This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stability and L2L_{2}-penalty in Model Averaging

Hengkun Zhu [email protected] Guohua Zou [email protected].
Abstract

Model averaging has received much attention in the past two decades, which integrates available information by averaging over potential models. Although various model averaging methods have been developed, there are few literatures on the theoretical properties of model averaging from the perspective of stability, and the majority of these methods constrain model weights to a simplex. The aim of this paper is to introduce stability from statistical learning theory into model averaging. Thus, we define the stability, asymptotic empirical risk minimizer, generalization, and consistency of model averaging and study the relationship among them. Our results indicate that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means model averaging estimator can asymptotically minimize the mean squared prediction error. We also propose a L2L_{2}-penalty model averaging method without limiting model weights and prove that it has stability and consistency. In order to reduce the impact of tuning parameter selection, we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors. The Monte Carlo simulation and an illustrative application demonstrate the usefulness of the proposed method.

keywords:
Model averaging , Stability , Mean squared prediction error, L2L_{2}-penalty
journal: Journal of Machine Learning Research
\affiliation

[label1]organization=School of Mathematical Sciences, Capital Normal University, city=Beijing, postcode=100048, country=China

1 Introduction

In practical applications, data analysts usually determine a series of models based on exploratory analysis for data and empirical knowledge to describe the relationship between variables of interest and related variables, but how to use these models to produce good results is a more important problem. It is very common to select one model using some data driven criteria, such as AIC(Akaike, 1998), BIC(Schwarz, 1978), CpC_{p}(Mallows, 1973) and FIC(Hjort and Claeskens, 2003). However, the properties of the corresponding results depend not only on the selected model, but also on the randomness of the selected model. An alternative to model selection is to make the resultant estimator compromises across a set of competing models. Further, statisticians find that they can usually obtain better and more stable results by combining information from different models. This process of combining multiple models into estimation is known as model averaging. Up to now, there have been a lot of literatures on Bayesian model averaging (BMA) and frequentist model averaging (FMA). Fragoso et al. (2018) reviewed the relevant literature on BMA. In this paper, we focus on FMA. In the past decades, the model averaging method has been applied in various fields. Wan and Zhang (2009) examined applications of model averaging in tourism research. Zhang and Zou (2011) applied the model averaging method to grain production forecasting of China. Moral-Benito (2015) reviewed the literature on model averaging with special emphasis on its applications to economics. The key to FMA lies in how to select model weights. The common weight selection methods include: 1) methods based on information criteria, such as smoothed AIC and smoothed BIC in Buckland et al. (1997); 2) Mallows model averaging(MMA), proposed by Hansen (2007)(see also Wan et al., 2010), modified by Liu and Okui (2013) to make it applicable to heteroscedasticity, and improved by Liao and Zou (2020) in small sample sizes; 3) FIC criterion, proposed by Hjort and Claeskens (2003), which is extended to generalized additive partial linear model(Zhang and Liang, 2011), the Tobit model with non-zero threshold(Zhang et al., 2012), and the weighted composite quantile regression(Xu et al., 2014); 4) adaptive methods, such as Yang (2001) and Yuan and Yang (2005); 5) OPT method(Liang et al., 2011); 6) cross validation methods, such as jackknife model averaging(JMA)(Hansen and Racine, 2012), Zhang et al. (2013), Gao et al. (2016), Zhang and Zou (2020) and Li et al. (2021), among others.

In the learning theory, stability is used to measure algorithm’s sensitivity to perturbations in the training set, and is an important tool for analyzing generalization and learnability of algorithms. Bousquet and Elisseeff (2002) introduced four kinds of stabilities (hypothesis stability, pointwise hypothesis stability, error stability, and uniform stability), and showed that stability is a sufficient condition for learnability. Kutin and Niyogi (2002) introduced several weaker variants of stability, and showed how they are sufficient to obtain generalization bounds for algorithms stable in their sense. Rakhlin et al. (2005) and Mukherjee et al. (2006) discussed the necessity of stability for learnability under the assumption that uniform convergence is equivalent to learnability. Further, Shalev-Shwartz et al. (2010) showed that uniform convergence is in fact not necessary for learning in the general learning setting, and stability plays there a key role which has nothing to do with uniform convergence. In the general learning setting with differential privacy constraint, Wang et al. (2016) studied some intricate relationships between privacy, stability and learnability.

Although various model averaging methods have been proposed, there are few literatures on their theoretical properties from the perspective of stability, and the majority of these methods are concerned only with whether the resultant estimator leads to a good approximation for the minimum of a given target when the model weights are constrained to a simplex. Considering that there is still a lack of literature on whether stability can ensure good theoretical properties of the model average estimator, and it may be better to approximate the global minimum than the local minimum in some cases, our attempts in this paper are to study stability in model averaging and to answer whether the resultant estimator can lead to a good approximation for the minimum of a given target when the model weights are unrestricted.

For the first attempt, we introduce the concept of stability from statistical learning theory into model averaging. Stability discusses how much the algorithm’s output varies when the sample set changes a little. Shalev-Shwartz et al. (2010) discussed the relationship among asymptotic empirical risk minimizer (AERM), stability, generalization and consistency, but relevant conclusions cannot be directly applied to model averaging. Therefore, we explore the relevant definitions and conclusions of Shalev-Shwartz et al. (2010) in model averaging. The results indicate that stability can ensure that model averaging has good generalization performance and consistency under reasonable conditions, where consistency means model averaging estimator can asymptotically minimize the mean squared prediction error (MSPE). For the second one, we find, for MMA and JMA, extreme weights tend to appear with the influence of correlation among models when the model weights are unrestricted, resulting in poor performance of model averaging estimator. Therefore, we should not simply remove the weight constraint and directly use the existing model averaging method. Thus, similar to ridge regression in Hoerl and Kennard (1970), we introduce L2L_{2}-penalty for the weight vector in MMA and JMA and call them Ridge-Mallows model averaging (RMMA) and Ridge-jackknife model averaging (RJMA), respectively. Like Theorem 4.3 in Hoerl and Kennard (1970), we discuss the reasonability of introducing L2L_{2}-penalty. We also prove the stability and consistency of the proposed method, where consistency means model averaging estimator can asymptotically minimize MSPE when the model weights are unrestricted. In the context of shrinkage estimation, Schomaker (2012) discussed the impact of tuning parameter selection, and pointed out that the weighted average of shrinkage estimators with different tuning parameters can improve overall stability, predictive performance and standard errors of shrinkage estimators. Hence, like Schomaker (2012), we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors.

The remainder of this paper is organised as follows. In Section 2, we explain the relevant notations, give the definitions of consistency and stability, and discuss their relationship. In Section 3, we propose RMMA and RJMA methods, and prove that they are stable and consistent. Section 4 conducts the Monte Carlo simulation experiment. Section 5 applies the proposed method to a real data set. Section 6 concludes. The proofs of lemmas and theorems are provided in the Appendix.

2 Consistency and Stability for Model Averaging

2.1 Model Averaging

We assume that S={zi=(yi,xi)𝒵,i=1,,n}S=\{z_{i}=(y_{i},x_{i}^{{}^{\prime}})^{{}^{\prime}}\in\mathcal{Z},i=1,...,n\} is a simple random sample from distribution 𝒟\mathcal{D}, where yiy_{i} is the ii-th observation of response variable, and xix_{i} is the ii-th observation of covariates. Let z=(y,x)z^{*}=(y^{*},x^{*^{\prime}})^{{}^{\prime}} be an observation from distribution 𝒟\mathcal{D} and independent of SS.

In model averaging, MM approximating models are selected first in order to describe the relationship between response variable and covariates. We assume that the hypothesis spaces of MM approximating models are

m={hm(xm),hmm},m=1,,M,\mathcal{H}_{m}=\big{\{}h_{m}(x^{*}_{m}),h_{m}\in\mathcal{F}_{m}\big{\}},m=1,...,M,

where xmx^{*}_{m} consists of some elements of xx^{*}, and m\mathcal{F}_{m} is a given function set. For example, in MMA, to estimate E(y|x)E(y^{*}|x^{*}), we take

m={xmθm,θmRdim(xm)},m=1,,M,\mathcal{H}_{m}=\big{\{}x^{*^{\prime}}_{m}\theta_{m},\theta_{m}\in R^{dim(x^{*}_{m})}\big{\}},m=1,...,M,

where dim()dim(\cdot) represents the dimension of the vector. For the mm-th approximating model, a proper estimation method AmA_{m} is selected, and h^m\hat{h}_{m}, the estimator of hmh_{m}, is obtained based on SS and AmA_{m}. Then, the hypothesis space of model averaging is defined as follows:

={h^(x,w)=H[w,h^1(x1),,h^M(xM)],wW},\mathcal{H}=\big{\{}\hat{h}(x^{*},w)=H[w,\hat{h}_{1}(x^{*}_{1}),...,\hat{h}_{M}(x^{*}_{M})],w\in W\big{\}},

where WW is a given weight space, and H()H(\cdot) is a given function of weight vector and estimators of MM approximating models. In MMA, we take

H[w,h^1(x1),,h^M(xM)]=m=1mwmh^m(xm)H[w,\hat{h}_{1}(x^{*}_{1}),...,\hat{h}_{M}(x^{*}_{M})]=\sum_{m=1}^{m}w_{m}\hat{h}_{m}(x^{*}_{m})

as the model average estimator of E(y|x)E(y^{*}|x^{*}). An important problem with model averaging is the choice of model weights. Here, the estimator w^\hat{w} of weight vector is obtained based on SS and a proper weight selection criterion A(w)A(w) that makes w^\hat{w} be optimal in a certain sense.

The selection of Am,m=1,,MA_{m},m=1,...,M and A(w)A(w) is closely related to the definition of the loss function. Let L[h^(x,w),y]L[\hat{h}(x^{*},w),y^{*}] be a real value loss function which is defined in ×𝒴\mathcal{H}\times\mathcal{Y}, where 𝒴\mathcal{Y} is the value space of yy^{*}. Then, the risk function is defined as follows:

F(w,S)=Ez{L[h^(x,w),y]},F(w,S)=E_{z^{*}}\big{\{}L[\hat{h}(x^{*},w),y^{*}]\big{\}},

which is an MSPE under the sample set S and weight vector ww.

2.2 Related Concepts

In this paper, we mainly discuss whether F(w^,S)F(\hat{w},S) can approximate smallest possible risk infwWF(w,S)inf_{w\in W}F(w,S). If A(w)A(w) has such a property, we say that A(w)A(w) is consistent. For fixed mm, Shalev-Shwartz et al. (2010) defined the stability and consistency of AmA_{m}, and discussed their relationship. Obviously, for model averaging, we need to pay more attention to the stability and consistency of weight selection. We note that relevant conclusions of Shalev-Shwartz et al. (2010) cannot be directly applied to model averaging because \mathcal{H} depends on SS. Therefore, we extend the relevant definitions and conclusions to model averaging. The following is the definition of consistency:

Definition 1 (Consistency).

If there is a sequence of constants {ϵn,nN+}\{\epsilon_{n},n\in N_{+}\} such that ϵn=o(1)\epsilon_{n}=o(1) and w^\hat{w} satisfies

ES{F(w^,S)infwWF(w,S)}=O(ϵn),E_{S}\big{\{}F(\hat{w},S)-inf_{w\in W}F(w,S)\big{\}}=O(\epsilon_{n}),

then A(w)A(w) is said to be consistent with rate ϵn\epsilon_{n}.

In statistical learning theory, stability concerns how much the algorithm’s output varies when SS changes a little.“Leave-one-out(Loo)” and “Replace-one(Ro)” are two common tools used to evaluate stability. Loo considers the change in the algorithm’s output after removing an observation from SS, and Ro considers such a change after replacing an observation in SS with an observation that is independent of S. Accordingly, the stability is called Loo stability and Ro stability respectively. Here we will give the formal definitions of Loo stability and Ro stability. To this end, we first give the definition of algorithm symmetry:

Definition 2 (Symmetry).

If the algorithm’s output is not affected by the order of the observations in SS, the algorithm is symmetric.

Now let SiS^{-i} be the sample set after removing the ii-th observation from SS, h^mi\hat{h}^{-i}_{m} be the estimator of hmh_{m} based on SiS^{-i} and AmA_{m}, w^i\hat{w}^{-i} be the estimator of weight vector based on SiS^{-i} and A(w)A(w), and F(w,Si)=Ez{L[h^i(x,w),y]}F(w,S^{-i})=E_{z^{*}}\big{\{}L[\hat{h}^{-i}(x^{*},w),y^{*}]\big{\}}, where h^i(x,w)=H[w,h^1i(x1),,h^Mi(xM)]\hat{h}^{-i}(x^{*},w)=H[w,\hat{h}^{-i}_{1}(x^{*}_{1}),...,\hat{h}^{-i}_{M}(x^{*}_{M})]. We define Loo stability as follows:

Definition 3 (PLoo Stability).

If there is a sequence of constants {ϵn,nN+}\{\epsilon_{n},n\in N_{+}\} such that ϵn=o(1)\epsilon_{n}=o(1) and A(w)A(w) satisfies

1ni=1nES[F(w^,S)F(w^i,Si)]=O(ϵn),\frac{1}{n}\sum_{i=1}^{n}E_{S}\big{[}F(\hat{w},S)-F(\hat{w}^{-i},S^{-i})\big{]}=O(\epsilon_{n}),

then A(w)A(w) is Predicted-Loo(PLoo) stable with rate ϵn\epsilon_{n}; If Am,m=1,,MA_{m},m=1,...,M and A(w)A(w) are symmetric, it only has to satisfy

ES[F(w^,S)F(w^n,Sn)]=O(ϵn).E_{S}\big{[}F(\hat{w},S)-F(\hat{w}^{-n},S^{-n})\big{]}=O(\epsilon_{n}).

Definition 4 (FLoo Stability).

If there is a sequence of constants {ϵn,nN+}\{\epsilon_{n},n\in N_{+}\} such that ϵn=o(1)\epsilon_{n}=o(1) and A(w)A(w) satisfies

1ni=1nES{L[h^(xi,w^),yi]L[h^i(xi,w^i),yi]}=O(ϵn),\frac{1}{n}\sum_{i=1}^{n}E_{S}\big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{-i}(x_{i},\hat{w}^{-i}),y_{i}]\big{\}}=O(\epsilon_{n}),

then A(w)A(w) is Fitted-Loo(FLoo) stable with rate ϵn\epsilon_{n}; If Am,m=1,,MA_{m},m=1,...,M and A(w)A(w) are symmetric, it only has to satisfy

ES{L[h^(xn,w^),yn]L[h^n(xn,w^n),yn]}=O(ϵn).E_{S}\big{\{}L[\hat{h}(x_{n},\hat{w}),y_{n}]-L[\hat{h}^{-n}(x_{n},\hat{w}^{-n}),y_{n}]\big{\}}=O(\epsilon_{n}).

Let SiS^{i} be the sample set S with the ii-th observation replaced by zi=(yi,xi)z_{i}^{*}=(y_{i}^{*},x_{i}^{*^{\prime}})^{{}^{\prime}}, h^mi\hat{h}^{i}_{m} be the estimator of hmh_{m} based on SiS^{i} and AmA_{m}, and w^i\hat{w}^{i} be the estimator of weight vector based on SiS^{i} and A(w)A(w), where ziz_{i}^{*} is from distribution 𝒟\mathcal{D} and independent of SS. Let F(w,Si)=Ez{L[h^i(x,w),y]}F(w,S^{i})=E_{z^{*}}\big{\{}L[\hat{h}^{i}(x^{*},w),y^{*}]\big{\}}, then

1ni=1nES,zi[F(w^,S)F(w^i,Si)]=0,\frac{1}{n}\sum_{i=1}^{n}E_{S,z_{i}^{*}}\big{[}F(\hat{w},S)-F(\hat{w}^{i},S^{i})\big{]}=0,

where h^i(x,w)=H[w,h^1i(x1),,h^Mi(xM)]\hat{h}^{i}(x^{*},w)=H[w,\hat{h}^{i}_{1}(x^{*}_{1}),...,\hat{h}^{i}_{M}(x^{*}_{M})]. Therefore, we define Ro stability as follows:

Definition 5 (Ro Stability).

If there is a sequence of constants {ϵn,nN+}\{\epsilon_{n},n\in N_{+}\} such that ϵn=o(1)\epsilon_{n}=o(1) and A(w)A(w) satisfies

1ni=1nES,zi{L[h^(xi,w^),yi]L[h^i(xi,w^i),yi]}=O(ϵn),\frac{1}{n}\sum_{i=1}^{n}E_{S,z_{i}^{*}}\big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{i}(x_{i},\hat{w}^{i}),y_{i}]\big{\}}=O(\epsilon_{n}),

then A(w)A(w) is Ro stable with rate ϵn\epsilon_{n}; If Am,m=1,,MA_{m},m=1,...,M and A(w)A(w) are symmetric, it only has to satisfy

ES,zn{L[h^(xn,w^),yn]L[h^n(xn,w^n),yn]}=O(ϵn).E_{S,z_{n}^{*}}\big{\{}L[\hat{h}(x_{n},\hat{w}),y_{n}]-L[\hat{h}^{n}(x_{n},\hat{w}^{n}),y_{n}]\big{\}}=O(\epsilon_{n}).

Before discussing the relationship between stability and consistency, we give the definitions of AERM and generalization. The empirical risk function is defined as follows:

F^(w,S)=1nn=1nL[h^(xi,w),yi].\hat{F}(w,S)=\frac{1}{n}\sum_{n=1}^{n}L[\hat{h}(x_{i},w),y_{i}].
Definition 6 (AERM).

If there is a sequence of constants {ϵn,nN+}\{\epsilon_{n},n\in N_{+}\} such that ϵn=o(1)\epsilon_{n}=o(1) and A(w)A(w) satisfies

ES[F^(w^,S)infwWF^(w,S)]=O(ϵn),E_{S}\big{[}\hat{F}(\hat{w},S)-inf_{w\in W}\hat{F}(w,S)\big{]}=O(\epsilon_{n}),

then A(w)A(w) is an AERM with rate ϵn\epsilon_{n}.

Vapnik (1998) proved some theoretical properties of the empirical risk minimization principle. However, when the sample size is small, the empirical risk minimizer tends to produce over-fitting phenomenon. Therefore, the structural risk minimization principle is proposed in Vapnik (1998), and the method to satisfy this principle is usually an AERM. Shalev-Shwartz et al. (2010) also discussed the deficiency of the empirical risk minimization principle and the importance of AERM.

Definition 7 (Generalization).

If there is a sequence of constants {ϵn,nN+}\{\epsilon_{n},n\in N_{+}\} such that ϵn=o(1)\epsilon_{n}=o(1) and A(w)A(w) satisfies

ES[F^(w^,S)F(w^,S)]=O(ϵn),E_{S}\big{[}\hat{F}(\hat{w},S)-F(\hat{w},S)\big{]}=O(\epsilon_{n}),

then A(w)A(w) generalizes with rate ϵn\epsilon_{n}.

In statistical learning theory, generalization refers to the performance of the concept learned by models on unknown samples. It can be seen from Definition 7 that the generalization of A(w)A(w) describes the difference between using w^\hat{w} to fit the training set SS and predict the unknown sample.

2.3 Relationship between Different Concepts

Note that for any i{1,,n}i\in\{1,...,n\},

ES,zi{L[h^(xi,w^),yi]L[h^i(xi,w^i),yi]}\displaystyle E_{S,z_{i}^{*}}\Big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{i}(x_{i},\hat{w}^{i}),y_{i}]\Big{\}}
=ES,zi{L[h^(xi,w^),yi]L[h^i(xi,w^i),yi]+L[h^i(xi,w^i),yi]L[h^i(xi,w^i),yi]}\displaystyle=E_{S,z_{i}^{*}}\Big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{-i}(x_{i},\hat{w}^{-i}),y_{i}]+L[\hat{h}^{-i}(x_{i},\hat{w}^{-i}),y_{i}]-L[\hat{h}^{i}(x_{i},\hat{w}^{i}),y_{i}]\Big{\}}
=ES{L[h^(xi,w^),yi]L[h^i(xi,w^i),yi]}+ES[F(w^i,Si)F(w^,S)],\displaystyle=E_{S}\Big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{-i}(x_{i},\hat{w}^{-i}),y_{i}]\Big{\}}+E_{S}[F(\hat{w}^{-i},S^{-i})-F(\hat{w},S)],

so we give the following theorem to illustrate the relationship between Loo stability and Ro stability:

Theorem 2.1.

If A(w)A(w) has two of FLoo stability, PLoo stability and Ro stability with rate ϵn\epsilon_{n}, then it has all three stabilities with rate ϵn\epsilon_{n}.

Shalev-Shwartz et al. (2010) emphasized that Ro stability and Loo stability are in general incomparable notions, but Theorem 2.1 shows that they are closely related.

By definitions of generalization and Ro stability, we have

ES[F^(w^,S)F(w^,S)]\displaystyle E_{S}[\hat{F}(\hat{w},S)-F(\hat{w},S)]
=ES,z1,,zn{1nn=1nL[h^(xi,w^),yi]1nn=1nL[h^(xi,w^),yi]}\displaystyle=E_{S,z_{1}^{*},...,z_{n}^{*}}\Big{\{}\frac{1}{n}\sum_{n=1}^{n}L[\hat{h}(x_{i},\hat{w}),y_{i}]-\frac{1}{n}\sum_{n=1}^{n}L[\hat{h}(x_{i}^{*},\hat{w}),y_{i}^{*}]\Big{\}}
=1nn=1nES,zi{L[h^(xi,w^),yi]L[h^i(xi,w^i),yi]},\displaystyle=\frac{1}{n}\sum_{n=1}^{n}E_{S,z_{i}^{*}}\Big{\{}L[\hat{h}(x_{i},\hat{w}),y_{i}]-L[\hat{h}^{i}(x_{i},\hat{w}^{i}),y_{i}]\Big{\}},

and then we give the following theorem to illustrate the equivalence of Ro stability and generalization:

Theorem 2.2.

A(w)A(w) have Ro stability with rate ϵn\epsilon_{n} if and noly if A(w)A(w) generalizes with rate ϵn\epsilon_{n}.

Theorem 2.2 shows that stability is an important property of weight selection criterion, which can ensure that the corresponding estimator has good generalization performance.

Let w^W\hat{w}^{*}\in W satisfy F(w^,S)=infwWF(w,S)F(\hat{w}^{*},S)=inf_{w\in W}F(w,S). Note that

ES[F(w^,S)F(w^,S)]\displaystyle E_{S}[F(\hat{w},S)-F(\hat{w}^{*},S)]
=ES[F(w^,S)F^(w^,S)+F^(w^,S)F^(w^,S)+F^(w^,S)F(w^,S)]\displaystyle=E_{S}[F(\hat{w},S)-\hat{F}(\hat{w},S)+\hat{F}(\hat{w},S)-\hat{F}(\hat{w}^{*},S)+\hat{F}(\hat{w}^{*},S)-F(\hat{w}^{*},S)]
ES[F(w^,S)F^(w^,S)+F^(w^,S)infwWF^(w,S)+F^(w^,S)F(w^,S)],\displaystyle\leq E_{S}[F(\hat{w},S)-\hat{F}(\hat{w},S)+\hat{F}(\hat{w},S)-inf_{w\in W}\hat{F}(w,S)+\hat{F}(\hat{w}^{*},S)-F(\hat{w}^{*},S)],

so we give the following theorem to illustrate the relationship between stability and consistency:

Theorem 2.3.

If A(w)A(w) is an AERM and has Ro stability with rate ϵn\epsilon_{n}, and w^\hat{w}^{*} satisfies

ES[F^(w^,S)F(w^,S)]=O(ϵn),E_{S}\big{[}\hat{F}(\hat{w}^{*},S)-F(\hat{w}^{*},S)\big{]}=O(\epsilon_{n}),

then A(w)A(w) is consistent with rate ϵn\epsilon_{n}.

Since w^\hat{w}^{*} and \mathcal{H} depend on SS, unlike Lemma 15 in Shalev-Shwartz et al. (2010), Theorem 2.3 requires w^\hat{w}^{*} to generalize with rate ϵn\epsilon_{n}. In the next section, we will propose a L2L_{2}-penalty model averaging method and prove that it has stability and consistency under certain reasonable conditions.

3 L2L_{2}-penalty Model Averaging

In the most of existing literatures on model averaging, the theoretical properties are explored under the weight set W0={w[0,1]M:m=1Mwm=1}W^{0}=\{w\in[0,1]^{M}:\sum_{m=1}^{M}w_{m}=1\}. From Definition 1, it is seen that, even if the corresponding weight selection criterion is consistent, such a consistency holds only under the subspace of RMR^{M}. Therefore, a natural question is whether it is possible to make the weight space unrestricted. What will happen when we do so? We note that unrestricted Granger-Ramanathan method obtains the estimator of the weight vector under RMR^{M} by minimizing the sum of squared forecast errors from the combination forecast, but its poor performance is observed when it is compared with some other methods(see Hansen, 2008). On the other hand, in the prediction task, we are more concerned about whether the resulting estimator can predict better, and the estimator that minimizes MSPE in the full space will most likely outperform the estimator that minimizes MSPE in the subspace. Therefore, it is necessary to further develop new research ideas.

3.1 Model Framework and Estimators

We assume that the response variable yiy_{i} and covariates xi=(x1i,x2i,)x_{i}=(x_{1i},x_{2i},...) satisfy the following data generating process:

yi=μi+ei=k=1xkiθk+ei,E(ei|xi)=0,E(ei2|xi)=σi2,\displaystyle y_{i}=\mu_{i}+e_{i}=\sum_{k=1}^{\infty}x_{ki}\theta_{k}+e_{i},E(e_{i}|x_{i})=0,E(e_{i}^{2}|x_{i})=\sigma^{2}_{i},

and M approximating models are given by

yi=k=1kmxm(k)iθm,(k)+bmi+ei,m=1,,M,\displaystyle y_{i}=\sum_{k=1}^{k_{m}}x_{m(k)i}\theta_{m,(k)}+b_{mi}+e_{i},m=1,...,M,

where bmi=μik=1kmxm(k)iθm,(k)b_{mi}=\mu_{i}-\sum_{k=1}^{k_{m}}x_{m(k)i}\theta_{m,(k)} is the approximating error of the mm-th approximating model. We assume that the MM-th approximating model contains all the considered covariates. To simplify notation, we let xi=(x(1)i,,x(kM)i)x_{i}=(x_{(1)i},...,x_{(k_{M})i}) throughout the rest of this article, where x(k)i=xM(k)ix_{(k)i}=x_{M(k)i}.

Let y=(y1,,yn)y=(y_{1},...,y_{n})^{{}^{\prime}}, μ=(μ1,,μn)\mu=(\mu_{1},...,\mu_{n})^{{}^{\prime}}, e=(e1,,en)e=(e_{1},...,e_{n})^{{}^{\prime}}, bm=(bm1,,bmn)b_{m}=(b_{m1},...,b_{mn})^{{}^{\prime}}. Then, the corresponding matrix form of the true model is y=Xmθm+bm+ey=X_{m}\theta_{m}+b_{m}+e, where θm=(θm,(1),,θm,(km))\theta_{m}=(\theta_{m,(1)},...,\theta_{m,(k_{m})})^{{}^{\prime}}, and XmX_{m} is the design matrix of the mm-th approximating model. When there is an approximating model such that bm=0b_{m}=\textbf{0}, it indicates that the true model is included in the mm-th approximating model, i.e. the model is correctly specified. Unlike Hansen (2007) , Wan et al. (2010) and Hansen and Racine (2012), we do not require that the infimum of Rn(w)R_{n}(w) (this is defined in section 3.2) tends to infinity, and therefore allow that the model is correctly specified.

Let πmRK×km\pi_{m}\in R^{K\times k_{m}} be the variable selection matrix satisfying XMπm=XmX_{M}\pi_{m}=X_{m} and πmπm=Ikm\pi_{m}^{{}^{\prime}}\pi_{m}=I_{k_{m}}, m=1,,Mm=1,...,M. Then, the hypothesis spaces of M approximating models are

m={xπmθm,θmRkm},m=1,,M.\mathcal{H}_{m}=\big{\{}x^{*^{\prime}}\pi_{m}\theta_{m},\theta_{m}\in R^{k_{m}}\big{\}},m=1,...,M.

The least squares estimator of θm\theta_{m} is θ^m=(XmXm)1Xmy\hat{\theta}_{m}=(X_{m}^{{}^{\prime}}X_{m})^{-1}X_{m}^{{}^{\prime}}y, m=1,,Mm=1,...,M.

3.2 Weight Selection Criterion

Let Pm=Xm(XmXm)1XmP_{m}=X_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}X_{m}^{{}^{\prime}}, P(w)=m=1MwmPmP(w)=\sum_{m=1}^{M}w_{m}P_{m}, Ln(w)=μP(w)y22L_{n}(w)=\|\mu-P(w)y\|_{2}^{2}, and Rn(w)=Ee[Ln(w)]R_{n}(w)=E_{e}[L_{n}(w)]. When σi2σ2\sigma_{i}^{2}\equiv\sigma^{2}, Hansen (2007) and Wan et al. (2010) used Mallows criterion Cn(w)=yΩ^w22+2σ2wκC_{n}(w)=\|y-\hat{\Omega}w\|_{2}^{2}+2\sigma^{2}w^{{}^{\prime}}\kappa to select a model weight vector from hypothesis space 0={xθ^(w),wW0}\mathcal{H}^{0}=\big{\{}x^{*^{\prime}}\hat{\theta}(w),w\in W^{0}\big{\}} and proved that the estimator of weight vector asymptotically minimizes Ln(w)L_{n}(w), where Ω^=(P1y,,PMy)\hat{\Omega}=(P_{1}y,...,P_{M}y), κ=(k1,,kM)\kappa=(k_{1},...,k_{M})^{{}^{\prime}}, and θ^(w)=m=1Mwmπmθ^m\hat{\theta}(w)=\sum_{m=1}^{M}w_{m}\pi_{m}\hat{\theta}_{m}. Hansen and Racine (2012) used Jackknife criterion Jn(w)=yΩ¯w22J_{n}(w)=\|y-\bar{\Omega}w\|_{2}^{2} to select a model weight vector from hypothesis space 0\mathcal{H}^{0} and proved that the estimator of weight vector asymptotically minimizes Ln(w)L_{n}(w) and Rn(w)R_{n}(w), where Ω¯=[yD1(IP1)y,,yDM(IPM)y]\bar{\Omega}=\big{[}y-D_{1}(I-P_{1})y,...,y-D_{M}(I-P_{M})y\big{]} with Dm=diag[(1hiim)1]D_{m}=diag[(1-h_{ii}^{m})^{-1}] and hiim=xiπm(XmXm)1πmxi,i=1,,nh_{ii}^{m}=x_{i}^{{}^{\prime}}\pi_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{i},i=1,...,n.

Different from Hansen (2007), Wan et al. (2010) and Hansen and Racine (2012), we focus on whether the model averaging estimator can asymptotically minimize MSPE when the model weights are not restricted. Let γ^=(xπ1θ^1,,xπMθ^M)\hat{\gamma}=(x^{*^{\prime}}\pi_{1}\hat{\theta}_{1},...,x^{*^{\prime}}\pi_{M}\hat{\theta}_{M}). Then, the risk function and the empirical risk function are defined as:

F(w,S)=Ez[yxθ^(w)]2=Ez(yγ^w)2F(w,S)=E_{z^{*}}[y^{*}-x^{*^{\prime}}\hat{\theta}(w)]^{2}=E_{z^{*}}(y^{*}-\hat{\gamma}w)^{2}

and

F^(w,S)=1ni=1n[yixiθ^(w)]2=1nyΩ^w22\hat{F}(w,S)=\frac{1}{n}\sum_{i=1}^{n}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}(w)]^{2}=\frac{1}{n}\|y-\hat{\Omega}w\|_{2}^{2}

respectively. Since Hansen (2007), Wan et al. (2010) and Hansen and Racine (2012) restrict wW0w\in W^{0}, the corresponding estimators of weight vector do not necessarily asymptotically minimize F(w,S)F(w,S) on RMR^{M}. An intuitive way that enables the estimator of weight vector to asymptotically minimize F(w,S)F(w,S) on RMR^{M} is to remove the restriction wW0w\in W^{0} directly.

Let P^\hat{P} and P¯\bar{P} be the orthogonal matrices satisfying P^Ω^Ω^P^=diag(ζ^1,,ζ^M)\hat{P}^{{}^{\prime}}\hat{\Omega}^{{}^{\prime}}\hat{\Omega}\hat{P}=diag(\hat{\zeta}_{1},...,\hat{\zeta}_{M}), P¯Ω¯Ω¯P¯=diag(ζ¯1,,ζ¯M)\bar{P}^{{}^{\prime}}\bar{\Omega}^{{}^{\prime}}\bar{\Omega}\bar{P}=diag(\bar{\zeta}_{1},...,\bar{\zeta}_{M}), where ζ^1ζ^M\hat{\zeta}_{1}\leq...\leq\hat{\zeta}_{M} and ζ¯1ζ¯M\bar{\zeta}_{1}\leq...\leq\bar{\zeta}_{M} are the eigenvalues of Ω^Ω^\hat{\Omega}^{{}^{\prime}}\hat{\Omega} and Ω¯Ω¯\bar{\Omega}^{{}^{\prime}}\bar{\Omega} respectively. We assume that Ez(γ^γ^)E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma}), Ω^Ω^\hat{\Omega}^{{}^{\prime}}\hat{\Omega} and Ω¯Ω¯\bar{\Omega}^{{}^{\prime}}\bar{\Omega} are invertible( this is reasonable under Assumption 3 ), then

w^0=argminwRMCn(w)=(Ω^Ω^)1(Ω^yσ2κ),\hat{w}^{0}=argmin_{w\in R^{M}}C_{n}(w)=(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})^{-1}(\hat{\Omega}^{{}^{\prime}}y-\sigma^{2}\kappa),
w¯0=argminwRMJn(w)=(Ω¯Ω¯)1Ω¯y,\bar{w}^{0}=argmin_{w\in R^{M}}J_{n}(w)=(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})^{-1}\bar{\Omega}^{{}^{\prime}}y,
w~=argminwRMF^(w,S)=(Ω^Ω^)1Ω^y,\tilde{w}=argmin_{w\in R^{M}}\hat{F}(w,S)=(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})^{-1}\hat{\Omega}^{{}^{\prime}}y,

and

w^=argminwRMF(w,S)=[Ez(γ^γ^)]1Ez(γ^y).\hat{w}^{*}=argmin_{w\in R^{M}}F(w,S)=[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}y^{*}).

From this, we can see that, in order to satisfy the consistency, w^0\hat{w}^{0} and w¯0\bar{w}^{0} should be good estimators of w^\hat{w}^{*}. However, when approximating models are highly correlated, the minimum eigenvalues of Ω^Ω^\hat{\Omega}^{{}^{\prime}}\hat{\Omega} and Ω¯Ω¯\bar{\Omega}^{{}^{\prime}}\bar{\Omega} may be small so that w^022=m=1Mam2ζ^m2a12ζ^12\|\hat{w}^{0}\|_{2}^{2}=\sum_{m=1}^{M}\frac{a_{m}^{2}}{\hat{\zeta}_{m}^{2}}\geq\frac{a_{1}^{2}}{\hat{\zeta}_{1}^{2}} and w¯022=m=1Mbm2ζ¯m2b12ζ¯12\|\bar{w}^{0}\|_{2}^{2}=\sum_{m=1}^{M}\frac{b_{m}^{2}}{\bar{\zeta}_{m}^{2}}\geq\frac{b_{1}^{2}}{\bar{\zeta}_{1}^{2}} are too large, which usually result in extreme weights, where (a1,a2,,aM)=P^Ω^y(a_{1},a_{2},...,a_{M})^{{}^{\prime}}=\hat{P}^{{}^{\prime}}\hat{\Omega}y, (b1,b2,,bM)=P¯Ω¯y(b_{1},b_{2},...,b_{M})^{{}^{\prime}}=\bar{P}^{{}^{\prime}}\bar{\Omega}y. Therefore, similar to ridge regression in Hoerl and Kennard (1970), we make the following correction to Cn(w)C_{n}(w) and Jn(w)J_{n}(w):

C(w,S)=Cn(w)+λnww,C(w,S)=C_{n}(w)+\lambda_{n}w^{{}^{\prime}}w,
J(w,S)=Jn(w)+λnww,J(w,S)=J_{n}(w)+\lambda_{n}w^{{}^{\prime}}w,

where λn>0\lambda_{n}>0 is a tuning parameter. The above corrections are actually L2L_{2}-penalty for weight vector. Let Z^=(Ω^Ω^+λnI)1Ω^Ω^\hat{Z}=(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I)^{-1}\hat{\Omega}^{{}^{\prime}}\hat{\Omega} , Z¯=(Ω¯Ω¯+λnI)1Ω¯Ω¯\bar{Z}=(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I)^{-1}\bar{\Omega}^{{}^{\prime}}\bar{\Omega}. Then

w^=argminwRMC(w,S)=(Ω^Ω^+λnI)1(Ω^yσ2κ)=Z^w^0,\hat{w}=argmin_{w\in R^{M}}C(w,S)=(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I)^{-1}(\hat{\Omega}^{{}^{\prime}}y-\sigma^{2}\kappa)=\hat{Z}\hat{w}^{0},

and

w¯=argminwRMJ(w,S)=(Ω¯Ω¯+λnI)1Ω¯y=Z¯w¯0.\bar{w}=argmin_{w\in R^{M}}J(w,S)=(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I)^{-1}\bar{\Omega}^{{}^{\prime}}y=\bar{Z}\bar{w}^{0}.

In the next subsection, we discuss the theoretical properties of C(w,S)C(w,S) and J(w,S)J(w,S).

3.3 Stability and Consistency

Let λmax()\lambda_{max}(\cdot) and λmin()\lambda_{min}(\cdot) be the maximum and minimum eigenvalues of a square matrix respectively, and χ\chi be the value space of K covariates, where K=kMK=k_{M}. In order to discuss the stability and consistency of the proposed method, we need the following assumptions:

Assumption 1.

There are constants C1>0C_{1}>0 and C2>0C_{2}>0 such that

C1λmin(n1XMXM)λmax(n1XMXM)C2K,a.s..C_{1}\leq\lambda_{min}(n^{-1}X_{M}^{{}^{\prime}}X_{M})\leq\lambda_{max}(n^{-1}X_{M}^{{}^{\prime}}X_{M})\leq C_{2}K,a.s..

Assumption 2.

There is a constant C3>0C_{3}>0 such that Ey[(y)2]C3E_{y^{*}}[(y^{*})^{2}]\leq C_{3}, and n1yyC3n^{-1}y^{{}^{\prime}}y\leq C_{3}, a.s.; There is a constant C4>0C_{4}>0 such that χB(𝟎𝐊,C4)\chi\subset B({\bf{0}_{K}},C_{4}), a.s., where B(𝟎𝐊,C4)B({\bf{0}_{K}},C_{4}) is a ball with center 𝟎𝐊{\bf{0}_{K}} and radius C4C_{4}, and 𝟎𝐊{\bf{0}_{K}} is the K-dimensional 0 vector.

Assumption 3.

There are constants C5>0C_{5}>0 and C6>0C_{6}>0 such that

C5n1λmin(Ω^Ω^)n1λmax(Ω^Ω^)C6M,C_{5}\leq n^{-1}\lambda_{min}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})\leq n^{-1}\lambda_{max}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})\leq C_{6}M,
C5λmin[Ez(γ^γ^)]λmax[Ez(γ^γ^)]C6M,C_{5}\leq\lambda_{min}[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]\leq\lambda_{max}[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]\leq C_{6}M,

and

C5n1λmin(Ω¯Ω¯)n1λmax(Ω¯Ω¯)C6M,C_{5}\leq n^{-1}\lambda_{min}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})\leq n^{-1}\lambda_{max}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})\leq C_{6}M,

a.s..

Assumption 4.

λn=O(Mlogn)\lambda_{n}=O(M\log n), λnλn1=O(K3M)\lambda_{n}-\lambda_{n-1}=O(K^{3}M).

Assumption 1 is mild and similar conditions can be found in Zou and Zhang (2009) and Zhao et al. (2020). From XmXm=πmXMXMπmX_{m}^{{}^{\prime}}X_{m}=\pi_{m}^{{}^{\prime}}X_{M}^{{}^{\prime}}X_{M}\pi_{m}, we see that, under Assumption 1, for any m {1,,M}\in\{1,...,M\}, we have

C1λmin(n1XmXm)λmax(n1XmXm)C2K,a.s..C_{1}\leq\lambda_{min}(n^{-1}X_{m}^{{}^{\prime}}X_{m})\leq\lambda_{max}(n^{-1}X_{m}^{{}^{\prime}}X_{m})\leq C_{2}K,a.s..

Shalev-Shwartz et al. (2010) assumed that the loss function is bounded, which is usually not satisfied in traditional regression analysis. We replace this assumption with Assumption 2. Tong and Wu (2017) assumed that χ×𝒴\chi\times\mathcal{Y} is a compact subset of RK+1R^{K+1}, under which Assumption 2 is obviously true. Assumption 3 requires the minimum eigenvalue of Ω^Ω^\hat{\Omega}^{{}^{\prime}}\hat{\Omega} to have a lower bound away from 0 and the maximum eigenvalue to have an order O(nM)O(nM) a.s.. A similar assumption is used in Liao and Zou (2020). Lemma 3 guarantees the rationality of the assumptions about the eigenvalues of Ez(γ^γ^)E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma}) and Ω¯Ω¯\bar{\Omega}^{{}^{\prime}}\bar{\Omega}. Assumption 4 is a mild assumption for tuning parameter in order to avoid excessive penalty. In Section 3.4, we look for tuning parameters only from [0,Mlogn][0,M\log n].

Let V^(λn)=Z^w^0Z^w^22\hat{V}(\lambda_{n})=\|\hat{Z}\hat{w}^{0}-\hat{Z}\hat{w}^{*}\|_{2}^{2}, B^(λn)=Z^w^w^22\hat{B}(\lambda_{n})=\|\hat{Z}\hat{w}^{*}-\hat{w}^{*}\|_{2}^{2}, V¯(λn)=Z¯w¯0Z¯w^22\bar{V}(\lambda_{n})=\|\bar{Z}\bar{w}^{0}-\bar{Z}\hat{w}^{*}\|_{2}^{2}, and B¯(λn)=Z¯w^w^22\bar{B}(\lambda_{n})=\|\bar{Z}\hat{w}^{*}-\hat{w}^{*}\|_{2}^{2}. We define

M^(λn)=Z^w^0w^22=V^(λn)+B^(λn)+2(Z^w^0Z^w^)(Z^w^w^),\hat{M}(\lambda_{n})=\|\hat{Z}\hat{w}^{0}-\hat{w}^{*}\|_{2}^{2}=\hat{V}(\lambda_{n})+\hat{B}(\lambda_{n})+2(\hat{Z}\hat{w}^{0}-\hat{Z}\hat{w}^{*})^{{}^{\prime}}(\hat{Z}\hat{w}^{*}-\hat{w}^{*}),
M¯(λn)=Z¯w¯0w^22=V¯(λn)+B¯(λn)+2(Z¯w¯0Z¯w^)(Z¯w^w^).\bar{M}(\lambda_{n})=\|\bar{Z}\bar{w}^{0}-\hat{w}^{*}\|_{2}^{2}=\bar{V}(\lambda_{n})+\bar{B}(\lambda_{n})+2(\bar{Z}\bar{w}^{0}-\bar{Z}\hat{w}^{*})^{{}^{\prime}}(\bar{Z}\hat{w}^{*}-\hat{w}^{*}).

In order for F(Z^w^0,S)F(\hat{Z}\hat{w}^{0},S) and F(Z¯w¯0,S)F(\bar{Z}\bar{w}^{0},S) to better approximate F(w^,S)F(\hat{w}^{*},S), we naturally want ES[M^(λn)]E_{S}[\hat{M}(\lambda_{n})] and ES[M¯(λn)]E_{S}[\bar{M}(\lambda_{n})] to be as small as possible. In the following discussion, we call ES[M^(λn)]E_{S}[\hat{M}(\lambda_{n})] and ES[M¯(λn)]E_{S}[\bar{M}(\lambda_{n})], ES[V^(λn)]E_{S}[\hat{V}(\lambda_{n})] and ES[V¯(λn)]E_{S}[\bar{V}(\lambda_{n})], ES[B^(λn)]E_{S}[\hat{B}(\lambda_{n})] and ES[B¯(λn)]E_{S}[\bar{B}(\lambda_{n})] the corresponding mean square errors, estimation variances, and estimation biases, respectively. Obviously, when λn=0\lambda_{n}=0, Z^=Z¯=IM\hat{Z}=\bar{Z}=I_{M} which means estimation bias is equal to zero. From Lemma 4 and the proof of Theorems 3.4, we see that, under Assumptions 14\ref{ass:1}-\ref{ass:4}, B^(λn)\hat{B}(\lambda_{n}) and B¯(λn)\bar{B}(\lambda_{n}) are O(n2M4log2n)O(n^{-2}M^{4}log^{2}n), a.s.. On the other hand, the existence of extreme weights may make the performance of w^0\hat{w}^{0} and w¯0\bar{w}^{0} extremely unstable. So the purpose of using L2L_{2}-penalty is to reduce estimation variance by introducing estimation bias, and thus make the performance of the model average estimator more stable. Further we define

M^1(λn)=V^(λn)+B^(λn),\hat{M}_{1}(\lambda_{n})=\hat{V}(\lambda_{n})+\hat{B}(\lambda_{n}),
M¯1(λn)=V¯(λn)+B¯(λn).\bar{M}_{1}(\lambda_{n})=\bar{V}(\lambda_{n})+\bar{B}(\lambda_{n}).

Like Theorem 4.3 in Hoerl and Kennard (1970), we give the following theorem to illustrate the reasonability of introducing L2L_{2}-penalty:

Theorem 3.4.

Let λ^n=min{λn:ddλnM^1(λn)=0}\hat{\lambda}_{n}=\min\{\lambda_{n}:\frac{d}{d\lambda_{n}}\hat{M}_{1}(\lambda_{n})=0\}, λ¯n=min{λn:ddλnM¯1(λn)=0}\bar{\lambda}_{n}=\min\{\lambda_{n}:\frac{d}{d\lambda_{n}}\bar{M}_{1}(\lambda_{n})=0\}. Then, 1) when w^0w^\hat{w}^{0}\neq\hat{w}^{*}, λ^n>0\hat{\lambda}_{n}>0 and M^1(λ^n)<M^1(0)\hat{M}_{1}(\hat{\lambda}_{n})<\hat{M}_{1}(0); 2) when w¯0w^\bar{w}^{0}\neq\hat{w}^{*}, λ¯n>0\bar{\lambda}_{n}>0 and M¯1(λ¯n)<M¯1(0)\bar{M}_{1}(\bar{\lambda}_{n})<\bar{M}_{1}(0).

Theorems 3.4 shows that the use of L2L_{2}-penalty reduces estimation variance by introducing estimation bias. However, since w^\hat{w}^{*} is unknown, λ^n\hat{\lambda}_{n} and λ¯n\bar{\lambda}_{n} are also unknown. In Section 3.4, we use cross validation to select the tuning parameter λn\lambda_{n}. The following theorem shows that C(w,S)C(w,S) and J(w,S)J(w,S) are AERM.

Theorem 3.5.

Under Assumptions 14\ref{ass:1}-\ref{ass:4}, C(w,S)C(w,S) and J(w,S)J(w,S) are AERM with rate n1lognKM2(1+n2K2)n^{-1}\log nKM^{2}(1+n^{-2}K^{2}) and n1lognK3M2n^{-1}\log nK^{3}M^{2}, respectively.

The following theorem shows that C(w,S)C(w,S), J(w,S)J(w,S) and F(w,S)F(w,S) have FLoo stability and PLoo stability.

Theorem 3.6.

Under Assumptions 14\ref{ass:1}-\ref{ass:4}, C(w,S)C(w,S), J(w,S)J(w,S) and F(w,S)F(w,S) have FLoo stability and PLoo stability with rate n12K72M52(1+n2K2)n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}(1+n^{-2}K^{2}), n12K72M52n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}} and n12K72M72n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}} , respectively.

It can be seen from Theorems 2.1, 2.2 and 3.6 that C(w,S)C(w,S), J(w,S)J(w,S) and F(w,S)F(w,S) have Ro stability and generalization. The following theorem shows that C(w,S)C(w,S) and J(w,S)J(w,S) have consistency, which is a direct consequence of Theorems 2.12.3\ref{thm:3-1}-\ref{thm:3-3} and 3.53.6\ref{thm:4-3-1}-\ref{thm:4-3-2}.

Theorem 3.7.

Under Assumptions 14\ref{ass:1}-\ref{ass:4}, C(w,S)C(w,S) and J(w,S)J(w,S) have consistency with rate n12K72M72(1+n2K2)n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}(1+n^{-2}K^{2}) and n12K72M72n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}, respectively.

3.4 Optimal Weighting Based on Cross Validation

Although Theorems 3.4 shows that there are λ^n\hat{\lambda}_{n} and λ¯n\bar{\lambda}_{n} such that w^\hat{w} and w¯\bar{w} are better approximation of w^\hat{w}^{*}, λ^n\hat{\lambda}_{n} and λ¯n\bar{\lambda}_{n} cannot be obtained. Therefore, like Schomaker (2012), we propose an algorithm based on 10-fold cross validation to obtain the estimator of weight vector, which is a weighted average of the weight estimators for different λn\lambda_{n}. That is, we first select 100 segmentation points on [0,Mlogn][0,M\log n] with equal intervals as the candidates of λn\lambda_{n}. Then we calculate the estimation error for each candidate of λn\lambda_{n} by using 10-fold cross validation. Based on this, we remove those candidates with large estimation error. Last, for remaining candidates, the estimation errors are used to perform a weighted average of the estimators of weight vector. We summarize our algorithm on RMMA below, and a similar algorithm can be given for RJMA.

  Algorithm 1 Optimal weighting based on cross validation

 

1:S;
2:w^\hat{w};
3:E^L=0,L=1,,100\hat{E}_{L}=0,L=1,...,100;
4:The sample set S is randomly divided into 10 sample subsets with equal size, and the sample index set belonging to the B-th part is denoted as SB,B=1,,10S_{B},B=1,...,10;
5:for  each B{1,2,,10}B\in\{1,2,...,10\} do
6:     Let Strain={zi,iSB}S_{train}=\{z_{i},i\notin S_{B}\} which is the training set, and Stest={zi,iSB}S_{test}=\{z_{i},i\in S_{B}\} which is the testing set;
7:     θ^mB\hat{\theta}_{m}^{B} is obtained based on StrainS_{train} , m=1,,Mm=1,...,M;
8:     for  each L{1,2,,100}L\in\{1,2,...,100\} do
9:         w^BL\hat{w}_{BL} is obtained based on λn=(L1)Mlogn99\lambda_{n}=\frac{(L-1)M\log n}{99} and C(w,Strain)C(w,S_{train});
10:         The estimation error of w^BL\hat{w}_{BL} on StestS_{test} is obtained as
E^(w^BL)=ziStest[yixiθ^B(w^BL)]2,\hat{E}(\hat{w}_{BL})=\sum_{z_{i}\in S_{test}}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{B}(\hat{w}_{BL})]^{2},
where θ^B(w)=m=1Mwmπmθ^mB\hat{\theta}^{B}(w)=\sum_{m=1}^{M}w_{m}\pi_{m}\hat{\theta}_{m}^{B};
11:         E^L=E^L+E^(w^BL)\hat{E}_{L}=\hat{E}_{L}+\hat{E}(\hat{w}_{BL});      
12:Let SλS_{\lambda} be the index set of the smallest 50 numbers in {E^L,L=1,,100}\{\hat{E}_{L},L=1,...,100\};
13:w^L\hat{w}_{L} is obtained based on λn=(L1)Mlogn99\lambda_{n}=\frac{(L-1)M\log n}{99}, SS, and C(w,S)C(w,S), where LSλL\in S_{\lambda};
14:w^=LSλexp(0.5E^L)LSλexp(0.5E^L)w^L\hat{w}=\sum_{L\in S_{\lambda}}\frac{exp(-0.5\hat{E}_{L})}{\sum_{L\in S_{\lambda}exp(-0.5\hat{E}_{L})}}\hat{w}_{L}.

 

4 Simulation Study

In this section, we conduct simulation experiments to demonstrate the finite sample performance of the proposed method. Similar to Hansen (2007), we consider the following data generating process:

yi=μi+ei=k=1Ktxkiθk+eii=1,,n,\displaystyle y_{i}=\mu_{i}+e_{i}=\sum_{k=1}^{K_{t}}x_{ki}\theta_{k}+e_{i}\ \ i=1,...,n,

where {θk,k=1,2,,Kt}\{\theta_{k},k=1,2,...,K_{t}\} are the model parameters, x1i1x_{1i}\equiv 1, (x2i,,xKti)N(0,Σ)(x_{2i},...,x_{K_{t}i})\sim N(0,\Sigma), and (e1,e2,,en)N[0,diag(σ12,σ22,,σn2)](e_{1},e_{2},...,e_{n})\sim N[0,diag(\sigma_{1}^{2},\sigma_{2}^{2},...,\sigma_{n}^{2})]. We set n=100,300,500,700n=100,300,500,700, α=0.5,1.0,1.5\alpha=0.5,1.0,1.5, Σ=(σkl)\Sigma=(\sigma_{kl}) and σkl=ρ|kl|\sigma_{kl}=\rho^{|k-l|} with ρ=0.3,0.6\rho=0.3,0.6, and R2=0.1,,0.9R^{2}=0.1,...,0.9, where the population R2=var(k=1Ktxkiθk)var(k=1Ktxkiθk+ei)R^{2}=\frac{var(\sum_{k=1}^{K_{t}}x_{ki}\theta_{k})}{var(\sum_{k=1}^{K_{t}}x_{ki}\theta_{k}+e_{i})}. For the homoskedastic simulation we set σi21\sigma_{i}^{2}\equiv 1, while for the heteroskedastic simulation we set σi2=x2i2\sigma_{i}^{2}=x_{2i}^{2}.

We compare the following ten model selection/averaging methods: 1) model selection with AIC (AI), model selection with BIC (BI), and model selection with CpC_{p} (Cp); 2) model averaging with smoothed AIC (SA), and model averaging with smoothed BIC (SB); 3) Mallows model averaging (MM), jackknife model averaging (JM), and least squares model averaging based on generalized cross validation (GM)(Li et al., 2021); 4) Ridge-Mallows model averaging (RM), and Ridge-jackknife model averaging (RJ). To evaluate these methods, we generate test set {(μi,xi),i=1,,nt}\{(\mu_{i}^{*},x_{i}^{*}),i=1,...,n_{t}\} by the above data generating process, and

MSE=nt1i=1nt[μixiθ^(w^)]2MSE=n_{t}^{-1}\sum_{i=1}^{n_{t}}[\mu_{i}^{*}-x_{i}^{*^{\prime}}\hat{\theta}(\hat{w})]^{2}

is calculated as a measure of consistency. In the simulation, we set nt=1000n_{t}=1000 and repeat 200 times. For each parameterization, we normalize the MSE by dividing by the infeasible MSE(the mean of the smallest MSEs of M approximating models in 200 simulations).

We consider two simulation settings. In the first setting, like Hansen (2007), all candidate models are misspecified, and candidate models are strictly nested. In the second setting, the true model is one of candidate models, and candidate models are non-nested.

4.1 Nested Setting and Results

We set Kt=400K_{t}=400, θk=c2αkα12\theta_{k}=c\sqrt{2\alpha}k^{-\alpha-\frac{1}{2}}, and K=log42nK=\log_{4}^{2}n(i.e. K=11,17,20,22K=11,17,20,22), where R2R^{2} is controlled by the parameter c. For ρ=0.3\rho=0.3, the mean of normalized MSEs in 200 simulations is shown in Figures 16\ref{pic:1}-\ref{pic:6}. The results with ρ=0.6\rho=0.6 are similar and so omitted for saving space.

For homoskedastic case, we can draw the following conclusions from Figures 13\ref{pic:1}-\ref{pic:3}. When α=0.5\alpha=0.5 (Figure 1): 1) RM and RJ perform better than other methods if R20.5R^{2}\leq 0.5 and n=300,500n=300,500 and 700700; 2) RM and RJ perform worse than other methods except for BI and SB if R2>0.5R^{2}>0.5, and worse than MA, JM and GM if R2=0.1R^{2}=0.1 and n=100n=100; 3) RM performs better than RJ in most of cases if n=100n=100 and 300300. When α=1.0\alpha=1.0 (Figure 2): 1) RM performs better than other methods if R20.8R^{2}\leq 0.8; 2) RM and RJ perform worse than other methods except for BI and SB if R2=0.9R^{2}=0.9; 3) RM performs better than RJ in most of cases if n=100n=100 and 300300. When α=1.5\alpha=1.5 (Figure 3): 1) RM and RJ always perform better than other methods; 2) RM performs better than RJ in most of cases if n=100n=100 and 300300.

For heteroskedastic case, we can draw the following conclusions from Figures 46\ref{pic:4}-\ref{pic:6}. When α=0.5\alpha=0.5 (Figure 4): 1) RM performs better than other methods if R20.5R^{2}\leq 0.5; 2) RM and RJ perform worse than other methods except for BI and SB if R2>0.5R^{2}>0.5; 3) RM performs better than RJ. When α=1.0\alpha=1.0 (Figure 5): 1) RM performs better than other methods if R20.7R^{2}\leq 0.7; 2) RM and RJ perform worse than other methods except for BI and SB if R2>0.7R^{2}>0.7; 3) RM performs better than RJ, and RJ performs better than other methods in most cases. When α=1.5\alpha=1.5 (Figure 6): 1) RM always performs better than other methods; 2) RM always performs better than RJ, and RJ performs better than other methods in most cases.

From Figures 16\ref{pic:1}-\ref{pic:6}, we can draw the following conclusions: 1) As α\alpha increases, RM and RJ perform better and better; 2) nn has no obvious influence on the performance comparison of various methods; 3) With the increase of R2R^{2}, the performance of various methods is improved, but RJ and RM always have not very bad performance, and are the best in most cases.

To sum up, the conclusions are as follows: 1) RM and RJ are the best in most cases and not bad if not the best; 2) When α\alpha is small and R2R^{2} is large, GM has better performance, and RM and RJ are the best in other cases; 3) RM performs better than RJ.

4.2 Non-nested Setting and Results

We set Kt=12K_{t}=12 and θk=c2αkα12\theta_{k}=c\sqrt{2\alpha}k^{-\alpha-\frac{1}{2}} for 1k101\leq k\leq 10, and θk=0\theta_{k}=0 for k=11,12k=11,12, where R2R^{2} is controlled by the parameter c. Each approximating model contains the first 6 covariates, and the last 6 covariates are combined to obtain 262^{6} approximating models. For ρ=0.3\rho=0.3, the mean of normalized MSEs in 200 simulations is shown in Figures 712\ref{pic:7}-\ref{pic:12}. Like the nested case, the results with ρ=0.6\rho=0.6 are similar and so omitted.

For this setting, we can draw the following conclusions from Figures 712\ref{pic:7}-\ref{pic:12}. When α=0.5\alpha=0.5 (Figure 7 and Figure 10): 1) RM and RJ perform better than other methods if R20.5R^{2}\leq 0.5; 2) As nn increases, RM and RJ perform worse than other methods for larger R2R^{2}; 3) For homoskedastic case, RJ performs better than RM in most of cases. When α=1.0\alpha=1.0 (Figure 8 and Figure 11): 1) RM and RJ perform better than other methods except for SB if R20.7R^{2}\leq 0.7, but the performance of SB is very instable; 2) As nn increases, RM and RJ perform worse than other methods for larger R2R^{2}; 3) RJ performs better than RM in most of cases. When α=1.5\alpha=1.5 (Figure 9 and Figure 12): 1) RM and RJ perform better than other methods except for BI and SB, but the performance of SB and BI is very instable; 2) RJ performs better than RM in most of cases.

To sum up, the conclusions are as follows: 1) RM and RJ are the best in most cases and have stable performance; 2) One of SB, SA, BI, and AI may perform best when R2 is small or large, but their performance is unstable compared to RM and RJ; 3) In heteroskedastic case with larger α\alpha and nn and homoskedastic case, on the whole, RJ performs better than RM.

5 Real Data Analysis

In this section, we apply the proposed method to the real ”wage1” dataset in Wooldridge (2003) from from the US Current Population Survey for the year 1976. There are 526 observations in this dataset. The response variable is the log of average hourly earning, while covariates include: 1) dummy variables—nonwhite, female, married, numdep, smsa, northcen, south, west, construc, ndurman, trcommpu, trade, services, profserv, profocc, clerocc, and servocc; 2) non-dummy variables—educ, exper, and tenure; 3) interaction variables—nonwhite ×\times educ, nonwhite ×\times exper, nonwhite ×\times tenure, female ×\times educ, female ×\times exper, female ×\times tenure, married ×\times educ, married ×\times exper, and married ×\times tenure.

We consider the following two cases: 1) we rank the covariates according to their linear correlations with the response variable, and then consider the strictly nested model averaging method (intercept term is considered and ranked first); 2) 100 models are selected by using function ”regsubsets” in ”leaps” package of R language, where the parameters ”nvmax” and ”nbest” are taken to be 20 and 5 respectively, and other parameters use default values.

We randomly divide the data into two parts: a training sample SS of nn observations for estimating the models and a test sample StS_{t} of nt=529nn_{t}=529-n observations for validating the results. We consider n=110,210,320,420n=110,210,320,420, and

MSE=nt1ziSt[yixiθ^(w^)]2MSE=n_{t}^{-1}\sum_{z_{i}\in S_{t}}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}

is calculated as a measure of consistency. We replicate the process for 200 times. The box plots of MSEs in 200 simulations are shown in Figures 1314\ref{pic:13}-\ref{pic:14}. From these figures, we see that the performance of RM and RJ is good and stable. We also compute the mean and median of MSEs, as well as the best performance rate (BPR), which is the frequency of achieving the lowest risk across the replications. The results are shown in Tables 12\ref{tab:1}-\ref{tab:2}. From these tables, we can draw the following conclusions: 1) RM and RJ always are superior to other methods in terms of mean and median of MSEs, and BPR; 2) The performance of RM and RJ is basically the same in terms of mean and median of MSEs; 3) For BPR, on the whole, RM outperforms RJ.

Table 1: The mean, median and BPR of MSE in Case 1
n AI Cp BI SA SB MM RM GM JM RJ
110 Mean 0.185 0.182 0.184 0.180 0.179 0.169 0.164 0.178 0.168 0.164
Median 0.179 0.176 0.182 0.174 0.177 0.167 0.162 0.174 0.166 0.162
BPR 0.011 0.000 0.006 0.061 0.039 0.028 0.376 0.044 0.088 0.348
210 Mean 0.160 0.159 0.169 0.157 0.166 0.155 0.152 0.155 0.155 0.152
Median 0.160 0.159 0.169 0.156 0.166 0.154 0.152 0.155 0.155 0.153
BPR 0.030 0.000 0.010 0.075 0.000 0.020 0.460 0.080 0.035 0.290
320 Mean 0.152 0.152 0.156 0.150 0.154 0.148 0.146 0.148 0.148 0.146
Median 0.152 0.152 0.156 0.150 0.154 0.148 0.145 0.148 0.147 0.146
BPR 0.010 0.000 0.050 0.160 0.035 0.045 0.315 0.070 0.020 0.295
420 Mean 0.152 0.152 0.150 0.150 0.151 0.148 0.147 0.149 0.148 0.147
Median 0.151 0.150 0.148 0.147 0.149 0.147 0.147 0.148 0.147 0.147
BPR 0.025 0.005 0.105 0.175 0.060 0.030 0.215 0.055 0.030 0.300
Table 2: The mean, median and BPR of MSE in Case 2
n AI Cp BI SA SB MM RM GM JM RJ
110 Mean 0.167 0.167 0.171 0.161 0.163 0.158 0.152 0.160 0.158 0.152
Median 0.164 0.164 0.170 0.158 0.163 0.157 0.151 0.158 0.157 0.151
BPR 0.011 0.000 0.005 0.081 0.000 0.016 0.400 0.011 0.011 0.465
210 Mean 0.152 0.151 0.157 0.149 0.153 0.148 0.145 0.148 0.148 0.145
Median 0.151 0.151 0.157 0.148 0.153 0.147 0.144 0.148 0.148 0.144
BPR 0.000 0.005 0.000 0.170 0.010 0.005 0.460 0.005 0.015 0.330
320 Mean 0.147 0.146 0.152 0.144 0.148 0.145 0.142 0.145 0.145 0.142
Median 0.145 0.145 0.150 0.143 0.147 0.144 0.142 0.143 0.144 0.142
BPR 0.010 0.000 0.000 0.280 0.040 0.000 0.340 0.000 0.000 0.330
420 Mean 0.148 0.148 0.152 0.144 0.148 0.146 0.143 0.146 0.146 0.143
Median 0.147 0.147 0.151 0.144 0.147 0.145 0.141 0.145 0.145 0.141
BPR 0.005 0.000 0.000 0.370 0.055 0.000 0.265 0.000 0.010 0.295

6 Concluding Remarks

In this paper, we study the relationship between AERM, stability, generalization and consistency in model averaging. The results indicate that stability is an important property of model averaging, which can ensure that model averaging has good generalization performance and is consistent under reasonable conditions. When the model weights are not restricted, similar to ridge regression in Hoerl and Kennard (1970), extreme weights tend to appear under the influence of correlation between candidate models in MMA and JMA, resulting in poor performance of the corresponding model averaging estimator. So we propose a L2L_{2}-penalty model averaging method. We prove that it has stability and consistency. In order to reduce the impact of tuning parameter selection, we use 10-fold cross-validation to select a candidate set of tuning parameters and perform a weighted average of the estimators of model weights based on estimation errors. The numerical simulation and real data analysis show the superiority of the proposed method.

Many issues deserve to be further investigated. We only apply the methods of Section 2 to the generalization of MMA and JMA in linear regression. It is worth investigating whether it is possible to extend the proposed method to more complex scenarios, such as generalized linear model, quantile regression, and dependent data. Further, we also expect to be able to propose a model averaging framework with stability and consistency that can be applied to multiple scenarios. In addition, in RMMA and RJMA, we can see that the estimators of weight vector are explicitly expressed. So, how to study their asymptotic behavior based on these explicit expressions is a meaningful but challenging topic.

Refer to caption
Figure 1: The mean of normalized MSE under homoskedastic errors with α=0.5\alpha=0.5 in nested setting
Refer to caption
Figure 2: The mean of normalized MSE under homoskedastic errors with α=1.0\alpha=1.0 in nested setting
Refer to caption
Figure 3: The mean of normalized MSE under homoskedastic errors with α=1.5\alpha=1.5 in nested setting
Refer to caption
Figure 4: The mean of normalized MSE under heteroskedastic errors with α=0.5\alpha=0.5 in nested setting
Refer to caption
Figure 5: The mean of normalized MSE under heteroskedastic errors with α=1.0\alpha=1.0 in nested setting
Refer to caption
Figure 6: The mean of normalized MSE under heteroskedastic errors with α=1.5\alpha=1.5 in nested setting
Refer to caption
Figure 7: The mean of normalized MSE under homoskedastic errors with α=0.5\alpha=0.5 in non-nested setting
Refer to caption
Figure 8: The mean of normalized MSE under homoskedastic errors with α=1.0\alpha=1.0 in non-nested setting
Refer to caption
Figure 9: The mean of normalized MSE under homoskedastic errors with α=1.5\alpha=1.5 in non-nested setting
Refer to caption
Figure 10: The mean of normalized MSE under heteroskedastic errors with α=0.5\alpha=0.5 in non-nested setting
Refer to caption
Figure 11: The mean of normalized MSE under heteroskedastic errors with α=1.0\alpha=1.0 in non-nested setting
Refer to caption
Figure 12: The mean of normalized MSE under heteroskedastic errors with α=1.5\alpha=1.5 in non-nested setting
Refer to caption
Figure 13: The box plot of MSE in Case 1
Refer to caption
Figure 14: The box plot of MSE in Case 2

7 Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (Grant nos. 11971323 and 12031016).

Appendix A Lemmas and Proofs

Let θ^n(w)\hat{\theta}^{-n}(w) and θ^(n1,n)(w)\hat{\theta}^{-(n-1,n)}(w) be the model averaging estimators corresponding to θ^(w)\hat{\theta}(w), which are based on SnS^{-n} and S(n1,n)S^{-(n-1,n)} respectively, where S(n1,n)S^{-(n-1,n)} represents the sample set after removing the (n1)(n-1)-th and nn-th observations from SS.

Lemma 1.

Under Assumptions 1 and 2, there exists a constant B1>0B_{1}>0 such that max1mMθ^m22B1K,a.s.\mathop{max}\limits_{1\leq m\leq M}\|\hat{\theta}_{m}\|_{2}^{2}\leq B_{1}K,a.s..

Proof.

It follows from Assumptions 1 and 2 that

max1mMθ^m22\displaystyle\max_{1\leq m\leq M}\|\hat{\theta}_{m}\|_{2}^{2}
=max1mM(XmXm)1Xmy22\displaystyle=\max_{1\leq m\leq M}\|(X_{m}^{{}^{\prime}}X_{m})^{-1}X_{m}^{{}^{\prime}}y\|_{2}^{2}
=max1mMyXm(XmXm)1(XmXm)1Xmy\displaystyle=\max_{1\leq m\leq M}y^{{}^{\prime}}X_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}(X_{m}^{{}^{\prime}}X_{m})^{-1}X_{m}^{{}^{\prime}}y
C12C2Kn1yy\displaystyle\leq C_{1}^{-2}C_{2}Kn^{-1}y^{{}^{\prime}}y
C12C2C3K,a.s..\displaystyle\leq C_{1}^{-2}C_{2}C_{3}K,a.s..

We take B1=C12C2C3B_{1}=C_{1}^{-2}C_{2}C_{3} to complete the proof.

Lemma 2.

Under Assumptions 1 and 2, we have

ES(max1mMθ^mθ^mn22)=O(n2K3),E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}-\hat{\theta}^{-n}_{m}\|_{2}^{2}\Big{)}=O(n^{-2}K^{3}),

and

ES(max1mMθ^mnθ^m(n1,n)22)=O(n2K3).E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}^{-n}-\hat{\theta}^{-(n-1,n)}_{m}\|_{2}^{2}\Big{)}=O(n^{-2}K^{3}).

Proof.

By Dufour (1982), we see that for any m{1,,M}m\in\{1,...,M\},

θ^m=θ^mn+(XmXm)1πmxn(ynxnπmθ^mn).\hat{\theta}_{m}=\hat{\theta}^{-n}_{m}+(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m}).

It follows from Assumption 1 that

ES[max1mMθ^mθ^mn22]\displaystyle E_{S}\Big{[}\max_{1\leq m\leq M}\|\hat{\theta}_{m}-\hat{\theta}^{-n}_{m}\|_{2}^{2}\Big{]}
=ES[max1mMxnπm(XmXm)1(XmXm)1πmxn(ynxnπmθ^mn)2]\displaystyle=E_{S}\Big{[}\max_{1\leq m\leq M}x_{n}^{{}^{\prime}}\pi_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}\Big{]}
ES[max1mMλmax2[(XmXm)1]πmxn(ynxnπmθ^mn)22]\displaystyle\leq E_{S}\Big{[}\max_{1\leq m\leq M}\lambda_{max}^{2}[(X_{m}^{{}^{\prime}}X_{m})^{-1}]\|\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})\|_{2}^{2}\Big{]}
ES[C12n2max1mMπmxn(ynxnπmθ^mn)22].\displaystyle\leq E_{S}\Big{[}C_{1}^{-2}n^{-2}\max_{1\leq m\leq M}\|\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})\|_{2}^{2}\Big{]}. (1)

From Assumption 2, we obtain

ES[max1mMπmxn(ynxnπmθ^mn)22]\displaystyle E_{S}\Big{[}\max_{1\leq m\leq M}\|\pi_{m}^{{}^{\prime}}x_{n}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})\|_{2}^{2}\Big{]}
ES[max1mMk=1Kx(k)n2(ynxnπmθ^mn)2]\displaystyle\leq E_{S}\Big{[}\max_{1\leq m\leq M}\sum_{k=1}^{K}x_{(k)n}^{2}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}\Big{]}
C42KES[max1mM(ynxnπmθ^mn)2]\displaystyle\leq C_{4}^{2}KE_{S}\Big{[}\max_{1\leq m\leq M}(y_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}]
C42KES[max1mM(2yn2+2(xnπmθ^mn)2)].\displaystyle\leq C_{4}^{2}KE_{S}[\max_{1\leq m\leq M}(2y_{n}^{2}+2(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2})\Big{]}. (2)

Further, from Assumption 2 and Lemma 1, we have

ES[max1mM(xnπmθ^mn)2]\displaystyle E_{S}\Big{[}\max_{1\leq m\leq M}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}^{-n}_{m})^{2}\Big{]}
ES(max1mMxn22θ^mn22)\displaystyle\leq E_{S}\Big{(}\max_{1\leq m\leq M}\|x_{n}\|_{2}^{2}\|\hat{\theta}^{-n}_{m}\|_{2}^{2}\Big{)}
C42KES(max1mMθ^mn22)\displaystyle\leq C_{4}^{2}KE_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}^{-n}_{m}\|_{2}^{2}\Big{)}
B1C42K2.\displaystyle\leq B_{1}C_{4}^{2}K^{2}. (3)

Combining (1)(3)(\ref{equ:1})-(\ref{equ:3}), it is seen that

ES(max1mMθ^mθ^mn22)=O(n2K3).E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}-\hat{\theta}^{-n}_{m}\|_{2}^{2}\Big{)}=O(n^{-2}K^{3}).

In a similar way, it can be shown that

ES(max1mMθ^mnθ^m(n1,n)22)=O(n2K3).E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}^{-n}-\hat{\theta}^{-(n-1,n)}_{m}\|_{2}^{2}\Big{)}=O(n^{-2}K^{3}).

Lemma 3.

Under Assumptions 1 and 2, we have

ES,z(γ^γ^)=n1ES(Ω^Ω^)+O(n1K3),E_{S,z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})=n^{-1}E_{S}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})+O(n^{-1}K^{3}),

and

n1ES(Ω¯Ω¯)=n1ES(Ω^Ω^)+O[n(C1nC42K)2K].n^{-1}E_{S}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})=n^{-1}E_{S}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})+O[n(C_{1}n-C_{4}^{2}K)^{-2}K].

Proof.

Note that

γ^γ^=(xπmθ^mθ^tπtx)M×M,\hat{\gamma}^{{}^{\prime}}\hat{\gamma}=(x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})_{M\times M},
Ω¯Ω¯={[yDm(yPmy)][yDt(yPty)]}M×M,\bar{\Omega}^{{}^{\prime}}\bar{\Omega}=\big{\{}[y-D_{m}(y-P_{m}y)]^{{}^{\prime}}[y-D_{t}(y-P_{t}y)]\big{\}}_{M\times M},

and

Ω^Ω^=(yPmPty)M×M=(i=1nxiπmθ^mθ^tπtxi)M×M.\hat{\Omega}^{{}^{\prime}}\hat{\Omega}=(y^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{t}y)_{M\times M}=\Big{(}\sum_{i=1}^{n}x_{i}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{i}\Big{)}_{M\times M}.

It follows from Assumption 2, Lemma 1 and Lemma 2 that

|ES(xnπmθ^mθ^tπtxnxnπmθ^mnθ^tnπtxn)|\displaystyle|E_{S}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x_{n})|
|ES[xnπm(θ^mθ^mn)θ^tπtxn]|+|ES[xnπmθ^mn(θ^tθ^tn)πtxn]|\displaystyle\leq|E_{S}[x_{n}^{{}^{\prime}}\pi_{m}(\hat{\theta}_{m}-\hat{\theta}_{m}^{-n})\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}]|+|E_{S}[x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}(\hat{\theta}_{t}-\hat{\theta}_{t}^{-n})^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}]|
ES(θ^mθ^mn22)ES(xnπtθ^tπmxn22)+ES(xnπmθ^mnπtxn22)ES(θ^tθ^tn22)\displaystyle\leq\sqrt{E_{S}(\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\|_{2}^{2})}\sqrt{E_{S}(\|x_{n}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}\pi_{m}^{{}^{\prime}}x_{n}\|_{2}^{2})}+\sqrt{E_{S}(\|x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\pi_{t}^{{}^{\prime}}x_{n}\|_{2}^{2})}\sqrt{E_{S}(\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\|_{2}^{2})}
ES(θ^mθ^mn22)ES(xn22θ^t22xn22)+ES(xn22θ^mn22xn22)ES(θ^tθ^tn22)\displaystyle\leq\sqrt{E_{S}(\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\|_{2}^{2})}\sqrt{E_{S}(\|x_{n}\|_{2}^{2}\|\hat{\theta}_{t}\|_{2}^{2}\|x_{n}\|_{2}^{2})}+\sqrt{E_{S}(\|x_{n}\|_{2}^{2}\|\hat{\theta}_{m}^{-n}\|_{2}^{2}\|x_{n}\|_{2}^{2})}\sqrt{E_{S}(\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\|_{2}^{2})}
=O(n1K3).\displaystyle=O(n^{-1}K^{3}).

In a similar way, we obtain

|ES,z(xπmθ^mnθ^tnπtxxπmθ^mθ^tπtx)|=O(n1K3).|E_{S,z^{*}}(x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{*}-x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})|=O(n^{-1}K^{3}).

Further, it is seen that

ES,z(1ni=1nxiπmθ^mθ^tπtxixπmθ^mθ^tπtx)\displaystyle E_{S,z^{*}}(\frac{1}{n}\sum_{i=1}^{n}x_{i}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{i}-x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})
=1ni=1nES,z(xiπmθ^mθ^tπtxixπmθ^mθ^tπtx)\displaystyle=\frac{1}{n}\sum_{i=1}^{n}E_{S,z^{*}}(x_{i}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{i}-x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})
=ES,z(xnπmθ^mθ^tπtxnxπmθ^mθ^tπtx)\displaystyle=E_{S,z^{*}}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})
=ES,z(xnπmθ^mθ^tπtxnxπmθ^mnθ^tnπtx)+ES,z(xπmθ^mnθ^tnπtxxπmθ^mθ^tπtx)\displaystyle=E_{S,z^{*}}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})+E_{S,z^{*}}(x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{*}-x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*})
=ES(xnπmθ^mθ^tπtxnxnπmθ^mnθ^tnπtxn)+ES,z(xπmθ^mnθ^tnπtxxπmθ^mθ^tπtx).\displaystyle=E_{S}(x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x_{n}-x_{n}^{{}^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x_{n})+E_{S,z^{*}}(x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}^{-n}\hat{\theta}_{t}^{-n^{\prime}}\pi_{t}^{{}^{\prime}}x^{*}-x^{*^{\prime}}\pi_{m}\hat{\theta}_{m}\hat{\theta}_{t}^{{}^{\prime}}\pi_{t}^{{}^{\prime}}x^{*}).

So we have

ES,z(γ^γ^)=n1ES[Ω^Ω^]+O(n1K3).E_{S,z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})=n^{-1}E_{S}[\hat{\Omega}^{{}^{\prime}}\hat{\Omega}]+O(n^{-1}K^{3}).

On the other hand, it follows from Assumptions 1 and 2 that

max1inmax1mMhiim=xiπm(XmXm)1πmxiC42KnC1,a.s..\max_{1\leq i\leq n}\max_{1\leq m\leq M}h_{ii}^{m}=x_{i}^{{}^{\prime}}\pi_{m}(X_{m}^{{}^{\prime}}X_{m})^{-1}\pi_{m}^{{}^{\prime}}x_{i}\leq\frac{C_{4}^{2}K}{nC_{1}},a.s..

Hence, we have

|yPmPty[yDm(yPmy)][yDt(yPty)]|\displaystyle|y^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{t}y-[y-D_{m}(y-P_{m}y)]^{{}^{\prime}}[y-D_{t}(y-P_{t}y)]|
=|yPmPty[(InDm)y+DmPmy][(InDt)y+DtPty]|\displaystyle=|y^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{t}y-[(I_{n}-D_{m})y+D_{m}P_{m}y]^{{}^{\prime}}[(I_{n}-D_{t})y+D_{t}P_{t}y]|
|yPm(DmDtIn)Pty|+y(InDm)(InDt)y+|y(InDm)DtPty|+|yPmDm(InDt)y|\displaystyle\leq|y^{{}^{\prime}}P_{m}^{{}^{\prime}}(D_{m}D_{t}-I_{n})P_{t}y|+y^{{}^{\prime}}(I_{n}-D_{m})(I_{n}-D_{t})y+|y^{{}^{\prime}}(I_{n}-D_{m})D_{t}P_{t}y|+|y^{{}^{\prime}}P_{m}^{{}^{\prime}}D_{m}(I_{n}-D_{t})y|
max1mMmax1in[(1hiim)21]yPmPmy+max1mMmax1in[1(1hiim)1]2yy\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq i\leq n}[(1-h_{ii}^{m})^{-2}-1]y^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{m}y+\max_{1\leq m\leq M}\max_{1\leq i\leq n}[1-(1-h_{ii}^{m})^{-1}]^{2}y^{{}^{\prime}}y
+2max1mMmax1in(1hiim)2[1(1hiim)1]2yyyPmPmy\displaystyle\ \ \ +2\max_{1\leq m\leq M}\max_{1\leq i\leq n}\sqrt{(1-h_{ii}^{m})^{-2}[1-(1-h_{ii}^{m})^{-1}]^{2}y^{{}^{\prime}}yy^{{}^{\prime}}P_{m}^{{}^{\prime}}P_{m}y}
max1mMmax1in{[(1hiim)21]+[(1hiim)11]2+2(1hiim)1[(1hiim)11]}yy\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq i\leq n}\{[(1-h_{ii}^{m})^{-2}-1]+[(1-h_{ii}^{m})^{-1}-1]^{2}+2(1-h_{ii}^{m})^{-1}[(1-h_{ii}^{m})^{-1}-1]\}y^{{}^{\prime}}y
C3n{[(1C42KnC1)21]+[(1C42KnC1)11]2+2(1C42KnC1)1[(1C42KnC1)11]}\displaystyle\leq C_{3}n\{[(1-\frac{C_{4}^{2}K}{nC_{1}})^{-2}-1]+[(1-\frac{C_{4}^{2}K}{nC_{1}})^{-1}-1]^{2}+2(1-\frac{C_{4}^{2}K}{nC_{1}})^{-1}[(1-\frac{C_{4}^{2}K}{nC_{1}})^{-1}-1]\}
=4C1C3C42n2K(C1nC42K)2,a.s..\displaystyle=\frac{4C_{1}C_{3}C_{4}^{2}n^{2}K}{(C_{1}n-C_{4}^{2}K)^{2}},a.s..

Thus, from Assumption 2, we see that

n1ES(Ω¯Ω¯)=n1ES(Ω^Ω^)+O[n(C1nC42K)2K].n^{-1}E_{S}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega})=n^{-1}E_{S}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega})+O[n(C_{1}n-C_{4}^{2}K)^{-2}K].

Lemma 4.

Under Assumptions 1 - 3, there is a constant B2>0B_{2}>0 such that

w^22B2M(1+n2K2),\|\hat{w}\|_{2}^{2}\leq B_{2}M(1+n^{-2}K^{2}),
w¯22B2M,\|\bar{w}\|_{2}^{2}\leq B_{2}M,
w~22B2M(1+n2K2),\|\tilde{w}\|_{2}^{2}\leq B_{2}M(1+n^{-2}K^{2}),

and

w^22B2M2,\|\hat{w}^{*}\|_{2}^{2}\leq B_{2}M^{2},

a.s..

Proof.

It follows from Assumptions 2 and 3 that

w^22\displaystyle\|\hat{w}\|_{2}^{2}
=(Ω^Ω^+λnIn)1Ω^yσ2(Ω^Ω^+λnIn)1κ22\displaystyle=\|(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}\hat{\Omega}^{{}^{\prime}}y-\sigma^{2}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}\kappa\|_{2}^{2}
2yΩ^(Ω^Ω^+λnIn)1(Ω^Ω^+λnIn)1Ω^y+2σ4κ(Ω^Ω^+λnIn)1(Ω^Ω^+λnIn)1κ\displaystyle\leq 2y^{{}^{\prime}}\hat{\Omega}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}\hat{\Omega}^{{}^{\prime}}y+2\sigma^{4}\kappa^{{}^{\prime}}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})^{-1}\kappa
2C52C6n1Myy+2σ4C52n2κκ\displaystyle\leq 2C_{5}^{-2}C_{6}n^{-1}My^{{}^{\prime}}y+2\sigma^{4}C_{5}^{-2}n^{-2}\kappa^{{}^{\prime}}\kappa
2C52C3C6M+2σ4C52n2K2M,a.s.,\displaystyle\leq 2C_{5}^{-2}C_{3}C_{6}M+2\sigma^{4}C_{5}^{-2}n^{-2}K^{2}M,a.s.,

and

w¯22\displaystyle\|\bar{w}\|_{2}^{2}
=(Ω¯Ω¯+λnIn)1Ω¯y22\displaystyle=\|(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})^{-1}\bar{\Omega}^{{}^{\prime}}y\|_{2}^{2}
yΩ¯(Ω¯Ω¯+λnIn)1(Ω¯Ω¯+λnIn)1Ω¯y\displaystyle\leq y^{{}^{\prime}}\bar{\Omega}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})^{-1}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})^{-1}\bar{\Omega}^{{}^{\prime}}y
C52C6n1Myy\displaystyle\leq C_{5}^{-2}C_{6}n^{-1}My^{{}^{\prime}}y
C52C3C6M,a.s..\displaystyle\leq C_{5}^{-2}C_{3}C_{6}M,a.s..

In a similar way, we obtain

w~222C52C3C6M+2σ4C52n2K2M,a.s..\|\tilde{w}\|_{2}^{2}\leq 2C_{5}^{-2}C_{3}C_{6}M+2\sigma^{4}C_{5}^{-2}n^{-2}K^{2}M,a.s..

On the other hand, it follows from Assumptions 23\ref{ass:2}-\ref{ass:3} and Lemma 1 that

w^22\displaystyle\|\hat{w}^{*}\|_{2}^{2}
=[Ez(γ^γ^)]1Ez(γ^y)22\displaystyle=\|[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}y^{*})\|_{2}^{2}
Ez(γ^y)[Ez(γ^γ^)]1[Ez(γ^γ^)]1Ez(γ^y)\displaystyle\leq E_{z^{*}}(\hat{\gamma}y^{*})[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]^{-1}E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}y^{*})
Ez(γ^y)Ez(γ^y)λmin2[Ez(γ^γ^)]\displaystyle\leq E_{z^{*}}(\hat{\gamma}y^{*})E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}y^{*})\lambda_{min}^{-2}[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]
C52Ez(γ^y)Ez(γ^y)\displaystyle\leq C_{5}^{-2}E_{z^{*}}(\hat{\gamma}y^{*})E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}y^{*})
C52Ez(γ^γ^)Ez(y2)\displaystyle\leq C_{5}^{-2}E_{z^{*}}(\hat{\gamma}\hat{\gamma}^{{}^{\prime}})E_{z^{*}}(y^{*2})
C52tr[Ez(γ^γ^)]Ez(y2)\displaystyle\leq C_{5}^{-2}tr[E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})]E_{z^{*}}(y^{*2})
C52C3C6M2,a.s..\displaystyle\leq C_{5}^{-2}C_{3}C_{6}M^{2},a.s..

We take B2=max{2C52C3C6,2σ4C52}B_{2}=max\{2C_{5}^{-2}C_{3}C_{6},2\sigma^{4}C_{5}^{-2}\} to complete the proof.

Lemma 5.

Under Assumptions 1 - 4, we have

ES[C(w^n,S)C(w^,S)]=O[K3M2(1+n2K2)],E_{S}[C(\hat{w}^{-n},S)-C(\hat{w},S)]=O[K^{3}M^{2}(1+n^{-2}K^{2})],
ES[J(w¯n,S)J(w¯,S)]=O(K3M2),E_{S}[J(\bar{w}^{-n},S)-J(\bar{w},S)]=O(K^{3}M^{2}),

and

ES[F(w^,n,S)F(w^,S)]=O(n1K3M3).E_{S}[F(\hat{w}^{*,-n},S)-F(\hat{w}^{*},S)]=O(n^{-1}K^{3}M^{3}).

Proof.

Note that

C(w^n,S)C(w^,S)\displaystyle C(\hat{w}^{-n},S)-C(\hat{w},S)
=C(w^n,S)C(w^n,Sn)+C(w^n,Sn)C(w^,S)\displaystyle=C(\hat{w}^{-n},S)-C(\hat{w}^{-n},S^{-n})+C(\hat{w}^{-n},S^{-n})-C(\hat{w},S)
C(w^n,S)C(w^n,Sn)+C(w^,Sn)C(w^,S),\displaystyle\leq C(\hat{w}^{-n},S)-C(\hat{w}^{-n},S^{-n})+C(\hat{w},S^{-n})-C(\hat{w},S),

and

ES[C(w,S)C(w,Sn)(λnλn1)ww]\displaystyle E_{S}[C(w,S)-C(w,S^{-n})-(\lambda_{n}-\lambda_{n-1})w^{{}^{\prime}}w]
=i=1n1ES{[yixiθ^(w)]2[yixiθ^n(w)]2}+ES{[ynxnθ^(w)]2}\displaystyle=\sum_{i=1}^{n-1}E_{S}\Big{\{}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}(w)]^{2}-[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}\Big{\}}+E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(w)]^{2}\Big{\}}
=(n1)ES{[yn1xn1θ^(w)]2[yn1xn1θ^n(w)]2}+ES{[ynxnθ^(w)]2},\displaystyle=(n-1)E_{S}\Big{\{}[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(w)]^{2}-[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}\Big{\}}+E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(w)]^{2}\Big{\}},

where

w^n=argminwRMC(w,Sn),\hat{w}^{-n}=argmin_{w\in R^{M}}C(w,S^{-n}),

and

C(w,Sn)=i=1n1[yixiθ^n(w)]2+2σ2wκ+λn1ww.C(w,S^{-n})=\sum_{i=1}^{n-1}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}+2\sigma^{2}w^{{}^{\prime}}\kappa+\lambda_{n-1}w^{{}^{\prime}}w.

It follows from Lemmas 1, 2 and 4 that

θ^(w^)22\displaystyle\|\hat{\theta}(\hat{w})\|_{2}^{2}
=m=1Mt=1Mw^mw^tθ^mπmπtθ^t\displaystyle=\sum_{m=1}^{M}\sum_{t=1}^{M}\hat{w}_{m}\hat{w}_{t}\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}
m=1Mt=1M|w^mw^t||θ^mπmπtθ^t|\displaystyle\leq\sum_{m=1}^{M}\sum_{t=1}^{M}|\hat{w}_{m}\hat{w}_{t}||\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}|
max1mMmax1tM|θ^mπmπtθ^t|m=1Mt=1M|w^mw^t|\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq t\leq M}|\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}|\sum_{m=1}^{M}\sum_{t=1}^{M}|\hat{w}_{m}\hat{w}_{t}|
max1mMmax1tMθ^m2θ^t2(m=1M|w^m|)2\displaystyle\leq\max_{1\leq m\leq M}\max_{1\leq t\leq M}\|\hat{\theta}_{m}\|_{2}\|\hat{\theta}_{t}\|_{2}\Big{(}\sum_{m=1}^{M}|\hat{w}_{m}|\Big{)}^{2}
=Mmax1mMθ^m22m=1Mw^m2\displaystyle=M\max_{1\leq m\leq M}\|\hat{\theta}_{m}\|_{2}^{2}\sum_{m=1}^{M}\hat{w}_{m}^{2}
B1B2KM2(1+n2K2),a.s.,\displaystyle\leq B_{1}B_{2}KM^{2}(1+n^{-2}K^{2}),a.s.,

and

ES[θ^(w^)θ^n(w^)22]\displaystyle E_{S}[\|\hat{\theta}(\hat{w})-\hat{\theta}^{-n}(\hat{w})\|_{2}^{2}]
=ES[m=1Mt=1Mw^mw^t(θ^mθ^mn)πmπt(θ^tθ^tn)]\displaystyle=E_{S}\Big{[}\sum_{m=1}^{M}\sum_{t=1}^{M}\hat{w}_{m}\hat{w}_{t}(\hat{\theta}_{m}-\hat{\theta}_{m}^{-n})^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}(\hat{\theta}_{t}-\hat{\theta}_{t}^{-n})\Big{]}
=ES(m=1Mt=1M|w^mw^t||(θ^mθ^mn)πmπt(θ^tθ^tn)|)\displaystyle=E_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}|\hat{w}_{m}\hat{w}_{t}||(\hat{\theta}_{m}-\hat{\theta}_{m}^{-n})^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}(\hat{\theta}_{t}-\hat{\theta}_{t}^{-n})|\Big{)}
ES(m=1Mt=1M|w^mw^t|θ^mθ^mn2θ^tθ^tn2)\displaystyle\leq E_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}|\hat{w}_{m}\hat{w}_{t}|\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\|_{2}\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\|_{2}\Big{)}
MES(max1mMmax1tMθ^mθ^mn2θ^tθ^tn2m=1Mw^m2)\displaystyle\leq ME_{S}\Big{(}\max_{1\leq m\leq M}\max_{1\leq t\leq M}\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\|_{2}\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\|_{2}\sum_{m=1}^{M}\hat{w}_{m}^{2}\Big{)}
B2M2(1+n2K2)ES(max1mMmax1tMθ^mθ^mn2θ^tθ^tn2)\displaystyle\leq B_{2}M^{2}(1+n^{-2}K^{2})E_{S}\Big{(}\max_{1\leq m\leq M}\max_{1\leq t\leq M}\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\|_{2}\|\hat{\theta}_{t}-\hat{\theta}_{t}^{-n}\|_{2}\Big{)}
=B2M2(1+n2K2)ES(max1mMθ^mθ^mn22)\displaystyle=B_{2}M^{2}(1+n^{-2}K^{2})E_{S}\Big{(}\max_{1\leq m\leq M}\|\hat{\theta}_{m}-\hat{\theta}_{m}^{-n}\|_{2}^{2}\Big{)}
C8B2n2K3M2(1+n2K2).\displaystyle\leq C_{8}B_{2}n^{-2}K^{3}M^{2}(1+n^{-2}K^{2}).

In a similar way, we obtain

θ^n(w^)22B1B2KM2(1+n2K2),a.s..\|\hat{\theta}^{-n}(\hat{w})\|_{2}^{2}\leq B_{1}B_{2}KM^{2}(1+n^{-2}K^{2}),a.s..

Further, from Assumption 2, we have

ES{xn1[2yn1xn1θ^(w^)xn1θ^n(w^)]22}\displaystyle E_{S}\Big{\{}\|x_{n-1}[2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]\|_{2}^{2}\Big{\}}
=k=1KES{xn1k2[2yn1xn1θ^(w^)xn1θ^n(w^)]2}\displaystyle=\sum_{k=1}^{K}E_{S}\Big{\{}x_{{n-1}k}^{2}[2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]^{2}\Big{\}}
C42KES{[2yn1xn1θ^(w^)xn1θ^n(w^)]2}\displaystyle\leq C_{4}^{2}KE_{S}\Big{\{}[2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]^{2}\Big{\}}
C42KES{6yn12+3[xn1θ^(w^)]2+3[xn1θ^n(w^)]2}\displaystyle\leq C_{4}^{2}KE_{S}\Big{\{}6y_{n-1}^{2}+3[x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}+3[x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]^{2}\Big{\}}
C42KES[6yn12+3xn122θ^(w^)22+3xn122θ^n(w^)22]\displaystyle\leq C_{4}^{2}KE_{S}[6y_{n-1}^{2}+3\|x_{n-1}\|_{2}^{2}\|\hat{\theta}(\hat{w})\|_{2}^{2}+3\|x_{n-1}\|_{2}^{2}\|\hat{\theta}^{-n}(\hat{w})\|_{2}^{2}]
6C3C42K+6C44B1B2K3M2(1+n2K2),\displaystyle\leq 6C_{3}C_{4}^{2}K+6C_{4}^{4}B_{1}B_{2}K^{3}M^{2}(1+n^{-2}K^{2}),

and

ES{[ynxnθ^(w^)]2}2C3+2C42B1B2K2M2(1+n2K2).\displaystyle E_{S}\{[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}\}\leq 2C_{3}+2C_{4}^{2}B_{1}B_{2}K^{2}M^{2}(1+n^{-2}K^{2}).

These arguments indicate that

|ES{[yn1xn1θ^(w^)]2[yn1xn1θ^n(w^)]2|}\displaystyle|E_{S}\Big{\{}[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}-[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]^{2}|\Big{\}}
=|ES{[2yn1xn1θ^(w^)xn1θ^(n)(w^)][xn1θ^(w^)xn1θ^n(w^)]}|\displaystyle=|E_{S}\Big{\{}[2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{(-n)}(\hat{w})][x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})]\Big{\}}|
ES[xn1(2yn1xn1θ^(w^)xn1θ^n(w^)22]ES[θ^(w^)θ^n(w^)22]\displaystyle\leq\sqrt{E_{S}[\|x_{n-1}(2y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}(\hat{w})-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w})\|_{2}^{2}]}\sqrt{E_{S}[\|\hat{\theta}(\hat{w})-\hat{\theta}^{-n}(\hat{w})\|_{2}^{2}]}
=O[n1K3M2(1+n2K2)],\displaystyle=O[n^{-1}K^{3}M^{2}(1+n^{-2}K^{2})],

and

|ES[C(w^,Sn)C(w^,S)(λnλn1)w^w^]|=O[K3M2(1+n2K2)].|E_{S}[C(\hat{w},S^{-n})-C(\hat{w},S)-(\lambda_{n}-\lambda_{n-1})\hat{w}^{{}^{\prime}}\hat{w}]|=O[K^{3}M^{2}(1+n^{-2}K^{2})].

Similarly, we obtain

|ES[C(w^n,S)C(w^n,Sn)(λnλn1)w^nw^n]|=O[K3M2(1+n2K2)].|E_{S}[C(\hat{w}^{-n},S)-C(\hat{w}^{-n},S^{-n})-(\lambda_{n}-\lambda_{n-1})\hat{w}^{-n^{\prime}}\hat{w}^{-n}]|=O[K^{3}M^{2}(1+n^{-2}K^{2})].

Thus, it follows from Assumption 4 and Lemma 4 that

ES[C(w^n,S)C(w^,S)]=O[K3M2(1+n2K2)].E_{S}[C(\hat{w}^{-n},S)-C(\hat{w},S)]=O[K^{3}M^{2}(1+n^{-2}K^{2})].

On the other hand, we notice that

J(w¯n,S)J(w¯,S)\displaystyle J(\bar{w}^{-n},S)-J(\bar{w},S)
=J(w¯n,S)J(w¯n,Sn)+J(w¯n,Sn)J(w¯,S)\displaystyle=J(\bar{w}^{-n},S)-J(\bar{w}^{-n},S^{-n})+J(\bar{w}^{-n},S^{-n})-J(\bar{w},S)
J(w¯n,S)J(w¯n,Sn)+J(w¯,Sn)J(w¯,S),\displaystyle\leq J(\bar{w}^{-n},S)-J(\bar{w}^{-n},S^{-n})+J(\bar{w},S^{-n})-J(\bar{w},S),

and

ES[J(w,S)J(w,Sn)(λnλn1)ww]\displaystyle E_{S}[J(w,S)-J(w,S^{-n})-(\lambda_{n}-\lambda_{n-1})w^{{}^{\prime}}w]
=i=1n1ES{[yixiθ^i(w)]2[yixiθ^(i,n)(w)]2}+ES{[ynxnθ^n(w)]2}\displaystyle=\sum_{i=1}^{n-1}E_{S}\Big{\{}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-i}(w)]^{2}-[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-(i,n)}(w)]^{2}\Big{\}}+E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}\Big{\}}
=(n1)ES{[yn1xn1θ^(n1)(w)]2[yn1xn1θ^(n1,n)(w)]2}\displaystyle=(n-1)E_{S}\Big{\{}[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-(n-1)}(w)]^{2}-[y_{n-1}-x_{n-1}^{{}^{\prime}}\hat{\theta}^{-(n-1,n)}(w)]^{2}\Big{\}}
+ES{[ynxnθ^n(w)]2},\displaystyle+E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(w)]^{2}\Big{\}},

where

w¯n=argminwRMJ(w,Sn),\bar{w}^{-n}=argmin_{w\in R^{M}}J(w,S^{-n}),

and

J(w,Sn)=i=1n1[yixiθ^(i,n)(w)]2+λn1ww.J(w,S^{-n})=\sum_{i=1}^{n-1}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-(i,n)}(w)]^{2}+\lambda_{n-1}w^{{}^{\prime}}w.

So, we obtain

|ES[J(w¯,Sn)J(w¯,S)(λnλn1)w¯w¯]|=O(K3M2),|E_{S}[J(\bar{w},S^{-n})-J(\bar{w},S)-(\lambda_{n}-\lambda_{n-1})\bar{w}^{{}^{\prime}}\bar{w}]|=O(K^{3}M^{2}),

and

|ES[J(w¯n,S)J(w¯n,Sn)(λnλn1)w¯nw¯n]|=O(K3M2).|E_{S}[J(\bar{w}^{-n},S)-J(\bar{w}^{-n},S^{-n})-(\lambda_{n}-\lambda_{n-1})\bar{w}^{-n^{\prime}}\bar{w}^{-n}]|=O(K^{3}M^{2}).

Further, we have

ES[J(w¯n,S)J(w¯,S)]=O(K3M2).E_{S}[J(\bar{w}^{-n},S)-J(\bar{w},S)]=O(K^{3}M^{2}).

Similarly, from

ES[F(w^,n,S)F(w^,S)]\displaystyle E_{S}[F(\hat{w}^{*,-n},S)-F(\hat{w}^{*},S)]
=ES[F(w^,n,S)F(w^,n,Sn)+F(w^,n,Sn)F(w^,S)]\displaystyle=E_{S}[F(\hat{w}^{*,-n},S)-F(\hat{w}^{*,-n},S^{-n})+F(\hat{w}^{*,-n},S^{-n})-F(\hat{w}^{*},S)]
ES[F(w^,n,S)F(w^,n,Sn)+F(w^,Sn)F(w^,S)],\displaystyle\leq E_{S}[F(\hat{w}^{*,-n},S)-F(\hat{w}^{*,-n},S^{-n})+F(\hat{w}^{*},S^{-n})-F(\hat{w}^{*},S)],

and

ES[F(w,Sn)F(w,S)]=ES,z{[yxθ^n(w)]2[yxθ^(w)]2},\displaystyle E_{S}[F(w,S^{-n})-F(w,S)]=E_{S,z^{*}}\Big{\{}[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(w)]^{2}-[y^{*}-x^{*^{\prime}}\hat{\theta}(w)]^{2}\Big{\}},

where

w^,n=argminwRMF(w,Sn),\hat{w}^{*,-n}=argmin_{w\in R^{M}}F(w,S^{-n}),

and

F(w,Sn)=Ez{[yxθ^n(w)]2},F(w,S^{-n})=E_{z^{*}}\{[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(w)]^{2}\},

we have

ES[F(w^,n,S)F(w^,S)]=O(n1K3M3).E_{S}[F(\hat{w}^{*,-n},S)-F(\hat{w}^{*},S)]=O(n^{-1}K^{3}M^{3}).

Lemma 6.

Under Assumptions 1 - 4, we have

ES[w^w^n22]=O[n1K3M2(1+n2K2)],E_{S}[\|\hat{w}-\hat{w}^{-n}\|_{2}^{2}]=O[n^{-1}K^{3}M^{2}(1+n^{-2}K^{2})],
ES[w¯w¯n22]=O(n1K3M2),E_{S}[\|\bar{w}-\bar{w}^{-n}\|_{2}^{2}]=O(n^{-1}K^{3}M^{2}),

and

ES[w^w^,n22]=O(n1K3M3).E_{S}[\|\hat{w}^{*}-\hat{w}^{*,-n}\|_{2}^{2}]=O(n^{-1}K^{3}M^{3}).

Proof.

It follows from the definition of w^\hat{w} that C(w^,S)w=𝟎M\frac{\partial C(\hat{w},S)}{\partial w}={\bf{0}}_{M}. So, we have

C(w^n,S)C(w^,S)\displaystyle C(\hat{w}^{-n},S)-C(\hat{w},S)
=(w^nw^)C(w^,S)w+(w^nw^)(Ω^Ω^+λnIn)(w^nw^)\displaystyle=(\hat{w}^{-n}-\hat{w})^{{}^{\prime}}\frac{\partial C(\hat{w},S)}{\partial w}+(\hat{w}^{-n}-\hat{w})^{{}^{\prime}}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})(\hat{w}^{-n}-\hat{w})
=(w^nw^)(Ω^Ω^+λnIn)(w^nw^)\displaystyle=(\hat{w}^{-n}-\hat{w})^{{}^{\prime}}(\hat{\Omega}^{{}^{\prime}}\hat{\Omega}+\lambda_{n}I_{n})(\hat{w}^{-n}-\hat{w})
C5nw^nw^22,a.s..\displaystyle\geq C_{5}n\|\hat{w}^{-n}-\hat{w}\|_{2}^{2},a.s..

Further, it follows from Lemma 5 that

ES(w^nw^22)=O[n1K3M2(1+n2K2)].E_{S}(\|\hat{w}^{-n}-\hat{w}\|_{2}^{2})=O[n^{-1}K^{3}M^{2}(1+n^{-2}K^{2})].

On the other hand, it follows from the definition of w¯\bar{w} that J(w¯,S)w=𝟎M\frac{\partial J(\bar{w},S)}{\partial w}={\bf{0}}_{M}. So, we have

J(w¯n,S)J(w¯,S)\displaystyle J(\bar{w}^{-n},S)-J(\bar{w},S)
=(w¯nw¯)J(w¯,S)w+(w¯nw¯)(Ω¯Ω¯+λnIn)(w¯nw¯)\displaystyle=(\bar{w}^{-n}-\bar{w})^{{}^{\prime}}\frac{\partial J(\bar{w},S)}{\partial w}+(\bar{w}^{-n}-\bar{w})^{{}^{\prime}}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})(\bar{w}^{-n}-\bar{w})
=(w¯nw¯)(Ω¯Ω¯+λnIn)(w¯nw¯)\displaystyle=(\bar{w}^{-n}-\bar{w})^{{}^{\prime}}(\bar{\Omega}^{{}^{\prime}}\bar{\Omega}+\lambda_{n}I_{n})(\bar{w}^{-n}-\bar{w})
C5nw¯nw¯22,a.s..\displaystyle\geq C_{5}n\|\bar{w}^{-n}-\bar{w}\|_{2}^{2},a.s..

Further, from Lemma 5, we have

ES(w¯nw¯22)=O(n1K3M2).E_{S}(\|\bar{w}^{-n}-\bar{w}\|_{2}^{2})=O(n^{-1}K^{3}M^{2}).

Similarly, from the definition of w^\hat{w}^{*}, Lemma 5 and

F(w^,n,S)F(w^,S)\displaystyle F(\hat{w}^{*,-n},S)-F(\hat{w}^{*},S)
=(w^,nw^)F(w^,S)w+(w^,nw^)Ez(γ^γ^)(w^,nw^)\displaystyle=(\hat{w}^{*,-n}-\hat{w}^{*})^{{}^{\prime}}\frac{\partial F(\hat{w}^{*},S)}{\partial w}+(\hat{w}^{*,-n}-\hat{w}^{*})^{{}^{\prime}}E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})(\hat{w}^{*,-n}-\hat{w}^{*})
=(w^,nw^)Ez(γ^γ^)(w^,nw^)\displaystyle=(\hat{w}^{*,-n}-\hat{w}^{*})^{{}^{\prime}}E_{z^{*}}(\hat{\gamma}^{{}^{\prime}}\hat{\gamma})(\hat{w}^{*,-n}-\hat{w}^{*})
C5w^,nw^22,a.s.,\displaystyle\geq C_{5}\|\hat{w}^{*,-n}-\hat{w}^{*}\|_{2}^{2},a.s.,

we obtain

ES(w^,nw^22)=O(n1K3M3).E_{S}(\|\hat{w}^{*,-n}-\hat{w}^{*}\|_{2}^{2})=O(n^{-1}K^{3}M^{3}).

Proof of Theorem 3.4: Let (c^1,,c^M)=P^w^0,(\hat{c}_{1},...,\hat{c}_{M})^{{}^{\prime}}=\hat{P}^{{}^{\prime}}\hat{w}^{0}, and (d^1,,d^M)=P^w^(\hat{d}_{1},...,\hat{d}_{M})^{{}^{\prime}}=\hat{P}^{{}^{\prime}}\hat{w}^{*}. Then, we have

M^1(λn)\displaystyle\hat{M}_{1}(\lambda_{n})
=Z^w^0Z^w^22+Z^w^w^22\displaystyle=\|\hat{Z}\hat{w}^{0}-\hat{Z}\hat{w}^{*}\|_{2}^{2}+\|\hat{Z}\hat{w}^{*}-\hat{w}^{*}\|_{2}^{2}
=m=1M(c^md^m)2ζ^m2(λn+ζ^m)2+m=1Md^m2(ζ^mλn+ζ^m1)2\displaystyle=\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{2}}+\sum_{m=1}^{M}\hat{d}_{m}^{2}(\frac{\hat{\zeta}_{m}}{\lambda_{n}+\hat{\zeta}_{m}}-1)^{2}
=m=1M(c^md^m)2ζ^m2(λn+ζ^m)2+m=1Md^m2λn2(λn+ζ^m)2,\displaystyle=\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{2}}+\sum_{m=1}^{M}\frac{\hat{d}_{m}^{2}\lambda_{n}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{2}},

and

ddλnM^1(λn)\displaystyle\frac{d}{d\lambda_{n}}\hat{M}_{1}(\lambda_{n})
=m=1M2(c^md^m)2ζ^m2(λn+ζ^m)3+m=1M2d^m2λn(λn+ζ^m)2d^m2λn2(λn+ζ^m)3\displaystyle=\sum_{m=1}^{M}\frac{-2(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}+\sum_{m=1}^{M}\frac{2\hat{d}_{m}^{2}\lambda_{n}(\lambda_{n}+\hat{\zeta}_{m})-2\hat{d}_{m}^{2}\lambda_{n}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}
=m=1M2(c^md^m)2ζ^m2(λn+ζ^m)3+m=1M2d^m2λnζ^m(λn+ζ^m)3.\displaystyle=\sum_{m=1}^{M}\frac{-2(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}+\sum_{m=1}^{M}\frac{2\hat{d}_{m}^{2}\lambda_{n}\hat{\zeta}_{m}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}.

From

m=1M(c^md^m)2ζ^m2(λn+ζ^m)3\displaystyle\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{3}}
1λn+ζ^Mm=1M(c^md^m)2ζ^m2(λn+ζ^m)2\displaystyle\geq\frac{1}{\lambda_{n}+\hat{\zeta}_{M}}\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(\lambda_{n}+\hat{\zeta}_{m})^{2}}
=1λn+ζ^MZ^(w^0w^)22,\displaystyle=\frac{1}{\lambda_{n}+\hat{\zeta}_{M}}\|\hat{Z}(\hat{w}^{0}-\hat{w}^{*})\|_{2}^{2},

we see that, when w^0w^\hat{w}^{0}\neq\hat{w}^{*}, m=1M(c^md^m)2ζ^m2(0+ζ^m)3>0\sum_{m=1}^{M}\frac{(\hat{c}_{m}-\hat{d}_{m})^{2}\hat{\zeta}_{m}^{2}}{(0+\hat{\zeta}_{m})^{3}}>0. So, we have λ^n>0\hat{\lambda}_{n}>0 and M^1(λ^n)<M^1(0)\hat{M}_{1}(\hat{\lambda}_{n})<\hat{M}_{1}(0) when w^0w^\hat{w}^{0}\neq\hat{w}^{*}.

Let (c¯1,,c¯M)=P¯w¯0,(d¯1,,d¯M)=P¯w^(\bar{c}_{1},...,\bar{c}_{M})^{{}^{\prime}}=\bar{P}^{{}^{\prime}}\bar{w}^{0},(\bar{d}_{1},...,\bar{d}_{M})^{{}^{\prime}}=\bar{P}^{{}^{\prime}}\hat{w}^{*}. Then, we have

M¯1(λn)\displaystyle\bar{M}_{1}(\lambda_{n})
=Z¯w¯0Z¯w^22+Z¯w^w^22\displaystyle=\|\bar{Z}\bar{w}^{0}-\bar{Z}\hat{w}^{*}\|_{2}^{2}+\|\bar{Z}\hat{w}^{*}-\hat{w}^{*}\|_{2}^{2}
=m=1M(c¯md¯m)2ζ¯m2(λn+ζ¯m)2+m=1Md¯m2(ζ¯mλn+ζ¯m1)2\displaystyle=\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{2}}+\sum_{m=1}^{M}\bar{d}_{m}^{2}(\frac{\bar{\zeta}_{m}}{\lambda_{n}+\bar{\zeta}_{m}}-1)^{2}
=m=1M(c¯md¯m)2ζ¯m2(λn+ζ¯m)2+m=1Md¯m2λn2(λn+ζ¯m)2,\displaystyle=\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{2}}+\sum_{m=1}^{M}\frac{\bar{d}_{m}^{2}\lambda_{n}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{2}},

and

ddλnM¯1(λn)\displaystyle\frac{d}{d\lambda_{n}}\bar{M}_{1}(\lambda_{n})
=m=1M2(c¯md¯m)2ζ¯m2(λn+ζ¯m)3+m=1M2d¯m2λn(λn+ζ¯m)2d¯m2λn2(λn+ζ¯m)3\displaystyle=\sum_{m=1}^{M}\frac{-2(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}+\sum_{m=1}^{M}\frac{2\bar{d}_{m}^{2}\lambda_{n}(\lambda_{n}+\bar{\zeta}_{m})-2\bar{d}_{m}^{2}\lambda_{n}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}
=m=1M2(c¯md¯m)2ζ¯m2(λn+ζ¯m)3+m=1M2d¯m2λnζ¯m(λn+ζ¯m)3.\displaystyle=\sum_{m=1}^{M}\frac{-2(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}+\sum_{m=1}^{M}\frac{2\bar{d}_{m}^{2}\lambda_{n}\bar{\zeta}_{m}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}.

Similarly, from

m=1M(c¯md¯m)2ζ¯m2(λn+ζ¯m)3\displaystyle\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{3}}
1λn+ζ¯Mm=1M(c¯md¯m)2ζ¯m2(λn+ζ¯m)2\displaystyle\geq\frac{1}{\lambda_{n}+\bar{\zeta}_{M}}\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(\lambda_{n}+\bar{\zeta}_{m})^{2}}
=1λn+ζ¯MZ¯(w¯0w^)22,\displaystyle=\frac{1}{\lambda_{n}+\bar{\zeta}_{M}}\|\bar{Z}(\bar{w}^{0}-\hat{w}^{*})\|_{2}^{2},

we see that, when w¯0w^\bar{w}^{0}\neq\hat{w}^{*}, m=1M(c¯md¯m)2ζ¯m2(0+ζ¯m)3>0\sum_{m=1}^{M}\frac{(\bar{c}_{m}-\bar{d}_{m})^{2}\bar{\zeta}_{m}^{2}}{(0+\bar{\zeta}_{m})^{3}}>0. So, we have λ¯n>0\bar{\lambda}_{n}>0 and M¯1(λ¯n)<M¯1(0)\bar{M}_{1}(\bar{\lambda}_{n})<\bar{M}_{1}(0) when w¯0w^\bar{w}^{0}\neq\hat{w}^{*}.

Proof of Theorem 3.5: It follows from Lemma 4 and Assumption 4 that

ES[F^(w^,S)F^(w~,S)]\displaystyle E_{S}[\hat{F}(\hat{w},S)-\hat{F}(\tilde{w},S)]
=ES[F^(w^,S)1nC(w^,S)+1nC(w^,S)F^(w~,S)]\displaystyle=E_{S}[\hat{F}(\hat{w},S)-\frac{1}{n}C(\hat{w},S)+\frac{1}{n}C(\hat{w},S)-\hat{F}(\tilde{w},S)]
ES[F^(w^,S)1nC(w^,S)+1nC(w~,S)F^(w~,S)]\displaystyle\leq E_{S}[\hat{F}(\hat{w},S)-\frac{1}{n}C(\hat{w},S)+\frac{1}{n}C(\tilde{w},S)-\hat{F}(\tilde{w},S)]
=ES(2σ2w~κn+λnw~w~n2σ2w^κnλnw^w^n)\displaystyle=E_{S}\Big{(}\frac{2\sigma^{2}\tilde{w}^{{}^{\prime}}\kappa}{n}+\frac{\lambda_{n}\tilde{w}^{{}^{\prime}}\tilde{w}}{n}-\frac{2\sigma^{2}\hat{w}^{{}^{\prime}}\kappa}{n}-\frac{\lambda_{n}\hat{w}^{{}^{\prime}}\hat{w}}{n}\Big{)}
4σ2B212KM(1+n2K2)12n+2B2λnM(1+n2K2)n\displaystyle\leq\frac{4\sigma^{2}B_{2}^{\frac{1}{2}}KM(1+n^{-2}K^{2})^{\frac{1}{2}}}{n}+\frac{2B_{2}\lambda_{n}M(1+n^{-2}K^{2})}{n}
=O[n1lognKM2(1+n2K2)].\displaystyle=O[n^{-1}\log nKM^{2}(1+n^{-2}K^{2})].

Further, from the proof of Lemma 5,

ES[F^(w¯,S)F^(w~,S)]\displaystyle E_{S}[\hat{F}(\bar{w},S)-\hat{F}(\tilde{w},S)]
=ES[F^(w¯,S)1nJ(w¯,S)+1nJ(w¯,S)F^(w~,S)]\displaystyle=E_{S}[\hat{F}(\bar{w},S)-\frac{1}{n}J(\bar{w},S)+\frac{1}{n}J(\bar{w},S)-\hat{F}(\tilde{w},S)]
ES[F^(w¯,S)1nJ(w¯,S)+1nJ(w~,S)F^(w~,S)],\displaystyle\leq E_{S}[\hat{F}(\bar{w},S)-\frac{1}{n}J(\bar{w},S)+\frac{1}{n}J(\tilde{w},S)-\hat{F}(\tilde{w},S)],

and

ES[F^(w¯,S)1nJ(w¯,S)+λnw¯w¯n]\displaystyle E_{S}[\hat{F}(\bar{w},S)-\frac{1}{n}J(\bar{w},S)+\frac{\lambda_{n}\bar{w}^{{}^{\prime}}\bar{w}}{n}]
=1ni=1nES{[yixiθ^(w¯)]2[yixiθ^i(w¯)]2}\displaystyle=\frac{1}{n}\sum_{i=1}^{n}E_{S}\Big{\{}[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}-[y_{i}-x_{i}^{{}^{\prime}}\hat{\theta}^{-i}(\bar{w})]^{2}\Big{\}}
=ES{[ynxnθ^(w¯)]2[ynxnθ^n(w¯)]2},\displaystyle=E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w})]^{2}\Big{\}},

we have

ES{[ynxnθ^(w¯)]2[ynxnθ^n(w¯)]2}=O(n1K3M2),E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w})]^{2}\Big{\}}=O(n^{-1}K^{3}M^{2}),

that is

ES[F^(w¯,S)1nJS(w¯)]=O(n1lognK3M2).E_{S}[\hat{F}(\bar{w},S)-\frac{1}{n}J_{S}(\bar{w})]=O(n^{-1}\log nK^{3}M^{2}).

In a similar way, we obtain

ES[1nJS(w~)F^(w~,S)]=O(n1lognK3M2).\displaystyle E_{S}[\frac{1}{n}J_{S}(\tilde{w})-\hat{F}(\tilde{w},S)]=O(n^{-1}\log nK^{3}M^{2}).

So, we have

ES[F^(w¯,S)F^(w~,S)]=O(n1lognK3M2).E_{S}[\hat{F}(\bar{w},S)-\hat{F}(\tilde{w},S)]=O(n^{-1}\log nK^{3}M^{2}).

Proof of Theorem 3.6: It follows from Lemmas 1 and 6 that

ES𝒟n(θ^(w^n)θ^(w^)22)\displaystyle E_{S\sim\mathcal{D}^{n}}(\|\hat{\theta}(\hat{w}^{-n})-\hat{\theta}(\hat{w})\|_{2}^{2})
=ES(m=1M(w^mnw^m)πmθ^m22)\displaystyle=E_{S}\Big{(}\Big{\|}\sum_{m=1}^{M}(\hat{w}^{-n}_{m}-\hat{w}_{m})\pi_{m}\hat{\theta}_{m}\Big{\|}_{2}^{2}\Big{)}
ES[m=1Mt=1M|(w^mnw^m)(w^tnw^t)θ^mπmπtθ^t|]\displaystyle\leq E_{S}\Big{[}\sum_{m=1}^{M}\sum_{t=1}^{M}|(\hat{w}^{-n}_{m}-\hat{w}_{m})(\hat{w}^{-n}_{t}-\hat{w}_{t})\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}|\Big{]}
ES(m=1Mt=1M|w^mnw^m||w^tnw^t|θ^m2θ^t2)\displaystyle\leq E_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}|\hat{w}^{-n}_{m}-\hat{w}_{m}||\hat{w}^{-n}_{t}-\hat{w}_{t}|\|\hat{\theta}_{m}\|_{2}\|\hat{\theta}_{t}\|_{2}\Big{)}
B1KES(m=1Mt=1M|w^mnw^m||w^tnw^t|)\displaystyle\leq B_{1}KE_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}|\hat{w}^{-n}_{m}-\hat{w}_{m}||\hat{w}^{-n}_{t}-\hat{w}_{t}|\Big{)}
=B1KES[(m=1M|w^mnw^m|)2]\displaystyle=B_{1}KE_{S}\Big{[}\Big{(}\sum_{m=1}^{M}|\hat{w}^{-n}_{m}-\hat{w}_{m}|\Big{)}^{2}\Big{]}
B1MKES(m=1M|w^mnw^m|2)\displaystyle\leq B_{1}MKE_{S}\Big{(}\sum_{m=1}^{M}|\hat{w}^{-n}_{m}-\hat{w}_{m}|^{2}\Big{)}
=B1MKES[w^nw^22]\displaystyle=B_{1}MKE_{S}[\|\hat{w}^{-n}-\hat{w}\|_{2}^{2}]
=O[n1K4M3(1+n2K2)].\displaystyle=O[n^{-1}K^{4}M^{3}(1+n^{-2}K^{2})].

Further, it follows from the proof of Lemma 5 that

|ES{[2ynxnθ^n(w^n)xnθ^(w^)][xnθ^n(w^n)xnθ^(w^n)]}|=O[n1K3M2(1+n2K2)],\displaystyle\Big{|}E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{-n})]\Big{\}}\Big{|}=O[n^{-1}K^{3}M^{2}(1+n^{-2}K^{2})],

and

|ES{[2ynxnθ^n(w^n)xnθ^(w^)][xnθ^(w^n)xnθ^(w^)]}|=O[n12K72M52(1+n2K2)].|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]\Big{\}}|=O[n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}(1+n^{-2}K^{2})].

From

|ES{[ynxnθ^n(w^n)]2[ynxnθ^(w^)]2}|\displaystyle|E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}\Big{\}}|
=|ES{[2ynxnθ^n(w^n)xnθ^(w^)][xnθ^n(w^n)xnθ^(w^)]}|\displaystyle=|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]\Big{\}}|
|ES{[2ynxnθ^n(w^n)xnθ^(w^)][xnθ^n(w^n)xnθ^(w^n)]}|\displaystyle\leq|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{-n})]\Big{\}}|
+|ES{[2ynxnθ^n(w^n)xnθ^(w^)][xnθ^(w^n)xnθ^(w^)]}|,\displaystyle\ \ \ +|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})][x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]\Big{\}}|,

we have

ES{[ynxnθ^n(w^n)]2[ynxnθ^(w^)]2}=O[n12K72M52(1+n2K2)].E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w})]^{2}\Big{\}}=O[n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}(1+n^{-2}K^{2})].

In a similar way, we obtain

ES,z{[yxθ^n(w^n)]2[yxθ^(w^)]2}=O[n12K72M52(1+n2K2)].E_{S,z^{*}}\Big{\{}[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(\hat{w}^{-n})]^{2}-[y^{*}-x^{*^{\prime}}\hat{\theta}(\hat{w})]^{2}\Big{\}}=O[n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}(1+n^{-2}K^{2})].

On the other hand, it follows from Lemmas 1 and 6 that

ES𝒟n[θ^(w¯n)θ^(w¯)22]\displaystyle E_{S\sim\mathcal{D}^{n}}[\|\hat{\theta}(\bar{w}^{-n})-\hat{\theta}(\bar{w})\|_{2}^{2}]
=ES[m=1M(w¯mnw¯m)πmθ^m22]\displaystyle=E_{S}\Big{[}\Big{\|}\sum_{m=1}^{M}(\bar{w}^{-n}_{m}-\bar{w}_{m})\pi_{m}\hat{\theta}_{m}\Big{\|}_{2}^{2}\Big{]}
ES[m=1Mt=1M|(w¯mnw¯m)(w¯tnw¯t)θ^mπmπtθ^t|]\displaystyle\leq E_{S}\Big{[}\sum_{m=1}^{M}\sum_{t=1}^{M}|(\bar{w}^{-n}_{m}-\bar{w}_{m})(\bar{w}^{-n}_{t}-\bar{w}_{t})\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}|\Big{]}
ES(m=1Mt=1M|w¯mnw¯m||w¯tnw¯t|θ^m2θ^t2)\displaystyle\leq E_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}|\bar{w}^{-n}_{m}-\bar{w}_{m}||\bar{w}^{-n}_{t}-\bar{w}_{t}|\|\hat{\theta}_{m}\|_{2}\|\hat{\theta}_{t}\|_{2}\Big{)}
B1KES(m=1Mt=1M|w¯mnw¯m||w¯tnw¯t|)\displaystyle\leq B_{1}KE_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}|\bar{w}^{-n}_{m}-\bar{w}_{m}||\bar{w}^{-n}_{t}-\bar{w}_{t}|\Big{)}
=B1KES[(m=1M|w¯mnw¯m|)2]\displaystyle=B_{1}KE_{S}\Big{[}\Big{(}\sum_{m=1}^{M}|\bar{w}^{-n}_{m}-\bar{w}_{m}|\Big{)}^{2}\Big{]}
B1MKES(m=1M|w¯mnw¯m|2)\displaystyle\leq B_{1}MKE_{S}\Big{(}\sum_{m=1}^{M}|\bar{w}^{-n}_{m}-\bar{w}_{m}|^{2}\Big{)}
=B1MKES(w¯nw¯22)\displaystyle=B_{1}MKE_{S}(\|\bar{w}^{-n}-\bar{w}\|_{2}^{2})
=O(n1K4M3).\displaystyle=O(n^{-1}K^{4}M^{3}).

So, from the proof of Lemma 5, we have

|ES{[2ynxnθ^n(w¯n)xnθ^(w¯)][xnθ^n(w¯n)xnθ^(w¯n)]}|=O(n1K3M2),\displaystyle|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w}^{-n})]\Big{\}}|=O(n^{-1}K^{3}M^{2}),

and

|ES{[2ynxnθ^n(w¯n)xnθ^(w¯)][xnθ^(w¯n)xnθ^(w¯)]}|=O(n12K72M52).|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]\Big{\}}|=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}).

From

|ES{[ynxnθ^n(w¯n)]2[ynxnθ^(w¯)]2}|\displaystyle|E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}\Big{\}}|
=|ES{[2ynxnθ^n(w¯n)xnθ^(w¯)][xnθ^n(w¯n)xnθ^(w¯)]}|\displaystyle=|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]\Big{\}}|
|ES{[2ynxnθ^n(w¯n)xnθ^(w¯)][xnθ^n(w¯n)xnθ^(w¯n)]}|\displaystyle\leq|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w}^{-n})]\Big{\}}|
+|ES{[2ynxnθ^n(w¯n)xnθ^(w¯)][xnθ^(w¯n)xnθ^(w¯)]}|,\displaystyle+|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})][x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w}^{-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]\Big{\}}|,

we know that

ES{[ynxnθ^n(w¯n)]2[ynxnθ^(w¯)]2}=O(n12K72M52).E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\bar{w})]^{2}\Big{\}}=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}).

In a similar way, we obtain

ES,z{[yxθ^n(w¯n)]2[yxθ^(w¯)]2}=O(n12K72M52).E_{S,z^{*}}\Big{\{}[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(\bar{w}^{-n})]^{2}-[y^{*}-x^{*^{\prime}}\hat{\theta}(\bar{w})]^{2}\Big{\}}=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{5}{2}}).

Similarly, it follows from Lemmas 1 and 6 that

ES𝒟n[θ^(w^,n)θ^(w^)22]\displaystyle E_{S\sim\mathcal{D}^{n}}[\|\hat{\theta}(\hat{w}^{*,-n})-\hat{\theta}(\hat{w}^{*})\|_{2}^{2}]
=ES[m=1M(w^m,nw^m)πmθ^m22]\displaystyle=E_{S}\Big{[}\Big{\|}\sum_{m=1}^{M}(\hat{w}^{*,-n}_{m}-\hat{w}^{*}_{m})\pi_{m}\hat{\theta}_{m}\Big{\|}_{2}^{2}\Big{]}
ES[m=1Mt=1M|(w^m,nw^m)(w^t,nw^t)θ^mπmπtθ^t|]\displaystyle\leq E_{S}\Big{[}\sum_{m=1}^{M}\sum_{t=1}^{M}|(\hat{w}^{*,-n}_{m}-\hat{w}^{*}_{m})(\hat{w}^{*,-n}_{t}-\hat{w}^{*}_{t})\hat{\theta}_{m}^{{}^{\prime}}\pi_{m}^{{}^{\prime}}\pi_{t}\hat{\theta}_{t}|\Big{]}
ES[m=1Mt=1M|w^m,nw^m||w^t,nw^t|θ^m2θ^t2]\displaystyle\leq E_{S}[\sum_{m=1}^{M}\sum_{t=1}^{M}|\hat{w}^{*,-n}_{m}-\hat{w}^{*}_{m}||\hat{w}^{*,-n}_{t}-\hat{w}^{*}_{t}|\|\hat{\theta}_{m}\|_{2}\|\hat{\theta}_{t}\|_{2}]
B1KES(m=1Mt=1M|w^m,nw^m||w^t,nw^t|)\displaystyle\leq B_{1}KE_{S}\Big{(}\sum_{m=1}^{M}\sum_{t=1}^{M}|\hat{w}^{*,-n}_{m}-\hat{w}^{*}_{m}||\hat{w}^{*,-n}_{t}-\hat{w}^{*}_{t}|\Big{)}
=B1KES[(m=1M|w^m,nw^m|)2]\displaystyle=B_{1}KE_{S}\Big{[}\Big{(}\sum_{m=1}^{M}|\hat{w}^{*,-n}_{m}-\hat{w}^{*}_{m}|\Big{)}^{2}\Big{]}
B1MKES(m=1M|w^m,nw^m|2)\displaystyle\leq B_{1}MKE_{S}\Big{(}\sum_{m=1}^{M}|\hat{w}^{*,-n}_{m}-\hat{w}^{*}_{m}|^{2}\Big{)}
=B1MKES(w^,nw^22)\displaystyle=B_{1}MKE_{S}(\|\hat{w}^{*,-n}-\hat{w}^{*}\|_{2}^{2})
=O(n1K4M4).\displaystyle=O(n^{-1}K^{4}M^{4}).

Thus, from the proof of Lemma 5, we have

|ES{[2ynxnθ^n(w^,n)xnθ^(w^)][xnθ^n(w^,n)xnθ^(w^,n)]}|=O(n1K3M3),\displaystyle|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*,-n})]\Big{\}}|=O(n^{-1}K^{3}M^{3}),

and

|ES{[2ynxnθ^n(w^,n)xnθ^(w^)][xnθ^(w^,n)xnθ^(w^)]}|=O(n12K72M72).|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})][x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})]\Big{\}}|=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}).

From

|ES{[ynxnθ^n(w^,n)]2[ynxnθ^(w^)]2}|\displaystyle|E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})]^{2}\Big{\}}|
=|ES{[2ynxnθ^n(w^,n)xnθ^(w^)][xnθ^n(w^,n)xnθ^(w^)]}|\displaystyle=|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})]\Big{\}}|
|ES{(2ynxnθ^n(w^,n)xnθ^(w^)][xnθ^n(w^,n)xnθ^(w^,n)]}|\displaystyle\leq|E_{S}\Big{\{}(2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})][x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*,-n})]\Big{\}}|
+|ES{[2ynxnθ^n(w^,n)xnθ^(w^)][xnθ^(w^,n)xnθ^(w^)]}|,\displaystyle+|E_{S}\Big{\{}[2y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})][x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*,-n})-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})]\Big{\}}|,

we know that

ES{[ynxnθ^n(w^,n)]2[ynxnθ^(w^)]2}=O(n12K72M72).E_{S}\Big{\{}[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})]^{2}-[y_{n}-x_{n}^{{}^{\prime}}\hat{\theta}(\hat{w}^{*})]^{2}\Big{\}}=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}).

In a similar way, we obtain

ES,z{[yxθ^n(w^,n)]2[yxθ^(w^)]2}=O(n12K72M72).E_{S,z^{*}}\Big{\{}[y^{*}-x^{*^{\prime}}\hat{\theta}^{-n}(\hat{w}^{*,-n})]^{2}-[y^{*}-x^{*^{\prime}}\hat{\theta}(\hat{w}^{*})]^{2}\Big{\}}=O(n^{-\frac{1}{2}}K^{\frac{7}{2}}M^{\frac{7}{2}}).

References

  • Akaike (1998) Akaike, H., 1998. Information theory and an extension of the maximum likelihood principle, in: Selected papers of Hirotugu Akaike, pp. 199–213.
  • Bousquet and Elisseeff (2002) Bousquet, O., Elisseeff, A., 2002. Stability and generalization. The Journal of Machine Learning Research 2, 499–526.
  • Buckland et al. (1997) Buckland, S.T., Burnham, K.P., Augustin, N.H., 1997. Model selection: An integral part of inference. Biometrics 53, 603–618.
  • Dufour (1982) Dufour, J.M., 1982. Recursive stability analysis of linear regression relationships: An exploratory methodology. Journal of Econometrics 19, 31–76.
  • Fragoso et al. (2018) Fragoso, T.M., Bertoli, W., Louzada, F., 2018. Bayesian model averaging: A systematic review and conceptual classification. International Statistical Review 86, 1–28.
  • Gao et al. (2016) Gao, Y., Zhang, X., Wang, S., Zou, G., 2016. Model averaging based on leave-subject-out cross-validation. Journal of Econometrics 192, 139–151.
  • Hansen (2007) Hansen, B.E., 2007. Least squares model averaging. Econometrica 75, 1175–1189.
  • Hansen (2008) Hansen, B.E., 2008. Least-squares forecast averaging. Journal of Econometrics 146, 342–350.
  • Hansen and Racine (2012) Hansen, B.E., Racine, J.S., 2012. Jackknife model averaging. Journal of Econometrics 167, 38–46.
  • Hjort and Claeskens (2003) Hjort, N.L., Claeskens, G., 2003. Frequentist model average estimators. Journal of the American Statistical Association 98, 879–899.
  • Hoerl and Kennard (1970) Hoerl, A.E., Kennard, R.W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12, 55–67.
  • Kutin and Niyogi (2002) Kutin, S., Niyogi, P., 2002. Almost-everywhere algorithmic stability and generalization error, in: In Proceedings of the 18th Conference in Uncertainty in Artificial Intelligence, p. 275–282.
  • Li et al. (2021) Li, X., Zou, G., Zhang, X., Zhao, S., 2021. Least squares model averaging based on generalized cross validation. Acta Mathematicae Applicatae Sinica, English Series 37, 495–509.
  • Liang et al. (2011) Liang, H., Zou, G., Wan, A.T., Zhang, X., 2011. Optimal weight choice for frequentist model average estimators. Journal of the American Statistical Association 106, 1053–1066.
  • Liao and Zou (2020) Liao, J., Zou, G., 2020. Corrected Mallows criterion for model averaging. Computational Statistics and Data Analysis 144, 106902.
  • Liu and Okui (2013) Liu, Q., Okui, R., 2013. Heteroskedasticity-robust CpC_{p} model averaging. The Econometrics Journal 16, 463–472.
  • Mallows (1973) Mallows, C.L., 1973. Some comments on CpC_{p}. Technometrics 15, 661–675.
  • Moral-Benito (2015) Moral-Benito, E., 2015. Model averaging in economics: An overview. Journal of Economic Surveys 29, 46–75.
  • Mukherjee et al. (2006) Mukherjee, S., Niyogi, P., Poggio, T., Rifkin, R., 2006. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics 25, 161–193.
  • Rakhlin et al. (2005) Rakhlin, A., Mukherjee, S., Poggio, T., 2005. Stability results in learning theory. Analysis and Applications 3, 397–417.
  • Schomaker (2012) Schomaker, M., 2012. Shrinkage averaging estimation. Statistical Papers 53, 1015–1034.
  • Schwarz (1978) Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics 6, 461–464.
  • Shalev-Shwartz et al. (2010) Shalev-Shwartz, S., Shamir, O., Srebro, N., Sridharan, K., 2010. Learnability, stability and uniform convergence. The Journal of Machine Learning Research 11, 2635–2670.
  • Tong and Wu (2017) Tong, H., Wu, Q., 2017. Learning performance of regularized moving least square regression. Journal of Computational and Applied Mathematics 325, 42–55.
  • Vapnik (1998) Vapnik, V., 1998. Statistical Learning Theory. Wiley.
  • Wan and Zhang (2009) Wan, A.T., Zhang, X., 2009. On the use of model averaging in tourism research. Annals of Tourism Research 36, 525–532.
  • Wan et al. (2010) Wan, A.T., Zhang, X., Zou, G., 2010. Least squares model averaging by Mallows criterion. Journal of Econometrics 156, 277–283.
  • Wang et al. (2016) Wang, Y.X., Lei, J., Fienberg, S.E., 2016. Learning with differential privacy: Stability, learnability and the sufficiency and necessity of erm principle. The Journal of Machine Learning Research 17, 6353–6392.
  • Wooldridge (2003) Wooldridge, J.M., 2003. Introductory Econometrics. Thompson South-Western.
  • Xu et al. (2014) Xu, G., Wang, S., Huang, J.Z., 2014. Focused information criterion and model averaging based on weighted composite quantile regression. Scandinavian Journal of Statistics 41, 365–381.
  • Yang (2001) Yang, Y., 2001. Adaptive regression by mixing. Journal of the American Statistical Association 96, 574–588.
  • Yuan and Yang (2005) Yuan, Z., Yang, Y., 2005. Combining linear regression models: When and how? Journal of the American Statistical Association 100, 1202–1214.
  • Zhang and Zou (2020) Zhang, H., Zou, G., 2020. Cross-validation model averaging for generalized functional linear model. Econometrics 8, 7.
  • Zhang and Liang (2011) Zhang, X., Liang, H., 2011. Focused information criterion and model averaging for generalized additive partial linear models. The Annals of Statistics 39, 174–200.
  • Zhang et al. (2012) Zhang, X., Wan, A.T., Zhou, S.Z., 2012. Focused information criteria, model selection, and model averaging in a Tobit model with a nonzero threshold. Journal of Business and Economic Statistics 30, 132–142.
  • Zhang et al. (2013) Zhang, X., Wan, A.T., Zou, G., 2013. Model averaging by jackknife criterion in models with dependent data. Journal of Econometrics 174, 82–94.
  • Zhang and Zou (2011) Zhang, X., Zou, G., 2011. Model averaging method and its application in forecast. Statistical Research 28, 6.
  • Zhao et al. (2020) Zhao, S., Liao, J., Yu, D., 2020. Model averaging estimator in ridge regression and its large sample properties. Statistical Papers 61, 1719–1739.
  • Zou and Zhang (2009) Zou, H., Zhang, H.H., 2009. On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics 37, 1733–1751.