This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Superset model problem

Koji Miyawaki
School of Economics, Kwansei Gakuin University
   Steven N. MacEachern
Department of Statistics, The Ohio State University
Abstract

This paper focuses on the superset model problem that arises in the context of regression. To address this problem, we take the Bayesian approach to measure its uncertainty. An illustrative example with the real dataset is provided.

1 Introduction

Regression analysis is a statistical method widely used in many research areas. It is often specified as the normal linear model, where coefficients are linear and the error term follows the normal distribution to simplify the analysis. This specification aims to approximate the state of nature and is often useful in prediction as well as discussing the causality.

Among its specifications, variable selection is a central issue in the regression analysis. It is important to select an appropriate set of explanatory variables partly because of the cost of collecting variables. Many methods are proposed for the variable selection problem.

In relation to the variable selection problem, this paper focuses on the superset model problem where the linear regression model selects a larger set of variables than the state of nature does (see the example provided in Section 2). The linear regression model tends to choose smaller set of variables when the state of nature is linear in variables due to its least squares loss. However, the nonlinear relationship is more likely and this case may lead to select a larger set of variables, which will be a deficiency of the linear regression model in terms of the data collection cost.

To evaluate the superset model problem, this paper utilizes the Bayesian approach, which provides a measure of uncertainty in the form of the posterior probability, and proposes an alternative model that will select the true set of variables when the sample size is large.

This paper is organized as follows. Section 2 describes the superset model problem by providing an example. Bayes’ Theorem is adopted to evaluate the problem in Section 3 and two regression models for specification is explained in Section 4. Section 6 illustrates the proposed method and discusses its robustness.

2 Superset model problem

Suppose the continuous response YY is associated with the set of explanatory variables 𝒙T\bm{x}_{T}. We are interested in its mean response conditional on 𝒙T\bm{x}_{T}. We often assume it to be linear in practice, although it is more likely to be nonlinear in reality. To this end, a regression model is specified as

Y=ϕ(𝒙T)+ϵ,\displaystyle Y=\phi(\bm{x}_{T})+\epsilon,

where ϵ\epsilon is an additive error term with mean zero. The functional form ϕ()\phi(\cdot), the distribution of the error term, and the true set of explanatory variables 𝒙T\bm{x}_{T} are all unknown. With this model, we make statistical inferences about the conditional mean response by estimating ϕ()\phi(\cdot) and the set of explanatory variables. Among problems about how this regression model should be specified, the variable selection problem focuses on the set of explanatory variables, based on the dataset {yi,𝒙i}i=1n\{y_{i},\bm{x}_{i}\}_{i=1}^{n}.

When the set of explanatory variables is known, the best fit is E(Y𝒙T)E(Y\mid\bm{x}_{T}) as an estimator of ϕ(𝒙T)\phi(\bm{x}_{T}) when we use the squared loss. The linear regression model assumes E(Y𝒙T)=𝒙T𝜷E(Y\mid\bm{x}_{T})=\bm{x}_{T}^{\prime}\bm{\beta}, where 𝜷\bm{\beta} is referred to as the regression coefficient vector. However, in general, E(Y𝒙T)𝒙T𝜷E(Y\mid\bm{x}_{T})\neq\bm{x}_{T}^{\prime}\bm{\beta}, contrary to the linearity assumption.

For example, suppose

E[YxT]=α+β1xT+β2xT2.\displaystyle E\left[Y\mid x_{T}\right]=\alpha+\beta_{1}x_{T}+\beta_{2}x_{T}^{2}. (1)

The linear regression with (xT,xU)(x_{T},x_{U}), where xU=xT2x_{U}=x_{T}^{2}, is better than the one with xTx_{T}, even though the latter selects the true explanatory variable. This is an example of the superset model problem. On the other hand, when xUx_{U}^{\prime} is independent of xTx_{T}, xUx_{U}^{\prime} should not be included in the regression to improve the fit.

Above example suggests that the knowledge about association among variables is helpful to examine the superset model problem, and hence the variable selection problem. One approach is to estimate the conditional expectation without linearity and compare it with the one implied by the normal linear model. If they are different and the latter contains more explanatory variables, there exists the superset model problem. Because the dataset at hand is limited, it is difficult to determine whether the superset model problem exists or not. Rather, it is evaluated in a probability form, which is explained in the next section.

3 Superset model probability

Suppose \mathcal{M}^{\ast} is the set of explanatory variables in the state of nature, which is 𝒙T\bm{x}_{T} when vectorized, and is known for a moment. The current dataset {yi,𝒙i}i=1n\{y_{i},\bm{x}_{i}\}_{i=1}^{n} is generated from this state of nature independently for each observation ii and is observed. Depending on a context, the normal linear model with \mathcal{M}^{\ast} explanatory variables would be a choice if it approximates the state of nature well. In this case, we do not have the superset model problem. On the other hand, a normal linear model with a set of explanatory variables indexed by ()\mathcal{M}(\neq\mathcal{M}^{\ast}) is chosen independent of the state of nature in terms of, say, prediction, where the superset model problem arises when \mathcal{M}\supset\mathcal{M}^{\ast}.

However, the state of nature is usually unknown and is inferred from the dataset. The uncertainty from inference is evaluated by the posterior probability over possible subsets of explanatory variables. This paper approximates it by assuming a flexible model (see Model (5) in Section 4), which is denoted by H0H_{0}. Then, this posterior probability is calculated via Bayes’ theorem, which is given by

Pr({yi,𝒙i}i=1n,H0)=Pr({Yi}i=1n{𝒙i}i=1n,H0,)Pr(,H0)~Pr({Yi}i=1n{𝒙i}i=1n,H0,~)Pr(~,H0).\displaystyle\Pr\left(\mathcal{M}^{\ast}\mid\{y_{i},\bm{x}_{i}\}_{i=1}^{n},H_{0}\right)=\frac{\Pr\left(\{Y_{i}\}_{i=1}^{n}\mid\{\bm{x}_{i}\}_{i=1}^{n},H_{0},\mathcal{M}^{\ast}\right)\Pr\left(\mathcal{M}^{\ast},H_{0}\right)}{\sum_{\tilde{\mathcal{M}}}\Pr\left(\{Y_{i}\}_{i=1}^{n}\mid\{\bm{x}_{i}\}_{i=1}^{n},H_{0},\tilde{\mathcal{M}}\right)\Pr\left(\tilde{\mathcal{M}},H_{0}\right)}. (2)

The numerator is the cross product of the marginal likelihood and the prior belief about the set of explanatory variables. When the latter is uniform (which is assumed in the following empirical illustration), the posterior probability is proportional to the marginal likelihood under H0H_{0}.

When uncertainty from inference about the normal linear model is evaluated from its posterior probability as well, the overall superset model probability is calculated as

I()Pr({yi,𝒙i}i=1n,H0)Pr({yi,𝒙i}i=1n,H1),\displaystyle\sum_{\mathcal{M}}\sum_{\mathcal{M}^{\ast}}I\left(\mathcal{M}\supset\mathcal{M}^{\ast}\right)\Pr\left(\mathcal{M}^{\ast}\mid\{y_{i},\bm{x}_{i}\}_{i=1}^{n},H_{0}\right)\Pr\left(\mathcal{M}\mid\{y_{i},\bm{x}_{i}\}_{i=1}^{n},H_{1}\right), (3)

where H1H_{1} denotes the normal linear model (see Model (4) in Section 4).

We note that the above expression is general enough to include variables (the response and explanatory variables) that are continuous or discrete. The next section specifies two regression models H0H_{0} and H1H_{1}, where the response is assumed to be continuous for simplicity.

4 Two regression models

First, the linear regression model H1H_{1} is specified as

Yi=α+𝒙i𝜷+ηi,ηiN(0,λ2),\displaystyle Y_{i}=\alpha+\bm{x}_{i}^{\prime}\bm{\beta}+\eta_{i},\quad\eta_{i}\sim N\left(0,\lambda^{2}\right), (4)

where each of explanatory variables is standardized without loss of generality. To estimate model parameters (𝜷,λ2)(\bm{\beta},\lambda^{2}), the hyper-gg prior is assumed. Then, the marginal likelihood is analytically tractable (see Miyawaki and MacEachern (na) for example).

Second, the model H1H_{1} that is alternative to the normal linear regression model is specified as

Yi=θx+ϵx,ϵxN(0,σx2),\displaystyle Y_{i}=\theta_{x}+\epsilon_{x},\quad\epsilon_{x}\sim N\left(0,\sigma_{x}^{2}\right), (5)

given 𝒙i=𝒙\bm{x}_{i}=\bm{x}. The normal error assumption is made because we have no other knowledge on it. Further, it makes the conditional mean estimation simpler, in terms of the number of parameters as well as the computational burden. The main purpose of this semiparametric model is to estimate conditional means of YY in a flexible manner, and to captures the association between YY and 𝒙\bm{x} in the state of nature as the sample size increases, which can be viewed as as an extreme of the local constant estimation (see, e.g., Fan and Gijbels (2003) for the local estimation).

When the dataset is fixed, the covariate space becomes sparse as its dimension gets larger. Then, the marginal likelihood (hence, the superset model probability) under this alternative model is strongly dependent on the prior specification. To mitigate this influence, this paper takes the mm-fold cross-validation approach, which is described in details below.

The dataset is randomly divided into mm groups. One of them is used as the test set, while the remainings belong to the training set. Explanatory variables in the training set are standardized, and those in the test set are standardized by the mean and standard deviation of those in the training set. Let 𝒟0\mathcal{D}_{0} and 𝒟1\mathcal{D}_{1} be the sets of identification numbers of observations which belongs to the training and test sets. More precisely, 𝒟0={ithe i-th observation is in the training set,i=1,,n}\mathcal{D}_{0}=\{i\mid\text{the $i$-th observation is in the training set},i=1,\dots,n\} and 𝒟1={ithe i-th observation is in the test set,i=1,,n}\mathcal{D}_{1}=\{i\mid\text{the $i$-th observation is in the test set},i=1,\dots,n\}. Given a choice of 𝒟0\mathcal{D}_{0} and 𝒟1\mathcal{D}_{1}, we construct the prior and conditional marginal likelihood in the following manner.

The prior for θx\theta_{x} in the model (5) given 𝒙i=𝒙\bm{x}_{i}=\bm{x} and i𝒟1i\in\mathcal{D}_{1} is assumed as

θx\displaystyle\theta_{x} N(y^x,tx2),\displaystyle\sim N\left(\hat{y}_{x},t_{x}^{2}\right), (6)
where y^x=y¯0+𝒙𝜷^\hat{y}_{x}=\bar{y}_{0}+\bm{x}^{\prime}\hat{\bm{\beta}}, y¯0\bar{y}_{0} is the sample average of the response in 𝒟0\mathcal{D}_{0},
𝜷^\displaystyle\hat{\bm{\beta}} =(i𝒟0𝒙i𝒙i)1i𝒟0𝒙iyi,\displaystyle=\left(\sum_{i\in\mathcal{D}_{0}}\bm{x}_{i}\bm{x}_{i}^{\prime}\right)^{-1}\sum_{i\in\mathcal{D}_{0}}\bm{x}_{i}y_{i}, (7)
tx2\displaystyle t_{x}^{2} =s2{1|𝒟0|+𝒙(i𝒟0𝒙i𝒙i)1𝒙},\displaystyle=s^{2}\left\{\frac{1}{|\mathcal{D}_{0}|}+\bm{x}^{\prime}\left(\sum_{i\in\mathcal{D}_{0}}\bm{x}_{i}\bm{x}_{i}^{\prime}\right)^{-1}\bm{x}\right\}, (8)
s2\displaystyle\quad s^{2} =1|𝒟0|k1i𝒟0(yiy¯0𝒙i𝜷^)2.\displaystyle=\frac{1}{|\mathcal{D}_{0}|-k-1}\sum_{i\in\mathcal{D}_{0}}\left(y_{i}-\bar{y}_{0}-\bm{x}_{i}^{\prime}\hat{\bm{\beta}}\right)^{2}. (9)

When 𝒜\mathcal{A} is a set, |𝒜||\mathcal{A}| is the number of elements in the set. This prior is constructed from classical OLS estimates of mean and standard deviation of YiY_{i} at 𝒙i=𝒙\bm{x}_{i}=\bm{x}. By using the prior that is obtained from the linear model and using the model that focuses on the local observation, we are able to combine local and global information. It is possible to use other estimates such as corresponding normal linear regression estimates under the hyper-g prior. However, to keep the methodology as simple as possible, we take the above prior specification.

Then, we are able to derive the marginal likelihood conditional on the nuisance parameter σx2\sigma_{x}^{2} for each 𝒙i=𝒙\bm{x}_{i}=\bm{x} and i𝒟1i\in\mathcal{D}_{1}. Let x={i𝒙i=𝒙,i𝒟1}\mathcal{M}_{x}=\{i\mid\bm{x}_{i}=\bm{x},i\in\mathcal{D}_{1}\} and nx=|x|n_{x}=|\mathcal{M}_{x}|. Then, this conditional marginal likelihood is given by

m({Yi}ix{𝒙i}ix,σx2,{yi,𝒙i}i𝒟0)\displaystyle m^{\ast}\left(\{Y_{i}\}_{i\in\mathcal{M}_{x}}\mid\{\bm{x}_{i}\}_{i\in\mathcal{M}_{x}},\sigma_{x}^{2},\{y_{i},\bm{x}_{i}\}_{i\in\mathcal{D}_{0}}\right)
=τx(2πσx)nxtxexp{12(μx2τx2+ixyi2σx2+y^x2tx2)},\displaystyle\hskip 100.0pt=\frac{\tau_{x}}{\left(\sqrt{2\pi}\sigma_{x}\right)^{n_{x}}t_{x}}\exp\left\{-\frac{1}{2}\left(-\frac{\mu_{x}^{2}}{\tau_{x}^{2}}+\frac{\sum_{i\in\mathcal{M}_{x}}y_{i}^{2}}{\sigma_{x}^{2}}+\frac{\hat{y}_{x}^{2}}{t_{x}^{2}}\right)\right\},
μx=τx2(ixyiσx2+y^xtx2),τx2=(nxσx2+1tx2)1.\displaystyle\mu_{x}=\tau_{x}^{2}\left(\frac{\sum_{i\in\mathcal{M}_{x}}y_{i}}{\sigma_{x}^{2}}+\frac{\hat{y}_{x}}{t_{x}^{2}}\right),\quad\tau_{x}^{2}=\left(\frac{n_{x}}{\sigma_{x}^{2}}+\frac{1}{t_{x}^{2}}\right)^{-1}.

The full Bayes analysis specifies a prior on the nuisance parameter σx2\sigma_{x}^{2} as well. However, because the data are sparse at 𝒙\bm{x}, how we specify it affects the (unconditional) marginal likelihood much. To mitigate this problem, this paper takes the empirical Bayes approach. The marginal likelihood for each 𝒙i=𝒙\bm{x}_{i}=\bm{x} is the conditional marginal likelihood m({Yi}ix{𝒙i}ix,σx2,{yi,𝒙i}i𝒟0)m^{\ast}(\{Y_{i}\}_{i\in\mathcal{M}_{x}}\mid\{\bm{x}_{i}\}_{i\in\mathcal{M}_{x}},\sigma_{x}^{2},\{y_{i},\bm{x}_{i}\}_{i\in\mathcal{D}_{0}}) maximized over σx2\sigma_{x}^{2}. More precisely,

m({Yi}ix{𝒙i}ix,{yi,𝒙i}i𝒟0)maxσx2m({Yi}ix{𝒙i}ix,σx2,{yi,𝒙i}i𝒟0).\displaystyle m\left(\{Y_{i}\}_{i\in\mathcal{M}_{x}}\mid\{\bm{x}_{i}\}_{i\in\mathcal{M}_{x}},\{y_{i},\bm{x}_{i}\}_{i\in\mathcal{D}_{0}}\right)\equiv\max_{\sigma_{x}^{2}}m^{\ast}\left(\{Y_{i}\}_{i\in\mathcal{M}_{x}}\mid\{\bm{x}_{i}\}_{i\in\mathcal{M}_{x}},\sigma_{x}^{2},\{y_{i},\bm{x}_{i}\}_{i\in\mathcal{D}_{0}}\right).

See the next section for this maximization in details. By multiplying it over all distinct 𝒙\bm{x} in 𝒟1\mathcal{D}_{1} and taking the geometric mean, we have the marginal likelihood for 𝒟1\mathcal{D}_{1} per one observation, which is given by

{𝒙m({Yi}ix{𝒙i}ix,{yi,𝒙i}i𝒟0)}1/|𝒟1|.\displaystyle\left\{\prod_{\bm{x}}m\left(\{Y_{i}\}_{i\in\mathcal{M}_{x}}\mid\{\bm{x}_{i}\}_{i\in\mathcal{M}_{x}},\{y_{i},\bm{x}_{i}\}_{i\in\mathcal{D}_{0}}\right)\right\}^{1/|\mathcal{D}_{1}|}.

The geometric mean is to take care of different sample sizes in different test sets.

Finally, we repeat above process until all mm groups are used as the test set and calculate above marginal likelihood for each test group selection. After averaging mm marginal likelihoods, we raise it to the power of nn to obtain the final marginal likelihood estimate for the model (5).

The robustness of this approach is of interest. The approach will be more useful if we know the upper and lower bounds of the superset model probability under H0H_{0} when its specification changes. Two points are discussed regarding robustness.

First, we consider the robustness to the number of folds in the cross-validation. In the following empirical analysis in Section 6, we use the 10-fold cross-validation. When the number of folds changes from 2 to 15, we see the resulting probability does not change much with the diabetes dataset (see Figure 1).

Refer to caption
Figure 1: Superset model probability change as the number of folds increases. (The dotted line indicates the 10-fold cross-validation, which will be used in Section 6)

Second, it is important to check the robustness to the prior. One approach would be to use the ϵ\epsilon-contamination class prior (see Section 4.7.4 of Berger (1985)), and to show the sensitivity of the marginal likelihood, which will be our future analysis.

5 Maximize the marginal likelihood

Letting

y¯x\displaystyle\bar{y}_{x} =1nxiGxyi,\displaystyle=\frac{1}{n_{x}}\sum_{i\in G_{x}}y_{i}, (10)
sy2\displaystyle s_{y}^{2} =1nxiGx(yiy¯x)2=1nxiGxyi2y¯x2,\displaystyle=\frac{1}{n_{x}}\sum_{i\in G_{x}}\left(y_{i}-\bar{y}_{x}\right)^{2}=\frac{1}{n_{x}}\sum_{i\in G_{x}}y_{i}^{2}-\bar{y}_{x}^{2}, (11)

the local marginal likelihood function is characterized by Theorem 1.

Theorem 1.

The local marginal likelihood function has extremal values at positive solutions to the cubic equation (20) given in the proof when sy2>0s_{y}^{2}>0; at 0 when sy2=0,tx2(y¯xy^x)20s_{y}^{2}=0,t_{x}^{2}-(\bar{y}_{x}-\hat{y}_{x})^{2}\geq 0; and at 0 and tx2+(y¯xy^x)2-t_{x}^{2}+(\bar{y}_{x}-\hat{y}_{x})^{2} when sy2=0,tx2(y¯xy^x)2<0s_{y}^{2}=0,t_{x}^{2}-(\bar{y}_{x}-\hat{y}_{x})^{2}<0.

See Appendix A for its proof. Table 1 summarizes the result.

Table 1: Local marginal likelihood function
σx20\sigma_{x}^{2}\to 0 0<σx2<0<\sigma_{x}^{2}<\infty σx2\sigma_{x}^{2}\to\infty
sy2>0s_{y}^{2}>0 0 See Lemma 2 0
sy2=0s_{y}^{2}=0 12πtxexp{12tx2(y¯xy^x)2}\frac{1}{\sqrt{2\pi}t_{x}}\exp\left\{-\frac{1}{2t_{x}^{2}}\left(\bar{y}_{x}-\hat{y}_{x}\right)^{2}\right\}

By this theorem, we are able to set a value of σx2\sigma_{x}^{2} to maximize the marginal likelihood, instead of placing a prior on it.

6 Illustrative example

The diabetes data (see Efron et al. (2004)) are used to illustrate our method. This dataset contains 442 observations. For the analysis below, we use the logarithm of the diabetes progression measure as the response and use remaining 10 variables are used as exlanatory variables. The proposed method is applied, and the superset model probability for this dataset with 10-fold cross-validation is estimated to be 22.37%.

As discussed by MacEachern and Miyawaki (2022), the diabetes dataset seems to be collected from least two different sources. In particular, the precision of two explanatory variables (the blood pressure and fourth blood serum measurement) consists of a mix of finer and coarser observations. When the data are divided into two groups by this precision, MacEachern and Miyawaki (2022) suggests these two datasets have different characteristics.

This conclusion is also confirmed in terms of the superset model probability. When the dataset for observations with finer variables is used, it is 22.19%. When, on the other hand, that for observations with coarser variables is used, it is 10.72%. The superset model problem is more likely to occur with the former dataset than the latter one, which would be due to the difference in characteristics of these two datasets.

Appendix A Proof of Theorem 1

The local marginal likelihood function is characterized in the following two lemmas. The proof of the theorem directly follows from them.

Lemma 1.

The local marginal likelihood function converges to a finite value as σx2\sigma_{x}^{2} approaches zero or infinity.

Proof.

Observe that

g(σx2)\displaystyle g\left(\sigma_{x}^{2}\right) =μx2τx2+iGxyi2σx2\displaystyle=-\frac{\mu_{x}^{2}}{\tau_{x}^{2}}+\frac{\sum_{i\in G_{x}}y_{i}^{2}}{\sigma_{x}^{2}}
=1σx2{iGxyi2τx2σx2(iGxyi)2}2τx2σx2nxy¯xy^xtx2τx2y^x2tx4\displaystyle=\frac{1}{\sigma_{x}^{2}}\left\{\sum_{i\in G_{x}}y_{i}^{2}-\frac{\tau_{x}^{2}}{\sigma_{x}^{2}}\left(\sum_{i\in G_{x}}y_{i}\right)^{2}\right\}-2\frac{\tau_{x}^{2}}{\sigma_{x}^{2}}\frac{n_{x}\bar{y}_{x}\hat{y}_{x}}{t_{x}^{2}}-\tau_{x}^{2}\frac{\hat{y}_{x}^{2}}{t_{x}^{4}}
=1σx2{nxsy2+(1nxτx2σx2)nx2y¯x2}2τx2σx2nxy¯xy^xtx2τx2y^x2tx4.\displaystyle=\frac{1}{\sigma_{x}^{2}}\left\{n_{x}s_{y}^{2}+\left(\frac{1}{n_{x}}-\frac{\tau_{x}^{2}}{\sigma_{x}^{2}}\right)n_{x}^{2}\bar{y}_{x}^{2}\right\}-2\frac{\tau_{x}^{2}}{\sigma_{x}^{2}}\frac{n_{x}\bar{y}_{x}\hat{y}_{x}}{t_{x}^{2}}-\tau_{x}^{2}\frac{\hat{y}_{x}^{2}}{t_{x}^{4}}. (12)

First, we consider the convergence when σx2\sigma_{x}^{2}\to\infty. Because τx2tx2\tau_{x}^{2}\to t_{x}^{2} and τx2σx20\frac{\tau_{x}^{2}}{\sigma_{x}^{2}}\to 0, g(σx2)y^x2tx2g(\sigma_{x}^{2})\to-\frac{\hat{y}_{x}^{2}}{t_{x}^{2}}, as σx2\sigma_{x}^{2}\to\infty. Hence, the local marginal likelihood function converges to zero in this case.

Next, we consider the case when σx20\sigma_{x}^{2}\to 0. As σx20\sigma_{x}^{2}\to 0, τx20\tau_{x}^{2}\to 0 and τx2σx21nx\frac{\tau_{x}^{2}}{\sigma_{x}^{2}}\to\frac{1}{n_{x}}. Because

1σx2(1nxτx2σx2)\displaystyle\frac{1}{\sigma_{x}^{2}}\left(\frac{1}{n_{x}}-\frac{\tau_{x}^{2}}{\sigma_{x}^{2}}\right) =1nx(nxtx2+σx2),\displaystyle=\frac{1}{n_{x}\left(n_{x}t_{x}^{2}+\sigma_{x}^{2}\right)}, (13)
we have
g(σx2)\displaystyle g\left(\sigma_{x}^{2}\right) {,when sy2>0,y¯x2tx22y¯xy^xtx2,when sy2=0,\displaystyle\to\begin{cases}\infty,&\text{when $s_{y}^{2}>0$},\\ \frac{\bar{y}_{x}^{2}}{t_{x}^{2}}-2\frac{\bar{y}_{x}\hat{y}_{x}}{t_{x}^{2}},&\text{when $s_{y}^{2}=0$},\end{cases} (14)

as σx20\sigma_{x}^{2}\to 0. Thus, the local marginal likelihood function converges to

{0,when sy2>0,12πtxexp{12tx2(y¯xy^x)2},when sy2=0,\displaystyle\begin{cases}0,&\text{when $s_{y}^{2}>0$},\\ \frac{1}{\sqrt{2\pi}t_{x}}\exp\left\{-\frac{1}{2t_{x}^{2}}\left(\bar{y}_{x}-\hat{y}_{x}\right)^{2}\right\},&\text{when $s_{y}^{2}=0$},\end{cases} (15)

when σx20\sigma_{x}^{2}\to 0. ∎

Lemma 2.

For 0<σx2<0<\sigma_{x}^{2}<\infty, at least one positive extremal value of the local marginal likelihood function exists when sy2>0s_{y}^{2}>0, while at most one such a value exists otherwise.

Proof.

The first order condition for maximizing the log local marginal likelihood function is

m({Yi}iGx{𝒙i}iGx,σx2)m({Yi}iGx{𝒙i}iGx,σx2)=0.\displaystyle\frac{m^{\prime}\left(\{Y_{i}\}_{i\in G_{x}}\mid\{\bm{x}_{i}\}_{i\in G_{x}},\sigma_{x}^{2}\right)}{m\left(\{Y_{i}\}_{i\in G_{x}}\mid\{\bm{x}_{i}\}_{i\in G_{x}},\sigma_{x}^{2}\right)}=0. (16)

The left hand side is calculated as

12(τx2)2(σx2)4nx{a1(σx2)3+a2(σx2)2+a3σx2+a4},\displaystyle\frac{1}{2}\frac{\left(\tau_{x}^{2}\right)^{2}}{\left(\sigma_{x}^{2}\right)^{4}}n_{x}\left\{a_{1}\left(\sigma_{x}^{2}\right)^{3}+a_{2}\left(\sigma_{x}^{2}\right)^{2}+a_{3}\sigma_{x}^{2}+a_{4}\right\}, (17)
where
a1=1tx4,a2=1tx2(12nx)+1tx4{sy2+(y¯xy^x)2},\displaystyle a_{1}=-\frac{1}{t_{x}^{4}},\quad a_{2}=\frac{1}{t_{x}^{2}}\left(1-2n_{x}\right)+\frac{1}{t_{x}^{4}}\left\{s_{y}^{2}+\left(\bar{y}_{x}-\hat{y}_{x}\right)^{2}\right\}, (18)
a3=nx2+nx+2nxsy2tx2,a4=nx2sy2,\displaystyle a_{3}=-n_{x}^{2}+n_{x}+\frac{2n_{x}s_{y}^{2}}{t_{x}^{2}},\quad a_{4}=n_{x}^{2}s_{y}^{2}, (19)

Then, the extremal values are solutions to the following cubic equation:

a1(σx2)3+a2(σx2)2+a3σx2+a4=0.\displaystyle a_{1}\left(\sigma_{x}^{2}\right)^{3}+a_{2}\left(\sigma_{x}^{2}\right)^{2}+a_{3}\sigma_{x}^{2}+a_{4}=0. (20)

When sy2>0s_{y}^{2}>0, a4>0a_{4}>0. Thus, at least one solution to this cubic equation is positive when sy2>0s_{y}^{2}>0.

Remark 1.

Let (α,β,γ)(\alpha,\beta,\gamma) be three solutions to this cubic equation. Then, it is factorized as (σx2α)(σx2β)(σx2γ)=0(\sigma_{x}^{2}-\alpha)(\sigma_{x}^{2}-\beta)(\sigma_{x}^{2}-\gamma)=0. By comparing coefficients, αβγ=a4a1<0-\alpha\beta\gamma=\frac{a_{4}}{a_{1}}<0. This implies that solutions are nonzero and either one of three possibilities: (i) three distinct real solutions, (ii) one real and two complex conjugate solutions, or (iii) three real, but at least two are the same, solutions. Then, we conclude that at least one solution is positive.

When sy2=0s_{y}^{2}=0, the cubic equation reduces to

(σx2)3+{tx2(y¯xy^x)2}(σx2)2=0.\displaystyle\left(\sigma_{x}^{2}\right)^{3}+\left\{t_{x}^{2}-\left(\bar{y}_{x}-\hat{y}_{x}\right)^{2}\right\}\left(\sigma_{x}^{2}\right)^{2}=0. (21)

If tx2(y¯xy^x)20t_{x}^{2}-(\bar{y}_{x}-\hat{y}_{x})^{2}\geq 0, no extremal value exists for the local marginal likelihood function over 0<σx2<0<\sigma_{x}^{2}<\infty. If, on the other hand, tx2(y¯xy^x)2<0t_{x}^{2}-(\bar{y}_{x}-\hat{y}_{x})^{2}<0,

σx2=tx2+(y¯xy^x)2\displaystyle\sigma_{x}^{2}=-t_{x}^{2}+\left(\bar{y}_{x}-\hat{y}_{x}\right)^{2} (22)

is the extremal value within the same range. ∎

References

  • Berger (1985) Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). Springer series in statistics. New York: Springer-Verlag.
  • Efron et al. (2004) Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The Annals of Statistics 32(2), 407–499.
  • Fan and Gijbels (2003) Fan, J. and I. Gijbels (2003). Local Polynomial Modelling and Its Applications. Boca Raton: Chapman & Hall/CRC.
  • MacEachern and Miyawaki (2022) MacEachern, S. N. and K. Miyawaki (2022). A regression approach to the two-dataset problem.
  • Miyawaki and MacEachern (na) Miyawaki, K. and S. N. MacEachern (n/a). Economic variable selection. Canadian Journal of Statistics n/a(n/a), n/a. to appear.