Superset model problem
Abstract
This paper focuses on the superset model problem that arises in the context of regression. To address this problem, we take the Bayesian approach to measure its uncertainty. An illustrative example with the real dataset is provided.
1 Introduction
Regression analysis is a statistical method widely used in many research areas. It is often specified as the normal linear model, where coefficients are linear and the error term follows the normal distribution to simplify the analysis. This specification aims to approximate the state of nature and is often useful in prediction as well as discussing the causality.
Among its specifications, variable selection is a central issue in the regression analysis. It is important to select an appropriate set of explanatory variables partly because of the cost of collecting variables. Many methods are proposed for the variable selection problem.
In relation to the variable selection problem, this paper focuses on the superset model problem where the linear regression model selects a larger set of variables than the state of nature does (see the example provided in Section 2). The linear regression model tends to choose smaller set of variables when the state of nature is linear in variables due to its least squares loss. However, the nonlinear relationship is more likely and this case may lead to select a larger set of variables, which will be a deficiency of the linear regression model in terms of the data collection cost.
To evaluate the superset model problem, this paper utilizes the Bayesian approach, which provides a measure of uncertainty in the form of the posterior probability, and proposes an alternative model that will select the true set of variables when the sample size is large.
This paper is organized as follows. Section 2 describes the superset model problem by providing an example. Bayes’ Theorem is adopted to evaluate the problem in Section 3 and two regression models for specification is explained in Section 4. Section 6 illustrates the proposed method and discusses its robustness.
2 Superset model problem
Suppose the continuous response is associated with the set of explanatory variables . We are interested in its mean response conditional on . We often assume it to be linear in practice, although it is more likely to be nonlinear in reality. To this end, a regression model is specified as
where is an additive error term with mean zero. The functional form , the distribution of the error term, and the true set of explanatory variables are all unknown. With this model, we make statistical inferences about the conditional mean response by estimating and the set of explanatory variables. Among problems about how this regression model should be specified, the variable selection problem focuses on the set of explanatory variables, based on the dataset .
When the set of explanatory variables is known, the best fit is as an estimator of when we use the squared loss. The linear regression model assumes , where is referred to as the regression coefficient vector. However, in general, , contrary to the linearity assumption.
For example, suppose
(1) |
The linear regression with , where , is better than the one with , even though the latter selects the true explanatory variable. This is an example of the superset model problem. On the other hand, when is independent of , should not be included in the regression to improve the fit.
Above example suggests that the knowledge about association among variables is helpful to examine the superset model problem, and hence the variable selection problem. One approach is to estimate the conditional expectation without linearity and compare it with the one implied by the normal linear model. If they are different and the latter contains more explanatory variables, there exists the superset model problem. Because the dataset at hand is limited, it is difficult to determine whether the superset model problem exists or not. Rather, it is evaluated in a probability form, which is explained in the next section.
3 Superset model probability
Suppose is the set of explanatory variables in the state of nature, which is when vectorized, and is known for a moment. The current dataset is generated from this state of nature independently for each observation and is observed. Depending on a context, the normal linear model with explanatory variables would be a choice if it approximates the state of nature well. In this case, we do not have the superset model problem. On the other hand, a normal linear model with a set of explanatory variables indexed by is chosen independent of the state of nature in terms of, say, prediction, where the superset model problem arises when .
However, the state of nature is usually unknown and is inferred from the dataset. The uncertainty from inference is evaluated by the posterior probability over possible subsets of explanatory variables. This paper approximates it by assuming a flexible model (see Model (5) in Section 4), which is denoted by . Then, this posterior probability is calculated via Bayes’ theorem, which is given by
(2) |
The numerator is the cross product of the marginal likelihood and the prior belief about the set of explanatory variables. When the latter is uniform (which is assumed in the following empirical illustration), the posterior probability is proportional to the marginal likelihood under .
When uncertainty from inference about the normal linear model is evaluated from its posterior probability as well, the overall superset model probability is calculated as
(3) |
where denotes the normal linear model (see Model (4) in Section 4).
We note that the above expression is general enough to include variables (the response and explanatory variables) that are continuous or discrete. The next section specifies two regression models and , where the response is assumed to be continuous for simplicity.
4 Two regression models
First, the linear regression model is specified as
(4) |
where each of explanatory variables is standardized without loss of generality. To estimate model parameters , the hyper- prior is assumed. Then, the marginal likelihood is analytically tractable (see Miyawaki and MacEachern (na) for example).
Second, the model that is alternative to the normal linear regression model is specified as
(5) |
given . The normal error assumption is made because we have no other knowledge on it. Further, it makes the conditional mean estimation simpler, in terms of the number of parameters as well as the computational burden. The main purpose of this semiparametric model is to estimate conditional means of in a flexible manner, and to captures the association between and in the state of nature as the sample size increases, which can be viewed as as an extreme of the local constant estimation (see, e.g., Fan and Gijbels (2003) for the local estimation).
When the dataset is fixed, the covariate space becomes sparse as its dimension gets larger. Then, the marginal likelihood (hence, the superset model probability) under this alternative model is strongly dependent on the prior specification. To mitigate this influence, this paper takes the -fold cross-validation approach, which is described in details below.
The dataset is randomly divided into groups. One of them is used as the test set, while the remainings belong to the training set. Explanatory variables in the training set are standardized, and those in the test set are standardized by the mean and standard deviation of those in the training set. Let and be the sets of identification numbers of observations which belongs to the training and test sets. More precisely, and . Given a choice of and , we construct the prior and conditional marginal likelihood in the following manner.
The prior for in the model (5) given and is assumed as
(6) | ||||
where , is the sample average of the response in , | ||||
(7) | ||||
(8) | ||||
(9) |
When is a set, is the number of elements in the set. This prior is constructed from classical OLS estimates of mean and standard deviation of at . By using the prior that is obtained from the linear model and using the model that focuses on the local observation, we are able to combine local and global information. It is possible to use other estimates such as corresponding normal linear regression estimates under the hyper-g prior. However, to keep the methodology as simple as possible, we take the above prior specification.
Then, we are able to derive the marginal likelihood conditional on the nuisance parameter for each and . Let and . Then, this conditional marginal likelihood is given by
The full Bayes analysis specifies a prior on the nuisance parameter as well. However, because the data are sparse at , how we specify it affects the (unconditional) marginal likelihood much. To mitigate this problem, this paper takes the empirical Bayes approach. The marginal likelihood for each is the conditional marginal likelihood maximized over . More precisely,
See the next section for this maximization in details. By multiplying it over all distinct in and taking the geometric mean, we have the marginal likelihood for per one observation, which is given by
The geometric mean is to take care of different sample sizes in different test sets.
Finally, we repeat above process until all groups are used as the test set and calculate above marginal likelihood for each test group selection. After averaging marginal likelihoods, we raise it to the power of to obtain the final marginal likelihood estimate for the model (5).
The robustness of this approach is of interest. The approach will be more useful if we know the upper and lower bounds of the superset model probability under when its specification changes. Two points are discussed regarding robustness.
First, we consider the robustness to the number of folds in the cross-validation. In the following empirical analysis in Section 6, we use the 10-fold cross-validation. When the number of folds changes from 2 to 15, we see the resulting probability does not change much with the diabetes dataset (see Figure 1).

Second, it is important to check the robustness to the prior. One approach would be to use the -contamination class prior (see Section 4.7.4 of Berger (1985)), and to show the sensitivity of the marginal likelihood, which will be our future analysis.
5 Maximize the marginal likelihood
Theorem 1.
The local marginal likelihood function has extremal values at positive solutions to the cubic equation (20) given in the proof when ; at 0 when ; and at 0 and when .
0 | See Lemma 2 | 0 | |
By this theorem, we are able to set a value of to maximize the marginal likelihood, instead of placing a prior on it.
6 Illustrative example
The diabetes data (see Efron et al. (2004)) are used to illustrate our method. This dataset contains 442 observations. For the analysis below, we use the logarithm of the diabetes progression measure as the response and use remaining 10 variables are used as exlanatory variables. The proposed method is applied, and the superset model probability for this dataset with 10-fold cross-validation is estimated to be 22.37%.
As discussed by MacEachern and Miyawaki (2022), the diabetes dataset seems to be collected from least two different sources. In particular, the precision of two explanatory variables (the blood pressure and fourth blood serum measurement) consists of a mix of finer and coarser observations. When the data are divided into two groups by this precision, MacEachern and Miyawaki (2022) suggests these two datasets have different characteristics.
This conclusion is also confirmed in terms of the superset model probability. When the dataset for observations with finer variables is used, it is 22.19%. When, on the other hand, that for observations with coarser variables is used, it is 10.72%. The superset model problem is more likely to occur with the former dataset than the latter one, which would be due to the difference in characteristics of these two datasets.
Appendix A Proof of Theorem 1
The local marginal likelihood function is characterized in the following two lemmas. The proof of the theorem directly follows from them.
Lemma 1.
The local marginal likelihood function converges to a finite value as approaches zero or infinity.
Proof.
Observe that
(12) |
First, we consider the convergence when . Because and , , as . Hence, the local marginal likelihood function converges to zero in this case.
Next, we consider the case when . As , and . Because
(13) | ||||
we have | ||||
(14) |
as . Thus, the local marginal likelihood function converges to
(15) |
when . ∎
Lemma 2.
For , at least one positive extremal value of the local marginal likelihood function exists when , while at most one such a value exists otherwise.
Proof.
The first order condition for maximizing the log local marginal likelihood function is
(16) |
The left hand side is calculated as
(17) | ||||
where | ||||
(18) | ||||
(19) |
Then, the extremal values are solutions to the following cubic equation:
(20) |
When , . Thus, at least one solution to this cubic equation is positive when .
Remark 1.
Let be three solutions to this cubic equation. Then, it is factorized as . By comparing coefficients, . This implies that solutions are nonzero and either one of three possibilities: (i) three distinct real solutions, (ii) one real and two complex conjugate solutions, or (iii) three real, but at least two are the same, solutions. Then, we conclude that at least one solution is positive.
When , the cubic equation reduces to
(21) |
If , no extremal value exists for the local marginal likelihood function over . If, on the other hand, ,
(22) |
is the extremal value within the same range. ∎
References
- Berger (1985) Berger, J. O. (1985). Statistical Decision Theory and Bayesian Analysis (2nd ed.). Springer series in statistics. New York: Springer-Verlag.
- Efron et al. (2004) Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The Annals of Statistics 32(2), 407–499.
- Fan and Gijbels (2003) Fan, J. and I. Gijbels (2003). Local Polynomial Modelling and Its Applications. Boca Raton: Chapman & Hall/CRC.
- MacEachern and Miyawaki (2022) MacEachern, S. N. and K. Miyawaki (2022). A regression approach to the two-dataset problem.
- Miyawaki and MacEachern (na) Miyawaki, K. and S. N. MacEachern (n/a). Economic variable selection. Canadian Journal of Statistics n/a(n/a), n/a. to appear.