This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Automated Analysis of Experiments using Hierarchical Garrote

Wei-Yang Yu H. Milton Stewart School of Industrial and Systems Engineering,
Georgia Institute of Technology, Atlanta, GA 30332
V. Roshan Joseph Corresponding author: roshan@gatech.edu H. Milton Stewart School of Industrial and Systems Engineering,
Georgia Institute of Technology, Atlanta, GA 30332
Abstract

In this work, we propose an automatic method for the analysis of experiments that incorporates hierarchical relationships between the experimental variables. We use a modified version of nonnegative garrote method for variable selection which can incorporate hierarchical relationships. The nonnegative garrote method requires a good initial estimate of the regression parameters for it to work well. To obtain the initial estimate, we use generalized ridge regression with the ridge parameters estimated from a Gaussian process prior placed on the underlying input-output relationship. The proposed method, called HiGarrote, is fast, easy to use, and requires no manual tuning. Analysis of several real experiments are presented to demonstrate its benefits over the existing methods.


Keywords: Gaussian process; Generalized ridge regression; Nonnegative garrote; Variable selection.

1 Introduction

Traditionally, the modeling and analysis of experimental data are done using analysis of variance (ANOVA) and regression techniques (Box et al.,, 1978; Montgomery,, 2017; Wu and Hamada,, 2021). Regression is more general and the preferred method because some of the factors in the experiments can be continuous and the regression methodology allows us to fit the response surface using smooth polynomial models. When some factors are categorical, they can also be incorporated into the regression methodology by introducing dummy variables. However, because “orthogonality” is present in many experimental designs such as full factorial designs, the regression analysis of experimental data is much simpler than that of observational data. The tt-statistics do not change when an effect is added or removed and therefore, the significant effects can be identified with ease. However, this simplicity has more or less disappeared as the experimental design methods evolved over time and became more complex. Fractional factorial designs and nonregular designs have become more common in practice where the effects can be aliased, which necessitates the need of more sophisticated regression techniques for the analysis of experiments.

There are certain characteristics of experiments that warrants a different treatment of regression analysis compared to those commonly used in observational studies. Experiments are usually expensive and therefore, the data are smaller in size. However, experiments are conducted in controlled environments, and therefore, higher order effects such as interactions and nonlinear effects can be entertained in the modeling. The need to model high order effects with small data immediately introduces certain challenges in the regression analysis. First, the total number of effects can exceed and can be much higher than the number of experimental runs. Second, experimenters over time have gained some understanding of the importance of various effects, which are formulated into certain hierarchical principles such as effect hierarchy and effect heredity (Wu and Hamada,, 2021). The effect hierarchy principle states that lower-order effects are more important than higher-order effects. Under strong heredity principle, an interaction is considered active only if both of its parent effects are active; whereas under weak heredity, only one of its parent effects needs to be active. These principles are supported by the findings of Li et al., (2006) and Ockuly et al., (2017), who provide comprehensive meta-analyses of two-level and response surface experiments, respectively. These principles need to be used intelligently to identify the significant effects from the small experimental data.

Analysis of regular designs can be done by first deriving all the aliasing relationships, estimating the aliased effects using regression, and identifying the significant aliased effects using tt-statistics or half-normal plots and Lenth’s method (Lenth,, 1989). Then the effect hierarchy and heredity principles are utilized to break the aliases and identify the significant effects. This approach cannot be used for nonregular designs because of the complex aliasing relationships. Hamada and Wu, (1992) proposed several forward selection strategies that incorporates the hierarchy and heredity principles. However, these strategies do not explore the complete space of models and therefore, can miss important effects. To search the model space more thoroughly, Chipman et al., (1997) proposed a Bayesian variable selection method that integrates the stochastic search variable selection (SSVS) algorithm of George and McCulloch, (1993) with the hierarchical priors described in Chipman, (1996). One major issue with the method is the challenge of specifying priors which contain several tuning parameters. In addition, the approach remains computationally intensive because the SSVS algorithm usually requires a large number of Markov chain Monte Carlo (MCMC) samples to attain convergence.

Several extensions of the regularization-based variable selection methods that preserve effect heredity have been proposed. Yuan et al., (2007) presented a fast algorithm by modifying the least angle regression (LARS) technique (Efron et al.,, 2004). However, their method is not convenient for dealing with situations when one group of factors follows strong heredity and another follows weak heredity. To provide a more flexible approach, Yuan et al., (2009) incorporated heredity constraints into nonnegative garrote (NG) method of Breiman, (1995). However, NG requires least squares estimates of the parameters to use as initial estimates, which cannot be obtained in fractionated experiments because the number of parameters to estimate exceeds the number of runs. Extensions of LASSO (Tibshirani,, 1996) can be found in Choi et al., (2010), Bien et al., (2013), and Hao et al., (2018). However, LASSO is not consistent in terms of estimation and variable selection, which is called the oracle property in Fan and Li, (2001). Vazquez et al., (2021) built a best-subset selection algorithm that can take any user-specified restriction, including heredity constraints. Due to the nature of best-subset selection, the algorithm necessitates searching for the optimal model across all-possible model sizes, making it computationally demanding. More recently, Singh and Stufken, (2024) proposed an aggregated version of the Gauss-Dantzig Selector (Candes and Tao,, 2007). However, their method is limited to two-level designs only.

The main aim of this article is to develop an automatic method to analyze experimental data which requires little or no manual tuning. To achieve this goal, we will extend the nonnegative garrote method proposed in Yuan et al., (2009) and make it suitable for the analysis of experiments. We chose nonnegative garrote because it possesses the oracle property (Zou,, 2006) and can be automatically tuned (Xiong,, 2010). The major hurdle in using nonnegative garrote is the non-existence of least squares estimates. Yuan and Lin, (2007) have shown that ridge regression (Hoerl and Kennard,, 1970) can be used to obtain initial estimates when the least squares estimates do not exist. However, in this article, we will show that the usual ridge regression estimates do not perform well in the analysis of experiments because the ridge estimates do not obey the hierarchical relationships. A better initial estimate can be obtained using generalized ridge regression, which has a separate ridge parameter for each effect and can be chosen to reflect the hierarchical relationships. However, there are too many ridge parameters, which makes the specification of them challenging. By exploiting the connection between generalized ridge regression and Bayesian regression, we will use the functionally induced priors proposed in Joseph, (2006) and Joseph and Delaney, (2007) to obtain estimates of the ridge parameters which respect the hierarchy and heredity principles. Integrating this with the nonnegative garrote will give us the proposed method, which we call Hierarchical Garrote or in short HiGarrote.

The article is organized as follows. In Section 2, We will first review the nonnegative garrote method and its limitations. We will then develop our proposed method in Section 3. Several examples are provided in Section 4 to illustrate the advantages of the proposed method. We conclude the article with some remarks in Section 5.

2 Nonnegative Garrote and Its limitations

Suppose that there are pp factors 𝐱=(x1,x2,,xp)\mathbf{x}=(x_{1},x_{2},\dots,x_{p})^{\prime}. Denote the data as {𝐗,𝐘}{𝐱i,yi}i=1n\{\mathbf{X},\mathbf{Y}\}\coloneqq\{\mathbf{x}_{i},y_{i}\}_{i=1}^{n} and let the response yiy_{i}\in\mathbb{R}, for all ii. Assume that

yi=f(𝐱i)+ϵi,ϵi𝒩(0,σ2),for i=1,,n,y_{i}=f(\mathbf{x}_{i})+\epsilon_{i},\qquad\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2}),\qquad\text{for }i=1,\dots,n, (1)

where ϵi\epsilon_{i} stands for the random error resulting from uncontrollable factors and measurement error. Throughout the paper, we center all variables so that the observed mean is 0 and scale all predictors so that their sample standard deviations are the same.

2.1 Original Nonnegative Garrote

Consider approximate the latent function ff by a main-effect-only linear model,

f(𝐱)β1x1+β2x2++βpxp.f(\mathbf{x})\approx\beta_{1}x_{1}+\beta_{2}x_{2}+\dots+\beta_{p}x_{p}.

Let 𝜷=(β1,β2,,βp)\bm{\beta}=(\beta_{1},\beta_{2},\dots,\beta_{p})^{\prime}. The least squares estimator 𝜷^LS=(𝐗𝐗)1𝐗𝐘\hat{\bm{\beta}}^{LS}=(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{Y}. The original nonnegative garrote (NG) (Breiman,, 1995) works by scaling the least squares estimate with shrinkage factors 𝜽(M)\bm{\theta}(M) controlled by a tuning parameter MM. The NG estimate is defined as 𝜷^NG(M)=(β^1NG(M),,β^pNG(M))\hat{\bm{\beta}}^{NG}(M)=(\hat{\beta}_{1}^{NG}(M),\dots,\hat{\beta}_{p}^{NG}(M))^{\prime}, where β^jNG(M)=θ^j(M)β^jLS\hat{\beta}_{j}^{NG}(M)=\hat{\theta}_{j}(M)\hat{\beta}_{j}^{LS}. For an M0M\geq 0, the shrinkage factors are obtained by solving a quadratic programming problem:

𝜽^(M)=\displaystyle\hat{\bm{\theta}}(M)=  argmin𝜽12i=1n{yi(θ1β^1LSxi1+θ2β^2LSxi2++θpβ^pLSxip)}2,\displaystyle\underset{\bm{\theta}}{\text{ argmin}}\;\frac{1}{2}\sum_{i=1}^{n}\left\{y_{i}-(\theta_{1}\hat{\beta}_{1}^{LS}x_{i1}+\theta_{2}\hat{\beta}_{2}^{LS}x_{i2}+\dots+\theta_{p}\hat{\beta}_{p}^{LS}x_{ip})\right\}^{2}, (2)
subject to jθjM and θj0j.\displaystyle\text{ subject to }\sum_{j}\theta_{j}\leq M\text{ and }\theta_{j}\geq 0\;\forall j.

By selecting MM suitably, some elements of 𝜷^NG(M)\hat{\bm{\beta}}^{NG}(M) can be exactly zero. Since β^jLS0\hat{\beta}_{j}^{LS}\neq 0 with a probability of 1, xjx_{j} is considered active if and only if θ^j(M)>0\hat{\theta}_{j}(M)>0. In other words, θ^j(M)\hat{\theta}_{j}(M) serves as an indicator of including xjx_{j} in the selected model.

2.2 Nonnegative Garrote with Effect Heredity

In several problems, main effects models are inadequate to approximate the underlying function ff in (1). To improve the approximation, we can include the quadratic effects and two-factor interactions:

f(𝐱)β1x1++βpxp+β11x12+β12x1x2++βppxp2.f(\mathbf{x})\approx\beta_{1}x_{1}+\dots+\beta_{p}x_{p}+\beta_{11}x_{1}^{2}+\beta_{12}x_{1}x_{2}+\dots+\beta_{pp}x_{p}^{2}.

The shrinkage factors in the nonnegative garrote are obtained by

𝜽^(M)=\displaystyle\hat{\bm{\theta}}(M)=  argmin𝜽12i=1n{yi(θ1β^1LSxi1++θpβ^pLSxip+θ11β^11LSxi12++θppβ^ppLSxip2)}2,\displaystyle\underset{\bm{\theta}}{\text{ argmin}}\;\frac{1}{2}\sum_{i=1}^{n}\left\{y_{i}-(\theta_{1}\hat{\beta}_{1}^{LS}x_{i1}+\dots+\theta_{p}\hat{\beta}_{p}^{LS}x_{ip}+\theta_{11}\hat{\beta}_{11}^{LS}x_{i1}^{2}+\dots+\theta_{pp}\hat{\beta}_{pp}^{LS}x_{ip}^{2})\right\}^{2}, (3)
subject to jθjM and θj0,j=1,,p,11,12,,pp.\displaystyle\text{ subject to }\sum_{j}\theta_{j}\leq M\text{ and }\theta_{j}\geq 0\;,j=1,\dots,p,11,12,\dots,pp.

In this optimization, it is possible to select a model that includes an interaction term while excluding both of its main effects, that is, θ^ij(M)>0\hat{\theta}_{ij}(M)>0 and θ^i(M)=θ^j(M)=0\hat{\theta}_{i}(M)=\hat{\theta}_{j}(M)=0. To preserve the effect heredity principle in nonnegative garrote, Yuan et al., (2009) proposed to add constraints that enforce strong or weak heredity among the effects.

Under the strong heredity principle, we must have θi>0\theta_{i}>0 and θj>0\theta_{j}>0 if θij>0\theta_{ij}>0. This can be enforced by requiring θijmin(θi,θj),\theta_{ij}\leq\text{min}(\theta_{i},\theta_{j}), which is equivalent to two convex constraints

θijθi and θijθjfor alli,j.\theta_{ij}\leq\theta_{i}\text{ and }\theta_{ij}\leq\theta_{j}\;\textrm{for all}\;i,j. (4)

Thus, if θ^ij(M)>0\hat{\theta}_{ij}(M)>0, (4) will guarantee that θ^i(M)>0\hat{\theta}_{i}(M)>0 and θ^j(M)>0\hat{\theta}_{j}(M)>0, ensuring that xix_{i} and xjx_{j} are part of the chosen model.

Under the weak heredity principle, we must have at least one of θi\theta_{i} or θj\theta_{j} to be larger than 0 if θij>0\theta_{ij}>0. This can be enforced by requiring θijmax(θi,θj).\theta_{ij}\leq\text{max}(\theta_{i},\theta_{j}). However, this constraint is nonconvex, and the optimization in (3) cannot be solved using quadratic programming. To simplify the optimization, Yuan et al., (2009) proposed a convex relaxation of the constraint:

θijθi+θjfor alli,j.\theta_{ij}\leq\theta_{i}+\theta_{j}\;\textrm{for all}\;i,j. (5)

Thus, if θ^ij(M)>0\hat{\theta}_{ij}(M)>0, (5) will ensure that at least one of θ^i(M)>0\hat{\theta}_{i}(M)>0 or θ^j(M)>0\hat{\theta}_{j}(M)>0 holds, thereby guaranteeing that at least one of xix_{i} or xjx_{j} is included in the selected model.

2.3 Nonnegative Garrote in Experimental Analysis

It is more common in the analysis of experiments (Wu and Hamada,, 2021) to approximate ff using a linear model with both main effects (linear and quadratic) and their cross product terms (two-factor interactions):

f(𝐱)β1u1+β2u2++βPuP,f(\mathbf{x})\approx\beta_{1}u_{1}+\beta_{2}u_{2}+\dots+\beta_{P}u_{P}, (6)

where u1=x1,,up=xp,up+1=x12,,u2p=xp2,,uP=xp12xp2u_{1}=x_{1},\dots,u_{p}=x_{p},u_{p+1}=x_{1}^{2},\dots,u_{2p}=x_{p}^{2},\ldots,u_{P}=x_{p-1}^{2}x_{p}^{2}, and P=2p+(2p2)P=2p+\binom{2p}{2}. For an experiment, the data size nn is typically much smaller than the number of parameters PP, making it impossible to obtain the least squares estimate required in the original nonnegative garrote. Therefore, Yuan and Lin, (2006) suggested using the ridge regression to obtain an initial estimate in NG.

Let 𝐔P\mathbf{U}_{P} be an n×Pn\times P model matrix. The ridge regression estimate is given by

𝜷^PR=(𝐔P𝐔P+λ𝐈P)1𝐔P𝐘,\hat{\bm{\beta}}^{R}_{P}=(\mathbf{U}^{\prime}_{P}\mathbf{U}_{P}+\lambda\mathbf{I}_{P})^{-1}\mathbf{U}^{\prime}_{P}\mathbf{Y}, (7)

where λ\lambda is a tuning parameter. Then, the NG estimate is defined as β^jNG=θ^j(M)β^jR,j=1,,P\hat{\beta}_{j}^{NG}=\hat{\theta}_{j}(M)\hat{\beta}_{j}^{R},j=1,\dots,P, where the shrinkage factors are obtained as

𝜽^(M)=\displaystyle\hat{\bm{\theta}}(M)=  argmin𝜽12i=1n{yi(θ1β^1Rui1+θ2β^2Rui2++θPβ^PRuiP)}2,\displaystyle\underset{\bm{\theta}}{\text{ argmin}}\;\frac{1}{2}\sum_{i=1}^{n}\left\{y_{i}-(\theta_{1}\hat{\beta}_{1}^{R}u_{i1}+\theta_{2}\hat{\beta}_{2}^{R}u_{i2}+\dots+\theta_{P}\hat{\beta}_{P}^{R}u_{iP})\right\}^{2}, (8)
subject to jθjM,θj0,j=1,,P, and heredity constraints in (4) or (5).\displaystyle\text{ subject to }\sum_{j}\theta_{j}\leq M,\;\theta_{j}\geq 0\;,j=1,\dots,P,\text{ and heredity constraints in \eqref{eq:strong heredity constraint} or \eqref{eq:weak heredity constraint}}.

To tune the hyperparameter MM, we employ the generalized cross-validation (GCV) criterion introduced by Golub et al., (1979), which was extended by Xiong, (2010) to handle NG with ridge estimate. The GCV criterion is defined as

GCV(𝜽^(M))\displaystyle\text{GCV}(\hat{\bm{\theta}}(M)) =𝐘𝐔P𝜷^NG22n(1d(𝜷^NG)/n)2,\displaystyle=\frac{\left\lVert\mathbf{Y}-\mathbf{U}_{P}\hat{\bm{\beta}}^{NG}\right\rVert^{2}_{2}}{n(1-d(\hat{\bm{\beta}}^{NG})/n)^{2}}, (9)

where d(𝜷^NG)d(\hat{\bm{\beta}}^{NG}) is the degrees of freedom of 𝜷^NG\hat{\bm{\beta}}^{NG}. The degrees of freedom is computed as

d(𝜷^NG)=i=1Pθ^i(M)wi,d(\hat{\bm{\beta}}^{NG})=\sum_{i=1}^{P}\hat{\theta}_{i}(M)w_{i}, (10)

where wiw_{i} is the (i,i)(i,i) entry of the P×PP\times P matrix (𝐔P𝐔P+λ𝐈P)1𝐔P𝐔P(\mathbf{U}_{P}^{\prime}\mathbf{U}_{P}+\lambda\mathbf{I}_{P})^{-1}\mathbf{U}_{P}^{\prime}\mathbf{U}_{P}. After constructing the solution path across a series of M[0.1,n1]M\in[0.1,n-1], the optimal MM is selected by minimizing the GCV criterion,

MGCV=argmin𝑀GCV(𝜽^(M)).M^{\text{GCV}}=\underset{M}{\text{argmin}}\;\text{GCV}(\hat{\bm{\theta}}(M)).

The upper bound of MM is set to n1n-1 to ensure that the number of selected terms is not too high compared to the degrees of freedom in the data.

However, we will show that ridge regression estimate is inadequate in experimental data analysis. Consider a toy example with the 12-run Plackett-Burman (PB) design. The response is generated by the true model y=20A+10AB+5ACy=20A+10AB+5AC. See Table 1 for the design matrix and data. We use the R package glmnet to perform the ridge regression and apply the weak heredity constraint when solving the quadratic programming problem (8). This approach identifies only the effect AA with β^ANG=14.938\hat{\beta}_{A}^{NG}=14.938 and misses the two interactions ABAB and ACAC that are present in the true model. The issue arises from the underestimation of 𝜷^PR\hat{\bm{\beta}}^{R}_{P}. Due to the presence of numerous independent and irrelevant effects in the model matrix, we found that λ\lambda is often overestimated, masking the effects of important terms. In the next section, we will propose an improved initial estimate for NG to overcome the foregoing issue.

Table 1: 12-Run Plackett-Burman Design Matrix and Data
Run Factor 𝐘\mathbf{Y}
A B C D E F G H I J K
1 1 1 -1 1 1 1 -1 -1 -1 1 -1 25
2 1 -1 1 1 1 1 -1 -1 1 1 -1 15
3 -1 1 1 1 1 -1 -1 -1 1 1 1 -35
4 1 1 1 -1 -1 -1 -1 1 1 1 -1 35
5 1 1 -1 1 -1 1 -1 1 -1 -1 1 25
6 1 -1 -1 1 1 -1 1 -1 1 1 1 5
7 -1 -1 -1 1 -1 1 -1 1 1 1 1 -5
8 -1 1 -1 1 -1 1 1 1 1 1 -1 -15
9 -1 -1 1 1 1 1 1 1 1 -1 -1 -25
10 1 -1 1 1 -1 -1 -1 1 1 -1 -1 15
11 -1 1 1 1 1 -1 1 -1 -1 1 1 -35
12 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -5

3 Methodology

The ridge regression estimate in (7) is the solution to the following optimization problem:

min𝜷P(𝐘𝐔P𝜷P)(𝐘𝐔P𝜷P)+λi=1Pβi2.\min_{\bm{\beta}_{P}}(\mathbf{Y}-\mathbf{U}_{P}\bm{\beta}_{P})^{\prime}(\mathbf{Y}-\mathbf{U}_{P}\bm{\beta}_{P})+\lambda\sum_{i=1}^{P}\beta_{i}^{2}. (11)

Clearly ridge regression gives equal importance to all the effects, which is not good because the effects are likely to follow the hierarchy and heredity principles. One approach to overcome this issue is to use generalized ridge regression (Hoerl and Kennard,, 1970)

min𝜷P(𝐘𝐔P𝜷P)(𝐘𝐔P𝜷P)+i=1Pλiβi2,\min_{\bm{\beta}_{P}}(\mathbf{Y}-\mathbf{U}_{P}\bm{\beta}_{P})^{\prime}(\mathbf{Y}-\mathbf{U}_{P}\bm{\beta}_{P})+\sum_{i=1}^{P}\lambda_{i}\beta_{i}^{2}, (12)

where λi\lambda_{i}’s can be chosen to obey the hierarchical relationships. Although generalized ridge regression is known to perform better than ridge regression in high-dimensional problems (Ishwaran and Rao,, 2014), it becomes an arduous task to tune P>>nP>>n parameters. Moreover, it is also not clear how to enforce the hierarchical relationships on the PP ridge parameters. We will propose an idea to do this.

First note that the generalized ridge regression solution can be viewed as the posterior mean of a Bayesian linear regression with prior:

𝜷P𝒩(𝟎,τ2𝐑),\bm{\beta}_{P}\sim\mathcal{N}(\mathbf{0},\tau^{2}\mathbf{R}), (13)

where 𝐑\mathbf{R} is a diagonal matrix with 𝐑ii=σ2/(τ2λi)\mathbf{R}_{ii}=\sigma^{2}/(\tau^{2}\lambda_{i}). Thus, we only need to determine how to specify the 𝐑\mathbf{R} matrix in (13). Joseph, (2006) has proposed an idea to do this by first postulating a Gaussian Process (GP) prior (Santner et al.,, 2018) on the underlying function f(𝐱)f(\mathbf{x}) in (1) and then inducing a prior for the linear model parameters. It is shown in Joseph, (2006) and Joseph and Delaney, (2007) that such priors satisfy effect hierarchy and heredity principles under a product correlation structure. Moreover, the prior usually contains only pp unknown parameters, which is much less than PP, and there are well-developed methods to estimate them. This is explained in the next section.

3.1 Functionally Induced Prior

Let the factor xjx_{j} have mjm_{j} levels in the experiment, j=1,,pj=1,\ldots,p. Consider the model in (6). Although it has only the main effects (linear and quadratic) and their cross product terms (two-factor interactions, 2fi), it is easier to construct the prior for all possible effects. Thus, we approximate ff by a linear model including main effects and all the interactions, i.e.,

f(𝐱)j=0q1βjuj,f(\mathbf{x})\approx\sum_{j=0}^{q-1}\beta_{j}u_{j}, (14)

where q=i=1pmiq=\prod_{i=1}^{p}m_{i} and u0=1.u_{0}=1. To avoid confusion, we let 𝜷q=(β0,β1,,βq1)\bm{\beta}_{q}=(\beta_{0},\beta_{1},\dots,\beta_{q-1})^{\prime}.

The main effects of the factor xjx_{j} can be described by mj1m_{j}-1 dummy variables, and the interactions can be expressed as the products of these dummy variables. For example, consider an experiment with two factors each at three levels. Let u1u_{1} and u2u_{2} be the main effects of the first factor and u3u_{3} and u4u_{4} be the main effects of the second factor. The interaction terms are u5=u1u3u_{5}=u_{1}u_{3}, u6=u1u4u_{6}=u_{1}u_{4}, u7=u2u3u_{7}=u_{2}u_{3}, and u8=u2u4u_{8}=u_{2}u_{4}. Many popular coding systems can be used to define dummy variables, such as treatment coding, Helmert coding, and orthogonal polynomial coding (Wu and Hamada,, 2021; Harville,, 1998).

The idea of functionally induced priors (Joseph,, 2006) is to postulate a GP prior for ff and use that to induce the prior for 𝜷q\bm{\beta}_{q}. Thus, assume

f(𝐱)GP(0,ν2ψ),f(\mathbf{x})\sim\text{GP}(0,\nu^{2}\psi), (15)

where ν2ψ\nu^{2}\psi is the covariance function of the Gaussian process defined as cov{f(𝐱),f(𝐱+𝐡)}=ν2ψ(𝐡)\text{cov}\{f(\mathbf{x}),f(\mathbf{x+h})\}=\nu^{2}\psi(\mathbf{h}). Thus, we have 𝐟=𝐔q𝜷q,\mathbf{f}=\mathbf{U}_{q}\bm{\beta}_{q}, where 𝐟\mathbf{f} is the vector of function values for the full factorial design and 𝐔q\mathbf{U}_{q} the q×qq\times q model matrix corresponding to (14). Note that 𝐟\mathbf{f} has a multivariate normal distribution with E(𝐟)=𝟎E(\mathbf{f})=\mathbf{0} and var(𝐟)=ν2𝚿\text{var}(\mathbf{f})=\nu^{2}\bm{\Psi}, where 𝚿\bm{\Psi} is the corresponding correlation matrix. Since 𝜷q=𝐔q1𝐟\bm{\beta}_{q}=\mathbf{U}_{q}^{-1}\mathbf{f}, we obtain

𝜷qN(𝟎,ν2𝐔q1𝚿(𝐔q1)).\bm{\beta}_{q}\sim N(\bm{0},\nu^{2}\mathbf{U}^{-1}_{q}\bm{\Psi}(\mathbf{U}_{q}^{-1})^{\prime}). (16)

Assume that the correlation function ψ\psi has a product correlation structure, i.e.,

ψ(𝐡)=j=1pψj(hj).\psi(\mathbf{h})=\prod_{j=1}^{p}\psi_{j}(h_{j}).

Under the product correlation structure, Joseph and Delaney, (2007) have shown that

var(𝜷q)=ν2j=1p𝐔j1𝚿j(𝐔j1),\text{var}(\bm{\beta}_{q})=\nu^{2}\bigotimes_{j=1}^{p}\mathbf{U}_{j}^{-1}\mathbf{\Psi}_{j}(\mathbf{U}_{j}^{-1})^{\prime}, (17)

where \bigotimes denotes the Kronecker product, 𝐔j\mathbf{U}_{j} is the model matrix for factor xjx_{j}, and 𝚿j\bm{\Psi}_{j} the corresponding correlation matrix. For example, suppose that factor xjx_{j} is a three-level factor with levels 1, 2, and 3, the model matrix using orthogonal polynomial coding is

𝐔j=(1321210213212),\mathbf{U}_{j}=\begin{pmatrix}1&-\sqrt{\frac{3}{2}}&\sqrt{\frac{1}{2}}\\ 1&0&-\sqrt{2}\\ 1&\sqrt{\frac{3}{2}}&\sqrt{\frac{1}{2}}\end{pmatrix},

and the correlation matrix is

𝚿j=(1ψj(1)ψj(2)ψj(1)1ψj(1)ψj(2)ψj(1)1).\bm{\Psi}_{j}=\begin{pmatrix}1&\psi_{j}(1)&\psi_{j}(2)\\ \psi_{j}(1)&1&\psi_{j}(1)\\ \psi_{j}(2)&\psi_{j}(1)&1\end{pmatrix}.

The benefit of this structure is that we can select the most appropriate coding system and correlation function for each factor according to our modeling requirements. Due to its popularity, we use the Gaussian correlation function for each factor (Santner et al.,, 2018): ψj(hj)=exp(hj2/θj2)\psi_{j}(h_{j})=\text{exp}(-h_{j}^{2}/\theta_{j}^{2}), 0<θj<.0<\theta_{j}<\infty. Let ρj=exp(1/θj2)\rho_{j}=\exp(-1/\theta_{j}^{2}). Then, the correlation function can be written as

ψj(hj)=ρjhj2,0<ρj<1.\psi_{j}(h_{j})=\rho_{j}^{h_{j}^{2}},\quad 0<\rho_{j}<1. (18)

Although the form of the correlation function is the same for all factors, the definition of hjh_{j} should depend on the type of factors.

For quantitative factors, we define hjh_{j} as |𝐱ij𝐱kj||\mathbf{x}_{ij}-\mathbf{x}_{kj}|, for two runs ii and kk. Suppose that there are pp quantitative three-level factors and their model matrices are encoded using orthogonal polynomial coding. Then, (16) simplifies to (Joseph and Delaney,, 2007):

βi𝒩(0,τ2j=1prjllijrjqqij),i=0,1,,3p1,\beta_{i}\sim\mathcal{N}\left(0,\tau^{2}\prod_{j=1}^{p}r_{j_{l}}^{l_{ij}}r_{j_{q}}^{q_{ij}}\right),\quad i=0,1,\dots,3^{p}-1, (19)

where

τ2=ν232pj=1p(3+4ρj+2ρj4),rjl=33ρj43+4ρj+2ρj4,rjq=34ρj+ρj43+4ρj+2ρj4,\tau^{2}=\frac{\nu^{2}}{3^{2p}}\prod_{j=1}^{p}(3+4\rho_{j}+2\rho_{j}^{4}),\;r_{j_{l}}=\frac{3-3\rho_{j}^{4}}{3+4\rho_{j}+2\rho_{j}^{4}},\;r_{j_{q}}=\frac{3-4\rho_{j}+\rho_{j}^{4}}{3+4\rho_{j}+2\rho_{j}^{4}},

lij=1l_{ij}=1 if βi\beta_{i} includes the linear effect of factor jj and 0 otherwise, and qij=1q_{ij}=1 if βi\beta_{i} includes the quadratic effect of factor jj and 0 otherwise. Although the prior for 𝜷q\bm{\beta}_{q} does includes covariance terms, we do not need them to construct the diagonal matrix 𝐑\mathbf{R} in (13).

Since rjl,rjq(0,1)r_{j_{l}},\;r_{j_{q}}\in(0,1), the variance of an interaction effect will be smaller than its parent effects. In other words, the lower-order effects are likely to be more important than the higher-order effects, and thus the priors satisfy the effect hierarchy principle. Moreover, since the variance of an interaction effect is proportional to the product of the variances of its parent effects, the priors also satisfy the effect heredity principle.

For qualitative factors where the relative distances between levels cannot be quantified, Joseph and Delaney, (2007) defines hj=0h_{j}=0 if 𝐱ij=𝐱kj\mathbf{x}_{ij}=\mathbf{x}_{kj} and 1 otherwise, i.e., hj=H(𝐱ij,𝐱kj)h_{j}=H(\mathbf{x}_{ij},\mathbf{x}_{kj}), where H(,)H(\cdot,\cdot) is the Hamming distance. However, this definition will assign equal importance to all the main effects of xjx_{j} even if not all of them are active. Since the main effects are represented by dummy variables, we define the relative distance for each dummy variable of xjx_{j} as hjd=H(𝐱ijd,𝐱kjd),d=1,,mj1h_{j_{d}}=H(\mathbf{x}_{ij_{d}},\mathbf{x}_{kj_{d}}),d=1,\dots,m_{j}-1, where the subscript dd stands for the ddth dummy variable. Then, according to (18), the form of the corresponding correlation function would be ψjd(hjd)=ρjdhjd2, 0<ρjd<1.\psi_{j_{d}}(h_{j_{d}})=\rho_{j_{d}}^{h_{j_{d}}^{2}},\;0<\rho_{j_{d}}<1.

Suppose that there are pp qualitative three-level factors and their model matrices are encoded according to Helmert coding. Then, as shown in the Appendix, the marginals of the prior in (16) will become

βi𝒩(0,τ2j=1prj1mij1rj2mij2),i=0,1,,3p1,\beta_{i}\sim\mathcal{N}\left(0,\tau^{2}\prod_{j=1}^{p}r_{j_{1}}^{m_{ij_{1}}}r_{j_{2}}^{m_{ij_{2}}}\right),\quad i=0,1,\dots,3^{p}-1, (20)

where

τ2=ν2j=1p(3+2ρj1+4ρj1ρj2),rj1=3(1ρj1)3+2ρj1+4ρj1ρj2,rj2=3+ρj14ρj1ρj23+2ρj1+4ρj1ρj2,\tau^{2}=\nu^{2}\prod_{j=1}^{p}(3+2\rho_{j_{1}}+4\rho_{j_{1}}\rho_{j_{2}}),\;r_{j_{1}}=\frac{3(1-\rho_{j_{1}})}{3+2\rho_{j_{1}}+4\rho_{j_{1}}\rho_{j_{2}}},\;r_{j_{2}}=\frac{3+\rho_{j_{1}}-4\rho_{j_{1}}\rho_{j_{2}}}{3+2\rho_{j_{1}}+4\rho_{j_{1}}\rho_{j_{2}}},

mij1=1m_{ij_{1}}=1 if βi\beta_{i} includes the first main effect of factor jj and 0 otherwise, and mij2=1m_{ij_{2}}=1 if βi\beta_{i} includes the second main effect of factor jj and 0 otherwise. As in the case of quantitative factors, effect hierarchy and heredity are built-in in these priors.

3.2 Initial Estimate

Our aim is to obtain an initial estimate of 𝜷P\bm{\beta}_{P} from 𝐘=𝐔P𝜷P+ϵ\mathbf{Y}=\mathbf{U}_{P}\bm{\beta}_{P}+\epsilon, where ϵ𝒩(𝟎,σ2𝐈)\epsilon\sim\mathcal{N}(\bm{0},\sigma^{2}\mathbf{I}) and 𝜷P𝒩(𝟎,τ2𝐑)\bm{\beta}_{P}\sim\mathcal{N}(\mathbf{0},\tau^{2}\mathbf{R}). Here 𝐑\mathbf{R} is a diagonal matrix, which is constructed as in (19) and (20). The posterior mean of 𝜷P\bm{\beta}_{P} is given by

𝜷^P\displaystyle\hat{\bm{\beta}}_{P} =\displaystyle= (𝐔P𝐔P+σ2τ2𝐑1)1𝐔P𝐘\displaystyle\left(\mathbf{U}_{P}^{\prime}\mathbf{U}_{P}+\frac{\sigma^{2}}{\tau^{2}}\mathbf{R}^{-1}\right)^{-1}\mathbf{U}_{P}^{\prime}\mathbf{Y} (21)
=\displaystyle= τ2𝐑𝐔P(τ2𝐔P𝐑𝐔P+σ2𝐈n)1𝐘.\displaystyle\tau^{2}\mathbf{R}\mathbf{U}_{P}^{\prime}\left(\tau^{2}\mathbf{U}_{P}\mathbf{R}\mathbf{U}_{P}^{\prime}+\sigma^{2}\mathbf{I}_{n}\right)^{-1}\mathbf{Y}.

To obtain the initial estimate of 𝜷P\bm{\beta}_{P} using (21), several parameters need to be specified, which include σ2\sigma^{2}, τ2\tau^{2}, and 𝝆\bm{\rho} that define 𝐑\mathbf{R}. Let 𝝆=(𝝆1,,𝝆p)\bm{\rho}=(\bm{\rho}_{1}^{\prime},\dots,\bm{\rho}_{p}^{\prime})^{\prime} where 𝝆j\bm{\rho}_{j} can be a vector. A good estimate of σ2\sigma^{2} can be obtained if the experiment contains replicates. Suppose that there are mm replicates in each run and let si2s_{i}^{2} be the sample variance for the iith run, i=1,,ni=1,\dots,n. Then σ^2=i=1nsi2/n\hat{\sigma}^{2}=\sum_{i=1}^{n}s_{i}^{2}/n can be a good estimate of σ2\sigma^{2}. However, if the experiment is unreplicated, there will be no information to estimate σ2\sigma^{2}. Instead of assuming that there is no noise, we assume that the noise would contribute at least 1%1\% of the variation in the response to prevent overfitting. Therefore, we reparameterize (21) to

𝜷^P=τ2ν2𝐑𝐔P[τ2ν2𝐔P𝐑𝐔P+λ1λ𝐈n]1𝐘,\hat{\bm{\beta}}_{P}=\frac{\tau^{2}}{\nu^{2}}\mathbf{R}\mathbf{U}_{P}^{\prime}\left[\frac{\tau^{2}}{\nu^{2}}\mathbf{U}_{P}\mathbf{R}\mathbf{U}_{P}^{\prime}+\frac{\lambda}{1-\lambda}\mathbf{I}_{n}\right]^{-1}\mathbf{Y}, (22)

where λ=σ2σ2+ν20.01\lambda=\frac{\sigma^{2}}{\sigma^{2}+\nu^{2}}\geq 0.01. This form can also be used for replicated experiments as long as we replace 𝐘\mathbf{Y} with the sample means and divide 𝐈n\mathbf{I}_{n} by mm.

Joseph and Delaney, (2007) have shown that if we use orthogonal coding for each factor,

τ2ν2=j=1psum(𝚿j)q2,\frac{\tau^{2}}{\nu^{2}}=\frac{\prod_{j=1}^{p}\text{sum}(\mathbf{\Psi}_{j})}{q^{2}},

where sum(𝚿j)\text{sum}(\mathbf{\Psi}_{j}) is the sum of all elements in 𝚿j\mathbf{\Psi}_{j}. Therefore, we only need to focus on the specifications of 𝝆\bm{\rho} and λ\lambda. These hyperparameters can be estimated by minimizing the negative log-likelihood of 𝐘\mathbf{Y} (Santner et al.,, 2018). Thus,

λ^,𝝆^\displaystyle\hat{\lambda},\;\hat{\bm{\rho}} =\displaystyle= argminλ,𝝆logν2^+1nlogdet(𝚿n+λ1λ𝐈n),\displaystyle\underset{\lambda,\bm{\rho}}{\text{argmin}}\log\hat{\nu^{2}}+\frac{1}{n}\log\det\left(\bm{\Psi}_{n}+\frac{\lambda}{1-\lambda}\mathbf{I}_{n}\right), (23)
ν2^\displaystyle\hat{\nu^{2}} =\displaystyle= 1n𝐘(𝚿n+λ1λ𝐈n)1𝐘,\displaystyle\frac{1}{n}\mathbf{Y}^{\prime}\left(\bm{\Psi}_{n}+\frac{\lambda}{1-\lambda}\mathbf{I}_{n}\right)^{-1}\mathbf{Y},

where 𝚿n\bm{\Psi}_{n} is the correlation matrix of the nn runs.

Different from the usual applications of GP modeling, we are not only interested in the accurate prediction of the output but also the selection of important effects. Therefore, it is very important for us to obtain the global optimum of (23). This is not easy because when nn is small, several models can fit the data well and therefore, the likelihood can be multi-modal with several local optima. While numerous global optimization algorithms are available for this purpose, we need one that requires only a few function evaluations and is robust to the choice of starting points. Therefore, we adopt a gradient-based multi-start global optimization strategy.

For numerical stability, we put mild constraints on the hyperparameters: ρi[1015,.999]\rho_{i}\in[10^{-15},.999] for all i=1,,ki=1,\ldots,k, and λ[.01,.99]\lambda\in[.01,.99]. First, we generate a space-filling design with k+1k+1 starting points in the feasible region. Here, we use the MaxPro design in Joseph et al., (2015). Second, for each starting point, we search for the local optimum using a local optimization algorithm. Here we use the Method of Moving Asymptotes in Svanberg, (2002). Among those local optima, the one with the lowest negative log-likelihood value will be used as the estimate of the hyperparameters. Importantly, we choose a derivative-based approach not only because gradients help us reduce the number of function evaluations but also because it gives us a more robust result than other derivative-free algorithms.

3.3 HiGarrote

Once the hyperparameters are estimated from (23), the shrinkage factors and the regression coefficients can be obtained as in (8) using the initial estimates from (22). The tuning parameter MM in (8) can be selected as described in Section 2.3. A solution path across a range of M[0.1,0.3(n1)]M\in[0.1,0.3(n-1)] is first constructed. The optimal MM is then selected by minimizing the GCV criterion (9), where the degrees of freedom is calculated as in (10) with

wi=τ2ν2(𝐑𝐔P[τ2ν2𝐔P𝐑𝐔P+λ1λ𝐈n]1𝐔P)ii.w_{i}=\frac{\tau^{2}}{\nu^{2}}\left(\mathbf{R}\mathbf{U}_{P}^{\prime}\left[\frac{\tau^{2}}{\nu^{2}}\mathbf{U}_{P}\mathbf{R}\mathbf{U}_{P}^{\prime}+\frac{\lambda}{1-\lambda}\mathbf{I}_{n}\right]^{-1}\mathbf{U}_{P}\right)_{ii}.

The upper bound of MM used in the optimization is based on the findings of Box and Meyer, (1986) and Wu and Hamada, (2021, p. 424), which indicate that the proportion of active effects (with intercept term included) typically ranges between 0.13 and 0.27 of the design’s run size.

For replicated experiments where a reliable estimate of σ2\sigma^{2} can be obtained, the shrinkage factors can be directly computed by solving (Xiong,, 2010):

𝜽^=\displaystyle\hat{\bm{\theta}}=  argmin𝜽12𝐘𝐔P𝜷^NG22+σ^2d(𝜷^NG),\displaystyle\underset{\bm{\theta}}{\text{ argmin}}\frac{1}{2}\left\lVert\mathbf{Y}-\mathbf{U}_{P}\hat{\bm{\beta}}^{NG}\right\rVert^{2}_{2}+\hat{\sigma}^{2}d(\hat{\bm{\beta}}^{NG}), (24)
subject to heredity constraints in (4) or (5).\displaystyle\text{ subject to }\text{heredity constraints in \eqref{eq:strong heredity constraint} or \eqref{eq:weak heredity constraint}}.

Algorithm 1 represents the proposed Hierarchical Garrote (HiGarrote) method. Most of the computational overhead of the method is in step 1 of the algorithm, i.e., the optimization in (23).

Algorithm 1 HiGarrote
1:  Estimate λ\lambda and 𝝆\bm{\rho} by (23)
2:  Obtain initial estimate 𝜷^P\hat{\bm{\beta}}_{P} by (22)
3:  Obtain 𝜽^\hat{\bm{\theta}} by solving the quadratic programming problem as in (8) or (24)
4:  Return 𝜷^NG\hat{\bm{\beta}}^{NG} by scaling 𝜷^P\hat{\bm{\beta}}_{P} with 𝜽^\hat{\bm{\theta}}

3.4 Simulation

To demonstrate the effectiveness of HiGarrote, a simulation study is conducted using the toy example described in Section 2.3, based on a 12-run PB design. The response is generated from the model

y=20A+10AB+5AC+ϵ,ϵiid𝒩(0,1).y=20A+10AB+5AC+\epsilon,\qquad\epsilon\overset{iid}{\sim}\mathcal{N}(0,1).

The simulation is repeated 100 times, with new data generated each time. For each simulated dataset, both HiGarrote and the nonnegative garrote (NG) with ridge estimate (Yuan et al.,, 2009) are fitted using weak heredity.

Figure 1 presents the boxplots of the estimates of selected effects using NG with ridge estimate and HiGarrote.

Refer to caption
Figure 1: Comparison of selected effects’ estimates using NG with ridge estimate (left) and HiGarrote (right).

As shown in Figure 1, the NG with ridge estimate consistently identifies only the main effect A and occasionally AB. However, the estimate for A shows large variability, and the estimate for AB remains close to 0. Importantly, the crucial interaction AC is never selected, indicating its limitation in identifying weaker yet significant interactions. In contrast, HiGarrote successfully identifies the important effects A, AB, and AC, with estimates close to their true values of 20, 10, and 5. While HiGarrote occasionally selects additional effects such as B, C, and BC, their estimates are generally very small (near 0) with low variability.

4 Real Data Examples

In this section we evaluate HiGarrote’s performance compared to several existing methods using different experimental designs such as regular, nonregular, and screening designs. The following methods are chosen for comparison.

stepH: This is the stepwise variable selection strategy proposed in Hamada and Wu, (1992).

BayesH: This is the Bayesian variable selection technique proposed in Chipman et al., (1997), which incorporates heredity principles in the SSVS algorithm.

hierNet: This is the LASSO-based method proposed in Bien et al., (2013), which is implemented in the R package hierNet.

RAMP: This is the LASSO-based method proposed in Hao et al., (2018), which is implemented in the R package RAMP.

GDSARM: This is the Gauss-Danzig selector aggregation proposed in Singh and Stufken, (2024), which is implemented in the R package GDSARM.

To make the exposition simple, all the methods are implemented using the weak heredity principle. We are not aware of any software packages that implement stepH and BayesH. Therefore, comparisons with these two methods are limited to the cases available in the literature. The two methods hierNet and RAMP are developed for general-purpose regression problems and therefore, are not suitable for the analysis of certain experimental designs such as mixed-level designs. Therefore, their results are reported only when we are able to implement them in the example. GDSARM can be applied only to two-level designs; so this is implemented only in the example in Section 4.1.1 and 4.2.1. On the other hand, HiGarrote is general and flexible, and can be used with any type of experimental designs.

4.1 Regular Designs

4.1.1 Two-Level Regular Design

Consider the 2952^{9-5} experiment reported by Raghavarao and Altan, (2003), where the nine factors are labeled A to J, excluding I. See Table S1 in the supplementary materials for details of the design matrix and data. Using half-normal plots and Lenth’s method, the aliased effects AH, J, E, G, and CH have smaller p-values while none of them are significant at 0.05 level. Due to the properties of the design matrix, all the two-factor interactions (2fi) are fully aliased with either main effects or other 2fis. By ignoring three- and higher-order interactions, we can establish the following aliasing relationships among these five effects:

J=CF,J=-CF,
E=BC,E=-BC,
G=AB=FH,G=-AB=-FH,
AH=BF=DG=EJ,AH=BF=DG=EJ,
CH=GJ=DE.CH=GJ=DE.

By applying effect hierarchy principle, we can select J, E, and G from the first three aliasing relationships. Now using effect heredity, the last two aliasing relationships become DG=EJDG=EJ and GJ=DEGJ=DE. These effects cannot be entangled because G, J, and E are active. Therefore, Raghavarao and Altan, (2003) conducted a heuristic analysis, and ultimately concluded that the effects J, E, G, EJ, and GJ are active. Their findings are supported by scientists, who believed that the main effects of E, G, and J, as well as their potential interactions, are likely to be active.

Now consider HiGarrote. As summarized in Table 2, HiGarrote automatically identifies the five active effects without requiring prior knowledge of the aliasing relationships or domain expertise from scientists. This is no accident and the reason behind its selection can be explained as follows. The prior variance of βEJ\beta_{EJ} is .0991τ2.0991\tau^{2}, whereas that of βDG\beta_{DG} is 5.3×105τ25.3\times 10^{-5}\tau^{2}. The observed effect of EJ=DGEJ=DG is divided among the two aliased effects proportional to its prior variance (Joseph,, 2006). Therefore, the initial estimate of βEJ\beta_{EJ} is much larger than βDG\beta_{DG}, which leads to the automatic selection of EJ by the nonnegative garrote instead of DG. The same thing happens between GJ and DE. Although HiGarrote selects three additional effects: H, HJ, and B, their estimates are relatively low compared to the five active effects. In terms of computational time tested on a laptop equipped with an Apple M2 chip, HiGarrote takes 0.54 seconds to analyze the experiment. For ease of comparison, the R2R^{2} for all the methods are calculated based on least squares. We also applied hierNet, RAMP, and GDSARM to this experiment. However, none of these methods identified any effects.

Table 2: Selected Effects in the 2952^{9-5} Experiment
Method Model R2R^{2}
Raghavarao and Altan, (2003) EJ(1.70)EJ(-1.70), J(1.67)J(-1.67), E(1.55)E(1.55), G(1.49)G(1.49), GJ(1.39)GJ(1.39) 70%
HiGarrote EJ(1.29)EJ(-1.29), J(1.26)J(-1.26), E(1.09)E(1.09), G(1.02)G(1.02), GJ(0.87)GJ(0.87),
H(0.51)H(0.51), HJ(0.2)HJ(-0.2), B(0.17)B(0.17) 89%

4.1.2 Mixed-Level Regular Design

Phadke, (1986) described a 32-run experiment aimed at increasing the lifespan of router bits used in a routing process to cut printed wiring boards from a panel. The experiment contains seven two-level factors and two four-level qualitative factors which are denoted by A through J with the exclusion of I. See Table S2 and Table S3 in the supplementary materials for details of factors, design matrix, and data. For the two qualitative factors, we adopt a convenient coding scheme offered by Wu and Hamada, (2021). The coding scheme generates effects that can be interpreted as the average difference between pairs of levels. Thus, the coding matrix for factors D and E is

𝐔j=(1111111111111111).\mathbf{U}_{j}=\begin{pmatrix}1&-1&1&-1\\ 1&-1&-1&1\\ 1&1&-1&-1\\ 1&1&1&1\end{pmatrix}.

The main effects of factor D and E are labeled as D1D_{1}, D2D_{2}, and D3D_{3} and E1E_{1}, E2E_{2}, and E3E_{3}.

This is a regular design and therefore, one can write down all the 15 aliasing relationships. Wu and Hamada, (2021) used effect hierarchy and heredity principles to identify the following effects: D2D_{2}, GG, JJ, GJGJ, and E1G=D2HE_{1}G=D_{2}H. Note that the last aliased effect could not be disentangled because both D2D_{2} and GG are significant. However, using the physical understanding of the routing process, they argued that E1GE_{1}G is unlikely to be significant because the four spindles of the routing machine are synchronized. Thus, they finally chose D2HD_{2}H.

On the other hand, as summarized in Table 3, HiGarrote automatically identifies the five effects without using any physical understanding of the routing process. In addition, it also selects a few more effects BB, D1D_{1}, E3E_{3}, GHGH, HJHJ, E3HE_{3}H. In terms of computational time, HiGarrote takes 6.38 seconds to analyze the experiment. As mentioned before, since hierNet and RAMP do not generate interactions that are suitable for this experiment, their performance is not reported here.

Table 3: Selected Effects in Router Bit Experiment
Method Model R2R^{2}
Wu and Hamada, (2021) D2(2.75)D_{2}(2.75), G(2.56)G(-2.56), J(2.31)J(2.31), GJ(2.25)GJ(-2.25), D2H(2.25)D_{2}H(2.25) 62%
HiGarrote D2(2.53)D_{2}(2.53), G(2.33)G(-2.33), J(2.06)J(2.06), GJ(1.98)GJ(-1.98), D2H(1.97)D_{2}H(1.97),
GH(1.33)GH(1.33), E3(1.22)E_{3}(-1.22), B(1.14)B(-1.14),D1(.92)D_{1}(.92), HJ(.86)HJ(-.86),
E3H(.30)E_{3}H(-.30) 90%

4.2 Nonregular Designs

4.2.1 Two-Level Nonregular Design

Hunter et al., (1982) used a 12-run Plackett-Burman design to investigate the effects of seven two-level factors on the fatigue life of weld-repaired castings. The seven factors are denoted by capital letters A through G. The details of the factors, experimental design, and data are given in Table S4 and Table S5 of the supplementary materials. According to the analysis conducted by Hunter et al., (1982), the main effects of factors D and F were found to be significant, but the R2R^{2} of the model is only 59%. In a very influential work, Hamada and Wu, (1992) reanalyzed the experiment using an iterative stepwise regression strategy that incorporates effect heredity (stepH). They found that by just adding the two-factor interaction FG in the model, R2R^{2} increases to 92%. They also found that if we remove the main effect D from the model, the R2R^{2} only decreases to 89%, which implies that the effect of D is negligible. Thus, they claim that effects F and FG are active.

For hierNet where the model is selected by leave-one-out cross-validation, the two important effects F and FG can be successfully identified; however, it also includes too many insignificant effects in the model. On the other hand, we found that RAMP performs well in this experiment, which selected the same model as the one from stepH. However, an issue with RAMP is that its algorithm includes quadratic effects such as A2A^{2}, B2B^{2}, \dots, G2G^{2}. Since there are only two levels in the experiment, these quadratic effects become the intercept effect which will not be selected, albeit unnecessary computations. GDSARM, using the settings provided in Singh and Stufken, (2024), successfully identified the effects F and FG. However, the algorithm includes several hyperparameters that may require manual tuning for optimal performance.

HiGarrote identifies five effects: F, FG, D, G, DG, with R2=96%R^{2}=96\%. Compared to the model selected by stepH, the inclusion of effects D, G, and DG only increases the R2R^{2} by 7%. Although HiGarrote selects three insignificant effects in the model, the estimates of the effects are low compared to the important effects. The comparisons are summarized in Table 4, with the numbers in parentheses indicating the effects estimates. The computational time is also reported in the table. HiGarrote is the second slowest among the methods considered in this example, but 0.26 seconds will not be a computational burden for the analyst!

Table 4: Selected Effects in Cast Fatigue Experiment
Method Model R2R^{2} Runtime
Hunter et al., (1982) F(.46)F(.46), D(.26)D(-.26) 59%
stepH F(.46)F(.46), FG(.46)FG(-.46) 89% .13 sec
hierNet F(.44)F(.44), FG(.26)FG(-.26), D(.17)D(-.17), A(.08)A(.08),
G(.08)G(.08), DG(.06)DG(.06), B(.05)B(.05), BC(.04)BC(-.04),
AE(.04)AE(-.04), AD(.03)AD(-.03), C(.02)C(-.02) 100% .08 sec
RAMP F(.48)F(.48), FG(.5)FG(-.5) 89% .06 sec
GDSARM F(.46)F(.46), FG(.46)FG(-.46) 89% .33 sec
HiGarrote F(.44)F(.44), FG(.43)FG(-.43), D(.05)D(-.05), G(.04)G(.04), DG(.03)DG(.03) 96% .26 sec

4.2.2 Mixed-Level Nonregular Design

Hamada and Wu, (1992) analyzed an 18-run experiment designed to study blood glucose readings of a clinical testing device. The experiment contains one two-level factor and seven three-level quantitative factors, which are denoted by A through H. The details of the factors, design matrix, and data are given in Table S6 and Table S7 of the supplementary materials. Since the experiment contains three-level quantitative factors, we can entertain quadratic terms and their interactions. The interactions include four components, i.e., linear-by-linear, linear-by-quadratic, quadratic-by-linear, and quadratic-by-quadratic. So, we consider the search for active effects among 15 main effects (one linear effect, and seven linear and quadratic effects) and 98 interactions. In the following comparison, the subscripts of ll and qq indicate the linear and quadratic terms, respectively.

Analysis carried out by Hamada and Wu, (1992) using stepH chooses the effects EqE_{q}, FqF_{q}, and ElFlE_{l}F_{l}. The model only achieves an R2R^{2} of 68%. On the other hand, Chipman et al., (1997) using BayesH identified BlB_{l}, BlHlB_{l}H_{l}, BlHqB_{l}H_{q}, and BqHqB_{q}H_{q} with R2=86%R^{2}=86\%. We observed that removing BlHlB_{l}H_{l} from the model only marginally reduces the R2R^{2} to 85%, suggesting that the effect of BlHlB_{l}H_{l} is negligible.

It is not ideal to use hierNet and RAMP to analyze the experiment because the way they construct interactions is to cross input variables. If we treat factors A to H as input variables, the interactions between linear and quadratic effects would not be entertained. On the other hand, if we treat all linear and quadratic effects as input variables, interactions between effects of the same factor–such as BlBqB_{l}B_{q}–would be included, potentially leading to the selection of cubic effects. Although treating all linear and quadratic effects as input variables may initially seem reasonable, our analysis revealed that neither hierNet nor RAMP selected any effects. This suggests that both methods are not suitable for analyzing mixed-level designs.

For HiGarrote, we found that it can identify more significant effects than the existing strategies. The model selected by HiGarrote contains eight effects, including BlB_{l}, BqB_{q}, FlF_{l}, HlH_{l}, HqH_{q}, BlHqB_{l}H_{q}, BqHlB_{q}H_{l}, BqHqB_{q}H_{q}, with an R2R^{2} of 97%. In the selected model, all the effects except HlH_{l} and HqH_{q} are significant. In terms of computational time, it takes 1.40 seconds for HiGarrote to analyze the experiment. The results are summarized in Table 5.

Table 5: Selected Effects in Blood Glucose Experiment
Method Model R2R^{2}
stepH ElFl(5.52)E_{l}F_{l}(-5.52), Eq(4.33)E_{q}(4.33), Fq(4.02)F_{q}(4.02) 68%
BayesH BlHq(6.64)B_{l}H_{q}(6.64), BqHq(5.43)B_{q}H_{q}(-5.43), Bl(2.85)B_{l}(-2.85), BlHl(.71)B_{l}H_{l}(.71) 86%
HiGarrote BlHq(6.52)B_{l}H_{q}(6.52), BqHq(5.10)B_{q}H_{q}(-5.10), Bl(2.60)B_{l}(-2.60), Bq(1.28)B_{q}(1.28),
BqHl(.99)B_{q}H_{l}(.99), Hl(.45)H_{l}(-.45), Fl(.34)F_{l}(-.34), Hq(.05)H_{q}(-.05) 97%

4.3 Screening Designs

4.3.1 Definitive Screening Design

Jones and Lanzerath, (2021) described a 21-run experiment aimed at creating a special resin for car vents designed to prevent moisture accumulation within a car’s housing. The experiment is carried out using a definitive screening design (DSD) with nine continuous factors, labeled A through J, with the exclusion of I. Details of factors and dataset are provided in Table S8 and Table S9 of the supplementary materials. The response is taken as the logarithm of Impurity.

According to the analysis carried out by Jones and Lanzerath, (2021), the linear effects A and F, and their interaction AF are identified as significant with an R2R^{2} of 93%. In addition to the three effects selected by Jones and Lanzerath, (2021), HiGarrote identifies five more effects, among which four of them are significant. The R2R^{2} increases to 99%. In terms of computational time, it takes 1.24 seconds for HiGarrote to analyze the experiment. Since both hierNet and RAMP can handle the full quadratic model, their performance can be examined here. It turns out that hierNet does not perform well in this experiment because there are too many effects being selected. In total, the selected model contains 16 effects comprised of 7 linear effects and 9 interactions. On the other hand, RAMP only identifies F. Since F is highly important, RAMP’s model is good with an R2R^{2} of 88%. The results are summarized in Table 6. To make the table concise, the selected model of hierNet is not reported.

Table 6: Selected Effects in Resin Experiment
Method Model R2R^{2}
Jones and Lanzerath, (2021) F(2.21)F(-2.21), A(.47)A(.47), AF(.23)AF(.23) 93%
RAMP F(2.21)F(-2.21) 88%
HiGarrote F(2.20)F(-2.20), A(.43)A(.43), FJ(.30)FJ(-.30), E(.23)E(-.23),
BF(.16)BF(.16), AJ(.10)AJ(-.10), AF(.05)AF(.05), F2(.04)F^{2}(.04) 99%

4.3.2 Supersaturated Design

Lin, (1993) employed a half-fraction of a 28-run PB design to investigate an experiment aimed at developing an epoxy adhesive system, as described by Williams, (1968). The original design matrix contains 14 runs and 24 factors. Since factors 13 and 16 were assigned to the same columns in the original design matrix, only factor 13 is listed here. The design matrix and data are given in Table S10 of the supplementary materials. Since supersaturated designs are generally used for factor screening, we focus only on the main effects. Therefore, we will exclude hierNet, RAMP, and GDSARM in this example, as these methods are designed to identify models that encompass quadratic effects and two-factor interactions. Although HiGarrote is also intended for models with hierarchical effects, we apply it in this example to show its generality. The results are summarized in Table 7. We can see that HiGarrote identifies the same set of factors as in Lin, (1993) and Chipman et al., (1997). In terms of computational time, HiGarrote takes about 5.80 seconds to perform the analysis, in which most of the time is spent on optimizing the 24 hyperparameters of the correlation matrix and the noise ratio.

Table 7: Selected Effects in Epoxy Experiment
Method Model R2R^{2}
Lin, (1993) 15(71.26)15(-71.26), 20(27.98)20(-27.98), 12(26.77)12(-26.77), 4(20.73)4(20.73), 10(9.40)10(-9.40) 97%
BayesH 15(70.48)15(-70.48), 20(29.20)20(-29.20), 12(25.29)12(-25.29), 4(22.12)4(22.12) 95%
HiGarrote 15(61.22)15(-61.22), 12(25.84)12(-25.84), 20(22.19)20(-22.19), 10(8.42)10(-8.42), 4(1.29)4(1.29) 97%

5 Conclusion

Regression analysis of experimental data requires special care because of two main reasons: (i) the data size is small and (ii) there is a need to estimate higher order effects that satisfy effect hierarchy and heredity principles. We have proposed a modified version of nonnegative garrote method, called HiGarrote, that incorporates the hierarchical relationships among the effects. A major innovation in this new technique is to carefully devise an approach to obtain an initial estimate for the nonnegative garrote. For this purpose we used generalized ridge regression with the ridge parameters chosen based on a Gaussian process prior on the underlying function. The main computational cost of our method comes from the fitting of the Gaussian process, which increases at the rate of 𝒪(n3)\mathcal{O}(n^{3}), where nn is the size of the data. However, since nn is small in experiments, this is not of much concern.

We have applied our method to many different kinds of experiments report in the literature and found that it works well. The most attractive feature of the method is that it is very general and requires no manual tuning. Thus, it can be used to automate the analysis of experiments. We plan to provide the implementation of the code in an R package. Some practitioners may still prefer to analyze their experiments manually. However, even in such cases, HiGarrote could be tried so that the accuracy of their analyses can be quickly verified.

Another novelty of the proposed method is that it makes a connection to the Gaussian process modeling which is widely applied to the analysis of computer experiments. Practitioners have so far stayed away from Gaussian process regression in physical experiments because they felt that such a complex nonparameteric regression model is unnecessary to analyze the small data from experiments, which is also corrupted by noise. Nevertheless, we found Gaussian process modeling to be useful, at least to get a good initial estimate for the linear regression parameters. One may think then why not just use Gaussian process regression in the analysis of physical experiments and why we still need to stick with linear regression. The reason is that linear regression does a better job in identifying significant effects that are interpretable, whereas variable selection with Gaussian process regression is not trivial (Linkletter et al.,, 2006). However, some recent advances in global sensitivity analysis (Huang and Joseph,, 2024) offer some promising alternatives in this direction, which needs more investigation.

Appendix: Proof of  (20)

Suppose that there are pp qualitative three-level factors and their model matrices are encoded according to Helmert coding. Then, the model matrix for factor jj is

𝐔j=(132121321210212).\mathbf{U}_{j}=\begin{pmatrix}1&-\sqrt{\frac{3}{2}}&-\sqrt{\frac{1}{2}}\\ 1&\sqrt{\frac{3}{2}}&-\sqrt{\frac{1}{2}}\\ 1&0&2\sqrt{\frac{1}{2}}\\ \end{pmatrix}.

Given that each factor is represented by two dummy variables, and based on the model matrix, these dummy variables are expressed as mj1=(1,1,0)m_{j_{1}}=(-1,1,0)^{\prime} and mj2=(1,1,2)m_{j_{2}}=(-1,-1,2)^{\prime}. Then, the two correlation matrices are

𝚿j1=(1ρj1ρj1ρj11ρj1ρj1ρj11),𝚿j2=(11ρj211ρj2ρj2ρj21).\mathbf{\Psi}_{j_{1}}=\begin{pmatrix}1&\rho_{j_{1}}&\rho_{j_{1}}\\ \rho_{j_{1}}&1&\rho_{j_{1}}\\ \rho_{j_{1}}&\rho_{j_{1}}&1\end{pmatrix},\;\mathbf{\Psi}_{j_{2}}=\begin{pmatrix}1&1&\rho_{j_{2}}\\ 1&1&\rho_{j_{2}}\\ \rho_{j_{2}}&\rho_{j_{2}}&1\end{pmatrix}.

According to the product correlation structure, the correlation matrix of factor jj is

𝚿j=𝚿j1𝚿j2=(1ρj1ρj1ρj2ρj11ρj1ρj2ρj1ρj2ρj1ρj21),\mathbf{\Psi}_{j}=\mathbf{\Psi}_{j_{1}}\odot\mathbf{\Psi}_{j_{2}}=\begin{pmatrix}1&\rho_{j_{1}}&\rho_{j_{1}}\rho_{j_{2}}\\ \rho_{j_{1}}&1&\rho_{j_{1}}\rho_{j_{2}}\\ \rho_{j_{1}}\rho_{j_{2}}&\rho_{j_{1}}\rho_{j_{2}}&1\end{pmatrix},

where \odot is the Hadamard product. Then,

𝐔j1𝚿j(𝐔j1)=19(3+2ρj1+4ρj1ρj202(ρj1+ρj1ρj2)03(1ρj1)02(ρj1+ρj1ρj2)03+ρj14ρj1ρj2).\mathbf{U}_{j}^{-1}\mathbf{\Psi}_{j}(\mathbf{U}_{j}^{-1})^{\prime}=\frac{1}{9}\begin{pmatrix}3+2\rho_{j1}+4\rho_{j1}\rho_{j2}&0&\sqrt{2}(-\rho_{j1}+\rho_{j1}\rho_{j2})\\ 0&3(1-\rho_{j1})&0\\ \sqrt{2}(-\rho_{j1}+\rho_{j1}\rho_{j2})&0&3+\rho_{j1}-4\rho_{j1}\rho_{j2}\end{pmatrix}.

Now using (17), the marginal distributions of βi\beta_{i}’s can be obtained from (16). This gives  (20).

Supplementary Materials

The file contains the details of the factors and their levels, design matrix, and data for the examples in Section 4.

Disclosure of Interest

No potential conflict of interest was reported by the authors.

Data Availability Statement

The data are provided in the supplementary materials.

Funding

This research is supported by U.S. National Science Foundation grants DMS-2310637.

References

  • Bien et al., (2013) Bien, J., Taylor, J., and Tibshirani, R. (2013). A Lasso for Hierarchical Interactions. Annals of statistics, 41(3):1111–1141.
  • Box and Meyer, (1986) Box, G. and Meyer, D. (1986). An Analysis for Unreplicated Experiments. Technometrics, 28:11–18.
  • Box et al., (1978) Box, G. E., Hunter, S., and Hunter, W. (1978). Statistics for Experimenters. John Wiley and sons, New York.
  • Breiman, (1995) Breiman, L. (1995). Better Subset Regression Using the Nonnegative Garrote. Technometrics, 37(4):373–384.
  • Candes and Tao, (2007) Candes, E. and Tao, T. (2007). The Dantzig Selector: Statistical Estimation When p Is Much Larger Than n. The Annals of Statistics, 35(6):2313–2351.
  • Chipman, (1996) Chipman, H. (1996). Bayesian Variable Selection With Related Predictors. Canadian Journal of Statistics, 24(1):17–36.
  • Chipman et al., (1997) Chipman, H., Hamada, M., and Wu, C. F. J. (1997). A Bayesian Variable-Selection Approach for Analyzing Designed Experiments With Complex Aliasing. Technometrics, 39(4):372–381.
  • Choi et al., (2010) Choi, N. H., Li, W., and Zhu, J. (2010). Variable Selection With the Strong Heredity Constraint and Its Oracle Property. Journal of the American Statistical Association, 105(489):354–364.
  • Efron et al., (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least Angle Regression. The Annals of Statistics, 32(2):407–499.
  • Fan and Li, (2001) Fan, J. and Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96(456):1348–1360.
  • George and McCulloch, (1993) George, E. I. and McCulloch, R. E. (1993). Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88(423):881–889.
  • Golub et al., (1979) Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter. Technometrics, 21(2):215–223.
  • Hamada and Wu, (1992) Hamada, M. and Wu, C. F. J. (1992). Analysis of Designed Experiments With Complex Aliasing. Journal of Quality Technology, 24(3):130–137.
  • Hao et al., (2018) Hao, N., Feng, Y., and Zhang, H. H. (2018). Model Selection for High-Dimensional Quadratic Regression via Regularization. Journal of the American Statistical Association, 113(522):615–625.
  • Harville, (1998) Harville, D. A. (1998). Matrix Algebra From a Statistician’s Perspective. Springer-Verlag, New York.
  • Hoerl and Kennard, (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1):55–67.
  • Huang and Joseph, (2024) Huang, C. and Joseph, V. R. (2024). Factor Importance Ranking and Selection Using Total Indices. arXiv preprint arXiv:2401.00800.
  • Hunter et al., (1982) Hunter, G. B., Hodi, F. S., and Eagar, T. W. (1982). High Cycle Fatigue of Weld Repaired Cast Ti-6AI-4V. Metallurgical Transactions A, 13(9):1589–1594.
  • Ishwaran and Rao, (2014) Ishwaran, H. and Rao, J. S. (2014). Geometry and Properties of Generalized Ridge Regression in High Dimensions. Contemp. Math, 622:81–93.
  • Jones and Lanzerath, (2021) Jones, B. and Lanzerath, M. (2021). A Novel Application of a Definitive Screening Design: A Case Study. Quality Engineering, 33(3):563–569.
  • Joseph, (2006) Joseph, V. R. (2006). A Bayesian Approach to the Design and Analysis of Fractionated Experiments. Technometrics, 48(2):219–229.
  • Joseph and Delaney, (2007) Joseph, V. R. and Delaney, J. D. (2007). Functionally Induced Priors for the Analysis of Experiments. Technometrics, 49(1):1–11.
  • Joseph et al., (2015) Joseph, V. R., Gul, E., and Ba, S. (2015). Maximum Projection Designs for Computer Experiments. Biometrika, 102(2):371–380.
  • Lenth, (1989) Lenth, R. V. (1989). Quick and Easy Analysis of Unreplicated Factorials. Technometrics, 31(4):469–473.
  • Li et al., (2006) Li, X., Sudarsanam, N., and Frey, D. D. (2006). Regularities in Data From Factorial Experiments. Complexity, 11(5):32–45.
  • Lin, (1993) Lin, D. K. J. (1993). A New Class of Supersaturated Designs. Technometrics, 35(1):28–31.
  • Linkletter et al., (2006) Linkletter, C., Bingham, D., Hengartner, N., Higdon, D., and Ye, K. Q. (2006). Variable Selection for Gaussian Process Models in Computer Experiments. Technometrics, 48(4):478–490.
  • Montgomery, (2017) Montgomery, D. C. (2017). Design and Analysis of Experiments. John wiley & sons.
  • Ockuly et al., (2017) Ockuly, R. A., Weese, M. L., Smucker, B. J., Edwards, D. J., and Chang, L. (2017). Response Surface Experiments: A Meta-Analysis. Chemometrics and Intelligent Laboratory Systems, 164:64–75.
  • Phadke, (1986) Phadke, M. S. (1986). Design Optimization Case Studies. AT&T Technical Journal, 65(2):51–68.
  • Raghavarao and Altan, (2003) Raghavarao, D. and Altan, S. (2003). A Heuristic Analysis of Highly Fractionated 2n2^{n} Factorial Experiments. Metrika, 58:185–191.
  • Santner et al., (2018) Santner, T. J., Williams, B. J., and Notz, W. I. (2018). The Design and Analysis of Computer Experiments. Springer, New York.
  • Singh and Stufken, (2024) Singh, R. and Stufken, J. (2024). Factor Selection in Screening Experiments by Aggregation Over Random Models. Computational Statistics & Data Analysis, 194:107940.
  • Svanberg, (2002) Svanberg, K. (2002). A Class of Globally Convergent Optimization Methods Based on Conservative Convex Separable Approximations. SIAM Journal on Optimization, 12(2):555–573.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
  • Vazquez et al., (2021) Vazquez, A. R., Schoen, E. D., and Goos, P. (2021). A Mixed Integer Optimization Approach for Model Selection in Screening Experiments. Journal of Quality Technology, 53(3):243–266.
  • Williams, (1968) Williams, K. R. (1968). Designed Experiments. Rubber Age, 100:65–71.
  • Wu and Hamada, (2021) Wu, C. F. J. and Hamada, M. S. (2021). Experiments: Planning, Analysis, and Optimization, 3rd Edition. Wiley.
  • Xiong, (2010) Xiong, S. (2010). Some Notes on the Nonnegative Garrote. Technometrics, 52(3):349–361.
  • Yuan et al., (2007) Yuan, M., Joseph, V. R., and Lin, Y. (2007). An Efficient Variable Selection Approach for Analyzing Designed Experiments. Technometrics, 49(4):430–439.
  • Yuan et al., (2009) Yuan, M., Joseph, V. R., and Zou, H. (2009). Structured Variable Selection and Estimation. The Annals of Applied Statistics, 3(4):1738–1757.
  • Yuan and Lin, (2006) Yuan, M. and Lin, Y. (2006). Model Selection and Estimation in Regression with Grouped Variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67.
  • Yuan and Lin, (2007) Yuan, M. and Lin, Y. (2007). On the Non-Negative Garrotte Estimator. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(2):143–161.
  • Zou, (2006) Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American statistical association, 101(476):1418–1429.