Automated Analysis of Experiments using Hierarchical Garrote

Wei-Yang Yu H. Milton Stewart School of Industrial and Systems Engineering,
Georgia Institute of Technology, Atlanta, GA 30332 V. Roshan Joseph Corresponding author: [email protected] H. Milton Stewart School of Industrial and Systems Engineering,
Georgia Institute of Technology, Atlanta, GA 30332

Abstract

In this work, we propose an automatic method for the analysis of experiments that incorporates hierarchical relationships between the experimental variables. We use a modified version of nonnegative garrote method for variable selection which can incorporate hierarchical relationships. The nonnegative garrote method requires a good initial estimate of the regression parameters for it to work well. To obtain the initial estimate, we use generalized ridge regression with the ridge parameters estimated from a Gaussian process prior placed on the underlying input-output relationship. The proposed method, called HiGarrote, is fast, easy to use, and requires no manual tuning. Analysis of several real experiments are presented to demonstrate its benefits over the existing methods.

Keywords: Gaussian process; Generalized ridge regression; Nonnegative garrote; Variable selection.

1 Introduction

Traditionally, the modeling and analysis of experimental data are done using analysis of variance (ANOVA) and regression techniques (Box et al.,, 1978; Montgomery,, 2017; Wu and Hamada,, 2021). Regression is more general and the preferred method because some of the factors in the experiments can be continuous and the regression methodology allows us to fit the response surface using smooth polynomial models. When some factors are categorical, they can also be incorporated into the regression methodology by introducing dummy variables. However, because “orthogonality” is present in many experimental designs such as full factorial designs, the regression analysis of experimental data is much simpler than that of observational data. The $t$ -statistics do not change when an effect is added or removed and therefore, the significant effects can be identified with ease. However, this simplicity has more or less disappeared as the experimental design methods evolved over time and became more complex. Fractional factorial designs and nonregular designs have become more common in practice where the effects can be aliased, which necessitates the need of more sophisticated regression techniques for the analysis of experiments.

There are certain characteristics of experiments that warrants a different treatment of regression analysis compared to those commonly used in observational studies. Experiments are usually expensive and therefore, the data are smaller in size. However, experiments are conducted in controlled environments, and therefore, higher order effects such as interactions and nonlinear effects can be entertained in the modeling. The need to model high order effects with small data immediately introduces certain challenges in the regression analysis. First, the total number of effects can exceed and can be much higher than the number of experimental runs. Second, experimenters over time have gained some understanding of the importance of various effects, which are formulated into certain hierarchical principles such as effect hierarchy and effect heredity (Wu and Hamada,, 2021). The effect hierarchy principle states that lower-order effects are more important than higher-order effects. Under strong heredity principle, an interaction is considered active only if both of its parent effects are active; whereas under weak heredity, only one of its parent effects needs to be active. These principles are supported by the findings of Li et al., (2006) and Ockuly et al., (2017), who provide comprehensive meta-analyses of two-level and response surface experiments, respectively. These principles need to be used intelligently to identify the significant effects from the small experimental data.

Analysis of regular designs can be done by first deriving all the aliasing relationships, estimating the aliased effects using regression, and identifying the significant aliased effects using $t$ -statistics or half-normal plots and Lenth’s method (Lenth,, 1989). Then the effect hierarchy and heredity principles are utilized to break the aliases and identify the significant effects. This approach cannot be used for nonregular designs because of the complex aliasing relationships. Hamada and Wu, (1992) proposed several forward selection strategies that incorporates the hierarchy and heredity principles. However, these strategies do not explore the complete space of models and therefore, can miss important effects. To search the model space more thoroughly, Chipman et al., (1997) proposed a Bayesian variable selection method that integrates the stochastic search variable selection (SSVS) algorithm of George and McCulloch, (1993) with the hierarchical priors described in Chipman, (1996). One major issue with the method is the challenge of specifying priors which contain several tuning parameters. In addition, the approach remains computationally intensive because the SSVS algorithm usually requires a large number of Markov chain Monte Carlo (MCMC) samples to attain convergence.

Several extensions of the regularization-based variable selection methods that preserve effect heredity have been proposed. Yuan et al., (2007) presented a fast algorithm by modifying the least angle regression (LARS) technique (Efron et al.,, 2004). However, their method is not convenient for dealing with situations when one group of factors follows strong heredity and another follows weak heredity. To provide a more flexible approach, Yuan et al., (2009) incorporated heredity constraints into nonnegative garrote (NG) method of Breiman, (1995). However, NG requires least squares estimates of the parameters to use as initial estimates, which cannot be obtained in fractionated experiments because the number of parameters to estimate exceeds the number of runs. Extensions of LASSO (Tibshirani,, 1996) can be found in Choi et al., (2010), Bien et al., (2013), and Hao et al., (2018). However, LASSO is not consistent in terms of estimation and variable selection, which is called the oracle property in Fan and Li, (2001). Vazquez et al., (2021) built a best-subset selection algorithm that can take any user-specified restriction, including heredity constraints. Due to the nature of best-subset selection, the algorithm necessitates searching for the optimal model across all-possible model sizes, making it computationally demanding. More recently, Singh and Stufken, (2024) proposed an aggregated version of the Gauss-Dantzig Selector (Candes and Tao,, 2007). However, their method is limited to two-level designs only.

The main aim of this article is to develop an automatic method to analyze experimental data which requires little or no manual tuning. To achieve this goal, we will extend the nonnegative garrote method proposed in Yuan et al., (2009) and make it suitable for the analysis of experiments. We chose nonnegative garrote because it possesses the oracle property (Zou,, 2006) and can be automatically tuned (Xiong,, 2010). The major hurdle in using nonnegative garrote is the non-existence of least squares estimates. Yuan and Lin, (2007) have shown that ridge regression (Hoerl and Kennard,, 1970) can be used to obtain initial estimates when the least squares estimates do not exist. However, in this article, we will show that the usual ridge regression estimates do not perform well in the analysis of experiments because the ridge estimates do not obey the hierarchical relationships. A better initial estimate can be obtained using generalized ridge regression, which has a separate ridge parameter for each effect and can be chosen to reflect the hierarchical relationships. However, there are too many ridge parameters, which makes the specification of them challenging. By exploiting the connection between generalized ridge regression and Bayesian regression, we will use the functionally induced priors proposed in Joseph, (2006) and Joseph and Delaney, (2007) to obtain estimates of the ridge parameters which respect the hierarchy and heredity principles. Integrating this with the nonnegative garrote will give us the proposed method, which we call Hierarchical Garrote or in short HiGarrote.

The article is organized as follows. In Section 2, We will first review the nonnegative garrote method and its limitations. We will then develop our proposed method in Section 3. Several examples are provided in Section 4 to illustrate the advantages of the proposed method. We conclude the article with some remarks in Section 5.

2 Nonnegative Garrote and Its limitations

Suppose that there are $p$ factors $\mathbf{x}=(x_{1},x_{2},\dots,x_{p})^{\prime}$ . Denote the data as $\{\mathbf{X},\mathbf{Y}\}\coloneqq\{\mathbf{x}_{i},y_{i}\}_{i=1}^{n}$ and let the response $y_{i}\in\mathbb{R}$ , for all $i$ . Assume that

y_{i}=f(\mathbf{x}_{i})+\epsilon_{i},\qquad\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2}),\qquad\text{for }i=1,\dots,n,

(1)

where $\epsilon_{i}$ stands for the random error resulting from uncontrollable factors and measurement error. Throughout the paper, we center all variables so that the observed mean is 0 and scale all predictors so that their sample standard deviations are the same.

2.1 Original Nonnegative Garrote

Consider approximate the latent function $f$ by a main-effect-only linear model,

f(\mathbf{x})\approx\beta_{1}x_{1}+\beta_{2}x_{2}+\dots+\beta_{p}x_{p}.

Let $\bm{\beta}=(\beta_{1},\beta_{2},\dots,\beta_{p})^{\prime}$ . The least squares estimator $\hat{\bm{\beta}}^{LS}=(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{Y}$ . The original nonnegative garrote (NG) (Breiman,, 1995) works by scaling the least squares estimate with shrinkage factors $\bm{\theta}(M)$ controlled by a tuning parameter $M$ . The NG estimate is defined as $\hat{\bm{\beta}}^{NG}(M)=(\hat{\beta}_{1}^{NG}(M),\dots,\hat{\beta}_{p}^{NG}(M))^{\prime}$ , where $\hat{\beta}_{j}^{NG}(M)=\hat{\theta}_{j}(M)\hat{\beta}_{j}^{LS}$ . For an $M\geq 0$ , the shrinkage factors are obtained by solving a quadratic programming problem:

	$\displaystyle\hat{\bm{\theta}}(M)=$	$\displaystyle\underset{\bm{\theta}}{\text{ argmin}}\;\frac{1}{2}\sum_{i=1}^{n}\left\{y_{i}-(\theta_{1}\hat{\beta}_{1}^{LS}x_{i1}+\theta_{2}\hat{\beta}_{2}^{LS}x_{i2}+\dots+\theta_{p}\hat{\beta}_{p}^{LS}x_{ip})\right\}^{2},$		(2)
		$\displaystyle\text{ subject to }\sum_{j}\theta_{j}\leq M\text{ and }\theta_{j}\geq 0\;\forall j.$		(2)

By selecting $M$ suitably, some elements of $\hat{\bm{\beta}}^{NG}(M)$ can be exactly zero. Since $\hat{\beta}_{j}^{LS}\neq 0$ with a probability of 1, $x_{j}$ is considered active if and only if $\hat{\theta}_{j}(M)>0$ . In other words, $\hat{\theta}_{j}(M)$ serves as an indicator of including $x_{j}$ in the selected model.

2.2 Nonnegative Garrote with Effect Heredity

In several problems, main effects models are inadequate to approximate the underlying function $f$ in (1). To improve the approximation, we can include the quadratic effects and two-factor interactions:

f(\mathbf{x})\approx\beta_{1}x_{1}+\dots+\beta_{p}x_{p}+\beta_{11}x_{1}^{2}+\beta_{12}x_{1}x_{2}+\dots+\beta_{pp}x_{p}^{2}.

The shrinkage factors in the nonnegative garrote are obtained by

	$\displaystyle\hat{\bm{\theta}}(M)=$	$\displaystyle\underset{\bm{\theta}}{\text{ argmin}}\;\frac{1}{2}\sum_{i=1}^{n}\left\{y_{i}-(\theta_{1}\hat{\beta}_{1}^{LS}x_{i1}+\dots+\theta_{p}\hat{\beta}_{p}^{LS}x_{ip}+\theta_{11}\hat{\beta}_{11}^{LS}x_{i1}^{2}+\dots+\theta_{pp}\hat{\beta}_{pp}^{LS}x_{ip}^{2})\right\}^{2},$		(3)
		$\displaystyle\text{ subject to }\sum_{j}\theta_{j}\leq M\text{ and }\theta_{j}\geq 0\;,j=1,\dots,p,11,12,\dots,pp.$		(3)

In this optimization, it is possible to select a model that includes an interaction term while excluding both of its main effects, that is, $\hat{\theta}_{ij}(M)>0$ and $\hat{\theta}_{i}(M)=\hat{\theta}_{j}(M)=0$ . To preserve the effect heredity principle in nonnegative garrote, Yuan et al., (2009) proposed to add constraints that enforce strong or weak heredity among the effects.

Under the strong heredity principle, we must have $\theta_{i}>0$ and $\theta_{j}>0$ if $\theta_{ij}>0$ . This can be enforced by requiring $\theta_{ij}\leq\text{min}(\theta_{i},\theta_{j}),$ which is equivalent to two convex constraints

\theta_{ij}\leq\theta_{i}\text{ and }\theta_{ij}\leq\theta_{j}\;\textrm{for all}\;i,j.

(4)

Thus, if $\hat{\theta}_{ij}(M)>0$ , (4) will guarantee that $\hat{\theta}_{i}(M)>0$ and $\hat{\theta}_{j}(M)>0$ , ensuring that $x_{i}$ and $x_{j}$ are part of the chosen model.

Under the weak heredity principle, we must have at least one of $\theta_{i}$ or $\theta_{j}$ to be larger than 0 if $\theta_{ij}>0$ . This can be enforced by requiring $\theta_{ij}\leq\text{max}(\theta_{i},\theta_{j}).$ However, this constraint is nonconvex, and the optimization in (3) cannot be solved using quadratic programming. To simplify the optimization, Yuan et al., (2009) proposed a convex relaxation of the constraint:

\theta_{ij}\leq\theta_{i}+\theta_{j}\;\textrm{for all}\;i,j.

(5)

Thus, if $\hat{\theta}_{ij}(M)>0$ , (5) will ensure that at least one of $\hat{\theta}_{i}(M)>0$ or $\hat{\theta}_{j}(M)>0$ holds, thereby guaranteeing that at least one of $x_{i}$ or $x_{j}$ is included in the selected model.

2.3 Nonnegative Garrote in Experimental Analysis

It is more common in the analysis of experiments (Wu and Hamada,, 2021) to approximate $f$ using a linear model with both main effects (linear and quadratic) and their cross product terms (two-factor interactions):

f(\mathbf{x})\approx\beta_{1}u_{1}+\beta_{2}u_{2}+\dots+\beta_{P}u_{P},

(6)

where $u_{1}=x_{1},\dots,u_{p}=x_{p},u_{p+1}=x_{1}^{2},\dots,u_{2p}=x_{p}^{2},\ldots,u_{P}=x_{p-1}^{2}x_{p}^{2}$ , and $P=2p+\binom{2p}{2}$ . For an experiment, the data size $n$ is typically much smaller than the number of parameters $P$ , making it impossible to obtain the least squares estimate required in the original nonnegative garrote. Therefore, Yuan and Lin, (2006) suggested using the ridge regression to obtain an initial estimate in NG.

Let $\mathbf{U}_{P}$ be an $n\times P$ model matrix. The ridge regression estimate is given by

\hat{\bm{\beta}}^{R}_{P}=(\mathbf{U}^{\prime}_{P}\mathbf{U}_{P}+\lambda\mathbf{I}_{P})^{-1}\mathbf{U}^{\prime}_{P}\mathbf{Y},

(7)

where $\lambda$ is a tuning parameter. Then, the NG estimate is defined as $\hat{\beta}_{j}^{NG}=\hat{\theta}_{j}(M)\hat{\beta}_{j}^{R},j=1,\dots,P$ , where the shrinkage factors are obtained as

	$\displaystyle\hat{\bm{\theta}}(M)=$	$\displaystyle\underset{\bm{\theta}}{\text{ argmin}}\;\frac{1}{2}\sum_{i=1}^{n}\left\{y_{i}-(\theta_{1}\hat{\beta}_{1}^{R}u_{i1}+\theta_{2}\hat{\beta}_{2}^{R}u_{i2}+\dots+\theta_{P}\hat{\beta}_{P}^{R}u_{iP})\right\}^{2},$		(8)
		$\displaystyle\text{ subject to }\sum_{j}\theta_{j}\leq M,\;\theta_{j}\geq 0\;,j=1,\dots,P,\text{ and heredity constraints in \eqref{eq:strong heredity constraint} or \eqref{eq:weak heredity constraint}}.$		(8)

To tune the hyperparameter $M$ , we employ the generalized cross-validation (GCV) criterion introduced by Golub et al., (1979), which was extended by Xiong, (2010) to handle NG with ridge estimate. The GCV criterion is defined as

\displaystyle\text{GCV}(\hat{\bm{\theta}}(M))

\displaystyle=\frac{\left\lVert\mathbf{Y}-\mathbf{U}_{P}\hat{\bm{\beta}}^{NG}\right\rVert^{2}_{2}}{n(1-d(\hat{\bm{\beta}}^{NG})/n)^{2}},

(9)

where $d(\hat{\bm{\beta}}^{NG})$ is the degrees of freedom of $\hat{\bm{\beta}}^{NG}$ . The degrees of freedom is computed as

d(\hat{\bm{\beta}}^{NG})=\sum_{i=1}^{P}\hat{\theta}_{i}(M)w_{i},

(10)

where $w_{i}$ is the $(i,i)$ entry of the $P\times P$ matrix $(\mathbf{U}_{P}^{\prime}\mathbf{U}_{P}+\lambda\mathbf{I}_{P})^{-1}\mathbf{U}_{P}^{\prime}\mathbf{U}_{P}$ . After constructing the solution path across a series of $M\in[0.1,n-1]$ , the optimal $M$ is selected by minimizing the GCV criterion,

M^{\text{GCV}}=\underset{M}{\text{argmin}}\;\text{GCV}(\hat{\bm{\theta}}(M)).

The upper bound of $M$ is set to $n-1$ to ensure that the number of selected terms is not too high compared to the degrees of freedom in the data.

However, we will show that ridge regression estimate is inadequate in experimental data analysis. Consider a toy example with the 12-run Plackett-Burman (PB) design. The response is generated by the true model $y=20A+10AB+5AC$ . See Table 1 for the design matrix and data. We use the R package glmnet to perform the ridge regression and apply the weak heredity constraint when solving the quadratic programming problem (8). This approach identifies only the effect $A$ with $\hat{\beta}_{A}^{NG}=14.938$ and misses the two interactions $AB$ and $AC$ that are present in the true model. The issue arises from the underestimation of $\hat{\bm{\beta}}^{R}_{P}$ . Due to the presence of numerous independent and irrelevant effects in the model matrix, we found that $\lambda$ is often overestimated, masking the effects of important terms. In the next section, we will propose an improved initial estimate for NG to overcome the foregoing issue.

Table 1: 12-Run Plackett-Burman Design Matrix and Data

Run	Factor											$\mathbf{Y}$
	A	B	C	D	E	F	G	H	I	J	K	$\mathbf{Y}$
1	1	1	-1	1	1	1	-1	-1	-1	1	-1	25
2	1	-1	1	1	1	1	-1	-1	1	1	-1	15
3	-1	1	1	1	1	-1	-1	-1	1	1	1	-35
4	1	1	1	-1	-1	-1	-1	1	1	1	-1	35
5	1	1	-1	1	-1	1	-1	1	-1	-1	1	25
6	1	-1	-1	1	1	-1	1	-1	1	1	1	5
7	-1	-1	-1	1	-1	1	-1	1	1	1	1	-5
8	-1	1	-1	1	-1	1	1	1	1	1	-1	-15
9	-1	-1	1	1	1	1	1	1	1	-1	-1	-25
10	1	-1	1	1	-1	-1	-1	1	1	-1	-1	15
11	-1	1	1	1	1	-1	1	-1	-1	1	1	-35
12	-1	-1	-1	-1	-1	-1	-1	-1	-1	-1	-1	-5

3 Methodology

The ridge regression estimate in (7) is the solution to the following optimization problem:

\min_{\bm{\beta}_{P}}(\mathbf{Y}-\mathbf{U}_{P}\bm{\beta}_{P})^{\prime}(\mathbf{Y}-\mathbf{U}_{P}\bm{\beta}_{P})+\lambda\sum_{i=1}^{P}\beta_{i}^{2}.

(11)

Clearly ridge regression gives equal importance to all the effects, which is not good because the effects are likely to follow the hierarchy and heredity principles. One approach to overcome this issue is to use generalized ridge regression (Hoerl and Kennard,, 1970)

\min_{\bm{\beta}_{P}}(\mathbf{Y}-\mathbf{U}_{P}\bm{\beta}_{P})^{\prime}(\mathbf{Y}-\mathbf{U}_{P}\bm{\beta}_{P})+\sum_{i=1}^{P}\lambda_{i}\beta_{i}^{2},

(12)

where $\lambda_{i}$ ’s can be chosen to obey the hierarchical relationships. Although generalized ridge regression is known to perform better than ridge regression in high-dimensional problems (Ishwaran and Rao,, 2014), it becomes an arduous task to tune $P>>n$ parameters. Moreover, it is also not clear how to enforce the hierarchical relationships on the $P$ ridge parameters. We will propose an idea to do this.

First note that the generalized ridge regression solution can be viewed as the posterior mean of a Bayesian linear regression with prior:

\bm{\beta}_{P}\sim\mathcal{N}(\mathbf{0},\tau^{2}\mathbf{R}),

(13)

where $\mathbf{R}$ is a diagonal matrix with $\mathbf{R}_{ii}=\sigma^{2}/(\tau^{2}\lambda_{i})$ . Thus, we only need to determine how to specify the $\mathbf{R}$ matrix in (13). Joseph, (2006) has proposed an idea to do this by first postulating a Gaussian Process (GP) prior (Santner et al.,, 2018) on the underlying function $f(\mathbf{x})$ in (1) and then inducing a prior for the linear model parameters. It is shown in Joseph, (2006) and Joseph and Delaney, (2007) that such priors satisfy effect hierarchy and heredity principles under a product correlation structure. Moreover, the prior usually contains only $p$ unknown parameters, which is much less than $P$ , and there are well-developed methods to estimate them. This is explained in the next section.

3.1 Functionally Induced Prior

Let the factor $x_{j}$ have $m_{j}$ levels in the experiment, $j=1,\ldots,p$ . Consider the model in (6). Although it has only the main effects (linear and quadratic) and their cross product terms (two-factor interactions, 2fi), it is easier to construct the prior for all possible effects. Thus, we approximate $f$ by a linear model including main effects and all the interactions, i.e.,

f(\mathbf{x})\approx\sum_{j=0}^{q-1}\beta_{j}u_{j},

(14)

where $q=\prod_{i=1}^{p}m_{i}$ and $u_{0}=1.$ To avoid confusion, we let $\bm{\beta}_{q}=(\beta_{0},\beta_{1},\dots,\beta_{q-1})^{\prime}$ .

The main effects of the factor $x_{j}$ can be described by $m_{j}-1$ dummy variables, and the interactions can be expressed as the products of these dummy variables. For example, consider an experiment with two factors each at three levels. Let $u_{1}$ and $u_{2}$ be the main effects of the first factor and $u_{3}$ and $u_{4}$ be the main effects of the second factor. The interaction terms are $u_{5}=u_{1}u_{3}$ , $u_{6}=u_{1}u_{4}$ , $u_{7}=u_{2}u_{3}$ , and $u_{8}=u_{2}u_{4}$ . Many popular coding systems can be used to define dummy variables, such as treatment coding, Helmert coding, and orthogonal polynomial coding (Wu and Hamada,, 2021; Harville,, 1998).

The idea of functionally induced priors (Joseph,, 2006) is to postulate a GP prior for $f$ and use that to induce the prior for $\bm{\beta}_{q}$ . Thus, assume

f(\mathbf{x})\sim\text{GP}(0,\nu^{2}\psi),

(15)

where $\nu^{2}\psi$ is the covariance function of the Gaussian process defined as $\text{cov}\{f(\mathbf{x}),f(\mathbf{x+h})\}=\nu^{2}\psi(\mathbf{h})$ . Thus, we have $\mathbf{f}=\mathbf{U}_{q}\bm{\beta}_{q},$ where $\mathbf{f}$ is the vector of function values for the full factorial design and $\mathbf{U}_{q}$ the $q\times q$ model matrix corresponding to (14). Note that $\mathbf{f}$ has a multivariate normal distribution with $E(\mathbf{f})=\mathbf{0}$ and $\text{var}(\mathbf{f})=\nu^{2}\bm{\Psi}$ , where $\bm{\Psi}$ is the corresponding correlation matrix. Since $\bm{\beta}_{q}=\mathbf{U}_{q}^{-1}\mathbf{f}$ , we obtain

\bm{\beta}_{q}\sim N(\bm{0},\nu^{2}\mathbf{U}^{-1}_{q}\bm{\Psi}(\mathbf{U}_{q}^{-1})^{\prime}).

(16)

Assume that the correlation function $\psi$ has a product correlation structure, i.e.,

\psi(\mathbf{h})=\prod_{j=1}^{p}\psi_{j}(h_{j}).

Under the product correlation structure, Joseph and Delaney, (2007) have shown that

\text{var}(\bm{\beta}_{q})=\nu^{2}\bigotimes_{j=1}^{p}\mathbf{U}_{j}^{-1}\mathbf{\Psi}_{j}(\mathbf{U}_{j}^{-1})^{\prime},

(17)

where $\bigotimes$ denotes the Kronecker product, $\mathbf{U}_{j}$ is the model matrix for factor $x_{j}$ , and $\bm{\Psi}_{j}$ the corresponding correlation matrix. For example, suppose that factor $x_{j}$ is a three-level factor with levels 1, 2, and 3, the model matrix using orthogonal polynomial coding is

\mathbf{U}_{j}=\begin{pmatrix}1&-\sqrt{\frac{3}{2}}&\sqrt{\frac{1}{2}}\\ 1&0&-\sqrt{2}\\ 1&\sqrt{\frac{3}{2}}&\sqrt{\frac{1}{2}}\end{pmatrix},

and the correlation matrix is

\bm{\Psi}_{j}=\begin{pmatrix}1&\psi_{j}(1)&\psi_{j}(2)\\ \psi_{j}(1)&1&\psi_{j}(1)\\ \psi_{j}(2)&\psi_{j}(1)&1\end{pmatrix}.

The benefit of this structure is that we can select the most appropriate coding system and correlation function for each factor according to our modeling requirements. Due to its popularity, we use the Gaussian correlation function for each factor (Santner et al.,, 2018): $\psi_{j}(h_{j})=\text{exp}(-h_{j}^{2}/\theta_{j}^{2})$ , $0<\theta_{j}<\infty.$ Let $\rho_{j}=\exp(-1/\theta_{j}^{2})$ . Then, the correlation function can be written as

\psi_{j}(h_{j})=\rho_{j}^{h_{j}^{2}},\quad 0<\rho_{j}<1.

(18)

Although the form of the correlation function is the same for all factors, the definition of $h_{j}$ should depend on the type of factors.

For quantitative factors, we define $h_{j}$ as $|\mathbf{x}_{ij}-\mathbf{x}_{kj}|$ , for two runs $i$ and $k$ . Suppose that there are $p$ quantitative three-level factors and their model matrices are encoded using orthogonal polynomial coding. Then, (16) simplifies to (Joseph and Delaney,, 2007):

\beta_{i}\sim\mathcal{N}\left(0,\tau^{2}\prod_{j=1}^{p}r_{j_{l}}^{l_{ij}}r_{j_{q}}^{q_{ij}}\right),\quad i=0,1,\dots,3^{p}-1,

(19)

where

\tau^{2}=\frac{\nu^{2}}{3^{2p}}\prod_{j=1}^{p}(3+4\rho_{j}+2\rho_{j}^{4}),\;r_{j_{l}}=\frac{3-3\rho_{j}^{4}}{3+4\rho_{j}+2\rho_{j}^{4}},\;r_{j_{q}}=\frac{3-4\rho_{j}+\rho_{j}^{4}}{3+4\rho_{j}+2\rho_{j}^{4}},

$l_{ij}=1$ if $\beta_{i}$ includes the linear effect of factor $j$ and 0 otherwise, and $q_{ij}=1$ if $\beta_{i}$ includes the quadratic effect of factor $j$ and 0 otherwise. Although the prior for $\bm{\beta}_{q}$ does includes covariance terms, we do not need them to construct the diagonal matrix $\mathbf{R}$ in (13).

Since $r_{j_{l}},\;r_{j_{q}}\in(0,1)$ , the variance of an interaction effect will be smaller than its parent effects. In other words, the lower-order effects are likely to be more important than the higher-order effects, and thus the priors satisfy the effect hierarchy principle. Moreover, since the variance of an interaction effect is proportional to the product of the variances of its parent effects, the priors also satisfy the effect heredity principle.

For qualitative factors where the relative distances between levels cannot be quantified, Joseph and Delaney, (2007) defines $h_{j}=0$ if $\mathbf{x}_{ij}=\mathbf{x}_{kj}$ and 1 otherwise, i.e., $h_{j}=H(\mathbf{x}_{ij},\mathbf{x}_{kj})$ , where $H(\cdot,\cdot)$ is the Hamming distance. However, this definition will assign equal importance to all the main effects of $x_{j}$ even if not all of them are active. Since the main effects are represented by dummy variables, we define the relative distance for each dummy variable of $x_{j}$ as $h_{j_{d}}=H(\mathbf{x}_{ij_{d}},\mathbf{x}_{kj_{d}}),d=1,\dots,m_{j}-1$ , where the subscript $d$ stands for the $d$ th dummy variable. Then, according to (18), the form of the corresponding correlation function would be $\psi_{j_{d}}(h_{j_{d}})=\rho_{j_{d}}^{h_{j_{d}}^{2}},\;0<\rho_{j_{d}}<1.$

Suppose that there are $p$ qualitative three-level factors and their model matrices are encoded according to Helmert coding. Then, as shown in the Appendix, the marginals of the prior in (16) will become

\beta_{i}\sim\mathcal{N}\left(0,\tau^{2}\prod_{j=1}^{p}r_{j_{1}}^{m_{ij_{1}}}r_{j_{2}}^{m_{ij_{2}}}\right),\quad i=0,1,\dots,3^{p}-1,

(20)

where

\tau^{2}=\nu^{2}\prod_{j=1}^{p}(3+2\rho_{j_{1}}+4\rho_{j_{1}}\rho_{j_{2}}),\;r_{j_{1}}=\frac{3(1-\rho_{j_{1}})}{3+2\rho_{j_{1}}+4\rho_{j_{1}}\rho_{j_{2}}},\;r_{j_{2}}=\frac{3+\rho_{j_{1}}-4\rho_{j_{1}}\rho_{j_{2}}}{3+2\rho_{j_{1}}+4\rho_{j_{1}}\rho_{j_{2}}},

$m_{ij_{1}}=1$ if $\beta_{i}$ includes the first main effect of factor $j$ and 0 otherwise, and $m_{ij_{2}}=1$ if $\beta_{i}$ includes the second main effect of factor $j$ and 0 otherwise. As in the case of quantitative factors, effect hierarchy and heredity are built-in in these priors.

3.2 Initial Estimate

Our aim is to obtain an initial estimate of $\bm{\beta}_{P}$ from $\mathbf{Y}=\mathbf{U}_{P}\bm{\beta}_{P}+\epsilon$ , where $\epsilon\sim\mathcal{N}(\bm{0},\sigma^{2}\mathbf{I})$ and $\bm{\beta}_{P}\sim\mathcal{N}(\mathbf{0},\tau^{2}\mathbf{R})$ . Here $\mathbf{R}$ is a diagonal matrix, which is constructed as in (19) and (20). The posterior mean of $\bm{\beta}_{P}$ is given by

	$\displaystyle\hat{\bm{\beta}}_{P}$	$\displaystyle=$	$\displaystyle\left(\mathbf{U}_{P}^{\prime}\mathbf{U}_{P}+\frac{\sigma^{2}}{\tau^{2}}\mathbf{R}^{-1}\right)^{-1}\mathbf{U}_{P}^{\prime}\mathbf{Y}$		(21)
		$\displaystyle=$	$\displaystyle\tau^{2}\mathbf{R}\mathbf{U}_{P}^{\prime}\left(\tau^{2}\mathbf{U}_{P}\mathbf{R}\mathbf{U}_{P}^{\prime}+\sigma^{2}\mathbf{I}_{n}\right)^{-1}\mathbf{Y}.$		(21)

To obtain the initial estimate of $\bm{\beta}_{P}$ using (21), several parameters need to be specified, which include $\sigma^{2}$ , $\tau^{2}$ , and $\bm{\rho}$ that define $\mathbf{R}$ . Let $\bm{\rho}=(\bm{\rho}_{1}^{\prime},\dots,\bm{\rho}_{p}^{\prime})^{\prime}$ where $\bm{\rho}_{j}$ can be a vector. A good estimate of $\sigma^{2}$ can be obtained if the experiment contains replicates. Suppose that there are $m$ replicates in each run and let $s_{i}^{2}$ be the sample variance for the $i$ th run, $i=1,\dots,n$ . Then $\hat{\sigma}^{2}=\sum_{i=1}^{n}s_{i}^{2}/n$ can be a good estimate of $\sigma^{2}$ . However, if the experiment is unreplicated, there will be no information to estimate $\sigma^{2}$ . Instead of assuming that there is no noise, we assume that the noise would contribute at least $1\%$ of the variation in the response to prevent overfitting. Therefore, we reparameterize (21) to

\hat{\bm{\beta}}_{P}=\frac{\tau^{2}}{\nu^{2}}\mathbf{R}\mathbf{U}_{P}^{\prime}\left[\frac{\tau^{2}}{\nu^{2}}\mathbf{U}_{P}\mathbf{R}\mathbf{U}_{P}^{\prime}+\frac{\lambda}{1-\lambda}\mathbf{I}_{n}\right]^{-1}\mathbf{Y},

(22)

where $\lambda=\frac{\sigma^{2}}{\sigma^{2}+\nu^{2}}\geq 0.01$ . This form can also be used for replicated experiments as long as we replace $\mathbf{Y}$ with the sample means and divide $\mathbf{I}_{n}$ by $m$ .

Joseph and Delaney, (2007) have shown that if we use orthogonal coding for each factor,

\frac{\tau^{2}}{\nu^{2}}=\frac{\prod_{j=1}^{p}\text{sum}(\mathbf{\Psi}_{j})}{q^{2}},

where $\text{sum}(\mathbf{\Psi}_{j})$ is the sum of all elements in $\mathbf{\Psi}_{j}$ . Therefore, we only need to focus on the specifications of $\bm{\rho}$ and $\lambda$ . These hyperparameters can be estimated by minimizing the negative log-likelihood of $\mathbf{Y}$ (Santner et al.,, 2018). Thus,

	$\displaystyle\hat{\lambda},\;\hat{\bm{\rho}}$	$\displaystyle=$	$\displaystyle\underset{\lambda,\bm{\rho}}{\text{argmin}}\log\hat{\nu^{2}}+\frac{1}{n}\log\det\left(\bm{\Psi}_{n}+\frac{\lambda}{1-\lambda}\mathbf{I}_{n}\right),$		(23)
	$\displaystyle\hat{\nu^{2}}$	$\displaystyle=$	$\displaystyle\frac{1}{n}\mathbf{Y}^{\prime}\left(\bm{\Psi}_{n}+\frac{\lambda}{1-\lambda}\mathbf{I}_{n}\right)^{-1}\mathbf{Y},$

where $\bm{\Psi}_{n}$ is the correlation matrix of the $n$ runs.

Different from the usual applications of GP modeling, we are not only interested in the accurate prediction of the output but also the selection of important effects. Therefore, it is very important for us to obtain the global optimum of (23). This is not easy because when $n$ is small, several models can fit the data well and therefore, the likelihood can be multi-modal with several local optima. While numerous global optimization algorithms are available for this purpose, we need one that requires only a few function evaluations and is robust to the choice of starting points. Therefore, we adopt a gradient-based multi-start global optimization strategy.

For numerical stability, we put mild constraints on the hyperparameters: $\rho_{i}\in[10^{-15},.999]$ for all $i=1,\ldots,k$ , and $\lambda\in[.01,.99]$ . First, we generate a space-filling design with $k+1$ starting points in the feasible region. Here, we use the MaxPro design in Joseph et al., (2015). Second, for each starting point, we search for the local optimum using a local optimization algorithm. Here we use the Method of Moving Asymptotes in Svanberg, (2002). Among those local optima, the one with the lowest negative log-likelihood value will be used as the estimate of the hyperparameters. Importantly, we choose a derivative-based approach not only because gradients help us reduce the number of function evaluations but also because it gives us a more robust result than other derivative-free algorithms.

3.3 HiGarrote

Once the hyperparameters are estimated from (23), the shrinkage factors and the regression coefficients can be obtained as in (8) using the initial estimates from (22). The tuning parameter $M$ in (8) can be selected as described in Section 2.3. A solution path across a range of $M\in[0.1,0.3(n-1)]$ is first constructed. The optimal $M$ is then selected by minimizing the GCV criterion (9), where the degrees of freedom is calculated as in (10) with

w_{i}=\frac{\tau^{2}}{\nu^{2}}\left(\mathbf{R}\mathbf{U}_{P}^{\prime}\left[\frac{\tau^{2}}{\nu^{2}}\mathbf{U}_{P}\mathbf{R}\mathbf{U}_{P}^{\prime}+\frac{\lambda}{1-\lambda}\mathbf{I}_{n}\right]^{-1}\mathbf{U}_{P}\right)_{ii}.

The upper bound of $M$ used in the optimization is based on the findings of Box and Meyer, (1986) and Wu and Hamada, (2021, p. 424), which indicate that the proportion of active effects (with intercept term included) typically ranges between 0.13 and 0.27 of the design’s run size.

For replicated experiments where a reliable estimate of $\sigma^{2}$ can be obtained, the shrinkage factors can be directly computed by solving (Xiong,, 2010):

	$\displaystyle\hat{\bm{\theta}}=$	$\displaystyle\underset{\bm{\theta}}{\text{ argmin}}\frac{1}{2}\left\lVert\mathbf{Y}-\mathbf{U}_{P}\hat{\bm{\beta}}^{NG}\right\rVert^{2}_{2}+\hat{\sigma}^{2}d(\hat{\bm{\beta}}^{NG}),$		(24)
		$\displaystyle\text{ subject to }\text{heredity constraints in \eqref{eq:strong heredity constraint} or \eqref{eq:weak heredity constraint}}.$		(24)

Algorithm 1 represents the proposed Hierarchical Garrote (HiGarrote) method. Most of the computational overhead of the method is in step 1 of the algorithm, i.e., the optimization in (23).

Algorithm 1 HiGarrote

1: Estimate

\lambda

and

\bm{\rho}

by (23)

2: Obtain initial estimate

\hat{\bm{\beta}}_{P}

by (22)

3: Obtain

\hat{\bm{\theta}}

by solving the quadratic programming problem as in (8) or (24)

4: Return

\hat{\bm{\beta}}^{NG}

by scaling

\hat{\bm{\beta}}_{P}

with

\hat{\bm{\theta}}

3.4 Simulation

To demonstrate the effectiveness of HiGarrote, a simulation study is conducted using the toy example described in Section 2.3, based on a 12-run PB design. The response is generated from the model

y=20A+10AB+5AC+\epsilon,\qquad\epsilon\overset{iid}{\sim}\mathcal{N}(0,1).

The simulation is repeated 100 times, with new data generated each time. For each simulated dataset, both HiGarrote and the nonnegative garrote (NG) with ridge estimate (Yuan et al.,, 2009) are fitted using weak heredity.

Figure 1 presents the boxplots of the estimates of selected effects using NG with ridge estimate and HiGarrote.

Refer to caption — Figure 1: Comparison of selected effects’ estimates using NG with ridge estimate (left) and HiGarrote (right).

As shown in Figure 1, the NG with ridge estimate consistently identifies only the main effect A and occasionally AB. However, the estimate for A shows large variability, and the estimate for AB remains close to 0. Importantly, the crucial interaction AC is never selected, indicating its limitation in identifying weaker yet significant interactions. In contrast, HiGarrote successfully identifies the important effects A, AB, and AC, with estimates close to their true values of 20, 10, and 5. While HiGarrote occasionally selects additional effects such as B, C, and BC, their estimates are generally very small (near 0) with low variability.

4 Real Data Examples

In this section we evaluate HiGarrote’s performance compared to several existing methods using different experimental designs such as regular, nonregular, and screening designs. The following methods are chosen for comparison.

stepH: This is the stepwise variable selection strategy proposed in Hamada and Wu, (1992).

BayesH: This is the Bayesian variable selection technique proposed in Chipman et al., (1997), which incorporates heredity principles in the SSVS algorithm.

hierNet: This is the LASSO-based method proposed in Bien et al., (2013), which is implemented in the R package hierNet.

RAMP: This is the LASSO-based method proposed in Hao et al., (2018), which is implemented in the R package RAMP.

GDSARM: This is the Gauss-Danzig selector aggregation proposed in Singh and Stufken, (2024), which is implemented in the R package GDSARM.

To make the exposition simple, all the methods are implemented using the weak heredity principle. We are not aware of any software packages that implement stepH and BayesH. Therefore, comparisons with these two methods are limited to the cases available in the literature. The two methods hierNet and RAMP are developed for general-purpose regression problems and therefore, are not suitable for the analysis of certain experimental designs such as mixed-level designs. Therefore, their results are reported only when we are able to implement them in the example. GDSARM can be applied only to two-level designs; so this is implemented only in the example in Section 4.1.1 and 4.2.1. On the other hand, HiGarrote is general and flexible, and can be used with any type of experimental designs.

4.1 Regular Designs

4.1.1 Two-Level Regular Design

Consider the $2^{9-5}$ experiment reported by Raghavarao and Altan, (2003), where the nine factors are labeled A to J, excluding I. See Table S1 in the supplementary materials for details of the design matrix and data. Using half-normal plots and Lenth’s method, the aliased effects AH, J, E, G, and CH have smaller p-values while none of them are significant at 0.05 level. Due to the properties of the design matrix, all the two-factor interactions (2fi) are fully aliased with either main effects or other 2fis. By ignoring three- and higher-order interactions, we can establish the following aliasing relationships among these five effects:

J=-CF,

E=-BC,

G=-AB=-FH,

AH=BF=DG=EJ,

CH=GJ=DE.

By applying effect hierarchy principle, we can select J, E, and G from the first three aliasing relationships. Now using effect heredity, the last two aliasing relationships become $DG=EJ$ and $GJ=DE$ . These effects cannot be entangled because G, J, and E are active. Therefore, Raghavarao and Altan, (2003) conducted a heuristic analysis, and ultimately concluded that the effects J, E, G, EJ, and GJ are active. Their findings are supported by scientists, who believed that the main effects of E, G, and J, as well as their potential interactions, are likely to be active.

Now consider HiGarrote. As summarized in Table 2, HiGarrote automatically identifies the five active effects without requiring prior knowledge of the aliasing relationships or domain expertise from scientists. This is no accident and the reason behind its selection can be explained as follows. The prior variance of $\beta_{EJ}$ is $.0991\tau^{2}$ , whereas that of $\beta_{DG}$ is $5.3\times 10^{-5}\tau^{2}$ . The observed effect of $EJ=DG$ is divided among the two aliased effects proportional to its prior variance (Joseph,, 2006). Therefore, the initial estimate of $\beta_{EJ}$ is much larger than $\beta_{DG}$ , which leads to the automatic selection of EJ by the nonnegative garrote instead of DG. The same thing happens between GJ and DE. Although HiGarrote selects three additional effects: H, HJ, and B, their estimates are relatively low compared to the five active effects. In terms of computational time tested on a laptop equipped with an Apple M2 chip, HiGarrote takes 0.54 seconds to analyze the experiment. For ease of comparison, the $R^{2}$ for all the methods are calculated based on least squares. We also applied hierNet, RAMP, and GDSARM to this experiment. However, none of these methods identified any effects.

Table 2: Selected Effects in the

2^{9-5}

Experiment

Method	Model	$R^{2}$
Raghavarao and Altan, (2003)	$EJ(-1.70)$ , $J(-1.67)$ , $E(1.55)$ , $G(1.49)$ , $GJ(1.39)$	70%
HiGarrote	$EJ(-1.29)$ , $J(-1.26)$ , $E(1.09)$ , $G(1.02)$ , $GJ(0.87)$ ,
	$H(0.51)$ , $HJ(-0.2)$ , $B(0.17)$	89%

4.1.2 Mixed-Level Regular Design

Phadke, (1986) described a 32-run experiment aimed at increasing the lifespan of router bits used in a routing process to cut printed wiring boards from a panel. The experiment contains seven two-level factors and two four-level qualitative factors which are denoted by A through J with the exclusion of I. See Table S2 and Table S3 in the supplementary materials for details of factors, design matrix, and data. For the two qualitative factors, we adopt a convenient coding scheme offered by Wu and Hamada, (2021). The coding scheme generates effects that can be interpreted as the average difference between pairs of levels. Thus, the coding matrix for factors D and E is

\mathbf{U}_{j}=\begin{pmatrix}1&-1&1&-1\\ 1&-1&-1&1\\ 1&1&-1&-1\\ 1&1&1&1\end{pmatrix}.

The main effects of factor D and E are labeled as $D_{1}$ , $D_{2}$ , and $D_{3}$ and $E_{1}$ , $E_{2}$ , and $E_{3}$ .

This is a regular design and therefore, one can write down all the 15 aliasing relationships. Wu and Hamada, (2021) used effect hierarchy and heredity principles to identify the following effects: $D_{2}$ , $G$ , $J$ , $GJ$ , and $E_{1}G=D_{2}H$ . Note that the last aliased effect could not be disentangled because both $D_{2}$ and $G$ are significant. However, using the physical understanding of the routing process, they argued that $E_{1}G$ is unlikely to be significant because the four spindles of the routing machine are synchronized. Thus, they finally chose $D_{2}H$ .

On the other hand, as summarized in Table 3, HiGarrote automatically identifies the five effects without using any physical understanding of the routing process. In addition, it also selects a few more effects $B$ , $D_{1}$ , $E_{3}$ , $GH$ , $HJ$ , $E_{3}H$ . In terms of computational time, HiGarrote takes 6.38 seconds to analyze the experiment. As mentioned before, since hierNet and RAMP do not generate interactions that are suitable for this experiment, their performance is not reported here.

Table 3: Selected Effects in Router Bit Experiment

Method	Model	$R^{2}$
Wu and Hamada, (2021)	$D_{2}(2.75)$ , $G(-2.56)$ , $J(2.31)$ , $GJ(-2.25)$ , $D_{2}H(2.25)$	62%
HiGarrote	$D_{2}(2.53)$ , $G(-2.33)$ , $J(2.06)$ , $GJ(-1.98)$ , $D_{2}H(1.97)$ ,
	$GH(1.33)$ , $E_{3}(-1.22)$ , $B(-1.14)$ , $D_{1}(.92)$ , $HJ(-.86)$ ,
	$E_{3}H(-.30)$	90%

4.2 Nonregular Designs

4.2.1 Two-Level Nonregular Design

Hunter et al., (1982) used a 12-run Plackett-Burman design to investigate the effects of seven two-level factors on the fatigue life of weld-repaired castings. The seven factors are denoted by capital letters A through G. The details of the factors, experimental design, and data are given in Table S4 and Table S5 of the supplementary materials. According to the analysis conducted by Hunter et al., (1982), the main effects of factors D and F were found to be significant, but the $R^{2}$ of the model is only 59%. In a very influential work, Hamada and Wu, (1992) reanalyzed the experiment using an iterative stepwise regression strategy that incorporates effect heredity (stepH). They found that by just adding the two-factor interaction FG in the model, $R^{2}$ increases to 92%. They also found that if we remove the main effect D from the model, the $R^{2}$ only decreases to 89%, which implies that the effect of D is negligible. Thus, they claim that effects F and FG are active.

For hierNet where the model is selected by leave-one-out cross-validation, the two important effects F and FG can be successfully identified; however, it also includes too many insignificant effects in the model. On the other hand, we found that RAMP performs well in this experiment, which selected the same model as the one from stepH. However, an issue with RAMP is that its algorithm includes quadratic effects such as $A^{2}$ , $B^{2}$ , $\dots$ , $G^{2}$ . Since there are only two levels in the experiment, these quadratic effects become the intercept effect which will not be selected, albeit unnecessary computations. GDSARM, using the settings provided in Singh and Stufken, (2024), successfully identified the effects F and FG. However, the algorithm includes several hyperparameters that may require manual tuning for optimal performance.

HiGarrote identifies five effects: F, FG, D, G, DG, with $R^{2}=96\%$ . Compared to the model selected by stepH, the inclusion of effects D, G, and DG only increases the $R^{2}$ by 7%. Although HiGarrote selects three insignificant effects in the model, the estimates of the effects are low compared to the important effects. The comparisons are summarized in Table 4, with the numbers in parentheses indicating the effects estimates. The computational time is also reported in the table. HiGarrote is the second slowest among the methods considered in this example, but 0.26 seconds will not be a computational burden for the analyst!

Table 4: Selected Effects in Cast Fatigue Experiment

Method	Model	$R^{2}$	Runtime
Hunter et al., (1982)	$F(.46)$ , $D(-.26)$	59%	–
stepH	$F(.46)$ , $FG(-.46)$	89%	.13 sec
hierNet	$F(.44)$ , $FG(-.26)$ , $D(-.17)$ , $A(.08)$ ,
	$G(.08)$ , $DG(.06)$ , $B(.05)$ , $BC(-.04)$ ,
	$AE(-.04)$ , $AD(-.03)$ , $C(-.02)$	100%	.08 sec
RAMP	$F(.48)$ , $FG(-.5)$	89%	.06 sec
GDSARM	$F(.46)$ , $FG(-.46)$	89%	.33 sec
HiGarrote	$F(.44)$ , $FG(-.43)$ , $D(-.05)$ , $G(.04)$ , $DG(.03)$	96%	.26 sec

4.2.2 Mixed-Level Nonregular Design

Hamada and Wu, (1992) analyzed an 18-run experiment designed to study blood glucose readings of a clinical testing device. The experiment contains one two-level factor and seven three-level quantitative factors, which are denoted by A through H. The details of the factors, design matrix, and data are given in Table S6 and Table S7 of the supplementary materials. Since the experiment contains three-level quantitative factors, we can entertain quadratic terms and their interactions. The interactions include four components, i.e., linear-by-linear, linear-by-quadratic, quadratic-by-linear, and quadratic-by-quadratic. So, we consider the search for active effects among 15 main effects (one linear effect, and seven linear and quadratic effects) and 98 interactions. In the following comparison, the subscripts of $l$ and $q$ indicate the linear and quadratic terms, respectively.

Analysis carried out by Hamada and Wu, (1992) using stepH chooses the effects $E_{q}$ , $F_{q}$ , and $E_{l}F_{l}$ . The model only achieves an $R^{2}$ of 68%. On the other hand, Chipman et al., (1997) using BayesH identified $B_{l}$ , $B_{l}H_{l}$ , $B_{l}H_{q}$ , and $B_{q}H_{q}$ with $R^{2}=86\%$ . We observed that removing $B_{l}H_{l}$ from the model only marginally reduces the $R^{2}$ to 85%, suggesting that the effect of $B_{l}H_{l}$ is negligible.

It is not ideal to use hierNet and RAMP to analyze the experiment because the way they construct interactions is to cross input variables. If we treat factors A to H as input variables, the interactions between linear and quadratic effects would not be entertained. On the other hand, if we treat all linear and quadratic effects as input variables, interactions between effects of the same factor–such as $B_{l}B_{q}$ –would be included, potentially leading to the selection of cubic effects. Although treating all linear and quadratic effects as input variables may initially seem reasonable, our analysis revealed that neither hierNet nor RAMP selected any effects. This suggests that both methods are not suitable for analyzing mixed-level designs.

For HiGarrote, we found that it can identify more significant effects than the existing strategies. The model selected by HiGarrote contains eight effects, including $B_{l}$ , $B_{q}$ , $F_{l}$ , $H_{l}$ , $H_{q}$ , $B_{l}H_{q}$ , $B_{q}H_{l}$ , $B_{q}H_{q}$ , with an $R^{2}$ of 97%. In the selected model, all the effects except $H_{l}$ and $H_{q}$ are significant. In terms of computational time, it takes 1.40 seconds for HiGarrote to analyze the experiment. The results are summarized in Table 5.

Table 5: Selected Effects in Blood Glucose Experiment

Method	Model	$R^{2}$
stepH	$E_{l}F_{l}(-5.52)$ , $E_{q}(4.33)$ , $F_{q}(4.02)$	68%
BayesH	$B_{l}H_{q}(6.64)$ , $B_{q}H_{q}(-5.43)$ , $B_{l}(-2.85)$ , $B_{l}H_{l}(.71)$	86%
HiGarrote	$B_{l}H_{q}(6.52)$ , $B_{q}H_{q}(-5.10)$ , $B_{l}(-2.60)$ , $B_{q}(1.28)$ ,
	$B_{q}H_{l}(.99)$ , $H_{l}(-.45)$ , $F_{l}(-.34)$ , $H_{q}(-.05)$	97%

4.3 Screening Designs

4.3.1 Definitive Screening Design

Jones and Lanzerath, (2021) described a 21-run experiment aimed at creating a special resin for car vents designed to prevent moisture accumulation within a car’s housing. The experiment is carried out using a definitive screening design (DSD) with nine continuous factors, labeled A through J, with the exclusion of I. Details of factors and dataset are provided in Table S8 and Table S9 of the supplementary materials. The response is taken as the logarithm of Impurity.

According to the analysis carried out by Jones and Lanzerath, (2021), the linear effects A and F, and their interaction AF are identified as significant with an $R^{2}$ of 93%. In addition to the three effects selected by Jones and Lanzerath, (2021), HiGarrote identifies five more effects, among which four of them are significant. The $R^{2}$ increases to 99%. In terms of computational time, it takes 1.24 seconds for HiGarrote to analyze the experiment. Since both hierNet and RAMP can handle the full quadratic model, their performance can be examined here. It turns out that hierNet does not perform well in this experiment because there are too many effects being selected. In total, the selected model contains 16 effects comprised of 7 linear effects and 9 interactions. On the other hand, RAMP only identifies F. Since F is highly important, RAMP’s model is good with an $R^{2}$ of 88%. The results are summarized in Table 6. To make the table concise, the selected model of hierNet is not reported.

Table 6: Selected Effects in Resin Experiment

Method	Model	$R^{2}$
Jones and Lanzerath, (2021)	$F(-2.21)$ , $A(.47)$ , $AF(.23)$	93%
RAMP	$F(-2.21)$	88%
HiGarrote	$F(-2.20)$ , $A(.43)$ , $FJ(-.30)$ , $E(-.23)$ ,
	$BF(.16)$ , $AJ(-.10)$ , $AF(.05)$ , $F^{2}(.04)$	99%

4.3.2 Supersaturated Design

Lin, (1993) employed a half-fraction of a 28-run PB design to investigate an experiment aimed at developing an epoxy adhesive system, as described by Williams, (1968). The original design matrix contains 14 runs and 24 factors. Since factors 13 and 16 were assigned to the same columns in the original design matrix, only factor 13 is listed here. The design matrix and data are given in Table S10 of the supplementary materials. Since supersaturated designs are generally used for factor screening, we focus only on the main effects. Therefore, we will exclude hierNet, RAMP, and GDSARM in this example, as these methods are designed to identify models that encompass quadratic effects and two-factor interactions. Although HiGarrote is also intended for models with hierarchical effects, we apply it in this example to show its generality. The results are summarized in Table 7. We can see that HiGarrote identifies the same set of factors as in Lin, (1993) and Chipman et al., (1997). In terms of computational time, HiGarrote takes about 5.80 seconds to perform the analysis, in which most of the time is spent on optimizing the 24 hyperparameters of the correlation matrix and the noise ratio.

Table 7: Selected Effects in Epoxy Experiment

Method	Model	$R^{2}$
Lin, (1993)	$15(-71.26)$ , $20(-27.98)$ , $12(-26.77)$ , $4(20.73)$ , $10(-9.40)$	97%
BayesH	$15(-70.48)$ , $20(-29.20)$ , $12(-25.29)$ , $4(22.12)$	95%
HiGarrote	$15(-61.22)$ , $12(-25.84)$ , $20(-22.19)$ , $10(-8.42)$ , $4(1.29)$	97%

5 Conclusion

Regression analysis of experimental data requires special care because of two main reasons: (i) the data size is small and (ii) there is a need to estimate higher order effects that satisfy effect hierarchy and heredity principles. We have proposed a modified version of nonnegative garrote method, called HiGarrote, that incorporates the hierarchical relationships among the effects. A major innovation in this new technique is to carefully devise an approach to obtain an initial estimate for the nonnegative garrote. For this purpose we used generalized ridge regression with the ridge parameters chosen based on a Gaussian process prior on the underlying function. The main computational cost of our method comes from the fitting of the Gaussian process, which increases at the rate of $\mathcal{O}(n^{3})$ , where $n$ is the size of the data. However, since $n$ is small in experiments, this is not of much concern.

We have applied our method to many different kinds of experiments report in the literature and found that it works well. The most attractive feature of the method is that it is very general and requires no manual tuning. Thus, it can be used to automate the analysis of experiments. We plan to provide the implementation of the code in an R package. Some practitioners may still prefer to analyze their experiments manually. However, even in such cases, HiGarrote could be tried so that the accuracy of their analyses can be quickly verified.

Another novelty of the proposed method is that it makes a connection to the Gaussian process modeling which is widely applied to the analysis of computer experiments. Practitioners have so far stayed away from Gaussian process regression in physical experiments because they felt that such a complex nonparameteric regression model is unnecessary to analyze the small data from experiments, which is also corrupted by noise. Nevertheless, we found Gaussian process modeling to be useful, at least to get a good initial estimate for the linear regression parameters. One may think then why not just use Gaussian process regression in the analysis of physical experiments and why we still need to stick with linear regression. The reason is that linear regression does a better job in identifying significant effects that are interpretable, whereas variable selection with Gaussian process regression is not trivial (Linkletter et al.,, 2006). However, some recent advances in global sensitivity analysis (Huang and Joseph,, 2024) offer some promising alternatives in this direction, which needs more investigation.

Appendix: Proof of (20)

Suppose that there are $p$ qualitative three-level factors and their model matrices are encoded according to Helmert coding. Then, the model matrix for factor $j$ is

\mathbf{U}_{j}=\begin{pmatrix}1&-\sqrt{\frac{3}{2}}&-\sqrt{\frac{1}{2}}\\ 1&\sqrt{\frac{3}{2}}&-\sqrt{\frac{1}{2}}\\ 1&0&2\sqrt{\frac{1}{2}}\\ \end{pmatrix}.

Given that each factor is represented by two dummy variables, and based on the model matrix, these dummy variables are expressed as $m_{j_{1}}=(-1,1,0)^{\prime}$ and $m_{j_{2}}=(-1,-1,2)^{\prime}$ . Then, the two correlation matrices are

\mathbf{\Psi}_{j_{1}}=\begin{pmatrix}1&\rho_{j_{1}}&\rho_{j_{1}}\\ \rho_{j_{1}}&1&\rho_{j_{1}}\\ \rho_{j_{1}}&\rho_{j_{1}}&1\end{pmatrix},\;\mathbf{\Psi}_{j_{2}}=\begin{pmatrix}1&1&\rho_{j_{2}}\\ 1&1&\rho_{j_{2}}\\ \rho_{j_{2}}&\rho_{j_{2}}&1\end{pmatrix}.

According to the product correlation structure, the correlation matrix of factor $j$ is

\mathbf{\Psi}_{j}=\mathbf{\Psi}_{j_{1}}\odot\mathbf{\Psi}_{j_{2}}=\begin{pmatrix}1&\rho_{j_{1}}&\rho_{j_{1}}\rho_{j_{2}}\\ \rho_{j_{1}}&1&\rho_{j_{1}}\rho_{j_{2}}\\ \rho_{j_{1}}\rho_{j_{2}}&\rho_{j_{1}}\rho_{j_{2}}&1\end{pmatrix},

where $\odot$ is the Hadamard product. Then,

\mathbf{U}_{j}^{-1}\mathbf{\Psi}_{j}(\mathbf{U}_{j}^{-1})^{\prime}=\frac{1}{9}\begin{pmatrix}3+2\rho_{j1}+4\rho_{j1}\rho_{j2}&0&\sqrt{2}(-\rho_{j1}+\rho_{j1}\rho_{j2})\\ 0&3(1-\rho_{j1})&0\\ \sqrt{2}(-\rho_{j1}+\rho_{j1}\rho_{j2})&0&3+\rho_{j1}-4\rho_{j1}\rho_{j2}\end{pmatrix}.

Now using (17), the marginal distributions of $\beta_{i}$ ’s can be obtained from (16). This gives (20).

Supplementary Materials

The file contains the details of the factors and their levels, design matrix, and data for the examples in Section 4.

Disclosure of Interest

No potential conflict of interest was reported by the authors.

Data Availability Statement

The data are provided in the supplementary materials.

Funding

This research is supported by U.S. National Science Foundation grants DMS-2310637.

References

Bien et al., (2013) Bien, J., Taylor, J., and Tibshirani, R. (2013). A Lasso for Hierarchical Interactions. Annals of statistics, 41(3):1111–1141.
Box and Meyer, (1986) Box, G. and Meyer, D. (1986). An Analysis for Unreplicated Experiments. Technometrics, 28:11–18.
Box et al., (1978) Box, G. E., Hunter, S., and Hunter, W. (1978). Statistics for Experimenters. John Wiley and sons, New York.
Breiman, (1995) Breiman, L. (1995). Better Subset Regression Using the Nonnegative Garrote. Technometrics, 37(4):373–384.
Candes and Tao, (2007) Candes, E. and Tao, T. (2007). The Dantzig Selector: Statistical Estimation When p Is Much Larger Than n. The Annals of Statistics, 35(6):2313–2351.
Chipman, (1996) Chipman, H. (1996). Bayesian Variable Selection With Related Predictors. Canadian Journal of Statistics, 24(1):17–36.
Chipman et al., (1997) Chipman, H., Hamada, M., and Wu, C. F. J. (1997). A Bayesian Variable-Selection Approach for Analyzing Designed Experiments With Complex Aliasing. Technometrics, 39(4):372–381.
Choi et al., (2010) Choi, N. H., Li, W., and Zhu, J. (2010). Variable Selection With the Strong Heredity Constraint and Its Oracle Property. Journal of the American Statistical Association, 105(489):354–364.
Efron et al., (2004) Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least Angle Regression. The Annals of Statistics, 32(2):407–499.
Fan and Li, (2001) Fan, J. and Li, R. (2001). Variable Selection via Nonconcave Penalized Likelihood and Its Oracle Properties. Journal of the American Statistical Association, 96(456):1348–1360.
George and McCulloch, (1993) George, E. I. and McCulloch, R. E. (1993). Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88(423):881–889.
Golub et al., (1979) Golub, G. H., Heath, M., and Wahba, G. (1979). Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter. Technometrics, 21(2):215–223.
Hamada and Wu, (1992) Hamada, M. and Wu, C. F. J. (1992). Analysis of Designed Experiments With Complex Aliasing. Journal of Quality Technology, 24(3):130–137.
Hao et al., (2018) Hao, N., Feng, Y., and Zhang, H. H. (2018). Model Selection for High-Dimensional Quadratic Regression via Regularization. Journal of the American Statistical Association, 113(522):615–625.
Harville, (1998) Harville, D. A. (1998). Matrix Algebra From a Statistician’s Perspective. Springer-Verlag, New York.
Hoerl and Kennard, (1970) Hoerl, A. E. and Kennard, R. W. (1970). Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics, 12(1):55–67.
Huang and Joseph, (2024) Huang, C. and Joseph, V. R. (2024). Factor Importance Ranking and Selection Using Total Indices. arXiv preprint arXiv:2401.00800.
Hunter et al., (1982) Hunter, G. B., Hodi, F. S., and Eagar, T. W. (1982). High Cycle Fatigue of Weld Repaired Cast Ti-6AI-4V. Metallurgical Transactions A, 13(9):1589–1594.
Ishwaran and Rao, (2014) Ishwaran, H. and Rao, J. S. (2014). Geometry and Properties of Generalized Ridge Regression in High Dimensions. Contemp. Math, 622:81–93.
Jones and Lanzerath, (2021) Jones, B. and Lanzerath, M. (2021). A Novel Application of a Definitive Screening Design: A Case Study. Quality Engineering, 33(3):563–569.
Joseph, (2006) Joseph, V. R. (2006). A Bayesian Approach to the Design and Analysis of Fractionated Experiments. Technometrics, 48(2):219–229.
Joseph and Delaney, (2007) Joseph, V. R. and Delaney, J. D. (2007). Functionally Induced Priors for the Analysis of Experiments. Technometrics, 49(1):1–11.
Joseph et al., (2015) Joseph, V. R., Gul, E., and Ba, S. (2015). Maximum Projection Designs for Computer Experiments. Biometrika, 102(2):371–380.
Lenth, (1989) Lenth, R. V. (1989). Quick and Easy Analysis of Unreplicated Factorials. Technometrics, 31(4):469–473.
Li et al., (2006) Li, X., Sudarsanam, N., and Frey, D. D. (2006). Regularities in Data From Factorial Experiments. Complexity, 11(5):32–45.
Lin, (1993) Lin, D. K. J. (1993). A New Class of Supersaturated Designs. Technometrics, 35(1):28–31.
Linkletter et al., (2006) Linkletter, C., Bingham, D., Hengartner, N., Higdon, D., and Ye, K. Q. (2006). Variable Selection for Gaussian Process Models in Computer Experiments. Technometrics, 48(4):478–490.
Montgomery, (2017) Montgomery, D. C. (2017). Design and Analysis of Experiments. John wiley & sons.
Ockuly et al., (2017) Ockuly, R. A., Weese, M. L., Smucker, B. J., Edwards, D. J., and Chang, L. (2017). Response Surface Experiments: A Meta-Analysis. Chemometrics and Intelligent Laboratory Systems, 164:64–75.
Phadke, (1986) Phadke, M. S. (1986). Design Optimization Case Studies. AT&T Technical Journal, 65(2):51–68.
Raghavarao and Altan, (2003) Raghavarao, D. and Altan, S. (2003). A Heuristic Analysis of Highly Fractionated $2^{n}$ Factorial Experiments. Metrika, 58:185–191.
Santner et al., (2018) Santner, T. J., Williams, B. J., and Notz, W. I. (2018). The Design and Analysis of Computer Experiments. Springer, New York.
Singh and Stufken, (2024) Singh, R. and Stufken, J. (2024). Factor Selection in Screening Experiments by Aggregation Over Random Models. Computational Statistics & Data Analysis, 194:107940.
Svanberg, (2002) Svanberg, K. (2002). A Class of Globally Convergent Optimization Methods Based on Conservative Convex Separable Approximations. SIAM Journal on Optimization, 12(2):555–573.
Tibshirani, (1996) Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288.
Vazquez et al., (2021) Vazquez, A. R., Schoen, E. D., and Goos, P. (2021). A Mixed Integer Optimization Approach for Model Selection in Screening Experiments. Journal of Quality Technology, 53(3):243–266.
Williams, (1968) Williams, K. R. (1968). Designed Experiments. Rubber Age, 100:65–71.
Wu and Hamada, (2021) Wu, C. F. J. and Hamada, M. S. (2021). Experiments: Planning, Analysis, and Optimization, 3rd Edition. Wiley.
Xiong, (2010) Xiong, S. (2010). Some Notes on the Nonnegative Garrote. Technometrics, 52(3):349–361.
Yuan et al., (2007) Yuan, M., Joseph, V. R., and Lin, Y. (2007). An Efficient Variable Selection Approach for Analyzing Designed Experiments. Technometrics, 49(4):430–439.
Yuan et al., (2009) Yuan, M., Joseph, V. R., and Zou, H. (2009). Structured Variable Selection and Estimation. The Annals of Applied Statistics, 3(4):1738–1757.
Yuan and Lin, (2006) Yuan, M. and Lin, Y. (2006). Model Selection and Estimation in Regression with Grouped Variables. Journal of the Royal Statistical Society Series B: Statistical Methodology, 68(1):49–67.
Yuan and Lin, (2007) Yuan, M. and Lin, Y. (2007). On the Non-Negative Garrotte Estimator. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(2):143–161.
Zou, (2006) Zou, H. (2006). The Adaptive Lasso and Its Oracle Properties. Journal of the American statistical association, 101(476):1418–1429.