This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian Elastic Net based on Empirical Likelihood

\nameChul Moona and Adel Bedouib CONTACT Chul Moon. Email: [email protected] aDepartment of Statistical Science, Southern Methodist University, Dallas, Texas, U.S.A. bDepartment of Statistics, University of Georgia, Athens, Georgia, U.S.A.
Abstract

We propose a Bayesian elastic net that uses empirical likelihood and develop an efficient tuning of Hamiltonian Monte Carlo for posterior sampling. The proposed model relaxes the assumptions on the identity of the error distribution, performs well when the variables are highly correlated, and enables more straightforward inference by providing posterior distributions of the regression coefficients. The Hamiltonian Monte Carlo method implemented in Bayesian empirical likelihood overcomes the challenges that the posterior distribution lacks a closed analytic form and its domain is nonconvex. We develop the leapfrog parameter tuning algorithm for Bayesian empirical likelihood. We also show that the posterior distributions of the regression coefficients are asymptotically normal. Simulation studies and real data analysis demonstrate the advantages of the proposed method in prediction accuracy.

keywords:
Bayesian empirical likelihood; Hamiltonian Monte Carlo; Variable selection
articletype: Research Article

1 Introduction

Regularization methods have been introduced to a linear model to improve the prediction accuracy and interpretability by introducing a penalty term to the least squares criterion; lasso that implements l1l_{1} penalty [1], ridge that adds l2l_{2} penalty [2], elastic net (EN) that introduces a combination of l1l_{1} and l2l_{2} penalties [3], and non-convex penalties [4, 5]. It is known that EN shows better performances compared to the lasso while preserving a sparse variable selection property [3, 6]. EN also identifies influential variables better than the lasso and has lower false positives than ridge regression [7].

The penalized regression approaches have an intuitive Bayesian counterpart stemming from introducing penalty terms in the form of priors. For instance, Bayesian lasso (BL) [8] uses a double-exponential prior, and Bayesian elastic net (BEN) [9] uses an informative prior that is a compromise between normal and Laplace priors. The non-convex penalties also have been implemented including spike-and-slab prior [10], horseshoe prior [11], and hyperlasso [12]. Motivated by Li and Lin [9], various BEN studies have been conducted including applications to gene-expression data [13], empirical BEN [14], and regressions [15, 16, 17].

The Bayesian approaches on regularization regression have several advantages. First, the Bayesian methods offer a natural interpretation of parameter uncertainty. For example, the standard errors of our parameter of interest can be estimated as the posterior standard deviations because the Bayesian method provides the entire posterior distribution. Second, the Bayesian methods provide valid standard errors compared to the variances of the non-Bayesian penalized regressions estimated by the sandwich or bootstrap [18]. Third, the penalty parameters in the Bayesian methods can be estimated simultaneously with model parameters. When the non-Bayesian models have multiple penalty parameters, such as EN, multi-stage cross-validations are used to estimate the penalty terms. However, the sequential estimation of penalties may incur the over-shrinkage of parameters that leads to larger bias [9]. Lastly, the Bayesian penalized regression approaches show similar or better performances compared to the corresponding non-Bayesian approaches [19, 18, 9].

Owen [20, 21] introduces empirical likelihood (EL), a nonparametric likelihood method to make inference for statistical functionals. EL does not require an assumption that data follow a certain distribution while maintaining similar properties of a conventional likelihood method such as Wilk’s theorem and Bartlett correctability [21, 22, 23]. The EL approach has been applied to complex inferences, including general estimating equation [24], density estimation [25], area under the receiver operating characteristic curve [26], and data imputation [27]. We refer to Chen and Keilegom [28] for a review of EL methods on regression.

EL has extended to high dimensional data. Hjort et al. [29] and Chen et al. [30] show that the asymptotic normality of the EL ratio holds when the dimension of data grows. Tang and Leng [31] and Leng and Tang [32] propose the penalized EL and show that it has the oracle property [4, 33]. Lahiri and Mukhopadhyay [34] propose the penalized EL for estimating population means when the dimension of data is larger than the number of observations. Chang et al. [35] also suggest implementing two penalty functions to penalized models and Lagrange multipliers.

Also, the EL-based Bayesian methods have been developed. Lazar [36] replaces the likelihood function in Bayesian settings by EL and shows the validity of the resulting posterior distribution. In addition, Grendar and Judge [37] show that Bayesian empirical likelihood (BEL) is consistent under the misspecification of the model. BEL has been studied in various areas; the higher-order asymptotic and coverage properties of the posterior distribution for the population mean [38, 39], the survey sampling [40], small area estimation [41], quantile regression [42], Bayesian computation [43], sampling method [44], and lasso and ridge regressions [45].

In this paper, we suggest a Bayesian approach for EN where the likelihood function is replaced by the profile EL of the linear model. We place a special prior distribution on the parameter of interest, which combines the l1l_{1} and l2l_{2} penalties leading to the EN approach. The proposed approach takes advantage of the interpretability and robustness of the results achieved by the Bayesian perspective and EL method, respectively. We implement the HMC sampling for BEL suggested by Chaudhuri et al. [44] and propose the leapfrog step size tuning algorithm based on the bisection method. The proposed algorithm enables more efficient sampling than hand-tuning or grid-searching HMC parameters. In addition, our method extends the BEL method for penalized regressions of Bedoui and Lazar [45] by proposing efficient HMC sampling rather than utilizing the tailored Metropolis-Hasting of Chib and Greenberg [46].

The outline of the remaining sections is as follows. In Section 2, we briefly describe the Bayesian linear model based on EL. In Section 3, we propose BEN based on EL and discuss the HMC sampling implementations, along with variable selection procedures. In Section 4, we compare the proposed method with other penalized regression methods using various simulation studies. The air pollution data application is presented in Section 5. Section 6 contains conclusion and future work.

2 Linear model based on empirical likelihood

We first outline how EL is implemented in a linear model. Let X=[𝒙𝟏,,𝒙𝒑]X=[\boldsymbol{x_{1}},\cdots,\boldsymbol{x_{p}}] be a collection of independent pp covariate vectors of size nn. We consider the linear model

yi=𝒙𝒊T𝜽+ϵi,y_{i}=\boldsymbol{x_{i}}^{T}\boldsymbol{\theta}+\epsilon_{i},

where the error ϵi\epsilon_{i} is independent and identically distributed and follows an unknown distribution with mean zero and variance σ2\sigma^{2}. The empirical likelihood function for 𝜽\boldsymbol{\theta} is defined as

LEL(𝜽)=supwi{i=1nnwiwi0,i=1nwi=1,i=1nwi𝒙𝒊(yi𝒙𝒊T𝜽)=𝟎},L_{EL}(\boldsymbol{\theta})=\sup_{w_{i}}\Bigl{\{}\prod\limits_{i=1}^{n}nw_{i}\mid w_{i}\geq 0,\;\sum\limits_{i=1}^{n}w_{i}=1,\;\sum\limits_{i=1}^{n}w_{i}\boldsymbol{x_{i}}(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta})=\boldsymbol{0}\Bigr{\}}, (1)

and 𝒘=(w1,,wn)T\boldsymbol{w}=(w_{1},\cdots,w_{n})^{T} is the weights on the data points. The coefficients 𝜽\boldsymbol{\theta} can be estimated by maximizing equation (1) using the Lagrange multipliers. The profile empirical log-likelihood becomes

lEL(𝜽)=nlog(n)i=1nlog{1+𝜸T𝒙𝒊(yi𝒙𝒊T𝜽)},i=1nlog{1+𝜸T𝒙𝒊(yi𝒙𝒊T𝜽)},\begin{split}l_{EL}(\boldsymbol{\theta})&=-n\log(n)-\sum_{i=1}^{n}\log\left\{1+\boldsymbol{\gamma}^{T}\boldsymbol{x_{i}}(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta})\right\},\\ &\propto-\sum_{i=1}^{n}\log\left\{1+\boldsymbol{\gamma}^{T}\boldsymbol{x_{i}}(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta})\right\},\\ \end{split} (2)

where 𝜸=𝜸(𝜽)\boldsymbol{\gamma}=\boldsymbol{\gamma}(\boldsymbol{\theta}) solves the equation

i=1n𝒙𝒊(yi𝒙𝒊T𝜽)1+𝜸T𝒙𝒊(yi𝒙𝒊T𝜽)=𝟎.\sum_{i=1}^{n}\dfrac{\boldsymbol{x_{i}}(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta})}{1+\boldsymbol{\gamma}^{T}\boldsymbol{x_{i}}(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta})}=\boldsymbol{0}.

Here, numerical approaches such as the Newton-Raphson method can be used to find the Lagrange multipliers 𝜸\boldsymbol{\gamma} [47].

The regularization method can be implemented to EL for linear models in different ways. First, the penalty terms can be introduced in the model through the priors π(𝜽)pn(𝜽)\pi(\boldsymbol{\theta})\propto p_{n}(\boldsymbol{\theta}), where pnp_{n} is a penalty function. Second, Tang and Leng [31] introduce the penalty terms in EL l(𝜽)i=1nlog{1+𝜸T𝒙𝒊(yi𝒙𝒊T𝜽)}nj=1ppn(θj)l(\boldsymbol{\theta})\propto-\sum_{i=1}^{n}\log\left\{1+\boldsymbol{\gamma}^{T}\boldsymbol{x_{i}}(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta})\right\}-n\sum_{j=1}^{p}p_{n}(\theta_{j}). In our study, the l1l_{1} and l2l_{2} penalties are introduced in the form of priors.

3 Bayesian elastic net based on empirical likelihood

EN uses both the l1l_{1} and l2l_{2} penalties for pt(𝜽)p_{t}(\boldsymbol{\theta})[3]. The EN estimator 𝜽^EN\hat{\boldsymbol{\theta}}_{EN} is defined as the minimizer of

L(𝜽)=12𝒚X𝜽22+λ1𝜽1+λ2𝜽22,L(\boldsymbol{\theta})=\dfrac{1}{2}||\boldsymbol{y}-\mathit{X}\boldsymbol{\theta}||_{2}^{2}+\lambda_{1}||\boldsymbol{\theta}||_{1}+\lambda_{2}||\boldsymbol{\theta}||^{2}_{2}, (3)

where

λ1,λ20,𝜽is a p×1vector,𝜽1=j=1p|θj| and 𝜽22=j=1pθj2,𝒚is a n×1vector,Xis a n×pmatrix.\begin{split}&\lambda_{1},\lambda_{2}\geq 0,\\ &\boldsymbol{\theta}\;\text{is a }p\times 1\;\text{vector},\\ &||\boldsymbol{\theta}||_{1}=\sum_{j=1}^{p}|\theta_{j}|\text{ and }||\boldsymbol{\theta}||^{2}_{2}=\sum_{j=1}^{p}\theta_{j}^{2},\\ &\boldsymbol{y}\;\text{is a }n\times 1\;\text{vector},\\ &\mathit{X}\;\text{is a }n\times p\;\text{matrix}.\\ \end{split}

Here, λ1\lambda_{1} and λ2\lambda_{2} are l1l_{1} and l2l_{2} penalty parameters that control the amount of shrinkage. Without loss of generality, we assume that

i=1nxij=0,i=1nyi=0,i=1nxij2=1 for j=1,,p.\sum_{i=1}^{n}x_{ij}=0,\;\sum_{i=1}^{n}y_{i}=0,\;\sum_{i=1}^{n}x_{ij}^{2}=1\text{ for }j=1,\cdots,p.

Penalized linear regression models have their Bayesian counterpart models. The Bayesian penalized models introduce priors for 𝜽\boldsymbol{\theta} where the hyperparameters are functions of the penalty terms. For example, from the form of the EN penalty term in (3), the EN regression parameters can be presented by the following prior

π(𝜽)exp(λ1𝜽1λ2𝜽22).\pi(\boldsymbol{\theta})\propto\exp\left(-\lambda_{1}||\boldsymbol{\theta}||_{1}-\lambda_{2}||\boldsymbol{\theta}||^{2}_{2}\right).

Li and Lin [9] propose the Bayesian counterpart of EN by placing normal and Laplace priors to l1l_{1} and l2l_{2} penalties, respectively. The shrinkage parameters, λ1\lambda_{1} and λ2\lambda_{2}, are introduced into the model in the form of hyperparameters.

The BEL approach for linear models replaces the likelihood with EL and places parametric priors. Then, the posterior density of EN under the BEL approach π(𝜽|X,𝒚)\pi(\boldsymbol{\theta}|\;\mathit{X},\boldsymbol{y}) becomes

π(𝜽X,𝒚)=LEL(𝜽)π(𝜽)ΘLEL(𝜽)π(𝜽)𝑑𝜽LEL(𝜽)π(𝜽).\pi(\boldsymbol{\theta}\mid\mathit{X},\boldsymbol{y})=\dfrac{L_{EL}(\boldsymbol{\theta})\pi(\boldsymbol{\theta})}{\int\limits_{\Theta}L_{EL}(\boldsymbol{\theta})\pi(\boldsymbol{\theta})d\boldsymbol{\theta}}\;\propto\;L_{EL}(\boldsymbol{\theta})\pi(\boldsymbol{\theta}).

The hierarchical representation of the full model becomes

LEL(𝜽)exp(lEL(𝜽)),𝜽|σ2exp(12σ2[λ1𝜽1+λ2𝜽22]),σ2IG(a,b).\begin{split}L_{EL}(\boldsymbol{\theta})&\sim\exp\left(l_{EL}(\boldsymbol{\theta})\right),\\ \boldsymbol{\theta}|\sigma^{2}&\sim\exp\left(-\dfrac{1}{2\sigma^{2}}\left[\lambda_{1}||\boldsymbol{\theta}||_{1}+\lambda_{2}||\boldsymbol{\theta}||^{2}_{2}\right]\right),\\ \sigma^{2}&\sim IG(a,b).\\ \end{split} (4)

where LEL(𝜽)L_{EL}(\boldsymbol{\theta}) is the profile empirical likelihood for the linear model in (2), IG is an inverse gamma whose probability density function is baΓ(a)xa1exp(bx),\dfrac{b^{a}}{\Gamma\left(a\right)}x^{-a-1}\exp\left(-\dfrac{b}{x}\right), for x>0x>0, and XX and 𝒚\boldsymbol{y} are omitted in LEL(𝜽)L_{EL}\left(\boldsymbol{\theta}\right) for simplicity. We condition 𝜽\boldsymbol{\theta} on σ2\sigma^{2} to guarantee the unimodality of the posterior distributions [8, 9]. One can also choose a noninformative prior on σ2\sigma^{2} of the form of 1/σ21/\sigma^{2}.

The prior of 𝜽\boldsymbol{\theta} given σ2\sigma^{2} defined in (4) is a multiplication of normal and Laplace densities. Andrews and Mallows [48] show that the Laplace distribution can be represented as a scale mixture of normals with an exponential mixing density

a2ea|z|=012πsez2/(2s)a22ea2s/2𝑑s,a> 0.\dfrac{a}{2}e^{-a|z|}=\int_{0}^{\infty}\dfrac{1}{\sqrt{2\pi s}}e^{-z^{2}/(2s)}\dfrac{a^{2}}{2}e^{-a^{2}s/2}ds,\;\;\;a\;>\;0.

Li and Lin [9] also showed that the prior defined in (4) can be presented as a scale mixture of normals with a truncated gamma mixing density. The hierarchical scheme in (4) becomes as follows

LEL(𝜽)exp(lEL(𝜽)),𝜽|𝝉,σ2j=1pN(0,(λ2σ2τjτj1)1),𝝉|σ2j=1pTG(12,λ128λ2σ2,(1,)),σ2IG(a,b),\begin{split}L_{EL}(\boldsymbol{\theta})&\sim\exp\left(l_{EL}(\boldsymbol{\theta})\right),\\ \boldsymbol{\theta}|\boldsymbol{\tau},\sigma^{2}&\sim\prod_{j=1}^{p}N\left(0,\left(\dfrac{\lambda_{2}}{\sigma^{2}}\dfrac{\tau_{j}}{\tau_{j}-1}\right)^{-1}\right),\\ \boldsymbol{\tau}|\sigma^{2}&\sim\prod_{j=1}^{p}TG\left(\dfrac{1}{2},\dfrac{\lambda_{1}^{2}}{8\lambda_{2}\sigma^{2}},(1,\infty)\right),\\ \sigma^{2}&\sim IG(a,b),\\ \end{split} (5)

where 𝝉\boldsymbol{\tau} follows the truncated gamma distribution from 1 to \infty with the shape parameter 1/21/2 and the rate parameter λ12/λ2σ2\lambda_{1}^{2}/\lambda_{2}\sigma^{2}. The full joint posterior density is as follows

π(𝜽,𝝉,σ2|X,𝒚)LEL(𝜽)×j=1p(λ2σ2τjτj1)12exp(12λ2σ2τjτj1θj2)×j=1p(λ128λ2σ2)12τj12exp(τjλ128λ2σ2)I(τj(1,))×(1σ2)a+1exp(bσ2).\begin{split}\pi\left(\boldsymbol{\theta},\;\boldsymbol{\tau},\,\sigma^{2}|X,\;\boldsymbol{y}\right)\propto&L_{EL}\left(\boldsymbol{\theta}\right)\times\prod_{j=1}^{p}\left(\dfrac{\lambda_{2}}{\sigma^{2}}\dfrac{\tau_{j}}{\tau_{j}-1}\right)^{\frac{1}{2}}\exp\left(-\dfrac{1}{2}\dfrac{\lambda_{2}}{\sigma^{2}}\dfrac{\tau_{j}}{\tau_{j}-1}\theta_{j}^{2}\right)\\ &\times\prod_{j=1}^{p}\left(\frac{\lambda_{1}^{2}}{8\lambda_{2}\sigma^{2}}\right)^{\frac{1}{2}}\tau_{j}^{-\frac{1}{2}}\exp\left(-\tau_{j}\dfrac{\lambda_{1}^{2}}{8\lambda_{2}\sigma^{2}}\right)I\left(\tau_{j}\in(1,\infty)\right)\\ &\times\left(\frac{1}{\sigma^{2}}\right)^{a+1}\exp\left(-\frac{b}{\sigma^{2}}\right).\end{split} (6)

To sample from the posterior π(𝜽,𝝉,σ2|X,𝒚)\pi\left(\boldsymbol{\theta},\;\boldsymbol{\tau},\,\sigma^{2}|X,\;\boldsymbol{y}\right), we draw from the following conditional distributions

  1. 1.

    Sample 𝜽\boldsymbol{\theta} from

    LEL(𝜽)exp(λ22σ2j=1pτjτj1θj2).L_{EL}\left(\boldsymbol{\theta}\right)\exp\left(-\dfrac{\lambda_{2}}{2\sigma^{2}}\sum_{j=1}^{p}\dfrac{\tau_{j}}{\tau_{j}-1}\theta_{j}^{2}\right).
  2. 2.

    Sampling τj\tau_{j} is equivalent to sampling τj1\tau_{j}-1 from

    GIG(ν=12,ψ=λ124λ2σ2,χ=λ2σ2θj2) for j=1,,p.\text{GIG}\left(\nu=\dfrac{1}{2},\;\psi=\;\dfrac{\lambda_{1}^{2}}{4\lambda_{2}\sigma^{2}},\;\chi=\dfrac{\lambda_{2}}{\sigma^{2}}\theta_{j}^{2}\right)\text{ for }j=1,\cdots,p.
  3. 3.

    Sample σ2\sigma^{2} from

    IG(a+p,b+12j=1p[λ2τjτj1θj2+λ124λ2τj]).\text{IG}\left(a+p,\;b+\dfrac{1}{2}\sum_{j=1}^{p}\left[\lambda_{2}\dfrac{\tau_{j}}{\tau_{j}-1}\theta_{j}^{2}+\dfrac{\lambda_{1}^{2}}{4\lambda_{2}}\tau_{j}\right]\right).

Here, GIG(ν,ψ,χ)\text{GIG}(\nu,\;\psi,\;\chi) is the generalized inverse Gaussian distribution with probability density function (ψ/χ)ν/22Kν(ψχ)xν1exp{12(χx1+ψx)}\dfrac{\left(\psi/\chi\right)^{\nu/2}}{2K_{\nu}\left(\sqrt{\psi\chi}\right)}x^{\nu-1}\exp\left\{-\dfrac{1}{2}\left(\chi x^{-1}+\psi x\right)\right\} for x>0x>0, where Kν()K_{\nu}\left(\cdot\right) is the modified Bessel function of the third kind with order ν\nu. We use the rejection algorithm proposed by Hörmann and Leydold [49] to generate samples from the GIG distribution.

The posterior estimates of BEN based on EL estimates are consistent. As the sample size increases, the estimates converge to the true value of the parameter being estimated. The consistency of the estimators under the BEL framework has been proved for the quantile regression [42] and the penalized regression [45], and it can be easily extended to the proposed estimates.

3.1 Hamiltonian Monte Carlo sampling for Bayesian empirical likelihood

Implementing traditional sampling methods like Gibbs sampler and Metropolis-Hastings (MH) under the Bayesian empirical likelihood framework is a daunting task. First, the conditional distribution of 𝜽\boldsymbol{\theta} has a non-closed form, which makes the implementation of the Gibbs sampler impossible. Second, implementing MH is suitable to sample from a distribution that lacks a closed form, but it may fail to achieve convergence. In addition, the nature of the posterior density support complicates the process of finding an efficient proposal density. The support of the posterior EL is nonconvex with many local optima where its surface is rigid. In these cases, the chain can be trapped in a region and not reach the global optimum. Therefore, it is challenging to find the proper proposal density that provides a high acceptance right with the appropriate location and dispersion parameters.

We use HMC [50] to sample 𝜽\boldsymbol{\theta}, inspired by Chaudhuri et al. [44]. Let 𝜽\boldsymbol{\theta} be the current position vector and 𝒎N(0,M)\boldsymbol{m}\sim N(0,M) be the momentum vector where MM is the dispersion matrix. The Hamiltonian is defined by the joint distribution of 𝜽\boldsymbol{\theta} and 𝒎\boldsymbol{m} that can be represented as the the sum of potential energy U(𝜽)U(\boldsymbol{\theta}) and kinetic energy K(𝒎)K(\boldsymbol{m}),

H(𝜽,𝒎)\displaystyle H(\boldsymbol{\theta},\boldsymbol{m}) =\displaystyle= U(𝜽)+K(𝒎)\displaystyle U(\boldsymbol{\theta})+K(\boldsymbol{m}) (7)
=\displaystyle= logπ(𝜽)+12𝒎TM1𝒎\displaystyle-\log\pi(\boldsymbol{\theta})+\frac{1}{2}\boldsymbol{m}^{T}M^{-1}\boldsymbol{m}

The partial derivatives of Hamiltonian determine how the 𝜽\boldsymbol{\theta} transits to a new state,

d𝜽dt\displaystyle\frac{d\boldsymbol{\theta}}{dt} =\displaystyle= H𝒎\displaystyle\frac{\partial H}{\partial\boldsymbol{m}}
d𝒎dt\displaystyle\frac{d\boldsymbol{m}}{dt} =\displaystyle= H𝜽.\displaystyle-\frac{\partial H}{\partial\boldsymbol{\theta}}.

The Hamiltonian dynamics has reversible, invariant, and volume-preserving properties that enable MCMC updates [50]. The Hamiltonian equations are computed by discretizing the small time interval ω\omega. The leapfrog integrator is the most widely-used method to implement HMC. First, given the current time tt and the position 𝜽\boldsymbol{\theta}, the momentum 𝒎\boldsymbol{m} is independently drawn from N(0,M)N(0,M). Then the position and the momentum at time t+ωt+\omega is updated as follows:

π(t+12ω)\displaystyle\pi\left(t+\frac{1}{2}\omega\right) =\displaystyle= π(t)ω2U𝜽\displaystyle\pi(t)-\frac{\omega}{2}\frac{\partial U}{\partial\boldsymbol{\theta}}
𝜽(t+ω)\displaystyle\boldsymbol{\theta}(t+\omega) =\displaystyle= 𝜽(t)+ωM1π(t+12ω)\displaystyle\boldsymbol{\theta}(t)+\omega M^{-1}\pi\left(t+\frac{1}{2}\omega\right)
π(t+ω)\displaystyle\pi\left(t+\omega\right) =\displaystyle= π(t+12ω)12U𝜽(𝜽(t+ω)).\displaystyle\pi\left(t+\frac{1}{2}\omega\right)-\frac{1}{2}\frac{\partial U}{\partial\boldsymbol{\theta}}(\boldsymbol{\theta}(t+\omega)).

The proposed state (𝜽,𝒎)(\boldsymbol{\theta^{*}},\boldsymbol{m^{*}}) is obtained after repeating the above updates TT times. Here, ω\omega and TT are also called the step size and leapfrog steps, respectively. The proposed state is accepted with the probability

min[1,exp(H(𝜽,𝒎)+H(𝜽,𝒎))].\min\left[1,\exp(-H(\boldsymbol{\theta^{*}},\boldsymbol{m^{*}})+H(\boldsymbol{\theta},\boldsymbol{m}))\right].

HMC is known to be more efficient than random-walk-based MH in sampling from the posterior for Bayesian EL. First, Chaudhuri et al. [44] show that once the parameters are inside the support, the HMC chain does not go outside and return to the interior of the support if they reach its boundary under certain assumptions. This is due to the property of lEL(𝜽)l_{EL}(\boldsymbol{\theta}), whose gradient diverges at the boundary. Second, HMC converges quickly towards the target distribution and enables faster convergence. In HMC, distances between successively generated points are large. Thus, fewer iterations are required to obtain a representative sample.

The performance of an HMC depends on its parameters, and it is known that tuning them is important [50, 51]. However, the optimal HMC parameter tuning procedure for BEL is not discussed in Chaudhuri et al. [44]. It is generally suggested to set a sufficiently small leapfrog step size ω\omega to ensure that the discretization is good and use sufficiently large leapfrog steps TT so that the overall trajectory length ωT\omega T is not too short. However, setting a large TT could be inefficient for HMC used in the BEL framework. This is because each leapfrog step requires computationally expensive EL computation. Therefore, we fix the leapfrog steps to T=10T=10 and find the optimal step size ω\omega in our study.

We develop the leapfrog step size tuning algorithm for BEL based on the bisection method [52]. The optimal step size will achieve the known theoretical optimal acceptance rate of 65.1% [53]. For a fixed TT, a larger step size tends to lower an acceptance rate and vice versa. For given lower and upper tolerance levels for acceptance rates, ιl\iota_{l} and ιu\iota_{u}, respectively, the proposed algorithm searches for ω\omega that results in a high acceptance rate, greater than 0.651+ιu0.651+\iota_{u}, while increasing it by ε\varepsilon. If the acceptance rate is within the target range [0.651ιl,0.651+ιu][0.651-\iota_{l},0.651+\iota_{u}], then ω\omega is selected. On the other hand, if the step size makes the acceptance rate >0.651+ιu>0.651+\iota_{u}, the method bisects the interval [ωε,ω][\omega-\varepsilon,\omega]. The bisection procedure is repeated until the acceptance rate reaches the target range. Compared to the grid search algorithm, the proposed algorithm converges to the target acceptance rate range linearly and enables finer and computationally efficient step size tuning. The detailed bisection algorithm is given in Algorithm 1. One can also consider implementing the No-U-Turn sampler (NUTS) [51] that automatically selects the leapfrog steps and step size. We discuss this in Section 6.

Algorithm 1 Bisection leapfrog step size tuning algorithm for Bayesian EL
itermax, (initial) ω\omega, TT, and lower and upper tolerances ιl\iota_{l} and ιu\iota_{u}
acceptance rate and (updated) ω\omega
εω\varepsilon\leftarrow\omega
scounter 0\leftarrow 0
iter 1\leftarrow 1
while iteritermax\text{iter}\leq\text{itermax} do
     Simulate HMC chain using step size ω\omega for a given TT and compute acceptance rate
     if acceptance rate [0.651ιl,0.651+ιu]\in[0.651-\iota_{l},0.651+\iota_{u}] then
         break
     else if acceptance rate >(0.651+ιu)>(0.651+\iota_{u}) then
         if scounter = 0 then
              εε\varepsilon\leftarrow\varepsilon \triangleright Do not change ε\varepsilon if ω\omega has not decreased before
         else
              εε/2\varepsilon\leftarrow\varepsilon/2
         end if
         ωω+ε\omega\leftarrow\omega+\varepsilon \triangleright Update ω\omega by increasing ε\varepsilon
     else if acceptance rate <(0.651ιl)<(0.651-\iota_{l}) then
         scounter +=1+=1
         εε/2\varepsilon\leftarrow\varepsilon/2
         ωωε\omega\leftarrow\omega-\varepsilon \triangleright Update ω\omega by decreasing ε\varepsilon
     end if
     iter +=1+=1
end while

We run a single chain of length 2,000 with 1,000 burn-ins for quicker step size tuning. For the estimation of 𝜽\boldsymbol{\theta}, we simulate four chains of length 2,000 with 1,000 burn-ins, respectively.

The convergence of simulated chains are examined with the split-R^\widehat{R} of Vehtari et al. [54]

R^=var^+(𝜽X,𝒚)W,\displaystyle\widehat{R}=\sqrt{\frac{\widehat{\text{var}}^{+}(\boldsymbol{\theta}\mid X,\boldsymbol{y})}{W}},

where var^+(𝜽X,𝒚)=Ns1NsW+1NsB\widehat{\text{var}}^{+}(\boldsymbol{\theta}\mid X,\boldsymbol{y})=\frac{N_{s}-1}{N_{s}}W+\frac{1}{N_{s}}B, WW and BB are the within- and between-chain variances, and NsN_{s} is the number of draws of one chain. Followed by Vehtari et al. [54], we use the posterior samples that satisfy the convergence criteria R^<1.01\widehat{R}<1.01.

3.2 Choosing Bayesian elastic net penalty parameters

The proposed EL based approach needs to specify the penalty parameters 𝝀=(λ1,λ2)\boldsymbol{\lambda}=(\lambda_{1},\lambda_{2}). We consider two approaches: empirical Bayes (EB) and full Bayes (FB).

First, the EB method estimates the penalty parameter from data and plugs the estimated parameters into the model. Park and Casella [8] use the EB method to estimate the shrinkage parameter in BL. It treats the parameters as missing data and uses the Monte Carlo expectation-maximization (EM) algorithm approach proposed by Casella [55]. The EM algorithm is iteratively used to approximate the parameters of interest by substituting Monte Carlo estimates for any expected values that cannot be computed explicitly. For our proposed model, 𝜽,𝝉,\boldsymbol{\theta},\;\boldsymbol{\tau}, and σ2\sigma^{2} are treated as missing data whereas λ1\lambda_{1} and λ2\lambda_{2} are treated as fixed parameters.

We use a building block of HMC and the Gibbs sampler to sample 𝜽\boldsymbol{\theta}, 𝝉\boldsymbol{\tau} and σ2\sigma^{2}. The negative log empirical posterior density of 𝜽\boldsymbol{\theta} in (7) is defined as

log{π(𝜽|σ2,τ12,,τp2)}=i=1nlog{1+𝜸T𝒙𝒊(yi𝒙𝒊T𝜽)}+λ22σ2𝜽TDτ1𝜽,-\log\left\{\pi\left(\boldsymbol{\theta}|\sigma^{2},\tau_{1}^{2},\cdots,\tau_{p}^{2}\right)\right\}=\sum_{i=1}^{n}\log\left\{1+\boldsymbol{\gamma}^{T}\boldsymbol{x_{i}}\left(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta}\right)\right\}+\dfrac{\lambda_{2}}{2\sigma^{2}}\boldsymbol{\theta}^{T}\mathit{D}_{\tau}^{-1}\boldsymbol{\theta},

where Dτ=diag(τ1τ11,,τpτp1)\mathit{D}_{\tau}=\text{diag}\left(\dfrac{\tau_{1}}{\tau_{1}-1},\cdots,\dfrac{\tau_{p}}{\tau_{p}-1}\right) and its gradient is defined as

log{π(𝜽|σ2,τ12,,τp2)}𝜽=i=1n𝝀T𝒙𝒊𝒙𝒊T1+𝜸T𝒙𝒊(yi𝒙𝒊T𝜽)+λ2σ2𝜽TDτ1.-\dfrac{\partial\log\left\{\pi\left(\boldsymbol{\theta}|\sigma^{2},\tau_{1}^{2},\cdots,\tau_{p}^{2}\right)\right\}}{\partial\boldsymbol{\theta}}=\sum_{i=1}^{n}\dfrac{-\boldsymbol{\lambda}^{T}\boldsymbol{x_{i}x_{i}}^{T}}{1+\boldsymbol{\gamma}^{T}\boldsymbol{x_{i}}\left(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta}\right)}+\dfrac{\lambda_{2}}{\sigma^{2}}\boldsymbol{\theta}^{T}\mathit{D}_{\tau}^{-1}.

The hierarchical model presented in (6) yields the complete-data log-likelihood

plog(λ1)λ22σ2j=1pτjτj1θj2λ128λ2σ2j=1pτj+terms not involving λ1 and λ2.p\log(\lambda_{1})-\frac{\lambda_{2}}{2\sigma^{2}}\sum\limits_{j=1}^{p}\frac{\tau_{j}}{\tau_{j}-1}\theta_{j}^{2}-\frac{\lambda_{1}^{2}}{8\lambda_{2}\sigma^{2}}\sum\limits_{j=1}^{p}\tau_{j}+\text{terms not involving $\lambda_{1}$ and $\lambda_{2}$}.

At the kkth step of the Monte Carlo EM algorithm, the conditional log-likelihood on 𝝀(k1)=(λ1(k1),λ2(k1))\boldsymbol{\lambda}^{(k-1)}=\left(\lambda_{1}^{(k-1)},\lambda_{2}^{(k-1)}\right) and 𝒚\boldsymbol{y} is

Q(𝝀|𝝀(k1))=plog(λ1)λ22j=1pE[τjτj1θj2σ2|𝝀(k1),𝒚]λ128λ2j=1pE[τjσ2|𝝀(k1),𝒚]+terms not involving λ1 and λ2=R(𝝀|𝝀(k1))+terms not involving λ1 and λ2.\begin{split}&Q(\boldsymbol{\lambda}|\boldsymbol{\lambda}^{(k-1)})\\ &=p\log(\lambda_{1})-\frac{\lambda_{2}}{2}\sum\limits_{j=1}^{p}E\left[\frac{\tau_{j}}{\tau_{j}-1}\frac{\theta_{j}^{2}}{\sigma^{2}}\middle|\boldsymbol{\lambda}^{(k-1)},\boldsymbol{y}\right]-\frac{\lambda_{1}^{2}}{8\lambda_{2}}\sum\limits_{j=1}^{p}E\left[\frac{\tau_{j}}{\sigma^{2}}\middle|\boldsymbol{\lambda}^{(k-1)},\boldsymbol{y}\right]\\ &+\text{terms not involving $\lambda_{1}$ and $\lambda_{2}$}\\ &=R(\boldsymbol{\lambda}|\boldsymbol{\lambda}^{(k-1)})+\text{terms not involving $\lambda_{1}$ and $\lambda_{2}$}.\end{split}

Then we find λ1\lambda_{1} and λ2\lambda_{2} that maximize R(𝝀|𝝀(k1))R(\boldsymbol{\lambda}|\boldsymbol{\lambda}^{(k-1)}). The maximization procedure can be obtained by using the partial derivatives of R(𝝀|𝝀(k1))R(\boldsymbol{\lambda}|\boldsymbol{\lambda}^{(k-1)})

Rλ1=pλ1λ14λ2j=1pE[τjσ2|𝝀(k1),𝒚]Rλ2=12j=1pE[τjτj1θj2σ2|𝝀(k1),𝒚]+λ128λ22j=1pE[τjσ2|𝝀(k1),𝒚].\begin{split}\frac{\partial R}{\partial\lambda_{1}}&=\frac{p}{\lambda_{1}}-\frac{\lambda_{1}}{4\lambda_{2}}\sum\limits_{j=1}^{p}E\left[\frac{\tau_{j}}{\sigma^{2}}\middle|\boldsymbol{\lambda}^{(k-1)},\boldsymbol{y}\right]\\ \frac{\partial R}{\partial\lambda_{2}}&=-\frac{1}{2}\sum\limits_{j=1}^{p}E\left[\frac{\tau_{j}}{\tau_{j}-1}\frac{\theta_{j}^{2}}{\sigma^{2}}\middle|\boldsymbol{\lambda}^{(k-1)},\boldsymbol{y}\right]+\frac{\lambda_{1}^{2}}{8\lambda_{2}^{2}}\sum\limits_{j=1}^{p}E\left[\frac{\tau_{j}}{\sigma^{2}}\middle|\boldsymbol{\lambda}^{(k-1)},\boldsymbol{y}\right].\end{split}

Here, the expectations in QQ and RR are evaluated by using the means of sampled 𝜽\boldsymbol{\theta}, 𝝉\boldsymbol{\tau}, and σ2\sigma^{2}.

Second, the FB approach treats 𝝀\boldsymbol{\lambda} as unknown model parameters and specify a prior for them. Park and Casella [8] suggest to use the gamma prior for the squared value of l1l_{1} penalty for BL. Similarly, we assume the Gamma(r1,δ1)\text{Gamma}(r_{1},\delta_{1}) prior on λ12\lambda_{1}^{2}. Also, we place a GIG(ν2,ψ2,χ2)\text{GIG}(\nu_{2},\;\psi_{2},\;\chi_{2}) prior on λ2\lambda_{2}, which is a conjugate prior [9]. The full joint posterior density becomes

π(𝜽,𝝉,σ2,λ12,λ2|X,𝒚)LNP(𝜽)×j=1p(λ2σ2τjτj1)12exp(12λ2σ2τjτj1θj2)×j=1p(λ128λ2σ2)12τj12exp(τjλ128λ2σ2)I(τj(1,))×(1σ2)a+1exp(bσ2)×(λ12)r11exp(δ1λ12)×(λ2)ν21exp{12(χ21λ2+ψ2λ2)}.\begin{split}\pi\left(\boldsymbol{\theta},\;\boldsymbol{\tau},\,\sigma^{2},\lambda_{1}^{2},\lambda_{2}|X,\;\boldsymbol{y}\right)\propto&L_{NP}\left(\boldsymbol{\theta}\right)\times\prod_{j=1}^{p}\left(\dfrac{\lambda_{2}}{\sigma^{2}}\dfrac{\tau_{j}}{\tau_{j}-1}\right)^{\frac{1}{2}}\exp\left(-\dfrac{1}{2}\dfrac{\lambda_{2}}{\sigma^{2}}\dfrac{\tau_{j}}{\tau_{j}-1}\theta_{j}^{2}\right)\\ &\times\prod_{j=1}^{p}\left(\frac{\lambda_{1}^{2}}{8\lambda_{2}\sigma^{2}}\right)^{\frac{1}{2}}\tau_{j}^{-\frac{1}{2}}\exp\left(-\tau_{j}\dfrac{\lambda_{1}^{2}}{8\lambda_{2}\sigma^{2}}\right)I\left(\tau_{j}\in(1,\infty)\right)\\ &\times\left(\frac{1}{\sigma^{2}}\right)^{a+1}\exp\left(-\frac{b}{\sigma^{2}}\right)\times\left(\lambda_{1}^{2}\right)^{r_{1}-1}\exp\left(-\delta_{1}\lambda_{1}^{2}\right)\\ &\times\left(\lambda_{2}\right)^{\nu_{2}-1}\exp\left\{-\frac{1}{2}\left(\chi_{2}\frac{1}{\lambda_{2}}+\psi_{2}\lambda_{2}\right)\right\}.\end{split}

The full conditional distributions for the penalty parameters are

λ12|𝝉,σ2,λ2Gamma(p2+r1,j=1pτj8λ2σ2+δ1)λ2|𝝉,σ2,λ12,𝜽GIG(ν=ν2,ψ=j=1pτjτj1θj2σ2+ψ2,χ=j=1pτjλ124σ2+χ2).\begin{split}\lambda_{1}^{2}|\boldsymbol{\tau},\sigma^{2},\lambda_{2}&\sim\text{Gamma}\left(\frac{p}{2}+r_{1},\sum_{j=1}^{p}\frac{\tau_{j}}{8\lambda_{2}\sigma^{2}}+\delta_{1}\right)\\ \lambda_{2}|\boldsymbol{\tau},\sigma^{2},\lambda_{1}^{2},\boldsymbol{\theta}&\sim\text{GIG}\left(\nu=\nu_{2},\psi=\sum\limits_{j=1}^{p}\frac{\tau_{j}}{\tau_{j}-1}\frac{\theta_{j}^{2}}{\sigma^{2}}+\psi_{2},\chi=\sum\limits_{j=1}^{p}\frac{\tau_{j}\lambda_{1}^{2}}{4\sigma^{2}}+\chi_{2}\right).\end{split}

An alternative prior for λ2\lambda_{2} to consider is Gamma(r2,δ2)\text{Gamma}(r_{2},\delta_{2}). Then, the full conditional distribution becomes

λ2|𝝉,σ2,λ12,𝜽GIG(ν=r2,ψ=j=1pτjτj1θj2σ2+2δ2,χ=j=1pτjλ124σ2).\begin{split}\lambda_{2}|\boldsymbol{\tau},\sigma^{2},\lambda_{1}^{2},\boldsymbol{\theta}&\sim\text{GIG}\left(\nu=r_{2},\psi=\sum\limits_{j=1}^{p}\frac{\tau_{j}}{\tau_{j}-1}\frac{\theta_{j}^{2}}{\sigma^{2}}+2\delta_{2},\chi=\sum\limits_{j=1}^{p}\frac{\tau_{j}\lambda_{1}^{2}}{4\sigma^{2}}\right).\end{split}

The gamma distribution is a special case of the GIG distribution family with χ=0\chi=0. Thus, placing a gamma prior on λ2\lambda_{2} results on a posterior distribution that follows a GIG distribution.

3.3 Variable selection methods

The Bayesian regularization approaches require additional variable selection methods. In most non-Bayesian penalized methods, the estimated coefficients shrink to zero. On the other hand, the outputs of Bayesian models are the posterior distributions, which are not necessarily equal to zero. We present two variable selection methods: Bayesian credible interval and scaled neighborhood criteria.

First, the Bayesian credible interval can be used to select variables whose credible regions do not include zero for a given probability level α\alpha [56]. Lazar [36] proves that under standard regularity conditions and as nn\rightarrow\infty, the posterior EL distribution 𝜽(F)\boldsymbol{\theta}(F) converges to the normal distribution. Bedoui and Lazar [45] show that the asymptotic posterior distribution of the BEL version of ridge and lasso regressions follows the normal distribution. Lemma 3.1 shows that the posterior EL distribution of 𝜽\boldsymbol{\theta} for the elastic net model converges to a normal distribution as nn\rightarrow\infty. However, we want to note that the Bayesian credible interval may include zero in the presence of strong correlation among explanatory variables because of bimodality of the posterior of the regression parameters [57].

Lemma 3.1.

Assume standard regularity conditions, log(π(𝛉))=0\nabla\log\left(\pi(\boldsymbol{\theta})\right)=0 and log(π(X,𝐲|𝛉))=0\nabla\log\left(\pi(X,\boldsymbol{y}|\boldsymbol{\theta})\right)=0. The posterior EL distribution of 𝛉\boldsymbol{\theta} for the Bayesian elastic net converges to the normal distribution with mean 𝐦\boldsymbol{m} and covariance Jn\mathit{J_{n}} as nn\rightarrow\infty, where

Jn=λ2σ2Dτ+J(𝜽^n),𝒎=Jn1(λ2σ2Dτ𝜽𝟎+J(𝜽^n)𝜽^n),\begin{split}\mathit{J_{n}}&=\dfrac{\lambda_{2}}{\sigma^{2}}D_{\tau}+\mathit{J}(\boldsymbol{\hat{\theta}}_{n}),\\ \boldsymbol{m}&=\mathit{J_{n}}^{-1}\left(\dfrac{\lambda_{2}}{\sigma^{2}}D_{\tau}\boldsymbol{\theta_{0}}+\mathit{J}(\boldsymbol{\hat{\theta}}_{n})\boldsymbol{\hat{\theta}}_{n}\right),\\ \end{split}

𝜽𝟎\boldsymbol{\theta_{0}} is the prior mode, 𝛉^n\boldsymbol{\hat{\theta}}_{n} is the profile maximum likelihood estimate of 𝛉\boldsymbol{\theta} and

J(𝜽^n)=[2θiθji=1nlog{1+𝜸T𝒙𝒊(yi𝒙𝒊T𝜽)}]𝜽=𝜽^n.\mathit{J}(\boldsymbol{\hat{\theta}}_{n})=\left[\dfrac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}\sum_{i=1}^{n}\log\left\{1+\boldsymbol{\gamma}^{T}\boldsymbol{x_{i}}(y_{i}-\boldsymbol{x_{i}}^{T}\boldsymbol{\theta})\right\}\right]_{\boldsymbol{\theta}=\boldsymbol{\hat{\theta}}_{n}}.

Also, 2log(π(𝛉|𝛕,σ2,X,𝐲))𝑑χp2-2\log\left(\pi(\boldsymbol{\theta}|\bm{\tau},\sigma^{2},\mathit{X},\boldsymbol{y})\right)\xrightarrow{d}\chi_{p}^{2} as nn\rightarrow\infty.

Proof.

Under the standard regularity conditions, the prior term is dominated by the likelihood function, and the log posterior distribution of 𝜽\boldsymbol{\theta} can be expressed using up to quadratic order terms [58],

2log(π(𝜽|𝝉,σ2,X,𝒚))(𝜽𝒎)TJn1(𝜽𝒎).-2\log\left(\pi(\boldsymbol{\theta}|\bm{\tau},\sigma^{2},\mathit{X},\boldsymbol{y})\right)\propto(\boldsymbol{\theta}-\boldsymbol{m})^{T}J_{n}^{-1}(\boldsymbol{\theta}-\boldsymbol{m}).

Lazar [36] shows that the posterior distribution of 𝜽\boldsymbol{\theta} converges to normal distribution with mean 𝒎\boldsymbol{m} and variance JnJ_{n} as nn\rightarrow\infty. Then, showing JnJ_{n} is symmetric and positive semi-definite completes the proof.

First, the first term of JnJ_{n}, the diagonal matrix λ2σ2Dτ\dfrac{\lambda_{2}}{\sigma^{2}}D_{\tau}, is the minus second derivative of the log prior distribution of 𝜽\boldsymbol{\theta} in (5) evaluated at 𝜽0\boldsymbol{\theta}_{0}. Also, the second term of JnJ_{n}, J(𝜽)J(\boldsymbol{\theta}), can be approximated as Fisher information matrix [58], which is a positive semi-definite matrix by definition. Therefore, JnJ_{n} is positive semi-definite because both λ2σ2Dτ\dfrac{\lambda_{2}}{\sigma^{2}}D_{\tau} and J(𝜽^n)\mathit{J}(\boldsymbol{\hat{\theta}}_{n}) positive semi-definite. Thus, 2log(π(𝜽|𝝉,σ2,X,𝒚))𝑑χp2-2\log\left(\pi(\boldsymbol{\theta}|\bm{\tau},\sigma^{2},\mathit{X},\boldsymbol{y})\right)\xrightarrow{d}\chi_{p}^{2} as nn\rightarrow\infty. ∎

Second, the scaled neighborhood criterion uses the posterior probability to select variables [9]. The variable θj\theta_{j} is excluded when the posterior probability P(|θj|Var(θj𝒚)𝒚)P\left(|\theta_{j}|\leq\sqrt{\text{Var}(\theta_{j}\mid\boldsymbol{y})}\mid\boldsymbol{y}\right) is greater than η\eta for a given threshold η\eta.

4 Simulation studies

We conduct Monte Carlo simulations to compare the performance of the proposed Bayesian elastic net based on the empirical likelihood (BEN-EL) model with four different models: BEN, BL, EN, and lasso applied to the least absolute deviation regression (LADL). LADL is known to be robust to heavy tail errors and outliers [59]. We generate data from the linear model

𝒚=𝑿𝜽+ϵ.\boldsymbol{y}=\boldsymbol{X}\boldsymbol{\theta}+\epsilon.

Note that BEN-EL and LADL assume that the mean and the median of errors are zero, respectively. On the other hand, BEN, BL, and EN assume normal errors.

4.1 Prediction performance

4.1.1 Simulation 1

In Simulation 1, we set 𝜽=(3,1.5,0,0,2,0,0,0)T\boldsymbol{\theta}=(3,1.5,0,0,2,0,0,0)^{T}. The explanatory variables XijX_{ij}’s are generated from the standard normal distribution with pairwise correlations ρ(xi,xj)=0.5|ij|\rho(x_{i},x_{j})=0.5^{|i-j|} for all i,j{1,,8}i,j\in\{1,...,8\}. We generate training data sets with three different sizes (n=50,100,200n=50,100,200). For each setting, we generate test data sets of size 400400. The error is generated from three different distributions: 1) normal distribution, ϵN(0,32)\epsilon\sim N(0,3^{2}), 2) non-normal distribution, ϵN(3,1)\epsilon\sim N(-3,1) or ϵN(3,1)\epsilon\sim N(3,1) with probability of 0.5, respectively, and 3) skew Student tt distribution proposed by Fernández and Steel [60], ϵST(ν=30,ξ=1.5)\epsilon\sim ST(\nu=30,\xi=1.5) with mean 0 and standard deviation 3. We use the R package fGarch [61] to sample from the skew Student tt distribution. The sampling procedures are repeated 100 times for each simulation setting.

The shrinkage parameters λ1\lambda_{1} and λ2\lambda_{2} are estimated using the EB method for BEN-EL, BEN, and BL methods. For BEN-EL, the hyperparameters a=10a=10 and b=10b=10 are used. For the bisection step size tuning, the initial step size ω=0.5\omega=0.5 and tolerances ιl=ιu=0.05\iota_{l}=\iota_{u}=0.05 are used. For BEN and BL, we use MCMC with 2,000 burn-ins and 12,000 iterations. The optimal value for λ2\lambda_{2} in EN is selected from {10nn=2,1,0,1,2,3}\{10^{n}\mid n=-2,-1,0,1,2,3\}.

In all cases, the simulated HMC chains meet the convergence criteria R^<1.01\widehat{R}<1.01. The trace plots and histograms of estimated 𝜽\boldsymbol{\theta} by BEN-EL for one dataset of skew Student tt errors and n=50n=50 are given in Figures 3 and 4 in Appendix A.

\tbl

The median of mean squared prediction errors (MMSPE) along with the standard errors (SE) of BEN-EL, BEN, BL and EN methods for Simulation 1. The smallest MMSPEs are marked in bold. Error MMSPE (SE) BEN-EL BEN BL EN LADL N(0,32)N(0,3^{2}) n=50n=50 10.24 (0.18) 11.33 (0.25) 10.00 (0.16) 10.43 (0.15) 11.07 (0.18) n=100n=100 9.61 (0.10) 11.65 (0.19) 9.55 (0.10) 9.73 (0.10) 9.96 (0.15) n=200n=200 9.13 (0.07) 11.65 (0.10) 9.10 (0.08) 9.17 (0.07) 9.36 (0.12) N(3,1)N(3,1) n=50n=50 10.80 (0.12) 12.33 (0.25) 10.91 (0.12) 11.20 (0.17) 14.37 (0.29) or n=100n=100 10.49 (0.11) 12.36 (0.21) 10.49 (0.11) 10.68 (0.09) 13.09 (0.18) N(3,1)N(-3,1) n=200n=200 10.29 (0.05) 12.66 (0.13) 10.27 (0.06) 10.34 (0.07) 12.52 (0.10) Skew n=50n=50 90.20 (0.99) 91.38 (1.30) 91.83 (1.22) 93.93 (1.30) 92.29 (0.76) Student tt n=100n=100 86.31 (1.26) 86.77 (1.07) 86.09 (1.12) 88.50 (1.48) 88.15 (1.08) n=200n=200 83.56 (0.80) 83.95 (0.68) 83.66 (0.76) 84.51 (0.89) 84.81 (1.05)

BEN-EL outperforms the other methods when the number of observations is small (n=50,100n=50,100) and the normality assumption in errors is violated. Table 4.1.1 presents the median of mean squared prediction error (MMSPE) and the standard error (SE) results based on Simulation 1. For Bayesian methods, the explanatory variables are selected based on the scaled neighborhood criterion with η>0.5\eta>0.5 in computing MMSPE. The SE is computed as a standard deviation of 1,000 bootstrap samples of mean squared prediction error values.

\tbl

The empirical frequency (%) that the regression coefficient is dropped for Simulation 1 with errors from N(3,1)N(3,1) or N(3,1)N(-3,1). For BEN-EL, BEN, and BL, the scaled neighborhood criterion with η>0.5\eta>0.5 is used. Empirical frequency of exclusion θ1\theta_{1} θ2\theta_{2} θ3\theta_{3} θ4\theta_{4} θ5\theta_{5} θ6\theta_{6} θ7\theta_{7} θ8\theta_{8} n=50n=50 BEN-EL 0 0 58 69 0 65 75 64 BEN 0 0 34 39 0 53 80 76 BL 0 3 69 76 0 77 82 75 LADL 0 18 70 76 9 86 90 82 n=100n=100 BEN-EL 0 0 70 61 0 59 72 69 BEN 0 0 13 15 0 34 73 74 BL 0 0 80 73 0 75 82 77 LADL 0 12 76 73 5 81 88 84 n=200n=200 BEN-EL 0 0 56 66 0 68 74 62 BEN 0 0 2 3 0 17 68 82 BL 0 0 69 69 0 75 82 75 LADL 0 6 77 71 0 81 88 84

Table 4.1.1 reports the number of exclusion of each variable for 100 simulated data sets for BEN-EL, BEN, BL, and LADL. The variable selection results for EN are not included because all the estimated coefficients are nonzero. LADL performs the best in dropping variables with zero coefficients (θ3\theta_{3}, θ4\theta_{4}, θ6\theta_{6}, θ7\theta_{7}, and θ8\theta_{8}) but some nonzero coefficients (θ1\theta_{1}, θ2\theta_{2}, and θ5\theta_{5}) are also excluded. Among the Bayesian methods, BL best identifies the zero coefficients. However, the MMSPEs for BL are larger than BEN-EL for n=50,100n=50,100. This implies that BEN-EL makes more accurate estimates even though their zero variables exclusion rate is worse than BL.

4.1.2 Simulation 2

Simulation 2 dataset has a more complex correlation structure than Simulation 1. First, we generate Z1Z_{1}, Z2Z_{2}, and Z3Z_{3} independently from N(0,1)N(0,1). Then, we add normal errors so that xi=Z1+υix_{i}=Z_{1}+\upsilon_{i} for i=1,,5i=1,\ldots,5, xi=Z2+υix_{i}=Z_{2}+\upsilon_{i} for i=6,,10i=6,\ldots,10, xi=Z3+υix_{i}=Z_{3}+\upsilon_{i} for i=11,,15i=11,\ldots,15, and xiN(0,1)x_{i}\sim N(0,1) for i=16,,30i=16,\ldots,30, where υiN(0,0.01)\upsilon_{i}\sim N(0,0.01) for i=1,,15i=1,\ldots,15. As a result, the variables in the first three groups are highly correlated within each group. We set the true parameters of the linear model as

𝜽=(3,,35,3,,35,3,,35,0,,015)T.\boldsymbol{\theta}=(\underbrace{3,\cdots,3}_{5},\underbrace{3,\cdots,3}_{5},\underbrace{3,\cdots,3}_{5},\underbrace{0,\cdots,0}_{15})^{T}.

We generate error ϵ\epsilon using three different distributions: 1) normal distribution, ϵN(0,152)\epsilon\sim N(0,15^{2}), 2) Student tt distribution with degrees of freedom 3, ϵ10×t3\epsilon\sim 10\times t_{3}, and 3) skew Student tt distribution, ϵST(ν=30,ξ=1.5)\epsilon\sim ST(\nu=30,\xi=1.5). The normal and tt distributions are symmetric and have similar variances, but the tt distribution has a thicker tail than the normal distribution. On the other hand, the skew Student tt distribution is right-skewed. We use training data sets with two different sizes (n=100,200n=100,200) and fix the size of testing data sets to 400400. We generate 100 data sets for each simulation setting. For the other simulation parameters, we use the same setting used in Simulation 1.

For BEN-EL, the simulated HMC chains converge with R^<1.01\widehat{R}<1.01 in all simulation settings. We also present the trace plots and histograms of four nonzero coefficients (θ1,θ2,θ3,θ4)(\theta_{1},\theta_{2},\theta_{3},\theta_{4}) and four zero elements (θ16,θ17,θ18,θ19)(\theta_{16},\theta_{17},\theta_{18},\theta_{19}) estimated by BEN-EL for one dataset of the skew Student tt errors and n=100n=100 in Figures 5 and 6 in Appendix A.

\tbl

The median of mean squared prediction errors (MMSPE) along with the standard errors (SE) of BEN-EL, BEN, BL, EN, and LADL methods for Simulation 2. The smallest MMSPEs are marked in bold. MMSPE (SE) BEN-EL BEN BL EN LADL Normal n=100n=100 281.98 (3.70) 291.44 (4.50) 295.28 (4.16) 309.52 (5.36) 332.72 (3.42) n=200n=200 255.06 (2.58) 254.12 (3.02) 258.17 (3.35) 263.62 (3.42) 323.32 (3.73) Student tt n=100n=100 334.10 (10.75) 344.72 (7.61) 345.03 (7.79) 380.51 (12.33) 374.70 (5.78) n=200n=200 313.37 (6.77) 302.26 (6.46) 307.12 (6.62) 311.59 (6.13) 351.81 (7.51) Skew n=100n=100 289.95 (4.03) 297.96 (4.43) 301.23 (4.64) 318.13 (8.88) 344.05 (2.83) Student tt n=200n=200 253.97 (3.06) 255.24 (2.74) 256.02 (2.97) 262.62 (3.52) 310.68 (4.33)

The simulation results suggest that BEN-EL outperforms when the sample size is small or the error follows the asymmetric distribution. Table 4.1.2 reports the MMSPE and SE results of Simulation 2. In all cases, the Bayesian methods (BEN-EL, BEN, and BL) perform better than the non-Bayesian methods (EN and LADL), and the Bayesian EN methods (BEN-EL and BEN) perform better than the BL. This corresponds to the findings that the EN-based methods perform better than the lasso-based methods under the complex correlation structure [3]. For the symmetric error distributions (normal and Student tt), BEN-EL provides the smallest MMSPE and SE values compared to BEN when the sample size is small. However, BEN-EL performs the best under the skewed error distribution regardless of the sample size.

Table 4.1.2 presents the variable selection results. For the Bayesian methods, variables are selected using the scaled neighborhood criterion with η>0.5\eta>0.5. The variable selection results for EN are not reported because its estimated coefficients are nonzero. BEN-EL tends to keep more variables than BEN and BL for both nonzero and zero coefficients. BEN and BL perform poorly in keeping nonzero variables when the sample size is small, but they improve as the sample size increases. LADL shows the highest exclusion rate for zero coefficients but removes many nonzero coefficients.

\tbl

The empirical frequency (%) that the 15 nonzero and 15 zero regression coefficients are dropped for Simulation 2. For Bayesian methods, the scaled neighborhood criterion with η>0.5\eta>0.5 is used. Empirical frequency of exclusion BEN-EL BEN BL LADL Nonzero Zero Nonzero Zero Nonzero Zero Nonzero Zero Normal n=100n=100 22.26 63.33 34.60 74.07 42.33 87.53 71.20 92.93 n=200n=200 5.47 68.47 5.60 69.27 9.87 82.20 48.87 94.40 Student tt n=100n=100 23.73 59.13 41.73 78.40 48.93 88.47 64.67 94.40 n=200n=200 10.53 58.80 12.13 70.07 16.60 83.20 35.93 93.87 Skew n=100n=100 21.53 61.07 32.53 76.00 38.67 87.53 72.07 94.00 Student tt n=200n=200 5.40 65.47 5.67 66.53 8.40 81.07 45.67 92.73

4.2 Sensitivity of Hyperparameters

We examine the sensitivity of the hyperparameters of the penalty parameters λ1\lambda_{1} and λ2\lambda_{2} in BEN-EL for the FB approach. Li and Lin [9] suggest that the BEN posterior samples may be heavily dependent on the hyperparameters on the penalty parameters based on their experience. Still, it has not been investigated for BEN-EL. Therefore, we investigate how the priors affect posterior inferences by changing hyperparameters: 1) (r1,δ1)(r_{1},\delta_{1}) of the gamma prior for λ12\lambda_{1}^{2} and 2) (ψ2,χ2)(\psi_{2},\chi_{2}) of the GIG prior for λ2\lambda_{2}. In both simulations, we use one data set from Simulation 1 with n=50n=50 and normal errors N(0,32)N(0,3^{2}). For the gamma prior, 1,600 combinations of the shape and rate parameters r1r_{1} and δ1{0.25,0.5,,9.75,10}\delta_{1}\in\{0.25,0.5,\ldots,9.75,10\} are used. For the GIG prior, we fix χ2=1\chi_{2}=1 and use 1,640 combinations of ν2{5,4.75,,4.75,5}\nu_{2}\in\{-5,-4.75,\ldots,4.75,5\} and ψ2{0.25,0.5,,9.75,10}\psi_{2}\in\{0.25,0.5,\ldots,9.75,10\}. The shape and rate parameters r1r_{1} and δ1\delta_{1} for the gamma prior and ν2\nu_{2} and χ2\chi_{2} for the GIG prior affect in the opposite directions for nonzero and zero coefficients. For example, as r1r_{1} and ν2\nu_{2} increase and δ1\delta_{1} and χ2\chi_{2} decrease, the estimated nonzero coefficients will decrease, whereas the estimated zero coefficients will increase.

The step size ω\omega is estimated by the proposed bisection algorithm using the same input parameters used in Simulation 1. We run four HMC chains of 2,000 iterations with 1,000 burn-ins. In all cases, the split-R^\widehat{R}’s are less than 1.01 and the HMC chains converge.

The estimated coefficients are affected by choice of the hyperparameters, but the variability is not large enough to interrupt overall inference. Figure 1 shows heatmaps of estimated coefficients of θ1\theta_{1} and θ4\theta_{4} according to different combinations of the hyperparameters. The estimated coefficients are the median of the posteriors of θ1\theta_{1} and θ4\theta_{4} whose true values are 3 and 0, respectively. We see the estimated coefficients vary according to the hyperparameters, but they are close to the true coefficients. For example, the estimated θ1\theta_{1} differs only up to 0.15 and 0.4 for the various combinations of the hyperparameters of λ12\lambda_{1}^{2} and λ2\lambda_{2}, respectively.

Refer to caption
Figure 1: Estimated θ1\theta_{1} (first column) and θ4\theta_{4} (second column) with respect to the hyperparameters of the gamma prior of λ12\lambda_{1}^{2} (first row) and the GIG prior of λ2\lambda_{2} (second row).

5 Air pollution data analysis

We use the air pollution to mortality data [62]. The data set includes the mortality rate (𝒚\boldsymbol{y}) and 15 explanatory variables of weather, socioeconomics, and air pollution in 60 metropolitan areas of the United States from 1959 to 1961. See Table A in Appendix A for a description of the variables.

The air pollution set has a complicated correlation structure. The correlation plot is presented in Figure 2. For example, the nitroc oxide level noxnox is highly correlated with other air pollution variable hydrocarbon pollution hchc, but also correlated with the weather variable precprec and population variable popnpopn.

Refer to caption
Figure 2: The correlation matrix of the air pollution data.

We randomly split the data into a training set with 30 observations and a test set of 30 observations. The five models, BEN-EL, BEN, BL, EN, and LADL, are conducted using the training set, and the prediction errors are obtained using the test set. For Bayesian methods, the EB approach is used to select the penalty parameters, and the Bayesian credible interval with the probability level α=0.5\alpha=0.5 is used to select the variables. We repeat the procedure 100 times.

Table 5 reports the mean prediction error and standard deviation of the five models. BEN-EL yields the smallest mean prediction error, but its standard deviation is the second largest. The relatively large standard deviation of BEN-EL is due to one large prediction error. BEN-EL has the smallest median, mean, the first quartile (Q1), and the third quartile (Q3) than the other methods. Figure 7 in Appendix A shows boxplots of prediction errors of BEN-EL, BEN, BL, LADL. The results imply that the proposed method outperforms the other methods when data have a complex correlation structure and a small number of observations.

\tbl

The mean prediction errors along with the standard errors (SE) of BEN-EL, BEN, BL and EN methods for air pollution data analysis. The smallest mean prediction error is marked in bold. Mean Prediction Error (SE) BEN-EL BEN BL EN LADL 1950 (679) 2118 (518) 2133 (665) 3044 (2903) 1974 (620)

6 Conclusion

We propose a new Bayesian approach for EN based on EL. Accordingly, the profile EL for the linear model replaces the ordinary likelihood function in the Bayesian setting. We use the HMC algorithm because the resulting posterior EL distribution lacks a closed form, and the implementation of even standard MCMC methods is challenging. A new HMC parameter tuning algorithm based on the bisection method is proposed for BEL. Simulation studies and the air pollution data application show that our approach outperforms the other methods when the sample size is small, data are correlated, and error distributions are misspecified.

BEL methods share the same limitations of EL in the sense that the number of variables cannot increase as the sample size decreases. The reason is that the estimating equations are included in the convex hull to maximize the profile EL ratio. However, XTXX^{T}X is not of full rank when p>np>n and hence it is non-invertible. To address this shortcoming, several approaches have been proposed, such as penalized EL [34] and double penalization [35]. Thus, it might be worth a formal investigation whether the proposed method could be extended for p>np>n case for future work.

Another interesting topic is implementing the NUTS algorithm of [51] to the BEL framework. NUTS automatically tunes leapfrog steps and step sizes and has been used as a default sampler in popular MCMC software such as Stan [63] and PyMC3 [64]. However, the proper introduction of NUTS into BEL has not been studied. We observe that HMC chains often diverge when the NUTS algorithm is used for BEN-EL. Also, BEL is not supported by most MCMC libraries because EL cannot be used as a likelihood function. Developing a BEL computational pipeline compatible with Stan or PyMC3 for practitioners will be of further interest.

Disclosure statement

No potential conflict of interest was reported by the authors.

Data availability statement

All data used in simulation studies are generated randomly and the air pollution data are obtained from McDonald et al. [62]. The R code used to generate, import, and analyze the data used in this paper is publicly available at the URL: https://github.com/chulmoon/BEN-EL.

References

  • [1] Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B. 1996;58(1):267–288.
  • [2] Tikhonov AN, Nikolayevich N. On the stability of inverse problems. Dokl Akad Nauk SSSR. 1943;39(5):195–198.
  • [3] Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc B. 2005;67(2):301–320.
  • [4] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc. 2001;96(456):1348–1360.
  • [5] Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38(2):894–942.
  • [6] Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media; 2011.
  • [7] Tutz G, Ulbricht J. Penalized regression with correlation-based penalty. Stat Comput. 2009;19(3):239–253.
  • [8] Park T, Casella G. The bayesian lasso. J Am Stat Assoc. 2008;103(482):681–686.
  • [9] Li Q, Lin N. The bayesian elastic net. Bayesian Anal. 2010;5(1):151–170.
  • [10] Ishwaran H, Rao JS. Spike and slab variable selection: Frequentist and bayesian strategies. Ann Stat. 2005;33(2):730–773.
  • [11] Carvalho CM, Polson NG, Scott JG. The horseshoe estimator for sparse signals. Biometrika. 2010;97(2):465–480.
  • [12] Griffin JE, Brown PJ. Bayesian hyper-lassos with non-convex penalization. Aust N Z J Stat. 2011;53(4):423–442.
  • [13] Chen M, Carlson D, Zaas A, et al. Detection of viruses via statistical gene expression analysis. IEEE Trans Biomed Eng. 2010;58(3):468–479.
  • [14] Huang A, Xu S, Cai X. Empirical bayesian elastic net for multiple quantitative trait locus mapping. Heredity. 2015;114(1):107–115.
  • [15] Alhamzawi R. Bayesian elastic net tobit quantile regression. Commun Stat Simul Comput. 2016;45(7):2409–2427.
  • [16] Alshaybawee T, Midi H, Alhamzawi R. Bayesian elastic net single index quantile regression. J Appl Stat. 2017;44(5):853–871.
  • [17] Alhamzawi R, Ali HTM. The bayesian elastic net regression. Commun Stat Simul Comput. 2018;47(4):1168–1178.
  • [18] Kyung M, Gill J, Ghosh M, et al. Penalized regression, standard errors, and bayesian lassos. Bayesian Anal. 2010;5(2):369–411.
  • [19] Hans C. Bayesian lasso regression. Biometrika. 2009;96(4):835–845.
  • [20] Owen AB. Empirical likelihood ratio confidence intervals for a single functional. Biometrika. 1988;75(2):237–249.
  • [21] Owen AB. Empirical likelihood ratio confidence regions. Ann Stat. 1990;18(1):90–120.
  • [22] DiCiccio T, Hall P, Romano J. Empirical likelihood is bartlett-correctable. Ann Stat. 1991;19(2):1053–1061.
  • [23] Chen SX, Cui H. On bartlett correction of empirical likelihood in the presence of nuisance parameters. Biometrika. 2006;93(1):215–220.
  • [24] Qin J, Lawless J. Empirical likelihood and general estimating equations. Ann Stat. 1994;22(1):300–325.
  • [25] Hall P, Owen AB. Empirical likelihood confidence bands in density estimation. J Comput Graph Stat. 1993;2(3):273–289.
  • [26] Qin G, Zhou XH. Empirical likelihood inference for the area under the roc curve. Biometrics. 2006;62:613–622.
  • [27] Wang D, Chen SX. Empirical likelihood for estimating equations with missing values. Ann Stat. 2009;37(1):490–517.
  • [28] Chen SX, Keilegom IV. A review on empirical likelihood methods for regression. TEST. 2009;18:415–447.
  • [29] Hjort NL, McKeague IW, Van Keilegom I. Extending the scope of empirical likelihood. Ann Stat. 2009;37(3):1079–1111.
  • [30] Chen SX, Peng L, Qin YL. Effects of data dimension on empirical likelihood. Biometrika. 2009;96(3):711–722.
  • [31] Tang CY, Leng C. Penalized high-dimensional empirical likelihood. Biometrika. 2010;97(4):905–920.
  • [32] Leng C, Tang CY. Penalized empirical likelihood and growing dimensional general estimating equations. Biometrika. 2012;99(3):703–716.
  • [33] Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters. Ann Stat. 2004;32(3):928–961.
  • [34] Lahiri SN, Mukhopadhyay S. A penalized empirical likelihood method in high dimensions. Ann Stat. 2012;40(5):2511–2540.
  • [35] Chang J, Tang CY, Wu TT. A new scope of penalized empirical likelihood with high-dimensional estimating equations. Ann Stat. 2018;46(6B):3185–3216.
  • [36] Lazar NA. Bayesian empirical likelihood. Biometrika. 2003;90(2):319–326.
  • [37] Grendar M, Judge G. Asymptotic equivalence of empirical likelihood and bayesian map. Ann Stat. 2009;37(5a):2445–2457.
  • [38] Fang KT, Mukerjee R. Empirical-type likelihoods allowing posterior credible sets with frequentist validity: higher-order asymptotics. Biometrika. 2006;93(3):723–733.
  • [39] Chang IH, Mukerjee R. Bayesian and frequentist confidence intervals arising from empirical-type likelihoods. Biometrika. 2008;95(1):139–147.
  • [40] Rao JNK, Wu C. Bayesian pseudo-empirical-likelihood intervals for complex surveys. J R Stat Soc B. 2010;72(4):533–544.
  • [41] Chaudhuri S, Ghosh M. Empirical likelihood for small area estimation. Biometrika. 2011;98(2):473–480.
  • [42] Yang Y, He X. Bayesian empirical likelihood for quantile regression. Ann Stat. 2012;40(2):1102–1131.
  • [43] Mengersen KL, Pudlo P, Robert CP. Bayesian computation via empirical likelihood. Proc Natl Acad Sci U S A. 2013;110(4):1321–1326.
  • [44] Chaudhuri S, Mondal D, Yin T. Hamiltonian monte carlo sampling in bayesian empirical likelihood computation. J R Stat Soc B. 2017;79(1):293–320.
  • [45] Bedoui A, Lazar AN. Bayesian empirical likelihood for ridge and lasso regressions. Comput Stat Data Anal. 2020;145:106917.
  • [46] Chib S, Greenberg E. Understanding the metropolis-hastings algorithm. Am Stat. 1995;49(4):327–335.
  • [47] Owen AB. Empirical likelihood. Chapman & Hall/CRC, Boca Raton; 2001.
  • [48] Andrews DF, Mallows CI. Scale mixtures of normal distributions. J R Stat Soc B. 1974;36(1):99–102.
  • [49] Hörmann W, Leydold J. Generating generalized inverse gaussian random variates. Stat Comput. 2014;24(4):547–557.
  • [50] Neal RM. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo. 2011;2(11):2.
  • [51] Hoffman MD, Gelman A, et al. The no-u-turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J Mach Learn Res. 2014;15(1):1593–1623.
  • [52] Burden R, Faires J, Burden A. Numerical analysis. Cengage Learning; 2015.
  • [53] Beskos A, Pillai N, Roberts G, et al. Optimal tuning of the hybrid monte carlo algorithm. Bernoulli. 2013;19(5A):1501–1534.
  • [54] Vehtari A, Gelman A, Simpson D, et al. Rank-normalization, folding, and localization: An improved R^\widehat{R} for assessing convergence of mcmc. Bayesian Anal. 2021;1(1):1–28.
  • [55] Casella G. Empirical bayes gibbs sampling. Biostatistics. 2001;2(4):485–500.
  • [56] Li J, Das K, Fu G, et al. The bayesian lasso for genome-wide association studies. Bioinformatics. 2011;27(4):516–523.
  • [57] Pasanen L, Holmström L, Sillanpää MJ. Bayesian lasso, scale space and decision making in association genetics. PLoS One. 2015;10(4):e0120017.
  • [58] Bernardo JM, Smith AFM. Bayesian theory. John Wiley & Sons, Chichester, New York, USA; 1994.
  • [59] Wang H, Li G, Jiang G. Robust regression shrinkage and consistent variable selection through the lad-lasso. J Bus Econ Stat. 2007;25(3):347–355.
  • [60] Fernández C, Steel MF. On bayesian modeling of fat tails and skewness. J Am Stat Assoc. 1998;93(441):359–371.
  • [61] Wuertz D, Setz T, Chalabi Y, et al. fgarch: Rmetrics - autoregressive conditional heteroskedastic modelling; 2020. R package version 3042.83.2.
  • [62] McDonald GC, Schwing RC. Instabilities of regression estimates relating air pollution to mortality. Technometrics. 1973;15(3):463–481.
  • [63] Stan Development Team. Stan modeling language users guide and reference manual, version 2.28 ; 2022.
  • [64] Salvatier J, Wiecki TV, Fonnesbeck C. Probabilistic programming in python using pymc3. PeerJ Comput Sci. 2016;2:e55.

Appendix A

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Trace plots for the posterior samples, where 𝜽=(3,1.5,0,0,2,0,0,0)\boldsymbol{\theta}=(3,1.5,0,0,2,0,0,0), errors follow the skew Student tt distribution, and n=50n=50 in Simulation 1.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Histograms of posterior samples, where 𝜽=(3,1.5,0,0,2,0,0,0)\boldsymbol{\theta}=(3,1.5,0,0,2,0,0,0), errors follow the skew Student tt distribution, and n=50n=50 in Simulation 1.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Trace plots for the posterior samples, where θ1=θ2=θ3=θ4=3\theta_{1}=\theta_{2}=\theta_{3}=\theta_{4}=3, θ16=θ17=θ18=θ19=0)\theta_{16}=\theta_{17}=\theta_{18}=\theta_{19}=0), errors follow the skew Student tt distribution, and n=100n=100 in Simulation 2.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Histograms of posterior samples, where θ1=θ2=θ3=θ4=3\theta_{1}=\theta_{2}=\theta_{3}=\theta_{4}=3, θ16=θ17=θ18=θ19=0)\theta_{16}=\theta_{17}=\theta_{18}=\theta_{19}=0), errors follow the skew Student tt distribution, and n=100n=100 in Simulation 2.
\tbl

Summary of variables of the air pollution data set. Variable type Variable name Description Weather prec Mean annual precipitation (in) jant Mean January temperature (F) jult Mean July temperature (F) humid Annual average relative humidity at 1 pm (%) Socioeconomic ovr95 Percentage of population aged 65 or older in 1960 popn Population per household in 1960 educ Median school years completed hous Percentage of housing units dens Population per square mile in 1960 nonw Percentage non-white population in 1960 wwdrk Percentage employed in white collar occupations in 1960 poor Percent of families with income under 3,000 dollars in 1960 Pollution hc Relative hydrocarbon pollution potential nox Relative oxides of nitrogen pollution potential so Relative sulfur dioxide pollution potential mort Age-adjusted mortality rate per 100,000

Refer to caption
Figure 7: Boxplots of prediction errors for the air pollution data. BEN-EL has the smallest mean, median, Q1, and Q3. The prediction error of EN is not included because it differs from the other methods.