This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Approximate cross-validated mean estimates for Bayesian hierarchical regression models

Amy Zhang
Department of Statistics, Pennsylvania State University
and
Michael J. Daniels
Department of Statistics, University of Florida
and
Changcheng Li
School of Mathematical Sciences, Dalian University of Technology
and
Le Bao
Department of Statistics, Pennsylvania State University
Abstract

We introduce a novel procedure for obtaining cross-validated predictive estimates for Bayesian hierarchical regression models (BHRMs). BHRMs are popular for modeling complex dependence structures (e.g., Gaussian processes and Gaussian Markov random fields) but can be computationally expensive to run. Cross-validation (CV) is, therefore, not a common practice to evaluate the predictive performance of BHRMs. Our method circumvents the need to re-run computationally costly estimation methods for each cross-validation fold and makes CV more feasible for large BHRMs. We shift the CV problem from probability-based sampling to a familiar and straightforward optimization problem by conditioning on the variance-covariance parameters. Our approximation applies to leave-one-out CV and leave-one-cluster-out CV, the latter of which is more appropriate for models with complex dependencies. In many cases, this produces estimates equivalent to full CV. We provide theoretical results, demonstrate the efficacy of our method on publicly available data and in simulations, and compare the model performance with several competing methods for CV approximation. Code and other supplementary materials available online.


Keywords: Bayesian hierarchical regression model, multi-level model, leave-one-out, leave-cluster-out, plug-in estimator

1 Introduction

Bayesian hierarchical regression models (BHRMs) are often used to model complex dependence structures while producing probabilistic uncertainty estimates. Except for the simplest models, BHRMs require computationally expensive methods such as Markov Chain Monte Carlo (MCMC) to obtain the posterior density. It has led to many papers that either present new methods to approximate the posterior density (e.g., Kingma and Welling, 2013; Lewis and Raftery, 1997; Rue et al., 2009) or attempt to make current methods more efficient (Bardenet et al., 2017; Korattikara et al., 2014; Quiroz et al., 2019).

The computational cost of BHRM increases by an order of magnitude when we need to repeat the posterior density estimation process for each fold of the cross-validation (CV). A common approach for cross-validation is to randomly partition the data into KK equally-sized subsets, called KK-fold CV. Reducing the number of CV folds, KK, can mitigate the cost of cross-validation. However, when the data are not independent of each other, random KK-fold cross-validation can select models which overfit because of the high correlation between test and training data (Arlot et al., 2010; Opsomer et al., 2001). Models with repeated measures are one such example, where the correlation between repeated samples from the same unit means that random KK-fold cross-validation can select models which overfit the unit-specific effects. This produces over-optimistic estimates of the selected model’s performance on a new unit. CV dropping out whole clusters (possibly more than one) at a time would be the most robust, but it often requires far more than 10 CV folds.

We reduce the computational cost of BHRM in the CV setting by introducing a novel procedure for obtaining BHRM posterior mean estimates. We use the full-data posterior mean of the variance-covariance parameters as a plug-in estimator and re-estimate all other model parameters given the training data and the plug-in estimate. Our method is motivated by Kass and Steffey (1989) who approximated the posterior mean E(Xβ|Y)E(X\beta|Y) using the conditional mean given the estimated variance terms. Our extensions include (1) approximating hyperparameters in the CV setting, (2) allowing the model to have random effects with any dependence structure, (3) extending the approximation from Bayesian hierarchical linear regression models to generalized linear mixed models, and (4) considering the leave-one-cluster-out CV which is more appropriate for models with complex dependency. We show with multiple datasets that our method produces more stable estimates than other competing methods while keeping computation time reasonable.

In Section 2, we introduce our procedure, which we refer to as AXE, an abbreviation for (A)pproximate (X)cross-validation (E)stimates. It can be applied to both linear mixed models and generalized linear mixed models, and to many CV schemas, e.g., KK-fold, leave-one-out (LOO), and leave-one-cluster-out (LCO). We also discuss AXE’s asymptotic properties and finite sample behavior. In Section 3, we summarize competing methods for CV approximation and extend some so that they can be applicable to LCO-CV. Section 4 presents and discusses empirical results comparing the manual cross-validation (MCV) and different approximate CV estimates using a variety of publicly available data sets. In Section 5, we summarize the method and our findings.

2 Approximate cross-validation estimates using plug-in estimators (AXE)

Section 2.1 presents the AXE procedure in the linear regression setting. Section 2.2 presents the AXE procedure in the generalized linear regression setting, which requires an additional approximation. A particular challenge in the LCO-CV approximation is that all observations informing the estimation of the random effect are associated with the test data which we describe in detail in Section 2.3. Section 2.4 describes the large-sample behavior of AXE estimates. Section 2.5 provides a finite-sample diagnostic, which we call AXE+.

We introduce the notation used throughout the paper. Let YIRNY\in I\!\!R^{N} denote a continuous response vector and XIRN×PX\in I\!\!R^{N\times P} denote the design matrix. For the JJ-fold cross-validation, j=1,,Jj=1,\dots,J indicates the cross-validation folds; 𝓈𝒿\mathpzc{s}_{j} corresponds to the indices of the held-out data for CV fold jj; njn_{j} is the corresponding sample size; Y𝓈𝒿IRnjY_{\mathpzc{s}_{j}}\in I\!\!R^{n_{j}} is the test response data; Y𝓈𝒿IRNnjY_{-\mathpzc{s}_{j}}\in I\!\!R^{N-n_{j}} the training response data; and similarly X𝓈𝒿IRnj×PX_{\mathpzc{s}_{j}}\in I\!\!R^{n_{j}\times P} refers to the rows of XX indexed by 𝓈𝒿\mathpzc{s}_{j}, while X𝓈𝒿IR(Nnj)×PX_{-\mathpzc{s}_{j}}\in I\!\!R^{(N-n_{j})\times P} refers to the remaining rows of XX without X𝓈𝒿X_{\mathpzc{s}_{j}}. We denote the transpose of a matrix (or vector) AA as AA^{\prime}.

2.1 AXE for linear mixed models

In linear mixed models (LMM), the continuous response vector, YY, follows

Y|β,ϕN(Xβ,ϕ2),Y|\beta,\phi\sim N(X\beta,\phi^{2}), (1)

where XX are the non-random covariates, β(β1,β2)IRP\beta\coloneqq(\beta_{1}^{\prime},\beta_{2}^{\prime})^{\prime}\in I\!\!R^{P} includes fixed effects β1\beta_{1} and random effects β2\beta_{2}, ϕIR+\phi\in I\!\!R^{+} is a positive scalar.

βN(α,[C00Σ]),ΣfΣ(Σ),ϕfϕ(ϕ),\begin{split}\beta\sim N\left(\alpha,\begin{bmatrix}C&0\\ 0&\Sigma\end{bmatrix}\right),\\ \Sigma\sim f_{\Sigma}(\Sigma),\quad\phi\sim f_{\phi}(\phi),\end{split} (2)

where CIRP1×P1C\in I\!\!R^{P_{1}\times P_{1}} is a fixed positive-definite matrix and often a diagonal matrix with large positive values. The hyperparameters, ΣIRP2×P2\Sigma\in I\!\!R^{P_{2}\times P_{2}}, can take many forms, as long as it is positive-definite, e.g., models with Gaussian Markov random fields sample from the space of all possible Σ\Sigma’s such that a variable in β2\beta_{2} is independent of all others, given its neighborhood. Note that fϕ(ϕ)f_{\phi}(\phi) in our setting is a fixed proper prior. Hence ϕ=𝒪P(1)\phi=\mathcal{O}_{P}\left(1\right), which means that for any ϵ>0\epsilon>0, there exists δ>0\delta>0 such that P(|ϕ|δ)>1ϵP(|\phi|\leq\delta)>1-\epsilon. In addition, from ϕ>0\phi>0, we know that ϕ1=𝒪P(1)\phi^{-1}=\mathcal{O}_{P}\left(1\right). The probabilities for ϕ\phi and ϕ1\phi^{-1} come from their prior distributions following the Bayesian point of view. Finally, we take α=0\alpha=0 throughout this paper without loss of generality.

The posterior mean for β\beta conditioned on variance parameters has the form

E(β|Σ,ϕ,Y)=ϕ2VXY,V=(ϕ2XX+[C100Σ1])1,E(\beta|\Sigma,\phi,Y)=\phi^{-2}VX^{\prime}Y,\quad V=\left(\phi^{-2}X^{\prime}X+\begin{bmatrix}C^{-1}&0\\ 0&\Sigma^{-1}\end{bmatrix}\right)^{-1}, (3)

which is analogous to the form of a Frequentist linear regression. We propose to use the full-data posterior mean estimates Σ^=E(Σ|Y)\hat{\Sigma}=E(\Sigma|Y) and ϕ^=E(ϕ|Y)\hat{\phi}=E(\phi|Y) as plug-in estimators in (3) to derive cross-validated mean estimates. In the linear regression setting, this results in a simple and straightforward closed-form solution for the cross-validated mean estimates:

Y^𝓈𝒿AXE=E(X𝓈𝒿β|Y𝓈𝒿,Σ^,ϕ^)=ϕ^2X𝓈𝒿(ϕ^2X𝓈𝒿X𝓈𝒿+[C100Σ^1])1X𝓈𝒿Y𝓈𝒿.\hat{Y}_{\mathpzc{s}_{j}}^{\text{AXE}}=E(X_{\mathpzc{s}_{j}}\beta|Y_{-\mathpzc{s}_{j}},\hat{\Sigma},\hat{\phi})=\hat{\phi}^{-2}X_{\mathpzc{s}_{j}}\left(\hat{\phi}^{-2}X_{-\mathpzc{s}_{j}}^{\prime}X_{-\mathpzc{s}_{j}}+\begin{bmatrix}C^{-1}&0\\ 0&\hat{\Sigma}^{-1}\end{bmatrix}\right)^{-1}X_{-\mathpzc{s}_{j}}^{\prime}Y_{-\mathpzc{s}_{j}}. (4)

The posterior variance of fixed effects β1\beta_{1}, CC, is not estimated and is taken as the diagonal matrix of prior variances for β1\beta_{1}. One common practice is to take C1C^{-1} as the matrix of 0s, which corresponds to the assumption that the fixed effects have infinite variance.

The AXE procedure in (4) essentially proposes using a Frequentist cross-validation to estimate E(Y𝓈𝒿|Y𝓈𝒿)E(Y_{\mathpzc{s}_{j}}|Y_{-\mathpzc{s}_{j}}) in place of a full Bayesian estimation of the posterior predictive density. AXE is likewise 𝒪(N2P+P3)\mathcal{O}\left(N^{2}P+P^{3}\right) in time for each CV fold. In comparison, Gibbs sampling of the same problem (when available) is 𝒪(MN3P+MNP2+MP3)\mathcal{O}\left(MN^{3}P+MNP^{2}+MP^{3}\right) in time for each CV fold, where MM is the number of MCMC iterations.

Our approach relies on the approximation that the conditional expectation of the posterior mean evaluated at the full-data posterior means of the variance parameters, Σ^=E(Σ|Y)\hat{\Sigma}=E(\Sigma|Y) and ϕ^=E(ϕ|Y)\hat{\phi}=E(\phi|Y), approximates the desired posterior predictive mean, i.e., E(X𝓈𝒿β|Y𝓈𝒿,Σ^,ϕ^)E(X𝓈𝒿β|Y𝓈𝒿)E(X_{\mathpzc{s}_{j}}\beta|Y_{-\mathpzc{s}_{j}},\hat{\Sigma},\hat{\phi})\approx E(X_{\mathpzc{s}_{j}}\beta|Y_{-\mathpzc{s}_{j}}). This looks similar to the approximation in Kass and Steffey (1989) but here we are in a cross-validation setting and we are plugging in the full-data estimates (as opposed to the training data estimates). In Section 2.4, we show that the approximation holds asymptotically under leave-one-cluster-out cross-validation.

2.2 AXE for generalized linear mixed models

Generalized linear mixed models (GLMMs) assume that Y|βY|\beta is not normally distributed. We follow the approximate inference in GLMM developed by Breslow and Clayton (1993) and derive the corresponding AXE for GLMM. Suppose Y|βY|\beta are independent with some probability density function π\pi, so that

Yi|Xiβπ(ui,ai,ϕ2,v(ui)),Y_{i}|X_{i}\beta\sim\pi(u_{i},a_{i},\phi^{2},v(u_{i})), (5)

where E(Yi|Xiβ)=uiE(Y_{i}|X_{i}\beta)=u_{i}, var(Yi|Xiβ)=aiϕ2v(ui)\mathrm{var}(Y_{i}|X_{i}\beta)=a_{i}\phi^{2}v(u_{i}), aia_{i} is a known constant, ϕ2\phi^{2} is a dispersion parameter. The regression model assumes that g(ui)=Xiβg(u_{i})=X_{i}\beta, where g(.)g(.) is the link function and the prior distribution of β\beta is specified in (2).

Note that, the normal priors on the coefficients β\beta are no longer conjugate, and the analytic solution of (4) is unavailable. Instead, an iterative weighted least square algorithm (IWLS) is commonly used to fit GLM or GLMM (Holland and Welsch, 1977; Green, 1984). Defining a working response,

Yi~=Xiβ+(Yiui)g(ui),\tilde{Y_{i}}=X_{i}\beta+(Y_{i}-u_{i})g^{\prime}(u_{i}),

and a n×nn\times n diagonal weight matrix, WW, with diagonal terms

wi={ϕ2aiv(ui)[g(ui)]2}1,w_{i}=\{\phi^{2}a_{i}v(u_{i})[g^{\prime}(u_{i})]^{2}\}^{-1},

i.e., for the logistic model, g(ui)=log(ui)log(1ui)g(u_{i})=\log(u_{i})-\log(1-u_{i}), g(ui)=1ui(1ui)g^{\prime}(u_{i})=\frac{1}{u_{i}(1-u_{i})}, and wi=ui(1ui)w_{i}=u_{i}(1-u_{i}); for the Poisson model, g(ui)=log(ui)g(u_{i})=\log(u_{i}), g(ui)=1uig^{\prime}(u_{i})=\frac{1}{u_{i}}, and wi=uiw_{i}=u_{i}. IWLS repeatedly updates the working response vector and the weight matrix until its convergence. Given WW, Harville (1977) shows that the best linear unbiased estimation (BLUE) of β\beta can be obtained from the associated weighted linear model,

Y~|β,σ2N(Xβ,W1).\tilde{Y}|\beta,\sigma^{2}\sim N(X\beta,W^{-1}).

In the cross-validation setting, we approximate the training data weights, W𝓈𝒿W_{-\mathpzc{s}_{j}}, by the corresponding weights at the convergence of the full data analysis. It is reasonable under the conditions we present in Section 2.1: 1) E(X𝓈𝒿β|Y~𝓈𝒿,Σ˙,ϕ˙)E(X𝓈𝒿β|Y~𝓈𝒿)E(X_{\mathpzc{s}_{j}}\beta|\tilde{Y}_{-\mathpzc{s}_{j}},\dot{\Sigma},\dot{\phi})\approx E(X_{\mathpzc{s}_{j}}\beta|\tilde{Y}_{-\mathpzc{s}_{j}}), where Σ˙=E(Σ|Y~𝓈𝒿)\dot{\Sigma}=E(\Sigma|\tilde{Y}_{-\mathpzc{s}_{j}}), ϕ˙=E(ϕ|Y~𝓈𝒿)\dot{\phi}=E(\phi|\tilde{Y}_{-\mathpzc{s}_{j}}) and 2) E(Σ|Y~)Σ˙E(\Sigma|\tilde{Y})\approx\dot{\Sigma}, E(ϕ|Y~)ϕ˙E(\phi|\tilde{Y})\approx\dot{\phi}.

We plug in the working response vector, Y~𝓈𝒿\tilde{Y}_{-\mathpzc{s}_{j}}, the weight matrix, W𝓈𝒿W_{-\mathpzc{s}_{j}}, and other variance estimates, Σ^\hat{\Sigma}, in the full data analysis, and obtain the following AXE estimates for the test data working response, Y~𝓈𝒿\tilde{Y}_{\mathpzc{s}_{j}},

Y~^𝓈𝒿AXE=E(X𝓈𝒿β|Y~𝓈𝒿,W𝓈𝒿,Σ^)=X𝓈𝒿(X𝓈𝒿W𝓈𝒿X𝓈𝒿+[C100Σ^1])1X𝓈𝒿W𝓈𝒿Y~𝓈𝒿.\hat{\tilde{Y}}_{\mathpzc{s}_{j}}^{\text{AXE}}=E(X_{\mathpzc{s}_{j}}\beta|\tilde{Y}_{-\mathpzc{s}_{j}},W_{-\mathpzc{s}_{j}},\hat{\Sigma})=X_{\mathpzc{s}_{j}}\left(X_{-\mathpzc{s}_{j}}^{\prime}W_{-\mathpzc{s}_{j}}X_{-\mathpzc{s}_{j}}+\begin{bmatrix}C^{-1}&0\\ 0&\hat{\Sigma}^{-1}\end{bmatrix}\right)^{-1}X_{-\mathpzc{s}_{j}}^{\prime}W_{-\mathpzc{s}_{j}}\tilde{Y}_{-\mathpzc{s}_{j}}. (6)

Finally, we transform the test data working response to the original scale by Y^𝓈𝒿AXE=g1(Y~^𝓈𝒿AXE)\hat{{Y}}_{\mathpzc{s}_{j}}^{\text{AXE}}=g^{-1}(\hat{\tilde{Y}}_{\mathpzc{s}_{j}}^{\text{AXE}}). If there is any concern that the weight matrix might differ between the full-data analysis and the training data analysis, then we can run a few more iterations of the IWLS algorithm. We start from the AXE estimates of the working responses and iteratively update the training data weights and the working responses. In practice, Y~^𝓈𝒿AXE\hat{\tilde{Y}}_{\mathpzc{s}_{j}}^{\text{AXE}} without additional IWLS iterations works well.

2.3 Leave-one-cluster-out cross-validation (LCO-CV)

One can apply AXE to any CV design, of which leave-one-out (LOO-CV) and leave-one-cluster-out (LCO-CV) are popular choices. For both LOO-CV and LCO-CV, the number of CV folds can grow with the amount of data, making the computation expensive. Thus they typically benefit the most from CV approximation methods. We focus on proving AXE convergence under LCO-CV, of which LOO-CV is a special case. We first define LCO-CV and related notation. Then, we introduce and discuss existing LCO-CV approximation methods.

In LOO-CV, a single observation is held out as test data for each CV fold, and the number of CV folds equals the number of data points, NN. LOO-CV is commonly used because it maintains as much similarity to the full data model posterior as possible while evaluating the fit to new data. LCO-CV is often used in models with clustered data and cluster-specific parameters, e.g., the patient-specific random effects in a longitudinal model with JJ patients. When the data are sampled uniformly at random to form CV folds, data from one patient can be split between the training and test data. It introduces correlation between the training and test data and can result in overfitting to the patient-specific parameters (Arlot et al., 2010; Opsomer et al., 2001). To fairly evaluate the predictive capability for a new cluster, LCO-CV defines JJ CV folds corresponding to JJ clusters and withholds all observations informing the estimation of a cluster-specific random effect.

LCO-CV has been a challenging CV design for approximate inference. Existing CV approximation methods use the full-data posterior density f(β|Y)f(\beta|Y) to approximate the training data posterior density f(β|Y𝓈𝒿)f(\beta|Y_{-\mathpzc{s}_{j}}). Under LCO-CV, these approximations can be inaccurate for two reasons. First, since the test data corresponds to the data in a cluster, the test data sample size can be large. It is shown that the larger the sample size of the test data, the more likely it is that the difference in densities f(β|Y)f(\beta|Y) and f(β|Y𝓈𝒿)f(\beta|Y_{-\mathpzc{s}_{j}}) is large (Gelfand et al., 1992). Second, under cross-validation, the cluster-specific random effects, θj\theta_{j}, cannot be estimated from the training data, relying instead on samples drawn from θj|θj,Σ\theta_{j}|\theta_{-j},\Sigma.

To focus on the random effects which pertain to the LCO-CV design, we split β\beta into θ\theta and the remaining coefficients in β\beta:

β=(β/θ,θ),\beta=(\beta_{/\theta}^{\prime},\hskip 5.0pt\theta^{\prime})^{\prime}, (7)

where θ\theta refers to the part of random effects that pertain to the CV design, and β/θ\beta_{/\theta} refers to the components in β\beta other than θ\theta, which contain all fixed effects and the part of random effects that do not pertain to the CV design. θ(θ1,θ2,,θJ)IRK\theta\coloneqq(\theta_{1},\theta_{2},\ldots,\theta_{J})^{\prime}\in I\!\!R^{K} (JKP2J\leq K\leq P_{2}). Similarly. we split XX column-wise into

X=[Xβ/θXθ],X=[X_{\beta_{/\theta}}\hskip 5.0ptX_{\theta}], (8)

where XθX_{\theta} contains the subset of covariates that correspond to θ\theta and Xβ/θX_{\beta_{/\theta}} contains the remaining columns of XX.

2.4 Asymptotic behavior under LCO-CV

For the linear mixed models (LMM) case, we first show that the full-data posterior means for variance parameters Σ\Sigma and ϕ\phi approximate the training data posterior means; see Theorem 2.1. We then show the convergence of the AXE approximation in Corollary 2.2.

Theorem 2.1.

Let response vector YIRNY\in I\!\!R^{N} of a hierarchical linear regression follow a normal distribution as in (1) and (2) and define θ\theta as in (7). The data are partitioned into JJ CV folds based on θ\theta, where all data informing θj\theta_{j} correspond to the test data for CV fold jj, and 𝓈𝒿{1,,𝒩}\mathpzc{s}_{j}\subset\{1,\dots,N\} is the set of indices for the test data in the jthj^{th} CV fold. Let X𝓈𝒿=𝟙njxjX_{\mathpzc{s}_{j}}=\mathbbm{1}_{n_{j}}x_{j}^{\prime} for some vector xjIRPx_{j}\in I\!\!R^{P}, where 𝟙nj\mathbbm{1}_{n_{j}} is a vector of 11s with length njn_{j}, njn_{j} is the size of 𝓈𝒿\mathpzc{s}_{j}, and nj1n_{j}\geq 1, 𝟙span(X)\mathbbm{1}\in\text{span}(X), and fΣ(Σ)f_{\Sigma}(\Sigma) and fϕ(ϕ)f_{\phi}(\phi) are proper prior densities. VV is defined as in (3). As JJ goes to infinity (and thus NN goes to infinity), we have

|E(ϕ|Y𝓈𝒿)1E(ϕ|Y)1|\displaystyle|E(\phi|Y_{-\mathpzc{s}_{j}})^{-1}E(\phi|Y)-1| =\displaystyle= 𝒪P(nj/N),\displaystyle\mathcal{O}_{P}\left(n_{j}/N\right),
E(Σ|Y𝓈𝒿)1E(Σ|Y)I\displaystyle\|E(\Sigma|Y_{-\mathpzc{s}_{j}})^{-1}E(\Sigma|Y)-I\| =\displaystyle= 𝒪P(nj/N),\displaystyle\mathcal{O}_{P}\left(n_{j}/N\right),

where II is the identity matrix, .\|.\| is the operator norm of a matrix defined as A=λ1(AA)\|A\|=\sqrt{\lambda_{1}(A^{\prime}A)} and λ1\lambda_{1} is the largest eigenvalue.

For proof, see Appendix A, included in Supplemental Materials.

Remark 1.

Throughout this article, the expectations, such as E(.|Y𝓈𝒿)E(.|Y_{-\mathpzc{s}_{j}}) and E(.|Y)E(.|Y), are functions of YY which are random variables. Thus, 𝒪P\mathcal{O}_{P}{} refers to the randomness in YY and the data generation process has been defined in (1) and (2).

Remark 2.

The requirement that X𝓈𝒿=𝟙njxjX_{\mathpzc{s}_{j}}=\mathbbm{1}_{n_{j}}x_{j}^{\prime} is a technical one so that an analytical form of the difference in (Σ,ϕ|Y𝓈𝒿)(Σ,ϕ|Y)\ell(\Sigma,\phi|Y_{-\mathpzc{s}_{j}})-\ell(\Sigma,\phi|Y) can be used to establish the above asymptotic properties, where (Σ,ϕ|Y𝓈𝒿)\ell(\Sigma,\phi|Y_{-\mathpzc{s}_{j}}) and (σ,ϕ|Y)\ell(\sigma,\phi|Y) are the log posterior densities of Σ\Sigma and ϕ\phi given Y𝓈𝒿Y_{-\mathpzc{s}_{j}} and YY respectively. This condition holds in many scenarios such as repeated measure design. We also believe this condition to be stronger than required in actual practice and include multiple real data examples which violate this condition, for which AXE performs well. See the ESP, SLC, and SRD examples in Section 4. The requirement that the resulting posterior distributions are proper is a mild condition that is easily satisfied in the case of normal priors on β\beta.

Remark 3.

When the clusters are of similar sizes, the rate 𝒪P(nj/N)\mathcal{O}_{P}\left(n_{j}/N\right) simplifies to the rate 𝒪P(1/J)\mathcal{O}_{P}\left(1/J\right). However, the rate 𝒪P(nj/N)\mathcal{O}_{P}\left(n_{j}/N\right) accounts for cases where clusters vary significantly in size, and it remains valid in unbalanced settings. This remark also applies to Corollary 2.2.

Remark 4.

In our experimental results, we have found that large JJ is more critical for the accuracy of AXE than having large njn_{j}. In the Radon subsets example, we varied both JJ from 3 - 12 and the percentage of test data from 30% to 70%. We found little difference in AXE error as the test data percentage increased, but large reductions in AXE error as JJ increased. This is because AXE error, in most cases, comes from a large variance in the estimate for Σ\Sigma across CV folds, and Σ\Sigma is independent of the data given β\beta. Removing njn_{j} test observations only removes the information from one realization of θj\theta_{j} for the estimation of Σ\Sigma, out of JJ total such realizations.

Remark 5.

The proof for the convergence of ϕ\phi is an extension of the proof for Σ\Sigma and so is presented with the same error bound. In most cases, we expect ϕ\phi to have a narrower error bound than Σ\Sigma, as all observations inform the estimation of ϕ\phi equally and the overall sample size, NN, is more critical for the estimation of ϕ\phi than the number of clusters, JJ. We found this to be the case in our example applications; ϕ\phi was generally well-estimated with a relatively narrow density for ϕ|Y\phi|Y and correspondingly smaller difference between E(ϕ|Y𝓈𝒿)E(\phi|Y_{-\mathpzc{s}_{j}}) and ϕ^\hat{\phi}. However, we do expect njn_{j} to be more critical for the convergence of ϕ\phi than JJ.

Remark 6.

We do not assume any specific form for Σ\Sigma, which may consist of P2(P2+1)/2P_{2}(P_{2}+1)/2 separate random variables or, in many cases, Σ\Sigma may be a function of a much smaller number of parameters, as when Σ=σ2I\Sigma=\sigma^{2}I. Other examples include Gaussian processes or Gaussian Markov random fields, where Σ\Sigma is a covariance function based on fixed data properties with typically only 1-3 parameters that need to be estimated. In the latter case, we expect a relatively modest number of clusters JJ to be sufficient to provide good results, as AXE performs best when Σ\Sigma is well-estimated with a narrow posterior density, which occurs more often when J>>J>> the number of parameters for Σ\Sigma. In the Radon subsets example in Section 4, we found J9J\geq 9 to be large enough for good results in the case of 1 parameter for Σ\Sigma.

Remark 7.

Similarly, we do not assume any specific prior for Σ\Sigma or ϕ\phi. One of the appeals of Bayesian modeling is its flexibility, and for regression, this often occurs through placing different priors on Σ\Sigma, e.g., variable selection or Bayesian penalized splines. There is also a growing preference for half-t priors over the conjugate inverse-gamma (Gelman et al., 2006; Polson et al., 2012). These considerations make it pragmatic to provide proofs for unspecified fϕ(ϕ)f_{\phi}(\phi) and fΣ(Σ)f_{\Sigma}(\Sigma).

As the AXE estimate is a straightforward function of Σ\Sigma and ϕ\phi, we can use the result of Theorem 2.1 above to derive an overall error bound for E(X𝓈𝒿β|Y𝓈𝒿,Σ^,ϕ^)E(X_{\mathpzc{s}_{j}}\beta|Y_{-\mathpzc{s}_{j}},\hat{\Sigma},\hat{\phi}).

Corollary 2.2.

Let Σ^\hat{\Sigma} denote the full-data posterior mean, E(Σ|Y)E(\Sigma|Y), and Σ~\tilde{\Sigma} the CV posterior mean over the training data E(Σ|Y𝓈𝒿)E(\Sigma|Y_{-\mathpzc{s}_{j}}) for CV fold jj. Note that X𝓈𝒿X_{\mathpzc{s}_{j}} is made up of identical rows and x𝓈𝒿x_{\mathpzc{s}_{j}}^{\prime} refers to any row in X𝓈𝒿X_{\mathpzc{s}_{j}}. Under the same conditions as Theorem 2.1, E(x𝓈𝒿β|Y𝓈𝒿)=E(x𝓈𝒿β|Y𝓈𝒿,Σ^,ϕ^)(1+𝒪P(nj/N))E(x_{\mathpzc{s}_{j}}^{\prime}\beta|Y_{-\mathpzc{s}_{j}})=E(x_{\mathpzc{s}_{j}}^{\prime}\beta|Y_{-\mathpzc{s}_{j}},\hat{\Sigma},\hat{\phi})(1+\mathcal{O}_{P}\left(n_{j}/N\right)) as JJ goes to infinity.

For proof, see Appendix A.

As the conditional expectation approximates the posterior predictive mean with error rate 𝒪P(nj/N)\mathcal{O}_{P}\left(n_{j}/N\right), AXE approximates the posterior predictive mean with overall error rate 𝒪P(nj/N)\mathcal{O}_{P}\left(n_{j}/N\right). In the LMM case, we have generally found that the accuracy of AXE largely depends on the accuracy of using the posterior means of the variance parameters plug-in estimators. In the GLMM case, AXE introduces the same approximations to the IWLS algorithm.

2.5 Finite sample performance

In general, when Σ\Sigma and ϕ\phi are estimated with narrow posterior densities, E(Σ|Y𝓈𝒿)E(\Sigma|Y_{-\mathpzc{s}_{j}}) and E(ϕ|Y𝓈𝒿)E(\phi|Y_{-\mathpzc{s}_{j}}) will not deviate much from their full-data posterior estimates and AXE will be accurate. This is also when LCO-CV is the most expensive with a large number of clusters and AXE provides the most benefit. In our experience with AXE, even when posterior densities are wide, we have found that AXE can perform well (e.g., Radon subsets and Eight schools examples, see Figure 1). The cases where AXE is inaccurate are typically those with a low number of clusters or severe data imbalance such that removal of the test data severely impacts the estimation of Σ\Sigma or ϕ\phi.

Intuitively, when the number of clusters is low, running manual cross-validation (MCV) may not be of concern. If additional assurance is needed, we suggest running MCV on a randomly selected subset of CV folds, stratified by clusters. By comparing the MCV point estimates to the AXE approximation, we can determine how similar the conclusions derived from the AXE approximation are to those using the MCV estimates. One common model evaluation criteria is the root mean square error (RMSE), where the error is the difference between Y^𝓈𝒿(MCV)=E(Y𝓈𝒿|Y𝓈𝒿)\hat{Y}_{\mathpzc{s}_{j}}^{(\text{MCV})}=E(Y_{\mathpzc{s}_{j}}|Y_{-\mathpzc{s}_{j}}) and Y𝓈𝒿Y_{\mathpzc{s}_{j}}. We take the log ratio of AXE-approximated RMSE to ground-truth MCV RMSE,

LRRj=log(i=1nj(Y^i(AXE)Yi)2i=1nj(Y^i(MCV)Yi)2),\text{LRR}_{j}=\log\left(\frac{\sum_{i=1}^{n_{j}}(\hat{Y}_{i}^{(\text{AXE})}-Y_{i})^{2}}{\sum_{i=1}^{n_{j}}(\hat{Y}_{i}^{(\text{MCV})}-Y_{i})^{2}}\right), (9)

where LRRj is calculated separately for each chosen CV fold jj to obtain more fine-grained comparisons. A low |LRR||LRR| represents high similarity between the approximate RMSE and ground-truth MCV RMSE. We use LRR as our criterion rather than (Y^AXEY^MCV)2\sum(\hat{Y}^{\text{AXE}}-\hat{Y}^{MCV})^{2}, because the magnitude of the latter depends on the standard error for Y^AXE\hat{Y}^{\text{AXE}}; it is possible that the approximate RMSE is close to the true RMSE while (Y^AXEY^MCV)2\sum(\hat{Y}^{\text{AXE}}-\hat{Y}^{MCV})^{2} is large. We also use LRR to compare AXE approximations to other LCO methods in Section 4 by replacing Y^i(AXE)\hat{Y}_{i}^{(\text{AXE})} in (9) with the alternative LCO method’s point estimate for YiY_{i}.

If exchangeability is assumed between clusters, then the sample mean and variance of LRRj\text{LRR}_{j}’s can provide inference for the expected accuracy of the approximation. When the variance parameters Σ\Sigma and ϕ\phi are not well-estimated, they are less likely to be consistent across CV folds; in these cases, we expect a large standard deviation among the AXE LRRj\text{LRR}_{j}’s across CV folds due to the instability of the variance parameter posterior means. In practice, we recommend selecting those clusters with particularly large njn_{j}, or for whose estimated random effects are more extreme, and if the mean or standard deviation of LRRj\text{LRR}_{j}’s is over some threshold δ\delta, we recommend running MCV. The choice of δ\delta represents the tolerable degree of error in approximating MCV RMSE. For our examples, we use δ=0.25\delta=0.25 as our threshold value, which translates to a ratio of AXE RMSE to MCV RMSE between 0.78 and 1.28. We refer to this additional validation step in our calculations of time cost as AXE+ in Section 4.

3 Existing CV approximation methods

We compare AXE to alternative CV approximation methods, which address these challenges in different ways: ghosting (GHOST) (Marshall and Spiegelhalter, 2003), integrated importance sampling (iIS) (Li et al., 2016; Vanhatalo et al., 2013; Vehtari et al., 2016), and likelihood-based linear approximations (Jaeckel, 1972; Giordano et al., 2019; Rad and Maleki, 2020). All LCO methods are summarized in Table 2. We briefly review those methods in Appendix B.

Table 1: Reference table of acronyms used
Acronym Method
AXE Approximate cross-validated mean estimates
GHOST Ghosting
iIS Integrated importance sampling
Vehtari Vehtari’s method
IJ-C Infinitesimal jackknife, integrated over held-out θj\theta_{j}
IJ-A Infinitesimal jackknife, integrated over all θ\theta
NS-C Newton-Raphson, integrated over held-out θj\theta_{j}
NS-A Newton-Raphson, integrated over all θ\theta

Table 2 summarizes the differences among the LCO methods, including distributional assumptions, whether they are biased, and computational complexity. While AXE relies only on the expected values of Σ\Sigma and ϕ\phi being similar between the training data and the full data, GHOST and iIS methods rely on the densities being the same or similar. Note that GHOST has the strongest assumptions, but if E(θj|θj,Σ)=0E(\theta_{j}|\theta_{-j},\Sigma)=0, the sole assumption it relies on is that E(β/θ|Y)=E(β/θ|Y𝓈𝒿)E(\beta_{/\theta}|Y)=E(\beta_{/\theta}|Y_{-\mathpzc{s}_{j}}). However, GHOST produces biased estimates, while AXE and iIS do not.

Table 2: Posterior distribution assumptions and computational complexity of approximating E(Y𝓈𝒿|Y𝓈𝒿)E(Y_{\mathpzc{s}_{j}}|Y_{-\mathpzc{s}_{j}}) for each LCO method. Cost of Gibbs sampling for equivalent MCV problem is 𝒪(M(N3P+NP2+P3))\mathcal{O}\left(M(N^{3}P+NP^{2}+P^{3})\right), where N=N= total number of data points YY, P=P= number of coefficients β\beta, PΞ=P_{\Xi}= total number of parameters in the model, M=M= number of MC samples, nj=n_{j}= size of test data for jthj^{th} CV fold.
Method f(Σ,ϕ|Y)f(\Sigma,\phi|Y) vs f(Σ,ϕ|Y𝓈𝒿)f(\Sigma,\phi|Y_{-\mathpzc{s}_{j}}) f(β/θ|Y)f(\beta_{/\theta}|Y) vs f(β/θ|Y𝓈𝒿)f(\beta_{/\theta}|Y_{-\mathpzc{s}_{j}}) Bias Time
AXE E(Σ,ϕ|Y)=E(Σ,ϕ|Y𝓈𝒿)E(\Sigma,\phi|Y)=E(\Sigma,\phi|Y_{-\mathpzc{s}_{j}}) N/A No 𝒪(N2P2+JP3))\mathcal{O}\left(N^{2}P^{2}+JP^{3})\right)
GHOST f(Σ,ϕ|Y)=f(Σ,ϕ|Y𝓈𝒿)f(\Sigma,\phi|Y)=f(\Sigma,\phi|Y_{-\mathpzc{s}_{j}}) f(β/θ|Y)=f(β/θ|Y𝓈𝒿)f(\beta_{/\theta}|Y)=f(\beta_{/\theta}|Y_{-\mathpzc{s}_{j}}) Yes 𝒪(MJP3+N)\mathcal{O}\left(MJP^{3}+N\right)
iIS f(Σ,ϕ|Y)f(Σ,ϕ|Y𝓈𝒿)f(\Sigma,\phi|Y)\approx f(\Sigma,\phi|Y_{-\mathpzc{s}_{j}}) f(β/θ|Y)f(β/θ|Y𝓈𝒿)f(\beta_{/\theta}|Y)\approx f(\beta_{/\theta}|Y_{-\mathpzc{s}_{j}}) No 𝒪(MJP3+MN)\mathcal{O}\left(MJP^{3}+MN\right)
Vehtari f(Σ,ϕ|Y)f(Σ,ϕ|Y𝓈𝒿)f(\Sigma,\phi|Y)\approx f(\Sigma,\phi|Y_{-\mathpzc{s}_{j}}) N/A No 𝒪(M(N2P2+JP3))\mathcal{O}\left(M(N^{2}P^{2}+JP^{3})\right)
IJ N/A N/A No 𝒪(PΞ3+PΞ2N+NP)\mathcal{O}\left(P_{\Xi}^{3}+P_{\Xi}^{2}N+NP\right)
NS N/A N/A No 𝒪(JPΞ3+NP)\mathcal{O}\left(JP_{\Xi}^{3}+NP\right)

The derivations of computational complexities are provided in Appendix C. Manual cross-validation (MCV), when using Gibbs sampling of the same problem (if available), is the most expensive with complexity 𝒪(M(N3P+NP2+P3))\mathcal{O}\left(M(N^{3}P+NP^{2}+P^{3})\right), where MM is the number of posterior samples. However, many of the examples in Section 4 were fit using STAN, which uses Hamiltonian Monte-Carlo (Girolami and Calderhead, 2011; Carpenter et al., 2017), where instead of the proposal distribution being a Gaussian random walk, proposal samples are generated along the gradient of the joint density. This allows for more efficient sampling and much shorter run times. In our examples, Vehtari was often the method that took the longest to run, not MCV. iIS and GHOST can also be computationally expensive, typically due to the inversion of Σ\Sigma that is necessary to obtain E(θj|θj,Σ)E(\theta_{j}|\theta_{-j},\Sigma) which contributes the 𝒪(P3)\mathcal{O}\left(P^{3}\right) term. This cost can be greatly reduced if E(θj|θj,Σ)=0E(\theta_{j}|\theta_{-j},\Sigma)=0 and Σ=σ2I\Sigma=\sigma^{2}I for some hyperparameter σ\sigma, 𝒪(MNJ)\mathcal{O}\left(MNJ\right), which is the case when the θj\theta_{j}, j=1,,Jj=1,\dots,J, are independent and identically distributed. It brings both methods closer to AXE in terms of computing cost; however, as we show in Section 4, AXE is generally both fast and robust. IJ and NS were typically the two fastest methods in our examples; however, the linear approximation can lead to less accurate estimates in smaller data sets.

4 Results

We use publicly available data to compare AXE and the LCO methods described in Section 3 to manual cross-validation (MCV). Table 3 is a high-level summary of the data sets and models used in our examples. The first three examples are linear mixed models (LMMs); the last three are generalized linear mixed models (GLMMs). The first four examples include models where θ\theta are independent and identically distributed (i.i.d.) and Σ\Sigma is a diagonal matrix, while θ\theta in the last two examples are Gaussian Markov random fields (GMRFs). Three of our examples consist of balanced data sets (Eight schools, SLC, and SRD), while the remainder are imbalanced.111Code available online as part of supplemental materials.

Table 3: Summary of data set and model properties. P=P= dimension of β\beta, J=J= number of CV folds and dimension of θ\theta, N=N= dimension of response vector YY, nj=n_{j}= size of test data in CV fold.
Section Example Model θ\theta PP JJ NN Max njn_{j}
LABEL:sec:8sch Eight schools LMM i.i.d. 9 8 8 1
LABEL:sec:radon Radon LMM i.i.d. 86-88 85 919 116
LABEL:sec:simul Radon subsets LMM i.i.d. 4-15 3 - 12 59 - 100 23
LABEL:sec:esp Esports (ESP) GLMM i.i.d. 197 73 2160 56
LABEL:sec:slc Lip cancer (SLC) GLMM GMRF 58 56 56 1
LABEL:sec:srd Respiratory disease (SRD) GLMM GMRF 1359 271 1355 5

Detailed descriptions of the data and models are in Appendix D. Here, we briefly describe the first three examples, including multiple transformations of the data and/or multiple models. The Eight schools example includes transformations of the response designed to reduce information borrowing as a data scaling factor, α\alpha, increases. There are 40 unique α\alpha, and all results are over the 40 different transformed data sets. The data can be found directly in the original paper (Rubin, 1981). The Radon example includes three different models applied to the same data set (Goodrich et al., 2018; Gelman and Hill, 2007). The Radon subsets example uses the same set of three models and consists of multiple data sets designed to examine the performance of AXE and other LCO methods under a varying number of clusters or percentages of test data. Using the Radon data, we fixed a particular cluster as the test data and randomly selected J1J-1 other clusters as the training data, J{3,4,6,9,12}J\in\{3,4,6,9,12\}, such that the size of test data was some proportion δ\delta of the full data, nj=δNn_{j}=\delta N, δ{0.3,0.4,0.5,0.6,0.7}\delta\in\{0.3,0.4,0.5,0.6,0.7\}. For each combination of JJ and δ\delta, multiple subsets are selected and evaluated for a total of 1,475 simulated subsets. Results for each example are aggregated over all data sets and models.

The Radon data are available under a GPL(\geq3) license. ESP is publicly available under a CC-by-SA 3.0 license; the data we used are included in the supplemental material, along with Radon subsets. SLC and SRD are available under a GPL-3 license (Lee et al., 2018).

4.1 LRR

We compare AXE to the CV approximation methods described in Section 3 by calculating LRRj for each method and CV fold jj, as in (9). For IJ and NS, we calculate MCV RMSE using the posterior mode.

Figure 1 shows the proportion of CV folds with |LRR|x|\text{LRR}|\leq x against different cutoff values, xx. We call these line plots “LRR percentage curves”. A perfectly accurate method would have 100% of |LRR|s=0|\text{LRR}|\text{s}=0. An |LRR||\text{LRR}| larger than log(2)\log(2) corresponds to an approximate RMSE that is over 100% different from the ground truth MCV RMSE, making it a poor approximation; thus we focus on x[0,log(2)]x\in[0,\log(2)] on the x-axis. Figure 1 presents three top-performing methods: AXE, Ghost, and iIS. Comparisons across all methods are provided in Appendix E. AXE is consistently one of the better-performing methods across our examples, which is not the case for any other method. We examine each panel in detail below.

Refer to caption
Figure 1: Line plots of the proportion of CV folds jj with |LRR|jx|\text{LRR}|_{j}\leq x, for x[0,log(2)]x\in[0,\log(2)] on the x-axis. We refer to each line as an LRR percentage curve. Curves are colored and shaped based on the method used and are truncated at log(2)0.7\log(2)\approx 0.7. Ghost is only applicable to LMMs in examples A, B and C.

Figure 1A shows an example where nj=1n_{j}=1 and LCO-CV is the same as LOO-CV. AXE and both iIS methods outperform all others, with roughly 70% of |LRR|0.1|\text{LRR}|\leq 0.1 compared to GHOST’s 25%. As GHOST assumes f(β/θ|Y)=f(β/θ|Y𝓈𝒿)f(\beta_{/\theta}|Y)=f(\beta_{/\theta}|Y_{-\mathpzc{s}_{j}}), the Ghost mean estimate for this example is Y¯\bar{Y}, resulting in larger LRRs, while AXE and the iIS methods have mean estimate Y¯𝓈𝒿\bar{Y}_{-\mathpzc{s}_{j}} (see, e.g., Eight schools in Figure 2B).

Figure 1B contains results for the full Radon data, which is a relatively rich data set compared to the number of parameters to estimate. As such, AXE, iIS, and GHOST all have over 9797% of |LRR||\text{LRR}|s under 0.10.1. A few CV folds do have relatively large |LRR||\text{LRR}|; these correspond to the same two CV folds across all LCO methods and are cases where the data are particularly informative for the random effect variance.

Figure 1C contains results for the Radon subsets data, where the training data are a randomly selected subset of counties. AXE is the best-performing method with over 9999% of |LRR|s0.1|\text{LRR}|\text{s}\leq 0.1. GHOST and iIS perform worse, with 7171% and 3939% of |LRR|s0.1|\text{LRR}|\text{s}\leq 0.1, respectively. Each of those three methods assumes f(β/θ|Y)f(\beta_{/\theta}|Y) and f(β/θ|Y𝓈𝒿)f(\beta_{/\theta}|Y_{-\mathpzc{s}_{j}}) are similar, but the low number of counties in the training data means that estimation of the fixed effect coefficient that describes the county-level heterogeneity can be quite different under cross-validation. We also note that in a closer examination of these results, we found that GHOST tended to underestimate RMSE for the Radon subsets data, while iIS overestimated RMSE. As GHOST uses the full-data posteriors directly, it follows that it produces over-optimistic results.

In Figure 1D, iIS is able to produce stable estimates because it incorporates information from β/θ|Y\beta_{/\theta}|Y, but it is outperformed by AXE which has a nearly perfect LRR percentage curve. The good performance of AXE suggests that posterior expectations for β/θ\beta_{/\theta} and Σ\Sigma were relatively stable across CV folds. In panel E, LRRs are in general larger for all methods than preceding panels (see Table LABEL:tab:lrrtable, in Appendix F). This is partly due to small sample sizes (nj=1n_{j}=1). In panel F, AXE yields satisfactory results with 9292% of |LRR|s0.1|\text{LRR}|\text{s}\leq 0.1 while the remaining LCO approximation methods perform poorly.

We summarize each method’s performance using the area under the LRR percentage curves (AUC-LRRP) of Figure 1. An ideal method would have AUC-LRRP of log(2)\log(2), and a method with all |LRR||\text{LRR}|s >log(2)>\log(2) would have AUC-LRRP of 0. We normalize AUC-LRRP and divide its value by log(2)\log(2) so that the maximum is 1 and the minimum is 0. Table 4 gives AUC-LRRP/log(2)/\log(2) for each method and example. Further summary metrics for each method and example, consisting of mean LRR and the standard deviation of LRR, are provided in Appendix F.

Table 4: Area under the LRR percentage curves of Figure 1, normalized by log(2)\log(2) to have a maximum of 1. Methods with highest AUC-LRRP are bolded.
AXE GHOST Vehtari iIS IJ-C IJ-A NS-C NS-A
A) Eight schools 0.80 0.72 0.16 0.82 0.17 0.27 0.12 0.11
B) Radon 0.98 0.98 0.65 0.98 0.70 0.88 0.71 0.47
C) Radon subsets 0.98 0.88 0.83 0.76 0.82 0.56 0.53 0.32
D) ESP 1.00 ** 0.92 0.76 0.10 0.71 0.17 0.26
E) SLC 0.67 ** 0.35 0.52 ** ** 0.19 0.34
F) SRD 0.94 ** * 0.33 ** ** 0.12 0.09
* Excluded due to computation time.
** Method does not apply.

Table 4 shows that AXE most frequently has the highest AUC-LRRP, except for Eight schools where it had the second highest. The main weakness of Vehtari and iIS is the computational expense. They are the most expensive of all the methods and, in some of our examples, require more time than manual cross-validation, as discussed in Section 4.3.

Both NS and IJ performed worse than the majority of other CV approximation methods. We note that our examples have fewer observations in comparison to the number of model parameters than is typical for NS or IJ, which is usually applied to data large enough such that cross-validation using standard optimization methods is prohibitively expensive. In such situations, the large amount of data typically means that posterior densities are very narrow and f(β/θ,Σ,ϕ|Y)f(\beta_{/\theta},\Sigma,\phi|Y) is more similar to f(β/θ,Σ,ϕ|Y𝓈𝒿)f(\beta_{/\theta},\Sigma,\phi|Y_{-\mathpzc{s}_{j}}). When there are fewer observations and, subsequently, more uncertainty in the model, a small change in ww can result in a larger change in the log-likelihood. Both methods are often limited to LOO-CV, as in Wang et al. (2018); Rad and Maleki (2020); Giordano et al. (2019) and Stephenson and Broderick (2020), which typically results in smaller changes to the log-likelihood over CV folds. We note that all of the CV approximation methods included in this paper improve as the amount of data increases; the main advantage for IJ and NS is that their computing times may scale more easily to LOO-CV with large data sets.

4.2 Point-by-point comparisons

We provide graphical point-by-point comparisons of Y^(method)\hat{Y}^{(\text{method})} to ground-truth cross-validated Y^(MCV)\hat{Y}^{(\text{MCV})}. Figure 2A contains scatter plots of the AXE approximation for each data point Y^\hat{Y} against actual MCV values for all examples. The vast majority of points lie on or near the 45-degree line. In many cases, the AXE approximation is point-by-point equivalent to MCV estimates. It is in contrast to the scatter plots in Figure 2B, which contain AXE approximations as well as GHOST and iIS, both of which performed relatively well, based on Figure 1. iIS has a higher variance than AXE for the Radon subsets, ESP, and SRD data sets, with approximations farther from the 45-degree line. GHOST has a higher variance in the Eight schools and Radon subsets data sets. All three methods perform similarly in the Radon data sets.

Refer to caption
Figure 2: Scatter plots comparing the LCO approximation to ground-truth MCV estimate for each data point, model, and data set. Panels in row A compare the AXE approximation Y^jiAXE\hat{Y}_{ji}^{\text{AXE}} against the MCV estimate for E(Yji|Y𝓈𝒿)E(Y_{ji}|Y_{-\mathpzc{s}_{j}}). Panels in row B add points with GHOST (pink triangle) and iIS (green square) approximations whenever applicable, along with AXE (black circle). Each point in a grid represents one point in the corresponding data set(s) and model(s).

We found that the accuracy of AXE was not particularly impacted as the proportion of test data increased, but it improved greatly as the number of clusters increased. For instance, some of the largest deviations from the 45-degree line are in row B, the Radon subsets data, where the number of clusters ranges from 3 to 12. It is intuitive, as the estimation of Σ\Sigma depends largely on cluster-specific intercepts, so when there are fewer clusters, E(Σ|Y𝓈𝒿)E(\Sigma|Y_{-\mathpzc{s}_{j}}) is less stable across CV folds.

4.3 Computation time

Table 5 provides the total computation time on an Intel i7-8700k CPU with 40 Gb of memory, to aid with running examples in parallel; note a standard 8 Gb of memory would be sufficient for the examples. Except for the SRD data, models were fit using STAN or the rstanarm package, with four chains of 2000 samples and a 1000-sample burn-in, resulting in 4000 total samples. For the SRD data, the CAR.ar() function from the CARBayesST package was used, for four chains of 220,000 samples each, with a 20,000 sample burn-in and thinned to produce 4000 total posterior samples. Calculations for IJ were conducted in Python using the autograd package. AXE took a reasonable amount of time for all examples: a few seconds for Eight schools, Radon, Radon subsets, and SLC, about one minute for ESP, and 10 minutes for SRD. To date, Vehtari has been used only in the LOO-CV case (Vehtari et al., 2016; Vanhatalo et al., 2013). When CV folds contained multiple data points, Vehtari took much longer than other approximation methods.

Table 5: Total computing time for each method in seconds, excluding time to fit the full data. Times with “h” are in hours. Times to fit the model to the full data are included for comparison.
AXE GHOST Vehtari iIS IJ-C IJ-A NS-C NS-A MCV
A) Eight schools 0.7 1.5 1.3h 0.6h 0.8 0.3 4.1 1.2 0.3h
B) Radon 9.1 0.5 14.5h 476.8 2.0 0.1 100.5 2.1 1.1h
C) Radon subsets 1.3 90.4 6.2h 3.6h 118.1 45.7 82.7 34.3 15.7h
D) ESP 67.7 ** 55.9h 570.0 3.1 1.5 120.5 59.2 4.4h
E) SLC 0.1 ** 2.8h 738.2 ** ** 16.6 1.4 693.8
F) SRD 624.6 ** * 126.0h ** ** 24.5h 13.5h 40.6h
*: Method excluded due to computation time. **: Method does not apply.

5 Discussion

AXE is a fast and stable approximation method for obtaining cross-validated mean estimates. AXE is equivalent to a Frequentist cross-validation, but using the posterior means Σ^\hat{\Sigma} and ϕ^\hat{\phi} as plug-in estimates, and benefits from any subsequent computational advances. In addition to linear models, the proposed approximation is applicable to all generalized linear models that use an iterative weighted least square algorithm (IWLS). However, the accuracy of the approximation will also depend on the accuracy of IWLS. We show that AXE is more accurate when the number of CV folds is large, which is also when it saves the most time. When variance parameters are not well-estimated, we recommend an approximation diagnosis that runs MCV for a small sample of folds and checks whether the mean and standard deviation of LRRs are low.

Our empirical results show that AXE consistently performed better than more computationally expensive LCO methods. It is because AXE relies on weaker assumptions than many competing methods, making it more robust. Ghosting estimates often had absolute LRR >0.25>0.25 when 1) the test data were critical for estimating β/θ\beta_{/\theta} and E(β/θ|Y)E(β/θ|Y𝓈𝒿)E(\beta_{/\theta}|Y)\neq E(\beta_{/\theta}|Y_{-\mathpzc{s}_{j}}) or 2) the θj\theta_{j} were not i.i.d., with E(θj|θj,Σ)0E(\theta_{j}|\theta_{-j},\Sigma)\neq 0. The latter scenario is also when ghosting may be more computationally expensive than MCV if the test data size nj>1n_{j}>1. Importance sampling methods typically required longer computation times than either ghosting or AXE. They performed worse when training data posterior densities differed from the full data posterior densities, resulting in high variance. Although importance sampling has weaker assumptions than ghosting, it may also have higher variance, and there are multiple instances where ghosting outperforms importance sampling in our examples. IJ and NS were fast but consistently performed worse than other methods. In addition, IJ only applies to models where the YiY_{i} are conditionally independent, or θ\theta are discrete and finite and thus could not be applied to the SLC and SRD examples in this paper.

Another category of LCO-CV approximation methods for Bayesian models in the literature is called integrated or marginal information criteria. The most notable method is the integrated Watanabe-Akaike Information Criterion (Li et al., 2016; Merkle et al., 2019). Integrated information criteria are used to approximate the expected log predictive density (ELPD), j=1Jlogf(Y𝓈𝒿|Y𝓈𝒿)\sum_{j=1}^{J}\log f(Y_{\mathpzc{s}_{j}}|Y_{-\mathpzc{s}_{j}}). They do not produce estimates for E(Y𝓈𝒿|Y𝓈𝒿)E(Y_{\mathpzc{s}_{j}}|Y_{-\mathpzc{s}_{j}}) and so are omitted from the CV comparisons here.

Acknowledgments

This work was supported by the National Institutes of Health (NIH) under grants R56AI120812-01A1, R01AI136664, R01AI170249, and R01HL158963. The authors report there are no competing interests to declare.

Supplementary Materials

ReadMe:

File describing contents of supplementary materials (readme.md).

Appendix:

Appendices containing proof of Theorem 1, descriptions of data used, computational complexity calculations, and other supplemental information referenced in the article (AXE_appendix.pdf)

AXE examples:

Contains all data sets and codes to re-create the examples in the article. The directory includes a second readme file which describes all files in the code. (AXEexamples, directory).

References

  • Arlot et al. (2010) Arlot, S., A. Celisse, et al. (2010). A survey of cross-validation procedures for model selection. Statistics Surveys 4, 40–79.
  • Bardenet et al. (2017) Bardenet, R., A. Doucet, and C. Holmes (2017). On Markov Chain Monte Carlo methods for tall data. The Journal of Machine Learning Research 18(1), 1515–1557.
  • Breslow and Clayton (1993) Breslow, N. E. and D. G. Clayton (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88(421), 9–25.
  • Carpenter et al. (2017) Carpenter, B., A. Gelman, M. D. Hoffman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo, P. Li, and A. Riddell (2017). Stan: A probabilistic programming language. Journal of Statistical Software 76(1), 1–32.
  • Gelfand et al. (1992) Gelfand, A. E., D. K. Dey, and H. Chang (1992). Model determination using predictive distributions with implementation via sampling-based methods. Technical report, Stanford Univ. CA Dept. of Statistics.
  • Gelman et al. (2006) Gelman, A. et al. (2006). Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis 1(3), 515–534.
  • Gelman and Hill (2007) Gelman, A. and J. Hill (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models, Volume 1. Cambridge University Press New York, NY, USA.
  • Giordano et al. (2019) Giordano, R., W. Stephenson, R. Liu, M. Jordan, and T. Broderick (2019). A swiss army infinitesimal jackknife. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  1139–1147. PMLR.
  • Girolami and Calderhead (2011) Girolami, M. and B. Calderhead (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(2), 123–214.
  • Goodrich et al. (2018) Goodrich, B., J. Gabry, I. Ali, and S. Brilleman (2018). rstanarm: Bayesian applied regression modeling via Stan. R package version 2(4), 1758.
  • Green (1984) Green, P. J. (1984). Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society: Series B (Methodological) 46(2), 149–170.
  • Harville (1977) Harville, D. A. (1977). Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association 72(358), 320–338.
  • Holland and Welsch (1977) Holland, P. W. and R. E. Welsch (1977). Robust regression using iteratively reweighted least-squares. Communications in Statistics-Theory and Methods 6(9), 813–827.
  • Jaeckel (1972) Jaeckel, L. A. (1972). The infinitesimal jackknife. Bell Telephone Laboratories.
  • Kass and Steffey (1989) Kass, R. E. and D. Steffey (1989). Approximate Bayesian inference in conditionally independent hierarchical models (parametric empirical Bayes models). Journal of the American Statistical Association 84(407), 717–726.
  • Kingma and Welling (2013) Kingma, D. P. and M. Welling (2013). Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114.
  • Korattikara et al. (2014) Korattikara, A., Y. Chen, and M. Welling (2014). Austerity in MCMC land: Cutting the Metropolis-Hastings budget. In International Conference on Machine Learning, pp.  181–189.
  • Lee et al. (2018) Lee, D., A. Rushworth, and G. Napier (2018). Spatio-temporal areal unit modelling in R with conditional autoregressive priors using the CARBayesST package. Journal of Statistical Software 84(9), 1–39.
  • Lewis and Raftery (1997) Lewis, S. M. and A. E. Raftery (1997). Estimating Bayes factors via posterior simulation with the Laplace-Metropolis estimator. Journal of the American Statistical Association 92(438), 648–655.
  • Li et al. (2016) Li, L., S. Qiu, B. Zhang, and C. X. Feng (2016). Approximating cross-validatory predictive evaluation in Bayesian latent variable models with integrated IS and WAIC. Statistics and Computing 26(4), 881–897.
  • Marshall and Spiegelhalter (2003) Marshall, E. and D. Spiegelhalter (2003). Approximate cross-validatory predictive checks in disease mapping models. Statistics in Medicine 22(10), 1649–1660.
  • Merkle et al. (2019) Merkle, E. C., D. Furr, and S. Rabe-Hesketh (2019). Bayesian comparison of latent variable models: Conditional versus marginal likelihoods. Psychometrika 84(3), 802–829.
  • Opsomer et al. (2001) Opsomer, J., Y. Wang, and Y. Yang (2001). Nonparametric regression with correlated errors. Statistical Science 16(2), 134–153.
  • Polson et al. (2012) Polson, N. G., J. G. Scott, et al. (2012). On the half-Cauchy prior for a global scale parameter. Bayesian Analysis 7(4), 887–902.
  • Quiroz et al. (2019) Quiroz, M., R. Kohn, M. Villani, and M.-N. Tran (2019). Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association 114(526), 831–843.
  • Rad and Maleki (2020) Rad, K. R. and A. Maleki (2020). A scalable estimate of the out-of-sample prediction error via approximate leave-one-out cross-validation. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82(4), 965–996.
  • Rubin (1981) Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics 6(4), 377–401.
  • Rue et al. (2009) Rue, H., S. Martino, and N. Chopin (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 71(2), 319–392.
  • Stephenson and Broderick (2020) Stephenson, W. and T. Broderick (2020). Approximate cross-validation in high dimensions with guarantees. In International Conference on Artificial Intelligence and Statistics, pp.  2424–2434. PMLR.
  • Vanhatalo et al. (2013) Vanhatalo, J., J. Riihimäki, J. Hartikainen, P. Jylänki, V. Tolvanen, and A. Vehtari (2013). GPstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research 14(Apr), 1175–1179.
  • Vehtari et al. (2016) Vehtari, A., T. Mononen, V. Tolvanen, T. Sivula, and O. Winther (2016). Bayesian leave-one-out cross-validation approximations for Gaussian latent variable models. The Journal of Machine Learning Research 17(1), 3581–3618.
  • Wang et al. (2018) Wang, S., W. Zhou, H. Lu, A. Maleki, and V. Mirrokni (2018). Approximate leave-one-out for fast parameter tuning in high dimensions. In International Conference on Machine Learning, pp.  5228–5237. PMLR.