This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian tolerance regions, with an application to linear mixed models

X. Gregory Chen label=e1][email protected] [    A.W. van der Vaartlabel=e2][email protected] [
(0000)
Abstract

We review and contrast frequentist and Bayesian definitions of tolerance regions. We give conditions under which for large samples a Bayesian region also has frequentist validity, and study the latter for smaller samples in a simulation study. We discuss a computational strategy for computing a Bayesian two-sided tolerance interval for a Gaussian future variable, and apply this to the case of possibly unbalanced linear mixed models. We illustrate the method on a quality control experiment from the pharmaceutical industry.

quality control,
tolerance,
linear mixed model,
doi:
0000
keywords:
volume: 00issue: 0
\startlocaldefs\endlocaldefs

and t1Biostatistics and Research Decision Sciences, MSD, t2DIAM, Delft Institute of Applied Mathematics. The research leading to these results has received funding from the European Research Council under ERC Grant Agreement 320637 and is partly financed by the NWO Spinoza prize awarded to A.W. van der Vaart by the Netherlands Organisation for Scientific Research (NWO).

1 Introduction

The concept of tolerance region is of central importance in quality control. A tolerance region is a prediction set for a future observation, which takes account of both the random nature of this observation and the uncertainty about its distribution. It incorporates the statistical error of estimation of unknown parameters of this distribution.

Conventional tolerance regions take the uncertainty of estimated parameters into account in one of two ways. Either the region captures the future observation a fraction of 1α1-{\alpha} times on average over both future and past observations (the (1α)(1-{\alpha})-expectation tolerance region), or the region captures the future observation with probability at least 1δ1-{\delta} with 1α1-{\alpha} confidence over past observations (the (δ,α)({\delta},{\alpha})-tolerance region). (We give precise definitions in Section 2.) The second way appears to be preferred in the pharmaceutical industry. The “on average” and “confidence” can refer to the sampling distribution of the data in a frequentist sense, but can also refer to a posterior distribution in the Bayesian statistical framework. The main focus on the present paper is the second, but we do relate it to the frequentist setup.

Frequentist tolerance regions have been well studied in the literature. A general reference is the book [8], especially for the situation that the data are i.i.d. For the linear mixed model (LMM), the paper [14] provides an elegant solution to build one- and two-sided (δ,α)({\delta},{\alpha})-tolerance intervals, and includes a comprehensive review of the literature. One purpose of the present paper is to provide a Bayesian approach for the general LMM.

The Bayesian formulation of tolerance regions dates back to at least 1964, but the subsequent literature is relatively small. In his paper [1], Aitchison derived Bayesian (δ,α)({\delta},{\alpha})-tolerance regions from a decision-theoretic framework, and contrasted them to the frequentist counterparts. In [2] he extended his discussion to (1α)(1-{\alpha})-tolerance regions, which are a natural way to build Bayesian prediction intervals.

A one-sided (δ,α)({\delta},{\alpha})-tolerance interval for a univariate future observation is usually easy to compute, but two-sided tolerance intervals pose challenges, both conceptually and computationally. There are at least two common approaches: intersecting two one-sided tolerance intervals, or fixing one degree of freedom of the interval (e.g. the midpoint of the interval). The former approach is identical to specifying probability masses in the two tails of the distribution of the future variable separately and gives a valid construction, in view of Bonferroni’s inequality, but it yields longer intervals than necessary. (They are called “anti-conservative” in the pharmaceutical industry in reference to the customers, whereas statisticians use the term “conservative”.) See [5] for an example application. The second approach, fixing one degree of freedom, is the conventional choice, especially in the frequentist framework, but requires untangling the dependence of the interval on the unknown true parameter. Solutions are often not available in analytical form and computationally more challenging. Wolfinger in [20] proposed an algorithm to derive a two-sided Bayesian interval for a future normal variate, which was refined by Krishnamoorthy and Mathew [8]. Their algorithms have been widely adopted in practice, and also in other literature (e.g.[7], [16], [15]).

Other contributions to the literature include the following: [10] used the empirical Bayes method to construct a one-sided tolerance interval given an i.i.d. sample from a normal distribution; [6] derived a Bayesian tolerance interval that contains a proportion of observations with a specified confidence; [4] and [21] focused on the sample size needed to attain a certain accuracy; [16] and [15] allowed data from the unbalanced one-way random effects model and the balanced two-factor nested random effects model; [11] discussed probability matching priors (PMP) in the one-sided case to ensure second-order frequentist validity; [12] extended this to the two-sided case; [13] incorporated it to a balanced one-way random effects model, and evaluated its performance against the frequentist method MLS in [8].

Although the PMP approach has merit when the sample size is small, it is analytically demanding even when data are i.i.d., and it seems difficult to extend to the general LMM setting. The algorithms of Wolfinger [20] and Krishnamoorthy and Mathew [8] can be extended to LMM, but they oversimplify the target function during optimization and may result in less satisfactory performance.

In this paper we propose a computationally efficient solution for the general case that the future observation possesses a normal distribution. We show that this is easy to implement given any data model for which a sample from the posterior distribution is available. We investigate when the shortest interval is centered at the posterior mean of the parameter. We discuss the interval in particular for the linear mixed model, and within this context show its good performance by simulation. We illustrate the method on an example that is representative for pharmaceutical applications. Finally we also prove that the Bayesian interval has frequentist validity in the case of large samples.

2 Definitions and setup

Given are observed data XX, with a distribution PθP_{\theta} depending on a parameter θ{\theta}, and future unobserved “data” ZZ , with a distribution QθQ_{\theta} depending on the same parameter θ{\theta}. In both cases the sample space is arbitrary. A tolerance region is a set (X){\cal R}(X) in the sample space of ZZ that captures ZZ with a “prescribed probability”. It will typically be constructed using the observation XX to overcome the problem that θ{\theta}, and hence the law of ZZ, is unknown. There are various ways to make the “prescribed probability” precise, and these can be divided into frequentist and Bayesian definitions. The probability statement will refer to both XX and ZZ, and is fixed by one or two parameters α{\alpha} and δ{\delta}, which are typically chosen small, e.g. 5%.

The parameter θ{\theta} will typically be chosen to identify the distribution of ZZ. The distribution of XX may also depend on unknown “nuisance” parameters. For simplicity of notation we do not make this explicit in the following. We shall use the notation P\mathord{\rm P} or Pθ\mathord{\rm P}_{\theta} for general probability statements, which may be reduced to PθP_{\theta} or QθQ_{\theta} if the event involves only XX or ZZ.

2.1 Frequentist definitions

The most common frequentist definition is the (δ,α)({\delta},{\alpha})-tolerance region. For a set RR, abbreviate Qθ(R)=Pθ(ZR)Q_{\theta}(R)=\mathord{\rm P}_{\theta}(Z\in R). Then (X){\cal R}(X) is an (δ,α)({\delta},{\alpha})-tolerance region if

Pθ(x:Qθ((x))1δ)1α,θ.P_{\theta}\Bigl{(}x:Q_{\theta}\bigl{(}{\cal R}(x)\bigr{)}\geq 1-{\delta}\Bigr{)}\geq 1-{\alpha},\qquad\forall{\theta}. (1)

If we let Qθ((X))Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)} denote the probability of (X){\cal R}(X) under QθQ_{\theta}, for XX held fixed, then we can also write the display in the shorter form Pθ(Qθ((X))1δ)1α\mathord{\rm P}_{\theta}\bigl{(}Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\geq 1-{\delta}\bigr{)}\geq 1-{\alpha}, where the outer probability Pθ\mathord{\rm P}_{\theta} refers to XX, and the inequality must hold for all possible values of the parameter θ{\theta}. The latter reminds us of the definition of confidence sets, and indeed it can be seen that (X){\cal R}(X) is a frequentist (δ,α)({\delta},{\alpha})-tolerance region if and only if the set 𝒞(X)={θ:Qθ((X))1δ}{\cal C}(X)=\{{\theta}:Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\geq 1-{\delta}\} is a confidence set for θ{\theta} of confidence level 1α1-{\alpha}.

An alternative is the α{\alpha}-expectation tolerance region, which requires that

Qθ((x))𝑑Pθ(x)1α,θ.\int Q_{\theta}\bigl{(}{\cal R}(x)\bigr{)}\,dP_{\theta}(x)\geq 1-{\alpha},\qquad\forall{\theta}. (2)

With the notational convention as before, the display can be written in the shorter form EθQθ((X))1α\mathord{\rm E}_{\theta}Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\geq 1-{\alpha}, which is again required for all possible parameter values.

Both definitions have the form of requiring that Eθ[Qθ((X))]1α\mathord{\rm E}_{\theta}\ell\bigl{[}Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\bigr{]}\geq 1-{\alpha}, for all θ{\theta}, and some given loss function \ell. In the two cases this loss function is given by (q)=1{q1δ}\ell(q)=1\{q\geq 1-{\delta}\} for (1), and (q)=q\ell(q)=q for (2), respectively, where 1{A}1\{A\} is the indicator function of a set AA.

2.2 Bayesian definitions

In the Bayesian setup the parameter θ{\theta} is generated from a prior distribution Π\Pi, and the densities pθp_{\theta} and qθq_{\theta} are the conditional densities of XX and ZZ given θ{\theta}, respectively. To proceed, it is necessary to make further assumptions that fix the joint law of (θ,X,Z)({\theta},X,Z). The typical assumption is that XX and ZZ are independent given θ{\theta}.

A natural Bayesian approach is to refer to the predictive distribution of ZZ, and define a tolerance region (X){\cal R}(X) to be a set such that P(Z(X)|X)1α\mathord{\rm P}\bigl{(}Z\in{\cal R}(X)\mathchar 25194\relax X\bigr{)}\geq 1-{\alpha}, i.e. a credible set in the posterior law of ZZ given XX. The inequality can be written in terms of the posterior distribution Π(|X)\Pi(\cdot\mathchar 25194\relax X) of θ{\theta} given XX as

P(Z(X)|X,θ)𝑑Π(θ|X)1α.\int\mathord{\rm P}\bigl{(}Z\in{\cal R}(X)\mathchar 25194\relax X,{\theta}\bigr{)}\,d\Pi({\theta}\mathchar 25194\relax X)\geq 1-{\alpha}.

Under the conditional independence assumption this becomes

Qθ((X))𝑑Π(θ|X)1α.\int Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\,d\Pi({\theta}\mathchar 25194\relax X)\geq 1-{\alpha}. (3)

This is like a frequentist α{\alpha}-expectation tolerance region (2), but with the expectation with respect to XX under PθP_{\theta} replaced by the expectation with respect to θ{\theta} under the posterior distribution.

An alternative, derived from a utility analysis by Aitchison [1], is the Bayesian (δ,α)({\delta},{\alpha})-tolerance region, which is a set (X){\cal R}(X) such that

Π(θ:Qθ((X))1δ|X)1α.\Pi\Bigl{(}{\theta}:Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\geq 1-{\delta}\mathchar 25194\relax X\Bigr{)}\geq 1-{\alpha}. (4)

This may be compared to (1). We can also say that (X){\cal R}(X) is a Bayesian (δ,α)({\delta},{\alpha})-tolerance region if and only if the set 𝒞(X)={θ:Qθ((X))1δ}{\cal C}(X)=\{{\theta}:Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\geq 1-{\delta}\} is a credible set at level 1α1-{\alpha}.

Both types of Bayesian regions satisfy [Qθ((X))]𝑑Π(θ|X)1α\int\ell\bigl{[}Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\bigr{]}\,d\Pi({\theta}\mathchar 25194\relax X)\geq 1-{\alpha}, for the appropriate loss function \ell. Solving the region (X){\cal R}(X) from such an equation may seem daunting, but good approximations may be easy to obtain using stochastic simulation. This is true even for complicated data models, as long as one is able to generate a sample from the posterior distribution given XX, for instance by implementing an MCMC procedure. We make this concrete in Section 3.1 for a Gaussian variable ZZ, and illustrate this in Section 4 for an unbalanced linear mixed model (LMM).

2.3 Comparison

The frequentist and Bayesian definitions differ in the usual way in that the frequentist probabilities in (2) and (1) refer to the possible values of xx in the sample space, whereas the Bayesian probabilities in (3) and (4) condition on the observed value of XX and refer to the distribution of the parameter.

As is the case for credible sets versus confidence sets, the Bayesian approach may feel more natural.

An advantage of the Bayesian approach is that while the form of the tolerance set (X){\cal R}(X) in (4) is determined by the variable ZZ, through the prediction problem QθQ_{\theta}, the model for the data XX enters only through the posterior distribution Π(θ|X)\Pi({\theta}\in\cdot\mathchar 25194\relax X). If in the former the dependence on the parameter θ{\theta} is not too complicated, then the problem is solvable for even complicated data models. In contrast, the frequentist problem permits explicit solutions only in very special cases, although approximations and asymptotic expansions may extend their use (see [14]).

Neither the frequentist nor the Bayesian formulation restrains the shape of the region (x){\cal R}(x). One may prescribe a fixed form and/or seek to optimize the shape with respect to an additional criterion, such as the volume of the region. The Bayesian formulation is again easier to apply, as the optimization will be given the data XX. In the case of frequentist region it may be necessary to optimize an expected quantity instead.

In general the two approaches give difference tolerance regions, but the difference may disappear in the large sample limit. The requirements of the frequentist and Bayesian tolerance regions (X){\cal R}(X) for loss function \ell and level α{\alpha}, can be given symmetric formulations, as:

E([Qθ((X))]|θ)\displaystyle\mathord{\rm E}\Bigl{(}\ell\bigl{[}Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\bigr{]}\mathchar 25194\relax{\theta}\Bigr{)} 1α,θ,\displaystyle\geq 1-{\alpha},\qquad\forall{\theta}, (5)
E([Qθ((X))]|X)\displaystyle\mathord{\rm E}\Bigl{(}\ell\bigl{[}Q_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\bigr{]}\mathchar 25194\relax X\Bigr{)} 1α.\displaystyle\geq 1-{\alpha}. (6)

In the first the expectation is taken with respect to XX, which gives an integral over xx with respect to the density pθp_{\theta}. In the second the expectation is relative to θ{\theta}, which leads to an integral relative to the posterior distribution given XX. The integral of the first relative to the prior is identical to the integral of the second relative to the Bayesian marginal distribution of XX, but there is no reason that (5) implies (6) or vice versa. In particular, a Bayesian tolerance region need not be a frequentist tolerance region.

However, Bayesian and frequentist inference typically merge if the informativeness in the data tends to a limit. For instance, this is true for regular parametric models in the sense that Bayesian credible sets are frequentist confidence sets, in the limit, with corresponding levels. The prior is then washed out and the Bayesian credible sets are equivalent to confidence sets based on the maximum likelihood estimator. This equivalence extends to tolerance regions, under some conditions. We defer a discussion to Section 5.

2.4 One-sided and two-sided tolerance intervals

For a one-dimensional future variable ZZ it is natural to choose (x){\cal R}(x) an interval in the real line. The endpoints of such an interval are referred to as tolerance limits.

The single finite tolerance limit of a Bayesian one-sided interval is determined by meeting the (δ,α)({\delta},{\alpha})- or α{\alpha}-tolerance criterion. The pair of tolerance limits of a Bayesian two-sided interval might be optimized to give an interval of minimal length, next to requiring that the tolerance criterion is met.

One-sided tolerance limits possess a straightforward interpretation and implementation. In particular, the (δ,α)({\delta},{\alpha})-type has a simple description in terms of confidence intervals and posterior quantiles:

  • (,U(X)](-\infty,U(X)] is a frequentist (δ,α)({\delta},{\alpha})-tolerance interval if and only if it is a (1α)(1-{\alpha})-confidence interval for the induced parameter Qθ1(1δ)Q_{\theta}^{-1}(1-{\delta}); it is a Bayesian (δ,α)({\delta},{\alpha})-tolerance interval if U(X)U(X) is the (1α)(1-{\alpha})-quantile of the posterior distribution of Qθ1(1δ)Q_{\theta}^{-1}(1-{\delta}) given XX.

  • [L(X),)[L(X),\infty) is a frequentist (δ,α)({\delta},{\alpha})-tolerance interval if and only if it is a (1α)(1-{\alpha})-confidence interval for Qθ+1(δ)Q_{{\theta}+}^{-1}({\delta}); it is a Bayesian (δ,α)({\delta},{\alpha})-tolerance interval if L(X)L(X) is the (1α)(1-{\alpha})-quantile of the posterior distribution of Qθ+1(δ)Q_{{\theta}+}^{-1}({\delta}).

Here Qθ1(u)=inf{z:Qθ(,z]u}Q_{\theta}^{-1}(u)=\inf\{z:Q_{\theta}(-\infty,z]\geq u\} is the usual quantile function of ZZ, and Qθ+1(u)Q_{{\theta}+}^{-1}(u) is the right-continuous version of this quantile function. (The distinction between the two is usually irrelevant, and linked to the arbitrary convention of including the boundary point in the tolerance intervals.) The assertions follow by inverting the inequality Qθ((X))1δQ_{\theta}\bigl{(}{\cal R}(X)\bigr{)}\geq 1-{\delta}, using the fact that for a cumulative distribution function QQ and its quantile function: Q1(u)xQ^{-1}(u)\leq x if and only if uQ(x)u\leq Q(x), and Q+1(u)<xQ_{+}^{-1}(u)<x if and only if u<Q(x)u<Q(x-), for u[0,1]u\in[0,1] and xx\in\mathbb{R}.

A valid two-sided interval might be constructed as the intersection of two one-sided intervals, each at half of the error rate, but this will be conservative and lead to needlessly wide intervals. It makes good sense to try and minimize the length of the interval. We consider this in the next section for the case that the future observation ZZ is univariate Gaussian.

3 Normally distributed future observation

Consider the case that the future observation ZZ is univariate Gaussian with mean ν{\nu} and variance τ2{\tau}^{2}. Thus the parameter is θ=(ν,τ){\theta}=({\nu},{\tau}), and Z|X,ν,τQθ=N(ν,τ2)Z\mathchar 25194\relax X,{\nu},{\tau}\sim Q_{\theta}=N({\nu},{\tau}^{2}). The probability that the future observation is captured within a candidate tolerance interval [L,U][L,U] is

Qθ[L,U]=Φ(Uντ)Φ(Lντ).Q_{\theta}[L,U]=\Phi\Bigl{(}\frac{U-{\nu}}{{\tau}}\Bigr{)}-\Phi\Bigl{(}\frac{L-{\nu}}{{\tau}}\Bigr{)}.

It is convenient to parametrize the interval [L,U][L,U] by its midpoint A=(L+U)/2A=(L+U)/2 and half length B=(UL)/2B=(U-L)/2. For given [L,U][L,U], or {A,B}\{A,B\}, and δ(0,1){\delta}\in(0,1), define the set

GA,B,δ={θ=(ν,τ):Qθ[L,U]1δ}.G_{A,B,{\delta}}=\bigl{\{}{\theta}=({\nu},{\tau}):Q_{\theta}[L,U]\geq 1-{\delta}\bigr{\}}.

For given [L,U][L,U] the set GA,B,δG_{A,B,{\delta}} is shaped as in Figure 1. It is symmetric about the vertical line ν=(L+U)/2{\nu}=(L+U)/2 and intersects the horizontal axis τ=0{\tau}=0 in the interval [L,U][L,U]. Changing AA moves the set GA,B,δG_{A,B,{\delta}} horizontally, while changing BB changes its shape, with bigger BB making the set both wider and taller. Although we use the normal distribution as our example, similarly shaped sets and conclusions would be obtainable for other unimodal symmetric distributions.

Refer to caption

Figure 1: The set GA,B,δG_{A,B,{\delta}} of pairs (ν,τ)({\nu},{\tau}) such that Φ((Uν)/τ)Φ((Lν)/τ)1δ\Phi\bigl{(}(U-{\nu})/{\tau}\bigr{)}-\Phi\bigl{(}(L-{\nu})/{\tau}\bigr{)}\geq 1-{\delta}, for A=4A=4, B=3B=3, δ=0.1{\delta}=0.1. The number AA is its horizontal point of symmetry and BB is the half-length of its base. The base of the set (the line segment at height τ=0{\tau}=0) corresponds to the tolerance interval [L,U][L,U].

It follows that (X)=[L(X),U(X)]{\cal R}(X)=\bigl{[}L(X),U(X)\bigr{]} satisfies inequality (1) and hence is a frequentist (δ,α)({\delta},{\alpha})-tolerance interval for ZZ if and only if

Pθ(x:θGA(x),B(x),δ)1α,θ.P_{\theta}\bigl{(}x:{\theta}\in G_{A(x),B(x),{\delta}}\bigr{)}\geq 1-{\alpha},\qquad\forall{\theta}.

In other words [L(X),U(X)][L(X),U(X)] is an (δ,α)({\delta},{\alpha})-tolerance interval for ZZ if and only if GA(X),B(X),δG_{A(X),B(X),{\delta}} is an (1α)(1-{\alpha})-confidence region for θ=(ν,τ){\theta}=({\nu},{\tau}). Setting a joint confidence interval for location and dispersion is a familiar problem, but here the shape is restrained to the form GA,B,δG_{A,B,{\delta}} and the focus will be on minimizing B=B(X)B=B(X) (in some average sense). Solutions will depend on the type of data XX. Standard solutions are available in closed form for the simplest models, and more generally as approximations.

Similarly (X)=[L(X),U(X)]{\cal R}(X)=\bigl{[}L(X),U(X)\bigr{]} satisfies inequality (4) and hence is a (δ,α)({\delta},{\alpha})-Bayesian tolerance interval for ZZ if

Π(θ:θGA(X),B(X),δ|X)1α.\Pi\bigl{(}{\theta}:{\theta}\in G_{A(X),B(X),{\delta}}\mathchar 25194\relax X\bigr{)}\geq 1-{\alpha}. (7)

It is natural to choose A(X)A(X) and B(X)B(X) to satisfy this inequality in such a way that B(X)B(X) is minimal. In the resulting optimization problem the posterior distribution Π(θ|X)\Pi({\theta}\in\cdot\mathchar 25194\relax X) is a fixed probability distribution on the upper half plane and optimization entails shifting and scaling the shape shown in Figure 1 in a position such that it captures posterior mass at least 1α1-{\alpha}, meanwhile minimizing its width. In Section 3.1 we show how to achieve this numerically given a large sample from the posterior distribution.

The data XX determines the posterior distribution, but does not enter the optimization problem. The parameter θ{\theta} may not be the full parameter characterising the distribution of XX, but our computational strategy will work as long as θ{\theta} is a function of this full parameter. For instance, if XX follows a linear regression model with predictor “time” and ZZ is an observation at time 0, then ν\nu will be a function of the regression intercept; if XX follows a random-effects model, then typically ν{\nu} will depend on the fixed effects and τ2{\tau}^{2} will be a specific linear combination of the variance components, depending on practical interests.

Frequentist methods typically choose A(X)A(X) equal to a standard estimator of ν\nu. One might guess that the Bayesian solution will be to take A(X)A(X) equal to the posterior mean E(ν|X)\mathord{\rm E}(\nu\mathchar 25194\relax X) of ν\nu. This would be convenient as it would reduce the optimization of (A,B)(A,B) to the problem of only optimizing BB. However, the posterior mean does not necessarily give the minimal length interval. The following lemma gives a sufficient condition.

Lemma 3.1.

Suppose that the conditional distribution of ν{\nu} given (X,τ)(X,{\tau}) is unimodal and symmetric with decreasing density to the right of its mode and has mean E(ν|X,τ)\mathord{\rm E}(\nu\mathchar 25194\relax X,{\tau}) that is free of τ{\tau}. Then the shortest (δ,α)({\delta},{\alpha})-Bayesian tolerance interval [L,U][L,U] for a future variable Z|X,ν,τN(ν,τ2)Z\mathchar 25194\relax X,{\nu},{\tau}\sim N({\nu},{\tau}^{2}) is centered at the posterior mean E(ν|X)\mathord{\rm E}({\nu}\mathchar 25194\relax X).

Proof.

We can decompose the probability on the left side of (7) as

Π(GA,B,δ|X)=Π(ν(GA,B,δ)τ|X,τ)𝑑Π(τ|X),\Pi\bigl{(}G_{A,B,{\delta}}\mathchar 25194\relax X\bigr{)}=\int\Pi\bigl{(}{\nu}\in(G_{A,B,{\delta}})_{\tau}\mathchar 25194\relax X,{\tau}\bigr{)}\,d\Pi({\tau}\mathchar 25194\relax X),

where (GA,B,δ)τ={ν:(ν,τ)GA,B,δ}(G_{A,B,{\delta}})_{\tau}=\{{\nu}:({\nu},{\tau})\in G_{A,B,{\delta}}\} is the section of GA,B,δG_{A,B,{\delta}} at height τ{\tau}. By the unimodality and monotonicity the integrand is maximized over AA for every τ{\tau} and a given BB by choosing A=E(ν|X,τ)A=\mathord{\rm E}({\nu}\mathchar 25194\relax X,{\tau}). If this does not depend on τ{\tau}, then this common maximizer AA will maximize the whole expression. Since we need to determine AA and BB so that the expression is at least 1α1-{\alpha}, maximizing it over AA will give the minimal BB. By assumption A=E(ν|X,τ)=E(ν|X)A=\mathord{\rm E}({\nu}\mathchar 25194\relax X,{\tau})=\mathord{\rm E}({\nu}\mathchar 25194\relax X).   

The condition of the preceding lemma is not unreasonable, but depends on the prior, as illustrated in the following simple example. In Section 4 we show that for a linear mixed model the condition is approximately satisfied. Then choosing AA equal to the posterior mean is a fast computational shortcut that may perform almost as well as the optimal solution.

Example 3.1.

The simplest possible data model is to let X=(X1,,Xn)X=(X_{1},\ldots,X_{n}) be a random sample from the N(ν,τ2)N({\nu},{\tau}^{2})-distribution. This example was already discussed by Aitchison [1]. Here we highlight the implications of Lemma 3.1.

For the standard priors τ2IG(α0,β0){\tau}^{2}\sim IG({\alpha}_{0},{\beta}_{0}) and ν|τN(a,τ2/b){\nu}\mathchar 25194\relax{\tau}\sim N(a,{\tau}^{2}/b), the conditional posterior distribution of ν{\nu} given τ{\tau} is normal with mean

E(ν|τ,X)=ba+nX¯b+n.\mathord{\rm E}\bigl{(}{\nu}\mathchar 25194\relax{\tau},X\bigr{)}=\frac{ba+n\bar{X}}{b+n}.

Since this is independent of τ{\tau}. the preceding discussion shows that the shortest (δ,α)({\delta},{\alpha})-tolerance interval is centred at the posterior mean of ν{\nu}.

Choosing the prior variance var(ν|τ)\mathop{\rm var}\nolimits({\nu}\mathchar 25194\relax{\tau}) proportional to τ2{\tau}^{2}, which is customary, is crucial for this finding. For instance, if we set the prior ν{\nu} to be independent of τ{\tau}, say ν|τN(a,b1)\nu\mathchar 25194\relax{\tau}\sim N(a,b^{-1}), then the conditional posterior mean changes to

E(ν|τ,X)=baτ2+nX¯bτ2+n.\mathord{\rm E}\bigl{(}{\nu}\mathchar 25194\relax{\tau},X\bigr{)}=\frac{ba{\tau}^{2}+n\bar{X}}{b{\tau}^{2}+n}.

This is X¯\bar{X} if τ=0{\tau}=0 and shrinks to the prior mean aa as τ{\tau}\rightarrow\infty. For illustration, let a=0,b=0.1,α0=β0=0.01a=0,\>b=0.1,\>{\alpha}_{0}={\beta}_{0}=0.01. Given data with n=3,x¯=10,s2=1n=3,\>\bar{x}=10,\>s^{2}=1, we approximated the posterior distribution of (ν,τ)({\nu},{\tau}) given XX by a Gibbs sampler, using the full conditional posteriors, where s2=(xix¯)2/(n1)s^{2}=\sum{(x_{i}-\bar{x})^{2}}/(n-1):

ν|τ,X\displaystyle{\nu}\mathchar 25194\relax{\tau},X N(baτ2+nX¯bτ2+n,τ2bτ2+n),\displaystyle\sim N\Big{(}\frac{ba{\tau}^{2}+n\bar{X}}{b{\tau}^{2}+n},\frac{{\tau}^{2}}{b{\tau}^{2}+n}\Bigr{)},
τ2|ν,X\displaystyle{\tau}^{2}\mathchar 25194\relax{\nu},X IG(α0+n2,β0+(n1)s2+n(X¯ν)22).\displaystyle\sim IG\Big{(}{\alpha}_{0}+\frac{n}{2},{\beta}_{0}+\frac{(n-1)s^{2}+n(\bar{X}-{\nu})^{2}}{2}\Big{)}.

The contour plots of the posterior distribution in the left panel of Figure 2 show dependence between ν{\nu} and τ{\tau} given XX, and ensuing functional dependence of E(ν|τ,X)\mathord{\rm E}({\nu}\mathchar 25194\relax{\tau},X) on τ{\tau}. Using Algorithm 2, as explained in Section 3.1, we computed the shortest tolerance interval (i.e. the smallest B^\hat{B}) for every possible location AA of the interval, for AA in a neighbourhood of the posterior mean E(ν|X)\mathord{\rm E}({\nu}\mathchar 25194\relax X). The right panel of Figure 2 shows B^\hat{B} as a function of A^\hat{A}. The minimum value is not taken at the location of the posterior mean E(ν|X)\mathord{\rm E}(\nu\mathchar 25194\relax X), which is indicated by a dashed line.

Admittedly the data in this example has been tweaked to illustrate the principle. Inspection of the vertical scale for B^\hat{B} shows that the global minimal length of a tolerance interval is only slightly smaller than the length of the minimal interval centred at the posterior mean of ν\nu. In Section 4 we theoretically investigate a similar phenomenon for linear mixed models, and in Section 5 we study this approximation in a large sample context.

Refer to caption

Figure 2: Left panel: density-level contour plots of MCMC approximation to realization of posterior distribution ν,τ|X{\nu},{\tau}\mathchar 25194\relax X in Example 3.1 (prior parameters: a=0,b=0.1,α0=β0=0.01a=0,\>b=0.1,\>{\alpha}_{0}={\beta}_{0}=0.01; data x¯=10\bar{x}=10, s2=1s^{2}=1, n=3n=3). Right panel: corresponding half-lengths BB (vertical axis) of the (δ=0.05,α=0.1)({\delta}=0.05,{\alpha}=0.1)-tolerance interval centered at AA (horizontal axis); the minimal length is not taken at the posterior mean E(ν|X)\mathord{\rm E}({\nu}\mathchar 25194\relax X), whose location is indicated by the abscissa of the dotted line.

3.1 Computational strategy

In this section we elaborate on the computation of the two-sided Bayesian (δ,α)({\delta},{\alpha})-tolerance interval for a normally distributed univariate future variable ZZ, as discussed in Section 3. We also compare our approach to the ones taken in [20] and [8]. We assume given a large sample of values θj=(νj,τj){\theta}_{j}=({\nu}_{j},{\tau}_{j}) from the posterior distribution of θ=(ν,τ){\theta}=({\nu},{\tau}) given XX. This could be the result of an MCMC run of a sampler for the posterior distribution, or, depending on the data model, of using an analytic formula for the posterior distribution. We shall use the sample values (νj,τj)({\nu}_{j},{\tau}_{j}) to approximate expectations under the posterior distribution, whence they need not be independent, and values from a (burnt-in) MCMC run will qualify. Possible dependence together with sample size will determine the error due to simulation.

The idea is to replace the posterior distribution in (4) or (7) by the empirical distribution of the values {θj}j=1J\{{\theta}_{j}\}_{j=1}^{J}. For a given interval (X)=[L,U]{\cal R}(X)=[L,U] we can (in theory) compute the values Qθj[L,U]Q_{{\theta}_{j}}[L,U] and next search for the interval [L,U][L,U] of minimal length ULU-L such that

1J#{1jJ:Qθj[L,U]1δ}1α,\frac{1}{J}\#\bigl{\{}1\leq j\leq J:Q_{{\theta}_{j}}[L,U]\geq 1-{\delta}\bigr{\}}\doteq 1-{\alpha},

where \doteq means approximately equal, yielding a slightly conservative or anti-conservative solution in case exact equality is not attainable due to discretization.

Wolfinger [20] proposed the algorithm, summarized as Algorithm 1 in the display, and this was refined (or corrected) by Krishnamoorthy and Mathew [8], Chapter 11. This algorithm has a convenient graphical representation and has been widely adopted in practice. The idea is to compute for every θj{\theta}_{j} the quantiles Lj=Qθj1(δ/2)L_{j}=Q_{{\theta}_{j}}^{-1}({\delta}/2) and Uj=Qθj1(1δ/2)U_{j}=Q_{{\theta}_{j}}^{-1}(1-{\delta}/2), yielding intervals [Lj,Uj][L_{j},U_{j}] with Qθj[Lj,Uj]1δQ_{{\theta}_{j}}[L_{j},U_{j}]\geq 1-{\delta}, and next setting the tolerance interval [L,U][L,U] equal to an interval that is symmetric about the posterior mean and contains a fraction 1α1-{\alpha} of the intervals [Lj,Uj][L_{j},U_{j}] (Krishnamoorthy and Mathew), or is contained in a fraction α{\alpha} of these intervals (Wolfinger). The graphical interpretation is to plot the points (Uj,Lj)(U_{j},L_{j}) in the xx-yy-plane and search for a point (U,L)(U,L) on the line y+x=2ν^y+x=2\hat{\nu}, for ν^\hat{\nu} the posterior mean or some other useful estimator, such that a fraction 1α1-{\alpha} of the points are in the left-upper quadrant relative to the point [L,U][L,U] (see Figure 11.1 in [8] for an example). This method results in an interval that is more confident than the prescribed level 1α1-{\alpha}, and appears not to optimize the length of the interval.

Data: Given α,δ,{(νj,τj)}j=1J{\alpha},{\delta},\{({\nu}_{j},{\tau}_{j})\}_{j=1}^{J}
1 Let A^=jνj/J\hat{A}=\sum_{j}{\nu}_{j}/J ;
2 Calculate two quantiles sequences: {LjQνj,τj1(δ2)}j=1J\{L_{j}\equiv Q^{-1}_{{\nu}_{j},{\tau}_{j}}(\frac{{\delta}}{2})\}^{J}_{j=1} and {UjQνj,τj1(1δ2)}j=1J\{U_{j}\equiv Q^{-1}_{{\nu}_{j},{\tau}_{j}}(1-\frac{{\delta}}{2})\}^{J}_{j=1} ;
3 Find a point (L^,U^)(\hat{L},\hat{U}) such that L^+U^=2A^\hat{L}+\hat{U}=2\hat{A} satisfying one of the following ;
4    (W) argminL^,U^|#SJα|\arg\min_{\hat{L},\hat{U}}\left|\frac{\#S}{J}-{\alpha}\right|, where S={(Lj,Uj):LjL^,UjU^}S=\{(L_{j},U_{j}):L_{j}\leq\hat{L},U_{j}\geq\hat{U}\};
5    (KM) argminL^,U^|#SJ1+α|\arg\min_{\hat{L},\hat{U}}\left|\frac{\#S}{J}-1+{\alpha}\right|, where S={(Lj,Uj):LjL^,UjU^}S=\{(L_{j},U_{j}):L_{j}\geq\hat{L},U_{j}\leq\hat{U}\};
Result: two-sided tolerance interval [L^,U^][\hat{L},\hat{U}]
Algorithm 1 WKM solution for two-sided tolerance interval

Here we propose another algorithm that directly utilizes (4). We seek to minimize BB under the constraint that the interval [L,U]=[AB,A+B][L,U]=[A-B,A+B] satisfies (4). This takes two steps: for fixed AA we optimize over BB; next we perform a grid search over AA. Because given AA, the optimizer over BB will yield equality in (4), B^\hat{B} will be the solution to

Π[Φ(A+Bντ)Φ(ABντ)1δ|X]=1α.\Pi\biggl{[}\Phi\Bigl{(}\frac{{A}+B-{\nu}}{{\tau}}\Bigr{)}-\Phi\Bigl{(}\frac{{A}-B-{\nu}}{{\tau}}\Bigr{)}\geq 1-{\delta}\mathchar 25194\relax X\biggr{]}=1-{\alpha}. (8)

The posterior mean E(ν|X)\mathord{\rm E}({\nu}\mathchar 25194\relax X) will typically be close to the optimal solution for AA, and is a good starting point for this parameter. As a fast approximation we may also set AA equal to this value and omit the grid search.

In practice, we replace the posterior distribution in equation (8) by an average over the sample values (νj,τj)({\nu}_{j},{\tau}_{j}). The posterior mean of ν{\nu} can be approximated by the average of the sample values νj{\nu}_{j}. Given A^\hat{A} we approximate B^\hat{B} by the (1α)(1-{\alpha})-quantile of the points gjg_{j} computed as the solutions to

Qνj,τj[A^gj,A^+gj]Φ(A^+gjνjτj)Φ(A^gjνjτj)=1δ.Q_{{\nu}_{j},{\tau}_{j}}[\hat{A}-g_{j},\hat{A}+g_{j}]\equiv\Phi\Bigl{(}\frac{\hat{A}+g_{j}-{\nu}_{j}}{{\tau}_{j}}\Bigr{)}-\Phi\Bigl{(}\frac{\hat{A}-g_{j}-{\nu}_{j}}{{\tau}_{j}}\Bigr{)}=1-{\delta}. (9)

The motivation for this procedure is that (νj,τj)GA^,B^,δ({\nu}_{j},{\tau}_{j})\in G_{\hat{A},\hat{B},{\delta}} if and only if Qνj,τj[A^B^,A^+B^]1δQ_{{\nu}_{j},{\tau}_{j}}[\hat{A}-\hat{B},\hat{A}+\hat{B}]\geq 1-{\delta}, whence precisely the points (νj,τj)({\nu}_{j},{\tau}_{j}) with gjB^g_{j}\leq\hat{B} satisfy Qνj,τj[A^B^,A^+B^]1δQ_{{\nu}_{j},{\tau}_{j}}[\hat{A}-\hat{B},\hat{A}+\hat{B}]\geq 1-{\delta} and hence are inside the set GA^,B^,δG_{\hat{A},\hat{B},{\delta}}, whereas the other points are outside this set. This makes the posterior mass of the set equal to 1α1-{\alpha} up to simulation error.

Refer to caption


Figure 3: Left panel: plot of the curves gΦ(ε+gτ)Φ(εgτ)g\mapsto\Phi(\frac{{\varepsilon}+g}{{\tau}})-\Phi(\frac{{\varepsilon}-g}{{\tau}}) for various settings of ε{\varepsilon} and τ{\tau}. Right panel: derivatives of these curves.

Given A^\hat{A} and (νj,τj)({\nu}_{j},{\tau}_{j}), the function gQνj,τj[A^g,A^+g]g\mapsto Q_{{\nu}_{j},{\tau}_{j}}[\hat{A}-g,\hat{A}+g] in (9) is increasing, from the value 0 when g=0g=0 to 1 as gg\rightarrow\infty (see Figure 3). The solutions gjg_{j} to each equation (9) can be found fast by a Newton-Raphson algorithm, with some caution on choosing the initial value for gjg_{j} (the algorithm will diverge if the initial value is chosen in the domain where Qνj,τj[A^gj,A^+gj]Q_{{\nu}_{j},{\tau}_{j}}[\hat{A}-g_{j},\hat{A}+g_{j}] is very close to 1). An appropriate algorithm is listed in Algorithm 2. Note that the (middle) expression in (9) does not change if ε:=A^νj{\varepsilon}:=\hat{A}-{\nu}_{j} is replaced by ε-{\varepsilon}.

Data: Given α,δ,{(νj,τj)}j=1J{\alpha},{\delta},\{({\nu}_{j},{\tau}_{j})\}_{j=1}^{J}
1
2Let A^=jνj/J\hat{A}=\sum_{j}{\nu}_{j}/J ;
3 for j=1,2,,Jj=1,2,...,J do
4       Solve equation (9) by a Newton-Raphson algorithm as follows;
5       if |A^νj|<τj|\hat{A}-{\nu}_{j}|<{\tau}_{j} then
6            g0=|A^νj|+τjg_{0}=|\hat{A}-{\nu}_{j}|+{\tau}_{j}
7      else
8            g0=|A^νj|g_{0}=|\hat{A}-{\nu}_{j}|
9      set initial value for gjg_{j} at g0g_{0};
10       while ω>0.0001{\omega}>0.0001 do
11             let qνj,τj[A^gj,A^+gj]q_{{\nu}_{j},{\tau}_{j}}[\hat{A}-g_{j},\hat{A}+g_{j}] be the first-order derivative of Qνj,τj[A^gj,A^+gj]Q_{{\nu}_{j},{\tau}_{j}}[\hat{A}-g_{j},\hat{A}+g_{j}];
12             gj=gjQνj,τj[A^gj,A^+gj]1+δqνj,τj[A^gj,A^+gj]g_{j}=g_{j}-\frac{Q_{{\nu}_{j},{\tau}_{j}}[\hat{A}-g_{j},\hat{A}+g_{j}]-1+{\delta}}{q_{{\nu}_{j},{\tau}_{j}}[\hat{A}-g_{j},\hat{A}+g_{j}]};
13             ω=Qνj,τj[A^gj,A^+gj]1+δ{\omega}=Q_{{\nu}_{j},{\tau}_{j}}[\hat{A}-g_{j},\hat{A}+g_{j}]-1+{\delta};
14      
15The above loop results in {gj}j=1J\{g_{j}\}_{j=1}^{J}, and let B^\hat{B} be its (1α)th(1-{\alpha})\emph{th} sample quantile;
Result: two-sided tolerance interval [A^B^,A^+B^][\hat{A}-\hat{B},\hat{A}+\hat{B}]
Algorithm 2 Proposed solution for two-sided tolerance interval

4 Data from a linear mixed model

In this section we apply the preceding to a model that is representative for practice in pharmaceutical quality control: the linear mixed model (LMM). We assume that the data XX are acquired in an LMM design, and that the future variable ZZ is defined in terms of the same LMM. We concentrate attention to the two-sided (δ,α)({\delta},{\alpha})-tolerance interval.

In the LMM we observe a vector X=Uβ+Vγ+eX=U{\beta}+V{\gamma}+e, for known (deterministic) matrices UU and VV of covariates, a vector of fixed effects parameters β{\beta}, an unobserved random effect vector γ{\gamma}, and an error vector ee. Assume that γ{\gamma} and ee are independent, with

γN(0,D),eN(0,σ2I).{\gamma}\sim N(0,D),\qquad e\sim N(0,{\sigma}^{2}I). (10)

Then the data XX follows a N(Uβ,C)N(U{\beta},C)-distribution, for C=VDVT+σ2IC=VDV^{T}+{\sigma}^{2}I, and the full parameter is (β,D,σ2)({\beta},D,{\sigma}^{2}).

Consider predicting a new observation Z=uTβ+vTγ+eZ=u^{T}{\beta}+v^{T}{\gamma}^{\prime}+e^{\prime} with given fixed and random effects coefficients uu and vv and newly generated random effect vector γ{\gamma}^{\prime} and error ee^{\prime}, with γγ{\gamma}^{\prime}\sim{\gamma} and eN(0,σ2)e^{\prime}\sim N(0,{\sigma}^{2}). Thus γ{\gamma}^{\prime} is assumed equal in distribution to, but independent of γ{\gamma}, and similarly for ee^{\prime}. This target for prediction is reasonable in many contexts, but sometimes another choice, in particular for γ{\gamma}^{\prime}, may be more relevant. Typically γ{\gamma} will carry a group structure matched by a block structure in VV. The vector vv will then have nonzero coordinates corresponding to a single group. The target corresponds to setting the distribution QθQ_{\theta} of ZZ equal to N(ν,τ2)N({\nu},{\tau}^{2}), with

ν=uTβ,τ2=vTDv+σ2.{\nu}=u^{T}{\beta},\qquad{\tau}^{2}=v^{T}Dv+{\sigma}^{2}. (11)

The “prediction” parameter θ=(ν,τ){\theta}=({\nu},{\tau}) is of smaller dimension than the full parameter” (β,D,σ2)({\beta},D,{\sigma}^{2}) governing the distribution of the data XX, whence part of the latter full parameter can be considered a nuisance parameter. To set a Bayesian tolerance interval we need a posterior distribution of θ{\theta} given the data XX. This will typically be inferred from a posterior distribution of the full parameter, resulting from a prior distribution on (β,D,σ2)({\beta},D,{\sigma}^{2}).

The (conditional) posterior distribution for a conditional prior β|D,σ2N(0,Λ){\beta}\mathchar 25194\relax D,{\sigma}^{2}\sim N(0,\Lambda), where Λ\Lambda may depend on (D,σ2)(D,{\sigma}^{2}), satisfies

β|D,σ2,XN((UTC1U+Λ1)1UTC1X,(UTC1U+Λ1)1).{\beta}\mathchar 25194\relax D,{\sigma}^{2},X\sim N\Bigl{(}(U^{T}C^{-1}U+\Lambda^{-1})^{-1}U^{T}C^{-1}X,(U^{T}C^{-1}U+\Lambda^{-1})^{-1}\Bigr{)}.

(An alternative expression for the posterior mean is ΛUT(UΛUT+C)1]X\Lambda U^{T}(U\Lambda U^{T}+C)^{-1]}X.) In general E(ν|D,σ2,X)=uTE(β|D,σ2,X)\mathord{\rm E}(\nu\mathchar 25194\relax D,{\sigma}^{2},X)=u^{T}\mathord{\rm E}({\beta}\mathchar 25194\relax D,{\sigma}^{2},X) will depend on DD and σ2{\sigma}^{2} (hidden in CC) and hence typically also on τ2{\tau}^{2}, in view of (11). Therefore Lemma 3.1 does not apply, and there appears to be no reason that a shortest tolerance interval would be centered at the posterior mean of ν\nu. To obtain the shortest interval, Algorithm 2 should be augmented with a search on possible centerings AA.

As in the standard i.i.d. model in Example 3.1, the dependence on σ2{\sigma}^{2} can be removed by choosing the variances Λ\Lambda and DD proportional to σ2{\sigma}^{2}. If D=σ2D0D={\sigma}^{2}D_{0} and Λ=σ2Λ0\Lambda={\sigma}^{2}\Lambda_{0}, then CC will be σ2(VD0VT+I)=:σ2C0{\sigma}^{2}(VD_{0}V^{T}+I)=:{\sigma}^{2}C_{0} and the conditional posterior mean of β{\beta} will be (UTC01U+Λ01)1UTC01X(U^{T}C_{0}^{-1}U+\Lambda_{0}^{-1})^{-1}U^{T}C_{0}^{-1}X. However, the dependence on D0D_{0} (through C0C_{0}) remains, in general.

Letting the prior covariance matrix Λ\Lambda tend to infinity corresponds to a non-informative prior on β{\beta}. If all other quantities are fixed and Λ\Lambda\rightarrow\infty, then

E(β|D,σ2,X)(UTC1U)1UTC1X.\mathord{\rm E}({\beta}\mathchar 25194\relax D,{\sigma}^{2},X)\rightarrow(U^{T}C^{-1}U)^{-1}U^{T}C^{-1}X.

The limit is the maximum likelihood estimator of β{\beta} in the model where CC is known. Since this is still dependent on CC (and hence DD and σ2{\sigma}^{2}), it seems that for both the Bayesian and frequentist tolerance intervals the two parameters ν{\nu} and τ{\tau} cannot be separated in general. The choice Λ=λ(UTC1U)1\Lambda={\lambda}(U^{T}C^{-1}U)^{-1} leads to λ/(1+λ){\lambda}/(1+{\lambda}) times the maximum likelihood estimator, and hence also still depends on CC.

4.1 Approximations to the conditional posterior mean

For special designs the dependence of E(β|D,σ2,X)\mathord{\rm E}({\beta}\mathchar 25194\relax D,{\sigma}^{2},X) on (D,σ2)(D,{\sigma}^{2}) is only mild and can be quantified. We discuss three examples of linear mixed models. For clarity we restrict to balanced designs, although its flexibility makes the Bayesian approach particularly valuable for unbalanced cases, as is illustrated by the numerical examples in Section 4.2.

Example 4.1 (One-way random effects).

Suppose XX is a vector with coordinates Xik=β+γi+eikX_{ik}={\beta}+{\gamma}_{i}+e_{ik}, for i=1,,mi=1,\ldots,m and k=1,,nk=1,\ldots,n, ordered as (X11,,X1n,X21,,Xmn)(X_{11},\ldots,X_{1n},X_{21},\ldots,X_{mn}), where β{\beta}\in\mathbb{R} and γ=(γ1,,γm)T{\gamma}=({\gamma}_{1},\ldots,{\gamma}_{m})^{T} with i.i.d. γiN(0,d2){\gamma}_{i}\sim N(0,d^{2}), so that D=d2ImD=d^{2}I_{m}, for ImI_{m} the (m×m)(m\times m)-identity matrix. As prior on β{\beta} we choose a one-dimensional normal distribution N(0,λ2)N(0,{\lambda}^{2}).

The matrix UU is the mnmn-vector 1mn1_{mn} with all coordinates equal to 1, while VV is the (mn×m)(mn\times m)-matrix with ithi^{th} column having 1s in rows (i1)n+1(i-1)n+1 to inin and 0s in the other rows. Then VTV=nImV^{T}V=nI_{m}, and UTV=n1mU^{T}V=n1_{m}, and it can be verified that C1mn=(nd2+σ2)1mnC1_{mn}=(nd^{2}+{\sigma}^{2})1_{mn} and hence C1U=(nd2+σ2)11mnC^{-1}U=(nd^{2}+{\sigma}^{2})^{-1}1_{mn}. The coefficient vector of E(β|D,σ2,X)\mathord{\rm E}({\beta}\mathchar 25194\relax D,{\sigma}^{2},X) is

(UTC1U+λ2)1UTC=(mn+nd2+σ2λ2)11mnT.(U^{T}C^{-1}U+{\lambda}^{-2})^{-1}U^{T}C=\Bigl{(}mn+\frac{nd^{2}+{\sigma}^{2}}{{\lambda}^{2}}\Bigr{)}^{-1}1_{mn}^{T}.

For λ={\lambda}=\infty, this is free of d2d^{2} and σ2{\sigma}^{2}, while for finite, fixed λ{\lambda} and m,nm,n\rightarrow\infty, the coefficient vector is (mn)1(1+O(d2/(mλ2))+O(σ2/(mnλ2)))(mn)^{-1}\bigl{(}1+O(d^{2}/(m{\lambda}^{2}))+O({\sigma}^{2}/(mn{\lambda}^{2}))\bigr{)}.

The dependence on dd and σ{\sigma} can be removed by choosing d=d0σd=d_{0}{\sigma} and λ=λ0nd02+1σ{\lambda}={\lambda}_{0}\sqrt{nd_{0}^{2}+1}\,{\sigma}.

Example 4.2 (Full random effects).

Suppose Xik=uikTβ+vikTγi+eikX_{ik}=u_{ik}^{T}{\beta}+v_{ik}^{T}{\gamma}_{i}+e_{ik}, ordered as in the preceding example, but now with observed covariates uikpu_{ik}\in\mathbb{R}^{p} and vikqv_{ik}\in\mathbb{R}^{q}, fixed effects parameter βp{\beta}\in\mathbb{R}^{p} and i.i.d. random effects γiNq(0,Dq){\gamma}_{i}\sim N_{q}(0,D_{q}), for i=1,,mi=1,\ldots,m. The corresponding matrices UU and VV are

U=(U1Um),V=(V100Vm),Ui=(ui1TuinT),Vi=(vi1TvinT).U=\left(\begin{matrix}U_{1}\\ \vdots\\ U_{m}\\ \end{matrix}\right),\quad V=\left(\begin{matrix}V_{1}&\cdots&0\\ \vdots&\ddots&\vdots\\ 0&\cdots&V_{m}\end{matrix}\right),\quad U_{i}=\left(\begin{matrix}u_{i1}^{T}\\ \vdots\\ u_{in}^{T}\end{matrix}\right),\quad V_{i}=\left(\begin{matrix}v_{i1}^{T}\\ \vdots\\ v_{in}^{T}\end{matrix}\right).

Then C=VDVT+σ2IC=VDV^{T}+{\sigma}^{2}I is an (mn×mn)(mn\times mn)-block-diagonal matrix with mm blocks ViDqViT+σ2InV_{i}D_{q}V_{i}^{T}+{\sigma}^{2}I_{n}, and

UTC1U\displaystyle U^{T}C^{-1}U =i=1mUiT(ViDqViT+σ2In)1Ui,\displaystyle=\sum_{i=1}^{m}U_{i}^{T}(V_{i}D_{q}V_{i}^{T}+{\sigma}^{2}I_{n})^{-1}U_{i},
UTC1\displaystyle U^{T}C^{-1} =(U1T(V1DqV1T+σ2In)1,,UmT(VmDqVmT+σ2In)1).\displaystyle=\Bigl{(}U_{1}^{T}(V_{1}D_{q}V_{1}^{T}+{\sigma}^{2}I_{n})^{-1},\ldots,U_{m}^{T}(V_{m}D_{q}V_{m}^{T}+{\sigma}^{2}I_{n})^{-1}\Bigr{)}.

The matrices UiTViU_{i}^{T}V_{i} and ViTViV_{i}^{T}V_{i} are of dimensions p×qp\times q and q×qq\times q, and are sums over the nn observations (for k=1,,nk=1,\ldots,n) per group ii, as defined by the random effect γi{\gamma}_{i}. For conventional asymptotics we could view them as nn times a matrix of fixed order. Then, for D0=σ2DqD_{0}={\sigma}^{-2}D_{q},

(ViD0ViT+I)1=IVi(D01+ViTVi)1ViT\displaystyle(V_{i}D_{0}V_{i}^{T}+I)^{-1}=I-V_{i}(D_{0}^{-1}+V_{i}^{T}V_{i})^{-1}V_{i}^{T}
=IVi(ViTVi)1[D01(ViTVi)1+I]1ViT\displaystyle\qquad=I-V_{i}(V_{i}^{T}V_{i})^{-1}\bigl{[}D_{0}^{-1}(V_{i}^{T}V_{i})^{-1}+I\big{]}^{-1}V_{i}^{T}
=IVi(ViTVi)1[ID01(ViTVi)1+(D01(ViTVi)1)2+]ViT\displaystyle\qquad=I-V_{i}(V_{i}^{T}V_{i})^{-1}\bigl{[}I-D_{0}^{-1}(V_{i}^{T}V_{i})^{-1}+(D_{0}^{-1}(V_{i}^{T}V_{i})^{-1})^{2}+\cdots\bigr{]}V_{i}^{T}
=PVi+Vi(ViTVi)1D01(ViTVi)1ViT\displaystyle\qquad=P_{V_{i}^{\perp}}+V_{i}(V_{i}^{T}V_{i})^{-1}D_{0}^{-1}(V_{i}^{T}V_{i})^{-1}V_{i}^{T}
k=2(1)kVi(ViTVi)1(D01(ViTVi)1)kViT,\displaystyle\qquad\qquad\qquad\qquad\quad-\sum_{k=2}^{\infty}(-1)^{k}V_{i}(V_{i}^{T}V_{i})^{-1}(D_{0}^{-1}(V_{i}^{T}V_{i})^{-1})^{k}V_{i}^{T}, (12)

where PVP_{V^{\perp}} is the projection on the orthocomplement of the linear span of the columns of VV. If ViTViV_{i}^{T}V_{i} is large, then the series on the right can be neglected, as its terms contain multiple terms (ViTVi)1(V_{i}^{T}V_{i})^{-1}. There is then still dependence on DD and σ2{\sigma}^{2} in the second term, which may dominate the first term.

In a full random effects model, every random effect is matched by a fixed effect with the same covariate vector (supplying a common mean value to the random effects), and hence uik=viku_{ik}=v_{ik}, for every (i,k)(i,k), and consequently Ui=ViU_{i}=V_{i}. Then UiTPVi=0U_{i}^{T}P_{V_{i}^{\perp}}=0 and UiTVi(ViTVi)1=IpU_{i}^{T}V_{i}(V_{i}^{T}V_{i})^{-1}=I_{p}. If n1ViTVin^{-1}V_{i}^{T}V_{i} stabilizes as nn\rightarrow\infty,

σ2UTC1U\displaystyle{\sigma}^{2}U^{T}C^{-1}U =i=1mk=1(1)k1D01((ViTVi)1D01)k1=m(D01+O(1n)),\displaystyle=\sum_{i=1}^{m}\sum_{k=1}^{\infty}(-1)^{k-1}D_{0}^{-1}((V_{i}^{T}V_{i})^{-1}D_{0}^{-1})^{k-1}=m\Bigl{(}D_{0}^{-1}+O\bigl{(}\frac{1}{n}\bigr{)}\Bigr{)},
σ2UTC1\displaystyle{\sigma}^{2}U^{T}C^{-1} =k=1(1)k1(D01(ViTVi)1)kViT)i=1m\displaystyle=\sum_{k=1}^{\infty}(-1)^{k-1}\Bigl{(}D_{0}^{-1}(V_{i}^{T}V_{i})^{-1})^{k}V_{i}^{T}\Bigr{)}_{i=1}^{m}
=(D01(ViTVi)1(I+O(1n))ViT)i=1m.\displaystyle=\Bigl{(}D_{0}^{-1}(V_{i}^{T}V_{i})^{-1}\Bigl{(}I+O\bigl{(}\frac{1}{n}\bigr{)}\Bigr{)}V_{i}^{T}\Bigr{)}_{i=1}^{m}.

Thus we find that the coefficient vector of the posterior mean of β{\beta} satisfies

(UTC1U+Λ1)1UTC1\displaystyle(U^{T}C^{-1}U+\Lambda^{-1})^{-1}U^{T}C^{-1}
=(1m(I+O(D0n)+D0Λ1m)1((ViTVi)1(I+O(1n))ViT))i=1m.\displaystyle\qquad=\biggl{(}\frac{1}{m}\Bigl{(}I+O\Bigl{(}\frac{D_{0}}{n}\Bigr{)}+\frac{D_{0}\Lambda^{-1}}{m}\Bigr{)}^{-1}\Bigl{(}(V_{i}^{T}V_{i})^{-1}\Bigl{(}I+O(\frac{1}{n}\bigr{)}\Bigr{)}V_{i}^{T}\Bigr{)}\biggr{)}_{i=1}^{m}.

To first order this is free of DD and σ2{\sigma}^{2} with relative remainders of the order D0/nD_{0}/n and D0Λ1/mD_{0}\Lambda^{-1}/m.

Example 4.3 (Random effects with additional fixed effects).

In the preceding example every random effect may be matched by a fixed effect with the same covariate vector (supplying a common mean value to the random effects), but there more fixed than random effects. This corresponds to setting uikT=(vikT,u¯ikT)u_{ik}^{T}=(v_{ik}^{T},\bar{u}_{ik}^{T}), which implies Ui=(Vi,U¯i)U_{i}=(V_{i},\bar{U}_{i}), for an (n×(pq))(n\times(p-q))-matrix U¯i\bar{U}_{i}. The formulas in the preceding example must then be adapted to, where the approximations refer to ignoring the series in (4.2), for Wi=Vi(ViTVi)1W_{i}=V_{i}(V_{i}^{T}V_{i})^{-1},

σ2UTC1U\displaystyle{\sigma}^{2}U^{T}C^{-1}U i=1m[(000U¯iTPViU¯i)+(D01D01WiTU¯iU¯iTWiD01U¯iTWiD01WiTU¯i)],\displaystyle\doteq\sum_{i=1}^{m}\biggl{[}\left(\begin{matrix}0&0\\ 0&\bar{U}_{i}^{T}P_{V_{i}^{\perp}}\bar{U}_{i}\end{matrix}\right)+\left(\begin{matrix}D_{0}^{-1}&D_{0}^{-1}W_{i}^{T}\bar{U}_{i}\\ \bar{U}_{i}^{T}W_{i}D_{0}^{-1}&\bar{U}_{i}^{T}W_{i}D_{0}^{-1}W_{i}^{T}\bar{U}_{i}\end{matrix}\right)\biggr{]},
σ2UTC1\displaystyle{\sigma}^{2}U^{T}C^{-1} ((0U¯iTPVi)+(D01WiTU¯iTWiD01WiT))i=1m.\displaystyle\doteq\biggl{(}\left(\begin{matrix}0\\ \bar{U}_{i}^{T}P_{V_{i}^{\perp}}\end{matrix}\right)+\left(\begin{matrix}D_{0}^{-1}W_{i}^{T}\\ \bar{U}_{i}^{T}W_{i}D_{0}^{-1}W_{i}^{T}\end{matrix}\right)\biggr{)}_{i=1}^{m}.

It is reasonable to expect that the matrices U¯iTWi=U¯iTVi(ViTVi)1\bar{U}_{i}^{T}W_{i}=\bar{U}_{i}^{T}V_{i}(V_{i}^{T}V_{i})^{-1} will settle down as, as will the matrices n1U¯iTPViU¯in^{-1}\bar{U}_{i}^{T}P_{V_{i}^{\perp}}\bar{U}_{i}. Then up to lower order terms

(UTC1U)1UTC1i=1m(D01D01WiTU¯iU¯iTWiD01U¯iTPViU¯i)1((D01WiTUiTPVi))i=1m.(U^{T}C^{-1}U)^{-1}U^{T}C^{-1}\doteq\sum_{i=1}^{m}\left(\begin{matrix}D_{0}^{-1}&D_{0}^{-1}W_{i}^{T}\bar{U}_{i}\\ \bar{U}_{i}^{T}W_{i}D_{0}^{-1}&\bar{U}_{i}^{T}P_{V_{i}^{\perp}}\bar{U}_{i}\end{matrix}\right)^{-1}\biggl{(}\left(\begin{matrix}D_{0}^{-1}W_{i}^{T}\\ U_{i}^{T}P_{V_{i}^{\perp}}\end{matrix}\right)\biggr{)}_{i=1}^{m}.

Here the three appearances of D01D_{0}^{-1} in the top row cancel each other (as follows by factorizing out the block diagonal matrix with blocks D01D_{0}^{-1} and IpqI_{p-q} from the inverse matrix), while the factor U¯iTWiD01\bar{U}_{i}^{T}W_{i}D_{0}^{-1} in the bottom row is a factor 1/n1/n smaller in order than the matrix U¯iTPViU¯i\bar{U}_{i}^{T}P_{V_{i}^{\perp}}\bar{U}_{i} and hence can be set to 0 up to order 1/n1/n. It follows that again the matrix is free of DD and σ2{\sigma}^{2} up to order 1/n1/n.

4.2 Numerical examples

We evaluate the small sample performance of our proposed algorithm via a simulation study, with data generated from a one-way random effects model. We observe (Xik:i=1,2,,m;k=1,2,,ni)(X_{ik}:i=1,2,\ldots,m;k=1,2,\ldots,n_{i}), where XikX_{ik} is the kthk^{th} value of the ithi^{th} group and satisfies Xik=ν+γi+eikX_{ik}={\nu}+{\gamma}_{i}+e_{ik} for i.i.d. γiN(0,d2){\gamma}_{i}\sim N(0,d^{2}) independent of the i.i.d. error eikN(0,σ2)e_{ik}\sim N(0,{\sigma}^{2}). We are interested in the two-sided (δ,α)({\delta},{\alpha})-tolerance interval related to a future observation Z=ν+γ+eQθ=N(ν,τ2)Z={\nu}+{\gamma}+e\sim Q_{\theta}=N({\nu},{\tau}^{2}), where γ{\gamma} and ee are independent copies of the γi{\gamma}_{i} and eike_{ik}. The parameter is θ=(ν,τ2){\theta}=({\nu},{\tau}^{2}), for τ2=d2+σ2{\tau}^{2}=d^{2}+{\sigma}^{2}.

We used 66 different parameter settings. In every setting the overall mean was set equal to ν=0{\nu}=0, and the group variance to d2=1d^{2}=1. The intra-correlation σ2/(d2+σ2){\sigma}^{2}/(d^{2}+{\sigma}^{2}) was chosen equal to the numbers 0.1,0.3,0.5,0.7,0.90.1,0.3,0.5,0.7,0.9. In every setting the number of groups was m=6m=6 and the group sizes were (n1,,n6)=(2,3,4,2,3,4)(n_{1},\ldots,n_{6})=(2,3,4,2,3,4). This simulation setup is the same as in [14], and facilitates a cross-comparison against the performance of a standard frequentist solution.

We computed the (δ=0.1,α=0.05)({\delta}=0.1,{\alpha}=0.05)-tolerance interval [L,U][L,U] by Algorithm 2, for every of K=1000K=1000 replicates of the data in each parameter setting, and computed the true coverage Qθ[L,U]Q_{\theta}[L,U] of these intervals using the true parameter θ{\theta} of the simulation. We consider an interval with true coverage no less than 1δ1-{\delta} as “qualified”, and compared the empirical fraction of qualified intervals out of the KK replicates to the nominal value 1α1-{\alpha}. The procedure is considered to perform well in the frequentist sense if this empirical fraction is close to this nominal value. Here we must allow for the simulation error, which has a standard error of p(1p)/K\sqrt{p(1-p)/K}, for pp the true coverage, which is unknown, but hopefully close to 1α1-{\alpha}.

For each simulated dataset the posterior distribution of (ν,d,σ2)({\nu},d,{\sigma}^{2}) was approximated by a standard Gibbs sampler (with the vector of random effects γi{\gamma}_{i} added in as a fourth parameter), before utilizing Algorithm 2. Two setups of priors were deployed, both with independent priors on the three parameters ν,d,σ{\nu},d,{\sigma}. The first is the vanilla setup with vague marginal prior distributions νN(0,1000){\nu}\sim N(0,1000), d2IG(0.001,0.001)d^{2}\sim IG(0.001,0.001) and σ2IG(0.001,0.001){\color[rgb]{1,0,0}{\sigma}^{2}}\sim IG(0.001,0.001). The second uses the same prior on σ2{\sigma}^{2}, but uses a tt-distribution for ν{\nu} given by the hierarchy ν|σ0N(0,σ02){\nu}\mathchar 25194\relax{\sigma}_{0}\sim N(0,{\sigma}^{2}_{0}) and σ02IG(0.001,0.001){\sigma}^{2}_{0}\sim IG(0.001,0.001), and the prior on dd given by the structural equation d=|ξ|ωd=|\xi|{\omega} for independent ξN(0,1)\xi\sim N(0,1) and ω2IG(0.001,0.001){\omega}^{2}\sim IG(0.001,0.001). The latter specification can also be understood as over-parameterizing the distribution of the random effect using two parameters instead of one, as γ|ξ,ωN(0,ξ2ω2){\gamma}\mathchar 25194\relax\xi,{\omega}\sim N(0,\xi^{2}{\omega}^{2}). This “parameter expansion” is meant to enhance the mixing rate of the Gibbs sampler, in particular when the number of groups mm is small. See [3] for a comparison of methods (including nonBayesian methods) to fit the LMM.

In all simulation settings the tolerance intervals were constructed both by fixing the center point AA at the posterior mean E(ν|X)E({\nu}\mathchar 25194\relax X), and by seeking an optimal value of AA to minimize the half length BB of the interval. Thus four tolerance intervals were calculated based on each simulated dataset. To save on computation time the optimization over AA was carried out only approximately. Still for 85% of the simulation cases a shorter interval was obtained than the interval at the posterior mean, in a few cases as much as 20% shorter, but in 75% of the cases no more than a few percentage points. Table 1 reports the quotients of the lengths.

The proportions of intervals that attain true coverage at least 1δ=0.91-{\delta}=0.9, are listed in Table 2. They are reasonably close to the nominal value 1α=0.951-{\alpha}=0.95, with deviations in both directions up to several percentage points. The performance seems to depend on the true intra-correlation. This dependence follows a similar pattern as for the high-order asymptotic solution in [14] without correction for imbalance, and is close to their solution that includes correction when the intra-correlation is small or very big. The tolerance intervals centered at the (approximately) optimal value have shorter length and attain lower confidence, but their performance seems to surpass slightly the intervals centered at the posterior mean E(ν|X)\mathord{\rm E}({\nu}\mathchar 25194\relax X).

The difference in performance of Algorithm 2 between the two prior setups is within the order of the simulation error.

Table 1: Interval length at optimal value of AA relative to fixing AA at E(ν|X)\mathord{\rm E}({\nu}\mathchar 25194\relax X)
under Vanilla setup under Parameter Expansion setup
MinMin 0.25thqu.0.25^{th}qu. MedianMedian 0.75thqu.0.75^{th}qu. MinMin 0.25thqu.0.25^{th}qu. MedianMedian 0.75thqu.0.75^{th}qu.
intracorrelationintra-correlation
0.10.1 0.8420 0.9958 0.9984 0.9993 0.8625 0.9969 0.9991 0.9997
0.30.3 0.8244 0.9943 0.9984 0.9994 0.8569 0.9969 0.9989 0.9997
0.50.5 0.8192 0.9905 0.9981 0.9992 0.8374 0.9954 0.9985 0.9995
0.70.7 0.8193 0.9897 0.9979 0.9992 0.8020 0.9947 0.9984 0.9994
0.90.9 0.8102 0.9889 0.9980 0.9993 0.8058 0.9925 0.9980 0.9993
Table 2: Approximated Confidence for (δ=0.1,α=0.05{\delta}=0.1,{\alpha}=0.05)-Bayesian tolerance interval
under Vanilla setup under Parameter Expansion setup
A=E(ν|X)A=\mathord{\rm E}({\nu}\mathchar 25194\relax X) A=OptimalA=Optimal A=E(ν|X)A=\mathord{\rm E}({\nu}\mathchar 25194\relax X) A=OptimalA=Optimal
intracorrelationintra-correlation
0.10.1 0.972 0.969 0.968 0.963
0.30.3 0.964 0.955 0.955 0.949
0.50.5 0.936 0.921 0.925 0.917
0.70.7 0.925 0.907 0.915 0.911
0.90.9 0.952 0.941 0.940 0.935

5 Frequentist justification of the Bayesian procedure

In this section we show that Bayesian tolerance regions are often also approximate frequentist tolerance regions, of corresponding levels. We consider an asymptotic setup, with data X=XnX=X_{n} indexed by a parameter nn\rightarrow\infty, in which the Bernstein-von Mises theorem holds. The latter theorem (see e.g. [17], Chapter 10) entails that the posterior distribution Πn(|Xn)\Pi_{n}(\cdot\mathchar 25194\relax X_{n}) of θ{\theta} can be approximated by a normal distribution with deterministic covariance matrix, centered at an estimator θ^n=θ^n(Xn)\hat{\theta}_{n}=\hat{\theta}_{n}(X_{n}),

Πn(|Xn)N(θ^n,1nΣθ)P 0,\Pi_{n}(\cdot\mathchar 25194\relax X_{n})-N\Bigl{(}\hat{\theta}_{n},\frac{1}{n}\Sigma_{{\theta}}\Bigr{)}\ {\mathchoice{\raise-1.5pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{\raise-2.0pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{}{}}\ 0, (13)

(in total variation norm), where the estimators θ^n=θ^n(Xn)\hat{\theta}_{n}=\hat{\theta}_{n}(X_{n}) satisfy

n(θ^nθ)θN(0,Σθ).\sqrt{n}(\hat{\theta}_{n}-{\theta})\ {\mathchoice{\buildrel{\theta}\over{\rightsquigarrow}}{\raise-2.0pt\hbox{$\buildrel{\theta}\over{\rightsquigarrow}$}}{}{}}\ N(0,\Sigma_{\theta}). (14)

Under regularity conditions this is valid for XnX_{n} a vector of nn i.i.d. observations from a smooth parametric model, with θ^n\hat{\theta}_{n} the maximum likelihood estimator and Σθ\Sigma_{\theta} the inverse Fisher information matrix. More generally, this is true in the case of a local approximation by a Gaussian shift experiment ([9, 19]). The Bernstein-von Mises theorem can be used to show that Bayesian and frequentist inference (testing and confidence sets) merge for large sample sizes. In this section we investigate this for tolerance intervals.

We shall show that Bayesian tolerance regions n(Xn){\cal R}_{n}(X_{n}) such that the functions

hQθ^n(Xn)+h/n(n(Xn)),n=1,2,,h\mapsto Q_{\hat{\theta}_{n}(X_{n})+h/\sqrt{n}}\bigl{(}{\cal R}_{n}(X_{n})\bigr{)},\qquad n=1,2,\ldots, (15)

stabilize asymptotically to a deterministic function are asymptotically frequentist tolerance regions, for any given loss function \ell and level α{\alpha}. The crux of this stability condition is that the randomness which enters the functions (15) through XnX_{n} in θ^n(Xn)\hat{\theta}_{n}(X_{n}) asymptotically cancels the randomness which enters through XnX_{n} in n(Xn){\cal R}_{n}(X_{n}): the Bayesian tolerance regions n(Xn){\cal R}_{n}(X_{n}) should be “asymptotically pivotal” with respect to the estimators θ^n\hat{\theta}_{n}. Some type of stability condition appears to be necessary, because the shape of a Bayesian tolerance region is left free by its definition.

An informal proof of the frequentist validity of Bayesian tolerance regions is as follows. Replacing the posterior distribution in (6) by its normal approximation (13) from the Bernstein-von Mises theorem, we find that

[Qϑ(n(Xn))]𝑑N(θ^n,1nΣθ)(ϑ)1α.\int\ell\bigl{[}Q_{\vartheta}\bigl{(}{\cal R}_{n}(X_{n})\bigr{)}\bigr{]}\,dN\Bigl{(}\hat{\theta}_{n},\frac{1}{n}\Sigma_{\theta}\Bigr{)}(\vartheta)\doteq 1-{\alpha}. (16)

By the substitution ϑ=θ^n+h/n\vartheta=\hat{\theta}_{n}+h/\sqrt{n} this can be rewritten in the form

[Qθ^n+h/n(n(Xn))]𝑑N(0,Σθ)(h)1α.\int\ell\bigl{[}Q_{\hat{\theta}_{n}+h/\sqrt{n}}\bigl{(}{\cal R}_{n}(X_{n})\bigr{)}\bigr{]}\,dN\bigl{(}0,\Sigma_{\theta}\bigr{)}(h)\doteq 1-{\alpha}. (17)

By the stability assumption the integrand

hgn(h;Xn):=[Qθ^n+h/n(n(Xn))]h\mapsto g_{n}(h;X_{n}):=\ell\bigl{[}Q_{\hat{\theta}_{n}+h/\sqrt{n}}\bigl{(}{\cal R}_{n}(X_{n})\bigr{)}\bigr{]} (18)

in this expression is asymptotically the same as a deterministic function hg(h)h\mapsto g_{\infty}(h). In view of (14) the integral in (17) is then approximately equal to Eθg(n(θθ^n))\mathord{\rm E}_{\theta}g_{\infty}\bigl{(}\sqrt{n}({\theta}-\hat{\theta}_{n})\bigr{)}, which in turn, again by stability, is asymptotically the same as Eθgn(n(θθ^n);Xn)\mathord{\rm E}_{\theta}g_{n}\bigl{(}\sqrt{n}({\theta}-\hat{\theta}_{n});X_{n}\bigr{)}, or

Eθ[Qθ^n+n(θθ^n)/n(n(Xn))]=Eθ[Qθ(n(Xn))].\mathord{\rm E}_{\theta}\ell\bigl{[}Q_{\hat{\theta}_{n}+\sqrt{n}({\theta}-\hat{\theta}_{n})/\sqrt{n}}\bigl{(}{\cal R}_{n}(X_{n})\bigr{)}\bigr{]}=\mathord{\rm E}_{\theta}\ell\bigl{[}Q_{{\theta}}\bigl{(}{\cal R}_{n}(X_{n})\bigr{)}\bigr{]}.

Thus the final expression, which is the frequentist level of the tolerance region n(Xn){\cal R}_{n}(X_{n}), is asymptotically equal to 1α1-{\alpha}.

For an (δ,α)({\delta},{\alpha})-tolerance region, (Qθ(n(Xn))\ell\bigl{(}Q_{\theta}({\cal R}_{n}(X_{n})\bigr{)} is the indicator of the set G^n={θ:Qθ(n(Xn))1δ}\hat{G}_{n}=\{{\theta}:Q_{\theta}\bigl{(}{\cal R}_{n}(X_{n})\bigr{)}\geq 1-{\delta}\} and the function (18) is the indicator of the set

H^n=n(G^nθ^n).\hat{H}_{n}=\sqrt{n}(\hat{G}_{n}-\hat{\theta}_{n}).

Thus the stability condition is that the latter sets approximate to a deterministic set, as nn\rightarrow\infty. Condition (17) becomes

N(0,Σθ)(H^n)1α.N(0,\Sigma_{\theta})(\hat{H}_{n})\doteq 1-{\alpha}. (19)

This equality allows to “solve” one aspect of the sets H^n\hat{H}_{n}; in general additional constraints will be imposed to define their shape. As the normal distribution in this display is fixed, it is not unnatural that these constraints would render the sets H^n\hat{H}_{n} also to become fixed, in the limit: stability is natural.

The following theorem makes the preceding rigorous. We shall verify its conditions for normal prediction variables in the next section.

Theorem 5.1.

Suppose that (13)–(14) hold, the loss function \ell is bounded, and suppose that there exist (deterministic) functions fn,1,fn,2:df_{n,1},f_{n,2}:\mathbb{R}^{d}\to\mathbb{R} with the property that fn,i(hn)f(h)f_{n,i}(h_{n})\rightarrow f_{\infty}(h) for i=1,2i=1,2 and some function ff_{\infty} and any sequence hnhh_{n}\rightarrow h with limit hh in a set of probability one under the normal distribution in (14) and such that

fn,1(h)[Qθ^n+h/n(n(Xn))]fn,2(h),hd.f_{n,1}(h)\leq\ell\bigl{[}Q_{\hat{\theta}_{n}+h/\sqrt{n}}\bigl{(}{\cal R}_{n}(X_{n})\bigr{)}\bigr{]}\leq f_{n,2}(h),\qquad h\in\mathbb{R}^{d}. (20)

Then (Qθ(n(Xn)))𝑑Π(θ|Xn)1α(0,1)\int\ell\bigl{(}Q_{\theta}({\cal R}_{n}(X_{n}))\bigr{)}\,d\Pi({\theta}\mathchar 25194\relax X_{n})\rightarrow 1-{\alpha}\in(0,1) in probability implies that Eθ(Qθ(n(Xn)))1α\mathord{\rm E}_{\theta}\ell\bigl{(}Q_{\theta}({\cal R}_{n}(X_{n}))\bigr{)}\rightarrow 1-{\alpha}, as nn\rightarrow\infty, for every θ{\theta}.

Proof.

We may assume without loss of generality that the functions fn,if_{n,i} are uniformly bounded. Then the condition fn,i(hn)f(h)f_{n,i}(h_{n})\rightarrow f_{\infty}(h) for every sequence hnhh_{n}\rightarrow h implies that Efn,i(Yn)Ef(Y)\mathord{\rm E}f_{n,i}(Y_{n})\rightarrow\mathord{\rm E}f_{\infty}(Y), whenever the sequence of random vectors YnY_{n} tends in distribution to the random vector YY, in view of the extended continuous mapping theorem (see [18], Theorem 1.11.1). Thus by (14), for i=1,2i=1,2,

Eθfn,i(n(θθ^n))f𝑑N(0,Σθ).\mathord{\rm E}_{\theta}f_{n,i}\bigl{(}\sqrt{n}({\theta}-\hat{\theta}_{n})\bigr{)}\rightarrow\int f_{\infty}\,dN(0,\Sigma_{\theta}).

By assumption the function gng_{n} given in (18) is sandwiched between fn,1f_{n,1} and fn,2f_{n,2}. Therefore Eθ(Qθ(n(Xn)))=Eθgn(n(θθ^n);Xn)\mathord{\rm E}_{\theta}\ell\bigl{(}Q_{\theta}({\cal R}_{n}(X_{n}))\bigr{)}=\mathord{\rm E}_{\theta}g_{n}\bigl{(}\sqrt{n}({\theta}-\hat{\theta}_{n});X_{n}\bigr{)} tends to the same limit.

By (13) and the definition of gng_{n} (see (16)-(17)), we have

(Qθ(n(Xn)))𝑑Π(θ|Xn)=gn(h;Xn)𝑑N(0,Σθ)(h)+oP(1).\int\ell\bigl{(}Q_{\theta}({\cal R}_{n}(X_{n}))\bigr{)}\,d\Pi({\theta}\mathchar 25194\relax X_{n})=\int g_{n}(h;X_{n})\,dN(0,\Sigma_{\theta})(h)+o_{P}(1).

Again by sandwiching of gn(h;Xn)g_{n}(h;X_{n}) this is asymptotic to fn,i𝑑N(0,Σθ)\int f_{n,i}\,dN(0,\Sigma_{\theta}), and hence tends to f𝑑N(0,Σθ)\int f_{\infty}\,dN(0,\Sigma_{\theta}). If the left side of the display tends to 1α1-{\alpha}, as assumed, then it follows that f𝑑N(0,Σθ)=1α\int f_{\infty}\,dN(0,\Sigma_{\theta})=1-{\alpha}. The theorem follows by combining this with the preceding paragraph.   

5.1 Normal predictions

An (δ,α)({\delta},{\alpha})-tolerance interval n(Xn)=[AnBn,An+Bn]{\cal R}_{n}(X_{n})=[A_{n}-B_{n},A_{n}+B_{n}] for a one-dimensional Gaussian variable ZN(ν,τ2)Z\sim N({\nu},{\tau}^{2}) is the base (the section at τ=0{\tau}=0) of a set of the form

GA,B,δ={θ=(ν,τ):Φ(A+Bντ)Φ(ABντ)1δ}.G_{A,B,{\delta}}=\Bigl{\{}{\theta}=({\nu},{\tau}):\Phi\Bigl{(}\frac{A+B-{\nu}}{{\tau}}\Bigr{)}-\Phi\Bigl{(}\frac{A-B-{\nu}}{{\tau}}\Bigr{)}\geq 1-{\delta}\Bigr{\}}.

The values AnA_{n} and BnB_{n} are determined so that Π(GAn,Bn,δ|Xn)1α\Pi(G_{A_{n},B_{n},{\delta}}\mathchar 25194\relax X_{n})\geq 1-{\alpha}, for Π(|Xn)\Pi(\cdot\mathchar 25194\relax X_{n}) the posterior distribution of θ=(ν,τ){\theta}=({\nu},{\tau}), and so that the length 2Bn2B_{n} of the interval is minimal.

Under (13) the posterior distribution contracts (at rate 1/n1/\sqrt{n}) to the Dirac measure at θ^n\hat{\theta}_{n}, which tends to the true value of the parameter (ν,τ)({\nu},{\tau}) under (14). Hence the equation Π(GAn,Bn,δ|Xn)1α\Pi(G_{A_{n},B_{n},{\delta}}\mathchar 25194\relax X_{n})\geq 1-{\alpha} forces that any ball of fixed radius around this true value intersects GAn,Bn,δG_{A_{n},B_{n},{\delta}} with probability tending to one. The minimality of BnB_{n} implies that the horizontal locations AnA_{n} of the latter sets must tend to the true value of ν{\nu}. Thus Anν^n0A_{n}-\hat{\nu}_{n}\rightarrow 0 in probability and hence Φ(Bn/τ^n)Φ(Bn/τ^n)1δ\Phi(B_{n}/\hat{\tau}_{n})-\Phi(-B_{n}/\hat{\tau}_{n})\rightarrow 1-{\delta}, whence Bn/τ^nξδ/2B_{n}/\hat{\tau}_{n}\rightarrow\xi_{{\delta}/2}, for ξδ\xi_{\delta} the upper δ{\delta}-quantile of the standard normal distribution.

The function θ(Qθ([AB,A+B])){\theta}\mapsto\ell\bigl{(}Q_{\theta}([A-B,A+B])\bigr{)} corresponding to the (δ,α)({\delta},{\alpha})-tolerance interval is the indicator of the set GA,B,δG_{A,B,{\delta}}, and the stability condition (20) is that the (indicator functions) of the sets H^n=n(GAn,Bn,δθ^)\hat{H}_{n}=\sqrt{n}(G_{A_{n},B_{n},{\delta}}-\hat{\theta}) are asymptotically deterministic. These sets can be written H^n={(g,h):Kn(g/n,h/n)0}\hat{H}_{n}=\{(g,h):K_{n}(g/\sqrt{n},h/\sqrt{n})\geq 0\}, for the stochastic processes

Kn(g,h)=Φ(An+Bnν^ngτ^n+h)Φ(AnBnν^ngτ^n+h)(1δ).K_{n}(g,h)=\Phi\Bigl{(}\frac{A_{n}+B_{n}-\hat{\nu}_{n}-g}{\hat{\tau}_{n}+h}\Bigr{)}-\Phi\Bigl{(}\frac{A_{n}-B_{n}-\hat{\nu}_{n}-g}{\hat{\tau}_{n}+h}\Bigr{)}-(1-{\delta}). (21)

By a second-order Taylor expansion we see that these processes satisfy the expansion (22) in Lemma 5.2 (below), with

a^n=Kn(0,0)\displaystyle\hat{a}_{n}=K_{n}(0,0) =Φ(An+Bnν^nτ^n)Φ(AnBnν^nτ^n)(1δ),\displaystyle=\Phi\Bigl{(}\frac{A_{n}+B_{n}-\hat{\nu}_{n}}{\hat{\tau}_{n}}\Bigr{)}-\Phi\Bigl{(}\frac{A_{n}-B_{n}-\hat{\nu}_{n}}{\hat{\tau}_{n}}\Bigr{)}-(1-{\delta}),
b^n=(hKn)(0,0)\displaystyle\hat{b}_{n}=-\Bigl{(}\frac{\partial}{\partial h}K_{n}\Bigr{)}(0,0) =ψ(An+Bnν^nτ^n)1τ^nψ(AnBnν^nτ^n)1τ^n,\displaystyle=\psi\Bigl{(}\frac{A_{n}+B_{n}-\hat{\nu}_{n}}{\hat{\tau}_{n}}\Bigr{)}\frac{1}{\hat{\tau}_{n}}-\psi\Bigl{(}\frac{A_{n}-B_{n}-\hat{\nu}_{n}}{\hat{\tau}_{n}}\Bigr{)}\frac{1}{\hat{\tau}_{n}},
V(g,h)\displaystyle V(g,h) =h.\displaystyle=h.

Here ψ(x)=ϕ(x)x=ϕ(x)\psi(x)=\phi(x)x=-\phi^{\prime}(x). (Note that the partial derivatives ϕ((An+Bnν^n)/τ^n)/τ^n+ϕ((AnBnν^n)/τ^n)/τ^n-\phi((A_{n}+B_{n}-\hat{\nu}_{n})/\hat{\tau}_{n})/\hat{\tau}_{n}+\phi((A_{n}-B_{n}-\hat{\nu}_{n})/\hat{\tau}_{n})/\hat{\tau}_{n} of KnK_{n} at (0,0)(0,0) relative to its first argument gg tend to zero.) Since AnA_{n}, BnB_{n}, ν^n\hat{\nu}_{n} and τ^n\hat{\tau}_{n} tend in probability to nontrivial limits, the conditions of Lemma 5.2 are satisfied and hence the sets H^n\hat{H}_{n} are asymptotically sandwiched between pairs of deterministic sets. Functions fn,if_{n,i} as in (20) can be constructed from these sets by letting ε0{\varepsilon}\rightarrow 0 and MM\rightarrow\infty slowly with nn. Thus the conditions of Theorem 20 are satisfied and we obtain the following corollary.

Corollary 5.1.

If the posterior distribution of θ=(ν,τ){\theta}=({\nu},{\tau}) given XnX_{n} satisfies (13)–(14), then the Bayesian (δ,α)({\delta},{\alpha})-tolerance interval [AnBn,An+Bn][A_{n}-B_{n},A_{n}+B_{n}] of minimal length for a future variable Z|Xn,ν,τN(ν,τ2)Z\mathchar 25194\relax X_{n},{\nu},{\tau}\sim N({\nu},{\tau}^{2}) is an asymptotic frequentist (δ,α)({\delta},{\alpha})-tolerance set.

The convergence Anν^n0A_{n}-\hat{\nu}_{n}\rightarrow 0 in probability means that the tolerance intervals are asymptotically centered at the (asymptotic) posterior mean. If the posterior distribution of θ{\theta} is exactly normal N((ν^n,τ^n),Σθ)N\bigl{(}(\hat{\nu}_{n},\hat{\tau}_{n}),\Sigma_{\theta}\bigr{)} with a diagonal covariance matrix, then E(ν|Xn,τ)\mathord{\rm E}({\nu}\mathchar 25194\relax X_{n},{\tau}) is free of τ{\tau} and Lemma 3.1 shows that the tolerance interval is centered exactly at the posterior mean.

For a non-diagonal matrix Σθ\Sigma_{\theta} this is not necessarily true, and in general the normal distribution will also be an approximation only. The approximation Anν^n0A_{n}-\hat{\nu}_{n}\rightarrow 0 can in general be improved to order n1/4n^{-1/4}. The following lemma also gives an asymptotic expression for the half length BnB_{n} of the interval.

Lemma 5.1.

The centers and half lengths of the Bayesian (δ,α)({\delta},{\alpha})-tolerance intervals [AnBn,An+Bn][A_{n}-B_{n},A_{n}+B_{n}] of minimal length in Corollary 5.1 satisfy An=ν^n+oP(n1/4)A_{n}=\hat{\nu}_{n}+o_{P}(n^{-1/4}) and Bn=τ^nξδ/2+ξαξδ/2Σθ,2,2n1/2+oP(n1/2)B_{n}=\hat{\tau}_{n}\xi_{{\delta}/2}+\xi_{\alpha}\xi_{{\delta}/2}\sqrt{\Sigma_{{\theta},2,2}}n^{-1/2}+o_{P}(n^{-1/2}).

Proof.

Since Anν^n0A_{n}-\hat{\nu}_{n}\rightarrow 0 and Bnξδ/2B_{n}\rightarrow\xi_{{\delta}/2}, there exist intervals InI_{n} and JnJ_{n} around ν0{\nu}_{0} and τ0ξδ/2{\tau}_{0}\xi_{{\delta}/2}, for (ν0,τ0)({\nu}_{0},{\tau}_{0}) the true value of θ{\theta}, that shrink to these points that contain AnA_{n} and BnB_{n} with probability tending to one. Define functions Fn,Gn:In×JnF_{n},G_{n}:I_{n}\times J_{n}\to\mathbb{R} by

Fn(A,B)\displaystyle F_{n}(A,B) =Π(GA,B,δ|Xn),\displaystyle=\Pi(G_{A,B,{\delta}}\mathchar 25194\relax X_{n}),
Gn(A,B)\displaystyle G_{n}(A,B) =N(0,Σθ){(g,h):hnKn(0,0;A,B)2ψ(ξδ/2)/τ^n},\displaystyle=N(0,\Sigma_{\theta})\Bigl{\{}(g,h):h\leq\frac{\sqrt{n}K_{n}(0,0;A,B)}{2\psi(\xi_{{\delta}/2})/\hat{\tau}_{n}}\Bigr{\}},

Here Kn(h,g;A,B)K_{n}(h,g;A,B) is the expression on the right side of (21), but with (An,Bn)(A_{n},B_{n}) replaced by a generic (A,B)(A,B), and ψ(x)=xϕ(x)\psi(x)=x\phi(x). The values (An,Bn)(A_{n},B_{n}) are determined so that BnB_{n} is the smallest value so that there exists AnA_{n} such that Fn(An,Bn)1αF_{n}(A_{n},B_{n})\geq 1-{\alpha}. In other words BnB_{n} is the minimum of {B:supAFn(A,B)1α}\{B:\sup_{A}F_{n}(A,B)\geq 1-{\alpha}\} and AnA_{n} is the point of maximum of AFn(A,Bn)A\mapsto F_{n}(A,B_{n}). We shall show that the functions FnF_{n} and GnG_{n} satisfy the conditions of Lemma 5.3 with cn=τ^nξδ/2c_{n}=\hat{\tau}_{n}\xi_{{\delta}/2} and ξ(α)=ξαξδ/2Σθ,2,2\xi({\alpha})=\xi_{\alpha}\xi_{{\delta}/2}\sqrt{\Sigma_{{\theta},2,2}}, whence the lemma follows from that lemma.

In view of (13) we have supA,B|Fn(A,B)N(0,Σθ)(H^A,B,δ)|0\sup_{A,B}\bigl{|}F_{n}(A,B)-N(0,\Sigma_{\theta})(\hat{H}_{A,B,{\delta}})\bigr{|}\rightarrow 0, for H^A,B,δ={(g,h):Kn(g/n,h/n;A,B)0}\hat{H}_{A,B,{\delta}}=\bigl{\{}(g,h):K_{n}(g/\sqrt{n},h/\sqrt{n};A,B)\geq 0\bigr{\}}. The supremum here and in the following is taken over (A,B)In×Jn(A,B)\in I_{n}\times J_{n}. By a second-order Taylor expansion, for functions ηn,1{\eta}_{n,1} and ηn,2{\eta}_{n,2} with supA,Bηn,i(A,B)0\sup_{A,B}{\eta}_{n,i}(A,B)\rightarrow 0 and a constant bb independent of (A,B)(A,B),

n|Kn(gn,hn;A,B)Kn(0,0;A,B)\displaystyle\sqrt{n}\biggl{|}K_{n}\Bigl{(}\frac{g}{\sqrt{n}},\frac{h}{\sqrt{n}};A,B\Bigr{)}-K_{n}(0,0;A,B)
ηn,1(A,B)gn(2ψ(ξδ/2)τ^n+ηn,2(A,B))hn|bg2+h2n.\displaystyle\qquad-{\eta}_{n,1}(A,B)\frac{g}{\sqrt{n}}-\Bigl{(}-\frac{2\psi(\xi_{{\delta}/2})}{\hat{\tau}_{n}}+{\eta}_{n,2}(A,B)\Bigr{)}\frac{h}{\sqrt{n}}\biggr{|}\leq b\frac{g^{2}+h^{2}}{\sqrt{n}}.

By a sandwiching argument as in the proof of Lemma 5.2, this shows that, with Cn={(g,h):g2+h2Ln}C_{n}=\{(g,h):g^{2}+h^{2}\leq L_{n}\} and LnL_{n}\rightarrow\infty sufficiently slowly

supA,B|N(0,Σθ)(H^A,B,δCn)N(0,Σθ){(g,h)Cn:hnKn(0,0;A,B)2ψ(ξδ/2)/τ^n}|0.\sup_{A,B}\Bigl{|}N(0,\Sigma_{\theta})(\hat{H}_{A,B,{\delta}}\cap C_{n})-N(0,\Sigma_{\theta})\Bigl{\{}(g,h)\in C_{n}:h\leq\frac{\sqrt{n}K_{n}(0,0;A,B)}{2\psi(\xi_{{\delta}/2})/\hat{\tau}_{n}}\Bigr{\}}\Bigr{|}\rightarrow 0.

Because N(0,Σθ)(Cnc)0N(0,\Sigma_{\theta})(C_{n}^{c})\rightarrow 0, this is then also true without intersecting the sets by CnC_{n}. This finishes the proof that supA,B|Fn(A,B)Gn(A,B)|0\sup_{A,B}\bigl{|}F_{n}(A,B)-G_{n}(A,B)\bigr{|}\rightarrow 0.

By the unimodality and symmetry of the normal distribution, the map AKn(0,0;A,B)A\mapsto K_{n}(0,0;A,B) has a maximum at A=ν^nA=\hat{\nu}_{n}, for every BB, and hence the same is true for AGn(A,B)A\mapsto G_{n}(A,B). It is elementary to verify by several Taylor expansions that Gn(ν^n,B^)1αG_{n}(\hat{\nu}_{n},\hat{B})\rightarrow 1-{\alpha}, for n(Φ(B^/τ^n)1+δ/2)τ^nξαψ(ξδ/2)Σθ,2,2\sqrt{n}(\Phi(\hat{B}/\hat{\tau}_{n})-1+{\delta}/2)\hat{\tau}_{n}\rightarrow\xi_{\alpha}\psi(\xi_{{\delta}/2})\sqrt{\Sigma_{{\theta},2,2}}, or B^=τ^nξδ/2+ξαξδ/2Σθ,2,2/n\hat{B}=\hat{\tau}_{n}\xi_{{\delta}/2}+\xi_{\alpha}\xi_{{\delta}/2}\sqrt{\Sigma_{{\theta},2,2}}/\sqrt{n}.

Finally, since AKn(0,0;ν^n,B)=0\frac{\partial}{\partial A}K_{n}(0,0;\hat{\nu}_{n},B)=0, again by Taylor expansion there exist functions ηn,3{\eta}_{n,3} with the property supAηn,3(A,B^)0\sup_{A}{\eta}_{n,3}(A,\hat{B})\rightarrow 0 such that

Kn(0,0;A,B^)Kn(0,0;ν^n,B^)+(Aν^n)2(ϕ(B^/τ^n)/τ^n2+ηn,3(A,B^)).K_{n}(0,0;A,\hat{B})\leq K_{n}(0,0;\hat{\nu}_{n},\hat{B})+(A-\hat{\nu}_{n})^{2}\bigl{(}\phi^{\prime}(\hat{B}/\hat{\tau}_{n})/\hat{\tau}_{n}^{2}+{\eta}_{n,3}(A,\hat{B})\bigr{)}.

As ϕ(x)<0\phi^{\prime}(x)<0, for x>0x>0, we obtain with probability tending to one that nKn(0,0;A,B^)\sqrt{n}K_{n}(0,0;A,\hat{B}) is smaller than nKn(0,0;ν^n,B^)ε\sqrt{n}K_{n}(0,0;\hat{\nu}_{n},\hat{B})-{\varepsilon} for some ε>0{\varepsilon}>0 if n(Aν^n)2>δ\sqrt{n}(A-\hat{\nu}_{n})^{2}>{\delta} for some δ>0{\delta}>0. On this event

Gn(A,B^)=N(0,Σθ){(g,h):hnKn(0,0;A,B^)2ψ(ξδ/2)/τ^n}G_{n}(A,\hat{B})=N(0,\Sigma_{\theta})\Bigl{\{}(g,h):h\leq\frac{\sqrt{n}K_{n}(0,0;A,\hat{B})}{2\psi(\xi_{{\delta}/2})/\hat{\tau}_{n}}\Bigr{\}}

is strictly smaller than its asymptotic value 1α1-{\alpha} at A=ν^nA=\hat{\nu}_{n}. This verifies the last displayed condition of Lemma 5.3.   

Lemma 5.2.

Suppose that for every M>0M>0 the stochastic processes (Kn(h):hd)(K_{n}(h):h\in\mathbb{R}^{d}) satisfy

suphM/nn|Kn(h)a^n+b^nVh|P 0,\sup_{\|h\|\leq M/\sqrt{n}}\sqrt{n}\bigl{|}K_{n}(h)-\hat{a}_{n}+\hat{b}_{n}Vh\bigr{|}\ {\mathchoice{\raise-1.5pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{\raise-2.0pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{}{}}\ 0, (22)

for random variables a^n\hat{a}_{n} and b^n>0\hat{b}_{n}>0 such that b^n1\hat{b}_{n}^{-1} is bounded in probability, and a linear map V:dV:\mathbb{R}^{d}\to\mathbb{R}. If the sets H^n={hd:Kn(h/n)0}\hat{H}_{n}=\{h\in\mathbb{R}^{d}:K_{n}(h/\sqrt{n})\geq 0\} satisfy N(0,Σ)(H^n)1α(0,1)N(0,\Sigma)(\hat{H}_{n})\rightarrow 1-{\alpha}\in(0,1), then na^n/b^nξαVΣVT\sqrt{n}\,\hat{a}_{n}/\hat{b}_{n}\rightarrow\xi_{\alpha}\sqrt{V\Sigma V^{T}} and for every ε,M>0{\varepsilon},M>0, with probability tending to 1,

{hd:hM,Vh\displaystyle\bigl{\{}h\in\mathbb{R}^{d}:\|h\|\leq M,Vh ξαVΣVTε}H^n\displaystyle\leq\xi_{\alpha}\sqrt{V\Sigma V^{T}}-{\varepsilon}\bigr{\}}\subset\hat{H}_{n}\subset
{hd:VhξαVΣVT+ε or h>M}.\displaystyle\quad\subset\bigl{\{}h\in\mathbb{R}^{d}:Vh\leq\xi_{\alpha}\sqrt{V\Sigma V^{T}}+{\varepsilon}\text{ or }\|h\|>M\bigr{\}}.
Proof.

Define εn(h)=n(Kn(h)a^n+b^nVh){\varepsilon}_{n}(h)=\sqrt{n}\big{(}K_{n}(h)-\hat{a}_{n}+\hat{b}_{n}Vh\bigr{)} and set ε^n=suphM/n|εn(h)|\hat{\varepsilon}_{n}=\sup_{\|h\|\leq M/\sqrt{n}}|{\varepsilon}_{n}(h)|, for given MM. Then by assumption ε^n0\hat{\varepsilon}_{n}\rightarrow 0 in probability, and |Kn(h/n)a^n+b^nVh/n|ε^n\bigl{|}K_{n}(h/\sqrt{n})-\hat{a}_{n}+\hat{b}_{n}Vh/\sqrt{n}\bigr{|}\leq\hat{\varepsilon}_{n}, for every hh with hM\|h\|\leq M. From the latter inequality we find that

hM,Kn(h/n)0\displaystyle\|h\|\leq M,K_{n}(h/\sqrt{n})\geq 0  b^nVhna^n+ε^n,\displaystyle\hbox{ $\Rightarrow$ }\hat{b}_{n}\,Vh\leq\sqrt{n}\hat{a}_{n}+\hat{\varepsilon}_{n},
hM,b^nVhna^nε^n\displaystyle\|h\|\leq M,\hat{b}_{n}\,Vh\leq\sqrt{n}\hat{a}_{n}-\hat{\varepsilon}_{n}  Kn(h/n)0.\displaystyle\hbox{ $\Rightarrow$ }K_{n}(h/\sqrt{n})\geq 0.

This implies that

{hM,Vh(na^nε^n)/b^n}H^n{Vh(na^n+ε^n)/b^n or h>M}.\bigl{\{}\|h\|\leq M,Vh\leq(\sqrt{n}\hat{a}_{n}-\hat{\varepsilon}_{n})/\hat{b}_{n}\Bigr{\}}\subset\hat{H}_{n}\subset\bigl{\{}Vh\leq(\sqrt{n}\hat{a}_{n}+\hat{\varepsilon}_{n})/\hat{b}_{n}\text{ or }\|h\|>M\bigr{\}}.

Combining this with the fact that N(0,Σ)(H^n)1α(0,1)N(0,\Sigma)(\hat{H}_{n})\rightarrow 1-{\alpha}\in(0,1) and the fact that VhN(0,VΣVT)Vh\sim N(0,V\Sigma V^{T}) if hN(0,Σ)h\sim N(0,\Sigma), we conclude that there exists δM>0{\delta}_{M}>0 such that δM0{\delta}_{M}\rightarrow 0 as MM\rightarrow\infty such that (na^nε^n)/b^nξαδMVΣVT+oP(1)(\sqrt{n}\hat{a}_{n}-\hat{\varepsilon}_{n})/\hat{b}_{n}\leq\xi_{{\alpha}-{\delta}_{M}}\sqrt{V\Sigma V^{T}}+o_{P}(1) and (na^n+ε^n)/b^nξα+δMVΣVT+oP(1)(\sqrt{n}\hat{a}_{n}+\hat{\varepsilon}_{n})/\hat{b}_{n}\geq\xi_{{\alpha}+{\delta}_{M}}\sqrt{V\Sigma V^{T}}+o_{P}(1), for every MM. Since b^n1=OP(1)\hat{b}_{n}^{-1}=O_{P}(1) by assumption, we have ε^n/b^n0\hat{\varepsilon}_{n}/\hat{b}_{n}\rightarrow 0 in probability, and hence na^n/b^n=ξαVΣVT+oP(1)\sqrt{n}\hat{a}_{n}/\hat{b}_{n}=\xi_{\alpha}\sqrt{V\Sigma V^{T}}+o_{P}(1). We substitute this in the last display to obtain the result of the lemma.   

Lemma 5.3.

Let Fn,Gn:In×JnF_{n},G_{n}:I_{n}\times J_{n}\to\mathbb{R} be stochastic processes indexed by rectangles In×Jn2I_{n}\times J_{n}\subset\mathbb{R}^{2} that are nondecreasing in their second argument, such that

supA,B|Fn(A,B)Gn(A,B)|P 0,\sup_{A,B}|F_{n}(A,B)-G_{n}(A,B)|\ {\mathchoice{\raise-1.5pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{\raise-2.0pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{}{}}\ 0,

and such that for numbers cnJnc_{n}\in J_{n} and continuous functions ξ:(0,1)\xi:(0,1)\to\mathbb{R}, and every α(0,1){\alpha}\in(0,1),

supAGn(A,cn+ξ(α)/n)=Gn(0,cn+ξ(α)/n)\displaystyle\sup_{A}G_{n}(A,c_{n}+\xi({\alpha})/\sqrt{n})=G_{n}(0,c_{n}+\xi({\alpha})/\sqrt{n}) P 1α,\displaystyle\ {\mathchoice{\raise-1.5pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{\raise-2.0pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{}{}}\ 1-{\alpha},
P(supA:|A|>δnn1/4Gn(A,cn+ξ(α)/n)<1α,)\displaystyle\mathord{\rm P}\Bigl{(}\sup_{A:|A|>{\delta}_{n}n^{-1/4}}G_{n}(A,c_{n}+\xi({\alpha})/\sqrt{n})<1-{\alpha},\Bigr{)} 1, some δn0.\displaystyle\rightarrow 1,\quad\text{ some }{\delta}_{n}\rightarrow 0.

Then Bn(α):=inf(B:supAFn(A,B)1α)B_{n}({\alpha}):=\inf(B:\sup_{A}F_{n}(A,B)\geq 1-{\alpha}\bigr{)} satisfies Bn(α)=cn+ξ(α)/n+oP(n1/2)B_{n}({\alpha})=c_{n}+\xi({\alpha})/\sqrt{n}+o_{P}(n^{-1/2}) and argmaxAFn(A,Bn)=oP(n1/4)\mathop{\rm argmax}_{A}F_{n}(A,B_{n})=o_{P}(n^{-1/4}).

Proof.

The functions F¯n\bar{F}_{n} and G¯n\bar{G}_{n} defined by F¯n(B)=supAFn(A,B)\bar{F}_{n}(B)=\sup_{A}F_{n}(A,B) and similarly for GnG_{n} satisfy supB|F¯n(B)G¯n(B)|P 0\sup_{B}|\bar{F}_{n}(B)-\bar{G}_{n}(B)|\ {\mathchoice{\raise-1.5pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{\raise-2.0pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{}{}}\ 0. Combined with the first displayed assumption on GnG_{n}, this gives that F¯n(cn+ξ(α)/n)P 1α\bar{F}_{n}(c_{n}+\xi({\alpha})/\sqrt{n})\ {\mathchoice{\raise-1.5pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{\raise-2.0pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{}{}}\ 1-{\alpha}, for every α(0,1){\alpha}\in(0,1). The definition of Bn(α)B_{n}({\alpha}) and monotonicity of F¯n\bar{F}_{n} now readily give that cn+ξ(α2)/nBn(α)cn+ξ(α1)/nc_{n}+\xi({\alpha}_{2})/\sqrt{n}\leq B_{n}({\alpha})\leq c_{n}+\xi({\alpha}_{1})/\sqrt{n} for every α1<α<α2{\alpha}_{1}<{\alpha}<{\alpha}_{2}, eventually with probability tending to one, or equivalently ξ(α2)n(Bn(α)cn)ξ(α1)\xi({\alpha}_{2})\leq\sqrt{n}\bigl{(}B_{n}({\alpha})-c_{n}\bigr{)}\leq\xi({\alpha}_{1}), eventually. By the continuity of ξ\xi it follows that n(Bn(α)cn)Pξ(α)\sqrt{n}\bigl{(}B_{n}({\alpha})-c_{n}\bigr{)}\ {\mathchoice{\raise-1.5pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{\raise-2.0pt\hbox{$\buildrel\small P\over{\rightarrow}$}}{}{}}\ \xi({\alpha}).

By the uniform approximation of FnF_{n} by GnG_{n} and the second displayed assumption on GnG_{n}, we have that

sup|A|>δnn1/4Fn(A,Bn(α))=sup|A|>δnn1/4Gn(A,Bn(α))+oP(1).\sup_{|A|>{\delta}_{n}n^{-1/4}}F_{n}\bigl{(}A,B_{n}({\alpha})\bigr{)}=\sup_{|A|>{\delta}_{n}n^{-1/4}}G_{n}\bigl{(}A,B_{n}({\alpha})\bigr{)}+o_{P}(1).

By monotonicity Gn(A,Bn(α))Gn(A,cn+ξ(α1)/n)G_{n}(A,B_{n}({\alpha}))\leq G_{n}(A,c_{n}+\xi({\alpha}_{1})/\sqrt{n}), for every α1<α{\alpha}_{1}<{\alpha}, for every AA. Thus the right side of the display is strictly smaller than 1α11-{\alpha}_{1}, eventually, by assumption, which is strictly smaller than 1α1-{\alpha}, for α1{\alpha}_{1} close enough to α{\alpha}. Similarly also Fn(0,Bn(α))=Gn(0,Bn(α))+oP(1)1αF_{n}\bigl{(}0,B_{n}({\alpha})\bigr{)}=G_{n}\bigl{(}0,B_{n}({\alpha})\bigr{)}+o_{P}(1)\rightarrow 1-{\alpha}. It follows that the maximum of AFn(A,Bn(α))A\mapsto F_{n}\bigl{(}A,B_{n}({\alpha})\bigr{)} is taken on the interval [δnn1/4,δnn1/4][-{\delta}_{n}n^{-1/4},{\delta}_{n}n^{-1/4}].   

References

  • [1] J. Aitchison. Two papers on the comparison of bayesian and frequentist approaches to statistical problems of prediction: Bayesian tolerance regions. Journal of the Royal Statistical Society. Series B (Methodological), 26(2):161–175, 1964.
  • [2] J. Aitchison. Expected-cover and linear-utility tolerance intervals. Journal of the Royal Statistical Society. Series B (Methodological), 28(1):57–62, 1966.
  • [3] William J. Browne and David Draper. A comparison of bayesian and likelihood-based methods for fitting multilevel models. Bayesian Analysis, 1(3):473–514, 09 2006.
  • [4] Robert G. Easterling and David L. Weeks. An accuracy criterion for bayesian tolerance intervals. Journal of the Royal Statistical Society. Series B (Methodological), 32(2):236–240, 1970.
  • [5] Michael Hamada. Bayesian tolerance interval control limits for attributes. Quality and Reliability Engineering International, 18(1):45–52, 2002.
  • [6] Michael Hamada, Valen Johnson, Leslie M. Moore, and Joanne Wendelberger. Bayesian prediction intervals and their relationship to tolerance intervals. Technometrics, 46(4):452–459, 2004.
  • [7] Hormuzd A. Katki, Eric A. Engels, and Philip S. Rosenberg. Assessing uncertainty in reference intervals via tolerance intervals: application to a mixed model describing hiv infection. Statistics in Medicine, 24(20):3185–3198, 2005.
  • [8] K. Krishnamoorthy and T. Mathew. Statistical Tolerance Regions: Theory, Applications, and Computation. Wiley Series in Probability and Statistics. Wiley, 2009.
  • [9] L. Le Cam. Limits of experiments. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. I: Theory of statistics, pages 245–261, 1972.
  • [10] Robert W. Miller. Parametric empirical bayes tolerance intervals. Technometrics, 31(4):449–459, 1989.
  • [11] Rahul Mukerjee and N. Reid. Second-order probability matching priors for a parametric function with application to bayesian tolerance limits. Biometrika, 88(2):587–592, 2001.
  • [12] Dharini Pathmanathan, Rahul Mukerjee, and S. H. Ong. Two-sided bayesian and frequentist tolerance intervals: general asymptotic results with applications. Statistics, 48(3):524–538, 2014.
  • [13] Dharini Pathmanathan and S. H. Ong. A monte carlo simulation study of two-sided tolerance intervals in balanced one-way random effects model for non-normal errors. Journal of Statistical Computation and Simulation, 84(11):2329–2344, 2014.
  • [14] Gaurav Sharma and Thomas Mathew. One-sided and two-sided tolerance intervals in general mixed and random effects models using small-sample asymptotics. Journal of the American Statistical Association, 107(497):258–267, 2012.
  • [15] A.J. Van der Merwe and Johan Hugo. Bayesian tolerance intervals for the balanced two-factor nested random effects model. TEST, 16(3):598–612, 2007.
  • [16] A.J. Van der Merwe, Albertus L. Pertorius, and J. H. Meyer. Bayesian tolerance intervals for the unbalanced one-way random effects model. Journal of Quality Technology, 38(3):280–293, 2006.
  • [17] A. van der Vaart. Asymptotic Statistics, volume 3 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 1998.
  • [18] A. van der Vaart and J. Wellner. Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics.
  • [19] Aad van der Vaart. An asymptotic representation theorem. International Statistical Review / Revue Internationale de Statistique, 59(1):97–121, 1991.
  • [20] Russell D. Wolfinger. Tolerance intervals for variance component models using bayesian simulation. Journal of quality technology, 30(1):18–32, 1998.
  • [21] Derek S. Young, Charles M. Gordon, Shihong Zhu, and Bryan D. Olin. Sample size determination strategies for normal tolerance intervals using historical data. Quality Engineering, 28(3):337–351, 2016.