This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Computationally efficient variational-like approximations of possibilistic inferential models111This is an extended version of the paper (Cella and Martin, 2024) presented at the 8th International Conference on Belief Functions, in Belfast, UK.

Leonardo Cella222Department of Statistical Sciences, Wake Forest University, [email protected]   and   Ryan Martin333Department of Statistics, North Carolina State University, [email protected]
Abstract

Inferential models (IMs) offer provably reliable, data-driven, possibilistic statistical inference. But despite IMs’ theoretical and foundational advantages, efficient computation is often a challenge. This paper presents a simple and powerful numerical strategy for approximating the IM’s possibility contour, or at least its α\alpha-cut for a specified α(0,1)\alpha\in(0,1). Our proposal starts with the specification a parametric family that, in a certain sense, approximately covers the credal set associated with the IM’s possibility measure. Then the parameters of that parametric family are tuned in such a way that the family’s 100(1α)%100(1-\alpha)\% credible set roughly matches the IM contour’s α\alpha-cut. This is reminiscent of the variational approximations now widely used in Bayesian statistics, hence the name variational-like IM approximation.

Keywords and phrases: Bayesian; confidence regions; credal set; fiducial; Monte Carlo; stochastic approximation.

1 Introduction

For a long time, despite Bayesians’ foundational advantages, few statisticians were actually using Bayesian methods—the computational burden was simply too high. Things changed significantly when Monte Carlo methods brought Bayesian solutions within reach. Things changed again more recently with the advances in various approximate Bayesian computational methods, in particular, the variational approximations in Blei et al., (2017), Tran et al., (2021), and the references therein. The once clear lines between what was computationally feasible for Bayesians and for non-Bayesians have now been blurred, reinvigorating Bayesians’ efforts in modern applications. Dennis Lindley predicted that “[statisticians] will all be Bayesians in 2020” (Smith, 1995)—his prediction did not come true, but the Bayesian community is arguably stronger now than ever.

While Bayesian and frequentist are currently the two mainstream schools of thought in statistical inference, these are not the only perspectives. For example, Dempster–Shafer theory originated as an improvement to and generalization of both Bayesian inference and Fisher’s fiducial argument. Of particular interest to us here are the recent advances in inferential models (IMs, Martin and Liu, 2013; Martin and Liu, 2015b ; Martin, 2021a ; Martin, 2022b ), a framework that offers Bayesian-like, data-dependent, possibilistic quantification of uncertainty about unknowns but with built-in, frequentist-like reliability guarantees. IMs and other new/non-traditional frameworks are currently facing the same computational challenges that Bayesians faced years ago. That is, we know what we want to compute and why, but we are currently lacking the tools to do so efficiently. While traditional Monte Carlo methods are still useful (see Section 2), the imprecision that is key to the IM’s reliability guarantees implies that Monte Carlo methods alone are not enough. In contrast to Lindley’s prediction, for Efron’s speculation about fiducial and related methods—“Maybe Fisher’s biggest blunder will become a big hit in the 21st century!” (Efron, 1998)—to come true, imprecision-accommodating advances in Monte Carlo computations are imperative. The present paper’s contribution is in this direction.

We start with a relatively simple idea that leads to a general tool for computationally efficient and statistically reliable possibilistic inference. For reasons summarized in Section 2, our focus is on the possibility-theoretic brand of IMs, which are fully determined by the corresponding contour function or, equivalently, the contour function’s so-called α\alpha-cuts. We leverage a well-known characterization of a possibility measure’s credal set as the collection of probability measures that assign probability at least 1α1-\alpha to the aforementioned α\alpha-cuts of the contour function. In the IM context, the α\alpha-cuts are 100(1α)100(1-\alpha)% confidence regions and, therefore, the contents of the IM’s credal set can be justifiably interpreted as “confidence distributions” (e.g., Xie and Singh, 2013; Schweder and Hjort, 2002). Specifically, the “most diffuse” element of the credal set, or “inner probabilistic approximation,” would assign probabilities as close to 1α1-\alpha as possible to each α\alpha-cut. If we could approximate this distinguished member of the IM’s credal set, via Monte Carlo or otherwise, then we would be well on our way to carrying out most—if not all—of the relevant IM computations. The challenge is that, except for very special classes of problems (Martin, 2023a ), this “inner probabilistic approximation” would be rather complex. We can, however, obtain a relatively simple approximation if we only require an accurate approximation of, say, a single α\alpha-cut. Towards this, we propose to introduce a simple parametric family of probability distributions, e.g., Gaussian, on the parameter space, with parameters partially depending on data, and then choosing those unspecified features of the parameter in such a way that the distribution (approximately) assigns probability 1α1-\alpha to a specified α\alpha-cut, say, α=0.1\alpha=0.1. This is akin to variational Bayes in the sense that we propose to approximate a complex probability distribution—in our case it is the “inner probabilistic approximation” of the IM’s possibility measure instead of a Bayesian posterior distribution—by an appropriately chosen member of a posited class of relatively simple probability distributions. The specific, technical details our proposal here are inspired by the recent work in Jiang et al., (2023) and the seemingly unrelated developments in Syring and Martin, (2019).

The remainder of the paper is organized as follows. After some background on possibilsitic IMs and their properties in Section 2, we present our basic but very general variational-like IM approximation in Section 3, which relies on a combination of Monte Carlo and stochastic approximation to tune the index of the posited parametric family of approximations. This is ideally suited for statistical inference problems involving low-dimensional parameters, and we offer a few such illustrations in Section 4. A more sophisticated version of the proposed variational-like approximation is presented in Section 5, which is more appropriate when the problem at hand is of higher dimension, but it is suited primarily for Gaussian approximation families—which is no practical restriction. With a more structured approximation, we can lower the computational cost by reducing the number of Monte Carlo evaluations of the IM contour. This is illustrated in several examples, including a relatively high-dimensional problem with lasso penalization and in problems involving nuisance parameters in parametric, nonparametric, and semiparametric models, respectively. The paper concludes in Section 7 with a concise summary and a discussion of some practically relevant extensions.

2 Background on possibilistic IMs

The first presentation of the IM framework (e.g., Martin and Liu, 2013; Martin and Liu, 2015b ) relied heavily on (nested) random sets and their corresponding belief functions. The more recent IM formulation in Martin, 2022b , building off of developments in Martin, (2015, 2018), defines the IM’s possibility contour using a probability-to-possibility transform applied to the relative likelihood. This small-but-important shift has principled motivations but we can only barely touch on this here. The present review focuses on the possibilistic IM formulation, its key properties, and the available computational strategies.

Consider a parametric statistical model {𝖯θ:θ𝕋}\{\mathsf{P}_{\theta}:\theta\in\mathbb{T}\} indexed by a parameter space 𝕋d\mathbb{T}\subseteq\mathbb{R}^{d}. Examples include 𝖯θ=𝖡𝗂𝗇(n,θ)\mathsf{P}_{\theta}={\sf Bin}(n,\theta), 𝖯θ=𝖭(θ,1)\mathsf{P}_{\theta}={\sf N}(\theta,1), 𝖯θ=𝖦𝖺𝗆𝗆𝖺(α,β)\mathsf{P}_{\theta}={\sf Gamma}(\alpha,\beta) with θ=(α,β)\theta=(\alpha,\beta), and many others; see Section 4. Nonparametric problems—see Cella, (2024), Cella and Martin, (2022), and Martin, 2023b (, Sec. 5)—can also be handled, but we postpone discussion of this until Section 6. Suppose that the observable data Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}) consists of iid samples from the distribution 𝖯Θ\mathsf{P}_{\Theta}, where Θ𝕋\Theta\in\mathbb{T} is the unknown/uncertain “true value.” The model and observed data Xn=xnX^{n}=x^{n} together determine a likelihood function θLxn(θ)\theta\mapsto L_{x^{n}}(\theta) and a corresponding relative likelihood function

R(xn,θ)=Lxn(θ)supϑLxn(ϑ).R(x^{n},\theta)=\frac{L_{x^{n}}(\theta)}{\sup_{\vartheta}L_{x^{n}}(\vartheta)}.

Throughout we will implicitly assume that the denominator is finite, i.e., that the likelihood function is bounded for almost all xnx^{n}.

The relative likelihood, θR(xn,θ)\theta\mapsto R(x^{n},\theta), itself is a data-dependent possibility contour, i.e., it is a non-negative function such that supθR(xn,θ)=1\sup_{\theta}R(x^{n},\theta)=1 for almost all xnx^{n}. This contour, in turn, determines a possibility measure, or consonant plausibility function, that can be used for uncertainty quantification about Θ\Theta, given Xn=xnX^{n}=x^{n}, which has been extensively studied (e.g., Shafer, 1982; Wasserman, 1990; Denœux, 2006, 2014). Specifically, uncertainty quantification here would be based on the possibility assignments

HsupθHR(xn,θ),H𝕋,H\mapsto\sup_{\theta\in H}R(x^{n},\theta),\quad H\subseteq\mathbb{T},

where HH represents a hypothesis about Θ\Theta. This purely likelihood-driven possibility measure has a number of desirable properties: for example, inference based on thresholding the possibility scores (with model-independent thresholds) satisfies the likelihood principle (e.g., Birnbaum, 1962; Basu, 1975), and it is asymptotically consistent (under standard regularity conditions) in the sense that it converges in 𝖯Θ\mathsf{P}_{\Theta}-probability to a possibility measure concentrated on Θ\Theta as nn\to\infty. What the purely likelihood-based possibility measure above lacks, however, is a calibration property (relative to the posited model) that gives meaning or belief-forming inferential weight to the “possibility” it assigns to hypotheses about the unknown Θ\Theta. More specifically, if we offer a possibility measure as our quantification of statistical uncertainty, then its corresponding credal set should contain probability distributions that are statistically meaningful. We are assuming vacuous prior information, so there are no meaningful or distinguished Bayesian posterior distributions. The only other natural property to enforce on the credal set’s elements is that they be “confidence distributions.” But, as in (5), this interpretation requires the α\alpha-cuts of the relative likelihood, i.e., {θ:R(xn,θ)>α}\{\theta:R(x^{n},\theta)>\alpha\}, to be 100(1α)100(1-\alpha)% confidence sets for Θ\Theta, which generally they are not. Therefore, the relative likelihood alone is not enough.

Fortunately, it is at least conceptually straightforward to achieve this calibration by applying what Martin, 2022a calls “validification”—a version of the probability-to-possibility transform (e.g., Dubois et al., 2004; Hose, 2022). In particular, for observed data Xn=xnX^{n}=x^{n}, the possibilistic IM’s contour is defined as

πxn(θ)=𝖯θ{R(Xn,θ)R(xn,θ)},θ𝕋,\pi_{x^{n}}(\theta)=\mathsf{P}_{\theta}\bigl{\{}R(X^{n},\theta)\leq R(x^{n},\theta)\bigr{\}},\quad\theta\in\mathbb{T}, (1)

and the possibility measure—or upper probability—is

Π¯xn(H)=supθHπxn(θ),H𝕋.\overline{\Pi}_{x^{n}}(H)=\sup_{\theta\in H}\pi_{x^{n}}(\theta),\quad H\subseteq\mathbb{T}.

A corresponding necessity measure, or lower probability, is defined via conjugacy as Π¯xn(H)=1Π¯xn(Hc)\underline{\Pi}_{x^{n}}(H)=1-\overline{\Pi}_{x^{n}}(H^{c}). This measure plays a crucial role in quantifying the support for hypotheses of interest (Cella and Martin, 2023); see Example 5.

An essential feature of this IM construction is its so-called validity property:

supΘ𝕋𝖯Θ{πXn(Θ)α}α,for all α[0,1].\sup_{\Theta\in\mathbb{T}}\mathsf{P}_{\Theta}\bigl{\{}\pi_{X^{n}}(\Theta)\leq\alpha\bigr{\}}\leq\alpha,\quad\text{for all $\alpha\in[0,1]$}. (2)

This has a number of important consequences. First, (2) is exactly the calibration property required to ensure a “confidence” connection that the relative likelihood on its own misses. In particular, (2) immediately implies that the set

Cα(xn)={θ𝕋:πxn(θ)>α},C_{\alpha}(x^{n})=\{\theta\in\mathbb{T}:\pi_{x^{n}}(\theta)>\alpha\}, (3)

indexed by α[0,1]\alpha\in[0,1], is a 100(1α)100(1-\alpha)% frequentist confidence set in the sense that its coverage probability is at least 1α1-\alpha:

supΘ𝕋𝖯Θ{Cα(Xn)∌Θ}α,α[0,1].\sup_{\Theta\in\mathbb{T}}\mathsf{P}_{\Theta}\bigl{\{}C_{\alpha}(X^{n})\not\ni\Theta\bigr{\}}\leq\alpha,\quad\alpha\in[0,1].

Moreover, it readily follows that

supΘH𝖯Θ{Π¯Xn(H)α}α,all α[0,1], all H𝕋.\sup_{\Theta\in H}\mathsf{P}_{\Theta}\bigl{\{}\overline{\Pi}_{X^{n}}(H)\leq\alpha\bigr{\}}\leq\alpha,\quad\text{all $\alpha\in[0,1]$, all $H\subseteq\mathbb{T}$}. (4)

In words, the IM is valid, or calibrated, if it assigns possibility α\leq\alpha to true hypotheses at rate α\leq\alpha as a function of data XnX^{n}. This is precisely where the IM’s aforementioned “inferential weight” comes from: (4) implies that we do not expect small Π¯xn(H)\overline{\Pi}_{x^{n}}(H) when HH is true, so we are inclined to doubt the truthfulness of those hypotheses HH for which Π¯xn(H)\overline{\Pi}_{x^{n}}(H) is small. In addition, the above property ensures that the possibilistic IM is safe from false confidence (Balch et al., 2019; Martin, 2019, 2024), unlike all default-prior Bayes and fiducial solutions. An even stronger, uniform-in-hypotheses version of (4) holds, as shown/discussed in Cella and Martin, (2023):

supΘ𝕋𝖯Θ{Π¯Xn(H)α for some H with HΘ}α,all α[0,1].\sup_{\Theta\in\mathbb{T}}\mathsf{P}_{\Theta}\bigl{\{}\overline{\Pi}_{X^{n}}(H)\leq\alpha\text{ for some $H$ with $H\ni\Theta$}\bigr{\}}\leq\alpha,\quad\text{all $\alpha\in[0,1]$}.

The “for some HH with HΘH\ni\Theta” event can be seen as a union of every hypothesis HH that contains Θ\Theta, which makes it a much broader event than that associated with any single fixed HH in (4). Consequently, regardless of the number of hypotheses evaluated or the manner in which they are selected—even if they are data-dependent—the probability that even a single suggestion from the IM is misleading remains controlled at the specified level. For further details about and discussion of the possibilistic IM’s properties, its connection to Bayesian/fiducial inference, etc, see Martin, 2022b ; Martin, 2023b ; Martin, 2023a .

In a Bayesian analysis, inference is based on summaries of the data-dependent posterior distribution, e.g., posterior probabilities of scientifically relevant hypotheses, expectations of loss/utility functions, etc. And all of these summaries boil down to integration involving the probability density function that determines the posterior. Virtually the same statement is true of the possibilistic IM: the lower–upper probability pairs for scientific relevant hypotheses, lower–upper expectations of loss/utility functions, etc. boil down to optimization involving the possibility contour πxn\pi_{x^{n}}. For example, if the focus is on a certain feature Φ=g(Θ)\Phi=g(\Theta) of Θ\Theta, then the Bayesian can find the marginal posterior distribution for Φ\Phi by integrating the posterior density for Θ\Theta. Similarly, the corresponding marginal possibilistic IM has a contour that is obtained by optimizing πxn\pi_{x^{n}}:

πxnmarg(ϕ)=supθ:g(θ)=ϕπxn(θ),ϕg(𝕋).\pi_{x^{n}}^{\text{marg}}(\phi)=\sup_{\theta:g(\theta)=\phi}\pi_{x^{n}}(\theta),\quad\phi\in g(\mathbb{T}).

Importantly, unlike with Bayesian integration, the IM’s optimization operation ensures that the validity property inherent in πxn\pi_{x^{n}} is transferred to πxnmarg\pi_{x^{n}}^{\text{marg}}, which implies that the IM’s marginal inference about Φ\Phi is safe from false confidence.

The above-described operations applied to the IM’s contour function (1), to obtain upper probabilities or to eliminate nuisance parameters, are all special cases of Choquet integrals (e.g. Troffaes and de Cooman, 2014, App. C). These more general Choquet integrals can be statistically relevant in formal decision-making contexts (Martin, 2021b ), etc. For the sake of space, we will not dig into these details here, but the fact that there are non-trivial operations to be carried out even after the IM’s contour function has been obtained provides genuine, practical motivation for us to find the simplest and most efficient contour approximation as possible.

While the IM construction and properties are relatively simple, computation can be a challenge. The key observation is that we rarely have the sampling distribution of the relative likelihood R(Xn,θ)R(X^{n},\theta), under 𝖯θ\mathsf{P}_{\theta}, available in closed form to facilitate exact computation of πxn(θ)\pi_{x^{n}}(\theta) for any θ\theta. So, instead, the go-to strategy is to approximate that sampling distribution using Monte Carlo at each value of θ\theta on a sufficiently fine grid (e.g., Martin, 2022b ; Hose et al., 2022). That is, the possibility contour is approximated as

πxn(θ)1Mm=1M1{R(Xm,θn,θ)R(xn,θ)},θ𝕋,\pi_{x^{n}}(\theta)\approx\frac{1}{M}\sum_{m=1}^{M}1\{R(X_{m,\theta}^{n},\theta)\leq R(x^{n},\theta)\},\quad\theta\in\mathbb{T},

where 1()1(\cdot) is the indicator function and Xm,θnX_{m,\theta}^{n} consists of nn iid samples from 𝖯θ\mathsf{P}_{\theta} for m=1,,Mm=1,\ldots,M. The above computation is feasible at one or a few different θ\theta values, but it is common that this needs to be carried out over a sufficiently fine grid covering the (relevant area) of the often multi-dimensional parameter space 𝕋\mathbb{T}. For example, the confidence set in (3) requires that we can solve the equation πxn(θ)=α\pi_{x^{n}}(\theta)=\alpha, or at least find which θ\theta’s satisfy πxn(θ)α\pi_{x^{n}}(\theta)\geq\alpha, and a naive approach is to compute the contour over a huge grid and then keep those that (approximately) solve the aforementioned equation. This amounts to lots of wasted computations. More generally, the relevant summaries of the IM output involve optimization of the contour, and carrying out this optimization numerically requires many contour function evaluations. Simple tweaks to this most-naive approach can be employed in certain cases, e.g., importance sampling, but these adjustments require problem-specific considerations and they cannot be expected to offer substantial improvements in computational efficiency. This is a serious bottleneck, so new and not-so-naive computational strategies are desperately needed.

3 Basic variational-like IMs

The Monte Carlo-based strategy reviewed above is not the only way to numerically approximate the possibilistic IM. Another option is an analytical “Gaussian” approximation (see below) based on the available large-sample theory (Martin and Williams, 2024). The goal here is to strike a balance between the (more-or-less) exact but expensive Monte Carlo-based approximation and the rough but cheap large-sample approximation. To strike this balance, we choose to focus on a specific feature of the possibilistic IM’s output, namely, the confidence sets Cα(xn)C_{\alpha}(x^{n}) in (3), and choose the approximation such that it at least matches the given confidence set exactly. Our specific proposal resembles the variational approximations that are now widely used in Bayesian analysis, where first a relatively simple family of candidate probability distributions is specified and then the approximation is obtained by finding the candidate that minimizes (an upper bound on) the distance/divergence from the exact posterior distribution. Our approach differs in the sense that we are aiming to approximate a possibility measure by (the probability-to-possibility transform applied to) an appropriately chosen probability distribution.

Following Destercke and Dubois, (2014), Couso et al., (2001), and others, the possibilistic IM’s credal set, 𝒞(Π¯xn)\mathscr{C}(\overline{\Pi}_{x^{n}}), which is the set of precise probability distributions dominated by Π¯xn\overline{\Pi}_{x^{n}}, has a remarkably simple and intuitive characterization:

𝖰xn𝒞(Π¯xn)𝖰xn{Cα(xn)}1α,for all α[0,1].\mathsf{Q}_{x^{n}}\in\mathscr{C}(\overline{\Pi}_{x^{n}})\iff\mathsf{Q}_{x^{n}}\{C_{\alpha}(x^{n})\}\geq 1-\alpha,\quad\text{for all $\alpha\in[0,1]$}. (5)

(Where convenient, we will replace the subscript “xnx^{n}” by a subscript “nn”—e.g., 𝖰n\mathsf{Q}_{n} and Π¯n\overline{\Pi}_{n} instead of 𝖰xn\mathsf{Q}_{x^{n}} and Π¯xn\overline{\Pi}_{x^{n}}—to simplify the notation in what follows.) That is, a data-dependent probability 𝖰n\mathsf{Q}_{n} is consistent with Π¯n\overline{\Pi}_{n} if and only if, for each α[0,1]\alpha\in[0,1], it assigns at least 1α1-\alpha mass to the IM’s confidence set Cα(xn)C_{\alpha}(x^{n}) in (3). Furthermore, the “best” inner probabilistic approximation of the possibilistic IM, if it exists, corresponds to a 𝖰n\mathsf{Q}_{n}^{\star} such that 𝖰n{Cα(xn)}=1α\mathsf{Q}_{n}^{\star}\{C_{\alpha}(x^{n})\}=1-\alpha for all α[0,1]\alpha\in[0,1]. For a certain special class of statistical models, Martin, 2023a showed that this best inner approximation corresponds to Fisher’s fiducial distribution and the default-prior Bayesian posterior distribution. Beyond this special class of models, however, it is not clear if a best inner approximation exists and, if so, how to find it. A less ambitious goal is to find, for a fixed choice of α\alpha, a probability distribution 𝖰n,α\mathsf{Q}_{n,\alpha}^{\star} such that

𝖰n,α{Cα(xn)}=1α.\mathsf{Q}_{n,\alpha}^{\star}\{C_{\alpha}(x^{n})\}=1-\alpha. (6)

Our goal here is to develop a general strategy for finding, for a given α(0,1)\alpha\in(0,1), a probability distribution 𝖰n,α\mathsf{Q}_{n,\alpha}^{\star} that (at least approximately) solves the equation in (6). Once identified, we can reconstruct relevant features of the possibilistic IM via (Bayesian-like) Monte Carlo sampling from this 𝖰n,α\mathsf{Q}_{n,\alpha}^{\star}.

We propose to start with a parametric class of data-dependent probability distributions 𝒬var={𝖰nξ:ξΞ}\mathscr{Q}^{\text{var}}=\{\mathsf{Q}_{n}^{\xi}:\xi\in\Xi\} indexed by a generic space Ξ\Xi. An important example is the case where the 𝖰nξ\mathsf{Q}_{n}^{\xi}’s are Gaussian distributions with mean vector and/or covariance matrix depending on (data and) ξ\xi in some specified way. Specifically, since the possibilistic IM contour’s peak is at the maximum likelihood estimator θ^n=θ^xn\hat{\theta}_{n}=\hat{\theta}_{x^{n}}, it makes sense to fix the mean vector of the Gaussian 𝖰nξ\mathsf{Q}_{n}^{\xi} at θ^n\hat{\theta}_{n}; for the covariance matrix, however, a natural choice would be ξ2Jn1\xi^{2}\,J_{n}^{-1}, where Jn=JxnJ_{n}=J_{x^{n}} is the observed Fisher information matrix depending on data and on the posited statistical model. This Gaussian family is a very reasonable default choice of 𝒬var\mathscr{Q}^{\text{var}} given that 𝖰nξ\mathsf{Q}_{n}^{\xi} with ξ=1\xi=1 is asymptotically the best inner probabilistic approximation to Π¯n\overline{\Pi}_{n}; see Martin and Williams, (2024). So, our proposal is to insert some additional flexibility in the Gaussian approximation by allowing the spread to expand or contract depending on whether ξ>1\xi>1 or ξ<1\xi<1. While a Gaussian-based approximation is natural, choosing 𝒬var\mathscr{Q}^{\text{var}} to be a Gaussian family is not the only option for 𝒬var\mathscr{Q}^{\text{var}}; see Example 4 below. Indeed, if the parameter space has structure that is not present in the usual Euclidean space where the Gaussian is supported, e.g., if 𝕋\mathbb{T} is the probability simplex, then choosing 𝒬var\mathscr{Q}^{\text{var}} to respect that structure makes perfect sense.

The one high-level condition imposed on 𝒬var\mathscr{Q}^{\text{var}} is that it be sufficiently flexible, i.e., as ξ\xi varies, the 𝖰n,αξ\mathsf{Q}_{n,\alpha}^{\xi}-probability of the possibilistic IM’s α\alpha-cut takes values smaller and larger than the target level 1α1-\alpha. This clearly holds for the Gaussian approximation described above, since ξ\xi controls the spread of 𝖰n,αξ\mathsf{Q}_{n,\alpha}^{\xi} to the extent that the aforementioned 𝖰n,αξ\mathsf{Q}_{n,\alpha}^{\xi}-probability can be made arbitrarily small or large by taking ξ\xi sufficiently small or large, respectively. This mild condition can also be readily verified for virtually any other (reasonable) approximation family 𝒬var\mathscr{Q}^{\text{var}}.

Given such a suitable choice of approximation family 𝒬var\mathscr{Q}^{\text{var}}, indexed by a parameter ξΞ\xi\in\Xi, our proposed procedure is as follows. Define an objective function

fα(ξ)=𝖰n,αξ({θ:πxn(θ)>α})(1α),f_{\alpha}(\xi)=\mathsf{Q}_{n,\alpha}^{\xi}(\{\theta:\pi_{x^{n}}(\theta)>\alpha\})-(1-\alpha), (7)

so that solving (6) boils down to finding a root of fαf_{\alpha}. If the probability on the right-hand side of (7) could be evaluated in closed-form, then one could apply any of the standard root-finding algorithms to solve this, e.g., Newton–Raphson. However, this 𝖰n,αξ\mathsf{Q}_{n,\alpha}^{\xi}-probability typically cannot be evaluated in closed-form so, instead, fαf_{\alpha} can be approximated via Monte Carlo with f^α\hat{f}_{\alpha} defined as

f^α(ξ)=1Mm=1M1{πxn(Θmξ)>α}(1α),\hat{f}_{\alpha}(\xi)=\frac{1}{M}\sum_{m=1}^{M}1\{\pi_{x^{n}}(\Theta_{m}^{\xi})>\alpha\}-(1-\alpha), (8)

where Θ1ξ,,ΘMξiid𝖰xnξ\Theta_{1}^{\xi},\ldots,\Theta_{M}^{\xi}\overset{\text{\tiny iid}}{\,\sim\,}\mathsf{Q}_{x^{n}}^{\xi}. Presumably, the aforementioned samples are cheap for every ξ\xi because the family 𝒬\mathscr{Q} has been specified by the user. But only having an unbiased estimator of the objective function requires some adjustments to the numerical routine. In particular, rather than a Newton–Raphson routine that assumes the function values are noiseless, we must use a stochastic approximation algorithm (e.g., Syring and Martin, 2021, 2019; Martin and Ghosh, 2008; Kushner and Yin, 2003; Robbins and Monro, 1951) that is adapted to noisy function evaluations of f^α\hat{f}_{\alpha}. The basic Robbins–Monro algorithm seeks the root of (7) through the updates

ξ(t+1)=ξ(t)±wt+1f^α(ξ(t)),t0,\xi^{(t+1)}=\xi^{(t)}\pm w_{t+1}\,\hat{f}_{\alpha}(\xi^{(t)}),\quad t\geq 0,

where “±\pm” depends on whether ξfα(ξ)\xi\mapsto f_{\alpha}(\xi) is decreasing or increasing, ξ(0)\xi^{(0)} is an initial guess, and (wt)(w_{t}) is a deterministic step size sequence that satisfies

t=1wt=andt=1wt2<.\sum_{t=1}^{\infty}w_{t}=\infty\quad\text{and}\quad\sum_{t=1}^{\infty}w_{t}^{2}<\infty.

Pseudocode for our proposed approximation is given in Algorithm 1. To summarize, we are proposing a simple data-dependent probability distribution for Θ\Theta whose probability-to-possibility contour closely matches the IM’s contour at least at the specified threshold α\alpha. More specifically, a reasonable choice of probability distribution is a Gaussian, and the approach presented in Algorithm 1 scales the covariance matrix so that its corresponding contour function accurately approximates the IM’s contour at threshold α\alpha.

requires: data xnx^{n} and ability to evaluate πn=πxn\pi_{n}=\pi_{x^{n}};
initialize: α\alpha-level, class 𝒬var={𝖰nξ:ξΞ}\mathscr{Q}^{\text{var}}=\{\mathsf{Q}_{n}^{\xi}:\xi\in\Xi\}, guess ξ(0)\xi^{(0)}, step size sequence (wt)(w_{t}), Monte Carlo sample size MM, and convergence threshold ε>0\varepsilon>0;
set: stop = FALSE, t=0t=0;
while !stop do
       sample Θ1,,ΘMiid𝖰nξ(t)\Theta_{1},\ldots,\Theta_{M}\overset{\text{\tiny iid}}{\,\sim\,}\mathsf{Q}_{n}^{\xi^{(t)}};
       evaluate f^α(ξ(t))\hat{f}_{\alpha}(\xi^{(t)}) as in (8) using (Θ1,,ΘM)(\Theta_{1},\ldots,\Theta_{M});
       update ξ(t+1)=ξ(t)±wt+1f^α(ξ(t))\xi^{(t+1)}=\xi^{(t)}\pm w_{t+1}\,\hat{f}_{\alpha}(\xi^{(t)});
       if |ξ(t+1)ξ(t)|<ε|\xi^{(t+1)}-\xi^{(t)}|<\varepsilon then
             ξ^=ξ(t+1)\hat{\xi}=\xi^{(t+1)};
             stop = TRUE;
            
      else
             tt+1t\leftarrow t+1;
            
       end if
      
end while
return ξ^\hat{\xi};
Algorithm 1 α\alpha-specific variational approximation of the IM

It is well-known that, under certain mild conditions, the sequence (ξ(t):t0)(\xi^{(t)}:t\geq 0) as defined above converges in probability to the root of fαf_{\alpha} in (7). If ξ^\hat{\xi} is the value returned when the algorithm reaches practical convergence, e.g., when |f^α(ξ(t))||\hat{f}_{\alpha}(\xi^{(t)})| or the change |ξ(t+1)ξ(t)||\xi^{(t+1)}-\xi^{(t)}| is smaller than some specified threshold, then we set 𝖰^n,α=𝖰n,αξ^\widehat{\mathsf{Q}}_{n,\alpha}=\mathsf{Q}_{n,\alpha}^{\hat{\xi}}. This distribution should be at least a decent approximation of the IM possibility measure’s inner approximation, i.e., the “most diffuse” member of 𝒞(Π¯xn)\mathscr{C}(\overline{\Pi}_{x^{n}}). Consequently, the probability-to-possibility transform in (1) applied to 𝖰^xn,α\widehat{\mathsf{Q}}_{x^{n},\alpha} should be a decent approximation of the exact possibilistic IM contour πxn\pi_{x^{n}}, at least in terms of their upper α\alpha-cuts. The illustrations presented in Section 4 below show that our proposed approximation is much better than “decent” in cases where we can visually compare with the exact IM contour.

As emphasized above, a suitable choice of 𝒬var\mathscr{Q}^{\text{var}} depends on context. An important consideration in this choice is the ability to perform exact computations of the approximate contour determined by the inner approximation 𝖰^n,α\widehat{\mathsf{Q}}_{n,\alpha}. This is the case for the aforementioned normal variational family 𝒬var\mathscr{Q}^{\text{var}}, with mean θ^n\hat{\theta}_{n} and covariance ξ^2Jn1\hat{\xi}^{2}J_{n}^{-1}, as

π^xn(θ)=1Gd{(θθ^n)Jn(θθ^n)/ξ^2},\hat{\pi}_{x^{n}}(\theta)=1-G_{d}\{(\theta-\hat{\theta}_{n})^{\top}J_{n}(\theta-\hat{\theta}_{n})/\hat{\xi}^{2}\}, (9)

where GdG_{d} is the 𝖢𝗁𝗂𝖲𝗊(d){\sf ChiSq}(d) distribution function. This closed-form expression makes summaries of the contour, which are approximations of certain features (e.g., Choquet integrals) of the possibilistic IM Π¯xn\overline{\Pi}_{x^{n}}, relatively easy to obtain numerically.

4 Numerical illustrations

Our first goal here is to provide a proof-of-concept for the proposed approximation. Towards this, we present a few low-dimensional illustrations where we can visualize both the exact and approximation IM contours and directly assess the quality of the approximation. In all but Example 4 below, we use the normal variational family 𝒬\mathscr{Q} with mean θ^n\hat{\theta}_{n} and covariance ξ2Jn1\xi^{2}J_{n}^{-1} as described above, with ξ\xi to be determined. All of the examples display the variational-like IM approximation 𝖰^n,α\widehat{\mathsf{Q}}_{n,\alpha} based on α=0.1\alpha=0.1, M=200M=200 Monte Carlo samples, step sizes wt=2(1+t)1w_{t}=2(1+t)^{-1}, and convergence threshold ε=0.005\varepsilon=0.005.

Example 1.

Suppose the data is X𝖡𝗂𝗇(n,Θ)X\sim{\sf Bin}(n,\Theta) with probability mass function pθp_{\theta}. The exact IM possibility contour based on X=xX=x is

πx(θ)=s=0n1{R(s,θ)R(x,θ)}pθ(s),\pi_{x}(\theta)=\sum_{s=0}^{n}1\{R(s,\theta)\leq R(x,\theta)\}\,p_{\theta}(s), (10)

where R(s,θ)R(s,\theta) is the binomial relative likelihood of θ\theta for ss successes:

R(s,θ)=(nθs)s(nnθns)ns,θ[0,1].R(s,\theta)=\Bigl{(}\frac{n\theta}{s}\Bigr{)}^{s}\Bigl{(}\frac{n-n\theta}{n-s}\Bigr{)}^{n-s},\quad\theta\in[0,1].

The right-hand side of (10) can be evaluated numerically without Monte Carlo. This and the contour corresponding to the proposed Gaussian-based variational approximation are shown in Figure 1(a) for n=15n=15 and x=6x=6 successes. Note that the two contours closely agree, especially at the level α=0.1\alpha=0.1 specifically targeted.

Refer to caption
(a) Example 1: Bernoulli
Refer to caption
(b) Example 2: Bivariate normal
Refer to caption
(c) Example 3: Logistic regression
Refer to caption
(d) Example 4: Multinomial
Figure 1: Exact (black) and approximate (red) IM contours; Panel (d) shows only the approximate contour (black), which is fully supported on the lower triangle corresponding to the probability simplex.
Example 2.

Suppose XnX^{n} consists of iid bivariate normal pairs with zero means and unit variances, with common density function

pθ(x)=12π(1θ2)1/2exp{x122θx1x2+x222(1θ2)},x=(x1,x2),θ[1,1].p_{\theta}(x)=\frac{1}{2\pi(1-\theta^{2})^{1/2}}\,\exp\Bigl{\{}-\frac{x_{1}^{2}-2\theta x_{1}x_{2}+x_{2}^{2}}{2(1-\theta^{2})}\Bigr{\}},\quad x=(x_{1},x_{2}),\quad\theta\in[-1,1].

Inference on the unknown correlation Θ\Theta is a surprisingly challenging problem (e.g., Basu, 1964; Reid, 2003; Martin and Liu, 2015a ). Indeed, despite being a scalar exponential family model, it does not have a one-dimensional sufficient statistic and, moreover, there are a variety of different ancillary statistics on which to condition, leading to different solutions. From a computational perspective, it is worth noting that the maximum likelihood estimation is not available in closed-form—in particular, it is not the sample correlation coefficient. That this optimization is non-trivial adds to the burden of a naive Monte Carlo IM construction. Figure 1(b) shows the exact IM contour, based on a naive Monte Carlo implementation of (1), which is rather expensive, and the approximation for simulated data of size n=50n=50 with true correlation 0.5. The exact contour has some asymmetry that the normal approximation cannot perfectly accommodate, but it makes up for this imperfection with a slightly wider upper 0.1-level set.

Example 3.

The data presented in Table 8.4 of Ghosh et al., (2006) concerns the relationship between exposure to chloracetic acid and mouse mortality. A simple logistic regression model is to be fit in order to relate the binary death indicator (yy) with the levels of exposure (uu) to chloracetic acid for the dataset’s n=120n=120 mice. That is, XnX^{n} consists of independent pairs Xi=(Ui,Yi)X_{i}=(U_{i},Y_{i}) and a conditionally Bernoulli model for YiY_{i}, given UiU_{i}, with mass function

pθ(yu)=F(θ1+θ2u)y{1F(θ1+θ2u)}1y,θ=(θ1,θ2)2,p_{\theta}(y\mid u)=F(\theta_{1}+\theta_{2}u)^{y}\,\{1-F(\theta_{1}+\theta_{2}u)\}^{1-y},\quad\theta=(\theta_{1},\theta_{2})\in\mathbb{R}^{2},

where F(z)=(1+ez)1F(z)=(1+e^{-z})^{-1} is the logistic distribution function. The corresponding likelihood cannot be maximized in closed-form, but doing so numerically is routine. The maximum likelihood estimator and the corresponding observed information matrix lead to the asymptotically valid inference reported by standard statistical software packages. For exact inference, however, the computational burden is heavier: evaluating the validified relative likelihood over a sufficiently fine grid of θ\theta values is rather expensive. Figure 1(c) presents the 0.10.1-level set of the exact IM possibility contour for the regression coefficients, based on a naive Monte Carlo implementation of (1), alongside the proposed variational approximation. The latter’s computational cost is a small fraction of the former’s, and yet the two contours line up almost perfectly.

Example 4.

Consider nn iid data points sampled from {1,2,,K}\{1,2,\ldots,K\} with unknown probabilities Θ=(Θ1,,ΘK)\Theta=(\Theta_{1},\ldots,\Theta_{K}) belong the probability simplex 𝕋\mathbb{T} in K\mathbb{R}^{K}. The frequency table, X=(X1,,XK)X=(X_{1},\ldots,X_{K}), with XkX_{k} being the frequency of category kk in the sample of size nn, is a sufficient statistic, having a multinomial distribution with parameters (n,K,Θ)(n,K,\Theta); the probability mass function is

pθ(x)=n!k=1nxk!k=1Kθkxk,θ𝕋.p_{\theta}(x)=\frac{n!}{\prod_{k=1}^{n}x_{k}!}\prod_{k=1}^{K}\theta_{k}^{x_{k}},\quad\theta\in\mathbb{T}.

Here we use a Dirichlet variational family 𝖰xnξ\mathsf{Q}_{x^{n}}^{\xi}, parametrized in terms of the mean vector m𝕋m\in\mathbb{T} and precision s>0s>0, with density function

q(θ)k=1Kθksmk1,θ𝕋,q(\theta)\propto\prod_{k=1}^{K}\theta_{k}^{sm_{k}-1},\quad\theta\in\mathbb{T},

where, naturally, we take the mean to be the maximum likelihood estimator, θ^=n1X\hat{\theta}=n^{-1}X, and the precision to be nξn\xi, where ξ>0\xi>0 is to be determined. Of course, a Gaussian variational approximation could also be used here, but our use of the Dirichlet approximation aims to highlight our proposal’s flexibility. Figure 1(d) shows the approximate IM contour based on K=3K=3 and counts X=(8,10,7)X=(8,10,7). The exact IM contour is virtually impossible to evaluate, because naive Monte Carlo is slow, the contours are noisy when based on Monte Carlo sample sizes that are too small, and the discrete nature of the data gives it the unusual shape akin to the binomial plot in Figure 1(a). Here, however, we get a smooth contour approximation in a matter of seconds.

Our second goal in this section is to provide a more in-depth illustration of the kind of analysis that can be carried out with the proposed variational-like approximate IM. We do this in the context of regression modeling for count data.

Example 5.

Poisson log-linear models are widely used to analyze the relationship between a count-based discrete response variable and a set of fixed explanatory variables; even if the explanatory variables are not fixed by design, it is rare that their distribution would depend on any of the relevant parameters, so they are ancillary statistics and it is customary to condition on their observed values. Let XiX_{i} represent the response variable, and zi1,,zipz_{i1},\ldots,z_{ip} denote the pp explanatory variables for the ithi^{th} observation in a sample, i=1,,ni=1,\ldots,n. The Poisson log-linear model assumes that Xi𝖯𝗈𝗂𝗌(Λi)X_{i}\sim{\sf Pois}(\Lambda_{i}), where

logΛi=Θ0+Θ1zi1+Θ2zi2++Θpzip.\log\Lambda_{i}=\Theta_{0}+\Theta_{1}z_{i1}+\Theta_{2}z_{i2}+\ldots+\Theta_{p}z_{ip}.

The goal is to quantify uncertainty about the regression coefficients Θ0,Θ1,,Θp\Theta_{0},\Theta_{1},\ldots,\Theta_{p}. Here, we will achieve this through the IMs framework.

Consider the data presented in Table 3.2 of Agresti, (2007), obtained from a study of the nesting habits of horseshoe crabs. In this study, each of the n=173n=173 female horseshoe crabs had a male attached to her nest, and the goal was to explore factors influencing whether additional males, referred to as satellites, were present nearby. The response variable XX is the number of satellites observed for each female crab. Here, we focus on evaluating two explanatory variables related to the female crab’s size that may influence this response: z1z_{1}, weight (kg), and z2z_{2}, shell width (cm).

A variational approximation of the IMs contour was obtained from (9), with ξ\xi estimated using M=200M=200 Monte Carlo samples, step sizes wt=2(1+t)1w_{t}=2(1+t)^{-1}, and convergence threshold ε=0.005\varepsilon=0.005. Perhaps the first question to consider is whether at least one of z1z_{1} or z2z_{2} has an influence on XX. To address this, a marginal IM for (Θ1,Θ2)(\Theta_{1},\Theta_{2}) is constructed, as shown in Figure 2(a). Notably, the hypothesis “H:Θ1=Θ2=0H:\Theta_{1}=\Theta_{2}=0” has an upper probability close to zero, providing strong evidence that at least one of Θ1\Theta_{1} or Θ2\Theta_{2} differs from zero. To evaluate the impact of z1z_{1} and z2z_{2} individually, the respective marginal IMs for Θ1\Theta_{1} and Θ2\Theta_{2} are shown in Figure 2(b,c). While there is strong evidence supporting “Θ1>0\Theta_{1}>0,” the hypothesis that “Θ2=0\Theta_{2}=0” is highly plausible. Finally, given the apparent evidence for “Θ1>0\Theta_{1}>0,” one might ask which hypotheses of the form “Hγ:Θ1>γH_{\gamma}:\Theta_{1}>\gamma,” for γ>0\gamma>0, are well supported. This question can be addressed using the marginal necessity measures for HγH_{\gamma}, as shown in Figure 2(d). We can see that “Θ1>0.1\Theta_{1}>0.1” is well supported, indicating that an additional kilogram increases the average number of satellites per female crab by approximately 10%10\%. Importantly, the IM’s uniform validity insures that the probability that even one of the suggestions above is misleading is controllably small.

Refer to caption
(a) Joint contour for (Θ1,Θ2)(\Theta_{1},\Theta_{2})
Refer to caption
(b) Marginal contour for Θ1\Theta_{1}
Refer to caption
(c) Marginal contour for Θ2\Theta_{2}
Refer to caption
(d) Plot of γΠ¯Xn(Θ1>γ)\gamma\mapsto\underline{\Pi}_{X^{n}}(\Theta_{1}>\gamma)
Figure 2: Results for the Poisson model in Example 5. Panels (a), (b) and (c) show the IM’s variational-like approximate joint and marginal contours. Panel (d) shows the necessity measure for hypothesis of the type Θ1>γ\Theta_{1}>\gamma, for γ0\gamma\geq 0.

5 Beyond basic variational-like IMs

Various tweaks to the basic procedure described in Section 3 might be considered in order applications to speed up computations. If there is a computational bottleneck remaining in the proposal from Section 3, then it would have to be that the IM’s possibility contour must be evaluated at MM-many points in each iteration of the stochastic approximation. While this is not impractically expensive in some applications, including those presented above, it could be more of a burden in other applications. A related challenge is that a scalar index ξ\xi on the variational family, as we have focused on exclusively so far, may not be sufficiently flexible. Increasing the flexibility by considering higher-dimensional ξ\xi would likewise increase the computational burden, so care is needed. Here we aim to address both of the aforementioned challenges.

The particular modification that we consider here is best suited for cases where the variational family 𝒬var\mathscr{Q}^{\text{var}} satisfies the following property: for each ξΞ\xi\in\Xi, the 100(1α)100(1-\alpha)% credible set corresponding to 𝖰nξ\mathsf{Q}_{n}^{\xi} can be expressed (or at least neatly summarized) in closed-form. The idea to be presented here is more general, but, to keep the details as simple and concrete as possible, we will focus our presentation on the case of a Gaussian variational family. Then the credible sets are ellipsoids in dd-dimensional space.

As a generalization of the scalar-ξ\xi-indexed Gaussian family introduced in Section 3, let ξΞ=0d\xi\in\Xi=\mathbb{R}_{\geq 0}^{d} be a dd-vector index and take 𝒬var\mathscr{Q}^{\text{var}} to be the dd-dimensional Gaussian family with mean vector θ^n\hat{\theta}_{n}, the maximum likelihood estimator, and covariance matrix Jn(ξ)1J_{n}(\xi)^{-1} defined as follows: take the eigen-decomposition of the observed Fisher information matrix JnJ_{n} as Jn=UΨUJ_{n}=U\Psi U^{\top}, and then set

Jn(ξ)=Udiag(ξ1)Ψdiag(ξ1)U.J_{n}(\xi)=U\,\text{diag}(\xi^{-1})\,\Psi\,\text{diag}(\xi^{-1})\,U^{\top}.

That is, the components of ξ\xi act as multipliers on the singular values of Jn1J_{n}^{-1}. This indeed generalizes our previously-considered Gaussian approximation family since, if ξ\xi is a scalar, then we recover the previous covariance matrix: Jn(ξ)=Jn/ξ2J_{n}(\xi)=J_{n}/\xi^{2}. Let Kn,αξK_{n,\alpha}^{\xi} denote the 100(1α)100(1-\alpha)% credible set corresponding to the distribution 𝖰n,αξ𝒬var\mathsf{Q}_{n,\alpha}^{\xi}\in\mathscr{Q}^{\text{var}}. In the present dd-dimensional Gaussian case, this is an ellipsoid in d\mathbb{R}^{d}, i.e.,

Kn,αξ={θd:(θθ^n)Jn(ξ)(θθ^n)χd2(1α)},K_{n,\alpha}^{\xi}=\{\theta\in\mathbb{R}^{d}:(\theta-\hat{\theta}_{n})^{\top}J_{n}(\xi)\,(\theta-\hat{\theta}_{n})\leq\chi_{d}^{2}(1-\alpha)\},

where χd2(p)\chi_{d}^{2}(p) is the pthp^{\text{th}} quantile of the 𝖢𝗁𝗂𝖲𝗊(d){\sf ChiSq}(d) distribution. Then we know that 𝖰n,αξ\mathsf{Q}_{n,\alpha}^{\xi} assigns at least probability 1α1-\alpha to the IM’s confidence set Cα(xn)C_{\alpha}(x^{n}) if

Kn,αξCα(xn),K_{n,\alpha}^{\xi}\supseteq C_{\alpha}(x^{n}),

or, equivalently, if

supθKn,αξπn(θ)α.\sup_{\theta\not\in K_{n,\alpha}^{\xi}}\pi_{n}(\theta)\leq\alpha.

For the Gaussian family of approximations we are considering here, the mode of πn\pi_{n}, the maximum likelihood estimator, is in the interior of each credible set, so the behavior of πn\pi_{n} outside Kn,αξK_{n,\alpha}^{\xi} is determined by its behavior on the boundary. Moreover, since equality in the above display implies a near-perfect match between the IM’s confidence sets and the posited Gaussian credible sets, a reasonable goal is to find a root to the function

gα(ξ):=maxθbdry(Kn,αξ)πn(θ)α,g_{\alpha}(\xi):=\max_{\theta\in\text{bdry}(K_{n,\alpha}^{\xi})}\pi_{n}(\theta)-\alpha,

where bdry(A)\text{bdry}(A) denotes the boundary of a set AA. In designing an iterative algorithm that makes steps towards this root, it is desirable to allow for steps of different magnitudes depending on the direction, and this requires care. For our present Gaussian approximation family, a natural—albeit minimal—representation of the boundary of the credible region Kn,αξK_{n,\alpha}^{\xi} is given by the collection of vectors

ϑsξ,±:=θ^n±{χd2(1α)ξs/ψs}1/2us,s=1,,d,\vartheta_{s}^{\xi,\pm}:=\hat{\theta}_{n}\pm\bigl{\{}\chi_{d}^{2}(1-\alpha)\,\xi_{s}\,/\,\psi_{s}\}^{1/2}\,u_{s},\quad s=1,\ldots,d, (11)

where (ψs,us)(\psi_{s},u_{s}) is the eigenvalue–eigenvector pair corresponding to the sths^{\text{th}} largest eigenvalue. Then define the vector-valued function gα:Ξdg_{\alpha}:\Xi\to\mathbb{R}^{d} with components

g^α,s(ξ)=max{πn(ϑsξ,+),πn(ϑsξ,)}α,s=1,,d.\hat{g}_{\alpha,s}(\xi)=\max\{\pi_{n}(\vartheta_{s}^{\xi,+}),\pi_{n}(\vartheta_{s}^{\xi,-})\}-\alpha,\quad s=1,\ldots,d. (12)

At least intuitively, g^α,s(ξ)\hat{g}_{\alpha,s}(\xi) being less than or greater than 0 determines if the credible set Kn,αξK_{n,\alpha}^{\xi} is too large or too small, respectively in the usu_{s}-direction. From here, we can apply the same stochastic approximation procedure to determine a sequence (ξ(t):t1)(\xi^{(t)}:t\geq 1), this time of dd-vectors, i.e.,

ξs(t+1)=ξs(t)+wt+1g^α,s(ξ(t)),s=1,,d,t0,\xi_{s}^{(t+1)}=\xi_{s}^{(t)}+w_{t+1}\,\hat{g}_{\alpha,s}(\xi^{(t)}),\quad s=1,\ldots,d,\quad t\geq 0,

that converges to the root of g^α\hat{g}_{\alpha} which we expect to be close to the root of gαg_{\alpha}. The details of this new but familiar strategy are presented in Algorithm 2. The chief advantage of this new strategy compared to that summarized in Algorithm 1 is computational savings: each iteration of stochastic approximation in Algorithm 2 only requires (Monte Carlo) evaluation of the IM contour at 2d2d many points, which would typically be far fewer than then number of Monte Carlo samples MM in Algorithm 1 needed to get a good approximation of 𝖰n,αξ{πn(Θ)>α}\mathsf{Q}_{n,\alpha}^{\xi}\{\pi_{n}(\Theta)>\alpha\}.

requires: data xnx^{n}, eigen-pairs (ψs,us)(\psi_{s},u_{s}), and ability to evaluate πn=πxn\pi_{n}=\pi_{x^{n}};
initialize: α\alpha-level, guess ξ(0)\xi^{(0)}, step size sequence (wt)(w_{t}), and threshold ε>0\varepsilon>0;
set: stop = FALSE, t=0t=0;
while !stop do
       construct the representative points {ϑsξ(t),±:s=1,,d}\{\vartheta_{s}^{\xi^{(t)},\pm}:s=1,\ldots,d\} as in (11);
       evaluate g^α,s(ξs(t))\hat{g}_{\alpha,s}(\xi_{s}^{(t)}) for s=1,,ds=1,\ldots,d as in (12);
       update ξs(t+1)=ξs(t)±wt+1g^α,s(ξs(t))\xi_{s}^{(t+1)}=\xi_{s}^{(t)}\pm w_{t+1}\,\hat{g}_{\alpha,s}(\xi_{s}^{(t)}) for s=1,,ds=1,\ldots,d;
       if maxs|ξs(t+1)ξs(t)|<ε\max_{s}|\xi_{s}^{(t+1)}-\xi_{s}^{(t)}|<\varepsilon then
             ξ^=ξ(t+1)\hat{\xi}=\xi^{(t+1)};
             stop = TRUE;
            
      else
             tt+1t\leftarrow t+1;
            
       end if
      
end while
return ξ^\hat{\xi};
Algorithm 2 α\alpha-specific variational approximation of the IM

We will illustrate this new version of the Gaussian variational family and approximation algorithm with two examples. The first is in the relatively simple two-parameter gamma model; the second is a relatively high-dimensional model involving a penalty, intended to serve a gateway into IM solutions to high-dimensional problems.

Example 6.

Suppose XnX^{n} consists of an iid sample of size nn from a gamma distribution with shape parameter Θ1\Theta_{1} and scale parameter Θ2\Theta_{2}. We simulated data of size n=25n=25, with Θ1=7\Theta_{1}=7 and Θ2=3\Theta_{2}=3, and we plot the approximate IM contour for (Θ1,Θ2)(\Theta_{1},\Theta_{2}) in Figure 3(a). This contour is constructed by first building a Gaussian approximation of the contour for (logΘ1,logΘ2)(\log\Theta_{1},\log\Theta_{2}) and then mapping the approximation back to the (Θ1,Θ2)(\Theta_{1},\Theta_{2}) space. The parameters’ non-negativity constraints are gone when mapped to the log-scale, which improves the quality of the Gaussian approximation; the approximation is worse when applied directly on the (θ1,θ2)(\theta_{1},\theta_{2}) space. For comparison, the contour of the relative likelihood is also shown, and its similarity to that of the Gaussian contour implies that the latter is a good approximation to the exact IM contour, which is rather expensive to compute. Indeed, Example 1 in Martin, (2018) considers exactly the same simulation setting and he also obtains a banana-shaped contour similar to the one in Figure 3(a).

We also repeated the above example 1000 times and Figure 3(b) shows the distribution function of the random variable πXn(Θ)\pi_{X^{n}}(\Theta), based on both the exact contour and the Gaussian approximation. That with the exact contour is the 𝖴𝗇𝗂𝖿(0,1){\sf Unif}(0,1) distribution function (modulo Monte Carlo sampling variation) and that based on the Gaussian approximation is empirically indistinguishable from 𝖴𝗇𝗂𝖿(0,1){\sf Unif}(0,1) across the full range.

Refer to caption
(a) Joint contour for (Θ1,Θ2)(\Theta_{1},\Theta_{2})
Refer to caption
(b) Distribution function of πXn(Θ)\pi_{X^{n}}(\Theta)
Figure 3: Results for the gamma model in Example 6. Panel (a) shows the approximate IM contour (for log-shape and log-scale) for a single data set; gray in the background shows the relative likelihood contours for comparison. Panel (b) shows the distribution function for the approximate contour (black) and exact contour (gray) at the true Θ\Theta.
Example 7.

This simplest and most canonical high-dimensional inference example is the many-normal-means problem, which goes back to classical papers such as Stein, (1956), James and Stein, (1961), Brown, (1971), and others. The model assumes that the data XnX^{n} consists of nn independent—but not iid—observations, with Xi𝖭(Θi,σ2)X_{i}\sim{\sf N}(\Theta_{i},\sigma^{2}), where σ2\sigma^{2} is assumed to be known but the vector Θ=(Θ1,,Θn)\Theta=(\Theta_{1},\ldots,\Theta_{n}) is unknown and to be inferred. Roughly, the message in the references above is that the best unbiased estimator θ^=Xn\hat{\theta}=X^{n}, also the maximum likelihood estimator, is inadmissible (under squared error loss). This result sparked research efforts on penalized estimation, including the now-famous lasso (e.g., Tibshirani, 1996, 2011). Following along these lines, and in the spirit of the examples in Section 6 above, we propose here to work with a relative penalized likelihood

Rλpen(xn,θ)=Lxnpen,λ(θ)maxϑLxnpen,λ(ϑ),θ𝕋=n,R_{\lambda}^{\text{\sc pen}}(x^{n},\theta)=\frac{L_{x^{n}}^{\text{\sc pen},\lambda}(\theta)}{\max_{\vartheta}L_{x^{n}}^{\text{\sc pen},\lambda}(\vartheta)},\quad\theta\in\mathbb{T}=\mathbb{R}^{n},

where

Lxnpen,λ(θ)=Lxn(θ)eλθ1=exp{12σ2xnθ22λθ1},L_{x^{n}}^{\text{\sc pen},\lambda}(\theta)=L_{x^{n}}(\theta)\,e^{-\lambda\|\theta\|_{1}}=\exp\bigl{\{}-\tfrac{1}{2\sigma^{2}}\|x^{n}-\theta\|_{2}^{2}-\lambda\|\theta\|_{1}\bigr{\}},

is the penalized likelihood function, with 1\|\cdot\|_{1} and 2\|\cdot\|_{2} the 1\ell_{1}- and 2\ell_{2}-norms, respectively, and λ>0\lambda>0 a tuning parameter to be discussed below. For this relatively simple high-dimensional model, the solution to the optimization problem in the denominator above, which is the lasso estimator, can be solved in closed-form: if the vector

θ^=argmaxϑLxnpen,λ(ϑ)\hat{\theta}=\arg\max_{\vartheta}L_{x^{n}}^{\text{\sc pen},\lambda}(\vartheta)

is the solution, then its entries are given by

θ^i=sign(xi)max(|xi|λ,0),i=1,,n.\hat{\theta}_{i}=\text{sign}(x_{i})\,\max(|x_{i}|-\lambda,0),\quad i=1,\ldots,n.

That is, the lasso penalization shrinks the maximum likelihood estimator XiX_{i} of Θi\Theta_{i} closer to 0 and, in fact, if the magnitude of XiX_{i} is less than the specified λ\lambda, then the estimator is exactly 0. Of course, the choice of λ\lambda is critical, and here we will take it to be λ=(σ2logn)1/2\lambda=(\sigma^{2}\log n)^{1/2}. In any case, we can construct a possibilistic IM contour as

πxnpen,λ(θ)=𝖯θ{Rλpen(Xn,θ)Rλpen(xn,θ)},θ𝕋.\pi_{x^{n}}^{\text{\sc pen},\lambda}(\theta)=\mathsf{P}_{\theta}\{R_{\lambda}^{\text{\sc pen}}(X^{n},\theta)\leq R_{\lambda}^{\text{\sc pen}}(x^{n},\theta)\},\quad\theta\in\mathbb{T}.

Even though the relative penalized likelihood is available in closed-form, Monte Carlo methods are needed to evaluate this contour. And when the dimension nn is even moderately large, carrying out these computations on a grid in nn-dimensional space that is sufficiently fine to obtain, say, confidence sets for Θ\Theta is practically impossible. But the variational approximation described above offers a computationally more efficient alternative to naive Monte Carlo that can handle moderate to large nn.

Here we follow the strategy suggested above, i.e., with a Gaussian approximation having mean θ^\hat{\theta} equal to the lasso or maximum penalized likelihood estimator and n×nn\times n covariance matrix Jn(ξ)1J_{n}(\xi)^{-1} indexed by the nn-vector ξ\xi; in this case, the initial JnJ_{n} is taken to be the no-penalty information matrix Jn=σ2InJ_{n}=\sigma^{-2}I_{n}, which is proportional to the identity matrix. The intuition here is that the coordinate-specific adjustment factors, ξi\xi_{i}, will allow the Gaussian approximation to adapt in some way to sparsity in the true signal Θ\Theta. For our illustration, we consider n=50n=50 and a true Θ\Theta whose first five entries are equal to 5 and last 45 entries equal to 0, i.e., only 10% of the coordinates of XX contain a signal, the remaining 90% is just noise. We also fixed α=0.1\alpha=0.1 for the approximation. For a single data set, Figure 4(a) displays the estimates ξ^\hat{\xi} obtained at convergence of the proposed stochastic approximation updates. The points in black correspond to signals (non-zero true means) and the points in gray correspond to noise. The key observation is that ξ\xi values corresponding to signal tend to be larger than those corresponding to noise; there is virtually no variability in the signal cases, but substantial variability in the noise cases. That the ξ\xi values tend to be smaller in the noise case is expected, since much less spread in the IM’s possibility contour is needed around those means that are clearly 0. We repeated the above simulation 1000 times and plotted the distribution function of the exact (using naive Monte Carlo) and approximate (using the Gaussian variational family) IM contour at the true Θ\Theta. Again, the exact contour at Θ\Theta is 𝖴𝗇𝗂𝖿(0,1){\sf Unif}(0,1) distributed, and the results in Figure 4(b). The Gaussian approximation is only designed to be calibrated at level α=0.1\alpha=0.1, which is clearly achieves; it is a bit too agressive at lower levels and conservative at the higher levels. Further investigation into this proposal in high-dimensional problems will be reported elsewhere.

Refer to caption
(a) Plot of the estimated ξ\xi values
Refer to caption
(b) Distribution function of πXn(Θ)\pi_{X^{n}}(\Theta)
Figure 4: Results from the many-normal-means model in Example 7. Panel (a) plots the estimated ξ\xi values based on a single data set; black and gray horizontal lines correspond to the averages of the black/signal and gray/noise points. Panel (b) shows the distribution function for the approximate contour (black) and exact contour (gray) at the true Θ\Theta.

6 Nuisance parameter problems

6.1 Parametric

The perspective taken above was that there is an unknown model parameter Θ\Theta and the primary goal is to quantify uncertainty about Θ\Theta in its entirety given the observed data Xn=xnX^{n}=x^{n} from the model. Of course, quantification of uncertainty about Θ\Theta implies quantification of uncertainty about any feature Φ=g(Θ)\Phi=g(\Theta) as described in Section 2 above. If, however, the sole objective is quantifying uncertainty about a particular Φ=g(Θ)\Phi=g(\Theta), then it is natural to ask if we can do better than first quantifying uncertainty about the full Θ\Theta and then deriving the corresponding results for Φ\Phi. There are opportunities for efficiency gain, but this requires eliminating the nuisance parameters, i.e., those aspects of Θ\Theta that are, in some sense, complementary or orthogonal to Φ\Phi. One fairly general strategy for eliminating nuisance parameters is profiling (e.g., Martin, 2023b ; Severini and Wong, 1992; Sprott, 2000), described below.

It is perhaps no surprise that, while use of the relative likelihood in the IM construction presented in Section 2 is very natural and, in some sense, “optimal,” it is not the only option. For cases involving nuisance parameters, one strategy is to replace the relative likelihood in (1) with a surrogate, namely, the relative profile likelihood

Rpr(xn,ϕ)=supθ:g(θ)=ϕLxn(θ)supϑ𝕋Lxn(ϑ).R^{\text{\sc pr}}(x^{n},\phi)=\frac{\sup_{\theta:g(\theta)=\phi}L_{x^{n}}(\theta)}{\sup_{\vartheta\in\mathbb{T}}L_{x^{n}}(\vartheta)}. (13)

Note that this is not based on a genuine likelihood function, i.e., there is no statistical model for XnX^{n}, depending solely on the unknown parameter Φ\Phi, for which ϕsupθ:g(θ)=ϕLxn(θ)\phi\mapsto\sup_{\theta:g(\theta)=\phi}L_{x^{n}}(\theta) is the corresponding likelihood function. Despite not being a genuine likelihood, the profile likelihood typically behaves very much like the usual likelihood asymptotically and, therefore, “may be used to a considerable extent as a full likelihood” (Murphy and van der Vaart, 2000, p. 449). In our present case, if we proceed with the relative profile likelihood in place of the relative likelihood in (1), then the marginal possibilistic IM has contour function

πxnpr(ϕ)=supθ:g(θ)=ϕ𝖯θ{Rpr(Xn,ϕ)Rpr(xn,ϕ)},ϕg(𝕋).\pi_{x^{n}}^{\text{\sc pr}}(\phi)=\sup_{\theta:g(\theta)=\phi}\mathsf{P}_{\theta}\bigl{\{}R^{\text{\sc pr}}(X^{n},\phi)\leq R^{\text{\sc pr}}(x^{n},\phi)\bigr{\}},\quad\phi\in g(\mathbb{T}).

The advantage of this construction is that it tends to be more efficient than the naive elimination of nuisance parameters presented in Section 2; see, e.g., Martin and Williams, (2024). The outer supremum above is the result of Φ\Phi not being the full parameter of the posited model; see Martin, 2023b (, Sec. 3.2) for more details. It is often the case that the probability on the right-hand side above is approximately constant in θ\theta with g(θ)=ϕg(\theta)=\phi, but one cannot count on this—the supremum needs to be evaluated, unfortunately, in order to ensure the IM’s strong validity properties hold marginally.

For concreteness, we will focus on a deceptively-challenging problem, namely, efficient inference on the mean of a gamma distribution. Roughly, the gamma mean is a highly non-linear function of the shape and scale parameters, which makes the classical, first-order asymptotic approximations of the sampling distribution of the maximum likelihood estimator rather poor in finite samples. For this reason, the gamma mean problem has received considerable attention, with focus on deriving asymptotic approximations with higher-order accuracy; we refer the reader to Fraser et al., (1997) for further details. Martin and Liu, 2015c presented an exact IM solution to the gamma mean problem and, more recently, a profile-based possibilistic IM solution was presented in Martin, 2023b (, Example 6) and shown to be superior to various existing methods. Here our focus is demonstrating the quality of the variational approximation in this new context.

Example 8.

Let the gamma model be indexed by θ=(θ1,θ2)\theta=(\theta_{1},\theta_{2}), where θ1\theta_{1} and θ2\theta_{2} represent the (positive) shape and scale parameters, respectively. There are no closed-form expressions for the maximum likelihood estimators, θ^1\hat{\theta}_{1} and θ^2\hat{\theta}_{2}, in the gamma model, but one can readily maximize the likelihood numerically to find them; one can also obtain the observed Fisher information matrix JJ, numerically or analytically. For the profile likelihood, it may help to reparametrize the model in terms of the mean parameter Φ=Θ1Θ2\Phi=\Theta_{1}\Theta_{2} and the shape parameter Θ1\Theta_{1}. Write the density in this new parametrization as

pθ1,ϕ(x)=1Γ(θ1)(ϕ/θ1)θ1xθ11eθ1x/ϕ,x>0.p_{\theta_{1},\phi}(x)=\frac{1}{\Gamma(\theta_{1})\,(\phi/\theta_{1})^{\theta_{1}}}\,x^{\theta_{1}-1}\,e^{-\theta_{1}x/\phi},\quad x>0.

In this form, the likelihood based on data XnX^{n} could be maximized numerically over θ1\theta_{1} for any fixed ϕ\phi, thus yielding the (relative) profile likelihood.

Fraser et al., (1997) presented an example where n=20n=20 mice were exposed to 240 rads of gamma radiation and their survival times recorded. A plot of the exact profiling-based marginal possibilistic IM contour is shown (black line) in Figure 5. This is relatively expensive to compute since, at each ϕ\phi point on the grid, our Monte Carlo approximation needs to be optimized over different θ1\theta_{1} values. For comparison, we consider a Gaussian possibility contour with mean ϕ^=θ^1θ^2\hat{\phi}=\hat{\theta}_{1}\hat{\theta}_{2} and variance ξ2g˙(θ^)J1g˙(θ^)\xi^{2}\,\dot{g}(\hat{\theta})^{\top}J^{-1}\dot{g}(\hat{\theta}), where g(θ)=θ1θ2g(\theta)=\theta_{1}\theta_{2} and the gradient is g˙(θ)=(θ2,θ1)\dot{g}(\theta)=(\theta_{2},\theta_{1})^{\top}. Figure 5 shows the Gaussian approximation with ξ=1\xi=1 as discussed in Martin and Williams, (2024) and with ξ^=1.28\hat{\xi}=1.28 as identified based on our variational approximation from Section 5. This approximation takes only a fraction of a second to obtain, and we find that, as intended, it closely matches the exact contour at the target level α=0.1\alpha=0.1 on the right (heavy) side and is a bit conservative on the left (thin) side. Clearly the basic large-sample Gaussian approximation is too narrow in the right tail, confirming the claim above that the first-order asymptotic theory provides relatively poor approximations for smaller sample sizes. Our variational approximation, on the other hand, appropriately adjusts to match the exact contour in certain places while being a bit cautious or conservative in others.

Refer to caption
Figure 5: Possibility contour plots for the gamma model in Example 8. Black line shows the exact profile-based marginal IM contour, dashed red is the naive Gaussian approximation (ξ=1\xi=1), and solid red is the Gaussian approximation with the strategically-chosen scale factor ξ^=1.28\hat{\xi}=1.28 based on α=0.1\alpha=0.1.

6.2 Nonparametric

A nonparametric problem is one in which the underlying distribution 𝖯\mathsf{P} is not assumed to have a particular form indexed by a finite-dimensional parameter. In some applications, it is 𝖯\mathsf{P} itself (or, e.g., its density function) that is the quantity of interest and, in other case, there is some (finite-dimensional) feature or functional Θ\Theta of 𝖯\mathsf{P} that is to be inferred. Our focus here is on the latter case, so it also fits the general mold of a problem involving nuisance parameters, since all of what is left of 𝖯\mathsf{P} once Θ\Theta has been accounted for would be considered “nuisance” and to-be-eliminated.

At least in principle, one can approach this problem in a manner similar to that in the parametric case above, i.e., via profiling out those nuisance parts of 𝖯\mathsf{P}. Recall that the goal of profiling is to reduce the dimension, so that the compatibility of data with candidate values of the quantity of interest could be directly assessed. Since it is typically the case that the quantity Θ\Theta of interest has some real-world interpretation, there is an opportunity to leverage that interpretation for the aforementioned compatibility assessment without the need of profiling. This is the approach taken in Cella and Martin, (2022), building on the classical work in M-estimation (e.g., Huber, 1981; Stefanski and Boos, 2002) and less-classical work on Gibbs posteriors (e.g., Martin and Syring, 2022; Grünwald and Mehta, 2020; Bissiri et al., 2016; Zhang, 2006), which we summarize briefly below.

Let data Xn=(X1,,Xn)X^{n}=(X_{1},\ldots,X_{n}) consist of independent and identically distributed components with Xi𝖯X_{i}\sim\mathsf{P}, where nothing is known or assumed about 𝖯\mathsf{P}. In this more general case, the unknown quantity of interest Θ=Θ(𝖯)\Theta=\Theta(\mathsf{P}) is a functional of the underlying distribution. Examples include quantiles and moments of 𝖯\mathsf{P}. Suppose Θ\Theta can be expressed as the minimizer of a risk or expected loss function. That is, assume that there exists a loss function (x,θ)lossθ(x)(x,\theta)\mapsto\text{\sc loss}_{\theta}(x) such that

Θ=argminθ𝕋ρ(θ),\Theta=\arg\min_{\theta\in\mathbb{T}}\rho(\theta),

where ρ(θ)=lossθ(x)𝖯(dx)\rho(\theta)=\int\text{\sc loss}_{\theta}(x)\,\mathsf{P}(dx) is the risk or expected loss function. An empirical version of the risk function replaces expectation with respect to 𝖯\mathsf{P} by averaging over observations:

ρ^xn(θ)=1ni=1nlossθ(xi),θ𝕋.\hat{\rho}_{x^{n}}(\theta)=\frac{1}{n}\sum_{i=1}^{n}\text{\sc loss}_{\theta}(x_{i}),\quad\theta\in\mathbb{T}.

Then the empirical risk minimizer, θ^xn=argminθρ^xn(θ)\hat{\theta}_{x^{n}}=\arg\min_{\theta}\hat{\rho}_{x^{n}}(\theta), is the natural estimator of the risk minimizer Θ\Theta. From here, we can mimic the relative likelihood construction with an empirical-risk version

Rer(xn,θ)=exp[{ρ^xn(θ)ρ^xn(θ^xn)}],θ𝕋.R^{\text{\sc er}}(x^{n},\theta)=\exp\bigl{[}-\{\hat{\rho}_{x^{n}}(\theta)-\hat{\rho}_{x^{n}}(\hat{\theta}_{x^{n}})\}\bigr{]},\quad\theta\in\mathbb{T}.

Clearly, values of θ\theta such that Rer(xn,θ)R^{\text{\sc er}}(x^{n},\theta) is close to 1 are more compatible with data xnx^{n} than values of θ\theta such that it is close to 0. Note that no profiling over the infinite-dimensional nuisance parameters is required in evaluating RerR^{\text{\sc er}}.

The same validification step can be taken to construct a possibilistic IM:

πxner(θ)=sup𝖯:𝖯θ𝖯{Rer(Xn,θ)Rer(xn,θ)},θ𝕋.\pi^{\text{\sc er}}_{x^{n}}(\theta)=\sup_{\mathsf{P}:\mathsf{P}\rightsquigarrow\theta}\mathsf{P}\bigl{\{}R^{\text{\sc er}}(X^{n},\theta)\leq R^{\text{\sc er}}(x^{n},\theta)\bigr{\}},\quad\theta\in\mathbb{T}. (14)

The outer supremum, analogous to that in (13), maximizes over all those probability distributions 𝖯\mathsf{P} such that the relevant feature Θ\Theta of 𝖯\mathsf{P} takes value θ\theta. This supremum appears because the distribution of Rer(Xn,θ)R^{\text{\sc er}}(X^{n},\theta) obviously depends on the underlying 𝖯\mathsf{P}, but this is unknown. This puts direct evaluation of the IM contour based on naive Monte Carlo out of reach. Fortunately, validity only requires that certain calibration holds for the one true 𝖯\mathsf{P}, which provides a shortcut. Cella and Martin, (2022) proposed to replace “iid sampling from 𝖯\mathsf{P} for all θ\theta-compatible 𝖯\mathsf{P}” with iid sampling from the empirical distribution, which is a good estimator of the one true 𝖯\mathsf{P}. This amounts to using the bootstrap (e.g., Efron, 1979; Efron and Tibshirani, 1993; Davison and Hinkley, 1997) to approximate the above contour, and Cella and Martin prove that the corresponding IM is asymptotically valid. Here, we will demonstrate that the proposed variational-like IMs can provide a good approximation for this bootstrap-based contour.

Example 9.

Suppose interest in the τ\tau-th quantile of a distribution 𝖯\mathsf{P}, i.e., the exact point Θ=Θ(τ)\Theta=\Theta^{(\tau)} such that 𝖯(XΘ(τ))=τ\mathsf{P}(X\leq\Theta^{(\tau)})=\tau, for τ(0,1)\tau\in(0,1). The key component in the nonparametric IM construction above is selecting an appropriate loss function. For quantile estimation, it is well-known that the loss function is given by

lossθ(x)=12{(|xθ|x)+(12τ)θ}.\text{\sc loss}_{\theta}(x)=\tfrac{1}{2}\bigl{\{}(|x-\theta|-x)+(1-2\tau)\theta\bigr{\}}.

Figure 6(a) shows the bootstrap approximation of the IM contour in (14), computed using 500 bootstrap samples, for a dataset of size n=100n=100 and τ=0.25\tau=0.25. We chose to work with the normal family for the variational approximation, due to the well known asymptotic normality of the empirical risk minimizer (sample quantile) θ^n(τ)\hat{\theta}_{n}^{(\tau)}:

n1/2(θ^n(τ)Θ(τ))𝖭(0,τ(1τ)p(Θ(τ))2),in distribution,n^{1/2}(\hat{\theta}_{n}^{(\tau)}-\Theta^{(\tau)})\to{\sf N}\left(0,\frac{\tau(1-\tau)}{p(\Theta^{(\tau)})^{2}}\right),\quad\text{in distribution},

where pp represents the common density function associated with the underlying true distribution 𝖯\mathsf{P}. More specifically, our chosen Gaussian variational family 𝒬var\mathscr{Q}^{\text{var}} has mean θ^n(0.25)\hat{\theta}_{n}^{(0.25)} and variance

0.25×0.75nξ2p^(θ^n(0.25))2,\frac{0.25\times 0.75}{n\,\xi^{2}\,\hat{p}(\hat{\theta}_{n}^{(0.25)})^{2}},

where p^\hat{p} denotes a kernel density estimate of pp based on the observed data. The same settings as in Section 4 were applied, with ξ\xi estimated using M=200M=200 Monte Carlo samples, step sizes wt=2(1+t)1w_{t}=2(1+t)^{-1}, α=0.1\alpha=0.1 and convergence threshold ε=0.005.\varepsilon=0.005. Note how the variational approximation is perfect except for small levels of α\alpha on the left side, where it is a bit conservative.

To verify that the variational approach provides approximate validity in nonparametric settings, a simulation study was performed by repeating the above scenario 250 times. For each dataset, the approximated contour was evaluated at Θ=2.53\Theta=2.53, which corresponds approximately to the first quartile when 𝖯\mathsf{P} follows a 𝖦𝖺𝗆𝗆𝖺(4,1){\sf Gamma}(4,1) distribution. The empirical distribution of this contour is shown in Figure 6(b), demonstrating that approximate validity is indeed achieved.

Refer to caption
(a) Plausibility contours
Refer to caption
(b) Distribution function of πXn(Θ)\pi_{X^{n}}(\Theta)
Figure 6: Results from Example 9 regarding inference on the first quartile. Panel (a) shows the bootstrap-based contour (black) and its variational approximation (red). Panel (b) shows the distribution function of both contours at the true Θ\Theta.

6.3 Semiparametric

A middle-ground between the parametric and nonparametric case described in the previous two subsections is what are called semiparametric problems, i.e., those with parametric and nonparametric parts. Perhaps the simplest example is a linear regression model with an unspecified error distribution: the linear mean function is the parametric part and the error distribution is the nonparametric part. Below we will focus on a censored-data with a semiparametric model but, of course, other examples are possible; see, e.g., Bickel et al., (1998), Tsiatis, (2006), and Kosorok, (2008) for more details.

The same profiling strategy described in Section 6.1 above can be carried out for semiparametric models too; Murphy and van der Vaart, (2000) is an important reference. For concreteness, we will consider a common situation involving censored data. That is, suppose we are measuring, say, the concentration of a particular chemical in the soil, but our measurement instrument has a lower detectability limit, i.e., concentrations below this limit are undetectable. In such cases, the concentrations are (left) censored. We might have a parametric model in mind for the measured concentrations, but the censoring corrupts the data and ultimately changes that model. Let YiY_{i} denote the actual chemical concentration at site ii, which we may or may not observe; the YiY_{i}’s are assigned a statistical model {𝖯θ:θ𝕋}\{\mathsf{P}_{\theta}:\theta\in\mathbb{T}\}, and the true-but-unknown value Θ\Theta of the posited model parameter is to be inferred. Let CiC_{i} denote the censoring level, which we will assume—without loss of generality—to be subject to sampling variability, i.e., the CiC_{i}’s are random variables. Then the observed data XnX^{n} consists of iid pairs Xi=(Zi,Ti)X_{i}=(Z_{i},T_{i}), where

Zi=max(Yi,Ci)andTi=1(YiCi),i=1,,n.Z_{i}=\max(Y_{i},C_{i})\quad\text{and}\quad T_{i}=1(Y_{i}\geq C_{i}),\quad i=1,\ldots,n. (15)

The goal is to infer the unknown Θ\Theta in the model for the concentrations, but now the measurements are corrupted by censoring. The augmented model for the corrupted data has likelihood function given by

Lxn(θ,G)=i=1ng(zi)1tiG(zi)ti×i=1npθ(zi)tiPθ(zi)1ti,L_{x^{n}}(\theta,G)=\prod_{i=1}^{n}g(z_{i})^{1-t_{i}}\,G(z_{i})^{t_{i}}\times\prod_{i=1}^{n}p_{\theta}(z_{i})^{t_{i}}\,P_{\theta}(z_{i})^{1-t_{i}},

depending on both the generic value θ\theta of the true unknown model parameter Θ\Theta for the concentrations and on the generic value GG of the true unknown censoring level distribution 𝖦\mathsf{G}. In the above expression, gg and pθp_{\theta} are density functions for the censoring and concentration distributions, and GG and PθP_{\theta} are the corresponding cumulative distribution functions. Now it should be clear why this is a semiparametric model: in addition to the obvious parametric model, there is the nonparametric model for the censoring levels.

A distinguishable feature of this semiparametric model is that the likelihood function is “separable,” i.e., it is a product of terms involving θ\theta and terms involving GG. Consequently, if we were to optimize over GG and then form the relative profile likelihood ratio, then the part involving the optimization over GG cancels out. This implies that we can simply ignore the part with GG and work with the relative profile likelihood of the form

Rpr(xn,θ)=i=1npθ(zi)tiPθ(zi)1tii=1npθ^(zi)tiPθ^(zi)1ti,θ𝕋,R^{\text{\sc pr}}(x^{n},\theta)=\frac{\prod_{i=1}^{n}p_{\theta}(z_{i})^{t_{i}}\,P_{\theta}(z_{i})^{1-t_{i}}}{\prod_{i=1}^{n}p_{\hat{\theta}}(z_{i})^{t_{i}}\,P_{\hat{\theta}}(z_{i})^{1-t_{i}}},\quad\theta\in\mathbb{T},

where θ^\hat{\theta} is the maximum likelihood estimator of Θ\Theta. The distribution of the relative profile likelihood still depends on the nuisance parameter 𝖦\mathsf{G}, so when we define the possibilistic IM contour by validifying the relative profile likelihood, we get

πxnpr(θ)=supG𝖯θ,G{Rpr(Xn,θ)Rpr(xn,θ)},θ𝕋.\pi_{x^{n}}^{\text{\sc pr}}(\theta)=\sup_{G}\mathsf{P}_{\theta,G}\bigl{\{}R^{\text{\sc pr}}(X^{n},\theta)\leq R^{\text{\sc pr}}(x^{n},\theta)\bigr{\}},\quad\theta\in\mathbb{T}.

Similar to the strategy described above (from Cella and Martin, 2022) for the nonparametric problem, Cahoon and Martin, (2021) proposed a novel strategy wherein a variation on the Kaplan–Meier estimator (e.g., Kaplan and Meier, 1958; Klein and Moeschberger, 2003) is used to obtain a G^\widehat{G}, and then the contour above is approximated by

π^xnpr(θ)=𝖯θ,G^{Rpr(Xn,θ)Rpr(xn,θ)},θ𝕋.\hat{\pi}_{x^{n}}^{\text{\sc pr}}(\theta)=\mathsf{P}_{\theta,\widehat{G}}\bigl{\{}R^{\text{\sc pr}}(X^{n},\theta)\leq R^{\text{\sc pr}}(x^{n},\theta)\bigr{\}},\quad\theta\in\mathbb{T}. (16)

Evaluation of the right-hand side via Monte Carlo boils down to sampling censoring levels from G^\widehat{G}, sampling concentration levels from 𝖯θ\mathsf{P}_{\theta}, and then constructing new data sets according to (15). While this procedure is conceptually relatively simple, naive implementation over a sufficiently fine grid of θ\theta values is rather expensive. Fortunately, our proposed variational-like approximation from Section 5 can be readily applied, quickly producing an approximate contour in closed-form.

Example 10.

For illustration, we use data on Atrazine concentration collected from a well in Nebraska. The data consists of n=24n=24 observations subjected to random left-censoring as described above. This is a rather extreme case where nearly half (11) of the 24 observations are censored, but previous studies indicate that a log-normal model for the Atrazine concentrations is appropriate (Helsel, 2005). The log-normal distribution is often used to model left-censored data in environmental science applications (Krishnamoorthy and Xu, 2011, e.g.,). The log-normal model density is

pθ(y)=1(2πθ2)1/2yexp{(logyθ1)22θ2},p_{\theta}(y)=\frac{1}{(2\pi\theta_{2})^{1/2}y}\exp\Bigl{\{}-\frac{(\log y-\theta_{1})^{2}}{2\theta_{2}}\Bigr{\}},

where θ=(θ1,θ2)\theta=(\theta_{1},\theta_{2}) represent the mean and variance parameters for logY\log Y. Again, log-normal is only the model for the observed concentrations—no model is assumed for the censored observations. Figure 7(a) displays the nonparametric estimator G^\widehat{G} of the censored data distribution obtained by applying the Kaplan–Meier estimator with the censoring labels swapped: ti1tit_{i}\mapsto 1-t_{i}. This G^\widehat{G} is then used to define what we are referring to here as the “exact” IM contour via (16), and then the corresponding Gaussian variational approximation—applied to (θ1,logθ2)(\theta_{1},\log\theta_{2}) first and then mapped back to (θ1,θ2)(\theta_{1},\theta_{2})—is shown in Figure 7(b). This plot is very similar to that shown in Cahoon and Martin, (2021, Fig. 10) based on naive Monte Carlo, but far less expensive computationally.

Refer to caption
(a) Estimated distribution function, G^\widehat{G}
Refer to caption
(b) Possibility contour
Figure 7: Results for left-censored data analysis in Example 10. Panel (a) shows the Kaplan–Meier estimator of the censoring distribution, which happens to be supported on only two points (see Cahoon and Martin, 2021, Sec. 4.3). Panel (b) shows the IM’s variational-like approximate possibility contour.

7 Conclusion

In a similar spirit to the variational approximations that are now widely used in Bayesian statistics, and building on recent ideas presented in Jiang et al., (2023), we develop here a strategy to approximate the possibilistic IM’s contour function—or at least its α\alpha-cuts/level sets for specified α\alpha—using ordinary Monte Carlo sampling and stochastic approximation. A potpourri of applications is presented, from simple textbook problems to (parametric, nonparametric, and semiparametric) problems involving nuisance parameters, and even one of relatively high dimension, to highlight the flexibility, accuracy, and overall applicability of our proposal.

A number of important and practically relevant extensions to the proposed method can and will be explored. Here we will mention two such extensions. First, recent efforts (e.g., Martin, 2022b ) have focused on incorporating incomplete or partial prior information into the possibilistic IM; something along these lines is sure to be necessary as we scale up to problems of higher dimension. The only downside to the incorporation of partial prior information is that the evaluation of the IM contour is often even more complicated than in the vacuous-prior case considered here. Thanks to the added complexity, efficient numerical approximations are even more important in these partial-prior formulations and, fortunately, we expect that the proposal here will carry over more-or-less directly. Second, there is an obvious question of whether the α\alpha-specific approximations developed here can, somehow, be stitched together in order to obtain an approximation that is (mostly) insensitive to the specification of any particular α\alpha value. Inspired by the recent developments in Jiang et al., (2023), we believe that the answer to this question is Yes, and these important details will be reported elsewhere.

References

  • Agresti, (2007) Agresti, A. (2007). An Introduction Categorical Data Analysis. John Wiley & Sons, 2nd edition.
  • Balch et al., (2019) Balch, M. S., Martin, R., and Ferson, S. (2019). Satellite conjunction analysis and the false confidence theorem. Proc. Royal Soc. A, 475(2227):2018.0565.
  • Basu, (1964) Basu, D. (1964). Recovery of ancillary information. Sankhyā Ser. A, 26:3–16.
  • Basu, (1975) Basu, D. (1975). Statistical information and likelihood. Sankhyā Ser. A, 37(1):1–71. Discussion and correspondance between Barnard and Basu.
  • Bickel et al., (1998) Bickel, P. J., Klaassen, C. A. J., Ritov, Y., and Wellner, J. A. (1998). Efficient and Adaptive Estimation for Semiparametric Models. Springer-Verlag, New York.
  • Birnbaum, (1962) Birnbaum, A. (1962). On the foundations of statistical inference. J. Amer. Statist. Assoc., 57:269–326.
  • Bissiri et al., (2016) Bissiri, P. G., Holmes, C. C., and Walker, S. G. (2016). A general framework for updating belief distributions. J. R. Stat. Soc. Ser. B. Stat. Methodol., 78(5):1103–1130.
  • Blei et al., (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: a review for statisticians. J. Amer. Statist. Assoc., 112(518):859–877.
  • Brown, (1971) Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. Ann. Math. Statist., 42:855–903.
  • Cahoon and Martin, (2021) Cahoon, J. and Martin, R. (2021). Generalized inferential models for censored data. Internat. J. Approx. Reason., 137:51–66.
  • Cella, (2024) Cella, L. (2024). Distribution-free inferential models: Achieving finite-sample valid probabilistic inference, with emphasis on quantile regression. International Journal of Approximate Reasoning, 170:109211.
  • Cella and Martin, (2022) Cella, L. and Martin, R. (2022). Direct and approximately valid probabilistic inference on a class of statistical functionals. Internat. J. Approx. Reason., 151:205–224.
  • Cella and Martin, (2023) Cella, L. and Martin, R. (2023). Possibility-theoretic statistical inference offers performance and probativeness assurances. Internat. J. Approx. Reason., 163:109060.
  • Cella and Martin, (2024) Cella, L. and Martin, R. (2024). Variational approximations of possibilistic inferential models. In Bi, Y., Jousselme, A.-L., and Denoeux, T., editors, BELIEF 2024, volume 14909 of Lecture Notes in Artificial Intelligence, pages 121–130, Switzerland. Springer Nature.
  • Couso et al., (2001) Couso, I., Montes, S., and Gil, P. (2001). The necessity of the strong α\alpha-cuts of a fuzzy set. Internat. J. Uncertain. Fuzziness Knowledge-Based Systems, 9(2):249–262.
  • Davison and Hinkley, (1997) Davison, A. C. and Hinkley, D. V. (1997). Bootstrap Methods and their Application, volume 1. Cambridge University Press, Cambridge.
  • Denœux, (2006) Denœux, T. (2006). Constructing belief functions from sample data using multinomial confidence regions. Internat. J. of Approx. Reason., 42(3):228–252.
  • Denœux, (2014) Denœux, T. (2014). Likelihood-based belief function: justification and some extensions to low-quality data. Internat. J. Approx. Reason., 55(7):1535–1547.
  • Destercke and Dubois, (2014) Destercke, S. and Dubois, D. (2014). Special cases. In Introduction to Imprecise Probabilities, Wiley Ser. Probab. Stat., pages 79–92. Wiley, Chichester.
  • Dubois et al., (2004) Dubois, D., Foulloy, L., Mauris, G., and Prade, H. (2004). Probability-possibility transformations, triangular fuzzy sets, and probabilistic inequalities. Reliab. Comput., 10(4):273–297.
  • Efron, (1979) Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist., 7(1):1–26.
  • Efron, (1998) Efron, B. (1998). R. A. Fisher in the 21st century. Statist. Sci., 13(2):95–122.
  • Efron and Tibshirani, (1993) Efron, B. and Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall, New York.
  • Fraser et al., (1997) Fraser, D. A. S., Reid, N., and Wong, A. (1997). Simple and accurate inference for the mean of a gamma model. Canad. J. Statist., 25(1):91–99.
  • Ghosh et al., (2006) Ghosh, J. K., Delampady, M., and Samanta, T. (2006). An Introduction to Bayesian Analysis. Springer, New York.
  • Grünwald and Mehta, (2020) Grünwald, P. D. and Mehta, N. A. (2020). Fast rates for general unbounded loss functions: from ERM to generalized Bayes. J. Mach. Learn. Res., 21(56):1–80.
  • Helsel, (2005) Helsel, D. R. (2005). Nondetects and Data Analysis. Wiley.
  • Hose, (2022) Hose, D. (2022). Possibilistic Reasoning with Imprecise Probabilities: Statistical Inference and Dynamic Filtering. PhD thesis, University of Stuttgart. https://dominikhose.github.io/dissertation/diss_dhose.pdf.
  • Hose et al., (2022) Hose, D., Hanss, M., and Martin, R. (2022). A practical strategy for valid partial prior-dependent possibilistic inference. In Le Hégarat-Mascle, S., Bloch, I., and Aldea, E., editors, Belief Functions: Theory and Applications (BELIEF 2022), volume 13506 of Lecture Notes in Artificial Intelligence, pages 197–206. Springer.
  • Huber, (1981) Huber, P. J. (1981). Robust Statistics. John Wiley & Sons Inc., New York. Wiley Series in Probability and Mathematical Statistics.
  • James and Stein, (1961) James, W. and Stein, C. (1961). Estimation with quadratic loss. In Proc. 4th Berkeley Sympos. Math. Statist. and Prob., Vol. I, pages 361–379. Univ. California Press, Berkeley, Calif.
  • Jiang et al., (2023) Jiang, Y., Liu, C., and Zhang, H. (2023). Finite sample valid inference via calibrated bootstrap. Under review.
  • Kaplan and Meier, (1958) Kaplan, E. L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc., 53:457–481.
  • Klein and Moeschberger, (2003) Klein, J. P. and Moeschberger, M. L. (2003). Survival Analysis. Springer-Verlag, New York, 2nd edition.
  • Kosorok, (2008) Kosorok, M. R. (2008). Introduction to Empirical Processes and Semiparametric Inference. Springer Series in Statistics. Springer, New York.
  • Krishnamoorthy and Xu, (2011) Krishnamoorthy, K. and Xu, Z. (2011). Confidence limits for lognormal percentiles and for lognormal mean based on samples with multiple detection limits. Ann. Occup. Hyg., 55(5):495–509.
  • Kushner and Yin, (2003) Kushner, H. J. and Yin, G. G. (2003). Stochastic Approximation and Recursive Algorithms and Applications. Springer-Verlag, New York, second edition.
  • Martin, (2015) Martin, R. (2015). Plausibility functions and exact frequentist inference. J. Amer. Statist. Assoc., 110(512):1552–1561.
  • Martin, (2018) Martin, R. (2018). On an inferential model construction using generalized associations. J. Statist. Plann. Inference, 195:105–115.
  • Martin, (2019) Martin, R. (2019). False confidence, non-additive beliefs, and valid statistical inference. Internat. J. Approx. Reason., 113:39–73.
  • (41) Martin, R. (2021a). An imprecise-probabilistic characterization of frequentist statistical inference. arXiv:2112.10904.
  • (42) Martin, R. (2021b). Inferential models and the decision-theoretic implications of the validity property. arXiv:2112.13247.
  • (43) Martin, R. (2022a). Valid and efficient imprecise-probabilistic inference with partial priors, I. First results. arXiv:2203.06703.
  • (44) Martin, R. (2022b). Valid and efficient imprecise-probabilistic inference with partial priors, II. General framework. arXiv:2211.14567.
  • (45) Martin, R. (2023a). Fiducial inference viewed through a possibility-theoretic inferential model lens. In Miranda, E., Montes, I., Quaeghebeur, E., and Vantaggi, B., editors, Proceedings of the Thirteenth International Symposium on Imprecise Probability: Theories and Applications, volume 215 of Proceedings of Machine Learning Research, pages 299–310. PMLR.
  • (46) Martin, R. (2023b). Valid and efficient imprecise-probabilistic inference with partial priors, III. Marginalization. arXiv:2309.13454.
  • Martin, (2024) Martin, R. (2024). Which statistical hypotheses are afflicted by false confidence? In Bi, Y., Jousselme, A.-L., and Denoeux, T., editors, BELIEF 2024, volume 14909 of Lecture Notes in Artificial Intelligence, pages 140–149, Switzerland. Springer Nature.
  • Martin and Ghosh, (2008) Martin, R. and Ghosh, J. K. (2008). Stochastic approximation and Newton’s estimate of a mixing distribution. Statist. Sci., 23(3):365–382.
  • Martin and Liu, (2013) Martin, R. and Liu, C. (2013). Inferential models: a framework for prior-free posterior probabilistic inference. J. Amer. Statist. Assoc., 108(501):301–313.
  • (50) Martin, R. and Liu, C. (2015a). Conditional inferential models: combining information for prior-free probabilistic inference. J. R. Stat. Soc. Ser. B. Stat. Methodol., 77(1):195–217.
  • (51) Martin, R. and Liu, C. (2015b). Inferential Models, volume 147 of Monographs on Statistics and Applied Probability. CRC Press, Boca Raton, FL.
  • (52) Martin, R. and Liu, C. (2015c). Marginal inferential models: prior-free probabilistic inference on interest parameters. J. Amer. Statist. Assoc., 110(512):1621–1631.
  • Martin and Syring, (2022) Martin, R. and Syring, N. (2022). Direct Gibbs posterior inference on risk minimizers: construction, concentration, and calibration. In Srinivasa Rao, A. S. R., Young, G. A., and Rao, C. R., editors, Handbook of Statistics: Advancements in Bayesian Methods and Implementation, volume 47, pages 1–41. Elsevier.
  • Martin and Williams, (2024) Martin, R. and Williams, J. P. (2024). Asymptotic efficiency of inferential models and a possibilistic Bernstein–von Mises theorem. arXiv:2412.15243.
  • Murphy and van der Vaart, (2000) Murphy, S. A. and van der Vaart, A. W. (2000). On profile likelihood. J. Amer. Statist. Assoc., 95(450):449–485. With discussion.
  • Reid, (2003) Reid, N. (2003). Asymptotics and the theory of inference. Ann. Statist., 31(6):1695–1731.
  • Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Statistics, 22:400–407.
  • Schweder and Hjort, (2002) Schweder, T. and Hjort, N. L. (2002). Confidence and likelihood. Scand. J. Statist., 29(2):309–332.
  • Severini and Wong, (1992) Severini, T. A. and Wong, W. H. (1992). Profile likelihood and conditionally parametric models. Ann. Statist., 4(20):1768–1802.
  • Shafer, (1982) Shafer, G. (1982). Belief functions and parametric models. J. Roy. Statist. Soc. Ser. B, 44(3):322–352. With discussion.
  • Smith, (1995) Smith, A. (1995). A conversation with Dennis Lindley. Statist. Sci., 10(3):305–319.
  • Sprott, (2000) Sprott, D. A. (2000). Statistical Inference in Science. Springer Series in Statistics. Springer-Verlag, New York.
  • Stefanski and Boos, (2002) Stefanski, L. A. and Boos, D. D. (2002). The calculus of MM-estimation. Amer. Statist., 56(1):29–38.
  • Stein, (1956) Stein, C. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1954–1955, vol. I, pages 197–206, Berkeley and Los Angeles. University of California Press.
  • Syring and Martin, (2019) Syring, N. and Martin, R. (2019). Calibrating general posterior credible regions. Biometrika, 106(2):479–486.
  • Syring and Martin, (2021) Syring, N. and Martin, R. (2021). Stochastic optimization for numerical evaluation of imprecise probabilities. In Cano, A., De Bock, J., Miranda, E., and Moral, S., editors, Proceedings of the Twelveth International Symposium on Imprecise Probability: Theories and Applications, volume 147 of Proceedings of Machine Learning Research, pages 289–298. PMLR.
  • Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288.
  • Tibshirani, (2011) Tibshirani, R. (2011). Regression shrinkage and selection via the lasso: a retrospective. J. R. Stat. Soc. Ser. B Stat. Methodol., 73(3):273–282.
  • Tran et al., (2021) Tran, M.-N., Nguyen, T.-N., and Dao, V.-H. (2021). A practical tutorial on variational Bayes. arXiv:2103.01327.
  • Troffaes and de Cooman, (2014) Troffaes, M. C. M. and de Cooman, G. (2014). Lower Previsions. Wiley Series in Probability and Statistics. John Wiley & Sons, Ltd., Chichester.
  • Tsiatis, (2006) Tsiatis, A. A. (2006). Semiparametric Theory and Missing Data. Springer Series in Statistics. Springer, New York.
  • Wasserman, (1990) Wasserman, L. A. (1990). Belief functions and statistical inference. Canad. J. Statist., 18(3):183–196.
  • Xie and Singh, (2013) Xie, M. and Singh, K. (2013). Confidence distribution, the frequentist distribution estimator of a parameter: a review. Int. Stat. Rev., 81(1):3–39.
  • Zhang, (2006) Zhang, T. (2006). Information theoretical upper and lower bounds for statistical estimation. IEEE Trans. Inform. Theory, 52(4):1307–1321.