This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\corrauth

Linke Li, Dalla Lana School of Public Health, University of Toronto, 155 College St Room 500, Toronto, Ontario, M5T 3M7, Canada.

A Nonparametric Approach for Estimating the Effective Sample Size in Gaussian Approximation of Expected Value of Sample Information

Linke Li1,21,2affiliationmark:    Hawre Jalal33affiliationmark:    and Anna Heath1,2,41,2,4affiliationmark: 11affiliationmark: Dalla Lana School of Public Health, University of Toronto, Toronto, Canada; 22affiliationmark: Child Health Evaluative Sciences, The Hospital for Sick Children, Toronto, Canada; 33affiliationmark: School of Epidemiology and Public Health, University of Ottawa, Ottawa, Canada; 44affiliationmark: Department of Statistical Science, University College London, London, United Kingdom; [email protected]
Abstract

The effective sample size (ESS) measures the informational value of a probability distribution in terms of an equivalent number of study participants. The ESS plays a crucial role in estimating the Expected Value of Sample Information (EVSI) through the Gaussian approximation approach. Despite the significance of ESS, existing ESS estimation methods within the Gaussian approximation framework are either computationally expensive or potentially inaccurate. To address these limitations, we propose a novel approach that estimates the ESS using the summary statistics of generated datasets and nonparametric regression methods. The simulation results suggest that the proposed method provides accurate ESS estimates at a low computational cost, making it an efficient and practical way to quantify the information contained in the probability distribution of a parameter. Overall, determining the ESS can help analysts understand the uncertainty levels in complex prior distributions in the probability analyses of decision models and perform efficient EVSI calculations.

keywords:
Expected value of sample information, Gaussian approximation, Effective sample size, Health economic evaluation, Value of Information, Uncertainty definition

1 INTRODUCTION

The effective sample size (ESS) quantifies the informational value of a probability distribution by representing it in terms of an equivalent number of observations contributing information to this distribution.morita2008determining For example, if the probability of successes pp follows a beta distribution with parameters (alpha = 2, beta = 8), the ESS is the number of successes (2) + the number of failures (8) = 10 observations. ESS can be particularly useful in computing the Expected Value of Sample Information (EVSI) using the Gaussian approximation (GA) approach.jalal2018gaussian EVSI is a quantitative measure used in decision analysis to assess the potential value of acquiring additional data or information through a sample. It represents the expected improvement in decision-making or reduction in uncertainty that would result from collecting new data before making a choice or decision. In addition to its importance in understanding the information in probability distributions and for EVSI estimation, ESS also finds various applications in clinical trial designs, including prior information elicitation, sensitivity analysis and trial protocol evaluations. morita2008determining

Despite its usefulness, existing ESS estimation methods within the GA have limitations.heath2020calculating This is because these methods either entail potentially high computational costs, impeding their practicality, or exhibit inaccuracies, undermining the reliability of estimates.jalal2018gaussian To efficiently assess the informative level of probability distributions and simplify the EVSI calculation, we propose a novel method for estimating ESS based on summary statistics and nonparametric regression models, inspired by the nonparametric regression-based EVSI methods proposed by Strong et al.strong2015estimating We begin by reviewing the ESS, particularly within the GA approach, and proceed to illustrate the nonparametric regression-based ESS estimation method. Subsequently, we perform a simulation to demonstrate the effectiveness and accuracy of our method, followed by a brief discussion to conclude.

2 METHODS

2.1 Effective Sample Size and Gaussian Approximation

The effective sample size (ESS) is a statistical concept that quantifies the informational value of a probability distribution through the number of hypothetical participants morita2008determining . A greater ESS signifies a more informative distribution, while a lower value indicates the opposite. The estimation of ESS serves two important purposes in health economic evaluation. Firstly, it allows area experts to determine if the amount of information contained in the distribution aligns with their prior beliefs. Secondly, it facilitates the estimation of the EVSI within the GA approach.jalal2018gaussian ; linke2023estimating This efficiency stems from the characteristic of ESS as being intrinsic to the GA setup, where it is computed only once. After that, this value remains unaltered for EVSI calculations across a spectrum of study designs, irrespective of sample size variations. 111To estimate EVSI, ESS is combined with the nonparametric regression models and Taylor series expansions to efficiently approximate the distribution of the conditional economic benefit and estimate EVSI. We refer readers to the article by Li et al. for a comprehensive explanation of the EVSI estimation method.linke2023estimating This section introduces the concept of ESS as it is presented within the GA approach.

Firstly, assume we have a parameter ϕ\phi with a corresponding probability distribution p(ϕ)p(\phi) that can be well approximated by a Gaussian distribution. This distribution has mean μ0\mu_{0} and variance σ2n0\frac{\sigma^{2}}{n_{0}}:

ϕN(μ0,σ2n0),\phi\sim N\left(\mu_{0},\frac{\sigma^{2}}{n_{0}}\right), (1)

where σ2\sigma^{2} represents the individual-level variance of the data collection process, and n0n_{0} can be understood as the number of hypothetical participants that would yield the uncertainty level of ϕ\phi. n0n_{0} is defined as the ESS of p(ϕ)p(\phi) in the GA approach.

Additionally, we denote the set of nn observations obtained from the data collection exercise that would provide additional information on ϕ\phi as 𝑿n\bm{X}_{n}, and its sample mean as 𝑿¯n\widebar{\bm{X}}_{n}. In the GA approach, it is assumed that given the parameter ϕ\phi, 𝑿¯n\widebar{\bm{X}}_{n} is approximately Gaussian distributed with mean ϕ\phi and variance σ2n\frac{\sigma^{2}}{n}:

𝑿¯n|ϕN(ϕ,σ2n).\widebar{\bm{X}}_{n}|\phi\sim N\left(\phi,\frac{\sigma^{2}}{n}\right). (2)

Due to the conjugate nature of the distributions of ϕ\phi and 𝑿¯n|ϕ\widebar{\bm{X}}_{n}|\phi, the conditional mean of ϕ\phi given the observed data is a weighted combination of μ0\mu_{0} and 𝑿¯n\widebar{\bm{X}}_{n}:

𝔼ϕ[ϕ|𝑿n]=(1v)μ0+v𝑿¯n,v=nn0+n.\mathbb{E}_{\phi}[\phi|\bm{X}_{n}]=(1-v)\mu_{0}+v\widebar{\bm{X}}_{n},v=\frac{n}{n_{0}+n}. (3)

Using this formulation, the ESS n0n_{0} can be determined based on the variance of either 𝔼ϕ[ϕ|𝑿n]\mathbb{E}_{\phi}[\phi|\bm{X}_{n}] or 𝑿¯n\widebar{\bm{X}}_{n}. This is because the marginal distributions of 𝔼ϕ[ϕ|𝑿n]\mathbb{E}_{\phi}[\phi|\bm{X}_{n}] and 𝑿¯n\widebar{\bm{X}}_{n} are both Gaussian with variances vσ2n0\frac{v\sigma^{2}}{n_{0}} and σ2vn0\frac{\sigma^{2}}{vn_{0}}, respectively. jalal2018gaussian As the variance of p(ϕ)p(\phi) is σ2n0\frac{\sigma^{2}}{n_{0}}, we can estimate n0n_{0} using variance ratios. Specifically, the variance ratio between ϕ\phi and 𝔼ϕ[ϕ|𝑿n]\mathbb{E}_{\phi}[\phi|\bm{X}_{n}] yeilds an n0n_{0} estimation of:

n^0=n(Varϕ[ϕ]Var𝑿n[𝔼ϕ[ϕ|𝑿n]]1),\hat{n}_{0}=n\left(\frac{\mbox{Var}_{\phi}[\phi]}{\mbox{Var}_{\bm{X}_{n}}\left[\mathbb{E}_{\phi}[\phi|\bm{X}_{n}]\right]}-1\right), (4)

and the variance ratio between ϕ\phi and 𝑿¯n\widebar{\bm{X}}_{n} yeilds an n0n_{0} estimation of:

n^0=n(Var𝑿¯n[𝑿¯n]Varϕ[ϕ]1).\hat{n}_{0}=n\left(\frac{\mbox{Var}_{\widebar{\bm{X}}_{n}}[\widebar{\bm{X}}_{n}]}{\mbox{Var}_{\phi}[\phi]}-1\right). (5)

Although equations 4 and 5 are derived using the Gaussian assumption, Jalal and Alarid-Escudero have shown that they can be generalized to estimate n0n_{0} for other non-Gaussian likelihood and prior combinations. jalal2018gaussian

2.2 Estimate the Effective Sample Size using the Nonparametric Regression-based Method

In this section, we review current n0n_{0} estimation methods and note their drawbacks.jalal2018gaussian We then introduce a novel approach, which employs nonparametric regression models, to efficiently and accurately estimate n0n_{0}.

In the original GA article, Jalal and Alarid-Escudero introduced two methods that can be generalized to estimate n0n_{0} for nonconjugate priors. jalal2018gaussian The first method uses equation 4. For this, we need to estimate 𝔼ϕ[ϕ|𝑿n]\mathbb{E}_{\phi}[\phi|\bm{X}_{n}] for a large number of datasets 𝑿n\bm{X}_{n} generated from a specific data collection exercise. For each dataset, we can use Markov Chain Monte Methods (MCMC) to simulate from the posterior distribution of parameter ϕ\bm{\phi} and calculate the mean. We then compute the sample variance of these estimated posterior means to approximate Var𝑿n[𝔼ϕ[ϕ|𝑿n]]\mbox{Var}_{\bm{X}_{n}}\left[\mathbb{E}_{\phi}[\phi|\bm{X}_{n}]\right] and estimate n0n_{0}. While this method yields accurate n0n_{0} estimates, repeated MCMC evaluations may impose a significant computational burden and complicate practical implementation.

The second method uses equation 5. However, since the sample mean 𝑿¯n\widebar{\bm{X}}_{n} may not have the same scale as ϕ\phi for many non-Gaussian probability distributions, equation 5 can not be directly applied. In such cases, we need to identify other summary statistics SS that share the same scale as ϕ\phi and compare its variance with that of ϕ\phi to estimate n0n_{0}:

n^0=n(VarS[S]Varϕ[ϕ]1).\hat{n}_{0}=n\left(\frac{\mbox{Var}_{S}[S]}{\mbox{Var}_{\phi}[\phi]}-1\right). (6)

As this method only requires summary statistics, it is more computationally efficient than MCMC. However, it necessitates the identification of suitable summary statistics for 𝑿n\bm{X}_{n}. These statistics should exhibit the same scale as ϕ\phi and have an appropriate level of variation, which can pose challenges in real-world case studies.

To reduce the computational burden of n0n_{0} when the appropriate summary statistics of ϕ\phi are difficult to find, we suggest estimating 𝔼ϕ[ϕ|𝑿n]\mathbb{E}_{\phi}[\phi|\bm{X}_{n}] using the nonparametric regression-based method, motivated by the EVSI method proposed by Strong et al. strong2015estimating ; strong2014estimating To achieve this, we need to first determine a low-dimensional summary statistic of the dataset 𝑿n\bm{X}_{n}, denoted T(𝑿n)T(\bm{X}_{n}). Note that, different from the current summary statistics-based method, the regression-based method does not require T(𝑿n)T(\bm{X}_{n}) to be converted to the same scale as ϕ\phi, which makes it simpler to specify. A commonly used summary statistic for the nonparametric regression-based method is the maximum likelihood estimate (MLE) of ϕ\phi given XnX_{n}.

After we find T(𝑿n)T(\bm{X}_{n}), using the formulation suggested by Strong et al., ϕ\phi can be expressed as the function of T(𝑿n)T(\bm{X}_{n}), plus a zero mean error term ϵ\epsilon strong2015estimating

ϕ=g(T(𝑿n))+ϵ.\phi=g(T(\bm{X}_{n}))+\epsilon. (7)

This equation suggests that the conditional expectation 𝔼ϕ[ϕ|𝑿n]\mathbb{E}_{\phi}[\phi|\bm{X}_{n}] can be approximated by the fitted value of a nonparametric regression model which takes ϕ\phi as the response and T(𝑿n)T(\bm{X}_{n}) as the predictor. Therefore, our proposed method begins by fitting this non-parametric regression and then extracting the fitted values from this regression model. Next, Var𝑿n[𝔼ϕ[ϕ|𝑿n]]\mbox{Var}_{\bm{X}_{n}}\left[\mathbb{E}_{\phi}[\phi|\bm{X}_{n}]\right] is approximated by the variance of these fitted values. Finally, n0n_{0} can then be estimated from equation 5. The detailed algorithm for estimating n0n_{0} using the nonparametric regression-based method is summarized in box 1.

1
2
1:Generate samples of parameters and datasets:
  • Generate MM samples of parameter ϕ\phi from the probabilistic distribution p(ϕ)p(\phi), denoted ϕ1,,ϕM\phi^{1},\dots,\phi^{M}.

  • For mm from 11 to MM, use the sample ϕm\phi^{m} to generate a sample of the dataset at the sample size nn, 𝑿nm\bm{X}_{n}^{m}, and compute its summary statistics T(𝑿nm)T(\bm{X}_{n}^{m}).

32:Fit the regression model and estimate the effective sample size:
  • Regress the parameter samples ϕm\phi^{m} on the summary statistics T(𝑿nm)T(\bm{X}_{n}^{m}) using a nonparametric regression model.

  • Extract the fitted values from the regression model and calculate their sample variance, denoted Var^𝑿n[𝔼ϕ[ϕ|𝑿n]]\widehat{\mbox{Var}}_{\bm{X}_{n}}\left[\mathbb{E}_{\phi}[\phi|\bm{X}_{n}]\right].

  • Calculate the sample variance of ϕ1,,ϕM\phi^{1},\dots,\phi^{M}, denoted Var^ϕ[ϕ]\widehat{\mbox{Var}}_{\phi}\left[\phi\right].

  • Estimate the effective sample size of ϕ\phi using n(Var^ϕ[ϕ]Var^𝑿n[𝔼ϕ[ϕ|𝑿n]]1)n\left(\frac{\widehat{\mbox{Var}}_{\phi}[\phi]}{\widehat{\mbox{Var}}_{\bm{X}_{n}}\left[\mathbb{E}_{\phi}[\phi|\bm{X}_{n}]\right]}-1\right).

Algorithm 1 Nonparametric Regression-Based Effective Sample Size Estimation Algorithm

2.3 Case study: Examing the Accuracy of the Nonparametric Regression-based n0n_{0} Estimation

We assess the accuracy of the proposed method across seven diverse data collection scenarios, including settings where ϕ\phi is univariate and multivariate. The distributions chosen for prior and data generation, respectively, for seven scenarios are: 1) beta-binomial, 2) gamma-exponential, 3) Poisson-gamma, 4) Gaussian-Weibull, 5) Dirichlet-multinomial, 6) truncated normal-binomial and 7) transformed beta-exponential. For each scenario, we compute n0n_{0} using the original summary statistic and MCMC methods, and our nonparametric regression-based method. Moreover, for scenarios one to four, where closed-form posterior distributions exist, we derive the analytic value of n0n_{0}. Table 1 details the prior distributions, likelihood functions, and n0n_{0} for each scenario.

To calculate the ESS using our nonparametric regression-based method, we generate 10410^{4} parameter samples from each prior distribution. Each parameter sample is then inputted into the likelihood function of the study design to generate 10410^{4} datasets. Subsequently, we compute the maximum likelihood estimate (MLE) as the summary statistic for each dataset. For our method, we then regressed the parameter samples on the MLEs using a spline. The fitted values were extracted from the regression model and used to compute n0n_{0} according to equation 5. The summary statistics-based method also used the MLEs to estimate n0n_{0} by equation 6, while the MCMC method was based on 10410^{4} posterior samples for each dataset to estimate 𝔼ϕ[ϕ|𝑿n]\mathbb{E}_{\phi}[\phi|\bm{X}_{n}].jalal2018gaussian

3 Results

Table 1 provides the estimated n0n_{0} values and the associated time to compute the ESS for each data collection process using distinct estimation methods. For the data collection exercises one to four, the estimates of n0n_{0} obtained from the nonparametric regression-based methods consistently align with the analytic results. In contrast, the summary statistics-based method overestimates the n0n_{0} for the second experiment. Although the analytic result for n0n_{0} is unavailable for the data collection exercises five to seven, the estimates provided by the nonparametric regression-based method remains consistent with that obtained from the MCMC method. Conversely, the summary statistics-based method may overestimate n0n_{0} for these experiments. In terms of computational efficiency, both the nonparametric regression-based method and the summary statistics-based method can estimate n0n_{0} within seconds or minutes, whereas the MCMC method takes about an hour in similar settings.

Table 1: The probability distribution, likelihood function and effective sample size (n0n_{0}) derived using the analytic results for five data collection exercises. The n0n_{0} estimates and associated computational time given by the summary statistics approach, Nonparametric regression-based approach and MCMC approach are also reported.
    Simulation Settings
Experiment 1 Experiment 2 Experiment 3
Prior
PBeta(α=4,P\sim Beta(\alpha=4,
β=6)\beta=6)
λGamma(α=20,\lambda\sim Gamma(\alpha=20,
β=10)\beta=10)
λGamma(α=50,\lambda\sim Gamma(\alpha=50,
β=100)\beta=100)
Likelihood
XBinomial(n=20,P)X\sim Binomial(n=20,P)
XiExp(λ)X_{i}\sim Exp(\lambda)
i=1,,100i=1,\dots,100
XiPossion(λ)X_{i}\sim Possion(\lambda)
i=1,,100i=1,\dots,100
Analytic n0n_{0} n0=α+β=10n_{0}=\alpha+\beta=10 n0=α=20n_{0}=\alpha=20 n0=β=100n_{0}=\beta=100
       𝐧𝟎\mathbf{n_{0}} Estimates
Summary Statistics Method 10.0910.09 24.1724.17 100.15100.15
Regression-based Method 9.939.93 21.1421.14 99.5899.58
MCMC Method 9.669.66 23.3823.38 98.9198.91
Computational Time of 𝐧𝟎\mathbf{n_{0}} (min)
Summary Statistics method 0.010.01 0.010.01 0.010.01
Regression-based method 0.010.01 0.010.01 0.010.01
MCMC method 78.4178.41 49.9849.98 101.93101.93
    Simulation Settings
Experiment 4 Experiment 5
Prior
PDirichlet(𝜶=P\sim Dirichlet(\bm{\alpha}=
(10,5,8))(10,5,8))
θN(μ0=1,\theta\sim N(\mu_{0}=1,
σ2=0.04)\sigma^{2}=0.04)
Likelihood
XMultinomial(n=50,P)X\sim Multinomial(n=50,P)
XiWeibull(v=1,λ=1/θ)X_{i}\sim Weibull(v=1,\lambda=1/\theta)
i=1,,100i=1,\dots,100
Analytic n0n_{0} n0=10+5+8=23n_{0}=10+5+8=23
       𝐧𝟎\mathbf{n_{0}} Estimates
Summary Statistics Method 23.2623.26 30.5730.57
Regression-based Method 22.7422.74 24.9424.94
MCMC Method 22.6722.67 26.3426.34
Computational Time of 𝐧𝟎\mathbf{n_{0}} (min)
Summary Statistics method 0.010.01 3.453.45
Regression-based method 0.010.01 3.523.52
MCMC method 150.55150.55 92.6392.63
    Simulation Settings
Experiment 6 Experiment 7
Prior
PTN(μ0=0.2,σ2=0.01,P\sim TN(\mu_{0}=0.2,\sigma^{2}=0.01,
a=0,b=1)a=0,b=1) 22footnotemark: 2
λ=log(1P),\lambda=-\log(1-P),
PBeta(α=4,β=6)P\sim Beta(\alpha=4,\beta=6) 33footnotemark: 3
Likelihood
XBinomial(n=20,P)X\sim Binomial(n=20,P)
XiExp(λ)X_{i}\sim Exp(\lambda)
i=1,,100i=1,\dots,100
Analytic n0n_{0}
       𝐧𝟎\mathbf{n_{0}} Estimates
Summary Statistics Method 18.2618.26 7.407.40
Regression-based Method 16.6716.67 4.874.87
MCMC Method 16.1916.19 4.904.90
Computational Time of 𝐧𝟎\mathbf{n_{0}} (min)
Summary Statistics method 0.010.01 0.010.01
Regression-based method 0.010.01 0.010.01
MCMC method 13.6613.66 1105.41105.4
  • \dagger The prior distribution for the parameter PP follows a truncated normal distribution with a domain ranging from 0 to 11, a mean of 0.20.2, and a variance of 0.010.01.

  • \ddagger The prior distribution of the parameter θ\theta is defined as the negative natural logarithm of 1P1-P, where PP follows a beta distribution with parameters α=4\alpha=4 and β=6\beta=6.

4 Discussion

This article introduces a novel computational method for efficiently quantifying the amount of information contained in a probability distribution. By combining summary statistics and nonparametric regression models, our method provides an estimation of the effective sample size (ESS) for various probability distributions based on a Gaussian approximation. The accuracy and efficiency of the proposed method are validated through a comprehensive simulation study, encompassing various types of probability distributions.

In addition to EVSI calculation, our efficient ESS estimation approach can also be used in other applications. This encompasses evidence synthesis in clinical trials, as explored by Morita et al.morita2008determining , and informing parameter distributions in medical decision-making. For instance, our method can be particularly useful when the parameter’s distribution is complex and ESS cannot be readily determined from the input’s parameters. In such instances, ESS provides a valuable metric to quantify the number of participants contributing information to a specific input. This insight can be compared with an expert’s intuition, and if there are discrepancies, the uncertainty representation can be revised accordingly.

Overall, ESS is a useful metric that can be readily used to inform EVSI and guide input distributions in medical decision-making. Our proposed approach can help estimate ESS more accurately and efficiently.

{acks}

Placeholder.

{sm}

Supplementary material for this article is available on the Medical Decision Making Web site at http://journals.sagepub.com/home.mdm

References

  • (1) Morita S, Thall PF and Müller P. Determining the effective sample size of a parametric prior. Biometrics 2008; 64(2): 595–602.
  • (2) Jalal H and Alarid-Escudero F. A gaussian approximation approach for value of information analysis. Medical Decision Making 2018; 38(2): 174–188.
  • (3) Heath A, Kunst N, Jackson C et al. Calculating the expected value of sample information in practice: considerations from 3 case studies. Medical Decision Making 2020; 40(3): 314–326.
  • (4) Strong M, Oakley JE, Brennan A et al. Estimating the expected value of sample information using the probabilistic sensitivity analysis sample: a fast, nonparametric regression-based method. Medical Decision Making 2015; 35(5): 570–583.
  • (5) Linke L, Hawre J and Anna H. Estimating the expected value of sample information across different sample sizes using gaussian approximations and spline-based taylor series expansions. Medical Decision Making 2023; XX(X): XXX–XXX.
  • (6) Strong M, Oakley JE and Brennan A. Estimating multiparameter partial expected value of perfect information from a probabilistic sensitivity analysis sample: a nonparametric regression approach. Medical Decision Making 2014; 34(3): 311–326.