Linke Li, Dalla Lana School of Public Health, University of Toronto, 155 College St Room 500, Toronto, Ontario, M5T 3M7, Canada.
A Nonparametric Approach for Estimating the Effective Sample Size in Gaussian Approximation of Expected Value of Sample Information
Abstract
The effective sample size (ESS) measures the informational value of a probability distribution in terms of an equivalent number of study participants. The ESS plays a crucial role in estimating the Expected Value of Sample Information (EVSI) through the Gaussian approximation approach. Despite the significance of ESS, existing ESS estimation methods within the Gaussian approximation framework are either computationally expensive or potentially inaccurate. To address these limitations, we propose a novel approach that estimates the ESS using the summary statistics of generated datasets and nonparametric regression methods. The simulation results suggest that the proposed method provides accurate ESS estimates at a low computational cost, making it an efficient and practical way to quantify the information contained in the probability distribution of a parameter. Overall, determining the ESS can help analysts understand the uncertainty levels in complex prior distributions in the probability analyses of decision models and perform efficient EVSI calculations.
keywords:
Expected value of sample information, Gaussian approximation, Effective sample size, Health economic evaluation, Value of Information, Uncertainty definition1 INTRODUCTION
The effective sample size (ESS) quantifies the informational value of a probability distribution by representing it in terms of an equivalent number of observations contributing information to this distribution.morita2008determining For example, if the probability of successes follows a beta distribution with parameters (alpha = 2, beta = 8), the ESS is the number of successes (2) + the number of failures (8) = 10 observations. ESS can be particularly useful in computing the Expected Value of Sample Information (EVSI) using the Gaussian approximation (GA) approach.jalal2018gaussian EVSI is a quantitative measure used in decision analysis to assess the potential value of acquiring additional data or information through a sample. It represents the expected improvement in decision-making or reduction in uncertainty that would result from collecting new data before making a choice or decision. In addition to its importance in understanding the information in probability distributions and for EVSI estimation, ESS also finds various applications in clinical trial designs, including prior information elicitation, sensitivity analysis and trial protocol evaluations. morita2008determining
Despite its usefulness, existing ESS estimation methods within the GA have limitations.heath2020calculating This is because these methods either entail potentially high computational costs, impeding their practicality, or exhibit inaccuracies, undermining the reliability of estimates.jalal2018gaussian To efficiently assess the informative level of probability distributions and simplify the EVSI calculation, we propose a novel method for estimating ESS based on summary statistics and nonparametric regression models, inspired by the nonparametric regression-based EVSI methods proposed by Strong et al.strong2015estimating We begin by reviewing the ESS, particularly within the GA approach, and proceed to illustrate the nonparametric regression-based ESS estimation method. Subsequently, we perform a simulation to demonstrate the effectiveness and accuracy of our method, followed by a brief discussion to conclude.
2 METHODS
2.1 Effective Sample Size and Gaussian Approximation
The effective sample size (ESS) is a statistical concept that quantifies the informational value of a probability distribution through the number of hypothetical participants morita2008determining . A greater ESS signifies a more informative distribution, while a lower value indicates the opposite. The estimation of ESS serves two important purposes in health economic evaluation. Firstly, it allows area experts to determine if the amount of information contained in the distribution aligns with their prior beliefs. Secondly, it facilitates the estimation of the EVSI within the GA approach.jalal2018gaussian ; linke2023estimating This efficiency stems from the characteristic of ESS as being intrinsic to the GA setup, where it is computed only once. After that, this value remains unaltered for EVSI calculations across a spectrum of study designs, irrespective of sample size variations. 111To estimate EVSI, ESS is combined with the nonparametric regression models and Taylor series expansions to efficiently approximate the distribution of the conditional economic benefit and estimate EVSI. We refer readers to the article by Li et al. for a comprehensive explanation of the EVSI estimation method.linke2023estimating This section introduces the concept of ESS as it is presented within the GA approach.
Firstly, assume we have a parameter with a corresponding probability distribution that can be well approximated by a Gaussian distribution. This distribution has mean and variance :
(1) |
where represents the individual-level variance of the data collection process, and can be understood as the number of hypothetical participants that would yield the uncertainty level of . is defined as the ESS of in the GA approach.
Additionally, we denote the set of observations obtained from the data collection exercise that would provide additional information on as , and its sample mean as . In the GA approach, it is assumed that given the parameter , is approximately Gaussian distributed with mean and variance :
(2) |
Due to the conjugate nature of the distributions of and , the conditional mean of given the observed data is a weighted combination of and :
(3) |
Using this formulation, the ESS can be determined based on the variance of either or . This is because the marginal distributions of and are both Gaussian with variances and , respectively. jalal2018gaussian As the variance of is , we can estimate using variance ratios. Specifically, the variance ratio between and yeilds an estimation of:
(4) |
and the variance ratio between and yeilds an estimation of:
(5) |
Although equations 4 and 5 are derived using the Gaussian assumption, Jalal and Alarid-Escudero have shown that they can be generalized to estimate for other non-Gaussian likelihood and prior combinations. jalal2018gaussian
2.2 Estimate the Effective Sample Size using the Nonparametric Regression-based Method
In this section, we review current estimation methods and note their drawbacks.jalal2018gaussian We then introduce a novel approach, which employs nonparametric regression models, to efficiently and accurately estimate .
In the original GA article, Jalal and Alarid-Escudero introduced two methods that can be generalized to estimate for nonconjugate priors. jalal2018gaussian The first method uses equation 4. For this, we need to estimate for a large number of datasets generated from a specific data collection exercise. For each dataset, we can use Markov Chain Monte Methods (MCMC) to simulate from the posterior distribution of parameter and calculate the mean. We then compute the sample variance of these estimated posterior means to approximate and estimate . While this method yields accurate estimates, repeated MCMC evaluations may impose a significant computational burden and complicate practical implementation.
The second method uses equation 5. However, since the sample mean may not have the same scale as for many non-Gaussian probability distributions, equation 5 can not be directly applied. In such cases, we need to identify other summary statistics that share the same scale as and compare its variance with that of to estimate :
(6) |
As this method only requires summary statistics, it is more computationally efficient than MCMC. However, it necessitates the identification of suitable summary statistics for . These statistics should exhibit the same scale as and have an appropriate level of variation, which can pose challenges in real-world case studies.
To reduce the computational burden of when the appropriate summary statistics of are difficult to find, we suggest estimating using the nonparametric regression-based method, motivated by the EVSI method proposed by Strong et al. strong2015estimating ; strong2014estimating To achieve this, we need to first determine a low-dimensional summary statistic of the dataset , denoted . Note that, different from the current summary statistics-based method, the regression-based method does not require to be converted to the same scale as , which makes it simpler to specify. A commonly used summary statistic for the nonparametric regression-based method is the maximum likelihood estimate (MLE) of given .
After we find , using the formulation suggested by Strong et al., can be expressed as the function of , plus a zero mean error term strong2015estimating
(7) |
This equation suggests that the conditional expectation can be approximated by the fitted value of a nonparametric regression model which takes as the response and as the predictor. Therefore, our proposed method begins by fitting this non-parametric regression and then extracting the fitted values from this regression model. Next, is approximated by the variance of these fitted values. Finally, can then be estimated from equation 5. The detailed algorithm for estimating using the nonparametric regression-based method is summarized in box 1.
-
•
Generate samples of parameter from the probabilistic distribution , denoted .
-
•
For from to , use the sample to generate a sample of the dataset at the sample size , , and compute its summary statistics .
-
•
Regress the parameter samples on the summary statistics using a nonparametric regression model.
-
•
Extract the fitted values from the regression model and calculate their sample variance, denoted .
-
•
Calculate the sample variance of , denoted .
-
•
Estimate the effective sample size of using .
2.3 Case study: Examing the Accuracy of the Nonparametric Regression-based Estimation
We assess the accuracy of the proposed method across seven diverse data collection scenarios, including settings where is univariate and multivariate. The distributions chosen for prior and data generation, respectively, for seven scenarios are: 1) beta-binomial, 2) gamma-exponential, 3) Poisson-gamma, 4) Gaussian-Weibull, 5) Dirichlet-multinomial, 6) truncated normal-binomial and 7) transformed beta-exponential. For each scenario, we compute using the original summary statistic and MCMC methods, and our nonparametric regression-based method. Moreover, for scenarios one to four, where closed-form posterior distributions exist, we derive the analytic value of . Table 1 details the prior distributions, likelihood functions, and for each scenario.
To calculate the ESS using our nonparametric regression-based method, we generate parameter samples from each prior distribution. Each parameter sample is then inputted into the likelihood function of the study design to generate datasets. Subsequently, we compute the maximum likelihood estimate (MLE) as the summary statistic for each dataset. For our method, we then regressed the parameter samples on the MLEs using a spline. The fitted values were extracted from the regression model and used to compute according to equation 5. The summary statistics-based method also used the MLEs to estimate by equation 6, while the MCMC method was based on posterior samples for each dataset to estimate .jalal2018gaussian
3 Results
Table 1 provides the estimated values and the associated time to compute the ESS for each data collection process using distinct estimation methods. For the data collection exercises one to four, the estimates of obtained from the nonparametric regression-based methods consistently align with the analytic results. In contrast, the summary statistics-based method overestimates the for the second experiment. Although the analytic result for is unavailable for the data collection exercises five to seven, the estimates provided by the nonparametric regression-based method remains consistent with that obtained from the MCMC method. Conversely, the summary statistics-based method may overestimate for these experiments. In terms of computational efficiency, both the nonparametric regression-based method and the summary statistics-based method can estimate within seconds or minutes, whereas the MCMC method takes about an hour in similar settings.
Simulation Settings | |||||||||
---|---|---|---|---|---|---|---|---|---|
Experiment 1 | Experiment 2 | Experiment 3 | |||||||
Prior |
|
|
|
||||||
Likelihood |
|
|
|
||||||
Analytic | |||||||||
Estimates | |||||||||
Summary Statistics Method | |||||||||
Regression-based Method | |||||||||
MCMC Method | |||||||||
Computational Time of (min) | |||||||||
Summary Statistics method | |||||||||
Regression-based method | |||||||||
MCMC method |
Simulation Settings | ||||||
---|---|---|---|---|---|---|
Experiment 4 | Experiment 5 | |||||
Prior |
|
|
||||
Likelihood |
|
|
||||
Analytic | — | |||||
Estimates | ||||||
Summary Statistics Method | ||||||
Regression-based Method | ||||||
MCMC Method | ||||||
Computational Time of (min) | ||||||
Summary Statistics method | ||||||
Regression-based method | ||||||
MCMC method |
Simulation Settings | ||||||
---|---|---|---|---|---|---|
Experiment 6 | Experiment 7 | |||||
Prior |
|
|
||||
Likelihood |
|
|
||||
Analytic | — | — | ||||
Estimates | ||||||
Summary Statistics Method | ||||||
Regression-based Method | ||||||
MCMC Method | ||||||
Computational Time of (min) | ||||||
Summary Statistics method | ||||||
Regression-based method | ||||||
MCMC method |
-
•
The prior distribution for the parameter follows a truncated normal distribution with a domain ranging from to , a mean of , and a variance of .
-
•
The prior distribution of the parameter is defined as the negative natural logarithm of , where follows a beta distribution with parameters and .
4 Discussion
This article introduces a novel computational method for efficiently quantifying the amount of information contained in a probability distribution. By combining summary statistics and nonparametric regression models, our method provides an estimation of the effective sample size (ESS) for various probability distributions based on a Gaussian approximation. The accuracy and efficiency of the proposed method are validated through a comprehensive simulation study, encompassing various types of probability distributions.
In addition to EVSI calculation, our efficient ESS estimation approach can also be used in other applications. This encompasses evidence synthesis in clinical trials, as explored by Morita et al.morita2008determining , and informing parameter distributions in medical decision-making. For instance, our method can be particularly useful when the parameter’s distribution is complex and ESS cannot be readily determined from the input’s parameters. In such instances, ESS provides a valuable metric to quantify the number of participants contributing information to a specific input. This insight can be compared with an expert’s intuition, and if there are discrepancies, the uncertainty representation can be revised accordingly.
Overall, ESS is a useful metric that can be readily used to inform EVSI and guide input distributions in medical decision-making. Our proposed approach can help estimate ESS more accurately and efficiently.
Placeholder.
Supplementary material for this article is available on the Medical Decision Making Web site at http://journals.sagepub.com/home.mdm
References
- (1) Morita S, Thall PF and Müller P. Determining the effective sample size of a parametric prior. Biometrics 2008; 64(2): 595–602.
- (2) Jalal H and Alarid-Escudero F. A gaussian approximation approach for value of information analysis. Medical Decision Making 2018; 38(2): 174–188.
- (3) Heath A, Kunst N, Jackson C et al. Calculating the expected value of sample information in practice: considerations from 3 case studies. Medical Decision Making 2020; 40(3): 314–326.
- (4) Strong M, Oakley JE, Brennan A et al. Estimating the expected value of sample information using the probabilistic sensitivity analysis sample: a fast, nonparametric regression-based method. Medical Decision Making 2015; 35(5): 570–583.
- (5) Linke L, Hawre J and Anna H. Estimating the expected value of sample information across different sample sizes using gaussian approximations and spline-based taylor series expansions. Medical Decision Making 2023; XX(X): XXX–XXX.
- (6) Strong M, Oakley JE and Brennan A. Estimating multiparameter partial expected value of perfect information from a probabilistic sensitivity analysis sample: a nonparametric regression approach. Medical Decision Making 2014; 34(3): 311–326.