Minimizing Uncertainty in Prevalence Estimates
Abstract
Estimating prevalence, the fraction of a population with a certain medical condition, is fundamental to epidemiology. Traditional methods rely on classification of test samples taken at random from a population. Such approaches to estimating prevalence are biased and have uncontrolled uncertainty. Here, we construct a new, unbiased, minimum variance estimator for prevalence. Recent result show that prevalence can be estimated from counting arguments that compare the fraction of samples in an arbitrary subset of the measurement space to what is expected from conditional probability models of the diagnostic test. The variance of this estimator depends on both the choice of subset and the fraction of samples falling in it. We employ a bathtub principle to recast variance minimization as a one-dimensional optimization problem. Using symmetry properties, we show that the resulting objective function is well-behaved and can be numerically minimized.
1 Introduction
Estimating prevalence – the proportion of a population that has been infected by a disease – is a fundamental problem in epidemiology. Nonetheless, many core mathematical issues associated with this task have only been recently discovered and understood [1, 2]. For example, it has long been assumed that classification of samples as positive or negative is necessary to compute the prevalence. In Ref. [1] we demonstrated that this is false: unbiased estimators of prevalence can be constructed from conditional probability arguments having nothing to do with classification.
The core idea of Ref. [1] was to recognize that the probability of a diagnostic measurement outcome in a space is given by the convex combination
(1) |
where is the prevalence and , are conditional probability density functions (PDFs) for positive and negative populations. Importantly, and can be constructed from training data and are thus known. Taking to be an arbitrary subset of , one can define via
(2) |
where indicates the measure of with respect to the arbitrary distribution .111It is necessary to assume that: (i) has neither zero measure nor unit measure with respect to ; and (ii) the measures with respect to and are not equal. These conditions are trivial to ensure in practice. Given samples drawn at random from a population and denoted , Eq. (2) implies that can be estimated by
(3) |
where is the indicator function. This estimate is unbiased and converges in mean square [1].
To solve the related and important problem of optimal prevalence estimation, we minimize the variance of , which is proportional to
(4) |
Equation (4) arises from the variance of the binomial random variable but is unusual in that the denominator depends on a set that can be deformed arbitrarily. Thus, minimization is performed with respect to both the parameter as well as the . We reduce Eq. (4) to a one-dimensional (1D) problem by maximizing the denominator for fixed . We construct this maximum in terms of a bathtub principle.
2 Bathtub Principle
Distinct sets may yield the same but different realizations of . This motivates treating as an independent variable, denoted by to avoid confusion. By rewriting Eq. (2) in terms of , we find a constraint
(5) |
that defines the admissible corresponding to each . Clearly the collection of sets whose elements maximize ( denotes absolute value) for a fixed is an equivalence class. Since, each element in this class yields the same as a function of , minimizing Eq. (4) is equivalent to solving the 1D problem
(6) |
We next construct by finding the equivalence class .
Equation (5) indicates that “loading” points into to increase simultaneously increases . We therefore wish to construct so as to increase as much as possible and as little as possible. This motivates us to define the sets
(7) | ||||
(8) |
where
(9) |
and Eq. (8) defines in Eq. (7). The are any sets satisfying both Eqs. (9) and the constraints given by Eq. (8). An optimality proof for these sets is closely related to the bathtub principle [3]. For completeness and to facilitate comparison to classification methods [1], we present a proof adapted to the problem at hand.
Lemma 1
Let , and assume every point has zero measure with respect to and . For fixed satisfying , either or maximizes .
Proof: First we show that Eq. (8) defines . By construction, satisfies the inequality . Restrict attention to . Let , where . Define . Clearly,
(10) |
is a monotone decreasing function of . Assume that crosses zero as . Then is either continuous or discontinuous at this crossing, which defines as the corresponding right-limit; that is . The can then be chosen as any subset of for which . If instead does not vanish for any finite , then is not finite and is a subset of . A similar argument yields existence of and .
Let be any set that differs from by positive measure not on . Requiring that Eq. (5) also hold for yields
(11) |
where is the set difference operator. Combining the definition of with Eq. (5), one finds that , implying . A similar argument can be used to show that maximizes the difference . These differences are unique, as can be verified by the definition of . Finally, note that is maximized by . \qed
The denominator of Eq. (4) can be parameterized by . In particular, we have shown that for a fixed satisfying , one of the variances
(12) |
minimizes . However, we have yet to uniquely define the objective function in Eq. (6), since it not clear when to use or . We now prove that both are equivalent.
Lemma 2
The variances and satisfy the symmetry . In particular, minimizes Eq. (4) if and only if does.
Proof: The numerators of and are invariant to the transformation . Also, for any ,
(13) |
By Eqs. (7)–(9), the set is equal to a set for which . Moreover, since is the -measure of domain , the complement has -measure . That is, . In light of Eq. (13), one finds . \qed

3 Optimal Prevalence Estimation
Lemma 2 proves that we may treat (or ) as the objective function in Eq. (6). We now show that this objective function has desirable properties.
Lemma 3
Assume that: (i) any point has zero measure with respect to and ; and (ii) there is a set of positive measure with respect to for which . Then on the open domain , the function is continuous and attains a minimum. Moreover, as or .
Proof: By Lemma 2, it is sufficient to only consider . Because any point has zero measure, by the definition of , the difference is continuous. Thus, is continuous on provided that for every . To demonstrate this, assume that there exists a for which . By assumption (ii) and the definition of , the remaining points in must have the property that . Integrating over this set shows that , which violates the assumption that both are probability densities. Thus cannot be zero in the interior of the domain, and is continuous on . By definition, and , which implies that in the limit . The corresponding result in the limit is proved in the same way by considering for and then using Lemma 2.
Existence of the minimum follows from divergence of at the boundaries and continuity in the interior . Specifically, for any , is continuous on . By the extreme value theorem, there exists a value for which attains a minimum. Clearly there exists an such that for all , is constant, which defines the minimum. \qed
Remark: Assumption (ii) implies that there is a set of positive measure with respect to for which .
Remark: The assumption that there exists a set of positive measure for which is an important feature of diagnostic tests; it implies the ability to distinguish populations. Failure to satisfy this condition is the hallmark of a useless diagnostic.
4 Validation and Discussion
4.1 Example Applied to a SARS-CoV-2 Antibody Test
The left plot of Fig. 1 shows training data and probability models for a SARS-CoV-2 immunoglobulin G (IgG) receptor binding domain (RBD) assay developed in Ref. [4]. The data was normalized according to the procedure in Ref. [1]. Following that, we added to all values and normalized all data by the largest positive value. While not necessary, adding facilitates model construction and otherwise has a negligible effect on the data, as it represents a perturbation of less than 1 % relative to the maximum scale of the data. We model with a Burr distribution [5] and approximate as a Beta distribution. Maximum likelihood estimation was used to determine model parameters. The Burr distribution was truncated to the domain and renormalized to have unit probability. The right plot of Fig. 1 shows the “bathtub function” for a prevalence of . The figure illustrates that for a given , the domain corresponds to the set of all for which . The corresponding value of is given by the integral of over .

Figure 2 illustrates various results derived in Lemma 3. The left plot embodies the symmetry argument expressed in Eq. (13), as well as the fact that the difference and are positive on the open set . The right plot illustrates that , as well as the divergence when and . Moreover, the objective function is continuous and bounded from below.
4.2 Limitations and Open Directions
The analysis presented herein does not refer to positive or negative populations. The functions and could have described two different populations having nothing to do with diagnostics. Our main result has implications for broader problems associated with estimating relative fractions of different populations.
Our method suffers a need to empirically model training data, which may introduce uncertainty associated with the choices of distributions used. Addressing such tasks is necessary to fully understand all sources of uncertainty in prevalence estimates. The analysis herein is idealized insofar as it assumes that and are known exactly.
In a related vein, construction of the optimal prevalence estimation domains requires a priori knowledge of itself. Since this quantity is often what is being estimated, a practical algorithm for minimizing uncertainty in would be to guess a trial domain based on any prior information, estimate , and update the domain. While this approach will likely never converge in a mathematical sense, it should provide reasonable estimates motivated by the theory herein.
Acknowledgements: This work is a contribution of the National Institutes of Standards and Technology and is therefore not subject to copyright in the United States.
Use of all data deriving from human subjects was approved by the NIST Research Protections Office.
References
- [1] P. N. Patrone, A. J. Kearsley, Classification under uncertainty: data analysis for diagnostic antibody testing., Mathematical medicine and biology : a journal of the IMA (2021).
- [2] L. Böttcher, M. R. D’Orsogna, T. Chou, A statistical model of covid-19 testing in populations: effects of sampling bias and testing errors, Philosophical transactions. Series A, Mathematical, physical, and engineering sciences 380 (2021).
- [3] E. Lieb, M. Loss, A. M. Society, Analysis, Crm Proceedings & Lecture Notes, American Mathematical Society, 2001.
- [4] T. Liu, J. Hsiung, S. Zhao, J. Kost, D. Sreedhar, C. V. Hanson, K. Olson, D. Keare, S. T. Chang, K. P. Bliden, P. A. Gurbel, U. S. Tantry, J. Roche, C. Press, J. Boggs, J. P. Rodriguez-Soto, J. G. Montoya, M. Tang, H. Dai, Quantification of antibody avidities and accurate detection of sars-cov-2 antibodies in serum and saliva on plasmonic substrates, Nature Biomedical Engineering 4 (12) (2020) 1188–1196.
- [5] I. W. Burr, Cumulative Frequency Functions, The Annals of Mathematical Statistics 13 (2) (1942) 215 – 232.