This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Minimizing Uncertainty in Prevalence Estimates

Paul N. Patrone [email protected] Anthony Kearsley National Institute of Standards and Technology
Abstract

Estimating prevalence, the fraction of a population with a certain medical condition, is fundamental to epidemiology. Traditional methods rely on classification of test samples taken at random from a population. Such approaches to estimating prevalence are biased and have uncontrolled uncertainty. Here, we construct a new, unbiased, minimum variance estimator for prevalence. Recent result show that prevalence can be estimated from counting arguments that compare the fraction of samples in an arbitrary subset of the measurement space to what is expected from conditional probability models of the diagnostic test. The variance of this estimator depends on both the choice of subset and the fraction of samples falling in it. We employ a bathtub principle to recast variance minimization as a one-dimensional optimization problem. Using symmetry properties, we show that the resulting objective function is well-behaved and can be numerically minimized.

1 Introduction

Estimating prevalence – the proportion of a population that has been infected by a disease – is a fundamental problem in epidemiology. Nonetheless, many core mathematical issues associated with this task have only been recently discovered and understood [1, 2]. For example, it has long been assumed that classification of samples as positive or negative is necessary to compute the prevalence. In Ref. [1] we demonstrated that this is false: unbiased estimators of prevalence can be constructed from conditional probability arguments having nothing to do with classification.

The core idea of Ref. [1] was to recognize that the probability Q(r)Q(r) of a diagnostic measurement outcome rr in a space ΩN\Omega\subset\mathbb{R}^{N} is given by the convex combination

Q(r)=qP(r)+(1q)N(r),\displaystyle Q(r)=qP(r)+(1-q)N(r), (1)

where qq is the prevalence and P(r)P(r), N(r)N(r) are conditional probability density functions (PDFs) for positive and negative populations. Importantly, P(r)P(r) and N(r)N(r) can be constructed from training data and are thus known. Taking DD to be an arbitrary subset of Ω\Omega, one can define qq via

q=QDNDPDND,\displaystyle q=\frac{Q_{D}-N_{D}}{P_{D}-N_{D}}, (2)

where SDS_{D} indicates the measure of DD with respect to the arbitrary distribution SS.111It is necessary to assume that: (i) DD has neither zero measure nor unit measure with respect to QQ; and (ii) the measures with respect to PP and NN are not equal. These conditions are trivial to ensure in practice. Given MM samples drawn at random from a population and denoted rjr_{j}, Eq. (2) implies that qq can be estimated by

qq~=Q~DNDPDND,Q~D=1Mj=1M𝕀(rjD),\displaystyle q\approx\tilde{q}=\frac{\tilde{Q}_{D}-N_{D}}{P_{D}-N_{D}},\hskip 14.22636pt\tilde{Q}_{D}=\frac{1}{M}\sum_{j=1}^{M}\mathbb{I}(r_{j}\in D), (3)

where 𝕀\mathbb{I} is the indicator function. This estimate is unbiased and converges in mean square [1].

To solve the related and important problem of optimal prevalence estimation, we minimize the variance of q~\tilde{q}, which is proportional to

σ2=QD(1QD)(PDND)2.\displaystyle\sigma^{2}=\frac{Q_{D}(1-Q_{D})}{(P_{D}-N_{D})^{2}}. (4)

Equation (4) arises from the variance of the binomial random variable Q~D\tilde{Q}_{D} but is unusual in that the denominator depends on a set that can be deformed arbitrarily. Thus, minimization is performed with respect to both the parameter QDQ_{D} as well as the DD. We reduce Eq. (4) to a one-dimensional (1D) problem by maximizing the denominator for fixed QDQ_{D}. We construct this maximum in terms of a bathtub principle.

2 Bathtub Principle

Distinct sets DD may yield the same QDQ_{D} but different realizations of PDNDP_{D}-N_{D}. This motivates treating QDQ_{D} as an independent variable, denoted by Q^\hat{Q} to avoid confusion. By rewriting Eq. (2) in terms of Q^\hat{Q}, we find a constraint

Q^=q(PDND)+ND\displaystyle\hat{Q}=q(P_{D}-N_{D})+N_{D} (5)

that defines the admissible DD corresponding to each Q^\hat{Q}. Clearly the collection of sets {D}Q^\{D^{\star}\}_{\hat{Q}} whose elements maximize |PDND||P_{D}-N_{D}| (|||\cdot| denotes absolute value) for a fixed Q^\hat{Q} is an equivalence class. Since, each element in this class yields the same F(Q^)=(PDND)2F(\hat{Q})=(P_{D^{\star}}-N_{D^{\star}})^{2} as a function of Q^\hat{Q}, minimizing Eq. (4) is equivalent to solving the 1D problem

Q^=argminQ^Q^(1Q^)F(Q^).\displaystyle\hat{Q}^{\star}=\mathop{\rm argmin}_{\hat{Q}}\frac{\hat{Q}(1-\hat{Q})}{F(\hat{Q})}. (6)

We next construct F(Q^)F(\hat{Q}) by finding the equivalence class {D}Q^\{D^{\star}\}_{\hat{Q}}.

Equation (5) indicates that “loading” points rr into DD to increase |PDND||P_{D}-N_{D}| simultaneously increases NDN_{D}. We therefore wish to construct DD so as to increase |PDND||P_{D}-N_{D}| as much as possible and NDN_{D} as little as possible. This motivates us to define the sets

D±\displaystyle D_{\pm} ={r:±q[P(r)N(r)]>±Δ±N(r)}S±\displaystyle=\{r:\pm q[P(r)-N(r)]>\pm\Delta_{\pm}N(r)\}\cup S_{\pm} (7)
Q^\displaystyle\hat{Q} =D±Q(r)𝑑r.\displaystyle=\int_{D_{\pm}}Q(r)dr. (8)

where

S±{r:q[P(r)N(r)]=Δ±N(r)}\displaystyle S_{\pm}\subset\{r:q[P(r)-N(r)]=\Delta_{\pm}N(r)\} (9)

and Eq. (8) defines Δ±\Delta_{\pm} in Eq. (7). The S±S_{\pm} are any sets satisfying both Eqs. (9) and the constraints given by Eq. (8). An optimality proof for these sets is closely related to the bathtub principle [3]. For completeness and to facilitate comparison to classification methods [1], we present a proof adapted to the problem at hand.

Lemma 1

Let 0<q<10<q<1, and assume every point rr has zero measure with respect to P(r)P(r) and N(r)N(r). For fixed Q^\hat{Q} satisfying 0<Q^<10<\hat{Q}<1, either D+D_{+} or DD_{-} maximizes F(Q^)F(\hat{Q}).

Proof: First we show that Eq. (8) defines Δ±\Delta_{\pm}. By construction, Δ±\Delta_{\pm} satisfies the inequality qΔ±<-q\leq\Delta_{\pm}<\infty. Restrict attention to Δ+\Delta_{+}. Let D(Δ)={r:q[P(r)N(r)]>ΔN(r)}D(\Delta)=\{r:q[P(r)-N(r)]>\Delta N(r)\}, where <Δ<-\infty<\Delta<\infty. Define D()=limΔD(Δ)={r:P(r)>0,N(r)=0}\displaystyle D(\infty)=\lim_{\Delta\to\infty}D(\Delta)=\{r:P(r)>0,N(r)=0\}. Clearly,

𝒬(Δ)=[D(Δ)Q(r)𝑑r]Q^\displaystyle\mathcal{Q}(\Delta)=\left[\int_{D(\Delta)}Q(r)dr\right]-\hat{Q} (10)

is a monotone decreasing function of Δ\Delta. Assume that 𝒬(Δ)\mathcal{Q}(\Delta) crosses zero as Δ\Delta\to\infty. Then 𝒬(Δ)\mathcal{Q}(\Delta) is either continuous or discontinuous at this crossing, which defines Δ+\Delta_{+} as the corresponding right-limit; that is Δ+=infΔ{Δ:𝒬(Δ)<0}\Delta_{+}={\rm inf}_{\Delta}\{\Delta:\mathcal{Q}(\Delta)<0\}. The S+S_{+} can then be chosen as any subset of {q[P(r)N(r)]=Δ+N(r)}\{q[P(r)-N(r)]=\Delta_{+}N(r)\} for which 𝒬(Δ+)+S+Q(r)𝑑r=0\mathcal{Q}(\Delta_{+})+\int_{S+}Q(r)dr=0. If instead 𝒬(Δ)\mathcal{Q}(\Delta) does not vanish for any finite Δ\Delta, then Δ\Delta is not finite and S+S_{+} is a subset of D()D(\infty). A similar argument yields existence of Δ\Delta_{-} and SS_{-}.

Let DD be any set that differs from D+D_{+} by positive measure not on S+S_{+}. Requiring that Eq. (5) also hold for DD yields

PD+ND+PD+ND=1q[ND/D+ND+/D].\displaystyle P_{D_{+}}-N_{D_{+}}-P_{D}+N_{D}=\frac{1}{q}[N_{D/D_{+}}-N_{D_{+}/D}]. (11)

where // is the set difference operator. Combining the definition of Δ+\Delta_{+} with Eq. (5), one finds that ND/D+>ND+/DN_{D/D_{+}}>N_{D_{+}/D}, implying PD+ND+>PDNDP_{D_{+}}-N_{D_{+}}>P_{D}-N_{D}. A similar argument can be used to show that DD_{-} maximizes the difference NDPDN_{D}-P_{D}. These differences are unique, as can be verified by the definition of S±S_{\pm}. Finally, note that F(Q^)F(\hat{Q}) is maximized by maxQ^[PD+ND+,NDPD]\max_{\hat{Q}}[P_{D_{+}}-N_{D_{+}},N_{D_{-}}-P_{D_{-}}]. \qed

The denominator of Eq. (4) can be parameterized by Q^\hat{Q}. In particular, we have shown that for a fixed Q^\hat{Q} satisfying 0<Q^<10<\hat{Q}<1, one of the variances

σ+2(Q^)=Q^(1Q^)(PD+ND+)2,σ2(Q^)=Q^(1Q^)(PDND)2\displaystyle\sigma^{2}_{+}(\hat{Q})=\frac{\hat{Q}(1-\hat{Q})}{(P_{D_{+}}-N_{D_{+}})^{2}},\hskip 14.22636pt\sigma^{2}_{-}(\hat{Q})=\frac{\hat{Q}(1-\hat{Q})}{(P_{D_{-}}-N_{D_{-}})^{2}} (12)

minimizes σ2\sigma^{2}. However, we have yet to uniquely define the objective function in Eq. (6), since it not clear when to use σ+2\sigma^{2}_{+} or σ2\sigma^{2}_{-}. We now prove that both are equivalent.

Lemma 2

The variances σ+2\sigma^{2}_{+} and σ2\sigma^{2}_{-} satisfy the symmetry σ+2(Q^)=σ2(1Q^)\sigma^{2}_{+}(\hat{Q})=\sigma^{2}_{-}(1-\hat{Q}). In particular, σ+2(Q^)\sigma^{2}_{+}(\hat{Q}^{\star}) minimizes Eq. (4) if and only if σ2(1Q^)\sigma^{2}_{-}(1-\hat{Q}^{\star}) does.

Proof: The numerators of σ+2\sigma^{2}_{+} and σ2\sigma^{2}_{-} are invariant to the transformation Q^1Q^\hat{Q}\to 1-\hat{Q}. Also, for any D+D_{+},

PD+ND+=NΩ/D+PΩ/D+.\displaystyle P_{D_{+}}-N_{D_{+}}=N_{\Omega/D_{+}}-P_{\Omega/D_{+}}. (13)

By Eqs. (7)–(9), the set Ω/D+\Omega/D_{+} is equal to a set DD_{-} for which Δ=Δ+\Delta_{-}=\Delta_{+}. Moreover, since Q^\hat{Q} is the QQ-measure of domain D+D_{+}, the complement Ω/D+\Omega/D_{+} has QQ-measure 1Q^1-\hat{Q}. That is, Ω/D+{D}(1Q^)\Omega/D_{+}\in\{D_{-}\}_{(1-\hat{Q})}. In light of Eq. (13), one finds σ+2(Q^)=σ2(1Q^)\sigma^{2}_{+}(\hat{Q})=\sigma^{2}_{-}(1-\hat{Q}). \qed

Refer to caption
Figure 1: Left: data and probability models associated with a SARS-CoV-2 antibody test. The blue and red histograms are the relative fractions of negative and positive data from an empirical study. The continuous curves are the modeled PDFs N(r)N(r) and P(r)P(r). See the main text for more details. Right: the bathtub function q[P(r)N(r)]/N(r)q[P(r)-N(r)]/N(r) and Q(r)Q(r) for q=30/130q=30/130. To find the domain DD_{-}, we “fill up” the bathtub function up to a level Δ\Delta_{-} for which the QQ-measure on DD_{-} equals a specific value of Q^\hat{Q}.

3 Optimal Prevalence Estimation

Lemma 2 proves that we may treat σ+2\sigma^{2}_{+} (or σ2\sigma^{2}_{-}) as the objective function in Eq. (6). We now show that this objective function has desirable properties.

Lemma 3

Assume that: (i) any point rr has zero measure with respect to P(r)P(r) and N(r)N(r); and (ii) there is a set of positive measure with respect to PP for which P(r)>N(r)P(r)>N(r). Then on the open domain 0<Q^<10<\hat{Q}<1, the function σ±2(Q^)\sigma^{2}_{\pm}(\hat{Q}) is continuous and attains a minimum. Moreover, σ±2\sigma^{2}_{\pm}\to\infty as Q^0\hat{Q}\to 0 or Q^1\hat{Q}\to 1.

Proof: By Lemma 2, it is sufficient to only consider σ+2(Q^)\sigma^{2}_{+}(\hat{Q}). Because any point rr has zero measure, by the definition of D±D_{\pm}, the difference PD+ND+P_{D_{+}}-N_{D_{+}} is continuous. Thus, σ+2(Q^)\sigma_{+}^{2}(\hat{Q}) is continuous on (0,1)(0,1) provided that P+N+>0P_{+}-N_{+}>0 for every Q^(0,1)\hat{Q}\in(0,1). To demonstrate this, assume that there exists a Q^\hat{Q}_{\infty} for which PD+ND+=0P_{D_{+}}-N_{D_{+}}=0. By assumption (ii) and the definition of D+D_{+}, the remaining points in Ω/D+\Omega/D_{+} must have the property that P(r)<N(r)P(r)<N(r). Integrating over this set shows that PΩ<NΩP_{\Omega}<N_{\Omega}, which violates the assumption that both are probability densities. Thus PD+ND+P_{D_{+}}-N_{D_{+}} cannot be zero in the interior of the domain, and σ+2(Q^)\sigma_{+}^{2}(\hat{Q}) is continuous on (0,1)(0,1). By definition, PD+Q^P_{D_{+}}\leq\hat{Q} and ND+Q^N_{D_{+}}\leq\hat{Q}, which implies that σ+2(Q^)\sigma^{2}_{+}(\hat{Q})\to\infty in the limit Q^0\hat{Q}\to 0. The corresponding result in the limit Q^1\hat{Q}\to 1 is proved in the same way by considering Q^0\hat{Q}\to 0 for σ2\sigma^{2}_{-} and then using Lemma 2.

Existence of the minimum follows from divergence of σ+2(Q^)\sigma_{+}^{2}(\hat{Q}) at the boundaries and continuity in the interior (0,1)(0,1). Specifically, for any 0<ϵ<1/20<\epsilon<1/2, σ+2\sigma_{+}^{2} is continuous on Dϵ=[ϵ,1ϵ]D_{\epsilon}=[\epsilon,1-\epsilon]. By the extreme value theorem, there exists a value Q^ϵDϵ\hat{Q}_{\epsilon}\in D_{\epsilon} for which σ+2(Q^)\sigma_{+}^{2}(\hat{Q}) attains a minimum. Clearly there exists an ϵ0\epsilon_{0} such that for all ϵ<ϵ0\epsilon<\epsilon_{0}, σ+2(Q^ϵ)\sigma_{+}^{2}(\hat{Q}_{\epsilon}) is constant, which defines the minimum. \qed

Remark: Assumption (ii) implies that there is a set of positive measure with respect to NN for which N(r)>P(r)N(r)>P(r).

Remark: The assumption that there exists a set of positive measure for which P(r)>N(r)P(r)>N(r) is an important feature of diagnostic tests; it implies the ability to distinguish populations. Failure to satisfy this condition is the hallmark of a useless diagnostic.

4 Validation and Discussion

4.1 Example Applied to a SARS-CoV-2 Antibody Test

The left plot of Fig. 1 shows training data and probability models for a SARS-CoV-2 immunoglobulin G (IgG) receptor binding domain (RBD) assay developed in Ref. [4]. The data was normalized according to the procedure in Ref. [1]. Following that, we added ε=0.01\varepsilon=0.01 to all values and normalized all data by the largest positive value. While not necessary, adding ε\varepsilon facilitates model construction and otherwise has a negligible effect on the data, as it represents a perturbation of less than 1 % relative to the maximum scale of the data. We model N(r)N(r) with a Burr distribution [5] and approximate P(r)P(r) as a Beta distribution. Maximum likelihood estimation was used to determine model parameters. The Burr distribution was truncated to the domain 0<r10<r\leq 1 and renormalized to have unit probability. The right plot of Fig. 1 shows the “bathtub function” q[P(r)N(r)]/N(r)q[P(r)-N(r)]/N(r) for a prevalence of q=30/130q=30/130. The figure illustrates that for a given Δ\Delta_{-}, the domain DD_{-} corresponds to the set of all rr for which q[P(r)N(r)]/N(r)<Δq[P(r)-N(r)]/N(r)<\Delta_{-}. The corresponding value of Q^\hat{Q} is given by the integral of Q(r)Q(r) over DD_{-}.

Refer to caption
Figure 2: Left: The differences PD+ND+P_{D_{+}}-N_{D_{+}} and NDPDN_{D_{-}}-P_{D_{-}} as a function of Q^\hat{Q}. Note that the differences are positive on the interior of the domain and reflect the symmetry implied by Eq. (13). Right: The variances σ+2(Q^)\sigma^{2}_{+}(\hat{Q}) and σ2(Q^)\sigma^{2}_{-}(\hat{Q}) versus Q^\hat{Q}. Note the divergence of both functions as Q^0\hat{Q}\to 0 and Q^1\hat{Q}\to 1, consistent with Lemma 3. Moreover, the variances exhibit the same symmetry as in the left plot.

Figure 2 illustrates various results derived in Lemma 3. The left plot embodies the symmetry argument expressed in Eq. (13), as well as the fact that the difference PD+ND+P_{D_{+}}-N_{D_{+}} and NDPDN_{D_{-}}-P_{D_{-}} are positive on the open set 0<Q^<10<\hat{Q}<1. The right plot illustrates that σ+2(Q^)=σ2(1Q^)\sigma^{2}_{+}(\hat{Q})=\sigma^{2}_{-}(1-\hat{Q}), as well as the divergence when Q^0\hat{Q}\to 0 and Q^1\hat{Q}\to 1. Moreover, the objective function is continuous and bounded from below.

4.2 Limitations and Open Directions

The analysis presented herein does not refer to positive or negative populations. The functions P(r)P(r) and N(r)N(r) could have described two different populations having nothing to do with diagnostics. Our main result has implications for broader problems associated with estimating relative fractions of different populations.

Our method suffers a need to empirically model training data, which may introduce uncertainty associated with the choices of distributions used. Addressing such tasks is necessary to fully understand all sources of uncertainty in prevalence estimates. The analysis herein is idealized insofar as it assumes that P(r)P(r) and N(r)N(r) are known exactly.

In a related vein, construction of the optimal prevalence estimation domains requires a priori knowledge of qq itself. Since this quantity is often what is being estimated, a practical algorithm for minimizing uncertainty in qq would be to guess a trial domain based on any prior information, estimate qq, and update the domain. While this approach will likely never converge in a mathematical sense, it should provide reasonable estimates motivated by the theory herein.

Acknowledgements: This work is a contribution of the National Institutes of Standards and Technology and is therefore not subject to copyright in the United States.

Use of all data deriving from human subjects was approved by the NIST Research Protections Office.

References

  • [1] P. N. Patrone, A. J. Kearsley, Classification under uncertainty: data analysis for diagnostic antibody testing., Mathematical medicine and biology : a journal of the IMA (2021).
  • [2] L. Böttcher, M. R. D’Orsogna, T. Chou, A statistical model of covid-19 testing in populations: effects of sampling bias and testing errors, Philosophical transactions. Series A, Mathematical, physical, and engineering sciences 380 (2021).
  • [3] E. Lieb, M. Loss, A. M. Society, Analysis, Crm Proceedings & Lecture Notes, American Mathematical Society, 2001.
  • [4] T. Liu, J. Hsiung, S. Zhao, J. Kost, D. Sreedhar, C. V. Hanson, K. Olson, D. Keare, S. T. Chang, K. P. Bliden, P. A. Gurbel, U. S. Tantry, J. Roche, C. Press, J. Boggs, J. P. Rodriguez-Soto, J. G. Montoya, M. Tang, H. Dai, Quantification of antibody avidities and accurate detection of sars-cov-2 antibodies in serum and saliva on plasmonic substrates, Nature Biomedical Engineering 4 (12) (2020) 1188–1196.
  • [5] I. W. Burr, Cumulative Frequency Functions, The Annals of Mathematical Statistics 13 (2) (1942) 215 – 232.