This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mixture Proportion Estimation Beyond Irreducibility

Yilun Zhu    Aaron Fjeldsted    Darren Holland    George Landon    Azaree Lintereur    Clayton Scott
Abstract

The task of mixture proportion estimation (MPE) is to estimate the weight of a component distribution in a mixture, given observations from both the component and mixture. Previous work on MPE adopts the irreducibility assumption, which ensures identifiablity of the mixture proportion. In this paper, we propose a more general sufficient condition that accommodates several settings of interest where irreducibility does not hold. We further present a resampling-based meta-algorithm that takes any existing MPE algorithm designed to work under irreducibility and adapts it to work under our more general condition. Our approach empirically exhibits improved estimation performance relative to baseline methods and to a recently proposed regrouping-based algorithm.

Machine Learning, Mixture Proportion Estimation, Weakly Supervised Learning

1 Introduction

Mixture proportion estimation (MPE) is the problem of estimating the weight of a component distribution in a mixture. Specifically, let κ[0,1]\kappa^{*}\in[0,1] and let FF, GG, and HH be probability distributions such that F=(1κ)G+κHF=(1-\kappa^{*})G+\kappa^{*}H. Given i.i.d. observations

XH:={x1,x2,,xm}iidH,XF:={xm+1,xm+2,,xm+n}iidF,\begin{split}X_{H}&:=\left\{x_{1},x_{2},\cdots,x_{m}\right\}\stackrel{{\scriptstyle iid}}{{\sim}}H,\\ X_{F}&:=\left\{x_{m+1},x_{m+2},\cdots,x_{m+n}\right\}\stackrel{{\scriptstyle iid}}{{\sim}}F,\end{split} (1)

MPE is the problem of estimating κ\kappa^{*}. A typical application is given some labeled positive reviews XHX_{H}, estimate the proportion of positive comments about a product among all comments XFX_{F} (González et al., 2017). MPE is also an important component in solving several domain adaptation and weakly supervised learning problems, such as learning from positive and unlabeled examples (LPUE) (Elkan & Noto, 2008; Du Plessis et al., 2014; Kiryo et al., 2017), learning with noisy labels (Lawrence & Schölkopf, 2001; Natarajan et al., 2013; Blanchard et al., 2016), multi-instance learning (Zhang & Goldman, 2001), and anomaly detection (Sanderson & Scott, 2014).

If no assumptions are made on the unobserved component GG, then κ\kappa^{*} is not identifiable. Blanchard et al. (2010) proposed the irreducibility assumption on GG so that κ\kappa^{*} becomes identifiable. Up to now, almost all MPE algorithms build upon the irreducibility assumption (Blanchard et al., 2010; Scott, 2015; Blanchard et al., 2016; Jain et al., 2016; Ramaswamy et al., 2016; Ivanov, 2020; Bekker & Davis, 2020; Garg et al., 2021), or some stricter conditions like non-overlapping support of component distributions (Elkan & Noto, 2008; Du Plessis & Sugiyama, 2014). However, as we discuss below, irreducibility can be violated in several applications, in which case the above methods produce statistically inconsistent estimates. As far as we know, Yao et al. (2022), discussed in Sec. 5, is the first attempt to move beyond irreducibility.

This work proposes a more general sufficient condition than irreducibility, and offers a practical algorithm for estimating κ\kappa^{*} under this condition. We introduce a meta-algorithm that takes as input any MPE method that consistently estimates κ\kappa^{*} under irreducibility, and removes the bias of that method whenever irreducibility does not hold but our more general sufficient condition does. Furthermore, even if our new sufficient condition is not satisfied, our meta-algorithm will not increase the bias of the underlying MPE method. We describe several applications and settings where our framework is relevant, and demonstrate the practical relevance of this framework through extensive experiments. Proofs and additional details can be found in the appendices.

2 Problem Setup and Background

Let GG and HH be probability measures on a measurable space (𝒳,𝔖)(\mathcal{X},\mathfrak{S}), and let FF be a mixture of GG and HH

F=(1κ)G+κH,F=(1-\kappa^{*})G+\kappa^{*}H, (2)

where 0κ10\leq\kappa^{*}\leq 1. With no assumptions on GG, κ\kappa^{*} is not uniquely determined by FF and HH. For example, suppose F=(1κ)G+κHF=(1-\kappa^{*})G+\kappa^{*}H for some GG, and take any δ[0,κ]\delta\in[0,\kappa^{*}]. Then F=(1κ+δ)G+(κδ)H,F=(1-\kappa^{*}+\delta)G^{\prime}+(\kappa^{*}-\delta)H, where G=(1κ+δ)1[(1κ)G+δH]G^{\prime}=(1-\kappa^{*}+\delta)^{-1}\left[(1-\kappa^{*})G+\delta H\right], has a different proportion on HH (Blanchard et al., 2010).

2.1 Ground-Truth and Maximal Proportion

To address the lack of identifiability, Blanchard et al. (2010) introduced the so-called irreducibility assumption. We now recall this definition and related concepts. Throughout this work we assume that FF, GG and HH have densities ff, gg and hh, defined w.r.t. a common dominating measure μ\mu.

Definition 2.1 (Blanchard et al. (2010)).

For any two probability distributions FF and HH, define

κ(F|H):=sup{κ[0,1]|F=(1κ)G+κH,for some distribution G},\begin{split}\kappa(F|H):=\sup\{\kappa\in[0,1]|F&=(1-\kappa)G^{\prime}+\kappa H,\\ &\text{for some distribution }G^{\prime}\},\end{split}

the maximal proportion of HH in FF.

This quantity equals the infimum of the likelihood ratio:

Proposition 2.2 (Blanchard et al. (2010)).

It holds that

κ(F|H)=infS𝔖:H(S)>0F(S)H(S)=essinfx:h(x)>0f(x)h(x).\kappa(F|H)=\inf_{S\in\mathfrak{S}:H(S)>0}\frac{F(S)}{H(S)}=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}. (3)

By substituting F=(1κ)G+κHF=(1-\kappa^{*})G+\kappa^{*}H into Eqn. (3), we get that

κ(F|H)=infS𝔖:H(S)>0F(S)H(S)=κ+(1κ)infS𝔖:H(S)>0G(S)H(S)=κ+(1κ)κ(G|H).\begin{split}\kappa(F|H)&=\inf_{S\in\mathfrak{S}:H(S)>0}\frac{F(S)}{H(S)}\\ &=\kappa^{*}+(1-\kappa^{*})\inf_{S\in\mathfrak{S}:H(S)>0}\frac{G(S)}{H(S)}\\ &=\kappa^{*}+(1-\kappa^{*})\kappa(G|H).\end{split} (4)

Since κ(F|H)\kappa(F|H) is identifiable from FF and HH, the following assumption on GG ensures identifiability of κ\kappa^{*}.

Definition 2.3 (Blanchard et al. (2010)).

We say that GG is irreducible with respect to HH if κ(G|H)=0\kappa(G|H)=0.

Thus, irreducibility means that there exists no decomposition of the form: G=γH+(1γ)JG=\gamma H+(1-\gamma)J^{\prime}, where JJ^{\prime} is some probability distribution and 0<γ10<\gamma\leq 1. Under irreducibility, κ\kappa^{*} is identifiable, and in particular, equals κ(F|H)\kappa(F|H).

2.2 Latent Label Model

We now consider another way of understanding irreducibility in terms of a latent label model. In particular, let XX and Y{0,1}Y\in\{0,1\} be the random variables characterized by

  1. (a)

    (X,Y)(X,Y) are jointly distributed

  2. (b)

    P(Y=1)=κP(Y=1)=\kappa^{*}

  3. (c)

    PX|Y=0=G and PX|Y=1=HP_{X|Y=0}=G\text{ and }P_{X|Y=1}=H.

It follows from these assumptions that the marginal distribution of XX is FF:

PX=(1κ)PX|Y=0+κPX|Y=1=F.P_{X}=\left(1-\kappa^{*}\right)P_{X|Y=0}+\kappa^{*}P_{X|Y=1}=F.

We also take the conditional probability of YY given XX to be defined via

P(Y=1|X=x)={κh(x)f(x),f(x)>0,0,otherwise.\begin{split}P(Y=1|X=x)=\begin{cases}\frac{\kappa^{*}h(x)}{f(x)},&f(x)>0,\\ 0,&\text{otherwise}.\end{cases}\end{split} (5)

The latent label model is commonly used in the positive unlabeled (PU) learning literature (Bekker & Davis, 2020). MPE is also called class prior/proportion estimation (CPE) in PU learning because P(Y=1)=κP(Y=1)=\kappa^{*}. YY may be viewed as a label indicating which component an observation from FF was drawn from. Going forward, we use this latent label model in addition to the original MPE notation.

Proposition 2.4.

Under the latent label model,

esssup𝑥P(Y=1|X=x)=κκ(F|H),\underset{x}{\operatorname{ess\ sup\ }}P(Y=1|X=x)=\frac{\kappa^{*}}{\kappa(F|H)}\ ,

where 0/0:=00/0:=0.

By definition, we know that GG is not irreducible with respect to HH iff κ(G|H)>0\kappa(G|H)>0. Combining Proposition 2.4 and Eqn. (4), we conclude that esssupP(Y=1|X=x)<1\underset{}{\operatorname{ess\ sup\ }}P(Y=1|X=x)<1 is equivalent to κ(G|H)>0\kappa(G|H)>0.

2.3 Violation of Irreducibility

Up to now, almost all MPE algorithms assume GG to be irreducible w.r.t. HH (Blanchard et al., 2010, 2016; Jain et al., 2016; Ivanov, 2020), or stricter conditions like the anchor set assumption (Scott, 2015; Ramaswamy et al., 2016), or that GG and HH have disjoint supports (Elkan & Noto, 2008; Du Plessis & Sugiyama, 2014). These methods return an estimate of κ(F|H)\kappa(F|H) as the estimate of κ\kappa^{*}. If irreducibility does not hold and κ<1\kappa^{*}<1, then κ(F|H)>κ\kappa(F|H)>\kappa^{*}. Even if these methods are consistent estimators of κ(F|H)\kappa(F|H), they are asymptotically biased and thus inconsistent estimators of κ\kappa^{*}.

A sufficient condition for irreducibility to hold is that the support of HH is not totally contained in the support of GG. In a classification setting where GG and HH are the class-conditional distributions, this may be quite reasonable. It essentially assumes that each class has at least some subset of instances (with positive measure) that cannot possibly be confused with the other class. While irreducibility is reasonable in many classification-related tasks, there are also a number of important applications where it does not hold. In this subsection we give three examples of applications where irreducibility is not satisfied.

Ubiquitous Background. In gamma spectroscopy, we may view HH as the distribution of the energy of a gamma particle emitted by some source of interest (e.g., Cesium-137), and GG as the energy distribution of background radiation. Typically the background emits gamma particles with a wider range of energies than the source of interest does, and therefore its distribution has a wider support: supp(H)supp(G)\operatorname{supp}(H)\subset\operatorname{supp}(G), thus violating irreducibility. What’s more, GG is usually unknown, because it varies according to the surrounding environment (Alamaniotis et al., 2013). The MPE problem is: given observations of the source spectrum HH, which may be collected in a laboratory setting, and observations of FF in the wild, estimate κ\kappa^{*}. This quantity is important for nuclear threat detection and nuclear safeguard applications.

Global Uncertainty. In marketing, let Y{0,1}Y\in\left\{0,1\right\} denote whether a customer does (Y=1Y=1) or does not purchase a product (Fei et al., 2013). Let HH be the distribution of a feature vector extracted from a customer who buys the product, and GG the distribution for those who do not. The MPE problem is: given data from past purchasers of a product (HH), and from a target population (FF), estimate the proportion κ\kappa^{*} of HH in FF. This quantity is called the transaction rate, and is important for estimating the number of products to be sold. Irreducibility is likely to be violated here because, given a finite number of features, uncertainty about customers should remain bounded away from 1: x,P(Y=1|X=x)<1\forall x,P(Y=1|X=x)<1. In other words, there is an ϵ>0\epsilon>0 such that, for any feature vector of demographic information, the probability of buying the product is always <1ϵ<1-\epsilon.

Underreported Outcomes. In public health, let (X,Y,Z)(X,Y,Z) be a jointly distributed triple, where XX is a feature vector, Y{0,1}Y\in\{0,1\} denotes whether a person reports a health condition or not, and Z{0,1}Z\in\{0,1\} indicates whether the person truly has the health condition. Here, H=PX|Y=1,G=PX|Y=0H=P_{X|Y=1},G=P_{X|Y=0} and F=PXF=P_{X}. The MPE problem is: given data from previous people who reported the condition (HH), and from a target group (FF), determine the prevalence κ\kappa^{*} of the condition for the target group. This helps estimate the amount of resources needed to address the health condition. Assume there are no false reports from those who do not have the medical condition: x,P(Y=1|Z=0,X=x)=0\forall x,P(Y=1|Z=0,X=x)=0. Some medical conditions are underreported, such as smoking and intimate partner violence (Gorber et al., 2009; Shanmugam & Pierson, 2021). If the underreporting happens globally, meaning e(x):=P(Y=1|Z=1,X=x)e(x):=P(Y=1|Z=1,X=x) is bounded away from 1, then esssupP(Y=1|X=x)<1\underset{}{\operatorname{ess\ sup\ }}P(Y=1|X=x)<1. This is because P(Y=1|X=x)=P(Y=1|Z=0,X=x)P(Z=0|X=x)+P(Y=1|Z=1,X=x)P(Z=1|X=x)e(x)P(Y=1|X=x)=P(Y=1|Z=0,X=x)P(Z=0|X=x)+P(Y=1|Z=1,X=x)P(Z=1|X=x)\leq e(x). In this situation, irreducibility is again violated.

In the above situations, irreducibility is violated and κ(F|H)>κ\kappa(F|H)>\kappa^{*}. Estimating κ(F|H)\kappa(F|H) alone leads to bias. In the following, we will re-examine mixture proportion estimation. In particular, we propose a more general sufficient condition than irreducibility, and introduce an estimation strategy that calls an existing MPE method and reduces or eliminates its asymptotic bias.

3 A General Identifiability Condition

Previous MPE works assume irreducibility. We propose a more general sufficient condition for recovering κ\kappa^{*}.

Theorem 3.1 (Identifiability Under Local Supremal Posterior (LSP)).

Let AA be any non-empty measurable subset of EH={x:h(x)>0}E_{H}=\{x:h(x)>0\} and s=esssupxAP(Y=1|X=x)s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x). Then

κ=sinfSAF(S)H(S)=sessinfxAf(x)h(x).\displaystyle\kappa^{*}=s\cdot\inf_{S\subseteq A}\frac{F(S)}{H(S)}=s\cdot\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}.

This implies that under LSP, κ\kappa^{*} is identifiable. Two special cases are worthy of comment. First, irreducibility holds when A=EHA=E_{H} and s=1s=1, in which case the above theorem recovers the known result that κ=κ(F|H)\kappa^{*}=\kappa(F|H). Second, when A={x0}A=\left\{x_{0}\right\} is a singleton, then s=P(Y=1|X=x0)s=P(Y=1|X=x_{0}) and κ=P(Y=1|X=x0)f(x0)h(x0)\kappa^{*}=P(Y=1|X=x_{0})\cdot\frac{f(x_{0})}{h(x_{0})}, which can also be derived directly from the definition of conditional probability.

The above theorem gives a general sufficient condition for recovering κ\kappa^{*}, but estimating infSAF(S)H(S)\inf_{S\subseteq A}\frac{F(S)}{H(S)} is non-trivial: when A=EHA=E_{H}, it can be estimated using existing MPE methods (Blanchard et al., 2016). When AA is a proper subset, however, a new approach is needed. We now present a variation of Theorem 3.1 that lends itself to a practical estimation strategy without having to devise a completely new method of estimating infSAF(S)H(S)\inf_{S\subseteq A}\frac{F(S)}{H(S)}.

Theorem 3.2 (Identifiability Under Tight Posterior Upper Bound).

Consider any non-empty measurable set AEH={x:h(x)>0}A\subseteq E_{H}=\{x:h(x)>0\}, and let s=esssupxAP(Y=1|X=x)s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x). Let α(x)\alpha(x) be any measurable function satisfying

α(x){[P(Y=1|X=x),s],xA,[P(Y=1|X=x),1],o.w.\alpha(x)\in\begin{cases}\left[P(Y=1|X=x),s\right],&x\in A,\\ \left[P(Y=1|X=x),1\right],&\text{o.w.}\end{cases} (6)

Define a new distribution F~\widetilde{F} in terms of its density

f~(x)=1cα(x)f(x), where c=α(x)f(x)𝑑x=𝔼XF[α(X)].\begin{split}&\widetilde{f}(x)=\frac{1}{c}\cdot\alpha(x)\cdot f(x),\\ \text{ where }\quad&c=\int\alpha(x)f(x)dx=\mathbb{E}_{X\sim F}\left[\alpha(X)\right].\end{split} (7)

Then

κ=cκ(F~|H).\displaystyle\kappa^{*}=c\cdot\kappa(\widetilde{F}|H).

The theorem can be re-stated as: κ\kappa^{*} is identifiable given an upper bound of the posterior probability α(x)P(Y=1|X=x)\alpha(x)\geq P(Y=1|X=x) that is tight for some xAx\in A. One possible choice for α(x)\alpha(x) is simply

α(x)={s,xA,1,o.w.\alpha(x)=\begin{cases}s,&x\in A,\\ 1,&\text{o.w.}\end{cases}

If the conditional probability P(Y=1|X=x)P(Y=1|X=x) is known for all xAx\in A, then

α(x)={P(Y=1|X=x),xA,1,o.w.,\alpha(x)=\begin{cases}P(Y=1|X=x),&x\in A,\\ 1,&\text{o.w.},\end{cases}

may be chosen.

Having α(x)\alpha(x) satisfying Eqn. (6) ensures identifiablility of κ\kappa^{*}. Relaxing this requirement slightly still guarantees that the bias will not increase.

Corollary 3.3.

Let α(x)\alpha(x) be any measurable function with

α(x)[P(Y=1|X=x),1]x.\alpha(x)\in[P(Y=1|X=x),1]\quad\forall x. (8)

Define a new distribution F~\widetilde{F} in terms of its density f~\widetilde{f} according to Eqn. (7). Then

κcκ(F~|H)κ(F|H).\displaystyle\kappa^{*}\leq c\cdot\kappa(\widetilde{F}|H)\leq\kappa(F|H).

This shows that even if we have a non-tight upper bound on P(Y=1|X=x)P(Y=1|X=x), the quantity cκ(F~|H)c\cdot\kappa(\widetilde{F}|H) is still bounded by κ(F|H)\kappa(F|H). Therefore, a smaller asymptotic bias may be achieved by estimating cκ(F~|H)c\cdot\kappa(\widetilde{F}|H) instead of κ(F|H)\kappa(F|H).

The intuition underlying the above results is that the new distribution F~\widetilde{F} is generated by throwing away some probability mass from GG, and therefore can be viewed as a mixture of HH and a new G~\widetilde{G}, but now G~\widetilde{G} tends to be irreducible w.r.t. HH. The proportion κ(F~|H)\kappa(\widetilde{F}|H) relates to the original proportion κ\kappa^{*} by a scaling constant cc. This interpretation is supported mathematically in Appendix B.1.

4 Subsampling MPE (SuMPE)

Theorem 3.2 directly motivates a practical algorithm. We obtain a new distribution F~\widetilde{F} from FF by rejection sampling (MacKay, 2003), which is a Monte Carlo method that generates a sample following a new distribution Q~\widetilde{Q} based on a sample from distribution QQ, in terms of their densities q~\widetilde{q} and qq. An instance xx drawn from q(x)q(x) is kept with acceptance probability β(x)[0,1]\beta(x)\in[0,1], and rejected otherwise. Appendix B.2 shows the detailed procedure. In our scenario, Q~=F~\widetilde{Q}=\widetilde{F}, Q=FQ=F and β(x)=α(x)\beta(x)=\alpha(x).

4.1 Method

Our Subsampling MPE algorithm, SuMPE (Algorithm 1), follows directly from Theorem 3.2. It first obtains in line 3 a data sample XF~X_{\widetilde{F}} following distribution F~\widetilde{F} using rejection sampling and in line 4 estimates the normalizing constant cc. Then in line 5, it computes an estimate κ^(F~|H)\widehat{\kappa}(\widetilde{F}|H) using any existing MPE method that consistently estimates the mixture proportion under irreducibility. The final estimate is returned as the product of c^\widehat{c} and κ^(F~|H)\widehat{\kappa}(\widetilde{F}|H).

Rejection sampling in high dimensional settings may be inefficient due to a potentially low acceptance rate (MacKay, 2003). However, this concern is mitigated in our setting because the acceptance rate can be taken to be 1 except on the set AA, which is potentially a small set.

Algorithm 1 Subsampling MPE (SuMPE)
1:  Input:XFX_{F}: sample drawn i.i.d. from FFXHX_{H}: sample drawn i.i.d. from HHα(x)\alpha(x): acceptance function
2:  Output:  Estimate of κ\kappa^{*}
3:  Generate XF~X_{\widetilde{F}} from XFX_{F} by rejection sampling (Algorithm 4), with acceptance function α(x)\alpha(x).
4:  Compute c^=|XF~|/|XF|\widehat{c}=\left|X_{\widetilde{F}}\right|/\left|X_{F}\right|, where ||\left|\cdot\right| denotes the cardinality of a set.
5:  Apply an off-the-shelf MPE algorithm to produce an estimate κ^(F~|H)\widehat{\kappa}(\widetilde{F}|H) from XF~X_{\widetilde{F}} and XHX_{H}.
6:  return c^κ^(F~|H)\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}|H)

One advantage of building our method around existing MPE methods is that we may adapt known theoretical results to our setting. To illustrate this, we give a rate of convergence result for SuMPE.

Theorem 4.1.

Assume α(x)\alpha(x) satisfies the condition in Eqn. (6). After subsampling, assume the resulting F~\widetilde{F} and HH are such that argminS𝒜:H(S)>0F~(S)H(S)\underset{S\in\mathcal{A}:H(S)>0}{\operatorname{arg\ min\ }}\frac{\widetilde{F}(S)}{H(S)} exists. Then there exists a constant C>0C>0 and an existing MPE estimator κ^\widehat{\kappa} such that, for mm and nn sufficiently large, the estimator κ^\widehat{\kappa}^{*} obtained from SuMPE (Algorithm 1) satisfies

Pr(|κ^κ|C[logmm+lognn])1𝒪(1m+1n).\Pr\left(\left|\widehat{\kappa}^{*}-\kappa^{*}\right|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\\ \geq 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right).

4.2 Practical Scenarios

Our new sufficient condition assumes knowledge of some set AEHA\subseteq E_{H} and s=esssupxAP(Y=1|X=x)s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x). However, practically speaking, our algorithm only requires an α(x)\alpha(x) satisfying Eqn. (6) for some AEHA\subseteq E_{H} and the associated value of ss, and does not require the explicit knowledge of AA and ss. Additionally, even if α(x)\alpha(x) does not satisfy Eqn. (6) , as long as it satisfies Eqn. (8) (which is easier to achieve), it shall perform no worse than directly applying off-the-shelf MPE methods.

There are settings where a generic construction of α(x)\alpha(x) is possible. For example, suppose the user has access to fully labeled data (where it is known which of GG or HH each instance came from) but only on a subset AA of the domain. This may come from an annotator who is only an expert on a subset of instances. This data should be sufficient to get a non-trivial upper bound on the posterior class probability P(Y|X)P(Y|X), which in turns leads to an α(x)\alpha(x).

More typically, however, it may be necessary to determine α(x)\alpha(x) on a case by case basis. This section continues the discussion of the three applications introduced in Sec. 2.3. Each of these three settings leverages different domain-specific knowledge in different ways, and we believe this leads to the best α(x)\alpha(x) compared to a one-size-fits-all construction.

4.2.1 Unfolding

Unfolding refers to the process of recovering one or more true distributions from contaminated ones (Cowan, 1998). In gamma spectrum unfolding (Li et al., 2019), a gamma ray detector measures the energies of incoming gamma rays. The gamma rays were emitted either by a source of interest or from the background. The measurement is represented as a histogram f(x)f(x) where the bins correspond to a quantization of energy. In many settings, the histogram h(x)h(x) of measurements from the source of interest is also known. In this case, unfolding amounts to the estimation of the unknown background histogram g(x)g(x). Toward this goal, it suffices to estimate the proportion of recorded particles κ\kappa^{*} emanating from the source of interest, since g(x)=(f(x)κh(x))/(1κ)g(x)=(f(x)-\kappa^{*}h(x))/(1-\kappa^{*}). This application corresponds to the “ubiquitious background” setting described in Sec. 2.3, where irreducibility may not hold since the source of interest energies can be a subset of the background energies.

Using existing techniques from the nuclear detection literature (Knoll, 2010; Alamaniotis et al., 2013), we can obtain a lower bound ρ(x)\rho(x) of the quantity (1κ)g(x)(1-\kappa^{*})g(x) on a certain set Asupp(H)A\subset\operatorname{supp}(H) (see Appendix B.3 for details). This leads to the acceptance function

α(x)={1ρ(x)f(x),xA,1,o.w.,\alpha(x)=\begin{cases}1-\frac{\rho(x)}{f(x)},&x\in A,\\ 1,&\text{o.w.},\end{cases} (9)

which is an upper bound of P(Y=1|X=x)P(Y=1|X=x), satisfying the condition in Corollary 3.3.

4.2.2 CPE under domain adaptation

In the problem of domain adaptation, the learner is given labeled examples from a source distribution, and the task is to do inference on a potentially different target distribution. Previous work on domain adaptation mainly focuses on classification and typically makes assumptions about which of the four distributions PX,PY|X,PYP_{X},P_{Y|X},P_{Y}, and PX|YP_{X|Y} vary between the source and target. This leads to situations such as covariate shift (where PXP_{X} changes) (Heckman, 1979), posterior drift (where PY|XP_{Y|X} changes) (Scott, 2019), prior/target shift (where PYP_{Y} changes) (Storkey et al., 2009), and conditional shift (where PX|YP_{X|Y} changes) (Zhang et al., 2013). It is also quite commonly assumed that the support of source distribution contains the support of target (Heckman, 1979; Bickel et al., 2009; Gretton et al., 2009; Storkey et al., 2009; Zhang et al., 2013; Scott, 2019).

We study class proportion estimation (CPE) under domain adaptation. Prior work on this topic has considered distributional assumptions like those described above (Saerens et al., 2002; Sanderson & Scott, 2014; González et al., 2017). In this work, we consider the setting where, in addition to labeled examples from the source, the learner has access to labeled positive and unlabeled data from the target. We propose a model that includes covariate shift and posterior drift as special cases. We use PXYsrP^{sr}_{XY} and PXYtgP^{tg}_{XY} to denote source and target distributions. In MPE notation, F=PXtgF=P^{tg}_{X}, G=PX|Y=0tgG=P^{tg}_{X|Y=0} and H=PX|Y=1tgH=P^{tg}_{X|Y=1}.

Definition 4.2 (CSPL).

We say that covariate shift with posterior lift occurs whenever

xsupp(PXsr)supp(PXtg),\displaystyle\forall x\in\operatorname{supp}(P^{sr}_{X})\bigcap\operatorname{supp}(P^{tg}_{X}),
Psr(Y=1|X=x)Ptg(Y=1|X=x),\displaystyle P^{sr}(Y=1|X=x)\geq P^{tg}(Y=1|X=x),

and “==” holds for some xsupp(PXsr)supp(PX|Y=1tg)x\in\operatorname{supp}(P^{sr}_{X})\bigcap\operatorname{supp}(P^{tg}_{X|Y=1}).

Covariate shift is a special case of CSPL when equality always holds. One motivation for posterior lift is to model labels produced by an annotator who is biased toward one class. It is a type of posterior drift model wherein the posterior changes from source to target (Scott, 2019; Cai & Wei, 2021; Maity et al., 2023). Also notice that CSPL does not require the support of the source distribution to contain the target, nor irreducibility.

CSPL is motivated by a marketing application mentioned in Sec. 2.3. In marketing, companies often have access to labeled data from a source distribution, such as survey results where customers express their interest in a product. Additionally, they also have access to labeled positive and unlabeled data from the target distribution, which corresponds to actual purchasing behavior. In this scenario, the CSPL assumption is often met as it is more likely for customers to express interest than to actually make a purchase: Psr(Y=1|X=x)Ptg(Y=1|X=x)P^{sr}(Y=1|X=x)\geq P^{tg}(Y=1|X=x).

Although irreducibility is violated in the marketing application due to the “global uncertainty” about a target customer buying the product (see Sec. 2.3), CSPL ensures the identifiability of κ=Ptg(Y=1)\kappa^{*}=P^{tg}(Y=1) because we can choose the set A=supp(PXsr)supp(PX|Y=1tg)A=\operatorname{supp}(P^{sr}_{X})\bigcap\operatorname{supp}(P^{tg}_{X|Y=1}) and the acceptance function as

α(x)={Psr(Y=1|X=x),xA,1,o.w.,\alpha(x)=\begin{cases}P^{sr}(Y=1|X=x),&x\in A,\\ 1,&\text{o.w.},\end{cases} (10)

which satisfies the identifiability criteria in Theorem 3.2. 111 To see this, take A={xA:Psr(Y=1|X=x)=Ptg(Y=1|X=x)}A^{\prime}=\{x\in A:P^{sr}(Y=1|X=x)=P^{tg}(Y=1|X=x)\} and s=esssupxAPsr(Y=1|X=x)=esssupxAPtg(Y=1|X=x)s^{\prime}=\underset{x\in A^{\prime}}{\operatorname{ess\ sup\ }}P^{sr}(Y=1|X=x)=\underset{x\in A^{\prime}}{\operatorname{ess\ sup\ }}P^{tg}(Y=1|X=x). Then AA^{\prime} and ss^{\prime} are the AA and ss in Theorem 3.2. By using the labeled source data, an estimate of Psr(Y=1|X=x)P^{sr}(Y=1|X=x) can be obtained and used as the acceptance function α(x)\alpha(x) in Algorithm 1 to do CPE.

4.2.3 Selected/Reported at Random

In public health, (X,Y,Z)(X,Y,Z) are jointly distributed, where XX is the feature vector, Y{0,1}Y\in\{0,1\} denotes whether a person reports a medical condition or not and Z{0,1}Z\in\{0,1\} indicates whether a person truly has the medical condition. The goal is to estimate the proportion of people in XFX_{F} that report the medical condition. This setting was described in Sec. 2.3 as “underreported outcomes” where it was argued that irreducibility may not hold, in which case estimating κ(F|H)\kappa(F|H) overestimates the true value of κ\kappa^{*}. Our SuMPE framework provides a way to eliminate the bias.

The behavior of underreporting can be captured using the selection bias model (Kato et al., 2018; Bekker et al., 2019; Gong et al., 2021). Denote the probability of reporting as e(x):=P(Y=1|X=x,Z=1)e(x):=P(Y=1|X=x,Z=1). Assume there is no false report: x,P(Y=1|X=x,Z=0)=0\forall x,P(Y=1|X=x,Z=0)=0. We use the notation p(x|Ω)p(x|\Omega) to indicate the conditional density of XX given the event Ω\Omega. Then p(x|Y=1)=e(x)νp(x|Z=1)p(x|Y=1)=\frac{e(x)}{\nu}p(x|Z=1), where ν=P(Y=1|Z=1)\nu=P(Y=1|Z=1). Under this model, the density of marginal distribution PXP_{X} can be decomposed as

p(x)=\displaystyle p(x)=\ (1α)p(x|Z=0)+αp(x|Z=1)\displaystyle(1-\alpha)\cdot p(x|Z=0)+\alpha\cdot p(x|Z=1)
=\displaystyle=\ (1αν)p(x|Y=0)+ανp(x|Y=1),\displaystyle(1-\alpha\nu)\cdot p(x|Y=0)+\alpha\nu\cdot p(x|Y=1),

where α=P(Z=1)\alpha=P(Z=1) is the proportion of people having the medical condition. The mixture proportion to be estimated is κ=P(Y=1)=P(Y=1,Z=1)=P(Z=1)P(Y=1|Z=1)=αν\kappa^{*}=P(Y=1)=P(Y=1,Z=1)=P(Z=1)P(Y=1|Z=1)=\alpha\nu.

We assume access to i.i.d. sample from H=PX|Y=1H=P_{X|Y=1}, representing the public survey data where people report the presence of the medical condition, and from F=PXF=P_{X}, representing the target group. Further assume that A{x:P(Z=1|X=x)=1}A\subseteq\{x:P(Z=1|X=x)=1\}. This is a subset of patients who are guaranteed to have the condition, which could be obtained based on historical patient data from hospital. Then the mixture proportion κ=αν\kappa^{*}=\alpha\nu can be recovered from Algorithm 1, where the acceptance function is

α(x)={e(x),xA,1,o.w.\alpha(x)=\begin{cases}e(x),&x\in A,\\ 1,&\text{o.w.}\end{cases} (11)

α(x)\alpha(x) satisfies the condition in Theorem 3.2 . This is because under no-false-report assumption, xA,P(Y=1|X=x)=P(Y=1,Z=1|X=x)=P(Y=1|X=x,Z=1)P(Z=1|X=x)=e(x)1=e(x)\forall x\in A,P(Y=1|X=x)=P(Y=1,Z=1|X=x)=P(Y=1|X=x,Z=1)\cdot P(Z=1|X=x)=e(x)\cdot 1=e(x). 222 Take A={xA:e(x)>0}A^{\prime}=\{x\in A:e(x)>0\} and s=esssupxAe(x)=esssupxAP(Y=1|X=x)s^{\prime}=\underset{x\in A^{\prime}}{\operatorname{ess\ sup\ }}e(x)=\underset{x\in A^{\prime}}{\operatorname{ess\ sup\ }}P(Y=1|X=x). Then AA^{\prime} and ss^{\prime} are the AA and ss in Theorem 3.2. In practice, e(x)e(x) can be estimated from labeled examples (X,Y,Z=1)(X,Y,Z=1).

5 Limitation of Previous Work

Previous research by Yao et al. (2022) introduced the Regrouping MPE (ReMPE) method 333Yao et al. (2022) called the method Regrouping CPE (ReCPE)., which is built on top of any existing MPE estimator (just like our meta-algorithm). They claimed that ReMPE works as well as the base MPE method when irreducibility holds, while improving the performance when it does not. In this section we offer some comments on ReMPE.

Regrouping MPE in theory.

Consider any F,H,G,κF,H,G,\kappa^{*} such that Eqn. (2) holds. Write GG as an arbitrary mixture of two distributions G=γG1+(1γ)G2,γ0G=\gamma G_{1}+(1-\gamma)G_{2},\gamma\geq 0. Then FF can be re-written as

F=(1κ)G+κH=(1κ)[γG1+(1γ)G2]+κH=(1κ)(1γ)G2+[(1κ)γG1+κH]Regrouped=(1κ)G2+κH,\begin{split}F&=(1-\kappa^{*})G+\kappa^{*}H\\ &=(1-\kappa^{*})\left[\gamma G_{1}+(1-\gamma)G_{2}\right]+\kappa^{*}H\\ &=(1-\kappa^{*})(1-\gamma)G_{2}+\underbrace{\left[{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1-\kappa^{*})\gamma G_{1}}+\kappa^{*}H\right]}_{\text{Regrouped}}\\ &=(1-\kappa^{\prime})G_{2}+\kappa^{\prime}H^{\prime},\end{split} (12)

where κ=κ+(1κ)γ\kappa^{\prime}=\kappa^{*}+(1-\kappa^{*})\gamma. Yao et al. (2022) assumes there exists a set such that the infimum in Eqn. (3) and (4) can be achieved. They proposed to specify G1G_{1} as the truncated distribution of GG in set BB, denote as GBG_{B}, where B=argminS𝔖G(S)H(S)B=\arg\min_{S\in\mathfrak{S}}\frac{G(S)}{H(S)}. This specific choice causes the resulting distribution G2G_{2} to be irreducible w.r.t. HH^{\prime} and the bias introduced by regrouping (1κ)G(A)(1-\kappa^{*})G(A) to be minimal. Denote the above procedure as ReMPE-1 (Algorithm 2). Theorem 2 in Yao et al. (2022) provides a theoretical justification for ReMPE-1, which we will restate here.

Algorithm 2 ReMPE-1 (Yao et al., 2022)
1:  Input: Distributions FF and HH
2:  Obtain set B=argminS𝔖G(S)H(S)B=\arg\min_{S\in\mathfrak{S}}\frac{G(S)}{H(S)}
3:  Generate new distribution HH^{\prime} by Eqn. (12), where G1=GBG_{1}=G_{B}
4:  return κ(F|H)\kappa(F|H^{\prime})
Theorem 5.1 (Yao et al. (2022)).

Let κ(F|H)\kappa(F|H^{\prime}) be the mixture proportion obtained from ReMPE-1. 1) If GG is irreducible w.r.t. HH, then κ(F|H)=κ\kappa(F|H^{\prime})=\kappa^{*}. 2) if GG is not irreducible w.r.t. HH, then κ<κ(F|H)<κ(F|H)\kappa^{*}<\kappa(F|H^{\prime})<\kappa(F|H).

While this theorem is valid, we note that in the case where GG is irreducible w.r.t. HH, the set BB is outside the support of GG, and therefore it is not appropriate to describe the procedure as “regrouping GG.” In fact, performing regrouping (γ>0\gamma>0) always introduces a positive bias, because κ(F|H)κ>κ\kappa(F|H^{\prime})\geq\kappa^{\prime}>\kappa^{*}. This indicates that any kind of regrouping will have a positive bias under irreducibility.

Regrouping MPE in practice.

Yao et al. (2022)’s practical implementation of regrouping deviates from the theoretical proposal. Here, we state and analyze the idealized version of their practical algorithm, referred to as ReMPE-2 (Algorithm 3). ReMPE-2 does not rely on the knowledge of κ\kappa^{*} and G(B)G(B) as outlined in Eqn. (12). Instead, the set BB is chosen based solely on FF and HH, and the distribution HH^{\prime} is obtained through regrouping some probability mass from FF rather than GG. 444Yao et al. (2022)’s real implementation differs a bit from ReMPE-2 in that instead of choosing argminS𝔖F(S)H(S)\arg\min_{S\in\mathfrak{S}}\frac{F(S)}{H(S)}, they select p=10%p=10\% of examples drawn from FF with smallest estimated score of f(x)/h(x)f(x)/h(x).

Algorithm 3 ReMPE-2 (Yao et al., 2022)
1:  Input: Distributions FF and HH
2:  Obtain set B=argminS𝔖F(S)H(S)B=\arg\min_{S\in\mathfrak{S}}\frac{F(S)}{H(S)}
3:  Generate new distribution H=11+F(B)(FB+H)H^{\prime}=\frac{1}{1+F(B)}\left(F_{B}+H\right)
4:  return κ(F|H)\kappa(F|H^{\prime})

ReMPE-2 is fundamentally different from ReMPE-1 in that it uses a different way to construct HH^{\prime}. To be specific, when the irreducibility assumption holds, ReMPE-1 suggests regrouping nothing (because G(B)=0G(B)=0), but ReMPE-2 still regroups a proportion F(B)F(B) from FF to HH. Therefore, Theorem 5.1 does not apply to it. Yao et al. (2022) did not analyze ReMPE-2, but the next result shows that it has a negative bias under irreducibility.

Proposition 5.2.

For κ(F|H)\kappa(F|H^{\prime}) obtained from ReMPE-2:

κ(F|H)<κ(F|H).\displaystyle\kappa(F|H^{\prime})<\kappa(F|H).

Thus, if irreducibility holds, then ReMPE-2 returns κ(F|H)<κ(F|H)=κ\kappa(F|H^{\prime})<\kappa(F|H)=\kappa^{*}, which is undesirable. However, when irreducibility does not hold, ReMPE-2 may lead to a smaller asymptotic bias than estimating κ(F|H)\kappa(F|H), which could explain why the authors observe empirical improvements in their results. Our theoretical analysis of ReMPE-2 is supported experimentally in Sec. 6 and Appendix D.

To summarize, Yao et al. (2022) proposed a regrouping approach that was the first attempt to tackle the problem of MPE beyond irreducibility and motivated our work. ReMPE-1 recovers κ\kappa^{*} when irreducibility holds (although in this case it is not doing regrouping), and decreases bias when irreducibility does not hold. The more practical algorithm ReMPE-2 might decrease the bias when irreducibility does not hold, but it has a negative bias when irreducibility does hold. Like ReMPE-1, SuMPE draws on some additional information beyond FF and HH. Both meta-algorithms do not increase the bias, and recover κ\kappa^{*} when irreducibility holds. Unlike ReMPE-1, however, SuMPE is able to recover κ\kappa^{*} under a more general condition. Furthermore, our practical implementations of subsampling are based directly on Theorem 3.2, unlike ReMPE-2 which does not have the desireable theoretical properties of ReMPE-1. Finally, as we argue in the next section, SuMPE offers significant empirical performance gains.

One limitation of our SuMPE framework is that some knowledge of P(Y|X)P(Y|X) is needed and that α(x)\alpha(x) may need to be developed specifically for different applications.

6 Experiments

We ran our algorithm on nuclear, synthetic and some benchmark datasets taken from the UCI machine learning repository and MNIST, corresponding to all three scenarios described in Sec. 4.2. We take four MPE algorithms: DPL (Ivanov, 2020), EN (Elkan & Noto, 2008), KM (Ramaswamy et al., 2016) and TIcE (Bekker & Davis, 2018). We compare the original version of these methods together with their regrouping (Re-((\cdot)) and subsampling (Su-((\cdot)) version. All experiments in this section consider settings where irreducibility is violated. The summarized results are shown in Table 6 and the detailed results are in Appendix C. Overall, the subsampling version of each MPE method outperforms the original and regrouping version. Additional experiments where irreducibility holds are offered in Appendix D, where we find that ReMPE harms the estimation performance while SuMPE does not. The implementation is available at https://github.com/allan-z/SuMPE.

Table 1: Summarized table of average absolute estimation error, corresponding to testing cases in Sec. 6. Several state-of-the-art MPE algorithms DPL, EN, KM and TIcE are selected. The mean absolute error (avg[|κ^κ|]\text{avg}[\left|\widehat{\kappa}^{*}-\kappa^{*}\right|]) is reported, the smallest error among original, regrouping and subsampling version is bolded. (+/)(+/-) denotes that on average the estimator produces positive//negative estimation bias (sgn[avg(κ^κ)]\text{sgn}[\text{avg}(\widehat{\kappa}^{*}-\kappa^{*})]).
Setup Dataset DPL ReDPL SuDPL EN ReEN SuEN KM ReKM SuKM TIcE ReTIcE SuTIcE
Unfolding Gamma Ray 0.045+0.045+ 0.1170.117- 0.013+\bm{0.013+} 0.034+0.034+ 0.1180.118- 0.027\bm{0.027-} 0.163+0.163+ 0.076+0.076+ 0.042+\bm{0.042+} 0.095+0.095+ 0.061+0.061+ 0.019+\bm{0.019+}
Domain Adaptation Synthetic 0.060+0.060+ 0.053+0.053+ 0.028\bm{0.028-} 0.045\bm{0.045-} 0.0610.061- 0.0670.067- 0.063+0.063+ 0.059+0.059+ 0.022\bm{0.022-} 0.128+0.128+ 0.094+0.094+ 0.041+\bm{0.041+}
Mushroom 0.047+0.047+ 0.0810.081- 0.033+\bm{0.033+} 0.122+0.122+ 0.078+\bm{0.078+} 0.101+0.101+ 0.0670.067- 0.1340.134- 0.059\bm{0.059-} 0.060+0.060+ 0.0780.078- 0.036+\bm{0.036+}
Landsat 0.046+0.046+ 0.0420.042- 0.017\bm{0.017-} 0.141+0.141+ 0.110+0.110+ 0.085+\bm{0.085+} 0.046+0.046+ 0.029+0.029+ 0.016\bm{0.016-} 0.043+0.043+ 0.0440.044- 0.035\bm{0.035-}
Shuttle 0.037+0.037+ 0.1380.138- 0.015\bm{0.015-} 0.090+0.090+ 0.071+0.071+ 0.046+\bm{0.046+} 0.036+0.036+ 0.1110.111- 0.032\bm{0.032-} 0.080+0.080+ 0.0860.086- 0.038+\bm{0.038+}
MNIST17 0.047+0.047+ 0.0850.085- 0.028\bm{0.028-} 0.231+0.231+ 0.175+0.175+ 0.166+\bm{0.166+} 0.041+0.041+ 0.0630.063- 0.017\bm{0.017-} 0.090+0.090+ 0.0730.073- 0.048+\bm{0.048+}
Selected/ Reported at Random Mushroom 0.047+0.047+ 0.0950.095- 0.027+\bm{0.027+} 0.119+0.119+ 0.075+0.075+ 0.074+\bm{0.074+} 0.0720.072- 0.1340.134- 0.064\bm{0.064-} 0.047+0.047+ 0.0660.066- 0.044\bm{0.044-}
Landsat 0.048+0.048+ 0.0570.057- 0.019\bm{0.019-} 0.142+0.142+ 0.108+0.108+ 0.092+\bm{0.092+} 0.046+0.046+ 0.033+0.033+ 0.018+\bm{0.018+} 0.046+0.046+ 0.0530.053- 0.041\bm{0.041-}
Shuttle 0.035+0.035+ 0.1440.144- 0.013\bm{0.013-} 0.095+0.095+ 0.073+0.073+ 0.056+\bm{0.056+} 0.042+\bm{0.042+} 0.1290.129- 0.0440.044- 0.079+0.079+ 0.0890.089- 0.050+\bm{0.050+}
MNIST17 0.047+0.047+ 0.0880.088- 0.027+\bm{0.027+} 0.240+0.240+ 0.173+0.173+ 0.164+\bm{0.164+} 0.038+0.038+ 0.0640.064- 0.022+\bm{0.022+} 0.096+0.096+ 0.073+0.073+ 0.057+\bm{0.057+}

6.1 Unfolding: Gamma Ray Spectra Data

The gamma ray spectra data are simulated from the Monte-Carlo N-Particle (MCNP) radiation transport code (Werner et al., 2018). HH refers to the distribution of Cesium-137. GG is the background distribution, consisting of terrestrial background and Cobalt-60. The goal is to estimate the proportion of Cesium-137 in the measured spectrum FF.

H\displaystyle H p(x|Cesium)\displaystyle\sim p(x|\text{Cesium})
G\displaystyle G 0.8p(x|Cobalt)+0.2p(x|terrestrial background)\displaystyle\sim 0.8\cdot p(x|\text{Cobalt})+0.2\cdot p(x|\text{terrestrial background})
F\displaystyle F =(1κ)G+κH.\displaystyle=(1-\kappa^{*})G+\kappa^{*}H.

Sample sizes of m=n=50,000m=n=50,000 were chosen, which is a reasonable number of counts for many nuclear detection applications. The true mixture proportion κ\kappa^{*} is varied in {0.1,0.25,0.5,0.75}\left\{0.1,0.25,0.5,0.75\right\}. The random variable xx represents quantized energy, which is one-dimensional and is discrete-valued. Therefore, we directly use the histogram as the estimate of the distribution. We choose the acceptance function α(x)\alpha(x) according to the methodology developed in Sec. 4.2.1 and Appendix B.3.

Three of the four baseline MPE methods (DPL, EN, KM) did not work well out-of-the-box in this setting. We therefore re-implemented these methods to explicitly leverage the histogram representation of the probability distributions. This also greatly sped up the KM approach. The results are summarized in Table 2.

6.2 Domain Adaptation: Synthetic Data

Following the setup in Sec. 4.2.2, we specify the target conditional and marginal distributions HH, GG and FF as:

H\displaystyle H 𝒩(μ1=0,σ1=1)\displaystyle\sim\mathcal{N}(\mu_{1}=0,\sigma_{1}=1)
G\displaystyle G 0.8𝒩(μ2=3,σ2=2)+0.2𝒩(μ3=4,σ3=1)\displaystyle\sim 0.8\cdot\mathcal{N}(\mu_{2}=3,\sigma_{2}=2)+0.2\cdot\mathcal{N}(\mu_{3}=4,\sigma_{3}=1)
F\displaystyle F =(1κ)G+κH.\displaystyle=(1-\kappa^{*})G+\kappa^{*}H.

where GG is not irreducible w.r.t. HH because it contains a Gaussian distribution with a bigger variance than HH. We draw m=n=1000m=n=1000 instances from both HH and FF. The true mixture proportion κ\kappa^{*} is varied in {0.1,0.25,0.5,0.75}\left\{0.1,0.25,0.5,0.75\right\}.

In addition, we draw 40004000 labeled instances from the source distribution, where μ3\mu_{3} is changed to 55. We then truncate the source distribution (and therefore the source sample) to (,2](-\infty,2]. The resulting source and target distribution satisfy the CSPL assumption. A 11 hidden layer neural network with 1616 neurons was trained to predict Psr(Y=1|X=x)P^{sr}(Y=1|X=x) for x(,2]x\in(-\infty,2], thus A=(,2]A=(-\infty,2] and α(x)\alpha(x) was chosen according to Eqn. (10). This procedure was repeated 1010 times with different random seeds. Detailed results are shown in Table 3.

6.3 Domain Adaptation: Benchmark Data

In many machine learning datasets for classification (e.g., those in UCI), irreducibility is satisfied. 555The datasets Mushroom, Landsat, Shuttle and MNIST were chosen for our study because previous empirical research (Ivanov, 2020) showed that the baseline MPE methods perform well on these datasets when the irreducibility assumption holds. Our paper focuses on how to eliminate the estimation bias that arises from the violation of irreducibility. Therefore, we chose to use datasets where the baseline methods perform well, in order to clearly observe and measure the bias introduced by MPE methods when irreducibility is not met. Here we manually create datasets that violate irreducibility by uniformly sampling out 90%90\% (or 80%80\%) of the original positive data as the target positive data, with all the remaining data treated as target negative. The target conditional and marginal distributions HH, GG and FF are specified as:

H\displaystyle H p(x|Y=1)\displaystyle\sim p(x|Y=1)
G\displaystyle G γp(x|Y=1)+(1γ)p(x|Y=0)\displaystyle\sim\gamma p(x|Y=1)+(1-\gamma)p(x|Y=0)
F\displaystyle F =(1κ)G+κH.\displaystyle=(1-\kappa^{*})G+\kappa^{*}H.

We draw m=1000m=1000 instances from HH and n[1000,4000]n\in\left[1000,4000\right] instances from FF (the exact number varies by datasets and is based on number of examples available, see the code provided). The true mixture proportion κ\kappa^{*} is varied in {0.1,0.25,0.5,0.75}\left\{0.1,0.25,0.5,0.75\right\}. The proportion γ\gamma is determined by the total number of positive and negative data originally available in the dataset, therefore varies case by case.

In addition, we obtain labeled instances following the source distribution, by drawing κn\kappa^{*}n data from the target positive and 0.95(1κ)n0.95(1-\kappa^{*})n data from the target negative distribution. This causes a prior shift that simulates CSPL.

A 22 hidden layer neural network with 512512 neurons was trained to predict Psr(Y=1|X=x)P^{sr}(Y=1|X=x). For real-world high-dimensional data, it is hard to know the support. Therefore, we choose A={x:P^sr(Y=1|X=x)>0.5}A=\{x:\widehat{P}^{sr}(Y=1|X=x)>0.5\}, because an example xx with high P^sr(Y=1|X=x)\widehat{P}^{sr}(Y=1|X=x) is more likely to lie in the support of HH. The acceptance function α(x)\alpha(x) was determined according to Eqn. (10). The above procedure was repeated 1010 times with different random seeds. Table 4 summarizes the results.

6.4 Selected/Reported at Random: Benchmark Data

Recalling the setting of Sec. 4.2.3, there is a jointly distributed triple (X,Y,Z)(X,Y,Z), where YY indicates whether a condition is reported, and ZZ indicates whether the condition is actually present. For the experiments below, the data from F,GF,G, and HH are generated in the same way as the target distribution of the previous subsection. Instead of observing labeled source data, however, in this subsection we instead observe κn\kappa^{*}n instances from p(x|Z=1)p(x|Z=1), together with their labels YY, matching the setup of Sec. 4.2.3.

A 22 hidden layer neural network with 512512 neurons was trained to predict e(x)=P(Y=1|X=x,Z=1)e(x)=P(Y=1|X=x,Z=1). We chose A={x:e^(x)>0.6}A=\{x:\widehat{e}(x)>0.6\} 666 AA needs to be a subset of supp(H)\operatorname{supp}(H). Note that in population level, h(x)=p(x|Y=1)e(x)p(x|Z=1)h(x)=p(x|Y=1)\propto e(x)\cdot p(x|Z=1), thus {x:e(x)>0.6}{x:e(x)>0}={x:h(x)>0}\{x:e(x)>0.6\}\subseteq\{x:e(x)>0\}=\{x:h(x)>0\}. (The last equality holds because e(x)=P(Y=1|X=x,Z=1)e(x)=P(Y=1|X=x,Z=1).) Here we replace e(x)e(x) with e^(x)\widehat{e}(x). and α(x)\alpha(x) according to Eqn. (11). The above procedure was repeated 1010 times with different random seeds. The results are shown in Table 5.

7 Conclusion

This work introduces a more general identifiability condition than irreducibility for mixture proportion estimation. We also propose a subsampling-based framework that achieves bias reduction/elimination for baseline MPE algorithms. Theoretically, our work expands the scope of settings where MPE can be solved. Practically, we illustrate three scenarios where irreducibility fails, and our meta-algorithm successfully improves upon baseline MPE methods.

Acknowledgements

This work was supported in part by the National Science Foundation under award 2008074 and the Department of Defense, Defense Threat Reduction Agency under award HDTRA1-20-2-0002.

References

  • Alamaniotis et al. (2013) Alamaniotis, M., Mattingly, J., and Tsoukalas, L. H. Kernel-based machine learning for background estimation of nai low-count gamma-ray spectra. IEEE Transactions on Nuclear Science, 60(3):2209–2221, 2013.
  • Bekker & Davis (2018) Bekker, J. and Davis, J. Estimating the class prior in positive and unlabeled data through decision tree induction. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Bekker & Davis (2020) Bekker, J. and Davis, J. Learning from positive and unlabeled data: A survey. Machine Learning, 109(4):719–760, 2020.
  • Bekker et al. (2019) Bekker, J., Robberechts, P., and Davis, J. Beyond the selected completely at random assumption for learning from positive and unlabeled data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp.  71–85. Springer, 2019.
  • Bickel et al. (2009) Bickel, S., Brückner, M., and Scheffer, T. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(9), 2009.
  • Blanchard et al. (2010) Blanchard, G., Lee, G., and Scott, C. Semi-supervised novelty detection. The Journal of Machine Learning Research, 11:2973–3009, 2010.
  • Blanchard et al. (2016) Blanchard, G., Flaska, M., Handy, G., Pozzi, S., Scott, C., et al. Classification with asymmetric label noise: Consistency and maximal denoising. Electronic Journal of Statistics, 10(2):2780–2824, 2016.
  • Cai & Wei (2021) Cai, T. T. and Wei, H. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 49(1):100–128, 2021.
  • Cowan (1998) Cowan, G. Statistical data analysis. Oxford university press, 1998.
  • Du Plessis & Sugiyama (2014) Du Plessis, M. C. and Sugiyama, M. Class prior estimation from positive and unlabeled data. IEICE TRANSACTIONS on Information and Systems, 97(5):1358–1362, 2014.
  • Du Plessis et al. (2014) Du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. Advances in neural information processing systems, 27, 2014.
  • Elkan & Noto (2008) Elkan, C. and Noto, K. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  213–220, 2008.
  • Fei et al. (2013) Fei, H., Kim, Y., Sahu, S., Naphade, M., Mamidipalli, S. K., and Hutchinson, J. Heat pump detection from coarse grained smart meter data with positive and unlabeled learning. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  1330–1338, 2013.
  • Garg et al. (2021) Garg, S., Wu, Y., Smola, A. J., Balakrishnan, S., and Lipton, Z. Mixture proportion estimation and PU learning: A modern approach. Advances in Neural Information Processing Systems, 34:8532–8544, 2021.
  • Gong et al. (2021) Gong, C., Wang, Q., Liu, T., Han, B., You, J., Yang, J., and Tao, D. Instance-dependent positive and unlabeled learning with labeling bias estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4163–4177, 2021.
  • González et al. (2017) González, P., Castaño, A., Chawla, N. V., and Coz, J. J. D. A review on quantification learning. ACM Computing Surveys (CSUR), 50(5):1–40, 2017.
  • Gorber et al. (2009) Gorber, S. C., Schofield-Hurwitz, S., Hardt, J., Levasseur, G., and Tremblay, M. The accuracy of self-reported smoking: a systematic review of the relationship between self-reported and cotinine-assessed smoking status. Nicotine & tobacco research, 11(1):12–24, 2009.
  • Gretton et al. (2009) Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Schölkopf, B. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
  • Heckman (1979) Heckman, J. J. Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pp. 153–161, 1979.
  • Ivanov (2020) Ivanov, D. DEDPUL: Difference-of-estimated-densities-based positive-unlabeled learning. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pp.  782–790. IEEE, 2020.
  • Jain et al. (2016) Jain, S., White, M., Trosset, M. W., and Radivojac, P. Nonparametric semi-supervised learning of class proportions. arXiv preprint arXiv:1601.01944, 2016.
  • Kato et al. (2018) Kato, M., Teshima, T., and Honda, J. Learning from positive and unlabeled data with a selection bias. In International conference on learning representations, 2018.
  • Kiryo et al. (2017) Kiryo, R., Niu, G., Du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. Advances in neural information processing systems, 30, 2017.
  • Knoll (2010) Knoll, G. F. Radiation detection and measurement. John Wiley & Sons, 2010.
  • Lawrence & Schölkopf (2001) Lawrence, N. and Schölkopf, B. Estimating a kernel Fisher discriminant in the presence of label noise. In 18th International Conference on Machine Learning (ICML 2001), pp.  306–306. Morgan Kaufmann, 2001.
  • Li et al. (2019) Li, F., Gu, Z., Ge, L., Li, H., Tang, X., Lang, X., and Hu, B. Review of recent gamma spectrum unfolding algorithms and their application. Results in Physics, 13:102211, 2019.
  • MacKay (2003) MacKay, D. J. Information theory, inference and learning algorithms. Cambridge university press, 2003.
  • Maity et al. (2023) Maity, S., Dutta, D., Terhorst, J., Sun, Y., and Banerjee, M. A linear adjustment based approach to posterior drift in transfer learning. Biometrika, 2023. URL https://doi.org/10.1093/biomet/asad029.
  • Natarajan et al. (2013) Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
  • Ramaswamy et al. (2016) Ramaswamy, H., Scott, C., and Tewari, A. Mixture proportion estimation via kernel embeddings of distributions. In International conference on machine learning, pp. 2052–2060. PMLR, 2016.
  • Saerens et al. (2002) Saerens, M., Latinne, P., and Decaestecker, C. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
  • Sanderson & Scott (2014) Sanderson, T. and Scott, C. Class proportion estimation with application to multiclass anomaly rejection. In Artificial Intelligence and Statistics, pp.  850–858. PMLR, 2014.
  • Scott (2015) Scott, C. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In Artificial Intelligence and Statistics, pp.  838–846. PMLR, 2015.
  • Scott (2019) Scott, C. A generalized Neyman-Pearson criterion for optimal domain adaptation. In Algorithmic Learning Theory, pp.  738–761. PMLR, 2019.
  • Shanmugam & Pierson (2021) Shanmugam, D. and Pierson, E. Quantifying inequality in underreported medical conditions. arXiv preprint arXiv:2110.04133, 2021.
  • Storkey et al. (2009) Storkey, A. et al. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, 30:3–28, 2009.
  • Werner et al. (2018) Werner, C. J., Bull, J., Solomon, C., Brown, F., McKinney, G., Rising, M., Dixon, D., Martz, R., Hughes, H., Cox, L., et al. MCNP 6.2 release notes. Los Alamos National Laboratory, 2018.
  • Yao et al. (2022) Yao, Y., Liu, T., Han, B., Gong, M., Niu, G., Sugiyama, M., and Tao, D. Rethinking class-prior estimation for positive-unlabeled learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=aYAA-XHKyk.
  • Zhang et al. (2013) Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. Domain adaptation under target and conditional shift. In International conference on machine learning, pp. 819–827. PMLR, 2013.
  • Zhang & Goldman (2001) Zhang, Q. and Goldman, S. EM-DD: An improved multiple-instance learning technique. Advances in neural information processing systems, 14, 2001.

Appendix A Proofs

A.1 Proof of Proposition 2.4

Proposition.

Under the latent label model,

esssup𝑥P(Y=1|X=x)=κκ(F|H),\underset{x}{\operatorname{ess\ sup\ }}P(Y=1|X=x)=\frac{\kappa^{*}}{\kappa(F|H)}\ ,

where 0/0:=00/0:=0.

Proof.

First consider the trivial case when κ(F|H)=0\kappa(F|H)=0, then from the fact that κκ(F|H)\kappa^{*}\leq\kappa(F|H), κ\kappa^{*} must also be 0. According to Eqn. (5), P(Y=1|X=x)=0xP(Y=1|X=x)=0\ \forall x, therefore the equality holds.

Then for κ(F|H)>0\kappa(F|H)>0, we consider the cases κ>0\kappa^{*}>0 and κ=0\kappa^{*}=0 separately. For the first case, consider two subcases: h(x)>0h(x)>0 and h(x)=0h(x)=0. In the first subcase, we have that f(x)>0f(x)>0, and therefore by the definition of conditional probability,

P(Y=1|X=x)=κf(x)h(x).\displaystyle P(Y=1|X=x)=\frac{\kappa^{*}}{\frac{f(x)}{h(x)}}.

Taking the essential supremum over all xx with h(x)>0h(x)>0,

esssupx:h(x)>0P(Y=1|X=x)\displaystyle\underset{x:h(x)>0}{\operatorname{ess\ sup\ }}\ P(Y=1|X=x) =κessinfx:h(x)>0f(x)h(x)\displaystyle=\frac{\kappa^{*}}{\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}}
=κκ(F|H),\displaystyle=\frac{\kappa^{*}}{\kappa(F|H)},

where the second equality follows from Proposition 3. In the second subcase, P(Y=1|X=x)P(Y=1|X=x) is zero. Therefore,

esssup𝑥P(Y=1|X=x)\displaystyle\underset{x}{\operatorname{ess\ sup\ }}P(Y=1|X=x) =esssupx:h(x)>0P(Y=1|X=x)\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ sup\ }}P(Y=1|X=x)
=κκ(F|H).\displaystyle=\frac{\kappa^{*}}{\kappa(F|H)}.

When κ=0\kappa^{*}=0, P(Y=1|X=x)=0P(Y=1|X=x)=0 for all xx, and the equality still holds. ∎

A.2 Proof of Theorem 3.1

Theorem (Identifiability Under Local Supremal Posterior (LSP)).

Let AA be any non-empty measurable subset of EH={x:h(x)>0}E_{H}=\{x:h(x)>0\} and s=esssupxAP(Y=1|X=x)s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x), then

κ=sinfSAF(S)H(S)=sessinfxAf(x)h(x).\displaystyle\kappa^{*}=s\cdot\inf_{S\subseteq A}\frac{F(S)}{H(S)}=s\cdot\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}.
Proof.

Consider the case of κ>0\kappa^{*}>0 and κ=0\kappa^{*}=0 separately.

If κ>0\kappa^{*}>0, then EH={x:h(x)>0}EF={x:f(x)>0}E_{H}=\{x:h(x)>0\}\subseteq E_{F}=\{x:f(x)>0\}. Recall from Eqn. (5),

P(Y=1|X=x)=κh(x)f(x)when f(x)>0.\displaystyle P(Y=1|X=x)=\kappa^{*}\cdot\frac{h(x)}{f(x)}\quad\text{when }f(x)>0.

Taking the essential supremum over AA and recall the definition of ss,

s=esssupxAP(Y=1|X=x)\displaystyle s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x) =κesssupxAh(x)f(x).\displaystyle=\kappa^{*}\cdot\underset{x\in A}{\operatorname{ess\ sup\ }}\frac{h(x)}{f(x)}.

Since AEHEFA\subseteq E_{H}\subseteq E_{F}, f(x)f(x) and h(x)h(x) are both positive for xAx\in A. Rearrange the denominator

s=κ1essinfxAf(x)h(x),\displaystyle s=\kappa^{*}\cdot\frac{1}{\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}},

take the denominator to the other side, we get

κ=sessinfxAf(x)h(x)=sinfSAF(S)H(S),\displaystyle\kappa^{*}=s\cdot\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}=s\cdot\inf_{S\subseteq A}\frac{F(S)}{H(S)},

where the second equality follows from the identity that essinfxAf(x)h(x)=infSAF(S)H(S)\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}=\inf_{S\subseteq A}\frac{F(S)}{H(S)}.

If κ=0\kappa^{*}=0, then s=0s=0, the above equality still holds. ∎

A.3 Proof of Theorem 3.2

Theorem (Identifiability Under Tight Posterior Upper Bound).

Consider any non-empty measurable set AEH={x:h(x)>0}A\subseteq E_{H}=\{x:h(x)>0\}, and let s=esssupxAP(Y=1|X=x)s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x). Let α(x)\alpha(x) be any measurable function satisfying

α(x){[P(Y=1|X=x),s],xA,[P(Y=1|X=x),1],o.w.\alpha(x)\in\begin{cases}\left[P(Y=1|X=x),s\right],&x\in A,\\ \left[P(Y=1|X=x),1\right],&\text{o.w.}\end{cases} (13)

Define a new distribution F~\widetilde{F} in terms of its density

f~(x)=1cα(x)f(x), where c=α(x)f(x)𝑑x=𝔼XF[α(x)].\begin{split}&\widetilde{f}(x)=\frac{1}{c}\cdot\alpha(x)\cdot f(x),\\ \text{ where }\quad&c=\int\alpha(x)f(x)dx=\mathbb{E}_{X\sim F}\left[\alpha(x)\right].\end{split} (14)

Then

κ=cκ(F~|H).\displaystyle\kappa^{*}=c\cdot\kappa(\widetilde{F}|H).
Proof.

Write κ(F~|H)\kappa(\widetilde{F}|H) explicitly according to Proposition 3 and Eqn. (14),

cκ(F~|H)\displaystyle c\cdot\kappa(\widetilde{F}|H) =cessinfx:h(x)>0f~(x)h(x)by definition of κ(F~|H)\displaystyle=c\cdot\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\widetilde{f}(x)}{h(x)}\quad\text{by definition of $\kappa(\widetilde{F}|H)$}
=cessinfx:h(x)>01cα(x)f(x)h(x)plug in the expression of f~(x)\displaystyle=c\cdot\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\frac{1}{c}\cdot\alpha(x)\cdot f(x)}{h(x)}\quad\text{plug in the expression of $\widetilde{f}(x)$}
=essinfx:h(x)>0α(x)f(x)h(x).\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\alpha(x)\cdot f(x)}{h(x)}.

We will show that it equals to κ\kappa^{*} by proving it is both upper and lower bound of κ\kappa^{*}.

From Eqn. (13), we can conclude that α(x)P(Y=1|X=x)x\alpha(x)\geq P(Y=1|X=x)\ \ \forall x, then

cκ(F~|H)\displaystyle c\cdot\kappa(\widetilde{F}|H) essinfx:h(x)>0P(Y=1|X=x)f(x)h(x)\displaystyle\geq\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{P(Y=1|X=x)\cdot f(x)}{h(x)}
=essinfx:h(x)>0κrearrange Eqn. (5)\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\kappa^{*}\quad\text{rearrange Eqn. \eqref{eq:cond}}
=κ.\displaystyle=\kappa^{*}.

Meanwhile, consider xAx\in A

cκ(F~|H)\displaystyle c\cdot\kappa(\widetilde{F}|H) =essinfx:h(x)>0α(x)f(x)h(x)\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\alpha(x)\cdot f(x)}{h(x)}
essinfxAα(x)f(x)h(x)replace EH by A\displaystyle\leq\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{\alpha(x)\cdot f(x)}{h(x)}\quad\text{replace $E_{H}$ by $A$}
essinfxAsf(x)h(x)because α(x)s,xA\displaystyle\leq\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{s\cdot f(x)}{h(x)}\quad\text{because $\alpha(x)\leq s,\forall x\in A$}
=κby Theorem 3.1\displaystyle=\kappa^{*}\quad\text{by Theorem \ref{thm:identify}}

This shows that cκ(F~|H)=κc\cdot\kappa(\widetilde{F}|H)=\kappa^{*}. ∎

A.4 Proof of Corollary 3.3

Corollary.

Let α(x)\alpha(x) be any measurable function with

α(x)[P(Y=1|X=x),1]x.\alpha(x)\in[P(Y=1|X=x),1]\quad\forall x.

Define a new distribution F~\widetilde{F} in terms of its density f~\widetilde{f}, obtained by Eqn. (7). Then

κcκ(F~|H)κ(F|H).\displaystyle\kappa^{*}\leq c\cdot\kappa(\widetilde{F}|H)\leq\kappa(F|H).
Proof.

From the proof of Theorem 3.2, we know that

cκ(F~|H)=essinfx:h(x)>0α(x)f(x)h(x)\displaystyle c\cdot\kappa(\widetilde{F}|H)=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\alpha(x)\cdot f(x)}{h(x)}

From the fact that α(x)P(Y=1|X=x)x\alpha(x)\geq P(Y=1|X=x)\ \forall x, we have

cκ(F~|H)\displaystyle c\cdot\kappa(\widetilde{F}|H) essinfx:h(x)>0P(Y=1|X=x)f(x)h(x)\displaystyle\geq\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{P(Y=1|X=x)\cdot f(x)}{h(x)}
=essinfx:h(x)>0κ\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\kappa^{*}
=κ.\displaystyle=\kappa^{*}.

What’s more, since α1x\alpha\leq 1\ \forall x,

cκ(F~|H)essinfx:h(x)>01f(x)h(x)=κ(F|H).\displaystyle c\cdot\kappa(\widetilde{F}|H)\leq\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{1\cdot f(x)}{h(x)}=\kappa(F|H).

A.5 Proof of Theorem 4.1

We now establish a rate of convergence result for estimator κ^=c^κ^(F~|H)\widehat{\kappa}^{*}=\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}|H) from Algorithm 1.

Theorem.

Assume α(x)\alpha(x) satisfies the condition in Eqn. (6). After subsampling, assume the resulting F~\widetilde{F} and HH are such that argminS𝒜:H(S)>0F~(S)H(S)\underset{S\in\mathcal{A}:H(S)>0}{\operatorname{arg\ min\ }}\frac{\widetilde{F}(S)}{H(S)} exists. Then there exists a constant C>0C>0 and an existing estimator κ^\widehat{\kappa} such that for mm and nn sufficiently large, the estimator κ^\widehat{\kappa}^{*} from Algorithm 1 satisfies

Pr(|κ^κ|C[logmm+lognn])1𝒪(1m+1n).\Pr\left(\left|\widehat{\kappa}^{*}-\kappa^{*}\right|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\geq 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right).
Proof.

Recall the setup, we originally have i.i.d. sample 777 The notation here is a bit different from Eqn. (1) in that the index of XiX_{i} is changed. This allows for more concise notation in the following derivation.

XF:={X1,X2,,Xn}iidF,XH:={Xn+1,Xn+2,,Xn+m}iidH.\begin{split}X_{F}&:=\left\{X_{1},X_{2},\cdots,X_{n}\right\}\stackrel{{\scriptstyle iid}}{{\sim}}F,\\ X_{H}&:=\left\{X_{n+1},X_{n+2},\cdots,X_{n+m}\right\}\stackrel{{\scriptstyle iid}}{{\sim}}H.\end{split}

After rejection sampling, we obtain nn^{\prime} i.i.d. sample XF~F~X_{\widetilde{F}}\sim\widetilde{F} (where 𝔼[n]=cn\mathbb{E}\left[n^{\prime}\right]=c\cdot n and c=α(x)f(x)𝑑xc=\int\alpha(x)f(x)dx), from which we can estimate κ(F~|H)\kappa(\widetilde{F}|H). Under the assumption that argminS𝒜:H(S)>0F~(S)H(S)\underset{S\in\mathcal{A}:H(S)>0}{\operatorname{arg\ min\ }}\frac{\widetilde{F}(S)}{H(S)} exists, estimator κ^(F~|H)\widehat{\kappa}(\widetilde{F}|H) by Scott (2015) has rate of convergence

Pr(|κ^(F~|H)κ(F~|H)|C1[logmm+lognn])12m2n.\Pr\left(\left|\widehat{\kappa}(\widetilde{F}|H)-\kappa(\widetilde{F}|H)\right|\leq C_{1}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n^{\prime}}{n^{\prime}}}\right]\right)\geq 1-\frac{2}{m}-\frac{2}{n^{\prime}}. (15)

Now nn^{\prime} is a random variable here, and we want to establish a rate of convergence result involving nn. This can be done by applying a concentration inequality for nn^{\prime}.

Theorem.

(Hoeffding’s Inequality) Let Z1,ZnZ_{1},\cdots Z_{n} be independent random variables on \mathbb{R} that take values in [0,1][0,1]. Denote Z=1ni=1nZiZ=\frac{1}{n}\sum_{i=1}^{n}Z_{i}, then for all ϵ>0\epsilon>0,

Pr(|1ni=1nZi𝔼[Z]|ϵ)12exp(2nϵ2).\Pr\left(\left|\frac{1}{n}\sum_{i=1}^{n}Z_{i}-\mathbb{E}\left[Z\right]\right|\leq\epsilon\right)\geq 1-2\exp(-2n\epsilon^{2}).

Let δ=2exp(2nϵ2)\delta=2\exp{(-2n\epsilon^{2})}, the theorem can be restated as:

Pr(|1ni=1nZi𝔼[Z]|log(1/δ)2n)1δ.\Pr\left(\left|\frac{1}{n}\sum_{i=1}^{n}Z_{i}-\mathbb{E}\left[Z\right]\right|\leq\sqrt{\frac{\log(1/\delta)}{2n}}\right)\geq 1-\delta.

Take Zi=𝟙{Viα(Xi)}Z_{i}=\mathbbm{1}_{\{V_{i}\leq\alpha(X_{i})\}}, where i{1,2,,n}i\in\{1,2,\cdots,n\}, ViV_{i} denotes the ii-th independent draw from Uniform(0,1)\text{Uniform}(0,1) 888ViV_{i} is used in rejection sampling (Algorihm 4). and XiX_{i} denote the ii-th draw from FF. Then

n=i=1nZi,\displaystyle n^{\prime}=\sum_{i=1}^{n}Z_{i},
𝔼[Z]=1ni=1n𝔼[Zi]=𝔼(Xi,Vi)[Zi]=𝔼Xi[𝔼Vi|Xi[𝟙{Viα(Xi)}]]=𝔼Xi[α(Xi)]=α(x)f(x)𝑑x=c.\displaystyle\mathbb{E}\left[Z\right]=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}\left[Z_{i}\right]=\mathbb{E}_{(X_{i},V_{i})}\left[Z_{i}\right]=\mathbb{E}_{X_{i}}\left[\mathbb{E}_{V_{i}|X_{i}}\left[\mathbbm{1}_{\{V_{i}\leq\alpha(X_{i})\}}\right]\right]=\mathbb{E}_{X_{i}}\left[\alpha(X_{i})\right]=\int\alpha(x)f(x)dx=c.

Plug into Hoeffding’s Inequality and setting ϵ=c2\epsilon=\frac{c}{2}, we have

Pr(|ncn|c2n)12exp(c2n),\Pr\left(\left|n^{\prime}-c\cdot n\right|\leq\frac{c}{2}\cdot n\right)\geq 1-2\exp\left(-\frac{c}{2}n\right), (16)

which allows us to bound nn^{\prime} by a constant times nn.

Now we can establish a rate of convergence result of κ^(F~|H)\widehat{\kappa}(\widetilde{F}|H) w.r.t. nn

Pr(|κ^(F~|H)κ(F~|H)|C2[logmm+lognn])\displaystyle\Pr\left(\left|\widehat{\kappa}(\widetilde{F}|H)-\kappa(\widetilde{F}|H)\right|\leq C_{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)
\displaystyle\geq\ Pr((|κ^(F~|H)κ(F~|H)|C2[logmm+lognn])and(c2nn3c2n))Pr(A)Pr(AB)\displaystyle\Pr\left(\left(\left|\widehat{\kappa}(\widetilde{F}|H)-\kappa(\widetilde{F}|H)\right|\leq C_{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\text{and}\left(\frac{c}{2}\cdot n\leq n^{\prime}\leq\frac{3c}{2}\cdot n\right)\right)\quad\because\Pr(A)\geq\Pr(A\cap B)
=\displaystyle=\ Pr((|κ^(F~|H)κ(F~|H)|C2[logmm+lognn])|(c2nn3c2n))Pr(c2nn3c2n)\displaystyle\Pr\left(\left(\left|\widehat{\kappa}(\widetilde{F}|H)-\kappa(\widetilde{F}|H)\right|\leq C_{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\Biggl{|}\left(\frac{c}{2}\cdot n\leq n^{\prime}\leq\frac{3c}{2}\cdot n\right)\right)\cdot\Pr\left(\frac{c}{2}\cdot n\leq n^{\prime}\leq\frac{3c}{2}\cdot n\right)
\displaystyle\geq\ (1𝒪(1m+1n))(12exp(c2n))n and n can be used interchangeably in the first probability term\displaystyle\left(1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right)\right)\cdot\left(1-2\exp\left(-\frac{c}{2}n\right)\right)\quad\because\text{$n$ and $n^{\prime}$ can be used interchangeably in the first probability term}
\displaystyle\geq\ 1𝒪(1m+1n)2exp(c2n)(1a)(1b)1ab\displaystyle 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right)-2\exp\left(-\frac{c}{2}n\right)\quad\because(1-a)(1-b)\geq 1-a-b
\displaystyle\geq\ 1𝒪(1m+1n)exp(n) decays faster\displaystyle 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right)\quad\because\exp(-n)\text{ decays faster}

As for the estimator of c=𝔼XF[α(x)]c=\mathbb{E}_{X\sim F}\left[\alpha(x)\right], the rate of convergence can also be shown via Hoeffding’s Inequality,

c^=nn=1ni=1nZi,c=𝔼[Z].\displaystyle\widehat{c}=\frac{n^{\prime}}{n}=\frac{1}{n}\sum_{i=1}^{n}Z_{i},\quad c=\mathbb{E}\left[Z\right].

Plug into Hoeffding’s Inequality and let the confidence δ=1/n\delta=1/n, we have

Pr(|c^c|logn2n)11/n.\Pr\left(\left|\widehat{c}-c\right|\leq\sqrt{\frac{\log n}{2n}}\right)\geq 1-1/n. (17)

Note that by triangle inequality

|c^κ^(F~|H)cκ(F~|H)|\displaystyle\left|\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}|H)-c\cdot\kappa(\widetilde{F}|H)\right| =|c^κ^(F~|H)c^κ(F~|H)+c^κ(F~|H)cκ(F~|H)|\displaystyle=\left|\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}|H)-\widehat{c}\cdot\kappa(\widetilde{F}|H)+\widehat{c}\cdot\kappa(\widetilde{F}|H)-c\cdot\kappa(\widetilde{F}|H)\right|
|c^κ^(F~|H)c^κ(F~|H)|+|c^κ(F~|H)cκ(F~|H)|\displaystyle\leq\left|\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}|H)-\widehat{c}\cdot\kappa(\widetilde{F}|H)\right|+\left|\widehat{c}\cdot\kappa(\widetilde{F}|H)-c\cdot\kappa(\widetilde{F}|H)\right|
=|c^||κ^(F~|H)κ(F~|H)|+|κ(F~|H)||c^c|\displaystyle=\left|\widehat{c}\right|\cdot\left|\widehat{\kappa}(\widetilde{F}|H)-\kappa(\widetilde{F}|H)\right|+\left|\kappa(\widetilde{F}|H)\right|\cdot\left|\widehat{c}-c\right|
|κ^(F~|H)κ(F~|H)|+|c^c|.\displaystyle\leq\left|\widehat{\kappa}(\widetilde{F}|H)-\kappa(\widetilde{F}|H)\right|+\left|\widehat{c}-c\right|.

Finally, combine all previous results

Pr(|κ^κ|C[logmm+lognn])\displaystyle\Pr\left(\left|\widehat{\kappa}^{*}-\kappa^{*}\right|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)
=\displaystyle=\ Pr(|c^κ^(F~|H)cκ(F~|H)|C[logmm+lognn])\displaystyle\Pr\left(\left|\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}|H)-c\cdot\kappa(\widetilde{F}|H)\right|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)
\displaystyle\geq\ Pr(|κ^(F~|H)κ(F~|H)|+|c^c|C[logmm+lognn])\displaystyle\Pr\left(\left|\widehat{\kappa}(\widetilde{F}|H)-\kappa(\widetilde{F}|H)\right|+\left|\widehat{c}-c\right|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)
\displaystyle\geq\ Pr((|κ^(F~|H)κ(F~|H)|C2[logmm+lognn])and(|c^c|C2[logmm+lognn]))\displaystyle\Pr\left(\left(\left|\widehat{\kappa}(\widetilde{F}|H)-\kappa(\widetilde{F}|H)\right|\leq\frac{C}{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\text{and}\left(\left|\widehat{c}-c\right|\leq\frac{C}{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\right)
\displaystyle\geq\ 1𝒪(1m+1n),\displaystyle 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right),

then we can conclude that κ^κ\widehat{\kappa}^{*}\rightarrow\kappa^{*} with rate of convergence O(logmm+lognn)O\left(\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right).

A.6 Proof of Proposition 5.2

Proposition.

For κ(F|H)\kappa(F|H^{\prime}) obtained from ReMPE-2:

κ(F|H)<κ(F|H).\displaystyle\kappa(F|H^{\prime})<\kappa(F|H).
Proof.

From the fact that B=argminS𝔖F(S)H(S)B=\underset{S\in\mathfrak{S}}{\operatorname{arg\ min\ }}\frac{F(S)}{H(S)}, we have

κ(F|H)=infS𝒜:H(S)>0F(S)H(S)=F(B)H(B).\displaystyle\kappa(F|H)=\inf_{S\in\mathcal{A}:H(S)>0}\frac{F(S)}{H(S)}=\frac{F(B)}{H(B)}.

After regrouping, H(B)>H(B)H^{\prime}(B)>H(B). Therefore,

κ(F|H)\displaystyle\kappa(F|H^{\prime}) =infS𝒜:H(S)>0F(S)H(S)\displaystyle=\inf_{S\in\mathcal{A}:H^{\prime}(S)>0}\frac{F(S)}{H^{\prime}(S)}
=F(B)H(B)\displaystyle=\frac{F(B)}{H^{\prime}(B)}
<F(B)H(B)\displaystyle<\frac{F(B)}{H(B)}
=κ(F|H).\displaystyle=\kappa(F|H).

Appendix B More about Subsampling MPE

B.1 Intuition

Theorem 3.2 and Corollary 3.3 have already justified the use of subsampling. Here, we explain in another perspective (in terms of distributions, similar to the analysis in Yao et al. (2022)). The idea is that, since the original GG may violate irreducibility assumption, we modify GG such that the resulting latent component distribution is less likely to violate the assumption.

Write the unknown distribution GG itself as a mixture G=γG1+(1γ)G2,γ0G=\gamma G_{1}+(1-\gamma)G_{2},\ \gamma\geq 0. Then FF becomes:

F\displaystyle F =(1κ)G+κH\displaystyle=(1-\kappa^{*})G+\kappa^{*}H
=(1κ)[γG1+(1γ)G2]+κH.\displaystyle=(1-\kappa^{*})\left[\gamma G_{1}+(1-\gamma)G_{2}\right]+\kappa^{*}H.

Switch G1G_{1} to the other side (i.e., discarding the probability mass from G1G_{1})

F(1κ)γG1\displaystyle F-{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(1-\kappa^{*})\gamma G_{1}} =(1κ)(1γ)G2+κH,\displaystyle=(1-\kappa^{*})(1-\gamma)G_{2}+\kappa^{*}H,

the left hand side can be re-written as

[1(1κ)γ]:=c, normalizing const.11(1κ)γ[F(1κ)γG1]:=F~, after subsampling.\displaystyle\underbrace{\left[1-(1-\kappa^{*})\gamma\right]}_{:=c,\text{ normalizing const.}}\underbrace{\frac{1}{1-(1-\kappa^{*})\gamma}\left[F-{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(1-\kappa^{*})\gamma G_{1}}\right]}_{:=\widetilde{F},\textbf{ after subsampling}}.

Then the resulting distribution F~\widetilde{F} is

F~\displaystyle\widetilde{F} =(1κ)(1γ)cG2+κcH\displaystyle=\frac{(1-\kappa^{*})(1-\gamma)}{c}G_{2}+\frac{\kappa^{*}}{c}H
=:(1κ~)G~+κ~H.\displaystyle=:(1-\tilde{\kappa}^{*})\widetilde{G}+\tilde{\kappa}^{*}H.

By discarding a portion of probability mass from GG, which is done by subsampling in practice, the resulting latent component distribution G~\widetilde{G} is less likely to violate the irreducibility assumption. The new mixture proportion κ~=κ/c\tilde{\kappa}^{*}=\kappa^{*}/c.

In the following, we provide justification of the above claim, which can also be seen as a reformulation of Corollary 3.3.

Proposition B.1.

Given some probability mass from GG being dropped out, cκ(F~|H)c\cdot\kappa(\widetilde{F}|H) is always bounded by: κ^* c κ(~F—H) κ(F—H). Furthermore, bias is strictly reduced when (1γ)κ(G2|H)<κ(G|H)(1-\gamma)\kappa(G_{2}|H)<\kappa(G|H).

Proof.

Observe that

cκ(F~|H)\displaystyle c\cdot\kappa(\widetilde{F}|H) =infS𝒜:H(S)>0cF~(S)H(S)\displaystyle=\inf_{S\in\mathcal{A}:H(S)>0}\frac{c\cdot\widetilde{F}(S)}{H(S)}
=κ+(1κ)infS𝒜:H(S)>0(1γ)G2(S)H(S),\displaystyle=\kappa^{*}+(1-\kappa^{*})\inf_{S\in\mathcal{A}:H(S)>0}\frac{(1-\gamma)G_{2}(S)}{H(S)},

therefore cκ(F~|H)κc\cdot\kappa(\widetilde{F}|H)\geq\kappa^{*}.

From the fact that G=γG1+(1γ)G2G=\gamma G_{1}+(1-\gamma)G_{2}, we have G(1γ)G2G\geq(1-\gamma)G_{2}. Then

cκ(F~|H)\displaystyle c\cdot\kappa(\widetilde{F}|H) κ+(1κ)infS𝒜:H(S)>0G(S)H(S)\displaystyle\leq\kappa^{*}+(1-\kappa^{*})\inf_{S\in\mathcal{A}:H(S)>0}\frac{G(S)}{H(S)}
=κ(F|H).\displaystyle=\kappa(F|H).

To have bias reduction, we need cκ(F~|H)<κ(F|H)c\cdot\kappa(\widetilde{F}|H)<\kappa(F|H), where both quantities can be represented as

cκ(F~|H)\displaystyle c\cdot\kappa(\widetilde{F}|H) =κ+(1κ)(1γ)κ(G2|H)\displaystyle=\kappa^{*}+(1-\kappa^{*})(1-\gamma)\kappa(G_{2}|H)
κ(F|H)\displaystyle\kappa(F|H) =κ+(1κ)κ(G|H).\displaystyle=\kappa^{*}+(1-\kappa^{*})\kappa(G|H).

Then we can get the result of (1γ)κ(G2|H)<κ(G|H)(1-\gamma)\kappa(G_{2}|H)<\kappa(G|H) by direct comparison. ∎

Based on the above proof, we claim that cκ(F~|H)c\cdot\kappa(\widetilde{F}|H) leads to no worse estimation bias than κ(F|H)\kappa(F|H). To be specific, when GG is irreducible w.r.t. HH, then κ=cκ(F~|H)=κ(F|H)\kappa^{*}=c\cdot\kappa(\widetilde{F}|H)=\kappa(F|H). When GG is not irreducible w.r.t. HH, then κcκ(F~|H)κ(F|H)\kappa^{*}\leq c\cdot\kappa(\widetilde{F}|H)\leq\kappa(F|H). The key difference compared to Yao et al. (2022)’s approach is that we are modifying GG, thus equivalently subsampling FF, rather than regrouping on HH.

In summary, without making assumptions about irreducibility, as long as we remove some contribution from GG in FF by subsampling, the resulting identifiable quantity cκ(F~|H)c\cdot\kappa(\widetilde{F}|H) will be at least no worse than the maximum proportion κ(F|H)\kappa(F|H). Furthermore, with the knowledge of set AA and esssupxAP(Y=1|X=x)\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x), cκ(F~|H)c\cdot\kappa(\widetilde{F}|H) will equal to κ\kappa^{*}.

B.2 Rejection Sampling

Rejection sampling is a Monte Carlo method that aims to generate sample following a new distribution Q~\widetilde{Q} based on sample from distribution QQ, which are characterised by the densities q~\widetilde{q} and qq. An instance xx drawn from q(x)q(x) is kept with acceptance probability β(x)[0,1]\beta(x)\in[0,1], and rejected otherwise. Algorithm 4 shows the detailed procedure.

Algorithm 4 Rejection Sampling
  Input:  sample [xi]={x1,,xn}q(x)[x_{i}]=\{x_{1},\cdots,x_{n}\}\sim q(x),  acceptance function β(x)\beta(x)
  Output:  sample [zj]q~(x)=1cβ(x)q(x)\left[z_{j}\right]\sim\widetilde{q}(x)=\frac{1}{c}\cdot\beta(x)\cdot q(x),   where c=β(x)q(x)𝑑x=𝔼Xq[β(X)]c=\int\beta(x)q(x)dx=\mathbb{E}_{X\sim q}\left[\beta(X)\right]
  Initialize j1j\leftarrow 1
  for i=1i=1 to nn do
     vUniform(0,1)v\sim\text{Uniform}(0,1)
     if vβ(x)v\leq\beta(x) then
        zjxiz_{j}\leftarrow x_{i}
        jj+1j\leftarrow j+1
     end if
  end for
  return [zj]\left[z_{j}\right]

B.3 Gamma Spectrum Unfolding

In gamma spectrum unfolding, a gamma ray detector measures the energies of incoming gamma ray particles. The gamma rays were emitted either by a source of interest or from the background. The measurement is represented as a histogram f(x)f(x), where the bins correspond to a quantization of energy. The histogram h(x)h(x) of measurements from the source of interest is also known.

The goal is to obtain a lower bound ρ(x)\rho(x) of the quantity (1κ)g(x)(1-\kappa^{*})g(x) on a certain set Asupp(H)A\subset\operatorname{supp}(H). We specify AA to be the energy bins near the main peak of h(x)h(x) (aka, full-energy peak). For xsupp(F)\supp(H)x\in\operatorname{supp}(F)\backslash\operatorname{supp}(H), ρ(x)=f(x)\rho(x)=f(x) because the gamma rays must come from the background in these regions. Typically, supp(F)\supp(H)\operatorname{supp}(F)\backslash\operatorname{supp}(H) contains two (or more) intervals, therefore we know the value of ρ(x)\rho(x) on either side of set AA. Then ρ(x)xA\rho(x)\ \forall x\in A can be estimated using linear interpolation (Knoll, 2010; Alamaniotis et al., 2013). The above procedure is illustrated in Figure 1 and the acceptance function is chosen to be

α(x)={1ρ(x)f(x),xA,1,o.w.\alpha(x)=\begin{cases}1-\frac{\rho(x)}{f(x)},&x\in A,\\ 1,&\text{o.w.}\end{cases} (18)
Refer to caption
Figure 1: Visual illustration of how to estimate ρ(x)\rho(x)

Appendix C Detailed Experimental Result in Sec. 6

This section shows four tables corresponding to four experimental setups in Sec. 6.

Table 2: Average absolute estimation error on gamma ray spectra data, corresponding to the unfolding scenario. Several state-of-the-art MPE algorithms DPL (Ivanov, 2020), EN (Elkan & Noto, 2008), KM (Ramaswamy et al., 2016) and TIcE (Bekker & Davis, 2018) are selected. The mean absolute error (avg[|κ^κ|]\text{avg}[\left|\widehat{\kappa}^{*}-\kappa^{*}\right|]) is reported, the smallest error among original, regrouping and subsampling versions is bolded. (+/)(+/-) denotes that on average the estimator produces positive//negative estimation bias (sgn[avg(κ^κ)]\text{sgn}[\text{avg}(\widehat{\kappa}^{*}-\kappa^{*})]).

κ\kappa^{*} DPL ReDPL SuDPL EN ReEN SuEN KM ReKM SuKM TIcE ReTIcE SuTIcE
0.10.1 0.063+0.063+ 0.053+0.053+ 0.008+\bm{0.008+} 0.057+0.057+ 0.048+0.048+ 0.012\bm{0.012-} 0.226+0.226+ 0.146+0.146+ 0.036+\bm{0.036+} 0.129+0.129+ 0.115+0.115+ 0.019+\bm{0.019+}
0.250.25 0.055+0.055+ 0.008+\bm{0.008+} 0.011+0.011+ 0.040+0.040+ 0.004\bm{0.004-} 0.0170.017- 0.216+0.216+ 0.020+\bm{0.020+} 0.059+0.059+ 0.112+0.112+ 0.065+0.065+ 0.017+\bm{0.017+}
0.50.5 0.029+0.029+ 0.1230.123- 0.019\bm{0.019-} 0.017+\bm{0.017+} 0.1290.129- 0.0320.032- 0.130+0.130+ 0.043+\bm{0.043+} 0.049+0.049+ 0.082+0.082+ 0.018+\bm{0.018+} 0.019+0.019+
0.750.75 0.034+0.034+ 0.2840.284- 0.015+\bm{0.015+} 0.022\bm{0.022-} 0.2890.289- 0.0470.047- 0.078+0.078+ 0.0940.094- 0.022+\bm{0.022+} 0.055+0.055+ 0.0450.045- 0.021+\bm{0.021+}
average 0.045+0.045+ 0.1170.117- 0.013+\bm{0.013+} 0.034+0.034+ 0.1180.118- 0.027\bm{0.027-} 0.163+0.163+ 0.076+0.076+ 0.042+\bm{0.042+} 0.095+0.095+ 0.061+0.061+ 0.019+\bm{0.019+}
Table 3: Average absolute estimation error on synthetic data, corresponding to domain adaptation scenario. Several state-of-the-art MPE algorithms DPL (Ivanov, 2020), EN (Elkan & Noto, 2008), KM (Ramaswamy et al., 2016) and TIcE (Bekker & Davis, 2018) are selected. The mean absolute error (avg[|κ^κ|]\text{avg}[\left|\widehat{\kappa}^{*}-\kappa^{*}\right|]) is reported, and the smallest error among the original, regrouping and subsampling versions is bolded. (+/)(+/-) denotes that on average the estimator produces positive//negative estimation bias (sgn[avg(κ^κ)]\text{sgn}[\text{avg}(\widehat{\kappa}^{*}-\kappa^{*})]).

κ\kappa^{*} DPL ReDPL SuDPL EN ReEN SuEN KM ReKM SuKM TIcE ReTIcE SuTIcE
0.10.1 0.089+0.089+ 0.075+0.075+ 0.014\bm{0.014-} 0.059+0.059+ 0.051+0.051+ 0.027\bm{0.027-} 0.102+0.102+ 0.091+0.091+ 0.013\bm{0.013-} 0.150+0.150+ 0.140+0.140+ 0.038+\bm{0.038+}
0.250.25 0.083+0.083+ 0.060+0.060+ 0.016\bm{0.016-} 0.037+0.037+ 0.016+\bm{0.016+} 0.0510.051- 0.081+0.081+ 0.059+0.059+ 0.018\bm{0.018-} 0.137+0.137+ 0.108+0.108+ 0.040+\bm{0.040+}
0.50.5 0.053+0.053+ 0.028+0.028+ 0.024\bm{0.024-} 0.022\bm{0.022-} 0.0570.057- 0.0770.077- 0.052+0.052+ 0.037+0.037+ 0.020\bm{0.020-} 0.114+0.114+ 0.068+0.068+ 0.035+\bm{0.035+}
0.750.75 0.016\bm{0.016-} 0.0500.050- 0.0560.056- 0.063\bm{0.063-} 0.1180.118- 0.1140.114- 0.018+\bm{0.018+} 0.0500.050- 0.0360.036- 0.112+0.112+ 0.058+0.058+ 0.051+\bm{0.051+}
average 0.060+0.060+ 0.053+0.053+ 0.028\bm{0.028-} 0.045\bm{0.045-} 0.0610.061- 0.0670.067- 0.063+0.063+ 0.059+0.059+ 0.022\bm{0.022-} 0.128+0.128+ 0.094+0.094+ 0.041+\bm{0.041+}
Table 4: Average absolute estimation error on benchmark data, corresponding to domain adaptation scenario. Several state-of-the-art MPE algorithms DPL (Ivanov, 2020), EN (Elkan & Noto, 2008), KM (Ramaswamy et al., 2016) and TIcE (Bekker & Davis, 2018) are selected. The mean absolute error is reported, and the smallest error among the original, regrouping and subsampling versions is bolded. +//+/-/\cdot denotes that on average the estimator produces positive//negative/no estimation bias.

Dataset κ\kappa^{*} DPL ReDPL SuDPL EN ReEN SuEN KM ReKM SuKM TIcE ReTIcE SuTIcE
Mushroom 0.10.1 0.075+0.075+ 0.069+0.069+ 0.053+\bm{0.053+} 0.121+0.121+ 0.105+0.105+ 0.099+\bm{0.099+} 0.061+0.061+ 0.054+0.054+ 0.039+\bm{0.039+} 0.082+0.082+ 0.086+0.086+ 0.064+\bm{0.064+}
0.250.25 0.063+0.063+ 0.023+\bm{0.023+} 0.040+0.040+ 0.139+0.139+ 0.101+\bm{0.101+} 0.109+0.109+ 0.053+0.053+ 0.041+0.041+ 0.024+\bm{0.024+} 0.057+0.057+ 0.034+0.034+ 0.024+\bm{0.024+}
0.50.5 0.035+0.035+ 0.0840.084- 0.025+\bm{0.025+} 0.132+0.132+ 0.074+\bm{0.074+} 0.112+0.112+ 0.057\bm{0.057-} 0.1890.189- 0.0630.063- 0.026+\bm{0.026+} 0.0580.058- 0.0280.028-
0.750.75 0.0150.015- 0.1490.149- 0.013\bm{0.013-} 0.097+0.097+ 0.033+\bm{0.033+} 0.082+0.082+ 0.096\bm{0.096-} 0.2530.253- 0.1080.108- 0.0760.076- 0.1340.134- 0.027\bm{0.027-}
avg 0.047+0.047+ 0.0810.081- 0.033+\bm{0.033+} 0.122+0.122+ 0.078+\bm{0.078+} 0.101+0.101+ 0.0670.067- 0.1340.134- 0.059\bm{0.059-} 0.060+0.060+ 0.0780.078- 0.036+\bm{0.036+}
Landsat 0.10.1 0.066+0.066+ 0.053+0.053+ 0.014+\bm{0.014+} 0.168+0.168+ 0.139+0.139+ 0.115+\bm{0.115+} 0.065+0.065+ 0.047+0.047+ 0.017+\bm{0.017+} 0.070+0.070+ 0.074+0.074+ 0.031+\bm{0.031+}
0.250.25 0.060+0.060+ 0.0100.010- 0.005+\bm{0.005+} 0.161+0.161+ 0.119+0.119+ 0.100+\bm{0.100+} 0.053+0.053+ 0.027+0.027+ 0.007+\bm{0.007+} 0.034+0.034+ 0.028+0.028+ 0.012\bm{0.012-}
0.50.5 0.037+0.037+ 0.0340.034- 0.010\bm{0.010-} 0.131+0.131+ 0.097+0.097+ 0.084+\bm{0.084+} 0.037+0.037+ 0.018+0.018+ 0.016\bm{0.016-} 0.033+0.033+ 0.0310.031- 0.030\bm{0.030-}
0.750.75 0.022+\bm{0.022+} 0.0710.071- 0.0370.037- 0.102+0.102+ 0.085+0.085+ 0.041+\bm{0.041+} 0.028+0.028+ 0.023\bm{0.023-} 0.0240.024- 0.036\bm{0.036-} 0.0420.042- 0.0670.067-
avg 0.046+0.046+ 0.0420.042- 0.017\bm{0.017-} 0.141+0.141+ 0.110+0.110+ 0.085+\bm{0.085+} 0.046+0.046+ 0.029+0.029+ 0.016\bm{0.016-} 0.043+0.043+ 0.0440.044- 0.035\bm{0.035-}
Shuttle 0.10.1 0.055+0.055+ 0.048+0.048+ 0.006+\bm{0.006+} 0.105+0.105+ 0.096+0.096+ 0.053+\bm{0.053+} 0.044+0.044+ 0.033+0.033+ 0.006\bm{0.006-} 0.083+0.083+ 0.072+0.072+ 0.023+\bm{0.023+}
0.250.25 0.045+0.045+ 0.0100.010- 0.007\bm{0.007-} 0.098+0.098+ 0.082+0.082+ 0.038+\bm{0.038+} 0.027+0.027+ 0.0180.018- 0.016\bm{0.016-} 0.074+0.074+ 0.030+0.030+ 0.020+\bm{0.020+}
0.50.5 0.025+0.025+ 0.1020.102- 0.016\bm{0.016-} 0.093+0.093+ 0.066+0.066+ 0.050+\bm{0.050+} 0.017\bm{0.017-} 0.0990.099- 0.0300.030- 0.074+0.074+ 0.0650.065- 0.041+\bm{0.041+}
0.750.75 0.022\bm{0.022-} 0.3930.393- 0.0330.033- 0.063+0.063+ 0.041+\bm{0.041+} 0.044+0.044+ 0.058\bm{0.058-} 0.2940.294- 0.0750.075- 0.089+0.089+ 0.1790.179- 0.068+\bm{0.068+}
avg 0.037+0.037+ 0.1380.138- 0.015\bm{0.015-} 0.090+0.090+ 0.071+0.071+ 0.046+\bm{0.046+} 0.036+0.036+ 0.1110.111- 0.032\bm{0.032-} 0.080+0.080+ 0.0860.086- 0.038+\bm{0.038+}
MNIST17 0.10.1 0.076+0.076+ 0.078+0.078+ 0.035+\bm{0.035+} 0.184+0.184+ 0.157+0.157+ 0.116+\bm{0.116+} 0.065+0.065+ 0.057+0.057+ 0.025+\bm{0.025+} 0.101+0.101+ 0.093+0.093+ 0.052+\bm{0.052+}
0.250.25 0.058+0.058+ 0.047+0.047+ 0.008\bm{0.008-} 0.218+0.218+ 0.175+0.175+ 0.138+\bm{0.138+} 0.053+0.053+ 0.017+0.017+ 0.009\bm{0.009-} 0.098+0.098+ 0.084+0.084+ 0.031+\bm{0.031+}
0.50.5 0.032+0.032+ 0.0470.047- 0.022\bm{0.022-} 0.275+0.275+ 0.180+\bm{0.180+} 0.191+0.191+ 0.029+0.029+ 0.0300.030- 0.017\bm{0.017-} 0.080+0.080+ 0.021+\bm{0.021+} 0.047+0.047+
0.750.75 0.023\bm{0.023-} 0.1690.169- 0.0460.046- 0.250+0.250+ 0.189+\bm{0.189+} 0.217+0.217+ 0.016+\bm{0.016+} 0.1460.146- 0.016\bm{0.016-} 0.081+0.081+ 0.0940.094- 0.060+\bm{0.060+}
avg 0.047+0.047+ 0.0850.085- 0.028\bm{0.028-} 0.231+0.231+ 0.175+0.175+ 0.166+\bm{0.166+} 0.041+0.041+ 0.0630.063- 0.017\bm{0.017-} 0.090+0.090+ 0.0730.073- 0.048+\bm{0.048+}
Overall avg 0.044+0.044+ 0.0870.087- 0.023\bm{0.023-} 0.146+0.146+ 0.109+0.109+ 0.099+\bm{0.099+} 0.047+0.047+ 0.0840.084- 0.031\bm{0.031-} 0.068+0.068+ 0.0700.070- 0.039+\bm{0.039+}
Table 5: Average absolute estimation error on benchmark data, corresponding to the selected/reported at random scenario. Several state-of-the-art MPE algorithms DPL (Ivanov, 2020), EN (Elkan & Noto, 2008), KM (Ramaswamy et al., 2016) and TIcE (Bekker & Davis, 2018) are selected. The mean absolute error is reported, the smallest error among the original, regrouping and subsampling versions is bolded. +//+/-/\cdot denotes that on average the estimator produces positive//negative/no estimation bias.

Dataset κ\kappa^{*} DPL ReDPL SuDPL EN ReEN SuEN KM ReKM SuKM TIcE ReTIcE SuTIcE
Mushroom 0.10.1 0.073+0.073+ 0.067+0.067+ 0.058+\bm{0.058+} 0.119+0.119+ 0.105+0.105+ 0.099+\bm{0.099+} 0.059+0.059+ 0.051+0.051+ 0.040+\bm{0.040+} 0.084+0.084+ 0.082+0.082+ 0.057+\bm{0.057+}
0.250.25 0.060+0.060+ 0.023+0.023+ 0.005\bm{0.005-} 0.134+0.134+ 0.096+0.096+ 0.054+\bm{0.054+} 0.054+0.054+ 0.034+0.034+ 0.010\bm{0.010-} 0.049+0.049+ 0.028+0.028+ 0.008+\bm{0.008+}
0.50.5 0.032+0.032+ 0.0870.087- 0.018\bm{0.018-} 0.125+0.125+ 0.068+\bm{0.068+} 0.076+0.076+ 0.073\bm{0.073-} 0.1900.190- 0.0820.082- 0.027+\bm{0.027+} 0.0570.057- 0.0520.052-
0.750.75 0.023\bm{0.023-} 0.2030.203- 0.0280.028- 0.096+0.096+ 0.030+\bm{0.030+} 0.066+0.066+ 0.100\bm{0.100-} 0.2620.262- 0.1230.123- 0.028\bm{0.028-} 0.0960.096- 0.0580.058-
avg 0.047+0.047+ 0.0950.095- 0.027+\bm{0.027+} 0.119+0.119+ 0.075+0.075+ 0.074+\bm{0.074+} 0.0720.072- 0.1340.134- 0.064\bm{0.064-} 0.047+0.047+ 0.0660.066- 0.044\bm{0.044-}
Landsat 0.10.1 0.066+0.066+ 0.053+0.053+ 0.034+\bm{0.034+} 0.167+0.167+ 0.141+0.141+ 0.121+\bm{0.121+} 0.066+0.066+ 0.049+0.049+ 0.028+\bm{0.028+} 0.068+0.068+ 0.074+0.074+ 0.036+\bm{0.036+}
0.250.25 0.052+0.052+ 0.0110.011- 0.007\bm{0.007\cdot} 0.159+0.159+ 0.115+0.115+ 0.085+\bm{0.085+} 0.061+0.061+ 0.035+0.035+ 0.011+\bm{0.011+} 0.034+0.034+ 0.020+0.020+ 0.011\bm{0.011-}
0.50.5 0.043+0.043+ 0.0650.065- 0.011\bm{0.011-} 0.137+0.137+ 0.103+0.103+ 0.089+\bm{0.089+} 0.033+0.033+ 0.020+0.020+ 0.016\bm{0.016-} 0.040+0.040+ 0.029\bm{0.029-} 0.0480.048-
0.750.75 0.032+0.032+ 0.0970.097- 0.023\bm{0.023-} 0.103+0.103+ 0.071+\bm{0.071+} 0.071+\bm{0.071+} 0.023+0.023+ 0.0270.027- 0.017\bm{0.017\cdot} 0.041\bm{0.041-} 0.0870.087- 0.0680.068-
avg 0.048+0.048+ 0.0570.057- 0.019\bm{0.019-} 0.142+0.142+ 0.108+0.108+ 0.092+\bm{0.092+} 0.046+0.046+ 0.033+0.033+ 0.018+\bm{0.018+} 0.046+0.046+ 0.0530.053- 0.041\bm{0.041-}
Shuttle 0.10.1 0.056+0.056+ 0.048+0.048+ 0.015+\bm{0.015+} 0.112+0.112+ 0.097+0.097+ 0.060+\bm{0.060+} 0.049+0.049+ 0.035+0.035+ 0.016\bm{0.016-} 0.080+0.080+ 0.066+0.066+ 0.032+\bm{0.032+}
0.250.25 0.044+0.044+ 0.0180.018- 0.007\bm{0.007-} 0.106+0.106+ 0.083+0.083+ 0.045+\bm{0.045+} 0.029+0.029+ 0.0200.020- 0.018\bm{0.018-} 0.077+0.077+ 0.024+0.024+ 0.037+\bm{0.037+}
0.50.5 0.026+0.026+ 0.1620.162- 0.008\bm{0.008-} 0.098+0.098+ 0.081+0.081+ 0.068+\bm{0.068+} 0.016\bm{0.016-} 0.1650.165- 0.0510.051- 0.083+0.083+ 0.0990.099- 0.052+\bm{0.052+}
0.750.75 0.012+\bm{0.012+} 0.3480.348- 0.0230.023- 0.065+0.065+ 0.029+0.029+ 0.050+\bm{0.050+} 0.072\bm{0.072-} 0.2960.296- 0.0890.089- 0.075+0.075+ 0.1690.169- 0.078+\bm{0.078+}
avg 0.035+0.035+ 0.1440.144- 0.013\bm{0.013-} 0.095+0.095+ 0.073+0.073+ 0.056+\bm{0.056+} 0.042+\bm{0.042+} 0.1290.129- 0.0440.044- 0.079+0.079+ 0.0890.089- 0.050+\bm{0.050+}
MNIST17 0.10.1 0.077+0.077+ 0.078+0.078+ 0.055+\bm{0.055+} 0.183+0.183+ 0.157+0.157+ 0.134+\bm{0.134+} 0.060+0.060+ 0.052+0.052+ 0.046+\bm{0.046+} 0.096+0.096+ 0.083+0.083+ 0.076+\bm{0.076+}
0.250.25 0.063+0.063+ 0.052+0.052+ 0.005+\bm{0.005+} 0.229+0.229+ 0.179+0.179+ 0.113+\bm{0.113+} 0.060+0.060+ 0.027+0.027+ 0.009\bm{0.009-} 0.099+0.099+ 0.087+0.087+ 0.034+\bm{0.034+}
0.50.5 0.035+0.035+ 0.0420.042- 0.014+\bm{0.014+} 0.300+0.300+ 0.198+\bm{0.198+} 0.208+0.208+ 0.022+0.022+ 0.0240.024- 0.011\bm{0.011-} 0.099+0.099+ 0.028+\bm{0.028+} 0.053+0.053+
0.750.75 0.014\bm{0.014-} 0.1780.178- 0.0330.033- 0.248+0.248+ 0.161+\bm{0.161+} 0.200+0.200+ 0.008+\bm{0.008+} 0.1530.153- 0.0210.021- 0.090+0.090+ 0.0950.095- 0.066+\bm{0.066+}
avg 0.047+0.047+ 0.0880.088- 0.027+\bm{0.027+} 0.240+0.240+ 0.173+0.173+ 0.164+\bm{0.164+} 0.038+0.038+ 0.0640.064- 0.022+\bm{0.022+} 0.096+0.096+ 0.073+0.073+ 0.057+\bm{0.057+}
Overall avg 0.044+0.044+ 0.0960.096- 0.022\bm{0.022-} 0.149+0.149+ 0.107+0.107+ 0.097+\bm{0.097+} 0.050+0.050+ 0.0900.090- 0.037\bm{0.037-} 0.067+0.067+ 0.070+0.070+ 0.048+\bm{0.048+}

Appendix D When Irreducibility Holds

In theory, when irreducibility holds, baseline MPE methods shall be asymptotically unbiased estimators of the mixture proportion κ\kappa^{*}, regrouping may introduce negative bias, subsampling should not introduce bias. Here we run some synthetically generated experiments (in a controlled setting) to verify the theoretical claim.

We specify the distributions HH, GG and FF as:

H\displaystyle H 𝒩(μ1=0,σ1=1)\displaystyle\sim\mathcal{N}(\mu_{1}=0,\sigma_{1}=1)
G\displaystyle G 𝒩(μ2=2,σ2=1)\displaystyle\sim\mathcal{N}(\mu_{2}=2,\sigma_{2}=1)
F\displaystyle F =(1κ)G+κH.\displaystyle=(1-\kappa^{*})G+\kappa^{*}H.

where GG is irreducible w.r.t. HH. We draw m=500,n=1500m=500,n=1500 instances from both HH and FF. The true mixture proportion κ=κ(F|H)\kappa^{*}=\kappa(F|H) is varied in {0.1,0.25,0.5,0.75}\left\{0.1,0.25,0.5,0.75\right\}.

We draw 20002000 labeled instances from FF. A 11 hidden layer neural network with 1616 neurons was trained to predict P(Y=1|X=x)P(Y=1|X=x). The acceptance function α(x)\alpha(x) used in Subsampling MPE is chosen to be

α(x)={P^(Y=1|X=x),xA,1,o.w.,\alpha(x)=\begin{cases}\widehat{P}(Y=1|X=x),&x\in A,\\ 1,&\text{o.w.},\end{cases}

where A={x:P^(Y=1|X=x)>0.6}A=\{x:\widehat{P}(Y=1|X=x)>0.6\}. As for ReMPE, 10%10\% of the sample from FF is chosen to be regrouped to the sample from HH, as suggested by Yao et al. (2022). The above procedure was repeated 1010 times with different random seeds. Results where shown in Table 6, where SuMPE never performs the worst 999 In some cases, subsampling slightly increases the bias compared to original version, this is partly due to P(Y|X)P(Y|X) being estimated. and ReMPE may introduce extremely high bias.

Table 6: Bias of the estimators on synthetic data. Several state-of-the-art MPE algorithms DPL, EN, KM and TIcE are selected. For each of them, we compute the bias (avg[κ^κ]\text{avg}[\widehat{\kappa}^{*}-\kappa^{*}]) of the original, regrouping and subsampling version. The biggest absolute bias among the original, regrouping and subsampling versions is in bold.

κ\kappa^{*} DPL ReDPL SuDPL EN ReEN SuEN KM ReKM SuKM TIcE ReTIcE SuTIcE
0.10.1 +0.029\bm{+0.029} +0.021+0.021 +0.010+0.010 +0.020\bm{+0.020} +0.005+0.005 +0.008+0.008 +0.020\bm{+0.020} +0.007+0.007 +0.006+0.006 +0.085{+0.085} +0.089\bm{+0.089} +0.063{+0.063}
0.250.25 +0.012+0.012 0.062\bm{-0.062} 0.017-0.017 0.023-0.023 0.080\bm{-0.080} 0.043-0.043 0.016-0.016 0.074\bm{-0.074} 0.031-0.031 +0.049\bm{+0.049} +0.010+0.010 +0.024+0.024
0.50.5 +0.038+0.038 0.128\bm{-0.128} 0.003-0.003 0.045-0.045 0.156\bm{-0.156} 0.067{-0.067} 0.001-0.001 0.133\bm{-0.133} 0.018-0.018 +0.059\bm{+0.059} 0.048-0.048 +0.032+0.032
0.750.75 +0.012+0.012 0.282\bm{-0.282} 0.015-0.015 0.071{-0.071} 0.278\bm{-0.278} 0.088{-0.088} 0.0000.000 0.278\bm{-0.278} 0.012-0.012 +0.089{+0.089} 0.199\bm{-0.199} +0.037+0.037