Mixture Proportion Estimation Beyond Irreducibility

Yilun Zhu Aaron Fjeldsted Darren Holland George Landon Azaree Lintereur Clayton Scott

Abstract

The task of mixture proportion estimation (MPE) is to estimate the weight of a component distribution in a mixture, given observations from both the component and mixture. Previous work on MPE adopts the irreducibility assumption, which ensures identifiablity of the mixture proportion. In this paper, we propose a more general sufficient condition that accommodates several settings of interest where irreducibility does not hold. We further present a resampling-based meta-algorithm that takes any existing MPE algorithm designed to work under irreducibility and adapts it to work under our more general condition. Our approach empirically exhibits improved estimation performance relative to baseline methods and to a recently proposed regrouping-based algorithm.

Machine Learning, Mixture Proportion Estimation, Weakly Supervised Learning

1 Introduction

Mixture proportion estimation (MPE) is the problem of estimating the weight of a component distribution in a mixture. Specifically, let $\kappa^{*}\in[0,1]$ and let $F$ , $G$ , and $H$ be probability distributions such that $F=(1-\kappa^{*})G+\kappa^{*}H$ . Given i.i.d. observations

\begin{split}X_{H}&:=\left\{x_{1},x_{2},\cdots,x_{m}\right\}\stackrel{{\scriptstyle iid}}{{\sim}}H,\\ X_{F}&:=\left\{x_{m+1},x_{m+2},\cdots,x_{m+n}\right\}\stackrel{{\scriptstyle iid}}{{\sim}}F,\end{split}

(1)

MPE is the problem of estimating $\kappa^{*}$ . A typical application is given some labeled positive reviews $X_{H}$ , estimate the proportion of positive comments about a product among all comments $X_{F}$ (González et al., 2017). MPE is also an important component in solving several domain adaptation and weakly supervised learning problems, such as learning from positive and unlabeled examples (LPUE) (Elkan & Noto, 2008; Du Plessis et al., 2014; Kiryo et al., 2017), learning with noisy labels (Lawrence & Schölkopf, 2001; Natarajan et al., 2013; Blanchard et al., 2016), multi-instance learning (Zhang & Goldman, 2001), and anomaly detection (Sanderson & Scott, 2014).

If no assumptions are made on the unobserved component $G$ , then $\kappa^{*}$ is not identifiable. Blanchard et al. (2010) proposed the irreducibility assumption on $G$ so that $\kappa^{*}$ becomes identifiable. Up to now, almost all MPE algorithms build upon the irreducibility assumption (Blanchard et al., 2010; Scott, 2015; Blanchard et al., 2016; Jain et al., 2016; Ramaswamy et al., 2016; Ivanov, 2020; Bekker & Davis, 2020; Garg et al., 2021), or some stricter conditions like non-overlapping support of component distributions (Elkan & Noto, 2008; Du Plessis & Sugiyama, 2014). However, as we discuss below, irreducibility can be violated in several applications, in which case the above methods produce statistically inconsistent estimates. As far as we know, Yao et al. (2022), discussed in Sec. 5, is the first attempt to move beyond irreducibility.

This work proposes a more general sufficient condition than irreducibility, and offers a practical algorithm for estimating $\kappa^{*}$ under this condition. We introduce a meta-algorithm that takes as input any MPE method that consistently estimates $\kappa^{*}$ under irreducibility, and removes the bias of that method whenever irreducibility does not hold but our more general sufficient condition does. Furthermore, even if our new sufficient condition is not satisfied, our meta-algorithm will not increase the bias of the underlying MPE method. We describe several applications and settings where our framework is relevant, and demonstrate the practical relevance of this framework through extensive experiments. Proofs and additional details can be found in the appendices.

2 Problem Setup and Background

Let $G$ and $H$ be probability measures on a measurable space $(\mathcal{X},\mathfrak{S})$ , and let $F$ be a mixture of $G$ and $H$

F=(1-\kappa^{*})G+\kappa^{*}H,

(2)

where $0\leq\kappa^{*}\leq 1$ . With no assumptions on $G$ , $\kappa^{*}$ is not uniquely determined by $F$ and $H$ . For example, suppose $F=(1-\kappa^{*})G+\kappa^{*}H$ for some $G$ , and take any $\delta\in[0,\kappa^{*}]$ . Then $F=(1-\kappa^{*}+\delta)G^{\prime}+(\kappa^{*}-\delta)H,$ where $G^{\prime}=(1-\kappa^{*}+\delta)^{-1}\left[(1-\kappa^{*})G+\delta H\right]$ , has a different proportion on $H$ (Blanchard et al., 2010).

2.1 Ground-Truth and Maximal Proportion

To address the lack of identifiability, Blanchard et al. (2010) introduced the so-called irreducibility assumption. We now recall this definition and related concepts. Throughout this work we assume that $F$ , $G$ and $H$ have densities $f$ , $g$ and $h$ , defined w.r.t. a common dominating measure $\mu$ .

Definition 2.1 (Blanchard et al. (2010)).

For any two probability distributions $F$ and $H$ , define

\begin{split}\kappa(F|H):=\sup\{\kappa\in[0,1]|F&=(1-\kappa)G^{\prime}+\kappa H,\\ &\text{for some distribution }G^{\prime}\},\end{split}

the maximal proportion of $H$ in $F$ .

This quantity equals the infimum of the likelihood ratio:

Proposition 2.2 (Blanchard et al. (2010)).

It holds that

\kappa(F|H)=\inf_{S\in\mathfrak{S}:H(S)>0}\frac{F(S)}{H(S)}=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}.

(3)

By substituting $F=(1-\kappa^{*})G+\kappa^{*}H$ into Eqn. (3), we get that

\begin{split}\kappa(F|H)&=\inf_{S\in\mathfrak{S}:H(S)>0}\frac{F(S)}{H(S)}\\ &=\kappa^{*}+(1-\kappa^{*})\inf_{S\in\mathfrak{S}:H(S)>0}\frac{G(S)}{H(S)}\\ &=\kappa^{*}+(1-\kappa^{*})\kappa(G|H).\end{split}

(4)

Since $\kappa(F|H)$ is identifiable from $F$ and $H$ , the following assumption on $G$ ensures identifiability of $\kappa^{*}$ .

Definition 2.3 (Blanchard et al. (2010)).

We say that $G$ is irreducible with respect to $H$ if $\kappa(G|H)=0$ .

Thus, irreducibility means that there exists no decomposition of the form: $G=\gamma H+(1-\gamma)J^{\prime}$ , where $J^{\prime}$ is some probability distribution and $0<\gamma\leq 1$ . Under irreducibility, $\kappa^{*}$ is identifiable, and in particular, equals $\kappa(F|H)$ .

2.2 Latent Label Model

We now consider another way of understanding irreducibility in terms of a latent label model. In particular, let $X$ and $Y\in\{0,1\}$ be the random variables characterized by

(a)

$(X,Y)$ are jointly distributed
(b)

$P(Y=1)=\kappa^{*}$
(c)

$P_{X|Y=0}=G\text{ and }P_{X|Y=1}=H$ .

It follows from these assumptions that the marginal distribution of $X$ is $F$ :

P_{X}=\left(1-\kappa^{*}\right)P_{X|Y=0}+\kappa^{*}P_{X|Y=1}=F.

We also take the conditional probability of $Y$ given $X$ to be defined via

\begin{split}P(Y=1|X=x)=\begin{cases}\frac{\kappa^{*}h(x)}{f(x)},&f(x)>0,\\ 0,&\text{otherwise}.\end{cases}\end{split}

(5)

The latent label model is commonly used in the positive unlabeled (PU) learning literature (Bekker & Davis, 2020). MPE is also called class prior/proportion estimation (CPE) in PU learning because $P(Y=1)=\kappa^{*}$ . $Y$ may be viewed as a label indicating which component an observation from $F$ was drawn from. Going forward, we use this latent label model in addition to the original MPE notation.

Proposition 2.4.

Under the latent label model,

\underset{x}{\operatorname{ess\ sup\ }}P(Y=1|X=x)=\frac{\kappa^{*}}{\kappa(F|H)}\ ,

where $0/0:=0$ .

By definition, we know that $G$ is not irreducible with respect to $H$ iff $\kappa(G|H)>0$ . Combining Proposition 2.4 and Eqn. (4), we conclude that $\underset{}{\operatorname{ess\ sup\ }}P(Y=1|X=x)<1$ is equivalent to $\kappa(G|H)>0$ .

2.3 Violation of Irreducibility

Up to now, almost all MPE algorithms assume $G$ to be irreducible w.r.t. $H$ (Blanchard et al., 2010, 2016; Jain et al., 2016; Ivanov, 2020), or stricter conditions like the anchor set assumption (Scott, 2015; Ramaswamy et al., 2016), or that $G$ and $H$ have disjoint supports (Elkan & Noto, 2008; Du Plessis & Sugiyama, 2014). These methods return an estimate of $\kappa(F|H)$ as the estimate of $\kappa^{*}$ . If irreducibility does not hold and $\kappa^{*}<1$ , then $\kappa(F|H)>\kappa^{*}$ . Even if these methods are consistent estimators of $\kappa(F|H)$ , they are asymptotically biased and thus inconsistent estimators of $\kappa^{*}$ .

A sufficient condition for irreducibility to hold is that the support of $H$ is not totally contained in the support of $G$ . In a classification setting where $G$ and $H$ are the class-conditional distributions, this may be quite reasonable. It essentially assumes that each class has at least some subset of instances (with positive measure) that cannot possibly be confused with the other class. While irreducibility is reasonable in many classification-related tasks, there are also a number of important applications where it does not hold. In this subsection we give three examples of applications where irreducibility is not satisfied.

Ubiquitous Background. In gamma spectroscopy, we may view $H$ as the distribution of the energy of a gamma particle emitted by some source of interest (e.g., Cesium-137), and $G$ as the energy distribution of background radiation. Typically the background emits gamma particles with a wider range of energies than the source of interest does, and therefore its distribution has a wider support: $\operatorname{supp}(H)\subset\operatorname{supp}(G)$ , thus violating irreducibility. What’s more, $G$ is usually unknown, because it varies according to the surrounding environment (Alamaniotis et al., 2013). The MPE problem is: given observations of the source spectrum $H$ , which may be collected in a laboratory setting, and observations of $F$ in the wild, estimate $\kappa^{*}$ . This quantity is important for nuclear threat detection and nuclear safeguard applications.

Global Uncertainty. In marketing, let $Y\in\left\{0,1\right\}$ denote whether a customer does ( $Y=1$ ) or does not purchase a product (Fei et al., 2013). Let $H$ be the distribution of a feature vector extracted from a customer who buys the product, and $G$ the distribution for those who do not. The MPE problem is: given data from past purchasers of a product ( $H$ ), and from a target population ( $F$ ), estimate the proportion $\kappa^{*}$ of $H$ in $F$ . This quantity is called the transaction rate, and is important for estimating the number of products to be sold. Irreducibility is likely to be violated here because, given a finite number of features, uncertainty about customers should remain bounded away from 1: $\forall x,P(Y=1|X=x)<1$ . In other words, there is an $\epsilon>0$ such that, for any feature vector of demographic information, the probability of buying the product is always $<1-\epsilon$ .

Underreported Outcomes. In public health, let $(X,Y,Z)$ be a jointly distributed triple, where $X$ is a feature vector, $Y\in\{0,1\}$ denotes whether a person reports a health condition or not, and $Z\in\{0,1\}$ indicates whether the person truly has the health condition. Here, $H=P_{X|Y=1},G=P_{X|Y=0}$ and $F=P_{X}$ . The MPE problem is: given data from previous people who reported the condition ( $H$ ), and from a target group ( $F$ ), determine the prevalence $\kappa^{*}$ of the condition for the target group. This helps estimate the amount of resources needed to address the health condition. Assume there are no false reports from those who do not have the medical condition: $\forall x,P(Y=1|Z=0,X=x)=0$ . Some medical conditions are underreported, such as smoking and intimate partner violence (Gorber et al., 2009; Shanmugam & Pierson, 2021). If the underreporting happens globally, meaning $e(x):=P(Y=1|Z=1,X=x)$ is bounded away from 1, then $\underset{}{\operatorname{ess\ sup\ }}P(Y=1|X=x)<1$ . This is because $P(Y=1|X=x)=P(Y=1|Z=0,X=x)P(Z=0|X=x)+P(Y=1|Z=1,X=x)P(Z=1|X=x)\leq e(x)$ . In this situation, irreducibility is again violated.

In the above situations, irreducibility is violated and $\kappa(F|H)>\kappa^{*}$ . Estimating $\kappa(F|H)$ alone leads to bias. In the following, we will re-examine mixture proportion estimation. In particular, we propose a more general sufficient condition than irreducibility, and introduce an estimation strategy that calls an existing MPE method and reduces or eliminates its asymptotic bias.

3 A General Identifiability Condition

Previous MPE works assume irreducibility. We propose a more general sufficient condition for recovering $\kappa^{*}$ .

Theorem 3.1 (Identifiability Under Local Supremal Posterior (LSP)).

Let $A$ be any non-empty measurable subset of $E_{H}=\{x:h(x)>0\}$ and $s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x)$ . Then

\displaystyle\kappa^{*}=s\cdot\inf_{S\subseteq A}\frac{F(S)}{H(S)}=s\cdot\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}.

This implies that under LSP, $\kappa^{*}$ is identifiable. Two special cases are worthy of comment. First, irreducibility holds when $A=E_{H}$ and $s=1$ , in which case the above theorem recovers the known result that $\kappa^{*}=\kappa(F|H)$ . Second, when $A=\left\{x_{0}\right\}$ is a singleton, then $s=P(Y=1|X=x_{0})$ and $\kappa^{*}=P(Y=1|X=x_{0})\cdot\frac{f(x_{0})}{h(x_{0})}$ , which can also be derived directly from the definition of conditional probability.

The above theorem gives a general sufficient condition for recovering $\kappa^{*}$ , but estimating $\inf_{S\subseteq A}\frac{F(S)}{H(S)}$ is non-trivial: when $A=E_{H}$ , it can be estimated using existing MPE methods (Blanchard et al., 2016). When $A$ is a proper subset, however, a new approach is needed. We now present a variation of Theorem 3.1 that lends itself to a practical estimation strategy without having to devise a completely new method of estimating $\inf_{S\subseteq A}\frac{F(S)}{H(S)}$ .

Theorem 3.2 (Identifiability Under Tight Posterior Upper Bound).

Consider any non-empty measurable set $A\subseteq E_{H}=\{x:h(x)>0\}$ , and let $s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x)$ . Let $\alpha(x)$ be any measurable function satisfying

\alpha(x)\in\begin{cases}\left[P(Y=1|X=x),s\right],&x\in A,\\ \left[P(Y=1|X=x),1\right],&\text{o.w.}\end{cases}

(6)

Define a new distribution $\widetilde{F}$ in terms of its density

\begin{split}&\widetilde{f}(x)=\frac{1}{c}\cdot\alpha(x)\cdot f(x),\\ \text{ where }\quad&c=\int\alpha(x)f(x)dx=\mathbb{E}_{X\sim F}\left[\alpha(X)\right].\end{split}

(7)

Then

\displaystyle\kappa^{*}=c\cdot\kappa(\widetilde{F}|H).

The theorem can be re-stated as: $\kappa^{*}$ is identifiable given an upper bound of the posterior probability $\alpha(x)\geq P(Y=1|X=x)$ that is tight for some $x\in A$ . One possible choice for $\alpha(x)$ is simply

\alpha(x)=\begin{cases}s,&x\in A,\\ 1,&\text{o.w.}\end{cases}

If the conditional probability $P(Y=1|X=x)$ is known for all $x\in A$ , then

\alpha(x)=\begin{cases}P(Y=1|X=x),&x\in A,\\ 1,&\text{o.w.},\end{cases}

may be chosen.

Having $\alpha(x)$ satisfying Eqn. (6) ensures identifiablility of $\kappa^{*}$ . Relaxing this requirement slightly still guarantees that the bias will not increase.

Corollary 3.3.

Let $\alpha(x)$ be any measurable function with

\alpha(x)\in[P(Y=1|X=x),1]\quad\forall x.

(8)

Define a new distribution $\widetilde{F}$ in terms of its density $\widetilde{f}$ according to Eqn. (7). Then

\displaystyle\kappa^{*}\leq c\cdot\kappa(\widetilde{F}|H)\leq\kappa(F|H).

This shows that even if we have a non-tight upper bound on $P(Y=1|X=x)$ , the quantity $c\cdot\kappa(\widetilde{F}|H)$ is still bounded by $\kappa(F|H)$ . Therefore, a smaller asymptotic bias may be achieved by estimating $c\cdot\kappa(\widetilde{F}|H)$ instead of $\kappa(F|H)$ .

The intuition underlying the above results is that the new distribution $\widetilde{F}$ is generated by throwing away some probability mass from $G$ , and therefore can be viewed as a mixture of $H$ and a new $\widetilde{G}$ , but now $\widetilde{G}$ tends to be irreducible w.r.t. $H$ . The proportion $\kappa(\widetilde{F}|H)$ relates to the original proportion $\kappa^{*}$ by a scaling constant $c$ . This interpretation is supported mathematically in Appendix B.1.

4 Subsampling MPE (SuMPE)

Theorem 3.2 directly motivates a practical algorithm. We obtain a new distribution $\widetilde{F}$ from $F$ by rejection sampling (MacKay, 2003), which is a Monte Carlo method that generates a sample following a new distribution $\widetilde{Q}$ based on a sample from distribution $Q$ , in terms of their densities $\widetilde{q}$ and $q$ . An instance $x$ drawn from $q(x)$ is kept with acceptance probability $\beta(x)\in[0,1]$ , and rejected otherwise. Appendix B.2 shows the detailed procedure. In our scenario, $\widetilde{Q}=\widetilde{F}$ , $Q=F$ and $\beta(x)=\alpha(x)$ .

4.1 Method

Our Subsampling MPE algorithm, SuMPE (Algorithm 1), follows directly from Theorem 3.2. It first obtains in line 3 a data sample $X_{\widetilde{F}}$ following distribution $\widetilde{F}$ using rejection sampling and in line 4 estimates the normalizing constant $c$ . Then in line 5, it computes an estimate $\widehat{\kappa}(\widetilde{F}|H)$ using any existing MPE method that consistently estimates the mixture proportion under irreducibility. The final estimate is returned as the product of $\widehat{c}$ and $\widehat{\kappa}(\widetilde{F}|H)$ .

Rejection sampling in high dimensional settings may be inefficient due to a potentially low acceptance rate (MacKay, 2003). However, this concern is mitigated in our setting because the acceptance rate can be taken to be 1 except on the set $A$ , which is potentially a small set.

Algorithm 1 Subsampling MPE (SuMPE)

1: Input:

X_{F}

: sample drawn i.i.d. from

F

X_{H}

: sample drawn i.i.d. from

H

\alpha(x)

: acceptance function

2: Output: Estimate of

\kappa^{*}

3: Generate

X_{\widetilde{F}}

from

X_{F}

by rejection sampling (Algorithm 4), with acceptance function

\alpha(x)

4: Compute

\widehat{c}=\left|X_{\widetilde{F}}\right|/\left|X_{F}\right|

, where

\left|\cdot\right|

denotes the cardinality of a set.

5: Apply an off-the-shelf MPE algorithm to produce an estimate

\widehat{\kappa}(\widetilde{F}|H)

from

X_{\widetilde{F}}

and

X_{H}

6: return

\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}|H)

One advantage of building our method around existing MPE methods is that we may adapt known theoretical results to our setting. To illustrate this, we give a rate of convergence result for SuMPE.

Theorem 4.1.

Assume $\alpha(x)$ satisfies the condition in Eqn. (6). After subsampling, assume the resulting $\widetilde{F}$ and $H$ are such that $\underset{S\in\mathcal{A}:H(S)>0}{\operatorname{arg\ min\ }}\frac{\widetilde{F}(S)}{H(S)}$ exists. Then there exists a constant $C>0$ and an existing MPE estimator $\widehat{\kappa}$ such that, for $m$ and $n$ sufficiently large, the estimator $\widehat{\kappa}^{*}$ obtained from SuMPE (Algorithm 1) satisfies

\Pr\left(\left|\widehat{\kappa}^{*}-\kappa^{*}\right|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\\ \geq 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right).

4.2 Practical Scenarios

Our new sufficient condition assumes knowledge of some set $A\subseteq E_{H}$ and $s=\underset{x\in A}{\operatorname{ess\ sup\ }}P(Y=1|X=x)$ . However, practically speaking, our algorithm only requires an $\alpha(x)$ satisfying Eqn. (6) for some $A\subseteq E_{H}$ and the associated value of $s$ , and does not require the explicit knowledge of $A$ and $s$ . Additionally, even if $\alpha(x)$ does not satisfy Eqn. (6) , as long as it satisfies Eqn. (8) (which is easier to achieve), it shall perform no worse than directly applying off-the-shelf MPE methods.

There are settings where a generic construction of $\alpha(x)$ is possible. For example, suppose the user has access to fully labeled data (where it is known which of $G$ or $H$ each instance came from) but only on a subset $A$ of the domain. This may come from an annotator who is only an expert on a subset of instances. This data should be sufficient to get a non-trivial upper bound on the posterior class probability $P(Y|X)$ , which in turns leads to an $\alpha(x)$ .

More typically, however, it may be necessary to determine $\alpha(x)$ on a case by case basis. This section continues the discussion of the three applications introduced in Sec. 2.3. Each of these three settings leverages different domain-specific knowledge in different ways, and we believe this leads to the best $\alpha(x)$ compared to a one-size-fits-all construction.

4.2.1 Unfolding

Unfolding refers to the process of recovering one or more true distributions from contaminated ones (Cowan, 1998). In gamma spectrum unfolding (Li et al., 2019), a gamma ray detector measures the energies of incoming gamma rays. The gamma rays were emitted either by a source of interest or from the background. The measurement is represented as a histogram $f(x)$ where the bins correspond to a quantization of energy. In many settings, the histogram $h(x)$ of measurements from the source of interest is also known. In this case, unfolding amounts to the estimation of the unknown background histogram $g(x)$ . Toward this goal, it suffices to estimate the proportion of recorded particles $\kappa^{*}$ emanating from the source of interest, since $g(x)=(f(x)-\kappa^{*}h(x))/(1-\kappa^{*})$ . This application corresponds to the “ubiquitious background” setting described in Sec. 2.3, where irreducibility may not hold since the source of interest energies can be a subset of the background energies.

Using existing techniques from the nuclear detection literature (Knoll, 2010; Alamaniotis et al., 2013), we can obtain a lower bound $\rho(x)$ of the quantity $(1-\kappa^{*})g(x)$ on a certain set $A\subset\operatorname{supp}(H)$ (see Appendix B.3 for details). This leads to the acceptance function

\alpha(x)=\begin{cases}1-\frac{\rho(x)}{f(x)},&x\in A,\\ 1,&\text{o.w.},\end{cases}

(9)

which is an upper bound of $P(Y=1|X=x)$ , satisfying the condition in Corollary 3.3.

4.2.2 CPE under domain adaptation

In the problem of domain adaptation, the learner is given labeled examples from a source distribution, and the task is to do inference on a potentially different target distribution. Previous work on domain adaptation mainly focuses on classification and typically makes assumptions about which of the four distributions $P_{X},P_{Y|X},P_{Y}$ , and $P_{X|Y}$ vary between the source and target. This leads to situations such as covariate shift (where $P_{X}$ changes) (Heckman, 1979), posterior drift (where $P_{Y|X}$ changes) (Scott, 2019), prior/target shift (where $P_{Y}$ changes) (Storkey et al., 2009), and conditional shift (where $P_{X|Y}$ changes) (Zhang et al., 2013). It is also quite commonly assumed that the support of source distribution contains the support of target (Heckman, 1979; Bickel et al., 2009; Gretton et al., 2009; Storkey et al., 2009; Zhang et al., 2013; Scott, 2019).

We study class proportion estimation (CPE) under domain adaptation. Prior work on this topic has considered distributional assumptions like those described above (Saerens et al., 2002; Sanderson & Scott, 2014; González et al., 2017). In this work, we consider the setting where, in addition to labeled examples from the source, the learner has access to labeled positive and unlabeled data from the target. We propose a model that includes covariate shift and posterior drift as special cases. We use $P^{sr}_{XY}$ and $P^{tg}_{XY}$ to denote source and target distributions. In MPE notation, $F=P^{tg}_{X}$ , $G=P^{tg}_{X|Y=0}$ and $H=P^{tg}_{X|Y=1}$ .

Definition 4.2 (CSPL).

We say that covariate shift with posterior lift occurs whenever

	$\displaystyle\forall x\in\operatorname{supp}(P^{sr}_{X})\bigcap\operatorname{supp}(P^{tg}_{X}),$
	$\displaystyle P^{sr}(Y=1\|X=x)\geq P^{tg}(Y=1\|X=x),$

and “ $=$ ” holds for some $x\in\operatorname{supp}(P^{sr}_{X})\bigcap\operatorname{supp}(P^{tg}_{X|Y=1})$ .

Covariate shift is a special case of CSPL when equality always holds. One motivation for posterior lift is to model labels produced by an annotator who is biased toward one class. It is a type of posterior drift model wherein the posterior changes from source to target (Scott, 2019; Cai & Wei, 2021; Maity et al., 2023). Also notice that CSPL does not require the support of the source distribution to contain the target, nor irreducibility.

CSPL is motivated by a marketing application mentioned in Sec. 2.3. In marketing, companies often have access to labeled data from a source distribution, such as survey results where customers express their interest in a product. Additionally, they also have access to labeled positive and unlabeled data from the target distribution, which corresponds to actual purchasing behavior. In this scenario, the CSPL assumption is often met as it is more likely for customers to express interest than to actually make a purchase: $P^{sr}(Y=1|X=x)\geq P^{tg}(Y=1|X=x)$ .

Although irreducibility is violated in the marketing application due to the “global uncertainty” about a target customer buying the product (see Sec. 2.3), CSPL ensures the identifiability of $\kappa^{*}=P^{tg}(Y=1)$ because we can choose the set $A=\operatorname{supp}(P^{sr}_{X})\bigcap\operatorname{supp}(P^{tg}_{X|Y=1})$ and the acceptance function as

\alpha(x)=\begin{cases}P^{sr}(Y=1|X=x),&x\in A,\\ 1,&\text{o.w.},\end{cases}

(10)

which satisfies the identifiability criteria in Theorem 3.2. ¹¹1 To see this, take $A^{\prime}=\{x\in A:P^{sr}(Y=1|X=x)=P^{tg}(Y=1|X=x)\}$ and $s^{\prime}=\underset{x\in A^{\prime}}{\operatorname{ess\ sup\ }}P^{sr}(Y=1|X=x)=\underset{x\in A^{\prime}}{\operatorname{ess\ sup\ }}P^{tg}(Y=1|X=x)$ . Then $A^{\prime}$ and $s^{\prime}$ are the $A$ and $s$ in Theorem 3.2. By using the labeled source data, an estimate of $P^{sr}(Y=1|X=x)$ can be obtained and used as the acceptance function $\alpha(x)$ in Algorithm 1 to do CPE.

4.2.3 Selected/Reported at Random

In public health, $(X,Y,Z)$ are jointly distributed, where $X$ is the feature vector, $Y\in\{0,1\}$ denotes whether a person reports a medical condition or not and $Z\in\{0,1\}$ indicates whether a person truly has the medical condition. The goal is to estimate the proportion of people in $X_{F}$ that report the medical condition. This setting was described in Sec. 2.3 as “underreported outcomes” where it was argued that irreducibility may not hold, in which case estimating $\kappa(F|H)$ overestimates the true value of $\kappa^{*}$ . Our SuMPE framework provides a way to eliminate the bias.

The behavior of underreporting can be captured using the selection bias model (Kato et al., 2018; Bekker et al., 2019; Gong et al., 2021). Denote the probability of reporting as $e(x):=P(Y=1|X=x,Z=1)$ . Assume there is no false report: $\forall x,P(Y=1|X=x,Z=0)=0$ . We use the notation $p(x|\Omega)$ to indicate the conditional density of $X$ given the event $\Omega$ . Then $p(x|Y=1)=\frac{e(x)}{\nu}p(x|Z=1)$ , where $\nu=P(Y=1|Z=1)$ . Under this model, the density of marginal distribution $P_{X}$ can be decomposed as

	$\displaystyle p(x)=\$	$\displaystyle(1-\alpha)\cdot p(x\|Z=0)+\alpha\cdot p(x\|Z=1)$
	$\displaystyle=\$	$\displaystyle(1-\alpha\nu)\cdot p(x\|Y=0)+\alpha\nu\cdot p(x\|Y=1),$

where $\alpha=P(Z=1)$ is the proportion of people having the medical condition. The mixture proportion to be estimated is $\kappa^{*}=P(Y=1)=P(Y=1,Z=1)=P(Z=1)P(Y=1|Z=1)=\alpha\nu$ .

We assume access to i.i.d. sample from $H=P_{X|Y=1}$ , representing the public survey data where people report the presence of the medical condition, and from $F=P_{X}$ , representing the target group. Further assume that $A\subseteq\{x:P(Z=1|X=x)=1\}$ . This is a subset of patients who are guaranteed to have the condition, which could be obtained based on historical patient data from hospital. Then the mixture proportion $\kappa^{*}=\alpha\nu$ can be recovered from Algorithm 1, where the acceptance function is

\alpha(x)=\begin{cases}e(x),&x\in A,\\ 1,&\text{o.w.}\end{cases}

(11)

$\alpha(x)$ satisfies the condition in Theorem 3.2 . This is because under no-false-report assumption, $\forall x\in A,P(Y=1|X=x)=P(Y=1,Z=1|X=x)=P(Y=1|X=x,Z=1)\cdot P(Z=1|X=x)=e(x)\cdot 1=e(x)$ . ²²2 Take $A^{\prime}=\{x\in A:e(x)>0\}$ and $s^{\prime}=\underset{x\in A^{\prime}}{\operatorname{ess\ sup\ }}e(x)=\underset{x\in A^{\prime}}{\operatorname{ess\ sup\ }}P(Y=1|X=x)$ . Then $A^{\prime}$ and $s^{\prime}$ are the $A$ and $s$ in Theorem 3.2. In practice, $e(x)$ can be estimated from labeled examples $(X,Y,Z=1)$ .

5 Limitation of Previous Work

Previous research by Yao et al. (2022) introduced the Regrouping MPE (ReMPE) method ³³3Yao et al. (2022) called the method Regrouping CPE (ReCPE)., which is built on top of any existing MPE estimator (just like our meta-algorithm). They claimed that ReMPE works as well as the base MPE method when irreducibility holds, while improving the performance when it does not. In this section we offer some comments on ReMPE.

Regrouping MPE in theory.

Consider any $F,H,G,\kappa^{*}$ such that Eqn. (2) holds. Write $G$ as an arbitrary mixture of two distributions $G=\gamma G_{1}+(1-\gamma)G_{2},\gamma\geq 0$ . Then $F$ can be re-written as

\begin{split}F&=(1-\kappa^{*})G+\kappa^{*}H\\ &=(1-\kappa^{*})\left[\gamma G_{1}+(1-\gamma)G_{2}\right]+\kappa^{*}H\\ &=(1-\kappa^{*})(1-\gamma)G_{2}+\underbrace{\left[{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1-\kappa^{*})\gamma G_{1}}+\kappa^{*}H\right]}_{\text{Regrouped}}\\ &=(1-\kappa^{\prime})G_{2}+\kappa^{\prime}H^{\prime},\end{split}

(12)

where $\kappa^{\prime}=\kappa^{*}+(1-\kappa^{*})\gamma$ . Yao et al. (2022) assumes there exists a set such that the infimum in Eqn. (3) and (4) can be achieved. They proposed to specify $G_{1}$ as the truncated distribution of $G$ in set $B$ , denote as $G_{B}$ , where $B=\arg\min_{S\in\mathfrak{S}}\frac{G(S)}{H(S)}$ . This specific choice causes the resulting distribution $G_{2}$ to be irreducible w.r.t. $H^{\prime}$ and the bias introduced by regrouping $(1-\kappa^{*})G(A)$ to be minimal. Denote the above procedure as ReMPE-1 (Algorithm 2). Theorem 2 in Yao et al. (2022) provides a theoretical justification for ReMPE-1, which we will restate here.

Algorithm 2 ReMPE-1 (Yao et al., 2022)

1: Input: Distributions

F

and

H

2: Obtain set

B=\arg\min_{S\in\mathfrak{S}}\frac{G(S)}{H(S)}

3: Generate new distribution

H^{\prime}

by Eqn. (12), where

G_{1}=G_{B}

4: return

\kappa(F|H^{\prime})

Theorem 5.1 (Yao et al. (2022)).

Let $\kappa(F|H^{\prime})$ be the mixture proportion obtained from ReMPE-1. 1) If $G$ is irreducible w.r.t. $H$ , then $\kappa(F|H^{\prime})=\kappa^{*}$ . 2) if $G$ is not irreducible w.r.t. $H$ , then $\kappa^{*}<\kappa(F|H^{\prime})<\kappa(F|H)$ .

While this theorem is valid, we note that in the case where $G$ is irreducible w.r.t. $H$ , the set $B$ is outside the support of $G$ , and therefore it is not appropriate to describe the procedure as “regrouping $G$ .” In fact, performing regrouping ( $\gamma>0$ ) always introduces a positive bias, because $\kappa(F|H^{\prime})\geq\kappa^{\prime}>\kappa^{*}$ . This indicates that any kind of regrouping will have a positive bias under irreducibility.

Regrouping MPE in practice.

Yao et al. (2022)’s practical implementation of regrouping deviates from the theoretical proposal. Here, we state and analyze the idealized version of their practical algorithm, referred to as ReMPE-2 (Algorithm 3). ReMPE-2 does not rely on the knowledge of $\kappa^{*}$ and $G(B)$ as outlined in Eqn. (12). Instead, the set $B$ is chosen based solely on $F$ and $H$ , and the distribution $H^{\prime}$ is obtained through regrouping some probability mass from $F$ rather than $G$ . ⁴⁴4Yao et al. (2022)’s real implementation differs a bit from ReMPE-2 in that instead of choosing $\arg\min_{S\in\mathfrak{S}}\frac{F(S)}{H(S)}$ , they select $p=10\%$ of examples drawn from $F$ with smallest estimated score of $f(x)/h(x)$ .

Algorithm 3 ReMPE-2 (Yao et al., 2022)

1: Input: Distributions

F

and

H

2: Obtain set

B=\arg\min_{S\in\mathfrak{S}}\frac{F(S)}{H(S)}

3: Generate new distribution

H^{\prime}=\frac{1}{1+F(B)}\left(F_{B}+H\right)

4: return

\kappa(F|H^{\prime})

ReMPE-2 is fundamentally different from ReMPE-1 in that it uses a different way to construct $H^{\prime}$ . To be specific, when the irreducibility assumption holds, ReMPE-1 suggests regrouping nothing (because $G(B)=0$ ), but ReMPE-2 still regroups a proportion $F(B)$ from $F$ to $H$ . Therefore, Theorem 5.1 does not apply to it. Yao et al. (2022) did not analyze ReMPE-2, but the next result shows that it has a negative bias under irreducibility.

Proposition 5.2.

For $\kappa(F|H^{\prime})$ obtained from ReMPE-2:

\displaystyle\kappa(F|H^{\prime})<\kappa(F|H).

Thus, if irreducibility holds, then ReMPE-2 returns $\kappa(F|H^{\prime})<\kappa(F|H)=\kappa^{*}$ , which is undesirable. However, when irreducibility does not hold, ReMPE-2 may lead to a smaller asymptotic bias than estimating $\kappa(F|H)$ , which could explain why the authors observe empirical improvements in their results. Our theoretical analysis of ReMPE-2 is supported experimentally in Sec. 6 and Appendix D.

To summarize, Yao et al. (2022) proposed a regrouping approach that was the first attempt to tackle the problem of MPE beyond irreducibility and motivated our work. ReMPE-1 recovers $\kappa^{*}$ when irreducibility holds (although in this case it is not doing regrouping), and decreases bias when irreducibility does not hold. The more practical algorithm ReMPE-2 might decrease the bias when irreducibility does not hold, but it has a negative bias when irreducibility does hold. Like ReMPE-1, SuMPE draws on some additional information beyond $F$ and $H$ . Both meta-algorithms do not increase the bias, and recover $\kappa^{*}$ when irreducibility holds. Unlike ReMPE-1, however, SuMPE is able to recover $\kappa^{*}$ under a more general condition. Furthermore, our practical implementations of subsampling are based directly on Theorem 3.2, unlike ReMPE-2 which does not have the desireable theoretical properties of ReMPE-1. Finally, as we argue in the next section, SuMPE offers significant empirical performance gains.

One limitation of our SuMPE framework is that some knowledge of $P(Y|X)$ is needed and that $\alpha(x)$ may need to be developed specifically for different applications.

6 Experiments

We ran our algorithm on nuclear, synthetic and some benchmark datasets taken from the UCI machine learning repository and MNIST, corresponding to all three scenarios described in Sec. 4.2. We take four MPE algorithms: DPL (Ivanov, 2020), EN (Elkan & Noto, 2008), KM (Ramaswamy et al., 2016) and TIcE (Bekker & Davis, 2018). We compare the original version of these methods together with their regrouping (Re- $(\cdot$ )) and subsampling (Su- $(\cdot$ )) version. All experiments in this section consider settings where irreducibility is violated. The summarized results are shown in Table 6 and the detailed results are in Appendix C. Overall, the subsampling version of each MPE method outperforms the original and regrouping version. Additional experiments where irreducibility holds are offered in Appendix D, where we find that ReMPE harms the estimation performance while SuMPE does not. The implementation is available at https://github.com/allan-z/SuMPE.

Refer to caption — Table 1: Summarized table of average absolute estimation error, corresponding to testing cases in Sec. 6. Several state-of-the-art MPE algorithms DPL, EN, KM and TIcE are selected. The mean absolute error ( $\text{avg}[\left|\widehat{\kappa}^{*}-\kappa^{*}\right|]$ ) is reported, the smallest error among original, regrouping and subsampling version is bolded. $(+/-)$ denotes that on average the estimator produces positive $/$ negative estimation bias ( $\text{sgn}[\text{avg}(\widehat{\kappa}^{*}-\kappa^{*})]$ ).

Setup	Dataset	DPL	ReDPL	SuDPL	EN	ReEN	SuEN	KM	ReKM	SuKM	TIcE	ReTIcE	SuTIcE
Unfolding	Gamma Ray	$0.045+$	$0.117-$	$\bm{0.013+}$	$0.034+$	$0.118-$	$\bm{0.027-}$	$0.163+$	$0.076+$	$\bm{0.042+}$	$0.095+$	$0.061+$	$\bm{0.019+}$
Domain Adaptation	Synthetic	$0.060+$	$0.053+$	$\bm{0.028-}$	$\bm{0.045-}$	$0.061-$	$0.067-$	$0.063+$	$0.059+$	$\bm{0.022-}$	$0.128+$	$0.094+$	$\bm{0.041+}$
	Mushroom	$0.047+$	$0.081-$	$\bm{0.033+}$	$0.122+$	$\bm{0.078+}$	$0.101+$	$0.067-$	$0.134-$	$\bm{0.059-}$	$0.060+$	$0.078-$	$\bm{0.036+}$
	Landsat	$0.046+$	$0.042-$	$\bm{0.017-}$	$0.141+$	$0.110+$	$\bm{0.085+}$	$0.046+$	$0.029+$	$\bm{0.016-}$	$0.043+$	$0.044-$	$\bm{0.035-}$
	Shuttle	$0.037+$	$0.138-$	$\bm{0.015-}$	$0.090+$	$0.071+$	$\bm{0.046+}$	$0.036+$	$0.111-$	$\bm{0.032-}$	$0.080+$	$0.086-$	$\bm{0.038+}$
	MNIST17	$0.047+$	$0.085-$	$\bm{0.028-}$	$0.231+$	$0.175+$	$\bm{0.166+}$	$0.041+$	$0.063-$	$\bm{0.017-}$	$0.090+$	$0.073-$	$\bm{0.048+}$
Selected/ Reported at Random	Mushroom	$0.047+$	$0.095-$	$\bm{0.027+}$	$0.119+$	$0.075+$	$\bm{0.074+}$	$0.072-$	$0.134-$	$\bm{0.064-}$	$0.047+$	$0.066-$	$\bm{0.044-}$
	Landsat	$0.048+$	$0.057-$	$\bm{0.019-}$	$0.142+$	$0.108+$	$\bm{0.092+}$	$0.046+$	$0.033+$	$\bm{0.018+}$	$0.046+$	$0.053-$	$\bm{0.041-}$
	Shuttle	$0.035+$	$0.144-$	$\bm{0.013-}$	$0.095+$	$0.073+$	$\bm{0.056+}$	$\bm{0.042+}$	$0.129-$	$0.044-$	$0.079+$	$0.089-$	$\bm{0.050+}$
	MNIST17	$0.047+$	$0.088-$	$\bm{0.027+}$	$0.240+$	$0.173+$	$\bm{0.164+}$	$0.038+$	$0.064-$	$\bm{0.022+}$	$0.096+$	$0.073+$	$\bm{0.057+}$

	$\displaystyle H$	$\displaystyle\sim p(x\|\text{Cesium})$
	$\displaystyle G$	$\displaystyle\sim 0.8\cdot p(x\|\text{Cobalt})+0.2\cdot p(x\|\text{terrestrial background})$
	$\displaystyle F$	$\displaystyle=(1-\kappa^{})G+\kappa^{}H.$

	$\displaystyle H$	$\displaystyle\sim\mathcal{N}(\mu_{1}=0,\sigma_{1}=1)$
	$\displaystyle G$	$\displaystyle\sim 0.8\cdot\mathcal{N}(\mu_{2}=3,\sigma_{2}=2)+0.2\cdot\mathcal{N}(\mu_{3}=4,\sigma_{3}=1)$
	$\displaystyle F$	$\displaystyle=(1-\kappa^{})G+\kappa^{}H.$

	$\displaystyle H$	$\displaystyle\sim p(x\|Y=1)$
	$\displaystyle G$	$\displaystyle\sim\gamma p(x\|Y=1)+(1-\gamma)p(x\|Y=0)$
	$\displaystyle F$	$\displaystyle=(1-\kappa^{})G+\kappa^{}H.$

	$\displaystyle c\cdot\kappa(\widetilde{F}\|H)$	$\displaystyle=c\cdot\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\widetilde{f}(x)}{h(x)}\quad\text{by definition of $\kappa(\widetilde{F}\|H)$}$
		$\displaystyle=c\cdot\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\frac{1}{c}\cdot\alpha(x)\cdot f(x)}{h(x)}\quad\text{plug in the expression of $\widetilde{f}(x)$}$
		$\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\alpha(x)\cdot f(x)}{h(x)}.$

	$\displaystyle\underset{x:h(x)>0}{\operatorname{ess\ sup\ }}\ P(Y=1\|X=x)$	$\displaystyle=\frac{\kappa^{*}}{\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{f(x)}{h(x)}}$
		$\displaystyle=\frac{\kappa^{*}}{\kappa(F\|H)},$

	$\displaystyle\underset{x}{\operatorname{ess\ sup\ }}P(Y=1\|X=x)$	$\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ sup\ }}P(Y=1\|X=x)$
		$\displaystyle=\frac{\kappa^{*}}{\kappa(F\|H)}.$

	$\displaystyle c\cdot\kappa(\widetilde{F}\|H)$	$\displaystyle\geq\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{P(Y=1\|X=x)\cdot f(x)}{h(x)}$
		$\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\kappa^{*}\quad\text{rearrange Eqn. \eqref{eq:cond}}$
		$\displaystyle=\kappa^{*}.$

	$\displaystyle c\cdot\kappa(\widetilde{F}\|H)$	$\displaystyle=\underset{x:h(x)>0}{\operatorname{ess\ inf\ }}\frac{\alpha(x)\cdot f(x)}{h(x)}$
		$\displaystyle\leq\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{\alpha(x)\cdot f(x)}{h(x)}\quad\text{replace $E_{H}$ by $A$}$
		$\displaystyle\leq\underset{x\in A}{\operatorname{ess\ inf\ }}\frac{s\cdot f(x)}{h(x)}\quad\text{because $\alpha(x)\leq s,\forall x\in A$}$
		$\displaystyle=\kappa^{*}\quad\text{by Theorem \ref{thm:identify}}$

	$\displaystyle n^{\prime}=\sum_{i=1}^{n}Z_{i},$
	$\displaystyle\mathbb{E}\left[Z\right]=\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}\left[Z_{i}\right]=\mathbb{E}_{(X_{i},V_{i})}\left[Z_{i}\right]=\mathbb{E}_{X_{i}}\left[\mathbb{E}_{V_{i}\|X_{i}}\left[\mathbbm{1}_{\{V_{i}\leq\alpha(X_{i})\}}\right]\right]=\mathbb{E}_{X_{i}}\left[\alpha(X_{i})\right]=\int\alpha(x)f(x)dx=c.$

		$\displaystyle\Pr\left(\left\|\widehat{\kappa}(\widetilde{F}\|H)-\kappa(\widetilde{F}\|H)\right\|\leq C_{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)$
	$\displaystyle\geq\$	$\displaystyle\Pr\left(\left(\left\|\widehat{\kappa}(\widetilde{F}\|H)-\kappa(\widetilde{F}\|H)\right\|\leq C_{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\text{and}\left(\frac{c}{2}\cdot n\leq n^{\prime}\leq\frac{3c}{2}\cdot n\right)\right)\quad\because\Pr(A)\geq\Pr(A\cap B)$
	$\displaystyle=\$	$\displaystyle\Pr\left(\left(\left\|\widehat{\kappa}(\widetilde{F}\|H)-\kappa(\widetilde{F}\|H)\right\|\leq C_{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\Biggl{\|}\left(\frac{c}{2}\cdot n\leq n^{\prime}\leq\frac{3c}{2}\cdot n\right)\right)\cdot\Pr\left(\frac{c}{2}\cdot n\leq n^{\prime}\leq\frac{3c}{2}\cdot n\right)$
	$\displaystyle\geq\$	$\displaystyle\left(1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right)\right)\cdot\left(1-2\exp\left(-\frac{c}{2}n\right)\right)\quad\because\text{$n$ and $n^{\prime}$ can be used interchangeably in the first probability term}$
	$\displaystyle\geq\$	$\displaystyle 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right)-2\exp\left(-\frac{c}{2}n\right)\quad\because(1-a)(1-b)\geq 1-a-b$
	$\displaystyle\geq\$	$\displaystyle 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right)\quad\because\exp(-n)\text{ decays faster}$

	$\displaystyle\left\|\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}\|H)-c\cdot\kappa(\widetilde{F}\|H)\right\|$	$\displaystyle=\left\|\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}\|H)-\widehat{c}\cdot\kappa(\widetilde{F}\|H)+\widehat{c}\cdot\kappa(\widetilde{F}\|H)-c\cdot\kappa(\widetilde{F}\|H)\right\|$
		$\displaystyle\leq\left\|\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}\|H)-\widehat{c}\cdot\kappa(\widetilde{F}\|H)\right\|+\left\|\widehat{c}\cdot\kappa(\widetilde{F}\|H)-c\cdot\kappa(\widetilde{F}\|H)\right\|$
		$\displaystyle=\left\|\widehat{c}\right\|\cdot\left\|\widehat{\kappa}(\widetilde{F}\|H)-\kappa(\widetilde{F}\|H)\right\|+\left\|\kappa(\widetilde{F}\|H)\right\|\cdot\left\|\widehat{c}-c\right\|$
		$\displaystyle\leq\left\|\widehat{\kappa}(\widetilde{F}\|H)-\kappa(\widetilde{F}\|H)\right\|+\left\|\widehat{c}-c\right\|.$

		$\displaystyle\Pr\left(\left\|\widehat{\kappa}^{}-\kappa^{}\right\|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)$
	$\displaystyle=\$	$\displaystyle\Pr\left(\left\|\widehat{c}\cdot\widehat{\kappa}(\widetilde{F}\|H)-c\cdot\kappa(\widetilde{F}\|H)\right\|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)$
	$\displaystyle\geq\$	$\displaystyle\Pr\left(\left\|\widehat{\kappa}(\widetilde{F}\|H)-\kappa(\widetilde{F}\|H)\right\|+\left\|\widehat{c}-c\right\|\leq C\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)$
	$\displaystyle\geq\$	$\displaystyle\Pr\left(\left(\left\|\widehat{\kappa}(\widetilde{F}\|H)-\kappa(\widetilde{F}\|H)\right\|\leq\frac{C}{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\text{and}\left(\left\|\widehat{c}-c\right\|\leq\frac{C}{2}\left[\sqrt{\frac{\log m}{m}}+\sqrt{\frac{\log n}{n}}\right]\right)\right)$
	$\displaystyle\geq\$	$\displaystyle 1-\mathcal{O}\left(\frac{1}{m}+\frac{1}{n}\right),$

	$\displaystyle\kappa(F\|H^{\prime})$	$\displaystyle=\inf_{S\in\mathcal{A}:H^{\prime}(S)>0}\frac{F(S)}{H^{\prime}(S)}$
		$\displaystyle=\frac{F(B)}{H^{\prime}(B)}$
		$\displaystyle<\frac{F(B)}{H(B)}$
		$\displaystyle=\kappa(F\|H).$

	$\displaystyle F$	$\displaystyle=(1-\kappa^{})G+\kappa^{}H$
		$\displaystyle=(1-\kappa^{})\left[\gamma G_{1}+(1-\gamma)G_{2}\right]+\kappa^{}H.$

	$\displaystyle\widetilde{F}$	$\displaystyle=\frac{(1-\kappa^{})(1-\gamma)}{c}G_{2}+\frac{\kappa^{}}{c}H$
		$\displaystyle=:(1-\tilde{\kappa}^{})\widetilde{G}+\tilde{\kappa}^{}H.$

	$\displaystyle c\cdot\kappa(\widetilde{F}\|H)$	$\displaystyle=\inf_{S\in\mathcal{A}:H(S)>0}\frac{c\cdot\widetilde{F}(S)}{H(S)}$
		$\displaystyle=\kappa^{}+(1-\kappa^{})\inf_{S\in\mathcal{A}:H(S)>0}\frac{(1-\gamma)G_{2}(S)}{H(S)},$

	$\displaystyle c\cdot\kappa(\widetilde{F}\|H)$	$\displaystyle\leq\kappa^{}+(1-\kappa^{})\inf_{S\in\mathcal{A}:H(S)>0}\frac{G(S)}{H(S)}$
		$\displaystyle=\kappa(F\|H).$

	$\displaystyle c\cdot\kappa(\widetilde{F}\|H)$	$\displaystyle=\kappa^{}+(1-\kappa^{})(1-\gamma)\kappa(G_{2}\|H)$
	$\displaystyle\kappa(F\|H)$	$\displaystyle=\kappa^{}+(1-\kappa^{})\kappa(G\|H).$

$\kappa^{*}$	DPL	ReDPL	SuDPL	EN	ReEN	SuEN	KM	ReKM	SuKM	TIcE	ReTIcE	SuTIcE
$0.1$	$0.063+$	$0.053+$	$\bm{0.008+}$	$0.057+$	$0.048+$	$\bm{0.012-}$	$0.226+$	$0.146+$	$\bm{0.036+}$	$0.129+$	$0.115+$	$\bm{0.019+}$
$0.25$	$0.055+$	$\bm{0.008+}$	$0.011+$	$0.040+$	$\bm{0.004-}$	$0.017-$	$0.216+$	$\bm{0.020+}$	$0.059+$	$0.112+$	$0.065+$	$\bm{0.017+}$
$0.5$	$0.029+$	$0.123-$	$\bm{0.019-}$	$\bm{0.017+}$	$0.129-$	$0.032-$	$0.130+$	$\bm{0.043+}$	$0.049+$	$0.082+$	$\bm{0.018+}$	$0.019+$
$0.75$	$0.034+$	$0.284-$	$\bm{0.015+}$	$\bm{0.022-}$	$0.289-$	$0.047-$	$0.078+$	$0.094-$	$\bm{0.022+}$	$0.055+$	$0.045-$	$\bm{0.021+}$
average	$0.045+$	$0.117-$	$\bm{0.013+}$	$0.034+$	$0.118-$	$\bm{0.027-}$	$0.163+$	$0.076+$	$\bm{0.042+}$	$0.095+$	$0.061+$	$\bm{0.019+}$

$\kappa^{*}$	DPL	ReDPL	SuDPL	EN	ReEN	SuEN	KM	ReKM	SuKM	TIcE	ReTIcE	SuTIcE
$0.1$	$0.089+$	$0.075+$	$\bm{0.014-}$	$0.059+$	$0.051+$	$\bm{0.027-}$	$0.102+$	$0.091+$	$\bm{0.013-}$	$0.150+$	$0.140+$	$\bm{0.038+}$
$0.25$	$0.083+$	$0.060+$	$\bm{0.016-}$	$0.037+$	$\bm{0.016+}$	$0.051-$	$0.081+$	$0.059+$	$\bm{0.018-}$	$0.137+$	$0.108+$	$\bm{0.040+}$
$0.5$	$0.053+$	$0.028+$	$\bm{0.024-}$	$\bm{0.022-}$	$0.057-$	$0.077-$	$0.052+$	$0.037+$	$\bm{0.020-}$	$0.114+$	$0.068+$	$\bm{0.035+}$
$0.75$	$\bm{0.016-}$	$0.050-$	$0.056-$	$\bm{0.063-}$	$0.118-$	$0.114-$	$\bm{0.018+}$	$0.050-$	$0.036-$	$0.112+$	$0.058+$	$\bm{0.051+}$
average	$0.060+$	$0.053+$	$\bm{0.028-}$	$\bm{0.045-}$	$0.061-$	$0.067-$	$0.063+$	$0.059+$	$\bm{0.022-}$	$0.128+$	$0.094+$	$\bm{0.041+}$

Dataset	$\kappa^{*}$	DPL	ReDPL	SuDPL	EN	ReEN	SuEN	KM	ReKM	SuKM	TIcE	ReTIcE	SuTIcE
Mushroom	$0.1$	$0.075+$	$0.069+$	$\bm{0.053+}$	$0.121+$	$0.105+$	$\bm{0.099+}$	$0.061+$	$0.054+$	$\bm{0.039+}$	$0.082+$	$0.086+$	$\bm{0.064+}$
	$0.25$	$0.063+$	$\bm{0.023+}$	$0.040+$	$0.139+$	$\bm{0.101+}$	$0.109+$	$0.053+$	$0.041+$	$\bm{0.024+}$	$0.057+$	$0.034+$	$\bm{0.024+}$
	$0.5$	$0.035+$	$0.084-$	$\bm{0.025+}$	$0.132+$	$\bm{0.074+}$	$0.112+$	$\bm{0.057-}$	$0.189-$	$0.063-$	$\bm{0.026+}$	$0.058-$	$0.028-$
	$0.75$	$0.015-$	$0.149-$	$\bm{0.013-}$	$0.097+$	$\bm{0.033+}$	$0.082+$	$\bm{0.096-}$	$0.253-$	$0.108-$	$0.076-$	$0.134-$	$\bm{0.027-}$
	avg	$0.047+$	$0.081-$	$\bm{0.033+}$	$0.122+$	$\bm{0.078+}$	$0.101+$	$0.067-$	$0.134-$	$\bm{0.059-}$	$0.060+$	$0.078-$	$\bm{0.036+}$
Landsat	$0.1$	$0.066+$	$0.053+$	$\bm{0.014+}$	$0.168+$	$0.139+$	$\bm{0.115+}$	$0.065+$	$0.047+$	$\bm{0.017+}$	$0.070+$	$0.074+$	$\bm{0.031+}$
	$0.25$	$0.060+$	$0.010-$	$\bm{0.005+}$	$0.161+$	$0.119+$	$\bm{0.100+}$	$0.053+$	$0.027+$	$\bm{0.007+}$	$0.034+$	$0.028+$	$\bm{0.012-}$
	$0.5$	$0.037+$	$0.034-$	$\bm{0.010-}$	$0.131+$	$0.097+$	$\bm{0.084+}$	$0.037+$	$0.018+$	$\bm{0.016-}$	$0.033+$	$0.031-$	$\bm{0.030-}$
	$0.75$	$\bm{0.022+}$	$0.071-$	$0.037-$	$0.102+$	$0.085+$	$\bm{0.041+}$	$0.028+$	$\bm{0.023-}$	$0.024-$	$\bm{0.036-}$	$0.042-$	$0.067-$
	avg	$0.046+$	$0.042-$	$\bm{0.017-}$	$0.141+$	$0.110+$	$\bm{0.085+}$	$0.046+$	$0.029+$	$\bm{0.016-}$	$0.043+$	$0.044-$	$\bm{0.035-}$
Shuttle	$0.1$	$0.055+$	$0.048+$	$\bm{0.006+}$	$0.105+$	$0.096+$	$\bm{0.053+}$	$0.044+$	$0.033+$	$\bm{0.006-}$	$0.083+$	$0.072+$	$\bm{0.023+}$
	$0.25$	$0.045+$	$0.010-$	$\bm{0.007-}$	$0.098+$	$0.082+$	$\bm{0.038+}$	$0.027+$	$0.018-$	$\bm{0.016-}$	$0.074+$	$0.030+$	$\bm{0.020+}$
	$0.5$	$0.025+$	$0.102-$	$\bm{0.016-}$	$0.093+$	$0.066+$	$\bm{0.050+}$	$\bm{0.017-}$	$0.099-$	$0.030-$	$0.074+$	$0.065-$	$\bm{0.041+}$
	$0.75$	$\bm{0.022-}$	$0.393-$	$0.033-$	$0.063+$	$\bm{0.041+}$	$0.044+$	$\bm{0.058-}$	$0.294-$	$0.075-$	$0.089+$	$0.179-$	$\bm{0.068+}$
	avg	$0.037+$	$0.138-$	$\bm{0.015-}$	$0.090+$	$0.071+$	$\bm{0.046+}$	$0.036+$	$0.111-$	$\bm{0.032-}$	$0.080+$	$0.086-$	$\bm{0.038+}$
MNIST17	$0.1$	$0.076+$	$0.078+$	$\bm{0.035+}$	$0.184+$	$0.157+$	$\bm{0.116+}$	$0.065+$	$0.057+$	$\bm{0.025+}$	$0.101+$	$0.093+$	$\bm{0.052+}$
	$0.25$	$0.058+$	$0.047+$	$\bm{0.008-}$	$0.218+$	$0.175+$	$\bm{0.138+}$	$0.053+$	$0.017+$	$\bm{0.009-}$	$0.098+$	$0.084+$	$\bm{0.031+}$
	$0.5$	$0.032+$	$0.047-$	$\bm{0.022-}$	$0.275+$	$\bm{0.180+}$	$0.191+$	$0.029+$	$0.030-$	$\bm{0.017-}$	$0.080+$	$\bm{0.021+}$	$0.047+$
	$0.75$	$\bm{0.023-}$	$0.169-$	$0.046-$	$0.250+$	$\bm{0.189+}$	$0.217+$	$\bm{0.016+}$	$0.146-$	$\bm{0.016-}$	$0.081+$	$0.094-$	$\bm{0.060+}$
	avg	$0.047+$	$0.085-$	$\bm{0.028-}$	$0.231+$	$0.175+$	$\bm{0.166+}$	$0.041+$	$0.063-$	$\bm{0.017-}$	$0.090+$	$0.073-$	$\bm{0.048+}$
Overall	avg	$0.044+$	$0.087-$	$\bm{0.023-}$	$0.146+$	$0.109+$	$\bm{0.099+}$	$0.047+$	$0.084-$	$\bm{0.031-}$	$0.068+$	$0.070-$	$\bm{0.039+}$

$\kappa^{*}$	DPL	ReDPL	SuDPL	EN	ReEN	SuEN	KM	ReKM	SuKM	TIcE	ReTIcE	SuTIcE
$0.1$	$\bm{+0.029}$	$+0.021$	$+0.010$	$\bm{+0.020}$	$+0.005$	$+0.008$	$\bm{+0.020}$	$+0.007$	$+0.006$	${+0.085}$	$\bm{+0.089}$	${+0.063}$
$0.25$	$+0.012$	$\bm{-0.062}$	$-0.017$	$-0.023$	$\bm{-0.080}$	$-0.043$	$-0.016$	$\bm{-0.074}$	$-0.031$	$\bm{+0.049}$	$+0.010$	$+0.024$
$0.5$	$+0.038$	$\bm{-0.128}$	$-0.003$	$-0.045$	$\bm{-0.156}$	${-0.067}$	$-0.001$	$\bm{-0.133}$	$-0.018$	$\bm{+0.059}$	$-0.048$	$+0.032$
$0.75$	$+0.012$	$\bm{-0.282}$	$-0.015$	${-0.071}$	$\bm{-0.278}$	${-0.088}$	$0.000$	$\bm{-0.278}$	$-0.012$	${+0.089}$	$\bm{-0.199}$	$+0.037$

Mixture Proportion Estimation Beyond Irreducibility

Abstract

1 Introduction

2 Problem Setup and Background

2.1 Ground-Truth and Maximal Proportion

Definition 2.1 (Blanchard et al. (2010)).

Proposition 2.2 (Blanchard et al. (2010)).

Definition 2.3 (Blanchard et al. (2010)).

2.2 Latent Label Model

Proposition 2.4.

2.3 Violation of Irreducibility

3 A General Identifiability Condition

Theorem 3.1 (Identifiability Under Local Supremal Posterior (LSP)).

Theorem 3.2 (Identifiability Under Tight Posterior Upper Bound).

Corollary 3.3.

4 Subsampling MPE (SuMPE)

4.1 Method

Theorem 4.1.

4.2 Practical Scenarios

4.2.1 Unfolding

4.2.2 CPE under domain adaptation

Definition 4.2 (CSPL).

4.2.3 Selected/Reported at Random

5 Limitation of Previous Work

Regrouping MPE in theory.

Theorem 5.1 (Yao et al. (2022)).

Regrouping MPE in practice.

Proposition 5.2.

6 Experiments

6.1 Unfolding: Gamma Ray Spectra Data

6.2 Domain Adaptation: Synthetic Data

6.3 Domain Adaptation: Benchmark Data

6.4 Selected/Reported at Random: Benchmark Data

7 Conclusion

Acknowledgements

References

Appendix A Proofs

A.1 Proof of Proposition 2.4

Proposition.

Proof.

A.2 Proof of Theorem 3.1

Theorem (Identifiability Under Local Supremal Posterior (LSP)).

Proof.

A.3 Proof of Theorem 3.2

Theorem (Identifiability Under Tight Posterior Upper Bound).

Proof.

A.4 Proof of Corollary 3.3

Corollary.

Proof.

A.5 Proof of Theorem 4.1

Theorem.

Proof.

Theorem.

A.6 Proof of Proposition 5.2

Proposition.

Proof.

Appendix B More about Subsampling MPE

B.1 Intuition

Proposition B.1.

Proof.

B.2 Rejection Sampling

B.3 Gamma Spectrum Unfolding

Appendix C Detailed Experimental Result in Sec. 6

Appendix D When Irreducibility Holds