Mixture Proportion Estimation Beyond Irreducibility
Abstract
The task of mixture proportion estimation (MPE) is to estimate the weight of a component distribution in a mixture, given observations from both the component and mixture. Previous work on MPE adopts the irreducibility assumption, which ensures identifiablity of the mixture proportion. In this paper, we propose a more general sufficient condition that accommodates several settings of interest where irreducibility does not hold. We further present a resampling-based meta-algorithm that takes any existing MPE algorithm designed to work under irreducibility and adapts it to work under our more general condition. Our approach empirically exhibits improved estimation performance relative to baseline methods and to a recently proposed regrouping-based algorithm.
1 Introduction
Mixture proportion estimation (MPE) is the problem of estimating the weight of a component distribution in a mixture. Specifically, let and let , , and be probability distributions such that . Given i.i.d. observations
(1) |
MPE is the problem of estimating . A typical application is given some labeled positive reviews , estimate the proportion of positive comments about a product among all comments (González et al., 2017). MPE is also an important component in solving several domain adaptation and weakly supervised learning problems, such as learning from positive and unlabeled examples (LPUE) (Elkan & Noto, 2008; Du Plessis et al., 2014; Kiryo et al., 2017), learning with noisy labels (Lawrence & Schölkopf, 2001; Natarajan et al., 2013; Blanchard et al., 2016), multi-instance learning (Zhang & Goldman, 2001), and anomaly detection (Sanderson & Scott, 2014).
If no assumptions are made on the unobserved component , then is not identifiable. Blanchard et al. (2010) proposed the irreducibility assumption on so that becomes identifiable. Up to now, almost all MPE algorithms build upon the irreducibility assumption (Blanchard et al., 2010; Scott, 2015; Blanchard et al., 2016; Jain et al., 2016; Ramaswamy et al., 2016; Ivanov, 2020; Bekker & Davis, 2020; Garg et al., 2021), or some stricter conditions like non-overlapping support of component distributions (Elkan & Noto, 2008; Du Plessis & Sugiyama, 2014). However, as we discuss below, irreducibility can be violated in several applications, in which case the above methods produce statistically inconsistent estimates. As far as we know, Yao et al. (2022), discussed in Sec. 5, is the first attempt to move beyond irreducibility.
This work proposes a more general sufficient condition than irreducibility, and offers a practical algorithm for estimating under this condition. We introduce a meta-algorithm that takes as input any MPE method that consistently estimates under irreducibility, and removes the bias of that method whenever irreducibility does not hold but our more general sufficient condition does. Furthermore, even if our new sufficient condition is not satisfied, our meta-algorithm will not increase the bias of the underlying MPE method. We describe several applications and settings where our framework is relevant, and demonstrate the practical relevance of this framework through extensive experiments. Proofs and additional details can be found in the appendices.
2 Problem Setup and Background
Let and be probability measures on a measurable space , and let be a mixture of and
(2) |
where . With no assumptions on , is not uniquely determined by and . For example, suppose for some , and take any . Then where , has a different proportion on (Blanchard et al., 2010).
2.1 Ground-Truth and Maximal Proportion
To address the lack of identifiability, Blanchard et al. (2010) introduced the so-called irreducibility assumption. We now recall this definition and related concepts. Throughout this work we assume that , and have densities , and , defined w.r.t. a common dominating measure .
Definition 2.1 (Blanchard et al. (2010)).
For any two probability distributions and , define
the maximal proportion of in .
This quantity equals the infimum of the likelihood ratio:
Proposition 2.2 (Blanchard et al. (2010)).
It holds that
(3) |
By substituting into Eqn. (3), we get that
(4) |
Since is identifiable from and , the following assumption on ensures identifiability of .
Definition 2.3 (Blanchard et al. (2010)).
We say that is irreducible with respect to if .
Thus, irreducibility means that there exists no decomposition of the form: , where is some probability distribution and . Under irreducibility, is identifiable, and in particular, equals .
2.2 Latent Label Model
We now consider another way of understanding irreducibility in terms of a latent label model. In particular, let and be the random variables characterized by
-
(a)
are jointly distributed
-
(b)
-
(c)
.
It follows from these assumptions that the marginal distribution of is :
We also take the conditional probability of given to be defined via
(5) |
The latent label model is commonly used in the positive unlabeled (PU) learning literature (Bekker & Davis, 2020). MPE is also called class prior/proportion estimation (CPE) in PU learning because . may be viewed as a label indicating which component an observation from was drawn from. Going forward, we use this latent label model in addition to the original MPE notation.
Proposition 2.4.
Under the latent label model,
where .
2.3 Violation of Irreducibility
Up to now, almost all MPE algorithms assume to be irreducible w.r.t. (Blanchard et al., 2010, 2016; Jain et al., 2016; Ivanov, 2020), or stricter conditions like the anchor set assumption (Scott, 2015; Ramaswamy et al., 2016), or that and have disjoint supports (Elkan & Noto, 2008; Du Plessis & Sugiyama, 2014). These methods return an estimate of as the estimate of . If irreducibility does not hold and , then . Even if these methods are consistent estimators of , they are asymptotically biased and thus inconsistent estimators of .
A sufficient condition for irreducibility to hold is that the support of is not totally contained in the support of . In a classification setting where and are the class-conditional distributions, this may be quite reasonable. It essentially assumes that each class has at least some subset of instances (with positive measure) that cannot possibly be confused with the other class. While irreducibility is reasonable in many classification-related tasks, there are also a number of important applications where it does not hold. In this subsection we give three examples of applications where irreducibility is not satisfied.
Ubiquitous Background. In gamma spectroscopy, we may view as the distribution of the energy of a gamma particle emitted by some source of interest (e.g., Cesium-137), and as the energy distribution of background radiation. Typically the background emits gamma particles with a wider range of energies than the source of interest does, and therefore its distribution has a wider support: , thus violating irreducibility. What’s more, is usually unknown, because it varies according to the surrounding environment (Alamaniotis et al., 2013). The MPE problem is: given observations of the source spectrum , which may be collected in a laboratory setting, and observations of in the wild, estimate . This quantity is important for nuclear threat detection and nuclear safeguard applications.
Global Uncertainty. In marketing, let denote whether a customer does () or does not purchase a product (Fei et al., 2013). Let be the distribution of a feature vector extracted from a customer who buys the product, and the distribution for those who do not. The MPE problem is: given data from past purchasers of a product (), and from a target population (), estimate the proportion of in . This quantity is called the transaction rate, and is important for estimating the number of products to be sold. Irreducibility is likely to be violated here because, given a finite number of features, uncertainty about customers should remain bounded away from 1: . In other words, there is an such that, for any feature vector of demographic information, the probability of buying the product is always .
Underreported Outcomes. In public health, let be a jointly distributed triple, where is a feature vector, denotes whether a person reports a health condition or not, and indicates whether the person truly has the health condition. Here, and . The MPE problem is: given data from previous people who reported the condition (), and from a target group (), determine the prevalence of the condition for the target group. This helps estimate the amount of resources needed to address the health condition. Assume there are no false reports from those who do not have the medical condition: . Some medical conditions are underreported, such as smoking and intimate partner violence (Gorber et al., 2009; Shanmugam & Pierson, 2021). If the underreporting happens globally, meaning is bounded away from 1, then . This is because . In this situation, irreducibility is again violated.
In the above situations, irreducibility is violated and . Estimating alone leads to bias. In the following, we will re-examine mixture proportion estimation. In particular, we propose a more general sufficient condition than irreducibility, and introduce an estimation strategy that calls an existing MPE method and reduces or eliminates its asymptotic bias.
3 A General Identifiability Condition
Previous MPE works assume irreducibility. We propose a more general sufficient condition for recovering .
Theorem 3.1 (Identifiability Under Local Supremal Posterior (LSP)).
Let be any non-empty measurable subset of and . Then
This implies that under LSP, is identifiable. Two special cases are worthy of comment. First, irreducibility holds when and , in which case the above theorem recovers the known result that . Second, when is a singleton, then and , which can also be derived directly from the definition of conditional probability.
The above theorem gives a general sufficient condition for recovering , but estimating is non-trivial: when , it can be estimated using existing MPE methods (Blanchard et al., 2016). When is a proper subset, however, a new approach is needed. We now present a variation of Theorem 3.1 that lends itself to a practical estimation strategy without having to devise a completely new method of estimating .
Theorem 3.2 (Identifiability Under Tight Posterior Upper Bound).
Consider any non-empty measurable set , and let . Let be any measurable function satisfying
(6) |
Define a new distribution in terms of its density
(7) |
Then
The theorem can be re-stated as: is identifiable given an upper bound of the posterior probability that is tight for some . One possible choice for is simply
If the conditional probability is known for all , then
may be chosen.
Having satisfying Eqn. (6) ensures identifiablility of . Relaxing this requirement slightly still guarantees that the bias will not increase.
Corollary 3.3.
Let be any measurable function with
(8) |
Define a new distribution in terms of its density according to Eqn. (7). Then
This shows that even if we have a non-tight upper bound on , the quantity is still bounded by . Therefore, a smaller asymptotic bias may be achieved by estimating instead of .
The intuition underlying the above results is that the new distribution is generated by throwing away some probability mass from , and therefore can be viewed as a mixture of and a new , but now tends to be irreducible w.r.t. . The proportion relates to the original proportion by a scaling constant . This interpretation is supported mathematically in Appendix B.1.
4 Subsampling MPE (SuMPE)
Theorem 3.2 directly motivates a practical algorithm. We obtain a new distribution from by rejection sampling (MacKay, 2003), which is a Monte Carlo method that generates a sample following a new distribution based on a sample from distribution , in terms of their densities and . An instance drawn from is kept with acceptance probability , and rejected otherwise. Appendix B.2 shows the detailed procedure. In our scenario, , and .
4.1 Method
Our Subsampling MPE algorithm, SuMPE (Algorithm 1), follows directly from Theorem 3.2. It first obtains in line 3 a data sample following distribution using rejection sampling and in line 4 estimates the normalizing constant . Then in line 5, it computes an estimate using any existing MPE method that consistently estimates the mixture proportion under irreducibility. The final estimate is returned as the product of and .
Rejection sampling in high dimensional settings may be inefficient due to a potentially low acceptance rate (MacKay, 2003). However, this concern is mitigated in our setting because the acceptance rate can be taken to be 1 except on the set , which is potentially a small set.
One advantage of building our method around existing MPE methods is that we may adapt known theoretical results to our setting. To illustrate this, we give a rate of convergence result for SuMPE.
4.2 Practical Scenarios
Our new sufficient condition assumes knowledge of some set and . However, practically speaking, our algorithm only requires an satisfying Eqn. (6) for some and the associated value of , and does not require the explicit knowledge of and . Additionally, even if does not satisfy Eqn. (6) , as long as it satisfies Eqn. (8) (which is easier to achieve), it shall perform no worse than directly applying off-the-shelf MPE methods.
There are settings where a generic construction of is possible. For example, suppose the user has access to fully labeled data (where it is known which of or each instance came from) but only on a subset of the domain. This may come from an annotator who is only an expert on a subset of instances. This data should be sufficient to get a non-trivial upper bound on the posterior class probability , which in turns leads to an .
More typically, however, it may be necessary to determine on a case by case basis. This section continues the discussion of the three applications introduced in Sec. 2.3. Each of these three settings leverages different domain-specific knowledge in different ways, and we believe this leads to the best compared to a one-size-fits-all construction.
4.2.1 Unfolding
Unfolding refers to the process of recovering one or more true distributions from contaminated ones (Cowan, 1998). In gamma spectrum unfolding (Li et al., 2019), a gamma ray detector measures the energies of incoming gamma rays. The gamma rays were emitted either by a source of interest or from the background. The measurement is represented as a histogram where the bins correspond to a quantization of energy. In many settings, the histogram of measurements from the source of interest is also known. In this case, unfolding amounts to the estimation of the unknown background histogram . Toward this goal, it suffices to estimate the proportion of recorded particles emanating from the source of interest, since . This application corresponds to the “ubiquitious background” setting described in Sec. 2.3, where irreducibility may not hold since the source of interest energies can be a subset of the background energies.
Using existing techniques from the nuclear detection literature (Knoll, 2010; Alamaniotis et al., 2013), we can obtain a lower bound of the quantity on a certain set (see Appendix B.3 for details). This leads to the acceptance function
(9) |
which is an upper bound of , satisfying the condition in Corollary 3.3.
4.2.2 CPE under domain adaptation
In the problem of domain adaptation, the learner is given labeled examples from a source distribution, and the task is to do inference on a potentially different target distribution. Previous work on domain adaptation mainly focuses on classification and typically makes assumptions about which of the four distributions , and vary between the source and target. This leads to situations such as covariate shift (where changes) (Heckman, 1979), posterior drift (where changes) (Scott, 2019), prior/target shift (where changes) (Storkey et al., 2009), and conditional shift (where changes) (Zhang et al., 2013). It is also quite commonly assumed that the support of source distribution contains the support of target (Heckman, 1979; Bickel et al., 2009; Gretton et al., 2009; Storkey et al., 2009; Zhang et al., 2013; Scott, 2019).
We study class proportion estimation (CPE) under domain adaptation. Prior work on this topic has considered distributional assumptions like those described above (Saerens et al., 2002; Sanderson & Scott, 2014; González et al., 2017). In this work, we consider the setting where, in addition to labeled examples from the source, the learner has access to labeled positive and unlabeled data from the target. We propose a model that includes covariate shift and posterior drift as special cases. We use and to denote source and target distributions. In MPE notation, , and .
Definition 4.2 (CSPL).
We say that covariate shift with posterior lift occurs whenever
and “” holds for some .
Covariate shift is a special case of CSPL when equality always holds. One motivation for posterior lift is to model labels produced by an annotator who is biased toward one class. It is a type of posterior drift model wherein the posterior changes from source to target (Scott, 2019; Cai & Wei, 2021; Maity et al., 2023). Also notice that CSPL does not require the support of the source distribution to contain the target, nor irreducibility.
CSPL is motivated by a marketing application mentioned in Sec. 2.3. In marketing, companies often have access to labeled data from a source distribution, such as survey results where customers express their interest in a product. Additionally, they also have access to labeled positive and unlabeled data from the target distribution, which corresponds to actual purchasing behavior. In this scenario, the CSPL assumption is often met as it is more likely for customers to express interest than to actually make a purchase: .
Although irreducibility is violated in the marketing application due to the “global uncertainty” about a target customer buying the product (see Sec. 2.3), CSPL ensures the identifiability of because we can choose the set and the acceptance function as
(10) |
which satisfies the identifiability criteria in Theorem 3.2. 111 To see this, take and . Then and are the and in Theorem 3.2. By using the labeled source data, an estimate of can be obtained and used as the acceptance function in Algorithm 1 to do CPE.
4.2.3 Selected/Reported at Random
In public health, are jointly distributed, where is the feature vector, denotes whether a person reports a medical condition or not and indicates whether a person truly has the medical condition. The goal is to estimate the proportion of people in that report the medical condition. This setting was described in Sec. 2.3 as “underreported outcomes” where it was argued that irreducibility may not hold, in which case estimating overestimates the true value of . Our SuMPE framework provides a way to eliminate the bias.
The behavior of underreporting can be captured using the selection bias model (Kato et al., 2018; Bekker et al., 2019; Gong et al., 2021). Denote the probability of reporting as . Assume there is no false report: . We use the notation to indicate the conditional density of given the event . Then , where . Under this model, the density of marginal distribution can be decomposed as
where is the proportion of people having the medical condition. The mixture proportion to be estimated is .
We assume access to i.i.d. sample from , representing the public survey data where people report the presence of the medical condition, and from , representing the target group. Further assume that . This is a subset of patients who are guaranteed to have the condition, which could be obtained based on historical patient data from hospital. Then the mixture proportion can be recovered from Algorithm 1, where the acceptance function is
(11) |
satisfies the condition in Theorem 3.2 . This is because under no-false-report assumption, . 222 Take and . Then and are the and in Theorem 3.2. In practice, can be estimated from labeled examples .
5 Limitation of Previous Work
Previous research by Yao et al. (2022) introduced the Regrouping MPE (ReMPE) method 333Yao et al. (2022) called the method Regrouping CPE (ReCPE)., which is built on top of any existing MPE estimator (just like our meta-algorithm). They claimed that ReMPE works as well as the base MPE method when irreducibility holds, while improving the performance when it does not. In this section we offer some comments on ReMPE.
Regrouping MPE in theory.
Consider any such that Eqn. (2) holds. Write as an arbitrary mixture of two distributions . Then can be re-written as
(12) |
where . Yao et al. (2022) assumes there exists a set such that the infimum in Eqn. (3) and (4) can be achieved. They proposed to specify as the truncated distribution of in set , denote as , where . This specific choice causes the resulting distribution to be irreducible w.r.t. and the bias introduced by regrouping to be minimal. Denote the above procedure as ReMPE-1 (Algorithm 2). Theorem 2 in Yao et al. (2022) provides a theoretical justification for ReMPE-1, which we will restate here.
Theorem 5.1 (Yao et al. (2022)).
Let be the mixture proportion obtained from ReMPE-1. 1) If is irreducible w.r.t. , then . 2) if is not irreducible w.r.t. , then .
While this theorem is valid, we note that in the case where is irreducible w.r.t. , the set is outside the support of , and therefore it is not appropriate to describe the procedure as “regrouping .” In fact, performing regrouping () always introduces a positive bias, because . This indicates that any kind of regrouping will have a positive bias under irreducibility.
Regrouping MPE in practice.
Yao et al. (2022)’s practical implementation of regrouping deviates from the theoretical proposal. Here, we state and analyze the idealized version of their practical algorithm, referred to as ReMPE-2 (Algorithm 3). ReMPE-2 does not rely on the knowledge of and as outlined in Eqn. (12). Instead, the set is chosen based solely on and , and the distribution is obtained through regrouping some probability mass from rather than . 444Yao et al. (2022)’s real implementation differs a bit from ReMPE-2 in that instead of choosing , they select of examples drawn from with smallest estimated score of .
ReMPE-2 is fundamentally different from ReMPE-1 in that it uses a different way to construct . To be specific, when the irreducibility assumption holds, ReMPE-1 suggests regrouping nothing (because ), but ReMPE-2 still regroups a proportion from to . Therefore, Theorem 5.1 does not apply to it. Yao et al. (2022) did not analyze ReMPE-2, but the next result shows that it has a negative bias under irreducibility.
Proposition 5.2.
For obtained from ReMPE-2:
Thus, if irreducibility holds, then ReMPE-2 returns , which is undesirable. However, when irreducibility does not hold, ReMPE-2 may lead to a smaller asymptotic bias than estimating , which could explain why the authors observe empirical improvements in their results. Our theoretical analysis of ReMPE-2 is supported experimentally in Sec. 6 and Appendix D.
To summarize, Yao et al. (2022) proposed a regrouping approach that was the first attempt to tackle the problem of MPE beyond irreducibility and motivated our work. ReMPE-1 recovers when irreducibility holds (although in this case it is not doing regrouping), and decreases bias when irreducibility does not hold. The more practical algorithm ReMPE-2 might decrease the bias when irreducibility does not hold, but it has a negative bias when irreducibility does hold. Like ReMPE-1, SuMPE draws on some additional information beyond and . Both meta-algorithms do not increase the bias, and recover when irreducibility holds. Unlike ReMPE-1, however, SuMPE is able to recover under a more general condition. Furthermore, our practical implementations of subsampling are based directly on Theorem 3.2, unlike ReMPE-2 which does not have the desireable theoretical properties of ReMPE-1. Finally, as we argue in the next section, SuMPE offers significant empirical performance gains.
One limitation of our SuMPE framework is that some knowledge of is needed and that may need to be developed specifically for different applications.
6 Experiments
We ran our algorithm on nuclear, synthetic and some benchmark datasets taken from the UCI machine learning repository and MNIST, corresponding to all three scenarios described in Sec. 4.2. We take four MPE algorithms: DPL (Ivanov, 2020), EN (Elkan & Noto, 2008), KM (Ramaswamy et al., 2016) and TIcE (Bekker & Davis, 2018). We compare the original version of these methods together with their regrouping (Re-)) and subsampling (Su-)) version. All experiments in this section consider settings where irreducibility is violated. The summarized results are shown in Table 6 and the detailed results are in Appendix C. Overall, the subsampling version of each MPE method outperforms the original and regrouping version. Additional experiments where irreducibility holds are offered in Appendix D, where we find that ReMPE harms the estimation performance while SuMPE does not. The implementation is available at https://github.com/allan-z/SuMPE.
Setup | Dataset | DPL | ReDPL | SuDPL | EN | ReEN | SuEN | KM | ReKM | SuKM | TIcE | ReTIcE | SuTIcE |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unfolding | Gamma Ray | ||||||||||||
Domain Adaptation | Synthetic | ||||||||||||
Mushroom | |||||||||||||
Landsat | |||||||||||||
Shuttle | |||||||||||||
MNIST17 | |||||||||||||
Selected/ Reported at Random | Mushroom | ||||||||||||
Landsat | |||||||||||||
Shuttle | |||||||||||||
MNIST17 |
6.1 Unfolding: Gamma Ray Spectra Data
The gamma ray spectra data are simulated from the Monte-Carlo N-Particle (MCNP) radiation transport code (Werner et al., 2018). refers to the distribution of Cesium-137. is the background distribution, consisting of terrestrial background and Cobalt-60. The goal is to estimate the proportion of Cesium-137 in the measured spectrum .
Sample sizes of were chosen, which is a reasonable number of counts for many nuclear detection applications. The true mixture proportion is varied in . The random variable represents quantized energy, which is one-dimensional and is discrete-valued. Therefore, we directly use the histogram as the estimate of the distribution. We choose the acceptance function according to the methodology developed in Sec. 4.2.1 and Appendix B.3.
Three of the four baseline MPE methods (DPL, EN, KM) did not work well out-of-the-box in this setting. We therefore re-implemented these methods to explicitly leverage the histogram representation of the probability distributions. This also greatly sped up the KM approach. The results are summarized in Table 2.
6.2 Domain Adaptation: Synthetic Data
Following the setup in Sec. 4.2.2, we specify the target conditional and marginal distributions , and as:
where is not irreducible w.r.t. because it contains a Gaussian distribution with a bigger variance than . We draw instances from both and . The true mixture proportion is varied in .
In addition, we draw labeled instances from the source distribution, where is changed to . We then truncate the source distribution (and therefore the source sample) to . The resulting source and target distribution satisfy the CSPL assumption. A hidden layer neural network with neurons was trained to predict for , thus and was chosen according to Eqn. (10). This procedure was repeated times with different random seeds. Detailed results are shown in Table 3.
6.3 Domain Adaptation: Benchmark Data
In many machine learning datasets for classification (e.g., those in UCI), irreducibility is satisfied. 555The datasets Mushroom, Landsat, Shuttle and MNIST were chosen for our study because previous empirical research (Ivanov, 2020) showed that the baseline MPE methods perform well on these datasets when the irreducibility assumption holds. Our paper focuses on how to eliminate the estimation bias that arises from the violation of irreducibility. Therefore, we chose to use datasets where the baseline methods perform well, in order to clearly observe and measure the bias introduced by MPE methods when irreducibility is not met. Here we manually create datasets that violate irreducibility by uniformly sampling out (or ) of the original positive data as the target positive data, with all the remaining data treated as target negative. The target conditional and marginal distributions , and are specified as:
We draw instances from and instances from (the exact number varies by datasets and is based on number of examples available, see the code provided). The true mixture proportion is varied in . The proportion is determined by the total number of positive and negative data originally available in the dataset, therefore varies case by case.
In addition, we obtain labeled instances following the source distribution, by drawing data from the target positive and data from the target negative distribution. This causes a prior shift that simulates CSPL.
A hidden layer neural network with neurons was trained to predict . For real-world high-dimensional data, it is hard to know the support. Therefore, we choose , because an example with high is more likely to lie in the support of . The acceptance function was determined according to Eqn. (10). The above procedure was repeated times with different random seeds. Table 4 summarizes the results.
6.4 Selected/Reported at Random: Benchmark Data
Recalling the setting of Sec. 4.2.3, there is a jointly distributed triple , where indicates whether a condition is reported, and indicates whether the condition is actually present. For the experiments below, the data from , and are generated in the same way as the target distribution of the previous subsection. Instead of observing labeled source data, however, in this subsection we instead observe instances from , together with their labels , matching the setup of Sec. 4.2.3.
A hidden layer neural network with neurons was trained to predict . We chose 666 needs to be a subset of . Note that in population level, , thus . (The last equality holds because .) Here we replace with . and according to Eqn. (11). The above procedure was repeated times with different random seeds. The results are shown in Table 5.
7 Conclusion
This work introduces a more general identifiability condition than irreducibility for mixture proportion estimation. We also propose a subsampling-based framework that achieves bias reduction/elimination for baseline MPE algorithms. Theoretically, our work expands the scope of settings where MPE can be solved. Practically, we illustrate three scenarios where irreducibility fails, and our meta-algorithm successfully improves upon baseline MPE methods.
Acknowledgements
This work was supported in part by the National Science Foundation under award 2008074 and the Department of Defense, Defense Threat Reduction Agency under award HDTRA1-20-2-0002.
References
- Alamaniotis et al. (2013) Alamaniotis, M., Mattingly, J., and Tsoukalas, L. H. Kernel-based machine learning for background estimation of nai low-count gamma-ray spectra. IEEE Transactions on Nuclear Science, 60(3):2209–2221, 2013.
- Bekker & Davis (2018) Bekker, J. and Davis, J. Estimating the class prior in positive and unlabeled data through decision tree induction. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Bekker & Davis (2020) Bekker, J. and Davis, J. Learning from positive and unlabeled data: A survey. Machine Learning, 109(4):719–760, 2020.
- Bekker et al. (2019) Bekker, J., Robberechts, P., and Davis, J. Beyond the selected completely at random assumption for learning from positive and unlabeled data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 71–85. Springer, 2019.
- Bickel et al. (2009) Bickel, S., Brückner, M., and Scheffer, T. Discriminative learning under covariate shift. Journal of Machine Learning Research, 10(9), 2009.
- Blanchard et al. (2010) Blanchard, G., Lee, G., and Scott, C. Semi-supervised novelty detection. The Journal of Machine Learning Research, 11:2973–3009, 2010.
- Blanchard et al. (2016) Blanchard, G., Flaska, M., Handy, G., Pozzi, S., Scott, C., et al. Classification with asymmetric label noise: Consistency and maximal denoising. Electronic Journal of Statistics, 10(2):2780–2824, 2016.
- Cai & Wei (2021) Cai, T. T. and Wei, H. Transfer learning for nonparametric classification: Minimax rate and adaptive classifier. The Annals of Statistics, 49(1):100–128, 2021.
- Cowan (1998) Cowan, G. Statistical data analysis. Oxford university press, 1998.
- Du Plessis & Sugiyama (2014) Du Plessis, M. C. and Sugiyama, M. Class prior estimation from positive and unlabeled data. IEICE TRANSACTIONS on Information and Systems, 97(5):1358–1362, 2014.
- Du Plessis et al. (2014) Du Plessis, M. C., Niu, G., and Sugiyama, M. Analysis of learning from positive and unlabeled data. Advances in neural information processing systems, 27, 2014.
- Elkan & Noto (2008) Elkan, C. and Noto, K. Learning classifiers from only positive and unlabeled data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 213–220, 2008.
- Fei et al. (2013) Fei, H., Kim, Y., Sahu, S., Naphade, M., Mamidipalli, S. K., and Hutchinson, J. Heat pump detection from coarse grained smart meter data with positive and unlabeled learning. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1330–1338, 2013.
- Garg et al. (2021) Garg, S., Wu, Y., Smola, A. J., Balakrishnan, S., and Lipton, Z. Mixture proportion estimation and PU learning: A modern approach. Advances in Neural Information Processing Systems, 34:8532–8544, 2021.
- Gong et al. (2021) Gong, C., Wang, Q., Liu, T., Han, B., You, J., Yang, J., and Tao, D. Instance-dependent positive and unlabeled learning with labeling bias estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4163–4177, 2021.
- González et al. (2017) González, P., Castaño, A., Chawla, N. V., and Coz, J. J. D. A review on quantification learning. ACM Computing Surveys (CSUR), 50(5):1–40, 2017.
- Gorber et al. (2009) Gorber, S. C., Schofield-Hurwitz, S., Hardt, J., Levasseur, G., and Tremblay, M. The accuracy of self-reported smoking: a systematic review of the relationship between self-reported and cotinine-assessed smoking status. Nicotine & tobacco research, 11(1):12–24, 2009.
- Gretton et al. (2009) Gretton, A., Smola, A., Huang, J., Schmittfull, M., Borgwardt, K., and Schölkopf, B. Covariate shift by kernel mean matching. Dataset shift in machine learning, 3(4):5, 2009.
- Heckman (1979) Heckman, J. J. Sample selection bias as a specification error. Econometrica: Journal of the econometric society, pp. 153–161, 1979.
- Ivanov (2020) Ivanov, D. DEDPUL: Difference-of-estimated-densities-based positive-unlabeled learning. In 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 782–790. IEEE, 2020.
- Jain et al. (2016) Jain, S., White, M., Trosset, M. W., and Radivojac, P. Nonparametric semi-supervised learning of class proportions. arXiv preprint arXiv:1601.01944, 2016.
- Kato et al. (2018) Kato, M., Teshima, T., and Honda, J. Learning from positive and unlabeled data with a selection bias. In International conference on learning representations, 2018.
- Kiryo et al. (2017) Kiryo, R., Niu, G., Du Plessis, M. C., and Sugiyama, M. Positive-unlabeled learning with non-negative risk estimator. Advances in neural information processing systems, 30, 2017.
- Knoll (2010) Knoll, G. F. Radiation detection and measurement. John Wiley & Sons, 2010.
- Lawrence & Schölkopf (2001) Lawrence, N. and Schölkopf, B. Estimating a kernel Fisher discriminant in the presence of label noise. In 18th International Conference on Machine Learning (ICML 2001), pp. 306–306. Morgan Kaufmann, 2001.
- Li et al. (2019) Li, F., Gu, Z., Ge, L., Li, H., Tang, X., Lang, X., and Hu, B. Review of recent gamma spectrum unfolding algorithms and their application. Results in Physics, 13:102211, 2019.
- MacKay (2003) MacKay, D. J. Information theory, inference and learning algorithms. Cambridge university press, 2003.
- Maity et al. (2023) Maity, S., Dutta, D., Terhorst, J., Sun, Y., and Banerjee, M. A linear adjustment based approach to posterior drift in transfer learning. Biometrika, 2023. URL https://doi.org/10.1093/biomet/asad029.
- Natarajan et al. (2013) Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
- Ramaswamy et al. (2016) Ramaswamy, H., Scott, C., and Tewari, A. Mixture proportion estimation via kernel embeddings of distributions. In International conference on machine learning, pp. 2052–2060. PMLR, 2016.
- Saerens et al. (2002) Saerens, M., Latinne, P., and Decaestecker, C. Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural computation, 14(1):21–41, 2002.
- Sanderson & Scott (2014) Sanderson, T. and Scott, C. Class proportion estimation with application to multiclass anomaly rejection. In Artificial Intelligence and Statistics, pp. 850–858. PMLR, 2014.
- Scott (2015) Scott, C. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In Artificial Intelligence and Statistics, pp. 838–846. PMLR, 2015.
- Scott (2019) Scott, C. A generalized Neyman-Pearson criterion for optimal domain adaptation. In Algorithmic Learning Theory, pp. 738–761. PMLR, 2019.
- Shanmugam & Pierson (2021) Shanmugam, D. and Pierson, E. Quantifying inequality in underreported medical conditions. arXiv preprint arXiv:2110.04133, 2021.
- Storkey et al. (2009) Storkey, A. et al. When training and test sets are different: characterizing learning transfer. Dataset shift in machine learning, 30:3–28, 2009.
- Werner et al. (2018) Werner, C. J., Bull, J., Solomon, C., Brown, F., McKinney, G., Rising, M., Dixon, D., Martz, R., Hughes, H., Cox, L., et al. MCNP 6.2 release notes. Los Alamos National Laboratory, 2018.
- Yao et al. (2022) Yao, Y., Liu, T., Han, B., Gong, M., Niu, G., Sugiyama, M., and Tao, D. Rethinking class-prior estimation for positive-unlabeled learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=aYAA-XHKyk.
- Zhang et al. (2013) Zhang, K., Schölkopf, B., Muandet, K., and Wang, Z. Domain adaptation under target and conditional shift. In International conference on machine learning, pp. 819–827. PMLR, 2013.
- Zhang & Goldman (2001) Zhang, Q. and Goldman, S. EM-DD: An improved multiple-instance learning technique. Advances in neural information processing systems, 14, 2001.
Appendix A Proofs
A.1 Proof of Proposition 2.4
Proposition.
Under the latent label model,
where .
Proof.
First consider the trivial case when , then from the fact that , must also be . According to Eqn. (5), , therefore the equality holds.
Then for , we consider the cases and separately. For the first case, consider two subcases: and . In the first subcase, we have that , and therefore by the definition of conditional probability,
Taking the essential supremum over all with ,
where the second equality follows from Proposition 3. In the second subcase, is zero. Therefore,
When , for all , and the equality still holds. ∎
A.2 Proof of Theorem 3.1
Theorem (Identifiability Under Local Supremal Posterior (LSP)).
Let be any non-empty measurable subset of and , then
Proof.
Consider the case of and separately.
If , then . Recall from Eqn. (5),
Taking the essential supremum over and recall the definition of ,
Since , and are both positive for . Rearrange the denominator
take the denominator to the other side, we get
where the second equality follows from the identity that .
If , then , the above equality still holds. ∎
A.3 Proof of Theorem 3.2
Theorem (Identifiability Under Tight Posterior Upper Bound).
Consider any non-empty measurable set , and let . Let be any measurable function satisfying
(13) |
Define a new distribution in terms of its density
(14) |
Then
A.4 Proof of Corollary 3.3
Corollary.
Let be any measurable function with
Define a new distribution in terms of its density , obtained by Eqn. (7). Then
Proof.
A.5 Proof of Theorem 4.1
We now establish a rate of convergence result for estimator from Algorithm 1.
Theorem.
Proof.
Recall the setup, we originally have i.i.d. sample 777 The notation here is a bit different from Eqn. (1) in that the index of is changed. This allows for more concise notation in the following derivation.
After rejection sampling, we obtain i.i.d. sample (where and ), from which we can estimate . Under the assumption that exists, estimator by Scott (2015) has rate of convergence
(15) |
Now is a random variable here, and we want to establish a rate of convergence result involving . This can be done by applying a concentration inequality for .
Theorem.
(Hoeffding’s Inequality) Let be independent random variables on that take values in . Denote , then for all ,
Let , the theorem can be restated as:
Take , where , denotes the -th independent draw from 888 is used in rejection sampling (Algorihm 4). and denote the -th draw from . Then
Plug into Hoeffding’s Inequality and setting , we have
(16) |
which allows us to bound by a constant times .
Now we can establish a rate of convergence result of w.r.t.
As for the estimator of , the rate of convergence can also be shown via Hoeffding’s Inequality,
Plug into Hoeffding’s Inequality and let the confidence , we have
(17) |
Note that by triangle inequality
Finally, combine all previous results
then we can conclude that with rate of convergence .
∎
A.6 Proof of Proposition 5.2
Proposition.
For obtained from ReMPE-2:
Proof.
From the fact that , we have
After regrouping, . Therefore,
∎
Appendix B More about Subsampling MPE
B.1 Intuition
Theorem 3.2 and Corollary 3.3 have already justified the use of subsampling. Here, we explain in another perspective (in terms of distributions, similar to the analysis in Yao et al. (2022)). The idea is that, since the original may violate irreducibility assumption, we modify such that the resulting latent component distribution is less likely to violate the assumption.
Write the unknown distribution itself as a mixture . Then becomes:
Switch to the other side (i.e., discarding the probability mass from )
the left hand side can be re-written as
Then the resulting distribution is
By discarding a portion of probability mass from , which is done by subsampling in practice, the resulting latent component distribution is less likely to violate the irreducibility assumption. The new mixture proportion .
In the following, we provide justification of the above claim, which can also be seen as a reformulation of Corollary 3.3.
Proposition B.1.
Given some probability mass from being dropped out, is always bounded by: κ^* ≤c ⋅κ(~F—H) ≤κ(F—H). Furthermore, bias is strictly reduced when .
Proof.
Observe that
therefore .
From the fact that , we have . Then
To have bias reduction, we need , where both quantities can be represented as
Then we can get the result of by direct comparison. ∎
Based on the above proof, we claim that leads to no worse estimation bias than . To be specific, when is irreducible w.r.t. , then . When is not irreducible w.r.t. , then . The key difference compared to Yao et al. (2022)’s approach is that we are modifying , thus equivalently subsampling , rather than regrouping on .
In summary, without making assumptions about irreducibility, as long as we remove some contribution from in by subsampling, the resulting identifiable quantity will be at least no worse than the maximum proportion . Furthermore, with the knowledge of set and , will equal to .
B.2 Rejection Sampling
Rejection sampling is a Monte Carlo method that aims to generate sample following a new distribution based on sample from distribution , which are characterised by the densities and . An instance drawn from is kept with acceptance probability , and rejected otherwise. Algorithm 4 shows the detailed procedure.
B.3 Gamma Spectrum Unfolding
In gamma spectrum unfolding, a gamma ray detector measures the energies of incoming gamma ray particles. The gamma rays were emitted either by a source of interest or from the background. The measurement is represented as a histogram , where the bins correspond to a quantization of energy. The histogram of measurements from the source of interest is also known.
The goal is to obtain a lower bound of the quantity on a certain set . We specify to be the energy bins near the main peak of (aka, full-energy peak). For , because the gamma rays must come from the background in these regions. Typically, contains two (or more) intervals, therefore we know the value of on either side of set . Then can be estimated using linear interpolation (Knoll, 2010; Alamaniotis et al., 2013). The above procedure is illustrated in Figure 1 and the acceptance function is chosen to be
(18) |

Appendix C Detailed Experimental Result in Sec. 6
This section shows four tables corresponding to four experimental setups in Sec. 6.
DPL | ReDPL | SuDPL | EN | ReEN | SuEN | KM | ReKM | SuKM | TIcE | ReTIcE | SuTIcE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
average |
DPL | ReDPL | SuDPL | EN | ReEN | SuEN | KM | ReKM | SuKM | TIcE | ReTIcE | SuTIcE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
average |
Dataset | DPL | ReDPL | SuDPL | EN | ReEN | SuEN | KM | ReKM | SuKM | TIcE | ReTIcE | SuTIcE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mushroom | |||||||||||||
avg | |||||||||||||
Landsat | |||||||||||||
avg | |||||||||||||
Shuttle | |||||||||||||
avg | |||||||||||||
MNIST17 | |||||||||||||
avg | |||||||||||||
Overall | avg |
Dataset | DPL | ReDPL | SuDPL | EN | ReEN | SuEN | KM | ReKM | SuKM | TIcE | ReTIcE | SuTIcE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Mushroom | |||||||||||||
avg | |||||||||||||
Landsat | |||||||||||||
avg | |||||||||||||
Shuttle | |||||||||||||
avg | |||||||||||||
MNIST17 | |||||||||||||
avg | |||||||||||||
Overall | avg |
Appendix D When Irreducibility Holds
In theory, when irreducibility holds, baseline MPE methods shall be asymptotically unbiased estimators of the mixture proportion , regrouping may introduce negative bias, subsampling should not introduce bias. Here we run some synthetically generated experiments (in a controlled setting) to verify the theoretical claim.
We specify the distributions , and as:
where is irreducible w.r.t. . We draw instances from both and . The true mixture proportion is varied in .
We draw labeled instances from . A hidden layer neural network with neurons was trained to predict . The acceptance function used in Subsampling MPE is chosen to be
where . As for ReMPE, of the sample from is chosen to be regrouped to the sample from , as suggested by Yao et al. (2022). The above procedure was repeated times with different random seeds. Results where shown in Table 6, where SuMPE never performs the worst 999 In some cases, subsampling slightly increases the bias compared to original version, this is partly due to being estimated. and ReMPE may introduce extremely high bias.
DPL | ReDPL | SuDPL | EN | ReEN | SuEN | KM | ReKM | SuKM | TIcE | ReTIcE | SuTIcE | |
---|---|---|---|---|---|---|---|---|---|---|---|---|