Logistic Regression for Massive Data with Rare Events
Abstract
This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls). We first derive the asymptotic distribution of the maximum likelihood estimator (MLE) of the unknown parameter, which shows that the asymptotic variance convergences to zero in a rate of the inverse of the number of the events instead of the inverse of the full data sample size. This indicates that the available information in rare events data is at the scale of the number of events instead of the full data sample size. Furthermore, we prove that under-sampling a small proportion of the nonevents, the resulting under-sampled estimator may have identical asymptotic distribution to the full data MLE. This demonstrates the advantage of under-sampling nonevents for rare events data, because this procedure may significantly reduce the computation and/or data collection costs. Another common practice in analyzing rare events data is to over-sample (replicate) the events, which has a higher computational cost. We show that this procedure may even result in efficiency loss in terms of parameter estimation.
1 Introduction
Big data with rare events in binary responses, also called imbalanced data, are data in which the number of events (observations for one class of the binary response) is much smaller than the number of non-events (observations for the other class of the binary response). In this paper we also call the events “cases” and can the nonevents “controls”. Rare events data are common in many scientific fields and applications. However, several important questions remain unanswered that are essential for valid data analysis and appropriate decision-making. For example, should we consider the amount of information contained in the data to be at the scale of the full-data sample size (very large) or the number of cases (relatively small)? Rare events data provide unique challenges and opportunities for sampling. On the one hand, sampling will not work without looking at responses because the probability of not selecting a rare case is high. On the other hand, since the rare cases are more informative than the controls, is it possible to use a small proportion of the full data to preserve most or all of the relevant information in the data about unknown parameters? A common practice when analyzing rare events data is to under-sample the controls and/or over-sample (replicate) the cases. Is there any information loss when using this approach? This paper provides a rigorous theoretical analysis on the aforementioned questions in the context of parameter estimation. Some answers may be counter-intuitive. For example, keeping all the cases, there may be no efficiency loss at all for under-sampling controls; on the other hand, using all the controls and over-sampling cases may reduce estimation efficiency.
Rare events data, or imbalanced data, have attracted a lot of attentions in machine learning and other quantitative fields, such as Japkowicz (2000); King and Zeng (2001); Chawla et al. (2004); Estabrooks et al. (2004); Owen (2007); Sun et al. (2007); Chawla (2009); Rahman and Davis (2013); Fithian and Hastie (2014); Lemaître et al. (2017). A commonly implemented approach in practice is to try balancing the data by under-sampling controls (Drummond et al., 2003; Liu et al., 2009) and/or over-sampling cases (Chawla et al., 2002; Han et al., 2005; Mathew et al., 2017; Douzas and Bacao, 2017). However, most existing investigations focus on algorithms and methodologies for classification. Theoretical analyses of the effects of under-sampling and over-sampling in terms of parameter estimation are still rare.
King and Zeng (2001) considered logistic regression in rare events data and focused on correcting the biases in estimating the regression coefficients and probabilities. Fithian and Hastie (2014) utilized the special structure of logistic regression models to design a novel local case-control sampling method. These investigations obtained theoretical results based on the the regular assumption that the probability of event occurring is fixed and does not go to zero. This assumption rules out the scenario of extremely imbalanced data, because for extremely imbalanced data, it is more appropriate to assume that the event probability goes to zero. Owen (2007)’s investigation did not require this fixed-probability assumption. He assumed that the number of rare cases is fixed, and derived the non-trivial point limit of the slope parameter estimator in logistic regression. However, the convergence rate and distributional properties of this estimator were not investigated. In this paper, we obtain convergence rates and asymptotic distributions of parameter estimators under the assumption that both the number of cases and the number of controls are random, and they grow large in rates that the number of cases divided by the number of controls decays to zero. This is the first study that provides distributional results for rare events data with a decaying event rate, and it gives the following indications.
-
•
The convergence rate of the maximum likelihood estimator (MLE) is at the inverse of the number of cases instead of the total number of observations. This means that the amount of available information about unknown parameters in the data may be limited even the full data volume is massive.
-
•
There maybe no efficiency loss at all in parameter estimation if one removes most of the controls in the data, because the control under-sampled estimators may have an asymptotic distribution that is identical to that of the full data MLE.
-
•
Besides higher computational cost, over-sampling cases may result in estimation efficiency loss, because the asymptotic variances of the resulting estimators may be larger than that of the full data MLE.
The rest of the paper is organized as follows. We introduce the model setup and related assumptions in Section 2, and derive the asymptotic distribution for the full data MLE. We investigate under-sampled estimators in Section 3 and study over-sampled estimators in Section 4. Section 5 presents some numerical experiments, and Section 6 concludes the paper and points out some necessary future research. All the proofs of theoretical findings in this paper are presented in the supplementary material.
2 Model setups and assumptions
Let be independent data of size from a logistic regression model,
(1) |
Here is the covariate, is the binary class label, is the intercept parameter, and is the slope parameter vector. For ease of presentation, denote as the full vector of regression coefficient, and define accordingly. This paper focuses on estimating the unknown .
If is fixed (does not change with changing), then model (1) is just the regular logistic regression model, and classical likelihood theory shows that the MLE based on the full data converges at a rate of . A fixed implies that is also a fixed constant bounded away from zero. However, for rare events data, because the event rate is so low in the data, it is more appropriate to assume that approaches zero in some way. We discuss how to model this scenario in the following.
Let and be the numbers cases (observations with ) and controls (observations with ), respectively, in . Here, and are random because they are summary statistics about the observed data, i.e., and . For rare events data, is much smaller than . Thus, for asymptotic investigations, it is reasonable to assume that , or equivalently , in probability, as . For big data with rare events, there should be a fair amount of cases observed, so it is appropriate to assume that in probability. To model this scenario, we assume that the marginal event probability satisfies that as ,
(2) |
We accommodate this condition by assuming that the true value of , denoted as , is fixed while the true value of , denoted as , goes to negative infinity in a certain rate. Specifically, we assume as in a rate such that
(3) |
where means a term that converges to zero in probability, i.e., a term that is arbitrarily small with probability approaching one. The assumption of a diverging with a fixed means that the baseline probability of a rare event is low, and the effect of the covariate does not change the order of the probability for a rare event to occur. This is a very reasonable assumption for many practical problems. For example, although making phone calls when driving may increase the probability of car accidents, it may not make car accidents a high-probability event.
2.1 How much information do we have in rare events data
To demonstrate how much information is really available in rare events data, we derive the asymptotic distribution of the MLE for model (1) in the scenario described in (2) and (3). The MLE based on the full data , say , is the maximizer of
(4) |
which is also the solution to the following equation,
(5) |
where is the gradient of the log-likelihood .
The following Theorem gives the asymptotic normality of the MLE for rare events data.
Theorem 1.
Remark 1.
The result in (6) shows that the convergence rate of the full-data MLE is at the order of , i.e, . This is different from the classical result of for the case that is a fixed constant. Theorem 1 indicates that for rare events data, the real amount of available information is actually at the scale of instead of . A large volume of data does not mean that we have a large amount of information.
3 Efficiency of under-sampled estimators
Theorem 1 in the previous section shows that the full-data MLE has a convergence rate of . If we under-sample controls to reduce the number of controls to the same level of , whether the resulting estimator has the full-data estimator convergence rate of ? If so, one can significantly improve the computational efficiency and reduce the storage requirement for massive data. Furthermore, will under-sampling controls causes any estimation efficiency loss (an enlarged asymptotic variance)? This section answers the aforementioned questions.
From the full data set , we want to use all the cases (data points with ) while only select a subset for the controls (data points with ). Specifically, let be the probability that each data points with is selected in the subset. Let be the binary indicator variable that signifies if the -th observation is included in the subset, i.e., include the -th observation into the sample if and ignore the -th observation if . Here, we define the sampling plan by assigning
(9) |
where , , are independent and identically distributed (i.i.d.) random variables with the standard uniform distribution. This is a mixture of deterministic selection and random sampling. The resulting control under-sampled data include all rare cases (with ) and the number of controls (with ) is on average at the order of . The average sample size for the under-sampled data given the full-data is , which is if . The average sample size reduction is which is at the same order of if , and if .
Note that the under-sampled data taken according to in (9) is a biased sample, so we need to maximize a weighted objective function to obtain an asymptotically unbiased estimator. Alternatively, we can maximize an unweighted objective function and then correct the bias for the resulting estimator in logistic regression.
3.1 Under-sampled weighted estimator
The sampling inclusion probability given the full data for the -th data point is
The under-sampled weighted estimator, , is the maximizer of
(10) |
We present the asymptotic distribution of in the following theorem.
Theorem 2.
Remark 2.
If for any , then from (3) and the dominated convergence theorem, we know that . Thus
Since is the average number of the controls in the under-sampled data, can be interpreted as the asymptotic ratio of the number of cases to the number of controls in the under-sampled data. Therefore, since is a fixed constant, the value of has the following intuitive interpretations.
-
•
: take much more controls than cases;
-
•
: the number of controls to take is at the same order of the number of cases;
-
•
: take much fewer controls than cases.
Theorem 2 requires that . This means that the number of controls to take should not be significantly smaller than the number of cases, which is a very reasonable assumption.
Remark 3.
Theorem 2 shows that as long as does not make the number of controls in the under-sampled data much smaller than the number of cases , then the under-sampled estimator preserves the convergence rate of the full-data estimator. Furthermore, if then , which implies that . This means that if one takes much more controls than cases, then asymptotically there is no estimation efficiency loss at all. Here, the number of controls to take can still be significantly smaller than so that the computational burden is significantly reduced. If , since , we know that , in the Loewner order111For two Hermitian matrices and of the same dimension, if is positive semi-definite and if is positive definite.. Thus reducing the number of controls to the same order of the number of cases may reduce the estimation efficiency, although the convergence rate is the same as that of the full-data estimator.
3.2 Under-sampled unweighted estimator with bias correction
Based on the control under-sampled data, if we obtain an estimator from an unweighted objective function, say
then in , the intercept estimator is asymptotically biased while the slope estimator is still asymptotically unbiased. We correct the bias of using , and define the under-sampled unweighted estimator with bias correction as
(14) |
where
(15) |
The following theorem gives the asymptotic distribution of .
Theorem 3.
Remark 4.
Similarly to the case of under-sampled weighted estimator, Theorem 3 shows that the estimator preserves the same convergence rate of the full-data estimator if . Furthermore, if then ; if , then .
The following proposition is useful to compare the asymptotic variances of the weighted and the unweighted estimators.
Proposition 1.
Let be a random vector and be a positive scalar random variable. Assume that , , and are all finite and positive-definite matrices. The following inequality holds in the Loewner order.
Remark 5.
If we let and in Proposition 1, then we know that in the Loewner order. This indicates that with the same control under-sampled data, the unweighted estimator with bias correction, , has a higher estimation efficiency than the weighted estimator, .
4 Efficiency loss due to over-sampling
Another common practice to analyze rare events data is to use all the controls and over-sample the cases. To investigate the effect of this approach, let denote the number of times that a data point is used, and define
(19) |
where , , are i.i.d. Poisson random variables with parameter . For this over-sampling plan, a data point with will be used only one time, while a data point with will be on average used in the over-sampled data for times. Here, can be interpreted as the average over-sampling rate for cases.
Again, the case over-sampled data according to (19) is a biased sample, and we need to use a weighted objective function or to correct the bias of the estimator form an unweighted objective function.
4.1 Over-sampled weighted estimator
Let . The case over-sampled weighted estimator, , is the maximizer of
(20) |
The following theorem gives the asymptotic distribution of .
Theorem 4.
Remark 6.
Note that in (22), and the equality holds only if or . Thus, , meaning that over-sampling the cases may result in estimation efficiency loss unless the number of over-sampled cases is small enough to be negligible () or it is very large (). Considering that over-sampling incurs additional computational cost with potential estimation efficiency loss, this procedure is not recommended if the primary goal is parameter estimation.
4.2 Over-sampled unweighted estimator with bias correction
For completeness, we derive the asymptotic distribution of the over-sampled unweighted estimator with bias correction, , defined as , where
(23) |
and
(24) |
The following theorem is about the asymptotic distribution of .
Theorem 5.
Remark 7.
Unlike the case of under-sampled estimators, for over-sampled estimators, the unweighted estimator with bias correction has a lower estimation efficiency than the weighted estimator . To see this, letting and in Proposition 1, we know that , and the equality holds if . Here, since , we can intuitively interpret as the ratio of the average times of over-sampled cases to the number of controls. If in addition , then ; but in general, .
5 Numerical experiments
5.1 Full data estimator
Consider model (1) with one covariate and . We set , , and , and generate corresponding full data of sizes , , and , respectively. As a result, the average numbers of cases () in the resulting data are , , and . The above value configuration aims to mimic the scenario that , , and . The covariates ’s are generated from for cases () and from for controls (). For the above setup, the true value of is fixed , and the true values of are , , and , respectively for the four different values of . We repeat the simulation for times and calculate empirical MSEs as , , where , , and is the estimate from the -th repetition.
Table 1 presents empirical MSEs (eMSEs) multiplied by and , respectively. We see that eMSE does not diverge as increases for both and . This confirms the conclusion in Theorem 1 that converges at a rate of (It implies that ). On the other hand, values of eMSE are large, and they increase fast as increases, indicating that diverges to infinity. Table 1 confirms that although the values of the full data sample sizes are very large, it is the values of that reflect the real amount of available information about regression parameters, and they are actually much smaller.
eMSE | eMSE | |||||
---|---|---|---|---|---|---|
20 | 2.51 | 1.21 | 125.7 | 60.6 | ||
40 | 2.06 | 1.09 | 515.5 | 271.9 | ||
80 | 2.22 | 1.00 | 2774.4 | 1248.8 | ||
160 | 2.16 | 1.08 | 13474.9 | 6731.6 |
5.2 Sampling-based estimators
Now we provide numerical results about under-sampled and over-sampled estimators. Consider model (1) with , , and , so that . For under-sampling, consider , , , , , , , and ; for over-sampling, consider and , which corresponds to , , , , , , and , respectively. We repeat the simulation for times and calculate empirical MSEs as
where is the estimate from the -th repetition for some estimator . We consider , , , and . Note that if then the under-sampled estimators become the full data estimator, i.e., ; if , then the over-sampled estimators become the full data estimator, i.e., .
Figure 1 presents the simulation results. Figure 1 (a) plots eMSEs () against . When is small, the number of controls in under-sampled data is small, and the resulting estimators are not as efficient as the full-data estimator. For example, when , the numbers of cases and the numbers of controls are roughly the same, and we do see significant information loss in this case. However, when gets larger, under-sampled estimators becomes more efficient, and when , the performances of the under-sampled estimators are almost as good as the full-data estimator. In addition, the unweighted estimator is more efficient than the weighted estimator for smaller ’s, and they both perform more similarly to the full data estimator as grows. These observations are consistent with the conclusions in Theorems 2 and 3, and the discussions in the relevant remarks.
Figure 1 (b) plots eMSEs () against . We see that the case over-sampled estimators are less efficient than the full data estimator unless the average number of over-sampled cases is very small or very large. For small , and perform similarly, but is more efficient than for large . The reason of this phenomenon is that if is large, then the required condition of in Theorem 5 for may not be valid. This confirms our recommendation that the weighted estimator is preferable if over-sampling has to be used.
6 Discussion and future research
In this paper, we have obtained distributional results showing that the amount of information contained in massive data with rare events is at the scale of the relatively small total number of cases rather than the large total number of observations. We have further demonstrated that aggressively under-sampling the controls may not sacrifice the estimation efficiency at all while over-sampling the cases may reduce the estimation efficiency.
Although the current paper focuses on the logistic regression model, we conjecture that our conclusions are generally true for rare events data and will investigate more complicated and general models in future research projects. As another direction, more comprehensive numerical experiments are helpful to gain further understandings on parameter estimation with imbalanced data. This paper has focused on point estimation. How to make valid and more accurate statistical inference with rare events data still need further research. There is a long standing literature investigating the effects of under-sampling and over-sampling in classification. However, most investigations adopted an empirical approach, so theoretical investigations on the effects of sampling are still needed for classification.
Appendix
In this section, we give prove all theoretical results in the paper. To facilitate the presentation of the proof, denote
The condition that for any implies that
(A.1) |
for any and , and we will use this result multiple times in the proof. The inequality in (A.1) is true because for any and , we can choose and so that
with probability one.
Appendix A.1 Proof of Theorem 1
Proof of Theorem 1.
The estimator is the maximizer of
(A.2) |
so is the maximizer of
(A.3) |
By Taylor’s expansion,
(A.4) |
where , and
is the gradient of , and lies between and . If we can show that
(A.5) |
in distribution, and for any ,
(A.6) |
in probability, then from the Basic Corollary in page 2 of Hjort and Pollard (2011), we know that , the maximizer of , satisfies that
(A.7) |
Slutsky’s theorem together with (A.5) and (A.7) implies the result in Theorem 1. We prove (A.5) and (A.6) in the following.
Note that
(A.8) |
is a summation of i.i.d. quantities. Since as , the distribution of depends on , we need to use a central limit theorem for triangular arrays. The Lindeberg-Feller central limit theorem (see, Section ∗2.8 of van der Vaart, 1998) is appropriate.
We exam the mean and variance of . For the mean, from the fact that
we know that .
For the variance,
Note that
almost surely, and
Thus, from the dominated convergence theorem,
Now we check the Lindeberg-Feller condition. For any ,
where the last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller central limit theorem (Section ∗2.8 of van der Vaart, 1998), we finish the proof of (A.5).
To finish the proof, we only need to prove that
(A.10) |
in probability. This is done by noting that
(A.11) | ||||
(A.12) |
by Proposition 1 of Wang (2019). ∎
Appendix A.2 Proof of Theorem 2
Proof of Theorem 2.
The estimator is the maximizer of defined in (10), so is the maximizer of . By Taylor’s expansion,
(A.13) |
where
is the gradient of , and lies between and . Similarly to the proof of Theorem 1, we only need to show that
(A.14) |
in distribution, and for any ,
(A.15) |
in probability.
We prove (A.14) first. Recall that is the full data set and , satisfying that
We notice that
Let , we know that , , are i.i.d., with the underlying distribution of being dependent on . From direction calculation, we have
Thus, by the dominated convergence theorem, we obtain that
(A.16) |
Now we check the Lindeberg-Feller condition (Section ∗2.8 of van der Vaart, 1998). For simplicity, let and , where . For any ,
where the second last step is from the dominated convergence theorem and the facts that and . Thus, applying the Lindeberg-Feller central limit theorem (Section ∗2.8 of van der Vaart, 1998) finishes the proof of (A.14).
Now we prove (A.15). By direct calculation, we first notice that
(A.17) |
has a mean of
(A.18) |
where the last step is by the dominated convergence theorem. In addition, the variance of each component of is bounded by
(A.19) |
where the last step is because and imply that . From (A.18) and (A.19), Chebyshev’s inequality implies that in probability. Notice that
Since , to finish the proof of (A.15), we only need to prove that is bounded in probability. Using an approach similar to (A.18) and (A.19), we can show that has a mean that is bounded and a variance that converges to zero.
∎
Appendix A.3 Proof of Theorem 3
Proof of Theorem 3.
If we use to denote the under-sampled objective function shifted by , i.e., , then the estimator is the maximizer of
(A.20) |
We notice that is the maximizer of . By Taylor’s expansion,
(A.21) |
where
is the gradient of , and lies between and .
Similarly to the proof of Theorem 1, we only need to show that
(A.22) |
in distribution, and for any ,
(A.23) |
in probability.
We prove (A.22) first. Define . We have that
which implies that . For the conditional variance
where is integrable. Thus, by the dominated convergence theorem, satisfies that
(A.24) |
Therefore, we have
(A.25) |
Now we check the Lindeberg-Feller condition. For any ,
where the second last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller central limit theorem (Section ∗2.8 of van der Vaart, 1998) finishes the proof of (A.22).
No we prove (A.23). First, letting
(A.26) |
the mean of satisfies that
(A.27) |
by the dominated convergence theorem, and the variance of each component of is bounded by
(A.28) |
Thus, Chebyshev’s inequality implies that
(A.29) |
in probability. Furthermore,
(A.30) |
where the last step is because is bounded in probability due to the fact that it has a mean that is bounded and a variance that converges to zero. Combing (A.29) and (A.30), (A.23) follows. ∎
Appendix A.4 Proof of Proposition 1
Proof of Proposition 1.
Let
Since , we have
which finishes the proof. ∎
Appendix A.5 Proof of Theorem 4
Proof of Theorem 4.
The estimator is the maximizer of (20), so is the maximizer of . By Taylor’s expansion,
(A.31) |
where
is the gradient of , and lies between and . Similarly to the proof of Theorem 1, we only need to show that
(A.32) |
in distribution, and for any ,
(A.33) |
in probability.
We prove (A.32) first. Denote , so , , are i.i.d. with the underlying distribution of being dependent on . From direction calculation, we have
where the is bounded. Thus, by the dominated convergence theorem, we obtain that
Now we check the Lindeberg-Feller condition (Section ∗2.8 of van der Vaart, 1998). Let and , where . For any ,
Thus, applying the Lindeberg-Feller central limit theorem (Section ∗2.8 of van der Vaart, 1998) finishes the proof of (A.32).
Now we prove (A.33). Let
Since
by the dominated convergence theorem, and each component of has a variance that is bounded by
applying Chebyshev’s inequality gives that
in probability. Thus, (A.33) follows from the fact that
where the last step is because has a bounded mean and a bounded variance and thus it is bounded in probability. ∎
Appendix A.6 Proof of Theorem 5
Proof of Theorem 5.
The over-sampled estimator is the maximizer of
(A.34) |
Thus, is the maximizer of . By Taylor’s expansion,
(A.35) |
where
is the gradient of , and lies between and .
Similarly to the proof of Theorem 1, we only need to show that
(A.36) |
in distribution, and for any ,
(A.37) |
in probability.
We prove (A.36) first. Let . We have that
which implies that . For the conditional variance
where the ’s above are all bounded and the last step is because . Thus, by the dominated convergence theorem, satisfies that
(A.38) |
which indicates that
(A.39) |
Now we check the Lindeberg-Feller condition. Recall that , where . We can show that . For any ,
This indicates that , and thus the Lindeberg-Feller condition holds. Applying the Lindeberg-Feller central limit theorem (Section ∗2.8 of van der Vaart, 1998) finishes the proof of (A.36).
No we prove (A.37). Let
(A.40) |
Note that
(A.41) | ||||
(A.42) | ||||
(A.43) |
by the dominated convergence theorem, and the variance of each component of is bounded by
where the last step is because and both expectations are finite. Therefore, Chebyshev’s inequality implies that in probability. Thus, (A.37) follows from the fact that
where the last step is from the fact that has a bounded mean and a bounded variance, and an application of Chebyshev’s inequality. ∎
References
- Chawla (2009) Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook, 875–886. Springer.
- Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321–357.
- Chawla et al. (2004) Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6, 1, 1–6.
- Douzas and Bacao (2017) Douzas, G. and Bacao, F. (2017). Self-organizing map oversampling (somo) for imbalanced data set learning. Expert systems with Applications 82, 40–52.
- Drummond et al. (2003) Drummond, C., Holte, R. C., et al. (2003). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, vol. 11, 1–8. Citeseer.
- Estabrooks et al. (2004) Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational intelligence 20, 1, 18–36.
- Fithian and Hastie (2014) Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics 42, 5, 1693.
- Han et al. (2005) Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In D.-S. Huang, X.-P. Zhang, and G.-B. Huang, eds., Advances in Intelligent Computing, 878–887, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Hjort and Pollard (2011) Hjort, N. L. and Pollard, D. (2011). Asymptotics for minimisers of convex processes. arXiv preprint arXiv:1107.3806 .
- Japkowicz (2000) Japkowicz, N. (2000). Learning from imbalanced data sets: Papers from the AAAI workshop, AAAI, 2000. Technical Report WS-00-05.
- King and Zeng (2001) King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political analysis 9, 2, 137–163.
- Lemaître et al. (2017) Lemaître, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research 18, 1, 559–563.
- Liu et al. (2009) Liu, X., Wu, J., and Zhou, Z. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2, 539–550.
- Mathew et al. (2017) Mathew, J., Pang, C. K., Luo, M., and Leong, W. H. (2017). Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE transactions on neural networks and learning systems 29, 9, 4065–4076.
- Owen (2007) Owen, A. B. (2007). Infinitely imbalanced logistic regression. The Journal of Machine Learning Research 8, 761–773.
- Rahman and Davis (2013) Rahman, M. M. and Davis, D. (2013). Addressing the class imbalance problem in medical datasets. International Journal of Machine Learning and Computing 3, 2, 224.
- Sun et al. (2007) Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 12, 3358–3378.
- van der Vaart (1998) van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, London.
- Wang (2019) Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research 20, 132, 1–59.