Logistic Regression for Massive Data with Rare Events

HaiYing Wang
Department of Statistics, University of Connecticut

Abstract

This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls). We first derive the asymptotic distribution of the maximum likelihood estimator (MLE) of the unknown parameter, which shows that the asymptotic variance convergences to zero in a rate of the inverse of the number of the events instead of the inverse of the full data sample size. This indicates that the available information in rare events data is at the scale of the number of events instead of the full data sample size. Furthermore, we prove that under-sampling a small proportion of the nonevents, the resulting under-sampled estimator may have identical asymptotic distribution to the full data MLE. This demonstrates the advantage of under-sampling nonevents for rare events data, because this procedure may significantly reduce the computation and/or data collection costs. Another common practice in analyzing rare events data is to over-sample (replicate) the events, which has a higher computational cost. We show that this procedure may even result in efficiency loss in terms of parameter estimation.

1 Introduction

Big data with rare events in binary responses, also called imbalanced data, are data in which the number of events (observations for one class of the binary response) is much smaller than the number of non-events (observations for the other class of the binary response). In this paper we also call the events “cases” and can the nonevents “controls”. Rare events data are common in many scientific fields and applications. However, several important questions remain unanswered that are essential for valid data analysis and appropriate decision-making. For example, should we consider the amount of information contained in the data to be at the scale of the full-data sample size (very large) or the number of cases (relatively small)? Rare events data provide unique challenges and opportunities for sampling. On the one hand, sampling will not work without looking at responses because the probability of not selecting a rare case is high. On the other hand, since the rare cases are more informative than the controls, is it possible to use a small proportion of the full data to preserve most or all of the relevant information in the data about unknown parameters? A common practice when analyzing rare events data is to under-sample the controls and/or over-sample (replicate) the cases. Is there any information loss when using this approach? This paper provides a rigorous theoretical analysis on the aforementioned questions in the context of parameter estimation. Some answers may be counter-intuitive. For example, keeping all the cases, there may be no efficiency loss at all for under-sampling controls; on the other hand, using all the controls and over-sampling cases may reduce estimation efficiency.

Rare events data, or imbalanced data, have attracted a lot of attentions in machine learning and other quantitative fields, such as Japkowicz (2000); King and Zeng (2001); Chawla et al. (2004); Estabrooks et al. (2004); Owen (2007); Sun et al. (2007); Chawla (2009); Rahman and Davis (2013); Fithian and Hastie (2014); Lemaître et al. (2017). A commonly implemented approach in practice is to try balancing the data by under-sampling controls (Drummond et al., 2003; Liu et al., 2009) and/or over-sampling cases (Chawla et al., 2002; Han et al., 2005; Mathew et al., 2017; Douzas and Bacao, 2017). However, most existing investigations focus on algorithms and methodologies for classification. Theoretical analyses of the effects of under-sampling and over-sampling in terms of parameter estimation are still rare.

King and Zeng (2001) considered logistic regression in rare events data and focused on correcting the biases in estimating the regression coefficients and probabilities. Fithian and Hastie (2014) utilized the special structure of logistic regression models to design a novel local case-control sampling method. These investigations obtained theoretical results based on the the regular assumption that the probability of event occurring is fixed and does not go to zero. This assumption rules out the scenario of extremely imbalanced data, because for extremely imbalanced data, it is more appropriate to assume that the event probability goes to zero. Owen (2007)’s investigation did not require this fixed-probability assumption. He assumed that the number of rare cases is fixed, and derived the non-trivial point limit of the slope parameter estimator in logistic regression. However, the convergence rate and distributional properties of this estimator were not investigated. In this paper, we obtain convergence rates and asymptotic distributions of parameter estimators under the assumption that both the number of cases and the number of controls are random, and they grow large in rates that the number of cases divided by the number of controls decays to zero. This is the first study that provides distributional results for rare events data with a decaying event rate, and it gives the following indications.

•

The convergence rate of the maximum likelihood estimator (MLE) is at the inverse of the number of cases instead of the total number of observations. This means that the amount of available information about unknown parameters in the data may be limited even the full data volume is massive.
•

There maybe no efficiency loss at all in parameter estimation if one removes most of the controls in the data, because the control under-sampled estimators may have an asymptotic distribution that is identical to that of the full data MLE.
•

Besides higher computational cost, over-sampling cases may result in estimation efficiency loss, because the asymptotic variances of the resulting estimators may be larger than that of the full data MLE.

The rest of the paper is organized as follows. We introduce the model setup and related assumptions in Section 2, and derive the asymptotic distribution for the full data MLE. We investigate under-sampled estimators in Section 3 and study over-sampled estimators in Section 4. Section 5 presents some numerical experiments, and Section 6 concludes the paper and points out some necessary future research. All the proofs of theoretical findings in this paper are presented in the supplementary material.

2 Model setups and assumptions

Let $\mathcal{D}_{n}=\{(\mathbf{x}_{i},y_{i}),i=1,...,n\}$ be independent data of size $n$ from a logistic regression model,

\mathbb{P}(y=1|\mathbf{x})=p(\alpha,\bm{\beta})=\frac{e^{\alpha+\mathbf{x}^{\mathrm{T}}\bm{\beta}}}{1+e^{\alpha+\mathbf{x}^{\mathrm{T}}\bm{\beta}}}.

(1)

Here $\mathbf{x}\in\mathbb{R}^{d}$ is the covariate, $y\in\{0,1\}$ is the binary class label, $\alpha$ is the intercept parameter, and $\bm{\beta}$ is the slope parameter vector. For ease of presentation, denote $\bm{\theta}=(\alpha,\bm{\beta}^{\mathrm{T}})^{\mathrm{T}}$ as the full vector of regression coefficient, and define $\mathbf{z}=(1,\mathbf{x}^{\mathrm{T}})^{\mathrm{T}}$ accordingly. This paper focuses on estimating the unknown $\bm{\theta}$ .

If $\bm{\theta}$ is fixed (does not change with $n$ changing), then model (1) is just the regular logistic regression model, and classical likelihood theory shows that the MLE based on the full data $\mathcal{D}_{n}$ converges at a rate of $n^{-1/2}$ . A fixed $\bm{\theta}$ implies that $\mathbb{P}(y=1)=\mathbb{E}\{\mathbb{P}(y=1|\mathbf{x})\}$ is also a fixed constant bounded away from zero. However, for rare events data, because the event rate is so low in the data, it is more appropriate to assume that $\mathbb{P}(y=1)$ approaches zero in some way. We discuss how to model this scenario in the following.

Let $n_{1}$ and $n_{0}$ be the numbers cases (observations with $y_{i}=1$ ) and controls (observations with $y_{i}=0$ ), respectively, in $\mathcal{D}_{n}$ . Here, $n_{1}$ and $n_{0}$ are random because they are summary statistics about the observed data, i.e., $n_{1}=\sum_{i=1}^{n}y_{i}$ and $n_{0}=n-n_{1}$ . For rare events data, $n_{1}$ is much smaller than $n_{0}$ . Thus, for asymptotic investigations, it is reasonable to assume that $n_{1}/n_{0}\rightarrow 0$ , or equivalently $n_{1}/n\rightarrow 0$ , in probability, as $n\rightarrow\infty$ . For big data with rare events, there should be a fair amount of cases observed, so it is appropriate to assume that $n_{1}\rightarrow\infty$ in probability. To model this scenario, we assume that the marginal event probability $\mathbb{P}(y=1)$ satisfies that as $n\rightarrow\infty$ ,

\mathbb{P}(y=1)\rightarrow 0\quad\text{and}\quad n\mathbb{P}(y=1)\rightarrow\infty.

(2)

We accommodate this condition by assuming that the true value of $\bm{\beta}$ , denoted as $\bm{\beta}_{t}$ , is fixed while the true value of $\alpha$ , denoted as $\alpha_{t}$ , goes to negative infinity in a certain rate. Specifically, we assume $\alpha_{t}\rightarrow-\infty$ as $n\rightarrow\infty$ in a rate such that

\displaystyle\frac{n_{1}}{n}

\displaystyle=\mathbb{P}(y=1)\{1+o_{P}(1)\}=\mathbb{E}\bigg{(}\frac{e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\bigg{)}\{1+o_{P}(1)\},

(3)

where $o_{P}(1)$ means a term that converges to zero in probability, i.e., a term that is arbitrarily small with probability approaching one. The assumption of a diverging $\alpha_{t}$ with a fixed $\bm{\beta}_{t}$ means that the baseline probability of a rare event is low, and the effect of the covariate does not change the order of the probability for a rare event to occur. This is a very reasonable assumption for many practical problems. For example, although making phone calls when driving may increase the probability of car accidents, it may not make car accidents a high-probability event.

2.1 How much information do we have in rare events data

To demonstrate how much information is really available in rare events data, we derive the asymptotic distribution of the MLE for model (1) in the scenario described in (2) and (3). The MLE based on the full data $\mathcal{D}_{n}$ , say $\hat{\bm{\beta}}$ , is the maximizer of

\ell(\bm{\theta})=\sum_{i=1}^{n}\big{\{}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log(1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}})\big{\}},

(4)

which is also the solution to the following equation,

\dot{\ell}(\bm{\theta})=\sum_{i=1}^{n}\big{\{}y_{i}-p_{i}(\alpha,\bm{\beta})\big{\}}\mathbf{z}_{i}=0,

(5)

where $\dot{\ell}(\bm{\theta})$ is the gradient of the log-likelihood $\ell(\bm{\theta})$ .

The following Theorem gives the asymptotic normality of the MLE $\hat{\bm{\beta}}$ for rare events data.

Theorem 1.

If $\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty$ for any $t>0$ and $\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}})$ is a positive-definite matrix, then under the conditions in (2) and (3), as $n\rightarrow\infty$ ,

\sqrt{n_{1}}({\hat{\bm{\theta}}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}\big{(}\mathbf{0},\ \mathbf{V}_{f}\big{)},

(6)

in distribution, where

	$\displaystyle\mathbf{V}_{f}$	$\displaystyle=\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\big{)}\mathbf{M}_{f}^{-1},\qquad\text{and}$		(7)
	$\displaystyle\mathbf{M}_{f}$	$\displaystyle=\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}=\mathbb{E}\left\{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\begin{pmatrix}1&\mathbf{x}^{\mathrm{T}}\\ \mathbf{x}&\mathbf{x}\mathbf{x}^{\mathrm{T}}\end{pmatrix}\right\}.$		(8)

Remark 1.

The result in (6) shows that the convergence rate of the full-data MLE is at the order of $n_{1}^{-1/2}$ , i.e, ${\hat{\bm{\theta}}}-\bm{\theta}_{t}=O_{P}(n_{1}^{-1/2})$ . This is different from the classical result of ${\hat{\bm{\theta}}}-\bm{\theta}_{t}=O_{P}(n^{-1/2})$ for the case that $\mathbb{P}(y=1)$ is a fixed constant. Theorem 1 indicates that for rare events data, the real amount of available information is actually at the scale of $n_{1}$ instead of $n$ . A large volume of data does not mean that we have a large amount of information.

3 Efficiency of under-sampled estimators

Theorem 1 in the previous section shows that the full-data MLE has a convergence rate of $n_{1}^{-1/2}$ . If we under-sample controls to reduce the number of controls to the same level of $n_{1}$ , whether the resulting estimator has the full-data estimator convergence rate of $n_{1}^{-1/2}$ ? If so, one can significantly improve the computational efficiency and reduce the storage requirement for massive data. Furthermore, will under-sampling controls causes any estimation efficiency loss (an enlarged asymptotic variance)? This section answers the aforementioned questions.

From the full data set $\mathcal{D}_{n}=\{(\mathbf{x}_{1},y_{1}),...,(\mathbf{x}_{n},y_{n})\}$ , we want to use all the cases (data points with $y_{i}=1$ ) while only select a subset for the controls (data points with $y_{i}=0$ ). Specifically, let $\pi_{0}$ be the probability that each data points with $y_{i}=0$ is selected in the subset. Let $\delta_{i}\in\{0,1\}$ be the binary indicator variable that signifies if the $i$ -th observation is included in the subset, i.e., include the $i$ -th observation into the sample if $\delta_{i}=1$ and ignore the $i$ -th observation if $\delta_{i}=0$ . Here, we define the sampling plan by assigning

\delta_{i}=y_{i}+(1-y_{i})I(u_{i}\leq\pi_{0}),\quad i=1,...,n,

(9)

where $u_{i}$ $\sim\mathbb{U}(0,1)$ , $i=1,...,n$ , are independent and identically distributed (i.i.d.) random variables with the standard uniform distribution. This is a mixture of deterministic selection and random sampling. The resulting control under-sampled data include all rare cases (with $y_{i}=1$ ) and the number of controls (with $y_{i}=0$ ) is on average at the order of $n_{0}\pi_{0}$ . The average sample size for the under-sampled data given the full-data is $\sum_{i=1}^{n}\mathbb{E}(\delta_{i}|\mathcal{D}_{n})=n_{1}+n_{0}\pi_{0}$ , which is $o_{p}(n)$ if $\pi_{0}\rightarrow 0$ . The average sample size reduction is $n_{0}(1-\pi_{0})$ which is at the same order of $n$ if $\pi_{0}\nrightarrow 1$ , and $n_{0}(1-\pi_{0})/n\rightarrow 1$ if $\pi_{0}\rightarrow 0$ .

Note that the under-sampled data taken according to $\delta_{i}$ in (9) is a biased sample, so we need to maximize a weighted objective function to obtain an asymptotically unbiased estimator. Alternatively, we can maximize an unweighted objective function and then correct the bias for the resulting estimator in logistic regression.

3.1 Under-sampled weighted estimator

The sampling inclusion probability given the full data $\mathcal{D}_{n}$ for the $i$ -th data point is

\pi_{i}=\mathbb{E}(\delta_{i}|\mathcal{D}_{n})=y_{i}+(1-y_{i})\pi_{0}=\pi_{0}+(1-\pi_{0})y_{i}.

The under-sampled weighted estimator, ${\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}$ , is the maximizer of

\displaystyle\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta})=\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\big{\{}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log(1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}})\big{\}}.

(10)

We present the asymptotic distribution of ${\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}$ in the following theorem.

Theorem 2.

If $\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty$ for any $t>0$ , $\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}$ is a positive-definite matrix, and $c_{n}=e^{\alpha_{t}}/\pi_{0}\rightarrow c$ for a constant $c\in[0,\infty)$ , then under the conditions in (2) and (3), as $n\rightarrow\infty$ ,

\sqrt{n_{1}}({\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}(\mathbf{0},\ \mathbf{V}_{\mathrm{under}}^{\mathrm{w}}),

(11)

in distribution, where

	$\displaystyle\mathbf{V}_{\mathrm{under}}^{\mathrm{w}}$	$\displaystyle=\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\mathbf{M}_{f}^{-1}\mathbf{M}_{\mathrm{under}}^{\mathrm{w}}\mathbf{M}_{f}^{-1},\quad\text{and}$		(12)
	$\displaystyle\mathbf{M}_{\mathrm{under}}^{\mathrm{w}}$	$\displaystyle=\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}(1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{\}}.$		(13)

Remark 2.

If $\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty$ for any $t>0$ , then from (3) and the dominated convergence theorem, we know that $n_{1}=ne^{\alpha_{t}}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\{1+o_{P}(1)\}$ . Thus

c_{n}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})=\frac{n_{1}}{n\pi_{0}}\{1+o_{P}(1)\}=\frac{n_{1}}{n_{0}\pi_{0}}\{1+o_{P}(1)\}.

Since $n_{0}\pi_{0}$ is the average number of the controls in the under-sampled data, $c\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})$ can be interpreted as the asymptotic ratio of the number of cases to the number of controls in the under-sampled data. Therefore, since $\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})>0$ is a fixed constant, the value of $c$ has the following intuitive interpretations.

•

$c=0$ : take much more controls than cases;
•

$0<c<\infty$ : the number of controls to take is at the same order of the number of cases;
•

$c=\infty$ : take much fewer controls than cases.

Theorem 2 requires that $0\leq c<\infty$ . This means that the number of controls to take should not be significantly smaller than the number of cases, which is a very reasonable assumption.

Remark 3.

Theorem 2 shows that as long as $\pi_{0}$ does not make the number of controls in the under-sampled data much smaller than the number of cases $n_{1}$ , then the under-sampled estimator ${\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}$ preserves the convergence rate of the full-data estimator. Furthermore, if $c=0$ then $\mathbf{M}_{\mathrm{under}}^{\mathrm{w}}=\mathbf{M}_{f}$ , which implies that $\mathbf{V}_{\mathrm{under}}^{\mathrm{w}}=\mathbf{V}_{f}$ . This means that if one takes much more controls than cases, then asymptotically there is no estimation efficiency loss at all. Here, the number of controls to take can still be significantly smaller than $n_{0}$ so that the computational burden is significantly reduced. If $c>0$ , since $\mathbf{M}_{\mathrm{under}}^{\mathrm{w}}>\mathbf{M}_{f}$ , we know that $\mathbf{V}_{\mathrm{under}}^{\mathrm{w}}>\mathbf{V}_{f}$ , in the Loewner order¹¹1For two Hermitian matrices $\mathbf{A}_{1}$ and $\mathbf{A}_{2}$ of the same dimension, $\mathbf{A}_{1}\geq\mathbf{A}_{2}$ if $\mathbf{A}_{1}-\mathbf{A}_{2}$ is positive semi-definite and $\mathbf{A}_{1}>\mathbf{A}_{2}$ if $\mathbf{A}_{1}-\mathbf{A}_{2}$ is positive definite.. Thus reducing the number of controls to the same order of the number of cases may reduce the estimation efficiency, although the convergence rate is the same as that of the full-data estimator.

3.2 Under-sampled unweighted estimator with bias correction

Based on the control under-sampled data, if we obtain an estimator from an unweighted objective function, say

\displaystyle{\tilde{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{u}}}

\displaystyle=\arg\max_{\bm{\theta}}\ \ell{{}_{\mathrm{under}}^{\mathrm{u}}}(\bm{\theta})=\arg\max_{\bm{\theta}}\sum_{i=1}^{n}\delta_{i}\big{[}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log\{1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}}\}\big{]},

then in ${\tilde{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{u}}}=(\hat{\alpha}{{}_{\mathrm{under}}^{\mathrm{u}}},\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{u}}}^{\mathrm{T}})^{\mathrm{T}}$ , the intercept estimator $\hat{\alpha}{{}_{\mathrm{under}}^{\mathrm{u}}}$ is asymptotically biased while the slope estimator $\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{u}}}$ is still asymptotically unbiased. We correct the bias of $\hat{\alpha}{{}_{\mathrm{under}}^{\mathrm{u}}}$ using $\log(\pi_{0})$ , and define the under-sampled unweighted estimator with bias correction ${\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$ as

{\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}={\tilde{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{u}}}+\mathbf{b},

(14)

where

\mathbf{b}=\{\log(\pi_{0}),0,...,0\}^{\mathrm{T}}.

(15)

The following theorem gives the asymptotic distribution of ${\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$ .

Theorem 3.

If $\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty$ for any $t>0$ , $\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}$ is a positive-definite matrix, and $e^{\alpha_{t}}/\pi_{0}\rightarrow c$ for a constant $c\in[0,\infty)$ , then under the conditions in (2) and (3), as $n\rightarrow\infty$ ,

\sqrt{n_{1}}({\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}(\mathbf{0},\ \mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}),

(16)

in distribution, where

	$\displaystyle\mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$	$\displaystyle=\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})(\mathbf{M}{{}_{\mathrm{under}}^{\mathrm{ubc}}})^{-1},\quad\text{and}$		(17)
	$\displaystyle\mathbf{M}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$	$\displaystyle=\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}.$		(18)

Remark 4.

Similarly to the case of under-sampled weighted estimator, Theorem 3 shows that the estimator ${\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$ preserves the same convergence rate of the full-data estimator if $c<\infty$ . Furthermore, if $c>0$ then $\mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}>\mathbf{V}_{f}$ ; if $c=0$ , then $\mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}=\mathbf{V}_{f}$ .

The following proposition is useful to compare the asymptotic variances of the weighted and the unweighted estimators.

Proposition 1.

Let $\mathbf{v}$ be a random vector and $h$ be a positive scalar random variable. Assume that $\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})$ , $\mathbb{E}(h\mathbf{v}\mathbf{v}^{\mathrm{T}})$ , and $\mathbb{E}(h^{-1}\mathbf{v}\mathbf{v}^{\mathrm{T}})$ are all finite and positive-definite matrices. The following inequality holds in the Loewner order.

\big{\{}\mathbb{E}(h^{-1}\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\leq\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\mathbb{E}(h\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}.

Remark 5.

If we let $\mathbf{v}=e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}/2}\mathbf{z}$ and $h=1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}$ in Proposition 1, then we know that $\mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}\leq\mathbf{V}_{\mathrm{under}}^{\mathrm{w}}$ in the Loewner order. This indicates that with the same control under-sampled data, the unweighted estimator with bias correction, ${\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$ , has a higher estimation efficiency than the weighted estimator, ${\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}$ .

4 Efficiency loss due to over-sampling

Another common practice to analyze rare events data is to use all the controls and over-sample the cases. To investigate the effect of this approach, let $\tau_{i}$ denote the number of times that a data point is used, and define

\tau_{i}=y_{i}v_{i}+1,\quad i=1,...,n,

(19)

where $v_{i}\sim\mathbb{POI}(\lambda_{n})$ , $i=1,...,n$ , are i.i.d. Poisson random variables with parameter $\lambda_{n}$ . For this over-sampling plan, a data point with $y_{0}=0$ will be used only one time, while a data point with $y_{i}=1$ will be on average used in the over-sampled data for $\mathbb{E}(\tau_{i}|\mathcal{D}_{n},y_{i}=1)=1+\lambda_{n}$ times. Here, $\lambda_{n}$ can be interpreted as the average over-sampling rate for cases.

Again, the case over-sampled data according to (19) is a biased sample, and we need to use a weighted objective function or to correct the bias of the estimator form an unweighted objective function.

4.1 Over-sampled weighted estimator

Let $w_{i}=\mathbb{E}(\tau_{i}|\mathcal{D}_{n})=1+\lambda_{n}y_{i}$ . The case over-sampled weighted estimator, ${\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}$ , is the maximizer of

\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta})=\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\big{\{}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log(1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}})\big{\}}.

(20)

The following theorem gives the asymptotic distribution of ${\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}$ .

Theorem 4.

If $\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty$ for any $t>0$ , $\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}$ is positive-definite, and $\lambda_{n}\rightarrow\lambda\geq 0$ , then under the conditions in (2) and (3), as $n\rightarrow\infty$ ,

\sqrt{n_{1}}({\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}(\mathbf{0},\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}),

(21)

in distribution, where

\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}=\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\mathbf{M}_{f}^{-1}.

(22)

Remark 6.

Note that in (22), $\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\geq 1$ and the equality holds only if $\lambda=0$ or $\lambda=\infty$ . Thus, $\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}\geq\mathbf{V}_{f}$ , meaning that over-sampling the cases may result in estimation efficiency loss unless the number of over-sampled cases is small enough to be negligible ( $\lambda=0$ ) or it is very large ( $\lambda=\infty$ ). Considering that over-sampling incurs additional computational cost with potential estimation efficiency loss, this procedure is not recommended if the primary goal is parameter estimation.

4.2 Over-sampled unweighted estimator with bias correction

For completeness, we derive the asymptotic distribution of the over-sampled unweighted estimator with bias correction, ${\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ , defined as ${\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}={\tilde{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{u}}}-\mathbf{b}_{o}$ , where

\displaystyle{\tilde{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{u}}}

\displaystyle=\arg\max_{\bm{\theta}}\ \ell{{}_{\mathrm{over}}^{\mathrm{u}}}(\bm{\theta})=\arg\max_{\bm{\theta}}\sum_{i=1}^{n}\tau_{i}\big{[}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log\{1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}}\}\big{]},

(23)

and

\mathbf{b}_{o}=(b_{o0},0,...,0)^{\mathrm{T}}=\{\log(1+\lambda_{n}),0,...,0\}^{\mathrm{T}}.

(24)

The following theorem is about the asymptotic distribution of ${\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ .

Theorem 5.

If $\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty$ for any $t>0$ , $\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}$ is positive-definite, $\lambda_{n}\rightarrow\lambda\geq 0$ , and $\lambda_{n}e^{\alpha_{t}}\rightarrow{c_{o}}$ for a constant ${c_{o}}\in[0,\infty)$ , then under the conditions in (2) and (3), as $n\rightarrow\infty$ ,

\sqrt{n_{1}}({\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}(\mathbf{0},\ \mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}),

(25)

in distribution, where

	$\displaystyle\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}=\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\big{)}\mathbf{M}_{obc2}^{-1}\mathbf{M}_{obc1}\mathbf{M}_{obc2}^{-1},$
	$\displaystyle\mathbf{M}_{obc1}=\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}},\quad\text{ and }\quad$
	$\displaystyle\mathbf{M}_{obc2}=\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}.$

Remark 7.

Unlike the case of under-sampled estimators, for over-sampled estimators, the unweighted estimator with bias correction ${\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ has a lower estimation efficiency than the weighted estimator ${\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}$ . To see this, letting $h=(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{-1}$ and $\mathbf{v}=e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}/2}(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{-1/2}\mathbf{z}$ in Proposition 1, we know that $\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}\geq\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}$ , and the equality holds if $c_{o}=0$ . Here, since $\lambda_{n}e^{\alpha_{t}}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})=\frac{n_{1}\lambda_{n}}{n_{0}}\{1+o_{P}(1)\}$ , we can intuitively interpret ${c_{o}}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})$ as the ratio of the average times of over-sampled cases to the number of controls. If in addition $\lambda=0$ , then $\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}=\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}=\mathbf{V}_{f}$ ; but in general, $\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}\geq\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}\geq\mathbf{V}_{f}$ .

Remark 8.

Compared with Theorem 4 for ${\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}$ , Theorem 5 for ${\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ requires an extra condition that $\lambda_{n}e^{\alpha_{t}}\rightarrow{c_{o}}\in[0,\infty)$ . In addition, $\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}\geq\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}$ . Thus, if over-sampling has to be implemented, then we recommend using the weighted estimator ${\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}$ .

5 Numerical experiments

5.1 Full data estimator ${\hat{\bm{\theta}}}$

Consider model (1) with one covariate $x$ and $\bm{\theta}=(\alpha,\beta)^{\mathrm{T}}$ . We set $\mathbb{P}(y=1)=0.02$ , $0.004$ , $0.0008$ and $0.00016$ , and generate corresponding full data of sizes $n=10^{3}$ , $10^{4}$ , $10^{5}$ and $10^{6}$ , respectively. As a result, the average numbers of cases ( $y_{i}=1$ ) in the resulting data are $\mathbb{E}(n_{1})=20$ , $40$ , $80$ and $160$ . The above value configuration aims to mimic the scenario that $n\rightarrow\infty$ , $\mathbb{P}(y=1)\rightarrow 0$ , and $\mathbb{E}(n_{1})\rightarrow\infty$ . The covariates $x_{i}$ ’s are generated from $\mathbb{N}(1,1)$ for cases ( $y_{i}=1$ ) and from $\mathbb{N}(0,1)$ for controls ( $y_{i}=0$ ). For the above setup, the true value of $\beta$ is fixed $\beta_{t}=1$ , and the true values of $\alpha$ are $\alpha_{t}=-4.39$ , $-6.02$ , $-7.63$ and $-9.24$ , respectively for the four different values of $n$ . We repeat the simulation for $S=1,000$ times and calculate empirical MSEs as $\text{eMSE}(\hat{\theta}_{j})=S^{-1}\sum_{s=1}^{S}(\hat{\theta}_{j}^{(s)}-\theta_{tj})^{2}$ , $j=0,1$ , where $\hat{\theta}_{0}=\hat{\alpha}$ , $\hat{\theta}_{1}=\hat{\beta}$ , and $\hat{\theta}_{j}^{(s)}$ is the estimate from the $s$ -th repetition.

Table 1 presents empirical MSEs (eMSEs) multiplied by $\mathbb{E}(n_{1})$ and $n$ , respectively. We see that $\mathbb{E}(n_{1})\times$ eMSE $(\hat{\theta}_{j})$ does not diverge as $n$ increases for both $\hat{\alpha}$ and $\hat{\beta}$ . This confirms the conclusion in Theorem 1 that ${\hat{\bm{\theta}}}$ converges at a rate of $n_{1}^{-1/2}$ (It implies that $n_{1}\|{\hat{\bm{\theta}}}-\bm{\theta}_{t}\|^{2}=O_{P}(1)$ ). On the other hand, values of $n\times$ eMSE $(\hat{\theta}_{j})$ are large, and they increase fast as $n$ increases, indicating that $n\|{\hat{\bm{\theta}}}-\bm{\theta}_{t}\|^{2}$ diverges to infinity. Table 1 confirms that although the values of the full data sample sizes $n$ are very large, it is the values of $n_{1}$ that reflect the real amount of available information about regression parameters, and they are actually much smaller.

Table 1: Empirical MSE (eMSE) multiplied by

\mathbb{E}(n_{1})

and

n

$n$	$\mathbb{E}(n_{1})$	$\mathbb{E}(n_{1})\times$ eMSE $(\hat{\theta}_{j})$		$n\times$ eMSE $(\hat{\theta}_{j})$
		$\hat{\alpha}$	$\hat{\beta}$	$\hat{\alpha}$	$\hat{\beta}$
$10^{3}$	20	2.51	1.21	125.7	60.6
$10^{4}$	40	2.06	1.09	515.5	271.9
$10^{5}$	80	2.22	1.00	2774.4	1248.8
$10^{6}$	160	2.16	1.08	13474.9	6731.6

5.2 Sampling-based estimators

Now we provide numerical results about under-sampled and over-sampled estimators. Consider model (1) with $n=10^{5}$ , $x\sim\mathbb{N}(0,1)$ , and $\bm{\theta}_{t}=(-6,1)^{\mathrm{T}}$ , so that $\mathbb{P}(y=1)\approx 0.004$ . For under-sampling, consider $\pi_{0}=0.005$ , $0.01$ , $0.05$ , $0.1$ , $0.2$ , $0.5$ , $0.8$ , and $1.0$ ; for over-sampling, consider $\lambda_{n}=0,0.22,0.49,1.23,3.48,6.39,11.18$ and $53.6$ , which corresponds to $\log(1+\lambda_{n})=0$ , $0.2$ , $0.4$ , $0.8$ , $1.5$ , $2.0$ , $2.5$ and $4.0$ , respectively. We repeat the simulation for $S=1,000$ times and calculate empirical MSEs as

\text{eMSE}({\hat{\bm{\theta}}}_{g})=\frac{1}{S}\sum_{s=1}^{S}\|{\hat{\bm{\theta}}}_{g}^{(s)}-\bm{\theta}_{t}\|^{2},

where ${\hat{\bm{\theta}}}_{g}^{(s)}$ is the estimate from the $s$ -th repetition for some estimator ${\hat{\bm{\theta}}}_{g}$ . We consider ${\hat{\bm{\theta}}}_{g}=\hat{\bm{\beta}}_{\mathrm{under}}^{\mathrm{w}}$ , $\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$ , $\hat{\bm{\beta}}_{\mathrm{over}}^{\mathrm{w}}$ , and $\hat{\bm{\beta}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ . Note that if $\pi_{0}=1$ then the under-sampled estimators become the full data estimator, i.e., $\hat{\bm{\beta}}_{\mathrm{under}}^{\mathrm{w}}=\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}=\hat{\bm{\beta}}$ ; if $\lambda_{n}=0$ , then the over-sampled estimators become the full data estimator, i.e., $\hat{\bm{\beta}}_{\mathrm{over}}^{\mathrm{w}}=\hat{\bm{\beta}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}=\hat{\bm{\beta}}$ .

Figure 1 presents the simulation results. Figure 1 (a) plots eMSEs ( $\times 10^{3}$ ) against $\pi_{0}$ . When $\pi_{0}$ is small, the number of controls in under-sampled data is small, and the resulting estimators are not as efficient as the full-data estimator. For example, when $\pi_{0}=0.005$ , the numbers of cases and the numbers of controls are roughly the same, and we do see significant information loss in this case. However, when $\pi_{0}$ gets larger, under-sampled estimators becomes more efficient, and when $\pi_{0}>0.1$ , the performances of the under-sampled estimators are almost as good as the full-data estimator. In addition, the unweighted estimator $\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$ is more efficient than the weighted estimator $\hat{\bm{\beta}}_{\mathrm{under}}^{\mathrm{w}}$ for smaller $\pi_{0}$ ’s, and they both perform more similarly to the full data estimator $\hat{\bm{\beta}}$ as $\pi_{0}$ grows. These observations are consistent with the conclusions in Theorems 2 and 3, and the discussions in the relevant remarks.

Figure 1 (b) plots eMSEs ( $\times 10^{3}$ ) against $\log(\lambda_{n}+1)$ . We see that the case over-sampled estimators are less efficient than the full data estimator unless the average number of over-sampled cases $\lambda_{n}$ is very small or very large. For small $\lambda_{n}$ , $\hat{\bm{\beta}}_{\mathrm{over}}^{\mathrm{w}}$ and $\hat{\bm{\beta}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ perform similarly, but $\hat{\bm{\beta}}_{\mathrm{over}}^{\mathrm{w}}$ is more efficient than $\hat{\bm{\beta}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ for large $\lambda_{n}$ . The reason of this phenomenon is that if $\lambda_{n}$ is large, then the required condition of $\lambda_{n}e^{\alpha_{t}}\rightarrow{c_{o}}\in[0,\infty)$ in Theorem 5 for ${\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ may not be valid. This confirms our recommendation that the weighted estimator ${\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}$ is preferable if over-sampling has to be used.

Refer to caption — (a) eMSEs ( $\times 10^{3}$ ) for under-sampling

6 Discussion and future research

In this paper, we have obtained distributional results showing that the amount of information contained in massive data with rare events is at the scale of the relatively small total number of cases rather than the large total number of observations. We have further demonstrated that aggressively under-sampling the controls may not sacrifice the estimation efficiency at all while over-sampling the cases may reduce the estimation efficiency.

Although the current paper focuses on the logistic regression model, we conjecture that our conclusions are generally true for rare events data and will investigate more complicated and general models in future research projects. As another direction, more comprehensive numerical experiments are helpful to gain further understandings on parameter estimation with imbalanced data. This paper has focused on point estimation. How to make valid and more accurate statistical inference with rare events data still need further research. There is a long standing literature investigating the effects of under-sampling and over-sampling in classification. However, most investigations adopted an empirical approach, so theoretical investigations on the effects of sampling are still needed for classification.

Appendix

In this section, we give prove all theoretical results in the paper. To facilitate the presentation of the proof, denote

a_{n}=\sqrt{ne^{\alpha_{t}}}.

The condition that $\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty$ for any $t>0$ implies that

\mathbb{E}(e^{t_{1}\|\mathbf{x}\|}\|\mathbf{z}\|^{t_{2}})<\infty,

(A.1)

for any $t_{1}>0$ and $t_{2}>0$ , and we will use this result multiple times in the proof. The inequality in (A.1) is true because for any $t_{1}>0$ and $t_{2}>0$ , we can choose $t>t_{1}$ and $k>t_{2}$ so that

\displaystyle e^{t\|\mathbf{x}\|}\geq e^{-t}e^{t\|\mathbf{z}\|}=e^{-t}e^{t_{1}\|\mathbf{z}\|}e^{(t-t_{1})\|\mathbf{z}\|}\geq\frac{(t-t_{1})^{k}e^{-t}}{k!}e^{t_{1}\|\mathbf{x}\|}\|\mathbf{z}\|^{k}\geq\frac{(t-t_{1})^{k}e^{-t}}{k!}e^{t_{1}\|\mathbf{x}\|}\|\mathbf{z}\|^{t_{2}}.

with probability one.

Appendix A.1 Proof of Theorem 1

Proof of Theorem 1.

The estimator ${\hat{\bm{\theta}}}$ is the maximizer of

\displaystyle\ell(\bm{\theta})=

\displaystyle\sum_{i=1}^{n}\big{[}(\alpha+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta})y_{i}-\log\{1+\exp(\alpha+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta})\}\big{]},

(A.2)

so $\mathbf{u}_{n}=a_{n}(\hat{\bm{\theta}}-\bm{\theta}_{t})$ is the maximizer of

\displaystyle\gamma(\mathbf{u})=\ell(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\ell(\bm{\theta}_{t}).

(A.3)

By Taylor’s expansion,

\displaystyle\gamma(\mathbf{u})

\displaystyle=a_{n}^{-1}\mathbf{u}^{\mathrm{T}}\dot{\ell}(\bm{\theta}_{t})+0.5a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2},

(A.4)

where $\phi_{i}(\bm{\theta})=p_{i}(\alpha,\bm{\beta})\{1-p_{i}(\alpha,\bm{\beta})\}$ , and

\dot{\ell}(\bm{\theta})=\frac{\partial\ell(\bm{\theta})}{\partial\bm{\theta}}=\sum_{i=1}^{n}\{y_{i}-p_{i}(\bm{\theta})\}\mathbf{z}_{i}=\sum_{i=1}^{n}\{y_{i}-p_{i}(\alpha,\bm{\beta})\}\mathbf{z}_{i}

is the gradient of $\ell(\bm{\theta})$ , and $\acute{\mathbf{u}}$ lies between $\mathbf{0}$ and $\mathbf{u}$ . If we can show that

\displaystyle a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\big{(}\mathbf{0},\ \mathbf{M}_{f}\big{)},

(A.5)

in distribution, and for any $\mathbf{u}$ ,

\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbf{M}_{f},

(A.6)

in probability, then from the Basic Corollary in page 2 of Hjort and Pollard (2011), we know that $a_{n}(\hat{\bm{\theta}}-\bm{\theta}_{t})$ , the maximizer of $\gamma(\mathbf{u})$ , satisfies that

a_{n}(\hat{\bm{\theta}}-\bm{\theta}_{t})=\mathbf{M}_{f}^{-1}\times a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})+o_{P}(1).

(A.7)

Slutsky’s theorem together with (A.5) and (A.7) implies the result in Theorem 1. We prove (A.5) and (A.6) in the following.

Note that

\dot{\ell}(\bm{\theta}_{t})=\sum_{i=1}^{n}\big{\{}y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\big{\}}\mathbf{z}_{i}=0,

(A.8)

is a summation of i.i.d. quantities. Since $\alpha_{t}\rightarrow-\infty$ as $n\rightarrow\infty$ , the distribution of $\{y-p(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}$ depends on $n$ , we need to use a central limit theorem for triangular arrays. The Lindeberg-Feller central limit theorem (see, Section ^∗2.8 of van der Vaart, 1998) is appropriate.

We exam the mean and variance of $a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})$ . For the mean, from the fact that

\displaystyle\mathbb{E}[\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}]=\mathbb{E}[\mathbb{E}\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})|\mathbf{z}_{i}\}\mathbf{z}_{i}]=\mathbf{0},

we know that $\mathbb{E}\{a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})\}=\mathbf{0}$ .

For the variance,

	$\displaystyle\mathbb{V}\{a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})\}$	$\displaystyle=a_{n}^{-2}\sum_{i=1}^{n}\mathbb{V}[\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}]=a_{n}^{-2}n\mathbb{E}\{\phi(\bm{\theta}_{t})\mathbf{z}\mathbf{z}^{\mathrm{T}}\}$
		$\displaystyle=a_{n}^{-2}n\mathbb{E}\bigg{\{}\frac{e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\bigg{\}}=\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\bigg{\}}.$

Note that

\displaystyle\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\longrightarrow e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}},

almost surely, and

\displaystyle\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\leq e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2}\quad\text{ with }\quad\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2})\leq\infty.

Thus, from the dominated convergence theorem,

\displaystyle\mathbb{V}\{a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})\}=\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\bigg{\}}\longrightarrow\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}.

Now we check the Lindeberg-Feller condition. For any $\epsilon>0$ ,

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}\Big{[}\\|\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}\\|^{2}I(\\|\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}\\|>a_{n}\epsilon)\Big{]}$
	$\displaystyle=n\mathbb{E}\Big{[}\\|\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\Big{]}$
	$\displaystyle=n\mathbb{E}\big{[}p(\bm{\theta}_{t})\{1-p(\bm{\theta}_{t})\}^{2}\\|\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\quad+n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\{p(\bm{\theta}_{t})\}^{2}\\|\mathbf{z}\\|^{2}I(\\|p(\bm{\theta}_{t})\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{]}+n\mathbb{E}\big{[}\{p(\bm{\theta}_{t})\}^{2}\\|\mathbf{z}\\|^{2}I(\\|p(\bm{\theta}_{t})\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq a_{n}^{2}\mathbb{E}\{e^{\\|\beta_{t}\\|\\|\mathbf{x}\\|}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\}+a_{n}^{2}\mathbb{E}\{e^{\\|\beta_{t}\\|\\|\mathbf{x}\\|}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\}$
	$\displaystyle=o(a_{n}^{2}),$

where the last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller central limit theorem (Section ^∗2.8 of van der Vaart, 1998), we finish the proof of (A.5).

The last step is to prove (A.6). We first show that

	$\displaystyle\bigg{\|}a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t})\\|\mathbf{z}_{i}\\|^{2}\bigg{\|}$
	$\displaystyle\leq a_{n}^{-2}\sum_{i=1}^{n}\big{\|}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})-\phi_{i}(\bm{\theta}_{t})\big{\|}\\|\mathbf{z}_{i}\\|^{2}$
	$\displaystyle\leq\\|a_{n}^{-1}\acute{\mathbf{u}}\\|a_{n}^{-2}\sum_{i=1}^{n}p_{i}(\bm{\theta}_{t}+a_{n}^{-1}\breve{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle=\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{n}\sum_{i=1}^{n}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}+a_{n}^{-1}\breve{\mathbf{u}}^{\mathrm{T}}\mathbf{z}_{i}}}{\{1+e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}+a_{n}^{-1}\breve{\mathbf{u}}^{\mathrm{T}}\mathbf{z}_{i}}\}^{2}}\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle\leq\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{n}\sum_{i=1}^{n}e^{(\\|\bm{\beta}_{t}\\|+\\|\mathbf{u}\\|)(1+\\|\mathbf{x}_{i}\\|)}\\|\mathbf{z}_{i}\\|^{3}=o_{P}(1).$		(A.9)

Here $\breve{\mathbf{u}}$ lies between $\mathbf{0}$ and $\acute{\mathbf{u}}$ , and thus $\|a_{n}^{-1}\breve{\mathbf{u}}\|\leq\|\mathbf{u}\|$ for $a_{n}\geq 1$ .

To finish the proof, we only need to prove that

\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}),

(A.10)

in probability. This is done by noting that

	$\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=$	$\displaystyle\frac{1}{ne^{\alpha_{t}}}\sum_{i=1}^{n}\frac{e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}}}{(1+e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$		(A.11)
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}})+o_{P}(1),$		(A.12)

by Proposition 1 of Wang (2019). ∎

Appendix A.2 Proof of Theorem 2

Proof of Theorem 2.

The estimator ${\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}$ is the maximizer of $\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta})$ defined in (10), so $\sqrt{a_{n}}({\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}-\theta_{t})$ is the maximizer of $\gamma_{\mathrm{under}}^{\mathrm{w}}(\mathbf{u})=\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}_{t})$ . By Taylor’s expansion,

\displaystyle\gamma_{\mathrm{under}}^{\mathrm{w}}(\mathbf{u})

\displaystyle=\frac{1}{a_{n}}\mathbf{u}^{\mathrm{T}}\dot{\ell}_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}_{t})+\frac{1}{2a_{n}^{2}}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2},

(A.13)

where

\dot{\ell}_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta})=\frac{\partial\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta})}{\partial\bm{\theta}}=\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\{y_{i}-p_{i}(\bm{\theta})\}\mathbf{z}_{i}=\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\{y_{i}-p_{i}(\alpha,\bm{\beta})\}\mathbf{z}_{i}

is the gradient of $\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta})$ , and $\acute{\mathbf{u}}$ lies between $\mathbf{0}$ and $\mathbf{u}$ . Similarly to the proof of Theorem 1, we only need to show that

\displaystyle a_{n}^{-1}\dot{\ell}_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\Big{[}\mathbf{0},\ \mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}(1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{\}}\Big{]},

(A.14)

in distribution, and for any $\mathbf{u}$ ,

\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)},

(A.15)

in probability.

We prove (A.14) first. Recall that $\mathcal{D}_{n}$ is the full data set and $\delta_{i}=y_{i}+(1-y_{i})I(u_{i}\leq\pi_{0})$ , satisfying that

\pi_{i}=\mathbb{E}(\delta_{i}|\mathcal{D}_{n})=y_{i}+(1-y_{i})\pi_{0}=\pi_{0}+(1-\pi_{0})y_{i}.

We notice that

\mathbb{E}(\delta_{i}|\mathbf{z}_{i})=p_{i}(\alpha_{t},\bm{\beta}_{t})+\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\pi_{0}=\pi_{0}+(1-\pi_{0})p_{i}(\alpha_{t},\bm{\beta}_{t}).

Let $\eta_{i}=\frac{\delta_{i}}{\pi_{i}}\{y_{i}-p_{i}(\bm{\theta}_{t})\}\mathbf{z}_{i}$ , we know that $\eta_{i}$ , $i=1,...,n$ , are i.i.d., with the underlying distribution of $\eta_{i}$ being dependent on $n$ . From direction calculation, we have

	$\displaystyle\mathbb{E}(\eta_{i}\|\mathbf{z}_{i})$	$\displaystyle=\mathbf{0},\quad\text{ and}$
	$\displaystyle\mathbb{V}(\eta_{i}\|\mathbf{z}_{i})$	$\displaystyle=\mathbb{E}\bigg{[}\frac{\{y_{i}-p_{i}(\bm{\theta}_{t})\}^{2}}{\pi_{0}+y_{i}(1-\pi_{0})}\bigg{\|}\mathbf{z}_{i}\bigg{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle=\big{[}p_{i}(\bm{\theta}_{t})\{1-p_{i}(\bm{\theta}_{t})\}^{2}+\pi_{0}^{-1}\{1-p_{i}(\bm{\theta}_{t})\}\{p_{i}(\bm{\theta}_{t})\}^{2}\big{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle=\big{\{}1-p_{i}(\bm{\theta}_{t})+\pi_{0}^{-1}p_{i}(\bm{\theta}_{t})\big{\}}p_{i}(\bm{\theta}_{t})\{1-p_{i}(\bm{\theta}_{t})\}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle=\frac{1+\pi_{0}^{-1}e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}p_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle\leq e^{\alpha_{t}}(1+\pi_{0}^{-1}e^{\alpha_{t}}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}.$

Thus, by the dominated convergence theorem, we obtain that

\displaystyle\mathbb{V}(\eta_{i})=\mathbb{E}\{\mathbb{V}(\eta_{i}|\mathbf{z}_{i})\}

\displaystyle=e^{\alpha_{t}}\mathbb{E}\Big{\{}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}(1+ce^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\Big{\}}\{1+o(1)\}.

(A.16)

Now we check the Lindeberg-Feller condition (Section ^∗2.8 of van der Vaart, 1998). For simplicity, let $\pi=\pi_{0}+(1-\pi_{0})y$ and $\delta=y+(1-y)I(u\leq\pi)$ , where $u\sim\mathbb{U}(0,1)$ . For any $\epsilon>0$ ,

	$\displaystyle\sum_{i=1}^{n}$	$\displaystyle\mathbb{E}\big{\{}\\|\eta_{i}\\|^{2}I(\\|\eta_{i}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle=$	$\displaystyle n\mathbb{E}\big{[}\\|\pi^{-1}\delta\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\pi^{-1}\delta\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=$	$\displaystyle\pi_{0}n\mathbb{E}\big{[}\\|\pi^{-1}\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\pi^{-1}\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}\pi^{-1}\\|y\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\pi^{-1}y\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=$	$\displaystyle\pi_{0}n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+\pi_{0}^{-1}n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\\|p(\bm{\theta}_{t})\mathbf{z}\\|^{2}I(\pi_{0}^{-1}\\|p(\bm{\theta}_{t})\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq$	$\displaystyle n\mathbb{E}\big{\{}p(\bm{\theta}_{t})\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}+n\pi_{0}^{-1}\mathbb{E}\big{\{}\\|p(\bm{\theta}_{t})\mathbf{z}\\|^{2}I(\\|\pi_{0}^{-1}p(\bm{\theta}_{t})\mathbf{z}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle\leq$	$\displaystyle ne^{\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}+n\pi_{0}^{-1}e^{2\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{2}I(\pi_{0}^{-1}e^{\alpha_{t}}e^{\alpha_{t}}\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle=$	$\displaystyle o(ne^{\alpha_{t}})=o(a_{n}^{2}),$

where the second last step is from the dominated convergence theorem and the facts that $a_{n}\rightarrow\infty$ and $\lim_{n\rightarrow\infty}e^{\alpha}/\pi_{0}=c<\infty$ . Thus, applying the Lindeberg-Feller central limit theorem (Section ^∗2.8 of van der Vaart, 1998) finishes the proof of (A.14).

Now we prove (A.15). By direct calculation, we first notice that

\displaystyle\Delta_{1}\equiv a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\frac{1}{n}\sum_{i=1}^{n}\frac{\{y_{i}+(1-y_{i})I(u_{i}\leq\pi_{0})\}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{\pi_{i}(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}

(A.17)

has a mean of

\displaystyle\mathbb{E}(\Delta_{1})=

\displaystyle\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}=\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}+o(1),

(A.18)

where the last step is by the dominated convergence theorem. In addition, the variance of each component of $\Delta_{1}$ is bounded by

\displaystyle\frac{1}{n}\mathbb{E}\bigg{\{}\frac{e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4}}{\pi(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{4}}\bigg{\}}\leq\frac{\mathbb{E}(e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4})}{n\pi_{0}}=o(1),

(A.19)

where the last step is because $ne^{\alpha_{t}}\rightarrow\infty$ and $e^{\alpha_{t}}/\pi_{0}\rightarrow c<\infty$ imply that $n\pi_{0}\rightarrow\infty$ . From (A.18) and (A.19), Chebyshev’s inequality implies that $\Delta_{1}\rightarrow\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}$ in probability. Notice that

	$\displaystyle\bigg{\|}a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t})\\|\mathbf{z}_{i}\\|^{2}\bigg{\|}$
	$\displaystyle\leq\\|a_{n}^{-1}\acute{\mathbf{u}}\\|a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}p_{i}(\bm{\theta}_{t}+a_{n}^{-1}\breve{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle\leq\\|a_{n}^{-1}\acute{\mathbf{u}}\\|\times\frac{1}{n}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}e^{(\\|\bm{\beta}_{t}\\|+\\|\mathbf{u}\\|)\\|\mathbf{z}_{i}\\|}\\|\mathbf{z}_{i}\\|^{3}\equiv\\|a_{n}^{-1}\acute{\mathbf{u}}\\|\times\Delta_{2}.$

Since $\|a_{n}^{-1}\acute{\mathbf{u}}\|\rightarrow 0$ , to finish the proof of (A.15), we only need to prove that $\Delta_{2}$ is bounded in probability. Using an approach similar to (A.18) and (A.19), we can show that $\Delta_{2}$ has a mean that is bounded and a variance that converges to zero.

∎

Appendix A.3 Proof of Theorem 3

Proof of Theorem 3.

If we use $\Upsilon_{bc}$ to denote the under-sampled objective function shifted by $\mathbf{b}$ , i.e., $\Upsilon_{bc}(\bm{\theta})=\ell{{}_{\mathrm{under}}^{\mathrm{u}}}(\bm{\theta}-\mathbf{b})$ , then the estimator ${\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}$ is the maximizer of

\displaystyle\Upsilon_{bc}(\bm{\theta})=\sum_{i=1}^{n}\delta_{i}\big{[}(\bm{\theta}-\mathbf{b})^{\mathrm{T}}\mathbf{z}_{i}y_{i}-\log\{1+e^{(\bm{\theta}-\mathbf{b})^{\mathrm{T}}\mathbf{z}_{i}}\}\big{]}.

(A.20)

We notice that $\sqrt{a_{n}}({\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}-\bm{\theta}_{t})$ is the maximizer of $\gamma_{bc}(\mathbf{u})=\Upsilon_{bc}(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\Upsilon_{bc}(\bm{\theta}_{t})$ . By Taylor’s expansion,

\displaystyle\gamma_{p}(\mathbf{u})

\displaystyle=\frac{1}{a_{n}}\mathbf{u}^{\mathrm{T}}\dot{\Upsilon}_{bc}(\bm{\theta}_{t})+\frac{1}{2a_{n}^{2}}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2},

(A.21)

where

\dot{\Upsilon}_{bc}(\bm{\theta})=\frac{\partial\Upsilon_{bc}(\bm{\theta})}{\partial\bm{\theta}}=\sum_{i=1}^{n}\delta_{i}\{y_{i}-p_{i}(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}_{i}=\sum_{i=1}^{n}\delta_{i}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}\mathbf{z}_{i}

is the gradient of $\Upsilon_{bc}(\bm{\theta})$ , and $\acute{\mathbf{u}}$ lies between $\mathbf{0}$ and $\mathbf{u}$ .

Similarly to the proof of Theorem 1, we only need to show that

\displaystyle a_{n}^{-1}\dot{\Upsilon}_{bc}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\bigg{\{}\mathbf{0},\ \mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\bigg{)}\bigg{\}},

(A.22)

in distribution, and for any $\mathbf{u}$ ,

\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}\bigg{(}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{1+ce^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\bigg{)}

(A.23)

in probability.

We prove (A.22) first. Define $\eta_{ui}=\delta_{i}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}\mathbf{z}_{i}$ . We have that

	$\displaystyle\mathbb{E}(\eta_{ui}\|\mathbf{z}_{i})$	$\displaystyle=\mathbb{E}[\{\pi_{0}+y_{i}(1-\pi_{0})\}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}\|\mathbf{z}_{i}]\mathbf{z}_{i}$
		$\displaystyle=[p_{i}(\alpha_{t},\bm{\beta}_{t})\{1-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}-\pi_{0}\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\{p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}]\mathbf{z}_{i}=0,$

which implies that $\mathbb{E}(\eta_{ui})=\mathbf{0}$ . For the conditional variance

	$\displaystyle\mathbb{V}(\eta_{ui}\|\mathbf{z}_{i})$	$\displaystyle=\mathbb{E}[\{\pi_{0}+y_{i}(1-\pi_{0})\}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}^{2}\|\mathbf{z}_{i}]\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle=\big{[}p_{i}(\alpha_{t},\bm{\beta}_{t})\{1-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}^{2}+\pi_{0}\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\{p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}^{2}\big{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle=\frac{e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}+\pi_{0}e^{2(\alpha_{t}-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t})}}{(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})(1+e^{\alpha_{t}-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle=\frac{e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{1+e^{\alpha_{t}-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\leq e^{\alpha_{t}}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}},$

where $e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$ is integrable. Thus, by the dominated convergence theorem, $\mathbb{V}(\eta_{ui})$ satisfies that

\displaystyle\mathbb{V}(\eta_{ui})=\mathbb{E}\{\mathbb{V}(\eta_{ui}|\mathbf{z}_{i})\}

\displaystyle=e^{\alpha_{t}}\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\bigg{)}\{1+o(1)\}.

(A.24)

Therefore, we have

\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\mathbb{V}(\eta_{ui})\longrightarrow\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}.

(A.25)

Now we check the Lindeberg-Feller condition. For any $\epsilon>0$ ,

		$\displaystyle\sum_{i=1}^{n}\mathbb{E}\big{\{}\\|\eta_{ui}\\|^{2}I(\\|\eta_{ui}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle=$	$\displaystyle n\mathbb{E}\big{[}\\|\delta\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|\delta\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=$	$\displaystyle\pi_{0}n\mathbb{E}\big{[}\\|\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}\\|y\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|y\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=$	$\displaystyle\pi_{0}n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+\pi_{0}n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\\|^{2}I(\\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq$	$\displaystyle n\mathbb{E}\big{\{}p(\bm{\theta}_{t})\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}+\pi_{0}n\mathbb{E}\big{[}\\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq$	$\displaystyle ne^{\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}+\pi_{0}^{-1}ne^{2\alpha_{t}}\mathbb{E}\big{\{}e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle=$	$\displaystyle o(ne^{\alpha_{t}})=o(a_{n}^{2}),$

where the second last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller central limit theorem (Section ^∗2.8 of van der Vaart, 1998) finishes the proof of (A.22).

No we prove (A.23). First, letting

\displaystyle\Delta_{3}\equiv a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\frac{1}{n}\sum_{i=1}^{n}\frac{\{y_{i}+(1-y_{i})I(u_{i}\leq\pi_{0})\}e^{-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{\{1+e^{\alpha_{t}-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\}^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}},

(A.26)

the mean of $\Delta_{3}$ satisfies that

\displaystyle\mathbb{E}(\Delta_{3})=

\displaystyle\mathbb{E}\bigg{[}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{\{1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}\{1+e^{\alpha_{t}-b_{0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{]}=\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}+o(1),

(A.27)

by the dominated convergence theorem, and the variance of each component of $\Delta_{3}$ is bounded by

\displaystyle\frac{1}{n}\mathbb{E}\bigg{[}\frac{\{y+(1-y)I(u\leq\pi_{0})\}e^{-2b_{0}+2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{\{1+e^{\alpha_{t}-b_{0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}^{4}}\|\mathbf{z}\|^{4}\bigg{]}\leq\frac{\mathbb{E}(e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4})}{n\pi_{0}}=o(1).

(A.28)

Thus, Chebyshev’s inequality implies that

\Delta_{3}\longrightarrow\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)},

(A.29)

in probability. Furthermore,

	$\displaystyle\bigg{\|}a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\acute{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b})\\|\mathbf{z}_{i}\\|^{2}\bigg{\|}$
	$\displaystyle\leq\\|a_{n}^{-1}\acute{\mathbf{u}}\\|a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}p_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\breve{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle\leq\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{n}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{0}}e^{(\\|\bm{\beta}_{t}\\|+\\|\mathbf{u}\\|)(1+\\|\mathbf{x}_{i}\\|)}\\|\mathbf{z}_{i}\\|^{3}\equiv\\|a_{n}^{-1}\acute{\mathbf{u}}\\|\times\Delta_{4}=o_{P}(1),$		(A.30)

where the last step is because $\Delta_{4}$ is bounded in probability due to the fact that it has a mean that is bounded and a variance that converges to zero. Combing (A.29) and (A.30), (A.23) follows. ∎

Appendix A.4 Proof of Proposition 1

Proof of Proposition 1.

Let

\mathbf{g}=\frac{1}{\sqrt{h}}\big{\{}\mathbb{E}(h^{-1}\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\mathbf{v}-\sqrt{h}\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\mathbf{v}.

Since $\mathbf{g}\mathbf{g}^{\mathrm{T}}\geq\mathbf{0}$ , we have

\displaystyle\mathbf{0}\leq\mathbb{E}(\mathbf{g}\mathbf{g}^{\mathrm{T}})=\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\mathbb{E}(h\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}-\big{\{}\mathbb{E}(h^{-1}\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1},

which finishes the proof. ∎

Appendix A.5 Proof of Theorem 4

Proof of Theorem 4.

The estimator ${\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}$ is the maximizer of (20), so $\sqrt{a_{n}}({\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}-\theta_{t})$ is the maximizer of $\gamma_{\mathrm{over}}^{\mathrm{w}}(\mathbf{u})=\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}_{t})$ . By Taylor’s expansion,

\gamma_{\mathrm{over}}^{\mathrm{w}}(\mathbf{u})=\frac{1}{a_{n}}\mathbf{u}^{\mathrm{T}}\dot{\ell}_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}_{t})+\frac{1}{2a_{n}^{2}}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2},

(A.31)

where

\dot{\ell}_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta})=\frac{\partial\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta})}{\partial\bm{\theta}}=\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\{y_{i}-p_{i}(\bm{\theta}_{t})\}\mathbf{z}_{i}=\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}\mathbf{z}_{i}

is the gradient of $\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta})$ , and $\acute{\mathbf{u}}$ lies between $\mathbf{0}$ and $\mathbf{u}$ . Similarly to the proof of Theorem 1, we only need to show that

\displaystyle a_{n}^{-1}\dot{\ell}_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\Big{\{}\mathbf{0},\ \frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}\Big{\}},

(A.32)

in distribution, and for any $\mathbf{u}$ ,

\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)},

(A.33)

in probability.

We prove (A.32) first. Denote $\eta_{owi}=\tau_{i}w_{i}^{-1}\{y_{i}-p_{i}(\bm{\theta}_{t})\}\mathbf{z}_{i}$ , so $\eta_{owi}$ , $i=1,...,n$ , are i.i.d. with the underlying distribution of $\eta_{owi}$ being dependent on $n$ . From direction calculation, we have

	$\displaystyle\mathbb{E}(\eta_{owi}\|\mathbf{z}_{i})$	$\displaystyle=\mathbf{0},\quad\text{ and}$
	$\displaystyle\mathbb{V}(\eta_{owi}\|\mathbf{z}_{i})$	$\displaystyle=\mathbb{E}\bigg{[}\frac{\{y_{i}(3\lambda_{n}+\lambda_{n}^{2})+1\}\{y_{i}-p_{i}(\bm{\theta}_{t})\}^{2}}{(1+\lambda_{n}y_{i})^{2}}\bigg{\|}\mathbf{z}_{i}\bigg{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle=\bigg{[}p_{i}(\bm{\theta}_{t})\{1-p_{i}(\bm{\theta}_{t})\}^{2}\frac{(1+\lambda_{n})^{2}+\lambda_{n}}{(1+\lambda_{n})^{2}}+\{1-p_{i}(\bm{\theta}_{t})\}\{p_{i}(\bm{\theta}_{t})\}^{2}\bigg{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
		$\displaystyle=\frac{(1+\lambda_{n})^{2}+\lambda_{n}}{(1+\lambda_{n})^{2}}e^{\alpha_{t}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\{1+o_{P}(1)\},$

where the $o_{P}(1)$ is bounded. Thus, by the dominated convergence theorem, we obtain that

\displaystyle\mathbb{V}(\eta_{owi})

\displaystyle=\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}e^{\alpha_{t}}\mathbb{E}\big{(}e^{\mathbf{x}^{\mathrm{T}}\bm{\beta}_{t}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}\{1+o(1)\}.

Now we check the Lindeberg-Feller condition (Section ^∗2.8 of van der Vaart, 1998). Let $w=1+\lambda_{n}y$ and $\tau=yv+1$ , where $v\sim\mathbb{POI}(\lambda_{n})$ . For any $\epsilon>0$ ,

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}\big{[}\\|\eta_{owi}\\|^{2}I(\\|\eta_{owi}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=n\mathbb{E}\big{[}\\|w^{-1}\tau\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|w^{-1}\tau\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq\frac{n}{a_{n}\epsilon}\mathbb{E}\big{[}\\|w^{-1}\tau\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{3}$
	$\displaystyle=\frac{n}{a_{n}\epsilon}\mathbb{E}\bigg{[}\frac{(1+vy)^{3}}{(1+\lambda_{n}y)^{3}}\{y-p(\bm{\theta}_{t})\}^{3}\\|\mathbf{z}\\|^{3}\bigg{]}$
	$\displaystyle\leq\frac{n}{a_{n}\epsilon}\frac{1+7\lambda_{n}+6\lambda_{n}^{2}+\lambda_{n}^{3}}{(1+\lambda_{n})^{3}}\mathbb{E}\{p(\bm{\theta}_{t})\\|\mathbf{z}\\|^{3}\}+\frac{n}{a_{n}\epsilon}\mathbb{E}[\{p(\bm{\theta}_{t})\}^{3}\\|\mathbf{z}\\|^{3}]$
	$\displaystyle\leq\frac{a_{n}}{\epsilon}\frac{1+7\lambda_{n}+6\lambda_{n}^{2}+\lambda_{n}^{3}}{(1+\lambda_{n})^{3}}\mathbb{E}(e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\\|\mathbf{z}\\|^{3})+\frac{a_{n}e^{2\alpha_{t}}}{\epsilon}\mathbb{E}(e^{3\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\\|\mathbf{z}\\|^{3})=o(a_{n}^{2}).$

Thus, applying the Lindeberg-Feller central limit theorem (Section ^∗2.8 of van der Vaart, 1998) finishes the proof of (A.32).

Now we prove (A.33). Let

\displaystyle\Delta_{5}\equiv a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\frac{1}{n}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}.

Since

\displaystyle\mathbb{E}(\Delta_{5})=

\displaystyle\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}=\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}+o(1),

by the dominated convergence theorem, and each component of $\Delta_{5}$ has a variance that is bounded by

\displaystyle\frac{1}{n}\mathbb{E}\bigg{\{}\frac{2e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{4}}\bigg{\}}\leq\frac{2\mathbb{E}(e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4})}{n}=o(1),

applying Chebyshev’s inequality gives that

\Delta_{5}\longrightarrow\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)},

in probability. Thus, (A.33) follows from the fact that

	$\displaystyle\bigg{\|}a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t})\\|\mathbf{z}_{i}\\|^{2}\bigg{\|}$
	$\displaystyle\leq\\|a_{n}^{-1}\acute{\mathbf{u}}\\|a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}p_{i}(\bm{\theta}_{t}+a_{n}^{-1}\breve{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle\leq\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{n}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}e^{(\\|\bm{\beta}_{t}\\|+\\|\mathbf{u}\\|)\\|\mathbf{z}_{i}\\|}\\|\mathbf{z}_{i}\\|^{3}=o_{P}(1),$

where the last step is because $n^{-1}\sum_{i=1}^{n}\tau_{i}w_{i}^{-1}e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)\|\mathbf{z}_{i}\|}\|\mathbf{z}_{i}\|^{3}$ has a bounded mean and a bounded variance and thus it is bounded in probability. ∎

Appendix A.6 Proof of Theorem 5

Proof of Theorem 5.

The over-sampled estimator ${\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}$ is the maximizer of

\Upsilon_{oc}(\bm{\theta})=\frac{1}{1+\lambda_{n}}\sum_{i=1}^{n}\tau_{i}\big{[}(\bm{\theta}+\mathbf{b}_{o})^{\mathrm{T}}\mathbf{z}_{i}y_{i}-\log\{1+e^{\mathbf{z}_{i}^{\mathrm{T}}(\bm{\theta}+\mathbf{b}_{o})}\}\big{]}.

(A.34)

Thus, $\sqrt{a_{n}}({\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}-\bm{\theta}_{t})$ is the maximizer of $\gamma_{oc}(\mathbf{u})=\Upsilon_{oc}(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\Upsilon_{oc}(\bm{\theta}_{t})$ . By Taylor’s expansion,

\gamma_{oc}(\mathbf{u})=\frac{1}{a_{n}}\mathbf{u}^{\mathrm{T}}\dot{\Upsilon}_{oc}(\bm{\theta}_{t})+\frac{1}{2a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2},

(A.35)

where

\dot{\Upsilon}_{oc}(\bm{\theta})=\frac{\partial\Upsilon_{oc}(\bm{\theta})}{\partial\bm{\theta}}=\frac{1}{1+\lambda_{n}}\sum_{i=1}^{n}\tau_{i}\{y_{i}-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}\mathbf{z}_{i}

is the gradient of $\Upsilon_{oc}(\bm{\theta})$ , and $\acute{\mathbf{u}}$ lies between $\mathbf{0}$ and $\mathbf{u}$ .

Similarly to the proof of Theorem 1, we only need to show that

\displaystyle a_{n}^{-1}\dot{\Upsilon}_{oc}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\bigg{[}\mathbf{0},\ \frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\ \mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}\bigg{]},

(A.36)

in distribution, and for any $\mathbf{u}$ ,

\displaystyle\frac{1}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)},

(A.37)

in probability.

We prove (A.36) first. Let $\eta_{obi}=(1+\lambda_{n})^{-1}\tau_{i}\{y_{i}-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}\mathbf{z}_{i}$ . We have that

	$\displaystyle(1+\lambda_{n})\mathbb{E}(\eta_{obi}\|\mathbf{z}_{i})$	$\displaystyle=\mathbb{E}[(1+\lambda_{n}y_{i})\{y_{i}-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}\|\mathbf{z}_{i}]\mathbf{z}_{i}$
		$\displaystyle=[p_{i}(\alpha_{t},\bm{\beta}_{t})(1+\lambda_{n})\{1-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}$
		$\displaystyle\quad-\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\{p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}]\mathbf{z}_{i}=0,$

which implies that $\mathbb{E}(\eta_{obi})=\mathbf{0}$ . For the conditional variance

	$\displaystyle(1+\lambda_{n})^{2}\mathbb{V}(\eta_{obi}\|\mathbf{z}_{i})$
	$\displaystyle=\mathbb{E}[\{1+3\lambda_{n}y_{i}+\lambda_{n}^{2}y_{i}\}\{y_{i}-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}^{2}\|\mathbf{z}_{i}]\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
	$\displaystyle=\big{[}p_{i}(\alpha_{t},\bm{\beta}_{t})(1+3\lambda_{n}+\lambda_{n}^{2})\{1-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}^{2}+\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\{p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}^{2}\big{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
	$\displaystyle=\frac{(1+3\lambda_{n}+\lambda_{n}^{2})e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}+e^{2(\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t})}}{(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})(1+e^{\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
	$\displaystyle=\frac{(1+3\lambda_{n}+\lambda_{n}^{2})e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\frac{1+\frac{1+2\lambda_{n}+\lambda_{n}^{2}}{1+3\lambda_{n}+\lambda_{n}^{2}}e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}$
	$\displaystyle=\frac{(1+3\lambda_{n}+\lambda_{n}^{2})e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\{1+o_{P}(1)\}$
	$\displaystyle=e^{\alpha_{t}}(1+3\lambda_{n}+\lambda_{n}^{2})\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+{c_{o}}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\{1+o_{P}(1)\},$

where the $o_{P}(1)$ ’s above are all bounded and the last step is because $(1+\lambda_{n})e^{\alpha_{t}}\rightarrow{c_{o}}$ . Thus, by the dominated convergence theorem, $\mathbb{V}(\eta_{obi})$ satisfies that

\displaystyle\mathbb{V}(\eta_{obi})

\displaystyle=e^{\alpha_{t}}\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\bigg{\}}\{1+o(1)\},

(A.38)

which indicates that

\displaystyle\frac{1}{a_{n}^{2}}\sum_{i=1}^{n}\mathbb{V}(\eta_{obi})\longrightarrow\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\ \mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}.

(A.39)

Now we check the Lindeberg-Feller condition. Recall that $\tau=yv+1$ , where $v\sim\mathbb{POI}(\lambda_{n})$ . We can show that $\mathbb{E}\{(1+v)^{3}\}<2(1+\lambda_{n})^{3}$ . For any $\epsilon>0$ ,

	$\displaystyle a_{n}\epsilon(1+\lambda_{n})^{3}\sum_{i=1}^{n}\mathbb{E}\big{\{}\\|\eta_{obi}\\|^{2}I(\\|\eta_{obi}\\|>a_{n}\epsilon)\big{\}}\leq(1+\lambda_{n})^{3}\sum_{i=1}^{n}\mathbb{E}(\\|\eta_{obi}\\|^{3})$
	$\displaystyle=n\mathbb{E}\big{[}\\|\tau^{3}\{y-p(\bm{\theta}_{t}+\mathbf{b}_{o})\}\mathbf{z}\\|^{3}\big{]}$
	$\displaystyle=n\mathbb{E}\big{[}p(\bm{\theta}_{t})(1+v)^{3}\\|\{1-p(\bm{\theta}_{t}+\mathbf{b}_{o})\}\mathbf{z}\\|^{3}\big{]}+n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\\|p(\bm{\theta}_{t}+\mathbf{b}_{o})\mathbf{z}\\|^{3}\big{]}$
	$\displaystyle\leq 2n(1+\lambda_{n})^{3}\mathbb{E}\big{\{}p(\bm{\theta}_{t})\\|\mathbf{z}\\|^{3}\big{\}}+n\mathbb{E}\big{\{}\\|p(\bm{\theta}_{t}+\mathbf{b}_{o})\mathbf{z}\\|^{3}\big{\}}$
	$\displaystyle\leq 2n(1+\lambda_{n})^{3}e^{\alpha_{t}}\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{3}\big{)}+n(1+\lambda_{n})^{3}e^{3\alpha_{t}}\mathbb{E}\big{(}e^{3\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{3}\big{)}$
	$\displaystyle=(1+\lambda_{n})^{3}O(a_{n}^{2}).$

This indicates that $a_{n}^{-2}\sum_{i=1}^{n}\mathbb{E}\{\|\eta_{obi}\|^{2}I(\|\eta_{obi}\|>a_{n}\epsilon)\}=o(1)$ , and thus the Lindeberg-Feller condition holds. Applying the Lindeberg-Feller central limit theorem (Section ^∗2.8 of van der Vaart, 1998) finishes the proof of (A.36).

No we prove (A.37). Let

\displaystyle\Delta_{6}\equiv\frac{1}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\frac{1}{n}\sum_{i=1}^{n}\frac{(1+v_{i}y_{i})e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{\{1+e^{\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\}^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}.

(A.40)

Note that

$\displaystyle\mathbb{E}(\Delta_{6})$	$\displaystyle=\mathbb{E}\bigg{\{}\frac{(1+\lambda_{n}y)e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+e^{\alpha_{t}+b_{o0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}$	(A.41)
	$\displaystyle=\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})(1+e^{\alpha_{t}+b_{o0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}$	(A.42)
	$\displaystyle=\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}+o(1),$	(A.43)

by the dominated convergence theorem, and the variance of each component of $\Delta_{6}$ is bounded by

	$\displaystyle\frac{1}{n}\mathbb{E}\bigg{[}\frac{(1+vy)^{2}e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{\{1+e^{\alpha_{t}+b_{o0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}^{4}}\\|\mathbf{z}\\|^{4}\bigg{]}$
	$\displaystyle=\frac{1}{n}\mathbb{E}\bigg{[}\frac{\{1+(3\lambda_{n}+\lambda_{n}^{2})p(\bm{\theta}_{t})\}e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{\{1+e^{\alpha_{t}+b_{o0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}^{4}}\\|\mathbf{z}\\|^{4}\bigg{]}$
	$\displaystyle\leq\frac{\mathbb{E}(e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{4})}{n}+\frac{e^{\alpha_{t}}(3\lambda_{n}+\lambda_{n}^{2})}{n}\mathbb{E}\big{(}e^{3\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{4}\big{)}=o(1),$

where the last step is because $n^{-1}e^{\alpha_{t}}\lambda_{n}^{2}=(e^{\alpha_{t}}\lambda_{n})^{2}a_{n}^{-2}\rightarrow 0$ and both expectations are finite. Therefore, Chebyshev’s inequality implies that $\Delta_{6}\rightarrow 0$ in probability. Thus, (A.37) follows from the fact that

	$\displaystyle\bigg{\|}\frac{1}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o}+a_{n}^{-1}\acute{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{2}-\frac{1}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o})\\|\mathbf{z}_{i}\\|^{2}\bigg{\|}$
	$\displaystyle\leq\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}p_{i}(\bm{\theta}_{t}+\mathbf{b}_{o}+a_{n}^{-1}\breve{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle\leq\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{n}\sum_{i=1}^{n}(1+v_{i}y_{i})e^{(\\|\bm{\beta}_{t}\\|+\\|\mathbf{u}\\|)\\|\mathbf{z}_{i}\\|}\\|\mathbf{z}_{i}\\|^{3}=o_{P}(1),$

where the last step is from the fact that $n^{-1}\sum_{i=1}^{n}(1+v_{i}y_{i})e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)\|\mathbf{z}_{i}\|}\|\mathbf{z}_{i}\|^{3}$ has a bounded mean and a bounded variance, and an application of Chebyshev’s inequality. ∎

References

Chawla (2009) Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook, 875–886. Springer.
Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321–357.
Chawla et al. (2004) Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6, 1, 1–6.
Douzas and Bacao (2017) Douzas, G. and Bacao, F. (2017). Self-organizing map oversampling (somo) for imbalanced data set learning. Expert systems with Applications 82, 40–52.
Drummond et al. (2003) Drummond, C., Holte, R. C., et al. (2003). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, vol. 11, 1–8. Citeseer.
Estabrooks et al. (2004) Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational intelligence 20, 1, 18–36.
Fithian and Hastie (2014) Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics 42, 5, 1693.
Han et al. (2005) Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In D.-S. Huang, X.-P. Zhang, and G.-B. Huang, eds., Advances in Intelligent Computing, 878–887, Berlin, Heidelberg. Springer Berlin Heidelberg.
Hjort and Pollard (2011) Hjort, N. L. and Pollard, D. (2011). Asymptotics for minimisers of convex processes. arXiv preprint arXiv:1107.3806 .
Japkowicz (2000) Japkowicz, N. (2000). Learning from imbalanced data sets: Papers from the AAAI workshop, AAAI, 2000. Technical Report WS-00-05.
King and Zeng (2001) King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political analysis 9, 2, 137–163.
Lemaître et al. (2017) Lemaître, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research 18, 1, 559–563.
Liu et al. (2009) Liu, X., Wu, J., and Zhou, Z. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2, 539–550.
Mathew et al. (2017) Mathew, J., Pang, C. K., Luo, M., and Leong, W. H. (2017). Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE transactions on neural networks and learning systems 29, 9, 4065–4076.
Owen (2007) Owen, A. B. (2007). Infinitely imbalanced logistic regression. The Journal of Machine Learning Research 8, 761–773.
Rahman and Davis (2013) Rahman, M. M. and Davis, D. (2013). Addressing the class imbalance problem in medical datasets. International Journal of Machine Learning and Computing 3, 2, 224.
Sun et al. (2007) Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 12, 3358–3378.
van der Vaart (1998) van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, London.
Wang (2019) Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research 20, 132, 1–59.

	$\displaystyle\sum_{i=1}^{n}\mathbb{E}\Big{[}\\|\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}\\|^{2}I(\\|\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}\\|>a_{n}\epsilon)\Big{]}$
	$\displaystyle=n\mathbb{E}\Big{[}\\|\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\Big{]}$
	$\displaystyle=n\mathbb{E}\big{[}p(\bm{\theta}_{t})\{1-p(\bm{\theta}_{t})\}^{2}\\|\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\quad+n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\{p(\bm{\theta}_{t})\}^{2}\\|\mathbf{z}\\|^{2}I(\\|p(\bm{\theta}_{t})\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{]}+n\mathbb{E}\big{[}\{p(\bm{\theta}_{t})\}^{2}\\|\mathbf{z}\\|^{2}I(\\|p(\bm{\theta}_{t})\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq a_{n}^{2}\mathbb{E}\{e^{\\|\beta_{t}\\|\\|\mathbf{x}\\|}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\}+a_{n}^{2}\mathbb{E}\{e^{\\|\beta_{t}\\|\\|\mathbf{x}\\|}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\}$
	$\displaystyle=o(a_{n}^{2}),$

	$\displaystyle\bigg{\|}a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t})\\|\mathbf{z}_{i}\\|^{2}\bigg{\|}$
	$\displaystyle\leq a_{n}^{-2}\sum_{i=1}^{n}\big{\|}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})-\phi_{i}(\bm{\theta}_{t})\big{\|}\\|\mathbf{z}_{i}\\|^{2}$
	$\displaystyle\leq\\|a_{n}^{-1}\acute{\mathbf{u}}\\|a_{n}^{-2}\sum_{i=1}^{n}p_{i}(\bm{\theta}_{t}+a_{n}^{-1}\breve{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle=\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{n}\sum_{i=1}^{n}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}+a_{n}^{-1}\breve{\mathbf{u}}^{\mathrm{T}}\mathbf{z}_{i}}}{\{1+e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}+a_{n}^{-1}\breve{\mathbf{u}}^{\mathrm{T}}\mathbf{z}_{i}}\}^{2}}\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle\leq\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{n}\sum_{i=1}^{n}e^{(\\|\bm{\beta}_{t}\\|+\\|\mathbf{u}\\|)(1+\\|\mathbf{x}_{i}\\|)}\\|\mathbf{z}_{i}\\|^{3}=o_{P}(1).$		(A.9)

	$\displaystyle\sum_{i=1}^{n}$	$\displaystyle\mathbb{E}\big{\{}\\|\eta_{i}\\|^{2}I(\\|\eta_{i}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle=$	$\displaystyle n\mathbb{E}\big{[}\\|\pi^{-1}\delta\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\pi^{-1}\delta\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=$	$\displaystyle\pi_{0}n\mathbb{E}\big{[}\\|\pi^{-1}\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\pi^{-1}\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}\pi^{-1}\\|y\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\pi^{-1}y\{y-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=$	$\displaystyle\pi_{0}n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+\pi_{0}^{-1}n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\\|p(\bm{\theta}_{t})\mathbf{z}\\|^{2}I(\pi_{0}^{-1}\\|p(\bm{\theta}_{t})\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq$	$\displaystyle n\mathbb{E}\big{\{}p(\bm{\theta}_{t})\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}+n\pi_{0}^{-1}\mathbb{E}\big{\{}\\|p(\bm{\theta}_{t})\mathbf{z}\\|^{2}I(\\|\pi_{0}^{-1}p(\bm{\theta}_{t})\mathbf{z}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle\leq$	$\displaystyle ne^{\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}+n\pi_{0}^{-1}e^{2\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{2}I(\pi_{0}^{-1}e^{\alpha_{t}}e^{\alpha_{t}}\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle=$	$\displaystyle o(ne^{\alpha_{t}})=o(a_{n}^{2}),$

		$\displaystyle\sum_{i=1}^{n}\mathbb{E}\big{\{}\\|\eta_{ui}\\|^{2}I(\\|\eta_{ui}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle=$	$\displaystyle n\mathbb{E}\big{[}\\|\delta\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|\delta\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=$	$\displaystyle\pi_{0}n\mathbb{E}\big{[}\\|\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}\\|y\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|y\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle=$	$\displaystyle\pi_{0}n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+\pi_{0}n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\\|^{2}I(\\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
		$\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}p(\bm{\theta}_{t})\\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|^{2}I(\\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq$	$\displaystyle n\mathbb{E}\big{\{}p(\bm{\theta}_{t})\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}+\pi_{0}n\mathbb{E}\big{[}\\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{]}$
	$\displaystyle\leq$	$\displaystyle ne^{\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}+\pi_{0}^{-1}ne^{2\alpha_{t}}\mathbb{E}\big{\{}e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\\|\mathbf{z}\\|^{2}I(\\|\mathbf{z}\\|>a_{n}\epsilon)\big{\}}$
	$\displaystyle=$	$\displaystyle o(ne^{\alpha_{t}})=o(a_{n}^{2}),$

	$\displaystyle\bigg{\|}a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\acute{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b})\\|\mathbf{z}_{i}\\|^{2}\bigg{\|}$
	$\displaystyle\leq\\|a_{n}^{-1}\acute{\mathbf{u}}\\|a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}p_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\breve{\mathbf{u}})\\|\mathbf{z}_{i}\\|^{3}$
	$\displaystyle\leq\frac{\\|a_{n}^{-1}\acute{\mathbf{u}}\\|}{n}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{0}}e^{(\\|\bm{\beta}_{t}\\|+\\|\mathbf{u}\\|)(1+\\|\mathbf{x}_{i}\\|)}\\|\mathbf{z}_{i}\\|^{3}\equiv\\|a_{n}^{-1}\acute{\mathbf{u}}\\|\times\Delta_{4}=o_{P}(1),$		(A.30)

Logistic Regression for Massive Data with Rare Events

Abstract

1 Introduction

2 Model setups and assumptions

2.1 How much information do we have in rare events data

Theorem 1.

Remark 1.

3 Efficiency of under-sampled estimators

3.1 Under-sampled weighted estimator

Theorem 2.

Remark 2.

Remark 3.

3.2 Under-sampled unweighted estimator with bias correction

Theorem 3.

Remark 4.

Proposition 1.

Remark 5.

4 Efficiency loss due to over-sampling

4.1 Over-sampled weighted estimator

Theorem 4.

Remark 6.

4.2 Over-sampled unweighted estimator with bias correction

Theorem 5.

Remark 7.

Remark 8.

5 Numerical experiments

5.1 Full data estimator 𝜽^{\hat{\bm{\theta}}}

5.2 Sampling-based estimators

6 Discussion and future research

Appendix A.1 Proof of Theorem 1

Proof of Theorem 1.

Appendix A.2 Proof of Theorem 2

Proof of Theorem 2.

Appendix A.3 Proof of Theorem 3

Proof of Theorem 3.

Appendix A.4 Proof of Proposition 1

Proof of Proposition 1.

Appendix A.5 Proof of Theorem 4

Proof of Theorem 4.

Appendix A.6 Proof of Theorem 5

Proof of Theorem 5.

References

5.1 Full data estimator ${\hat{\bm{\theta}}}$