This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Logistic Regression for Massive Data with Rare Events

HaiYing Wang
Department of Statistics, University of Connecticut
Abstract

This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events (observations in one class, often called cases) is significantly smaller than the number of nonevents (observations in the other class, often called controls). We first derive the asymptotic distribution of the maximum likelihood estimator (MLE) of the unknown parameter, which shows that the asymptotic variance convergences to zero in a rate of the inverse of the number of the events instead of the inverse of the full data sample size. This indicates that the available information in rare events data is at the scale of the number of events instead of the full data sample size. Furthermore, we prove that under-sampling a small proportion of the nonevents, the resulting under-sampled estimator may have identical asymptotic distribution to the full data MLE. This demonstrates the advantage of under-sampling nonevents for rare events data, because this procedure may significantly reduce the computation and/or data collection costs. Another common practice in analyzing rare events data is to over-sample (replicate) the events, which has a higher computational cost. We show that this procedure may even result in efficiency loss in terms of parameter estimation.

1 Introduction

Big data with rare events in binary responses, also called imbalanced data, are data in which the number of events (observations for one class of the binary response) is much smaller than the number of non-events (observations for the other class of the binary response). In this paper we also call the events “cases” and can the nonevents “controls”. Rare events data are common in many scientific fields and applications. However, several important questions remain unanswered that are essential for valid data analysis and appropriate decision-making. For example, should we consider the amount of information contained in the data to be at the scale of the full-data sample size (very large) or the number of cases (relatively small)? Rare events data provide unique challenges and opportunities for sampling. On the one hand, sampling will not work without looking at responses because the probability of not selecting a rare case is high. On the other hand, since the rare cases are more informative than the controls, is it possible to use a small proportion of the full data to preserve most or all of the relevant information in the data about unknown parameters? A common practice when analyzing rare events data is to under-sample the controls and/or over-sample (replicate) the cases. Is there any information loss when using this approach? This paper provides a rigorous theoretical analysis on the aforementioned questions in the context of parameter estimation. Some answers may be counter-intuitive. For example, keeping all the cases, there may be no efficiency loss at all for under-sampling controls; on the other hand, using all the controls and over-sampling cases may reduce estimation efficiency.

Rare events data, or imbalanced data, have attracted a lot of attentions in machine learning and other quantitative fields, such as Japkowicz (2000); King and Zeng (2001); Chawla et al. (2004); Estabrooks et al. (2004); Owen (2007); Sun et al. (2007); Chawla (2009); Rahman and Davis (2013); Fithian and Hastie (2014); Lemaître et al. (2017). A commonly implemented approach in practice is to try balancing the data by under-sampling controls (Drummond et al., 2003; Liu et al., 2009) and/or over-sampling cases (Chawla et al., 2002; Han et al., 2005; Mathew et al., 2017; Douzas and Bacao, 2017). However, most existing investigations focus on algorithms and methodologies for classification. Theoretical analyses of the effects of under-sampling and over-sampling in terms of parameter estimation are still rare.

King and Zeng (2001) considered logistic regression in rare events data and focused on correcting the biases in estimating the regression coefficients and probabilities. Fithian and Hastie (2014) utilized the special structure of logistic regression models to design a novel local case-control sampling method. These investigations obtained theoretical results based on the the regular assumption that the probability of event occurring is fixed and does not go to zero. This assumption rules out the scenario of extremely imbalanced data, because for extremely imbalanced data, it is more appropriate to assume that the event probability goes to zero. Owen (2007)’s investigation did not require this fixed-probability assumption. He assumed that the number of rare cases is fixed, and derived the non-trivial point limit of the slope parameter estimator in logistic regression. However, the convergence rate and distributional properties of this estimator were not investigated. In this paper, we obtain convergence rates and asymptotic distributions of parameter estimators under the assumption that both the number of cases and the number of controls are random, and they grow large in rates that the number of cases divided by the number of controls decays to zero. This is the first study that provides distributional results for rare events data with a decaying event rate, and it gives the following indications.

  • The convergence rate of the maximum likelihood estimator (MLE) is at the inverse of the number of cases instead of the total number of observations. This means that the amount of available information about unknown parameters in the data may be limited even the full data volume is massive.

  • There maybe no efficiency loss at all in parameter estimation if one removes most of the controls in the data, because the control under-sampled estimators may have an asymptotic distribution that is identical to that of the full data MLE.

  • Besides higher computational cost, over-sampling cases may result in estimation efficiency loss, because the asymptotic variances of the resulting estimators may be larger than that of the full data MLE.

The rest of the paper is organized as follows. We introduce the model setup and related assumptions in Section 2, and derive the asymptotic distribution for the full data MLE. We investigate under-sampled estimators in Section 3 and study over-sampled estimators in Section 4. Section 5 presents some numerical experiments, and Section 6 concludes the paper and points out some necessary future research. All the proofs of theoretical findings in this paper are presented in the supplementary material.

2 Model setups and assumptions

Let 𝒟n={(𝐱i,yi),i=1,,n}\mathcal{D}_{n}=\{(\mathbf{x}_{i},y_{i}),i=1,...,n\} be independent data of size nn from a logistic regression model,

(y=1|𝐱)=p(α,𝜷)=eα+𝐱T𝜷1+eα+𝐱T𝜷.\mathbb{P}(y=1|\mathbf{x})=p(\alpha,\bm{\beta})=\frac{e^{\alpha+\mathbf{x}^{\mathrm{T}}\bm{\beta}}}{1+e^{\alpha+\mathbf{x}^{\mathrm{T}}\bm{\beta}}}. (1)

Here 𝐱d\mathbf{x}\in\mathbb{R}^{d} is the covariate, y{0,1}y\in\{0,1\} is the binary class label, α\alpha is the intercept parameter, and 𝜷\bm{\beta} is the slope parameter vector. For ease of presentation, denote 𝜽=(α,𝜷T)T\bm{\theta}=(\alpha,\bm{\beta}^{\mathrm{T}})^{\mathrm{T}} as the full vector of regression coefficient, and define 𝐳=(1,𝐱T)T\mathbf{z}=(1,\mathbf{x}^{\mathrm{T}})^{\mathrm{T}} accordingly. This paper focuses on estimating the unknown 𝜽\bm{\theta}.

If 𝜽\bm{\theta} is fixed (does not change with nn changing), then model (1) is just the regular logistic regression model, and classical likelihood theory shows that the MLE based on the full data 𝒟n\mathcal{D}_{n} converges at a rate of n1/2n^{-1/2}. A fixed 𝜽\bm{\theta} implies that (y=1)=𝔼{(y=1|𝐱)}\mathbb{P}(y=1)=\mathbb{E}\{\mathbb{P}(y=1|\mathbf{x})\} is also a fixed constant bounded away from zero. However, for rare events data, because the event rate is so low in the data, it is more appropriate to assume that (y=1)\mathbb{P}(y=1) approaches zero in some way. We discuss how to model this scenario in the following.

Let n1n_{1} and n0n_{0} be the numbers cases (observations with yi=1y_{i}=1) and controls (observations with yi=0y_{i}=0), respectively, in 𝒟n\mathcal{D}_{n}. Here, n1n_{1} and n0n_{0} are random because they are summary statistics about the observed data, i.e., n1=i=1nyin_{1}=\sum_{i=1}^{n}y_{i} and n0=nn1n_{0}=n-n_{1}. For rare events data, n1n_{1} is much smaller than n0n_{0}. Thus, for asymptotic investigations, it is reasonable to assume that n1/n00n_{1}/n_{0}\rightarrow 0, or equivalently n1/n0n_{1}/n\rightarrow 0, in probability, as nn\rightarrow\infty. For big data with rare events, there should be a fair amount of cases observed, so it is appropriate to assume that n1n_{1}\rightarrow\infty in probability. To model this scenario, we assume that the marginal event probability (y=1)\mathbb{P}(y=1) satisfies that as nn\rightarrow\infty,

(y=1)0andn(y=1).\mathbb{P}(y=1)\rightarrow 0\quad\text{and}\quad n\mathbb{P}(y=1)\rightarrow\infty. (2)

We accommodate this condition by assuming that the true value of 𝜷\bm{\beta}, denoted as 𝜷t\bm{\beta}_{t}, is fixed while the true value of α\alpha, denoted as αt\alpha_{t}, goes to negative infinity in a certain rate. Specifically, we assume αt\alpha_{t}\rightarrow-\infty as nn\rightarrow\infty in a rate such that

n1n\displaystyle\frac{n_{1}}{n} =(y=1){1+oP(1)}=𝔼(eαt+𝜷tT𝐱1+eαt+𝜷tT𝐱){1+oP(1)},\displaystyle=\mathbb{P}(y=1)\{1+o_{P}(1)\}=\mathbb{E}\bigg{(}\frac{e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\bigg{)}\{1+o_{P}(1)\}, (3)

where oP(1)o_{P}(1) means a term that converges to zero in probability, i.e., a term that is arbitrarily small with probability approaching one. The assumption of a diverging αt\alpha_{t} with a fixed 𝜷t\bm{\beta}_{t} means that the baseline probability of a rare event is low, and the effect of the covariate does not change the order of the probability for a rare event to occur. This is a very reasonable assumption for many practical problems. For example, although making phone calls when driving may increase the probability of car accidents, it may not make car accidents a high-probability event.

2.1 How much information do we have in rare events data

To demonstrate how much information is really available in rare events data, we derive the asymptotic distribution of the MLE for model (1) in the scenario described in (2) and (3). The MLE based on the full data 𝒟n\mathcal{D}_{n}, say 𝜷^\hat{\bm{\beta}}, is the maximizer of

(𝜽)=i=1n{yi𝐳iT𝜽log(1+e𝐳iT𝜽)},\ell(\bm{\theta})=\sum_{i=1}^{n}\big{\{}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log(1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}})\big{\}}, (4)

which is also the solution to the following equation,

˙(𝜽)=i=1n{yipi(α,𝜷)}𝐳i=0,\dot{\ell}(\bm{\theta})=\sum_{i=1}^{n}\big{\{}y_{i}-p_{i}(\alpha,\bm{\beta})\big{\}}\mathbf{z}_{i}=0, (5)

where ˙(𝜽)\dot{\ell}(\bm{\theta}) is the gradient of the log-likelihood (𝜽)\ell(\bm{\theta}).

The following Theorem gives the asymptotic normality of the MLE 𝜷^\hat{\bm{\beta}} for rare events data.

Theorem 1.

If 𝔼(et𝐱)<\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty for any t>0t>0 and 𝔼(e𝛃tT𝐱𝐳𝐳T)\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}) is a positive-definite matrix, then under the conditions in (2) and (3), as nn\rightarrow\infty,

n1(𝜽^𝜽t)(𝟎,𝐕f),\sqrt{n_{1}}({\hat{\bm{\theta}}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}\big{(}\mathbf{0},\ \mathbf{V}_{f}\big{)}, (6)

in distribution, where

𝐕f\displaystyle\mathbf{V}_{f} =𝔼(e𝜷tT𝐱)𝐌f1,and\displaystyle=\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\big{)}\mathbf{M}_{f}^{-1},\qquad\text{and} (7)
𝐌f\displaystyle\mathbf{M}_{f} =𝔼(e𝜷tT𝐱𝐳𝐳T)=𝔼{e𝜷tT𝐱(1𝐱T𝐱𝐱𝐱T)}.\displaystyle=\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}=\mathbb{E}\left\{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\begin{pmatrix}1&\mathbf{x}^{\mathrm{T}}\\ \mathbf{x}&\mathbf{x}\mathbf{x}^{\mathrm{T}}\end{pmatrix}\right\}. (8)
Remark 1.

The result in (6) shows that the convergence rate of the full-data MLE is at the order of n11/2n_{1}^{-1/2}, i.e, 𝜽^𝜽t=OP(n11/2){\hat{\bm{\theta}}}-\bm{\theta}_{t}=O_{P}(n_{1}^{-1/2}). This is different from the classical result of 𝜽^𝜽t=OP(n1/2){\hat{\bm{\theta}}}-\bm{\theta}_{t}=O_{P}(n^{-1/2}) for the case that (y=1)\mathbb{P}(y=1) is a fixed constant. Theorem 1 indicates that for rare events data, the real amount of available information is actually at the scale of n1n_{1} instead of nn. A large volume of data does not mean that we have a large amount of information.

3 Efficiency of under-sampled estimators

Theorem 1 in the previous section shows that the full-data MLE has a convergence rate of n11/2n_{1}^{-1/2}. If we under-sample controls to reduce the number of controls to the same level of n1n_{1}, whether the resulting estimator has the full-data estimator convergence rate of n11/2n_{1}^{-1/2}? If so, one can significantly improve the computational efficiency and reduce the storage requirement for massive data. Furthermore, will under-sampling controls causes any estimation efficiency loss (an enlarged asymptotic variance)? This section answers the aforementioned questions.

From the full data set 𝒟n={(𝐱1,y1),,(𝐱n,yn)}\mathcal{D}_{n}=\{(\mathbf{x}_{1},y_{1}),...,(\mathbf{x}_{n},y_{n})\}, we want to use all the cases (data points with yi=1y_{i}=1) while only select a subset for the controls (data points with yi=0y_{i}=0). Specifically, let π0\pi_{0} be the probability that each data points with yi=0y_{i}=0 is selected in the subset. Let δi{0,1}\delta_{i}\in\{0,1\} be the binary indicator variable that signifies if the ii-th observation is included in the subset, i.e., include the ii-th observation into the sample if δi=1\delta_{i}=1 and ignore the ii-th observation if δi=0\delta_{i}=0. Here, we define the sampling plan by assigning

δi=yi+(1yi)I(uiπ0),i=1,,n,\delta_{i}=y_{i}+(1-y_{i})I(u_{i}\leq\pi_{0}),\quad i=1,...,n, (9)

where uiu_{i}𝕌(0,1)\sim\mathbb{U}(0,1), i=1,,ni=1,...,n, are independent and identically distributed (i.i.d.) random variables with the standard uniform distribution. This is a mixture of deterministic selection and random sampling. The resulting control under-sampled data include all rare cases (with yi=1y_{i}=1) and the number of controls (with yi=0y_{i}=0) is on average at the order of n0π0n_{0}\pi_{0}. The average sample size for the under-sampled data given the full-data is i=1n𝔼(δi|𝒟n)=n1+n0π0\sum_{i=1}^{n}\mathbb{E}(\delta_{i}|\mathcal{D}_{n})=n_{1}+n_{0}\pi_{0}, which is op(n)o_{p}(n) if π00\pi_{0}\rightarrow 0. The average sample size reduction is n0(1π0)n_{0}(1-\pi_{0}) which is at the same order of nn if π01\pi_{0}\nrightarrow 1, and n0(1π0)/n1n_{0}(1-\pi_{0})/n\rightarrow 1 if π00\pi_{0}\rightarrow 0.

Note that the under-sampled data taken according to δi\delta_{i} in (9) is a biased sample, so we need to maximize a weighted objective function to obtain an asymptotically unbiased estimator. Alternatively, we can maximize an unweighted objective function and then correct the bias for the resulting estimator in logistic regression.

3.1 Under-sampled weighted estimator

The sampling inclusion probability given the full data 𝒟n\mathcal{D}_{n} for the ii-th data point is

πi=𝔼(δi|𝒟n)=yi+(1yi)π0=π0+(1π0)yi.\pi_{i}=\mathbb{E}(\delta_{i}|\mathcal{D}_{n})=y_{i}+(1-y_{i})\pi_{0}=\pi_{0}+(1-\pi_{0})y_{i}.

The under-sampled weighted estimator, 𝜽^underw{\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}, is the maximizer of

underw(𝜽)=i=1nδiπi{yi𝐳iT𝜽log(1+e𝐳iT𝜽)}.\displaystyle\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta})=\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\big{\{}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log(1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}})\big{\}}. (10)

We present the asymptotic distribution of 𝜽^underw{\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}} in the following theorem.

Theorem 2.

If 𝔼(et𝐱)<\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty for any t>0t>0, 𝔼(e𝛉tT𝐱𝐳𝐳T)\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)} is a positive-definite matrix, and cn=eαt/π0cc_{n}=e^{\alpha_{t}}/\pi_{0}\rightarrow c for a constant c[0,)c\in[0,\infty), then under the conditions in (2) and (3), as nn\rightarrow\infty,

n1(𝜽^underw𝜽t)(𝟎,𝐕underw),\sqrt{n_{1}}({\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}(\mathbf{0},\ \mathbf{V}_{\mathrm{under}}^{\mathrm{w}}), (11)

in distribution, where

𝐕underw\displaystyle\mathbf{V}_{\mathrm{under}}^{\mathrm{w}} =𝔼(e𝜷tT𝐱)𝐌f1𝐌underw𝐌f1,and\displaystyle=\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\mathbf{M}_{f}^{-1}\mathbf{M}_{\mathrm{under}}^{\mathrm{w}}\mathbf{M}_{f}^{-1},\quad\text{and} (12)
𝐌underw\displaystyle\mathbf{M}_{\mathrm{under}}^{\mathrm{w}} =𝔼{e𝜷tT𝐱(1+ce𝜷tT𝐱)𝐳𝐳T}.\displaystyle=\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}(1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{\}}. (13)
Remark 2.

If 𝔼(et𝐱)<\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty for any t>0t>0, then from (3) and the dominated convergence theorem, we know that n1=neαt𝔼(e𝜷tT𝐱){1+oP(1)}n_{1}=ne^{\alpha_{t}}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\{1+o_{P}(1)\}. Thus

cn𝔼(e𝜷tT𝐱)=n1nπ0{1+oP(1)}=n1n0π0{1+oP(1)}.c_{n}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})=\frac{n_{1}}{n\pi_{0}}\{1+o_{P}(1)\}=\frac{n_{1}}{n_{0}\pi_{0}}\{1+o_{P}(1)\}.

Since n0π0n_{0}\pi_{0} is the average number of the controls in the under-sampled data, c𝔼(e𝜷tT𝐱)c\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}) can be interpreted as the asymptotic ratio of the number of cases to the number of controls in the under-sampled data. Therefore, since 𝔼(e𝜷tT𝐱)>0\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})>0 is a fixed constant, the value of cc has the following intuitive interpretations.

  • c=0c=0: take much more controls than cases;

  • 0<c<0<c<\infty: the number of controls to take is at the same order of the number of cases;

  • c=c=\infty: take much fewer controls than cases.

Theorem 2 requires that 0c<0\leq c<\infty. This means that the number of controls to take should not be significantly smaller than the number of cases, which is a very reasonable assumption.

Remark 3.

Theorem 2 shows that as long as π0\pi_{0} does not make the number of controls in the under-sampled data much smaller than the number of cases n1n_{1}, then the under-sampled estimator 𝜽^underw{\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}} preserves the convergence rate of the full-data estimator. Furthermore, if c=0c=0 then 𝐌underw=𝐌f\mathbf{M}_{\mathrm{under}}^{\mathrm{w}}=\mathbf{M}_{f}, which implies that 𝐕underw=𝐕f\mathbf{V}_{\mathrm{under}}^{\mathrm{w}}=\mathbf{V}_{f}. This means that if one takes much more controls than cases, then asymptotically there is no estimation efficiency loss at all. Here, the number of controls to take can still be significantly smaller than n0n_{0} so that the computational burden is significantly reduced. If c>0c>0, since 𝐌underw>𝐌f\mathbf{M}_{\mathrm{under}}^{\mathrm{w}}>\mathbf{M}_{f}, we know that 𝐕underw>𝐕f\mathbf{V}_{\mathrm{under}}^{\mathrm{w}}>\mathbf{V}_{f}, in the Loewner order111For two Hermitian matrices 𝐀1\mathbf{A}_{1} and 𝐀2\mathbf{A}_{2} of the same dimension, 𝐀1𝐀2\mathbf{A}_{1}\geq\mathbf{A}_{2} if 𝐀1𝐀2\mathbf{A}_{1}-\mathbf{A}_{2} is positive semi-definite and 𝐀1>𝐀2\mathbf{A}_{1}>\mathbf{A}_{2} if 𝐀1𝐀2\mathbf{A}_{1}-\mathbf{A}_{2} is positive definite.. Thus reducing the number of controls to the same order of the number of cases may reduce the estimation efficiency, although the convergence rate is the same as that of the full-data estimator.

3.2 Under-sampled unweighted estimator with bias correction

Based on the control under-sampled data, if we obtain an estimator from an unweighted objective function, say

𝜽~underu\displaystyle{\tilde{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{u}}} =argmax𝜽(𝜽)underu=argmax𝜽i=1nδi[yi𝐳iT𝜽log{1+e𝐳iT𝜽}],\displaystyle=\arg\max_{\bm{\theta}}\ \ell{{}_{\mathrm{under}}^{\mathrm{u}}}(\bm{\theta})=\arg\max_{\bm{\theta}}\sum_{i=1}^{n}\delta_{i}\big{[}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log\{1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}}\}\big{]},

then in 𝜽~=underu(α^,underu𝜷^Tunderu)T{\tilde{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{u}}}=(\hat{\alpha}{{}_{\mathrm{under}}^{\mathrm{u}}},\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{u}}}^{\mathrm{T}})^{\mathrm{T}}, the intercept estimator α^underu\hat{\alpha}{{}_{\mathrm{under}}^{\mathrm{u}}} is asymptotically biased while the slope estimator 𝜷^underu\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{u}}} is still asymptotically unbiased. We correct the bias of α^underu\hat{\alpha}{{}_{\mathrm{under}}^{\mathrm{u}}} using log(π0)\log(\pi_{0}), and define the under-sampled unweighted estimator with bias correction 𝜽^underubc{\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}} as

𝜽^=underubc𝜽~+underu𝐛,{\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}={\tilde{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{u}}}+\mathbf{b}, (14)

where

𝐛={log(π0),0,,0}T.\mathbf{b}=\{\log(\pi_{0}),0,...,0\}^{\mathrm{T}}. (15)

The following theorem gives the asymptotic distribution of 𝜽^underubc{\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}.

Theorem 3.

If 𝔼(et𝐱)<\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty for any t>0t>0, 𝔼(e𝛉tT𝐱𝐳𝐳T)\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)} is a positive-definite matrix, and eαt/π0ce^{\alpha_{t}}/\pi_{0}\rightarrow c for a constant c[0,)c\in[0,\infty), then under the conditions in (2) and (3), as nn\rightarrow\infty,

n1(𝜽^underubc𝜽t)(𝟎,𝐕)underubc,\sqrt{n_{1}}({\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}(\mathbf{0},\ \mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}), (16)

in distribution, where

𝐕underubc\displaystyle\mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}} =𝔼(e𝜷tT𝐱)(𝐌)underubc1,and\displaystyle=\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})(\mathbf{M}{{}_{\mathrm{under}}^{\mathrm{ubc}}})^{-1},\quad\text{and} (17)
𝐌underubc\displaystyle\mathbf{M}{{}_{\mathrm{under}}^{\mathrm{ubc}}} =𝔼(e𝜷tT𝐱1+ce𝜷tT𝐱𝐳𝐳T).\displaystyle=\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}. (18)
Remark 4.

Similarly to the case of under-sampled weighted estimator, Theorem 3 shows that the estimator 𝜽^underubc{\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}} preserves the same convergence rate of the full-data estimator if c<c<\infty. Furthermore, if c>0c>0 then 𝐕>underubc𝐕f\mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}>\mathbf{V}_{f}; if c=0c=0, then 𝐕=underubc𝐕f\mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}=\mathbf{V}_{f}.

The following proposition is useful to compare the asymptotic variances of the weighted and the unweighted estimators.

Proposition 1.

Let 𝐯\mathbf{v} be a random vector and hh be a positive scalar random variable. Assume that 𝔼(𝐯𝐯T)\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}}), 𝔼(h𝐯𝐯T)\mathbb{E}(h\mathbf{v}\mathbf{v}^{\mathrm{T}}), and 𝔼(h1𝐯𝐯T)\mathbb{E}(h^{-1}\mathbf{v}\mathbf{v}^{\mathrm{T}}) are all finite and positive-definite matrices. The following inequality holds in the Loewner order.

{𝔼(h1𝐯𝐯T)}1{𝔼(𝐯𝐯T)}1𝔼(h𝐯𝐯T){𝔼(𝐯𝐯T)}1.\big{\{}\mathbb{E}(h^{-1}\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\leq\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\mathbb{E}(h\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}.
Remark 5.

If we let 𝐯=e𝜷tT𝐱/2𝐳\mathbf{v}=e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}/2}\mathbf{z} and h=1+ce𝜷tT𝐱h=1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}} in Proposition 1, then we know that 𝐕underubc𝐕underw\mathbf{V}{{}_{\mathrm{under}}^{\mathrm{ubc}}}\leq\mathbf{V}_{\mathrm{under}}^{\mathrm{w}} in the Loewner order. This indicates that with the same control under-sampled data, the unweighted estimator with bias correction, 𝜽^underubc{\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}, has a higher estimation efficiency than the weighted estimator, 𝜽^underw{\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}.

4 Efficiency loss due to over-sampling

Another common practice to analyze rare events data is to use all the controls and over-sample the cases. To investigate the effect of this approach, let τi\tau_{i} denote the number of times that a data point is used, and define

τi=yivi+1,i=1,,n,\tau_{i}=y_{i}v_{i}+1,\quad i=1,...,n, (19)

where vi𝕆𝕀(λn)v_{i}\sim\mathbb{POI}(\lambda_{n}), i=1,,ni=1,...,n, are i.i.d. Poisson random variables with parameter λn\lambda_{n}. For this over-sampling plan, a data point with y0=0y_{0}=0 will be used only one time, while a data point with yi=1y_{i}=1 will be on average used in the over-sampled data for 𝔼(τi|𝒟n,yi=1)=1+λn\mathbb{E}(\tau_{i}|\mathcal{D}_{n},y_{i}=1)=1+\lambda_{n} times. Here, λn\lambda_{n} can be interpreted as the average over-sampling rate for cases.

Again, the case over-sampled data according to (19) is a biased sample, and we need to use a weighted objective function or to correct the bias of the estimator form an unweighted objective function.

4.1 Over-sampled weighted estimator

Let wi=𝔼(τi|𝒟n)=1+λnyiw_{i}=\mathbb{E}(\tau_{i}|\mathcal{D}_{n})=1+\lambda_{n}y_{i}. The case over-sampled weighted estimator, 𝜽^overw{\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}, is the maximizer of

overw(𝜽)=i=1nτiwi{yi𝐳iT𝜽log(1+e𝐳iT𝜽)}.\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta})=\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\big{\{}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log(1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}})\big{\}}. (20)

The following theorem gives the asymptotic distribution of 𝜽^overw{\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}.

Theorem 4.

If 𝔼(et𝐱)<\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty for any t>0t>0, 𝔼(e𝛉tT𝐱𝐳𝐳T)\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)} is positive-definite, and λnλ0\lambda_{n}\rightarrow\lambda\geq 0, then under the conditions in (2) and (3), as nn\rightarrow\infty,

n1(𝜽^overw𝜽t)(𝟎,𝐕overw),\sqrt{n_{1}}({\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}(\mathbf{0},\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}), (21)

in distribution, where

𝐕overw=(1+λ)2+λ(1+λ)2𝔼(e𝜷tT𝐱)𝐌f1.\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}=\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\mathbf{M}_{f}^{-1}. (22)
Remark 6.

Note that in (22), (1+λ)2+λ(1+λ)21\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\geq 1 and the equality holds only if λ=0\lambda=0 or λ=\lambda=\infty. Thus, 𝐕overw𝐕f\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}\geq\mathbf{V}_{f}, meaning that over-sampling the cases may result in estimation efficiency loss unless the number of over-sampled cases is small enough to be negligible (λ=0\lambda=0) or it is very large (λ=\lambda=\infty). Considering that over-sampling incurs additional computational cost with potential estimation efficiency loss, this procedure is not recommended if the primary goal is parameter estimation.

4.2 Over-sampled unweighted estimator with bias correction

For completeness, we derive the asymptotic distribution of the over-sampled unweighted estimator with bias correction, 𝜽^overubc{\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}, defined as 𝜽^=overubc𝜽~overu𝐛o{\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}={\tilde{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{u}}}-\mathbf{b}_{o}, where

𝜽~overu\displaystyle{\tilde{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{u}}} =argmax𝜽(𝜽)overu=argmax𝜽i=1nτi[yi𝐳iT𝜽log{1+e𝐳iT𝜽}],\displaystyle=\arg\max_{\bm{\theta}}\ \ell{{}_{\mathrm{over}}^{\mathrm{u}}}(\bm{\theta})=\arg\max_{\bm{\theta}}\sum_{i=1}^{n}\tau_{i}\big{[}y_{i}\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}-\log\{1+e^{\mathbf{z}_{i}^{\mathrm{T}}\bm{\theta}}\}\big{]}, (23)

and

𝐛o=(bo0,0,,0)T={log(1+λn),0,,0}T.\mathbf{b}_{o}=(b_{o0},0,...,0)^{\mathrm{T}}=\{\log(1+\lambda_{n}),0,...,0\}^{\mathrm{T}}. (24)

The following theorem is about the asymptotic distribution of 𝜽^overubc{\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}.

Theorem 5.

If 𝔼(et𝐱)<\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty for any t>0t>0, 𝔼(e𝛉tT𝐱𝐳𝐳T)\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)} is positive-definite, λnλ0\lambda_{n}\rightarrow\lambda\geq 0, and λneαtco\lambda_{n}e^{\alpha_{t}}\rightarrow{c_{o}} for a constant co[0,){c_{o}}\in[0,\infty), then under the conditions in (2) and (3), as nn\rightarrow\infty,

n1(𝜽^overubc𝜽t)(𝟎,𝐕)overubc,\sqrt{n_{1}}({\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}-\bm{\theta}_{t})\longrightarrow\mathbb{N}(\mathbf{0},\ \mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}), (25)

in distribution, where

𝐕=overubc(1+λ)2+λ(1+λ)2𝔼(e𝜷tT𝐱)𝐌obc21𝐌obc1𝐌obc21,\displaystyle\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}=\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\big{)}\mathbf{M}_{obc2}^{-1}\mathbf{M}_{obc1}\mathbf{M}_{obc2}^{-1},
𝐌obc1=𝔼{e𝜷tT𝐱(1+coe𝜷tT𝐱)2𝐳𝐳T}, and \displaystyle\mathbf{M}_{obc1}=\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}},\quad\text{ and }\quad
𝐌obc2=𝔼(e𝜷tT𝐱1+coe𝜷tT𝐱𝐳𝐳T).\displaystyle\mathbf{M}_{obc2}=\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}.
Remark 7.

Unlike the case of under-sampled estimators, for over-sampled estimators, the unweighted estimator with bias correction 𝜽^overubc{\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}} has a lower estimation efficiency than the weighted estimator 𝜽^overw{\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}. To see this, letting h=(1+coe𝜷tT𝐱)1h=(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{-1} and 𝐯=e𝜷tT𝐱/2(1+coe𝜷tT𝐱)1/2𝐳\mathbf{v}=e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}/2}(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{-1/2}\mathbf{z} in Proposition 1, we know that 𝐕overubc𝐕overw\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}\geq\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}, and the equality holds if co=0c_{o}=0. Here, since λneαt𝔼(e𝜷tT𝐱)=n1λnn0{1+oP(1)}\lambda_{n}e^{\alpha_{t}}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})=\frac{n_{1}\lambda_{n}}{n_{0}}\{1+o_{P}(1)\}, we can intuitively interpret co𝔼(e𝜷tT𝐱){c_{o}}\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}) as the ratio of the average times of over-sampled cases to the number of controls. If in addition λ=0\lambda=0, then 𝐕=overubc𝐕overw=𝐕f\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}=\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}=\mathbf{V}_{f}; but in general, 𝐕overubc𝐕overw𝐕f\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}\geq\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}\geq\mathbf{V}_{f}.

Remark 8.

Compared with Theorem 4 for 𝜽^underw{\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}, Theorem 5 for 𝜽^overubc{\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}} requires an extra condition that λneαtco[0,)\lambda_{n}e^{\alpha_{t}}\rightarrow{c_{o}}\in[0,\infty). In addition, 𝐕overubc𝐕overw\mathbf{V}{{}_{\mathrm{over}}^{\mathrm{ubc}}}\geq\mathbf{V}_{\mathrm{over}}^{\mathrm{w}}. Thus, if over-sampling has to be implemented, then we recommend using the weighted estimator 𝜽^overw{\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}.

5 Numerical experiments

5.1 Full data estimator 𝜽^{\hat{\bm{\theta}}}

Consider model (1) with one covariate xx and 𝜽=(α,β)T\bm{\theta}=(\alpha,\beta)^{\mathrm{T}}. We set (y=1)=0.02\mathbb{P}(y=1)=0.02, 0.0040.004, 0.00080.0008 and 0.000160.00016, and generate corresponding full data of sizes n=103n=10^{3}, 10410^{4}, 10510^{5} and 10610^{6}, respectively. As a result, the average numbers of cases (yi=1y_{i}=1) in the resulting data are 𝔼(n1)=20\mathbb{E}(n_{1})=20, 4040, 8080 and 160160. The above value configuration aims to mimic the scenario that nn\rightarrow\infty, (y=1)0\mathbb{P}(y=1)\rightarrow 0, and 𝔼(n1)\mathbb{E}(n_{1})\rightarrow\infty. The covariates xix_{i}’s are generated from (1,1)\mathbb{N}(1,1) for cases (yi=1y_{i}=1) and from (0,1)\mathbb{N}(0,1) for controls (yi=0y_{i}=0). For the above setup, the true value of β\beta is fixed βt=1\beta_{t}=1, and the true values of α\alpha are αt=4.39\alpha_{t}=-4.39, 6.02-6.02, 7.63-7.63 and 9.24-9.24, respectively for the four different values of nn. We repeat the simulation for S=1,000S=1,000 times and calculate empirical MSEs as eMSE(θ^j)=S1s=1S(θ^j(s)θtj)2\text{eMSE}(\hat{\theta}_{j})=S^{-1}\sum_{s=1}^{S}(\hat{\theta}_{j}^{(s)}-\theta_{tj})^{2}, j=0,1j=0,1, where θ^0=α^\hat{\theta}_{0}=\hat{\alpha}, θ^1=β^\hat{\theta}_{1}=\hat{\beta}, and θ^j(s)\hat{\theta}_{j}^{(s)} is the estimate from the ss-th repetition.

Table 1 presents empirical MSEs (eMSEs) multiplied by 𝔼(n1)\mathbb{E}(n_{1}) and nn, respectively. We see that 𝔼(n1)×\mathbb{E}(n_{1})\timeseMSE(θ^j)(\hat{\theta}_{j}) does not diverge as nn increases for both α^\hat{\alpha} and β^\hat{\beta}. This confirms the conclusion in Theorem 1 that 𝜽^{\hat{\bm{\theta}}} converges at a rate of n11/2n_{1}^{-1/2} (It implies that n1𝜽^𝜽t2=OP(1)n_{1}\|{\hat{\bm{\theta}}}-\bm{\theta}_{t}\|^{2}=O_{P}(1)). On the other hand, values of n×n\timeseMSE(θ^j)(\hat{\theta}_{j}) are large, and they increase fast as nn increases, indicating that n𝜽^𝜽t2n\|{\hat{\bm{\theta}}}-\bm{\theta}_{t}\|^{2} diverges to infinity. Table 1 confirms that although the values of the full data sample sizes nn are very large, it is the values of n1n_{1} that reflect the real amount of available information about regression parameters, and they are actually much smaller.

Table 1: Empirical MSE (eMSE) multiplied by 𝔼(n1)\mathbb{E}(n_{1}) and nn.

nn 𝔼(n1)\mathbb{E}(n_{1}) 𝔼(n1)×\mathbb{E}(n_{1})\timeseMSE(θ^j)(\hat{\theta}_{j}) n×n\timeseMSE(θ^j)(\hat{\theta}_{j})
α^\hat{\alpha} β^\hat{\beta} α^\hat{\alpha} β^\hat{\beta}
10310^{3} 20 2.51 1.21 125.7 60.6
10410^{4} 40 2.06 1.09 515.5 271.9
10510^{5} 80 2.22 1.00 2774.4 1248.8
10610^{6} 160 2.16 1.08 13474.9 6731.6

5.2 Sampling-based estimators

Now we provide numerical results about under-sampled and over-sampled estimators. Consider model (1) with n=105n=10^{5}, x(0,1)x\sim\mathbb{N}(0,1), and 𝜽t=(6,1)T\bm{\theta}_{t}=(-6,1)^{\mathrm{T}}, so that (y=1)0.004\mathbb{P}(y=1)\approx 0.004. For under-sampling, consider π0=0.005\pi_{0}=0.005, 0.010.01, 0.050.05, 0.10.1, 0.20.2, 0.50.5, 0.80.8, and 1.01.0; for over-sampling, consider λn=0,0.22,0.49,1.23,3.48,6.39,11.18\lambda_{n}=0,0.22,0.49,1.23,3.48,6.39,11.18 and 53.653.6, which corresponds to log(1+λn)=0\log(1+\lambda_{n})=0, 0.20.2, 0.40.4, 0.80.8, 1.51.5, 2.02.0, 2.52.5 and 4.04.0, respectively. We repeat the simulation for S=1,000S=1,000 times and calculate empirical MSEs as

eMSE(𝜽^g)=1Ss=1S𝜽^g(s)𝜽t2,\text{eMSE}({\hat{\bm{\theta}}}_{g})=\frac{1}{S}\sum_{s=1}^{S}\|{\hat{\bm{\theta}}}_{g}^{(s)}-\bm{\theta}_{t}\|^{2},

where 𝜽^g(s){\hat{\bm{\theta}}}_{g}^{(s)} is the estimate from the ss-th repetition for some estimator 𝜽^g{\hat{\bm{\theta}}}_{g}. We consider 𝜽^g=𝜷^underw{\hat{\bm{\theta}}}_{g}=\hat{\bm{\beta}}_{\mathrm{under}}^{\mathrm{w}}, 𝜷^underubc\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}, 𝜷^overw\hat{\bm{\beta}}_{\mathrm{over}}^{\mathrm{w}}, and 𝜷^overubc\hat{\bm{\beta}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}. Note that if π0=1\pi_{0}=1 then the under-sampled estimators become the full data estimator, i.e., 𝜷^underw=𝜷^=underubc𝜷^\hat{\bm{\beta}}_{\mathrm{under}}^{\mathrm{w}}=\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}=\hat{\bm{\beta}}; if λn=0\lambda_{n}=0, then the over-sampled estimators become the full data estimator, i.e., 𝜷^overw=𝜷^=overubc𝜷^\hat{\bm{\beta}}_{\mathrm{over}}^{\mathrm{w}}=\hat{\bm{\beta}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}=\hat{\bm{\beta}}.

Figure 1 presents the simulation results. Figure 1 (a) plots eMSEs (×103\times 10^{3}) against π0\pi_{0}. When π0\pi_{0} is small, the number of controls in under-sampled data is small, and the resulting estimators are not as efficient as the full-data estimator. For example, when π0=0.005\pi_{0}=0.005, the numbers of cases and the numbers of controls are roughly the same, and we do see significant information loss in this case. However, when π0\pi_{0} gets larger, under-sampled estimators becomes more efficient, and when π0>0.1\pi_{0}>0.1, the performances of the under-sampled estimators are almost as good as the full-data estimator. In addition, the unweighted estimator 𝜷^underubc\hat{\bm{\beta}}{{}_{\mathrm{under}}^{\mathrm{ubc}}} is more efficient than the weighted estimator 𝜷^underw\hat{\bm{\beta}}_{\mathrm{under}}^{\mathrm{w}} for smaller π0\pi_{0}’s, and they both perform more similarly to the full data estimator 𝜷^\hat{\bm{\beta}} as π0\pi_{0} grows. These observations are consistent with the conclusions in Theorems 2 and 3, and the discussions in the relevant remarks.

Figure 1 (b) plots eMSEs (×103\times 10^{3}) against log(λn+1)\log(\lambda_{n}+1). We see that the case over-sampled estimators are less efficient than the full data estimator unless the average number of over-sampled cases λn\lambda_{n} is very small or very large. For small λn\lambda_{n}, 𝜷^overw\hat{\bm{\beta}}_{\mathrm{over}}^{\mathrm{w}} and 𝜷^overubc\hat{\bm{\beta}}{{}_{\mathrm{over}}^{\mathrm{ubc}}} perform similarly, but 𝜷^overw\hat{\bm{\beta}}_{\mathrm{over}}^{\mathrm{w}} is more efficient than 𝜷^overubc\hat{\bm{\beta}}{{}_{\mathrm{over}}^{\mathrm{ubc}}} for large λn\lambda_{n}. The reason of this phenomenon is that if λn\lambda_{n} is large, then the required condition of λneαtco[0,)\lambda_{n}e^{\alpha_{t}}\rightarrow{c_{o}}\in[0,\infty) in Theorem 5 for 𝜽^overubc{\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}} may not be valid. This confirms our recommendation that the weighted estimator 𝜽^overw{\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}} is preferable if over-sampling has to be used.

Refer to caption
(a) eMSEs (×103\times 10^{3}) for under-sampling
Refer to caption
(b) eMSE for over-sampling
Figure 1: Empirical MSEs (×103\times 10^{3}) of under-sampled and over-sampled estimators. The eMSE (×103\times 10^{3}) for the full data estimator 𝜽^{\hat{\bm{\theta}}} (the horizontal line) is also plotted for comparison. A smaller eMSE means that the corresponding estimator has a higher estimation efficiency.

6 Discussion and future research

In this paper, we have obtained distributional results showing that the amount of information contained in massive data with rare events is at the scale of the relatively small total number of cases rather than the large total number of observations. We have further demonstrated that aggressively under-sampling the controls may not sacrifice the estimation efficiency at all while over-sampling the cases may reduce the estimation efficiency.

Although the current paper focuses on the logistic regression model, we conjecture that our conclusions are generally true for rare events data and will investigate more complicated and general models in future research projects. As another direction, more comprehensive numerical experiments are helpful to gain further understandings on parameter estimation with imbalanced data. This paper has focused on point estimation. How to make valid and more accurate statistical inference with rare events data still need further research. There is a long standing literature investigating the effects of under-sampling and over-sampling in classification. However, most investigations adopted an empirical approach, so theoretical investigations on the effects of sampling are still needed for classification.

Appendix

In this section, we give prove all theoretical results in the paper. To facilitate the presentation of the proof, denote

an=neαt.a_{n}=\sqrt{ne^{\alpha_{t}}}.

The condition that 𝔼(et𝐱)<\mathbb{E}(e^{t\|\mathbf{x}\|})<\infty for any t>0t>0 implies that

𝔼(et1𝐱𝐳t2)<,\mathbb{E}(e^{t_{1}\|\mathbf{x}\|}\|\mathbf{z}\|^{t_{2}})<\infty, (A.1)

for any t1>0t_{1}>0 and t2>0t_{2}>0, and we will use this result multiple times in the proof. The inequality in (A.1) is true because for any t1>0t_{1}>0 and t2>0t_{2}>0, we can choose t>t1t>t_{1} and k>t2k>t_{2} so that

et𝐱etet𝐳=etet1𝐳e(tt1)𝐳(tt1)ketk!et1𝐱𝐳k(tt1)ketk!et1𝐱𝐳t2.\displaystyle e^{t\|\mathbf{x}\|}\geq e^{-t}e^{t\|\mathbf{z}\|}=e^{-t}e^{t_{1}\|\mathbf{z}\|}e^{(t-t_{1})\|\mathbf{z}\|}\geq\frac{(t-t_{1})^{k}e^{-t}}{k!}e^{t_{1}\|\mathbf{x}\|}\|\mathbf{z}\|^{k}\geq\frac{(t-t_{1})^{k}e^{-t}}{k!}e^{t_{1}\|\mathbf{x}\|}\|\mathbf{z}\|^{t_{2}}.

with probability one.

Appendix A.1 Proof of Theorem 1

Proof of Theorem 1.

The estimator 𝜽^{\hat{\bm{\theta}}} is the maximizer of

(𝜽)=\displaystyle\ell(\bm{\theta})= i=1n[(α+𝐱iT𝜷)yilog{1+exp(α+𝐱iT𝜷)}],\displaystyle\sum_{i=1}^{n}\big{[}(\alpha+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta})y_{i}-\log\{1+\exp(\alpha+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta})\}\big{]}, (A.2)

so 𝐮n=an(𝜽^𝜽t)\mathbf{u}_{n}=a_{n}(\hat{\bm{\theta}}-\bm{\theta}_{t}) is the maximizer of

γ(𝐮)=(𝜽t+an1𝐮)(𝜽t).\displaystyle\gamma(\mathbf{u})=\ell(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\ell(\bm{\theta}_{t}). (A.3)

By Taylor’s expansion,

γ(𝐮)\displaystyle\gamma(\mathbf{u}) =an1𝐮T˙(𝜽t)+0.5an2i=1nϕi(𝜽t+an1𝐮´)(𝐳iT𝐮)2,\displaystyle=a_{n}^{-1}\mathbf{u}^{\mathrm{T}}\dot{\ell}(\bm{\theta}_{t})+0.5a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2}, (A.4)

where ϕi(𝜽)=pi(α,𝜷){1pi(α,𝜷)}\phi_{i}(\bm{\theta})=p_{i}(\alpha,\bm{\beta})\{1-p_{i}(\alpha,\bm{\beta})\}, and

˙(𝜽)=(𝜽)𝜽=i=1n{yipi(𝜽)}𝐳i=i=1n{yipi(α,𝜷)}𝐳i\dot{\ell}(\bm{\theta})=\frac{\partial\ell(\bm{\theta})}{\partial\bm{\theta}}=\sum_{i=1}^{n}\{y_{i}-p_{i}(\bm{\theta})\}\mathbf{z}_{i}=\sum_{i=1}^{n}\{y_{i}-p_{i}(\alpha,\bm{\beta})\}\mathbf{z}_{i}

is the gradient of (𝜽)\ell(\bm{\theta}), and 𝐮´\acute{\mathbf{u}} lies between 𝟎\mathbf{0} and 𝐮\mathbf{u}. If we can show that

an1˙(𝜽t)(𝟎,𝐌f),\displaystyle a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\big{(}\mathbf{0},\ \mathbf{M}_{f}\big{)}, (A.5)

in distribution, and for any 𝐮\mathbf{u},

an2i=1nϕi(𝜽t+an1𝐮´)𝐳i𝐳iT𝐌f,\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbf{M}_{f}, (A.6)

in probability, then from the Basic Corollary in page 2 of Hjort and Pollard (2011), we know that an(𝜽^𝜽t)a_{n}(\hat{\bm{\theta}}-\bm{\theta}_{t}), the maximizer of γ(𝐮)\gamma(\mathbf{u}), satisfies that

an(𝜽^𝜽t)=𝐌f1×an1˙(𝜽t)+oP(1).a_{n}(\hat{\bm{\theta}}-\bm{\theta}_{t})=\mathbf{M}_{f}^{-1}\times a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})+o_{P}(1). (A.7)

Slutsky’s theorem together with (A.5) and (A.7) implies the result in Theorem 1. We prove (A.5) and (A.6) in the following.

Note that

˙(𝜽t)=i=1n{yipi(αt,𝜷t)}𝐳i=0,\dot{\ell}(\bm{\theta}_{t})=\sum_{i=1}^{n}\big{\{}y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\big{\}}\mathbf{z}_{i}=0, (A.8)

is a summation of i.i.d. quantities. Since αt\alpha_{t}\rightarrow-\infty as nn\rightarrow\infty, the distribution of {yp(αt,𝜷t)}𝐳\{y-p(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z} depends on nn, we need to use a central limit theorem for triangular arrays. The Lindeberg-Feller central limit theorem (see, Section 2.8 of van der Vaart, 1998) is appropriate.

We exam the mean and variance of an1˙(𝜽t)a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t}). For the mean, from the fact that

𝔼[{yipi(αt,𝜷t)}𝐳i]=𝔼[𝔼{yipi(αt,𝜷t)|𝐳i}𝐳i]=𝟎,\displaystyle\mathbb{E}[\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}]=\mathbb{E}[\mathbb{E}\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})|\mathbf{z}_{i}\}\mathbf{z}_{i}]=\mathbf{0},

we know that 𝔼{an1˙(𝜽t)}=𝟎\mathbb{E}\{a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})\}=\mathbf{0}.

For the variance,

𝕍{an1˙(𝜽t)}\displaystyle\mathbb{V}\{a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})\} =an2i=1n𝕍[{yipi(αt,𝜷t)}𝐳i]=an2n𝔼{ϕ(𝜽t)𝐳𝐳T}\displaystyle=a_{n}^{-2}\sum_{i=1}^{n}\mathbb{V}[\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}]=a_{n}^{-2}n\mathbb{E}\{\phi(\bm{\theta}_{t})\mathbf{z}\mathbf{z}^{\mathrm{T}}\}
=an2n𝔼{eαt+𝜷tT𝐱𝐳𝐳T(1+eαt+𝜷tT𝐱)2}=𝔼{e𝜷tT𝐱𝐳𝐳T(1+eαt+𝜷tT𝐱)2}.\displaystyle=a_{n}^{-2}n\mathbb{E}\bigg{\{}\frac{e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\bigg{\}}=\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\bigg{\}}.

Note that

e𝜷tT𝐱𝐳𝐳T(1+eαt+𝜷tT𝐱)2e𝜷tT𝐱𝐳𝐳T,\displaystyle\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\longrightarrow e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}},

almost surely, and

e𝜷tT𝐱𝐳2(1+eαt+𝜷tT𝐱)2e𝜷tT𝐱𝐳2 with 𝔼(e𝜷tT𝐱𝐳2).\displaystyle\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\leq e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2}\quad\text{ with }\quad\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2})\leq\infty.

Thus, from the dominated convergence theorem,

𝕍{an1˙(𝜽t)}=𝔼{e𝜷tT𝐱𝐳𝐳T(1+eαt+𝜷tT𝐱)2}𝔼(e𝜽tT𝐱𝐳𝐳T).\displaystyle\mathbb{V}\{a_{n}^{-1}\dot{\ell}(\bm{\theta}_{t})\}=\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\bigg{\}}\longrightarrow\mathbb{E}\big{(}e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}.

Now we check the Lindeberg-Feller condition. For any ϵ>0\epsilon>0,

i=1n𝔼[{yipi(αt,𝜷t)}𝐳i2I({yipi(αt,𝜷t)}𝐳i>anϵ)]\displaystyle\sum_{i=1}^{n}\mathbb{E}\Big{[}\|\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}\|^{2}I(\|\{y_{i}-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}\|>a_{n}\epsilon)\Big{]}
=n𝔼[{yp(𝜽t)}𝐳2I({yp(𝜽t)}𝐳>anϵ)]\displaystyle=n\mathbb{E}\Big{[}\|\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|^{2}I(\|\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|>a_{n}\epsilon)\Big{]}
=n𝔼[p(𝜽t){1p(𝜽t)}2𝐳2I({1p(𝜽t)}𝐳>anϵ)]\displaystyle=n\mathbb{E}\big{[}p(\bm{\theta}_{t})\{1-p(\bm{\theta}_{t})\}^{2}\|\mathbf{z}\|^{2}I(\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
+n𝔼[{1p(𝜽t)}{p(𝜽t)}2𝐳2I(p(𝜽t)𝐳>anϵ)]\displaystyle\quad+n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\{p(\bm{\theta}_{t})\}^{2}\|\mathbf{z}\|^{2}I(\|p(\bm{\theta}_{t})\mathbf{z}\|>a_{n}\epsilon)\big{]}
n𝔼[p(𝜽t)𝐳2I(𝐳>anϵ)]+n𝔼[{p(𝜽t)}2𝐳2I(p(𝜽t)𝐳>anϵ)]\displaystyle\leq n\mathbb{E}\big{[}p(\bm{\theta}_{t})\|\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\big{]}+n\mathbb{E}\big{[}\{p(\bm{\theta}_{t})\}^{2}\|\mathbf{z}\|^{2}I(\|p(\bm{\theta}_{t})\mathbf{z}\|>a_{n}\epsilon)\big{]}
an2𝔼{eβt𝐱𝐳2I(𝐳>anϵ)}+an2𝔼{eβt𝐱𝐳2I(𝐳>anϵ)}\displaystyle\leq a_{n}^{2}\mathbb{E}\{e^{\|\beta_{t}\|\|\mathbf{x}\|}\|\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\}+a_{n}^{2}\mathbb{E}\{e^{\|\beta_{t}\|\|\mathbf{x}\|}\|\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\}
=o(an2),\displaystyle=o(a_{n}^{2}),

where the last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller central limit theorem (Section 2.8 of van der Vaart, 1998), we finish the proof of (A.5).

The last step is to prove (A.6). We first show that

|an2i=1nϕi(𝜽t+an1𝐮´)𝐳i2an2i=1nϕi(𝜽t)𝐳i2|\displaystyle\bigg{|}a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\|\mathbf{z}_{i}\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t})\|\mathbf{z}_{i}\|^{2}\bigg{|}
an2i=1n|ϕi(𝜽t+an1𝐮´)ϕi(𝜽t)|𝐳i2\displaystyle\leq a_{n}^{-2}\sum_{i=1}^{n}\big{|}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})-\phi_{i}(\bm{\theta}_{t})\big{|}\|\mathbf{z}_{i}\|^{2}
an1𝐮´an2i=1npi(𝜽t+an1𝐮˘)𝐳i3\displaystyle\leq\|a_{n}^{-1}\acute{\mathbf{u}}\|a_{n}^{-2}\sum_{i=1}^{n}p_{i}(\bm{\theta}_{t}+a_{n}^{-1}\breve{\mathbf{u}})\|\mathbf{z}_{i}\|^{3}
=an1𝐮´ni=1ne𝐱iT𝜷t+an1𝐮˘T𝐳i{1+e𝜽tT𝐳i+an1𝐮˘T𝐳i}2𝐳i3\displaystyle=\frac{\|a_{n}^{-1}\acute{\mathbf{u}}\|}{n}\sum_{i=1}^{n}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}+a_{n}^{-1}\breve{\mathbf{u}}^{\mathrm{T}}\mathbf{z}_{i}}}{\{1+e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}+a_{n}^{-1}\breve{\mathbf{u}}^{\mathrm{T}}\mathbf{z}_{i}}\}^{2}}\|\mathbf{z}_{i}\|^{3}
an1𝐮´ni=1ne(𝜷t+𝐮)(1+𝐱i)𝐳i3=oP(1).\displaystyle\leq\frac{\|a_{n}^{-1}\acute{\mathbf{u}}\|}{n}\sum_{i=1}^{n}e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)(1+\|\mathbf{x}_{i}\|)}\|\mathbf{z}_{i}\|^{3}=o_{P}(1). (A.9)

Here 𝐮˘\breve{\mathbf{u}} lies between 𝟎\mathbf{0} and 𝐮´\acute{\mathbf{u}}, and thus an1𝐮˘𝐮\|a_{n}^{-1}\breve{\mathbf{u}}\|\leq\|\mathbf{u}\| for an1a_{n}\geq 1.

To finish the proof, we only need to prove that

an2i=1nϕi(𝜽t)𝐳i𝐳iT𝔼(e𝜷tT𝐱𝐳𝐳T),\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}), (A.10)

in probability. This is done by noting that

an2i=1nϕi(𝜽t)𝐳i𝐳iT=\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\phi_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}= 1neαti=1ne𝜽tT𝐳i(1+e𝜽tT𝐳i)2𝐳i𝐳iT\displaystyle\frac{1}{ne^{\alpha_{t}}}\sum_{i=1}^{n}\frac{e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}}}{(1+e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}} (A.11)
=\displaystyle= 1ni=1ne𝐱iT𝜷t(1+e𝜽tT𝐳i)2𝐳i𝐳iT=𝔼(e𝜷tT𝐱𝐳𝐳T)+oP(1),\displaystyle\frac{1}{n}\sum_{i=1}^{n}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\bm{\theta}_{t}^{\mathrm{T}}\mathbf{z}_{i}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\mathbb{E}(e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}})+o_{P}(1), (A.12)

by Proposition 1 of Wang (2019). ∎

Appendix A.2 Proof of Theorem 2

Proof of Theorem 2.

The estimator 𝜽^underw{\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}} is the maximizer of underw(𝜽)\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}) defined in (10), so an(𝜽^underwθt)\sqrt{a_{n}}({\hat{\bm{\theta}}}_{\mathrm{under}}^{\mathrm{w}}-\theta_{t}) is the maximizer of γunderw(𝐮)=underw(𝜽t+an1𝐮)underw(𝜽t)\gamma_{\mathrm{under}}^{\mathrm{w}}(\mathbf{u})=\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}_{t}). By Taylor’s expansion,

γunderw(𝐮)\displaystyle\gamma_{\mathrm{under}}^{\mathrm{w}}(\mathbf{u}) =1an𝐮T˙underw(𝜽t)+12an2i=1nδiπiϕi(𝜽t+an1𝐮´)(𝐳iT𝐮)2,\displaystyle=\frac{1}{a_{n}}\mathbf{u}^{\mathrm{T}}\dot{\ell}_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}_{t})+\frac{1}{2a_{n}^{2}}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2}, (A.13)

where

˙underw(𝜽)=underw(𝜽)𝜽=i=1nδiπi{yipi(𝜽)}𝐳i=i=1nδiπi{yipi(α,𝜷)}𝐳i\dot{\ell}_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta})=\frac{\partial\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta})}{\partial\bm{\theta}}=\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\{y_{i}-p_{i}(\bm{\theta})\}\mathbf{z}_{i}=\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\{y_{i}-p_{i}(\alpha,\bm{\beta})\}\mathbf{z}_{i}

is the gradient of underw(𝜽)\ell_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}), and 𝐮´\acute{\mathbf{u}} lies between 𝟎\mathbf{0} and 𝐮\mathbf{u}. Similarly to the proof of Theorem 1, we only need to show that

an1˙underw(𝜽t)[𝟎,𝔼{e𝜷tT𝐱(1+ce𝜷tT𝐱)𝐳𝐳T}],\displaystyle a_{n}^{-1}\dot{\ell}_{\mathrm{under}}^{\mathrm{w}}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\Big{[}\mathbf{0},\ \mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}(1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{\}}\Big{]}, (A.14)

in distribution, and for any 𝐮\mathbf{u},

an2i=1nδiπiϕi(𝜽t+an1𝐮´)𝐳i𝐳iT𝔼(e𝜷tT𝐱𝐳𝐳T),\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}, (A.15)

in probability.

We prove (A.14) first. Recall that 𝒟n\mathcal{D}_{n} is the full data set and δi=yi+(1yi)I(uiπ0)\delta_{i}=y_{i}+(1-y_{i})I(u_{i}\leq\pi_{0}), satisfying that

πi=𝔼(δi|𝒟n)=yi+(1yi)π0=π0+(1π0)yi.\pi_{i}=\mathbb{E}(\delta_{i}|\mathcal{D}_{n})=y_{i}+(1-y_{i})\pi_{0}=\pi_{0}+(1-\pi_{0})y_{i}.

We notice that

𝔼(δi|𝐳i)=pi(αt,𝜷t)+{1pi(αt,𝜷t)}π0=π0+(1π0)pi(αt,𝜷t).\mathbb{E}(\delta_{i}|\mathbf{z}_{i})=p_{i}(\alpha_{t},\bm{\beta}_{t})+\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\pi_{0}=\pi_{0}+(1-\pi_{0})p_{i}(\alpha_{t},\bm{\beta}_{t}).

Let ηi=δiπi{yipi(𝜽t)}𝐳i\eta_{i}=\frac{\delta_{i}}{\pi_{i}}\{y_{i}-p_{i}(\bm{\theta}_{t})\}\mathbf{z}_{i}, we know that ηi\eta_{i}, i=1,,ni=1,...,n, are i.i.d., with the underlying distribution of ηi\eta_{i} being dependent on nn. From direction calculation, we have

𝔼(ηi|𝐳i)\displaystyle\mathbb{E}(\eta_{i}|\mathbf{z}_{i}) =𝟎, and\displaystyle=\mathbf{0},\quad\text{ and}
𝕍(ηi|𝐳i)\displaystyle\mathbb{V}(\eta_{i}|\mathbf{z}_{i}) =𝔼[{yipi(𝜽t)}2π0+yi(1π0)|𝐳i]𝐳i𝐳iT\displaystyle=\mathbb{E}\bigg{[}\frac{\{y_{i}-p_{i}(\bm{\theta}_{t})\}^{2}}{\pi_{0}+y_{i}(1-\pi_{0})}\bigg{|}\mathbf{z}_{i}\bigg{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=[pi(𝜽t){1pi(𝜽t)}2+π01{1pi(𝜽t)}{pi(𝜽t)}2]𝐳i𝐳iT\displaystyle=\big{[}p_{i}(\bm{\theta}_{t})\{1-p_{i}(\bm{\theta}_{t})\}^{2}+\pi_{0}^{-1}\{1-p_{i}(\bm{\theta}_{t})\}\{p_{i}(\bm{\theta}_{t})\}^{2}\big{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
={1pi(𝜽t)+π01pi(𝜽t)}pi(𝜽t){1pi(𝜽t)}𝐳i𝐳iT\displaystyle=\big{\{}1-p_{i}(\bm{\theta}_{t})+\pi_{0}^{-1}p_{i}(\bm{\theta}_{t})\big{\}}p_{i}(\bm{\theta}_{t})\{1-p_{i}(\bm{\theta}_{t})\}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=1+π01eαt+𝐱iT𝜷t(1+eαt+𝐱iT𝜷t)2pi(𝜽t)𝐳i𝐳iT\displaystyle=\frac{1+\pi_{0}^{-1}e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}p_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
eαt(1+π01eαte𝐱iT𝜷t)e𝐱iT𝜷t𝐳i𝐳iT.\displaystyle\leq e^{\alpha_{t}}(1+\pi_{0}^{-1}e^{\alpha_{t}}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}.

Thus, by the dominated convergence theorem, we obtain that

𝕍(ηi)=𝔼{𝕍(ηi|𝐳i)}\displaystyle\mathbb{V}(\eta_{i})=\mathbb{E}\{\mathbb{V}(\eta_{i}|\mathbf{z}_{i})\} =eαt𝔼{e𝐱iT𝜷t(1+ce𝐱iT𝜷t)𝐳i𝐳iT}{1+o(1)}.\displaystyle=e^{\alpha_{t}}\mathbb{E}\Big{\{}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}(1+ce^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\Big{\}}\{1+o(1)\}. (A.16)

Now we check the Lindeberg-Feller condition (Section 2.8 of van der Vaart, 1998). For simplicity, let π=π0+(1π0)y\pi=\pi_{0}+(1-\pi_{0})y and δ=y+(1y)I(uπ)\delta=y+(1-y)I(u\leq\pi), where u𝕌(0,1)u\sim\mathbb{U}(0,1). For any ϵ>0\epsilon>0,

i=1n\displaystyle\sum_{i=1}^{n} 𝔼{ηi2I(ηi>anϵ)}\displaystyle\mathbb{E}\big{\{}\|\eta_{i}\|^{2}I(\|\eta_{i}\|>a_{n}\epsilon)\big{\}}
=\displaystyle= n𝔼[π1δ{yp(𝜽t)}𝐳2I(π1δ{yp(𝜽t)}𝐳>anϵ)]\displaystyle n\mathbb{E}\big{[}\|\pi^{-1}\delta\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|^{2}I(\|\pi^{-1}\delta\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
=\displaystyle= π0n𝔼[π1{yp(𝜽t)}𝐳2I(π1{yp(𝜽t)}𝐳>anϵ)]\displaystyle\pi_{0}n\mathbb{E}\big{[}\|\pi^{-1}\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|^{2}I(\|\pi^{-1}\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
+(1π0)n𝔼[π1y{yp(𝜽t)}𝐳2I(π1y{yp(𝜽t)}𝐳>anϵ)]\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}\pi^{-1}\|y\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|^{2}I(\|\pi^{-1}y\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
=\displaystyle= π0n𝔼[p(𝜽t){1p(𝜽t)}𝐳2I({1p(𝜽t)}𝐳>anϵ)]\displaystyle\pi_{0}n\mathbb{E}\big{[}p(\bm{\theta}_{t})\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\|^{2}I(\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
+π01n𝔼[{1p(𝜽t)}p(𝜽t)𝐳2I(π01p(𝜽t)𝐳>anϵ)]\displaystyle+\pi_{0}^{-1}n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\|p(\bm{\theta}_{t})\mathbf{z}\|^{2}I(\pi_{0}^{-1}\|p(\bm{\theta}_{t})\mathbf{z}\|>a_{n}\epsilon)\big{]}
+(1π0)n𝔼[p(𝜽t){1p(𝜽t)}𝐳2I({1p(𝜽t)}𝐳>anϵ)]\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}p(\bm{\theta}_{t})\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\|^{2}I(\|\{1-p(\bm{\theta}_{t})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
\displaystyle\leq n𝔼{p(𝜽t)𝐳2I(𝐳>anϵ)}+nπ01𝔼{p(𝜽t)𝐳2I(π01p(𝜽t)𝐳>anϵ)}\displaystyle n\mathbb{E}\big{\{}p(\bm{\theta}_{t})\|\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\big{\}}+n\pi_{0}^{-1}\mathbb{E}\big{\{}\|p(\bm{\theta}_{t})\mathbf{z}\|^{2}I(\|\pi_{0}^{-1}p(\bm{\theta}_{t})\mathbf{z}\|>a_{n}\epsilon)\big{\}}
\displaystyle\leq neαt𝔼{e𝜷tT𝐱𝐳2I(𝐳>anϵ)}+nπ01e2αt𝔼{e𝜷tT𝐱𝐳2I(π01eαteαt𝐳>anϵ)}\displaystyle ne^{\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\big{\}}+n\pi_{0}^{-1}e^{2\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2}I(\pi_{0}^{-1}e^{\alpha_{t}}e^{\alpha_{t}}\|\mathbf{z}\|>a_{n}\epsilon)\big{\}}
=\displaystyle= o(neαt)=o(an2),\displaystyle o(ne^{\alpha_{t}})=o(a_{n}^{2}),

where the second last step is from the dominated convergence theorem and the facts that ana_{n}\rightarrow\infty and limneα/π0=c<\lim_{n\rightarrow\infty}e^{\alpha}/\pi_{0}=c<\infty. Thus, applying the Lindeberg-Feller central limit theorem (Section 2.8 of van der Vaart, 1998) finishes the proof of (A.14).

Now we prove (A.15). By direct calculation, we first notice that

Δ1an2i=1nδiπiϕi(𝜽t)𝐳i𝐳iT=1ni=1n{yi+(1yi)I(uiπ0)}e𝐱iT𝜷tπi(1+eαt+𝐱iT𝜷t)2𝐳i𝐳iT\displaystyle\Delta_{1}\equiv a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\frac{1}{n}\sum_{i=1}^{n}\frac{\{y_{i}+(1-y_{i})I(u_{i}\leq\pi_{0})\}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{\pi_{i}(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}} (A.17)

has a mean of

𝔼(Δ1)=\displaystyle\mathbb{E}(\Delta_{1})= 𝔼{e𝜷tT𝐱(1+eαt+𝜷tT𝐱)2𝐳𝐳T}=𝔼(e𝜷tT𝐱𝐳𝐳T)+o(1),\displaystyle\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}=\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}+o(1), (A.18)

where the last step is by the dominated convergence theorem. In addition, the variance of each component of Δ1\Delta_{1} is bounded by

1n𝔼{e2𝜷tT𝐱𝐳4π(1+eαt+𝜷tT𝐱)4}𝔼(e2𝜷tT𝐱𝐳4)nπ0=o(1),\displaystyle\frac{1}{n}\mathbb{E}\bigg{\{}\frac{e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4}}{\pi(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{4}}\bigg{\}}\leq\frac{\mathbb{E}(e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4})}{n\pi_{0}}=o(1), (A.19)

where the last step is because neαtne^{\alpha_{t}}\rightarrow\infty and eαt/π0c<e^{\alpha_{t}}/\pi_{0}\rightarrow c<\infty imply that nπ0n\pi_{0}\rightarrow\infty. From (A.18) and (A.19), Chebyshev’s inequality implies that Δ1𝔼(e𝜷tT𝐱𝐳𝐳T)\Delta_{1}\rightarrow\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)} in probability. Notice that

|an2i=1nδiπiϕi(𝜽t+an1𝐮´)𝐳i2an2i=1nδiπiϕi(𝜽t)𝐳i2|\displaystyle\bigg{|}a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\|\mathbf{z}_{i}\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}\phi_{i}(\bm{\theta}_{t})\|\mathbf{z}_{i}\|^{2}\bigg{|}
an1𝐮´an2i=1nδiπipi(𝜽t+an1𝐮˘)𝐳i3\displaystyle\leq\|a_{n}^{-1}\acute{\mathbf{u}}\|a_{n}^{-2}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}p_{i}(\bm{\theta}_{t}+a_{n}^{-1}\breve{\mathbf{u}})\|\mathbf{z}_{i}\|^{3}
an1𝐮´×1ni=1nδiπie(𝜷t+𝐮)𝐳i𝐳i3an1𝐮´×Δ2.\displaystyle\leq\|a_{n}^{-1}\acute{\mathbf{u}}\|\times\frac{1}{n}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{i}}e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)\|\mathbf{z}_{i}\|}\|\mathbf{z}_{i}\|^{3}\equiv\|a_{n}^{-1}\acute{\mathbf{u}}\|\times\Delta_{2}.

Since an1𝐮´0\|a_{n}^{-1}\acute{\mathbf{u}}\|\rightarrow 0, to finish the proof of (A.15), we only need to prove that Δ2\Delta_{2} is bounded in probability. Using an approach similar to (A.18) and (A.19), we can show that Δ2\Delta_{2} has a mean that is bounded and a variance that converges to zero.

Appendix A.3 Proof of Theorem 3

Proof of Theorem 3.

If we use Υbc\Upsilon_{bc} to denote the under-sampled objective function shifted by 𝐛\mathbf{b}, i.e., Υbc(𝜽)=(𝜽𝐛)underu\Upsilon_{bc}(\bm{\theta})=\ell{{}_{\mathrm{under}}^{\mathrm{u}}}(\bm{\theta}-\mathbf{b}), then the estimator 𝜽^underubc{\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}} is the maximizer of

Υbc(𝜽)=i=1nδi[(𝜽𝐛)T𝐳iyilog{1+e(𝜽𝐛)T𝐳i}].\displaystyle\Upsilon_{bc}(\bm{\theta})=\sum_{i=1}^{n}\delta_{i}\big{[}(\bm{\theta}-\mathbf{b})^{\mathrm{T}}\mathbf{z}_{i}y_{i}-\log\{1+e^{(\bm{\theta}-\mathbf{b})^{\mathrm{T}}\mathbf{z}_{i}}\}\big{]}. (A.20)

We notice that an(𝜽^underubc𝜽t)\sqrt{a_{n}}({\hat{\bm{\theta}}}{{}_{\mathrm{under}}^{\mathrm{ubc}}}-\bm{\theta}_{t}) is the maximizer of γbc(𝐮)=Υbc(𝜽t+an1𝐮)Υbc(𝜽t)\gamma_{bc}(\mathbf{u})=\Upsilon_{bc}(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\Upsilon_{bc}(\bm{\theta}_{t}). By Taylor’s expansion,

γp(𝐮)\displaystyle\gamma_{p}(\mathbf{u}) =1an𝐮TΥ˙bc(𝜽t)+12an2i=1nδiϕi(𝜽t𝐛+an1𝐮´)(𝐳iT𝐮)2,\displaystyle=\frac{1}{a_{n}}\mathbf{u}^{\mathrm{T}}\dot{\Upsilon}_{bc}(\bm{\theta}_{t})+\frac{1}{2a_{n}^{2}}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2}, (A.21)

where

Υ˙bc(𝜽)=Υbc(𝜽)𝜽=i=1nδi{yipi(𝜽t𝐛)}𝐳i=i=1nδi{yipi(αtb,𝜷t)}𝐳i\dot{\Upsilon}_{bc}(\bm{\theta})=\frac{\partial\Upsilon_{bc}(\bm{\theta})}{\partial\bm{\theta}}=\sum_{i=1}^{n}\delta_{i}\{y_{i}-p_{i}(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}_{i}=\sum_{i=1}^{n}\delta_{i}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}\mathbf{z}_{i}

is the gradient of Υbc(𝜽)\Upsilon_{bc}(\bm{\theta}), and 𝐮´\acute{\mathbf{u}} lies between 𝟎\mathbf{0} and 𝐮\mathbf{u}.

Similarly to the proof of Theorem 1, we only need to show that

an1Υ˙bc(𝜽t){𝟎,𝔼(e𝜷tT𝐱𝐳𝐳T1+ce𝜷tT𝐱)},\displaystyle a_{n}^{-1}\dot{\Upsilon}_{bc}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\bigg{\{}\mathbf{0},\ \mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\bigg{)}\bigg{\}}, (A.22)

in distribution, and for any 𝐮\mathbf{u},

an2i=1nδiϕi(𝜽t𝐛+an1𝐮´)𝐳i𝐳iT𝔼(e𝐱iT𝜷t1+ce𝐱iT𝜷t𝐳i𝐳iT)\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}\bigg{(}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{1+ce^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\bigg{)} (A.23)

in probability.

We prove (A.22) first. Define ηui=δi{yipi(αtb,𝜷t)}𝐳i\eta_{ui}=\delta_{i}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}\mathbf{z}_{i}. We have that

𝔼(ηui|𝐳i)\displaystyle\mathbb{E}(\eta_{ui}|\mathbf{z}_{i}) =𝔼[{π0+yi(1π0)}{yipi(αtb,𝜷t)}|𝐳i]𝐳i\displaystyle=\mathbb{E}[\{\pi_{0}+y_{i}(1-\pi_{0})\}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}|\mathbf{z}_{i}]\mathbf{z}_{i}
=[pi(αt,𝜷t){1pi(αtb,𝜷t)}π0{1pi(αt,𝜷t)}{pi(αtb,𝜷t)}]𝐳i=0,\displaystyle=[p_{i}(\alpha_{t},\bm{\beta}_{t})\{1-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}-\pi_{0}\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\{p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}]\mathbf{z}_{i}=0,

which implies that 𝔼(ηui)=𝟎\mathbb{E}(\eta_{ui})=\mathbf{0}. For the conditional variance

𝕍(ηui|𝐳i)\displaystyle\mathbb{V}(\eta_{ui}|\mathbf{z}_{i}) =𝔼[{π0+yi(1π0)}{yipi(αtb,𝜷t)}2|𝐳i]𝐳i𝐳iT\displaystyle=\mathbb{E}[\{\pi_{0}+y_{i}(1-\pi_{0})\}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}^{2}|\mathbf{z}_{i}]\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=[pi(αt,𝜷t){1pi(αtb,𝜷t)}2+π0{1pi(αt,𝜷t)}{pi(αtb,𝜷t)}2]𝐳i𝐳iT\displaystyle=\big{[}p_{i}(\alpha_{t},\bm{\beta}_{t})\{1-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}^{2}+\pi_{0}\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\{p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}^{2}\big{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=eαt+𝐱iT𝜷t+π0e2(αtb0+𝐱iT𝜷t)(1+eαt+𝐱iT𝜷t)(1+eαtb0+𝐱iT𝜷t)2𝐳i𝐳iT\displaystyle=\frac{e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}+\pi_{0}e^{2(\alpha_{t}-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t})}}{(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})(1+e^{\alpha_{t}-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=eαt+𝐱iT𝜷t1+eαtb0+𝐱iT𝜷t{1pi(αt,𝜷t)}𝐳i𝐳iTeαte𝐱iT𝜷t𝐳i𝐳iT,\displaystyle=\frac{e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{1+e^{\alpha_{t}-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\leq e^{\alpha_{t}}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}},

where e𝐱iT𝜷t𝐳i𝐳iTe^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}} is integrable. Thus, by the dominated convergence theorem, 𝕍(ηui)\mathbb{V}(\eta_{ui}) satisfies that

𝕍(ηui)=𝔼{𝕍(ηui|𝐳i)}\displaystyle\mathbb{V}(\eta_{ui})=\mathbb{E}\{\mathbb{V}(\eta_{ui}|\mathbf{z}_{i})\} =eαt𝔼(e𝜷tT𝐱1+ce𝜷tT𝐱){1+o(1)}.\displaystyle=e^{\alpha_{t}}\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\bigg{)}\{1+o(1)\}. (A.24)

Therefore, we have

an2i=1n𝕍(ηui)𝔼(e𝜷tT𝐱1+ce𝜷tT𝐱𝐳𝐳T).\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\mathbb{V}(\eta_{ui})\longrightarrow\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}. (A.25)

Now we check the Lindeberg-Feller condition. For any ϵ>0\epsilon>0,

i=1n𝔼{ηui2I(ηui>anϵ)}\displaystyle\sum_{i=1}^{n}\mathbb{E}\big{\{}\|\eta_{ui}\|^{2}I(\|\eta_{ui}\|>a_{n}\epsilon)\big{\}}
=\displaystyle= n𝔼[δ{yp(𝜽t𝐛)}𝐳2I(δ{yp(𝜽t𝐛)}𝐳>anϵ)]\displaystyle n\mathbb{E}\big{[}\|\delta\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|^{2}I(\|\delta\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
=\displaystyle= π0n𝔼[{yp(𝜽t𝐛)}𝐳2I({yp(𝜽t𝐛)}𝐳>anϵ)]\displaystyle\pi_{0}n\mathbb{E}\big{[}\|\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|^{2}I(\|\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
+(1π0)n𝔼[y{yp(𝜽t𝐛)}𝐳2I(y{yp(𝜽t𝐛)}𝐳>anϵ)]\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}\|y\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|^{2}I(\|y\{y-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
=\displaystyle= π0n𝔼[p(𝜽t){1p(𝜽t𝐛)}𝐳2I({1p(𝜽t𝐛)}𝐳>anϵ)]\displaystyle\pi_{0}n\mathbb{E}\big{[}p(\bm{\theta}_{t})\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|^{2}I(\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
+π0n𝔼[{1p(𝜽t)}p(𝜽t𝐛)𝐳2I(p(𝜽t𝐛)𝐳>anϵ)]\displaystyle+\pi_{0}n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\|^{2}I(\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\|>a_{n}\epsilon)\big{]}
+(1π0)n𝔼[p(𝜽t){1p(𝜽t𝐛)}𝐳2I({1p(𝜽t𝐛)}𝐳>anϵ)]\displaystyle+(1-\pi_{0})n\mathbb{E}\big{[}p(\bm{\theta}_{t})\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|^{2}I(\|\{1-p(\bm{\theta}_{t}-\mathbf{b})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
\displaystyle\leq n𝔼{p(𝜽t)𝐳2I(𝐳>anϵ)}+π0n𝔼[p(𝜽t𝐛)𝐳2I(𝐳>anϵ)]\displaystyle n\mathbb{E}\big{\{}p(\bm{\theta}_{t})\|\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\big{\}}+\pi_{0}n\mathbb{E}\big{[}\|p(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\big{]}
\displaystyle\leq neαt𝔼{e𝜷tT𝐱𝐳2I(𝐳>anϵ)}+π01ne2αt𝔼{e2𝜷tT𝐱𝐳2I(𝐳>anϵ)}\displaystyle ne^{\alpha_{t}}\mathbb{E}\big{\{}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\big{\}}+\pi_{0}^{-1}ne^{2\alpha_{t}}\mathbb{E}\big{\{}e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{2}I(\|\mathbf{z}\|>a_{n}\epsilon)\big{\}}
=\displaystyle= o(neαt)=o(an2),\displaystyle o(ne^{\alpha_{t}})=o(a_{n}^{2}),

where the second last step is from the dominated convergence theorem. Thus, applying the Lindeberg-Feller central limit theorem (Section 2.8 of van der Vaart, 1998) finishes the proof of (A.22).

No we prove (A.23). First, letting

Δ3an2i=1nδiϕi(𝜽t𝐛)𝐳i𝐳iT=1ni=1n{yi+(1yi)I(uiπ0)}eb0+𝐱iT𝜷t{1+eαtb0+𝐱iT𝜷t}2𝐳i𝐳iT,\displaystyle\Delta_{3}\equiv a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\frac{1}{n}\sum_{i=1}^{n}\frac{\{y_{i}+(1-y_{i})I(u_{i}\leq\pi_{0})\}e^{-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{\{1+e^{\alpha_{t}-b_{0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\}^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}, (A.26)

the mean of Δ3\Delta_{3} satisfies that

𝔼(Δ3)=\displaystyle\mathbb{E}(\Delta_{3})= 𝔼[e𝜷tT𝐱{1+eαt+𝜷tT𝐱}{1+eαtb0+𝜷tT𝐱}𝐳𝐳T]=𝔼(e𝜷tT𝐱1+ce𝜷tT𝐱𝐳𝐳T)+o(1),\displaystyle\mathbb{E}\bigg{[}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{\{1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}\{1+e^{\alpha_{t}-b_{0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{]}=\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}+o(1), (A.27)

by the dominated convergence theorem, and the variance of each component of Δ3\Delta_{3} is bounded by

1n𝔼[{y+(1y)I(uπ0)}e2b0+2𝜷tT𝐱{1+eαtb0+𝜷tT𝐱}4𝐳4]𝔼(e2𝜷tT𝐱𝐳4)nπ0=o(1).\displaystyle\frac{1}{n}\mathbb{E}\bigg{[}\frac{\{y+(1-y)I(u\leq\pi_{0})\}e^{-2b_{0}+2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{\{1+e^{\alpha_{t}-b_{0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}^{4}}\|\mathbf{z}\|^{4}\bigg{]}\leq\frac{\mathbb{E}(e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4})}{n\pi_{0}}=o(1). (A.28)

Thus, Chebyshev’s inequality implies that

Δ3𝔼(e𝜷tT𝐱1+ce𝜷tT𝐱𝐳𝐳T),\Delta_{3}\longrightarrow\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+ce^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}, (A.29)

in probability. Furthermore,

|an2i=1nδiϕi(𝜽t𝐛+an1𝐮´)𝐳i2an2i=1nδiϕi(𝜽t𝐛)𝐳i2|\displaystyle\bigg{|}a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\acute{\mathbf{u}})\|\mathbf{z}_{i}\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}\phi_{i}(\bm{\theta}_{t}-\mathbf{b})\|\mathbf{z}_{i}\|^{2}\bigg{|}
an1𝐮´an2i=1nδipi(𝜽t𝐛+an1𝐮˘)𝐳i3\displaystyle\leq\|a_{n}^{-1}\acute{\mathbf{u}}\|a_{n}^{-2}\sum_{i=1}^{n}\delta_{i}p_{i}(\bm{\theta}_{t}-\mathbf{b}+a_{n}^{-1}\breve{\mathbf{u}})\|\mathbf{z}_{i}\|^{3}
an1𝐮´ni=1nδiπ0e(𝜷t+𝐮)(1+𝐱i)𝐳i3an1𝐮´×Δ4=oP(1),\displaystyle\leq\frac{\|a_{n}^{-1}\acute{\mathbf{u}}\|}{n}\sum_{i=1}^{n}\frac{\delta_{i}}{\pi_{0}}e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)(1+\|\mathbf{x}_{i}\|)}\|\mathbf{z}_{i}\|^{3}\equiv\|a_{n}^{-1}\acute{\mathbf{u}}\|\times\Delta_{4}=o_{P}(1), (A.30)

where the last step is because Δ4\Delta_{4} is bounded in probability due to the fact that it has a mean that is bounded and a variance that converges to zero. Combing (A.29) and (A.30), (A.23) follows. ∎

Appendix A.4 Proof of Proposition 1

Proof of Proposition 1.

Let

𝐠=1h{𝔼(h1𝐯𝐯T)}1𝐯h{𝔼(𝐯𝐯T)}1𝐯.\mathbf{g}=\frac{1}{\sqrt{h}}\big{\{}\mathbb{E}(h^{-1}\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\mathbf{v}-\sqrt{h}\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\mathbf{v}.

Since 𝐠𝐠T𝟎\mathbf{g}\mathbf{g}^{\mathrm{T}}\geq\mathbf{0}, we have

𝟎𝔼(𝐠𝐠T)={𝔼(𝐯𝐯T)}1𝔼(h𝐯𝐯T){𝔼(𝐯𝐯T)}1{𝔼(h1𝐯𝐯T)}1,\displaystyle\mathbf{0}\leq\mathbb{E}(\mathbf{g}\mathbf{g}^{\mathrm{T}})=\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}\mathbb{E}(h\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\{}\mathbb{E}(\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1}-\big{\{}\mathbb{E}(h^{-1}\mathbf{v}\mathbf{v}^{\mathrm{T}})\big{\}}^{-1},

which finishes the proof. ∎

Appendix A.5 Proof of Theorem 4

Proof of Theorem 4.

The estimator 𝜽^overw{\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}} is the maximizer of (20), so an(𝜽^overwθt)\sqrt{a_{n}}({\hat{\bm{\theta}}}_{\mathrm{over}}^{\mathrm{w}}-\theta_{t}) is the maximizer of γoverw(𝐮)=overw(𝜽t+an1𝐮)overw(𝜽t)\gamma_{\mathrm{over}}^{\mathrm{w}}(\mathbf{u})=\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}_{t}). By Taylor’s expansion,

γoverw(𝐮)=1an𝐮T˙overw(𝜽t)+12an2i=1nτiwiϕi(𝜽t+an1𝐮´)(𝐳iT𝐮)2,\gamma_{\mathrm{over}}^{\mathrm{w}}(\mathbf{u})=\frac{1}{a_{n}}\mathbf{u}^{\mathrm{T}}\dot{\ell}_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}_{t})+\frac{1}{2a_{n}^{2}}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2}, (A.31)

where

˙overw(𝜽)=overw(𝜽)𝜽=i=1nτiwi{yipi(𝜽t)}𝐳i=i=1nτiwi{yipi(αtb,𝜷t)}𝐳i\dot{\ell}_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta})=\frac{\partial\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta})}{\partial\bm{\theta}}=\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\{y_{i}-p_{i}(\bm{\theta}_{t})\}\mathbf{z}_{i}=\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\{y_{i}-p_{i}(\alpha_{t}-b,\bm{\beta}_{t})\}\mathbf{z}_{i}

is the gradient of overw(𝜽)\ell_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}), and 𝐮´\acute{\mathbf{u}} lies between 𝟎\mathbf{0} and 𝐮\mathbf{u}. Similarly to the proof of Theorem 1, we only need to show that

an1˙overw(𝜽t){𝟎,(1+λ)2+λ(1+λ)2𝔼(e𝜷tT𝐱𝐳𝐳T)},\displaystyle a_{n}^{-1}\dot{\ell}_{\mathrm{over}}^{\mathrm{w}}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\Big{\{}\mathbf{0},\ \frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}\Big{\}}, (A.32)

in distribution, and for any 𝐮\mathbf{u},

an2i=1nτiwiϕi(𝜽t+an1𝐮´)𝐳i𝐳iT𝔼(e𝜷tT𝐱𝐳𝐳T),\displaystyle a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}, (A.33)

in probability.

We prove (A.32) first. Denote ηowi=τiwi1{yipi(𝜽t)}𝐳i\eta_{owi}=\tau_{i}w_{i}^{-1}\{y_{i}-p_{i}(\bm{\theta}_{t})\}\mathbf{z}_{i}, so ηowi\eta_{owi}, i=1,,ni=1,...,n, are i.i.d. with the underlying distribution of ηowi\eta_{owi} being dependent on nn. From direction calculation, we have

𝔼(ηowi|𝐳i)\displaystyle\mathbb{E}(\eta_{owi}|\mathbf{z}_{i}) =𝟎, and\displaystyle=\mathbf{0},\quad\text{ and}
𝕍(ηowi|𝐳i)\displaystyle\mathbb{V}(\eta_{owi}|\mathbf{z}_{i}) =𝔼[{yi(3λn+λn2)+1}{yipi(𝜽t)}2(1+λnyi)2|𝐳i]𝐳i𝐳iT\displaystyle=\mathbb{E}\bigg{[}\frac{\{y_{i}(3\lambda_{n}+\lambda_{n}^{2})+1\}\{y_{i}-p_{i}(\bm{\theta}_{t})\}^{2}}{(1+\lambda_{n}y_{i})^{2}}\bigg{|}\mathbf{z}_{i}\bigg{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=[pi(𝜽t){1pi(𝜽t)}2(1+λn)2+λn(1+λn)2+{1pi(𝜽t)}{pi(𝜽t)}2]𝐳i𝐳iT\displaystyle=\bigg{[}p_{i}(\bm{\theta}_{t})\{1-p_{i}(\bm{\theta}_{t})\}^{2}\frac{(1+\lambda_{n})^{2}+\lambda_{n}}{(1+\lambda_{n})^{2}}+\{1-p_{i}(\bm{\theta}_{t})\}\{p_{i}(\bm{\theta}_{t})\}^{2}\bigg{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=(1+λn)2+λn(1+λn)2eαte𝜷tT𝐱𝐳i𝐳iT{1+oP(1)},\displaystyle=\frac{(1+\lambda_{n})^{2}+\lambda_{n}}{(1+\lambda_{n})^{2}}e^{\alpha_{t}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\{1+o_{P}(1)\},

where the oP(1)o_{P}(1) is bounded. Thus, by the dominated convergence theorem, we obtain that

𝕍(ηowi)\displaystyle\mathbb{V}(\eta_{owi}) =(1+λ)2+λ(1+λ)2eαt𝔼(e𝐱T𝜷t𝐳𝐳T){1+o(1)}.\displaystyle=\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}e^{\alpha_{t}}\mathbb{E}\big{(}e^{\mathbf{x}^{\mathrm{T}}\bm{\beta}_{t}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}\{1+o(1)\}.

Now we check the Lindeberg-Feller condition (Section 2.8 of van der Vaart, 1998). Let w=1+λnyw=1+\lambda_{n}y and τ=yv+1\tau=yv+1, where v𝕆𝕀(λn)v\sim\mathbb{POI}(\lambda_{n}). For any ϵ>0\epsilon>0,

i=1n𝔼[ηowi2I(ηowi>anϵ)]\displaystyle\sum_{i=1}^{n}\mathbb{E}\big{[}\|\eta_{owi}\|^{2}I(\|\eta_{owi}\|>a_{n}\epsilon)\big{]}
=n𝔼[w1τ{yp(𝜽t)}𝐳2I(w1τ{yp(𝜽t)}𝐳>anϵ)]\displaystyle=n\mathbb{E}\big{[}\|w^{-1}\tau\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|^{2}I(\|w^{-1}\tau\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|>a_{n}\epsilon)\big{]}
nanϵ𝔼[w1τ{yp(𝜽t)}𝐳3\displaystyle\leq\frac{n}{a_{n}\epsilon}\mathbb{E}\big{[}\|w^{-1}\tau\{y-p(\bm{\theta}_{t})\}\mathbf{z}\|^{3}
=nanϵ𝔼[(1+vy)3(1+λny)3{yp(𝜽t)}3𝐳3]\displaystyle=\frac{n}{a_{n}\epsilon}\mathbb{E}\bigg{[}\frac{(1+vy)^{3}}{(1+\lambda_{n}y)^{3}}\{y-p(\bm{\theta}_{t})\}^{3}\|\mathbf{z}\|^{3}\bigg{]}
nanϵ1+7λn+6λn2+λn3(1+λn)3𝔼{p(𝜽t)𝐳3}+nanϵ𝔼[{p(𝜽t)}3𝐳3]\displaystyle\leq\frac{n}{a_{n}\epsilon}\frac{1+7\lambda_{n}+6\lambda_{n}^{2}+\lambda_{n}^{3}}{(1+\lambda_{n})^{3}}\mathbb{E}\{p(\bm{\theta}_{t})\|\mathbf{z}\|^{3}\}+\frac{n}{a_{n}\epsilon}\mathbb{E}[\{p(\bm{\theta}_{t})\}^{3}\|\mathbf{z}\|^{3}]
anϵ1+7λn+6λn2+λn3(1+λn)3𝔼(e𝐱iT𝜷t𝐳3)+ane2αtϵ𝔼(e3𝐱iT𝜷t𝐳3)=o(an2).\displaystyle\leq\frac{a_{n}}{\epsilon}\frac{1+7\lambda_{n}+6\lambda_{n}^{2}+\lambda_{n}^{3}}{(1+\lambda_{n})^{3}}\mathbb{E}(e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\|\mathbf{z}\|^{3})+\frac{a_{n}e^{2\alpha_{t}}}{\epsilon}\mathbb{E}(e^{3\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\|\mathbf{z}\|^{3})=o(a_{n}^{2}).

Thus, applying the Lindeberg-Feller central limit theorem (Section 2.8 of van der Vaart, 1998) finishes the proof of (A.32).

Now we prove (A.33). Let

Δ5an2i=1nτiwiϕi(𝜽t)𝐳i𝐳iT=1ni=1nτiwie𝐱iT𝜷t(1+eαt+𝐱iT𝜷t)2𝐳i𝐳iT.\displaystyle\Delta_{5}\equiv a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\frac{1}{n}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}.

Since

𝔼(Δ5)=\displaystyle\mathbb{E}(\Delta_{5})= 𝔼{e𝜷tT𝐱(1+eαt+𝜷tT𝐱)2𝐳𝐳T}=𝔼(e𝜷tT𝐱𝐳𝐳T)+o(1),\displaystyle\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}=\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)}+o(1),

by the dominated convergence theorem, and each component of Δ5\Delta_{5} has a variance that is bounded by

1n𝔼{2e2𝜷tT𝐱𝐳4(1+eαt+𝜷tT𝐱)4}2𝔼(e2𝜷tT𝐱𝐳4)n=o(1),\displaystyle\frac{1}{n}\mathbb{E}\bigg{\{}\frac{2e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{4}}\bigg{\}}\leq\frac{2\mathbb{E}(e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4})}{n}=o(1),

applying Chebyshev’s inequality gives that

Δ5𝔼(e𝜷tT𝐱𝐳𝐳T),\Delta_{5}\longrightarrow\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\big{)},

in probability. Thus, (A.33) follows from the fact that

|an2i=1nτiwiϕi(𝜽t+an1𝐮´)𝐳i2an2i=1nτiwiϕi(𝜽t)𝐳i2|\displaystyle\bigg{|}a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t}+a_{n}^{-1}\acute{\mathbf{u}})\|\mathbf{z}_{i}\|^{2}-a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}\phi_{i}(\bm{\theta}_{t})\|\mathbf{z}_{i}\|^{2}\bigg{|}
an1𝐮´an2i=1nτiwipi(𝜽t+an1𝐮˘)𝐳i3\displaystyle\leq\|a_{n}^{-1}\acute{\mathbf{u}}\|a_{n}^{-2}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}p_{i}(\bm{\theta}_{t}+a_{n}^{-1}\breve{\mathbf{u}})\|\mathbf{z}_{i}\|^{3}
an1𝐮´ni=1nτiwie(𝜷t+𝐮)𝐳i𝐳i3=oP(1),\displaystyle\leq\frac{\|a_{n}^{-1}\acute{\mathbf{u}}\|}{n}\sum_{i=1}^{n}\frac{\tau_{i}}{w_{i}}e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)\|\mathbf{z}_{i}\|}\|\mathbf{z}_{i}\|^{3}=o_{P}(1),

where the last step is because n1i=1nτiwi1e(𝜷t+𝐮)𝐳i𝐳i3n^{-1}\sum_{i=1}^{n}\tau_{i}w_{i}^{-1}e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)\|\mathbf{z}_{i}\|}\|\mathbf{z}_{i}\|^{3} has a bounded mean and a bounded variance and thus it is bounded in probability. ∎

Appendix A.6 Proof of Theorem 5

Proof of Theorem 5.

The over-sampled estimator 𝜽^overubc{\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}} is the maximizer of

Υoc(𝜽)=11+λni=1nτi[(𝜽+𝐛o)T𝐳iyilog{1+e𝐳iT(𝜽+𝐛o)}].\Upsilon_{oc}(\bm{\theta})=\frac{1}{1+\lambda_{n}}\sum_{i=1}^{n}\tau_{i}\big{[}(\bm{\theta}+\mathbf{b}_{o})^{\mathrm{T}}\mathbf{z}_{i}y_{i}-\log\{1+e^{\mathbf{z}_{i}^{\mathrm{T}}(\bm{\theta}+\mathbf{b}_{o})}\}\big{]}. (A.34)

Thus, an(𝜽^overubc𝜽t)\sqrt{a_{n}}({\hat{\bm{\theta}}}{{}_{\mathrm{over}}^{\mathrm{ubc}}}-\bm{\theta}_{t}) is the maximizer of γoc(𝐮)=Υoc(𝜽t+an1𝐮)Υoc(𝜽t)\gamma_{oc}(\mathbf{u})=\Upsilon_{oc}(\bm{\theta}_{t}+a_{n}^{-1}\mathbf{u})-\Upsilon_{oc}(\bm{\theta}_{t}). By Taylor’s expansion,

γoc(𝐮)=1an𝐮TΥ˙oc(𝜽t)+12an2(1+λn)i=1nτiϕi(𝜽t+𝐛o+an1𝐮´)(𝐳iT𝐮)2,\gamma_{oc}(\mathbf{u})=\frac{1}{a_{n}}\mathbf{u}^{\mathrm{T}}\dot{\Upsilon}_{oc}(\bm{\theta}_{t})+\frac{1}{2a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o}+a_{n}^{-1}\acute{\mathbf{u}})(\mathbf{z}_{i}^{\mathrm{T}}\mathbf{u})^{2}, (A.35)

where

Υ˙oc(𝜽)=Υoc(𝜽)𝜽=11+λni=1nτi{yipi(αt+bo0,𝜷t)}𝐳i\dot{\Upsilon}_{oc}(\bm{\theta})=\frac{\partial\Upsilon_{oc}(\bm{\theta})}{\partial\bm{\theta}}=\frac{1}{1+\lambda_{n}}\sum_{i=1}^{n}\tau_{i}\{y_{i}-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}\mathbf{z}_{i}

is the gradient of Υoc(𝜽)\Upsilon_{oc}(\bm{\theta}), and 𝐮´\acute{\mathbf{u}} lies between 𝟎\mathbf{0} and 𝐮\mathbf{u}.

Similarly to the proof of Theorem 1, we only need to show that

an1Υ˙oc(𝜽t)[𝟎,(1+λ)2+λ(1+λ)2𝔼{e𝜷tT𝐱(1+coe𝜷tT𝐱)2𝐳𝐳T}],\displaystyle a_{n}^{-1}\dot{\Upsilon}_{oc}(\bm{\theta}_{t})\longrightarrow\mathbb{N}\bigg{[}\mathbf{0},\ \frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\ \mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}\bigg{]}, (A.36)

in distribution, and for any 𝐮\mathbf{u},

1an2(1+λn)i=1nτiϕi(𝜽t+𝐛o+an1𝐮´)𝐳i𝐳iT𝔼(e𝜷tT𝐱1+coe𝜷tT𝐱𝐳𝐳T),\displaystyle\frac{1}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o}+a_{n}^{-1}\acute{\mathbf{u}})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\longrightarrow\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}, (A.37)

in probability.

We prove (A.36) first. Let ηobi=(1+λn)1τi{yipi(αt+bo0,𝜷t)}𝐳i\eta_{obi}=(1+\lambda_{n})^{-1}\tau_{i}\{y_{i}-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}\mathbf{z}_{i}. We have that

(1+λn)𝔼(ηobi|𝐳i)\displaystyle(1+\lambda_{n})\mathbb{E}(\eta_{obi}|\mathbf{z}_{i}) =𝔼[(1+λnyi){yipi(αt+bo0,𝜷t)}|𝐳i]𝐳i\displaystyle=\mathbb{E}[(1+\lambda_{n}y_{i})\{y_{i}-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}|\mathbf{z}_{i}]\mathbf{z}_{i}
=[pi(αt,𝜷t)(1+λn){1pi(αt+bo0,𝜷t)}\displaystyle=[p_{i}(\alpha_{t},\bm{\beta}_{t})(1+\lambda_{n})\{1-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}
{1pi(αt,𝜷t)}{pi(αt+bo0,𝜷t)}]𝐳i=0,\displaystyle\quad-\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\{p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}]\mathbf{z}_{i}=0,

which implies that 𝔼(ηobi)=𝟎\mathbb{E}(\eta_{obi})=\mathbf{0}. For the conditional variance

(1+λn)2𝕍(ηobi|𝐳i)\displaystyle(1+\lambda_{n})^{2}\mathbb{V}(\eta_{obi}|\mathbf{z}_{i})
=𝔼[{1+3λnyi+λn2yi}{yipi(αt+bo0,𝜷t)}2|𝐳i]𝐳i𝐳iT\displaystyle=\mathbb{E}[\{1+3\lambda_{n}y_{i}+\lambda_{n}^{2}y_{i}\}\{y_{i}-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}^{2}|\mathbf{z}_{i}]\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=[pi(αt,𝜷t)(1+3λn+λn2){1pi(αt+bo0,𝜷t)}2+{1pi(αt,𝜷t)}{pi(αt+bo0,𝜷t)}2]𝐳i𝐳iT\displaystyle=\big{[}p_{i}(\alpha_{t},\bm{\beta}_{t})(1+3\lambda_{n}+\lambda_{n}^{2})\{1-p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}^{2}+\{1-p_{i}(\alpha_{t},\bm{\beta}_{t})\}\{p_{i}(\alpha_{t}+b_{o0},\bm{\beta}_{t})\}^{2}\big{]}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=(1+3λn+λn2)eαt+𝐱iT𝜷t+e2(αt+bo0+𝐱iT𝜷t)(1+eαt+𝐱iT𝜷t)(1+eαt+bo0+𝐱iT𝜷t)2𝐳i𝐳iT\displaystyle=\frac{(1+3\lambda_{n}+\lambda_{n}^{2})e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}+e^{2(\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t})}}{(1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})(1+e^{\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=(1+3λn+λn2)eαt+𝐱iT𝜷t(1+eαt+bo0+𝐱iT𝜷t)21+1+2λn+λn21+3λn+λn2eαt+𝐱iT𝜷t1+eαt+𝐱iT𝜷t𝐳i𝐳iT\displaystyle=\frac{(1+3\lambda_{n}+\lambda_{n}^{2})e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\frac{1+\frac{1+2\lambda_{n}+\lambda_{n}^{2}}{1+3\lambda_{n}+\lambda_{n}^{2}}e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{1+e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}
=(1+3λn+λn2)eαt+𝐱iT𝜷t(1+eαt+bo0+𝐱iT𝜷t)2𝐳i𝐳iT{1+oP(1)}\displaystyle=\frac{(1+3\lambda_{n}+\lambda_{n}^{2})e^{\alpha_{t}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+e^{\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\{1+o_{P}(1)\}
=eαt(1+3λn+λn2)e𝐱iT𝜷t(1+coe𝐱iT𝜷t)2𝐳i𝐳iT{1+oP(1)},\displaystyle=e^{\alpha_{t}}(1+3\lambda_{n}+\lambda_{n}^{2})\frac{e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{(1+{c_{o}}e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}})^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}\{1+o_{P}(1)\},

where the oP(1)o_{P}(1)’s above are all bounded and the last step is because (1+λn)eαtco(1+\lambda_{n})e^{\alpha_{t}}\rightarrow{c_{o}}. Thus, by the dominated convergence theorem, 𝕍(ηobi)\mathbb{V}(\eta_{obi}) satisfies that

𝕍(ηobi)\displaystyle\mathbb{V}(\eta_{obi}) =eαt(1+λ)2+λ(1+λ)2𝔼{e𝜷tT𝐱(1+coe𝜷tT𝐱)2}{1+o(1)},\displaystyle=e^{\alpha_{t}}\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\bigg{\}}\{1+o(1)\}, (A.38)

which indicates that

1an2i=1n𝕍(ηobi)(1+λ)2+λ(1+λ)2𝔼{e𝜷tT𝐱(1+coe𝜷tT𝐱)2𝐳𝐳T}.\displaystyle\frac{1}{a_{n}^{2}}\sum_{i=1}^{n}\mathbb{V}(\eta_{obi})\longrightarrow\frac{(1+\lambda)^{2}+\lambda}{(1+\lambda)^{2}}\ \mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}}. (A.39)

Now we check the Lindeberg-Feller condition. Recall that τ=yv+1\tau=yv+1, where v𝕆𝕀(λn)v\sim\mathbb{POI}(\lambda_{n}). We can show that 𝔼{(1+v)3}<2(1+λn)3\mathbb{E}\{(1+v)^{3}\}<2(1+\lambda_{n})^{3}. For any ϵ>0\epsilon>0,

anϵ(1+λn)3i=1n𝔼{ηobi2I(ηobi>anϵ)}(1+λn)3i=1n𝔼(ηobi3)\displaystyle a_{n}\epsilon(1+\lambda_{n})^{3}\sum_{i=1}^{n}\mathbb{E}\big{\{}\|\eta_{obi}\|^{2}I(\|\eta_{obi}\|>a_{n}\epsilon)\big{\}}\leq(1+\lambda_{n})^{3}\sum_{i=1}^{n}\mathbb{E}(\|\eta_{obi}\|^{3})
=n𝔼[τ3{yp(𝜽t+𝐛o)}𝐳3]\displaystyle=n\mathbb{E}\big{[}\|\tau^{3}\{y-p(\bm{\theta}_{t}+\mathbf{b}_{o})\}\mathbf{z}\|^{3}\big{]}
=n𝔼[p(𝜽t)(1+v)3{1p(𝜽t+𝐛o)}𝐳3]+n𝔼[{1p(𝜽t)}p(𝜽t+𝐛o)𝐳3]\displaystyle=n\mathbb{E}\big{[}p(\bm{\theta}_{t})(1+v)^{3}\|\{1-p(\bm{\theta}_{t}+\mathbf{b}_{o})\}\mathbf{z}\|^{3}\big{]}+n\mathbb{E}\big{[}\{1-p(\bm{\theta}_{t})\}\|p(\bm{\theta}_{t}+\mathbf{b}_{o})\mathbf{z}\|^{3}\big{]}
2n(1+λn)3𝔼{p(𝜽t)𝐳3}+n𝔼{p(𝜽t+𝐛o)𝐳3}\displaystyle\leq 2n(1+\lambda_{n})^{3}\mathbb{E}\big{\{}p(\bm{\theta}_{t})\|\mathbf{z}\|^{3}\big{\}}+n\mathbb{E}\big{\{}\|p(\bm{\theta}_{t}+\mathbf{b}_{o})\mathbf{z}\|^{3}\big{\}}
2n(1+λn)3eαt𝔼(e𝜷tT𝐱𝐳3)+n(1+λn)3e3αt𝔼(e3𝜷tT𝐱𝐳3)\displaystyle\leq 2n(1+\lambda_{n})^{3}e^{\alpha_{t}}\mathbb{E}\big{(}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{3}\big{)}+n(1+\lambda_{n})^{3}e^{3\alpha_{t}}\mathbb{E}\big{(}e^{3\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{3}\big{)}
=(1+λn)3O(an2).\displaystyle=(1+\lambda_{n})^{3}O(a_{n}^{2}).

This indicates that an2i=1n𝔼{ηobi2I(ηobi>anϵ)}=o(1)a_{n}^{-2}\sum_{i=1}^{n}\mathbb{E}\{\|\eta_{obi}\|^{2}I(\|\eta_{obi}\|>a_{n}\epsilon)\}=o(1), and thus the Lindeberg-Feller condition holds. Applying the Lindeberg-Feller central limit theorem (Section 2.8 of van der Vaart, 1998) finishes the proof of (A.36).

No we prove (A.37). Let

Δ61an2(1+λn)i=1nτiϕi(𝜽t+𝐛o)𝐳i𝐳iT=1ni=1n(1+viyi)e𝐱iT𝜷t{1+eαt+bo0+𝐱iT𝜷t}2𝐳i𝐳iT.\displaystyle\Delta_{6}\equiv\frac{1}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o})\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}=\frac{1}{n}\sum_{i=1}^{n}\frac{(1+v_{i}y_{i})e^{\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}}{\{1+e^{\alpha_{t}+b_{o0}+\mathbf{x}_{i}^{\mathrm{T}}\bm{\beta}_{t}}\}^{2}}\mathbf{z}_{i}\mathbf{z}_{i}^{\mathrm{T}}. (A.40)

Note that

𝔼(Δ6)\displaystyle\mathbb{E}(\Delta_{6}) =𝔼{(1+λny)e𝜷tT𝐱(1+eαt+bo0+𝜷tT𝐱)2𝐳𝐳T}\displaystyle=\mathbb{E}\bigg{\{}\frac{(1+\lambda_{n}y)e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+e^{\alpha_{t}+b_{o0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})^{2}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}} (A.41)
=𝔼{e𝜷tT𝐱(1+eαt+𝜷tT𝐱)(1+eαt+bo0+𝜷tT𝐱)𝐳𝐳T}\displaystyle=\mathbb{E}\bigg{\{}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{(1+e^{\alpha_{t}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})(1+e^{\alpha_{t}+b_{o0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}})}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{\}} (A.42)
=𝔼(e𝜷tT𝐱1+coe𝜷tT𝐱𝐳𝐳T)+o(1),\displaystyle=\mathbb{E}\bigg{(}\frac{e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{1+{c_{o}}e^{\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}\mathbf{z}\mathbf{z}^{\mathrm{T}}\bigg{)}+o(1), (A.43)

by the dominated convergence theorem, and the variance of each component of Δ6\Delta_{6} is bounded by

1n𝔼[(1+vy)2e2𝜷tT𝐱{1+eαt+bo0+𝜷tT𝐱}4𝐳4]\displaystyle\frac{1}{n}\mathbb{E}\bigg{[}\frac{(1+vy)^{2}e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{\{1+e^{\alpha_{t}+b_{o0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}^{4}}\|\mathbf{z}\|^{4}\bigg{]}
=1n𝔼[{1+(3λn+λn2)p(𝜽t)}e2𝜷tT𝐱{1+eαt+bo0+𝜷tT𝐱}4𝐳4]\displaystyle=\frac{1}{n}\mathbb{E}\bigg{[}\frac{\{1+(3\lambda_{n}+\lambda_{n}^{2})p(\bm{\theta}_{t})\}e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}}{\{1+e^{\alpha_{t}+b_{o0}+\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\}^{4}}\|\mathbf{z}\|^{4}\bigg{]}
𝔼(e2𝜷tT𝐱𝐳4)n+eαt(3λn+λn2)n𝔼(e3𝜷tT𝐱𝐳4)=o(1),\displaystyle\leq\frac{\mathbb{E}(e^{2\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4})}{n}+\frac{e^{\alpha_{t}}(3\lambda_{n}+\lambda_{n}^{2})}{n}\mathbb{E}\big{(}e^{3\bm{\beta}_{t}^{\mathrm{T}}\mathbf{x}}\|\mathbf{z}\|^{4}\big{)}=o(1),

where the last step is because n1eαtλn2=(eαtλn)2an20n^{-1}e^{\alpha_{t}}\lambda_{n}^{2}=(e^{\alpha_{t}}\lambda_{n})^{2}a_{n}^{-2}\rightarrow 0 and both expectations are finite. Therefore, Chebyshev’s inequality implies that Δ60\Delta_{6}\rightarrow 0 in probability. Thus, (A.37) follows from the fact that

|1an2(1+λn)i=1nτiϕi(𝜽t+𝐛o+an1𝐮´)𝐳i21an2(1+λn)i=1nτiϕi(𝜽t+𝐛o)𝐳i2|\displaystyle\bigg{|}\frac{1}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o}+a_{n}^{-1}\acute{\mathbf{u}})\|\mathbf{z}_{i}\|^{2}-\frac{1}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}\phi_{i}(\bm{\theta}_{t}+\mathbf{b}_{o})\|\mathbf{z}_{i}\|^{2}\bigg{|}
an1𝐮´an2(1+λn)i=1nτipi(𝜽t+𝐛o+an1𝐮˘)𝐳i3\displaystyle\leq\frac{\|a_{n}^{-1}\acute{\mathbf{u}}\|}{a_{n}^{2}(1+\lambda_{n})}\sum_{i=1}^{n}\tau_{i}p_{i}(\bm{\theta}_{t}+\mathbf{b}_{o}+a_{n}^{-1}\breve{\mathbf{u}})\|\mathbf{z}_{i}\|^{3}
an1𝐮´ni=1n(1+viyi)e(𝜷t+𝐮)𝐳i𝐳i3=oP(1),\displaystyle\leq\frac{\|a_{n}^{-1}\acute{\mathbf{u}}\|}{n}\sum_{i=1}^{n}(1+v_{i}y_{i})e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)\|\mathbf{z}_{i}\|}\|\mathbf{z}_{i}\|^{3}=o_{P}(1),

where the last step is from the fact that n1i=1n(1+viyi)e(𝜷t+𝐮)𝐳i𝐳i3n^{-1}\sum_{i=1}^{n}(1+v_{i}y_{i})e^{(\|\bm{\beta}_{t}\|+\|\mathbf{u}\|)\|\mathbf{z}_{i}\|}\|\mathbf{z}_{i}\|^{3} has a bounded mean and a bounded variance, and an application of Chebyshev’s inequality. ∎

References

  • Chawla (2009) Chawla, N. V. (2009). Data mining for imbalanced datasets: An overview. In Data mining and knowledge discovery handbook, 875–886. Springer.
  • Chawla et al. (2002) Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research 16, 321–357.
  • Chawla et al. (2004) Chawla, N. V., Japkowicz, N., and Kotcz, A. (2004). Editorial: special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter 6, 1, 1–6.
  • Douzas and Bacao (2017) Douzas, G. and Bacao, F. (2017). Self-organizing map oversampling (somo) for imbalanced data set learning. Expert systems with Applications 82, 40–52.
  • Drummond et al. (2003) Drummond, C., Holte, R. C., et al. (2003). C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, vol. 11, 1–8. Citeseer.
  • Estabrooks et al. (2004) Estabrooks, A., Jo, T., and Japkowicz, N. (2004). A multiple resampling method for learning from imbalanced data sets. Computational intelligence 20, 1, 18–36.
  • Fithian and Hastie (2014) Fithian, W. and Hastie, T. (2014). Local case-control sampling: Efficient subsampling in imbalanced data sets. Annals of statistics 42, 5, 1693.
  • Han et al. (2005) Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In D.-S. Huang, X.-P. Zhang, and G.-B. Huang, eds., Advances in Intelligent Computing, 878–887, Berlin, Heidelberg. Springer Berlin Heidelberg.
  • Hjort and Pollard (2011) Hjort, N. L. and Pollard, D. (2011). Asymptotics for minimisers of convex processes. arXiv preprint arXiv:1107.3806 .
  • Japkowicz (2000) Japkowicz, N. (2000). Learning from imbalanced data sets: Papers from the AAAI workshop, AAAI, 2000. Technical Report WS-00-05.
  • King and Zeng (2001) King, G. and Zeng, L. (2001). Logistic regression in rare events data. Political analysis 9, 2, 137–163.
  • Lemaître et al. (2017) Lemaître, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. The Journal of Machine Learning Research 18, 1, 559–563.
  • Liu et al. (2009) Liu, X., Wu, J., and Zhou, Z. (2009). Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2, 539–550.
  • Mathew et al. (2017) Mathew, J., Pang, C. K., Luo, M., and Leong, W. H. (2017). Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE transactions on neural networks and learning systems 29, 9, 4065–4076.
  • Owen (2007) Owen, A. B. (2007). Infinitely imbalanced logistic regression. The Journal of Machine Learning Research 8, 761–773.
  • Rahman and Davis (2013) Rahman, M. M. and Davis, D. (2013). Addressing the class imbalance problem in medical datasets. International Journal of Machine Learning and Computing 3, 2, 224.
  • Sun et al. (2007) Sun, Y., Kamel, M. S., Wong, A. K., and Wang, Y. (2007). Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40, 12, 3358–3378.
  • van der Vaart (1998) van der Vaart, A. (1998). Asymptotic Statistics. Cambridge University Press, London.
  • Wang (2019) Wang, H. (2019). More efficient estimation for logistic regression with optimal subsamples. Journal of Machine Learning Research 20, 132, 1–59.