This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Balancing Weights for Non-monotone Missing Data

Jianing Dong Raymond K. W. Wong Kwun Chuen Gary Chan Department of Biostatistics, University of Washington
Abstract

Balancing weights have been widely applied to single or monotone missingness due to empirical advantages over likelihood-based methods and inverse probability weighting approaches. This paper considers non-monotone missing data under the complete-case missing variable condition (CCMV), a case of missing not at random (MNAR). Using relationships between each missing pattern and the complete-case subsample, we construct a weighted estimator for estimation, where the weight is a sum of ratios of the conditional probability of observing a particular missing pattern versus that of observing the complete-case, given the variables observed in the corresponding missing pattern. However, plug-in estimators of the propensity odds can be unbounded and lead to unstable estimation. Using further relations between propensity odds and balancing of moments across response patterns, we employ tailored loss functions, each encouraging empirical balance across patterns to estimate propensity odds flexibly using a functional basis expansion. We propose two penalizations to control propensity odds model smoothness and empirical imbalance. We study the asymptotic properties of the proposed estimators and show that they are consistent under mild smoothness assumptions. Asymptotic normality and efficiency are developed. Simulation results show the superior performance of the proposed method.

Keywords: Non-monotone missing; Missing not at random (MNAR); Complete-case missing variable (CCMV); Covariate balancing

1 Introduction

Missing data is ubiquitous in many research fields, such as health sciences, economics, and sociology. Monotone missing patterns, which arise in longitudinal studies and multi-stage sampling, have been widely studied. In practice, non-monotone missing patterns are more common since an ordering of variables may not exist to characterize the response patterns. For example, participants of an observational cohort may miss a particular study visit but return in later visits.

Restricting the analysis to complete cases in which all relevant variables are observed is a convenient solution but ignores information from partially missing data. It may lead to biased estimates unless data are missing completely at random (MCAR) [16], and would typically lose efficiency even under MCAR. The missing at random (MAR) assumption, where missingness depends only on observed data, is more reasonable and is widely used to handle missing data in likelihood-based or weighting methods.

Researchers often focus on a specific target, such as the coefficients of linear regression with Gaussian errors, which can be easily estimated by maximum likelihood estimation if the full-data density is known. To estimate parameters in the full-data density where the full variable vector LL may be partially missing, Little and Schluchter [17] suggested a mixture of normal distributions as the model of missing data density. However, its validity heavily relies on parametric density assumptions. The likelihood-based methods are attractive because the missingness probability can be factored out from the likelihood under MAR [24]. To mimic the score function (the derivative of log-likelihood) in the absence of missing data, Reilly and Pepe [21] proposed a mean-score approach, and Chatterjee et al. [5] proposed a pseudo-score approach for the bivariate case and extended to partial questionnaire design which is a particular case of non-monotone missing data [6]. Chen [7] proposed a semi-parametric method based on reparametrization of the joint likelihood into a product of the odds-ratio functions. However, these methods cannot directly handle continuous variables and require the discretization of continuous variables.

Inverse propensity weighting (IPW) methods [23] are also attractive since they do not require the specification of full-data likelihood. In contrast, a propensity model for non-missingness is needed. However, the resulting estimators can be highly unstable since even a small set of tiny propensity estimates would lead to extreme weights. A more direct weighting approach, the balancing method, is developed especially for the average treatment effect (ATE), which can be viewed as a missing data problem with only the response variable missing; see, for example, Zubizarreta [35], Chan et al. [4] and Wong and Chan [31], and Zhao [34] introduced tailored loss functions to show the connection between IPW and balancing methods. These balancing methods handle the single missing variable scenario, which has yet to be extended to handle non-monotone missingness.

Robins and Gill [22] argued that missing not at random (MNAR) is more natural than MAR under non-monotone missingness. However, many different specifications of MNAR are not identifiable from observed data. Little [15] imposed an identifying condition, the complete-case missing variable (CCMV) restriction, that matches the unidentified conditional distribution of missing variables for missing patterns to the identified distribution for complete cases. Then, the full-data density can be estimated by a pattern-mixture model. IPW estimators were also proposed under CCMV restriction and extensions [28]. Another increasingly popular MNAR mechanism is the no self-censoring (NSC) restriction, studied in Sinha et al. [26], Sadinle and Reiter [25] and Malinsky et al. [18]. Under MNAR, specifications of missing variable density typically rely on parametric assumptions and have intrinsic drawbacks. For instance, likelihood-based methods are often restricted to discrete variables due to complex integral computation, and IPW methods are often unstable due to difficulties in weight estimation for multiple missing patterns.

We propose a non-parametric approach that generalizes the balancing approaches to non-monotone missing scenarios under the CCMV assumption. Based on a tailored loss function, the proposed method approximately balances a set of functions of observed variables and mimics the missing-data adjustment of the mean-score approaches. The proposed method does not require the discretization of continuous variables and is less prone to misspecification due to the non-parametric modeling. A carefully designed penalization strikes a good balance between bias and variance, and prioritizes balancing smoother functions. We also show the consistency and asymptotic efficiency of the resulting estimator.

2 Notation and preliminaries

We formally describe the setup of the problem. Let L=(L[1],,L[d])dL=(L^{[1]},\ldots,L^{[d]})\in\mathbb{R}^{d} be a random vector of interest and R=(R1,,Rd){0,1}dR=(R^{1},\ldots,R^{d})\in\mathcal{R}\subseteq\{0,1\}^{d} be a binary random vector where Rj=1R^{j}=1 indicates that L[j]L^{[j]} is observed and ={r:P(R=r)>0}\mathcal{R}=\{r:P(R=r)>0\} is the set of possible response patterns in the study. Denote the complete-case pattern by 1d=(1,,1)1_{d}=(1,\ldots,1). Let M=||M=|\mathcal{R}| be the number of response patterns. For the response pattern rr, we denote the observed variables by LrdrL^{r}\in\mathbb{R}^{d_{r}} and the missing variables by Lr¯ddrL^{\overline{r}}\in\mathbb{R}^{d-d_{r}} where drd_{r} is the number of observed variables. So, the observations are (LiRi,Ri)i=1N(L^{R_{i}}_{i},R_{i})_{i=1}^{N}. An example of a bivariate non-monotone missing structure with M=3M=3 is

response pattern R=11R=11 R=10R=10 R=01R=01
observed variables L11=(L[1],L[2])L^{11}=(L^{[1]},L^{[2]}) L10=L[1]L^{10}=L^{[1]} L01=L[2]L^{01}=L^{[2]}
missing variables L11¯=L^{\overline{11}}=\emptyset L10¯=L[2]L^{\overline{10}}=L^{[2]} L01¯=L[1]L^{\overline{01}}=L^{[1]}

Let θ0q\theta_{0}\in\mathbb{R}^{q} be the parameter of interest which is the unique solution to 𝔼{ψθ(L)}=0\mathbb{E}\{\psi_{\theta}(L)\}=0, with a known vector-valued estimating function ψθ(L)=ψ(L,θ)\psi_{\theta}(L)=\psi(L,\theta) with values in q\mathbb{R}^{q}. For instance, we could use the quasi-likelihood estimating functions for the generalized linear models. If full data were observed, a solution to the estimating equations N1i=1Nψθ(Li)=0N^{-1}\sum_{i=1}^{N}\psi_{\theta}(L_{i})=0 is a common Z-estimator. However, ψθ(Li)\psi_{\theta}(L_{i}) can only be evaluated at complete samples, and a complete-case analysis is typically biased or inefficient. To address the problem, we posit the following assumptions throughout this paper.

Assumption 1.

  1.  A:

    The estimating function ψ(L,θ)\psi(L,\theta) is differentiable with respect to θ\theta with derivative ψ˙θ(L)\dot{\psi}_{\theta}(L). Also, 𝔼{ψθ(L)}\mathbb{E}\{\psi_{\theta}(L)\} has the unique root θ0\theta_{0} and is differentiable at θ0\theta_{0} with nonsingular derivative Dθ0D_{\theta_{0}}.

  2.  B:

    There exists a constant δ0>0\delta_{0}>0 such that P(R=1dlr)δ0P(R=1_{d}\mid l^{r})\geq\delta_{0} for any rr\in\mathcal{R} and so 1d1_{d}\in\mathcal{R}.

  3.  C:

    p(lr¯lr,R=r)=p(lr¯lr,R=1d)p(l^{\overline{r}}\mid l^{r},R=r)=p(l^{\overline{r}}\mid l^{r},R=1_{d}), or equivalently, P(R=rl)P(R=1dl)=P(R=rlr)P(R=1dlr)\frac{P(R=r\mid l)}{P(R=1_{d}\mid l)}=\frac{P(R=r\mid l^{r})}{P(R=1_{d}\mid l^{r})}, for any r\1dr\in\mathcal{R}\backslash 1_{d} .

Assumption 1.A is a standard regularity assumption for Z-estimation. Assumption 1.B ensures that complete cases are available for analysis. The first equation in Assumption 1.C is the CCMV condition, while the second equation motivates the logit discrete choice nonresponse model (LDCM) in Tchetgen Tchetgen et al. [28]. To see that the missing mechanism is MNAR, we define the propensity as the probability of data belonging to a specific response pattern conditional on the variables of interest and write it as P(R=rl)P(R=r\mid l). The propensity odds between patterns rr and 1d1_{d} is defined as P(R=rl)/P(R=1dl)P(R=r\mid l)/P(R=1_{d}\mid l). The CCMV condition states that the propensity odds depends on ll via lrl^{r}, and so is equal to Oddsr(lr)=P(R=rlr)/P(R=1dlr)\mathrm{Odds}^{r}(l^{r})=P(R=r\mid l^{r})/P(R=1_{d}\mid l^{r}). The propensity is a function of L=s\1dLsL=\bigcup_{s\in\mathcal{R}\backslash 1_{d}}L^{s} and does not satisfy either MCAR or MAR conditions because propensities and odds are related through

P(R=rl)=P(R=rl)/P(R=1dl)sP(R=sl)/P(R=1dl)=Oddsr(lr)sOddss(ls).\displaystyle P(R=r\mid l)=\frac{P(R=r\mid l)/P(R=1_{d}\mid l)}{\sum_{s\in\mathcal{R}}P(R=s\mid l)/P(R=1_{d}\mid l)}=\frac{\mathrm{Odds}^{r}(l^{r})}{\sum_{s\in\mathcal{R}}\mathrm{Odds}^{s}(l^{s})}\ . (1)

Tchetgen Tchetgen et al. [28] showed that Assumption 1 is sufficient for non-parametric identification and developed the inverse propensity weighting (IPW) estimator that is motivated by the law of total expectation:

𝔼{𝟣R=1dψθ(L)P(R=1dL)}=𝔼[𝔼{𝟣R=1dψθ(L)P(R=1dL)|L}]=𝔼{ψθ(L)}.\displaystyle\mathbb{E}\left\{\frac{\mathsf{1}_{R=1_{d}}\psi_{\theta}(L)}{P(R=1_{d}\mid L)}\right\}=\mathbb{E}\left[\mathbb{E}\left\{\frac{\mathsf{1}_{R=1_{d}}\psi_{\theta}(L)}{P(R=1_{d}\mid L)}\bigg{|}L\right\}\right]=\mathbb{E}\{\psi_{\theta}(L)\}\ . (2)

Taking the reciprocal of (1) and that Odds1d(l)=1\mathrm{Odds}^{1_{d}}(l)=1,

1P(R=1dl)=rOddsr(lr).\displaystyle\frac{1}{P(R=1_{d}\mid l)}=\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(l^{r})\ . (3)

As such, one can focus on the estimation of the odds. A standard approach is to fit a logistic regression where the binary response indicates whether the pattern is R=1dR=1_{d} or R=rR=r, and obtain an estimate of P(R=rlr,R{1d,r})P(R=r\mid l^{r},R\in\{1_{d},r\}) [28]. Due to the relationship

Oddsr(lr)=P(R=rlr,R{1d,r})1P(R=rlr,R{1d,r}),\displaystyle\mathrm{Odds}^{r}(l^{r})=\frac{P(R=r\mid l^{r},R\in\{1_{d},r\})}{1-P(R=r\mid l^{r},R\in\{1_{d},r\})}\ ,

a plug-in estimator of Oddsr(lr)\mathrm{Odds}^{r}(l^{r}) can then be constructed based on the predicted probability.

Based on (2) and (3), the resulting estimator θ^\hat{\theta} can be obtained by solving the weighted estimating equations N1i=1N𝟣Ri=1drw^r(Lir)ψθ(Li)=0N^{-1}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\sum_{r\in\mathcal{R}}\hat{w}^{r}(L^{r}_{i})\psi_{\theta}(L_{i})=0 where w^r(lr)\hat{w}^{r}(l^{r}) represents an estimator of Oddsr(lr)\mathrm{Odds}^{r}(l^{r}) and 𝟣A\mathsf{1}_{A} is an indicator function of an event AA. However, the estimated odds may lead to extremely large weights and hence an unstable estimator of θ0\theta_{0} when the likelihood estimation of a misspecified missingness model is used. Similar phenomena were observed in inverse propensity estimation in [14] and many subsequent papers, which motivates covariate balancing methods for parameter estimation. See, for example, [13], [33], [34] and [27].

We note that the estimation of P(R=1dl)P(R=1_{d}\mid l) or Oddsr(lr)\mathrm{Odds}^{r}(l^{r}) is not the ultimate target but only serves as an intermediate step for estimating θ0\theta_{0}. Instead of using the entropy loss, i.e., the negative log-likelihood function, as the loss function, which is typical for logistic regression, we employ a tailored loss that directly imposes empirical control on key identifying conditions of the weights, which we call the balancing conditions described below. These balancing conditions are directly related to estimating target parameters θ0\theta_{0}. Since the missingness mechanism is typically not our interest, parsimonious modeling may not be required. Instead, we will use functional basis expansions for flexible modeling, which requires penalizations for stable estimation. In addition, our proposed estimator achieves the semiparametric efficiency bounds of parameters defined through the estimating equations. By contrast, Tchetgen Tchetgen et al. [28] only provided asymptotic variances when the parametric model of either propensity odds or missing variable density is correctly specified. Here, we present the semiparametric efficiency bound, whose proof is given in Appendix B. The “regular estimators” are defined according to Begun et al. [2]. The regularity of an estimator can be viewed as a robustness or stability property.

Theorem 1.

Under Assumption 1, the asymptotic variance lower bound for all regular estimators of θ0\theta_{0} is Dθ01Vθ0Dθ01D_{\theta_{0}}^{-1}V_{\theta_{0}}D_{\theta_{0}}^{-1^{\intercal}}, where Vθ=𝔼{Fθ(L,R)Fθ(L,R)}V_{\theta}=\mathbb{E}\{F_{\theta}(L,R)F_{\theta}(L,R)^{\intercal}\} and

Fθ(L,R)=𝟣R=1drOddsr(Lr){ψθ(L)uθr(Lr)}+r𝟣R=ruθr(Lr)𝔼{ψθ(L)}\displaystyle F_{\theta}(L,R)=\mathsf{1}_{R=1_{d}}\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r})\{\psi_{\theta}(L)-u_{\theta}^{r}(L^{r})\}+\sum_{r\in\mathcal{R}}\mathsf{1}_{R=r}u_{\theta}^{r}(L^{r})-\mathbb{E}\{\psi_{\theta}(L)\}

with uθr(lr)=𝔼{ψθ(L)Lr=lr,R=r}u_{\theta}^{r}(l^{r})=\mathbb{E}\{\psi_{\theta}(L)\mid L^{r}=l^{r},R=r\}, the conditional expectation of the estimating function given variables LrL^{r} and is equal to 𝔼{ψθ(L)Lr=lr,R=1d}\mathbb{E}\{\psi_{\theta}(L)\mid L^{r}=l^{r},R=1_{d}\} under Assumption 1.C. In a slight abuse of notation, uθ1d(l1d)=ψθ(l)u_{\theta}^{1_{d}}(l^{1_{d}})=\psi_{\theta}(l).

3 Construction of our method

In this section, we define the balancing conditions and propose a tailored loss function whose expectation is minimized when the propensity odds model satisfies the balancing conditions. We will show that the imbalance can be explicitly controlled under penalization. Eventually, we will introduce two penalizations to control empirical imbalance and the smoothness of the propensity odds estimates, respectively.

3.1 Balancing conditions and benefits

To estimate the propensity odds in a way more related to the estimation of θ0\theta_{0}, we first consider the effect of a set of generic weights wrw^{r} in the weighted estimating equations that are constructed to estimate 𝔼{ψθ(L)}\mathbb{E}\{\psi_{\theta}(L)\}. The law of total expectation (2) shows that the weighted average in the fully observed pattern is equal to the unweighted average in the population. Moreover, for each missing pattern rr, one can see that under the CCMV assumption, the following equation holds for any measurable function g(lr)g(l^{r}). We call it the balancing condition associated with function g(lr)g(l^{r}) using Oddsr(lr)\mathrm{Odds}^{r}(l^{r}):

𝔼{𝟣R=rg(Lr)}\displaystyle\mathbb{E}\left\{\mathsf{1}_{R=r}g(L^{r})\right\} =𝔼{𝟣R=1dOddsr(Lr)g(Lr)}\displaystyle=\mathbb{E}\left\{\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})g(L^{r})\right\} (4)
=𝔼{P(R=1dLr)P(R=rLr)P(R=1dLr)g(Lr)}.\displaystyle=\mathbb{E}\left\{P(R=1_{d}\mid L^{r})\frac{P(R=r\mid L^{r})}{P(R=1_{d}\mid L^{r})}g(L^{r})\right\}\ .

Note that the weights wr(Lr)w^{r}(L^{r}) are equal to Oddsr(Lr)\mathrm{Odds}^{r}(L^{r}) almost surely with respect to the conditional measure given R=1dR=1_{d} if they satisfy the equations 𝔼{𝟣R=rg(Lr)}=𝔼{𝟣R=1dwrg(Lr)}\mathbb{E}\{\mathsf{1}_{R=r}g(L^{r})\}=\mathbb{E}\{\mathsf{1}_{R=1_{d}}w^{r}g(L^{r})\} for all measurable functions. In other words, the set of balancing conditions associated with all measurable functions identifies the propensity odds. This motivates another way to estimate the propensity odds by balancing the shared variables between two patterns. To explicitly see how balancing approach helps the estimation of θ\theta, we study the error N1i=1N𝟣Ri=1drwirψθ(Li)𝔼{ψθ(L)}N^{-1}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\sum_{r\in\mathcal{R}}w^{r}_{i}\psi_{\theta}(L_{i})-\mathbb{E}\{\psi_{\theta}(L)\} based on a generic set of weights wir,i=1,,Nw^{r}_{i},i=1,\ldots,N. The error can be first decomposed as

1Ni=1N𝟣Ri=1drwirψθ(Li)𝔼{ψθ(L)}\displaystyle\quad\ \frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\sum_{r\in\mathcal{R}}w^{r}_{i}\psi_{\theta}(L_{i})-\mathbb{E}\{\psi_{\theta}(L)\}
=r[1Ni=1N𝟣Ri=1dwirψθ(Li)𝔼{𝟣R=rψθ(L)}].\displaystyle=\sum_{r\in\mathcal{R}}\left[\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}w^{r}_{i}\psi_{\theta}(L_{i})-\mathbb{E}\{\mathsf{1}_{R=r}\psi_{\theta}(L)\}\right]\ .

Further decomposition of each inner term leads to:

1Ni=1N𝟣Ri=1dwirψθ(Li)𝔼{𝟣R=rψθ(L)}\displaystyle\quad\ \frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}w^{r}_{i}\psi_{\theta}(L_{i})-\mathbb{E}\{\mathsf{1}_{R=r}\psi_{\theta}(L)\}
=1Ni=1N𝟣Ri=1dwir{ψθ(Li)uθr(Lir)}\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}w^{r}_{i}\left\{\psi_{\theta}(L_{i})-u_{\theta}^{r}(L^{r}_{i})\right\} (5)
+1Ni=1N{𝟣Ri=1dwiruθr(Lir)𝟣Ri=ruθr(Lir)}\displaystyle\quad\ +\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}w^{r}_{i}u_{\theta}^{r}(L^{r}_{i})-\mathsf{1}_{R_{i}=r}u_{\theta}^{r}(L^{r}_{i})\right\} (6)
+1Ni=1N𝟣Ri=ruθr(Lir)𝔼{𝟣R=ruθr(Lr)}.\displaystyle\quad\ +\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=r}u_{\theta}^{r}(L^{r}_{i})-\mathbb{E}\{\mathsf{1}_{R=r}u_{\theta}^{r}(L^{r})\}\ . (7)

To control the error, we aim to design weights that can control the magnitude of the components in the above decomposition. First, we note that only the first two terms (5) and (6) depend on the weights. Indeed, the last term (7) is expected to converge to zero uniformly over θ\theta in a compact set Θ\Theta under mild assumptions. Next, we focus on the term (6). It is the empirical version of imbalance associated with uθru_{\theta}^{r}. Therefore, to control this error component, we would like the weights to achieve empirical balance, at least approximately. It is the fundamental motivation of the proposed weights. Finally, the term (5) with the proposed weights introduced later can be shown to uniformly converge to zero by some technical arguments. In Section 4 and Appendices BF, we provide rigorous theoretical statements and corresponding proofs. In the next subsection, we propose a tailored loss function that encourages the empirical balance.

3.2 Tailored loss function for balancing conditions

For each missing pattern r1dr\neq 1_{d}, suppose

Oddsr(lr;αr)=exp{Φr(lr)αr},\displaystyle\mathrm{Odds}^{r}(l^{r};\alpha^{r})=\exp\left\{\Phi^{r}(l^{r})^{\intercal}\alpha^{r}\right\}\ ,

where Φr(lr)={ϕ1r(lr),,ϕKrr(lr)}\Phi^{r}(l^{r})=\left\{\phi^{r}_{1}(l^{r}),\ldots,\phi^{r}_{K_{r}}(l^{r})\right\} are KrK_{r} basis functions for the observed variables in pattern rr. The basis functions for lrl^{r} can be constructed as tensor products of the basis functions for each variable. One may choose suitable basis functions depending on the observed variables and the number of observations in different patterns. For each missing pattern r1dr\neq 1_{d}, we propose the tailored loss function:

r{Oddsr(lr;αr),R}=𝟣R=1dOddsr(lr;αr)𝟣R=rlogOddsr(lr;αr).\displaystyle\mathcal{L}^{r}\{\mathrm{Odds}^{r}(l^{r};\alpha^{r}),R\}=\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(l^{r};\alpha^{r})-\mathsf{1}_{R=r}\log\mathrm{Odds}^{r}(l^{r};\alpha^{r})\ .

Assuming the exchangeability of taking derivatives with respect to αr\alpha^{r} and taking expectations with respect to LL, the minimizer of 𝔼[r{Oddsr(Lr;αr),R}]\mathbb{E}[\mathcal{L}^{r}\{\mathrm{Odds}^{r}(L^{r};\alpha^{r}),R\}] satisfies

𝔼[r{Oddsr(Lr;αr),R}]/αr=0,\partial\mathbb{E}[\mathcal{L}^{r}\{\mathrm{Odds}^{r}(L^{r};\alpha^{r}),R\}]/\partial\alpha^{r}=0,

which can be rewritten as the balancing conditions:

𝔼{𝟣R=1dOddsr(Lr;αr)Φr(Lr)}=𝔼{𝟣R=rΦr(Lr)}.\displaystyle\mathbb{E}\left\{\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r};\alpha^{r})\Phi^{r}(L^{r})\right\}=\mathbb{E}\left\{\mathsf{1}_{R=r}\Phi^{r}(L^{r})\right\}\ .

In practice, the minimum tailored loss estimator of αr\alpha^{r}, denoted by α˘r\breve{\alpha}^{r}, is obtained by minimizing the average loss

Nr(αr)=N1i=1Nr{Oddsr(Lir;αr),Ri}.\displaystyle\mathcal{L}^{r}_{N}(\alpha^{r})=N^{-1}\sum_{i=1}^{N}\mathcal{L}^{r}\{\mathrm{Odds}^{r}(L^{r}_{i};\alpha^{r}),R_{i}\}\ .

The estimating equations Nr(α˘r)=0\nabla\mathcal{L}^{r}_{N}(\breve{\alpha}^{r})=0 can be rewritten as the empirical version of balancing conditions using Oddsr(Lir;α˘r)\mathrm{Odds}^{r}(L^{r}_{i};\breve{\alpha}^{r}):

1Ni=1N𝟣Ri=1dOddsr(Lir;α˘r)Φr(Lir)=1Ni=1N𝟣Ri=rΦr(Lir).\displaystyle\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\breve{\alpha}^{r})\Phi^{r}(L^{r}_{i})=\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=r}\Phi^{r}(L^{r}_{i})\ . (8)

Recall that we want to control the term (6). If uθrspan{ϕ1r,,ϕKrr}u_{\theta}^{r}\in\mathrm{span}\{\phi^{r}_{1},\ldots,\phi^{r}_{K_{r}}\}, the empirical balances associated with the basis functions imply the empirical balance associated with uθru_{\theta}^{r}. Therefore, the basis functions should be chosen to approximate uθru_{\theta}^{r} well. Theoretically, one can increase the number of basis functions with sample size to extend the spanned space and lower the approximation error.

3.3 Penalized optimization and empirical imbalance

As mentioned in the last subsection, one wants to balance as many basis functions as necessary to enlarge the space spanned by the basis functions since uθru_{\theta}^{r} is unknown. However, the unpenalized optimizations can result in overfitting or even being unfeasible to compute if one chooses too many basis functions. More precisely, Nr(Cαr)\mathcal{L}^{r}_{N}(C\alpha^{r})\to-\infty as CC\to\infty if there exists αr\alpha^{r} such that the linear component Φr(lr)αr>0{\Phi^{r}}(l^{r})^{\intercal}\alpha^{r}>0 for all data in pattern rr and Φr(lr)αr<0{\Phi^{r}}(l^{r})^{\intercal}\alpha^{r}<0 for all data in pattern 1d1_{d}.

Therefore, we consider the penalized average loss λr(αr)=Nr(αr)+λJr(αr)\mathcal{L}^{r}_{\lambda}(\alpha^{r})=\mathcal{L}^{r}_{N}(\alpha^{r})+\lambda J^{r}(\alpha^{r}) where the penalty function Jr()J^{r}(\cdot) is a continuous non-negative convex function, and the tuning parameter λ0\lambda\geq 0 controls the degree of penalization chosen by a cross-validation procedure to be discussed later. Since the estimation of propensity odds is not the final goal, we want to study how penalization affects the estimation of 𝔼{ψθ(L)}\mathbb{E}\{\psi_{\theta}(L)\}. We will explore the empirical imbalance of uθru_{\theta}^{r}, which depends on those of basis functions.

Note that the tailored loss function is convex in αr\alpha^{r}. One can consider the sub-differential, as the penalty Jr()J^{r}(\cdot) may be non-differentiable. Denote the sub-differential of λr\mathcal{L}^{r}_{\lambda} at αr\alpha^{r} as λr(αr)\partial\mathcal{L}^{r}_{\lambda}(\alpha^{r}), which is a set of sub-derivatives gKrg\in\mathbb{R}^{K_{r}} satisfying

λr(βr)λr(αr)+g(βrαr), for all βrKr.\displaystyle\mathcal{L}^{r}_{\lambda}(\beta^{r})\geq\mathcal{L}^{r}_{\lambda}(\alpha^{r})+g^{\intercal}(\beta^{r}-\alpha^{r}),\textrm{ for all }\beta^{r}\in\mathbb{R}^{K_{r}}\ .

The sub-differential of λr\mathcal{L}^{r}_{\lambda} at any bounded minimizer α^r\hat{\alpha}^{r} should include the vector 𝟎\mathbf{0} since λr(αr)λr(α^r)+𝟎(αrα^r)\mathcal{L}^{r}_{\lambda}(\alpha^{r})\geq\mathcal{L}^{r}_{\lambda}(\hat{\alpha}^{r})+\mathbf{0}^{\intercal}(\alpha^{r}-\hat{\alpha}^{r}) for all αrKr\alpha^{r}\in\mathbb{R}^{K_{r}}. Therefore, there exists a sub-derivative s^=(s^1,,s^Kr)Kr\hat{s}=(\hat{s}_{1},\ldots,\hat{s}_{K_{r}})\in\mathbb{R}^{K_{r}} of JrJ^{r} at α^r\hat{\alpha}^{r} such that

Nr(α^r)+λs^=𝟎.\nabla\mathcal{L}^{r}_{N}(\hat{\alpha}^{r})+\lambda\hat{s}=\mathbf{0}\ . (9)

3.3.1 Controlling empirical imbalance

We denote the absolute difference between the left-hand side and the right-hand side of (8) the empirical imbalance. One immediate observation from (9) is that the empirical balance might not hold exactly when λ0\lambda\neq 0. Since the penalty determines α^r\hat{\alpha}^{r} and s^\hat{s}, one wants to choose an appropriate penalty to ensure control of the empirical imbalance. As a continuous function leads to bounded sub-derivatives, a simple choice is 1\ell_{1}-norm penalty, i.e., k=1Kr|αkr|\sum_{k=1}^{K_{r}}|\alpha^{r}_{k}|, whose entries of sub-derivatives belong to [1,1][-1,1], and so, for each k{1,,Kr}k\in\{1,\ldots,K_{r}\},

|1Ni=1N𝟣Ri=rϕkr(Lir)1Ni=1N𝟣Ri=1dOddsr(Lir;α^r)ϕkr(Lir)|λ.\displaystyle\left|\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=r}\phi^{r}_{k}(L^{r}_{i})-\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})\phi^{r}_{k}(L^{r}_{i})\right|\leq\lambda\ .

As such, λ\lambda uniformly controls all empirical imbalances of basis functions. However, to emphasize the control of empirical imbalance of uθru_{\theta}^{r}, the tolerance to the empirical imbalances of different basis functions should be allowed to vary. We should tolerate smaller empirical imbalances for those basis functions that are believed to have stronger approximating power for uθru_{\theta}^{r}. For example, if one believes that uθru_{\theta}^{r} satisfies certain smoothness assumptions, one may want to prioritize the empirical balances of smoother basis functions. One natural way to achieve this is to use the weighted 1\ell_{1}-norm penalty, i.e., k=1Krtk|αkr|\sum_{k=1}^{K_{r}}t_{k}|\alpha^{r}_{k}| where tk0,k=1,,Krt_{k}\geq 0,k=1,\ldots,K_{r}, represent relative tolerance which should be determined by the assumptions of uθru_{\theta}^{r}. For each k{1,,Kr}k\in\{1,\ldots,K_{r}\}, the corresponding imbalance bound becomes:

|1Ni=1N𝟣Ri=rϕkr(Lir)1Ni=1N𝟣Ri=1dOddsr(Lir;α^r)ϕkr(Lir)|λtk.\displaystyle\left|\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=r}\phi^{r}_{k}(L^{r}_{i})-\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})\phi^{r}_{k}(L^{r}_{i})\right|\leq\lambda t_{k}\ .

Small tkt_{k} should be assigned to important basis functions to ensure stronger balance. A detailed example of the choice of tkt_{k} is presented in Section 3.3.3.

3.3.2 Controlling propensity odds smoothness

In addition to the empirical balance, we could also introduce a penalty to promote the smoothness of the propensity odds estimate. In this paper, we focus on a quadratic penalty of the form αr𝐃rαr{\alpha^{r}}^{\intercal}\mathbf{D}^{r}\alpha^{r}, where 𝐃r\mathbf{D}^{r} is a positive semi-definite matrix. As suggested by Wood [32], the most convenient penalties to control the degree of smoothness are those that measure function roughness as a quadratic form in the coefficients of the function. A detailed example of the choice is presented in Section 3.3.3, where the roughness is related to the order of derivatives of a function. It is worth noting that our asymptotic theory only depends on the positive semi-definite property of 𝐃r\mathbf{D}^{r}, but not a specific choice of 𝐃r\mathbf{D}^{r}.

3.3.3 A combined penalty

We suggest the following penalized loss to combine the strengths of the above two penalties:

λr(αr)=1Ni=1Nr{Oddsr(Lir;αr),Ri}+λ{γk=1Krtk|αkr|+(1γ)αr𝐃rαr},\displaystyle\mathcal{L}^{r}_{\lambda}(\alpha^{r})=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}^{r}\{\mathrm{Odds}^{r}(L^{r}_{i};\alpha^{r}),R_{i}\}+\lambda\left\{\gamma\sum_{k=1}^{K_{r}}t_{k}|\alpha^{r}_{k}|+(1-\gamma){\alpha^{r}}^{\intercal}\mathbf{D}^{r}\alpha^{r}\right\}, (10)

where 0γ10\leq\gamma\leq 1 is the tuning parameter controlling the weights between two penalties. A method for tuning parameter selection is given in Section 3.3.4.

Here, we focus on a standard choice of {tk}k=1Kr\{t_{k}\}_{k=1}^{K_{r}} and matrix 𝐃r\mathbf{D}^{r}, based on a prior belief that uθru_{\theta}^{r} and logOddsr\log\mathrm{Odds}^{r} are smooth. There are many ways to define the roughness of a function. Here, we adopt a version in Wood [32] for twice continuously differentiable functions of dd-dimensional continuous variables, such as cubic splines. The roughness of function ff is defined as 𝙿𝙴𝙽2(f)=f,f\mathtt{PEN}_{2}(f)=\langle f,f\rangle where the semi-inner product is defined as:

f,g=dv1++vd=22!v1!vd!(2fx1v1xdvd)(2gx1v1xdvd)dx1dxd.\displaystyle\langle f,g\rangle=\int\cdots\int_{\mathbb{R}^{d}}\sum_{v_{1}+\cdots+v_{d}=2}\frac{2!}{v_{1}!\cdots v_{d}!}\left(\frac{\partial^{2}f}{\partial x_{1}^{v_{1}}\cdots\partial x_{d}^{v_{d}}}\right)\left(\frac{\partial^{2}g}{\partial x_{1}^{v_{1}}\cdots\partial x_{d}^{v_{d}}}\right)dx_{1}\cdots dx_{d}\ .

Let 𝐃r\mathbf{D}^{r} be the Gram matrix where 𝐃i,jr=ϕir,ϕjr\mathbf{D}^{r}_{i,j}=\langle\phi^{r}_{i},\phi^{r}_{j}\rangle. Then, the roughness of basis expansion Φrαr{\Phi^{r}}^{\intercal}\alpha^{r} is the quadratic term αr𝐃rαr{\alpha^{r}}^{\intercal}\mathbf{D}^{r}\alpha^{r}. We orthogonalize the basis functions with respect to the above inner product such that the matrix 𝐃r\mathbf{D}^{r} becomes diagonal. Thus, the quadratic term becomes k=1Kr(αkr)2𝙿𝙴𝙽2(ϕkr)\sum_{k=1}^{K_{r}}({\alpha^{r}_{k}})^{2}\mathtt{PEN}_{2}(\phi^{r}_{k}) and has a simple derivative and is strongly convex with respect to αkr\alpha^{r}_{k} for kk such that 𝙿𝙴𝙽2(ϕkr)>0\mathtt{PEN}_{2}(\phi^{r}_{k})>0. To prioritize balancing smoother basis functions, we tolerate larger imbalances for rougher functions. Thus, we can connect the tolerance with function roughness. In particular, we choose tk=ϕkr,ϕkr=𝙿𝙴𝙽2(ϕkr)t_{k}=\sqrt{\langle\phi^{r}_{k},\phi^{r}_{k}\rangle}=\sqrt{\mathtt{PEN}_{2}(\phi^{r}_{k})}. Then, for the minimizer α^r\hat{\alpha}^{r} of λr(αr)\mathcal{L}^{r}_{\lambda}(\alpha^{r}), the empirical imbalance of any linear combination of basis functions v():=k=1Krβkrϕkr()v(\cdot):=\sum_{k=1}^{K_{r}}\beta_{k}^{r}\phi_{k}^{r}(\cdot) is

|k=1Krβkr[1Ni=1N{𝟣Ri=r𝟣Ri=1dOddsr(Lir;α^r)}ϕkr(Lir)]|\displaystyle\quad\left|\sum_{k=1}^{K_{r}}\beta^{r}_{k}\left[\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=r}-\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})\right\}\phi^{r}_{k}(L_{i}^{r})\right]\right|
λγk=1Kr|βkr|𝙿𝙴𝙽2(ϕkr)+2λ(1γ)k=1Kr|βkrα^kr|𝙿𝙴𝙽2(ϕkr)\displaystyle\leq\lambda\gamma\sum_{k=1}^{K_{r}}|\beta^{r}_{k}|\sqrt{\mathtt{PEN}_{2}(\phi^{r}_{k})}+2\lambda(1-\gamma)\sum_{k=1}^{K_{r}}|\beta^{r}_{k}\hat{\alpha}^{r}_{k}|\mathtt{PEN}_{2}(\phi^{r}_{k})
λγKrk=1Kr𝙿𝙴𝙽2(βkrϕkr)+2λ(1γ)k=1Kr𝙿𝙴𝙽2(βkrϕkr)k=1Kr𝙿𝙴𝙽2(α^krϕkr)\displaystyle\leq\lambda\gamma\sqrt{K_{r}}\sqrt{\sum_{k=1}^{K_{r}}\mathtt{PEN}_{2}(\beta^{r}_{k}\phi^{r}_{k})}+2\lambda(1-\gamma)\sqrt{\sum_{k=1}^{K_{r}}\mathtt{PEN}_{2}(\beta^{r}_{k}\phi^{r}_{k})}\sqrt{\sum_{k=1}^{K_{r}}\mathtt{PEN}_{2}(\hat{\alpha}^{r}_{k}\phi^{r}_{k})}
=λ{γKr+2λ(1γ)𝙿𝙴𝙽2(k=1Krα^krϕkr)}𝙿𝙴𝙽2(k=1Krβkrϕkr)\displaystyle=\lambda\left\{\gamma\sqrt{K_{r}}+2\lambda(1-\gamma)\sqrt{\mathtt{PEN}_{2}\left(\sum_{k=1}^{K_{r}}\hat{\alpha}^{r}_{k}\phi^{r}_{k}\right)}\right\}\sqrt{\mathtt{PEN}_{2}\left(\sum_{k=1}^{K_{r}}\beta^{r}_{k}\phi^{r}_{k}\right)}
=λ{γKr+2(1γ)𝙿𝙴𝙽2(Φrα^r)}𝙿𝙴𝙽2(v)\displaystyle=\lambda\left\{\gamma\sqrt{K_{r}}+2(1-\gamma)\sqrt{\mathtt{PEN}_{2}({\Phi^{r}}^{\intercal}\hat{\alpha}^{r})}\right\}\sqrt{\mathtt{PEN}_{2}(v)}

where Φr(lr)α^r=logOddsr(lr;α^r)\Phi^{r}(l^{r})^{\intercal}\hat{\alpha}^{r}=\log\mathrm{Odds}^{r}(l^{r};{\hat{\alpha}^{r}}) denotes the log transformation of the propensity odds model, and KrK_{r} is the number of basis functions. As such, the empirical imbalance of vv is proportionally bounded by the square root of its roughness 𝙿𝙴𝙽2(v)\mathtt{PEN}_{2}(v). So, smoother functions in the spanned space have smaller empirical imbalances. We assume that uθru_{\theta}^{r} is well approximated by the basis functions under certain smoothness assumptions introduced in Assumption 3.A. Thus, the approximation errors are controlled, and the empirical imbalance of uθru_{\theta}^{r} is also controlled.

Besides, 𝙿𝙴𝙽2(Φrα^r)\mathtt{PEN}_{2}({\Phi^{r}}^{\intercal}\hat{\alpha}^{r}) is controlled through the quadratic term in the penalty. The rough propensity odds models are penalized more. In total, we control the smoothness of the propensity odds model and the imbalance of uθru_{\theta}^{r} by using such a model.

3.3.4 Tuning parameter selection

One approach to choosing the tuning parameters λ\lambda and γ\gamma is to select the pair of parameters that gives the minimal expected loss, usually estimated by a cross-validation procedure. The candidate parameters can be drawn from a grid satisfying λ>0\lambda>0 and γ[0,1]\gamma\in[0,1]. For example, in our simulation study, we perform 5-fold cross-validation using the grid (λ{1,0.1,,1e10})×(γ{0,0.1,0.5,0.9,1})(\lambda\in\{1,0.1,\ldots,1e-10\})\times(\gamma\in\{0,0.1,0.5,0.9,1\}). The dataset is randomly divided into five folds. For each iteration, the propensity odds models are trained on four folds using penalized optimization and tested on the remaining fold to estimate the out-of-sample tailored loss. The parameter pair that yields the lowest cross-validation error across the five iterations is then used to retrain the model on the entire dataset. This approach enhances the model’s ability to generalize to unseen data and ensures a stable estimation of the propensity odds.

4 Asymptotic properties

In this section, we investigate the consistency and asymptotic normality of our proposed estimators. Consider the proposed penalization (10) with generic basis functions. By minimizing the penalized average loss λr(αr)\mathcal{L}^{r}_{\lambda}(\alpha^{r}) for each missing pattern r1dr\neq 1_{d}, we first obtain the minimizers α^r\hat{\alpha}^{r} and the balancing estimators of propensity odds, Oddsr(lr;α^r)\mathrm{Odds}^{r}(l^{r};\hat{\alpha}^{r}). Summing up the propensity odds estimators, we obtain the weights, w^(Li)=rOddsr(Lir;α^r)\hat{w}(L_{i})=\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r}), and a plug-in estimator of 𝔼{ψθ(L)}\mathbb{E}\{\psi_{\theta}(L)\}:

^Nψθ=1Ni=1N{𝟣Ri=1dw^(Li)ψθ(Li)}.\displaystyle\hat{\mathbb{P}}_{N}\psi_{\theta}=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\hat{w}(L_{i})\psi_{\theta}(L_{i})\right\}\ .

The resulting estimator of θ0\theta_{0} is the solution to ^Nψθ=0\hat{\mathbb{P}}_{N}\psi_{\theta}=0. Denote it by θ^N\hat{\theta}_{N}. Under mild conditions, we show that Oddsr(lr;α^r)\mathrm{Odds}^{r}(l^{r};\hat{\alpha}^{r}) is consistent, ^Nψθ\hat{\mathbb{P}}_{N}\psi_{\theta} is asymptotically normal for each θ\theta in a compact set Θq\Theta\subset\mathbb{R}^{q}, and θ^N\hat{\theta}_{N} is consistent and efficient. We require the following set of assumptions to establish the consistency of the proposed estimator.

Assumption 2.

The following conditions hold for each missing pattern rr\in\mathcal{R}:

  1.  A:

    There exist constants 0<c0<C00<c_{0}<C_{0} such that c0Oddsr(lr)C0c_{0}\leq\mathrm{Odds}^{r}(l^{r})\leq C_{0} for all lrdomrl^{r}\in\mathrm{dom}^{r} where domrdr\mathrm{dom}^{r}\subset\mathbb{R}^{d_{r}} is the domain of Oddsr\mathrm{Odds}^{r}.

  2.  B:

    The optimization minαr𝔼[r{Oddsr(Lr;αr),R}]{\min_{\alpha^{r}}}\mathbb{E}[\mathcal{L}^{r}\{\mathrm{Odds}^{r}(L^{r};\alpha^{r}),R\}] has a unique solution α0rKr\alpha^{r}_{0}\in\mathbb{R}^{K_{r}}.

  3.  C:

    The total number of basis functions KrK_{r} grows as the sample size increases and satisfies Kr2=o(Nr)K_{r}^{2}=o(N_{r}) where NrN_{r} is the number of observations in patterns rr and 1d1_{d}.

  4.  D:

    There exist constants C1>0C_{1}>0 and μ1>1/2\mu_{1}>1/2 such that for any positive integer KrK_{r}, there exist αKrKr\alpha_{K_{r}}^{*}\in\mathbb{R}^{K_{r}} satisfying suplrdomr|Oddsr(lr)Oddsr(lr;αKr)|C1Krμ1.\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\big{|}\mathrm{Odds}^{r}(l^{r})-\mathrm{Odds}^{r}(l^{r};\alpha_{K_{r}}^{*})\big{|}\leq C_{1}K_{r}^{-\mu_{1}}.

  5.  E:

    The Euclidean norm of the basis functions satisfies suplrdomrΦr(lr)2=O(Kr1/2)\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\|\Phi^{r}(l^{r})\|_{2}=O(K_{r}^{1/2}).

  6.  F:

    Let λ1,,λKr\lambda_{1},\ldots,\lambda_{K_{r}} be the eigenvalues of 𝔼{Φr(Lr)Φr(Lr)}\mathbb{E}\{\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\} in the non-decreasing order. There exist constants λmin\lambda_{\min}^{*} and λmax\lambda_{\max}^{*} such that 0<λminλ1λKrλmax0<\lambda_{\min}^{*}\leq\lambda_{1}\leq\lambda_{K_{r}}\leq\lambda_{\max}^{*}.

  7.  G:

    The tuning parameter λ\lambda satisfies λ=o(1/KrNr)\lambda=o(1/\sqrt{K_{r}N_{r}}).

Assumption 2.A is the boundedness assumption commonly used in missing data and causal inference. It is equivalent to that P(R=rlr,R{1d,r})P(R=r\mid l^{r},R\in\{1_{d},r\}) is strictly bounded away from 0 and 1. The domain domr\mathrm{dom}^{r} is usually assumed to be compact, so it becomes possible to approximate Oddsr\mathrm{Odds}^{r} with compactly supported functions. Assumption 2.B is a standard condition for consistency of minimum loss estimators of α0r\alpha^{r}_{0}. It is well known that the uniform approximation error is related to the number of basis functions. Thus, we allow KrK_{r} to increase with sample size under certain restrictions in Assumption 2.C. The uniform approximation rate μ1\mu_{1} in Assumption 2.D is related to the true propensity odds Oddsr\mathrm{Odds}^{r} and the choice of basis functions. For instance, the rate μ1=s/d\mu_{1}=s/d for power series and splines if Oddsr\mathrm{Odds}^{r} is continuously differentiable of order ss on [0,1]d[0,1]^{d} under mild assumptions; see Newey [20] and Fan et al. [11] for details. The restriction μ>1/2\mu>1/2 is a technical condition such that the estimator of propensity odds is consistent. Assumption 2.E and Assumption 2.F are standard conditions for controlling the magnitude of the basis functions. The Euclidean norm of the basis function vector can increase as the spanned space extends, but its growth rate cannot be too fast. These assumptions are satisfied by many bases such as the regression spline, trigonometric polynomial, and wavelet bases; see, e.g., Newey [20], Horowitz and Mammen [12], Chen [8] and Fan et al. [11]. Assumption 2.G is a technical assumption of the tuning parameter λ\lambda for the maintenance of consistency of weights. We now establish the consistency of the estimated odds.

Theorem 2.

Under Assumptions 1 and 2, for each missing pattern rr, we have

Oddsr(;α^r)Oddsr\displaystyle\left\|\mathrm{Odds}^{r}(\ \cdot\ ;\hat{\alpha}^{r})-\mathrm{Odds}^{r}\right\|_{\infty} =Op(Kr2Nr+Kr12μ1)=op(1),\displaystyle=O_{p}\left(\sqrt{\frac{K_{r}^{2}}{N_{r}}}+K_{r}^{\frac{1}{2}-\mu_{1}}\right)=o_{p}(1)\ ,
Oddsr(;α^r)OddsrP,2\displaystyle\left\|\mathrm{Odds}^{r}(\ \cdot\ ;\hat{\alpha}^{r})-\mathrm{Odds}^{r}\right\|_{P,2} =Op(KrNr+Krμ1)=op(1)\displaystyle=O_{p}\left(\sqrt{\frac{K_{r}}{N_{r}}}+K_{r}^{-\mu_{1}}\right)=o_{p}(1)

where XP,22=X2𝑑P\|X\|_{P,2}^{2}=\int X^{2}dP is the second moment of a random variable.

The proof is provided in Appendix C.

Next, we establish the asymptotic normality of the empirical weighted estimating function ^Nψθ\hat{\mathbb{P}}_{N}\psi_{\theta} for each θ\theta. Note that the estimating function ψθ\psi_{\theta} is a qq-dimensional vector-valued function. We only need to consider each entry separately. Denote the jj-th entry of ψθ\psi_{\theta} and uθru_{\theta}^{r} as ψθ,j\psi_{\theta,j} and uθ,jru_{\theta,j}^{r} respectively. Let n[]{ϵ,,}n_{[\ ]}\{\epsilon,\mathcal{F},\cdot\} denote the bracketing number of the set \mathcal{F} by ϵ\epsilon-brackets with respect to a specific norm. We will need the following additional conditions.

Assumption 3.

The following conditions hold for all missing pattern rr and all θΘ\theta\in\Theta where Θ\Theta is a compact set:

  1.  A:

    There exist constants C2>0C_{2}>0 and μ2>1/2\mu_{2}>1/2 such that for any θ\theta and each missing pattern rr, there exists a parameter βθr\beta_{\theta}^{r} satisfying suplrdomr|uθr(lr)Φr(lr)βθr|C2Krμ2\sup_{l^{r}\in\mathrm{dom}^{r}}|u_{\theta}^{r}(l^{r})-\Phi^{r}(l^{r})^{\intercal}\beta_{\theta}^{r}|\leq C_{2}K_{r}^{-\mu_{2}}.

  2.  B:

    Each of the true propensity odds, Oddsr\mathrm{Odds}^{r}, is contained in a set of smooth functions r\mathcal{M}^{r}. There exists constants C>0C_{\mathcal{M}}>0 and d>1/2d_{\mathcal{M}}>1/2 such that logn[]{ϵ,r,L}C(1/ϵ)1/d\log n_{[\ ]}\{\epsilon,\mathcal{M}^{r},L^{\infty}\}\leq C_{\mathcal{M}}(1/\epsilon)^{1/d_{\mathcal{M}}}.

  3.  C:

    The sets Ψ:={ψθ,j:θΘ,j=1,,q}\Psi:=\{\psi_{\theta,j}:\theta\in\Theta,j=1,\ldots,q\} are contained in a function class \mathcal{H} such that there exists constants C>0C_{\mathcal{H}}>0 and d>1/2d_{\mathcal{H}}>1/2 such that logn[]{ϵ,,L2(P)}C(1/ϵ)1/d\log n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\leq C_{\mathcal{H}}(1/\epsilon)^{1/d_{\mathcal{H}}}.

  4.  D:

    There exists a constant C3C_{3} such that for all j=1,,qj=1,\ldots,q,

    𝔼{ψθ,j(L)uθ,jr(Lr)}2𝔼[sup𝜃{ψθ,j(L)uθ,jr(Lr)}2]C32.\displaystyle\mathbb{E}\left\{\psi_{\theta,j}(L)-u_{\theta,j}^{r}(L^{r})\right\}^{2}\leq\mathbb{E}\left[\underset{\theta}{\sup}\left\{\psi_{\theta,j}(L)-u_{\theta,j}^{r}(L^{r})\right\}^{2}\right]\leq C_{3}^{2}\ .
  5.  E:

    Nr1/{2(μ1+μ2)}=o(Kr)N_{r}^{1/\{2(\mu_{1}+\mu_{2})\}}=o(K_{r}), which means that the growth rate of the number of basis functions has a lower bound.

Assumption 3.A is a requirement similar in spirit to Assumption 2.D such that the conditional expectation uθr(lr)u_{\theta}^{r}(l^{r}) can be well approximated as we extend the space spanned by the basis functions. Assumption 3.B and Assumption 3.C are conditions on the complexity of the function classes r\mathcal{M}^{r} and \mathcal{H} to ensure uniform convergence over θ\theta. These assumptions are satisfied for many function classes. For instance, if r\mathcal{M}^{r} is a Sobolev class of functions f:[0,1]f:[0,1]\mapsto\mathbb{R} such that f1\|f\|_{\infty}\leq 1 and the (s1s-1)-th derivative is absolutely continuous with (f(s))2(x)𝑑x1\int(f^{(s)})^{2}(x)dx\leq 1 for some fixed ss\in\mathbb{N}, then logn[]{ϵ,r,L)}C(1/ϵ)1/s\log n_{[\ ]}\{\epsilon,\mathcal{M}^{r},L^{\infty})\}\leq C(1/\epsilon)^{1/s} by Example 19.10 of Van der Vaart [29]. The condition d>1/2d_{\mathcal{M}}>1/2 is satisfied for all s1s\geq 1. A Hölder class of functions also satisfies this condition [11]. Assumption 3.D is a technical condition related to the envelope function such that we can apply the maximal inequality via bracketing. Assumption 3.E requires the number of basis functions to grow such that the approximation error decreases in general.

Theorem 3.

Suppose that Assumptions 1, 2 and 3 hold. For any θΘ\theta\in\Theta,

N[^Nψθ𝔼{ψθ(L)}]𝑑N(0,Vθ)\displaystyle\sqrt{N}\left[\hat{\mathbb{P}}_{N}\psi_{\theta}-\mathbb{E}\{\psi_{\theta}(L)\}\right]\overset{d}{\to}N(0,V_{\theta})

where VθV_{\theta} is defined in Theorem 1.

To prove the theorem, we utilize a few lemmas of the bracketing number n[]{ϵ,,}n_{[\ ]}\{\epsilon,\mathcal{F},\cdot\}. The proofs of the theorem and lemmas are given separately in Appendices D and F.

With further assumptions, we show the consistency and asymptotical normality of θ^N\hat{\theta}_{N} that solves ^Nψθ=0\hat{\mathbb{P}}_{N}\psi_{\theta}=0.

Assumption 4.

The following conditions hold for all missing pattern rr and all θΘ\theta\in\Theta:

  1.  A:

    For any sequence {θn}Θ\{\theta_{n}\}\in\Theta , 𝔼{max1jq|ψθn,j(L)|}0\mathbb{E}\{\underset{1\leq j\leq q}{\max}|\psi_{\theta_{n},j}(L)|\}\to 0 implies θnθ020\|\theta_{n}-\theta_{0}\|_{2}\to 0

  2.  B:

    For each jj-th entry and any δ>0\delta>0, there exists an envelop function fδ,jf_{\delta,j} such that |ψθ,j(l)ψθ0,j(l)|fδ,j(l)|\psi_{\theta,j}(l)-\psi_{\theta_{0},j}(l)|\leq f_{\delta,j}(l) for any θ\theta such that θθ02δ\|\theta-\theta_{0}\|_{2}\leq\delta. Besides, fδ,jP,20\|f_{\delta,j}\|_{P,2}\to 0 when δ0\delta\to 0.

Assumption 4.A is a standard regularity assumption for Z-estimation. Assumption 4.B corresponds to the continuity assumption on ψ(l,θ)\psi(l,\theta) with respect to θ\theta. For example, the Lipschitz class of functions {fθ:θΘ}\{f_{\theta}:\theta\in\Theta\} satisfies this condition if Θ\Theta is compact. More precisely, there exists an uniform envelop function ff such that |fθ1(l)fθ2(l)|θ1θ22f(l)|f_{\theta_{1}}(l)-f_{\theta_{2}}(l)|\leq\|\theta_{1}-\theta_{2}\|_{2}f(l) for any θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta where fP,2<\|f\|_{P,2}<\infty. Now, we establish the theorem.

Theorem 4.

Suppose that Assumptions 14 hold. Then

θ^N𝑃θ0\displaystyle\hat{\theta}_{N}\xrightarrow{P}\theta_{0}

and

N12(θ^Nθ0)𝑑N(0,Dθ01Vθ0Dθ01),\displaystyle N^{\frac{1}{2}}(\hat{\theta}_{N}-\theta_{0})\overset{d}{\to}N(0,D_{\theta_{0}}^{-1}V_{\theta_{0}}D_{\theta_{0}}^{-1^{\intercal}})\ ,

where Dθ01Vθ0Dθ01D_{\theta_{0}}^{-1}V_{\theta_{0}}D_{\theta_{0}}^{-1^{\intercal}} is the asymptotic variance bound in Theorem 1. Therefore, θ^N\hat{\theta}_{N} is semiparametrically efficient.

Similar to Assumption 3.B, suppose that the set of derivatives of of estimating functions 𝒥Θ:={ψ˙θ:θΘ}\mathcal{J}_{\Theta}:=\{\dot{\psi}_{\theta}:\theta\in\Theta\} satisfies n[]{ϵ,𝒥Θ,L1(P)}<n_{[\ ]}\{\epsilon,\mathcal{J}_{\Theta},L_{1}(P)\}<\infty, and ψ˙θ\dot{\psi}_{\theta} is continuous in a neighborhood of θ0\theta_{0}. Also suppose that there exists a constant λmax\lambda_{\max}^{\prime} such that supθΘ𝔼{ψ˙θ(L)ψ˙θ(L)}2λmax\sup_{\theta\in\Theta}\|\mathbb{E}\{\dot{\psi}_{\theta}(L)\dot{\psi}_{\theta}(L)^{\intercal}\}\|_{2}\leq\lambda_{\max}^{\prime}. Then,

D^θ^N=1Ni=1N{𝟣Ri=1dw^(Li)ψ˙θ^N(Li)}\displaystyle\hat{D}_{\hat{\theta}_{N}}=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\hat{w}(L_{i})\dot{\psi}_{\hat{\theta}_{N}}(L_{i})\right\}

is a consistent estimator of Dθ0D_{\theta_{0}}. Furthermore, let u^θ^Nr\hat{u}_{\hat{\theta}_{N}}^{r} be an estimator of the conditional expectation uθ^Nru_{\hat{\theta}_{N}}^{r} such that suplrdomr|u^θ^N,jr(lr)uθ^N,jr(lr)|=op(1)\sup_{l^{r}\in\mathrm{dom}^{r}}|\hat{u}_{\hat{\theta}_{N},j}^{r}(l^{r})-u_{\hat{\theta}_{N},j}^{r}(l^{r})|=o_{p}(1) for each jj-th entry. Then, D^θ^N1Vθ^ND^θ^N1\hat{D}_{\hat{\theta}_{N}}^{-1}V_{\hat{\theta}_{N}}\hat{D}_{\hat{\theta}_{N}}^{-1^{\intercal}} is a consistent estimator of Dθ01Vθ0Dθ01D_{\theta_{0}}^{-1}V_{\theta_{0}}D_{\theta_{0}}^{-1^{\intercal}} where

V^θ^N\displaystyle\hat{V}_{\hat{\theta}_{N}} =1Ni=1N(F^iF^i),\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left(\hat{F}_{i}\hat{F}_{i}^{\intercal}\right),
F^i\displaystyle\hat{F}_{i} =r𝟣Ri=ru^θ^Nr(Lir)+𝟣Ri=1drOddsr(Lir;α^r){ψθ^N(L)u^θ^Nr(Lir)}^Nψθ^N.\displaystyle=\sum_{r\in\mathcal{R}}\mathsf{1}_{R_{i}=r}\hat{u}_{\hat{\theta}_{N}}^{r}(L^{r}_{i})+\mathsf{1}_{R_{i}=1_{d}}\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})\left\{\psi_{\hat{\theta}_{N}}(L)-\hat{u}_{\hat{\theta}_{N}}^{r}(L^{r}_{i})\right\}-\hat{\mathbb{P}}_{N}\psi_{\hat{\theta}_{N}}\ . (11)

The proof is given in Appendix F.

5 Simulation

A simulation study is conducted to evaluate the finite-sample performance of the proposed estimators. We designed three missing mechanisms that satisfy the CCMV condition. For each setting, we simulated 1,000 independent data sets, each of size NN=1,000, where Xj,j=1,2,3X_{j},\ j=1,2,3, are generated independently from a truncated standard normal distribution with support [3,3][-3,3]. We considered a logistic regression model logit{P(Y=1X)}=θ1X1+θ2X2+θ3X3+θ4\mathrm{logit}\{P(Y=1\mid X)\}=\theta_{1}X_{1}+\theta_{2}X_{2}+\theta_{3}X_{3}+\theta_{4} where the true coefficients θ0=(1,1,1,2)\theta_{0}=(1,-1,1,-2) are the parameters of interest. The response variable YY is always observed, while X1,X2,X3X_{1},X_{2},X_{3} could be missing. Four non-monotone response patterns are considered: L1111=(Y,X1,X2,X3)L^{1111}=(Y,X_{1},X_{2},X_{3}), L1110=(Y,X1,X2)L^{1110}=(Y,X_{1},X_{2}), L1101=(Y,X1,X3)L^{1101}=(Y,X_{1},X_{3}) and L1100=(Y,X1)L^{1100}=(Y,X_{1}). The categorical variable R{1111,1110,1101,1100}R\in\{1111,1110,1101,1100\} indicating the response patterns is generated from a multinomial distribution with the probabilities shown in Appendix A.

Table 1: Results of the simulation study based on 1000 replications under three CCMV conditions.
Method Bias MSE
θ1\theta_{1} θ2\theta_{2} θ3\theta_{3} θ4\theta_{4} θ1\theta_{1} θ2\theta_{2} θ3\theta_{3} θ4\theta_{4}
Setting 1
Full -0.022 0.007 -0.010 0.009 0.016 0.012 0.012 0.012
Complete-case 0.512 0.239 0.002 0.103 0.297 0.100 0.031 0.048
Mean-score 0.218 0.098 -0.004 0.084 0.077 0.050 0.034 0.047
True-weight -0.065 0.086 -0.044 0.058 0.049 0.094 0.059 0.072
Entropy-linear -0.067 0.082 -0.045 0.057 0.038 0.091 0.059 0.071
Entropy-parametric -0.067 0.082 -0.045 0.057 0.038 0.091 0.059 0.071
Entropy-basis -0.039 0.064 -0.027 0.088 0.030 0.053 0.044 0.054
Proposed -0.024 0.060 -0.032 0.112 0.033 0.047 0.043 0.054
Setting 2
Full -0.022 0.007 -0.010 0.009 0.016 0.012 0.012 0.012
Complete-case 0.943 0.310 -0.176 -0.089 0.921 0.152 0.070 0.043
Mean-score 0.318 0.085 0.059 0.052 0.140 0.073 0.045 0.041
True-weight -0.045 0.080 -0.063 0.038 0.050 0.100 0.067 0.065
Entropy-linear -0.001 0.126 -0.111 0.051 0.041 0.118 0.081 0.072
Entropy-parametric -0.056 0.088 -0.065 0.044 0.042 0.092 0.065 0.065
Entropy-basis -0.062 0.054 -0.058 0.070 0.042 0.074 0.064 0.067
Proposed -0.023 0.029 -0.044 0.089 0.029 0.051 0.055 0.057
Setting 3
Full -0.022 0.007 -0.010 0.009 0.016 0.012 0.012 0.012
Complete-case 0.805 0.889 -0.043 0.252 0.689 0.892 0.041 0.111
Mean-score 0.253 0.119 0.002 -0.071 0.098 0.090 0.050 0.061
True-weight -0.076 0.195 -0.115 0.081 0.115 0.220 0.134 0.129
Entropy-linear -0.252 0.349 -0.153 -0.022 0.135 0.295 0.149 0.096
Entropy-parametric -0.158 0.265 -0.136 0.104 0.098 0.218 0.151 0.143
Entropy-basis -0.084 0.187 -0.129 0.109 0.135 0.244 0.147 0.145
Proposed 0.016 0.035 -0.021 0.047 0.039 0.054 0.052 0.071

We first analyzed the simulated data with the full dataset (Full), which is the ideal case with no missingness. We then analyzed the data in the complete case pattern (Complete-case), for which data in all missing patterns r1dr\neq 1_{d} are discarded, and an unweighted analysis is used for the remaining data. As we described before, the mean-score approach typically requires the discretization of continuous variables. To implement the mean-score method in Chatterjee and Li [6] (Mean-score), we created discrete surrogates as X~j=𝟣Xj0,j=1,2,3\tilde{X}_{j}=\mathsf{1}_{X_{j}\geq 0},\ j=1,2,3. Note that the mean-score method makes a missing-at-random (MAR) assumption. Next, we considered the inverse propensity weighting methods with the true inverse propensity weights (True-weight). We also examined the performance of the estimators based on the estimated propensity odds using the entropy loss. Tchetgen Tchetgen et al. [28] modeled each propensity odds using linear logistic regression (Entropy-linear), i.e., the logarithm of propensity odds, log(Oddsr)\log(\mathrm{Odds}^{r}), is a linear function of the shared variables in patterns rr and 1d1_{d}, which is correctly specified only for setting 1. For settings 2 and 3, we also applied a parametric logistic regression with the correct model (Entropy-parametric). Note that they are unpenalized methods. When implementing the proposed methodology (Proposed), we set the basis functions Φr\Phi^{r} as the tensor products of functions of variables shared in patterns rr and 1d1_{d}, where polynomials of degrees up to three are chosen for each continuous variable, and a binary indicator function is chosen for discrete response YY. Then, we orthogonalized the basis functions and followed the choices of {tk}k=1Kr\{t_{k}\}_{k=1}^{K_{r}} and matrix 𝐃r\mathbf{D}^{r} in 3.3.3 to construct the penalty function. Notice that the constant and linear functions have zero roughness. In practice, to achieve stable estimations of the propensity odds, we set the tolerance to the empirical imbalances of these functions to be the same as the smallest positive roughness of other basis functions. We also applied the above basis functions and penalties with entropy loss (Entropy-basis) to estimate the propensity odds for comparing tailored and entropy loss functions. The biases and mean squared errors of each coefficient are shown in Table 1.

Table 2: Results of the asymptotic property study: Avg Est SD stands for the average of estimated asymptotic standard deviation. MC SD is short for the Monte Carlo standard deviation, which is the standard deviation of proposed estimators of θ\theta over 1000 replications. 95% confidence interval was constructed using estimators of θ\theta and asymptotic variance for each simulated dataset. The confidence interval coverage probabilities are reported based on 1000 replications.
Avg Est SD/MC SD Coverage probabilities
Sample size θ1\theta_{1} θ2\theta_{2} θ3\theta_{3} θ4\theta_{4} θ1\theta_{1} θ2\theta_{2} θ3\theta_{3} θ4\theta_{4}
Setting 1 N=1000 0.937 0.861 0.961 0.970 0.938 0.908 0.930 0.926
N=2000 0.939 0.979 0.990 1.032 0.937 0.942 0.940 0.934
N=5000 1.023 1.091 1.040 1.164 0.947 0.958 0.957 0.962
N=10000 1.079 1.115 1.039 1.219 0.960 0.959 0.954 0.962
Setting 2 N=1000 0.978 0.843 0.892 0.958 0.946 0.897 0.930 0.934
N=2000 0.926 0.952 0.916 0.950 0.930 0.931 0.931 0.928
N=5000 0.967 0.971 0.933 1.037 0.944 0.919 0.930 0.946
N=10000 1.061 1.097 0.984 1.041 0.959 0.962 0.951 0.946
Setting 3 N=1000 0.877 0.793 0.856 0.856 0.899 0.893 0.904 0.916
N=2000 0.918 0.932 0.925 0.905 0.921 0.924 0.940 0.921
N=5000 0.952 1.018 1.007 0.947 0.936 0.951 0.959 0.939
N=10000 0.992 1.051 1.012 1.010 0.946 0.959 0.950 0.947

In all three settings, the mean-score method has bad performances (considerable bias over θ1\theta_{1}) that are likely attributed to the misspecified missing mechanisms and discretization. In settings 1 and 2, compared with the linear model and correctly specified parametric model, using basis functions as regressors and adding penalization alleviates overfitting such that the inverse propensity weights and estimates are more stable. The proposed method provides comparable or even smaller errors in these two settings. It is expected since the tailored loss also encourages the balance of observed variables. In setting 3, all the methods misspecify the missing mechanism, but the proposed method outperforms the others.

We also assess the performance of asymptotic variance estimators (4) and confidence interval coverage using the estimated standard errors. The results in Table 2 show that the average of estimated asymptotic standard deviations is close to the Monte Carlo standard deviations, which are the standard deviations of estimated parameters from 1000 datasets. The 95%95\% confidence interval constructed by estimated asymptotic variance has close to nominal coverage, especially when the sample size increases.

6 Real data application

We apply the proposed method to a study of hip fracture risk factors among male veterans [1]. The binary outcome of interest was the presence (Y=1)(Y=1) or absence (Y=0)(Y=0) of hip fracture. Preliminary exploratory analyses suggested that nine risk factors are potentially significant. A detailed description of the data set and its missing data patterns is in Chen [7]. Three (BMI, HGB, and Albumin) of these factors are continuous variables, and the remaining six (Etoh, Smoke, Dement, Antiseiz, LevoT4, Antichol) are discrete variables. Including 1d1_{d}, where all variables are observed, there is a total of 38 missingness patterns. The parameters of interest are the coefficients of a logistic regression model with these nine variables as the predictors and the presence or absence of hip fracture as the binary outcome.

We implemented the same Complete-case analysis as in Chen [7] and our proposed method under the CCMV assumption. Like the simulation study, we generated orthogonalized tensor product basis functions for continuous variables and considered two penalties constructed by the roughness norm. Indicator functions are created for each discrete variable as basis functions. Since there are many combinations of possible values of discrete variables, we do not consider any further tensor products. In other words, the indicator functions of discrete variables are added to the log-linear model for propensity odds in an additive way. For example, if Lr=L^{r}=(BMI, HGB, Dement, Antiseiz), the propensity odds model is

logOddsr(lr)=k=116αkrϕkr(BMI,HGB)+α17r𝟣Dement=1+α18r𝟣Antiseiz=1\displaystyle\log\mathrm{Odds}^{r}(l^{r})=\sum_{k=1}^{16}\alpha^{r}_{k}\phi^{r}_{k}(\mathrm{BMI,HGB})+\alpha^{r}_{17}\mathsf{1}_{\mathrm{Dement}=1}+\alpha^{r}_{18}\mathsf{1}_{\mathrm{Antiseiz}=1}

where {ϕkr}k=116\{\phi^{r}_{k}\}_{k=1}^{16} are tensor products of two sets of functions, each of which contains polynomials up to degree 3 for one continuous variable. Similarly, we should penalize the indicator functions that perform similarly to lower-order moments for stabilization. The relative tolerance to the empirical imbalances of the indicator functions was chosen to be the same as that for the constant and linear functions.

We present the parameter estimates, standard errors, and p-values in Table 3. The results show that our proposed estimators have standard errors similar to those from the complete-case analysis. All nine predictor variables are statistically significant at the 5%5\% level. Our results largely agree with those of Chen [7]. One apparent difference is that the coefficient of LevoT4 is significant in our analysis but not in Chen [7]. The main difference between the two estimators is the missing mechanism assumptions.

One may notice that some patterns may have rare observations. There are 26 patterns having at most three observations. It may draw attention to the stability of the propensity odds estimator for those patterns. Through the definition, the true propensity odds between a rare pattern and the complete cases should be negligible. Therefore, the contribution of rare patterns to the combined weights should also be negligible. Although the sample size may be too small to guarantee the asymptotic properties, the estimated odds for rare patterns using the proposed penalized estimation only slightly contribute to the total weights. Thus, the total weights and the weighted estimator for the parameters of interest remain stable.

Table 3: Results of the Hip Fracture Data analysis: The asymptotic variance estimator is used to calculate the standard error(SE) and p-value.
Parameters Complete-case Proposed
Estimate SE p-value Estimate SE p-value
Etoh 1.391 0.391 <<1e-3 1.410 0.397 <<1e-3
Smoke 0.929 0.400 0.020 0.972 0.386 0.012
Dementia 2.509 0.724 0.001 2.595 0.689 <<1e-3
Antiseiz 3.311 1.064 0.002 3.435 0.935 <<1e-3
LevoT4 2.010 1.015 0.048 1.941 0.771 0.012
AntiChol -1.918 0.768 0.012 -1.872 0.796 0.019
BMI -0.104 0.039 0.007 -0.100 0.040 0.013
logHGB -2.597 1.202 0.031 -2.541 1.232 0.039
Albumin -0.911 0.353 0.010 -0.857 0.378 0.024

7 Discussion

We studied the estimation of model parameters defined by estimating functions with non-monotone missing data under the complete-case missing variable condition, which is a case of missing not at random. Using tailored loss for balance, functional bases for flexible modeling and penalization for stable estimation, we propose a method that can be viewed as a generalization of both inverse probability weighting and mean-score approach. The proposed framework improves the reliability of inferences and has significant implications for the analysis of datasets with non-monotone missingness, which is frequently encountered in biomedical data. Given the limited availability of analytical tools for addressing non-monotone missingness, we hope to fill an important gap in the literature.

Despite its strengths, the proposed method has certain limitations. The CCMV assumption, while plausible, is inherently unverifiable from observed data, which may limit its acceptance in some applications. Recently, [10] generalizes the assumption into pattern graphs, which allows detailed relationships among non-monotone missing patterns to be assumed. One can notice that the CCMV assumption is a special case of the pattern graph. An example of such a graph is shown in Figure 1. All the missing patterns have the same and only parent, the fully observed pattern.

1111111011011100
Figure 1: The pattern graph for CCMV

A future research direction is to extend our tailored loss estimation to allow more flexible identifying assumptions, which are represented by general pattern graphs. The proposed method is well-suited to serve as a tool for these assumptions. Another important direction is to investigate the impact of misspecification on the identifying assumptions and explore ways to incorporate robustness checks. This includes designing sensitivity analyses that account for potential deviations from CCMV or other assumption structures.

Acknowledgements

This work is partly supported by the National Science Foundation (DMS-1711952). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.

Appendix A Details of simulation settings

The probabilities of data belonging to the four response patterns are respectively:

P(R=1111L)\displaystyle P(R=1111\mid L) =11+Odds1110(L1110)+Odds1101(L1101)+Odds1100(L1100),\displaystyle=\frac{1}{1+\mathrm{Odds}^{1110}(L^{1110})+\mathrm{Odds}^{1101}(L^{1101})+\mathrm{Odds}^{1100}(L^{1100})},
P(R=1110L)\displaystyle P(R=1110\mid L) =Odds1110(L1110)1+Odds1110(L1110)+Odds1101(L1101)+Odds1100(L1100),\displaystyle=\frac{\mathrm{Odds}^{1110}(L^{1110})}{1+\mathrm{Odds}^{1110}(L^{1110})+\mathrm{Odds}^{1101}(L^{1101})+\mathrm{Odds}^{1100}(L^{1100})},
P(R=1101L)\displaystyle P(R=1101\mid L) =Odds1101(L1101)1+Odds1110(L1110)+Odds1101(L1101)+Odds1100(L1100),\displaystyle=\frac{\mathrm{Odds}^{1101}(L^{1101})}{1+\mathrm{Odds}^{1110}(L^{1110})+\mathrm{Odds}^{1101}(L^{1101})+\mathrm{Odds}^{1100}(L^{1100})},
P(R=1100L)\displaystyle P(R=1100\mid L) =Odds1100(L1100)1+Odds1110(L1110)+Odds1101(L1101)+Odds1100(L1100),\displaystyle=\frac{\mathrm{Odds}^{1100}(L^{1100})}{1+\mathrm{Odds}^{1110}(L^{1110})+\mathrm{Odds}^{1101}(L^{1101})+\mathrm{Odds}^{1100}(L^{1100})},

where Oddsr(Lr)=P(R=rLr)/P(R=1111Lr)\mathrm{Odds}^{r}(L^{r})=P(R=r\mid L^{r})/P(R=1111\mid L^{r}) is the true propensity odds, which only depends on the shared variables LrL^{r} under CCMV condition, for r=1110,1101,1100r=1110,1101,1100.

Setting 1

The logarithms of true propensity odds are:

logOdds1110(L1110)=\displaystyle\log\mathrm{Odds}^{1110}(L^{1110})= X1+X2𝟣Y=10.5,\displaystyle X_{1}+X_{2}-\mathsf{1}_{Y=1}-0.5\ ,
logOdds1101(L1101)=\displaystyle\log\mathrm{Odds}^{1101}(L^{1101})= 12X1+X3+0.5𝟣Y=10.3,\displaystyle\frac{1}{2}X_{1}+X_{3}+-0.5\mathsf{1}_{Y=1}-0.3\ ,
logOdds1100(L1100)=\displaystyle\log\mathrm{Odds}^{1100}(L^{1100})= 32X1𝟣Y=10.4.\displaystyle\frac{3}{2}X_{1}-\mathsf{1}_{Y=1}-0.4\ .

Setting 2

The logarithms of true propensity odds are:

logOdds1110(L1110)=\displaystyle\log\mathrm{Odds}^{1110}(L^{1110})= 15(X129)(X1+1.5)+15(X229)(X2+1)\displaystyle\frac{1}{5}(X_{1}^{2}-9)(X_{1}+1.5)+\frac{1}{5}(X_{2}^{2}-9)(X_{2}+1)
+110(X1+2)(X2+2)(X21)2𝟣Y=1+3,\displaystyle+\frac{1}{10}(X_{1}+2)(X_{2}+2)(X_{2}-1)-2\mathsf{1}_{Y=1}+3\ ,
logOdds1101(L1101)=\displaystyle\log\mathrm{Odds}^{1101}(L^{1101})= 15(X129)(X1+1)+15(X329)(X3+1.5)2𝟣Y=1,\displaystyle-\frac{1}{5}(X_{1}^{2}-9)(X_{1}+1)+\frac{1}{5}(X_{3}^{2}-9)(X_{3}+1.5)-2\mathsf{1}_{Y=1}\ ,
logOdds1100(L1100)=\displaystyle\log\mathrm{Odds}^{1100}(L^{1100})= 15(X1+2)(X1+0.5)(X14)2𝟣Y=11.\displaystyle-\frac{1}{5}(X_{1}+2)(X_{1}+0.5)(X_{1}-4)-2\mathsf{1}_{Y=1}-1\ .

Setting 3

The logarithms of true propensity odds are:

logOdds1110(L1110)=\displaystyle\log\mathrm{Odds}^{1110}(L^{1110})= 110(X129)(X124)X1+110(X229)(X2+1)\displaystyle\frac{1}{10}(X_{1}^{2}-9)(X_{1}^{2}-4)X_{1}+\frac{1}{10}(X_{2}^{2}-9)(X_{2}+1)
+14(X1+2)(X2+2)(X21)2𝟣Y=1,\displaystyle+\frac{1}{4}(X_{1}+2)(X_{2}+2)(X_{2}-1)-2\mathsf{1}_{Y=1}\ ,
logOdds1101(L1101)=\displaystyle\log\mathrm{Odds}^{1101}(L^{1101})= 110(X129)(X1+1)+110(X329)(X324)X3\displaystyle\frac{1}{10}(X_{1}^{2}-9)(X_{1}+1)+\frac{1}{10}(X_{3}^{2}-9)(X_{3}^{2}-4)X_{3}
+𝟣Y=1{(X1+1)(X324)2},\displaystyle+\mathsf{1}_{Y=1}\left\{(X_{1}+1)(X_{3}^{2}-4)-2\right\}\ ,
logOdds1100(L1100)=\displaystyle\log\mathrm{Odds}^{1100}(L^{1100})= 𝟣Y=0{15(X129)(X124)X11}\displaystyle\mathsf{1}_{Y=0}\left\{\frac{1}{5}(X_{1}^{2}-9)(X_{1}^{2}-4)X_{1}-1\right\}
𝟣Y=1{110(X129)(X122.52)(X1+0.5)}.\displaystyle-\mathsf{1}_{Y=1}\left\{\frac{1}{10}(X_{1}^{2}-9)(X_{1}^{2}-2.5^{2})(X_{1}+0.5)\right\}\ .

We assume that the number of basis KrK_{r} increases with sample size NN. Precisely, one should predetermine a way to choose KrK_{r} depending on the size of the fully observed pattern and missing pattern rr, NrN_{r}, which also increases with NN. In the following proofs, we use “a large enough NN” to describe the requirements for sample size, while the explicit conditions may be related to KrK_{r} or NrN_{r}.

Appendix B Proof of Theorem 1

Proof sketch: To show Dθ01Vθ0Dθ01D_{\theta_{0}}^{-1}V_{\theta_{0}}D_{\theta_{0}}^{-1^{\intercal}} is the efficiency bound, we closely follow the structure of semiparametric efficiency bound derivation of [19], [3] and [9]. Briefly speaking, the efficiency bound is defined as the supremum of the Cramer-Rao bounds for all regular parametric submodels. If it exists, the efficient estimator should have the efficient influence function that gives the minimum 𝔼{ζζ}\mathbb{E}\{\zeta\zeta^{\intercal}\}, which is known as the asymptotic variance of such an estimator. Based on Theorem 2.2 in Newey [19], the regular and asymptotically linear estimator has the influence function satisfying (12). Theorem 3.1 in Newey [19] reveals that the efficient influence function is in the tangent space 𝒯\mathcal{T} where 𝒯\mathcal{T} is defined as the mean square closure of all qq-dimensional linear combinations of scores for all smooth parametric submodels.

Proof.

Consider an arbitrary parametric submodel for the likelihood function of (L,R)(L,R)

Lα(l,r)=sfα(ls,s)𝟣r=s\displaystyle L_{\alpha}(l,r)=\prod_{s\in\mathcal{R}}f_{\alpha}(l^{s},s)^{\mathsf{1}_{r=s}}

where fα0(ls,s)f_{\alpha_{0}}(l^{s},s) gives the true distribution of observed variables in pattern ss. The resulting score function is given by

Sα(l,r)=s𝟣r=sSα(ls,s)\displaystyle S_{\alpha}(l,r)=\sum_{s\in\mathcal{R}}\mathsf{1}_{r=s}S_{\alpha}(l^{s},s)

where Sα(ls,s)=logfα(ls,s)/αS_{\alpha}(l^{s},s)=\partial\log f_{\alpha}(l^{s},s)/\partial\alpha for ss\in\mathcal{R}. Recall that the parameter of interest θ0\theta_{0} is the solution to 𝔼{ψθ(L)}=0\mathbb{E}\{\psi_{\theta}(L)\}=0 and thus is a function of α\alpha. Pathwise differentiability follows if we can find an influence function ζ(L,R)\zeta(L,R) for all regular parametric submodels such that

θ0(α0)α=𝔼{ζ(L,R)Sα0(L,R)}.\displaystyle\frac{\partial\theta_{0}(\alpha_{0})}{\partial\alpha}=\mathbb{E}\{\zeta(L,R)S_{\alpha_{0}}(L,R)\}\ . (12)

To save notations, α\alpha is also used as the true parameter value α0\alpha_{0} when the context is clear. Chain rule and Leibniz integral rule (differentiating under the integral) gives

𝔼{ψθ(L)}α\displaystyle\frac{\partial\mathbb{E}\{\psi_{\theta}(L)\}}{\partial\alpha} =ψθ(l)fα(l)α𝑑l={ψθ(l)θθ(α)αfα(l)+ψθ(l)fα(l)α}𝑑l\displaystyle=\int\frac{\partial\psi_{\theta}(l)f_{\alpha}(l)}{\partial\alpha}dl=\int\left\{\frac{\partial\psi_{\theta}(l)}{\partial\theta}\frac{\partial\theta(\alpha)}{\partial\alpha}f_{\alpha}(l)+\psi_{\theta}(l)\frac{\partial f_{\alpha}(l)}{\partial\alpha}\right\}dl
=θ(α)αψθ(l)θfα(l)𝑑l+ψθ(l)logfα(l)αfα(l)𝑑l\displaystyle=\frac{\partial\theta(\alpha)}{\partial\alpha}\int\frac{\partial\psi_{\theta}(l)}{\partial\theta}f_{\alpha}(l)dl+\int\psi_{\theta}(l)\frac{\partial\log f_{\alpha}(l)}{\partial\alpha}f_{\alpha}(l)dl
=θ(α)α𝔼{ψθ(L)}θ+𝔼{ψθ(L)logfα(L)α}.\displaystyle=\frac{\partial\theta(\alpha)}{\partial\alpha}\frac{\partial\mathbb{E}\{\psi_{\theta}(L)\}}{\partial\theta}+\mathbb{E}\left\{\psi_{\theta}(L)\frac{\partial\log f_{\alpha}(L)}{\partial\alpha}\right\}\ .

Therefore,

θ0(α0)α=[𝔼{ψθ(L)}θ|θ0]1𝔼{ψθ0(L)logfα(L)α}.\displaystyle\frac{\partial\theta_{0}(\alpha_{0})}{\partial\alpha}=-\left[\left.\frac{\partial\mathbb{E}\{\psi_{\theta}(L)\}}{\partial\theta}\right|_{\theta_{0}}\right]^{-1}\mathbb{E}\left\{\psi_{\theta_{0}}(L)\frac{\partial\log f_{\alpha}(L)}{\partial\alpha}\right\}\ .

By the identification assumption, fα(lr¯lr,r)=fα(lr¯lr,1d)f_{\alpha}(l^{\overline{r}}\mid l^{r},r)=f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d}). Thus, the marginal density

fα(l)=fα(l,1d)+r1d,rfα(lr,r)fα(lr¯lr,1d).\displaystyle f_{\alpha}(l)=f_{\alpha}(l,1_{d})+\sum_{r\neq 1_{d},r\in\mathcal{R}}f_{\alpha}(l^{r},r)f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})\ .

Define Sα(lr¯lr,1d)=logfα(lr¯lr,1d)/αS_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})=\partial\log f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})/\partial\alpha where Sα(lr¯lr,1d)fα(lr¯lr,1d)𝑑lr¯=0\int S_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})dl^{\overline{r}}=0. Then,

𝔼{ψθ(L)logfα(L)α}\displaystyle\mathbb{E}\left\{\psi_{\theta}(L)\frac{\partial\log f_{\alpha}(L)}{\partial\alpha}\right\} =ψθ(l)fα(l,1d)α𝑑l\displaystyle=\int\psi_{\theta}(l)\frac{\partial f_{\alpha}(l,1_{d})}{\partial\alpha}dl
+r1d,rψθ(l)fα(lr,r)fα(lr¯lr,1d)α𝑑l.\displaystyle\quad\ +\sum_{r\neq 1_{d},r\in\mathcal{R}}\int\psi_{\theta}(l)\frac{\partial f_{\alpha}(l^{r},r)f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})}{\partial\alpha}dl\ .

The first term is

ψθ(l)fα(l,1d)α𝑑l=ψθ(l)Sα(l,1d)fα(l,1d)𝑑l=𝔼{𝟣R=1dψθ(L)Sα(L,1d)}.\displaystyle\int\psi_{\theta}(l)\frac{\partial f_{\alpha}(l,1_{d})}{\partial\alpha}dl=\int\psi_{\theta}(l)S_{\alpha}(l,1_{d})f_{\alpha}(l,1_{d})dl=\mathbb{E}\{\mathsf{1}_{R=1_{d}}\psi_{\theta}(L)S_{\alpha}(L,1_{d})\}\ .

For each r1d,rr\neq 1_{d},r\in\mathcal{R},

ψθ(l)fα(lr,r)fα(lr¯lr,1d)α𝑑l\displaystyle\quad\ \int\psi_{\theta}(l)\frac{\partial f_{\alpha}(l^{r},r)f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})}{\partial\alpha}dl
=ψθ(l){fα(lr,r)αfα(lr¯lr,1d)+fα(lr,r)fα(lr¯lr,1d)α}𝑑l\displaystyle=\int\psi_{\theta}(l)\left\{\frac{\partial f_{\alpha}(l^{r},r)}{\partial\alpha}f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})+f_{\alpha}(l^{r},r)\frac{\partial f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})}{\partial\alpha}\right\}dl
=ψθ(l)Sα(lr,r)fα(lr,r)fα(lr¯lr,1d)𝑑l\displaystyle=\int\psi_{\theta}(l)S_{\alpha}(l^{r},r)f_{\alpha}(l^{r},r)f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})dl
+ψθ(l)Sα(lr¯lr,1d)fα(lr,r)fα(lr¯lr,1d)𝑑l.\displaystyle\quad\ +\int\psi_{\theta}(l)S_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})f_{\alpha}(l^{r},r)f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})dl\ .

The first integral is

ψθ(l)Sα(lr,r)fα(lr,r)fα(lr¯lr,1d)𝑑l\displaystyle\quad\ \int\psi_{\theta}(l)S_{\alpha}(l^{r},r)f_{\alpha}(l^{r},r)f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})dl
={ψθ(l)fα(lr¯lr,1d)𝑑lr¯}Sα(lr,r)fα(lr,r)𝑑l\displaystyle=\int\left\{\int\psi_{\theta}(l)f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})dl^{\overline{r}}\right\}S_{\alpha}(l^{r},r)f_{\alpha}(l^{r},r)dl
=𝔼{ψθ(L)Lr=lr,R=1d}Sα(lr,r)fα(lr,r)𝑑l\displaystyle=\int\mathbb{E}\{\psi_{\theta}(L)\mid L^{r}=l^{r},R=1_{d}\}S_{\alpha}(l^{r},r)f_{\alpha}(l^{r},r)dl
=𝔼{𝟣R=ruθr(Lr)Sα(Lr,r)}.\displaystyle=\mathbb{E}\left\{\mathsf{1}_{R=r}u_{\theta}^{r}(L^{r})S_{\alpha}(L^{r},r)\right\}\ .

Recall that Oddsr(lr)=P(R=rlr)/P(R=1dlr)=p(lr,r)/p(lr,1d)\mathrm{Odds}^{r}(l^{r})=P(R=r\mid l^{r})/P(R=1_{d}\mid l^{r})=p(l^{r},r)/p(l^{r},1_{d}). The second integral is

ψθ(l)Sα(lr¯lr,1d)fα(lr,r)fα(lr¯lr,1d)𝑑l\displaystyle\quad\ \int\psi_{\theta}(l)S_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})f_{\alpha}(l^{r},r)f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})dl
=ψθ(l)Sα(lr¯lr,1d)fα(lr,r)fα(lr,1d)fα(lr,1d)fα(lr¯lr,1d)𝑑l\displaystyle=\int\psi_{\theta}(l)S_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})\frac{f_{\alpha}(l^{r},r)}{f_{\alpha}(l^{r},1_{d})}f_{\alpha}(l^{r},1_{d})f_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d})dl
=𝔼{𝟣R=1dψθ(L)Oddsr(Lr)Sα(Lr¯Lr,1d)}\displaystyle=\mathbb{E}\left\{\mathsf{1}_{R=1_{d}}\psi_{\theta}(L)\mathrm{Odds}^{r}(L^{r})S_{\alpha}(L^{\overline{r}}\mid L^{r},1_{d})\right\}
=𝔼[𝟣R=1d{ψθ(L)uθ(Lr)}Oddsr(Lr)Sα(Lr¯Lr,1d)]\displaystyle=\mathbb{E}\left[\mathsf{1}_{R=1_{d}}\left\{\psi_{\theta}(L)-u_{\theta}(L^{r})\right\}\mathrm{Odds}^{r}(L^{r})S_{\alpha}(L^{\overline{r}}\mid L^{r},1_{d})\right]
=𝔼[𝟣R=1d{ψθ(L)uθ(Lr)}Oddsr(Lr)Sα(L,1d)]\displaystyle=\mathbb{E}\left[\mathsf{1}_{R=1_{d}}\left\{\psi_{\theta}(L)-u_{\theta}(L^{r})\right\}\mathrm{Odds}^{r}(L^{r})S_{\alpha}(L,1_{d})\right]

since

𝔼{𝟣R=1duθ(Lr)Oddsr(Lr)Sα(Lr¯Lr,1d)}\displaystyle\quad\ \mathbb{E}\left\{\mathsf{1}_{R=1_{d}}u_{\theta}(L^{r})\mathrm{Odds}^{r}(L^{r})S_{\alpha}(L^{\overline{r}}\mid L^{r},1_{d})\right\}
=𝔼[𝟣R=1duθ(Lr)Oddsr(Lr)𝔼{Sα(Lr¯Lr,1d)Lr,R=1d}]=0,\displaystyle=\mathbb{E}\left[\mathsf{1}_{R=1_{d}}u_{\theta}(L^{r})\mathrm{Odds}^{r}(L^{r})\mathbb{E}\left\{S_{\alpha}(L^{\overline{r}}\mid L^{r},1_{d})\mid L^{r},R=1_{d}\right\}\right]=0\ ,

and

𝔼[𝔼{ψθ(L)Lr,R=1d}𝟣R=1dOddsr(Lr)Sα(Lr,1d)]\displaystyle\quad\ \mathbb{E}\left[\mathbb{E}\left\{\psi_{\theta}(L)\mid L^{r},R=1_{d}\right\}\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})S_{\alpha}(L^{r},1_{d})\right]
=𝔼{𝟣R=1duθ(Lr)Oddsr(Lr)Sα(Lr,1d)}\displaystyle=\mathbb{E}\left\{\mathsf{1}_{R=1_{d}}u_{\theta}(L^{r})\mathrm{Odds}^{r}(L^{r})S_{\alpha}(L^{r},1_{d})\right\}

where Sα(lr,1d)=logfα(lr,1d)/αS_{\alpha}(l^{r},1_{d})=\partial\log f_{\alpha}(l^{r},1_{d})/\partial\alpha and Sα(l,1d)=Sα(lr,1d)+Sα(lr¯lr,1d)S_{\alpha}(l,1_{d})=S_{\alpha}(l^{r},1_{d})+S_{\alpha}(l^{\overline{r}}\mid l^{r},1_{d}). Therefore,

𝔼{ψθ(L)logfα(L)α}\displaystyle\mathbb{E}\left\{\psi_{\theta}(L)\frac{\partial\log f_{\alpha}(L)}{\partial\alpha}\right\} =𝔼{𝟣R=1dψθ(L)Sα(L,1d)}+r1d,r𝔼{𝟣R=ruθr(Lr)Sα(Lr,r)}\displaystyle=\mathbb{E}\{\mathsf{1}_{R=1_{d}}\psi_{\theta}(L)S_{\alpha}(L,1_{d})\}+\sum_{r\neq 1_{d},r\in\mathcal{R}}\mathbb{E}\left\{\mathsf{1}_{R=r}u_{\theta}^{r}(L^{r})S_{\alpha}(L^{r},r)\right\}
+r1d,r𝔼[𝟣R=1d{ψθ(L)uθ(Lr)}Oddsr(Lr)Sα(L,1d)].\displaystyle\quad\ +\sum_{r\neq 1_{d},r\in\mathcal{R}}\mathbb{E}\left[\mathsf{1}_{R=1_{d}}\left\{\psi_{\theta}(L)-u_{\theta}(L^{r})\right\}\mathrm{Odds}^{r}(L^{r})S_{\alpha}(L,1_{d})\right]\ .

Notice that 𝔼{ψθ0(L)}=0\mathbb{E}\{\psi_{\theta_{0}}(L)\}=0. Recall that we denote the derivative 𝔼{ψθ(L)}/θ\partial\mathbb{E}\{\psi_{\theta}(L)\}/\partial\theta at θ0\theta_{0} as Dθ0D_{\theta_{0}}. Let the influence function ζ(L,R)=Dθ01Fθ0(L,R)\zeta(L,R)=-D_{\theta_{0}}^{-1}F_{\theta_{0}}(L,R) where

Fθ(L,R)=𝟣R=1drOddsr(Lr){ψθ(L)uθr(Lr)}+r𝟣R=ruθr(Lr)𝔼{ψθ(L)}.\displaystyle F_{\theta}(L,R)=\mathsf{1}_{R=1_{d}}\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r})\{\psi_{\theta}(L)-u_{\theta}^{r}(L^{r})\}+\sum_{r\in\mathcal{R}}\mathsf{1}_{R=r}u_{\theta}^{r}(L^{r})-\mathbb{E}\{\psi_{\theta}(L)\}\ .

Thus, equation (12) is satisfied, and θ(α)\theta(\alpha) is pathwise differentiable. To calculate the efficiency bound, we need to find the efficient influence function, which is the projection of ζ(L,R)\zeta(L,R) on the tangent set. The tangent set 𝒯\mathcal{T} is defined as the mean square closure of all qq-dimensional linear combinations of scores SαS_{\alpha} for smooth parametric submodels,

𝒯={h(L,R)q:𝔼{h2},AjSαj with limj𝔼{hAjSαj}=0}\displaystyle\mathcal{T}=\left\{h(L,R)\in\mathbb{R}^{q}:\mathbb{E}\{\|h\|^{2}\}\leq\infty,\exists A_{j}S_{\alpha j}\textrm{ with }\lim_{j\to\infty}\mathbb{E}\{\|h-A_{j}S_{\alpha j}\|\}=0\right\}

where AjA_{j} is a constant matrix with qq rows. It can be verified by similar arguments as in [19]. Note that each score SαjS_{\alpha j} has the same form of decomposition, which is the summation of partial scores for each missing pattern rr. It is also easy to see that 𝒯\mathcal{T} is linear and ζ\zeta belongs to the tangent set where 𝟣R=ruθ0r(Lr)\mathsf{1}_{R=r}u_{\theta_{0}}^{r}(L^{r}) taking the role of 𝟣R=rSα(lr,r)\mathsf{1}_{R=r}S_{\alpha}(l^{r},r) for r1d,rr\neq 1_{d},r\in\mathcal{R} and the rest taking the role of 𝟣R=1dSα(l,1d)\mathsf{1}_{R=1_{d}}S_{\alpha}(l,1_{d}). Therefore all the conditions of Theorem 3.1 in [19] hold, so the efficiency bound for regular estimators of the parameter θ\theta is given by Dθ01Vθ0Dθ01D_{\theta_{0}}^{-1}V_{\theta_{0}}D_{\theta_{0}}^{-1^{\intercal}} where Vθ0=𝔼{Fθ0(L,R)Fθ0(L,R)}V_{\theta_{0}}=\mathbb{E}\{F_{\theta_{0}}(L,R)F_{\theta_{0}}(L,R)^{\intercal}\}. ∎

Appendix C Proof of Theorem 2

Proof sketch: Assumption 2.D assumes that there is a close approximation of the true propensity odds. We will show our estimator is close to the approximation. With the help of a few inequalities, the distance of functions is proportionally bounded by the distance of coefficients. The key to the proof is to show the distance between two coefficients converges with a certain order. The problem is converted to the study of a quadratic form with random coefficients in Lemma 3. The quadratic coefficients form a symmetric random matrix. By the Weyl inequality, we can connect the random matrix with the magnitude of basis functions. So, we can apply the matrix Bernstein inequality to provide bounds on the spectral norm, i.e., the largest eigenvalue. Similarly, we can show that the linear coefficients are also bounded. Lemmas 4 and 5 provide the bound for the quadratic and linear coefficients respectively.

Proof.

By the triangle inequality and Assumption 2.D,

suplrdomr|Oddsr(lr;α^r)Oddsr(lr)|\displaystyle\quad\ \underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\left|\mathrm{Odds}^{r}(l^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(l^{r})\right|
suplrdomr|Oddsr(lr;α^r)Oddsr(lr;αKr)|+suplrdomr|Oddsr(lr;αKr)Oddsr(lr)|\displaystyle\leq\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\left|\mathrm{Odds}^{r}(l^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(l^{r};\alpha_{K_{r}}^{*})\right|+\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\left|\mathrm{Odds}^{r}(l^{r};\alpha_{K_{r}}^{*})-\mathrm{Odds}^{r}(l^{r})\right|
suplrdomr|exp{Φr(lr)α^r}exp{Φr(lr)αKr}|+C1Krμ1.\displaystyle\leq\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\left|\exp\left\{\Phi^{r}(l^{r})^{\intercal}\hat{\alpha}^{r}\right\}-\exp\left\{\Phi^{r}(l^{r})^{\intercal}\alpha_{K_{r}}^{*}\right\}\right|+C_{1}K_{r}^{-\mu_{1}}\ .

Since the exponential function is locally Lipschitz continuous, |exey|=ey|exy1|2ey|xy||e^{x}-e^{y}|=e^{y}|e^{x-y}-1|\leq 2e^{y}|x-y| if |xy|ln2|x-y|\leq\ln 2. By the triangle inequality, Assumption 2.A, and Assumption 2.D,

suplrdomrOddsr(lr;αKr)\displaystyle\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\mathrm{Odds}^{r}(l^{r};\alpha_{K_{r}}^{*}) suplrdomrOddsr(lr)+suplrdomr|Oddsr(lr;αKr)Oddsr(lr)|\displaystyle\leq\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\mathrm{Odds}^{r}(l^{r})+\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\left|\mathrm{Odds}^{r}(l^{r};\alpha_{K_{r}}^{*})-\mathrm{Odds}^{r}(l^{r})\right|
C0+C1Krμ1.\displaystyle\leq C_{0}+C_{1}K_{r}^{-\mu_{1}}\ .

Thus, there exists large enough NN^{*} such that suplrdomrOddsr(lr;αKr)2C0\sup_{l^{r}\in\mathrm{dom}^{r}}\mathrm{Odds}^{r}(l^{r};\alpha_{K_{r}}^{*})\leq 2C_{0} for all NNN\geq N^{*}. Therefore,

|exp{Φr(lr)α^r}exp{Φr(lr)αKr}|4C0|Φr(lr)α^rΦr(lr)αKr|\displaystyle|\exp\left\{\Phi^{r}(l^{r})^{\intercal}\hat{\alpha}^{r}\right\}-\exp\{\Phi^{r}(l^{r})^{\intercal}\alpha_{K_{r}}^{*}\}|\leq 4C_{0}|\Phi^{r}(l^{r})^{\intercal}\hat{\alpha}^{r}-\Phi^{r}(l^{r})^{\intercal}\alpha_{K_{r}}^{*}| (13)

if |Φr(lr)α^rΦr(lr)αKr|ln2|\Phi^{r}(l^{r})^{\intercal}\hat{\alpha}^{r}-\Phi^{r}(l^{r})^{\intercal}\alpha_{K_{r}}^{*}|\leq\ln 2. By the Cauchy inequality and Assumption 2.E, |Φr(lr)α^rΦr(lr)αKr|Kr1/2α^rαKr2|\Phi^{r}(l^{r})^{\intercal}\hat{\alpha}^{r}-\Phi^{r}(l^{r})^{\intercal}\alpha_{K_{r}}^{*}|\leq K_{r}^{1/2}\|\hat{\alpha}^{r}-\alpha_{K_{r}}^{*}\|_{2} for any lrdomrl^{r}\in\mathrm{dom}^{r}. By Lemma 3, α^rαKr2=Op(Kr/Nr+Krμ1)\|\hat{\alpha}^{r}-\alpha_{K_{r}}^{*}\|_{2}=O_{p}(\sqrt{K_{r}/N_{r}}+K_{r}^{-\mu_{1}}). More precisely, for any ϵ>0\epsilon>0, there exists a finite Mϵ>0M_{\epsilon}>0 and Nϵ>0N_{\epsilon}>0 such that

P{α^rαKr2>Mϵ(KrNr+Krμ1)}<ϵ\displaystyle P\left\{\|\hat{\alpha}^{r}-\alpha_{K_{r}}^{*}\|_{2}>M_{\epsilon}\left(\sqrt{\frac{K_{r}}{N_{r}}}+K_{r}^{-\mu_{1}}\right)\right\}<\epsilon

for any NNϵN\geq N_{\epsilon}. Considering the complementary event, we can find NϵN_{\epsilon}^{*} large enough such that Mϵ(Kr/Nr+Kr1/2μ1)<ln2M_{\epsilon}(K_{r}/\sqrt{N_{r}}+K_{r}^{1/2-\mu_{1}})<\ln 2 which makes the inequality (13) hold for any NNϵN\geq N_{\epsilon}^{*}. Then,

P{suplrdomr|Oddsr(lr;α^r)Oddsr(lr)|4C0Mϵ(KrNr+Kr12μ1)+C1Krμ1}1ϵ\displaystyle P\left\{\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\left|\mathrm{Odds}^{r}(l^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(l^{r})\right|\leq 4C_{0}M_{\epsilon}\left(\frac{K_{r}}{\sqrt{N_{r}}}+K_{r}^{\frac{1}{2}-\mu_{1}}\right)+C_{1}K_{r}^{-\mu_{1}}\right\}\geq 1-\epsilon

for all Nmax{N,Nϵ,Nϵ}N\geq\max\{N^{*},N_{\epsilon},N_{\epsilon}^{*}\}. In other words,

suplrdomr|Oddsr(lr;α^r)Oddsr(lr)|=Op(KrNr+Kr12μ1).\displaystyle\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}|\mathrm{Odds}^{r}(l^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(l^{r})|=O_{p}\left(\frac{K_{r}}{\sqrt{N_{r}}}+K_{r}^{\frac{1}{2}-\mu_{1}}\right)\ .

Now, we consider the L2(P)L_{2}(P) norm.

Oddsr(Lr;α^r)Oddsr(Lr)P,2\displaystyle\quad\ \|\mathrm{Odds}^{r}(L^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(L^{r})\|_{P,2}
Oddsr(Lr;α^r)Oddsr(Lr;αKr)P,2+Oddsr(Lr;αKr)Oddsr(Lr)P,2\displaystyle\leq\|\mathrm{Odds}^{r}(L^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(L^{r};\alpha_{K_{r}}^{*})\|_{P,2}+\|\mathrm{Odds}^{r}(L^{r};\alpha_{K_{r}}^{*})-\mathrm{Odds}^{r}(L^{r})\|_{P,2}
Oddsr(Lr;α^r)Oddsr(Lr;αKr)P,2+suplrdomr|Oddsr(lr;αKr)Oddsr(lr)|.\displaystyle\leq\|\mathrm{Odds}^{r}(L^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(L^{r};\alpha_{K_{r}}^{*})\|_{P,2}+\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\left|\mathrm{Odds}^{r}(l^{r};\alpha_{K_{r}}^{*})-\mathrm{Odds}^{r}(l^{r})\right|\ .

Following the similar arguments, when |Φr(lr)α^rΦr(lr)αKr|ln2|\Phi^{r}(l^{r})^{\intercal}\hat{\alpha}^{r}-\Phi^{r}(l^{r})^{\intercal}\alpha_{K_{r}}^{*}|\leq\ln 2, we have

Oddsr(Lr;α^r)Oddsr(Lr;αKr)P,22\displaystyle\quad\ \|\mathrm{Odds}^{r}(L^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(L^{r};\alpha_{K_{r}}^{*})\|_{P,2}^{2}
16C02{Φr(Lr)α^rΦr(Lr)αKr}2𝑑P(L)\displaystyle\leq 16C_{0}^{2}\int\left\{\Phi^{r}(L^{r})^{\intercal}\hat{\alpha}^{r}-\Phi^{r}(L^{r})^{\intercal}\alpha_{K_{r}}^{*}\right\}^{2}dP(L)
16C02(α^rαKr)Φr(Lr)Φr(Lr)(α^rαKr)𝑑P(L)\displaystyle\leq 16C_{0}^{2}\int(\hat{\alpha}^{r}-\alpha_{K_{r}}^{*})^{\intercal}\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}(\hat{\alpha}^{r}-\alpha_{K_{r}}^{*})dP(L)
16C02suplrdomrλmax{Φr(lr)Φr(lr)}(α^rαKr)(α^rαKr)𝑑P(L)\displaystyle\leq 16C_{0}^{2}\underset{l^{r}\in\mathrm{dom}^{r}}{\sup}\lambda_{\max}\{\Phi^{r}(l^{r})\Phi^{r}(l^{r})^{\intercal}\}\int(\hat{\alpha}^{r}-\alpha_{K_{r}}^{*})^{\intercal}(\hat{\alpha}^{r}-\alpha_{K_{r}}^{*})dP(L)
16C02λmaxα^rαKr22.\displaystyle\leq 16C_{0}^{2}\lambda_{\max}^{*}\|\hat{\alpha}^{r}-\alpha_{K_{r}}^{*}\|_{2}^{2}\ .

Thus, Oddsr(Lr;α^r)Oddsr(Lr;αKr)P,2=Op(Kr/Nr+Krμ1)\|\mathrm{Odds}^{r}(L^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(L^{r};\alpha_{K_{r}}^{*})\|_{P,2}=O_{p}(\sqrt{K_{r}/N_{r}}+K_{r}^{-\mu_{1}}). Therefore,

Oddsr(Lr;α^r)Oddsr(Lr)P,2=Op(KrNr+Krμ1).\displaystyle\|\mathrm{Odds}^{r}(L^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(L^{r})\|_{P,2}=O_{p}\left(\sqrt{\frac{K_{r}}{N_{r}}}+K_{r}^{-\mu_{1}}\right)\ .

Appendix D Proof of Theorem 3

Proof sketch: We further decompose the error terms and show that the first three terms converge to 0 with the rate faster than 1/N1/\sqrt{N}, and the last term contributes as the influence function. Since the components in the decomposition involve the estimator, they should be treated as random functions. So, we consider the uniform convergence over a set of functions and apply the theory in Van der Vaart [29]. With the maximal inequality via bracketing, the problem is converted to the control of the entropy integral, which requires the calculation of bracketing numbers. Lemmas 1215 are bracketing inequalities which could be of independent interest.

Proof.

First, recall that ^Nψθ𝔼{ψθ(L)}\hat{\mathbb{P}}_{N}\psi_{\theta}-\mathbb{E}\{\psi_{\theta}(L)\} has the following decomposition

^Nψθ𝔼{ψθ(L)}=r[1Ni=1N𝟣Ri=1dOddsr(Lir;α^r)ψθ(Li)𝔼{𝟣R=rψθ(L)}].\displaystyle\hat{\mathbb{P}}_{N}\psi_{\theta}-\mathbb{E}\{\psi_{\theta}(L)\}=\sum_{r\in\mathcal{R}}\left[\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})\psi_{\theta}(L_{i})-\mathbb{E}\{\mathsf{1}_{R=r}\psi_{\theta}(L)\}\right]\ .

For each missing pattern rr, denote 1/Ni=1N𝟣Ri=1dOddsr(Lir;α^r)ψθ(Li)1/N\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})\psi_{\theta}(L_{i}) as ^Nrψθ\hat{\mathbb{P}}_{N}^{r}\psi_{\theta}. Then, ^Nrψθ𝔼{𝟣R=rψθ(L)}\hat{\mathbb{P}}_{N}^{r}\psi_{\theta}-\mathbb{E}\{\mathsf{1}_{R=r}\psi_{\theta}(L)\} can be decomposed into 4 parts:

Sθ,1r\displaystyle S_{\theta,1}^{r} =1Ni=1N𝟣Ri=1d{Oddsr(Lir;α^r)Oddsr(Lir)}{ψθ(Li)uθr(Lir)},\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\left\{\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(L^{r}_{i})\right\}\left\{\psi_{\theta}(L_{i})-u^{r}_{\theta}(L^{r}_{i})\right\}\ ,
Sθ,2r\displaystyle S_{\theta,2}^{r} =1Ni=1N{𝟣Ri=1dOddsr(Lir;α^r)𝟣Ri=r}{uθr(Lir)Φr(Lir)βθr},\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})-\mathsf{1}_{R_{i}=r}\right\}\left\{u^{r}_{\theta}(L^{r}_{i})-\Phi^{r}(L^{r}_{i})^{\intercal}\beta_{\theta}^{r}\right\}\ ,
Sθ,3r\displaystyle S_{\theta,3}^{r} =1Ni=1N{𝟣Ri=1dOddsr(Lir;α^r)𝟣Ri=r}Φr(Lir)βθr,\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})-\mathsf{1}_{R_{i}=r}\right\}\Phi^{r}(L^{r}_{i})^{\intercal}\beta_{\theta}^{r}\ ,
Sθ,4r\displaystyle S_{\theta,4}^{r} =1Ni=1N𝟣Ri=1dOddsr(Lir){ψθ(Li)uθr(Lir)}\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i})\left\{\psi_{\theta}(L_{i})-u^{r}_{\theta}(L^{r}_{i})\right\}
+1Ni=1N𝟣Ri=ruθr(Lir)𝔼{𝟣R=rψθ(L)}.\displaystyle\quad\ +\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=r}u^{r}_{\theta}(L^{r}_{i})-\mathbb{E}\{\mathsf{1}_{R=r}\psi_{\theta}(L)\}\ .

For any fixed θΘ\theta\in\Theta and any missing pattern rr, by Lemmas 6, 7, and 8, N|Sθ,ir|=op(1),i=1,2,3\sqrt{N}|S_{\theta,i}^{r}|=o_{p}(1),i=1,2,3. It’s easy to see that 𝔼(Sθ,4r)=0\mathbb{E}(S_{\theta,4}^{r})=0. Therefore, by the central limit theorem,

N[^Nψθ𝔼{ψθ(L)}]𝒩(0,Vθ)\displaystyle\sqrt{N}\left[\hat{\mathbb{P}}_{N}\psi_{\theta}-\mathbb{E}\{\psi_{\theta}(L)\}\right]\to\mathcal{N}(0,V_{\theta})

where Vθ=𝔼{Fθ(L,R)Fθ(L,R)}V_{\theta}=\mathbb{E}\{F_{\theta}(L,R)F_{\theta}(L,R)^{\intercal}\} and

Fθ(L,R)=𝟣R=1drOddsr(Lr){ψθ(L)uθr(Lr)}+r𝟣R=ruθr(Lr)𝔼{ψθ(L)}.\displaystyle F_{\theta}(L,R)=\mathsf{1}_{R=1_{d}}\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r})\{\psi_{\theta}(L)-u_{\theta}^{r}(L^{r})\}+\sum_{r\in\mathcal{R}}\mathsf{1}_{R=r}u_{\theta}^{r}(L^{r})-\mathbb{E}\{\psi_{\theta}(L)\}\ .

Appendix E Proof of Theorem 4

Proof sketch: First, by Assumption 4.A, the convergence of θ^N\hat{\theta}_{N} should be implied by the uniform convergence of ^Nψθ\hat{\mathbb{P}}_{N}\psi_{\theta} over θ\theta in a compact set. Second, we study the convergence of 𝔼{ψθ^N(L)}\mathbb{E}\{\psi_{\hat{\theta}_{N}}(L)\} and apply the Delta method to obtain the limiting distribution of θ^N\hat{\theta}_{N}. The functional version of the central limit theorem, i.e. Donsker’s theorem, is applied to achieve uniform convergence.

Proof.

Denote the empirical average N1i=1Nψθ(Li)N^{-1}\sum_{i=1}^{N}\psi_{\theta}(L_{i}) as Nψθ\mathbb{P}_{N}\psi_{\theta} and the centered and scaled version N[Nψθ𝔼{ψθ(L)}]\sqrt{N}[\mathbb{P}_{N}\psi_{\theta}-\mathbb{E}\{\psi_{\theta}(L)\}] as 𝔾Nψθ\mathbb{G}_{N}\psi_{\theta}. Recall the proposed weighted average is

^Nψθ=1Ni=1N{𝟣Ri=1dw^(Li)ψθ(Li)}.\displaystyle\hat{\mathbb{P}}_{N}\psi_{\theta}=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\hat{w}(L_{i})\psi_{\theta}(L_{i})\right\}\ .

Since θ^N\hat{\theta}_{N} is the solution to ^Nψθ=0\hat{\mathbb{P}}_{N}\psi_{\theta}=0, by Lemma 9,

𝔼{ψθ^N(L)}=𝔼{ψθ^N(L)}^Nψθ^NsupθΘ|^Nψθ𝔼{ψθ(L)}|=op(1).\displaystyle\mathbb{E}\{\psi_{\hat{\theta}_{N}}(L)\}=\mathbb{E}\{\psi_{\hat{\theta}_{N}}(L)\}-\hat{\mathbb{P}}_{N}\psi_{\hat{\theta}_{N}}\leq\underset{\theta\in\Theta}{\sup}\left|\hat{\mathbb{P}}_{N}\psi_{\theta}-\mathbb{E}\{\psi_{\theta}(L)\}\right|=o_{p}(1)\ .

By identifiability condition Assumption 4.A, θ^Nθ02𝑃0\|\hat{\theta}_{N}-\theta_{0}\|_{2}\xrightarrow{P}0.

Next, we investigate the asymptotic normality of θ^N\hat{\theta}_{N}. Although 𝔼{ψθ^N(L)}\mathbb{E}\{\psi_{\hat{\theta}_{N}}(L)\} has a form of expectation over the population, it can be viewed as a random vector because θ^N\hat{\theta}_{N} depends on the observations. Since θ^N𝑃θ0\hat{\theta}_{N}\xrightarrow{P}\theta_{0}, one would expect that 𝔼{ψθ^N(L)}\mathbb{E}\{\psi_{\hat{\theta}_{N}}(L)\} converges to 𝔼{ψθ0(L)}\mathbb{E}\{\psi_{\theta_{0}}(L)\} in some way. If the limiting distribution is known, one could apply the Delta method to obtain limiting distribution of θ^N\hat{\theta}_{N}. From Theorem 3, we have the asymptotic normality of ^Nψθ\hat{\mathbb{P}}_{N}\psi_{\theta} for any fixed θΘ\theta\in\Theta. It is natural to consider

[^Nψθ^N𝔼{ψθ^N(L)}][^Nψθ0𝔼{ψθ0(L)}]\displaystyle\left[\hat{\mathbb{P}}_{N}\psi_{\hat{\theta}_{N}}-\mathbb{E}\{\psi_{\hat{\theta}_{N}}(L)\}\right]-\left[\hat{\mathbb{P}}_{N}\psi_{\theta_{0}}-\mathbb{E}\{\psi_{\theta_{0}}(L)\}\right] (14)

The above difference has a similar form to the asymptotic equicontinuity, which can be derived if the function class is Donsker. More precisely, consider the class of jj-th entry of the estimating functions, Ψj:={ψθ,j:θΘ}\Psi_{j}:=\{\psi_{\theta,j}:\theta\in\Theta\}. It is Donsker due to Theorem 19.5 in [29] and

J[]{1,Ψj,L2(P)}\displaystyle J_{[\ ]}\{1,\Psi_{j},L_{2}(P)\} =01logn[]{ϵ,Ψj,L2(P)}𝑑ϵ\displaystyle=\int_{0}^{1}\sqrt{\log n_{[\ ]}\{\epsilon,\Psi_{j},L_{2}(P)\}}d\epsilon
01logn[]{ϵ,,L2(P)}𝑑ϵ\displaystyle\leq\int_{0}^{1}\sqrt{\log n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}}d\epsilon
01Cϵ12d𝑑ϵ=C.\displaystyle\leq\int_{0}^{1}\sqrt{C_{\mathcal{H}}}\epsilon^{-\frac{1}{2d_{\mathcal{H}}}}d\epsilon=\sqrt{C_{\mathcal{H}}}\leq\infty\ .

Then, by Section 2.1.2 in [30], we have the following asymptotic equicontinuity: for any ϵ,η>0\epsilon,\eta>0, there exists Cϵ,η>0C_{\epsilon,\eta}>0 and Nϵ,ηN_{\epsilon,\eta} such that for all NNϵ,ηN\geq N_{\epsilon,\eta},

P(supψθ,j:ρP(ψθ,jψθ0,j)<Cϵ,η|𝔾Nψθ,j𝔾Nψθ0,j|ϵ)η2\displaystyle P\left(\underset{\psi_{\theta,j}:\rho_{P}(\psi_{\theta,j}-\psi_{\theta_{0},j})<C_{\epsilon,\eta}}{\sup}\left|\mathbb{G}_{N}\psi_{\theta,j}-\mathbb{G}_{N}\psi_{\theta_{0},j}\right|\geq\epsilon\right)\leq\frac{\eta}{2}

where the seminorm ρP\rho_{P} is defined as ρP(f)={P(fPf)2}1/2\rho_{P}(f)=\{P(f-Pf)^{2}\}^{1/2}. Consider

𝔾Nψθ^N,j𝔾Nψθ0,j\displaystyle\quad\ \mathbb{G}_{N}\psi_{\hat{\theta}_{N},j}-\mathbb{G}_{N}\psi_{\theta_{0},j}
=N[Nψθ^N,j𝔼{ψθ^N,j(L)}]N[Nψθ0,j𝔼{ψθ0,j(L)}].\displaystyle=\sqrt{N}\left[\mathbb{P}_{N}\psi_{\hat{\theta}_{N},j}-\mathbb{E}\{\psi_{\hat{\theta}_{N},j}(L)\}\right]-\sqrt{N}\left[\mathbb{P}_{N}\psi_{\theta_{0},j}-\mathbb{E}\{\psi_{\theta_{0},j}(L)\}\right]\ .

Notice that ρP(f)fP,2\rho_{P}(f)\leq\|f\|_{P,2}. By Assumption 4.B, for any δ>0\delta>0, there exists an envelop function fδ,jf_{\delta,j} such that

P(θ^Nθ02<δ)P(ψθ^N,jψθ0,jP,2<Cδ)P{ρP(ψθ^N,jψθ0,j)<Cδ}\displaystyle P\left(\|\hat{\theta}_{N}-\theta_{0}\|_{2}<\delta\right)\leq P\left(\|\psi_{\hat{\theta}_{N},j}-\psi_{\theta_{0},j}\|_{P,2}<C_{\delta}\right)\leq P\left\{\rho_{P}(\psi_{\hat{\theta}_{N},j}-\psi_{\theta_{0},j})<C_{\delta}\right\}

where Cδ=fδ,jP,20C_{\delta}=\|f_{\delta,j}\|_{P,2}\to 0 when δ0\delta\to 0. Thus, there exists δϵ,η\delta_{\epsilon,\eta} small enough such that CδCϵ,ηC_{\delta}\leq C_{\epsilon,\eta} for all δδϵ,η\delta\leq\delta_{\epsilon,\eta}. Then, by the consistency of θ^N\hat{\theta}_{N}, there exists Nϵ,ηN_{\epsilon,\eta}^{*} such that for all NNϵ,ηN\geq N_{\epsilon,\eta}^{*},

P(θ^Nθ02δϵ,η)η2.\displaystyle P\left(\|\hat{\theta}_{N}-\theta_{0}\|_{2}\geq\delta_{\epsilon,\eta}\right)\leq\frac{\eta}{2}\ .

Thus,

P{ρP(ψθ^N,jψθ0,j)<Cϵ,η}>1η2.\displaystyle P\left\{\rho_{P}(\psi_{\hat{\theta}_{N},j}-\psi_{\theta_{0},j})<C_{\epsilon,\eta}\right\}>1-\frac{\eta}{2}\ .

Note that the event ρP(ψθ^N,jψθ0,j)<Cϵ,η\rho_{P}(\psi_{\hat{\theta}_{N},j}-\psi_{\theta_{0},j})<C_{\epsilon,\eta} and

supψθ,j:ρP(ψθ,jψθ0,j)<Cϵ,η|𝔾Nψθ,j𝔾Nψθ0,j|<ϵ\displaystyle\underset{\psi_{\theta,j}:\rho_{P}(\psi_{\theta,j}-\psi_{\theta_{0},j})<C_{\epsilon,\eta}}{\sup}\left|\mathbb{G}_{N}\psi_{\theta,j}-\mathbb{G}_{N}\psi_{\theta_{0},j}\right|<\epsilon

happening together implies |𝔾Nψθ^N,j𝔾Nψθ0,j|<ϵ|\mathbb{G}_{N}\psi_{\hat{\theta}_{N},j}-\mathbb{G}_{N}\psi_{\theta_{0},j}|<\epsilon. By taking complementary event, for any Nmax{Nϵ,η,Nϵ,η}N\geq\max\{N_{\epsilon,\eta},N_{\epsilon,\eta}^{*}\}, we obtain

P(|𝔾Nψθ^N,j𝔾Nψθ0,j|ϵ)P{ρP(ψθ^N,jψθ0,j)Cϵ,η}\displaystyle P\left(\left|\mathbb{G}_{N}\psi_{\hat{\theta}_{N},j}-\mathbb{G}_{N}\psi_{\theta_{0},j}\right|\geq\epsilon\right)\leq P\left\{\rho_{P}(\psi_{\hat{\theta}_{N},j}-\psi_{\theta_{0},j})\geq C_{\epsilon,\eta}\right\}
+P(supψθ,j:ρP(ψθ,jψθ0,j)<Cϵ,η|𝔾Nψθ,j𝔾Nψθ0,j|ϵ)\displaystyle\quad\quad\quad\quad+P\left(\underset{\psi_{\theta,j}:\rho_{P}(\psi_{\theta,j}-\psi_{\theta_{0},j})<C_{\epsilon,\eta}}{\sup}\left|\mathbb{G}_{N}\psi_{\theta,j}-\mathbb{G}_{N}\psi_{\theta_{0},j}\right|\geq\epsilon\right)
η2+η2=η.\displaystyle\leq\frac{\eta}{2}+\frac{\eta}{2}=\eta\ .

That is, for each jj-th entry,

N[Nψθ^N,j𝔼{ψθ^N,j(L)}]N[Nψθ0,j𝔼{ψθ0,j(L)}]𝑃0.\displaystyle\sqrt{N}\left[\mathbb{P}_{N}\psi_{\hat{\theta}_{N},j}-\mathbb{E}\{\psi_{\hat{\theta}_{N},j}(L)\}\right]-\sqrt{N}\left[\mathbb{P}_{N}\psi_{\theta_{0},j}-\mathbb{E}\{\psi_{\theta_{0},j}(L)\}\right]\xrightarrow{P}0\ . (15)

Therefore, by the comparison between terms in (14) and 𝔾Nψθ^N,j𝔾Nψθ0,j\mathbb{G}_{N}\psi_{\hat{\theta}_{N},j}-\mathbb{G}_{N}\psi_{\theta_{0},j}, we should consider

N[^Nψθ^N,jNψθ^N,j]N[^Nψθ0,jNψθ0,j]\displaystyle\sqrt{N}\left[\hat{\mathbb{P}}_{N}\psi_{\hat{\theta}_{N},j}-\mathbb{P}_{N}\psi_{\hat{\theta}_{N},j}\right]-\sqrt{N}\left[\hat{\mathbb{P}}_{N}\psi_{\theta_{0},j}-\mathbb{P}_{N}\psi_{\theta_{0},j}\right]

which can be decomposed as the following terms.

r(Sθ^N,1r+Sθ^N,2r+Sθ^N,3r+Sθ^N,5rSθ^N,6r)\displaystyle\quad\ \sum_{r\in\mathcal{R}}\left(S_{\hat{\theta}_{N},1}^{r}+S_{\hat{\theta}_{N},2}^{r}+S_{\hat{\theta}_{N},3}^{r}+S_{\hat{\theta}_{N},5}^{r}-S_{\hat{\theta}_{N},6}^{r}\right)
r(Sθ0,1r+Sθ0,2r+Sθ0,3r+Sθ0,5rSθ0,6r)\displaystyle-\sum_{r\in\mathcal{R}}\left(S_{\theta_{0},1}^{r}+S_{\theta_{0},2}^{r}+S_{\theta_{0},3}^{r}+S_{\theta_{0},5}^{r}-S_{\theta_{0},6}^{r}\right)

where

Sθ,5r\displaystyle S_{\theta,5}^{r} =1Ni=1N{𝟣Ri=1dOddsr(Lir)𝟣Ri=r}ψθ(Li),\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i})-\mathsf{1}_{R_{i}=r}\right\}\psi_{\theta}(L_{i})\ ,
Sθ,6r\displaystyle S_{\theta,6}^{r} =1Ni=1N{𝟣Ri=1dOddsr(Lir)𝟣Ri=r}uθr(Lir).\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i})-\mathsf{1}_{R_{i}=r}\right\}u^{r}_{\theta}(L^{r}_{i})\ .

By Lemmas 6, 7, and 8, N|Sθ,ir|=op(1),i=1,2,3\sqrt{N}\left|S_{\theta,i}^{r}\right|=o_{p}(1),i=1,2,3 for any missing pattern rr and θΘ\theta\in\Theta. Combing with Lemmas 10 and 11, we have

N(^Nψθ^NNψθ^N^Nψθ0+Nψθ0)𝑃0.\displaystyle\sqrt{N}\left(\hat{\mathbb{P}}_{N}\psi_{\hat{\theta}_{N}}-\mathbb{P}_{N}\psi_{\hat{\theta}_{N}}-\hat{\mathbb{P}}_{N}\psi_{\theta_{0}}+\mathbb{P}_{N}\psi_{\theta_{0}}\right)\xrightarrow{P}0\ . (16)

By Equations (16) and (15), we have

N[^Nψθ^N𝔼{ψθ^N(L)}^Nψθ0+𝔼{ψθ0(L)}]𝑃0.\displaystyle\sqrt{N}\left[\hat{\mathbb{P}}_{N}\psi_{\hat{\theta}_{N}}-\mathbb{E}\{\psi_{\hat{\theta}_{N}}(L)\}-\hat{\mathbb{P}}_{N}\psi_{\theta_{0}}+\mathbb{E}\{\psi_{\theta_{0}}(L)\}\right]\xrightarrow{P}0\ .

Since ^Nψθ^N=0\hat{\mathbb{P}}_{N}\psi_{\hat{\theta}_{N}}=0 and 𝔼{ψθ0(L)}=0\mathbb{E}\{\psi_{\theta_{0}}(L)\}=0, the above equation can be rewritten as

N[𝔼{ψθ^N(L)}𝔼{ψθ0(L)}+^Nψθ0𝔼{ψθ0(L)}]𝑃0.\displaystyle\sqrt{N}\left[\mathbb{E}\{\psi_{\hat{\theta}_{N}}(L)\}-\mathbb{E}\{\psi_{\theta_{0}}(L)\}+\hat{\mathbb{P}}_{N}\psi_{\theta_{0}}-\mathbb{E}\{\psi_{\theta_{0}}(L)\}\right]\xrightarrow{P}0\ .

By Theorem 3,

N[^Nψθ0𝔼{ψθ0(L)}]𝑑N(0,Vθ0).\displaystyle\sqrt{N}\left[\hat{\mathbb{P}}_{N}\psi_{\theta_{0}}-\mathbb{E}\{\psi_{\theta_{0}}(L)\}\right]\overset{d}{\to}N(0,V_{\theta_{0}})\ .

Since Dθ0D_{\theta_{0}} is nonsingular, by multivariate Delta method,

N(θ^Nθ0)𝑑N(0,Dθ0Missing Operator\displaystyle\sqrt{N}(\hat{\theta}_{N}-\theta_{0})\overset{d}{\to}N\left(0,D_{\theta_{0}}^{-1}V_{\theta_{0}}D_{\theta_{0}}^{-1^{\intercal}}\right)\ .

Therefore, θ^N\hat{\theta}_{N} is semiparametrically efficient.

Lastly, we look into the estimator for the asymptotic variance. We have the following decomposition:

D^θ^NDθ0\displaystyle\hat{D}_{\hat{\theta}_{N}}-D_{\theta_{0}} =1Ni=1N[𝟣Ri=1d{w^(Li)rOddsr(Lri)}ψ˙θ^N(Li)]\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left[\mathsf{1}_{R_{i}=1_{d}}\left\{\hat{w}(L_{i})-\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r}_{i})\right\}\dot{\psi}_{\hat{\theta}_{N}}(L_{i})\right]
+1Ni=1N𝟣Ri=1d1P(Ri=1dLi)ψ˙θ^N(Li)Dθ^N+Dθ^NDθ0.\displaystyle\quad\ +\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\frac{1}{P(R_{i}=1_{d}\mid L_{i})}\dot{\psi}_{\hat{\theta}_{N}}(L_{i})-D_{\hat{\theta}_{N}}+D_{\hat{\theta}_{N}}-D_{\theta_{0}}\ .

where w^(Li)=rOddsr(Lri;α^r)\hat{w}(L_{i})=\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r}).

Consider the first term on the right hand side. By Theorem 2,

w^(l)1/P(R=1dl)\displaystyle\quad\ \|\hat{w}(l)-1/P(R=1_{d}\mid l)\|_{\infty}
rOddsr(;α^r)Oddsr=Op(Kr2/Nr+Kr1/2μ1).\displaystyle\leq\sum_{r\in\mathcal{R}}\|\mathrm{Odds}^{r}(\ \cdot\ ;\hat{\alpha}^{r})-\mathrm{Odds}^{r}\|_{\infty}=O_{p}(\sqrt{K_{r}^{2}/N_{r}}+K_{r}^{1/2-\mu_{1}})\ .

Let

𝐅=1Ni=1N𝟣Ri=1dψ˙θ^N(Li)ψ˙θ^N(Li).\displaystyle\mathbf{F}=\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\dot{\psi}_{\hat{\theta}_{N}}(L_{i})\dot{\psi}_{\hat{\theta}_{N}}(L_{i})^{\intercal}\ .

Following the similar arguments in Lemma 5, one can see that

1Ni=1N[𝟣Ri=1d{w^(Li)rOddsr(Lri)}ψ˙θ^N(Li)]22\displaystyle\quad\ \left\|\frac{1}{N}\sum_{i=1}^{N}\left[\mathsf{1}_{R_{i}=1_{d}}\left\{\hat{w}(L_{i})-\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r}_{i})\right\}\dot{\psi}_{\hat{\theta}_{N}}(L_{i})\right]\right\|_{2}^{2}
supθΘ1Ni=1N𝟣Ri=1d{w^(Li)rOddsr(Lri)}2λmax{𝐅}\displaystyle\leq\underset{\theta\in\Theta}{\sup}\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\left\{\hat{w}(L_{i})-\sum_{r\in\mathcal{R}}\mathrm{Odds}^{r}(L^{r}_{i})\right\}^{2}\lambda_{\max}\left\{\mathbf{F}\right\}
λmaxw^(l)1/P(R=1dl)2=op(1).\displaystyle\leq\lambda_{\max}^{\prime}\|\hat{w}(l)-1/P(R=1_{d}\mid l)\|_{\infty}^{2}=o_{p}(1)\ .

Consider the second term on the right hand side. Notice that ψ˙θ\dot{\psi}_{\theta} is the Jacobian matrix of ψθ\psi_{\theta}. We consider all the entries of ψ˙θ\dot{\psi}_{\theta} together and abbreviate the subscripts in the following statements. Define a set of functions Θ:={fθ:θΘ}\mathcal{F}_{\Theta}:=\{f_{\theta}:\theta\in\Theta\} where fθ(L,R):=1R=1d/P(R=1dL)ψ˙θ(L)f_{\theta}(L,R):=1_{R=1_{d}}/P(R=1_{d}\mid L)\dot{\psi}_{\theta}(L). Similar to Lemma 15, one can show that

n[]{ϵ/δ0,Θ,L1(P)}n[]{ϵ,𝒥Θ,L1(P)}<\displaystyle n_{[\ ]}\{\epsilon/\delta_{0},\mathcal{F}_{\Theta},L_{1}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{J}_{\Theta},L_{1}(P)\}<\infty

since 1R=1d/P(R=1dL)1/δ01_{R=1_{d}}/P(R=1_{d}\mid L)\leq 1/\delta_{0}. Therefore, by Theorem 19.4 in [29], Θ\mathcal{F}_{\Theta} is P-Glivenko-Cantelli. Since the set Θ\mathcal{F}_{\Theta} includes all entries of the Jacobian matrix, we consider the Frobenius/Euclidean norm of a matrix to construct the following convergence result.

supfθΘNfθPfθFa.s.0\displaystyle\underset{f_{\theta}\in\mathcal{F}_{\Theta}}{\sup}\left\|\mathbb{P}_{N}f_{\theta}-Pf_{\theta}\right\|_{F}\xrightarrow{a.s.}0

where F\|\cdot\|_{F} is the Euclidean norm of a matrix. The fact that 2F\|\cdot\|_{2}\leq\|\cdot\|_{F} implies

1Ni=1N𝟣Ri=1d1P(Ri=1dLi)ψ˙θ^N(Li)Dθ^N2=op(1).\displaystyle\left\|\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\frac{1}{P(R_{i}=1_{d}\mid L_{i})}\dot{\psi}_{\hat{\theta}_{N}}(L_{i})-D_{\hat{\theta}_{N}}\right\|_{2}=o_{p}(1)\ .

Finally, Dθ^N𝑃Dθ0D_{\hat{\theta}_{N}}\xrightarrow{P}D_{\theta_{0}} since θ^Nθ02𝑃0\|\hat{\theta}_{N}-\theta_{0}\|_{2}\xrightarrow{P}0 and ψ˙θ\dot{\psi}_{\theta} is continuous in a neighborhood of θ0\theta_{0}. Therefore, D^θ^N\hat{D}_{\hat{\theta}_{N}} is a consistent estimator of Dθ0D_{\theta_{0}}.

We skip the details but provide a skeleton of the following proof. Notice that each component of F^i\hat{F}_{i} converges to the corresponding true value. Therefore, F^i\hat{F}_{i} and Vθ^NV_{\hat{\theta}_{N}} are consistent estimators of Fθ0(Li,Ri)F_{\theta_{0}}(L_{i},R_{i}) and Vθ0V_{\theta_{0}} respectively. Since D^θ^N1Vθ^ND^θ^N1\hat{D}_{\hat{\theta}_{N}}^{-1}V_{\hat{\theta}_{N}}\hat{D}_{\hat{\theta}_{N}}^{-1^{\intercal}} is a standard sandwich estimator, it is easy to show it is a consistent estimator of the above asymptotic variance. ∎

Appendix F Related Lemmas

Lemma 1 (Weyl’s inequality).

Let 𝐀\mathbf{A} and 𝐁\mathbf{B} be m×mm\times m Hermitian matrices and 𝐂=𝐀𝐁\mathbf{C}=\mathbf{A}-\mathbf{B}. Suppose their respective eigenvalues μi,νi,ρi\mu_{i},\nu_{i},\rho_{i} are ordered as follows:

𝐀:μ1μm,\displaystyle\mathbf{A}:\quad\mu_{1}\geq\cdots\geq\mu_{m}\ ,
𝐁:ν1νm,\displaystyle\mathbf{B}:\quad\nu_{1}\geq\cdots\geq\nu_{m}\ ,
𝐂:ρ1ρm.\displaystyle\mathbf{C}:\quad\rho_{1}\geq\cdots\geq\rho_{m}\ .

Then, the following inequalities hold.

ρmμiνiρ1,i=1,,m.\displaystyle\rho_{m}\leq\mu_{i}-\nu_{i}\leq\rho_{1},\quad i=1,\cdots,m\ .

In particular, if 𝐂\mathbf{C} is positive semi-definite, plugging ρm0\rho_{m}\geq 0 into the above inequalities leads to

μiνi,i=1,,m.\displaystyle\mu_{i}\geq\nu_{i},\quad i=1,\cdots,m\ .
Lemma 2 (Bernstein’s inequality).

Let {𝐀i}i=1N\{\mathbf{A}_{i}\}_{i=1}^{N} be a sequence of independent random matrices with dimensions d1×d2d_{1}\times d_{2}. Assume that 𝔼{𝐀i}=𝟎d1,d2\mathbb{E}\{\mathbf{A}_{i}\}=\mathbf{0}_{d_{1},d_{2}} and 𝐀i2c\|\mathbf{A}_{i}\|_{2}\leq c almost surely for all i=1,,Ni=1,\cdots,N and some constant cc. Also assume that

max{i=1N𝔼(𝐀i𝐀i)2,i=1N𝔼(𝐀i𝐀i)2}σ2.\displaystyle\max\left\{\left\|\sum_{i=1}^{N}\mathbb{E}(\mathbf{A}_{i}\mathbf{A}_{i}^{\intercal})\right\|_{2},\left\|\sum_{i=1}^{N}\mathbb{E}(\mathbf{A}_{i}^{\intercal}\mathbf{A}_{i})\right\|_{2}\right\}\leq\sigma^{2}\ .

Then, for all t0t\geq 0,

P(i=1N𝐀i2t)(d1+d2)exp(t2/2σ2+ct/3).\displaystyle P\left(\left\|\sum_{i=1}^{N}\mathbf{A}_{i}\right\|_{2}\geq t\right)\leq(d_{1}+d_{2})\exp\left(-\frac{t^{2}/2}{\sigma^{2}+ct/3}\right)\ .
Lemma 3.

Under Assumptions 1 and 2, the minimizer α^r\hat{\alpha}^{r} satisfies

α^rαKr2=Op(KrNr+Krμ1)=op(1).\displaystyle\|\hat{\alpha}^{r}-\alpha_{K_{r}}^{*}\|_{2}=O_{p}\left(\sqrt{\frac{K_{r}}{N_{r}}}+K_{r}^{-\mu_{1}}\right)=o_{p}(1)\ .
Proof.

It suffices to show for any ϵ>0\epsilon>0, there exists CϵC_{\epsilon} and NϵN_{\epsilon} such that

P{α^rαKr2>Cϵ(KrNr+Krμ1)}ϵ\displaystyle P\left\{\|\hat{\alpha}^{r}-\alpha_{K_{r}}^{*}\|_{2}>C_{\epsilon}\left(\sqrt{\frac{K_{r}}{N_{r}}}+K_{r}^{-\mu_{1}}\right)\right\}\leq\epsilon (17)

for any NNϵN\geq N_{\epsilon}. It means that the minimizer α^r\hat{\alpha}^{r} is in a small neighbourhood of αKr\alpha_{K_{r}}^{*} with probability higher than 1ϵ1-\epsilon. Consider the set Δ={δKr:δ2C(Kr/Nr+Krμ1)}\Delta=\{\delta\in\mathbb{R}^{K_{r}}:\|\delta\|_{2}\leq C(\sqrt{K_{r}/N_{r}}+K_{r}^{-\mu_{1}})\} for an arbitrary constant CC. Since rλ\mathcal{L}^{r}_{\lambda} is a convex function of αr\alpha^{r}, the minimizer α^rαKr+Δ\hat{\alpha}^{r}\in\alpha_{K_{r}}^{*}+\Delta if infδΔinfrλ(αKr+δ)>rλ(αKr)\inf_{\delta\in\partial\Delta}{\inf}\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}+\delta)>\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}). Thus, considering the complementary event, we have

P{α^rαKr2>C(KrNr+Krμ1)}P{infδΔrλ(αKr+δ)rλ(αKr)}.\displaystyle P\left\{\|\hat{\alpha}^{r}-\alpha_{K_{r}}^{*}\|_{2}>C\left(\sqrt{\frac{K_{r}}{N_{r}}}+K_{r}^{-\mu_{1}}\right)\right\}\leq P\left\{\underset{\delta\in\partial\Delta}{\inf}\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}+\delta)\leq\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*})\right\}\ .

Recall that for any r1dr\neq 1_{d} and any λ>0\lambda>0, the objective function is

rλ(αr)\displaystyle\mathcal{L}^{r}_{\lambda}(\alpha^{r}) =1Ni=1N{𝟣Ri=1dOddsr(Lri;αr)𝟣Ri=rlogOddsr(Lri;αr)}+λJr(αr)\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\alpha^{r})-\mathsf{1}_{R_{i}=r}\log\mathrm{Odds}^{r}(L^{r}_{i};\alpha^{r})\right\}+\lambda J^{r}(\alpha^{r})

where Oddsr(lr;αr)=exp{Φr(lr)αr}\mathrm{Odds}^{r}(l^{r};\alpha^{r})=\exp\{\Phi^{r}(l^{r})^{\intercal}\alpha^{r}\} and Jr(αr)=γk=1Krtk|αrk|+(1γ)(αr)𝐃rαrJ^{r}(\alpha^{r})=\gamma\sum_{k=1}^{K_{r}}t_{k}|\alpha^{r}_{k}|+(1-\gamma)(\alpha^{r})^{\intercal}\mathbf{D}^{r}\alpha^{r}. To investigate infδΔrλ(αKr+δ)rλ(αKr)\inf_{\delta\in\partial\Delta}\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}+\delta)-\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}), we apply the mean value theorem. There exists some α~r\tilde{\alpha}^{r} satisfying α~rαKrint(Δ)\tilde{\alpha}^{r}-\alpha_{K_{r}}^{*}\in\mathrm{int}(\Delta), which is the interior of Δ\Delta, such that for any δΔ\delta\in\Delta,

rλ(αKr+δ)rλ(αKr)\displaystyle\quad\ \mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}+\delta)-\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*})
=δrN(αr)αr|αKr+12δ{2rN(αr)(αr)2|α~r}δ+λJr(αKr+δ)λJr(αKr).\displaystyle=\delta^{\intercal}\left.\frac{\partial\mathcal{L}^{r}_{N}(\alpha^{r})}{\partial\alpha^{r}}\right|_{\alpha_{K_{r}}^{*}}+\frac{1}{2}\delta^{\intercal}\left\{\left.\frac{\partial^{2}\mathcal{L}^{r}_{N}(\alpha^{r})}{(\partial\alpha^{r})^{2}}\right|_{\tilde{\alpha}^{r}}\right\}\delta+\lambda J^{r}(\alpha_{K_{r}}^{*}+\delta)-\lambda J^{r}(\alpha_{K_{r}}^{*})\ .

By the triangle inequality and Cauchy inequality, the difference between penalties satisfies

Jr(αKr+δ)Jr(αKr)\displaystyle J^{r}(\alpha_{K_{r}}^{*}+\delta)-J^{r}(\alpha_{K_{r}}^{*})
=γk=1Kr(|αr,k+δkr||αr,k|)tk+(1γ)(δ𝐃rδ+2δ𝐃rαKr)\displaystyle=\gamma\sum_{k=1}^{K_{r}}\left(\left|\alpha^{r}_{*,k}+\delta_{k}^{r}\right|-\left|\alpha^{r}_{*,k}\right|\right)t_{k}+(1-\gamma)\left(\delta^{\intercal}\mathbf{D}^{r}\delta+2\delta^{\intercal}\mathbf{D}^{r}\alpha_{K_{r}}^{*}\right)
γk=1Kr|δkr|tk+(1γ)(δ𝐃rδ+2δ𝐃rαKr)\displaystyle\geq-\gamma\sum_{k=1}^{K_{r}}\left|\delta_{k}^{r}\right|t_{k}+(1-\gamma)\left(\delta^{\intercal}\mathbf{D}^{r}\delta+2\delta^{\intercal}\mathbf{D}^{r}\alpha_{K_{r}}^{*}\right)
γk=1Krtk2δ22(1γ)δ2𝐃r2αKr2+(1γ)δ𝐃rδ.\displaystyle\geq-\gamma\sqrt{\sum_{k=1}^{K_{r}}t_{k}^{2}}\left\|\delta\right\|_{2}-2(1-\gamma)\|\delta\|_{2}\|\mathbf{D}^{r}\|_{2}\|\alpha_{K_{r}}^{*}\|_{2}+(1-\gamma)\delta^{\intercal}\mathbf{D}^{r}\delta\ .

Denote the constant γk=1Krtk2+2(1γ)𝐃r2αKr2\gamma\sqrt{\sum_{k=1}^{K_{r}}t_{k}^{2}}+2(1-\gamma)\|\mathbf{D}^{r}\|_{2}\|\alpha_{K_{r}}^{*}\|_{2} as clinc_{\mathrm{lin}}. Then, by the Cauchy inequality again,

rλ(αKr+δ)rλ(αKr)\displaystyle\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}+\delta)-\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}) {rN(αr)αr|αKr2+λclin}δ2\displaystyle\geq-\left\{\left\|\left.\frac{\partial\mathcal{L}^{r}_{N}(\alpha^{r})}{\partial\alpha^{r}}\right|_{\alpha_{K_{r}}^{*}}\right\|_{2}+\lambda c_{\mathrm{lin}}\right\}\left\|\delta\right\|_{2}
+δ{122rN(αr)(αr)2|α~r+λ(1γ)𝐃r}δ.\displaystyle\quad\ +\delta^{\intercal}\left\{\frac{1}{2}\left.\frac{\partial^{2}\mathcal{L}^{r}_{N}(\alpha^{r})}{(\partial\alpha^{r})^{2}}\right|_{\tilde{\alpha}^{r}}+\lambda(1-\gamma)\mathbf{D}^{r}\right\}\delta\ .

First, let’s have a look at the quadratic coefficient. Since 𝐃r\mathbf{D}^{r} is a positive semi-definite matrix, λ(1γ)δ𝐃rδ0\lambda(1-\gamma)\delta^{\intercal}\mathbf{D}^{r}\delta\geq 0. Thus, by Lemma 4, the quadratic terms are bounded from below. More precisely, for any ϵ>0\epsilon>0, there exists NϵN_{\epsilon}^{*} such that for any NNϵN\geq N_{\epsilon}^{*},

P[δ{122rN(αr)(αr)2|α~r+λ(1γ)𝐃r}δCquad2δ22]112ϵ.\displaystyle P\left[\delta^{\intercal}\left\{\frac{1}{2}\left.\frac{\partial^{2}\mathcal{L}^{r}_{N}(\alpha^{r})}{(\partial\alpha^{r})^{2}}\right|_{\tilde{\alpha}^{r}}+\lambda(1-\gamma)\mathbf{D}^{r}\right\}\delta\geq\frac{C_{\mathrm{quad}}}{2}\|\delta\|_{2}^{2}\right]\geq 1-\frac{1}{2}\epsilon\ .

Next, let’s investigate the bound of the linear coefficient. By Assumption 2.E, λ=O(Kr/N)\lambda=O(\sqrt{K_{r}/N}). By Lemma 5, for any ϵ>0\epsilon>0, there exists NϵN_{\epsilon}^{\prime} and a constant CϵC_{\epsilon}^{\prime} such that for any NNϵN\geq N_{\epsilon}^{\prime},

P[{rN(αr)αr|αKr2+λclin}Cϵ(KrN+Krμ1)]12ϵ.\displaystyle P\left[\left\{\left\|\left.\frac{\partial\mathcal{L}^{r}_{N}(\alpha^{r})}{\partial\alpha^{r}}\right|_{\alpha_{K_{r}}^{*}}\right\|_{2}+\lambda c_{\mathrm{lin}}\right\}\geq C_{\epsilon}^{\prime}\left(\sqrt{\frac{K_{r}}{N}}+K_{r}^{-\mu_{1}}\right)\right]\leq\frac{1}{2}\epsilon\ .

Considering the complement of the above event and the fact that P(AB)=P(A)+P(B)P(AB)P(A)+P(B)1P(A\cap B)=P(A)+P(B)-P(A\cup B)\geq P(A)+P(B)-1, we have

P{rλ(αKr+δ)rλ(αKr)Cquad2δ22Cϵ(KrN+Krμ1)δ2}1ϵ\displaystyle P\left\{\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}+\delta)-\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*})\geq\frac{C_{\mathrm{quad}}}{2}\|\delta\|_{2}^{2}-C_{\epsilon}^{\prime}\left(\sqrt{\frac{K_{r}}{N}}+K_{r}^{-\mu_{1}}\right)\|\delta\|_{2}\right\}\geq 1-\epsilon

for any Nmax{Nϵ,Nϵ}N\geq\max\{N_{\epsilon}^{*},N_{\epsilon}^{\prime}\}. Note that Δ={δKr:δ2=C(Kr/Nr+Krμ1)}\partial\Delta=\{\delta\in\mathbb{R}^{K_{r}}:\|\delta\|_{2}=C(\sqrt{K_{r}/N_{r}}+K_{r}^{-\mu_{1}})\}. Choosing C>2Cϵ/CquadC>2C_{\epsilon}^{\prime}/C_{\mathrm{quad}}, we have P{infδΔrλ(αKr+δ)rλ(αKr)}1ϵP\{\inf_{\delta\in\partial\Delta}\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*}+\delta)\geq\mathcal{L}^{r}_{\lambda}(\alpha_{K_{r}}^{*})\}\geq 1-\epsilon for any ϵ>0\epsilon>0. Therefore, inequality (17) holds which completes the proof. ∎

Lemma 4.

There exists a constant CquadC_{\mathrm{quad}} such that the Hessian matrix of rN(αr)\mathcal{L}^{r}_{N}(\alpha^{r}) at α~r\tilde{\alpha}^{r} satisfies

limNP[λmin{2rN(αr)(αr)2|α~r}Cquad]=1\displaystyle\lim_{N\to\infty}P\left[\lambda_{\min}\left\{\left.\frac{\partial^{2}\mathcal{L}^{r}_{N}(\alpha^{r})}{(\partial\alpha^{r})^{2}}\right|_{\tilde{\alpha}^{r}}\right\}\geq C_{\mathrm{quad}}\right]=1

where λmin()\lambda_{\min}(\cdot) represents the minimal eigenvalue of the matrix.

Proof.

Denote the Hessian matrix of rN(αr)\mathcal{L}^{r}_{N}(\alpha^{r}) at α~r\tilde{\alpha}^{r} as 𝐀\mathbf{A}:

𝐀=2rN(αr)(αr)2|α~r=1Ni=1N{Oddsr(Lri;α~r)𝟣Ri=1dΦr(Lri)Φr(Lri)}.\displaystyle\mathbf{A}=\left.\frac{\partial^{2}\mathcal{L}^{r}_{N}(\alpha^{r})}{(\partial\alpha^{r})^{2}}\right|_{\tilde{\alpha}^{r}}=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathrm{Odds}^{r}(L^{r}_{i};\tilde{\alpha}^{r})\mathsf{1}_{R_{i}=1_{d}}\Phi^{r}(L^{r}_{i})\Phi^{r}(L^{r}_{i})^{\intercal}\right\}\ .

Recall that the set Δ={δKr:δ2C(Kr/Nr+Krμ1)}\Delta=\{\delta\in\mathbb{R}^{K_{r}}:\|\delta\|_{2}\leq C(\sqrt{K_{r}/N_{r}}+K_{r}^{-\mu_{1}})\} and α~rαKrint(Δ)\tilde{\alpha}^{r}-\alpha_{K_{r}}^{*}\in\mathrm{int}(\Delta). Following the similar arguments in the proof of Theorem 2 (Appendix C), it can be easily shown that

Oddsr(lr;α~r)\displaystyle\quad\ \mathrm{Odds}^{r}(l^{r};\tilde{\alpha}^{r})
Oddsr(lr)|Oddsr(lr)Oddsr(lr;αKr)|4C0|Φr(lr)α~rΦr(lr)αKr|\displaystyle\geq\mathrm{Odds}^{r}(l^{r})-\left|\mathrm{Odds}^{r}(l^{r})-\mathrm{Odds}^{r}(l^{r};\alpha_{K_{r}}^{*})\right|-4C_{0}\left|\Phi^{r}(l^{r})^{\intercal}\tilde{\alpha}^{r}-\Phi^{r}(l^{r})^{\intercal}\alpha_{K_{r}}^{*}\right|
c0C1Krμ14C0C(Kr/Nr+Kr12μ1).\displaystyle\geq c_{0}-C_{1}K_{r}^{-\mu_{1}}-4C_{0}C(K_{r}/\sqrt{N_{r}}+K_{r}^{\frac{1}{2}-\mu_{1}})\ .

Then, there exists NΔN_{\Delta} such that Oddsr(lr;α~r)>c0/2\mathrm{Odds}^{r}(l^{r};\tilde{\alpha}^{r})>c_{0}/2 holds for any NNΔN\geq N_{\Delta}. Let

𝐁\displaystyle\mathbf{B} =1Ni=1N{12c0𝟣Ri=1dΦr(Lri)Φr(Lri)},\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left\{\frac{1}{2}c_{0}\mathsf{1}_{R_{i}=1_{d}}\Phi^{r}(L^{r}_{i})\Phi^{r}(L^{r}_{i})^{\intercal}\right\},
𝐂\displaystyle\mathbf{C} =12c0𝔼{𝟣R=1dΦr(Lr)Φr(Lr)}=12c0𝔼{P(R=1dLr)Φr(Lr)Φr(Lr)},\displaystyle=\frac{1}{2}c_{0}\mathbb{E}\left\{\mathsf{1}_{R=1_{d}}\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}=\frac{1}{2}c_{0}\mathbb{E}\left\{P(R=1_{d}\mid L^{r})\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\},
𝐃\displaystyle\mathbf{D} =12c0δ0𝔼{Φr(Lr)Φr(Lr)}.\displaystyle=\frac{1}{2}c_{0}\delta_{0}\mathbb{E}\left\{\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}\ .

It’s easy to see that matrices 𝐀,𝐁,𝐂,𝐃\mathbf{A},\mathbf{B},\mathbf{C},\mathbf{D} are symmetric. Based on the above discussions, 𝐀𝐁\mathbf{A}-\mathbf{B} is positive semi-definite for large enough NN. By Assumption 1.B, 𝐂𝐃\mathbf{C}-\mathbf{D} is also positive semi-definite. Applying Lemma 1, we have λmin(𝐀)λmin(𝐁)\lambda_{\min}(\mathbf{A})\geq\lambda_{\min}(\mathbf{B}), λmin(𝐂)λmin(𝐃)\lambda_{\min}(\mathbf{C})\geq\lambda_{\min}(\mathbf{D}) and |λmin(𝐁)λmin(𝐂)|max{|λmin(𝐁𝐂)|,|λmax(𝐁𝐂)|}=𝐁𝐂2|\lambda_{\min}(\mathbf{B})-\lambda_{\min}(\mathbf{C})|\leq\max\{|\lambda_{\min}(\mathbf{B}-\mathbf{C})|,|\lambda_{\max}(\mathbf{B}-\mathbf{C})|\}=\|\mathbf{B}-\mathbf{C}\|_{2}. Therefore, λmin(𝐀)λmin(𝐃)𝐁𝐂2c0δ0λmin/2𝐁𝐂2\lambda_{\min}(\mathbf{A})\geq\lambda_{\min}(\mathbf{D})-\|\mathbf{B}-\mathbf{C}\|_{2}\geq c_{0}\delta_{0}\lambda_{\min}^{*}/2-\|\mathbf{B}-\mathbf{C}\|_{2}. To study 𝐁𝐂2\|\mathbf{B}-\mathbf{C}\|_{2}, we apply Lemma 2. Let

𝐄i=1N[𝟣Ri=1dΦr(Lri)Φr(Lri)𝔼{𝟣R=1dΦr(Lr)Φr(Lr)}].\displaystyle\mathbf{E}_{i}=\frac{1}{N}\Big{[}\mathsf{1}_{R_{i}=1_{d}}\Phi^{r}(L^{r}_{i})\Phi^{r}(L^{r}_{i})^{\intercal}-\mathbb{E}\left\{\mathsf{1}_{R=1_{d}}\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}\Big{]}\ .

So, 𝔼{𝐄i}=𝟎Kr,Kr\mathbb{E}\{\mathbf{E}_{i}\}=\mathbf{0}_{K_{r},K_{r}}. By the triangle inequality, Lemma 1 and the fact that 2F\|\cdot\|_{2}\leq\|\cdot\|_{F},

𝐄i2\displaystyle\|\mathbf{E}_{i}\|_{2} 1N𝟣Ri=1dΦr(Lri)Φr(Lri)F+1N𝔼{𝟣R=1dΦr(Lr)Φr(Lr)}2\displaystyle\leq\frac{1}{N}\mathsf{1}_{R_{i}=1_{d}}\|\Phi^{r}(L^{r}_{i})\Phi^{r}(L^{r}_{i})^{\intercal}\|_{F}+\frac{1}{N}\|\mathbb{E}\left\{\mathsf{1}_{R=1_{d}}\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}\|_{2}
1Ntrace{Φr(Lri)Φr(Lri)Φr(Lri)Φr(Lri)}+1N𝔼{Φr(Lr)Φr(Lr)}2\displaystyle\leq\frac{1}{N}\sqrt{\textrm{trace}\{\Phi^{r}(L^{r}_{i})\Phi^{r}(L^{r}_{i})^{\intercal}\Phi^{r}(L^{r}_{i})\Phi^{r}(L^{r}_{i})^{\intercal}\}}+\frac{1}{N}\|\mathbb{E}\left\{\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}\|_{2}
=1NΦr(Lri)22+1N𝔼{Φr(Lr)Φr(Lr)}2.\displaystyle=\frac{1}{N}\|\Phi^{r}(L^{r}_{i})\|_{2}^{2}+\frac{1}{N}\|\mathbb{E}\left\{\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}\|_{2}\ .

By Assumption 2.E, Assumption 2.F and the fact that Nr/N<1N_{r}/N<1, 𝐄i2=O(Kr/Nr)\|\mathbf{E}_{i}\|_{2}=O(K_{r}/N_{r}). Similarly,

i=1N𝔼(𝐄i𝐄i)2\displaystyle\quad\ \left\|\sum_{i=1}^{N}\mathbb{E}(\mathbf{E}_{i}\mathbf{E}_{i}^{\intercal})\right\|_{2}
1N𝔼{𝟣R=1dΦr(Lr)Φr(Lr)Φr(Lr)Φr(Lr)}2\displaystyle\leq\frac{1}{N}\left\|\mathbb{E}\{\mathsf{1}_{R=1_{d}}\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\}\right\|_{2}
+1N𝔼{𝟣R=1dΦr(Lr)Φr(Lr)}𝔼{𝟣R=1dΦr(Lr)Φr(Lr)}2\displaystyle\quad\ +\frac{1}{N}\left\|\mathbb{E}\{\mathsf{1}_{R=1_{d}}\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\}\mathbb{E}\left\{\mathsf{1}_{R=1_{d}}\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}\right\|_{2}
1NsuplrdomrΦr(lr)22𝔼{Φr(Lr)Φr(Lr)}2+1N𝔼{Φr(Lr)Φr(Lr)}22\displaystyle\leq\frac{1}{N}\underset{l^{r}\mathrm{dom}^{r}}{\sup}\|\Phi^{r}(l^{r})\|_{2}^{2}\|\mathbb{E}\left\{\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}\|_{2}+\frac{1}{N}\|\mathbb{E}\left\{\Phi^{r}(L^{r})\Phi^{r}(L^{r})^{\intercal}\right\}\|_{2}^{2}
=O(Kr/Nr).\displaystyle=O(K_{r}/N_{r})\ .

Taking t=CKrlogKr/Nrt=C\sqrt{K_{r}\log K_{r}/N_{r}} in Lemma 2 for an arbitrary constant CC, we have

exp(i=1N𝐄i2t)2Krexp(ClogKr)\displaystyle\exp\left(\left\|\sum_{i=1}^{N}\mathbf{E}_{i}\right\|_{2}\geq t\right)\leq 2K_{r}\exp(-C^{\prime}\log K_{r})

for large enough NN and some constant CC^{\prime}. In other words, 𝐁𝐂2=Op(KrlogKr/Nr)=op(1)\|\mathbf{B}-\mathbf{C}\|_{2}=O_{p}(\sqrt{K_{r}\log K_{r}/N_{r}})=o_{p}(1). Therefore, for any ϵ\epsilon there exists NΔ,ϵN_{\Delta,\epsilon} such that

P{λmin(𝐀)14c0δ0λmin}1ϵ\displaystyle P\left\{\lambda_{\min}(\mathbf{A})\geq\frac{1}{4}c_{0}\delta_{0}\lambda_{\min}^{*}\right\}\geq 1-\epsilon

for any Nmax{NΔ,NΔ,ϵ}N\geq\max\{N_{\Delta},N_{\Delta,\epsilon}\}. ∎

Lemma 5.

The gradient of rN(αr)\mathcal{L}^{r}_{N}(\alpha^{r}) at αKr\alpha_{K_{r}}^{*} satisfies

rN(αr)αr|αKr2\displaystyle\left\|\left.\frac{\partial\mathcal{L}^{r}_{N}(\alpha^{r})}{\partial\alpha^{r}}\right|_{\alpha_{K_{r}}^{*}}\right\|_{2} =Op(KrN+Krμ1).\displaystyle=O_{p}\left(\sqrt{\frac{K_{r}}{N}}+K_{r}^{-\mu_{1}}\right)\ .
Proof.

The gradient of rN(αr)\mathcal{L}^{r}_{N}(\alpha^{r}) at αKr\alpha_{K_{r}}^{*} is

rN(αr)αr|αKr=1Ni=1N{𝟣Ri=r𝟣Ri=1dOddsr(Lri;αKr)}Φr(Lri).\displaystyle\left.\frac{\partial\mathcal{L}^{r}_{N}(\alpha^{r})}{\partial\alpha^{r}}\right|_{\alpha_{K_{r}}^{*}}=\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=r}-\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\alpha_{K_{r}}^{*})\right\}\Phi^{r}(L^{r}_{i})\ .

Thus, by the triangle inequality,

rN(αr)αr|αKr2\displaystyle\left\|\left.\frac{\partial\mathcal{L}^{r}_{N}(\alpha^{r})}{\partial\alpha^{r}}\right|_{\alpha_{K_{r}}^{*}}\right\|_{2} 1Ni=1N{𝟣Ri=r𝟣Ri=1dOddsr(Lri)}Φr(Lri)2\displaystyle\leq\left\|\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=r}-\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i})\right\}\Phi^{r}(L^{r}_{i})\right\|_{2}
+1Ni=1N𝟣Ri=1d{Oddsr(Lri)Oddsr(Lri;αKr)}Φr(Lri)2.\displaystyle\quad\ +\left\|\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\left\{\mathrm{Odds}^{r}(L^{r}_{i})-\mathrm{Odds}^{r}(L^{r}_{i};\alpha_{K_{r}}^{*})\right\}\Phi^{r}(L^{r}_{i})\right\|_{2}\ .

Consider the first term on the right hand side. Let Ai={𝟣Ri=r𝟣Ri=1dOddsr(Lri)}Φr(Lri)A_{i}=\{\mathsf{1}_{R_{i}=r}-\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i})\}\Phi^{r}(L^{r}_{i}). It’s easy to see that {Ai}i=1N\{A_{i}\}_{i=1}^{N} are i.i.d. and 𝔼(Ai)=0\mathbb{E}(A_{i})=0. Thus,

𝔼1Ni=1NAi22=1N𝔼(AiAi)\displaystyle\quad\ \mathbb{E}\left\|\frac{1}{N}\sum_{i=1}^{N}A_{i}\right\|_{2}^{2}=\frac{1}{N}\mathbb{E}(A_{i}^{\intercal}A_{i})
=1N𝔼[k=1Kr{𝟣R=r+𝟣R=1dOddsr(Lr)Oddsr(Lr)}ϕrk(Lr)ϕrk(Lr)]\displaystyle=\frac{1}{N}\mathbb{E}\left[\sum_{k=1}^{K_{r}}\left\{\mathsf{1}_{R=r}+\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})\mathrm{Odds}^{r}(L^{r})\right\}\phi^{r}_{k}(L^{r})\phi^{r}_{k}(L^{r})\right]
C02+1N𝔼Φr(Lr)22.\displaystyle\leq\frac{C_{0}^{2}+1}{N}\mathbb{E}\|\Phi^{r}(L^{r})\|_{2}^{2}\ .

By Assumption 2.E, 𝔼i=1NAi/N22=O(Kr/N)\mathbb{E}\|\sum_{i=1}^{N}A_{i}/N\|_{2}^{2}=O(K_{r}/N). By the Markov inequality, this implies i=1NAi/N2=Op(Kr/N)\|\sum_{i=1}^{N}A_{i}/N\|_{2}=O_{p}(\sqrt{K_{r}/N}). As for the second term on the right hand side, let ξ=(ξ1,,ξN)\xi=(\xi_{1},\cdots,\xi_{N}) where ξi=𝟣Ri=1d{Oddsr(Lri)Oddsr(Lri;αKr)}\xi_{i}=\mathsf{1}_{R_{i}=1_{d}}\{\mathrm{Odds}^{r}(L^{r}_{i})-\mathrm{Odds}^{r}(L^{r}_{i};\alpha_{K_{r}}^{*})\} and B=(B1,,BN)B=(B_{1},\cdots,B_{N}) where Bi=𝟣Ri=1dΦr(Lri)B_{i}=\mathsf{1}_{R_{i}=1_{d}}\Phi^{r}(L^{r}_{i}). Then,

1Ni=1NξiBi22=1N2ξBBξ=1Nξ{1Ni=1N𝟣Ri=1dΦr(Lri)Φr(Lri)}ξ.\displaystyle\left\|\frac{1}{N}\sum_{i=1}^{N}\xi_{i}B_{i}\right\|_{2}^{2}=\frac{1}{N^{2}}\xi^{\intercal}BB^{\intercal}\xi=\frac{1}{N}\xi^{\intercal}\left\{\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\Phi^{r}(L^{r}_{i})\Phi^{r}(L^{r}_{i})^{\intercal}\right\}\xi\ .

Following the similar arguments in the proof of Lemma 4, it’s easy to see that

λmax{1Ni=1N𝟣Ri=1dΦr(Lri)Φr(Lri)}λmax+op(1).\displaystyle\lambda_{\max}\left\{\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\Phi^{r}(L^{r}_{i})\Phi^{r}(L^{r}_{i})^{\intercal}\right\}\leq\lambda_{\max}^{*}+o_{p}(1)\ .

By Assumption 2.D, |ξi|C1Krμ1|\xi_{i}|\leq C_{1}K_{r}^{-\mu_{1}}. Thus, 1Ni=1NξiBi2=Op(Krμ1)\|\frac{1}{N}\sum_{i=1}^{N}\xi_{i}B_{i}\|_{2}=O_{p}(K_{r}^{-\mu_{1}}) and

rN(αr)αr|αKr2=Op(KrN+Krμ1).\displaystyle\left\|\left.\frac{\partial\mathcal{L}^{r}_{N}(\alpha^{r})}{\partial\alpha^{r}}\right|_{\alpha_{K_{r}}^{*}}\right\|_{2}=O_{p}\left(\sqrt{\frac{K_{r}}{N}}+K_{r}^{-\mu_{1}}\right)\ .

Lemma 6.

Under Assumptions 13, for any missing pattern rr,

supθΘ|NSθ,1r|=op(1).\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,1}^{r}|=o_{p}(1).
Proof.

Consider the following empirical process.

𝔾N(fθ,1)=N[1Ni=1Nfθ,1(Li,Ri)𝔼{fθ,1(L,R)}]\displaystyle\mathbb{G}_{N}(f_{\theta,1})=\sqrt{N}\left[\frac{1}{N}\sum_{i=1}^{N}f_{\theta,1}(L_{i},R_{i})-\mathbb{E}\left\{f_{\theta,1}(L,R)\right\}\right]

where fθ,1(L,R)=𝟣R=1d{O(Lr)Oddsr(Lr)}{ψθ(L)urθ(Lr)}f_{\theta,1}(L,R)=\mathsf{1}_{R=1_{d}}\{O(L^{r})-\mathrm{Odds}^{r}(L^{r})\}\{\psi_{\theta}(L)-u^{r}_{\theta}(L^{r})\} and OO is an arbitrary function, which can be viewed as an estimator of true propensity odds Oddsr\mathrm{Odds}^{r}. By Theorem 2, for any γ>0\gamma>0, there exists constants Cγ>0C_{\gamma}>0 and Nγ>0N_{\gamma}>0 such that for any NNγN\geq N_{\gamma},

P{Oddsr(;α^r)OddsrCγ(Kr2Nr+Kr12μ1)}γ.\displaystyle P\left\{\left\|\mathrm{Odds}^{r}(\ \cdot\ ;\hat{\alpha}^{r})-\mathrm{Odds}^{r}\right\|_{\infty}\geq C_{\gamma}\left(\sqrt{\frac{K_{r}^{2}}{N_{r}}}+K_{r}^{\frac{1}{2}-\mu_{1}}\right)\right\}\leq\gamma.

Let δ1=Cγ(Kr2/Nr+Kr1/2μ1)\delta_{1}=C_{\gamma}(\sqrt{K_{r}^{2}/N_{r}}+K_{r}^{1/2-\mu_{1}}) and consider the set of functions

1={fθ,1:OOddsrδ1,θΘ}.\displaystyle\mathcal{F}_{1}=\left\{f_{\theta,1}:\left\|O-\mathrm{Odds}^{r}\right\|_{\infty}\leq\delta_{1},\theta\in\Theta\right\}\ .

By Assumption 1.C, for any fθ,11f_{\theta,1}\in\mathcal{F}_{1},

𝔼{fθ,1(L,R)}\displaystyle\mathbb{E}\left\{f_{\theta,1}(L,R)\right\} =𝔼[𝔼{fθ,1(L,R)Lr,R}]\displaystyle=\mathbb{E}\left[\mathbb{E}\left\{f_{\theta,1}(L,R)\mid L^{r},R\right\}\right]
=𝔼[𝟣R=1d𝔼{fθ,1(L,R)Lr,R=1d}]=0.\displaystyle=\mathbb{E}\left[\mathsf{1}_{R=1_{d}}\mathbb{E}\left\{f_{\theta,1}(L,R)\mid L^{r},R=1_{d}\right\}\right]=0\ .

Define f^θ,1(L,R):=𝟣R=1d{Oddsr(Lr;α^r)Oddsr(Lr)}{ψθ(L)urθ(Lr)}\hat{f}_{\theta,1}(L,R):=\mathsf{1}_{R=1_{d}}\{\mathrm{Odds}^{r}(L^{r};\hat{\alpha}^{r})-\mathrm{Odds}^{r}(L^{r})\}\{\psi_{\theta}(L)-u^{r}_{\theta}(L^{r})\}. To simplify notations, vectors A>BA>B means that Aj>BjA_{j}>B_{j} for each entry, and vector A>cA>c means that Aj>cA_{j}>c for each entry where cc is a constant.

Notice that supθΘ|NSθ,1r|=supθΘ|𝔾N(f^θ,1)|\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,1}^{r}|=\sup_{\theta\in\Theta}|\mathbb{G}_{N}(\hat{f}_{\theta,1})|. Thus,

1γP(f^θ,11)P(supθΘ|NSθ,1r|supfθ,11|𝔾N(fθ,1)|).\displaystyle 1-\gamma\leq P\left(\hat{f}_{\theta,1}\in\mathcal{F}_{1}\right)\leq P\left(\underset{\theta\in\Theta}{\sup}\left|\sqrt{N}S_{\theta,1}^{r}\right|\leq\underset{f_{\theta,1}\in\mathcal{F}_{1}}{\sup}|\mathbb{G}_{N}(f_{\theta,1})|\right)\ .

By Markov’s inequality, for any ξ>0\xi>0, we have

P(supfθ,11|𝔾N(fθ,1)|1ξ𝔼supfθ,11|𝔾N(fθ,1)|)ξ.\displaystyle P\left(\underset{f_{\theta,1}\in\mathcal{F}_{1}}{\sup}|\mathbb{G}_{N}(f_{\theta,1})|\geq\frac{1}{\xi}\mathbb{E}\underset{f_{\theta,1}\in\mathcal{F}_{1}}{\sup}\left|\mathbb{G}_{N}(f_{\theta,1})\right|\right)\leq\xi\ .

If we can show 𝔼supfθ,11|𝔾N(fθ,1)|=op(1)\mathbb{E}\sup_{f_{\theta,1}\in\mathcal{F}_{1}}|\mathbb{G}_{N}(f_{\theta,1})|=o_{p}(1), then for any η>0\eta>0 and fixed ξ>0\xi>0, there exists Nξ,ηN_{\xi,\eta} and σξ,η\sigma_{\xi,\eta} such that for any NNξ,ηN\geq N_{\xi,\eta},

P(1ξ𝔼supfθ,11|𝔾N(fθ,1)|σξ,η)η.\displaystyle P\left(\frac{1}{\xi}\mathbb{E}\underset{f_{\theta,1}\in\mathcal{F}_{1}}{\sup}\left|\mathbb{G}_{N}(f_{\theta,1})\right|\geq\sigma_{\xi,\eta}\right)\leq\eta\ .

Then, for any ϵ>0\epsilon>0, by taking γ=ξ=η=ϵ3\gamma=\xi=\eta=\frac{\epsilon}{3} and appropriately choosing CγC_{\gamma}, NγN_{\gamma}, Nξ,ηN_{\xi,\eta} and σξ,η\sigma_{\xi,\eta}, we have the above inequalities and for any NNϵ=max{Nγ,Nξ,η}N\geq N_{\epsilon}=\max\{N_{\gamma},N_{\xi,\eta}\},

P(supθΘ|NSθ,1r|σξ,η)γ+ξ+η=ϵ.\displaystyle P\left(\underset{\theta\in\Theta}{\sup}\left|\sqrt{N}S_{\theta,1}^{r}\right|\geq\sigma_{\xi,\eta}\right)\leq\gamma+\xi+\eta=\epsilon\ .

That is, supθΘ|NSθ,1r|=op(1)\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,1}^{r}|=o_{p}(1).

To show 𝔼supfθ,11|𝔾N(fθ,1)|=op(1)\mathbb{E}\sup_{f_{\theta,1}\in\mathcal{F}_{1}}|\mathbb{G}_{N}(f_{\theta,1})|=o_{p}(1), we utilize the maximal inequality with bracketing (Corollary 19.35 in [29]). Define the envelop function F1(L):=supθΘ|ψθ(L)urθ(Lr)|δ1F_{1}(L):=\sup_{\theta\in\Theta}|\psi_{\theta}(L)-u^{r}_{\theta}(L^{r})|\delta_{1}. It’s easy to see |fθ,1(L,R)|F1(L)|f_{\theta,1}(L,R)|\leq F_{1}(L) for any fθ,11f_{\theta,1}\in\mathcal{F}_{1}. Besides, due to Assumption 3.D, for each jj-th entry,

F1,jP,2=F1,j(L)2dP(L)=𝔼[sup𝜃{ψθ,j(L)uθ,jr(Lr)}2δ12]C3δ1.\displaystyle\|F_{1,j}\|_{P,2}=\sqrt{\int F_{1,j}(L)^{2}dP(L)}=\sqrt{\mathbb{E}\left[\underset{\theta}{\sup}\left\{\psi_{\theta,j}(L)-u_{\theta,j}^{r}(L^{r})\right\}^{2}\delta_{1}^{2}\right]}\leq C_{3}\delta_{1}\ .

To save notations, ψθ\psi_{\theta} and uθru_{\theta}^{r} are used as their jj-th entry. We also omit the subscripts “jj” of some sets of functions where the related inequalities should hold for each jj-th entry.

By the maximal inequality,

𝔼supfθ,11|𝔾N(fθ,1)|=Op(J[]{C3δ1,1,L2(P)}).\displaystyle\mathbb{E}\underset{f_{\theta,1}\in\mathcal{F}_{1}}{\sup}\left|\mathbb{G}_{N}(f_{\theta,1})\right|=O_{p}\left(J_{[\ ]}\{C_{3}\delta_{1},\mathcal{F}_{1},L_{2}(P)\}\right)\ .

To study the entropy integral of 1\mathcal{F}_{1}, we split function fθ,1f_{\theta,1} into two parts and consider two sets of functions 𝒢1={g1:g1δ1}\mathcal{G}_{1}=\{g_{1}:\|g_{1}\|_{\infty}\leq\delta_{1}\} where g1(L)=O(Lr)Oddsr(Lr)g_{1}(L)=O(L^{r})-\mathrm{Odds}^{r}(L^{r}) and 1={hθ,1:θΘ}\mathcal{H}_{1}=\{h_{\theta,1}:\theta\in\Theta\} where hθ,1(L)=ψθ(L)urθ(Lr)h_{\theta,1}(L)=\psi_{\theta}(L)-u^{r}_{\theta}(L^{r}). Notice that g1δ1\|g_{1}\|_{\infty}\leq\delta_{1}, hθ,1(L)P,2C3\|h_{\theta,1}(L)\|_{P,2}\leq C_{3} and δ11\delta_{1}\leq 1 when NN is large enough. By Lemma 12,

n[]{4(C3+1)ϵ,1,L2(P)}n[]{ϵ,𝒢1,L}n[]{ϵ,1,L2(P)}.\displaystyle n_{[\ ]}\left\{4\left(C_{3}+1\right)\epsilon,\mathcal{F}_{1},L_{2}(P)\right\}\leq n_{[\ ]}\{\epsilon,\mathcal{G}_{1},L^{\infty}\}n_{[\ ]}\{\epsilon,\mathcal{H}_{1},L_{2}(P)\}\ .

Define 𝒢~1:={g1:g1C}\tilde{\mathcal{G}}_{1}:=\{g_{1}:\|g_{1}\|_{\infty}\leq C\} for some constant CC and 𝒪:=𝒢~1+Oddsr={O:OOddsrC}\mathcal{O}:=\tilde{\mathcal{G}}_{1}+\mathrm{Odds}^{r}=\{O:\|O-\mathrm{Odds}^{r}\|_{\infty}\leq C\}. It is obvious that 𝒢=δ1/C𝒢~\mathcal{G}=\delta_{1}/C\tilde{\mathcal{G}}. Since Oddsr\mathrm{Odds}^{r} is a fixed function,

n[]{ϵ,𝒢1,L}=n[]{ϵ,δ1/C𝒢~1,L}\displaystyle\quad\ n_{[\ ]}\left\{\epsilon,\mathcal{G}_{1},L^{\infty}\right\}=n_{[\ ]}\left\{\epsilon,\delta_{1}/C\tilde{\mathcal{G}}_{1},L^{\infty}\right\}
=n[]{Cϵ/δ1,𝒢~1,L}=n[]{Cϵ/δ1,𝒪,L}.\displaystyle=n_{[\ ]}\left\{C\epsilon/\delta_{1},\tilde{\mathcal{G}}_{1},L^{\infty}\right\}=n_{[\ ]}\left\{C\epsilon/\delta_{1},\mathcal{O},L^{\infty}\right\}\ .

The true propensity score odds Oddsr\mathrm{Odds}^{r} is unknown, but its roughness is controlled by Assumption 3.B. Thus, we should not consider much more rough functions. In other words, our models for propensity score odds should satisfy a similar smoothness condition. There exists appropriate constant C𝒪C_{\mathcal{O}} such that 𝒪r\mathcal{O}\subset\mathcal{M}^{r}. Thus,

n[]{ϵ,𝒪,L}n[]{ϵ,r,L}.\displaystyle n_{[\ ]}\{\epsilon,\mathcal{O},L^{\infty}\}\leq n_{[\ ]}\{\epsilon,\mathcal{M}^{r},L^{\infty}\}\ .

Define a set of functions 𝒰r={urθ:θΘ}\mathcal{U}^{r}=\{u^{r}_{\theta}:\theta\in\Theta\}. Notice that 1Ψ𝒰r\mathcal{H}_{1}\subset\Psi-\mathcal{U}^{r} By Lemma 13, Assumption Assumption 3.E and Lemma 14,

n[]{2ϵ,1,L2(P)}n[]{ϵ,,L2(P)}n[]{ϵ,𝒰r,L2(P)}n[]{ϵ,,L2(P)}2.\displaystyle n_{[\ ]}\{2\epsilon,\mathcal{H}_{1},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}n_{[\ ]}\{\epsilon,\mathcal{U}^{r},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}^{2}\ .

Combine the above inequalities and recall Assumption 3.B and Assumption 3.C,

J[]{F1P,1,1,L2(P)}\displaystyle J_{[\ ]}\{\|F_{1}\|_{P,1},\mathcal{F}_{1},L_{2}(P)\} 0C3δ1logn[]{C𝒪ϵ4(C3+1)δ1,r,L}dϵ\displaystyle\leq\int_{0}^{C_{3}\delta_{1}}\sqrt{\log n_{[\ ]}\left\{\frac{C_{\mathcal{O}}\epsilon}{4(C_{3}+1)\delta_{1}},\mathcal{M}^{r},L^{\infty}\right\}}d\epsilon
+20C3δ1logn[]{ϵ8(C3+1)δ1,,L2(P)}dϵ\displaystyle\quad\ +\sqrt{2}\int_{0}^{C_{3}\delta_{1}}\sqrt{\log n_{[\ ]}\left\{\frac{\epsilon}{8(C_{3}+1)\delta_{1}},\mathcal{H},L_{2}(P)\right\}}d\epsilon
C0C3δ1{4(C3+1)δ1/(C𝒪ϵ)}12ddϵ\displaystyle\leq\sqrt{C_{\mathcal{M}}}\int_{0}^{C_{3}\delta_{1}}\{4(C_{3}+1)\delta_{1}/(C_{\mathcal{O}}\epsilon)\}^{\frac{1}{2d_{\mathcal{M}}}}d\epsilon
+2C0C3δ1{8(C3+1)δ1/ϵ}12ddϵ\displaystyle\quad\ +\sqrt{2C_{\mathcal{H}}}\int_{0}^{C_{3}\delta_{1}}\{8(C_{3}+1)\delta_{1}/\epsilon\}^{\frac{1}{2d_{\mathcal{H}}}}d\epsilon
=C{4(C3+1)/C𝒪}12dC3112dδ1\displaystyle=\sqrt{C_{\mathcal{M}}}\{4(C_{3}+1)/C_{\mathcal{O}}\}^{\frac{1}{2d_{\mathcal{M}}}}C_{3}^{1-\frac{1}{2d_{\mathcal{M}}}}\delta_{1}
+2C{8(C3+1)}12dC3112dδ1\displaystyle\quad\ +\sqrt{2C_{\mathcal{H}}}\{8(C_{3}+1)\}^{\frac{1}{2d_{\mathcal{H}}}}C_{3}^{1-\frac{1}{2d_{\mathcal{H}}}}\delta_{1}
0\displaystyle\to 0

since d,d>1/2d_{\mathcal{M}},d_{\mathcal{H}}>1/2 and δ10\delta_{1}\to 0 as NN\to\infty. Therefore, 𝔼supfθ,11|𝔾N(fθ,1)|=Op(op(1))=op(1)\mathbb{E}\sup_{f_{\theta,1}\in\mathcal{F}_{1}}|\mathbb{G}_{N}(f_{\theta,1})|=O_{p}(o_{p}(1))=o_{p}(1) and supθΘ|NSθ,1r|=op(1)\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,1}^{r}|=o_{p}(1). ∎

Lemma 7.

Under Assumptions 13, for any missing pattern rr, supθΘ|NSθ,2r|=op(1)\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,2}^{r}|=o_{p}(1).

Proof.

Consider the following empirical process.

𝔾N(fθ,2)=N[1Ni=1Nfθ,2(Li,Ri)𝔼{fθ,2(L,R)}]\displaystyle\mathbb{G}_{N}(f_{\theta,2})=\sqrt{N}\left[\frac{1}{N}\sum_{i=1}^{N}f_{\theta,2}(L_{i},R_{i})-\mathbb{E}\left\{f_{\theta,2}(L,R)\right\}\right]

where fθ,2(L,R)={𝟣R=1dO(Lr)𝟣R=r}{urθ(Lr)U(Lr)}f_{\theta,2}(L,R)=\{\mathsf{1}_{R=1_{d}}O(L^{r})-\mathsf{1}_{R=r}\}\{u^{r}_{\theta}(L^{r})-U(L^{r})\} and OO and UU are arbitrary functions. By Theorem 2, for any γ>0\gamma>0, there exists constants Cγ>0C_{\gamma}>0 and Nγ>0N_{\gamma}>0 such that for any NNγN\geq N_{\gamma},

P{Oddsr(;α^r)OddsrP,2Cγ(KrNr+Krμ1)}γ.\displaystyle P\left\{\left\|\mathrm{Odds}^{r}(\ \cdot\ ;\hat{\alpha}^{r})-\mathrm{Odds}^{r}\right\|_{P,2}\geq C_{\gamma}\left(\sqrt{\frac{K_{r}}{N_{r}}}+K_{r}^{-\mu_{1}}\right)\right\}\leq\gamma\ .

Besides, by Assumption 3.A, suplrdomr|urθ(lr)Φr(lr)βθr|C2Krμ2\sup_{l^{r}\in\mathrm{dom}^{r}}|u^{r}_{\theta}(l^{r})-\Phi^{r}(l^{r})^{\intercal}\beta_{\theta}^{r}|\leq C_{2}K_{r}^{-\mu_{2}}. So, we consider the set of functions

2={fθ,2:OOddsrP,2δ1,urθUδ2,θΘ}\displaystyle\mathcal{F}_{2}=\left\{f_{\theta,2}:\left\|O-\mathrm{Odds}^{r}\right\|_{P,2}\leq\delta_{1}^{\prime},\|u^{r}_{\theta}-U\|_{\infty}\leq\delta_{2},\theta\in\Theta\right\}

where δ1=Cγ(Kr/Nr+Krμ1)\delta_{1}^{\prime}=C_{\gamma}(\sqrt{K_{r}/N_{r}}+K_{r}^{-\mu_{1}}) and δ2=C2Krμ2\delta_{2}=C_{2}K_{r}^{-\mu_{2}}. Then, for any fθ,22f_{\theta,2}\in\mathcal{F}_{2},

𝔼{fθ,2(L,R)}\displaystyle\mathbb{E}\left\{f_{\theta,2}(L,R)\right\} =𝔼[{𝟣R=1dOddsr(Lr)𝟣R=r}{urθ(Lr)U(Lr)}]\displaystyle=\mathbb{E}\left[\left\{\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})-\mathsf{1}_{R=r}\right\}\left\{u^{r}_{\theta}(L^{r})-U(L^{r})\right\}\right]
+𝔼[𝟣R=1d{O(Lr)Oddsr(Lr)}{urθ(Lr)U(Lr)}]\displaystyle\quad\ +\mathbb{E}\left[\mathsf{1}_{R=1_{d}}\left\{O(L^{r})-\mathrm{Odds}^{r}(L^{r})\right\}\left\{u^{r}_{\theta}(L^{r})-U(L^{r})\right\}\right]
0+OOddsrP,2urθUP,2\displaystyle\leq 0+\left\|O-\mathrm{Odds}^{r}\right\|_{P,2}\|u^{r}_{\theta}-U\|_{P,2}
δ1δ2=C2Cγ(Kr12μ2Nr+Krμ1μ2)=op(N12).\displaystyle\leq\delta_{1}^{\prime}\delta_{2}=C_{2}C_{\gamma}\left(\frac{K_{r}^{\frac{1}{2}-\mu_{2}}}{\sqrt{N_{r}}}+K_{r}^{-\mu_{1}-\mu_{2}}\right)=o_{p}(N^{-\frac{1}{2}})\ .

The last line holds due to the fact that P,2\|\cdot\|_{P,2}\leq\|\cdot\|_{\infty} and Assumption 3.A and Assumption 3.E. Plug in our estimator and define f^θ,2(L,R):={𝟣R=1dOddsr(Lr;α^r)𝟣R=r}{urθ(Lr)Φr(Lr)βθr}\hat{f}_{\theta,2}(L,R):=\{\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r};\hat{\alpha}^{r})-\mathsf{1}_{R=r}\}\{u^{r}_{\theta}(L^{r})-\Phi^{r}(L^{r})^{\intercal}\beta_{\theta}^{r}\}. Then, supθΘ|NSθ,2r|supθΘ|𝔾N(f^θ,2)|+Nδ1δ2\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,2}^{r}|\leq\sup_{\theta\in\Theta}|\mathbb{G}_{N}(\hat{f}_{\theta,2})|+\sqrt{N}\delta_{1}^{\prime}\delta_{2} and

P(supθΘ|NSθ,2r|>supfθ,22|𝔾N(fθ,2)|+Nδ1δ2)P(f^θ,22)γ.\displaystyle P\left(\underset{\theta\in\Theta}{\sup}\left|\sqrt{N}S_{\theta,2}^{r}\right|>\underset{f_{\theta,2}\in\mathcal{F}_{2}}{\sup}|\mathbb{G}_{N}(f_{\theta,2})|+\sqrt{N}\delta_{1}^{\prime}\delta_{2}\right)\leq P\left(\hat{f}_{\theta,2}\notin\mathcal{F}_{2}\right)\leq\gamma\ .

Similarly, we need to show 𝔼supfθ,22|𝔾N(fθ,2)|=op(1)\mathbb{E}\sup_{f_{\theta,2}\in\mathcal{F}_{2}}|\mathbb{G}_{N}(f_{\theta,2})|=o_{p}(1). Define the envelop function F2:=(C0+1)δ2F_{2}:=(C_{0}+1)\delta_{2}. It’s easy to see that |fθ,2(L,R)|F2|f_{\theta,2}(L,R)|\leq F_{2} for any fθ,22f_{\theta,2}\in\mathcal{F}_{2} when NN is large enough. By the maximal inequality with bracketing,

𝔼supfθ,22|𝔾N(fθ,2)|=Op(J[]{F2P,2,2,L2(P)}).\displaystyle\mathbb{E}\underset{f_{\theta,2}\in\mathcal{F}_{2}}{\sup}\left|\mathbb{G}_{N}(f_{\theta,2})\right|=O_{p}\left(J_{[\ ]}\{\|F_{2}\|_{P,2},\mathcal{F}_{2},L_{2}(P)\}\right)\ .

To study the entropy integral of 2\mathcal{F}_{2}, we first compare it with 2={fθ,2:OOddsrP,2δ1,urθUP,2δ2,θΘ}\mathcal{F}_{2}^{\prime}=\{f_{\theta,2}:\|O-\mathrm{Odds}^{r}\|_{P,2}\leq\delta_{1}^{\prime},\|u^{r}_{\theta}-U\|_{P,2}\leq\delta_{2},\theta\in\Theta\}. It is apparent 22\mathcal{F}_{2}\subset\mathcal{F}_{2}^{\prime}. Then, we split function fθ,2f_{\theta,2} into two parts and consider two sets of functions 𝒢2={g2:OOddsrP,2δ1}\mathcal{G}_{2}=\{g_{2}:\|O-\mathrm{Odds}^{r}\|_{P,2}\leq\delta_{1}^{\prime}\} where g2(L,R)=𝟣R=1dO(Lr)𝟣R=rg_{2}(L,R)=\mathsf{1}_{R=1_{d}}O(L^{r})-\mathsf{1}_{R=r} and 2={hθ,2:hθ,2P,2δ2,θΘ}\mathcal{H}_{2}=\{h_{\theta,2}:\|h_{\theta,2}\|_{P,2}\leq\delta_{2},\theta\in\Theta\} where hθ,2(L)=urθ(Lr)U(Lr)h_{\theta,2}(L)=u^{r}_{\theta}(L^{r})-U(L^{r}). Notice that g2P,2C0+1\|g_{2}\|_{P,2}\leq C_{0}+1 and hθ,2P,2δ21\|h_{\theta,2}\|_{P,2}\leq\delta_{2}\leq 1 when NN is large enough. By Lemma 12,

n[]{4(C0+2)ϵ,2,L2(P)}n[]{ϵ,𝒢2,L2(P)}n[]{ϵ,2,L2(P)}.\displaystyle n_{[\ ]}\{4(C_{0}+2)\epsilon,\mathcal{F}_{2},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G}_{2},L_{2}(P)\}n_{[\ ]}\{\epsilon,\mathcal{H}_{2},L_{2}(P)\}\ .

Notice that 𝒢2+(𝟣R=r𝟣R=1dOddsr)=𝟣R=1d𝒢1\mathcal{G}_{2}+(\mathsf{1}_{R=r}-\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r})=\mathsf{1}_{R=1_{d}}\mathcal{G}_{1}. Since 𝟣R=r𝟣R=1dOddsr\mathsf{1}_{R=r}-\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r} is a fixed function, and 𝟣R=1d1\|\mathsf{1}_{R=1_{d}}\|_{\infty}\leq 1, by Lemma 15,

n[]{ϵ,𝒢2,L2(P)}=n[]{ϵ,𝟣R=1d𝒢1,L2(P)}n[]{ϵ,𝒢1,L2(P)}.\displaystyle n_{[\ ]}\{\epsilon,\mathcal{G}_{2},L_{2}(P)\}=n_{[\ ]}\{\epsilon,\mathsf{1}_{R=1_{d}}\mathcal{G}_{1},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G}_{1},L_{2}(P)\}\ .

It is obvious that any ϵ\epsilon-brackets equipped with \|\cdot\|_{\infty} norm are also ϵ\epsilon-brackets in L2(P)L_{2}(P). With similar arguments in the proof of Lemma 6, we have

n[]{ϵ,𝒢1,L2(P)}n[]{ϵ,𝒢1,L}n[]{C𝒪ϵ/δ1,r,L}.\displaystyle n_{[\ ]}\{\epsilon,\mathcal{G}_{1},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G}_{1},L^{\infty}\}\leq n_{[\ ]}\left\{C_{\mathcal{O}}\epsilon/\delta_{1}^{\prime},\mathcal{M}^{r},L^{\infty}\right\}\ .

Define a set of functions ~2={hθ,2:hθ,2P,2C,θΘ}\tilde{\mathcal{H}}_{2}=\{h_{\theta,2}:\|h_{\theta,2}\|_{P,2}\leq C,\theta\in\Theta\}. Similarly,

n[]{ϵ,2,L2(P)}=n[]{ϵ,δ2/C~2,L2(P)}=n[]{Cϵ/δ2,~2,L2(P)}.\displaystyle n_{[\ ]}\{\epsilon,\mathcal{H}_{2},L_{2}(P)\}=n_{[\ ]}\left\{\epsilon,\delta_{2}/C\tilde{\mathcal{H}}_{2},L_{2}(P)\right\}=n_{[\ ]}\left\{C\epsilon/\delta_{2},\tilde{\mathcal{H}}_{2},L_{2}(P)\right\}\ .

Similarly, we split ~2\tilde{\mathcal{H}}_{2} into two parts. Define a set of functions 𝒰^r={U:urθ𝒰rs.t.urθUC,}\hat{\mathcal{U}}^{r}=\{U:\exists u^{r}_{\theta}\in\mathcal{U}^{r}\ s.t.\ \|u^{r}_{\theta}-U\|_{\infty}\leq C,\} where 𝒰r={urθ:θΘ}\mathcal{U}^{r}=\{u^{r}_{\theta}:\theta\in\Theta\}. By Lemma 13,

n[]{2ϵ,~2,L2(P)}n[]{ϵ,𝒰r,L2(P)}n[]{ϵ,𝒰^r,L2(P)}.\displaystyle n_{[\ ]}\{2\epsilon,\tilde{\mathcal{H}}_{2},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{U}^{r},L_{2}(P)\}n_{[\ ]}\{\epsilon,\hat{\mathcal{U}}^{r},L_{2}(P)\}\ .

Also define a set of functions 𝔼r:={gr(lr):=𝔼{f(L)Lr=lr,R=r},f}\mathbb{E}\mathcal{H}^{r}:=\{g^{r}(l^{r}):=\mathbb{E}\{f(L)\mid L^{r}=l^{r},R=r\},f\in\mathcal{H}\}. Although the set 𝒰r\mathcal{U}^{r} is unknown, we should not consider much more rough functions than those in 𝔼r\mathbb{E}\mathcal{H}^{r}. Therefore, there exists a constant C𝒰^rC_{\hat{\mathcal{U}}^{r}} such that 𝒰^r𝔼r\hat{\mathcal{U}}^{r}\subset\mathbb{E}\mathcal{H}^{r}. Thus, by Lemma 14,

n[]{ϵ,𝒰^r,L2(P)}n[]{ϵ,𝔼r,L2(P)}n[]{ϵ,,L2(P)}.\displaystyle n_{[\ ]}\{\epsilon,\hat{\mathcal{U}}^{r},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathbb{E}\mathcal{H}^{r},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .

By Assumption 3.B, Assumption 3.C, and the above inequalities,

J[]{F2P,2,2,L2(P)}\displaystyle\quad\ J_{[\ ]}\{\|F_{2}\|_{P,2},\mathcal{F}_{2},L_{2}(P)\}
0(C0+1)δ2logn[]{C𝒪ϵ4(C0+2)δ1,r,L2(P)}dϵ\displaystyle\leq\int_{0}^{(C_{0}+1)\delta_{2}}\sqrt{\log n_{[\ ]}\left\{\frac{C_{\mathcal{O}}\epsilon}{4(C_{0}+2)\delta_{1}^{\prime}},\mathcal{M}^{r},L_{2}(P)\right\}}d\epsilon
+20(C0+1)δ2logn[]{C𝒰^rϵ8(C0+2)δ2,,L2(P)}dϵ\displaystyle\quad\ +\sqrt{2}\int_{0}^{(C_{0}+1)\delta_{2}}\sqrt{\log n_{[\ ]}\left\{\frac{C_{\hat{\mathcal{U}}^{r}}\epsilon}{8(C_{0}+2)\delta_{2}},\mathcal{H},L_{2}(P)\right\}}d\epsilon
C{4(C0+2)δ1/C𝒪}12d{(C0+1)δ2}112d\displaystyle\leq\sqrt{C_{\mathcal{M}}}\{4(C_{0}+2)\delta_{1}^{\prime}/C_{\mathcal{O}}\}^{\frac{1}{2d_{\mathcal{M}}}}\{(C_{0}+1)\delta_{2}\}^{1-\frac{1}{2d_{\mathcal{M}}}}
+2C{8(C0+2)/C𝒰^r}12d(C0+1)112dδ2\displaystyle\quad\ +\sqrt{2C_{\mathcal{H}}}\{8(C_{0}+2)/C_{\hat{\mathcal{U}}^{r}}\}^{\frac{1}{2d_{\mathcal{H}}}}(C_{0}+1)^{1-\frac{1}{2d_{\mathcal{H}}}}\delta_{2}
0\displaystyle\to 0

since d,d>1/2d_{\mathcal{M}},d_{\mathcal{H}}>1/2 and δ1,δ20\delta_{1}^{\prime},\delta_{2}\to 0 as NN\to\infty. So, 𝔼supfθ,22|𝔾N(fθ,2)|=Op(op(1))=op(1)\mathbb{E}\sup_{f_{\theta,2}\in\mathcal{F}_{2}}|\mathbb{G}_{N}(f_{\theta,2})|=O_{p}(o_{p}(1))=o_{p}(1) and supθΘ|NSθ,2r|=op(1)\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,2}^{r}|=o_{p}(1). ∎

Lemma 8.

Under Assumptions 13, for any missing pattern rr,

supθΘ|NSθ,3r|=op(1).\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,3}^{r}|=o_{p}(1).
Proof.

Notice that Sθ,3rS_{\theta,3}^{r} is related to the balancing error:

supθΘ|NSθ,3r|\displaystyle\underset{\theta\in\Theta}{\sup}\left|\sqrt{N}S_{\theta,3}^{r}\right| =supθΘ|1Ni=1N{𝟣Ri=1dOddsr(Lri;α^r)𝟣Ri=r}Φr(Lri)βθr|\displaystyle=\underset{\theta\in\Theta}{\sup}\left|\frac{1}{N}\sum_{i=1}^{N}\left\{\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i};\hat{\alpha}^{r})-\mathsf{1}_{R_{i}=r}\right\}\Phi^{r}(L^{r}_{i})^{\intercal}\beta_{\theta}^{r}\right|
λ{γKr+2(1γ)𝙿𝙴𝙽2(Φrα^r)}𝙿𝙴𝙽2(Φrβθr)\displaystyle\leq\lambda\left\{\gamma\sqrt{K_{r}}+2(1-\gamma)\sqrt{\mathtt{PEN}_{2}({\Phi^{r}}^{\intercal}\hat{\alpha}^{r})}\right\}\sqrt{\mathtt{PEN}_{2}({\Phi^{r}}^{\intercal}\beta_{\theta}^{r})}

where Φr(lr)α^r=logOddsr(lr;α^r)\Phi^{r}(l^{r})^{\intercal}\hat{\alpha}^{r}=\log\mathrm{Odds}^{r}(l^{r};{\hat{\alpha}^{r}}) denotes the log transformation of the propensity odds model. Due to the similar reason, the roughness of the approximation functions are bounded. Besides, by Assumption 2.D, λ=o(1/KrNr)\lambda=o(1/\sqrt{K_{r}N_{r}}). Thus, supθΘ|NSθ,3r|=op(1)\sup_{\theta\in\Theta}|\sqrt{N}S_{\theta,3}^{r}|=o_{p}(1). ∎

Lemma 9.

Suppose that Assumptions 14 hold. Then,

supθΘ|^Nψθ𝔼{ψθ(L)}|=op(1).\displaystyle\underset{\theta\in\Theta}{\sup}\left|\hat{\mathbb{P}}_{N}\psi_{\theta}-\mathbb{E}\{\psi_{\theta}(L)\}\right|=o_{p}(1)\ .
Proof.

By Lemmas 6, 7, and 8, we only need to show supθΘ|Sθ,4r|=op(1)\sup_{\theta\in\Theta}|S_{\theta,4}^{r}|=o_{p}(1) where

Sθ,4r\displaystyle S_{\theta,4}^{r} =1Ni=1N𝟣Ri=1dOddsr(Lri){ψθ(Li)urθ(Lri)}\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=1_{d}}\mathrm{Odds}^{r}(L^{r}_{i})\left\{\psi_{\theta}(L_{i})-u^{r}_{\theta}(L^{r}_{i})\right\}
+1Ni=1N𝟣Ri=rurθ(Lri)𝔼{𝟣R=rψθ(L)}.\displaystyle\quad\ +\frac{1}{N}\sum_{i=1}^{N}\mathsf{1}_{R_{i}=r}u^{r}_{\theta}(L^{r}_{i})-\mathbb{E}\{\mathsf{1}_{R=r}\psi_{\theta}(L)\}\ .

Study the following decomposition. Let a={fθ,a:θΘ}\mathcal{F}_{a}=\{f_{\theta,a}:\theta\in\Theta\} where fθ,a(L,R)=𝟣R=rurθ(Lr)f_{\theta,a}(L,R)=\mathsf{1}_{R=r}u^{r}_{\theta}(L^{r}). It’s easy to see that for any ϵ>0\epsilon>0,

n[]{ϵ,a,L2(P)}n[]{ϵ,𝒰r,L2(P)}n[]{ϵ,,L2(P)}<.\displaystyle n_{[\ ]}\{\epsilon,\mathcal{F}_{a},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{U}^{r},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}<\infty\ .

For any measurable function ff, f(L)P,22=𝔼{f(L)2}{𝔼|f(L)|}2=fP,12\|f(L)\|_{P,2}^{2}=\mathbb{E}\{f(L)^{2}\}\geq\{\mathbb{E}|f(L)|\}^{2}=\|f\|_{P,1}^{2}. Thus,

n[]{ϵ,a,L1(P)}n[]{ϵ,a,L2(P)}.\displaystyle n_{[\ ]}\{\epsilon,\mathcal{F}_{a},L_{1}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{F}_{a},L_{2}(P)\}\ .

By Theorem 19.4 in [29], a\mathcal{F}_{a} is Glivenko-Cantelli. Thus,

supθΘ|Nfθ,aPfθ,a|a.s.0.\displaystyle\underset{\theta\in\Theta}{\sup}\left|\mathbb{P}_{N}f_{\theta,a}-Pf_{\theta,a}\right|\xrightarrow{a.s.}0\ .

Also let b={fθ,b:θΘ}\mathcal{F}_{b}=\{f_{\theta,b}:\theta\in\Theta\} where fθ,b(L,R)=𝟣R=1dOddsr(Lr)ψθ(L)f_{\theta,b}(L,R)=\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})\psi_{\theta}(L) and c={fθ,c:θΘ}\mathcal{F}_{c}=\{f_{\theta,c}:\theta\in\Theta\} where fθ,b(L,R)=𝟣R=1dOddsr(Lr)urθ(Lr)f_{\theta,b}(L,R)=\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})u^{r}_{\theta}(L^{r}). Similarly,

supθΘ|Nfθ,bPfθ,b|a.s.0 and supθΘ|Nfθ,cPfθ,c|a.s.0.\displaystyle\underset{\theta\in\Theta}{\sup}\left|\mathbb{P}_{N}f_{\theta,b}-Pf_{\theta,b}\right|\xrightarrow{a.s.}0\textrm{ and }\underset{\theta\in\Theta}{\sup}\left|\mathbb{P}_{N}f_{\theta,c}-Pf_{\theta,c}\right|\xrightarrow{a.s.}0\ .

Notice that 𝔼{fθ,b(L,R)}=𝔼{fθ,c(L,R)}\mathbb{E}\{f_{\theta,b}(L,R)\}=\mathbb{E}\{f_{\theta,c}(L,R)\}. Besides, the convergence almost surely implies the convergence in probability. Thus, supθΘ|Sθ,4r|=op(1)\sup_{\theta\in\Theta}|S_{\theta,4}^{r}|=o_{p}(1). Then,

supθΘ|^Nψθ𝔼{ψθ(L)}|=op(1).\displaystyle\underset{\theta\in\Theta}{\sup}\left|\hat{\mathbb{P}}_{N}\psi_{\theta}-\mathbb{E}\{\psi_{\theta}(L)\}\right|=o_{p}(1)\ .

Lemma 10.

Under Assumptions 14, we have

N|Sθ^N,5rSθ0,5r|=op(1).\displaystyle\sqrt{N}\left|S_{\hat{\theta}_{N},5}^{r}-S_{\theta_{0},5}^{r}\right|=o_{p}(1)\ .
Proof.

Consider the following empirical process.

𝔾N(fθ,5)=N[1Ni=1Nfθ,5(Li,Ri)𝔼{fθ,5(L,R)}]\displaystyle\mathbb{G}_{N}(f_{\theta,5})=\sqrt{N}\left[\frac{1}{N}\sum_{i=1}^{N}f_{\theta,5}(L_{i},R_{i})-\mathbb{E}\left\{f_{\theta,5}(L,R)\right\}\right]

where fθ,5(L,R)={𝟣R=1dOddsr(Lr)𝟣R=r}{ψθ(L)ψθ0(L)}f_{\theta,5}(L,R)=\{\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})-\mathsf{1}_{R=r}\}\{\psi_{\theta}(L)-\psi_{\theta_{0}}(L)\}. Pick any decreasing sequence {δm}0\{\delta_{m}\}\to 0. Since θ^Nθ02=op(1)\|\hat{\theta}_{N}-\theta_{0}\|_{2}=o_{p}(1), for any γ>0\gamma>0 and each δm\delta_{m}, there exists a constant Nδm,γ>0N_{\delta_{m},\gamma}>0 such that for any NNδm,γN\geq N_{\delta_{m},\gamma},

P(θ^Nθ02δm)γ\displaystyle P\left(\|\hat{\theta}_{N}-\theta_{0}\|_{2}\geq\delta_{m}\right)\leq\gamma (18)

Consider the set of functions 5={fθ,5:θθ02δm}\mathcal{F}_{5}=\{f_{\theta,5}:\|\theta-\theta_{0}\|_{2}\leq\delta_{m}\}. It is easy to check that 𝔼{fθ,5(L,R)}=0\mathbb{E}\{f_{\theta,5}(L,R)\}=0. Plug in our estimator and define f^θ,5(L,R):={𝟣R=1dOddsr(Lr)𝟣R=r}{ψθ^N(L)ψθ0(L)}\hat{f}_{\theta,5}(L,R):=\{\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})-\mathsf{1}_{R=r}\}\{\psi_{\hat{\theta}_{N}}(L)-\psi_{\theta_{0}}(L)\}. Notice that N(Sθ^N,5rSθ0,5r)=𝔾N(f^θ,5)\sqrt{N}(S_{\hat{\theta}_{N},5}^{r}-S_{\theta_{0},5}^{r})=\mathbb{G}_{N}(\hat{f}_{\theta,5}). Thus,

P(N|Sθ^N,5rSθ0,5r|>supfθ,55|𝔾N(fθ,5)|)P(f^θ,55)γ.\displaystyle P\left(\sqrt{N}\left|S_{\hat{\theta}_{N},5}^{r}-S_{\theta_{0},5}^{r}\right|>\underset{f_{\theta,5}\in\mathcal{F}_{5}}{\sup}|\mathbb{G}_{N}(f_{\theta,5})|\right)\leq P\left(\hat{f}_{\theta,5}\notin\mathcal{F}_{5}\right)\leq\gamma\ .

Similarly, we only need to show 𝔼supfθ,55|𝔾N(fθ,5)|=op(1)\mathbb{E}\sup_{f_{\theta,5}\in\mathcal{F}_{5}}|\mathbb{G}_{N}(f_{\theta,5})|=o_{p}(1). Define the envelop function F5(L):=(C0+1)fδm(L)F_{5}(L):=(C_{0}+1)f_{\delta_{m}}(L) where fδf_{\delta} is the envelop function in Assumption 4.B. So, |fθ,5(L,R)|F5(L)|f_{\theta,5}(L,R)|\leq F_{5}(L) for any fθ,55f_{\theta,5}\in\mathcal{F}_{5}. Besides, F5P,2(C0+1)fδmP,2\|F_{5}\|_{P,2}\leq(C_{0}+1)\|f_{\delta_{m}}\|_{P,2}. Due to the maximal inequality,

𝔼supfθ,55|𝔾N(fθ,5)|=Op(J[]{F5P,2,5,L2(P)}).\displaystyle\mathbb{E}\underset{f_{\theta,5}\in\mathcal{F}_{5}}{\sup}\left|\mathbb{G}_{N}(f_{\theta,5})\right|=O_{p}\left(J_{[\ ]}\{\|F_{5}\|_{P,2},\mathcal{F}_{5},L_{2}(P)\}\right)\ .

Define a set of functions 𝒢5={gθ,5:θθ02δm}\mathcal{G}_{5}=\{g_{\theta,5}:\|\theta-\theta_{0}\|_{2}\leq\delta_{m}\} where gθ,5(L)=ψθ(L)ψθ0(L)g_{\theta,5}(L)=\psi_{\theta}(L)-\psi_{\theta_{0}}(L). Since 𝟣R=1dOddsr𝟣R=r(C0+1)\|\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}-\mathsf{1}_{R=r}\|_{\infty}\leq(C_{0}+1), by Lemma 15,

n[]{(C0+1)ϵ,5,L2(P)}n[]{ϵ,𝒢5,L2(P)}.\displaystyle n_{[\ ]}\{(C_{0}+1)\epsilon,\mathcal{F}_{5},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G}_{5},L_{2}(P)\}\ .

Define a set of functions 𝒢~5={ψθ:θθ02δm}\tilde{\mathcal{G}}_{5}=\{\psi_{\theta}:\|\theta-\theta_{0}\|_{2}\leq\delta_{m}\}. Since ψθ0\psi_{\theta_{0}} is a fixed function, n[]{ϵ,𝒢5,L2(P)}=n[]{ϵ,𝒢~5,L2(P)}n_{[\ ]}\{\epsilon,\mathcal{G}_{5},L_{2}(P)\}=n_{[\ ]}\{\epsilon,\tilde{\mathcal{G}}_{5},L_{2}(P)\}. Since δm0\delta_{m}\to 0 as NN\to\infty, we can take δm\delta_{m} small enough such that the set {θ:θθ02δm}Θ\{\theta:\|\theta-\theta_{0}\|_{2}\leq\delta_{m}\}\subset\Theta. So, 𝒢~5\tilde{\mathcal{G}}_{5}\subset\mathcal{H}, and n[]{ϵ,𝒢~5,L2(P)}n[]{ϵ,,L2(P)}n_{[\ ]}\{\epsilon,\tilde{\mathcal{G}}_{5},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}. Then,

J[]{F5P,2,5,L2(P)}\displaystyle J_{[\ ]}\{\|F_{5}\|_{P,2},\mathcal{F}_{5},L_{2}(P)\} 0(C0+1)fδmP,2logn[]{ϵC0+1,,L2(P)}dϵ\displaystyle\leq\int_{0}^{(C_{0}+1)\|f_{\delta_{m}}\|_{P,2}}\sqrt{\log n_{[\ ]}\left\{\frac{\epsilon}{C_{0}+1},\mathcal{H},L_{2}(P)\right\}}d\epsilon
C0(C0+1)fδmP,2{(C0+1)/ϵ}12ddϵ\displaystyle\leq\sqrt{C_{\mathcal{H}}}\int_{0}^{(C_{0}+1)\|f_{\delta_{m}}\|_{P,2}}\{(C_{0}+1)/\epsilon\}^{\frac{1}{2d_{\mathcal{H}}}}d\epsilon
C(C0+1)fδmP,2112d\displaystyle\leq\sqrt{C_{\mathcal{H}}}(C_{0}+1)\|f_{\delta_{m}}\|_{P,2}^{1-\frac{1}{2d_{\mathcal{H}}}}
0\displaystyle\to 0

since d>1/2d_{\mathcal{H}}>1/2 and fδmP,20\|f_{\delta_{m}}\|_{P,2}\to 0 as NN\to\infty. Thus, 𝔼supfθ,55|𝔾N(fθ,5)|=op(1)\mathbb{E}\sup_{f_{\theta,5}\in\mathcal{F}_{5}}|\mathbb{G}_{N}(f_{\theta,5})|=o_{p}(1) and N|Sθ^N,5rSθ0,5r|=op(1)\sqrt{N}|S_{\hat{\theta}_{N},5}^{r}-S_{\theta_{0},5}^{r}|=o_{p}(1). ∎

Lemma 11.

Under Assumptions 14, we have

N|Sθ^N,6rSθ0,6r|=op(1).\displaystyle\sqrt{N}\left|S_{\hat{\theta}_{N},6}^{r}-S_{\theta_{0},6}^{r}\right|=o_{p}(1)\ .
Proof.

Consider the following empirical process.

𝔾N(fθ,6)=N[1Ni=1Nfθ,6(Li,Ri)𝔼{fθ,6(L,R)}]\displaystyle\mathbb{G}_{N}(f_{\theta,6})=\sqrt{N}\left[\frac{1}{N}\sum_{i=1}^{N}f_{\theta,6}(L_{i},R_{i})-\mathbb{E}\left\{f_{\theta,6}(L,R)\right\}\right]

where fθ,6(L,R)={𝟣R=1dOddsr(Lr)𝟣R=r}{urθ(Lr)urθ0(Lr)}f_{\theta,6}(L,R)=\{\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}(L^{r})-\mathsf{1}_{R=r}\}\{u^{r}_{\theta}(L^{r})-u^{r}_{\theta_{0}}(L^{r})\}. Similarly, inequality (18) holds and 𝔼{fθ,6(L,R)}=0\mathbb{E}\{f_{\theta,6}(L,R)\}=0. Consider the set of functions 6={fθ,6:θθ02δm}\mathcal{F}_{6}=\{f_{\theta,6}:\|\theta-\theta_{0}\|_{2}\leq\delta_{m}\}. To show N|Sθ^N,6rSθ0,6r|=op(1)\sqrt{N}|S_{\hat{\theta}_{N},6}^{r}-S_{\theta_{0},6}^{r}|=o_{p}(1), we need to show 𝔼supfθ,66|𝔾N(fθ,6)|=op(1)\mathbb{E}\sup_{f_{\theta,6}\in\mathcal{F}_{6}}|\mathbb{G}_{N}(f_{\theta,6})|=o_{p}(1). Define the envelop function F6(L):=(C0+1)𝔼{fδm(L)Lr}F_{6}(L):=(C_{0}+1)\mathbb{E}\{f_{\delta_{m}}(L)\mid L^{r}\}. It’s easy to see that |fθ,6(L,R)|F6(L)|f_{\theta,6}(L,R)|\leq F_{6}(L) for any fθ,66f_{\theta,6}\in\mathcal{F}_{6} and F6P,2(C0+1)fδmP,2\|F_{6}\|_{P,2}\leq(C_{0}+1)\|f_{\delta_{m}}\|_{P,2}. Apply the maximal inequality,

𝔼supfθ,66|𝔾N(fθ,6)|=Op(J[]{F6P,2,6,L2(P)}).\displaystyle\mathbb{E}\underset{f_{\theta,6}\in\mathcal{F}_{6}}{\sup}\left|\mathbb{G}_{N}(f_{\theta,6})\right|=O_{p}\left(J_{[\ ]}\{\|F_{6}\|_{P,2},\mathcal{F}_{6},L_{2}(P)\}\right)\ .

Define a set of functions 𝒢6={gθ,6:θθ02δ}\mathcal{G}_{6}=\{g_{\theta,6}:\|\theta-\theta_{0}\|_{2}\leq\delta\} where gθ,6(L)=urθ(Lr)urθ0(Lr)g_{\theta,6}(L)=u^{r}_{\theta}(L^{r})-u^{r}_{\theta_{0}}(L^{r}). Since 𝟣R=1dOddsr𝟣R=r(C0+1)\|\mathsf{1}_{R=1_{d}}\mathrm{Odds}^{r}-\mathsf{1}_{R=r}\|_{\infty}\leq(C_{0}+1), by Lemma 15,

n[]{(C0+1)ϵ,6,L2(P)}n[]{ϵ,𝒢6,L2(P)}.\displaystyle n_{[\ ]}\{(C_{0}+1)\epsilon,\mathcal{F}_{6},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G}_{6},L_{2}(P)\}\ .

Define a set of functions 𝒢~6={urθ:θθ02δ}\tilde{\mathcal{G}}_{6}=\{u^{r}_{\theta}:\|\theta-\theta_{0}\|_{2}\leq\delta\}. Similarly, since urθ0u^{r}_{\theta_{0}} is a fixed function, n[]{ϵ,𝒢6,L2(P)}=n[]{ϵ,𝒢~6,L2(P)}n_{[\ ]}\{\epsilon,\mathcal{G}_{6},L_{2}(P)\}=n_{[\ ]}\{\epsilon,\tilde{\mathcal{G}}_{6},L_{2}(P)\}. Take δm\delta_{m} small enough such that the set {θ:θθ02δm}Θ\{\theta:\|\theta-\theta_{0}\|_{2}\leq\delta_{m}\}\subset\Theta. Then, 𝒢~6𝒰r\tilde{\mathcal{G}}_{6}\subset\mathcal{U}^{r}, and by Lemma 14,

n[]{ϵ,𝒢~6,L2(P)}n[]{ϵ,𝒰r,L2(P)}n[]{ϵ,,L2(P)}.\displaystyle n_{[\ ]}\{\epsilon,\tilde{\mathcal{G}}_{6},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{U}^{r},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .

Therefore,

J[]{F6P,2,6,L2(P)}\displaystyle J_{[\ ]}\{\|F_{6}\|_{P,2},\mathcal{F}_{6},L_{2}(P)\} 0(C0+1)fδmP,2logn[](ϵ/(C0+1),,L2(P))dϵ\displaystyle\leq\int_{0}^{(C_{0}+1)\|f_{\delta_{m}}\|_{P,2}}\sqrt{\log n_{[\ ]}\left(\epsilon/(C_{0}+1),\mathcal{H},L_{2}(P)\right)}d\epsilon
C(C0+1)fδmP,2112d\displaystyle\leq\sqrt{C_{\mathcal{H}}}(C_{0}+1)\|f_{\delta_{m}}\|_{P,2}^{1-\frac{1}{2d_{\mathcal{H}}}}
0\displaystyle\to 0

since d>1/2d_{\mathcal{H}}>1/2 and fδmP,20\|f_{\delta_{m}}\|_{P,2}\to 0 as NN\to\infty. Thus, 𝔼supfθ,66|𝔾N(fθ,6)|=Op(op(1))=op(1)\mathbb{E}\sup_{f_{\theta,6}\in\mathcal{F}_{6}}|\mathbb{G}_{N}(f_{\theta,6})|=O_{p}(o_{p}(1))=o_{p}(1) and N|Sθ^N,6rSθ0,6r|=op(1)\sqrt{N}|S_{\hat{\theta}_{N},6}^{r}-S_{\theta_{0},6}^{r}|=o_{p}(1). ∎

Lemma 12.

Consider the set of functions ={f:=gh,g𝒢,h}\mathcal{F}=\{f:=gh,g\in\mathcal{G},h\in\mathcal{H}\}. Assume that gcg\|g\|_{\infty}\leq c_{g} for all g𝒢g\in\mathcal{G} and hP,2ch\|h\|_{P,2}\leq c_{h} for all hh\in\mathcal{H}. Then, for any ϵmin{cg,ch}\epsilon\leq\min\{c_{g},c_{h}\},

n[]{4(cg+ch)ϵ,,L2(P)}n[]{ϵ,𝒢,L}n[]{ϵ,,L2(P)}.\displaystyle n_{[\ ]}\{4(c_{g}+c_{h})\epsilon,\mathcal{F},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G},L^{\infty}\}n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .
Proof.

Suppose {ui,vi}i=1n\{u_{i},v_{i}\}_{i=1}^{n} are the ϵ\epsilon-brackets that can cover 𝒢\mathcal{G} and {Uj,Vj}j=1m\{U_{j},V_{j}\}_{j=1}^{m} are the ϵ\epsilon-brackets that can cover \mathcal{H}. Define the bracket [𝖴k,𝖵k][\mathsf{U}_{k},\mathsf{V}_{k}] for k=(i1)m+jk=(i-1)m+j where i=1,,n,j=1,,mi=1,\cdots,n,j=1,\cdots,m:

𝖴k(x)=min{ui(x)Uj(x),ui(x)Vj(x),vi(x)Uj(x),vi(x)Vj(x)},\displaystyle\mathsf{U}_{k}(x)=\min\{u_{i}(x)U_{j}(x),u_{i}(x)V_{j}(x),v_{i}(x)U_{j}(x),v_{i}(x)V_{j}(x)\}\ ,
𝖵k(x)=max{ui(x)Uj(x),ui(x)Vj(x),vi(x)Uj(x),vi(x)Vj(x)}.\displaystyle\mathsf{V}_{k}(x)=\max\{u_{i}(x)U_{j}(x),u_{i}(x)V_{j}(x),v_{i}(x)U_{j}(x),v_{i}(x)V_{j}(x)\}\ .

For any function ff\in\mathcal{F}, there exists functions g𝒢g\in\mathcal{G} and hh\in\mathcal{H} such that f=ghf=gh. Besides, we can find two pairs of functions (ui0,vi0)(u_{i_{0}},v_{i_{0}}) and (Uj0,Vj0)(U_{j_{0}},V_{j_{0}}) such that ui0(x)g(x)vi0(x)u_{i_{0}}(x)\leq g(x)\leq v_{i_{0}}(x), Uj0(x)h(x)Vj0(x)U_{j_{0}}(x)\leq h(x)\leq V{j_{0}}(x), ui0vi0ϵ\|u_{i_{0}}-v_{i_{0}}\|_{\infty}\leq\epsilon, and Uj0Vj0P,2ϵ\|U_{j_{0}}-V_{j_{0}}\|_{P,2}\leq\epsilon. Then, 𝖴k0(x)f(x)𝖵k0(x)\mathsf{U}_{k_{0}}(x)\leq f(x)\leq\mathsf{V}_{k_{0}}(x) where k0=(i01)m+j0k_{0}=(i_{0}-1)m+j_{0}. Then, we look at the size of the new brackets. By simple algebra,

𝖴k𝖵kP,2\displaystyle\|\mathsf{U}_{k}-\mathsf{V}_{k}\|_{P,2} (|ui|+|vi|)|UjVj|+(|Uj|+|Vj|)|uivi|P,2\displaystyle\leq\left\|\left(|u_{i}|+|v_{i}|\right)|U_{j}-V_{j}|+\left(|U_{j}|+|V_{j}|\right)|u_{i}-v_{i}|\right\|_{P,2}
uiUjVjP,2+viUjVjP,2\displaystyle\leq\|u_{i}\|_{\infty}\|U_{j}-V_{j}\|_{P,2}+\|v_{i}\|_{\infty}\|U_{j}-V_{j}\|_{P,2}
+uiviUjP,2+uiviVjP,2\displaystyle+\|u_{i}-v_{i}\|_{\infty}\|U_{j}\|_{P,2}+\|u_{i}-v_{i}\|_{\infty}\|V_{j}\|_{P,2}
2ϵ(cg+ϵ)+2(ch+ϵ)ϵ=2(cg+ch+2ϵ)ϵ.\displaystyle\leq 2\epsilon(c_{g}+\epsilon)+2(c_{h}+\epsilon)\epsilon=2(c_{g}+c_{h}+2\epsilon)\epsilon\ .

Furthermore, for any ϵmin{cg,ch}\epsilon\leq\min\{c_{g},c_{h}\}, we have 2(cg+ch+2ϵ)ϵ4(cg+ch)ϵ2(c_{g}+c_{h}+2\epsilon)\epsilon\leq 4(c_{g}+c_{h})\epsilon. Therefore,

n[]{4(cg+ch)ϵ,,L2(P)}n[]{ϵ,𝒢,L}n[]{ϵ,,L2(P)}.\displaystyle n_{[\ ]}\{4(c_{g}+c_{h})\epsilon,\mathcal{F},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G},L^{\infty}\}n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .

Lemma 13.

Consider the set of functions =+𝒢={f:=g+h,g𝒢,h}\mathcal{F}=\mathcal{H}+\mathcal{G}=\{f:=g+h,g\in\mathcal{G},h\in\mathcal{H}\}. Assume that gP,2cg\|g\|_{P,2}\leq c_{g} for all g𝒢g\in\mathcal{G} and hP,2ch\|h\|_{P,2}\leq c_{h} for all hh\in\mathcal{H}. Then,

n[]{2ϵ,,L2(P)}n[]{ϵ,𝒢,L2(P)}n[]{ϵ,,L2(P)}.\displaystyle n_{[\ ]}\{2\epsilon,\mathcal{F},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G},L_{2}(P)\}n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .
Proof.

Suppose {ui,vi}i=1n\{u_{i},v_{i}\}_{i=1}^{n} are the ϵ\epsilon-brackets that can cover 𝒢\mathcal{G} and {Uj,Vj}j=1m\{U_{j},V_{j}\}_{j=1}^{m} are the ϵ\epsilon-brackets that can cover \mathcal{H}. Define the bracket [𝖴k,𝖵k][\mathsf{U}_{k},\mathsf{V}_{k}] for k=(i1)m+jk=(i-1)m+j and i=1,,n,j=1,,mi=1,\cdots,n,j=1,\cdots,m:

𝖴k(x)=ui(x)+Uj(x),\displaystyle\mathsf{U}_{k}(x)=u_{i}(x)+U_{j}(x)\ ,
𝖵k(x)=vi(x)+Vj(x).\displaystyle\mathsf{V}_{k}(x)=v_{i}(x)+V_{j}(x)\ .

For any function ff\in\mathcal{F}, there exists functions g𝒢g\in\mathcal{G} and hh\in\mathcal{H} such that f=g+hf=g+h. Besides, we can find two pairs of functions (ui0,vi0)(u_{i_{0}},v_{i_{0}}) and (Uj0,Vj0)(U_{j_{0}},V_{j_{0}}) such that ui0(x)g(x)vi0(x)u_{i_{0}}(x)\leq g(x)\leq v_{i_{0}}(x), Uj0(x)h(x)Vj0(x)U_{j_{0}}(x)\leq h(x)\leq V{j_{0}}(x), ui0vi0P,2ϵ\|u_{i_{0}}-v_{i_{0}}\|_{P,2}\leq\epsilon, and Uj0Vj0P,2ϵ\|U_{j_{0}}-V_{j_{0}}\|_{P,2}\leq\epsilon. Then, 𝖴k0(x)f(x)𝖵k0(x)\mathsf{U}_{k_{0}}(x)\leq f(x)\leq\mathsf{V}_{k_{0}}(x) where k0=(i01)m+j0k_{0}=(i_{0}-1)m+j_{0} and

𝖴k𝖵kP,2uiviP,2+UjVjP,22ϵ.\displaystyle\|\mathsf{U}_{k}-\mathsf{V}_{k}\|_{P,2}\leq\|u_{i}-v_{i}\|_{P,2}+\|U_{j}-V_{j}\|_{P,2}\leq 2\epsilon\ .

Therefore,

n[]{2ϵ,,L2(P)}n[]{ϵ,𝒢,L2(P)}n[]{ϵ,,L2(P)}.\displaystyle n_{[\ ]}\{2\epsilon,\mathcal{F},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G},L_{2}(P)\}n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .

Lemma 14.

Let \mathcal{H}, 𝒰r\mathcal{U}^{r} and 𝔼r\mathbb{E}\mathcal{H}^{r} be the sets of functions as we defined before. Then,

n[]{ϵ,𝒰r,L2(P)}\displaystyle n_{[\ ]}\{\epsilon,\mathcal{U}^{r},L_{2}(P)\} n[]{ϵ,,L2(P)},\displaystyle\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ ,
n[]{ϵ,𝔼r,L2(P)}\displaystyle n_{[\ ]}\{\epsilon,\mathbb{E}\mathcal{H}^{r},L_{2}(P)\} n[]{ϵ,,L2(P)}.\displaystyle\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .
Proof.

Suppose {ui,vi}i=1n\{u_{i},v_{i}\}_{i=1}^{n} are the ϵ\epsilon-brackets that can cover \mathcal{H}. Define Ui(lr)=𝔼{ui(L)Lr=lr,R=r}U_{i}(l^{r})=\mathbb{E}\{u_{i}(L)\mid L^{r}=l^{r},R=r\} and Vi(lr)=𝔼{vi(L)Lr=lr,R=r}V_{i}(l^{r})=\mathbb{E}\{v_{i}(L)\mid L^{r}=l^{r},R=r\}. Then, for any ur𝒰ru^{r}\in\mathcal{U}^{r}, there exists ψθ\psi_{\theta}\in\mathcal{H} such that ur(lr)=𝔼{ψθ(L)Lr=lr,R=r}u^{r}(l^{r})=\mathbb{E}\{\psi_{\theta}(L)\mid L^{r}=l^{r},R=r\} with a pair of functions (u0,v0)(u_{0},v_{0}) satisfying u0(l)ψθ(l)v0(l)u_{0}(l)\leq\psi_{\theta}(l)\leq v_{0}(l) and u0(L)v0(L)P,2ϵ\|u_{0}(L)-v_{0}(L)\|_{P,2}\leq\epsilon. Then, U0(lr)ur(lr)V0(lr)U_{0}(l^{r})\leq u^{r}(l^{r})\leq V_{0}(l^{r}) and

U0(Lr)V0(Lr)P,2\displaystyle\|U_{0}(L^{r})-V_{0}(L^{r})\|_{P,2} =𝔼[𝔼{u0(L)v0(L)Lr=lr,R=r}2]\displaystyle=\mathbb{E}[\mathbb{E}\{u_{0}(L)-v_{0}(L)\mid L^{r}=l^{r},R=r\}^{2}]
𝔼𝔼[{u0(L)v0(L)}2Lr=lr,R=r]\displaystyle\leq\mathbb{E}\mathbb{E}[\{u_{0}(L)-v_{0}(L)\}^{2}\mid L^{r}=l^{r},R=r]
=𝔼{u0(L)v0(L)}2=u0v0P,2ϵ.\displaystyle=\mathbb{E}\{u_{0}(L)-v_{0}(L)\}^{2}=\|u_{0}-v_{0}\|_{P,2}\leq\epsilon\ .

So, {Ui,Vi}i=1n\{U_{i},V_{i}\}_{i=1}^{n} are the ϵ\epsilon-brackets that can cover 𝒰r\mathcal{U}^{r} and

n[]{ϵ,𝒰r,L2(P)}n[]{ϵ,,L2(P)}.\displaystyle n_{[\ ]}\{\epsilon,\mathcal{U}^{r},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .

For any gr𝔼rg^{r}\in\mathbb{E}\mathcal{H}^{r}, there exists ff\in\mathcal{H} such that gr(lr)=𝔼{f(L)Lr=lr,R=r}g^{r}(l^{r})=\mathbb{E}\{f(L)\mid L^{r}=l^{r},R=r\}. Similarly,

n[]{ϵ,𝔼r,L2(P)}n[]{ϵ,,L2(P)}.\displaystyle n_{[\ ]}\{\epsilon,\mathbb{E}\mathcal{H}^{r},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{H},L_{2}(P)\}\ .

Lemma 15.

Let hh be a fixed bounded function. Assume hch\|h\|_{\infty}\leq c_{h}. We consider two function classes ={f:f(x):=g(x)h(x),g𝒢}\mathcal{F}=\{f:f(x):=g(x)h(x),g\in\mathcal{G}\} and 𝒢={g:gP,2c}\mathcal{G}=\{g:\|g\|_{P,2}\leq c\} for a fixed constant cc. Then,

n[]{chϵ,,L2(P)}n[]{ϵ,𝒢,L2(P)}.\displaystyle n_{[\ ]}\{c_{h}\epsilon,\mathcal{F},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G},L_{2}(P)\}\ .
Proof.

Suppose {ui,vi}i=1n\{u_{i},v_{i}\}_{i=1}^{n} are the ϵ\epsilon-brackets that can cover 𝒢\mathcal{G}. That is, for any g𝒢g\in\mathcal{G}, we can find a pair of functions (u0,v0)(u_{0},v_{0}) such that u0(x)g(x)v0(x)u_{0}(x)\leq g(x)\leq v_{0}(x) and u0v0P,2ϵ\|u_{0}-v_{0}\|_{P,2}\leq\epsilon. Then, for any xx, either u0(x)h(x)g(x)h(x)v0(x)h(x)u_{0}(x)h(x)\leq g(x)h(x)\leq v_{0}(x)h(x) or u0(x)h(x)g(x)h(x)v0(x)h(x)u_{0}(x)h(x)\geq g(x)h(x)\geq v_{0}(x)h(x) holds. Define Ui(x)=min{ui(x)h(x),vi(x)h(x)}U_{i}(x)=\min\{u_{i}(x)h(x),v_{i}(x)h(x)\} and Vi(x)=max{ui(x)h(x),vi(x)h(x)}V_{i}(x)=\max\{u_{i}(x)h(x),v_{i}(x)h(x)\}. For any ff\in\mathcal{F}, there exists g𝒢g\in\mathcal{G} such that f=ghf=gh and a pair of functions (U0,V0)(U_{0},V_{0}) such U0(x)f(x)V0(x)U_{0}(x)\leq f(x)\leq V_{0}(x) and

UiViP,2=(uivi)hP,2h(uivi)P,2chϵ.\displaystyle\|U_{i}-V_{i}\|_{P,2}=\|(u_{i}-v_{i})h\|_{P,2}\leq\|h\|_{\infty}\|(u_{i}-v_{i})\|_{P,2}\leq c_{h}\epsilon\ .

So, {Ui,Vi}i=1n\{U_{i},V_{i}\}_{i=1}^{n} are the chϵc_{h}\epsilon-brackets that can cover \mathcal{F} and

n[]{chϵ,,L2(P)}n[]{ϵ,𝒢,L2(P)}.\displaystyle n_{[\ ]}\{c_{h}\epsilon,\mathcal{F},L_{2}(P)\}\leq n_{[\ ]}\{\epsilon,\mathcal{G},L_{2}(P)\}\ .

References

  • Barengolts [2001] Barengolts, E. (2001). Risk factors for hip fracture in predominantly african-american veteran male population. J Bone Miner Res 16, S170.
  • Begun et al. [1983] Begun, J. M., W. J. Hall, W.-M. Huang, and J. A. Wellner (1983). Information and asymptotic efficiency in parametric-nonparametric models. The Annals of Statistics 11(2), 432–452.
  • Bickel et al. [1993] Bickel, P. J., C. A. Klaassen, P. J. Bickel, Y. Ritov, J. Klaassen, J. A. Wellner, and Y. Ritov (1993). Efficient and adaptive estimation for semiparametric models, Volume 4. Springer.
  • Chan et al. [2016] Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2016). Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(3), 673–700.
  • Chatterjee et al. [2003] Chatterjee, N., Y.-H. Chen, and N. E. Breslow (2003). A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association 98(461), 158–168.
  • Chatterjee and Li [2010] Chatterjee, N. and Y. Li (2010). Inference in semiparametric regression models under partial questionnaire design and nonmonotone missing data. Journal of the American Statistical Association 105(490), 787–797.
  • Chen [2004] Chen, H. Y. (2004). Nonparametric and semiparametric models for missing covariates in parametric regression. Journal of the American Statistical Association 99(468), 1176–1189.
  • Chen [2007] Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics 6, 5549–5632.
  • Chen et al. [2008] Chen, X., H. Hong, and A. Tarozzi (2008). Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics 36(2), 808–843.
  • Chen [2022] Chen, Y.-C. (2022). Pattern graphs: a graphical approach to nonmonotone missing data. The Annals of Statistics 50(1), 129–146.
  • Fan et al. [2022] Fan, J., K. Imai, I. Lee, H. Liu, Y. Ning, and X. Yang (2022). Optimal covariate balancing conditions in propensity score estimation. Journal of Business & Economic Statistics 41(1), 97–110.
  • Horowitz and Mammen [2004] Horowitz, J. L. and E. Mammen (2004). Nonparametric estimation of an additive model with a link function. The Annals of Statistics 32(6), 2412 – 2443.
  • Imai and Ratkovic [2014] Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(1), 243–263.
  • Kang and Schafer [2007] Kang, J. D. and J. L. Schafer (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22(4), 523–539.
  • Little [1993] Little, R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association 88(421), 125–134.
  • Little and Rubin [2019] Little, R. J. and D. B. Rubin (2019). Statistical analysis with missing data, Volume 793. John Wiley & Sons.
  • Little and Schluchter [1985] Little, R. J. and M. D. Schluchter (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 72(3), 497–512.
  • Malinsky et al. [2022] Malinsky, D., I. Shpitser, and E. J. Tchetgen Tchetgen (2022). Semiparametric inference for nonmonotone missing-not-at-random data: the no self-censoring model. Journal of the American Statistical Association 117(539), 1415–1423.
  • Newey [1990] Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of applied econometrics 5(2), 99–135.
  • Newey [1997] Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of econometrics 79(1), 147–168.
  • Reilly and Pepe [1995] Reilly, M. and M. S. Pepe (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82(2), 299–314.
  • Robins and Gill [1997] Robins, J. M. and R. D. Gill (1997). Non-response models for the analysis of non-monotone ignorable missing data. Statistics in medicine 16(1), 39–56.
  • Robins et al. [1994] Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89(427), 846–866.
  • Rubin [1976] Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3), 581–592.
  • Sadinle and Reiter [2017] Sadinle, M. and J. P. Reiter (2017). Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. Biometrika 104(1), 207–220.
  • Sinha et al. [2014] Sinha, S., K. K. Saha, and S. Wang (2014). Semiparametric approach for non-monotone missing covariates in a parametric regression model. Biometrics 70(2), 299–311.
  • Tan [2020] Tan, Z. (2020). Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. Biometrika 107(1), 137–158.
  • Tchetgen Tchetgen et al. [2018] Tchetgen Tchetgen, E. J., L. Wang, and B. Sun (2018). Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Statistica Sinica 28(4), 2069.
  • Van der Vaart [2000] Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.
  • Wellner et al. [2013] Wellner, J. et al. (2013). Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media.
  • Wong and Chan [2018] Wong, R. K. and K. C. G. Chan (2018). Kernel-based covariate functional balancing for observational studies. Biometrika 105(1), 199–213.
  • Wood [2017] Wood, S. N. (2017). Generalized additive models: an introduction with R. CRC press.
  • Yiu and Su [2018] Yiu, S. and L. Su (2018). Covariate association eliminating weights: a unified weighting framework for causal effect estimation. Biometrika 105(3), 709–722.
  • Zhao [2019] Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions. The Annals of Statistics 47(2), 965–993.
  • Zubizarreta [2015] Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association 110(511), 910–922.