Balancing Weights for Non-monotone Missing Data
Abstract
Balancing weights have been widely applied to single or monotone missingness due to empirical advantages over likelihood-based methods and inverse probability weighting approaches. This paper considers non-monotone missing data under the complete-case missing variable condition (CCMV), a case of missing not at random (MNAR). Using relationships between each missing pattern and the complete-case subsample, we construct a weighted estimator for estimation, where the weight is a sum of ratios of the conditional probability of observing a particular missing pattern versus that of observing the complete-case, given the variables observed in the corresponding missing pattern. However, plug-in estimators of the propensity odds can be unbounded and lead to unstable estimation. Using further relations between propensity odds and balancing of moments across response patterns, we employ tailored loss functions, each encouraging empirical balance across patterns to estimate propensity odds flexibly using a functional basis expansion. We propose two penalizations to control propensity odds model smoothness and empirical imbalance. We study the asymptotic properties of the proposed estimators and show that they are consistent under mild smoothness assumptions. Asymptotic normality and efficiency are developed. Simulation results show the superior performance of the proposed method.
Keywords: Non-monotone missing; Missing not at random (MNAR); Complete-case missing variable (CCMV); Covariate balancing
1 Introduction
Missing data is ubiquitous in many research fields, such as health sciences, economics, and sociology. Monotone missing patterns, which arise in longitudinal studies and multi-stage sampling, have been widely studied. In practice, non-monotone missing patterns are more common since an ordering of variables may not exist to characterize the response patterns. For example, participants of an observational cohort may miss a particular study visit but return in later visits.
Restricting the analysis to complete cases in which all relevant variables are observed is a convenient solution but ignores information from partially missing data. It may lead to biased estimates unless data are missing completely at random (MCAR) [16], and would typically lose efficiency even under MCAR. The missing at random (MAR) assumption, where missingness depends only on observed data, is more reasonable and is widely used to handle missing data in likelihood-based or weighting methods.
Researchers often focus on a specific target, such as the coefficients of linear regression with Gaussian errors, which can be easily estimated by maximum likelihood estimation if the full-data density is known. To estimate parameters in the full-data density where the full variable vector may be partially missing, Little and Schluchter [17] suggested a mixture of normal distributions as the model of missing data density. However, its validity heavily relies on parametric density assumptions. The likelihood-based methods are attractive because the missingness probability can be factored out from the likelihood under MAR [24]. To mimic the score function (the derivative of log-likelihood) in the absence of missing data, Reilly and Pepe [21] proposed a mean-score approach, and Chatterjee et al. [5] proposed a pseudo-score approach for the bivariate case and extended to partial questionnaire design which is a particular case of non-monotone missing data [6]. Chen [7] proposed a semi-parametric method based on reparametrization of the joint likelihood into a product of the odds-ratio functions. However, these methods cannot directly handle continuous variables and require the discretization of continuous variables.
Inverse propensity weighting (IPW) methods [23] are also attractive since they do not require the specification of full-data likelihood. In contrast, a propensity model for non-missingness is needed. However, the resulting estimators can be highly unstable since even a small set of tiny propensity estimates would lead to extreme weights. A more direct weighting approach, the balancing method, is developed especially for the average treatment effect (ATE), which can be viewed as a missing data problem with only the response variable missing; see, for example, Zubizarreta [35], Chan et al. [4] and Wong and Chan [31], and Zhao [34] introduced tailored loss functions to show the connection between IPW and balancing methods. These balancing methods handle the single missing variable scenario, which has yet to be extended to handle non-monotone missingness.
Robins and Gill [22] argued that missing not at random (MNAR) is more natural than MAR under non-monotone missingness. However, many different specifications of MNAR are not identifiable from observed data. Little [15] imposed an identifying condition, the complete-case missing variable (CCMV) restriction, that matches the unidentified conditional distribution of missing variables for missing patterns to the identified distribution for complete cases. Then, the full-data density can be estimated by a pattern-mixture model. IPW estimators were also proposed under CCMV restriction and extensions [28]. Another increasingly popular MNAR mechanism is the no self-censoring (NSC) restriction, studied in Sinha et al. [26], Sadinle and Reiter [25] and Malinsky et al. [18]. Under MNAR, specifications of missing variable density typically rely on parametric assumptions and have intrinsic drawbacks. For instance, likelihood-based methods are often restricted to discrete variables due to complex integral computation, and IPW methods are often unstable due to difficulties in weight estimation for multiple missing patterns.
We propose a non-parametric approach that generalizes the balancing approaches to non-monotone missing scenarios under the CCMV assumption. Based on a tailored loss function, the proposed method approximately balances a set of functions of observed variables and mimics the missing-data adjustment of the mean-score approaches. The proposed method does not require the discretization of continuous variables and is less prone to misspecification due to the non-parametric modeling. A carefully designed penalization strikes a good balance between bias and variance, and prioritizes balancing smoother functions. We also show the consistency and asymptotic efficiency of the resulting estimator.
2 Notation and preliminaries
We formally describe the setup of the problem. Let be a random vector of interest and be a binary random vector where indicates that is observed and is the set of possible response patterns in the study. Denote the complete-case pattern by . Let be the number of response patterns. For the response pattern , we denote the observed variables by and the missing variables by where is the number of observed variables. So, the observations are . An example of a bivariate non-monotone missing structure with is
response pattern | |||
---|---|---|---|
observed variables | |||
missing variables |
Let be the parameter of interest which is the unique solution to , with a known vector-valued estimating function with values in . For instance, we could use the quasi-likelihood estimating functions for the generalized linear models. If full data were observed, a solution to the estimating equations is a common Z-estimator. However, can only be evaluated at complete samples, and a complete-case analysis is typically biased or inefficient. To address the problem, we posit the following assumptions throughout this paper.
Assumption 1.
-
A:
The estimating function is differentiable with respect to with derivative . Also, has the unique root and is differentiable at with nonsingular derivative .
-
B:
There exists a constant such that for any and so .
-
C:
, or equivalently, , for any .
Assumption 1.A is a standard regularity assumption for Z-estimation. Assumption 1.B ensures that complete cases are available for analysis. The first equation in Assumption 1.C is the CCMV condition, while the second equation motivates the logit discrete choice nonresponse model (LDCM) in Tchetgen Tchetgen et al. [28]. To see that the missing mechanism is MNAR, we define the propensity as the probability of data belonging to a specific response pattern conditional on the variables of interest and write it as . The propensity odds between patterns and is defined as . The CCMV condition states that the propensity odds depends on via , and so is equal to . The propensity is a function of and does not satisfy either MCAR or MAR conditions because propensities and odds are related through
(1) |
Tchetgen Tchetgen et al. [28] showed that Assumption 1 is sufficient for non-parametric identification and developed the inverse propensity weighting (IPW) estimator that is motivated by the law of total expectation:
(2) |
Taking the reciprocal of (1) and that ,
(3) |
As such, one can focus on the estimation of the odds. A standard approach is to fit a logistic regression where the binary response indicates whether the pattern is or , and obtain an estimate of [28]. Due to the relationship
a plug-in estimator of can then be constructed based on the predicted probability.
Based on (2) and (3), the resulting estimator can be obtained by solving the weighted estimating equations where represents an estimator of and is an indicator function of an event . However, the estimated odds may lead to extremely large weights and hence an unstable estimator of when the likelihood estimation of a misspecified missingness model is used. Similar phenomena were observed in inverse propensity estimation in [14] and many subsequent papers, which motivates covariate balancing methods for parameter estimation. See, for example, [13], [33], [34] and [27].
We note that the estimation of or is not the ultimate target but only serves as an intermediate step for estimating . Instead of using the entropy loss, i.e., the negative log-likelihood function, as the loss function, which is typical for logistic regression, we employ a tailored loss that directly imposes empirical control on key identifying conditions of the weights, which we call the balancing conditions described below. These balancing conditions are directly related to estimating target parameters . Since the missingness mechanism is typically not our interest, parsimonious modeling may not be required. Instead, we will use functional basis expansions for flexible modeling, which requires penalizations for stable estimation. In addition, our proposed estimator achieves the semiparametric efficiency bounds of parameters defined through the estimating equations. By contrast, Tchetgen Tchetgen et al. [28] only provided asymptotic variances when the parametric model of either propensity odds or missing variable density is correctly specified. Here, we present the semiparametric efficiency bound, whose proof is given in Appendix B. The “regular estimators” are defined according to Begun et al. [2]. The regularity of an estimator can be viewed as a robustness or stability property.
Theorem 1.
Under Assumption 1, the asymptotic variance lower bound for all regular estimators of is , where and
with , the conditional expectation of the estimating function given variables and is equal to under Assumption 1.C. In a slight abuse of notation, .
3 Construction of our method
In this section, we define the balancing conditions and propose a tailored loss function whose expectation is minimized when the propensity odds model satisfies the balancing conditions. We will show that the imbalance can be explicitly controlled under penalization. Eventually, we will introduce two penalizations to control empirical imbalance and the smoothness of the propensity odds estimates, respectively.
3.1 Balancing conditions and benefits
To estimate the propensity odds in a way more related to the estimation of , we first consider the effect of a set of generic weights in the weighted estimating equations that are constructed to estimate . The law of total expectation (2) shows that the weighted average in the fully observed pattern is equal to the unweighted average in the population. Moreover, for each missing pattern , one can see that under the CCMV assumption, the following equation holds for any measurable function . We call it the balancing condition associated with function using :
(4) | ||||
Note that the weights are equal to almost surely with respect to the conditional measure given if they satisfy the equations for all measurable functions. In other words, the set of balancing conditions associated with all measurable functions identifies the propensity odds. This motivates another way to estimate the propensity odds by balancing the shared variables between two patterns. To explicitly see how balancing approach helps the estimation of , we study the error based on a generic set of weights . The error can be first decomposed as
Further decomposition of each inner term leads to:
(5) | |||
(6) | |||
(7) |
To control the error, we aim to design weights that can control the magnitude of the components in the above decomposition. First, we note that only the first two terms (5) and (6) depend on the weights. Indeed, the last term (7) is expected to converge to zero uniformly over in a compact set under mild assumptions. Next, we focus on the term (6). It is the empirical version of imbalance associated with . Therefore, to control this error component, we would like the weights to achieve empirical balance, at least approximately. It is the fundamental motivation of the proposed weights. Finally, the term (5) with the proposed weights introduced later can be shown to uniformly converge to zero by some technical arguments. In Section 4 and Appendices B–F, we provide rigorous theoretical statements and corresponding proofs. In the next subsection, we propose a tailored loss function that encourages the empirical balance.
3.2 Tailored loss function for balancing conditions
For each missing pattern , suppose
where are basis functions for the observed variables in pattern . The basis functions for can be constructed as tensor products of the basis functions for each variable. One may choose suitable basis functions depending on the observed variables and the number of observations in different patterns. For each missing pattern , we propose the tailored loss function:
Assuming the exchangeability of taking derivatives with respect to and taking expectations with respect to , the minimizer of satisfies
which can be rewritten as the balancing conditions:
In practice, the minimum tailored loss estimator of , denoted by , is obtained by minimizing the average loss
The estimating equations can be rewritten as the empirical version of balancing conditions using :
(8) |
Recall that we want to control the term (6). If , the empirical balances associated with the basis functions imply the empirical balance associated with . Therefore, the basis functions should be chosen to approximate well. Theoretically, one can increase the number of basis functions with sample size to extend the spanned space and lower the approximation error.
3.3 Penalized optimization and empirical imbalance
As mentioned in the last subsection, one wants to balance as many basis functions as necessary to enlarge the space spanned by the basis functions since is unknown. However, the unpenalized optimizations can result in overfitting or even being unfeasible to compute if one chooses too many basis functions. More precisely, as if there exists such that the linear component for all data in pattern and for all data in pattern .
Therefore, we consider the penalized average loss where the penalty function is a continuous non-negative convex function, and the tuning parameter controls the degree of penalization chosen by a cross-validation procedure to be discussed later. Since the estimation of propensity odds is not the final goal, we want to study how penalization affects the estimation of . We will explore the empirical imbalance of , which depends on those of basis functions.
Note that the tailored loss function is convex in . One can consider the sub-differential, as the penalty may be non-differentiable. Denote the sub-differential of at as , which is a set of sub-derivatives satisfying
The sub-differential of at any bounded minimizer should include the vector since for all . Therefore, there exists a sub-derivative of at such that
(9) |
3.3.1 Controlling empirical imbalance
We denote the absolute difference between the left-hand side and the right-hand side of (8) the empirical imbalance. One immediate observation from (9) is that the empirical balance might not hold exactly when . Since the penalty determines and , one wants to choose an appropriate penalty to ensure control of the empirical imbalance. As a continuous function leads to bounded sub-derivatives, a simple choice is -norm penalty, i.e., , whose entries of sub-derivatives belong to , and so, for each ,
As such, uniformly controls all empirical imbalances of basis functions. However, to emphasize the control of empirical imbalance of , the tolerance to the empirical imbalances of different basis functions should be allowed to vary. We should tolerate smaller empirical imbalances for those basis functions that are believed to have stronger approximating power for . For example, if one believes that satisfies certain smoothness assumptions, one may want to prioritize the empirical balances of smoother basis functions. One natural way to achieve this is to use the weighted -norm penalty, i.e., where , represent relative tolerance which should be determined by the assumptions of . For each , the corresponding imbalance bound becomes:
Small should be assigned to important basis functions to ensure stronger balance. A detailed example of the choice of is presented in Section 3.3.3.
3.3.2 Controlling propensity odds smoothness
In addition to the empirical balance, we could also introduce a penalty to promote the smoothness of the propensity odds estimate. In this paper, we focus on a quadratic penalty of the form , where is a positive semi-definite matrix. As suggested by Wood [32], the most convenient penalties to control the degree of smoothness are those that measure function roughness as a quadratic form in the coefficients of the function. A detailed example of the choice is presented in Section 3.3.3, where the roughness is related to the order of derivatives of a function. It is worth noting that our asymptotic theory only depends on the positive semi-definite property of , but not a specific choice of .
3.3.3 A combined penalty
We suggest the following penalized loss to combine the strengths of the above two penalties:
(10) |
where is the tuning parameter controlling the weights between two penalties. A method for tuning parameter selection is given in Section 3.3.4.
Here, we focus on a standard choice of and matrix , based on a prior belief that and are smooth. There are many ways to define the roughness of a function. Here, we adopt a version in Wood [32] for twice continuously differentiable functions of -dimensional continuous variables, such as cubic splines. The roughness of function is defined as where the semi-inner product is defined as:
Let be the Gram matrix where . Then, the roughness of basis expansion is the quadratic term . We orthogonalize the basis functions with respect to the above inner product such that the matrix becomes diagonal. Thus, the quadratic term becomes and has a simple derivative and is strongly convex with respect to for such that . To prioritize balancing smoother basis functions, we tolerate larger imbalances for rougher functions. Thus, we can connect the tolerance with function roughness. In particular, we choose . Then, for the minimizer of , the empirical imbalance of any linear combination of basis functions is
where denotes the log transformation of the propensity odds model, and is the number of basis functions. As such, the empirical imbalance of is proportionally bounded by the square root of its roughness . So, smoother functions in the spanned space have smaller empirical imbalances. We assume that is well approximated by the basis functions under certain smoothness assumptions introduced in Assumption 3.A. Thus, the approximation errors are controlled, and the empirical imbalance of is also controlled.
Besides, is controlled through the quadratic term in the penalty. The rough propensity odds models are penalized more. In total, we control the smoothness of the propensity odds model and the imbalance of by using such a model.
3.3.4 Tuning parameter selection
One approach to choosing the tuning parameters and is to select the pair of parameters that gives the minimal expected loss, usually estimated by a cross-validation procedure. The candidate parameters can be drawn from a grid satisfying and . For example, in our simulation study, we perform 5-fold cross-validation using the grid . The dataset is randomly divided into five folds. For each iteration, the propensity odds models are trained on four folds using penalized optimization and tested on the remaining fold to estimate the out-of-sample tailored loss. The parameter pair that yields the lowest cross-validation error across the five iterations is then used to retrain the model on the entire dataset. This approach enhances the model’s ability to generalize to unseen data and ensures a stable estimation of the propensity odds.
4 Asymptotic properties
In this section, we investigate the consistency and asymptotic normality of our proposed estimators. Consider the proposed penalization (10) with generic basis functions. By minimizing the penalized average loss for each missing pattern , we first obtain the minimizers and the balancing estimators of propensity odds, . Summing up the propensity odds estimators, we obtain the weights, , and a plug-in estimator of :
The resulting estimator of is the solution to . Denote it by . Under mild conditions, we show that is consistent, is asymptotically normal for each in a compact set , and is consistent and efficient. We require the following set of assumptions to establish the consistency of the proposed estimator.
Assumption 2.
The following conditions hold for each missing pattern :
-
A:
There exist constants such that for all where is the domain of .
-
B:
The optimization has a unique solution .
-
C:
The total number of basis functions grows as the sample size increases and satisfies where is the number of observations in patterns and .
-
D:
There exist constants and such that for any positive integer , there exist satisfying
-
E:
The Euclidean norm of the basis functions satisfies .
-
F:
Let be the eigenvalues of in the non-decreasing order. There exist constants and such that .
-
G:
The tuning parameter satisfies .
Assumption 2.A is the boundedness assumption commonly used in missing data and causal inference. It is equivalent to that is strictly bounded away from 0 and 1. The domain is usually assumed to be compact, so it becomes possible to approximate with compactly supported functions. Assumption 2.B is a standard condition for consistency of minimum loss estimators of . It is well known that the uniform approximation error is related to the number of basis functions. Thus, we allow to increase with sample size under certain restrictions in Assumption 2.C. The uniform approximation rate in Assumption 2.D is related to the true propensity odds and the choice of basis functions. For instance, the rate for power series and splines if is continuously differentiable of order on under mild assumptions; see Newey [20] and Fan et al. [11] for details. The restriction is a technical condition such that the estimator of propensity odds is consistent. Assumption 2.E and Assumption 2.F are standard conditions for controlling the magnitude of the basis functions. The Euclidean norm of the basis function vector can increase as the spanned space extends, but its growth rate cannot be too fast. These assumptions are satisfied by many bases such as the regression spline, trigonometric polynomial, and wavelet bases; see, e.g., Newey [20], Horowitz and Mammen [12], Chen [8] and Fan et al. [11]. Assumption 2.G is a technical assumption of the tuning parameter for the maintenance of consistency of weights. We now establish the consistency of the estimated odds.
Theorem 2.
The proof is provided in Appendix C.
Next, we establish the asymptotic normality of the empirical weighted estimating function for each . Note that the estimating function is a -dimensional vector-valued function. We only need to consider each entry separately. Denote the -th entry of and as and respectively. Let denote the bracketing number of the set by -brackets with respect to a specific norm. We will need the following additional conditions.
Assumption 3.
The following conditions hold for all missing pattern and all where is a compact set:
-
A:
There exist constants and such that for any and each missing pattern , there exists a parameter satisfying .
-
B:
Each of the true propensity odds, , is contained in a set of smooth functions . There exists constants and such that .
-
C:
The sets are contained in a function class such that there exists constants and such that .
-
D:
There exists a constant such that for all ,
-
E:
, which means that the growth rate of the number of basis functions has a lower bound.
Assumption 3.A is a requirement similar in spirit to Assumption 2.D such that the conditional expectation can be well approximated as we extend the space spanned by the basis functions. Assumption 3.B and Assumption 3.C are conditions on the complexity of the function classes and to ensure uniform convergence over . These assumptions are satisfied for many function classes. For instance, if is a Sobolev class of functions such that and the ()-th derivative is absolutely continuous with for some fixed , then by Example 19.10 of Van der Vaart [29]. The condition is satisfied for all . A Hölder class of functions also satisfies this condition [11]. Assumption 3.D is a technical condition related to the envelope function such that we can apply the maximal inequality via bracketing. Assumption 3.E requires the number of basis functions to grow such that the approximation error decreases in general.
To prove the theorem, we utilize a few lemmas of the bracketing number . The proofs of the theorem and lemmas are given separately in Appendices D and F.
With further assumptions, we show the consistency and asymptotical normality of that solves .
Assumption 4.
The following conditions hold for all missing pattern and all :
-
A:
For any sequence , implies
-
B:
For each -th entry and any , there exists an envelop function such that for any such that . Besides, when .
Assumption 4.A is a standard regularity assumption for Z-estimation. Assumption 4.B corresponds to the continuity assumption on with respect to . For example, the Lipschitz class of functions satisfies this condition if is compact. More precisely, there exists an uniform envelop function such that for any where . Now, we establish the theorem.
Theorem 4.
Suppose that Assumptions 1–4 hold. Then
and
where is the asymptotic variance bound in Theorem 1. Therefore, is semiparametrically efficient.
Similar to Assumption 3.B, suppose that the set of derivatives of of estimating functions satisfies , and is continuous in a neighborhood of . Also suppose that there exists a constant such that . Then,
is a consistent estimator of . Furthermore, let be an estimator of the conditional expectation such that for each -th entry. Then, is a consistent estimator of where
(11) |
The proof is given in Appendix F.
5 Simulation
A simulation study is conducted to evaluate the finite-sample performance of the proposed estimators. We designed three missing mechanisms that satisfy the CCMV condition. For each setting, we simulated 1,000 independent data sets, each of size =1,000, where , are generated independently from a truncated standard normal distribution with support . We considered a logistic regression model where the true coefficients are the parameters of interest. The response variable is always observed, while could be missing. Four non-monotone response patterns are considered: , , and . The categorical variable indicating the response patterns is generated from a multinomial distribution with the probabilities shown in Appendix A.
Method | Bias | MSE | ||||||
---|---|---|---|---|---|---|---|---|
Setting 1 | ||||||||
Full | -0.022 | 0.007 | -0.010 | 0.009 | 0.016 | 0.012 | 0.012 | 0.012 |
Complete-case | 0.512 | 0.239 | 0.002 | 0.103 | 0.297 | 0.100 | 0.031 | 0.048 |
Mean-score | 0.218 | 0.098 | -0.004 | 0.084 | 0.077 | 0.050 | 0.034 | 0.047 |
True-weight | -0.065 | 0.086 | -0.044 | 0.058 | 0.049 | 0.094 | 0.059 | 0.072 |
Entropy-linear | -0.067 | 0.082 | -0.045 | 0.057 | 0.038 | 0.091 | 0.059 | 0.071 |
Entropy-parametric | -0.067 | 0.082 | -0.045 | 0.057 | 0.038 | 0.091 | 0.059 | 0.071 |
Entropy-basis | -0.039 | 0.064 | -0.027 | 0.088 | 0.030 | 0.053 | 0.044 | 0.054 |
Proposed | -0.024 | 0.060 | -0.032 | 0.112 | 0.033 | 0.047 | 0.043 | 0.054 |
Setting 2 | ||||||||
Full | -0.022 | 0.007 | -0.010 | 0.009 | 0.016 | 0.012 | 0.012 | 0.012 |
Complete-case | 0.943 | 0.310 | -0.176 | -0.089 | 0.921 | 0.152 | 0.070 | 0.043 |
Mean-score | 0.318 | 0.085 | 0.059 | 0.052 | 0.140 | 0.073 | 0.045 | 0.041 |
True-weight | -0.045 | 0.080 | -0.063 | 0.038 | 0.050 | 0.100 | 0.067 | 0.065 |
Entropy-linear | -0.001 | 0.126 | -0.111 | 0.051 | 0.041 | 0.118 | 0.081 | 0.072 |
Entropy-parametric | -0.056 | 0.088 | -0.065 | 0.044 | 0.042 | 0.092 | 0.065 | 0.065 |
Entropy-basis | -0.062 | 0.054 | -0.058 | 0.070 | 0.042 | 0.074 | 0.064 | 0.067 |
Proposed | -0.023 | 0.029 | -0.044 | 0.089 | 0.029 | 0.051 | 0.055 | 0.057 |
Setting 3 | ||||||||
Full | -0.022 | 0.007 | -0.010 | 0.009 | 0.016 | 0.012 | 0.012 | 0.012 |
Complete-case | 0.805 | 0.889 | -0.043 | 0.252 | 0.689 | 0.892 | 0.041 | 0.111 |
Mean-score | 0.253 | 0.119 | 0.002 | -0.071 | 0.098 | 0.090 | 0.050 | 0.061 |
True-weight | -0.076 | 0.195 | -0.115 | 0.081 | 0.115 | 0.220 | 0.134 | 0.129 |
Entropy-linear | -0.252 | 0.349 | -0.153 | -0.022 | 0.135 | 0.295 | 0.149 | 0.096 |
Entropy-parametric | -0.158 | 0.265 | -0.136 | 0.104 | 0.098 | 0.218 | 0.151 | 0.143 |
Entropy-basis | -0.084 | 0.187 | -0.129 | 0.109 | 0.135 | 0.244 | 0.147 | 0.145 |
Proposed | 0.016 | 0.035 | -0.021 | 0.047 | 0.039 | 0.054 | 0.052 | 0.071 |
We first analyzed the simulated data with the full dataset (Full), which is the ideal case with no missingness. We then analyzed the data in the complete case pattern (Complete-case), for which data in all missing patterns are discarded, and an unweighted analysis is used for the remaining data. As we described before, the mean-score approach typically requires the discretization of continuous variables. To implement the mean-score method in Chatterjee and Li [6] (Mean-score), we created discrete surrogates as . Note that the mean-score method makes a missing-at-random (MAR) assumption. Next, we considered the inverse propensity weighting methods with the true inverse propensity weights (True-weight). We also examined the performance of the estimators based on the estimated propensity odds using the entropy loss. Tchetgen Tchetgen et al. [28] modeled each propensity odds using linear logistic regression (Entropy-linear), i.e., the logarithm of propensity odds, , is a linear function of the shared variables in patterns and , which is correctly specified only for setting 1. For settings 2 and 3, we also applied a parametric logistic regression with the correct model (Entropy-parametric). Note that they are unpenalized methods. When implementing the proposed methodology (Proposed), we set the basis functions as the tensor products of functions of variables shared in patterns and , where polynomials of degrees up to three are chosen for each continuous variable, and a binary indicator function is chosen for discrete response . Then, we orthogonalized the basis functions and followed the choices of and matrix in 3.3.3 to construct the penalty function. Notice that the constant and linear functions have zero roughness. In practice, to achieve stable estimations of the propensity odds, we set the tolerance to the empirical imbalances of these functions to be the same as the smallest positive roughness of other basis functions. We also applied the above basis functions and penalties with entropy loss (Entropy-basis) to estimate the propensity odds for comparing tailored and entropy loss functions. The biases and mean squared errors of each coefficient are shown in Table 1.
Avg Est SD/MC SD | Coverage probabilities | ||||||||
---|---|---|---|---|---|---|---|---|---|
Sample size | |||||||||
Setting 1 | N=1000 | 0.937 | 0.861 | 0.961 | 0.970 | 0.938 | 0.908 | 0.930 | 0.926 |
N=2000 | 0.939 | 0.979 | 0.990 | 1.032 | 0.937 | 0.942 | 0.940 | 0.934 | |
N=5000 | 1.023 | 1.091 | 1.040 | 1.164 | 0.947 | 0.958 | 0.957 | 0.962 | |
N=10000 | 1.079 | 1.115 | 1.039 | 1.219 | 0.960 | 0.959 | 0.954 | 0.962 | |
Setting 2 | N=1000 | 0.978 | 0.843 | 0.892 | 0.958 | 0.946 | 0.897 | 0.930 | 0.934 |
N=2000 | 0.926 | 0.952 | 0.916 | 0.950 | 0.930 | 0.931 | 0.931 | 0.928 | |
N=5000 | 0.967 | 0.971 | 0.933 | 1.037 | 0.944 | 0.919 | 0.930 | 0.946 | |
N=10000 | 1.061 | 1.097 | 0.984 | 1.041 | 0.959 | 0.962 | 0.951 | 0.946 | |
Setting 3 | N=1000 | 0.877 | 0.793 | 0.856 | 0.856 | 0.899 | 0.893 | 0.904 | 0.916 |
N=2000 | 0.918 | 0.932 | 0.925 | 0.905 | 0.921 | 0.924 | 0.940 | 0.921 | |
N=5000 | 0.952 | 1.018 | 1.007 | 0.947 | 0.936 | 0.951 | 0.959 | 0.939 | |
N=10000 | 0.992 | 1.051 | 1.012 | 1.010 | 0.946 | 0.959 | 0.950 | 0.947 |
In all three settings, the mean-score method has bad performances (considerable bias over ) that are likely attributed to the misspecified missing mechanisms and discretization. In settings 1 and 2, compared with the linear model and correctly specified parametric model, using basis functions as regressors and adding penalization alleviates overfitting such that the inverse propensity weights and estimates are more stable. The proposed method provides comparable or even smaller errors in these two settings. It is expected since the tailored loss also encourages the balance of observed variables. In setting 3, all the methods misspecify the missing mechanism, but the proposed method outperforms the others.
We also assess the performance of asymptotic variance estimators (4) and confidence interval coverage using the estimated standard errors. The results in Table 2 show that the average of estimated asymptotic standard deviations is close to the Monte Carlo standard deviations, which are the standard deviations of estimated parameters from 1000 datasets. The confidence interval constructed by estimated asymptotic variance has close to nominal coverage, especially when the sample size increases.
6 Real data application
We apply the proposed method to a study of hip fracture risk factors among male veterans [1]. The binary outcome of interest was the presence or absence of hip fracture. Preliminary exploratory analyses suggested that nine risk factors are potentially significant. A detailed description of the data set and its missing data patterns is in Chen [7]. Three (BMI, HGB, and Albumin) of these factors are continuous variables, and the remaining six (Etoh, Smoke, Dement, Antiseiz, LevoT4, Antichol) are discrete variables. Including , where all variables are observed, there is a total of 38 missingness patterns. The parameters of interest are the coefficients of a logistic regression model with these nine variables as the predictors and the presence or absence of hip fracture as the binary outcome.
We implemented the same Complete-case analysis as in Chen [7] and our proposed method under the CCMV assumption. Like the simulation study, we generated orthogonalized tensor product basis functions for continuous variables and considered two penalties constructed by the roughness norm. Indicator functions are created for each discrete variable as basis functions. Since there are many combinations of possible values of discrete variables, we do not consider any further tensor products. In other words, the indicator functions of discrete variables are added to the log-linear model for propensity odds in an additive way. For example, if (BMI, HGB, Dement, Antiseiz), the propensity odds model is
where are tensor products of two sets of functions, each of which contains polynomials up to degree 3 for one continuous variable. Similarly, we should penalize the indicator functions that perform similarly to lower-order moments for stabilization. The relative tolerance to the empirical imbalances of the indicator functions was chosen to be the same as that for the constant and linear functions.
We present the parameter estimates, standard errors, and p-values in Table 3. The results show that our proposed estimators have standard errors similar to those from the complete-case analysis. All nine predictor variables are statistically significant at the level. Our results largely agree with those of Chen [7]. One apparent difference is that the coefficient of LevoT4 is significant in our analysis but not in Chen [7]. The main difference between the two estimators is the missing mechanism assumptions.
One may notice that some patterns may have rare observations. There are 26 patterns having at most three observations. It may draw attention to the stability of the propensity odds estimator for those patterns. Through the definition, the true propensity odds between a rare pattern and the complete cases should be negligible. Therefore, the contribution of rare patterns to the combined weights should also be negligible. Although the sample size may be too small to guarantee the asymptotic properties, the estimated odds for rare patterns using the proposed penalized estimation only slightly contribute to the total weights. Thus, the total weights and the weighted estimator for the parameters of interest remain stable.
Parameters | Complete-case | Proposed | ||||
---|---|---|---|---|---|---|
Estimate | SE | p-value | Estimate | SE | p-value | |
Etoh | 1.391 | 0.391 | 1e-3 | 1.410 | 0.397 | 1e-3 |
Smoke | 0.929 | 0.400 | 0.020 | 0.972 | 0.386 | 0.012 |
Dementia | 2.509 | 0.724 | 0.001 | 2.595 | 0.689 | 1e-3 |
Antiseiz | 3.311 | 1.064 | 0.002 | 3.435 | 0.935 | 1e-3 |
LevoT4 | 2.010 | 1.015 | 0.048 | 1.941 | 0.771 | 0.012 |
AntiChol | -1.918 | 0.768 | 0.012 | -1.872 | 0.796 | 0.019 |
BMI | -0.104 | 0.039 | 0.007 | -0.100 | 0.040 | 0.013 |
logHGB | -2.597 | 1.202 | 0.031 | -2.541 | 1.232 | 0.039 |
Albumin | -0.911 | 0.353 | 0.010 | -0.857 | 0.378 | 0.024 |
7 Discussion
We studied the estimation of model parameters defined by estimating functions with non-monotone missing data under the complete-case missing variable condition, which is a case of missing not at random. Using tailored loss for balance, functional bases for flexible modeling and penalization for stable estimation, we propose a method that can be viewed as a generalization of both inverse probability weighting and mean-score approach. The proposed framework improves the reliability of inferences and has significant implications for the analysis of datasets with non-monotone missingness, which is frequently encountered in biomedical data. Given the limited availability of analytical tools for addressing non-monotone missingness, we hope to fill an important gap in the literature.
Despite its strengths, the proposed method has certain limitations. The CCMV assumption, while plausible, is inherently unverifiable from observed data, which may limit its acceptance in some applications. Recently, [10] generalizes the assumption into pattern graphs, which allows detailed relationships among non-monotone missing patterns to be assumed. One can notice that the CCMV assumption is a special case of the pattern graph. An example of such a graph is shown in Figure 1. All the missing patterns have the same and only parent, the fully observed pattern.
A future research direction is to extend our tailored loss estimation to allow more flexible identifying assumptions, which are represented by general pattern graphs. The proposed method is well-suited to serve as a tool for these assumptions. Another important direction is to investigate the impact of misspecification on the identifying assumptions and explore ways to incorporate robustness checks. This includes designing sensitivity analyses that account for potential deviations from CCMV or other assumption structures.
Acknowledgements
This work is partly supported by the National Science Foundation (DMS-1711952). Portions of this research were conducted with the advanced computing resources provided by Texas A&M High Performance Research Computing.
Appendix A Details of simulation settings
The probabilities of data belonging to the four response patterns are respectively:
where is the true propensity odds, which only depends on the shared variables under CCMV condition, for .
Setting 1
The logarithms of true propensity odds are:
Setting 2
The logarithms of true propensity odds are:
Setting 3
The logarithms of true propensity odds are:
We assume that the number of basis increases with sample size . Precisely, one should predetermine a way to choose depending on the size of the fully observed pattern and missing pattern , , which also increases with . In the following proofs, we use “a large enough ” to describe the requirements for sample size, while the explicit conditions may be related to or .
Appendix B Proof of Theorem 1
Proof sketch: To show is the efficiency bound, we closely follow the structure of semiparametric efficiency bound derivation of [19], [3] and [9]. Briefly speaking, the efficiency bound is defined as the supremum of the Cramer-Rao bounds for all regular parametric submodels. If it exists, the efficient estimator should have the efficient influence function that gives the minimum , which is known as the asymptotic variance of such an estimator. Based on Theorem 2.2 in Newey [19], the regular and asymptotically linear estimator has the influence function satisfying (12). Theorem 3.1 in Newey [19] reveals that the efficient influence function is in the tangent space where is defined as the mean square closure of all -dimensional linear combinations of scores for all smooth parametric submodels.
Proof.
Consider an arbitrary parametric submodel for the likelihood function of
where gives the true distribution of observed variables in pattern . The resulting score function is given by
where for . Recall that the parameter of interest is the solution to and thus is a function of . Pathwise differentiability follows if we can find an influence function for all regular parametric submodels such that
(12) |
To save notations, is also used as the true parameter value when the context is clear. Chain rule and Leibniz integral rule (differentiating under the integral) gives
Therefore,
By the identification assumption, . Thus, the marginal density
Define where . Then,
The first term is
For each ,
The first integral is
Recall that . The second integral is
since
and
where and . Therefore,
Notice that . Recall that we denote the derivative at as . Let the influence function where
Thus, equation (12) is satisfied, and is pathwise differentiable. To calculate the efficiency bound, we need to find the efficient influence function, which is the projection of on the tangent set. The tangent set is defined as the mean square closure of all -dimensional linear combinations of scores for smooth parametric submodels,
where is a constant matrix with rows. It can be verified by similar arguments as in [19]. Note that each score has the same form of decomposition, which is the summation of partial scores for each missing pattern . It is also easy to see that is linear and belongs to the tangent set where taking the role of for and the rest taking the role of . Therefore all the conditions of Theorem 3.1 in [19] hold, so the efficiency bound for regular estimators of the parameter is given by where . ∎
Appendix C Proof of Theorem 2
Proof sketch: Assumption 2.D assumes that there is a close approximation of the true propensity odds. We will show our estimator is close to the approximation. With the help of a few inequalities, the distance of functions is proportionally bounded by the distance of coefficients. The key to the proof is to show the distance between two coefficients converges with a certain order. The problem is converted to the study of a quadratic form with random coefficients in Lemma 3. The quadratic coefficients form a symmetric random matrix. By the Weyl inequality, we can connect the random matrix with the magnitude of basis functions. So, we can apply the matrix Bernstein inequality to provide bounds on the spectral norm, i.e., the largest eigenvalue. Similarly, we can show that the linear coefficients are also bounded. Lemmas 4 and 5 provide the bound for the quadratic and linear coefficients respectively.
Proof.
By the triangle inequality and Assumption 2.D,
Since the exponential function is locally Lipschitz continuous, if . By the triangle inequality, Assumption 2.A, and Assumption 2.D,
Thus, there exists large enough such that for all . Therefore,
(13) |
if . By the Cauchy inequality and Assumption 2.E, for any . By Lemma 3, . More precisely, for any , there exists a finite and such that
for any . Considering the complementary event, we can find large enough such that which makes the inequality (13) hold for any . Then,
for all . In other words,
Now, we consider the norm.
Following the similar arguments, when , we have
Thus, . Therefore,
∎
Appendix D Proof of Theorem 3
Proof sketch: We further decompose the error terms and show that the first three terms converge to 0 with the rate faster than , and the last term contributes as the influence function. Since the components in the decomposition involve the estimator, they should be treated as random functions. So, we consider the uniform convergence over a set of functions and apply the theory in Van der Vaart [29]. With the maximal inequality via bracketing, the problem is converted to the control of the entropy integral, which requires the calculation of bracketing numbers. Lemmas 12–15 are bracketing inequalities which could be of independent interest.
Appendix E Proof of Theorem 4
Proof sketch: First, by Assumption 4.A, the convergence of should be implied by the uniform convergence of over in a compact set. Second, we study the convergence of and apply the Delta method to obtain the limiting distribution of . The functional version of the central limit theorem, i.e. Donsker’s theorem, is applied to achieve uniform convergence.
Proof.
Denote the empirical average as and the centered and scaled version as . Recall the proposed weighted average is
Since is the solution to , by Lemma 9,
By identifiability condition Assumption 4.A, .
Next, we investigate the asymptotic normality of . Although has a form of expectation over the population, it can be viewed as a random vector because depends on the observations. Since , one would expect that converges to in some way. If the limiting distribution is known, one could apply the Delta method to obtain limiting distribution of . From Theorem 3, we have the asymptotic normality of for any fixed . It is natural to consider
(14) |
The above difference has a similar form to the asymptotic equicontinuity, which can be derived if the function class is Donsker. More precisely, consider the class of -th entry of the estimating functions, . It is Donsker due to Theorem 19.5 in [29] and
Then, by Section 2.1.2 in [30], we have the following asymptotic equicontinuity: for any , there exists and such that for all ,
where the seminorm is defined as . Consider
Notice that . By Assumption 4.B, for any , there exists an envelop function such that
where when . Thus, there exists small enough such that for all . Then, by the consistency of , there exists such that for all ,
Thus,
Note that the event and
happening together implies . By taking complementary event, for any , we obtain
That is, for each -th entry,
(15) |
Therefore, by the comparison between terms in (14) and , we should consider
which can be decomposed as the following terms.
where
By Lemmas 6, 7, and 8, for any missing pattern and . Combing with Lemmas 10 and 11, we have
(16) |
By Equations (16) and (15), we have
Since and , the above equation can be rewritten as
By Theorem 3,
Since is nonsingular, by multivariate Delta method,
Therefore, is semiparametrically efficient.
Lastly, we look into the estimator for the asymptotic variance. We have the following decomposition:
where .
Consider the first term on the right hand side. By Theorem 2,
Let
Following the similar arguments in Lemma 5, one can see that
Consider the second term on the right hand side. Notice that is the Jacobian matrix of . We consider all the entries of together and abbreviate the subscripts in the following statements. Define a set of functions where . Similar to Lemma 15, one can show that
since . Therefore, by Theorem 19.4 in [29], is P-Glivenko-Cantelli. Since the set includes all entries of the Jacobian matrix, we consider the Frobenius/Euclidean norm of a matrix to construct the following convergence result.
where is the Euclidean norm of a matrix. The fact that implies
Finally, since and is continuous in a neighborhood of . Therefore, is a consistent estimator of .
We skip the details but provide a skeleton of the following proof. Notice that each component of converges to the corresponding true value. Therefore, and are consistent estimators of and respectively. Since is a standard sandwich estimator, it is easy to show it is a consistent estimator of the above asymptotic variance. ∎
Appendix F Related Lemmas
Lemma 1 (Weyl’s inequality).
Let and be Hermitian matrices and . Suppose their respective eigenvalues are ordered as follows:
Then, the following inequalities hold.
In particular, if is positive semi-definite, plugging into the above inequalities leads to
Lemma 2 (Bernstein’s inequality).
Let be a sequence of independent random matrices with dimensions . Assume that and almost surely for all and some constant . Also assume that
Then, for all ,
Proof.
It suffices to show for any , there exists and such that
(17) |
for any . It means that the minimizer is in a small neighbourhood of with probability higher than . Consider the set for an arbitrary constant . Since is a convex function of , the minimizer if . Thus, considering the complementary event, we have
Recall that for any and any , the objective function is
where and . To investigate , we apply the mean value theorem. There exists some satisfying , which is the interior of , such that for any ,
By the triangle inequality and Cauchy inequality, the difference between penalties satisfies
Denote the constant as . Then, by the Cauchy inequality again,
First, let’s have a look at the quadratic coefficient. Since is a positive semi-definite matrix, . Thus, by Lemma 4, the quadratic terms are bounded from below. More precisely, for any , there exists such that for any ,
Next, let’s investigate the bound of the linear coefficient. By Assumption 2.E, . By Lemma 5, for any , there exists and a constant such that for any ,
Considering the complement of the above event and the fact that , we have
for any . Note that . Choosing , we have for any . Therefore, inequality (17) holds which completes the proof. ∎
Lemma 4.
There exists a constant such that the Hessian matrix of at satisfies
where represents the minimal eigenvalue of the matrix.
Proof.
Denote the Hessian matrix of at as :
Recall that the set and . Following the similar arguments in the proof of Theorem 2 (Appendix C), it can be easily shown that
Then, there exists such that holds for any . Let
It’s easy to see that matrices are symmetric. Based on the above discussions, is positive semi-definite for large enough . By Assumption 1.B, is also positive semi-definite. Applying Lemma 1, we have , and . Therefore, . To study , we apply Lemma 2. Let
So, . By the triangle inequality, Lemma 1 and the fact that ,
By Assumption 2.E, Assumption 2.F and the fact that , . Similarly,
Taking in Lemma 2 for an arbitrary constant , we have
for large enough and some constant . In other words, . Therefore, for any there exists such that
for any . ∎
Lemma 5.
The gradient of at satisfies
Proof.
The gradient of at is
Thus, by the triangle inequality,
Consider the first term on the right hand side. Let . It’s easy to see that are i.i.d. and . Thus,
By Assumption 2.E, . By the Markov inequality, this implies . As for the second term on the right hand side, let where and where . Then,
Following the similar arguments in the proof of Lemma 4, it’s easy to see that
By Assumption 2.D, . Thus, and
∎
Proof.
Consider the following empirical process.
where and is an arbitrary function, which can be viewed as an estimator of true propensity odds . By Theorem 2, for any , there exists constants and such that for any ,
Let and consider the set of functions
By Assumption 1.C, for any ,
Define . To simplify notations, vectors means that for each entry, and vector means that for each entry where is a constant.
Notice that . Thus,
By Markov’s inequality, for any , we have
If we can show , then for any and fixed , there exists and such that for any ,
Then, for any , by taking and appropriately choosing , , and , we have the above inequalities and for any ,
That is, .
To show , we utilize the maximal inequality with bracketing (Corollary 19.35 in [29]). Define the envelop function . It’s easy to see for any . Besides, due to Assumption 3.D, for each -th entry,
To save notations, and are used as their -th entry. We also omit the subscripts “” of some sets of functions where the related inequalities should hold for each -th entry.
By the maximal inequality,
To study the entropy integral of , we split function into two parts and consider two sets of functions where and where . Notice that , and when is large enough. By Lemma 12,
Define for some constant and . It is obvious that . Since is a fixed function,
The true propensity score odds is unknown, but its roughness is controlled by Assumption 3.B. Thus, we should not consider much more rough functions. In other words, our models for propensity score odds should satisfy a similar smoothness condition. There exists appropriate constant such that . Thus,
Define a set of functions . Notice that By Lemma 13, Assumption Assumption 3.E and Lemma 14,
Combine the above inequalities and recall Assumption 3.B and Assumption 3.C,
since and as . Therefore, and . ∎
Proof.
Consider the following empirical process.
where and and are arbitrary functions. By Theorem 2, for any , there exists constants and such that for any ,
Besides, by Assumption 3.A, . So, we consider the set of functions
where and . Then, for any ,
The last line holds due to the fact that and Assumption 3.A and Assumption 3.E. Plug in our estimator and define . Then, and
Similarly, we need to show . Define the envelop function . It’s easy to see that for any when is large enough. By the maximal inequality with bracketing,
To study the entropy integral of , we first compare it with . It is apparent . Then, we split function into two parts and consider two sets of functions where and where . Notice that and when is large enough. By Lemma 12,
Notice that . Since is a fixed function, and , by Lemma 15,
It is obvious that any -brackets equipped with norm are also -brackets in . With similar arguments in the proof of Lemma 6, we have
Define a set of functions . Similarly,
Similarly, we split into two parts. Define a set of functions where . By Lemma 13,
Also define a set of functions . Although the set is unknown, we should not consider much more rough functions than those in . Therefore, there exists a constant such that . Thus, by Lemma 14,
By Assumption 3.B, Assumption 3.C, and the above inequalities,
since and as . So, and . ∎
Proof.
Notice that is related to the balancing error:
where denotes the log transformation of the propensity odds model. Due to the similar reason, the roughness of the approximation functions are bounded. Besides, by Assumption 2.D, . Thus, . ∎
Proof.
By Lemmas 6, 7, and 8, we only need to show where
Study the following decomposition. Let where . It’s easy to see that for any ,
For any measurable function , . Thus,
By Theorem 19.4 in [29], is Glivenko-Cantelli. Thus,
Also let where and where . Similarly,
Notice that . Besides, the convergence almost surely implies the convergence in probability. Thus, . Then,
∎
Proof.
Consider the following empirical process.
where . Pick any decreasing sequence . Since , for any and each , there exists a constant such that for any ,
(18) |
Consider the set of functions . It is easy to check that . Plug in our estimator and define . Notice that . Thus,
Similarly, we only need to show . Define the envelop function where is the envelop function in Assumption 4.B. So, for any . Besides, . Due to the maximal inequality,
Define a set of functions where . Since , by Lemma 15,
Define a set of functions . Since is a fixed function, . Since as , we can take small enough such that the set . So, , and . Then,
since and as . Thus, and . ∎
Proof.
Consider the following empirical process.
where . Similarly, inequality (18) holds and . Consider the set of functions . To show , we need to show . Define the envelop function . It’s easy to see that for any and . Apply the maximal inequality,
Define a set of functions where . Since , by Lemma 15,
Define a set of functions . Similarly, since is a fixed function, . Take small enough such that the set . Then, , and by Lemma 14,
Therefore,
since and as . Thus, and . ∎
Lemma 12.
Consider the set of functions . Assume that for all and for all . Then, for any ,
Proof.
Suppose are the -brackets that can cover and are the -brackets that can cover . Define the bracket for where :
For any function , there exists functions and such that . Besides, we can find two pairs of functions and such that , , , and . Then, where . Then, we look at the size of the new brackets. By simple algebra,
Furthermore, for any , we have . Therefore,
∎
Lemma 13.
Consider the set of functions . Assume that for all and for all . Then,
Proof.
Suppose are the -brackets that can cover and are the -brackets that can cover . Define the bracket for and :
For any function , there exists functions and such that . Besides, we can find two pairs of functions and such that , , , and . Then, where and
Therefore,
∎
Lemma 14.
Let , and be the sets of functions as we defined before. Then,
Proof.
Suppose are the -brackets that can cover . Define and . Then, for any , there exists such that with a pair of functions satisfying and . Then, and
So, are the -brackets that can cover and
For any , there exists such that . Similarly,
∎
Lemma 15.
Let be a fixed bounded function. Assume . We consider two function classes and for a fixed constant . Then,
Proof.
Suppose are the -brackets that can cover . That is, for any , we can find a pair of functions such that and . Then, for any , either or holds. Define and . For any , there exists such that and a pair of functions such and
So, are the -brackets that can cover and
∎
References
- Barengolts [2001] Barengolts, E. (2001). Risk factors for hip fracture in predominantly african-american veteran male population. J Bone Miner Res 16, S170.
- Begun et al. [1983] Begun, J. M., W. J. Hall, W.-M. Huang, and J. A. Wellner (1983). Information and asymptotic efficiency in parametric-nonparametric models. The Annals of Statistics 11(2), 432–452.
- Bickel et al. [1993] Bickel, P. J., C. A. Klaassen, P. J. Bickel, Y. Ritov, J. Klaassen, J. A. Wellner, and Y. Ritov (1993). Efficient and adaptive estimation for semiparametric models, Volume 4. Springer.
- Chan et al. [2016] Chan, K. C. G., S. C. P. Yam, and Z. Zhang (2016). Globally efficient non-parametric inference of average treatment effects by empirical balancing calibration weighting. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78(3), 673–700.
- Chatterjee et al. [2003] Chatterjee, N., Y.-H. Chen, and N. E. Breslow (2003). A pseudoscore estimator for regression problems with two-phase sampling. Journal of the American Statistical Association 98(461), 158–168.
- Chatterjee and Li [2010] Chatterjee, N. and Y. Li (2010). Inference in semiparametric regression models under partial questionnaire design and nonmonotone missing data. Journal of the American Statistical Association 105(490), 787–797.
- Chen [2004] Chen, H. Y. (2004). Nonparametric and semiparametric models for missing covariates in parametric regression. Journal of the American Statistical Association 99(468), 1176–1189.
- Chen [2007] Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. Handbook of econometrics 6, 5549–5632.
- Chen et al. [2008] Chen, X., H. Hong, and A. Tarozzi (2008). Semiparametric efficiency in gmm models with auxiliary data. The Annals of Statistics 36(2), 808–843.
- Chen [2022] Chen, Y.-C. (2022). Pattern graphs: a graphical approach to nonmonotone missing data. The Annals of Statistics 50(1), 129–146.
- Fan et al. [2022] Fan, J., K. Imai, I. Lee, H. Liu, Y. Ning, and X. Yang (2022). Optimal covariate balancing conditions in propensity score estimation. Journal of Business & Economic Statistics 41(1), 97–110.
- Horowitz and Mammen [2004] Horowitz, J. L. and E. Mammen (2004). Nonparametric estimation of an additive model with a link function. The Annals of Statistics 32(6), 2412 – 2443.
- Imai and Ratkovic [2014] Imai, K. and M. Ratkovic (2014). Covariate balancing propensity score. Journal of the Royal Statistical Society Series B: Statistical Methodology 76(1), 243–263.
- Kang and Schafer [2007] Kang, J. D. and J. L. Schafer (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science 22(4), 523–539.
- Little [1993] Little, R. J. (1993). Pattern-mixture models for multivariate incomplete data. Journal of the American Statistical Association 88(421), 125–134.
- Little and Rubin [2019] Little, R. J. and D. B. Rubin (2019). Statistical analysis with missing data, Volume 793. John Wiley & Sons.
- Little and Schluchter [1985] Little, R. J. and M. D. Schluchter (1985). Maximum likelihood estimation for mixed continuous and categorical data with missing values. Biometrika 72(3), 497–512.
- Malinsky et al. [2022] Malinsky, D., I. Shpitser, and E. J. Tchetgen Tchetgen (2022). Semiparametric inference for nonmonotone missing-not-at-random data: the no self-censoring model. Journal of the American Statistical Association 117(539), 1415–1423.
- Newey [1990] Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of applied econometrics 5(2), 99–135.
- Newey [1997] Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of econometrics 79(1), 147–168.
- Reilly and Pepe [1995] Reilly, M. and M. S. Pepe (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82(2), 299–314.
- Robins and Gill [1997] Robins, J. M. and R. D. Gill (1997). Non-response models for the analysis of non-monotone ignorable missing data. Statistics in medicine 16(1), 39–56.
- Robins et al. [1994] Robins, J. M., A. Rotnitzky, and L. P. Zhao (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association 89(427), 846–866.
- Rubin [1976] Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3), 581–592.
- Sadinle and Reiter [2017] Sadinle, M. and J. P. Reiter (2017). Itemwise conditionally independent nonresponse modelling for incomplete multivariate data. Biometrika 104(1), 207–220.
- Sinha et al. [2014] Sinha, S., K. K. Saha, and S. Wang (2014). Semiparametric approach for non-monotone missing covariates in a parametric regression model. Biometrics 70(2), 299–311.
- Tan [2020] Tan, Z. (2020). Regularized calibrated estimation of propensity scores with model misspecification and high-dimensional data. Biometrika 107(1), 137–158.
- Tchetgen Tchetgen et al. [2018] Tchetgen Tchetgen, E. J., L. Wang, and B. Sun (2018). Discrete choice models for nonmonotone nonignorable missing data: Identification and inference. Statistica Sinica 28(4), 2069.
- Van der Vaart [2000] Van der Vaart, A. W. (2000). Asymptotic statistics, Volume 3. Cambridge university press.
- Wellner et al. [2013] Wellner, J. et al. (2013). Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media.
- Wong and Chan [2018] Wong, R. K. and K. C. G. Chan (2018). Kernel-based covariate functional balancing for observational studies. Biometrika 105(1), 199–213.
- Wood [2017] Wood, S. N. (2017). Generalized additive models: an introduction with R. CRC press.
- Yiu and Su [2018] Yiu, S. and L. Su (2018). Covariate association eliminating weights: a unified weighting framework for causal effect estimation. Biometrika 105(3), 709–722.
- Zhao [2019] Zhao, Q. (2019). Covariate balancing propensity score by tailored loss functions. The Annals of Statistics 47(2), 965–993.
- Zubizarreta [2015] Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association 110(511), 910–922.