\authormark

Yuliang et. al.

\corres

*Yuliang Shi.

\presentaddress

200 University Ave W, Waterloo, ON, Canada, N2L 3G1.

Model Selection for Causal Modeling in Missing Exposure Problems

Yuliang Shi Yeying Zhu Joel A. Dubin \orgdivDepartment of Statistics and Actuarial Science, \orgnameUniversity of Waterloo, \orgaddress\stateWaterloo, \countryCanada [email protected]

(<day> <Month>, <year>; <day> <Month>, <year>)

Abstract

[Abstract]In causal inference, properly selecting the propensity score (PS) model is an important topic and has been widely investigated in observational studies. There is also a large literature focusing on the missing data problem. However, there are very few studies investigating the model selection issue for causal inference when the exposure is missing at random (MAR). In this paper, we discuss how to select both imputation and PS models, which can result in the smallest root mean squared error (RMSE) of the estimated causal effect in our simulation study. Then, we propose a new criterion, called “rank score” for evaluating the overall performance of both models. The simulation studies show that the full imputation plus the outcome-related PS models lead to the smallest RMSE and the rank score can help select the best models. An application study is conducted to quantify the causal effect of cardiovascular disease (CVD) on the mortality of COVID-19 patients.

keywords:

Causal Inference; Propensity Score; Multiple Imputation Chained Equations; Rank Score.

^†^†articletype: RESEARCH ARTICLE

1 Introduction

In application studies, missing data can occur in multiple ways and one of the common cases is that the exposure of interest may not be fully observed ¹. For example, a patient’s exposure status may not be fully recorded when the patient suddenly drops out of the clinical study; or an individual may decline to answer a sensitive question regarding his/her health problem in a survey questionnaire. In addressing the challenge of missing data, researchers have proposed various methodologies for handling cases where the outcome is missing ². However, when the exposure status is missing, few approaches have been provided in literature to deal with this issue ^{3, 4}.

In causal inference, propensity score (PS) analysis is a popular tool to address the confounding issue ⁵. However, when the exposure status is missing at random (MAR), estimation of the causal effect is challenging. The case of MAR on exposure significantly differs from MAR on the outcome. When the outcome is missing, only the estimation of the outcome model is influenced by the missing data, but the estimation of the PS model will not be affected by the missing outcome. In contrast, when the exposure is missing, the estimation of all three models — imputation, PS, and outcome models— is affected. Consequently, although PS-based methods have been widely discussed when the outcome is MAR, their application to cases where the exposure is missing is not straightforward ³. Careful adjustment regarding both missing and confounding issues are necessary in such scenarios.

One of the common approaches to dealing with missing data is via imputation. Single imputation (SI) is easy to conduct based on the observed data ⁶. However, imputing data only once may not be reliable and a large variation may be induced ⁷. In contrast, the multiple imputations chained equations (MICE) approach is proposed through random sampling, and Rubin’s Rules can be applied to account for both the sampling variability and the uncertainty in the imputation of missing values ^{8, 9}.

In addition to the imputation-based method, inverse probability weighting (IPW) or double robust (DR) methods have been widely investigated in the literature ^{10, 11, 12}. In recent literature, when the exposure is MAR, two-layer DR estimators have been proposed to deal with both missingness and confounding issues ³. Later, a triple robust estimator (TR) was proposed to protect against misspecification of the missingness and imputation models ⁴. Besides that, nonparametric estimators with efficient bounds have been proposed using the efficient influence function ¹³. Compared with the imputation-based method, DR or TR approach requires more flexible conditions for the model specifications ¹⁴, but very few papers discuss the model selection problem when the exposure is MAR. Despite the above-mentioned more sophisticated methods for estimating causal effects when the exposure is MAR, in this article, we focus on the more intuitive and straightforward approach: we impute the missing exposure values first via MICE and employ propensity score-based methods, such as IPW or DR, to estimate causal effects based on the imputed datasets. Therefore, we face the model selection issue for both the imputation model and the PS model.

In the previous literature, researchers have shown that improper model selection for the PS model can result in higher bias and variance ¹⁵. However, little attention has been given to addressing how to select both imputation and PS models when the exposure is MAR. As a motivating example, we consider the case in which researchers aim to determine the causal relationship between the incidence of cardiovascular disease (CVD, as the exposure) and COVID-19 patients’ mortality (as the outcome) in an observational study where CVD status is not completely observed. In this situation, we need to make adjustment for both missing CVD status and other confounding factors. However, which variables should be selected for the imputation and PS models is not clear because we do not know the true missing values of CVD.

Another problem from the motivating example is the so-called “double-dipping issue”, which means if we impute the missing values of exposure with all exposure-related covariates on the first step and fit the PS model with the same set of covariates using that imputed dataset on the second step, the bias and variance of the estimated causal effect may be increased due to two mains reasons: (1) we do not clearly distinguish the purpose of fitting the imputation and PS models; (2) we include redundant exposure-related covariates instead of outcome-related variable into two models. More specifically, for the imputation model, we aim to increase its predictive ability to correctly impute the missing exposure. In contrast, to select the proper PS model, we intend to balance the confounders across exposure and non-exposure groups instead of increasing its predictive performance ¹⁶. In addition, including all exposure-related covariates in the PS model is also redundant because those variables are not considered as main confounders.

Even though fitting the imputation and PS models with the same set of exposure-related covariates is very common in application studies, it is very important to investigate which variables should be included in the imputation and PS models to achieve the best performance of causal estimates among candidate models. Furthermore, since we generally do not know the true causal effect in the application, evaluating those selected models through some model selection criteria is also essential. Therefore, we propose a new rank-based criterion to combine the performance of the imputation and PS models, compared with the traditional criterion. The rest of this article is organized as follows. In Section 2, we describe the goal of our research and discuss the proposed criteria for conducting model selection. In Section 3, we further design a simulation study to investigate model selection in both imputation and PS models. In Section 4, an application study is conducted to identify the causal effect of CVD on mortality for a cohort of COVID-19 patients. The article ends with a discussion in Section 5.

2 Notation and Methods

2.1 Framework and Assumptions

Based on the counterfactual causal framework ¹⁷, we denote $(Y_{i}^{1},Y_{i}^{0}),i=1,\dots,n,$ as the potential outcomes if individual $i$ were exposed or unexposed (or equivalently, treated or untreated), respectively. Let $\bm{X}_{i}$ denote the covariates, $A_{i}$ denote the exposure of interest, $Y_{i}$ denote the binary outcome, and $R_{i}=I\{A_{i}\text{ is missing}\}$ denote the indicator of whether the exposure value is missing or not for individual $i=1,\dots,n$ . In this study, we focus on the exposure variable being missing, so all other variables are assumed to be completely observed. In addition, we study the model selection problem on a finite set of covariates instead of the high-dimensional setting.

For each individual $i$ , $Y_{i}^{1}$ and $Y_{i}^{0}$ cannot be observed at the same time; instead, we only observe $Y_{i}$ and $A_{i}$ . The observed dataset is $(\bm{X}_{i},A_{i},Y_{i},R_{i}),i=1,\dots,n$ , and the relationship between the observed and potential outcomes can be written as $Y_{i}=A_{i}Y_{i}^{1}+(1-A_{i})Y_{i}^{0}$ , for $i=1,\dots,n$ . The true propensity score (PS) is defined as $\pi(\bm{X}_{i})=P(A_{i}=1|\bm{X}_{i})$ for individual $i=1,\dots,n$ . We denote $\tau_{1}=E(Y^{1})$ and $\tau_{0}=E(Y^{0})$ as the average potential outcomes if treated or untreated, respectively. For the binary outcome in our application studies, we define the causal estimand of interest as the causal risk ratio: $\tau=\tau_{1}/\tau_{0}$ . Even though the study focuses on the binary outcome, the theory is developed in general scenarios and the key ideas presented in this article will not be largely changed by the different formats of the causal quantities, so the application can be easily expanded to other cases, such as the continuous outcome or count outcome. To estimate the causal risk ratio $\tau$ , three assumptions are required as follows:

1.

Strongly ignorable treatment assignment assumption (SITA): $(Y^{1},Y^{0})\bot A|\bm{X}$ .
2.

Positivity assumption: $0<P(A=1|\bm{X}=\bm{x})<1,\text{for all possible }\bm{x}$ .
3.

MAR assumption: $R\bot A|\bm{X},Y$ .

Here, SITA assumption is also known as the assumption of no unmeasured confounders, which cannot be tested in the application studies. The positivity assumption requires the propensity score to be a positive probability, which can be checked after imputing the missing exposure status correctly ¹⁷. MAR assumption assumes that the missing indicator is conditionally independent of the exposure itself given all observed covariates and outcomes in the dataset, which also cannot be tested on the observed data due to the unknown missing values ³.

The objective of this study is two-fold. Firstly, since different components of $\bm{X}$ can be different as shown in Figure 1, our primary goal is to find the best selection on both imputation and PS models when the exposure is MAR. Secondly, we aim to find a suitable criterion to evaluate the performance of the two models in the application study when a DAG is not known.

2.2 An Illustrative Example

To study the different roles of covariates in model selection, we consider a simplified scenario with three covariates, which have different effects on the exposure or the outcome. As shown in Figure 1, $X_{1}$ is the main confounder, $X_{2}$ is the exposure-related covariate, and $X_{3}$ is the covariate related only to the outcome. We denote the $4\times 1$ vector of covariates for subject $i$ as $\bm{X}_{i}=(1,X_{i1},X_{i2},X_{i3})^{T}$ . In terms of assumptions described in the general scenario in Section 2.1, we can rewrite three main assumptions based on Figure 1 as follows:

1.

Strongly ignorable treatment assignment assumption (SITA): $(Y^{1},Y^{0})\bot A|X_{1}$ .
2.

Positivity assumption: $0<P(A=1|\bm{X}=\bm{x})<1,\text{for all possible }\bm{x}$ .
3.

MAR assumption: $R\bot A|X_{1},X_{2}$ .

Since $X_{1}$ is the “sufficient set” for confounding adjustment in this specific case ¹⁸, SITA assumption only requires conditioning on the main confounder $X_{1}$ instead of all covariates. In addition, we do not include the association between $R$ and $Y$ as shown in Figure 1 since we want to investigate whether adding $X_{3}$ and $Y$ will increase the accuracy of the imputation model for the missing exposure. In that way, we only need to condition on $X_{1},X_{2}$ for MAR assumption based on Figure 1.

Refer to caption — Figure 1: An illustrative causal diagram for simulation studies. The black arrows refer to the causal relationship among confounders, the exposure, and the outcome. The double-sided dash arrows refer to the associational relationship among the missing indicator, the outcome, and the covariates. The red arrow is the causal effect of primary interest. The dashed short line refers to no association between the missing indicator and the exposure given covariates and the outcome due to MAR assumption.

To investigate the role of each variable in the imputation model, one of the approaches is to rely on the theories of directed acyclic graph (DAG), which is a common tool using “d-separation” ¹⁹ to check conditional independence among variables without specifying the form of the models. In our specific example in Figure 1, we know that $X_{1},X_{2}$ should be included in the imputation model because they affect $A$ and $R$ directly. The main question is whether $Y$ should be included in the imputation model or not. Even though we do not include the outcome in the true missingness model in Figure 1, $Y$ is directly correlated with the exposure variable, so adding $Y$ may help predict the missing exposure values.

Next, we aim to study the role of $X_{3}$ in the imputation model. Notice that $X_{3}\bot A$ , but if given $(X_{1},X_{2})$ , $Y$ is a function of $X_{3}$ and $R$ is also correlated with $X_{3}$ , as shown in Figure 1. In addition, if we condition on $Y$ , $Y$ will be the collider in the path of $A\rightarrow Y\leftarrow X_{3}$ , so $X_{3}\not\!\perp A|(Y,X_{1},X_{2})$ . In other words, including the outcome and outcome-related variables in the imputation model is expected to improve the predictive ability of the imputation model. In summary, we should include all $(X_{1},X_{2},X_{3},Y)$ into the imputation model, which will be verified later in simulation studies.

2.3 Estimation

Before we discuss the model selection strategy, we first discuss how to estimate $\tau$ . As we have previously mentioned, estimating $\tau$ requires us to account for both missing and confounding issues. For the missing data problem, we will impute the exposure status via MICE based on MAR assumption ⁷, ⁸. After the missing exposures are imputed, $\tau$ can be estimated using inverse-probability weighting (IPW) or double robust (DR) method to account for the confounding issues.

In summary, we present an algorithm with three key steps as follows:

1.

Fit imputation (Imp) model based on MAR assumption using MICE: fit logistic regression model for $A\sim\bm{X}+Y$ to obtain $A_{i}^{\text{imp}}$ as the exposure on the imputed dataset, which includes the original exposure for individuals without missing values and imputed exposure for individuals with missing values.
2.

Fit PS model using the logistic model on the imputed dataset: $A^{\text{imp}}\sim\bm{X}$ to obtain fitted PS values, denoted as $\hat{\pi}_{i}(\bm{X})$ .

Apply inverse weighting to adjust for the confounding issue, and the IPW estimator can be written as ²⁰:

\displaystyle\hat{\tau}_{1}^{\text{IPW}}=\frac{1}{n}\sum_{i=1}^{n}\frac{A_{i}^{\text{imp}}Y_{i}}{\hat{\pi_{i}}(\bm{X}_{i})},\ \ \hat{\tau}_{0}^{\text{IPW}}=\frac{1}{n}\sum_{i=1}^{n}\frac{(1-A_{i}^{\text{imp}})Y_{i}}{1-\hat{\pi_{i}}(\bm{X}_{i})}

(1)

Then, an IPW estimator for $\tau$ is: $\hat{\tau}^{\text{IPW}}=\hat{\tau}_{1}^{\text{IPW}}/\hat{\tau}_{0}^{\text{IPW}}$ or the DR estimator can be written as:

\displaystyle\hat{\tau}_{1}^{\text{DR}}=\frac{1}{n}\sum_{i=1}^{n}\left[\frac{A_{i}^{\text{imp}}Y_{i}}{\hat{\pi_{i}}(\bm{X}_{i})}-\frac{\hat{\pi_{i}}(X)-A_{i}^{\text{imp}}}{\hat{\pi_{i}}(\bm{X}_{i})}\hat{m_{1}}(A_{i}^{\text{imp}},\bm{X}_{i})\right],\ \ \hat{\tau}_{0}^{\text{DR}}=\frac{1}{n}\sum_{i=1}^{n}\left[\frac{(1-A_{i}^{\text{imp}})Y_{i}}{1-\hat{\pi_{i}}(\bm{X}_{i})}-\frac{A_{i}^{\text{imp}}-\hat{\pi_{i}}(X)}{1-\hat{\pi_{i}}(\bm{X}_{i})}\hat{m_{0}}(A_{i}^{\text{imp}},\bm{X}_{i})\right]

(2)

where $\hat{m_{1}}(A_{i}^{\text{imp}},\bm{X}_{i})$ is the fitted response for the treatment group, i.e., $\hat{m_{1}}(A_{i}^{\text{imp}},\bm{X}_{i})=E[Y_{i}|A_{i}^{\text{imp}}=1,\bm{X}_{i};\hat{\bm{\beta}}_{1}]$ ; $\hat{m_{0}}(A_{i}^{\text{imp}},\bm{X}_{i})$ is the fitted response for the control group, i.e., $\hat{m_{0}}(A_{i}^{\text{imp}},\bm{X}_{i})=E[Y_{i}|A_{i}^{\text{imp}}=0,\bm{X}_{i};\hat{\bm{\beta}}_{0}]$ . Here, $(\hat{\bm{\beta}_{1}},\hat{\bm{\beta}_{0}})$ are estimated parameters for the treatment or control group from the outcome model, written as $Y\sim A+X$ . Then, a DR estimator for $\tau$ is: $\hat{\tau}^{\text{DR}}=\hat{\tau}_{1}^{\text{DR}}/\hat{\tau}_{0}^{\text{DR}}$ .

Notice that when we conduct imputation using MICE, we choose $m=20$ as the number of imputations. Then, IPW and DR estimators for $\tau$ can be constructed on each imputed dataset and Rubin’s Rules is applied to obtain final estimated values based on the algorithm described above 2.3 ⁹. If the imputation model is correctly specified, based on MAR assumption, we can approximate $P(A=1|\bm{X},Y)$ by $P(A^{\text{imp}}=1|\bm{X},Y)$ . Furthermore, if PS model is also correct, due to SITA assumption, $\hat{\tau}^{\text{IPW}}\xrightarrow[]{\text{p}}\tau$ and its asymptotic normality holds as $n\to\infty$ ²⁰. For DR estimator, it can protect against misspecification of either PS or the outcome model ^{12, 11}. In other words, if the imputation model is correct and either PS or the outcome model is correct, we know $\hat{\tau}^{\text{DR}}\xrightarrow[]{\text{p}}\tau$ and its asymptotic normality holds as $n\to\infty$ ².

2.4 Model Selection Criteria

In application studies, since the true causal effect is unknown, we are not able to find which combination of imputation and PS models leads to the best performance in terms of some performance metrics, such as RMSE. In such a case, we need some model selection criteria to choose the best combination of imputation and PS models. In this section, we discuss some traditional criteria for model selection and propose a new criterion that takes into consideration both models.

2.4.1 Weighted Accuracy ( $\text{Accuracy}^{(w)})$

We first discuss a theoretical way to evaluate the performance of the imputation model via weighted accuracy based on the observed data. Due to the missing exposure, even though we can impute the missing data, we do not know the true missing values, which makes it challenging to estimate the accuracy only based on the observed data. A naive approach is to fit the imputation model based on the observed data and compare the imputed exposure with those observed values. However, since the accuracy is a function of $(\bm{X},Y)$ and the distribution of $(\bm{X},Y)$ on observed data is different from the whole data in general, this approach to estimate accuracy from the observed data is no longer valid.

One approach to deal with this issue is to apply inverse probability weighting on the accuracy when we consider $w_{i}=P(R_{i}=1|\bm{X_{i}},Y_{i})$ as the propensity for the missingness. Obtaining the weighted accuracy can be described in the following four steps:

1.

First, we randomly split the individuals into either the training or the testing data according to a ratio $q$ , where $q=\frac{n_{\text{test}}}{n_{\text{obs}}}$ . Here, $n_{\text{obs}},n_{\text{test}}$ is the sample size for the whole observed data and for the testing data, respectively. Certainly, $q$ can be arbitrarily chosen by the user, but the simplest way is to set the ratio equal to the missing rate in the original dataset.
2.

Next, we fit the imputation model on the training data and impute the exposure on the testing data to compare with the known exposure status.
3.

Then, we fit the full missingness model on the whole dataset, written as $\text{logit}(w_{i})=\bm{X_{i}^{T}\gamma}$ , where $\bm{\gamma}$ is the vector of coefficients including the intercept to be estimated so that we can obtain $\hat{w_{i}}$ as the estimated propensity of missingness for individual $i=1,2,\dots,n$ .

Finally, we calculate the weighted accuracy based on the observed data, called “ $\text{Accuracy}^{(w)}$ ”:

\displaystyle\text{Accuracy}^{(w)}(A^{\text{imp}})=\frac{\sum_{i=1}^{n}\frac{1-R_{i}}{1-\hat{w}_{i}}\delta_{i}\mathbbm{1}(A_{i}=A^{\text{imp}})}{n\times q},

(3)

where $\delta_{i}=\mathbbm{1}(\text{individual $i$ is chosen in the test set})$ and $\mathbbm{1}(A_{i}=A^{\text{imp}})$ is an indicator for whether the imputed value equals to the observed value. Here, $\hat{w}_{i}$ is estimated from the full missingness model including all covariates and the outcome. The weighted accuracy is consistent to the true accuracy when the full missingness model is correctly specified and the detailed proof is attached in Appendix A.1.

2.4.2 ASMD, KS, and BIC

To evaluate the PS model, the traditional approaches are based on the balance statistic calculated from the confounders. For example, one can calculate absolute standardized mean difference (ASMD) and Kolmogorov-Smirnov (KS) statistics in the covariates after propensity score adjustment ^{21, 22, 23}.

Another approach to evaluate the PS model is using BIC criterion to select the best model adjusting for both the goodness of fit and the number of parameters in the model. Since in PS model selection, it is recommended to select the outcome-related variables instead of the exposure-related variables ^{15, 24, 25}, we suggest using BIC of the outcome model, i.e., $Y\sim A+\bm{X}$ , as the selection criterion. For example, from Figure 1, since $X_{2}\bot Y|A=a$ , a smaller BIC indicates that we have included outcome-related variables and excluded the exposure-related variables. Notice that we do not recommend using the c-statistics or accuracy of PS model to select variables because in PS analysis, we aim to balance the confounders across the exposure and non-exposure group instead of maximizing the predictability of the PS model ^{16, 26, 27}.

We should also notice that only using one traditional criterion may not be appropriate to select both imputation and PS models, such as either $\text{Accuracy}^{(w)}$ , ASMD, or KS statistics, because they just focus on the performance of a single model. That motivates us to combine the evaluation of $\text{Accuracy}^{(w)}$ and BIC into an integrated criterion, called “rank score”, proposed in the next section.

2.4.3 Rank Score

The idea of this new criterion called “rank score” is to take into account of the performance of both imputation and PS models. We first obtain the values of $\text{Accuracy}^{(w)}$ and BIC for all possible combinations of the candidate models. Then, for a given model, we calculate its rank score value by:

\displaystyle\text{ Rank Score}=\frac{\text{Rank(1-$\text{Accuracy}^{(w)}$)+Rank(BIC)}}{2},

(4)

where “Rank( $1-\text{Accuracy}^{(w)}$ )” is the rank of the model based on the value of $1-\text{Accuracy}^{(w)}$ if we order $1-\text{Accuracy}^{(w)}$ from the largest to the smallest. “Rank(BIC)” is the rank for a given model based on the value of BIC for regressing $Y$ on $A$ and the covariates selected in the given model. We can calculate the rank score for every possible combination of the imputation and PS models. Then, we select the smallest rank score, which leads to the highest $\text{Accuracy}^{(w)}$ and the smallest BIC, so the best imputation and PS models can be successfully chosen.

In summary, the main advantage of the rank score is to combine the performance of both imputation and PS models and directly find the best model based on this rank-based criterion. In addition, the rank score is a unit-free score, which is not affected by the different magnitudes of the accuracy and BIC. The evaluation of those criteria is shown in Table 4.

2.4.4 Other Criteria

Another possible criterion to combine accuracy and BIC is to re-scale both terms into the range of $[0,1]$ . Then, we average two rescaled terms and call the following criterion as “ABIC”:

\displaystyle\text{ABIC}=\frac{1}{2}\left[\left(1-\frac{\text{$\text{Accuracy}^{(w)}$}-\text{min}(\text{$\text{Accuracy}^{(w)}$})}{\text{max(\text{$\text{Accuracy}^{(w)}$})}-\text{min(\text{$\text{Accuracy}^{(w)}$})}}\right)+\frac{\text{BIC}-\text{min}(\text{BIC})}{\text{max(\text{BIC})}-\text{min(\text{BIC})}}\right],

(5)

where “min” means the smallest value and “max” means the largest value among all candidates of models. Notice that we cannot directly average over the accuracy and BIC values because of the different magnitude issues. Therefore, ABIC is considered to solve that problem after we rescale the accuracy and BIC between 0 and 1. We want to choose the candidate model with a smaller ABIC value, which usually means that its accuracy will be larger and BIC will become smaller. However, the min and max values of ABIC can still be affected by the extreme values of either BIC or $\text{Accuracy}^{(w)}$ . As a result, the ABIC values among the candidate models may be largely shrunk to a quite small value, which makes it hard for us to distinguish the model performance. The main results are shown in Table 3 and A.10 in Section A.2 of Appendix.

3 The Simulation Studies

3.1 Simulation Setup

To investigate which covariates should be included in the imputation and PS models, a simulation study is conducted based on the causal diagram in Figure 1. The number of replications is $N=1000$ , and the sample size is $n=500$ in each data generation. Three covariates $X_{1},X_{2},X_{3}$ are generated from $N(0,1)$ independently. $Y,A$ , and the missing indicator $R$ are generated by the following three models:

•

missingness model: $\text{logit}\{P(R=1|\bm{X})\}=-0.3+0.4X_{1}+0.6X_{2}+1.8X_{3}$ ;
•

treatment model: $\text{logit}\{P(A=1|\bm{X})\}=-0.2+0.3X_{1}+1X_{2}$ ;
•

outcome model: $\text{logit}\{P(Y=1|A,\bm{X})\}=-0.2+2A-0.3X_{1}+2.5X_{3};$

The missing rate is about 48% in this scenario and the true causal effect $\tau=1.523$ . Notice that the true missingness model does not include the outcome $Y$ , but one of our aims is to investigate whether adding the outcome into the imputation model will improve the performance of the estimated causal effect in the simulation studies.

For each simulated data set, we investigate five different imputation models and four different PS models, which we have specified and named in Tables 3.1, respectively. In total, twenty possible combinations of the imputation and PS models are investigated. We provide some rationale behind the specification of these imputation and PS models. When selecting the imputation model, we start with an exposure-related model, which includes only $X_{1}$ and $X_{2}$ as they directly affect $A$ . To increase the predictive ability, we gradually modify the imputation model by adding $X_{3}$ and $Y$ in a step-by-step process until the full imputation model is obtained.

Five imputation models
Model Name (shortened form)	Model Form
exposure-related model (Exp)	$A\sim X_{1}+X_{2}$
outcome-related model (Out)	$A\sim X_{1}+X_{3}$
all covariates-included model (Covs)	$A\sim X_{1}+X_{2}+X_{3}$
response-included model (Res)	$A\sim X_{1}+X_{2}+Y$
full model (Full)	$A\sim X_{1}+X_{2}+X_{3}+Y$
\hdashline Four PS models
naive model (Naive)	$A^{\text{imp}}\sim X_{1}$
exposure-relate model (Exp)	$A^{\text{imp}}\sim X_{1}+X_{2}$
outcome-relate model (Out)	$A^{\text{imp}}\sim X_{1}+X_{3}$
all covariates-included model (Covs)	$A^{\text{imp}}\sim X_{1}+X_{2}+X_{3}$

Imp, PS Models	Bias	Bias Rate	ESE	RMSE
Exp, naive	-0.186	-12.187	0.085	0.204
Exp, exp	-0.217	-14.283	0.092	0.236
Exp, out	-0.188	-12.324	0.067	0.199
Exp, covs	-0.219	-14.384	0.080	0.233
\hdashlineOut, naive	-0.290	-19.069	0.109	0.310
Out, exp	-0.333	-21.858	0.103	0.348
Out, out	-0.230	-15.074	0.059	0.237
Out, covs	-0.263	-17.301	0.059	0.270
\hdashlineCovs, naive	-0.181	-11.920	0.123	0.219
Covs, exp	-0.211	-13.883	0.138	0.252
Covs, out	-0.188	-12.335	0.067	0.199
Covs, covs	-0.218	-14.295	0.080	0.232
\hdashlineRes, naive	0.102	6.679	0.176	0.204
Res, exp	0.137	9.026	0.217	0.257
Res, out	-0.046	-2.996	0.107	0.116
Res, covs	-0.038	-2.523	0.137	0.143
\hdashlineFull, naive	-0.017	-1.096	0.160	0.161
Full, exp	-0.011	-0.710	0.194	0.194
Full, out	0.010	0.640	0.110	0.111
Full, covs	0.023	1.498	0.137	0.139

Imp, PS Models	$\text{Accuracy}^{(w)}$	Out BIC	ASMD	KS	ABIC	Rank Score	Rank(RMSE)
Exp, naive	0.586	677.561	0.161	0.130	0.608	11.789	10
Exp, exp	0.586	720.437	0.022	0.130	0.675	14.272	14
Exp, out	0.586	430.074	0.149	0.130	0.225	7.849	7
Exp, covs	0.586	433.966	0.014	0.130	0.231	8.959	13
\hdashlineOut, naive	0.507	683.862	0.103	0.148	0.930	15.947	19
Out, exp	0.507	724.226	0.031	0.148	0.993	18.306	20
Out, out	0.507	434.737	0.075	0.148	0.545	12.460	15
Out, covs	0.507	435.659	0.008	0.148	0.546	12.918	18
\hdashlineCovs, naive	0.585	676.668	0.170	0.136	0.612	11.875	11
Covs, exp	0.585	719.239	0.033	0.136	0.678	14.354	16
Covs, out	0.585	430.103	0.148	0.136	0.231	8.043	8
Covs, covs	0.585	433.958	0.014	0.136	0.237	9.136	12
\hdashlineRes, naive	0.609	652.471	0.182	0.124	0.484	8.103	9
Res, exp	0.609	694.546	0.040	0.124	0.549	10.450	17
Res, out	0.609	411.095	0.154	0.124	0.110	3.993	2
Res, covs	0.609	415.906	0.015	0.124	0.118	4.580	4
\hdashlineFull, naive	0.622	662.369	0.172	0.138	0.430	7.463	5
Full, exp	0.622	705.051	0.034	0.138	0.497	9.941	6
Full, out	0.622	401.525	0.149	0.138	0.027	1.989	1
Full, covs	0.622	406.367	0.014	0.138	0.034	2.574	3

Criterion	IPW Method	DR Method
1- $\text{Accuracy}^{(w)}$	0.678	0.655
ASMD	-0.306	-0.337
KS	0.212	0.191
Out BIC	0.682	0.702
\hdashlineABIC	0.793	0.785
Rank Score	0.841	0.837

Covariate	Full Sample (n=2878)	Alive (n=1625)	Died (n=1253)	p-value
CVD				0.022
No	802 (34)	478 (36)	324 (32)
Yes	1528 (66)	835 (64)	693 (68)
Missing	548	312	236
\hdashlineage				$<$ 0.001
Mean (sd)	68.9 (16.3)	65.4 (16.6)	73.3 (14.7)
Median (Min,Max)	72 (0,107)	67 (0,104)	76 (0,107)
\hdashlinesex				0.054
Male	1551 (54)	850 (52)	701 (56)
Female	1327 (46)	775 (48)	552 (44)
\hdashlinediabetes				0.3
No	1117 (39)	617 (38)	500 (40)
Yes	1761 (61)	1008 (62)	753 (60)

Model Selection for Causal Modeling in Missing Exposure Problems

Abstract

keywords:

1 Introduction