This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonparametric assessment of regimen response curve estimators

Cuong Pham1, Benjamin R. Baer1,2, and Ashkan Ertefaie1
1Department of Biostatistics and Computational Biology,
University of Rochester Medical Center, Rochester, NY, USA
2School of Mathematics and Statistics,
University of St Andrews, St Andrews, Scotland
Abstract

Marginal structural models have been widely used in causal inference to estimate mean outcomes under either a static or a prespecified set of treatment decision rules. This approach requires imposing a working model for the mean outcome given a sequence of treatments and possibly baseline covariates. In this paper, we introduce a dynamic marginal structural model that can be used to estimate an optimal decision rule within a class of parametric rules. Specifically, we will estimate the mean outcome as a function of the parameters in the class of decision rules, referred to as a regimen-response curve. In general, misspecification of the working model may lead to a biased estimate with questionable causal interpretability. To mitigate this issue, we will leverage risk to assess ”goodness-of-fit” of the imposed working model. We consider the counterfactual risk as our target parameter and derive inverse probability weighting and canonical gradients to map it to the observed data. We provide asymptotic properties of the resulting risk estimators, considering both fixed and data-dependent target parameters. We will show that the inverse probability weighting estimator can be efficient and asymptotic linear when the weight functions are estimated using a sieve-based estimator. The proposed method is implemented on the LS1 study to estimate a regimen-response curve for patients with Parkinson’s disease.

1 Introduction

Many diseases and disorders, including cancer, Parkinson’s, substance abuse, and depression, require a sequence of treatments to address the evolving characteristics of the disease and the patient. Dynamic treatment regimes offer personalized treatment sequences to cater to these specific needs (Robins, 1986, 1989, 1997; Murphy et al., 2001). Such regimes consist of a predefined set of decision rules established before treatment initiation, which map the patient’s time-varying characteristics to appropriate treatment options at each decision point.

A key step toward estimating optimal treatment regimes is to estimate the mean potential outcome under a given regime. Marginal structural models have been widely used for this purpose, where a working model is imposed for the mean outcome given a sequence of treatments and possibly baseline covariates. However, the majority of the current literature focuses on finding an optimal regime among a prespecified set of regimes (Robins et al., 2000; Murphy et al., 2001; Shortreed and Moodie, 2012; Ertefaie et al., 2016). Dynamic marginal structural models are alternatives that can be used to estimate optimal regimes within a class of parametric decision rules (Orellana et al., 2010; Duque et al., 2021). This approach estimates the mean outcome as a function of the parameters in the class of decision rules, referred to as a regimen-response curve (Ertefaie et al., 2023). The quality of the estimated optimal regime and its causal interpretability depends on how closely the working model approximates the true regimen-response curve (Neugebauer and van der Laan, 2007).

Cross-validation provides a powerful tool for model selection both in parametric and nonparametric settings. The asymptotic and finite sample properties of cross-validated risk have been extensively studied (van der Laan and Dudoit, 2003; van der Vaart et al., 2006). Brookhart and van der Laan (2006) illustrated how to utilize cross-validated risk for selecting the best nuisance parameters in a semi-parametric model. In this paper, we will use cross-validated risk to assess the performance of an imposed working model to estimate a regimen-response curve. In this context, risk quantifies the proximity of the working model to the actual regimen-response curve. It is defined as the Euclidean distance between the working model and the potential mean outcome under all treatment regimes. The challenge lies in the fact that we do not observe the potential mean outcome under a treatment regime. Hence, an inverse probability weighting (IPW) mapping is often utilized to estimate the risk of a working model.

The consistency of the IPW estimators relies on a correctly specified propensity score model (Cole and Hernán, 2008; Mortimer et al., 2005; Howe et al., 2011). This requirement is not always guaranteed when parametric methods are used. A solution to this problem is to use nonparametric methods. However, naively utilizing nonparametric methods to estimate the propensity score often leads to risk estimators that are not n\sqrt{n}-consistent and suffer poor finite sample performance (Ertefaie et al., 2023). This is because the convergence rate of an IPW estimator relies on the convergence rate of the propensity score estimator, which is slower than the desired root-nn rate when nonparametric models are used. Additionally, inefficiency is a notable drawback of IPW estimators. In clinical trials that involve human subjects, efficiency is a critical consideration, as it is usually costly to have a large sample size.

Numerous approaches have emerged to address limitations in the IPW estimator. For instance, Imai and Ratkovic (2015) proposed the covariate balancing propensity score, while Yu and van der Laan (2006) introduced a double robust estimator. However, these methods were tailored for estimating the marginal structural model in a static treatment setting and were not applicable in identifying the optimal dynamic treatment setting. We expand the literature by deriving the canonical gradient of the risk function and proposing a multiply robust estimator to estimate the risk of the dynamic marginal structural working model. This estimator maintains consistency even when the nuisance parameters were misspecified (Rotnitzky et al., 1998; van der Laan and Robins, 2003). It achieves efficiency under the correct specification of all nuisance parameters. Furthermore, the multiply robust estimator allows for a slower convergence rate of the propensity score model, enabling the utilization of more flexible nonparametric methods for its modeling. However, a drawback of the multiply robust estimator is its reliance on estimating numerous nuisance parameters. Employing parametric methods for their estimation can lead to inconsistency in some parameters, impacting the estimator’s efficiency. Conversely, nonparametric methods ensure consistency but may become computationally burdensome, potentially yielding irregular estimators when some parameters are inconsistently estimated (van der Laan, 2014; Benkeser et al., 2017). Therefore, we also introduce a technique to refine the IPW estimator, enhancing its efficiency. Initially introduced by Ertefaie et al. (2023) for estimating causal treatment effects in point exposure settings, we extend this method to the longitudinal setting.

The contribution of our proposed method beyond the existing literature can be summarized as follows. First, we derive the canonical gradient of the risk function corresponding to a dynamic marginal structural model in longitudinal settings. Our derivation leads to a multiply robust estimator of the risk function. Second, we show that when the propensity score is estimated using a sieve-based approach, the resulting IPW estimator will solve the efficient influence function. This implies that the proposed IPW estimator achieves the nonparametric efficiency bound. Third, we show that, under certain rate conditions, the proposed multiply robust and sieve-based estimators are asymptotically linear for both fixed and data dependent risk functions.

2 Problem Setting

2.1 Notation and setup

Our data consists of nn i.i.d. samples of Oi={S1i,A1i,,STi,ATi,Yi}i=1n0O_{i}=\{S_{1i},A_{1i},\dots,S_{Ti},A_{Ti},Y_{i}\}^{n}_{i=1}\sim\mathbb{P}_{0}\in\mathcal{M} where \mathcal{M} is a nonparametric model. We denote StS_{t} as the vector of patient characteristics at time tt and VS1V\subseteq S_{1} as a subset of the baseline characteristics. The observed treatment is denoted At{0,1}A_{t}\in\{0,1\} at time t=1,,Tt=1,\cdots,T, and the outcome of interest is denoted YY. We use the overbar notation to denote the history of a variable. For example, S¯t=(S1,,St)\bar{S}_{t}=(S_{1},\dots,S_{t}) and A¯t=(A1,,At)\bar{A}_{t}=(A_{1},\dots,A_{t}) are the covariates and past treatment up to time tt. We let H¯t=(A¯t1,S¯t)\bar{H}_{t}=(\bar{A}_{t-1},\bar{S}_{t}) be all the information up to time tt. We denote n\mathbb{P}_{n} as the empirical distribution and define f=f(o)d(o)\mathbb{P}f=\int f(o)\,\mathrm{d}\mathbb{P}(o) for a given function f(o)f(o) and distribution \mathbb{P}.

A dynamic regime consists of a set of decision rules that assigns the treatment at each stage based on the information at and before that stage. We define the deterministic dynamic treatment rule dθ=I(θtTSt>0)d^{\theta}=I(\theta_{t}^{T}S_{t}>0) to be the decision rule at time tt. π0tP(At=dtθ|H¯t)\pi_{0}^{t}\equiv P(A_{t}=d_{t}^{\theta}|\bar{H}_{t}) is the true propensity score at time tt and πnt\pi_{n}^{t} denote the corresponding estimator. And, YθY^{\theta} denotes the potential outcome under the regime θ\theta. Suppose that we have the full data X={(Yθ:dθ𝒟),V}PXFX=\{(Y^{\theta}:d^{\theta}\in\mathcal{D}),V\}\sim P_{X}\in\mathcal{M}^{F} where F\mathcal{M}^{F} is a nonparametric full data model. We can then define the regimen response curve m0(θ,V):=EPX(YθV)m_{0}(\theta,V):=E_{P_{X}}(Y^{\theta}\mid V).

2.2 Cross-validated risk

A dynamic marginal structural model can be used to approximate m0(θ,V)m_{0}(\theta,V) by imposing a working model. We quantify the quality of the imposed working model using a cross-validated risk function. Let Bn{0,1}nB_{n}\in\{0,1\}^{n} denote the random partition of {1,,n}\{1,\dots,n\} into two folds. A realization of Bn={Bn,1,Bn,2,,Bn,n}B_{n}=\{B_{n,1},B_{n,2},\cdots,B_{n,n}\} defines a particular split of the sample of nn observations into a training set {i{1,,n}:Bn,i=0}\{i\in\{1,\cdots,n\}:B_{n,i}=0\} and a validation set {i{1,,n}:Bn,i=1}\{i\in\{1,\cdots,n\}:B_{n,i}=1\}. Denote by n,b0\mathbb{P}_{n,b}^{0} and n,b1\mathbb{P}_{n,b}^{1} the empirical distribution of the training sample and the validation sample. The type of cross-validation procedure is determined by the specific distribution of BnB_{n}. For example, in Monte Carlo cross-validation, where the sample is divided into training and validation sets with probability pp, the distribution of BnB_{n} puts mass 1/(nnp)1/\binom{n}{np} on each binary realization of BnB_{n}. Another example is BB-fold cross-validation, in which the distribution of BnB_{n} puts mass 1/B1/B on each possible realization. Throughout, we assume that this randomness is independent of all data.

Consider an estimation rule mn(θ,V)mn(θ,V)m_{n}(\theta,V)\equiv m_{n}(\theta,V\mid\mathbb{P}) that is fit using the distribution \mathbb{P}; we refer to mnm_{n} as a marginal structural model (MSM) to be consistent with the literature, although it is the fit from a working model rather than a model itself. Let mn,b(θ,V)mn(θ,Vn,b0)m_{n,b}(\theta,V)\equiv m_{n}(\theta,V\mid\mathbb{P}_{n,b}^{0}) denote the MSM mn(θ,V)m_{n}(\theta,V) evaluated at the empirical distribution n,b0\mathbb{P}_{n,b}^{0}, i.e. the training data. We denote m(θ,V)m^{*}(\theta,V) as the pointwise probability limit of mn,b(θ,V)m_{n,b}(\theta,V). Note, mm^{*} does not necessarily equal the true regimen-response curve m0m_{0}.

Define the full data conditional risk based on training observations as

Ψ0(mn)\displaystyle\Psi_{0}(m_{n})\equiv 𝔼Bn[0{{Yθmn,b(θ,V)}2dF(θ)}],\displaystyle\mathbb{E}_{B_{n}}\left[\mathbb{P}_{0}\left\{\int\{Y^{\theta}-m_{n,b}(\theta,V)\}^{2}\,\mathrm{d}F(\theta)\right\}\right], (1)

where 𝔼Bn\mathbb{E}_{B_{n}} denotes an expectation over possible partitions and FF is a user-specified distribution function over the parameter θ\theta. The averaging over the partitions in (1) prevents the estimand from depending on any randomness in the selection of training data, thus making Ψ0(mn)\Psi_{0}(m_{n}) a data-adaptive estimand (Hubbard et al., 2016). We define the full non-data-adaptive for a given regimen-response curve mm as Ψ0,nda(m)0{{Yθm(θ,V)}2dF(θ)}.\Psi_{0,\text{nda}}(m)\equiv\mathbb{P}_{0}\left\{\int\{Y^{\theta}-m(\theta,V)\}^{2}\,\mathrm{d}F(\theta)\right\}. The full data conditional risk, Ψ0(mn)\Psi_{0}(m_{n}) can be expressed in term of non-data-adaptive risk, Ψ0,nda(m)\Psi_{0,\text{nda}}(m), through the relation, Ψ0(mn)=𝔼Bn[Ψ0,nda(mn,b)]\Psi_{0}(m_{n})=\mathbb{E}_{B_{n}}\left[\Psi_{0,\text{nda}}(m_{n,b})\right]. In Section 3.1, we develop efficiency theory for Ψ0,nda(m)\Psi_{0,\text{nda}}(m) of a given regimen-response curve, mm. Then, we will use those results to construct a nonparametric asymptotically linear estimators for Ψ0(mn)\Psi_{0}(m_{n}).

2.3 Inverse-probability weighted identification

For a given marginal structural model mnm_{n}, the definition of the risk Ψ0(mn)\Psi_{0}(m_{n}) involves the potential outcome YθY^{\theta} which is not observed. However, under the following causal assumptions, it is identifiable using the observed data. Recall that π0tP(At=dtθ|H¯t)\pi_{0}^{t}\equiv P(A_{t}=d_{t}^{\theta}|\bar{H}_{t}) is the true propensity score at time tt.

Assumption 1.

(Causal assumptions)

  1. a. Consistency: Y=YθY=Y^{\theta} if A¯T=d¯Tθ\bar{A}_{T}=\bar{d}^{\theta}_{T} for all θ\theta;

  2. b. Exchangeability: AtYθH¯tA_{t}\perp\!\!\!\!\perp Y^{\theta}\mid\bar{H}_{t} for all θ\theta and all t=1,,Tt=1,\dots,T;

  3. c. Positivity: t=1Tπ0t(At=dtθH¯t)>ϵ\prod_{t=1}^{T}\pi_{0}^{t}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})>\epsilon almost surely, for all θ\theta and some ϵ>0\epsilon>0.

Assumption 1a states that the potential outcome under an observed regime equals the observed value. This assumption can be violated if the treatment assignment of a unit impacts the outcome of others (Hernán and Robins, 2023; Kennedy, 2019). Assumption 1b states that there are no unmeasured confounders. This assumption holds in standard randomized studies. However, in observational studies it requires a rich enough set of characteristics S1,,STS_{1},\dots,S_{T}. Assumption 1c states that regardless of a subject’s medical history, that person will have a nonzero probability of following any given treatment regime.

Lemma 1.

Let mnm_{n} be an estimation rule for a regimen response curve. If Assumption 1 holds, then the estimand is an observed-data functional, that is,

Ψ0(mn)\displaystyle\Psi_{0}(m_{n}) =𝔼Bn0[{t=1TI(At=dtθ)π0t(At=dtθH¯t)}{Ymn,b(θ,V)}2dF(θ)].\displaystyle=\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left[\left\{\int\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi_{0}^{t}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t})}\right\}\left\{Y-m_{n,b}(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right].

2.4 Inverse-probability weighted estimator

Applying the plug-in approach to Lemma 1, we have the inverse probability weight (IPW) estimator for the full data conditional risk of a working model mn(θ,V)m_{n}(\theta,V).

ΨnIPW(mn,πn)\displaystyle\Psi_{n}^{\mathrm{IPW}}(m_{n},\pi_{n}) =𝔼Bnn,b1[{t=1TI(At=dtθ)πn,bt(At=dtθA¯t1,S¯t)}{Ymn,b(θ,V)}2dF(θ)],\displaystyle=\mathbb{E}_{B_{n}}\mathbb{P}^{1}_{n,b}\left[\int\left\{\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{n,b}(A_{t}=d_{t}^{\theta}\mid\bar{A}_{t-1},\bar{S}_{t})}\right\}\left\{Y-m_{n,b}(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right],

where πn,bt\pi^{t}_{n,b} is the estimate of the propensity score applied to the training sample for the bthb^{th} sample split. The consistency and convergence rate of Ψn,mIPW(πn)\Psi_{n,m}^{\mathrm{IPW}}(\pi_{n}) depends entirely on the consistency and convergence rate of the estimators πnt\pi^{t}_{n} of π0t\pi^{t}_{0} for each time point t=1,,Tt=1,\dots,T. It is common to estimate the propensity score using parametric models as it has the convergence rate of Op(n1/2)O_{p}(n^{-1/2}). Alternatively, data-adaptive methods can be used to increase the flexibility of the model. However, due to a slow convergent rate of nonparametric methods, the IPW estimator could be biased. We can see that through the following decomposition.

For a given marginal structural model m(θ,V)m(\theta,V), define the loss as

Lm(π)={t=1TI(At=dtθ)πt}{Ym(θ,V)}2dF(θ).L_{m}(\pi)=\int\left\{\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}}\right\}\left\{Y-m(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta).

Then, we have

ΨnIPW(m,πn)Ψ0(m,π0)=nLm(πn)0Lm(π0)\displaystyle\Psi_{n}^{\mathrm{IPW}}(m,\pi_{n})-\Psi_{0}(m,\pi_{0})=\mathbb{P}_{n}L_{m}(\pi_{n})-\mathbb{P}_{0}L_{m}(\pi_{0})
=(n0)Lm(π0)+0{Lm(πn)Lm(π0)}+(n0){Lm(πn)Lm(π0)}\displaystyle\hskip 42.67912pt=(\mathbb{P}_{n}-\mathbb{P}_{0})L_{m}(\pi_{0})+\mathbb{P}_{0}\{L_{m}(\pi_{n})-L_{m}(\pi_{0})\}+(\mathbb{P}_{n}-\mathbb{P}_{0})\{L_{m}(\pi_{n})-L_{m}(\pi_{0})\}
=(I)+(II)+(III).\displaystyle\hskip 42.67912pt=(I)+(II)+(III).

Assuming πn\pi_{n} is a consistent estimator of π0\pi_{0} and Donsker class condition, using standard empirical process theory, we can show that term (III) will converge to 0 at rate n1/2n^{-1/2} (van der Vaart, 1998). Hence, Ψn,mIPW\Psi_{n,m}^{\mathrm{IPW}} is asymptotically linear only if term (II) also converges to 0 at the rate n1/2n^{-1/2}. Since that rate is not obtainable using nonparametric methods, term (II) will dominate the right hand side, causing Ψn,mIPW\Psi_{n,m}^{\mathrm{IPW}} to be biased. The same issue arises when mn(θ,V)m_{n}(\theta,V) is a cross-fitted estimator of the underlying marginal structural model.

2.5 The highly adaptive lasso

Our proposed approach for refining the inverse probability weighting estimator of Ψ0(mn)\Psi_{0}(m_{n}) utilizes the Highly Adaptive Lasso model (HAL). The detail of this process will be discussed in Section 4.1. To provide the context, in this section, we will give a concise introduction to the HAL model and highlight its key characteristics. For an in-depth exploration of HAL, we refer our interested readers to Benkeser and van der Laan (2016); Ertefaie et al. (2023)

The highly adaptive lasso is a nonparametric regression function that estimates infinite dimensional functions by constructing a linear combination of indicator functions that minimize a loss function by constraining the L1L_{1}-norm complexity of the model. (Bibaut and van der Laan, 2019) shows that highly adaptive lasso estimators have a fast convergent rate of n1/3n^{-1/3} (up to a log factor).

Let 𝔻[0,τ]\mathbb{D}[0,\tau] be the Banach space of dd-variate real-valued cadlag functions on a cube [0,τ]d[0,\tau]\in\mathbb{R}^{d}. For a function f𝔻[0,τ]f\in\mathbb{D}[0,\tau], the sectional variation norm is defined as

fν=|f(0)|+r{1,,d}0rτr|dfr(u)|,\|f\|_{\nu}=|f(0)|+\sum_{r\subset\{1,\ldots,d\}}\int_{0_{r}}^{\tau_{r}}\left|df_{r}(u)\right|,

where fr(u)=f{u1I(1r),,udI(dr)}f_{r}(u)=f\{u_{1}I(1\in r),\dots,u_{d}I(d\in r)\} is the rthr^{th} section of ff and the summation is over all subset of {1,,d}\{1,\dots,d\}. The term 0rτr|dfr(u)|\int_{0_{r}}^{\tau_{r}}\left|df_{r}(u)\right| is the r-specific variation norm.

For any f𝔻[0,τ]f\in\mathbb{D}[0,\tau] with fν<\|f\|_{\nu}<\infty, Gill et al. (1995) shows that ff can be represented as a linear combination of indicator functions.

f(x):\displaystyle f(x): =f(0)+r{1,,d}0rwr𝑑fr(u)\displaystyle=f(0)+\sum_{r\subset\{1,\ldots,d\}}\int_{0_{r}}^{w_{r}}df_{r}(u)
=f(0)+r{1,,d}0rτrI(urxr)𝑑fr(u).\displaystyle=f(0)+\sum_{r\subset\{1,\ldots,d\}}\int_{0_{r}}^{\tau_{r}}I\left(u_{r}\leq x_{r}\right)df_{r}(u). (2)

The latter integral in Equation 2 can be approximated using a discrete measure FnF_{n} putting mass on the observed values of Xr=(Xj:jr)X_{r}=\left(X_{j}:j\in r\right) denoted as xr,ix_{r,i} for the ii the subject. We let βr,i\beta_{r,i} denote the mass placed on each observation xr,ix_{r,i} and ϕr,i=I(xi,rcr)\phi_{r,i}=I\left(x_{i,r}\leq c_{r}\right). Then the approximated ff is

f~(x)\displaystyle\tilde{f}(x) =f(0)+r{1,,d}i=1nI(xr,icr)fr,n(dxi,r)\displaystyle=f(0)+\sum_{r\subset\{1,\ldots,d\}}\sum_{i=1}^{n}I\left(x_{r,i}\leq c_{r}\right)f_{r,n}\left(dx_{i,r}\right)
=β0+r{1,,d}i=1nβr,iϕr,i\displaystyle=\beta_{0}+\sum_{r\subset\{1,\ldots,d\}}\sum_{i=1}^{n}\beta_{r,i}\phi_{r,i}

In this formulation, |β0|+r{1,,d}i=1n|βr,i|\left|\beta_{0}\right|+\sum_{r\subset\{1,\ldots,d\}}\sum_{i=1}^{n}\left|\beta_{r,i}\right| is an approximation of the sectional variation norm of ff. Consequently, the minimum loss based highly adaptive lasso estimator βn\beta_{n} is defined as

βn=argminβ:|β0|+r{1,,d}i=1n|βr,i|<λPnL(f~).\beta_{n}=\arg\min_{\beta:\left|\beta_{0}\right|+\sum_{r\subset\{1,\ldots,d\}}\sum_{i=1}^{n}\left|\beta_{r,i}\right|<\lambda}P_{n}L(\tilde{f}).

where L(.)L(.) is a given loss function. This is a constrained minimization that corresponds to lasso linear regression with up to n×(2d1)n\times\left(2^{d}-1\right) parameters (i.e., βr,i)\left.\beta_{r,i}\right). Each value of λ\lambda will give a unique HAL estimator.

3 Efficient multiply-robust estimation

In this section, we develop a multiply robust (MR) estimator for the cross-validated risk Ψ0(mn)\Psi_{0}(m_{n}) of an estimation rule, or working model, mnm_{n}.

3.1 The Canonical Gradient of Ψ0,nda(m)\Psi_{0,\text{nda}}(m)

To construct the MR estimator, we need to find the efficient influence function of Ψ0,nda(m)\Psi_{0,\text{nda}}(m). Influence functions are important because a regular and asymptotically linear estimator can be expressed as the empirical mean of the influence function plus some negligible term that converge to 0 at the rate of n1/2n^{-1/2} or faster (van der Laan and Robins, 2003; Tsiatis, 2006). The efficient influence function is the influence function whose variance corresponds to the efficiency bound. The influence function is unique in the nonparametric setting and equals the efficient influence function. The multiply robust estimator is defined using the efficient influence function, and we will show that it is asymptotically efficient when all the nuisance parameters converge to their true value at a rate of n1/4n^{-1/4}. This slower-than-parametric rate allows us to estimate the nuisance parameters with flexible nonparametric methods. Although the MR estimator is not efficient when the propensity score model is misspecified, it is still consistent when the outcome regression is correctly specified, or vice versa.

The following result provides the nonparametric efficient influence function for the inner term in the definition of the cross-validated risk Ψ0(mn)\Psi_{0}(m_{n}).

Theorem 1.

Let mm be a given regimen-response curve. If Lemma 1 holds, the nonparametric efficient influence function for

Ψ0,nda(m)0{Yθm(θ,V)}2dF(θ)\Psi_{0,\text{nda}}(m)\equiv\mathbb{P}_{0}\int\{Y^{\theta}-m(\theta,V)\}^{2}\,\mathrm{d}F(\theta)

is φ(π0,Q0,m;m)Ψ0,nda(m)\varphi(\pi_{0},Q_{0,m};m)-\Psi_{0,\text{nda}}(m), where the uncentered term φ(π0,Q0,m;m)\varphi(\pi_{0},Q_{0,m};m) is

[{t=1TI(At=dtθ)π0t}{Ym(θ,V)}2t=1TI(At=dtθ)π0tπ0tQ0,mtj=1t1I(At=dtθ)π0j]dF(θ),\displaystyle\int\left[\left\{\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}}\right\}\{Y-m(\theta,V)\}^{2}-\sum_{t=1}^{T}\frac{I(A_{t}=d^{\theta}_{t})-\pi^{t}_{0}}{\pi^{t}_{0}}Q^{t}_{0,m}\prod_{j=1}^{t-1}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{j}_{0}}\right]\,\mathrm{d}F(\theta),

and the nuisance parameter

Q0,mt=0[{j=t+1TI(Aj=djθ)π0j}{Ym(θ,V)}2|S¯t,A¯t1,At=dtθ].Q^{t}_{0,m}=\mathbb{P}_{0}\left[\left\{\prod_{j=t+1}^{T}\frac{I(A_{j}=d_{j}^{\theta})}{\pi^{j}_{0}}\right\}\{Y-m(\theta,V)\}^{2}\middle|\bar{S}_{t},\bar{A}_{t-1},A_{t}=d^{\theta}_{t}\right].

The structure of the influence function in Theorem 1 involves a weighted average over θ\theta of an inverse probability weighting term and an augmentation term for each of the time points. The function Q0,mtQ^{t}_{0,m} can be represented as a recursive sequential regression formula, which is useful in practice.

Remark 1.

The function Q0,mtQ^{t}_{0,m} can be expressed recursively as

Q0,mt=0(Q0,mt+1S¯t+1,A¯t1,At=dtθ),Q^{t}_{0,m}=\mathbb{P}_{0}\left(Q^{t+1}_{0,m}\mid\bar{S}_{t+1},\bar{A}_{t-1},A_{t}=d^{\theta}_{t}\right),

for t=1,,Tt=1,\dots,T, where Q0,mT+1=YQ^{T+1}_{0,m}=Y.

3.2 The estimator

The form of the efficient influence function for Ψ0,nda\Psi_{0,\text{nda}} may be used to construct an estimator for Ψ0(mn)\Psi_{0}(m_{n}) that is asymptotically efficient for certain choices of nuisance parameter estimators (Bickel et al., 1993). The estimator construction is more delicate than usual since the efficient influence function is for Ψ0,nda(m)\Psi_{0,\text{nda}}(m), with mm a given regimen-response curve, while the estimand is Ψ0(mn)=𝔼Bn{Ψ0,nda(mn(;n,b0))}\Psi_{0}(m_{n})=\mathbb{E}_{B_{n}}\{\Psi_{0,\text{nda}}(m_{n}(\cdot;\mathbb{P}^{0}_{n,b}))\}. We first develop estimators as if mm were given, then we discuss how to adapt these estimators to instead target our estimand Ψ0(mn)\Psi_{0}(m_{n}).

Consider estimation of Ψ0,nda(m)\Psi_{0,\text{nda}}(m) using the efficient influence function as a guide. Two common estimation approaches are to use one-step estimators, which add the the estimated efficient influence function to an initial estimator, and estimating-equation estimators, which treat the efficient influence function as an estimating function (Tsiatis, 2006; Kennedy, 2022). Since the efficient influence function in Theorem 1 only depends on Ψ0,nda(m)\Psi_{0,\text{nda}}(m) through its subtraction, each of these estimation strategies is equivalent. Thus we initially consider the estimator

n{φ(πn,Qn,m;m)},\mathbb{P}_{n}\{\varphi(\pi_{n},Q_{n,m};m)\}, (3)

where πn\pi_{n} and Qn,mQ_{n,m} are estimators for π0\pi_{0} and Q0,mQ_{0,m} respectively. Assuming these nuisance parameter estimators satisfy certain conditions, we can show that the above estimator is asymptotically efficient, i.e. has the efficient influence function of Ψ0,nda(m)\Psi_{0,\text{nda}}(m), found in Theorem 1, as its influence function.

Now consider estimation of our estimand Ψ0(mn)=𝔼Bn[Ψ0,nda{mn(n,b0)}]\Psi_{0}(m_{n})=\mathbb{E}_{B_{n}}[\Psi_{0,\text{nda}}\{m_{n}(\cdot\mid\mathbb{P}^{0}_{n,b})\}]; we construct our estimator by adapting the estimator in (3) for Ψ0,nda(m)\Psi_{0,\text{nda}}(m). A plugin-like approach readily produces the estimator

𝔼Bn{nφ(πn,b,Qn,b,mn,b;mn,b)},\mathbb{E}_{B_{n}}\{\mathbb{P}_{n}\varphi(\pi_{n,b},Q_{n,b,m_{n,b}};m_{n,b})\}, (4)

in which an outer average over folds is included, each fixed regimen-response curve mm is replaced with mn,bm_{n,b} which is estimated in the training data n,b0\mathbb{P}^{0}_{n,b} of the fold, and each nuisance parameter is replaced with estimators within the training data.

The estimator in (4) is almost a suitable estimator for the cross-validated risk Ψ0(mn)\Psi_{0}(m_{n}), however it still has one major deficiency which harms its performance. It averages over all observations, including those in the training set n,b0\mathbb{P}_{n,b}^{0} as well as the validation set n,b1\mathbb{P}_{n,b}^{1}, which biases the estimator downwards. A corrected estimator, which we study throughout this paper, is

ΨnMR(mn,πn,Qn,mn)=𝔼Bnn,b1φ(πn,b,Qn,b,mn,b;mn,b),\displaystyle\Psi_{n}^{\mathrm{MR}}(m_{n},\pi_{n},Q_{n,m_{n}})=\mathbb{E}_{B_{n}}\mathbb{P}^{1}_{n,b}\varphi(\pi_{n,b},Q_{n,b,m_{n,b}};m_{n,b}), (5)

where πn,b\pi_{n,b} and Qn,b,mn,bQ_{n,b,m_{n,b}} are estimates of the nuisance parameters π0\pi_{0} and Qn,m0Q_{n,m_{0}} calculated with the training sample for the bthb^{\text{th}} sample split.

The next result establishes that ΨnMR\Psi_{n}^{\mathrm{MR}} is an asymptotic efficient estimator for the estimand Ψ0(mn)\Psi_{0}(m_{n}). Before stating it, we make two assumptions.

Assumption 2.

For all t=1,,Tt=1,\dots,T, the nuisance parameters πnt\pi^{t}_{n} and Qn,mn,btQ^{t}_{n,m_{n,b}} satisfy 𝔼Bnπ0tπn,btL2(0)Q0,mn,btQn,mn,btL2(0)𝑑F(θ)=op(n1/2)\mathbb{E}_{B_{n}}\int\|\pi^{t}_{0}-\pi^{t}_{n,b}\|_{L_{2}(\mathbb{P}_{0})}\|Q^{t}_{0,m_{n,b}}-Q^{t}_{n,m_{n,b}}\|_{L_{2}(\mathbb{P}_{0})}dF(\theta)=o_{p}(n^{-1/2}).

Assumption 3.

The estimator mn,bm_{n,b} satisfies mn,b(θ;V)m(θ;V)L2(0)𝑑F(θ)=op(n1/4)\int\|m_{n,b}(\theta;V)-m^{*}(\theta;V)\|_{L_{2}(\mathbb{P}_{0})}dF(\theta)=o_{p}(n^{-1/4}).

Assumption 2 states that the estimators πnt\pi_{n}^{t} and Qn,mn,btQ^{t}_{n,m_{n,b}} need to converge to its true value at a rate of op(n1/4)o_{p}(n^{-1/4}) for any θ\theta. This assumption holds when these nuisance parameters are estimated using the highly adaptive lasso. Moreover, if either π0t\pi_{0}^{t} or Qn,mn,btQ^{t}_{n,m_{n,b}} converge to its true value at the rate of Op(n1/2)O_{p}(n^{-1/2}), we only require consistency for the other nuisance parameter. The rate of Op(n1/2)O_{p}(n^{-1/2}) can be achieved using parametric methods. Assumption 3 is also a relatively mild assumption. It only requires the fitted marginal structure model, mn,bm_{n,b}, to have a limit mm^{*}. And mn,bm_{n,b} converges to mm^{*} at the rate of op(n1/4)o_{p}(n^{-1/4}), enabling the utilization of both parametric and nonparametric methods to estimate the marginal structure model.

Theorem 2.

If Assumptions 1 holds, then the estimator ΨnMR(mn,πn,Qn,mn)\Psi_{n}^{\mathrm{MR}}(m_{n},\pi_{n},Q_{n,m_{n}}) satisfies

ΨnMR(mn,πn,Qn,mn)Ψ0(mn)=(n0)φ(π0,Q0,m;m)+Rn,\Psi_{n}^{\mathrm{MR}}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m_{n})=(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi(\pi_{0},Q_{0,m^{*}};m^{*})+R_{n},

where RnR_{n} is defined in Section 10 of the Supplement Material. If Assumption 2 additionally holds, then Rn=op(n1/2)R_{n}=o_{p}(n^{-1/2}) and

n{ΨnMR(mn,πn,Qn,mn)Ψ0(mn)}𝒩[0,𝕍0{φ(π0,Q0,m;m)}].\sqrt{n}\left\{\Psi_{n}^{\mathrm{MR}}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m_{n})\right\}\rightsquigarrow\mathcal{N}\left[0,\mathbb{V}_{0}\{\varphi(\pi_{0},Q_{0,m^{*}};m^{*})\}\right].

Theorem 2 shows that as the fitted working marginal structural model mn,b0m_{n,b}^{0} converges to its limiting value mm^{*} at the rate of op(n1/4)o_{p}(n^{-1/4}), the multiply robust estimator is asymptotic linear with respect to the limit of mnm_{n}. The centering term in Theorem 2 is the data-adaptive risk Ψ0(mn)\Psi_{0}(m_{n}) that depends on the estimator mnm_{n}, the true propensity score and outcome regression.

When two working models are under considerations, the results of Theorem 2 allow us to compare their estimators, mn,1m_{n,1} and mn,2m_{n,2}, by applying the asymptotic linearity result once for each estimator. We may ascertain the quality of the estimators by comparing their risks Ψ0(mn,1)\Psi_{0}(m_{n,1}) and Ψ0(mn,2)\Psi_{0}(m_{n,2}) by testing the informal null hypothesis that these risks are equal.

In Theorem 2, we study the difference ΨnMR(mn,πn,Qn,mn)Ψ0(mn)\Psi^{\mathrm{MR}}_{n}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m_{n}). This is because ΨnMR(mn,πn,Qn,mn)Ψ0(m)\Psi^{\mathrm{MR}}_{n}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m^{*}) is asymptotically linear under much more restrictive assumptions. In the following result, we study the asymptotics of ΨnMR(mn,πn,Qn,mn)Ψ0(m)\Psi^{MR}_{n}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m^{*}) to replace the data-adaptive target parameter Ψ0(mn)\Psi_{0}(m_{n}) with the fixed target parameter Ψ0(m)\Psi_{0}(m^{*}) in the conclusions of Theorem 2.

Theorem 3.

If Assumptions 1 and 3 hold and additionally mm^{*}, the limit of mnm_{n}, equals the true minimizer of Ψ0(m)\Psi_{0}(m) over mm\in\mathcal{M} (i.e., m=m0m^{*}=m_{0}), then

ΨnMR(mn,πn,Qn,mn)Ψ0(m)=(n0)φ(π0,Q0,m0;m0)+Rn,\displaystyle\Psi^{MR}_{n}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m^{*})=(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi(\pi_{0},Q_{0,m_{0}};m_{0})+R_{n},

for RnR_{n} defined in Section 11 of the Supplementary Materials. If additionally Assumption 2 holds, then Rn=op(n1/2)R_{n}=o_{p}(n^{-1/2}) and

n{ΨnMR(mn,πn,Qn,mn)Ψ0(m)}𝒩[0,𝕍0{φm(π0,Q0,m)}].\sqrt{n}\left\{\Psi_{n}^{\mathrm{MR}}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m^{*})\right\}\rightsquigarrow\mathcal{N}[0,\mathbb{V}_{0}\{\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})\}].

The result of Theorem 3 can be used to compare two different modeling approaches known to be consistent with mn,1m02=op(n1/4)\|m_{n,1}-m_{0}\|_{2}=o_{p}(n^{-1/4}) and mn,2m02=op(n1/4)\|m_{n,2}-m_{0}\|_{2}=o_{p}(n^{-1/4}). Despite having the same asymptotic variance, the approximated variance based on φ(π0,Q0,mn,1;mn,1)\varphi(\pi_{0},Q_{0,m_{n,1}};m_{n,1}) and φ(π0,Q0,mn,2;mn,2)\varphi(\pi_{0},Q_{0,m_{n,2}};m_{n,2}) may differ. As an illustration, one could consider comparing the super learner with a library that includes HAL alongside other parametric and nonparametric methods, against a HAL-only model.

4 Efficient plugin estimation

4.1 The estimator

Although doubly robust procedures facilitate the use of data adaptive techniques for modeling nuisance parameters, the resultant estimator can be irregular with large bias and a slow rate of convergence when either of the nuisance parameters is inconsistently estimated (van der Laan, 2014; Benkeser et al., 2017). To mitigate this issue, we propose a sieve based IPW estimator called UIPW where the propensity scores are estimated using an undersmoothed highly adaptive lasso. Define the estimator as

ΨnUIPW(mn,πn,λ)=𝔼Bnn,b1[{t=1TI(At=dtθ)πn,b,λt(At=dtθH¯t)}{Ymn,b(θ,V)}2dF(θ)],\displaystyle\Psi_{n}^{\mathrm{UIPW}}(m_{n},\pi_{n,\lambda})=\mathbb{E}_{B_{n}}\mathbb{P}^{1}_{n,b}\left[\int\left\{\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{n,b,\lambda}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t})}\right\}\left\{Y-m_{n,b}(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right], (6)

where πn,b,λt\pi^{t}_{n,b,\lambda} is the highly adaptive lasso estimate of π0t\pi^{t}_{0} obtained using the bth sample split for a given tuning parameter λ\lambda. Smaller values of λ\lambda imply a weaker penalty, thereby undersmoothing the fit.

Our theoretical results show that for specific choices λ\lambda, the function Ψmn,nUIPW(πn,λ)\Psi_{m_{n},n}^{\mathrm{UIPW}}(\pi_{n,\lambda}) can solve the efficient influence function presented in Theorem 1. This implication indicates that the proposed undersmoothed highly adaptive lasso estimator achieves nonparametric efficiency. This accomplishment relies on the satisfaction of two key assumptions:

Assumption 4 (Regularity of Functions).

Q0,mtQ^{t}_{0,m^{*}} and π0t\pi^{t}_{0} for any 1tT1\leq t\leq T are cadlag functions with finite sectional variation norm.

Assumption 5 (Approximation of Functions).

For 1tT1\leq t\leq T, let ft=Q0,mtπ0tj=1t1I(At=dtθ)πjf_{t}=\frac{Q^{t}_{0,m^{*}}}{\pi^{t}_{0}}\prod_{j=1}^{t-1}\frac{I(A_{t}=d^{\theta}_{t})}{\pi^{j}}, and f~t\tilde{f}_{t} be the projection of ftf_{t} onto the linear span of the basic functions ϕr,it\phi^{t}_{r,i} in L2(P)L^{2}(P). Here, ϕr,it\phi^{t}_{r,i} represents the basis functions generated by the highly adaptive lasso fit of πn,b,λt\pi^{t}_{n,b,\lambda}. We assume that ftf~tL2(0)=Op(n1/4)\|f_{t}-\tilde{f}_{t}\|_{L_{2}(\mathbb{P}_{0})}=O_{p}(n^{-1/4}).

Assumption 4 typically holds in practical applications because the set of cadlag functions with finite sectional variation norms is extensive.Assumption 5, on the other hand, posits that the generated basis functions in πb,λt\pi^{t}_{b,\lambda} are sufficient to approximate ftf_{t} within an Op(n1/4)O_{p}(n^{-1/4}) neighborhood of ftf_{t}. Without undersmoothing, the basis functions can effectively approximate πb,λt\pi^{t}_{b,\lambda} when π0t\pi^{t}_{0} is as complex as ftf_{t}. However, when the complexity of ftf_{t} surpasses that of πb,λt\pi^{t}_{b,\lambda}, more undersmoothing is necessary.

Theorem 4 (Efficiency of UIPW estimator).

Let

DCARt(π,Q)={I(At=dtθ)πtπtQmtj=1t1I(At=dtθ)πj}𝑑F(θ).\displaystyle D^{t}_{CAR}(\pi,Q)=\int\left\{\frac{I(A_{t}=d^{\theta}_{t})-\pi^{t}}{\pi^{t}}Q^{t}_{m}\prod_{j=1}^{t-1}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{j}}\right\}dF(\theta).

Under Assumptions 1,4,5 and 𝔼Bnπn,bπ02.(πn,bπ02+mn,bm2)dF(θ)=op(n1/2)\mathbb{E}_{B_{n}}\int\|\pi_{n,b}-\pi_{0}\|_{2}.\left(\|\pi_{n,b}-\pi_{0}\|_{2}+\|m_{n,b}-m^{*}\|_{2}\right)dF(\theta)=o_{p}(n^{-1/2}), there exists a set of λopt={λopt1,,λoptT}\lambda_{opt}=\{\lambda^{1}_{opt},...,\lambda^{T}_{opt}\} such that 𝔼Bnnt=1TDCARt(πb,λopt,Q0,m)=op(n1/2)\mathbb{E}_{B_{n}}\mathbb{P}_{n}\sum_{t=1}^{T}D^{t}_{CAR}(\pi_{b,\lambda_{opt}},Q_{0,m^{*}})=o_{p}(n^{-1/2}) and the estimator ΨnUIPW(mn,πn,λopt)\Psi^{UIPW}_{n}({m_{n}},\pi_{n,\lambda_{opt}}) is asymptotically efficient and satisfies

ΨnUIPW(mn,πn,λopt)Ψ0(mn,π0)=(n0)φ(π0,Qm,0;m)+op(n1/2).\Psi^{UIPW}_{n}(m_{n},\pi_{n,\lambda_{opt}})-\Psi_{0}(m_{n},\pi_{0})=(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi(\pi_{0},Q_{m^{*},0};m^{*})+o_{p}(n^{-1/2}).

4.2 Undersmoothing criteria in practice

As discussed in Section 2.4, without undersmoothing, the estimator Ψmn,nUIPW\Psi_{m_{n},n}^{\mathrm{UIPW}} may fail to be asymptotically linear. An IPW estimator will be efficient if it solves the efficient influence function. Theorem 4 shows that the latter is achieved when nt=1TDCARt(πn,λ,Q0,m)=op(n1/2)\mathbb{P}_{n}\sum_{t=1}^{T}D^{t}_{CAR}(\pi_{n,\lambda},Q_{0,m^{*}})=o_{p}(n^{-1/2}). The theoretical rate might not provide practical guidance due to the unknown constant. However, given the finite sample at hand, one can achieve the best performance by selecting a tuning parameter that minimizes the nt=1TDCARt(πn,λ,Q0,m)\mathbb{P}_{n}\sum_{t=1}^{T}D^{t}_{CAR}(\pi_{n,\lambda},Q_{0,m^{*}}). Specifically, we define

λn,t=argminλ|B1b=1Bn,b1DCARt(πn,b,λ,Qn,b,mb)|,\displaystyle\lambda_{n,t}=\operatorname*{argmin}_{\lambda}\left|B^{-1}\sum_{b=1}^{B}\mathbb{P}^{1}_{n,b}D^{t}_{CAR}(\pi_{n,b,\lambda},Q_{n,b,m_{b}})\right|, (7)

where Qb,mbQ_{b,m_{b}} is a cross-validated highly adaptive lasso estimate of Q0,mQ_{0,m^{*}} with the L1L_{1}-norm bound based on the global cross-validation selector.

The criterion (7) relies on knowledge about the efficient influence function. However, in some cases, deriving this function can be intricate (e.g., longitudinal settings with many decision points) or may not have a closed-form solution (e.g., bivariate survival probability in bivariate right-censored data) (van der Laan, 1996). To this end, we propose an alternative criterion, drawing from the approach in Ertefaie et al. (2023), that does not require the efficient influence function. Define

λ~n,t=argminλB1b=1B[(r,i)𝒥n1βt,n,λ,bL1|n,b1Ω~r,i(ϕ,πn,b,λt)|],\tilde{\lambda}_{n,t}=\operatorname*{argmin}_{\lambda}B^{-1}\sum_{b=1}^{B}\left[\sum_{(r,i)\in\mathcal{J}_{n}}\frac{1}{\lVert\beta_{t,n,\lambda,b}\rVert_{L_{1}}}\bigg{\lvert}\mathbb{P}_{n,b}^{1}\tilde{\Omega}_{r,i}(\phi,\pi^{t}_{n,b,\lambda})\bigg{\rvert}\right], (8)

in which βt,n,λ,bL1=|βt,n,λ,0|+r{1,,d}j=1n|βt,n,λ,r,i|\lVert\beta_{t,n,\lambda,b}\rVert_{L_{1}}=\lvert\beta_{t,n,\lambda,0}\rvert+\sum_{r\subset\{1,\ldots,d\}}\sum_{j=1}^{n}\lvert\beta_{t,n,\lambda,r,i}\rvert is the L1L_{1}-norm of the coefficients βt,n,λ,r,i\beta_{t,n,\lambda,r,i} in the highly adaptive estimator πn,λt\pi^{t}_{n,\lambda} for a given λ\lambda, and Ω~r,i(ϕ,πn,λ,vt)=ϕr,i(H¯t){Ajπn,λ,vt(1H¯t)}{πn,λ,vt(1H¯t))}1\tilde{\Omega}_{r,i}(\phi,\pi^{t}_{n,\lambda,v})=\phi_{r,i}(\bar{H}_{t})\{A_{j}-\pi^{t}_{n,\lambda,v}(1\mid\bar{H}_{t})\}\{\pi^{t}_{n,\lambda,v}(1\mid\bar{H}_{t}))\}^{-1}.

The main drawback of undersmoothing is the decreased convergence rate of the highly adaptive lasso fit. Our asymptotic linearity proof of Ψmn,nUIPW(πn,λ)\Psi_{m_{n},n}^{\mathrm{UIPW}}(\pi_{n,\lambda}) requires πntπ0t2𝑑F(θ)=op(n1/4)\int\|\pi^{t}_{n}-\pi^{t}_{0}\|_{2}dF(\theta)=o_{p}(n^{-1/4}) which is slower than the cross-validated based highly adaptive lasso fit of Op(n1/3)O_{p}(n^{-1/3}). This ensures that we can tolerate a certain degree of undersmoothing without compromising our desired asymptotic properties. Let JJ be a number of features included in fit (i.e., J=|𝒥n|J=|\mathcal{J}_{n}|). van der Laan (2023) showed that the convergence rate of a highly adaptive lasso is (J/n)1/2(J/n)^{1/2}. Hence to ensure the required op(n1/4)o_{p}(n^{-1/4}), we must choose a JJ such that J<n1/2J<n^{1/2} and let λn,J\lambda_{n,J} be the corresponding L1L_{1} norm. We then propose our score based criterion as

λn,t=max(λn,J,λ~n,t).\displaystyle\lambda^{\prime}_{n,t}=\max(\lambda_{n,J},\tilde{\lambda}_{n,t}). (9)

In practice, the score-based criterion tends to undersmooth too much (i.e., pick a very small λ~n,t\tilde{\lambda}_{n,t}), and the max operator helps us to mitigate this issue.

5 Simulation Studies

5.1 Setting

In our simulation study, we evaluate and compare the performance exhibited by our proposed estimators, MR and UIPW to the conventional IPW estimator. We consider a two-stage treatment in our simulation (T = 2). The baseline covariates S1,t=1,S2,t=1S_{1,t=1},S_{2,t=1} are generated from a normal distribution with the mean equals 0 and the standard deviation equals 0.5. The time-dependent covariate S1,t,S2,tS_{1,t},S_{2,t} are function of the preceding S1,t1,S2,t1S_{1,t-1},S_{2,t-1}. In particular,

S1,t=Max[Min{0.9(2At1)S1,t1+0.05S1,t1S2,t1+N(0,0.5),8},4]\displaystyle S_{1,t}=\text{Max}[\text{Min}\{0.9(2A_{t-1})S_{1,t-1}+0.05S_{1,t-1}S_{2,t-1}+N(0,0.5),8\},-4]
S2,t=Max[Min{0.9(12At1)S1,t1+0.05S1,t1S2,t1+N(0,0.5),8},4]\displaystyle S_{2,t}=\text{Max}[\text{Min}\{0.9(1-2A_{t-1})S_{1,t-1}+0.05S_{1,t-1}S_{2,t-1}+N(0,0.5),8\},-4]

The outcome Y=S1,TS2,T+ATS1,T+N(0,0.12)Y=S_{1,T}-S_{2,T}+A_{T}S_{1,T}+N(0,0.1^{2}) is a function of S1,S2S_{1},S_{2} in the last stage. The decision rule is given as dtθ=I{S1,t<θ}d^{\theta}_{t}=I\{S_{1,t}<\theta\} which only depends on S1,tS_{1,t}. We generate θ\theta from a normal distribution f(θ)=N(0,0.12)f(\theta)=N(0,0.1^{2}). The treatment AtA_{t} is generated from a Bernoulli distribution with P(At=1|S1,t,S2,t)=expit(0.1+0.2S1,t)P(A_{t}=1|S_{1,t},S_{2,t})=\text{expit}(0.1+0.2S_{1,t}). Additional simulations where AtA_{t} is randomized can be found in Section 13 of the Supplementary Material.

We consider two variants of the UIPW estimator, differing in how the tuning parameter λ\lambda is selected. UIPW-Dcar chooses λ\lambda based on the criterion 7, while UIPW-Score chooses λ\lambda based on the criterion 9. The nuisance parameters, Qn,mntQ^{t}_{n,m_{n}} and πnt\pi^{t}_{n}, is estimated using the highly adaptive lasso. We impose the parametric working model m=E[Y|θ]=β0+β1θm=E[Y|\theta]=\beta_{0}+\beta_{1}\theta. We perform 360 simulations on the sample sizes 250, 500, 850, and 1000. In Section 13 of the Supplementary Material, we present additional simulations when the working model mm is estimated using other regression models, both parametric and nonparametric.

5.2 Efficiency of the Proposed Estimators

We showcase the effectiveness of our proposed estimators via coverage plots, as seen in the right panel of Figure 1. Notably, the coverage of IPW consistently falls below 77% at a sample size of 250 and diminishes further with larger sample sizes. Conversely, the MR and UIPW-Dcar estimators maintain stable coverage around 90% at small sample sizes and achieving 93% and 95% at sample sizes of 850 and 1000, respectively. Regarding UIPW-Score, its initial coverage hovers around 85% at sample sizes of 250 and 500, but rapidly climbing to approximately 92% at a sample size of 1000.

The left panel of Figure 1 provides insight into the scaled bias, calculated by multiplying the bias by the square root of the sample size. The MR estimator displays the smallest scaled bias. Additionally, as the sample size increases, our proposed estimators exhibit a declining trend in scaled bias, indicating convergence towards the true value at a rate of op(n1/2)o_{p}(n^{-1/2}).

Conversely, the scaled bias of the IPW estimator shows an upward trend with increasing sample size, signifying a slower convergence rate than op(n1/2)o_{p}(n^{-1/2}). This slower convergence is attributed to the bias term in the IPW estimator, which does not diminish rapidly enough when estimating the propensity score model using a nonparametric method.

Refer to caption
Figure 1: Bias and Coverage plots

6 Data Application

We implement our proposed methods to determine the optimal treatment strategy for individuals with Parkinson’s Disease, focusing on the LS1 study. The LS1 study enrolled patients diagnosed with Parkinson’s within five years, being treated with either Levodopa or DRA drug class (Kieburtz et al., 2015). However, given the prevalent use of Levodopa as the first line therapy, we concentrate solely on patients already on Levodopa before the study. The study includes follow-ups at three months initially and annually thereafter, with our primary outcome (YY) being the UPDRS-III score two years after the start of the LS1 study. UPDRS-III is a measure of motor function of a patient, ranging from 0 to 100.

Within our sample of 238 patients, the UPDRS-III score ranges from 3.00 to 68.00, with mean and median values around 20.00, and the first and third quantiles are at 13.00 and 26.00, repspectively. We focus on two decision points for treatment adjustment: at three months and a year into the study (i.e., T=2T=2). Treatment decisions involve high (total Levodopa >> 300 mg/day) or low ( \leq 300 mg/day) doses (i.e., At{high,low}A_{t}\in\{high,low\}). Our decision rule, dθd^{\theta}, assigns high or low doses based on whether the UPDRS-III score exceeds the threshold θ\theta, i.e dθ=I(θ>UPDRS-III)d^{\theta}=I(\theta>\text{UPDRS-III}).

We seek to find θopt=argminθ[0,30]E[Yθ]\theta_{opt}=\operatorname*{argmin}_{\theta\in[0,30]}E[Y^{\theta}] where the choice of θ\theta’s range [0,30] aligns with our data support. We consider three marginal structural models:

Model 1: Quadratic linear regression in θ\theta; i.e E[Yθ]=β0+β1θ+β2θ2E[Y\mid\theta]=\beta_{0}+\beta_{1}\theta+\beta_{2}\theta^{2}.

Model 2: Highly adaptive lasso model with zero order splines

Model 3: Highly adaptive lasso model with smoothing.

We will employ cross-validation to prevent overfitting. For risk calculation, we employ four estimators: DR, IPW, and UIPWs (Dcar and Score). These models leverage the highly adaptive lasso to fit both the propensity score model πt\pi^{t} and the nuisance parameter QtQ^{t}, using multiple covariates.

The risk estimates in Table 1 indicate that the linear regression model (Model 1) consistently yields the highest risk among all estimators. However, the estimated risks of the highly adaptive lasso models (Model 2 and 3) are close to the risks of the linear regresion model. Examining Table 2, optimal θ\theta values across folds mostly converge around 30, suggesting minimal improvement with higher Levodopa doses. Figure 2 presents marginal structural model plots, indicating minimal marginal benefits or none with increased Levodopa doses.

Dyskinesia and motor fluctuation are two major side effects of Levodopa treatment. To examine the association between Levodopa doses (i.e., high vs low) with these side effects, we fit a logistic model that includes the binary dosing of Levodopa as a dependent variable and the presence of dyskinesia or motor fluctuation recorded in UPDRS-IV after a year of treatment initiation as an independent variable. Our analysis indicates that the odd ratio of developing adverse events in the high dose is 1.6 times that in the low dose. This finding is consistent with findings from other studies that a high dose of Levodopa is associated with dyskinesia and motor fluctuation (Thanvi et al., 2007). Hence, our analysis suggests favoring low-dose Levodopa for early-stage Parkinson’s patients. It is because high dose of Levodopa has minimal improvement in UPDRS-III scores but increases likelihood of having adverse events.

Table 1: Estimated risk with different risk estimators
Model DR UIPW-Dcar IPW UIPW-Score
Model 1 114.19 199.33 199.33 199.33
Model 2 125.64 204.16 204.16 204.16
Model 3 121.94 203.02 203.02 203.02
Table 2: Optimal points with each fold
Fold Model 1 Model 2 Model 3
Fold 1 19.5 0.0 30.0
Fold 2 30.0 18.5 30.0
Fold 3 30.0 18.5 22.5
Fold 4 17.0 21.5 25.0
Fold 5 30.0 18.5 30.0
Refer to caption
Figure 2: Working models estimated using 5-fold cross validation

7 Discussion

This paper introduces two efficient estimators for assessing the risk associated with a marginal structural model. The first is a multiply robust estimator, validated to achieve efficiency when all nuisance parameters are correctly specified. It remains consistent even when the propensity score or the mean outcome model is accurately specified. The second is a sieve IPW estimation leveraging Highly Adaptive Models to attain efficiency.

Our work has several promising directions for extension. While our method is adaptable to studies with multiple stages, the assumption of positivity often fails in scenarios with a large number of stages. A potential modification involves transitioning from a deterministic to a stochastic decision rule where a probability is assigned to each possible treatment such that higher probabilities reflect the higher chance of being the optimal treatment. Stochastic rules are of interest because their carry more information about the relative optimality of a treatment compared to the other treatment choices.

To facilitate the comparison of several marginal structural models, one can use our asymptotic linearity results to identify the set of best models that contains the true best model with a given probability. This will necessitate the generalization of the multiple comparison with the best approach to our specific setting. Dynamic marginal structural models can be used with time-to-event data by considering structural Cox or accelerated failure time models for the potential outcomes.

Acknowledgements

CP, BRB, and AE were supported by the National Institute of Neurological Disorders and Stroke (R61/R33 NS120240).

References

  • Benkeser and van der Laan [2016] David Benkeser and Mark van der Laan. The highly adaptive lasso estimator. In IEEE International Conference on Data Science and Advanced Analytics, pages 689–696. IEEE, 2016.
  • Benkeser et al. [2017] David Benkeser, Marco Carone, Mark J van der Laan, and Peter B Gilbert. Doubly robust nonparametric inference on the average treatment effect. Biometrika, 104(4):863–880, 2017.
  • Bibaut and van der Laan [2019] Aurelien F Bibaut and Mark J van der Laan. Fast rates for empirical risk minimization over cadlag functions with bounded sectional variation norm. arXiv preprint arXiv:1907.09244, 2019.
  • Bickel et al. [1993] Peter J Bickel, Chris AJ Klaassen, Ya’acov Ritov, and Jon A Wellner. Efficient and Adaptive Eestimation for Semiparametric Models. Spinger, New York, 1993.
  • Brookhart and van der Laan [2006] M Alan Brookhart and Mark J van der Laan. A semiparametric model selection criterion with applications to the marginal structural model. Computational Statistics & Data Analysis, 50(2):475–498, 2006.
  • Cole and Hernán [2008] Stephen R Cole and Miguel A Hernán. Constructing inverse probability weights for marginal structural models. American Journal of Epidemiology, 168(6):656–664, 2008.
  • Duque et al. [2021] Daniel Rodriguez Duque, David A Stephens, and Erica EM Moodie. Estimation of optimal dynamic treatment regimes using gaussian process emulation. arXiv preprint arXiv:2105.12259, 2021.
  • Ertefaie et al. [2016] Ashkan Ertefaie, Tianshuang Wu, Kevin G Lynch, and Inbal Nahum-Shani. Identifying a set that contains the best dynamic treatment regimes. Biostatistics, 17(1):135–148, 2016.
  • Ertefaie et al. [2023] Ashkan Ertefaie, Nima S Hejazi, and Mark J van der Laan. Nonparametric inverse-probability-weighted estimators based on the highly adaptive lasso. Biometrics, 79(2):1029–1041, 2023.
  • Gill et al. [1995] Richard D Gill, Mark J Laan, and Jon A Wellner. Inefficient estimators of the bivariate survival function for three models. In Annales de l’IHP Probabilités et statistiques, volume 31, pages 545–597, 1995.
  • Hernán and Robins [2023] Miguel A. Hernán and James M. Robins. Causal Inference: What If. Chapman & Hall/CRC, Boca Raton, 2023.
  • Howe et al. [2011] Chanelle J Howe, Stephen R Cole, Joan S Chmiel, and Alvaro Munoz. Limitation of inverse probability-of-censoring weights in estimating survival in the presence of strong selection bias. American Journal of Epidemiology, 173(5):569–577, 2011.
  • Hubbard et al. [2016] Alan E Hubbard, Sara Kherad-Pajouh, and Mark J van der Laan. Statistical inference for data adaptive target parameters. The International Journal of Biostatistics, 12(1):3–19, 2016.
  • Imai and Ratkovic [2015] Kosuke Imai and Marc Ratkovic. Robust estimation of inverse probability weights for marginal structural models. Journal of the American Statistical Association, 110(511):1013–1023, 2015.
  • Kennedy [2019] Edward H Kennedy. Nonparametric causal effects based on incremental propensity score interventions. Journal of the American Statistical Association, 114(526):645–656, 2019.
  • Kennedy [2022] Edward H Kennedy. Semiparametric doubly robust targeted double machine learning: a review. arXiv preprint arXiv:2203.06469, 2022.
  • Kieburtz et al. [2015] Karl Kieburtz, Barbara C Tilley, Jordan J Elm, Debra Babcock, Robert Hauser, G Webster Ross, Alicia H Augustine, Erika U Augustine, Michael J Aminoff, Ivan G Bodis-Wollner, et al. Effect of creatine monohydrate on clinical progression in patients with parkinson disease: a randomized clinical trial. Journal of the American Medical Association, 313(6):584–593, 2015.
  • Mortimer et al. [2005] Kathleen M Mortimer, Romain Neugebauer, Mark van der Laan, and Ira B Tager. An application of model-fitting procedures for marginal structural models. American Journal of Epidemiology, 162(4):382–388, 2005.
  • Murphy et al. [2001] Susan A Murphy, Mark J van der Laan, James M Robins, and Conduct Problems Prevention Research Group. Marginal mean models for dynamic regimes. Journal of the American Statistical Association, 96(456):1410–1423, 2001.
  • Neugebauer and van der Laan [2007] Romain Neugebauer and Mark van der Laan. Nonparametric causal effects based on marginal structural models. Journal of Statistical Planning and Inference, 137(2):419–434, 2007.
  • Orellana et al. [2010] Liliana Orellana, Andrea Rotnitzky, and James M Robins. Dynamic regime marginal structural mean models for estimation of optimal dynamic treatment regimes, part i: main content. The International Journal of Biostatistics, 6(2), 2010.
  • Robins [1986] James Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(9-12):1393–1512, 1986.
  • Robins [1989] James M Robins. The analysis of randomized and non-randomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In L Sechrest, H Freeman, and A Mulley, editors, Health Service Research Methodology: A Focus on AIDS, pages 113–159, Washington, D.C., 1989. U.S. Public Health Service, National Center for Health Services Research.
  • Robins [1997] James M Robins. Causal inference from complex longitudinal data. In M Berkane, editor, Latent Variable Modeling and Applications to Causality, pages 69–117, New York, 1997. Springer.
  • Robins et al. [2000] James M Robins, Miguel Angel Hernan, and Babette Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, pages 550–560, 2000.
  • Rotnitzky et al. [1998] Andrea Rotnitzky, James M Robins, and Daniel O Scharfstein. Semiparametric regression for repeated outcomes with nonignorable nonresponse. Journal of the American Statistical Association, 93(444):1321–1339, 1998.
  • Shortreed and Moodie [2012] Susan M Shortreed and Erica EM Moodie. Estimating the optimal dynamic antipsychotic treatment regime: evidence from the sequential multiple-assignment randomized clinical antipsychotic trials of intervention and effectiveness schizophrenia study. Journal of the Royal Statistical Society Series C: Applied Statistics, 61(4):577–599, 2012.
  • Thanvi et al. [2007] Bhomraj Thanvi, Nelson Lo, and Tom Robinson. Levodopa-induced dyskinesia in parkinson’s disease: clinical features, pathogenesis, prevention and treatment. Postgraduate Medical Journal, 83(980):384–388, 2007.
  • Tsiatis [2006] Anastasios A Tsiatis. Semiparametric Theory and Missing Data. Springer, New York, 2006.
  • van der Laan [2023] Mark van der Laan. Higher order spline highly adaptive lasso estimators of functional parameters: Pointwise asymptotic normality and uniform convergence rates. arXiv preprint arXiv:2301.13354, 2023.
  • van der Laan [1996] Mark J van der Laan. Efficient estimation in the bivariate censoring model and repairing npmle. The Annals of Statistics, 24(2):596–627, 1996.
  • van der Laan [2014] Mark J van der Laan. Targeted estimation of nuisance parameters to obtain valid statistical inference. The International Journal of Biostatistics, 10(1):29–57, 2014.
  • van der Laan and Dudoit [2003] Mark J van der Laan and Sandrine Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper Series, 2003. Working Paper 130.
  • van der Laan and Robins [2003] Mark J van der Laan and James M Robins. Unified Methods for Censored Longitudinal Data and Causality. Springer, New York, 2003.
  • van der Vaart [1998] Aad W van der Vaart. Asymptotic Statistics. Cambridge University Press, Cambridge, 1998.
  • van der Vaart et al. [2006] Aad W van der Vaart, Sandrine Dudoit, and Mark J van der Laan. Oracle inequalities for multi-fold cross validation. Statistics & Decisions, 24(3):351–371, 2006.
  • Yu and van der Laan [2006] Zhuo Yu and Mark van der Laan. Double robust estimation in longitudinal marginal structural models. Journal of Statistical Planning and Inference, 136(3):1061–1089, 2006.

SUPPLEMENTARY MATERIAL

8 Proof of Lemma 1

The lemma follows in the usual way. For any given regimen-response curve mm, we see that

0[t=1TI(At=dtθ)π0t(At=dtθH¯t){Ym(θ,V)}2dF(θ)]\displaystyle\mathbb{P}_{0}\left[\int\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t})}\left\{Y-m(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right]
=0[t=1TI(At=dtθ)π0t(At=dtθH¯t){Yθm(θ,V)}2dF(θ)]\displaystyle\hskip 42.67912pt=\mathbb{P}_{0}\left[\int\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t})}\left\{Y_{\theta}-m(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right]
=0[π0T(At=dtθH¯t,Yθ)π0T(At=dtθH¯t)t=1T1I(At=dtθ)π0t(At=dtθH¯t){Yθm(θ,V)}2dF(θ)]\displaystyle\hskip 42.67912pt=\mathbb{P}_{0}\left[\int\frac{\pi^{T}_{0}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t},Y_{\theta})}{\pi^{T}_{0}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t})}\prod_{t=1}^{T-1}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t})}\left\{Y_{\theta}-m(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right]
=0[t=1T1I(At=dtθ)π0t(At=dtθH¯t){Yθm(θ,V)}2dF(θ)]\displaystyle\hskip 42.67912pt=\mathbb{P}_{0}\left[\int\prod_{t=1}^{T-1}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t})}\left\{Y_{\theta}-m(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right]
=\displaystyle\hskip 42.67912pt=\dots
=0[{Yθm(θ,V)}2dF(θ)],\displaystyle\hskip 42.67912pt=\mathbb{P}_{0}\left[\int\left\{Y_{\theta}-m(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right],

where the second equality applies iterated expectation and the dots formally follow through induction.

Now, recall that

Ψ0(mn)=𝔼Bn0{Yθmn,b(θ,V)}2dF(θ)\Psi_{0}(m_{n})=\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\{Y^{\theta}-m_{n,b}(\theta,V)\}^{2}\,\mathrm{d}F(\theta)

is an average over folds. Taking the regimen-response curve mm to be the curve mn,bm_{n,b} fitted to the training data n,b1\mathbb{P}^{1}_{n,b} and 0\mathbb{P}_{0} to be the expectation with respect to a (new) test point, our previous calculation shows that

0[t=1TI(At=dtθ)π0t(At=dtθH¯t){Ymn,b(θ,V)}2dF(θ)]=0[{Yθmn,b(θ,V)}2dF(θ)].\displaystyle\mathbb{P}_{0}\left[\int\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}(A_{t}=d_{t}^{\theta}\mid\bar{H}_{t})}\left\{Y-m_{n,b}(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right]=\mathbb{P}_{0}\left[\int\left\{Y_{\theta}-m_{n,b}(\theta,V)\right\}^{2}\,\mathrm{d}F(\theta)\right].

The result now follows by averaging over folds.

9 Proof of Theorem 1

The target parameter Ψ0,nda(m)\Psi_{0,\text{nda}}(m) is a pathwise differentiable function. Hence, the canonical gradient of Ψ0,nda(m)\Psi_{0,\text{nda}}(m) in a nonparametric model \mathcal{M} corresponds to the efficient influence function of a regular, asymptotically linear (RAL) estimator. Let Ψϵ,nda(m)=PϵLm(π)\Psi_{\epsilon,\text{nda}}(m)=P_{\epsilon}L_{m}(\pi) where Pϵ=(1+ϵ𝒮)P0P_{\epsilon}=(1+\epsilon\mathcal{S})P_{0} is a perturbed data distribution. The canonical gradient φ\varphi is the one that satisfies the following equation:

ϵΨϵ,nda(m)ϵ=0=E{φ(m)𝒮(O;ϵ)}ϵ=0,\frac{\partial}{\partial\epsilon}\Psi_{\epsilon,\text{nda}}(m)\mid_{\epsilon=0}=E\{\varphi(m)\mathcal{S}(O;\epsilon)\}\mid_{\epsilon=0},

where 𝒮(O;ϵ)=ϵlogf(O;ϵ)\mathcal{S}(O;\epsilon)=\frac{\partial}{\partial\epsilon}\log f(O;\epsilon).

Let πϵ\pi_{\epsilon} denote the perturbed propensity score. Then,

ϵΨϵ,nda(m)ϵ=0\displaystyle\frac{\partial}{\partial\epsilon}\Psi_{\epsilon,\text{nda}}(m)\mid_{\epsilon=0} =𝔼{t=1TI(At=dtθ)π0t(At=dtθ|H¯t){Ym(θ,V)}2𝒮(𝒪)dF(θ)}\displaystyle=\mathbb{E}\int\Big{\{}\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi_{0}^{t}(A_{t}=d_{t}^{\theta}|\bar{H}_{t})}\{Y-m(\theta,V)\}^{2}\mathcal{S}(\mathcal{O})\mathrm{d}F(\theta)\Big{\}}
+𝔼[ϵ{t=1TI(At=dtθ)πϵt(At=dtθH¯t){Ym(θ,V)}2}ϵ=0dF(θ)]\displaystyle+\mathbb{E}\left[\int\frac{\partial}{\partial\epsilon}\left\{\prod_{t=1}^{T}\frac{I(A_{t}=d^{\theta}_{t})}{\pi^{t}_{\epsilon}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})}\{Y-m(\theta,V)\}^{2}\right\}_{\epsilon=0}\mathrm{d}F(\theta)\right]
=𝔼{t=1TI(At=dtθ)π0t(At=dtθ|H¯t){Ym(θ,V)}2𝒮(𝒪)dF(θ)}\displaystyle=\mathbb{E}\Big{\{}\int\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi_{0}^{t}(A_{t}=d_{t}^{\theta}|\bar{H}_{t})}\{Y-m(\theta,V)\}^{2}\mathcal{S}(\mathcal{O})\mathrm{d}F(\theta)\Big{\}}
𝔼[t=1TjtI(Aj=djθ)I(At=dtθ){Ym(θ,V)}2π0t(Aj=djθH¯j)π0t(At=dtθH¯t)2ϵπϵ(At=dtθ|H¯t)dF(θ)]\displaystyle-\mathbb{E}\int\Big{[}\sum_{t=1}^{T}\prod_{j\neq t}\frac{I(A_{j}=d_{j}^{\theta})I(A_{t}=d_{t}^{\theta})\{Y-m(\theta,V)\}^{2}}{\pi^{t}_{0}(A_{j}=d^{\theta}_{j}\mid\bar{H}_{j})\pi^{t}_{0}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})^{2}}\frac{\partial}{\partial\epsilon}\pi_{\epsilon}(A_{t}=d^{\theta}_{t}|\bar{H}_{t})\mathrm{d}F(\theta)\Big{]} (10)

We derive the form of (10) by focusing on a given tt which then easily generalizes to all 1tT1\leq t\leq T. We have

𝔼[jtT{I(Aj=djθ)π0j(Aj=djθ|H¯j)}I(At=dtθ)π0t(At=dtθH¯t)2{Ym(θ,V)}2ϵπϵ(At=dtθH¯t)dF(θ)]\displaystyle\mathbb{E}\Big{[}\int\prod_{j\neq t}^{T}\Big{\{}\frac{I(A_{j}=d_{j}^{\theta})}{\pi^{j}_{0}(A_{j}=d^{\theta}_{j}|\bar{H}_{j})}\Big{\}}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})^{2}}\{Y-m(\theta,V)\}^{2}\frac{\partial}{\partial\epsilon}\pi_{\epsilon}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})\mathrm{d}F(\theta)\Big{]}
=𝔼[I(At=dtθ)π0t{At=dtθH¯t}𝔼[jtTI(Aj=djθ)π0j{Aj=djθ|H¯j}{Ym(θ,V)}2S¯t,A¯t]𝒮(𝒪)dF(θ)]\displaystyle=\mathbb{E}\left[\int\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}\left\{A_{t}=d^{\theta}_{t}\mid\bar{H}_{t}\right\}}\mathbb{E}\left[\prod_{j\neq t}^{T}\frac{I(A_{j}=d_{j}^{\theta})}{\pi^{j}_{0}\left\{A_{j}=d^{\theta}_{j}|\bar{H}_{j}\right\}}\{Y-m(\theta,V)\}^{2}\mid\bar{S}_{t},\bar{A}_{t}\right]\mathcal{S}(\mathcal{O})\mathrm{d}F(\theta)\right]
𝔼[𝔼[jmTI(Aj=djθ)π0j{Aj=djθ|H¯j}{Ym(θ,V)}2S¯t,A¯t1,At=dtθ]𝒮(𝒪)dF(θ)]\displaystyle-\mathbb{E}\Bigg{[}\int\mathbb{E}\left[\prod_{j\neq m}^{T}\frac{I(A_{j}=d_{j}^{\theta})}{\pi^{j}_{0}\left\{A_{j}=d^{\theta}_{j}|\bar{H}_{j}\right\}}\{Y-m(\theta,V)\}^{2}\mid\bar{S}_{t},\bar{A}_{t-1},A_{t}=d^{\theta}_{t}\big{]}\mathcal{S}(\mathcal{O})\mathrm{d}F(\theta)\right]
=𝔼[I(At=dtθ)π0t{At=dtθH¯t}𝔼[jmTI(Aj=djθ)π0j{Aj=djθ|H¯j}{Ym(θ,V)}2S¯t,A¯t1,At=dtθ]𝒮(𝒪)dF(θ)]\displaystyle=\mathbb{E}\left[\int\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}\left\{A_{t}=d^{\theta}_{t}\mid\bar{H}_{t}\right\}}\mathbb{E}\left[\prod_{j\neq m}^{T}\frac{I(A_{j}=d_{j}^{\theta})}{\pi_{0}^{j}\left\{A_{j}=d^{\theta}_{j}|\bar{H}_{j}\right\}}\{Y-m(\theta,V)\}^{2}\mid\bar{S}_{t},\bar{A}_{t-1},A_{t}=d^{\theta}_{t}\right]\mathcal{S}(\mathcal{O})\mathrm{d}F(\theta)\right]
𝔼[𝔼[jmTI(Aj=djθ)π0j{Aj=djθ|H¯j}{Ym(θ,V)}2S¯t,A¯t1,At=dtθ]𝒮(𝒪)dF(θ)]\displaystyle-\mathbb{E}\Bigg{[}\int\mathbb{E}\left[\prod_{j\neq m}^{T}\frac{I(A_{j}=d_{j}^{\theta})}{\pi_{0}^{j}\left\{A_{j}=d^{\theta}_{j}|\bar{H}_{j}\right\}}\{Y-m(\theta,V)\}^{2}\mid\bar{S}_{t},\bar{A}_{t-1},A_{t}=d^{\theta}_{t}\big{]}\mathcal{S}(\mathcal{O})\mathrm{d}F(\theta)\right]
=𝔼[{i=1t1I(Ai=diθ)π0t(Ai=diθH¯i)}Qmt[I(At=dtθ)π0t{At=dtθH¯t}1]𝒮(𝒪)dF(θ)]\displaystyle=\mathbb{E}\left[\int\left\{\prod_{i=1}^{t-1}\frac{I(A_{i}=d^{\theta}_{i})}{\pi^{t}_{0}(A_{i}=d^{\theta}_{i}\mid\bar{H}_{i})}\right\}Q^{t}_{m}\left[\frac{I(A_{t}=d_{t}^{\theta})}{\pi_{0}^{t}\left\{A_{t}=d^{\theta}_{t}\mid\bar{H}_{t}\right\}}-1\right]\mathcal{S}(\mathcal{O})\mathrm{d}F(\theta)\right]
=𝔼[{i=1t1I(Ai=diθ)π0t(Ai=diθH¯i)}Q0,mt[I(At=dtθ)π0t{At=dtθH¯t}π0t{At=dtθH¯t}]𝒮(𝒪)dF(θ)],\displaystyle=\mathbb{E}\left[\int\left\{\prod_{i=1}^{t-1}\frac{I(A_{i}=d^{\theta}_{i})}{\pi^{t}_{0}(A_{i}=d^{\theta}_{i}\mid\bar{H}_{i})}\right\}Q^{t}_{0,m}\left[\frac{I(A_{t}=d_{t}^{\theta})-{\pi_{0}^{t}\left\{A_{t}=d^{\theta}_{t}\mid\bar{H}_{t}\right\}}}{{\pi_{0}^{t}\left\{A_{t}=d^{\theta}_{t}\mid\bar{H}_{t}\right\}}}\right]\mathcal{S}(\mathcal{O})\mathrm{d}F(\theta)\right],

where

Q0,mt=𝔼[j=t+1TI(Aj=djθ)π0j{Aj=dtθH¯j}{Ym(θ,V)}2H¯t,At=dtθ].Q^{t}_{0,m}=\mathbb{E}\left[\prod_{j={t+1}}^{T}\frac{I(A_{j}=d_{j}^{\theta})}{\pi_{0}^{j}\{A_{j}=d^{\theta}_{t}\mid\bar{H}_{j}\}}\{Y-m(\theta,V)\}^{2}\mid\bar{H}_{t},A_{t}=d^{\theta}_{t}\right].

Hence, the un-centered canonical gradient of Ψ0,nda(m)\Psi_{0,\text{nda}}(m) is

φm(π,Q0,m)=\displaystyle\varphi_{m}(\pi,Q_{0,m})= {t=1TI(At=dtθ)π0t(At=dtθH¯t){Ym(θ,V)}2\displaystyle\int\left\{\prod_{t=1}^{T}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{t}_{0}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})}\{Y-m(\theta,V)\}^{2}\right.
t=1TI(At=dtθ)π0t(At=dtθH¯t)π0t(At=dtθH¯t)Q0,mtj=1t1I(At=dtθ)π0j(At=dtθH¯t)}dF(θ).\displaystyle\left.-\sum_{t=1}^{T}\frac{I(A_{t}=d^{\theta}_{t})-\pi^{t}_{0}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})}{\pi^{t}_{0}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})}Q^{t}_{0,m}\prod_{j=1}^{t-1}\frac{I(A_{t}=d_{t}^{\theta})}{\pi^{j}_{0}(A_{t}=d^{\theta}_{t}\mid\bar{H}_{t})}\right\}\,dF(\theta). (11)

10 Proof of Theorem 2

We let πn,b\pi_{n,b} be the estimate of π0\pi_{0} obtained using the bthb^{th} sample split. Similarly, Qn,b,mQ_{n,b,m} is the estimate of Q0,mQ_{0,m} and mbm_{b} is the estimate of m0m_{0} obtained using the bthb^{th} sample split.

ΨnMR(mn,πn,Qn,mn)Ψ0(mn)\displaystyle\Psi^{MR}_{n}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m_{n})
=\displaystyle= ΨnMR(mn,πn,Qn,mn)Ψ0(mn)±ΨnMR(mn,π0,Q0,mn)\displaystyle\Psi^{MR}_{n}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m_{n})\pm\Psi_{n}^{MR}(m_{n},\pi_{0},Q_{0,m_{n}})
=\displaystyle= 𝔼Bnn,b1φmn,b(πn,b,Qn,b,mn,b)𝔼Bn0φmn,b(π0,Q0,mn,b)±𝔼Bnn,b1φmn,b(π0,Q0,mn,b)\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}^{1}_{n,b}\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\mathbb{E}_{B_{n}}\mathbb{P}_{0}\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\pm\mathbb{E}_{B_{n}}\mathbb{P}^{1}_{n,b}\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})
=\displaystyle= 𝔼Bn(n,b10)φmn,b(π0,Q0,mn,b)+𝔼Bnn,b1φmn,b(πn,b,Qn,b,mn,b)𝔼Bnn,b1φmn,b(π0,Q0,mn,b)\displaystyle\mathbb{E}_{B_{n}}(\mathbb{P}^{1}_{n,b}-\mathbb{P}_{0})\varphi_{m_{n,b}}\left(\pi_{0},Q_{0,m_{n,b}}\right)+\mathbb{E}_{B_{n}}\mathbb{P}^{1}_{n,b}\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\mathbb{E}_{B_{n}}\mathbb{P}^{1}_{n,b}\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})
=\displaystyle= 𝔼Bn(n,b10)φmn,b(π0,Q0,mn,b)+𝔼Bnn,b1{φmn,b(πn,b,Qn,b,mn,b)φmn,b(π0,Q0,mn,b)}\displaystyle\mathbb{E}_{B_{n}}(\mathbb{P}^{1}_{n,b}-\mathbb{P}_{0})\varphi_{m_{n,b}}\left(\pi_{0},Q_{0,m_{n,b}}\right)+\mathbb{E}_{B_{n}}\mathbb{P}^{1}_{n,b}\left\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\right\}
=\displaystyle= 𝔼Bn(n,b10)φm(π0,Q0,m)+𝔼Bn0{φmn,b(πn,b,Qn,b,mn,b)φmn,b(π0,Q0,mn,b)}+op(n1/2)\displaystyle\mathbb{E}_{B_{n}}(\mathbb{P}^{1}_{n,b}-\mathbb{P}_{0})\varphi_{m^{*}}\left(\pi_{0},Q_{0,m^{*}}\right)+\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\right\}+o_{p}(n^{-1/2})
=\displaystyle= (n0)φm(π0,Q0,m)+𝔼Bn0{φmn,b(πn,b,Qn,b,mn,b)φmn,b(π0,Q0,mn,b)}+op(n1/2)\displaystyle(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi_{m^{*}}\left(\pi_{0},Q_{0,m^{*}}\right)+\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\right\}+o_{p}(n^{-1/2})

We need to show that 𝔼Bn0{φmn,b(πn,b,Qn,b,mn,b)φmn,b(π0,Q0,mn,b)}=op(n1/2)\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\right\}=o_{p}(n^{-1/2}).

𝔼Bn0{φmn,b(πn,b,Qn,b,mn,b)φmn,b(π0,Q0,mn,b)}\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\}
=\displaystyle= 𝔼Bn0[t=1Tπn,bt{Ymn,b(θ,V)}2t=1Tπ0t{Ymn,b(θ,V)}2]𝑑F(θ)\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left[\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{n,b}}\{Y-m_{n,b}(\theta,V)\}^{2}-\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{0}}\{Y-m_{n,b}(\theta,V)\}^{2}\right]dF(\theta)
+t=1T𝔼Bn0[i=1tI{Ai=diθ}π0iQ0,mn,bti=1tI{Ai=diθ}πn,biQn,b,mn,bt]𝑑F(θ)\displaystyle+\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left[\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}Q^{t}_{0,m_{n,b}}-\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{n,b}}Q^{t}_{n,b,m_{n,b}}\right]dF(\theta)
+t=1T𝔼Bn0[i=1t1I{Ai=diθ}πn,biQb,mn,bti=1t1I{Ai=diθ}π0iQ0,mn,bt]𝑑F(θ)\displaystyle+\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left[\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{n,b}}Q^{t}_{b,m_{n,b}}-\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}Q^{t}_{0,m_{n,b}}\right]dF(\theta)
=(I)+(II)+(III)\displaystyle=(I)+(II)+(III)

For the rest of this calculation, we use the color brown to indicate that the term converges to 0 at the n1/2n^{1/2} rate under the required assumptions.

(I)=\displaystyle(I)= 𝔼Bn0t=1Tπn,bt{Ymn,b(θ,V)}2t=1Tπ0t{Ymn,b(θ,V)}2dF(θ)\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{n,b}}\{Y-m_{n,b}(\theta,V)\}^{2}-\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{0}}\{Y-m_{n,b}(\theta,V)\}^{2}dF(\theta)
±t=1T𝔼Bn0[i=1TI{Ai=diθ}(i=1t1πn,bi)(i=tTπ0i){Ymn,b(θ,V)}2]𝑑F(θ)\displaystyle\pm\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left[\frac{\prod_{i=1}^{T}I\{A_{i}=d^{\theta}_{i}\}}{\left(\prod_{i=1}^{t-1}\pi^{i}_{n,b}\right)\left(\prod_{i=t}^{T}\pi^{i}_{0}\right)}\left\{Y-m_{n,b}(\theta,V)\right\}^{2}\right]dF(\theta)
=\displaystyle= i=1T𝔼Bn0i=1TI{Ai=diθ}(i=1tπn,bi)(i=t+1Tπ0i){Ymn,b(θ,V)}2{π0tπn,btπ0t}𝑑F(θ)\displaystyle\sum_{i=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{T}I\{A_{i}=d^{\theta}_{i}\}}{\left(\prod_{i=1}^{t}\pi^{i}_{n,b}\right)\left(\prod_{i=t+1}^{T}\pi^{i}_{0}\right)}\left\{Y-m_{n,b}(\theta,V)\right\}^{2}\left\{\frac{\pi^{t}_{0}-\pi^{t}_{n,b}}{\pi^{t}_{0}}\right\}dF(\theta)
=\displaystyle= t=1Tk<l<m<t𝔼Bn0i=1TI{Ai=diθ}πn,bkπ0m{Ymn,b(θ,V)}2πl,0πl,Snπ0tπn,btπ0tπn,btπ0tπn,bt𝑑F(θ)\displaystyle{\color[rgb]{.75,.5,.25}\sum_{t=1}^{T}\sum_{k<l<m<t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{T}I\{A_{i}=d^{\theta}_{i}\}}{\pi^{k}_{n,b}\pi^{m}_{0}}\{Y-m_{n,b}(\theta,V)\}^{2}\frac{\pi_{l,0}-\pi_{l,S_{n}}}{\pi^{t}_{0}\pi^{t}_{n,b}}\frac{\pi^{t}_{0}-\pi^{t}_{n,b}}{\pi^{t}_{0}\pi^{t}_{n,b}}dF(\theta)}
+t=1T𝔼Bn0(i=1TI{Ai=diθ}π0i{Ymn,b(θ,V)}2π0tπn,btπn,bt)𝑑F(θ)\displaystyle+\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left(\prod_{i=1}^{T}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left\{Y-m_{n,b}(\theta,V)\right\}^{2}\frac{\pi^{t}_{0}-\pi^{t}_{n,b}}{\pi^{t}_{n,b}}\right)dF(\theta)
=\displaystyle= t=1T𝔼Bn0(i=1TI{Ai=diθ}π0i{Ymn,b(θ,V)}2π0tπn,btπn,bt)𝑑F(θ)+op(n1/2)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left(\prod_{i=1}^{T}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left\{Y-m_{n,b}(\theta,V)\right\}^{2}\frac{\pi^{t}_{0}-\pi^{t}_{n,b}}{\pi^{t}_{n,b}}\right)dF(\theta)+o_{p}(n^{-1/2})
=\displaystyle= t=1T𝔼Bn0i=1tI{Ai=diθ}π0i(Q0,mn,bt)π0tπn,btπn,btdF(θ)+op(n1/2)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left(Q^{t}_{0,m_{n,b}}\right)\frac{\pi^{t}_{0}-\pi^{t}_{n,b}}{\pi^{t}_{n,b}}dF(\theta)+o_{p}(n^{-1/2})

We continue with Term (II)

(II)=\displaystyle(II)= t=1T𝔼Bn0[i=1tI{Ai=diθ}π0iQ0,mn,bti=1tI{Ai=diθ}πn,biQn,b,mn,bt]𝑑F(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left[\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}Q^{t}_{0,m_{n,b}}-\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{n,b}}Q^{t}_{n,b,m_{n,b}}\right]dF(\theta)
=\displaystyle= t=1T𝔼Bn0[i=1tI{Ai=diθ}π0iQ0,mn,bti=1tI{Ai=diθ}πn,biQn,b,mn,bt]𝑑F(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left[\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}Q^{t}_{0,m_{n,b}}-\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{n,b}}Q^{t}_{n,b,m_{n,b}}\right]dF(\theta)
±t=1Tk=1t𝔼Bn0i=1tI{Ai=diθ}i=1k1πn,bii=ktπ0iQn,b,mn,bt\displaystyle\pm\sum_{t=1}^{T}\sum_{k=1}^{t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{k-1}\pi^{i}_{n,b}\prod_{i=k}^{t}\pi^{i}_{0}}Q^{t}_{n,b,m_{n,b}}
=\displaystyle= t=1T𝔼Bn0i=1tI{Ai=diθ}π0i(Q0,mn,btQn,b,mn,bt)dF(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left(Q^{t}_{0,m_{n,b}}-Q^{t}_{n,b,m_{n,b}}\right)dF(\theta)
+t=1Ti=1t𝔼Bn0i=1tI{Ai=diθ}k=1i1(πn,bk)l=it(π0l)(Qn,b,mn,btQ0,mn,bt)(πn,biπ0iπn,bi)𝑑F(θ)\displaystyle+{\color[rgb]{.75,.5,.25}\sum_{t=1}^{T}\sum_{i=1}^{t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{k=1}^{i-1}\left(\pi^{k}_{n,b}\right)\prod_{l=i}^{t}\left(\pi^{l}_{0}\right)}\left(Q^{t}_{n,b,m_{n,b}}-Q^{t}_{0,m_{n,b}}\right)\left(\frac{\pi^{i}_{n,b}-\pi^{i}_{0}}{\pi^{i}_{n,b}}\right)dF(\theta)}
+t=1Ti=1t𝔼Bn0i=1tI{Ai=diθ}k=1i1(πn,bk)l=it(π0l)(πn,biπ0iπn,bi)Q0,mn,bt𝑑F(θ)\displaystyle+\sum_{t=1}^{T}\sum_{i=1}^{t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{k=1}^{i-1}\left(\pi^{k}_{n,b}\right)\prod_{l=i}^{t}\left(\pi^{l}_{0}\right)}\left(\frac{\pi^{i}_{n,b}-\pi^{i}_{0}}{\pi^{i}_{n,b}}\right)Q^{t}_{0,m_{n,b}}dF(\theta)
=\displaystyle= t=1T𝔼Bn0i=1tI{Ai=diθ}π0i(Qn,b,mn,btQ0,mn,bt)dF(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left(Q^{t}_{n,b,m_{n,b}}-Q^{t}_{0,m_{n,b}}\right)dF(\theta)
+t=1Ti=1t𝔼Bn0i=1tI{Ai=diθ}i=1t(π0i)(πn,biπ0iπn,bi)Q0,mn,bt𝑑F(θ)\displaystyle+\sum_{t=1}^{T}\sum_{i=1}^{t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{t}\left(\pi^{i}_{0}\right)}\left(\frac{\pi^{i}_{n,b}-\pi^{i}_{0}}{\pi^{i}_{n,b}}\right)Q^{t}_{0,m_{n,b}}dF(\theta)
+t=1T1klt𝔼Bn0i=1tI{Ai=diθ}i=1t(π0i)(πn,blπ0lπn,bl)(π0kπn,bkπn,bk)Q0,mbt𝑑F(θ)+op(n1/2)\displaystyle+{\color[rgb]{.75,.5,.25}\sum_{t=1}^{T}\sum_{1\leq k\leq l\leq t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{t}\left(\pi^{i}_{0}\right)}\left(\frac{\pi^{l}_{n,b}-\pi^{l}_{0}}{\pi^{l}_{n,b}}\right)\left(\frac{\pi^{k}_{0}-\pi^{k}_{n,b}}{\pi^{k}_{n,b}}\right)Q^{t}_{0,m_{b}}dF(\theta)}+o_{p}(n^{-1/2})
=\displaystyle= t=1T𝔼Bn0i=1tI{Ai=diθ}π0i(Qn,b,mbtQ0,mbt)dF(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left(Q^{t}_{n,b,m_{b}}-Q^{t}_{0,m_{b}}\right)dF(\theta)
+t=1Ti=1t𝔼Bn0i=1tI{Ai=diθ}i=1t(π0i)(πn,biπ0iπn,bi)Q0,mbt𝑑F(θ)+op(n1/2)\displaystyle+\sum_{t=1}^{T}\sum_{i=1}^{t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{t}\left(\pi^{i}_{0}\right)}\left(\frac{\pi^{i}_{n,b}-\pi^{i}_{0}}{\pi^{i}_{n,b}}\right)Q^{t}_{0,m_{b}}dF(\theta)+o_{p}(n^{-1/2})

Lastly, we consider the third term in red.

(III)=\displaystyle(III)= t=1T𝔼Bn0[i=1t1I{Ai=diθ}πn,biQn,b,mn,bti=1t1I{Ai=diθ}πn,biQ0,mn,bt]𝑑F(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left[\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{n,b}}Q^{t}_{n,b,m_{n,b}}-\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{n,b}}Q^{t}_{0,m_{n,b}}\right]dF(\theta)
=\displaystyle= t=1T𝔼Bn0[i=1t1I{Ai=diθ}πn,biQn,b,mn,bti=1t1I{Ai=diθ}πn,biQ0,mn,bt]𝑑F(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left[\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{n,b}}Q^{t}_{n,b,m_{n,b}}-\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{n,b}}Q^{t}_{0,m_{n,b}}\right]dF(\theta)
±t=1Tk=1t1𝔼Bn0i=1t1I{Ai=diθ}i=1k1πn,bii=kt1π0iQn,b,mn,bt\displaystyle\pm\sum_{t=1}^{T}\sum_{k=1}^{t-1}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t-1}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{k-1}\pi^{i}_{n,b}\prod_{i=k}^{t-1}\pi^{i}_{0}}Q^{t}_{n,b,m_{n,b}}
=\displaystyle= t=1T𝔼Bn0i=1t1I{Ai=diθ}π0i(Qn,b,mn,btQ0,mn,bt)dF(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left(Q^{t}_{n,b,m_{n,b}}-Q^{t}_{0,m_{n,b}}\right)dF(\theta)
+t=1Ti=1t1𝔼Bn0i=1t1I{Ai=diθ}k=1i(πn,bk)l=it1(π0i)(Q0,mn,btQn,b,mn,bt)(π0lπn,blπn,bl)𝑑F(θ)\displaystyle+{\color[rgb]{.75,.5,.25}\sum_{t=1}^{T}\sum_{i=1}^{t-1}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t-1}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{k=1}^{i}\left(\pi^{k}_{n,b}\right)\prod_{l=i}^{t-1}\left(\pi^{i}_{0}\right)}\left(Q^{t}_{0,m_{n,b}}-Q^{t}_{n,b,m_{n,b}}\right)\left(\frac{\pi^{l}_{0}-\pi^{l}_{n,b}}{\pi^{l}_{n,b}}\right)dF(\theta)}
+t=1T1lt1𝔼Bn0i=1t1I{Ai=diθ}k=1i(πn,bi)l=it1(π0i)(π0lπn,blπn,bl)Q0,mn,bt𝑑F(θ)\displaystyle+\sum_{t=1}^{T}\sum_{1\leq l\leq t-1}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t-1}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{k=1}^{i}\left(\pi^{i}_{n,b}\right)\prod_{l=i}^{t-1}\left(\pi^{i}_{0}\right)}\left(\frac{\pi^{l}_{0}-\pi^{l}_{n,b}}{\pi^{l}_{n,b}}\right)Q^{t}_{0,m_{n,b}}dF(\theta)
=\displaystyle= t=1T𝔼Bn0i=1t1I{Ai=diθ}π0i(Q0,mn,btQn,b,mn,bt)dF(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left(Q^{t}_{0,m_{n,b}}-Q^{t}_{n,b,m_{n,b}}\right)dF(\theta)
+t=1Ti=1t1𝔼Bn0i=1tI{Ai=diθ}i=1t(π0i)(π0iπn,biπn,bi)Q0,mn,bt𝑑F(θ)\displaystyle+\sum_{t=1}^{T}\sum_{i=1}^{t-1}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{t}\left(\pi^{i}_{0}\right)}\left(\frac{\pi^{i}_{0}-\pi^{i}_{n,b}}{\pi^{i}_{n,b}}\right)Q^{t}_{0,m_{n,b}}dF(\theta)
+t=1T1klt1𝔼Bn0i=1t1I{Ai=diθ}i=1t1(π0i)(π0lπn,blπn,bl)(πn,bkπ0kπn,bk)Qn,b,mn,bt𝑑F(θ)+op(n1/2)\displaystyle+{\color[rgb]{.75,.5,.25}\sum_{t=1}^{T}\sum_{1\leq k\leq l\leq t-1}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t-1}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{t-1}\left(\pi^{i}_{0}\right)}\left(\frac{\pi^{l}_{0}-\pi^{l}_{n,b}}{\pi^{l}_{n,b}}\right)\left(\frac{\pi^{k}_{n,b}-\pi^{k}_{0}}{\pi^{k}_{n,b}}\right)Q^{t}_{n,b,m_{n,b}}dF(\theta)}+o_{p}(n^{-1/2})
=\displaystyle= t=1T𝔼Bn0i=1t1I{Ai=diθ}π0i(Q0,mn,btQn,b,mn,bt)dF(θ)\displaystyle\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{i=1}^{t-1}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left(Q^{t}_{0,m_{n,b}}-Q^{t}_{n,b,m_{n,b}}\right)dF(\theta)
+t=1Ti=1t1𝔼Bn0i=1t1I{Ai=diθ}i=1t1π0i(π0iπn,biπn,bi)Q0,mn,bt𝑑F(θ)+op(n1/2)\displaystyle+\sum_{t=1}^{T}\sum_{i=1}^{t-1}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t-1}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{t-1}\pi^{i}_{0}}\left(\frac{\pi^{i}_{0}-\pi^{i}_{n,b}}{\pi^{i}_{n,b}}\right)Q^{t}_{0,m_{n,b}}dF(\theta)+o_{p}(n^{-1/2})

We combine all the remaining terms in (I), (II), and (III)

(I)+(II)+(III)\displaystyle(I)+(II)+(III)
=t=1T𝔼Bn0i=1tI{Ai=diθ}π0i(Qn,b,mn,btQ0,mn,bt)(π0tπ0t)dF(θ)\displaystyle=\sum_{t=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{i=1}^{t}\frac{I\{A_{i}=d^{\theta}_{i}\}}{\pi^{i}_{0}}\left(Q^{t}_{n,b,m_{n,b}}-Q^{t}_{0,m_{n,b}}\right)\left(\frac{\it-\pi^{t}_{0}}{\pi^{t}_{0}}\right)dF(\theta)
+t=1Ti=1t𝔼Bn0i=1tI{Ai=diθ}i=1t(π0i)(πn,biπ0iπn,bi)Q0,mn,bt𝑑F(θ)+op(n1/2)\displaystyle+\sum_{t=1}^{T}\sum_{i=1}^{t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{t}\left(\pi^{i}_{0}\right)}\left(\frac{\pi^{i}_{n,b}-\pi^{i}_{0}}{\pi^{i}_{n,b}}\right)Q^{t}_{0,m_{n,b}}dF(\theta)+o_{p}(n^{-1/2})
+t=1Ti=1t𝔼Bn0i=1tI{Ai=diθ}i=1t(π0i)(π0iπn,biπn,bi)Q0,mn,bt𝑑F(θ)+op(n1/2)\displaystyle+\sum_{t=1}^{T}\sum_{i=1}^{t}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{i=1}^{t}I\{A_{i}=d^{\theta}_{i}\}}{\prod_{i=1}^{t}\left(\pi^{i}_{0}\right)}\left(\frac{\pi^{i}_{0}-\pi^{i}_{n,b}}{\pi^{i}_{n,b}}\right)Q^{t}_{0,m_{n,b}}dF(\theta)+o_{p}(n^{-1/2})
=op(n1/2)\displaystyle=o_{p}(n^{-1/2})

Hence,

ΨnMR(mn,πn,Qn,mn)Ψ0(mn)=(n0)φm(π0,Q0,m)+op(n1/2)\Psi^{MR}_{n}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m_{n})=(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})+o_{p}(n^{-1/2})

11 Proof of Theorem 3

In the calculation below, we show that if we look at ΨnMR(mn,πn,Qmn,n)Ψ0(m)\Psi^{MR}_{n}(m_{n},\pi_{n},Q_{m_{n},n})-\Psi_{0}(m^{*}) instead of ΨnMR(mn,πn,Qmn,n)Ψ0(mn)\Psi^{MR}_{n}(m_{n},\pi_{n},Q_{m_{n},n})-\Psi_{0}(m_{n}), we will have a bias term.

ΨnMR\displaystyle\Psi^{MR}_{n} (mn,πn,Qn,mn)Ψ0(m)\displaystyle(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m^{*})
=\displaystyle= ΨnMR(mn,πn,Qn,mn)Ψ0(m)±ΨnMR(m,π0,Q0,m)\displaystyle\Psi^{MR}_{n}(m_{n},\pi_{n},Q_{n,m_{n}})-\Psi_{0}(m^{*})\pm\Psi^{MR}_{n}(m^{*},\pi_{0},Q_{0,m^{*}})
=\displaystyle= 𝔼BnBnφmn,b(πn,b,Qn,b,mn,b)0φm(π0,Q0,m)±𝔼Bnnφm(π0,Q0,m)\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{B_{n}}\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\mathbb{P}_{0}\varphi_{m}^{*}(\pi_{0},Q_{0,m^{*}})\pm\mathbb{E}_{B_{n}}\mathbb{P}_{n}\varphi_{m}^{*}(\pi_{0},Q_{0,m^{*}})
=\displaystyle= (n0)φm(π0,Q0,m)+𝔼BnBn{φmn,b(πn,b,Qn,b,mn,b)φm(π0,Q0,m)}\displaystyle(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})+\mathbb{E}_{B_{n}}\mathbb{P}_{B_{n}}\left\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})\right\}
±𝔼BnBnφmn,b(π0,Q0,mn,b)\displaystyle\pm\mathbb{E}_{B_{n}}\mathbb{P}_{B_{n}}\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})
=\displaystyle= (n0)φm(π0,Q0,m)+𝔼BnBn{φmn,b(πn,b,Qn,b,mn,b)φmn,b(π0,Q0,mn,b)}\displaystyle(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})+\mathbb{E}_{B_{n}}\mathbb{P}_{B_{n}}\left\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\right\}
+𝔼BnBn{φmn,b(π0,Q0,mn,b)φm(π0,Q0,m)}\displaystyle+\mathbb{E}_{B_{n}}\mathbb{P}_{B_{n}}\left\{\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})-\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})\right\}
=\displaystyle= (n0)φm(π0,Q0,m)+𝔼Bn0{φmn,b(πn,b,Qn,b,mn,b)φmn,b(π0,Q0,mn,b)}\displaystyle(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})+\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\right\}
+𝔼Bn0{φmn,b(π0,Q0,mn,b)φm(π0,Q0,m)}\displaystyle+\mathbb{E}_{B_{n}}\mathbb{P}_{0}\{\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})-\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})\}

In the proof of Theorem 2, we have shown that
𝔼Bn0{φmn,b(πn,b,Qn,b,mn,b)φmn,b(π0,Q0,mn,b)}=op(n1/2)\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{\varphi_{m_{n,b}}(\pi_{n,b},Q_{n,b,m_{n,b}})-\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})\right\}=o_{p}(n^{-1/2}). We will continue to examine the third term.

𝔼Bn0\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0} [φmn,b(π0,Q0,mn,b)φm(π0,Q0,m)]\displaystyle[\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})-\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})]
=\displaystyle= 𝔼Bn0t=1Tπ0t{{Ymn,b(θ,V)}2(Ym(θ,V))2}dF(θ)\displaystyle\mathbb{E}_{B_{n}}\int\mathbb{P}_{0}\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{0}}\left\{\{Y-m_{n,b}(\theta,V)\}^{2}-(Y-m^{*}(\theta,V))^{2}\right\}dF(\theta)
=\displaystyle= 𝔼Bn0t=1Tπ0t{Y22Ymn,b(θ,V)+mn,b2(θ,V)Y2+2Ym(θ,V)m(θ,V)2}dF(θ)\displaystyle\mathbb{E}_{B_{n}}\int\mathbb{P}_{0}\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{0}}\left\{Y^{2}-2Ym_{n,b}(\theta,V)+m_{n,b}^{2}(\theta,V)-Y^{2}+2Ym^{*}(\theta,V)-{m^{*}(\theta,V)}^{2}\right\}dF(\theta)
=\displaystyle= 𝔼Bn0t=1Tπ0t{Y22m0(θ,V)mn,b(θ,V)+mn,b2(θ,V)Y2\displaystyle\mathbb{E}_{B_{n}}\int\mathbb{P}_{0}\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{0}}\{Y^{2}-2m_{0}(\theta,V)m_{n,b}(\theta,V)+m_{n,b}^{2}(\theta,V)-Y^{2}
+2m0(θ,V)m(θ,V)m(θ,V)2}dF(θ)\displaystyle\hskip 128.0374pt+2m_{0}(\theta,V)m^{*}(\theta,V)-{m^{*}(\theta,V)}^{2}\}dF(\theta)
=\displaystyle= 𝔼Bn0t=1Tπ0t[{m0(θ,V)mn,b}2{m(θ,V)m0(θ,V)}2]dF(θ)\displaystyle\mathbb{E}_{B_{n}}\int\mathbb{P}_{0}\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{0}}\left[\{m_{0}(\theta,V)-m_{n,b}\}^{2}-\{m^{*}(\theta,V)-m_{0}(\theta,V)\}^{2}\right]dF(\theta)
=\displaystyle= 𝔼Bn0t=1Tπ0t[{m0(θ,V)m(θ,V)+m(θ,V)mn,b(θ,V)}2\displaystyle\mathbb{E}_{B_{n}}\int\mathbb{P}_{0}\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{0}}\big{[}\{m_{0}(\theta,V)-m^{*}(\theta,V)+m^{*}(\theta,V)-m_{n,b}(\theta,V)\}^{2}
{m(θ,V)m0(θ,V)}2]dF(θ)\displaystyle\hskip 128.0374pt-\{m^{*}(\theta,V)-m_{0}(\theta,V)\}^{2}\big{]}dF(\theta)
=\displaystyle= 𝔼Bn0t=1Tπ0t[{m(θ,V)mn,b(θ,V)}2\displaystyle\mathbb{E}_{B_{n}}\int\mathbb{P}_{0}\prod_{t=1}^{T}\frac{\it}{\pi^{t}_{0}}\big{[}\{m^{*}(\theta,V)-m_{n,b}(\theta,V)\}^{2}
2{m(θ,V)m0(θ,V)}{m(θ,V)mn,b(θ,V)}]dF(θ)\displaystyle\hskip 128.0374pt-2\{m^{*}(\theta,V)-m_{0}(\theta,V)\}\{m^{*}(\theta,V)-m_{n,b}(\theta,V)\}\big{]}dF(\theta)

where m0(θ,V)=E[Y|θ,V]m_{0}(\theta,V)=E[Y|\theta,V] is the true minimizer of the risk function and mn(θ,V)m(θ,V)m_{n}(\theta,V)\xrightarrow{}m^{*}(\theta,V).

Hence 𝔼Bn0[φmn,b(π0,Q0,mn,b)φm(π0,Q0,m)]=op(n1/2)\mathbb{E}_{B_{n}}\mathbb{P}_{0}[\varphi_{m_{n,b}}(\pi_{0},Q_{0,m_{n,b}})-\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})]=o_{p}(n^{-1/2}) if and only if m0(θ,V)=m(θ,V)m_{0}(\theta,V)=m^{*}(\theta,V).

12 Proof of Theorem 4

In this section, we will show that by undersmoothing the propensity score, the IPW estimator will be linear asymptotic. We start by considering the difference below.

ΨnUIPW(mn,πn)Ψ0(mn)=\displaystyle\Psi^{UIPW}_{n}(m_{n},\pi_{n})-\Psi_{0}(m_{n})= 𝔼Bn0{Lmn,b(πn,b,λ)Lmn,b(π0)}\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{L_{m_{n,b}}\left(\pi_{n,b,\lambda}\right)-L_{m_{n,b}}\left(\pi_{0}\right)\right\}
+𝔼Bn(n,b10)Lmn,b(πn,b,λ)\displaystyle+\mathbb{E}_{B_{n}}(\mathbb{P}^{1}_{n,b}-\mathbb{P}_{0})L_{m_{n,b}}\left(\pi_{n,b,\lambda}\right) (12)

The second term on the right-hand side of (12) can be represented as

𝔼Bn(n,b10)Lmn,b(πn,b,λ)=\displaystyle\mathbb{E}_{B_{n}}(\mathbb{P}^{1}_{n,b}-\mathbb{P}_{0})L_{m_{n,b}}\left(\pi_{n,b,\lambda}\right)= 𝔼Bn(n,b10)Lmn,b(πn,b,λ)Lm(π0)\displaystyle\mathbb{E}_{B_{n}}(\mathbb{P}^{1}_{n,b}-\mathbb{P}_{0})L_{m_{n,b}}\left(\pi_{n,b,\lambda}\right)-L_{m^{*}}\left(\pi_{0}\right)
+𝔼Bn(n,b10)Lm(π0)\displaystyle+\mathbb{E}_{B_{n}}(\mathbb{P}_{n,b}^{1}-\mathbb{P}_{0})L_{m^{*}}\left(\pi_{0}\right) (13)
=\displaystyle= 𝔼Bn(n0)Lm(π0)+op(n1/2)\displaystyle\mathbb{E}_{B_{n}}(\mathbb{P}_{n}-\mathbb{P}_{0})L_{m^{*}}\left(\pi_{0}\right)+o_{p}(n^{-1/2}) (14)

The only remaining term to study is the first term in (12).

𝔼Bn\displaystyle\mathbb{E}_{B_{n}} 0{Lmn,b(πn,b,λ)Lmn,b(π0)}\displaystyle\mathbb{P}_{0}\left\{L_{m_{n,b}}\left(\pi_{n,b,\lambda}\right)-L_{m_{n,b}}\left(\pi_{0}\right)\right\}
=\displaystyle= 𝔼Bn0{Lmn,b(πn,b,λ)Lmn,b(π0)}𝔼Bn0{Lm(πn,b,λ)Lm(π0)}\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{L_{m_{n,b}}\left(\pi_{n,b,\lambda}\right)-L_{m_{n,b}}\left(\pi_{0}\right)\right\}-\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{L_{m^{*}}\left(\pi_{n,b,\lambda}\right)-L_{m^{*}}\left(\pi_{0}\right)\right\}
+𝔼Bn0{Lm(πn,b,λ)Lm(π0)}\displaystyle+\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{L_{m^{*}}\left(\pi_{n,b,\lambda}\right)-L_{m^{*}}\left(\pi_{0}\right)\right\}

The terms in the first integral can be written as

𝔼Bn0{Lmn,b(πn,b,λ)Lmn,b(π0)}=𝔼Bn𝔼{(Ymn,b(θ,V))2H¯T}π0(π0πn,b,λπn,b,λπ0)𝑑F(θ)𝑑P(o)\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}\{L_{m_{n,b}}\left(\pi_{n,b,\lambda}\right)-L_{m_{n,b}}\left(\pi_{0}\right)\}=\mathbb{E}_{B_{n}}\int\int\mathbb{E}\left\{\left(Y-m_{n,b}(\theta,V)\right)^{2}\mid\bar{H}_{T}\right\}\pi_{0}\left(\frac{\pi_{0}-\pi_{n,b,\lambda}}{\pi_{n,b,\lambda}\pi_{0}}\right)dF(\theta)dP\left(o\right)
𝔼Bn0{Lm(πn,b,λ)Lm(π0)}=𝔼Bn𝔼{(Ym(θ,V))2H¯T}π0(π0πn,b,λπn,b,λπ0)𝑑F(θ)𝑑P(o).\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}\{L_{m^{*}}\left(\pi_{n,b,\lambda}\right)-L_{m^{*}}\left(\pi_{0}\right)\}=\mathbb{E}_{B_{n}}\int\int\mathbb{E}\left\{\left(Y-m^{*}(\theta,V)\right)^{2}\mid\bar{H}_{T}\right\}\pi_{0}\left(\frac{\pi_{0}-\pi_{n,b,\lambda}}{\pi_{n,b,\lambda}\pi_{0}}\right)dF(\theta)dP\left(o\right).

where π0=t=1Tπ0t\pi_{0}=\prod_{t=1}^{T}\pi_{0}^{t} and πn,b\pi_{n,b} is an estimate of π0\pi_{0} using the bthb^{th} sample split.

Combining these two terms, we have

𝔼Bn{Lmn,b(πn,b,λ)Lmn,b(π0)}{Lm(πn,b,λ)Lm(π0)}=\displaystyle\mathbb{E}_{B_{n}}\left\{L_{m_{n,b}}\left(\pi_{n,b,\lambda}\right)-L_{m_{n,b}}\left(\pi_{0}\right)\right\}-\left\{L_{m^{*}}\left(\pi_{n,b,\lambda}\right)-L_{m^{*}}\left(\pi_{0}\right)\right\}=
[{m(θ,V)mn,b(θ,V)}{2Ymn,b(θ,V)m(θ,V)}]π0(π0πn,b,λπn,b,λπ0)𝑑F(θ)𝑑P(o)=op(n1/2).\displaystyle\int\int\left[\{m^{*}(\theta,V)-m_{n,b}(\theta,V)\}\{2Y-m_{n,b}(\theta,V)-m^{*}(\theta,V)\}\right]\pi_{0}\left(\frac{\pi_{0}-\pi_{n,b,\lambda}}{\pi_{n,b,\lambda}\pi_{0}}\right)dF(\theta)dP\left(o\right)=o_{p}\left(n^{-1/2}\right).

The last equality applies Cauchy-Schwarz inequality and the assumption that πnπ02.mnm2dF(θ)=op(n1/2)\int\|\pi_{n}-\pi_{0}\|_{2}.\|m_{n}-m^{*}\|_{2}dF(\theta)=o_{p}(n^{-1/2}). Gathering the relevant terms, we have

ΨnUIPW(mn,πn)Ψ0(mn)=\displaystyle\Psi^{UIPW}_{n}(m_{n},\pi_{n})-\Psi_{0}(m_{n})= 1ni=1n{Lm(π0)Ψ0(m)}\displaystyle\frac{1}{n}\sum_{i=1}^{n}\left\{L_{m^{*}}\left(\pi_{0}\right)-\Psi_{0}(m^{*})\right\}
+𝔼Bn0{Lm(πn,b,λ)Lm(π0)}+op(n1/2).\displaystyle+\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\left\{L_{m^{*}}\left(\pi_{n,b,\lambda}\right)-L_{m^{*}}\left(\pi_{0}\right)\right\}+o_{p}\left(n^{-1/2}\right).

We now consider the second term on the right-hand side.

𝔼Bn0{Lm(πn,b,λ)Lm(π0)}\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}\left\{L_{m^{*}}\left(\pi_{n,b,\lambda}\right)-L_{m^{*}}\left(\pi_{0}\right)\right\}
=𝔼Bn0t=1TI(At=dtθ){Ym(θ,V)}2πn,b,λtdF(θ)0t=1TI(At=dtθ){Ym(θ,V)}2π0tdF(θ)\displaystyle=\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\prod_{t=1}^{T}\frac{I(A_{t}=d^{\theta}_{t})\{Y-m^{*}(\theta,V)\}^{2}}{\pi^{t}_{n,b,\lambda}}dF(\theta)-\mathbb{P}_{0}\int\prod_{t=1}^{T}\frac{I(A_{t}=d^{\theta}_{t})\{Y-m^{*}(\theta,V)\}^{2}}{\pi^{t}_{0}}dF(\theta)
±k=2T𝔼Bn0t=1TI(At=dtθ){Ym(θ,V)}2i=1k1(πn,b,λt)t=kT(π0t)𝑑F(θ)\displaystyle\pm\sum_{k=2}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{t=1}^{T}I(A_{t}=d^{\theta}_{t})\{Y-m^{*}(\theta,V)\}^{2}}{\prod_{i=1}^{k-1}(\pi^{t}_{n,b,\lambda})\prod_{t=k}^{T}(\pi^{t}_{0})}dF(\theta)
=k=1T𝔼Bn0t=1TI(At=dtθ){Ym(θ,V)}2t=1k(πn,b,λt)t=k+1T(π0t)𝑑F(θ)\displaystyle=\sum_{k=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{t=1}^{T}I(A_{t}=d_{t}^{\theta})\{Y-m^{*}(\theta,V)\}^{2}}{\prod_{t=1}^{k}(\pi^{t}_{n,b,\lambda})\prod_{t=k+1}^{T}(\pi^{t}_{0})}dF(\theta)
=k=1T𝔼Bn0t=1k1I(At=dtθ)Q0,mkt=1k1(πt,b)π0k(π0kπn,b,λk)π0kπn,b,λk𝑑F(θ)\displaystyle=\sum_{k=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{t=1}^{k-1}I(A_{t}=d^{\theta}_{t})Q^{k}_{0,m^{*}}}{\prod_{t=1}^{k-1}(\pi_{t,b})}\frac{\pi^{k}_{0}(\pi^{k}_{0}-\pi^{k}_{n,b,\lambda})}{\pi^{k}_{0}\pi^{k}_{n,b,\lambda}}dF(\theta)
=k=1T𝔼Bn0t=1k1I(At=dtθ)Q0,mkt=1k1(πn,b,λt)π0kπn,b,λkπ0kt=1k1I(At=dtθ)Q0,mkt=1k1(πn,b,λt)(π0kπn,b,λk)2πn,b,λkdF(θ)\displaystyle=\sum_{k=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{t=1}^{k-1}I(A_{t}=d^{\theta}_{t})Q^{k}_{0,m^{*}}}{\prod_{t=1}^{k-1}(\pi^{t}_{n,b,\lambda})}\frac{\pi^{k}_{0}-\pi^{k}_{n,b,\lambda}}{\pi^{k}_{0}}-\frac{\prod_{t=1}^{k-1}I(A_{t}=d^{\theta}_{t})Q^{k}_{0,m^{*}}}{\prod_{t=1}^{k-1}(\pi^{t}_{n,b,\lambda})}\frac{(\pi^{k}_{0}-\pi^{k}_{n,b,\lambda})^{2}}{\pi^{k}_{n,b,\lambda}}dF(\theta)
=k=1T𝔼Bn0t=1k1I(At=dtθ)Q0,mkt=1k1(πn,b,λt)π0kπn,b,λkπ0k+op(n1/2)\displaystyle=\sum_{k=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{t=1}^{k-1}I(A_{t}=d^{\theta}_{t})Q^{k}_{0,m^{*}}}{\prod_{t=1}^{k-1}(\pi^{t}_{n,b,\lambda})}\frac{\pi^{k}_{0}-\pi^{k}_{n,b,\lambda}}{\pi^{k}_{0}}+o_{p}(n^{-1/2})
=k=1T𝔼Bn0t=1k1I(At=dtθ)Q0,mkt=1k1(πn,b,λt)I(Ak=dkθ)πn,b,λkπ0k+op(n1/2).\displaystyle=\sum_{k=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{0}\int\frac{\prod_{t=1}^{k-1}I(A_{t}=d^{\theta}_{t})Q^{k}_{0,m^{*}}}{\prod_{t=1}^{k-1}(\pi^{t}_{n,b,\lambda})}\frac{I(A_{k}=d^{\theta}_{k})-\pi^{k}_{n,b,\lambda}}{\pi^{k}_{0}}+o_{p}(n^{-1/2}). (15)

Let

DCARk(Q0,π0,πn,b,λ)=t=1k1I(At=dtθ)Q0,mkt=1k1πn,b,λtI(Ak=dkθ)πn,b,λkπ0k\displaystyle D_{CAR}^{k}(Q_{0},\pi_{0},\pi_{n,b,\lambda})=\int\frac{\prod_{t=1}^{k-1}I(A_{t}=d^{\theta}_{t})Q^{k}_{0,m^{*}}}{\prod_{t=1}^{k-1}\pi^{t}_{n,b,\lambda}}\frac{I(A_{k}=d^{\theta}_{k})-\pi^{k}_{n,b,\lambda}}{\pi^{k}_{0}} (16)
DCARk(Q0,π0)=t=1k1I(At=dtθ)Q0,mkt=1k1πn,b,λtI(Ak=dkθ)π0kπ0k\displaystyle D_{CAR}^{k}(Q_{0},\pi_{0})=\int\frac{\prod_{t=1}^{k-1}I(A_{t}=d^{\theta}_{t})Q^{k}_{0,m^{*}}}{\prod_{t=1}^{k-1}\pi^{t}_{n,b,\lambda}}\frac{I(A_{k}=d^{\theta}_{k})-\pi^{k}_{0}}{\pi^{k}_{0}} (17)

and DCAR(Q0,π0,πn,b,λ)=k=1TDCARk(Q0,π0,πn,b,λ)D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})=\sum_{k=1}^{T}D_{CAR}^{k}(Q_{0},\pi_{0},\pi_{n,b,\lambda}) and DCAR(Q0,π0)=k=1TDCARk(Q0,π0)D_{CAR}(Q_{0},\pi_{0})=\sum_{k=1}^{T}D_{CAR}^{k}(Q_{0},\pi_{0}). We then have

(15)\displaystyle(\ref{eq:dcar1}) =𝔼Bn0DCAR(Q0,π0,πn,b,λ)+op(n1/2)\displaystyle=\mathbb{E}_{B_{n}}\mathbb{P}_{0}D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})+o_{p}(n^{-1/2})
=\displaystyle= 𝔼Bn0DCAR(Q0,π0,πn,b,λ)𝔼Bnn,b1DCAR(Q0,π0,πn,b,λ)+𝔼Bnn,b1DCAR(Q0,π0,πn,b,λ)+op(n1/2)\displaystyle\mathbb{E}_{B_{n}}\mathbb{P}_{0}D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})-\mathbb{E}_{B_{n}}\mathbb{P}_{n,b}^{1}D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})+\mathbb{E}_{B_{n}}\mathbb{P}_{n,b}^{1}D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})+o_{p}(n^{-1/2})
=\displaystyle= 𝔼Bn(0n,b1)DCAR(Q0,π0,πn,b,λ)+𝔼Bnn,b1DCAR(Q0,π0,πn,b,λ)+op(n1/2)\displaystyle\mathbb{E}_{B_{n}}(\mathbb{P}_{0}-\mathbb{P}_{n,b}^{1})D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})+\mathbb{E}_{B_{n}}\mathbb{P}_{n,b}^{1}D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})+o_{p}(n^{-1/2})
=\displaystyle= 𝔼Bn(0n,b1)[DCAR(Q0,π0,πn,b,λ)DCAR(Q0,π0)]+𝔼Bn(0n,b1)DCAR(Q0,π0)\displaystyle\mathbb{E}_{B_{n}}(\mathbb{P}_{0}-\mathbb{P}_{n,b}^{1})[D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})-D_{CAR}(Q_{0},\pi_{0})]+\mathbb{E}_{B_{n}}(\mathbb{P}_{0}-\mathbb{P}_{n,b}^{1})D_{CAR}(Q_{0},\pi_{0})
+𝔼Bnn,b1DCAR(Q0,π0,πn,b,λ)+op(n1/2)\displaystyle+\mathbb{E}_{B_{n}}\mathbb{P}_{n,b}^{1}D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})+o_{p}(n^{-1/2})
=\displaystyle= (0n)DCAR(Q0,π0)+𝔼Bnn,b1DCAR(Q0,π0,πn,b,λ)+op(n1/2).\displaystyle(\mathbb{P}_{0}-\mathbb{P}_{n})D_{CAR}(Q_{0},\pi_{0})+\mathbb{E}_{B_{n}}\mathbb{P}_{n,b}^{1}D_{CAR}(Q_{0},\pi_{0},\pi_{n,b,\lambda})+o_{p}(n^{-1/2}).

Putting everything together, we have

ΨnUIPW(mn,πn)Ψ0(mn,π0)=\displaystyle\Psi^{UIPW}_{n}(m_{n},\pi_{n})-\Psi_{0}(m_{n},\pi_{0})= (n0){Lm(π0)DCAR(Q0,π0)}\displaystyle(\mathbb{P}_{n}-\mathbb{P}_{0})\{L_{m}^{*}(\pi_{0})-D_{CAR}(Q_{0},\pi_{0})\}
+k=1T𝔼Bnn,b1DCARk(Q0,π0,πb,λopt)+op(n1/2),\displaystyle+\sum_{k=1}^{T}\mathbb{E}_{B_{n}}\mathbb{P}_{n,b}^{1}D^{k}_{CAR}(Q_{0},\pi_{0},\pi_{b,\lambda_{opt}})+o_{p}(n^{-1/2}), (18)
=\displaystyle= (n0)φm(π0,Q0,m)\displaystyle(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}}) (19)

so it remains to show the second term on the right-hand side is asymptotically negligible.

For any given kk and any given collection of folds, we have

n,b1DCARk(Q0,π0,πb,λopt)\displaystyle\mathbb{P}_{n,b}^{1}D^{k}_{CAR}(Q_{0},\pi_{0},\pi_{b,\lambda_{opt}}) =n,b1t=1k1I(At=dtθ)Q0,mkt=1k1(πb,λoptt)I(Ak=dkθ)πb,λoptkπ0k\displaystyle=\mathbb{P}_{n,b}^{1}\int\frac{\prod_{t=1}^{k-1}I(A_{t}=d^{\theta}_{t})Q^{k}_{0,m^{*}}}{\prod_{t=1}^{k-1}(\pi^{t}_{b,\lambda_{opt}})}\frac{I(A_{k}=d^{\theta}_{k})-\pi^{k}_{b,\lambda_{opt}}}{\pi^{k}_{0}}
=op(n1/2),\displaystyle=o_{p}(n^{-1/2}),

by applying Lemma 1 in Ertefaie et al. [2023], which relies on our assumptions …[his asp 1 and 2 plus his equation 4, at each k] …. The conclusion that

ΨnUIPW(mn,πn,λopt)Ψ0(mn)=(n0)φm(π0,Q0,m)+op(n1/2)\Psi^{UIPW}_{n}(m_{n},\pi_{n,\lambda_{opt}})-\Psi_{0}(m_{n})=(\mathbb{P}_{n}-\mathbb{P}_{0})\varphi_{m^{*}}(\pi_{0},Q_{0,m^{*}})+o_{p}(n^{-1/2})

followed by summing over kk.

13 Additional Simulations

We consider the case where the treatment AtA_{t} is randomized (Figure 3). Figure 3 shows that our proposed estimators and IPW estimator are all asymptotically linear. However, the IPW estimator is not efficient. Our proposed estimator is efficient. Their coverage all get close to 95% as the sample size increases. The coverage of IPW estimator, on the other hand, decreases to 80% as the sample size increases.

Refer to caption
Figure 3: Randomized Treatment: Scale Bias and Coverage Plots

We also present the simulation scenarios where the marginal structural models are estimated using

  1. 1.

    Linear model with second order term (i.e, E[Yθ]=β0+β1θ+β2θ2E[Y\mid\theta]=\beta_{0}+\beta_{1}\theta+\beta_{2}\theta^{2}) (Figure 4)

  2. 2.

    Random forest (Figure 5)

  3. 3.

    Highly adaptive lasso (Figure 6)

While our proposed estimators may display some variability in scaled bias in small sample sizes when the marginal structural model is estimated using nonparametric methods, they exhibit a downward trend as sample sizes increase, suggesting the asymmetrical linearity property. In contrast, the inverse probability weighting (IPW) estimator shows an upward trend, indicating potential bias when the propensity model is estimated using nonparametric methods. Notably, the UIPW-score estimator outperforms the UIPW-Dcar in various scenarios. Its scaled bias is consistently smaller than that of UIPW-Dcar and, in some instances, even smaller than the MR estimator. Furthermore, its coverage is closer to 0.95 compared to UIPW-Dcar in smaller sample sizes, especially when the marginal structural model is estimated using nonparametric methods.

Refer to caption
Figure 4: The marginal structural model is estimated using the linear regression with a second model term
Refer to caption
Figure 5: The marginal structural model is estimated using the random forest
Refer to caption
Figure 6: The marginal structural model is estimated using the highly adaptive lasso model