This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Individualized treatment rules
under stochastic treatment cost constraints

Hongxiang Qiu Department of Statistics, the Wharton School, University of Pennsylvania Marco Carone Department of Biostatistics, University of Washington Alex Luedtke Department of Statistics, University of Washington
Abstract

Estimation and evaluation of individualized treatment rules have been studied extensively, but real-world treatment resource constraints have received limited attention in existing methods. We investigate a setting in which treatment is intervened upon based on covariates to optimize the mean counterfactual outcome under treatment cost constraints when the treatment cost is random. In a particularly interesting special case, an instrumental variable corresponding to encouragement to treatment is intervened upon with constraints on the proportion receiving treatment. For such settings, we first develop a method to estimate optimal individualized treatment rules. We further construct an asymptotically efficient plug-in estimator of the corresponding average treatment effect relative to a given reference rule.

1 Introduction

The effect of a treatment often varies across subgroups of the population [38, 52]. When such differences are clinically meaningful, it may be beneficial to assign treatments strategically depending on subgroup membership. Such treatment assignment mechanisms are called individualized treatment rules (ITRs). A treatment rule is commonly evaluated on the basis of the mean counterfactual outcome value it generates — what is often referred to as the treatment rule’s value — and an ITR with an optimal value is called an optimal ITR. There is an extensive literature on estimation of optimal ITRs and their corresponding values using data from randomized trials or observational studies [6, 21, 25, 37, 54].

Most existing approaches for estimating ITRs do not incorporate real-world resource constraints. Without such constraints, an optimal ITR would assign the treatment to members of a subgroup provided there is any benefit for such individuals, even when this benefit is minute. In contrast, under treatment resource limits, it may be more advantageous to reserve treatment for subgroups with the greatest benefit from treatment. This issue has received attention in recent work. Luedtke and van der Laan developed methods for estimation and evaluation of optimal ITRs with a constraint on the proportion receiving treatment [20]. Qiu et al. instead considered related problems in settings in which instrumental variables (IVs) are available [34]. In one of the settings they considered, the same resource constraint is imposed as in Luedtke and van der Laan [20] but a binary IV is used to identify optimal ITRs even in settings in which there may be unmeasured confounders. In another setting considered in Qiu et al. [34], the authors considered interventions on a causal IV or encouragement status, and developed methods to estimate individualized encouragement (rather than treatment) rules with a constraint on the proportion receiving both encouragement and treatment [33]. They also developed nonparametrically efficient estimators of the average causal effect of optimal rules relative to a prespecified reference rule. Sun et al. considered a setting in which the cost of treatment is random and dependent upon baseline covariates. They developed methods to estimate optimal ITRs under a constraint on the expected additional treatment cost as compared to control, though inference on the impact of implementing the optimal ITR in the population was not studied [41]. Sun [42] considered a related problem involving the development of optimal ITRs under resource constraints, and established the asymptotic properties of the estimated optimal ITR. Their method appears viable when the class of ITRs is restricted by the user a priori.

In this paper, we study estimation and inference for an optimal rule under two different cost constraints. The first is the same as appearing in Sun et al. [41]. In contrast to earlier work on this setting, we do not constrain the class of ITRs considered and provide a means to obtain inference about the optimal ITR. The second constraint we consider places a cap on the total cost under the rule rather than on the incremental cost relative to control. To our knowledge, the latter problem has not previously been considered in the literature. Both of these estimation problems mirror the intervention-on-encouragement setting considered in Qiu et al. [34] but involve different constraints and a more general cost function.

Similarly as in Qiu et al. [34], the estimators that we develop are asymptotically efficient within a nonparametric model and enable the construction of asymptotically valid confidence intervals for the impact of implementing the optimal rule. We develop our estimators using similar tools — such as semiparametric efficiency theory [31, 51] and targeted minimum loss-based estimation (TMLE) [45, 49] — as were used to tackle the related problem studied in Qiu et al. [34]. Consequently, our proposed estimators are similar to that in Qiu et al. [34]. Therefore, we will streamline the presentation by highlighting the key similarities and focusing on the differences between these related problems and estimation schemes.

The rest of this paper is organized as follows. In Section 2, we describe the problem setup, introduce notation, and present the causal estimands along with basic causal conditions. In Section 3, we present additional causal conditions and the corresponding nonparametric identification results. In Section 4, we present our proposed estimators and their theoretical properties. In Section 5, we present a simulation illustrating the performance of our proposed estimators. We make concluding remarks in Section 6. Proofs, technical conditions, and additional simulation results can be found in the Supplementary Material.

2 Setup and objectives

To facilitate comparisons with Qiu et al. [34], we adopt similar notation as in that work. Suppose that we observe independent and identically distributed data units O1,O2,,OnP0O_{1},O_{2},\ldots,O_{n}\sim P_{0}, where P0P_{0} is an unknown sampling distribution. A prototypical data unit OO consists of the quadruplet (W,T,C,Y)(W,T,C,Y), where W𝒲pW\in\mathscr{W}\subseteq\mathbb{R}^{p} is the vector of baseline covariates, T{0,1}T\in\{0,1\} is the treatment status, C[0,)C\in[0,\infty) is the random treatment cost, and YY\in is the outcome of interest. As a convention, we assume that larger values of YY are preferable. We use V=V(W)𝒱V=V(W)\in\mathcal{V} to denote a fixed transformation of WW upon which we allow treatment decisions to depend. For example, VV may be a subset of covariates in WW or a summary of WW (e.g., BMI as a summary of height and weight). In practice, VV may be chosen based on prior knowledge on potential modifiers of the treatment effect as well as the cost of measuring various covariates. We distinguish between V(W)V(W) and WW because of their different roles. On the one hand, we will assume that the full covariate WW contains all confounders and thus is used to identify causal effects, while V(W)V(W) might not be sufficient for this purpose. On the other hand, some covariates in WW may be expensive or difficult to measure in future applications, and thus implementing an optimal ITR based on a subset V(W)V(W) of covariate WW may be desirable. In the rest of this paper, we will use the shorthand notation VV, ViV_{i} and vv to refer to V(W)V(W), V(Wi)V(W_{i}) and V(w)V(w), respectively. We define an individualized (stochastic) treatment rule (ITR) to be a function ρ:𝒱[0,1]{\rho}:\mathcal{V}\rightarrow[0,1] that prescribes treatment with probability ρ(v){\rho}(v) according to an exogenous source of randomness for an individual with covariate value vv. Any stochastic ITR that only takes values in {0,1}\{0,1\} is referred to as a deterministic ITR.

In this work, we adopt the potential outcomes framework [27, 39]. For each individual, we use C(t)C(t) and Y(t)Y(t) to denote the potential treatment cost and potential outcome, respectively, corresponding to scenarios in which the individual has treatment status tt. We use 𝔼\operatorname{\mathbb{E}} to denote an expectation over the counterfactual observations and the exogenous random mechanism defining a rule, and E0\operatorname{E}_{0} to denote an expectation over observables alone under sampling from P0P_{0}. We make the usual Stable Unit Treatment Value Assumption (SUTVA) assumption.

Condition A1 (Stable Unit Treatment Value Assumption).

The counterfactual data unit of one individual is unaffected by the treatment assigned to other individuals, and there is only a single version of the treatment, so that T=tT=t implies that C=C(t)C=C(t) and Y=Y(t)Y=Y(t).

Remark 1.

The ITRs we consider are not truly individualized, because they are based on the value of covariate VV rather than each individual’s unique potential treatment effects Y(1)Y(0)Y(1)-Y(0) and C(1)C(0)C(1)-C(0). Nevertheless, depending on the resolution of VV, these ITRs can be considerably more individualized than assigning everyone to either treatment or control. In this paper, we adopt the conventional nomenclature and refer to the treatment rules we study as ITRs [see, e.g., 5, 7, 14, 18, 19, 29, 32, 40, 47, 54, 55, 57].

We define C(ρ)C({\rho}) and Y(ρ)Y({\rho}) to be the counterfactual treatment cost and outcome, respectively, for an ITR ρ{\rho} under an exogenous random mechanism. We note that if ρ(v)(0,1){\rho}(v)\in(0,1) for an individual with covariate vv, an exogenous random mechanism is used to randomly assign treatment with probability ρ(v){\rho}(v) and thus C(ρ)C({\rho}) and Y(ρ)Y({\rho}) are random for this given individual. If ρ{\rho} were implemented in the population, then the population mean outcome would be 𝔼[Y(ρ)]\operatorname{\mathbb{E}}\left[Y({\rho})\right], where we use 𝔼\operatorname{\mathbb{E}} to denote expectation under the true data-generating mechanism involving potential outcomes (C(t),Y(t))(C(t),Y(t)) and exogenous randomness in ρ\rho. We consider a generic treatment resource constraint requiring that a convex combination of the population average treatment cost and the population average additional treatment cost compared to control be no greater than a specified constant κ(0,]\kappa\in(0,\infty]. Consequently, an optimal ITR ρ0{\rho}_{0} under this constraint is a solution in ρ:𝒱[0,1]{\rho}:\mathcal{V}\rightarrow[0,1] to

maximize 𝔼[Y(ρ)]subject toα𝔼[C(ρ)]+(1α)𝔼[C(ρ)C(0)]κ.\displaystyle\operatorname{\mathbb{E}}[Y({\rho})]\hskip 15.00002pt\text{subject to}\hskip 15.00002pt\alpha\operatorname{\mathbb{E}}[C({\rho})]+(1-\alpha)\operatorname{\mathbb{E}}[C({\rho})-C(0)]\leq\kappa\ . (1)

Here, α[0,1]\alpha\in[0,1] is also a constant specified by the investigator. Natural choices of α\alpha are α=0\alpha=0, corresponding to a constraint on the population average additional treatment cost compared to control, and α=1\alpha=1, corresponding to a constraint on the population average treatment cost. The first choice may be preferred when the control treatment corresponds to the current standard of care and a limited budget is available to fund the novel treatment to some patients. The second choice may be more relevant when both treatment and control incur treatment costs.

Remark 2.

Our setup is similar to that in Qiu et al. [34] if we view TT and CC defined here as the instrumental variable/encouragement ZZ and treatment status AA defined in those prior works, respectively. However, the constraint in our setup is different from the constraint 𝔼[ρ(V)C(ρ)]κ\operatorname{\mathbb{E}}[{\rho}(V)C({\rho})]\leq\kappa considered previously. In IV settings, the constraint in (1) with α=1\alpha=1 is useful when assigning treatment always incurs a cost, regardless of whether encouragement is applied, such as in distributing a limited supply of an expensive drug within a health system based on the results of a randomized clinical trial. It is instead useful with α=0\alpha=0 when no encouragement is present under the standard of care but intervention on the encouragement is of interest when additional treatment resources are available. The constraint considered in Qiu et al. [34, 33] was instead useful in cases in which treatment only incurs a cost when paired with encouragement, such as when housing vouchers are used to encourage individuals to live in a certain area. In the general setting in which TT is viewed as treatment status and CC as a random treatment cost, the constraint in (1) with α=0\alpha=0 is identical to that considered in Sun et al. [41] — we refer the readers to these works for a more in-depth discussion of the relation between the current problem setup and IV settings.

To evaluate an optimal ITR ρ0{\rho}_{0}, we follow Qiu et al. [34] in considering three types of reference ITRs and develop methods for statistical inference on the difference in the mean counterfactual outcome between ρ0{\rho}_{0} and a reference ITR ρ0:𝒱[0,1]{\rho}^{\mathcal{R}}_{0}:\mathcal{V}\rightarrow[0,1]. The first type of reference ITR considered, denoted by ρFR{\rho}^{\mathrm{FR}} (FR{\mathrm{FR}}=fixed rule), is any fixed ITR that may be specified by the investigator before the study. When α=0\alpha=0, it is usually most reasonable to consider the rule that always assigns control, namely v0v\mapsto 0, because the constraint in (1) may arise due to limited funding for implementing treatment whereas the standard of care rule is to always assign control. The second type, denoted by ρ0RD{\rho}^{\mathrm{RD}}_{0} (RD{\mathrm{RD}}=random), prescribes treatment completely at random to individuals regardless of their baseline covariates. The probability of prescribing treatment is chosen such that the treatment resource is saturated (i.e., all available resources are used) or all individuals receive treatment, if such a probability exists. Symbolically, this ITR is given by ρ0RD:vmin{1,(κα𝔼[C(0)])/𝔼[C(1)C(0)]}{\rho}^{\mathrm{RD}}_{0}:v\mapsto\min\left\{1,(\kappa-\alpha\operatorname{\mathbb{E}}\left[C(0)\right])/\operatorname{\mathbb{E}}\left[C(1)-C(0)\right]\right\} under the condition that 𝔼[C(0)]κ\operatorname{\mathbb{E}}\left[C(0)\right]\leq\kappa and 𝔼[C(1)C(0)]>0\operatorname{\mathbb{E}}[C(1)-C(0)]>0. Although ρ0RD{\rho}^{\mathrm{RD}}_{0} has the same interpretation as the corresponding encouragement rule in Qiu et al. [34], its mathematical expression is different due to the different resource constraint. This rule may be of interest if it is known a priori that treatment is harmless. The third type, denoted by ρ0TP{\rho}^{\mathrm{TP}}_{0} (TP{\mathrm{TP}}=true propensity), prescribes treatment according to the true propensity of the treatment implied by the study sampling mechanism P0P_{0}, so that ρ0TP{\rho}^{\mathrm{TP}}_{0} equals wP0(T=1W=w)w\mapsto P_{0}\left(T=1\mid W=w\right). This ITR may be of interest in two settings. In one setting, ρ0TP{\rho}^{\mathrm{TP}}_{0} satisfies the treatment resource constraint. The investigator may wish to determine the extent to which the implementation of an optimal ITR would improve upon the standard of care. In the other setting, the treatment resource constraint is newly introduced and the standard of care ITR may lead to overuse of treatment resources. The investigator may then be interested in whether the implementation of an optimal constrained ITR would result, despite the new resource constraint, in a noninferior mean outcome.

3 Identification of causal estimands

In this section, we present nonparametric identification results. Though these results are similar to those for individualized encouragement rules in Qiu et al. [34], there are two key differences. First, the form of some of the conditions in Qiu et al. [34] need to be modified to account for the novel resource constraint considered here. Second, two additional conditions are needed to overcome challenges that arise due to this new constraint.

We first introduce notation that will be useful when presenting our identification results and our proposed estimators. For any observed-data distribution PP, we define pointwise the conditional mean functions μPC(t,w):=EP(CT=t,W=w)\mu^{C}_{P}(t,w):=\operatorname{E}_{P}(C\mid T=t,W=w) and μPY(t,w):=EP(YT=t,W=w)\mu^{Y}_{P}(t,w):=\operatorname{E}_{P}(Y\mid T=t,W=w), where we use EP\operatorname{E}_{P} to denote an expectation over observables alone under sampling from PP, and their corresponding contrasts due to different treatment status, ΔPC(w):=μPC(1,w)μPC(0,w)\Delta_{P}^{C}(w):=\mu^{C}_{P}(1,w)-\mu^{C}_{P}(0,w) and ΔPY(w):=μPY(1,w)μPY(0,w)\Delta_{P}^{Y}(w):=\mu^{Y}_{P}(1,w)-\mu^{Y}_{P}(0,w). We also define the average of these contrasts conditional on VV as δPC(v):=EP[ΔPC(W)V=v]\delta^{C}_{P}(v):=\operatorname{E}_{P}[\Delta^{C}_{P}(W)\mid V=v] and δPY(v):=EP[ΔPY(W)V=v]\delta^{Y}_{P}(v):=\operatorname{E}_{P}[\Delta^{Y}_{P}(W)\mid V=v], and the propensity to receive treatment μPT(w):=P(T=1W=w)\mu^{T}_{P}(w):=P\left(T=1\mid W=w\right). Additionally, we define νP(t,v):=EP[EP(CT=t,W)V=v]\nu_{P}(t,v):=\operatorname{E}_{P}\left[\operatorname{E}_{P}\left(C\mid T=t,W\right)\mid V=v\right], ϕP:=EP[μPC(0,W)]\phi_{P}:=\operatorname{E}_{P}[\mu^{C}_{P}(0,W)]. These quantities play an important role in tackling the problem at hand. Throughout the paper, for ease of notation, if fPf_{P} is a quantity or operation indexed by distribution PP, we may denote fP0f_{P_{0}} by f0f_{0}. As an example, we may use Δ0Y\Delta^{Y}_{0} to denote ΔP0Y\Delta^{Y}_{P_{0}}.

We introduce additional causal conditions we will require, positivity and unconfoundedness. In one form or another, these conditions commonly appear in the causal inference literature [49], including in the IV literature [1, 15, 43, 53].

Condition A2 (Strong positivity).

There exists a constant ϵT>0\epsilon_{T}>0 such that ϵT<μ0T(w)<1ϵT\epsilon_{T}<\mu^{T}_{0}(w)<1-\epsilon_{T} holds for P0P_{0}-almost every ww.

Condition A3 (Unconfoundedness of treatment).

For each t{0,1}t\in\{0,1\}, TT and (C(t),Y(t))(C(t),Y(t)) are conditionally independent given W=wW=w for P0P_{0}-almost every ww.

Equipped with these conditions, we are able to state a theorem on the nonparametric identification of the mean counterfactual outcomes and average treatment effect (ATE) — these results can be viewed as a corollary of the well-known G-formula [36].

Theorem 1 (Identification of ATE and expected treatment resource expenditure).

Provided Conditions A1A3 are satisfied, it holds that 𝔼[Y(t)W=w]=μ0Y(t,w)\operatorname{\mathbb{E}}[Y(t)\mid W=w]=\mu^{Y}_{0}(t,w), 𝔼[Y(1)Y(0)W=w]=Δ0Y(w)\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid W=w]=\Delta^{Y}_{0}(w) and 𝔼[Y(1)Y(0)V=v]=δ0Y(v)\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid V=v]=\delta^{Y}_{0}(v) for P0P_{0}-almost every ww and vv, and so, 𝔼[Y(ρ)Y(ρ0)]=E0[{ρ(V)ρ0(W)}Δ0Y(W)]\operatorname{\mathbb{E}}[Y({\rho})-Y({\rho}^{\mathcal{R}}_{0})]=\operatorname{E}_{0}[\{{\rho}(V)-{\rho}^{\mathcal{R}}_{0}(W)\}\Delta^{Y}_{0}(W)]. In addition, it holds that 𝔼[C(t)W=w]=μ0C(t,w)\operatorname{\mathbb{E}}[C(t)\mid W=w]=\mu^{C}_{0}(t,w) for P0P_{0}-almost every ww, and so, 𝔼[C(ρ)]=E0[ρ(V)μ0C(1,W)+(1ρ(V))μ0C(0,W)]\operatorname{\mathbb{E}}[C({\rho})]=\operatorname{E}_{0}[{\rho}(V)\mu^{C}_{0}(1,W)+(1-{\rho}(V))\mu^{C}_{0}(0,W)].

In view of Theorem 1, the objective function in (1) can be identified as

𝔼[Y(ρ)]=E0[ρ(V)μ0Y(1,W)+(1ρ(V))μ0Y(0,W)]=E0[ρ(V)Δ0Y(W)]+E0[μ0Y(0,W)],\displaystyle\operatorname{\mathbb{E}}\left[Y({\rho})\right]=\operatorname{E}_{0}\left[{\rho}(V)\mu^{Y}_{0}(1,W)+(1-{\rho}(V))\mu^{Y}_{0}(0,W)\right]=\operatorname{E}_{0}\left[{\rho}(V)\Delta_{0}^{Y}(W)\right]+\operatorname{E}_{0}[\mu^{Y}_{0}(0,W)]\ ,

and, similarly, the expected cost is identified as 𝔼[C(ρ)]=E0[ρ(V)Δ0C(W)]+E0[μ0C(0,W)]\operatorname{\mathbb{E}}\left[C({\rho})\right]=\operatorname{E}_{0}\left[{\rho}(V)\Delta_{0}^{C}(W)\right]+\operatorname{E}_{0}[\mu^{C}_{0}(0,W)]. It follows that the optimization problem (1) is equivalent to

maximize E0[ρ(V)δ0Y(V)]subject toE0[ρ(V)δ0C(V)]+αϕ0κ.\displaystyle\operatorname{E}_{0}\left[{\rho}(V)\delta_{0}^{Y}(V)\right]\hskip 15.00002pt\text{subject to}\hskip 15.00002pt\operatorname{E}_{0}\left[{\rho}(V)\delta_{0}^{C}(V)\right]+\alpha\phi_{0}\leq\kappa\ . (2)

This differs from Equation 3 defining optimal individualized encouragement rules in Qiu et al. [34]. We now present two additional conditions so that (2) is a fractional knapsack problem [9], thereby allowing us to use existing results from the optimization literature. These conditions are similar to those in Sun et al. [41].

Condition A4 (Strictly costlier treatment).

There exists a constant ϵC>0\epsilon_{C}>0 such that Δ0C(w)>ϵC\Delta^{C}_{0}(w)>\epsilon_{C} holds for P0P_{0}-almost every ww.

Condition A5 (Financial feasibility of assigning treatment).

The inequality αϕ0<κ\alpha\phi_{0}<\kappa holds.

Condition A4 is reasonable if treatment is more expensive than control. When applied to an IV setting as outlined in Remark 2, this condition corresponds to the assumption that the IV is indeed an encouragement to take treatment. This condition is slightly stronger than its counterpart in Sun et al. [41], which only requires that Δ0C0\Delta^{C}_{0}\geq 0. This stronger condition is needed to ensure the asymptotic linearity of our proposed estimator in Section 4. Under Condition A4, it is evident that Condition A5 is reasonable because if αϕ0>κ\alpha\phi_{0}>\kappa, then no ITR satisfies the treatment resource constraint in view of the fact that E0[ρ(V)δ0C(V)]0\operatorname{E}_{0}[{\rho}(V)\delta_{0}^{C}(V)]\geq 0, whereas if αϕ0=κ\alpha\phi_{0}=\kappa, then only the trivial ITR v0v\mapsto 0 satisfies the constraint and there is no need to estimate an optimal ITR.

Under these two additional conditions, (2) is a fractional knapsack problem [9] in which every subgroup defined by a different value of VV corresponds to a different ‘item’. A solution in the special case in which V(W)=WV(W)=W and α=0\alpha=0 was given in Theorem 1 of Sun et al. [41]. We now state a more general result with the following differences: (i) the treatment decision may be based on a summary VV rather than the entire covariate vector WW, and (ii) α\alpha may take any value in [0,1][0,1] rather than only zero. We also explicitly state the randomization probability at the boundary for completeness and clarity. Despite these differences, the result we obtain is similar to Theorem 1 in Sun et al. [41]. Define pointwise ξ0(v):=δ0Y(v)/δ0C(v)\xi_{0}(v):=\delta^{Y}_{0}(v)/\delta^{C}_{0}(v), and write η0:=inf{η:E0[I(ξ0(V)>η)δ0C(V)]καϕ0}\eta_{0}:=\inf\{\eta:\operatorname{E}_{0}[I(\xi_{0}(V)>\eta)\delta^{C}_{0}(V)]\leq\kappa-\alpha\phi_{0}\} and τ0:=max{η0,0}\tau_{0}:=\max\{\eta_{0},0\}.

Theorem 2 (Optimal ITR).

Under Conditions A1A5, a solution to (2) is explicitly given by

ρ0(v):={καϕ0E0[I(ξ0(V)>τ0)δ0C(V)]E0[I(ξ0(V)=τ0)δ0C(V)]: if τ0>0,ξ0(v)=τ0 and E0[I(ξ0(V)=τ0)δ0C(V)]>0I(ξ0(v)>τ0): otherwise .\displaystyle{\rho}_{0}(v):=\begin{cases}\ \frac{\kappa-\alpha\phi_{0}-\operatorname{E}_{0}\left[I(\xi_{0}(V)>\tau_{0})\delta^{C}_{0}(V)\right]}{\operatorname{E}_{0}\left[I(\xi_{0}(V)=\tau_{0})\delta^{C}_{0}(V)\right]}&:\ \text{ if }\tau_{0}>0,\ \xi_{0}(v)=\tau_{0}\textnormal{ and }\operatorname{E}_{0}\left[I(\xi_{0}(V)=\tau_{0})\delta^{C}_{0}(V)\right]>0\\ \ I\left(\xi_{0}(v)>\tau_{0}\right)&:\ \text{ otherwise\ .}\end{cases}

Here, the first case is the boundary case with the randomization probability that saturates the treatment resource.

We also note that the reference ITRs introduced in Section 2 are also identified under the above conditions. In particular, it can be shown that ρ0RD(v):=min{1,(καϕ0)/E0[Δ0C(W)]}{\rho}^{\mathrm{RD}}_{0}(v):=\min\{1,(\kappa-\alpha\phi_{0})/\operatorname{E}_{0}[\Delta^{C}_{0}(W)]\} and ρ0TP=μ0T{\rho}^{\mathrm{TP}}_{0}=\mu^{T}_{0}.

4 Estimating and evaluating optimal individualized treatment rules

In this section, we present an estimator of an optimal ITR ρ0{\rho}_{0} and an inferential procedure for its ATE relative to a reference ITR ρ0{\rho}^{\mathcal{R}}_{0}, where \mathcal{R} is any of FR{\mathrm{FR}}, RD{\mathrm{RD}} or TP{\mathrm{TP}}. The proposed procedure is an adaptation of the method first proposed in Qiu et al. [34, 33].

We begin by introducing some notations that are useful for defining the estimands. We define the parameter Ψρ(P):=EP[ρ(V)ΔPY(W)]\Psi_{{\rho}}(P):=\operatorname{E}_{P}\left[{\rho}(V)\Delta^{Y}_{P}(W)\right] or Ψρ(P):=EP[ρ(W)ΔPY(W)]\Psi_{{\rho}}(P):=\operatorname{E}_{P}\left[{\rho}(W)\Delta^{Y}_{P}(W)\right] for each ITR ρ{\rho} and distribution PP\in\mathscr{M}, depending on whether the domain of ρ{\rho} is 𝒱\mathcal{V} or 𝒲\mathcal{W}. Here, we consider the model \mathscr{M} to be locally nonparametric at P0P_{0} [31]. For PP\in\mathscr{M}, the ATE of an optimal ITR ρP{\rho}_{P} relative to a reference ITR ρP{\rho}^{\mathcal{R}}_{P} equals Ψ(P):=ΨρP(P)ΨρP(P)\Psi_{\mathcal{R}}(P):=\Psi_{{\rho}_{P}}(P)-\Psi_{{\rho}^{\mathcal{R}}_{P}}(P). We are interested in making inference about ψ0:=Ψ(P0)\psi_{0}:=\Psi_{\mathcal{R}}(P_{0}), where we have suppressed dependence on \mathcal{R} from our shorthand notation.

4.1 Pathwise differentiability of the ATE

We first present a result regarding the pathwise differentiability of the ATE. Pathwise differentiability of the parameter of interest serves as the foundation for constructing asymptotically efficient estimators of this parameter, based on which an inferential procedure may be developed. Additional technical conditions are required and are provided in Section S1 in the Supplementary Material. For a distribution PP\in\mathscr{M}, a function μC:{0,1}×𝒲\mu^{C}:\{0,1\}\times\mathcal{W}\rightarrow, an ITR ρ{\rho}, and a decision threshold τ\tau\in, we define pointwise the following functions:

D(P,ρ,τ,μC)(o):=ρ(v)[yμPY(t,w)t+μPT(w)1+ΔPY(w)]Ψe(P)τ{ρ(v)[cμC(t,w)t+μPT(w)1+ΔC(w)]+α[(1t)(cμC(0,w))1μPT(w)+μC(0,w)]κ};G(P)(o):=D(P,ρP,τP,μPC)(o);D1(P,μC)(o):=(1t)(cμC(0,w))1μPT(w)+μC(0,w)EP[μC(0,W)];D2(P,μC)(o):=cμC(t,w)t+μPT(w)1+ΔC(w)EP[ΔC(W)];GRD(P)(o):=D(P,ρPRD,0,μPC)(o)αΨρPRD(P)D1(P,μPC)κϕPΨρPRD(P)D2(P,μPC)EP[ΔPC(W)];GTP(P)(o):=μPT(w)t+μPT(w)1[yμPY(t,w)]+tΔPY(w)ΨρPTP(P);GFR(P)(o):=D(P,ρFR,0,μPC)(o).\displaystyle\begin{split}D(P,{\rho},\tau,\mu^{C})(o)\ &:=\ {\rho}(v)\left[\frac{y-\mu^{Y}_{P}(t,w)}{t+\mu^{T}_{P}(w)-1}+\Delta^{Y}_{P}(w)\right]-\Psi_{e}(P)\\ &\hskip 18.06749pt-\tau\left\{{\rho}(v)\left[\frac{c-\mu^{C}(t,w)}{t+\mu^{T}_{P}(w)-1}+\Delta^{C}(w)\right]+\alpha\left[\frac{(1-t)(c-\mu^{C}(0,w))}{1-\mu^{T}_{P}(w)}+\mu^{C}(0,w)\right]-\kappa\right\};\\ G(P)(o)\ &:=D(P,{\rho}_{P},\tau_{P},\mu^{C}_{P})(o)\ ;\\ D_{1}(P,\mu^{C})(o)\ &:=\ \frac{(1-t)(c-\mu^{C}(0,w))}{1-\mu^{T}_{P}(w)}+\mu^{C}(0,w)-\operatorname{E}_{P}\left[\mu^{C}(0,W)\right]\ ;\\ D_{2}(P,\mu^{C})(o)\ &:=\ \frac{c-\mu^{C}(t,w)}{t+\mu^{T}_{P}(w)-1}+\Delta^{C}(w)-\operatorname{E}_{P}\left[\Delta^{C}(W)\right]\ ;\\ G_{\mathrm{RD}}(P)(o)\ &:=\ D(P,{\rho}^{\mathrm{RD}}_{P},0,\mu^{C}_{P})(o)-\frac{\alpha\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)D_{1}(P,\mu^{C}_{P})}{\kappa-\phi_{P}}-\frac{\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)D_{2}(P,\mu^{C}_{P})}{\operatorname{E}_{P}\left[\Delta_{P}^{C}(W)\right]}\ ;\\ G_{\mathrm{TP}}(P)(o)\ &:=\ \frac{\mu^{T}_{P}(w)}{t+\mu^{T}_{P}(w)-1}\left[y-\mu^{Y}_{P}(t,w)\right]+t\Delta^{Y}_{P}(w)-\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)\ ;\\ G_{\mathrm{FR}}(P)(o)\ &:=D(P,{\rho}^{\mathrm{FR}},0,\mu^{C}_{P})(o)\ .\end{split} (3)

One key condition we rely on is the following non-exceptional law assumption.

Condition B1 (Non-exceptional law).

P0(ξ0(V)=τ0)=0P_{0}(\xi_{0}(V)=\tau_{0})=0.

Under this condition, the true optimal ITR ρ0\rho_{0} is identical to an indicator function. If all covariates are discrete, then we can plug in the empirical estimates into the identification formulae in Theorems 12 and show that the resulting estimators of the ATE are asymptotically normal by the delta method even when Condition B1 does not hold. We do not further pursue this simple case in this paper, and thus need to rely on the non-exceptional law assumption, namely Condition B1, to account for continuous covariates. We list additional technical conditions in Supplement S1.

We can now provide a formal result describing the pathwise differentiability of the ATE parameter.

Theorem 3 (Pathwise differentiability of the ATE).

Let {FR,RD,TP}\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}. Provided Conditions A1A5 and B1B5 are satisfied, the parameters PΨρP(P)P\mapsto\Psi_{{\rho}_{P}}(P) and PΨρP(P)P\mapsto\Psi_{{\rho}^{\mathcal{R}}_{P}}(P) are pathwise differentiable at P0P_{0} relative to \mathscr{M} with canonical gradients G(P0)G(P_{0}) and G(P0)G_{\mathcal{R}}(P_{0}), respectively.

We note that the pathwise differentiability of PΨρP(P)P\mapsto\Psi_{{\rho}^{\mathcal{R}}_{P}}(P) was established in Theorem 3 of Qiu et al. [34] for {FR,TP}\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{TP}}\}. The other results can be proven using similar techniques. We put the proof of these results in Supplement S4.2. In view of Theorem 3, it follows that the ATE parameter Ψ\Psi_{\mathcal{R}} is pathwise differentiable at P0P_{0} with nonparametric canonical gradient

D(P0):=G(P0)G(P0)D_{\mathcal{R}}(P_{0}):=G(P_{0})-G_{\mathcal{R}}(P_{0}) (4)

for {FR,RD,TP}\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}.

Remark 3.

We have noted similar additional terms related to the resource being used in the canonical gradient of the mean counterfactual outcome or ATE of optimal ITRs under resource constraints, for example, in Luedtke and van der Laan [20] and Qiu et al. [34]. In our problem, this additional term is

τ0{ρ0(v)[cμ0C(t,w)t+μ0T(w)1+Δ0C(w)]+α[(1t)(cμ0C(0,w))1μ0T(w)+μ0C(0,w)]κ}.-\tau_{0}\left\{{\rho}_{0}(v)\left[\frac{c-\mu^{C}_{0}(t,w)}{t+\mu^{T}_{0}(w)-1}+\Delta^{C}_{0}(w)\right]+\alpha\left[\frac{(1-t)(c-\mu^{C}_{0}(0,w))}{1-\mu^{T}_{0}(w)}+\mu^{C}_{0}(0,w)\right]-\kappa\right\}.

Such terms appear to come from solving a fractional knapsack problem with truncation at zero and take the form of a product of (i) the threshold in the solution, and (ii) a term that equals the influence function of the resource being used under the solution when the resource is saturated. We conjecture that such structures generally exist for fractional knapsack problems.

4.2 Proposed estimator and asymptotic linearity

We next present our proposed nonparametric procedure for estimating an optimal ITR ρ0{\rho}_{0} and the corresponding ATE ψ0\psi_{0}. We will generally use subscript nn to denote an estimator with sample size nn, and add a hat to a nuisance function estimator that is targeted toward estimating ϕ0\phi_{0}.

  1. 1.

    Use the empirical distribution P^W,n\hat{P}_{W,n} of WW as an estimate of the true marginal distribution of WW. Compute estimates μnY\mu^{Y}_{n}, μnC\mu^{C}_{n}, μnT\mu^{T}_{n}, δnY\delta^{Y}_{n} and δnC\delta^{C}_{n} of μ0Y\mu^{Y}_{0}, μ0C\mu^{C}_{0}, μ0T\mu^{T}_{0}, δ0Y\delta^{Y}_{0} and δ0C\delta^{C}_{0}, respectively, using flexible regression methods. Recall that μ0Y(t,w)=E0[YT=t,W=w]\mu^{Y}_{0}(t,w)=\operatorname{E}_{0}[Y\mid T=t,W=w], μ0C(t,w)=E0[CT=t,W=w]\mu^{C}_{0}(t,w)=\operatorname{E}_{0}[C\mid T=t,W=w], δ0Y(v)=E0[μ0Y(1,W)μ0Y(0,W)V=v]\delta^{Y}_{0}(v)=\operatorname{E}_{0}[\mu^{Y}_{0}(1,W)-\mu^{Y}_{0}(0,W)\mid V=v], and δ0C(v)=E0[μ0C(1,W)μ0C(0,W)V=v]\delta^{C}_{0}(v)=\operatorname{E}_{0}[\mu^{C}_{0}(1,W)-\mu^{C}_{0}(0,W)\mid V=v]. Define pointwise ΔnC(w):=μnC(1,w)μnC(0,w)\Delta^{C}_{n}(w):=\mu^{C}_{n}(1,w)-\mu^{C}_{n}(0,w).

  2. 2.

    Estimate an optimal ITR:

    1. (a)

      Estimate ϕ0=E0[μ0C(0,W)]\phi_{0}=\operatorname{E}_{0}[\mu^{C}_{0}(0,W)] with a one-step correction estimator

      ϕn:=1ni=1n[μnC(0,Wi)+(1Ti)(CiμnC(0,Wi))1μnT(Wi)].\phi_{n}:=\frac{1}{n}\sum_{i=1}^{n}\left[\mu^{C}_{n}(0,W_{i})+\frac{(1-T_{i})(C_{i}-\mu^{C}_{n}(0,W_{i}))}{1-\mu^{T}_{n}(W_{i})}\right].
    2. (b)

      Let ξn:=δnY/δnC\xi_{n}:=\delta^{Y}_{n}/\delta^{C}_{n}, Γn:τ1ni:ξn(Vi)>τΔnC(Wi)\Gamma_{n}:\tau\mapsto\frac{1}{n}\sum_{i:\xi_{n}(V_{i})>\tau}\Delta^{C}_{n}(W_{i}) and γn:τ1ni:ξn(Vi)=τΔnC(Wi)\gamma_{n}:\tau\mapsto\frac{1}{n}\sum_{i:\xi_{n}(V_{i})=\tau}\Delta^{C}_{n}(W_{i}). For any k[0,]k\in[0,\infty], define ηn(k):=inf{τ:Γn(τ)kαϕn}\eta_{n}(k):=\inf\left\{\tau:\Gamma_{n}(\tau)\leq k-\alpha\phi_{n}\right\}, τn(k):=max{ηn(k),0}\tau_{n}(k):=\max\left\{\eta_{n}(k),0\right\} and

      dn,k:v{kαϕnΓn(ηn(k))γn(ηn(k)): if ξn(v)=ηn(k) and γn(ηn(k))>0,I{ξn(v)>ηn(k)}: otherwise.d_{n,k}:v\mapsto\begin{cases}\ \frac{k-\alpha\phi_{n}-\Gamma_{n}(\eta_{n}(k))}{\gamma_{n}(\eta_{n}(k))}&:\ \text{ if }\xi_{n}(v)=\eta_{n}(k)\textnormal{ and }\gamma_{n}(\eta_{n}(k))>0\ ,\\ \ I\{\xi_{n}(v)>\eta_{n}(k)\}&:\ \text{ otherwise.}\end{cases}

      The rule dn,kd_{n,k} is the sample analog of an ITR that prescribes treatment to those with the highest values of ξ0(V)\xi_{0}(V), regardless of whether treatment is harmful or not, until treatment resources run out.

    3. (c)

      Compute knk_{n}, which is used to define an estimate of ρ0{\rho}_{0} for which the plug-in estimator is asymptotically linear under conditions, as follows:

      • if τn(κ)>0\tau_{n}(\kappa)>0 and there is a solution in k[0,)k\in[0,\infty) to

        1ni=1ndn,k(Vi)[ΔnC(Wi)+CiμnC(Ti,Wi)ti+μnT(Wi)1]+αϕn=κ,\frac{1}{n}\sum_{i=1}^{n}d_{n,k}(V_{i})\left[\Delta^{C}_{n}(W_{i})+\frac{C_{i}-\mu^{C}_{n}(T_{i},W_{i})}{t_{i}+\mu^{T}_{n}(W_{i})-1}\right]+\alpha\phi_{n}=\kappa\ , (5)

        then take knk_{n} to be this solution;

      • otherwise, set kn=κk_{n}=\kappa.

    4. (d)

      Estimate ρ0{\rho}_{0} using the sample analog of ρ0{\rho}_{0} with treatment resource constraint knk_{n}, namely

      ρn:v{knαϕnΓn(τn(kn))γn(τn(kn)): if ξn(v)=τn(kn) and γn(τn(kn))>0I{ξn(v)>τn(kn)}: otherwise.{\rho}_{n}:v\mapsto\begin{cases}\ \frac{k_{n}-\alpha\phi_{n}-\Gamma_{n}(\tau_{n}(k_{n}))}{\gamma_{n}(\tau_{n}(k_{n}))}&:\ \text{ if }\xi_{n}(v)=\tau_{n}(k_{n})\textnormal{ and }\gamma_{n}(\tau_{n}(k_{n}))>0\\ \ I\{\xi_{n}(v)>\tau_{n}(k_{n})\}&:\ \text{ otherwise.}\end{cases}
  3. 3.

    Obtain an estimate ρn{\rho}^{\mathcal{R}}_{n} of the reference ITR ρ0{\rho}^{\mathcal{R}}_{0} as follows:

    • For =FR\mathcal{R}={\mathrm{FR}}, take ρn{\rho}^{\mathcal{R}}_{n} to be ρFR{\rho}^{\mathrm{FR}}.

    • For =RD\mathcal{R}={\mathrm{RD}},

      1. (a)

        obtain a targeted estimate μ^nC(1,)\hat{\mu}^{C}_{n}(1,\cdot) of μ0C(1,)\mu^{C}_{0}(1,\cdot): run an ordinary least-squares linear regression with outcome CC, covariate 1/(T+μnT(W)1)1/(T+\mu^{T}_{n}(W)-1), offset μnC(T,W)\mu^{C}_{n}(T,W) and no intercept. Take μ^nC\hat{\mu}^{C}_{n} to be the fitted mean model;

      2. (b)

        take ρn{\rho}^{\mathcal{R}}_{n} to be the constant function wmin{1,(κϕn)/1ni=1nΔ^nC(Wi)}w\mapsto\min\left\{1,(\kappa-\phi_{n})/\tfrac{1}{n}\sum_{i=1}^{n}\hat{\Delta}^{C}_{n}(W_{i})\right\}, where we define pointwise Δ^nC(w):=μ^nC(1,w)μ^nC(0,w)\hat{\Delta}^{C}_{n}(w):=\hat{\mu}^{C}_{n}(1,w)-\hat{\mu}^{C}_{n}(0,w).

    • For =TP\mathcal{R}={\mathrm{TP}}, take ρn{\rho}^{\mathcal{R}}_{n} to be μnT\mu^{T}_{n}.

  4. 4.

    Estimate ATE of ρ0{\rho}_{0} relative to the reference ITR ρ0{\rho}^{\mathcal{R}}_{0} with a targeted minimum-loss based estimator (TMLE) ψn\psi_{n}:

    1. (a)

      obtain a targeted estimate μ^nY\hat{\mu}^{Y}_{n} of μ0Y\mu^{Y}_{0}: run an ordinary least-squares linear regression with outcome YY, covariate [ρn(V)ρn(W)]/[T+μnT(W)1][{\rho}_{n}(V)-{\rho}^{\mathcal{R}}_{n}(W)]/[T+\mu^{T}_{n}(W)-1], offset μnY(T,W)\mu^{Y}_{n}(T,W) and no intercept. Take μ^nY\hat{\mu}^{Y}_{n} to be the fitted mean function.

    2. (b)

      with P^n\hat{P}_{n} being any distribution with components μ^nY\hat{\mu}^{Y}_{n} and P^W,n\hat{P}_{W,n}, take

      ψn:=Ψρn(P^n)Ψρn(P^n)=1ni=1n[ρn(Vi)ρn,i][μ^nY(1,Wi)μ^nY(1,Wi)],\psi_{n}:=\Psi_{{\rho}_{n}}(\hat{P}_{n})-\Psi_{{\rho}^{\mathcal{R}}_{n}}(\hat{P}_{n})=\frac{1}{n}\sum_{i=1}^{n}[{\rho}_{n}(V_{i})-{\rho}^{\mathcal{R}}_{n,i}][\hat{\mu}^{Y}_{n}(1,W_{i})-\hat{\mu}^{Y}_{n}(1,W_{i})]\ ,

      where ρn,i{\rho}^{\mathcal{R}}_{n,i} is defined as ρn(Wi){\rho}^{\mathcal{R}}_{n}(W_{i}) or ρn(Vi){\rho}^{\mathcal{R}}_{n}(V_{i}) depending on the covariate used by the reference ITR.

The above procedure is similar to that proposed in Qiu et al. [34]. One key difference is the use of the refined estimator knk_{n} of κ\kappa obtained via the estimating equation (5), which is key to ensuring the asymptotic linearity of ψn\psi_{n}. Another difference is that the denominator of ξn\xi_{n} is now δnC\delta^{C}_{n}, which is consistent with our different definition of the unit value for solving the fractional knapsack problem (2). Similarly to TMLE for other problems, when CC or YY has known bounds (e.g., the closed interval [0,1][0,1]), to obtain a corresponding targeted estimate that respect the known bounds, we may use logistic regression rather than ordinary least-squares [12].

The above procedure has both similarities and substantial differences compared to the estimation procedure proposed by Sun et al. [41]. The main difference is that our procedure is targeted towards efficient estimation of and inference about the ATE of ψ0\psi_{0} of the optimal ITR under a nonparametric model, while Sun et al. [41] focus on estimating the optimal ITR ρ0{\rho}_{0} and does not evaluate this optimal ITR. This leads to a key difference between the two procedures when estimating the optimal ITR: we need to solve an estimating equation (5), which is crucial to ensuring that the estimator ψn\psi_{n} is asymptotically linear, while Sun et al. [41] do not. The requirement of solving (5) is related to the nature of the fractional knapsack problem discussed in Remark 3, and we conjecture that such a calibration on the resource used is necessary for general problems of the same nature. Our procedure is also related to the method in Sun [42]. Sun [42] rely on the availability of asymptotically normal estimators of both the average benefit and average resource used (Assumption 2.4), a nontrivial requirement when the propensity score μ0T\mu^{T}_{0} is unknown in observational studies. Our procedure essentially produces such estimators: in Step 4, an asymptotically normal estimator of the ATE is constructed, whereas an asymptotically normal estimator of the expected resource is produced in Step 2 and used to calibrate the resource expenditure of the estimated optimal ITR ρn{\rho}_{n} in Step 2(c).

Remark 4.

In Step 1 of the above procedure, we estimate the functions δ0Y\delta^{Y}_{0} and δ0C\delta^{C}_{0} using a naïve approach based on outcome regression. It is viable to use more advanced techniques such as the doubly robust methods in van der Laan and Rubin [45], van der Laan and Luedtke [46], Luedtke and van der Laan [22], and Kennedy [17] or R-learning as in Nie and Wager [28]. These methods were developed for conditional average treatment effect estimation and might lead to better estimators of δ0Y\delta^{Y}_{0} and δ0C\delta^{C}_{0}. It is also possible to develop multiply robust methods to estimate ξ0\xi_{0} using influence function techniques. Such methods to estimate ξ0\xi_{0} are beyond the scope of our paper, whose main focus is on the inference for the ATE. Our theoretical analysis of the estimator only applies to naïve estimators based on outcome regression, but we expect only minor modifications to be required to study these more advanced estimators once their asymptotic behavior is characterized.

Remark 5.

In Step 2(a), it is also viable to use other efficient estimators of ϕ0\phi_{0}, for example, a targeted minimum loss-based estimator (TMLE). We note that estimating ϕ0\phi_{0} is only one component of estimating the optimal ITR ρ0{\rho}_{0}. Methods such as TMLE can be preferable to ensure that the estimator respects known bounds on the estimand. However, in our case, such an improvement in estimating ϕ0\phi_{0} does not necessarily lead to an improvement in the estimation of ρ0{\rho}_{0}.

We now present results on the asymptotic linearity and efficiency of our proposed estimator. We state and discuss the technical conditions required by the theorem below in Supplement S1.

Theorem 4 (Asymptotic linearity of ATE estimator).

Let {FR,RD,TP}\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}. Under Conditions B1B12, with the canonical gradient D(P0)D_{\mathcal{R}}(P_{0}) defined in (3) and (4), it holds that

ψnψ0=1ni=1nD(P0)(Oi)+op(n1/2).\psi_{n}-\psi_{0}=\frac{1}{n}\sum_{i=1}^{n}D_{\mathcal{R}}(P_{0})(O_{i})+{\mathrm{o}}_{p}(n^{-1/2})\ .

Therefore, n(ψnψ0)𝑑N(0,σ02)\sqrt{n}\left(\psi_{n}-\psi_{0}\right)\overset{d}{\longrightarrow}\textnormal{N}\left(0,\sigma_{0}^{2}\right), where σ02:=E0[D(P0)(O)2]\sigma_{0}^{2}:=\operatorname{E}_{0}\left[D_{\mathcal{R}}(P_{0})(O)^{2}\right]. Since ψn\psi_{n} is asymptotically linear with influence function equal to the canonical gradient, ψn\psi_{n} is also asymptotically efficient.

To conduct inference about ψ0\psi_{0}, we can directly plug the estimators of nuisance functions into D(P0)D_{\mathcal{R}}(P_{0}) to obtain a consistent estimator of D(P0)D_{\mathcal{R}}(P_{0}), and then take the sample variance to obtain a consistent estimator of the asymptotic variance σ02\sigma_{0}^{2}. The proof of Theorem 4 can be found in Supplements S4.3 and S4.4.

Remark 6.

It may be desirable to use cross-fitting [26, 56] to estimate an optimal ITR for better finite-sample performance. The asymptotic linearity is maintained by a similar argument that is used to prove Theorem 4. We describe this algorithm in Section S3 in the Supplementary Material.

Remark 7.

We note that, unlike in Qiu et al. [34] where the bound κ\kappa lies in (0,1](0,1] due to the binary nature of treatment status, the methods we propose here do not require knowledge of an upper bound on treatment costs. When such a bound is indeed known (e.g., one), our methods may still be applied as long as all special cases corresponding to κ=\kappa=\infty or κ<\kappa<\infty in Section 4 are replaced by κ\kappa being equal to or less than the known bound, respectively.

5 Simulation

5.1 Simulation setting

In this simulation study, we investigate the performance of our proposed estimator of the ATE of an optimal ITR relative to specified reference ITRs. We focus here on the setting α=1\alpha=1. This scenario is more difficult than the case α=0\alpha=0 because it requires the estimation of ϕ0\phi_{0}.

We generate data from a model in which the treatment TT is an IV and the treatment cost CC and outcome YY are both binary. This data-generating mechanism satisfies all causal conditions and has an unobserved confounder between treatment cost and outcome. We first generate a trivariate covariate W=(W1,W2,W3)W=(W_{1},W_{2},W_{3}), where W1Unif(1,1)W_{1}\sim\mathrm{Unif}(-1,1), W2Bernoulli(0.8)W_{2}\sim\mathrm{Bernoulli}(0.8) and W3N(0,1)W_{3}\sim\mathrm{N}(0,1) are mutually independent. We also simulate an unobserved treatment-outcome confounder UBernoulli(0.5)U\sim\mathrm{Bernoulli}(0.5) independently of WW, and then simulate TT, CC, and YY as follows:

TW,U\displaystyle T\mid W,U\ Bernoulli(expit(2.5W1+0.5W2W3)),\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(2.5W_{1}+0.5W_{2}W_{3})\right),
CT,W,U\displaystyle C\mid T,W,U\ Bernoulli(expit(2T1W1+0.2W2+0.7W3+2W1W2+0.5U)),\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(2T-1-W_{1}+0.2W_{2}+0.7W_{3}+2W_{1}W_{2}+0.5U)\right),
YT,C,W,U\displaystyle Y\mid T,C,W,U\ Bernoulli(expit(0.3C+CW2W1+0.2W20.9W3+0.3CU)).\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(-0.3C+CW_{2}-W_{1}+0.2W_{2}-0.9W_{3}+0.3CU)\right).

We introduce UU in the data-generating mechanism to emphasize that we do not require assumptions on the joint distribution of treatment cost and outcome conditional on covariates. We consider all three reference ITRs {FR,RD,TP}\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}, where we set ρFR:v0{\rho}^{\mathrm{FR}}:v\mapsto 0. We set κ=0.68\kappa=0.68, which is an active constraint with τ0>0\tau_{0}>0 and ρ0RD<1{\rho}^{\mathrm{RD}}_{0}<1.

The ITRs we consider are based on all covariates — that is, we take V(W)=WV(W)=W. We estimate the nuisance functions using the Super Learner [48] with library including a logistic regression, generalized additive model with logit link [13], gradient boosting machine [10, 11, 23, 24], support vector machine [2, 8] and neural network [3, 35]. Because none of the nuisance functions follow a logistic regression model, the resulting ensemble learner is not expected to achieve the parametric convergence rate. Since both CC and YY are binary, we use logistic regression rather than ordinary least-squares to obtain their corresponding targeted estimates in Section 4.2. We consider sample size n{500,1000,4000,16000}n\in\{500,1000,4000,16000\}, and run 1000 Monte Carlo repetitions for each sample size. We implement the algorithm that incorporates cross-fitting discussed in Remark 6 and described in Section S3 in the Supplementary Material.

To evaluate the performance of our proposed estimator, we investigate the bias and root mean squared error (RMSE) of the estimator. We also investigate the coverage probability and the width of nominal 95% Wald CIs constructed using influence function-based standard error estimates. We further investigate the probability that our confidence lower limit falls below the true ATE, that is, the coverage probability of the 97.5% Wald confidence lower bound.

5.2 Simulation results

Table 1 presents the performance of our proposed estimator in this simulation. For sample sizes 500, 1000 and 4000, the CI coverage of our proposed method is lower than the nominal coverage 95%. When sample size is larger (16000), the CI coverage of our proposed method increases to 90–93%. The coverage of the confidence lower bounds is much closer to nominal (97.5%) for all sample sizes considered, though, and is always approximately nominal when the sample size is large. For all reference ITRs, the bias and RMSE of our proposed estimator appear to converge to zero faster than and at the same rate as the square root of sample size, respectively. All biases are negative, which is expected in view of Remark 6. All standard errors underestimate the variation of the estimator with the extent decreasing as sample size increases.

Table 1: Performance of estimators of average treatment effects in the simulation with nuisance functions estimated via machine learning.
Performance measure Sample size FR{\mathrm{FR}} RD{\mathrm{RD}} TP{\mathrm{TP}}
95% Wald CI coverage 500 74%74\% 71%71\% 70%70\%
1000 78%78\% 74%74\% 73%73\%
4000 90%90\% 84%84\% 88%88\%
16000 93%93\% 90%90\% 93%93\%
97.5% confidence lower 500 94%94\% 96%96\% 96%96\%
bound coverage 1000 97%97\% 98%98\% 96%96\%
4000 98%98\% 98%98\% 98%98\%
16000 97%97\% 98%98\% 97%97\%
bias 500 0.018-0.018 0.018-0.018 0.020-0.020
1000 0.014-0.014 0.013-0.013 0.013-0.013
4000 0.003-0.003 0.004-0.004 0.003-0.003
16000 0.000-0.000 0.001-0.001 0.000-0.000
RMSE 500 0.0560.056 0.0390.039 0.0460.046
1000 0.0390.039 0.0250.025 0.0310.031
4000 0.0170.017 0.0090.009 0.0120.012
16000 0.0090.009 0.0040.004 0.0050.005
Ratio of mean standard error 500 0.6200.620 0.6200.620 0.5710.571
to standard deviation 1000 0.6830.683 0.6730.673 0.6370.637
4000 0.8680.868 0.7650.765 0.8090.809
16000 0.9130.913 0.8700.870 0.9060.906

Figure 1 presents the width of the Wald CIs scaled by the square root of sample size nn. Our theory indicates that the CI width should shrink at a root-nn rate, and our simulation results are consistent with this. There are some outlying cases of extremely wide or narrow CIs. This is expected for small sample sizes because the estimator of σ02\sigma_{0}^{2} in Theorem 4 resembles a sample mean and might not be close to σ02\sigma_{0}^{2} with high probability when sample size is small. In practice, this issue might be slightly mitigated by fine-tuning the involved machine learning algorithms.

Refer to caption
Figure 1: Boxplot of n×\sqrt{n}\times CI width for ATE relative to each reference ITR.

As indicated in Theorem 4, theoretical guarantees for the validity of the Wald CIs rely on the nuisance function estimators converging to the truth sufficiently quickly. It appears that the undercoverage of our Wald CI in small samples may owe, in part, to poor estimation of these nuisance functions in small sample sizes. To illustrate how our procedure may perform with improved small-sample nuisance function estimators, we conducted another two simulations: one is identical to those reported earlier in all ways except that the nuisance function estimators μnY\mu^{Y}_{n}, μnC\mu^{C}_{n} and μnT\mu^{T}_{n} are taken to be equal to the truth; the other is a simpler scenario under a lower dimension and a parametric model. The results are presented in Section S5 in the Supplementary Material and suggest that our proposed estimator may achieve significantly better performance with improved machine learning estimators of the nuisance functions. This motivates seeking ways to optimize the finite-sample performance of the nuisance function estimators employed in future applications of the proposed method, possibly based on prior subject-matter expertise. The underestimation of standard errors in this simulation also motivates future work exploring whether there are standard error estimators with better finite-sample performance, for example, estimators based on the bootstrap.

6 Conclusion

There is an extensive literature on estimating optimal ITRs and evaluating their performance. Among these works, only a few incorporated treatment resource constraints. In this paper, we build upon Sun et al. [41] and study the problem of estimating optimal ITRs under treatment cost constraints when the treatment cost is random. Using similar techniques as used in Qiu et al. [34], we have proposed novel methods to estimate an optimal ITR and infer about the corresponding average treatment effect relative to a prespecified reference ITR, under a locally nonparametric model. Our methods may also be applied to instrumental variable (IV) settings studied in Qiu et al. [34] when the IV is intervened on.

References

  • Abadie [2003] Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics 113(2), 231–263.
  • Bennett and Campbell [2000] Bennett, K. P. and C. Campbell (2000). Support vector machines: hype or hallelujah? SIGKDD Explor. Newsl. 2(2), 1–13.
  • Bishop [1995] Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
  • Bolthausen et al. [2002] Bolthausen, E., E. Perkins, and A. van der Vaart (2002). Lectures on Probability Theory and Statistics: Ecole D’Eté de Probabilités de Saint-Flour XXIX-1999, Volume 1781 of Lecture Notes in Mathematics. Berlin, Heidelberg: Springer Science & Business Media.
  • Butler et al. [2018] Butler, E. L., E. B. Laber, S. M. Davis, and M. R. Kosorok (2018). Incorporating Patient Preferences into Estimation of Optimal Individualized Treatment Rules. Biometrics 74(1), 18–26.
  • Chakraborty and Moodie [2013] Chakraborty, B. and E. E. Moodie (2013). Statistical Methods for Dynamic Treatment Regimes. Statistics for Biology and Health. New York, NY: Springer New York.
  • Chen et al. [2018] Chen, J., H. Fu, X. He, M. R. Kosorok, and Y. Liu (2018). Estimating individualized treatment rules for ordinal treatments. Biometrics 74(3), 924–933.
  • Cortes and Vapnik [1995] Cortes, C. and V. Vapnik (1995). Support-vector networks. Machine Learning 20(3), 273–297.
  • Dantzig [1957] Dantzig, G. B. (1957). Discrete-variable extremum problems. Operations research 5(2), 266–288.
  • Friedman [2001] Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Technical Report 5.
  • Friedman [2002] Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis 38(4), 367–378.
  • Gruber and Van Der Laan [2010] Gruber, S. and M. J. Van Der Laan (2010). A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. International Journal of Biostatistics 6(1).
  • Hastie and Tibshirani [1990] Hastie, T. and R. Tibshirani (1990). Generalized additive models. Chapman and Hall.
  • Imai and Li [2021] Imai, K. and M. L. Li (2021). Experimental Evaluation of Individualized Treatment Rules. Journal of the American Statistical Association.
  • Imbens and Angrist [1994] Imbens, G. W. and J. D. Angrist (1994). Identification and Estimation of Local Average Treatment Effects. Econometrica 62(2), 467–475.
  • Kennedy [2016] Kennedy, E. H. (2016). Semiparametric theory and empirical processes in causal inference. In Statistical causal inferences and their applications in public health research, pp.  141–167. Springer.
  • Kennedy [2020] Kennedy, E. H. (2020). Towards optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497v3.
  • Laber and Zhao [2015] Laber, E. and Y. Zhao (2015). Tree-based methods for individualized treatment regimes. Biometrika 102(3), 501–514.
  • Lei et al. [2012] Lei, H., I. Nahum-Shani, K. Lynch, D. Oslin, and S. A. Murphy (2012). A ”SMART” design for building individualized treatment sequences. Annual Review of Clinical Psychology 8, 21–48.
  • Luedtke and van der Laan [2016a] Luedtke, A. R. and M. J. van der Laan (2016a). Optimal Individualized Treatments in Resource-Limited Settings. International Journal of Biostatistics 12(1), 283–303.
  • Luedtke and van der Laan [2016b] Luedtke, A. R. and M. J. van der Laan (2016b). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics 44(2), 713–742.
  • Luedtke and van der Laan [2016c] Luedtke, A. R. and M. J. van der Laan (2016c). Super-Learning of an Optimal Dynamic Treatment Rule. International Journal of Biostatistics 12(1), 305–332.
  • Mason et al. [1999] Mason, L., J. Baxter, P. Bartlett, and M. Frean (1999). Boosting Algorithms as Gradient Descent in Function Space. Technical report.
  • Mason et al. [2000] Mason, L., J. Baxter, P. L. Bartlett, and M. Frean (2000). Boosting Algorithms as Gradient Descent. Technical report.
  • Murphy [2003] Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
  • Newey and Robins [2018] Newey, W. K. and J. R. Robins (2018). Cross-Fitting and Fast Remainder Rates for Semiparametric Estimation. arXiv preprint arXiv:1801.09138v1.
  • Neyman [1923] Neyman, J. (1923). Sur les applications de la théorie des probabilités aux expériences agricoles: Essay des principles. (Excerpts reprinted and translated to English, 1990). Statistical Science 5, 463–472.
  • Nie and Wager [2021] Nie, X. and S. Wager (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108(2), 299–319.
  • Petersen et al. [2007] Petersen, M. L., S. G. Deeks, and M. J. van der Laan (2007). Individualized treatment rules: Generating candidate clinical trials. Statistics in Medicine 26(25), 4578–4601.
  • Pfanzagl [1982] Pfanzagl, J. (1982). Contributions to a General Asymptotic Statistical Theory, Volume 13 of Lecture Notes in Statistics. New York, NY: Springer New York.
  • Pfanzagl [1990] Pfanzagl, J. (1990). Estimation in semiparametric models. In Estimation in Semiparametric Models, pp.  17–22. Springer.
  • Qian and Murphy [2011] Qian, M. and S. A. Murphy (2011). Performance guarantees for individualized treatment rules. The Annals of Statistics 39(2), 1180.
  • Qiu et al. [2021a] Qiu, H., M. Carone, E. Sadikova, M. Petukhova, R. C. Kessler, and A. Luedtke (2021a). Correction to: “optimal individualized decision rules using instrumental variable methods”. Journal of the American Statistical Association (just-accepted), 1–2.
  • Qiu et al. [2021b] Qiu, H., M. Carone, E. Sadikova, M. Petukhova, R. C. Kessler, and A. Luedtke (2021b). Optimal individualized decision rules using instrumental variable methods. Journal of the American Statistical Association 116(533), 174–191.
  • Ripley [2014] Ripley, B. D. (2014). Pattern recognition and neural networks. Cambridge University Press.
  • Robins [1986] Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Mathematical modelling 7(9-12), 1393–1512.
  • Robins [2004] Robins, J. M. (2004). Optimal Structural Nested Models for Optimal Sequential Decisions. pp.  189–326. Springer, New York, NY.
  • Rothwell [2005] Rothwell, P. M. (2005). Subgroup analysis in randomised controlled trials: Importance, indications, and interpretation. Lancet 365(9454), 176–186.
  • Rubin [1974] Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688–701.
  • Song et al. [2015] Song, R., M. Kosorok, D. Zeng, Y. Zhao, E. Laber, and M. Yuan (2015). On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat 4(1), 59–68.
  • Sun et al. [2021] Sun, H., S. Du, and S. Wager (2021). Treatment Allocation under Uncertain Costs. arXiv preprint arXiv:2103.11066v1.
  • Sun [2021] Sun, L. (2021). Empirical Welfare Maximization with Constraints. arXiv preprint arXiv:2103.15298v1.
  • Tchetgen Tchetgen and Vansteelandt [2013] Tchetgen Tchetgen, E. J. and S. Vansteelandt (2013). Alternative Identification and Inference for the Effect of Treatment on the Treated with an Instrumental Variable. Harvard University BiostatisticsWorking Paper Series.
  • van der Laan [2017] van der Laan, M. (2017). A Generally Efficient Targeted Minimum Loss Based Estimator based on the Highly Adaptive Lasso. International Journal of Biostatistics 13(2).
  • van der Laan and Rubin [2006] van der Laan, M. and D. Rubin (2006). Targeted Maximum Likelihood Learning. The International Journal of Biostatistics 2(1).
  • van der Laan and Luedtke [2014] van der Laan, M. J. and A. R. Luedtke (2014). Targeted Learning of the Mean Outcome under an Optimal Dynamic Treatment Rule. Journal of Causal Inference 3(1).
  • van der Laan and Petersen [2007] van der Laan, M. J. and M. L. Petersen (2007). Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics 3(1).
  • van der Laan et al. [2007] van der Laan, M. J., E. C. Polley, and A. E. Hubbard (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology 6(1).
  • van der Laan and Rose [2018] van der Laan, M. J. and S. Rose (2018). Targeted Learning in Data Science.
  • van der Vaart and Wellner [2000] van der Vaart, A. and J. Wellner (2000). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer.
  • van der Vaart [1998] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
  • Varadhan et al. [2013] Varadhan, R., J. B. Segal, C. M. Boyd, A. W. Wu, and C. O. Weiss (2013). A framework for the analysis of heterogeneity of treatment effect in patient-centered outcomes research. Journal of Clinical Epidemiology 66(8), 818–825.
  • Wang and Tchetgen Tchetgen [2018] Wang, L. and E. Tchetgen Tchetgen (2018). Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 531–550.
  • Zhao et al. [2012] Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012). Estimating Individualized Treatment Rules Using Outcome Weighted Learning. Journal of the American Statistical Association 107(499), 1106–1118.
  • Zhao et al. [2015] Zhao, Y. Q., D. Zeng, E. B. Laber, R. Song, M. Yuan, and M. R. Kosorok (2015). Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1), 151–168.
  • Zheng and van der Laan [2011] Zheng, W. and M. J. van der Laan (2011). Cross-Validated Targeted Minimum-Loss-Based Estimation. pp.  459–474. Springer, New York, NY.
  • Zhou et al. [2017] Zhou, X., N. Mayer-Hamblett, U. Khan, and M. R. Kosorok (2017). Residual Weighted Learning for Estimating Individualized Treatment Rules. Journal of the American Statistical Association 112(517), 169–187.

Supplementary Material for “Individualized treatment rules under stochastic treatment cost constraints”

This Supplementary Material is organized as follows. Section S1 contains technical conditions to ensure that the statistical parameter of interest, the average treatment effect, is pathwise differentiable and that our proposed estimator is asymptotically efficient. We discuss a particular technical condition that may be difficult to verify in Section S2. In Section S3, we describe a modified version of our proposed estimator with improved performance in small to moderate samples. We present proofs of theoretical results in Section S4. In Section S5, we present the results of a simulation under an idealized setting. These results may provide guidance on interpreting the simulation results in Section 5.

As noted in the main text, the methods proposed in this work build upon tools used in Qiu et al. [34]; as such, the involved technical details bear similarity. To orient readers and facilitate comparisons, we have organized these supplementary materials for these papers similarly and shared portions of technical details when appropriate.

S1 Technical conditions for pathwise differentiability of parameter and asymptotic linearity of proposed estimator

In this section, we list the additional technical conditions required by Theorems 3 and 4 in Section 4 that we omit in the main text. Before doing this, we define pointwise

Dn,FR(o)\displaystyle D_{n,{\mathrm{FR}}}(o) :=D(P^n,ρn,τ0,μnC)(o)D(P^n,ρFR,0,μ0C)(o),\displaystyle:=D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})(o)-D(\hat{P}_{n},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})(o)\ ,
Dn,RD(o)\displaystyle D_{n,{\mathrm{RD}}}(o) :=D(P^n,ρn,τ0,μnC)(o)D(P^n,ρnRD,0,μ0C)(o)\displaystyle:=D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})(o)-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})(o)
αΨρnRD(P^n)κϕnD1(P^n,μnC)(o)ΨρnRD(P^n)PnΔ^nCD2(P^n,μ^nC)(o),\displaystyle\hskip 36.135pt-\alpha\frac{\Psi_{{\rho}^{\mathrm{RD}}_{n}}(\hat{P}_{n})}{\kappa-\phi_{n}}D_{1}(\hat{P}_{n},\mu^{C}_{n})(o)-\frac{\Psi_{{\rho}^{\mathrm{RD}}_{n}}(\hat{P}_{n})}{P_{n}\hat{\Delta}^{C}_{n}}D_{2}(\hat{P}_{n},\hat{\mu}^{C}_{n})(o)\ ,
Dn,TP(o)\displaystyle D_{n,{\mathrm{TP}}}(o) :=D(P^n,ρn,τ0,μnC)(o)GTP(P^n)(o).\displaystyle:=D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})(o)-G_{\mathrm{TP}}(\hat{P}_{n})(o)\ .
Condition B2 (Nonzero continuous density of ξ0(V)\xi_{0}(V) around η0\eta_{0}).

If η0>\eta_{0}>-\infty, then the distribution of ξ0(V)\xi_{0}(V) has positive, finite and continuous Lebesgue density in a neighborhood of η0\eta_{0}.

Since Condition B2 is most plausible when covariates are continuous, in this case, it is also plausible to expect the distribution of ξ0(V)\xi_{0}(V) to be continuous and thus Condition B2 holds.

Condition B3 (Smooth treatment cost function or lack of constraint).

If η0>\eta_{0}>-\infty, then the function ηE0[I(ξ0(V)>η)Δ0C(W)]\eta\mapsto\operatorname{E}_{0}\left[I\left(\xi_{0}(V)>\eta\right)\Delta^{C}_{0}(W)\right] is continuously differentiable with nonzero derivative in a neighborhood of η0\eta_{0}; if η0=\eta_{0}=-\infty and κ<\kappa<\infty, then E0[Δ0C(W)]<καϕ0\operatorname{E}_{0}\left[\Delta^{C}_{0}(W)\right]<\kappa-\alpha\phi_{0}.

Condition B3 requires different conditions in separate cases. There are three cases in terms of the sufficiency of the budget to treat every individual: (i) there is an infinite budget and no constraint is present (κ=\kappa=\infty); (ii) the budget is insufficient (η0>\eta_{0}>-\infty); and (iii) the budget is finite but sufficient (η0=\eta_{0}=-\infty and κ<\kappa<\infty). Condition B3 makes no assumption for Case (i). In Case (ii), we require a function ηE0[I(ξ0(V)>η)Δ0C(W)]\eta\mapsto\operatorname{E}_{0}\left[I\left(\xi_{0}(V)>\eta\right)\Delta^{C}_{0}(W)\right] to be locally continuously differentiable. Since Δ0C>0\Delta_{0}^{C}>0 by Condition A4, this function is nonincreasing and thus only continuous differentiability is required. For each η\eta, this function is an integral of additional cost Δ0C\Delta_{0}^{C} over the set {v:ξ0(v)>0}\{v:\xi_{0}(v)>0\} and has a similar nature to survival functions. When covariates are continuous, it is plausible to assume that Δ0C(W)\Delta_{0}^{C}(W) is continuous and thus ηE0[I(ξ0(V)>η)Δ0C(W)]\eta\mapsto\operatorname{E}_{0}\left[I\left(\xi_{0}(V)>\eta\right)\Delta^{C}_{0}(W)\right] is continuously differentiable. In Case (iii), we require that the budget has a surplus. When it is unknown a priori whether the budget is sufficient to treat every individual, namely in Case (ii) or (iii), it is highly unlikely that the budget exactly suffices with no surplus. Therefore, Condition B3 is mild.

Condition B4 (Bounded additional treatment cost).

Δ0C\Delta^{C}_{0} is bounded.

Condition B5 (Active constraint).

If =RD\mathcal{R}={\mathrm{RD}}, then it holds that (καϕ0)/E0[Δ0C(W)]<1(\kappa-\alpha\phi_{0})/\operatorname{E}_{0}\left[\Delta^{C}_{0}(W)\right]<1.

Condition B5 requires that, when the rule ρRD{\rho}^{\mathrm{RD}} that assigns treatment completely at random while respecting the budget constraint is the reference rule of interest, it should not correspond to the trivial rule v1v\mapsto 1 that assigns treatment to every individual. The rule ρRD{\rho}^{\mathrm{RD}} equals v1v\mapsto 1 only when the budget is sufficient to treat every individual. Since, as a separate reference rule from given fixed rules ρFR{\rho}^{\mathrm{FR}}, the reference rule ρRD{\rho}^{\mathrm{RD}} is only interesting when the budget constraint is active, Condition B5 often holds automatically.

Condition B6 (Sufficient rates for nuisance estimators).
μnTμ0T2,P0{\displaystyle\|\mu^{T}_{n}-\mu^{T}_{0}\|_{2,P_{0}}\Big{\{} μnYμ0Y2,P0+μ^nYμ0Y2,P0\displaystyle\|\mu^{Y}_{n}-\mu^{Y}_{0}\|_{2,P_{0}}+\|\hat{\mu}^{Y}_{n}-\mu^{Y}_{0}\|_{2,P_{0}}
+μnCμ0C2,P0+μ^nCμ0C2,P0}=op(n1/2).\displaystyle+\|\mu^{C}_{n}-\mu^{C}_{0}\|_{2,P_{0}}+\|\hat{\mu}^{C}_{n}-\mu^{C}_{0}\|_{2,P_{0}}\Big{\}}={\mathrm{o}}_{p}(n^{-1/2})\ .

Condition B6 holds if all above nuisance estimators converge at a rate faster than n1/4n^{-1/4}, which may be much slower than the parametric rate n1/2n^{-1/2} and thus allows for the use of flexible nonparametric estimators. This condition also holds if μnY\mu^{Y}_{n}, μ^nY\hat{\mu}^{Y}_{n}, μnC\mu^{C}_{n} and μ^nC\hat{\mu}^{C}_{n} each converges slower than n1/4n^{-1/4}, as long as the estimated propensity score μnT\mu^{T}_{n} converges sufficiently fast to compensate.

Condition B7 (Consistency of estimated influence function).

The following terms are all op(1){\mathrm{o}}_{p}(1):

D1(P^n,μ^nC)D1(P0,μ0C)2,P0,D2(P^n,μnC)D2(P0,μ0C)2,P0,Dn,D(P0)2,P0,\displaystyle\|D_{1}(\hat{P}_{n},\hat{\mu}^{C}_{n})-D_{1}(P_{0},\mu^{C}_{0})\|_{2,P_{0}}\ ,\quad\|D_{2}(\hat{P}_{n},\mu^{C}_{n})-D_{2}(P_{0},\mu^{C}_{0})\|_{2,P_{0}}\ ,\quad\|D_{n,\mathcal{R}}-D_{\mathcal{R}}(P_{0})\|_{2,P_{0}}\ ,
[D(P^n,ρn,τ0,μnC)D(P^n,ρnRD,0,μ0C)][D(P0,ρ0,τ0,μ0C)D(P0,ρ0RD,0,μ0C)]2,P0.\displaystyle\|[D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})]-[D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})]\|_{2,P_{0}}\ .
Condition B8 (Consistency of strong positivity).

With probability tending to one over the sample used to obtain μnT\mu^{T}_{n}, it holds that I{ϵT<μnT(w)<1ϵT}𝑑P0(w)=1\int I\{\epsilon_{T}<\mu^{T}_{n}(w)<1-\epsilon_{T}\}dP_{0}(w)=1.

Condition B9 (Consistency of strictly more costly treatment).

With probability tending to one over the sample used to obtain ΔnC\Delta^{C}_{n} and δnC\delta^{C}_{n}, it holds that I(ΔnC(w)>δC)𝑑P0(w)=1\int I(\Delta^{C}_{n}(w)>\delta_{C})dP_{0}(w)=1 and I(δnC(v)>δC)𝑑P0(v)=1\int I(\delta^{C}_{n}(v)>\delta_{C})dP_{0}(v)=1.

Condition B10 (Fast rate of estimated optimal ITR).

As sample size nn tends to infinity, it holds that

{ρn(v)ρ0(v)}{δ0Y(v)τ0δ0C(v)}𝑑P0(v)=op(n1/2).\int\left\{{\rho}_{n}(v)-{\rho}_{0}(v)\right\}\left\{\delta^{Y}_{0}(v)-\tau_{0}\delta^{C}_{0}(v)\right\}dP_{0}(v)={\mathrm{o}}_{p}(n^{-1/2})\ .

Condition B10 may, at first sight, appear to be difficult to verify and is discussed in detail in Section S2. As shown in Theorem S1 of Section S2, Condition B10 may require faster rates on nuisance estimators than Condition B6. For example, convergence in the L2L^{2}-sense at a rate op(n1/4){\mathrm{o}}_{p}(n^{-1/4}) is sufficient for Condition B6, but a rate op(n3/8){\mathrm{o}}_{p}(n^{-3/8}) is needed in order to use Theorem S1 to show that Condition B10 holds.

Condition B11 (Donsker condition).

{odn,k(v)D2(P^n,μnC)(o):k[0,1]}\{o\mapsto d_{n,k}(v)D_{2}(\hat{P}_{n},\mu^{C}_{n})(o):k\in[0,1]\} is a subset of a fixed P0P_{0}-Donsker class with probability tending to 1. Additionally, each of D1(P^n,μnC)D_{1}(\hat{P}_{n},\mu^{C}_{n}), D(P^n,ρn,τ0,μnC)D(P^n,ρnRD,0,μ0C)D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0}) and Dn,D_{n,\mathcal{R}} belongs to a (possibly different) fixed P0P_{0}-Donsker class with probability tending to 1.

Condition B12 (Glivenko-Cantelli condition).

ξnξ01,P0=op(1)\|\xi_{n}-\xi_{0}\|_{1,P_{0}}={\mathrm{o}}_{p}(1) and ΔnCΔ0C1,P0=op(1)\|\Delta^{C}_{n}-\Delta^{C}_{0}\|_{1,P_{0}}={\mathrm{o}}_{p}(1). Moreover, (i) if η0>\eta_{0}>-\infty, then, for any η\eta sufficiently close to η0\eta_{0}, wI(ξn(v)>η)ΔnC(w)w\mapsto I(\xi_{n}(v)>\eta)\Delta^{C}_{n}(w) belongs to a P0P_{0}-Glivenko-Cantelli class with probability tending to 1; (ii) otherwise, if η0=\eta_{0}=-\infty, then, for any η<0\eta<0 with sufficiently large |η||\eta|, wI(ξn(v)>η)ΔnC(w)w\mapsto I(\xi_{n}(v)>\eta)\Delta^{C}_{n}(w) belongs to a P0P_{0}-Glivenko-Cantelli class with probability tending to 1.

The Donsker condition B11 and the Glivenko-Cantelli condition B12 impose restrictions on the flexibility of the methods used to estimate nuisance functions. We refer readers to, for example, van der Vaart and Wellner [50], for a more thorough introduction to such conditions.

All above conditions are similar to those in Qiu et al. [34] except that Conditions B9 and B4 are additional in this paper because the assumption of more costly treatment was not needed and a boundedness condition similar to B4 was automatically satisfied with a binary cost.

S2 Sufficient condition for fast convergence rate of estimated optimal rule

Condition B10, which is required by Theorem 4, may seem unintuitive and difficult to verify. In Theorem S1 below, we present sufficient conditions for Condition B10 that are similar to those in Qiu et al. [34].

Throughout the rest of the Supplement, for two quantities a,ba,b\in, we use aba\lesssim b to denote a𝒞ba\leq{\mathscr{C}}b for some constant 𝒞>0{\mathscr{C}}>0 that may depend on P0P_{0}.

Theorem S1 (Sufficient condition for Condition B10).

Assume that I(ξn(v)=τn)𝑑P0(v)=Op(n1/2)\int I(\xi_{n}(v)=\tau_{n})dP_{0}(v)={\mathrm{O}}_{p}(n^{-1/2}). Further assume that each of oI(ξn(v)>ηn)o\mapsto I(\xi_{n}(v)>\eta_{n}) and oI(ξn(v)>ηn)δ0C(v)o\mapsto I(\xi_{n}(v)>\eta_{n})\delta^{C}_{0}(v) belongs to a (possibly different) fixed P0P_{0}-Donsker class with probability tending to 1. Suppose also that the distribution of ξ0(V)\xi_{0}(V) (VP0V\sim P_{0}) has nonzero finite continuous Lebesgue density in a neighborhood of η0\eta_{0} and a neighborhood of τ0\tau_{0}. Under Condition B4, the following statements hold.

  • If δnYδ0Yq,P0=op(1)\|\delta^{Y}_{n}-\delta^{Y}_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1) for some q1q\geq 1, then

    |P0{(ρnρ0)(δ0Yτ0δ0C)}|δnYδ0Yq,P02q/(q+1)+Op(n1).\displaystyle|P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}|\lesssim\|\delta^{Y}_{n}-\delta^{Y}_{0}\|_{q,P_{0}}^{2q/(q+1)}+{\mathrm{O}}_{p}(n^{-1}).
  • If δnYδ0Y,P0=op(1)\|\delta^{Y}_{n}-\delta^{Y}_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1), then

    |P0{(ρnρ0)(δ0Yτ0δ0C)}|δnYδ0Y,P02+Op(n1).\displaystyle|P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}|\lesssim\|\delta^{Y}_{n}-\delta^{Y}_{0}\|_{\infty,P_{0}}^{2}+{\mathrm{O}}_{p}(n^{-1}).

The proof of Theorem S1 is very similar to Theorem 5 in Qiu et al. [34] and can be found in Section S4.5.

S3 Modified procedure with cross-fitting

In this section, we describe our proposed procedure to estimate the ATE with cross-fitting, which is mentioned in Remark 6. We use Λ\Lambda to denote a user-specified fixed number of folds to split the data. Common choices of Λ\Lambda used in practice include 5, 10 and 20.

  1. 1.

    Use the empirical distribution P^W,n\hat{P}_{W,n} of WW as an estimate of the true marginal distribution of WW. Compute estimates μnY\mu^{Y}_{n}, μnC\mu^{C}_{n} and μnT\mu^{T}_{n} of μ0Y\mu^{Y}_{0}, μnC\mu^{C}_{n} and μ0T\mu^{T}_{0}, respectively using flexible regression methods.

  2. 2.

    Estimate an optimal individualized treatment rule for each observation:

    1. (a)

      Create folds: split the set of observation indices {1,2,,n}\{1,2,\ldots,n\} into Λ\Lambda mutually exclusive and exhaustive folds of (approximately) equal size. Denote these sets by SλS_{\lambda}, λ=1,2,,Λ\lambda=1,2,\ldots,\Lambda. Define Sλ:=λλSλS_{-\lambda}:=\cup_{\lambda^{\prime}\neq\lambda}S_{\lambda^{\prime}}. For each i=1,2,,ni=1,2,\ldots,n, let λ(i)\lambda(i) be the index of the fold containing ii; in other words, λ(i)\lambda(i) is the unique value of λ\lambda such that iSλi\in S_{\lambda}.

    2. (b)

      Estimate ξ0(Vi)\xi_{0}(V_{i}) using sample splitting: for each λ=1,2,,Λ\lambda=1,2,\ldots,\Lambda, compute estimates δn,SλY\delta^{Y}_{n,S_{-\lambda}} and δn,SλC\delta^{C}_{n,S_{-\lambda}} of δ0Y\delta^{Y}_{0} and Δ0,bC\Delta^{C}_{0,b} using flexible regression methods based on data {Oi:iSλ}\{O_{i}:i\in S_{-\lambda}\}. For each i=1,2,,ni=1,2,\ldots,n, let ξn,i:=δn,Sλ(i)Y(Vi)/δn,Sλ(i)C(Vi)\xi_{n,i}:=\delta^{Y}_{n,S_{-\lambda(i)}}(V_{i})/\delta^{C}_{n,S_{-\lambda(i)}}(V_{i}) be the sample splitting estimate of ξ0(Vi)\xi_{0}(V_{i}).

    3. (c)

      Estimate ϕ0\phi_{0} with a one-step correction estimator

      ϕn:=1ni=1n{μnC(0,Wi)+1Ti1μnT(Wi)[CiμnC(0,Wi)]}.\phi_{n}:=\frac{1}{n}\sum_{i=1}^{n}\{\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\}.
    4. (d)

      Let Γn:τ1ni:ξn,i>τΔnC(1,Wi)\Gamma_{n}:\tau\mapsto\frac{1}{n}\sum_{i:\xi_{n,i}>\tau}\Delta^{C}_{n}(1,W_{i}) and γn:τ1ni:ξn,i=τΔnC(Wi)\gamma_{n}:\tau\mapsto\frac{1}{n}\sum_{i:\xi_{n,i}=\tau}\Delta^{C}_{n}(W_{i}). For any k[0,)k\in[0,\infty), define ηn(k):=inf{τ:Γn(τ)kαϕn}\eta_{n}(k):=\inf\{\tau:\Gamma_{n}(\tau)\leq k-\alpha\phi_{n}\}, τn(k):=max{ηn(k),0}\tau_{n}(k):=\max\{\eta_{n}(k),0\}, and, for i=1,2,,ni=1,2,\ldots,n,

      dn,k,i:={kαϕnΓn(ηn(k))γn(ηn(k)), if ξn,i=ηn(k) and γn(ηn(k))>0,I{ξn,i>ηn(k)}, otherwise.d_{n,k,i}:=\begin{cases}\frac{k-\alpha\phi_{n}-\Gamma_{n}(\eta_{n}(k))}{\gamma_{n}(\eta_{n}(k))},&\text{ if }\xi_{n,i}=\eta_{n}(k)\text{ and }\gamma_{n}(\eta_{n}(k))>0,\\ I\{\xi_{n,i}>\eta_{n}(k)\},&\text{ otherwise.}\end{cases}
    5. (e)

      Compute knk_{n}, which is used to define an estimate of ρ0{\rho}_{0} for which the plug-in estimator is asymptotically linear.

      • If τn(κ)>0\tau_{n}(\kappa)>0 and there is a solution in k[0,)k\in[0,\infty) to

        1ni=1ndn,k,i[ΔnC(Wi)+1Ti+μnT(Wi)1[CiμnC(Ti,Wi)]]+αϕn=κ,\frac{1}{n}\sum_{i=1}^{n}d_{n,k,i}\left[\Delta^{C}_{n}(W_{i})+\frac{1}{T_{i}+\mu^{T}_{n}(W_{i})-1}[C_{i}-\mu^{C}_{n}(T_{i},W_{i})]\right]+\alpha\phi_{n}=\kappa, (S1)

        then take knk_{n} to be this solution.

      • otherwise, set kn=κk_{n}=\kappa.

    6. (f)

      For each i=1,2,,ni=1,2,\ldots,n, estimate ρ0(Vi){\rho}_{0}(V_{i}) with

      ρn,i:={knαϕnΓn(τn(kn))γn(τn(kn)), if ξn,i=τn(kn), and γn(τn(kn))>0,I{ξn,i>τn(kn)}, otherwise.{\rho}_{n,i}:=\begin{cases}\frac{k_{n}-\alpha\phi_{n}-\Gamma_{n}(\tau_{n}(k_{n}))}{\gamma_{n}(\tau_{n}(k_{n}))},&\text{ if }\xi_{n,i}=\tau_{n}(k_{n}),\;\text{ and }\gamma_{n}(\tau_{n}(k_{n}))>0,\\ I\{\xi_{n,i}>\tau_{n}(k_{n})\},&\text{ otherwise.}\end{cases}
  3. 3.

    Obtain an estimate ρn{\rho}^{\mathcal{R}}_{n} of the reference ITR ρ0{\rho}^{\mathcal{R}}_{0} as follows:

    • For =FR\mathcal{R}={\mathrm{FR}}, take ρn{\rho}^{\mathcal{R}}_{n} to be ρFR{\rho}^{\mathrm{FR}}.

    • For =RD\mathcal{R}={\mathrm{RD}},

      1. (a)

        obtain a targeted estimate μ^nC\hat{\mu}^{C}_{n} of μ0C\mu^{C}_{0}: run an ordinary least-squared regression using observations i=1,2,,ni=1,2,\ldots,n with outcome CiC_{i}, offset μnC(Ti,Wi)\mu^{C}_{n}(T_{i},W_{i}), no intercept and covariate 1/(Ti+μnT(Wi)1)1/(T_{i}+\mu^{T}_{n}(W_{i})-1). Take μ^nC\hat{\mu}^{C}_{n} to be the fitted mean model;

      2. (b)

        take ρn{\rho}^{\mathcal{R}}_{n} to be the constant function omin{1,(καϕn)/P^W,nΔ^nC}o\mapsto\min\{1,(\kappa-\alpha\phi_{n})/\hat{P}_{W,n}\hat{\Delta}^{C}_{n}\}, where we define pointwise Δ^nC:wμ^nC(1,w)μ^nC(0,w)\hat{\Delta}^{C}_{n}:w\mapsto\hat{\mu}^{C}_{n}(1,w)-\hat{\mu}^{C}_{n}(0,w).

    • For =TP\mathcal{R}={\mathrm{TP}}, take ρn{\rho}^{\mathcal{R}}_{n} to be μnT\mu^{T}_{n}.

  4. 4.

    Estimate ATE of ρ0{\rho}_{0} relative to the reference ITR ρ0{\rho}^{\mathcal{R}}_{0} with a targeted minimum-loss based estimator (TMLE) ψn\psi_{n}:

    1. (a)

      obtain a targeted estimate μ^nY\hat{\mu}^{Y}_{n} of μ0Y\mu^{Y}_{0}: run an ordinary least-squares linear regression using observations i=1,2,,ni=1,2,\ldots,n with outcome YiY_{i}, offset μnY(Ti,Wi)\mu^{Y}_{n}(T_{i},W_{i}), no intercept and covariate [ρn,iρn(Oi)]/[Ti+μnT(Wi)1][{\rho}_{n,i}-{\rho}^{\mathcal{R}}_{n}(O_{i})]/[T_{i}+\mu^{T}_{n}(W_{i})-1]. Take μ^nY\hat{\mu}^{Y}_{n} to be the fitted mean function.

    2. (b)

      with P^n\hat{P}_{n} being any distribution with components μ^nY\hat{\mu}^{Y}_{n} and P^W,n\hat{P}_{W,n}, set ψn:=1ni=1nρn,iΔ^nY(Wi)Ψρn(P^n)\psi_{n}:=\frac{1}{n}\sum_{i=1}^{n}{\rho}_{n,i}\hat{\Delta}^{Y}_{n}(W_{i})-\Psi_{{\rho}^{\mathcal{R}}_{n}}(\hat{P}_{n}) where Δ^nY:wμ^nY(1,w)μ^nY(0,w)\hat{\Delta}^{Y}_{n}:w\mapsto\hat{\mu}^{Y}_{n}(1,w)-\hat{\mu}^{Y}_{n}(0,w).

S4 Proof of theorems

S4.1 Identification results (Theorem 1 and 2)

Theorem 1 is a simple corollary of the standard G-formula [36]. We provide a complete proof below.

Proof of Theorem 1.

Note that

𝔼[Y(1)W]=𝔼[Y(1)T=1,W]=E0[YT=1,W]=μ0Y(1,W).\displaystyle\operatorname{\mathbb{E}}[Y(1)\mid W]=\operatorname{\mathbb{E}}[Y(1)\mid T=1,W]=\operatorname{E}_{0}[Y\mid T=1,W]=\mu^{Y}_{0}(1,W).

Similarly, 𝔼[Y(0)W]=E0[YT=0,W]=μ0Y(0,W)\operatorname{\mathbb{E}}[Y(0)\mid W]=\operatorname{E}_{0}[Y\mid T=0,W]=\mu^{Y}_{0}(0,W). Hence, 𝔼[Y(1)Y(0)W]=Δ0Y(W)\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid W]=\Delta^{Y}_{0}(W). By the law of total expectation, this yields that 𝔼[Y(1)Y(0)V]=E0[Δ0Y(W)V]=δ0Y(V)\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid V]=\operatorname{E}_{0}[\Delta^{Y}_{0}(W)\mid V]=\delta^{Y}_{0}(V). It then follows that

𝔼[Y(ρ)Y(ρ0)]\displaystyle\operatorname{\mathbb{E}}[Y({\rho})-Y({\rho}^{\mathcal{R}}_{0})] =𝔼[{ρ(V)ρ0(W)}{Y(1)Y(0)}]\displaystyle=\operatorname{\mathbb{E}}[\{{\rho}(V)-{\rho}^{\mathcal{R}}_{0}(W)\}\{Y(1)-Y(0)\}]
=E0[{ρ(V)ρ0(W)}𝔼[Y(1)Y(0)W]]\displaystyle=\operatorname{E}_{0}[\{{\rho}(V)-{\rho}^{\mathcal{R}}_{0}(W)\}\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid W]]
=E0[{ρ(V)ρ0(W)}Δ0Y(W)].\displaystyle=\operatorname{E}_{0}[\{{\rho}(V)-{\rho}^{\mathcal{R}}_{0}(W)\}\Delta^{Y}_{0}(W)].

The results for the treatment cost can be proved similarly. ∎

We next prove Theorem 2.

Proof of Theorem 2.

Let ρ{\rho} be any ITR that satisfies the constraint that E0[ρ(V)δ0C(V)]+αϕ0κ\operatorname{E}_{0}[{\rho}(V)\delta^{C}_{0}(V)]+\alpha\phi_{0}\leq\kappa. We will show that E0[ρ0(V)δ0Y(V)]E0[ρ(V)δ0Y(V)]\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]\geq\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)], implying that ρ0{\rho}_{0} is a solution to (2).

Observe that

E0[ρ0(V)δ0Y(V)]E0[ρ(V)δ0Y(V)]\displaystyle\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]-\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]
=E0[{ρ0(V)ρ(V)}δ0Y(V)]\displaystyle=\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{Y}_{0}(V)]
=E0[{ρ0(V)ρ(V)}δ0Y(V)I(ξ0(V)>τ0)]+E0[{ρ0(V)ρ(V)}δ0Y(V)I(ξ0(V)<τ0)]\displaystyle=\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{Y}_{0}(V)I(\xi_{0}(V)>\tau_{0})]+\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{Y}_{0}(V)I(\xi_{0}(V)<\tau_{0})]
+E0[{ρ0(V)ρ(V)}δ0Y(V)I(ξ0(V)=τ0)]\displaystyle\hskip 36.135pt+\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{Y}_{0}(V)I(\xi_{0}(V)=\tau_{0})]
=E0[{ρ0(V)ρ(V)}ξ0(V)δ0C(V)I(ξ0(V)>τ0)]+E0[{ρ0(V)ρ(V)}ξ0(V)δ0C(V)I(ξ0(V)<τ0)]\displaystyle=\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\xi_{0}(V)\delta^{C}_{0}(V)I(\xi_{0}(V)>\tau_{0})]+\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\xi_{0}(V)\delta^{C}_{0}(V)I(\xi_{0}(V)<\tau_{0})]
+E0[{ρ0(V)ρ(V)}ξ0(V)δ0C(V)I(ξ0(V)=τ0)].\displaystyle\hskip 36.135pt+\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\xi_{0}(V)\delta^{C}_{0}(V)I(\xi_{0}(V)=\tau_{0})].

Note that ρ0(v)=1ρ(v){\rho}_{0}(v)=1\geq{\rho}(v) if ξ0(v)>τ0\xi_{0}(v)>\tau_{0} and ρ0(v)=0ρ(v){\rho}_{0}(v)=0\leq{\rho}(v) if ξ0(v)<τ0\xi_{0}(v)<\tau_{0}. Combining this observation with the fact that τ00\tau_{0}\geq 0, the above shows that

E0[ρ0(V)δ0Y(V)]E0[ρ(V)δ0Y(V)]\displaystyle\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]-\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]
τ0E0[{ρ0(V)ρ(V)}δ0C(V)I(ξ0(V)>τ0)]+τ0E0[{ρ0(V)ρ(V)}δ0C(V)I(ξ0(V)<τ0)]\displaystyle\geq\tau_{0}\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{C}_{0}(V)I(\xi_{0}(V)>\tau_{0})]+\tau_{0}\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{C}_{0}(V)I(\xi_{0}(V)<\tau_{0})]
+τ0E0[{ρ0(V)ρ(V)}δ0C(V)I(ξ0(V)=τ0)]\displaystyle\hskip 36.135pt+\tau_{0}\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{C}_{0}(V)I(\xi_{0}(V)=\tau_{0})]
=τ0E0[{ρ0(V)ρ(V)}δ0C(V)].\displaystyle=\tau_{0}\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{C}_{0}(V)].

If τ0=0\tau_{0}=0, then E0[ρ0(V)δ0Y(V)]E0[ρ(V)δ0Y(V)]0\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]-\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]\geq 0, as desired; otherwise, τ0>0\tau_{0}>0 and E0[ρ(V)δ0C(V)]καϕ0=E0[ρ0(V)δ0C(V)]\operatorname{E}_{0}[{\rho}(V)\delta^{C}_{0}(V)]\leq\kappa-\alpha\phi_{0}=\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{C}_{0}(V)], and so it follows that E0[ρ0(V)δ0Y(V)]E0[ρ(V)δ0Y(V)]\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]\geq\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]. Therefore, we conclude that ρ0{\rho}_{0} is a solution to (2). ∎

S4.2 Pathwise differentiability of ATE parameter (Theorem 3)

We follow existing literature on semiparametric efficiency theory closely to prove pathwise differentiability of our estimands and asymptotic efficiency of our estimators under nonparametric models. We refer readers to, for example, Pfanzagl [30, 31], Bolthausen et al. [4], for a more thorough introduction to semiparametric efficiency.

To derive the canonical gradient of the ATE parameters, let L02(P0)\mathcal{H}\subseteq L^{2}_{0}(P_{0}) be the set of score functions with range contained in [1,1][-1,1] and we study the behavior of the parameters under perturbations in an arbitrary direction HH\in\mathcal{H}. We note that the L02(P0)L^{2}_{0}(P_{0})-closure of \mathcal{H} is indeed L02(P0)L^{2}_{0}(P_{0}).

We define HW:wE0[H(O)W=w]H_{W}:w\mapsto\operatorname{E}_{0}[H(O)\mid W=w], HT:(tw)E0[H(O)T=t,W=w]H_{T}:(t\mid w)\mapsto\operatorname{E}_{0}[H(O)\mid T=t,W=w] and PH,ϵP_{H,\epsilon} via its Radon-Nikodym derivative with respect to P0P_{0}:

dPH,ϵdP0:o[1+ϵH(o)ϵHT(tw)ϵHW(w)][1+ϵHT(tw)][1+ϵHW(w)]\displaystyle\frac{dP_{H,\epsilon}}{dP_{0}}:o\mapsto\left[1+\epsilon H(o)-\epsilon H_{T}(t\mid w)-\epsilon H_{W}(w)\right]\left[1+\epsilon H_{T}(t\mid w)\right]\left[1+\epsilon H_{W}(w)\right] (S2)

for any ϵ\epsilon in a sufficiently small neighborhood of 0 such that the right-hand side is positive for all o𝒲×{0,1}×{0,1}×o\in\mathcal{W}\times\{0,1\}\times\{0,1\}\times. It is straightforward to verify that the score function for ϵ\epsilon at ϵ=0\epsilon=0 is indeed HH. For the rest of this section, we may drop HH from the notation and use PϵP_{\epsilon} as a shorthand notation for PH,ϵP_{H,\epsilon} when no confusion should arise.

We will see that each parameter evaluated at PϵP_{\epsilon} depends on the following marginal or conditional distributions in a clean way: the marginal distribution PW,ϵP_{W,\epsilon} of WW, the marginal distribution PT,W,ϵP_{T,W,\epsilon} of (T,W)(T,W), the conditional distribution PT,ϵP_{T,\epsilon} of TT given WW, the conditional distribution PC,ϵP_{C,\epsilon} of CC given (T,W)(T,W), and the conditional distribution PY,ϵP_{Y,\epsilon} of YY given (T,W)(T,W). We now derive their closed-form expressions. Let HC:(ct,w)E0[H(O)C=c,T=t,W=w]HT(tw)HW(w)H_{C}:(c\mid t,w)\mapsto\operatorname{E}_{0}[H(O)\mid C=c,T=t,W=w]-H_{T}(t\mid w)-H_{W}(w), and HY:(yt,w)E0[H(O)Y=y,T=t,W=w]HT(tw)HW(w)H_{Y}:(y\mid t,w)\mapsto\operatorname{E}_{0}[H(O)\mid Y=y,T=t,W=w]-H_{T}(t\mid w)-H_{W}(w). We can then show that

dPW,ϵdPW,0:w1+ϵHW(w),dPT,W,ϵdPT,W,0:(t,w)1+ϵHT(tw)+ϵHW(w),dPT,ϵdPT,0(w):t1+ϵHT(tw),dPC,ϵdPC,0(t,w):c1+ϵHC(ct,w),dPY,ϵdPY,0(t,w):y1+ϵHY(yt,w).\displaystyle\begin{split}\frac{dP_{W,\epsilon}}{dP_{W,0}}&:w\mapsto 1+\epsilon H_{W}(w),\\ \frac{dP_{T,W,\epsilon}}{dP_{T,W,0}}&:(t,w)\mapsto 1+\epsilon H_{T}(t\mid w)+\epsilon H_{W}(w),\\ \frac{dP_{T,\epsilon}}{dP_{T,0}}(\,\cdot\,\mid w)&:t\mapsto 1+\epsilon H_{T}(t\mid w),\\ \frac{dP_{C,\epsilon}}{dP_{C,0}}(\,\cdot\,\mid t,w)&:c\mapsto 1+\epsilon H_{C}(c\mid t,w),\\ \frac{dP_{Y,\epsilon}}{dP_{Y,0}}(\,\cdot\,\mid t,w)&:y\mapsto 1+\epsilon H_{Y}(y\mid t,w).\end{split} (S3)

Moreover, E0[HW(W)]=0\operatorname{E}_{0}[H_{W}(W)]=0, E0[HT(TW)W]=0\operatorname{E}_{0}[H_{T}(T\mid W)\mid W]=0 P0P_{0}-a.s., E0[HC(CT,W)T,W]=0\operatorname{E}_{0}[H_{C}(C\mid T,W)\mid T,W]=0 P0P_{0}-a.s., and E0[HY(YT,W)T,W]=0\operatorname{E}_{0}[H_{Y}(Y\mid T,W)\mid T,W]=0 P0P_{0}-a.s.

We finally introduce some additional notations that are used for the rest of the section. We use 𝒞{\mathscr{C}} to denote a generic positive constant that may vary line by line. Let S0S_{0} be the survival function of the distribution of ξ0(V)\xi_{0}(V) when VP0V\sim P_{0}. We also use the notation \lesssim defined in Section S2. For a generic function f:f:\mathbb{R}\rightarrow\mathbb{R}, we will use the big- and little-oh notations, namely O(f(ϵ)){\mathrm{O}}(f(\epsilon)) and o(f(ϵ)){\mathrm{o}}(f(\epsilon)), respectively, to denote the behavior of f(ϵ)f(\epsilon) as ϵ0\epsilon\rightarrow 0. Finally, for a general function or quantity fPf_{P} that depends on a distribution PP, we use fϵf_{\epsilon} to denote fPϵf_{P_{\epsilon}}. For example, we may write μϵY\mu^{Y}_{\epsilon} as a shorthand for μPϵY\mu^{Y}_{P_{\epsilon}}. We will also write expectations under PϵP_{\epsilon} as Eϵ\operatorname{E}_{\epsilon}.

The derivation of the canonical gradients of PΨρFR(P)P\mapsto\Psi_{{\rho}^{\mathrm{FR}}}(P) can be found in the Supplement of Qiu et al. [34]. We now derive the canonical gradients of PΨρPTP(P)P\mapsto\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P), PΨρPRD(P)P\mapsto\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P) and PΨρP(P)P\mapsto\Psi_{{\rho}_{P}}(P), which are different from the parameters in Qiu et al. [34].

S4.2.1 Canonical gradient of PΨρPTP(P)P\mapsto\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P) (Theorem 3)

Fix a score HH\in\mathscr{H}. Note that, for all PP\in\mathscr{M}, ΨρPTP(P)=μPT(w)ΔPY(w)PW(dw)\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)=\int\mu^{T}_{P}(w)\Delta^{Y}_{P}(w)P_{W}(dw). Combining this, (S3) and the chain rule yields that

ddϵΨρϵTP(Pϵ)|ϵ=0\displaystyle\left.\frac{d}{d\epsilon}\Psi_{{\rho}^{\mathrm{TP}}_{\epsilon}}(P_{\epsilon})\right|_{\epsilon=0}
=ddϵ[μϵT(w)ΔϵY(w)PW,ϵ(dw)]|ϵ=0\displaystyle=\int\left.\frac{d}{d\epsilon}\left[\mu^{T}_{\epsilon}(w)\Delta^{Y}_{\epsilon}(w)P_{W,\epsilon}(dw)\right]\right|_{\epsilon=0}
=(ddϵμϵT(w)|ϵ=0)Δ0Y(w)PW,0(dw)+μ0T(w)(ddϵΔϵY(w)|ϵ=0)PW,0(dw)\displaystyle=\int\left(\left.\frac{d}{d\epsilon}\mu^{T}_{\epsilon}(w)\right|_{\epsilon=0}\right)\Delta^{Y}_{0}(w)P_{W,0}(dw)+\int\mu^{T}_{0}(w)\left(\left.\frac{d}{d\epsilon}\Delta^{Y}_{\epsilon}(w)\right|_{\epsilon=0}\right)P_{W,0}(dw)
+μ0T(w)Δ0Y(w)ddϵPW,ϵ(dw)|ϵ=0\displaystyle\quad+\int\mu^{T}_{0}(w)\Delta^{Y}_{0}(w)\left.\frac{d}{d\epsilon}P_{W,\epsilon}(dw)\right|_{\epsilon=0}
=(tμ0T(w))HT(tw)Δ0Y(w)PT,0(dtw)PW,0(dw)\displaystyle=\iint(t-\mu^{T}_{0}(w))H_{T}(t\mid w)\Delta^{Y}_{0}(w)P_{T,0}(dt\mid w)P_{W,0}(dw)
+μ0T(w)(I(t=1)μPT(w)I(t=0)1μPT(w))(yμ0Y(t,w))HY(yt,w)PY,0(dyt,w)PT,0(dtw)PW,0(dw)\displaystyle\quad+\iiint\mu^{T}_{0}(w)\left(\frac{I(t=1)}{\mu^{T}_{P}(w)}-\frac{I(t=0)}{1-\mu^{T}_{P}(w)}\right)(y-\mu^{Y}_{0}(t,w))H_{Y}(y\mid t,w)P_{Y,0}(dy\mid t,w)P_{T,0}(dt\mid w)P_{W,0}(dw)
+(μ0T(w)Δ0Y(w)Ψρ0TP(P0))HW(w)PW,0(dw)\displaystyle\quad+\int(\mu^{T}_{0}(w)\Delta^{Y}_{0}(w)-\Psi_{{\rho}^{\mathrm{TP}}_{0}}(P_{0}))H_{W}(w)P_{W,0}(dw)
=GTP(P0)(o)H(o)P0(do),\displaystyle=\int G_{\mathrm{TP}}(P_{0})(o)H(o)P_{0}(do),

where we have used the fact that E0[HY(YT,W)T,W]=0\operatorname{E}_{0}[H_{Y}(Y\mid T,W)\mid T,W]=0 P0P_{0}-a.s., E0[HT(TW)W]=0\operatorname{E}_{0}[H_{T}(T\mid W)\mid W]=0 P0P_{0}-a.s., and E0[HW(W)]=0\operatorname{E}_{0}[H_{W}(W)]=0. Therefore, the canonical gradient of PΨρPTP(P)P\mapsto\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P) at P0P_{0} is GTP(P0)G_{\mathrm{TP}}(P_{0}).

S4.2.2 Canonical gradient of PΨρPRD(P)P\mapsto\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)

Let HH be a score function in \mathcal{H}. We aim to show that

ddϵΨρϵRD(Pϵ)|ϵ=0=GRD(P0)(o)H(o)P0(do),\displaystyle\left.\frac{d}{d\epsilon}\Psi_{{\rho}^{\mathrm{RD}}_{\epsilon}}(P_{\epsilon})\right|_{\epsilon=0}=\int G_{\mathrm{RD}}(P_{0})(o)H(o)P_{0}(do), (S4)

which shows that PΨρPRD(P)P\mapsto\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P) is pathwise differentiable with canonical gradient GRD(P0)G_{\mathrm{RD}}(P_{0}) at P0P_{0}.

By similar arguments to those in Section 3.4 of Kennedy [16], we can show that

ddϵPϵμϵC(0,)|ϵ=0\displaystyle\left.\frac{d}{d\epsilon}P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)\right|_{\epsilon=0} ={1t1μ0T(w)[cμ0C(0,w)]+μ0C(0,w)P0μ0C(0,)}H(o)P0(do),\displaystyle=\int\left\{\frac{1-t}{1-\mu^{T}_{0}(w)}[c-\mu^{C}_{0}(0,w)]+\mu^{C}_{0}(0,w)-P_{0}\mu^{C}_{0}(0,\cdot)\right\}H(o)P_{0}(do), (S5)
ddϵPϵΔϵC|ϵ=0\displaystyle\left.\frac{d}{d\epsilon}P_{\epsilon}\Delta^{C}_{\epsilon}\right|_{\epsilon=0} ={1t+μ0T(w)1[cμ0C(t,w)]+Δ0C(w)P0Δ0C}H(o)P0(do).\displaystyle=\int\left\{\frac{1}{t+\mu^{T}_{0}(w)-1}[c-\mu^{C}_{0}(t,w)]+\Delta^{C}_{0}(w)-P_{0}\Delta^{C}_{0}\right\}H(o)P_{0}(do). (S6)

Consequently, PϵμϵC(0,)=P0μ0C(0,)+O(ϵ)P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)=P_{0}\mu^{C}_{0}(0,\cdot)+{\mathrm{O}}(\epsilon) and PϵΔϵC=P0Δ0C+O(ϵ)P_{\epsilon}\Delta^{C}_{\epsilon}=P_{0}\Delta^{C}_{0}+{\mathrm{O}}(\epsilon). It follows that, for all ϵ\epsilon in a sufficiently small neighborhood of zero, Condition B5 implies that (καPϵμϵC(0,))/PϵΔϵC<1(\kappa-\alpha P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot))/P_{\epsilon}\Delta^{C}_{\epsilon}<1. Consequently, for each ϵ\epsilon in this neighborhood, ΨρϵRD(Pϵ)=καPϵμϵC(0,)PϵΔϵCΨv1(Pϵ)\Psi_{{\rho}^{\mathrm{RD}}_{\epsilon}}(P_{\epsilon})=\frac{\kappa-\alpha P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)}{P_{\epsilon}\Delta^{C}_{\epsilon}}\Psi_{v\mapsto 1}(P_{\epsilon}), where we have used that PϵΔϵY=Ψv1(Pϵ)P_{\epsilon}\Delta^{Y}_{\epsilon}=\Psi_{v\mapsto 1}(P_{\epsilon}). It follows that the derivative ddϵPϵμϵC(0,)|ϵ=0\left.\frac{d}{d\epsilon}P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)\right|_{\epsilon=0} is the same as the derivative of f:ϵκPϵμϵC(0,)PϵΔϵCΨv1(Pϵ)f:\epsilon\mapsto\frac{\kappa-P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)}{P_{\epsilon}\Delta^{C}_{\epsilon}}\Psi_{v\mapsto 1}(P_{\epsilon}) at ϵ=0\epsilon=0, provided this derivative exists. Noting that v1v\mapsto 1 is a particular instance of a fixed treatment rule, we may take ρFR{\rho}^{\mathrm{FR}} to be v1v\mapsto 1 in the results on pathwise differentiability of PΨρFR(P)P\mapsto\Psi_{{\rho}^{\mathrm{FR}}}(P) and show that

ddϵΨv1(Pϵ)|ϵ=0=D(P0,v1,0,μ0C)(o)H(o)P0(do).\displaystyle\left.\frac{d}{d\epsilon}\Psi_{v\mapsto 1}(P_{\epsilon})\right|_{\epsilon=0}=\int D(P_{0},v\mapsto 1,0,\mu^{C}_{0})(o)H(o)P_{0}(do). (S7)

As both the above derivative and the derivatives in (S5) and (S6) exist, by the chain rule, it follows that

ddϵf(ϵ)|ϵ=0\displaystyle\left.\frac{d}{d\epsilon}f(\epsilon)\right|_{\epsilon=0} =καP0μ0C(0,)P0Δ0CddϵΨv1(Pϵ)|ϵ=0(καP0μ0C(0,))Ψv1(P0)(P0Δ0C)2ddϵPϵΔϵC|ϵ=0\displaystyle=\frac{\kappa-\alpha P_{0}\mu^{C}_{0}(0,\cdot)}{P_{0}\Delta^{C}_{0}}\left.\frac{d}{d\epsilon}\Psi_{v\mapsto 1}(P_{\epsilon})\right|_{\epsilon=0}-\frac{(\kappa-\alpha P_{0}\mu^{C}_{0}(0,\cdot))\Psi_{v\mapsto 1}(P_{0})}{(P_{0}\Delta^{C}_{0})^{2}}\left.\frac{d}{d\epsilon}P_{\epsilon}\Delta^{C}_{\epsilon}\right|_{\epsilon=0}
αΨv1(P0)P0Δ0CddϵPϵμϵC(0,)|ϵ=0.\displaystyle\hskip 36.135pt-\alpha\frac{\Psi_{v\mapsto 1}(P_{0})}{P_{0}\Delta^{C}_{0}}\left.\frac{d}{d\epsilon}P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)\right|_{\epsilon=0}.

Note that ϕP=PμPC(0,)\phi_{P}=P\mu^{C}_{P}(0,\cdot). Plugging (S5), (S6) and (S7) into the above and we can show that the right-hand side of the above is equal to the right-hand side of (S4). As ddϵf(ϵ)|ϵ=0=ddϵΨρϵRD(Pϵ)|ϵ=0\left.\frac{d}{d\epsilon}f(\epsilon)\right|_{\epsilon=0}=\left.\frac{d}{d\epsilon}\Psi_{{\rho}^{\mathrm{RD}}_{\epsilon}}(P_{\epsilon})\right|_{\epsilon=0}, we have shown that (S4) holds, and the desired result follows.

S4.2.3 Canonical gradient of PΨρP(P)P\mapsto\Psi_{{\rho}_{P}}(P)

Let HH be a score function in \mathcal{H}. The argument that we use parallels that of Luedtke and van der Laan [20] and Qiu et al. [34], except that it is slightly modified to account for the fact that the resource constraint takes a different form in this paper.

We first note that all of following hold for all ϵ\epsilon sufficiently close to zero:

supw|ΔϵC(w)Δ0C(w)|\displaystyle\sup_{w}|\Delta^{C}_{\epsilon}(w)-\Delta^{C}_{0}(w)| |ϵ|,\displaystyle\lesssim|\epsilon|, (S8)
supv|δϵY(v)δ0Y(v)|\displaystyle\sup_{v}|\delta^{Y}_{\epsilon}(v)-\delta^{Y}_{0}(v)| |ϵ|,\displaystyle\lesssim|\epsilon|, (S9)
supv|δϵC(v)δ0C(v)|\displaystyle\sup_{v}|\delta^{C}_{\epsilon}(v)-\delta^{C}_{0}(v)| |ϵ|.\displaystyle\lesssim|\epsilon|. (S10)

The derivations of these inequalities are straightforward and hence omitted. Under Condition A4, the above inequalities imply that

supv|ξϵ(v)ξ0(v)|=|δϵY(v)δϵC(v)δ0Y(v)δ0C(v)||ϵ|.\displaystyle\sup_{v}|\xi_{\epsilon}(v)-\xi_{0}(v)|=\left|\frac{\delta^{Y}_{\epsilon}(v)}{\delta^{C}_{\epsilon}(v)}-\frac{\delta^{Y}_{0}(v)}{\delta^{C}_{0}(v)}\right|\lesssim|\epsilon|. (S11)

For ϵ\epsilon sufficiently close to zero, it will be useful to define

Γϵ:ηEϵ[I{ξϵ(V)>η}δϵC(V)]\displaystyle\Gamma_{\epsilon}:\eta\mapsto\operatorname{E}_{\epsilon}[I\{\xi_{\epsilon}(V)>\eta\}\delta^{C}_{\epsilon}(V)]

for η[,)\eta\in[-\infty,\infty). We also define Γϵ:ηddsΓϵ(s)|s=η\Gamma_{\epsilon}^{\prime}:\eta\mapsto\frac{d}{ds}\Gamma_{\epsilon}(s)|_{s=\eta} when the derivative exists.

We first show two lemmas. These two lemmas show that, under a perturbed distribution PϵP_{\epsilon} with magnitude ϵ\epsilon, the fluctuation in the threshold τϵϵ0\tau_{\epsilon}-\epsilon_{0} is of order ϵ\epsilon. This result is crucial in quantifying the convergence rate of two terms in the expansion of Ψρϵ(Pϵ)Ψρ0(P0)\Psi_{\rho_{\epsilon}}(P_{\epsilon})-\Psi_{\rho_{0}}(P_{0}), namely terms 1 and 3 in (S12) below. In particular, term 1 is the main challenge in the analysis as it comes from the perturbation in the threshold and is unique in estimation problems involving the evaluation of optimal ITRs. The first studies the convergence of ηϵ\eta_{\epsilon} to η0\eta_{0}. Because it may be the case that η0=\eta_{0}=-\infty, the convergence stated in this result is convergence in the extended real line.

Lemma S1.

Under the conditions of Theorem 3, ηϵη0\eta_{\epsilon}\rightarrow\eta_{0} as ϵ0\epsilon\rightarrow 0.

Proof of Lemma S1.

We separately consider the cases where η0>\eta_{0}>-\infty and η0=\eta_{0}=-\infty.

Suppose that η0>\eta_{0}>-\infty. For all sufficiently small δ>0\delta>0 and sufficiently small |ϵ||\epsilon|, by (S10), (S11) and the fact that the range of HH is contained in [1,1][-1,1], we can show that

Γϵ(η0+δ)+αϕϵ\displaystyle\Gamma_{\epsilon}(\eta_{0}+\delta)+\alpha\phi_{\epsilon} (1+𝒞|ϵ|)E0[I{ξϵ(V)>η0+δ}δ0C(V)]+αϕϵ(1+𝒞|ϵ|)Γ0(η0+δ𝒞|ϵ|)+αϕϵ.\displaystyle\leq(1+{\mathscr{C}}|\epsilon|)\operatorname{E}_{0}[I\{\xi_{\epsilon}(V)>\eta_{0}+\delta\}\delta^{C}_{0}(V)]+\alpha\phi_{\epsilon}\leq(1+{\mathscr{C}}|\epsilon|)\Gamma_{0}\left(\eta_{0}+\delta-{\mathscr{C}}|\epsilon|\right)+\alpha\phi_{\epsilon}.

Under Condition B3, as long as δ\delta is small enough, the right-hand side converges to Γ0(η0+δ)+αϕ0\Gamma_{0}(\eta_{0}+\delta)+\alpha\phi_{0} as ϵ0\epsilon\rightarrow 0. Moreover, Conditions B3 and A4 can be combined to show that the derivative of Γ0\Gamma_{0} is strictly negative for all x[η0,η0+δ]x\in[\eta_{0},\eta_{0}+\delta] for sufficiently small δ\delta, and so Γ0(η0)>Γ0(η0+δ)\Gamma_{0}(\eta_{0})>\Gamma_{0}(\eta_{0}+\delta). Because Γ0(η0)+αϕ0=κ\Gamma_{0}(\eta_{0})+\alpha\phi_{0}=\kappa by the definition of ρ0{\rho}_{0} under Condition B2, it follows that, for all ϵ\epsilon sufficiently close to zero, Γϵ(η0+δ)+αϕϵ<κ\Gamma_{\epsilon}(\eta_{0}+\delta)+\alpha\phi_{\epsilon}<\kappa. By the definition ηϵ:=inf{η:Γϵ(η)καϕϵ}\eta_{\epsilon}:=\inf\{\eta:\Gamma_{\epsilon}(\eta)\leq\kappa-\alpha\phi_{\epsilon}\}, it follows that, for all ϵ\epsilon sufficiently close to zero, η0+δηϵ\eta_{0}+\delta\geq\eta_{\epsilon}, that is, ηϵη0δ\eta_{\epsilon}-\eta_{0}\leq\delta.

By similar arguments, we can show that, for all ϵ\epsilon sufficiently close to zero, ηϵη0δ\eta_{\epsilon}-\eta_{0}\geq-\delta. Indeed,

Γϵ(η0δ)+αϕϵ(1𝒞|ϵ|)Γ0(η0δ+𝒞|ϵ|)+αϕϵ.\displaystyle\Gamma_{\epsilon}(\eta_{0}-\delta)+\alpha\phi_{\epsilon}\geq(1-{\mathscr{C}}|\epsilon|)\Gamma_{0}(\eta_{0}-\delta+{\mathscr{C}}|\epsilon|)+\alpha\phi_{\epsilon}.

The right-hand side converges to Γ0(η0δ)+αϕ0\Gamma_{0}(\eta_{0}-\delta)+\alpha\phi_{0} as ϵ0\epsilon\rightarrow 0 provided δ\delta is sufficiently small. The derivative of Γ0\Gamma_{0} is strictly negative on [η0δ,η0][\eta_{0}-\delta,\eta_{0}] provided δ\delta is small enough, and therefore, Γ0(η0δ)+αϕ0>Γ0(η0)+αϕ0=κ\Gamma_{0}(\eta_{0}-\delta)+\alpha\phi_{0}>\Gamma_{0}(\eta_{0})+\alpha\phi_{0}=\kappa. Hence, Γϵ(η0δ)+αϕϵ>κ\Gamma_{\epsilon}(\eta_{0}-\delta)+\alpha\phi_{\epsilon}>\kappa. By the definition of ηϵ\eta_{\epsilon}, it follows that ηϵη0δ\eta_{\epsilon}-\eta_{0}\geq-\delta.

Combining these two results, we see that, for all ϵ\epsilon sufficiently close to zero, |ηϵη0|δ|\eta_{\epsilon}-\eta_{0}|\leq\delta. Hence, lim supϵ0|ηϵη0|δ\limsup_{\epsilon\rightarrow 0}|\eta_{\epsilon}-\eta_{0}|\leq\delta. As δ>0\delta>0 is an arbitrary in a neighborhood of zero, it follows that lim supϵ0|ηϵη0|=0\limsup_{\epsilon\rightarrow 0}|\eta_{\epsilon}-\eta_{0}|=0. That is, ηϵη0\eta_{\epsilon}\rightarrow\eta_{0} as ϵ0\epsilon\rightarrow 0 in the case that η0>\eta_{0}>-\infty.

We now study the case where η0=\eta_{0}=-\infty. If κ=\kappa=\infty, then it is trivial that ηϵ==η0\eta_{\epsilon}=-\infty=\eta_{0} for all ϵ\epsilon, and so the desired result holds. Suppose now that κ<\kappa<\infty. Fix a small enough δ>0\delta>0 so that the bound in (S11) is valid for all ϵ[δ,δ]\epsilon\in[-\delta,\delta]. Also fix ϵ[δ,δ]\epsilon\in[-\delta,\delta] and η\eta\in. By (S11) and the bound on the range of HH,

Γϵ(η)+αϕϵ(1+𝒞|ϵ|)E0[I{ξϵ(V)>η}Δ0,bC(V)]+αϕϵ(1+𝒞|ϵ|)Γ0(η𝒞|ϵ|)+αϕϵ.\displaystyle\Gamma_{\epsilon}(\eta)+\alpha\phi_{\epsilon}\leq(1+{\mathscr{C}}|\epsilon|)\operatorname{E}_{0}[I\{\xi_{\epsilon}(V)>\eta\}\Delta^{C}_{0,b}(V)]+\alpha\phi_{\epsilon}\leq(1+{\mathscr{C}}|\epsilon|)\Gamma_{0}(\eta-{\mathscr{C}}|\epsilon|)+\alpha\phi_{\epsilon}.

Because Γ0\Gamma_{0} is a nonnegative decreasing function, the right-hand side is no greater than (1+𝒞|ϵ|)Γ0(η𝒞δ)+αϕϵ(1+{\mathscr{C}}|\epsilon|)\Gamma_{0}(\eta-{\mathscr{C}}\delta)+\alpha\phi_{\epsilon}. This upper bound tends to Γ0(η𝒞δ)+αϕ0\Gamma_{0}(\eta-{\mathscr{C}}\delta)+\alpha\phi_{0} as ϵ0\epsilon\rightarrow 0. Hence, lim supϵ0Γϵ(η)+αϕϵΓ0(η𝒞δ)+αϕ0\limsup_{\epsilon\rightarrow 0}\Gamma_{\epsilon}(\eta)+\alpha\phi_{\epsilon}\leq\Gamma_{0}(\eta-{\mathscr{C}}\delta)+\alpha\phi_{0}. By Condition B3 and the monotonicity of Γ0\Gamma_{0}, Γ0(η𝒞δ)+αϕ0<κ\Gamma_{0}(\eta-{\mathscr{C}}\delta)+\alpha\phi_{0}<\kappa, and so Γϵ(η)+αϕϵ<κ\Gamma_{\epsilon}(\eta)+\alpha\phi_{\epsilon}<\kappa for all ϵ\epsilon sufficiently close to zero. By the definition of ηϵ\eta_{\epsilon}, it follows that ηϵη\eta_{\epsilon}\leq\eta for all ϵ\epsilon sufficiently close to zero. Since η\eta\in is arbitrary, the desired result follows. ∎

The next lemma establishes a rate of convergence of τϵ\tau_{\epsilon} to τ0\tau_{0} as ϵ0\epsilon\rightarrow 0.

Lemma S2.

Under conditions of Theorem 3, τϵ=τ0+O(ϵ)\tau_{\epsilon}=\tau_{0}+{\mathrm{O}}(\epsilon).

Proof of Lemma S2.

We separately consider the cases where η0<0\eta_{0}<0 and η00\eta_{0}\geq 0.

We start with the easier case where η0<0\eta_{0}<0. In this case, Lemma S1 shows that τϵ:=max{ηϵ,0}\tau_{\epsilon}:=\max\{\eta_{\epsilon},0\} is equal to τ0=0\tau_{0}=0 for all ϵ\epsilon sufficiently close to zero. Thus, τϵτ0=O(|ϵ|)\tau_{\epsilon}-\tau_{0}={\mathrm{O}}(|\epsilon|).

Now consider the more difficult case where η00\eta_{0}\geq 0. By the Lipschitz property of the function xmax{x,0}x\mapsto\max\{x,0\}, we can show that |max{ηϵ,0}max{η0,0}||ηϵη0||\max\{\eta_{\epsilon},0\}-\max\{\eta_{0},0\}|\leq|\eta_{\epsilon}-\eta_{0}|. As a consequence, to show that τϵτ0=O(ϵ)\tau_{\epsilon}-\tau_{0}={\mathrm{O}}(\epsilon), it suffices to show that ηϵη0=O(ϵ)\eta_{\epsilon}-\eta_{0}={\mathrm{O}}(\epsilon). We next establish this statement.

Fix ϵ\epsilon in a sufficiently small neighborhood of zero. By the definition ηϵ:=inf{η:Γϵ(η)κϕϵ}\eta_{\epsilon}:=\inf\{\eta:\Gamma_{\epsilon}(\eta)\leq\kappa-\phi_{\epsilon}\}, the bound on the range of HH, and (S11), it holds that κ<Γϵ(ηϵ|ϵ|)+αϕϵ[1+𝒞|ϵ|]Γ0(ηϵ[1+𝒞]|ϵ|)+αϕϵ\kappa<\Gamma_{\epsilon}(\eta_{\epsilon}-|\epsilon|)+\alpha\phi_{\epsilon}\leq[1+{\mathscr{C}}|\epsilon|]\Gamma_{0}(\eta_{\epsilon}-[1+{\mathscr{C}}]|\epsilon|)+\alpha\phi_{\epsilon}. We use a Taylor expansion of Γ0\Gamma_{0} about η0\eta_{0}, which is justified by Condition B3 provided |ϵ||\epsilon| is small enough, and it follows that

κ<[1+𝒞|ϵ|][Γ0(η0)+{ηϵη0(1𝒞)|ϵ|}{Γ0(η0)+o(1)}]+αϕ0+O(ϵ).\displaystyle\kappa<[1+{\mathscr{C}}|\epsilon|]\left[\Gamma_{0}(\eta_{0})+\{\eta_{\epsilon}-\eta_{0}-(1-{\mathscr{C}})|\epsilon|\}\{\Gamma_{0}^{\prime}(\eta_{0})+{\mathrm{o}}(1)\}\right]+\alpha\phi_{0}+{\mathrm{O}}(\epsilon).

By Condition B3, Γ0(η0)+αϕ0=κ\Gamma_{0}(\eta_{0})+\alpha\phi_{0}=\kappa. Plugging this into the above shows that

0\displaystyle 0 <𝒞Γ0(η0)|ϵ|+[1+𝒞|ϵ|][ηϵη0(1𝒞)|ϵ|][Γ0(η0)+o(1)]+O(ϵ).\displaystyle<{\mathscr{C}}\Gamma_{0}(\eta_{0})|\epsilon|+[1+{\mathscr{C}}|\epsilon|]\left[\eta_{\epsilon}-\eta_{0}-(1-{\mathscr{C}})|\epsilon|\right]\left[\Gamma_{0}^{\prime}(\eta_{0})+{\mathrm{o}}(1)\right]+{\mathrm{O}}(\epsilon).

Note that Condition B3 implies that Γ0(η0)(,0)\Gamma_{0}^{\prime}(\eta_{0})\in(-\infty,0). Therefore, the above shows that, for all ϵ\epsilon sufficiently close to zero, 0<[ηϵη0]Γ0(η0)+𝒞|ϵ|+o(ηϵη0)0<[\eta_{\epsilon}-\eta_{0}]\Gamma_{0}^{\prime}(\eta_{0})+{\mathscr{C}}|\epsilon|+{\mathrm{o}}\left(\eta_{\epsilon}-\eta_{0}\right), which implies that there exists an O(ϵ){\mathrm{O}}(\epsilon) sequence for which ηϵη0<O(ϵ)\eta_{\epsilon}-\eta_{0}<{\mathrm{O}}(\epsilon).

A similar argument, which is based on the observation that Γϵ(ηϵ+|ϵ|)+ϕϵκ\Gamma_{\epsilon}(\eta_{\epsilon}+|\epsilon|)+\phi_{\epsilon}\leq\kappa, can be used to show that there exists an O(ϵ){\mathrm{O}}(\epsilon) sequence such that ηϵη0>O(ϵ)\eta_{\epsilon}-\eta_{0}>{\mathrm{O}}(\epsilon). Combining these two bounds shows that ηϵη0=O(ϵ)\eta_{\epsilon}-\eta_{0}={\mathrm{O}}(\epsilon), as desired. This concludes the proof. ∎

Our derivation of the canonical gradient is based on the following decomposition:

Ψρϵ\displaystyle\Psi_{{\rho}_{\epsilon}} (Pϵ)Ψρ0(P0)\displaystyle(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})
=\displaystyle= Ψρϵ(Pϵ)Ψρ0(Pϵ)+Ψρ0(Pϵ)Ψρ0(P0)\displaystyle\Psi_{{\rho}_{\epsilon}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{\epsilon})+\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})
=\displaystyle= Pϵ{[ρϵρ0]δϵY}+Ψρ0(Pϵ)Ψρ0(P0)\displaystyle P_{\epsilon}\{[{\rho}_{\epsilon}-{\rho}_{0}]\delta^{Y}_{\epsilon}\}+\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})
=\displaystyle= Pϵ{[ρϵρ0](δϵYτ0δϵC)}+τ0Pϵ{(ρϵρ0)δϵC}+Ψρ0(Pϵ)Ψρ0(P0)\displaystyle P_{\epsilon}\{[{\rho}_{\epsilon}-{\rho}_{0}](\delta^{Y}_{\epsilon}-\tau_{0}\delta^{C}_{\epsilon})\}+\tau_{0}P_{\epsilon}\{({\rho}_{\epsilon}-{\rho}_{0})\delta^{C}_{\epsilon}\}+\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})
=\displaystyle= Pϵ{[ρϵρ0](δϵYτ0δ0C)}+[Ψρ0(Pϵ)Ψρ0(P0)]+τ0{Pϵ[δϵCρϵ]+αϕϵP0[δ0Cρ0]αϕ0}\displaystyle P_{\epsilon}\{[{\rho}_{\epsilon}-{\rho}_{0}](\delta^{Y}_{\epsilon}-\tau_{0}\delta^{C}_{0})\}+[\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})]+\tau_{0}\{P_{\epsilon}[\delta^{C}_{\epsilon}{\rho}_{\epsilon}]+\alpha\phi_{\epsilon}-P_{0}[\delta^{C}_{0}{\rho}_{0}]-\alpha\phi_{0}\}
τ0Pϵ{(δϵCΔ0,bC)ρ0}τ0(PϵP0){δ0Cρ0}ατ0{ϕϵϕ0}.\displaystyle-\tau_{0}P_{\epsilon}\{(\delta^{C}_{\epsilon}-\Delta^{C}_{0,b}){\rho}_{0}\}-\tau_{0}(P_{\epsilon}-P_{0})\{\delta^{C}_{0}{\rho}_{0}\}-\alpha\tau_{0}\{\phi_{\epsilon}-\phi_{0}\}. (S12)

We separately study each of the six terms on the right-hand side, which we refer to as term 1 up to term 6.

Study of term 1 in (S12): We will show that this term is o(ϵ){\mathrm{o}}(\epsilon). By Lemma S2 and (S9),

supv|δϵY(v)τϵδϵCδ0Y(v)+τ0δ0C|supv|δϵY(v)δ0Y(v)|+supv|δϵC(v)δ0C(v)|+|τϵτ0||ϵ|.\sup_{v}|\delta^{Y}_{\epsilon}(v)-\tau_{\epsilon}\delta^{C}_{\epsilon}-\delta^{Y}_{0}(v)+\tau_{0}\delta^{C}_{0}|\leq\sup_{v}|\delta^{Y}_{\epsilon}(v)-\delta^{Y}_{0}(v)|+\sup_{v}|\delta^{C}_{\epsilon}(v)-\delta^{C}_{0}(v)|+|\tau_{\epsilon}-\tau_{0}|\lesssim|\epsilon|.

Under Condition B2, P0{ξ0(V)=τ0}=0P_{0}\{\xi_{0}(V)=\tau_{0}\}=0. We apply a similar argument as that used to prove Lemma 2 in van der Laan and Luedtke [46]:

|Pϵ\displaystyle\big{|}P_{\epsilon} {[ρϵρ0](δϵYτ0δϵC)}|\displaystyle\{[{\rho}_{\epsilon}-{\rho}_{0}](\delta^{Y}_{\epsilon}-\tau_{0}\delta^{C}_{\epsilon})\}\big{|}
=|[ρϵ(v)ρ0(v)][δϵY(v)τ0δϵC(v)]PW,ϵ(dw)|\displaystyle=\left|\int[{\rho}_{\epsilon}(v)-{\rho}_{0}(v)][\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)]P_{W,\epsilon}(dw)\right|
|ρϵ(v)ρ0(v)||δϵY(v)τ0δϵC(v)|PW,ϵ(dw).\displaystyle\leq\int\left|{\rho}_{\epsilon}(v)-{\rho}_{0}(v)\right|\left|\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)\right|P_{W,\epsilon}(dw).
Because ρϵ(v)ρ0(v){\rho}_{\epsilon}(v)\neq{\rho}_{0}(v) implies that either (i) ξϵ(v)τϵ\xi_{\epsilon}(v)-\tau_{\epsilon} and ξ0(v)τ0\xi_{0}(v)-\tau_{0} have different signs or (ii) only one of these quantities is zero, the display continues as
I{|ξ0(v)τ0||ξϵ(v)τϵξ0(v)+τ0|}|δϵY(v)τ0δϵC(v)|PW,ϵ(dw)\displaystyle\leq\int I\{|\xi_{0}(v)-\tau_{0}|\leq|\xi_{\epsilon}(v)-\tau_{\epsilon}-\xi_{0}(v)+\tau_{0}|\}\left|\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)\right|P_{W,\epsilon}(dw)
I{|ξ0(v)τ0|𝒞|ϵ|}(|δ0Y(v)τ0δ0C(v)|+𝒞|ϵ|)PW,ϵ(dw).\displaystyle\leq\int I\{|\xi_{0}(v)-\tau_{0}|\leq{\mathscr{C}}|\epsilon|\}\left(\left|\delta^{Y}_{0}(v)-\tau_{0}\delta^{C}_{0}(v)\right|+{\mathscr{C}}|\epsilon|\right)P_{W,\epsilon}(dw).
Using the facts that infvδ0C(v)>0\inf_{v}\delta^{C}_{0}(v)>0 by Condition A4, that supvδ0C(v)1\sup_{v}\delta^{C}_{0}(v)\leq 1 since probabilities are no more than one, and that ξ0(v):=δ0Y(v)/δ0C(v)\xi_{0}(v):=\delta^{Y}_{0}(v)/\delta^{C}_{0}(v), the display continues as
I{|ξ0(v)τ0|𝒞|ϵ|}(|ξ0(v)τ0|+𝒞|ϵ|)PW,ϵ(dw)\displaystyle\leq\int I\{|\xi_{0}(v)-\tau_{0}|\leq{\mathscr{C}}|\epsilon|\}\left(\left|\xi_{0}(v)-\tau_{0}\right|+{\mathscr{C}}|\epsilon|\right)P_{W,\epsilon}(dw)
Leveraging the bound on |ξ0(v)τ0||\xi_{0}(v)-\tau_{0}| that appears in the indicator function, we see that
I{|ξ0(v)τ0|𝒞|ϵ|}(𝒞|ϵ|+𝒞|ϵ|)PW,ϵ(dw)\displaystyle\leq\int I\{|\xi_{0}(v)-\tau_{0}|\leq{\mathscr{C}}|\epsilon|\}({\mathscr{C}}|\epsilon|+{\mathscr{C}}|\epsilon|)P_{W,\epsilon}(dw)
|ϵ|I{|ξ0(v)τ0|𝒞|ϵ|}PW,0(dw)\displaystyle\lesssim|\epsilon|\int I\{|\xi_{0}(v)-\tau_{0}|\leq{\mathscr{C}}|\epsilon|\}P_{W,0}(dw)
=|ϵ|I{0<|ξ0(v)τ0|𝒞|ϵ|}PW,0(dw),\displaystyle=|\epsilon|\int I\{0<|\xi_{0}(v)-\tau_{0}|\leq{\mathscr{C}}|\epsilon|\}P_{W,0}(dw),

where the final equality holds by Condition B1. The integral in the final expression is o(1){\mathrm{o}}(1), and so this expression is o(ϵ){\mathrm{o}}(\epsilon).

Study of term 2 in (S12): By the result on the pathwise differentiability of PΨρFR(P)P\mapsto\Psi_{{\rho}^{\mathrm{FR}}}(P), setting ρFR{\rho}^{\mathrm{FR}} to be ρ0{\rho}_{0}, we see that the second term satisfies Ψρ0(Pϵ)Ψρ0(P0)=ϵG2(o)H(o)P(do)+o(ϵ)\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})=\epsilon\int G_{2}(o)H(o)P(do)+{\mathrm{o}}(\epsilon), where G2L02(P0)G_{2}\in L^{2}_{0}(P_{0}) is equal to D(P0,ρ0,0,μ0C)D(P_{0},{\rho}_{0},0,\mu^{C}_{0}).

Study of term 3 in (S12): We will show that the third term is identical to zero for any ϵ\epsilon that is sufficiently close to zero. If τ0=0\tau_{0}=0, then this term is trivially zero. Otherwise, τ0=η0>0\tau_{0}=\eta_{0}>0. Lemma S1 shows that, in this case, ηϵ>0\eta_{\epsilon}>0 for ϵ\epsilon sufficiently close to zero. Hence, Eϵ[δϵC(V)ρϵ(V)]+αϕϵ=κ=E0[δ0C(V)ρ0(V)]+αϕ0\operatorname{E}_{\epsilon}[\delta^{C}_{\epsilon}(V){\rho}_{\epsilon}(V)]+\alpha\phi_{\epsilon}=\kappa=\operatorname{E}_{0}[\delta^{C}_{0}(V){\rho}_{0}(V)]+\alpha\phi_{0}. Consequently, term 3 equals zero for all ϵ\epsilon sufficiently close to zero.

Study of term 4 in (S12): We will show that this term can be writes as ϵG4(o)H(o)P0(do)+o(ϵ)\epsilon\int G_{4}(o)H(o)P_{0}(do)+{\mathrm{o}}(\epsilon) for an appropriately defined G4L02(P0)G_{4}\in L^{2}_{0}(P_{0}) that does not depend on HH. Note that there exists a function HW:(wv)HW(wv)H_{W}:(w\mid v)\mapsto H_{W}(w\mid v) for which HW(wv)PW,0(dwv)=0\int H_{W}(w\mid v)P_{W,0}(dw\mid v)=0, supw,v|HW(wv)|<\sup_{w,v}|H_{W}(w\mid v)|<\infty, and, for all vv,

PW,ϵ(dwv)=(1+ϵHW(wv)+o(ϵ))PW,0(dwv).\displaystyle P_{W,\epsilon}(dw\mid v)=(1+\epsilon H_{W}(w\mid v)+{\mathrm{o}}(\epsilon))P_{W,0}(dw\mid v).

The function HWH_{W} can be chosen so that the above o(ϵ){\mathrm{o}}(\epsilon) term indicates little-oh behavior uniformly over ww and vv. By the definition of HCH_{C} from (S3), we see that

δϵC(v)δ0C(v)\displaystyle\delta^{C}_{\epsilon}(v)-\delta^{C}_{0}(v) =c{[1+ϵHC(c1,w)][1+ϵHW(wv)+o(ϵ)]1}P0(dc1,w)P0(dwv)\displaystyle=\iint c\Big{\{}[1+\epsilon H_{C}(c\mid 1,w)][1+\epsilon H_{W}(w\mid v)+{\mathrm{o}}(\epsilon)]-1\Big{\}}P_{0}(dc\mid 1,w)P_{0}(dw\mid v)
c{[1+ϵHC(c0,w)][1+ϵHW(wv)+o(ϵ)]1}P0(dc0,w)P0(dwv)\displaystyle\quad-\iint c\Big{\{}[1+\epsilon H_{C}(c\mid 0,w)][1+\epsilon H_{W}(w\mid v)+{\mathrm{o}}(\epsilon)]-1\Big{\}}P_{0}(dc\mid 0,w)P_{0}(dw\mid v)
=ϵ{c(HC(c1,w)+HW(wv)+o(1))P0(dc1,w)P0(dwv)\displaystyle=\epsilon\Big{\{}\iint c(H_{C}(c\mid 1,w)+H_{W}(w\mid v)+{\mathrm{o}}(1))P_{0}(dc\mid 1,w)P_{0}(dw\mid v)
c(HC(c0,w)+HW(wv)+o(1))P0(dc0,w)P0(dwv)}+o(ϵ),\displaystyle\quad-\iint c(H_{C}(c\mid 0,w)+H_{W}(w\mid v)+{\mathrm{o}}(1))P_{0}(dc\mid 0,w)P_{0}(dw\mid v)\Big{\}}+{\mathrm{o}}(\epsilon),

where the little-oh terms are uniform over ww and vv. Hence,

ddϵPϵ{[δϵCδ0C]ρ0}|ϵ=0\displaystyle\left.\frac{d}{d\epsilon}P_{\epsilon}\{[\delta^{C}_{\epsilon}-\delta^{C}_{0}]{\rho}_{0}\}\right|_{\epsilon=0}
=ρ0(v)c{HC(c1,w)+HW(wv)}P0(dc1,w)P0(dw)\displaystyle=\iint{\rho}_{0}(v)c\{H_{C}(c\mid 1,w)+H_{W}(w\mid v)\}P_{0}(dc\mid 1,w)P_{0}(dw)
ρ0(v)c{HC(c0,w)+HW(wv)}P0(dc0,w)P0(dw)\displaystyle\quad-\iint{\rho}_{0}(v)c\{H_{C}(c\mid 0,w)+H_{W}(w\mid v)\}P_{0}(dc\mid 0,w)P_{0}(dw)
=E0[ρ0(V)(1T+μ0T(W)1{Cμ0C(T,W)}HC(C1,W)+{Δ0C(W)δ0C(V)}HW(WV))].\displaystyle=\operatorname{E}_{0}\left[{\rho}_{0}(V)\left(\frac{1}{T+\mu^{T}_{0}(W)-1}\{C-\mu^{C}_{0}(T,W)\}H_{C}(C\mid 1,W)+\{\Delta^{C}_{0}(W)-\delta^{C}_{0}(V)\}H_{W}(W\mid V)\right)\right].
Since E0[HC(CT,W)T,W]=E0[HW(WV)V]=0\operatorname{E}_{0}[H_{C}(C\mid T,W)\mid T,W]=\operatorname{E}_{0}[H_{W}(W\mid V)\mid V]=0 P0P_{0}-a.s., the display continues as
=E0[ρ0(V)(1T+μ0T(W)1{Cμ0C(T,W)}+Δ0C(W)δ0C(V))H(O)].\displaystyle=\operatorname{E}_{0}\left[{\rho}_{0}(V)\left(\frac{1}{T+\mu^{T}_{0}(W)-1}\{C-\mu^{C}_{0}(T,W)\}+\Delta^{C}_{0}(W)-\delta^{C}_{0}(V)\right)H(O)\right].

As a consequence, term 4 satisfies

τ0Pϵ{[δϵCδ0C]ρ0}\displaystyle-\tau_{0}P_{\epsilon}\{[\delta^{C}_{\epsilon}-\delta^{C}_{0}]{\rho}_{0}\} =ϵG4(o)H(o)P0(do)+o(ϵ),\displaystyle=\epsilon\int G_{4}(o)H(o)P_{0}(do)+{\mathrm{o}}(\epsilon),

where

G4:o\displaystyle G_{4}:o\mapsto τ0ρ0(v){1t+μ0T(w)1[cμ0C(t,w)]+Δ0C(w)δ0C(v)}.\displaystyle-\tau_{0}{\rho}_{0}(v)\left\{\frac{1}{t+\mu^{T}_{0}(w)-1}[c-\mu^{C}_{0}(t,w)]+\Delta^{C}_{0}(w)-\delta^{C}_{0}(v)\right\}.

Study of term 5 in (S12): By (S3) and the fact that P0{δ0Cρ0}=καϕ0P_{0}\{\delta^{C}_{0}{\rho}_{0}\}=\kappa-\alpha\phi_{0} whenever τ0>0\tau_{0}>0, we see that τ0(PϵP0){δ0Cρ0}=ϵG5(o)HV(v)P0(do)-\tau_{0}(P_{\epsilon}-P_{0})\{\delta^{C}_{0}{\rho}_{0}\}=\epsilon\int G_{5}(o)H_{V}(v)P_{0}(do), where G5L02(P0)G_{5}\in L^{2}_{0}(P_{0}) is defined as oτ0[δ0C(v)ρ0(v)κ+αϕ0]o\mapsto-\tau_{0}[\delta^{C}_{0}(v){\rho}_{0}(v)-\kappa+\alpha\phi_{0}]. Since HVH_{V} is defined as vE0[H(O)V=v]v\mapsto\operatorname{E}_{0}[H(O)\mid V=v], we see that it also holds that τ0(PϵP0){δ0Cρ0}=ϵG5(o)H(o)P0(do)-\tau_{0}(P_{\epsilon}-P_{0})\{\delta^{C}_{0}{\rho}_{0}\}=\epsilon\int G_{5}(o)H(o)P_{0}(do).

Study of term 6 in (S12): We have shown that

ϕϵϕ0=ϵ{1t1μ0T(w)[cμ0C(0,w)]+μ0C(0,w)ϕ0}H(o)P0(do)+o(ϵ).\phi_{\epsilon}-\phi_{0}=\epsilon\int\left\{\frac{1-t}{1-\mu^{T}_{0}(w)}[c-\mu^{C}_{0}(0,w)]+\mu^{C}_{0}(0,w)-\phi_{0}\right\}H(o)P_{0}(do)+{\mathrm{o}}(\epsilon).

Therefore, τ0α(ϕϵϕ0)=ϵG6(o)H(o)P0(do)+o(ϵ)-\tau_{0}\alpha(\phi_{\epsilon}-\phi_{0})=\epsilon\int G_{6}(o)H(o)P_{0}(do)+{\mathrm{o}}(\epsilon) where

G6:oτ0α{1t1μ0T(w)[cμ0C(0,w)]+μ0C(0,w)ϕ0}.G_{6}:o\mapsto-\tau_{0}\alpha\left\{\frac{1-t}{1-\mu^{T}_{0}(w)}[c-\mu^{C}_{0}(0,w)]+\mu^{C}_{0}(0,w)-\phi_{0}\right\}.

Conclusion of the derivation of the canonical gradient of PΨρP(P)P\mapsto\Psi_{{\rho}_{P}}(P): Combining our results regarding the six terms in (S12), we see that

Ψρϵ(Pϵ)Ψρ0(P0)\displaystyle\Psi_{{\rho}_{\epsilon}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0}) =ϵ[G2(o)+G4(o)+G5(o)+G6(o)]H(o)P0(do)+o(ϵ).\displaystyle=\epsilon\int\left[G_{2}(o)+G_{4}(o)+G_{5}(o)+G_{6}(o)\right]H(o)P_{0}(do)+{\mathrm{o}}(\epsilon).

Dividing both sides by ϵ0\epsilon\not=0 and taking the limit as ϵ0\epsilon\rightarrow 0, we see that G2+G4+G5+G6=G(P0)G_{2}+G_{4}+G_{5}+G_{6}=G(P_{0}) is the canonical gradient of PΨρP(P)P\mapsto\Psi_{{\rho}_{P}}(P) at P0P_{0}.

S4.3 Expansions based on gradients or pseudo-gradients

In this section, we present (approximate) first-order expansions of ATE parameters based on which we construct our proposed targeted minimum-loss based estimators (TMLE) and prove their asymptotic linearity. We refer the readers to Supplement S5 of Qiu et al. [34] for an overview of TMLE based on gradients and pseudo-gradients. The overall idea behind TMLE based on gradients is the following: the empirical mean of the gradient at the estimated distribution can be viewed as the first-order bias of the plug-in estimator; this bias can be removed by solving the estimating equation that equates the first-order bias to zero. The idea behind pseudo-gradients is similar, except that the gradient is replaced by an approximation that we term pseudo-gradient so that the corresponding estimating equation is easy to solve with a single regression step.

For any ITR ρ:𝒲[0,1]{\rho}:\mathcal{W}\rightarrow[0,1] that utilizes all covariates, we define

Rρ(P,P0):=\displaystyle R_{{\rho}}(P,P_{0}):= Ψρ(P)Ψρ(P0)+P0D(P,ρ,0,μC)\displaystyle\Psi_{{\rho}}(P)-\Psi_{{\rho}}(P_{0})+P_{0}D(P,{\rho},0,\mu^{C})
=\displaystyle= E0[ρ(W){μPT(W)μ0T(W)μPT(W)(μPY(1,W)μ0Y(1,W))\displaystyle\operatorname{E}_{0}\Bigg{[}{\rho}(W)\Bigg{\{}\frac{\mu^{T}_{P}(W)-\mu^{T}_{0}(W)}{\mu^{T}_{P}(W)}(\mu^{Y}_{P}(1,W)-\mu^{Y}_{0}(1,W))
+μPT(W)μ0T(W)1μPT(W)(μPY(0,W)μ0Y(0,W))}].\displaystyle\quad\quad\quad\quad\quad+\frac{\mu^{T}_{P}(W)-\mu^{T}_{0}(W)}{1-\mu^{T}_{P}(W)}(\mu^{Y}_{P}(0,W)-\mu^{Y}_{0}(0,W))\Bigg{\}}\Bigg{]}.

For any ITR ρ:𝒱[0,1]{\rho}:\mathcal{V}\rightarrow[0,1] that only utilizes VV, for convenience, we define Rρ(P,P0):=Rwρ(V(w))(P,P0)R_{{\rho}}(P,P_{0}):=R_{w\mapsto{\rho}(V(w))}(P,P_{0}).

For ΨρFR\Psi_{{\rho}^{\mathrm{FR}}} and ΨρPTP\Psi_{{\rho}^{\mathrm{TP}}_{P}}, it is straightforward to show that the following expansions hold:

ΨρFR(P)ΨρFR(P0)\displaystyle\Psi_{{\rho}^{\mathrm{FR}}}(P)-\Psi_{{\rho}^{\mathrm{FR}}}(P_{0}) =P0GFR(P)+RρFR(P,P0),\displaystyle=-P_{0}G_{\mathrm{FR}}(P)+R_{{\rho}^{\mathrm{FR}}}(P,P_{0}),
ΨρPTP(P)ΨρP0TP(P0)\displaystyle\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)-\Psi_{{\rho}^{\mathrm{TP}}_{P_{0}}}(P_{0}) =P0GTP(P)+P0{μPT()μ0T()1μPT()(μPY(0,)μ0Y(0,))}.\displaystyle=-P_{0}G_{\mathrm{TP}}(P)+P_{0}\left\{\frac{\mu^{T}_{P}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{P}(\cdot)}(\mu^{Y}_{P}(0,\cdot)-\mu^{Y}_{0}(0,\cdot))\right\}.

For PΨρPRD(P)P\mapsto\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P), we expand this parameter sequentially as follows:

ΨρPRD(P)ΨρP0RD(P0)=P0D(P,ρPRD,0,μPC)+RρPRD(P,P0)+(ρPRDρ0RD)P0Δ0Y,\displaystyle\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)-\Psi_{{\rho}^{\mathrm{RD}}_{P_{0}}}(P_{0})=P_{0}D(P,{\rho}^{\mathrm{RD}}_{P},0,\mu^{C}_{P})+R_{{\rho}^{\mathrm{RD}}_{P}}(P,P_{0})+({\rho}^{\mathrm{RD}}_{P}-{\rho}^{\mathrm{RD}}_{0})P_{0}\Delta^{Y}_{0},
ρPRDρ0RD=κϕPPΔPCκϕ0P0Δ0C,\displaystyle{\rho}^{\mathrm{RD}}_{P}-{\rho}^{\mathrm{RD}}_{0}=\frac{\kappa-\phi_{P}}{P\Delta^{C}_{P}}-\frac{\kappa-\phi_{0}}{P_{0}\Delta^{C}_{0}},
(καϕP)(καϕ0)=α{P0D1(P,μC)+P0{(μC(0,)μC(1,))μPTμ0T1μPT}},\displaystyle(\kappa-\alpha\phi_{P})-(\kappa-\alpha\phi_{0})=\alpha\left\{P_{0}D_{1}(P,\mu^{C})+P_{0}\left\{(\mu^{C}(0,\cdot)-\mu^{C}(1,\cdot))\frac{\mu^{T}_{P}-\mu^{T}_{0}}{1-\mu^{T}_{P}}\right\}\right\},
PΔPCP0Δ0C=P0D2(P,μPC)+P0{(μPC(1,)μ0C(1,))μPTμ0TμPT+(μPC(0,)μ0C(0,))μPTμ0T1μPT}.\displaystyle P\Delta^{C}_{P}-P_{0}\Delta^{C}_{0}=-P_{0}D_{2}(P,\mu^{C}_{P})+P_{0}\left\{(\mu^{C}_{P}(1,\cdot)-\mu^{C}_{0}(1,\cdot))\frac{\mu^{T}_{P}-\mu^{T}_{0}}{\mu^{T}_{P}}+(\mu^{C}_{P}(0,\cdot)-\mu^{C}_{0}(0,\cdot))\frac{\mu^{T}_{P}-\mu^{T}_{0}}{1-\mu^{T}_{P}}\right\}.

For PΨρP(P)P\mapsto\Psi_{{\rho}_{P}}(P), straightforward but tedious calculation shows that the following expansion holds:

ΨρP(P)Ψρ0(P0)\displaystyle\Psi_{{\rho}_{P}}(P)-\Psi_{{\rho}_{0}}(P_{0}) =P0D(P,ρ,τ0,μC)\displaystyle=-P_{0}D(P,{\rho},\tau_{0},\mu^{C})
+Rρ(P,P0)+P0{(ρρ0)(δ0Yτ0δ0C)}\displaystyle\quad+R_{{\rho}}(P,P_{0})+P_{0}\{({\rho}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}
τ0E0[ρ(V)μPT(W)μ0T(W)μPT(W){μC(1,W)μ0C(1,W)}]\displaystyle\quad-\tau_{0}\operatorname{E}_{0}\left[{\rho}(V)\frac{\mu^{T}_{P}(W)-\mu^{T}_{0}(W)}{\mu^{T}_{P}(W)}\{\mu^{C}(1,W)-\mu^{C}_{0}(1,W)\}\right]
+τ0E0[(1ρ(V))μPT(W)μ0T(W)1μPT(W){μC(0,W)μ0C(0,W)}].\displaystyle\quad+\tau_{0}\operatorname{E}_{0}\left[(1-{\rho}(V))\frac{\mu^{T}_{P}(W)-\mu^{T}_{0}(W)}{1-\mu^{T}_{P}(W)}\{\mu^{C}(0,W)-\mu^{C}_{0}(0,W)\}\right].

S4.4 Asymptotic linearity of proposed estimator (Theorem 4)

For convenience, we set P^n\hat{P}_{n} to have component μnT\mu^{T}_{n} and μ^nC\hat{\mu}^{C}_{n}, even though the plug-in estimator does not explicitly involve these functions. We start with some lemmas that facilitate the proof of the main theorem. In this section, we define ηn:=ηn(kn)\eta_{n}:=\eta_{n}(k_{n}) and τn:=τn(kn)\tau_{n}:=\tau_{n}(k_{n}) to simplify notations.

Our proof is centered around the expansions in Supplement S4.3. We first prove a few lemmas. Lemma S3 is a standard asymptotic linearity result on estimators ϕn\phi_{n} and PnΔ^nCP_{n}\hat{\Delta}^{C}_{n} about treatment resource being used for constant ITRs v0v\mapsto 0 and v1v\mapsto 1, respectively; Lemma S4 is a technical convenient tool to convert conditions on norms in Condition B6 between functions; Lemmas S5S7 are analysis results for estimators that are similar to Lemmas S1S2 for deterministic perturbations of P0P_{0}, and they lead to the crucial Lemma S8 on the negligibility of the remainder Rρ(P^n,P0)R_{\rho}(\hat{P}_{n},P_{0}) for an arbitrary ITR ρ{\rho}.

Lemma S3 (Asymptotic linearity of ϕn\phi_{n} and PnΔ^nCP_{n}\hat{\Delta}^{C}_{n}).

Under the conditions of Theorem 4,

ϕnϕ0\displaystyle\phi_{n}-\phi_{0} =(PnP0)D1(P0,μ0C)+op(n1/2)=Op(n1/2),\displaystyle=(P_{n}-P_{0})D_{1}(P_{0},\mu^{C}_{0})+{\mathrm{o}}_{p}(n^{-1/2})={\mathrm{O}}_{p}(n^{-1/2}),
PnΔ^nCP0Δ0C\displaystyle P_{n}\hat{\Delta}^{C}_{n}-P_{0}\Delta^{C}_{0} =(PnP0)D2(P0,μ0C)+op(n1/2)=Op(n1/2).\displaystyle=(P_{n}-P_{0})D_{2}(P_{0},\mu^{C}_{0})+{\mathrm{o}}_{p}(n^{-1/2})={\mathrm{O}}_{p}(n^{-1/2}).

This result follows from the facts that (i) ϕn\phi_{n} is a one-step correction estimator of ϕ0\phi_{0} [30], and (ii) PnΔ^nCP_{n}\hat{\Delta}^{C}_{n} is a TMLE for P0Δ0CP_{0}\Delta^{C}_{0} [44, 49]. Therefore the proof is omitted.

Lemma S4 (Lemma S8 in Qiu et al. [34]).

Fix functions μC:{0,1}×𝒲[0,1]\mu^{C}:\{0,1\}\times\mathcal{W}\rightarrow[0,1] and μY:{0,1}×𝒲\mu^{Y}:\{0,1\}\times\mathcal{W}\rightarrow, and suppose that P0μY(0,)2<P_{0}\mu^{Y}(0,\cdot)^{2}<\infty and P0μY(1,)2<P_{0}\mu^{Y}(1,\cdot)^{2}<\infty. If Condition A2 holds, then

μY(1,)μ0Y(1,)2,P0+μY(0,)μ0Y(0,)2,P0μYμ0Y2,P0,\displaystyle\|\mu^{Y}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)\|_{2,P_{0}}+\|\mu^{Y}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)\|_{2,P_{0}}\simeq\|\mu^{Y}-\mu^{Y}_{0}\|_{2,P_{0}},
μC(1,)μ0C(1,)2,P0+μC(0,)μ0C(0,)2,P0μCμ0C2,P0,\displaystyle\|\mu^{C}(1,\cdot)-\mu^{C}_{0}(1,\cdot)\|_{2,P_{0}}+\|\mu^{C}(0,\cdot)-\mu^{C}_{0}(0,\cdot)\|_{2,P_{0}}\simeq\|\mu^{C}-\mu^{C}_{0}\|_{2,P_{0}},

where aba\simeq b is defined as aba\lesssim b and bab\lesssim a.

The following Lemmas S5S7 prove consistency of the estimated thresholds used to define the estimated optimal ITR ρn{\rho}_{n}.

Lemma S5 (Lemma S5 in Qiu et al. [34]).

Let ϵ>0\epsilon>0, η\eta\in, g:𝒪g:\mathscr{O}\rightarrow be bounded and functions f0:𝒪f_{0}:\mathscr{O}\rightarrow and f:𝒪f:\mathscr{O}\rightarrow. Then

|P0([I(f>η)I(f0>η)]g)|\displaystyle|P_{0}([I(f>\eta)-I(f_{0}>\eta)]g)| P0|[I(f>η)I(f0>η)]g|\displaystyle\leq P_{0}|[I(f>\eta)-I(f_{0}>\eta)]g|
P0{|f(O)f0(O)|>ϵ}+P0{|f0(O)η|ϵ}.\displaystyle\lesssim P_{0}\{|f(O)-f_{0}(O)|>\epsilon\}+P_{0}\{|f_{0}(O)-\eta|\leq\epsilon\}.

If gg takes values in [1,1][-1,1], then \lesssim can be replaced by \leq.

Lemma S6 (Consistency of ηn(κ)\eta_{n}(\kappa)).

Under Conditions B2B3 and B12, ηn(κ)𝑝η0\eta_{n}(\kappa)\overset{p}{\rightarrow}\eta_{0}.

This lemma is a stochastic variant of the deterministic result in Lemma S1 and has a similar proof. Therefore, the arguments are slightly abbreviated here.

Proof of Lemma S6.

We separately consider the cases where η0>\eta_{0}>-\infty and η0=\eta_{0}=-\infty.

First consider the case where η0>\eta_{0}>-\infty. We start by showing that, for any η\eta sufficiently close to η0\eta_{0}, it holds that Γn(η)Γ0(η)=op(1)\Gamma_{n}(\eta)-\Gamma_{0}(\eta)={\mathrm{o}}_{p}(1). Fix an η\eta in a neighborhood of η0\eta_{0}. By the triangle inequality,

|Γn(η)Γ0(η)|\displaystyle|\Gamma_{n}(\eta)-\Gamma_{0}(\eta)| |P0[I{ξn(V())>η}I(ξ0(V())>η)]Δ0C|\displaystyle\leq|P_{0}[I\{\xi_{n}(V(\cdot))>\eta\}-I(\xi_{0}(V(\cdot))>\eta)]\Delta^{C}_{0}|
+|P0I{ξn(V())>η}[ΔnCΔ0C]|\displaystyle\quad+|P_{0}I\{\xi_{n}(V(\cdot))>\eta\}[\Delta^{C}_{n}-\Delta^{C}_{0}]|
+|(PnP0)I{ξn(V())>η}ΔnC|.\displaystyle\quad+|(P_{n}-P_{0})I\{\xi_{n}(V(\cdot))>\eta\}\Delta^{C}_{n}|. (S13)

We will show that the right-hand side is op(1){\mathrm{o}}_{p}(1). By Condition B12, the third term on the right is op(1){\mathrm{o}}_{p}(1) for η\eta sufficiently close to η0\eta_{0}. Moreover, because the second term is no greater than ΔnCΔ0C1,P0\|\Delta^{C}_{n}-\Delta^{C}_{0}\|_{1,P_{0}}, Condition B12 also implies that this second term is also op(1){\mathrm{o}}_{p}(1). We will now argue that the first term is op(1){\mathrm{o}}_{p}(1). By Lemma S5 and Condition B4, for any ϵ>0\epsilon^{\prime}>0,

|P0[I(ξn(V())>η)I(ξ0(V())>η)]Δ0C|\displaystyle|P_{0}[I(\xi_{n}(V(\cdot))>\eta)-I(\xi_{0}(V(\cdot))>\eta)]\Delta^{C}_{0}| P0|I(ξn>η)I(ξ0>η)|\displaystyle\lesssim P_{0}|I(\xi_{n}>\eta)-I(\xi_{0}>\eta)|
P0I(|ξnξ0|>ϵ)+P0I(|ξ0η|ϵ)\displaystyle\leq P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon^{\prime})+P_{0}I(|\xi_{0}-\eta|\leq\epsilon^{\prime})
ξnξ01,P0ϵ+P0I(|ξ0η|ϵ),\displaystyle\leq\frac{\|\xi_{n}-\xi_{0}\|_{1,P_{0}}}{\epsilon^{\prime}}+P_{0}I(|\xi_{0}-\eta|\leq\epsilon^{\prime}),

where the final relation follows from Markov’s inequality. We next show that the last line is op(1){\mathrm{o}}_{p}(1). Fix ϵ>0\epsilon>0. For η\eta that is sufficiently close to η0\eta_{0} and ϵ\epsilon^{\prime} that is sufficiently small, by Condition B2, we see that S0S_{0} is continuous in [ηϵ,η+ϵ][\eta-\epsilon^{\prime},\eta+\epsilon^{\prime}] and hence, for all sufficiently small ϵ>0\epsilon^{\prime}>0, it holds that P0I(|ξ0η|ϵ)ϵ/2P_{0}I(|\xi_{0}-\eta|\leq\epsilon^{\prime})\leq\epsilon/2. Therefore,

P0{|P0[I(ξn>η)I(ξ0>η)]|>ϵ}\displaystyle P_{0}\left\{|P_{0}[I(\xi_{n}>\eta)-I(\xi_{0}>\eta)]|>\epsilon\right\} P0{ξnξ01,P0ϵ+P0I(|ξ0η|ϵ)>ϵ}\displaystyle\leq P_{0}\left\{\frac{\|\xi_{n}-\xi_{0}\|_{1,P_{0}}}{\epsilon^{\prime}}+P_{0}I(|\xi_{0}-\eta|\leq\epsilon^{\prime})>\epsilon\right\}
P0{ξnξ01,P0ϵ>ϵ/2}.\displaystyle\leq P_{0}\left\{\frac{\|\xi_{n}-\xi_{0}\|_{1,P_{0}}}{\epsilon^{\prime}}>\epsilon/2\right\}.

Since ξnξ01,P0=op(1)\|\xi_{n}-\xi_{0}\|_{1,P_{0}}={\mathrm{o}}_{p}(1) by Condition B12, the right-hand side of the above display converges to zero as nn\rightarrow\infty. Therefore, |P0[I(ξn(V())>η)I(ξ0(V())>η)]Δ0C|=op(1)|P_{0}[I(\xi_{n}(V(\cdot))>\eta)-I(\xi_{0}(V(\cdot))>\eta)]\Delta^{C}_{0}|={\mathrm{o}}_{p}(1). Recalling (S13), the above results imply that Γn(η)Γ0(η)=op(1)\Gamma_{n}(\eta)-\Gamma_{0}(\eta)={\mathrm{o}}_{p}(1) for any η\eta that is sufficiently close to η0\eta_{0}.

Fix ϵ>0\epsilon>0. For any ϵ\epsilon sufficiently small, the above result and Lemma S3 imply that Γn(η0ϵ)+αϕn=Γ0(η0ϵ)+αϕ0+op(1)\Gamma_{n}(\eta_{0}-\epsilon)+\alpha\phi_{n}=\Gamma_{0}(\eta_{0}-\epsilon)+\alpha\phi_{0}+{\mathrm{o}}_{p}(1) and Γn(η0+ϵ)+αϕn=Γ0(η0+ϵ)+αϕ0+op(1)\Gamma_{n}(\eta_{0}+\epsilon)+\alpha\phi_{n}=\Gamma_{0}(\eta_{0}+\epsilon)+\alpha\phi_{0}+{\mathrm{o}}_{p}(1). By Condition B3, Γ0(η0ϵ)+αϕ0>κ>Γ0(η0+ϵ)+αϕ0\Gamma_{0}(\eta_{0}-\epsilon)+\alpha\phi_{0}>\kappa>\Gamma_{0}(\eta_{0}+\epsilon)+\alpha\phi_{0} provided ϵ\epsilon is sufficiently small. It follows that, with probability tending to one, Γn(η0ϵ)+αϕn>κ>Γn(η0+ϵ)+αϕn\Gamma_{n}(\eta_{0}-\epsilon)+\alpha\phi_{n}>\kappa>\Gamma_{n}(\eta_{0}+\epsilon)+\alpha\phi_{n}, and hence η0ϵηn(κ)η0+ϵ\eta_{0}-\epsilon\leq\eta_{n}(\kappa)\leq\eta_{0}+\epsilon by the definition of ηn(κ)\eta_{n}(\kappa). Because ϵ\epsilon is arbitrary, it follows that ηn(κ)𝑝η0\eta_{n}(\kappa)\overset{p}{\rightarrow}\eta_{0}.

The case where η0=\eta_{0}=-\infty can be proved similarly. If κ=\kappa=\infty, then it trivially holds that ηn(κ)==η0\eta_{n}(\kappa)=-\infty=\eta_{0} for all nn and the desired result holds. Otherwise, for any η<0\eta<0 for which |η||\eta| is sufficiently large, a nearly identical argument to that used above shows that [Γn(η)+αϕn][Γ0(η)+αϕ0]=op(1)[\Gamma_{n}(\eta)+\alpha\phi_{n}]-[\Gamma_{0}(\eta)+\alpha\phi_{0}]={\mathrm{o}}_{p}(1). By Condition B3 and monotonicity of Γ0\Gamma_{0}, it follows that Γ0(η)+αϕ0<κ\Gamma_{0}(\eta)+\alpha\phi_{0}<\kappa, and so, with probability tending to one, Γn(η)+αϕn<κ\Gamma_{n}(\eta)+\alpha\phi_{n}<\kappa and hence ηn(κ)η\eta_{n}(\kappa)\leq\eta by the definition of ηn(κ)\eta_{n}(\kappa). Because η\eta is arbitrary, we have shown that ηn(κ)𝑝=η0\eta_{n}(\kappa)\overset{p}{\rightarrow}-\infty=\eta_{0}. ∎

Lemma S7 (Consistency of τn\tau_{n} and existence of solution to Eq. 5 when η0>\eta_{0}>-\infty).

Assume that the conditions of Theorem 4 hold. The following statements hold:

  1. i)

    if η0>\eta_{0}>-\infty, then, with probability tending to one, a solution kn[0,)k_{n}^{\prime}\in[0,\infty) to (5) exists. Note that we let kn=knk_{n}=k_{n}^{\prime} when ηn(κ)>0\eta_{n}(\kappa)>0 and knk_{n}^{\prime} exists. Hence, if η0>0\eta_{0}>0, ηn=ηn(kn)=ηn(kn)\eta_{n}=\eta_{n}(k_{n})=\eta_{n}(k_{n}^{\prime}) with probability tending to one;

  2. ii)

    if a solution knk_{n}^{\prime} to (5) exists, then with probability tending to one, P0{dn,knΔnC}+αϕ0=κ+Op(n1/2)P_{0}\{d_{n,k_{n}^{\prime}}\Delta^{C}_{n}\}+\alpha\phi_{0}=\kappa+{\mathrm{O}}_{p}(n^{-1/2});

  3. iii)

    τnτ0=op(1)\tau_{n}-\tau_{0}={\mathrm{o}}_{p}(1).

We separately prove i, ii, and iii in the case that η0>\eta_{0}>-\infty, and then we separately prove iii in the cases where η0>0\eta_{0}>0, η0=0\eta_{0}=0 and η0<0\eta_{0}<0.

Proof of i from Lemma S7..

Our strategy for showing the existence of a solution to (5) is as follows. First, we show that the left-hand side of (5) consistently estimates the treatment resource being used uniformly over rules {dn,k:k[0,]}\{d_{n,k}:k\in[0,\infty]\}. Next, we show that the left-hand side of (5) is a continuous function in kk that takes different signs at k=0k=0 and k=k=\infty with probability tending to one.

Define fn,k:odn,k(v)[ΔnC(w)+1t+μnT(w)1[cμnC(t,w)]]f_{n,k}:o\mapsto d_{n,k}(v)\left[\Delta^{C}_{n}(w)+\frac{1}{t+\mu^{T}_{n}(w)-1}[c-\mu^{C}_{n}(t,w)]\right]. We first show that

supk[0,]|Pnfn,kP0{dn,kΔ0C}|=Op(n1/2).\displaystyle\sup_{k\in[0,\infty]}|P_{n}f_{n,k}-P_{0}\{d_{n,k}\Delta^{C}_{0}\}|={\mathrm{O}}_{p}(n^{-1/2}). (S14)

We rely on the fact that, for fixed dn,kd_{n,k}, Pnfn,kP_{n}f_{n,k} is a one-step estimator of P0{dn,kΔ0C}P_{0}\{d_{n,k}\Delta^{C}_{0}\}. Note that

supk[0,]|Pnfn,kP0{dn,kΔ0C}|\displaystyle\sup_{k\in[0,\infty]}\left|P_{n}f_{n,k}-P_{0}\{d_{n,k}\Delta^{C}_{0}\}\right| supk[0,]|(PnP0)dn,kD2(P^n,μnC)|\displaystyle\leq\sup_{k\in[0,\infty]}\left|(P_{n}-P_{0})d_{n,k}D_{2}(\hat{P}_{n},\mu^{C}_{n})\right|
+supk[0,]|P0{dn,k(V())μnT()μ0T()μnT()[μnC(1,)μ0C(1,)]}|\displaystyle\quad+\sup_{k\in[0,\infty]}\left|P_{0}\left\{d_{n,k}(V(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}\right|
+supk[0,]|P0{dn,k(V())μnT()μ0T()1μnT()[μnC(0,)μ0C(0,)]}|.\displaystyle\quad+\sup_{k\in[0,\infty]}\left|P_{0}\left\{d_{n,k}(V(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}\right|.

Conditions B7 and B11 along with Lemma S4 imply that the first term on the right-hand side is Op(n1/2){\mathrm{O}}_{p}(n^{-1/2}). For the second term, we note that the fact that dn,k(V(w))[0,1]d_{n,k}(V(w))\in[0,1] for all w𝒲w\in\mathcal{W} and all k[0,]k\in[0,\infty] and the Cauchy-Schwarz inequality imply that

supk[0,]|P0{dn,k(V())μnT()μ0T()μnT()[μnC(1,)μ0C(1,)]}|\displaystyle\sup_{k\in[0,\infty]}\left|P_{0}\left\{d_{n,k}(V(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}\right|
P0|μnT()μ0T()μnT()[μnC(1,)μ0C(1,)]|μnTμ0T2,P0μnC(1,)μ0C(1,)2,P0.\displaystyle\leq P_{0}\left|\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right|\lesssim\|\mu^{T}_{n}-\mu^{T}_{0}\|_{2,P_{0}}\|\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)\|_{2,P_{0}}.

Hence, the second term is op(n1/2){\mathrm{o}}_{p}(n^{-1/2}) by Condition B6. The third term is also op(n1/2){\mathrm{o}}_{p}(n^{-1/2}) by an almost identical argument. Combining the previous two displays shows that (S14) holds.

Applying (S14) at k=0k=0 shows that Pnfn,0+αϕn=P0{dn,0Δ0C}+αϕ0+Op(n1/2)=αϕ0+Op(n1/2)P_{n}f_{n,0}+\alpha\phi_{n}=P_{0}\{d_{n,0}\Delta^{C}_{0}\}+\alpha\phi_{0}+{\mathrm{O}}_{p}(n^{-1/2})=\alpha\phi_{0}+{\mathrm{O}}_{p}(n^{-1/2}). Therefore, Pnfn,0+αϕn<κP_{n}f_{n,0}+\alpha\phi_{n}<\kappa with probability tending to one. Applying this result at k=k=\infty shows that Pnfn,1+αϕn=P0{dn,Δ0C}+αϕ0+Op(n1/2)=P0Δ0C+αϕ0+Op(n1/2)P_{n}f_{n,1}+\alpha\phi_{n}=P_{0}\{d_{n,\infty}\Delta^{C}_{0}\}+\alpha\phi_{0}+{\mathrm{O}}_{p}(n^{-1/2})=P_{0}\Delta^{C}_{0}+\alpha\phi_{0}+{\mathrm{O}}_{p}(n^{-1/2}). Combining this fact with the fact that P0Δ0C+αϕ0>κP_{0}\Delta^{C}_{0}+\alpha\phi_{0}>\kappa whenever η0>\eta_{0}>-\infty shows that Pnfn,1+αϕn>κP_{n}f_{n,1}+\alpha\phi_{n}>\kappa with probability tending to one. Combining these results at k=0k=0 and k=k=\infty with the fact that kPnfn,kk\mapsto P_{n}f_{n,k} is a continuous function shows that, with probability tending to one, there exists a kn[0,)k_{n}^{\prime}\in[0,\infty) such that Pnfn,kn=καϕnP_{n}f_{n,k_{n}^{\prime}}=\kappa-\alpha\phi_{n}. Lemma S6 then implies ηn=ηn(kn)\eta_{n}=\eta_{n}(k_{n}) with probability tending to 1. ∎

Proof of ii from Lemma S7..

. By Lemma S3, Eq. S14 and part i of this lemma, we see that P0{dn,knΔ0C}+αϕ0=Pnfn,kn+αϕn+Op(n1/2)=κ+Op(n1/2)P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}+\alpha\phi_{0}=P_{n}f_{n,k_{n}}+\alpha\phi_{n}+{\mathrm{O}}_{p}(n^{-1/2})=\kappa+{\mathrm{O}}_{p}(n^{-1/2}), as desired. ∎

Proof of iii from Lemma S7 when η0>0\eta_{0}>0..

In this proof, we use P0nP_{0}^{n} to denote a probability statement over the draws of O1,,OnO_{1},\ldots,O_{n}. Fix ϵ>0\epsilon>0. We will argue by contradiction to show that P0n{ηnη0+ϵ}0P_{0}^{n}\left\{\eta_{n}\geq\eta_{0}+\epsilon\right\}\rightarrow 0 and P0n{ηnη0ϵ}0P_{0}^{n}\left\{\eta_{n}\leq\eta_{0}-\epsilon\right\}\rightarrow 0 as nn\rightarrow\infty, implying the consistency of ηn\eta_{n}. The consistency of τn\tau_{n} then follows. We study these two events separately. First, we suppose that

lim supnP0n{ηnη0+ϵ}>0.\displaystyle\limsup_{n}P_{0}^{n}\left\{\eta_{n}\geq\eta_{0}+\epsilon\right\}>0. (S15)

Then there exists δ>0\delta>0 such that, for all nn in an infinite sequence NN\subseteq\mathbb{N}, the probability P0n{ηnη0+ϵ}P_{0}^{n}\left\{\eta_{n}\geq\eta_{0}+\epsilon\right\} is at least δ\delta. Consequently, for any nNn\in N, the following holds with probability at least δ\delta:

P0{dn,knΔ0C}+αϕ0κ\displaystyle P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa P0{I(ξn>η0+ϵ/2)Δ0C}+αϕ0κ\displaystyle\leq P_{0}\{I(\xi_{n}>\eta_{0}+\epsilon/2)\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa
=P0{[I(ξn>η0+ϵ/2)I(ξ0>η0+ϵ/2)]Δ0C}+Γ0(η0+ϵ/2)+αϕ0κ.\displaystyle=P_{0}\{[I(\xi_{n}>\eta_{0}+\epsilon/2)-I(\xi_{0}>\eta_{0}+\epsilon/2)]\Delta^{C}_{0}\}+\Gamma_{0}(\eta_{0}+\epsilon/2)+\alpha\phi_{0}-\kappa. (S16)

We now show that the first term is op(1){\mathrm{o}}_{p}(1). For any x>0x>0 and nn\in\mathbb{N}, by Lemma S5 and Condition B4,

|P0{[I(ξn>η0+ϵ/2)I(ξ0>η0+ϵ/2)]Δ0C}|\displaystyle|P_{0}\{[I(\xi_{n}>\eta_{0}+\epsilon/2)-I(\xi_{0}>\eta_{0}+\epsilon/2)]\Delta^{C}_{0}\}| P0I(|ξnξ0|>x)+P0I(|ξ0ηn+ϵ/2|x)\displaystyle\lesssim P_{0}I(|\xi_{n}-\xi_{0}|>x)+P_{0}I(|\xi_{0}-\eta_{n}+\epsilon/2|\leq x)
ξnξ01,P0x+P0I(|ξ0ηn+ϵ/2|x).\displaystyle\leq\frac{\|\xi_{n}-\xi_{0}\|_{1,P_{0}}}{x}+P_{0}I(|\xi_{0}-\eta_{n}+\epsilon/2|\leq x).

Similarly to the proof of Lemma S6, the fact that ξnξ01,P0=op(1)\|\xi_{n}-\xi_{0}\|_{1,P_{0}}={\mathrm{o}}_{p}(1) (Condition B12) ensures that P0{[I(ξn>η0+ϵ/2)I(ξ0>η0+ϵ/2)]Δ0C=op(1)P_{0}\{[I(\xi_{n}>\eta_{0}+\epsilon/2)-I(\xi_{0}>\eta_{0}+\epsilon/2)]\Delta^{C}_{0}={\mathrm{o}}_{p}(1). By Condition B3, Γ0(η0+ϵ/2)Γ0(η0)\Gamma_{0}(\eta_{0}+\epsilon/2)-\Gamma_{0}(\eta_{0}) is a negative constant. Because (S16) holds with probability at least δ>0\delta>0 for infinitely many nn, this shows that P0{dn,knΔ0C}+αϕ0κP_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa is not op(1){\mathrm{o}}_{p}(1). This contradicts our result from part ii of this lemma. Therefore, (S15) is false, that is, lim supnP0n{ηnη0+ϵ}=0\limsup_{n}P_{0}^{n}\left\{\eta_{n}\geq\eta_{0}+\epsilon\right\}=0.

Now we assume that, for some ϵ>0\epsilon>0, lim supnP0n{ηnη0ϵ}>0\limsup_{n}P_{0}^{n}\left\{\eta_{n}\leq\eta_{0}-\epsilon\right\}>0. Then there exists δ>0\delta>0 such that, for all nn in an infinite sequence NN\subseteq\mathbb{N}, P0n{ηnη0ϵ}δP_{0}^{n}\left\{\eta_{n}\leq\eta_{0}-\epsilon\right\}\geq\delta. Now, for any nNn\in N, the following holds with probability at least δ\delta:

P0{dn,knΔ0C}+αϕ0κ\displaystyle P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa P0{I(ξn>η0ϵ)Δ0C}+αϕ0κ\displaystyle\geq P_{0}\{I(\xi_{n}>\eta_{0}-\epsilon)\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa
=P0{[I(ξn>η0ϵ)I(ξ0>η0ϵ)]ν0}+Γ0(η0ϵ)+αϕ0κ.\displaystyle=P_{0}\{[I(\xi_{n}>\eta_{0}-\epsilon)-I(\xi_{0}>\eta_{0}-\epsilon)]\nu_{0}\}+\Gamma_{0}(\eta_{0}-\epsilon)+\alpha\phi_{0}-\kappa.

The rest of the argument is almost identical to the contradiction argument for the previous event, and is therefore omitted.

Since ϵ\epsilon is arbitrary, combining the results of these two contradiction arguments shows that |τnτ0||ηnη0|=op(1)|\tau_{n}-\tau_{0}|\leq|\eta_{n}-\eta_{0}|={\mathrm{o}}_{p}(1), as desired. ∎

Proof of iii from Lemma S7 when η0=0\eta_{0}=0..

If η0=0\eta_{0}=0, then the construction of ηn\eta_{n} implies that ηn\eta_{n} takes values from two sequences: ηn(κ)\eta_{n}(\kappa) and ηn(kn)\eta_{n}(k_{n}) where knk_{n} is a solution to (5). By Lemma S6, ηn(κ)\eta_{n}(\kappa) is consistent for η0\eta_{0}. When a solution to (5) exists and equals knk_{n}, the proof of part iii from Lemma S7 when η0>0\eta_{0}>0 shows that ηn(kn)\eta_{n}(k_{n}) is consistent for η0\eta_{0} and the desired result follows. ∎

Proof of iii from Lemma S7 when η0<0\eta_{0}<0..

If η0<0\eta_{0}<0, then by Lemma S6, ηn(κ)0\eta_{n}(\kappa)\leq 0 with probability tending to one. Hence, with probability tending to one, τn=0=τ0\tau_{n}=0=\tau_{0}. Therefore, part iii holds. ∎

The following Lemma S8 show that certain remainders in the expansions in Section S4.3 are op(n1/2){\mathrm{o}}_{p}(n^{-1/2}).

Lemma S8.

Under Conditions A2B8 and B6,

supρ:𝒲[0,1]|Rρ(P^n,P0)|=op(n1/2).\sup_{{\rho}:\mathcal{W}\rightarrow[0,1]}\left|R_{\rho}(\hat{P}_{n},P_{0})\right|={\mathrm{o}}_{p}(n^{-1/2}).
Proof of Lemma S8.

By the boundedness of the range of ρ{\rho}, we see that

supρ:𝒲[0,1]|Rρ(P^n,P0)|\displaystyle\sup_{{\rho}:\mathcal{W}\rightarrow[0,1]}\left|R_{\rho}(\hat{P}_{n},P_{0})\right|
=supρ:𝒲[0,1]P0|ρ()[μnT()μ0T()μnT(){μ^nY(1,)μ0Y(1,)}+μnT()μ0T()1μnT{μ^nY(0,)μ0Y(0,)}]|\displaystyle=\sup_{{\rho}:\mathcal{W}\rightarrow[0,1]}P_{0}\left|{\rho}(\cdot)\left[\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}\{\hat{\mu}^{Y}_{n}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)\}+\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}}\{\hat{\mu}^{Y}_{n}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)\}\right]\right|
P0|[μnT()μ0T()μnT(){μ^nY(1,)μ0Y(1,)}+μnT()μ0T()1μnT(){μ^nY(0,)μ0Y(0,)}]|.\displaystyle\leq P_{0}\left|\left[\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}\{\hat{\mu}^{Y}_{n}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)\}+\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}\{\hat{\mu}^{Y}_{n}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)\}\right]\right|.
Using Condition B8 and Lemma S4, the display continues as
\displaystyle\lesssim P0|(μnT()μ0T())[μ^nY(1,)μ0Y(1,)]|+P0|(μnT()μ0T())[μ^nY(0,)μ0Y(0,)]|\displaystyle P_{0}\left|(\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot))[\hat{\mu}^{Y}_{n}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)]\right|+P_{0}\left|(\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot))[\hat{\mu}^{Y}_{n}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)]\right|
\displaystyle\leq μnTμ0T2,P0μ^nY(1,)μ0Y(1,)2,P0+μnTμ0T2,P0μ^nY(0,)μ0Y(0,)2,P0\displaystyle\|\mu^{T}_{n}-\mu^{T}_{0}\|_{2,P_{0}}\|\hat{\mu}^{Y}_{n}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)\|_{2,P_{0}}+\|\mu^{T}_{n}-\mu^{T}_{0}\|_{2,P_{0}}\|\hat{\mu}^{Y}_{n}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)\|_{2,P_{0}}
\displaystyle\lesssim μnTμ0T2,P0μ^nYμ0Y2,P0.\displaystyle\|\mu^{T}_{n}-\mu^{T}_{0}\|_{2,P_{0}}\|\hat{\mu}^{Y}_{n}-\mu^{Y}_{0}\|_{2,P_{0}}.

The right-hand side is op(n1/2){\mathrm{o}}_{p}(n^{-1/2}) by Condition B6. ∎

We next prove Theorem 4.

Proof of Theorem 4.

By the expansion of PΨρP(P)P\mapsto\Psi_{{\rho}_{P}}(P) presented in Section S4.3,

Ψρn\displaystyle\Psi_{{\rho}_{n}} (P^n)Ψρ0(P0)\displaystyle(\hat{P}_{n})-\Psi_{{\rho}_{0}}(P_{0})
=P0D(P^n,ρn,τ0,μnC)+Rρn(P^n,P0)+P0{(ρnρ0)(δ0Yτ0δ0C)}\displaystyle=P_{0}D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})+R_{{\rho}_{n}}(\hat{P}_{n},P_{0})+P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}
τ0P0{ρn()μnT()μ0T()μnT()[μnC(1,)μ0C(1,)]}\displaystyle\quad-\tau_{0}P_{0}\left\{{\rho}_{n}(\cdot)\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}
+τ0P0{(1ρn())μnT()μ0T()1μnT()[μnC(0,)μ0C(0,)]}\displaystyle\quad+\tau_{0}P_{0}\left\{(1-{\rho}_{n}(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}
=(PnP0)D(P0,ρ0,τ0,μ0C)PnD(P^n,ρn,τ0,μnC)\displaystyle=(P_{n}-P_{0})D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})-P_{n}D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})
+(PnP0)[D(P^n,ρn,τ0,μnC)D(P0,ρ0,τ0,μ0C)]\displaystyle\quad+(P_{n}-P_{0})\left[D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})\right]
+Rρn(P^n,P0)+P0{(ρnρ0)(δ0Yτ0δ0C)}\displaystyle\quad+R_{{\rho}_{n}}(\hat{P}_{n},P_{0})+P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}
τ0P0{ρn()μnT()μ0T()μnT()[μnC(1,)μ0C(1,)]}\displaystyle\quad-\tau_{0}P_{0}\left\{{\rho}_{n}(\cdot)\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}
+τ0P0{(1ρn())μnT()μ0T()1μnT()[μnC(0,)μ0C(0,)]}.\displaystyle\quad+\tau_{0}P_{0}\left\{(1-{\rho}_{n}(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}.

Similarly,

ΨρFR(P^n)ΨρFR(P0)\displaystyle\Psi_{{\rho}^{\mathrm{FR}}}(\hat{P}_{n})-\Psi_{{\rho}^{\mathrm{FR}}}(P_{0}) =(PnP0)D(P0,ρFR,0,μ0C)PnD(P^n,ρFR,0,μ0C)\displaystyle=(P_{n}-P_{0})D(P_{0},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})-P_{n}D(\hat{P}_{n},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})
+(PnP0)[D(P^n,ρFR,0,μ0C)D(P0,ρFR,0,μ0C)]+RρFR(P^n,P0);\displaystyle\quad+(P_{n}-P_{0})\left[D(\hat{P}_{n},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})\right]+R_{{\rho}^{\mathrm{FR}}}(\hat{P}_{n},P_{0});
ΨρnRD(P^n)Ψρ0RD(P0)\displaystyle\Psi_{{\rho}^{\mathrm{RD}}_{n}}(\hat{P}_{n})-\Psi_{{\rho}^{\mathrm{RD}}_{0}}(P_{0}) =(PnP0)D(P0,ρ0RD,0,μ0C)PnD(P^n,ρnRD,0,μ0C)\displaystyle=(P_{n}-P_{0})D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})-P_{n}D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})
+(PnP0)[D(P^n,ρnRD,0,μ0C)D(P0,ρ0RD,0,μ0C)]\displaystyle\quad+(P_{n}-P_{0})\left[D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})\right]
+RρnRD(P^n,P0)+(ρnRDρ0RD)P0Δ0Y;\displaystyle\quad+R_{{\rho}^{\mathrm{RD}}_{n}}(\hat{P}_{n},P_{0})+({\rho}^{\mathrm{RD}}_{n}-{\rho}^{\mathrm{RD}}_{0})P_{0}\Delta^{Y}_{0};
ΨρnTP(P^n)Ψρ0TP(P0)\displaystyle\Psi_{{\rho}^{\mathrm{TP}}_{n}}(\hat{P}_{n})-\Psi_{{\rho}^{\mathrm{TP}}_{0}}(P_{0}) =(PnP0)GTP(P0)PnGTP(P^n)+(PnP0)[GTP(P^n)GTP(P0)]\displaystyle=(P_{n}-P_{0})G_{\mathrm{TP}}(P_{0})-P_{n}G_{\mathrm{TP}}(\hat{P}_{n})+(P_{n}-P_{0})\left[G_{\mathrm{TP}}(\hat{P}_{n})-G_{\mathrm{TP}}(P_{0})\right]
+Rρ0TP(P^n,P0).\displaystyle\quad+R_{{\rho}^{\mathrm{TP}}_{0}}(\hat{P}_{n},P_{0}).

First, we note the following facts, which will be sufficient to ensure that the remainders and empirical process terms in all of the first-order expansions given above are op(n1/2){\mathrm{o}}_{p}(n^{-1/2}). By Condition B6, Lemmas S4 and S8, the Cauchy-Schwarz inequality, and boundedness of the range of an ITR, the following terms are all op(n1/2){\mathrm{o}}_{p}(n^{-1/2}):

Rρ(P^n,P0)for ρ=ρn,ρFR,ρnRD,\displaystyle R_{{\rho}}(\hat{P}_{n},P_{0})\ \textnormal{for ${\rho}={\rho}_{n},{\rho}^{\mathrm{FR}},{\rho}^{\mathrm{RD}}_{n}$},
P0{μnT()μ0T()1μnT()(μ^PY(0,)μ^0Y(0,))},\displaystyle P_{0}\left\{\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}(\hat{\mu}^{Y}_{P}(0,\cdot)-\hat{\mu}^{Y}_{0}(0,\cdot))\right\},
τ0P0{ρn()μnT()μ0T()μnT()[μnC(1,)μ0C(1,)]},\displaystyle\tau_{0}P_{0}\left\{{\rho}_{n}(\cdot)\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\},
τ0P0{(1ρn())μnT()μ0T()1μnT()[μnC(0,)μ0C(0,)]}.\displaystyle\tau_{0}P_{0}\left\{(1-{\rho}_{n}(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}.

Moreover, by Condition B10, P0{(ρnρ0)(δ0Yτ0δ0C)}=op(n1/2)P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}={\mathrm{o}}_{p}(n^{-1/2}); by Conditions B7 and B11, (PnP0)[Dn,D(P0)]=op(n1/2)(P_{n}-P_{0})\left[D_{n,\mathcal{R}}-D_{\mathcal{R}}(P_{0})\right]={\mathrm{o}}_{p}(n^{-1/2}) for all {FR,RD,TP}\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\} and

(PnP0){[D(P^n,ρn,τ0,μnC)D(P^n,ρnRD,0,μ0C)][D(P0,ρ0,τ0,μ0C)D(P0,ρ0RD,0,μ0C)]}=op(n1/2).(P_{n}-P_{0})\left\{[D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})]-[D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})]\right\}={\mathrm{o}}_{p}(n^{-1/2}).

Therefore, all relevant remainders and empirical process terms are op(n1/2){\mathrm{o}}_{p}(n^{-1/2}).

We separately study the three cases where =FR\mathcal{R}={\mathrm{FR}}, =RD\mathcal{R}={\mathrm{RD}}, and =TP\mathcal{R}={\mathrm{TP}}.

Case I: =FR\mathcal{R}={\mathrm{FR}}. It holds that

ψnψ0\displaystyle\psi_{n}-\psi_{0} =(PnP0)DFR(P0)PnDn,FR+op(n1/2)\displaystyle=(P_{n}-P_{0})D_{\mathrm{FR}}(P_{0})-P_{n}D_{n,{\mathrm{FR}}}+{\mathrm{o}}_{p}(n^{-1/2})
=(PnP0)DFR(P0)\displaystyle=(P_{n}-P_{0})D_{\mathrm{FR}}(P_{0})
+τ0{1ni=1n{ρn(Vi)[ΔnC(Wi)+1Ti+μnT(Wi)1[CiμnC(Ti,Wi)]]\displaystyle\quad+\tau_{0}\Bigg{\{}\frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}{\rho}_{n}(V_{i})\left[\Delta^{C}_{n}(W_{i})+\frac{1}{T_{i}+\mu^{T}_{n}(W_{i})-1}[C_{i}-\mu^{C}_{n}(T_{i},W_{i})]\right]
+α[μnC(0,Wi)+1Ti1μnT(Wi)[CiμnC(0,Wi)]]}κ}+op(n1/2),\displaystyle\hskip 36.135pt+\alpha\left[\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\right]\Bigg{\}}-\kappa\Bigg{\}}+{\mathrm{o}}_{p}(n^{-1/2}),

where the last step follows from the TMLE construction of P^n\hat{P}_{n} (Step 4a of our estimator), which implies that

1ni=1n{ρn(Vi)ρFR(V)Ti+μnT(Wi)1[Yiμ^nY(Ti,Wi)]}=0.\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{{\rho}_{n}(V_{i})-{\rho}^{\mathrm{FR}}(V)}{T_{i}+\mu^{T}_{n}(W_{i})-1}[Y_{i}-\hat{\mu}^{Y}_{n}(T_{i},W_{i})]\right\}=0.

We now show that the second term on the right-hand side is zero with probability tending to one. If τ0=0\tau_{0}=0, then this term is zero. Otherwise, τ0=η0>0\tau_{0}=\eta_{0}>0. By Lemma S7, the following holds with probability tending to one:

1ni=1n{ρn(Vi)[μnC(1,Wi)+TiμnT(Wi)[CiμnC(1,Wi)]]+α[μnC(0,Wi)+1Ti1μnT(Wi)[CiμnC(0,Wi)]]}=κ,\frac{1}{n}\sum_{i=1}^{n}\left\{{\rho}_{n}(V_{i})\left[\mu^{C}_{n}(1,W_{i})+\frac{T_{i}}{\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(1,W_{i})]\right]+\alpha\left[\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\right]\right\}=\kappa,

and hence the second term is zero with probability tending to one, as desired. Therefore, ψnψ0=(PnP0)DFR(P0)+op(n1/2)\psi_{n}-\psi_{0}=(P_{n}-P_{0})D_{\mathrm{FR}}(P_{0})+{\mathrm{o}}_{p}(n^{-1/2}).

Case II: =RD\mathcal{R}={\mathrm{RD}}. It holds that

ψnψ0\displaystyle\psi_{n}-\psi_{0} =(PnP0){D(P0,ρ0,τ0,μ0C)D(P0,ρ0RD,0,μ0C)}\displaystyle=(P_{n}-P_{0})\{D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})\}
Pn{D(P^n,ρn,τ0,μnC)D(P^n,ρnRD,0,μ0C)}\displaystyle\quad-P_{n}\{D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})\}
(ρnRDρ0RD)P0Δ0Y+op(n1/2),\displaystyle\quad-({\rho}^{\mathrm{RD}}_{n}-{\rho}^{\mathrm{RD}}_{0})P_{0}\Delta^{Y}_{0}+{\mathrm{o}}_{p}(n^{-1/2}),

where we have used ρnRD{\rho}^{\mathrm{RD}}_{n} and ρ0RD{\rho}^{\mathrm{RD}}_{0} to denote the values that the two functions take, respectively. The TMLE construction of P^n\hat{P}_{n} (Step 4a of our estimator) implies that

1ni=1nρn(Vi)ρnRD(Vi)Ti+μnT(Wi)1[Yiμ^nY(Ti,Wi)]=0,\frac{1}{n}\sum_{i=1}^{n}\frac{{\rho}_{n}(V_{i})-{\rho}^{\mathrm{RD}}_{n}(V_{i})}{T_{i}+\mu^{T}_{n}(W_{i})-1}[Y_{i}-\hat{\mu}^{Y}_{n}(T_{i},W_{i})]=0,

and hence

Pn{D(P^n,ρn,τ0,μnC)D(P^n,ρnRD,0,μ0C)}\displaystyle P_{n}\{D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})\}
=τ0{1ni=1n{ρn(Vi)[ΔnC(Wi)+1Ti+μnT(Wi)1[CiμnC(Ti,Wi)]]\displaystyle=-\tau_{0}\Bigg{\{}\frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}{\rho}_{n}(V_{i})\left[\Delta^{C}_{n}(W_{i})+\frac{1}{T_{i}+\mu^{T}_{n}(W_{i})-1}[C_{i}-\mu^{C}_{n}(T_{i},W_{i})]\right]
+α[μnC(0,Wi)+1Ti1μnT(Wi)[CiμnC(0,Wi)]]}κ},\displaystyle\hskip 36.135pt+\alpha\left[\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\right]\Bigg{\}}-\kappa\Bigg{\}},

which is zero with probability tending to one as proved above. By Condition B5, Lemma S3 and the delta method for influence functions, the value that ρnRD{\rho}^{\mathrm{RD}}_{n} takes is an asymptotic linear estimator of the value that ρ0RD{\rho}^{\mathrm{RD}}_{0} takes. Straightforward application of the delta method for influence functions implies that

ψnψ0=(PnP0)DRD(P0)+op(n1/2).\psi_{n}-\psi_{0}=(P_{n}-P_{0})D_{\mathrm{RD}}(P_{0})+{\mathrm{o}}_{p}(n^{-1/2}).

Case 3: =TP\mathcal{R}={\mathrm{TP}}. It holds that

ψnψ0=(PnP0)DTP(P0)PnDn,TP+op(n1/2).\psi_{n}-\psi_{0}=(P_{n}-P_{0})D_{{\mathrm{TP}}}(P_{0})-P_{n}D_{n,{\mathrm{TP}}}+{\mathrm{o}}_{p}(n^{-1/2}).

The TMLE construction of P^n\hat{P}_{n} (Step 4a of our estimator) implies that

1ni=1n{ρn(Vi)μnT(Wi)Ti+μnT(Wi)1[Yiμ^nY(Ti,Wi)]}=0,\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{{\rho}_{n}(V_{i})-\mu^{T}_{n}(W_{i})}{T_{i}+\mu^{T}_{n}(W_{i})-1}[Y_{i}-\hat{\mu}^{Y}_{n}(T_{i},W_{i})]\right\}=0,

so

PnDn,TP\displaystyle P_{n}D_{n,{\mathrm{TP}}} =τ0{1ni=1n{ρn(Vi)[ΔnC(Wi)+1Ti+μnT(Wi)1[CiμnC(Ti,Wi)]]\displaystyle=-\tau_{0}\Bigg{\{}\frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}{\rho}_{n}(V_{i})\left[\Delta^{C}_{n}(W_{i})+\frac{1}{T_{i}+\mu^{T}_{n}(W_{i})-1}[C_{i}-\mu^{C}_{n}(T_{i},W_{i})]\right]
+α[μnC(0,Wi)+1Ti1μnT(Wi)[CiμnC(0,Wi)]}]κ},\displaystyle\hskip 36.135pt+\alpha\left[\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\Bigg{\}}\right]-\kappa\Bigg{\}},

which is zero with probability tending to one as proved above. Therefore,

ψnψ0=(PnP0)DTP(P0)+op(n1/2).\psi_{n}-\psi_{0}=(P_{n}-P_{0})D_{\mathrm{TP}}(P_{0})+{\mathrm{o}}_{p}(n^{-1/2}).

Conclusion: The asymptotic linearity result on ψn\psi_{n} follows from the above results. Consequently, the asymptotic normality result on ψn\psi_{n} holds by the central limit theorem and Slutsky’s theorem. ∎

S4.5 Proof of Theorem S1

In this section, we prove Theorem S1. The arguments are almost identical to those in Supplement S9 Qiu et al. [34] with adaptations to the different treatment resource constraint.

Lemma S9 (Convergence rate of τn\tau_{n} if η0>\eta_{0}>-\infty).

Assume that the conditions for Theorem 4 hold. Suppose that η0>\eta_{0}>-\infty, that the Lebesgue density of the distribution of ξ0(V)\xi_{0}(V) under VP0V\sim P_{0} is well-defined, nonzero and finite in a neighborhood of and that P0I(ξn=ηn)=Op(n1/2)P_{0}I(\xi_{n}=\eta_{n})={\mathrm{O}}_{p}(n^{-1/2}). Under these conditions, the following implications hold with probability tending to one:

  • If ξnξ0q,P0=op(1)\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1) for some 0<q<0<q<\infty, then |τnτ0|ξnξ0q,P0q/q+1+Op(n1/2)|\tau_{n}-\tau_{0}|\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/{q+1}}+{\mathrm{O}}_{p}(n^{-1/2}).

  • If ξnξ0,P0=op(1)\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1), then |τnτ0|ξnξ0,P0+Op(n1/2)|\tau_{n}-\tau_{0}|\lesssim\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}+{\mathrm{O}}_{p}(n^{-1/2}).

The condition that P0I(ξn=ηn)=Op(n1/2)P_{0}I(\xi_{n}=\eta_{n})={\mathrm{O}}_{p}(n^{-1/2}) is reasonable if ξn(V)\xi_{n}(V) has a continuous distribution when VP0V\sim P_{0}, in which case P0I(ξn=ηn)=0P_{0}I(\xi_{n}=\eta_{n})=0.

Proof of Lemma S9.

We study the three cases where η0>0\eta_{0}>0, η0<0\eta_{0}<0 and η0=0\eta_{0}=0 separately.

We first study the case where η0>0\eta_{0}>0. By Lemma S7, with probability tending to one, ηn=ηn(kn)\eta_{n}=\eta_{n}(k_{n}) where knk_{n} is a solution to (5), and

P0{[I(ξn>ηn)I(ξ0>η0)]Δ0C}=P0{dn,knΔ0C}(καϕ0)+Op(n1/2)=Op(n1/2).\displaystyle P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{0})]\Delta^{C}_{0}\}=P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}-(\kappa-\alpha\phi_{0})+{\mathrm{O}}_{p}(n^{-1/2})={\mathrm{O}}_{p}(n^{-1/2}).

We argue conditionally on the event that knk_{n} is a solution to (5). Adding Γ0(ηn)P0{I(ξn>ηn)Δ0C}\Gamma_{0}(\eta_{n})-P_{0}\{I(\xi_{n}>\eta_{n})\Delta^{C}_{0}\} to both sides shows that Γ0(ηn)Γ0(η0)=P0{[I(ξn>ηn)I(ξ0>ηn)]Δ0C}+Op(n1/2)\Gamma_{0}(\eta_{n})-\Gamma_{0}(\eta_{0})=-P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}+{\mathrm{O}}_{p}(n^{-1/2}). By a Taylor expansion of Γ0\Gamma_{0} under Conditions B2B3 and A4, the left-hand side is equal to C(ηnη0)+op(ηnη0)-C(\eta_{n}-\eta_{0})+{\mathrm{o}}_{p}(\eta_{n}-\eta_{0}) for some C>0C>0, yielding that

[C+op(1)][ηnη0]\displaystyle[C+{\mathrm{o}}_{p}(1)][\eta_{n}-\eta_{0}] =P0{[I(ξn>ηn)I(ξ0>ηn)]Δ0C}+Op(n1/2),\displaystyle=P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}+{\mathrm{O}}_{p}(n^{-1/2}),

which immediately implies that

ηnη0\displaystyle\eta_{n}-\eta_{0} =Op(P0{[I(ξn>ηn)I(ξ0>ηn)]Δ0C})+Op(n1/2).\displaystyle={\mathrm{O}}_{p}\Big{(}P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}\Big{)}+{\mathrm{O}}_{p}(n^{-1/2}). (S17)

The rest of the proof for this case and the proof for the other two cases are identical to the proof of Lemma S14 in Qiu et al. [34]. We present the argument below for completeness. By Lemma S5 and Condition B4, for any ϵ>0\epsilon>0 it holds that

|P0{[I(ξn>ηn)I(ξ0>ηn)]Δ0C}|\displaystyle|P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}|
|P0{[I(ξn>ηn)I(ξ0>ηn)]}|\displaystyle\lesssim|P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\}|
P0I(|ξnξ0|>ϵ)+P0I(|ξ0ηn|ϵ).\displaystyle\leq P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon)+P_{0}I(|\xi_{0}-\eta_{n}|\leq\epsilon).

Fix a positive sequence {ϵn}n=1\{\epsilon_{n}\}_{n=1}^{\infty}, where each ϵn\epsilon_{n} may be random through observations O1,,OnO_{1},\ldots,O_{n}, such that ϵn𝑝0\epsilon_{n}\overset{p}{\rightarrow}0 as nn\rightarrow\infty. By a Taylor expansion of S0S_{0}, the survival function of the distribution of ξ0(V)\xi_{0}(V) when VP0V\sim P_{0}, around η0\eta_{0}, which is valid under Condition B2 provided ϵn\epsilon_{n} is sufficiently small, it follows that

|P0{[I(ξn>ηn)I(ξ0>ηn)]Δ0C}|\displaystyle|P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}| P0I(|ξnξ0|>ϵn)2(S0)(η0)ϵn+op(ϵn).\displaystyle\lesssim P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})-2(S_{0})^{\prime}(\eta_{0})\epsilon_{n}+{\mathrm{o}}_{p}(\epsilon_{n}).

Here we recall that (S0)(η0)(S_{0})^{\prime}(\eta_{0}) is finite by Condition B2. Returning to (S17),

ηnη0\displaystyle\eta_{n}-\eta_{0} =Op(P0I(|ξnξ0|>ϵn))[2(S0)(η0)+op(1)]ϵn+Op(n1/2).\displaystyle={\mathrm{O}}_{p}\Big{(}P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})\Big{)}-[2(S_{0})^{\prime}(\eta_{0})+{\mathrm{o}}_{p}(1)]\epsilon_{n}+{\mathrm{O}}_{p}(n^{-1/2}).

If ξnξ0q,P0=op(1)\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1) for some 0<q<0<q<\infty, by Markov’s inequality, P0I(|ξnξ0|>ϵn)ξnξ0q,P0q/ϵnqP_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})\leq\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q}/\epsilon_{n}^{q}. In this case, taking ϵn=ξnξ0q,P0q/(q+1)\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)} yields that |ηnη0|ξnξ0q,P0q/(q+1)+Op(n1/2)|\eta_{n}-\eta_{0}|\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}+{\mathrm{O}}_{p}(n^{-1/2}) with probability tending to one. If ξnξ0,P0=op(1)\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1), then taking ϵn=ξnξ0,P0\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}} yields that P0I(|ξnξ0|>ϵn)=0P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})=0, and hence that |ηnη0|ξnξ0,P02+Op(n1/2)|\eta_{n}-\eta_{0}|\lesssim\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}+{\mathrm{O}}_{p}(n^{-1/2}) with probability tending to one. The desired result follows by noting that τ0=η0\tau_{0}=\eta_{0} and in both cases, τn=ηn(kn)\tau_{n}=\eta_{n}(k_{n}) with probability tending to one.

We now study the case where η<0\eta<0. By Lemma S6, with probability tending to one, ηn<0\eta_{n}<0 and hence τn=0=τ0\tau_{n}=0=\tau_{0}, as desired.

We finally study the case where η0=0\eta_{0}=0. We argue conditional on the event that a solution knk_{n}^{\prime} to (5) exists, which happens with probability tending to one by Lemma S7. Recall that for convenience we let kn=knk_{n}=k_{n}^{\prime} when ηn(κ)>0\eta_{n}(\kappa)>0. Then, exactly one of the following two events happen: (i) ηn(κ)0\eta_{n}(\kappa)\leq 0 or ηn(kn)0\eta_{n}(k_{n}^{\prime})\leq 0, in which case τn=0=τ0\tau_{n}=0=\tau_{0}; (2) ηn(κ)>0\eta_{n}(\kappa)>0 and ηn(kn)>0\eta_{n}(k_{n}^{\prime})>0, in which case a similar argument as the above proof for the case where η0>0\eta_{0}>0 shows that the distance between τn=ηn(kn)\tau_{n}=\eta_{n}(k_{n}^{\prime}) and τ0\tau_{0} has the desired bound. The desired result holds conditional on either event, so it holds unconditional on either event. ∎

We finally prove Theorem S1.

Proof of Theorem S1.

Observe that

|P0{(ρnρ0)(δ0Yτ0Δ0C)}|P0|{I(ξn>τn)I(ξ0>τ0)}(ξ0τ0)Δ0C|P0|{I(ξn>τn)I(ξ0>τ0)}(ξ0τ0)|P0|{I(ξn>τn)I(ξ0>τn)}(ξ0τn)|+P0|{I(ξ0>τn)I(ξ0>τ0)}(ξ0τ0)|+|τnτ0|P0|I(ξn>τn)I(ξ0>τn)|.\displaystyle\begin{split}|P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\Delta^{C}_{0})\}|&\leq P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{0})\}(\xi_{0}-\tau_{0})\Delta^{C}_{0}|\\ &\lesssim P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{0})\}(\xi_{0}-\tau_{0})|\\ &\leq P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})|\\ &\quad+P_{0}|\{I(\xi_{0}>\tau_{n})-I(\xi_{0}>\tau_{0})\}(\xi_{0}-\tau_{0})|\\ &\quad+|\tau_{n}-\tau_{0}|P_{0}|I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})|.\end{split} (S18)

Starting from this inequality, the rest of the proof is identical to that of Theorem 5 in Qiu et al. [34]. We present the argument below for completeness. Let {ϵn}n=1\{\epsilon_{n}\}_{n=1}^{\infty} be a positive sequence, where each ϵn\epsilon_{n} is random through the observations O1,,OnO_{1},\ldots,O_{n}, such that ϵn𝑝0\epsilon_{n}\overset{p}{\rightarrow}0 as nn\rightarrow\infty.

We denote the three terms on the right-hand side by terms 1, 2, and 3, and study these terms separately. It is useful to note that τnτ0=op(1)\tau_{n}-\tau_{0}={\mathrm{o}}_{p}(1), so the Lebesgue density of the distribution of ξ0(V)\xi_{0}(V), VP0V\sim P_{0}, is finite in a neighborhood of τn\tau_{n} with probability tending to one.

Study of term 1 in (S18): Observe that

P0\displaystyle P_{0} |{I(ξn>τn)I(ξ0>τn)}(ξ0τn)|\displaystyle|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})|
=P0|{I(ξn>τn)I(ξ0>τn)}(ξ0τn)|I(0<|ξ0τn|).\displaystyle=P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})|I(0<|\xi_{0}-\tau_{n}|).

First consider the bound with the Lq(P0)L^{q}(P_{0})-distance. Because I(ξn(v)>τn)I(ξ0(v)>τn)I(\xi_{n}(v)>\tau_{n})\neq I(\xi_{0}(v)>\tau_{n}) if and only if (i) ξn(v)τn\xi_{n}(v)-\tau_{n} and ξ0(v)τn\xi_{0}(v)-\tau_{n} take different signs or (ii) only one of them is zero, this event implies |ξ0(v)τn||ξn(v)ξ0(v)||\xi_{0}(v)-\tau_{n}|\leq|\xi_{n}(v)-\xi_{0}(v)|, and so this term is upper bounded by

P0|{I(ξn>τn)I(ξ0>τn)}(ξ0τn)|I(0<|ξ0τn|ϵn)\displaystyle P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})|I(0<|\xi_{0}-\tau_{n}|\leq\epsilon_{n})
+P0|{I(ξn>τn)I(ξ0>τn)}(ξ0τn)|I(|ξ0τn|>ϵn)\displaystyle\quad+P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})|I(|\xi_{0}-\tau_{n}|>\epsilon_{n})
P0|ξnξ0|I(0<|ξ0τn|ϵn)+P0|ξnξ0|I(|ξnξ0|>ϵn)\displaystyle\leq P_{0}|\xi_{n}-\xi_{0}|I(0<|\xi_{0}-\tau_{n}|\leq\epsilon_{n})+P_{0}|\xi_{n}-\xi_{0}|I(|\xi_{n}-\xi_{0}|>\epsilon_{n})
ξnξ0q,P0{P0(0<|ξ0(V)τn|ϵn)}(q1)/q+P0|ξnξ0|qϵnq1\displaystyle\leq\|\xi_{n}-\xi_{0}\|_{q,P_{0}}\left\{P_{0}(0<|\xi_{0}(V)-\tau_{n}|\leq\epsilon_{n})\right\}^{(q-1)/q}+\frac{P_{0}|\xi_{n}-\xi_{0}|^{q}}{\epsilon_{n}^{q-1}}
ξnξ0q,P0ϵn(q1)/q+ξnξ0q,P0qϵnq1,\displaystyle\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}\cdot\epsilon_{n}^{(q-1)/q}+\frac{\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q}}{\epsilon_{n}^{q-1}},

where second to last relation holds by Hölder’s inequality and Markov’s inequality, and the last relation holds with probability tending to one by the assumption that the distribution of ξ0(V)\xi_{0}(V), VP0V\sim P_{0}, has a continuous finite Lebesgue density in a neighborhood of τ0\tau_{0} and Lemma S7. Taking ϵn=ξnξ0q,P0q/(q+1)\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)} yields that |P0{I(ξn>τn)I(ξ0>τn)}(ξ0τn)|ξnξ0q,P02q/(q+1)|P_{0}\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})|\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)}.

Next consider the bound with the L(P0)L^{\infty}(P_{0})-distance. We have that

P0|{I(ξn>τn)I(ξ0>τn)}(ξ0τn)|\displaystyle P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})| P0I(|ξ0τn||ξnξ0|)|ξ0τn|\displaystyle\leq P_{0}I(|\xi_{0}-\tau_{n}|\leq|\xi_{n}-\xi_{0}|)|\xi_{0}-\tau_{n}|
=P0I(0<|ξ0τn||ξnξ0|)|ξ0τn|\displaystyle=P_{0}I(0<|\xi_{0}-\tau_{n}|\leq|\xi_{n}-\xi_{0}|)|\xi_{0}-\tau_{n}|
P0I(0<|ξ0τn|ξnξ0,P0)|ξ0τn|\displaystyle\leq P_{0}I(0<|\xi_{0}-\tau_{n}|\leq\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}})|\xi_{0}-\tau_{n}|
ξnξ0,P0P0(0<|ξ0(V)τn|ξnξ0,P0)\displaystyle\leq\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}P_{0}(0<|\xi_{0}(V)-\tau_{n}|\leq\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}})
ξnξ0,P02.\displaystyle\lesssim\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}.

Therefore, the first term is upper bounded by both ξnξ0q,P02q/(q+1)\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)} and ξnξ0,P02\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}, up to an absolute constant.

Study of term 2 in (S18): Because I(ξ0(v)>τn)I(ξ0(v)>τ0)I(\xi_{0}(v)>\tau_{n})\neq I(\xi_{0}(v)>\tau_{0}) if and only if the two indicators take different signs or only one of them is zero, these indicators only take different values if |ξ0(v)τ0||τnτ0||\xi_{0}(v)-\tau_{0}|\leq|\tau_{n}-\tau_{0}|. Therefore, term 2 bounds as

P0|{I(ξ0>τn)I(ξ0>τ0)}(ξ0τ0)|\displaystyle P_{0}|\{I(\xi_{0}>\tau_{n})-I(\xi_{0}>\tau_{0})\}(\xi_{0}-\tau_{0})| P0I(|ξ0τ0||τnτ0|)|ξ0τ0|\displaystyle\leq P_{0}I(|\xi_{0}-\tau_{0}|\leq|\tau_{n}-\tau_{0}|)|\xi_{0}-\tau_{0}|
|τnτ0|P0I(|ξ0τ0||τnτ0|)\displaystyle\leq|\tau_{n}-\tau_{0}|P_{0}I(|\xi_{0}-\tau_{0}|\leq|\tau_{n}-\tau_{0}|)
|τnτ0|2,\displaystyle\lesssim|\tau_{n}-\tau_{0}|^{2},

where the last step holds for with probability tending to one by the assumption that the distribution of ξ0(V)\xi_{0}(V), VP0V\sim P_{0}, has a continuous finite Lebesgue density in a neighborhood of τ0\tau_{0} and Lemma S7. If η0>\eta_{0}>-\infty, by Lemma S9, with probability tending to one,

P0|I(ξ0>τn)I(ξ0>τ0)||ξ0τ0|{ξnξ0q,P02q/(q+1)+Op(n1),if ξnξ0q,P0=op(1)ξnξ0,P02+Op(n1),if ξnξ0,P0=op(1).P_{0}|I(\xi_{0}>\tau_{n})-I(\xi_{0}>\tau_{0})||\xi_{0}-\tau_{0}|\lesssim\begin{cases}\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)}+{\mathrm{O}}_{p}(n^{-1}),&\text{if }\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1)\\ \|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}+{\mathrm{O}}_{p}(n^{-1}),&\text{if }\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1)\end{cases}.

Otherwise, by Lemma S6, with probability tending to one, τn=0=τ0\tau_{n}=0=\tau_{0} and the above result still holds.

Study of term 3 in (S18): By Lemma S5,

P0|I(ξn>τn)I(ξ0>τn)|P0I(|ξnξ0|>ϵn)+P0I(|ξ0τn|ϵn).P_{0}|I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})|\leq P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})+P_{0}I(|\xi_{0}-\tau_{n}|\leq\epsilon_{n}).

By a Taylor expansion of S0S_{0} around τ0\tau_{0}, similarly to the proof of Lemma S9, with probability tending to one,

P0I(|ξ0τn|ϵn)=2(S0)(τ0)ϵn+op(|τnτ0|+ϵn),P_{0}I(|\xi_{0}-\tau_{n}|\leq\epsilon_{n})=-2(S_{0})^{\prime}(\tau_{0})\epsilon_{n}+{\mathrm{o}}_{p}(|\tau_{n}-\tau_{0}|+\epsilon_{n}),

where |(S0)(τ0)|<|(S_{0})^{\prime}(\tau_{0})|<\infty. If ξnξ0q,P0=op(1)\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1) for some 1<q<1<q<\infty, then P0I(|ξnξ0|>ϵn)ξnξ0q,P0q/ϵnqP_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})\leq\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q}/\epsilon_{n}^{q}. Taking ϵn=ξnξ0q,P0q/(q+1)\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)} yields that |P0{I(ξn>τn)I(ξ0>τn)}|ξnξ0q,P0q/(q+1)|P_{0}\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}|\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}. If ξnξ0,P0=op(1)\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1), then taking ϵn=ξnξ0,P0\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}} yields that |P0{I(ξn>τn)I(ξ0>τn)}|ξnξ0,P0|P_{0}\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}|\lesssim\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}} with probability tending to one. Also note that, by Lemma S9, if η0>\eta_{0}>-\infty, then, with probability tending to one,

|τnτ0||ηnη0|{ξnξ0q,P0q/(q+1)+Op(n1/2),if ξnξ0q,P0=op(1),ξnξ0,P0+Op(n1/2),if ξnξ0,P0=op(1).|\tau_{n}-\tau_{0}|\leq|\eta_{n}-\eta_{0}|\lesssim\begin{cases}\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}+{\mathrm{O}}_{p}(n^{-1/2}),&\text{if }\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1),\\ \|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}+{\mathrm{O}}_{p}(n^{-1/2}),&\text{if }\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1).\end{cases}

The same holds when η0=\eta_{0}=-\infty since then |τnτ0|=0|\tau_{n}-\tau_{0}|=0 with probability tending to one.

Therefore, with probability tending to one,

|\displaystyle| τnτ0|P0|I(ξn>τn)I(ξ0>τn)|\displaystyle\tau_{n}-\tau_{0}|P_{0}|I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})|
{ξnξ0q,P02q/(q+1)+ξnξ0q,P0q/(q+1)Op(n1/2) if ξnξ0q,P0=op(1),ξnξ0,P02+ξnξ0,P0Op(n1/2) if ξnξ0,P0=op(1).\displaystyle\lesssim\begin{cases}\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)}+\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}{\mathrm{O}}_{p}(n^{-1/2})&\text{ if }\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1),\\ \|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}+\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}{\mathrm{O}}_{p}(n^{-1/2})&\text{ if }\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1).\end{cases}

Conclusion of the bound in (S18): We finally combine the bounds for all three terms. Note that anOp(bn)an2+Op(bn2)a_{n}{\mathrm{O}}_{p}(b_{n})\lesssim a_{n}^{2}+{\mathrm{O}}_{p}(b_{n}^{2}) for any sequence of non-negative random variables ana_{n} and sequence of constants bnb_{n}. It follows that, with probability tending to one,

|P0{(ρnρ0)(δ0Yτ0ν0)}|{ξnξ0q,P02q/(q+1)+Op(n1),if ξnξ0q,P0=op(1),ξnξ0,P02+Op(n1),if ξnξ0,P0=op(1).|P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\nu_{0})\}|\lesssim\begin{cases}\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)}+{\mathrm{O}}_{p}(n^{-1}),&\text{if }\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1),\\ \|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}+{\mathrm{O}}_{p}(n^{-1}),&\text{if }\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1).\end{cases}

S5 Additional simulations

S5.1 Results of simulation with nuisance functions being truth

In this section, we present the results of the simulation with an identical setting as that in Section 5 in the main text except that the nuisance functions are taken to be the truth rather than estimated via machine learning. The purpose of this simulation is to show that the performance of our proposed estimator may be significantly improved by using machine learning estimators of nuisance functions that outperform those used in the simulation study reported in the main text.

Table S1 presents the performance of our proposed estimator in this simulation. The Wald CI coverage is close to 95% for sample sizes of 1000 or more. The coverage of the confidence lower bounds is also close to the nominal coverage of 97.5%. Therefore, our proposed procedure appears to have the potential to be significantly improved when using improved estimators of nuisance functions. Figure S1 presents the width of our 95% Wald CI scaled by the square root of sample size nn. For each estimand, the scaled width appears to stabilize as nn grows and to be similar to the scaled width observed in the simulation reported in Section 5, where nuisance functions are estimated from data.

Table S1: Performance of estimators of average causal effects in the simulation with nuisance functions being the truth.
Performance measure Sample size FR{\mathrm{FR}} RD{\mathrm{RD}} TP{\mathrm{TP}}
95% Wald CI coverage 500 93%93\% 90%90\% 90%90\%
1000 94%94\% 94%94\% 93%93\%
4000 96%96\% 95%95\% 95%95\%
16000 94%94\% 96%96\% 95%95\%
97.5% confidence lower 500 96%96\% 96%96\% 95%95\%
bound coverage 1000 97%97\% 97%97\% 96%96\%
4000 98%98\% 97%97\% 97%97\%
16000 97%97\% 97%97\% 97%97\%
bias 500 0.00230.0023 0.00040.0004 0.00100.0010
1000 0.00130.0013 0.00080.0008 0.00070.0007
4000 0.00030.0003 0.00030.0003 0.00030.0003
16000 0.00020.0002 0.0030.003 0.00020.0002
RMSE 500 0.0480.048 0.0230.023 0.0290.029
1000 0.0330.033 0.0150.015 0.0200.020
4000 0.0160.016 0.0080.008 0.0100.010
16000 0.0090.009 0.0040.004 0.0050.005
Ratio of mean standard error 500 0.9640.964 0.9110.911 0.9280.928
to standard deviation 1000 0.9670.967 0.9980.998 0.9580.958
4000 1.0281.028 0.9830.983 0.9950.995
16000 0.9630.963 1.0071.007 0.9960.996
Refer to caption
Figure S1: Boxplot of n×\sqrt{n}\times CI width for ATE relative to each reference ITR in the simulation with nuisance functions being the truth.

S5.2 Simulation under a low dimension and a parametric model

In this section, we describe the additional simulation in a setting with a low dimension and a parametric model as well as the simulation results.

The data is generated as follows. We first generate a univariate covariate WUnif(1,1)W\sim\mathrm{Unif}(-1,1). We then generate TT, CC and YY as follows:

TW\displaystyle T\mid W\ Bernoulli(expit(W)),\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(W)\right),
CT,W\displaystyle C\mid T,W\ Bernoulli(expit(2T1+W)),\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(2T-1+W)\right),
YT,W\displaystyle Y\mid T,W\ Bernoulli(expit(1.4T0.70.3W)),\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(1.4T-0.7-0.3W)\right),

where CC and YY are independent conditional on (W,T)(W,T). We set ρFR:v0{\rho}^{\mathrm{FR}}:v\mapsto 0, V=WV=W, and κ=0.35\kappa=0.35, which is an active constraint with τ0>0\tau_{0}>0 and ρ0RD<1{\rho}^{\mathrm{RD}}_{0}<1. We use logistic regression to estimate functions μ0T\mu^{T}_{0}, μ0C\mu^{C}_{0} and μ0C\mu^{C}_{0}. All other simulation settings are identical to that in Section 5.

The simulation results are presented in Table S2 and Figure S2. The performance is generally between the nonparametric setting in Section 5 and the oracle setting in Section S5.1. The CI coverage is much better than the nonparametric case, thus suggesting that our method might perform better with improved estimators of nuisance functions μ0T\mu^{T}_{0}, μ0C\mu^{C}_{0} and μ0C\mu^{C}_{0}.

Table S2: Performance of estimators of average causal effects in the simulation with nuisance functions in a parametric model.
Performance measure Sample size FR{\mathrm{FR}} RD{\mathrm{RD}} TP{\mathrm{TP}}
95% Wald CI coverage 500 95%95\% 83%83\% 96%96\%
1000 91%91\% 83%83\% 93%93\%
4000 94%94\% 88%88\% 93%93\%
16000 94%94\% 94%94\% 95%95\%
97.5% confidence lower 500 99%99\% 99%99\% 99%99\%
bound coverage 1000 99%99\% 99%99\% 99%99\%
4000 99%99\% 99%99\% 98%98\%
16000 98%98\% 99%99\% 99%99\%
bias 500 0.0177-0.0177 0.0161-0.0161 0.0167-0.0167
1000 0.0122-0.0122 0.0113-0.0113 0.0125-0.0125
4000 0.0037-0.0037 0.0036-0.0036 0.0035-0.0035
16000 0.0009-0.0009 0.0010-0.0010 0.0008-0.0008
RMSE 500 0.0350.035 0.0290.029 0.0400.040
1000 0.0260.026 0.0210.021 0.0290.029
4000 0.0120.012 0.0090.009 0.0130.013
16000 0.0060.006 0.0040.004 0.0060.006
Ratio of mean standard error 500 1.1221.122 0.9080.908 1.0461.046
to standard deviation 1000 0.9880.988 0.8570.857 0.9840.984
4000 0.9780.978 0.8910.891 0.9420.942
16000 0.9860.986 0.9790.979 0.9790.979
Refer to caption
Figure S2: Boxplot of n×\sqrt{n}\times CI width for ATE relative to each reference ITR in the simulation with nuisance functions in a parametric model.