Individualized treatment rules
under stochastic treatment cost constraints

Hongxiang Qiu Department of Statistics, the Wharton School, University of Pennsylvania Marco Carone Department of Biostatistics, University of Washington Alex Luedtke Department of Statistics, University of Washington

Abstract

Estimation and evaluation of individualized treatment rules have been studied extensively, but real-world treatment resource constraints have received limited attention in existing methods. We investigate a setting in which treatment is intervened upon based on covariates to optimize the mean counterfactual outcome under treatment cost constraints when the treatment cost is random. In a particularly interesting special case, an instrumental variable corresponding to encouragement to treatment is intervened upon with constraints on the proportion receiving treatment. For such settings, we first develop a method to estimate optimal individualized treatment rules. We further construct an asymptotically efficient plug-in estimator of the corresponding average treatment effect relative to a given reference rule.

1 Introduction

The effect of a treatment often varies across subgroups of the population [38, 52]. When such differences are clinically meaningful, it may be beneficial to assign treatments strategically depending on subgroup membership. Such treatment assignment mechanisms are called individualized treatment rules (ITRs). A treatment rule is commonly evaluated on the basis of the mean counterfactual outcome value it generates — what is often referred to as the treatment rule’s value — and an ITR with an optimal value is called an optimal ITR. There is an extensive literature on estimation of optimal ITRs and their corresponding values using data from randomized trials or observational studies [6, 21, 25, 37, 54].

Most existing approaches for estimating ITRs do not incorporate real-world resource constraints. Without such constraints, an optimal ITR would assign the treatment to members of a subgroup provided there is any benefit for such individuals, even when this benefit is minute. In contrast, under treatment resource limits, it may be more advantageous to reserve treatment for subgroups with the greatest benefit from treatment. This issue has received attention in recent work. Luedtke and van der Laan developed methods for estimation and evaluation of optimal ITRs with a constraint on the proportion receiving treatment [20]. Qiu et al. instead considered related problems in settings in which instrumental variables (IVs) are available [34]. In one of the settings they considered, the same resource constraint is imposed as in Luedtke and van der Laan [20] but a binary IV is used to identify optimal ITRs even in settings in which there may be unmeasured confounders. In another setting considered in Qiu et al. [34], the authors considered interventions on a causal IV or encouragement status, and developed methods to estimate individualized encouragement (rather than treatment) rules with a constraint on the proportion receiving both encouragement and treatment [33]. They also developed nonparametrically efficient estimators of the average causal effect of optimal rules relative to a prespecified reference rule. Sun et al. considered a setting in which the cost of treatment is random and dependent upon baseline covariates. They developed methods to estimate optimal ITRs under a constraint on the expected additional treatment cost as compared to control, though inference on the impact of implementing the optimal ITR in the population was not studied [41]. Sun [42] considered a related problem involving the development of optimal ITRs under resource constraints, and established the asymptotic properties of the estimated optimal ITR. Their method appears viable when the class of ITRs is restricted by the user a priori.

In this paper, we study estimation and inference for an optimal rule under two different cost constraints. The first is the same as appearing in Sun et al. [41]. In contrast to earlier work on this setting, we do not constrain the class of ITRs considered and provide a means to obtain inference about the optimal ITR. The second constraint we consider places a cap on the total cost under the rule rather than on the incremental cost relative to control. To our knowledge, the latter problem has not previously been considered in the literature. Both of these estimation problems mirror the intervention-on-encouragement setting considered in Qiu et al. [34] but involve different constraints and a more general cost function.

Similarly as in Qiu et al. [34], the estimators that we develop are asymptotically efficient within a nonparametric model and enable the construction of asymptotically valid confidence intervals for the impact of implementing the optimal rule. We develop our estimators using similar tools — such as semiparametric efficiency theory [31, 51] and targeted minimum loss-based estimation (TMLE) [45, 49] — as were used to tackle the related problem studied in Qiu et al. [34]. Consequently, our proposed estimators are similar to that in Qiu et al. [34]. Therefore, we will streamline the presentation by highlighting the key similarities and focusing on the differences between these related problems and estimation schemes.

The rest of this paper is organized as follows. In Section 2, we describe the problem setup, introduce notation, and present the causal estimands along with basic causal conditions. In Section 3, we present additional causal conditions and the corresponding nonparametric identification results. In Section 4, we present our proposed estimators and their theoretical properties. In Section 5, we present a simulation illustrating the performance of our proposed estimators. We make concluding remarks in Section 6. Proofs, technical conditions, and additional simulation results can be found in the Supplementary Material.

2 Setup and objectives

To facilitate comparisons with Qiu et al. [34], we adopt similar notation as in that work. Suppose that we observe independent and identically distributed data units $O_{1},O_{2},\ldots,O_{n}\sim P_{0}$ , where $P_{0}$ is an unknown sampling distribution. A prototypical data unit $O$ consists of the quadruplet $(W,T,C,Y)$ , where $W\in\mathscr{W}\subseteq\mathbb{R}^{p}$ is the vector of baseline covariates, $T\in\{0,1\}$ is the treatment status, $C\in[0,\infty)$ is the random treatment cost, and $Y\in$ is the outcome of interest. As a convention, we assume that larger values of $Y$ are preferable. We use $V=V(W)\in\mathcal{V}$ to denote a fixed transformation of $W$ upon which we allow treatment decisions to depend. For example, $V$ may be a subset of covariates in $W$ or a summary of $W$ (e.g., BMI as a summary of height and weight). In practice, $V$ may be chosen based on prior knowledge on potential modifiers of the treatment effect as well as the cost of measuring various covariates. We distinguish between $V(W)$ and $W$ because of their different roles. On the one hand, we will assume that the full covariate $W$ contains all confounders and thus is used to identify causal effects, while $V(W)$ might not be sufficient for this purpose. On the other hand, some covariates in $W$ may be expensive or difficult to measure in future applications, and thus implementing an optimal ITR based on a subset $V(W)$ of covariate $W$ may be desirable. In the rest of this paper, we will use the shorthand notation $V$ , $V_{i}$ and $v$ to refer to $V(W)$ , $V(W_{i})$ and $V(w)$ , respectively. We define an individualized (stochastic) treatment rule (ITR) to be a function ${\rho}:\mathcal{V}\rightarrow[0,1]$ that prescribes treatment with probability ${\rho}(v)$ according to an exogenous source of randomness for an individual with covariate value $v$ . Any stochastic ITR that only takes values in $\{0,1\}$ is referred to as a deterministic ITR.

In this work, we adopt the potential outcomes framework [27, 39]. For each individual, we use $C(t)$ and $Y(t)$ to denote the potential treatment cost and potential outcome, respectively, corresponding to scenarios in which the individual has treatment status $t$ . We use $\operatorname{\mathbb{E}}$ to denote an expectation over the counterfactual observations and the exogenous random mechanism defining a rule, and $\operatorname{E}_{0}$ to denote an expectation over observables alone under sampling from $P_{0}$ . We make the usual Stable Unit Treatment Value Assumption (SUTVA) assumption.

Condition A1 (Stable Unit Treatment Value Assumption).

The counterfactual data unit of one individual is unaffected by the treatment assigned to other individuals, and there is only a single version of the treatment, so that $T=t$ implies that $C=C(t)$ and $Y=Y(t)$ .

Remark 1.

The ITRs we consider are not truly individualized, because they are based on the value of covariate $V$ rather than each individual’s unique potential treatment effects $Y(1)-Y(0)$ and $C(1)-C(0)$ . Nevertheless, depending on the resolution of $V$ , these ITRs can be considerably more individualized than assigning everyone to either treatment or control. In this paper, we adopt the conventional nomenclature and refer to the treatment rules we study as ITRs [see, e.g., 5, 7, 14, 18, 19, 29, 32, 40, 47, 54, 55, 57].

We define $C({\rho})$ and $Y({\rho})$ to be the counterfactual treatment cost and outcome, respectively, for an ITR ${\rho}$ under an exogenous random mechanism. We note that if ${\rho}(v)\in(0,1)$ for an individual with covariate $v$ , an exogenous random mechanism is used to randomly assign treatment with probability ${\rho}(v)$ and thus $C({\rho})$ and $Y({\rho})$ are random for this given individual. If ${\rho}$ were implemented in the population, then the population mean outcome would be $\operatorname{\mathbb{E}}\left[Y({\rho})\right]$ , where we use $\operatorname{\mathbb{E}}$ to denote expectation under the true data-generating mechanism involving potential outcomes $(C(t),Y(t))$ and exogenous randomness in $\rho$ . We consider a generic treatment resource constraint requiring that a convex combination of the population average treatment cost and the population average additional treatment cost compared to control be no greater than a specified constant $\kappa\in(0,\infty]$ . Consequently, an optimal ITR ${\rho}_{0}$ under this constraint is a solution in ${\rho}:\mathcal{V}\rightarrow[0,1]$ to

maximize

\displaystyle\operatorname{\mathbb{E}}[Y({\rho})]\hskip 15.00002pt\text{subject to}\hskip 15.00002pt\alpha\operatorname{\mathbb{E}}[C({\rho})]+(1-\alpha)\operatorname{\mathbb{E}}[C({\rho})-C(0)]\leq\kappa\ .

(1)

Here, $\alpha\in[0,1]$ is also a constant specified by the investigator. Natural choices of $\alpha$ are $\alpha=0$ , corresponding to a constraint on the population average additional treatment cost compared to control, and $\alpha=1$ , corresponding to a constraint on the population average treatment cost. The first choice may be preferred when the control treatment corresponds to the current standard of care and a limited budget is available to fund the novel treatment to some patients. The second choice may be more relevant when both treatment and control incur treatment costs.

Remark 2.

Our setup is similar to that in Qiu et al. [34] if we view $T$ and $C$ defined here as the instrumental variable/encouragement $Z$ and treatment status $A$ defined in those prior works, respectively. However, the constraint in our setup is different from the constraint $\operatorname{\mathbb{E}}[{\rho}(V)C({\rho})]\leq\kappa$ considered previously. In IV settings, the constraint in (1) with $\alpha=1$ is useful when assigning treatment always incurs a cost, regardless of whether encouragement is applied, such as in distributing a limited supply of an expensive drug within a health system based on the results of a randomized clinical trial. It is instead useful with $\alpha=0$ when no encouragement is present under the standard of care but intervention on the encouragement is of interest when additional treatment resources are available. The constraint considered in Qiu et al. [34, 33] was instead useful in cases in which treatment only incurs a cost when paired with encouragement, such as when housing vouchers are used to encourage individuals to live in a certain area. In the general setting in which $T$ is viewed as treatment status and $C$ as a random treatment cost, the constraint in (1) with $\alpha=0$ is identical to that considered in Sun et al. [41] — we refer the readers to these works for a more in-depth discussion of the relation between the current problem setup and IV settings.

To evaluate an optimal ITR ${\rho}_{0}$ , we follow Qiu et al. [34] in considering three types of reference ITRs and develop methods for statistical inference on the difference in the mean counterfactual outcome between ${\rho}_{0}$ and a reference ITR ${\rho}^{\mathcal{R}}_{0}:\mathcal{V}\rightarrow[0,1]$ . The first type of reference ITR considered, denoted by ${\rho}^{\mathrm{FR}}$ ( ${\mathrm{FR}}$ =fixed rule), is any fixed ITR that may be specified by the investigator before the study. When $\alpha=0$ , it is usually most reasonable to consider the rule that always assigns control, namely $v\mapsto 0$ , because the constraint in (1) may arise due to limited funding for implementing treatment whereas the standard of care rule is to always assign control. The second type, denoted by ${\rho}^{\mathrm{RD}}_{0}$ ( ${\mathrm{RD}}$ =random), prescribes treatment completely at random to individuals regardless of their baseline covariates. The probability of prescribing treatment is chosen such that the treatment resource is saturated (i.e., all available resources are used) or all individuals receive treatment, if such a probability exists. Symbolically, this ITR is given by ${\rho}^{\mathrm{RD}}_{0}:v\mapsto\min\left\{1,(\kappa-\alpha\operatorname{\mathbb{E}}\left[C(0)\right])/\operatorname{\mathbb{E}}\left[C(1)-C(0)\right]\right\}$ under the condition that $\operatorname{\mathbb{E}}\left[C(0)\right]\leq\kappa$ and $\operatorname{\mathbb{E}}[C(1)-C(0)]>0$ . Although ${\rho}^{\mathrm{RD}}_{0}$ has the same interpretation as the corresponding encouragement rule in Qiu et al. [34], its mathematical expression is different due to the different resource constraint. This rule may be of interest if it is known a priori that treatment is harmless. The third type, denoted by ${\rho}^{\mathrm{TP}}_{0}$ ( ${\mathrm{TP}}$ =true propensity), prescribes treatment according to the true propensity of the treatment implied by the study sampling mechanism $P_{0}$ , so that ${\rho}^{\mathrm{TP}}_{0}$ equals $w\mapsto P_{0}\left(T=1\mid W=w\right)$ . This ITR may be of interest in two settings. In one setting, ${\rho}^{\mathrm{TP}}_{0}$ satisfies the treatment resource constraint. The investigator may wish to determine the extent to which the implementation of an optimal ITR would improve upon the standard of care. In the other setting, the treatment resource constraint is newly introduced and the standard of care ITR may lead to overuse of treatment resources. The investigator may then be interested in whether the implementation of an optimal constrained ITR would result, despite the new resource constraint, in a noninferior mean outcome.

3 Identification of causal estimands

In this section, we present nonparametric identification results. Though these results are similar to those for individualized encouragement rules in Qiu et al. [34], there are two key differences. First, the form of some of the conditions in Qiu et al. [34] need to be modified to account for the novel resource constraint considered here. Second, two additional conditions are needed to overcome challenges that arise due to this new constraint.

We first introduce notation that will be useful when presenting our identification results and our proposed estimators. For any observed-data distribution $P$ , we define pointwise the conditional mean functions $\mu^{C}_{P}(t,w):=\operatorname{E}_{P}(C\mid T=t,W=w)$ and $\mu^{Y}_{P}(t,w):=\operatorname{E}_{P}(Y\mid T=t,W=w)$ , where we use $\operatorname{E}_{P}$ to denote an expectation over observables alone under sampling from $P$ , and their corresponding contrasts due to different treatment status, $\Delta_{P}^{C}(w):=\mu^{C}_{P}(1,w)-\mu^{C}_{P}(0,w)$ and $\Delta_{P}^{Y}(w):=\mu^{Y}_{P}(1,w)-\mu^{Y}_{P}(0,w)$ . We also define the average of these contrasts conditional on $V$ as $\delta^{C}_{P}(v):=\operatorname{E}_{P}[\Delta^{C}_{P}(W)\mid V=v]$ and $\delta^{Y}_{P}(v):=\operatorname{E}_{P}[\Delta^{Y}_{P}(W)\mid V=v]$ , and the propensity to receive treatment $\mu^{T}_{P}(w):=P\left(T=1\mid W=w\right)$ . Additionally, we define $\nu_{P}(t,v):=\operatorname{E}_{P}\left[\operatorname{E}_{P}\left(C\mid T=t,W\right)\mid V=v\right]$ , $\phi_{P}:=\operatorname{E}_{P}[\mu^{C}_{P}(0,W)]$ . These quantities play an important role in tackling the problem at hand. Throughout the paper, for ease of notation, if $f_{P}$ is a quantity or operation indexed by distribution $P$ , we may denote $f_{P_{0}}$ by $f_{0}$ . As an example, we may use $\Delta^{Y}_{0}$ to denote $\Delta^{Y}_{P_{0}}$ .

We introduce additional causal conditions we will require, positivity and unconfoundedness. In one form or another, these conditions commonly appear in the causal inference literature [49], including in the IV literature [1, 15, 43, 53].

Condition A2 (Strong positivity).

There exists a constant $\epsilon_{T}>0$ such that $\epsilon_{T}<\mu^{T}_{0}(w)<1-\epsilon_{T}$ holds for $P_{0}$ -almost every $w$ .

Condition A3 (Unconfoundedness of treatment).

For each $t\in\{0,1\}$ , $T$ and $(C(t),Y(t))$ are conditionally independent given $W=w$ for $P_{0}$ -almost every $w$ .

Equipped with these conditions, we are able to state a theorem on the nonparametric identification of the mean counterfactual outcomes and average treatment effect (ATE) — these results can be viewed as a corollary of the well-known G-formula [36].

Theorem 1 (Identification of ATE and expected treatment resource expenditure).

Provided Conditions A1–A3 are satisfied, it holds that $\operatorname{\mathbb{E}}[Y(t)\mid W=w]=\mu^{Y}_{0}(t,w)$ , $\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid W=w]=\Delta^{Y}_{0}(w)$ and $\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid V=v]=\delta^{Y}_{0}(v)$ for $P_{0}$ -almost every $w$ and $v$ , and so, $\operatorname{\mathbb{E}}[Y({\rho})-Y({\rho}^{\mathcal{R}}_{0})]=\operatorname{E}_{0}[\{{\rho}(V)-{\rho}^{\mathcal{R}}_{0}(W)\}\Delta^{Y}_{0}(W)]$ . In addition, it holds that $\operatorname{\mathbb{E}}[C(t)\mid W=w]=\mu^{C}_{0}(t,w)$ for $P_{0}$ -almost every $w$ , and so, $\operatorname{\mathbb{E}}[C({\rho})]=\operatorname{E}_{0}[{\rho}(V)\mu^{C}_{0}(1,W)+(1-{\rho}(V))\mu^{C}_{0}(0,W)]$ .

In view of Theorem 1, the objective function in (1) can be identified as

\displaystyle\operatorname{\mathbb{E}}\left[Y({\rho})\right]=\operatorname{E}_{0}\left[{\rho}(V)\mu^{Y}_{0}(1,W)+(1-{\rho}(V))\mu^{Y}_{0}(0,W)\right]=\operatorname{E}_{0}\left[{\rho}(V)\Delta_{0}^{Y}(W)\right]+\operatorname{E}_{0}[\mu^{Y}_{0}(0,W)]\ ,

and, similarly, the expected cost is identified as $\operatorname{\mathbb{E}}\left[C({\rho})\right]=\operatorname{E}_{0}\left[{\rho}(V)\Delta_{0}^{C}(W)\right]+\operatorname{E}_{0}[\mu^{C}_{0}(0,W)]$ . It follows that the optimization problem (1) is equivalent to

maximize

\displaystyle\operatorname{E}_{0}\left[{\rho}(V)\delta_{0}^{Y}(V)\right]\hskip 15.00002pt\text{subject to}\hskip 15.00002pt\operatorname{E}_{0}\left[{\rho}(V)\delta_{0}^{C}(V)\right]+\alpha\phi_{0}\leq\kappa\ .

(2)

This differs from Equation 3 defining optimal individualized encouragement rules in Qiu et al. [34]. We now present two additional conditions so that (2) is a fractional knapsack problem [9], thereby allowing us to use existing results from the optimization literature. These conditions are similar to those in Sun et al. [41].

Condition A4 (Strictly costlier treatment).

There exists a constant $\epsilon_{C}>0$ such that $\Delta^{C}_{0}(w)>\epsilon_{C}$ holds for $P_{0}$ -almost every $w$ .

Condition A5 (Financial feasibility of assigning treatment).

The inequality $\alpha\phi_{0}<\kappa$ holds.

Condition A4 is reasonable if treatment is more expensive than control. When applied to an IV setting as outlined in Remark 2, this condition corresponds to the assumption that the IV is indeed an encouragement to take treatment. This condition is slightly stronger than its counterpart in Sun et al. [41], which only requires that $\Delta^{C}_{0}\geq 0$ . This stronger condition is needed to ensure the asymptotic linearity of our proposed estimator in Section 4. Under Condition A4, it is evident that Condition A5 is reasonable because if $\alpha\phi_{0}>\kappa$ , then no ITR satisfies the treatment resource constraint in view of the fact that $\operatorname{E}_{0}[{\rho}(V)\delta_{0}^{C}(V)]\geq 0$ , whereas if $\alpha\phi_{0}=\kappa$ , then only the trivial ITR $v\mapsto 0$ satisfies the constraint and there is no need to estimate an optimal ITR.

Under these two additional conditions, (2) is a fractional knapsack problem [9] in which every subgroup defined by a different value of $V$ corresponds to a different ‘item’. A solution in the special case in which $V(W)=W$ and $\alpha=0$ was given in Theorem 1 of Sun et al. [41]. We now state a more general result with the following differences: (i) the treatment decision may be based on a summary $V$ rather than the entire covariate vector $W$ , and (ii) $\alpha$ may take any value in $[0,1]$ rather than only zero. We also explicitly state the randomization probability at the boundary for completeness and clarity. Despite these differences, the result we obtain is similar to Theorem 1 in Sun et al. [41]. Define pointwise $\xi_{0}(v):=\delta^{Y}_{0}(v)/\delta^{C}_{0}(v)$ , and write $\eta_{0}:=\inf\{\eta:\operatorname{E}_{0}[I(\xi_{0}(V)>\eta)\delta^{C}_{0}(V)]\leq\kappa-\alpha\phi_{0}\}$ and $\tau_{0}:=\max\{\eta_{0},0\}$ .

Theorem 2 (Optimal ITR).

Under Conditions A1–A5, a solution to (2) is explicitly given by

\displaystyle{\rho}_{0}(v):=\begin{cases}\ \frac{\kappa-\alpha\phi_{0}-\operatorname{E}_{0}\left[I(\xi_{0}(V)>\tau_{0})\delta^{C}_{0}(V)\right]}{\operatorname{E}_{0}\left[I(\xi_{0}(V)=\tau_{0})\delta^{C}_{0}(V)\right]}&:\ \text{ if }\tau_{0}>0,\ \xi_{0}(v)=\tau_{0}\textnormal{ and }\operatorname{E}_{0}\left[I(\xi_{0}(V)=\tau_{0})\delta^{C}_{0}(V)\right]>0\\ \ I\left(\xi_{0}(v)>\tau_{0}\right)&:\ \text{ otherwise\ .}\end{cases}

Here, the first case is the boundary case with the randomization probability that saturates the treatment resource.

We also note that the reference ITRs introduced in Section 2 are also identified under the above conditions. In particular, it can be shown that ${\rho}^{\mathrm{RD}}_{0}(v):=\min\{1,(\kappa-\alpha\phi_{0})/\operatorname{E}_{0}[\Delta^{C}_{0}(W)]\}$ and ${\rho}^{\mathrm{TP}}_{0}=\mu^{T}_{0}$ .

4 Estimating and evaluating optimal individualized treatment rules

In this section, we present an estimator of an optimal ITR ${\rho}_{0}$ and an inferential procedure for its ATE relative to a reference ITR ${\rho}^{\mathcal{R}}_{0}$ , where $\mathcal{R}$ is any of ${\mathrm{FR}}$ , ${\mathrm{RD}}$ or ${\mathrm{TP}}$ . The proposed procedure is an adaptation of the method first proposed in Qiu et al. [34, 33].

We begin by introducing some notations that are useful for defining the estimands. We define the parameter $\Psi_{{\rho}}(P):=\operatorname{E}_{P}\left[{\rho}(V)\Delta^{Y}_{P}(W)\right]$ or $\Psi_{{\rho}}(P):=\operatorname{E}_{P}\left[{\rho}(W)\Delta^{Y}_{P}(W)\right]$ for each ITR ${\rho}$ and distribution $P\in\mathscr{M}$ , depending on whether the domain of ${\rho}$ is $\mathcal{V}$ or $\mathcal{W}$ . Here, we consider the model $\mathscr{M}$ to be locally nonparametric at $P_{0}$ [31]. For $P\in\mathscr{M}$ , the ATE of an optimal ITR ${\rho}_{P}$ relative to a reference ITR ${\rho}^{\mathcal{R}}_{P}$ equals $\Psi_{\mathcal{R}}(P):=\Psi_{{\rho}_{P}}(P)-\Psi_{{\rho}^{\mathcal{R}}_{P}}(P)$ . We are interested in making inference about $\psi_{0}:=\Psi_{\mathcal{R}}(P_{0})$ , where we have suppressed dependence on $\mathcal{R}$ from our shorthand notation.

4.1 Pathwise differentiability of the ATE

We first present a result regarding the pathwise differentiability of the ATE. Pathwise differentiability of the parameter of interest serves as the foundation for constructing asymptotically efficient estimators of this parameter, based on which an inferential procedure may be developed. Additional technical conditions are required and are provided in Section S1 in the Supplementary Material. For a distribution $P\in\mathscr{M}$ , a function $\mu^{C}:\{0,1\}\times\mathcal{W}\rightarrow$ , an ITR ${\rho}$ , and a decision threshold $\tau\in$ , we define pointwise the following functions:

\displaystyle\begin{split}D(P,{\rho},\tau,\mu^{C})(o)\ &:=\ {\rho}(v)\left[\frac{y-\mu^{Y}_{P}(t,w)}{t+\mu^{T}_{P}(w)-1}+\Delta^{Y}_{P}(w)\right]-\Psi_{e}(P)\\ &\hskip 18.06749pt-\tau\left\{{\rho}(v)\left[\frac{c-\mu^{C}(t,w)}{t+\mu^{T}_{P}(w)-1}+\Delta^{C}(w)\right]+\alpha\left[\frac{(1-t)(c-\mu^{C}(0,w))}{1-\mu^{T}_{P}(w)}+\mu^{C}(0,w)\right]-\kappa\right\};\\ G(P)(o)\ &:=D(P,{\rho}_{P},\tau_{P},\mu^{C}_{P})(o)\ ;\\ D_{1}(P,\mu^{C})(o)\ &:=\ \frac{(1-t)(c-\mu^{C}(0,w))}{1-\mu^{T}_{P}(w)}+\mu^{C}(0,w)-\operatorname{E}_{P}\left[\mu^{C}(0,W)\right]\ ;\\ D_{2}(P,\mu^{C})(o)\ &:=\ \frac{c-\mu^{C}(t,w)}{t+\mu^{T}_{P}(w)-1}+\Delta^{C}(w)-\operatorname{E}_{P}\left[\Delta^{C}(W)\right]\ ;\\ G_{\mathrm{RD}}(P)(o)\ &:=\ D(P,{\rho}^{\mathrm{RD}}_{P},0,\mu^{C}_{P})(o)-\frac{\alpha\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)D_{1}(P,\mu^{C}_{P})}{\kappa-\phi_{P}}-\frac{\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)D_{2}(P,\mu^{C}_{P})}{\operatorname{E}_{P}\left[\Delta_{P}^{C}(W)\right]}\ ;\\ G_{\mathrm{TP}}(P)(o)\ &:=\ \frac{\mu^{T}_{P}(w)}{t+\mu^{T}_{P}(w)-1}\left[y-\mu^{Y}_{P}(t,w)\right]+t\Delta^{Y}_{P}(w)-\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)\ ;\\ G_{\mathrm{FR}}(P)(o)\ &:=D(P,{\rho}^{\mathrm{FR}},0,\mu^{C}_{P})(o)\ .\end{split}

(3)

One key condition we rely on is the following non-exceptional law assumption.

Condition B1 (Non-exceptional law).

$P_{0}(\xi_{0}(V)=\tau_{0})=0$ .

Under this condition, the true optimal ITR $\rho_{0}$ is identical to an indicator function. If all covariates are discrete, then we can plug in the empirical estimates into the identification formulae in Theorems 1–2 and show that the resulting estimators of the ATE are asymptotically normal by the delta method even when Condition B1 does not hold. We do not further pursue this simple case in this paper, and thus need to rely on the non-exceptional law assumption, namely Condition B1, to account for continuous covariates. We list additional technical conditions in Supplement S1.

We can now provide a formal result describing the pathwise differentiability of the ATE parameter.

Theorem 3 (Pathwise differentiability of the ATE).

Let $\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}$ . Provided Conditions A1–A5 and B1–B5 are satisfied, the parameters $P\mapsto\Psi_{{\rho}_{P}}(P)$ and $P\mapsto\Psi_{{\rho}^{\mathcal{R}}_{P}}(P)$ are pathwise differentiable at $P_{0}$ relative to $\mathscr{M}$ with canonical gradients $G(P_{0})$ and $G_{\mathcal{R}}(P_{0})$ , respectively.

We note that the pathwise differentiability of $P\mapsto\Psi_{{\rho}^{\mathcal{R}}_{P}}(P)$ was established in Theorem 3 of Qiu et al. [34] for $\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{TP}}\}$ . The other results can be proven using similar techniques. We put the proof of these results in Supplement S4.2. In view of Theorem 3, it follows that the ATE parameter $\Psi_{\mathcal{R}}$ is pathwise differentiable at $P_{0}$ with nonparametric canonical gradient

D_{\mathcal{R}}(P_{0}):=G(P_{0})-G_{\mathcal{R}}(P_{0})

(4)

for $\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}$ .

Remark 3.

We have noted similar additional terms related to the resource being used in the canonical gradient of the mean counterfactual outcome or ATE of optimal ITRs under resource constraints, for example, in Luedtke and van der Laan [20] and Qiu et al. [34]. In our problem, this additional term is

-\tau_{0}\left\{{\rho}_{0}(v)\left[\frac{c-\mu^{C}_{0}(t,w)}{t+\mu^{T}_{0}(w)-1}+\Delta^{C}_{0}(w)\right]+\alpha\left[\frac{(1-t)(c-\mu^{C}_{0}(0,w))}{1-\mu^{T}_{0}(w)}+\mu^{C}_{0}(0,w)\right]-\kappa\right\}.

Such terms appear to come from solving a fractional knapsack problem with truncation at zero and take the form of a product of (i) the threshold in the solution, and (ii) a term that equals the influence function of the resource being used under the solution when the resource is saturated. We conjecture that such structures generally exist for fractional knapsack problems.

4.2 Proposed estimator and asymptotic linearity

We next present our proposed nonparametric procedure for estimating an optimal ITR ${\rho}_{0}$ and the corresponding ATE $\psi_{0}$ . We will generally use subscript $n$ to denote an estimator with sample size $n$ , and add a hat to a nuisance function estimator that is targeted toward estimating $\phi_{0}$ .

1.

Use the empirical distribution $\hat{P}_{W,n}$ of $W$ as an estimate of the true marginal distribution of $W$ . Compute estimates $\mu^{Y}_{n}$ , $\mu^{C}_{n}$ , $\mu^{T}_{n}$ , $\delta^{Y}_{n}$ and $\delta^{C}_{n}$ of $\mu^{Y}_{0}$ , $\mu^{C}_{0}$ , $\mu^{T}_{0}$ , $\delta^{Y}_{0}$ and $\delta^{C}_{0}$ , respectively, using flexible regression methods. Recall that $\mu^{Y}_{0}(t,w)=\operatorname{E}_{0}[Y\mid T=t,W=w]$ , $\mu^{C}_{0}(t,w)=\operatorname{E}_{0}[C\mid T=t,W=w]$ , $\delta^{Y}_{0}(v)=\operatorname{E}_{0}[\mu^{Y}_{0}(1,W)-\mu^{Y}_{0}(0,W)\mid V=v]$ , and $\delta^{C}_{0}(v)=\operatorname{E}_{0}[\mu^{C}_{0}(1,W)-\mu^{C}_{0}(0,W)\mid V=v]$ . Define pointwise $\Delta^{C}_{n}(w):=\mu^{C}_{n}(1,w)-\mu^{C}_{n}(0,w)$ .

Estimate an optimal ITR:

(a)

Estimate $\phi_{0}=\operatorname{E}_{0}[\mu^{C}_{0}(0,W)]$ with a one-step correction estimator

\phi_{n}:=\frac{1}{n}\sum_{i=1}^{n}\left[\mu^{C}_{n}(0,W_{i})+\frac{(1-T_{i})(C_{i}-\mu^{C}_{n}(0,W_{i}))}{1-\mu^{T}_{n}(W_{i})}\right].

(b)

Let $\xi_{n}:=\delta^{Y}_{n}/\delta^{C}_{n}$ , $\Gamma_{n}:\tau\mapsto\frac{1}{n}\sum_{i:\xi_{n}(V_{i})>\tau}\Delta^{C}_{n}(W_{i})$ and $\gamma_{n}:\tau\mapsto\frac{1}{n}\sum_{i:\xi_{n}(V_{i})=\tau}\Delta^{C}_{n}(W_{i})$ . For any $k\in[0,\infty]$ , define $\eta_{n}(k):=\inf\left\{\tau:\Gamma_{n}(\tau)\leq k-\alpha\phi_{n}\right\}$ , $\tau_{n}(k):=\max\left\{\eta_{n}(k),0\right\}$ and

d_{n,k}:v\mapsto\begin{cases}\ \frac{k-\alpha\phi_{n}-\Gamma_{n}(\eta_{n}(k))}{\gamma_{n}(\eta_{n}(k))}&:\ \text{ if }\xi_{n}(v)=\eta_{n}(k)\textnormal{ and }\gamma_{n}(\eta_{n}(k))>0\ ,\\ \ I\{\xi_{n}(v)>\eta_{n}(k)\}&:\ \text{ otherwise.}\end{cases}

The rule $d_{n,k}$ is the sample analog of an ITR that prescribes treatment to those with the highest values of $\xi_{0}(V)$ , regardless of whether treatment is harmful or not, until treatment resources run out.

(c)

Compute $k_{n}$ , which is used to define an estimate of ${\rho}_{0}$ for which the plug-in estimator is asymptotically linear under conditions, as follows:

•

if $\tau_{n}(\kappa)>0$ and there is a solution in $k\in[0,\infty)$ to

\frac{1}{n}\sum_{i=1}^{n}d_{n,k}(V_{i})\left[\Delta^{C}_{n}(W_{i})+\frac{C_{i}-\mu^{C}_{n}(T_{i},W_{i})}{t_{i}+\mu^{T}_{n}(W_{i})-1}\right]+\alpha\phi_{n}=\kappa\ ,

(5)

then take $k_{n}$ to be this solution;

•

otherwise, set $k_{n}=\kappa$ .

(d)

Estimate ${\rho}_{0}$ using the sample analog of ${\rho}_{0}$ with treatment resource constraint $k_{n}$ , namely

{\rho}_{n}:v\mapsto\begin{cases}\ \frac{k_{n}-\alpha\phi_{n}-\Gamma_{n}(\tau_{n}(k_{n}))}{\gamma_{n}(\tau_{n}(k_{n}))}&:\ \text{ if }\xi_{n}(v)=\tau_{n}(k_{n})\textnormal{ and }\gamma_{n}(\tau_{n}(k_{n}))>0\\ \ I\{\xi_{n}(v)>\tau_{n}(k_{n})\}&:\ \text{ otherwise.}\end{cases}

3.
Obtain an estimate ${\rho}^{\mathcal{R}}_{n}$ of the reference ITR ${\rho}^{\mathcal{R}}_{0}$ as follows:
- •
  
  For $\mathcal{R}={\mathrm{FR}}$ , take ${\rho}^{\mathcal{R}}_{n}$ to be ${\rho}^{\mathrm{FR}}$ .
- •
  For $\mathcal{R}={\mathrm{RD}}$ ,
  1. (a)
    
    obtain a targeted estimate $\hat{\mu}^{C}_{n}(1,\cdot)$ of $\mu^{C}_{0}(1,\cdot)$ : run an ordinary least-squares linear regression with outcome $C$ , covariate $1/(T+\mu^{T}_{n}(W)-1)$ , offset $\mu^{C}_{n}(T,W)$ and no intercept. Take $\hat{\mu}^{C}_{n}$ to be the fitted mean model;
  2. (b)
    
    take ${\rho}^{\mathcal{R}}_{n}$ to be the constant function $w\mapsto\min\left\{1,(\kappa-\phi_{n})/\tfrac{1}{n}\sum_{i=1}^{n}\hat{\Delta}^{C}_{n}(W_{i})\right\}$ , where we define pointwise $\hat{\Delta}^{C}_{n}(w):=\hat{\mu}^{C}_{n}(1,w)-\hat{\mu}^{C}_{n}(0,w)$ .
- •
  
  For $\mathcal{R}={\mathrm{TP}}$ , take ${\rho}^{\mathcal{R}}_{n}$ to be $\mu^{T}_{n}$ .

Estimate ATE of ${\rho}_{0}$ relative to the reference ITR ${\rho}^{\mathcal{R}}_{0}$ with a targeted minimum-loss based estimator (TMLE) $\psi_{n}$ :

(a)

obtain a targeted estimate $\hat{\mu}^{Y}_{n}$ of $\mu^{Y}_{0}$ : run an ordinary least-squares linear regression with outcome $Y$ , covariate $[{\rho}_{n}(V)-{\rho}^{\mathcal{R}}_{n}(W)]/[T+\mu^{T}_{n}(W)-1]$ , offset $\mu^{Y}_{n}(T,W)$ and no intercept. Take $\hat{\mu}^{Y}_{n}$ to be the fitted mean function.

(b)

with $\hat{P}_{n}$ being any distribution with components $\hat{\mu}^{Y}_{n}$ and $\hat{P}_{W,n}$ , take

\psi_{n}:=\Psi_{{\rho}_{n}}(\hat{P}_{n})-\Psi_{{\rho}^{\mathcal{R}}_{n}}(\hat{P}_{n})=\frac{1}{n}\sum_{i=1}^{n}[{\rho}_{n}(V_{i})-{\rho}^{\mathcal{R}}_{n,i}][\hat{\mu}^{Y}_{n}(1,W_{i})-\hat{\mu}^{Y}_{n}(1,W_{i})]\ ,

where ${\rho}^{\mathcal{R}}_{n,i}$ is defined as ${\rho}^{\mathcal{R}}_{n}(W_{i})$ or ${\rho}^{\mathcal{R}}_{n}(V_{i})$ depending on the covariate used by the reference ITR.

The above procedure is similar to that proposed in Qiu et al. [34]. One key difference is the use of the refined estimator $k_{n}$ of $\kappa$ obtained via the estimating equation (5), which is key to ensuring the asymptotic linearity of $\psi_{n}$ . Another difference is that the denominator of $\xi_{n}$ is now $\delta^{C}_{n}$ , which is consistent with our different definition of the unit value for solving the fractional knapsack problem (2). Similarly to TMLE for other problems, when $C$ or $Y$ has known bounds (e.g., the closed interval $[0,1]$ ), to obtain a corresponding targeted estimate that respect the known bounds, we may use logistic regression rather than ordinary least-squares [12].

The above procedure has both similarities and substantial differences compared to the estimation procedure proposed by Sun et al. [41]. The main difference is that our procedure is targeted towards efficient estimation of and inference about the ATE of $\psi_{0}$ of the optimal ITR under a nonparametric model, while Sun et al. [41] focus on estimating the optimal ITR ${\rho}_{0}$ and does not evaluate this optimal ITR. This leads to a key difference between the two procedures when estimating the optimal ITR: we need to solve an estimating equation (5), which is crucial to ensuring that the estimator $\psi_{n}$ is asymptotically linear, while Sun et al. [41] do not. The requirement of solving (5) is related to the nature of the fractional knapsack problem discussed in Remark 3, and we conjecture that such a calibration on the resource used is necessary for general problems of the same nature. Our procedure is also related to the method in Sun [42]. Sun [42] rely on the availability of asymptotically normal estimators of both the average benefit and average resource used (Assumption 2.4), a nontrivial requirement when the propensity score $\mu^{T}_{0}$ is unknown in observational studies. Our procedure essentially produces such estimators: in Step 4, an asymptotically normal estimator of the ATE is constructed, whereas an asymptotically normal estimator of the expected resource is produced in Step 2 and used to calibrate the resource expenditure of the estimated optimal ITR ${\rho}_{n}$ in Step 2(c).

Remark 4.

In Step 1 of the above procedure, we estimate the functions $\delta^{Y}_{0}$ and $\delta^{C}_{0}$ using a naïve approach based on outcome regression. It is viable to use more advanced techniques such as the doubly robust methods in van der Laan and Rubin [45], van der Laan and Luedtke [46], Luedtke and van der Laan [22], and Kennedy [17] or R-learning as in Nie and Wager [28]. These methods were developed for conditional average treatment effect estimation and might lead to better estimators of $\delta^{Y}_{0}$ and $\delta^{C}_{0}$ . It is also possible to develop multiply robust methods to estimate $\xi_{0}$ using influence function techniques. Such methods to estimate $\xi_{0}$ are beyond the scope of our paper, whose main focus is on the inference for the ATE. Our theoretical analysis of the estimator only applies to naïve estimators based on outcome regression, but we expect only minor modifications to be required to study these more advanced estimators once their asymptotic behavior is characterized.

Remark 5.

In Step 2(a), it is also viable to use other efficient estimators of $\phi_{0}$ , for example, a targeted minimum loss-based estimator (TMLE). We note that estimating $\phi_{0}$ is only one component of estimating the optimal ITR ${\rho}_{0}$ . Methods such as TMLE can be preferable to ensure that the estimator respects known bounds on the estimand. However, in our case, such an improvement in estimating $\phi_{0}$ does not necessarily lead to an improvement in the estimation of ${\rho}_{0}$ .

We now present results on the asymptotic linearity and efficiency of our proposed estimator. We state and discuss the technical conditions required by the theorem below in Supplement S1.

Theorem 4 (Asymptotic linearity of ATE estimator).

Let $\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}$ . Under Conditions B1–B12, with the canonical gradient $D_{\mathcal{R}}(P_{0})$ defined in (3) and (4), it holds that

\psi_{n}-\psi_{0}=\frac{1}{n}\sum_{i=1}^{n}D_{\mathcal{R}}(P_{0})(O_{i})+{\mathrm{o}}_{p}(n^{-1/2})\ .

Therefore, $\sqrt{n}\left(\psi_{n}-\psi_{0}\right)\overset{d}{\longrightarrow}\textnormal{N}\left(0,\sigma_{0}^{2}\right)$ , where $\sigma_{0}^{2}:=\operatorname{E}_{0}\left[D_{\mathcal{R}}(P_{0})(O)^{2}\right]$ . Since $\psi_{n}$ is asymptotically linear with influence function equal to the canonical gradient, $\psi_{n}$ is also asymptotically efficient.

To conduct inference about $\psi_{0}$ , we can directly plug the estimators of nuisance functions into $D_{\mathcal{R}}(P_{0})$ to obtain a consistent estimator of $D_{\mathcal{R}}(P_{0})$ , and then take the sample variance to obtain a consistent estimator of the asymptotic variance $\sigma_{0}^{2}$ . The proof of Theorem 4 can be found in Supplements S4.3 and S4.4.

Remark 6.

It may be desirable to use cross-fitting [26, 56] to estimate an optimal ITR for better finite-sample performance. The asymptotic linearity is maintained by a similar argument that is used to prove Theorem 4. We describe this algorithm in Section S3 in the Supplementary Material.

Remark 7.

We note that, unlike in Qiu et al. [34] where the bound $\kappa$ lies in $(0,1]$ due to the binary nature of treatment status, the methods we propose here do not require knowledge of an upper bound on treatment costs. When such a bound is indeed known (e.g., one), our methods may still be applied as long as all special cases corresponding to $\kappa=\infty$ or $\kappa<\infty$ in Section 4 are replaced by $\kappa$ being equal to or less than the known bound, respectively.

5 Simulation

5.1 Simulation setting

In this simulation study, we investigate the performance of our proposed estimator of the ATE of an optimal ITR relative to specified reference ITRs. We focus here on the setting $\alpha=1$ . This scenario is more difficult than the case $\alpha=0$ because it requires the estimation of $\phi_{0}$ .

We generate data from a model in which the treatment $T$ is an IV and the treatment cost $C$ and outcome $Y$ are both binary. This data-generating mechanism satisfies all causal conditions and has an unobserved confounder between treatment cost and outcome. We first generate a trivariate covariate $W=(W_{1},W_{2},W_{3})$ , where $W_{1}\sim\mathrm{Unif}(-1,1)$ , $W_{2}\sim\mathrm{Bernoulli}(0.8)$ and $W_{3}\sim\mathrm{N}(0,1)$ are mutually independent. We also simulate an unobserved treatment-outcome confounder $U\sim\mathrm{Bernoulli}(0.5)$ independently of $W$ , and then simulate $T$ , $C$ , and $Y$ as follows:

	$\displaystyle T\mid W,U\$	$\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(2.5W_{1}+0.5W_{2}W_{3})\right),$
	$\displaystyle C\mid T,W,U\$	$\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(2T-1-W_{1}+0.2W_{2}+0.7W_{3}+2W_{1}W_{2}+0.5U)\right),$
	$\displaystyle Y\mid T,C,W,U\$	$\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(-0.3C+CW_{2}-W_{1}+0.2W_{2}-0.9W_{3}+0.3CU)\right).$

We introduce $U$ in the data-generating mechanism to emphasize that we do not require assumptions on the joint distribution of treatment cost and outcome conditional on covariates. We consider all three reference ITRs $\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}$ , where we set ${\rho}^{\mathrm{FR}}:v\mapsto 0$ . We set $\kappa=0.68$ , which is an active constraint with $\tau_{0}>0$ and ${\rho}^{\mathrm{RD}}_{0}<1$ .

The ITRs we consider are based on all covariates — that is, we take $V(W)=W$ . We estimate the nuisance functions using the Super Learner [48] with library including a logistic regression, generalized additive model with logit link [13], gradient boosting machine [10, 11, 23, 24], support vector machine [2, 8] and neural network [3, 35]. Because none of the nuisance functions follow a logistic regression model, the resulting ensemble learner is not expected to achieve the parametric convergence rate. Since both $C$ and $Y$ are binary, we use logistic regression rather than ordinary least-squares to obtain their corresponding targeted estimates in Section 4.2. We consider sample size $n\in\{500,1000,4000,16000\}$ , and run 1000 Monte Carlo repetitions for each sample size. We implement the algorithm that incorporates cross-fitting discussed in Remark 6 and described in Section S3 in the Supplementary Material.

To evaluate the performance of our proposed estimator, we investigate the bias and root mean squared error (RMSE) of the estimator. We also investigate the coverage probability and the width of nominal 95% Wald CIs constructed using influence function-based standard error estimates. We further investigate the probability that our confidence lower limit falls below the true ATE, that is, the coverage probability of the 97.5% Wald confidence lower bound.

5.2 Simulation results

Table 1 presents the performance of our proposed estimator in this simulation. For sample sizes 500, 1000 and 4000, the CI coverage of our proposed method is lower than the nominal coverage 95%. When sample size is larger (16000), the CI coverage of our proposed method increases to 90–93%. The coverage of the confidence lower bounds is much closer to nominal (97.5%) for all sample sizes considered, though, and is always approximately nominal when the sample size is large. For all reference ITRs, the bias and RMSE of our proposed estimator appear to converge to zero faster than and at the same rate as the square root of sample size, respectively. All biases are negative, which is expected in view of Remark 6. All standard errors underestimate the variation of the estimator with the extent decreasing as sample size increases.

Table 1: Performance of estimators of average treatment effects in the simulation with nuisance functions estimated via machine learning.

Performance measure	Sample size	${\mathrm{FR}}$	${\mathrm{RD}}$	${\mathrm{TP}}$
95% Wald CI coverage	500	$74\%$	$71\%$	$70\%$
	1000	$78\%$	$74\%$	$73\%$
	4000	$90\%$	$84\%$	$88\%$
	16000	$93\%$	$90\%$	$93\%$
97.5% confidence lower	500	$94\%$	$96\%$	$96\%$
bound coverage	1000	$97\%$	$98\%$	$96\%$
	4000	$98\%$	$98\%$	$98\%$
	16000	$97\%$	$98\%$	$97\%$
bias	500	$-0.018$	$-0.018$	$-0.020$
	1000	$-0.014$	$-0.013$	$-0.013$
	4000	$-0.003$	$-0.004$	$-0.003$
	16000	$-0.000$	$-0.001$	$-0.000$
RMSE	500	$0.056$	$0.039$	$0.046$
	1000	$0.039$	$0.025$	$0.031$
	4000	$0.017$	$0.009$	$0.012$
	16000	$0.009$	$0.004$	$0.005$
Ratio of mean standard error	500	$0.620$	$0.620$	$0.571$
to standard deviation	1000	$0.683$	$0.673$	$0.637$
	4000	$0.868$	$0.765$	$0.809$
	16000	$0.913$	$0.870$	$0.906$

Figure 1 presents the width of the Wald CIs scaled by the square root of sample size $n$ . Our theory indicates that the CI width should shrink at a root- $n$ rate, and our simulation results are consistent with this. There are some outlying cases of extremely wide or narrow CIs. This is expected for small sample sizes because the estimator of $\sigma_{0}^{2}$ in Theorem 4 resembles a sample mean and might not be close to $\sigma_{0}^{2}$ with high probability when sample size is small. In practice, this issue might be slightly mitigated by fine-tuning the involved machine learning algorithms.

Refer to caption — Figure 1: Boxplot of $\sqrt{n}\times$ CI width for ATE relative to each reference ITR.

As indicated in Theorem 4, theoretical guarantees for the validity of the Wald CIs rely on the nuisance function estimators converging to the truth sufficiently quickly. It appears that the undercoverage of our Wald CI in small samples may owe, in part, to poor estimation of these nuisance functions in small sample sizes. To illustrate how our procedure may perform with improved small-sample nuisance function estimators, we conducted another two simulations: one is identical to those reported earlier in all ways except that the nuisance function estimators $\mu^{Y}_{n}$ , $\mu^{C}_{n}$ and $\mu^{T}_{n}$ are taken to be equal to the truth; the other is a simpler scenario under a lower dimension and a parametric model. The results are presented in Section S5 in the Supplementary Material and suggest that our proposed estimator may achieve significantly better performance with improved machine learning estimators of the nuisance functions. This motivates seeking ways to optimize the finite-sample performance of the nuisance function estimators employed in future applications of the proposed method, possibly based on prior subject-matter expertise. The underestimation of standard errors in this simulation also motivates future work exploring whether there are standard error estimators with better finite-sample performance, for example, estimators based on the bootstrap.

6 Conclusion

There is an extensive literature on estimating optimal ITRs and evaluating their performance. Among these works, only a few incorporated treatment resource constraints. In this paper, we build upon Sun et al. [41] and study the problem of estimating optimal ITRs under treatment cost constraints when the treatment cost is random. Using similar techniques as used in Qiu et al. [34], we have proposed novel methods to estimate an optimal ITR and infer about the corresponding average treatment effect relative to a prespecified reference ITR, under a locally nonparametric model. Our methods may also be applied to instrumental variable (IV) settings studied in Qiu et al. [34] when the IV is intervened on.

References

Abadie [2003] Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics 113(2), 231–263.
Bennett and Campbell [2000] Bennett, K. P. and C. Campbell (2000). Support vector machines: hype or hallelujah? SIGKDD Explor. Newsl. 2(2), 1–13.
Bishop [1995] Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
Bolthausen et al. [2002] Bolthausen, E., E. Perkins, and A. van der Vaart (2002). Lectures on Probability Theory and Statistics: Ecole D’Eté de Probabilités de Saint-Flour XXIX-1999, Volume 1781 of Lecture Notes in Mathematics. Berlin, Heidelberg: Springer Science & Business Media.
Butler et al. [2018] Butler, E. L., E. B. Laber, S. M. Davis, and M. R. Kosorok (2018). Incorporating Patient Preferences into Estimation of Optimal Individualized Treatment Rules. Biometrics 74(1), 18–26.
Chakraborty and Moodie [2013] Chakraborty, B. and E. E. Moodie (2013). Statistical Methods for Dynamic Treatment Regimes. Statistics for Biology and Health. New York, NY: Springer New York.
Chen et al. [2018] Chen, J., H. Fu, X. He, M. R. Kosorok, and Y. Liu (2018). Estimating individualized treatment rules for ordinal treatments. Biometrics 74(3), 924–933.
Cortes and Vapnik [1995] Cortes, C. and V. Vapnik (1995). Support-vector networks. Machine Learning 20(3), 273–297.
Dantzig [1957] Dantzig, G. B. (1957). Discrete-variable extremum problems. Operations research 5(2), 266–288.
Friedman [2001] Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Technical Report 5.
Friedman [2002] Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis 38(4), 367–378.
Gruber and Van Der Laan [2010] Gruber, S. and M. J. Van Der Laan (2010). A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. International Journal of Biostatistics 6(1).
Hastie and Tibshirani [1990] Hastie, T. and R. Tibshirani (1990). Generalized additive models. Chapman and Hall.
Imai and Li [2021] Imai, K. and M. L. Li (2021). Experimental Evaluation of Individualized Treatment Rules. Journal of the American Statistical Association.
Imbens and Angrist [1994] Imbens, G. W. and J. D. Angrist (1994). Identification and Estimation of Local Average Treatment Effects. Econometrica 62(2), 467–475.
Kennedy [2016] Kennedy, E. H. (2016). Semiparametric theory and empirical processes in causal inference. In Statistical causal inferences and their applications in public health research, pp. 141–167. Springer.
Kennedy [2020] Kennedy, E. H. (2020). Towards optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497v3.
Laber and Zhao [2015] Laber, E. and Y. Zhao (2015). Tree-based methods for individualized treatment regimes. Biometrika 102(3), 501–514.
Lei et al. [2012] Lei, H., I. Nahum-Shani, K. Lynch, D. Oslin, and S. A. Murphy (2012). A ”SMART” design for building individualized treatment sequences. Annual Review of Clinical Psychology 8, 21–48.
Luedtke and van der Laan [2016a] Luedtke, A. R. and M. J. van der Laan (2016a). Optimal Individualized Treatments in Resource-Limited Settings. International Journal of Biostatistics 12(1), 283–303.
Luedtke and van der Laan [2016b] Luedtke, A. R. and M. J. van der Laan (2016b). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics 44(2), 713–742.
Luedtke and van der Laan [2016c] Luedtke, A. R. and M. J. van der Laan (2016c). Super-Learning of an Optimal Dynamic Treatment Rule. International Journal of Biostatistics 12(1), 305–332.
Mason et al. [1999] Mason, L., J. Baxter, P. Bartlett, and M. Frean (1999). Boosting Algorithms as Gradient Descent in Function Space. Technical report.
Mason et al. [2000] Mason, L., J. Baxter, P. L. Bartlett, and M. Frean (2000). Boosting Algorithms as Gradient Descent. Technical report.
Murphy [2003] Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
Newey and Robins [2018] Newey, W. K. and J. R. Robins (2018). Cross-Fitting and Fast Remainder Rates for Semiparametric Estimation. arXiv preprint arXiv:1801.09138v1.
Neyman [1923] Neyman, J. (1923). Sur les applications de la théorie des probabilités aux expériences agricoles: Essay des principles. (Excerpts reprinted and translated to English, 1990). Statistical Science 5, 463–472.
Nie and Wager [2021] Nie, X. and S. Wager (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108(2), 299–319.
Petersen et al. [2007] Petersen, M. L., S. G. Deeks, and M. J. van der Laan (2007). Individualized treatment rules: Generating candidate clinical trials. Statistics in Medicine 26(25), 4578–4601.
Pfanzagl [1982] Pfanzagl, J. (1982). Contributions to a General Asymptotic Statistical Theory, Volume 13 of Lecture Notes in Statistics. New York, NY: Springer New York.
Pfanzagl [1990] Pfanzagl, J. (1990). Estimation in semiparametric models. In Estimation in Semiparametric Models, pp. 17–22. Springer.
Qian and Murphy [2011] Qian, M. and S. A. Murphy (2011). Performance guarantees for individualized treatment rules. The Annals of Statistics 39(2), 1180.
Qiu et al. [2021a] Qiu, H., M. Carone, E. Sadikova, M. Petukhova, R. C. Kessler, and A. Luedtke (2021a). Correction to: “optimal individualized decision rules using instrumental variable methods”. Journal of the American Statistical Association (just-accepted), 1–2.
Qiu et al. [2021b] Qiu, H., M. Carone, E. Sadikova, M. Petukhova, R. C. Kessler, and A. Luedtke (2021b). Optimal individualized decision rules using instrumental variable methods. Journal of the American Statistical Association 116(533), 174–191.
Ripley [2014] Ripley, B. D. (2014). Pattern recognition and neural networks. Cambridge University Press.
Robins [1986] Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Mathematical modelling 7(9-12), 1393–1512.
Robins [2004] Robins, J. M. (2004). Optimal Structural Nested Models for Optimal Sequential Decisions. pp. 189–326. Springer, New York, NY.
Rothwell [2005] Rothwell, P. M. (2005). Subgroup analysis in randomised controlled trials: Importance, indications, and interpretation. Lancet 365(9454), 176–186.
Rubin [1974] Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688–701.
Song et al. [2015] Song, R., M. Kosorok, D. Zeng, Y. Zhao, E. Laber, and M. Yuan (2015). On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat 4(1), 59–68.
Sun et al. [2021] Sun, H., S. Du, and S. Wager (2021). Treatment Allocation under Uncertain Costs. arXiv preprint arXiv:2103.11066v1.
Sun [2021] Sun, L. (2021). Empirical Welfare Maximization with Constraints. arXiv preprint arXiv:2103.15298v1.
Tchetgen Tchetgen and Vansteelandt [2013] Tchetgen Tchetgen, E. J. and S. Vansteelandt (2013). Alternative Identification and Inference for the Effect of Treatment on the Treated with an Instrumental Variable. Harvard University BiostatisticsWorking Paper Series.
van der Laan [2017] van der Laan, M. (2017). A Generally Efficient Targeted Minimum Loss Based Estimator based on the Highly Adaptive Lasso. International Journal of Biostatistics 13(2).
van der Laan and Rubin [2006] van der Laan, M. and D. Rubin (2006). Targeted Maximum Likelihood Learning. The International Journal of Biostatistics 2(1).
van der Laan and Luedtke [2014] van der Laan, M. J. and A. R. Luedtke (2014). Targeted Learning of the Mean Outcome under an Optimal Dynamic Treatment Rule. Journal of Causal Inference 3(1).
van der Laan and Petersen [2007] van der Laan, M. J. and M. L. Petersen (2007). Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics 3(1).
van der Laan et al. [2007] van der Laan, M. J., E. C. Polley, and A. E. Hubbard (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology 6(1).
van der Laan and Rose [2018] van der Laan, M. J. and S. Rose (2018). Targeted Learning in Data Science.
van der Vaart and Wellner [2000] van der Vaart, A. and J. Wellner (2000). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer.
van der Vaart [1998] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
Varadhan et al. [2013] Varadhan, R., J. B. Segal, C. M. Boyd, A. W. Wu, and C. O. Weiss (2013). A framework for the analysis of heterogeneity of treatment effect in patient-centered outcomes research. Journal of Clinical Epidemiology 66(8), 818–825.
Wang and Tchetgen Tchetgen [2018] Wang, L. and E. Tchetgen Tchetgen (2018). Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 531–550.
Zhao et al. [2012] Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012). Estimating Individualized Treatment Rules Using Outcome Weighted Learning. Journal of the American Statistical Association 107(499), 1106–1118.
Zhao et al. [2015] Zhao, Y. Q., D. Zeng, E. B. Laber, R. Song, M. Yuan, and M. R. Kosorok (2015). Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1), 151–168.
Zheng and van der Laan [2011] Zheng, W. and M. J. van der Laan (2011). Cross-Validated Targeted Minimum-Loss-Based Estimation. pp. 459–474. Springer, New York, NY.
Zhou et al. [2017] Zhou, X., N. Mayer-Hamblett, U. Khan, and M. R. Kosorok (2017). Residual Weighted Learning for Estimating Individualized Treatment Rules. Journal of the American Statistical Association 112(517), 169–187.

Supplementary Material for “Individualized treatment rules under stochastic treatment cost constraints”

This Supplementary Material is organized as follows. Section S1 contains technical conditions to ensure that the statistical parameter of interest, the average treatment effect, is pathwise differentiable and that our proposed estimator is asymptotically efficient. We discuss a particular technical condition that may be difficult to verify in Section S2. In Section S3, we describe a modified version of our proposed estimator with improved performance in small to moderate samples. We present proofs of theoretical results in Section S4. In Section S5, we present the results of a simulation under an idealized setting. These results may provide guidance on interpreting the simulation results in Section 5.

As noted in the main text, the methods proposed in this work build upon tools used in Qiu et al. [34]; as such, the involved technical details bear similarity. To orient readers and facilitate comparisons, we have organized these supplementary materials for these papers similarly and shared portions of technical details when appropriate.

S1 Technical conditions for pathwise differentiability of parameter and asymptotic linearity of proposed estimator

In this section, we list the additional technical conditions required by Theorems 3 and 4 in Section 4 that we omit in the main text. Before doing this, we define pointwise

	$\displaystyle D_{n,{\mathrm{FR}}}(o)$	$\displaystyle:=D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})(o)-D(\hat{P}_{n},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})(o)\ ,$
	$\displaystyle D_{n,{\mathrm{RD}}}(o)$	$\displaystyle:=D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})(o)-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})(o)$
		$\displaystyle\hskip 36.135pt-\alpha\frac{\Psi_{{\rho}^{\mathrm{RD}}_{n}}(\hat{P}_{n})}{\kappa-\phi_{n}}D_{1}(\hat{P}_{n},\mu^{C}_{n})(o)-\frac{\Psi_{{\rho}^{\mathrm{RD}}_{n}}(\hat{P}_{n})}{P_{n}\hat{\Delta}^{C}_{n}}D_{2}(\hat{P}_{n},\hat{\mu}^{C}_{n})(o)\ ,$
	$\displaystyle D_{n,{\mathrm{TP}}}(o)$	$\displaystyle:=D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})(o)-G_{\mathrm{TP}}(\hat{P}_{n})(o)\ .$

Condition B2 (Nonzero continuous density of $\xi_{0}(V)$ around $\eta_{0}$ ).

If $\eta_{0}>-\infty$ , then the distribution of $\xi_{0}(V)$ has positive, finite and continuous Lebesgue density in a neighborhood of $\eta_{0}$ .

Since Condition B2 is most plausible when covariates are continuous, in this case, it is also plausible to expect the distribution of $\xi_{0}(V)$ to be continuous and thus Condition B2 holds.

Condition B3 (Smooth treatment cost function or lack of constraint).

If $\eta_{0}>-\infty$ , then the function $\eta\mapsto\operatorname{E}_{0}\left[I\left(\xi_{0}(V)>\eta\right)\Delta^{C}_{0}(W)\right]$ is continuously differentiable with nonzero derivative in a neighborhood of $\eta_{0}$ ; if $\eta_{0}=-\infty$ and $\kappa<\infty$ , then $\operatorname{E}_{0}\left[\Delta^{C}_{0}(W)\right]<\kappa-\alpha\phi_{0}$ .

Condition B3 requires different conditions in separate cases. There are three cases in terms of the sufficiency of the budget to treat every individual: (i) there is an infinite budget and no constraint is present ( $\kappa=\infty$ ); (ii) the budget is insufficient ( $\eta_{0}>-\infty$ ); and (iii) the budget is finite but sufficient ( $\eta_{0}=-\infty$ and $\kappa<\infty$ ). Condition B3 makes no assumption for Case (i). In Case (ii), we require a function $\eta\mapsto\operatorname{E}_{0}\left[I\left(\xi_{0}(V)>\eta\right)\Delta^{C}_{0}(W)\right]$ to be locally continuously differentiable. Since $\Delta_{0}^{C}>0$ by Condition A4, this function is nonincreasing and thus only continuous differentiability is required. For each $\eta$ , this function is an integral of additional cost $\Delta_{0}^{C}$ over the set $\{v:\xi_{0}(v)>0\}$ and has a similar nature to survival functions. When covariates are continuous, it is plausible to assume that $\Delta_{0}^{C}(W)$ is continuous and thus $\eta\mapsto\operatorname{E}_{0}\left[I\left(\xi_{0}(V)>\eta\right)\Delta^{C}_{0}(W)\right]$ is continuously differentiable. In Case (iii), we require that the budget has a surplus. When it is unknown a priori whether the budget is sufficient to treat every individual, namely in Case (ii) or (iii), it is highly unlikely that the budget exactly suffices with no surplus. Therefore, Condition B3 is mild.

Condition B4 (Bounded additional treatment cost).

$\Delta^{C}_{0}$ is bounded.

Condition B5 (Active constraint).

If $\mathcal{R}={\mathrm{RD}}$ , then it holds that $(\kappa-\alpha\phi_{0})/\operatorname{E}_{0}\left[\Delta^{C}_{0}(W)\right]<1$ .

Condition B5 requires that, when the rule ${\rho}^{\mathrm{RD}}$ that assigns treatment completely at random while respecting the budget constraint is the reference rule of interest, it should not correspond to the trivial rule $v\mapsto 1$ that assigns treatment to every individual. The rule ${\rho}^{\mathrm{RD}}$ equals $v\mapsto 1$ only when the budget is sufficient to treat every individual. Since, as a separate reference rule from given fixed rules ${\rho}^{\mathrm{FR}}$ , the reference rule ${\rho}^{\mathrm{RD}}$ is only interesting when the budget constraint is active, Condition B5 often holds automatically.

Condition B6 (Sufficient rates for nuisance estimators).

	$\displaystyle\\|\mu^{T}_{n}-\mu^{T}_{0}\\|_{2,P_{0}}\Big{\{}$	$\displaystyle\\|\mu^{Y}_{n}-\mu^{Y}_{0}\\|_{2,P_{0}}+\\|\hat{\mu}^{Y}_{n}-\mu^{Y}_{0}\\|_{2,P_{0}}$
		$\displaystyle+\\|\mu^{C}_{n}-\mu^{C}_{0}\\|_{2,P_{0}}+\\|\hat{\mu}^{C}_{n}-\mu^{C}_{0}\\|_{2,P_{0}}\Big{\}}={\mathrm{o}}_{p}(n^{-1/2})\ .$

Condition B6 holds if all above nuisance estimators converge at a rate faster than $n^{-1/4}$ , which may be much slower than the parametric rate $n^{-1/2}$ and thus allows for the use of flexible nonparametric estimators. This condition also holds if $\mu^{Y}_{n}$ , $\hat{\mu}^{Y}_{n}$ , $\mu^{C}_{n}$ and $\hat{\mu}^{C}_{n}$ each converges slower than $n^{-1/4}$ , as long as the estimated propensity score $\mu^{T}_{n}$ converges sufficiently fast to compensate.

Condition B7 (Consistency of estimated influence function).

The following terms are all ${\mathrm{o}}_{p}(1)$ :

	$\displaystyle\\|D_{1}(\hat{P}_{n},\hat{\mu}^{C}_{n})-D_{1}(P_{0},\mu^{C}_{0})\\|_{2,P_{0}}\ ,\quad\\|D_{2}(\hat{P}_{n},\mu^{C}_{n})-D_{2}(P_{0},\mu^{C}_{0})\\|_{2,P_{0}}\ ,\quad\\|D_{n,\mathcal{R}}-D_{\mathcal{R}}(P_{0})\\|_{2,P_{0}}\ ,$
	$\displaystyle\\|[D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})]-[D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})]\\|_{2,P_{0}}\ .$

Condition B8 (Consistency of strong positivity).

With probability tending to one over the sample used to obtain $\mu^{T}_{n}$ , it holds that $\int I\{\epsilon_{T}<\mu^{T}_{n}(w)<1-\epsilon_{T}\}dP_{0}(w)=1$ .

Condition B9 (Consistency of strictly more costly treatment).

With probability tending to one over the sample used to obtain $\Delta^{C}_{n}$ and $\delta^{C}_{n}$ , it holds that $\int I(\Delta^{C}_{n}(w)>\delta_{C})dP_{0}(w)=1$ and $\int I(\delta^{C}_{n}(v)>\delta_{C})dP_{0}(v)=1$ .

Condition B10 (Fast rate of estimated optimal ITR).

As sample size $n$ tends to infinity, it holds that

\int\left\{{\rho}_{n}(v)-{\rho}_{0}(v)\right\}\left\{\delta^{Y}_{0}(v)-\tau_{0}\delta^{C}_{0}(v)\right\}dP_{0}(v)={\mathrm{o}}_{p}(n^{-1/2})\ .

Condition B10 may, at first sight, appear to be difficult to verify and is discussed in detail in Section S2. As shown in Theorem S1 of Section S2, Condition B10 may require faster rates on nuisance estimators than Condition B6. For example, convergence in the $L^{2}$ -sense at a rate ${\mathrm{o}}_{p}(n^{-1/4})$ is sufficient for Condition B6, but a rate ${\mathrm{o}}_{p}(n^{-3/8})$ is needed in order to use Theorem S1 to show that Condition B10 holds.

Condition B11 (Donsker condition).

$\{o\mapsto d_{n,k}(v)D_{2}(\hat{P}_{n},\mu^{C}_{n})(o):k\in[0,1]\}$ is a subset of a fixed $P_{0}$ -Donsker class with probability tending to 1. Additionally, each of $D_{1}(\hat{P}_{n},\mu^{C}_{n})$ , $D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})$ and $D_{n,\mathcal{R}}$ belongs to a (possibly different) fixed $P_{0}$ -Donsker class with probability tending to 1.

Condition B12 (Glivenko-Cantelli condition).

$\|\xi_{n}-\xi_{0}\|_{1,P_{0}}={\mathrm{o}}_{p}(1)$ and $\|\Delta^{C}_{n}-\Delta^{C}_{0}\|_{1,P_{0}}={\mathrm{o}}_{p}(1)$ . Moreover, (i) if $\eta_{0}>-\infty$ , then, for any $\eta$ sufficiently close to $\eta_{0}$ , $w\mapsto I(\xi_{n}(v)>\eta)\Delta^{C}_{n}(w)$ belongs to a $P_{0}$ -Glivenko-Cantelli class with probability tending to 1; (ii) otherwise, if $\eta_{0}=-\infty$ , then, for any $\eta<0$ with sufficiently large $|\eta|$ , $w\mapsto I(\xi_{n}(v)>\eta)\Delta^{C}_{n}(w)$ belongs to a $P_{0}$ -Glivenko-Cantelli class with probability tending to 1.

The Donsker condition B11 and the Glivenko-Cantelli condition B12 impose restrictions on the flexibility of the methods used to estimate nuisance functions. We refer readers to, for example, van der Vaart and Wellner [50], for a more thorough introduction to such conditions.

All above conditions are similar to those in Qiu et al. [34] except that Conditions B9 and B4 are additional in this paper because the assumption of more costly treatment was not needed and a boundedness condition similar to B4 was automatically satisfied with a binary cost.

S2 Sufficient condition for fast convergence rate of estimated optimal rule

Condition B10, which is required by Theorem 4, may seem unintuitive and difficult to verify. In Theorem S1 below, we present sufficient conditions for Condition B10 that are similar to those in Qiu et al. [34].

Throughout the rest of the Supplement, for two quantities $a,b\in$ , we use $a\lesssim b$ to denote $a\leq{\mathscr{C}}b$ for some constant ${\mathscr{C}}>0$ that may depend on $P_{0}$ .

Theorem S1 (Sufficient condition for Condition B10).

Assume that $\int I(\xi_{n}(v)=\tau_{n})dP_{0}(v)={\mathrm{O}}_{p}(n^{-1/2})$ . Further assume that each of $o\mapsto I(\xi_{n}(v)>\eta_{n})$ and $o\mapsto I(\xi_{n}(v)>\eta_{n})\delta^{C}_{0}(v)$ belongs to a (possibly different) fixed $P_{0}$ -Donsker class with probability tending to 1. Suppose also that the distribution of $\xi_{0}(V)$ ( $V\sim P_{0}$ ) has nonzero finite continuous Lebesgue density in a neighborhood of $\eta_{0}$ and a neighborhood of $\tau_{0}$ . Under Condition B4, the following statements hold.

•

If $\|\delta^{Y}_{n}-\delta^{Y}_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1)$ for some $q\geq 1$ , then

\displaystyle|P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}|\lesssim\|\delta^{Y}_{n}-\delta^{Y}_{0}\|_{q,P_{0}}^{2q/(q+1)}+{\mathrm{O}}_{p}(n^{-1}).

•

If $\|\delta^{Y}_{n}-\delta^{Y}_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1)$ , then

\displaystyle|P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}|\lesssim\|\delta^{Y}_{n}-\delta^{Y}_{0}\|_{\infty,P_{0}}^{2}+{\mathrm{O}}_{p}(n^{-1}).

The proof of Theorem S1 is very similar to Theorem 5 in Qiu et al. [34] and can be found in Section S4.5.

S3 Modified procedure with cross-fitting

In this section, we describe our proposed procedure to estimate the ATE with cross-fitting, which is mentioned in Remark 6. We use $\Lambda$ to denote a user-specified fixed number of folds to split the data. Common choices of $\Lambda$ used in practice include 5, 10 and 20.

1.

Use the empirical distribution $\hat{P}_{W,n}$ of $W$ as an estimate of the true marginal distribution of $W$ . Compute estimates $\mu^{Y}_{n}$ , $\mu^{C}_{n}$ and $\mu^{T}_{n}$ of $\mu^{Y}_{0}$ , $\mu^{C}_{n}$ and $\mu^{T}_{0}$ , respectively using flexible regression methods.

Estimate an optimal individualized treatment rule for each observation:

(a)

Create folds: split the set of observation indices $\{1,2,\ldots,n\}$ into $\Lambda$ mutually exclusive and exhaustive folds of (approximately) equal size. Denote these sets by $S_{\lambda}$ , $\lambda=1,2,\ldots,\Lambda$ . Define $S_{-\lambda}:=\cup_{\lambda^{\prime}\neq\lambda}S_{\lambda^{\prime}}$ . For each $i=1,2,\ldots,n$ , let $\lambda(i)$ be the index of the fold containing $i$ ; in other words, $\lambda(i)$ is the unique value of $\lambda$ such that $i\in S_{\lambda}$ .
(b)

Estimate $\xi_{0}(V_{i})$ using sample splitting: for each $\lambda=1,2,\ldots,\Lambda$ , compute estimates $\delta^{Y}_{n,S_{-\lambda}}$ and $\delta^{C}_{n,S_{-\lambda}}$ of $\delta^{Y}_{0}$ and $\Delta^{C}_{0,b}$ using flexible regression methods based on data $\{O_{i}:i\in S_{-\lambda}\}$ . For each $i=1,2,\ldots,n$ , let $\xi_{n,i}:=\delta^{Y}_{n,S_{-\lambda(i)}}(V_{i})/\delta^{C}_{n,S_{-\lambda(i)}}(V_{i})$ be the sample splitting estimate of $\xi_{0}(V_{i})$ .

(c)

Estimate $\phi_{0}$ with a one-step correction estimator

\phi_{n}:=\frac{1}{n}\sum_{i=1}^{n}\{\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\}.

(d)

Let $\Gamma_{n}:\tau\mapsto\frac{1}{n}\sum_{i:\xi_{n,i}>\tau}\Delta^{C}_{n}(1,W_{i})$ and $\gamma_{n}:\tau\mapsto\frac{1}{n}\sum_{i:\xi_{n,i}=\tau}\Delta^{C}_{n}(W_{i})$ . For any $k\in[0,\infty)$ , define $\eta_{n}(k):=\inf\{\tau:\Gamma_{n}(\tau)\leq k-\alpha\phi_{n}\}$ , $\tau_{n}(k):=\max\{\eta_{n}(k),0\}$ , and, for $i=1,2,\ldots,n$ ,

d_{n,k,i}:=\begin{cases}\frac{k-\alpha\phi_{n}-\Gamma_{n}(\eta_{n}(k))}{\gamma_{n}(\eta_{n}(k))},&\text{ if }\xi_{n,i}=\eta_{n}(k)\text{ and }\gamma_{n}(\eta_{n}(k))>0,\\ I\{\xi_{n,i}>\eta_{n}(k)\},&\text{ otherwise.}\end{cases}

(e)

Compute $k_{n}$ , which is used to define an estimate of ${\rho}_{0}$ for which the plug-in estimator is asymptotically linear.

•

If $\tau_{n}(\kappa)>0$ and there is a solution in $k\in[0,\infty)$ to

\frac{1}{n}\sum_{i=1}^{n}d_{n,k,i}\left[\Delta^{C}_{n}(W_{i})+\frac{1}{T_{i}+\mu^{T}_{n}(W_{i})-1}[C_{i}-\mu^{C}_{n}(T_{i},W_{i})]\right]+\alpha\phi_{n}=\kappa,

(S1)

then take $k_{n}$ to be this solution.

•

otherwise, set $k_{n}=\kappa$ .

(f)

For each $i=1,2,\ldots,n$ , estimate ${\rho}_{0}(V_{i})$ with

{\rho}_{n,i}:=\begin{cases}\frac{k_{n}-\alpha\phi_{n}-\Gamma_{n}(\tau_{n}(k_{n}))}{\gamma_{n}(\tau_{n}(k_{n}))},&\text{ if }\xi_{n,i}=\tau_{n}(k_{n}),\;\text{ and }\gamma_{n}(\tau_{n}(k_{n}))>0,\\ I\{\xi_{n,i}>\tau_{n}(k_{n})\},&\text{ otherwise.}\end{cases}

3.
Obtain an estimate ${\rho}^{\mathcal{R}}_{n}$ of the reference ITR ${\rho}^{\mathcal{R}}_{0}$ as follows:
- •
  
  For $\mathcal{R}={\mathrm{FR}}$ , take ${\rho}^{\mathcal{R}}_{n}$ to be ${\rho}^{\mathrm{FR}}$ .
- •
  For $\mathcal{R}={\mathrm{RD}}$ ,
  1. (a)
    
    obtain a targeted estimate $\hat{\mu}^{C}_{n}$ of $\mu^{C}_{0}$ : run an ordinary least-squared regression using observations $i=1,2,\ldots,n$ with outcome $C_{i}$ , offset $\mu^{C}_{n}(T_{i},W_{i})$ , no intercept and covariate $1/(T_{i}+\mu^{T}_{n}(W_{i})-1)$ . Take $\hat{\mu}^{C}_{n}$ to be the fitted mean model;
  2. (b)
    
    take ${\rho}^{\mathcal{R}}_{n}$ to be the constant function $o\mapsto\min\{1,(\kappa-\alpha\phi_{n})/\hat{P}_{W,n}\hat{\Delta}^{C}_{n}\}$ , where we define pointwise $\hat{\Delta}^{C}_{n}:w\mapsto\hat{\mu}^{C}_{n}(1,w)-\hat{\mu}^{C}_{n}(0,w)$ .
- •
  
  For $\mathcal{R}={\mathrm{TP}}$ , take ${\rho}^{\mathcal{R}}_{n}$ to be $\mu^{T}_{n}$ .
4.
Estimate ATE of ${\rho}_{0}$ relative to the reference ITR ${\rho}^{\mathcal{R}}_{0}$ with a targeted minimum-loss based estimator (TMLE) $\psi_{n}$ :
1. (a)
  
  obtain a targeted estimate $\hat{\mu}^{Y}_{n}$ of $\mu^{Y}_{0}$ : run an ordinary least-squares linear regression using observations $i=1,2,\ldots,n$ with outcome $Y_{i}$ , offset $\mu^{Y}_{n}(T_{i},W_{i})$ , no intercept and covariate $[{\rho}_{n,i}-{\rho}^{\mathcal{R}}_{n}(O_{i})]/[T_{i}+\mu^{T}_{n}(W_{i})-1]$ . Take $\hat{\mu}^{Y}_{n}$ to be the fitted mean function.
2. (b)
  
  with $\hat{P}_{n}$ being any distribution with components $\hat{\mu}^{Y}_{n}$ and $\hat{P}_{W,n}$ , set $\psi_{n}:=\frac{1}{n}\sum_{i=1}^{n}{\rho}_{n,i}\hat{\Delta}^{Y}_{n}(W_{i})-\Psi_{{\rho}^{\mathcal{R}}_{n}}(\hat{P}_{n})$ where $\hat{\Delta}^{Y}_{n}:w\mapsto\hat{\mu}^{Y}_{n}(1,w)-\hat{\mu}^{Y}_{n}(0,w)$ .

S4 Proof of theorems

S4.1 Identification results (Theorem 1 and 2)

Theorem 1 is a simple corollary of the standard G-formula [36]. We provide a complete proof below.

Proof of Theorem 1.

Note that

\displaystyle\operatorname{\mathbb{E}}[Y(1)\mid W]=\operatorname{\mathbb{E}}[Y(1)\mid T=1,W]=\operatorname{E}_{0}[Y\mid T=1,W]=\mu^{Y}_{0}(1,W).

Similarly, $\operatorname{\mathbb{E}}[Y(0)\mid W]=\operatorname{E}_{0}[Y\mid T=0,W]=\mu^{Y}_{0}(0,W)$ . Hence, $\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid W]=\Delta^{Y}_{0}(W)$ . By the law of total expectation, this yields that $\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid V]=\operatorname{E}_{0}[\Delta^{Y}_{0}(W)\mid V]=\delta^{Y}_{0}(V)$ . It then follows that

	$\displaystyle\operatorname{\mathbb{E}}[Y({\rho})-Y({\rho}^{\mathcal{R}}_{0})]$	$\displaystyle=\operatorname{\mathbb{E}}[\{{\rho}(V)-{\rho}^{\mathcal{R}}_{0}(W)\}\{Y(1)-Y(0)\}]$
		$\displaystyle=\operatorname{E}_{0}[\{{\rho}(V)-{\rho}^{\mathcal{R}}_{0}(W)\}\operatorname{\mathbb{E}}[Y(1)-Y(0)\mid W]]$
		$\displaystyle=\operatorname{E}_{0}[\{{\rho}(V)-{\rho}^{\mathcal{R}}_{0}(W)\}\Delta^{Y}_{0}(W)].$

The results for the treatment cost can be proved similarly. ∎

We next prove Theorem 2.

Proof of Theorem 2.

Let ${\rho}$ be any ITR that satisfies the constraint that $\operatorname{E}_{0}[{\rho}(V)\delta^{C}_{0}(V)]+\alpha\phi_{0}\leq\kappa$ . We will show that $\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]\geq\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]$ , implying that ${\rho}_{0}$ is a solution to (2).

Observe that

	$\displaystyle\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]-\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]$
	$\displaystyle=\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{Y}_{0}(V)]$
	$\displaystyle=\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{Y}_{0}(V)I(\xi_{0}(V)>\tau_{0})]+\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{Y}_{0}(V)I(\xi_{0}(V)<\tau_{0})]$
	$\displaystyle\hskip 36.135pt+\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{Y}_{0}(V)I(\xi_{0}(V)=\tau_{0})]$
	$\displaystyle=\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\xi_{0}(V)\delta^{C}_{0}(V)I(\xi_{0}(V)>\tau_{0})]+\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\xi_{0}(V)\delta^{C}_{0}(V)I(\xi_{0}(V)<\tau_{0})]$
	$\displaystyle\hskip 36.135pt+\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\xi_{0}(V)\delta^{C}_{0}(V)I(\xi_{0}(V)=\tau_{0})].$

Note that ${\rho}_{0}(v)=1\geq{\rho}(v)$ if $\xi_{0}(v)>\tau_{0}$ and ${\rho}_{0}(v)=0\leq{\rho}(v)$ if $\xi_{0}(v)<\tau_{0}$ . Combining this observation with the fact that $\tau_{0}\geq 0$ , the above shows that

	$\displaystyle\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]-\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]$
	$\displaystyle\geq\tau_{0}\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{C}_{0}(V)I(\xi_{0}(V)>\tau_{0})]+\tau_{0}\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{C}_{0}(V)I(\xi_{0}(V)<\tau_{0})]$
	$\displaystyle\hskip 36.135pt+\tau_{0}\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{C}_{0}(V)I(\xi_{0}(V)=\tau_{0})]$
	$\displaystyle=\tau_{0}\operatorname{E}_{0}[\{{\rho}_{0}(V)-{\rho}(V)\}\delta^{C}_{0}(V)].$

If $\tau_{0}=0$ , then $\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]-\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]\geq 0$ , as desired; otherwise, $\tau_{0}>0$ and $\operatorname{E}_{0}[{\rho}(V)\delta^{C}_{0}(V)]\leq\kappa-\alpha\phi_{0}=\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{C}_{0}(V)]$ , and so it follows that $\operatorname{E}_{0}[{\rho}_{0}(V)\delta^{Y}_{0}(V)]\geq\operatorname{E}_{0}[{\rho}(V)\delta^{Y}_{0}(V)]$ . Therefore, we conclude that ${\rho}_{0}$ is a solution to (2). ∎

S4.2 Pathwise differentiability of ATE parameter (Theorem 3)

We follow existing literature on semiparametric efficiency theory closely to prove pathwise differentiability of our estimands and asymptotic efficiency of our estimators under nonparametric models. We refer readers to, for example, Pfanzagl [30, 31], Bolthausen et al. [4], for a more thorough introduction to semiparametric efficiency.

To derive the canonical gradient of the ATE parameters, let $\mathcal{H}\subseteq L^{2}_{0}(P_{0})$ be the set of score functions with range contained in $[-1,1]$ and we study the behavior of the parameters under perturbations in an arbitrary direction $H\in\mathcal{H}$ . We note that the $L^{2}_{0}(P_{0})$ -closure of $\mathcal{H}$ is indeed $L^{2}_{0}(P_{0})$ .

We define $H_{W}:w\mapsto\operatorname{E}_{0}[H(O)\mid W=w]$ , $H_{T}:(t\mid w)\mapsto\operatorname{E}_{0}[H(O)\mid T=t,W=w]$ and $P_{H,\epsilon}$ via its Radon-Nikodym derivative with respect to $P_{0}$ :

\displaystyle\frac{dP_{H,\epsilon}}{dP_{0}}:o\mapsto\left[1+\epsilon H(o)-\epsilon H_{T}(t\mid w)-\epsilon H_{W}(w)\right]\left[1+\epsilon H_{T}(t\mid w)\right]\left[1+\epsilon H_{W}(w)\right]

(S2)

for any $\epsilon$ in a sufficiently small neighborhood of 0 such that the right-hand side is positive for all $o\in\mathcal{W}\times\{0,1\}\times\{0,1\}\times$ . It is straightforward to verify that the score function for $\epsilon$ at $\epsilon=0$ is indeed $H$ . For the rest of this section, we may drop $H$ from the notation and use $P_{\epsilon}$ as a shorthand notation for $P_{H,\epsilon}$ when no confusion should arise.

We will see that each parameter evaluated at $P_{\epsilon}$ depends on the following marginal or conditional distributions in a clean way: the marginal distribution $P_{W,\epsilon}$ of $W$ , the marginal distribution $P_{T,W,\epsilon}$ of $(T,W)$ , the conditional distribution $P_{T,\epsilon}$ of $T$ given $W$ , the conditional distribution $P_{C,\epsilon}$ of $C$ given $(T,W)$ , and the conditional distribution $P_{Y,\epsilon}$ of $Y$ given $(T,W)$ . We now derive their closed-form expressions. Let $H_{C}:(c\mid t,w)\mapsto\operatorname{E}_{0}[H(O)\mid C=c,T=t,W=w]-H_{T}(t\mid w)-H_{W}(w)$ , and $H_{Y}:(y\mid t,w)\mapsto\operatorname{E}_{0}[H(O)\mid Y=y,T=t,W=w]-H_{T}(t\mid w)-H_{W}(w)$ . We can then show that

\displaystyle\begin{split}\frac{dP_{W,\epsilon}}{dP_{W,0}}&:w\mapsto 1+\epsilon H_{W}(w),\\ \frac{dP_{T,W,\epsilon}}{dP_{T,W,0}}&:(t,w)\mapsto 1+\epsilon H_{T}(t\mid w)+\epsilon H_{W}(w),\\ \frac{dP_{T,\epsilon}}{dP_{T,0}}(\,\cdot\,\mid w)&:t\mapsto 1+\epsilon H_{T}(t\mid w),\\ \frac{dP_{C,\epsilon}}{dP_{C,0}}(\,\cdot\,\mid t,w)&:c\mapsto 1+\epsilon H_{C}(c\mid t,w),\\ \frac{dP_{Y,\epsilon}}{dP_{Y,0}}(\,\cdot\,\mid t,w)&:y\mapsto 1+\epsilon H_{Y}(y\mid t,w).\end{split}

(S3)

Moreover, $\operatorname{E}_{0}[H_{W}(W)]=0$ , $\operatorname{E}_{0}[H_{T}(T\mid W)\mid W]=0$ $P_{0}$ -a.s., $\operatorname{E}_{0}[H_{C}(C\mid T,W)\mid T,W]=0$ $P_{0}$ -a.s., and $\operatorname{E}_{0}[H_{Y}(Y\mid T,W)\mid T,W]=0$ $P_{0}$ -a.s.

We finally introduce some additional notations that are used for the rest of the section. We use ${\mathscr{C}}$ to denote a generic positive constant that may vary line by line. Let $S_{0}$ be the survival function of the distribution of $\xi_{0}(V)$ when $V\sim P_{0}$ . We also use the notation $\lesssim$ defined in Section S2. For a generic function $f:\mathbb{R}\rightarrow\mathbb{R}$ , we will use the big- and little-oh notations, namely ${\mathrm{O}}(f(\epsilon))$ and ${\mathrm{o}}(f(\epsilon))$ , respectively, to denote the behavior of $f(\epsilon)$ as $\epsilon\rightarrow 0$ . Finally, for a general function or quantity $f_{P}$ that depends on a distribution $P$ , we use $f_{\epsilon}$ to denote $f_{P_{\epsilon}}$ . For example, we may write $\mu^{Y}_{\epsilon}$ as a shorthand for $\mu^{Y}_{P_{\epsilon}}$ . We will also write expectations under $P_{\epsilon}$ as $\operatorname{E}_{\epsilon}$ .

The derivation of the canonical gradients of $P\mapsto\Psi_{{\rho}^{\mathrm{FR}}}(P)$ can be found in the Supplement of Qiu et al. [34]. We now derive the canonical gradients of $P\mapsto\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)$ , $P\mapsto\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)$ and $P\mapsto\Psi_{{\rho}_{P}}(P)$ , which are different from the parameters in Qiu et al. [34].

S4.2.1 Canonical gradient of $P\mapsto\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)$ (Theorem 3)

Fix a score $H\in\mathscr{H}$ . Note that, for all $P\in\mathscr{M}$ , $\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)=\int\mu^{T}_{P}(w)\Delta^{Y}_{P}(w)P_{W}(dw)$ . Combining this, (S3) and the chain rule yields that

	$\displaystyle\left.\frac{d}{d\epsilon}\Psi_{{\rho}^{\mathrm{TP}}_{\epsilon}}(P_{\epsilon})\right\|_{\epsilon=0}$
	$\displaystyle=\int\left.\frac{d}{d\epsilon}\left[\mu^{T}_{\epsilon}(w)\Delta^{Y}_{\epsilon}(w)P_{W,\epsilon}(dw)\right]\right\|_{\epsilon=0}$
	$\displaystyle=\int\left(\left.\frac{d}{d\epsilon}\mu^{T}_{\epsilon}(w)\right\|_{\epsilon=0}\right)\Delta^{Y}_{0}(w)P_{W,0}(dw)+\int\mu^{T}_{0}(w)\left(\left.\frac{d}{d\epsilon}\Delta^{Y}_{\epsilon}(w)\right\|_{\epsilon=0}\right)P_{W,0}(dw)$
	$\displaystyle\quad+\int\mu^{T}_{0}(w)\Delta^{Y}_{0}(w)\left.\frac{d}{d\epsilon}P_{W,\epsilon}(dw)\right\|_{\epsilon=0}$
	$\displaystyle=\iint(t-\mu^{T}_{0}(w))H_{T}(t\mid w)\Delta^{Y}_{0}(w)P_{T,0}(dt\mid w)P_{W,0}(dw)$
	$\displaystyle\quad+\iiint\mu^{T}_{0}(w)\left(\frac{I(t=1)}{\mu^{T}_{P}(w)}-\frac{I(t=0)}{1-\mu^{T}_{P}(w)}\right)(y-\mu^{Y}_{0}(t,w))H_{Y}(y\mid t,w)P_{Y,0}(dy\mid t,w)P_{T,0}(dt\mid w)P_{W,0}(dw)$
	$\displaystyle\quad+\int(\mu^{T}_{0}(w)\Delta^{Y}_{0}(w)-\Psi_{{\rho}^{\mathrm{TP}}_{0}}(P_{0}))H_{W}(w)P_{W,0}(dw)$
	$\displaystyle=\int G_{\mathrm{TP}}(P_{0})(o)H(o)P_{0}(do),$

where we have used the fact that $\operatorname{E}_{0}[H_{Y}(Y\mid T,W)\mid T,W]=0$ $P_{0}$ -a.s., $\operatorname{E}_{0}[H_{T}(T\mid W)\mid W]=0$ $P_{0}$ -a.s., and $\operatorname{E}_{0}[H_{W}(W)]=0$ . Therefore, the canonical gradient of $P\mapsto\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)$ at $P_{0}$ is $G_{\mathrm{TP}}(P_{0})$ .

S4.2.2 Canonical gradient of $P\mapsto\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)$

Let $H$ be a score function in $\mathcal{H}$ . We aim to show that

\displaystyle\left.\frac{d}{d\epsilon}\Psi_{{\rho}^{\mathrm{RD}}_{\epsilon}}(P_{\epsilon})\right|_{\epsilon=0}=\int G_{\mathrm{RD}}(P_{0})(o)H(o)P_{0}(do),

(S4)

which shows that $P\mapsto\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)$ is pathwise differentiable with canonical gradient $G_{\mathrm{RD}}(P_{0})$ at $P_{0}$ .

By similar arguments to those in Section 3.4 of Kennedy [16], we can show that

	$\displaystyle\left.\frac{d}{d\epsilon}P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)\right\|_{\epsilon=0}$	$\displaystyle=\int\left\{\frac{1-t}{1-\mu^{T}_{0}(w)}[c-\mu^{C}_{0}(0,w)]+\mu^{C}_{0}(0,w)-P_{0}\mu^{C}_{0}(0,\cdot)\right\}H(o)P_{0}(do),$		(S5)
	$\displaystyle\left.\frac{d}{d\epsilon}P_{\epsilon}\Delta^{C}_{\epsilon}\right\|_{\epsilon=0}$	$\displaystyle=\int\left\{\frac{1}{t+\mu^{T}_{0}(w)-1}[c-\mu^{C}_{0}(t,w)]+\Delta^{C}_{0}(w)-P_{0}\Delta^{C}_{0}\right\}H(o)P_{0}(do).$		(S6)

Consequently, $P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)=P_{0}\mu^{C}_{0}(0,\cdot)+{\mathrm{O}}(\epsilon)$ and $P_{\epsilon}\Delta^{C}_{\epsilon}=P_{0}\Delta^{C}_{0}+{\mathrm{O}}(\epsilon)$ . It follows that, for all $\epsilon$ in a sufficiently small neighborhood of zero, Condition B5 implies that $(\kappa-\alpha P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot))/P_{\epsilon}\Delta^{C}_{\epsilon}<1$ . Consequently, for each $\epsilon$ in this neighborhood, $\Psi_{{\rho}^{\mathrm{RD}}_{\epsilon}}(P_{\epsilon})=\frac{\kappa-\alpha P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)}{P_{\epsilon}\Delta^{C}_{\epsilon}}\Psi_{v\mapsto 1}(P_{\epsilon})$ , where we have used that $P_{\epsilon}\Delta^{Y}_{\epsilon}=\Psi_{v\mapsto 1}(P_{\epsilon})$ . It follows that the derivative $\left.\frac{d}{d\epsilon}P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)\right|_{\epsilon=0}$ is the same as the derivative of $f:\epsilon\mapsto\frac{\kappa-P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)}{P_{\epsilon}\Delta^{C}_{\epsilon}}\Psi_{v\mapsto 1}(P_{\epsilon})$ at $\epsilon=0$ , provided this derivative exists. Noting that $v\mapsto 1$ is a particular instance of a fixed treatment rule, we may take ${\rho}^{\mathrm{FR}}$ to be $v\mapsto 1$ in the results on pathwise differentiability of $P\mapsto\Psi_{{\rho}^{\mathrm{FR}}}(P)$ and show that

\displaystyle\left.\frac{d}{d\epsilon}\Psi_{v\mapsto 1}(P_{\epsilon})\right|_{\epsilon=0}=\int D(P_{0},v\mapsto 1,0,\mu^{C}_{0})(o)H(o)P_{0}(do).

(S7)

As both the above derivative and the derivatives in (S5) and (S6) exist, by the chain rule, it follows that

	$\displaystyle\left.\frac{d}{d\epsilon}f(\epsilon)\right\|_{\epsilon=0}$	$\displaystyle=\frac{\kappa-\alpha P_{0}\mu^{C}_{0}(0,\cdot)}{P_{0}\Delta^{C}_{0}}\left.\frac{d}{d\epsilon}\Psi_{v\mapsto 1}(P_{\epsilon})\right\|_{\epsilon=0}-\frac{(\kappa-\alpha P_{0}\mu^{C}_{0}(0,\cdot))\Psi_{v\mapsto 1}(P_{0})}{(P_{0}\Delta^{C}_{0})^{2}}\left.\frac{d}{d\epsilon}P_{\epsilon}\Delta^{C}_{\epsilon}\right\|_{\epsilon=0}$
		$\displaystyle\hskip 36.135pt-\alpha\frac{\Psi_{v\mapsto 1}(P_{0})}{P_{0}\Delta^{C}_{0}}\left.\frac{d}{d\epsilon}P_{\epsilon}\mu^{C}_{\epsilon}(0,\cdot)\right\|_{\epsilon=0}.$

Note that $\phi_{P}=P\mu^{C}_{P}(0,\cdot)$ . Plugging (S5), (S6) and (S7) into the above and we can show that the right-hand side of the above is equal to the right-hand side of (S4). As $\left.\frac{d}{d\epsilon}f(\epsilon)\right|_{\epsilon=0}=\left.\frac{d}{d\epsilon}\Psi_{{\rho}^{\mathrm{RD}}_{\epsilon}}(P_{\epsilon})\right|_{\epsilon=0}$ , we have shown that (S4) holds, and the desired result follows.

S4.2.3 Canonical gradient of $P\mapsto\Psi_{{\rho}_{P}}(P)$

Let $H$ be a score function in $\mathcal{H}$ . The argument that we use parallels that of Luedtke and van der Laan [20] and Qiu et al. [34], except that it is slightly modified to account for the fact that the resource constraint takes a different form in this paper.

We first note that all of following hold for all $\epsilon$ sufficiently close to zero:

$\displaystyle\sup_{w}\|\Delta^{C}_{\epsilon}(w)-\Delta^{C}_{0}(w)\|$	$\displaystyle\lesssim\|\epsilon\|,$	(S8)
$\displaystyle\sup_{v}\|\delta^{Y}_{\epsilon}(v)-\delta^{Y}_{0}(v)\|$	$\displaystyle\lesssim\|\epsilon\|,$	(S9)
$\displaystyle\sup_{v}\|\delta^{C}_{\epsilon}(v)-\delta^{C}_{0}(v)\|$	$\displaystyle\lesssim\|\epsilon\|.$	(S10)

The derivations of these inequalities are straightforward and hence omitted. Under Condition A4, the above inequalities imply that

\displaystyle\sup_{v}|\xi_{\epsilon}(v)-\xi_{0}(v)|=\left|\frac{\delta^{Y}_{\epsilon}(v)}{\delta^{C}_{\epsilon}(v)}-\frac{\delta^{Y}_{0}(v)}{\delta^{C}_{0}(v)}\right|\lesssim|\epsilon|.

(S11)

For $\epsilon$ sufficiently close to zero, it will be useful to define

\displaystyle\Gamma_{\epsilon}:\eta\mapsto\operatorname{E}_{\epsilon}[I\{\xi_{\epsilon}(V)>\eta\}\delta^{C}_{\epsilon}(V)]

for $\eta\in[-\infty,\infty)$ . We also define $\Gamma_{\epsilon}^{\prime}:\eta\mapsto\frac{d}{ds}\Gamma_{\epsilon}(s)|_{s=\eta}$ when the derivative exists.

We first show two lemmas. These two lemmas show that, under a perturbed distribution $P_{\epsilon}$ with magnitude $\epsilon$ , the fluctuation in the threshold $\tau_{\epsilon}-\epsilon_{0}$ is of order $\epsilon$ . This result is crucial in quantifying the convergence rate of two terms in the expansion of $\Psi_{\rho_{\epsilon}}(P_{\epsilon})-\Psi_{\rho_{0}}(P_{0})$ , namely terms 1 and 3 in (S12) below. In particular, term 1 is the main challenge in the analysis as it comes from the perturbation in the threshold and is unique in estimation problems involving the evaluation of optimal ITRs. The first studies the convergence of $\eta_{\epsilon}$ to $\eta_{0}$ . Because it may be the case that $\eta_{0}=-\infty$ , the convergence stated in this result is convergence in the extended real line.

Lemma S1.

Under the conditions of Theorem 3, $\eta_{\epsilon}\rightarrow\eta_{0}$ as $\epsilon\rightarrow 0$ .

Proof of Lemma S1.

We separately consider the cases where $\eta_{0}>-\infty$ and $\eta_{0}=-\infty$ .

Suppose that $\eta_{0}>-\infty$ . For all sufficiently small $\delta>0$ and sufficiently small $|\epsilon|$ , by (S10), (S11) and the fact that the range of $H$ is contained in $[-1,1]$ , we can show that

\displaystyle\Gamma_{\epsilon}(\eta_{0}+\delta)+\alpha\phi_{\epsilon}

\displaystyle\leq(1+{\mathscr{C}}|\epsilon|)\operatorname{E}_{0}[I\{\xi_{\epsilon}(V)>\eta_{0}+\delta\}\delta^{C}_{0}(V)]+\alpha\phi_{\epsilon}\leq(1+{\mathscr{C}}|\epsilon|)\Gamma_{0}\left(\eta_{0}+\delta-{\mathscr{C}}|\epsilon|\right)+\alpha\phi_{\epsilon}.

Under Condition B3, as long as $\delta$ is small enough, the right-hand side converges to $\Gamma_{0}(\eta_{0}+\delta)+\alpha\phi_{0}$ as $\epsilon\rightarrow 0$ . Moreover, Conditions B3 and A4 can be combined to show that the derivative of $\Gamma_{0}$ is strictly negative for all $x\in[\eta_{0},\eta_{0}+\delta]$ for sufficiently small $\delta$ , and so $\Gamma_{0}(\eta_{0})>\Gamma_{0}(\eta_{0}+\delta)$ . Because $\Gamma_{0}(\eta_{0})+\alpha\phi_{0}=\kappa$ by the definition of ${\rho}_{0}$ under Condition B2, it follows that, for all $\epsilon$ sufficiently close to zero, $\Gamma_{\epsilon}(\eta_{0}+\delta)+\alpha\phi_{\epsilon}<\kappa$ . By the definition $\eta_{\epsilon}:=\inf\{\eta:\Gamma_{\epsilon}(\eta)\leq\kappa-\alpha\phi_{\epsilon}\}$ , it follows that, for all $\epsilon$ sufficiently close to zero, $\eta_{0}+\delta\geq\eta_{\epsilon}$ , that is, $\eta_{\epsilon}-\eta_{0}\leq\delta$ .

By similar arguments, we can show that, for all $\epsilon$ sufficiently close to zero, $\eta_{\epsilon}-\eta_{0}\geq-\delta$ . Indeed,

\displaystyle\Gamma_{\epsilon}(\eta_{0}-\delta)+\alpha\phi_{\epsilon}\geq(1-{\mathscr{C}}|\epsilon|)\Gamma_{0}(\eta_{0}-\delta+{\mathscr{C}}|\epsilon|)+\alpha\phi_{\epsilon}.

The right-hand side converges to $\Gamma_{0}(\eta_{0}-\delta)+\alpha\phi_{0}$ as $\epsilon\rightarrow 0$ provided $\delta$ is sufficiently small. The derivative of $\Gamma_{0}$ is strictly negative on $[\eta_{0}-\delta,\eta_{0}]$ provided $\delta$ is small enough, and therefore, $\Gamma_{0}(\eta_{0}-\delta)+\alpha\phi_{0}>\Gamma_{0}(\eta_{0})+\alpha\phi_{0}=\kappa$ . Hence, $\Gamma_{\epsilon}(\eta_{0}-\delta)+\alpha\phi_{\epsilon}>\kappa$ . By the definition of $\eta_{\epsilon}$ , it follows that $\eta_{\epsilon}-\eta_{0}\geq-\delta$ .

Combining these two results, we see that, for all $\epsilon$ sufficiently close to zero, $|\eta_{\epsilon}-\eta_{0}|\leq\delta$ . Hence, $\limsup_{\epsilon\rightarrow 0}|\eta_{\epsilon}-\eta_{0}|\leq\delta$ . As $\delta>0$ is an arbitrary in a neighborhood of zero, it follows that $\limsup_{\epsilon\rightarrow 0}|\eta_{\epsilon}-\eta_{0}|=0$ . That is, $\eta_{\epsilon}\rightarrow\eta_{0}$ as $\epsilon\rightarrow 0$ in the case that $\eta_{0}>-\infty$ .

We now study the case where $\eta_{0}=-\infty$ . If $\kappa=\infty$ , then it is trivial that $\eta_{\epsilon}=-\infty=\eta_{0}$ for all $\epsilon$ , and so the desired result holds. Suppose now that $\kappa<\infty$ . Fix a small enough $\delta>0$ so that the bound in (S11) is valid for all $\epsilon\in[-\delta,\delta]$ . Also fix $\epsilon\in[-\delta,\delta]$ and $\eta\in$ . By (S11) and the bound on the range of $H$ ,

\displaystyle\Gamma_{\epsilon}(\eta)+\alpha\phi_{\epsilon}\leq(1+{\mathscr{C}}|\epsilon|)\operatorname{E}_{0}[I\{\xi_{\epsilon}(V)>\eta\}\Delta^{C}_{0,b}(V)]+\alpha\phi_{\epsilon}\leq(1+{\mathscr{C}}|\epsilon|)\Gamma_{0}(\eta-{\mathscr{C}}|\epsilon|)+\alpha\phi_{\epsilon}.

Because $\Gamma_{0}$ is a nonnegative decreasing function, the right-hand side is no greater than $(1+{\mathscr{C}}|\epsilon|)\Gamma_{0}(\eta-{\mathscr{C}}\delta)+\alpha\phi_{\epsilon}$ . This upper bound tends to $\Gamma_{0}(\eta-{\mathscr{C}}\delta)+\alpha\phi_{0}$ as $\epsilon\rightarrow 0$ . Hence, $\limsup_{\epsilon\rightarrow 0}\Gamma_{\epsilon}(\eta)+\alpha\phi_{\epsilon}\leq\Gamma_{0}(\eta-{\mathscr{C}}\delta)+\alpha\phi_{0}$ . By Condition B3 and the monotonicity of $\Gamma_{0}$ , $\Gamma_{0}(\eta-{\mathscr{C}}\delta)+\alpha\phi_{0}<\kappa$ , and so $\Gamma_{\epsilon}(\eta)+\alpha\phi_{\epsilon}<\kappa$ for all $\epsilon$ sufficiently close to zero. By the definition of $\eta_{\epsilon}$ , it follows that $\eta_{\epsilon}\leq\eta$ for all $\epsilon$ sufficiently close to zero. Since $\eta\in$ is arbitrary, the desired result follows. ∎

The next lemma establishes a rate of convergence of $\tau_{\epsilon}$ to $\tau_{0}$ as $\epsilon\rightarrow 0$ .

Lemma S2.

Under conditions of Theorem 3, $\tau_{\epsilon}=\tau_{0}+{\mathrm{O}}(\epsilon)$ .

Proof of Lemma S2.

We separately consider the cases where $\eta_{0}<0$ and $\eta_{0}\geq 0$ .

We start with the easier case where $\eta_{0}<0$ . In this case, Lemma S1 shows that $\tau_{\epsilon}:=\max\{\eta_{\epsilon},0\}$ is equal to $\tau_{0}=0$ for all $\epsilon$ sufficiently close to zero. Thus, $\tau_{\epsilon}-\tau_{0}={\mathrm{O}}(|\epsilon|)$ .

Now consider the more difficult case where $\eta_{0}\geq 0$ . By the Lipschitz property of the function $x\mapsto\max\{x,0\}$ , we can show that $|\max\{\eta_{\epsilon},0\}-\max\{\eta_{0},0\}|\leq|\eta_{\epsilon}-\eta_{0}|$ . As a consequence, to show that $\tau_{\epsilon}-\tau_{0}={\mathrm{O}}(\epsilon)$ , it suffices to show that $\eta_{\epsilon}-\eta_{0}={\mathrm{O}}(\epsilon)$ . We next establish this statement.

Fix $\epsilon$ in a sufficiently small neighborhood of zero. By the definition $\eta_{\epsilon}:=\inf\{\eta:\Gamma_{\epsilon}(\eta)\leq\kappa-\phi_{\epsilon}\}$ , the bound on the range of $H$ , and (S11), it holds that $\kappa<\Gamma_{\epsilon}(\eta_{\epsilon}-|\epsilon|)+\alpha\phi_{\epsilon}\leq[1+{\mathscr{C}}|\epsilon|]\Gamma_{0}(\eta_{\epsilon}-[1+{\mathscr{C}}]|\epsilon|)+\alpha\phi_{\epsilon}$ . We use a Taylor expansion of $\Gamma_{0}$ about $\eta_{0}$ , which is justified by Condition B3 provided $|\epsilon|$ is small enough, and it follows that

\displaystyle\kappa<[1+{\mathscr{C}}|\epsilon|]\left[\Gamma_{0}(\eta_{0})+\{\eta_{\epsilon}-\eta_{0}-(1-{\mathscr{C}})|\epsilon|\}\{\Gamma_{0}^{\prime}(\eta_{0})+{\mathrm{o}}(1)\}\right]+\alpha\phi_{0}+{\mathrm{O}}(\epsilon).

By Condition B3, $\Gamma_{0}(\eta_{0})+\alpha\phi_{0}=\kappa$ . Plugging this into the above shows that

\displaystyle 0

\displaystyle<{\mathscr{C}}\Gamma_{0}(\eta_{0})|\epsilon|+[1+{\mathscr{C}}|\epsilon|]\left[\eta_{\epsilon}-\eta_{0}-(1-{\mathscr{C}})|\epsilon|\right]\left[\Gamma_{0}^{\prime}(\eta_{0})+{\mathrm{o}}(1)\right]+{\mathrm{O}}(\epsilon).

Note that Condition B3 implies that $\Gamma_{0}^{\prime}(\eta_{0})\in(-\infty,0)$ . Therefore, the above shows that, for all $\epsilon$ sufficiently close to zero, $0<[\eta_{\epsilon}-\eta_{0}]\Gamma_{0}^{\prime}(\eta_{0})+{\mathscr{C}}|\epsilon|+{\mathrm{o}}\left(\eta_{\epsilon}-\eta_{0}\right)$ , which implies that there exists an ${\mathrm{O}}(\epsilon)$ sequence for which $\eta_{\epsilon}-\eta_{0}<{\mathrm{O}}(\epsilon)$ .

A similar argument, which is based on the observation that $\Gamma_{\epsilon}(\eta_{\epsilon}+|\epsilon|)+\phi_{\epsilon}\leq\kappa$ , can be used to show that there exists an ${\mathrm{O}}(\epsilon)$ sequence such that $\eta_{\epsilon}-\eta_{0}>{\mathrm{O}}(\epsilon)$ . Combining these two bounds shows that $\eta_{\epsilon}-\eta_{0}={\mathrm{O}}(\epsilon)$ , as desired. This concludes the proof. ∎

Our derivation of the canonical gradient is based on the following decomposition:

$\displaystyle\Psi_{{\rho}_{\epsilon}}$	$\displaystyle(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})$
$\displaystyle=$	$\displaystyle\Psi_{{\rho}_{\epsilon}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{\epsilon})+\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})$
$\displaystyle=$	$\displaystyle P_{\epsilon}\{[{\rho}_{\epsilon}-{\rho}_{0}]\delta^{Y}_{\epsilon}\}+\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})$
$\displaystyle=$	$\displaystyle P_{\epsilon}\{[{\rho}_{\epsilon}-{\rho}_{0}](\delta^{Y}_{\epsilon}-\tau_{0}\delta^{C}_{\epsilon})\}+\tau_{0}P_{\epsilon}\{({\rho}_{\epsilon}-{\rho}_{0})\delta^{C}_{\epsilon}\}+\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})$
$\displaystyle=$	$\displaystyle P_{\epsilon}\{[{\rho}_{\epsilon}-{\rho}_{0}](\delta^{Y}_{\epsilon}-\tau_{0}\delta^{C}_{0})\}+[\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})]+\tau_{0}\{P_{\epsilon}[\delta^{C}_{\epsilon}{\rho}_{\epsilon}]+\alpha\phi_{\epsilon}-P_{0}[\delta^{C}_{0}{\rho}_{0}]-\alpha\phi_{0}\}$
	$\displaystyle-\tau_{0}P_{\epsilon}\{(\delta^{C}_{\epsilon}-\Delta^{C}_{0,b}){\rho}_{0}\}-\tau_{0}(P_{\epsilon}-P_{0})\{\delta^{C}_{0}{\rho}_{0}\}-\alpha\tau_{0}\{\phi_{\epsilon}-\phi_{0}\}.$	(S12)

We separately study each of the six terms on the right-hand side, which we refer to as term 1 up to term 6.

Study of term 1 in (S12): We will show that this term is ${\mathrm{o}}(\epsilon)$ . By Lemma S2 and (S9),

\sup_{v}|\delta^{Y}_{\epsilon}(v)-\tau_{\epsilon}\delta^{C}_{\epsilon}-\delta^{Y}_{0}(v)+\tau_{0}\delta^{C}_{0}|\leq\sup_{v}|\delta^{Y}_{\epsilon}(v)-\delta^{Y}_{0}(v)|+\sup_{v}|\delta^{C}_{\epsilon}(v)-\delta^{C}_{0}(v)|+|\tau_{\epsilon}-\tau_{0}|\lesssim|\epsilon|.

Under Condition B2, $P_{0}\{\xi_{0}(V)=\tau_{0}\}=0$ . We apply a similar argument as that used to prove Lemma 2 in van der Laan and Luedtke [46]:

	$\displaystyle\big{\|}P_{\epsilon}$	$\displaystyle\{[{\rho}_{\epsilon}-{\rho}_{0}](\delta^{Y}_{\epsilon}-\tau_{0}\delta^{C}_{\epsilon})\}\big{\|}$
		$\displaystyle=\left\|\int[{\rho}_{\epsilon}(v)-{\rho}_{0}(v)][\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)]P_{W,\epsilon}(dw)\right\|$
		$\displaystyle\leq\int\left\|{\rho}_{\epsilon}(v)-{\rho}_{0}(v)\right\|\left\|\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)\right\|P_{W,\epsilon}(dw).$
Because ${\rho}_{\epsilon}(v)\neq{\rho}_{0}(v)$ implies that either (i) $\xi_{\epsilon}(v)-\tau_{\epsilon}$ and $\xi_{0}(v)-\tau_{0}$ have different signs or (ii) only one of these quantities is zero, the display continues as
		$\displaystyle\leq\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq\|\xi_{\epsilon}(v)-\tau_{\epsilon}-\xi_{0}(v)+\tau_{0}\|\}\left\|\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)\right\|P_{W,\epsilon}(dw)$
		$\displaystyle\leq\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}\left(\left\|\delta^{Y}_{0}(v)-\tau_{0}\delta^{C}_{0}(v)\right\|+{\mathscr{C}}\|\epsilon\|\right)P_{W,\epsilon}(dw).$
Using the facts that $\inf_{v}\delta^{C}_{0}(v)>0$ by Condition A4, that $\sup_{v}\delta^{C}_{0}(v)\leq 1$ since probabilities are no more than one, and that $\xi_{0}(v):=\delta^{Y}_{0}(v)/\delta^{C}_{0}(v)$ , the display continues as
		$\displaystyle\leq\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}\left(\left\|\xi_{0}(v)-\tau_{0}\right\|+{\mathscr{C}}\|\epsilon\|\right)P_{W,\epsilon}(dw)$
Leveraging the bound on $\|\xi_{0}(v)-\tau_{0}\|$ that appears in the indicator function, we see that
		$\displaystyle\leq\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}({\mathscr{C}}\|\epsilon\|+{\mathscr{C}}\|\epsilon\|)P_{W,\epsilon}(dw)$
		$\displaystyle\lesssim\|\epsilon\|\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}P_{W,0}(dw)$
		$\displaystyle=\|\epsilon\|\int I\{0<\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}P_{W,0}(dw),$

where the final equality holds by Condition B1. The integral in the final expression is ${\mathrm{o}}(1)$ , and so this expression is ${\mathrm{o}}(\epsilon)$ .

Study of term 2 in (S12): By the result on the pathwise differentiability of $P\mapsto\Psi_{{\rho}^{\mathrm{FR}}}(P)$ , setting ${\rho}^{\mathrm{FR}}$ to be ${\rho}_{0}$ , we see that the second term satisfies $\Psi_{{\rho}_{0}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})=\epsilon\int G_{2}(o)H(o)P(do)+{\mathrm{o}}(\epsilon)$ , where $G_{2}\in L^{2}_{0}(P_{0})$ is equal to $D(P_{0},{\rho}_{0},0,\mu^{C}_{0})$ .

Study of term 3 in (S12): We will show that the third term is identical to zero for any $\epsilon$ that is sufficiently close to zero. If $\tau_{0}=0$ , then this term is trivially zero. Otherwise, $\tau_{0}=\eta_{0}>0$ . Lemma S1 shows that, in this case, $\eta_{\epsilon}>0$ for $\epsilon$ sufficiently close to zero. Hence, $\operatorname{E}_{\epsilon}[\delta^{C}_{\epsilon}(V){\rho}_{\epsilon}(V)]+\alpha\phi_{\epsilon}=\kappa=\operatorname{E}_{0}[\delta^{C}_{0}(V){\rho}_{0}(V)]+\alpha\phi_{0}$ . Consequently, term 3 equals zero for all $\epsilon$ sufficiently close to zero.

Study of term 4 in (S12): We will show that this term can be writes as $\epsilon\int G_{4}(o)H(o)P_{0}(do)+{\mathrm{o}}(\epsilon)$ for an appropriately defined $G_{4}\in L^{2}_{0}(P_{0})$ that does not depend on $H$ . Note that there exists a function $H_{W}:(w\mid v)\mapsto H_{W}(w\mid v)$ for which $\int H_{W}(w\mid v)P_{W,0}(dw\mid v)=0$ , $\sup_{w,v}|H_{W}(w\mid v)|<\infty$ , and, for all $v$ ,

\displaystyle P_{W,\epsilon}(dw\mid v)=(1+\epsilon H_{W}(w\mid v)+{\mathrm{o}}(\epsilon))P_{W,0}(dw\mid v).

The function $H_{W}$ can be chosen so that the above ${\mathrm{o}}(\epsilon)$ term indicates little-oh behavior uniformly over $w$ and $v$ . By the definition of $H_{C}$ from (S3), we see that

	$\displaystyle\delta^{C}_{\epsilon}(v)-\delta^{C}_{0}(v)$	$\displaystyle=\iint c\Big{\{}[1+\epsilon H_{C}(c\mid 1,w)][1+\epsilon H_{W}(w\mid v)+{\mathrm{o}}(\epsilon)]-1\Big{\}}P_{0}(dc\mid 1,w)P_{0}(dw\mid v)$
		$\displaystyle\quad-\iint c\Big{\{}[1+\epsilon H_{C}(c\mid 0,w)][1+\epsilon H_{W}(w\mid v)+{\mathrm{o}}(\epsilon)]-1\Big{\}}P_{0}(dc\mid 0,w)P_{0}(dw\mid v)$
		$\displaystyle=\epsilon\Big{\{}\iint c(H_{C}(c\mid 1,w)+H_{W}(w\mid v)+{\mathrm{o}}(1))P_{0}(dc\mid 1,w)P_{0}(dw\mid v)$
		$\displaystyle\quad-\iint c(H_{C}(c\mid 0,w)+H_{W}(w\mid v)+{\mathrm{o}}(1))P_{0}(dc\mid 0,w)P_{0}(dw\mid v)\Big{\}}+{\mathrm{o}}(\epsilon),$

where the little-oh terms are uniform over $w$ and $v$ . Hence,

		$\displaystyle\left.\frac{d}{d\epsilon}P_{\epsilon}\{[\delta^{C}_{\epsilon}-\delta^{C}_{0}]{\rho}_{0}\}\right\|_{\epsilon=0}$
		$\displaystyle=\iint{\rho}_{0}(v)c\{H_{C}(c\mid 1,w)+H_{W}(w\mid v)\}P_{0}(dc\mid 1,w)P_{0}(dw)$
		$\displaystyle\quad-\iint{\rho}_{0}(v)c\{H_{C}(c\mid 0,w)+H_{W}(w\mid v)\}P_{0}(dc\mid 0,w)P_{0}(dw)$
		$\displaystyle=\operatorname{E}_{0}\left[{\rho}_{0}(V)\left(\frac{1}{T+\mu^{T}_{0}(W)-1}\{C-\mu^{C}_{0}(T,W)\}H_{C}(C\mid 1,W)+\{\Delta^{C}_{0}(W)-\delta^{C}_{0}(V)\}H_{W}(W\mid V)\right)\right].$
Since $\operatorname{E}_{0}[H_{C}(C\mid T,W)\mid T,W]=\operatorname{E}_{0}[H_{W}(W\mid V)\mid V]=0$ $P_{0}$ -a.s., the display continues as
		$\displaystyle=\operatorname{E}_{0}\left[{\rho}_{0}(V)\left(\frac{1}{T+\mu^{T}_{0}(W)-1}\{C-\mu^{C}_{0}(T,W)\}+\Delta^{C}_{0}(W)-\delta^{C}_{0}(V)\right)H(O)\right].$

As a consequence, term 4 satisfies

\displaystyle-\tau_{0}P_{\epsilon}\{[\delta^{C}_{\epsilon}-\delta^{C}_{0}]{\rho}_{0}\}

\displaystyle=\epsilon\int G_{4}(o)H(o)P_{0}(do)+{\mathrm{o}}(\epsilon),

where

\displaystyle G_{4}:o\mapsto

\displaystyle-\tau_{0}{\rho}_{0}(v)\left\{\frac{1}{t+\mu^{T}_{0}(w)-1}[c-\mu^{C}_{0}(t,w)]+\Delta^{C}_{0}(w)-\delta^{C}_{0}(v)\right\}.

Study of term 5 in (S12): By (S3) and the fact that $P_{0}\{\delta^{C}_{0}{\rho}_{0}\}=\kappa-\alpha\phi_{0}$ whenever $\tau_{0}>0$ , we see that $-\tau_{0}(P_{\epsilon}-P_{0})\{\delta^{C}_{0}{\rho}_{0}\}=\epsilon\int G_{5}(o)H_{V}(v)P_{0}(do)$ , where $G_{5}\in L^{2}_{0}(P_{0})$ is defined as $o\mapsto-\tau_{0}[\delta^{C}_{0}(v){\rho}_{0}(v)-\kappa+\alpha\phi_{0}]$ . Since $H_{V}$ is defined as $v\mapsto\operatorname{E}_{0}[H(O)\mid V=v]$ , we see that it also holds that $-\tau_{0}(P_{\epsilon}-P_{0})\{\delta^{C}_{0}{\rho}_{0}\}=\epsilon\int G_{5}(o)H(o)P_{0}(do)$ .

Study of term 6 in (S12): We have shown that

\phi_{\epsilon}-\phi_{0}=\epsilon\int\left\{\frac{1-t}{1-\mu^{T}_{0}(w)}[c-\mu^{C}_{0}(0,w)]+\mu^{C}_{0}(0,w)-\phi_{0}\right\}H(o)P_{0}(do)+{\mathrm{o}}(\epsilon).

Therefore, $-\tau_{0}\alpha(\phi_{\epsilon}-\phi_{0})=\epsilon\int G_{6}(o)H(o)P_{0}(do)+{\mathrm{o}}(\epsilon)$ where

G_{6}:o\mapsto-\tau_{0}\alpha\left\{\frac{1-t}{1-\mu^{T}_{0}(w)}[c-\mu^{C}_{0}(0,w)]+\mu^{C}_{0}(0,w)-\phi_{0}\right\}.

Conclusion of the derivation of the canonical gradient of $P\mapsto\Psi_{{\rho}_{P}}(P)$ : Combining our results regarding the six terms in (S12), we see that

\displaystyle\Psi_{{\rho}_{\epsilon}}(P_{\epsilon})-\Psi_{{\rho}_{0}}(P_{0})

\displaystyle=\epsilon\int\left[G_{2}(o)+G_{4}(o)+G_{5}(o)+G_{6}(o)\right]H(o)P_{0}(do)+{\mathrm{o}}(\epsilon).

Dividing both sides by $\epsilon\not=0$ and taking the limit as $\epsilon\rightarrow 0$ , we see that $G_{2}+G_{4}+G_{5}+G_{6}=G(P_{0})$ is the canonical gradient of $P\mapsto\Psi_{{\rho}_{P}}(P)$ at $P_{0}$ .

S4.3 Expansions based on gradients or pseudo-gradients

In this section, we present (approximate) first-order expansions of ATE parameters based on which we construct our proposed targeted minimum-loss based estimators (TMLE) and prove their asymptotic linearity. We refer the readers to Supplement S5 of Qiu et al. [34] for an overview of TMLE based on gradients and pseudo-gradients. The overall idea behind TMLE based on gradients is the following: the empirical mean of the gradient at the estimated distribution can be viewed as the first-order bias of the plug-in estimator; this bias can be removed by solving the estimating equation that equates the first-order bias to zero. The idea behind pseudo-gradients is similar, except that the gradient is replaced by an approximation that we term pseudo-gradient so that the corresponding estimating equation is easy to solve with a single regression step.

For any ITR ${\rho}:\mathcal{W}\rightarrow[0,1]$ that utilizes all covariates, we define

	$\displaystyle R_{{\rho}}(P,P_{0}):=$	$\displaystyle\Psi_{{\rho}}(P)-\Psi_{{\rho}}(P_{0})+P_{0}D(P,{\rho},0,\mu^{C})$
	$\displaystyle=$	$\displaystyle\operatorname{E}_{0}\Bigg{[}{\rho}(W)\Bigg{\{}\frac{\mu^{T}_{P}(W)-\mu^{T}_{0}(W)}{\mu^{T}_{P}(W)}(\mu^{Y}_{P}(1,W)-\mu^{Y}_{0}(1,W))$
		$\displaystyle\quad\quad\quad\quad\quad+\frac{\mu^{T}_{P}(W)-\mu^{T}_{0}(W)}{1-\mu^{T}_{P}(W)}(\mu^{Y}_{P}(0,W)-\mu^{Y}_{0}(0,W))\Bigg{\}}\Bigg{]}.$

For any ITR ${\rho}:\mathcal{V}\rightarrow[0,1]$ that only utilizes $V$ , for convenience, we define $R_{{\rho}}(P,P_{0}):=R_{w\mapsto{\rho}(V(w))}(P,P_{0})$ .

For $\Psi_{{\rho}^{\mathrm{FR}}}$ and $\Psi_{{\rho}^{\mathrm{TP}}_{P}}$ , it is straightforward to show that the following expansions hold:

	$\displaystyle\Psi_{{\rho}^{\mathrm{FR}}}(P)-\Psi_{{\rho}^{\mathrm{FR}}}(P_{0})$	$\displaystyle=-P_{0}G_{\mathrm{FR}}(P)+R_{{\rho}^{\mathrm{FR}}}(P,P_{0}),$
	$\displaystyle\Psi_{{\rho}^{\mathrm{TP}}_{P}}(P)-\Psi_{{\rho}^{\mathrm{TP}}_{P_{0}}}(P_{0})$	$\displaystyle=-P_{0}G_{\mathrm{TP}}(P)+P_{0}\left\{\frac{\mu^{T}_{P}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{P}(\cdot)}(\mu^{Y}_{P}(0,\cdot)-\mu^{Y}_{0}(0,\cdot))\right\}.$

For $P\mapsto\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)$ , we expand this parameter sequentially as follows:

	$\displaystyle\Psi_{{\rho}^{\mathrm{RD}}_{P}}(P)-\Psi_{{\rho}^{\mathrm{RD}}_{P_{0}}}(P_{0})=P_{0}D(P,{\rho}^{\mathrm{RD}}_{P},0,\mu^{C}_{P})+R_{{\rho}^{\mathrm{RD}}_{P}}(P,P_{0})+({\rho}^{\mathrm{RD}}_{P}-{\rho}^{\mathrm{RD}}_{0})P_{0}\Delta^{Y}_{0},$
	$\displaystyle{\rho}^{\mathrm{RD}}_{P}-{\rho}^{\mathrm{RD}}_{0}=\frac{\kappa-\phi_{P}}{P\Delta^{C}_{P}}-\frac{\kappa-\phi_{0}}{P_{0}\Delta^{C}_{0}},$
	$\displaystyle(\kappa-\alpha\phi_{P})-(\kappa-\alpha\phi_{0})=\alpha\left\{P_{0}D_{1}(P,\mu^{C})+P_{0}\left\{(\mu^{C}(0,\cdot)-\mu^{C}(1,\cdot))\frac{\mu^{T}_{P}-\mu^{T}_{0}}{1-\mu^{T}_{P}}\right\}\right\},$
	$\displaystyle P\Delta^{C}_{P}-P_{0}\Delta^{C}_{0}=-P_{0}D_{2}(P,\mu^{C}_{P})+P_{0}\left\{(\mu^{C}_{P}(1,\cdot)-\mu^{C}_{0}(1,\cdot))\frac{\mu^{T}_{P}-\mu^{T}_{0}}{\mu^{T}_{P}}+(\mu^{C}_{P}(0,\cdot)-\mu^{C}_{0}(0,\cdot))\frac{\mu^{T}_{P}-\mu^{T}_{0}}{1-\mu^{T}_{P}}\right\}.$

For $P\mapsto\Psi_{{\rho}_{P}}(P)$ , straightforward but tedious calculation shows that the following expansion holds:

	$\displaystyle\Psi_{{\rho}_{P}}(P)-\Psi_{{\rho}_{0}}(P_{0})$	$\displaystyle=-P_{0}D(P,{\rho},\tau_{0},\mu^{C})$
		$\displaystyle\quad+R_{{\rho}}(P,P_{0})+P_{0}\{({\rho}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}$
		$\displaystyle\quad-\tau_{0}\operatorname{E}_{0}\left[{\rho}(V)\frac{\mu^{T}_{P}(W)-\mu^{T}_{0}(W)}{\mu^{T}_{P}(W)}\{\mu^{C}(1,W)-\mu^{C}_{0}(1,W)\}\right]$
		$\displaystyle\quad+\tau_{0}\operatorname{E}_{0}\left[(1-{\rho}(V))\frac{\mu^{T}_{P}(W)-\mu^{T}_{0}(W)}{1-\mu^{T}_{P}(W)}\{\mu^{C}(0,W)-\mu^{C}_{0}(0,W)\}\right].$

S4.4 Asymptotic linearity of proposed estimator (Theorem 4)

For convenience, we set $\hat{P}_{n}$ to have component $\mu^{T}_{n}$ and $\hat{\mu}^{C}_{n}$ , even though the plug-in estimator does not explicitly involve these functions. We start with some lemmas that facilitate the proof of the main theorem. In this section, we define $\eta_{n}:=\eta_{n}(k_{n})$ and $\tau_{n}:=\tau_{n}(k_{n})$ to simplify notations.

Our proof is centered around the expansions in Supplement S4.3. We first prove a few lemmas. Lemma S3 is a standard asymptotic linearity result on estimators $\phi_{n}$ and $P_{n}\hat{\Delta}^{C}_{n}$ about treatment resource being used for constant ITRs $v\mapsto 0$ and $v\mapsto 1$ , respectively; Lemma S4 is a technical convenient tool to convert conditions on norms in Condition B6 between functions; Lemmas S5–S7 are analysis results for estimators that are similar to Lemmas S1–S2 for deterministic perturbations of $P_{0}$ , and they lead to the crucial Lemma S8 on the negligibility of the remainder $R_{\rho}(\hat{P}_{n},P_{0})$ for an arbitrary ITR ${\rho}$ .

Lemma S3 (Asymptotic linearity of $\phi_{n}$ and $P_{n}\hat{\Delta}^{C}_{n}$ ).

Under the conditions of Theorem 4,

	$\displaystyle\phi_{n}-\phi_{0}$	$\displaystyle=(P_{n}-P_{0})D_{1}(P_{0},\mu^{C}_{0})+{\mathrm{o}}_{p}(n^{-1/2})={\mathrm{O}}_{p}(n^{-1/2}),$
	$\displaystyle P_{n}\hat{\Delta}^{C}_{n}-P_{0}\Delta^{C}_{0}$	$\displaystyle=(P_{n}-P_{0})D_{2}(P_{0},\mu^{C}_{0})+{\mathrm{o}}_{p}(n^{-1/2})={\mathrm{O}}_{p}(n^{-1/2}).$

This result follows from the facts that (i) $\phi_{n}$ is a one-step correction estimator of $\phi_{0}$ [30], and (ii) $P_{n}\hat{\Delta}^{C}_{n}$ is a TMLE for $P_{0}\Delta^{C}_{0}$ [44, 49]. Therefore the proof is omitted.

Lemma S4 (Lemma S8 in Qiu et al. [34]).

Fix functions $\mu^{C}:\{0,1\}\times\mathcal{W}\rightarrow[0,1]$ and $\mu^{Y}:\{0,1\}\times\mathcal{W}\rightarrow$ , and suppose that $P_{0}\mu^{Y}(0,\cdot)^{2}<\infty$ and $P_{0}\mu^{Y}(1,\cdot)^{2}<\infty$ . If Condition A2 holds, then

	$\displaystyle\\|\mu^{Y}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)\\|_{2,P_{0}}+\\|\mu^{Y}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)\\|_{2,P_{0}}\simeq\\|\mu^{Y}-\mu^{Y}_{0}\\|_{2,P_{0}},$
	$\displaystyle\\|\mu^{C}(1,\cdot)-\mu^{C}_{0}(1,\cdot)\\|_{2,P_{0}}+\\|\mu^{C}(0,\cdot)-\mu^{C}_{0}(0,\cdot)\\|_{2,P_{0}}\simeq\\|\mu^{C}-\mu^{C}_{0}\\|_{2,P_{0}},$

where $a\simeq b$ is defined as $a\lesssim b$ and $b\lesssim a$ .

The following Lemmas S5–S7 prove consistency of the estimated thresholds used to define the estimated optimal ITR ${\rho}_{n}$ .

Lemma S5 (Lemma S5 in Qiu et al. [34]).

Let $\epsilon>0$ , $\eta\in$ , $g:\mathscr{O}\rightarrow$ be bounded and functions $f_{0}:\mathscr{O}\rightarrow$ and $f:\mathscr{O}\rightarrow$ . Then

	$\displaystyle\|P_{0}([I(f>\eta)-I(f_{0}>\eta)]g)\|$	$\displaystyle\leq P_{0}\|[I(f>\eta)-I(f_{0}>\eta)]g\|$
		$\displaystyle\lesssim P_{0}\{\|f(O)-f_{0}(O)\|>\epsilon\}+P_{0}\{\|f_{0}(O)-\eta\|\leq\epsilon\}.$

If $g$ takes values in $[-1,1]$ , then $\lesssim$ can be replaced by $\leq$ .

Lemma S6 (Consistency of $\eta_{n}(\kappa)$ ).

Under Conditions B2, B3 and B12, $\eta_{n}(\kappa)\overset{p}{\rightarrow}\eta_{0}$ .

This lemma is a stochastic variant of the deterministic result in Lemma S1 and has a similar proof. Therefore, the arguments are slightly abbreviated here.

Proof of Lemma S6.

We separately consider the cases where $\eta_{0}>-\infty$ and $\eta_{0}=-\infty$ .

First consider the case where $\eta_{0}>-\infty$ . We start by showing that, for any $\eta$ sufficiently close to $\eta_{0}$ , it holds that $\Gamma_{n}(\eta)-\Gamma_{0}(\eta)={\mathrm{o}}_{p}(1)$ . Fix an $\eta$ in a neighborhood of $\eta_{0}$ . By the triangle inequality,

$\displaystyle\|\Gamma_{n}(\eta)-\Gamma_{0}(\eta)\|$	$\displaystyle\leq\|P_{0}[I\{\xi_{n}(V(\cdot))>\eta\}-I(\xi_{0}(V(\cdot))>\eta)]\Delta^{C}_{0}\|$
	$\displaystyle\quad+\|P_{0}I\{\xi_{n}(V(\cdot))>\eta\}[\Delta^{C}_{n}-\Delta^{C}_{0}]\|$
	$\displaystyle\quad+\|(P_{n}-P_{0})I\{\xi_{n}(V(\cdot))>\eta\}\Delta^{C}_{n}\|.$	(S13)

We will show that the right-hand side is ${\mathrm{o}}_{p}(1)$ . By Condition B12, the third term on the right is ${\mathrm{o}}_{p}(1)$ for $\eta$ sufficiently close to $\eta_{0}$ . Moreover, because the second term is no greater than $\|\Delta^{C}_{n}-\Delta^{C}_{0}\|_{1,P_{0}}$ , Condition B12 also implies that this second term is also ${\mathrm{o}}_{p}(1)$ . We will now argue that the first term is ${\mathrm{o}}_{p}(1)$ . By Lemma S5 and Condition B4, for any $\epsilon^{\prime}>0$ ,

	$\displaystyle\|P_{0}[I(\xi_{n}(V(\cdot))>\eta)-I(\xi_{0}(V(\cdot))>\eta)]\Delta^{C}_{0}\|$	$\displaystyle\lesssim P_{0}\|I(\xi_{n}>\eta)-I(\xi_{0}>\eta)\|$
		$\displaystyle\leq P_{0}I(\|\xi_{n}-\xi_{0}\|>\epsilon^{\prime})+P_{0}I(\|\xi_{0}-\eta\|\leq\epsilon^{\prime})$
		$\displaystyle\leq\frac{\\|\xi_{n}-\xi_{0}\\|_{1,P_{0}}}{\epsilon^{\prime}}+P_{0}I(\|\xi_{0}-\eta\|\leq\epsilon^{\prime}),$

where the final relation follows from Markov’s inequality. We next show that the last line is ${\mathrm{o}}_{p}(1)$ . Fix $\epsilon>0$ . For $\eta$ that is sufficiently close to $\eta_{0}$ and $\epsilon^{\prime}$ that is sufficiently small, by Condition B2, we see that $S_{0}$ is continuous in $[\eta-\epsilon^{\prime},\eta+\epsilon^{\prime}]$ and hence, for all sufficiently small $\epsilon^{\prime}>0$ , it holds that $P_{0}I(|\xi_{0}-\eta|\leq\epsilon^{\prime})\leq\epsilon/2$ . Therefore,

	$\displaystyle P_{0}\left\{\|P_{0}[I(\xi_{n}>\eta)-I(\xi_{0}>\eta)]\|>\epsilon\right\}$	$\displaystyle\leq P_{0}\left\{\frac{\\|\xi_{n}-\xi_{0}\\|_{1,P_{0}}}{\epsilon^{\prime}}+P_{0}I(\|\xi_{0}-\eta\|\leq\epsilon^{\prime})>\epsilon\right\}$
		$\displaystyle\leq P_{0}\left\{\frac{\\|\xi_{n}-\xi_{0}\\|_{1,P_{0}}}{\epsilon^{\prime}}>\epsilon/2\right\}.$

Since $\|\xi_{n}-\xi_{0}\|_{1,P_{0}}={\mathrm{o}}_{p}(1)$ by Condition B12, the right-hand side of the above display converges to zero as $n\rightarrow\infty$ . Therefore, $|P_{0}[I(\xi_{n}(V(\cdot))>\eta)-I(\xi_{0}(V(\cdot))>\eta)]\Delta^{C}_{0}|={\mathrm{o}}_{p}(1)$ . Recalling (S13), the above results imply that $\Gamma_{n}(\eta)-\Gamma_{0}(\eta)={\mathrm{o}}_{p}(1)$ for any $\eta$ that is sufficiently close to $\eta_{0}$ .

Fix $\epsilon>0$ . For any $\epsilon$ sufficiently small, the above result and Lemma S3 imply that $\Gamma_{n}(\eta_{0}-\epsilon)+\alpha\phi_{n}=\Gamma_{0}(\eta_{0}-\epsilon)+\alpha\phi_{0}+{\mathrm{o}}_{p}(1)$ and $\Gamma_{n}(\eta_{0}+\epsilon)+\alpha\phi_{n}=\Gamma_{0}(\eta_{0}+\epsilon)+\alpha\phi_{0}+{\mathrm{o}}_{p}(1)$ . By Condition B3, $\Gamma_{0}(\eta_{0}-\epsilon)+\alpha\phi_{0}>\kappa>\Gamma_{0}(\eta_{0}+\epsilon)+\alpha\phi_{0}$ provided $\epsilon$ is sufficiently small. It follows that, with probability tending to one, $\Gamma_{n}(\eta_{0}-\epsilon)+\alpha\phi_{n}>\kappa>\Gamma_{n}(\eta_{0}+\epsilon)+\alpha\phi_{n}$ , and hence $\eta_{0}-\epsilon\leq\eta_{n}(\kappa)\leq\eta_{0}+\epsilon$ by the definition of $\eta_{n}(\kappa)$ . Because $\epsilon$ is arbitrary, it follows that $\eta_{n}(\kappa)\overset{p}{\rightarrow}\eta_{0}$ .

The case where $\eta_{0}=-\infty$ can be proved similarly. If $\kappa=\infty$ , then it trivially holds that $\eta_{n}(\kappa)=-\infty=\eta_{0}$ for all $n$ and the desired result holds. Otherwise, for any $\eta<0$ for which $|\eta|$ is sufficiently large, a nearly identical argument to that used above shows that $[\Gamma_{n}(\eta)+\alpha\phi_{n}]-[\Gamma_{0}(\eta)+\alpha\phi_{0}]={\mathrm{o}}_{p}(1)$ . By Condition B3 and monotonicity of $\Gamma_{0}$ , it follows that $\Gamma_{0}(\eta)+\alpha\phi_{0}<\kappa$ , and so, with probability tending to one, $\Gamma_{n}(\eta)+\alpha\phi_{n}<\kappa$ and hence $\eta_{n}(\kappa)\leq\eta$ by the definition of $\eta_{n}(\kappa)$ . Because $\eta$ is arbitrary, we have shown that $\eta_{n}(\kappa)\overset{p}{\rightarrow}-\infty=\eta_{0}$ . ∎

Lemma S7 (Consistency of $\tau_{n}$ and existence of solution to Eq. 5 when $\eta_{0}>-\infty$ ).

Assume that the conditions of Theorem 4 hold. The following statements hold:

i)

if $\eta_{0}>-\infty$ , then, with probability tending to one, a solution $k_{n}^{\prime}\in[0,\infty)$ to (5) exists. Note that we let $k_{n}=k_{n}^{\prime}$ when $\eta_{n}(\kappa)>0$ and $k_{n}^{\prime}$ exists. Hence, if $\eta_{0}>0$ , $\eta_{n}=\eta_{n}(k_{n})=\eta_{n}(k_{n}^{\prime})$ with probability tending to one;
ii)

if a solution $k_{n}^{\prime}$ to (5) exists, then with probability tending to one, $P_{0}\{d_{n,k_{n}^{\prime}}\Delta^{C}_{n}\}+\alpha\phi_{0}=\kappa+{\mathrm{O}}_{p}(n^{-1/2})$ ;
iii)

$\tau_{n}-\tau_{0}={\mathrm{o}}_{p}(1)$ .

We separately prove i, ii, and iii in the case that $\eta_{0}>-\infty$ , and then we separately prove iii in the cases where $\eta_{0}>0$ , $\eta_{0}=0$ and $\eta_{0}<0$ .

Proof of i from Lemma S7..

Our strategy for showing the existence of a solution to (5) is as follows. First, we show that the left-hand side of (5) consistently estimates the treatment resource being used uniformly over rules $\{d_{n,k}:k\in[0,\infty]\}$ . Next, we show that the left-hand side of (5) is a continuous function in $k$ that takes different signs at $k=0$ and $k=\infty$ with probability tending to one.

Define $f_{n,k}:o\mapsto d_{n,k}(v)\left[\Delta^{C}_{n}(w)+\frac{1}{t+\mu^{T}_{n}(w)-1}[c-\mu^{C}_{n}(t,w)]\right]$ . We first show that

\displaystyle\sup_{k\in[0,\infty]}|P_{n}f_{n,k}-P_{0}\{d_{n,k}\Delta^{C}_{0}\}|={\mathrm{O}}_{p}(n^{-1/2}).

(S14)

We rely on the fact that, for fixed $d_{n,k}$ , $P_{n}f_{n,k}$ is a one-step estimator of $P_{0}\{d_{n,k}\Delta^{C}_{0}\}$ . Note that

	$\displaystyle\sup_{k\in[0,\infty]}\left\|P_{n}f_{n,k}-P_{0}\{d_{n,k}\Delta^{C}_{0}\}\right\|$	$\displaystyle\leq\sup_{k\in[0,\infty]}\left\|(P_{n}-P_{0})d_{n,k}D_{2}(\hat{P}_{n},\mu^{C}_{n})\right\|$
		$\displaystyle\quad+\sup_{k\in[0,\infty]}\left\|P_{0}\left\{d_{n,k}(V(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}\right\|$
		$\displaystyle\quad+\sup_{k\in[0,\infty]}\left\|P_{0}\left\{d_{n,k}(V(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}\right\|.$

Conditions B7 and B11 along with Lemma S4 imply that the first term on the right-hand side is ${\mathrm{O}}_{p}(n^{-1/2})$ . For the second term, we note that the fact that $d_{n,k}(V(w))\in[0,1]$ for all $w\in\mathcal{W}$ and all $k\in[0,\infty]$ and the Cauchy-Schwarz inequality imply that

	$\displaystyle\sup_{k\in[0,\infty]}\left\|P_{0}\left\{d_{n,k}(V(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}\right\|$
	$\displaystyle\leq P_{0}\left\|\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\|\lesssim\\|\mu^{T}_{n}-\mu^{T}_{0}\\|_{2,P_{0}}\\|\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)\\|_{2,P_{0}}.$

Hence, the second term is ${\mathrm{o}}_{p}(n^{-1/2})$ by Condition B6. The third term is also ${\mathrm{o}}_{p}(n^{-1/2})$ by an almost identical argument. Combining the previous two displays shows that (S14) holds.

Applying (S14) at $k=0$ shows that $P_{n}f_{n,0}+\alpha\phi_{n}=P_{0}\{d_{n,0}\Delta^{C}_{0}\}+\alpha\phi_{0}+{\mathrm{O}}_{p}(n^{-1/2})=\alpha\phi_{0}+{\mathrm{O}}_{p}(n^{-1/2})$ . Therefore, $P_{n}f_{n,0}+\alpha\phi_{n}<\kappa$ with probability tending to one. Applying this result at $k=\infty$ shows that $P_{n}f_{n,1}+\alpha\phi_{n}=P_{0}\{d_{n,\infty}\Delta^{C}_{0}\}+\alpha\phi_{0}+{\mathrm{O}}_{p}(n^{-1/2})=P_{0}\Delta^{C}_{0}+\alpha\phi_{0}+{\mathrm{O}}_{p}(n^{-1/2})$ . Combining this fact with the fact that $P_{0}\Delta^{C}_{0}+\alpha\phi_{0}>\kappa$ whenever $\eta_{0}>-\infty$ shows that $P_{n}f_{n,1}+\alpha\phi_{n}>\kappa$ with probability tending to one. Combining these results at $k=0$ and $k=\infty$ with the fact that $k\mapsto P_{n}f_{n,k}$ is a continuous function shows that, with probability tending to one, there exists a $k_{n}^{\prime}\in[0,\infty)$ such that $P_{n}f_{n,k_{n}^{\prime}}=\kappa-\alpha\phi_{n}$ . Lemma S6 then implies $\eta_{n}=\eta_{n}(k_{n})$ with probability tending to 1. ∎

Proof of ii from Lemma S7..

. By Lemma S3, Eq. S14 and part i of this lemma, we see that $P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}+\alpha\phi_{0}=P_{n}f_{n,k_{n}}+\alpha\phi_{n}+{\mathrm{O}}_{p}(n^{-1/2})=\kappa+{\mathrm{O}}_{p}(n^{-1/2})$ , as desired. ∎

Proof of iii from Lemma S7 when $\eta_{0}>0$ ..

In this proof, we use $P_{0}^{n}$ to denote a probability statement over the draws of $O_{1},\ldots,O_{n}$ . Fix $\epsilon>0$ . We will argue by contradiction to show that $P_{0}^{n}\left\{\eta_{n}\geq\eta_{0}+\epsilon\right\}\rightarrow 0$ and $P_{0}^{n}\left\{\eta_{n}\leq\eta_{0}-\epsilon\right\}\rightarrow 0$ as $n\rightarrow\infty$ , implying the consistency of $\eta_{n}$ . The consistency of $\tau_{n}$ then follows. We study these two events separately. First, we suppose that

\displaystyle\limsup_{n}P_{0}^{n}\left\{\eta_{n}\geq\eta_{0}+\epsilon\right\}>0.

(S15)

Then there exists $\delta>0$ such that, for all $n$ in an infinite sequence $N\subseteq\mathbb{N}$ , the probability $P_{0}^{n}\left\{\eta_{n}\geq\eta_{0}+\epsilon\right\}$ is at least $\delta$ . Consequently, for any $n\in N$ , the following holds with probability at least $\delta$ :

	$\displaystyle P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa$	$\displaystyle\leq P_{0}\{I(\xi_{n}>\eta_{0}+\epsilon/2)\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa$
		$\displaystyle=P_{0}\{[I(\xi_{n}>\eta_{0}+\epsilon/2)-I(\xi_{0}>\eta_{0}+\epsilon/2)]\Delta^{C}_{0}\}+\Gamma_{0}(\eta_{0}+\epsilon/2)+\alpha\phi_{0}-\kappa.$		(S16)

We now show that the first term is ${\mathrm{o}}_{p}(1)$ . For any $x>0$ and $n\in\mathbb{N}$ , by Lemma S5 and Condition B4,

	$\displaystyle\|P_{0}\{[I(\xi_{n}>\eta_{0}+\epsilon/2)-I(\xi_{0}>\eta_{0}+\epsilon/2)]\Delta^{C}_{0}\}\|$	$\displaystyle\lesssim P_{0}I(\|\xi_{n}-\xi_{0}\|>x)+P_{0}I(\|\xi_{0}-\eta_{n}+\epsilon/2\|\leq x)$
		$\displaystyle\leq\frac{\\|\xi_{n}-\xi_{0}\\|_{1,P_{0}}}{x}+P_{0}I(\|\xi_{0}-\eta_{n}+\epsilon/2\|\leq x).$

Similarly to the proof of Lemma S6, the fact that $\|\xi_{n}-\xi_{0}\|_{1,P_{0}}={\mathrm{o}}_{p}(1)$ (Condition B12) ensures that $P_{0}\{[I(\xi_{n}>\eta_{0}+\epsilon/2)-I(\xi_{0}>\eta_{0}+\epsilon/2)]\Delta^{C}_{0}={\mathrm{o}}_{p}(1)$ . By Condition B3, $\Gamma_{0}(\eta_{0}+\epsilon/2)-\Gamma_{0}(\eta_{0})$ is a negative constant. Because (S16) holds with probability at least $\delta>0$ for infinitely many $n$ , this shows that $P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa$ is not ${\mathrm{o}}_{p}(1)$ . This contradicts our result from part ii of this lemma. Therefore, (S15) is false, that is, $\limsup_{n}P_{0}^{n}\left\{\eta_{n}\geq\eta_{0}+\epsilon\right\}=0$ .

Now we assume that, for some $\epsilon>0$ , $\limsup_{n}P_{0}^{n}\left\{\eta_{n}\leq\eta_{0}-\epsilon\right\}>0$ . Then there exists $\delta>0$ such that, for all $n$ in an infinite sequence $N\subseteq\mathbb{N}$ , $P_{0}^{n}\left\{\eta_{n}\leq\eta_{0}-\epsilon\right\}\geq\delta$ . Now, for any $n\in N$ , the following holds with probability at least $\delta$ :

	$\displaystyle P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa$	$\displaystyle\geq P_{0}\{I(\xi_{n}>\eta_{0}-\epsilon)\Delta^{C}_{0}\}+\alpha\phi_{0}-\kappa$
		$\displaystyle=P_{0}\{[I(\xi_{n}>\eta_{0}-\epsilon)-I(\xi_{0}>\eta_{0}-\epsilon)]\nu_{0}\}+\Gamma_{0}(\eta_{0}-\epsilon)+\alpha\phi_{0}-\kappa.$

The rest of the argument is almost identical to the contradiction argument for the previous event, and is therefore omitted.

Since $\epsilon$ is arbitrary, combining the results of these two contradiction arguments shows that $|\tau_{n}-\tau_{0}|\leq|\eta_{n}-\eta_{0}|={\mathrm{o}}_{p}(1)$ , as desired. ∎

Proof of iii from Lemma S7 when $\eta_{0}=0$ ..

If $\eta_{0}=0$ , then the construction of $\eta_{n}$ implies that $\eta_{n}$ takes values from two sequences: $\eta_{n}(\kappa)$ and $\eta_{n}(k_{n})$ where $k_{n}$ is a solution to (5). By Lemma S6, $\eta_{n}(\kappa)$ is consistent for $\eta_{0}$ . When a solution to (5) exists and equals $k_{n}$ , the proof of part iii from Lemma S7 when $\eta_{0}>0$ shows that $\eta_{n}(k_{n})$ is consistent for $\eta_{0}$ and the desired result follows. ∎

Proof of iii from Lemma S7 when $\eta_{0}<0$ ..

If $\eta_{0}<0$ , then by Lemma S6, $\eta_{n}(\kappa)\leq 0$ with probability tending to one. Hence, with probability tending to one, $\tau_{n}=0=\tau_{0}$ . Therefore, part iii holds. ∎

The following Lemma S8 show that certain remainders in the expansions in Section S4.3 are ${\mathrm{o}}_{p}(n^{-1/2})$ .

Lemma S8.

Under Conditions A2, B8 and B6,

\sup_{{\rho}:\mathcal{W}\rightarrow[0,1]}\left|R_{\rho}(\hat{P}_{n},P_{0})\right|={\mathrm{o}}_{p}(n^{-1/2}).

Proof of Lemma S8.

By the boundedness of the range of ${\rho}$ , we see that

		$\displaystyle\sup_{{\rho}:\mathcal{W}\rightarrow[0,1]}\left\|R_{\rho}(\hat{P}_{n},P_{0})\right\|$
		$\displaystyle=\sup_{{\rho}:\mathcal{W}\rightarrow[0,1]}P_{0}\left\|{\rho}(\cdot)\left[\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}\{\hat{\mu}^{Y}_{n}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)\}+\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}}\{\hat{\mu}^{Y}_{n}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)\}\right]\right\|$
		$\displaystyle\leq P_{0}\left\|\left[\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}\{\hat{\mu}^{Y}_{n}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)\}+\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}\{\hat{\mu}^{Y}_{n}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)\}\right]\right\|.$
Using Condition B8 and Lemma S4, the display continues as
	$\displaystyle\lesssim$	$\displaystyle P_{0}\left\|(\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot))[\hat{\mu}^{Y}_{n}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)]\right\|+P_{0}\left\|(\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot))[\hat{\mu}^{Y}_{n}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)]\right\|$
	$\displaystyle\leq$	$\displaystyle\\|\mu^{T}_{n}-\mu^{T}_{0}\\|_{2,P_{0}}\\|\hat{\mu}^{Y}_{n}(1,\cdot)-\mu^{Y}_{0}(1,\cdot)\\|_{2,P_{0}}+\\|\mu^{T}_{n}-\mu^{T}_{0}\\|_{2,P_{0}}\\|\hat{\mu}^{Y}_{n}(0,\cdot)-\mu^{Y}_{0}(0,\cdot)\\|_{2,P_{0}}$
	$\displaystyle\lesssim$	$\displaystyle\\|\mu^{T}_{n}-\mu^{T}_{0}\\|_{2,P_{0}}\\|\hat{\mu}^{Y}_{n}-\mu^{Y}_{0}\\|_{2,P_{0}}.$

The right-hand side is ${\mathrm{o}}_{p}(n^{-1/2})$ by Condition B6. ∎

We next prove Theorem 4.

Proof of Theorem 4.

By the expansion of $P\mapsto\Psi_{{\rho}_{P}}(P)$ presented in Section S4.3,

	$\displaystyle\Psi_{{\rho}_{n}}$	$\displaystyle(\hat{P}_{n})-\Psi_{{\rho}_{0}}(P_{0})$
		$\displaystyle=P_{0}D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})+R_{{\rho}_{n}}(\hat{P}_{n},P_{0})+P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}$
		$\displaystyle\quad-\tau_{0}P_{0}\left\{{\rho}_{n}(\cdot)\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}$
		$\displaystyle\quad+\tau_{0}P_{0}\left\{(1-{\rho}_{n}(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}$
		$\displaystyle=(P_{n}-P_{0})D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})-P_{n}D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})$
		$\displaystyle\quad+(P_{n}-P_{0})\left[D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})\right]$
		$\displaystyle\quad+R_{{\rho}_{n}}(\hat{P}_{n},P_{0})+P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}$
		$\displaystyle\quad-\tau_{0}P_{0}\left\{{\rho}_{n}(\cdot)\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}$
		$\displaystyle\quad+\tau_{0}P_{0}\left\{(1-{\rho}_{n}(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}.$

Similarly,

	$\displaystyle\Psi_{{\rho}^{\mathrm{FR}}}(\hat{P}_{n})-\Psi_{{\rho}^{\mathrm{FR}}}(P_{0})$	$\displaystyle=(P_{n}-P_{0})D(P_{0},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})-P_{n}D(\hat{P}_{n},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})$
		$\displaystyle\quad+(P_{n}-P_{0})\left[D(\hat{P}_{n},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{FR}},0,\mu^{C}_{0})\right]+R_{{\rho}^{\mathrm{FR}}}(\hat{P}_{n},P_{0});$
	$\displaystyle\Psi_{{\rho}^{\mathrm{RD}}_{n}}(\hat{P}_{n})-\Psi_{{\rho}^{\mathrm{RD}}_{0}}(P_{0})$	$\displaystyle=(P_{n}-P_{0})D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})-P_{n}D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})$
		$\displaystyle\quad+(P_{n}-P_{0})\left[D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})\right]$
		$\displaystyle\quad+R_{{\rho}^{\mathrm{RD}}_{n}}(\hat{P}_{n},P_{0})+({\rho}^{\mathrm{RD}}_{n}-{\rho}^{\mathrm{RD}}_{0})P_{0}\Delta^{Y}_{0};$
	$\displaystyle\Psi_{{\rho}^{\mathrm{TP}}_{n}}(\hat{P}_{n})-\Psi_{{\rho}^{\mathrm{TP}}_{0}}(P_{0})$	$\displaystyle=(P_{n}-P_{0})G_{\mathrm{TP}}(P_{0})-P_{n}G_{\mathrm{TP}}(\hat{P}_{n})+(P_{n}-P_{0})\left[G_{\mathrm{TP}}(\hat{P}_{n})-G_{\mathrm{TP}}(P_{0})\right]$
		$\displaystyle\quad+R_{{\rho}^{\mathrm{TP}}_{0}}(\hat{P}_{n},P_{0}).$

First, we note the following facts, which will be sufficient to ensure that the remainders and empirical process terms in all of the first-order expansions given above are ${\mathrm{o}}_{p}(n^{-1/2})$ . By Condition B6, Lemmas S4 and S8, the Cauchy-Schwarz inequality, and boundedness of the range of an ITR, the following terms are all ${\mathrm{o}}_{p}(n^{-1/2})$ :

	$\displaystyle R_{{\rho}}(\hat{P}_{n},P_{0})\ \textnormal{for ${\rho}={\rho}_{n},{\rho}^{\mathrm{FR}},{\rho}^{\mathrm{RD}}_{n}$},$
	$\displaystyle P_{0}\left\{\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}(\hat{\mu}^{Y}_{P}(0,\cdot)-\hat{\mu}^{Y}_{0}(0,\cdot))\right\},$
	$\displaystyle\tau_{0}P_{0}\left\{{\rho}_{n}(\cdot)\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\},$
	$\displaystyle\tau_{0}P_{0}\left\{(1-{\rho}_{n}(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}.$

Moreover, by Condition B10, $P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\delta^{C}_{0})\}={\mathrm{o}}_{p}(n^{-1/2})$ ; by Conditions B7 and B11, $(P_{n}-P_{0})\left[D_{n,\mathcal{R}}-D_{\mathcal{R}}(P_{0})\right]={\mathrm{o}}_{p}(n^{-1/2})$ for all $\mathcal{R}\in\{{\mathrm{FR}},{\mathrm{RD}},{\mathrm{TP}}\}$ and

(P_{n}-P_{0})\left\{[D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})]-[D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})]\right\}={\mathrm{o}}_{p}(n^{-1/2}).

Therefore, all relevant remainders and empirical process terms are ${\mathrm{o}}_{p}(n^{-1/2})$ .

We separately study the three cases where $\mathcal{R}={\mathrm{FR}}$ , $\mathcal{R}={\mathrm{RD}}$ , and $\mathcal{R}={\mathrm{TP}}$ .

Case I: $\mathcal{R}={\mathrm{FR}}$ . It holds that

	$\displaystyle\psi_{n}-\psi_{0}$	$\displaystyle=(P_{n}-P_{0})D_{\mathrm{FR}}(P_{0})-P_{n}D_{n,{\mathrm{FR}}}+{\mathrm{o}}_{p}(n^{-1/2})$
		$\displaystyle=(P_{n}-P_{0})D_{\mathrm{FR}}(P_{0})$
		$\displaystyle\quad+\tau_{0}\Bigg{\{}\frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}{\rho}_{n}(V_{i})\left[\Delta^{C}_{n}(W_{i})+\frac{1}{T_{i}+\mu^{T}_{n}(W_{i})-1}[C_{i}-\mu^{C}_{n}(T_{i},W_{i})]\right]$
		$\displaystyle\hskip 36.135pt+\alpha\left[\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\right]\Bigg{\}}-\kappa\Bigg{\}}+{\mathrm{o}}_{p}(n^{-1/2}),$

where the last step follows from the TMLE construction of $\hat{P}_{n}$ (Step 4a of our estimator), which implies that

\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{{\rho}_{n}(V_{i})-{\rho}^{\mathrm{FR}}(V)}{T_{i}+\mu^{T}_{n}(W_{i})-1}[Y_{i}-\hat{\mu}^{Y}_{n}(T_{i},W_{i})]\right\}=0.

We now show that the second term on the right-hand side is zero with probability tending to one. If $\tau_{0}=0$ , then this term is zero. Otherwise, $\tau_{0}=\eta_{0}>0$ . By Lemma S7, the following holds with probability tending to one:

\frac{1}{n}\sum_{i=1}^{n}\left\{{\rho}_{n}(V_{i})\left[\mu^{C}_{n}(1,W_{i})+\frac{T_{i}}{\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(1,W_{i})]\right]+\alpha\left[\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\right]\right\}=\kappa,

and hence the second term is zero with probability tending to one, as desired. Therefore, $\psi_{n}-\psi_{0}=(P_{n}-P_{0})D_{\mathrm{FR}}(P_{0})+{\mathrm{o}}_{p}(n^{-1/2})$ .

Case II: $\mathcal{R}={\mathrm{RD}}$ . It holds that

	$\displaystyle\psi_{n}-\psi_{0}$	$\displaystyle=(P_{n}-P_{0})\{D(P_{0},{\rho}_{0},\tau_{0},\mu^{C}_{0})-D(P_{0},{\rho}^{\mathrm{RD}}_{0},0,\mu^{C}_{0})\}$
		$\displaystyle\quad-P_{n}\{D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})\}$
		$\displaystyle\quad-({\rho}^{\mathrm{RD}}_{n}-{\rho}^{\mathrm{RD}}_{0})P_{0}\Delta^{Y}_{0}+{\mathrm{o}}_{p}(n^{-1/2}),$

where we have used ${\rho}^{\mathrm{RD}}_{n}$ and ${\rho}^{\mathrm{RD}}_{0}$ to denote the values that the two functions take, respectively. The TMLE construction of $\hat{P}_{n}$ (Step 4a of our estimator) implies that

\frac{1}{n}\sum_{i=1}^{n}\frac{{\rho}_{n}(V_{i})-{\rho}^{\mathrm{RD}}_{n}(V_{i})}{T_{i}+\mu^{T}_{n}(W_{i})-1}[Y_{i}-\hat{\mu}^{Y}_{n}(T_{i},W_{i})]=0,

and hence

	$\displaystyle P_{n}\{D(\hat{P}_{n},{\rho}_{n},\tau_{0},\mu^{C}_{n})-D(\hat{P}_{n},{\rho}^{\mathrm{RD}}_{n},0,\mu^{C}_{0})\}$
	$\displaystyle=-\tau_{0}\Bigg{\{}\frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}{\rho}_{n}(V_{i})\left[\Delta^{C}_{n}(W_{i})+\frac{1}{T_{i}+\mu^{T}_{n}(W_{i})-1}[C_{i}-\mu^{C}_{n}(T_{i},W_{i})]\right]$
	$\displaystyle\hskip 36.135pt+\alpha\left[\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\right]\Bigg{\}}-\kappa\Bigg{\}},$

which is zero with probability tending to one as proved above. By Condition B5, Lemma S3 and the delta method for influence functions, the value that ${\rho}^{\mathrm{RD}}_{n}$ takes is an asymptotic linear estimator of the value that ${\rho}^{\mathrm{RD}}_{0}$ takes. Straightforward application of the delta method for influence functions implies that

\psi_{n}-\psi_{0}=(P_{n}-P_{0})D_{\mathrm{RD}}(P_{0})+{\mathrm{o}}_{p}(n^{-1/2}).

Case 3: $\mathcal{R}={\mathrm{TP}}$ . It holds that

\psi_{n}-\psi_{0}=(P_{n}-P_{0})D_{{\mathrm{TP}}}(P_{0})-P_{n}D_{n,{\mathrm{TP}}}+{\mathrm{o}}_{p}(n^{-1/2}).

The TMLE construction of $\hat{P}_{n}$ (Step 4a of our estimator) implies that

\frac{1}{n}\sum_{i=1}^{n}\left\{\frac{{\rho}_{n}(V_{i})-\mu^{T}_{n}(W_{i})}{T_{i}+\mu^{T}_{n}(W_{i})-1}[Y_{i}-\hat{\mu}^{Y}_{n}(T_{i},W_{i})]\right\}=0,

	$\displaystyle P_{n}D_{n,{\mathrm{TP}}}$	$\displaystyle=-\tau_{0}\Bigg{\{}\frac{1}{n}\sum_{i=1}^{n}\Bigg{\{}{\rho}_{n}(V_{i})\left[\Delta^{C}_{n}(W_{i})+\frac{1}{T_{i}+\mu^{T}_{n}(W_{i})-1}[C_{i}-\mu^{C}_{n}(T_{i},W_{i})]\right]$
		$\displaystyle\hskip 36.135pt+\alpha\left[\mu^{C}_{n}(0,W_{i})+\frac{1-T_{i}}{1-\mu^{T}_{n}(W_{i})}[C_{i}-\mu^{C}_{n}(0,W_{i})]\Bigg{\}}\right]-\kappa\Bigg{\}},$

which is zero with probability tending to one as proved above. Therefore,

\psi_{n}-\psi_{0}=(P_{n}-P_{0})D_{\mathrm{TP}}(P_{0})+{\mathrm{o}}_{p}(n^{-1/2}).

Conclusion: The asymptotic linearity result on $\psi_{n}$ follows from the above results. Consequently, the asymptotic normality result on $\psi_{n}$ holds by the central limit theorem and Slutsky’s theorem. ∎

S4.5 Proof of Theorem S1

In this section, we prove Theorem S1. The arguments are almost identical to those in Supplement S9 Qiu et al. [34] with adaptations to the different treatment resource constraint.

Lemma S9 (Convergence rate of $\tau_{n}$ if $\eta_{0}>-\infty$ ).

Assume that the conditions for Theorem 4 hold. Suppose that $\eta_{0}>-\infty$ , that the Lebesgue density of the distribution of $\xi_{0}(V)$ under $V\sim P_{0}$ is well-defined, nonzero and finite in a neighborhood of and that $P_{0}I(\xi_{n}=\eta_{n})={\mathrm{O}}_{p}(n^{-1/2})$ . Under these conditions, the following implications hold with probability tending to one:

•

If $\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1)$ for some $0<q<\infty$ , then $|\tau_{n}-\tau_{0}|\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/{q+1}}+{\mathrm{O}}_{p}(n^{-1/2})$ .
•

If $\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1)$ , then $|\tau_{n}-\tau_{0}|\lesssim\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}+{\mathrm{O}}_{p}(n^{-1/2})$ .

The condition that $P_{0}I(\xi_{n}=\eta_{n})={\mathrm{O}}_{p}(n^{-1/2})$ is reasonable if $\xi_{n}(V)$ has a continuous distribution when $V\sim P_{0}$ , in which case $P_{0}I(\xi_{n}=\eta_{n})=0$ .

Proof of Lemma S9.

We study the three cases where $\eta_{0}>0$ , $\eta_{0}<0$ and $\eta_{0}=0$ separately.

We first study the case where $\eta_{0}>0$ . By Lemma S7, with probability tending to one, $\eta_{n}=\eta_{n}(k_{n})$ where $k_{n}$ is a solution to (5), and

\displaystyle P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{0})]\Delta^{C}_{0}\}=P_{0}\{d_{n,k_{n}}\Delta^{C}_{0}\}-(\kappa-\alpha\phi_{0})+{\mathrm{O}}_{p}(n^{-1/2})={\mathrm{O}}_{p}(n^{-1/2}).

We argue conditionally on the event that $k_{n}$ is a solution to (5). Adding $\Gamma_{0}(\eta_{n})-P_{0}\{I(\xi_{n}>\eta_{n})\Delta^{C}_{0}\}$ to both sides shows that $\Gamma_{0}(\eta_{n})-\Gamma_{0}(\eta_{0})=-P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}+{\mathrm{O}}_{p}(n^{-1/2})$ . By a Taylor expansion of $\Gamma_{0}$ under Conditions B2, B3 and A4, the left-hand side is equal to $-C(\eta_{n}-\eta_{0})+{\mathrm{o}}_{p}(\eta_{n}-\eta_{0})$ for some $C>0$ , yielding that

\displaystyle[C+{\mathrm{o}}_{p}(1)][\eta_{n}-\eta_{0}]

\displaystyle=P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}+{\mathrm{O}}_{p}(n^{-1/2}),

which immediately implies that

\displaystyle\eta_{n}-\eta_{0}

\displaystyle={\mathrm{O}}_{p}\Big{(}P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}\Big{)}+{\mathrm{O}}_{p}(n^{-1/2}).

(S17)

The rest of the proof for this case and the proof for the other two cases are identical to the proof of Lemma S14 in Qiu et al. [34]. We present the argument below for completeness. By Lemma S5 and Condition B4, for any $\epsilon>0$ it holds that

	$\displaystyle\|P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}\|$
	$\displaystyle\lesssim\|P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\}\|$
	$\displaystyle\leq P_{0}I(\|\xi_{n}-\xi_{0}\|>\epsilon)+P_{0}I(\|\xi_{0}-\eta_{n}\|\leq\epsilon).$

Fix a positive sequence $\{\epsilon_{n}\}_{n=1}^{\infty}$ , where each $\epsilon_{n}$ may be random through observations $O_{1},\ldots,O_{n}$ , such that $\epsilon_{n}\overset{p}{\rightarrow}0$ as $n\rightarrow\infty$ . By a Taylor expansion of $S_{0}$ , the survival function of the distribution of $\xi_{0}(V)$ when $V\sim P_{0}$ , around $\eta_{0}$ , which is valid under Condition B2 provided $\epsilon_{n}$ is sufficiently small, it follows that

\displaystyle|P_{0}\{[I(\xi_{n}>\eta_{n})-I(\xi_{0}>\eta_{n})]\Delta^{C}_{0}\}|

\displaystyle\lesssim P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})-2(S_{0})^{\prime}(\eta_{0})\epsilon_{n}+{\mathrm{o}}_{p}(\epsilon_{n}).

Here we recall that $(S_{0})^{\prime}(\eta_{0})$ is finite by Condition B2. Returning to (S17),

\displaystyle\eta_{n}-\eta_{0}

\displaystyle={\mathrm{O}}_{p}\Big{(}P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})\Big{)}-[2(S_{0})^{\prime}(\eta_{0})+{\mathrm{o}}_{p}(1)]\epsilon_{n}+{\mathrm{O}}_{p}(n^{-1/2}).

If $\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1)$ for some $0<q<\infty$ , by Markov’s inequality, $P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})\leq\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q}/\epsilon_{n}^{q}$ . In this case, taking $\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}$ yields that $|\eta_{n}-\eta_{0}|\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}+{\mathrm{O}}_{p}(n^{-1/2})$ with probability tending to one. If $\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1)$ , then taking $\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}$ yields that $P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})=0$ , and hence that $|\eta_{n}-\eta_{0}|\lesssim\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}+{\mathrm{O}}_{p}(n^{-1/2})$ with probability tending to one. The desired result follows by noting that $\tau_{0}=\eta_{0}$ and in both cases, $\tau_{n}=\eta_{n}(k_{n})$ with probability tending to one.

We now study the case where $\eta<0$ . By Lemma S6, with probability tending to one, $\eta_{n}<0$ and hence $\tau_{n}=0=\tau_{0}$ , as desired.

We finally study the case where $\eta_{0}=0$ . We argue conditional on the event that a solution $k_{n}^{\prime}$ to (5) exists, which happens with probability tending to one by Lemma S7. Recall that for convenience we let $k_{n}=k_{n}^{\prime}$ when $\eta_{n}(\kappa)>0$ . Then, exactly one of the following two events happen: (i) $\eta_{n}(\kappa)\leq 0$ or $\eta_{n}(k_{n}^{\prime})\leq 0$ , in which case $\tau_{n}=0=\tau_{0}$ ; (2) $\eta_{n}(\kappa)>0$ and $\eta_{n}(k_{n}^{\prime})>0$ , in which case a similar argument as the above proof for the case where $\eta_{0}>0$ shows that the distance between $\tau_{n}=\eta_{n}(k_{n}^{\prime})$ and $\tau_{0}$ has the desired bound. The desired result holds conditional on either event, so it holds unconditional on either event. ∎

We finally prove Theorem S1.

Proof of Theorem S1.

Observe that

\displaystyle\begin{split}|P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\Delta^{C}_{0})\}|&\leq P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{0})\}(\xi_{0}-\tau_{0})\Delta^{C}_{0}|\\ &\lesssim P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{0})\}(\xi_{0}-\tau_{0})|\\ &\leq P_{0}|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})|\\ &\quad+P_{0}|\{I(\xi_{0}>\tau_{n})-I(\xi_{0}>\tau_{0})\}(\xi_{0}-\tau_{0})|\\ &\quad+|\tau_{n}-\tau_{0}|P_{0}|I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})|.\end{split}

(S18)

Starting from this inequality, the rest of the proof is identical to that of Theorem 5 in Qiu et al. [34]. We present the argument below for completeness. Let $\{\epsilon_{n}\}_{n=1}^{\infty}$ be a positive sequence, where each $\epsilon_{n}$ is random through the observations $O_{1},\ldots,O_{n}$ , such that $\epsilon_{n}\overset{p}{\rightarrow}0$ as $n\rightarrow\infty$ .

We denote the three terms on the right-hand side by terms 1, 2, and 3, and study these terms separately. It is useful to note that $\tau_{n}-\tau_{0}={\mathrm{o}}_{p}(1)$ , so the Lebesgue density of the distribution of $\xi_{0}(V)$ , $V\sim P_{0}$ , is finite in a neighborhood of $\tau_{n}$ with probability tending to one.

Study of term 1 in (S18): Observe that

	$\displaystyle P_{0}$	$\displaystyle\|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})\|$
		$\displaystyle=P_{0}\|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})\|I(0<\|\xi_{0}-\tau_{n}\|).$

First consider the bound with the $L^{q}(P_{0})$ -distance. Because $I(\xi_{n}(v)>\tau_{n})\neq I(\xi_{0}(v)>\tau_{n})$ if and only if (i) $\xi_{n}(v)-\tau_{n}$ and $\xi_{0}(v)-\tau_{n}$ take different signs or (ii) only one of them is zero, this event implies $|\xi_{0}(v)-\tau_{n}|\leq|\xi_{n}(v)-\xi_{0}(v)|$ , and so this term is upper bounded by

	$\displaystyle P_{0}\|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})\|I(0<\|\xi_{0}-\tau_{n}\|\leq\epsilon_{n})$
	$\displaystyle\quad+P_{0}\|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})\|I(\|\xi_{0}-\tau_{n}\|>\epsilon_{n})$
	$\displaystyle\leq P_{0}\|\xi_{n}-\xi_{0}\|I(0<\|\xi_{0}-\tau_{n}\|\leq\epsilon_{n})+P_{0}\|\xi_{n}-\xi_{0}\|I(\|\xi_{n}-\xi_{0}\|>\epsilon_{n})$
	$\displaystyle\leq\\|\xi_{n}-\xi_{0}\\|_{q,P_{0}}\left\{P_{0}(0<\|\xi_{0}(V)-\tau_{n}\|\leq\epsilon_{n})\right\}^{(q-1)/q}+\frac{P_{0}\|\xi_{n}-\xi_{0}\|^{q}}{\epsilon_{n}^{q-1}}$
	$\displaystyle\lesssim\\|\xi_{n}-\xi_{0}\\|_{q,P_{0}}\cdot\epsilon_{n}^{(q-1)/q}+\frac{\\|\xi_{n}-\xi_{0}\\|_{q,P_{0}}^{q}}{\epsilon_{n}^{q-1}},$

where second to last relation holds by Hölder’s inequality and Markov’s inequality, and the last relation holds with probability tending to one by the assumption that the distribution of $\xi_{0}(V)$ , $V\sim P_{0}$ , has a continuous finite Lebesgue density in a neighborhood of $\tau_{0}$ and Lemma S7. Taking $\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}$ yields that $|P_{0}\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})|\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)}$ .

Next consider the bound with the $L^{\infty}(P_{0})$ -distance. We have that

	$\displaystyle P_{0}\|\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}(\xi_{0}-\tau_{n})\|$	$\displaystyle\leq P_{0}I(\|\xi_{0}-\tau_{n}\|\leq\|\xi_{n}-\xi_{0}\|)\|\xi_{0}-\tau_{n}\|$
		$\displaystyle=P_{0}I(0<\|\xi_{0}-\tau_{n}\|\leq\|\xi_{n}-\xi_{0}\|)\|\xi_{0}-\tau_{n}\|$
		$\displaystyle\leq P_{0}I(0<\|\xi_{0}-\tau_{n}\|\leq\\|\xi_{n}-\xi_{0}\\|_{\infty,P_{0}})\|\xi_{0}-\tau_{n}\|$
		$\displaystyle\leq\\|\xi_{n}-\xi_{0}\\|_{\infty,P_{0}}P_{0}(0<\|\xi_{0}(V)-\tau_{n}\|\leq\\|\xi_{n}-\xi_{0}\\|_{\infty,P_{0}})$
		$\displaystyle\lesssim\\|\xi_{n}-\xi_{0}\\|_{\infty,P_{0}}^{2}.$

Therefore, the first term is upper bounded by both $\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)}$ and $\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}$ , up to an absolute constant.

Study of term 2 in (S18): Because $I(\xi_{0}(v)>\tau_{n})\neq I(\xi_{0}(v)>\tau_{0})$ if and only if the two indicators take different signs or only one of them is zero, these indicators only take different values if $|\xi_{0}(v)-\tau_{0}|\leq|\tau_{n}-\tau_{0}|$ . Therefore, term 2 bounds as

	$\displaystyle P_{0}\|\{I(\xi_{0}>\tau_{n})-I(\xi_{0}>\tau_{0})\}(\xi_{0}-\tau_{0})\|$	$\displaystyle\leq P_{0}I(\|\xi_{0}-\tau_{0}\|\leq\|\tau_{n}-\tau_{0}\|)\|\xi_{0}-\tau_{0}\|$
		$\displaystyle\leq\|\tau_{n}-\tau_{0}\|P_{0}I(\|\xi_{0}-\tau_{0}\|\leq\|\tau_{n}-\tau_{0}\|)$
		$\displaystyle\lesssim\|\tau_{n}-\tau_{0}\|^{2},$

where the last step holds for with probability tending to one by the assumption that the distribution of $\xi_{0}(V)$ , $V\sim P_{0}$ , has a continuous finite Lebesgue density in a neighborhood of $\tau_{0}$ and Lemma S7. If $\eta_{0}>-\infty$ , by Lemma S9, with probability tending to one,

P_{0}|I(\xi_{0}>\tau_{n})-I(\xi_{0}>\tau_{0})||\xi_{0}-\tau_{0}|\lesssim\begin{cases}\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)}+{\mathrm{O}}_{p}(n^{-1}),&\text{if }\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1)\\ \|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}+{\mathrm{O}}_{p}(n^{-1}),&\text{if }\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1)\end{cases}.

Otherwise, by Lemma S6, with probability tending to one, $\tau_{n}=0=\tau_{0}$ and the above result still holds.

Study of term 3 in (S18): By Lemma S5,

P_{0}|I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})|\leq P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})+P_{0}I(|\xi_{0}-\tau_{n}|\leq\epsilon_{n}).

By a Taylor expansion of $S_{0}$ around $\tau_{0}$ , similarly to the proof of Lemma S9, with probability tending to one,

P_{0}I(|\xi_{0}-\tau_{n}|\leq\epsilon_{n})=-2(S_{0})^{\prime}(\tau_{0})\epsilon_{n}+{\mathrm{o}}_{p}(|\tau_{n}-\tau_{0}|+\epsilon_{n}),

where $|(S_{0})^{\prime}(\tau_{0})|<\infty$ . If $\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1)$ for some $1<q<\infty$ , then $P_{0}I(|\xi_{n}-\xi_{0}|>\epsilon_{n})\leq\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q}/\epsilon_{n}^{q}$ . Taking $\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}$ yields that $|P_{0}\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}|\lesssim\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}$ . If $\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1)$ , then taking $\epsilon_{n}=\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}$ yields that $|P_{0}\{I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\}|\lesssim\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}$ with probability tending to one. Also note that, by Lemma S9, if $\eta_{0}>-\infty$ , then, with probability tending to one,

|\tau_{n}-\tau_{0}|\leq|\eta_{n}-\eta_{0}|\lesssim\begin{cases}\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{q/(q+1)}+{\mathrm{O}}_{p}(n^{-1/2}),&\text{if }\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1),\\ \|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}+{\mathrm{O}}_{p}(n^{-1/2}),&\text{if }\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1).\end{cases}

The same holds when $\eta_{0}=-\infty$ since then $|\tau_{n}-\tau_{0}|=0$ with probability tending to one.

Therefore, with probability tending to one,

	$\displaystyle\|$	$\displaystyle\tau_{n}-\tau_{0}\|P_{0}\|I(\xi_{n}>\tau_{n})-I(\xi_{0}>\tau_{n})\|$
		$\displaystyle\lesssim\begin{cases}\\|\xi_{n}-\xi_{0}\\|_{q,P_{0}}^{2q/(q+1)}+\\|\xi_{n}-\xi_{0}\\|_{q,P_{0}}^{q/(q+1)}{\mathrm{O}}_{p}(n^{-1/2})&\text{ if }\\|\xi_{n}-\xi_{0}\\|_{q,P_{0}}={\mathrm{o}}_{p}(1),\\ \\|\xi_{n}-\xi_{0}\\|_{\infty,P_{0}}^{2}+\\|\xi_{n}-\xi_{0}\\|_{\infty,P_{0}}{\mathrm{O}}_{p}(n^{-1/2})&\text{ if }\\|\xi_{n}-\xi_{0}\\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1).\end{cases}$

Conclusion of the bound in (S18): We finally combine the bounds for all three terms. Note that $a_{n}{\mathrm{O}}_{p}(b_{n})\lesssim a_{n}^{2}+{\mathrm{O}}_{p}(b_{n}^{2})$ for any sequence of non-negative random variables $a_{n}$ and sequence of constants $b_{n}$ . It follows that, with probability tending to one,

|P_{0}\{({\rho}_{n}-{\rho}_{0})(\delta^{Y}_{0}-\tau_{0}\nu_{0})\}|\lesssim\begin{cases}\|\xi_{n}-\xi_{0}\|_{q,P_{0}}^{2q/(q+1)}+{\mathrm{O}}_{p}(n^{-1}),&\text{if }\|\xi_{n}-\xi_{0}\|_{q,P_{0}}={\mathrm{o}}_{p}(1),\\ \|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}^{2}+{\mathrm{O}}_{p}(n^{-1}),&\text{if }\|\xi_{n}-\xi_{0}\|_{\infty,P_{0}}={\mathrm{o}}_{p}(1).\end{cases}

∎

S5 Additional simulations

S5.1 Results of simulation with nuisance functions being truth

In this section, we present the results of the simulation with an identical setting as that in Section 5 in the main text except that the nuisance functions are taken to be the truth rather than estimated via machine learning. The purpose of this simulation is to show that the performance of our proposed estimator may be significantly improved by using machine learning estimators of nuisance functions that outperform those used in the simulation study reported in the main text.

Table S1 presents the performance of our proposed estimator in this simulation. The Wald CI coverage is close to 95% for sample sizes of 1000 or more. The coverage of the confidence lower bounds is also close to the nominal coverage of 97.5%. Therefore, our proposed procedure appears to have the potential to be significantly improved when using improved estimators of nuisance functions. Figure S1 presents the width of our 95% Wald CI scaled by the square root of sample size $n$ . For each estimand, the scaled width appears to stabilize as $n$ grows and to be similar to the scaled width observed in the simulation reported in Section 5, where nuisance functions are estimated from data.

Table S1: Performance of estimators of average causal effects in the simulation with nuisance functions being the truth.

Performance measure	Sample size	${\mathrm{FR}}$	${\mathrm{RD}}$	${\mathrm{TP}}$
95% Wald CI coverage	500	$93\%$	$90\%$	$90\%$
	1000	$94\%$	$94\%$	$93\%$
	4000	$96\%$	$95\%$	$95\%$
	16000	$94\%$	$96\%$	$95\%$
97.5% confidence lower	500	$96\%$	$96\%$	$95\%$
bound coverage	1000	$97\%$	$97\%$	$96\%$
	4000	$98\%$	$97\%$	$97\%$
	16000	$97\%$	$97\%$	$97\%$
bias	500	$0.0023$	$0.0004$	$0.0010$
	1000	$0.0013$	$0.0008$	$0.0007$
	4000	$0.0003$	$0.0003$	$0.0003$
	16000	$0.0002$	$0.003$	$0.0002$
RMSE	500	$0.048$	$0.023$	$0.029$
	1000	$0.033$	$0.015$	$0.020$
	4000	$0.016$	$0.008$	$0.010$
	16000	$0.009$	$0.004$	$0.005$
Ratio of mean standard error	500	$0.964$	$0.911$	$0.928$
to standard deviation	1000	$0.967$	$0.998$	$0.958$
	4000	$1.028$	$0.983$	$0.995$
	16000	$0.963$	$1.007$	$0.996$

S5.2 Simulation under a low dimension and a parametric model

In this section, we describe the additional simulation in a setting with a low dimension and a parametric model as well as the simulation results.

The data is generated as follows. We first generate a univariate covariate $W\sim\mathrm{Unif}(-1,1)$ . We then generate $T$ , $C$ and $Y$ as follows:

	$\displaystyle T\mid W\$	$\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(W)\right),$
	$\displaystyle C\mid T,W\$	$\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(2T-1+W)\right),$
	$\displaystyle Y\mid T,W\$	$\displaystyle\sim\ \text{Bernoulli}\left(\operatorname{expit}(1.4T-0.7-0.3W)\right),$

where $C$ and $Y$ are independent conditional on $(W,T)$ . We set ${\rho}^{\mathrm{FR}}:v\mapsto 0$ , $V=W$ , and $\kappa=0.35$ , which is an active constraint with $\tau_{0}>0$ and ${\rho}^{\mathrm{RD}}_{0}<1$ . We use logistic regression to estimate functions $\mu^{T}_{0}$ , $\mu^{C}_{0}$ and $\mu^{C}_{0}$ . All other simulation settings are identical to that in Section 5.

The simulation results are presented in Table S2 and Figure S2. The performance is generally between the nonparametric setting in Section 5 and the oracle setting in Section S5.1. The CI coverage is much better than the nonparametric case, thus suggesting that our method might perform better with improved estimators of nuisance functions $\mu^{T}_{0}$ , $\mu^{C}_{0}$ and $\mu^{C}_{0}$ .

Table S2: Performance of estimators of average causal effects in the simulation with nuisance functions in a parametric model.

Performance measure	Sample size	${\mathrm{FR}}$	${\mathrm{RD}}$	${\mathrm{TP}}$
95% Wald CI coverage	500	$95\%$	$83\%$	$96\%$
	1000	$91\%$	$83\%$	$93\%$
	4000	$94\%$	$88\%$	$93\%$
	16000	$94\%$	$94\%$	$95\%$
97.5% confidence lower	500	$99\%$	$99\%$	$99\%$
bound coverage	1000	$99\%$	$99\%$	$99\%$
	4000	$99\%$	$99\%$	$98\%$
	16000	$98\%$	$99\%$	$99\%$
bias	500	$-0.0177$	$-0.0161$	$-0.0167$
	1000	$-0.0122$	$-0.0113$	$-0.0125$
	4000	$-0.0037$	$-0.0036$	$-0.0035$
	16000	$-0.0009$	$-0.0010$	$-0.0008$
RMSE	500	$0.035$	$0.029$	$0.040$
	1000	$0.026$	$0.021$	$0.029$
	4000	$0.012$	$0.009$	$0.013$
	16000	$0.006$	$0.004$	$0.006$
Ratio of mean standard error	500	$1.122$	$0.908$	$1.046$
to standard deviation	1000	$0.988$	$0.857$	$0.984$
	4000	$0.978$	$0.891$	$0.942$
	16000	$0.986$	$0.979$	$0.979$

$\displaystyle\sup_{w}\|\Delta^{C}_{\epsilon}(w)-\Delta^{C}_{0}(w)\|$	$\displaystyle\lesssim\|\epsilon\|,$	(S8)
$\displaystyle\sup_{v}\|\delta^{Y}_{\epsilon}(v)-\delta^{Y}_{0}(v)\|$	$\displaystyle\lesssim\|\epsilon\|,$	(S9)
$\displaystyle\sup_{v}\|\delta^{C}_{\epsilon}(v)-\delta^{C}_{0}(v)\|$	$\displaystyle\lesssim\|\epsilon\|.$	(S10)

	$\displaystyle\big{\|}P_{\epsilon}$	$\displaystyle\{[{\rho}_{\epsilon}-{\rho}_{0}](\delta^{Y}_{\epsilon}-\tau_{0}\delta^{C}_{\epsilon})\}\big{\|}$
		$\displaystyle=\left\|\int[{\rho}_{\epsilon}(v)-{\rho}_{0}(v)][\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)]P_{W,\epsilon}(dw)\right\|$
		$\displaystyle\leq\int\left\|{\rho}_{\epsilon}(v)-{\rho}_{0}(v)\right\|\left\|\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)\right\|P_{W,\epsilon}(dw).$
Because ${\rho}_{\epsilon}(v)\neq{\rho}_{0}(v)$ implies that either (i) $\xi_{\epsilon}(v)-\tau_{\epsilon}$ and $\xi_{0}(v)-\tau_{0}$ have different signs or (ii) only one of these quantities is zero, the display continues as
		$\displaystyle\leq\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq\|\xi_{\epsilon}(v)-\tau_{\epsilon}-\xi_{0}(v)+\tau_{0}\|\}\left\|\delta^{Y}_{\epsilon}(v)-\tau_{0}\delta^{C}_{\epsilon}(v)\right\|P_{W,\epsilon}(dw)$
		$\displaystyle\leq\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}\left(\left\|\delta^{Y}_{0}(v)-\tau_{0}\delta^{C}_{0}(v)\right\|+{\mathscr{C}}\|\epsilon\|\right)P_{W,\epsilon}(dw).$
Using the facts that $\inf_{v}\delta^{C}_{0}(v)>0$ by Condition A4, that $\sup_{v}\delta^{C}_{0}(v)\leq 1$ since probabilities are no more than one, and that $\xi_{0}(v):=\delta^{Y}_{0}(v)/\delta^{C}_{0}(v)$ , the display continues as
		$\displaystyle\leq\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}\left(\left\|\xi_{0}(v)-\tau_{0}\right\|+{\mathscr{C}}\|\epsilon\|\right)P_{W,\epsilon}(dw)$
Leveraging the bound on $\|\xi_{0}(v)-\tau_{0}\|$ that appears in the indicator function, we see that
		$\displaystyle\leq\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}({\mathscr{C}}\|\epsilon\|+{\mathscr{C}}\|\epsilon\|)P_{W,\epsilon}(dw)$
		$\displaystyle\lesssim\|\epsilon\|\int I\{\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}P_{W,0}(dw)$
		$\displaystyle=\|\epsilon\|\int I\{0<\|\xi_{0}(v)-\tau_{0}\|\leq{\mathscr{C}}\|\epsilon\|\}P_{W,0}(dw),$

$\displaystyle\|\Gamma_{n}(\eta)-\Gamma_{0}(\eta)\|$	$\displaystyle\leq\|P_{0}[I\{\xi_{n}(V(\cdot))>\eta\}-I(\xi_{0}(V(\cdot))>\eta)]\Delta^{C}_{0}\|$
	$\displaystyle\quad+\|P_{0}I\{\xi_{n}(V(\cdot))>\eta\}[\Delta^{C}_{n}-\Delta^{C}_{0}]\|$
	$\displaystyle\quad+\|(P_{n}-P_{0})I\{\xi_{n}(V(\cdot))>\eta\}\Delta^{C}_{n}\|.$	(S13)

	$\displaystyle\|P_{0}[I(\xi_{n}(V(\cdot))>\eta)-I(\xi_{0}(V(\cdot))>\eta)]\Delta^{C}_{0}\|$	$\displaystyle\lesssim P_{0}\|I(\xi_{n}>\eta)-I(\xi_{0}>\eta)\|$
		$\displaystyle\leq P_{0}I(\|\xi_{n}-\xi_{0}\|>\epsilon^{\prime})+P_{0}I(\|\xi_{0}-\eta\|\leq\epsilon^{\prime})$
		$\displaystyle\leq\frac{\\|\xi_{n}-\xi_{0}\\|_{1,P_{0}}}{\epsilon^{\prime}}+P_{0}I(\|\xi_{0}-\eta\|\leq\epsilon^{\prime}),$

	$\displaystyle\sup_{k\in[0,\infty]}\left\|P_{n}f_{n,k}-P_{0}\{d_{n,k}\Delta^{C}_{0}\}\right\|$	$\displaystyle\leq\sup_{k\in[0,\infty]}\left\|(P_{n}-P_{0})d_{n,k}D_{2}(\hat{P}_{n},\mu^{C}_{n})\right\|$
		$\displaystyle\quad+\sup_{k\in[0,\infty]}\left\|P_{0}\left\{d_{n,k}(V(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(1,\cdot)-\mu^{C}_{0}(1,\cdot)]\right\}\right\|$
		$\displaystyle\quad+\sup_{k\in[0,\infty]}\left\|P_{0}\left\{d_{n,k}(V(\cdot))\frac{\mu^{T}_{n}(\cdot)-\mu^{T}_{0}(\cdot)}{1-\mu^{T}_{n}(\cdot)}[\mu^{C}_{n}(0,\cdot)-\mu^{C}_{0}(0,\cdot)]\right\}\right\|.$

Individualized treatment rules under stochastic treatment cost constraints