This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Smoothed Concordance-Assisted Learning for Optimal Treatment Decision in High Dimensional Data

Angzhi Fan   
Department of Statistics, University of Chicago
The author would like to thank Professor Wenbin Lu and Professor Rui Song for their kind guidance during the author’s visit to North Carolina State University in 2017.
Abstract

Optimal treatment regime is the individualized treatment decision rule which yields the optimal treatment outcomes in expectation. A simple case of treatment decision rule is the linear decision rule, which is characterized by its coefficients and its threshold. As patients’ heterogeneity data accumulates, it is of interest to estimate the optimal treatment regime with a linear decision rule in high-dimensional settings. Single timepoint optimal treatment regime can be estimated using Concordance-assisted learning (CAL), which is based on pairwise comparison. CAL is flexible and achieves good results in low dimensions. However, with an indicator function inside it, CAL is difficult to optimize in high dimensions. Recently, researchers proposed a smoothing approach using a family of cumulative distribution functions to replace indicator functions. In this paper, we introduce smoothed concordance-assisted learning (SMCAL), which applies the smoothing method to CAL using a family of sigmoid functions. We then prove the convergence rates of the estimated coefficients by analyzing the approximation and stochastic errors for the cases when the covariates are continuous. We also consider discrete covariates cases, and establish similar results. Simulation studies are conducted, demonstrating the advantage of our method.


Keywords: optimal treatment regime, precision medicine, smoothing approximation, monotonic single-index model

1 Introduction

In order to give precised treatment decisions, we have to take patients’ heterogeneity into account. Decision rules based on patients’ own features are called individualized treatment rules. In a binary treatment decisions setting, 𝒜={1,0}\mathcal{A}=\{1,0\} are the treatment indicators and

{(Yi,Xi,Ai),i=1,2,,n}\{(Y_{i},X_{i},A_{i}),i=1,2,...,n\}

are the i.i.d. observations, where YiY_{i} and XiX_{i} are the outcome and features of subject ii. Suppose Yi(Ai)Y_{i}^{*}(A_{i}) is the expected outcome for subject ii with features XiX_{i} and treatment AiA_{i}. If Yi(1)>Yi(0)Y_{i}^{*}(1)>Y_{i}^{*}(0), treatment 11 is more favorable to subject ii compared to treatment 0. Assuming the linear decision rule d(X)=𝟙(βTX>c)d(X)=\mathbbm{1}(\beta^{T}X>c), a treatment regime is determined by the coefficients β\beta and the threshold cc. Mean treatment outcomes can be modeled as

E[Y(d(X))]=EX[E(Y|X,A=1)d(X)+E(Y|X,A=0)(1d(X))].E[Y^{*}(d(X))]=E_{X}[E(Y|X,A=1)d(X)+E(Y|X,A=0)(1-d(X))].

An optimal treatment regime is the treatment regime which yields the most favorable mean treatment outcomes. Optimal treatment decision with multiple decision timepoints is called optimal dynamic treatment regimes. There are currently two leading dynamic treatment learning approaches. The first one is Q-learning by Watkins, (1989). Watkins and Dayan, (1992) proved the convergence property of Q-learning. Song et al., (2011) extended Q-learning into penalized Q-learning (PQ-learning). The second one is Advantage Learning (A-learning) by Murphy, (2003). Blatt et al., (2004) showed that A-learning is likely to have smaller bias than Q-learning.

Due to the rapid accumulation of patients’ heterogeneity data, people started to consider the high dimensional setting. Zhu et al., (2019) extended Q-learning to high dimensional Q-learning, with a focus on the two-stage dynamic treatment regimes. Shi et al., (2018) proposed Penalized A-learning (PAL), which applied the Dantzig selector under the A-learning framework, with a focus on two or three-stage settings.

For a single timepoint treatment decision, Zhang et al., (2012) used a doubly robust augmented inverse probability weighted estimator to handle the possible model misspecification issue. Zhao et al., (2012) adopted the support vector machine framework and proposed outcome weighted learning (OWL). Song et al., (2015) modified OWL to be penalized outcome weighted learning (POWL). Lu et al., (2013) introduced a robust estimator which doesn’t require estimating the baseline function and is easy to perform variable selection in large dimensions. CAL by Fan et al., (2017) used the contrast between two treatments to construct a concordance function, and then used the concordance function to estimate the coefficients β\beta^{*}. CAL has a fast convergence rate, and it does not assume that the contrast function is a linear combination of the features. They assumed the stable unit treatment value assumption (SUTVA) in Rubin, (1980),

Yi(Ai)=𝟙(Ai=0)Yi(0)+𝟙(Ai=1)Yi(1),\displaystyle Y_{i}^{*}(A_{i})=\mathbbm{1}(A_{i}=0)Y_{i}^{*}(0)+\mathbbm{1}(A_{i}=1)Y_{i}^{*}(1), (1.1)

and the no-unmeasured-confounders condition,

{Yi(0),Yi(1)}Ai|Xi.\displaystyle\{Y_{i}^{*}(0),Y_{i}^{*}(1)\}\bot A_{i}|X_{i}. (1.2)

The proposed concordance function C(β)C(\beta) is

C(β)\displaystyle C(\beta) =E{[{Yi(1)Yi(0)}{Yj(1)Yj(0)}]𝟙(βTXi>βTXj)}\displaystyle=E\{[\{Y_{i}^{*}(1)-Y_{i}^{*}(0)\}-\{Y_{j}^{*}(1)-Y_{j}^{*}(0)\}]\mathbbm{1}(\beta^{T}X_{i}>\beta^{T}X_{j})\}
=E{[D(Xi)D(Xj)]𝟙(βTXi>βTXj)},\displaystyle=E\{[D(X_{i})-D(X_{j})]\mathbbm{1}(\beta^{T}X_{i}>\beta^{T}X_{j})\}, (1.3)

where the contrast function D(Xi)=E[Yi|Ai=1,Xi]E[Yi|Ai=0,Xi]D(X_{i})=E[Y_{i}|A_{i}=1,X_{i}]-E[Y_{i}|A_{i}=0,X_{i}]. Within the concordance function, an important assumption is that βTXi\beta^{*T}X_{i} is concord with D(Xi)D(X_{i}), which means

D(Xi)=Q(βTXi),\displaystyle D(X_{i})=Q(\beta^{*T}X_{i}), (1.4)

where Q(x)Q(x) is an unknown monotone increasing function. Using the propensity score π(Xi)=P(Ai=1|Xi)\pi(X_{i})=P(A_{i}=1|X_{i}), their D(Xi)D(X_{i}) is estimated by its unbiased estimator

wi={Yiv(Xi,θ^)}{Aiπ(Xi)}π(Xi){1π(Xi)},w_{i}=\frac{\{Y_{i}-v(X_{i},\hat{\theta})\}\{A_{i}-\pi(X_{i})\}}{\pi(X_{i})\{1-\pi(X_{i})\}},

where v(Xi,θ^)v(X_{i,}\hat{\theta}) can be any arbitrary function independent of AiA_{i}, while it is usually chosen as the mean response of those patients who received treatment 0. The true β\beta^{*} and the true threshold c0c_{0} in the linear decision rule are estimated by

β^=argmaxβ:β2=11n(n1)ij(wiwj)×𝟙(βTxi>βTxj),\hat{\beta}=\mathop{\arg\max}_{\beta:\|\beta\|_{2}=1}\frac{1}{n(n-1)}{\displaystyle\sum_{i\neq j}(w_{i}-w_{j})\times}\mathbbm{1}(\beta^{T}x_{i}>\beta^{T}x_{j}), (1.5)

and

c^=argmaxc1ni=1nYi𝟙[Ai=𝟙(βT^xi>c)]Aiπ(xi)+(1Ai)[1π(xi)].\hat{c}=\mathop{\arg\max}_{c}\frac{1}{n}{\displaystyle\sum_{i=1}^{n}\frac{Y_{i}\mathbbm{1}[A_{i}=\mathbbm{1}(\hat{\beta^{T}}x_{i}>c)]}{A_{i}\pi(x_{i})+(1-A_{i})[1-\pi(x_{i})]}}. (1.6)

In the high dimensional setting, however, due to the indicator function inside the concordance function, the optimization of CAL is somewhat difficult. In order to deal with this optimization issue of CAL, Liang et al., (2018) proposed SCAL, which used hinge loss to replace 𝟙(x)\mathbbm{1}(x), and they added the L1L_{1} penalty for variable selection purpose. They estimate β\beta^{*} and c0c_{0} by

β^=argminβ:β2=12n(n1)wi>wj(wiwj)×(1βT(xixj))++λβ1,\hat{\beta}=\mathop{\arg\min}_{\beta:\|\beta\|_{2}=1}\frac{2}{n(n-1)}{\displaystyle\sum_{w_{i}>w_{j}}(w_{i}-w_{j})\times}(1-\beta^{T}(x_{i}-x_{j}))_{+}+\lambda{\displaystyle\|\beta\|_{1}},

and

c^=argmaxc1ni=1nwi𝟙(βT^xi>c),\hat{c}=\mathop{\arg\max}_{c}\frac{1}{n}{\displaystyle\sum_{i=1}^{n}w_{i}\mathbbm{1}(\hat{\beta^{T}}x_{i}>c)},

It achieved n1/2n^{1/2}-rate but induced a relatively large bias due to the difference between the hinge loss and the original 0-1 loss.

Our way to overcome the optimization difficulties in CAL under high dimensional data is based on the smoothing method proposed by Han et al., (2017). A family of sigmoid functions are used to substitute the indicator function in CAL. The L1L_{1} penalty is also added. We then employ the coordinate descent algorithm for optimization. Our method is called Smoothed Concordance-assisted Learning (SMCAL).

Compared to SCAL, our SMCAL has a slower convergence rate but achieves a much smaller bias. Numerical comparisons with POWL, SCAL and PAL demonstrated the advantage of our method, especially when the sample size is relatively large. Other than the continuous covariates cases in the SCAL settings, we expand our SMCAL to the discrete covariates cases.

There are three main differences between our work and Han et al., (2017). Firstly, they focused on the Maximum Rank Correlation (MRC) proposed in Han, (1987), but our smoothing is applied on the concordance function. Secondly, their paper studied Monotone Transformation model, but we focus on Monotone Single Index Model, and we have a faster convergence rate. Thirdly, we assume weaker assumptions instead of their normality assumption posted on the covariates, and we expand our method to discrete cases.

The rest of this paper is organized as follows: Section 2 introduces SMCAL and describes the coordinate descent algorithm. Section 3 displays the L2L_{2}-error rate in continuous cases and the L1L_{1}-error rate in discrete cases, which are the main results of our paper. In Section 4, we conduct numerical comparisons with SCAL, PAL and POWL. A real data application to STAR*D dataset is provided in Section 5. The proofs of lemmas and theorems in Section 3 are left to the supplementary material.

2 Methods

2.1 The Model Set-up

We still assume the Assumptions (1.1), (1.2) and (1.4) posted in Fan et al., (2017). Our smoothing procedure approximates their concordance function by

Cn(β)=E{[D(Xi)D(Xj)]F(βT(XiXj)αn)},C_{n}(\beta)=E\{[D(X_{i})-D(X_{j})]F(\beta^{T}(X_{i}-X_{j})\alpha_{n})\}, (2.1)

where we use a family of sigmoid functions

F(αnx)=11+eαnxF(\alpha_{n}x)=\frac{1}{1+e^{-\alpha_{n}x}}

to replace 𝟙(x)\mathbbm{1}(x) in Fan et al., (2017). αn\alpha_{n} is a positive constant depending on nn. An unbiased estimator of Cn(β)C_{n}(\beta) is

Cn^(β)=1n(n1)ij(wiwj)×F(βT(xixj)αn).\hat{C_{n}}(\beta)=\frac{1}{n(n-1)}{\displaystyle\sum_{i\neq j}(w_{i}-w_{j})\times}F(\beta^{T}(x_{i}-x_{j})\alpha_{n}). (2.2)

Without loss of generality, we can set β1=β1=1\beta_{1}^{*}=\beta_{1}=1. Define

βr(γ),αn=argminβ:β1=1,ββ2r(γ)(Cn(β))\beta_{r(\gamma),\alpha_{n}}^{*}=\mathop{\arg\min}_{\beta:\beta_{1}=1,\|\beta-\beta^{*}\|_{2}\leq r(\gamma)}(-C_{n}(\beta))

where r(γ)r(\gamma) is a constant to be chosen. βr(γ),αn\beta_{r(\gamma),\alpha_{n}}^{*} will converge to β\beta^{*} as αn\alpha_{n} goes to infinity.

Define ={βd:β1=1,β0sn,ββ2r(γ)}\mathcal{H}=\{\beta\in\mathbb{R}^{d}:\beta_{1}=1,\|\beta\|_{0}\leq s_{n},\|\beta-\beta^{*}\|_{2}\leq r(\gamma)\}, where sns_{n} is a constant so that βr(γ),αn0sn\|\beta_{r(\gamma),\alpha_{n}}^{*}\|_{0}\leq s_{n}. Notice that maximizing C^n(β)\hat{C}_{n}(\beta) is equivalent to maximizing

2n(n1)wi>wj(wiwj)×F(βT(xixj)αn),\frac{2}{n(n-1)}{\displaystyle\sum_{w_{i}>w_{j}}(w_{i}-w_{j})\times}F(\beta^{T}(x_{i}-x_{j})\alpha_{n}),

we can write our loss function as

ln(β)=2n(n1)wi>wj(wiwj)×F(βT(xixj)αn)+λnβ1.l_{n}(\beta)=-\frac{2}{n(n-1)}{\displaystyle\sum_{w_{i}>w_{j}}(w_{i}-w_{j})\times}F(\beta^{T}(x_{i}-x_{j})\alpha_{n})+\lambda_{n}\|\beta\|_{1}. (2.3)

Then we use the following two steps to estimate β\beta^{*} and cc.

β^r(γ),αn=argminβln(β)\hat{\beta}_{r(\gamma),\alpha_{n}}=\mathop{\arg\min}_{\beta\in\mathcal{H}}{\displaystyle l_{n}(\beta)} (2.4)
c^=argmaxc1ni=1nwi𝟙(β^r(γ),αnTxi>c)\hat{c}=\mathop{\arg\max}_{c}\frac{1}{n}{\displaystyle\sum_{i=1}^{n}w_{i}\mathbbm{1}(\hat{\beta}_{r(\gamma),\alpha_{n}}^{T}x_{i}>c)} (2.5)

Our β^r(γ),αn\hat{\beta}_{r(\gamma),\alpha_{n}} is different from βr(γ),αn\beta_{r(\gamma),\alpha_{n}}^{*} due to the stochastic factors and the L1L_{1} penalty. We will call βr(γ),αnβ\beta_{r(\gamma),\alpha_{n}}^{*}-\beta^{*} the approximation error and β^r(γ),αnβr(γ),αn\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta_{r(\gamma),\alpha_{n}}^{*} the stochastic error.

2.2 Coordinate Descent Algorithm

We can iteratively use proximal gradient descent on each coordinate to solve the optimization problem. Step size tt is a fixed constant.

Algorithm 1 Coordinate Descent Algorithm
1:  Initialize β=(1,0,,0)T\beta=(1,0,...,0)^{T} or β=(1,0,,0)T\beta=(-1,0,...,0)^{T}. Repeat the following steps until converge: For k=2,3,,dk=2,3,\ldots,d:
2:  Calculate gk=αn(n1)i.j(wiwj)(xi,kxj,k)F((xixj)Tβα)g_{k}=\frac{-\alpha}{n(n-1)}{\displaystyle\sum_{i\neq.j}(w_{i}-w_{j})(x_{i,k}-x_{j,k})F^{{}^{\prime}}((x_{i}-x_{j})^{T}\beta\alpha)}.
3:  Update βk=Stλ(βktgk)\beta_{k}=S_{t\lambda}(\beta_{k}-tg_{k}), where Stλ(x)=S_{t\lambda}(x)= {xtλx>tλ0tλxtλx+tλx<tλ\begin{cases}x-t\lambda&x>t\lambda\\ 0&-t\lambda\leq x\leq t\lambda\\ x+t\lambda&x<-t\lambda\end{cases} is the soft-thresholding operator.

Consider 2n(n1)wi>wj(wiwj)F((xixj)Tβαn)-\frac{2}{n(n-1)}{\displaystyle\sum_{w_{i}>w_{j}}(w_{i}-w_{j})}F((x_{i}-x_{j})^{T}\beta\alpha_{n}). When a pair (i,j)(i,j) satisfies wi>wjw_{i}>w_{j}, likely we have D(Xi)>D(Xj)D(X_{i})>D(X_{j}), which means (xixj)Tβ>0(x_{i}-x_{j})^{T}\beta^{*}>0. Because F(x)-F(x) is convex when x>0x>0, we know that our optimization problem is likely to be convex when β\beta is close to β\beta^{*}. λ\lambda and α\alpha can be chosen by cross-validation.

3 Convergence Rates of the Estimated Coefficients

In this section, we establish the error rates for β^r(γ),αnβ\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta^{*} under continuous and then discrete covariates cases. Let λn2L^n(βr,αn)\lambda_{n}\geq 2||\nabla\hat{L}_{n}(\beta_{r,\alpha_{n}}^{*})||_{\infty}. Assume the sparsity of the real coefficients β\beta^{*}. Let SS represent the nonzero indexes in β\beta^{*}, s=|S|s=|S|, and define Sc={1,2,d}SS^{c}=\{1,2,...d\}\setminus S. Let Xi,IX_{i,I} represent all the entries of XiX_{i} whose indexes belong to set II. The notation \lesssim means that if anbna_{n}\lesssim b_{n}, or equivalently bnanb_{n}\gtrsim a_{n}, then there exist C>0C>0 and N0>0N_{0}>0, such that for all n>N0n>N_{0}, we have |an|C|bn||a_{n}|\leq C|b_{n}|. If anbna_{n}\lesssim b_{n} and bnanb_{n}\lesssim a_{n}, then we write anbna_{n}\asymp b_{n}.

3.1 Continuous Cases

3.1.1 Approximation Error

To analyze the approximation error, we assume the following assumptions.

Assumption 1.

(A1) Q(x)Q(x) is twice differentiable and |Q(x)|<M1|Q(x)|<M_{1}.

(A2) For all β:β1=1,ββ2r(γ)\beta:\>\beta_{1}=1,||\beta-\beta^{*}||_{2}\leq r(\gamma), and tt\in\mathbb{R}, we have |tfXTβ(t)|<M3|\frac{\partial}{\partial t}f_{X^{T}\beta}(t)|<M_{3}.

(A3) Let Γ(d1)×(d1)=2C(β)\Gamma_{(d-1)\times(d-1)}=-\nabla^{2}C(\beta^{*}), where the derivative is taken with respect to every coordinate except the first one because β1=1\beta_{1}^{*}=1 is fixed. Assume 0<c3λmin(Γ)λmax(Γ)c40<c_{3}\leq\lambda_{min}(\Gamma)\leq\lambda_{max}(\Gamma)\leq c_{4}.

Remark 1.

(A2) is true for many distributions of XX. For example, it’s easy to show that if X𝒩(μ,Σd×d)X\sim\mathcal{N}(\mu,\Sigma_{d\times d}) and 0<c1λmin(Σ)0<c_{1}\leq\lambda_{min}(\Sigma), then (A2) is satisfied.

Remark 2.

Since β\beta^{*} is the maximizer of C(β)C(\beta), Γ\Gamma is non-negative definite. Assumption similar to (A3) also posed in Section 4.1.1, Han et al., (2017). But in the discrete cases, we will use another way to build the upper error bound and no longer need this assumption.

Under the above assumptions, the following Lemma 1 measures the closeness between the concordance function C(β)C(\beta) and its approximation Cn(β)C_{n}(\beta).

Lemma 1.
supβd:β1=1,ββ2r(γ)|C(β)Cn(β)|αn2\sup_{\beta\in\mathbb{R}^{d}:\beta_{1}=1,\|\beta^{*}-\beta\|_{2}\leq r(\gamma)}|C(\beta)-C_{n}(\beta)|\lesssim\alpha_{n}^{-2} (3.1)

The convexity assumption (A3) in Assumption 1 implies that βr(γ),αnβ22\|\beta_{r(\gamma),\alpha_{n}}^{*}-\beta^{*}\|_{2}^{2} and |C(β)C(βr(γ),αn)||C(\beta^{*})-C(\beta_{r(\gamma),\alpha_{n}}^{*})| are of the same rate, and thus the approximation error can be bounded through analyzing |C(β)C(βr(γ),αn)||C(\beta^{*})-C(\beta_{r(\gamma),\alpha_{n}}^{*})|, where we can apply Lemma 1 and get Theorem 2.

Theorem 2 (Approximation Error Rate).
βr(γ),αnβ22αn2\|\beta_{r(\gamma),\alpha_{n}}^{*}-\beta^{*}\|_{2}^{2}\lesssim\alpha_{n}^{-2} (3.2)
Remark 3.

The approximation error in Han et al., (2017) is

βr(γ),αnβ22αn1supβ:β1=112βTΣβ.\|\beta_{r(\gamma),\alpha_{n}}^{*}-\beta^{*}\|_{2}^{2}\lesssim\alpha_{n}^{-1}\cdot\sup_{\beta:\beta_{1}=1}\frac{1}{\sqrt{2\beta^{T}\varSigma\beta}}.

We have a faster convergence rate because we use the concordance function estimator

1n(n1)ij(wiwj)×𝟙(βTxi>βTxj)\frac{1}{n(n-1)}{\displaystyle\sum_{i\neq j}(w_{i}-w_{j})\times}\mathbbm{1}(\beta^{T}x_{i}>\beta^{T}x_{j})

which is different from the Maximum Rank Correlation (MRC) estimator

1n(n1)ijsign(wiwj)×sign(βTxiβTxj)\frac{1}{n(n-1)}{\displaystyle\sum_{i\neq j}\text{sign}(w_{i}-w_{j})\times}\text{sign}(\beta^{T}x_{i}-\beta^{T}x_{j})

in their paper.

3.1.2 Stochastic Error

When investigating the stochastic error β^r(γ),αnβr(γ),αn\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta_{r(\gamma),\alpha_{n}}^{*}, we follow the framework of Section 4.2 in Han et al., (2017). In order to be consistent with their notations, define

L^n(β)=C^n(β)=1n(n1)ij(wiwj)×F(βT(XiXj)αn),\hat{L}_{n}(\beta)=-\hat{C}_{n}(\beta)=-\frac{1}{n(n-1)}{\displaystyle\sum_{i\neq j}(w_{i}-w_{j})\times}F(\beta^{T}(X_{i}-X_{j})\alpha_{n}),
L(β)=C(β),Ln(β)=Cn(β),L(\beta)=-C(\beta),\quad L_{n}(\beta)=-C_{n}(\beta),
Δ^=β^r(γ),αnβr(γ),αn,\hat{\Delta}=\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta_{r(\gamma),\alpha_{n}}^{*},

and P()P(\cdot) denotes L1L_{1} Penalty, P()P^{*}(\cdot) is the dual norm of P()P(\cdot). The following assumptions are needed in this subsection.

Assumption 2.

(A4)|wi|M2|w_{i}|\leq M_{2}.

(A5)M4\exists M_{4}, s.t.s.t. M4XiM4-M_{4}\leq\|X_{i}\|_{\infty}\leq M_{4}.

(A6)βr(γ),αn||0sn\|\beta_{r(\gamma),\alpha_{n}}^{*}||_{0}\leq s_{n}.

Remark 4.

Our assumption (A4) is very similar to (A1) in Assumption 1, but (A1) doesn’t include any stochastic factors.

Remark 5.

Although it’s reasonable for us to post (A5) in our optimal treatment decision problem, (A5) inevitably eliminates the multivariate normal distribution.

Remark 6.

(A6) appeared as Assumption (A0) in Section 4.2.2, Han et al., (2017). As we will see in the proof the Theorem 7, under some conditions, we have βr(γ),αn,Sc=0\beta_{r(\gamma),\alpha_{n},S^{c}}^{*}=\vec{0}. So sns_{n} can be relatively small.

We provide two necessary steps, Lemma 3 and 4, in order to apply Theorem 4.8 in Han et al., (2017). Lemma 3 is used to bound the term L^n(βr(γ),αn),Δ\langle\nabla\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}),\Delta\rangle.

Lemma 3.
L^n(βr(γ),αn)pαnlog(d)n\|\nabla\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})\|_{\infty}\stackrel{{\scriptstyle p}}{{\lesssim}}\alpha_{n}\sqrt{\frac{log(d)}{n}} (3.3)

In Han et al., (2017), they introduced a parameter κn=Esupβ|L^n(β)Ln(β)|\kappa_{n}=E\mathop{\sup_{\beta\in\mathcal{H}}}|\hat{L}_{n}(\beta)-L_{n}(\beta)| and stated that κn=O(n1/2)\kappa_{n}=O(n^{-1/2}) when dd is fixed. However, in our case dd can be much larger than nn, so we will use a different approach, which is our Lemma 4, to analyze the rate of supβ|L^n(β)Ln(β)|\mathop{\sup_{\beta\in\mathcal{H}}}|\hat{L}_{n}(\beta)-L_{n}(\beta)|. Define

B(h)=|L^n(βr(γ),αnh)Ln(βr(γ),αnh)L^n(βr(γ),αn)+L(βr(γ),αn)|,B(h)=|\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}-h)-L_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}-h)-\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})+L(\beta_{r(\gamma),\alpha_{n}}^{*})|,

the following Lemma 4 provides a bound for B(h)B(h) in high probability.

Lemma 4.
suph02sn,h22r(γ)B(h)h2psn2log(d)nαn{\displaystyle\sup_{\|h\|_{0}\leq 2s_{n},\|h\|_{2}\leq 2r(\gamma)}\frac{B(h)}{\|h\|_{2}}\stackrel{{\scriptstyle p}}{{\lesssim}}\frac{\sqrt{s_{n}^{2}log(d)}}{\sqrt{n}}\alpha_{n}} (3.4)

B(h)B(h) is used to bound the difference between L^n(βr(γ),αn+Δ^)L^n(βr(γ),αn)\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}+\hat{\Delta})-\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}) and Ln(βr(γ),αnΔ^)Ln(βr(γ),αn)L_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}-\hat{\Delta})-L_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}). By using an argument similar to Theorem 4.8 and Lemma 4.10 in Han et al., (2017), we can prove the stochastic error rate stated in the following theorem.

Theorem 5 (Stochastic Error Rate).
β^r(γ),αnβr(γ),αn22pαn2log(d)sn2/n+αn2\|\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta_{r(\gamma),\alpha_{n}}^{*}\|_{2}^{2}\stackrel{{\scriptstyle p}}{{\lesssim}}\alpha_{n}^{2}log(d)s_{n}^{2}/n+\alpha_{n}^{-2} (3.5)

From Theorem 2 and Theorem 5, we conclude β^r(γ),αnβ22pαn2log(d)sn2/n+αn2\|\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta^{*}\|_{2}^{2}\stackrel{{\scriptstyle p}}{{\lesssim}}\alpha_{n}^{2}log(d)s_{n}^{2}/n+\alpha_{n}^{-2}. By choosing αn(nlog(d)sn2)14\alpha_{n}\asymp(\frac{n}{log(d)s_{n}^{2}})^{\frac{1}{4}}, we get the overall L2L_{2} error rate as

Theorem 6 (Overall Error Rate).
β^r(γ),αnβ2p(log(d)sn2/n)14\|\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta^{*}\|_{2}\stackrel{{\scriptstyle p}}{{\lesssim}}(log(d)s_{n}^{2}/n)^{\frac{1}{4}} (3.6)
Remark 7.

For comparison, in Han et al., (2017), they proved βr(γ),αnβ22αn1\|\beta_{r(\gamma),\alpha_{n}}^{*}-\beta^{*}\|_{2}^{2}\lesssim\alpha_{n}^{-1} and β^r(γ),αnβr(γ),αn22pαn2log(d)sn/n+αn1+κn\|\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta_{r(\gamma),\alpha_{n}}^{*}\|_{2}^{2}\stackrel{{\scriptstyle p}}{{\lesssim}}\alpha_{n}^{2}log(d)s_{n}/n+\alpha_{n}^{-1}+\kappa_{n}. Choose αn(nlog(d)sn)13\alpha_{n}\asymp(\frac{n}{log(d)s_{n}})^{\frac{1}{3}}, then in their paper β^r(γ),αnβ2p(log(d)sn/n)16+κn12\|\hat{\beta}_{r(\gamma),\alpha_{n}}-\beta^{*}\|_{2}\stackrel{{\scriptstyle p}}{{\lesssim}}(log(d)s_{n}/n)^{\frac{1}{6}}+\kappa_{n}^{\frac{1}{2}}.

3.2 Discrete Cases

We will make some different assumptions as we discuss the discrete covariates cases in this section. The main difference between the discrete and continuous cases is that in the discrete cases, C(β)-C(\beta) might be a constant around β\beta^{*}.

Assumption 3.

(B1) There exists a matrix (ds)×(s1)\mathcal{M}\in\mathbb{R}^{(d-s)\times(s-1)}, such that the conditional distribution (XiXj)Sc(XiXj)S1|(XiXj)S(X_{i}-X_{j})_{S^{c}}-\mathcal{M}(X_{i}-X_{j})_{S\setminus 1}|(X_{i}-X_{j})_{S} is a symmetric distribution centered at the origin.

(B2) |wi|M2|w_{i}|\leq M_{2}.

(B3) M4XiM4-M_{4}\leq\|X_{i}\|_{\infty}\leq M_{4}.

Remark 8.

In the continuous covariates cases, (B1) is satisfied under some multivariate normal distributions, where we can think of SS to be slightly larger than the nonzero indexes set. We can let SS include the features which are correlated with the first feature.

Remark 9.

(B2) and (B3) also appear in the continuous cases and are used in the proof of Lemma 3 and 4.

In the discrete cases, we bound the approximation error and the stochastic error at the same time. Lemma 3 and 4 also hold true for discrete cases and the proofs are almost the same. Moreover, we have a different main theorem as follows:

Theorem 7 (Error Rate in Discrete Cases).

As long as s3log(d)/n0s^{3}\log(d)/n\rightarrow 0, for all (xi,xj)(x_{i},x_{j}) such that (xiTxjT)β>0(x_{i}^{T}-x_{j}^{T})\beta^{*}>0, we have (xiTxjT)Sβ^r,αn,S>pϵαn(x_{i}^{T}-x_{j}^{T})_{S}\hat{\beta}_{r,\alpha_{n},S}\stackrel{{\scriptstyle p}}{{>}}\frac{\epsilon}{\alpha_{n}}, and β^r,αn,Sc1p(s3log(d)n)1/2\|\hat{\beta}_{r,\alpha_{n},S^{c}}\|_{1}\stackrel{{\scriptstyle p}}{{\lesssim}}(\frac{s^{3}\log(d)}{n})^{1/2}.

Remark 10.

In the proof, we choose αn=max(1,log(c11)log(δ)+log(3)ε)\alpha_{n}=max(1,\frac{log(c_{11})-log(\delta)+log(3)}{\varepsilon}). Using (B3), the theorem implies that for any i,ji,j, |β^r,αn,Sc(XiTXjT)Sc|p(s3log(d)n)1/2|\hat{\beta}_{r,\alpha_{n},S^{c}}(X_{i}^{T}-X_{j}^{T})_{S^{c}}|\stackrel{{\scriptstyle p}}{{\lesssim}}(\frac{s^{3}\log(d)}{n})^{1/2}. If (xiTxjT)β>0(x_{i}^{T}-x_{j}^{T})\beta^{*}>0, then (xiTxjT)Sβ^r,αn,S>pεαn(s3log(d)n)1/2p(x_{i}^{T}-x_{j}^{T})_{S}\hat{\beta}_{r,\alpha_{n},S}\stackrel{{\scriptstyle p}}{{>}}\frac{\varepsilon}{\alpha_{n}}\gtrsim(\frac{s^{3}\log(d)}{n})^{1/2}\stackrel{{\scriptstyle p}}{{\gtrsim}}|β^r,αn,Sc(xiTxjT)Sc||\hat{\beta}_{r,\alpha_{n},S^{c}}(x_{i}^{T}-x_{j}^{T})_{S^{c}}|. Therefore, when s3log(d)n0\frac{s^{3}\log(d)}{n}\longrightarrow 0, we conclude that once (xiTxjT)β>0(x_{i}^{T}-x_{j}^{T})\beta^{*}>0, then (xiTxjT)β^r,αnp0(x_{i}^{T}-x_{j}^{T})\hat{\beta}_{r,\alpha_{n}}\stackrel{{\scriptstyle p}}{{\geq}}0. In other words, if the contrast functions D(xi)>D(xj)D(x_{i})>D(x_{j}), then xiTβ^r,αnpxjTβ^r,αnx_{i}^{T}\hat{\beta}_{r,\alpha_{n}}\stackrel{{\scriptstyle p}}{{\geq}}x_{j}^{T}\hat{\beta}_{r,\alpha_{n}}, which means we will rank the subjects in a right order with high probability.

4 Simulation Studies

Four methods are compared in this section: POWL in Song et al., (2015), PAL in Shi et al., (2018), SCAL in Liang et al., (2018) and SMCAL. POWL method minimizes

1ni=1nYi[1(2Ai1)g(Xi)]+Aiπ(Xi)+(1Ai)[1π(Xi)]+λi=1d|βi|\frac{1}{n}{\displaystyle\sum_{i=1}^{n}\frac{Y_{i}\cdot[1-(2A_{i}-1)g(X_{i})]_{+}}{A_{i}\pi(X_{i})+(1-A_{i})[1-\pi(X_{i})]}}+\lambda{\displaystyle\sum_{i=1}^{d}}|\beta_{i}| (4.1)

where g(Xi)=βTXi+β0g(X_{i})=\beta^{T}\cdot X_{i}+\beta_{0} is the decision function and the tuning parameter λ\lambda is selected by the largest IPW estimator. PAL is a dynamic treatment regime approach but can be used in the single-stage problem because PAL estimates the latest stage at first. PAL perform variable selection using penalized A-learning, then use unpenalized A-learning to solve the coefficients. We used the R package provided by the authors of PAL to get its results.

To evaluate the estimated coefficients, we report the MSE after normalization. To evaluate the variable selection accuracy, we report the Incorrect Zeros (true coefficient is zero but the estimation is nonzero) and Correct Zeros (true coefficient is zero and the estimation is zero). To evaluate the estimated treatment regime, we report the Percentage of Correct Decision (PCD) and the mean response if we follow the estimated treatment regime (Estimated Value). Estimated Value is calculated by drawing 1000 new samples. Standard errors of the PCDs and Estimated Values are in the parenthesis.

4.1 Low Dimension

4.1.1 Linear Case

The first example can be found in Zhao et al., (2012) as well as in Liang et al., (2018). Set d=50d=50, Xi1,Xi2,,Xi50X_{i1},X_{i2},...,X_{i50} are independent variables, all generated from U(1,1)U(-1,1) in the continuous cases. π(Xi)\pi(X_{i}) is chosen to be 0.50.5 for all i=1,2,ni=1,2,...n. Yi|Xi,AiN(Q0(Xi),1)Y_{i}|X_{i},A_{i}\sim N(Q_{0}(X_{i}),1), where Q0(Xi)=1+2Xi1+Xi2+0.5Xi3+0.442(1Xi1Xi2)(2Ai1)Q_{0}(X_{i})=1+2X_{i1}+X_{i2}+0.5X_{i3}+0.442(1-X_{i1}-X_{i2})(2A_{i}-1). 100100 simulations are conducted for n=30,100,200n=30,100,200 respectively.

In the discrete cases, everything remains the same except that we generate Xi1,Xi2,,Xi50X_{i1},X_{i2},...,X_{i50} from the uniform distribution on {0.9,0.7,0.5,0.3,0.1,0.1,0.3,0.5,0.7,0.9}\{-0.9,-0.7,-0.5,-0.3,-0.1,0.1,0.3,0.5,0.7,0.9\}.

Numerical comparisons are summarized in Table 1, where we did simulations only for SMCAL and PAL, the results by SCAL and POWL were from Liang et al., (2018).

Table 1: Comparison between POWL, PAL, SCAL and SMCAL: low dimensional case.
n MSE Incorr0(0) Corr0(48) PCD Estimated Value
POWL 30 1.60 1.70 42.23 0.615(0.02) 1.09(0.02)
100 1.27 1.94 46.64 0.768(0.02) 1.27(0.02)
200 1.09 1.99 47.78 0.786(0.02) 1.30(0.03)
PAL 30 1.83 1.76 46.23 0.631(0.012) 1.17(0.015)
100 1.01 0.92 46.53 0.808(0.009) 1.37(0.009)
200 0.32 0.17 47.00 0.903(0.005) 1.45(0.006)
SCAL 30 1.40 0.73 35.79 0.659(0.01) 1.16(0.01)
100 0.52 0.11 41.97 0.764(0.01) 1.31(0.01)
200 0.19 0.01 46.03 0.749(0.01) 1.32(0.01)
SMCAL 30 0.93 0.82 43.09 0.677(0.014) 1.21(0.017)
100 0.81 0.75 44.11 0.735(0.010) 1.28(0.013)
200 0.69 0.57 45.42 0.788(0.007) 1.34(0.009)
SMCAL-Discrete 30 0.95 0.89 43.12 0.653(0.014) 1.20(0.017)
100 0.79 0.74 44.15 0.723(0.009) 1.28(0.011)
200 0.70 0.50 43.78 0.764(0.006) 1.33(0.008)

In general, POWL gives many estimated zeros. Our MSEs are larger than SCAL but smaller than POWL. In fact, our theoretical results indicate a slower convergence rate than SCAL. In the case when the sample size is 200, our PCDs and Estimated Values are significantly higher than those in SCAL, which may be explained by SMCAL has a smaller bias compared to SCAL. We also notice that the PCD of 200-sample SCAL is abnormally smaller than the PCD of 100-sample SCAL. This may be because SCAL does not converge to the true coefficients due to the induced bias by the hinge loss. In summary, we think when the sample size is small, SCAL is better than SMCAL. But when the sample size grows larger, SMCAL will be increasingly better than SCAL.

In the discrete cases, the MSE, variable selection, PCDs and Estimated Values have similar patterns.

PAL has the best performance in this example. The 100-sample PAL is already much better than 200-sample SCAL and SMCAL in terms of PCDs and Estimated Values. In this example, the contrast function D(Xi)=0.884(1Xi1Xi2)D(X_{i})=0.884(1-X_{i1}-X_{i2}), which is a linear combination of the features. In such a linear case, pairwise comparison based methods like CAL, SCAL and SMCAL seems to be less efficient than PAL. But CAL, SCAL and SMCAL only assume D(Xi)=Q(βTXi)D(X_{i})=Q(\beta^{*T}X_{i}), which is flexible and can be applied in nonlinear cases.

4.1.2 Nonlinear Case

The second example, see Table 2, compares PAL and SMCAL on a case when the contrast function is nonlinear of the features. We set d=50d=50, n=30,100,200n=30,100,200, Xi𝒩(0,I)X_{i}\sim\mathcal{N}(0,I), π(Xi)=0.5\pi(X_{i})=0.5 and Yi=1+Xi,1Xi,2+Xi,3+Xi,4+A(exp(1+Xi,1+Xi,2Xi,3+Xi,4)exp(1))+ϵY_{i}=1+X_{i,1}-X_{i,2}+X_{i,3}+X_{i,4}+A\cdot(exp(1+X_{i,1}+X_{i,2}-X_{i,3}+X_{i,4})-exp(1))+\epsilon, where ϵ𝒩(0,0.5)\epsilon\sim\mathcal{N}(0,0.5). Our MSEs are much smaller than PAL, and we can successfully select important features when the sample size is relatively large. Our PCDs and Estimated Values are also much better than PAL. PAL seems to be not suitable for such a nonlinear case, but SMCAL performs quite well.

Table 2: Comparison between PAL and SMCAL: nonlinear case.
n MSE Incorr0(0) Corr0(46) PCD Estimated Value
PAL 30 1.44 3.12 44.30 0.579(0.007) 10.42(0.431)
100 0.82 2.22 45.35 0.666(0.006) 14.67(0.486)
200 0.58 1.80 45.83 0.706(0.005) 16.39(0.552)
SMCAL 30 0.90 1.03 33.53 0.689(0.006) 17.45(0.464)
100 0.52 0.05 29.44 0.814(0.006) 18.56(0.319)
200 0.28 0.00 30.66 0.888(0.004) 19.92(0.510)

4.2 High Dimension

We conduct simulations for the following six high dimensional models which can be found in Section 4.2, Liang et al., (2018). We uniformly set sample size n=100n=100 and dimension d=500d=500. The interaction terms with treatment type AA are XβX\beta or (Xβ)3(X\beta)^{3}, which satisfies our Monotone Single Index Model assumption. However, the baseline function can be very flexible. The following models use linear, polynomial or even trigonometric sines as the baseline function. We set X𝒩(0,I)X\sim\mathcal{N}(0,I), π(Xi)=0.5\pi(X_{i})=0.5 and ε𝒩(0,1)\varepsilon\sim\mathcal{N}(0,1).

𝐌𝐨𝐝𝐞𝐥𝟏:\mathbf{Model1}: Y=3+Xγ1+XβA+εY=3+X\gamma_{1}+X\beta A+\varepsilon, where γ1=(1,1,𝟎d2)T\gamma_{1}=(-1,1,\mathbf{0}_{d-2})^{T}, β=(2,1.8,0,0,0,1.6,𝟎d6)T\beta=(2,1.8,0,0,0,-1.6,\mathbf{0}_{d-6})^{T}.

𝐌𝐨𝐝𝐞𝐥𝟐:\mathbf{Model2}: Y=30.5(Xγ1)2+0.625(Xγ2)2+XβA+εY=3-0.5(X\gamma_{1})^{2}+0.625(X\gamma_{2})^{2}+X\beta A+\varepsilon, where γ1=(1,0.5,𝟎d2)T\gamma_{1}=(1,0.5,\mathbf{0}_{d-2})^{T}, γ2=(0,1,𝟎d2)T\gamma_{2}=(0,1,\mathbf{0}_{d-2})^{T}, β=(2,1.8,0,0,0,1.6,𝟎d6)T\beta=(2,1.8,0,0,0,-1.6,\mathbf{0}_{d-6})^{T}.

𝐌𝐨𝐝𝐞𝐥𝟑:\mathbf{Model3}: Y=1sin(Xγ1)+sin(Xγ2)+XβA+εY=1-sin(X\gamma_{1})+sin(X\gamma_{2})+X\beta A+\varepsilon, where γ1=(1,𝟎d1)T\gamma_{1}=(1,\mathbf{0}_{d-1})^{T}, γ2=(0,1,𝟎d2)T\gamma_{2}=(0,1,\mathbf{0}_{d-2})^{T}, β=(2,1.8,0,0,0,1.6,𝟎d6)T\beta=(2,1.8,0,0,0,-1.6,\mathbf{0}_{d-6})^{T}.

𝐌𝐨𝐝𝐞𝐥𝟒:\mathbf{Model4}: Y=3+Xγ1+(Xβ)3A+εY=3+X\gamma_{1}+(X\beta)^{3}A+\varepsilon, where γ1=(1,1,𝟎d2)T\gamma_{1}=(-1,1,\mathbf{0}_{d-2})^{T}, β=(1,0.9,0,0,0,0.8,𝟎d6)T\beta=(1,0.9,0,0,0,-0.8,\mathbf{0}_{d-6})^{T}.

𝐌𝐨𝐝𝐞𝐥𝟓:\mathbf{Model5}: Y=30.5(Xγ1)2+0.625(Xγ2)2+(Xβ)3A+εY=3-0.5(X\gamma_{1})^{2}+0.625(X\gamma_{2})^{2}+(X\beta)^{3}A+\varepsilon, where γ1=(1,0.5,𝟎d2)T\gamma_{1}=(1,0.5,\mathbf{0}_{d-2})^{T}, γ2=(0,1,𝟎d2)T\gamma_{2}=(0,1,\mathbf{0}_{d-2})^{T}, β=(1,0.9,0,0,0,0.8,𝟎d6)T\beta=(1,0.9,0,0,0,-0.8,\mathbf{0}_{d-6})^{T}.

𝐌𝐨𝐝𝐞𝐥𝟔:\mathbf{Model6}: Y=1sin(Xγ1)+sin(Xγ2)+(Xβ)3A+εY=1-sin(X\gamma_{1})+sin(X\gamma_{2})+(X\beta)^{3}A+\varepsilon, where γ1=(1,𝟎d1)T\gamma_{1}=(1,\mathbf{0}_{d-1})^{T}, γ2=(0,1,𝟎d2)T\gamma_{2}=(0,1,\mathbf{0}_{d-2})^{T}, β=(1,0.9,0,0,0,0.8,𝟎d6)T\beta=(1,0.9,0,0,0,-0.8,\mathbf{0}_{d-6})^{T}.

Liang et al., (2018) reported the results of SCAL, and we ran simulations for SMCAL, all summarized in Table 3.

Table 3: Simulation results of SCAL and SMCAL: high dimensional cases.
Model MSE Incorr0(0) Corr0(497) PCD Estimated Value
SCAL SMCAL SCAL SMCAL SCAL SMCAL SCAL SMCAL SCAL SMCAL
Model 1 0.61 0.76 0.75 0.86 482.62 490.81 0.744(0.01) 0.732(0.005) 3.80(0.02) 3.82(0.016)
Model 2 0.56 0.71 0.57 0.55 485.34 492.69 0.763(0.01) 0.749(0.006) 3.79(0.03) 3.87(0.015)
Model 3 0.44 0.69 0.49 0.47 488.12 491.59 0.786(0.01) 0.751(0.005) 1.92(0.02) 1.88(0.015)
Model 4 0.35 0.62 0.35 0.21 486.81 481.69 0.801(0.01) 0.752(0.006) 5.67(0.04) 5.69(0.043)
Model 5 0.32 0.56 0.25 0.18 487.00 485.41 0.810(0.01) 0.740(0.008) 5.63(0.05) 5.62(0.057)
Model 6 0.29 0.55 0.20 0.05 485.33 485.09 0.820(0.01) 0.774(0.007) 3.74(0.04) 3.78(0.040)

It shows that our model selections and Estimated Values are close to those in SCAL. The MSEs and PCDs are not as good as those in SCAL, perhaps because SCAL has a faster convergence rate, which makes it perform better than SMCAL in high dimensions.

5 Real Data Analysis

The STAR*D study, which focused on the Major Depression Disorder (MDD), enrolled over 4,000 outpatients from age 18 to age 75. There were altogether four treatment levels: 2, 2A, 3 and 4. At each treatment level, patients were randomly assigned to different treatment groups. Various clinic and socioeconomic factors were recorded, as well as the treatment outcomes. More details can be found in Fava et al., (2003).

We focus on level 2 of STAR*D study. The data consists of 319 samples, with each sample containing the patient ID, treatment type, treatment outcome and 305 other clinical or genetic features. There are two different treatment types: bupropion (BUP) and sertraline (SER). We use 0 to label SER and 11 to label BUP. The 16-item Quick Inventory of Depressive Symptomatology-Clinician-Rated (QIDS-C16) ranges from 0 to 24, with smaller values indicating better outcomes. In our setting, larger value indicates better treatment effect, so we use the negative QIDS-C16 as our treatment outcome.

Among the 319 patients selected, 166 of them have received SER and 153 of them have received BUP. After obtaining our estimated treatment regime, we use the inverse probability weighted estimator (IPW) proposed by Zhang et al., (2012)

1ni=1nYi𝟙[Ai=𝟙(βTXi>c^)]Aiπ(Xi)+(1Ai)[1π(Xi)]\frac{1}{n}{\displaystyle\sum_{i=1}^{n}\frac{Y_{i}\mathbbm{1}[A_{i}=\mathbbm{1}(\hat{\beta^{T}X_{i}>c})]}{A_{i}\pi(X_{i})+(1-A_{i})[1-\pi(X_{i})]}}

to calculate the estimated values. We draw bootstrap samples for 1000 times, each time with 1000 samples to get the Estimated Value and 95% CI.

We ran the simulation for PAL and our SMCAL. Liang et al., (2018) reported the results of SCAL, POWL and non-dynamic treatment regimes. These are all summarized in Table 4.

Table 4: Estimated Values of POWL, SCAL and SMCAL.
Treatment Regime Estimated Value Diff 95% CI on Diff
Optimal Regime (SCAL) -6.77
Optimal Regime (SMCAL) -6.90 0.13 (-0.46,0.69)
Optimal Regime (PAL) -8.15 1.38 (0.71,2.02)
Optimal Regime (POWL) -9.46 2.69 (1.18,4.24)
BUP -10.50 3.75 (2.38,5.19)
SER -10.72 3.97 (2.57,5.50)

Consider SCAL as baseline, Diff in the table means the difference between the estimated value of each method and the estimated value of SCAL. According to Table 4, the optimal treatment regimes are all better than non-dynamic treatment regimes. Optimal treatment regimes by SCAL and SMCAL have the best estimated values. PAL and POWL are not as good as SCAL and SMCAL in this real data example.

Comparison of the received treatments and the estimated optimal treatment regimes are available in Table 5.

Table 5: Treatment recommended by SMCAL.
Recommended: SER Recommended: BUP
Randomized Treatment: SER 75 91
Randomized Treatment: BUP 68 85

Randomized Treatment means the treatment actually performed in the STAR*D study. Estimated Treatment means the treatment suggested by the estimated optimal treatment regime. From Table 5 we can see that among those 153 people who received BUP, the optimal treatment regime by SMCAL suggests that 77 of them receive SER and 76 of them still receive BUP. It is a rather balanced treatment regime.

6 Conclusions

In this paper, we have proposed SMCAL, which is based on the concordance-assisted-learning (CAL) framework and the smoothing procedure by Han et al., (2017), and aims at dealing with the optimization issue of CAL in high dimensional data. We established convergence results in both continuous covariates and discrete covariates cases. The proposed SMCAL can be successfully applied in the case when the contrast function is nonlinear of the features, and it does not induce a relatively large bias compared to SCAL.


SUPPLEMENTARY MATERIAL

7 Proofs

We first present Lemma 8 which will be used in the proof of Lemma 4.

Lemma 8.

There exists M5M_{5}, s.t. suphd,h04snhTDTDhn(n1)2h22M5sn{\displaystyle{\sup_{h\in\mathbb{R}^{d},\|h\|_{0}\leq 4s_{n}}\frac{h^{T}D^{T}Dh}{\frac{n(n-1)}{2}\|h\|_{2}^{2}}}\leq M_{5}s_{n}}, where

D=(X1TX2TX1TX3TX1TXnTX2TX3T)n(n1)2×d\displaystyle D=\begin{pmatrix}X_{1}^{T}-X_{2}^{T}\\ X_{1}^{T}-X_{3}^{T}\\ ...\\ X_{1}^{T}-X_{n}^{T}\\ X_{2}^{T}-X_{3}^{T}\\ ...\end{pmatrix}_{\frac{n(n-1)}{2}\times d}
Proof of Lemma 8.

(A5) in Assumptions 2 or (B3) in Assumption 3 implies that

suphd:h04sn,h2=1((XiXj)Th)24M42suphd:h04sn,h2=1h1216M42sn.{\displaystyle\sup_{h\in\mathbb{R}^{d}:\|h\|_{0}\leq 4s_{n},\|h\|_{2}=1}((X_{i}-X_{j})^{T}h)^{2}}\leq 4M_{4}^{2}{\displaystyle\sup_{h\in\mathbb{R}^{d}:\|h\|_{0}\leq 4s_{n},\|h\|_{2}=1}\|h\|_{1}^{2}}\leq 16M_{4}^{2}s_{n}.

Therefore,

suphd:h04snhTDTDhn(n1)2h22\displaystyle\sup_{h\in\mathbb{R}^{d}:\|h\|_{0}\leq 4s_{n}}\frac{h^{T}D^{T}Dh}{\frac{n(n-1)}{2}\|h\|_{2}^{2}} 2n(n1)suphd:h04snhT(1i<jn(XiXj)(XiXj)T)hh22\displaystyle\leq\frac{2}{n(n-1)}{\displaystyle{\displaystyle\sup_{h\in\mathbb{R}^{d}:\|h\|_{0}\leq 4s_{n}}\frac{h^{T}({\displaystyle\sum_{1\leq i<j\leq n}}(X_{i}-X_{j})(X_{i}-X_{j})^{T})h}{\|h\|_{2}^{2}}}}
2n(n1)1i<jnsuphd:h04sn((XiXj)Th)2h22\displaystyle\leq\frac{2}{n(n-1)}{{\displaystyle\sum_{1\leq i<j\leq n}}{\displaystyle\sup_{h\in\mathbb{R}^{d}:\|h\|_{0}\leq 4s_{n}}\frac{((X_{i}-X_{j})^{T}h)^{2}}{\|h\|_{2}^{2}}}}
<suphd:h04sn,h2=1((XiXj)Th)216M42sn,\displaystyle<{\displaystyle{\sup_{h\in\mathbb{R}^{d}:\|h\|_{0}\leq 4s_{n},\|h\|_{2}=1}((X_{i}-X_{j})^{T}h)^{2}}}\leq 16M_{4}^{2}s_{n}, (7.1)

This implies that if we let M5=16M42M_{5}=16M_{4}^{2}, then

suphd,h04snhTDTDhn(n1)2h22M5sn.{\displaystyle{\sup_{h\in\mathbb{R}^{d},\|h\|_{0}\leq 4s_{n}}}\frac{h^{T}D^{T}Dh}{\frac{n(n-1)}{2}\|h\|_{2}^{2}}}\leq M_{5}s_{n}.

7.1 Proofs of Section 3.1

Proof of Lemma 1.

Define

W(v,β)=E[(Q(XiTβ)Q(XjTβ))|(XiTXjT)β=v]f(XiTXjT)β(v),W(v,\beta)=E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))|(X_{i}^{T}-X_{j}^{T})\beta=v]f_{(X_{i}^{T}-X_{j}^{T})\beta}(v),

and

T=12(XiT+XjT)β,U=XTβXTβ,T=\frac{1}{2}(X_{i}^{T}+X_{j}^{T})\beta,\quad U=X^{T}\beta^{*}-X^{T}\beta,

then

W(v,β)\displaystyle W(v,\beta)
=\displaystyle= (Q(t+v2+u1)Q(tv2+u2))f(XTβ,U)(t+v2,u1)f(XTβ,U)(tv2,u2)𝑑u1𝑑u2𝑑t\displaystyle\int_{\mathbb{R}}\int_{\mathbb{R}}\int_{\mathbb{R}}(Q(t+\frac{v}{2}+u_{1})-Q(t-\frac{v}{2}+u_{2}))f_{(X^{T}\beta,U)}(t+\frac{v}{2},u_{1})f_{(X^{T}\beta,U)}(t-\frac{v}{2},u_{2})du_{1}du_{2}dt
=\displaystyle= Q(t+v2+u1)f(XTβ,U)(t+v2,u1)fXTβ(tv2)𝑑u1𝑑t\displaystyle\int_{\mathbb{R}}\int_{\mathbb{R}}Q(t+\frac{v}{2}+u_{1})f_{(X^{T}\beta,U)}(t+\frac{v}{2},u_{1})f_{X^{T}\beta}(t-\frac{v}{2})du_{1}dt
Q(tv2+u2)fXTβ(t+v2)f(XTβ,U)(tv2,u2)𝑑u2𝑑t\displaystyle-\int_{\mathbb{R}}\int_{\mathbb{R}}Q(t-\frac{v}{2}+u_{2})f_{X^{T}\beta}(t+\frac{v}{2})f_{(X^{T}\beta,U)}(t-\frac{v}{2},u_{2})du_{2}dt
=\displaystyle= Q(t+v2+u1)f(XTβ,U)(t+v2,u1)fXTβ(tv2)𝑑u1𝑑t\displaystyle\int_{\mathbb{R}}\int_{\mathbb{R}}Q(t+\frac{v}{2}+u_{1})f_{(X^{T}\beta,U)}(t+\frac{v}{2},u_{1})f_{X^{T}\beta}(t-\frac{v}{2})du_{1}dt
Q(t+v2+u1)fXTβ(t+3v2)f(XTβ,U)(t+v2,u1)𝑑u1𝑑t\displaystyle-\int_{\mathbb{R}}\int_{\mathbb{R}}Q(t+\frac{v}{2}+u_{1})f_{X^{T}\beta}(t+\frac{3v}{2})f_{(X^{T}\beta,U)}(t+\frac{v}{2},u_{1})du_{1}dt
=\displaystyle= Q(t+v2+u1)f(XTβ,U)(t+v2,u1)(fXTβ(tv2)fXTβ(t+3v2))𝑑u1𝑑t\displaystyle\int_{\mathbb{R}}\int_{\mathbb{R}}Q(t+\frac{v}{2}+u_{1})f_{(X^{T}\beta,U)}(t+\frac{v}{2},u_{1})(f_{X^{T}\beta}(t-\frac{v}{2})-f_{X^{T}\beta}(t+\frac{3v}{2}))du_{1}dt (7.2)

(A1) and (A2) in Assumptions 1 implies |Q(t+v2+u1)|M1|Q(t+\frac{v}{2}+u_{1})|\leq M_{1} and |fXTβ(tv2)fXTβ(t+3v2)|2|v|M3|f_{X^{T}\beta}(t-\frac{v}{2})-f_{X^{T}\beta}(t+\frac{3v}{2})|\leq 2|v|M_{3}. So

|W(v,β)|2M1M3|v|.|W(v,\beta)|\leq 2M_{1}M_{3}|v|.

Notice that

E[(Q(XiTβ)Q(XjTβ))(1F((XiTXjT)βαn))𝟙((XiTXjT)β>0)]\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(1-F((X_{i}^{T}-X_{j}^{T})\beta\alpha_{n}))\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta>0)]
=\displaystyle= E[(Q(XiTβ)Q(XjTβ))F((XjTXiT)βαn)𝟙((XjTXiT)β<0)]\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))F((X_{j}^{T}-X_{i}^{T})\beta\alpha_{n})\mathbbm{1}((X_{j}^{T}-X_{i}^{T})\beta<0)]
=\displaystyle= E[(Q(XjTβ)Q(XiTβ))F((XiTXjT)βαn)𝟙((XiTXjT)β<0)],\displaystyle E[(Q(X_{j}^{T}\beta^{*})-Q(X_{i}^{T}\beta^{*}))F((X_{i}^{T}-X_{j}^{T})\beta\alpha_{n})\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta<0)],

we have

|C(β)Cn(β)|=\displaystyle\begin{aligned} \end{aligned}|C(\beta)-C_{n}(\beta)|= |E[(Q(XiTβ)Q(XjTβ))(𝟙((XiTXjT)β>0)F((XiTXjT)βαn))]|\displaystyle|E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta>0)-F((X_{i}^{T}-X_{j}^{T})\beta\alpha_{n}))]|
=\displaystyle= |E[(Q(XiTβ)Q(XjTβ))(1F((XiTXjT)βαn))𝟙((XiTXjT)β>0)\displaystyle|E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(1-F((X_{i}^{T}-X_{j}^{T})\beta\alpha_{n}))\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta>0)
(Q(XiTβ)Q(XjTβ))F((XiTXjT)βαn)𝟙((XiTXjT)β<0)]|\displaystyle-(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))F((X_{i}^{T}-X_{j}^{T})\beta\alpha_{n})\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta<0)]|
=\displaystyle= 2|E[(Q(XjTβ)Q(XiTβ))F((XiTXjT)βαn)𝟙((XiTXjT)β<0)]|.\displaystyle 2|E[(Q(X_{j}^{T}\beta^{*})-Q(X_{i}^{T}\beta^{*}))F((X_{i}^{T}-X_{j}^{T})\beta\alpha_{n})\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta<0)]|.

Therefore,

supβd:β1=1,ββ2r(γ)|C(β)Cn(β)|\displaystyle\sup_{\beta\in\mathbb{R}^{d}:\beta_{1}=1,\|\beta^{*}-\beta\|_{2}\leq r(\gamma)}|C(\beta)-C_{n}(\beta)| (7.3)
=\displaystyle= 2supβd:β1=1,ββ2r(γ)|0W(v,β)11+eαnv𝑑v|\displaystyle 2\sup_{\beta\in\mathbb{R}^{d}:\beta_{1}=1,\|\beta^{*}-\beta\|_{2}\leq r(\gamma)}|\int_{-\infty}^{0}W(v,\beta)\frac{1}{1+e^{-\alpha_{n}v}}dv|
\displaystyle\leq 4M1M3|0v1+eαnv𝑑v|\displaystyle 4M_{1}M_{3}|\int_{0}^{\infty}\frac{v}{1+e^{\alpha_{n}v}}dv|
=\displaystyle= 4M1M3αn2|0v1+ev𝑑v|αn2.\displaystyle 4M_{1}M_{3}\alpha_{n}^{-2}|\int_{0}^{\infty}\frac{v}{1+e^{v}}dv|\lesssim\alpha_{n}^{-2}.

Proof of Theorem 2.

Notice that

C(βr(γ),αn)Cn(βr(γ),αn)C(βr(γ),αn)Cn(β)C(β)Cn(β),C(\beta_{r(\gamma),\alpha_{n}}^{*})-C_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})\leq C(\beta_{r(\gamma),\alpha_{n}}^{*})-C_{n}(\beta^{*})\leq C(\beta^{*})-C_{n}(\beta^{*}),

we have

|C(β)C(βr(γ),αn)|\displaystyle|C(\beta^{*})-C(\beta_{r(\gamma),\alpha_{n}}^{*})|\leq |Cn(β)C(βr(γ),αn)|+|C(β)Cn(β)|\displaystyle|C_{n}(\beta^{*})-C(\beta_{r(\gamma),\alpha_{n}}^{*})|+|C(\beta^{*})-C_{n}(\beta^{*})| (7.4)
\displaystyle\leq 2supβd:β1=1,ββ2r(γ)|C(β)Cn(β)|αn2.\displaystyle 2\sup_{\beta\in\mathbb{R}^{d}:\beta_{1}=1,\|\beta^{*}-\beta\|_{2}\leq r(\gamma)}|C(\beta)-C_{n}(\beta)|\lesssim\alpha_{n}^{-2}.

The last step is by Lemma 1.

Similar to Han et al., (2017), we can pick a positive constant set γ={γ1,γ2}\gamma=\{\gamma_{1},\gamma_{2}\} with γ2γ1\dfrac{\gamma_{2}}{\gamma_{1}} close to 11, such that for some small enough r(γ)>0r(\gamma)>0, β1=β1=1\beta_{1}=\beta_{1}^{*}=1 and ββ2\parallel\beta-\beta^{*}\parallel_{2}r(γ)\leq r(\gamma), we have

γ1λmin(Γ)ββ22C(β)C(β)γ2λmax(Γ)ββ22,\gamma_{1}\lambda_{min}(\Gamma)\parallel\beta-\beta^{*}\parallel_{2}^{2}\leq C(\beta^{*})-C(\beta)\leq\gamma_{2}\lambda_{max}(\Gamma)\parallel\beta-\beta^{*}\parallel_{2}^{2}, (7.5)

then by (A3) in Assumption 1, we know λmin(Γ)c3\lambda_{min}(\Gamma)\geq c_{3}, so

βr(γ),αnβ22C(β)C(βr(γ),αn)γ1λmin(Γ)αn2.\parallel\beta_{r(\gamma),\alpha_{n}}^{*}-\beta^{*}\parallel_{2}^{2}\leq\frac{C(\beta^{*})-C(\beta_{r(\gamma),\alpha_{n}}^{*})}{\gamma_{1}\lambda_{min}(\Gamma)}\lesssim\alpha_{n}^{-2}.

Proof of Lemma 3.

Let

hk(xi,xj)=βk(wiwj)F((xixj)Tβαn)|β=βr(γ),αn,h_{k}(x_{i},x_{j})=\frac{\partial}{\partial\beta_{k}}(w_{i}-w_{j})F((x_{i}-x_{j})^{T}\beta\alpha_{n})|_{\beta=\beta_{r(\gamma),\alpha_{n}}^{*}},

where k=2,3,,dk=2,3,...,d. Then Ehk(Xi,Xj)=0Eh_{k}(X_{i},X_{j})=0 since βr(γ),αn\beta_{r(\gamma),\alpha_{n}}^{*} is the locally maximizer of Cn(β)C_{n}(\beta). Because

F(x)=1/(ex+ex+2)14,F^{{}^{\prime}}(x)=1/(e^{x}+e^{-x}+2)\leq\frac{1}{4},

under Assumption 2 or Assumption 3, we both have

|hk(Xi,Xj)|=|(WiWj)F((XiXj)Tβr(γ),αnαn)αn(Xi,kXj,k)|14αn|(WiWj)(Xi,kXj,k)|αnM2M4.\begin{aligned} |h_{k}(X_{i},X_{j})|=&|(W_{i}-W_{j})F^{{}^{\prime}}((X_{i}-X_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})\alpha_{n}(X_{i,k}-X_{j,k})|\\ \text{$\leq$}&\frac{1}{4}\alpha_{n}|(W_{i}-W_{j})(X_{i,k}-X_{j,k})|\leq\alpha_{n}M_{2}M_{4}.\end{aligned}

Therefore, using the Hoeffding bound for U-statistics which can be found in, for example, Pitcan, (2017),

P{L^n(βr(γ),αn)t}\displaystyle P\{\|\nabla\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})\|_{\infty}\geq t\} =P{C^n(βr(γ),αn)t}\displaystyle=P\{\|\nabla\hat{C}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})\|_{\infty}\geq t\}
\displaystyle\leq dP{|1i<jn2n(n1)hk(xi,xj)|t}\displaystyle d\cdot P\{|{\displaystyle\sum_{1\leq i<j\leq n}\frac{2}{n(n-1)}h_{k}(x_{i},x_{j})|\geq t\}}
\displaystyle\leq d2exp{n/2t22αn2M22M42}\displaystyle d\cdot 2exp\{-\frac{\lfloor n/2\rfloor t^{2}}{2\alpha_{n}^{2}M_{2}^{2}M_{4}^{2}}\} (7.6)

Choose t=8αnM2M4log(d)nt=\sqrt{8}\alpha_{n}M_{2}M_{4}\sqrt{\frac{log(d)}{n}}, the proof is completed. ∎

Proof of Lemma 4.

Define

B(h)=\displaystyle B(h)= |L^n(βr(γ),αnh)Ln(βr(γ),αnh)L^n(βr(γ),αn)+L(βr(γ),αn)|\displaystyle|\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}-h)-L_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}-h)-\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})+L(\beta_{r(\gamma),\alpha_{n}}^{*})|
=\displaystyle= |1n(n1)ij(wiwj)F((xixj)T(βr(γ),αnh)αn)\displaystyle|\frac{1}{n(n-1)}{\sum_{i\neq j}}(w_{i}-w_{j})F((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-h)\alpha_{n})
E[(WiWj)F((XiXj)T(βr(γ),αnh)αn)]\displaystyle-E[(W_{i}-W_{j})F((X_{i}-X_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-h)\alpha_{n})]
1n(n1)ij(wiwj)F((xixj)Tβr(γ),αnαn)\displaystyle-\frac{1}{n(n-1)}{\sum_{i\neq j}}(w_{i}-w_{j})F((x_{i}-x_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})
+E[(WiWj)F((XiXj)Tβr(γ),αnαn)]|\displaystyle+E[(W_{i}-W_{j})F((X_{i}-X_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})]|

Under Assumption 2, if h02sn\|h\|_{0}\leq 2s_{n},

|(wiwj)F((XiTXjT)(βr(γ),αnh)αn)(wiwj)F((XiTXjT)βr(γ),αnαn)|\displaystyle|(w_{i}-w_{j})F((X_{i}^{T}-X_{j}^{T})(\beta_{r(\gamma),\alpha_{n}}^{*}-h)\alpha_{n})-(w_{i}-w_{j})F((X_{i}^{T}-X_{j}^{T})\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})|
\displaystyle\leq 14|wiwj||(XiTXjT)h|αnM2M4αnh1M2M4αn2snh2.\displaystyle\frac{1}{4}|w_{i}-w_{j}|\cdot|(X_{i}^{T}-X_{j}^{T})h|\cdot\alpha_{n}\leq M_{2}M_{4}\alpha_{n}\|h\|_{1}\leq M_{2}M_{4}\alpha_{n}\sqrt{2s_{n}}\|h\|_{2}.

According to the Hoeffding bound for U-statistics,

P(B(h)t)p2exp{n/2t24M22M42αn2snh22}.P(B(h)\geq t)\stackrel{{\scriptstyle p}}{{\leq}}2\exp\{-\frac{\lfloor n/2\rfloor t^{2}}{4M_{2}^{2}M_{4}^{2}\alpha_{n}^{2}s_{n}\|h\|_{2}^{2}}\}.

Let t=h2tnt=\frac{\|h\|_{2}t^{{}^{\prime}}}{\sqrt{n}} and M6=8M22M42M_{6}=8M_{2}^{2}M_{4}^{2}, then

P(B(h)h2tn)p2exp{t2snαn2M6}.P(\frac{B(h)}{\|h\|_{2}}\geq\frac{t^{{}^{\prime}}}{\sqrt{n}})\stackrel{{\scriptstyle p}}{{\leq}}2\exp\{-\frac{t^{{}^{\prime}2}}{s_{n}\alpha_{n}^{2}M_{6}}\}.

Choose t=C2sn(sn+1)log(d)αnt^{{}^{\prime}}=C\sqrt{2s_{n}(s_{n}+1)log(d)}\alpha_{n}, then

P(B(h)h2C2sn(sn+1)log(d)αnn)p2d2(sn+1)C2M6.P(\frac{B(h)}{\|h\|_{2}}\geq\frac{C\sqrt{2s_{n}(s_{n}+1)log(d)}\alpha_{n}}{\sqrt{n}})\stackrel{{\scriptstyle p}}{{\leq}}2d^{-\frac{2(s_{n}+1)C^{2}}{M_{6}}}. (7.7)

Let N1N_{1} be an ϵ\epsilon-covering of {hd:h02sn,h2=2r(γ)}\{h\in\mathbb{R}^{d}:\|h\|_{0}\leq 2s_{n},\>\|h\|_{2}=2r(\gamma)\}, where we constrain N1{hd:h02sn,h2=2r(γ)}N_{1}\subset\{h\in\mathbb{R}^{d}:\|h\|_{0}\leq 2s_{n},\>\|h\|_{2}=2r(\gamma)\} and ϵ\epsilon is a small positive constant to be chosen. Since the covering number of {u2sn:u2=2r(γ)}\{u\in\mathbb{R}^{2s_{n}}:\>\|u\|_{2}=2r(\gamma)\} should not exceed the packing number, which is less than

(1+ϵ2r(γ)ϵ2r(γ))2sn,(\frac{1+\frac{\epsilon}{2r(\gamma)}}{\frac{\epsilon}{2r(\gamma)}})^{2s_{n}},

we can find a N1N_{1} such that

|N1|(d2sn)(4r(γ)ϵ)2sn.|N_{1}|\leq{d\choose 2s_{n}}(\frac{4r(\gamma)}{\epsilon})^{2s_{n}}.

Consider

N=i=12r(γ)/ϵi2r(γ)/ϵN1,\displaystyle N={\displaystyle\bigcup_{i=1}^{\lceil 2r(\gamma)/\epsilon\rceil}\frac{i}{\lceil 2r(\gamma)/\epsilon\rceil}\cdot N_{1}}, (7.8)

then

P(suphNB(h)h2C2sn(sn+1)log(d)αnn)\displaystyle P({\displaystyle{\sup_{h\in N}}\frac{B(h)}{\|h\|_{2}}\geq\frac{C\sqrt{2s_{n}(s_{n}+1)log(d)}\alpha_{n}}{\sqrt{n}})}
p\displaystyle\stackrel{{\scriptstyle p}}{{\leq}} 4r(γ)ϵ(d2sn)(4r(γ)ϵ)2sn2d2(sn+1)C2M62(4r(γ)ϵd1C2M6)2sn+1.\displaystyle\frac{4r(\gamma)}{\epsilon}{d\choose 2s_{n}}(\frac{4r(\gamma)}{\epsilon})^{2s_{n}}2d^{-\frac{2(s_{n}+1)C^{2}}{M_{6}}}\leq 2(\frac{4r(\gamma)}{\epsilon}d^{1-\frac{C^{2}}{M_{6}}})^{2s_{n}+1}. (7.9)

When h12=h22\|h_{1}\|_{2}=\|h_{2}\|_{2},

|B(h1)h12B(h2)h22|=1h12|B(h1)B(h2)|\displaystyle|\frac{B(h_{1})}{\|h_{1}\|_{2}}-\frac{B(h_{2})}{\|h_{2}\|_{2}}|=\frac{1}{\|h_{1}\|_{2}}|B(h_{1})-B(h_{2})|
=\displaystyle= 1h12|1n(n1)ij(wiwj)F((xixj)T(βr(γ),αnh1)αn)\displaystyle\frac{1}{\|h_{1}\|_{2}}|\frac{1}{n(n-1)}{\sum_{i\neq j}}(w_{i}-w_{j})F((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-h_{1})\alpha_{n})
E[(WiWj)F((XiXj)T(βr(γ),αnh1)αn)]\displaystyle-{\displaystyle E[(W_{i}-W_{j})F((X_{i}-X_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-h_{1})\alpha_{n})]}
1n(n1)ij(wiwj)F((xixj)T(βr(γ),αnh2)αn)\displaystyle-\frac{1}{n(n-1)}{\sum_{i\neq j}}(w_{i}-w_{j})F((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-h_{2})\alpha_{n})
+E[(WiWj)F((XiXj)T(βr(γ),αnh2)αn)]|\displaystyle+{\displaystyle E[(W_{i}-W_{j})F((X_{i}-X_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-h_{2})\alpha_{n})]}|
\displaystyle\leq M2αn2n(n1)D(h1h12h2h22)1+M2αn2E|(XiTXjT)(h1h12h2h22)|\displaystyle\frac{M_{2}\alpha_{n}}{2n(n-1)}\|D(\frac{h_{1}}{\|h_{1}\|_{2}}-\frac{h_{2}}{\|h_{2}\|_{2}})\|_{1}+\frac{M_{2}\alpha_{n}}{2}\cdot E|(X_{i}^{T}-X_{j}^{T})(\frac{h_{1}}{\|h_{1}\|_{2}}-\frac{h_{2}}{\|h_{2}\|_{2}})|
\displaystyle\leq 14M2αnn(n1)2D(h1h12h2h22)2+M2αnE|XiT(h1h12h2h22)|,\displaystyle\frac{\frac{1}{4}M_{2}\alpha_{n}}{\sqrt{\frac{n(n-1)}{2}}}\|D(\frac{h_{1}}{\|h_{1}\|_{2}}-\frac{h_{2}}{\|h_{2}\|_{2}})\|_{2}+M_{2}\alpha_{n}\cdot E|X_{i}^{T}(\frac{h_{1}}{\|h_{1}\|_{2}}-\frac{h_{2}}{\|h_{2}\|_{2}})|, (7.10)

where DD is defined in Lemma 8. Then by Lemma 8, we have

suph1h204sn,h12=h22,h1h12h2h222ϵ|B(h1)h12B(h2)h22|\displaystyle{\displaystyle{\sup_{\|h_{1}-h_{2}\|_{0}\leq 4s_{n},\|h_{1}\|_{2}=\|h_{2}\|_{2},\|\frac{h_{1}}{\|h_{1}\|_{2}}-\frac{h_{2}}{\|h_{2}\|_{2}}\|_{2}\leq\epsilon}}|\frac{B(h_{1})}{\|h_{1}\|_{2}}-\frac{B(h_{2})}{\|h_{2}\|_{2}}|}
\displaystyle\leq 14M2αnϵ(M5sn+4E[suphϵd:h04sn,h2=1|XiTh|])\displaystyle\frac{1}{4}M_{2}\alpha_{n}\epsilon\cdot(\sqrt{M_{5}s_{n}}+4E[\sup_{h\epsilon\mathbb{R}^{d}:\|h\|_{0}\leq 4s_{n},\|h\|_{2}=1}|X_{i}^{T}h|])
\displaystyle\leq 14M2αnϵ(M5sn+8M4sn).\displaystyle\frac{1}{4}M_{2}\alpha_{n}\epsilon\cdot(\sqrt{M_{5}s_{n}}+8M_{4}\sqrt{s_{n}}). (7.11)

It bounds the difference between B(h1)h12\frac{B(h_{1})}{\|h_{1}\|_{2}} and B(h2)h22\frac{B(h_{2})}{\|h_{2}\|_{2}} when h12=h22\|h_{1}\|_{2}=\|h_{2}\|_{2}. Next, consider h1h_{1} and h2h_{2} in the same direction but h12h22\|h_{1}\|_{2}\neq\|h_{2}\|_{2}. If h2=1\|h\|_{2}=1 and h02sn\|h\|_{0}\leq 2s_{n}, we have

|(1t[F((xixj)T(βr(γ),αnth)αn)F((xixj)Tβr(γ),αnαn)])|\displaystyle|(\frac{1}{t}[F((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-th)\alpha_{n})-F((x_{i}-x_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})])^{{}^{\prime}}|
=\displaystyle= |1tF((xixj)T(βr(γ),αnth)αn)αn(xixj)Th\displaystyle|\frac{-1}{t}F^{{}^{\prime}}((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-th)\alpha_{n})\alpha_{n}(x_{i}-x_{j})^{T}h
1t2[F((xixj)T(βr(γ),αnth)αn)F((xixj)Tβr(γ),αnαn)]|\displaystyle-\frac{1}{t^{2}}[F((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-th)\alpha_{n})-F((x_{i}-x_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})]|
=\displaystyle= |1tF((xixj)T(βr(γ),αnth)αn)αn(xixj)Th\displaystyle|\frac{-1}{t}F^{{}^{\prime}}((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-th)\alpha_{n})\alpha_{n}(x_{i}-x_{j})^{T}h
1t2[F((xixj)T(βr(γ),αnth)αn)(t(xixj)Thαn)\displaystyle-\frac{1}{t^{2}}[F^{{}^{\prime}}((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-th)\alpha_{n})(-t(x_{i}-x_{j})^{T}h\alpha_{n})
+12F′′((xixj)T(βr(γ),αnξh)αn)(t(xixj)Thαn)2]|\displaystyle+\frac{1}{2}F^{{}^{\prime\prime}}((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-\xi h)\alpha_{n})\cdot(-t(x_{i}-x_{j})^{T}h\alpha_{n})^{2}]|
=\displaystyle= |12F′′((xixj)T(βr(γ),αnξh)αn)αn2((xixj)Th)2|c6αn2M42h122c6αn2M42sn,\displaystyle|\frac{1}{2}F^{{}^{\prime\prime}}((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-\xi h)\alpha_{n})\cdot\alpha_{n}^{2}((x_{i}-x_{j})^{T}h)^{2}|\leq c_{6}\alpha_{n}^{2}M_{4}^{2}\|h\|_{1}^{2}\leq 2c_{6}\alpha_{n}^{2}M_{4}^{2}s_{n},

therefore,

|B(t1h)t1B(t2h)t2|\displaystyle{\displaystyle|\frac{B(t_{1}h)}{t_{1}}-\frac{B(t_{2}h)}{t_{2}}|}
\displaystyle\leq 2M2n(n1)ij|[F((xixj)T(βr(γ),αnt1h)αn)F((xixj)Tβr(γ),αnαn)]/t1\displaystyle\frac{2M_{2}}{n(n-1)}{\sum_{i\neq j}}|[F((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-t_{1}h)\alpha_{n})-F((x_{i}-x_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})]/t_{1}
[F((xixj)T(βr(γ),αnt2h)αn)F((xixj)Tβr(γ),αnαn)]/t2|\displaystyle-[F((x_{i}-x_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-t_{2}h)\alpha_{n})-F((x_{i}-x_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})]/t_{2}|
+2M2E|[F((XiXj)T(βr(γ),αnt1h)αn)F((XiXj)Tβr(γ),αnαn)]/t1\displaystyle+2M_{2}\cdot E|[F((X_{i}-X_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-t_{1}h)\alpha_{n})-F((X_{i}-X_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})]/t_{1}
[F((XiXj)T(βr(γ),αnt2h)αn)F((XiXj)Tβr(γ),αnαn)]/t2|8c6M2M42snαn2|t1t2|\displaystyle-[F((X_{i}-X_{j})^{T}(\beta_{r(\gamma),\alpha_{n}}^{*}-t_{2}h)\alpha_{n})-F((X_{i}-X_{j})^{T}\beta_{r(\gamma),\alpha_{n}}^{*}\alpha_{n})]/t_{2}|\leq 8c_{6}M_{2}M_{4}^{2}s_{n}\alpha_{n}^{2}|t_{1}-t_{2}| (7.12)

For any h{h:h02sn,h22r(γ)}h\in\{h:\|h\|_{0}\leq 2s_{n},\|h\|_{2}\leq 2r(\gamma)\}, we can find an integer i{1,2,,2r(γ)/ϵ}i\in\{1,2,...,\lceil 2r(\gamma)/\epsilon\rceil\} and h0Nh_{0}\in N s.t. |h22ir(γ)2r(γ)/ϵ|ϵ|\|h\|_{2}-\frac{2ir(\gamma)}{\lceil 2r(\gamma)/\epsilon\rceil}|\leq\epsilon, h02=2ir(γ)2r(γ)/ϵ\|h_{0}\|_{2}=\frac{2ir(\gamma)}{\lceil 2r(\gamma)/\epsilon\rceil} and h02ir(γ)h2r(γ)/ϵh22ϵ\|h_{0}-\frac{2ir(\gamma)h}{\lceil 2r(\gamma)/\epsilon\rceil\|h\|_{2}}\|_{2}\leq\epsilon, where NN is defined in (7.8).

Since

|B(h)h2B(h0)h02||B(h)h2B(2ir(γ)h2r(γ)/ϵh2)2ir(γ)h2r(γ)/ϵh22|+|B(2ir(γ)h2r(γ)/ϵh2)2ir(γ)h2r(γ)/ϵh22B(h0)h02|,|\frac{B(h)}{\|h\|_{2}}-\frac{B(h_{0})}{\|h_{0}\|_{2}}|\leq|\frac{B(h)}{\|h\|_{2}}-\frac{B(\frac{2ir(\gamma)h}{\lceil 2r(\gamma)/\epsilon\rceil\|h\|_{2}})}{\|\frac{2ir(\gamma)h}{\lceil 2r(\gamma)/\epsilon\rceil\|h\|_{2}}\|_{2}}|+|\frac{B(\frac{2ir(\gamma)h}{\lceil 2r(\gamma)/\epsilon\rceil\|h\|_{2}})}{\|\frac{2ir(\gamma)h}{\lceil 2r(\gamma)/\epsilon\rceil\|h\|_{2}}\|_{2}}-\frac{B(h_{0})}{\|h_{0}\|_{2}}|,

by (7.1) and (7.1) we have

suph02sn,h22r(γ)B(h)h2suphNB(h)h2+c7M2αn2ϵ,{\displaystyle{\sup_{\|h\|_{0}\leq 2s_{n},\|h\|_{2}\leq 2r(\gamma)}}\frac{B(h)}{\|h\|_{2}}}\leq{\displaystyle{\sup_{h\in N}}\frac{B(h)}{\|h\|_{2}}}+c_{7}M_{2}\alpha_{n}^{2}\epsilon,

where c7=14M5sn+2M4sn+8M42snc6c_{7}=\frac{1}{4}\sqrt{M_{5}s_{n}}+2M_{4}\sqrt{s_{n}}+8M_{4}^{2}s_{n}c_{6}.

Let ϵ=2sn(sn+1)log(d)nc7αn\epsilon=\frac{\sqrt{2s_{n}(s_{n}+1)log(d)}}{\sqrt{n}c_{7}\alpha_{n}} and using (7.1), we get

P(suph02sn,h22r(γ)B(h)h2(C1)2sn(sn+1)log(d)M2αnn)\displaystyle P({\displaystyle{\sup_{\|h\|_{0}\leq 2s_{n},\|h\|_{2}\leq 2r(\gamma)}}\frac{B(h)}{\|h\|_{2}}}\geq\frac{(C-1)\sqrt{2s_{n}(s_{n}+1)log(d)}M_{2}\alpha_{n}}{\sqrt{n}})
\displaystyle\leq 2(4r(γ)ϵd1C22M6)2sn+1=2(nc7αn4r(γ)2sn(sn+1)log(d)d1C22M6)2sn+1,\displaystyle 2(\frac{4r(\gamma)}{\epsilon}d^{1-\frac{C^{2}}{2M_{6}}})^{2s_{n}+1}=2(\frac{\sqrt{n}c_{7}\alpha_{n}4r(\gamma)}{\sqrt{2s_{n}(s_{n}+1)log(d)}}d^{1-\frac{C^{2}}{2M_{6}}})^{2s_{n}+1}, (7.13)

and we will let αnn\alpha_{n}\ll n. Choose a large enough CC, we have

suph02sn,h22r(γ)B(h)h2psn2log(d)nαn.{\displaystyle{\sup_{\|h\|_{0}\leq 2s_{n},\|h\|_{2}\leq 2r(\gamma)}}\frac{B(h)}{\|h\|_{2}}\stackrel{{\scriptstyle p}}{{\lesssim}}\frac{\sqrt{s_{n}^{2}log(d)}}{\sqrt{n}}\alpha_{n}}.

The proof for the discrete cases under Assumption 3 is essentially the same. ∎

Proof of Theorem 5.

The proof of Theorem 5 is very similar to the proof of Lemma 4.10 in Han et al., (2017), so in here we only give a sketch of the proof. According to Lemma 1,

Ln(βr(γ),αn+Δ^)Ln(βr(γ),αn)L(βr(γ),αn+Δ^)L(βr(γ),αn)c5αn2.L_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}+\hat{\Delta})-L_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})\geq L(\beta_{r(\gamma),\alpha_{n}}^{*}+\hat{\Delta})-L(\beta_{r(\gamma),\alpha_{n}}^{*})-c_{5}\alpha_{n}^{-2}. (7.14)

And based on Taylor Expansion of L(βr(γ),αn)L(β)L(\beta_{r(\gamma),\alpha_{n}}^{*})-L(\beta^{*}) and L(βr(γ),αn+Δ^)L(β)L(\beta_{r(\gamma),\alpha_{n}}^{*}+\hat{\Delta})-L(\beta^{*}),

L(βr(γ),αn+Δ^)L(βr(γ),αn)12Δ^TΓΔ^(1+o(1))Δ^TΓ(ββr(γ),αn)(1+o(1)),L(\beta_{r(\gamma),\alpha_{n}}^{*}+\hat{\Delta})-L(\beta_{r(\gamma),\alpha_{n}}^{*})\geq\frac{1}{2}\hat{\Delta}^{T}\Gamma\hat{\Delta}(1+o(1))-\hat{\Delta}^{T}\Gamma(\beta^{*}-\beta_{r(\gamma),\alpha_{n}}^{*})(1+o(1)), (7.15)

in here Δ^\hat{\Delta} and ββr(γ),αn\beta^{*}-\beta_{r(\gamma),\alpha_{n}}^{*} do not include the first coordinate. Then use Theorem 2 and Assumption 1, we have

Ln(βr(γ),αn+Δ^)Ln(βr(γ),αn)\displaystyle L_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}+\hat{\Delta})-L_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})\geq c8Δ^22c9αn1Δ^2c10αn2\displaystyle c_{8}\|\hat{\Delta}\|_{2}^{2}-c_{9}\alpha_{n}^{-1}\|\hat{\Delta}\|_{2}-c_{10}\alpha_{n}^{-2} (7.16)

where c8,c9,c10c_{8},c_{9},c_{10} are some proper constants.

Similar to the proof of Theorem 4.8 in Han et al., (2017), as long as λn2P(L^n(βr(γ),αn))\lambda_{n}\geq 2P^{*}(\nabla\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})) and L^n()\hat{L}_{n}(\cdot) is locally convex differentiable, where P()P^{*}(\cdot) is the dual norm of P()P(\cdot), we have

Δ^Snc13Δ^Sn1\|\hat{\Delta}_{S_{n}^{c}}\|_{1}\leq 3\|\hat{\Delta}_{S_{n}}\|_{1}

Then by Lemma 3, Lemma 4 and (7.16), define

δL^n(Δ^,βr(γ),αn):=L^n(βr(γ),αn+Δ^)L^n(βr(γ),αn)L^n(βr(γ),αn),Δ^\displaystyle\delta\hat{L}_{n}(\hat{\Delta},\beta_{r(\gamma),\alpha_{n}}^{*}):=\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}+\hat{\Delta})-\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})-\langle\nabla\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*}),\hat{\Delta}\rangle (7.17)
\displaystyle\geq c8Δ^22c9αn1Δ^2c10αn2supβ^B(Δ^)Δ^1L^n(βr(γ),αn)\displaystyle c_{8}\|\hat{\Delta}\|_{2}^{2}-c_{9}\alpha_{n}^{-1}\|\hat{\Delta}\|_{2}-c_{10}\alpha_{n}^{-2}-{\displaystyle{\sup_{\hat{\beta}\in\mathcal{H}}}B(\hat{\Delta})-\|\hat{\Delta}\|_{1}\|\nabla\hat{L}_{n}(\beta_{r(\gamma),\alpha_{n}}^{*})\|_{\infty}}
p\displaystyle\stackrel{{\scriptstyle p}}{{\geq}} c8Δ^22c9αn1Δ^2c10αn22(C1)αnsn2log(d)nΔ^222M2M4Δ^1αnlog(d)n\displaystyle c_{8}\|\hat{\Delta}\|_{2}^{2}-c_{9}\alpha_{n}^{-1}\|\hat{\Delta}\|_{2}-c_{10}\alpha_{n}^{-2}-2(C-1)\alpha_{n}\frac{\sqrt{s_{n}^{2}log(d)}}{\sqrt{n}}\|\hat{\Delta}\|_{2}-2\sqrt{2}M_{2}M_{4}\|\hat{\Delta}\|_{1}\alpha_{n}\sqrt{\frac{log(d)}{n}}
\displaystyle\geq c8Δ^22c9αn1Δ^2c10αn22(C1)αnsn2log(d)nΔ^282M2M4Δ^Sn1αnlog(d)n\displaystyle c_{8}\|\hat{\Delta}\|_{2}^{2}-c_{9}\alpha_{n}^{-1}\|\hat{\Delta}\|_{2}-c_{10}\alpha_{n}^{-2}-2(C-1)\alpha_{n}\frac{\sqrt{s_{n}^{2}log(d)}}{\sqrt{n}}\|\hat{\Delta}\|_{2}-8\sqrt{2}M_{2}M_{4}\|\hat{\Delta}_{S_{n}}\|_{1}\alpha_{n}\sqrt{\frac{log(d)}{n}}
\displaystyle\geq c8Δ^22Δ^2(c9αn1+(2C2+82M2M4)αnlog(d)sn2n)c10αn2.\displaystyle c_{8}\|\hat{\Delta}\|_{2}^{2}-\|\hat{\Delta}\|_{2}(c_{9}\alpha_{n}^{-1}+(2C-2+8\sqrt{2}M_{2}M_{4})\alpha_{n}\sqrt{\frac{log(d)s_{n}^{2}}{n}})-c_{10}\alpha_{n}^{-2}.

Since Ψ(S¯n)=supvS¯n0P(v)v2sn\Psi(\bar{S}_{n})=\displaystyle{\sup_{v\in\bar{S}_{n}\setminus 0}}\dfrac{P(v)}{\|v\|_{2}}\leq\sqrt{s_{n}}, we can check that the assumptions of Theorem 4.8 in Han et al., (2017) are all satisfied. Let λnαnlog(d)n\lambda_{n}\asymp\alpha_{n}\sqrt{\frac{log(d)}{n}}. Theorem 4.8 in Han et al., (2017) implies that Δ^22\|\hat{\Delta}\|_{2}^{2} p(2λnΨ(S¯n)\stackrel{{\scriptstyle p}}{{\leq}}(2\lambda_{n}\Psi(\bar{S}_{n}) +c9αn1+c_{9}\alpha_{n}^{-1} +(2C2+82M2M4)αnlog(d)sn2n)2/c82+(2C-2+8\sqrt{2}M_{2}M_{4})\alpha_{n}\sqrt{\frac{log(d)s_{n}^{2}}{n}})^{2}/c_{8}^{2} +2c10αn2/c8+2c_{10}\alpha_{n}^{-2}/c_{8} pαn2log(d)sn2/n+αn2\stackrel{{\scriptstyle p}}{{\lesssim}}\alpha_{n}^{2}log(d)s_{n}^{2}/n+\alpha_{n}^{-2}. ∎

7.2 Proofs of Section 3.2

Proof of Theorem 7.

Recall the definition of \mathcal{M} in Assumption 3. Define

ε=minxiTβ>xjTβ(xixj)Tβ.\varepsilon=\min\limits_{x_{i}^{T}\beta^{*}>x_{j}^{T}\beta^{*}}(x_{i}-x_{j})^{T}\beta^{*}.
𝒰t={β:β1=1,xi,xj, if (xiTxjT)β>0, then (xiTxjT)SβS+(xiTxjT)S1TβSc>t}.\mathcal{U}_{t}=\{\beta:\beta_{1}=1,\forall x_{i},x_{j},\text{ if }(x_{i}^{T}-x_{j}^{T})\beta^{*}>0,\text{ then }(x_{i}^{T}-x_{j}^{T})_{S}\beta_{S}+(x_{i}^{T}-x_{j}^{T})_{S\setminus 1}\mathcal{M}^{T}\beta_{S^{c}}>t\}.

We let αn1\alpha_{n}\geq 1, then β𝒰ϵαn\beta^{*}\in\mathcal{U}_{\frac{\epsilon}{\alpha_{n}}} according to the definition of ϵ\epsilon, so 𝒰ϵαn\mathcal{U}_{\frac{\epsilon}{\alpha_{n}}} is nonempty. We then prove the theorem by three steps.

First, let’s prove: for large enough αn\alpha_{n}, βr,αn𝒰ϵαn\beta_{r,\alpha_{n}}^{*}\in\mathcal{U}_{\frac{\epsilon}{\alpha_{n}}}. Clearly

Cn(β)\displaystyle C_{n}(\beta) =E[(Q(XiTβ)Q(XjTβ))(2F((XiTXjT)βαn)1)𝟙((XiTXjT)β>0)]\displaystyle=E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(2F((X_{i}^{T}-X_{j}^{T})\beta\alpha_{n})-1)\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta^{*}>0)] (7.18)
\displaystyle\leq E[(Q(XiTβ)Q(XjTβ))1𝟙((XiTXjT)β>0)]=C(β),\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))\cdot 1\cdot\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta^{*}>0)]=C(\beta^{*}),

and there exists a positive constant c11c_{11} s.t.

C(β)Cn(β)=\displaystyle C(\beta^{*})-C_{n}(\beta^{*})= 2E[(Q(XjTβ)Q(XiTβ))F((XiTXjT)βαn)𝟙((XiTXjT)β<0)]\displaystyle 2E[(Q(X_{j}^{T}\beta^{*})-Q(X_{i}^{T}\beta^{*}))F((X_{i}^{T}-X_{j}^{T})\beta^{*}\alpha_{n})\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta^{*}<0)] (7.19)
\displaystyle\leq 2E[(Q(XjTβ)Q(XiTβ))𝟙(XjTβ>XiTβ)]/(1+eαnε)c11eεαn.\displaystyle 2E[(Q(X_{j}^{T}\beta^{*})-Q(X_{i}^{T}\beta^{*}))\mathbbm{1}(X_{j}^{T}\beta^{*}>X_{i}^{T}\beta^{*})]/(1+e^{\alpha_{n}\varepsilon})\leq c_{11}e^{-\varepsilon\alpha_{n}}.

If β𝒰ϵαn\beta\notin\mathcal{U}_{\frac{\epsilon}{\alpha_{n}}}, then xiT,xjT\exists x_{i}^{T},x_{j}^{T}, s.t.s.t. (xiTxjT)β>0(x_{i}^{T}-x_{j}^{T})\beta^{*}>0 but (xiTxjT)SβS+(xiTxjT)S1TβSc<εαn(x_{i}^{T}-x_{j}^{T})_{S}\beta_{S}+(x_{i}^{T}-x_{j}^{T})_{S\setminus 1}\mathcal{M}^{T}\beta_{S^{c}}<\frac{\varepsilon}{\alpha_{n}}. Define δ=minxiTβ>xjTβ(1F(ε))(Q(xiTβ)Q(xjTβ))p(xi,S)p(xj,S)\delta=\min\limits_{x_{i}^{T}\beta^{*}>x_{j}^{T}\beta^{*}}(1-F(\varepsilon))(Q(x_{i}^{T}\beta^{*})-Q(x_{j}^{T}\beta^{*}))p(x_{i,S})p(x_{j,S}), where p(XS)p(X_{S}) is the p.m.f. of XSX_{S}, then δ>0\delta>0 independent of dd. Using (B1) in Assumption 3, We have

C(β)Cn(β)\displaystyle C(\beta^{*})-C_{n}(\beta) (7.20)
=\displaystyle= E[(Q(XiTβ)Q(XjTβ))(22F((XiTXjT)βαn))𝟙((XiTXjT)β>0)]\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(2-2F((X_{i}^{T}-X_{j}^{T})\beta\alpha_{n}))\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta^{*}>0)]
=\displaystyle= E[(Q(XiTβ)Q(XjTβ))(22F((XiXj)STβSαn+(XiXj)S1TTβScαn\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(2-2F((X_{i}-X_{j})_{S}^{T}\beta_{S}\alpha_{n}+(X_{i}-X_{j})_{S\setminus 1}^{T}\mathcal{M}^{T}\beta_{S^{c}}\alpha_{n}
+[(XiXj)Sc(XiXj)S1]TβScαn))𝟙((XiTXjT)β>0)]\displaystyle+[(X_{i}-X_{j})_{S^{c}}-\mathcal{M}(X_{i}-X_{j})_{S\setminus 1}]^{T}\beta_{S^{c}}\alpha_{n}))\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta^{*}>0)]
\displaystyle\geq E[(Q(XiTβ)Q(XjTβ))(1F((XiXj)STβSαn+(XiXj)S1TTβScαn))\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(1-F((X_{i}-X_{j})_{S}^{T}\beta_{S}\alpha_{n}+(X_{i}-X_{j})_{S\setminus 1}^{T}\mathcal{M}^{T}\beta_{S^{c}}\alpha_{n}))
𝟙((XiTXjT)β>0)]p(xi,S)p(xj,S)(Q(xiTβ)Q(xjTβ))(1F(ε))δ.\displaystyle\mathbbm{1}((X_{i}^{T}-X_{j}^{T})\beta^{*}>0)]\geq p(x_{i,S})p(x_{j,S})(Q(x_{i}^{T}\beta^{*})-Q(x_{j}^{T}\beta^{*}))(1-F(\varepsilon))\geq\delta.

Let αn>log(c11)log(δ)ε\alpha_{n}>\frac{log(c_{11})-log(\delta)}{\varepsilon}, then C(β)Cn(β)c11eεαn<δC(β)Cn(β)C(\beta^{*})-C_{n}(\beta^{*})\leq c_{11}e^{-\varepsilon\alpha_{n}}<\delta\leq C(\beta^{*})-C_{n}(\beta), which implies that β𝒰ϵαn\beta\notin\mathcal{U}_{\frac{\epsilon}{\alpha_{n}}} is not a maximizer of Cn()C_{n}(\cdot). So βr,αn𝒰ϵαn\beta_{r,\alpha_{n}}^{*}\text{$\in$}\mathcal{U}_{\frac{\epsilon}{\alpha_{n}}}.

Secondly, let’s prove the entries of βr,αn,Sc\beta_{r,\alpha_{n},S^{c}}^{*} are all 0. By (B1) in Assumption 3,

Cn(β)\displaystyle C_{n}(\beta) (7.21)
=\displaystyle= E[(Q(XiTβ)Q(XjTβ))(2F((XiXj)Tβαn)1)𝟙((XiXj)Tβ>0)]\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(2F((X_{i}-X_{j})^{T}\beta\alpha_{n})-1)\mathbbm{1}((X_{i}-X_{j})^{T}\beta^{*}>0)]
=\displaystyle= E[(Q(XiTβ)Q(XjTβ))(2F((XiXj)STβSαn+(XiXj)S1TTβScαn\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(2F((X_{i}-X_{j})_{S}^{T}\beta_{S}\alpha_{n}+(X_{i}-X_{j})_{S\setminus 1}^{T}\mathcal{M}^{T}\beta_{S^{c}}\alpha_{n}
+[(XiXj)Sc(XiXj)S1]TβScαn)+2F((XiXj)STβSαn+(XiXj)S1TTβScαn\displaystyle+[(X_{i}-X_{j})_{S^{c}}-\mathcal{M}(X_{i}-X_{j})_{S\setminus 1}]^{T}\beta_{S^{c}}\alpha_{n})+2F((X_{i}-X_{j})_{S}^{T}\beta_{S}\alpha_{n}+(X_{i}-X_{j})_{S\setminus 1}^{T}\mathcal{M}^{T}\beta_{S^{c}}\alpha_{n}
[(XiXj)Sc(XiXj)S1]TβScαn)2)𝟙((XiXj)Tβ>0)\displaystyle-[(X_{i}-X_{j})_{S^{c}}-\mathcal{M}(X_{i}-X_{j})_{S\setminus 1}]^{T}\beta_{S^{c}}\alpha_{n})-2)\cdot\mathbbm{1}((X_{i}-X_{j})^{T}\beta^{*}>0)\cdot
𝟙([(XiXj)Sc(XiXj)S1]TβSc>0)].\displaystyle\mathbbm{1}([(X_{i}-X_{j})_{S^{c}}-\mathcal{M}(X_{i}-X_{j})_{S\setminus 1}]^{T}\beta_{S^{c}}>0)].

Let β=βr,αn\beta=\beta_{r,\alpha_{n}}^{*} in the formula. And notice that a>0,t>0,t(F(a+t)+F(at))=ea+t(et+ea)2ea+t(1+ea+t)2\forall a>0,t>0,\frac{\partial}{\partial t}(F(a+t)+F(a-t))=\frac{e^{-a+t}}{(e^{t}+e^{-a})^{2}}-\frac{e^{-a+t}}{(1+e^{-a+t})^{2}}. Because et+ea(1+ea+t)=(et1)(1ea)>0e^{t}+e^{-a}-(1+e^{-a+t})=(e^{t}-1)(1-e^{-a})>0, we know (F(a+t)+F(at))(F(a+t)+F(a-t)) attains its maximum at t=0t=0 when a>0a>0. Therefore, we have

Cn(βr,αn)\displaystyle C_{n}(\beta_{r,\alpha_{n}}^{*}) (7.22)
\displaystyle\leq 2E[(Q(XiTβ)Q(XjTβ))(2F((XiXj)STβr,αn,Sαn+(XiXj)S1TTβr,αn,Scαn)1)\displaystyle 2E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(2F((X_{i}-X_{j})_{S}^{T}\beta_{r,\alpha_{n},S}^{*}\alpha_{n}+(X_{i}-X_{j})_{S\setminus 1}^{T}\mathcal{M}^{T}\beta_{r,\alpha_{n},S^{c}}^{*}\alpha_{n})-1)\cdot
𝟙((XiXj)Tβ>0)𝟙([(XiXj)Sc(XiXj)S1]Tβr,αn,Sc>0)]\displaystyle\mathbbm{1}((X_{i}-X_{j})^{T}\beta^{*}>0)\mathbbm{1}([(X_{i}-X_{j})_{S^{c}}-\mathcal{M}(X_{i}-X_{j})_{S\setminus 1}]^{T}\beta_{r,\alpha_{n},S^{c}}^{*}>0)]
=\displaystyle= E[(Q(XiTβ)Q(XjTβ))(2F((XiXj)STβr,αn,Sαn+(XiXj)S1TTβr,αn,Scαn)1)\displaystyle E[(Q(X_{i}^{T}\beta^{*})-Q(X_{j}^{T}\beta^{*}))(2F((X_{i}-X_{j})_{S}^{T}\beta_{r,\alpha_{n},S}^{*}\alpha_{n}+(X_{i}-X_{j})_{S\setminus 1}^{T}\mathcal{M}^{T}\beta_{r,\alpha_{n},S^{c}}^{*}\alpha_{n})-1)\cdot
𝟙((XiXj)Tβ>0)],\displaystyle\mathbbm{1}((X_{i}-X_{j})^{T}\beta^{*}>0)],

which means the β\beta such that β1=1\beta_{1}=1, βS1=βr,αn,S1+Tβr,αn,Sc\beta_{S\setminus 1}=\beta_{r,\alpha_{n},S\setminus 1}^{*}+\mathcal{M}^{T}\beta_{r,\alpha_{n},S^{c}}^{*} and βSc=0\beta_{S^{c}}=\vec{0} will let Cn(β)C_{n}(\beta) no smaller than Cn(βr,αn)C_{n}(\beta_{r,\alpha_{n}}^{*}). So without loss of generality we can set βr,αn,Sc=0\beta_{r,\alpha_{n},S^{c}}^{*}=\vec{0} since βr,αn\beta_{r,\alpha_{n}}^{*} maximizes Cn()C_{n}(\cdot).

Finally, define F(Δ^)=L^n(βr,αn+Δ^)+λnβr,αn+Δ^1L^n(βr,αn)λnβr,αn1F(\hat{\Delta})=\hat{L}_{n}(\beta_{r,\alpha_{n}}^{*}+\hat{\Delta})+\lambda_{n}\|\beta_{r,\alpha_{n}}^{*}+\hat{\Delta}\|_{1}-\hat{L}_{n}(\beta_{r,\alpha_{n}}^{*})-\lambda_{n}\|\beta_{r,\alpha_{n}}^{*}\|_{1}, then

F(Δ^)\displaystyle F(\hat{\Delta})\geq Cn(βr,αn)Cn(β^r,αn)supβ^B(Δ^)+λn(βr,αn+Δ^1βr,αn1)\displaystyle C_{n}(\beta_{r,\alpha_{n}}^{*})-C_{n}(\hat{\beta}_{r,\alpha_{n}})-{\displaystyle{\sup_{\hat{\beta}\in\mathcal{H}}}B(\hat{\Delta})+}\lambda_{n}(\|\beta_{r,\alpha_{n}}^{*}+\hat{\Delta}\|_{1}-\|\beta_{r,\alpha_{n}}^{*}\|_{1}) (7.23)
=\displaystyle= C(β)Cn(β^r,αn)(C(β)Cn(βr,αn))supβ^B(Δ^)+λn(βr,αn+Δ^1βr,αn1).\displaystyle C(\beta^{*})-C_{n}(\hat{\beta}_{r,\alpha_{n}})-(C(\beta^{*})-C_{n}(\beta_{r,\alpha_{n}}^{*}))-{\displaystyle{\sup_{\hat{\beta}\in\mathcal{H}}}B(\hat{\Delta})+}\lambda_{n}(\|\beta_{r,\alpha_{n}}^{*}+\hat{\Delta}\|_{1}-\|\beta_{r,\alpha_{n}}^{*}\|_{1}).

If β^r,αn𝒰ϵαn\hat{\beta}_{r,\alpha_{n}}\notin\mathcal{U}_{\frac{\epsilon}{\alpha_{n}}}, using (7.19), LABEL:equ:diff_c_beta2 and the fact that βr,αn,Sc=0\beta_{r,\alpha_{n},S^{c}}^{*}=\vec{0},

F(Δ^)\displaystyle F(\hat{\Delta})\geq δc11eεαnsupβ^B(Δ^)+λn(βr,αn+Δ^1βr,αn1)\displaystyle\delta-c_{11}e^{-\varepsilon\alpha_{n}}-{\displaystyle{\sup_{\hat{\beta}\in\mathcal{H}}}B(\hat{\Delta})+}\lambda_{n}(\|\beta_{r,\alpha_{n}}^{*}+\hat{\Delta}\|_{1}-\|\beta_{r,\alpha_{n}}^{*}\|_{1}) (7.24)
\displaystyle\geq δc11eεαnsupβ^B(Δ^)+λn(βr,αn,S+Δ^S1+Δ^Sc1βr,αn,S1),\displaystyle\delta-c_{11}e^{-\varepsilon\alpha_{n}}-{\displaystyle{\sup_{\hat{\beta}\in\mathcal{H}}}B(\hat{\Delta})+}\lambda_{n}(\|\beta_{r,\alpha_{n},S}^{*}+\hat{\Delta}_{S}\|_{1}+\|\hat{\Delta}_{S^{c}}\|_{1}-\|\beta_{r,\alpha_{n},S}^{*}\|_{1}),

then by Lemma 4,

F(Δ^)p\displaystyle F(\hat{\Delta})\stackrel{{\scriptstyle p}}{{\geq}} δc11eεαn+λn(Δ^S1+Δ^Sc1)c12Δ^2αnlog(d)s2/n\displaystyle\delta-c_{11}e^{-\varepsilon\alpha_{n}}+\lambda_{n}(-\|\hat{\Delta}_{S}\|_{1}+\|\hat{\Delta}_{S^{c}}\|_{1})-c_{12}\|\hat{\Delta}\|_{2}\alpha_{n}\sqrt{log(d)s^{2}/n} (7.25)
\displaystyle\geq λnΔ^S1+δc11eεαnc12Δ^2αnlog(d)s2/n\displaystyle-\lambda_{n}\|\hat{\Delta}_{S}\|_{1}+\delta-c_{11}e^{-\varepsilon\alpha_{n}}-c_{12}\|\hat{\Delta}\|_{2}\alpha_{n}\sqrt{log(d)s^{2}/n}
\displaystyle\geq λnΔ^2s+δc11eεαnc12Δ^2αnlog(d)s2/n.\displaystyle-\lambda_{n}\|\hat{\Delta}\|_{2}\sqrt{s}+\delta-c_{11}e^{-\varepsilon\alpha_{n}}-c_{12}\|\hat{\Delta}\|_{2}\alpha_{n}\sqrt{log(d)s^{2}/n}.

Choose λn=δ3rs\lambda_{n}=\frac{\delta}{3r\sqrt{s}} and αn=max(1,log(c11)log(δ)+log(3)ε)\alpha_{n}=max(1,\frac{log(c_{11})-log(\delta)+log(3)}{\varepsilon}), then this satisfies αn>log(c11)log(δ)ε\alpha_{n}>\frac{log(c_{11})-log(\delta)}{\varepsilon} and λnαnlog(d)n\lambda_{n}\gtrsim\alpha_{n}\sqrt{\frac{log(d)}{n}}. So

F(Δ^)pδ/3c12Δ^2αnlog(d)s2/nδ/3max(1,log(3c11/δ)ε)c12rlog(d)s2/nF(\hat{\Delta})\stackrel{{\scriptstyle p}}{{\geq}}\delta/3-c_{12}\|\hat{\Delta}\|_{2}\alpha_{n}\sqrt{log(d)s^{2}/n}\geq\delta/3-max(1,\frac{log(3c_{11}/\delta)}{\varepsilon})c_{12}r\sqrt{log(d)s^{2}/n} (7.26)

If log(d)s2/n0log(d)s^{2}/n\longrightarrow 0, then F(Δ^)>p0F(\hat{\Delta})\stackrel{{\scriptstyle p}}{{>}}0, which contradicts with the fact that β^r,αn\hat{\beta}_{r,\alpha_{n}} minimizes L^n(β)+λnP(β)\hat{L}_{n}(\beta)+\lambda_{n}P(\beta) in \mathcal{H}. Hence P(β^r,αn𝒰ϵαn)1P(\hat{\beta}_{r,\alpha_{n}}\in\mathcal{U}_{\frac{\epsilon}{\alpha_{n}}})\longrightarrow 1.

Now let’s consider β^r,αn,Sc\hat{\beta}_{r,\alpha_{n},S^{c}}. Truncate the ScS^{c} part of β^r,αn\hat{\beta}_{r,\alpha_{n}} to be 0 and call this new vector as β^~r,αn\widetilde{\hat{\beta}}_{r,\alpha_{n}}. By definition, L^n(β^r,αn)+λnβ^r,αn1\hat{L}_{n}(\hat{\beta}_{r,\alpha_{n}})+\lambda_{n}\|\hat{\beta}_{r,\alpha_{n}}\|_{1}L^n(β^~r,αn)+λnβ^~r,αn1\leq\hat{L}_{n}(\widetilde{\hat{\beta}}_{r,\alpha_{n}})+\lambda_{n}\|\widetilde{\hat{\beta}}_{r,\alpha_{n}}\|_{1}, so λnβ^r,αn,Sc1\lambda_{n}\|\hat{\beta}_{r,\alpha_{n},S^{c}}\|_{1} L^n(β^~r,αn)\leq\hat{L}_{n}(\widetilde{\hat{\beta}}_{r,\alpha_{n}}) L^n(β^r,αn)-\hat{L}_{n}(\hat{\beta}_{r,\alpha_{n}}). Similar to (7.22), we know Ln(β^~r,αn)Ln(β^r,αn)0L_{n}(\widetilde{\hat{\beta}}_{r,\alpha_{n}})-L_{n}(\hat{\beta}_{r,\alpha_{n}})\leq 0, so

λnβ^r,αn,Sc1B(β^r,αnβ^~r,αn)p(s2log(d)n)1/2,\lambda_{n}\|\hat{\beta}_{r,\alpha_{n},S^{c}}\|_{1}\leq B(\hat{\beta}_{r,\alpha_{n}}-\widetilde{\hat{\beta}}_{r,\alpha_{n}})\stackrel{{\scriptstyle p}}{{\lesssim}}(\frac{s^{2}\log(d)}{n})^{1/2}, (7.27)

and therefore β^r,αn,Sc1p(s3log(d)n)1/2\|\hat{\beta}_{r,\alpha_{n},S^{c}}\|_{1}\stackrel{{\scriptstyle p}}{{\lesssim}}(\frac{s^{3}\log(d)}{n})^{1/2}. ∎

References

  • Blatt et al., (2004) Blatt, D., Murphy, S., and Zhu, J. (2004). A-learning for approximate planning.
  • Fan et al., (2017) Fan, C., Lu, W., Song, R., and Zhou, Y. (2017). Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1565 – 1582.
  • Fava et al., (2003) Fava, M., Rush, J., Kupfer, D. J., Trivedi, M. H., Nierenberg, A. A., Thase, M. E., Sackeim, H. A., Quitkin, F. M., Wisniewski, S., Lavori, P. W., Rosenbaum, J. F., Dunner, D. L., and STAR*D Investigators, G. (2003). Background and rationale for the sequenced treatment alternatives to relieve depression (star*d) study. Drugs therapy: predictors of response, 26(2).
  • Han et al., (2017) Han, F., Ji, H., Ji, Z., and Wang, H. (2017). A provable smoothing approach for high dimensional generalized regression with applications in genomics. Electronic Journal of Statistics, 11(2):4347–4403.
  • Han, (1987) Han, A., K. (1987). Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator. Journal of econometrics, 35(2-3):303 – 316.
  • Liang et al., (2018) Liang, S., Lu, W., Song, R., and Wang, L. (2018). Sparse concordance-assisted learning for optimal treatment decision. Journal of Machine Learning Research, 18(154-234):1 – 26.
  • Lu et al., (2013) Lu, W., Zhang, H., and Zeng, D. (2013). Variable selection for optimal treatment decision. STATISTICAL METHODS IN MEDICAL RESEARCH, (5):493.
  • Murphy, (2003) Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 65(2):331 – 366.
  • Pitcan, (2017) Pitcan, Y. (2017). A note on concentration inequalities for u-statistics.
  • Rubin, (1980) Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371):591 – 593.
  • Shi et al., (2018) Shi, C., Fan, A., Song, R., and Lu, W. (2018). High-dimensional a-learning for optimal dynamic treatment regimes. Annals of Statistics, 46(3):925–957.
  • Song et al., (2015) Song, R., Kosorok, M., Zeng, D., Zhao, Y., Laber, E., and Yuan, M. (2015). On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning.
  • Song et al., (2011) Song, R., Wang, W., Zeng, D., and Kosorok, M. R. (2011). Penalized q-learning for dynamic treatment regimes.
  • Watkins, (1989) Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK.
  • Watkins and Dayan, (1992) Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279.
  • Zhang et al., (2012) Zhang, B., Tsiatis, A., Laber, E., and Davidian, M. (2012). A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010 – 1018.
  • Zhao et al., (2012) Zhao, Y., Zeng, D., Rush A., J., and Kosorok Michael, R. (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106 – 1118.
  • Zhu et al., (2019) Zhu, W., Zeng, D., and Song, R. (2019). Proper inference for value function in high-dimensional q-learning for dynamic treatment regimes. Journal of the American Statistical Association, 114(527):1404–1417.