This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The oracle property of the generalized outcome adaptive lasso

Ismaila Baldé
(April 6, 2025)
Abstract

The generalized outcome-adaptive lasso (GOAL) is a variable selection for high-dimensional causal inference proposed by Baldé et al. [2023, Biometrics 79(1), 514–520]. When the dimension is high, it is now well established that an ideal variable selection method should have the oracle property to ensure the optimal large sample performance. However, the oracle property of GOAL has not been proven. In this paper, we show that the GOAL estimator enjoys the oracle property. Our simulation shows that the GOAL method deals with the collinearity problem better than the oracle-like method, the outcome-adaptive lasso (OAL).

Keywords: Causal inference; GOAL; High-dimensional data; Oracle property; Propensity score; Variable selection.

1 Introduction

Let YY be a continuous outcome variable, 𝐗=(X1,,Xp)\mathbf{X}=(X_{1},\ldots,X_{p}) a potential confounders matrix and AA a binary treatment. Assume that all pp covariates XjX_{j}, j=1,,pj=1,\ldots,p are measured prior to the treatment AA which in turn is measured prior to the outcome YY. We assume the propensity score (PS) model is defined as: logit{P(A=1|𝐗)}=j=1pαjXj.{\rm logit}{\left\{P(A=1|\mathbf{X})\right\}}=\sum_{j=1}^{p}\alpha_{j}X_{j}. Let 𝒞\mathcal{C} and 𝒫\mathcal{P} denote indices of confounders (covariates related to both outcome and treatment) and pure predictors of outcome, respectively. Our objective is to estimate the following PS model: logit{π(X,α^)}=j𝒞α^jXj+j𝒫α^jXj.{\rm logit}{\left\{\pi(X,\hat{\alpha})\right\}}=\sum_{j\in\mathcal{C}}\hat{\alpha}_{j}X_{j}+\sum_{j\in\mathcal{P}}\hat{\alpha}_{j}X_{j}. The negative log-likelihood function of α\alpha is given by n(α;A,𝐗)=i=1n{ai(xiTα)+log(1+exiTα)}.\ell_{n}(\alpha;A,\mathbf{X})=\sum_{i=1}^{n}\left\{-a_{i}(x_{i}^{T}\alpha)+\log\left(1+e^{x_{i}^{T}\alpha}\right)\right\}. Baldé et al. (2023) proposed the generalized outcome adaptive lasso (GOAL): α^(GOAL)=argminα[n(α;A,𝐗)+λ1j=1pw^j|αj|+λ2j=1pαj2]\hat{\alpha}(GOAL)=\arg\min_{\alpha}\left[\ell_{n}(\alpha;A,\mathbf{X})+\lambda_{1}\sum_{j=1}^{p}\hat{w}_{j}|\alpha_{j}|+\lambda_{2}\sum_{j=1}^{p}\alpha^{2}_{j}\right], where w^j=|β^jols|γ\hat{w}_{j}=\left|{\hat{\beta}_{j}}^{ols}\right|^{-\gamma} such that γ>1\gamma>1 and (β^Aols,β^ols)=argmin(βA,β)YβAA𝐗β22(\hat{\beta}_{A}^{ols},\hat{\beta}^{ols})=\arg\min_{(\beta_{A},\beta)}\left\|Y-\beta_{A}A-\mathbf{X}\beta\right\|_{2}^{2}. GOAL is designed to improve OAL (Shortreed and Ertefaie, 2017) for high-dimensional data analysis. Baldé (2022) conjectured that GOAL must satisfy the oracle property. In this paper, we show that GOAL enjoys the oracle property with a proper choice of λ1\lambda_{1}, λ2\lambda_{2} and γ\gamma. That is, GOAL performs as well as if the true underlying model were known in advance. The oracle property is particularly important for a variable selection method when the dimension is high to ensure optimal large sample performance (Zou, 2006; Zou and Zhang, 2009).

2 Statistical theory

Let 𝒜=𝒞𝒫={1,2,,p0}\mathcal{A}=\mathcal{C}\cup\mathcal{P}=\{1,2,\ldots,p_{0}\} be the indices of desirable covariates to include in the estimated PS. Let 𝒜c=𝒮={p0+1,p0+2,,p0+(pp0)}\mathcal{A}^{c}=\mathcal{I}\cup\mathcal{S}=\{p_{0}+1,p_{0}+2,\ldots,p_{0}+(p-p_{0})\} be the indices of of covariates to exclude, where \mathcal{I} and 𝒮\mathcal{S} are the indices of pure predictors of treatment and spurious covariates (covariates that are unrelated to both outcome and treatment), respectively. We write the Fisher information (FI) matrix as

𝐅(α)=E(ϕ′′(xTα)xxT)=(𝐅11𝐅12𝐅21𝐅22),𝐅11 is the FI matrix (p0×p0) for the parsimonious PS.\mathbf{F}(\alpha^{*})=E\left(\phi^{{}^{\prime\prime}}(x^{T}\alpha^{*})xx^{T}\right)=\begin{pmatrix}\mathbf{F}_{11}&\mathbf{F}_{12}\\ \mathbf{F}_{21}&\mathbf{F}_{22}\end{pmatrix},\hskip 5.69046pt\mbox{$\mathbf{F}_{11}$ is the FI matrix ($p_{0}\times p_{0}$) for the parsimonious PS.} (1)
Theorem 1.

Assume the following regularity conditions:

  1. (C.1)

    The Fisher information matrix 𝐅(α)\mathbf{F}(\alpha^{*}) defined in Equation 1 is finite and positive definite.

  2. (C.2)

    For each α𝛀\alpha^{*}\in\mathbf{\Omega}, there existe a function M1(x)M_{1}(x) and M2(x)M_{2}(x) such that for α\alpha in the neighborhood of α\alpha^{*}, we have: |ϕ′′(x,α)|M1(x)and|ϕ′′′(x,α)|M2(x)\left|\phi^{{}^{\prime\prime}}(x,\alpha)\right|\leq M_{1}(x)\quad\mbox{and}\quad\left|\phi^{{}^{\prime\prime\prime}}(x,\alpha)\right|\leq M_{2}(x) such that

    M1(x)𝑑x<andE(M2(x)|xj,xk,xl)<,1j,k,lp0,\int M_{1}(x)dx<\infty\hskip 5.69046pt\mbox{and}\hskip 5.69046ptE(M_{2}(x)|x_{j},x_{k},x_{l})<\infty,\hskip 2.84544pt\forall\hskip 2.84544pt1\leq j,k,l\leq p_{0}, where 𝛀\mathbf{\Omega} is an open parameter space for α\alpha.

  3. (C.3)

    λ1/n0\lambda_{1}/\sqrt{n}\rightarrow 0 and λ1nγ/21\lambda_{1}n^{\gamma/2-1}\rightarrow\infty, for γ>1\gamma>1.

  4. (C.4)

    λ2/n0\lambda_{2}/\sqrt{n}\rightarrow 0.

Then under conditions (C.1)-(C.4) the generalized outcome-adaptive lasso estimator α^(GOAL)\hat{\alpha}(GOAL) satisfies the following:

  1. 1.

    Consistency in variable selection: limnP{α^j(GOAL)=0|j𝒮}=1\lim_{n}P\{\hat{\alpha}_{j}(GOAL)=0|j\in\mathcal{I}\cup\mathcal{S}\}=1;

  2. 2.

    Asymptotic normality: n{α^(GOAL)α𝒜}dN(0,𝐅111)\sqrt{n}\{\hat{\alpha}(GOAL)-\alpha^{*}_{\mathcal{A}}\}\rightarrow_{d}N(0,\mathbf{F}_{11}^{-1}).

Proof of Theorem 1. The ideas of the proof are taken from Zou (2006), Khalili and Chen (2007), Slawski et al. (2010), and Shortreed and Ertefaie (2017).

First, we prove the asymptotic normality. Let α=α+bn\alpha=\alpha^{*}+\frac{b}{\sqrt{n}}. Then define 𝒢n(b)\mathcal{G}_{n}(b) by

𝒢n(b)\displaystyle\mathcal{G}_{n}(b) =n(α+bn;A,𝐗)+λ1j=1p0w^j|αj+bn|+λ2j=1p0(αj+bn)2\displaystyle=\ell_{n}(\alpha^{*}+\frac{b}{\sqrt{n}};A,\mathbf{X})+\lambda_{1}\sum_{j=1}^{p_{0}}\hat{w}_{j}\left|\alpha_{j}^{*}+\frac{b}{\sqrt{n}}\right|+\lambda_{2}\sum_{j=1}^{p_{0}}(\alpha_{j}^{*}+\frac{b}{\sqrt{n}})^{2}
=i=1naixiT(α+bn)+ϕ(xiT(α+bn))+λ1j=1p0w^j|αj+bjn|+λ2j=1p0(αj+bjn)2.\displaystyle=\sum_{i=1}^{n}-a_{i}x_{i}^{T}\left(\alpha^{*}+\frac{b}{\sqrt{n}}\right)+\phi\left({x_{i}^{T}\left(\alpha^{*}+\frac{b}{\sqrt{n}}\right)}\right)+\lambda_{1}\sum_{j=1}^{p_{0}}\hat{w}_{j}\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|+\lambda_{2}\sum_{j=1}^{p_{0}}\left(\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right)^{2}.

For b=0b=0, we have: 𝒢n(0)=i=1nai(xiTα)+ϕ(xiTα)+λ1j=1p0w^j|αj|+λ2j=1p0αj2.\mathcal{G}_{n}(0)=\sum_{i=1}^{n}-a_{i}(x_{i}^{T}\alpha^{*})+\phi\left({x_{i}^{T}\alpha^{*}}\right)+\lambda_{1}\sum_{j=1}^{p_{0}}\hat{w}_{j}\left|\alpha_{j}^{*}\right|+\lambda_{2}\sum_{j=1}^{p_{0}}{\alpha_{j}^{*}}^{2}.

Define 𝒲^(b)=𝒢n(b)𝒢n(0)\widehat{\mathcal{W}}(b)=\mathcal{G}_{n}(b)-\mathcal{G}_{n}(0). Thus

𝒲^(b)\displaystyle\widehat{\mathcal{W}}(b) =i=1n{aixiT(α+bn)+ϕ(xiT(α+bn))(ai(xiTα)+ϕ(xiTα))}\displaystyle=\sum_{i=1}^{n}\left\{-a_{i}x_{i}^{T}\left(\alpha^{*}+\frac{b}{\sqrt{n}}\right)+\phi\left({x_{i}^{T}\left(\alpha^{*}+\frac{b}{\sqrt{n}}\right)}\right)-\left(-a_{i}(x_{i}^{T}\alpha^{*})+\phi\left({x_{i}^{T}\alpha^{*}}\right)\right)\right\}
+λ1j=1p0w^j(|αj+bjn||αj|)+λ2j=1p0((αj+bjn)2αj2)\displaystyle+\lambda_{1}\sum_{j=1}^{p_{0}}\hat{w}_{j}\left(\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|-\left|\alpha_{j}^{*}\right|\right)+\lambda_{2}\sum_{j=1}^{p_{0}}\left(\left(\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right)^{2}-{\alpha_{j}^{*}}^{2}\right)
=i=1n{aixiTbn+ϕ(xiT(α+bn))ϕ(xiTα)}\displaystyle=\sum_{i=1}^{n}\left\{-a_{i}x_{i}^{T}\frac{b}{\sqrt{n}}+\phi\left({x_{i}^{T}\left(\alpha^{*}+\frac{b}{\sqrt{n}}\right)}\right)-\phi\left({x_{i}^{T}\alpha^{*}}\right)\right\}
+λ1j=1p0w^j(|αj+bjn||αj|)+λ2j=1p0((αj+bjn)2αj2)\displaystyle+\lambda_{1}\sum_{j=1}^{p_{0}}\hat{w}_{j}\left(\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|-\left|\alpha_{j}^{*}\right|\right)+\lambda_{2}\sum_{j=1}^{p_{0}}\left(\left(\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right)^{2}-{\alpha_{j}^{*}}^{2}\right)
=n(b)+λ1j=1p0w^j(|αj+bjn||αj|)+λ2j=1p0(αj(2bjn)+bj2n),\displaystyle=\mathcal{L}_{n}(b)+\lambda_{1}\sum_{j=1}^{p_{0}}\hat{w}_{j}\left(\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|-\left|\alpha_{j}^{*}\right|\right)+\lambda_{2}\sum_{j=1}^{p_{0}}\left(\alpha_{j}^{*}\left(\frac{2b_{j}}{\sqrt{n}}\right)+\frac{b_{j}^{2}}{n}\right),

where n(b)=i=1n{aixiTbn+ϕ(xiT(α+bn))ϕ(xiTα)}\mathcal{L}_{n}(b)=\sum_{i=1}^{n}\left\{-a_{i}x_{i}^{T}\frac{b}{\sqrt{n}}+\phi\left({x_{i}^{T}\left(\alpha^{*}+\frac{b}{\sqrt{n}}\right)}\right)-\phi\left({x_{i}^{T}\alpha^{*}}\right)\right\}.

By using the second-order Taylor expansion of n(b)\mathcal{L}_{n}(b) around b=0b=0, we have:

n(b)\displaystyle\mathcal{L}_{n}(b) =i=1n(aixiTbn)+i=1n(ϕ(xiTα)(xiTbn))+12i=1nϕ′′(xiTα)bT(xixiTn)b+n32i=1n16ϕ′′′(xiTα)(xiTb)3\displaystyle=\sum_{i=1}^{n}\left(-a_{i}x_{i}^{T}\frac{b}{\sqrt{n}}\right)+\sum_{i=1}^{n}\left(\phi^{{}^{\prime}}\left({x_{i}^{T}\alpha^{*}}\right)\left(\frac{x_{i}^{T}b}{\sqrt{n}}\right)\right)+\frac{1}{2}\sum_{i=1}^{n}\phi^{{}^{\prime\prime}}\left({x_{i}^{T}\alpha^{*}}\right)b^{T}\left(\frac{x_{i}x_{i}^{T}}{n}\right)b+n^{-\frac{3}{2}}\sum_{i=1}^{n}\frac{1}{6}\phi^{{}^{\prime\prime\prime}}\left({x_{i}^{T}\alpha^{*}}\right)\left(x_{i}^{T}b\right)^{3}
=i=1n(aiϕ(xiTα))xiTbn+12i=1nϕ′′(xiTα)bT(xixiTn)b+n32i=1n16ϕ′′′(xiTα)(xiTb)3.\displaystyle=-\sum_{i=1}^{n}\left(a_{i}-\phi^{{}^{\prime}}\left({x_{i}^{T}\alpha^{*}}\right)\right)\frac{x_{i}^{T}b}{\sqrt{n}}+\frac{1}{2}\sum_{i=1}^{n}\phi^{{}^{\prime\prime}}\left({x_{i}^{T}\alpha^{*}}\right)b^{T}\left(\frac{x_{i}x_{i}^{T}}{n}\right)b+n^{-\frac{3}{2}}\sum_{i=1}^{n}\frac{1}{6}\phi^{{}^{\prime\prime\prime}}\left({x_{i}^{T}\alpha^{*}}\right)\left(x_{i}^{T}b\right)^{3}.

Thus, we can rewrite 𝒲^(b)\widehat{\mathcal{W}}(b) as 𝒲^(b)=R1(n)+R2(n)+R3(n)+R4(n)+R5(n),\widehat{\mathcal{W}}(b)=R_{1}^{(n)}+R_{2}^{(n)}+R_{3}^{(n)}+R_{4}^{(n)}+R_{5}^{(n)}, with

R1(n)=i=1n(aiϕ(xiTα))xiTbn,R2(n)=12i=1nϕ′′(xiTα)bT(xixiTn)b,R3(n)=n32i=1n16ϕ′′′(xiTα)(xiTb)3,R_{1}^{(n)}=-\sum_{i=1}^{n}\left(a_{i}-\phi^{{}^{\prime}}\left({x_{i}^{T}\alpha^{*}}\right)\right)\frac{x_{i}^{T}b}{\sqrt{n}},\quad R_{2}^{(n)}=\frac{1}{2}\sum_{i=1}^{n}\phi^{{}^{\prime\prime}}\left({x_{i}^{T}\alpha^{*}}\right)b^{T}\left(\frac{x_{i}x_{i}^{T}}{n}\right)b,\quad R_{3}^{(n)}=n^{-\frac{3}{2}}\sum_{i=1}^{n}\frac{1}{6}\phi^{{}^{\prime\prime\prime}}\left({x_{i}^{T}\alpha^{*}}\right)\left(x_{i}^{T}b\right)^{3},
R4(n)=λ1j=1p0w^j(|αj+bjn||αj|),R5(n)=λ2j=1p0(αj(2bjn)+bj2n).R_{4}^{(n)}=\lambda_{1}\sum_{j=1}^{p_{0}}\hat{w}_{j}\left(\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|-\left|\alpha_{j}^{*}\right|\right),\quad R_{5}^{(n)}=\lambda_{2}\sum_{j=1}^{p_{0}}\left(\alpha_{j}^{*}\left(\frac{2b_{j}}{\sqrt{n}}\right)+\frac{b_{j}^{2}}{n}\right).

By applying the central limit theorem and laws of large numbers, we have:

R1(n)=i=1n(aiϕ(xiTα))xiTbndbTZ,ZN(0,𝐅(α)).R_{1}^{(n)}=-\sum_{i=1}^{n}\left(a_{i}-\phi^{{}^{\prime}}\left({x_{i}^{T}\alpha^{*}}\right)\right)\frac{x_{i}^{T}b}{\sqrt{n}}\rightarrow_{d}b^{T}Z,\quad Z\sim N(0,\mathbf{F}(\alpha^{*})).

For the term R2(n)R_{2}^{(n)}, we observe that

i=1nϕ′′(xiTα)(xixiTn)p𝐅(α),thusR2(n)=12i=1nϕ′′(xiTα)bT(xixiTn)bp12bT{𝐅(α)}b.\sum_{i=1}^{n}\phi^{{}^{\prime\prime}}\left({x_{i}^{T}\alpha^{*}}\right)\left(\frac{x_{i}x_{i}^{T}}{n}\right)\rightarrow_{p}\mathbf{F}(\alpha^{*}),\quad\mbox{thus}\quad R_{2}^{(n)}=\frac{1}{2}\sum_{i=1}^{n}\phi^{{}^{\prime\prime}}\left({x_{i}^{T}\alpha^{*}}\right)b^{T}\left(\frac{x_{i}x_{i}^{T}}{n}\right)b\rightarrow_{p}\frac{1}{2}b^{T}\{\mathbf{F}(\alpha^{*})\}b.

By the condition (C.2), we observe that

6n{R3(n)}=1ni=1nϕ′′′(xiTα~)(xiTb)3i=1n1nM2(xi)|xiTb|3pE(M2(x)|xTb|3)<,6\sqrt{n}\left\{R_{3}^{(n)}\right\}=\frac{1}{n}\sum_{i=1}^{n}\phi^{{}^{\prime\prime\prime}}\left({x_{i}^{T}\tilde{\alpha}}_{*}\right)\left(x_{i}^{T}b\right)^{3}\leq\sum_{i=1}^{n}\frac{1}{n}M_{2}\left(x_{i}\right)|x_{i}^{T}b|^{3}\rightarrow_{p}E\left(M_{2}(x)\left|x^{T}b\right|^{3}\right)<\infty, (2)

where α~\tilde{\alpha}_{*} is between α\alpha^{*} and α+bn\alpha^{*}+\frac{b}{\sqrt{n}}. Equation 2 show that R3(n)R_{3}^{(n)} is bounded. The behavior of R4(n)R_{4}^{(n)} and R5(n)R_{5}^{(n)} depend on the covariate type. If a covariate XjX_{j} is a confounder (j𝒞j\in\mathcal{C}) or a pure predictor of the outcome (j𝒫j\in\mathcal{P}), this is αj0\alpha_{j}^{*}\neq 0 since j𝒜=𝒞𝒫j\in\mathcal{A}=\mathcal{C}\cup\mathcal{P}. If αj0\alpha_{j}^{*}\neq 0, then we have:

λ1nw^jn(|αj+bjn||αj|)=(λ1n)(w^j)(n(|αj+bjn||αj|)),\frac{\lambda_{1}}{\sqrt{n}}\hat{w}_{j}\sqrt{n}\left(\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|-\left|\alpha_{j}^{*}\right|\right)=\left(\frac{\lambda_{1}}{\sqrt{n}}\right)\left(\hat{w}_{j}\right)\left(\sqrt{n}\left(\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|-\left|\alpha_{j}^{*}\right|\right)\right),

with λ1np0\frac{\lambda_{1}}{\sqrt{n}}\rightarrow_{p}0, w^jp|βj|γ\hat{w}_{j}\rightarrow_{p}\left|\beta_{j}^{*}\right|^{-\gamma} and n(|αj+bjn||αj|)pbjsign(αj).\sqrt{n}\left(\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|-\left|\alpha_{j}^{*}\right|\right)\rightarrow_{p}b_{j}{\rm sign}(\alpha_{j}^{*}).

By using the Slutsky’s theorem we have: λ1nw^jn(|αj+bjn||αj|)p0.\frac{\lambda_{1}}{\sqrt{n}}\hat{w}_{j}\sqrt{n}\left(\left|\alpha_{j}^{*}+\frac{b_{j}}{\sqrt{n}}\right|-\left|\alpha_{j}^{*}\right|\right)\rightarrow_{p}0.

For the behavior of R5(n)R_{5}^{(n)}, we have (λ2n)(n(αj(2bjn)+bj2n))p0,\left(\frac{\lambda_{2}}{\sqrt{n}}\right)\left(\sqrt{n}\left(\alpha_{j}^{*}\left(\frac{2b_{j}}{\sqrt{n}}\right)+\frac{b_{j}^{2}}{n}\right)\right)\rightarrow_{p}0, since λ2np0\frac{\lambda_{2}}{\sqrt{n}}\rightarrow_{p}0 by assumption and n(αj(2bjn)+bj2n)p2αjbj\sqrt{n}\left(\alpha_{j}^{*}\left(\frac{2b_{j}}{\sqrt{n}}\right)+\frac{b_{j}^{2}}{n}\right)\rightarrow_{p}2\alpha_{j}^{*}b_{j} and then using the Slutsky’s theorem.

Using the convexity of 𝒲^(b)\widehat{\mathcal{W}}(b) and following the epi-convergence results of Geyer (1994), we have:

argmin𝒲^(b)dargmin𝒲(b),whereargmin𝒲^(b)=n(α^α).\arg\min\widehat{\mathcal{W}}(b)\rightarrow_{d}\arg\min\mathcal{W}(b),\quad\mbox{where}\quad\arg\min\widehat{\mathcal{W}}(b)=\sqrt{n}(\hat{\alpha}-\alpha^{*}).

Thus, again by applying the theorem of Slutsky, we have:

argmin𝒲(b𝒜)=𝐅111Z𝒜,Z𝒜N(0,𝐅11).\arg\min\mathcal{W}(b_{\mathcal{A}})=\mathbf{F}_{11}^{-1}Z_{\mathcal{A}},\quad Z_{\mathcal{A}}\sim N(0,\mathbf{F}_{11}).

Finally, we prove the asymptotic normality part.

Now we show the consistency in variable selection part, i.e. limnP{α^𝒜c=0}=1\lim_{n}P\{\hat{\alpha}_{\mathcal{A}^{c}}=0\}=1. Let α=(α𝒜,α𝒜c)\alpha=(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}}) and define a penalized negative log-likelihood function as

~n(α𝒜,α𝒜c)\displaystyle\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}}) =n(α𝒜,α𝒜c)+λ1j=1pw^j|αj|+λ2j=1pαj2=n(α𝒜,α𝒜c)+λ1j𝒜𝒜cw^j|αj|+λ2j𝒜𝒜cαj2.\displaystyle=\ell_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})+\lambda_{1}\sum_{j=1}^{p}\hat{w}_{j}|\alpha_{j}|+\lambda_{2}\sum_{j=1}^{p}\alpha^{2}_{j}=\ell_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})+\lambda_{1}\sum_{j\in\mathcal{A}\cup\mathcal{A}^{c}}\hat{w}_{j}|\alpha_{j}|+\lambda_{2}\sum_{j\in\mathcal{A}\cup\mathcal{A}^{c}}\alpha^{2}_{j}.

Thus, to prove sparsity it suffices to show that ~n(α𝒜,α𝒜c)~n(α𝒜,0)>0\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0)>0 with probability tending to 1 as nn\rightarrow\infty. We observe that

~n(α𝒜,α𝒜c)~n(α𝒜,0)\displaystyle\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0) =[n(α𝒜,α𝒜c)n(α𝒜,0)]+λ1[j𝒜𝒜cw^j|αj|j𝒜w^j|αj|]+λ2[j𝒜𝒜cαj2j𝒜αj2]\displaystyle=\left[\ell_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\ell_{n}(\alpha_{\mathcal{A}},0)\right]+\lambda_{1}\left[\sum_{j\in\mathcal{A}\cup\mathcal{A}^{c}}\hat{w}_{j}|\alpha_{j}|-\sum_{j\in\mathcal{A}}\hat{w}_{j}|\alpha_{j}|\right]+\lambda_{2}\left[\sum_{j\in\mathcal{A}\cup\mathcal{A}^{c}}\alpha^{2}_{j}-\sum_{j\in\mathcal{A}}\alpha^{2}_{j}\right]
=[n(α𝒜,α𝒜c)n(α𝒜,0)]+λ1j𝒜cw^j|αj|+λ2j𝒜cαj2.\displaystyle=\left[\ell_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\ell_{n}(\alpha_{\mathcal{A}},0)\right]+\lambda_{1}\sum_{j\in\mathcal{A}^{c}}\hat{w}_{j}|\alpha_{j}|+\lambda_{2}\sum_{j\in\mathcal{A}^{c}}\alpha^{2}_{j}.

By the mean value theorem, we have: n(α𝒜,α𝒜c)n(α𝒜,0)=[n(α𝒜,ξ)α𝒜c]Tα𝒜c,\ell_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\ell_{n}(\alpha_{\mathcal{A}},0)=\left[\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},\xi)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right]^{T}\alpha_{\mathcal{A}^{c}}, for some ξα𝒜c\|\xi\|\leq\|\alpha_{\mathcal{A}^{c}}\|.

By the mean value theorem again, we have:

n(α𝒜,ξ)α𝒜cn(α𝒜,0)α𝒜c\displaystyle\left\|\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},\xi)}}{\partial{\alpha_{\mathcal{A}^{c}}}}-\frac{\partial{\ell_{n}(\alpha^{*}_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right\| =[n(α𝒜,ξ)α𝒜cn(α𝒜,0)α𝒜c]+[n(α𝒜,0)α𝒜cn(α𝒜,0)α𝒜c]\displaystyle=\left\|\left[\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},\xi)}}{\partial{\alpha_{\mathcal{A}^{c}}}}-\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right]+\left[\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}-\frac{\partial{\ell_{n}(\alpha^{*}_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right]\right\|
n(α𝒜,ξ)α𝒜cn(α𝒜,0)α𝒜c+n(α𝒜,0)α𝒜cn(α𝒜,0)α𝒜c\displaystyle\leq\left\|\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},\xi)}}{\partial{\alpha_{\mathcal{A}^{c}}}}-\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right\|+\left\|\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}-\frac{\partial{\ell_{n}(\alpha^{*}_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right\|
(i=1nM1(xi))ξ+(i=1nM1(xi))α𝒜α𝒜,by condition (C.2)\displaystyle\leq\left(\sum_{i=1}^{n}M_{1}(x_{i})\right)\left\|\xi\right\|+\left(\sum_{i=1}^{n}M_{1}(x_{i})\right)\left\|\alpha_{\mathcal{A}}-\alpha^{*}_{\mathcal{A}}\right\|,\quad\mbox{by condition (C.2)}
=(ξ+α𝒜α𝒜)(i=1nM1(xi))=(ξ+α𝒜α𝒜)Op(n).\displaystyle=\left(\left\|\xi\right\|+\left\|\alpha_{\mathcal{A}}-\alpha^{*}_{\mathcal{A}}\right\|\right)\left(\sum_{i=1}^{n}M_{1}(x_{i})\right)=\left(\left\|\xi\right\|+\left\|\alpha_{\mathcal{A}}-\alpha^{*}_{\mathcal{A}}\right\|\right)O_{p}(n).

Thus, the limiting behavior of (ξ+α𝒜α𝒜)Op(n)\left(\left\|\xi\right\|+\left\|\alpha_{\mathcal{A}}-\alpha^{*}_{\mathcal{A}}\right\|\right)O_{p}(n) depends wether j𝒮j\in\mathcal{S} or jj\in\mathcal{I}.

If j𝒮j\in\mathcal{S}, we have ξα𝒮=Op(n1/2)\left\|\xi\right\|\leq\left\|\alpha_{\mathcal{S}}\right\|=O_{p}(n^{-1/2}). Thus,

n(α𝒜,ξ)α𝒜cn(α𝒜,0)α𝒜c(ξ+α𝒜α𝒜)Op(n)=Op(n1/2).\left\|\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},\xi)}}{\partial{\alpha_{\mathcal{A}^{c}}}}-\frac{\partial{\ell_{n}(\alpha^{*}_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right\|\leq\left(\left\|\xi\right\|+\left\|\alpha_{\mathcal{A}}-\alpha^{*}_{\mathcal{A}}\right\|\right)O_{p}(n)=O_{p}(n^{1/2}).

If jj\in\mathcal{I}, we have ξα=Op(1)\left\|\xi\right\|\leq\left\|\alpha_{\mathcal{I}}\right\|=O_{p}(1). Thus,

n(α𝒜,ξ)α𝒜cn(α𝒜,0)α𝒜c(ξ+α𝒜α𝒜)Op(n)=Op(n).\left\|\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},\xi)}}{\partial{\alpha_{\mathcal{A}^{c}}}}-\frac{\partial{\ell_{n}(\alpha^{*}_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right\|\leq\left(\left\|\xi\right\|+\left\|\alpha_{\mathcal{A}}-\alpha^{*}_{\mathcal{A}}\right\|\right)O_{p}(n)=O_{p}(n).

Then, for j𝒜c=𝒮j\in\mathcal{A}^{c}=\mathcal{I}\cup\mathcal{S}, we have n(α𝒜,ξ)α𝒜cn(α𝒜,0)α𝒜cOp(n).\left\|\frac{\partial{\ell_{n}(\alpha_{\mathcal{A}},\xi)}}{\partial{\alpha_{\mathcal{A}^{c}}}}-\frac{\partial{\ell_{n}(\alpha^{*}_{\mathcal{A}},0)}}{\partial{\alpha_{\mathcal{A}^{c}}}}\right\|\leq O_{p}(n).

Hence, we have: n(α𝒜,α𝒜c)n(α𝒜,0)=Op(n)j𝒜c|αj|.\ell_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\ell_{n}(\alpha_{\mathcal{A}},0)=O_{p}(n)\sum_{j\in\mathcal{A}^{c}}-|\alpha_{j}|. Thus,

~n(α𝒜,α𝒜c)~n(α𝒜,0)\displaystyle\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0) =j𝒜c[|αj|Op(n)+λ1w^j|αj|+λ2αj2]=j𝒜c[|αj|Op(n)+λ1(Op(1)nγ/2)|αj|+λ2αj2].\displaystyle=\sum_{j\in\mathcal{A}^{c}}\left[-|\alpha_{j}|O_{p}(n)+\lambda_{1}\hat{w}_{j}|\alpha_{j}|+\lambda_{2}\alpha^{2}_{j}\right]=\sum_{j\in\mathcal{A}^{c}}\left[-|\alpha_{j}|O_{p}(n)+\lambda_{1}\left(O_{p}(1)n^{\gamma/2}\right)|\alpha_{j}|+\lambda_{2}\alpha^{2}_{j}\right]. (3)

By shrinking neighborhood of 0, |αj|Op(n)λ1(Op(1)nγ/2)|αj|+λ2αj2|\alpha_{j}|O_{p}(n)\leq\lambda_{1}\left(O_{p}(1)n^{\gamma/2}\right)|\alpha_{j}|+\lambda_{2}\alpha^{2}_{j} in probability. This show that ~n(α𝒜,α𝒜c)~n(α𝒜,0)>0\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0)>0 with probability tending to 1 as nn\rightarrow\infty. Let (α^𝒜,0)(\widehat{\alpha}_{\mathcal{A}},0) be the minimizer of the penalized negative log-likelihood function ~n(α𝒜,0)\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0), where ~n(α𝒜,0)\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0) is a function of α𝒜\alpha_{\mathcal{A}}. Now it suffices to prove ~n(α𝒜,α𝒜c)~n(α^𝒜,0)>0\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\tilde{\ell}_{n}(\widehat{\alpha}_{\mathcal{A}},0)>0. By adding and subtracting ~n(α𝒜,0)\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0), we have:

~n(α𝒜,α𝒜c)~n(α^𝒜,0)\displaystyle\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\tilde{\ell}_{n}(\widehat{\alpha}_{\mathcal{A}},0) =[~n(α𝒜,α𝒜c)~n(α𝒜,0)]+[~n(α𝒜,0)~n(α^𝒜,0)]\displaystyle=\left[\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0)\right]+\left[\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0)-\tilde{\ell}_{n}(\widehat{\alpha}_{\mathcal{A}},0)\right] (4)
~n(α𝒜,α𝒜c)~n(α𝒜,0).\displaystyle\geq\tilde{\ell}_{n}(\alpha_{\mathcal{A}},\alpha_{\mathcal{A}^{c}})-\tilde{\ell}_{n}(\alpha_{\mathcal{A}},0). (5)

By the results in Equation 3, the right-hand side of the Equation 5 is positive with probability tending to 1 as nn\rightarrow\infty. This completes the proof of the sparsity part.

3 Numerical example

In this section, we present simulations done to study the finite sample performance of GOAL. Unlike the simulation studies conducted in Baldé et al. (2023), and in Shortreed and Ertefaie (2017), we follow Zou and Zhang (2009) to allow the intrinsic dimension (the size of 𝒜=𝒞𝒫\mathcal{A}=\mathcal{C}\cup\mathcal{P}) to diverge with the sample size as well. This makes our numerical study more challenging than simulation studies in Baldé et al. (2023) and in Shortreed and Ertefaie (2017), which considered a situation where the intrinsic dimension |𝒜||\mathcal{A}| is fixed. In this numerical study, we considered three methods: GOAL, OAL and Lasso. We use these methods because OAL is an oracle-like method (Shortreed and Ertefaie, 2017) while Lasso does not have the oracle property (Zou, 2006; Zou and Zhang, 2009). Now, we describe the simulation set up to generate the data (𝐗,A,Y)(\mathbf{X},A,Y), denoting the covariates matrix, treatment and outcome, respectively. The vector Xi=(Xi1,Xi2,,Xip){X_{i}}=(X_{i1},X_{i2},\ldots,X_{ip}), for i=1,2,,ni=1,2,\ldots,n is simulated from a multivariate standard Gaussian distribution with pairwise correlation ρ\rho. The binary treatment AA is simulated from a Bernoulli distribution with logit{P(Ai=1)}=j=1pαjXij{\rm logit}\{P(A_{i}=1)\}=\sum_{j=1}^{p}\alpha_{j}X_{ij}. Given 𝐗\mathbf{X} and AA, the continuous outcome is simulated as Yi=βAAi+j=1pβjXij+ϵiY_{i}=\beta_{A}A_{i}+\sum_{j=1}^{p}\beta_{j}X_{ij}+\epsilon_{i} where ϵiN(0,1)\epsilon_{i}\sim N(0,1). The true ATE was βA=0\beta_{A}=0. We considered two different correlations: independent covariates (ρ=0\rho=0) and strongly correlated covariates (ρ=0.5\rho=0.5). Let p=pn=4n1/25p=p_{n}=\lfloor 4n^{1/2}-5\rfloor for n=100,200,400n=100,200,400. The true coefficients are α=(0.6.1q,0.6.1q,0.1q,0p3q)\alpha^{*}=(0.6.1_{q},0.6.1_{q},0.1_{q},0_{p-3q}), |𝒜|=3q|\mathcal{A}|=3q with q=pn/9q=\lfloor p_{n}/9\rfloor, and 1m/0m1_{m}/0_{m} denote a mm-vector of 11’s/0’s. We follow Zou and Zhang (2009) to set γ=3\gamma=3 for computing the adaptive weights in the GOAL and OAL. For estimation accuracy, we used the bias, standard error (SE) and mean squared error (MSE) to compare methods. For variable selection, we evaluated methods based on the proportion of times each variable was selected for inclusion in the PS model, where a variable was considered selected if its estimated coefficient was greater than the tolerance 10810^{-8} (Shortreed and Ertefaie, 2017).

n=100n=100                                        n=200n=200                                        n=400n=400 Refer to caption Refer to caption Refer to caption            Refer to caption Refer to caption Refer to caption

Figure 1: Probability of covariate being included in PS model for estimating the average treatment effect (ATE) under independent (ρ=0\rho=0) in row 1, and strongly correlated covariates (ρ=0.5\rho=0.5) in row 2.
Table 1: Bias, standard error (SE) and mean squared error (MSE) of IPTW estimator for GOAL, OAL and Lasso estimator for the average treatment effect (ATE) based on 1 0001\,000 replications.
No correlation ρ=0\rho=0 Strong correlation ρ=0.5\rho=0.5
𝐧\mathbf{n} 𝐩𝐧\mathbf{p_{n}} |𝒜|\mathbf{|\mathcal{A}|} Model Bias SE MSE Bias SE MSE
100 35 6 GOAL 0.13 0.32 0.12 0.32 0.68 0.57
OAL 0.17 0.32 0.13 1.40 0.80 2.60
Lasso 0.69 0.32 0.58 2.93 0.63 8.97
200 51 10 GOAL 0.12 0.26 0.08 0.51 0.84 0.95
OAL 0.15 0.27 0.10 2.96 1.14 10.07
Lasso 0.92 0.27 0.92 5.51 0.67 30.79
400 75 16 GOAL 0.15 0.20 0.06 1.00 1.13 2.27
OAL 0.19 0.22 0.08 5.56 1.68 33.74
Lasso 1.24 0.24 1.59 9.48 0.78 90.43

Table 1 and Figure 1 present the simulation results based on 10001000 simulations with the (n,p,|𝒜|)(n,p,|\mathcal{A}|) combinations (100,35,6)(100,35,6), (200,51,10)(200,51,10), (400,75,16)(400,75,16) for both independent (ρ=0)(\rho=0) and strongly correlated (ρ=0.5)(\rho=0.5) covariates.

Table 1 presents the bias, standard error (SE) and mean squared error (MSE) of GOAL, OAL and Lasso estimators for the ATE. In all considered scenarios, GOAL and OAL performed much better than the Lasso method. For the independent covariates (ρ=0)(\rho=0), the MSE of GOAL and OAL decreased as the sample size increased. The method GOAL exhibited the smallest bias and MSE for all combinations (n,p,|𝒜|,ρ)(n,p,|\mathcal{A}|,\rho). When covariates are strongly correlated (ρ=0.5)(\rho=0.5), the performance of the OAL method deteriorates with sample size. The Lasso performed the worst under the settings we considered.

Figure 1 reports the proportion of times each covariate was selected over 10001000 simulations for inclusion in the PS model when Lasso, OAL and GOAL were used to fit the PS model to estimate the ATE. OAL and GOAL algorithms included all covariates at similar rate with high probability for confounders and pure predictors of the outcome (about 100%), and relatively small probability for the pure predictor of treatment and spurious covariates, for all combinations (n,p,|𝒜|,ρ)(n,p,|\mathcal{A}|,\rho) considered. However, the Lasso method selects the confounders and pure predictor of treatment with high probability and excludes pure predictors of the outcome and spurious covariates.

4 Discussion

In this paper, we studied the statistical property of the GOAL method. We compared GOAL, OAL and Lasso using a simulation study where both the number of parameters and the intrinsic dimension (𝒜=𝒞𝒫)(\mathcal{A}=\mathcal{C}\cup\mathcal{P}) diverges with the sample size. A distinctive feature of our simulation scenarios compared to many existing variable selection method for causal inference including those conducted in Baldé et al. (2023) and Shortreed and Ertefaie (2017) is that the dimension of the active set (𝒜\mathcal{A}) diverges with the sample size. This makes our numerical example more challenging and more appropriate for high-dimensional data analysis. GOAL and OAL outperformed the Lasso in all scenarios considered. The two oracle-like methods (GOAL and OAL) are the best when the covariates are independent (ρ=0\rho=0) and the sample size is large (n=400). This result is expected according to the asymptotic theory for an oracle-like method (Zou and Zhang, 2009). However, OAL was less performant than GOAL when the correlation is strong (ρ=0.5\rho=0.5). The GOAL method had the best performance for every combination of (n,p,|𝒜|,ρ)(n,p,|\mathcal{A}|,\rho). As a result, GOAL has much better finite sample performance than the oracle-like method OAL.

Data availability

No data was used for the research described in the article.

Acknowledgements

This work was funded by grants from New Brunswick Innovation Foundation (NBIF).

References

Baldé, I., 2022. Algorithmes de sélection de confondants en petite et grande dimensions : contextes d’application conventionnels et pour l’analyse de la médiation. Thèse. Montréal (Québec, Canada), Université du Québec à Montréal, Doctorat en mathématiques.

Baldé, I., Yang, A. Y., Lefebvre, G., 2023. Reader Reaction to “ Outcome-adaptive lasso: Variable selection for causal inference ” by Shortreed and Ertefaie (2017). Biometrics 79(1), 514–520.

Khalili, A., Chen, J., 2007. Variables selection in finite mixture of regression models. Journal of the American Statistical Association 104, 1025–38.

Shortreed, S. M., Ertefaie, A., 2017. Outcome-adaptive lasso: Variable selection for causal inference. Biometrics 73(4), 1111–1122.

Slawski, M., Castell, W. zu., Tutz, G., 2010. Feature Selection Guided by Structural Information. Annals of Applied Statistics, 4(2), 1056–1080.

Zou, H., 2006. The adaptive lasso and its oracle properties. Journal of the American Statistical Association: Series B 101, 1418–1429.

Zou, H. and Zhang, H. H. (2009). On the adaptive elastic-net with a diverging number of parameters. The Annals of Statistics, 37, 1733–1751.