This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Private Estimation and Inference in High-
Dimensional Regression with FDR Control

Zhanrui Cai  
Faculty of Business and Economics, University of Hong Kong
and
Sai Li
Institute of Statistics and Big Data, Renmin University of China
and
Xintao Xia
Department of Statistics, Iowa State University
and
Linjun Zhang
Department of Statistics, Rutgers University
The authors are listed alphabetically.
Abstract

This paper presents novel methodologies for conducting practical differentially private (DP) estimation and inference in high-dimensional linear regression. We start by proposing a differentially private Bayesian Information Criterion (BIC) for selecting the unknown sparsity parameter in DP-Lasso, eliminating the need for prior knowledge of model sparsity, a requisite in the existing literature. Then we propose a differentially private debiased LASSO algorithm that enables privacy-preserving inference on regression parameters. Our proposed method enables accurate and private inference on the regression parameters by leveraging the inherent sparsity of high-dimensional linear regression models. Additionally, we address the issue of multiple testing in high-dimensional linear regression by introducing a differentially private multiple testing procedure that controls the false discovery rate (FDR). This allows for accurate and privacy-preserving identification of significant predictors in the regression model. Through extensive simulations and real data analysis, we demonstrate the efficacy of our proposed methods in conducting inference for high-dimensional linear models while safeguarding privacy and controlling the FDR.


Keywords: differential privacy, high dimension, linear regression, debiased Lasso, false discovery rate control.

1 Introduction

In the era of big data, the significance of data privacy has grown considerably. With the continuous collection, storage, processing, and sharing of vast amounts of personal data (Lane et al., 2014), there is a pressing need to protect sensitive information. Traditional data analytics and statistical inference tools, unfortunately, may fall short in ensuring such protection. However, the concept of differential privacy, initially proposed by theoretical computer scientists (Dwork et al., 2006), has made substantial progress and found widespread use in various large-scale applications. In essence, differentially private algorithms incorporate random noise that is independent of the original database and produce privatized summary statistics or model parameters. The ultimate goal of differentially private analysis is to safeguard individual data while still allowing meaningful statistical analysis of the original database.

In this paper, we develop a novel framework that enables differentially private statistical inference in high-dimensional linear regression. Let YY\in{\mathbb{R}} be the response and XpX\in{\mathbb{R}}^{p} be the covariates. Assume the linear model

Y=𝜷X+e,Y={\bm{\beta}}^{\top}X+e,

where ee denotes the noise term and follows the sub-Gaussian distribution. Assume that {𝒙i,yi}i=1n\{\bm{x}_{i},y_{i}\}_{i=1}^{n} are independent samples. We are interested in the high-dimensional setting where pp may grow exponentially with nn, and only a small subset of 𝜷{\bm{\beta}} have non-zero values. We aim to develop practical differentially private estimation and inference methods for the parameter 𝜷{\bm{\beta}}, and a false discovery rate control procedure when multiple testing is concerned.

Numerous approaches have been developed to address the challenge of statistical inference in high-dimensional linear regression, under the typical non-private setting. Wasserman and Roeder (2009) applied sample-splitting for high-dimensional inference, where the estimation and inference are conducted on different samples. Later on, the debiased lasso (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; Van de Geer et al., 2014) emerged as a technique capable of mitigating the potential bias arising from the Lasso estimator, thereby providing asymptotically optimal confidence intervals for regression coefficients. Additionally, Ning and Liu (2017) proposed a method for constructing confidence intervals for high-dimensional penalized M-estimators based on decorrelated score statistics, while Shi et al. (2021) developed a recursive online-score estimation approach. More recently, Wang et al. (2022) introduced the repro framework, which enables the construction of valid confidence intervals for finite samples with high-dimensional covariates.

Besides conducting inference on one parameter, another desired feature in high dimensional linear regression is to control the false discovery rate (FDR) of the selected variables. This objective has led to the development of various FDR control methods in the literature. One such method, proposed by Liu (2013), focuses on the Gaussian graphical model and utilizes the asymptotic normality of the test statistic (Liu, 2013). Another influential approach is the knockoff framework introduced by Barber and Candès (2015), which exploits the symmetry of the statistics under the null hypothesis. Since its inception, the knockoff method has undergone further development and generalization, finding applications in diverse settings (Candes et al., 2018; Barber and Candès, 2019; Romano et al., 2020). Recently, Dai et al. (2022) proposed a method that combines the idea of symmetric mirror statistics and data splitting to control the FDR asymptotically. The power of this method can be further enhanced by implementing a multiple-split approach. Dai et al. (2023) extended this idea to generalized linear models, expanding its applicability.

Addressing privacy concerns in various statistical inference scenarios has gained significant attention in recent literature. Avella-Medina et al. (2023a) put forth the utilization of first-order and second-order optimization algorithms to develop private M-estimators, and they conducted an analysis of the asymptotic normality of these private estimators, accompanied by an evaluation of their privacy error rate. The extension of these methodologies to the high dimensional setting is further discussed in Avella-Medina et al. (2023b), albeit not published yet. Moreover, the multiple testing problem has also been a subject of recent investigation. Dwork et al. (2021) proposed the DP-BH method, while Xia and Cai (2023) introduced the DP-Adapt method, which provides control over the false discovery rate (FDR) while guaranteeing finite sample performance.

Our paper contributes to the differentially private analysis of high-dimensional linear regression in several key aspects.

  1. 1.

    We propose a differentially private Bayesian Information Criterion (BIC) to accurately select the unknown sparsity parameter in DP-Lasso proposed by Cai et al. (2021), eliminating the need for prior knowledge of the model sparsity. This advancement enhances the reliability of the DP-Lasso framework and can be used in many downstream tasks.

  2. 2.

    We develop a differentially private debiased Lasso procedure which can produce asymptotic normal estimators with privacy. This procedure provides differentially private confidence interval for a single parameter of interest.

  3. 3.

    We design a differentially private procedure for controlling FDR in multiple testing scenarios. Our approach allows for FDR control at any user-specified rate α\alpha and exhibits asymptotic power approaching 1 under proper conditions.

The structure of this paper is as follows. In Section 2, we present a formulation of the problem of interest and propose a new estimation of DP-Lasso that do not require the knowledge of sparsity. In Section 3, we introduce the differentially private debiased Lasso algorithm and provide its asymptotic distribution and privacy error. Section 4 presents a novel differentially private false discovery rate (FDR) control method that leverages sample splitting and mirror statistics. We also elucidate the reasons why existing FDR techniques are inapplicable in the context of differential privacy. In Section 5, we present extensive simulations and real data analysis to validate the performance of our proposed methods. Finally, in Section 6, we conclude the article with a discussion of our findings and outline potential future research directions. All proofs are provided in the appendix.

Notation: Throughout the paper, we use the following notation. For any pp-dimensional vector 𝒙=(x1,,xp)\bm{x}=(x_{1},\dots,x_{p})^{\top}, we define the lql_{q}-norm of 𝒙\bm{x} as 𝒙q:=(i=1p|x|iq)1/q\|\bm{x}\|_{q}:=(\sum_{i=1}^{p}|x|_{i}^{q})^{1/q} for 1q1\leq q, with |||\cdot| representing the absolute value. The supp(𝒙)={i:|xi|>0}\text{supp}(\bm{x})=\{i:|x_{i}|>0\} is the index set of nonzero elements in 𝒙\bm{x}. We define the l0l_{0}-norm of 𝒙\bm{x} by 𝒙0=#supp(𝒙)\|\bm{x}\|_{0}=\#\text{supp}(\bm{x}), which is number of nonzero coordinates of 𝒙\bm{x}. For a positive integer nn, we use [n][n] to denote the set {1,,n}\{1,\dots,n\}. For a subset 𝒮[k]\mathcal{S}\subseteq[k] and vector 𝒙n\bm{x}\in\mathbb{R}^{n}, we use 𝒙𝒮\bm{x}_{\mathcal{S}} to denote the restriction of vector 𝒙\bm{x} to the index set 𝒮\mathcal{S}. For a vector 𝒙n\bm{x}\in\mathbb{R}^{n}, we use ΠR(𝒙)\Pi_{R}(\bm{x}) to denote the projection of 𝒙\bm{x} onto the l2l_{2}-ball {𝒖p:𝒖2R}\{\bm{u}\in\mathbb{R}^{p}:\|\bm{u}\|_{2}\leq R\}. Let XX and {Xn}n=1\{X_{n}\}_{n=1}^{\infty} be random variables. The expression Xn𝐷XX_{n}\xrightarrow{D}X denotes the convergence of XnX_{n} to XX in distribution. Similarly, the notation Xn=OP(an)X_{n}=O_{P}(a_{n}) indicates that the random variable Xn/anX_{n}/a_{n} is stochastically bounded.

2 Preliminaries

Consider the dataset D:={(𝒙i,yi)}i=1n𝒟D:=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n}\in\mathcal{D} that are drawn from the distributions satisfying

yi=𝜷𝒙i+ei,y_{i}={\bm{\beta}}^{\top}\bm{x}_{i}+e_{i},

where eii.i.d.N(0,σ2)e_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}N(0,\sigma^{2}), 𝜷p{\bm{\beta}}\in{\mathbb{R}}^{p} and 𝜷0s\|{\bm{\beta}}\|_{0}\leq s. In this paper, we are interested in the high-dimensional setting where the dimension pp may grow exponentially with the sample size nn, under the (ϵ,δ)(\epsilon,\delta)-differential privacy framework. We begin by introducing the background of differential privacy, with its formal definition as follows.

Definition 1 (Differential Privacy (Dwork et al., 2006)).

A randomized algorithm M:𝒟M:\mathcal{D}\to\mathcal{R} is (ϵ,δ)(\epsilon,\delta)-differentially private for ϵ,δ>0\epsilon,\delta>0 if for every pair of neighboring data sets D,D𝒟D,D^{\prime}\in\mathcal{D} that differ by one individual datum and every measurable set 𝒮\mathcal{S}\subset\mathcal{R} with respect to M()M(\cdot),

[M(D)𝒮]eϵ[M(D)𝒮]+δ,\mathbb{P}[M(D)\in\mathcal{S}]\leq e^{\epsilon}\mathbb{P}[M(D^{\prime})\in\mathcal{S}]+\delta, (2.1)

where the probability measure \mathbb{P} is induced by the randomness of MM only.

In the definition, two data sets DD and DD^{\prime} are treated as fixed, and the probability measures the randomness in the mechanism M()M(\cdot). The level of privacy against an adversary is controlled by the pre-specified privacy parameters (ϵ,δ)(\epsilon,\delta). When ϵ\epsilon and δ\delta are small, the algorithm offers strong privacy guarantees.

One of the most useful concepts in differential privacy is the sensitivity, which characterizes the magnitude of change in the algorithm when only one single individual in the dataset is replaced by another element.

Definition 2 (Sensitivity).

For a vector-valued deterministic algorithm 𝒯():D𝒟m\mathcal{T}(\cdot):D\in\mathcal{D}\to{\mathbb{R}}^{m}, the lql_{q} sensitivity of 𝒯()\mathcal{T}(\cdot) is defined as

Δq(𝒯):=supD,D𝒟𝒯(D)𝒯(D)q,\Delta_{q}(\mathcal{T}):=\sup_{D,D^{\prime}\in\mathcal{D}}\|\mathcal{T}(D)-\mathcal{T}(D^{\prime})\|_{q}, (2.2)

where DD and DD^{\prime} only differ in one single entry.

Next, we introduce two well-known mechanisms: the Laplace mechanism, which is (ϵ,0)(\epsilon,0) differential privacy for algorithms with bounded l1l_{1} sensitivity, and the Gaussian mechanism, which achieves (ϵ,δ)(\epsilon,\delta) differential privacy for algorithms with bounded l2l_{2} sensitivity.

Lemma 1 (Dwork and Roth (2014)).

  

  1. 1.

    (Laplace mechanism): For a deterministic algorithm 𝒯()\mathcal{T}(\cdot) with l1l_{1} sensitivity Δ1(𝒯)\Delta_{1}(\mathcal{T}), the randomized algorithm ():=𝒯()+𝝃\mathcal{M}(\cdot):=\mathcal{T}(\cdot)+\bm{\xi} achieves (ϵ,0)(\epsilon,0)-differential privacy, where 𝝃=(ξ1,,ξm)\bm{\xi}=(\xi_{1},\dots,\xi_{m})^{\top} follows i.i.d. Laplace distribution with scale parameter Δ1(𝒯)/ϵ\Delta_{1}(\mathcal{T})/\epsilon.

  2. 2.

    (Gaussian mechanism): For a deterministic algorithm 𝒯()\mathcal{T}(\cdot) with l2l_{2} sensitivity Δ2(𝒯)\Delta_{2}(\mathcal{T}), the randomized algorithm ():=𝒯()+𝝃\mathcal{M}(\cdot):=\mathcal{T}(\cdot)+\bm{\xi} achieves (ϵ,δ)(\epsilon,\delta)-differential privacy, where 𝝃=(ξ1,,ξm)\bm{\xi}=(\xi_{1},\dots,\xi_{m})^{\top} follows i.i.d. Gaussian distribution with mean 0 and standard deviation 2log(1.25/δ)Δ2(𝒯)/ϵ\sqrt{2\log(1.25/\delta)}\Delta_{2}(\mathcal{T})/\epsilon.

Differentially private algorithms have other nice properties that will be useful for our analysis, as summarized in the following lemma.

Lemma 2.

Differentially private algorithms have the following properties (Dwork et al., 2006, 2010):

  1. 1.

    Post-processing: Let ()\mathcal{M}(\cdot) be an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm and f()f(\cdot) be a deterministic function that maps (D)\mathcal{M}(D) to real Euclidean space, then f{(D)}f\{\mathcal{M}(D)\} is also an (ϵ,δ)(\epsilon,\delta)-differentially private algorithm.

  2. 2.

    Composition: Let 1()\mathcal{M}_{1}(\cdot) be (ϵ1,δ1)(\epsilon_{1},\delta_{1})-differentially private and 2()\mathcal{M}_{2}(\cdot) be (ϵ2,δ2)(\epsilon_{2},\delta_{2})- differentially private, then 12\mathcal{M}_{1}\circ\mathcal{M}_{2} is (ϵ1+ϵ2,δ1+δ2)(\epsilon_{1}+\epsilon_{2},\delta_{1}+\delta_{2})-differentially private.

  3. 3.

    Advanced Composition: Let ()\mathcal{M}(\cdot) be (ϵ,0)(\epsilon,0)-differentially private and 0<δ<10<\delta^{\prime}<1, then kk-fold adaptive composition of ()\mathcal{M}(\cdot) is (ϵ,δ)(\epsilon^{\prime},\delta^{\prime})-differentially private for ϵ=kϵ(eϵ1)+ϵ2klog(1/δ)\epsilon^{\prime}=k\epsilon(e^{\epsilon}-1)+\epsilon\sqrt{2k\log(1/\delta^{\prime})}.

In high-dimensional problems, parameters of interest are often assumed to be sparse. Reporting the entire estimation results can result in a substantial variance. Fortunately, exploiting the sparsity assumption allows us to selectively disclose only the significant nonzero coordinates. The “peeling” algorithm (Dwork et al., 2021) is a differentially private algorithm that addresses this problem by identifying and returning the top-ss largest coordinates based on the absolute values. Since its proposal by Dwork et al. (2021), this algorithm has been popular in protecting privacy for high-dimensional data, see for example, applications in Cai et al. (2021) and Xia and Cai (2023).

Input : vector-valued function 𝝃=𝝃(D)d\bm{\xi}=\bm{\xi}(D)\in\mathbb{R}^{d}, sparsity ss, privacy parameters ε,δ\varepsilon,\delta, noise scale λ\lambda.
1 Initialize S=S=\emptyset;
2 for ii in 11 to ss do
3       Generate 𝜼id\bm{\eta}_{i}\in\mathbb{R}^{d} with ηi1,ηi2,,ηidi.i.d.Laplace(λ23slog(1/δ)ε)\eta_{i1},\eta_{i2},\cdots,\eta_{id}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Laplace}\left(\lambda\cdot\frac{2\sqrt{3s\log(1/\delta)}}{\varepsilon}\right);
4       Append j=argmaxj[d]S|ξj|+ηijj^{*}=\operatorname*{arg\,max}_{j\in[d]\setminus S}|\xi_{j}|+\eta_{ij} to SS;
5      
6 end for
7Set P~s(𝒗)=𝝃S\tilde{P}_{s}(\bm{v})=\bm{\xi}_{S};
8 Generate 𝜼~\tilde{\bm{\eta}} with η~1,,η~di.i.d.Laplace(λ23slog(1/δ)ε)\tilde{\eta}_{1},\cdots,\tilde{\eta}_{d}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\text{Laplace}\left(\lambda\cdot\frac{2\sqrt{3s\log(1/\delta)}}{\varepsilon}\right);
Output : P~s(𝝃)+𝜼~S\tilde{P}_{s}(\bm{\xi})+\tilde{\bm{\eta}}_{S}.
Algorithm 1 Noisy Hard Thresholding (NoisyHT(𝝃,s,ϵ,δ,λ)NoisyHT(\bm{\xi},s,\epsilon,\delta,\lambda))
Lemma 3 (Dwork et al. (2021)).

For a vector-valued function 𝐯\bm{v} with 𝐯(D)𝐯(D)λ\|\bm{v}(D)-\bm{v}(D^{\prime})\|_{\infty}\leq\lambda, where DD^{\prime} is a neighboring data set of DD, Algorithm 1 is (ϵ,δ)(\epsilon,\delta)-differentially private.

3 Differentially Private Estimation

The estimation of regression parameters in the high-dimensional private setting has been studied in Cai et al. (2021) with optimality guarantees. However, the algorithm in Cai et al. (2021) requires prior knowledge of the sparsity parameter ss, which are typically unknown in practice. In this section, we propose a differentially private Bayesian Information Criterion (BIC) in Algorithm 2 to accurately select the unknown sparsity parameter, eliminating the need for prior knowledge of the model sparsity. The pipeline of the proposed estimation algorithm is as follows.

Input : dataset {(𝒙i,yi)}i[n]\{(\bm{x}_{i},y_{i})\}_{i\in[n]}, step size η0\eta^{0}, privacy parameters ε,δ\varepsilon,\delta, noise scale BB, number of iterations TT, truncation level RR, feasibility parameter CC, initial value 𝜷(0){\bm{\beta}}^{(0)}.
1 Random split data into TT parts of roughly equal sizes: [n]=𝒮0𝒮T1[n]=\mathcal{S}_{0}\cup\dots\cup\mathcal{S}_{T-1} and 𝒮i𝒮j=\mathcal{S}_{i}\cap\mathcal{S}_{j}=\emptyset for iji\neq j;
2Set K=log2(n/log2p)K=\lfloor\log_{2}(\sqrt{n}/\log^{2}p)\rfloor;
3 for kk in 0 to KK do
4      s=2ks=2^{k} ;
5       for tt in 0 to T1T-1 do
6             Compute 𝜷(t+0.5)=𝜷(t)(η0/|𝒮t|)i𝒮t(ΠR(𝒙i𝜷(t))ΠR(yi))𝒙i{\bm{\beta}}^{(t+0.5)}={\bm{\beta}}^{(t)}-(\eta^{0}/|\mathcal{S}_{t}|)\sum_{i\in\mathcal{S}_{t}}(\Pi_{R}(\bm{x}_{i}^{\top}{\bm{\beta}}^{(t)})-\Pi_{R}(y_{i}))\bm{x}_{i}  where |𝒮t||\mathcal{S}_{t}| is the size of set 𝒮t\mathcal{S}_{t}; 𝜷(t+1)=ΠC(NoisyIHT(𝜷(t+0.5),s,ε/{T(K+2)},δ/{T(K+1)},η0B/n)){\bm{\beta}}^{(t+1)}=\Pi_{C}(\text{NoisyIHT}({\bm{\beta}}^{(t+0.5)},s,\varepsilon/\{T(K+2)\},\delta/\{T(K+1)\},\eta^{0}B/n)).
7       end for
8      Let 𝜷^(k)=𝜷(T)\hat{{\bm{\beta}}}(k)={\bm{\beta}}^{(T)}.
9 end for
10We select the model based on the BIC type of loss:
𝜷^=\displaystyle\hat{{\bm{\beta}}}= argmin𝜷^(k):0kK{i=1n{ΠR(yi)ΠR(𝒙i𝜷^(k))}2+c0{log(p)log(n)2k\displaystyle\operatorname*{arg\,min}_{\hat{{\bm{\beta}}}(k):0\leq k\leq K}\bigg{\{}\sum_{i=1}^{n}\{\Pi_{R}(y_{i})-\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}}(k))\}^{2}+c_{0}\bigg{\{}\log(p)\log(n)\cdot 2^{k}
+log(p)222klog(1/δ)log7nnϵ2}+zk},\displaystyle+\frac{\log(p)^{2}\cdot 2^{2k}\log(1/\delta)\log^{7}n}{n\epsilon^{2}}\bigg{\}}+z_{k}\bigg{\}},
where zki.i.d.Laplace(2(4R)2(K+2)ϵ)z_{k}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\text{Laplace}\bigg{(}\frac{2(4R)^{2}(K+2)}{\epsilon}\bigg{)}.
Output : 𝜷^\hat{\bm{\beta}}.
Algorithm 2 Adaptive Differentially Private Sparse Linear Regression

The random sample splitting in the proposed algorithm is equivalent to employing the stochastic gradient descent algorithm with one pass of the entire dataset. Moreover, our choice to use the powers of 2 as candidate values of the sparsity parameter accomplishes two critical goals: firstly, it ensures that the candidate set covers the true model by defining an interval where ss falls, such that s<s<2ss^{*}<s<2s^{*} where ss^{*} is in the candidate set; secondly, it limits the total number of candidate models to logn\log n, which is indeed o(n)o(n). The choice of K=log2(n/log2p)K=\log_{2}(\sqrt{n}/\log^{2}p) aligns with the sparsity requirement for statistical inference, as will be discussed in Section 4.

Condition 3.1.

The covariates 𝐱i{\bm{x}}_{i} are independent sub-Gaussian with mean zero and covariance 𝚺\bm{\Sigma} such that 1/LΛmin(𝚺)Λmax(𝚺)L1/L\leq\Lambda_{\min}(\bm{\Sigma})\leq\Lambda_{\max}(\bm{\Sigma})\leq L. There is a constant cx<c_{x}<\infty such that 𝐱i<cx\|\bm{x}_{i}\|_{\infty}<c_{x}.

Condition 3.2.

The true parameter satisfies 𝛃2c0\|{\bm{\beta}}\|_{2}\leq c_{0} for some constant c0c_{0} and 𝛃0s\|{\bm{\beta}}\|_{0}\leq s.

Condition 3.1 is commonly assumed in the high-dimensional literature; for example, see Javanmard and Montanari (2014). The bound on the infinity norm of the design matrix cxc_{x} can be relaxed to hold with high probability, which can be easily obtained for sub-Gaussian distributions. Different from the algorithm proposed by Cai et al. (2021), we employ a data-splitting technique in the algorithm to establish independence between 𝜷(t)\bm{\beta}^{(t)} and the sub-data 𝒙i\bm{x}_{i} utilized in the tt-th iteration. Notably, our Condition 3.1 is less stringent than the design conditions outlined in Cai et al. (2021). This reduction in conditions comes at the cost of decreasing the effective sample size by a factor of 1/T1/T. As demonstrated in Theorem 3.1, the number of iterations, denoted as TT, adheres to T=O(log(n))T=O(\log(n)), resulting in an associated error bound increase of O(log(n))O(\log(n)). Condition 3.2 assumes the bound and sparsity for the true parameter 𝜷{\bm{\beta}}. Lemma 3.1 provides the privacy guarantee of Algorithm 2, where we only require condition 3.1.

Lemma 3.1.

Assume Condition 3.1 and B4RcxB\geq 4Rc_{x}, Algorithm 2 is (ϵ,δ)(\epsilon,\delta)-differentially private.

Now we establish the error bound of the new estimation algorithm.

Theorem 3.1.

Suppose Conditions 3.1 and 3.2 hold, and let R=σ2lognR=\sigma\sqrt{2\log n}, B=2(R+c0cx)cxB=2(R+c_{0}c_{x})c_{x} and C>c0C>c_{0}. There exists a constant ρ\rho such that, if 2K>ρL4s2^{K}>\rho L^{4}s, η0=1/6L\eta^{0}=1/6L, T=ρL2log(8c02Ln)T=\rho L^{2}\log(8c_{0}^{2}Ln), slogplogn=o(n)s\log p\log n=o(n) and slogplog(1/δ)log3.5n/ϵ=o(n)s\log p\sqrt{\log(1/\delta)}\log^{3.5}n/\epsilon=o(n). Then with probability at least 1exp(c1logn)1-\exp(-c_{1}\log n), there exist constants c2c_{2} and c3c_{3}, such that

𝜷^𝜷22c2slogplognn+c3s2log2plog(1/δ)log7nn2ϵ2.\|\hat{{\bm{\beta}}}-{\bm{\beta}}\|_{2}^{2}\leq c_{2}\frac{s\log p\log n}{n}+c_{3}\frac{s^{2}\log^{2}p\log(1/\delta)\log^{7}n}{n^{2}\epsilon^{2}}.

The upper error bound matches the minimax lower bound presented in Cai et al. (2021) up to a logarithmic factor of nn, which is considerably small. In comparison to algorithms with a known sparsity level ss, our proposed algorithm introduces an additional factor of log(n)\log(n) in both the statistical error component and the privacy cost component. This arises from our use of data-splitting in the estimation process, along with the output of a total of KK estimates and the utilization of the BIC for selecting the “optimal” model.

The selection of candidate models presents a delicate balance. Firstly, we desire the candidate set to encompass the true model, necessitating the inclusion of a sufficient number of potential models. Secondly, privacy constraints impose limitations on the total number of outputs. Consequently, the number of candidate models must be on the order of o(n)o(n) for consistent estimation.

4 Statistical Inference with Privacy

In this section, we construct the confidence interval for a particular regression coefficient βj\beta_{j}, j[p]j\in[p] under (ϵ,δ)(\epsilon,\delta)-differential privacy. Following the debiased Lasso framework (Zhang and Zhang, 2014; Van de Geer et al., 2014; Javanmard and Montanari, 2014), we first estimate the precision matrix 𝛀:=𝚺1\bm{\Omega}:=\bm{\Sigma}^{-1} with differential privacy. Note that for 𝒘j={𝚺1}.,j\bm{w}_{j}=\{\bm{\Sigma}^{-1}\}_{.,j}, it satisfies the linear equation 𝒆j=𝚺𝒘j\bm{e}_{j}=\bm{\Sigma}\bm{w}_{j}. Thus it is the unique minimizer of the following convex quadratic function:

12𝒘𝚺𝒘𝒘𝒆j.\frac{1}{2}\bm{w}^{\top}\bm{\Sigma}\bm{w}-\bm{w}^{\top}\bm{e}_{j}.

In addition, we assume the 𝒘j\bm{w}_{j} is sparse.

Given a pre-specified sparsity level sjs_{j}, the differentially private estimation of 𝒘j\bm{w}_{j} algorithm is presented in Algorithm 7.

Input : dataset {(𝒙i,yi)}i[n]\{(\bm{x}_{i},y_{i})\}_{i\in[n]}, step size η0\eta^{0}, privacy parameters ε,δ\varepsilon,\delta, noise scale BB, number of iterations TT, truncation level RR, feasibility parameter CC, sparsity ss^{*}, initial value 𝒘0\bm{w}^{0}.
1 Random split data into TT parts of roughly equal sizes: [n]=𝒮0𝒮T1[n]=\mathcal{S}_{0}\cup\dots\cup\mathcal{S}_{T-1} and 𝒮i𝒮j=\mathcal{S}_{i}\cap\mathcal{S}_{j}=\emptyset for iji\neq j;
2for tt in 0 to T1T-1 do
3       Compute 𝒘(t+0.5)=𝒘(t)η0(𝒆ji𝒮t𝒙iΠR(𝒙i𝒘(t))/|𝒮t|)\bm{w}^{(t+0.5)}=\bm{w}^{(t)}-\eta^{0}(\bm{e}_{j}-\sum_{i\in\mathcal{S}_{t}}\bm{x}_{i}\Pi_{R}(\bm{x}_{i}^{\top}\bm{w}^{(t)})/|\mathcal{S}_{t}|);
4       𝒘(t+1)=ΠC(NoisyIHT(𝒘(t+0.5),s,ε/T,δ/T,η0B/n))\bm{w}^{(t+1)}=\Pi_{C}(\text{NoisyIHT}(\bm{w}^{(t+0.5)},s^{*},\varepsilon/T,\delta/T,\eta^{0}B/n)).
5 end for
6
Output : 𝒘j(T)\bm{w}_{j}^{(T)}.
Algorithm 3 Differentially Private Estimation of 𝒘j\bm{w}_{j} given sparsity
Lemma 4.1.

Suppose conditions 3.1, 3.2 and 4.1 hold, and let B=2RcxB=2Rc_{x}, C>LC>L and R=2Llog(n)R=2L\log(n). There exists a constant ρ\rho such that, if s=ρL4sjs^{*}=\rho L^{4}s_{j}, η0=1/6L\eta^{0}=1/6L, T=ρL2log(8L3n)T=\rho L^{2}\log(8L^{3}n), slogp=o(n)s\log p=o(n) and slogplog(1/δ)log3n/ϵ=o(n)s\log p\log(1/\delta)\log^{3}n/\epsilon=o(n). Then with probability at least 1exp(c1logn)1-\exp(-c_{1}\log n), there exist constants c2c_{2} and c3c_{3}, such that

𝒘j(T)𝒘j22c1slogplognn+c2s2log2plog(1/δ)log5nn2ϵ2.\|\bm{w}_{j}^{(T)}-\bm{w}_{j}\|_{2}^{2}\leq c_{1}\frac{s^{*}\log p\log n}{n}+c_{2}\frac{s^{*2}\log^{2}p\log(1/\delta)\log^{5}n}{n^{2}\epsilon^{2}}.

Theoretical properties of Algorithm 7 are outlined in Lemma 8.1. Similar to Algorithm 2, we once again employ the data-splitting technique to relax design-related conditions. Our approach shares similarities with the concept of node-wide regression introduced in Van de Geer et al. (2014). What sets our approach apart is its simultaneous estimation of the entire vector. This confers the advantage of privacy cost reduction by avoiding the separate estimation of node-wise regression parameters and error variance.

Condition 4.1.

The precision matrix 𝛀\bm{\Omega} is sparse and satisfies 𝛀j0sj\|\bm{\Omega}_{j}\|_{0}\leq s_{j} for j[p]j\in[p], where 𝛀j\bm{\Omega}_{j} is jj-th column of 𝛀\bm{\Omega}.

This condition is necessary to facilitate differentially private estimation of the precision matrix. We proposed to use a BIC criterion to select the optimal model following a similar approach when the sparsity level sjs_{j} is unknown. The algorithm for estimating 𝒘j\bm{w}_{j} is summarized in Algorithm 4. Based on the sparsity assumption, we can follow a similar approach as in Algorithm 2 to utilize the DP-BIC to select the unknown sparsity parameters sjs_{j}. Lemma 4.2 shows that under certain regularity conditions, Algorithm 4 is (ϵ,δ)(\epsilon,\delta)-differentially private.

Lemma 4.2.

Under Conditions 3.1 and 3.2, if B2RcxB\geq 2Rc_{x} then Algorithm 4 is (ϵ,δ)(\epsilon,\delta)-differentially private.

Lemma 4.3.

Suppose Conditions 3.1, 3.2 and 4.1 hold, and let B=2(Lcx)cxB=2(Lc_{x})c_{x}, C>LC>L and R=2LlognR=2L\sqrt{\log n}. There exists a constant ρ\rho such that, if 2K>ρL4sj2^{K}>\rho L^{4}s_{j}, η0=1/6L\eta^{0}=1/6L, T=ρL2log(8L3n)T=\rho L^{2}\log(8L^{3}n), sjlogplogn=o(n)s_{j}\log p\log n=o(n) and sjlogplog(1/δ)log3n/ϵ=o(n)s_{j}\log p\sqrt{\log(1/\delta)}\log^{3}n/\epsilon=o(n). Then with probability at least 1exp(c1logn)1-\exp(-c_{1}\log n), there exist constants c2c_{2} and c3c_{3}, such that

𝒘^j𝒘j22c2sjlogplognn+c3sj2log2plog(1/δ)log6nn2ϵ2.\|\hat{\bm{w}}_{j}-\bm{w}_{j}\|_{2}^{2}\leq c_{2}\frac{s_{j}\log p\log n}{n}+c_{3}\frac{s_{j}^{2}\log^{2}p\log(1/\delta)\log^{6}n}{n^{2}\epsilon^{2}}.

The theoretical property of 𝒘^j\hat{\bm{w}}_{j} by Algorithm 4 is shown in Lemma 4.3. Compared to the case when the sparsity level is known, the proposed algorithm introduces an additional factor of log(n)\log(n) in the privacy cost component of the error bound due to the DP-BIC selection. Using the “tracing attack” technique developed in Cai et al. (2021), one can easily show that the error bound in Lemma 4.3 matches the minimax lower bound up to a logarithmic factor of nn.

Input : dataset {(𝒙i,yi)}i[n]\{(\bm{x}_{i},y_{i})\}_{i\in[n]}, step size η0\eta^{0}, privacy parameters ε,δ\varepsilon,\delta, noise scale BB, number of iterations TT, truncation level RR, feasibility parameter CC, sparsity ss, initial value 𝒘j0\bm{w}_{j}^{0}.
1 Set K=log2(n/log2p)K=\lfloor\log_{2}(\sqrt{n}/\log^{2}p)\rfloor;
2 Random split data into TT parts of roughly equal sizes: [n]=𝒮0𝒮T1[n]=\mathcal{S}_{0}\cup\dots\cup\mathcal{S}_{T-1} and 𝒮i𝒮j=\mathcal{S}_{i}\cap\mathcal{S}_{j}=\emptyset for iji\neq j;
3for kk in 0 to KK do
4      s=2ks=2^{k} ;
5       for tt in 0 to T1T-1 do
6             Compute 𝒘j(t+0.5)=𝒘j(t)η0(𝒆ji𝒮t𝒙iΠR(𝒙i𝒘j(t))/|𝒮t|)\bm{w}_{j}^{(t+0.5)}=\bm{w}_{j}^{(t)}-\eta^{0}(\bm{e}_{j}-\sum_{i\in\mathcal{S}_{t}}\bm{x}_{i}\Pi_{R}(\bm{x}_{i}^{\top}\bm{w}_{j}^{(t)})/|\mathcal{S}_{t}|);
7             𝒘j(t+1)=ΠC(NoisyIHT(𝒘j(t+0.5),s,ε/{T(K+2)},δ/{T(K+1)},η0B/n))\bm{w}_{j}^{(t+1)}=\Pi_{C}(\text{NoisyIHT}(\bm{w}_{j}^{(t+0.5)},s,\varepsilon/\{T(K+2)\},\delta/\{T(K+1)\},\eta^{0}B/n)).
8       end for
9      Let 𝒘j^(k)=𝒘j(T)\hat{\bm{w}_{j}}(k)=\bm{w}_{j}^{(T)}.
10 end for
11We select the model based on BIC type of loss.
𝒘j^=\displaystyle\hat{\bm{w}_{j}}= argmin𝒘j^(k):0kK{n(𝒘j^(k)𝚺^XX𝒘j^(k)/2𝒘j^(k)𝒆j)\displaystyle\operatorname*{arg\,min}_{\hat{\bm{w}_{j}}(k):0\leq k\leq K}\bigg{\{}n(\hat{\bm{w}_{j}}(k)^{\top}\hat{\bm{\Sigma}}_{XX}\hat{\bm{w}_{j}}(k)/2-\hat{\bm{w}_{j}}(k)^{\top}\bm{e}_{j})
+c0{log(p)log(n)2k+log(p)222klog(1/δ)logn7n2ϵ2}+zk},\displaystyle+c_{0}\bigg{\{}\log(p)\log(n)\cdot 2^{k}+\frac{\log(p)^{2}\cdot 2^{2k}\log(1/\delta)\log n^{7}}{n^{2}\epsilon^{2}}\bigg{\}}+z_{k}\bigg{\}},
where zki.i.d.Laplace(R2ϵ/(K+2))z_{k}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\text{Laplace}\bigg{(}\frac{R^{2}}{\epsilon/(K+2)}\bigg{)};
Output : 𝒘^j\hat{\bm{w}}_{j}.
Algorithm 4 Adaptive Differentially Private Estimation of 𝒘j\bm{w}_{j}

After obtaining the private estimator 𝒘^j\hat{\bm{w}}_{j}, we propose the following de-biased estimator to facilitate private inference:

β^j(db)=β^j+i=1nΠR(𝒙i𝒘^j)(ΠR(yi)ΠR(𝒙i𝜷^))n+zj,\hat{\beta}_{j}^{(db)}=\hat{\beta}_{j}+\frac{\sum_{i=1}^{n}\Pi_{R}(\bm{x}_{i}^{\top}\hat{\bm{w}}_{j})(\Pi_{R}(y_{i})-\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}}))}{n}+z_{j}, (4.1)

where zjN(0,{4R2n}22log(1.25/δ)ϵ2)z_{j}\sim N(0,\{\frac{4R^{2}}{n}\}^{2}\cdot\frac{2\log(1.25/\delta)}{\epsilon^{2}}). Different from the non-private debiased estimator in Van de Geer et al. (2014), additional random noise zjz_{j} is to guarantee the (ϵ,δ)(\epsilon,\delta)-differentially private by the Gaussian mechanism. Due to the privacy constraints, the variance analysis of β^j(db)\hat{\beta}_{j}^{(db)} turns out to be different from Van de Geer et al. (2014). Specifically, we could write The following lemma provides theoretical insights for the decomposition of the private debiased estimator β^j(db)\hat{\beta}^{(db)}_{j}.

Lemma 4.4 (Limiting distribution of private debiased Lasso).

Assume the same conditions as in Theorem 3.1 and Lemma 4.3. Let s0=max{s,sj}s_{0}=\max\{s,s_{j}\}, we have

n(β^j(db)βj)=uj+vj,\displaystyle\sqrt{n}(\hat{\beta}^{(db)}_{j}-\beta_{j})=u_{j}+v_{j},

where uj𝐷N(0,Ωj,jσ2)u_{j}\xrightarrow{D}N(0,\Omega_{j,j}{\sigma}^{2}) and Ωj,j\Omega_{j,j} is the (j,j)(j,j) element of the precision matrix, vj=Op{s0log(p)log(n)/n+s02log2(p)log(1/δ)log6.5(n)/(n1.5ϵ2)}v_{j}=O_{p}\{s_{0}\log(p)\log(n)/\sqrt{n}+s_{0}^{2}\log^{2}(p)\log(1/\delta)\log^{6.5}(n)/(n^{1.5}\epsilon^{2})\}.

Therefore, to obtain a differentially private confidence interval, we need to estimate the variance privately. Note that the estimate of Ωj,j\Omega_{j,j} can be obtained directly from w^j,j\hat{w}_{j,j}. Therefore, it suffices to estimate σ2\sigma^{2}. We propose the following differentially private estimation of σ2\sigma^{2}:

σ^2=1ni=1n(ΠR(yi)ΠR(𝒙i𝜷^))2+Z,\hat{\sigma}^{2}=\frac{1}{n}\sum_{i=1}^{n}(\Pi_{R}(y_{i})-\Pi_{R}(\bm{x}_{i}^{\top}\hat{\bm{\beta}}))^{2}+Z,

where ZN(0,{2(2R)2n}22log(1.25/δ)ϵ2)Z\sim N(0,\{\frac{2(2R)^{2}}{n}\}^{2}\cdot\frac{2\log(1.25/\delta)}{\epsilon^{2}}).

For the reader’s convenience, we summarize the entire algorithm for obtaining differentially private confidence interval for βj\beta_{j} in the high-dimensional setting. Specifically, the algorithm consists of four steps, and we allocate (ϵ/4,δ/4)(\epsilon/4,\delta/4) privacy budget for each step: (1) estimating the regression parameters 𝜷^\hat{\bm{\beta}}; (2) estimating the precision matrix column 𝒘^j\hat{\bm{w}}_{j}; (3) estimating the debiased estimator β^j(db)\hat{\beta}_{j}^{(db)}; and (4) estimating the standard error of the debiased estimator. Our algorithm ensures that each of the four steps are (ϵ/4,δ/4)(\epsilon/4,\delta/4)- differentially private, thus Algorithm 5 itself is (ϵ,δ)(\epsilon,\delta)-DP. The privacy guarantee together with the nominal coverage of the confidence interval is summarized in Theorem 4.1.

Input : dataset {(𝒙i,yi)}i[n]\{(\bm{x}_{i},y_{i})\}_{i\in[n]}, step size η0\eta^{0}, privacy parameters ε,δ\varepsilon,\delta.
1 Compute 𝜷^\hat{{\bm{\beta}}} by Algorithm 2 with privacy parameters (ϵ/4,δ/4)(\epsilon/4,\delta/4) and tuning parameters defined in Theorem 3.1 .
2Compute 𝒘^j\hat{\bm{w}}_{j} by Algorithm 4 with privacy parameters (ϵ/4,δ/4)(\epsilon/4,\delta/4) and tuning parameters defined in Lemma 4.3.
3Let
β^j(db)=β^j+i=1nΠR(𝒙i𝒘^j)(ΠR(yi)ΠR(𝒙i𝜷^))n+zj,\displaystyle\hat{\beta}_{j}^{(db)}=\hat{\beta}_{j}+\frac{\sum_{i=1}^{n}\Pi_{R}(\bm{x}_{i}^{\top}\hat{\bm{w}}_{j})(\Pi_{R}(y_{i})-\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}}))}{n}+z_{j},
where zjN(0,16{4R2n}22log(41.25/δ)ϵ2)z_{j}\sim N(0,16\{\frac{4R^{2}}{n}\}^{2}\cdot\frac{2\log(4*1.25/\delta)}{\epsilon^{2}}).
4Compute
Ij=[β^j(db)zα/2V^j/n,β^j(db)+zα/2V^j/n],I_{j}=[\hat{\beta}^{(db)}_{j}-z_{\alpha/2}\sqrt{\widehat{V}_{j}/n},\;\;\hat{\beta}^{(db)}_{j}+z_{\alpha/2}\sqrt{\widehat{V}_{j}/n}], (4.2)
where V^j\hat{V}_{j} is defined as
V^j2=w^j,jσ^2n\hat{V}_{j}^{2}=\frac{\hat{w}_{j,j}\hat{\sigma}^{2}}{n}
and zαz_{\alpha} is the (1α)(1-\alpha)-th quantile of the standard normal distribution and σ^2=1ni=1n(ΠR(yi)ΠR(𝒙i𝜷^))2+Z\hat{\sigma}^{2}=\frac{1}{n}\sum_{i=1}^{n}(\Pi_{R}(y_{i})-\Pi_{R}(\bm{x}_{i}^{\top}\hat{\bm{\beta}}))^{2}+Z, where ZN(0,16{2(2R)2n}22log(41.25/δ)ϵ2)Z\sim N(0,16\{\frac{2(2R)^{2}}{n}\}^{2}\cdot\frac{2\log(4*1.25/\delta)}{\epsilon^{2}}).
Output : IjI_{j}.
Algorithm 5 (1α)×100%(1-\alpha)\times 100\% differentially private confidence interval for βj\beta_{j}
Theorem 4.1 (Validness of the proposed CI).

Under conditions 3.1 and 3.2, Algorithm 5 is (ϵ,δ)(\epsilon,\delta)-differentially private. Moreover, under the assumptions of Lemma 4.4, assume s0log(p)log(n)/n=o(1)s_{0}\log(p)\log(n)/\sqrt{n}=o(1) and s02log2(p)log(1/δ)log6.5(n)/(n2ϵ2)=o(n1/2)s_{0}^{2}\log^{2}(p)\log(1/\delta)\log^{6.5}(n)/(n^{2}\epsilon^{2})=o(n^{-1/2}), we then have

limn(β^jIj)=1α.\lim_{n\to\infty}\mathbb{P}(\hat{\beta}_{j}\in I_{j})=1-\alpha.

Theorem 4.1 demonstrates that the proposed confidence interval provides asymptotic nominal coverage while ensuring privacy. Nevertheless, in finite samples, we suggest incorporating a minor correction to the confidence interval to enhance the finite-sample performance of our method. A similar approach was previously discussed by Avella-Medina et al. (2023a) in the context of low-dimensional noisy gradient descent and noisy Newton’s method algorithms. Note that the proposed debiased estimator (4.1) incorporates additional noise, denoted as zjz_{j}, which is generated from a Gaussian distribution with a known variance. The confidence interval, with finite-sample correction, is defined by accounting for the variance of zjz_{j} as follows:

Ij=[β^j(db)zα/2V^j/n+Vc,β^j(db)+zα/2V^j/n+Vc],I_{j}=[\hat{\beta}^{(db)}_{j}-z_{\alpha/2}\sqrt{\widehat{V}_{j}/n+V_{c}},\;\;\hat{\beta}^{(db)}_{j}+z_{\alpha/2}\sqrt{\widehat{V}_{j}/n+V_{c}}], (4.3)

where Vc=16{4R2n}22log(41.25/δ)ϵ2V_{c}=16\{\frac{4R^{2}}{n}\}^{2}\cdot\frac{2\log(4*1.25/\delta)}{\epsilon^{2}} represents the variance of zjz_{j}. It’s worth noting that the additional variance term VcV_{c} is O(n2)O(n^{-2}), which diminishes as nn approaches infinity. Consequently, the corrected confidence interval remains asymptotically efficient compared to the debiased lasso.

5 FDR Control with Differential Privacy

False Discovery Rate (FDR) control with privacy under the high-dimensional linear model is a difficult problem. Current approaches to achieving differentially private FDR control (Dwork et al., 2021; Xia and Cai, 2023) require the mutual independence of pp-values under the null hypotheses, and this assumption does not necessarily hold in the context of linear regression problems. The differentially private knockoff method introduced by Pournaderi and Xiang (2021) has two severe limitations: firstly, it requires the dimension pp to be smaller than the sample size nn; secondly, the introduced noise for privacy preservation overwhelms the signal. To generate differentially private knockoff features, Pournaderi and Xiang (2021) proposed adding noise with a standard error proportional to the upper bound of the signal. Consequently, even when pp remains fixed and nn approaches infinity, the added noise dominates the signal. The effectiveness of Pournaderi and Xiang (2021)’s method is severely limited. The question of how to construct differentially private knockoff features with a reasonably small amount of added noise remains an open challenge.

Our approach draws inspiration from the recent advancements in mirror statistics, as demonstrated in the works of (Dai et al., 2022, 2023). In particular, the utilization of sample splitting and post-selection techniques allows us to effectively reduce the dimensionality of the model from high to relatively low dimensions. This reduction in dimensionality enables us to manage the scale of noise required for privacy preservation more effectively. Our approach shares similarities with the peeling algorithm proposed by (Dwork et al., 2021) and the mirror peeling algorithm introduced by (Xia and Cai, 2023). Specifically, we divide the data into two parts, denoted as 𝒟1\mathcal{D}_{1} and 𝒟2\mathcal{D}_{2}. Initially, we employ the high-dimensional differentially private sparse linear regression algorithm (Algorithm 2) using 𝒟1\mathcal{D}_{1}. The resulting estimator is denoted as 𝜷~(1)\tilde{{\bm{\beta}}}{(1)}, with its support represented by 𝒜:={i[p]:β~(1)i0}\mathcal{A}:=\{i\in[p]:\tilde{\beta}_{(1)i}\neq 0\}. Subsequently, we utilize 𝒟2\mathcal{D}_{2} to fit a differentially private ordinary least squares algorithm, employing the estimated active support 𝒜\mathcal{A}, and denote the resulting estimator as 𝜷~(2)\tilde{{\bm{\beta}}}_{(2)}. For all j𝒜j\in\mathcal{A}, we define mirror statistic MjM_{j} by

Mj=sign(β~(1)jβ~(2)j)f(|β~(1)j|,|β~(2)j|),M_{j}=\text{sign}(\tilde{\beta}_{(1)j}\tilde{\beta}_{(2)j})f(|\tilde{\beta}_{(1)j}|,|\tilde{\beta}_{(2)j}|),

Similar as in Dai et al. (2022), f(u,v)f(u,v) can take different choices such as

f(u,v)=2min(u,v),f(u,v)=uv,f(u,v)=u+v.f(u,v)=2\min(u,v),\quad f(u,v)=uv,\quad f(u,v)=u+v.

The data-driven cutoff τq\tau_{q} is defined as

τq:=min{t>0:#{j:Mj<t,j𝒜}#{j:Mj>t,j𝒜}1q},\tau_{q}:=\min\bigg{\{}t>0:\frac{\#\{j:M_{j}<-t,j\in\mathcal{A}\}}{\#\{j:M_{j}>t,j\in\mathcal{A}\}\lor 1}\leq q\bigg{\}},

where qq is the desired FDR level. We select the subset of variables 𝒜τq={i𝒜:Mj>τq}\mathcal{A}_{\tau_{q}}=\{i\in\mathcal{A}:M_{j}>\tau_{q}\} as the important variables. Define 𝒮:={i[p]:βi0}\mathcal{S}:=\{i\in[p]:\beta_{i}\neq 0\} as the true support set and 𝒮¯=[p]𝒮\bar{\mathcal{S}}=[p]-\mathcal{S} as the complementary set of 𝒮\mathcal{S}. The false discovery proportion (FDP), FDR and power of proposed selection procedure are defined as following:

FDP(𝒜τq):=|𝒜τq𝒮¯||𝒜τq|1,FDR(𝒜τq)=𝔼{FDP(𝒜τq)}\text{FDP}(\mathcal{A}_{\tau_{q}}):=\frac{|\mathcal{A}_{\tau_{q}}\cap\bar{\mathcal{S}}|}{|\mathcal{A}_{\tau_{q}}|\lor 1},\quad\text{FDR}(\mathcal{A}_{\tau_{q}})=\mathbb{E}\{\text{FDP}(\mathcal{A}_{\tau_{q}})\}
Power(𝒜τq):=|𝒜τq𝒮||𝒮|,\text{Power}(\mathcal{A}_{\tau_{q}}):=\frac{|\mathcal{A}_{\tau_{q}}\cap\mathcal{S}|}{|\mathcal{S}|},

We summarize the details of the algorithm in Algorithm 6. The privacy guarantee of the proposed procedure is provided in Lemma 5.1.

Lemma 5.1.

Assume conditions 3.1 and 3.2 hold, then Algorithm 6 is (ϵ,δ)(\epsilon,\delta)-differentially private for B14Kcx2/nB_{1}\geq 4Kc_{x}^{2}/n and B24RKcx/nB_{2}\geq 4R\sqrt{K}c_{x}/n.

Under regularized conditions, the proposed method controls the FDR asymptotically at user-specified level qq and power goes to 11. We summarize those results in Theorem 5.1.

Theorem 5.1.

Under Conditions in Theorem 3.1 and log(1/δ)/(log6pnϵ2)=o(1)\log(1/\delta)/(\log^{6}p\sqrt{n}\epsilon^{2})=o(1), for any nominal FDR level q(0,1)q\in(0,1), if the signal strength satisfies:

minj𝒮|βj|slogp/n+max{slogplog7/2(n),n3/4/log3(p)}log(1/δ)/(nϵ),\min_{j\in\mathcal{S}}|\beta_{j}|\gg\sqrt{s\log p/n}+\max\{s\log p\log^{7/2}(n),n^{3/4}/\log^{3}(p)\}\sqrt{\log(1/\delta)}/(n\epsilon),

where the true support set 𝒮:={j[p]:βj0}\mathcal{S}:=\{j\in[p]:\beta_{j}\neq 0\}, then the result of Algorithm 6 satisfies

lim supn,pFDR(𝒜τq)q.\limsup_{n,p\to\infty}{\rm FDR}(\mathcal{A}_{\tau_{q}})\leq q.

Furthermore,

lim infn,pPower(𝒜τq)=1.\liminf_{n,p\to\infty}{\rm Power}(\mathcal{A}_{\tau_{q}})=1.
Input : dataset {(𝒙i,yi)}i[n]\{(\bm{x}_{i},y_{i})\}_{i\in[n]}, privacy parameters ε,δ\varepsilon,\delta, noise scale B1B_{1} and B2B_{2}
1
2Randomly split data into 𝒟1\mathcal{D}_{1} and 𝒟2\mathcal{D}_{2}. Compute 𝜷~(1)\tilde{{\bm{\beta}}}_{(1)} by Algorithm 2 with data 𝒟1\mathcal{D}_{1}, privacy parameters (ϵ/2,δ/2)(\epsilon/2,\delta/2) and tuning parameters defined in Theorem 3.1. The support set of 𝜷~(1)\tilde{{\bm{\beta}}}_{(1)} is denoted by 𝒜\mathcal{A}.
3Estimate
𝜷~(2)𝒜:=(i𝒟2𝒙i,𝒜𝒙i,𝒜/n2+𝑵XX)1(i𝒟2𝒙i,𝒜yi/n2+𝑵XY)\tilde{{\bm{\beta}}}_{(2)\mathcal{A}}:=\left(\sum_{i\in\mathcal{D}_{2}}\bm{x}_{i,\mathcal{A}}\bm{x}_{i,\mathcal{A}}^{\top}/n_{2}+\bm{N}_{XX}\right)^{-1}\left(\sum_{i\in\mathcal{D}_{2}}\bm{x}_{i,\mathcal{A}}^{\top}y_{i}/n_{2}+\bm{N}_{XY}\right)
where 𝒙i,𝒜\bm{x}_{i,\mathcal{A}} is a subvector of 𝒙i\bm{x}_{i}, 𝑵XX\bm{N}_{XX} is a |𝒜|×|𝒜||\mathcal{A}|\times|\mathcal{A}| symmetric matrix with i.i.d. N(0,B124×2log(2×1.25/δ)ϵ2)N(0,B_{1}^{2}\cdot\frac{4\times 2\log(2\times 1.25/\delta)}{\epsilon^{2}}) entries and 𝑵XY\bm{N}_{XY} is a |𝒜|×1|\mathcal{A}|\times 1 vector with i.i.d. N(0,B224×2log(2×1.25/δ)ϵ2)N(0,B_{2}^{2}\cdot\frac{4\times 2\log(2\times 1.25/\delta)}{\epsilon^{2}}) entries;
4For each j𝒜j\in\mathcal{A}, compute the mirror statistic MjM_{j} as
Mj=sign(β~(1)jβ~(2)j)f(|β~(1)j|,|β~(2)j|).M_{j}=\text{sign}(\tilde{\beta}_{(1)j}\tilde{\beta}_{(2)j})f(|\tilde{\beta}_{(1)j}|,|\tilde{\beta}_{(2)j}|).
Let the data-driven cutoff τq\tau_{q} be defined as
τq:=min{t>0:#{j:Mj<t,j𝒜}#{j:Mj>t,j𝒜}1q}.\tau_{q}:=\min\bigg{\{}t>0:\frac{\#\{j:M_{j}<-t,j\in\mathcal{A}\}}{\#\{j:M_{j}>t,j\in\mathcal{A}\}\lor 1}\leq q\bigg{\}}.
Output : subset 𝒜τq={i𝒜:Mj>τq}\mathcal{A}_{\tau_{q}}=\{i\in\mathcal{A}:M_{j}>\tau_{q}\}
Algorithm 6 Differentially Private False Discovery Control

The minimal signal strength condition guarantees the SURE screening property (Fan and Lv, 2008), i.e. the set 𝒜\mathcal{A} contains all active coefficients. This property is very important for FDR control under a high-dimensional linear model; see Barber and Candès (2019) and Dai et al. (2022). The critical part of FDR control is the linear model still holds conditional on the selected set 𝒜\mathcal{A}. Using a data-splitting strategy, the above condition can be relaxed to require that the selected set 𝒜\mathcal{A} contains all active coefficients with high probability.

Regarding differentially private FDR control, we acknowledge the latest version of mirror statistics developed in Dai et al. (2023). However, directly implementing the algorithm from Dai et al. (2023) would result in the noise required for privacy overwhelming the signals. This is because DP-FDR control requires the screening step to reduce the number of tests to a moderate level so that the required noise level becomes reasonable. See, for example, the peeling algorithm in Dwork et al. (2021) and the mirror-peeling algorithm in Xia and Cai (2023). It would also be interesting to develop a differentially private version of the well-known knockoff procedure Barber and Candès (2015), which enables FDR control with finite samples. We will leave them for future research.

6 Numeric Study

6.1 Simulation

We evaluate the finite sample behavior for inference of individual regression coefficients of the private debiased procedure and false discovery rate of the selection procedure. For our simulation study, we consider linear models where the rows of covariates 𝑿\bm{X} are i.i.d. drawn from N(𝟎,𝚺)N(\bm{0},\bm{\Sigma}). We simulate the response variable 𝒚\bm{y} from the linear model 𝒚=𝑿𝜷+𝒆\bm{y}=\bm{X}\bm{\beta}+\bm{e}, where 𝒚n\bm{y}\in\mathbb{R}^{n}, 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p} and 𝒆n\bm{e}\in\mathbb{R}^{n}.

6.1.1 Debiased Inference

We first evaluate the performance of the proposed debiased procedure under two designs. We consider the Toeplitz covariance matrices (AR) and blocked equal covariance matrices for the design matrix:

Toeplitz: Σj,k=ρ|jk| for j,k{1,,p}Equi corr: Σj,k=ρ for jk and j/4=k/4, Σj,k=1 for j=k.\begin{split}\text{Toeplitz: }&\Sigma_{j,k}=\rho^{|j-k|}\text{ for }j,k\in\{1,\dots,p\}\\ \text{Equi corr: }&\Sigma_{j,k}=\rho\text{ for }j\neq k\text{ and }\lfloor j/4\rfloor=\lfloor k/4\rfloor,\text{ }\Sigma_{j,k}=1\text{ for }j=k.\end{split}

The active set has cardinality s0=|S0|=3s_{0}=|S_{0}|=3 and each of it is of the following form: S0={1,2,3}S_{0}=\{1,2,3\}. The nonzero regression coefficients are fixed at 11. Assume all the errors e1,,ene_{1},\dots,e_{n} are independently drawn from N(0,1)N(0,1). The sample size n=2000n=2000, and the number of covariates p=2000p=2000. The privacy parameter for each coordinate is ϵ=0.5\epsilon=0.5 and δ=1/n1.1\delta=1/n^{1.1}. For comparison purposes, we provide coverages and interval lengths for three methods: DB-Lasso, which is the debiased lasso method in Javanmard and Montanari (2014); DP naive, which is the proposed DP debiased procedure (Algorithm 5) without finite-sample correction; and DP correction, the proposed debiased procedure with finite-sample correction in formula (4.3). All the results are based on 100100 independent repetitions of the model with random design and fixed regression coefficients.

Refer to caption
Figure 1: 95% confidence intervals for one realization of DP correction under the AR covariance structure with ρ=0.5\rho=0.5 for the first 100100 regression parameters. The true parameters are denoted in red, and the debiased estimators are denoted in black.

To demonstrate the effectiveness of the proposed debiased procedure, we begin by displaying the estimated confidence intervals for the initial 100100 coordinates in one particular realization, as depicted in Figure 1. Notably, the estimated confidence intervals cover all of the signals, which are represented by the first three coordinates. The collective coverage for all the coordinates presented in the figure is approximately 95%95\%. Further results can be found in Table 1 for the Toeplitz covariance design and in Table 2 for the equal correlation design.

Measure Method ρ=0.0\rho=0.0 ρ=0.2\rho=0.2 ρ=0.4\rho=0.4 ρ=0.6\rho=0.6
Avgcov DB-Lasso 0.960 0.960 0.959 0.961
DP naive 0.753 0.764 0.797 0.848
DP correction 0.951 0.950 0.950 0.950
Avglength DB-Lasso 0.190 0.195 0.219 0.261
DP naive 0.179 0.187 0.210 0.265
DP correction 0.304 0.309 0.324 0.361
Table 1: Average coverage and length of confidence intervals, for nominal coverage equal to 0.95, under Toeplitz covariance matrices.
Measure Method ρ=0.1\rho=0.1 ρ=0.3\rho=0.3 ρ=0.5\rho=0.5 ρ=0.7\rho=0.7
Avgcov DB-Lasso 0.960 0.960 0.959 0.962
DP naive 0.755 0.777 0.817 0.866
DP correction 0.950 0.950 0.950 0.951
Avglength DB-Lasso 0.191 0.204 0.234 0.282
DP naive 0.182 0.196 0.228 0.291
DP correction 0.306 0.314 0.335 0.381
Table 2: Average coverage and length of confidence intervals, for nominal coverage equal to 0.95, under blocked equal covariance matrices.

The numeric performance of our proposed debiased procedure exhibits remarkable similarity between the Toeplitz covariance design (Table 1) and the equal correlation design (Table 2). Notably, the coverage rates for DP naive fall significantly below the 95%95\% benchmark. It empirically proves our intuition that the additional correction is necessary for finite samples, as discussed in Section 4. In comparison, the DP correction method demonstrates significantly improved coverage compared to DP naive, albeit with wider confidence intervals. The corrected confidence intervals are approximately 30%30\% wider than those of DP naive. The interval length for DP correction is approximately 30%30\% greater than that of DB-Lasso, reflecting the efficiency loss stemming from privacy constraints. Overall, the proposed method exhibits coverage rates of approximately 95%95\% with only a marginal reduction in efficiency.

6.1.2 FDR control

Next, we proceed to assess the performance of the proposed algorithm for controlling the false discovery rate (FDR), as outlined in Section 5. To evaluate its effectiveness, we consider the Toeplitz covariance matrices (AR) defined as Σj,k=ρ|jk|\Sigma_{j,k}=\rho^{|j-k|} for j,k{1,,p}j,k\in\{1,\dots,p\}. The active set, denoted as S0S_{0}, consists of s0=|S0|=30s_{0}=|S_{0}|=30 randomly chosen covariates among all the available covariates. The nonzero regression coefficients βj\beta_{j} for jS0j\in S_{0} are independently sampled from a normal distribution with mean 0 and variance ξ\xi, where ξ\xi represents the signal strength. The errors in the linear model are assumed to follow a standard normal distribution, i.e., e1,,enN(0,1)e_{1},\dots,e_{n}\sim N(0,1). The sample size is set to n=20,000n=20,000, and the number of covariates is p=20,000p=20,000. For privacy preservation, we choose the privacy parameters ϵ=0.5\epsilon=0.5 and δ=1/n1.1\delta=1/n^{1.1}. Moreover, we set the target FDR control level at q=0.1q=0.1.

We compare our method with the non-private FDR control algorithm presented in Dai et al. (2022). The empirical FDR and power are reported, and all results are based on 100 independent simulations of the model with a fixed design and random regression coefficients (i.e., repeating the simulations 100 times with different errors in each linear model). Figure 2 presents the empirical FDR and power across various signal levels. Both the proposed differentially private FDR control procedure and the non-private procedure effectively control the empirical FDR at the predetermined level of q=0.1q=0.1. The power of the proposed procedure exhibits a minor reduction compared to the non-private procedure due to privacy constraints. The numerical study demonstrates that, for reasonably large sample sizes, the proposed method can maintain FDR control with minimal sacrifice in power compared to the non-private approach.

Refer to caption
Figure 2: Empirical FDRs and powers of proposed methods with increasing signals ξ\xi for ρ=0.2\rho=0.2 and n=20,000n=20,000.
Refer to caption
Figure 3: Empirical FDRs and powers of proposed methods with increasing sample sizes nn for ρ=0.2\rho=0.2 and ξ=0.2\xi=0.2.

Figure 3 presents the empirical FDR and power across increasing sample sizes. It is important to note that the proposed procedure may fail to control the empirical FDR when the sample size is very small. This is primarily due to the fact that in cases of small sample sizes, the initial step involving differentially private high-dimensional linear regression struggles to accurately identify all active features. Additionally, there may exist a nontrivial bias in the second step, which is the differentially private ordinary least square estimation. However, as the sample size increases, the proposed method successfully controls the empirical FDR at the predetermined level of q=0.1q=0.1. Similarly, when dealing with small sample sizes, the power of the proposed procedure is notably lower than that of the non-private procedure due to the reduced estimation accuracy caused by privacy constraints. Nevertheless, as the sample size grows, both the differentially private algorithm and the non-private methods exhibit increased power, and the difference between them diminishes. This is a result of improved estimation accuracy in both the differentially private algorithm and the non-private algorithms. Overall, our numerical study demonstrates that for reasonably large sample sizes, the proposed method can effectively maintain FDR control with only a marginal reduction in power compared to the non-private approach.

6.2 Real Example: Parkinsons Telemonitoring

For real data applications, we demonstrate the performance of the proposed differentially private algorithms in analyzing the Parkinson’s Disease Progression data (Tsanas et al. (2009)). In clinical diagnosis, assessing the progression of Parkinson’s disease (PD) symptoms typically relies on the unified Parkinson’s disease rating scale (UPDRS), which necessitates the patient’s physical presence at the clinic and time-consuming physical evaluations conducted by trained medical professionals. Thus, monitoring symptoms is associated with high costs and logistical challenges for both patients and clinical staff. The target of this data set is to track UPDRS by noninvasive speech tests. However, it is crucial to protect the privacy of each patient’s data, as any unauthorized disclosure could lead to potential harm or trouble for the participants. By ensuring privacy protection, individuals are more likely to contribute their personal data, facilitating advancements in Parkinson’s disease research.

The data collection process involved the utilization of the Intel AHTD, a telemonitoring system designed for remote, internet-enabled measurement of various motor impairment symptoms associated with Parkinson’s disease (PD). The research was overseen by six U.S. medical centers, namely the Georgia Institute of Technology (seven subjects), National Institutes of Health (ten subjects), Oregon Health and Science University (fourteen subjects), Rush University Medical Center (eleven subjects), Southern Illinois University (six subjects), and the University of California Los Angeles (four subjects). A total of 52 individuals diagnosed with idiopathic PD were recruited. Following an initial screening process to eliminate flawed recordings (such as instances of patient coughing), a total of 5923 sustained phonations were subjected to analysis. In total, 1616 dysphonia measures were applied to the 5923 sustained phonations. Tsanas et al. (2009) proposed to use a linear model and didn’t consider privacy issues.

6.2.1 Debiased Inference

To evaluate the performance of our proposed differentially private inference algorithm 5 in high-dimensional settings, we add 5,0005,000 random features, which are generated independently identically from the standard normal distribution. Therefore, the dataset comprises a sample size of n=5,923n=5,923 with covariates having a dimension of p=5,016p=5,016, where the first 1616 of these covariates represent real features.

We consider the following three methods: the oracle method, the proposed differentially private algorithm, and the non-private debiased lasso (Van de Geer et al., 2014). The oracle method uses only the 16 real features, while the proposed algorithm and the debiased lasso utilize all 5,016 features. The privacy parameters are set as ϵ=0.5\epsilon=0.5 and δ=1/n1.1\delta=1/n^{1.1}. Figure 4 displays the confidence intervals for the 16 real features obtained from the three methods. Overall, the private confidence intervals consistently cover the estimates obtained from the oracle method and the debiased lasso. In fact, the private confidence intervals exhibit substantial overlap with the confidence intervals from both the oracle method and the debiased lasso. However, due to the privacy costs, the width of the proposed confidence intervals is slightly larger than the confidence intervals from the debiased lasso.

Refer to caption
Figure 4: The 95%95\% confidence interval of OLS, nonprivate debiased lasso and proposed Algorithm 5 with finite sample correction.

6.2.2 FDR control

Next, we evaluate the performance of the private FDR control algorithm proposed in Section 5 on the Parkinson’s Disease Progression data. We add 100100 random features, which are generated independently and identically from N(0,1)N(0,1). The proposed algorithm is compared with the nonprivate data splitting algorithm by Dai et al. (2022) and the knockoff by Barber and Candès (2015). The results are reported in Figure 5, where the target FDR is set to 0.10.1 and 0.30.3, respectively. The proposed method exhibits a notable count of discoveries within the real features while registering only a minimal number of false discoveries among the random features, compared to both knockoff and non-private data-splitting methods.

Refer to caption
Figure 5: Numbers of the discovers.
feature age sex test_time Jitter Abs PPQ5 Shimmer dB
knockoff
Non-Private
DP-FDR
feature APQ5 APQ11 DDA NHR1 HNR2 RPDE DFA PPE
knockoff
Non-Private
DP-FDR
Table 3: Selected Real Feature for Target q=0.1q=0.1

We further report the selected features at the target FDR level of q=0.1q=0.1 in Table 3. There is substantial overlap among the three methods, with several features being consistently selected. For instance, age, Jitter, Abs, HNR2, DFA, and PPE are all chosen by all three methods. For Parkinson’s disease (PD), which is the second most prevalent neurodegenerative disorder among the elderly, age is the most important impact factor. Numerous medical studies underscore the pivotal role of age as the single most significant factor associated with PD, as documented by Elbaz et al. (2002). Furthermore, Jitter and Abs are commonly employed to characterize cycle-to-cycle variability in fundamental frequency, while HNR (Harmonics-to-Noise Ratio) is a fundamental feature in speech processing techniques. Detrended Fluctuation Analysis (DFA) and Pitch Period Entropy (PPE) represent two recently proposed speech signal processing methods, both of which exhibit a strong correlation with PD-dysphonia, as highlighted in Little et al. (2008). In addition to these commonly selected features, the proposed method also identifies shimmer and DDA, which are frequently used to describe cycle-to-cycle variability in amplitude. Shimmer and DDA are also endorsed as relevant features in clinical studies by Tsanas et al. (2009). The features identified by our proposed method receive substantial clinical support, while with the privacy guarantee for the individual patients.

7 Discussion

This paper presents a comprehensive framework for performing differentially private analysis in high-dimensional linear models, encompassing estimation, inference, and false discovery rate (FDR) control. This framework is particularly valuable in scenarios where individual privacy in the dataset needs to be protected and can be readily applied across various disciplines. The numerical studies conducted in this work demonstrate that privacy protection can be achieved with only a minor loss in the accuracy of confidence intervals and multiple testing.

We briefly discuss several intriguing extensions. For example, the tools developed for DP estimation, debiased Lasso, and FDR control in this paper can all be extended to the generalized linear models. It would also be interesting to explore scenarios where a partial dataset, potentially following a different distribution, is publicly available and not bound by privacy constraints. Besides, the newly developed DP-BIC could be adapted for other tasks involving the selection of tuning parameters with privacy assurances. Those topics will be left for future research.

References

  • Avella-Medina et al. (2023a) Avella-Medina, M., Bradshaw, C., and Loh, P.L. (2023a). “Differentially private inference via noisy optimization.” Annals of Statistics.
  • Avella-Medina et al. (2023b) Avella-Medina, M., Liu, Z., and Loh, P.L. (2023b). “Differentially private inference for high dimensional M-estimators.” manuscript unpublished yet.
  • Azriel and Schwartzman (2015) Azriel, D. and Schwartzman, A. (2015). “The empirical distribution of a large number of correlated normal variables.” Journal of the American Statistical Association, 110(511), 1217–1228.
  • Barber and Candès (2015) Barber, R.F. and Candès, E.J. (2015). “Controlling the false discovery rate via knockoffs.” The Annals of Statistics, 43(5), 2055–2085.
  • Barber and Candès (2019) Barber, R.F. and Candès, E.J. (2019). “A knockoff filter for high-dimensional selective inference.” The Annals of Statistics, 47(5), 2504–2537.
  • Cai et al. (2021) Cai, T.T., Wang, Y., and Zhang, L. (2021). “The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy.” The Annals of Statistics, 49(5), 2825–2850.
  • Candes et al. (2018) Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). “Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3), 551–577.
  • Dai et al. (2022) Dai, C., Lin, B., Xing, X., and Liu, J.S. (2022). “False discovery rate control via data splitting.” Journal of the American Statistical Association, pages 1–18.
  • Dai et al. (2023) Dai, C., Lin, B., Xing, X., and Liu, J.S. (2023). “A scale-free approach for false discovery rate control in generalized linear models.” Journal of the American Statistical Association, pages 1–15.
  • Dwork et al. (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). “Calibrating noise to sensitivity in private data analysis.” In “TCC 2006,” pages 265–284. Springer.
  • Dwork and Roth (2014) Dwork, C. and Roth, A. (2014). “The algorithmic foundations of differential privacy.” Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211–407.
  • Dwork et al. (2010) Dwork, C., Rothblum, G.N., and Vadhan, S. (2010). “Boosting and differential privacy.” In “2010 IEEE 51st Annual Symposium on Foundations of Computer Science,” pages 51–60. IEEE.
  • Dwork et al. (2021) Dwork, C., Su, W., and Zhang, L. (2021). “Differentially private false discovery rate control.” Journal of Privacy and Confidentiality, 11(2).
  • Elbaz et al. (2002) Elbaz, A., Bower, J.H., Maraganore, D.M., McDonnell, S.K., Peterson, B.J., Ahlskog, J.E., Schaid, D.J., and Rocca, W.A. (2002). “Risk tables for parkinsonism and parkinson’s disease.” Journal of Clinical Epidemiology, 55(1), 25–31.
  • Fan and Lv (2008) Fan, J. and Lv, J. (2008). “Sure independence screening for ultrahigh dimensional feature space.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(5), 849–911.
  • Javanmard and Montanari (2014) Javanmard, A. and Montanari, A. (2014). “Confidence intervals and hypothesis testing for high-dimensional regression.” Journal of Machine Learning Research, 15(1), 2869–2909.
  • Lane et al. (2014) Lane, J., Stodden, V., Bender, S., and Nissenbaum, H. (2014). Privacy, big data, and the public good: Frameworks for engagement. Cambridge University Press.
  • Little et al. (2008) Little, M., McSharry, P., Hunter, E., Spielman, J., and Ramig, L. (2008). “Suitability of dysphonia measurements for telemonitoring of parkinson’s disease.” Nature Precedings, pages 1–1.
  • Liu (2013) Liu, W. (2013). “Gaussian graphical model estimation with false discovery rate control.” The Annals of Statistics, 41(6), 2948 – 2978.
  • Ning and Liu (2017) Ning, Y. and Liu, H. (2017). “A general theory of hypothesis tests and confidence regions for sparse high dimensional models.” The Annals of Statistics, 45(1), 158 – 195. doi:10.1214/16-AOS1448.
  • Pournaderi and Xiang (2021) Pournaderi, M. and Xiang, Y. (2021). “Differentially private variable selection via the knockoff filter.” In “2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP),” pages 1–6. IEEE.
  • Romano et al. (2020) Romano, Y., Sesia, M., and Candès, E. (2020). “Deep knockoffs.” Journal of the American Statistical Association, 115(532), 1861–1872.
  • Shi et al. (2021) Shi, C., Song, R., Lu, W., and Li, R. (2021). “Statistical inference for high-dimensional models via recursive online-score estimation.” Journal of the American Statistical Association, 116(535), 1307–1318.
  • Tan et al. (2020) Tan, K., Shi, L., and Yu, Z. (2020). “Sparse sir: Optimal rates and adaptive estimation.” The Annals of Statistics, 48(1), 64–85.
  • Tsanas et al. (2009) Tsanas, A., Little, M., McSharry, P., and Ramig, L. (2009). “Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.” Nature Precedings, pages 1–1.
  • Van de Geer et al. (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). “On asymptotically optimal confidence regions and tests for high-dimensional models.” The Annals of Statistics, 42(3), 1166–1202.
  • Wang et al. (2022) Wang, P., Xie, M.G., and Zhang, L. (2022). “Finite-and large-sample inference for model and coefficients in high-dimensional linear regression with repro samples.” arXiv preprint arXiv:2209.09299.
  • Wasserman and Roeder (2009) Wasserman, L. and Roeder, K. (2009). “High dimensional variable selection.” Annals of statistics, 37(5A), 2178.
  • Xia and Cai (2023) Xia, X. and Cai, Z. (2023). “Adaptive false discovery rate control with privacy guarantee.” Journal of Machine Learning Research, 24(252), 1–35.
  • Zhang and Zhang (2014) Zhang, C.H. and Zhang, S.S. (2014). “Confidence intervals for low dimensional parameters in high dimensional linear models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 217–242.

SUPPLEMENTARY MATERIAL

8 Proofs

8.1 The convergence rate in Algorithm 2

We now prove the differential privacy and convergence rate of 𝜷^\hat{{\bm{\beta}}} in Algorithm 2.

8.1.1 Proof of Lemma 3.1

Proof of Lemma 3.1.

The ll_{\infty} sensitivity of of gradient in ttth iteration
η0/|St|iSt(ΠR(𝒙i𝜷(t))ΠR(yi))𝒙i-\eta_{0}/|S_{t}|\sum_{i\in S_{t}}(\Pi_{R}(\bm{x}_{i}^{\top}{\bm{\beta}}^{(t)})-\Pi_{R}(y_{i}))\bm{x}_{i} is

sup(𝒙i,yi),(𝒙i,yi)η0/|St||(ΠR(𝒙i𝜷(t))ΠR(yi))𝒙i(ΠR(𝒙i𝜷(t))ΠR(yi))𝒙i|η0T/n2(R+R)cx.\sup_{(\bm{x}_{i},y_{i}),(\bm{x}_{i}^{\prime},y_{i}^{\prime})}\eta_{0}/|S_{t}|\cdot|(\Pi_{R}(\bm{x}_{i}^{\top}{\bm{\beta}}^{(t)})-\Pi_{R}(y_{i}))\bm{x}_{i}-(\Pi_{R}(\bm{x}_{i}^{\prime\top}{\bm{\beta}}^{(t)})-\Pi_{R}(y_{i}^{\prime}))\bm{x}_{i}^{\prime}|\leq\eta_{0}T/n\cdot 2(R+R)c_{x}.

By advanced composition theorem, outputting the gradient is (ϵ/{T(K+2)},δ/{T(K+1)})(\epsilon/\{T(K+2)\},\delta/\{T(K+1)\})-differentially private. Thus, furthermore, by composition theorem, for 0kK0\leq k\leq K, output 𝜷^(k)\hat{{\bm{\beta}}}(k) is (ϵ/(K+2),δ/(K+1))(\epsilon/(K+2),\delta/(K+1))-differential privacy. By composition theorem, returning all {𝜷^(k)}k=0K\{\hat{{\bm{\beta}}}(k)\}_{k=0}^{K} is (ϵ(K+1)/(K+2),δ)(\epsilon(K+1)/(K+2),\delta)-differential privacy.

Then, we consider the sensitivity of BIC loss. Notice that

sup(𝒙i,yi),(𝒙i,yi)|(ΠR(𝒙i𝜷(k))ΠR(yi))2(ΠR(𝒙i𝜷(k))ΠR(yi))2|2(R+R)2,\sup_{(\bm{x}_{i},y_{i}),(\bm{x}_{i}^{\prime},y_{i}^{\prime})}|(\Pi_{R}(\bm{x}_{i}^{\top}{\bm{\beta}}(k))-\Pi_{R}(y_{i}))^{2}-(\Pi_{R}(\bm{x}_{i}^{\prime\top}{\bm{\beta}}(k))-\Pi_{R}(y_{i}^{\prime}))^{2}|\leq 2(R+R)^{2},

for every 0kK0\leq k\leq K. The BIC selection procedure returns the noisy minimum. By Dwork and Roth (2014), the BIC selection procedure is (ϵ/(K+2),0)(\epsilon/(K+2),0)-differential privacy. Finally, by composition theorem, the Algorithm 2 is (ϵ,δ)(\epsilon,\delta)-differential privacy. ∎

8.1.2 Proof of Theorem 3.1

Proof of Theorem 3.1.

Let k^\hat{k} be the selected number corresponds to 𝜷^\hat{{\bm{\beta}}} and kk^{*} be such that 2k1<ρL4s2k2^{k^{*}-1}<\rho L^{4}s\leq 2^{k^{*}}. Note that kk^{*} is uniquely defined by ss and (ρ,L)(\rho,L). Define events

E0={inf𝐮0=o(n),𝐮2=1𝐮T𝚺^𝐮c𝐮22},E_{0}=\big{\{}\inf_{\|{\mathbf{u}}\|_{0}=o(n),\|{\mathbf{u}}\|_{2}=1}{{\mathbf{u}}}^{T}\widehat{\bm{\Sigma}}{{\mathbf{u}}}\geq c\|{\mathbf{u}}\|_{2}^{2}\big{\}},

where 𝚺^=𝑿𝑿/n\widehat{\bm{\Sigma}}={\bm{X}}^{\top}{\bm{X}}/n, and

E1={𝜷^(k)𝜷2c22klogplognn+c322klog(p)2log(1/δ)log7nn2ϵ2 for 2kρL4s}.E_{1}=\big{\{}\|\hat{{\bm{\beta}}}(k)-{\bm{\beta}}\|_{2}\leq c_{2}\frac{2^{k}\log p\log n}{n}+c_{3}\frac{2^{2k}\log(p)^{2}\log(1/\delta)\log^{7}n}{n^{2}\epsilon^{2}}\text{ for }2^{k}\geq\rho L^{4}s\big{\}}.

By Theorem 4.4 in Cai et al. (2021) and replacing TT in Cai et al. (2021) by KTKT, the event E1E_{1} happens with probability at least 1exp(c1logn)1-\exp(-c_{1}\log n).

By the oracle inequality of the BIC criterion,

𝒚𝑿𝜷^22+c0f(n,k^)𝒚𝑿𝜷^(k)22+c0f(n,k)+ϵprivacy,\|\bm{y}-{\bm{X}}\hat{{\bm{\beta}}}\|_{2}^{2}+c_{0}f(n,\hat{k})\leq\|\bm{y}-{\bm{X}}\hat{{\bm{\beta}}}(k^{*})\|_{2}^{2}+c_{0}f(n,k^{*})+\epsilon_{privacy},

where f(n,k)=2klogplogn+22klog(p)2log(1/δ)log7nnϵ2f(n,k)=2^{k}\log p\log n+\frac{2^{2k}\log(p)^{2}\log(1/\delta)\log^{7}n}{n\epsilon^{2}} and ϵprivacy\epsilon_{privacy} is the additional term equals to supk|zk|\sup_{k}|z_{k}|. It implies that

𝑿(𝜷^𝜷^(k))222|𝑿(𝜷^𝜷^(k)),𝒚𝑿𝜷^(k)|+c0{f(n,k)f(n,k^)}+ϵprivacy.\|{\bm{X}}(\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*}))\|_{2}^{2}\leq 2|\langle{\bm{X}}(\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})),\bm{y}-{\bm{X}}\hat{{\bm{\beta}}}(k^{*})\rangle|+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}+\epsilon_{privacy}.

Let U^=supp(𝜷^𝜷^(k))\widehat{U}=supp(\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})). Note that

|U^|=n/log2p+s=o(n).|\widehat{U}|=\sqrt{n}/\log^{2}p+s=o(n).

Hence, by the first statement in E0E_{0}, we have

c𝜷^𝜷^(k)221n𝑿(𝜷^𝜷^(k))22\displaystyle c\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}^{2}\leq\frac{1}{n}\|{\bm{X}}(\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*}))\|_{2}^{2}
\displaystyle\leq 2n|𝑿(𝜷^𝜷^(k)),𝑿(𝜷^(k)𝜷)|+2n|𝑿(𝜷^𝜷^(k)),𝒚𝑿𝜷|\displaystyle\frac{2}{n}|\langle{\bm{X}}(\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})),{\bm{X}}(\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}})\rangle|+\frac{2}{n}|\langle{\bm{X}}(\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})),\bm{y}-{\bm{X}}{\bm{\beta}}\rangle|
+c0{f(n,k)f(n,k^)}/n+ϵprivacy/n\displaystyle+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n+\epsilon_{privacy}/n
\displaystyle\leq 2n𝑿(𝜷^𝜷^(k))2𝑿(𝜷^(k)𝜷)2+2𝜷^𝜷^(k)11n𝑿ϵ\displaystyle\frac{2}{n}\|{\bm{X}}(\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*}))\|_{2}\|{\bm{X}}(\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}})\|_{2}+2\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{1}\|\frac{1}{n}{\bm{X}}\epsilon\|_{\infty}
+c0{f(n,k)f(n,k^)}/n+ϵprivacy/n\displaystyle+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n+\epsilon_{privacy}/n
\displaystyle\leq C𝜷^𝜷^(k)2𝜷^(k)𝜷2+C𝜷^𝜷^(k)1logpn+c0{f(n,k)f(n,k^)}/n\displaystyle C\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}+{C\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{1}\sqrt{\frac{\log p}{n}}}+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n
+ϵprivacy/n\displaystyle+\epsilon_{privacy}/n

with high probability, where we use Hölder’s inequality and concentration inequality. We first consider the case where k^<k\hat{k}<k^{*}. Then, by the Hölder’s inequality,

c𝜷^𝜷^(k)22C𝜷^𝜷^(k)2𝜷^(k)𝜷2+C𝜷^𝜷^(k)2(2k^+2k)logpn+c0{f(n,k)f(n,k^)}/n+ϵprivacy/n2C𝜷^𝜷^(k)2𝜷^(k)𝜷2+c0{f(n,k)f(n,k^)}/n+ϵprivacy/n\begin{split}c\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}^{2}\leq&C\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}+C\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}\sqrt{\frac{(2^{\hat{k}}+2^{k^{*}})\log p}{n}}\\ &+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n+\epsilon_{privacy}/n\\ \leq&2C\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n+\epsilon_{privacy}/n\end{split}

By the assumption that k^<k\hat{k}<k^{*}, we have c0{f(n,k)f(n,k^)}+ϵprivacy>0c_{0}\{f(n,k^{*})-f(n,\hat{k})\}+\epsilon_{privacy}>0 for c0>0c_{0}>0. Thus, the solution of quadratic inequality exists and satisfies,

𝜷^𝜷^(k)2C𝜷^(k)𝜷2+C2𝜷^(k)𝜷22+c{c0{f(n,k)f(n,k^)}/n+ϵprivacy/n}cC𝜷^(k)𝜷2,\begin{split}&\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}\\ &\leq\frac{C\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}+\sqrt{C^{2}\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}^{2}+c\{c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n+\epsilon_{privacy}/n\}}}{c}\\ &\leq C^{\prime}\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2},\end{split}

for a constant CC^{\prime}. Then, we consider the case where k^k\hat{k}\geq k^{*}. By Hölder’s inequality and triangle inequality, we have

c𝜷^𝜷^(k)22\displaystyle c\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}^{2}\leq C𝜷^𝜷^(k)2𝜷^(k)𝜷2+C𝜷^𝜷^(k)1logpn\displaystyle C\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}+C\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{1}\sqrt{\frac{\log p}{n}}
+c0{f(n,k)f(n,k^)}/n+ϵprivacy/n\displaystyle+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n+\epsilon_{privacy}/n
\displaystyle\leq C𝜷^(k)𝜷22+C𝜷^𝜷22+C𝜷^𝜷1logpn\displaystyle C\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}^{2}+C\|\hat{{\bm{\beta}}}-{\bm{\beta}}\|_{2}^{2}+C\|\hat{{\bm{\beta}}}-{\bm{\beta}}\|_{1}\sqrt{\frac{\log p}{n}}
+C𝜷𝜷^(k)1logpn+c0{f(n,k)f(n,k^)}/n+ϵprivacy/n\displaystyle+C\|{\bm{\beta}}-\hat{{\bm{\beta}}}(k^{*})\|_{1}\sqrt{\frac{\log p}{n}}+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n+\epsilon_{privacy}/n
\displaystyle\leq 2C𝜷^(k)𝜷22+2C{c22k^logplognn+c322k^logplog7nlog(1/δ)n2ϵ2}\displaystyle 2C\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}^{2}+2C\bigg{\{}c_{2}\frac{2^{\hat{k}}\log p\log n}{n}+c_{3}\frac{2^{2\hat{k}}\log p\log^{7}n\log(1/\delta)}{n^{2}\epsilon^{2}}\bigg{\}}
+c0{f(n,k)f(n,k^)}/n+ϵprivacy/n.\displaystyle+c_{0}\{f(n,k^{*})-f(n,\hat{k})\}/n+\epsilon_{privacy}/n.

For c0>2Cc_{0}>2C, we have

c𝜷^𝜷^(k)22(2C+c0)𝜷^(k)𝜷22+ϵprivacy/n.c\|\hat{{\bm{\beta}}}-\hat{{\bm{\beta}}}(k^{*})\|_{2}^{2}\leq(2C+c_{0})\|\hat{{\bm{\beta}}}(k^{*})-{\bm{\beta}}\|_{2}^{2}+\epsilon_{privacy}/n.

The conclusion holds. ∎

8.2 Proofs of Statistical Inference

Given a pre-specified sparsity level sjs_{j}, the differentially private estimation of 𝒘j\bm{w}_{j} algorithm is presented in Algorithm 7.

Input : dataset {(𝒙i,yi)}i[n]\{(\bm{x}_{i},y_{i})\}_{i\in[n]}, step size η0\eta^{0}, privacy parameters ε,δ\varepsilon,\delta, noise scale BB, number of iterations TT, truncation level RR, feasibility parameter CC, sparsity ss^{*}, initial value 𝒘0\bm{w}^{0}.
1 Random split data into TT parts of roughly equal sizes: [n]=𝒮0𝒮T1[n]=\mathcal{S}_{0}\cup\dots\cup\mathcal{S}_{T-1} and 𝒮i𝒮j=\mathcal{S}_{i}\cap\mathcal{S}_{j}=\emptyset for iji\neq j;
2for tt in 0 to T1T-1 do
3       Compute 𝒘(t+0.5)=𝒘(t)η0(𝒆ji𝒮t𝒙iΠR(𝒙i𝒘(t))/|𝒮t|)\bm{w}^{(t+0.5)}=\bm{w}^{(t)}-\eta^{0}(\bm{e}_{j}-\sum_{i\in\mathcal{S}_{t}}\bm{x}_{i}\Pi_{R}(\bm{x}_{i}^{\top}\bm{w}^{(t)})/|\mathcal{S}_{t}|);
4       𝒘(t+1)=ΠC(NoisyIHT(𝒘(t+0.5),s,ε/T,δ/T,η0B/n))\bm{w}^{(t+1)}=\Pi_{C}(\text{NoisyIHT}(\bm{w}^{(t+0.5)},s^{*},\varepsilon/T,\delta/T,\eta^{0}B/n)).
5 end for
6
Output : 𝒘j(T)\bm{w}_{j}^{(T)}.
Algorithm 7 Differentially Private Estimation of 𝒘j\bm{w}_{j} given sparsity

Theoretical properties of Algorithm 7 are outlined in Lemma 8.1. The proof follows the similar arguments of Theorem 4.4 in Cai et al. (2021).

Lemma 8.1.

Suppose conditions 3.1, 3.2 and 4.1 hold, and let B=2RcxB=2Rc_{x}, C>LC>L and R=2Llog(n)R=2L\log(n). There exists a constant ρ\rho such that, if s=ρL4sjs^{*}=\rho L^{4}s_{j}, η0=1/6L\eta^{0}=1/6L, T=ρL2log(8L3n)T=\rho L^{2}\log(8L^{3}n), slogp=o(n)s\log p=o(n) and slogplog(1/δ)log3n/ϵ=o(n)s\log p\log(1/\delta)\log^{3}n/\epsilon=o(n). Then with probability at least 1exp(c1logn)1-\exp(-c_{1}\log n), there exist constants c2c_{2} and c3c_{3}, such that

𝒘j(T)𝒘j22c1slogplognn+c2s2log2plog(1/δ)log4nn2ϵ2.\|\bm{w}_{j}^{(T)}-\bm{w}_{j}\|_{2}^{2}\leq c_{1}\frac{s^{*}\log p\log n}{n}+c_{2}\frac{s^{*2}\log^{2}p\log(1/\delta)\log^{4}n}{n^{2}\epsilon^{2}}.

8.2.1 Proofs of Lemma 8.1

Proofs of Lemma 8.1.

Let S0=supp(𝒘j)S_{0}=\text{supp}(\bm{w}_{j}) be the support of true parameter 𝒘j\bm{w}_{j} and for any subset SS satisfies S0SS_{0}\subset S and |S|csj|S|\leq cs_{j}. The oracle estimator 𝒘^j\hat{\bm{w}}_{j} is defined by

𝒘^j=argmin𝒘p,supp(𝒘)Sn(𝒘):=12𝒘Σ^𝒘𝒘T𝒆j.\hat{\bm{w}}_{j}=\operatorname*{arg\,min}_{\bm{w}\in\mathbb{R}^{p},\text{supp}(\bm{w})\in S}{\mathcal{L}}_{n}(\bm{w}):=\frac{1}{2}\bm{w}^{\top}\hat{\Sigma}\bm{w}-\bm{w}^{T}\bm{e}_{j}.

We first study the statistical property of 𝒘^j\hat{\bm{w}}_{j}. An equivalent formulation of 𝒘^j\hat{\bm{w}}_{j} is 𝒘^j,S=argmin𝒘|S|12𝒘Σ^SS𝒘𝒘𝒆j,S\hat{\bm{w}}_{j,S}=\operatorname*{arg\,min}_{\bm{w}\in\mathbb{R}^{|}S|}\frac{1}{2}\bm{w}^{\top}\hat{\Sigma}_{SS}\bm{w}-\bm{w}^{\top}\bm{e}_{j,S}, where 𝒘^j,S\hat{\bm{w}}_{j,S} is a sub-vector of 𝒘^j\hat{\bm{w}}_{j} and Σ^SS\hat{\Sigma}_{SS} is a sub-matrix of Σ^\hat{\Sigma}. Notice that jSj\in S, the sub-vector 𝒆j,S\bm{e}_{j,S} is a unit vector. Then, we have an analytic solution 𝒘^j,S=Σ^SS1𝒆j,S\hat{\bm{w}}_{j,S}=\hat{\Sigma}_{SS}^{-1}\bm{e}_{j,S}. Then,

𝒘^j𝒘j2=𝒘^j,S𝒘j,S2=(Σ^SS1ΣSS1)𝒆j,S2Σ^SS1ΣSS12.\|\hat{\bm{w}}_{j}-\bm{w}_{j}\|_{2}=\|\hat{\bm{w}}_{j,S}-\bm{w}_{j,S}\|_{2}=\|(\hat{\Sigma}_{SS}^{-1}-\Sigma_{SS}^{-1})\bm{e}_{j,S}\|_{2}\leq\|\hat{\Sigma}_{SS}^{-1}-\Sigma_{SS}^{-1}\|_{2}.

By Corollary 10.1 in Tan et al. (2020), we have for any constant CC^{\prime}, there exists some constant C>0C>0, such that

𝒘^j𝒘j22Cnsjlog(ep/sj),\|\hat{\bm{w}}_{j}-\bm{w}_{j}\|_{2}^{2}\leq\frac{C}{n}s_{j}\log(ep/s_{j}),

with probability at least 1exp{Csjlog(ep/sj)}1-\exp\{-C^{\prime}s_{j}\log(ep/s_{j})\}.

Then, the n(𝒘){\mathcal{L}}_{n}(\bm{w}) satisfy n(𝒘1)n(𝒘2),𝒘1𝒘2=(𝒘1𝒘2)Σ^(𝒘1𝒘2)\langle\nabla{\mathcal{L}}_{n}(\bm{w}_{1})-\nabla{\mathcal{L}}_{n}(\bm{w}_{2}),\bm{w}_{1}-\bm{w}_{2}\rangle=(\bm{w}_{1}-\bm{w}_{2})^{\top}\hat{\Sigma}(\bm{w}_{1}-\bm{w}_{2}). Thus, we have

α𝒘1𝒘222n(𝒘1)n(𝒘2),𝒘1𝒘2γ𝒘1𝒘222,\alpha\|\bm{w}_{1}-\bm{w}_{2}\|_{2}^{2}\leq\langle\nabla{\mathcal{L}}_{n}(\bm{w}_{1})-\nabla{\mathcal{L}}_{n}(\bm{w}_{2}),\bm{w}_{1}-\bm{w}_{2}\rangle\leq\gamma\|\bm{w}_{1}-\bm{w}_{2}\|_{2}^{2},

for 𝒘1,𝒘2p\bm{w}_{1},\bm{w}_{2}\in\mathbb{R}^{p} satisfying max{|supp(𝒘1)|,|supp(𝒘2)|}cs\max\{|\text{supp}(\bm{w}_{1})|,|\text{supp}(\bm{w}_{2})|\}\leq cs. Then, we have

n(𝒘^j(t+1))n(𝒘^j(t))=12𝒘^j(t+1)Σ^𝒘^j(t+1)12𝒘^j(t)Σ^𝒘^j(t)(𝒘^j(t+1)𝒘^j(t))𝒆j=𝒘^j(t+1)𝒘^j(t),𝒘^j(t)Σ^𝒆j+12(𝒘^j(t+1)𝒘^j(t))Σ^(𝒘^j(t+1)𝒘^j(t))𝒘^j(t+1)𝒘^j(t),𝒈t+γ2𝒘^j(t+1)𝒘^j(t)22,\begin{split}{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t+1)})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})&=\frac{1}{2}\hat{\bm{w}}_{j}^{(t+1)\top}\hat{\Sigma}\hat{\bm{w}}_{j}^{(t+1)}-\frac{1}{2}\hat{\bm{w}}_{j}^{(t)\top}\hat{\Sigma}\hat{\bm{w}}_{j}^{(t)}-(\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)})^{\top}\bm{e}_{j}\\ &=\langle\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)},\hat{\bm{w}}_{j}^{(t)\top}\hat{\Sigma}-\bm{e}_{j}\rangle+\frac{1}{2}(\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)})^{\top}\hat{\Sigma}(\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)})\\ &\leq\langle\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)},\bm{g}^{t}\rangle+\frac{\gamma}{2}\|\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2},\end{split}

where 𝒈t=𝒘^j(t)Σ^𝒆j\bm{g}_{t}=\hat{\bm{w}}_{j}^{(t)\top}\hat{\Sigma}-\bm{e}_{j} is the gradient of n(𝒘){\mathcal{L}}_{n}(\bm{w}) evaluated at 𝒘^jt\hat{\bm{w}}_{j}^{t}. Let St=supp(𝒘^j(t))S^{t}=\text{supp}(\hat{\bm{w}}_{j}^{(t)}), St+1=supp(𝒘^j(t+1))S^{t+1}=\text{supp}(\hat{\bm{w}}_{j}^{(t+1)}) and It=St+1StSI^{t}=S^{t+1}\cup S^{t}\cup S. Let 𝒏1,𝒏2,\bm{n}_{1},\bm{n}_{2},\dots be the noise vectors added to 𝒘^jηn(𝒘^j(t))\hat{\bm{w}}_{j}-\eta\nabla{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)}) after the peeling mechanism and 𝑵=4i[s]𝒏i2\bm{N}=4\sum_{i\in[s]}\|\bm{n}_{i}\|^{2}_{\infty}. Then, we have the decomposition,

𝒘^j(t+1)𝒘^j(t),𝒈t+γ2𝒘^j(t+1)𝒘^j(t)22=γ2𝒘^j,It(t+1)𝒘^j,It(t)+ηγ𝒈Itt22η22γ𝒈Itt22+(1η)𝒘^j(t+1)𝒘^j(t),𝒈t.\begin{split}&\langle\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)},\bm{g}^{t}\rangle+\frac{\gamma}{2}\|\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2}\\ =&\frac{\gamma}{2}\|\hat{\bm{w}}_{j,I^{t}}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}-\frac{\eta^{2}}{2\gamma}\|\bm{g}^{t}_{I^{t}}\|_{2}^{2}+(1-\eta)\langle\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)},\bm{g}^{t}\rangle.\end{split}

We first consider the first two terms. Let the subset R=St/St+1R=S^{t}/S^{t+1}, then |It/(StS)|=|St/St+1||I^{t}/(S^{t}\cup S)|=|S^{t}/S^{t+1}|. Then, by the fact 𝒘^j,It/(StS)(t)=𝟎\hat{\bm{w}}^{(t)}_{j,I^{t}/(S^{t}\cup S)}=\bm{0} and the selection criterion, we have for every c>1c>1

η2γ2𝒈It/(StS)t22=𝒘^j,It/(StS)(t)η2γ2𝒈It/(StS)t22(11/c)𝒘^j,R(t)η2γ2𝒈Rt22c𝑵.\frac{\eta^{2}}{\gamma^{2}}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}=\|\hat{\bm{w}}_{j,I^{t}/(S^{t}\cup S)}^{(t)}-\frac{\eta^{2}}{\gamma^{2}}\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}\geq(1-1/c)\|\hat{\bm{w}}_{j,R}^{(t)}-\frac{\eta^{2}}{\gamma^{2}}\bm{g}^{t}_{R}\|_{2}^{2}-c\bm{N}.

Then, we have

γ2𝒘^j,It(t+1)𝒘^j,It(t)+ηγ𝒈Itt22η22γ𝒈It/(StS)t22γ𝒏St+122+γ𝒘~j,It(t+1)𝒘^j,It(t)+ηγ𝒈Itt22γ2(11/c)𝒘^j,R(t)η2γ2𝒈Rt22+cγ2𝑵=γ𝒘~j,It(t+1)𝒘^j,It(t)+ηγ𝒈Itt22γ𝒘~j,R(t+1)𝒘^j,R(t)+ηγ𝒈Rt22+γ2(1+1/c)𝒘^j,R(t)η2γ2𝒈Rt22+γ𝒏St+122+cγ2𝑵γ𝒘~j,It/R(t+1)𝒘^j,It/R(t)+ηγ𝒈It/Rt22+η22γc+1c1𝒈It/(StS)t22+γ𝒏St+122+cγ2(c+1c1c+1)𝑵,\begin{split}&\frac{\gamma}{2}\|\hat{\bm{w}}_{j,I^{t}}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}-\frac{\eta^{2}}{2\gamma}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}\\ \leq&\gamma\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\gamma\|\tilde{\bm{w}}_{j,I^{t}}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}-\frac{\gamma}{2}(1-1/c)\|\hat{\bm{w}}_{j,R}^{(t)}-\frac{\eta^{2}}{\gamma^{2}}\bm{g}^{t}_{R}\|_{2}^{2}+\frac{c\gamma}{2}\bm{N}\\ =&\gamma\|\tilde{\bm{w}}_{j,I^{t}}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}-\gamma\|\tilde{\bm{w}}_{j,R}^{(t+1)}-\hat{\bm{w}}_{j,R}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{R}\|_{2}^{2}+\frac{\gamma}{2}(1+1/c)\|\hat{\bm{w}}_{j,R}^{(t)}-\frac{\eta^{2}}{\gamma^{2}}\bm{g}^{t}_{R}\|_{2}^{2}\\ &+\gamma\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\frac{c\gamma}{2}\bm{N}\\ \leq&\gamma\|\tilde{\bm{w}}_{j,I^{t}/R}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}/R}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}/R}\|_{2}^{2}+\frac{\eta^{2}}{2\gamma}\frac{c+1}{c-1}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}+\gamma\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\frac{c\gamma}{2}(\frac{c+1}{c-1}c+1)\bm{N},\end{split}

where we use (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2} in the first inequality. By Lemma A.3. in Cai et al. (2021), we have

𝒘~j,It/R(t+1)𝒘^j,It/R(t)+ηγ𝒈It/Rt2232|It/R|s|It/R|s𝒘^j,It𝒘^j,It(t)+ηγ𝒈Itt22+3𝑵.\|\tilde{\bm{w}}_{j,I^{t}/R}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}/R}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}/R}\|_{2}^{2}\leq\frac{3}{2}\frac{|I^{t}/R|-s}{|I^{t}/R|-s^{*}}\|\hat{\bm{w}}_{j,I^{t}}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}+3\bm{N}.

Thus,

γ2𝒘^j,It(t+1)𝒘^j,It(t)+ηγ𝒈Itt22η22γ𝒈It/(StS)t22γ32|It/R|sj|It/R|s𝒘^j,It𝒘^j,It(t)+ηγ𝒈Itt22+3𝜸N+η22γc+1c1𝒈It/(StS)t22+γ𝒏St+122+cγ2(c+1c1c+1)𝑵=3ssj+s(2η𝒘^j𝒘^j(t),𝒈t+γ𝒘^j𝒘^j(t)22+η2γ𝒈Itt22)+η22γc+1c1𝒈It/(StS)t22+γ𝒏St+122+cγ2(c+1c1c+1+6/c)𝑵3ssj+s[2η{n(𝒘^j)n(𝒘^j(t))}+(γ2ηα)𝒘^j𝒘^j(t)22+η2γ𝒈Itt22]+η22γc+1c1𝒈It/(StS)t22+γ𝒏St+122+cγ2(c+1c1c+1+6/c)𝑵,\begin{split}&\frac{\gamma}{2}\|\hat{\bm{w}}_{j,I^{t}}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}-\frac{\eta^{2}}{2\gamma}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}\\ \leq&\gamma\frac{3}{2}\frac{|I^{t}/R|-s_{j}}{|I^{t}/R|-s^{*}}\|\hat{\bm{w}}_{j,I^{t}}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}+3\bm{\gamma}N+\frac{\eta^{2}}{2\gamma}\frac{c+1}{c-1}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}\\ &+\gamma\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\frac{c\gamma}{2}(\frac{c+1}{c-1}c+1)\bm{N}\\ =&\frac{3s^{*}}{s_{j}+s^{*}}(2\eta\langle\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}^{(t)},\bm{g}^{t}\rangle+\gamma\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2}+\frac{\eta^{2}}{\gamma}\|\bm{g}^{t}_{I^{t}}\|_{2}^{2})+\frac{\eta^{2}}{2\gamma}\frac{c+1}{c-1}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}\\ &+\gamma\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\frac{c\gamma}{2}(\frac{c+1}{c-1}c+1+6/c)\bm{N}\\ \leq&\frac{3s^{*}}{s_{j}+s^{*}}[2\eta\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})\}+(\gamma-2\eta\alpha)\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2}+\frac{\eta^{2}}{\gamma}\|\bm{g}^{t}_{I^{t}}\|_{2}^{2}]\\ &+\frac{\eta^{2}}{2\gamma}\frac{c+1}{c-1}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}+\gamma\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\frac{c\gamma}{2}(\frac{c+1}{c-1}c+1+6/c)\bm{N},\end{split}

where in the first inequality, we use the assumption sj>ss_{j}>s^{*}.

Then, we consider the second term, which has the decomposition

𝒘^j(t+1)𝒘^j(t),𝒈t=𝒘~j,St+1(t+1)𝒘^j,St+1(t),𝒈St+1t+𝒏St+1,𝒈St+1t𝒘^j,St/St+1(t),𝒈St/St+1tηγ𝒈Stt22+c𝒏St+122+(1/4c)𝒈St+1t22𝒘^j,St/St+1(t),𝒈St/St+1t,\begin{split}\langle\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)},\bm{g}^{t}\rangle&=\langle\tilde{\bm{w}}_{j,S^{t+1}}^{(t+1)}-\hat{\bm{w}}_{j,S^{t+1}}^{(t)},\bm{g}^{t}_{S^{t+1}}\rangle+\langle\bm{n}_{S^{t+1}},\bm{g}^{t}_{S^{t+1}}\rangle-\langle\hat{\bm{w}}_{j,S^{t}/S^{t+1}}^{(t)},\bm{g}_{S^{t}/S^{t+1}}^{t}\rangle\\ &\leq-\frac{\eta}{\gamma}\|\bm{g}^{t}_{S^{t}}\|_{2}^{2}+c\|\bm{n}_{S^{t+1}}\|_{2}^{2}+(1/4c)\|\bm{g}^{t}_{S^{t+1}}\|^{2}_{2}-\langle\hat{\bm{w}}_{j,S^{t}/S^{t+1}}^{(t)},\bm{g}_{S^{t}/S^{t+1}}^{t}\rangle,\end{split}

where we use the fact that aba2/2+b2/2ab\leq a^{2}/2+b^{2}/2. The last term satisfies

𝒘^j,St/St+1(t),𝒈St/St+1tγ2η{𝒘^j,St/St+1(t)ηγ𝒈St/St+1t22(ηγ)2𝒈St/St+1t22}.-\langle\hat{\bm{w}}_{j,S^{t}/S^{t+1}}^{(t)},\bm{g}_{S^{t}/S^{t+1}}^{t}\rangle\leq\frac{\gamma}{2\eta}\{\|\hat{\bm{w}}_{j,S^{t}/S^{t+1}}^{(t)}-\frac{\eta}{\gamma}\bm{g}_{S^{t}/S^{t+1}}^{t}\|_{2}^{2}-(\frac{\eta}{\gamma})^{2}\|\bm{g}_{S^{t}/S^{t+1}}^{t}\|_{2}^{2}\}.

By the selection property, we have

𝒘^j,St/St+1(t),𝒈St/St+1tγ2η{(1+1/c)𝒘~j,St+1/St(t)22+(1+c)𝑵}η2γ𝒈St/St+1t22=η2γ{(1+1/c)𝒈St+1/Stt22+(1+c)𝑵}η2γ𝒈St/St+1t22\begin{split}-\langle\hat{\bm{w}}_{j,S^{t}/S^{t+1}}^{(t)},\bm{g}_{S^{t}/S^{t+1}}^{t}\rangle&\leq\frac{\gamma}{2\eta}\{(1+1/c)\|\tilde{\bm{w}}_{j,S^{t+1}/S^{t}}^{(t)}\|_{2}^{2}+(1+c)\bm{N}\}-\frac{\eta}{2\gamma}\|\bm{g}_{S^{t}/S^{t+1}}^{t}\|_{2}^{2}\\ &=\frac{\eta}{2\gamma}\{(1+1/c)\|\bm{g}_{S^{t+1}/S^{t}}^{t}\|_{2}^{2}+(1+c)\bm{N}\}-\frac{\eta}{2\gamma}\|\bm{g}_{S^{t}/S^{t+1}}^{t}\|_{2}^{2}\end{split}

By combining the results together, we have

𝒘^j(t+1)𝒘^j(t),𝒈tηγ𝒈St+1t22+c𝒏St+122+(1/4c)𝒈St+1t22+γ2η{(1+1/c)𝒈St+1/Stt22+(1+c)𝑵}η2γ𝒈St/St+1t22η2γ𝒈St+1/Stt22η2γ𝒈St/St+1t22ηγ𝒈St+1t22+(1/c)(4+η2γ)𝒈St+1t22+c𝒏St+122+(1+c)γ2η𝑵η2γ𝒈StSt+1t22+(1/c)(4+η2γ)𝒈St+1t22+c𝒏St+122+(1+c)γ2η𝑵.\begin{split}\langle\hat{\bm{w}}_{j}^{(t+1)}-\hat{\bm{w}}_{j}^{(t)},\bm{g}^{t}\rangle\leq&-\frac{\eta}{\gamma}\|\bm{g}^{t}_{S^{t+1}}\|_{2}^{2}+c\|\bm{n}_{S^{t+1}}\|_{2}^{2}+(1/4c)\|\bm{g}^{t}_{S^{t+1}}\|^{2}_{2}\\ &+\frac{\gamma}{2\eta}\{(1+1/c)\|\bm{g}_{S^{t+1}/S^{t}}^{t}\|_{2}^{2}+(1+c)\bm{N}\}-\frac{\eta}{2\gamma}\|\bm{g}_{S^{t}/S^{t+1}}^{t}\|_{2}^{2}\\ \leq&\frac{\eta}{2\gamma}\|\bm{g}_{S^{t+1}/S^{t}}^{t}\|_{2}^{2}-\frac{\eta}{2\gamma}\|\bm{g}_{S^{t}/S^{t+1}}^{t}\|_{2}^{2}-\frac{\eta}{\gamma}\|\bm{g}^{t}_{S^{t+1}}\|_{2}^{2}\\ &+(1/c)(4+\frac{\eta}{2\gamma})\|\bm{g}^{t}_{S^{t+1}}\|^{2}_{2}+c\|\bm{n}_{S^{t+1}}\|_{2}^{2}+(1+c)\frac{\gamma}{2\eta}\bm{N}\\ \leq&-\frac{\eta}{2\gamma}\|\bm{g}_{S^{t}\cup S^{t+1}}^{t}\|_{2}^{2}+(1/c)(4+\frac{\eta}{2\gamma})\|\bm{g}^{t}_{S^{t+1}}\|^{2}_{2}+c\|\bm{n}_{S^{t+1}}\|_{2}^{2}+(1+c)\frac{\gamma}{2\eta}\bm{N}.\end{split}

Combining all results together, we have

n(𝒘^j(t+1))n(𝒘^j(t))γ2𝒘^j,It(t+1)𝒘^j,It(t)+ηγ𝒈Itt22η22γ𝒈Itt22η(1η)2γ𝒈StSt+1t22+(1η)(1/c)(4+η2γ)𝒈St+1t22+c(1η)𝒏St+122+(1η)(1+c)γ2η𝑵γ2𝒘^j,It(t+1)𝒘^j,It(t)+ηγ𝒈Itt22η22γ𝒈It/(StS)t22η22γ𝒈StSt22η(1η)2γ𝒈St+1/(StS)t22+(1η)(1/c)(4+η2γ)𝒈St+1t22+c(1η)𝒏St+122+(1η)(1+c)γ2η𝑵3ssj+s[2η{n(𝒘^j)n(𝒘^j(t))}+(γ2ηα)𝒘^j𝒘^j(t)22+η2γ𝒈Itt22]+η22γc+1c1𝒈It/(StS)t22+cγ2(c+1c1c+1+6/c)𝑵η22γ𝒈StSt22η(1η)2γ𝒈St+1/(StS)t22+(1η)(1/c)(4+η2γ)𝒈St+1t22+c(1η)𝒏St+122+(1η)(1+c)γ2η𝑵+γ𝒏St+1223ssj+s[2η{n(𝒘^j)n(𝒘^j(t))}+(γ2ηα)𝒘^j𝒘^j(t)22+cη2γ𝒈Itt22]η22γ𝒈StSt22η(1η)2γ𝒈St+1/(StS)t22+{γ+c(1η)}𝒏St+122+{cγ2(c+1c1c+1+6/c)+(1η)(1+c)γ2η}𝑵\begin{split}&{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t+1)})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})\\ \leq&\frac{\gamma}{2}\|\hat{\bm{w}}_{j,I^{t}}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}-\frac{\eta^{2}}{2\gamma}\|\bm{g}^{t}_{I^{t}}\|_{2}^{2}-\frac{\eta(1-\eta)}{2\gamma}\|\bm{g}_{S^{t}\cup S^{t+1}}^{t}\|_{2}^{2}\\ &+(1-\eta)(1/c)(4+\frac{\eta}{2\gamma})\|\bm{g}^{t}_{S^{t+1}}\|^{2}_{2}+c(1-\eta)\|\bm{n}_{S^{t+1}}\|_{2}^{2}+(1-\eta)(1+c)\frac{\gamma}{2\eta}\bm{N}\\ \leq&\frac{\gamma}{2}\|\hat{\bm{w}}_{j,I^{t}}^{(t+1)}-\hat{\bm{w}}_{j,I^{t}}^{(t)}+\frac{\eta}{\gamma}\bm{g}^{t}_{I^{t}}\|_{2}^{2}-\frac{\eta^{2}}{2\gamma}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}-\frac{\eta^{2}}{2\gamma}\|\bm{g}^{t}_{S^{t}\cup S}\|_{2}^{2}\\ &-\frac{\eta(1-\eta)}{2\gamma}\|\bm{g}_{S^{t+1}/(S^{t}\cup S)}^{t}\|_{2}^{2}+(1-\eta)(1/c)(4+\frac{\eta}{2\gamma})\|\bm{g}^{t}_{S^{t+1}}\|^{2}_{2}\\ &+c(1-\eta)\|\bm{n}_{S^{t+1}}\|_{2}^{2}+(1-\eta)(1+c)\frac{\gamma}{2\eta}\bm{N}\\ \leq&\frac{3s^{*}}{s_{j}+s^{*}}[2\eta\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})\}+(\gamma-2\eta\alpha)\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2}+\frac{\eta^{2}}{\gamma}\|\bm{g}^{t}_{I^{t}}\|_{2}^{2}]\\ &+\frac{\eta^{2}}{2\gamma}\frac{c+1}{c-1}\|\bm{g}^{t}_{I^{t}/(S^{t}\cup S)}\|_{2}^{2}+\frac{c\gamma}{2}(\frac{c+1}{c-1}c+1+6/c)\bm{N}-\frac{\eta^{2}}{2\gamma}\|\bm{g}^{t}_{S^{t}\cup S}\|_{2}^{2}\\ &-\frac{\eta(1-\eta)}{2\gamma}\|\bm{g}_{S^{t+1}/(S^{t}\cup S)}^{t}\|_{2}^{2}+(1-\eta)(1/c)(4+\frac{\eta}{2\gamma})\|\bm{g}^{t}_{S^{t+1}}\|^{2}_{2}+c(1-\eta)\|\bm{n}_{S^{t+1}}\|_{2}^{2}\\ &+(1-\eta)(1+c)\frac{\gamma}{2\eta}\bm{N}+\gamma\|\bm{n}_{S^{t+1}}\|_{2}^{2}\\ \leq&\frac{3s^{*}}{s_{j}+s^{*}}[2\eta\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})\}+(\gamma-2\eta\alpha)\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2}+\frac{c\eta^{2}}{\gamma}\|\bm{g}^{t}_{I^{t}}\|_{2}^{2}]\\ &-\frac{\eta^{2}}{2\gamma}\|\bm{g}^{t}_{S^{t}\cup S}\|_{2}^{2}-\frac{\eta(1-\eta)}{2\gamma}\|\bm{g}_{S^{t+1}/(S^{t}\cup S)}^{t}\|_{2}^{2}+\{\gamma+c(1-\eta)\}\|\bm{n}_{S^{t+1}}\|_{2}^{2}\\ &+\{\frac{c\gamma}{2}(\frac{c+1}{c-1}c+1+6/c)+(1-\eta)(1+c)\frac{\gamma}{2\eta}\}\bm{N}\end{split}

for constant cc large enough. By the choice of sj=C72(γ/α)2s=ρL4ss_{j}=C72(\gamma/\alpha)^{2}s^{*}=\rho L^{4}s^{*}, η=2/3\eta=2/3 such that 6cs/(sj+s)α2/(24γ(γ2ηα))1/86cs^{*}/(s_{j}+s^{*})\leq\alpha^{2}/(24\gamma(\gamma-2\eta\alpha))\leq 1/8 because α<γ\alpha<\gamma. Then,

n(𝒘^j(t+1))n(𝒘^j(t))6ss+sjη{n(𝒘^j)n(𝒘^j(t))}+α224γ𝒘^j𝒘^j(t)22+c18γ𝒈Itt2219γ𝒈StSt22118γ𝒈St+1/(StS)t22+{γ+c/3}𝒏St+122+{cγ2(c+1c1c+1+6/c)+(1+c)γ4}𝑵6ss+sjη{n(𝒘^j)n(𝒘^j(t))}+α248γ𝒘^j𝒘^j(t)22+136γ𝒈Itt2219γ𝒈StSt22118γ𝒈St+1/(StS)t22+{γ+c/3}𝒏St+122+{cγ2(c+1c1c+1+6/c)+(1+c)γ4}𝑵6ss+sjη{n(𝒘^j)n(𝒘^j(t))}+336γ(α24𝒘^j𝒘^j(t)22𝒈Itt22)+c3(𝒏St+122+𝑵)(3α72γ+4ssj+s){n(𝒘^j(t))n(𝒘^j)}+c3(𝒏St+122+𝑵)1ρL2{n(𝒘^jt)n(𝒘^j)}+c3(𝒏St+122+𝑵).\begin{split}{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t+1)})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})\leq&\frac{6s^{*}}{s^{*}+s_{j}}\eta\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})\}+\frac{\alpha^{2}}{24\gamma}\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2}+\frac{c}{18\gamma}\|\bm{g}^{t}_{I^{t}}\|_{2}^{2}\\ &-\frac{1}{9\gamma}\|\bm{g}^{t}_{S^{t}\cup S}\|_{2}^{2}-\frac{1}{18\gamma}\|\bm{g}_{S^{t+1}/(S^{t}\cup S)}^{t}\|_{2}^{2}+\{\gamma+c/3\}\|\bm{n}_{S^{t+1}}\|_{2}^{2}\\ &+\{\frac{c\gamma}{2}(\frac{c+1}{c-1}c+1+6/c)+(1+c)\frac{\gamma}{4}\}\bm{N}\\ \leq&\frac{6s^{*}}{s^{*}+s_{j}}\eta\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})\}+\frac{\alpha^{2}}{48\gamma}\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2}+\frac{1}{36\gamma}\|\bm{g}^{t}_{I^{t}}\|_{2}^{2}\\ &-\frac{1}{9\gamma}\|\bm{g}^{t}_{S^{t}\cup S}\|_{2}^{2}-\frac{1}{18\gamma}\|\bm{g}_{S^{t+1}/(S^{t}\cup S)}^{t}\|_{2}^{2}+\{\gamma+c/3\}\|\bm{n}_{S^{t+1}}\|_{2}^{2}\\ &+\{\frac{c\gamma}{2}(\frac{c+1}{c-1}c+1+6/c)+(1+c)\frac{\gamma}{4}\}\bm{N}\\ \leq&\frac{6s^{*}}{s^{*}+s_{j}}\eta\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})\}+\frac{3}{36\gamma}(\frac{\alpha^{2}}{4}\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}^{(t)}\|_{2}^{2}-\|\bm{g}^{t}_{I^{t}}\|_{2}^{2})\\ &+c_{3}(\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\bm{N})\\ \leq&-(\frac{3\alpha}{72\gamma}+\frac{4s^{*}}{s_{j}+s^{*}})\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})\}+c_{3}(\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\bm{N})\\ \leq&-\frac{1}{\rho L^{2}}\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{t})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})\}+c_{3}(\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\bm{N}).\end{split}

Thus,

n(𝒘^j(t+1))n(𝒘^j)(11ρL2)(n(𝒘^j(t))n(𝒘^j))+c3(𝒏St+122+𝑵).{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t+1)})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})\leq(1-\frac{1}{\rho L^{2}})({\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(t)})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}))+c_{3}(\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\bm{N}).

Let 𝑵t=c3(𝒏St+122+𝑵)\bm{N}_{t}=c_{3}(\|\bm{n}_{S^{t+1}}\|_{2}^{2}+\bm{N}) and iterate above equation,

n(𝒘^j(T))n(𝒘^j)(11ρL2)T{n(𝒘^j0)n(𝒘^j)}+k=0T1(11ρL2)Tk1𝑵k.{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(T)})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})\leq(1-\frac{1}{\rho L^{2}})^{T}\{{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{0})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})\}+\sum_{k=0}^{T-1}(1-\frac{1}{\rho L^{2}})^{T-k-1}\bm{N}_{k}.

Thus, by T=O(log(n))T=O(\log(n)) and

n(𝒘^j(T))n(𝒘^j)c𝒘^jT𝒘j22n(𝒘j),𝒘j𝒘^j(T),{\mathcal{L}}_{n}(\hat{\bm{w}}_{j}^{(T)})-{\mathcal{L}}_{n}(\hat{\bm{w}}_{j})\geq c\|\hat{\bm{w}}_{j}^{T}-\bm{w}_{j}\|_{2}^{2}-\langle\nabla{\mathcal{L}}_{n}(\bm{w}_{j}),\bm{w}_{j}-\hat{\bm{w}}_{j}^{(T)}\rangle,

we have

c𝒘^j(T)𝒘j22n(𝒘j)s+s𝒘j𝒘^j(T)2+1/n+k=0T1(11ρL2)Tk1𝑵k.c\|\hat{\bm{w}}_{j}^{(T)}-\bm{w}_{j}\|_{2}^{2}\leq\|\nabla{\mathcal{L}}_{n}(\bm{w}_{j})\|_{\infty}\sqrt{s+s^{*}}\|\bm{w}_{j}-\hat{\bm{w}}_{j}^{(T)}\|_{2}+1/n+\sum_{k=0}^{T-1}(1-\frac{1}{\rho L^{2}})^{T-k-1}\bm{N}_{k}.

Thus,

𝒘^j(T)𝒘j22cslogpn+c(slogp)2log(1/δ)log2nn2ϵ2.\|\hat{\bm{w}}_{j}^{(T)}-\bm{w}_{j}\|_{2}^{2}\leq c\frac{s^{*}\log p}{n}+c\frac{(s^{*}\log p)^{2}\log(1/\delta)\log^{2}n}{n^{2}\epsilon^{2}}.

The desired result follows by replacing nn with n/Tn/T. ∎

8.2.2 Proof of Lemma 4.2

Proof of Lemma 4.2.

The ll_{\infty} sensitivity of of gradient in ttth iteration:
η0{𝒆jiSt𝒙iΠR(𝒙i𝒘(t))/|St|}-\eta_{0}\{\bm{e}_{j}-\sum_{i\in S_{t}}\bm{x}_{i}\Pi_{R}(\bm{x}_{i}^{\top}\bm{w}^{(t)})/|S_{t}|\} is

sup𝒙i,𝒙iη0/|St||𝒙iΠR(𝒙i𝒘(t))𝒙iΠR(𝒙i𝒘(t))|η0T/n2Rcx,\sup_{\bm{x}_{i},\bm{x}_{i}^{\prime}}\eta_{0}/|S_{t}|\cdot|\bm{x}_{i}\Pi_{R}(\bm{x}_{i}^{\top}\bm{w}^{(t)})-\bm{x}_{i}^{\prime}\Pi_{R}(\bm{x}_{i}^{\prime\top}\bm{w}^{(t)})|\leq\eta_{0}T/n\cdot 2Rc_{x},

where we use the fact 𝒘2𝚺1opL\|\bm{w}\|_{2}\leq\|\bm{\Sigma}^{-1}\|_{op}\leq L by condition 3.1. By advanced composition theorem, output the gradient is (ϵ/{T(K+2)},δ/{T(K+1)})(\epsilon/\{T(K+2)\},\delta/\{T(K+1)\})-differentially private. For 0kK0\leq k\leq K, outputting 𝒘^(k)\hat{\bm{w}}(k) is (ϵ/(K+2),δ/(K+1))(\epsilon/(K+2),\delta/(K+1))-differential privacy. By composition theorem, returning all {𝒘^(k)}k=0K\{\hat{\bm{w}}(k)\}_{k=0}^{K} is (ϵ(K+1)/(K+2),δ)(\epsilon(K+1)/(K+2),\delta)-differential privacy.

Then, we consider the sensitivity of BIC loss. Notice that

sup𝒙i,𝒙i|ΠR(𝒘^(k)𝒙i)ΠR(𝒙i𝒘^(k))/2ΠR(𝒘^(k))ΠR(𝒙i𝒙i𝒘^(k))/2|R2.\sup_{\bm{x}_{i},\bm{x}_{i}^{\prime}}|\Pi_{R}(\hat{\bm{w}}(k)^{\top}\bm{x}_{i})\Pi_{R}(\bm{x}_{i}^{\top}\hat{\bm{w}}(k))/2-\Pi_{R}(\hat{\bm{w}}(k)^{\top})\Pi_{R}(\bm{x}_{i}^{\prime}\bm{x}_{i}^{\prime\top}\hat{\bm{w}}(k))/2|\leq R^{2}.

The BIC selection procedure returns the noisy minimum. By Dwork and Roth (2014), the BIC selection procedure is (ϵ/(K+2),0)(\epsilon/(K+2),0)-differential privacy. Finally, by composition theorem, the Algorithm 4 is (ϵ,δ)(\epsilon,\delta)-differential privacy. ∎

8.2.3 Proofs of Lemma 4.3

Proof of Lemma 4.3.

Let k^\hat{k} be the selected number corresponds to 𝒘^\hat{\bm{w}} and kk^{*} be such that 2k1<ρL4sj2k2^{k^{*}-1}<\rho L^{4}s_{j}\leq 2^{k^{*}}. Note that kk^{*} is uniquely defined by sjs_{j} and (ρ,L)(\rho,L). Define events

E0={inf𝐮0=o(n),𝐮2=1𝐮T𝚺^𝐮c𝐮22},E_{0}=\big{\{}\inf_{\|{\mathbf{u}}\|_{0}=o(n),\|{\mathbf{u}}\|_{2}=1}{{\mathbf{u}}}^{T}\widehat{\bm{\Sigma}}{{\mathbf{u}}}\geq c\|{\mathbf{u}}\|_{2}^{2}\big{\}},

and

E1={𝒘^j(k)𝒘j2c22klogplognn+c322klog(p)2log(1/δ)log6nn2ϵ2 for 2kρL4sj}.E_{1}=\big{\{}\|\hat{\bm{w}}_{j}(k)-\bm{w}_{j}\|_{2}\leq c_{2}\frac{2^{k}\log p\log n}{n}+c_{3}\frac{2^{2k}\log(p)^{2}\log(1/\delta)\log^{6}n}{n^{2}\epsilon^{2}}\text{ for }2^{k}\geq\rho L^{4}s_{j}\big{\}}.

By Lemma 8.1 and replacing TT with KTKT, the event E1E_{1} happens with probability at least 1exp(c1logn)1-\exp(-c_{1}\log n).

By the oracle inequality of the BIC criterion,

𝒘^jTΣ^𝒘^j/2𝒘^jT𝒆j+c0f(p,k^)/n𝒘^j(k)TΣ^𝒘^j(k)/2𝒘^j(k)T𝒆j+c0f(p,k)/n+ϵprivacy,\hat{\bm{w}}_{j}^{T}\hat{\Sigma}\hat{\bm{w}}_{j}/2-\hat{\bm{w}}_{j}^{T}\bm{e}_{j}+c_{0}f(p,\hat{k})/n\leq\hat{\bm{w}}_{j}(k^{*})^{T}\hat{\Sigma}\hat{\bm{w}}_{j}(k^{*})/2-\hat{\bm{w}}_{j}(k^{*})^{T}\bm{e}_{j}+c_{0}f(p,k^{*})/n+\epsilon_{privacy},

where f(p,k)=2klogplogn+22klog(p)2log(1/δ)log4nnϵ2f(p,k)=2^{k}\log p\log n+\frac{2^{2k}\log(p)^{2}\log(1/\delta)\log^{4}n}{n\epsilon^{2}} and ϵprivacy\epsilon_{privacy} is the additional term equals to supk|zk|/n\sup_{k}|z_{k}|/n. It implies that

(𝒘^j𝒘^j(k))TΣ^(𝒘^j𝒘^j(k))/2|𝒘^j𝒘^j(k),𝒘^j(k)TΣ^𝒆j|+c0/n{f(p,k)f(p,k^)}+ϵprivacy.\begin{split}(\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*}))^{T}\hat{\Sigma}(\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*}))/2&\leq|\langle\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*}),\hat{\bm{w}}_{j}(k^{*})^{T}\hat{\Sigma}-\bm{e}_{j}\rangle|\\ &+c_{0}/n\{f(p,k^{*})-f(p,\hat{k})\}+\epsilon_{privacy}.\end{split}

Let U^=supp(𝒘^j𝒘^j(k))\widehat{U}=supp(\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})). Note that

|U^|=n/log2p+sj=o(n).|\widehat{U}|=\sqrt{n}/\log^{2}p+s_{j}=o(n).

Hence, by the first statement in E0E_{0}, we have

c𝒘^j𝒘^j(k)22(𝒘^j𝒘^j(k))TΣ^(𝒘^j𝒘^j(k))/2\displaystyle c\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}^{2}\leq(\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*}))^{T}\hat{\Sigma}(\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*}))/2
\displaystyle\leq |𝒘^j𝒘^j(k),𝒘jTΣ^𝒆j|+|𝒘^j𝒘^j(k),(𝒘^j(k)𝒘j)TΣ^|\displaystyle|\langle\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*}),\bm{w}_{j}^{T}\hat{\Sigma}-\bm{e}_{j}\rangle|+|\langle\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*}),(\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j})^{T}\hat{\Sigma}\rangle|
+c0{f(p,k)f(p,k^)}/n+ϵprivacy\displaystyle+c_{0}\{f(p,k^{*})-f(p,\hat{k})\}/n+\epsilon_{privacy}
\displaystyle\leq 𝒘^j𝒘^j(k)1𝒘jTΣ^𝒆j+c𝒘^j𝒘^j(k)2𝒘j𝒘^j(k)2\displaystyle\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{1}\|\bm{w}_{j}^{T}\hat{\Sigma}-\bm{e}_{j}\|_{\infty}+c\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}\|\bm{w}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}
+c0{f(p,k)f(p,k^)}/n+ϵprivacy,\displaystyle+c_{0}\{f(p,k^{*})-f(p,\hat{k})\}/n+\epsilon_{privacy},

with high probability, where we use Hölder’s inequality. We first consider the case where k^<k\hat{k}<k^{*}. Then, by the Hölder’s inequality,

c𝒘^j𝒘^j(k)22𝒘^j𝒘^j(k)2𝒘^j(k)𝒘j2+C𝒘^j𝒘^j(k)2(2k^+2k)logpn+c0{f(p,k)f(p,k^)}/n+ϵprivacy2𝒘^j𝒘^j(k)2𝒘^j(k)𝒘j2+c0/n{f(p,k)f(p,k^)}+ϵprivacy.\begin{split}c\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}^{2}\leq&\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}+C\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}\sqrt{\frac{(2^{\hat{k}}+2^{k^{*}})\log p}{n}}\\ &+c_{0}\{f(p,k^{*})-f(p,\hat{k})\}/n+\epsilon_{privacy}\\ \leq&2\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}+c_{0}/n\{f(p,k^{*})-f(p,\hat{k})\}+\epsilon_{privacy}.\end{split}

By the assumption that k^<k\hat{k}<k^{*}, we have c0{f(p,k)f(p,k^)}/n+ϵprivacy>0c_{0}\{f(p,k^{*})-f(p,\hat{k})\}/n+\epsilon_{privacy}>0 for c0>0c_{0}>0. Thus, the solution of quadratic inequality exists and satisfies,

𝒘^j𝒘^j(k)2C𝒘^j(k)𝒘j2+C2𝒘^j(k)𝒘j22+c{c0{f(p,k)f(p,k^)}/n+ϵprivacy}cC𝒘^j(k)𝒘j2.\begin{split}&\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}\\ &\leq\frac{C\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}+\sqrt{C^{2}\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}^{2}+c\{c_{0}\{f(p,k^{*})-f(p,\hat{k})\}/n+\epsilon_{privacy}\}}}{c}\\ &\leq C\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}.\end{split}

Then, we consider the case where k^k\hat{k}\geq k^{*}. By Hölder’s inequality and triangle inequality, we have

c𝒘^j𝒘^j(k)22\displaystyle c\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}^{2}\leq 𝒘^j𝒘^j(k)2𝒘^j(k)𝒘j2+C𝒘^j𝒘^j(k)1logpn\displaystyle\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}+C\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{1}\sqrt{\frac{\log p}{n}}
+c0{f(p,k)f(p,k^)}/n+ϵprivacy\displaystyle+c_{0}\{f(p,k^{*})-f(p,\hat{k})\}/n+\epsilon_{privacy}
\displaystyle\leq 𝒘^j(k)𝒘j22+𝒘^j𝒘j22+C𝒘^j𝒘j1logpn\displaystyle\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}^{2}+\|\hat{\bm{w}}_{j}-\bm{w}_{j}\|_{2}^{2}+C\|\hat{\bm{w}}_{j}-\bm{w}_{j}\|_{1}\sqrt{\frac{\log p}{n}}
+C𝒘j𝒘^j(k)1logpn+c0/n{f(p,k)f(p,k^)}+ϵprivacy\displaystyle+C\|\bm{w}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{1}\sqrt{\frac{\log p}{n}}+c_{0}/n\{f(p,k^{*})-f(p,\hat{k})\}+\epsilon_{privacy}
\displaystyle\leq 2C𝒘^j(k)𝒘j22+2C{c12k^logplognn+c222k^logplog(1/δ)log5nn2ϵ2}\displaystyle 2C\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}^{2}+2C\bigg{\{}c_{1}\frac{2^{\hat{k}}\log p\log n}{n}+c_{2}\frac{2^{2\hat{k}}\log p\log(1/\delta)\log^{5}n}{n^{2}\epsilon^{2}}\bigg{\}}
+c0/n{f(p,k)f(p,k^)}+ϵprivacy.\displaystyle+c_{0}/n\{f(p,k^{*})-f(p,\hat{k})\}+\epsilon_{privacy}.

For c0>2Cc_{0}>2C, we have

c𝒘^j𝒘^j(k)22(2C+c0)𝒘^j(k)𝒘j22+ϵprivacy.c\|\hat{\bm{w}}_{j}-\hat{\bm{w}}_{j}(k^{*})\|_{2}^{2}\leq(2C+c_{0})\|\hat{\bm{w}}_{j}(k^{*})-\bm{w}_{j}\|_{2}^{2}+\epsilon_{privacy}.

Then, the conclusion holds. ∎

8.2.4 Proof of Lemma 4.4

Proof of Lemma 4.4.

Recall that

𝜷^jdb𝜷j=𝒘^ji=1n𝒙iϵinR1,j(𝒆j𝒘^j𝚺^XX)(𝜷𝜷^)R2,j.\widehat{\bm{\beta}}^{db}_{j}-\bm{\beta}_{j}=\underbrace{\hat{\bm{w}}_{j}^{\top}\frac{\sum_{i=1}^{n}\bm{x}_{i}\epsilon_{i}}{n}}_{R_{1,j}}-\underbrace{(\bm{e}_{j}-\hat{\bm{w}}_{j}^{\top}\widehat{\bm{\Sigma}}_{XX})({\bm{\beta}}-\widehat{\bm{\beta}})}_{R_{2,j}}.

For R2,jR_{2,j},

|R2,j|𝒆j𝒘jΣ^XX𝜷𝜷^1+𝒘^j𝒘j2𝜷𝜷^2logpn2Kc2slogplognn+c3s2log2plog(1/δ)log7nn2ϵ2+c2slogplognn+c3s2log2plog(1/δ)log6nn2ϵ2×c2sjlogplognn+c3sj2log2plog(1/δ)log7nn2ϵ2.\begin{split}|R_{2,j}|&\leq\|\bm{e}_{j}-\bm{w}_{j}^{\top}\widehat{\Sigma}_{XX}\|_{\infty}\|\bm{\beta}-\widehat{\bm{\beta}}\|_{1}+\|\hat{\bm{w}}_{j}-\bm{w}_{j}\|_{2}\|\bm{\beta}-\widehat{\bm{\beta}}\|_{2}\\ &\leq\sqrt{\frac{\log p}{n}}\sqrt{2^{K}}\sqrt{c_{2}\frac{s\log p\log n}{n}+c_{3}\frac{s^{2}\log^{2}p\log(1/\delta)\log^{7}n}{n^{2}\epsilon^{2}}}\\ &+\sqrt{c_{2}\frac{s\log p\log n}{n}+c_{3}\frac{s^{2}\log^{2}p\log(1/\delta)\log^{6}n}{n^{2}\epsilon^{2}}}\\ &\times\sqrt{c_{2}\frac{s_{j}\log p\log n}{n}+c_{3}\frac{s_{j}^{2}\log^{2}p\log(1/\delta)\log^{7}n}{n^{2}\epsilon^{2}}}.\end{split}

The two term R1,jR_{1,j} is asymptotic normal when w^j\hat{w}_{j} is replaced with wjw_{j}. Let
R1,j=1n𝒘j(𝑿ε)R_{1,j}^{*}=\frac{1}{n}{\bm{w}}_{j}^{\top}({\bm{X}}^{\top}\varepsilon). Then, we have

Var(R1,j)=Ωj,jσ2n.\textup{Var}(R_{1,j}^{*})=\frac{\Omega_{j,j}{\sigma}^{2}}{n}.

Furthermore, we have

|R1,jR1,j|𝒘^j𝒘1i=1nxiϵinlogpn2Kc2sjlogplognn+c3sj2log2plog(1/δ)log6nn2ϵ2.\begin{split}|R_{1,j}^{*}-R_{1,j}|&\leq\|\hat{\bm{w}}_{j}-\bm{w}\|_{1}\|\frac{\sum_{i=1}^{n}x_{i}\epsilon_{i}}{n}\|_{\infty}\\ &\leq\sqrt{\frac{\log p}{n}}\sqrt{2^{K}}\sqrt{c_{2}\frac{s_{j}\log p\log n}{n}+c_{3}\frac{s_{j}^{2}\log^{2}p\log(1/\delta)\log^{6}n}{n^{2}\epsilon^{2}}}.\end{split}

8.2.5 Proof of Theorem 4.1

Proof of Theorem 4.1.

We first show the privacy part. By Lemma 3.1 and Lemma 4.2, the first two steps of Algorithm 5 is (ϵ/4,δ/4)(\epsilon/4,\delta/4)-differential privacy. By composition theorem, it remains to show Step 3 and Step 4 are (ϵ/4,δ/4)(\epsilon/4,\delta/4)-differential privacy, respectively.

The sensitivity of i=1nΠR(𝒘^j𝒙i)ΠR(yi)ni=1nΠR(𝒘^j𝒙i)ΠR(𝒙i𝜷^)n\frac{\sum_{i=1}^{n}\Pi_{R}(\hat{\bm{w}}_{j}^{\top}\bm{x}_{i})\Pi_{R}(y_{i})}{n}-\frac{\sum_{i=1}^{n}\Pi_{R}(\hat{\bm{w}}_{j}^{\top}\bm{x}_{i})\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}})}{n} is

sup(𝒙i,yi),(𝒙i,yi)1n|ΠR(𝒘^j𝒙i)ΠR(yi)ΠR(𝒘^j𝒙i)ΠR(𝒙i𝜷^)ΠR(𝒘^j𝒙i)ΠR(yi)+ΠR(𝒘^j𝒙i)ΠR(𝒙i𝜷^)|2n(R2+R2).\begin{split}&\sup_{(\bm{x}_{i},y_{i}),(\bm{x}_{i}^{\prime},y_{i}^{\prime})}\frac{1}{n}|\Pi_{R}(\hat{\bm{w}}_{j}^{\top}\bm{x}_{i})\Pi_{R}(y_{i})-\Pi_{R}(\hat{\bm{w}}_{j}\bm{x}_{i})\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}})\\ &-\Pi_{R}(\hat{\bm{w}}_{j}^{\top}\bm{x}_{i}^{\prime})\Pi_{R}(y_{i}^{\prime})+\Pi_{R}(\hat{\bm{w}}_{j}\bm{x}_{i}^{\prime})\Pi_{R}(\bm{x}_{i}^{\prime\top}\hat{{\bm{\beta}}})|\\ &\leq\frac{2}{n}(R^{2}+R^{2}).\end{split}

The sensitivity of 1ni=1n(ΠR(yi)ΠR(𝒙i𝜷^))2\frac{1}{n}\sum_{i=1}^{n}(\Pi_{R}(y_{i})-\Pi_{R}(\bm{x}_{i}^{\top}\hat{\bm{\beta}}))^{2} is

sup(𝒙i,yi),(𝒙i,yi)1n|(ΠR(yi)ΠR(𝒙i𝜷^))2(ΠR(yi)ΠR(𝒙i𝜷^))2|2n(R+R)2.\sup_{(\bm{x}_{i},y_{i}),(\bm{x}_{i}^{\prime},y_{i}^{\prime})}\frac{1}{n}|(\Pi_{R}(y_{i})-\Pi_{R}(\bm{x}_{i}^{\top}\hat{\bm{\beta}}))^{2}-(\Pi_{R}(y^{\prime}_{i})-\Pi_{R}(\bm{x}_{i}^{\prime\top}\hat{\bm{\beta}}))^{2}|\leq\frac{2}{n}(R+R)^{2}.

By the Gaussian mechanism, Step 3 and Step 4 are (ϵ/4,δ/4)(\epsilon/4,\delta/4)-differential privacy. Finally, by composition theorem, Algorithm 5 is (ϵ,δ)(\epsilon,\delta)-differentially private.

Then, we are about to show the validity of the proposed confidence interval. Notice that nsup|zj|Clognlog(1/δ)nϵ\sqrt{n}\sup|z_{j}|\leq C\sqrt{\log n}\frac{\log(1/\delta)}{\sqrt{n}\epsilon} for some constant CC with probability approaching 11. Thus, |zj|=op(n1/2)|z_{j}|=o_{p}(n^{-1/2}). By Lemma 4.4, we have

n(β^j(db)βj)dN(0,Ωjjσ2).\sqrt{n}(\hat{\beta}^{(db)}_{j}-\beta_{j})\stackrel{{\scriptstyle d}}{{\to}}N(0,\Omega_{jj}\sigma^{2}).

By the 4.3, the convergence of 𝒘^j\hat{\bm{w}}_{j} implies w^jjpwjj\hat{w}_{jj}\stackrel{{\scriptstyle p}}{{\to}}w_{jj}. Furthermore, the estimation of σ2\sigma^{2} satisfies,

σ^2σ2=1ni=1n(ΠR(yi)ΠR(𝒙i𝜷^))2+Zσ2=1ni=1n(ΠR(yi)𝒙i𝜷)2σ2+1ni=1n(ΠR(𝒙i𝜷^)𝒙i𝜷)22ni=1n(ΠR(yi)𝒙i𝜷)(ΠR(𝒙i𝜷^)𝒙i𝜷)+Z.\begin{split}\hat{\sigma}^{2}-{\sigma}^{2}=&\frac{1}{n}\sum_{i=1}^{n}(\Pi_{R}(y_{i})-\Pi_{R}(\bm{x}_{i}^{\top}\hat{\bm{\beta}}))^{2}+Z-\sigma^{2}\\ =&\frac{1}{n}\sum_{i=1}^{n}(\Pi_{R}(y_{i})-\bm{x}_{i}^{\top}{\bm{\beta}})^{2}-\sigma^{2}\\ +&\frac{1}{n}\sum_{i=1}^{n}(\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}})-\bm{x}_{i}^{\top}{\bm{\beta}})^{2}-\frac{2}{n}\sum_{i=1}^{n}(\Pi_{R}(y_{i})-\bm{x}_{i}^{\top}{\bm{\beta}})(\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}})-\bm{x}_{i}^{\top}{\bm{\beta}})+Z.\end{split}

The term 1ni=1n(ΠR(yi)𝒙i𝜷)2σ2=op(1)\frac{1}{n}\sum_{i=1}^{n}(\Pi_{R}(y_{i})-\bm{x}_{i}^{\top}{\bm{\beta}})^{2}-\sigma^{2}=o_{p}(1) by the law of large number. The term 1ni=1n(ΠR(𝒙i𝜷^)𝒙i𝜷)2\frac{1}{n}\sum_{i=1}^{n}(\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}})-\bm{x}_{i}^{\top}{\bm{\beta}})^{2} is op(1)o_{p}(1) by Theorem 3.1. The term

|2ni=1n(ΠR(yi)𝒙i𝜷)(ΠR(𝒙i𝜷^)𝒙i𝜷)|=|2ni=1nϵi(ΠR(𝒙i𝜷^)𝒙i𝜷)|logpncx𝜷𝜷^1=op(1).\begin{split}|\frac{2}{n}\sum_{i=1}^{n}(\Pi_{R}(y_{i})-\bm{x}_{i}^{\top}{\bm{\beta}})(\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}})-\bm{x}_{i}^{\top}{\bm{\beta}})|&=|\frac{2}{n}\sum_{i=1}^{n}\epsilon_{i}(\Pi_{R}(\bm{x}_{i}^{\top}\hat{{\bm{\beta}}})-\bm{x}_{i}^{\top}{\bm{\beta}})|\\ &\leq\sqrt{\frac{\log p}{n}}c_{x}\|{\bm{\beta}}-\hat{{\bm{\beta}}}\|_{1}=o_{p}(1).\end{split}

Then, we have σ^2pσ2\hat{\sigma}^{2}\stackrel{{\scriptstyle p}}{{\to}}\sigma^{2}. The final result follows by the Slutsky’s theorem. ∎

8.3 Proof of FDR: Theorem 5.1

8.3.1 Proof of Lemma 5.1

Proof of Lemma 5.1.

By Lemma 3.1, outputting 𝜷~(1)\tilde{{\bm{\beta}}}_{(1)} is (ϵ/2,δ/2)(\epsilon/2,\delta/2)-differentially private. It remains to show the DP-OLS is (ϵ/2,δ/2)(\epsilon/2,\delta/2)-differentially private, and the final result is followed by the composition theorem. We first calculate the l2l_{2} sensitivity of 𝚺^(2)XX,𝒜\hat{\bm{\Sigma}}_{(2)XX,\mathcal{A}} and 𝚺^(2)XY,𝒜\hat{\bm{\Sigma}}_{(2)XY,\mathcal{A}}. The l2l_{2} sensitivity of 𝚺^(2)XX,𝒜\hat{\bm{\Sigma}}_{(2)XX,\mathcal{A}} satisfies

sup(𝒙i,𝒙i)𝒙i𝒜𝒙i𝒜𝒙i𝒜𝒙i𝒜2/n22|𝒜|cx2/n22Kcx2/n2.\sup_{(\bm{x}_{i},\bm{x}_{i}^{\prime})}\|\bm{x}_{i\mathcal{A}}\bm{x}_{i\mathcal{A}}^{\top}-\bm{x}_{i\mathcal{A}}^{\prime}\bm{x}_{i\mathcal{A}}^{\prime\top}\|_{2}/n_{2}\leq 2|\mathcal{A}|c_{x}^{2}/n_{2}\leq 2Kc_{x}^{2}/n_{2}.

The l2l_{2} sensitivity of 𝚺^(2)XY,𝒜\hat{\bm{\Sigma}}_{(2)XY,\mathcal{A}} satisfies

sup(𝒙i,yi),(𝒙i,yi)𝒙i𝒜ΠR(yi)𝒙i𝒜ΠR(yi)2/n22|𝒜|cxR/n22cxKR/n2.\sup_{(\bm{x}_{i},y_{i}),(\bm{x}_{i}^{\prime},y_{i}^{\prime})}\|\bm{x}_{i\mathcal{A}}\Pi_{R}(y_{i})-\bm{x}_{i\mathcal{A}}^{\prime}\Pi_{R}(y_{i}^{\prime})\|_{2}/n_{2}\leq 2|\mathcal{A}|c_{x}\sqrt{R}/n_{2}\leq 2c_{x}\sqrt{KR}/n_{2}.

By the data-splitting n2=n/2n_{2}=n/2 and Gaussian mechanism, the DP-OLS is (ϵ/2,δ/2)(\epsilon/2,\delta/2)-differentially private. ∎

8.3.2 Proof of Theorem 5.1

Proof of Theorem 5.1.

We only need to consider the subset 𝒜\mathcal{A}. We omit the subscript 𝒜\mathcal{A} to simplify the notation. Notice that the DP-OLS estimator 𝜷~(2)\tilde{{\bm{\beta}}}_{(2)} satisfies

𝜷~(2)𝜷=𝚺^(2)XX1(1n/2i𝒮2𝑿iej+𝑵XY)+(𝚺^(2)XX1𝚺~(2)XX1)𝚺~(2)XX𝜷=:𝜷~(2)(0)+𝜷~(2)(1).\begin{split}\tilde{{\bm{\beta}}}_{(2)}-{\bm{\beta}}&=\hat{\bm{\Sigma}}^{-1}_{(2)XX}\big{(}\frac{1}{n/2}\sum_{i\in\mathcal{S}_{2}}{\bm{X}}_{i}e_{j}+\bm{N}_{XY}\big{)}+(\hat{\bm{\Sigma}}^{-1}_{(2)XX}-\tilde{\bm{\Sigma}}^{-1}_{(2)XX})\tilde{\bm{\Sigma}}_{(2)XX}{\bm{\beta}}\\ &=:\tilde{{\bm{\beta}}}_{(2)}^{(0)}+\tilde{{\bm{\beta}}}_{(2)}^{(1)}.\end{split}

Conditional on {𝑿}i=1n\{{\bm{X}}\}_{i=1}^{n}, the distribution of 𝜷~(2)(0)\tilde{{\bm{\beta}}}_{(2)}^{(0)} is symmetric around 0 and the 𝜷~(2)(1)\tilde{{\bm{\beta}}}_{(2)}^{(1)} representing bias is small. For j𝒮0j\in\mathcal{S}_{0}, βj=0\beta_{j}=0 and thus the sampling distribution of β~(2)jβ~(2)j(1)\tilde{\beta}_{(2)j}-\tilde{\beta}_{(2)j}^{(1)} is symmetric around 0. WLOG, we assume s^=2k^\hat{s}=2^{\hat{k}}\to\infty, otherwise the FDR control problem is trivial.

Define 𝑹\bm{R} and its normalized version 𝑹0\bm{R}^{0} as

𝑹=(1n/2𝑿𝒜(2)T𝑿𝒜(2))1(1n/2𝑿𝒜(2)𝑿𝒜(2)T+σr2𝑰p)(1n/2𝑿𝒜(2)𝑿𝒜(2)T)1=:ABA,\begin{split}\bm{R}&=\big{(}\frac{1}{n/2}{\bm{X}}^{(2)T}_{\mathcal{A}}{\bm{X}}^{(2)}_{\mathcal{A}}\big{)}^{-1}(\frac{1}{n/2}{\bm{X}}^{(2)}_{\mathcal{A}}{\bm{X}}^{(2)T}_{\mathcal{A}}+\sigma^{2}_{r}\bm{I}_{p})\big{(}\frac{1}{n/2}{\bm{X}}^{(2)}_{\mathcal{A}}{\bm{X}}^{(2)T}_{\mathcal{A}}\big{)}^{-1}\\ &=:ABA,\end{split}
Rij0=RijRiiRjj,R_{ij}^{0}=\frac{R_{ij}}{\sqrt{R_{ii}R_{jj}}},

where σr2=Cs^/n\sigma^{2}_{r}=C\sqrt{\hat{s}}/n and the Rij0R_{ij}^{0} represents the conditional correlation of the OLS regression coefficients 𝜷~(2)\tilde{{\bm{\beta}}}_{(2)}. By the proof of Theorem 3.1, s^=2k^=o(n)\hat{s}=2^{\hat{k}}=o(n). Thus,

λmin(R)λmin(A)λmin(B)λmin(A){1/cop(1)}31/c3op(1),\lambda_{\min}(R)\geq\lambda_{\min}(A)\lambda_{\min}(B)\lambda_{min}(A)\geq\{1/c-o_{p}(1)\}^{3}\geq 1/c^{3}-o_{p}(1),
λmax(R)λmax(A)λmax(B)λmax(A){c+op(1)}3c3+op(1).\lambda_{\max}(R)\leq\lambda_{\max}(A)\lambda_{\max}(B)\lambda_{\max}(A)\leq\{c+o_{p}(1)\}^{3}\leq c^{3}+o_{p}(1).

Furthermore,

R01λmin1R1λmin1s^R2λmaxλmin1s^3/2=Op(s^3/2),\|R^{0}\|_{1}\leq\lambda_{\min}^{-1}\|R\|_{1}\leq\lambda_{\min}^{-1}\hat{s}\|R\|_{2}\leq\lambda_{\max}\lambda_{\min}^{-1}\hat{s}^{3/2}=O_{p}(\hat{s}^{3/2}),

where Rp:=(i,j=1m|Rijp|)1/p\|R\|_{p}:=(\sum_{i,j=1}^{m}|R_{ij}^{p}|)^{1/p} is the element-wise matrix norm. In addition, R𝒮¯𝒜01Op(p03/2)\|R^{0}_{\bar{\mathcal{S}}\cap\mathcal{A}}\|_{1}\leq O_{p}(p_{0}^{3/2}), where p0=|𝒮¯𝒜|p_{0}=|\bar{\mathcal{S}}\cap\mathcal{A}|.

For any threshold tt\in\mathbb{R}, define

G^p0(t)=1p0j𝒮¯𝒜𝟙(Mj>t)Gp0(t)=1p0j𝒮¯𝒜(Mj>t);\hat{G}_{p}^{0}(t)=\frac{1}{p_{0}}\sum_{j\in\bar{\mathcal{S}}\cap\mathcal{A}}\mathbbm{1}(M_{j}>t)\text{, }G_{p}^{0}(t)=\frac{1}{p_{0}}\sum_{j\in\bar{\mathcal{S}}\cap\mathcal{A}}\mathbb{P}(M_{j}>t);
G^p1(t)=1p1j𝒮𝒜𝟙(Mj>t)V^p0(t)=1p0j𝒮¯𝒜𝟙(Mj<t),\hat{G}_{p}^{1}(t)=\frac{1}{p_{1}}\sum_{j\in\mathcal{S}\cap\mathcal{A}}\mathbbm{1}(M_{j}>t)\text{, }\hat{V}_{p}^{0}(t)=\frac{1}{p_{0}}\sum_{j\in\bar{\mathcal{S}}\cap\mathcal{A}}\mathbbm{1}(M_{j}<-t),

where p0=|𝒮¯𝒜|p_{0}=|\bar{\mathcal{S}}\cap\mathcal{A}| and p1=s^p0p_{1}=\hat{s}-p_{0}. Let rp=p1/p0r_{p}=p_{1}/p_{0} and

FDP(t)=G^p0(t)G^p0(t)+tpG^p1(t)FDPs(t)=V^p0(t)G^p0(t)+tpG^p1(t) and FDPe(t)=Gp0(t)Gp0(t)+tpG^p1(t).\text{FDP}(t)=\frac{\hat{G}_{p}^{0}(t)}{\hat{G}_{p}^{0}(t)+t_{p}\hat{G}_{p}^{1}(t)}\text{, }\text{FDP}^{s}(t)=\frac{\hat{V}_{p}^{0}(t)}{\hat{G}_{p}^{0}(t)+t_{p}\hat{G}_{p}^{1}(t)}\text{ and }\text{FDP}^{e}(t)=\frac{G_{p}^{0}(t)}{G_{p}^{0}(t)+t_{p}\hat{G}_{p}^{1}(t)}.

It’s easy to see Gp0(t)=𝔼{G^p0(t)}=𝔼{V^p0(t)}G_{p}^{0}(t)=\mathbb{E}\{\hat{G}_{p}^{0}(t)\}=\mathbb{E}\{\hat{V}_{p}^{0}(t)\}, where the last equality is due to the symmetry of MjM_{j} for j𝒮¯j\in\bar{\mathcal{S}}. Furthermore, we have

Var{G^p0(t)}=1p02j𝒮¯𝒜Var{𝟙(Mj>t)}+1p02i,j𝒮¯𝒜;ijCov{𝟙(Mi>t),𝟙(Mj>t)}.\text{Var}\{\hat{G}_{p}^{0}(t)\}=\frac{1}{p_{0}^{2}}\sum_{j\in\bar{\mathcal{S}}\cap\mathcal{A}}\text{Var}\{\mathbbm{1}(M_{j}>t)\}+\frac{1}{p_{0}^{2}}\sum_{i,j\in\bar{\mathcal{S}}\cap\mathcal{A};i\neq j}\text{Cov}\{\mathbbm{1}(M_{i}>t),\mathbbm{1}(M_{j}>t)\}.

The first term is bounded by C/p0n0C/p_{0}\stackrel{{\scriptstyle n\to\infty}}{{\longrightarrow}}0. WLOG, we assume β^1,i>0\hat{\beta}_{1,i}>0 and β^1,j>0\hat{\beta}_{1,j}>0. By the definition of f(u,v)f(u,v), there exists a function of β^1,i\hat{\beta}_{1,i} and β^1,j\hat{\beta}_{1,j}, denoted by t(β^1,i)\mathcal{I}_{t}(\hat{\beta}_{1,i}) and t(β^1,j)\mathcal{I}_{t}(\hat{\beta}_{1,j}), such that

(Mi>t,Mj>t)={β^2,i>t(β^1,i),β^2,j>t(β^1,j)},\mathbb{P}(M_{i}>t,M_{j}>t)=\mathbb{P}\{\hat{\beta}_{2,i}>\mathcal{I}_{t}(\hat{\beta}_{1,i}),\hat{\beta}_{2,j}>\mathcal{I}_{t}(\hat{\beta}_{1,j})\},

where the function t(u)=inf{v0:f(u,v)>t}\mathcal{I}_{t}(u)=\inf\{v\geq 0:f(u,v)>t\}. Notice that the 𝜷~(2)(1)cs^3log(1/δ)/(n2ϵ2)𝜷2=op(1)\|\tilde{{\bm{\beta}}}_{(2)}^{(1)}\|_{\infty}\leq c\hat{s}^{3}\log(1/\delta)/(n^{2}\epsilon^{2})\|{\bm{\beta}}\|_{2}=o_{p}(1) by Condition in Theorem 5.1. By the Lipschitz continuity of f(u,v)f(u,v), thus,

{β~(2)i>t(β~(1)i),β~(2)j>t(β~(1)j)}={β~(2)iβ~(2)i(1)>t(β~(1)i),β~(2)jβ~(2)j(1)>t(β~(1)j)}+op(1).\begin{split}\mathbb{P}\{\tilde{\beta}_{(2)i}>\mathcal{I}_{t}(\tilde{\beta}_{(1)i}),\tilde{\beta}_{(2)j}>\mathcal{I}_{t}(\tilde{\beta}_{(1)j})\}&=\mathbb{P}\{\tilde{\beta}_{(2)i}-\tilde{\beta}_{(2)i}^{(1)}>\mathcal{I}_{t}(\tilde{\beta}_{(1)i}),\tilde{\beta}_{(2)j}-\tilde{\beta}_{(2)j}^{(1)}>\mathcal{I}_{t}(\tilde{\beta}_{(1)j})\}\\ &+o_{p}(1).\end{split}

Notice that the joint distribution of (β~(2)i(0),β~(2)j(0))(\tilde{\beta}_{(2)i}^{(0)},\tilde{\beta}_{(2)j}^{(0)}) is bivariate normal. By theorem 1 in (Azriel and Schwartzman, 2015), for any t1,t2t_{1},t_{2}\in\mathbb{R},

(β~(2)iβ~(2)i(0)>t1,β~(2)jβ~(2)j(0)>t2)(β~(2)iβ~(2)i(0)>t1)(β~(2)jβ~(2)j(0)>t2)O(|Rij0|).\mathbb{P}(\tilde{\beta}_{(2)i}-\tilde{\beta}_{(2)i}^{(0)}>t_{1},\tilde{\beta}_{(2)j}-\tilde{\beta}_{(2)j}^{(0)}>t_{2})-\mathbb{P}(\tilde{\beta}_{(2)i}-\tilde{\beta}_{(2)i}^{(0)}>t_{1})\mathbb{P}(\tilde{\beta}_{(2)j}-\tilde{\beta}_{(2)j}^{(0)}>t_{2})\leq O(|R_{ij}^{0}|).

Thus,

1p02i,j𝒮¯𝒜;ijCov{𝟙(Mi>t),𝟙(Mj>t)}Op(p02R𝒮¯𝒜01)+op(1)Op(p02p03/2)+op(1)n0.\begin{split}&\frac{1}{p_{0}^{2}}\sum_{i,j\in\bar{\mathcal{S}}\cap\mathcal{A};i\neq j}\text{Cov}\{\mathbbm{1}(M_{i}>t),\mathbbm{1}(M_{j}>t)\}\leq O_{p}(p_{0}^{-2}\|R^{0}_{\bar{\mathcal{S}}\cap\mathcal{A}}\|_{1})+o_{p}(1)\\ \leq&O_{p}(p_{0}^{-2}p_{0}^{3/2})+o_{p}(1)\stackrel{{\scriptstyle n\to\infty}}{{\longrightarrow}}0.\end{split}

By the Markov’s inequality and symmetry of MiM_{i} under null,

|G^p0(t)Gp0(t)|0 and |V^p0(t)Gp0(t)|0.|\hat{G}_{p}^{0}(t)-G_{p}^{0}(t)|\to 0\text{ and }|\hat{V}_{p}^{0}(t)-G_{p}^{0}(t)|\to 0.

Thus, we have

sup0t1|FDP(t)FDPs(t)|=op(1).\sup_{0\leq t\leq 1}|\text{FDP}(t)-\text{FDP}^{s}(t)|=o_{p}(1).

For any ϵ(0,q)\epsilon\in(0,q) and tqϵt_{q-\epsilon} satisfying {FDPs(tqϵ)qϵ}1\mathbb{P}\{\text{FDP}^{s}(t_{q-\epsilon})\leq q-\epsilon\}\to 1, we have

(τqtqϵ){FDPs(tqϵ)q}{FDP(tqϵ)qϵ,|FDPs(tqϵ)FDP(tqϵ)|ϵ}1ϵ+o(1).\begin{split}\mathbb{P}(\tau_{q}\leq t_{q-\epsilon})&\geq\mathbb{P}\{\text{FDP}^{s}(t_{q-\epsilon})\leq q\}\\ &\geq\mathbb{P}\{\text{FDP}(t_{q-\epsilon})\leq q-\epsilon,|\text{FDP}^{s}(t_{q-\epsilon})-\text{FDP}(t_{q-\epsilon})|\leq\epsilon\}\geq 1-\epsilon+o(1).\end{split}

Thus, we have

lim supn𝔼{FDP(τq)}lim supn𝔼{FDP(τq)τqtqϵ}(τqtqϵ)+(τq>tqϵ)lim supn𝔼{FDPs(τq)τqtqϵ}(τqtqϵ)+lim supn𝔼{|FDP(τq)FDPe(τq)|τqtqϵ}(τqtqϵ)+lim supn𝔼{|FDPs(τq)FDPe(τq)|τqtqϵ}(τqtqϵ)+ϵlim supn𝔼{FDPs(τq)}+lim supn𝔼{|FDP(τq)FDPe(τq)|}+lim supn𝔼{|FDPs(τq)FDPe(τq)|}+ϵq+ϵ+o(1).\begin{split}\limsup_{n\to\infty}\mathbb{E}\{\text{FDP}(\tau_{q})\}&\leq\limsup_{n\to\infty}\mathbb{E}\{\text{FDP}(\tau_{q})\mid\tau_{q}\leq t_{q-\epsilon}\}\mathbb{P}(\tau_{q}\leq t_{q-\epsilon})+\mathbb{P}(\tau_{q}>t_{q-\epsilon})\\ &\leq\limsup_{n\to\infty}\mathbb{E}\{\text{FDP}^{s}(\tau_{q})\mid\tau_{q}\leq t_{q-\epsilon}\}\mathbb{P}(\tau_{q}\leq t_{q-\epsilon})\\ &+\limsup_{n\to\infty}\mathbb{E}\{|\text{FDP}(\tau_{q})-\text{FDP}^{e}(\tau_{q})|\mid\tau_{q}\leq t_{q-\epsilon}\}\mathbb{P}(\tau_{q}\leq t_{q-\epsilon})\\ &+\limsup_{n\to\infty}\mathbb{E}\{|\text{FDP}^{s}(\tau_{q})-\text{FDP}^{e}(\tau_{q})|\mid\tau_{q}\leq t_{q-\epsilon}\}\mathbb{P}(\tau_{q}\leq t_{q-\epsilon})+\epsilon\\ &\leq\limsup_{n\to\infty}\mathbb{E}\{\text{FDP}^{s}(\tau_{q})\}+\limsup_{n\to\infty}\mathbb{E}\{|\text{FDP}(\tau_{q})-\text{FDP}^{e}(\tau_{q})|\}\\ &+\limsup_{n\to\infty}\mathbb{E}\{|\text{FDP}^{s}(\tau_{q})-\text{FDP}^{e}(\tau_{q})|\}+\epsilon\leq q+\epsilon+o(1).\end{split}

So far, we showed lim supnFDR(τq)q\limsup_{n\to\infty}\text{FDR}(\tau_{q})\leq q. Then, we consider the power of the procedure. Given the signal strength condition, the sure screening property holds with probability approaching one, that is, (𝒮𝒜)1\mathbb{P}(\mathcal{S}\subseteq\mathcal{A})\to 1 as nn\to\infty. Remember that the OLS estimator satisfies,

Σ^XX,𝒜1(1n/2i𝒮2𝑿iej+NXY)+(Σ^XX,𝒜1Σ~XX,𝒜1)Σ~XX,𝒜𝜷0.\hat{\Sigma}^{-1}_{XX,\mathcal{A}}\big{(}\frac{1}{n/2}\sum_{i\in\mathcal{S}_{2}}{\bm{X}}_{i}e_{j}+N_{XY}\big{)}+(\hat{\Sigma}^{-1}_{XX,\mathcal{A}}-\tilde{\Sigma}^{-1}_{XX,\mathcal{A}})\tilde{\Sigma}_{XX,\mathcal{A}}{\bm{\beta}}_{0}.

By the eigenvalue of Σ\Sigma is bounded and Condition 3.2, the eigenvalue of Σ^XX,𝒜1\hat{\Sigma}^{-1}_{XX,\mathcal{A}} is bounded with probability approaching one. By the large deivation theorem,

𝜷^2𝜷=𝜷^2(0)+𝜷^2(1)Op{log(s^)/n+s^3/2log(1/δ)/(nϵ)}.\|\hat{{\bm{\beta}}}_{2}-{\bm{\beta}}\|_{\infty}=\|\hat{{\bm{\beta}}}_{2}^{(0)}\|_{\infty}+\|\hat{{\bm{\beta}}}_{2}^{(1)}\|_{\infty}\leq O_{p}\{\sqrt{\log(\hat{s})/n}+\hat{s}^{3/2}\sqrt{\log(1/\delta)}/(n\epsilon)\}.

Under the signal atrength condition, with probability approaching one, we have

mini𝒮|β^1,i|maxi𝒮¯|β^1,i| and mini𝒮𝒜|β^2,i|maxi𝒮¯𝒜|β^2,i|\min_{i\in\mathcal{S}}|\hat{\beta}_{1,i}|\geq\max_{i\in\bar{\mathcal{S}}}|\hat{\beta}_{1,i}|\text{ and }\min_{i\in\mathcal{S}\cap\mathcal{A}}|\hat{\beta}_{2,i}|\geq\max_{i\in\bar{\mathcal{S}}\cap\mathcal{A}}|\hat{\beta}_{2,i}|

which further imples that mini𝒮𝒜|Mi|maxi𝒮¯𝒜|Mi|\min_{i\in\mathcal{S}\cap\mathcal{A}}|M_{i}|\geq\max_{i\in\bar{\mathcal{S}}\cap\mathcal{A}}|M_{i}| by the definition of f(u,v)f(u,v). Consequently, we have G^p1(τq)1\hat{G}^{1}_{p}(\tau_{q})\to 1 with probability approaching 11. Thus, the power asymptotically converges to 11. ∎