Private Estimation and Inference in High-
Dimensional Regression with FDR Control
Abstract
This paper presents novel methodologies for conducting practical differentially private (DP) estimation and inference in high-dimensional linear regression. We start by proposing a differentially private Bayesian Information Criterion (BIC) for selecting the unknown sparsity parameter in DP-Lasso, eliminating the need for prior knowledge of model sparsity, a requisite in the existing literature. Then we propose a differentially private debiased LASSO algorithm that enables privacy-preserving inference on regression parameters. Our proposed method enables accurate and private inference on the regression parameters by leveraging the inherent sparsity of high-dimensional linear regression models. Additionally, we address the issue of multiple testing in high-dimensional linear regression by introducing a differentially private multiple testing procedure that controls the false discovery rate (FDR). This allows for accurate and privacy-preserving identification of significant predictors in the regression model. Through extensive simulations and real data analysis, we demonstrate the efficacy of our proposed methods in conducting inference for high-dimensional linear models while safeguarding privacy and controlling the FDR.
Keywords: differential privacy, high dimension, linear regression, debiased Lasso, false discovery rate control.
1 Introduction
In the era of big data, the significance of data privacy has grown considerably. With the continuous collection, storage, processing, and sharing of vast amounts of personal data (Lane et al., 2014), there is a pressing need to protect sensitive information. Traditional data analytics and statistical inference tools, unfortunately, may fall short in ensuring such protection. However, the concept of differential privacy, initially proposed by theoretical computer scientists (Dwork et al., 2006), has made substantial progress and found widespread use in various large-scale applications. In essence, differentially private algorithms incorporate random noise that is independent of the original database and produce privatized summary statistics or model parameters. The ultimate goal of differentially private analysis is to safeguard individual data while still allowing meaningful statistical analysis of the original database.
In this paper, we develop a novel framework that enables differentially private statistical inference in high-dimensional linear regression. Let be the response and be the covariates. Assume the linear model
where denotes the noise term and follows the sub-Gaussian distribution. Assume that are independent samples. We are interested in the high-dimensional setting where may grow exponentially with , and only a small subset of have non-zero values. We aim to develop practical differentially private estimation and inference methods for the parameter , and a false discovery rate control procedure when multiple testing is concerned.
Numerous approaches have been developed to address the challenge of statistical inference in high-dimensional linear regression, under the typical non-private setting. Wasserman and Roeder (2009) applied sample-splitting for high-dimensional inference, where the estimation and inference are conducted on different samples. Later on, the debiased lasso (Zhang and Zhang, 2014; Javanmard and Montanari, 2014; Van de Geer et al., 2014) emerged as a technique capable of mitigating the potential bias arising from the Lasso estimator, thereby providing asymptotically optimal confidence intervals for regression coefficients. Additionally, Ning and Liu (2017) proposed a method for constructing confidence intervals for high-dimensional penalized M-estimators based on decorrelated score statistics, while Shi et al. (2021) developed a recursive online-score estimation approach. More recently, Wang et al. (2022) introduced the repro framework, which enables the construction of valid confidence intervals for finite samples with high-dimensional covariates.
Besides conducting inference on one parameter, another desired feature in high dimensional linear regression is to control the false discovery rate (FDR) of the selected variables. This objective has led to the development of various FDR control methods in the literature. One such method, proposed by Liu (2013), focuses on the Gaussian graphical model and utilizes the asymptotic normality of the test statistic (Liu, 2013). Another influential approach is the knockoff framework introduced by Barber and Candès (2015), which exploits the symmetry of the statistics under the null hypothesis. Since its inception, the knockoff method has undergone further development and generalization, finding applications in diverse settings (Candes et al., 2018; Barber and Candès, 2019; Romano et al., 2020). Recently, Dai et al. (2022) proposed a method that combines the idea of symmetric mirror statistics and data splitting to control the FDR asymptotically. The power of this method can be further enhanced by implementing a multiple-split approach. Dai et al. (2023) extended this idea to generalized linear models, expanding its applicability.
Addressing privacy concerns in various statistical inference scenarios has gained significant attention in recent literature. Avella-Medina et al. (2023a) put forth the utilization of first-order and second-order optimization algorithms to develop private M-estimators, and they conducted an analysis of the asymptotic normality of these private estimators, accompanied by an evaluation of their privacy error rate. The extension of these methodologies to the high dimensional setting is further discussed in Avella-Medina et al. (2023b), albeit not published yet. Moreover, the multiple testing problem has also been a subject of recent investigation. Dwork et al. (2021) proposed the DP-BH method, while Xia and Cai (2023) introduced the DP-Adapt method, which provides control over the false discovery rate (FDR) while guaranteeing finite sample performance.
Our paper contributes to the differentially private analysis of high-dimensional linear regression in several key aspects.
-
1.
We propose a differentially private Bayesian Information Criterion (BIC) to accurately select the unknown sparsity parameter in DP-Lasso proposed by Cai et al. (2021), eliminating the need for prior knowledge of the model sparsity. This advancement enhances the reliability of the DP-Lasso framework and can be used in many downstream tasks.
-
2.
We develop a differentially private debiased Lasso procedure which can produce asymptotic normal estimators with privacy. This procedure provides differentially private confidence interval for a single parameter of interest.
-
3.
We design a differentially private procedure for controlling FDR in multiple testing scenarios. Our approach allows for FDR control at any user-specified rate and exhibits asymptotic power approaching 1 under proper conditions.
The structure of this paper is as follows. In Section 2, we present a formulation of the problem of interest and propose a new estimation of DP-Lasso that do not require the knowledge of sparsity. In Section 3, we introduce the differentially private debiased Lasso algorithm and provide its asymptotic distribution and privacy error. Section 4 presents a novel differentially private false discovery rate (FDR) control method that leverages sample splitting and mirror statistics. We also elucidate the reasons why existing FDR techniques are inapplicable in the context of differential privacy. In Section 5, we present extensive simulations and real data analysis to validate the performance of our proposed methods. Finally, in Section 6, we conclude the article with a discussion of our findings and outline potential future research directions. All proofs are provided in the appendix.
Notation: Throughout the paper, we use the following notation. For any -dimensional vector , we define the -norm of as for , with representing the absolute value. The is the index set of nonzero elements in . We define the -norm of by , which is number of nonzero coordinates of . For a positive integer , we use to denote the set . For a subset and vector , we use to denote the restriction of vector to the index set . For a vector , we use to denote the projection of onto the -ball . Let and be random variables. The expression denotes the convergence of to in distribution. Similarly, the notation indicates that the random variable is stochastically bounded.
2 Preliminaries
Consider the dataset that are drawn from the distributions satisfying
where , and . In this paper, we are interested in the high-dimensional setting where the dimension may grow exponentially with the sample size , under the -differential privacy framework. We begin by introducing the background of differential privacy, with its formal definition as follows.
Definition 1 (Differential Privacy (Dwork et al., 2006)).
A randomized algorithm is -differentially private for if for every pair of neighboring data sets that differ by one individual datum and every measurable set with respect to ,
(2.1) |
where the probability measure is induced by the randomness of only.
In the definition, two data sets and are treated as fixed, and the probability measures the randomness in the mechanism . The level of privacy against an adversary is controlled by the pre-specified privacy parameters . When and are small, the algorithm offers strong privacy guarantees.
One of the most useful concepts in differential privacy is the sensitivity, which characterizes the magnitude of change in the algorithm when only one single individual in the dataset is replaced by another element.
Definition 2 (Sensitivity).
For a vector-valued deterministic algorithm , the sensitivity of is defined as
(2.2) |
where and only differ in one single entry.
Next, we introduce two well-known mechanisms: the Laplace mechanism, which is differential privacy for algorithms with bounded sensitivity, and the Gaussian mechanism, which achieves differential privacy for algorithms with bounded sensitivity.
Lemma 1 (Dwork and Roth (2014)).
-
1.
(Laplace mechanism): For a deterministic algorithm with sensitivity , the randomized algorithm achieves -differential privacy, where follows i.i.d. Laplace distribution with scale parameter .
-
2.
(Gaussian mechanism): For a deterministic algorithm with sensitivity , the randomized algorithm achieves -differential privacy, where follows i.i.d. Gaussian distribution with mean and standard deviation .
Differentially private algorithms have other nice properties that will be useful for our analysis, as summarized in the following lemma.
Lemma 2.
Differentially private algorithms have the following properties (Dwork et al., 2006, 2010):
-
1.
Post-processing: Let be an -differentially private algorithm and be a deterministic function that maps to real Euclidean space, then is also an -differentially private algorithm.
-
2.
Composition: Let be -differentially private and be - differentially private, then is -differentially private.
-
3.
Advanced Composition: Let be -differentially private and , then -fold adaptive composition of is -differentially private for .
In high-dimensional problems, parameters of interest are often assumed to be sparse. Reporting the entire estimation results can result in a substantial variance. Fortunately, exploiting the sparsity assumption allows us to selectively disclose only the significant nonzero coordinates. The “peeling” algorithm (Dwork et al., 2021) is a differentially private algorithm that addresses this problem by identifying and returning the top- largest coordinates based on the absolute values. Since its proposal by Dwork et al. (2021), this algorithm has been popular in protecting privacy for high-dimensional data, see for example, applications in Cai et al. (2021) and Xia and Cai (2023).
3 Differentially Private Estimation
The estimation of regression parameters in the high-dimensional private setting has been studied in Cai et al. (2021) with optimality guarantees. However, the algorithm in Cai et al. (2021) requires prior knowledge of the sparsity parameter , which are typically unknown in practice. In this section, we propose a differentially private Bayesian Information Criterion (BIC) in Algorithm 2 to accurately select the unknown sparsity parameter, eliminating the need for prior knowledge of the model sparsity. The pipeline of the proposed estimation algorithm is as follows.
The random sample splitting in the proposed algorithm is equivalent to employing the stochastic gradient descent algorithm with one pass of the entire dataset. Moreover, our choice to use the powers of 2 as candidate values of the sparsity parameter accomplishes two critical goals: firstly, it ensures that the candidate set covers the true model by defining an interval where falls, such that where is in the candidate set; secondly, it limits the total number of candidate models to , which is indeed . The choice of aligns with the sparsity requirement for statistical inference, as will be discussed in Section 4.
Condition 3.1.
The covariates are independent sub-Gaussian with mean zero and covariance such that . There is a constant such that .
Condition 3.2.
The true parameter satisfies for some constant and .
Condition 3.1 is commonly assumed in the high-dimensional literature; for example, see Javanmard and Montanari (2014). The bound on the infinity norm of the design matrix can be relaxed to hold with high probability, which can be easily obtained for sub-Gaussian distributions. Different from the algorithm proposed by Cai et al. (2021), we employ a data-splitting technique in the algorithm to establish independence between and the sub-data utilized in the -th iteration. Notably, our Condition 3.1 is less stringent than the design conditions outlined in Cai et al. (2021). This reduction in conditions comes at the cost of decreasing the effective sample size by a factor of . As demonstrated in Theorem 3.1, the number of iterations, denoted as , adheres to , resulting in an associated error bound increase of . Condition 3.2 assumes the bound and sparsity for the true parameter . Lemma 3.1 provides the privacy guarantee of Algorithm 2, where we only require condition 3.1.
Now we establish the error bound of the new estimation algorithm.
Theorem 3.1.
The upper error bound matches the minimax lower bound presented in Cai et al. (2021) up to a logarithmic factor of , which is considerably small. In comparison to algorithms with a known sparsity level , our proposed algorithm introduces an additional factor of in both the statistical error component and the privacy cost component. This arises from our use of data-splitting in the estimation process, along with the output of a total of estimates and the utilization of the BIC for selecting the “optimal” model.
The selection of candidate models presents a delicate balance. Firstly, we desire the candidate set to encompass the true model, necessitating the inclusion of a sufficient number of potential models. Secondly, privacy constraints impose limitations on the total number of outputs. Consequently, the number of candidate models must be on the order of for consistent estimation.
4 Statistical Inference with Privacy
In this section, we construct the confidence interval for a particular regression coefficient , under -differential privacy. Following the debiased Lasso framework (Zhang and Zhang, 2014; Van de Geer et al., 2014; Javanmard and Montanari, 2014), we first estimate the precision matrix with differential privacy. Note that for , it satisfies the linear equation . Thus it is the unique minimizer of the following convex quadratic function:
In addition, we assume the is sparse.
Given a pre-specified sparsity level , the differentially private estimation of algorithm is presented in Algorithm 7.
Lemma 4.1.
Theoretical properties of Algorithm 7 are outlined in Lemma 8.1. Similar to Algorithm 2, we once again employ the data-splitting technique to relax design-related conditions. Our approach shares similarities with the concept of node-wide regression introduced in Van de Geer et al. (2014). What sets our approach apart is its simultaneous estimation of the entire vector. This confers the advantage of privacy cost reduction by avoiding the separate estimation of node-wise regression parameters and error variance.
Condition 4.1.
The precision matrix is sparse and satisfies for , where is -th column of .
This condition is necessary to facilitate differentially private estimation of the precision matrix. We proposed to use a BIC criterion to select the optimal model following a similar approach when the sparsity level is unknown. The algorithm for estimating is summarized in Algorithm 4. Based on the sparsity assumption, we can follow a similar approach as in Algorithm 2 to utilize the DP-BIC to select the unknown sparsity parameters . Lemma 4.2 shows that under certain regularity conditions, Algorithm 4 is -differentially private.
Lemma 4.3.
The theoretical property of by Algorithm 4 is shown in Lemma 4.3. Compared to the case when the sparsity level is known, the proposed algorithm introduces an additional factor of in the privacy cost component of the error bound due to the DP-BIC selection. Using the “tracing attack” technique developed in Cai et al. (2021), one can easily show that the error bound in Lemma 4.3 matches the minimax lower bound up to a logarithmic factor of .
After obtaining the private estimator , we propose the following de-biased estimator to facilitate private inference:
(4.1) |
where . Different from the non-private debiased estimator in Van de Geer et al. (2014), additional random noise is to guarantee the -differentially private by the Gaussian mechanism. Due to the privacy constraints, the variance analysis of turns out to be different from Van de Geer et al. (2014). Specifically, we could write The following lemma provides theoretical insights for the decomposition of the private debiased estimator .
Lemma 4.4 (Limiting distribution of private debiased Lasso).
Therefore, to obtain a differentially private confidence interval, we need to estimate the variance privately. Note that the estimate of can be obtained directly from . Therefore, it suffices to estimate . We propose the following differentially private estimation of :
where .
For the reader’s convenience, we summarize the entire algorithm for obtaining differentially private confidence interval for in the high-dimensional setting. Specifically, the algorithm consists of four steps, and we allocate privacy budget for each step: (1) estimating the regression parameters ; (2) estimating the precision matrix column ; (3) estimating the debiased estimator ; and (4) estimating the standard error of the debiased estimator. Our algorithm ensures that each of the four steps are - differentially private, thus Algorithm 5 itself is -DP. The privacy guarantee together with the nominal coverage of the confidence interval is summarized in Theorem 4.1.
(4.2) |
Theorem 4.1 (Validness of the proposed CI).
Theorem 4.1 demonstrates that the proposed confidence interval provides asymptotic nominal coverage while ensuring privacy. Nevertheless, in finite samples, we suggest incorporating a minor correction to the confidence interval to enhance the finite-sample performance of our method. A similar approach was previously discussed by Avella-Medina et al. (2023a) in the context of low-dimensional noisy gradient descent and noisy Newton’s method algorithms. Note that the proposed debiased estimator (4.1) incorporates additional noise, denoted as , which is generated from a Gaussian distribution with a known variance. The confidence interval, with finite-sample correction, is defined by accounting for the variance of as follows:
(4.3) |
where represents the variance of . It’s worth noting that the additional variance term is , which diminishes as approaches infinity. Consequently, the corrected confidence interval remains asymptotically efficient compared to the debiased lasso.
5 FDR Control with Differential Privacy
False Discovery Rate (FDR) control with privacy under the high-dimensional linear model is a difficult problem. Current approaches to achieving differentially private FDR control (Dwork et al., 2021; Xia and Cai, 2023) require the mutual independence of -values under the null hypotheses, and this assumption does not necessarily hold in the context of linear regression problems. The differentially private knockoff method introduced by Pournaderi and Xiang (2021) has two severe limitations: firstly, it requires the dimension to be smaller than the sample size ; secondly, the introduced noise for privacy preservation overwhelms the signal. To generate differentially private knockoff features, Pournaderi and Xiang (2021) proposed adding noise with a standard error proportional to the upper bound of the signal. Consequently, even when remains fixed and approaches infinity, the added noise dominates the signal. The effectiveness of Pournaderi and Xiang (2021)’s method is severely limited. The question of how to construct differentially private knockoff features with a reasonably small amount of added noise remains an open challenge.
Our approach draws inspiration from the recent advancements in mirror statistics, as demonstrated in the works of (Dai et al., 2022, 2023). In particular, the utilization of sample splitting and post-selection techniques allows us to effectively reduce the dimensionality of the model from high to relatively low dimensions. This reduction in dimensionality enables us to manage the scale of noise required for privacy preservation more effectively. Our approach shares similarities with the peeling algorithm proposed by (Dwork et al., 2021) and the mirror peeling algorithm introduced by (Xia and Cai, 2023). Specifically, we divide the data into two parts, denoted as and . Initially, we employ the high-dimensional differentially private sparse linear regression algorithm (Algorithm 2) using . The resulting estimator is denoted as , with its support represented by . Subsequently, we utilize to fit a differentially private ordinary least squares algorithm, employing the estimated active support , and denote the resulting estimator as . For all , we define mirror statistic by
Similar as in Dai et al. (2022), can take different choices such as
The data-driven cutoff is defined as
where is the desired FDR level. We select the subset of variables as the important variables. Define as the true support set and as the complementary set of . The false discovery proportion (FDP), FDR and power of proposed selection procedure are defined as following:
We summarize the details of the algorithm in Algorithm 6. The privacy guarantee of the proposed procedure is provided in Lemma 5.1.
Lemma 5.1.
Under regularized conditions, the proposed method controls the FDR asymptotically at user-specified level and power goes to . We summarize those results in Theorem 5.1.
Theorem 5.1.
The minimal signal strength condition guarantees the SURE screening property (Fan and Lv, 2008), i.e. the set contains all active coefficients. This property is very important for FDR control under a high-dimensional linear model; see Barber and Candès (2019) and Dai et al. (2022). The critical part of FDR control is the linear model still holds conditional on the selected set . Using a data-splitting strategy, the above condition can be relaxed to require that the selected set contains all active coefficients with high probability.
Regarding differentially private FDR control, we acknowledge the latest version of mirror statistics developed in Dai et al. (2023). However, directly implementing the algorithm from Dai et al. (2023) would result in the noise required for privacy overwhelming the signals. This is because DP-FDR control requires the screening step to reduce the number of tests to a moderate level so that the required noise level becomes reasonable. See, for example, the peeling algorithm in Dwork et al. (2021) and the mirror-peeling algorithm in Xia and Cai (2023). It would also be interesting to develop a differentially private version of the well-known knockoff procedure Barber and Candès (2015), which enables FDR control with finite samples. We will leave them for future research.
6 Numeric Study
6.1 Simulation
We evaluate the finite sample behavior for inference of individual regression coefficients of the private debiased procedure and false discovery rate of the selection procedure. For our simulation study, we consider linear models where the rows of covariates are i.i.d. drawn from . We simulate the response variable from the linear model , where , and .
6.1.1 Debiased Inference
We first evaluate the performance of the proposed debiased procedure under two designs. We consider the Toeplitz covariance matrices (AR) and blocked equal covariance matrices for the design matrix:
The active set has cardinality and each of it is of the following form: . The nonzero regression coefficients are fixed at . Assume all the errors are independently drawn from . The sample size , and the number of covariates . The privacy parameter for each coordinate is and . For comparison purposes, we provide coverages and interval lengths for three methods: DB-Lasso, which is the debiased lasso method in Javanmard and Montanari (2014); DP naive, which is the proposed DP debiased procedure (Algorithm 5) without finite-sample correction; and DP correction, the proposed debiased procedure with finite-sample correction in formula (4.3). All the results are based on independent repetitions of the model with random design and fixed regression coefficients.

To demonstrate the effectiveness of the proposed debiased procedure, we begin by displaying the estimated confidence intervals for the initial coordinates in one particular realization, as depicted in Figure 1. Notably, the estimated confidence intervals cover all of the signals, which are represented by the first three coordinates. The collective coverage for all the coordinates presented in the figure is approximately . Further results can be found in Table 1 for the Toeplitz covariance design and in Table 2 for the equal correlation design.
Measure | Method | ||||
---|---|---|---|---|---|
Avgcov | DB-Lasso | 0.960 | 0.960 | 0.959 | 0.961 |
DP naive | 0.753 | 0.764 | 0.797 | 0.848 | |
DP correction | 0.951 | 0.950 | 0.950 | 0.950 | |
Avglength | DB-Lasso | 0.190 | 0.195 | 0.219 | 0.261 |
DP naive | 0.179 | 0.187 | 0.210 | 0.265 | |
DP correction | 0.304 | 0.309 | 0.324 | 0.361 |
Measure | Method | ||||
---|---|---|---|---|---|
Avgcov | DB-Lasso | 0.960 | 0.960 | 0.959 | 0.962 |
DP naive | 0.755 | 0.777 | 0.817 | 0.866 | |
DP correction | 0.950 | 0.950 | 0.950 | 0.951 | |
Avglength | DB-Lasso | 0.191 | 0.204 | 0.234 | 0.282 |
DP naive | 0.182 | 0.196 | 0.228 | 0.291 | |
DP correction | 0.306 | 0.314 | 0.335 | 0.381 |
The numeric performance of our proposed debiased procedure exhibits remarkable similarity between the Toeplitz covariance design (Table 1) and the equal correlation design (Table 2). Notably, the coverage rates for DP naive fall significantly below the benchmark. It empirically proves our intuition that the additional correction is necessary for finite samples, as discussed in Section 4. In comparison, the DP correction method demonstrates significantly improved coverage compared to DP naive, albeit with wider confidence intervals. The corrected confidence intervals are approximately wider than those of DP naive. The interval length for DP correction is approximately greater than that of DB-Lasso, reflecting the efficiency loss stemming from privacy constraints. Overall, the proposed method exhibits coverage rates of approximately with only a marginal reduction in efficiency.
6.1.2 FDR control
Next, we proceed to assess the performance of the proposed algorithm for controlling the false discovery rate (FDR), as outlined in Section 5. To evaluate its effectiveness, we consider the Toeplitz covariance matrices (AR) defined as for . The active set, denoted as , consists of randomly chosen covariates among all the available covariates. The nonzero regression coefficients for are independently sampled from a normal distribution with mean 0 and variance , where represents the signal strength. The errors in the linear model are assumed to follow a standard normal distribution, i.e., . The sample size is set to , and the number of covariates is . For privacy preservation, we choose the privacy parameters and . Moreover, we set the target FDR control level at .
We compare our method with the non-private FDR control algorithm presented in Dai et al. (2022). The empirical FDR and power are reported, and all results are based on 100 independent simulations of the model with a fixed design and random regression coefficients (i.e., repeating the simulations 100 times with different errors in each linear model). Figure 2 presents the empirical FDR and power across various signal levels. Both the proposed differentially private FDR control procedure and the non-private procedure effectively control the empirical FDR at the predetermined level of . The power of the proposed procedure exhibits a minor reduction compared to the non-private procedure due to privacy constraints. The numerical study demonstrates that, for reasonably large sample sizes, the proposed method can maintain FDR control with minimal sacrifice in power compared to the non-private approach.


Figure 3 presents the empirical FDR and power across increasing sample sizes. It is important to note that the proposed procedure may fail to control the empirical FDR when the sample size is very small. This is primarily due to the fact that in cases of small sample sizes, the initial step involving differentially private high-dimensional linear regression struggles to accurately identify all active features. Additionally, there may exist a nontrivial bias in the second step, which is the differentially private ordinary least square estimation. However, as the sample size increases, the proposed method successfully controls the empirical FDR at the predetermined level of . Similarly, when dealing with small sample sizes, the power of the proposed procedure is notably lower than that of the non-private procedure due to the reduced estimation accuracy caused by privacy constraints. Nevertheless, as the sample size grows, both the differentially private algorithm and the non-private methods exhibit increased power, and the difference between them diminishes. This is a result of improved estimation accuracy in both the differentially private algorithm and the non-private algorithms. Overall, our numerical study demonstrates that for reasonably large sample sizes, the proposed method can effectively maintain FDR control with only a marginal reduction in power compared to the non-private approach.
6.2 Real Example: Parkinsons Telemonitoring
For real data applications, we demonstrate the performance of the proposed differentially private algorithms in analyzing the Parkinson’s Disease Progression data (Tsanas et al. (2009)). In clinical diagnosis, assessing the progression of Parkinson’s disease (PD) symptoms typically relies on the unified Parkinson’s disease rating scale (UPDRS), which necessitates the patient’s physical presence at the clinic and time-consuming physical evaluations conducted by trained medical professionals. Thus, monitoring symptoms is associated with high costs and logistical challenges for both patients and clinical staff. The target of this data set is to track UPDRS by noninvasive speech tests. However, it is crucial to protect the privacy of each patient’s data, as any unauthorized disclosure could lead to potential harm or trouble for the participants. By ensuring privacy protection, individuals are more likely to contribute their personal data, facilitating advancements in Parkinson’s disease research.
The data collection process involved the utilization of the Intel AHTD, a telemonitoring system designed for remote, internet-enabled measurement of various motor impairment symptoms associated with Parkinson’s disease (PD). The research was overseen by six U.S. medical centers, namely the Georgia Institute of Technology (seven subjects), National Institutes of Health (ten subjects), Oregon Health and Science University (fourteen subjects), Rush University Medical Center (eleven subjects), Southern Illinois University (six subjects), and the University of California Los Angeles (four subjects). A total of 52 individuals diagnosed with idiopathic PD were recruited. Following an initial screening process to eliminate flawed recordings (such as instances of patient coughing), a total of 5923 sustained phonations were subjected to analysis. In total, dysphonia measures were applied to the 5923 sustained phonations. Tsanas et al. (2009) proposed to use a linear model and didn’t consider privacy issues.
6.2.1 Debiased Inference
To evaluate the performance of our proposed differentially private inference algorithm 5 in high-dimensional settings, we add random features, which are generated independently identically from the standard normal distribution. Therefore, the dataset comprises a sample size of with covariates having a dimension of , where the first of these covariates represent real features.
We consider the following three methods: the oracle method, the proposed differentially private algorithm, and the non-private debiased lasso (Van de Geer et al., 2014). The oracle method uses only the 16 real features, while the proposed algorithm and the debiased lasso utilize all 5,016 features. The privacy parameters are set as and . Figure 4 displays the confidence intervals for the 16 real features obtained from the three methods. Overall, the private confidence intervals consistently cover the estimates obtained from the oracle method and the debiased lasso. In fact, the private confidence intervals exhibit substantial overlap with the confidence intervals from both the oracle method and the debiased lasso. However, due to the privacy costs, the width of the proposed confidence intervals is slightly larger than the confidence intervals from the debiased lasso.

6.2.2 FDR control
Next, we evaluate the performance of the private FDR control algorithm proposed in Section 5 on the Parkinson’s Disease Progression data. We add random features, which are generated independently and identically from . The proposed algorithm is compared with the nonprivate data splitting algorithm by Dai et al. (2022) and the knockoff by Barber and Candès (2015). The results are reported in Figure 5, where the target FDR is set to and , respectively. The proposed method exhibits a notable count of discoveries within the real features while registering only a minimal number of false discoveries among the random features, compared to both knockoff and non-private data-splitting methods.

feature | age | sex | test_time | Jitter | Abs | PPQ5 | Shimmer | dB |
---|---|---|---|---|---|---|---|---|
knockoff | ✓ | ✓ | ✓ | ✓ | ||||
Non-Private | ✓ | ✓ | ✓ | ✓ | ✓ | |||
DP-FDR | ✓ | ✓ | ✓ | ✓ | ||||
feature | APQ5 | APQ11 | DDA | NHR1 | HNR2 | RPDE | DFA | PPE |
knockoff | ✓ | ✓ | ✓ | ✓ | ✓ | |||
Non-Private | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
DP-FDR | ✓ | ✓ | ✓ | ✓ |
We further report the selected features at the target FDR level of in Table 3. There is substantial overlap among the three methods, with several features being consistently selected. For instance, age, Jitter, Abs, HNR2, DFA, and PPE are all chosen by all three methods. For Parkinson’s disease (PD), which is the second most prevalent neurodegenerative disorder among the elderly, age is the most important impact factor. Numerous medical studies underscore the pivotal role of age as the single most significant factor associated with PD, as documented by Elbaz et al. (2002). Furthermore, Jitter and Abs are commonly employed to characterize cycle-to-cycle variability in fundamental frequency, while HNR (Harmonics-to-Noise Ratio) is a fundamental feature in speech processing techniques. Detrended Fluctuation Analysis (DFA) and Pitch Period Entropy (PPE) represent two recently proposed speech signal processing methods, both of which exhibit a strong correlation with PD-dysphonia, as highlighted in Little et al. (2008). In addition to these commonly selected features, the proposed method also identifies shimmer and DDA, which are frequently used to describe cycle-to-cycle variability in amplitude. Shimmer and DDA are also endorsed as relevant features in clinical studies by Tsanas et al. (2009). The features identified by our proposed method receive substantial clinical support, while with the privacy guarantee for the individual patients.
7 Discussion
This paper presents a comprehensive framework for performing differentially private analysis in high-dimensional linear models, encompassing estimation, inference, and false discovery rate (FDR) control. This framework is particularly valuable in scenarios where individual privacy in the dataset needs to be protected and can be readily applied across various disciplines. The numerical studies conducted in this work demonstrate that privacy protection can be achieved with only a minor loss in the accuracy of confidence intervals and multiple testing.
We briefly discuss several intriguing extensions. For example, the tools developed for DP estimation, debiased Lasso, and FDR control in this paper can all be extended to the generalized linear models. It would also be interesting to explore scenarios where a partial dataset, potentially following a different distribution, is publicly available and not bound by privacy constraints. Besides, the newly developed DP-BIC could be adapted for other tasks involving the selection of tuning parameters with privacy assurances. Those topics will be left for future research.
References
- Avella-Medina et al. (2023a) Avella-Medina, M., Bradshaw, C., and Loh, P.L. (2023a). “Differentially private inference via noisy optimization.” Annals of Statistics.
- Avella-Medina et al. (2023b) Avella-Medina, M., Liu, Z., and Loh, P.L. (2023b). “Differentially private inference for high dimensional M-estimators.” manuscript unpublished yet.
- Azriel and Schwartzman (2015) Azriel, D. and Schwartzman, A. (2015). “The empirical distribution of a large number of correlated normal variables.” Journal of the American Statistical Association, 110(511), 1217–1228.
- Barber and Candès (2015) Barber, R.F. and Candès, E.J. (2015). “Controlling the false discovery rate via knockoffs.” The Annals of Statistics, 43(5), 2055–2085.
- Barber and Candès (2019) Barber, R.F. and Candès, E.J. (2019). “A knockoff filter for high-dimensional selective inference.” The Annals of Statistics, 47(5), 2504–2537.
- Cai et al. (2021) Cai, T.T., Wang, Y., and Zhang, L. (2021). “The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy.” The Annals of Statistics, 49(5), 2825–2850.
- Candes et al. (2018) Candes, E., Fan, Y., Janson, L., and Lv, J. (2018). “Panning for gold:‘model-x’knockoffs for high dimensional controlled variable selection.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3), 551–577.
- Dai et al. (2022) Dai, C., Lin, B., Xing, X., and Liu, J.S. (2022). “False discovery rate control via data splitting.” Journal of the American Statistical Association, pages 1–18.
- Dai et al. (2023) Dai, C., Lin, B., Xing, X., and Liu, J.S. (2023). “A scale-free approach for false discovery rate control in generalized linear models.” Journal of the American Statistical Association, pages 1–15.
- Dwork et al. (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006). “Calibrating noise to sensitivity in private data analysis.” In “TCC 2006,” pages 265–284. Springer.
- Dwork and Roth (2014) Dwork, C. and Roth, A. (2014). “The algorithmic foundations of differential privacy.” Foundations and Trends® in Theoretical Computer Science, 9(3–4), 211–407.
- Dwork et al. (2010) Dwork, C., Rothblum, G.N., and Vadhan, S. (2010). “Boosting and differential privacy.” In “2010 IEEE 51st Annual Symposium on Foundations of Computer Science,” pages 51–60. IEEE.
- Dwork et al. (2021) Dwork, C., Su, W., and Zhang, L. (2021). “Differentially private false discovery rate control.” Journal of Privacy and Confidentiality, 11(2).
- Elbaz et al. (2002) Elbaz, A., Bower, J.H., Maraganore, D.M., McDonnell, S.K., Peterson, B.J., Ahlskog, J.E., Schaid, D.J., and Rocca, W.A. (2002). “Risk tables for parkinsonism and parkinson’s disease.” Journal of Clinical Epidemiology, 55(1), 25–31.
- Fan and Lv (2008) Fan, J. and Lv, J. (2008). “Sure independence screening for ultrahigh dimensional feature space.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 70(5), 849–911.
- Javanmard and Montanari (2014) Javanmard, A. and Montanari, A. (2014). “Confidence intervals and hypothesis testing for high-dimensional regression.” Journal of Machine Learning Research, 15(1), 2869–2909.
- Lane et al. (2014) Lane, J., Stodden, V., Bender, S., and Nissenbaum, H. (2014). Privacy, big data, and the public good: Frameworks for engagement. Cambridge University Press.
- Little et al. (2008) Little, M., McSharry, P., Hunter, E., Spielman, J., and Ramig, L. (2008). “Suitability of dysphonia measurements for telemonitoring of parkinson’s disease.” Nature Precedings, pages 1–1.
- Liu (2013) Liu, W. (2013). “Gaussian graphical model estimation with false discovery rate control.” The Annals of Statistics, 41(6), 2948 – 2978.
- Ning and Liu (2017) Ning, Y. and Liu, H. (2017). “A general theory of hypothesis tests and confidence regions for sparse high dimensional models.” The Annals of Statistics, 45(1), 158 – 195. doi:10.1214/16-AOS1448.
- Pournaderi and Xiang (2021) Pournaderi, M. and Xiang, Y. (2021). “Differentially private variable selection via the knockoff filter.” In “2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP),” pages 1–6. IEEE.
- Romano et al. (2020) Romano, Y., Sesia, M., and Candès, E. (2020). “Deep knockoffs.” Journal of the American Statistical Association, 115(532), 1861–1872.
- Shi et al. (2021) Shi, C., Song, R., Lu, W., and Li, R. (2021). “Statistical inference for high-dimensional models via recursive online-score estimation.” Journal of the American Statistical Association, 116(535), 1307–1318.
- Tan et al. (2020) Tan, K., Shi, L., and Yu, Z. (2020). “Sparse sir: Optimal rates and adaptive estimation.” The Annals of Statistics, 48(1), 64–85.
- Tsanas et al. (2009) Tsanas, A., Little, M., McSharry, P., and Ramig, L. (2009). “Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests.” Nature Precedings, pages 1–1.
- Van de Geer et al. (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). “On asymptotically optimal confidence regions and tests for high-dimensional models.” The Annals of Statistics, 42(3), 1166–1202.
- Wang et al. (2022) Wang, P., Xie, M.G., and Zhang, L. (2022). “Finite-and large-sample inference for model and coefficients in high-dimensional linear regression with repro samples.” arXiv preprint arXiv:2209.09299.
- Wasserman and Roeder (2009) Wasserman, L. and Roeder, K. (2009). “High dimensional variable selection.” Annals of statistics, 37(5A), 2178.
- Xia and Cai (2023) Xia, X. and Cai, Z. (2023). “Adaptive false discovery rate control with privacy guarantee.” Journal of Machine Learning Research, 24(252), 1–35.
- Zhang and Zhang (2014) Zhang, C.H. and Zhang, S.S. (2014). “Confidence intervals for low dimensional parameters in high dimensional linear models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1), 217–242.
SUPPLEMENTARY MATERIAL
8 Proofs
8.1 The convergence rate in Algorithm 2
We now prove the differential privacy and convergence rate of in Algorithm 2.
8.1.1 Proof of Lemma 3.1
Proof of Lemma 3.1.
The sensitivity of of gradient in th iteration
is
By advanced composition theorem, outputting the gradient is -differentially private. Thus, furthermore, by composition theorem, for , output is -differential privacy. By composition theorem, returning all is -differential privacy.
8.1.2 Proof of Theorem 3.1
Proof of Theorem 3.1.
Let be the selected number corresponds to and be such that . Note that is uniquely defined by and . Define events
where , and
By Theorem 4.4 in Cai et al. (2021) and replacing in Cai et al. (2021) by , the event happens with probability at least .
By the oracle inequality of the BIC criterion,
where and is the additional term equals to . It implies that
Let . Note that
Hence, by the first statement in , we have
with high probability, where we use Hölder’s inequality and concentration inequality. We first consider the case where . Then, by the Hölder’s inequality,
By the assumption that , we have for . Thus, the solution of quadratic inequality exists and satisfies,
for a constant . Then, we consider the case where . By Hölder’s inequality and triangle inequality, we have
For , we have
The conclusion holds. ∎
8.2 Proofs of Statistical Inference
Given a pre-specified sparsity level , the differentially private estimation of algorithm is presented in Algorithm 7.
Theoretical properties of Algorithm 7 are outlined in Lemma 8.1. The proof follows the similar arguments of Theorem 4.4 in Cai et al. (2021).
Lemma 8.1.
8.2.1 Proofs of Lemma 8.1
Proofs of Lemma 8.1.
Let be the support of true parameter and for any subset satisfies and . The oracle estimator is defined by
We first study the statistical property of . An equivalent formulation of is , where is a sub-vector of and is a sub-matrix of . Notice that , the sub-vector is a unit vector. Then, we have an analytic solution . Then,
By Corollary 10.1 in Tan et al. (2020), we have for any constant , there exists some constant , such that
with probability at least .
Then, the satisfy . Thus, we have
for satisfying . Then, we have
where is the gradient of evaluated at . Let , and . Let be the noise vectors added to after the peeling mechanism and . Then, we have the decomposition,
We first consider the first two terms. Let the subset , then . Then, by the fact and the selection criterion, we have for every
Then, we have
where we use in the first inequality. By Lemma A.3. in Cai et al. (2021), we have
Thus,
where in the first inequality, we use the assumption .
Then, we consider the second term, which has the decomposition
where we use the fact that . The last term satisfies
By the selection property, we have
By combining the results together, we have
Combining all results together, we have
for constant large enough. By the choice of , such that because . Then,
Thus,
Let and iterate above equation,
Thus, by and
we have
Thus,
The desired result follows by replacing with . ∎
8.2.2 Proof of Lemma 4.2
8.2.3 Proofs of Lemma 4.3
Proof of Lemma 4.3.
Let be the selected number corresponds to and be such that . Note that is uniquely defined by and . Define events
and
By Lemma 8.1 and replacing with , the event happens with probability at least .
By the oracle inequality of the BIC criterion,
where and is the additional term equals to . It implies that
Let . Note that
Hence, by the first statement in , we have
with high probability, where we use Hölder’s inequality. We first consider the case where . Then, by the Hölder’s inequality,
By the assumption that , we have for . Thus, the solution of quadratic inequality exists and satisfies,
Then, we consider the case where . By Hölder’s inequality and triangle inequality, we have
For , we have
Then, the conclusion holds. ∎
8.2.4 Proof of Lemma 4.4
Proof of Lemma 4.4.
Recall that
For ,
The two term is asymptotic normal when is replaced with . Let
. Then, we have
Furthermore, we have
∎
8.2.5 Proof of Theorem 4.1
Proof of Theorem 4.1.
We first show the privacy part. By Lemma 3.1 and Lemma 4.2, the first two steps of Algorithm 5 is -differential privacy. By composition theorem, it remains to show Step 3 and Step 4 are -differential privacy, respectively.
The sensitivity of is
The sensitivity of is
By the Gaussian mechanism, Step 3 and Step 4 are -differential privacy. Finally, by composition theorem, Algorithm 5 is -differentially private.
Then, we are about to show the validity of the proposed confidence interval. Notice that for some constant with probability approaching . Thus, . By Lemma 4.4, we have
By the 4.3, the convergence of implies . Furthermore, the estimation of satisfies,
The term by the law of large number. The term is by Theorem 3.1. The term
Then, we have . The final result follows by the Slutsky’s theorem. ∎
8.3 Proof of FDR: Theorem 5.1
8.3.1 Proof of Lemma 5.1
Proof of Lemma 5.1.
By Lemma 3.1, outputting is -differentially private. It remains to show the DP-OLS is -differentially private, and the final result is followed by the composition theorem. We first calculate the sensitivity of and . The sensitivity of satisfies
The sensitivity of satisfies
By the data-splitting and Gaussian mechanism, the DP-OLS is -differentially private. ∎
8.3.2 Proof of Theorem 5.1
Proof of Theorem 5.1.
We only need to consider the subset . We omit the subscript to simplify the notation. Notice that the DP-OLS estimator satisfies
Conditional on , the distribution of is symmetric around and the representing bias is small. For , and thus the sampling distribution of is symmetric around . WLOG, we assume , otherwise the FDR control problem is trivial.
Define and its normalized version as
where and the represents the conditional correlation of the OLS regression coefficients . By the proof of Theorem 3.1, . Thus,
Furthermore,
where is the element-wise matrix norm. In addition, , where .
For any threshold , define
where and . Let and
It’s easy to see , where the last equality is due to the symmetry of for . Furthermore, we have
The first term is bounded by . WLOG, we assume and . By the definition of , there exists a function of and , denoted by and , such that
where the function . Notice that the by Condition in Theorem 5.1. By the Lipschitz continuity of , thus,
Notice that the joint distribution of is bivariate normal. By theorem 1 in (Azriel and Schwartzman, 2015), for any ,
Thus,
By the Markov’s inequality and symmetry of under null,
Thus, we have
For any and satisfying , we have
Thus, we have
So far, we showed . Then, we consider the power of the procedure. Given the signal strength condition, the sure screening property holds with probability approaching one, that is, as . Remember that the OLS estimator satisfies,
By the eigenvalue of is bounded and Condition 3.2, the eigenvalue of is bounded with probability approaching one. By the large deivation theorem,
Under the signal atrength condition, with probability approaching one, we have
which further imples that by the definition of . Consequently, we have with probability approaching . Thus, the power asymptotically converges to . ∎