Near-Optimal differentially private low-rank trace regression with guaranteed private initialization
Abstract
We study differentially private (DP) estimation of a rank- matrix under the trace regression model with Gaussian measurement matrices. Theoretically, the sensitivity of non-private spectral initialization is precisely characterized, and the differential-privacy-constrained minimax lower bound for estimating under the Schatten- norm is established. Methodologically, the paper introduces a computationally efficient algorithm for DP-initialization with a sample size of . Under certain regularity conditions, the DP-initialization falls within a local ball surrounding . We also propose a differentially private algorithm for estimating based on Riemannian optimization (DP-RGrad), which achieves a near-optimal convergence rate with the DP-initialization and sample size of . Finally, the paper discusses the non-trivial gap between the minimax lower bound and the upper bound of low-rank matrix estimation under the trace regression model. It is shown that the estimator given by DP-RGrad attains the optimal convergence rate in a weaker notion of differential privacy. Our powerful technique for analyzing the sensitivity of initialization requires no eigengap condition between non-zero singular values.
1 Introduction
The trace regression model (Rohde and Tsybakov, 2011; Koltchinskii et al., 2011), as an extension of the standard regression model, has been widely applied in various fields such as matrix completion, compressed sensing, and multi-task learning (Negahban and Wainwright, 2011; Koltchinskii et al., 2011; Hamidi and Bayati, 2022). Previous studies have proposed both convex and non-convex approaches for optimal estimation procedures for the model. However, the increasing demand for privacy protection has added new complexities to this extensively studied problem. Differential privacy (DP) (Dwork et al., 2006), a framework for protecting individual privacy, has been widely adopted in industrial and governmental applications (Erlingsson et al., 2014; Apple Differential Privacy Team, 2017; Abowd et al., 2020). This paper aims to develop a near-optimal differentially private method for low-rank matrix estimation under the trace regression model.
Trace regression model
Let be an unknown rank- matrix and be the measurement matrix for . Suppose the noisy observation satisfies
(1) |
where the model noise and the inner product between and in Euclidean space is given by . The goal of the present paper is to estimate the unknown rank- matrix under trace regression model defined by (1), subject to differential privacy, based on independent observations .
Our approaches are built upon the Gaussian mechanism Dwork et al. (2006). The main difficulty in applying the Gaussian mechanism is sharply characterizing sensitivities of statistics whose privacy is under protection. Listed below are definitions of sensitivity, differential privacy (DP), and Gaussian mechanism. Interested readers may refer to Dwork et al. (2006, 2014); Vadhan (2017) for proofs and other details. Let denotes the spectral norm and denotes the Frobenius norm.
Sensitivity
The sensitivity of a function that maps a dataset into is defined by where and the supremum is taken over all neighbouring datasets and that differ by at most one observation.
Differential privacy
Let and , then we say the randomized algorithm is -differentially private if for all neighbouring data sets and all subset .
Gaussian mechanism
The randomized algorithm defined by is -DP where has i.i.d. entries.
RIP of Gaussian measurement matrices
The sensitivity of any statistic involving depends on the properties of the measurement matrices. Besides, it has been previously established since Candes and Tao (2005) that the restricted isometry property (RIP) on measurement matrices is crucial to the recovery of the unknown matrix . Hence, assumptions on are necessary and the present paper considers with Gaussian design.
Assumption 1 (Gaussian design).
The vectorization of measurement matrices are independent Gaussian where ’s are known, symmetric and positive definite. There exist absolute constants such that
The following Lemma 1 shows that under Assumption 1, the measurement matrices satisfy the restricted isometry property (RIP) with high probability, see the proof in E.1.
Lemma 1.
Under the Assumption 1, for any of rank , there exist constants and such that if , with probability at least , we have
Notations
Suppose is of rank- and its singular value decomposition is of the form where , and with . Here, denotes the set of matrices satisfying . Let be the condition number and be the signal-to-noise ratio. Let stand for the typical big-O notation up to logarithmic factors and stand for holds with high probability.
1.1 Main results
The paper presents several key results related to differentially private low-rank matrix estimation. Firstly, we propose a private initialization (as detailed in Algorithm 1). Secondly, we establish the privacy-constrained minimax lower bound under the general Shatten- norm (as detailed in Theorem 2). Finally, we introduce a private estimator (as detailed in Algorithm 2) that achieves the near-optimal convergence rate under the Frobenius norm. The sensitivity analysis of heavily relies on a spectral representation formula for asymmetric matrices (See Lemma 2).
We prove in Corollary 1 that the private initialization satisfies with high probability (w.h.p.), for a small constant , provided that
1.2 Motivations and related works
The trace regression model has been extensively researched, resulting in well-known optimal procedures and theoretical properties. Both convex (Rohde and Tsybakov, 2011; Koltchinskii et al., 2011; Candes and Plan, 2011; Negahban and Wainwright, 2011) and non-convex methods (Burer and Monteiro, 2003; Chen and Wainwright, 2015; Zheng and Lafferty, 2016; Wei et al., 2016) have achieved the optimal convergence rate of the order without the constraint from differential privacy. However, the DP-constrained minimax rate of low-rank matrix estimation under the trace regression model is still unknown. (Near) Optimal DP-algorithms have been developed for statistical problems such as learning Gaussians Kamath et al. (2019); Kuditipudi et al. (2023); Brown et al. (2023) or heavy-tailed distributions Kamath et al. (2020), (sparse or generalized) linear regression Wang (2018); Cai et al. (2021, 2023), and PCA Blum et al. (2005); Dwork et al. (2014); Chaudhuri et al. (2012); Liu et al. (2022). Previous works on DP-regression Cai et al. (2021, 2023) assume that all measurements have bounded norm. This assumption presents a significant limitation to studying the role of measurements play in the estimation error. Additionally, by treating measurements as a fixed vector or matrix, the statistical properties of measurements are disregarded. As a result, the opportunity for optimal statistical analysis subject to privacy concerns is inevitably lost. Recently, (McSherry and Mironov, 2009; Liu et al., 2015; Jain et al., 2018; Chien et al., 2021; Wang et al., 2023) propose gradient-descent-based algorithms for DP low-rank matrix completion. These algorithms have attained near-optimal sample complexity. However, the problem of sample-efficient, differentially private initialization remains under-explored. Additionally, it is unknown how to establish the minimax risk of low-rank matrix estimation with the constraints of differential privacy, especially when the matrix is asymmetric.
1.3 Organization
Section 2 proposes a DP-initialization algorithm and presents its privacy and utility guarantees. In Section 3, we establish a DP-constrained minimax lower bound (5) for estimating the rank- matrix under the trace regression model. Section 4 presents the DP-estimator based on non-convex optimization and derives the upper bound of the DP-estimator’s error, as stated in (7). We discuss the score attack argument and the non-trivial gap between the upper bound of (7) and the DP-constrained minimax lower bound (5) in Section 5. Proofs are given in Appendix A to F.
2 DP-initialization
Section 2.1 presents an -DP initialization , as stated in Algorithm 1. In Section 2.2, we introduce a spectral representation formula (See Lemma 2) that is crucial to sensitivity analysis on the initialization. With the help of the spectral representation formula, the privacy and utility guarantees of the DP-initialization are given in Section 2.3.
2.1 Algorithm for DP-initialization
We begin with which is unbiased for . Suppose that the leading- left and right singular vectors of is given by the columns of and , respectively. Then, is a non-private estimator for . Let , then we have
It is reasonable to think about privatizing , , and , separately. We first privatize the empirical spectral projector and by Gaussian mechanism. Thanks to post-processing property Dwork et al. (2006), we obtain and whose columns are differentially private and orthogonal. Secondly, we privatize the matrix by Gaussian mechanism and obtain which is a private surrogate for . Finally, we take as the DP-initialization. We display the pseudo-code of the proposed DP-initialization in Algorithm 1. The privacy of is guaranteed by the composition property Dwork et al. (2006).
To this end, we define the sensitivities of and appear in Algorithm 1. Let
where is an i.i.d. copy of . Then, the estimator differs with only by the -th pair of observations. Suppose the top- left and right singular vectors of are given by and , respectively. The sensitivity of is defined by
and the sensitivity of is defined similarly. We refer to as the sensitivity of singular subspaces and define the sensitivity
As privatizing by Gaussian mechanism, the scale of artificial noise avoids growing with an unnecessary but rather growing with a smaller quantity . This benefit motivates us to privatize , and , separately. However, it is technically challenging to characterize due to the non-linear dependence of and on the dataset . To address this challenge, we introduce an explicit spectral representation formula (See Lemma 2) to obtain a sharp upper bound on the sensitivity of the singular subspaces.
2.2 Spetral representation formula
This section introduces a spectral representation formula for asymmetric matrices (See Lemma 2). To begin with, we quickly explain the standard symmetric dilation trick (See e.g., Section 2.1.17 in Tropp et al. (2015)) and define auxiliary operators used in Lemma 2.
Symmetric dilation and auxiliary operators
For any , the symmetric dialation of is a matrix defined by . It is easy to check that and . Further, if we assume that is of rank and has the form of SVD , then is of rank- and has eigendecomposition of the form
The eigenvectors of is given by the columns of . For integer , we define operators
Lemma 2 (Spectral representation formula).
Let be any rank- matrix with singular values and be a perturbation of where is the deviation matrix. Suppose the top- left and right singular vectors of and , are given by the columns of , and , , respectively. Suppose that , then
Here, is the symmetric dilation of and the -th order term is a summation of terms defined by where contains non-negative indices and
In Lemma 2, the spectral projectors and of the matrix , is explicitly represented in terms of the symmetric dilation of , with the help of auxiliary operators and for integer . The proof of Lemma 2 is deferred to Appendix E.2. Note that Lemma 2 accommodates a diverging condition number and requires no eigengap condition between non-zero singular values. In the proof of Theorem 1, we shall see that and are mainly contributed by the 1-st order approximation where is the symmetric dilation of .
2.3 Privacy and utility guarantees of the initialization
In this section, we study the privacy and utility guarantees of the initialization . Theorem 1 characterizes the sensitivities and needed to guarantee an -DP , and present the upper bounds of and . The proof of Theorem 1 is provided in Appendix A.
Theorem 1 (Privacy and utility guarantees of the initialization ).
Consider i.i.d. observations drawn from the trace regression model stated in where for . Let the true rank- regression coefficients matrix be . Suppose that satisfy the Assumption 1. Under the mild condition , there exists absolute constants such that
if Algorithm 1 takes in sensitivities at least and then Algorithm 1 is guaranteed to be -DP. Moreover, the output of Algorithm 1 satisfies
with probability at least .
In Theorem 1, the sample size condition ensures that the spectral norm of perturbations is small enough, i.e., to apply Lemma 2 and obtain a sharp characterization on . Theorem 1 also provides an upper bound on the sensitivity of pseudo singular values, which is of the order . Based on these results, Algorithm 1 outputs an -DP initialization under the sample size condition
with an upper bound on the error consisting of two terms. The first term accounts for the statistical error of and is greater than the the optimal rate (Koltchinskii, 2011). The second term can be further decomposed into the cost of privacy on the singular subspaces which is of the order , and the cost of privacy arises from privatizing the singular values by Gaussian mechanism which is of the order .
Next, Corollary 1 gives the sample size required by a DP-initialization that falls within a local ball of . The proof of Corollary 1 is deferred to Appendix A.5.
Corollary 1.
Under the conditions stated in Theorem 1, as the sample size is sufficiently large
for some absolute constant , then we have, for some small constant ,
(2) |
In Corollary 1, the error due to dominates the statistical error and the sample size is required to control these two terms; the sample size controls the error due to privatizing the singular subspaces and singular values. According to Corollary 1, as the sample size the -DP is guaranteed to fall into a local ball surrounding , as stated in (2). The condition (2) is a pre-requisite for Algorithm 2 to converge geometrically, as discussed in Theorem 3.
3 Minimax lower bounds
This section applies DP-Fano’s lemma (See Lemma 3) to establish the DP-constrained minimax lower bound of estimating the matrix under trace regression model
(3) |
Suppose we observe an i.i.d. sample of size drawn from (3). Then, we have
where the underlying matrix . Let be the joint distribution of and ; be the conditional distribution of given ; and denote the joint distribution of and as . It is clear that is given by the distribution of
Let represent the tensor product of marginal laws. For a given matrix where for some constants and , we consider the family of normal distribution under trace regression model:
By definition, each distribution is indexed by whose columns are the eigenvectors of . Next, we employ DP-Fano’s lemma to derive the minimax lower bound of estimating by a sample drawn from . Let and denote the Kullback-Leibler divergence and total variation distance between two distributions.
Lemma 3 (DP-Fano’s lemma, Acharya et al. (2021)).
Let be a family of product measures indexed by a parameter from a pseudo-metric space . Denote the parameter associated with the distribution . Let contain probability measures and there exist constants such that for all , where and . Suppose , then
(4) |
where the infimum is taken over all the -DP randomized algorithm defined by .
To apply Fano’s lemma, we need to construct a large subset with well-separated elements for . By Lemma 6, there exists a subset with cardinality such that for any ,
for some small constants , where denotes the Schatten-q norm. We then consider the family of distributions
whose cardinality . Let and . As shown in Appendix B, for any , we have
and To this end, we obtain Theorem 2 by applying Lemma 3 with the bounded KL divergence and TV distance, together with the facts that . The proof of Theorem 2 is deferred to Appendix B.
Theorem 2.
Consider a sample of size drawn from the distribution , then for any and any , there exists a constant
where the infimum is taken over all possible -DP algorithms. It suffices to choose to obtain the bounds in the nuclear norm, Frobenius norm, and spectral norm, respectively. For example, when , there exists a constant
(5) |
In Theorem 2, the lower bound (5) consists of two terms, the statistical error and the cost of privacy . The next section proposes a DP-estimator that attains the minimax lower bound (5), up to an additional factor and some logarithmic factors. As a supplement to DP-Fano’s Lemma which works for , we also try the score attack argument, which is valid for a wider range of where is a constant. Theorem 5 presents the DP-constrained lower bound established by the score attack argument. The content and proof of Theorem 5 are deferred to Appendix D. We also point out that it is trivial to derive the minimax lower bound of the case based on DP-Fano’s Lemma since there is no need to apply the trick of symmetrization.
4 Upper bounds with differential privacy
In this section, we present Algorithm 2, DP-RGrad, and show that DP-RGrad attains the near-optimal convergence rate for differentially privately estimating low-rank matrices under the trace regression model. Our approach is based on privatizing the Riemannian gradient descent (RGrad) by the Gaussian mechanism. Interested readers may refer to Vandereycken (2013); Edelman et al. (1998); Adler et al. (2002); Absil et al. (2008) for the basics of RGrad. Let the estimate we obtain after iterations be the rank- matrix whose SVD has the form . It is well-known in Absil et al. (2008); Vandereycken (2013) that the tangent space of at is given by . The projection of the gradient onto is which is of rank at most . Let the noisy gradient descent on the tangent space be where is the Gaussian noise matrix. Then, we retract it back to and obtain
(6) |
We update the estimate as defined in (6) for where is the total number of iterations. Thanks to the composition property and Gaussian mechanism, we only need to ensure that each iteration is -DP. For trace regression model defined in (1), empirical mean squared loss is defined as and the empirical Euclidean gradient is The sensitivity of the -th iteration is
Theorem 3 establishes the error bound of the estimator given by Algorithm 2. The proof of Theorem 3 is deferred to Appendix C.
Theorem 3.
Consider i.i.d. observations drawn from the trace regression model stated in where the true low-rank regression coefficients matrix being . Here, for and we assume that satisfy the Assumption 1. Under the Assumption 1 and the condition that , suppose that Algorithm 2 takes in an -DP initialization such that for some small constant , and the sensitivities take the value
for some absolute constant , then we have, Algorithm 2 is -differentially private. Moreover, as the sample size
for some small constant , number of iteration , and the step size , we have the output of Algorithm 2 satisfies
with probability at least
According to Theorem 3, the upper bound of can be decomposed into the the statistical error and the cost of privacy . The term matches the the optimal rate , only up to logarithmic factors. However, the term differs from the theoretical lower bound of the cost of privacy , by a non-trivial factor , apart from logarithmic factors. In conclusion, Theorem 3 shows that as the sample size the estimator given by Algorithm 2 attains the near-optimal convergence rate
(7) |
The sample size requirement of Theorem 3 has the following explanations. The sample size is required to guarantee that the RIP condition stated in Lemma 1 occurs with high probability. The sample size is necessary to control the statistical error contributed by in each iteration where is the model noise. The sample size arises from controlling the cost of privacy due to in each iteration.
5 Discussion
In this section, we discuss the non-trivial gap between and . Note that is free of while contains the factor arising from sensitivities
The quantity of arises from , as elaborated in (S.12). Here, denotes the sub-Gaussian norm. Therefore, we cannot get rid of the factor once the measurement matrices are subject to differential privacy. In many real applications, however, the measurement matrices are fixed with deterministic designs. People publish to the public with little concern on the privacy of . Although the exposure of alone will not reveal any information on , the privacy of suffers from leakage when the public has access to the joint observations . We, therefore, introduce the following notion of privacy for neighboring datasets sharing the same measurement matrix.
Definition 1 (weak -differential privacy).
The algorithm that maps into is weak -differentially private over the dataset if for all neighbouring data set sharing the same measurement and all subset .
In Theorem 7, Appendix F, we show that as the estimator given by Algorithm 2 attains the optimal convergence rate in the sense of weak differential privacy. The analogs of Theorem 1, Corollary 1 and Theorem 3 in the sense of weak differential privacy can be found as Theorem 6, Corollary 2 and Theorem 7 in Appendix F. It is interesting to explore in future work whether the score attack argument or DP-Fano’s Lemma can be generalized to include the non-trivial factor .
Acknowledgement
The author would like to express sincere gratitude to Hong Kong PhD Fellowship Scheme and Hong Kong RGC GRF Grant 16301622 for providing financial support for this research. The author also wishes to acknowledge the invaluable guidance provided by Prof. Dong, Xia throughout the research process. Additionally, the author would like to extend heartfelt thanks to Mr. Zetao, Fei for his constructive criticism during the paper revision. Their contributions have been instrumental in the successful completion of this research.
Appendix A Proof of Theorem 1
The proof of Theorem 1 consists of four parts. In Part A.1, we list several existing results that are useful in the proofs later. In Part A.2, Lemma 2 works as the main technique to derive the sensitivity . Part A.3 derives the sensitivity . Part A.4 establishes the upper bounds of and based on the and .
A.1 Part 1
The following Theorem 4 proposed by Proposition , Koltchinskii (2011) will be frequently used to establish the upper bound of the spectral norm of a summation of independent random matrices.
Theorem 4 (Bernstein’s inequality, Koltchinskii (2011)).
Let be independent matrices such that for some and all
Let
Then, for some constant and for all ,
Theorem 4 applies to bound the spectral norm of . The existing result for the case of heavy-tailed noise can be found in Theorem 6, Shen et al. (2023). Adapting the existing result to the case of Gaussian noise, we have that for some absolute constant ,
(S.1) |
with probability at least . The following Lemma originated from Lemma 18, Shen et al. (2023), is useful to analyze the matrix permutation due to singular value decomposition.
Lemma 4 (Matrix Permutation, Shen et al. (2023)).
Let be a rank- matrix with singular value decomposition of the form where with . For any satisfying , then
and
where the leading left singular vectors of are given by the columns of and the leading right singular vectors of are given by the columns of .
A.2 Part 2
The second part aims to derive the sensivitity
Before moving on, we present Lemma 5, which provides conclusions on and , frequently used in the proof later. The proof of Lemma 5 can be found in Appendix E.4.
Lemma 5.
The following analysis is proceeded under the sample size condition (S.2) and is mainly conditioned on the event which happens with probability at least .
Step 1: expansion. Conditioned on , we are able to apply Lemma 2 to and and get
Our goal is to bound which satisfies
Step 2: bounding the first order term. By the definition of and ,
(S.4) | ||||
conditioned on the event , for some absolute constant .
Step 3: bounding the higher order terms. Let be the index set for terms in
with the cardinality . We define
for and . Since , the higher order terms
(S.5) |
It is sufficient to find a upper bound of . Denote
which appeared in the event as an upper bound of .
Conditioned on , for all , and ,
where the first inequality is because is of rank at most . Let be a function defined by , then and for all integer ,
(S.6) | ||||
where the last step is due to (S.3), which is guaranteed by the sample size condition (S.2) together with the event . Combining (S.5) and (S.6), since , conditioned on the event , we have
for some absolute constant . In conclusion, conditioned on , as the sample size for some absolute constant , we have for some absolute constant .
Let be a symmetric matrix where the entries i.i.d. to for . Then, conditioned on the event and (S.2), for some absolute constant , with probability at least . Moreover, by Gaussian mechanism, is -DP and thus is also -DP thanks to the post-processing property of differential privacy. Furthermore, by Davis-Kahan’s Theorem, for some absolute constant
(S.7) | ||||
where we apply Lemma 4 in step .
Let be a symmetric matrix with i.i.d. to for , then for some absolute constant , conditioned on the event and (S.2), we have with probability at least . Moreover, by Gaussian mechanism, is -DP and is also -DP thanks to the post-processing property of differential privacy. Furthermore, by Davis-Kahan’s Theorem, for some absolute constant
(S.8) | ||||
A.3 Part 3
Given the -DP singular vectors and obtained in Part A.2, we derive the sensitivity Conditioned on the event ,
Let be a matrix with entries i.i.d. to , then
(S.9) |
for some absolute constant with probability at least . Moreover, by Gaussian mechanism, is -differentially private. Thanks to the composition property of differential privacy, the output of Algorithm 1 is -differentially private.
A.4 Part 4
In this part, we derive the upper bound of . Note that
is a matrix of rank at most . Since
(S.10) | ||||
it is sufficient to plug in the upper bound (S.1) of , , as well as the upper bounds (S.7) of , (S.8) of and (S.9) of . In conclusion, conditioned on the event , as the sample size
for some absolute constant , we have
for some absolute constant with probability at least .
A.5 Proof of Corollary 1
Appendix B Proof of Theorem 2
We first present some preliminary results on the KL-divergence and total variation distance between Gaussian distributions. Let and be two -dimensional multivariate Gaussians, then
Let and where , then
and further by Pinsker’s inequality, we have
For any probability measures , we have
and
The next Lemma 6 states that there exists a sufficiently large subsect of such that the elements in the subsets are well separated.
Lemma 6.
For any , there exists a subset with cardinality such that for any ,
where denotes the Schatten-q norm and, meanwhile,
To invoke Lemma 3, we define the metric as for any and take ,
for some small absolute constant . Then, by Lemma 3, for any -DP estimator ,
Recall that if . We can take
to get
where the last term is due to a trivial upper of . Since is already known, it suffices to estimate differentially privately by the estimator , and an estimator for the matrix is given by . Therefore,
Due to , we have
There is a one-to-one mapping between and . Let , then . Note that
and
Therefore, and
(S.11) |
where we use the fact and infimum is taken over all possible -DP algorithms.
Appendix C Proof of Theorem 3
In Appendix C, we aim to prove Theorem 3. The proof is composed of three Parts. In Part C.1, we characterize the sensitivity for iterations and bound . In Part C.2, take mathematical induction to prove that if the RIP-condition holds and both and are bounded with high probability, then we also have and are bounded with high probability. In Part C.3, we choose an appropriate as the total number of iterations and give the convergence rate of .
C.1 Part 1
In Part C.1, we focus on upper bounding . The first step is to characterize the sensitivity of the -th iteration for . Let
which is the gradient of -th iteration obtained by the dataset differes with the original one only by the -th pair of observation. The sensitivity of the gradient descent on the tagent space is
By the definition of and ,
where for all and
(S.12) | ||||
Here, the last inequality uses the fact that for any , the matrix is of rank at most . Since both and are sub-Gaussians with and , we turn to Lemma 7 to upper bound .
Lemma 7 (Vershynin (2018)).
For any sub-Gaussian random variable ,
According to the tail probability of sub-Gaussian random variable stated in Lemma 7, we have with probability at least ,
(S.13) |
for some absolute constant . Shen et al. (2023) offeres the folowing result on .
Lemma 8 (Shen et al. (2023)).
Suppose the vectorization of follows mean zero multivariate Gaussian distribution where satisfies . Then, for some constant
It implies and for some constants .
Thus, for some absolute constant , with probability at least
(S.14) |
Combining (S.12), (S.13) and (S.14) and taking maximum over , for some constant , we have the event
(S.15) |
happens with probability at least . In the event stated in (S.15), the sensitivity still relies on . To get an upper bound irrelevant with , we take condition on the event
and obtain that for some absolute constant , the event
(S.16) |
happens with the probability
In the -th iteration of Algorithm 2, the matrix and operator are known. Moreover, the rank approximation is irrelevant with the data set . Thanks to the post-processing property and composition property of differential privacy, we only need to guarantee that is -DP where the gradient
is the only component depends on the data set . Let be a matrix with entries i.i.d. to normal distribution with varicance Under the condition that and conditioned on the event , we have for some constant
with probability at least .
C.2 Part 2
In Part C.2, we take mathematical induction to prove that when the events
occurs with high probability, the event occurs with high probability as well. Here, is defined as the event where the RIP condition of holds, See (S.17).
Step 1: is true with high probability.
We first consider the RIP condition. According to Lemma 1, for any of rank , there exist constants and such that when , with probability at least ,
where . The values of and are defined similarly. Therefore, under the condition that , for some constants ,
(S.17) |
happens with probability at least .
As for , we refer to Corollary 1, which shows that as the sample size
the event happens with probability at least . Conditioned on , plugging to the event defined in (S.15), we have the event defined in (S.16) happens with probability at least . To this end, we have
Step 2: induction. The following analysis is conditioned on the event . Let be an operator defined by for all . It is easy to check that the adjoint operator of is which is defined by for all . Therefore, where and accordingly,
Our first goal is to upper bound
(S.18) | ||||
Lemma 9 characterizes the operators and , which are critical to upper bound and in (S.18).
Lemma 9 (Wei et al. (2016), Luo and Zhang (2022)).
Suppose the event happens, then the following conclusions hold
-
1.
.
-
2.
,
where according to Wei et al. (2016).
According to Lemma 9, conditioned on the event
and To this end, the only term unknown in (S.18) is
where is the spectral norm of a summation of i.i.d. mean zero sub-exponential random matrices. We upper bound by Theorem 4. Let for all , then
Since and ,
where the first inequality uses the fact . Similarly, .
Therefore,
Applying Thorem 4 with , and , we have
In conclusion, with probability at least
for some constant .
Conditioned on , we plug , and into (S.18) and obtain that for some small constant and absolute constant
with probability at least where we define
Suppose that , the step size being a small constant, then we have . Further, as for some absolute constant , the sample size satisfies
we have for some small constant ,
Applying Lemma 4, we obtain that under the condition ,
(S.19) | ||||
In summary, conditioned on the event , the event
occurs with probability at least . Besides, according to defined in (S.15), the event
occurs with probability . Therefore, conditioned on , the event happens with probability at least
To this end, we has finished the induction and conclude Part C.2 by
C.3 Part 3
In Part C.3, we derive the convergence rate of and choose an appropriate value for . Conditioned on the event , according to (S.19), with probability at least
Let and , then we have , indicating that there is little reason to run the algorithm further than iterations.
In conclusion,
with probability at least
Appendix D The lower bound derived by score attack argument
Let . This section establishes the minimax lower bound of differentially privately estimating the matrix within the trace regression model based on an alternative approach, score attack argument Cai et al. (2023).
The score attack argument involves designing a test statistic and establishing the lower bound of the statistic with the help of a prior distribution of the parameters to estimate. It is unclear, however, how to construct a prior distribution for the low-rank matrix such that the prior complies with the parameter space and the score attack is easy to compute at the same time. Compared to DP-fano’s Lemma (See Lemma 3) which requires , the score attack argument is valid for a wider range of where is a constant. We first define some necessary notations for the elaboration of score attack argument. For any matirx , we denote as the support of and the matrix restricted on is where is the -th canonical basis in and is the -th canonical basis in .
To apply score attack argument, we relax the problem to deriving minimax lower bounds over . The benefit is that there exists a trivial prior of such that for and otherwise. Similarly, we may consider establish minimax lower bound over For any , there is a trivial prior as well. Let be a randomized algorithm mapping a dataset to a matrix. We define the DP-constrained minimax risk over as
where is taken over all -DP algorithms. Similarly, we define and . Since and , we have
(S.20) |
which indicates that the lower bound of will be an immediate result once we successfully lower bound and .
Next, we construct score attacks to derive the lower bounds of and . Let be the pair of a measurement matrix and its corresponding response variable, drawn independently from (3). The score function is defined by
and the score attack is defined by
where is an -DP algorithm to estimate ; is a piece of datum that we want to test whether it belongs to ; the quantity is obtained by restricting to the index set . Under some regularity conditions, the score attack will lead to the lower bound of . Similarly, we derive the lower bound of with the help of the attack
Finally, Theorem 5 establishes the lower bound for estimating the low-rank matrix . The proof of Theorem 5.
Theorem 5.
Consider i.i.d. observations drawn from the trace regression model defined in , where for . We assume that satisfy the Assumption 1, , and for some , then
(S.21) |
By Theorem 5, the lower bound of consists of two terms where the first term accounts for the statistical error and the second term is the cost of privacy. The proof for can be found in Rohde and Tsybakov (2011) and the cost of privacy is deduced in the following proof.
Proof of Theorem 5.
We now start proving Theorem 5 by score attack argument. Throughout the proof, we assume that is an i.i.d. sample drawn from and is a neighbouring data set of obtained by replacing with an independent copy . Besides, we mainly focus on the case and states the result for the case in Remark 1. Let
We derive the lower bound of in three steps. For ease of notation, we define
Step 1: bounding the summation. The following Lemma 10 bounds ; Lemma 11 develops the upper bound of based on discussed in Lemma 10 and a tunning parameter . The proof of Lemma 10 and 11 can be found in Appendix E.7 and E.8.
Lemma 10.
For , we have
Lemma 11.
Let be an -DP algorithm with and , under model (1), by choosing , we have
(S.22) |
Step 2: lower bounding the summation. Under some regularity conditions, the following Lemma 12 characterize the quantity as a summation of functions of . Lemma 13 lower bounds the summation of functions by assigning an appropriate prior distribution to . The proof of Lemma 12 and 13 can be found in Appendix E.8.
Lemma 12.
If for every is continuously differentiable with respect to and such that , we have
Lemma 12 has its general form stated in Theorem , Cai et al. (2023). Let be a function defined by for all , then
Lemma 13 lower bounds this quantity by assigning the prior distribution to such that for all and otherwise, .
Lemma 13.
Let be distributed according to a density whose marginal densities are for and such that for all , and otherwise, be the density function such that . Then,
Step 3: combining the upper and lower bounds. Combining the lower bound of and the upper bound (S.22) of , we have
Under the assumption that for some , we have , and therefore Since the sup-risk is greater than the Bayesian risk,
Furthermore, due to , we have and
where is an -DP algorithm that satisfies . This conclusion extends to all differentially private if we assume that such that .
Remark 1.
Lemma 11 and 13 are also applicable to the case where the parameter space is . For , Lemma 11 implies that
and Lemma 13 results in Therefore, as for some , the minimax lower bound
where is an -DP algorithm that satisfies . Similarly, this conclusion extends to all differentially private if we assume that such that .
∎
Appendix E Proofs of Technical Lemmas
E.1 Proof of Lemma 1
E.2 Proof of Lemma 2
Lemma 14 (Theorem 1, Xia (2021)).
Let be a rank- symmetric matrix with eigen-decomposition of the form where and the diagonal matrix has the eigenvalues of arranging in the non-increasing order. Let be another symmetric matrix and leading eigen vector of is given by . Then,
where the -th order term is a summation of terms defined by
where contains non-negative indices and
Lemma 14 provides an explicit representation formula for the spectral projector given that is symmetric and of rank-. Since we are interested in the asymmetric rank- matrix , we apply the symmetric dilation trick to and obtain the rank- symmetric matrix has eigendecomposition of the form
The proof is finished by applying Lemma 14 with , , , the rank be and
E.3 Proof of Lemma 3 and 4
E.4 Proof of Lemma 5
where and
for some absolute constant . Therefore, for some absolute constant , with probability at least ,
Taking maximum over , with probability at least
In (S.1), we have already shown that that for some absolute constant ,
(S.24) |
with probability at least . Note that for all , and thus
As long as the sample size there exists an absolute constant such that
with probability at least .
E.5 Proof of Lemma 6
By (Pajor, 1998, Proposition 8) and (Koltchinskii and Xia, 2015, Lemma 5), for any , there exists an absolute constant and a subset such that for any , , and the cardinality of is at least . Here, denotes the Schatten- norm of a matrix. In particular, spectral norm is Schatten- norm, Frobenius norm is Schatten- norm, and nuclear norm is Schatten- norm. Let be a small number to be decided later. Now, for each , we define
such that and . This means that, for any , we can construct a . This defines a subset with such that for any ,
and, meanwhile,
E.6 Proof of Lemma 7, 8 and 9
E.7 Proof of Lemma 10
Since is independent of and ,
As for , we apply Jensen’s inequality and have
(S.25) | ||||
where the second line is due to and the last inequality is because is independent of . By the definition of and the independence between and ,
Plugging the upper bound of into (S.25),
E.8 Proof of Lemma 11, Proof of Lemma 12, Lemma 13
E.9 Proof of Lemma 14
Appendix F Weak Differential privacy
This section proposes a weaker definition than differential privacy such that the sensitivities are free of .
Definition 2 (weak -differential privacy).
Let be a given data set and be a weak neighbouring data set of , i.e., and differs by at most one pair of observations and sharing the same measurement . The algorithm that maps into is weak -differentially private over the dataset if
(S.26) |
for all weak neighbouring data set and all subset .
Compared to the standard -DP, weak -differential privacy is a less powerful constraint. Definition 2 only requires the algorithm to preserve the property (S.26) over weak neighbouring datasets, i.e., datasets that differs by at most one pair of observations sharing the same measurement . As we consider a pair of observations and under the model (1), where the difference is free of the measurement .
Next, we list the Theorem 6, Corollary 2 and Theorem 7 as the analogues of Theorem 1, Corollary 1 and Theorem 3. All proofs for this section are deferred to the end of this section.
Theorem 6 (Weak DP and utility guarantees of the initialization ).
Consider i.i.d. observations drawn from the trace regression model stated in where for . Let the true low-rank regression coefficients matrix being . Suppose that satisfy the Assumption 1. Under the mild condition , there exists absolute constants such that as the sample size , the sensitivity for leading left and right singular vectors takes the value
the sensitivity for the singular values takes the value
and Algorithm 1 is weak -differentially private. Moreover,
with probability at least .
Theorem 6 requires the same sample size condition as Theorem 1, however, the sensitivities and derived under weak DP, differs with their DP counterpart and by the factor . This leads to a smaller cost of privacy than the cost of privacy we obtained under stronger standard DP-constraints, as presented in Theorem 1.
Corollary 2.
Under the conditions stated in Theorem 6, as the sample size is sufficiently large such that for some absolute constant ,
we have for some small constant ,
Theorem 7.
Consider i.i.d. observations drawn from the trace regression model stated in where the true low-rank regression coefficients matrix being . Here, for and we assume that satisfy the Assumption 1 and . Suppose the weak -DP initialization satisfies 2, then Algorithm 2 is weak -differentially private with the sensitivities
for some absolute constant . Moreover, as the sample size
for some small constant , number of iteration , and the step size , we have the output of Algorithm 2 satisfies
with probability at least
Theorem 7 shows that as the sample size the estimator given by Algorithm 2 attains the optimal convergence rate in the sense of weak differential privacy.
The proofs of Theorem 6, 7 and Corollary 2 will be a trivial concequence of replacing the first part of Lemma 5 by the following Lemma 15
Lemma 15.
References
- Abowd et al. (2020) Abowd, J. M., I. M. Rodriguez, W. N. Sexton, P. E. Singer, and L. Vilhuber (2020). The modernization of statistical disclosure limitation at the us census bureau. US Census Bureau.
- Absil et al. (2008) Absil, P.-A., R. Mahony, and R. Sepulchre (2008). Optimization Algorithms on Matrix Manifolds. Princeton, NJ: Princeton University Press.
- Acharya et al. (2021) Acharya, J., Z. Sun, and H. Zhang (2021). Differentially private assouad, fano, and le cam. In Algorithmic Learning Theory, pp. 48–78. PMLR.
- Adler et al. (2002) Adler, R. L., J. Dedieu, J. Y. Margulies, M. Martens, and M. Shub (2002, 07). Newton’s method on Riemannian manifolds and a geometric model for the human spine. IMA Journal of Numerical Analysis 22(3), 359–390.
- Apple Differential Privacy Team (2017) Apple Differential Privacy Team (2017). Learning with privacy at scale.
- Blum et al. (2005) Blum, A., C. Dwork, F. McSherry, and K. Nissim (2005, 06). Practical privacy: The sulq framework. Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 128–138.
- Brown et al. (2023) Brown, G., S. Hopkins, and A. Smith (2023, 12–15 Jul). Fast, sample-efficient, affine-invariant private mean and covariance estimation for subgaussian distributions. In G. Neu and L. Rosasco (Eds.), Proceedings of Thirty Sixth Conference on Learning Theory, Volume 195 of Proceedings of Machine Learning Research, pp. 5578–5579. PMLR.
- Burer and Monteiro (2003) Burer, S. and R. D. Monteiro (2003). A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming 95(2), 329–357.
- Cai et al. (2021) Cai, T. T., Y. Wang, and L. Zhang (2021). The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. The Annals of Statistics 49, 2825–2850.
- Cai et al. (2023) Cai, T. T., Y. Wang, and L. Zhang (2023). Score attack: A lower bound technique for optimal differentially private learning. arXiv preprint arXiv:2303.07152.
- Cai et al. (2024) Cai, T. T., D. Xia, and M. Zha (2024). Optimal differentially private pca and estimation for spiked covariance matrices. arXiv preprint arXiv:2401.03820.
- Candes and Plan (2011) Candes, E. J. and Y. Plan (2011). Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory 57(4), 2342–2359.
- Candes and Tao (2005) Candes, E. J. and T. Tao (2005). Decoding by linear programming. IEEE transactions on information theory 51(12), 4203–4215.
- Chaudhuri et al. (2012) Chaudhuri, K., A. Sarwate, and K. Sinha (2012). Near-optimal differentially private principal components. Advances in Neural Information Processing Systems 25.
- Chen et al. (2019) Chen, H., G. Raskutti, and M. Yuan (2019). Non-convex projected gradient descent for generalized low-rank tensor regression. The Journal of Machine Learning Research 20(1), 172–208.
- Chen and Wainwright (2015) Chen, Y. and M. J. Wainwright (2015). Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees.
- Chien et al. (2021) Chien, S., P. Jain, W. Krichene, S. Rendle, S. Song, A. Thakurta, and L. Zhang (2021). Private alternating least squares: Practical private matrix completion with tighter rates. pp. 1877–1887.
- Dwork et al. (2006) Dwork, C., F. McSherry, K. Nissim, and A. Smith (2006). Calibrating noise to sensitivity in private data analysis. Theory of Cryptography Conference, 265–284.
- Dwork et al. (2014) Dwork, C., A. Roth, et al. (2014). The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science 9(3–4), 211–407.
- Dwork et al. (2014) Dwork, C., K. Talwar, A. Thakurta, and L. Zhang (2014). Analyze gauss: optimal bounds for privacy-preserving principal component analysis. Proceedings of the forty-sixth annual ACM symposium on Theory of computing, 11–20.
- Edelman et al. (1998) Edelman, A., T. A. Arias, and S. T. Smith (1998). The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications 20(2), 303–353.
- Erlingsson et al. (2014) Erlingsson, U., V. Pihur, and A. Korolova (2014). Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ’14, New York, NY, USA, pp. 1054–1067. Association for Computing Machinery.
- Hamidi and Bayati (2022) Hamidi, N. and M. Bayati (2022). On low-rank trace regression under general sampling distribution. Journal of Machine Learning Research 23(321), 1–49.
- Jain et al. (2018) Jain, P., O. D. Thakkar, and A. Thakurta (2018). Differentially private matrix completion revisited. pp. 2215–2224.
- Kamath et al. (2019) Kamath, G., J. Li, V. Singhal, and J. Ullman (2019). Privately learning high-dimensional distributions. In Conference on Learning Theory, pp. 1853–1902. PMLR.
- Kamath et al. (2020) Kamath, G., V. Singhal, and J. Ullman (2020, 09–12 Jul). Private mean estimation of heavy-tailed distributions. In J. Abernethy and S. Agarwal (Eds.), Proceedings of Thirty Third Conference on Learning Theory, Volume 125 of Proceedings of Machine Learning Research, pp. 2204–2235. PMLR.
- Koltchinskii (2011) Koltchinskii, V. (2011). Von neumann entropy penalization and low-rank matrix estimation.
- Koltchinskii et al. (2011) Koltchinskii, V., K. Lounici, and A. B. Tsybakov (2011). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion.
- Koltchinskii and Xia (2015) Koltchinskii, V. and D. Xia (2015). Optimal estimation of low rank density matrices. J. Mach. Learn. Res. 16(53), 1757–1792.
- Kuditipudi et al. (2023) Kuditipudi, R., J. Duchi, and S. Haque (2023, 12–15 Jul). A pretty fast algorithm for adaptive private mean estimation. In G. Neu and L. Rosasco (Eds.), Proceedings of Thirty Sixth Conference on Learning Theory, Volume 195 of Proceedings of Machine Learning Research, pp. 2511–2551. PMLR.
- Liu et al. (2022) Liu, X., W. Kong, P. Jain, and S. Oh (2022). Dp-pca: Statistically optimal and differentially private pca. Advances in Neural Information Processing Systems 35, 29929–29943.
- Liu et al. (2015) Liu, Z., Y.-X. Wang, and A. Smola (2015). Fast differentially private matrix factorization. In Proceedings of the 9th ACM Conference on Recommender Systems, pp. 171–178.
- Luo and Zhang (2022) Luo, Y. and A. R. Zhang (2022). Tensor-on-tensor regression: Riemannian optimization, over-parameterization, statistical-computational gap, and their interplay. arXiv preprint arXiv:2206.08756.
- McSherry and Mironov (2009) McSherry, F. and I. Mironov (2009). Differentially private recommender systems: Building privacy into the netflix prize contenders. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 627–636.
- Negahban and Wainwright (2011) Negahban, S. and M. J. Wainwright (2011). Estimation of (near) low-rank matrices with noise and high-dimensional scaling.
- Pajor (1998) Pajor, A. (1998). Metric entropy of the grassmann manifold. Convex Geometric Analysis 34(181-188), 0942–46013.
- Rohde and Tsybakov (2011) Rohde, A. and A. B. Tsybakov (2011). Estimation of high-dimensional low-rank matrices.
- Shen et al. (2023) Shen, Y., J. Li, J.-F. Cai, and D. Xia (2023). Computationally efficient and statistically optimal robust high-dimensional linear regression. arXiv preprint arXiv:2305.06199.
- Tropp et al. (2015) Tropp, J. A. et al. (2015). An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning 8(1-2), 1–230.
- Vadhan (2017) Vadhan, S. (2017). The complexity of differential privacy. Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, 347–450.
- Vandereycken (2013) Vandereycken, B. (2013). Low-rank matrix completion by riemannian optimization. SIAM Journal on Optimization 23(2), 1214–1236.
- Vershynin (2015) Vershynin, R. (2015). Estimation in high dimensions: a geometric perspective. In Sampling Theory, a Renaissance: Compressive Sensing and Other Developments, pp. 3–66. Springer.
- Vershynin (2018) Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science, Volume 47. Cambridge university press.
- Wang et al. (2023) Wang, L., B. Zhao, and M. Kolar (2023, 25–27 Apr). Differentially private matrix completion through low-rank matrix factorization. 206, 5731–5748.
- Wang (2018) Wang, Y.-X. (2018). Revisiting differentially private linear regression: optimal and adaptive prediction and estimation in unbounded domain. In UAI 2018.
- Wei et al. (2016) Wei, K., J.-F. Cai, T. F. Chan, and S. Leung (2016). Guarantees of riemannian optimization for low rank matrix recovery. SIAM Journal on Matrix Analysis and Applications 37(3), 1198–1222.
- Xia (2021) Xia, D. (2021). Normal approximation and confidence region of singular subspaces. Electronic Journal of Statistics 15(2), 3798–3851.
- Zheng and Lafferty (2016) Zheng, Q. and J. Lafferty (2016). A convergent gradient descent algorithm for rank minimization and semidefinite programming from random linear measurements.