Differentially private analysis of networks with covariates via a generalized -model
Abstract
How to achieve the tradeoff between privacy and utility is one of fundamental problems in private data analysis. In this paper, we give a rigourous differential privacy analysis of networks in the appearance of covariates via a generalized -model, which has an -dimensional degree parameter and a -dimensional homophily parameter . Under -edge differential privacy, we use the popular Laplace mechanism to release the network statistics. The method of moments is used to estimate the unknown model parameters. We establish the conditions guaranteeing consistency of the differentially private estimators and as the number of nodes goes to infinity, which reveal an interesting tradeoff between a privacy parameter and model parameters. The consistency is shown by applying a two-stage Newton’s method to obtain the upper bound of the error between and its true value in terms of the distance, which has a convergence rate of rough order for and for , respectively. Further, we derive the asymptotic normalities of and , whose asymptotic variances are the same as those of the non-private estimators under some conditions. Our paper sheds light on how to explore asymptotic theory under differential privacy in a principled manner; these principled methods should be applicable to a class of network models with covariates beyond the generalized -model. Numerical studies and a real data analysis demonstrate our theoretical findings.
Keywords: Asymptotic normality; Consistency; Covariate; Differential privacy; Network data
1 Introduction
Social network data may contain sensitive information about relationships between individuals (e.g., friendship, email exchange, sexual interaction) and even individuals themselves (e.g., respondents in sexual partner networks). Undoubtedly, it will expose individual’s privacy if these data are directly released to the public for various research purposes. Even if individuals are anonymized by removing identifications before being made public, it is still easy to attack by applying some de-anonymization techniques [e.g., Narayanan and Shmatikov, (2009)]. A randomized data releasing mechanism that injects random noises to the original data (i.e., input perturbation) or their aggregate statistics to queries (i.e., output perturbation), provides an alternative to protect data privacy. To rigorously restrict privacy leakage, Dwork et al., (2006) developed a privacy notation–differential privacy that requires that the output of a query does not change too much if we add/remove any single individual’s record to/from a database in randomized data releasing mechanisms. Since then, it has been widely accepted as a privacy standard for releasing sensitive data.
Many differentially private algorithms have been proposed to release network data or their aggregate statistics, especially in computer and machine learning literature [e.g., Day et al., (2016); Macwan and Patel, (2018); Nguyen et al., (2016); Wang et al., (2022)]. On the other hand, denoising approaches have been developed to improve the estimation of network statistics [e.g, Hay et al., (2009); Karwa and Slavković, (2016); Yan, (2021)]. However, differentially private inference in network models is still in its infancy. This is partly because network data are nonstandard and asymptotic analysis is usually based on only one observed network. The increasing dimension of parameters and the appearance of noises poses additional challenge as well [Fienberg, (2012)]. Recently, Karwa and Slavković, (2016) derived consistency and asymptotic normality of the differentially private estimator of the parameter constructed from the denoised degree sequence in the -model, which is an exponential random graph model with the degree sequence as the sufficient statistic [Chatterjee et al., (2011)]. Yan, (2021) derived asymptotic properties of the differentially private estimators in the model for directed networks. In despite of these recent developments, to the best of our knowledge, the differentially private analyses of networks in the presence of covariates have not been explored, in that neither their releasing methods nor their theoretical properties are well understood.
The covariates of nodes could have important implications on the link formation. A commonly existing phenomenon in social and econometric network data is that individuals tend to form connections with those like themselves, which is referred to as homophily. Therefore, it is of interest to see how covariates influence differentially private estimation in network models. The aim of this paper is to give a rigorously differentially private analysis of networks in the presence of covariates via a generalized -model. It contains an -dimensional degree parameter characterizing the variation in the node degrees and a -dimensional regression parameter of covariates measuring phomophic or heteriphic effects. A detailed description is given in Section 2.1. This model had been introduced in Graham, (2017) to model economic networks and a directed version was proposed in Yan et al., (2019). In this model, the degree sequence and the covariate term are the sufficient statistics. Therefore, it is sufficient to treat only these information as privacy contents in this model. Under -edge differential privacy, we propose to use a joint Laplace mechanism to release the network statistics and , which are added the discrete Laplacian noises and continuous Laplacian noises, respectively.
We construct estimating equations to infer model parameters based on the original maximum likelihood equations, in which the original network statistics are directly replaced by their noisy outputs. We develop new approaches to establish asymptotic theory of differentially private estimators. Owning to noises having zero mean, they are the same as the moment equations. The main contributions are as follows. First, we establish the conditions on a privacy parameter and model parameters that guarantee consistency of the differentially private estimator, which control the trade-off between privacy and utility. A key idea for the proof is that we use a two-stage Newton’s method that first obtains the upper bound of the error in terms of norm between and with a given , and then derives the upper bound of the error between and by using a profiled function, where and are the differentially private estimators of and , respectively. As a result, we obtain the convergence rates of and having respective orders of and roughly, both up to a logarithm factor. Notably, the convergence rate for matches the minimax optimal upper bound for the Lasso estimator in the linear model with -dimensional parameter and the sample size in Lounici, (2008). Second, we derive the asymptotic normal distributions of and . This is proved by applying Taylor’s expansions to a series of functions constructed from estimating equations and showing that various remainder terms in the expansions are asymptotically neglect. The convergence rate of makes that the asymptotic distribution of does not depend on and therefore has no bias. The asymptotic distribution of of the homophily parameter contains a bias term in terms of a weighted sum of covariates. Finally, we provide simulation studies as well as a real data analysis to illustrate the theoretical results.
We note that Karwa and Slavković, (2016) obtained asymptotic results of the edge-differentially private estimator based on the denoising process while our asymptotic results do not require the denoising process. Another important difference from Karwa and Slavković, (2016) is that we characterize how errors of estimators depend on the privacy parameter and we do not make the assumption that all parameters are bounded above by a constant in asymptotic theories.
For the rest of the paper, we proceed as follows. In Section 2, we give a necessary background on the generalized -model and differential privacy. In Section 3, we present the estimation. In Section 4, we present the consistency and asymptotic normality of the differentially private estimator. We carry out simulations and illustrate our results by a real data analysis in Section 5. We give the summary and further discussion in Section 6. The proofs of the main results are regelated into Section 7. The proofs of supported lemmas are given in the supplementary material.
2 Model and differential privacy
In this section, we introduce the generalized -model with covariates and present the necessary background for differential privacy.
2.1 Generalized -model
Let be an undirected graph on nodes labeled by “”. Let be the adjacency matrix of , where is an indicator denoting whether node is connected to node . That is, if there is a link between and ; otherwise. We do not consider self-loops here, i.e., . Let be the degree of node and be the degree sequence of the graph . We also observe a -dimensional vector , the covariate information attached to the edge between nodes and . The covariate can be formed according to the similarity or dissimilarity between nodal attributes and for nodes and . Specifically, can be represented through a symmetric function with and as its arguments. As an example if is an indicator of genders (e.g., for male and for female), then we could use to denote the similarity or dissimilarity measurement between and .
The -model with covariates [Graham, (2017); Yan et al., (2019)] assumes that the edge between and conditional on the unobserved degree effects and observed covariates has the following probability:
(1) |
independent of other edges. The parameter is the intrinsic individual effect that reflects the node heterogeneity to participate in network connection. The common parameter is exogenous, measuring the homophily or heterophilic effect. A larger homophily component means a larger homophily effect. We will refer to as the homophily parameter hereafter although it could represent heterophilic measurement. Hereafter, we call model (1) the covariate-adjusted -model.
2.2 Differential privacy
Given an original database with records of persons, we consider a randomized data releasing mechanism that takes as input and outputs a sanitized database for public use. As an illustrated example, the additive noise mechanism returns the answer to the query , where is a random noise. Let be a positive real number and denote the sample space of . The data releasing mechanism is -differentially private if for any two neighboring databases and that differ on a single element (i.e., the data of one person), and all measurable subsets of [Dwork et al., (2006)],
This says the probability of an output given the input is less than that given the input multiplied by a privacy factor . The privacy parameter is chosen according to the privacy policy, which controls the trade-off between privacy and utility. It is generally public. Smaller value of means more privacy protection.
Differential privacy requires that the distribution of the output is almost the same whether or not an individual’s record appears in the original database. We illustrate why it protects privacy with an example. Suppose a hospital wants to release some statistics on the medical records of their patients to the public. In response, a patient may wish to make his record omitted from the study due to a privacy concern that the published results will reveal something about him personally. Differential privacy alleviates this concern because whether or not the patient participates in the study, the probability of a possible output is almost the same. From a theoretical point, any test statistic has nearly no power for testing whether an individual’s data is in the original database or not [Wasserman and Zhou, (2010)].
What is being protected in the differential privacy is precisely the difference between two neighboring databases. Within network data, depending on the definition of the graph neighbor, differential privacy is divided into -node differential privacy [Hay et al., (2009)] and -edge differential privacy [Nissim et al., (2007)]. Two graphs are called neighbors if they differ in exactly edges, then differential privacy is -edge differential privacy. The special case with is generally referred to as edge differential privacy. Analogously, we can define -node differential privacy by letting graphs be neighbors if one can be obtained from the other by removing nodes and its adjacent edges. Edge differential privacy protects edges not to be detected, whereas node differential privacy protects nodes together with their adjacent edges, which is a stronger privacy policy. However, it may be infeasible to design algorithms that are both node differential privacy and have good utility since it generally needs a large noise [e.g., Hay et al., (2009)]. Following Hay et al., (2009) and Karwa and Slavković, (2016), we use edge differential privacy here.
Let be the harming distance between two graphs and , i.e., the number of edges on which and differ. The formal definition of -edge differential privacy is as follows.
Definition 1 (Edge differential privacy).
Let be a privacy parameter. A randomized mechanism is -edge differentially private if
where is the set of all graphs of interest on nodes and is the set of all possible outputs.
Let be a function. The global sensitivity [Dwork et al., (2006)] of the function , denoted , is defined below.
Definition 2.
(Global Sensitivity). Let . The global sensitivity of is defined as
where is the norm.
The global sensitivity measures the largest change for the query function in terms of the -norm between any two neighboring graphs. The magnitude of noises added in the differentially private algorithm crucially depends on the global sensitivity. If the outputs are the network statistics, then a simple algorithm to guarantee edge differential privacy is the Laplace Mechanism [e.g., Dwork et al., (2006)] that adds the Laplacian noise proportional to the global sensitivity of .
Lemma 1.
(Laplace Mechanism). Suppose that is a output function in . Let be independently and identically distributed Laplace random variables with density function . Then the Laplace mechanism outputs is -edge differentially private, where .
When is integer, one can use a discrete Laplace random variable as the noise, where it has the probability mass function:
(4) |
Lemma 1 still holds if the continuous Laplace distribution is replaced by the discrete version and the privacy parameter is chosen by ; see Karwa and Slavković, (2016).
We introduce a nice property on differential privacy: any function of a differentially private mechanism is also differentially private, as stated in the lemma below.
Lemma 2 (Dwork et al., (2006)).
Let be an output of an -differentially private mechanism and be any function. Then is also -differentially private.
3 Releasing network statistics and estimation
3.1 Releasing
From the log-likelihood function (2), we know that is the sufficient statistic. Thus, the private information in the covariates-adjusted –model is . We use the continuous Laplace mechanism in Lemma 1 and its discrete version to release the covariate statistic and the degree sequence under -edge differential privacy, respectively. The joint mechanism satisfies -edge differential privacy. The subscript means that and are allowed to depend on . If we add or remove edges in and denote the induced graph as , then
where is the degree sequence of , is the value of edge in and . So the global sensitivity is for and for . We release the sufficient statistics and as follows:
(7) |
where , , are independently generated from the discrete Laplace distribution with , and , , are independently generated from the Laplace distribution with .
3.2 Estimation
Write . Define
(8) |
It is clear that is the expectation of . When we emphasize the arguments and in , we write instead of . To estimate model parameters, we directly replace and in maximum likelihood equations (3) with their noisy observed values and :
(9) |
Because the expectations of the noises are zero, the above equation are the same as the moment equations.
Let be the solution to the equations (9). Since satisfies -edge differential privacy, is also -edge differentially private according to Lemma 2. A two-step iterative algorithm by alternating between solving the first equation in (9) via the fixed point method in Chatterjee et al., (2011) for a given and solving the second equation in (9) via the Newton method or the gradient descent method, can be employed to obtain the solution.
4 Asymptotic properties
In this section, we present consistency and asymptotic normality of the differentially private estimator . We first introduce some notations. For a subset , let and denote the interior and closure of , respectively. For a vector , denote by for a general norm on vectors with the special cases and for the - and -norm of respectively. Let be an -neighborhood of . For an matrix , denotes the matrix norm induced by the -norm on vectors in , i.e.,
and be a general matrix norm. Define the matrix maximum norm: . We use the superscript “*” to denote the true parameter under which the data are generated. When there is no ambiguity, we omit the superscript “*”. Define
The notation is a shorthand for .
Recall that . Write , and as the first, second and third derivative of on , respectively. A direct calculation gives that
It is easily checked that
(10) |
Let and be two small positive numbers. Note that is a decreasing function of when and . Recall that . Define
(11) |
In other words, we have
Note that . When causing no confusion, we will simply write stead of for shorthand. We will use the notations and interchangeably. Hereafter, we assume that the dimension of is fixed.
4.1 Consistency
To derive consistency of the differentially private estimator, let us first define a system of functions based on the estimating equations (9). Define
(12) |
and . Further, we define as the value of for an arbitrarily fixed and . Let be a solution to . Correspondingly, we define two functions for exploring the asymptotic behaviors of the estimator of the homophily parameter:
(13) | |||
(14) |
could be viewed as the profile function of in which the degree parameter is profiled out. It is clear that
By the compound function derivation law, we have
(15) | |||
(16) |
By solving in (15) and substituting it into (16), we get the Jacobian matrix :
(17) |
where
The asymptotic behavior of crucially depends on the Jacobian matrix . Since does not have a closed form, conditions that are directly imposed on are not easily checked. To derive feasible conditions, we define
(18) |
which is a general form of . Note that is the Fisher information matrix of the concentrated likelihood function , where the degree parameter is profiled out. When and , we have the approximation:
whose proof is given in the supplementary material. We assume that there exists a number such that
Note that the dimension of is fixed and every its entry is a sum of terms. If converges to a constant matrix, then is bounded. Moreover, if is positively definite, then
where is the smallest eigenvalue of .
We use a two-stage Newton iterative sequence to show consistency. In the first stage, we obtain an upper bound of the error between and in terms of the norm for a given . This is done by verifying the well-known Newton-Kantororich conditions, under which the optimal error bounds are established. Then we derive the upper bound of the error between and by using a profiled function constructed from estimating equations. Now we formally state the consistency result.
Theorem 1.
Let and . If
then the differentially private estimator exists with probability approaching one and is consistent in the sense that
The scalar factor appears due to that the magnitude of is . Note that the error bound of the parameter estimator in Theorem 3 of Karwa and Slavković, (2016) does not depend on the privacy parameter. Our result here characterizes how the error bound varies on . In view of the above theorem, we present the consistency conditions and the error bounds under two special cases. The first case is that the parameters and covariates are bounded. The second is that goes to zero.
Corollary 1.
Assume that and and are bounded above by a constant. If , then
Corollary 2.
Assume that . If , and , then
Remark 1.
The condition means that the outputs are nearly the same as the original network statistics. From Corollary 2, we can see that the MLE of has a convergence rate of order .
4.2 Asymptotic normality of
The asymptotic distribution of depends crucially on the inverse of the Fisher information matrix of . Given , we say an matrix belongs to the matrix class if is a diagonally balanced matrix with positive elements bounded by and , i.e.,
Clearly, belongs to the matrix class when and . We will obtain the asymptotic distribution of the estimator through obtaining its asymptotic expression, which depends on the inverse of . However, its inverse does not have a closed form. Yan et al., (2015) proposed to approximate the inverse of by a diagonal matrix
(19) |
and obtained the upper bound of the approximate error.
Note that . By the central limit theorem for the bounded case, as in Loéve, (1977, p.289), if , then converges in distribution to the standard normal distribution. When considering the asymptotic behaviors of the vector with a fixed , one could replace the degrees by the independent random variables , . Therefore, we have the following proposition.
Proposition 1.
If , then as :
(1)For any fixed , the components of are
asymptotically independent and normally distributed with variances ,
respectively.
(2)More generally, is asymptotically normally distributed with mean zero
and variance whenever are fixed constants and the latter sum is finite.
Part (2) follows from part (1) and the fact that
by Theorem 4.2 of Billingsley, (1995). To prove the above equation, it suffices to show that the eigenvalues of the covariance matrix of , are bounded by 2 (for all ). This is true by the well-known Perron-Frobenius theory: if is a symmetric matrix with nonnegative elements, then its largest eigenvalue is less than the largest value of row sums.
We apply a second order Taylor expansion to to derive the asymptotic expression of . In the expansion, the first order term is the sum of and , where . Since does not have a closed form, we work with defined at (19) to approximate it. By Theorem 1, has a convergence rate up to a logarithm factor. This makes that the term is an remainder term. The second order term in the expansion is also asymptotically neglect. Then we represent as the sum of and a remainder term. The central limit theorem is proved by establishing the asymptotic normality of and showing the remainder is negligible. We formally state the central limit theorem as follows.
Theorem 2.
Assume that the conditions in Theorem 1 hold. If , then for fixed the vector converges in distribution to the -dimensional multivariate standard normal distribution.
Remark 2.
The asymptotic variance of is lying between and , which is the same as in the non-private estimator.
4.3 Asymptotic normality of
Let be an -dimensional column vector with th and th elements ones and other elements zeros. Define
Note that , , are independent vectors. By direct calculations,
such that
By Lemma 4, we have
Therefore, all are bounded. Let and . Note that
where is defined at (18). By the central limit theorem for the bounded case, as in Loéve, (1977, p.289), we have the following proposition.
Proposition 2.
For any nonzero fixed vector , if diverges, then
converges in distribution to the standard normal distribution.
Let and
We assume that the above limit exists. We briefly describe the idea of proving asymptotic normality of . We use a mean-value expansion to derive the explicit expression of , which mainly contains a term multiplied by . Then by applying a third-order Taylor expansion to , we find that the first order term is asymptotically normal, the second is the asymptotic bias term and the third is a remainder term. The asymptotic distribution of is stated below.
Theorem 3.
Assume that the conditions in Theorem 1 hold. If , then as goes to infinity, converges in distribution to the normal distribution with mean and variance , where
(20) |
Remark 3.
We now discuss the bias term in (20). We assume that is centered and independently drawn from a -dimensional multivariate distribution with bounded supports. Then, by Hoeffding,’s (1963) inequality, such that . In this case, there is no bias in the limiting distribution of . When , the confidence intervals and the p-values of hypothesis testing constructed from cannot achieve the nominal level without bias-correction. This is referred to as the well-known incidental parameter problem in econometrics literature [Neyman and Scott, (1948); Graham, (2017)]. As in Dzemski, (2019), we could use a simple analytical bias correction formula: , where is the plug-in estimate of by replacing and with their estimators and , respectively.
5 Numerical studies
5.1 Simulations
In this section, we evaluate the performance of the asymptotic theories through finite sizes of networks. The parameters in the simulations are as follows. The setting of the parameter took a linear form. Specifically, we set for , where we chose four different values for , i.e., . Since the conditions to guarantee asymptotic properties in theorems depend on the whole quantity , we set to be fixed (i.e., ) and let go to zero as increases. We considered two different values for : and . The edge covariates formed as follows. For each node , we generated two dichotomous random variables and from with unequal probabilities and and equal probabilities, respectively. Then we set . For the parameter , we let it be . Thus, the first positive value measures the homophily effect and the second negative value measures heterophilic effect. We carried out simulations under two different sizes of networks: and . Each simulation was repeated times.
By Theorem 2, converges in distribution to the standard normal distributions, where is the estimate of by replacing with . We choose five special pairs , , and for . We record the coverage probability of the confidence interval, the length of the confidence interval, and the frequency that the estimates do not exist. These values for are also reported.
100 | (1, 2) | ||||
---|---|---|---|---|---|
(49,50) | |||||
(99, 100) | |||||
(1,50) | |||||
(1, 100) | |||||
200 | (1, 2) | ||||
(99,100) | |||||
(199,200) | |||||
(1,100) | |||||
(1,200) | |||||
100 | (1, 2) | ||||
(49,50) | |||||
(99, 100) | |||||
(1,50) | |||||
(1, 100) | |||||
200 | (1, 2) | ||||
(99,100) | |||||
(199,200) | |||||
(1,100) | |||||
(1,200) |
Table 1 reports the simulation results for . The reported frequencies and lengths were conditional on the event that the estimator exists. We found that empirical coverage frequencies are very close to the nominal level when and , and they are a little less than the nominal level when and . As expected, the length of the confidence interval increases with . Conversely, it decreases as increases. When , the estimator failed to exist with a positive frequencies over . In other cases, estimates existed almost in every simulation.
Table 2 reports the median of as well as those of the bias corrected estimator . As we can see, the bias is small and the empirical coverage frequencies for the estimators are more closer to the target level than those values for bias-uncorrected when . When and , they are a little lower than the target level. On the other hand, when is fixed, the length of confidence interval of increases as becomes larger.
5.2 A real data example
We use the Enron email dataset as an example analysis [Cohen, (2004)], available from https://www.cs.cmu.edu/~enron/. This dataset was released by William Cohen at Carnegie Mellon University and has been widely studied. The Enron email data was originally acquired and made public by the Federal Energy Regulatory Commission during its investigation into fraudulent accounting practices. Some of the emails have been deleted upon requests from affected employees. However, the raw data is messy and needs to be cleaned before any analysis is conducted. Zhou et al., (2007) applied data cleaning strategies to compile the Enron email dataset. We use their cleaned data for the subsequent analysis. The resulting data comprises messages sent between employees with their covariate information. Thus, the corresponding graph has multiple edges. We treat it as a simple undirected graph for our analysis, where each edge denotes that there are at least one message between the corresponding two nodes. We remove the isolated nodes “32” and “37” with zero degrees, where the estimators of the corresponding degree parameters do not exist. This leaves a connected network with nodes and edges. The minimum, quantile, median, quantile and maximum values of are , , , and , respectively.


Each employee has three categorical variables: departments of these employees (Trading, Legal, Other), the genders (Male, Female) and seniorities (Senior, Junior). We plot the network with individual departments and genders in Figure 1. We can see that the degrees exhibit a great variation across nodes and it is not easy to judge homophic or heteriphic effects that require quantitative analysis. The -dimensional covariate vector of edge is formed by using a homophilic matching function between these three covariates of two employees and , i.e., if and are equal, then ; otherwise .
We evaluate how close the estimator is to the original MLE fitted in the generalized –model. We chose two (, ) and let as in simulations, and repeated to release and using according to (7) times for each . Then we computed the median private estimate and the upper () and the lower () quantiles. The private estimate existed for each output. The frequencies that the private estimate fails to exist are and for and , respectively. The results for the estimates and are shown in Table 3 and Figure 2. From Table 3, we can see that the median value of is the same as the MLE and the MLE lies in the confidence interval. The similar phenomenon can also be observed in Figure 2. From this figure, we can see that the length of confidence interval of private estimates under are shorter than those under .
Covariate | confidence interval | ||
---|---|---|---|
Department | |||
Gender | |||
Seniority | |||
Department | |||
Gender | |||
Seniority |

6 Summary and discussion
We have present the -edge-differentially private estimation for inferring the degree parameter and homophily parameter in the generalized –model. We establish consistency of the estimator under several conditions and also derive its asymptotic normal distribution. It is worth noting that the conditions imposed on imply the network density going to zero with a very slow rate. When networks are very sparse, adding noises easily produces the outputs of negative degrees, which will lead to the non-existence of the maximum likelihood estimator in the covariate-adjusted -model. In addition, the conditions in Theorems 2 and 3 seem stronger than those needed for consistency. Note that the asymptotic behavior of the estimator depends not only on , but also on the configuration of the parameters. It is of interest to see whether conditions for guaranteeing theoretical properties could be relaxed.
We use a generalized –model to give a rigorous differential privacy analysis of networks with covariates. It is notable that the assumption of the logistic distribution of an edge is not essential in our strategies for proofs. Our these principled methods should be applicable to a class of network models beyond the covariate-adjusted -model [Wang et al., (2023)]. For instance, the developed two-stage Newton method to prove the consistency of the differentially private estimator still works if the logistic distribution is replaced with the probit distribution. Further, the edge independence assumption is not directly used in this method, which only plays a role to derive the upper bound of and . There are many tail probability inequalities in dependent random variables that could be applied to network statistics with edge dependence situations. We hope that the methods developed here can be further to be applied to edge-dependence network models.
7 Appendix
7.1 Preliminaries
In this section, we present three results that will be used in the proofs. The first is on the approximation error of using to approximate the inverse of belonging to the matrix class , where and . Yan et al., (2015) obtained the upper bound of the approximation error, which has an order . The second is a tight bound of in Hillar et al., (2012). These two results are stated below as lemmas.
Lemma 3 (Yan et al., (2015)).
If , then the following holds:
where for a general matrix .
Lemma 4 (Hillar et al., (2012)).
For , when , we have
Let be a function vector on . We say that a Jacobian matrix with is Lipschitz continuous on a convex set if for any , there exists a constant such that for any vector the inequality
holds. We will use the Newton iterative sequence to establish the existence and consistency of the differentially private estimator. Gragg and Tapia, (1974) gave the optimal error bound for the Newton method under the Kantovorich conditions [Kantorovich, (1948)]. We only show partial results here that are enough for our applications.
Lemma 5 (Gragg and Tapia, (1974)).
Let be an open convex set of and be Fréchet differentiable on with a Jacobian that is Lipschitz continuous on with Lipschitz coefficient . Assume that is such that exists,
and
Then: (1) The Newton iterations exist and for . (2) exists, and .
7.2 Error bound between and
We will use the Newton method to derive the error bound between and through verifying the Kantororich conditions in Lemma 5. The Kantororich conditions require the Lipschitz continuous of and the upper bounds of . We need three lemmas below, whose proofs are in supplementary material.
Lemma 6.
For any given , the Jacobian matrix of on is Lipschitz continuous on with the Lipschitz coefficient .
Lemma 7.
The tail probabilities of and are given below:
Lemma 8.
For any given , we have
Recall that
We state the error bound between and below.
Lemma 9.
Assume and . If , then with probability approaching one, exists and satisfies
Proof of Lemma 9.
We will derive the error bound between and through constructing the Newton iterative sequence , where we choose as the starting point . To this end, it is sufficient to verify the Kantovorich conditions in Lemma 5. Note that when and , where is a small positive number and . The following calculations are based on the event :
Let and . By Lemma 4, we have . Recall that . Note that the dimension of is a fixed constant. If , by the mean value theorem and the event , we have
Repeatedly utilizing Lemma 4, we have
By Lemma 6, is Lipschitz continuous with Lipschitz coefficient . Therefore, if , then
The above arguments verify the Kantovorich conditions. By Lemma 5, it yields that
(21) |
7.3 Proof of Theorem 1
To show Theorem 1, we need three lemmas below.
Lemma 10.
Let be an open convex set containing the true point . If , then is Lipschitz continuous on with the Lipschitz coefficient .
Lemma 11.
Write as and . If , then has the following expansion:
(22) |
where is the remainder term and
Lemma 12.
If , for any and , then we have
Now we are ready to prove Theorem 1.
Proof of Theorem 1.
We construct the Newton iterative sequence to show the consistency. It is sufficient to verify the Kantovorich conditions in Lemma 5. In the Newton method, we set as the initial point and .
The following calculations are based on the event that for , exists and satisfies
This shows that exists such that and are well defined. This in turn shows that in every iterative step, exists as long as exists.
Recall the definition of and in (13) and (14). By Lemmas 7 and 8, we have
By the mean value theorem and Lemma 12, we have
Then it follows that
By Lemma 10, is Lipschitz continuous with . Note that . Thus,
As a result, if , then
By Lemma 5, with probability approaching one, the limiting point of the sequence exists denoted by , and satisfies
At the same time, by Lemma 9, exists, denoted by . The limiting points satisfies the equation (9). It completes the proof. ∎
7.4 Proofs for Theorem 2
Proof of Theorem 2.
To simplify notations, write , , and
By a second order Taylor expansion, we have
(23) |
where
and lies between and . Recall that . Since (see (10)), we have
Let and . If , by Theorem 1, we have
(24) |
By writing (23) into a matrix form, we have
which is equivalent to
(25) |
We bound the last three remainder terms in the above equation as follows. Let . Note that and . By Lemma 4 and inequality (24), we have
(26) |
7.5 Proof of Theorem 3
Proof of Theorem 3.
Assume that the conditions in Theorem 1 hold. A mean value expansion gives
where lies between and . By noting that , we have
Note that the dimension of is fixed. By Theorem 1 and (4.1), we have
Write as for convenience. Let and . Note that is a Laplace random vector. By Lemma 10, if , then
Therefore,
(30) |
By applying a third order Taylor expansion to , it yields
(31) |
where
and for some . We will show that (1) asymptotically follows a multivariate normal distribution; (2) is a bias term; (3) is an asymptotically negligible remainder term. Specifically, they are accurately characterized as follows:
We defer the proofs of the above equations to supplementary material. Substituting the above equations into (30) then gives
∎
References
- Billingsley, (1995) Billingsley, P. (1995). Probability and measure. 3rd edition. Wiley, New York.
- Chatterjee et al., (2011) Chatterjee, S., Diaconis, P., and Sly, A. (2011). Random graphs with a given degree sequence. Annals of Applied Probability, 21(4):1400–1435.
- Cohen, (2004) Cohen, W. W. (2004). Enron email dataset (retrieved march 12, 2005).
- Day et al., (2016) Day, W.-Y., Li, N., and Lyu, M. (2016). Publishing graph degree distribution with node differential privacy. In Proceedings of the 2016 International Conference on Management of Data, pages 123–138, New York, NY, USA.
- Dwork et al., (2006) Dwork, C., Mcsherry, F., Nissim, K., and Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. Lecture Notes in Computer Science, pages 265–284.
- Dzemski, (2019) Dzemski, A. (2019). An empirical model of dyadic link formation in a network with unobserved heterogeneity. The Review of Economics and Statistics, (To appear).
- Fienberg, (2012) Fienberg, S. E. (2012). A brief history of statistical models for network analysis and open challenges. Journal of Computational and Graphical Statistics, 21(4):825–839.
- Gragg and Tapia, (1974) Gragg, W. B. and Tapia, R. A. (1974). Optimal error bounds for the newton�ckantorovich theorem. SIAM Journal on Numerical Analysis, 11(1):10–13.
- Graham, (2017) Graham, B. S. (2017). An econometric model of network formation with degree heterogeneity. Econometrica, 85(4):1033–1063.
- Hay et al., (2009) Hay, M., Li, C., Miklau, G., and Jensen, D. (2009). Accurate estimation of the degree distribution of private networks. In 2009 Ninth IEEE International Conference on Data Mining, pages 169–178.
- Hillar et al., (2012) Hillar, C. J., Lin, S., and Wibisono, A. (2012). Inverses of symmetric, diagonally dominant positive matrices and applications.
- Hoeffding, (1963) Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30.
- Kantorovich, (1948) Kantorovich, L. V. (1948). Functional analysis and applied mathematics. Uspekhi Mat Nauk, pages 89–185.
- Karwa and Slavković, (2016) Karwa, V. and Slavković, A. (2016). Inference using noisy degrees-differentially private beta model and synthetic graphs. The Annals of Statistics, 44:87–112.
- Loéve, (1977) Loéve, M. (1977). Probability theory I. 4th ed. Springer, New York.
- Lounici, (2008) Lounici, K. (2008). Sup-norm convergence rate and sign concentration property of lasso and dantzig estimators. Electronic Journal of Statistics, 2:90–102.
- Macwan and Patel, (2018) Macwan, K. R. and Patel, S. J. (2018). Node differential privacy in social graph degree publishing. Procedia Computer Science, 143:786 – 793. 8th International Conference on Advances in Computing & Communications (ICACC-2018).
- McCullagh and Nelder, (1989) McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Second Edition. Chapman and Hall.
- Narayanan and Shmatikov, (2009) Narayanan, A. and Shmatikov, V. (2009). De-anonymizing social networks. In 2009 30th IEEE Symposium on Security and Privacy, pages 173–187.
- Neyman and Scott, (1948) Neyman, J. and Scott, E. (1948). Consistent estimates based on partially consistent observations. Econometrica, (16):1–32.
- Nguyen et al., (2016) Nguyen, H. H., Imine, A., and Rusinowitch, M. (2016). Detecting communities under differential privacy. In Proceedings of the 2016 ACM on Workshop on Privacy in the Electronic Society, WPES ¡¯16, page 83¨C93, New York, NY, USA. Association for Computing Machinery.
- Nissim et al., (2007) Nissim, K., Raskhodnikova, S., and Smith, A. (2007). Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, pages 75–84.
- Wang et al., (2022) Wang, Q., Yan, T., Jiang, B., and Leng, C. (2022). Two-mode networks: inference with as many parameters as actors and differential privacy. Journal of Machine Learning Research, 292(23):1–38.
- Wang et al., (2023) Wang, Q., Zhang, Y., and Yan, T. (2023). Asymptotic theory in network models with covariates and a growing number of node parameters. Annals of the Institute of Statistical Mathematics, 75(2):369–392.
- Wasserman and Zhou, (2010) Wasserman, L. and Zhou, S. (2010). A statistical framework for differential privacy. Journal of the American Statistical Association, 105(489):375–389.
- Yan, (2021) Yan, T. (2021). Directed networks with a differentially private bi-degree sequence. Statistica Sinica, 31(4):pp. 2031–2050.
- Yan et al., (2019) Yan, T., Jiang, B., Fienberg, S. E., and Leng, C. (2019). Statistical inference in a directed network model with covariates. Journal of the American Statistical Association, 114(526):857–868.
- Yan et al., (2015) Yan, T., Zhao, Y., and Qin, H. (2015). Asymptotic normality in the maximum entropy models on graphs with an increasing number of parameters. Journal of Multivariate Analysis, 133:61 – 76.
- Zhou et al., (2007) Zhou, Y., Goldberg, M., Magdon-Ismail, M., and Wallace, W. A. (2007). Strategies for cleaning organizational emails with an application to enron email dataset. In in 5th Conference of North American Association for Computatational Social Organization Science, Pittsburgh. North American Association for Computational Social and Organizational Science.