Federated Graph Analytics with Differential Privacy
Abstract
Collaborative graph analysis across multiple institutions is becoming increasingly popular. Realistic examples include social network analysis across various social platforms, financial transaction analysis across multiple banks, and analyzing the transmission of infectious diseases across multiple hospitals. We define the federated graph analytics, a new problem for collaborative graph analytics under differential privacy. Although differentially private graph analysis has been widely studied, it fails to achieve a good tradeoff between utility and privacy in federated scenarios, due to the limited view of local clients and overlapping information across multiple subgraphs. Motivated by this, we first propose a federated graph analytic framework, named , which enables arbitrary downstream common graph statistics while preserving individual privacy. Furthermore, we introduce an optimized framework based on our proposed degree-based partition algorithm, called +, which improves the overall utility by leveraging the true local subgraphs. Finally, extensive experiments demonstrate that our and significantly outperform the baseline approach by approximately one and four orders of magnitude, respectively.
Index Terms:
Federated Analytics, Graph Statistics, Differential Privacy.I Introduction
Graph data has become a crucial resource for analyzing big data in a variety of applications such as finance, social networks, and healthcare due to its widespread usage. Owing to escalating privacy concerns and regulatory measures like the GDPR, conducting centralized graph analysis has become increasingly challenging. In this paper, we define the federated graph analytics (FGA), a new problem for collaborative graph analysis with the privacy guarantee, which is motivated by the following scenarios:
Example 1.1. Social Network Analytics. Various social media platforms, including Facebook, Twitter, and LINE, collaborate to estimate different metrics of a global social network within a particular region. Each platform has its own local subgraph, which is a subset of a ground-truth global social network graph. In a graph, a node represents a user, and an edge represents a friendship between two users. As users may use multiple platforms, these clients (i.e., subgraphs) may have overlapping edges.
Example 1.2. Financial Transaction Analytics. Several banks work together to analyze transaction data [1] for financial risk management or macroeconomic analysis over transaction graphs, in which each node represents a bank account owned by a user, and each edge represents a money transaction between two accounts.
Example 1.3. Disease Transmission Analytics. Several medical institutions are collaborating to study the transmission of diseases, such as COVID-19 [2], in a particular region. Each hospital is responsible for a subgraph that includes nodes representing patients and edges representing the transmission of the disease between those patients.

This study is the first to discuss the problem of federated graph analytics (FGA) under the differential privacy (DP) [11, 12]. Different from existing differentially private graph analytic works, such as central models [3, 4, 5, 6, 13] and local models [7, 8, 9, 10, 14], FGA considers a more general setting. In particular, in a central scenario (as shown in Figure 1(a)), a trusted server owns the entire global graph that consists of multiple nodes and edges. Nevertheless, a central server is amenable to privacy issues in practice, such as data leaks and breaches [15, 16]. In a local scenario (as shown in Figure 1(b)), each client manages an user and her 1-hop path information (i.e, neighboring information). Each client doesn’t trust the server and directly perturbs local sensitive data. In contrast, in federated graph analytics (as shown in Figure 1(c)), each client possesses a subgraph consisting of multiple nodes and edges. Each client does not trust other parties, including the server and other clients. In fact, local scenarios can be viewed as an extreme case of federated graph analytics when each client contains a subgraph consisting of one user and her 1-hop path. At this point, is equal to , where and are the number of clients and users, respectively. Additionally, FGA is similar to cross-silo federated learning [17, 18, 19, 20] where different silos (or clients) collaboratively train machine learning models without collecting the raw data. Nevertheless, federated learning focuses on optimization-based questions (i.e., learning models) that differ from graph statistics.
Although differentially private graph analysis has been widely studied [3, 4, 5, 6, 7, 8, 9, 10, 13, 14], this does not apply to FGA due to the following reasons. On the one hand, the limited view of each local client leads to utility issues. Each client only possesses a portion of the entire graph, making it hard to calculate accurate statistics. For instance, if a query task is to count triangles, each client in Figure 1(c) returns the answer . Although their sum is 3, the true answer is 4. This discrepancy happens because the edges of the triangle come from three different clients. Consequently, it is impossible for any individual client to obtain the ground truth answer. On the other hand, overlapping information among different subgraphs causes privacy issues. An edge may exist in multiple subgraphs owned by different clients. In Figure 1(c), for example, the edge appears in both client and client . Although each client can individually provide sufficient privacy guarantees for , multiple reports of the same information amplify the probability of distinguishing such an edge multiple times, leaking the edge privacy.
In this paper, we propose a federated graph analytic framework, named , which enables arbitrary downstream common graph statistics while preserving individual privacy. The main idea is to let the server privately collect the subgraph information from local clients and then aggregate a noisy global graph for executing targeted query tasks, thereby overcoming the limited view problem. To avoid collecting the same edge multiple times, leverages the private set union (PSU) technique [21, 22, 23, 24] to aggregate the subgraph information. However, existing multi-party private set union protocols do not satisfy DP. Hence, we design a differentially private set union (DPSU) algorithm, which ensures that the sensitive information is reported only once and the output global graph is protected under DP.
Moreover, we observe that there is still room for improving the accuracy by leveraging true local subgraphs. To this end, we introduce an improved framework + that allows additional communication between the server and clients. In +, each client reports the intermediate answer based on its local subgraph and the global graph. However, a key challenge arises from the possibility of different clients reporting the same edge multiple times, thereby compromising individual privacy. To mitigate this risk, we devise a degree-based node partition method to partition entire nodes into multiple disjoint sets. Consequently, the query answer associated with each set is collected only once.
In summary, our contributions in this work are elaborated as follows:
-
•
We investigate the federated graph analytics (FGA) under DP for the first time. By comparing with previous protocols, we conclude unique challenges in FGA.
-
•
We present a generalized federated graph analytic framework with differential privacy () based on our proposed DPSU protocol, which supports a wide range of common graph statistics, e.g., subgraph counting.
-
•
We introduce an optimized framework (+) based on our proposed degree-based partition algorithm, which improves the overall utility by leveraging true subgraphs.
-
•
We verify the effectiveness of our proposed methods through extensive experiments. reduces the error than baseline approach by up to an order of 4. + outperforms by at least an order of 1.
II Preliminary
II-A Problem Formulation
II-A1 System Model
In our work, we consider undirected, unattributed graphs, represented as , where is the set of nodes, and is the set of edges. We study the common graph statistics in cross-silo federated scenario, where there are an untrusted server and silos, i.e., clients . Each client owns a subgraph , which is represented as a adjacent matrix. The virtual global graph is the union of all subgraphs, which can be represented as . It is worth noting that each client is mutually independent of others and there may exist overlapping information among different subgraphs, denoted as , where . Clients collaboratively support graph queries over their subgraph data while preserving user privacy. Table I lists the major notations used in this paper.
II-A2 Trust Assumption
Our objective is to create a protocol that enables the server to coordinate graph statistics while ensuring that none of the clients’ sensitive information is disclosed. Similar to prior works [19, 25, 26], we assume that clients are semi-honest. In other words, each client follows the protocol honestly but is curious about the sensitive information on other clients. The server is untrusted and has no access to individual sensitive information. Furthermore, we presume that any parties beyond the system, such as servers, analysts, or other individuals, are adversaries who are computationally constrained.
Notation | Definition |
---|---|
True global graph | |
Node set | |
Edge set | |
Noisy global graph | |
A set of all clients | |
Number of clients in | |
The -th client | |
Subgraph of client | |
Number of nodes in | |
Noisy degree sequence | |
Noisy -stars counts | |
Noisy triangle counts |
II-B Privacy Model
Like previous works [7, 8, 9, 10, 14], the private information considered in this study is the edge privacy. We assume that the server knows the node information, i.e., , which makes sense in some real-world applications. For example, consider that the healthy administration is examining the spread of COVID-19 in a certain region. It collects the disease transmission paths according to the released census in this area.
Differential privacy (DP) [11, 12] has become a de-facto standard for preserving individual privacy, which can be formalized in Definition 1. In our work, we use global sensitivity [11] to achieve the DP, defined as Definition 2. It considers the maximum difference between statistic results on two neighboring graphs.
Definition 1 (Differential Privacy [11]).
Let be the privacy budget and be the number of users. A randomized algorithm with domain satisfies -DP, iff for any neighboring datasets that differ in a single user’s data and any subset ,
,
Definition 2 (Global Sensitivity [11]).
For a query function : , the global sensitivity is defined by
,
where and are neighboring databases that differ in a single user’s data.
The Laplace mechanism is one of common techniques to achieve DP. The formal definition is as follows:
Definition 3 (Laplace Mechanism [27]).
Given any function , let be the sensitivity of function . satisfies -differential privacy, where are random variables drawn from Lap().
Definition 4 (Edge LDP [28]).
For any , let be a randomized algorithm of user . satisfies -Edge LDP, iff for any two neighboring adjacent bit vectors and that differ in one edge and any subset ,
,
where is the privacy budget.
Existing locally differentially private models, such as edge local differentially privacy (Definition 4) [7, 9, 8] is a promising model. However, it fails to provide privacy guarantee in federated scenarios. While the same edge may exist in multiple different clients, each client only considers its own edge information. Multiple reports of the same information increase the probability of distinguishing such an edge multiple times, leading to privacy issues. To address this challenge, we introduce edge distributed differential privacy to achieve our privacy objectives. The formal definition is as follows:
Definition 5 (Neighboring Graphs).
Two graphs and are neighboring graphs if and differ in one edge.
Definition 6 (Edge Distributed Differential Privacy (Edge DDP)).
Let and be two neighboring global graphs. Let be the client set. Let and () be graphs owned by client in and , respectively. A set of randomized mechanisms collectively satisfy -Edge DDP iff. for any subsets of possible outputs , we have the following inequality.
.
Edge DDP guarantees that the server cannot distinguish the presence or absence of any edge based on all reports collected from clients. It also guarantees that the information about which client an edge in belongs to is private, if the edge exists. For example, in Figure 1(c), both the presence of the edge in and the absence of the edge in are private. Furthermore, no party knows that the edge belongs to clients even if the existence of has been disclosed.
II-C Private Set Union
Private Set Union (PSU) [21, 22, 23, 24] is a secure multiparty computation cryptographic technique designed for securely computing the union of private sets held by different parties. At its core, neither party reveals anything to the counterparty except for the elements in the union. From a high-level perspective, a typical PSU protocol involves the following steps: (1) Set Encoding: each party privately encodes its set into a cryptographic form suitable for secure computation: . (2) Union Computation: parties engage in computing the union of their encoded sets without revealing the underlying elements: . (3) Result Decoding: once the computation is complete, parties decode the computed union from its cryptographic representation to obtain the set union: .
III A General Framework:
In this section, we introduce our proposed general framework for federated graph analytics with differential privacy, called . The main idea is to let the server privately collect subgraphs from local clients and aggregate a noisy global graph, which facilitates common graph statistics. As shown in Figure 2, in general, enhances the utility by (1) reducing the added noise by introducing the crypto technique (i.e., PSU) into differentially private graph statistics; (2) calibrating the noisy results to suppress the estimation bias.
We first present a baseline approach which revises the prior protocol (e.g., randomized response) in order to satisfy our privacy goal, and discuss its limitations. Then, we introduce the overview of our proposed general framework , which substantially improves the baseline approach, and elaborate on its details in subsequent sections.
III-A A Baseline Approach
Randomized response (RR) [11] is a common methodology for enhancing local differential privacy. However, it fails to provide -Edge DDP, as the same one edge could be reported by different clients multiple times. One edge may exist in multiple subgraphs. In the worst case, an edge is included by all subgraphs . Without loss of generality, assume and are neighboring graphs and differ in edge , we have . Then, we have
To address this privacy issue, we could consider the following baseline method. It divides the overall privacy budget into portions equally. Then each client randomizes each subgraph using the randomized response with . As a result, the baseline can provide -Edge DDP, i.e.,
Algorithm 1 shows the details of the baseline approach. It takes as input the subgraph set and privacy budget . Each graph is represented as a adjacent matrix , where each bit . If there is an edge between two users, the bit will be set as 1; otherwise, it will be set as 0. After that, each client flips each bit in the upper triangular part of the matrix with the flipping probability , where . After that, the server collects all noisy edges and aggregates an union as a noisy global graph . Finally, the server computes the targeted graph statistics based on the noisy global graph . This algorithm is denoted as .
Theorem 1.
approach satisfies -Edge DDP.
Subgraph set , privacy budget , |
is represented as adjacent matrix |

Discussion. To provide a strong privacy guarantee, too small privacy budget is allocated to each client. Although the baseline approach achieves our privacy goal discussed in Section II, much redundant noise is added into results. For example, for -star counting, obtains the expected loss errors of ; For triangle counting, it attains the expected loss errors of . Our experiments in Section V further shows that the baseline approach cannot obtain competitive result accuracy under various cases.
III-B Overview of
To alleviate the limitations of method, we propose a federated graph analytic framework, called , as shown in Figure 2. It supports arbitrary downstream common graph statistics while satisfying -Edge DDP.
As discussed in the above section, one edge appears in clients in the extreme case and then the overall privacy budget should be allocated to clients, i.e., , which leads to much noise. In , we leverage the private set union (PSU) technique [21, 22, 23, 24] to achieve , reducing the noise scale. PSU allows parties to collaboratively compute the union of multiple sets held by different parties without revealing the individual elements of the sets to other parties. It guarantees that an element appears in the final union only once. However, previous PSU protocols cannot be easily applied in our FGA setting. Most of PSU works [21, 22, 24] focus on a two-party setting, which is different from our multi-party scenario. Although the paper [23] proposes a multi-party protocol, it faces the security and efficiency issue. Additionally, this multi-party PSU [23] outputs the true union and is unable to provide the differential privacy guarantee. To this end, we propose a differentially private set union protocol to compute the union of multiple subgraphs while satisfying -Edge DDP. The detailed analysis will be presented in Section III-C.
Algorithm 2 presents the overall protocol of . It involves two kinds of entities: clients and a server. The local clients , , owns a subgraph that is represented as a adjacent matrix ( is the number of users). It takes as input the set of subgraphs , privacy budget , and the targeted graph query . At the beginning, recalls a function to collect a noisy global graph (step ①). Each client perturbs its subgraph with suitable noise and then encrypts the noisy subgraph. Next, clients communicate with each other to compute the union of subgraphs. All clients collaboratively decrypts the union and outputs a noisy global graph (details in Section III-C). Once obtaining the noisy global graph, it executes the targeted graph statistics (step ②). In this paper, we compute the subgraph counts (i.e., -stars, triangles) to verify the effectiveness of our framework. Considering the estimation bias, the server further calibrates the noisy results to improve the utility (refer to Section III-D for details).
III-C DPSU-based Graph Collection
We propose a DPSU protocol based on the state-of-the-art PSU method for computing the global graph under DP.
PSU Protocol. Suppose that each client owns a set . Our goal is to compute the union .
The SOTA multi-party PSU method [23] supports computing the union among multiple clients as follows:
(1) Initialization. Each client creates a flag vector . If , ; otherwise, .
(2) Key Generation. Each client generates a pair of keys , where . All clients jointly generate the public key: .
(3) Encryption. Each client encrypts with the public key : .
(4) Modification. Each client modifies the flag vector in sequence as follows: if , ; otherwise, .
(5) Decryption. All clients jointly decrypt with secret keys : . For , if , .
However, the above protocol may face the security issue. It is implemented based on finite field cryptography (FFC) [29] and in fact the prime number with 512 bits is not secure according to Table 2 in [30]. If this multi-party PSU is implemented in a secure way, the prime number should be set as 3072 bits. As a result, it is inefficient for the large-scale graph analytics. To improve the tradeoff between security and efficiency, we implement the PSU protocol based on the Elliptic Curve Cryptography (ECC) [31], which is more efficient than finite field cryptography (FFC). In particular, there are three cryptographic libraries for implementing ECC, namely, Libsodium [32], OpenSSL [33], and MCL[34]. As shown in Table II, Libsodium is more suitable for processing large-scale graph data, which motivates us to implement PSU protocol based on Libsodium library.
Libsodium | 0.0056 | 0.059 | 0.558 | 5.26 | 53.14 | 528.10 |
---|---|---|---|---|---|---|
OpenSSL | 0.0098 | 0.060 | 0.549 | 5.49 | 53.76 | 539.52 |
MCL | 0.0541 | 0.504 | 4.983 | 49.84 | 498.84 | 4986.43 |
-
1
: data size
DPSU Protocol. We then propose a differentially private set union (DPSU) protocol for aggregating a global graph. One challenge is that the output of the PSU protocol does not satisfy DP. The individual privacy could be disclosed via the union of subgraphs. Although some previous works [35, 36, 37, 38] discuss how to calculate the private set union while satisfying DP, their system models differ from ours and can not be used directly. One naive solution is that each client perturbs the flag vector using the randomized response (RR) mechanism [39] before encrypting. Specifically, each client flips each bit in her flag vector with probability in the step Initialization of PSU protocol, where is the privacy budget. Subsequently, the differentially private set union can be computed according to step (2)(5) in the above PSU protocol.
Although this natural solution can provide the DP guarantee, it leads to much bias in a noisy graph. In fact, most of real-world graphs tend to be sparse, which means that there are much more 0s than 1s in an adjacent matrix. After the randomly flipping bits, however, the number of 1s is much larger than that of 0s, making the noisy global graph denser than the original one. Additionally, the step Modification further amplifies the denser problem. Even if ‘0’ bit is flipped into ‘1’ bit in one of subgraphs, the according bit in final union will become ‘1’. In particular, assume that the number of ‘1’ bits in a true global graph is and the number of ‘0’ bits is . After flipping, the number of ‘1’ bits in a noisy global graph becomes . Take the Facebook graph as an example, which has 4,039 nodes (i.e., ) and 88,234 edges (i.e., ). Even with a fairly small privacy budget and the number of clients , the expected number of ‘1’ bits will become 3.8, increasing at least 439 times than before.
To address the above issue, we propose a novel differentially private set union protocol. In particular, ‘0’ bits are perturbed only by the first client and ‘1’ bits are randomized by all clients. The server can then obtain the whole noisy matrix by computing the union of them. Theoretically, our method can reduce the denser problem by a factor of . As a result, the utility loss can be alleviated by at least a factor of . For instance, for -star counting, achieves the expected loss errors of ; For triangle counting, it attains the expected loss errors of . To further suppress the bias from the randomized response, we calibrate the noisy results during the graph query processing. The details will be discussed in next Section III-D.
Subgraph set |
Privacy budget |
Algorithm 3 describes the details of our DPSU-based edge collection method. It takes as input the subgraph set and the privacy budget . Each client initializes an edge domain according to the node information . In particular, each client first constructs a adjacent matrix , where , and then transforms the upper triangular part of into a vector with the size . For example, , . The server initializes an empty set for collecting the edge information. Each client generates a pair of keys , and all clients jointly generate the public key . Client first initializes a flag vector according to the principle in line 5. If exists in , the according bit will be set as 1; otherwise, it will be set as 0. After, client perturbs each bit of with the flipping probability using the randomized response, and encrypts the noisy with the public key . Then, client sends to the client . For each client from to , updates according to the principle in lines 10 to 15. If is in , client will flip ‘1’ with RR and then replace with . Once the client completes updating and sending it to the server, all clients jointly decrypt and send decrypted to the server. Finally, the server generates and releases according to . We call this algorithm by .
Example 3.1. Table III shows how Algorithm 3 works for computing the union of local edge sets while satisfying the DP. Specially, clients owns private edge sets , and node set is . Thus, , where , , , , , . The goal is to compute . Firstly, client constructs a flag vector according to line 5 in Algorithm 3. Then, perturbs each bit of and encrypts it to obtain . Secondly, clients and update based on the principle of lines 10 to 15 in Algorithm 3. Thirdly, all clients jointly decrypt and send to the server. Finally, the server can obtain the edge union according to .
1 | 1 | 0 | 1 | 1 | 0 | |
- | - |
-
1
: : .
-
1
In this example, and are flipped by RR.
Theorem 2.
Algorithm 3 satisfies -Edge DDP.
Proof of Theorem 2. Let and be two neighboring graphs. Let and () be the neighboring graphs of client with respect to and , respectively. Let be a set of randomized mechanisms with any subsets of possible outputs . Given the privacy budget , we can easily obtain . Due to using the threshold ElGamal encryption, any change of adding or deleting an edge from clients is only collected by the server once. Thus, we have
III-D Graph Query Processing
In this section, we execute two common subgraph counting queries, i.e., -star counting [7, 9, 40] and triangle counting [4, 7, 41], in order to explain how to execute the function of Algorithm 2 and how to calibrate the noisy results.
A -star refers to a subgraph consisting of a central node connecting to other nodes. To count the number of -stars in a given graph, we iterate through each vertex in the graph and compute , where is the noisy degree of node . The -star counts of the whole graph is equal to the summation of each node’s -stars, i.e., . However, a direct computation of node degrees from can lead to significant bias induced by the randomized response in Algorithm 3. To mitigate the bias, we leverage the post-processing property of DP [11], yielding an unbiased estimation of as Proposition 1. Hence, we can obtain an unbiased estimation of -stars, i.e., .
Proposition 1.
Let be a noisy global graph. Let be the node degree of in and be an unbiased estimate of . Let be the number of nodes in . Let be the flipping probability, where is the privacy budget in . We have
(1) |
Proof of Proposition 3. The mapping relationship between and can be represented as:
(2) |
Then we can prove Proposition 3. ∎
A triangle in a graph refers to a subgraph consisting of three nodes connected by three edges, forming a closed loop. To count the number of triangles in a graph, one common approach is to iterate each subgraph with three nodes and check if it is a loop. However, simply counting the triangles in a noisy graph introduces a significant bias. We continue to calibrate the biased triangle counts through the post-processing. Building upon the insights of [7], we categorize triplets in into four types based on the number of edges they involve: , , , and , where () represents the count of triplets in involving edges (referred to as -edge; -edges are identical to triangles). Thus, we can compute the unbiased triangle counts according to Proposition 2.
Proposition 2.
Let be a noisy global graph. Let be the number of 0-edges, 1-edges, 2-edges, and triangles in , respectively. Let be the privacy budget used in . We have
(3) |
Proof of Proposition 2. Let be the number of 0-edges, 1-edges, 2-edges, and triangles in , respectively, when we do not flip 1/0 using the randomized response. Let . Then we have:
(4) |
.
is a transition matrix from a type of subgraph (0-edge, 1-edge, 2-edge, and triangle) in an original graph to a type of subgraph in a noisy graph. Let (, ) be an unbiased estimate of (). Then we obtain:
(5) |
Let be the -th element of . Then we have:
(6) |
(7) |
By combining above equations, Proposition 2 is proved. ∎
IV An Improved Framework:
Although reduces the problems from the limited view and overlaps, there is still large space for improving the overall utility. calculates common graph statistics based on a noisy global graph collected from local clients and without considering the true local subgraph information. The graph statistics such as -stars and triangles can be calculated more accurately via an additional round of interaction between the server and clients. After the server publishes , each client can concatenate her local subgraph with , which removes the limited view issues. For example, in Figure 3, if the targeted query is to count triangles, client will return instead of since she can see the edges and via . Thus, the final answer will be (redundant counts will be removed).

Nevertheless, one edge may be reported by different clients multiple times, which leaks individual privacy (as discussed in Section I). For example, in Figure 3, the triangle may be collected by the server three times during counting triangles. In intuition, if one edge is reported by one of clients only once, the probability of distinguishing such an edge will be reduced. Hence, the problem is reduced to determining how to assign the same information to one of the clients for reporting. To do this effectively, we propose a degree-based node partition method that splits nodes into disjoint node sets so that each node is assigned to a client with the highest node degree. In this context, the degree information plays a significant role: the higher a node degree, the more edges it possesses, and the more likely it is to be involved in -stars or triangles. For instance, in Figure 3, the degree of node in client () is larger than the degrees of clients and (). Thus, client can count 2-stars or triangles of more accurately than clients and . This is an intuition behind our +.
IV-A Overview of +
Algorithm 4 presents the overall protocol of +. It mainly consists of two phases: global graph collection and local query collection. The global graph collection is used to publish a noisy global graph , which is finished by (Algorithm 2). The local query collection consists of three main steps: degree-based partition (Algortihm 5), graph query processing (Section IV-C), and perturbation. To be specific, it takes as input local subgraph set , privacy budget , and query sensitivity . + first collects a noisy global graph via using . Then, this global graph is published to each client for local query collection. Next, + recalls a function with privacy budget , to split the node set into disjoint node sets (Step 1). These node sets are distributed to the respective clients. After that, each client computes the targeted query. In our paper, we take -star and triangle counting as examples to explain the rationale behind (Step 2). Since the noisy global graph is dense, we carefully design calibration techniques to obtain unbiased estimates of -stars or triangles, as presented in Section IV-C. Upon obtaining the query results, each client perturbs the local answers using a suitable noise (Step 3). Specifically, each client calculates the noisy query by adding the Laplacian noise to , where is the global sensitivity. Finally, the server computes , which is an unbiased estimate of and satisfies DP.
Theorem 3.
Algorithm 4 satisfies ()-Edge DDP.
Proof of Theorem 3. Algorithm 4 uses three kinds of privacy budgets: in , in , and in . By Theorem 2, satisfies -Edge DDP. In , each client adds the Laplacian noise to the node degree, which satisfies -Edge DDP. In , each client adds the Laplacian noise to the query result , which satisfies -Edge DDP. Following the post-processing property of DP, the aggregated result satisfies -Edge DDP. Thus, the entire process of Algorithm 4 provides ()-Edge DDP. ∎
IV-B Degree-based Node Partition
We propose a degree-based user partition technique to split the node set into multiple disjoint user sets . For each node, the server collects node degrees from clients privately. The node (i.e., user) will be distributed to the client that sends the maximum degree to the server. Algorithm 5 presents how to partition users into multiple disjoint sets. It takes as input the true subgraph sets and privacy budget . At the outset, the server initializes the empty set for each client, in order to record the partition information. For each node in the node set , each client computes the node degree based on its true subgraph . Then, client perturbs the degree with the Laplace mechanism and sends the noisy degree to the server. After that, the server computes the max noisy degree for each node and adds the user index to the corresponding user partition set . The final output is the user partition sets , where and . This algorithm is denoted by .
Subgraph set , |
Privacy budget |
Compute -th node degree |
Perturb |
Send to Server |
Compute |
Obtain index |
Update |
Noisy global graph , -th subgraph |
-th user partition . |
IV-C Graph Query Processing
In this section, we execute two basic subgraph counting queries, i.e., -star counting [7, 9, 40] and triangle counting [4, 7, 41], in order to explain how to execute the function in line 4 of Algorithm 4.
IV-C1 -Star Counting
A -star refers to a subgraph consisting of a central node connecting to other nodes. The key is to compute the total number of edges in a graph, i.e., the summation of node degrees. From each client ’s view, each node degree is contributed from a local subgraph and a noisy global graph . Thus, the problem is reduced to compute the true node degree and noisy node degree .
Algorithm 6 shows an instantiation of the function in -star counting. It takes as input a noisy global graph , -th subgraph , and -th user partition . Client first initializes to record two kinds of degrees. Then, it obtains a new global graph by computing the union between and its true subgraph (line 2). After that, it traverses each node in the user partition set and calculates and . If the edge exists in subgraph , the true degree will be updated as ; otherwise, the noisy degree will be updated as (lines 5-6). However, simply computing the degrees from can introduce a significant bias, as the randomized response in Algorithm 3 makes a graph dense [7, 8, 28]. By the post-processing property of DP [11], we obtain an unbiased estimate of according to Proposition 3 (line 7). Upon obtaining the node degree , client can calculate the -stars by . The final output is the number of -stars.
Proposition 3.
Let be the union of a noisy global graph and local subgraph . Let be the absolute complement of in . Let be the node degree of in and be an unbiased estimate of . Let be the number of nodes in . Let be the flipping probability, where is the privacy budget in (i.e., line 1 of Algorithm 4). We have
(8) |
IV-C2 Triangle Counting
Noisy global graph , -th subgraph |
-th user partition . |
Next, we focus on triangle counting. This is more challenging because three edges of a triangle can be from the local subgraph or a noisy global graph . There are four kinds of triangles according to the number of edges from : , and , where () is the number of triangles involving edges from (referred to as -edge in ). Since , and involves some noisy edges from , simply counting the noisy triangles can introduce a bias. We propose different empirical estimation methods to obtain unbiased counts, as presented in Propositions 4, 5, and 6 at the end of this subsection.
Algorithm 7 presents an instantiation of the function in triangle counting. It takes as input a noisy global graph , the -th subgraph , the -th user partition , and the privacy budget . Client first obtains a graph by merging and . Then it calculates four kinds of triangles , i.e., 0-edge, 1-edge, 2-edge, and 3-edge in . After that, the unbiased estimates of are computed based on Proposition 4, 5, and 6. The final triangle count can be obtained by aggregating , and .
Proposition 4.
Let be the union of a noisy global graph and local subgraph . Let be the absolute complement of in . Let be the number of 0-edges, 1-edges, 2-edges, and triangles in , respectively. Let be the number of 0-edges in , i.e., . Let be an unbiased estimate of . Let be the privacy budget used in (i.e., line 1 of Algorithm 4). We have
(10) |
Proposition 5.
Let be the union of a noisy global graph and local subgraph . Let be the absolute complement of in . Let be the number of 0-edges, 1-edges, and 2-edges in , respectively. Let be the number of 1-edges in , i.e., . Let be an unbiased estimate of . Let be the privacy budget used in . We have
(11) |
Proof of Proposition 5. Consider that one edge of a triangle is from and another two edges are from . Let be the number of 0-edges, 1-edges, and 2-edges in , respectively, when we do not flip 1/0 using the randomized response. Let . Then we have:
(12) |
is a transition matrix from a type of subgraph (0-edge, 1-edge, 2-edge) in an original graph to a type of subgraph in a noisy graph. Let be the unbiased estimation of (). Then we obtain:
(13) |
Let be the -th element of . Then we have:
(14) |
(15) |
(16) |
By combining above equations, Proposition 5 is proved. ∎
Proposition 6.
Let be a noisy global graph and be the local subgraph. Let be the number of 2-edges in . Let be the number of 2-stars in . Let be an unbiased estimate of . Let be the flipping probability. We have
(17) |
V Experimental Evaluation
In this section, we evaluate our and + along two dimensions: utility and running time. To simulate the federated scenario, we split a graph randomly into multiple local subgraphs by controlling two key parameters: sampling rate and overlapping rate . Then, we conducted experiments to answer the following questions:
-
•
Q1: How do our general and improved + compare with baseline approaches (denoted by ) in terms of the utility-privacy trade-off?
-
•
Q2: How do the sampling rate and overlapping rate affect accuracy?
-
•
Q3: How much do our and + compare with in terms of the running time?
Evaluation Highlights:
- •
- •
-
•
and + takes more time than by at least an order of 1 (Figure 12).








V-A Experimental Setup
Datasets. We use two real-world graph datasets from SNAP [42] as follows: (1) Facebook. The Facebook graph is collected from survey participants using the Facebook app, which includes 4039 nodes and 88234 edges. The average degree of the Facebook graph is 21.85 (=). (2) Wiki-Vote. The Wiki-Vote graph contains all the Wikipedia voting data from the inception of Wikipedia till 2008, which includes 7115 nodes and 103689 edges. The average degree of Wiki-Vote graph is 14.57 (=). Thus, Wiki-Vote graph is more sparse than the Facebook graph. As explained above, we split a graph randomly into 4 local subgraphs by controlling the sampling rate and overlapping rate .
Parameters. There are some key parameters that influence the overall accuracy of the system. (1) Privacy Budget . The privacy budget varies from 1 to 6, and the default is 3. Note that + involves three kind of privacy budgets, namely, . We set and . (2) Sampling rate . The sampling rate is the ratio of each local subgraph to the global graph. It is set from 0.1 to 0.5, and the default is 0.3. (3) Overlapping Rate . The overlapping rate is the ratio of edges shared among multiple local subgraphs to the total number of edges. It varies from 0 to 0.4, and the default is 0.2.
Graph Statistics and Metrics. For graph analytic tasks, we evaluate two common graph statistics: 2-star counts and triangle counts, as in [4, 7, 9, 10, 40, 41]. For each query, we compare the results and from the true graph and noisy graph respectively. We use two common measures to assess the accuracy of our algorithms: mean squared error (MSE) [43] and mean relative error (MRE) [44]. We evaluate the average results over 10 repeated runs.








V-B Experimental Results
Utility-Privacy Trade-off (Q1). We first answer Q1 by comparing the utility of our and + with that of when the privacy budget varies from to . Figures 5 to 7 show that the utility loss of all methods decreases with an increase in . Our general framework significantly outperforms in all cases, and our improved framework + can further improve the accuracy of .
In particular, outperforms by at least an order of 1. For instance, for 2-star counting, Figure 4(b) shows that when , owns an MSE of while gives an MSE of . Similarly, for triangle counting, Figure 7(a) shows that gives an MRE of only 9.53 when . In contrast, has an MRE of 8.27. This is because uses small privacy budgets (i.e., ) to perturb sensitive data, which leads to much noise. Instead, collects noisy graphs using by combining PSU and DP. The same information is collected only once, and therefore, individual privacy is not leaked.
Another observation is that the error of + is smaller than that of in all cases. For example, Figure 5(b) shows that when , + only owns an MRE of 3.86, whereas has an MRE of 2.98. Similarly, Figure 6(a) shows that when , the MSE of + is 4.68 when owns a MSE of 6.25. This is mainly because + calculates graph statistics by utilizing local true subgraphs. In contrast, the results of are computed based on noisy graphs, which results in much utility loss. Thus, outperforms significantly, and + improves the utility of .
Parameter Effects (Q2). Next, we evaluate the key parameters that may influence the overall utility of , i.e., sampling rate and overlapping rate . determines the size of local subgraphs , and the size of becomes larger as increases. determines the number of the same edges that exist in multiple local subgraphs.
Figures 9 and 9 show that in all cases, the MSE and MRE increase with an increase in . This is because graph analytic tasks are executed over a noisy graph . The added error becomes larger as the graph size increases. We can also observe that and + owns better utility than and , respectively, for all .
Figures 11 and 11 show that performs the worst while + owns the best utility over all values of the overlapping rate . Another interesting observation is that the overlapping rate has little influence on the overall utility. For instance, Figure 10(a) shows that with the increase in , the MSE of increases from 1.014 to 1.0442. The slight growth is from multiple perturbations of the same edge information. There are little changes in the results of and + over various . This is because the same information is randomized and collected only once.
Execution Time (Q3). Finally, we answer the third question by evaluating the running time over graphs with different sizes. Here, we use different sampling rates to generate multiple graphs with different scales. Figure 12 presents the running time of , , and + for various values of . We can find that the running time increases when the graph scale becomes larger. This is because when increases, there are more edges collected and computed accordingly. Another important observation is that the running time of is approximately 10 higher than that of . This is because the computation of cryptographic techniques leads to additional time overhead. + takes more time than by about 50%. This is because additional communication between clients and the server consumes more time. Thus, utilizing cryptographic tools improves the utility while not leaking sensitive information but at the cost of efficiency.


VI Related Works
Federated Analytics. The term “federated analytics” is first introduced by Google in 2020 [45], which is explored in support of federated learning for Google engineers to measure the quality of federated learning models against real-world data. Bharadwaj [46] introduces the notion of federated computation, which is a means of working with private data at a rather large scale. Wang [47] clarify what federated analytics is and its position in literature and then presents the motivation, application, and opportunities of federated analytics. Elkordy [48] gives a comprehensive survey about federated analytics. Nevertheless, they only focus on tabular data analytics, which totally differs from graph analytics in our work. Roth [49] introduce Mycelium for large-scale distributed graph queries with differential privacy. Yuan [50] define the notion of graph federation for subgraph matching, where graph data sources are temporarily federated. However, they assume that all clients are mutually independent, which is different from ours. Our (informal) previous work [51]111Note that [51] was not published in proceedings in accordance with the policy of the KDD conference, cf. https://fl4data-mining.github.io/calls/. introduces the concept of federated graph analytics for the first time. However, it encountered privacy leakage issues due to the potential disclosure of intersection information between two clients.
Cross-silo Federated Learning. There exist several works related to cross-silo federated learning [17, 18, 19, 20, 52, 53]. Huang [17] propose FedAMP that employs federated attentive message passing to facilitate the collaboration effectiveness between clients without infringing their data privacy. Li [53] propose a practical one-shot federated learning algorithm by using the knowledge transfer technique. Liu [18] empirically show that MR-MTL is a remarkably strong baseline under silo-specific sample-level DP. Tang [20] propose an incentive mechanism for cross-silo federated learning, addressing the public goods feature. Zheng [19] propose a one-server solution based solely on homomorphic encryption and a two-server protocol based on homomorphic encryption and additive secret sharing, which are designed for contribution evaluation in cross-silo federated learning. However, these protocols are designed for machine learning cannot be used for graph analytics.
Differentially Private Graph Analytics. The standard way to calculate graph statistics is through differential privacy (DP) [11, 12], which is a golden standard in the privacy community. However, existing protocols [4, 5, 7, 9, 10, 40, 54] are designed for different scenarios and not applicable to federated graph scenarios. To be specific, central protocols [4, 5] rely on a trusted server to collect the entire graph information from local users and then release accurate graph statistics privately. Nevertheless, a centrally trusted server is amenable to privacy issues in practical such as data leaks and breaches [15, 16]. Instead, local protocols [7, 9] remove the assumption of a trusted server, and each user directly perturbs local sensitive data. However, they cannot protect the user-client membership and faces the edge privacy issues due to the overlaps. Extended local view (ELV)-based protocols [10, 55] consider an extension of local scenarios, where each client can see not only her 1-hop path but also her 2-hop path. Similar to local mechanisms, ELV-based protocols fail to protect the user-client membership. Additionally, decentralized differential privacy in [55] can protect edge privacy in ELV but fails to do it in federated settings since overlaps of ELV are different from those of federated graph analytics. In a nutshell, this paper is the first work to formulate the federated graph analytics (FGA) to our best of knowledge. Our proposed is a general framework for various common graph analytics.
VII Conclusion
We made the first attempt to study federated graph analytics with privacy guarantee. We showed unique challenges in federated graph analytics, namely, utility issue from the limited view and privacy issue due to overlaps. To alleviate them, we proposed a general federated graph analytic framework , based on our proposed differentially private set union protocol. Furthermore, we observed that it calculates graph statistics over a noisy global graph without considering true local subgraphs, and there is still room for improving the overall utility. To address this issue, we introduced an improved framework + by combining a noisy global graph with true local subgraphs. Comprehensive experiments verify that our proposed methods significantly outperform the baseline approaches in various graph analytics.
References
- [1] D. Cheng, F. Yang, S. Xiang, and J. Liu, “Financial time series forecasting with multi-modality graph neural network,” Pattern Recognition, vol. 121, p. 108218, 2022.
- [2] S. J. Dancer, “Reducing the risk of covid-19 transmission in hospitals: focus on additional infection control strategies,” Surgery (Oxford), vol. 39, no. 11, pp. 752–758, 2021.
- [3] X. Jian, Y. Wang, and L. Chen, “Publishing graphs under node differential privacy,” IEEE TKDE, 2021.
- [4] X. Ding, S. Sheng, H. Zhou, X. Zhang, Z. Bao, P. Zhou, and H. Jin, “Differentially private triangle counting in large graphs,” IEEE TKDE, vol. 34, no. 11, pp. 5278–5292, 2021.
- [5] W.-Y. Day, N. Li, and M. Lyu, “Publishing graph degree distribution with node differential privacy,” in Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 123–138.
- [6] Q. Yuan, Z. Zhang, L. Du, M. Chen, P. Cheng, and M. Sun, “PrivGraph: Differentially private graph data publication by exploiting community information,” in USENIX Security Symposium, 2023.
- [7] J. Imola, T. Murakami, and K. Chaudhuri, “Locally differentially private analysis of graph statistics.” in USENIX Security Symposium, 2021.
- [8] Q. Ye, H. Hu, M. H. Au, X. Meng, and X. Xiao, “Lf-gdpr: A framework for estimating graph metrics with local differential privacy,” IEEE TKDE, vol. 34, no. 10, pp. 4905–4920, 2020.
- [9] J. Imola, T. Murakami, and K. Chaudhuri, “Communication-efficient triangle counting under local differential privacy,” in USENIX Security Symposium, 2022.
- [10] Y. Liu, S. Zhao, Y. Liu, D. Zhao, H. Chen, and C. Li, “Collecting triangle counts with edge relationship local differential privacy,” in IEEE ICDE, 2022.
- [11] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
- [12] N. Li, M. Lyu, D. Su, and W. Yang, “Differential privacy: From theory to practice,” Synthesis Lectures on Information Security, Privacy, & Trust, vol. 8, no. 4, pp. 1–138, 2016.
- [13] S. Zhang, W. Ni, and N. Fu, “Community preserved social graph publishing with node differential privacy,” in IEEE ICDM, 2020.
- [14] C. Wei, S. Ji, C. Liu, W. Chen, and T. Wang, “Asgldp: collecting and generating decentralized attributed graphs with local differential privacy,” IEEE TIFS, 2020.
- [15] N. N. Neto, S. Madnick, A. M. G. D. Paula, and N. M. Borges, “Developing a global data breach database and the challenges encountered,” JDIQ, vol. 13, no. 1, pp. 1–33, 2021.
- [16] B. Gibson, S. Townes, D. Lewis, and S. Bhunia, “Vulnerability in massive api scraping: 2021 linkedin data breach,” in IEEE CSCI, 2021.
- [17] Y. Huang, L. Chu, Z. Zhou, L. Wang, J. Liu, J. Pei, and Y. Zhang, “Personalized cross-silo federated learning on non-iid data,” in Proc. AAAI, vol. 35, no. 9, 2021, pp. 7865–7873.
- [18] K. Liu, S. Hu, S. Z. Wu, and V. Smith, “On privacy and personalization in cross-silo federated learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 5925–5940, 2022.
- [19] S. Zheng, Y. Cao, and M. Yoshikawa, “Secure shapley value for cross-silo federated learning,” Proc. VLDB, 2023.
- [20] M. Tang and V. W. Wong, “An incentive mechanism for cross-silo federated learning: A public goods perspective,” in IEEE INFOCOM, 2021, pp. 1–10.
- [21] C. Zhang, Y. Chen, W. Liu, M. Zhang, and D. Lin, “Linear private set union from Multi-Query reverse private membership test,” in USENIX Security Symposium, 2023, pp. 337–354.
- [22] Y. Jia, S.-F. Sun, H.-S. Zhou, J. Du, and D. Gu, “Shuffle-based private set union: Faster and more secure,” in 31st USENIX Security Symposium (USENIX Security 22), 2022, pp. 2947–2964.
- [23] W. Wang, S. Li, J. Dou, and R. Du, “Privacy-preserving mixed set operations,” Information Sciences, vol. 525, pp. 67–81, 2020.
- [24] C. Dong and G. Loukides, “Approximating private set union/intersection cardinality with logarithmic complexity,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 11, pp. 2792–2806, 2017.
- [25] C. Zhang, S. Li, J. Xia, W. Wang, F. Yan, and Y. Liu, “BatchCrypt: Efficient homomorphic encryption for Cross-Silo federated learning,” in USENIX ATC, 2020, pp. 493–506.
- [26] Y. Liu, Y. Kang, C. Xing, T. Chen, and Q. Yang, “A secure federated transfer learning framework,” IEEE Intelligent Systems, vol. 35, no. 4, pp. 70–82, 2020.
- [27] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Springer TCC, 2006, pp. 265–284.
- [28] Z. Qin, T. Yu, Y. Yang, I. Khalil, X. Xiao, and K. Ren, “Generating synthetic decentralized social graphs with local differential privacy,” in Proc. CCS, 2017, pp. 425–438.
- [29] W. Diffie and M. E. Hellman, “New directions in cryptography,” in Democratizing Cryptography: The Work of Whitfield Diffie and Martin Hellman, 2022, pp. 365–390.
- [30] E. B. Barker, W. C. Barker, W. E. Burr, W. T. Polk, and M. E. Smid, “Sp 800-57. recommendation for key management, part 1: General (revised),” 2007.
- [31] D. Hankerson and A. Menezes, “Elliptic curve cryptography,” in Encyclopedia of Cryptography, Security and Privacy. Springer, 2021, pp. 1–2.
- [32] Libsodium. [Online]. Available: https://doc.libsodium.org
- [33] OpenSSL. [Online]. Available: https://www.openssl.org/
- [34] MCL. [Online]. Available: https://github.com/herumi/mcl
- [35] S. Gopi, P. Gulhane, J. Kulkarni, J. H. Shen, M. Shokouhi, and S. Yekhanin, “Differentially private set union,” in International Conference on Machine Learning. PMLR, 2020, pp. 3627–3636.
- [36] K. Kim, S. Gopi, J. Kulkarni, and S. Yekhanin, “Differentially private n-gram extraction,” Advances in Neural Information Processing Systems, vol. 34, pp. 5102–5111, 2021.
- [37] R. S. Carvalho, K. Wang, and L. S. Gondara, “Incorporating item frequency for differentially private set union,” in Proc. AAAI, vol. 36, no. 9, 2022, pp. 9504–9511.
- [38] X. Qiao, Z. Youwen, and X. L. Jian WANG, “Locally differentially private distributed algorithms for set intersection and union,” Information Sciences, vol. 64, no. 219101, pp. 1–219 101, 2021.
- [39] S. L. Warner, “Randomized response: A survey technique for eliminating evasive answer bias,” Journal of the American Statistical Association, vol. 60, no. 309, pp. 63–69, 1965.
- [40] J. Imola, T. Murakami, and K. Chaudhuri, “Differentially private triangle and 4-cycle counting in the shuffle model,” in ACM CCS, 2022.
- [41] S. Liu, Y. Cao, T. Murakami, J. Liu, and M. Yoshikawa, “Cargo: Crypto-assisted differentially private triangle counting without trusted servers,” arXiv preprint arXiv:2312.12938, 2023.
- [42] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
- [43] K. Das, J. Jiang, and J. Rao, “Mean squared error of empirical predictor,” 2004.
- [44] T. Foss, I. Myrtveit, and E. Stensrud, “Mre and heteroscedasticity: An empirical validation of the assumption of homoscedasticity of the magnitude of relative error,” in Proc. ESCOM, 2001, pp. 157–164.
- [45] D. Ramage and S. Mazzocchi, “Federated analytics: Collaborative data science without data collection,” https://ai.googleblog.com/2020/05/federated-analytics-collaborative-data.html, 2020.
- [46] A. Bharadwaj and G. Cormode, “An introduction to federated computation,” in Proceedings of the 2022 International Conference on Management of Data, 2022, pp. 2448–2451.
- [47] D. Wang, S. Shi, Y. Zhu, and Z. Han, “Federated analytics: Opportunities and challenges,” IEEE Network, vol. 36, no. 1, pp. 151–158, 2021.
- [48] A. R. Elkordy, Y. H. Ezzeldin, S. Han, S. Sharma, C. He, S. Mehrotra, S. Avestimehr et al., “Federated analytics: A survey,” APSIPA Transactions on Signal and Information Processing, vol. 12, no. 1, 2023.
- [49] E. Roth, K. Newatia, Y. Ma, K. Zhong, S. Angel, and A. Haeberlen, “Mycelium: Large-scale distributed graph queries with differential privacy,” in Proc. SOSP, 2021, pp. 327–343.
- [50] Y. Yuan, D. Ma, Z. Wen, Z. Zhang, and G. Wang, “Subgraph matching over graph federation,” Proceedings of the VLDB Endowment, vol. 15, no. 3, pp. 437–450, 2021.
- [51] S. Liu, Y. Cao, and M. Yoshikawa, “Federated graph analytics with differential privacy,” in International Workshop on Federated Learning for Distributed Data Mining, 2023 (informal paper not published in proceedings, cf., https://fl4data-mining.github.io/calls/.
- [52] Y. Wang, Y. Tong, D. Shi, and K. Xu, “An efficient approach for cross-silo federated learning to rank,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021, pp. 1128–1139.
- [53] Q. Li, B. He, and D. Song, “Practical one-shot federated learning for cross-silo setting.”
- [54] S. Liu, Y. Cao, T. Murakami, and M. Yoshikawa, “A crypto-assisted approach for publishing graph statistics with node local differential privacy,” in IEEE Big Data, 2022, pp. 5765–5774.
- [55] H. Sun, X. Xiao, I. Khalil, Y. Yang, Z. Qin, H. Wang, and T. Yu, “Analyzing subgraph statistics from extended local views with decentralized differential privacy,” in Proc. CCS, 2019, pp. 703–717.