This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated Graph Analytics with Differential Privacy

Shang Liu, Yang Cao, Takao Murakami, Weiran Liu,
Seng Pei Liew, Tsubasa Takahashi, Jinfei Liu, Masatoshi Yoshikawa
Shang Liu is with Graduate School of Informatics, Kyoto University, Kyoto 606-8501, Japan. Email: [email protected]. Yang Cao is with Department of Computer Science, Tokyo Institute of Technology, Tokyo 152-8552, Japan. Email: [email protected]. Takao Murakami is with Department of Interdisciplinary Statistical Mathematics, Institute of Statistical Mathematics, Tokyo 190-8562, Japan. Email: [email protected]. Weiran Liu is with Alibaba Group, Beijing 100102, China. Email: [email protected] Pei Liew and Tsubasa Takahash are with LY Corporation, Tokyo 160–0004, Japan. Email: {sengpei.liew, tsubasa.takahash}@lycorp.co.jp.Jinfei Liu with College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China. E-mail: [email protected] Yoshikawa is with Osaka Seikei University, Osaka 533-0007, Japan. Email: [email protected].
Abstract

Collaborative graph analysis across multiple institutions is becoming increasingly popular. Realistic examples include social network analysis across various social platforms, financial transaction analysis across multiple banks, and analyzing the transmission of infectious diseases across multiple hospitals. We define the federated graph analytics, a new problem for collaborative graph analytics under differential privacy. Although differentially private graph analysis has been widely studied, it fails to achieve a good tradeoff between utility and privacy in federated scenarios, due to the limited view of local clients and overlapping information across multiple subgraphs. Motivated by this, we first propose a federated graph analytic framework, named 𝖥𝖤𝖠𝖳\mathsf{FEAT}, which enables arbitrary downstream common graph statistics while preserving individual privacy. Furthermore, we introduce an optimized framework based on our proposed degree-based partition algorithm, called 𝖥𝖤𝖠𝖳\mathsf{FEAT}+, which improves the overall utility by leveraging the true local subgraphs. Finally, extensive experiments demonstrate that our 𝖥𝖤𝖠𝖳\mathsf{FEAT} and 𝖥𝖤𝖠𝖳+\mathsf{FEAT+} significantly outperform the baseline approach by approximately one and four orders of magnitude, respectively.

Index Terms:
Federated Analytics, Graph Statistics, Differential Privacy.

I Introduction

Graph data has become a crucial resource for analyzing big data in a variety of applications such as finance, social networks, and healthcare due to its widespread usage. Owing to escalating privacy concerns and regulatory measures like the GDPR, conducting centralized graph analysis has become increasingly challenging. In this paper, we define the federated graph analytics (FGA), a new problem for collaborative graph analysis with the privacy guarantee, which is motivated by the following scenarios:

Example 1.1. Social Network Analytics. Various social media platforms, including Facebook, Twitter, and LINE, collaborate to estimate different metrics of a global social network within a particular region. Each platform has its own local subgraph, which is a subset of a ground-truth global social network graph. In a graph, a node represents a user, and an edge represents a friendship between two users. As users may use multiple platforms, these clients (i.e., subgraphs) may have overlapping edges.

Example 1.2. Financial Transaction Analytics. Several banks work together to analyze transaction data [1] for financial risk management or macroeconomic analysis over transaction graphs, in which each node represents a bank account owned by a user, and each edge represents a money transaction between two accounts.

Example 1.3. Disease Transmission Analytics. Several medical institutions are collaborating to study the transmission of diseases, such as COVID-19 [2], in a particular region. Each hospital is responsible for a subgraph that includes nodes representing patients and edges representing the transmission of the disease between those patients.

Refer to caption
Figure 1: Comparisons among central, local and federated scenarios. (a) In a central scenario [3, 4, 5, 6], one trusted server owns the entire graph. (b) In a local scenario [7, 8, 9, 10], each client owns one node and its 1-hop path information. (c) In a federated scenario, each client owns a subgraph that consists of multiple nodes and edges among them.

This study is the first to discuss the problem of federated graph analytics (FGA) under the differential privacy (DP) [11, 12]. Different from existing differentially private graph analytic works, such as central models [3, 4, 5, 6, 13] and local models [7, 8, 9, 10, 14], FGA considers a more general setting. In particular, in a central scenario (as shown in Figure 1(a)), a trusted server owns the entire global graph that consists of multiple nodes and edges. Nevertheless, a central server is amenable to privacy issues in practice, such as data leaks and breaches [15, 16]. In a local scenario (as shown in Figure 1(b)), each client manages an user and her 1-hop path information (i.e, neighboring information). Each client doesn’t trust the server and directly perturbs local sensitive data. In contrast, in federated graph analytics (as shown in Figure 1(c)), each client possesses a subgraph consisting of multiple nodes and edges. Each client does not trust other parties, including the server and other clients. In fact, local scenarios can be viewed as an extreme case of federated graph analytics when each client contains a subgraph consisting of one user and her 1-hop path. At this point, mm is equal to nn, where mm and nn are the number of clients and users, respectively. Additionally, FGA is similar to cross-silo federated learning [17, 18, 19, 20] where different silos (or clients) collaboratively train machine learning models without collecting the raw data. Nevertheless, federated learning focuses on optimization-based questions (i.e., learning models) that differ from graph statistics.

Although differentially private graph analysis has been widely studied [3, 4, 5, 6, 7, 8, 9, 10, 13, 14], this does not apply to FGA due to the following reasons. On the one hand, the limited view of each local client leads to utility issues. Each client only possesses a portion of the entire graph, making it hard to calculate accurate statistics. For instance, if a query task QQ is to count triangles, each client in Figure 1(c) returns the answer Qi=1Q_{i}=1. Although their sum is 3, the true answer is 4. This discrepancy happens because the edges of the triangle v1,v2,v3\langle v_{1},v_{2},v_{3}\rangle come from three different clients. Consequently, it is impossible for any individual client to obtain the ground truth answer. On the other hand, overlapping information among different subgraphs causes privacy issues. An edge may exist in multiple subgraphs owned by different clients. In Figure 1(c), for example, the edge v1,v4\langle v_{1},v_{4}\rangle appears in both client C1C_{1} and client C3C_{3}. Although each client can individually provide sufficient privacy guarantees for v1,v4\langle v_{1},v_{4}\rangle, multiple reports of the same information amplify the probability of distinguishing such an edge multiple times, leaking the edge privacy.

In this paper, we propose a federated graph analytic framework, named 𝖥𝖤𝖠𝖳\mathsf{FEAT}, which enables arbitrary downstream common graph statistics while preserving individual privacy. The main idea is to let the server privately collect the subgraph information from local clients and then aggregate a noisy global graph for executing targeted query tasks, thereby overcoming the limited view problem. To avoid collecting the same edge multiple times, 𝖥𝖤𝖠𝖳\mathsf{FEAT} leverages the private set union (PSU) technique [21, 22, 23, 24] to aggregate the subgraph information. However, existing multi-party private set union protocols do not satisfy DP. Hence, we design a differentially private set union (DPSU) algorithm, which ensures that the sensitive information is reported only once and the output global graph is protected under DP.

Moreover, we observe that there is still room for improving the accuracy by leveraging true local subgraphs. To this end, we introduce an improved framework 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ that allows additional communication between the server and clients. In 𝖥𝖤𝖠𝖳\mathsf{FEAT}+, each client reports the intermediate answer based on its local subgraph and the global graph. However, a key challenge arises from the possibility of different clients reporting the same edge multiple times, thereby compromising individual privacy. To mitigate this risk, we devise a degree-based node partition method to partition entire nodes into multiple disjoint sets. Consequently, the query answer associated with each set is collected only once.

In summary, our contributions in this work are elaborated as follows:

  • We investigate the federated graph analytics (FGA) under DP for the first time. By comparing with previous protocols, we conclude unique challenges in FGA.

  • We present a generalized federated graph analytic framework with differential privacy (𝖥𝖤𝖠𝖳\mathsf{FEAT}) based on our proposed DPSU protocol, which supports a wide range of common graph statistics, e.g., subgraph counting.

  • We introduce an optimized framework (𝖥𝖤𝖠𝖳\mathsf{FEAT}+) based on our proposed degree-based partition algorithm, which improves the overall utility by leveraging true subgraphs.

  • We verify the effectiveness of our proposed methods through extensive experiments. 𝖥𝖤𝖠𝖳\mathsf{FEAT} reduces the error than baseline approach by up to an order of 4. 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ outperforms 𝖥𝖤𝖠𝖳\mathsf{FEAT} by at least an order of 1.

Section II introduces the preliminaries. Our generalized framework 𝖥𝖤𝖠𝖳\mathsf{FEAT} and improved framework 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ are proposed in Section III and Section IV. Section V presents experimental results. Section VI reviews the related work and Section VII draws a conclusion.

II Preliminary

II-A Problem Formulation

II-A1 System Model

In our work, we consider undirected, unattributed graphs, represented as G=(V,E)G=(V,E), where V={v1,,vn}V=\{v_{1},...,v_{n}\} is the set of nodes, and EV×VE\subseteq V\times V is the set of edges. We study the common graph statistics in cross-silo federated scenario, where there are an untrusted server and mm silos, i.e., clients C={C1,,Cm}C=\{C_{1},...,C_{m}\}. Each client CiC_{i} owns a subgraph GiG_{i}, which is represented as a n×nn\times n adjacent matrix. The virtual global graph is the union of all subgraphs, which can be represented as G=i=1mGiG=\bigcup_{i=1}^{m}G_{i}. It is worth noting that each client is mutually independent of others and there may exist overlapping information among different subgraphs, denoted as GiGjG_{i}\cap G_{j}\neq\emptyset, where iji\neq j. Clients collaboratively support graph queries over their subgraph data while preserving user privacy. Table I lists the major notations used in this paper.

II-A2 Trust Assumption

Our objective is to create a protocol that enables the server to coordinate graph statistics while ensuring that none of the clients’ sensitive information is disclosed. Similar to prior works [19, 25, 26], we assume that clients are semi-honest. In other words, each client follows the protocol honestly but is curious about the sensitive information on other clients. The server is untrusted and has no access to individual sensitive information. Furthermore, we presume that any parties beyond the system, such as servers, analysts, or other individuals, are adversaries who are computationally constrained.

TABLE I: Summary of Notations.
       Notation        Definition
       GG        True global graph
       VV        Node set
       EE        Edge set
       GG^{\prime}        Noisy global graph
       CC        A set of all clients
       mm        Number of clients in CC
       CiC_{i}        The ii-th client
       GiG_{i}        Subgraph of client CiC_{i}
       nn        Number of nodes in GG
       DD^{\prime}        Noisy degree sequence
       SkS_{k}^{\prime}        Noisy kk-stars counts
       TT^{\prime}        Noisy triangle counts

II-B Privacy Model

Like previous works [7, 8, 9, 10, 14], the private information considered in this study is the edge privacy. We assume that the server knows the node information, i.e., V={v1,,vn}V=\{v_{1},...,v_{n}\}, which makes sense in some real-world applications. For example, consider that the healthy administration is examining the spread of COVID-19 in a certain region. It collects the disease transmission paths according to the released census in this area.

Differential privacy (DP) [11, 12] has become a de-facto standard for preserving individual privacy, which can be formalized in Definition 1. In our work, we use global sensitivity [11] to achieve the DP, defined as Definition 2. It considers the maximum difference between statistic results on two neighboring graphs.

Definition 1 (Differential Privacy [11]).

Let ε>0\varepsilon>0 be the privacy budget and nn be the number of users. A randomized algorithm \mathcal{M} with domain 𝔻n\mathbb{D}^{n} satisfies ε\varepsilon-DP, iff for any neighboring datasets D,D𝔻nD,D^{\prime}\in\mathbb{D}^{n} that differ in a single user’s data and any subset SRange()S\subseteq Range(\mathcal{M}),

Pr[(D)S]eϵPr[(D)S]Pr[\mathcal{M}(D)\in S]\leq e^{\epsilon}Pr[\mathcal{M}(D^{\prime})\in S],

Definition 2 (Global Sensitivity [11]).

For a query function ff: DD\rightarrow\mathbb{R}, the global sensitivity is defined by

GS=maxDD|f(D)f(D)|\triangle_{GS}=\mathop{max}\limits_{D\sim D^{\prime}}|f(D)-f(D^{\prime})|,

where DD and DD^{\prime} are neighboring databases that differ in a single user’s data.

The Laplace mechanism is one of common techniques to achieve DP. The formal definition is as follows:

Definition 3 (Laplace Mechanism [27]).

Given any function f:DRkf:D\rightarrow R^{k}, let f\triangle f be the sensitivity of function ff. M(x)=f(x)+(Y1,,Yk)M(x)=f(x)+(Y_{1},...,Y_{k}) satisfies (ε,0)(\varepsilon,0)-differential privacy, where YiY_{i} are i.i.di.i.d random variables drawn from Lap(/ε\triangle/\varepsilon).

Definition 4 (Edge LDP [28]).

For any i[n]i\in[n], let i\mathcal{M}_{i} be a randomized algorithm of user viv_{i}. i\mathcal{M}_{i} satisfies ε\varepsilon-Edge LDP, iff for any two neighboring adjacent bit vectors AiA_{i} and AiA_{i}^{\prime} that differ in one edge and any subset SRange(i)S\subseteq Range(\mathcal{M}_{i}),

Pr[i(Ai)S]eϵPr[i(Ai)S]Pr[\mathcal{M}_{i}(A_{i})\in S]\leq e^{\epsilon}Pr[\mathcal{M}_{i}(A_{i}^{\prime})\in S],

where ε>0\varepsilon>0 is the privacy budget.

Existing locally differentially private models, such as edge local differentially privacy (Definition 4) [7, 9, 8] is a promising model. However, it fails to provide privacy guarantee in federated scenarios. While the same edge may exist in multiple different clients, each client only considers its own edge information. Multiple reports of the same information increase the probability of distinguishing such an edge multiple times, leading to privacy issues. To address this challenge, we introduce edge distributed differential privacy to achieve our privacy objectives. The formal definition is as follows:

Definition 5 (Neighboring Graphs).

Two graphs GG and GG^{\prime} are neighboring graphs if GG and GG^{\prime} differ in one edge.

Definition 6 (Edge Distributed Differential Privacy (Edge DDP)).

Let G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V,E^{\prime}) be two neighboring global graphs. Let C={C1,,Cm}C=\{C_{1},...,C_{m}\} be the client set. Let GiG_{i} and GiG_{i}^{\prime} (i[1,m]i\in[1,m]) be graphs owned by client CiC_{i} in GG and GG^{\prime}, respectively. A set of randomized mechanisms {i,i[1,m]}\{\mathcal{M}_{i},i\in[1,m]\} collectively satisfy ε\varepsilon-Edge DDP iff. for any subsets of possible outputs Sirange(),i[1,m]S_{i}\subseteq range(\mathcal{M}),i\in[1,m], we have the following inequality.

Pr[1(G1)S1,,m(Gm)Sn]Pr[\mathcal{M}_{1}(G_{1})\in S_{1},...,\mathcal{M}_{m}(G_{m})\in S_{n}]

eϵPr[1(G1)S1,,m(Gm)Sm]\leq e^{\epsilon}\cdot Pr[\mathcal{M}_{1}(G_{1}^{\prime})\in S_{1},...,\mathcal{M}_{m}(G_{m}^{\prime})\in S_{m}].

Edge DDP guarantees that the server cannot distinguish the presence or absence of any edge based on all reports collected from clients. It also guarantees that the information about which client CiC_{i} an edge in EE belongs to is private, if the edge exists. For example, in Figure 1(c), both the presence of the edge v2,v3\langle v_{2},v_{3}\rangle in GG and the absence of the edge v1,v6\langle v_{1},v_{6}\rangle in GG are private. Furthermore, no party knows that the edge v2,v3\langle v_{2},v_{3}\rangle belongs to clients C2C_{2} even if the existence of v2,v3\langle v_{2},v_{3}\rangle has been disclosed.

II-C Private Set Union

Private Set Union (PSU) [21, 22, 23, 24] is a secure multiparty computation cryptographic technique designed for securely computing the union of private sets held by different parties. At its core, neither party reveals anything to the counterparty except for the elements in the union. From a high-level perspective, a typical PSU protocol involves the following steps: (1) Set Encoding: each party privately encodes its set into a cryptographic form suitable for secure computation: [[Xi]]𝖤𝗇𝖼(Xi),i[1,m][\![X_{i}]\!]\leftarrow\mathsf{Enc}(X_{i}),i\in[1,m]. (2) Union Computation: parties engage in computing the union of their encoded sets without revealing the underlying elements: [[X]]𝖤𝗇𝖼(X1)𝖤𝗇𝖼(X2)𝖤𝗇𝖼(Xm)[\![X]\!]\leftarrow\mathsf{Enc}(X_{1})\cup\mathsf{Enc}(X_{2})...\mathsf{Enc}(X_{m}). (3) Result Decoding: once the computation is complete, parties decode the computed union from its cryptographic representation to obtain the set union: X𝖣𝖾𝖼([[X]])X\leftarrow\mathsf{Dec}([\![X]\!]).

III A General Framework: 𝖥𝖤𝖠𝖳\mathsf{FEAT}

In this section, we introduce our proposed general framework for federated graph analytics with differential privacy, called 𝖥𝖤𝖠𝖳\mathsf{FEAT}. The main idea is to let the server privately collect subgraphs from local clients and aggregate a noisy global graph, which facilitates common graph statistics. As shown in Figure 2, in general, 𝖥𝖤𝖠𝖳\mathsf{FEAT} enhances the utility by (1) reducing the added noise by introducing the crypto technique (i.e., PSU) into differentially private graph statistics; (2) calibrating the noisy results to suppress the estimation bias.

We first present a baseline approach which revises the prior protocol (e.g., randomized response) in order to satisfy our privacy goal, and discuss its limitations. Then, we introduce the overview of our proposed general framework 𝖥𝖤𝖠𝖳\mathsf{FEAT}, which substantially improves the baseline approach, and elaborate on its details in subsequent sections.

III-A A Baseline Approach

Randomized response (RR) [11] is a common methodology for enhancing local differential privacy. However, it fails to provide ε\varepsilon-Edge DDP, as the same one edge could be reported by different clients multiple times. One edge may exist in multiple subgraphs. In the worst case, an edge is included by all subgraphs Gi,i[1,m]G_{i},i\in[1,m]. Without loss of generality, assume GiG_{i} and GiG_{i}^{\prime} are neighboring graphs and differ in edge e1e_{1}, we have Pr[i(Gi)Si]eϵPr[i(Gi)Si]Pr[\mathcal{M}_{i}(G_{i})\in S_{i}]\leq e^{\epsilon}\cdot Pr[\mathcal{M}_{i}(G_{i}^{\prime})\in S_{i}]. Then, we have

Pr[1(G1)S1,,m(Gm)Sm]Pr[1(G1)S1,,m(Gm)Sm]\displaystyle\frac{Pr[\mathcal{M}_{1}(G_{1})\in S_{1},...,\mathcal{M}_{m}(G_{m})\in S_{m}]}{Pr[\mathcal{M}_{1}(G_{1}^{\prime})\in S_{1},...,\mathcal{M}_{m}(G_{m}^{\prime})\in S_{m}]}
=\displaystyle= Pr[1(G1)S1]Pr[m(Gm)Sm]Pr[1(G1)S1]Pr[m(Gm)Sm]\displaystyle\frac{Pr[\mathcal{M}_{1}(G_{1})\in S_{1}]...Pr[\mathcal{M}_{m}(G_{m})\in S_{m}]}{Pr[\mathcal{M}_{1}(G_{1}^{\prime})\in S_{1}]...Pr[\mathcal{M}_{m}(G_{m}^{\prime})\in S_{m}]}
\displaystyle\leq (eε)m=emε.\displaystyle(e^{\varepsilon})^{m}=e^{m\varepsilon}.

To address this privacy issue, we could consider the following baseline method. It divides the overall privacy budget into mm portions equally. Then each client randomizes each subgraph using the randomized response with εl=ε/m\varepsilon_{l}=\varepsilon/m. As a result, the baseline can provide ε\varepsilon-Edge DDP, i.e.,

Pr[1(G1)S1,,m(Gm)Sm]Pr[1(G1)S1,,m(Gm)Sm]\displaystyle\frac{Pr[\mathcal{M}_{1}(G_{1})\in S_{1},...,\mathcal{M}_{m}(G_{m})\in S_{m}]}{Pr[\mathcal{M}_{1}(G_{1}^{\prime})\in S_{1},...,\mathcal{M}_{m}(G_{m}^{\prime})\in S_{m}]}
\displaystyle\leq (eεl)m=emεm=eε.\displaystyle(e^{\varepsilon_{l}})^{m}=e^{m\cdot\frac{\varepsilon}{m}}=e^{\varepsilon}.

Algorithm 1 shows the details of the baseline approach. It takes as input the subgraph set G={G1,,Gm}G=\{G_{1},...,G_{m}\} and privacy budget ε\varepsilon. Each graph GiG_{i} is represented as a n×nn\times n adjacent matrix MiM_{i}, where each bit {0,1}\in\{0,1\}. If there is an edge between two users, the bit will be set as 1; otherwise, it will be set as 0. After that, each client flips each bit in the upper triangular part of the matrix with the flipping probability 11+εl\frac{1}{1+\varepsilon_{l}}, where εl=ε/m\varepsilon_{l}=\varepsilon/m. After that, the server collects all noisy edges and aggregates an union as a noisy global graph GG^{\prime}. Finally, the server computes the targeted graph statistics f(G)f(G)^{\prime} based on the noisy global graph GG^{\prime}. This algorithm is denoted as 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline}.

Theorem 1.

𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} approach satisfies ε\varepsilon-Edge DDP.

Algorithm 1 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline}: Plain Randomized Response
0:  
Subgraph set G={G1,,Gm}G=\{G_{1},...,G_{m}\}, privacy budget ε\varepsilon,
GiG_{i} is represented as adjacent matrix Mi{0,1}n×nM_{i}\subseteq\{0,1\}^{n\times n}
0:  Noisy global graph GG^{\prime} Each client CiC_{i} perturbs MiM_{i}
1:  for each bit bb in MiM_{i} do
2:     
b={bw.p.eεl1+eεl1bw.p.11+eεlb\prime=\left\{\begin{array}[]{ll}b&w.p.\ \frac{e^{\varepsilon_{l}}}{1+e^{\varepsilon_{l}}}\\ 1-b&w.p.\ \frac{1}{1+e^{\varepsilon_{l}}}\end{array}\right.
where εl=ε/m\varepsilon_{l}=\varepsilon/m
3:  Client CiC_{i}: Send the noisy subgraph GiG_{i}^{\prime} to server
4:  Server: Gi=1mGiG^{\prime}\leftarrow\bigcup_{i=1}^{m}G_{i}^{\prime}
5:  return  GG^{\prime}
Refer to caption
Figure 2: Overview of 𝖥𝖤𝖠𝖳\mathsf{FEAT}.

Discussion. To provide a strong privacy guarantee, too small privacy budget is allocated to each client. Although the baseline approach achieves our privacy goal discussed in Section II, much redundant noise is added into results. For example, for kk-star counting, 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} obtains the expected l2l_{2} loss errors of O((mn)2k2ε2)O(\frac{(mn)^{2k-2}}{{\varepsilon}^{2}}); For triangle counting, it attains the expected l2l_{2} loss errors of O((mn)2ε2)O(\frac{(mn)^{2}}{{\varepsilon}^{2}}). Our experiments in Section V further shows that the baseline approach cannot obtain competitive result accuracy under various cases.

III-B Overview of 𝖥𝖤𝖠𝖳\mathsf{FEAT}

To alleviate the limitations of 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} method, we propose a federated graph analytic framework, called 𝖥𝖤𝖠𝖳\mathsf{FEAT}, as shown in Figure 2. It supports arbitrary downstream common graph statistics while satisfying ε\varepsilon-Edge DDP.

As discussed in the above section, one edge appears in mm clients in the extreme case and then the overall privacy budget ε\varepsilon should be allocated to mm clients, i.e., εl=ε/m\varepsilon_{l}=\varepsilon/m, which leads to much noise. In 𝖥𝖤𝖠𝖳\mathsf{FEAT}, we leverage the private set union (PSU) technique [21, 22, 23, 24] to achieve εl=ε\varepsilon_{l}=\varepsilon, reducing the noise scale. PSU allows parties to collaboratively compute the union of multiple sets held by different parties without revealing the individual elements of the sets to other parties. It guarantees that an element appears in the final union only once. However, previous PSU protocols cannot be easily applied in our FGA setting. Most of PSU works [21, 22, 24] focus on a two-party setting, which is different from our multi-party scenario. Although the paper [23] proposes a multi-party protocol, it faces the security and efficiency issue. Additionally, this multi-party PSU [23] outputs the true union and is unable to provide the differential privacy guarantee. To this end, we propose a differentially private set union protocol to compute the union of multiple subgraphs while satisfying ε\varepsilon-Edge DDP. The detailed analysis will be presented in Section III-C.

Algorithm 2 presents the overall protocol of 𝖥𝖤𝖠𝖳\mathsf{FEAT}. It involves two kinds of entities: clients and a server. The local clients CiC_{i}, i[m]i\in[m], owns a subgraph that is represented as a n×nn\times n adjacent matrix (nn is the number of users). It takes as input the set of subgraphs G={G1,,Gm}G=\{G_{1},...,G_{m}\}, privacy budget ε\varepsilon, and the targeted graph query QQ. At the beginning, 𝖥𝖤𝖠𝖳\mathsf{FEAT} recalls a 𝖢𝗈𝗅𝗅𝖾𝖼𝗍𝖦𝗋𝖺𝗉𝗁()\mathsf{CollectGraph}() function to collect a noisy global graph (step ①). Each client perturbs its subgraph with suitable noise and then encrypts the noisy subgraph. Next, clients communicate with each other to compute the union of subgraphs. All clients collaboratively decrypts the union and outputs a noisy global graph (details in Section III-C). Once obtaining the noisy global graph, it executes the targeted graph statistics (step ②). In this paper, we compute the subgraph counts (i.e., kk-stars, triangles) to verify the effectiveness of our framework. Considering the estimation bias, the server further calibrates the noisy results to improve the utility (refer to Section III-D for details).

Algorithm 2 Overall Protocol of 𝖥𝖤𝖠𝖳\mathsf{FEAT}.
0:  
Subgraph set G={G1,,Gm}G=\{G_{1},...,G_{m}\},
where GiG_{i} represented as n×nn\times n adjacent matrix,
Privacy budget ε\varepsilon, Targeted query QQ
0:  Query result QQ^{\prime} Step 1: DPSU-based Graph Collection
1:  G𝖢𝗈𝗅𝗅𝖾𝖼𝗍𝖦𝗋𝖺𝗉𝗁(G,ε)G^{\prime}\leftarrow\mathsf{CollectGraph}(G,\varepsilon) \triangleright Algorithm 3 Step 2: Graph Query and Calibration
2:  Q𝖰𝗎𝖾𝗋𝗒(G,Q)Q^{\prime}\leftarrow\mathsf{Query}(G^{\prime},Q) \triangleright Section III-D
3:  return  QQ^{\prime}

III-C DPSU-based Graph Collection

We propose a DPSU protocol based on the state-of-the-art PSU method for computing the global graph under DP.

PSU Protocol. Suppose that each client Ci[1,m]C_{i\in[1,m]} owns a set XiX={x1,,xn}X_{i}\subseteq X=\{x_{1},...,x_{n}\}. Our goal is to compute the union X=i=1mXiX=\bigcup_{i=1}^{m}X_{i}.

The SOTA multi-party PSU method [23] supports computing the union among multiple clients as follows:

(1) Initialization. Each client CiC_{i} creates a flag vector WiW_{i}. If xj[1,n]Xix_{j\in[1,n]}\in X_{i}, Wi,j=1W_{i,j}=1; otherwise, Wi,j=0W_{i,j}=0.

(2) Key Generation. Each client Ci[1,m]C_{i\in[1,m]} generates a pair of keys pki,ski\langle pk_{i},sk_{i}\rangle, where pki=gskipk_{i}=g^{sk_{i}}. All clients jointly generate the public key: pk=i=1mpki=gsk1++skmpk=\prod_{i=1}^{m}pk_{i}=g^{sk_{1}+...+sk_{m}}.

(3) Encryption. Each client Ci[1,m]C_{i\in[1,m]} encrypts WiW_{i} with the public key pkpk: 𝖤𝗇𝖼(Wi)=(𝖤𝗇𝖼(Wi,1),,𝖤𝗇𝖼(Wi,n)\mathsf{Enc}(W_{i})=(\mathsf{Enc}(W_{i,1}),...,\mathsf{Enc}(W_{i,n}).

(4) Modification. Each client Ci[2,m]C_{i\in[2,m]} modifies the flag vector WiW_{i} in sequence as follows: if xj[1,n]Xix_{j\in[1,n]}\notin X_{i}, 𝖤𝗇𝖼(Wi,j)=𝖤𝗇𝖼(Wi1,j)\mathsf{Enc}(W_{i,j})=\mathsf{Enc}(W_{i-1,j}); otherwise, 𝖤𝗇𝖼(Wi,j)=𝖤𝗇𝖼(Wi,j)\mathsf{Enc}(W_{i,j})=\mathsf{Enc}(W_{i,j}).

(5) Decryption. All clients jointly decrypt 𝖤𝗇𝖼(Wm)\mathsf{Enc}(W_{m}) with secret keys {sk1,,skm}\{sk_{1},...,sk_{m}\}: Wm𝖣𝖾𝖼[𝖤𝗇𝖼(Wm)]W_{m}\leftarrow\mathsf{Dec[Enc}(W_{m})]. For j[1,n]j\in[1,n], if Wm,j=1W_{m,j}=1, XX{xj}X\leftarrow X\cup\{x_{j}\}.

However, the above protocol may face the security issue. It is implemented based on finite field cryptography (FFC) [29] and in fact the prime number pp with 512 bits is not secure according to Table 2 in [30]. If this multi-party PSU is implemented in a secure way, the prime number pp should be set as 3072 bits. As a result, it is inefficient for the large-scale graph analytics. To improve the tradeoff between security and efficiency, we implement the PSU protocol based on the Elliptic Curve Cryptography (ECC) [31], which is more efficient than finite field cryptography (FFC). In particular, there are three cryptographic libraries for implementing ECC, namely, Libsodium [32], OpenSSL [33], and MCL[34]. As shown in Table II, Libsodium is more suitable for processing large-scale graph data, which motivates us to implement PSU protocol based on Libsodium library.

TABLE II: Performance of ECC with Three Libraries.
ECC nn 1010 10210^{2} 10310^{3} 10310^{3} 10410^{4} 10510^{5}
Libsodium 0.0056 0.059 0.558 5.26 53.14 528.10
OpenSSL 0.0098 0.060 0.549 5.49 53.76 539.52
MCL 0.0541 0.504 4.983 49.84 498.84 4986.43
  • 1

    nn: data size

DPSU Protocol. We then propose a differentially private set union (DPSU) protocol for aggregating a global graph. One challenge is that the output of the PSU protocol does not satisfy DP. The individual privacy could be disclosed via the union of subgraphs. Although some previous works [35, 36, 37, 38] discuss how to calculate the private set union while satisfying DP, their system models differ from ours and can not be used directly. One naive solution is that each client CiC_{i} perturbs the flag vector WiW_{i} using the randomized response (RR) mechanism [39] before encrypting. Specifically, each client CiC_{i} flips each bit in her flag vector WiW_{i} with probability p=11+eεp=\frac{1}{1+e^{\varepsilon}} in the step Initialization of PSU protocol, where ε\varepsilon is the privacy budget. Subsequently, the differentially private set union can be computed according to step (2)\sim(5) in the above PSU protocol.

Although this natural solution can provide the DP guarantee, it leads to much bias in a noisy graph. In fact, most of real-world graphs tend to be sparse, which means that there are much more 0s than 1s in an adjacent matrix. After the randomly flipping bits, however, the number of 1s is much larger than that of 0s, making the noisy global graph denser than the original one. Additionally, the step Modification further amplifies the denser problem. Even if ‘0’ bit is flipped into ‘1’ bit in one of mm subgraphs, the according bit in final union will become ‘1’. In particular, assume that the number of ‘1’ bits in a true global graph is tt and the number of ‘0’ bits is (n2t)(n^{2}-t). After flipping, the number of ‘1’ bits in a noisy global graph becomes m[(1p)t+p(n2t)]m[(1-p)t+p(n^{2}-t)]. Take the Facebook graph as an example, which has 4,039 nodes (i.e., nn) and 88,234 edges (i.e., tt). Even with a fairly small privacy budget ε=0.1\varepsilon=0.1 and the number of clients m=5m=5, the expected number of ‘1’ bits will become 3.8×107\times 10^{7}, increasing at least 439 times than before.

To address the above issue, we propose a novel differentially private set union protocol. In particular, ‘0’ bits are perturbed only by the first client and ‘1’ bits are randomized by all clients. The server can then obtain the whole noisy matrix by computing the union of them. Theoretically, our method can reduce the denser problem by a factor of mm. As a result, the utility loss can be alleviated by at least a factor of O(m2)O(m^{2}). For instance, for kk-star counting, 𝖥𝖤𝖠𝖳\mathsf{FEAT} achieves the expected l2l_{2} loss errors of O(n2k2ε2)O(\frac{n^{2k-2}}{{\varepsilon}^{2}}); For triangle counting, it attains the expected l2l_{2} loss errors of O(n2ε2)O(\frac{n^{2}}{{\varepsilon}^{2}}). To further suppress the bias from the randomized response, we calibrate the noisy results during the graph query processing. The details will be discussed in next Section III-D.

Algorithm 3 𝖢𝗈𝗅𝗅𝖾𝖼𝗍𝖦𝗋𝖺𝗉𝗁\mathsf{CollectGraph}: DPSU-based Graph Collection.
0:  
Subgraph set G={G1,,Gm},Gi=(Vi,Ei),G=\{G_{1},...,G_{m}\},G_{i}=(V_{i},E_{i}),
Privacy budget ε\varepsilon
0:  Noisy global graph GG^{\prime}
1:  Initialize: An edge domain 𝔼\mathbb{E} according to VV,where N=|𝔼|=n(n1)2N=|\mathbb{E}|=\frac{n(n-1)}{2}
2:  Server: Initialize E=E^{\prime}=\varnothing
3:  Each client Ci[1,m]C_{i\in[1,m]} generates a pair of keys pki,ski\langle pk_{i},sk_{i}\rangle, where pki=gskipk_{i}=g^{sk_{i}}
4:  All clients jointly generate the public key: pk=i=1mpki=gsk1++skmpk=\prod_{i=1}^{m}pk_{i}=g^{sk_{1}+...+sk_{m}}
5:  Client C1C_{1}: Initialize a flag vector YY,
Yj[1,N]={1,𝔼jE10,𝔼jE1Y_{j\in[1,N]}=\left\{\begin{array}[]{ll}1,&\mathbb{E}_{j}\in E_{1}\\ 0,&\mathbb{E}_{j}\notin E_{1}\end{array}\right.
6:  Client C1C_{1}: Perturb YY with 𝖱𝖱\mathsf{RR},
Yj[1,N]={Yjw.p.eε1+eε1Yjw.p.11+eεY_{j\in[1,N]}^{\prime}=\left\{\begin{array}[]{ll}Y_{j}&w.p.\frac{e^{\varepsilon}}{1+e^{\varepsilon}}\\ 1-Y_{j}&w.p.\frac{1}{1+e^{\varepsilon}}\\ \end{array}\right.
7:  Client C1C_{1}: Encrypt YY^{\prime} with pkpk to 𝖤𝗇𝖼(Y)=[𝖤𝗇𝖼(Y1),,𝖤𝗇𝖼(Yn~)]\mathsf{Enc}(Y^{\prime})=[\mathsf{Enc}(Y_{1}^{\prime}),...,\mathsf{Enc}(Y_{\widetilde{n}}^{\prime})] Send 𝖤𝗇𝖼(Y)\mathsf{Enc}(Y^{\prime}) to client C2C_{2}
8:  for each client Ci,i[2,m]C_{i},i\in[2,m] do
9:     for each bit Yj,j[1,N]Y_{j}^{\prime},j\in[1,N] do
10:        if 𝔼jEi\mathbb{E}_{j}\in E_{i} then
11:           if 𝖱𝖱(𝟣)==1\mathsf{RR(1)}==1 then𝖤𝗇𝖼(Yj)𝖤𝗇𝖼(1)\mathsf{Enc}(Y_{j}^{\prime})\leftarrow\mathsf{Enc}(1)
12:           else if 𝖱𝖱(𝟣)==0\mathsf{RR(1)}==0 then𝖤𝗇𝖼(Yj)𝖤𝗇𝖼(Yj)\mathsf{Enc}(Y_{j}^{\prime})\leftarrow\mathsf{Enc}(Y_{j}^{\prime})
13:        else if 𝔼jEi\mathbb{E}_{j}\notin E_{i} then
14:           if 𝖱𝖱(𝟢)==1\mathsf{RR(0)}==1 then𝖤𝗇𝖼(Yj)𝖤𝗇𝖼(1)\mathsf{Enc}(Y_{j}^{\prime})\leftarrow\mathsf{Enc}(1)
15:           else if 𝖱𝖱(𝟢)==0\mathsf{RR(0)}==0 then𝖤𝗇𝖼(Yj)𝖤𝗇𝖼(Yj)\mathsf{Enc}(Y_{j}^{\prime})\leftarrow\mathsf{Enc}(Y_{j}^{\prime})
16:  All clients jointly decrypt 𝖤𝗇𝖼(Y)\mathsf{Enc}(Y^{\prime}) with secret keys: Y𝖣𝖾𝖼[𝖤𝗇𝖼(Y)]Y^{\prime}\leftarrow\mathsf{Dec[Enc}(Y^{\prime})] Send YY^{\prime} to server
17:  for each bit Y,j[1,N]Y^{\prime},j\in[1,N] do
18:     if Yj=1Y_{j}^{\prime}=1 thenEE{𝔼j}E^{\prime}\leftarrow E^{\prime}\cup\{\mathbb{E}_{j}\}
19:  Server: G(V,E)G^{\prime}\leftarrow(V,E^{\prime})
20:  return  GG^{\prime}

Algorithm 3 describes the details of our DPSU-based edge collection method. It takes as input the subgraph set GG and the privacy budget ε\varepsilon. Each client CiC_{i} initializes an edge domain 𝔼\mathbb{E} according to the node information VV. In particular, each client CiC_{i} first constructs a n×nn\times n adjacent matrix MiM_{i}, where n=|V|n=|V|, and then transforms the upper triangular part of MiM_{i} into a vector 𝔼\mathbb{E} with the size N=|𝔼|=n(n1)2N=|\mathbb{E}|=\frac{n(n-1)}{2}. For example, 𝔼1=e1=v1,v2\mathbb{E}_{1}=e_{1}=\langle v_{1},v_{2}\rangle, 𝔼2=e2=v1,v3\mathbb{E}_{2}=e_{2}=\langle v_{1},v_{3}\rangle. The server initializes an empty set EE for collecting the edge information. Each client CiC_{i} generates a pair of keys psi,ski\langle ps_{i},sk_{i}\rangle, and all clients jointly generate the public key pkpk. Client CiC_{i} first initializes a flag vector YY according to the principle in line 5. If 𝔼j\mathbb{E}_{j} exists in E1E_{1}, the according bit will be set as 1; otherwise, it will be set as 0. After, client C1C_{1} perturbs each bit of YY with the flipping probability 11+eε\frac{1}{1+e^{\varepsilon}} using the randomized response, and encrypts the noisy YY^{\prime} with the public key pkpk. Then, client C1C_{1} sends 𝖤𝗇𝖼(Y)\mathsf{Enc}(Y^{\prime}) to the client C2C_{2}. For each client CiC_{i} from C2C_{2} to CmC_{m}, CiC_{i} updates 𝖤𝗇𝖼(Y)\mathsf{Enc}(Y^{\prime}) according to the principle in lines 10 to 15. If 𝔼j\mathbb{E}_{j} is in EiE_{i}, client CiC_{i} will flip ‘1’ with RR and then replace 𝖤𝗇𝖼(Yj)\mathsf{Enc}(Y_{j}^{\prime}) with 𝖤𝗇𝖼(𝖱𝖱(1))\mathsf{Enc}(\mathsf{RR}(1)). Once the client CmC_{m} completes updating 𝖤𝗇𝖼(Y)\mathsf{Enc}(Y^{\prime}) and sending it to the server, all clients jointly decrypt 𝖤𝗇𝖼(Y)\mathsf{Enc}(Y^{\prime}) and send decrypted YY^{\prime} to the server. Finally, the server generates and releases EE according to YY^{\prime}. We call this algorithm by 𝖢𝗈𝗅𝗅𝖾𝖼𝗍𝖦𝗋𝖺𝗉𝗁\mathsf{CollectGraph}.

Example 3.1. Table III shows how Algorithm 3 works for computing the union of local edge sets while satisfying the DP. Specially, clients C1,C2,C3C_{1},C_{2},C_{3} owns private edge sets E1={e1,e2},E2={e2,e3},E3={e5}E_{1}=\{e_{1},e_{2}\},E_{2}=\{e_{2},e_{3}\},E_{3}=\{e_{5}\}, and node set is V={v1,v2,v3,v4}V=\{v_{1},v_{2},v_{3},v_{4}\}. Thus, E1,E2,E3𝔼={e1,e2,e3,e4,e5,e6}E_{1},E_{2},E_{3}\subseteq\mathbb{E}=\{e_{1},e_{2},e_{3},e_{4},e_{5},e_{6}\}, where e1=v1,v2e_{1}=\langle v_{1},v_{2}\rangle, e2=v1,v3e_{2}=\langle v_{1},v_{3}\rangle, e3=v1,v4e_{3}=\langle v_{1},v_{4}\rangle, e4=v2,v3e_{4}=\langle v_{2},v_{3}\rangle, e5=v2,v4e_{5}=\langle v_{2},v_{4}\rangle, e6=v3,v4e_{6}=\langle v_{3},v_{4}\rangle. The goal is to compute E=E1E2E3E=E_{1}\cup E_{2}\cup E_{3}. Firstly, client C1C_{1} constructs a flag vector YY according to line 5 in Algorithm 3. Then, C1C_{1} perturbs each bit of YY and encrypts it to obtain YY^{\prime}. Secondly, clients C2C_{2} and C3C_{3} update YY based on the principle of lines 10 to 15 in Algorithm 3. Thirdly, all clients jointly decrypt 𝖤𝗇𝖼(Y)\mathsf{Enc}(Y^{\prime}) and send YY^{\prime} to the server. Finally, the server can obtain the edge union E={e1,e2,e4,e5}E=\{e_{1},e_{2},e_{4},e_{5}\} according to YY^{\prime}.

TABLE III: An Example of DPSU Protocol.
YY Y1Y_{1} Y2Y_{2} Y3Y_{3} Y4Y_{4} Y5Y_{5} Y6Y_{6}
C1C_{1} [[(1)]][\![(1)]\!] [[(1)]][\![(1)]\!] [[(0)]][\![(0)]\!] [[(0)]][\![(0)]\!] [[(0)]][\![(0)]\!] [[(0)]][\![(0)]\!]
C2C_{2} [[(1)]][\![(1)]\!] [[(1)]][\![(1)]\!] [[(1)]][\![(1)]\!] [[(0)]][\![(0)]\!] [[(0)]][\![(0)]\!] [[(0)]][\![(0)]\!]
C3C_{3} [[(1)]][\![(1)]\!] [[(1)]][\![(1)]\!] [[(1)]][\![(1)]\!] [[(0)]][\![(0)]\!] [[(1)]][\![(1)]\!] [[(0)]][\![(0)]\!]
YY^{\prime} 1 1 0 1 1 0
EE e1e_{1} e2e_{2} - e4e_{4} e5e_{5} -
  • 1

    [[x]][\![x]\!]: 𝖤𝗇𝖼(x)\mathsf{Enc}(x)(x)(x): 𝖱𝖱(x)\mathsf{RR}(x).

  • 1

    In this example, Y3Y_{3}^{\prime} and Y4Y_{4}^{\prime} are flipped by RR.

Theorem 2.

Algorithm 3 satisfies ε\varepsilon-Edge DDP.

Proof of Theorem 2. Let G=(V,E)G=(V,E) and G=(V,E)G^{\prime}=(V,E^{\prime}) be two neighboring graphs. Let GiG_{i} and GiG_{i}^{\prime} (i[1,m]i\in[1,m]) be the neighboring graphs of client CiC_{i} with respect to GG and GG^{\prime}, respectively. Let {i,i[1,m]}\{\mathcal{M}_{i},i\in[1,m]\} be a set of randomized mechanisms with any subsets of possible outputs Sirange(i),i[1,m]S_{i}\subseteq range(\mathcal{M}_{i}),i\in[1,m]. Given the privacy budget ε\varepsilon, we can easily obtain Pr[i(Gi)Si]eεPr[i(Gi)Si]Pr[\mathcal{M}_{i}(G_{i})\in S_{i}]\leq e^{\varepsilon}Pr[\mathcal{M}_{i}(G_{i}^{\prime})\in S_{i}]. Due to using the threshold ElGamal encryption, any change of adding or deleting an edge from mm clients is only collected by the server once. Thus, we have

Pr[1(G1)S1,,m(Gm)Sm]Pr[1(G1)S1,,m(Gm)Sm]\displaystyle\frac{Pr[\mathcal{M}_{1}(G_{1})\in S_{1},...,\mathcal{M}_{m}(G_{m})\in S_{m}]}{Pr[\mathcal{M}_{1}(G_{1}^{\prime})\in S_{1},...,\mathcal{M}_{m}(G_{m}^{\prime})\in S_{m}]}
=\displaystyle= Pr[1(G1)S1]Pr[m(Gm)Sm]Pr[1(G1)S1]Pr[m(Gm)Sm]\displaystyle\frac{Pr[\mathcal{M}_{1}(G_{1})\in S_{1}]...Pr[\mathcal{M}_{m}(G_{m})\in S_{m}]}{Pr[\mathcal{M}_{1}(G_{1}^{\prime})\in S_{1}]...Pr[\mathcal{M}_{m}(G_{m}^{\prime})\in S_{m}]}
=\displaystyle= eε\displaystyle e^{\varepsilon}

Therefore, Algorithm 3 satisfies ε\varepsilon-Edge DDP. Furthermore, by the immunity to post-processing [11], Algorithm 2 provides ε\varepsilon-Edge DDP guarantee. ∎

III-D Graph Query Processing

In this section, we execute two common subgraph counting queries, i.e., kk-star counting [7, 9, 40] and triangle counting [4, 7, 41], in order to explain how to execute the 𝖰𝗎𝖾𝗋𝗒\mathsf{Query} function of Algorithm 2 and how to calibrate the noisy results.

A kk-star refers to a subgraph consisting of a central node connecting to kk other nodes. To count the number of kk-stars in a given graph, we iterate through each vertex in the graph and compute (dik)\binom{d_{i}^{\prime}}{k}, where did_{i}^{\prime} is the noisy degree of node viv_{i}. The kk-star counts SS of the whole graph is equal to the summation of each node’s kk-stars, i.e., S=i=1n(dik)S=\sum_{i=1}^{n}\binom{d_{i}^{\prime}}{k}. However, a direct computation of node degrees from GG^{\prime} can lead to significant bias induced by the randomized response in Algorithm 3. To mitigate the bias, we leverage the post-processing property of DP [11], yielding an unbiased estimation di~\widetilde{d_{i}} of did_{i} as Proposition 1. Hence, we can obtain an unbiased estimation of kk-stars, i.e., S=i=1n(di~k)S=\sum_{i=1}^{n}\binom{\widetilde{d_{i}}}{k}.

Proposition 1.

Let GG^{\prime} be a noisy global graph. Let did_{i}^{\prime} be the node degree of viv_{i} in GG^{\prime} and di~\widetilde{d_{i}} be an unbiased estimate of did_{i}^{\prime}. Let nn be the number of nodes in GG^{\prime}. Let p=11+eεp=\frac{1}{1+e^{\varepsilon}} be the flipping probability, where ε\varepsilon is the privacy budget in 𝖥𝖤𝖠𝖳\mathsf{FEAT}. We have

di~=112p(dinp).\widetilde{d_{i}}=\frac{1}{1-2p}(d_{i}^{\prime}-np). (1)

Proof of Proposition 3. The mapping relationship between did_{i}^{\prime} and di~\widetilde{d_{i}} can be represented as:

di=di~(1p)+(ndi~)p.d_{i}^{\prime}=\widetilde{d_{i}}(1-p)+(n-\widetilde{d_{i}})p. (2)

Then we can prove Proposition 3.

A triangle in a graph refers to a subgraph consisting of three nodes connected by three edges, forming a closed loop. To count the number of triangles in a graph, one common approach is to iterate each subgraph with three nodes and check if it is a loop. However, simply counting the triangles in a noisy graph GG^{\prime} introduces a significant bias. We continue to calibrate the biased triangle counts through the post-processing. Building upon the insights of [7], we categorize triplets in GiG_{i} into four types based on the number of edges they involve: t0t_{0}, t1t_{1}, t2t_{2}, and t3t_{3}, where tjt_{j} (j0,1,2,3j\in{0,1,2,3}) represents the count of triplets in GiG_{i} involving jj edges (referred to as jj-edge; 33-edges are identical to triangles). Thus, we can compute the unbiased triangle counts T~\widetilde{T} according to Proposition 2.

Proposition 2.

Let GG^{\prime} be a noisy global graph. Let t0,t1,t2,t3t_{0},t_{1},t_{2},t_{3} be the number of 0-edges, 1-edges, 2-edges, and triangles in GG^{\prime}, respectively. Let ε\varepsilon be the privacy budget used in 𝖥𝖤𝖠𝖳\mathsf{FEAT}. We have

T~=1(eε1)3(t0+t1eεt2e2ε+t3e3ε).\widetilde{T}=\frac{1}{(e^{\varepsilon}-1)^{3}}(-t_{0}+t_{1}e^{\varepsilon}-t_{2}e^{2\varepsilon}+t_{3}e^{3\varepsilon}). (3)

Proof of Proposition 2. Let u0,u1,u2,u3u_{0},u_{1},u_{2},u_{3} be the number of 0-edges, 1-edges, 2-edges, and triangles in GG, respectively, when we do not flip 1/0 using the randomized response. Let x=eεx=e^{\varepsilon}. Then we have:

(t0,t1,t2,t3)=(u0,u1,u2,u3)𝑨,(t_{0},t_{1},t_{2},t_{3})=(u_{0},u_{1},u_{2},u_{3})\bm{A}, (4)

𝑨=1(x+1)2[x33x23x1x2x3+2x2x2+1xx2x2+1x3+2xx213x3x2x3]\bm{A}=\frac{1}{(x+1)^{2}}\begin{bmatrix}x^{3}&3x^{2}&3x&1\\ x^{2}&x^{3}+2x&2x^{2}+1&x\\ x&2x^{2}+1&x^{3}+2x&x^{2}\\ 1&3x&3x^{2}&x^{3}\end{bmatrix}.

𝑨\bm{A} is a transition matrix from a type of subgraph (0-edge, 1-edge, 2-edge, and triangle) in an original graph to a type of subgraph in a noisy graph. Let (u~0,u~1,u~2\widetilde{u}_{0},\widetilde{u}_{1},\widetilde{u}_{2}, u~3\widetilde{u}_{3}) be an unbiased estimate of (u0,u1,u2,u3u_{0},u_{1},u_{2},u_{3}). Then we obtain:

(u~0,u~1,u~2,u~3)=(t0,t1,t2,t3)𝑨1(\widetilde{u}_{0},\widetilde{u}_{1},\widetilde{u}_{2},\widetilde{u}_{3})=(t_{0},t_{1},t_{2},t_{3})\bm{A}^{-1} (5)

Let 𝑨i,j1\bm{A}_{i,j}^{-1} be the (i,j)(i,j)-th element of 𝑨1\bm{A}^{-1}. Then we have:

𝑨1,11=x3(x1)3,𝑨2,11=x2(x1)3,\bm{A}_{1,1}^{-1}=\frac{x^{3}}{(x-1)^{3}},\bm{A}_{2,1}^{-1}=-\frac{x^{2}}{(x-1)^{3}}, (6)
𝑨3,11=x(x1)3,𝑨4,11=1(x1)3.\bm{A}_{3,1}^{-1}=\frac{x}{(x-1)^{3}},\bm{A}_{4,1}^{-1}=-\frac{1}{(x-1)^{3}}. (7)

By combining above equations, Proposition 2 is proved. ∎

IV An Improved Framework: 𝖥𝖤𝖠𝖳\mathsf{FEAT}++

Although 𝖥𝖤𝖠𝖳\mathsf{FEAT} reduces the problems from the limited view and overlaps, there is still large space for improving the overall utility. 𝖥𝖤𝖠𝖳\mathsf{FEAT} calculates common graph statistics based on a noisy global graph collected from local clients and without considering the true local subgraph information. The graph statistics such as kk-stars and triangles can be calculated more accurately via an additional round of interaction between the server and clients. After the server publishes GG^{\prime}, each client CiC_{i} can concatenate her local subgraph GiG_{i} with GG^{\prime}, which removes the limited view issues. For example, in Figure 3, if the targeted query is to count triangles, client C1C_{1} will return Q1=4Q_{1}=4 instead of Q1=1Q_{1}=1 since she can see the edges v1,v3,v2,v3,v3,v5\langle v_{1},v_{3}\rangle,\langle v_{2},v_{3}\rangle,\langle v_{3},v_{5}\rangle and v3,v6\langle v_{3},v_{6}\rangle via GG^{\prime}. Thus, the final answer will be (4+4+4)/3=4(4+4+4)/3=4 (redundant counts will be removed).

Refer to caption
Figure 3: Motivation of 𝖥𝖤𝖠𝖳\mathsf{FEAT}+.

Nevertheless, one edge may be reported by different clients multiple times, which leaks individual privacy (as discussed in Section I). For example, in Figure 3, the triangle v1,v2,v3\langle v_{1},v_{2},v_{3}\rangle may be collected by the server three times during counting triangles. In intuition, if one edge is reported by one of mm clients only once, the probability of distinguishing such an edge will be reduced. Hence, the problem is reduced to determining how to assign the same information to one of the clients for reporting. To do this effectively, we propose a degree-based node partition method that splits nodes into mm disjoint node sets so that each node is assigned to a client with the highest node degree. In this context, the degree information plays a significant role: the higher a node degree, the more edges it possesses, and the more likely it is to be involved in kk-stars or triangles. For instance, in Figure 3, the degree of node v4v_{4} in client C1C_{1} (=2=2) is larger than the degrees of clients C2C_{2} and C3C_{3} (=1=1). Thus, client C1C_{1} can count 2-stars or triangles of v4v_{4} more accurately than clients C2C_{2} and C3C_{3}. This is an intuition behind our 𝖥𝖤𝖠𝖳\mathsf{FEAT}+.

Algorithm 4 Overall Protocol of 𝖥𝖤𝖠𝖳\mathsf{FEAT}+.
0:  
Subgraph set G={G1,,Gm}G=\{G_{1},...,G_{m}\},
Privacy budget ε=ε1+ε2+ε3\varepsilon=\varepsilon_{1}+\varepsilon_{2}+\varepsilon_{3}, sensitivity \triangle.
0:  Query result QQ^{\prime} Phase I: Global Graph Collection
1:  G𝖥𝖤𝖠𝖳(G,ε1)G^{\prime}\leftarrow\mathsf{FEAT}(G,\varepsilon_{1}) \triangleright Algorithm 2Phase II: Local Query Collection
2:  U𝖯𝖺𝗋𝗍𝗂𝗍𝗂𝗈𝗇𝖭𝗈𝖽𝖾(G,ε2)U\leftarrow\mathsf{PartitionNode}(G,\varepsilon_{2}) \triangleright Step 1: Algorithm 5
3:  for each client Ci[m]C_{i\in[m]} in CC do
4:     Qi𝖰𝗎𝖾𝗋𝗒(G,Gi,Ui)Q_{i}\leftarrow\mathsf{Query}(G^{\prime},G_{i},U_{i}) \triangleright Step 2: Section IV-C
5:     Qi=Qi+𝖫𝖺𝗉(ε3)Q_{i}^{\prime}=Q_{i}+\mathsf{Lap}(\frac{\triangle}{\varepsilon_{3}}) \triangleright Step 3: 𝖯𝖾𝗋𝗍𝗎𝗋𝖻𝖺𝗍𝗂𝗈𝗇\mathsf{Perturbation}
6:  Server: Qi=1mQiQ^{\prime}\leftarrow\sum_{i=1}^{m}Q_{i}^{\prime}
7:  return  QQ^{\prime}

IV-A Overview of 𝖥𝖤𝖠𝖳\mathsf{FEAT}+

Algorithm 4 presents the overall protocol of 𝖥𝖤𝖠𝖳\mathsf{FEAT}+. It mainly consists of two phases: global graph collection and local query collection. The global graph collection is used to publish a noisy global graph GG^{\prime}, which is finished by 𝖥𝖤𝖠𝖳\mathsf{FEAT} (Algorithm 2). The local query collection consists of three main steps: degree-based partition (Algortihm 5), graph query processing (Section IV-C), and perturbation. To be specific, it takes as input local subgraph set G={G1,,Gm}G=\{G_{1},...,G_{m}\}, privacy budget ε=ε1+ε2+ε3\varepsilon=\varepsilon_{1}+\varepsilon_{2}+\varepsilon_{3}, and query sensitivity \triangle. 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ first collects a noisy global graph GG^{\prime} via 𝖥𝖤𝖠𝖳\mathsf{FEAT} using ε1\varepsilon_{1}. Then, this global graph is published to each client CiC_{i} for local query collection. Next, 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ recalls a 𝖯𝖺𝗋𝗍𝗂𝗍𝗂𝗈𝗇𝖭𝗈𝖽𝖾\mathsf{PartitionNode} function with privacy budget ε2\varepsilon_{2}, to split the node set VV into mm disjoint node sets UU (Step 1). These node sets are distributed to the respective clients. After that, each client CiC_{i} computes the targeted query. In our paper, we take kk-star and triangle counting as examples to explain the rationale behind 𝖰𝗎𝖾𝗋𝗒\mathsf{Query} (Step 2). Since the noisy global graph GG^{\prime} is dense, we carefully design calibration techniques to obtain unbiased estimates of kk-stars or triangles, as presented in Section IV-C. Upon obtaining the query results, each client CiC_{i} perturbs the local answers using a suitable noise (Step 3). Specifically, each client CiC_{i} calculates the noisy query Qi=Qi+𝖫𝖺𝗉(ε3)Q_{i}^{\prime}=Q_{i}+\mathsf{Lap}(\frac{\triangle}{\varepsilon_{3}}) by adding the Laplacian noise to QiQ_{i}, where \triangle is the global sensitivity. Finally, the server computes Qi=1mQiQ^{\prime}\leftarrow\sum_{i=1}^{m}Q_{i}^{\prime}, which is an unbiased estimate of QQ and satisfies DP.

Theorem 3.

Algorithm 4 satisfies (ε1+ε2+ε3\varepsilon_{1}+\varepsilon_{2}+\varepsilon_{3})-Edge DDP.

Proof of Theorem 3. Algorithm 4 uses three kinds of privacy budgets: ε1\varepsilon_{1} in 𝖥𝖤𝖠𝖳\mathsf{FEAT}, ε2\varepsilon_{2} in 𝖯𝖺𝗋𝗍𝗂𝗍𝗂𝗈𝗇𝖭𝗈𝖽𝖾\mathsf{PartitionNode}, and ε3\varepsilon_{3} in 𝖯𝖾𝗋𝗍𝗎𝗋𝖻𝖺𝗍𝗂𝗈𝗇\mathsf{Perturbation}. By Theorem 2, 𝖥𝖤𝖠𝖳\mathsf{FEAT} satisfies ε1\varepsilon_{1}-Edge DDP. In 𝖯𝖺𝗋𝗍𝗂𝗍𝗂𝗈𝗇𝖭𝗈𝖽𝖾\mathsf{PartitionNode}, each client Ci[m]C_{i\in[m]} adds the Laplacian noise Lap(mε2)Lap(\frac{m}{\varepsilon_{2}}) to the node degree, which satisfies ε2\varepsilon_{2}-Edge DDP. In 𝖯𝖾𝗋𝗍𝗎𝗋𝖻𝖺𝗍𝗂𝗈𝗇\mathsf{Perturbation}, each client Ci[m]C_{i\in[m]} adds the Laplacian noise Lap(ε3)Lap(\frac{\triangle}{\varepsilon_{3}}) to the query result QiQ_{i}, which satisfies ε3\varepsilon_{3}-Edge DDP. Following the post-processing property of DP, the aggregated result QQ^{\prime} satisfies ε3\varepsilon_{3}-Edge DDP. Thus, the entire process of Algorithm 4 provides (ε1+ε2+ε3\varepsilon_{1}+\varepsilon_{2}+\varepsilon_{3})-Edge DDP. ∎

IV-B Degree-based Node Partition

We propose a degree-based user partition technique to split the node set VV into multiple disjoint user sets {V~1,,V~m}\{\widetilde{V}_{1},...,\widetilde{V}_{m}\}. For each node, the server collects mm node degrees {d1,,dm}\{d_{1}^{\prime},...,d_{m}^{\prime}\} from mm clients privately. The node (i.e., user) will be distributed to the client that sends the maximum degree to the server. Algorithm 5 presents how to partition users into multiple disjoint sets. It takes as input the true subgraph sets G={V,E}={G1,,Gm}G=\{V,E\}=\{G_{1},...,G_{m}\} and privacy budget ε\varepsilon. At the outset, the server initializes the empty set U={U1,,Um}U=\{U_{1},...,U_{m}\} for each client, in order to record the partition information. For each node viv_{i} in the node set VV, each client CjC_{j} computes the node degree di,jd_{i,j} based on its true subgraph GjG_{j}. Then, client CjC_{j} perturbs the degree with the Laplace mechanism and sends the noisy degree di,jd_{i,j}^{\prime} to the server. After that, the server computes the max noisy degree di,k=𝗆𝖺𝗑{di,1,,di,m}d_{i,k}^{\prime}=\mathsf{max}\{d_{i,1}^{\prime},...,d_{i,m}^{\prime}\} for each node viv_{i} and adds the user index ii to the corresponding user partition set UkU_{k}. The final output is the user partition sets U={U1,,Um}U=\{U_{1},...,U_{m}\}, where UiUj=U_{i}\cap U_{j}=\varnothing and i=1mUi=V\bigcup_{i=1}^{m}U_{i}=V. This algorithm is denoted by 𝖯𝖺𝗋𝗍𝗂𝗍𝗂𝗈𝗇𝖭𝗈𝖽𝖾\mathsf{PartitionNode}.

Algorithm 5 𝖯𝖺𝗋𝗍𝗂𝗍𝗂𝗈𝗇𝖭𝗈𝖽𝖾\mathsf{PartitionNode}: Degree-based Node Partition.
0:  
Subgraph set G={V,E}={G1,,Gm}G=\{V,E\}=\{G_{1},...,G_{m}\},
Privacy budget ε2\varepsilon_{2}
0:  User partition U={U1,,Um}U=\{U_{1},...,U_{m}\}
1:  Initialize: U={U1,,Um}U=\{U_{1},...,U_{m}\}, where Ui[m]=U_{i\in[m]}=\varnothing
2:  for each node viv_{i} in VV do
3:     Client Cj[1,m]C_{j\in[1,m]}:
Compute ii-th node degree di,jd_{i,j}
Perturb di,jdi,j+𝖫𝖺𝗉(mε2)d_{i,j}^{\prime}\leftarrow d_{i,j}+\mathsf{Lap}(\frac{m}{\varepsilon_{2}})
Send di,jd_{i,j}^{\prime} to Server
4:     Server:
Compute di,k𝗆𝖺𝗑{di,1,,di,m}d_{i,k}^{\prime}\leftarrow\mathsf{max}\{d_{i,1}^{\prime},...,d_{i,m}^{\prime}\}
Obtain index kk
Update UkUk{i}U_{k}\leftarrow U_{k}\cup\{i\}
5:  return  UU
Algorithm 6 kk-Star Counting.
0:  
Noisy global graph GG^{\prime}, ii-th subgraph GiG_{i}
ii-th user partition UiU_{i}.
0:  Number of kk-stars SS.
1:  Initialize: d1=d2=0d_{1}=d_{2}=0
2:  G¯GGi\overline{G}\leftarrow G^{\prime}\cup G_{i}
3:  for each node vjv_{j} in UiU_{i} do
4:     for each friend vkv_{k} of vjv_{j} in G¯\overline{G} do
5:        if edgevj,vkedge\langle v_{j},v_{k}\rangle in GiG_{i} then d1d1+1d_{1}\leftarrow d_{1}+1 \triangleright True degree in GiG_{i}
6:        else d2d2+1d_{2}\leftarrow d_{2}+1 \triangleright Noisy degree in GG^{\prime}
7:  p11+eε1p\leftarrow\frac{1}{1+e^{\varepsilon_{1}}}, d~2112p[d2np]\widetilde{d}_{2}\leftarrow\frac{1}{1-2p}[d_{2}-np] \triangleright De-bias
8:  dd1+d~2d\leftarrow d_{1}+\widetilde{d}_{2}, S(dk)S\leftarrow\binom{d}{k}
9:  return  SS

IV-C Graph Query Processing

In this section, we execute two basic subgraph counting queries, i.e., kk-star counting [7, 9, 40] and triangle counting [4, 7, 41], in order to explain how to execute the 𝖰𝗎𝖾𝗋𝗒\mathsf{Query} function in line 4 of Algorithm 4.

IV-C1 kk-Star Counting

A kk-star refers to a subgraph consisting of a central node connecting to kk other nodes. The key is to compute the total number of edges in a graph, i.e., the summation of node degrees. From each client CiC_{i}’s view, each node degree is contributed from a local subgraph GiG_{i} and a noisy global graph GG^{\prime}. Thus, the problem is reduced to compute the true node degree d1d_{1} and noisy node degree d2d_{2}.

Algorithm 6 shows an instantiation of the 𝖰𝗎𝖾𝗋𝗒\mathsf{Query} function in kk-star counting. It takes as input a noisy global graph GG^{\prime}, ii-th subgraph GiG_{i}, and ii-th user partition UiU_{i}. Client CiC_{i} first initializes d1=d2=0d_{1}=d_{2}=0 to record two kinds of degrees. Then, it obtains a new global graph G¯\overline{G} by computing the union between GG^{\prime} and its true subgraph GiG_{i} (line 2). After that, it traverses each node in the user partition set UiU_{i} and calculates d1d_{1} and d2d_{2}. If the edge vj,vk\langle v_{j},v_{k}\rangle exists in subgraph GiG_{i}, the true degree will be updated as d1d1+1d_{1}\leftarrow d_{1}+1; otherwise, the noisy degree will be updated as d2d2+1d_{2}\leftarrow d_{2}+1 (lines 5-6). However, simply computing the degrees from GG^{\prime} can introduce a significant bias, as the randomized response in Algorithm 3 makes a graph dense [7, 8, 28]. By the post-processing property of DP [11], we obtain an unbiased estimate d2~\widetilde{d_{2}} of d2d_{2} according to Proposition 3 (line 7). Upon obtaining the node degree d=d1+d2~d=d_{1}+\widetilde{d_{2}}, client CiC_{i} can calculate the kk-stars by (dk)\binom{d}{k}. The final output is the number of kk-stars.

Proposition 3.

Let G¯\overline{G} be the union of a noisy global graph GG^{\prime} and local subgraph GiG_{i}. Let GCG^{C} be the absolute complement of GiG_{i} in G¯\overline{G}. Let did_{i}^{\prime} be the node degree of viv_{i} in GCG^{C} and di~\widetilde{d_{i}} be an unbiased estimate of did_{i}^{\prime}. Let nn be the number of nodes in GCG^{C}. Let p=11+eε1p=\frac{1}{1+e^{\varepsilon_{1}}} be the flipping probability, where ε1\varepsilon_{1} is the privacy budget in 𝖥𝖤𝖠𝖳\mathsf{FEAT} (i.e., line 1 of Algorithm 4). We have

di~=112p(dinp).\widetilde{d_{i}}=\frac{1}{1-2p}(d_{i}^{\prime}-np). (8)

Proof of Proposition 3. The mapping relationship between did_{i}^{\prime} and di~\widetilde{d_{i}} can be represented as:

di=di~(1p)+(ndi~)p.d_{i}^{\prime}=\widetilde{d_{i}}(1-p)+(n-\widetilde{d_{i}})p. (9)

Then we can prove Proposition 3.

IV-C2 Triangle Counting

Algorithm 7 Triangle Counting.
0:  
Noisy global graph GG^{\prime}, ii-th subgraph GiG_{i}
ii-th user partition UiU_{i}.
0:  Number of triangles TT.
1:  Initialize: T0=T1=T2=T3=0T_{0}=T_{1}=T_{2}=T_{3}=0
2:  G¯GGi\overline{G}\leftarrow G^{\prime}\cup G_{i}
3:  for each node vjv_{j} in UiU_{i} do
4:     for each friend vkv_{k} of vjv_{j} in G¯\overline{G} do
5:        for each friend vlv_{l} of vjv_{j} in G¯\overline{G} do
6:           if j<k<lj<k<l then
7:              if vk,vlv_{k},v_{l} are friends in G¯\overline{G} then
8:                 Initialize: 𝒆={vj,vk,vj,vl,vk,vl}\bm{e}=\{\langle v_{j},v_{k}\rangle,\langle v_{j},v_{l}\rangle,\langle v_{k},v_{l}\rangle\}
9:                 if 0 edges of 𝒆\bm{e} in GiG_{i} then T0T0+1T_{0}\leftarrow T_{0}+1
10:                 if 1 edges of 𝒆\bm{e} in GiG_{i} then T1T1+1T_{1}\leftarrow T_{1}+1
11:                 if 2 edges of 𝒆\bm{e} in GiG_{i} then T2T2+1T_{2}\leftarrow T_{2}+1
12:                 if 3 edges of 𝒆\bm{e} in GiG_{i} then T3T3+1T_{3}\leftarrow T_{3}+1
13:  (T~0,T~1,T~2)𝖣𝖾𝖻𝗂𝖺𝗌(T0,T1,T2)(\widetilde{T}_{0},\widetilde{T}_{1},\widetilde{T}_{2})\leftarrow\mathsf{Debias}(T_{0},T_{1},T_{2})
14:  TT~0+T~1+T~2+T3T\leftarrow\widetilde{T}_{0}+\widetilde{T}_{1}+\widetilde{T}_{2}+T_{3}
15:  return  TT

Next, we focus on triangle counting. This is more challenging because three edges of a triangle can be from the local subgraph GiG_{i} or a noisy global graph GG^{\prime}. There are four kinds of triangles according to the number of edges from GiG_{i}: T0,T1,T2T_{0},T_{1},T_{2}, and T3T_{3}, where TjT_{j} (j{0,1,2,3}j\in\{0,1,2,3\}) is the number of triangles involving jj edges from GiG_{i} (referred to as jj-edge in GiG_{i}). Since T0,T1T_{0},T_{1}, and T2T_{2} involves some noisy edges from GG^{\prime}, simply counting the noisy triangles can introduce a bias. We propose different empirical estimation methods to obtain unbiased counts, as presented in Propositions 4, 5, and 6 at the end of this subsection.

Algorithm 7 presents an instantiation of the 𝖰𝗎𝖾𝗋𝗒\mathsf{Query} function in triangle counting. It takes as input a noisy global graph GG^{\prime}, the ii-th subgraph GiG_{i}, the ii-th user partition UiU_{i}, and the privacy budget ε1\varepsilon_{1}. Client CiC_{i} first obtains a graph G~\widetilde{G} by merging GG^{\prime} and GiG_{i}. Then it calculates four kinds of triangles (T0,T1,T2,T3)(T_{0},T_{1},T_{2},T_{3}), i.e., 0-edge, 1-edge, 2-edge, and 3-edge in GiG_{i}. After that, the unbiased estimates (T~0,T~1,T~2)(\widetilde{T}_{0},\widetilde{T}_{1},\widetilde{T}_{2}) of (T0,T1,T2)(T_{0},T_{1},T_{2}) are computed based on Proposition 4, 5, and 6. The final triangle count TT can be obtained by aggregating T~0,T~1,T~2\widetilde{T}_{0},\widetilde{T}_{1},\widetilde{T}_{2}, and T3T_{3}.

Proposition 4.

Let G¯\overline{G} be the union of a noisy global graph GG^{\prime} and local subgraph GiG_{i}. Let GCG^{C} be the absolute complement of GiG_{i} in G¯\overline{G}. Let t0,t1,t2,t3t_{0},t_{1},t_{2},t_{3} be the number of 0-edges, 1-edges, 2-edges, and triangles in GCG^{C}, respectively. Let T0T_{0} be the number of 0-edges in GiG_{i}, i.e., T0=t3T_{0}=t_{3}. Let T~0\widetilde{T}_{0} be an unbiased estimate of T0T_{0}. Let ε1\varepsilon_{1} be the privacy budget used in 𝖥𝖤𝖠𝖳\mathsf{FEAT} (i.e., line 1 of Algorithm 4). We have

T~0=1(eε11)3(t0+t1eε1t2e2ε1+T0e3ε1).\widetilde{T}_{0}=\frac{1}{(e^{\varepsilon_{1}}-1)^{3}}(-t_{0}+t_{1}e^{\varepsilon_{1}}-t_{2}e^{2\varepsilon_{1}}+T_{0}e^{3\varepsilon_{1}}). (10)

The proof details of Proposition 4 can refer to Proposition 2.

Proposition 5.

Let G¯\overline{G} be the union of a noisy global graph GG^{\prime} and local subgraph GiG_{i}. Let GCG^{C} be the absolute complement of GiG_{i} in G¯\overline{G}. Let t0,t1,t2t_{0},t_{1},t_{2} be the number of 0-edges, 1-edges, and 2-edges in GCG^{C}, respectively. Let T1T_{1} be the number of 1-edges in GiG_{i}, i.e., T1=t2T_{1}=t_{2}. Let T~1\widetilde{T}_{1} be an unbiased estimate of T1T_{1}. Let ε1\varepsilon_{1} be the privacy budget used in 𝖥𝖤𝖠𝖳\mathsf{FEAT}. We have

T~1=(eε1+1)[e2ε1t0eε1(eε1+1)t1+(2eε1+1)T1](eε11)(e2ε12eε1).\widetilde{T}_{1}=\frac{(e^{\varepsilon_{1}}+1)[e^{2\varepsilon_{1}}t_{0}-e^{\varepsilon_{1}}(e^{\varepsilon_{1}}+1)t_{1}+(2e^{\varepsilon_{1}}+1)T_{1}]}{(e^{\varepsilon_{1}}-1)(e^{2\varepsilon_{1}}-2e^{\varepsilon}-1)}. (11)

Proof of Proposition 5. Consider that one edge of a triangle is from GiG_{i} and another two edges are from GCG^{C}. Let u0,u1,u2u_{0},u_{1},u_{2} be the number of 0-edges, 1-edges, and 2-edges in GCG^{C}, respectively, when we do not flip 1/0 using the randomized response. Let x=eε1x=e^{\varepsilon_{1}}. Then we have:

(t0,t1,t2)=(u0,u1,u2)𝑨(t_{0},t_{1},t_{2})=(u_{0},u_{1},u_{2})\bm{A} (12)

𝑨=1(x+1)2[x22x1x1+xx12xx2]\bm{A}=\frac{1}{(x+1)^{2}}\begin{bmatrix}x^{2}&2x&1\\ x&1+x&x\\ 1&2x&x^{2}\\ \end{bmatrix}

𝑨\bm{A} is a transition matrix from a type of subgraph (0-edge, 1-edge, 2-edge) in an original graph to a type of subgraph in a noisy graph. Let u~0,u~1,u~2\widetilde{u}_{0},\widetilde{u}_{1},\widetilde{u}_{2} be the unbiased estimation of (u0,u1,u2u_{0},u_{1},u_{2}). Then we obtain:

(u~0,u~1,u~2)=(t0,t1,t2)𝑨1(\widetilde{u}_{0},\widetilde{u}_{1},\widetilde{u}_{2})=(t_{0},t_{1},t_{2})\bm{A}^{-1} (13)

Let 𝑨i,j1\bm{A}_{i,j}^{-1} be the (i,j)(i,j)-th element of 𝑨1\bm{A}^{-1}. Then we have:

𝑨1,11=(x+1)x2(x1)(x22x1),\bm{A}_{1,1}^{-1}=\frac{(x+1)x^{2}}{(x-1)(x^{2}-2x-1)}, (14)
𝑨2,11=(x+1)(x2x)(x1)(x22x1),\bm{A}_{2,1}^{-1}=\frac{(x+1)(-x^{2}-x)}{(x-1)(x^{2}-2x-1)}, (15)
𝑨3,11=(x+1)(2x+1)(x1)(x22x1),\bm{A}_{3,1}^{-1}=\frac{(x+1)(2x+1)}{(x-1)(x^{2}-2x-1)}, (16)

By combining above equations, Proposition 5 is proved. ∎

Proposition 6.

Let GG^{\prime} be a noisy global graph and GiG_{i} be the local subgraph. Let T2T_{2} be the number of 2-edges in GiG_{i}. Let SS be the number of 2-stars in GiG_{i}. Let T~2\widetilde{T}_{2} be an unbiased estimate of T2T_{2}. Let p=11+eε1p=\frac{1}{1+e^{\varepsilon_{1}}} be the flipping probability. We have

T~2=112p(T2Sp).\widetilde{T}_{2}=\frac{1}{1-2p}(T_{2}-Sp). (17)

Proof of Proposition 6. The mapping relationship between did_{i}^{\prime} and di~\widetilde{d_{i}} can be represented as:

T2=T~2(1p)+(ST~2)p.T_{2}=\widetilde{T}_{2}(1-p)+(S-\widetilde{T}_{2})p. (18)

Then we can prove Proposition 6.

V Experimental Evaluation

In this section, we evaluate our 𝖥𝖤𝖠𝖳\mathsf{FEAT} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ along two dimensions: utility and running time. To simulate the federated scenario, we split a graph randomly into multiple local subgraphs by controlling two key parameters: sampling rate ρ\rho and overlapping rate σ\sigma. Then, we conducted experiments to answer the following questions:

  • Q1: How do our general 𝖥𝖤𝖠𝖳\mathsf{FEAT} and improved 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ compare with baseline approaches (denoted by 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline}) in terms of the utility-privacy trade-off?

  • Q2: How do the sampling rate ρ\rho and overlapping rate σ\sigma affect accuracy?

  • Q3: How much do our 𝖥𝖤𝖠𝖳\mathsf{FEAT} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ compare with 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} in terms of the running time?

Evaluation Highlights:

  • 𝖥𝖤𝖠𝖳\mathsf{FEAT} reduces the error of 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} by up to an order of 4. 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ outperforms 𝖥𝖤𝖠𝖳\mathsf{FEAT} by at least an order of 1 (Figures 5 to 7).

  • 𝖥𝖤𝖠𝖳\mathsf{FEAT} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ significantly outperform 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}, respectively, under various values of ρ\rho and σ\sigma (Figures 9 and 11).

  • 𝖥𝖤𝖠𝖳\mathsf{FEAT} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ takes more time than 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} by at least an order of 1 (Figure 12).

Refer to caption
(a) Facebook
Refer to caption
(b) Wiki
Figure 4: The MSE in 2-star counting.
Refer to caption
(a) Facebook
Refer to caption
(b) Wiki
Figure 5: The MRE in 2-star counting.
Refer to caption
(a) Facebook
Refer to caption
(b) Wiki
Figure 6: The MSE in triangle counting.
Refer to caption
(a) Facebook
Refer to caption
(b) Wiki
Figure 7: The MRE in triangle counting.

V-A Experimental Setup

Datasets. We use two real-world graph datasets from SNAP [42] as follows: (1) Facebook. The Facebook graph is collected from survey participants using the Facebook app, which includes 4039 nodes and 88234 edges. The average degree of the Facebook graph is 21.85 (=882344039\frac{88234}{4039}). (2) Wiki-Vote. The Wiki-Vote graph contains all the Wikipedia voting data from the inception of Wikipedia till 2008, which includes 7115 nodes and 103689 edges. The average degree of Wiki-Vote graph is 14.57 (=1036897115\frac{103689}{7115}). Thus, Wiki-Vote graph is more sparse than the Facebook graph. As explained above, we split a graph randomly into 4 local subgraphs by controlling the sampling rate ρ\rho and overlapping rate σ\sigma.

Parameters. There are some key parameters that influence the overall accuracy of the 𝖥𝖤𝖠𝖳\mathsf{FEAT} system. (1) Privacy Budget ε\varepsilon. The privacy budget varies from 1 to 6, and the default is 3. Note that 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ involves three kind of privacy budgets, namely, ε=ε1+ε2+ε3\varepsilon=\varepsilon_{1}+\varepsilon_{2}+\varepsilon_{3}. We set ε1=ε3=0.45ε\varepsilon_{1}=\varepsilon_{3}=0.45\varepsilon and ε2=0.1ε\varepsilon_{2}=0.1\varepsilon. (2) Sampling rate ρ\rho. The sampling rate is the ratio of each local subgraph to the global graph. It is set from 0.1 to 0.5, and the default is 0.3. (3) Overlapping Rate σ\sigma. The overlapping rate is the ratio of edges shared among multiple local subgraphs to the total number of edges. It varies from 0 to 0.4, and the default is 0.2.

Graph Statistics and Metrics. For graph analytic tasks, we evaluate two common graph statistics: 2-star counts and triangle counts, as in [4, 7, 9, 10, 40, 41]. For each query, we compare the results QQ and QQ^{\prime} from the true graph and noisy graph respectively. We use two common measures to assess the accuracy of our algorithms: mean squared error (MSE) [43] and mean relative error (MRE) [44]. We evaluate the average results over 10 repeated runs.

Refer to caption
(a) 2-Star counting
Refer to caption
(b) Triangle counting
Figure 8: The MSE with various ρ\rho.
Refer to caption
(a) 2-Star counting
Refer to caption
(b) Triangle counting
Figure 9: The MRE with various ρ\rho.
Refer to caption
(a) 2-Star counting
Refer to caption
(b) Triangle counting
Figure 10: The MSE with various σ\sigma.
Refer to caption
(a) 2-Star counting
Refer to caption
(b) Triangle counting
Figure 11: The MRE with various σ\sigma.

V-B Experimental Results

Utility-Privacy Trade-off (Q1). We first answer Q1 by comparing the utility of our 𝖥𝖤𝖠𝖳\mathsf{FEAT} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ with that of 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} when the privacy budget varies from 11 to 66. Figures 5 to 7 show that the utility loss of all methods decreases with an increase in ε\varepsilon. Our general framework 𝖥𝖤𝖠𝖳\mathsf{FEAT} significantly outperforms 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} in all cases, and our improved framework 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ can further improve the accuracy of 𝖥𝖤𝖠𝖳\mathsf{FEAT}.

In particular, 𝖥𝖤𝖠𝖳\mathsf{FEAT} outperforms 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} by at least an order of 1. For instance, for 2-star counting, Figure 4(b) shows that when ε=6\varepsilon=6, 𝖥𝖤𝖠𝖳\mathsf{FEAT} owns an MSE of 5.09×1055.09\times 10^{5} while 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} gives an MSE of 1.45×10111.45\times 10^{11}. Similarly, for triangle counting, Figure 7(a) shows that 𝖥𝖤𝖠𝖳\mathsf{FEAT} gives an MRE of only 9.53×104\times 10^{-4} when ε=4\varepsilon=4. In contrast, 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} has an MRE of 8.27×101\times 10^{-1}. This is because 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} uses small privacy budgets (i.e., ε/m\varepsilon/m) to perturb sensitive data, which leads to much noise. Instead, 𝖥𝖤𝖠𝖳\mathsf{FEAT} collects noisy graphs using ε\varepsilon by combining PSU and DP. The same information is collected only once, and therefore, individual privacy is not leaked.

Another observation is that the error of 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ is smaller than that of 𝖥𝖤𝖠𝖳\mathsf{FEAT} in all cases. For example, Figure 5(b) shows that when ε=6\varepsilon=6, 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ only owns an MRE of 3.86×105\times 10^{-5}, whereas 𝖥𝖤𝖠𝖳\mathsf{FEAT} has an MRE of 2.98×104\times 10^{-4}. Similarly, Figure 6(a) shows that when ε=3\varepsilon=3, the MSE of 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ is 4.68×102\times 10^{2} when 𝖥𝖤𝖠𝖳\mathsf{FEAT} owns a MSE of 6.25×104\times 10^{4}. This is mainly because 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ calculates graph statistics by utilizing local true subgraphs. In contrast, the results of 𝖥𝖤𝖠𝖳\mathsf{FEAT} are computed based on noisy graphs, which results in much utility loss. Thus, 𝖥𝖤𝖠𝖳\mathsf{FEAT} outperforms 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} significantly, and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ improves the utility of 𝖥𝖤𝖠𝖳\mathsf{FEAT}.

Parameter Effects (Q2). Next, we evaluate the key parameters that may influence the overall utility of 𝖥𝖤𝖠𝖳\mathsf{FEAT}, i.e., sampling rate ρ\rho and overlapping rate σ\sigma. ρ\rho determines the size of local subgraphs Gi,i[m]G_{i},i\in[m], and the size of GiG_{i} becomes larger as ρ\rho increases. σ\sigma determines the number of the same edges that exist in multiple local subgraphs.

Figures 9 and 9 show that in all cases, the MSE and MRE increase with an increase in ρ\rho. This is because graph analytic tasks are executed over a noisy graph GG^{\prime}. The added error becomes larger as the graph size increases. We can also observe that 𝖥𝖤𝖠𝖳\mathsf{FEAT} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ owns better utility than 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}, respectively, for all ρ\rho.

Figures 11 and 11 show that 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} performs the worst while 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ owns the best utility over all values of the overlapping rate σ\sigma. Another interesting observation is that the overlapping rate σ\sigma has little influence on the overall utility. For instance, Figure 10(a) shows that with the increase in σ\sigma, the MSE of 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline} increases from 1.014×1016\times 10^{16} to 1.0442×1016\times 10^{16}. The slight growth is from multiple perturbations of the same edge information. There are little changes in the results of 𝖥𝖤𝖠𝖳\mathsf{FEAT} and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ over various σ\sigma. This is because the same information is randomized and collected only once.

Execution Time (Q3). Finally, we answer the third question by evaluating the running time over graphs with different sizes. Here, we use different sampling rates ρ\rho to generate multiple graphs with different scales. Figure 12 presents the running time of 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline}, 𝖥𝖤𝖠𝖳\mathsf{FEAT}, and 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ for various values of ρ\rho. We can find that the running time increases when the graph scale becomes larger. This is because when ρ\rho increases, there are more edges collected and computed accordingly. Another important observation is that the running time of 𝖥𝖤𝖠𝖳\mathsf{FEAT} is approximately 10×\times higher than that of 𝖡𝖺𝗌𝖾𝗅𝗂𝗇𝖾\mathsf{Baseline}. This is because the computation of cryptographic techniques leads to additional time overhead. 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ takes more time than 𝖥𝖤𝖠𝖳\mathsf{FEAT} by about 50%. This is because additional communication between clients and the server consumes more time. Thus, utilizing cryptographic tools improves the utility while not leaking sensitive information but at the cost of efficiency.

Refer to caption
(a) Facebook
Refer to caption
(b) Wiki
Figure 12: Running time with various ρ\rho.

VI Related Works

Federated Analytics. The term “federated analytics” is first introduced by Google in 2020 [45], which is explored in support of federated learning for Google engineers to measure the quality of federated learning models against real-world data. Bharadwaj etet al.al. [46] introduces the notion of federated computation, which is a means of working with private data at a rather large scale. Wang etet al.al. [47] clarify what federated analytics is and its position in literature and then presents the motivation, application, and opportunities of federated analytics. Elkordy etet al.al. [48] gives a comprehensive survey about federated analytics. Nevertheless, they only focus on tabular data analytics, which totally differs from graph analytics in our work. Roth etet al.al. [49] introduce Mycelium for large-scale distributed graph queries with differential privacy. Yuan etet al.al. [50] define the notion of graph federation for subgraph matching, where graph data sources are temporarily federated. However, they assume that all clients are mutually independent, which is different from ours. Our (informal) previous work [51]111Note that [51] was not published in proceedings in accordance with the policy of the KDD conference, cf. https://fl4data-mining.github.io/calls/. introduces the concept of federated graph analytics for the first time. However, it encountered privacy leakage issues due to the potential disclosure of intersection information between two clients.

Cross-silo Federated Learning. There exist several works related to cross-silo federated learning [17, 18, 19, 20, 52, 53]. Huang etet al.al.[17] propose FedAMP that employs federated attentive message passing to facilitate the collaboration effectiveness between clients without infringing their data privacy. Li etet al.al. [53] propose a practical one-shot federated learning algorithm by using the knowledge transfer technique. Liu etet al.al. [18] empirically show that MR-MTL is a remarkably strong baseline under silo-specific sample-level DP. Tang etet al.al. [20] propose an incentive mechanism for cross-silo federated learning, addressing the public goods feature. Zheng etet al.al. [19] propose a one-server solution based solely on homomorphic encryption and a two-server protocol based on homomorphic encryption and additive secret sharing, which are designed for contribution evaluation in cross-silo federated learning. However, these protocols are designed for machine learning cannot be used for graph analytics.

Differentially Private Graph Analytics. The standard way to calculate graph statistics is through differential privacy (DP) [11, 12], which is a golden standard in the privacy community. However, existing protocols [4, 5, 7, 9, 10, 40, 54] are designed for different scenarios and not applicable to federated graph scenarios. To be specific, central protocols [4, 5] rely on a trusted server to collect the entire graph information from local users and then release accurate graph statistics privately. Nevertheless, a centrally trusted server is amenable to privacy issues in practical such as data leaks and breaches [15, 16]. Instead, local protocols [7, 9] remove the assumption of a trusted server, and each user directly perturbs local sensitive data. However, they cannot protect the user-client membership and faces the edge privacy issues due to the overlaps. Extended local view (ELV)-based protocols [10, 55] consider an extension of local scenarios, where each client can see not only her 1-hop path but also her 2-hop path. Similar to local mechanisms, ELV-based protocols fail to protect the user-client membership. Additionally, decentralized differential privacy in [55] can protect edge privacy in ELV but fails to do it in federated settings since overlaps of ELV are different from those of federated graph analytics. In a nutshell, this paper is the first work to formulate the federated graph analytics (FGA) to our best of knowledge. Our proposed 𝖥𝖤𝖠𝖳\mathsf{FEAT} is a general framework for various common graph analytics.

VII Conclusion

We made the first attempt to study federated graph analytics with privacy guarantee. We showed unique challenges in federated graph analytics, namely, utility issue from the limited view and privacy issue due to overlaps. To alleviate them, we proposed a general federated graph analytic framework 𝖥𝖤𝖠𝖳\mathsf{FEAT}, based on our proposed differentially private set union protocol. Furthermore, we observed that it calculates graph statistics over a noisy global graph without considering true local subgraphs, and there is still room for improving the overall utility. To address this issue, we introduced an improved framework 𝖥𝖤𝖠𝖳\mathsf{FEAT}+ by combining a noisy global graph with true local subgraphs. Comprehensive experiments verify that our proposed methods significantly outperform the baseline approaches in various graph analytics.

References

  • [1] D. Cheng, F. Yang, S. Xiang, and J. Liu, “Financial time series forecasting with multi-modality graph neural network,” Pattern Recognition, vol. 121, p. 108218, 2022.
  • [2] S. J. Dancer, “Reducing the risk of covid-19 transmission in hospitals: focus on additional infection control strategies,” Surgery (Oxford), vol. 39, no. 11, pp. 752–758, 2021.
  • [3] X. Jian, Y. Wang, and L. Chen, “Publishing graphs under node differential privacy,” IEEE TKDE, 2021.
  • [4] X. Ding, S. Sheng, H. Zhou, X. Zhang, Z. Bao, P. Zhou, and H. Jin, “Differentially private triangle counting in large graphs,” IEEE TKDE, vol. 34, no. 11, pp. 5278–5292, 2021.
  • [5] W.-Y. Day, N. Li, and M. Lyu, “Publishing graph degree distribution with node differential privacy,” in Proceedings of the 2016 International Conference on Management of Data, 2016, pp. 123–138.
  • [6] Q. Yuan, Z. Zhang, L. Du, M. Chen, P. Cheng, and M. Sun, “PrivGraph: Differentially private graph data publication by exploiting community information,” in USENIX Security Symposium, 2023.
  • [7] J. Imola, T. Murakami, and K. Chaudhuri, “Locally differentially private analysis of graph statistics.” in USENIX Security Symposium, 2021.
  • [8] Q. Ye, H. Hu, M. H. Au, X. Meng, and X. Xiao, “Lf-gdpr: A framework for estimating graph metrics with local differential privacy,” IEEE TKDE, vol. 34, no. 10, pp. 4905–4920, 2020.
  • [9] J. Imola, T. Murakami, and K. Chaudhuri, “Communication-efficient triangle counting under local differential privacy,” in USENIX Security Symposium, 2022.
  • [10] Y. Liu, S. Zhao, Y. Liu, D. Zhao, H. Chen, and C. Li, “Collecting triangle counts with edge relationship local differential privacy,” in IEEE ICDE, 2022.
  • [11] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy,” Foundations and Trends® in Theoretical Computer Science, vol. 9, no. 3–4, pp. 211–407, 2014.
  • [12] N. Li, M. Lyu, D. Su, and W. Yang, “Differential privacy: From theory to practice,” Synthesis Lectures on Information Security, Privacy, & Trust, vol. 8, no. 4, pp. 1–138, 2016.
  • [13] S. Zhang, W. Ni, and N. Fu, “Community preserved social graph publishing with node differential privacy,” in IEEE ICDM, 2020.
  • [14] C. Wei, S. Ji, C. Liu, W. Chen, and T. Wang, “Asgldp: collecting and generating decentralized attributed graphs with local differential privacy,” IEEE TIFS, 2020.
  • [15] N. N. Neto, S. Madnick, A. M. G. D. Paula, and N. M. Borges, “Developing a global data breach database and the challenges encountered,” JDIQ, vol. 13, no. 1, pp. 1–33, 2021.
  • [16] B. Gibson, S. Townes, D. Lewis, and S. Bhunia, “Vulnerability in massive api scraping: 2021 linkedin data breach,” in IEEE CSCI, 2021.
  • [17] Y. Huang, L. Chu, Z. Zhou, L. Wang, J. Liu, J. Pei, and Y. Zhang, “Personalized cross-silo federated learning on non-iid data,” in Proc. AAAI, vol. 35, no. 9, 2021, pp. 7865–7873.
  • [18] K. Liu, S. Hu, S. Z. Wu, and V. Smith, “On privacy and personalization in cross-silo federated learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 5925–5940, 2022.
  • [19] S. Zheng, Y. Cao, and M. Yoshikawa, “Secure shapley value for cross-silo federated learning,” Proc. VLDB, 2023.
  • [20] M. Tang and V. W. Wong, “An incentive mechanism for cross-silo federated learning: A public goods perspective,” in IEEE INFOCOM, 2021, pp. 1–10.
  • [21] C. Zhang, Y. Chen, W. Liu, M. Zhang, and D. Lin, “Linear private set union from {\{Multi-Query}\} reverse private membership test,” in USENIX Security Symposium, 2023, pp. 337–354.
  • [22] Y. Jia, S.-F. Sun, H.-S. Zhou, J. Du, and D. Gu, “Shuffle-based private set union: Faster and more secure,” in 31st USENIX Security Symposium (USENIX Security 22), 2022, pp. 2947–2964.
  • [23] W. Wang, S. Li, J. Dou, and R. Du, “Privacy-preserving mixed set operations,” Information Sciences, vol. 525, pp. 67–81, 2020.
  • [24] C. Dong and G. Loukides, “Approximating private set union/intersection cardinality with logarithmic complexity,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 11, pp. 2792–2806, 2017.
  • [25] C. Zhang, S. Li, J. Xia, W. Wang, F. Yan, and Y. Liu, “{\{BatchCrypt}\}: Efficient homomorphic encryption for {\{Cross-Silo}\} federated learning,” in USENIX ATC, 2020, pp. 493–506.
  • [26] Y. Liu, Y. Kang, C. Xing, T. Chen, and Q. Yang, “A secure federated transfer learning framework,” IEEE Intelligent Systems, vol. 35, no. 4, pp. 70–82, 2020.
  • [27] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Springer TCC, 2006, pp. 265–284.
  • [28] Z. Qin, T. Yu, Y. Yang, I. Khalil, X. Xiao, and K. Ren, “Generating synthetic decentralized social graphs with local differential privacy,” in Proc. CCS, 2017, pp. 425–438.
  • [29] W. Diffie and M. E. Hellman, “New directions in cryptography,” in Democratizing Cryptography: The Work of Whitfield Diffie and Martin Hellman, 2022, pp. 365–390.
  • [30] E. B. Barker, W. C. Barker, W. E. Burr, W. T. Polk, and M. E. Smid, “Sp 800-57. recommendation for key management, part 1: General (revised),” 2007.
  • [31] D. Hankerson and A. Menezes, “Elliptic curve cryptography,” in Encyclopedia of Cryptography, Security and Privacy.   Springer, 2021, pp. 1–2.
  • [32] Libsodium. [Online]. Available: https://doc.libsodium.org
  • [33] OpenSSL. [Online]. Available: https://www.openssl.org/
  • [34] MCL. [Online]. Available: https://github.com/herumi/mcl
  • [35] S. Gopi, P. Gulhane, J. Kulkarni, J. H. Shen, M. Shokouhi, and S. Yekhanin, “Differentially private set union,” in International Conference on Machine Learning.   PMLR, 2020, pp. 3627–3636.
  • [36] K. Kim, S. Gopi, J. Kulkarni, and S. Yekhanin, “Differentially private n-gram extraction,” Advances in Neural Information Processing Systems, vol. 34, pp. 5102–5111, 2021.
  • [37] R. S. Carvalho, K. Wang, and L. S. Gondara, “Incorporating item frequency for differentially private set union,” in Proc. AAAI, vol. 36, no. 9, 2022, pp. 9504–9511.
  • [38] X. Qiao, Z. Youwen, and X. L. Jian WANG, “Locally differentially private distributed algorithms for set intersection and union,” Information Sciences, vol. 64, no. 219101, pp. 1–219 101, 2021.
  • [39] S. L. Warner, “Randomized response: A survey technique for eliminating evasive answer bias,” Journal of the American Statistical Association, vol. 60, no. 309, pp. 63–69, 1965.
  • [40] J. Imola, T. Murakami, and K. Chaudhuri, “Differentially private triangle and 4-cycle counting in the shuffle model,” in ACM CCS, 2022.
  • [41] S. Liu, Y. Cao, T. Murakami, J. Liu, and M. Yoshikawa, “Cargo: Crypto-assisted differentially private triangle counting without trusted servers,” arXiv preprint arXiv:2312.12938, 2023.
  • [42] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network dataset collection,” http://snap.stanford.edu/data, Jun. 2014.
  • [43] K. Das, J. Jiang, and J. Rao, “Mean squared error of empirical predictor,” 2004.
  • [44] T. Foss, I. Myrtveit, and E. Stensrud, “Mre and heteroscedasticity: An empirical validation of the assumption of homoscedasticity of the magnitude of relative error,” in Proc. ESCOM, 2001, pp. 157–164.
  • [45] D. Ramage and S. Mazzocchi, “Federated analytics: Collaborative data science without data collection,” https://ai.googleblog.com/2020/05/federated-analytics-collaborative-data.html, 2020.
  • [46] A. Bharadwaj and G. Cormode, “An introduction to federated computation,” in Proceedings of the 2022 International Conference on Management of Data, 2022, pp. 2448–2451.
  • [47] D. Wang, S. Shi, Y. Zhu, and Z. Han, “Federated analytics: Opportunities and challenges,” IEEE Network, vol. 36, no. 1, pp. 151–158, 2021.
  • [48] A. R. Elkordy, Y. H. Ezzeldin, S. Han, S. Sharma, C. He, S. Mehrotra, S. Avestimehr et al., “Federated analytics: A survey,” APSIPA Transactions on Signal and Information Processing, vol. 12, no. 1, 2023.
  • [49] E. Roth, K. Newatia, Y. Ma, K. Zhong, S. Angel, and A. Haeberlen, “Mycelium: Large-scale distributed graph queries with differential privacy,” in Proc. SOSP, 2021, pp. 327–343.
  • [50] Y. Yuan, D. Ma, Z. Wen, Z. Zhang, and G. Wang, “Subgraph matching over graph federation,” Proceedings of the VLDB Endowment, vol. 15, no. 3, pp. 437–450, 2021.
  • [51] S. Liu, Y. Cao, and M. Yoshikawa, “Federated graph analytics with differential privacy,” in International Workshop on Federated Learning for Distributed Data Mining, 2023 (informal paper not published in proceedings, cf., https://fl4data-mining.github.io/calls/.
  • [52] Y. Wang, Y. Tong, D. Shi, and K. Xu, “An efficient approach for cross-silo federated learning to rank,” in 2021 IEEE 37th International Conference on Data Engineering (ICDE).   IEEE, 2021, pp. 1128–1139.
  • [53] Q. Li, B. He, and D. Song, “Practical one-shot federated learning for cross-silo setting.”
  • [54] S. Liu, Y. Cao, T. Murakami, and M. Yoshikawa, “A crypto-assisted approach for publishing graph statistics with node local differential privacy,” in IEEE Big Data, 2022, pp. 5765–5774.
  • [55] H. Sun, X. Xiao, I. Khalil, Y. Yang, Z. Qin, H. Wang, and T. Yu, “Analyzing subgraph statistics from extended local views with decentralized differential privacy,” in Proc. CCS, 2019, pp. 703–717.