¹¹institutetext: Núcleo de Investigación y Desarrollo Tecnológico (NIDTEC) ²²institutetext: Facultad Politécnica - Universidad Nacional de Asunción, San Lorenzo C.P. 2169, Paraguay.

A Distributed Algorithm for Spectral Sparsification of Graphs with Applications to Data Clustering

Fabricio Mendoza-Granada and Marcos Villagra

Abstract

Spectral sparsification is a technique that is used to reduce the number of non-zero entries in a positive semidefinite matrix with little changes to its spectrum. In particular, the main application of spectral sparsification is to construct sparse graphs whose spectra are close to a given dense graph. We study spectral sparsification under the assumption that the edges of a graph are allocated among sites which can communicate among each other. In this work we show that if a graph is allocated among several sites, the union of the spectral sparsifiers of each induced subgraph give us an spectral sparsifier of the original graph. In contrast to other works in the literature, we present precise computations of the approximation factor of the union of spectral sparsifiers and give an explicit calculation of the edge weights. Then we present an application of this result to data clustering in the Number-On-Forehead model of multiparty communication complexity when input data is allocated as a sunflower among sites in the party.

1 Introduction

Spectral sparsification is a technique introduced by Spielman and Teng spielman:2011 that is used to approximate a graph $G$ by a sparse graph $H$ . The notion of approximation used by spectral sparsification is that the spectra of both $H$ and $G$ must be close up to a constant factor. Batson, Spielman and Srivastava batson:12 proved that every graph $G$ has an spectral sparsifier with a number of edges linear in the number of vertices of $G$ and provided an algorithm achieving such bound. There are several algorithms in the literature that construct spectral sparsifiers of graphs with a trade-off between running time and number of edges of $H$ . To the best of our knowledge, Lee and Sun lee:2018 has the best probabilistic algorithm for spectral sparsification with a running time that is almost linear and constructs spectral sparsifiers with $O(qn/\epsilon^{2})$ edges, where $n$ is the number of vertices of $G$ , $\epsilon$ is an approximation factor and $q\geq 10$ is a constant.

There are situations where algorithms need to work with data that is not centralized and allocated in different sites. One way to deal with decentralized data is to design communication protocols so that the sites can communicate among them. The efficiency of a communication protocol can be measured by the number of bits shared among the sites and such a measure is known as the communication complexity of the protocol nisan:93 . When data comes in the form of a graph, the edges greatly affects communication complexity, and hence, computing spectral sparsifiers of graphs in distributed systems is of great importance.

In this work we present a distributed algorithm for spectral sparsification of graphs in the communication complexity model. In this model, we are only interested in the communication costs among sites and we assume that each site has arbitrary computational power. The idea behind this protocol is that, given an input graph $G$ , spectral sparsifiers of induced subgraphs of $G$ can be computed in each site first, and then any given site computes the union of such graphs which results in a spectral sparsifier of $G$ . Even though other works have used the idea of taking the union of spectral sparsifiers like Chen et al. chen:16 , they have not shown a precise calculation of the approximation factor. The main contribution of this work, presented in Theorem 3.1, is an estimation of the approximation factor and an explicit calculation of the edge weights in the union of spectral sparsifiers. In order to compute the approximation factor we introduce an idea that we call “overlapping cardinality partition,” which is a way to partition the edge set of a graph with respect to the number of times each edge is allocated among sites. Overlapping cardinality partition is a technical tool that allows us to express the Laplacian matrix of the union of induced subgraphs of $G$ as a linear combination of the Laplacian matrices of graphs induced from the partition.

In a second part of this paper, we present in Section 4 an application of Theorem 3.1 in distributed data clustering in the Number-On-Forehead model of communication complexity. In particular, if we assume the existence of a sunflower structure erdos:61 ; deza:74 ; kostochka:00 on the input data, we show how a communication protocol can detect the presence of the sunflower and take advantage of its kernel to reduce the communication costs.

The rest of this paper is organized as follows. In Section 2 we present the main definitions and notation used throughout this work. In Section 3 we present the main result of this work, and in Section 4 we present our application to data clustering.

2 Preliminaries and Notation

In this section we will introduce some definitions and notations that will be used throughout this paper.

2.1 Spectral Graph Theory

Let $G=(V,E,w)$ be an undirected and weighted graph with $n$ vertices and $m$ edges. Let $\{E_{i}\}_{i\geq 1}$ be family of subsets of $E$ . We denote by $G_{i}=(V,E_{i},w_{i})$ the subgraph induced by $E_{i}$ , where $w_{i}:E_{i}\rightarrow\mathbb{R}^{+}$ is defined as $w_{i}(e)=w(e)$ for all $e\in E_{i}$ and 0 otherwise. Every graph $G$ has an associated matrix called its Laplacian matrix, or simply Laplacian, which is define as

\displaystyle L_{G}=D_{G}-W_{G},

where $W_{G}$ is the weighted adjacency matrix and $D_{G}$ is the weighted degree matrix. We will omit the subindex $G$ from $L_{G},W_{G}$ and $D_{G}$ when it is clear from the context.

The normalized Laplacian is defined as $\mathcal{L}=D^{-1/2}LD^{-1/2}$ . The Laplacian matrix (and normalized Laplacian) is positive semidefinite (PSD) with its first eigenvalue $\lambda_{1}$ always equals zero with multiplicity equal to the number of connected components of $G$ luxburg:07 . Indeed, if there exists a multicut of size $k$ in $G$ then the $k$ -th smallest eigenvalue $\lambda_{k}$ of $L$ gives useful information to find a multicut.

One of the fastest methods to approximate an optimal multicut in a graph is the so-called spectral clustering algorithm. This technique uses $k$ eigenvectors of $L$ or $\mathcal{L}$ associated to the first $k$ smallest eigenvalues in order to construct a matrix $X$ with the eigenvectors as columns, and then, it applies a simpler clustering algorithm (like k-means) to the rows of $X$ luxburg:07 . Lee, Gharan, and Trevisan lee2:2014 proved that $\lambda_{k}$ approximates the optimal value of a multicut of size $k$ in $G$ and the eingevectors give the corresponding partition over $V$ .

2.2 Spectral Sparsification

Spectral sparsification is a technique used to reduce the density of a given PSD matrix changing its spectra only by a constant factor of approximation. Given a matrix $M$ , spectral sparsification constructs another matrix which is “similar” to $M$ in some well-defined way. We will use a notion of similarity defined in spielman:2011 . A subgraph $H$ of $G$ is called an $\epsilon$ -spectral sparsifier of $G$ if for any $x\in\mathbb{R}^{n}$ we have that

\displaystyle(1-\epsilon)x^{T}L_{G}x\leq x^{T}L_{H}x\leq(1+\epsilon)x^{T}L_{G}x.

The importance of a spectral sparsifier lies on the sparseness of $L_{H}$ , for example, some computations are easier over an sparse matrix. There are deterministic and probabilistic algorithms to find spectral sparsifiers of a given graph. The algorithm of Batson, Spielman and Srivastava batson:12 is currently the best deterministic algorithm. The algorithm of batson:12 constructs a graph with $O(\frac{qn}{\epsilon^{2}})$ edges in $O(\frac{qmn^{5/q}}{\epsilon^{4+4/q}})$ time, where $\epsilon$ is the approximation factor and $q\geq 10$ is a constant.

3 A Distributed Algorithm for Spectral Sparsification

In this section we present our main result. In particular, given a graph $G$ and a family of induced subgraphs of $G$ , we show that the union of spectral sparsifiers of the induced subgraphs is a spectral sparsifier of $G$ . In contrast to other work, however, we give explicit bounds on the approximation factor and a construction of the new weight function.

First we introduce some definitions which will help us understand the overlapping of data among the sites. We denote by $[n]$ the set $\{1,2,\dots,n\}$ .

Definition 1 (Occurrence Number)

Let ${\mathcal{E}}=\{E_{1},\dots,E_{t}\}$ be a family of subsets of $[n]$ . For any $a\in[n]$ , the occurrence number of $a$ in $\mathcal{E}$ , denoted $\#(a)$ , is the maximum number of sets from $\mathcal{E}$ in which $a$ appears.

Example 1

Let $n=7$ and ${\mathcal{E}}=\{\{1,2,3\},\{2,3,4\},\{4,5,1\},\{3,2,6\},\{4,7,1\}$ , $\{2,3\}$ , $\{5,6,7\},\{1,3,5\},\{2,4\}\}$ . Here we have that $\#(1)=4$ , $\#(2)=5$ , $\#(3)=5$ , and so on.∎

Definition 2 (Overlapping Cardinality)

Let $\mathcal{E}=\{E_{1},\dots,E_{t}\}$ be a family of subsets of $[n]$ for some fixed $n$ and $E=\bigcup_{i=1}^{t}E_{i}$ . The overlapping cardinality of a subset $E^{\prime}\subseteq E$ in $\mathcal{E}$ is a positive integer $k$ such that for each $a\in E^{\prime}$ its ocurrence number $\#(a)=k$ ; otherwise the overlapping cardinality of $E^{\prime}$ in $\mathcal{E}$ is $0$ .

The overlapping cardinality identifies the maximum number of times the elements of a subset appears in a family of subsets.

Example 2

Let $n=7$ and $\mathcal{E}$ be as in Example 1. Here we have that $E=\bigcup_{i=1}^{t}E_{i}=[n]$ . Now consider the sets $\{1,4\}$ and $\{1,2,3\}$ .

•

The overlapping cardinality of $\{1,4\}$ in $\mathcal{E}$ is 4, because $\#(1)=\#(4)=4$ .
•

The overlapping cardinality of $\{1,2,3\}$ in $\mathcal{E}$ is $0$ because the occurrence number of one of the elements of the set is different from the others, namely, $\#(1)=4$ , $\#(2)=5$ , and $\#(3)=5$ .∎

Now we use the idea of overlapping cardinality to construct a partition on the set $\mathcal{E}$ of subsets of $[n]$ .

Definition 3 (Overlapping Cardinality Partition)

Given a family ${\mathcal{E}}$ as in Definition 2, an overlapping cardinality partition over $E$ on $\mathcal{E}$ is a partition $\{E_{1}^{\prime},\dots,E_{k}^{\prime}\}$ of $E$ where each $E_{i}^{\prime}$ has overlapping cardinality $c_{i}$ on $\mathcal{E}$ . We call the sequence $(c_{1},c_{2},\dots,c_{k})$ , with $1\leq c_{1}<c_{2}<\cdots<c_{k}$ , the overlapping cardinalities over the family $\mathcal{E}$ .

Example 3

Take $\mathcal{E}$ from examples 1 and 2. An overlapping cardinality partition is

\displaystyle\{\{6,7\},\{5\},\{1,4\},\{2,3\}\}.

Here, $\{6,7\}$ has overlapping cardinality equal to $2$ because $\#(6)=\#(7)=2$ . The subset $\{5\}$ has overlapping cardinality equal to $3$ because $\#(5)=3$ . In Example 2 we saw that the subset $\{1,4\}$ has overlapping cardinality $4$ . Finally, the subset $\{2,3\}$ has overlapping cardinality equal to $5$ because $\#(2)=\#(3)=5$ .∎

Our main technical lemma shows that the Laplacian of an input graph can be rewritten as a linear combination of Laplacians corresponding to induced subgraphs constructed from an overlapping cardinality partition of the set of edges.

For the rest of this section we make the following assumptions. Let $G=(V,E,w)$ be an undirected and weighted graph with a function $w:E\rightarrow\mathbb{R}^{+}$ , let $\mathcal{E}=\{E_{1},\dots,E_{t}\}$ be a collection of subsets of $E$ such that $\bigcup_{i=1}^{t}E_{i}=E$ where $E_{i}\neq\emptyset$ and $G_{i}=(V,E_{i},w_{i})$ is an induced subgraph of $G$ where $w_{i}:E_{i}\rightarrow\mathbb{R}^{+}$ and $w_{i}(e)=w(e)$ for all $e\in E_{i}$ and 0 otherwise.

Lemma 1

If $1\leq c_{1}<c_{2}<\dots<c_{k}$ are the overlapping cardinalities over the family $\mathcal{E}$ with an overlapping cardinality partition $\{E_{c_{j}}^{\prime}\}_{j\leq k}$ , then $\sum_{i=1}^{t}L_{G_{i}}=\sum_{j=1}^{k}c_{j}L_{G_{c_{j}}^{\prime}}$ where $L_{G_{c_{j}}^{\prime}}$ is the Laplacian of $G_{c_{j}}^{\prime}=(V,E_{c_{j}}^{\prime},w_{c_{j}}^{\prime})$ .

Proof

First notice that, for all $e=xy\in E_{c_{j}}^{\prime}$ there exists a subfamily of $\mathcal{E}$ with cardinality equal to $c_{j}$ such that $e$ belongs to every member of it and its associated subgraph. Take any $xy\in E_{c_{j}}^{\prime}$ for some $j\in\{1,\dots,k\}$ . There exists $c_{j}$ induced subgraphs $G_{i_{1}},\dots,G_{i_{c_{j}}}$ of $G$ that has $xy$ as an edge, and all other induced subgraphs $G_{k_{1}},\dots,G_{k_{\ell}}$ do not have $xy$ as and edge, where $c_{j}+\ell=t$ . This means that

\sum_{i=1}^{t}L_{G_{i}}(x,y)=c_{j}\cdot L_{G_{c_{j}}^{\prime}}(x,y)=-c_{j}\cdot w(x,y).

(1)

Now, let $d_{G}(x)$ denote the degree of $x$ in $G$ . We know that $d_{G}(x)=\sum_{y}w(x,y)$ where $xy\in E$ . Since $\{E_{c_{j}}^{\prime}\}_{j\leq k}$ is a partition of $E$ , we can rewrite the degree of $x$ as

\displaystyle d_{G}(x)=\sum_{xy_{c_{1}}\in E_{c_{1}}^{\prime}}w(x,y_{c_{1}})+\dots+\sum_{xy_{c_{k}}\in E_{c_{k}}^{\prime}}w(x,y_{c_{k}}).

Then, the degree of $x$ in the graph $G_{c_{j}}^{\prime}$ is

\displaystyle L_{G_{c_{j}}^{\prime}}(x,x)=\sum_{xy_{c_{j}}\in E_{c_{j}}^{\prime}}w(x,y_{c_{j}})=d_{G_{c_{j}}^{\prime}}(x).

If we take an edge $xy_{c_{j}}\in E_{c_{j}}^{\prime}$ , where $x$ is fixed, we know that $xy_{c_{j}}$ appears only in the induced subgraphs $G_{i_{1}},\dots,G_{i_{c_{j}}}$ , and hence, we obtain

\sum_{i=1}^{t}\left(\sum_{xy_{c_{j}}\in E_{c_{j}}^{\prime}}w_{i}(x,y_{c_{j}})\right)=c_{j}\cdot d_{G_{c_{j}}^{\prime}}(x).

(2)

If we take another edge $uv\in E_{c_{m}}^{\prime}$ , with $m\neq j$ , note that $uv$ does not belong to any of the graphs $G_{i_{1}},\dots,G_{i_{c_{j}}}$ and each Laplacian matrix $L_{G_{i_{1}}},\dots,L_{G_{i_{c_{j}}}}$ has 0 in its $(u,v)$ -entry. Therefore, adding $uv$ to Eq.(1) we have that

\sum_{i=1}^{t}\left(L_{G_{i}}(x,y)+L_{G_{i}}(u,v)\right)=c_{j}\cdot L_{G_{c_{j}}^{\prime}}(x,y)+c_{m}\cdot L_{G_{c_{m}}^{\prime}}(u,v).

Extending this argument to all equivalent classes in $\{E_{c_{j}}^{\prime}\}_{j\leq k}$ , for each non-diagonal entry $(x,y)$ , with $xy\in E$ , it holds

\sum_{i=1}^{t}L_{G_{i}}(x,y)=\sum_{j=1}^{k}c_{j}\cdot L_{G_{c_{j}}^{\prime}}(x,y).

(3)

A similar argument can be made for the diagonal entries with Eq.(2), thus obtaining

\sum_{i=1}^{t}\left(\sum_{xy_{c_{1}}\in E_{c_{1}}^{\prime}}w_{i}(x,y_{c_{1}})+\cdots+\sum_{xy_{c_{k}}\in E_{c_{k}}^{\prime}}w_{i}(x,y_{c_{k}})\right)=\sum_{i=1}^{t}L_{G_{i}}(x,x)=\sum_{j=1}^{k}c_{j}\cdot L_{G_{c_{j}}^{\prime}}(x,x).

(4)

Equations (3) and (4) imply the lemma.∎

Now we will use Lemma 1 to show that the spectral sparsifier of $\sum_{j=1}^{k}c_{j}L_{G_{c_{j}}^{\prime}}$ is an spectral sparsifier of the Laplacian $L_{G}$ of an input graph $G$ .

Theorem 3.1

Let $(1=c_{1}<c_{2}<\dots<c_{k})$ be the overlapping cardinalities over the family $\mathcal{E}$ with $\{E_{c_{j}}^{\prime}\}_{j\leq k}$ its associated overlapping cardinality partition and $L_{G_{1}},\dots,L_{G_{t}}$ the Laplacians of $G_{1},\dots,G_{t}$ . If $H_{i}=(V,D_{i},h_{i})$ is an $\epsilon$ -spectral sparsifier of $G_{i}$ , then $H=(V,\bigcup_{i=1}^{t}D_{i},h)$ is an $\epsilon^{\prime}$ -spectral sparsifier of $G$ where $h(e)=\frac{\sum_{i=1}^{t}h_{i}(e)}{c_{1}c_{k}}$ and $\epsilon^{\prime}\geq 1-\frac{1-\epsilon}{c_{k}}$ .

Proof

Let $L_{H_{i}}$ be the Laplacian of $H_{i}$ . By hypothesis we have that for every $i\in[t]$ and $x\in\mathbb{R}^{V}$

\displaystyle(1-\epsilon)x^{T}L_{G_{i}}x\leq x^{T}L_{H_{i}}x\leq(1+\epsilon)x^{T}L_{G_{i}}x.

Then we may take the summation over all $i\in[t]$ to get

\displaystyle(1-\epsilon)\sum_{i=1}^{t}x^{T}L_{G_{i}}x\leq\sum_{i=1}^{t}x^{T}L_{H_{i}}x\leq(1+\epsilon)\sum_{i=1}^{t}x^{T}L_{G_{i}}x.

(5)

Now, lets consider the left hand side of the Equation (5). Using Lemma 1 we get

$\displaystyle(1-\epsilon)\sum_{i=1}^{t}x^{T}L_{G_{i}}x$	$\displaystyle=(1-\epsilon)\sum_{i=1}^{k}c_{i}\cdot x^{T}L_{G_{c_{i}}^{\prime}}x$
	$\displaystyle\geq(1-\epsilon)c_{1}\sum_{i=1}^{k}x^{T}L_{G_{c_{i}}^{\prime}}x$
	$\displaystyle=(1-\epsilon)c_{1}x^{T}L_{G}x,$	(6)

where the last equality follows from the fact that $\{E_{c_{j}}^{\prime}\}_{j\leq k}$ is a partition of $E$ . Similarly for the right hand side of Equation (5) we have that

\displaystyle(1+\epsilon)\sum_{i=1}^{t}x^{T}L_{G_{i}}x\leq(1+\epsilon)c_{k}x^{T}L_{G}x.

(7)

Therefore, by multiplying equations (6) and (7) by $\frac{1}{c_{1}c_{k}}$ we obtain

\displaystyle(1-\epsilon)\frac{x^{T}L_{G}x}{c_{k}}\leq x^{T}L_{H}x\leq(1+\epsilon)\frac{x^{T}L_{G}x}{c_{1}},

where $x^{T}L_{H}x=(\sum_{i=1}^{t}x^{T}L_{H_{i}}x)/(c_{1}c_{k})$ .

To finish the proof, note that we want $1-\epsilon^{\prime}\leq(1-\epsilon)/c_{k}$ and $(1+\epsilon)/c_{1}\leq 1+\epsilon^{\prime}$ with $\epsilon\leq\epsilon^{\prime}<1$ . In order to solve this, we choose an $\epsilon^{\prime}\geq 1-\frac{1-\epsilon}{c_{k}}$ . First notice that $1-\epsilon^{\prime}\leq 1-1+\frac{1-\epsilon}{c_{k}}=\frac{1-\epsilon}{c_{k}}$ . Then we have that $\frac{1+\epsilon}{c_{1}}\leq\frac{1+\epsilon^{\prime}}{c_{1}}=1+\epsilon^{\prime}$ . ∎

From Theorem 3.1, a distributed algorithm for computing spectral sparsifiers is natural. Just let every site compute a spectral sparsifier of its own input and then each site sends its result to a coordinator that will construct the union of all spectral sparsifiers.

4 Data Clustering in the Number-On-Forehead Model

In this section we will show an application of Theorem 3.1 to distributed data clustering in the Number-On-Forehead model of communication complexity for the case when the input data is allocated as a sunflower among sites.

Clustering is an unsupervised machine learning task that involves finding a partition over a given set of points $x_{1},\dots,x_{n}\in\mathbb{R}^{d}$ . Such a partition must fulfill two conditions, (i) every two points in the same set must be “similar” in some way and (ii) every two points on different sets must be far from being similar. Each equivalence class from the partition is also called a cluster. Clustering can be accomplished by different kinds of techniques, where spectral clustering luxburg:07 is one of the fastest methods.

It is easy to see clustering as a graph problem, where each point corresponds to a vertex in a complete graph and the cost of each edge is interpreted as a similarity between points. Thus, finding a set of optimal clusters in data is equivalent to finding an optimal multicut in a graph. Since the optimal multicut depends on the spectrum of the graph’s Laplacian lee2:2014 and we want to keep the communication costs low, each site must be capable of constructing sparse induced subgraphs of its own data while preserving the spectrum of its graph Laplacian.

In our communication protocol, each site is assigned an induced subgraph of $G$ , and we want each site to be aware of all clusters in the data. Consequently, each site must be capable of running a clustering algorithm on its own data, communicate its results to the other sites, and then use the exchanged messages to construct an approximation to the original graph $G$ . This is where the distributed spectral sparsification algorithm is relevant.

First, we will construct a protocol to verify if the input data in every site is a sunflower. If the input is indeed allocated in a sunflower structure, then a party can take advantage of the sunflower to find an approximation of clusters in the data.

4.1 Models of Communication and their Complexity Measure

We will introduce some standard notations from communication complexity—we refer the interested reader to the textbook by Kushilevitz and Nisan kushilevitz:97 for more details. Let $P_{1},P_{2},\dots,P_{s}$ be a set of sites where a site $P_{j}$ has an input $x_{j}\in\{0,1\}^{r}$ , with $r$ a positive integer. In a multiparty communication protocol, with $s\geq 3$ , the sites want to jointly compute a function $f:\{0,1\}^{r}\times\dots\times\{0,1\}^{r}\rightarrow Z$ for some finite codomain $Z$ . In the Number-On-Forehead model of communication, or NOF model, each site only has access to the other sites’s input but not its own, that is, a site $P_{j}$ has access to $(x_{1},...,x_{j-1},x_{j+1},...,x_{s})$ . In order to compute $f$ the sites must communicate, and they do so by writing bits on a blackboard which can be accessed by all sites in the party. This is the so-called blackboard model of communication.

The maximum number of bits exchanged in the protocol over the worst-case input is the cost of the protocol. The deterministic communication complexity of the function $f$ is the minimum cost over all protocols which compute $f$ .

Let $G=(V,E)$ be an input graph and $\{E_{j}\}_{j\leq s}$ be a family of subsets of $E$ . In order to study communication protocols for graph problems we assume that $E_{j}$ is the input data to site $P_{j}$ . In the NOF model, we let $F_{j}=\{E_{1},E_{2},...,E_{j-1},E_{j+1},...,E_{s}\}$ be the set of edges which $P_{j}$ can access. Given a site $P_{j}$ , the symmetric difference on $P_{j}$ , denoted $\Delta_{j}$ , is defined as the symmetric difference among all sets $P_{j}$ has access to, that is, $\Delta_{j}$ is the symmetric difference between each set in $F_{j}$ .

For the rest of this paper, we use as a shorthand $\mathcal{E}$ for the set $\{E_{1},\dots,E_{s}\}$ of subsets of the set of edges $E$ of an input graph $G=(V,E)$ with $\bigcup_{i=1}^{s}E_{i}=E$ , and $\mathcal{F}$ for the set $\{F_{1},\dots,F_{s}\}$ where $F_{j}=\{E_{1},\dots,E_{j-1},E_{j+1},\dots,E_{s}\}$ . Here $\mathcal{F}$ captures the idea of the NOF model where every site have access to the other’s sites input but not its own.

4.2 Sunflowers and NOF Communication

A sunflower or $\Delta$ -System is a family of sets $\mathcal{A}=\{A_{1},...,A_{t}\}$ where $(A_{i}\cap A_{j})=\bigcap_{k=1}^{t}A_{k}=K$ for all $i\neq j$ . We call $K$ the kernel of $\mathcal{A}$ . The family $\mathcal{A}$ is a weak $\Delta$ -System if $|A_{i}\cap A_{j}|=\lambda$ for all $i\neq j$ for some constant $\lambda$ kostochka:00 . It is known that if $\mathcal{A}$ is a weak $\Delta$ -System and $|\mathcal{A}|\geq\ell^{2}-\ell+2$ , where $\ell=\max_{i=1}^{t}\{A_{i}\}$ , then $\mathcal{A}$ is a $\Delta$ -System deza:74 .

We start with a simple fact that ensures the existence of $\Delta$ -Systems with the same kernel in the NOF model if input data in a communication protocol is allocated as a sunflower among sites.

Lemma 2

If $s=|\mathcal{E}|\geq 3$ and $\mathcal{E}$ is a $\Delta$ -System with kernel $K$ , then any $F_{i}$ is a $\Delta$ -System with kernel $K$ .

The following lemma states a sufficient condition for the existence of a $\Delta$ -System in the input data in the NOF model with the requirement, however, that we need at least four or more sites

Lemma 3

Let $s=|\mathcal{E}|\geq 4$ . If, for all $i\in[s]$ , we have that $F_{i}$ is a $\Delta$ -System, then $\mathcal{E}$ is a $\Delta$ -System.

Proof

Suppose that $\mathcal{E}$ is no a $\Delta$ -System, and we want to prove that for some $1\leq i\leq s$ , $F_{i}$ is not a $\Delta$ -System.

With no loss of generality, suppose that there exists exactly two sets $E_{i}$ and $E_{j}$ that certify that $\mathcal{E}$ is not a $\Delta$ -System; that is, there exists $E_{i}$ and $E_{j}$ such that $E_{i}\cap E_{j}=K^{\prime}$ , and, for any $a\neq i$ and $b\neq j$ , it holds that $E_{a}\cap E_{j}=E_{b}\cap E_{i}=K$ , with $K\neq K^{\prime}$ . Now take any $F_{c}$ , with $c$ different from $i$ and $j$ . Then $F_{c}$ cannot be a $\Delta$ -System because $E_{i}$ and $E_{j}$ belong to $F_{c}$ and there is at least another set in $F_{c}$ because $|\mathcal{E}|\geq 4$ .∎

Lemma 3 implies that we only need to know if all sites in a communication protocol have access to a $\Delta$ -System to ensure that an entire family of input sets is a $\Delta$ -System, provided there are at least 4 sites.

Proposition 1

There exists a protocol that verifies if $\mathcal{E}$ , with $|\mathcal{E}|\geq 4$ , is a $\Delta$ -System or not with $s-1$ bits of communication exchanged.

With Proposition 1, a multiparty communication protocol with a number of sites $s\geq 4$ can check for the existence of a sunflower structure in its input data. Furthermore, if input data is allocated among sites as a sunflower, then, by Lemma 2, any site immediately knows the kernel of the sunflower.

4.3 Data Clustering with Sunflowers

In this section, we present a NOF communication protocol that exploits the sunflower structure in input data. First, we start by defining an overlapping coefficient of the edges of $G$ which can be seen as a measure of how well spread out are the edges among sites.

Definition 4

The overlapping coefficient on site $P_{j}$ is defined as $\delta(j)=\frac{|\bigcap_{i\neq j}E_{i}|}{|\bigcup_{i\neq j}E_{i}|}$ and the greatest overlapping coefficient is defined as $\delta=\max_{j\in[s]}\delta(j).$

The following proposition presents a simple protocol that makes every site aware of the entire input graph.

Proposition 2

Let $P_{j}$ be a site and let ${\mathcal{E}}$ be a weak $\Delta$ -System with each $|E_{k}|=\ell$ for $k=1,2,\dots,s$ , with a kernel of size $\lambda$ . Suppose that $s\geq\ell^{2}-\ell+3$ . If site $P_{j}$ sends all the edges in $\Delta_{j}$ , then every other site will know the entire graph $G$ . The number of edges this communication protocol sends is at most $|\bigcup_{i\neq j}E_{i}|(1-\delta)+\ell$ .

Proof

We will prove this proposition by showing how each site constructs the graph $G$ . First, a given site $P_{j}$ computes $\Delta_{j}$ and writes it on the blackboard. Since $s\geq\ell^{2}-\ell+3$ , by the result of Deza deza:74 , we known that ${\mathcal{E}}$ is a sunflower with kernel $K$ and by Lemma 2 this kernel is the same in all sites. At this point all sites $i\neq j$ know $\Delta_{j}$ , therefore, they can construct $G$ by its own using the kernel $K$ of $\mathcal{E}$ . In one more round, one of the sites $i\neq j$ writes $E_{j}$ so that site $P_{j}$ can also construct $G$ .

In order to compute the communication cost of the protocol, first notice that $\delta={\lambda}/({|\bigcup_{i\neq j}E_{i}|})={\lambda}/({|\Delta_{j}|+\lambda})$ ,where we used the fact that the union of all edges in every site equals the union of the symmetric difference and the kernel $K$ . Then we have that $\delta|\Delta_{j}|=\lambda-\delta\lambda$ , which implies $|\Delta_{j}|=\frac{\lambda-\delta\lambda}{\delta}=|\bigcup_{i\neq j}E_{i}||(1-\delta)$ , where the last equality follows from the fact that $|\bigcup_{i\neq j}E_{i}|=\lambda/\delta$ . Finally, after $E_{j}$ was sent to the blackboard the communication cost is $|\bigcup_{i\neq j}E_{i}||(1-\delta)+\ell$ .∎

Theorem 4.1

Let ${\mathcal{E}}$ be a weak $\Delta$ -system with each $|E_{k}|=\ell$ for $k=1,2,\dots,s$ , and suppose that $s\geq\ell^{2}-\ell+3$ . There exists a communication protocol such that after two rounds of communication every site knows an $\epsilon$ -spectral sparsifier of the entire graph $G$ with communication cost $O\left(\log\left(\frac{n}{\epsilon^{2}}\sqrt{1-\delta}\right)\right)$ .

Proof

From deza:74 we know that $\mathcal{E}$ is a sunflower with a kernel $K$ of size $\lambda$ and, by Lemma 2, $K$ is equal in all sites. First, a site $P_{j}$ computes a spectral sparsifier $H_{j}=(V,\hat{\Delta}_{j})$ of the induced subgraph $G_{j}=(V,\Delta_{j})$ using the spectral sparsification algorithm of lee:2018 . This way we have that $|\hat{\Delta}_{j}|=O(n/\epsilon^{2})$ where $0<\epsilon\leq 1/120$ . Then site $P_{j}$ writes $\hat{\Delta}_{j}$ on the blackboard. Any other site $i\neq j$ constructs an $\epsilon$ -spectral sparsifier $H_{i}^{\prime}=(V,\hat{E}_{j})$ of $G_{i}^{\prime}=(V,E_{j})$ . By Theorem 3.1, the graph $H=(V,\hat{\Delta}_{j}\cup\hat{E}_{j})$ is a $\epsilon^{\prime}$ -spectral sparsifier of $G$ . In a second round, a given site $P_{i}$ writes $\hat{E}_{j}$ on the blackboard. Finally, site $P_{j}$ receives $\hat{E}_{j}$ and by Theorem 3.1 it can also construct an $\epsilon^{\prime}$ -spectral sparsifier for $G$ . Finally, the communication complexity is upper-bounded by $O\left(\log\left(\frac{n}{\epsilon^{2}}(1-\delta)\right)+\log\left(\frac{n}{\epsilon^{2}}\right)\right)=O\left(\log\left(\frac{n}{\epsilon^{2}}\sqrt{1-\delta}\right)\right).$ ∎

Acknowledgement. We give our thanks to the reviewers of CTW 2020 for their comments that helped improve this paper. This work is supported by Conacyt research grants POSG17-62 and PINV15-208.

References

(1) Batson, J.D., Spielman, D.A., Srivastava, N.: Twice-ramanujan sparsifiers. In: Proceedings of the 41st annual ACM symposium on Theory of computing (STOC), pp. 255–262 (2009)
(2) Chen, J., Sun, H., Woodruff, D., Zhang, Q.: Communication-optimal distributed clustering. In: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS), pp. 3727–3735 (2016)
(3) Deza, M.: Solution d’un problème de Erdös-Lovász. Journal of Combinatorial Theory, Series B 16(2), 166–167 (1974)
(4) Erdös, P., Chao, Rado, R.: Intersection theorems for systems op finite sets. Quarterly Journal of Mathematics 12(1), 313–320 (1961)
(5) Kostochka, A.: Extremal problems on $\Delta$ -systems. Numbers, Information and Complexity pp. 143–150 (2000)
(6) Kushilevitz, E., Nisan, N.: Communication complexity. Cambridge University Press (2006)
(7) Lee, J.R., Gharan, S.O., Trevisan, L.: Multiway spectral partitioning and higher-order cheeger inequalities. Journal of the ACM (JACM) 61(6), 37 (2014)
(8) Lee, Y.T., Sun, H.: Constructing linear-sized spectral sparsification in almost-linear time. SIAM Journal on Computing 47(6), 2315–2336 (2018)
(9) von Luxburg, U.: A tutorial on spectral clustering. Statistics and computing 17(4), 395–416 (2007)
(10) Nisan, N.: The communication complexity of threshold gates. Combinatorics, Paul Erdös is Eighty 1, 301–315 (1993)
(11) Spielman, D.A., Teng, S.H.: Spectral sparsification of graphs. SIAM Journal On Computing 40(4), 981–1025 (2011)