This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Compressed Sensing Based Least Squares Approach to Semi-supervised Local Cluster Extraction

Ming-Jun Lai 111Department of Mathematics, University of Georgia, Athens, GA 30602. [email protected].    Zhaiming Shen222Department of Mathematics, University of Georgia, Athens, GA 30602. [email protected].
Abstract

A least squares semi-supervised local clustering algorithm based on the idea of compressed sensing is proposed to extract clusters from a graph with known adjacency matrix. The algorithm is based on a two-stage approach similar to the one in [26]. However, under a weaker assumption and with less computational complexity than the one in [26], the algorithm is shown to be able to find a desired cluster with high probability. The “one cluster at a time” feature of our method distinguishes it from other global clustering methods. Several numerical experiments are conducted on the synthetic data such as stochastic block model and real data such as MNIST, political blogs network, AT&T and YaleB human faces data sets to demonstrate the effectiveness and efficiency of our algorithm.

1 Introduction

Informally speaking, graph clustering aims at dividing the set of vertices from a graph into subsets in a way such that there are more edges within each subset, and fewer edges between different subsets. When analyzing a graph, one of people’s primary interest is to find the underlying clustered structure of the graph, as the vertices in the same cluster can reasonably be assumed to have some latent similarity. For data sets which are not presented as graphs, we can create a suitable auxiliary graph such as the KK-nearest-neighbors (KK-NN) graph based on the given data, for example, see [42], [30] and [21]. Then we can apply graph clustering techiques on this auxiliary graph.

Graph clustering problem has become prevalent recently in areas of social network study [16], [20] and [24], image classification [6], [7] and [42], natural language processing [11] and [31]. For example, suppose a social network graph has vertices which represent users of a social network (e.g. Facebook, Linkedln), then the edges could represent users which are connected to each other. The sets of nodes with high inter-connectivity, which we call them communities or clusters, could represent friendship groups or co-workers. By identifying those communities we can suggest new connections to users. Note that some networks are directed (e.g. Twitter, Citation Networks), which could make community detection more subtle. For the scope of this paper, we will only focus on weighted undirected graphs.

The classical graph based clustering problem is a global clustering problem which assigns every vertice a unique cluster, assuming there are no multi-class vertices. It is usually considered as an unsupervised learning problem which can be done by using method such as spectral clustering [29], [35] and [50] or ways of finding an optimal cut of the graph [12], [13], these approaches are generally computational expensive and hard to implement for large data sets. It can also be done semi-supervisely, such as [25], [21] and [49]. However, sometimes it is only of people’s interests in finding one certain cluster which contains the target vertices, given some prior knowledge of a small portion of labels for the entire true cluster, which is usually attainable for real data. This type of problem is called local clustering, or local cluster extraction, which loosely speaking, is defined to be the problem which takes a set of vertices Γ\Gamma with given labels, we call them seeds, as input, and returns a cluster C#C^{\#} such that ΓC#\Gamma\subset C^{\#}. The local clustering problem hasn’t been studied exhaustively, and many aspects of the local clustering problem still remain open. Some recent related work are [18], [47], [48], [44] and [26]. Especially the work in [26] is one of the recent works with the same problem setting as in this paper. More precisely, we propose a new semi-supervised local clustering approach using the ideas of compressed sensing and method of least squares to make the clustering effective and efficient. Indeed, as we will see in the numerical experiments section, our approach outperforms the work in [26] in terms of both the accuracy and efficiency.

The main contribution of this paper is that it proposes the local cluster extraction Algorithms 3 and 4 which improve the performance of the algorithms in [26] and also slightly improve state-of-the-art result in [2] for the political blog network [3]. It also achieves better or comparable results on synthetic stochastic block model, human faces data, and MNIST data compared with several other modern local clustering algorithms or semi-supervised algorithms.

The subsequent sections in this paper are structured as follows. In Section 2, we give brief introductions to spectral clustering and concept of graph Laplacian, we also make the assumptions for the graph model which we will use later for theoretical analysis. In Section 3, we explain the main algorithms for solving the local cluster extraction problem in two-stage and show the correctness of our algorithms asymptotically. In Section 4, we analyze the computational complexity of our algorithms. In Section 5, several synthetic and real data sets are used to evaluate the performance of our algorithms and we also compared their performances with the state-of-the-art results.

2 Preliminaries and Models

2.1 Notations and Definitions

We use standard notation G=(V,E)G=(V,E) to denote the graph GG with the set of vertices VV and set of edges EE. For the case |V|=n|V|=n, we identify VV with the set of integers [n]:={1,2,,n}[n]\mathrel{\mathop{\mathchar 58\relax}}=\{1,2,\cdots,n\}. We use AA to denote the adjacency matrix (possibly non-negative weighted) of GG, so in the undirected case, AA is a symmetric matrix. Let DD be the diagonal matrix D=diag(d1,d2,,dn)D=diag(d_{1},d_{2},\cdots,d_{n}), where each did_{i} is the degree of vertex ii. We have the following definition.

Definition 2.1.

The unnormalized graph Laplacian is defined as L=DAL=D-A. There are also two other normalized graph Laplacians which are symmetric graph Laplacian Lsym:=ID1/2AD1/2L_{sym}\mathrel{\mathop{\mathchar 58\relax}}=I-D^{-1/2}AD^{-1/2}, and the random walk graph Laplacian Lrw:=ID1AL_{rw}\mathrel{\mathop{\mathchar 58\relax}}=I-D^{-1}A.

The following result serves as the foundation of our approach for solving the graph clustering problem, we omit its proof here by directly referring to [8] and [29].

Lemma 2.1.

Let G be an undirected graph of size nn with non-negative weights. Then the multiplicity kk of the eigenvalue 0 of LL (Lrw)(L_{rw}) equals to the number of connected components C1,C2,,CkC_{1},C_{2},\cdots,C_{k} in GG, and the indicator vectors 1C1,,1Ckn\textbf{1}_{C_{1}},\cdots,\textbf{1}_{C_{k}}\in\mathbb{R}^{n} on these components span the kernel of LL (Lrw)(L_{rw}).

Let us introduce some more notations which we will use later. Suppose for the moment we have information about structure of the underlying clusters for each vertex, then it is useful to write GG as a union of two edge-disjoint subgraphs G=GinGoutG=G^{in}\cup G^{out} where Gin=(V,Ein)G^{in}=(V,E^{in}) consists of only intra-connection edges, and Gout=(V,Eout)G^{out}=(V,E^{out}) consists of only inter-connection edges. We will use diind_{i}^{in} to denote the degree of vertex ii in the subgraph GinG^{in}, and dioutd_{i}^{out} to denote the degree of vertex ii in the subgraph GoutG^{out}. We will also use AinA^{in} and LinL^{in} to denote the adjacency matrix and graph Laplacian associated with GinG^{in}, and AoutA^{out} and LoutL^{out} to denote the adjacency matrix and graph Laplacian associated with GoutG^{out}. Note that these notations are just for convenience for the analysis in the next section, in reality we will have no assurance about which cluster each individual vertex belongs to, so we will have no access to AinA^{in} and LinL^{in}. It is also worthwhile to point out that A=Ain+AoutA=A^{in}+A^{out} but LLin+LoutL\neq L^{in}+L^{out} in general. Furthermore, we will use |L||L| or |𝐲||\mathbf{y}| to denote the matrix or vector where each its entry is replaced by the absolute value, and we will use |V||V| to denote the size of VV whenever VV is a set. In the later sections, we will use LL and LinL^{in} to indicate LrwL_{rw} and LrwinL^{in}_{rw} respectively, and use LCL_{C} and LCinL^{in}_{C} to denote the submatrices of LL and LinL^{in} with column indices subset CV=[n]C\subset V=[n] respectively. For convenience, let us formulate the notations being used through the paper into Table 1.

Table 1: Table of Notations
Symbols
GG A general graph
|G||G| Size of G
VV Set of vertices of graph G
(We identify V={1,2,,n}V=\{1,2,\cdots,n\} if |G|=n|G|=n through the paper)
|V||V| Size of VV
EE Set of edges of graph G
EinE^{in} Subset of EE which consists only intra-connection edges
EoutE^{out} Subset of EE which consists only inter-connection edges
GinG^{in} Subgraph of GG on VV with edge set EinE^{in}
GoutG^{out} Subgraph of GG on VV with edge set EoutE^{out}
AA Adjacency matrix of graph GG
AinA^{in} Adjacency matrix of graph GinG^{in}
AoutA^{out} Adjacency matrix of graph GoutG^{out}
LL Random walk graph Laplacian of GG
LinL^{in} Random walk graph Laplacian of GinG^{in}
LoutL^{out} Random walk graph Laplacian of GoutG^{out}
𝟏C\mathbf{1}_{C} Indicator vector on subset CVC\subset V
LCL_{C} submatrix of LL with column indices CVC\subset V
LCinL^{in}_{C} submatrix of LinL^{in} with column indices CVC\subset V
|L||L| Entrywised absolute value operation on matrix LL
|𝐲||\mathbf{y}| Entrywised absolute value operation on vector 𝐲\mathbf{y}
M\|M\| 2\|\cdot\|_{2} norm of matrix MM
𝐲\|\mathbf{y}\| 2\|\cdot\|_{2} norm of vector 𝐲\mathbf{y}.

2.2 Graph Model Assumptions

We make the following assumption for our graph model in the asymptotic perspective.

Assumption 1.

Suppose G=(V,E)G=(V,E) can be partitioned into k=O(1)k=O(1) connected components such that V=C1CkV=C_{1}\cup\cdots\cup C_{k}, where each CiC_{i} is the underlying vertex set for each connected component of GG.

  • (I)

    The degree of each vertex is asymptotically the same for vertices belong to the same cluster CiC_{i}.

  • (II)

    The degree dioutd^{out}_{i} is small relative to degree diind^{in}_{i} asymptotically for each vertex iVi\in V.

The random graphs which satisfies assumption (I) is not uncommon, for example, the Erdös-Rényi (ER) model G(n,p)G(n,p) with pω(n)log(n)np\sim\frac{\omega(n)\log(n)}{n} for any ω(n)\omega(n)\to\infty, see [15] and [9]. A natural generalization of the ER model is the stochastic block model (SBM) [19], which is a generative model for random graphs with certain edge densities within and between underlying clusters, such that the edges within clusters are more dense than the edges between clusters. In the case of each cluster has the same size and the intra- and inter-connection probability are the same among all the vertices, we have the symmetric stochastic block model (SSBM). It is worthwhile to note that the information theoretical bound for exact cluster recovery in SBM are given in [1] and [2]. It was also shown in [26] that a general SBM under certain assumptions of the parameters can be clustered by using a compressed sensing approach. Our model requires a weaker assumption than the one in [26], indeed, we remove the assumption imposed on the eigenvalues of LL in [26]. Therefore, our model will be applicable to a broader range of random graphs.

3 Main Algorithms

Our analysis is based on the following key observation. Suppose that graph GG has kk connected components C1,,CkC_{1},\cdots,C_{k}, i.e., L=LinL=L^{in}. Suppose further that we temporarily have access to the information about the structure of LinL^{in}. Then we can write the graph Laplacian LinL^{in} into a block diagonal form

L=Lin=(L1inL2inLkin).L=L^{in}=\left(\begin{array}[]{cccc}L^{in}_{1}&&&\\ &L^{in}_{2}&&\\ &&\ddots&\\ &&&L^{in}_{k}\\ \end{array}\right). (1)

Suppose now we are interested in finding the cluster with the smallest number of vertices, say C1C_{1}, which corresponds to L1inL^{in}_{1}. By Lemma 2.1, {𝟏C1,,𝟏Ck}\{\mathbf{1}_{C_{1}},\cdots,\mathbf{1}_{C_{k}}\} forms a basis of the kernel W0W_{0} of LL. Note that all the 𝟏Ci\mathbf{1}_{C_{i}} have disjoint supports, so for 𝐰W0\mathbf{w}\in W_{0} and 𝐰0\mathbf{w}\neq 0, we can write

𝐰=i=1kαi𝟏Ci\mathbf{w}=\sum_{i=1}^{k}\alpha_{i}\mathbf{1}_{C_{i}} (2)

with some αi0\alpha_{i}\neq 0. Therefore, if 𝟏C1\mathbf{1}_{C_{1}} has the fewest non-zero entries among all elements of W0{𝟎}W_{0}\setminus\{\mathbf{0}\}, then we can find it by solving the following minimization problem:

min𝐰0s.t.Lin𝐰=𝟎and𝐰𝟎.\min||\mathbf{w}||_{0}\quad\text{s.t.}\quad L^{in}\mathbf{w}=\mathbf{0}\quad\text{and}\quad\mathbf{w}\neq\mathbf{0}. (3)

Here the 0\ell_{0} norm 0\|\cdot\|_{0} indicates the number of nonzero components for the input vector. Problem (3) can be solved using method such as greedy algorithm in compressed sensing as explained in [26]. However, we will propose a different approach to tackle it in this paper and demonstrate that the new approach is more effective numerically and require a fewer number of assumptions.

3.1 Least Squares Cluster Pursuit

Let us consider problem (3) again, instead of finding C1C_{1} directly, let us try to find what are not in C1C_{1}. Suppose there is a superset ΩV\Omega\subset V such that C1ΩC_{1}\subset\Omega, and CiΩC_{i}\not\subset\Omega for all i=2,,ni=2,\cdots,n. Since Lin𝟏C1=𝟎L^{in}\mathbf{1}_{C_{1}}=\mathbf{0}, we have

Lin𝟏Ω=Lin(𝟏ΩC1+𝟏C1)=Lin𝟏ΩC1+Lin𝟏C1=Lin𝟏ΩC1.L^{in}\mathbf{1}_{\Omega}=L^{in}(\mathbf{1}_{\Omega\setminus C_{1}}+\mathbf{1}_{C_{1}})=L^{in}\mathbf{1}_{\Omega\setminus C_{1}}+L^{in}\mathbf{1}_{C_{1}}=L^{in}\mathbf{1}_{\Omega\setminus C_{1}}. (4)

Letting 𝐲:=Lin𝟏Ω\mathbf{y}\mathrel{\mathop{\mathchar 58\relax}}=L^{in}\mathbf{1}_{\Omega}, then to find what are not in C1C_{1} within Ω\Omega is equivalent to solve the following problem (5)

argmin𝐱nLin𝐱𝐲2.\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{n}}\|L^{in}\mathbf{x}-\mathbf{y}\|_{2}. (5)

Note that solving problem (5) directly will give 𝐱=𝟏Ωn\mathbf{x}^{*}=\mathbf{1}_{\Omega}\in\mathbb{R}^{n} and 𝐱=𝟏ΩC1n\mathbf{x}^{*}=\mathbf{1}_{\Omega\setminus C_{1}}\in\mathbb{R}^{n} both as solutions. By setting the columns LVΩin=0L^{in}_{V\setminus\Omega}=0, solving problem (5) is equivalent to solving

argmin𝐱|Ω|LΩin𝐱𝐲2.\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{|\Omega|}}\|L^{in}_{\Omega}\mathbf{x}-\mathbf{y}\|_{2}. (6)

Directly solving problem (6) gives at least two solutions 𝐱=𝟏|Ω|\mathbf{x}^{*}=\mathbf{1}\in\mathbb{R}^{|\Omega|} and 𝐱=𝟏C1𝖼|Ω|\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}\in\mathbb{R}^{|\Omega|}, where C1𝖼C_{1}^{\mathsf{c}} indicates the complement set of C1C_{1}. Between these two solutions, the latter is much more informative for us to extract C1C_{1} from Ω\Omega than the former. We need to find a way to avoid the non-informative solution 𝐱=𝟏\mathbf{x}^{*}=\mathbf{1} but keep the informative solution 𝐱=𝟏C1𝖼\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}.

We can achieve this by removing a subset of columns from index set Ω\Omega. Let us use TΩT\subset\Omega to indicate the indices of column we aim to remove. Suppose we could choose TT such that TC1T\subset C_{1}. Now consider the following variation (7) of the minimization problem (6)

argmin𝐱|Ω||T|LΩTin𝐱𝐲2.\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{|\Omega|-|T|}}\|L^{in}_{\Omega\setminus T}\mathbf{x}-\mathbf{y}\|_{2}. (7)

Different from solving (6) which gives two solutions, solving (7) only gives one solution 𝐱=𝟏C1𝖼\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}, as 𝐱=𝟏\mathbf{x}^{*}=\mathbf{1} is no longer a solution because of the removal of TT. The solution 𝐱=𝟏C1𝖼\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}} is indeed still a solution to (7) because LΩTin𝟏C1𝖼=Lin𝟏ΩC1=0L^{in}_{\Omega\setminus T}\mathbf{1}_{C_{1}^{\mathsf{c}}}=L^{in}\mathbf{1}_{\Omega\setminus C_{1}}=0. Furthermore, the solution to (7) is unique since it is a least squares problem with matrix LΩTinL^{in}_{\Omega\setminus T} of full column rank, therefore 𝐱=𝟏C1𝖼\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}} is the unique solution to (7).

However, there is no way in theory we can select TT and assure the condition TC1T\subset C_{1}. In practice, the way we choose TT is based on the following observation. Suppose L=LinL=L^{in}, ΩC1\Omega\supset C_{1} and ΩCi\Omega\not\supset C_{i} for i=2,ki=2,\cdots k. Then |La||𝐲|=0|L_{a}^{\top}|\cdot|\mathbf{y}|=0 for all aC1a\in C_{1}, and |La||𝐲|>0|L_{a}^{\top}|\cdot|\mathbf{y}|>0 for all aΩC1a\in\Omega\setminus C_{1}. Therefore, we can choose TT in such a way that |Lt||𝐲||L_{t}^{\top}|\cdot|\mathbf{y}| is small for all tTΩt\in T\subset\Omega. These ideas lead to Algorithm 1.

Algorithm 1 Least Squares Cluster Pursuit
Adjacency matrix AA, vertex subset ΩV\Omega\subset V, least squares threshold parameter γ(0,1)\gamma\in(0,1), and rejection parameter 0.1R0.90.1\leq R\leq 0.9.
  1. 1.

    Compute L=ID1AL=I-D^{-1}A and 𝐲=L𝟏Ω\mathbf{y}=L\mathbf{1}_{\Omega}.

  2. 2.

    Let TT be the set of column indices of γ|Ω|\gamma\cdot|\Omega| smallest components of the vector |LΩ||𝐲||{L_{\Omega}}^{\top}|\cdot\mathbf{|y|}.

  3. 3.

    Let 𝐱#\mathbf{x}^{\#} be the solution to

    argmin𝐱|Ω||T|LΩT𝐱𝐲2\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{|\Omega|-|T|}}\|L_{\Omega\setminus T}\mathbf{x}-\mathbf{y}\|_{2} (8)

    obtained by using an iterative least squares solver.

  4. 4.

    Let W#={i:𝐱i#>R}W^{\#}=\{i\mathrel{\mathop{\mathchar 58\relax}}\mathbf{x}_{i}^{\#}>R\} .

C1#=ΩW#C^{\#}_{1}=\Omega\setminus W^{\#}.
Remark 3.1.

We impose the absolute value rather than direct dot product in order to have fewer cancellation between vector components when summing over the entrywised products. In practice, the value of γ(0,1)\gamma\in(0,1) will not affect the performance too much as long as its value is not too extreme. We find that 0.15γ0.40.15\leq\gamma\leq 0.4 works well for our numerical experiments.

Remark 3.2.

In practice, we choose to use MATLAB’s lsqr function to solve the least squares problem (8). As we will see in Lemma 3.2, our problem is well conditioned, so it is also possible to solve the normal equation exactly for problems which are not in a very large scale. However, we choose to solve it iteratively over exactly because the quality of the numerical solution is not essential for our task here, we are only interested in an approximated solution as we can use the cutoff RR number for clustering.

Remark 3.3.

As indicated in [26], we can reformulate problem (3) as solving

argmin𝐱n{L𝐱𝐲2:x0s}\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{n}}\{\|L\mathbf{x}-\mathbf{y}\|_{2}\mathrel{\mathop{\mathchar 58\relax}}\quad\|x\|_{0}\leq s\} (9)

by applying the greedy algorithms such as subspace pursuit [10] and compressed sensing matching pursuit (CoSaMP) [34]. Or alternatively, we can consider LASSO, see [41] and [43], formulation of the problem

argmin𝐱n{L𝐱𝐲22+λ𝐱1}=argmin𝐱n{L𝐱𝐲22+λ𝐱0}.\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{n}}\{\|L\mathbf{x}-\mathbf{y}\|^{2}_{2}+\lambda\|\mathbf{x}\|_{1}\}=\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{n}}\{\|L\mathbf{x}-\mathbf{y}\|^{2}_{2}+\lambda\|\mathbf{x}\|_{0}\}. (10)

The reason that Lasso is a good way to interpret this problem is that the solution 𝐱\mathbf{x}^{*} we are trying to solve for is the sparse indicator vector which satisfies 𝐱1=𝐱0\|\mathbf{x}^{*}\|_{1}=\|\mathbf{x}^{*}\|_{0}. We do not analyze it further here.

However, in reality we have no access to LinL^{in}, what we know only is LL, and in general LLinL\neq L^{in}. We argue that the solution to the perturbed problem (8) associated with LL will not be too far away from the solution to the unperturbed (7) problem associated with LinL^{in}, if the difference between LL and LinL^{in} is relative small. Let us make this precise by first quoting the following standard result in numerical analysis.

Lemma 3.1.

Let \|\cdot\| be an operator norm, An×nA\in\mathbb{R}^{n\times n} be a non-singular square matrix, 𝐱n\mathbf{x}\in\mathbb{R}^{n}, 𝐲n\mathbf{y}\in\mathbb{R}^{n}. Let A~\tilde{A}, 𝐱~\mathbf{\tilde{x}}, 𝐲~\mathbf{\tilde{y}} be perturbed measurements of AA, 𝐱\mathbf{x}, 𝐲\mathbf{y} respectively. Suppose A𝐱=𝐲A\mathbf{x}=\mathbf{y}, A~𝐱~=𝐲~\tilde{A}\mathbf{\tilde{x}}=\mathbf{\tilde{y}}, and suppose further cond(A)<AA~Acond(A)<\frac{\|A\|}{\|\tilde{A}-A\|}, then

𝐱~𝐱𝐱cond(A)1cond(A)A~AA(A~AA+𝐲~𝐲𝐲).\frac{\|\mathbf{\tilde{x}}-\mathbf{x}\|}{\|\mathbf{x}\|}\leq\frac{cond(A)}{1-cond(A)\frac{\|\tilde{A}-A\|}{\|A\|}}\Big{(}\frac{\|\tilde{A}-A\|}{\|A\|}+\frac{\|\mathbf{\tilde{y}}-\mathbf{y}\|}{\|\mathbf{y}\|}\Big{)}.

The above lemma asserts that the size of cond(AA) is significant in determining the stability of the solution 𝐱\mathbf{x} with respect to small perturbations on AA and 𝐲\mathbf{y}. For the discussion from now on, we will use \|\cdot\| to denote the standard vector or matrix induced two-norm 2\|\cdot\|_{2} unless state otherwise. The next lemma claims the invertibility of (LΩTin)LΩTin(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T} and gives an estimation bound of its condition number.

Lemma 3.2.

Let V=i=1kCiV=\cup_{i=1}^{k}C_{i} be the disjoint union of k=O(1)k=O(1) underlying clusters with size nin_{i} and assume (I)(I). Let djd_{j} be the degree for vertex jV=[n]j\in V=[n], n1=mini[k]nin_{1}=\min_{i\in[k]}n_{i}, and suppose ΩV\Omega\subset V be such that ΩC1\Omega\supset C_{1} and ΩCi\Omega\not\supset C_{i} for i=2,ki=2,\cdots k. Then

  • (i)

    If TC1T\subset C_{1}, then (LΩTin)LΩTin(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T} is invertible.

  • (ii)

    Suppose further 3n14|T|<n1\lceil\frac{3n_{1}}{4}\rceil\leq|T|<n_{1} and 5n14|Ω|7n14\lceil\frac{5n_{1}}{4}\rceil\leq|\Omega|\leq\lceil\frac{7n_{1}}{4}\rceil. Then

    cond((LΩTin)LΩTin)4cond\big{(}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{)}\leq 4

    almost surely as n1n_{1}\to\infty, e.g. when nn\to\infty.

Proof.

Without loss of generality, let us assume that the column indices of LΩTinL^{in}_{\Omega\setminus T} are already permuted such that the indices number is in the same order relative to their underlying clusters. The invertibility of (LΩTin)LΩTin(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T} follows directly from the fact that LΩTinL^{in}_{\Omega\setminus T} is of full column rank. So let us show that LΩTinL^{in}_{\Omega\setminus T} is of full column rank. Because of the reordering, LΩTinL^{in}_{\Omega\setminus T} is in a block diagonal form

LΩTin=(LC1TinLΩC2inLΩC3in).L_{\Omega\setminus T}^{in}=\left(\begin{array}[]{cccc}L^{in}_{C_{1}\setminus T}&&&\\ &L^{in}_{\Omega\cap C_{2}}&&\\ &&L^{in}_{\Omega\cap C_{3}}&\\ &&&\ddots\\ \end{array}\right).

It is then suffices to show each block is of full column rank. By Lemma 2.1, each of LCiinL^{in}_{C_{i}} has λ=0\lambda=0 as an eigenvalue with multiplicity one, and the corresponding eigenspace is spanned by 𝟏Ci\mathbf{1}_{C_{i}}. Hence rank(LCiin)=|Ci|1rank(L^{in}_{C_{i}})=|C_{i}|-1. Now suppose by contradiction that the columns of LC1TinL^{in}_{C_{1}\setminus T} are linearly dependent, so there exists 𝐯𝟎\mathbf{v}\neq\mathbf{0} such that LC1Tin𝐯=𝟎L^{in}_{C_{1}\setminus T}\mathbf{v}=\mathbf{0}, or LC1Tin𝐯+LTin𝟎=𝟎L^{in}_{C_{1}\setminus T}\mathbf{v}+L^{in}_{T}\cdot\mathbf{0}=\mathbf{0}. This means that 𝐮=(𝐯,𝟎)\mathbf{u}=(\mathbf{v},\mathbf{0}) is an eigenvector associated to eigenvalue zero, which contradicts the fact that the eigenspace is spanned by 𝟏Ci\mathbf{1}_{C_{i}}. Therefore LC1TinL^{in}_{C_{1}\setminus T} is linearly independent, hence LC1TinL^{in}_{C_{1}\setminus T} is of full column rank. For CiC_{i} with i2i\geq 2, since CiΩC_{i}\notin\Omega, ΩCi\Omega\cap C_{i} is a proper subset of CiC_{i}. The strategy above applies as well. Therefore all blocks in LΩTinL^{in}_{\Omega\setminus T} are of full column rank, so LΩTinL^{in}_{\Omega\setminus T} is of full column rank.

Now since (LΩTin)LΩTin(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T} is in a block form, to estimate the condition number, we only need to estimate the largest and smallest eigenvalues for each block. Writing LΩTin=[lij]L^{in}_{\Omega\setminus T}=[l_{ij}] and (LΩTin)LΩTin=[sij](L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}=[s_{ij}], for each iC1Ti\in C_{1}\setminus T, sii=k=1nlkilki=k=1nlki2=k=1n1lki2=1+1diins_{ii}=\sum_{k=1}^{n}l_{ki}l_{ki}=\sum_{k=1}^{n}l_{ki}^{2}=\sum_{k=1}^{n_{1}}l_{ki}^{2}=1+\frac{1}{d_{i}^{in}}, and for i,jC1Ti,j\in C_{1}\setminus T with iji\not=j, sij=k=1nlkilkj=k=1n1lkilkjs_{ij}=\sum_{k=1}^{n}l_{ki}l_{kj}=\sum_{k=1}^{n_{1}}l_{ki}l_{kj}. Note that the probability of having an edge between ii and jj given degree sequences d1,dn1d_{1},\cdots d_{n_{1}} equals to didjiC1di\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}, as the existence of an edge between two vertices is proportional to their degrees. So lijl_{ij} equals to 1di-\frac{1}{d_{i}} with probability didjiC1di\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}, which implies 𝔼(lij)=djiC1di\mathbb{E}(l_{ij})=-\frac{d_{j}}{\sum_{i\in C_{1}}d_{i}}; ljil_{ji} equals to 1dj-\frac{1}{d_{j}} with probability didjiC1di\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}, which implies 𝔼(lji)=diiC1di\mathbb{E}(l_{ji})=-\frac{d_{i}}{\sum_{i\in C_{1}}d_{i}}. Hence the expectation

𝔼(sij)\displaystyle\mathbb{E}(s_{ij}) =𝔼(k=1nlkilkj)=k=1n𝔼(lki)𝔼(lkj)=k=1n1𝔼(lki)𝔼(lkj)\displaystyle=\mathbb{E}(\sum_{k=1}^{n}l_{ki}l_{kj})=\sum_{k=1}^{n}\mathbb{E}(l_{ki})\mathbb{E}(l_{kj})=\sum_{k=1}^{n_{1}}\mathbb{E}(l_{ki})\mathbb{E}(l_{kj})
=didjiC1di(1di)+didjiC1di(1dj)+dkdiiC1didkdjiC1di(1dk)2\displaystyle=\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}\cdot(-\frac{1}{d_{i}})+\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}\cdot(-\frac{1}{d_{j}})+\frac{d_{k}d_{i}}{\sum_{i\in C_{1}}d_{i}}\cdot\frac{d_{k}d_{j}}{\sum_{i\in C_{1}}d_{i}}\cdot(\frac{1}{d_{k}})^{2}
=di+djiC1di+didj(iC1di)2=2n1+1n12.\displaystyle=-\frac{d_{i}+d_{j}}{\sum_{i\in C_{1}}d_{i}}+\frac{d_{i}d_{j}}{(\sum_{i\in C_{1}}d_{i})^{2}}=-\frac{2}{n_{1}}+\frac{1}{n_{1}^{2}}.

By the law of large numbers, sij2n1+1n12s_{ij}\to-\frac{2}{n_{1}}+\frac{1}{n_{1}^{2}} almost surely as n1n_{1}\to\infty. Therefore for iC1Ti\in C_{1}\setminus T, we have

jC1T,ji|sij||C1T|(2n11n12)n14(2n11n12)12\displaystyle\sum_{j\in C_{1}\setminus T,j\neq i}|s_{ij}|\to|C_{1}\setminus T|\cdot(\frac{2}{n_{1}}-\frac{1}{n_{1}^{2}})\leq\frac{n_{1}}{4}\cdot(\frac{2}{n_{1}}-\frac{1}{n_{1}^{2}})\leq\frac{1}{2}

almost surely as n1n_{1}\to\infty. Similarly, for each iCk(ΩC1)i\in C_{k}\cap(\Omega\setminus C_{1}), k2k\geq 2, we have sii=1+1diins_{ii}=1+\frac{1}{d^{in}_{i}}, and jCk(ΩC1),ji|sij|n14(2nk1nk2)12\sum_{j\in C_{k}\cap(\Omega\setminus C_{1}),j\neq i}|s_{ij}|\to\frac{n_{1}}{4}\cdot(\frac{2}{n_{k}}-\frac{1}{n_{k}^{2}})\leq\frac{1}{2} almost surely as n1n_{1}\to\infty.
Now we apply Gershgorin’s circle theorem to bound the spectrum of (LΩTin)LΩTin(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}. For all iΩTi\in\Omega\setminus T, the circles are centered at 1+1di1+\frac{1}{d_{i}}, with radius less than or equal to 12\frac{1}{2} almost surely, hence σmin((LΩTin)LΩTin)12\sigma_{\min}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})\geq\frac{1}{2} and σmax((LΩTin)LΩTin)32+1di2\sigma_{\max}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})\leq\frac{3}{2}+\frac{1}{d_{i}}\leq 2. almost surely. Therefore we have

cond((LΩTin)LΩTin)=σmax((LΩTin)LΩTin)σmin((LΩTin)LΩTin)4\displaystyle\text{cond}\big{(}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{)}=\frac{\sigma_{\max}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})}{\sigma_{\min}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})}\leq 4

almost surely, as desired. ∎

Remark 3.4.

Note that there is a minor difficulty in estimating the expectation of inner product between two different columns of LΩTinL^{in}_{\Omega\setminus T}. The computation assumes the independence of degree distribution of each individual vertex within each cluster, but this may not be true in general for arbitrary graph. However, the independence will occur if the asymptotic uniformity of the degree distribution within each cluster is assumed, that is why our model needs this assumption.

Now the perturbed problem (8) is equivalent to solving (LΩTLΩT)𝐱#=LΩT𝐲~=LΩT(L𝟏Ω)(L_{\Omega\setminus T}^{\top}L_{\Omega\setminus T})\mathbf{x}^{\#}=L_{\Omega\setminus T}^{\top}\mathbf{\tilde{y}}=L_{\Omega\setminus T}^{\top}(L\mathbf{1}_{\Omega}), while the unperturbed problem (7) is to solve (LΩTin)LΩTin𝐱=(LΩTin)𝐲=(LΩTin)(Lin𝟏Ω)(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\mathbf{x}^{*}=(L^{in}_{\Omega\setminus T})^{\top}\mathbf{y}=(L^{in}_{\Omega\setminus T})^{\top}(L^{in}\mathbf{1}_{\Omega}). Let M:=LLinM\mathrel{\mathop{\mathchar 58\relax}}=L-L^{in}, MΩ:=LΩLΩinM_{\Omega}\mathrel{\mathop{\mathchar 58\relax}}=L_{\Omega}-L^{in}_{\Omega}, and MΩT:=LΩTLΩTinM_{\Omega\setminus T}\mathrel{\mathop{\mathchar 58\relax}}=L_{\Omega\setminus T}-L^{in}_{\Omega\setminus T}. Let us give an estimate for MM.

Lemma 3.3.

Let LL be the graph Laplacian of GG and M:=LLinM\mathrel{\mathop{\mathchar 58\relax}}=L-L^{in}. Let ϵi:=dioutdi\epsilon_{i}\mathrel{\mathop{\mathchar 58\relax}}=\frac{d^{out}_{i}}{d_{i}} for all ii and ϵmax:=maxi[n]ϵi\epsilon_{max}\mathrel{\mathop{\mathchar 58\relax}}=\max_{i\in[n]}\epsilon_{i}. Then M2ϵmax\|M\|\leq 2\epsilon_{\max}.

Proof.

Let δij\delta_{ij} denote the Kronecker delta symbol, observe that

Lij:=δij1diAij=δij1diin+diout(Aijin+Aijout).L_{ij}\mathrel{\mathop{\mathchar 58\relax}}=\delta_{ij}-\frac{1}{d_{i}}A_{ij}=\delta_{ij}-\frac{1}{d^{in}_{i}+d^{out}_{i}}(A^{in}_{ij}+A^{out}_{ij}).

Since ϵi:=dioutdi\epsilon_{i}\mathrel{\mathop{\mathchar 58\relax}}=\frac{d^{out}_{i}}{d_{i}}, we have 1di=1diin+diout=1diinϵidiin\frac{1}{d_{i}}=\frac{1}{d^{in}_{i}+d^{out}_{i}}=\frac{1}{d^{in}_{i}}-\frac{\epsilon_{i}}{d^{in}_{i}}. So we have

Lij\displaystyle L_{ij} =δij(1diinϵidiin)(Aijin+Aijout)\displaystyle=\delta_{ij}-\Big{(}\frac{1}{d^{in}_{i}}-\frac{\epsilon_{i}}{d^{in}_{i}}\Big{)}(A^{in}_{ij}+A^{out}_{ij})
=(δij1diinAijin)1diinAijout+ϵidiin(Aijin+Aijout)\displaystyle=\Big{(}\delta_{ij}-\frac{1}{d^{in}_{i}}A^{in}_{ij}\Big{)}-\frac{1}{d^{in}_{i}}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}(A^{in}_{ij}+A^{out}_{ij})
=Lijin1ϵidiinAijout+ϵidiinAijin.\displaystyle=L^{in}_{ij}-\frac{1-\epsilon_{i}}{d^{in}_{i}}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}A^{in}_{ij}.

Therefore Mij=1ϵidiinAijout+ϵidiinAijinM_{ij}=-\frac{1-\epsilon_{i}}{d^{in}_{i}}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}A^{in}_{ij}. To bound the spectral norm we apply Gershgorin’s circle theorem, noting that Mii=0M_{ii}=0 for all ii, hence

M\displaystyle\|M\| =max{|λi|:λieigenvalue ofM}maxij|Mij|\displaystyle=\max\{|\lambda_{i}|\mathrel{\mathop{\mathchar 58\relax}}\lambda_{i}\quad\text{eigenvalue of}\quad M\}\leq\max_{i}\sum_{j}|M_{ij}|
=maxij|1ϵidiinAijout+ϵidiinAijin|\displaystyle=\max_{i}\sum_{j}\Big{|}-\frac{1-\epsilon_{i}}{d^{in}_{i}}A_{ij}^{out}+\frac{\epsilon_{i}}{d^{in}_{i}}A_{ij}^{in}\Big{|}
maxij|1ϵidiin|Aijout+|ϵidiin|Aijin\displaystyle\leq\max_{i}\sum_{j}\Big{|}-\frac{1-\epsilon_{i}}{d^{in}_{i}}\Big{|}A_{ij}^{out}+\Big{|}\frac{\epsilon_{i}}{d^{in}_{i}}\Big{|}A_{ij}^{in}
maxi{1ϵidiinjAijout+ϵidiinjAijin}\displaystyle\leq\max_{i}\Big{\{}\frac{1-\epsilon_{i}}{d^{in}_{i}}\sum_{j}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}\sum_{j}A^{in}_{ij}\Big{\}}
=maxi{1ϵidiindiout+ϵidiindiin}=2maxiϵi=2ϵmax.\displaystyle=\max_{i}\Big{\{}\frac{1-\epsilon_{i}}{d^{in}_{i}}d^{out}_{i}+\frac{\epsilon_{i}}{d^{in}_{i}}d^{in}_{i}\Big{\}}=2\max_{i}\epsilon_{i}=2\epsilon_{\max}.

This completes the proof. ∎

Next we will have the following result.

Lemma 3.4.

(LΩTin)LΩin𝟏Ω|ΩC1|2\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}_{\Omega}\|\geq\frac{\sqrt{|\Omega\setminus C_{1}|}}{2} almost surely.

Proof.

Note that (LΩTin)(Lin𝟏Ω)=(LΩTin)LΩin𝟏\|(L^{in}_{\Omega\setminus T})^{\top}(L^{in}\mathbf{1}_{\Omega})\|=\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}\|. We want to give an estimate of (LΩTin)LΩin𝟏\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}\|. Similar to the computation we did in Lemma 3.2, for each iC1Ti\in C_{1}\setminus T, we have sii=1+1diins_{ii}=1+\frac{1}{d^{in}_{i}}, jC1sij=0\sum_{j\in C_{1}}s_{ij}=0, and jΩC1sij=0\sum_{j\in\Omega\setminus C_{1}}s_{ij}=0. For each iCk(ΩC1)i\in C_{k}\cap(\Omega\setminus C_{1}), k2k\geq 2, we have sii=1+1diins_{ii}=1+\frac{1}{d^{in}_{i}}, jC1sij=0\sum_{j\in C_{1}}s_{ij}=0, and jCk(ΩC1),jisijn14(2nk+1nk2)12\sum_{j\in C_{k}\cap(\Omega\setminus C_{1}),j\neq i}s_{ij}\to\frac{n_{1}}{4}\cdot(-\frac{2}{n_{k}}+\frac{1}{n_{k}^{2}})\geq-\frac{1}{2} almost surely. Therefore, the row sum of (LΩTin)LΩin(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega} for row iC1Ti\in C_{1}\setminus T equals to zero, and the row sum (LΩTin)LΩin(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega} for row iΩC1i\in\Omega\setminus C_{1} larger than 12\frac{1}{2} almost surely. Hence (LΩTin)LΩin𝟏|ΩC1|2\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}\|\geq\frac{\sqrt{|\Omega\setminus C_{1}|}}{2} almost surely. ∎

Now let us use previous lemmas to establish that the difference between perturbed solution and unperturbed solution is small in the order of ϵmax\epsilon_{\max}.

Theorem 3.1.

Under the same assumptions as Lemma 3.2, let 𝐱#\mathbf{x}^{\#} be the solution to the perturbed problem (8), and 𝐱=𝟏C1𝖼|Ω||T|\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}\in\mathbb{R}^{|\Omega|-|T|} which is the solution to the unperturbed problem (7). Then

𝐱#𝐱𝐱=O(ϵmax)\frac{\|\mathbf{x}^{\#}-\mathbf{x}^{*}\|}{\|\mathbf{x}^{*}\|}=O(\epsilon_{\max})

almost surely for large n1n_{1}.

Proof.

Let B=(LΩTin)LΩTinB=(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}, B~=(LΩT)LΩT\tilde{B}=(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}, 𝐲=Lin𝟏Ω\mathbf{y}=L^{in}\mathbf{1}_{\Omega}, 𝐲~=L𝟏Ω\tilde{\mathbf{y}}=L\mathbf{1}_{\Omega}. We will apply Lemma 3.2 with BB, B~\tilde{B}, 𝐲\mathbf{y}, 𝐲~\tilde{\mathbf{y}}.

First by Lemma 3.3, we have M2ϵmax\|M\|\leq 2\epsilon_{\max}. Therefore

B~B\displaystyle\|\tilde{B}-B\| =(LΩT)LΩT(LΩTin)LΩTin\displaystyle=\big{\|}(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}-(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{\|}
=(LΩTin)MΩT+MΩTLΩTin+MΩTMΩT\displaystyle=\big{\|}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega\setminus T}+M_{\Omega\setminus T}^{\top}L^{in}_{\Omega\setminus T}+M_{\Omega\setminus T}^{\top}M_{\Omega\setminus T}\big{\|}
(LΩTin)MΩT+MΩTLΩTin+MΩTMΩT\displaystyle\leq\big{\|}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega\setminus T}\big{\|}+\big{\|}M_{\Omega\setminus T}^{\top}L^{in}_{\Omega\setminus T}\big{\|}+\big{\|}M_{\Omega\setminus T}^{\top}M_{\Omega\setminus T}\big{\|}
(2LΩTin+MΩT)MΩT\displaystyle\leq\big{(}2\|L^{in}_{\Omega\setminus T}\|+\|M_{\Omega\setminus T}\|\big{)}\cdot\|M_{\Omega\setminus T}\|
(2LΩTin+M)M\displaystyle\leq\big{(}2\|L^{in}_{\Omega\setminus T}\|+\|M\|\big{)}\cdot\|M\|
4ϵmax(LΩTin+ϵmax).\displaystyle\leq 4\epsilon_{\max}\cdot\big{(}\|L^{in}_{\Omega\setminus T}\|+\epsilon_{\max}\big{)}.

For each iΩTi\in\Omega\setminus T, we have Li1\|L_{i}\|\geq 1, and σmax((LΩTin)LΩTin)=(LΩTin)LΩTin=σmax2(LΩTin)=LΩTin2maxiΩTLi21\sigma_{\max}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})=\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\|=\sigma_{\max}^{2}(L^{in}_{\Omega\setminus T})=\|L^{in}_{\Omega\setminus T}\|^{2}\geq\max_{i\in\Omega\setminus T}\|L_{i}\|^{2}\geq 1. Hence

(LΩT)LΩT(LΩTin)LΩTin(LΩTin)LΩTin\displaystyle\frac{\big{\|}(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}-(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{\|}}{\big{\|}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{\|}}\leq (2LΩTin+M)MLΩTin2\displaystyle\frac{\big{(}2\|L^{in}_{\Omega\setminus T}\|+\|M\|\big{)}\cdot\|M\|}{\|L^{in}_{\Omega\setminus T}\|^{2}} (11)
\displaystyle\leq 4ϵmaxLΩTin+4ϵmax2LΩTin2\displaystyle\frac{4\epsilon_{\max}}{\|L^{in}_{\Omega\setminus T}\|}+\frac{4\epsilon_{\max}^{2}}{\|L^{in}_{\Omega\setminus T}\|^{2}} (12)
\displaystyle\leq 4(ϵmax+ϵmax2).\displaystyle 4(\epsilon_{\max}+\epsilon_{\max}^{2}). (13)

We also have

𝐲~𝐲\displaystyle\|\tilde{\mathbf{y}}-\mathbf{y}\| =(LΩT)(L𝟏Ω)(LΩTin)(Lin𝟏Ω)\displaystyle=\|(L_{\Omega\setminus T})^{\top}(L\mathbf{1}_{\Omega})-(L^{in}_{\Omega\setminus T})^{\top}(L^{in}\mathbf{1}_{\Omega})\|
=(LΩTin+MΩT)(LΩ𝟏Ω)(LΩTin)(LΩin𝟏Ω)\displaystyle=\|(L^{in}_{\Omega\setminus T}+M_{\Omega\setminus T})^{\top}(L_{\Omega}\mathbf{1}_{\Omega})-(L^{in}_{\Omega\setminus T})^{\top}(L^{in}_{\Omega}\mathbf{1}_{\Omega})\|
=((LΩTin)MΩ+MΩTLΩin+MΩTMΩ)𝟏Ω\displaystyle=\|\big{(}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega}+M_{\Omega\setminus T}^{\top}L^{in}_{\Omega}+M_{\Omega\setminus T}^{\top}M_{\Omega}\big{)}\cdot\mathbf{1}_{\Omega}\|
|Ω|((LΩTin)MΩ+MΩTLΩin+MΩTMΩ)\displaystyle\leq\sqrt{|\Omega|}\cdot\big{(}\|(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega}\|+\|M_{\Omega\setminus T}^{\top}L^{in}_{\Omega}\|+\|M_{\Omega\setminus T}^{\top}M_{\Omega}\|\big{)}
|Ω|(2LΩin+MΩ)MΩ\displaystyle\leq\sqrt{|\Omega|}\cdot\big{(}2\|L^{in}_{\Omega}\|+\|M_{\Omega}\|\big{)}\cdot\|M_{\Omega}\|
4|Ω|(LΩin+ϵmax)ϵmax.\displaystyle\leq 4\sqrt{|\Omega|}\cdot\big{(}\|L^{in}_{\Omega}\|+\epsilon_{\max}\big{)}\cdot\epsilon_{\max}.

Next by Lemma 3.4, (LΩTin)LΩin𝟏Ω|ΩC1|2\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}_{\Omega}\|\geq\frac{\sqrt{|\Omega\setminus C_{1}|}}{2} almost surely. Therefore

(LΩT)L𝟏Ω(LΩTin)Lin𝟏Ω(LΩTin)Lin𝟏Ω\displaystyle\frac{\|(L_{\Omega\setminus T})^{\top}L\mathbf{1}_{\Omega}-(L^{in}_{\Omega\setminus T})^{\top}L^{in}\mathbf{1}_{\Omega}\|}{\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}\mathbf{1}_{\Omega}\|} 4|Ω|(LΩin+ϵmax)ϵmax|ΩC1|/2\displaystyle\leq\frac{4\sqrt{|\Omega|}\cdot\big{(}\|L^{in}_{\Omega}\|+\epsilon_{\max}\big{)}\cdot\epsilon_{\max}}{\sqrt{|\Omega\setminus C_{1}|}/2}
85ϵmax(LΩin+ϵmax)\displaystyle\leq 8\sqrt{5}\epsilon_{\max}\cdot\big{(}\|L^{in}_{\Omega}\|+\epsilon_{\max}\big{)}
85ϵmax(2+ϵmax)\displaystyle\leq 8\sqrt{5}\epsilon_{\max}\cdot\big{(}\sqrt{2}+\epsilon_{\max}\big{)}
=810ϵmax+85ϵmax2.\displaystyle=8\sqrt{10}\epsilon_{\max}+8\sqrt{5}\epsilon_{\max}^{2}.

The second inequality holds since |Ω|5n14|\Omega|\geq\lceil\frac{5n_{1}}{4}\rceil. The third inequality holds since σmax((LΩin)LΩin)2\sigma_{\max}((L^{in}_{\Omega})^{\top}L^{in}_{\Omega})\leq 2, which comes from the similar reasoning as in Lemma 3.2 by using Gershgorin’s circle theorem. Consequently, we have LΩin2\|L^{in}_{\Omega}\|\leq\sqrt{2}. Now putting Lemma 3.2 and Lemma 3.1 together with B=(LΩTin)LΩTinB=(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}, B~=(LΩT)LΩT\tilde{B}=(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}, 𝐲=Lin𝟏Ω\mathbf{y}=L^{in}\mathbf{1}_{\Omega}, 𝐲~=L𝟏Ω\tilde{\mathbf{y}}=L\mathbf{1}_{\Omega}, we have

𝐱#𝐱𝐱\displaystyle\frac{\|\mathbf{x}^{\#}-\mathbf{x}^{*}\|}{\|\mathbf{x}^{*}\|} cond((LΩTin)LΩTin)(4ϵmax+4ϵmax2+810ϵmax+85ϵmax2)1cond((LΩTin)LΩTin)(4ϵmax+4ϵmax2)\displaystyle\leq\frac{cond\big{(}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{)}\cdot\big{(}4\epsilon_{\max}+4\epsilon_{\max}^{2}+8\sqrt{10}\epsilon_{\max}+8\sqrt{5}\epsilon_{\max}^{2}\big{)}}{1-cond\big{(}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{)}\cdot\big{(}4\epsilon_{\max}+4\epsilon_{\max}^{2}\big{)}}
16((1+210)ϵmax+(1+25)ϵmax2)116ϵmax(1+ϵmax)=O(ϵmax).\displaystyle\leq\frac{16\big{(}(1+2\sqrt{10})\epsilon_{\max}+(1+2\sqrt{5})\epsilon_{\max}^{2}\big{)}}{1-16\epsilon_{\max}(1+\epsilon_{\max})}=O(\epsilon_{\max}).

Next we can estimate the size of the symmetric difference between output C1#C_{1}^{\#} and the true cluster C1C_{1} relative to the size of C1C_{1}, the symmetric difference is defined as C1#C1:=(C1#C1)(C1C1#)C_{1}^{\#}\triangle C_{1}\mathrel{\mathop{\mathchar 58\relax}}=(C_{1}^{\#}\setminus C_{1})\cup(C_{1}\setminus C_{1}^{\#}). Let us state another lemma before we establish the result.

Lemma 3.5.

Let T[n]T\subset[n], 𝐯n\mathbf{v}\in\mathbb{R}^{n}, and W#={i:𝐯i>R}W^{\#}=\{i\mathrel{\mathop{\mathchar 58\relax}}\mathbf{v}_{i}>R\}. Suppose 𝟏T𝐯D\|\mathbf{1}_{T}-\mathbf{v}\|\leq D, then |TW#|D2min{R2,(1R)2}|T\triangle W^{\#}|\leq\frac{D^{2}}{\min\{R^{2},(1-R)^{2}\}}.

Proof.

Let U#=[n]W#U^{\#}=[n]\setminus W^{\#} and write 𝐯=𝐯U#+𝐯W#\mathbf{v}=\mathbf{v}_{U^{\#}}+\mathbf{v}_{W^{\#}}, where 𝐯U#\mathbf{v}_{U^{\#}} and 𝐯W#\mathbf{v}_{W^{\#}} are the parts of 𝐯\mathbf{v} supported on U#U^{\#} and W#W^{\#}. Then we can write

𝟏T𝐯2=𝟏TW#(𝐯W#)TW#2+(𝐯W#)W#T2+𝟏TW#𝐯U#2.\|\mathbf{1}_{T}-\mathbf{v}\|^{2}=\|\mathbf{1}_{T\cap W^{\#}}-(\mathbf{v}_{W^{\#}})_{T\cap W^{\#}}\|^{2}+\|(\mathbf{v}_{W^{\#}})_{W^{\#}\setminus T}\|^{2}+\|\mathbf{1}_{T\setminus W^{\#}}-\mathbf{v}_{U^{\#}}\|^{2}.

Note that (𝐯W#)W#T2R2|W#T|\|(\mathbf{v}_{W^{\#}})_{W^{\#}\setminus T}\|^{2}\geq R^{2}\cdot|W^{\#}\setminus T| and 𝟏TW#𝐯U#2(1R)2|TW#|\|\mathbf{1}_{T\setminus W^{\#}}-\mathbf{v}_{U^{\#}}\|^{2}\geq(1-R)^{2}\cdot|T\setminus W^{\#}|. We have

𝟏T𝐯2\displaystyle\|\mathbf{1}_{T}-\mathbf{v}\|^{2} (𝐯W#)W#T2+𝟏TW#𝐯U#2\displaystyle\geq\|(\mathbf{v}_{W^{\#}})_{W^{\#}\setminus T}\|^{2}+\|\mathbf{1}_{T\setminus W^{\#}}-\mathbf{v}_{U^{\#}}\|^{2}
R2|W#T|+(1R)2|TW#|\displaystyle\geq R^{2}\cdot|W^{\#}\setminus T|+(1-R)^{2}\cdot|T\setminus W^{\#}|
min{R2,(1R)2}(|W#T|+|TW#|)\displaystyle\geq\min\{R^{2},(1-R)^{2}\}\cdot(|W^{\#}\setminus T|+|T\setminus W^{\#}|)
=min{R2,(1R)2}|TW#|.\displaystyle=\min\{R^{2},(1-R)^{2}\}\cdot|T\triangle W^{\#}|.

Therefore |TW#|𝟏T𝐯2min{R2,(1R)2}D2min{R2,(1R)2}|T\triangle W^{\#}|\leq\frac{\|\mathbf{1}_{T}-\mathbf{v}\|^{2}}{\min\{R^{2},(1-R)^{2}\}}\leq\frac{D^{2}}{\min\{R^{2},(1-R)^{2}\}} as desired. ∎

Theorem 3.2.

Under the same assumptions as Theorem 3.1, we have

|C1#C1||C1|O(ϵmax2).\frac{|C^{\#}_{1}\triangle C_{1}|}{|C_{1}|}\leq O(\epsilon^{2}_{\max}).

In other words, the error rate of successfully recovering C1C_{1} is at most a constant multiple of ϵmax2\epsilon^{2}_{\max}.

Proof.

From Theorem 3.1, we have 𝐱#𝐱=𝐱#𝟏ΩC1O(ϵmax)𝐱O(ϵmaxn1)\|\mathbf{x}^{\#}-\mathbf{x}^{*}\|=\|\mathbf{x}^{\#}-\mathbf{1}_{\Omega\setminus C_{1}}\|\leq O(\epsilon_{\max})\cdot\|\mathbf{x}^{*}\|\leq O(\epsilon_{\max}\sqrt{n_{1}}). By Lemma 3.5, we get |W#(ΩC1)|O(ϵmax2n1)|W^{\#}\triangle(\Omega\setminus C_{1})|\leq O(\epsilon_{\max}^{2}n_{1}). Since C1#=ΩW#C_{1}^{\#}=\Omega\setminus W^{\#}, it then follows |C1#C1|O(ϵmax2n1)|C_{1}^{\#}\triangle C_{1}|\leq O(\epsilon^{2}_{\max}n_{1}), hence |C1#C1||C1|=O(ϵmax2)\frac{|C^{\#}_{1}\triangle C_{1}|}{|C_{1}|}=O(\epsilon_{\max}^{2}) as desired. ∎

3.2 Random Walk Threshold

In order to apply Algorithm 1, we need a “nice” superset which contains C1C_{1}. The task for this subsection is to find such a superset Ω\Omega from the given seeds Γ\Gamma. We will apply a simple diffusion based random walk algorithm on GG to find such Ω\Omega. This leads to Algorithm 2, which is described in [26] as well. However, the difference of the random walk threshold algorithm between this paper and the one in [26] is that the threshold parameter δ\delta here is heuristically chosen to be larger than the corresponding threshold parameter in [26]. This is another advantage of our method as it will increase the chances of having C1C_{1} entirely contained in Ω\Omega. Such a choice is made based on the natural differences of our approaches. It is worthwhile to point out that there are also other sophisticated algorithms such as the ones described in [5], [23] and [45] which can achieve the same goal. We avoid using these methods here as our purpose is just to implement a fast way of obtaining a set ΩC1\Omega\supset C_{1}.

Algorithm 2 Random Walk Threshold
Adjacency matrix AA, a random walk threshold parameter δ(0,1)\delta\in(0,1), a set of seed vertices ΓC1\Gamma\subset C_{1}, estimated size n^1|C1|\hat{n}_{1}\approx|C_{1}|, and depth of random walk t+t\in\mathbb{Z}^{+}.
  1. 1.

    Compute P=AD1P=AD^{-1} and 𝐯(0)=D𝟏Γ\mathbf{v}^{(0)}=D\mathbf{1}_{\Gamma}.

  2. 2.

    Compute 𝐯(t)=Pt𝐯(0)\mathbf{v}^{(t)}=P^{t}\mathbf{v}^{(0)}.

  3. 3.

    Define Ω=(1+δ)n^1(𝐯(t))\Omega={\mathcal{L}}_{(1+\delta)\hat{n}_{1}}(\mathbf{v}^{(t)}).

Ω=ΩΓ\Omega=\Omega\cup\Gamma.

The thresholding operator s()\mathcal{L}_{s}(\mathord{\cdot}) is defined as

s(𝐯):={i[n]:vi among s largest entries in 𝐯}.\mathcal{L}_{s}(\mathbf{v})\mathrel{\mathop{\mathchar 58\relax}}=\{i\in[n]\mathrel{\mathop{\mathchar 58\relax}}\text{$v_{i}$ among $s$ largest entries in $\mathbf{v}$}\}.

The motivation of Algorithm 2 is the following intuitive observation. Suppose we are given seed vertices ΓC1\Gamma\subset C_{1}, then by starting from Γ\Gamma, since the edges within each cluster are more dense than those between different clusters, the probability of staying within C1C_{1} will be much higher than entering other clusters CiC_{i}, for i1i\neq 1, in a short amount of depth. Therefore, by performing a random walk up to a certain depth tt, e.g., t=3t=3, we will have a well approximated set Ω\Omega such that C1C_{1} is almost surely contained in Ω\Omega. Let us make this more precisely in Theorem 3.3.

Theorem 3.3.

Assume |Γ|=O(1)|\Gamma|=O(1) and t=O(1)t=O(1) in Algorithm 2, the probability (C1Ω)(jC1𝐯j(t)=𝐯(t)1)1O(ϵmax)\mathbb{P}\big{(}C_{1}\subset\Omega\big{)}\geq\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t)}_{j}=\|\mathbf{v}^{(t)}\|_{1}\big{)}\geq 1-O(\epsilon_{\max}). In other words, the probability that the tt-steps random walk with seed vertices Γ\Gamma being not in C1C_{1} is at most a constant multiple of ϵmax\epsilon_{\max}.

Proof.

Let us first consider the case |Γ|=1|\Gamma|=1. Suppose Γ={s}\Gamma=\{s\}. Then we have (jC1𝐯j(0)=𝐯(0)1)=(𝐯s(0)=𝐯(0)1)=1\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(0)}_{j}=\|\mathbf{v}^{(0)}\|_{1}\big{)}=\mathbb{P}(\mathbf{v}^{(0)}_{s}=\|\mathbf{v}^{(0)}\|_{1})=1. It is also easy to see that (jC1𝐯j(1)=𝐯(1)1)=diin/di=1ϵi1ϵmax\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(1)}_{j}=\|\mathbf{v}^{(1)}\|_{1}\big{)}=d_{i}^{in}/d_{i}=1-\epsilon_{i}\geq 1-\epsilon_{\max}. For t2t\geq 2, we have (jC1𝐯j(t)=𝐯(t)1)(1ϵmax)(jC1𝐯j(t1)=𝐯(t1)1)\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t)}_{j}=\|\mathbf{v}^{(t)}\|_{1}\big{)}\geq(1-\epsilon_{\max})\cdot\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t-1)}_{j}=\|\mathbf{v}^{(t-1)}\|_{1}\big{)}. So by assuming (jC1𝐯j(t1)=𝐯(t1)1)(1ϵmax)t11(t1)ϵmax\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t-1)}_{j}=\|\mathbf{v}^{(t-1)}\|_{1}\big{)}\geq(1-\epsilon_{\max})^{t-1}\geq 1-(t-1)\epsilon_{\max}, we have (jC1𝐯j(t)=𝐯(t)1)(1ϵmax)t1tϵmax=1O(ϵmax)\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t)}_{j}=\|\mathbf{v}^{(t)}\|_{1}\big{)}\geq(1-\epsilon_{\max})^{t}\geq 1-t\epsilon_{\max}=1-O(\epsilon_{\max}).

Suppose now |Γ|>1|\Gamma|>1, we can apply the above argument to each individual vertex in Γ\Gamma, where the random walk starting from each vertex can be considered independently, therefore we have (jC1𝐯j(t)=𝐯(t)1)(1tϵmax)|Γ|1tϵmax|Γ|=1O(ϵmax)\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t)}_{j}=\|\mathbf{v}^{(t)}\|_{1}\big{)}\geq(1-t\epsilon_{\max})^{|\Gamma|}\geq 1-t\epsilon_{\max}|\Gamma|=1-O(\epsilon_{\max}). ∎

Remark 3.5.

It is worthwhile to note that we do not want tt to be too large, one reason is that Theorem 3.3 tells us the probability of staying within the target cluster C1C_{1} decreases as tt increases. An alternative interpretation is that we can treat our graph GG, suppose connected, as a time homogeneous finite state Markov chain with evenly distributed transition probability determined by the vertex degree between adjacent vertices. Since GG is connected, it is certainly irreducible and aperiodic. By the fundamental theorem of Markov chains, the limiting probability of finally being at each vertex will be the same, regardless of what the seed set Γ\Gamma is.

Meanwhile, we do not want tt to be too small as well, otherwise the random walk will not be able to explore all the reachable vertices. There is also a trade-off between the size of Γ\Gamma and the random walk depth tt, where a smaller size of Γ\Gamma usually induces a larger tt in order to fully explore the target cluster.

3.3 Local Cluster Extraction

Let us now combine the previous two subroutines into Algorithm 3. In practice, we may want to vary the number of iterations MaxIterMaxIter based on the number of examples in the data set in order to achieve a better performance. For the purpose theoretical analysis, let us fix MaxIter=1MaxIter=1.

Algorithm 3 Least Squares Clustering (LSC)
Adjacency matrix AA, a random walk threshold parameter δ(0,1)\delta\in(0,1), a set of seed vertices ΓC1\Gamma\subset C_{1}, estimated size n^1|C1|\hat{n}_{1}\approx|C_{1}|, depth of random walk t+t\in\mathbb{Z}^{+}, least squares parameter γ(0,0.8)\gamma\in(0,0.8), and rejection parameter R[0,1)R\in[0,1).
  1. 1.

    for i=1,,MaxIteri=1,\cdots,MaxIter

  2. 2.

    Ω\Omega\longleftarrow Random Walk Threshold (AA, Γ\Gamma, n^1\hat{n}_{1}, ϵ\epsilon, tt).

  3. 3.

    Γ\Gamma\longleftarrow Least Squares Cluster Pursuit (AA, Ω\Omega, RR, γ\gamma).

  4. 4.

    end

  5. 5.

    Let C1#=ΓC_{1}^{\#}=\Gamma.

C1#C_{1}^{\#}.
Remark 3.6.

The hyperparameter MaxIterMaxIter in the algorithm is usually choosen based on the size of initial seed vertices Γ\Gamma relative to nn, we do not have a formal way of choosing the best MaxIterMaxIter rather than choose it heuristically. In practice, we believe MaxIter3MaxIter\leq 3 will do a very good job most of the time.

The analysis in previous two subsections gives that the difference between true cluster C1C_{1} and the estimated C1#C_{1}^{\#} is relative small compared to the size of C1C_{1}, this can be written more formally using the asymptotic notation.

Theorem 3.4.

Suppose ϵmax=o(1)\epsilon_{\max}=o(1) and MaxIter=1MaxIter=1, then under the assumptions of Theorem 3.2 and 3.3, we have (|C1#C1||C1|o(1))=1o(1)\mathbb{P}\Big{(}\frac{|C^{\#}_{1}\triangle C_{1}|}{|C_{1}|}\leq o(1)\Big{)}=1-o(1).

Proof.

By Theorem 3.3, we know that the probability of ΩC1\Omega\supset C_{1} after performing Algorithm 2 is 1O(ϵmax)=1o(1)1-O(\epsilon_{\max})=1-o(1). By Theorem 3.2, the error rate is at most a constant multiple of ϵmax2\epsilon^{2}_{\max} after performing Algorithm 1. Putting them together, we have (|C1#C1||C1|o(1))=1o(1)\mathbb{P}\Big{(}\frac{|C^{\#}_{1}\triangle C_{1}|}{|C_{1}|}\leq o(1)\Big{)}=1-o(1). ∎

3.4 From Local to Global

We can make one step further by applying Algorithm 3 iteratively on the entire graph to extract all the underlying clusters. That is, we remove Ci#C_{i}^{\#} each time after the Algorithm 3 finds it, and update the graph GG by removing the subgraph spanned by vertices Ci#C_{i}^{\#} successively. This leads to Algorithm 4. We will not analyze further the theoretical guarantees of the iterative version the algorithm, but rather provide with numerical examples in the later section to show its effectiveness and efficiency.

Algorithm 4 Iterative Least Squares Clustering (ILSC)
Adjacency matrix AA, random walk threshold parameter δ(0,1)\delta\in(0,1), least squares parameter γ(0,0.8)\gamma\in(0,0.8), rejection parameter R[0,1)R\in[0,1), depth of random walk t+t\in\mathbb{Z}^{+}. Seed vertices for each cluster ΓiCi\Gamma_{i}\subset C_{i}, estimated size n^i|Ci|\hat{n}_{i}\approx|C_{i}| for i=1,ki=1,\cdots k.
  1. 1.

    for i=1,,ki=1,\cdots,k

  2. 2.

    Let Ci#C_{i}^{\#} be the output of Algorithm 3.

  3. 3.

    Let G(i)G^{(i)} be the subgraph spanned by Ci#C_{i}^{\#}.

  4. 4.

    Updates GGG(i)G\leftarrow G\setminus G^{(i)}.

  5. 5.

    end

C1#,,Ck#C_{1}^{\#},\cdots,C_{k}^{\#}.
Remark 3.7.

It is worth noting that Algorithm 4 extracts one cluster at a time, which is different from most of other global unsupervised clustering algorithms. In practice, those global clustering methods could have impractically high run time [33] or tricky to implement [2]. In contrast, our method requires much lower computational time and can be implemented easily. In addition, the ”one cluster at a time” feature of our method provides more flexibility for problems under certain circumstances.

4 Computational Complexity

In this section, let us discuss the run time of the algorithms introduced previously.

Theorem 4.1.

Algorithm 2 requires O(ndmaxt+nlog(n))O(nd_{\max}t+n\log(n)) operations, where tt is the depth of the random walk.

Proof.

Notice that if A,D,PA,D,P are stored as sparse matrices, then for each tt in the second step of Algorithm 2, it requires O(ndmax)O(nd_{\max}), where dmaxd_{\max} is the maximal degrees among all the vertices. Therefore the algorithm requires O(ndmaxt+nlog(n))O(nd_{\max}t+n\log(n)), where the O(nlog(n))O(n\log(n)) part comes from the third step of sorting. In practice, the random walk depth tt is O(1)O(1) with respect to the graph size nn, therefore we have O(ndmax+nlog(n))O(nd_{\max}+n\log(n)). ∎

Theorem 4.2.

Algorithm 1 requires O(ndmax+nlog(n))O(nd_{\max}+n\log(n)) operations.

Proof.

For Algorithm 1, its first step requires O(ndmax)O(nd_{\max}), second step requires O(ndmax+nlog(n))O(nd_{\max}+n\log(n)), where the O(ndmax)O(nd_{\max}) part comes from matrix vector multiplication, and O(nlog(n))O(n\log(n)) part comes from sorting. For its third step, to avoid solving the normal equation exactly for large scale problems, we recommend using an iterative mehod, for example conjugate gradient descent (we use MATLAB’s lsqr operation in our implementation). As we have shown the matrices are associated with well behaved condition numbers, it requires only a constant number of iterations to get a well approximated least squares solution to problem (8). Since the cost for each iteration in conjugate gradient descent equals to a few operations of matrix vector multiplication, which is O(ndmax)O(nd_{\max}), the total cost for Algorithm 1 is O(ndmax+nlog(n))O(nd_{\max}+n\log(n)). ∎

As a consequence, the total run time for Algorithm 3 is O(ndmax+nlog(n))O(nd_{\max}+n\log(n)). if the number of clusters k=O(1)k=O(1), then Algorithm 4 also runs in O(ndmax+nlog(n))O(nd_{\max}+n\log(n)).

Remark 4.1.

The computational scheme of our methods follow the similar framework as CP+RWT in [26]. However, one of the differences between these two approaches is that we apply lsqr to solve the least squares problem (8), but CP+RWT applies O(logn)O(\log n) iterations of subspace pursuit algorithm to solve (8), and each its subspace pursuit is implemented with lsqr as a subroutine. So essentially, our proposed method is O(logn)O(\log n) times cheaper than CP+RWT. We can also see this difference by comparing the run times for our numerical experiments in the next section.

5 Numerical Experiments

In this section, we evaluate the performance of our algorithms on synthetic symmetric stochastic block model (SSBM), network data on political blogs [3], AT&T Database of Faces 333https://git-disl.github.io/GTDLBench/datasets/att_face_dataset/, Extended Yale Face Database B (YaleB) 444http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html, and MNIST data 555http://yann.lecun.com/exdb/mnist/.

For single cluster extraction tasks, we will consider the diffusion based methods plus a possible refinement procedure CP+RWT [26], HK-Grow [23], PPR [5], and LBSA [36] as our baseline methods. For multi-cluster extraction tasks, we will consider ICP+RWT [26], LapRF and TVRF [49], Multi-class MBO Auction Dynamics [21], AtlasRBF [37], Pseudo-label [27], DGN [22] and Ladder Networks [39] as the baseline methods. The standard spectral clustering algorithm [35] is also being applied in some of the experiments. For the implementation of Algorithms 3 and 4, we use MATLAB’s lsqrlsqr function as our iterative least squares solver to solve equation (8). We tune the rejection parameters RR for all algorithms appropriately to make the output Ci#C^{\#}_{i} of each algorithm approximately the same size for comparison purpose. For the evaluation metrics of our experiments, we will consider Jaccard index for symmetric stochastic block model, F1 score for human faces data, and classification accuracy for political blogs data and MNIST data. More implementation details are summarized as a supplementary material.

5.1 Stochastic Block Model Data

We first test Algorithm 3 on SSBM(n,k,p,q)SSBM(n,k,p,q) with different choices of parameters. The paramenter nn indicates the total number of vertices, kk indicates the number of clusters, pp is the probability of having an edge between any two vertices within each cluster, and qq is the probability of having an edge between any two vertices from different clusters. Figure 1 left panel demonstrates such a synthetic random graph model with three underlying clusters. Figure 1 right panel illustrates an adjacency matrix of a random graph generated from symmetric stochastic block model with three underlying clusters. In our experiments, we fix k=3k=3 and choose n=600,1200,1800,2400,3000n=600,1200,1800,2400,3000 respectively. The connection probability between edges are chosen as p=8lognnp=\frac{8\log n}{n} and q=lognnq=\frac{\log n}{n}. By choosing the parameters carefully, we obtain the Jaccard index and logarithm of running time of each method shown in Figure 2. We also run the experiments on stochastic block model for non-symmetric case and obtained similar gaps in accuracy and run time. For the implementation of symmetric stochastic block model, we use three vertices with given label as our seeds, and we focus on only recovering the target cluster, say C1C_{1}. The experiments are performed with 500 repetitions.

Refer to caption Refer to caption
Figure 1: Left: A Random SSBM Graph with Three Underlying Clusters. Right: Adjacency Matrix of A Random Graph Generated From SSBM.
Refer to caption Refer to caption
Figure 2: Left: Average Jaccard Index. Right: Logarithm of the Average Run Time.

5.2 Political Blogs Network Data

Next we test Algorithm 3 on the data from ”The political blogosphere and the 2004 US Election”[3], which contains a list of political blogs that were classified as liberal or conservative, and links between the blogs. See Figure 3 as an illustration for the community structure (Figure source [3]).

Refer to caption
Figure 3: Community structure of political blogs. The colors reflect political orientation, red for conservative, and blue for liberal. Orange links go from liberal to conservative, and purple ones from conservative to liberal. The size of each blog reflects the number of other blogs that link to it [3].

As explained by Abbe and Sandon in [2], their simplified algorithm gave a reasonably good classification 37 times out of 40 trials. Each of these trials classified all but 56 to 67 of the 1222 vertices in the graph main component correctly. According to [2], the state-of-the-art described in [51] before the work in [2] gives a lowest value around 60, while using regularized spectral algorithms such as the one in [38] obtain about 80 errors. In our experiments, given three labeled seeds, the Algorithm 3 succeeds 35 trials out of a total of 40 trials. Among these 35 trials, the average number of misclassified node in the graph main component is 55, which is slightly favorable than the state-of-the-art result. We also tested CP+RWT on this dataset and found the results were not very satisfactory.

5.3 AT&T Database of Faces

The AT&T Database of Faces contains gray scale images for 4040 different people of pixel size 92×11292\times 112. Images of each person are taken under 1010 different conditions, by varying the three perspectives of faces, lighting conditions, and facial expressions.

Refer to caption Refer to caption
Figure 4: Left: Randomly Permuted AT&T Faces. Right: Desired Recovery of all Faces into Correct Clusters.

We use part of this data set by randomly selecting 10 people such that each individual has 10 images. We randomly permute the images as shown in the left side of Figure 4, and compute its adjacency matrix AA based on the preprocessing method discussed in the appendix B. Then we iteratively apply Algorithm 3 and try to recover all the 10 images belong to the same individual. The desired permutation of these individual images after iteratively performing Algorithm 3 are shown in the right side of Figure 4. Some more details of the implementation regarding to the hyperparameters tuning are summarized in the appendxi A. The performance of our algorithm compared with CP+RWT and Spectral Clustering (SC) are summarized in Table 2 under 500 repetitions. Note that spectral clustering method is unsupervised, hence its accuracy does not affected by the percentage of labeled data.

Table 2: Average F1 Scores of Recovering All Clusters for AT&T Data.
Labeled Data % 1010 2020 3030
LSC 96.5%96.5\% 97.5%97.5\% 98.2%98.2\%
CP+RWT [26] 92.2%92.2\% 95.7%95.7\% 97.1%97.1\%
SC [35] 95.8% 95.8% 95.8%

5.4 Extended Yale Face Database B (YaleB)

The YaleB dataset contains 16128 gray scale images of 28 human subjects under 9 poses and 64 illumination conditions. We use part of this data set by randomly selecting 20 images from each person after the preprocessing in appendix B. The images are randomly permuted and we aim to recover all the clusters by iteratively performing Algorithm 3. Figure 5 shows this dataset with randomly permuted images on the left side and the desired clustering results on the right side. Figure 6 enlarges a small part inside the pictures from Figure 5 with the red boxes. The performance of our algorithm compared with CP+RWT and Spectral Clustering (SC) are summarized in Table 3 under 500 repetitions.

Table 3: Average F1 Scores of Recovering All Clusters for YaleB Data.
Labeled Data % 55 1010 2020
LSC 92.1% 96.0%96.0\% 96.1%96.1\%
CP+RWT [26] 89.2% 93.7%93.7\% 93.9%93.9\%
SC [35] 93.8% 93.8% 93.8%
Refer to caption Refer to caption
Figure 5: Left: Randomly Permuted YaleB Faces. Right: Desired Recovery of all Faces into Correct Clusters.
Refer to caption Refer to caption
Figure 6: Enlarged YaleB Human Faces.

5.5 MNIST Data

We also test Algorithm 4 on the MNIST data, which is a famous machine learning benchmark dataset in classification that consists of 7000070000 grayscale images of the handwritten digits 0-99 of size 28×2828\times 28 with approximately 70007000 images of each digit. We used a certain percentage of vertices with given labels within each of the ten clusters as our seed vertices, the performance ILSC and ICP+RWT are summarized in Table 4 under 100 repetitions.

Table 4: Average Accuracy of Recovering All Clusters for MNIST Data (Time is measured in seconds).
Labeled Data % ILSC Run Time ICP+RWT [26] Run Time
0.5 97.30%97.30\% 15.5 96.41%96.41\% 18.1
1 97.73%97.73\% 15.3 97.32%97.32\% 19.1
1.5 98.03%98.03\% 15.4 97.44%97.44\% 19.8
2 98.17%98.17\% 15.5 97.52%97.52\% 21.4
2.5 98.27%98.27\% 15.4 97.50%97.50\% 22.1

We also compare ILSC with several other state-of-the-art semi-supervised methods on MNIST. As we can see in Table 5, ILSC outperforms the other algorithms except for the Ladder Networks which uses more information of labels and involved in a deep neural network architecture that requires training on GPUs for several hours.

Table 5: Accuracy of ILSC and other Semi-supervised Algorithms on MNIST.
Methods # Labeled Data Accuracy
LapRF [49] 600600 95.6%95.6\%
TVRF [49] 600600 96.8%96.8\%
ICP+RWT [26] 700700 97.3%97.3\%
Multi-class MBO with Auction Dynamics [21] 700700 97.4%97.4\%
ILSC (this paper) 700700 97.7%97.7\%
AtlasRBF [37] 10001000 96.4%96.4\%
Pseudo-label [27] 10001000 96.6%96.6\%
DGN [22] 10001000 97.6%97.6\%
ILSC (this paper) 10001000 98.0%98.0\%
Ladder Networks [39] 10001000 99.2%99.2\%

References

  • [1] E. Abbe, Community Detection and Stochastic Block Models: Recent Developments, Journal of Machine Learning Research. vol.18, no.177, pp.1-86, 2018
  • [2] E. Abbe and C. Sandon, Recovering communities in the general stochastic block model without knowing the parameters, In Advances in Neural Information Processing Systems, 676–684, 2015
  • [3] L. A. Adamic and N. Glance, The political blogosphere and the 2004 US election: Divided they blog, In Proceedings of the 3rd International Workshop on Link Discovery, 36-43, 2005.
  • [4] D. J. Aldous, Random walks on finite groups and rapidly mixing Markov chains, In Seminaire de Probabilites XVII, pages 243–297. Springer Verlag, 1983. Lecture Notes in Mathematics 986.
  • [5] R. Andersen, F. Chung, and K. Lang, Using pagerank to locally partition a graph, Internet Mathematics, 4(1):35-64, 2007.
  • [6] G. Camps-Valls, T. Bandos and D. Zhou, Semisupervised graph-based hyperspectral image classification, IEEE Trans. Geosci. Remote Sensing, vol. 45, no. 10, pp. 3044-3054, 2007.
  • [7] Y. Chen, J. Z. Wang, and R. Krovetz, Clue: Cluster-based retrieval of images by unsupervised learning, IEEE Trans. Image Process. 14, 8, 1187–1201, 2005.
  • [8] F. Chung, Spectral Graph Theory, Vol. 92. American Mathematical Society., 1997.
  • [9] F. Chung and L. Lu, Complex Graphs and Networks, Vol. 107. American Mathematical Soc., 2006.
  • [10] W. Dai and O. Milenkovic, Subspace pursuit for compressive sensing signal reconstruction, IEEE Trans. Inform. Theory, vol. 55, no. 5, pp. 2230–2249, May 2009.
  • [11] S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning, In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining:269-274, 2001.
  • [12] S. Dhillon, Y. Guan, and B. Kulis, Kernel k-means: spectral clustering and normalized cuts, In Proc. of the 10th ACM SIGKDD Conference, 2004.
  • [13] C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon,A min-max cut algorithm for graph partitioning and data clustering, In Proceedings of IEEE ICDM 2001, pages 107–114, 2001.
  • [14] W. Doeblin, Expose de la theorie des chaınes simples constantes de Markov a un nombre fini d’etats, Mathematique de l’Union Interbalkanique, 2:77–105, 1938
  • [15] P. Erdős and A. Rényi, On random graphs, I. Publ. Math. Debrecen 6 (1959), 290-297.
  • [16] S. Fortunato, Community detection in graphs, Physics Reports, 486 (3–5), 75–174, 2010.
  • [17] A. Georghiades, P. Belhumeur, and D. Kriegman, From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose, PAMI, 2001, Vol 23, pp 643-660.
  • [18] W. Ha, K. Fountoulakis, and M. W. Mahoney, Statistical guarantees for local graph clustering, The 23rd International Conference on Artificial Intelligence and Statistics, 2020.
  • [19] P. W. Holland, K. Laskey, and S. Leinhardt, Stochastic blockmodels: First steps, Social Networks 5 (1983), no. 2,109–137
  • [20] D. Hric, R. K. Darst, and S. Fortunato, Community detection in networks: Structural clusters versus ground truth, Physical Review E, 9, 062805, 2014
  • [21] M. Jacobs, E. Merkurjev, and S. Esedoglu, Auction Dynamics: A Volume Constrained MBO Scheme, Journal of Computational Physics, 354:288-310, 2018.
  • [22] D. P. Kingma, S. Mohamed, D. J. Rezende, and Max Welling,Semi-supervised learning with deep generative models, In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 3581–3589, 2014
  • [23] K. Kloster and D. F. Gleich, Heat kernel based community detection, In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 1386-1395, 2014.
  • [24] G. Kossinets and D. J. Watts, Empirical analysis of an evolving social network, Science, 311 (5757), 88–90, 2016
  • [25] B. Kulis, S. Basu, I. Dhillon, and R. Mooney, Semi-supervised graph clustering: A kernel approach, Proc. ICML-2005.
  • [26] M.-J. Lai, D. Mckenzie, Compressive Sensing Approach to Cut Improvement and Local Clustering, SIAM Journal on Mathematics of Data Science, 2(2020), 368–395.
  • [27] D-H. Lee,Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, In Workshop on Challenges in Representation Learning, ICML 2013.
  • [28] T. Lindvall, Lectures on the coupling method, John Wiley & Sons Inc., New York, 1992.
  • [29] U. V. Luxburg, A tutorial on spectral clustering, Statistics and computing, 17(4):395– 416, 2007.
  • [30] M. W. Mahoney, L. Orecchia, and N. K. Vishnoi, A local spectral method for graphs: With applications to improving graph partitions and exploring data graphs locally, Journal of Machine Learning Research, 13(Aug):2339-2365, 2012.
  • [31] R. Mihalcea and D. Radev, Graph-based natural language processing and information retrieval, Cambridge university press, 2011.
  • [32] T. Miyato, S-i. Maeda, M. Koyama, K. Nakae, and S. Ishii, Distributional smoothing by virtual adversarial examples, arXiv:1507.00677, 2015
  • [33] E. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold conjecture. Combinatorica 38.3 (2018): 665-708.
  • [34] D. Needell and Joel A. Tropp, CoSaMP: Iterative signal recovery from incomplete and inaccurate samples, Applied and Computational Harmonic Analysis, 26(3):301-321, 2009.
  • [35] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, On spectral clustering: Analysis and an algorithm, In Advances in Neural Information Processing Systems: 849-856, 2002.
  • [36] Pan Shi, Kun He, David Bindel, and John E.Hopcroft, Locally-biased spectral approximation for community detection. Knowledge-Based Systems, 164:459–472, 2019.
  • [37] N. Pitelis, C. Russell, and L. Agapito, Semi-supervised learning using an unsupervised atlas, In Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2014), pages 565–580. Springer, 2014.
  • [38] T. Qin and K. Rohe, Regularized spectral clustering under the degree-corrected stochastic block model, Advances in Neural Information Processing Systems 26 (C.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, eds.), 2013, pp. 3120–3128.
  • [39] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised Learning with Ladder Networks, In Advances in Neural Information Processing Systems: 3546-3554, 2015.
  • [40] F. Samaria and A. Harter, Parameterisation of a Stochastic Model for Human Face Identification, Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, 1994.
  • [41] F. Santosa and W. Symes, Linear inversion of band-limited reflection seismograms, SIAM Journal on Scientific and Statistical Computing. SIAM. 7 (4): 1307–1330. doi:10.1137/0907087, 1986.
  • [42] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000.
  • [43] R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B (methodological). Wiley. 58 (1): 267–88. JSTOR 2346178, 1996
  • [44] N. Veldt, C. Klymko, and D. F. Gleich, Flow-Based Local Graph Clustering with Better Seed Set Inclusion, In Proceedings of the SIAM International Conference on Data Mining, 2019
  • [45] D. Wang, K. Fountoulakis, M. Henziger, Michael W. Mahoney, S. Rao, Capacity releasing diffusion for speed and locality, Proceedings of the 34th International Conference on Machine Learning, 70:3598-3607, 2017.
  • [46] Y. Wu, M. Rosca, and T. Lillicrap, Deep compressed sensing. International Conference on Machine Learning, pp. 6850–6860, 2019.
  • [47] Y. Yan, Y. Bian, D. Luo, D. Lee, and X. Zhang, Constrained Local Graph Clustering by Colored Random Walk, in Proc. World Wide Web Conf., 2019, pp. 2137–2146.
  • [48] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich, Local higher-order graph clustering, In Proceedings of the 23rd ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2017, pp. 555–564.
  • [49] K. Yin, X.-C. Tai, An Effective Region Force for Some Variational Models for Learning and Clustering, Journal of Scientific Computing, 74 (2018), 175-196.
  • [50] L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering, In Advances in neural information processing systems, pages 1601–1608, 2004.
  • [51] A. Y. Zhang, H. H. Zhou, C. Gao, Z. Ma, Achieving optimal misclassification proportion in stochastic block model, arXiv:1505.03772 (2015).

Declarations

Funding

The first author is supported by the Simons Foundation collaboration grant #864439.

Competing Interests

The authors have disclosed any competing interests.

Data Availability

A sample demo program for reproducing Fig. 4 in this paper can be found at https://github.com/zzzzms/LeastSquareClustering. All other demonstration codes or data are available upon request.

Appendix A Hyperparameters for Numerical Experiments

For each cluster to be recovered, we sampled the seed vertices Γi\Gamma_{i} uniformly from CiC_{i} during all implementations. We fix the random walk depth with t=3t=3, use random walk threshold parameter δ=0.8\delta=0.8 for political blogs network and δ=0.6\delta=0.6 for all the other experiments. We vary the rejection parameters R(0,1)R\in(0,1) for each specific experiments based on the estimated sizes of clusters. In the case of no knowledge of estimated sizes of clusters nor the number of clusters are given, we may have to refer to the spectra of graph Laplacian and use the large gap between two consecutive spectrum to estimate the number of clusters, and then use the average size to estimate the size of each cluster.

We fix the least squares threshold parameter with γ=0.2\gamma=0.2 for all the experiments, which is totally heuristic. However, we have experimented that the performance of algorithms will not be affected too much by varying γ[0.15,0.4]\gamma\in[0.15,0.4]. The hyperparameter MaxIterMaxIter is chosen according to the size of initial seed vertices relative to the total number of vertices in the cluster. For purely comparison purpose, we keep MaxIter=1MaxIter=1 for MNIST data. By experimenting on different choices of MaxIterMaxIter, we find that the best performance for AT&T data occurs at MaxIter=2MaxIter=2 for 10%10\% seeds and MaxIter=1MaxIter=1 for 20%20\% and 30%30\% seeds. For YaleB data, the best performance occurs at MaxIter=2MaxIter=2 for 5%5\%, 10%10\%, and 20%20\% seeds. All the numerical experiments are implemented in MATLAB and can be run on a local personal machine, for the authenticity of our results, we put a sample demo code at https://github.com/zzzzms/LeastSquareClustering for verification purpose.

Appendix B Image Data Preprocessing

For YaleB human faces data, we have performed some data preprocessing techinuqes to avoid the poor quality images. Specifically, we abandoned the pictures which are too dark, and we cropped each image into size of 54×4654\times 46 to reduce the effect of background noise. For the remaining qualified pictures, we randomly selected 20 images for each person.

All the image data in MNIST, AT&T, YaleB needs to be firstly constructed into an auxiliary graph before the implementation. Let 𝐱in\mathbf{x}_{i}\in\mathbb{R}^{n} be the vectorization of an image from the original data set, we define the following affinity matrix of the KK-NN auxiliary graph based on Gaussian kernel according to [21] and [50],.

Aij={e𝐱i𝐱j2/σiσjif𝐱jNN(𝐱i,K)0otherwiseA_{ij}=\begin{cases}e^{-\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}/\sigma_{i}\sigma_{j}}&\text{if}\quad\mathbf{x}_{j}\in NN(\mathbf{x}_{i},K)\\ 0&\text{otherwise}\end{cases}

The notation NN(𝐱i,K)NN(\mathbf{x}_{i},K) indicates the set of KK-nearest neighbours of 𝐱i\mathbf{x}_{i}, and σi:=𝐱i𝐱i(r)\sigma_{i}\mathrel{\mathop{\mathchar 58\relax}}=\|\mathbf{x}_{i}-\mathbf{x}^{(r)}_{i}\| where 𝐱i(r)\mathbf{x}^{(r)}_{i} is the rr-th closest point of 𝐱i\mathbf{x}_{i}. Note that the above AijA_{ij} is not necessary symmetric, so we consider A~ij=ATA\tilde{A}_{ij}=A^{T}A for symmetrization. Alternatively, one may also want to consider A~=max{Aij,Aji}\tilde{A}=\max\{A_{ij},A_{ji}\} or A~=(Aij+Aji)/2\tilde{A}=(A_{ij}+A_{ji})/2. We use A~\tilde{A} as the input adjacency matrix for our algorithms.

We fix the local scaling parameters K=15K=15, r=10r=10 for the MNIST data, K=8K=8, r=5r=5 for the YaleB data, and K=5K=5, r=3r=3 for the AT&T data during implementation.