A Compressed Sensing Based Least Squares Approach to Semi-supervised Local Cluster Extraction

Ming-Jun Lai ¹¹1Department of Mathematics, University of Georgia, Athens, GA 30602. [email protected]. Zhaiming Shen²²2Department of Mathematics, University of Georgia, Athens, GA 30602. [email protected].

Abstract

A least squares semi-supervised local clustering algorithm based on the idea of compressed sensing is proposed to extract clusters from a graph with known adjacency matrix. The algorithm is based on a two-stage approach similar to the one in [26]. However, under a weaker assumption and with less computational complexity than the one in [26], the algorithm is shown to be able to find a desired cluster with high probability. The “one cluster at a time” feature of our method distinguishes it from other global clustering methods. Several numerical experiments are conducted on the synthetic data such as stochastic block model and real data such as MNIST, political blogs network, AT&T and YaleB human faces data sets to demonstrate the effectiveness and efficiency of our algorithm.

1 Introduction

Informally speaking, graph clustering aims at dividing the set of vertices from a graph into subsets in a way such that there are more edges within each subset, and fewer edges between different subsets. When analyzing a graph, one of people’s primary interest is to find the underlying clustered structure of the graph, as the vertices in the same cluster can reasonably be assumed to have some latent similarity. For data sets which are not presented as graphs, we can create a suitable auxiliary graph such as the $K$ -nearest-neighbors ( $K$ -NN) graph based on the given data, for example, see [42], [30] and [21]. Then we can apply graph clustering techiques on this auxiliary graph.

Graph clustering problem has become prevalent recently in areas of social network study [16], [20] and [24], image classification [6], [7] and [42], natural language processing [11] and [31]. For example, suppose a social network graph has vertices which represent users of a social network (e.g. Facebook, Linkedln), then the edges could represent users which are connected to each other. The sets of nodes with high inter-connectivity, which we call them communities or clusters, could represent friendship groups or co-workers. By identifying those communities we can suggest new connections to users. Note that some networks are directed (e.g. Twitter, Citation Networks), which could make community detection more subtle. For the scope of this paper, we will only focus on weighted undirected graphs.

The classical graph based clustering problem is a global clustering problem which assigns every vertice a unique cluster, assuming there are no multi-class vertices. It is usually considered as an unsupervised learning problem which can be done by using method such as spectral clustering [29], [35] and [50] or ways of finding an optimal cut of the graph [12], [13], these approaches are generally computational expensive and hard to implement for large data sets. It can also be done semi-supervisely, such as [25], [21] and [49]. However, sometimes it is only of people’s interests in finding one certain cluster which contains the target vertices, given some prior knowledge of a small portion of labels for the entire true cluster, which is usually attainable for real data. This type of problem is called local clustering, or local cluster extraction, which loosely speaking, is defined to be the problem which takes a set of vertices $\Gamma$ with given labels, we call them seeds, as input, and returns a cluster $C^{\#}$ such that $\Gamma\subset C^{\#}$ . The local clustering problem hasn’t been studied exhaustively, and many aspects of the local clustering problem still remain open. Some recent related work are [18], [47], [48], [44] and [26]. Especially the work in [26] is one of the recent works with the same problem setting as in this paper. More precisely, we propose a new semi-supervised local clustering approach using the ideas of compressed sensing and method of least squares to make the clustering effective and efficient. Indeed, as we will see in the numerical experiments section, our approach outperforms the work in [26] in terms of both the accuracy and efficiency.

The main contribution of this paper is that it proposes the local cluster extraction Algorithms 3 and 4 which improve the performance of the algorithms in [26] and also slightly improve state-of-the-art result in [2] for the political blog network [3]. It also achieves better or comparable results on synthetic stochastic block model, human faces data, and MNIST data compared with several other modern local clustering algorithms or semi-supervised algorithms.

The subsequent sections in this paper are structured as follows. In Section 2, we give brief introductions to spectral clustering and concept of graph Laplacian, we also make the assumptions for the graph model which we will use later for theoretical analysis. In Section 3, we explain the main algorithms for solving the local cluster extraction problem in two-stage and show the correctness of our algorithms asymptotically. In Section 4, we analyze the computational complexity of our algorithms. In Section 5, several synthetic and real data sets are used to evaluate the performance of our algorithms and we also compared their performances with the state-of-the-art results.

2 Preliminaries and Models

2.1 Notations and Definitions

We use standard notation $G=(V,E)$ to denote the graph $G$ with the set of vertices $V$ and set of edges $E$ . For the case $|V|=n$ , we identify $V$ with the set of integers $[n]\mathrel{\mathop{\mathchar 58\relax}}=\{1,2,\cdots,n\}$ . We use $A$ to denote the adjacency matrix (possibly non-negative weighted) of $G$ , so in the undirected case, $A$ is a symmetric matrix. Let $D$ be the diagonal matrix $D=diag(d_{1},d_{2},\cdots,d_{n})$ , where each $d_{i}$ is the degree of vertex $i$ . We have the following definition.

Definition 2.1.

The unnormalized graph Laplacian is defined as $L=D-A$ . There are also two other normalized graph Laplacians which are symmetric graph Laplacian $L_{sym}\mathrel{\mathop{\mathchar 58\relax}}=I-D^{-1/2}AD^{-1/2}$ , and the random walk graph Laplacian $L_{rw}\mathrel{\mathop{\mathchar 58\relax}}=I-D^{-1}A$ .

The following result serves as the foundation of our approach for solving the graph clustering problem, we omit its proof here by directly referring to [8] and [29].

Lemma 2.1.

Let G be an undirected graph of size $n$ with non-negative weights. Then the multiplicity $k$ of the eigenvalue $0$ of $L$ $(L_{rw})$ equals to the number of connected components $C_{1},C_{2},\cdots,C_{k}$ in $G$ , and the indicator vectors $\textbf{1}_{C_{1}},\cdots,\textbf{1}_{C_{k}}\in\mathbb{R}^{n}$ on these components span the kernel of $L$ $(L_{rw})$ .

Let us introduce some more notations which we will use later. Suppose for the moment we have information about structure of the underlying clusters for each vertex, then it is useful to write $G$ as a union of two edge-disjoint subgraphs $G=G^{in}\cup G^{out}$ where $G^{in}=(V,E^{in})$ consists of only intra-connection edges, and $G^{out}=(V,E^{out})$ consists of only inter-connection edges. We will use $d_{i}^{in}$ to denote the degree of vertex $i$ in the subgraph $G^{in}$ , and $d_{i}^{out}$ to denote the degree of vertex $i$ in the subgraph $G^{out}$ . We will also use $A^{in}$ and $L^{in}$ to denote the adjacency matrix and graph Laplacian associated with $G^{in}$ , and $A^{out}$ and $L^{out}$ to denote the adjacency matrix and graph Laplacian associated with $G^{out}$ . Note that these notations are just for convenience for the analysis in the next section, in reality we will have no assurance about which cluster each individual vertex belongs to, so we will have no access to $A^{in}$ and $L^{in}$ . It is also worthwhile to point out that $A=A^{in}+A^{out}$ but $L\neq L^{in}+L^{out}$ in general. Furthermore, we will use $|L|$ or $|\mathbf{y}|$ to denote the matrix or vector where each its entry is replaced by the absolute value, and we will use $|V|$ to denote the size of $V$ whenever $V$ is a set. In the later sections, we will use $L$ and $L^{in}$ to indicate $L_{rw}$ and $L^{in}_{rw}$ respectively, and use $L_{C}$ and $L^{in}_{C}$ to denote the submatrices of $L$ and $L^{in}$ with column indices subset $C\subset V=[n]$ respectively. For convenience, let us formulate the notations being used through the paper into Table 1.

Table 1: Table of Notations

Symbols
$G$		A general graph
$\|G\|$		Size of G
$V$		Set of vertices of graph G
		(We identify $V=\{1,2,\cdots,n\}$ if $\|G\|=n$ through the paper)
$\|V\|$		Size of $V$
$E$		Set of edges of graph G
$E^{in}$		Subset of $E$ which consists only intra-connection edges
$E^{out}$		Subset of $E$ which consists only inter-connection edges
$G^{in}$		Subgraph of $G$ on $V$ with edge set $E^{in}$
$G^{out}$		Subgraph of $G$ on $V$ with edge set $E^{out}$
$A$		Adjacency matrix of graph $G$
$A^{in}$		Adjacency matrix of graph $G^{in}$
$A^{out}$		Adjacency matrix of graph $G^{out}$
$L$		Random walk graph Laplacian of $G$
$L^{in}$		Random walk graph Laplacian of $G^{in}$
$L^{out}$		Random walk graph Laplacian of $G^{out}$
$\mathbf{1}_{C}$		Indicator vector on subset $C\subset V$
$L_{C}$		submatrix of $L$ with column indices $C\subset V$
$L^{in}_{C}$		submatrix of $L^{in}$ with column indices $C\subset V$
$\|L\|$		Entrywised absolute value operation on matrix $L$
$\|\mathbf{y}\|$		Entrywised absolute value operation on vector $\mathbf{y}$
$\\|M\\|$		$\\|\cdot\\|_{2}$ norm of matrix $M$
$\\|\mathbf{y}\\|$		$\\|\cdot\\|_{2}$ norm of vector $\mathbf{y}$ .

2.2 Graph Model Assumptions

We make the following assumption for our graph model in the asymptotic perspective.

Assumption 1.

Suppose $G=(V,E)$ can be partitioned into $k=O(1)$ connected components such that $V=C_{1}\cup\cdots\cup C_{k}$ , where each $C_{i}$ is the underlying vertex set for each connected component of $G$ .

(I)

The degree of each vertex is asymptotically the same for vertices belong to the same cluster $C_{i}$ .
(II)

The degree $d^{out}_{i}$ is small relative to degree $d^{in}_{i}$ asymptotically for each vertex $i\in V$ .

The random graphs which satisfies assumption (I) is not uncommon, for example, the Erdös-Rényi (ER) model $G(n,p)$ with $p\sim\frac{\omega(n)\log(n)}{n}$ for any $\omega(n)\to\infty$ , see [15] and [9]. A natural generalization of the ER model is the stochastic block model (SBM) [19], which is a generative model for random graphs with certain edge densities within and between underlying clusters, such that the edges within clusters are more dense than the edges between clusters. In the case of each cluster has the same size and the intra- and inter-connection probability are the same among all the vertices, we have the symmetric stochastic block model (SSBM). It is worthwhile to note that the information theoretical bound for exact cluster recovery in SBM are given in [1] and [2]. It was also shown in [26] that a general SBM under certain assumptions of the parameters can be clustered by using a compressed sensing approach. Our model requires a weaker assumption than the one in [26], indeed, we remove the assumption imposed on the eigenvalues of $L$ in [26]. Therefore, our model will be applicable to a broader range of random graphs.

3 Main Algorithms

Our analysis is based on the following key observation. Suppose that graph $G$ has $k$ connected components $C_{1},\cdots,C_{k}$ , i.e., $L=L^{in}$ . Suppose further that we temporarily have access to the information about the structure of $L^{in}$ . Then we can write the graph Laplacian $L^{in}$ into a block diagonal form

L=L^{in}=\left(\begin{array}[]{cccc}L^{in}_{1}&&&\\ &L^{in}_{2}&&\\ &&\ddots&\\ &&&L^{in}_{k}\\ \end{array}\right).

(1)

Suppose now we are interested in finding the cluster with the smallest number of vertices, say $C_{1}$ , which corresponds to $L^{in}_{1}$ . By Lemma 2.1, $\{\mathbf{1}_{C_{1}},\cdots,\mathbf{1}_{C_{k}}\}$ forms a basis of the kernel $W_{0}$ of $L$ . Note that all the $\mathbf{1}_{C_{i}}$ have disjoint supports, so for $\mathbf{w}\in W_{0}$ and $\mathbf{w}\neq 0$ , we can write

\mathbf{w}=\sum_{i=1}^{k}\alpha_{i}\mathbf{1}_{C_{i}}

(2)

with some $\alpha_{i}\neq 0$ . Therefore, if $\mathbf{1}_{C_{1}}$ has the fewest non-zero entries among all elements of $W_{0}\setminus\{\mathbf{0}\}$ , then we can find it by solving the following minimization problem:

\min||\mathbf{w}||_{0}\quad\text{s.t.}\quad L^{in}\mathbf{w}=\mathbf{0}\quad\text{and}\quad\mathbf{w}\neq\mathbf{0}.

(3)

Here the $\ell_{0}$ norm $\|\cdot\|_{0}$ indicates the number of nonzero components for the input vector. Problem (3) can be solved using method such as greedy algorithm in compressed sensing as explained in [26]. However, we will propose a different approach to tackle it in this paper and demonstrate that the new approach is more effective numerically and require a fewer number of assumptions.

3.1 Least Squares Cluster Pursuit

Let us consider problem (3) again, instead of finding $C_{1}$ directly, let us try to find what are not in $C_{1}$ . Suppose there is a superset $\Omega\subset V$ such that $C_{1}\subset\Omega$ , and $C_{i}\not\subset\Omega$ for all $i=2,\cdots,n$ . Since $L^{in}\mathbf{1}_{C_{1}}=\mathbf{0}$ , we have

L^{in}\mathbf{1}_{\Omega}=L^{in}(\mathbf{1}_{\Omega\setminus C_{1}}+\mathbf{1}_{C_{1}})=L^{in}\mathbf{1}_{\Omega\setminus C_{1}}+L^{in}\mathbf{1}_{C_{1}}=L^{in}\mathbf{1}_{\Omega\setminus C_{1}}.

(4)

Letting $\mathbf{y}\mathrel{\mathop{\mathchar 58\relax}}=L^{in}\mathbf{1}_{\Omega}$ , then to find what are not in $C_{1}$ within $\Omega$ is equivalent to solve the following problem (5)

\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{n}}\|L^{in}\mathbf{x}-\mathbf{y}\|_{2}.

(5)

Note that solving problem (5) directly will give $\mathbf{x}^{*}=\mathbf{1}_{\Omega}\in\mathbb{R}^{n}$ and $\mathbf{x}^{*}=\mathbf{1}_{\Omega\setminus C_{1}}\in\mathbb{R}^{n}$ both as solutions. By setting the columns $L^{in}_{V\setminus\Omega}=0$ , solving problem (5) is equivalent to solving

\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{|\Omega|}}\|L^{in}_{\Omega}\mathbf{x}-\mathbf{y}\|_{2}.

(6)

Directly solving problem (6) gives at least two solutions $\mathbf{x}^{*}=\mathbf{1}\in\mathbb{R}^{|\Omega|}$ and $\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}\in\mathbb{R}^{|\Omega|}$ , where $C_{1}^{\mathsf{c}}$ indicates the complement set of $C_{1}$ . Between these two solutions, the latter is much more informative for us to extract $C_{1}$ from $\Omega$ than the former. We need to find a way to avoid the non-informative solution $\mathbf{x}^{*}=\mathbf{1}$ but keep the informative solution $\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}$ .

We can achieve this by removing a subset of columns from index set $\Omega$ . Let us use $T\subset\Omega$ to indicate the indices of column we aim to remove. Suppose we could choose $T$ such that $T\subset C_{1}$ . Now consider the following variation (7) of the minimization problem (6)

\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{|\Omega|-|T|}}\|L^{in}_{\Omega\setminus T}\mathbf{x}-\mathbf{y}\|_{2}.

(7)

Different from solving (6) which gives two solutions, solving (7) only gives one solution $\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}$ , as $\mathbf{x}^{*}=\mathbf{1}$ is no longer a solution because of the removal of $T$ . The solution $\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}$ is indeed still a solution to (7) because $L^{in}_{\Omega\setminus T}\mathbf{1}_{C_{1}^{\mathsf{c}}}=L^{in}\mathbf{1}_{\Omega\setminus C_{1}}=0$ . Furthermore, the solution to (7) is unique since it is a least squares problem with matrix $L^{in}_{\Omega\setminus T}$ of full column rank, therefore $\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}$ is the unique solution to (7).

However, there is no way in theory we can select $T$ and assure the condition $T\subset C_{1}$ . In practice, the way we choose $T$ is based on the following observation. Suppose $L=L^{in}$ , $\Omega\supset C_{1}$ and $\Omega\not\supset C_{i}$ for $i=2,\cdots k$ . Then $|L_{a}^{\top}|\cdot|\mathbf{y}|=0$ for all $a\in C_{1}$ , and $|L_{a}^{\top}|\cdot|\mathbf{y}|>0$ for all $a\in\Omega\setminus C_{1}$ . Therefore, we can choose $T$ in such a way that $|L_{t}^{\top}|\cdot|\mathbf{y}|$ is small for all $t\in T\subset\Omega$ . These ideas lead to Algorithm 1.

Algorithm 1 Least Squares Cluster Pursuit

Adjacency matrix

A

, vertex subset

\Omega\subset V

, least squares threshold parameter

\gamma\in(0,1)

, and rejection parameter

0.1\leq R\leq 0.9

1.

Compute $L=I-D^{-1}A$ and $\mathbf{y}=L\mathbf{1}_{\Omega}$ .
2.

Let $T$ be the set of column indices of $\gamma\cdot|\Omega|$ smallest components of the vector $|{L_{\Omega}}^{\top}|\cdot\mathbf{|y|}$ .
3.

Let $\mathbf{x}^{\#}$ be the solution to

$\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{|\Omega|-|T|}}\|L_{\Omega\setminus T}\mathbf{x}-\mathbf{y}\|_{2}$ (8)

obtained by using an iterative least squares solver.
4.

Let $W^{\#}=\{i\mathrel{\mathop{\mathchar 58\relax}}\mathbf{x}_{i}^{\#}>R\}$ .

C^{\#}_{1}=\Omega\setminus W^{\#}

Remark 3.1.

We impose the absolute value rather than direct dot product in order to have fewer cancellation between vector components when summing over the entrywised products. In practice, the value of $\gamma\in(0,1)$ will not affect the performance too much as long as its value is not too extreme. We find that $0.15\leq\gamma\leq 0.4$ works well for our numerical experiments.

Remark 3.2.

In practice, we choose to use MATLAB’s lsqr function to solve the least squares problem (8). As we will see in Lemma 3.2, our problem is well conditioned, so it is also possible to solve the normal equation exactly for problems which are not in a very large scale. However, we choose to solve it iteratively over exactly because the quality of the numerical solution is not essential for our task here, we are only interested in an approximated solution as we can use the cutoff $R$ number for clustering.

Remark 3.3.

As indicated in [26], we can reformulate problem (3) as solving

\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{n}}\{\|L\mathbf{x}-\mathbf{y}\|_{2}\mathrel{\mathop{\mathchar 58\relax}}\quad\|x\|_{0}\leq s\}

(9)

by applying the greedy algorithms such as subspace pursuit [10] and compressed sensing matching pursuit (CoSaMP) [34]. Or alternatively, we can consider LASSO, see [41] and [43], formulation of the problem

\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{n}}\{\|L\mathbf{x}-\mathbf{y}\|^{2}_{2}+\lambda\|\mathbf{x}\|_{1}\}=\operatorname*{arg\,min}_{\mathbf{x}\in\mathbb{R}^{n}}\{\|L\mathbf{x}-\mathbf{y}\|^{2}_{2}+\lambda\|\mathbf{x}\|_{0}\}.

(10)

The reason that Lasso is a good way to interpret this problem is that the solution $\mathbf{x}^{*}$ we are trying to solve for is the sparse indicator vector which satisfies $\|\mathbf{x}^{*}\|_{1}=\|\mathbf{x}^{*}\|_{0}$ . We do not analyze it further here.

However, in reality we have no access to $L^{in}$ , what we know only is $L$ , and in general $L\neq L^{in}$ . We argue that the solution to the perturbed problem (8) associated with $L$ will not be too far away from the solution to the unperturbed (7) problem associated with $L^{in}$ , if the difference between $L$ and $L^{in}$ is relative small. Let us make this precise by first quoting the following standard result in numerical analysis.

Lemma 3.1.

Let $\|\cdot\|$ be an operator norm, $A\in\mathbb{R}^{n\times n}$ be a non-singular square matrix, $\mathbf{x}\in\mathbb{R}^{n}$ , $\mathbf{y}\in\mathbb{R}^{n}$ . Let $\tilde{A}$ , $\mathbf{\tilde{x}}$ , $\mathbf{\tilde{y}}$ be perturbed measurements of $A$ , $\mathbf{x}$ , $\mathbf{y}$ respectively. Suppose $A\mathbf{x}=\mathbf{y}$ , $\tilde{A}\mathbf{\tilde{x}}=\mathbf{\tilde{y}}$ , and suppose further $cond(A)<\frac{\|A\|}{\|\tilde{A}-A\|}$ , then

\frac{\|\mathbf{\tilde{x}}-\mathbf{x}\|}{\|\mathbf{x}\|}\leq\frac{cond(A)}{1-cond(A)\frac{\|\tilde{A}-A\|}{\|A\|}}\Big{(}\frac{\|\tilde{A}-A\|}{\|A\|}+\frac{\|\mathbf{\tilde{y}}-\mathbf{y}\|}{\|\mathbf{y}\|}\Big{)}.

The above lemma asserts that the size of cond( $A$ ) is significant in determining the stability of the solution $\mathbf{x}$ with respect to small perturbations on $A$ and $\mathbf{y}$ . For the discussion from now on, we will use $\|\cdot\|$ to denote the standard vector or matrix induced two-norm $\|\cdot\|_{2}$ unless state otherwise. The next lemma claims the invertibility of $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}$ and gives an estimation bound of its condition number.

Lemma 3.2.

Let $V=\cup_{i=1}^{k}C_{i}$ be the disjoint union of $k=O(1)$ underlying clusters with size $n_{i}$ and assume $(I)$ . Let $d_{j}$ be the degree for vertex $j\in V=[n]$ , $n_{1}=\min_{i\in[k]}n_{i}$ , and suppose $\Omega\subset V$ be such that $\Omega\supset C_{1}$ and $\Omega\not\supset C_{i}$ for $i=2,\cdots k$ . Then

(i)

If $T\subset C_{1}$ , then $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}$ is invertible.
(ii)

Suppose further $\lceil\frac{3n_{1}}{4}\rceil\leq|T|<n_{1}$ and $\lceil\frac{5n_{1}}{4}\rceil\leq|\Omega|\leq\lceil\frac{7n_{1}}{4}\rceil$ . Then

$cond\big{(}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{)}\leq 4$

almost surely as $n_{1}\to\infty$ , e.g. when $n\to\infty$ .

Proof.

Without loss of generality, let us assume that the column indices of $L^{in}_{\Omega\setminus T}$ are already permuted such that the indices number is in the same order relative to their underlying clusters. The invertibility of $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}$ follows directly from the fact that $L^{in}_{\Omega\setminus T}$ is of full column rank. So let us show that $L^{in}_{\Omega\setminus T}$ is of full column rank. Because of the reordering, $L^{in}_{\Omega\setminus T}$ is in a block diagonal form

L_{\Omega\setminus T}^{in}=\left(\begin{array}[]{cccc}L^{in}_{C_{1}\setminus T}&&&\\ &L^{in}_{\Omega\cap C_{2}}&&\\ &&L^{in}_{\Omega\cap C_{3}}&\\ &&&\ddots\\ \end{array}\right).

It is then suffices to show each block is of full column rank. By Lemma 2.1, each of $L^{in}_{C_{i}}$ has $\lambda=0$ as an eigenvalue with multiplicity one, and the corresponding eigenspace is spanned by $\mathbf{1}_{C_{i}}$ . Hence $rank(L^{in}_{C_{i}})=|C_{i}|-1$ . Now suppose by contradiction that the columns of $L^{in}_{C_{1}\setminus T}$ are linearly dependent, so there exists $\mathbf{v}\neq\mathbf{0}$ such that $L^{in}_{C_{1}\setminus T}\mathbf{v}=\mathbf{0}$ , or $L^{in}_{C_{1}\setminus T}\mathbf{v}+L^{in}_{T}\cdot\mathbf{0}=\mathbf{0}$ . This means that $\mathbf{u}=(\mathbf{v},\mathbf{0})$ is an eigenvector associated to eigenvalue zero, which contradicts the fact that the eigenspace is spanned by $\mathbf{1}_{C_{i}}$ . Therefore $L^{in}_{C_{1}\setminus T}$ is linearly independent, hence $L^{in}_{C_{1}\setminus T}$ is of full column rank. For $C_{i}$ with $i\geq 2$ , since $C_{i}\notin\Omega$ , $\Omega\cap C_{i}$ is a proper subset of $C_{i}$ . The strategy above applies as well. Therefore all blocks in $L^{in}_{\Omega\setminus T}$ are of full column rank, so $L^{in}_{\Omega\setminus T}$ is of full column rank.

Now since $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}$ is in a block form, to estimate the condition number, we only need to estimate the largest and smallest eigenvalues for each block. Writing $L^{in}_{\Omega\setminus T}=[l_{ij}]$ and $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}=[s_{ij}]$ , for each $i\in C_{1}\setminus T$ , $s_{ii}=\sum_{k=1}^{n}l_{ki}l_{ki}=\sum_{k=1}^{n}l_{ki}^{2}=\sum_{k=1}^{n_{1}}l_{ki}^{2}=1+\frac{1}{d_{i}^{in}}$ , and for $i,j\in C_{1}\setminus T$ with $i\not=j$ , $s_{ij}=\sum_{k=1}^{n}l_{ki}l_{kj}=\sum_{k=1}^{n_{1}}l_{ki}l_{kj}$ . Note that the probability of having an edge between $i$ and $j$ given degree sequences $d_{1},\cdots d_{n_{1}}$ equals to $\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}$ , as the existence of an edge between two vertices is proportional to their degrees. So $l_{ij}$ equals to $-\frac{1}{d_{i}}$ with probability $\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}$ , which implies $\mathbb{E}(l_{ij})=-\frac{d_{j}}{\sum_{i\in C_{1}}d_{i}}$ ; $l_{ji}$ equals to $-\frac{1}{d_{j}}$ with probability $\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}$ , which implies $\mathbb{E}(l_{ji})=-\frac{d_{i}}{\sum_{i\in C_{1}}d_{i}}$ . Hence the expectation

	$\displaystyle\mathbb{E}(s_{ij})$	$\displaystyle=\mathbb{E}(\sum_{k=1}^{n}l_{ki}l_{kj})=\sum_{k=1}^{n}\mathbb{E}(l_{ki})\mathbb{E}(l_{kj})=\sum_{k=1}^{n_{1}}\mathbb{E}(l_{ki})\mathbb{E}(l_{kj})$
		$\displaystyle=\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}\cdot(-\frac{1}{d_{i}})+\frac{d_{i}d_{j}}{\sum_{i\in C_{1}}d_{i}}\cdot(-\frac{1}{d_{j}})+\frac{d_{k}d_{i}}{\sum_{i\in C_{1}}d_{i}}\cdot\frac{d_{k}d_{j}}{\sum_{i\in C_{1}}d_{i}}\cdot(\frac{1}{d_{k}})^{2}$
		$\displaystyle=-\frac{d_{i}+d_{j}}{\sum_{i\in C_{1}}d_{i}}+\frac{d_{i}d_{j}}{(\sum_{i\in C_{1}}d_{i})^{2}}=-\frac{2}{n_{1}}+\frac{1}{n_{1}^{2}}.$

By the law of large numbers, $s_{ij}\to-\frac{2}{n_{1}}+\frac{1}{n_{1}^{2}}$ almost surely as $n_{1}\to\infty$ . Therefore for $i\in C_{1}\setminus T$ , we have

\displaystyle\sum_{j\in C_{1}\setminus T,j\neq i}|s_{ij}|\to|C_{1}\setminus T|\cdot(\frac{2}{n_{1}}-\frac{1}{n_{1}^{2}})\leq\frac{n_{1}}{4}\cdot(\frac{2}{n_{1}}-\frac{1}{n_{1}^{2}})\leq\frac{1}{2}

almost surely as $n_{1}\to\infty$ . Similarly, for each $i\in C_{k}\cap(\Omega\setminus C_{1})$ , $k\geq 2$ , we have $s_{ii}=1+\frac{1}{d^{in}_{i}}$ , and $\sum_{j\in C_{k}\cap(\Omega\setminus C_{1}),j\neq i}|s_{ij}|\to\frac{n_{1}}{4}\cdot(\frac{2}{n_{k}}-\frac{1}{n_{k}^{2}})\leq\frac{1}{2}$ almost surely as $n_{1}\to\infty$ .
Now we apply Gershgorin’s circle theorem to bound the spectrum of $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}$ . For all $i\in\Omega\setminus T$ , the circles are centered at $1+\frac{1}{d_{i}}$ , with radius less than or equal to $\frac{1}{2}$ almost surely, hence $\sigma_{\min}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})\geq\frac{1}{2}$ and $\sigma_{\max}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})\leq\frac{3}{2}+\frac{1}{d_{i}}\leq 2$ . almost surely. Therefore we have

\displaystyle\text{cond}\big{(}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{)}=\frac{\sigma_{\max}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})}{\sigma_{\min}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})}\leq 4

almost surely, as desired. ∎

Remark 3.4.

Note that there is a minor difficulty in estimating the expectation of inner product between two different columns of $L^{in}_{\Omega\setminus T}$ . The computation assumes the independence of degree distribution of each individual vertex within each cluster, but this may not be true in general for arbitrary graph. However, the independence will occur if the asymptotic uniformity of the degree distribution within each cluster is assumed, that is why our model needs this assumption.

Now the perturbed problem (8) is equivalent to solving $(L_{\Omega\setminus T}^{\top}L_{\Omega\setminus T})\mathbf{x}^{\#}=L_{\Omega\setminus T}^{\top}\mathbf{\tilde{y}}=L_{\Omega\setminus T}^{\top}(L\mathbf{1}_{\Omega})$ , while the unperturbed problem (7) is to solve $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\mathbf{x}^{*}=(L^{in}_{\Omega\setminus T})^{\top}\mathbf{y}=(L^{in}_{\Omega\setminus T})^{\top}(L^{in}\mathbf{1}_{\Omega})$ . Let $M\mathrel{\mathop{\mathchar 58\relax}}=L-L^{in}$ , $M_{\Omega}\mathrel{\mathop{\mathchar 58\relax}}=L_{\Omega}-L^{in}_{\Omega}$ , and $M_{\Omega\setminus T}\mathrel{\mathop{\mathchar 58\relax}}=L_{\Omega\setminus T}-L^{in}_{\Omega\setminus T}$ . Let us give an estimate for $M$ .

Lemma 3.3.

Let $L$ be the graph Laplacian of $G$ and $M\mathrel{\mathop{\mathchar 58\relax}}=L-L^{in}$ . Let $\epsilon_{i}\mathrel{\mathop{\mathchar 58\relax}}=\frac{d^{out}_{i}}{d_{i}}$ for all $i$ and $\epsilon_{max}\mathrel{\mathop{\mathchar 58\relax}}=\max_{i\in[n]}\epsilon_{i}$ . Then $\|M\|\leq 2\epsilon_{\max}$ .

Proof.

Let $\delta_{ij}$ denote the Kronecker delta symbol, observe that

L_{ij}\mathrel{\mathop{\mathchar 58\relax}}=\delta_{ij}-\frac{1}{d_{i}}A_{ij}=\delta_{ij}-\frac{1}{d^{in}_{i}+d^{out}_{i}}(A^{in}_{ij}+A^{out}_{ij}).

Since $\epsilon_{i}\mathrel{\mathop{\mathchar 58\relax}}=\frac{d^{out}_{i}}{d_{i}}$ , we have $\frac{1}{d_{i}}=\frac{1}{d^{in}_{i}+d^{out}_{i}}=\frac{1}{d^{in}_{i}}-\frac{\epsilon_{i}}{d^{in}_{i}}$ . So we have

	$\displaystyle L_{ij}$	$\displaystyle=\delta_{ij}-\Big{(}\frac{1}{d^{in}_{i}}-\frac{\epsilon_{i}}{d^{in}_{i}}\Big{)}(A^{in}_{ij}+A^{out}_{ij})$
		$\displaystyle=\Big{(}\delta_{ij}-\frac{1}{d^{in}_{i}}A^{in}_{ij}\Big{)}-\frac{1}{d^{in}_{i}}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}(A^{in}_{ij}+A^{out}_{ij})$
		$\displaystyle=L^{in}_{ij}-\frac{1-\epsilon_{i}}{d^{in}_{i}}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}A^{in}_{ij}.$

Therefore $M_{ij}=-\frac{1-\epsilon_{i}}{d^{in}_{i}}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}A^{in}_{ij}$ . To bound the spectral norm we apply Gershgorin’s circle theorem, noting that $M_{ii}=0$ for all $i$ , hence

	$\displaystyle\\|M\\|$	$\displaystyle=\max\{\|\lambda_{i}\|\mathrel{\mathop{\mathchar 58\relax}}\lambda_{i}\quad\text{eigenvalue of}\quad M\}\leq\max_{i}\sum_{j}\|M_{ij}\|$
		$\displaystyle=\max_{i}\sum_{j}\Big{\|}-\frac{1-\epsilon_{i}}{d^{in}_{i}}A_{ij}^{out}+\frac{\epsilon_{i}}{d^{in}_{i}}A_{ij}^{in}\Big{\|}$
		$\displaystyle\leq\max_{i}\sum_{j}\Big{\|}-\frac{1-\epsilon_{i}}{d^{in}_{i}}\Big{\|}A_{ij}^{out}+\Big{\|}\frac{\epsilon_{i}}{d^{in}_{i}}\Big{\|}A_{ij}^{in}$
		$\displaystyle\leq\max_{i}\Big{\{}\frac{1-\epsilon_{i}}{d^{in}_{i}}\sum_{j}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}\sum_{j}A^{in}_{ij}\Big{\}}$
		$\displaystyle=\max_{i}\Big{\{}\frac{1-\epsilon_{i}}{d^{in}_{i}}d^{out}_{i}+\frac{\epsilon_{i}}{d^{in}_{i}}d^{in}_{i}\Big{\}}=2\max_{i}\epsilon_{i}=2\epsilon_{\max}.$

This completes the proof. ∎

Next we will have the following result.

Lemma 3.4.

$\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}_{\Omega}\|\geq\frac{\sqrt{|\Omega\setminus C_{1}|}}{2}$ almost surely.

Proof.

Note that $\|(L^{in}_{\Omega\setminus T})^{\top}(L^{in}\mathbf{1}_{\Omega})\|=\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}\|$ . We want to give an estimate of $\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}\|$ . Similar to the computation we did in Lemma 3.2, for each $i\in C_{1}\setminus T$ , we have $s_{ii}=1+\frac{1}{d^{in}_{i}}$ , $\sum_{j\in C_{1}}s_{ij}=0$ , and $\sum_{j\in\Omega\setminus C_{1}}s_{ij}=0$ . For each $i\in C_{k}\cap(\Omega\setminus C_{1})$ , $k\geq 2$ , we have $s_{ii}=1+\frac{1}{d^{in}_{i}}$ , $\sum_{j\in C_{1}}s_{ij}=0$ , and $\sum_{j\in C_{k}\cap(\Omega\setminus C_{1}),j\neq i}s_{ij}\to\frac{n_{1}}{4}\cdot(-\frac{2}{n_{k}}+\frac{1}{n_{k}^{2}})\geq-\frac{1}{2}$ almost surely. Therefore, the row sum of $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}$ for row $i\in C_{1}\setminus T$ equals to zero, and the row sum $(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}$ for row $i\in\Omega\setminus C_{1}$ larger than $\frac{1}{2}$ almost surely. Hence $\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}\|\geq\frac{\sqrt{|\Omega\setminus C_{1}|}}{2}$ almost surely. ∎

Now let us use previous lemmas to establish that the difference between perturbed solution and unperturbed solution is small in the order of $\epsilon_{\max}$ .

Theorem 3.1.

Under the same assumptions as Lemma 3.2, let $\mathbf{x}^{\#}$ be the solution to the perturbed problem (8), and $\mathbf{x}^{*}=\mathbf{1}_{C_{1}^{\mathsf{c}}}\in\mathbb{R}^{|\Omega|-|T|}$ which is the solution to the unperturbed problem (7). Then

\frac{\|\mathbf{x}^{\#}-\mathbf{x}^{*}\|}{\|\mathbf{x}^{*}\|}=O(\epsilon_{\max})

almost surely for large $n_{1}$ .

Proof.

Let $B=(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}$ , $\tilde{B}=(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}$ , $\mathbf{y}=L^{in}\mathbf{1}_{\Omega}$ , $\tilde{\mathbf{y}}=L\mathbf{1}_{\Omega}$ . We will apply Lemma 3.2 with $B$ , $\tilde{B}$ , $\mathbf{y}$ , $\tilde{\mathbf{y}}$ .

First by Lemma 3.3, we have $\|M\|\leq 2\epsilon_{\max}$ . Therefore

	$\displaystyle\\|\tilde{B}-B\\|$	$\displaystyle=\big{\\|}(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}-(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{\\|}$
		$\displaystyle=\big{\\|}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega\setminus T}+M_{\Omega\setminus T}^{\top}L^{in}_{\Omega\setminus T}+M_{\Omega\setminus T}^{\top}M_{\Omega\setminus T}\big{\\|}$
		$\displaystyle\leq\big{\\|}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega\setminus T}\big{\\|}+\big{\\|}M_{\Omega\setminus T}^{\top}L^{in}_{\Omega\setminus T}\big{\\|}+\big{\\|}M_{\Omega\setminus T}^{\top}M_{\Omega\setminus T}\big{\\|}$
		$\displaystyle\leq\big{(}2\\|L^{in}_{\Omega\setminus T}\\|+\\|M_{\Omega\setminus T}\\|\big{)}\cdot\\|M_{\Omega\setminus T}\\|$
		$\displaystyle\leq\big{(}2\\|L^{in}_{\Omega\setminus T}\\|+\\|M\\|\big{)}\cdot\\|M\\|$
		$\displaystyle\leq 4\epsilon_{\max}\cdot\big{(}\\|L^{in}_{\Omega\setminus T}\\|+\epsilon_{\max}\big{)}.$

For each $i\in\Omega\setminus T$ , we have $\|L_{i}\|\geq 1$ , and $\sigma_{\max}((L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T})=\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\|=\sigma_{\max}^{2}(L^{in}_{\Omega\setminus T})=\|L^{in}_{\Omega\setminus T}\|^{2}\geq\max_{i\in\Omega\setminus T}\|L_{i}\|^{2}\geq 1$ . Hence

$\displaystyle\frac{\big{\\|}(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}-(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{\\|}}{\big{\\|}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{\\|}}\leq$	$\displaystyle\frac{\big{(}2\\|L^{in}_{\Omega\setminus T}\\|+\\|M\\|\big{)}\cdot\\|M\\|}{\\|L^{in}_{\Omega\setminus T}\\|^{2}}$	(11)
$\displaystyle\leq$	$\displaystyle\frac{4\epsilon_{\max}}{\\|L^{in}_{\Omega\setminus T}\\|}+\frac{4\epsilon_{\max}^{2}}{\\|L^{in}_{\Omega\setminus T}\\|^{2}}$	(12)
$\displaystyle\leq$	$\displaystyle 4(\epsilon_{\max}+\epsilon_{\max}^{2}).$	(13)

We also have

	$\displaystyle\\|\tilde{\mathbf{y}}-\mathbf{y}\\|$	$\displaystyle=\\|(L_{\Omega\setminus T})^{\top}(L\mathbf{1}_{\Omega})-(L^{in}_{\Omega\setminus T})^{\top}(L^{in}\mathbf{1}_{\Omega})\\|$
		$\displaystyle=\\|(L^{in}_{\Omega\setminus T}+M_{\Omega\setminus T})^{\top}(L_{\Omega}\mathbf{1}_{\Omega})-(L^{in}_{\Omega\setminus T})^{\top}(L^{in}_{\Omega}\mathbf{1}_{\Omega})\\|$
		$\displaystyle=\\|\big{(}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega}+M_{\Omega\setminus T}^{\top}L^{in}_{\Omega}+M_{\Omega\setminus T}^{\top}M_{\Omega}\big{)}\cdot\mathbf{1}_{\Omega}\\|$
		$\displaystyle\leq\sqrt{\|\Omega\|}\cdot\big{(}\\|(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega}\\|+\\|M_{\Omega\setminus T}^{\top}L^{in}_{\Omega}\\|+\\|M_{\Omega\setminus T}^{\top}M_{\Omega}\\|\big{)}$
		$\displaystyle\leq\sqrt{\|\Omega\|}\cdot\big{(}2\\|L^{in}_{\Omega}\\|+\\|M_{\Omega}\\|\big{)}\cdot\\|M_{\Omega}\\|$
		$\displaystyle\leq 4\sqrt{\|\Omega\|}\cdot\big{(}\\|L^{in}_{\Omega}\\|+\epsilon_{\max}\big{)}\cdot\epsilon_{\max}.$

Next by Lemma 3.4, $\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega}\mathbf{1}_{\Omega}\|\geq\frac{\sqrt{|\Omega\setminus C_{1}|}}{2}$ almost surely. Therefore

	$\displaystyle\frac{\\|(L_{\Omega\setminus T})^{\top}L\mathbf{1}_{\Omega}-(L^{in}_{\Omega\setminus T})^{\top}L^{in}\mathbf{1}_{\Omega}\\|}{\\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}\mathbf{1}_{\Omega}\\|}$	$\displaystyle\leq\frac{4\sqrt{\|\Omega\|}\cdot\big{(}\\|L^{in}_{\Omega}\\|+\epsilon_{\max}\big{)}\cdot\epsilon_{\max}}{\sqrt{\|\Omega\setminus C_{1}\|}/2}$
		$\displaystyle\leq 8\sqrt{5}\epsilon_{\max}\cdot\big{(}\\|L^{in}_{\Omega}\\|+\epsilon_{\max}\big{)}$
		$\displaystyle\leq 8\sqrt{5}\epsilon_{\max}\cdot\big{(}\sqrt{2}+\epsilon_{\max}\big{)}$
		$\displaystyle=8\sqrt{10}\epsilon_{\max}+8\sqrt{5}\epsilon_{\max}^{2}.$

The second inequality holds since $|\Omega|\geq\lceil\frac{5n_{1}}{4}\rceil$ . The third inequality holds since $\sigma_{\max}((L^{in}_{\Omega})^{\top}L^{in}_{\Omega})\leq 2$ , which comes from the similar reasoning as in Lemma 3.2 by using Gershgorin’s circle theorem. Consequently, we have $\|L^{in}_{\Omega}\|\leq\sqrt{2}$ . Now putting Lemma 3.2 and Lemma 3.1 together with $B=(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}$ , $\tilde{B}=(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}$ , $\mathbf{y}=L^{in}\mathbf{1}_{\Omega}$ , $\tilde{\mathbf{y}}=L\mathbf{1}_{\Omega}$ , we have

	$\displaystyle\frac{\\|\mathbf{x}^{\#}-\mathbf{x}^{}\\|}{\\|\mathbf{x}^{}\\|}$	$\displaystyle\leq\frac{cond\big{(}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{)}\cdot\big{(}4\epsilon_{\max}+4\epsilon_{\max}^{2}+8\sqrt{10}\epsilon_{\max}+8\sqrt{5}\epsilon_{\max}^{2}\big{)}}{1-cond\big{(}(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{)}\cdot\big{(}4\epsilon_{\max}+4\epsilon_{\max}^{2}\big{)}}$
		$\displaystyle\leq\frac{16\big{(}(1+2\sqrt{10})\epsilon_{\max}+(1+2\sqrt{5})\epsilon_{\max}^{2}\big{)}}{1-16\epsilon_{\max}(1+\epsilon_{\max})}=O(\epsilon_{\max}).$

∎

Next we can estimate the size of the symmetric difference between output $C_{1}^{\#}$ and the true cluster $C_{1}$ relative to the size of $C_{1}$ , the symmetric difference is defined as $C_{1}^{\#}\triangle C_{1}\mathrel{\mathop{\mathchar 58\relax}}=(C_{1}^{\#}\setminus C_{1})\cup(C_{1}\setminus C_{1}^{\#})$ . Let us state another lemma before we establish the result.

Lemma 3.5.

Let $T\subset[n]$ , $\mathbf{v}\in\mathbb{R}^{n}$ , and $W^{\#}=\{i\mathrel{\mathop{\mathchar 58\relax}}\mathbf{v}_{i}>R\}$ . Suppose $\|\mathbf{1}_{T}-\mathbf{v}\|\leq D$ , then $|T\triangle W^{\#}|\leq\frac{D^{2}}{\min\{R^{2},(1-R)^{2}\}}$ .

Proof.

Let $U^{\#}=[n]\setminus W^{\#}$ and write $\mathbf{v}=\mathbf{v}_{U^{\#}}+\mathbf{v}_{W^{\#}}$ , where $\mathbf{v}_{U^{\#}}$ and $\mathbf{v}_{W^{\#}}$ are the parts of $\mathbf{v}$ supported on $U^{\#}$ and $W^{\#}$ . Then we can write

\|\mathbf{1}_{T}-\mathbf{v}\|^{2}=\|\mathbf{1}_{T\cap W^{\#}}-(\mathbf{v}_{W^{\#}})_{T\cap W^{\#}}\|^{2}+\|(\mathbf{v}_{W^{\#}})_{W^{\#}\setminus T}\|^{2}+\|\mathbf{1}_{T\setminus W^{\#}}-\mathbf{v}_{U^{\#}}\|^{2}.

Note that $\|(\mathbf{v}_{W^{\#}})_{W^{\#}\setminus T}\|^{2}\geq R^{2}\cdot|W^{\#}\setminus T|$ and $\|\mathbf{1}_{T\setminus W^{\#}}-\mathbf{v}_{U^{\#}}\|^{2}\geq(1-R)^{2}\cdot|T\setminus W^{\#}|$ . We have

	$\displaystyle\\|\mathbf{1}_{T}-\mathbf{v}\\|^{2}$	$\displaystyle\geq\\|(\mathbf{v}_{W^{\#}})_{W^{\#}\setminus T}\\|^{2}+\\|\mathbf{1}_{T\setminus W^{\#}}-\mathbf{v}_{U^{\#}}\\|^{2}$
		$\displaystyle\geq R^{2}\cdot\|W^{\#}\setminus T\|+(1-R)^{2}\cdot\|T\setminus W^{\#}\|$
		$\displaystyle\geq\min\{R^{2},(1-R)^{2}\}\cdot(\|W^{\#}\setminus T\|+\|T\setminus W^{\#}\|)$
		$\displaystyle=\min\{R^{2},(1-R)^{2}\}\cdot\|T\triangle W^{\#}\|.$

Therefore $|T\triangle W^{\#}|\leq\frac{\|\mathbf{1}_{T}-\mathbf{v}\|^{2}}{\min\{R^{2},(1-R)^{2}\}}\leq\frac{D^{2}}{\min\{R^{2},(1-R)^{2}\}}$ as desired. ∎

Theorem 3.2.

Under the same assumptions as Theorem 3.1, we have

\frac{|C^{\#}_{1}\triangle C_{1}|}{|C_{1}|}\leq O(\epsilon^{2}_{\max}).

In other words, the error rate of successfully recovering $C_{1}$ is at most a constant multiple of $\epsilon^{2}_{\max}$ .

Proof.

From Theorem 3.1, we have $\|\mathbf{x}^{\#}-\mathbf{x}^{*}\|=\|\mathbf{x}^{\#}-\mathbf{1}_{\Omega\setminus C_{1}}\|\leq O(\epsilon_{\max})\cdot\|\mathbf{x}^{*}\|\leq O(\epsilon_{\max}\sqrt{n_{1}})$ . By Lemma 3.5, we get $|W^{\#}\triangle(\Omega\setminus C_{1})|\leq O(\epsilon_{\max}^{2}n_{1})$ . Since $C_{1}^{\#}=\Omega\setminus W^{\#}$ , it then follows $|C_{1}^{\#}\triangle C_{1}|\leq O(\epsilon^{2}_{\max}n_{1})$ , hence $\frac{|C^{\#}_{1}\triangle C_{1}|}{|C_{1}|}=O(\epsilon_{\max}^{2})$ as desired. ∎

3.2 Random Walk Threshold

In order to apply Algorithm 1, we need a “nice” superset which contains $C_{1}$ . The task for this subsection is to find such a superset $\Omega$ from the given seeds $\Gamma$ . We will apply a simple diffusion based random walk algorithm on $G$ to find such $\Omega$ . This leads to Algorithm 2, which is described in [26] as well. However, the difference of the random walk threshold algorithm between this paper and the one in [26] is that the threshold parameter $\delta$ here is heuristically chosen to be larger than the corresponding threshold parameter in [26]. This is another advantage of our method as it will increase the chances of having $C_{1}$ entirely contained in $\Omega$ . Such a choice is made based on the natural differences of our approaches. It is worthwhile to point out that there are also other sophisticated algorithms such as the ones described in [5], [23] and [45] which can achieve the same goal. We avoid using these methods here as our purpose is just to implement a fast way of obtaining a set $\Omega\supset C_{1}$ .

Algorithm 2 Random Walk Threshold

Adjacency matrix

A

, a random walk threshold parameter

\delta\in(0,1)

, a set of seed vertices

\Gamma\subset C_{1}

, estimated size

\hat{n}_{1}\approx|C_{1}|

, and depth of random walk

t\in\mathbb{Z}^{+}

1.

Compute $P=AD^{-1}$ and $\mathbf{v}^{(0)}=D\mathbf{1}_{\Gamma}$ .
2.

Compute $\mathbf{v}^{(t)}=P^{t}\mathbf{v}^{(0)}$ .
3.

Define $\Omega={\mathcal{L}}_{(1+\delta)\hat{n}_{1}}(\mathbf{v}^{(t)})$ .

\Omega=\Omega\cup\Gamma

The thresholding operator $\mathcal{L}_{s}(\mathord{\cdot})$ is defined as

\mathcal{L}_{s}(\mathbf{v})\mathrel{\mathop{\mathchar 58\relax}}=\{i\in[n]\mathrel{\mathop{\mathchar 58\relax}}\text{$v_{i}$ among $s$ largest entries in $\mathbf{v}$}\}.

The motivation of Algorithm 2 is the following intuitive observation. Suppose we are given seed vertices $\Gamma\subset C_{1}$ , then by starting from $\Gamma$ , since the edges within each cluster are more dense than those between different clusters, the probability of staying within $C_{1}$ will be much higher than entering other clusters $C_{i}$ , for $i\neq 1$ , in a short amount of depth. Therefore, by performing a random walk up to a certain depth $t$ , e.g., $t=3$ , we will have a well approximated set $\Omega$ such that $C_{1}$ is almost surely contained in $\Omega$ . Let us make this more precisely in Theorem 3.3.

Theorem 3.3.

Assume $|\Gamma|=O(1)$ and $t=O(1)$ in Algorithm 2, the probability $\mathbb{P}\big{(}C_{1}\subset\Omega\big{)}\geq\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t)}_{j}=\|\mathbf{v}^{(t)}\|_{1}\big{)}\geq 1-O(\epsilon_{\max})$ . In other words, the probability that the $t$ -steps random walk with seed vertices $\Gamma$ being not in $C_{1}$ is at most a constant multiple of $\epsilon_{\max}$ .

Proof.

Let us first consider the case $|\Gamma|=1$ . Suppose $\Gamma=\{s\}$ . Then we have $\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(0)}_{j}=\|\mathbf{v}^{(0)}\|_{1}\big{)}=\mathbb{P}(\mathbf{v}^{(0)}_{s}=\|\mathbf{v}^{(0)}\|_{1})=1$ . It is also easy to see that $\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(1)}_{j}=\|\mathbf{v}^{(1)}\|_{1}\big{)}=d_{i}^{in}/d_{i}=1-\epsilon_{i}\geq 1-\epsilon_{\max}$ . For $t\geq 2$ , we have $\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t)}_{j}=\|\mathbf{v}^{(t)}\|_{1}\big{)}\geq(1-\epsilon_{\max})\cdot\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t-1)}_{j}=\|\mathbf{v}^{(t-1)}\|_{1}\big{)}$ . So by assuming $\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t-1)}_{j}=\|\mathbf{v}^{(t-1)}\|_{1}\big{)}\geq(1-\epsilon_{\max})^{t-1}\geq 1-(t-1)\epsilon_{\max}$ , we have $\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t)}_{j}=\|\mathbf{v}^{(t)}\|_{1}\big{)}\geq(1-\epsilon_{\max})^{t}\geq 1-t\epsilon_{\max}=1-O(\epsilon_{\max})$ .

Suppose now $|\Gamma|>1$ , we can apply the above argument to each individual vertex in $\Gamma$ , where the random walk starting from each vertex can be considered independently, therefore we have $\mathbb{P}\big{(}\sum_{j\in C_{1}}\mathbf{v}^{(t)}_{j}=\|\mathbf{v}^{(t)}\|_{1}\big{)}\geq(1-t\epsilon_{\max})^{|\Gamma|}\geq 1-t\epsilon_{\max}|\Gamma|=1-O(\epsilon_{\max})$ . ∎

Remark 3.5.

It is worthwhile to note that we do not want $t$ to be too large, one reason is that Theorem 3.3 tells us the probability of staying within the target cluster $C_{1}$ decreases as $t$ increases. An alternative interpretation is that we can treat our graph $G$ , suppose connected, as a time homogeneous finite state Markov chain with evenly distributed transition probability determined by the vertex degree between adjacent vertices. Since $G$ is connected, it is certainly irreducible and aperiodic. By the fundamental theorem of Markov chains, the limiting probability of finally being at each vertex will be the same, regardless of what the seed set $\Gamma$ is.

Meanwhile, we do not want $t$ to be too small as well, otherwise the random walk will not be able to explore all the reachable vertices. There is also a trade-off between the size of $\Gamma$ and the random walk depth $t$ , where a smaller size of $\Gamma$ usually induces a larger $t$ in order to fully explore the target cluster.

3.3 Local Cluster Extraction

Let us now combine the previous two subroutines into Algorithm 3. In practice, we may want to vary the number of iterations $MaxIter$ based on the number of examples in the data set in order to achieve a better performance. For the purpose theoretical analysis, let us fix $MaxIter=1$ .

Algorithm 3 Least Squares Clustering (LSC)

Adjacency matrix

A

, a random walk threshold parameter

\delta\in(0,1)

, a set of seed vertices

\Gamma\subset C_{1}

, estimated size

\hat{n}_{1}\approx|C_{1}|

, depth of random walk

t\in\mathbb{Z}^{+}

, least squares parameter

\gamma\in(0,0.8)

, and rejection parameter

R\in[0,1)

1.

for $i=1,\cdots,MaxIter$
2.

$\Omega\longleftarrow$ Random Walk Threshold ( $A$ , $\Gamma$ , $\hat{n}_{1}$ , $\epsilon$ , $t$ ).
3.

$\Gamma\longleftarrow$ Least Squares Cluster Pursuit ( $A$ , $\Omega$ , $R$ , $\gamma$ ).
4.

end
5.

Let $C_{1}^{\#}=\Gamma$ .

C_{1}^{\#}

Remark 3.6.

The hyperparameter $MaxIter$ in the algorithm is usually choosen based on the size of initial seed vertices $\Gamma$ relative to $n$ , we do not have a formal way of choosing the best $MaxIter$ rather than choose it heuristically. In practice, we believe $MaxIter\leq 3$ will do a very good job most of the time.

The analysis in previous two subsections gives that the difference between true cluster $C_{1}$ and the estimated $C_{1}^{\#}$ is relative small compared to the size of $C_{1}$ , this can be written more formally using the asymptotic notation.

Theorem 3.4.

Suppose $\epsilon_{\max}=o(1)$ and $MaxIter=1$ , then under the assumptions of Theorem 3.2 and 3.3, we have $\mathbb{P}\Big{(}\frac{|C^{\#}_{1}\triangle C_{1}|}{|C_{1}|}\leq o(1)\Big{)}=1-o(1)$ .

Proof.

By Theorem 3.3, we know that the probability of $\Omega\supset C_{1}$ after performing Algorithm 2 is $1-O(\epsilon_{\max})=1-o(1)$ . By Theorem 3.2, the error rate is at most a constant multiple of $\epsilon^{2}_{\max}$ after performing Algorithm 1. Putting them together, we have $\mathbb{P}\Big{(}\frac{|C^{\#}_{1}\triangle C_{1}|}{|C_{1}|}\leq o(1)\Big{)}=1-o(1)$ . ∎

3.4 From Local to Global

We can make one step further by applying Algorithm 3 iteratively on the entire graph to extract all the underlying clusters. That is, we remove $C_{i}^{\#}$ each time after the Algorithm 3 finds it, and update the graph $G$ by removing the subgraph spanned by vertices $C_{i}^{\#}$ successively. This leads to Algorithm 4. We will not analyze further the theoretical guarantees of the iterative version the algorithm, but rather provide with numerical examples in the later section to show its effectiveness and efficiency.

Algorithm 4 Iterative Least Squares Clustering (ILSC)

Adjacency matrix

A

, random walk threshold parameter

\delta\in(0,1)

, least squares parameter

\gamma\in(0,0.8)

, rejection parameter

R\in[0,1)

, depth of random walk

t\in\mathbb{Z}^{+}

. Seed vertices for each cluster

\Gamma_{i}\subset C_{i}

, estimated size

\hat{n}_{i}\approx|C_{i}|

for

i=1,\cdots k

1.

for $i=1,\cdots,k$
2.

Let $C_{i}^{\#}$ be the output of Algorithm 3.
3.

Let $G^{(i)}$ be the subgraph spanned by $C_{i}^{\#}$ .
4.

Updates $G\leftarrow G\setminus G^{(i)}$ .
5.

end

C_{1}^{\#},\cdots,C_{k}^{\#}

Remark 3.7.

It is worth noting that Algorithm 4 extracts one cluster at a time, which is different from most of other global unsupervised clustering algorithms. In practice, those global clustering methods could have impractically high run time [33] or tricky to implement [2]. In contrast, our method requires much lower computational time and can be implemented easily. In addition, the ”one cluster at a time” feature of our method provides more flexibility for problems under certain circumstances.

4 Computational Complexity

In this section, let us discuss the run time of the algorithms introduced previously.

Theorem 4.1.

Algorithm 2 requires $O(nd_{\max}t+n\log(n))$ operations, where $t$ is the depth of the random walk.

Proof.

Notice that if $A,D,P$ are stored as sparse matrices, then for each $t$ in the second step of Algorithm 2, it requires $O(nd_{\max})$ , where $d_{\max}$ is the maximal degrees among all the vertices. Therefore the algorithm requires $O(nd_{\max}t+n\log(n))$ , where the $O(n\log(n))$ part comes from the third step of sorting. In practice, the random walk depth $t$ is $O(1)$ with respect to the graph size $n$ , therefore we have $O(nd_{\max}+n\log(n))$ . ∎

Theorem 4.2.

Algorithm 1 requires $O(nd_{\max}+n\log(n))$ operations.

Proof.

For Algorithm 1, its first step requires $O(nd_{\max})$ , second step requires $O(nd_{\max}+n\log(n))$ , where the $O(nd_{\max})$ part comes from matrix vector multiplication, and $O(n\log(n))$ part comes from sorting. For its third step, to avoid solving the normal equation exactly for large scale problems, we recommend using an iterative mehod, for example conjugate gradient descent (we use MATLAB’s lsqr operation in our implementation). As we have shown the matrices are associated with well behaved condition numbers, it requires only a constant number of iterations to get a well approximated least squares solution to problem (8). Since the cost for each iteration in conjugate gradient descent equals to a few operations of matrix vector multiplication, which is $O(nd_{\max})$ , the total cost for Algorithm 1 is $O(nd_{\max}+n\log(n))$ . ∎

As a consequence, the total run time for Algorithm 3 is $O(nd_{\max}+n\log(n))$ . if the number of clusters $k=O(1)$ , then Algorithm 4 also runs in $O(nd_{\max}+n\log(n))$ .

Remark 4.1.

The computational scheme of our methods follow the similar framework as CP+RWT in [26]. However, one of the differences between these two approaches is that we apply lsqr to solve the least squares problem (8), but CP+RWT applies $O(\log n)$ iterations of subspace pursuit algorithm to solve (8), and each its subspace pursuit is implemented with lsqr as a subroutine. So essentially, our proposed method is $O(\log n)$ times cheaper than CP+RWT. We can also see this difference by comparing the run times for our numerical experiments in the next section.

5 Numerical Experiments

In this section, we evaluate the performance of our algorithms on synthetic symmetric stochastic block model (SSBM), network data on political blogs [3], AT&T Database of Faces ³³3https://git-disl.github.io/GTDLBench/datasets/att_face_dataset/, Extended Yale Face Database B (YaleB) ⁴⁴4http://vision.ucsd.edu/~leekc/ExtYaleDatabase/ExtYaleB.html, and MNIST data ⁵⁵5http://yann.lecun.com/exdb/mnist/.

For single cluster extraction tasks, we will consider the diffusion based methods plus a possible refinement procedure CP+RWT [26], HK-Grow [23], PPR [5], and LBSA [36] as our baseline methods. For multi-cluster extraction tasks, we will consider ICP+RWT [26], LapRF and TVRF [49], Multi-class MBO Auction Dynamics [21], AtlasRBF [37], Pseudo-label [27], DGN [22] and Ladder Networks [39] as the baseline methods. The standard spectral clustering algorithm [35] is also being applied in some of the experiments. For the implementation of Algorithms 3 and 4, we use MATLAB’s $lsqr$ function as our iterative least squares solver to solve equation (8). We tune the rejection parameters $R$ for all algorithms appropriately to make the output $C^{\#}_{i}$ of each algorithm approximately the same size for comparison purpose. For the evaluation metrics of our experiments, we will consider Jaccard index for symmetric stochastic block model, F1 score for human faces data, and classification accuracy for political blogs data and MNIST data. More implementation details are summarized as a supplementary material.

5.1 Stochastic Block Model Data

We first test Algorithm 3 on $SSBM(n,k,p,q)$ with different choices of parameters. The paramenter $n$ indicates the total number of vertices, $k$ indicates the number of clusters, $p$ is the probability of having an edge between any two vertices within each cluster, and $q$ is the probability of having an edge between any two vertices from different clusters. Figure 1 left panel demonstrates such a synthetic random graph model with three underlying clusters. Figure 1 right panel illustrates an adjacency matrix of a random graph generated from symmetric stochastic block model with three underlying clusters. In our experiments, we fix $k=3$ and choose $n=600,1200,1800,2400,3000$ respectively. The connection probability between edges are chosen as $p=\frac{8\log n}{n}$ and $q=\frac{\log n}{n}$ . By choosing the parameters carefully, we obtain the Jaccard index and logarithm of running time of each method shown in Figure 2. We also run the experiments on stochastic block model for non-symmetric case and obtained similar gaps in accuracy and run time. For the implementation of symmetric stochastic block model, we use three vertices with given label as our seeds, and we focus on only recovering the target cluster, say $C_{1}$ . The experiments are performed with 500 repetitions.

Refer to caption — Figure 1: *Left*: A Random SSBM Graph with Three Underlying Clusters. *Right*: Adjacency Matrix of A Random Graph Generated From SSBM.

5.2 Political Blogs Network Data

Next we test Algorithm 3 on the data from ”The political blogosphere and the 2004 US Election”[3], which contains a list of political blogs that were classified as liberal or conservative, and links between the blogs. See Figure 3 as an illustration for the community structure (Figure source [3]).

As explained by Abbe and Sandon in [2], their simplified algorithm gave a reasonably good classification 37 times out of 40 trials. Each of these trials classified all but 56 to 67 of the 1222 vertices in the graph main component correctly. According to [2], the state-of-the-art described in [51] before the work in [2] gives a lowest value around 60, while using regularized spectral algorithms such as the one in [38] obtain about 80 errors. In our experiments, given three labeled seeds, the Algorithm 3 succeeds 35 trials out of a total of 40 trials. Among these 35 trials, the average number of misclassified node in the graph main component is 55, which is slightly favorable than the state-of-the-art result. We also tested CP+RWT on this dataset and found the results were not very satisfactory.

5.3 AT&T Database of Faces

The AT&T Database of Faces contains gray scale images for $40$ different people of pixel size $92\times 112$ . Images of each person are taken under $10$ different conditions, by varying the three perspectives of faces, lighting conditions, and facial expressions.

We use part of this data set by randomly selecting 10 people such that each individual has 10 images. We randomly permute the images as shown in the left side of Figure 4, and compute its adjacency matrix $A$ based on the preprocessing method discussed in the appendix B. Then we iteratively apply Algorithm 3 and try to recover all the 10 images belong to the same individual. The desired permutation of these individual images after iteratively performing Algorithm 3 are shown in the right side of Figure 4. Some more details of the implementation regarding to the hyperparameters tuning are summarized in the appendxi A. The performance of our algorithm compared with CP+RWT and Spectral Clustering (SC) are summarized in Table 2 under 500 repetitions. Note that spectral clustering method is unsupervised, hence its accuracy does not affected by the percentage of labeled data.

Table 2: Average F1 Scores of Recovering All Clusters for AT&T Data.

Labeled Data %	$10$	$20$	$30$
LSC	$96.5\%$	$97.5\%$	$98.2\%$
CP+RWT [26]	$92.2\%$	$95.7\%$	$97.1\%$
SC [35]	95.8%	95.8%	95.8%

5.4 Extended Yale Face Database B (YaleB)

The YaleB dataset contains 16128 gray scale images of 28 human subjects under 9 poses and 64 illumination conditions. We use part of this data set by randomly selecting 20 images from each person after the preprocessing in appendix B. The images are randomly permuted and we aim to recover all the clusters by iteratively performing Algorithm 3. Figure 5 shows this dataset with randomly permuted images on the left side and the desired clustering results on the right side. Figure 6 enlarges a small part inside the pictures from Figure 5 with the red boxes. The performance of our algorithm compared with CP+RWT and Spectral Clustering (SC) are summarized in Table 3 under 500 repetitions.

Table 3: Average F1 Scores of Recovering All Clusters for YaleB Data.

Labeled Data %	$5$	$10$	$20$
LSC	92.1%	$96.0\%$	$96.1\%$
CP+RWT [26]	89.2%	$93.7\%$	$93.9\%$
SC [35]	93.8%	93.8%	93.8%

5.5 MNIST Data

We also test Algorithm 4 on the MNIST data, which is a famous machine learning benchmark dataset in classification that consists of $70000$ grayscale images of the handwritten digits $0$ - $9$ of size $28\times 28$ with approximately $7000$ images of each digit. We used a certain percentage of vertices with given labels within each of the ten clusters as our seed vertices, the performance ILSC and ICP+RWT are summarized in Table 4 under 100 repetitions.

Table 4: Average Accuracy of Recovering All Clusters for MNIST Data (Time is measured in seconds).

Labeled Data %	ILSC	Run Time	ICP+RWT [26]	Run Time
0.5	$97.30\%$	15.5	$96.41\%$	18.1
1	$97.73\%$	15.3	$97.32\%$	19.1
1.5	$98.03\%$	15.4	$97.44\%$	19.8
2	$98.17\%$	15.5	$97.52\%$	21.4
2.5	$98.27\%$	15.4	$97.50\%$	22.1

We also compare ILSC with several other state-of-the-art semi-supervised methods on MNIST. As we can see in Table 5, ILSC outperforms the other algorithms except for the Ladder Networks which uses more information of labels and involved in a deep neural network architecture that requires training on GPUs for several hours.

Table 5: Accuracy of ILSC and other Semi-supervised Algorithms on MNIST.

Methods	# Labeled Data	Accuracy
LapRF [49]	$600$	$95.6\%$
TVRF [49]	$600$	$96.8\%$
ICP+RWT [26]	$700$	$97.3\%$
Multi-class MBO with Auction Dynamics [21]	$700$	$97.4\%$
ILSC (this paper)	$700$	$97.7\%$
AtlasRBF [37]	$1000$	$96.4\%$
Pseudo-label [27]	$1000$	$96.6\%$
DGN [22]	$1000$	$97.6\%$
ILSC (this paper)	$1000$	$98.0\%$
Ladder Networks [39]	$1000$	$99.2\%$

References

[1] E. Abbe, Community Detection and Stochastic Block Models: Recent Developments, Journal of Machine Learning Research. vol.18, no.177, pp.1-86, 2018
[2] E. Abbe and C. Sandon, Recovering communities in the general stochastic block model without knowing the parameters, In Advances in Neural Information Processing Systems, 676–684, 2015
[3] L. A. Adamic and N. Glance, The political blogosphere and the 2004 US election: Divided they blog, In Proceedings of the 3rd International Workshop on Link Discovery, 36-43, 2005.
[4] D. J. Aldous, Random walks on finite groups and rapidly mixing Markov chains, In Seminaire de Probabilites XVII, pages 243–297. Springer Verlag, 1983. Lecture Notes in Mathematics 986.
[5] R. Andersen, F. Chung, and K. Lang, Using pagerank to locally partition a graph, Internet Mathematics, 4(1):35-64, 2007.
[6] G. Camps-Valls, T. Bandos and D. Zhou, Semisupervised graph-based hyperspectral image classification, IEEE Trans. Geosci. Remote Sensing, vol. 45, no. 10, pp. 3044-3054, 2007.
[7] Y. Chen, J. Z. Wang, and R. Krovetz, Clue: Cluster-based retrieval of images by unsupervised learning, IEEE Trans. Image Process. 14, 8, 1187–1201, 2005.
[8] F. Chung, Spectral Graph Theory, Vol. 92. American Mathematical Society., 1997.
[9] F. Chung and L. Lu, Complex Graphs and Networks, Vol. 107. American Mathematical Soc., 2006.
[10] W. Dai and O. Milenkovic, Subspace pursuit for compressive sensing signal reconstruction, IEEE Trans. Inform. Theory, vol. 55, no. 5, pp. 2230–2249, May 2009.
[11] S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning, In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining:269-274, 2001.
[12] S. Dhillon, Y. Guan, and B. Kulis, Kernel k-means: spectral clustering and normalized cuts, In Proc. of the 10th ACM SIGKDD Conference, 2004.
[13] C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon,A min-max cut algorithm for graph partitioning and data clustering, In Proceedings of IEEE ICDM 2001, pages 107–114, 2001.
[14] W. Doeblin, Expose de la theorie des chaınes simples constantes de Markov a un nombre fini d’etats, Mathematique de l’Union Interbalkanique, 2:77–105, 1938
[15] P. Erdős and A. Rényi, On random graphs, I. Publ. Math. Debrecen 6 (1959), 290-297.
[16] S. Fortunato, Community detection in graphs, Physics Reports, 486 (3–5), 75–174, 2010.
[17] A. Georghiades, P. Belhumeur, and D. Kriegman, From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose, PAMI, 2001, Vol 23, pp 643-660.
[18] W. Ha, K. Fountoulakis, and M. W. Mahoney, Statistical guarantees for local graph clustering, The 23rd International Conference on Artificial Intelligence and Statistics, 2020.
[19] P. W. Holland, K. Laskey, and S. Leinhardt, Stochastic blockmodels: First steps, Social Networks 5 (1983), no. 2,109–137
[20] D. Hric, R. K. Darst, and S. Fortunato, Community detection in networks: Structural clusters versus ground truth, Physical Review E, 9, 062805, 2014
[21] M. Jacobs, E. Merkurjev, and S. Esedoglu, Auction Dynamics: A Volume Constrained MBO Scheme, Journal of Computational Physics, 354:288-310, 2018.
[22] D. P. Kingma, S. Mohamed, D. J. Rezende, and Max Welling,Semi-supervised learning with deep generative models, In Advances in Neural Information Processing Systems 27 (NIPS 2014), pages 3581–3589, 2014
[23] K. Kloster and D. F. Gleich, Heat kernel based community detection, In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining: 1386-1395, 2014.
[24] G. Kossinets and D. J. Watts, Empirical analysis of an evolving social network, Science, 311 (5757), 88–90, 2016
[25] B. Kulis, S. Basu, I. Dhillon, and R. Mooney, Semi-supervised graph clustering: A kernel approach, Proc. ICML-2005.
[26] M.-J. Lai, D. Mckenzie, Compressive Sensing Approach to Cut Improvement and Local Clustering, SIAM Journal on Mathematics of Data Science, 2(2020), 368–395.
[27] D-H. Lee,Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, In Workshop on Challenges in Representation Learning, ICML 2013.
[28] T. Lindvall, Lectures on the coupling method, John Wiley & Sons Inc., New York, 1992.
[29] U. V. Luxburg, A tutorial on spectral clustering, Statistics and computing, 17(4):395– 416, 2007.
[30] M. W. Mahoney, L. Orecchia, and N. K. Vishnoi, A local spectral method for graphs: With applications to improving graph partitions and exploring data graphs locally, Journal of Machine Learning Research, 13(Aug):2339-2365, 2012.
[31] R. Mihalcea and D. Radev, Graph-based natural language processing and information retrieval, Cambridge university press, 2011.
[32] T. Miyato, S-i. Maeda, M. Koyama, K. Nakae, and S. Ishii, Distributional smoothing by virtual adversarial examples, arXiv:1507.00677, 2015
[33] E. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold conjecture. Combinatorica 38.3 (2018): 665-708.
[34] D. Needell and Joel A. Tropp, CoSaMP: Iterative signal recovery from incomplete and inaccurate samples, Applied and Computational Harmonic Analysis, 26(3):301-321, 2009.
[35] Andrew Y. Ng, Michael I. Jordan, and Yair Weiss, On spectral clustering: Analysis and an algorithm, In Advances in Neural Information Processing Systems: 849-856, 2002.
[36] Pan Shi, Kun He, David Bindel, and John E.Hopcroft, Locally-biased spectral approximation for community detection. Knowledge-Based Systems, 164:459–472, 2019.
[37] N. Pitelis, C. Russell, and L. Agapito, Semi-supervised learning using an unsupervised atlas, In Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2014), pages 565–580. Springer, 2014.
[38] T. Qin and K. Rohe, Regularized spectral clustering under the degree-corrected stochastic block model, Advances in Neural Information Processing Systems 26 (C.j.c. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.q. Weinberger, eds.), 2013, pp. 3120–3128.
[39] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised Learning with Ladder Networks, In Advances in Neural Information Processing Systems: 3546-3554, 2015.
[40] F. Samaria and A. Harter, Parameterisation of a Stochastic Model for Human Face Identification, Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, 1994.
[41] F. Santosa and W. Symes, Linear inversion of band-limited reflection seismograms, SIAM Journal on Scientific and Statistical Computing. SIAM. 7 (4): 1307–1330. doi:10.1137/0907087, 1986.
[42] J. Shi and J. Malik, Normalized Cuts and Image Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000.
[43] R. Tibshirani, Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society. Series B (methodological). Wiley. 58 (1): 267–88. JSTOR 2346178, 1996
[44] N. Veldt, C. Klymko, and D. F. Gleich, Flow-Based Local Graph Clustering with Better Seed Set Inclusion, In Proceedings of the SIAM International Conference on Data Mining, 2019
[45] D. Wang, K. Fountoulakis, M. Henziger, Michael W. Mahoney, S. Rao, Capacity releasing diffusion for speed and locality, Proceedings of the 34th International Conference on Machine Learning, 70:3598-3607, 2017.
[46] Y. Wu, M. Rosca, and T. Lillicrap, Deep compressed sensing. International Conference on Machine Learning, pp. 6850–6860, 2019.
[47] Y. Yan, Y. Bian, D. Luo, D. Lee, and X. Zhang, Constrained Local Graph Clustering by Colored Random Walk, in Proc. World Wide Web Conf., 2019, pp. 2137–2146.
[48] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich, Local higher-order graph clustering, In Proceedings of the 23rd ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2017, pp. 555–564.
[49] K. Yin, X.-C. Tai, An Effective Region Force for Some Variational Models for Learning and Clustering, Journal of Scientific Computing, 74 (2018), 175-196.
[50] L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering, In Advances in neural information processing systems, pages 1601–1608, 2004.
[51] A. Y. Zhang, H. H. Zhou, C. Gao, Z. Ma, Achieving optimal misclassification proportion in stochastic block model, arXiv:1505.03772 (2015).

Declarations

Funding

The first author is supported by the Simons Foundation collaboration grant #864439.

Competing Interests

The authors have disclosed any competing interests.

Data Availability

A sample demo program for reproducing Fig. 4 in this paper can be found at https://github.com/zzzzms/LeastSquareClustering. All other demonstration codes or data are available upon request.

Appendix A Hyperparameters for Numerical Experiments

For each cluster to be recovered, we sampled the seed vertices $\Gamma_{i}$ uniformly from $C_{i}$ during all implementations. We fix the random walk depth with $t=3$ , use random walk threshold parameter $\delta=0.8$ for political blogs network and $\delta=0.6$ for all the other experiments. We vary the rejection parameters $R\in(0,1)$ for each specific experiments based on the estimated sizes of clusters. In the case of no knowledge of estimated sizes of clusters nor the number of clusters are given, we may have to refer to the spectra of graph Laplacian and use the large gap between two consecutive spectrum to estimate the number of clusters, and then use the average size to estimate the size of each cluster.

We fix the least squares threshold parameter with $\gamma=0.2$ for all the experiments, which is totally heuristic. However, we have experimented that the performance of algorithms will not be affected too much by varying $\gamma\in[0.15,0.4]$ . The hyperparameter $MaxIter$ is chosen according to the size of initial seed vertices relative to the total number of vertices in the cluster. For purely comparison purpose, we keep $MaxIter=1$ for MNIST data. By experimenting on different choices of $MaxIter$ , we find that the best performance for AT&T data occurs at $MaxIter=2$ for $10\%$ seeds and $MaxIter=1$ for $20\%$ and $30\%$ seeds. For YaleB data, the best performance occurs at $MaxIter=2$ for $5\%$ , $10\%$ , and $20\%$ seeds. All the numerical experiments are implemented in MATLAB and can be run on a local personal machine, for the authenticity of our results, we put a sample demo code at https://github.com/zzzzms/LeastSquareClustering for verification purpose.

Appendix B Image Data Preprocessing

For YaleB human faces data, we have performed some data preprocessing techinuqes to avoid the poor quality images. Specifically, we abandoned the pictures which are too dark, and we cropped each image into size of $54\times 46$ to reduce the effect of background noise. For the remaining qualified pictures, we randomly selected 20 images for each person.

All the image data in MNIST, AT&T, YaleB needs to be firstly constructed into an auxiliary graph before the implementation. Let $\mathbf{x}_{i}\in\mathbb{R}^{n}$ be the vectorization of an image from the original data set, we define the following affinity matrix of the $K$ -NN auxiliary graph based on Gaussian kernel according to [21] and [50],.

A_{ij}=\begin{cases}e^{-\|\mathbf{x}_{i}-\mathbf{x}_{j}\|^{2}/\sigma_{i}\sigma_{j}}&\text{if}\quad\mathbf{x}_{j}\in NN(\mathbf{x}_{i},K)\\ 0&\text{otherwise}\end{cases}

The notation $NN(\mathbf{x}_{i},K)$ indicates the set of $K$ -nearest neighbours of $\mathbf{x}_{i}$ , and $\sigma_{i}\mathrel{\mathop{\mathchar 58\relax}}=\|\mathbf{x}_{i}-\mathbf{x}^{(r)}_{i}\|$ where $\mathbf{x}^{(r)}_{i}$ is the $r$ -th closest point of $\mathbf{x}_{i}$ . Note that the above $A_{ij}$ is not necessary symmetric, so we consider $\tilde{A}_{ij}=A^{T}A$ for symmetrization. Alternatively, one may also want to consider $\tilde{A}=\max\{A_{ij},A_{ji}\}$ or $\tilde{A}=(A_{ij}+A_{ji})/2$ . We use $\tilde{A}$ as the input adjacency matrix for our algorithms.

We fix the local scaling parameters $K=15$ , $r=10$ for the MNIST data, $K=8$ , $r=5$ for the YaleB data, and $K=5$ , $r=3$ for the AT&T data during implementation.

	$\displaystyle\\|M\\|$	$\displaystyle=\max\{\|\lambda_{i}\|\mathrel{\mathop{\mathchar 58\relax}}\lambda_{i}\quad\text{eigenvalue of}\quad M\}\leq\max_{i}\sum_{j}\|M_{ij}\|$
		$\displaystyle=\max_{i}\sum_{j}\Big{\|}-\frac{1-\epsilon_{i}}{d^{in}_{i}}A_{ij}^{out}+\frac{\epsilon_{i}}{d^{in}_{i}}A_{ij}^{in}\Big{\|}$
		$\displaystyle\leq\max_{i}\sum_{j}\Big{\|}-\frac{1-\epsilon_{i}}{d^{in}_{i}}\Big{\|}A_{ij}^{out}+\Big{\|}\frac{\epsilon_{i}}{d^{in}_{i}}\Big{\|}A_{ij}^{in}$
		$\displaystyle\leq\max_{i}\Big{\{}\frac{1-\epsilon_{i}}{d^{in}_{i}}\sum_{j}A^{out}_{ij}+\frac{\epsilon_{i}}{d^{in}_{i}}\sum_{j}A^{in}_{ij}\Big{\}}$
		$\displaystyle=\max_{i}\Big{\{}\frac{1-\epsilon_{i}}{d^{in}_{i}}d^{out}_{i}+\frac{\epsilon_{i}}{d^{in}_{i}}d^{in}_{i}\Big{\}}=2\max_{i}\epsilon_{i}=2\epsilon_{\max}.$

	$\displaystyle\\|\tilde{B}-B\\|$	$\displaystyle=\big{\\|}(L_{\Omega\setminus T})^{\top}L_{\Omega\setminus T}-(L^{in}_{\Omega\setminus T})^{\top}L^{in}_{\Omega\setminus T}\big{\\|}$
		$\displaystyle=\big{\\|}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega\setminus T}+M_{\Omega\setminus T}^{\top}L^{in}_{\Omega\setminus T}+M_{\Omega\setminus T}^{\top}M_{\Omega\setminus T}\big{\\|}$
		$\displaystyle\leq\big{\\|}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega\setminus T}\big{\\|}+\big{\\|}M_{\Omega\setminus T}^{\top}L^{in}_{\Omega\setminus T}\big{\\|}+\big{\\|}M_{\Omega\setminus T}^{\top}M_{\Omega\setminus T}\big{\\|}$
		$\displaystyle\leq\big{(}2\\|L^{in}_{\Omega\setminus T}\\|+\\|M_{\Omega\setminus T}\\|\big{)}\cdot\\|M_{\Omega\setminus T}\\|$
		$\displaystyle\leq\big{(}2\\|L^{in}_{\Omega\setminus T}\\|+\\|M\\|\big{)}\cdot\\|M\\|$
		$\displaystyle\leq 4\epsilon_{\max}\cdot\big{(}\\|L^{in}_{\Omega\setminus T}\\|+\epsilon_{\max}\big{)}.$

	$\displaystyle\\|\tilde{\mathbf{y}}-\mathbf{y}\\|$	$\displaystyle=\\|(L_{\Omega\setminus T})^{\top}(L\mathbf{1}_{\Omega})-(L^{in}_{\Omega\setminus T})^{\top}(L^{in}\mathbf{1}_{\Omega})\\|$
		$\displaystyle=\\|(L^{in}_{\Omega\setminus T}+M_{\Omega\setminus T})^{\top}(L_{\Omega}\mathbf{1}_{\Omega})-(L^{in}_{\Omega\setminus T})^{\top}(L^{in}_{\Omega}\mathbf{1}_{\Omega})\\|$
		$\displaystyle=\\|\big{(}(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega}+M_{\Omega\setminus T}^{\top}L^{in}_{\Omega}+M_{\Omega\setminus T}^{\top}M_{\Omega}\big{)}\cdot\mathbf{1}_{\Omega}\\|$
		$\displaystyle\leq\sqrt{\|\Omega\|}\cdot\big{(}\\|(L^{in}_{\Omega\setminus T})^{\top}M_{\Omega}\\|+\\|M_{\Omega\setminus T}^{\top}L^{in}_{\Omega}\\|+\\|M_{\Omega\setminus T}^{\top}M_{\Omega}\\|\big{)}$
		$\displaystyle\leq\sqrt{\|\Omega\|}\cdot\big{(}2\\|L^{in}_{\Omega}\\|+\\|M_{\Omega}\\|\big{)}\cdot\\|M_{\Omega}\\|$
		$\displaystyle\leq 4\sqrt{\|\Omega\|}\cdot\big{(}\\|L^{in}_{\Omega}\\|+\epsilon_{\max}\big{)}\cdot\epsilon_{\max}.$

	$\displaystyle\frac{\\|(L_{\Omega\setminus T})^{\top}L\mathbf{1}_{\Omega}-(L^{in}_{\Omega\setminus T})^{\top}L^{in}\mathbf{1}_{\Omega}\\|}{\\|(L^{in}_{\Omega\setminus T})^{\top}L^{in}\mathbf{1}_{\Omega}\\|}$	$\displaystyle\leq\frac{4\sqrt{\|\Omega\|}\cdot\big{(}\\|L^{in}_{\Omega}\\|+\epsilon_{\max}\big{)}\cdot\epsilon_{\max}}{\sqrt{\|\Omega\setminus C_{1}\|}/2}$
		$\displaystyle\leq 8\sqrt{5}\epsilon_{\max}\cdot\big{(}\\|L^{in}_{\Omega}\\|+\epsilon_{\max}\big{)}$
		$\displaystyle\leq 8\sqrt{5}\epsilon_{\max}\cdot\big{(}\sqrt{2}+\epsilon_{\max}\big{)}$
		$\displaystyle=8\sqrt{10}\epsilon_{\max}+8\sqrt{5}\epsilon_{\max}^{2}.$

	$\displaystyle\\|\mathbf{1}_{T}-\mathbf{v}\\|^{2}$	$\displaystyle\geq\\|(\mathbf{v}_{W^{\#}})_{W^{\#}\setminus T}\\|^{2}+\\|\mathbf{1}_{T\setminus W^{\#}}-\mathbf{v}_{U^{\#}}\\|^{2}$
		$\displaystyle\geq R^{2}\cdot\|W^{\#}\setminus T\|+(1-R)^{2}\cdot\|T\setminus W^{\#}\|$
		$\displaystyle\geq\min\{R^{2},(1-R)^{2}\}\cdot(\|W^{\#}\setminus T\|+\|T\setminus W^{\#}\|)$
		$\displaystyle=\min\{R^{2},(1-R)^{2}\}\cdot\|T\triangle W^{\#}\|.$