Learning with Semi-Definite Programming: new statistical bounds based on fixed point analysis and excess risk curvature
Abstract
Many statistical learning problems have recently been shown to be amenable to Semi-Definite Programming (SDP), with community detection and clustering in Gaussian mixture models as the most striking instances [60]. Given the growing range of applications of SDP-based techniques to machine learning problems, and the rapid progress in the design of efficient algorithms for solving SDPs, an intriguing question is to understand how the recent advances from empirical process theory can be put to work in order to provide a precise statistical analysis of SDP estimators.
In the present paper, we borrow cutting edge techniques and concepts from the learning theory literature, such as fixed point equations and excess risk curvature arguments, which yield general estimation and prediction results for a wide class of SDP estimators. From this perspective, we revisit some classical results in community detection from [54] and [23], and we obtain statistical guarantees for SDP estimators used in signed clustering, group synchronization and MAXCUT.
1 Introduction
Many statistical learning problems have recently been shown to be amenable to Semi-Definite Programming (SDP), with community detection and clustering in Gaussian mixture models as the most striking instances where SDP performs significantly better than other current approaches [60]. SDP is a class of convex optimisation problems generalising linear programming to linear problems over semi-definite matrices [94], [107], [21], and which proved an important tool in the computational approach to difficult challenges in automatic control, combinatorial optimisation, polynomial optimisation, data mining, high dimensional statistics and the numerical solution to partial differential equations. The goal of the present paper is to introduce a new fixed point approach to the statistical analysis of SDP-based estimators and illustrate our method on four current problems of interest, namely MAX-CUT, community detection, signed clustering, angular group synchronization. The rest of this section gives some historical background and presents the mathematical definition of SDP based estimators.
1.1 Historical background
SDP is a class of optimisation problems which includes linear programming as a particular case and can be written as the set of problems over symmetric (resp. Hermitian) positive semi-definite matrix variables, with linear cost function and affine constraints, i.e. optimization problems of the form
(1.1) |
where are given matrices. SDPs are convex programming problems which can be solved in polynomial time when the constraint set is compact and it plays a paramount role in a large number of convex and nonconvex problems, for which it often appear as a convex relaxation [6].
We will sometimes use the notation (resp. ) for the cone of positive semi-definite matrices (resp. negative semi-definite matrices).
1.1.1 Early history
Early use of Semi-Definite programming to statistics can be traced back to [88] and [42]. In the same year, Shapiro used SDP in factor analysis [89]. The study of the mathematical properties of SDP then gained momentum with the introduction of Linear Matrix Inequalities (LMI) and their numerous applications in control theory, system identification and signal processing. The book [20] is the standard reference of these type of results, mostly obtained in the 90’s.
1.1.2 The Goemans-Williamson SDP relaxation of Max-Cut and its legacy
A notable turning point is the publication of [49] where SDP was shown to provide a 0.87 approximation to the NP-Hard problem known as Max-Cut. The Max-Cut problem is a clustering problem on graphs which consists in finding two complementary subsets and of nodes such that the sum of the weights of the edges between and is maximal. In [49], the authors approach this difficult combinatorial problem by using what is now known as the Goemans-Williamson SDP relaxation and use the Choleski factorization of the optimal solution to this SDP in order to produce a randomized scheme achieving the .87 bound in expectation. Moreover, this problem can be seen as a first instance where the Laplacian of a graph is employed in order to provide an optimal bi-clustering in a graph and certainly represents the first chapter of a long and fruitful relationship between clustering, embedding and Laplacians. Other SDP schemes for approximating hard combinatorial problems are, to name a few, for the graph coloring problem [61], for satisfiability problem [49, 48]. These results were later surveyed in [70, 47] and [106]. The randomized scheme introduced by Goemans and Williamson was then further improved in order to study more general Quadratically Constrained Quadratic Programmes (QCQP) in various references, most notably [79, 110] and further extended in [56]. Many applications to signal processing are discussed in [81], [74]; one specific reduced complexity implementation in the form of an eigenvalue minimisation problem and its application to binary least-squares recovery and denoising is presented in [27].
1.1.3 Relaxation of machine learning and high dimensional statistical estimation problems
Applications of SDP to problems related with machine learning is more recent and probably started with the SDP relaxation of -means in [83, 82] and later in [5]. This approach was then further improved using a refined statistical analysis by [87] and [46]. Similar methods have also been applied to community detection [55, 3] and for the weak recovery viewpoint, [54]. This last approach was also re-used via the kernel trick for the point cloud clustering [28]. Another incarnation of SDP in machine learning is the extensive use of nuclear norm-penalized least-square costs as a surrogate for rank-penalization in low-rank recovery problems such as matrix completion in recommender systems, matrix compressed sensing, natural language processing and quantum state tomography; these topics are surveyed in [35].
The problem of manifold learning was also addressed using SDP and is often mentioned as one of the most accurate approaches to the problem, let aside its computational complexity; see [103, 105, 104, 57]. Connections with the design of fast converging Markov-Chains were also exhibited in [93].
In a different direction, A. Singer and collaborators have recently promoted the use of SDP relaxation for estimation under group invariance, an active area with many applications [92, 8]. SDP-based relaxations have also been considered in [30] in the context of synchronization over in signed multiplex networks with constraints, and [31] in the setting of ranking from inconsistent and incomplete pairwise comparisons where an SDP-based relaxation of angular synchronization over SO(2) outperformed a suite of state-of-the-art algorithms from the literature. Phase recovery using SDP was studied in e.g. [102] and [40]. An extension to multi-partite clustering based on SDP was then proposed in [61]. Other important applications of SDP are, information theory [73], estimation in power networks [67], quantum tomography [77], [50] and polynomial optimization via Sums-of-squares relaxations [66, 14]. Sums of squares relaxations were recently applied to statistical problems in [38, 58, 39]. Extension to the field of complex numbers, with denoting the Hermitian inner product, has been less extensively studied but has many interesting applications and comes with efficient algorithms [49, 45].
1.2 Mathematical formulation of the problem
The general problem we want to study can be stated as follows. Let be a random matrix in and be a constraint. The object that we want to recover, for instance, the community membership vector in community detection, is related to an oracle defined as
(1.2) |
where when where is the conjugate of . We would like to estimate , from which we can ultimately retrieve the object that really matters to us (for instance, by considering a singular vector associated to the largest singular value of ). To that end, we consider the following estimator
(1.3) |
which is simply obtained by replacing the unobserved quantity by the observation .
As pointed out, in many situations, is not the object we want to estimate, but there is a straightforward relation between and this object. For instance, consider the community detection problem, where the goal is to recover the class community vector of nodes. Here, when is well chosen, there is a close relation between and , given by . We therefore need a final step to estimate using , for instance, by letting denote a top eigenvector of , and then using the Davis-Kahan ”sin-theta” Theorem [36, 108] to control the estimation of by from the one of by .
1.3 Goal of the paper
The aim of the present work is to present a general approach to the study of the statistical properties of SDP-based estimators defined in (1.3). In particular, using our framework, one is able to obtain new (non-asymptotic) rates of convergence or exact reconstruction properties for a wide class of estimators obtained as a solution of a semidefinite program like (1.3). Specifically, our goal is to show that the solution to (1.3) can be analyzed in a statistical way when is only partially and noisily observed in . Even though the constraint may not necessarily be the intersection of the set of PSD (or Hermitian) matrices with linear spaces – such as in the definition of SDP – in the following, a solution of (1.3) will be called a SDP estimator because in all our examples will be solution of a SDP. But for the sake of generality we will only assume only minimal requirement on the shape of . We also illustrate our results on a number of specific machine learning problems such as various forms of clustering problems and angular group synchronization. Three out of the four examples worked out here are concerned with real-valued matrices. Only the angular synchronization problem is approached using complex matrices.
2 Main general results for the statistical analysis of SDP estimators
From a statistical point of view, the task remains to estimate in the most efficient way the oracle , and to that end is our candidate estimator. The point of view we will use to evaluate how far is from is coming from the learning theory literature. We therefore see as an empirical risk minimization (ERM) procedure built on a single observation , where the loss function is the linear one , and the oracle is indeed the one minimizing the risk function over . Having this setup in mind, we can use all the machinery developed in learning theory (see for instance [99, 63, 76, 97]) to obtain rates of convergence for the ERM (here ) toward the oracle (here ).
There is one key quantity driving the rate of convergence of the ERM: a fixed point complexity parameter. This type of parameter carries all the statistical complexity of the problem, and even though it is usually easy to set up, its computation can be tedious since it requires to control, with large probability, the supremum of empirical processes indexed by “localized classes”. We now define this complexity fixed point related to the problem we are considering here.
Definition 1.
Let . The fixed point complexity parameter at deviation is
(2.1) |
Fixed point complexity parameters have been extensively used in learning theory since the introduction of the localization argument [76, 64, 97, 13]. When they can be computed, they are preferred to the (global) analysis developed by Chervonenkis and Vapnik [99] to study ERM, since the latter analysis always yields slower rates given that the Vapnik-Chervonenkis bound is a high probability bound on the non-localized empirical process , which is an upper bound for since . The gap between the two global and local analysis can be important since fast rates cannot be obtained using the VC approach, whereas the localization argument resulting in fixed points such as the one in Definition 1 may yield fast rates of convergence or even exact recovery results.
An example of a Vapnik-Chervonenkis’s type of analysis of SDP estimators can be found in [54] for the community detection problem. An improvement of the latter approach has been obtained in [41] thanks to a localization argument – even though it is not stated in these words (we elaborate more on the two approaches from [54, 41] in Section 3). Somehow, a fixed point such as (2.1) is a sharp way to measure the statistical performances of ERM estimators and in particular for the SDP estimators that we are considering here. They can even be proved to be optimal (in a minimax sense) when the noise is Gaussian [69] and under mild conditions on the structure of .
Before stating our general result, we first recall a definition of a minimal structural assumption on the constraint .
Definition 2.
We say that the set is star-shaped in when for all , the segment is in .
This is a pretty mild assumption satisfied, for instance, when is convex, which is the setup we will always encounter in practical applications, given that SDP estimators are usually introduced after a “convex relaxation” argument. Our main general statistical bound on SDP estimators is as follows.
Theorem 1.
We assume that the constraint is star-shaped in . Then, for all , with probability at least , it holds true that .
Theorem 1 applies to any type of setup where an oracle is estimated by an estimator such as in (1.3). Its result shows that is almost a maximizer of the true objective function over up to . In particular, when , is exactly a maximizer such as and, in that case, we can work with as if we were working with without any loss.
Theorem 1 may be applied in many different settings; in the following, we study four such instances. We will apply Theorem 1 (or one of its corollary stated below) to some popular problems in the networks and graph signal processing literatures, namely, community detection [43] (we will mostly revisit the results in [54] and [41] from our perspective), signed clustering [32], group synchronization [90] and MAX-CUT.
The proof of Theorem 1 is straightforward (mostly because the loss function is linear). Its importance stems from the fact that it puts forward two important concepts originally introduced in Learning Theory, namely that the complexity of the problem comes from the one of the local subset and that the “radius” of the localization is solution of a fixed point equation. For a setup given by a random matrix and a constraint , we should try to understand how these two ideas apply to obtain estimation properties of SDP estimators such as . That is to understand the shape of the local subsets and the maximal oscillations of the empirical process indexed by these local subsets. We will consider this task in three distinct problem instances. We now provide a proof for Theorem 1.
Proof of Theorem 1. Denote . Assume first that (the case is analyzed later). Let be the event onto which for all if then . By Definition of , we have .
Let be such that and define such that . We have and because is star-shaped in . Therefore, on the event , and so . It therefore follows that on the event , if is such that then
which implies that and therefore does not maximize over . As a consequence, we necessarily have on the event (which holds with probability at least ).
Let us now assume that . There exists a decreasing sequence of positive real numbers tending to such that for all , where is the event where for all ,
Since is star-shapped in , is a non-increasing function and so is a decreasing sequence (i.e. for all ). It follows that . Let us now place ourselves on the event . For all , since holds and , we can use the same argument as in first case to conclude that . Since the latter inequality is true for all (on the event ) and tends to zero, we conclude that .
The main conclusion of Theorem 1 is that all the information for the problem of estimating via is contained in the fixed point . We therefore have to compute or upper bound such a fixed point. This might be difficult in great generality but there are some tools that can help to find upper bounds on .
A first approach is to understand the shape of the local sets and to that end it is helpful to characterize the curvature of the excess risk around its maximizer . This type of local characterization of the excess risk is also a tool used in Learning theory that goes back to classical conditions such as the Margin assumption [96, 75] or the Bernstein condition [11]. The latter condition was initially introduced as an upper bound of the variance term by its expectation: for all for some absolute constant , but it has now been better understood as a way to discriminate the oracle from the other points in the model . These assumptions were global assumption in the sense that they concern all in . It has been recently shown [26] that only the local curvature of the excess risk needs to be understood. We now introduce this tool in our setup.
We characterize the local curvature of the excess risk by some function . Most of the time the function is a norm like the -norm or a power of a norm such as the norm to the square. The radius defining the local subset onto which we need to understand the curvature of the excess risk is also solution of a fixed point equation:
(2.2) |
The difference between the two fixed points and is that the local subsets are not defined using the same proximity function to the oracle ; the first one uses the excess risk as a proximity function while the second one uses the function as a proximity function. The function should play the role of a simple description of the curvature of the excess risk function locally around ; that is formalized in the next assumption.
Assumption 1.
For all , if then .
Typical examples of curvature function will have the form for some , and some norm . In that case, the parameter was initially called the margin parameter [95, 75]. Even though the relation given in Assumption 1 has been typically referred to as a margin condition or Bernstein condition in the learning theory literature, we will rather call it a local curvature assumption, following [54] and [26], since this type of relation describes the behavior of the risk function locally around its oracle. The main advantage for finding a local curvature function is that should be easier to compute than and because of the definition of and (thanks to Assumption 1). We can therefore state the following corollary.
Corollary 1.
We assume that the constraint is star-shaped in and that the “local curvature” Assumption 1 holds for some . With probability at least , it holds true that
When it is possible to describe the local curvature of the excess risk around its oracle by some function and when some estimate of can be obtained, Corollary 1 applies and estimation results of by (w.r.t. to both the ”excess risk” metric and the metric) follow. If not, either because understanding the local curvature of the excess risk or the computation of is difficult, it is still possible to apply Theorem 1 with the global VC approach which boils down to simply upper bound the fixed point used in Theorem 1 by a global parameter that is a complexity measure of the entire set :
(2.3) |
Interestingly, if the latter last resort approach is used then, following the approach form [54], Grothendieck’s inequality [51, 85] appears to be a powerful tool to upper bound the right-hand side of (2.3) in the case of the community detection problem such as in [53] as well as in the MAX-CUT problem. Of course, when it is possible to avoid this ultimate global approach one has to do it because the local approach will always provide better results.
Finally, proving a “local curvature” property such as in Assumption 1 may be difficult because it requires to understand the shape of the local subsets . It is however possible to simplify this assumption if getting estimation results of only w.r.t. the function (and not necessarily an upper bound on the excess risk ) is enough. In that case, Assumption 1 may be replaced by the following one.
Assumption 2.
For all , if then .
Assumption 2 assumes a curvature of the excess risk function in a neighborhood of unlike Assumption 1 which grants this curvature in an “ excess risk neighborhood”. The shape of a neighborhood defined by the function may be easier to understand (for instance when is a norm, a neighborhood defined by is the ball of a norm centered at with radius ). In general, the latter assumption and Assumption 1 do not compare. If Assumption 2 holds then can estimate w.r.t. the function.
Theorem 2.
We assume that the constraint is star-shaped in and that the “local curvature” Assumption 2 holds for some . We assume that the function is continuous, and for all . With probability at least , it holds true that .
Proof of Theorem 2. Let . First assume that . Let be such that . Let . We have , and is continuous. Therefore, there exists such that . We let be such that . Since is star-shapped in and we have . Moreover, . As a consequence, on the event such that for all if then , we have . The latter and Assumption 2 imply that, on ,
and so . Finally, using the definition of , we obtain
In particular, cannot be a maximizer of over and so necessarily, on the event , .
Let us now consider the case where . Using the same approach as in the proof of Theorem 1, we only have to check that the function
is non-increasing. Let . W.l.o.g. we may assume that there exists some such that and . If then . If then there exists such that for we have and . Moreover, and so . It follows that
where we used that for all because for all .
As a consequence, Theorem 1, Corollary 1 and Theorem 2 are the three tools at our disposal to study the performance of SDP estimators depending on the deepness of understanding we have on the problem. The best approach is given by Theorem 1 when it is possible to compute efficiently the complexity fixed point . If the latter approach is too complicated (likely because understanding the geometry of the local subset may be difficult) then one may resort to find a curvature function of the excess risk locally around . In that case, both Corollary 1 and Theorem 2 may apply depending on the hardness to find a local curvature function on an “excess risk neighborhood” (see Assumption 1) or a “-neighborhood” (see Assumption 2). Finally, if no local approach can be handled (likely because describing the curvature of the excess risk in any neighborhood of or controlling the maximal oscillations of the empirical process locally are too difficult) then one may resort ultimately to a global approach which follows from Theorem 1 as explained in (2.3). In the following, we will use these tools for various problems.
Results like Theorem 1, Corollary 1 and Theorem 2 appeared in many papers on ERM in learning theory such as in [64, 11, 76, 69]. In all these results, typical loss functions such as the quadratic or logistic loss functions were not linear one such as the one we are using here. From that point of view our problem is easier and this can be seen by the simplicity to prove our three general results from this section. What is much more complicated here than in other more classical problems in Learning Theory is the computation of the fixed point because 1) the stochastic processes may be far from being a Gaussian process if the noise matrix is complicated and 2) the local sets or for maybe very hard to describe in a simple way. Fortunately, we will rely on several previous papers such as [41] to solve such problems.
3 Revisiting two results from the community detection literature [41, 54]
The rapid growth of social networks on the Internet has lead many statisticians and computer scientists to focus their research on data coming from graphs. One important topic that has attracted particular interest during the last decades is that of community detection [43, 86], where the goal is to recover mesoscopic structures in a network, the so-called called communities. A community consists of group of nodes that are relatively densely connected to each other, but sparsely connected to other dense groups present within the network. The motivation for this line of work stems not only from the fact that finding communities in a network is an interesting and challenging problem of its own as it leads to understanding structural properties of networks, but community detection is also used as a data pre-processing step for other statistical inference tasks on large graphs, as it facilitates parallelization and allows one to distribute time consuming processes on several smaller subgraphs (i.e., the extracted communities).
One challenging aspect of the community detection problem arises in the setting of sparse graphs. Many of the existing algorithms, which enjoy theoretical guarantees, do so in the relatively dense regime for the edge sampling probability, where the expected average degree is of the order . The problem becomes challenging in very sparse graphs with bounded average degree. To this end, Guédon and Vershynin proposed a semidefinite relaxation for a discrete optimization problem [54], an instance of which encompasses the community detection problem, and showed that it can recover a solution with any given relative accuracy even in the setting of very sparse graphs with average degree of order .
A subset of the existing literature for community detection and clustering relies on spectral methods, which consider the adjacency matrix associated to a graph, and employ its eigenvalues, and especially eigenvectors, in the analysis process or to propose efficient algorithms to solve the task at hand. Along these lines, Can et al. [68] proposed a general framework for optimizing a general function of the graph adjacency matrix over discrete label assignments by projecting onto a low-dimensional subspace spanned by vectors that approximate the top eigenvectors of the expected adjacency matrix. The authors consider the problem of community detection with communities, which they frame as an instance of their proposed framework, combined with a regularization step that shifts each entry in the adjacency matrix by a small constant , which renders their methodology applicable in the sparse regime as well.
In the remainder of this section, we focus on the community detection problem on random graphs under the general stochastic block model. We will mostly revisit the work in [54] and [41] from the perspective given by Theorem 1, which simplifies the proof in [41] since the peeling argument is no longer required, and neither is the upper bound from [54] unlike [41], thanks to the homogeneity argument hidden in Theorem 1 (which underlies the localization argument).
We first recall the definition of the generalized stochastic block model (SBM). We consider a set of vertices , and assume it is partitioned into communities of arbitrary sizes .
Definition 3.
For any pair of nodes , we denote by when and belong to the same community (i.e., there exists ) such that ), and we denote by if and do not belong to the same community.
For each pair of nodes from , we draw an edge between and with a fixed probability independently from the other edges. We assume that there exist numbers and satisfying , such that
(3.1) |
We denote by the observed symmetric adjacency matrix, such that, for all , is distributed according to a Bernoulli of parameter . The community structure of such a graph is captured by the membership matrix , defined by if , and otherwise. The main goal in community detection is to reconstruct from the observation .
Spectral methods for community detection are very popular in the literature [54, 41, 100, 15, 29]. There are many ways to introduce such methods, one of which being via convex relaxations of certain graph cut problems aiming to minimize a modularity function such as the RatioCut [80]. Such relaxations often lead to SDP estimators, such as those introduced in Section 1.
Considering a random graph distributed according to the generalized stochastic block model, and its associated adjacency matrix (i.e. and for and as defined in (3.1)), we will estimate its membership matrix via the following SDP estimator
where and denotes the number of nonzero elements in the membership matrix . The motivation for this approach stems from the fact that the membership matrix is actually the oracle, i.e., (see Lemma 7.1 in [54] or Lemma 1 below), where
Following the strategy from Theorem 1 and from our point of view, the upper bound on from [54] is the one that is based on the global approach – that is, without localization. Indeed, [54] uses the observation that, for all , it holds true that
(3.2) |
where is the cut-norm111The cut-norm of a real matrix with a set of rows indexed by and a set of columns indexed by , is the maximum, over all and , of the quantity . It is also the operator norm of from to and the “injective norm” in the orginal Grothendieck “résumé” [52, 85] and is the Grothendieck constant (Grothendieck’s inequality is used in (b), see [85, 100]). Therefore, the localization around the oracle by the excess risk “band” is simply removed in inequality (a). As a consequence, the resulting statistical bound is based on the complexity of the entire class whereas, in a localized approach, only the complexity of matters. Next step in the proof of [54] is a high probability upper bound on which follows from Bernstein’s inequality and a union bound since one has , then for all , with probability at least where . The resulting upper bound on the fixed point obtained in [54] is,
(3.3) |
Finally, under the assumption of Theorem 1 in [54] (i.e., for some some , , , and ), for we obtain (using the general result in Theorem 1) with probability at least , the bound , which is the result from Theorem 1 in [54]. Finally, [54] uses a (global) curvature property of the excess risk in its Lemma 7.2:
Lemma 1 (Lemma 7.2 in [54]).
For all , .
Therefore, a (global– that is for all ) curvature assumption holds for a function which is here the norm, a margin parameter and for the community detection problem. However this curvature property is not use to compute a “better” fixed point parameter but only to have a estimation bound since
The latter bound together with the sin-Theta theorem allow the authors from [54] to obtain estimation bound for the community membership vector .
The approach from [41] improves upon the one in [54] because it uses a localization argument: the curvature property of the excess risk function from Lemma 1 is used to improve the upper bound in (3.3) obtained following a global approach. Indeed, the authors from [41] obtain high probability upper bound on the quantity
depending on . This yields to exact reconstruction result in the “dense” case and exponentially decaying rates of convergence in the “sparse” case. This is a typical example where the localization argument shows its advantage upon the global approach. The price to pay is usually a more technical proof for the local approach compare with the global one. However, the argument from [41] also uses an unnecessary peeling argument together with an unnecessary a priori upper bound on (which is actually the one from [54]). It appears that this peeling argument and this a priori upper bound on can be avoided thanks to our approach from Theorem 1. This improves the probability estimate and simplifies the proofs (since the result from [54] is not required anymore neither is the peeling argument). For the sign clustering problem we consider below as an application of our main results, we will mostly adapt the probabilistic tools from [41] (in the “dense” case) to the methodology associated with Theorem 1 (without this two unnecessary arguments).
4 Contributions of the paper
4.1 Application to signed clustering
Much of the clustering literature, including both spectral and non-spectral methods, has focused on unsigned graphs, where each edge carries a non-negative scalar weight that encodes a measure of affinity (similarity, trust) between pairs of nodes. However, in numerous instances, the above-mentioned affinity takes negative values, and encodes a measure of dissimilarity or distrust. Such applications arise in social networks where users relationships denote trust-distrust or friendship-enmity, shopping bipartite networks which capture like-dislike relationships between users and products [10], online news and review websites, such as Epinions [1] and Slashdot [2], that allow users to approve or denounce others [71], and clustering financial or economic time series data [4]. Such applications have spurred interest in the analysis of signed networks, which has recently become an increasingly important research topic [72], with relevant lines of work in the context of clustering signed networks including, in chronological order, [65, 24, 32].
The second application of our proposed methodology is an extension of the community detection and clustering problems to the setting of signed graphs, where, for simplicity, we assume that an edge connecting two nodes can take either or values.
4.1.1 A Signed Stochastic Block Model (SSBM)
We focus on the problem of clustering a K-weakly balanced graphs222A signed graph is K-weakly balanced if and only if all the edges are positive, or the nodes can be partitioned into disjoint sets such that positive edges exist only within clusters, and negative edges are only present across clusters [37].. We consider a signed stochastic block model (SSBM) similar to the one introduced in [32], where we are given a graph with nodes which are divided into communities, , such that, in the noiseless setting, edges within each community are positive and edges between communities are negative.
The only information available to the user is given by a sparse adjacency matrix constructed as follows: is symmetric, with for all , and for all , where
for some and . Moreover, all the variables for are independent.
We remark that this SSBM model is similar to the one considered in [32], which was governed by two parameters, the sampling probability as above, and the noise level , which may flip entries of the adjacency matrix.
Our aim is to recover the community membership matrix or cluster matrix where when and when using only the observed censored adjacency matrix .
Our approach is similar in nature to the one used by spectral methods in community detection. We first observe that for and we have where
(4.1) |
Since we do not know and , we should estimate both of them. We will estimate with but, for simplicity, we will assume that is known. The resulting estimator of the cluster matrix is
(4.2) |
which is indeed a SDP estimator and therefore Theorem 1 (or Corollary 1 and Theorem 2) may be used to obtain statistical bounds for the estimation of from (4.1) by .
We will use the following notations: , , , , for all , for all , , and . We also use the notation to denote absolute constants whose values may change from one line to another.
4.1.2 Main result for the estimation of the cluster matrix in signed clustering
Our main result concerns the reconstitution of the communities from the observation of the matrix . In order to avoid solutions with some communities of degenerated size (too small or too large) we consider the following assumption.
Assumption 3.
Up to constant, the elements of the partition of are of same size (up to constants): there are absolute constant such that for any , .
We are now ready to state the main result on the estimation of the cluster matrix from (4.1) by the SDP estimator from (4.2).
Theorem 3.
There is an absolute positive constant such that the following holds. Grant Assumption 3. Assume that
(4.3) |
(4.4) |
(4.5) |
Then, with probability at least , exact recovery holds true, i.e.,
Therefore, we have exact reconstruction in the dense case (that is under assumption (4.3)), which stems from condition (4.5). The latter condition is in the same spirit as the one in Theorem 1 of [41], it measures the SNR (signal-to-noise ratio) of the model which captures the hardness of the SSBM. As mentioned in [41], it is related to the Kesten-Stigum threshold [78]. The last condition (4.5) basically requires that the number of clusters is at most . If this condition is dropped out then we do not have anymore exact reconstruction but only a certified rate of convergence: with probability at least ,
This shows that there is a phase transition in the dense case: exact reconstruction is possible when and, otherwise, when we only have a control of the estimation error.
4.2 Application to angular group synchronization
In this section, we introduce the group synchronization problem as well as a stochastic model for this problem. We consider a SDP relaxation of the original problem (which is exact) and construct the associated SDP estimator such as (1.3).
The angular synchronization problem consists of estimating unknown angles (up to a global shift angle) given a noisy subset of their pairwise offsets , where is the modulo operation. The pairwise measurements can be realized as the edge set of a graph , typically modeled as an Erdös-Renyi random graph [90].
The aim of this section is to show that the angular synchronization problem can be analyzed using our methodology. In order to keep the presentation as simple as possible, we assume that all pairwise offsets are observed up to some Gaussian noise: we are given for all where are i.i.d. standard Gaussian variables and is the noise variance. We may rewrite the problem as follows: we observe a complex matrix defined by
(4.6) |
denotes the imaginary number such that , , , denotes the conjugate vector of and is the element-wise product . In particular, is a Hermitian matrix (i.e. ) and for and if . We want to estimate (up to a global shift) from the matrix of data . Unlike classical statistical models the noise here is multiplicative; we show that our approach covers this type of problem as well.
The first step is to find an (vectorial) optimization problem which solutions are given by (up to global angle shift) or some bijective function of it. Estimating up to global angle shift is equivalent to estimating the vector . The latter is, up to a global rotation of its coordinates, the unique solution of the following maximization problem
(4.7) |
A proof of (4.7) is given in Section 6. Let us now rewrite (4.7) as a SDP problem. For all , we have where and where is the set of all Hermitian matrices and is the vector with all coordinates equal to . It is therefore straightforward to construct a SDP relaxation of (4.7) by dropping the rank constraint. It appears that this relaxation is exact since, for ,
(4.8) |
where . A proof of (4.8) can be found in Section 6. Finally, as we only observe , we consider the following SDP estimator of
(4.9) |
In the next section, we use the strategy from Corollary 1 to obtain statistical guarantees for the estimation of by .
Intuitively, the above maximization problem (4.8) attempts to preserve the given angle offsets as best as possible
(4.10) |
where the objective function is incremented by whenever an assignment of angles and perfectly satisfies the given edge constraint (i.e., for a clean edge for which ), while the contribution of an incorrect assignment (i.e., of a very noisy edge) will be almost uniformly distributed on the unit circle in the complex plane. Due to non-convexity of optimization in (4.10), it is difficult to solve computationally [109]; one way to overcome this problem is to consider the SDP relaxation from (4.8) but it is also possible to consider a spectral relaxation such as the one proposed by Singer [90] which replaces the individual constraints that all ’s should have unit magnitude by the much weaker single constraint , leading to
(4.11) |
The solution to the resulting maximization problem is simply given by a top eigenvector of the Hermitian matrix , followed by a normalization step. We remark that the main advantage of the SDP relaxation (4.8) is that it explicitly imposes the unit magnitude constraint for , which we cannot otherwise enforce in the spectral relaxation solved via the eigenvector method in (4.11). The above SDP program (4.8) is very similar to the well-known Goemans-Williamson SDP relaxation for the seminal MAX-CUT problem of finding the maximal cut of a graph (the MAX-CUT problem is one of the four applications considered in this work, see Section 4.3), with the main difference being that here we optimize over the cone of complex-valued Hermitian positive semidefinite matrices, not just real symmetric matrices.
4.2.1 Main results for phase recovery in the synchronization problem
Our main result concerns the estimation of the matrix of offsets from the observation of the matrix . This result is then used to estimate (up to global phase shift) the angular vector .
Theorem 4.
Let . If then, with probability at least , it holds true that
(4.12) |
Once we have an estimator for the oracle , we can extract an estimator for the vector of phases by considering a top eigenvector (i.e. an eigenvector associated with the largest eigenvalue) of . It is then possible to quantify the estimation properties of by using a sin-theta theorem and Theorem 4.
Corollary 2.
Let be a top eigenvector of with Euclidean norm . Let and assume that . We have the existence of a universal constant (which is the constant in the Davis-Kahan Theorem for Hermitian matrices) such that, with probability at least , it holds true that
(4.13) |
It follows from Corollary 2 that we can estimate (up to a global rotation ) with a -estimation error of the order of with exponential deviations. Given that , this means that a constant proportion of the entries are well estimated. For a values of , the rate of estimation is like , we therefore get a much better estimation of but only with constant probability. It is important to recall that and can be both efficiently computed by solving a SDP problem and then by considering a top eigenvector of its solution (for instance, using the power method).
4.3 Application to the MAX-CUT problem
Let be the adjacency (symmetric) matrix of an undirected graph , where is the set of the vertices and the set of edges is where and . We assume that has no self loop so that for all . A cut of is any subset of vertices in . For a cut , we define its weight by , that is the number of edges in between and its complement . The MAX-CUT problem is to find the cut with maximal weight:
(4.14) |
The MAX-CUT problem is a NP-complete problem but [49] constructed a approximating solution via a SDP relaxation. Indeed, one can write the MAX-CUT problem in the following way. For a cut , we define the membership vector associated with by setting if and if for all . We have and so solving (4.14) is equivalent to solve
(4.15) |
Since , the latter problem is also equivalent to solve
(4.16) |
which admits a SDP relaxation by removing the rank constraint. This yields the following SDP relaxation problem of MAX-CUT from [49]:
(4.17) |
where .
Unlike the other examples from the previous sections, the SDP relaxation in (9.1) is not exact, except for bipartite graphs; see [62, 44] for more details. Nevertheless, thanks to the approximation result from [49] we can use our methodology to estimate and then deduce some approximate optimal cut. But first, we introduce a stochastic model because in many situations the adjacency matrix is only partially observed but still it might be interesting to find an approximating solution to the MAX-CUT problem.
We observe a “masked” version of , where is symmetric with upper triangular matrix filled with i.i.d. Bernoulli entries: for all such that , where is a family of i.i.d. Bernoulli random variables with parameter . Let so that . We can write as an oracle since and so we estimate via the SDP estimator . Our first aim is to quantify the cost we pay by using instead of in our final choice of cut. It appears that the fixed point used in Theorem 1 may be used to quantify this loss:
(4.18) |
Our second result is an explicit high probability upper bound on the latter fixed point.
4.3.1 Main results for the MAX-CUT problem
In this section, we gather the two results on the estimation of from and on the approximate optimality of the final cut constructed from . Let us now explicitly provide the construction of this cut. We consider the same strategy as in [49]. Assume that has been constructed. Let be a centered Gaussian vector with covariance matrix . Let be the sign vector of . Using the statistical properties of , it is possible to prove near optimality of .
We denote the optimal values of the maxcut problem and its SDP relaxation by
where is a solution of (4.14) and . Our first result is to show how the approximating result from [49] is downgraded by the incomplete information we have on the graph (we only partially observed the adjacency matrix via ).
Theorem 5.
For all . With probability at least (with respect to the masked ),
To precise the notation, is the sign vector of which is a centered Gaussian variable with covariance . In that context, is the conditional expectation according to for a fixed . Moreover, the probability at least that we obtain is w.r.t. the mask that is to the randomness in .
Let us put Theorem 5 into some perspective. If we had known the entire adjacency matrix (which is the case when ), then we could have use instead of . In that case, for the sign vector of , we know from [49] that
(4.19) |
Therefore, Theorem 5 characterizes the price we pay for not observing the entire adjacency matrix but only a masked version of it. It is an interesting output of Theorem 5 to observe that the fixed point measures, in a quantitative way, this loss. If we were able to identify scenarii of and for which that would prove that there is no loss for partially observing in the MAX-CUT problem. Unfortunately, the approach we use to control is the global one which does not allow for exact reconstruction (that is to show that ).
Let us now turn to an estimation result of by via an upper bound on .
Theorem 6.
With probability at least :
In particular, it follows from the approximation result from Theorem 5 and the high probability upper bound on from Theorem 6 that, with probability at least ,
(4.20) |
This result is none trivial only when the right-hand side term is strictly larger than which is the performance of a random cut. As a consequence, (4.20) shows that one can still do better than randomness even in an incomplete information setup for the MAX-CUT problem when , and are such that
For instance, when is like a constant, it requires to be larger than (for some absolute constant ) and when , it requires to be at least (for some absolute constant ).
5 Proof of Theorem 3 (signed clustering)
The aim of this paper is to put forward a methodology developed in learning theory for the study of SDP estimators. In each example, we follow this methodology. For a problem, such as the signed clustering, where it is possible to characterize the curvature of the excess risk, we start to identify this curvature because the curvature function , coming out of it, defines the local subsets of driving the complexity of the problem. Then, we turn to the stochastic part of the proof which is entirely summarized into the complexity fixed point from (2.2). Finally, we put the two pieces together and apply the main general result from Corollary 1 to obtain estimation results for the SDP estimator (4.2) in the signed clustering problem; which is Theorem 3.
5.1 Curvature equation
In this section, we show that the objective function satisfies a curvature assumption around its maximizer with respect to the -norm given by with parameter (and margin exponent ).
Proposition 1.
For , we have for all , .
Proof.
Let be in . We have
Moreover, for all and , so . We also have for all , because in that case and . Hence,
5.2 Computation of the complexity fixed point
Define the noise matrix of the problem. Since is symmetric, its entries are not independent. In order to work only with independent random variables, we define the following matrix :
(5.1) |
where entries are considered as independent Bernoulli variables with parameter and therefore, has independent entries, and satisfies the relation .
In order to obtain upper bounds on the fixed point complexity parameter associated with the signed clustering problem, we need to prove a high probability upper bound on the quantity
(5.2) |
and then find a radius as small as possible such that the quantity in (5.2) is less than . We denote where is the unit -ball of .
We follow the strategy from [41] by decomposing the inner product into two parts according to the SVD of . This observation is a key point in the work of [41] compared to the analysis from [54]. This allows to perform the localization argument efficiently. Up to a change of index of the nodes, is a block matrix with diagonal blocks of ’s. It therefore admits singular vectors with multiplicity associated with the singular value for all . We can therefore write
where has column vectors given by and . We define the following projection operator
and its orthogonal projection by
where are such that is an orthonormal basis of .
We use the same decomposition as in [41]: for all ,
The next step is to control with large probability the two terms and uniformly for all . To that end we use the two following propositions where we recall that and . The proof of Proposition 2 and 3 can be found in Section 8, it is based on [41].
Proposition 2.
There are absolute positive constants and such that the following holds. If then we have
Proposition 3.
There exists an absolute constant such that the following holds. When , with probability at least ,
It follows from Proposition 2 and Proposition 3 that when , for all such that we have, for , with probability at least ,
Moreover, we have
(5.3) |
for when and . In particular, when and , we conclude that for all (5.3) is true. Therefore, one can take meaning that we have exact reconstruction of : if and then with probability at least , .
6 Proofs of Theorem 4 and Corollary 2 (angular synchronization)
Proof of (4.7): For all we have for all if and only if for all . We therefore have
(6.1) |
Moreover, for all such that for , we have
where denotes the real part of . On the other side, we have
Hence, minimizing over all such that is equivalent to maximize over all such that . This concludes the proof with (6.1).
Proof of (4.8): Let . We first prove that . Let . Since , there exists such that . For all , denote by the -th row vector of . We have since . Moreover, for all , we have . This proves that and so .
Let . Since is convex and the objective function is linear (for real coefficients), is one of the extreme points of . Extreme points of are matrices such that for all . We can then write each entry of as for some and now we obtain
The maximal value is attained only for for all , that is for . But we have and , so is the only maximizer of on . But for all we have , then is the only maximizer of over .
6.1 Curvature of the objective function
Proposition 4.
For , we have for all , .
Proof.
Let where and for all . Since for all , we have, on one side, , and so
(6.2) |
On the other side, we proved in the proof of (4.8) that . So we have for all and
(6.3) |
In fact, it follows from the proof of Proposition 4 that we have the following equality: for all ,
where (in particular, ). We therefore know exactly how to characterize the curvature of the excess risk for the angular synchronization problem in terms of the (to the square) and the norms. Nevertheless, we will not use the extra term in the following.
6.2 Computation of the complexity fixed point
It follows from the (global) curvature property of the excess risk for the angular synchronization problem obtained in Proposition 4 that for the curvature function defined by , we just have to compute the fixed point and then apply Corollary 1 in order to obtain statistical properties of (w.r.t. to both the excess risk and the function). In this section, we compute the complexity fixed point for .
Following Proposition 4, the natural “local” subsets of around which drive the statistical complexity of the synchronization problem are defined for all by .
Let . Denote by (resp. ) the real (resp. imaginary) part of for all . Since we also have and so
where we used that for . Now its remains to get a high probability upper bound on the sum of the centered cosinus of . We use Bernstein’s inequality (see Equation 7.3 below) to get such a bound. For all , with probability at least ,
for and (because when ).
We now have all the ingredients to compute the fixed point for : for and ,
In particular, using and for (where ) for some , if then .
6.3 End of the proof of Theorem 4 and Corollary 2: application of Corollary 1
Take (for ), we have when (which holds for instance when ) and so it follows from Corollary 1 (together with the curvature property in Section 6.1 and the computation of the fixed point from Section 6.2), that with probability at least , , which is the statement of Theorem 4.
Proof of Corollary 2: The oracle is the rank one matrix which has for largest eigenvalue and associated eigenspace . In particular, has a spectral gap . Let be a top eigenvector of with norm . It follows from Davis-Kahan Theorem (see, for example, Theorem 4.5.5 in [100] or Theorem 4 in [101]) that there exists an universal constant such that
where is the spectral gap of . We conclude the proof of Corollary 2 using the upper bound on from Theorem 4.
7 Proofs of Theorem 5 and 6 (MAX-CUT)
In this section, we prove the two main results from Section 4.3 using our general methodology for Theorem 6 and the technique from [49] for Theorem 5.
7.1 Proof of Theorem 5
The proof of Theorem 5 follows the one from [49] up to a minor modification due to the fact that we use the SDP estimator instead of the oracle . It is based on two tools. The first one is Grothendieck’s identity: let and , we have:
(7.1) |
and the identity: for all
(7.2) |
7.2 Proof of Theorem 6
For the MAX-CUT problem, we do not use any localization argument; we therefore use the (likely sub-optimal) global approach. The methodology is very close to the one used in [54] for the community detection problem. In particular, we use both Bernstein and Grothendieck inequalities to compute high probability upper bound for . We recall theses two tools now. First Bernstein’s inequality: if are independent centered random variables such that a.s. for all then for all , with probability at least ,
(7.3) |
where . The second tool is Grothendieck inequality [52] (see also [85] or Theorem 3.4 in [54]): if then
(7.4) |
where and is an absolute constant, called the Grothendieck constant.
In order to apply Theorem 1, we just have to compute the fixed point . As announced, we use the global approach and Grothendieck inequality (7.4) to get
(7.5) |
because . It follows from Bernstein’s inequality (7.3) and a union bound that for all , with probability at least ,
Therefore, for , with probability at least ,
for . Then the result follows from Theorem 1.
8 Annex 1: Signed clustering
8.1 Proof of Equation (4.1)
We recall that the cluster matrix is defined by if and when and . For all matrix , we have
The latter quantity is maximal for such that for and for , that is when . As a consequence . Moreover, so we also have that is the only solution to the problem and so .
8.2 Proof of Proposition 2: control of adapted from [41]
The noise matrix is symmetric and has been decomposed as where has been defined in (5.1). For all , we have
(8.1) |
An upper bound on follows from an upper bound on the three terms in the right side of (8.2). Let us show how to bound the first term. Similar arguments can be used to control the other two terms.
Let . Let us find a high probability upper bound on the term uniformly over . For all , and , we have
Therefore, given the ’s are all equal for . We can therefore fix some arbitrary index and have for all . Moreover, is a family of independent random variables. We now have
which is a weighted sum of independent centered random variables with weights for . We now idenfity some properties on the weights .
The weights are such that
which is equivalent to say that the weights vector is in the -ball . It is also in the unit -ball since for all and ,
We therefore have and so
(8.2) |
It remains to find a high probability upper bound on the latter term. We can use the following lemma to that end.
Lemma 1.
Let for . For all , if then with probability at least ,
Proof of Lemma 1. Let and assume that . We denote by (resp. ) the non-decreasing rearrangement of (resp. ) for . We have
Moreover, for all , using a union bound, we have
Let us now control each term of the latter sum thanks to Bernstein inequality. The random variables are independent with variances at most since when and for . Moreover, when and for because . It follows from Bernstein’s inequality that for all satisfying and that
when . Therefore, with probability at least
when
which is a non vacuous condition since . The result follows, in the case , by taking and using that when .
For , we have
and using Bernstein inequality as above we get that with probability at least , when which is a non vacuous condition when . By taking , we obtain, that for all , if then with probability at least ,
Using the same methodology, we can prove exactly the same result for . We can also use the same method to upper bound , we simply have to check that the weights vector where is also in for any such that . This is indeed the case, since we have for all , and , which is therefore constant for all elements in . Therefore, we have
and for all and ,
Therefore, and we obtain exactly the same upper bound for the three terms in (8.2). This concludes the proof of Proposition 2.
8.3 Proof of Proposition 3: control of the term from [41]
In this section, we prove Proposition 3. We follow the proof from [41] but we only consider the “dense case” which is when – we recall that . For a similar uniform control of in the “sparse case ”, when for some absolute constant , we refer the reader to [41].
For all , we have because, by construction of the projection operator, . Therefore, where denotes the nuclear norm (i.e. the sum of singular values) and denotes the operator norm (i.e. maximum of the singular value). In the following Lemma 2, we prove an upper bound for and then, we will obtain a high probability upper bound onto .
Lemma 2.
For all , we have
Proof.
Since so it is for and so therefore . Next, we bound the trace of .
Since is a projection operator, it is symmetric and , moreover, when so
where we used in (i) that for and in a same community, we have and , thus . Finally, when is in the localized set , we have which concludes the proof.
Now, we obtain a high probability upper bound on . In the following, we apply this result in the “dense case” (i.e. ) to get the uniform bound onto over .
Lemma 3 (Lemma 4 in [41]).
With probability at least , .
Proof.
Let be an independent copy of and be a random symmetric matrix independent from both and whose sub-diagonal entries are independent Rademacher random variables. Using a symmetrization argument (see Chapter 2 in [64] or Chapter 2.3 in [98]), we obtain for ,
where is the entry-wise matrix multiplication operation, (i) comes from Jensen’s inequality, (ii) occurs since and are identically distributed and (iii) is the triangle inequality. Next, we obtain an upper bound onto .
We define the family of independent random variables where for all
(8.3) |
We also put for all . It is straightforward to see that and have the same distribution. As a consequence, and have the same distribution where is a symmetric matrix with for . An upper bound on follow from the next result due to [9].
Theorem 7 (Corollary 3.6 in [9]).
Let be independent symmetric random variables with unit variance and be a family of real numbers. Let be the random symmetric matrix defined by for all . Let . Then, for any ,
where, for any , denotes the -norm.
Since are independent symmetric such that we can apply Lemma 7 to upper bound . We have for any and . It therefore follows from Lemma 7 for that
(8.4) |
The final step to prove Lemma 3 is a concentration argument showing that is close to its expectation with high probability. To that end we rely on a general result for Lipschitz and separately convex functions from [16]. We first recall that a real-valued function of variables is said separately convex when for every it is a convex function of the -th variable if the rest of the variables are fixed.
Theorem 8 (Theorem 6.10 in [16]).
Let be a convex compact set in with diameter . Let be independent random variables taking values in . Let be a separately convex and -Lipschitz function, w.r.t. the -norm (i.e. for all ). Then satisfies, for all , with probability at least , .
We apply Theorem 8 to where is a -Lipschitz w.r.t. -norm for and separately convex function and is a family of independent random variables. Moreover, for each , , which is a convex compact set with diameter . Therefore, it follows from Theorem 8 that for all , with probability at least , . In particular, we finish the proof of Lemma 3 for and using the bound from (8.4).
9 Annex 2: Solving SDP’s in practice
The practical implementation of our approach to the problems of synchronization, signed clustering and MAX-CUT resort to solving a convex optimization problem. In the present section, we describe the various algoritms we used for solving these SDP’s.
9.1 Pierra’s method
For SDP’s with constraints on the entries, we propose a simple modification of the method initially proposed by Pierra in [84]. Let : be a convex function. Let denote a convex set which can be written as the intersection of convex sets . Let us define ( times) and let denote the (diagonal) subspace of of vectors of the form . In this new formalism, the problem can now be formulated as a minimisation problem over the intersection of two sets only, i.e.
Define . The algorithm proposed by Pierra in [84] consists of performing the following iterations
9.1.1 Application to community detection
Let us now present the computational details of Pierra’s method for the community detection problem. We will estimate its membership matrix via the following SDP estimator
where and denotes the number of nonzero elements in the membership matrix . The motivation for this approach stems from the fact that the membership matrix is actually the oracle, i.e., , where . The function to minimize in the Pierra algorithm is defined as .
Let us denote by the set of symmetric positive semi-definite matrices in . The set is the intersection of the sets
We now compute for all and ( here)
We have for
On the other hand, the projections operators are given by
To sum up, Pierra’s method can be formulated as follows.
For all iterations in , compute the SVD of . Then compute for all
9.1.2 Application to signed clustering
Let us now turn to the signed clustering problem. We will estimate its membership matrix via the following SDP estimator where . As in the community detection case the function to minimize in the Pierra algorithm is defined as .
Let us denote by the set of symmetric positive semi-definite matrices in . The set is the intersection of the sets , and
As before, for
and the projection operators , are given by
To sum up, Pierra’s method can be formulated as follows.
At each iteration , compute the SVD of . Then compute for all
9.2 The Burer-Monteiro approach and the Manopt Solver
To solve the MAX-CUT and Angular Synchronization problems we rely on Manopt, a freely available Matlab toolbox for optimization on manifolds [17]. Manopt runs the Riemannian Trust-Region method on corresponding Burer-Monteiro non-convex problem with rank bounded by as follows. The Burer-Monteiro approach consists of replacing the optimization of a linear function over the convex set with the optimization of the quadratic function over the non-convex set .
In the context of the MAX-CUT problem, the Burer-Monteiro approach amounts to the following steps. Denoting by the positive semidefinite matrix , note that both the cost function and the constraints lend themselves to be expressed linearly in terms of . Dropping the NP-hard rank-1 constraint on , we arrive at the well-known convex relaxation of MAX-CUT from [49]
(9.1) |
where .
If a solution of this SDP has rank 1, then for some , which then gives the optimal cut. Recall that in the general case of higher rank , Goemans and Williamson [49] introduced the celebrated rounding scheme that extracts approximately optimal cuts within a ratio of 0.878 from . The corresponding Burer-Monteiro non-convex problem with rank bounded by is given by
(9.2) |
where . Note that the constraint requires each row of to have unit norm, rendering to be a point on the Cartesian product of unit spheres in , which is a smooth manifold. Also note that the search space of the SDP is compact, since all feasible for the SDP have identical trace equal to .
If the convex set is compact, and denotes the number of constraints, it holds true that whenever satisfies , the two problems share the same global optimum [12, 22]. Building on pioneering work of Burer and Monteiro [22], Boumal et. al [18] showed that if the set is compact and the set is a smooth manifold, then implies that, for almost all cost matrices , global optimality is achieved by any satisfying a second-order necessary optimality conditions. Following [18], for , for almost all matrices , even though (9.2) is non-convex, any local optimum is a global optimum (and so is ), and all saddle points have an escape (the Hessian has a negative eigenvalues). Note that for the same statement holds true for all , and was previously established by [19].
10 Numerical experiments
This section contains the outcome of numerical experiments on the three application problems considered: signed clustering, MAX-CUT, and angular synchronization.
10.1 Signed Clustering
To assess the effectiveness of the SDP relaxation, we consider the following experimental setup. We generate synthetic networks following the signed stochastic block model (SSBM) previously described in Section 4.1.1, with communities. To quantify the effectiveness of the SDP relaxation, we compare the accuracy of a suite of algorithms from the signed clustering literature, before (that is when we perform these algorithms directly on ) and after the SDP relaxation (that is when we perform the very same algorithms on ). To measure the recovery quality of the clustering results, for a given indicator set , we rely on the error rate consider in [24], defined as
(10.1) |
where denotes a cluster indicator vector, () is the complete -weakly balanced ground truth network – with ’s on the diagonal blocks corresponding to inter-cluster edges, and otherwise – with , and denotes the combinatorial graph Laplacian corresponding to . Note that counts the number of violations within the clusters (since negative edges should not be placed within clusters) and counts the number of violations across clusters (since positive edges should not belong to the cut). Overall, (10.1) essentially counts the fraction of intra-cluster and inter-cluster edge violations, with respect to the full ground truth matrix. Note that this definition can also be easily adjusted to work on real data sets, where the ground truth matrix is not available, which one can replace with the empirical observation .
In terms of the signed clustering algorithms compared, we consider the following algorithms from the literature. One straightforward approach is to simply rely on the spectrum of the observed adjacency matrix . Kunegis et al. [65] proposed spectral tools for clustering, link prediction, and visualization of signed graphs, by solving a 2-way “signed” ratio-cut problem based on the combinatorial Signed Laplacian [59] , where is a diagonal matrix with . The same authors proposed signed extensions for the case of the random-walk Laplacian , and the symmetric graph Laplacian , the latter of which is particularly suitable for skewed degree distributions. Finally, the last algorithm we considered is BNC of Chiang et al. [25], who introduced a formulation based on the Balanced Normalized Cut objective
(10.2) |
which, in light of the decomposition , is effectively minimizing the number of violations in the clustering procedure.
In our experiments, we first compute the error rate of all algorithms on the original SSBM graph (shown in Column 1 of Figure 1), and then we repeat the procedure but with the input to all signed clustering algorithms being given by the output of the SDP relaxation, and denote the resulting recovery error by . The third column of the same Figure 1 shows the difference in errors between the first and second columns, while the fourth column contains a histogram of the error differences . This altogether illustrates the fact that the SDP relaxation does improve the performance of all signed clustering algorithms, except , and could effectively be used as a denoising pre-processing step.
Before | After | Delta | Histogram | |
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
![]() |
![]() |
![]() |
![]() |
|
BNC |
![]() |
![]() |
![]() |
![]() |
10.2 Max-Cut
For the MAX-CUT problem, we consider two sets of numerical experiments. First, we consider a version of the stochastic block model which essentially perturbs a complete bipartite graph
(10.3) |
where (respectively, ) denotes an matrix of all ones, respectively, all zeros. In our experiments, we set , and fix . We perturb by deleting edges across the two partitions, and inserting edges within each partition. More specifically, we generated the full adjacency matrix from by adding edges independently with probability within each partition (i.e., along the diagonal blocks in (10.3)). Finally, we denote by the masked version we observe, , where denotes the adjacency matrix of an Erdős-Rényi(, ) graph. The graph shown in Figure 2 is an instance of the above generative model.

Note that for small values of we expect the maximum cut to occur across the initial partition in the clean bipartite graph , which we aim to recover as we sparsify the observed graph . The heatmap in the left of Figure 3 shows the Adjusted Rand Index (ARI) between the initial partition and the partition of the Max-Cut SDP relaxation in (9.1), as we vary the noise parameter and the sparsity . As expected, for a fix level of noise , we are able to recover the hypothetically optimal Max-Cut, for suitable levels of the sparsity parameter. The heatmap in the right of Figure 3 shows the computational running time, as we vary the two parameter, showing that the Manopt solver takes the longest to solve dense noisy problems, as one would expect.


In the second set of experiments shown in Figure 4, we consider a graph chosen at random from the collection333http://web.stanford.edu/~yyye/yyye/Gset/ of graphs known in the literature as the Gset, where we vary the sparsity level , and show the Max-Cut value attained on the original full graph , but using the Max-Cut partition computed by the SDP relaxation (9.1) on the sparsified graph .

10.3 Angular Synchronization
For the angular synchronization problem, we consider the following experimental setup, by assessing the quality of the recovered angular solution from the SDP relaxation, as we vary the two parameters of interest. In the -axis in the plots from Figures 5 and 6 we vary the noise level , under two different noise models, Gaussian and outliers. On the -axis, we vary the sparsity of the sampling graph.
We measure the quality of the recovered angles via the Mean Squared Error (MSE), defined as follows. Since a solution can only be recovered up to a global shift, one needs an MSE error that mods out such a degree of freedom. The following MSE is also more broadly applicable for the case when the underlying group is the orthogonal group , as opposed to , as in the present work, where one can replace the unknown angles with their respective representation as rotation matrices . To that end, we look for an optimal orthogonal transformation that minimizes the sum of squared distances between the estimated orthogonal transformations and the ground truth measurements
(10.4) |
where denote the rotation matrix representation of the estimated angles . In other words, is the optimal solution to the alignment problem between two sets of orthogonal transformations, in the least-squares sense. Following the analysis of [91], and making use of properties of the trace, one arrives at
(10.5) | |||||
If we let denote the matrix
(10.6) |
it follows from (10.5) that the MSE is given by minimizing
(10.7) |
In [7] it is proven that , for all , where is the singular value decomposition of . Therefore, the MSE is minimized by the orthogonal matrix and is given by
(10.8) |
where are the singular values of . Therefore, whenever is an orthogonal matrix for which , the MSE vanishes. Indeed, the numerical experiments (on a log scale) in Figures 5 and 6 confirm that for noiseless data, the MSE is very close to zero. Furthermore, as one would expected, under favorable noise regimes and sparsity levels, we have almost perfect recovery, both by the SDP and the spectral relaxations, under both noise models.




11 Conclusions and future work
There are a number of other graph-based problems amenable to SDP relaxations, for which a similar theoretical analysis of their SDP-based estimators could be suitable. For example, the recent work of [34] considered a problem motivated by geosciences and engineering applications of recovering a smooth unknown function (where is known) from noisy observations of its mod 1 values, which is also amenable to a solution based on an SDP relaxation solved via a Burer-Monteiro approach. Another potential application concerns the problem of clustering directed graphs, as in the very recent work of [33] that proposed a spectral algorithm based on Hermitian matrices; this problem is also amenable to an SDP relaxation.
Our theoretical and practical findings show that running algorithms (such as spectral methods) directly on may be improved by using first a SDP estimator such as and running the very same algorithms on (instead of ). Somehow, performs a pre-processing de-noising step which improve the recovery of the hidden signal such as community vectors.
Acknowledgements
Mihai Cucuringu acknowledges support from the EPSRC grant EP/N510129/1 at The Alan Turing Institute. Guillaume Lecué acknowledges support from a grant of the French National Research Agency (ANR), “Investissements d’Avenir” (LabEx Ecodec/ANR-11-LABX-0047).
References
- [1] Epinions data set. https://snap.stanford.edu/data/soc-Epinions1.html. Accessed: 2010-09-30.
- [2] Slashdot data set. https://snap.stanford.edu/data/soc-Slashdot0902.html. Accessed: 2010-09-30.
- [3] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471–487, 2015.
- [4] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. Time-series clustering–a decade review. Information Systems, 53:16–38, 2015.
- [5] Brendan PW Ames. Guaranteed clustering and biclustering via semidefinite programming. Mathematical Programming, 147(1-2):429–465, 2014.
- [6] Miguel F Anjos and Jean B Lasserre. Handbook on semidefinite, conic and polynomial optimization, volume 166. Springer Science & Business Media, 2011.
- [7] K. Arun, T. Huang, and S. Bolstein. Least-squares fitting of two 3-D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):698–700, 1987.
- [8] Afonso S Bandeira, Moses Charikar, Amit Singer, and Andy Zhu. Multireference alignment using semidefinite programming. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 459–470. ACM, 2014.
- [9] Afonso S Bandeira, Ramon Van Handel, et al. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. The Annals of Probability, 44(4):2479–2506, 2016.
- [10] Sujogya Banerjee, Kaushik Sarkar, Sedat Gokalp, Arunabha Sen, and Hasan Davulcu. Partitioning signed bipartite graphs for classification of individuals and organizations. In International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction, pages 196–204. Springer, 2012.
- [11] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probab. Theory Related Fields, 135(3):311–334, 2006.
- [12] A. I. Barvinok. Problems of distance geometry and convex properties of quadratic maps. Discrete & Computational Geometry, 13(2):189–202, Mar 1995.
- [13] Lucien Birgé and Pascal Massart. Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields, 97(1-2):113–150, 1993.
- [14] Grigoriy Blekherman, Pablo A Parrilo, and Rekha R Thomas. Semidefinite optimization and convex algebraic geometry. SIAM, 2012.
- [15] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.
- [16] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-5.
- [17] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15:1455–1459, 2014.
- [18] N. Boumal, V. Voroninski, and A.S. Bandeira. The non-convex Burer-Monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems 29, pages 2757–2765. 2016.
- [19] Nicolas Boumal. A riemannian low-rank method for optimization over semidefinite matrices with block-diagonal constraints. arXiv preprint arXiv:1506.00575, 2015.
- [20] Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear matrix inequalities in system and control theory, volume 15. Siam, 1994.
- [21] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- [22] S. Burer and R.D.C. Monteiro. Local minima and convergence in low-rank semidefinite programming. Mathematical Programming, 103(3):427–444, 2005.
- [23] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. The Journal of Machine Learning Research, 17(1):882–938, 2016.
- [24] K. Chiang, J. Whang, and I. Dhillon. Scalable clustering of signed networks using Balance Normalized Cut. CIKM, 2012.
- [25] Kai-Yang Chiang, Joyce Whang, and Inderjit S. Dhillon. Scalable Clustering of Signed Networks using Balance Normalized Cut. In ACM Conference on Information and Knowledge Management (CIKM), oct 2012.
- [26] Geoffrey Chinot, Lecué Guillaume, and Lerasle Matthieu. Statistical learning with lipschitz and convex loss functions. arXiv preprint arXiv:1810.01090, 2018.
- [27] Stéphane Chrétien and Franck Corset. Using the eigenvalue relaxation for binary least-squares estimation problems. Signal Processing, 89(11):2079–2091, 2009.
- [28] Stéphane Chrétien, Clément Dombry, and Adrien Faivre. A semi-definite programming approach to low dimensional embedding for unsupervised clustering. Frontiers in Applied Mathematics and Statistics, to appear.
- [29] Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding community structure in very large networks. Physical review E, 70(6):066111, 2004.
- [30] M. Cucuringu. Synchronization over Z2 and community detection in multiplex networks with constraints. Journal of Complex Networks, 3:469–506, 2015.
- [31] M. Cucuringu. Sync-Rank: Robust Ranking, Constrained Ranking and Rank Aggregation via Eigenvector and Semidefinite Programming Synchronization. IEEE Transactions on Network Science and Engineering, 3(1):58–79, 2016.
- [32] M. Cucuringu, P. Davies, A. Glielmo, and H. Tyagi. SPONGE: A generalized eigenproblem for clustering signed networks. AISTATS 2019.
- [33] M. Cucuringu, H. Li, H. Sun, and L. Zanetti. Hermitian matrices for clustering directed graphs: insights and applications. to appear in AISTATS 2020.
- [34] M. Cucuringu and H. Tyagi. Provably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrapping. to appear in Journal of Machine Learning Research (JMLR), 2020.
- [35] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622, 2016.
- [36] Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
- [37] J. A. Davis. Clustering and structural balance in graphs. Human Relations, 20(2):181–187, 1967.
- [38] Yohann De Castro, Fabrice Gamboa, Didier Henrion, Roxana Hess, Jean-Bernard Lasserre, et al. Approximate optimal designs for multivariate polynomial regression. The Annals of Statistics, 47(1):127–155, 2019.
- [39] Yohann de Castro, Fabrice Gamboa, Didier Henrion, and Jean B. Lasserre. Exact solutions to super resolution on semi-algebraic domains in higher dimensions. IEEE Trans. Information Theory, 63(1):621–630, 2017.
- [40] Laurent Demanet and Paul Hand. Stable optimizationless recovery from phaseless linear measurements. Journal of Fourier Analysis and Applications, 20(1):199–221, 2014.
- [41] Yingjie Fei and Yudong Chen. Exponential error rates of SDP for block models: beyond Grothendieck’s inequality. IEEE Trans. Inform. Theory, 65(1):551–571, 2019.
- [42] Roger Fletcher. A nonlinear programming problem in statistics (educational testing). SIAM Journal on Scientific and Statistical Computing, 2(3):257–267, 1981.
- [43] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
- [44] Bernd Gärtner and Jiří Matoušek. Semidefinite programming. In Approximation Algorithms and Semidefinite Programming, pages 15–25. Springer, 2012.
- [45] Jean Charles Gilbert and Cédric Josz. Plea for a semidefinite optimization solver in complex numbers. 2017.
- [46] Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the relaxed means. arXiv preprint arXiv:1807.07547, 2018.
- [47] Michel X Goemans. Semidefinite programming in combinatorial optimization. Mathematical Programming, 79(1-3):143–161, 1997.
- [48] Michel X Goemans and David P Williamson. New 34-approximation algorithms for the maximum satisfiability problem. SIAM Journal on Discrete Mathematics, 7(4):656–666, 1994.
- [49] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
- [50] David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quantum state tomography via compressed sensing. Physical review letters, 105(15):150401, 2010.
- [51] A. Grothendieck. Résumé de la théorie métrique des produits tensoriels topologiques. Bol. Soc. Mat. São Paulo, 8:1–79, 1953.
- [52] Alexandre Grothendieck. Résumé de la théorie métrique des produits tensoriels topologiques. Soc. de Matemática de São Paulo, 1956.
- [53] Olivier Guédon and Roman Vershynin. Community detection in sparse networks via Grothendieck’s inequality. Probab. Theory Related Fields, 165(3-4):1025–1049, 2016.
- [54] Olivier Guédon and Roman Vershynin. Community detection in sparse networks via grothendieck’s inequality. Probability Theory and Related Fields, 165(3-4):1025–1049, 2016.
- [55] Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via semidefinite programming. IEEE Transactions on Information Theory, 62(5):2788–2797, 2016.
- [56] Simai He, Zhi-Quan Luo, Jiawang Nie, and Shuzhong Zhang. Semidefinite relaxation bounds for indefinite homogeneous quadratic optimization. SIAM Journal on Optimization, 19(2):503–523, 2008.
- [57] Chinmay Hegde, Aswin C Sankaranarayanan, and Richard G Baraniuk. Near-isometric linear embeddings of manifolds. In 2012 IEEE Statistical Signal Processing Workshop (SSP), pages 728–731. IEEE, 2012.
- [58] Samuel B. Hopkins. Sub-gaussian mean estimation in polynomial time. CoRR, abs/1809.07425, 2018.
- [59] Jao Ping Hou. Bounds for the least Laplacian eigenvalue of a signed graph. Acta Mathematica Sinica, 21(4):955–960, 2005.
- [60] Adel Javanmard, Andrea Montanari, and Federico Ricci-Tersenghi. Phase transitions in semidefinite relaxations. Proceedings of the National Academy of Sciences, 113(16):E2218–E2223, 2016.
- [61] David Karger, Rajeev Motwani, and Madhu Sudan. Approximate graph coloring by semidefinite programming. Journal of the ACM (JACM), 45(2):246–265, 1998.
- [62] Subhash Khot and Assaf Naor. Approximate kernel clustering. Mathematika, 55(1-2):129–165, 2009.
- [63] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist., 34(6):2593–2656, 2006.
- [64] Vladimir Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems, volume 2033 of Lecture Notes in Mathematics. Springer, Heidelberg, 2011. Lectures from the 38th Probability Summer School held in Saint-Flour, 2008, École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School].
- [65] Jérôme Kunegis, Stephan Schmidt, Andreas Lommatzsch, Jürgen Lerner, Ernesto William De Luca, and Sahin Albayrak. Spectral analysis of signed graphs for clustering, prediction and visualization. SDM, 10:559–570, 2010.
- [66] Jean Bernard Lasserre. An Introduction to Polynomial and Semi-Algebraic Optimization. Number 52. Cambridge University Press, 2015.
- [67] Javad Lavaei and Steven H Low. Zero duality gap in optimal power flow problem. IEEE Transactions on Power Systems, 27(1):92–107, 2011.
- [68] Can M. Le, Elizaveta Levina, and Roman Vershynin. Optimization via low-rank approximation for community detection in networks. Ann. Statist., 44(1):373–400, 2016.
- [69] Guillaume Lecué and Shahar Mendelson. Learning subgaussian classes: Upper and minimax bounds. Technical report, CNRS, Ecole polytechnique and Technion, 2013.
- [70] Claude Lemaréchal, Arkadii Nemirovskii, and Yurii Nesterov. New variants of bundle methods. Mathematical programming, 69(1-3):111–147, 1995.
- [71] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online social networks. In WWW, pages 641–650, 2010.
- [72] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Signed Networks in Social Media. In CHI, pages 1361–1370, 2010.
- [73] László Lovász. On the shannon capacity of a graph. IEEE Transactions on Information theory, 25(1):1–7, 1979.
- [74] Wing-Kin Ken Ma. Semidefinite relaxation of quadratic optimization problems and applications. IEEE Signal Processing Magazine, 1053(5888/10), 2010.
- [75] Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. Ann. Statist., 27(6):1808–1829, 1999.
- [76] Pascal Massart. Concentration inequalities and model selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin, 2007. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard.
- [77] David A Mazziotti. Large-scale semidefinite programming for many-electron quantum mechanics. Physical review letters, 106(8):083001, 2011.
- [78] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted partition model. Probab. Theory Related Fields, 162(3-4):431–461, 2015.
- [79] Y Nesterov. Semidefinite relaxation and non-convex quadratic optimization. Optimization Methods and Software, 12:1–20, 1997.
- [80] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104, 2006.
- [81] Carl Olsson, Anders P Eriksson, and Fredrik Kahl. Solving large scale binary quadratic problems: Spectral methods vs. semidefinite programming. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
- [82] Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite programming. SIAM journal on optimization, 18(1):186–205, 2007.
- [83] Jiming Peng and Yu Xia. A new theoretical framework for k-means-type clustering. In Foundations and advances in data mining, pages 79–96. Springer, 2005.
- [84] Guy Pierra. Decomposition through formalization in a product space. Mathematical Programming, 28(1):96–115, 1984.
- [85] Gilles Pisier. Grothendieck’s theorem, past and present. Bulletin of the American Mathematical Society, 49(2):237–323, 2012.
- [86] Mason A Porter, Jukka-Pekka Onnela, and Peter J Mucha. Communities in networks. Notices of the AMS, 56(9):1082–1097, 2009.
- [87] Martin Royer. Adaptive clustering through semidefinite programming. In Advances in Neural Information Processing Systems, pages 1795–1803, 2017.
- [88] P Scobey and DG Kabe. Vector quadratic programming problems and inequality constrained least squares estimation. J. Indust. Math. Soc., 28:37–49, 1978.
- [89] Alexander Shapiro. Weighted minimum trace factor analysis. Psychometrika, 47(3):243–264, 1982.
- [90] A. Singer. Angular synchronization by eigenvectors and semidefinite programming. Appl. Comput. Harmon. Anal., 30(1):20–36, 2011.
- [91] A. Singer and Y. Shkolnisky. Three-dimensional structure determination from common lines in Cryo-EM by eigenvectors and semidefinite programming. SIAM Journal on Imaging Sciences, 4(2):543–572, 2011.
- [92] Amit Singer. Angular synchronization by eigenvectors and semidefinite programming. Applied and computational harmonic analysis, 30(1):20–36, 2011.
- [93] Jun Sun, Stephen Boyd, Lin Xiao, and Persi Diaconis. The fastest mixing markov process on a graph and a connection to a maximum variance unfolding problem. SIAM review, 48(4):681–699, 2006.
- [94] Michael J Todd. Semidefinite optimization. Acta Numerica, 10:515–560, 2001.
- [95] Alexandre B. Tsybakov. Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003), volume 2777 of Lecture Notes in Artificial Intelligence, pages 303–313. Springer, Heidelberg, 2003.
- [96] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32(1):135–166, 2004.
- [97] Sara A. van de Geer. Applications of empirical process theory, volume 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2000.
- [98] Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics.
- [99] Vladimir N. Vapnik. Statistical learning theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. John Wiley & Sons, Inc., New York, 1998. A Wiley-Interscience Publication.
- [100] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.
- [101] V Vu. Singular vectors under random perturbation. Preprint available in arXiv:104.2000, 2010.
- [102] Irène Waldspurger, Alexandre d’Aspremont, and Stéphane Mallat. Phase recovery, maxcut and complex semidefinite programming. Mathematical Programming, 149(1-2):47–81, 2015.
- [103] Kilian Q Weinberger, Benjamin Packer, and Lawrence K Saul. Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In AISTATS, volume 2, page 6. Citeseer, 2005.
- [104] Kilian Q Weinberger and Lawrence K Saul. An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In AAAI, volume 6, pages 1683–1686, 2006.
- [105] Kilian Q Weinberger and Lawrence K Saul. Unsupervised learning of image manifolds by semidefinite programming. International journal of computer vision, 70(1):77–90, 2006.
- [106] Henry Wolkowicz. Semidefinite and lagrangian relaxations for hard combinatorial problems. In IFIP Conference on System Modeling and Optimization, pages 269–309. Springer, 1999.
- [107] Henry Wolkowicz, Romesh Saigal, and Lieven Vandenberghe. Handbook of semidefinite programming: theory, algorithms, and applications, volume 27. Springer Science & Business Media, 2012.
- [108] Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the Davis–Kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.
- [109] S. Zhang and Y. Huang. Complex quadratic optimization and semidefinite programming. SIAM Journal on Optimization, 16(3):871–890, 2006.
- [110] Shuzhong Zhang. Quadratic maximization and semidefinite relaxation. Mathematical Programming, 87(3):453–465, 2000.