Learning with Semi-Definite Programming: new statistical bounds based on fixed point analysis and excess risk curvature

Stéphane Chrétien Mihai Cucuringu Guillaume Lecué and Lucie Neirac Stéphane Chrétien Mihai Cucuringu Guillaume Lecué Lucie Neirac
email: [email protected] [email protected] [email protected] [email protected]
Université Lyon 2 Oxford University The Alan Turing Institute CREST ENSAE IPParis

Abstract

Many statistical learning problems have recently been shown to be amenable to Semi-Definite Programming (SDP), with community detection and clustering in Gaussian mixture models as the most striking instances [60]. Given the growing range of applications of SDP-based techniques to machine learning problems, and the rapid progress in the design of efficient algorithms for solving SDPs, an intriguing question is to understand how the recent advances from empirical process theory can be put to work in order to provide a precise statistical analysis of SDP estimators.

In the present paper, we borrow cutting edge techniques and concepts from the learning theory literature, such as fixed point equations and excess risk curvature arguments, which yield general estimation and prediction results for a wide class of SDP estimators. From this perspective, we revisit some classical results in community detection from [54] and [23], and we obtain statistical guarantees for SDP estimators used in signed clustering, group synchronization and MAXCUT.

1 Introduction

Many statistical learning problems have recently been shown to be amenable to Semi-Definite Programming (SDP), with community detection and clustering in Gaussian mixture models as the most striking instances where SDP performs significantly better than other current approaches [60]. SDP is a class of convex optimisation problems generalising linear programming to linear problems over semi-definite matrices [94], [107], [21], and which proved an important tool in the computational approach to difficult challenges in automatic control, combinatorial optimisation, polynomial optimisation, data mining, high dimensional statistics and the numerical solution to partial differential equations. The goal of the present paper is to introduce a new fixed point approach to the statistical analysis of SDP-based estimators and illustrate our method on four current problems of interest, namely MAX-CUT, community detection, signed clustering, angular group synchronization. The rest of this section gives some historical background and presents the mathematical definition of SDP based estimators.

1.1 Historical background

SDP is a class of optimisation problems which includes linear programming as a particular case and can be written as the set of problems over symmetric (resp. Hermitian) positive semi-definite matrix variables, with linear cost function and affine constraints, i.e. optimization problems of the form

\displaystyle\max_{Z\succeq 0}\left(\bigl{<}A,Z\bigr{>}:\bigl{<}B_{j},Z\bigr{>}=b_{j}\mbox{ for }j=1,\ldots,m\right)

(1.1)

where $A,B_{1},\ldots,B_{m}$ are given matrices. SDPs are convex programming problems which can be solved in polynomial time when the constraint set is compact and it plays a paramount role in a large number of convex and nonconvex problems, for which it often appear as a convex relaxation [6].

We will sometimes use the notation $\mathbb{S}_{n,+}$ (resp. $\mathbb{S}_{n,-}$ ) for the cone of positive semi-definite matrices (resp. negative semi-definite matrices).

1.1.1 Early history

Early use of Semi-Definite programming to statistics can be traced back to [88] and [42]. In the same year, Shapiro used SDP in factor analysis [89]. The study of the mathematical properties of SDP then gained momentum with the introduction of Linear Matrix Inequalities (LMI) and their numerous applications in control theory, system identification and signal processing. The book [20] is the standard reference of these type of results, mostly obtained in the 90’s.

1.1.2 The Goemans-Williamson SDP relaxation of Max-Cut and its legacy

A notable turning point is the publication of [49] where SDP was shown to provide a 0.87 approximation to the NP-Hard problem known as Max-Cut. The Max-Cut problem is a clustering problem on graphs which consists in finding two complementary subsets $S$ and $S^{c}$ of nodes such that the sum of the weights of the edges between $S$ and $S^{c}$ is maximal. In [49], the authors approach this difficult combinatorial problem by using what is now known as the Goemans-Williamson SDP relaxation and use the Choleski factorization of the optimal solution to this SDP in order to produce a randomized scheme achieving the .87 bound in expectation. Moreover, this problem can be seen as a first instance where the Laplacian of a graph is employed in order to provide an optimal bi-clustering in a graph and certainly represents the first chapter of a long and fruitful relationship between clustering, embedding and Laplacians. Other SDP schemes for approximating hard combinatorial problems are, to name a few, for the graph coloring problem [61], for satisfiability problem [49, 48]. These results were later surveyed in [70, 47] and [106]. The randomized scheme introduced by Goemans and Williamson was then further improved in order to study more general Quadratically Constrained Quadratic Programmes (QCQP) in various references, most notably [79, 110] and further extended in [56]. Many applications to signal processing are discussed in [81], [74]; one specific reduced complexity implementation in the form of an eigenvalue minimisation problem and its application to binary least-squares recovery and denoising is presented in [27].

1.1.3 Relaxation of machine learning and high dimensional statistical estimation problems

Applications of SDP to problems related with machine learning is more recent and probably started with the SDP relaxation of $K$ -means in [83, 82] and later in [5]. This approach was then further improved using a refined statistical analysis by [87] and [46]. Similar methods have also been applied to community detection [55, 3] and for the weak recovery viewpoint, [54]. This last approach was also re-used via the kernel trick for the point cloud clustering [28]. Another incarnation of SDP in machine learning is the extensive use of nuclear norm-penalized least-square costs as a surrogate for rank-penalization in low-rank recovery problems such as matrix completion in recommender systems, matrix compressed sensing, natural language processing and quantum state tomography; these topics are surveyed in [35].

The problem of manifold learning was also addressed using SDP and is often mentioned as one of the most accurate approaches to the problem, let aside its computational complexity; see [103, 105, 104, 57]. Connections with the design of fast converging Markov-Chains were also exhibited in [93].

In a different direction, A. Singer and collaborators have recently promoted the use of SDP relaxation for estimation under group invariance, an active area with many applications [92, 8]. SDP-based relaxations have also been considered in [30] in the context of synchronization over $\mathbb{Z}_{2}$ in signed multiplex networks with constraints, and [31] in the setting of ranking from inconsistent and incomplete pairwise comparisons where an SDP-based relaxation of angular synchronization over SO(2) outperformed a suite of state-of-the-art algorithms from the literature. Phase recovery using SDP was studied in e.g. [102] and [40]. An extension to multi-partite clustering based on SDP was then proposed in [61]. Other important applications of SDP are, information theory [73], estimation in power networks [67], quantum tomography [77], [50] and polynomial optimization via Sums-of-squares relaxations [66, 14]. Sums of squares relaxations were recently applied to statistical problems in [38, 58, 39]. Extension to the field of complex numbers, with $\bigl{<}\cdot,\cdot\bigr{>}$ denoting the Hermitian inner product, has been less extensively studied but has many interesting applications and comes with efficient algorithms [49, 45].

1.2 Mathematical formulation of the problem

The general problem we want to study can be stated as follows. Let $A$ be a random matrix in $\mathbb{R}^{n\times n}$ and ${\cal C}\subset\mathbb{R}^{n\times n}$ be a constraint. The object that we want to recover, for instance, the community membership vector in community detection, is related to an oracle defined as

Z^{*}\in\operatorname*{argmax}_{Z\in{\cal C}}\;\bigl{<}\mathbb{E}A,Z\bigr{>}

(1.2)

where $\bigl{<}A,B\bigr{>}={\rm Tr}(A\bar{B}^{\top})=\sum A_{ij}\bar{B}_{ij}$ when $A,B\in{\mathbb{C}}^{n\times n}$ where $\bar{z}$ is the conjugate of $z\in{\mathbb{C}}$ . We would like to estimate $Z^{*}$ , from which we can ultimately retrieve the object that really matters to us (for instance, by considering a singular vector associated to the largest singular value of $Z^{*}$ ). To that end, we consider the following estimator

\hat{Z}\in\operatorname*{argmax}_{Z\in{\cal C}}\;\bigl{<}A,Z\bigr{>},

(1.3)

which is simply obtained by replacing the unobserved quantity $\mathbb{E}A$ by the observation $A$ .

As pointed out, in many situations, $Z^{*}$ is not the object we want to estimate, but there is a straightforward relation between $Z^{*}$ and this object. For instance, consider the community detection problem, where the goal is to recover the class community vector $x^{*}\in\{-1,1\}^{n}$ of $n$ nodes. Here, when ${\cal C}$ is well chosen, there is a close relation between $Z^{*}$ and $x^{*}$ , given by $Z^{*}=x^{*}(x^{*})^{\top}$ . We therefore need a final step to estimate $x^{*}$ using $\hat{Z}$ , for instance, by letting $\hat{x}$ denote a top eigenvector of $\hat{Z}$ , and then using the Davis-Kahan ”sin-theta” Theorem [36, 108] to control the estimation of $x^{*}$ by $\hat{x}$ from the one of $Z^{*}$ by $\hat{Z}$ .

When the constraint ${\cal C}$ is of the form ${\cal C}=\{Z\in{\mathbb{R}}^{n\times n}:Z\succeq 0,\bigl{<}Z,B_{j}\bigr{>}=b_{j},j=1,\ldots,m\}$ where $B_{1},\ldots,B_{m}\in{\mathbb{R}}^{n\times n}$ and $Z\succeq 0$ is notation for “ $Z$ is positive semidefinite” then (1.3) is a semidefinite programming (SDP) [21].

1.3 Goal of the paper

The aim of the present work is to present a general approach to the study of the statistical properties of SDP-based estimators defined in (1.3). In particular, using our framework, one is able to obtain new (non-asymptotic) rates of convergence or exact reconstruction properties for a wide class of estimators obtained as a solution of a semidefinite program like (1.3). Specifically, our goal is to show that the solution to (1.3) can be analyzed in a statistical way when $\mathbb{E}A$ is only partially and noisily observed in $A$ . Even though the constraint ${\cal C}$ may not necessarily be the intersection of the set of PSD (or Hermitian) matrices with linear spaces – such as in the definition of SDP – in the following, a solution $\hat{Z}$ of (1.3) will be called a SDP estimator because in all our examples $\hat{Z}$ will be solution of a SDP. But for the sake of generality we will only assume only minimal requirement on the shape of ${\cal C}$ . We also illustrate our results on a number of specific machine learning problems such as various forms of clustering problems and angular group synchronization. Three out of the four examples worked out here are concerned with real-valued matrices. Only the angular synchronization problem is approached using complex matrices.

2 Main general results for the statistical analysis of SDP estimators

From a statistical point of view, the task remains to estimate in the most efficient way the oracle $Z^{*}$ , and to that end $\hat{Z}$ is our candidate estimator. The point of view we will use to evaluate how far $\hat{Z}$ is from $Z^{*}$ is coming from the learning theory literature. We therefore see $\hat{Z}$ as an empirical risk minimization (ERM) procedure built on a single observation $A$ , where the loss function is the linear one $Z\in{\cal C}\to\ell_{Z}(A)=-\bigl{<}A,Z\bigr{>}$ , and the oracle $Z^{*}$ is indeed the one minimizing the risk function $Z\in{\cal C}\to\mathbb{E}\ell_{Z}(A)$ over ${\cal C}$ . Having this setup in mind, we can use all the machinery developed in learning theory (see for instance [99, 63, 76, 97]) to obtain rates of convergence for the ERM (here $\hat{Z}$ ) toward the oracle (here $Z^{*}$ ).

There is one key quantity driving the rate of convergence of the ERM: a fixed point complexity parameter. This type of parameter carries all the statistical complexity of the problem, and even though it is usually easy to set up, its computation can be tedious since it requires to control, with large probability, the supremum of empirical processes indexed by “localized classes”. We now define this complexity fixed point related to the problem we are considering here.

Definition 1.

Let $0<\Delta<1$ . The fixed point complexity parameter at deviation $1-\Delta$ is

r^{*}(\Delta)=\inf\left(r>0:{\mathbb{P}}\left[\sup_{Z\in{\cal C}:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r}\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}\leq(1/2)r\right]\geq 1-\Delta\right).

(2.1)

Fixed point complexity parameters have been extensively used in learning theory since the introduction of the localization argument [76, 64, 97, 13]. When they can be computed, they are preferred to the (global) analysis developed by Chervonenkis and Vapnik [99] to study ERM, since the latter analysis always yields slower rates given that the Vapnik-Chervonenkis bound is a high probability bound on the non-localized empirical process $\sup_{Z\in{\cal C}}\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}$ , which is an upper bound for $r^{*}(\Delta)$ since $\{Z\in{\cal C}:\bigl{<}{\mathbb{E}}A,Z^{*}-Z\bigr{>}\leq r\}\subset{\cal C}$ . The gap between the two global and local analysis can be important since fast rates cannot be obtained using the VC approach, whereas the localization argument resulting in fixed points such as the one in Definition 1 may yield fast rates of convergence or even exact recovery results.

An example of a Vapnik-Chervonenkis’s type of analysis of SDP estimators can be found in [54] for the community detection problem. An improvement of the latter approach has been obtained in [41] thanks to a localization argument – even though it is not stated in these words (we elaborate more on the two approaches from [54, 41] in Section 3). Somehow, a fixed point such as (2.1) is a sharp way to measure the statistical performances of ERM estimators and in particular for the SDP estimators that we are considering here. They can even be proved to be optimal (in a minimax sense) when the noise $A-\mathbb{E}A$ is Gaussian [69] and under mild conditions on the structure of ${\cal C}$ .

Before stating our general result, we first recall a definition of a minimal structural assumption on the constraint ${\cal C}$ .

Definition 2.

We say that the set ${\cal C}$ is star-shaped in $Z^{*}$ when for all $Z\in{\cal C}$ , the segment $[Z,Z^{*}]$ is in ${\cal C}$ .

This is a pretty mild assumption satisfied, for instance, when ${\cal C}$ is convex, which is the setup we will always encounter in practical applications, given that SDP estimators are usually introduced after a “convex relaxation” argument. Our main general statistical bound on SDP estimators is as follows.

Theorem 1.

We assume that the constraint ${\cal C}$ is star-shaped in $Z^{*}$ . Then, for all $0<\Delta<1$ , with probability at least $1-\Delta$ , it holds true that $\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}\leq r^{*}(\Delta)$ .

Theorem 1 applies to any type of setup where an oracle $Z^{*}$ is estimated by an estimator $\hat{Z}$ such as in (1.3). Its result shows that $\hat{Z}$ is almost a maximizer of the true objective function $Z\to\bigl{<}{\mathbb{E}}A,Z\bigr{>}$ over ${\cal C}$ up to $r^{*}(\Delta)$ . In particular, when $r^{*}(\Delta)=0$ , $\hat{Z}$ is exactly a maximizer such as $Z^{*}$ and, in that case, we can work with $\hat{Z}$ as if we were working with $Z^{*}$ without any loss.

Theorem 1 may be applied in many different settings; in the following, we study four such instances. We will apply Theorem 1 (or one of its corollary stated below) to some popular problems in the networks and graph signal processing literatures, namely, community detection [43] (we will mostly revisit the results in [54] and [41] from our perspective), signed clustering [32], group synchronization [90] and MAX-CUT.

The proof of Theorem 1 is straightforward (mostly because the loss function is linear). Its importance stems from the fact that it puts forward two important concepts originally introduced in Learning Theory, namely that the complexity of the problem comes from the one of the local subset ${\cal C}\cap\{Z:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r^{*}(\Delta)\}$ and that the “radius” $r^{*}(\Delta)$ of the localization is solution of a fixed point equation. For a setup given by a random matrix $A$ and a constraint ${\cal C}$ , we should try to understand how these two ideas apply to obtain estimation properties of SDP estimators such as $\hat{Z}$ . That is to understand the shape of the local subsets ${\cal C}\cap\{Z:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r\},r>0$ and the maximal oscillations of the empirical process $Z\to\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}$ indexed by these local subsets. We will consider this task in three distinct problem instances. We now provide a proof for Theorem 1.

Proof of Theorem 1. Denote $r^{*}=r^{*}(\Delta)$ . Assume first that $r^{*}>0$ (the case $r^{*}=0$ is analyzed later). Let $\Omega^{*}$ be the event onto which for all $Z\in{\cal C}$ if $\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r^{*}$ then $\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}\leq(1/2)r^{*}$ . By Definition of $r^{*}$ , we have ${\mathbb{P}}[\Omega^{*}]\geq 1-\Delta$ .

Let $Z\in{\cal C}$ be such that $\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}>r^{*}$ and define $Z^{\prime}$ such that $Z^{\prime}-Z^{*}=\left(r^{*}/\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\right)(Z-Z^{*})$ . We have $\bigl{<}\mathbb{E}A,Z^{*}-Z^{\prime}\bigr{>}=r^{*}$ and $Z^{\prime}\in{\cal C}$ because ${\cal C}$ is star-shaped in $Z^{*}$ . Therefore, on the event $\Omega^{*}$ , $\bigl{<}A-\mathbb{E}A,Z^{\prime}-Z^{*}\bigr{>}\leq(1/2)r^{*}$ and so $\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}\leq(1/2)\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}$ . It therefore follows that on the event $\Omega^{*}$ , if $Z\in{\cal C}$ is such that $\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}>r^{*}$ then

\bigl{<}A,Z-Z^{*}\bigr{>}\leq(-1/2)\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}<-r^{*}/2

which implies that $\bigl{<}A,Z-Z^{*}\bigr{>}<0$ and therefore $Z$ does not maximize $Z\to\bigl{<}A,Z\bigr{>}$ over ${\cal C}$ . As a consequence, we necessarily have $\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}\leq r^{*}$ on the event $\Omega^{*}$ (which holds with probability at least $1-\Delta$ ).

Let us now assume that $r^{*}=0$ . There exists a decreasing sequence $(r_{n})_{n}$ of positive real numbers tending to $r^{*}=0$ such that for all $n\geq 0$ , ${\mathbb{P}}[\Omega_{n}]\geq 1-\Delta$ where $\Omega_{n}$ is the event $\Omega_{n}=\left\{\psi(r_{n})\leq\theta/2\right\}$ where for all $r>0$ ,

\psi(r)=\frac{1}{r}\sup_{Z\in{\cal C}:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r}\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}.

Since ${\cal C}$ is star-shapped in $Z^{*}$ , $\psi$ is a non-increasing function and so $(\Omega_{n})_{n}$ is a decreasing sequence (i.e. $\Omega_{n+1}\subset\Omega_{n}$ for all $n\geq 0$ ). It follows that ${\mathbb{P}}[\cap_{n}\Omega_{n}]=\lim_{n}{\mathbb{P}}[\Omega_{n}]\geq 1-\Delta$ . Let us now place ourselves on the event $\cap_{n}\Omega_{n}$ . For all $n$ , since $\Omega_{n}$ holds and $r_{n}>0$ , we can use the same argument as in first case to conclude that $\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}\leq r_{n}$ . Since the latter inequality is true for all $n$ (on the event $\cap_{n}\Omega_{n}$ ) and $(r_{n})_{n}$ tends to zero, we conclude that $\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}\leq 0=r^{*}$ .

The main conclusion of Theorem 1 is that all the information for the problem of estimating $Z^{*}$ via $\hat{Z}$ is contained in the fixed point $r^{*}(\Delta)$ . We therefore have to compute or upper bound such a fixed point. This might be difficult in great generality but there are some tools that can help to find upper bounds on $r^{*}(\Delta)$ .

A first approach is to understand the shape of the local sets ${\cal C}\cap\{Z:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r\},r>0$ and to that end it is helpful to characterize the curvature of the excess risk $Z\to\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}$ around its maximizer $Z^{*}$ . This type of local characterization of the excess risk is also a tool used in Learning theory that goes back to classical conditions such as the Margin assumption [96, 75] or the Bernstein condition [11]. The latter condition was initially introduced as an upper bound of the variance term by its expectation: for all $Z\in{\cal C},\;\mathbb{E}(\ell_{A}(Z)-\ell_{A}(Z^{*}))^{2}\leq c_{0}\mathbb{E}(\ell_{A}(Z)-\ell_{A}(Z^{*}))$ for some absolute constant $c_{0}$ , but it has now been better understood as a way to discriminate the oracle from the other points in the model ${\cal C}$ . These assumptions were global assumption in the sense that they concern all $Z$ in ${\cal C}$ . It has been recently shown [26] that only the local curvature of the excess risk needs to be understood. We now introduce this tool in our setup.

We characterize the local curvature of the excess risk by some function $G:{\mathbb{R}}^{n\times n}\to{\mathbb{R}}$ . Most of the time the $G$ function is a norm like the $\ell_{1}$ -norm or a power of a norm such as the $\ell_{2}$ norm to the square. The radius defining the local subset onto which we need to understand the curvature of the excess risk is also solution of a fixed point equation:

r^{*}_{G}(\Delta)=\inf\left(r>0:{\mathbb{P}}\left[\sup_{Z\in{\cal C}:G(Z^{*}-Z)\leq r}\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}\leq(1/2)r\right]\geq 1-\Delta\right).

(2.2)

The difference between the two fixed points $r^{*}(\Delta)$ and $r^{*}_{G}(\Delta)$ is that the local subsets are not defined using the same proximity function to the oracle $Z^{*}$ ; the first one uses the excess risk as a proximity function while the second one uses the $G$ function as a proximity function. The $G$ function should play the role of a simple description of the curvature of the excess risk function locally around $Z^{*}$ ; that is formalized in the next assumption.

Assumption 1.

For all $Z\in{\cal C}$ , if $\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r^{*}_{G}(\Delta)$ then $\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\geq G(Z^{*}-Z)$ .

Typical examples of curvature function $G$ will have the form $G(Z^{*}-Z)=\theta\left\|Z^{*}-Z\right\|^{\kappa}$ for some $\kappa\geq 1$ , $\theta>0$ and some norm $\left\|\cdot\right\|$ . In that case, the parameter $\kappa$ was initially called the margin parameter [95, 75]. Even though the relation given in Assumption 1 has been typically referred to as a margin condition or Bernstein condition in the learning theory literature, we will rather call it a local curvature assumption, following [54] and [26], since this type of relation describes the behavior of the risk function locally around its oracle. The main advantage for finding a local curvature function $G$ is that $r^{*}_{G}(\Delta)$ should be easier to compute than $r^{*}(\Delta)$ and $r^{*}(\Delta)\leq r^{*}_{G}(\Delta)$ because of the definition of $r_{G}^{*}(\Delta)$ and $\{Z\in{\cal C}:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r_{G}^{*}(\Delta)\}\subset\{Z\in{\cal C}:G(Z^{*}-Z)\leq r_{G}^{*}(\Delta)\}$ (thanks to Assumption 1). We can therefore state the following corollary.

Corollary 1.

We assume that the constraint ${\cal C}$ is star-shaped in $Z^{*}$ and that the “local curvature” Assumption 1 holds for some $0<\Delta<1$ . With probability at least $1-\Delta$ , it holds true that

r^{*}_{G}(\Delta)\geq\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}\geq G(Z^{*}-\hat{Z}).

When it is possible to describe the local curvature of the excess risk around its oracle by some $G$ function and when some estimate of $r^{*}_{G}(\Delta)$ can be obtained, Corollary 1 applies and estimation results of $Z^{*}$ by $\hat{Z}$ (w.r.t. to both the ”excess risk” metric $\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}$ and the $G$ metric) follow. If not, either because understanding the local curvature of the excess risk or the computation of $r^{*}_{G}(\Delta)$ is difficult, it is still possible to apply Theorem 1 with the global VC approach which boils down to simply upper bound the fixed point $r^{*}(\Delta)$ used in Theorem 1 by a global parameter that is a complexity measure of the entire set ${\cal C}$ :

r^{*}(\Delta)\leq\inf\left(r>0:{\mathbb{P}}\left[\sup_{Z\in{\cal C}}\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}\leq(1/2)r\right]\geq 1-\Delta\right).

(2.3)

Interestingly, if the latter last resort approach is used then, following the approach form [54], Grothendieck’s inequality [51, 85] appears to be a powerful tool to upper bound the right-hand side of (2.3) in the case of the community detection problem such as in [53] as well as in the MAX-CUT problem. Of course, when it is possible to avoid this ultimate global approach one has to do it because the local approach will always provide better results.

Finally, proving a “local curvature” property such as in Assumption 1 may be difficult because it requires to understand the shape of the local subsets ${\cal C}\cap\{Z:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r\},r>0$ . It is however possible to simplify this assumption if getting estimation results of $Z^{*}$ only w.r.t. the $G$ function (and not necessarily an upper bound on the excess risk $\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}$ ) is enough. In that case, Assumption 1 may be replaced by the following one.

Assumption 2.

For all $Z\in{\cal C}$ , if $G(Z^{*}-Z)\leq r^{*}_{G}(\Delta)$ then $\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\geq G(Z^{*}-Z)$ .

Assumption 2 assumes a curvature of the excess risk function in a $G$ neighborhood of $Z^{*}$ unlike Assumption 1 which grants this curvature in an “ excess risk neighborhood”. The shape of a neighborhood defined by the $G$ function may be easier to understand (for instance when $G$ is a norm, a neighborhood defined by $G$ is the ball of a norm centered at $Z^{*}$ with radius $r_{G}^{*}(\Delta)$ ). In general, the latter assumption and Assumption 1 do not compare. If Assumption 2 holds then $\hat{Z}$ can estimate $Z^{*}$ w.r.t. the $G$ function.

Theorem 2.

We assume that the constraint ${\cal C}$ is star-shaped in $Z^{*}$ and that the “local curvature” Assumption 2 holds for some $0<\Delta<1$ . We assume that the $G$ function is continuous, $G(0)=0$ and $G(\lambda(Z^{*}-Z))\leq\lambda G(Z^{*}-Z)$ for all $\lambda\in[0,1],Z\in Z^{*}-{\cal C}$ . With probability at least $1-\Delta$ , it holds true that $G(Z^{*}-\hat{Z})\leq r^{*}_{G}(\Delta)$ .

Proof of Theorem 2. Let $r^{*}=r^{*}_{G}(\Delta)$ . First assume that $r^{*}>0$ . Let $Z\in{\cal C}$ be such that $G(Z^{*}-Z)>r^{*}$ . Let $f:\lambda\in[0,1]\to G(\lambda(Z^{*}-Z))$ . We have $f(0)=G(0)=0$ , $f(1)=G(Z^{*}-Z)>r^{*}$ and $f$ is continuous. Therefore, there exists $\lambda_{0}\in(0,1)$ such that $f(\lambda_{0})=r^{*}$ . We let $Z^{\prime}$ be such that $Z^{\prime}-Z^{*}=\lambda_{0}(Z-Z^{*})$ . Since ${\cal C}$ is star-shapped in $Z^{*}$ and $\lambda_{0}\in[0,1]$ we have $Z^{\prime}\in{\cal C}$ . Moreover, $G(Z^{*}-Z^{\prime})=r^{*}$ . As a consequence, on the event $\Omega^{*}$ such that for all $Z\in{\cal C}$ if $G(Z^{*}-Z)\leq r^{*}$ then $\bigl{<}A-{\mathbb{E}}A,Z-Z^{*}\bigr{>}\leq(1/2)r^{*}$ , we have $\bigl{<}A-{\mathbb{E}}A,Z^{\prime}-Z^{*}\bigr{>}\leq(1/2)r^{*}$ . The latter and Assumption 2 imply that, on $\Omega^{*}$ ,

(1/2)r^{*}\geq\bigl{<}A,Z^{\prime}-Z^{*}\bigr{>}+\bigl{<}{\mathbb{E}}A,Z^{*}-Z^{\prime}\bigr{>}\geq\bigl{<}A,Z^{\prime}-Z^{*}\bigr{>}+G(Z^{*}-Z^{\prime})\geq\bigl{<}A,Z^{\prime}-Z^{*}\bigr{>}+r^{*}

and so $\bigl{<}A,Z^{\prime}-Z^{*}\bigr{>}\leq-r^{*}/2$ . Finally, using the definition of $Z^{\prime}$ , we obtain

\bigl{<}A,Z-Z^{*}\bigr{>}=(1/\lambda_{0})\bigl{<}A,Z^{\prime}-Z^{*}\bigr{>}\leq-r^{*}/(2\lambda_{0})<0.

In particular, $Z$ cannot be a maximizer of $Z\to\bigl{<}A,Z\bigr{>}$ over ${\cal C}$ and so necessarily, on the event $\Omega^{*}$ , $G(Z^{*}-\hat{Z})\leq r^{*}$ .

Let us now consider the case where $r^{*}=0$ . Using the same approach as in the proof of Theorem 1, we only have to check that the function

\psi:r>0\to\frac{1}{r}\sup_{Z\in{\cal C}:G(Z^{*}-Z)\leq r}\bigl{<}A-\mathbb{E}A,Z-Z^{*}\bigr{>}

is non-increasing. Let $0<r_{1}<r_{2}$ . W.l.o.g. we may assume that there exists some $Z_{2}\in{\cal C}$ such that $G(Z^{*}-Z_{2})\leq r_{2}$ and $\psi(r_{2})=\bigl{<}A-\mathbb{E}A,Z_{2}-Z^{*}\bigr{>}/r_{2}$ . If $G(Z^{*}-Z_{2})\leq r_{1}$ then $\psi(r_{2})\leq(r_{1}/r_{2})\psi(r_{1})\leq\psi(r_{1})$ . If $G(Z^{*}-Z_{2})>r_{1}$ then there exists $\lambda_{0}\in(0,1)$ such that for $Z_{1}=Z^{*}+\lambda_{0}(Z_{2}-Z^{*})$ we have $G(Z^{*}-Z_{1})=r_{1}$ and $Z_{1}\in{\cal C}$ . Moreover, $r_{1}=G(\lambda_{0}(Z^{*}-Z_{2}))\leq\lambda_{0}G(Z^{*}-Z_{2})\leq\lambda_{0}r_{2}$ and so $\lambda_{0}\geq r_{1}/r_{2}$ . It follows that

\psi(r_{2})=\frac{1}{r_{2}}\bigl{<}A-\mathbb{E}A,Z_{2}-Z^{*}\bigr{>}=\frac{1}{\lambda_{0}r_{2}}\bigl{<}A-\mathbb{E}A,Z_{1}-Z^{*}\bigr{>}\leq\frac{r_{1}}{\lambda_{0}r_{2}}\psi(r_{1})\leq\psi(r_{1})

where we used that $\psi(r)>0$ for all $r>0$ because $Z^{*}\in\{Z\in{\cal C}:G(Z^{*}-Z)\leq r\}$ for all $r>0$ .

As a consequence, Theorem 1, Corollary 1 and Theorem 2 are the three tools at our disposal to study the performance of SDP estimators depending on the deepness of understanding we have on the problem. The best approach is given by Theorem 1 when it is possible to compute efficiently the complexity fixed point $r^{*}(\Delta)$ . If the latter approach is too complicated (likely because understanding the geometry of the local subset ${\cal C}\cap\{Z:\bigl{<}{\mathbb{E}}A,Z^{*}-Z\bigr{>}\leq r\},r>0$ may be difficult) then one may resort to find a curvature function $G$ of the excess risk locally around $Z^{*}$ . In that case, both Corollary 1 and Theorem 2 may apply depending on the hardness to find a local curvature function $G$ on an “excess risk neighborhood” (see Assumption 1) or a “ $G$ -neighborhood” (see Assumption 2). Finally, if no local approach can be handled (likely because describing the curvature of the excess risk in any neighborhood of $Z^{*}$ or controlling the maximal oscillations of the empirical process $Z\to\bigl{<}{\mathbb{E}}A-A,Z^{*}-Z\bigr{>}$ locally are too difficult) then one may resort ultimately to a global approach which follows from Theorem 1 as explained in (2.3). In the following, we will use these tools for various problems.

Results like Theorem 1, Corollary 1 and Theorem 2 appeared in many papers on ERM in learning theory such as in [64, 11, 76, 69]. In all these results, typical loss functions such as the quadratic or logistic loss functions were not linear one such as the one we are using here. From that point of view our problem is easier and this can be seen by the simplicity to prove our three general results from this section. What is much more complicated here than in other more classical problems in Learning Theory is the computation of the fixed point because 1) the stochastic processes $Z\to\bigl{<}A-{\mathbb{E}}A,Z-Z^{*}\bigr{>}$ may be far from being a Gaussian process if the noise matrix $A-{\mathbb{E}}A$ is complicated and 2) the local sets $\{Z\in{\cal C}:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r\}$ or $\{Z\in{\cal C}:G(Z^{*}-Z)\leq r\}$ for $r>0$ maybe very hard to describe in a simple way. Fortunately, we will rely on several previous papers such as [41] to solve such problems.

3 Revisiting two results from the community detection literature [41, 54]

The rapid growth of social networks on the Internet has lead many statisticians and computer scientists to focus their research on data coming from graphs. One important topic that has attracted particular interest during the last decades is that of community detection [43, 86], where the goal is to recover mesoscopic structures in a network, the so-called called communities. A community consists of group of nodes that are relatively densely connected to each other, but sparsely connected to other dense groups present within the network. The motivation for this line of work stems not only from the fact that finding communities in a network is an interesting and challenging problem of its own as it leads to understanding structural properties of networks, but community detection is also used as a data pre-processing step for other statistical inference tasks on large graphs, as it facilitates parallelization and allows one to distribute time consuming processes on several smaller subgraphs (i.e., the extracted communities).

One challenging aspect of the community detection problem arises in the setting of sparse graphs. Many of the existing algorithms, which enjoy theoretical guarantees, do so in the relatively dense regime for the edge sampling probability, where the expected average degree is of the order $\Theta(\log n)$ . The problem becomes challenging in very sparse graphs with bounded average degree. To this end, Guédon and Vershynin proposed a semidefinite relaxation for a discrete optimization problem [54], an instance of which encompasses the community detection problem, and showed that it can recover a solution with any given relative accuracy even in the setting of very sparse graphs with average degree of order $O(1)$ .

A subset of the existing literature for community detection and clustering relies on spectral methods, which consider the adjacency matrix associated to a graph, and employ its eigenvalues, and especially eigenvectors, in the analysis process or to propose efficient algorithms to solve the task at hand. Along these lines, Can et al. [68] proposed a general framework for optimizing a general function of the graph adjacency matrix over discrete label assignments by projecting onto a low-dimensional subspace spanned by vectors that approximate the top eigenvectors of the expected adjacency matrix. The authors consider the problem of community detection with $k=2$ communities, which they frame as an instance of their proposed framework, combined with a regularization step that shifts each entry in the adjacency matrix by a small constant $\tau$ , which renders their methodology applicable in the sparse regime as well.

In the remainder of this section, we focus on the community detection problem on random graphs under the general stochastic block model. We will mostly revisit the work in [54] and [41] from the perspective given by Theorem 1, which simplifies the proof in [41] since the peeling argument is no longer required, and neither is the upper bound from [54] unlike [41], thanks to the homogeneity argument hidden in Theorem 1 (which underlies the localization argument).

We first recall the definition of the generalized stochastic block model (SBM). We consider a set of vertices $V=\{1,\cdots,n\}$ , and assume it is partitioned into $K$ communities ${\cal C}_{1},\cdots,{\cal C}_{K}$ of arbitrary sizes $|{\cal C}_{1}|=l_{1},\cdots,|{\cal C}_{K}|=l_{K}$ .

Definition 3.

For any pair of nodes $i,j\in V$ , we denote by $i\sim j$ when $i$ and $j$ belong to the same community (i.e., there exists $k\in\{1,\ldots,K\}$ ) such that $i,j\in{\cal C}_{k}$ ), and we denote by $i\not\sim j$ if $i$ and $j$ do not belong to the same community.

For each pair $(i,j)$ of nodes from $V$ , we draw an edge between $i$ and $j$ with a fixed probability $p_{ij}$ independently from the other edges. We assume that there exist numbers $p$ and $q$ satisfying $0<q<p<1$ , such that

\left\{\begin{array}[]{ll}p_{ij}>p\mbox{, if $i\sim j$ and $i\neq j$},\\ p_{ij}=1\mbox{, if }i=j,\\ p_{ij}<q\mbox{, otherwise.}\end{array}\right.

(3.1)

We denote by $A=(A_{i,j})_{1\leq i,j,\leq n}$ the observed symmetric adjacency matrix, such that, for all $1\leq i\leq j\leq n$ , $A_{ij}$ is distributed according to a Bernoulli of parameter $p_{ij}$ . The community structure of such a graph is captured by the membership matrix $\bar{Z}\in\mathbb{R}^{n\times n}$ , defined by $\bar{Z}_{ij}=1$ if $i\sim j$ , and $\bar{Z}_{ij}=0$ otherwise. The main goal in community detection is to reconstruct $\bar{Z}$ from the observation $A$ .

Spectral methods for community detection are very popular in the literature [54, 41, 100, 15, 29]. There are many ways to introduce such methods, one of which being via convex relaxations of certain graph cut problems aiming to minimize a modularity function such as the RatioCut [80]. Such relaxations often lead to SDP estimators, such as those introduced in Section 1.

Considering a random graph distributed according to the generalized stochastic block model, and its associated adjacency matrix $A$ (i.e. $A=A^{\top}$ and $A_{ij}\sim{\rm Bern}(p_{ij})$ for $1\leq i\leq j\leq n$ and $p_{ij}$ as defined in (3.1)), we will estimate its membership matrix $\bar{Z}$ via the following SDP estimator

\hat{Z}\in\operatorname*{argmax}_{Z\in\mathcal{C}}\bigl{<}A,Z\bigr{>},

where $\mathcal{C}=\{Z\in\mathbb{R}^{n\times n},Z\succeq 0,Z\geq 0,{\rm diag}(Z)\preceq I_{n},\sum_{i,j=1}^{n}Z_{ij}\leq\lambda\}$ and $\lambda=\sum_{i,j=1}^{n}\bar{Z}_{ij}=\sum_{k=1}^{K}|\mathcal{C}_{k}|^{2}$ denotes the number of nonzero elements in the membership matrix $\bar{Z}$ . The motivation for this approach stems from the fact that the membership matrix $\bar{Z}$ is actually the oracle, i.e., $Z^{*}=\bar{Z}$ (see Lemma 7.1 in [54] or Lemma 1 below), where

Z^{*}\in\operatorname*{argmax}_{Z\in\mathcal{C}}\bigl{<}\mathbb{E}A,Z\bigr{>}.

Following the strategy from Theorem 1 and from our point of view, the upper bound on $r^{*}(\Delta)$ from [54] is the one that is based on the global approach – that is, without localization. Indeed, [54] uses the observation that, for all $r>0$ , it holds true that

\sup_{Z\in{\cal C}:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r}\bigl{<}A-{\mathbb{E}}A,Z-Z^{*}\bigr{>}\overset{(a)}{\leq}\sup_{Z\in{\cal C}}\bigl{<}A-{\mathbb{E}}A,Z-Z^{*}\bigr{>}\overset{(b)}{\leq}2K_{G}\left\|A-{\mathbb{E}}A\right\|_{\mathrm{cut}},

(3.2)

where $\left\|\cdot\right\|_{{\rm cut}}$ is the cut-norm¹¹1The cut-norm $\left\|\cdot\right\|_{{\rm cut}}$ of a real matrix $A=(a_{ij})_{i\in R,j\in C}$ with a set of rows indexed by $R$ and a set of columns indexed by $C$ , is the maximum, over all $I\subset R$ and $J\subset C$ , of the quantity $|\sum_{i\in I,j\in J}a_{ij}|$ . It is also the operator norm of $A$ from $\ell_{\infty}$ to $\ell_{1}$ and the “injective norm” in the orginal Grothendieck “résumé” [52, 85] and $K_{G}$ is the Grothendieck constant (Grothendieck’s inequality is used in (b), see [85, 100]). Therefore, the localization around the oracle $Z^{*}$ by the excess risk “band” $B^{*}_{r}:=\{Z:\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq r\}$ is simply removed in inequality (a). As a consequence, the resulting statistical bound is based on the complexity of the entire class ${\cal C}$ whereas, in a localized approach, only the complexity of ${\cal C}\cap B^{*}_{r}$ matters. Next step in the proof of [54] is a high probability upper bound on $\left\|A-{\mathbb{E}}A\right\|_{\mathrm{cut}}$ which follows from Bernstein’s inequality and a union bound since one has $\left\|A-{\mathbb{E}}A\right\|_{\mathrm{cut}}=\max_{x,y\in\{-1,1\}^{n}}\bigl{<}A-{\mathbb{E}}A,xy^{\top}\bigr{>}$ , then for all $t>0$ , $\left\|A-\mathbb{E}A\right\|_{\mathrm{cut}}\leq tn(n-1)/2$ with probability at least $1-\exp\left(2n\log 2-(n(n-1)t^{2})/(16\bar{p}+8t/3)\right)$ where $\bar{p}\overset{def}{=}2/[n(n-1)]\sum_{i<j}p_{ij}(1-p_{ij})$ . The resulting upper bound on the fixed point obtained in [54] is,

r^{*}(\Delta)\leq(8/3)K_{G}(2n\log(2)+\log(1/\Delta)).

(3.3)

Finally, under the assumption of Theorem 1 in [54] (i.e., for some some $\epsilon\in(0,1)$ , $n\geq 5.10^{4}/\epsilon^{2}$ , $\max(p(1-p),q(1-q))\geq 20/n$ , $p=a/n>b/n=q$ and $(a-b)^{2}\geq 2.10^{4}\epsilon^{-2}(a+b)$ ), for $\Delta=e^{3}5^{-n}$ we obtain (using the general result in Theorem 1) with probability at least $1-\Delta$ , the bound $\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}\leq r^{*}(\Delta)\leq\epsilon n^{2}=\epsilon\left\|Z^{*}\right\|_{2}^{2}$ , which is the result from Theorem 1 in [54]. Finally, [54] uses a (global) curvature property of the excess risk in its Lemma 7.2:

Lemma 1 (Lemma 7.2 in [54]).

For all $Z\in\mathcal{C}$ , $\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\geq[(p-q)/2]\left\|Z^{*}-Z\right\|_{1}$ .

Therefore, a (global– that is for all $Z\in{\cal C}$ ) curvature assumption holds for a $G$ function which is here the $\ell_{1}^{n\times n}$ norm, a margin parameter $\kappa=1$ and $\theta=(p-q)/2$ for the community detection problem. However this curvature property is not use to compute a “better” fixed point parameter but only to have a $\ell_{1}^{n\times n}$ estimation bound since

\left\|\hat{Z}-Z^{*}\right\|_{1}\leq\left(\frac{2}{p-q}\right)\bigl{<}\mathbb{E}A,Z^{*}-\hat{Z}\bigr{>}\leq\frac{16K_{G}(2n\log(2)+\log(1/\Delta))}{3(p-q)}.

The latter bound together with the sin-Theta theorem allow the authors from [54] to obtain estimation bound for the community membership vector $x^{*}$ .

The approach from [41] improves upon the one in [54] because it uses a localization argument: the curvature property of the excess risk function from Lemma 1 is used to improve the upper bound in (3.3) obtained following a global approach. Indeed, the authors from [41] obtain high probability upper bound on the quantity

\sup_{Z\in{\cal C}:\left\|Z^{*}-Z\right\|_{1}\leq r}\bigl{<}A-{\mathbb{E}}A,Z-Z^{*}\bigr{>}

depending on $r$ . This yields to exact reconstruction result in the “dense” case and exponentially decaying rates of convergence in the “sparse” case. This is a typical example where the localization argument shows its advantage upon the global approach. The price to pay is usually a more technical proof for the local approach compare with the global one. However, the argument from [41] also uses an unnecessary peeling argument together with an unnecessary a priori upper bound on $\left\|\hat{Z}-Z^{*}\right\|_{1}$ (which is actually the one from [54]). It appears that this peeling argument and this a priori upper bound on $\left\|\hat{Z}-Z^{*}\right\|_{1}$ can be avoided thanks to our approach from Theorem 1. This improves the probability estimate and simplifies the proofs (since the result from [54] is not required anymore neither is the peeling argument). For the sign clustering problem we consider below as an application of our main results, we will mostly adapt the probabilistic tools from [41] (in the “dense” case) to the methodology associated with Theorem 1 (without this two unnecessary arguments).

4 Contributions of the paper

4.1 Application to signed clustering

Much of the clustering literature, including both spectral and non-spectral methods, has focused on unsigned graphs, where each edge carries a non-negative scalar weight that encodes a measure of affinity (similarity, trust) between pairs of nodes. However, in numerous instances, the above-mentioned affinity takes negative values, and encodes a measure of dissimilarity or distrust. Such applications arise in social networks where users relationships denote trust-distrust or friendship-enmity, shopping bipartite networks which capture like-dislike relationships between users and products [10], online news and review websites, such as Epinions [1] and Slashdot [2], that allow users to approve or denounce others [71], and clustering financial or economic time series data [4]. Such applications have spurred interest in the analysis of signed networks, which has recently become an increasingly important research topic [72], with relevant lines of work in the context of clustering signed networks including, in chronological order, [65, 24, 32].

The second application of our proposed methodology is an extension of the community detection and clustering problems to the setting of signed graphs, where, for simplicity, we assume that an edge connecting two nodes can take either $-1$ or $+1$ values.

4.1.1 A Signed Stochastic Block Model (SSBM)

We focus on the problem of clustering a K-weakly balanced graphs²²2A signed graph is K-weakly balanced if and only if all the edges are positive, or the nodes can be partitioned into $K\in\mathbb{N}$ disjoint sets such that positive edges exist only within clusters, and negative edges are only present across clusters [37].. We consider a signed stochastic block model (SSBM) similar to the one introduced in [32], where we are given a graph $G$ with $n$ nodes $\{1,\ldots,n\}$ which are divided into $K$ communities, $\{{\cal C}_{1},\cdots,{\cal C}_{K}\}$ , such that, in the noiseless setting, edges within each community are positive and edges between communities are negative.

The only information available to the user is given by a $n\times n$ sparse adjacency matrix $A$ constructed as follows: $A$ is symmetric, with $A_{ii}=1$ for all $i=1,\ldots,n$ , and for all $1\leq i<j\leq n$ , $A_{ij}=s_{ij}(2B_{ij}-1)$ where

\mathrm{B}_{ij}\sim\left\{\begin{array}[]{l}\mathrm{Bern}(p)\mbox{ if }i\sim j\\ \mathrm{Bern}(q)\mbox{ if }i\not\sim j\end{array}\right.\;\;\mbox{ and }\;\;s_{ij}\sim{\rm Bern}(\delta),

for some $0\leq q<1/2<p\leq 1$ and $\delta\in(0,1)$ . Moreover, all the variables $B_{ij},s_{ij}$ for $1\leq i<j\leq n$ are independent.

We remark that this SSBM model is similar to the one considered in [32], which was governed by two parameters, the sampling probability $\delta$ as above, and the noise level $\eta$ , which may flip entries of the adjacency matrix.

Our aim is to recover the community membership matrix or cluster matrix $\bar{Z}=(\bar{Z}_{ij})_{n\times n}$ where $\bar{Z}_{ij}=1$ when $i\sim j$ and $\bar{Z}_{ij}=0$ when $i\not\sim j$ using only the observed censored adjacency matrix $A$ .

Our approach is similar in nature to the one used by spectral methods in community detection. We first observe that for $\alpha:=\delta(p+q-1)$ and $J=(1)_{n\times n}$ we have $\bar{Z}=Z^{*}$ where

Z^{*}\in\operatorname*{argmax}_{Z\in{\cal C}}\bigl{<}\mathbb{E}A-\alpha J,Z\bigr{>}

(4.1)

and ${\cal C}=\{Z\in{\mathbb{R}}^{n\times n}:Z\succeq 0,Z_{ij}\in[0,1],Z_{ii}=1,i=1,\ldots,n\}$ . The proof of (4.1) is recalled in Section 8.

Since we do not know ${\mathbb{E}}A$ and $\alpha$ , we should estimate both of them. We will estimate ${\mathbb{E}}A$ with $A$ but, for simplicity, we will assume that $\alpha$ is known. The resulting estimator of the cluster matrix $\bar{Z}$ is

\hat{Z}\in\operatorname*{argmax}_{Z\in{\cal C}}\bigl{<}A-\alpha J,Z\bigr{>}

(4.2)

which is indeed a SDP estimator and therefore Theorem 1 (or Corollary 1 and Theorem 2) may be used to obtain statistical bounds for the estimation of $Z^{*}$ from (4.1) by $\hat{Z}$ .

We will use the following notations: $s:=\delta(p-q)^{2}$ , $\theta:=\delta(p-q)$ , $\rho:=\delta\max\{1-\delta(2p-1)^{2},1-\delta(2q-1)^{2}\}$ , $\nu:=\max\{2p-1,1-2q\}$ , $[m]:=\{1,\cdots,m\}$ for all $m\in{\mathbb{N}}$ , $l_{k}:=|{\cal C}_{k}|$ for all $k\in[K]$ , $\lambda^{2}:=\sum_{k=1}^{K}l_{k}^{2}$ , ${\cal C}^{+}:=\underset{k=1}{\overset{K}{\cup}}({\cal C}_{k}\times{\cal C}_{k})$ and ${\cal C}^{-}:=\underset{k\neq k^{\prime}}{\overset{}{\cup}}({\cal C}_{k}\times{\cal C}_{k^{\prime}})$ . We also use the notation $c_{0},c_{1},\ldots,$ to denote absolute constants whose values may change from one line to another.

4.1.2 Main result for the estimation of the cluster matrix in signed clustering

Our main result concerns the reconstitution of the $K$ communities from the observation of the matrix $A$ . In order to avoid solutions with some communities of degenerated size (too small or too large) we consider the following assumption.

Assumption 3.

Up to constant, the elements of the partition ${\cal C}_{1}\sqcup\cdots\sqcup{\cal C}_{K}$ of $\{1,\ldots,n\}$ are of same size (up to constants): there are absolute constant $c_{0},c_{1}>0$ such that for any $k\in[K]$ , $n/(c_{1}K)\leq|{\cal C}_{k}|=l_{k}\leq c_{0}n/K$ .

We are now ready to state the main result on the estimation of the cluster matrix $Z^{*}$ from (4.1) by the SDP estimator $\hat{Z}$ from (4.2).

Theorem 3.

There is an absolute positive constant $c_{0}$ such that the following holds. Grant Assumption 3. Assume that

n\nu\delta\geq\log n,

(4.3)

(p-q)^{2}n\delta\geq c_{0}K^{2}\nu

(4.4)

\frac{K\log(2eKn)}{n}\leq\max\left(\frac{\theta^{2}}{\rho},\frac{9\rho}{32}\right).

(4.5)

Then, with probability at least $1-\exp(-\delta\nu n)-3/(2eKn)$ , exact recovery holds true, i.e., $\hat{Z}=Z^{*}.$

Therefore, we have exact reconstruction in the dense case (that is under assumption (4.3)), which stems from condition (4.5). The latter condition is in the same spirit as the one in Theorem 1 of [41], it measures the SNR (signal-to-noise ratio) of the model which captures the hardness of the SSBM. As mentioned in [41], it is related to the Kesten-Stigum threshold [78]. The last condition (4.5) basically requires that the number of clusters $K$ is at most $n/\log n$ . If this condition is dropped out then we do not have anymore exact reconstruction but only a certified rate of convergence: with probability at least $1-\exp(-\delta\nu n)-3/(2eKn)$ ,

\left\|Z^{*}-\hat{Z}\right\|_{1}\leq\frac{n}{c_{1}\delta(p-q)K}.

This shows that there is a phase transition in the dense case: exact reconstruction is possible when $K\lesssim n/\log n$ and, otherwise, when $K\gtrsim n/\log n$ we only have a control of the estimation error.

4.2 Application to angular group synchronization

In this section, we introduce the group synchronization problem as well as a stochastic model for this problem. We consider a SDP relaxation of the original problem (which is exact) and construct the associated SDP estimator such as (1.3).

The angular synchronization problem consists of estimating $n$ unknown angles $\theta_{1},\cdots,\theta_{n}$ (up to a global shift angle) given a noisy subset of their pairwise offsets $(\theta_{i}-\theta_{j})[2\pi]$ , where $[2\pi]$ is the modulo $2\pi$ operation. The pairwise measurements can be realized as the edge set of a graph $G$ , typically modeled as an Erdös-Renyi random graph [90].

The aim of this section is to show that the angular synchronization problem can be analyzed using our methodology. In order to keep the presentation as simple as possible, we assume that all pairwise offsets are observed up to some Gaussian noise: we are given $(\theta_{i}-\theta_{j}+\sigma g_{ij})[2\pi]$ for all $1\leq i<j\leq n$ where $(g_{ij}:1\leq i<j\leq n)$ are $n(n-1)/2$ i.i.d. standard Gaussian variables and $\sigma>0$ is the noise variance. We may rewrite the problem as follows: we observe a $n\times n$ complex matrix $A$ defined by

A=S\circ[x^{*}(\overline{x^{*}})^{\top}]\mbox{ where }S=(S_{ij})_{n\times n},S_{ij}=\left\{\begin{array}[]{cc}e^{\iota\sigma g_{ij}}&\mbox{ if }i<j\\ 1&\mbox{ if }i=j\\ e^{-\iota\sigma g{ij}}&\mbox{ if }i>j\end{array}\right.,

(4.6)

$\iota$ denotes the imaginary number such that $\iota^{2}=-1$ , $x^{*}=(x^{*}_{i})_{i=1}^{n}\in{\mathbb{C}}^{n}$ , $x^{*}_{i}=e^{\iota\theta_{i}},i=1,\ldots,n$ , $\bar{x}$ denotes the conjugate vector of $x$ and $S\circ[x^{*}(\overline{x^{*}})^{\top}]$ is the element-wise product $(S_{ij}x_{i}\bar{x}_{j})_{n\times n}$ . In particular, $S$ is a Hermitian matrix (i.e. $\bar{S}^{\top}=S$ ) and $\mathbb{E}S_{ij}=\exp(-\sigma^{2}/2)$ for $i\neq j$ and $\mathbb{E}S_{ii}=1$ if $i=j$ . We want to estimate $(\theta_{1},\ldots,\theta_{n})$ (up to a global shift) from the matrix of data $A$ . Unlike classical statistical models the noise here is multiplicative; we show that our approach covers this type of problem as well.

The first step is to find an (vectorial) optimization problem which solutions are given by $(\theta_{i})_{i=1}^{n}$ (up to global angle shift) or some bijective function of it. Estimating $(\theta_{i})_{i=1}^{n}$ up to global angle shift is equivalent to estimating the vector $x^{*}=(e^{\iota\theta_{i}})_{i=1}^{n}$ . The latter is, up to a global rotation of its coordinates, the unique solution of the following maximization problem

\operatorname*{argmax}_{x\in\mathbb{C}^{n}:|x_{i}|=1}\left\{\bar{x}^{\top}\;{\mathbb{E}}A\;x\right\}=\{(e^{\iota(\theta_{i}+\theta_{0})})_{i=1}^{n}:\theta_{0}\in[0,2\pi)\}.

(4.7)

A proof of (4.7) is given in Section 6. Let us now rewrite (4.7) as a SDP problem. For all $x\in{\mathbb{C}}^{n}$ , we have $\bar{x}^{\top}\mathbb{E}Ax=\mathrm{tr}(\mathbb{E}AX)=\bigl{<}\mathbb{E}A,X\bigr{>}$ where $X=x\bar{x}^{\top}$ and $\{Z\in\mathbb{C}^{n\times n}:Z=x\bar{x}^{T},|x_{i}|=1\}=\{Z\in\mathbb{H}_{n}:Z\succeq 0,\mathrm{diag}(Z)=\mathbf{1}_{n},\mathrm{rank}(Z)=1\}$ where $\mathbb{H}_{n}$ is the set of all $n\times n$ Hermitian matrices and $\mathbf{1}_{n}\in{\mathbb{C}}^{n}$ is the vector with all coordinates equal to $1$ . It is therefore straightforward to construct a SDP relaxation of (4.7) by dropping the rank constraint. It appears that this relaxation is exact since, for ${\cal C}=\{Z\in\mathbb{H}_{n}:Z\succeq 0,\mathrm{diag}(Z)=\mathbf{1}_{n}\}$ ,

\operatorname*{argmax}_{Z\in{\cal C}}\bigl{<}{\mathbb{E}}A,Z\bigr{>}=\{Z^{*}\},

(4.8)

where $Z^{*}=x^{*}(\overline{x^{*}})^{\top}$ . A proof of (4.8) can be found in Section 6. Finally, as we only observe $A$ , we consider the following SDP estimator of $Z^{*}$

\hat{Z}\in\operatorname*{argmax}_{Z\in{\cal C}}\bigl{<}A,Z\bigr{>}.

(4.9)

In the next section, we use the strategy from Corollary 1 to obtain statistical guarantees for the estimation of $Z^{*}$ by $\hat{Z}$ .

Intuitively, the above maximization problem (4.8) attempts to preserve the given angle offsets as best as possible

\operatorname*{argmax}_{\theta_{1},\ldots,\theta_{n}\in[0,2\pi)}\sum_{i,j=1}^{n}e^{-\iota\theta_{i}}A_{ij}e^{\iota\theta_{j}},

(4.10)

where the objective function is incremented by $+1$ whenever an assignment of angles $\theta_{i}$ and $\theta_{j}$ perfectly satisfies the given edge constraint $\delta_{ij}=(\theta_{i}-\theta_{j})[2\pi]$ (i.e., for a clean edge for which $\sigma=0$ ), while the contribution of an incorrect assignment (i.e., of a very noisy edge) will be almost uniformly distributed on the unit circle in the complex plane. Due to non-convexity of optimization in (4.10), it is difficult to solve computationally [109]; one way to overcome this problem is to consider the SDP relaxation from (4.8) but it is also possible to consider a spectral relaxation such as the one proposed by Singer [90] which replaces the individual constraints that all $z_{i}$ ’s should have unit magnitude by the much weaker single constraint $\sum_{i=1}^{n}|z_{i}|^{2}=n$ , leading to

\operatorname*{argmax}_{z_{1},\ldots,z_{n}\in\mathbb{C};\;\;\;\sum_{i=1}^{n}|z_{i}|^{2}=n}\sum_{i,j=1}^{n}\bar{z_{i}}A_{ij}z_{j}.

(4.11)

The solution to the resulting maximization problem is simply given by a top eigenvector of the Hermitian matrix $A$ , followed by a normalization step. We remark that the main advantage of the SDP relaxation (4.8) is that it explicitly imposes the unit magnitude constraint for $e^{\iota\theta_{i}}$ , which we cannot otherwise enforce in the spectral relaxation solved via the eigenvector method in (4.11). The above SDP program (4.8) is very similar to the well-known Goemans-Williamson SDP relaxation for the seminal MAX-CUT problem of finding the maximal cut of a graph (the MAX-CUT problem is one of the four applications considered in this work, see Section 4.3), with the main difference being that here we optimize over the cone of complex-valued Hermitian positive semidefinite matrices, not just real symmetric matrices.

4.2.1 Main results for phase recovery in the synchronization problem

Our main result concerns the estimation of the matrix of offsets $Z^{*}=x^{*}(\overline{x^{*}})^{\top}$ from the observation of the matrix $A$ . This result is then used to estimate (up to global phase shift) the angular vector $x^{*}$ .

Theorem 4.

Let $0<\epsilon<1$ . If $\sigma\leq\sqrt{\log(\epsilon n^{4})}$ then, with probability at least $1-\exp(-\epsilon\sigma^{4}n(n-1)/2)$ , it holds true that

(e^{-\sigma^{2}/2}/2)\left\|Z^{*}-Z\right\|_{2}^{2}\leq\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq(128/3)\sqrt{\epsilon}\sigma^{4}N.

(4.12)

Once we have an estimator $\hat{Z}$ for the oracle $Z^{*}$ , we can extract an estimator $\hat{x}$ for the vector of phases $x^{*}$ by considering a top eigenvector (i.e. an eigenvector associated with the largest eigenvalue) of $\hat{Z}$ . It is then possible to quantify the estimation properties of $x^{*}$ by $\hat{x}$ using a sin-theta theorem and Theorem 4.

Corollary 2.

Let $\hat{x}$ be a top eigenvector of $\hat{Z}$ with Euclidean norm $\left\|\hat{x}\right\|_{2}=\sqrt{n}$ . Let $0<\epsilon<1$ and assume that $\sigma\leq\sqrt{\log(\epsilon n^{4})}$ . We have the existence of a universal constant $c_{0}>0$ (which is the constant in the Davis-Kahan Theorem for Hermitian matrices) such that, with probability at least $1-\exp(-\epsilon\sigma^{4}n(n-1)/2)$ , it holds true that

\min_{z\in{\mathbb{C}}:|z|=1}\left\|\hat{x}-zx^{*}\right\|_{2}\leq 8c_{0}\sqrt{2/3}\epsilon^{1/4}e^{\sigma^{2}/4}\sigma^{2}\sqrt{n}.

(4.13)

It follows from Corollary 2 that we can estimate $x^{*}$ (up to a global rotation $z\in{\mathbb{C}}:|z|=1$ ) with a $\ell_{2}^{n}$ -estimation error of the order of $\sigma^{2}\sqrt{n}$ with exponential deviations. Given that $\left\|x^{*}\right\|_{2}=\sqrt{n}$ , this means that a constant proportion of the entries are well estimated. For a values of $\epsilon\sim 1/n^{2}$ , the rate of estimation is like $\sigma^{2}$ , we therefore get a much better estimation of $x^{*}$ but only with constant probability. It is important to recall that $\hat{Z}$ and $\hat{x}$ can be both efficiently computed by solving a SDP problem and then by considering a top eigenvector of its solution (for instance, using the power method).

4.3 Application to the MAX-CUT problem

Let $A^{0}\in\{0,1\}^{n\times n}$ be the adjacency (symmetric) matrix of an undirected graph $G=(V,E^{0})$ , where $V:=\{1,\ldots,n\}$ is the set of the vertices and the set of edges is $E^{0}:=E\cup E^{\top}\cup\{(i,i):A_{ii}^{0}=1\}$ where $E:=\{(i,j)\in V^{2}~{}:~{}i<j\mbox{ and }A^{0}_{ij}=1\}$ and $E^{\top}=\{(j,i):(i,j)\in E\}$ . We assume that $G$ has no self loop so that $A_{ii}^{0}=0$ for all $i\in V$ . A cut of $G$ is any subset $S$ of vertices in $V$ . For a cut $S\subset V$ , we define its weight by $\mathrm{cut}(G,S):=(1/2)\sum_{(i,j)\in S\times\bar{S}}A^{0}_{ij}$ , that is the number of edges in $E$ between $S$ and its complement $\bar{S}=V\backslash S$ . The MAX-CUT problem is to find the cut with maximal weight:

\displaystyle S^{*}\in\underset{S\subset V}{\mathrm{argmax}}\,\mathrm{cut}(G,S).

(4.14)

The MAX-CUT problem is a NP-complete problem but [49] constructed a $0.878$ approximating solution via a SDP relaxation. Indeed, one can write the MAX-CUT problem in the following way. For a cut $S\subset V$ , we define the membership vector $x\in\{-1,1\}^{n}$ associated with $S$ by setting $x_{i}:=1$ if $i\in S$ and $x_{i}=-1$ if $i\notin S$ for all $i\in V$ . We have $\mathrm{cut}(G,S)=(1/4)\sum_{i,j=1}^{n}A^{0}_{ij}(1-x_{i}x_{j}):=\mathrm{cut}(G,x)$ and so solving (4.14) is equivalent to solve

\displaystyle x^{*}\in\underset{x\in\{-1,1\}^{n}}{\mathrm{argmax}}\,\mathrm{cut}(G,x).

(4.15)

Since $(x_{i}x_{j})_{i,j}=xx^{\top}$ , the latter problem is also equivalent to solve

\max\left(\frac{1}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(1-Z_{ij}):{\rm rank}(Z)=1,Z\succeq 0,Z_{ii}=1\right)

(4.16)

which admits a SDP relaxation by removing the rank $1$ constraint. This yields the following SDP relaxation problem of MAX-CUT from [49]:

Z^{*}\in\underset{Z\in\mathcal{C}}{\mathrm{argmin}}\bigl{<}A^{0},Z\bigr{>}

(4.17)

where $\mathcal{C}:=\{Z\in{\mathbb{R}}^{n\times n}:Z\succeq 0,Z_{ii}=1,\forall i=1,\ldots,n\}$ .

Unlike the other examples from the previous sections, the SDP relaxation in (9.1) is not exact, except for bipartite graphs; see [62, 44] for more details. Nevertheless, thanks to the approximation result from [49] we can use our methodology to estimate $Z^{*}$ and then deduce some approximate optimal cut. But first, we introduce a stochastic model because in many situations the adjacency matrix $A^{0}$ is only partially observed but still it might be interesting to find an approximating solution to the MAX-CUT problem.

We observe $A=S\circ A^{0}=(s_{ij}A_{ij}^{0})_{1\leq i,j\leq n}$ a “masked” version of $A^{0}$ , where $S\in{\mathbb{R}}^{n\times n}$ is symmetric with upper triangular matrix filled with i.i.d. Bernoulli entries: for all $i,j\in V$ such that $i\leq j$ , $S_{ij}=S_{ji}=s_{ij}$ where $(s_{ij})_{i\leq j}$ is a family of i.i.d. Bernoulli random variables with parameter $p\in(1/2,1)$ . Let $B:=-(1/p)A$ so that ${\mathbb{E}}[B]=-A^{0}$ . We can write $Z^{*}$ as an oracle since $Z^{*}\in\operatorname*{argmax}_{Z\in{\cal C}}\bigl{<}{\mathbb{E}}B,Z\bigr{>}$ and so we estimate $Z^{*}$ via the SDP estimator $\hat{Z}\in\operatorname*{argmax}_{Z\in{\cal C}}\bigl{<}B,Z\bigr{>}$ . Our first aim is to quantify the cost we pay by using $\hat{Z}$ instead of $Z^{*}$ in our final choice of cut. It appears that the fixed point used in Theorem 1 may be used to quantify this loss:

r^{*}(\Delta)=\inf\left(r>0:{\mathbb{P}}\left[\sup_{Z\in{\cal C}:\bigl{<}\mathbb{E}B,Z^{*}-Z\bigr{>}\leq r}\bigl{<}B-\mathbb{E}B,Z-Z^{*}\bigr{>}\leq(1/2)r\right]\geq 1-\Delta\right).

(4.18)

Our second result is an explicit high probability upper bound on the latter fixed point.

4.3.1 Main results for the MAX-CUT problem

In this section, we gather the two results on the estimation of $Z^{*}$ from $\hat{Z}$ and on the approximate optimality of the final cut constructed from $\hat{Z}$ . Let us now explicitly provide the construction of this cut. We consider the same strategy as in [49]. Assume that $\hat{Z}$ has been constructed. Let $\hat{G}$ be a centered Gaussian vector with covariance matrix $\hat{Z}$ . Let $\hat{x}$ be the sign vector of $\hat{G}$ . Using the statistical properties of $\hat{Z}$ , it is possible to prove near optimality of $\hat{x}$ .

We denote the optimal values of the maxcut problem and its SDP relaxation by

\mathrm{SDP}(G):=(1/4)\bigl{<}A^{0},J-Z^{*}\bigr{>}=\max_{Z\in{\cal C}}\frac{1}{4}\sum_{i,j}A_{i,j}^{0}(1-Z_{ij})\mbox{ and }\mathrm{MAXCUT}(G):=\mathrm{cut}(G,S^{*})

where $S^{*}$ is a solution of (4.14) and $J=(1)_{n\times n}$ . Our first result is to show how the $0.878$ approximating result from [49] is downgraded by the incomplete information we have on the graph (we only partially observed the adjacency matrix $A^{0}$ via $A$ ).

Theorem 5.

For all $0<\Delta<1$ . With probability at least $1-\Delta$ (with respect to the masked $S$ ),

\mathrm{SDP}(G)\geq\mathbb{E}\left[\mathrm{cut}(G,\hat{x})|\hat{Z}\right]\geq 0.878\mathrm{SDP}(G)-\frac{0.878r^{*}(\Delta)}{4}.

To precise the notation, $\hat{x}$ is the sign vector of $\hat{G}$ which is a centered Gaussian variable with covariance $\hat{Z}$ . In that context, $\mathbb{E}\left[\mathrm{cut}(G,\hat{x})|\hat{Z}\right]$ is the conditional expectation according to $\hat{G}$ for a fixed $\hat{Z}$ . Moreover, the probability at least $1-\Delta$ that we obtain is w.r.t. the mask that is to the randomness in $A$ .

Let us put Theorem 5 into some perspective. If we had known the entire adjacency matrix (which is the case when $p=1$ ), then we could have use $Z^{*}$ instead of $\hat{Z}$ . In that case, for $x^{\star}$ the sign vector of $G^{\star}\sim{\cal N}(0,Z^{*})$ , we know from [49] that

\mathrm{SDP}(G)\geq\mathbb{E}\left[\mathrm{cut}(G,x^{\star})\right]\geq 0.878\mathrm{SDP}(G).

(4.19)

Therefore, Theorem 5 characterizes the price we pay for not observing the entire adjacency matrix $A^{0}$ but only a masked version $A$ of it. It is an interesting output of Theorem 5 to observe that the fixed point $r^{*}(\Delta)$ measures, in a quantitative way, this loss. If we were able to identify scenarii of $p$ and $E$ for which $r^{*}(\Delta)=0$ that would prove that there is no loss for partially observing $A^{0}$ in the MAX-CUT problem. Unfortunately, the approach we use to control $r^{*}(\Delta)$ is the global one which does not allow for exact reconstruction (that is to show that $r^{*}(\Delta)=0$ ).

Let us now turn to an estimation result of $Z^{*}$ by $\hat{Z}$ via an upper bound on $r^{*}(\Delta)$ .

Theorem 6.

With probability at least $1-4^{-n}$ :

\displaystyle\bigl{<}\mathbb{E}B,Z^{*}-\hat{Z}\bigr{>}\leq r^{*}(4^{-n})\leq 2n\sqrt{\frac{(2\log 4)(1-p)(n-1)}{p}}+\frac{8n\log 4}{3}.

In particular, it follows from the approximation result from Theorem 5 and the high probability upper bound on $r^{*}(\Delta)$ from Theorem 6 that, with probability at least $1-4^{-n}$ ,

\mathbb{E}\left[\mathrm{cut}(G,\hat{x})|\hat{Z}\right]\geq 0.878\mathrm{SDP}(G)-\frac{0.878}{4}\left(2n\sqrt{\frac{(2\log 4)(1-p)(n-1)}{p}}+\frac{8n\log 4}{3}\right).

(4.20)

This result is none trivial only when the right-hand side term is strictly larger than $(0.5)\mathrm{SDP}(G)$ which is the performance of a random cut. As a consequence, (4.20) shows that one can still do better than randomness even in an incomplete information setup for the MAX-CUT problem when $p$ , $n$ and $\mathrm{SDP}(G)$ are such that

0.378\mathrm{SDP}(G)>\frac{0.878}{4}\left(2n\sqrt{\frac{(2\log 4)(1-p)(n-1)}{p}}+\frac{8n\log 4}{3}\right).

For instance, when $p$ is like a constant, it requires $\mathrm{SDP}(G)$ to be larger than $c_{0}n^{3/2}$ (for some absolute constant $c_{0}$ ) and when $p=1-1/n$ , it requires $\mathrm{SDP}(G)$ to be at least $c_{0}n$ (for some absolute constant $c_{0}$ ).

5 Proof of Theorem 3 (signed clustering)

The aim of this paper is to put forward a methodology developed in learning theory for the study of SDP estimators. In each example, we follow this methodology. For a problem, such as the signed clustering, where it is possible to characterize the curvature of the excess risk, we start to identify this curvature because the curvature function $G$ , coming out of it, defines the local subsets of ${\cal C}$ driving the complexity of the problem. Then, we turn to the stochastic part of the proof which is entirely summarized into the complexity fixed point $r^{*}_{G}(\Delta)$ from (2.2). Finally, we put the two pieces together and apply the main general result from Corollary 1 to obtain estimation results for the SDP estimator (4.2) in the signed clustering problem; which is Theorem 3.

5.1 Curvature equation

In this section, we show that the objective function $Z\in{\cal C}\rightarrow\bigl{<}Z,{\mathbb{E}}A-\alpha J\bigr{>}$ satisfies a curvature assumption around its maximizer $Z^{*}$ with respect to the $\ell_{1}^{n\times n}$ -norm given by $G(Z^{*}-Z)=\theta\left\|Z^{*}-Z\right\|_{1}$ with parameter $\theta=\delta(p-q)$ (and margin exponent $\kappa=1$ ).

Proposition 1.

For $\theta=\delta(p-q)$ , we have for all $Z\in{\cal C}$ , $\bigl{<}{\mathbb{E}}A-\alpha J,Z^{*}-Z\bigr{>}=\theta\left\|Z^{*}-Z\right\|_{1}$ .

Proof.

Let $Z$ be in ${\cal C}$ . We have

	$\displaystyle\bigl{<}Z^{*}-Z,{\mathbb{E}}A-\alpha J\bigr{>}$	$\displaystyle=\sum_{i,j=1}^{n}(Z^{*}-Z)_{ij}({\mathbb{E}}A_{ij}-\alpha)$
		$\displaystyle=\sum_{(i,j)\in{\cal C}^{+}}(Z^{}_{ij}-Z_{ij})(\delta(2p-1)-\alpha)+\sum_{(i,j)\in{\cal C}^{-}}(Z^{}_{ij}-Z_{ij})(\delta(2q-1)-\alpha)$
		$\displaystyle=\delta(p-q)\left[\sum_{(i,j)\in{\cal C}^{+}}(Z^{}-Z)_{ij}-\sum_{(i,j)\in{\cal C}^{-}}(Z^{}-Z)_{ij}\right].$

Moreover, for all $(i,j)\in{\cal C}^{+},Z^{*}_{ij}=1$ and $0\leq Z_{ij}\leq 1$ , so $(Z^{*}-Z)_{ij}=|(Z^{*}-Z)_{ij}|$ . We also have for all $(i,j)\in{\cal C}^{-}$ , $(Z^{*}-Z)_{ij}=-Z_{ij}=-|(Z^{*}-Z)_{ij}|$ because in that case $Z^{*}_{ij}=1$ and $0\leq Z_{ij}\leq 1$ . Hence,

\bigl{<}Z^{*}-Z,{\mathbb{E}}A-\alpha J\bigr{>}=\delta(p-q)\left[\sum_{(i,j)\in{\cal C}^{+}}|(Z^{*}-Z)_{ij}|+\sum_{(i,j)\in{\cal C}^{-}}|(Z^{*}-Z)_{ij}|\right]=\theta\left\|Z^{*}-Z\right\|_{1}.

5.2 Computation of the complexity fixed point $r^{*}_{G}(\Delta)$

Define $W:=A-{\mathbb{E}}A$ the noise matrix of the problem. Since $W$ is symmetric, its entries are not independent. In order to work only with independent random variables, we define the following matrix $\Psi\in{\mathbb{R}}^{n\times n}$ :

\Psi_{ij}=\left\{\begin{array}[]{l}W_{ij}\mbox{ if }i\leq j\\ 0\mbox{ otherwise,}\end{array}\right.

(5.1)

where $0$ entries are considered as independent Bernoulli variables with parameter $0$ and therefore, $\Psi$ has independent entries, and satisfies the relation $W=\Psi+\Psi^{\top}$ .

In order to obtain upper bounds on the fixed point complexity parameter $r^{*}_{G}(\Delta)$ associated with the signed clustering problem, we need to prove a high probability upper bound on the quantity

\sup_{Z\in{\cal C}:\left\|Z-Z^{*}\right\|_{1}\leq r}\bigl{<}W,Z-Z^{*}\bigr{>},

(5.2)

and then find a radius $r$ as small as possible such that the quantity in (5.2) is less than $(\theta/2)r$ . We denote ${\cal C}_{r}:={\cal C}\cap(Z^{*}+rB_{1}^{n\times n})=\left\{Z\in{\cal C}:\left\|Z-Z^{*}\right\|_{1}\leq r\right\}$ where $B_{1}^{n\times n}$ is the unit $\ell_{1}^{n\times n}$ -ball of ${\mathbb{R}}^{n\times n}$ .

We follow the strategy from [41] by decomposing the inner product $\bigl{<}W,Z-Z^{*}\bigr{>}$ into two parts according to the SVD of $Z^{*}$ . This observation is a key point in the work of [41] compared to the analysis from [54]. This allows to perform the localization argument efficiently. Up to a change of index of the nodes, $Z^{*}$ is a block matrix with $K$ diagonal blocks of $1$ ’s. It therefore admits $K$ singular vectors $U_{\bullet k}:=I(i\in{\cal C}_{k})/\sqrt{|{\cal C}_{k}|}$ with multiplicity $l_{k}$ associated with the singular value $l_{k}$ for all $k\in[K]$ . We can therefore write

Z^{*}=\sum_{k=1}^{K}l_{k}U_{\bullet k}\otimes U_{\bullet k}=UDU^{\top},

where $U\in{\mathbb{R}}^{n\times K}$ has $K$ column vectors given by $U_{\bullet k},k=1,\ldots,K$ and $D={\rm diag}(l_{1},\ldots,l_{K})$ . We define the following projection operator

{\cal P}:M\in{\mathbb{R}}^{n\times n}\rightarrow UU^{T}M+MUU^{T}-UU^{T}MUU^{T}

and its orthogonal projection ${\cal P}^{\perp}$ by

{\cal P}^{\perp}:M\in{\mathbb{R}}^{n\times n}\rightarrow M-{\cal P}(M)=(\mathrm{I}_{n}-UU^{T})M(\mathrm{I}_{n}-UU^{T})=\sum_{k=K+1}^{n}\bigl{<}M,U_{\bullet k}\otimes U_{\bullet k}\bigr{>}U_{\bullet k}\otimes U_{\bullet k}

where $U_{\bullet k}\in{\mathbb{R}}^{n},k=K+1,\ldots,n$ are such that $(U_{\bullet k}:k=1,\ldots,n)$ is an orthonormal basis of ${\mathbb{R}}^{n}$ .

We use the same decomposition as in [41]: for all $Z\in{\cal C}$ ,

\displaystyle\bigl{<}W,Z-Z^{*}\bigr{>}=\bigl{<}W,{\cal P}(Z-Z^{*})+{\cal P}^{\perp}(Z-Z^{*})\bigr{>}=\underset{S_{1}(Z)}{\underbrace{\bigl{<}{\cal P}(Z-Z^{*}),W\bigr{>}}}+\underset{S_{2}(Z)}{\underbrace{\bigl{<}{\cal P}^{\perp}(Z-Z^{*}),W\bigr{>}}}.

The next step is to control with large probability the two terms $S_{1}(Z)$ and $S_{2}(Z)$ uniformly for all $Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})$ . To that end we use the two following propositions where we recall that $\rho=\delta\max(1-\delta(2p-1)^{2},1-\delta(2q-1)^{2})$ and $\nu=\max(2p-1,1-2q)$ . The proof of Proposition 2 and 3 can be found in Section 8, it is based on [41].

Proposition 2.

There are absolute positive constants $c_{0},c_{1},c_{2}$ and $c_{3}$ such that the following holds. If $\lceil c_{1}rK/n\rceil\geq 2eKn\exp(-(9/32)n\rho/K)$ then we have

{\mathbb{P}}\left[\sup_{Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})}S_{1}(Z)\leq c_{2}r\sqrt{\frac{K\rho}{n}\log\left(\frac{2eKn}{\lceil\frac{c_{1}rK}{n}\rceil}\right)}\right]\geq 1-3\left(\frac{\lceil\frac{c_{1}rK}{n}\rceil}{2eKn}\right)^{\lceil\frac{c_{1}rK}{n}\rceil}.

Proposition 3.

There exists an absolute constant $c_{0}>0$ such that the following holds. When $n\nu\delta\geq\log n$ , with probability at least $1-\exp(-\delta\nu n)$ ,

\sup_{Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})}S_{2}(Z)\leq c_{0}Kr\sqrt{\frac{\delta\nu}{n}}.

It follows from Proposition 2 and Proposition 3 that when $n\nu\delta\geq\log n$ , for all $r$ such that $\lceil c_{1}rK/n\rceil\geq 2eKn\exp(-(9/32)n\rho/K)$ we have, for $\Delta=\Delta(r):=\exp(-\delta\nu n)-3\left(\lceil\frac{c_{1}rK}{n}\rceil/(2eKn)\right)^{\lceil\frac{c_{1}rK}{n}\rceil}$ , with probability at least $1-\Delta$ ,

\sup_{Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})}\bigl{<}W,Z-Z^{*}\bigr{>}\leq c_{0}Kr\sqrt{\frac{\delta\nu}{n}}+c_{2}r\sqrt{\frac{K\rho}{n}\log\left(\frac{2eKn}{\lceil\frac{c_{1}rK}{n}\rceil}\right)}.

Moreover, we have

c_{0}Kr\sqrt{\frac{\delta\nu}{n}}+c_{2}r\sqrt{\frac{K\rho}{n}\log\left(\frac{2eKn}{\lceil\frac{c_{1}rK}{n}\rceil}\right)}\leq\frac{\theta}{2}r

(5.3)

for $\theta=\delta(p-q)$ when $K\sqrt{\nu}\lesssim\sqrt{n\delta}(p-q)$ and $\lceil c_{1}rK/n\rceil\geq 2eKn\exp(-\theta^{2}n/(K\rho))$ . In particular, when $(p-q)^{2}n\delta\geq K^{2}\nu$ and $1\geq 2eKn\max\big{(}\exp(-\theta^{2}n/(K\rho)),\exp(-(9/32)n\rho/K)\big{)}$ , we conclude that for all $0<r\leq n/(c_{1}K)$ (5.3) is true. Therefore, one can take $r_{G}^{*}(\Delta(0))=0$ meaning that we have exact reconstruction of $Z^{*}$ : if $(p-q)^{2}n\delta\geq K^{2}\nu$ and $n\gtrsim K\max(\rho/\theta^{2}\log(2eK^{2}\rho/\theta^{2}),(1/\rho)\log(2eK^{2}/\rho))$ then with probability at least $1-\exp(-\delta\nu n)-3/(2eKn)$ , $\hat{Z}=Z^{*}$ .

If $(p-q)^{2}n\delta\geq K^{2}\nu$ and $1<2eKn\max\big{(}\exp(-\theta^{2}n/(K\rho)),\exp(-(9/32)n\rho/K)\big{)}$ then we do not have exact reconstruction anymore and we have to take $r$ such that $c_{1}rK/n\geq 1$ . In that case, (5.3) holds when $r=n/(c_{1}K)$ and so $r_{G}^{*}(\Delta(n/(c_{1}K)))\leq n/(c_{1}K)$ . Therefore, it follows from Corollary 1 that

\left\|Z^{*}-\hat{Z}\right\|_{1}\leq\frac{n}{c_{1}\delta(p-q)K}.

6 Proofs of Theorem 4 and Corollary 2 (angular synchronization)

Proof of (4.7): For all $\gamma_{1},\ldots,\gamma_{n}\in[0,2\pi)$ we have $\gamma_{i}-\gamma_{j}=\delta_{ij}$ for all $i\neq j\in[n]$ if and only if $e^{\sigma^{2}/2}\mathbb{E}A_{ij}e^{\iota\gamma_{j}}-e^{\iota\gamma_{i}}=0$ for all $i\neq j\in[n]$ . We therefore have

\operatorname*{argmin}_{x\in{\mathbb{C}}^{n}:|x_{i}|=1}\left\{\sum_{i\neq j}|e^{\sigma^{2}/2}\mathbb{E}A_{ij}x_{j}-x_{i}|^{2}\right\}=\{(e^{\iota(\theta_{i}+\theta_{0})})_{i=1}^{n}:\theta_{0}\in[0,2\pi)\}.

(6.1)

Moreover, for all $x=(x_{i})_{i=1}^{n}\in\mathbb{C}^{n}$ such that $|x_{i}|=1$ for $i=1,\ldots,n$ , we have

	$\displaystyle\sum_{i\neq j}\|e^{\sigma^{2}/2}\mathbb{E}A_{ij}x_{j}-x_{i}\|^{2}=\sum_{i\neq j}\|x^{}_{i}\bar{x}_{j}^{}x_{j}-x_{i}\|^{2}=\sum_{i,j=1}^{n}\|x_{i}^{}\bar{x}_{j}^{}-x_{i}\bar{x}_{j}\|^{2}$
	$\displaystyle=2n^{2}-2\Re\left(\sum_{i,j=1}^{n}x_{i}^{}\bar{x}_{j}^{}x_{j}\bar{x}_{i}\right)=2n^{2}-2\|\bigl{<}x^{*},x\bigr{>}\|^{2}$

where $\Re(z)$ denotes the real part of $z\in{\mathbb{C}}$ . On the other side, we have

\displaystyle\bar{x}^{\top}(e^{\sigma^{2}/2}\mathbb{E}A)x=\sum_{i\neq j}\bar{x}_{i}x_{i}^{*}\bar{x}_{j}^{*}x_{j}+\sum_{i=1}^{n}\bar{x}_{i}e^{\sigma^{2}/2}x_{i}=n(e^{\sigma^{2}/2}-1)+|\bigl{<}x^{*},x\bigr{>}|^{2}.

Hence, minimizing $x\to\sum_{i\neq j}^{n}|e^{\sigma^{2}/2}\mathbb{E}A_{ij}x_{j}-x_{i}|^{2}$ over all $x=(x_{i})_{i}\in{\mathbb{C}}^{n}$ such that $|x_{i}|=1$ is equivalent to maximize $x\to\bar{x}^{\top}\mathbb{E}Ax$ over all $x=(x_{i})_{i}\in{\mathbb{C}}^{n}$ such that $|x_{i}|=1$ . This concludes the proof with (6.1).

Proof of (4.8): Let ${\cal C}^{\prime}=\{Z\in\mathbb{C}^{n\times n}:|Z_{ij}|\leq 1,\forall i,j\in[n]\}$ . We first prove that ${\cal C}\subset{\cal C}^{\prime}$ . Let $Z\in{\cal C}$ . Since $Z\succeq 0$ , there exists $X\in{\mathbb{C}}^{n\times n}$ such that $Z=X\bar{X}^{\top}$ . For all $i\in\{1,\ldots,n\}$ , denote by $X_{i\bullet}$ the $i$ -th row vector of $X$ . We have $\left\|X_{i\bullet}\right\|_{2}^{2}=\bigl{<}X_{i\bullet},X_{i\bullet}\bigr{>}=Z_{ii}=1$ since $\mathrm{diag}(Z)=\mathbf{1}_{n}$ . Moreover, for all $i,j\in[n]$ , we have $|Z_{ij}|=|\bigl{<}X_{i\bullet},X_{j\bullet}\bigr{>}|\leq\left\|X_{i\bullet}\right\|_{2}\left\|X_{j\bullet}\right\|_{2}\leq 1$ . This proves that $Z\in{\cal C}^{\prime}$ and so ${\cal C}\subset{\cal C}^{\prime}$ .

Let $Z^{\prime}\in\operatorname*{argmax}\left(\Re(\bigl{<}\mathbb{E}A,Z\bigr{>}):Z\in{\cal C}^{\prime}\right\}$ . Since ${\cal C}^{\prime}$ is convex and the objective function $Z\rightarrow\Re(\bigl{<}\mathbb{E}A,Z\bigr{>})$ is linear (for real coefficients), $Z^{\prime}$ is one of the extreme points of ${\cal C}^{\prime}$ . Extreme points of ${\cal C}^{\prime}$ are matrices $Z\in{\mathbb{C}}^{n\times n}$ such that $|Z_{ij}|=1$ for all $i,j\in[n]$ . We can then write each entry of $Z^{\prime}$ as $Z^{\prime}_{ij}=e^{\iota\beta_{ij}}$ for some $0\leq\beta_{ij}<2\pi$ and now we obtain

	$\displaystyle\Re(\bigl{<}\mathbb{E}A,Z^{\prime}\bigr{>})=\Re{\left(\sum_{i,j=1}^{n}\mathbb{E}A_{ij}\overline{Z^{\prime}_{ij}}\right)}=\Re{\left(\sum_{i\neq j}^{n}e^{-\sigma^{2}/2}e^{\iota\delta_{ij}}e^{-\iota\beta_{ij}}\right)}+\Re{\left(\sum_{i=1}^{n}e^{\iota\delta_{ii}}e^{-\iota\beta_{ii}}\right)}$
	$\displaystyle=\sum_{i\neq j}e^{\sigma^{2}/2}\cos(\delta_{ij}-\beta_{ij})+\sum_{i}\cos(\delta_{ii}-\beta_{ii})\leq e^{\sigma^{2}/2}(n^{2}-n)+n.$

The maximal value $e^{\sigma^{2}/2}(n^{2}-n)+n$ is attained only for $\beta_{ij}=\delta_{ij}$ for all $i,j\in[n]$ , that is for $Z^{\prime}=(e^{\iota\delta_{ij}})_{i,j=1,\ldots,n}=Z^{*}$ . But we have $Z^{*}\in{\cal C}$ and ${\cal C}\subset{\cal C}^{\prime}$ , so $Z^{*}$ is the only maximizer of $Z\to\Re(\bigl{<}\mathbb{E}A,Z\bigr{>})$ on ${\cal C}$ . But for all $Z\in{\cal C}$ we have $\bigl{<}\mathbb{E}A,Z\bigr{>}=\overline{x^{*}}^{\top}Zx^{*}\in{\mathbb{R}}$ , then $Z^{*}$ is the only maximizer of $Z\to\bigl{<}\mathbb{E}A,Z\bigr{>}$ over ${\cal C}$ .

6.1 Curvature of the objective function

Proposition 4.

For $\theta=e^{-\sigma^{2}/2}/2$ , we have for all $Z\in{\cal C}$ , $\bigl{<}{\mathbb{E}}A,Z^{*}-Z\bigr{>}\geq\theta\left\|Z^{*}-Z\right\|_{2}^{2}$ .

Proof.

Let $Z=(z_{ij}e^{\iota\beta_{ij}})_{i,j=1}^{n})\in{\cal C}$ where $z_{ij}\in{\mathbb{R}}$ and $0\leq\beta_{ij}<2\pi$ for all $i,j\in[n]$ . Since $Z_{ii}^{*}=Z_{ii}=1$ for all $i\in[n]$ , we have, on one side, $\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}=e^{-\sigma^{2}/2}\overline{x^{*}}^{\top}(Z^{*}-Z)x^{*}\in{\mathbb{R}}$ , and so

	$\displaystyle\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}$	$\displaystyle=\Re{\left(\bigl{<}\mathbb{E}A,Z^{}-Z\bigr{>}\right)}=\Re{\left(\sum_{i,j=1}^{n}\mathbb{E}A_{ij}\overline{(Z^{}-Z)_{ij}}\right)}=\Re{\left(\sum_{i,j=1}^{n}e^{-\sigma^{2}/2}e^{\iota\delta_{ij}}(e^{-\iota\delta_{ij}}-z_{ij}e^{-\iota\beta_{ij}})\right)}$
		$\displaystyle=e^{-\sigma^{2}/2}\Re{\left(\sum_{i,j=1}^{n}1-z_{ij}e^{\iota(\delta_{ij}-\beta_{ij})}\right)}=e^{-\sigma^{2}/2}\sum_{i,j=1}^{n}(1-z_{ij}\cos(\delta_{ij}-\beta_{ij})).$		(6.2)

On the other side, we proved in the proof of (4.8) that ${\cal C}\subset\{Z\in\mathbb{C}^{n\times n}:|Z_{ij}|\leq 1,\forall i,j\in[n]\}$ . So we have $|z_{ij}|\leq 1$ for all $i,j\in[n]$ and

		$\displaystyle\left\\|Z^{}-Z\right\\|_{2}^{2}=\sum_{i,j=1}^{n}\|(Z^{}-Z)_{ij}\|^{2}=\sum_{i,j=1}^{n}\|e^{\iota\delta_{ij}}-z_{ij}e^{\iota\beta_{ij}}\|^{2}=\sum_{i,j=1}^{n}\|1-z_{ij}e^{\iota(-\delta_{ij}+\beta_{ij})}\|^{2}$
		$\displaystyle=\sum_{i,j=1}^{n}(1-z_{ij}\cos(\beta_{ij}-\delta_{ij}))^{2}+z_{ij}^{2}\sin^{2}(\beta_{ij}-\delta_{ij})=\sum_{i,j=1}^{n}1-2z_{ij}\cos(\beta_{ij}-\delta_{ij})+z_{ij}^{2}$
		$\displaystyle\leq 2\sum_{i,j=1}^{n}(1-z_{ij}\cos(\beta_{ij}-\delta_{ij})).$		(6.3)

We conclude with (6.1) and (6.1).

In fact, it follows from the proof of Proposition 4 that we have the following equality: for all $Z\in{\cal C}$ ,

\bigl{<}{\mathbb{E}}A,Z^{*}-Z\bigr{>}=\theta\left(\left\|Z^{*}-Z\right\|_{2}^{2}+\left\||Z^{*}|^{2}-|Z|^{2}\right\|_{1}\right)

where $|Z|^{2}=(|Z_{ij}|^{2})_{1\leq i,j\leq n}$ (in particular, $|Z^{*}|^{2}=(1)_{n\times n}$ ). We therefore know exactly how to characterize the curvature of the excess risk for the angular synchronization problem in terms of the $\ell_{2}$ (to the square) and the $\ell_{1}$ norms. Nevertheless, we will not use the extra term $\left\||Z^{*}|^{2}-|Z|^{2}\right\|_{1}$ in the following.

6.2 Computation of the complexity fixed point $r^{*}_{G}(\Delta)$

It follows from the (global) curvature property of the excess risk for the angular synchronization problem obtained in Proposition 4 that for the curvature $G$ function defined by $G(Z^{*}-Z)=\theta\left\|Z^{*}-Z\right\|_{2}^{2},\forall Z\in{\cal C}$ , we just have to compute the $r^{*}_{G}(\Delta)$ fixed point and then apply Corollary 1 in order to obtain statistical properties of $\hat{Z}$ (w.r.t. to both the excess risk and the $G$ function). In this section, we compute the complexity fixed point $r_{G}^{*}(\Delta)$ for $0<\Delta<1$ .

Following Proposition 4, the natural “local” subsets of ${\cal C}$ around $Z^{*}$ which drive the statistical complexity of the synchronization problem are defined for all $r>0$ by ${\cal C}_{r}=\{Z\in{\cal C}:\left\|Z-Z^{*}\right\|_{2}\leq r\}={\cal C}\cap(Z^{*}+rB_{2}^{n\times n})$ .

Let $Z\in{\cal C}_{r}$ . Denote by $b_{ij}^{R}$ (resp. $b_{ij}^{I}$ ) the real (resp. imaginary) part of $b_{ij}=Z_{ij}^{*}\overline{Z_{ij}-Z_{ij}^{*}}$ for all $i,j\in[n]$ . Since $\left\|Z-Z^{*}\right\|_{2}\leq r$ we also have $\sum_{i,j}(b_{ij}^{R})^{2}+(b_{ij}^{I})^{2}\leq r^{2}$ and so

	$\displaystyle\bigl{<}A-{\mathbb{E}}A,Z-Z^{}\bigr{>}=\bigl{<}(S-{\mathbb{E}}S)\circ Z^{},Z-Z^{*}\bigr{>}=2\Re\left(\sum_{i<j}(S_{i}j-{\mathbb{E}}S_{ij})b_{ij}\right)$
	$\displaystyle=2\sum_{i<j}(\cos(\sigma g_{ij})-\mathbb{E}\cos(\sigma g_{ij}))b_{ij}^{R}-\sin(\sigma g_{ij})b_{ij}^{I}\leq 2r\sqrt{\sum_{i<j}(\cos(\sigma g_{ij})-\mathbb{E}\cos(\sigma g_{ij}))^{2}+(\sin(\sigma g_{ij}))^{2}}$
	$\displaystyle\leq 2r\sqrt{1-e^{-\sigma^{2}}+2e^{-\sigma^{2}/2}\left(\sum_{i<j}{\mathbb{E}}\cos(\sigma g_{ij})-\cos(\sigma g_{ij})\right)}$

where we used that $\mathbb{E}\cos(\sigma g)=\Re({\mathbb{E}}e^{\iota g})=e^{-\sigma^{2}/2}$ for $g\sim{\cal N}(0,1)$ . Now its remains to get a high probability upper bound on the sum of the centered cosinus of $\sigma g_{ij}$ . We use Bernstein’s inequality (see Equation 7.3 below) to get such a bound. For all $t>0$ , with probability at least $1-\exp(-t)$ ,

\frac{1}{\sqrt{N}}\sum_{i<j}{\mathbb{E}}\cos(\sigma g_{ij})-\cos(\sigma g_{ij})\leq\sqrt{2Vt}+\frac{2t}{3\sqrt{N}}\leq(1-e^{-\sigma^{2}})\sqrt{t}+\frac{2t}{3\sqrt{N}}

for $N=n(n-1)/2$ and $V=\mathbb{E}\cos^{2}(\sigma g)-(\mathbb{E}\cos(\sigma g))^{2}=(1/2)(1-e^{-\sigma^{2}})^{2}$ (because $\mathbb{E}\cos^{2}(\sigma g)=(1/2){\mathbb{E}}(1+\cos(2\sigma g))=(1/2)(1+e^{-2\sigma^{2}})$ when $g\sim{\cal N}(0,1)$ ).

We now have all the ingredients to compute the fixed point $r_{G}^{*}(\Delta)$ for $0<\Delta<1$ : for $\theta=e^{-\sigma^{2}/2}/2$ and $t=\log(1/\Delta)$ ,

r^{*}_{G}(\Delta)\leq\frac{4}{\theta}\left(1-e^{-\sigma^{2}}+2e^{-\sigma^{2}/2}\left((1-e^{-\sigma^{2}})\sqrt{tN}+\frac{2t}{3}\right)\right)=\frac{32t}{3}+8(1-e^{-\sigma^{2}})(e^{\sigma^{2}/2}+2\sqrt{tN}).

In particular, using $1-e^{-\sigma^{2}}\leq\sigma^{2}$ and for $t=\epsilon\sigma^{4}N$ (where $N=n(n-1)/2$ ) for some $0<\epsilon<1$ , if $e^{\sigma^{2}/2}\leq 2\sigma^{2}\sqrt{\epsilon}N$ then $r^{*}_{G}(\Delta)\leq(128/3)\sigma^{4}N\sqrt{\epsilon}$ .

6.3 End of the proof of Theorem 4 and Corollary 2: application of Corollary 1

Take $\Delta=\exp(-\epsilon\sigma^{4}N)$ (for $N=n(n-1)/2$ ), we have $r^{*}_{G}(\Delta)\leq(128/3)\sqrt{\epsilon}\sigma^{4}N$ when $e^{\sigma^{2}/2}\leq 2\sqrt{\epsilon}\sigma^{2}N$ (which holds for instance when $\sigma\leq\sqrt{\log(\epsilon N^{2})}$ ) and so it follows from Corollary 1 (together with the curvature property in Section 6.1 and the computation of the fixed point $r^{*}_{G}(\Delta)$ from Section 6.2), that with probability at least $1-\exp(-\epsilon\sigma^{4}n(n-1)/2)$ , $\theta\bigl{<}Z^{*}-Z\bigr{>}_{2}^{2}\leq\bigl{<}\mathbb{E}A,Z^{*}-Z\bigr{>}\leq(128/3)\sqrt{\epsilon}\sigma^{4}N$ , which is the statement of Theorem 4.

Proof of Corollary 2: The oracle $Z^{*}$ is the rank one matrix $x^{*}\overline{x^{*}}^{\top}$ which has $n$ for largest eigenvalue and associated eigenspace $\{\lambda x^{*}:\lambda\in{\mathbb{C}}\}$ . In particular, $Z^{*}$ has a spectral gap $g=n$ . Let $\hat{x}\in{\mathbb{C}}^{n}$ be a top eigenvector of $\hat{Z}$ with norm $\left\|\hat{x}\right\|_{2}=\sqrt{n}$ . It follows from Davis-Kahan Theorem (see, for example, Theorem 4.5.5 in [100] or Theorem 4 in [101]) that there exists an universal constant $c_{0}>0$ such that

\min_{z\in{\mathbb{C}}:|z|=1}\left\|\frac{\hat{x}}{\sqrt{n}}-z\frac{x^{*}}{\sqrt{n}}\right\|_{2}\leq\frac{c_{0}}{g}\left\|\hat{Z}-Z^{*}\right\|_{2}

where $g=n$ is the spectral gap of $Z^{*}$ . We conclude the proof of Corollary 2 using the upper bound on $\left\|\hat{Z}-Z^{*}\right\|_{2}$ from Theorem 4.

7 Proofs of Theorem 5 and 6 (MAX-CUT)

In this section, we prove the two main results from Section 4.3 using our general methodology for Theorem 6 and the technique from [49] for Theorem 5.

7.1 Proof of Theorem 5

The proof of Theorem 5 follows the one from [49] up to a minor modification due to the fact that we use the SDP estimator $\hat{Z}$ instead of the oracle $Z^{*}$ . It is based on two tools. The first one is Grothendieck’s identity: let $g\sim{\cal N}(0,\mathrm{I}_{n})$ and $u,v\in{\cal S}_{2}^{n-1}$ , we have:

{\mathbb{E}}[\mathrm{sign}(\bigl{<}g,u\bigr{>})\mathrm{sign}(\bigl{<}g,u\bigr{>})]=\frac{2}{\pi}\mathrm{arcsin}(\bigl{<}u,v\bigr{>})

(7.1)

and the identity: for all $t\in[-1,1]$

1-\frac{2}{\pi}\mathrm{arcsin}(t)=\frac{2}{\pi}\mathrm{arccos}(t)\geq 0.878(1-t).

(7.2)

We now have enough tools to prove Theorem 5. The right-hand side inequality is trivial since $\mathrm{MAXCUT}(G)\leq\mathrm{SDP}(G)$ . For the left-hand side, we denote by $\hat{X}_{1},\ldots,\hat{X}_{n}$ (resp. $X_{1}^{*},\ldots,X_{n}^{*}$ ) the $n$ columns vectors in ${\cal S}_{2}^{n-1}$ of $\hat{Z}$ (resp. $Z^{*}$ ). We also consider the event $\Omega^{*}$ onto which

\bigl{<}\mathbb{E}B,Z^{*}-\hat{Z}\bigr{>}\leq r^{*}(\Delta)

which hold with probability at least $1-\Delta$ according to Theorem 1. On the event $\Omega^{*}$ , we have

	$\displaystyle{\mathbb{E}}\left[\mathrm{cut}(G,\hat{x})\|\hat{Z}\right]={\mathbb{E}}\left[\frac{1}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(1-\hat{x}_{i}\hat{x}_{j})\right]=\frac{1}{4}\sum_{i,j=1}^{n}A^{0}_{ij}\left(1-{\mathbb{E}}[\mathrm{sign}(\bigl{<}\hat{X}_{i},g\bigr{>})\mathrm{sign}(\bigl{<}\hat{X}_{j},g\bigr{>})]\right)$
	$\displaystyle\overset{(i)}{=}\frac{1}{4}\sum_{i,j=1}^{n}A^{0}_{ij}\left(1-\frac{2}{\pi}\arcsin(\bigl{<}\hat{X}_{i},\hat{X}_{j}\bigr{>})\right)=\frac{1}{2\pi}\sum_{i,j=1}^{n}A^{0}_{ij}\arccos(\bigl{<}\hat{X}_{i},\hat{X}_{j}\bigr{>})$
	$\displaystyle\overset{(ii)}{\geq}\frac{0.878}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(1-\bigl{<}\hat{X}_{i},\hat{X}_{j}\bigr{>})=\frac{0.878}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(1-\bigl{<}X_{i}^{},X_{j}^{}\bigr{>})+\frac{0.878}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(\bigl{<}X_{i}^{},X_{j}^{}\bigr{>}-\bigl{<}\hat{X}_{i},\hat{X}_{j}\bigr{>})$
	$\displaystyle=\frac{0.878}{4}\bigl{<}A^{0},J-Z^{}\bigr{>}+\frac{0.878}{4}\bigl{<}A^{0},Z^{}-\hat{Z}\bigr{>}=0.878\mathrm{SDP}(G)-\frac{0.878}{4}\bigl{<}\mathbb{E}B,Z^{*}-\hat{Z}\bigr{>}$
	$\displaystyle\geq 0.878\ \mathrm{SDP}(G)-\frac{0.878}{4}r^{*}(\Delta)$

where we used (7.1) in (i) and (7.2) in (ii).

7.2 Proof of Theorem 6

For the MAX-CUT problem, we do not use any localization argument; we therefore use the (likely sub-optimal) global approach. The methodology is very close to the one used in [54] for the community detection problem. In particular, we use both Bernstein and Grothendieck inequalities to compute high probability upper bound for $r^{*}(\Delta)$ . We recall theses two tools now. First Bernstein’s inequality: if $Y_{1},\ldots,Y_{N}$ are $N$ independent centered random variables such that $|Y_{i}|\leq M$ a.s. for all $i=1,\ldots,N$ then for all $t>0$ , with probability at least $1-\exp(-t)$ ,

\frac{1}{\sqrt{N}}\sum_{i=1}^{N}Y_{i}\leq\sigma\sqrt{2t}+\frac{2Mt}{3\sqrt{N}}

(7.3)

where $\sigma^{2}=(1/N)\sum_{i=1}^{N}{\rm var}(Y_{i})$ . The second tool is Grothendieck inequality [52] (see also [85] or Theorem 3.4 in [54]): if $C\in{\mathbb{R}}^{n\times n}$ then

\sup_{Z\in{\cal C}}\bigl{<}C,Z\bigr{>}\leq K_{G}\left\|C\right\|_{cut}=K_{G}\max_{s,t\in\{-1,1\}^{n}}\sum_{i,j=1}^{n}C_{ij}s_{i}t_{j}

(7.4)

where ${\cal C}=\{Z\succeq 0:Z_{ii}=1,i=1,\ldots,n\}$ and $K_{G}$ is an absolute constant, called the Grothendieck constant.

In order to apply Theorem 1, we just have to compute the fixed point $r^{*}(\Delta)$ . As announced, we use the global approach and Grothendieck inequality (7.4) to get

\sup_{Z\in{\cal C}:\bigl{<}\mathbb{E}B,Z^{*}-Z\bigr{>}\leq r}\bigl{<}B-\mathbb{E}B,Z-Z^{*}\bigr{>}\leq\sup_{Z\in{\cal C}}\bigl{<}B-\mathbb{E}B,Z-Z^{*}\bigr{>}\leq 2K_{G}\left\|B-\mathbb{E}B\right\|_{cut}

(7.5)

because $Z^{*}\in{\cal C}$ . It follows from Bernstein’s inequality (7.3) and a union bound that for all $t>0$ , with probability at least $1-4^{n}\exp(-t)$ ,

\left\|B-\mathbb{E}B\right\|_{cut}=\sup_{s,t\in\{\pm 1\}^{n}}\sum_{1\leq i<j\leq n}(B_{ij}-\mathbb{E}B_{ij})(s_{i}t_{j}+s_{j}t_{i})\leq 2\sqrt{\frac{(1-p)n(n-1)t}{p}}+\frac{4t}{3}.

Therefore, for $t=2n\log 4$ , with probability at least $1-4^{-n}$ ,

r^{*}(\Delta)\leq\left\|B-\mathbb{E}B\right\|_{cut}\leq 2n\sqrt{\frac{(2\log 4)(1-p)(n-1)}{p}}+\frac{8n\log 4}{3}

for $\Delta=4^{-n}$ . Then the result follows from Theorem 1.

8 Annex 1: Signed clustering

8.1 Proof of Equation (4.1)

We recall that the cluster matrix $\bar{Z}\in\{0,1\}^{n\times n}$ is defined by $Z_{ij}=1$ if $i\sim j$ and $Z_{ij}=0$ when $i\not\sim j$ and $\alpha=\delta(p+q-1)$ . For all matrix $Z\in[0,1]^{n\times n}$ , we have

	$\displaystyle\bigl{<}Z,{\mathbb{E}}A-\alpha J\bigr{>}=\sum_{i,j=1}^{n}Z_{ij}({\mathbb{E}}A_{ij}-\alpha)=\sum_{(i,j)\in{\cal C}^{+}}Z_{ij}({\mathbb{E}}A_{ij}-\alpha)+\sum_{(i,j)\in{\cal C}^{-}}Z_{ij}({\mathbb{E}}A_{ij}-\alpha)$
	$\displaystyle=[\delta(2p-1)-\alpha]\sum_{(i,j)\in{\cal C}^{+}:i\neq j}Y_{ij}+[\delta(2q-1)-\alpha]\sum_{(i,j)\in{\cal C}^{-}}Z_{ij}+(1-\alpha)\sum_{i=1}^{n}Z_{ii}$
	$\displaystyle=\delta(p-q)\left[\sum_{(i,j)\in{\cal C}^{+}}Y_{ij}-\sum_{(i,j)\in{\cal C}^{-}}Y_{ij}\right]+(1-\alpha)\sum_{i=1}^{n}Z_{ii}.$

The latter quantity is maximal for $Z\in[0,1]^{n\times n}$ such that $Z_{ij}=1$ for $(i,j)\in{\cal C}^{+}$ and $Z_{ij}=0$ for $(i,j)\in{\cal C}^{-}$ , that is when $Z=\bar{Z}$ . As a consequence $\{\bar{Z}\}=\operatorname*{argmax}_{Z\in[0,1]^{n\times n}}\bigl{<}Z,{\mathbb{E}}A-\alpha J\bigr{>}$ . Moreover, $\bar{Z}\in{\cal C}\subset[0,1]^{n\times n}$ so we also have that $\bar{Z}$ is the only solution to the problem $\max_{Z\in{\cal C}}\bigl{<}Z,{\mathbb{E}}A-\alpha J\bigr{>}$ and so $\bar{Z}=Z^{*}$ .

8.2 Proof of Proposition 2: control of $S_{1}(Z)$ adapted from [41]

The noise matrix $W$ is symmetric and has been decomposed as $W=\Psi+\Psi^{\top}$ where $\Psi$ has been defined in (5.1). For all $Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})$ , we have

		$\displaystyle S_{1}(Z)=\bigl{<}{\cal P}(Z-Z^{}),W\bigr{>}=\bigl{<}{\cal P}(W),Z-Z^{}\bigr{>}$
		$\displaystyle=\bigl{<}UU^{\top}W,Z-Z^{}\bigr{>}+\bigl{<}WUU^{\top},Z-Z^{}\bigr{>}-\bigl{<}UU^{\top}WUU^{\top},Z-Z^{*}\bigr{>}$
		$\displaystyle=2\bigl{<}UU^{\top}W,Z-Z^{}\bigr{>}-\bigl{<}UU^{\top}WUU^{\top},Z-Z^{}\bigr{>}$
		$\displaystyle=2\bigl{<}UU^{\top}\Psi,Z-Z^{}\bigr{>}+2\bigl{<}UU^{\top}\Psi^{\top},Z-Z^{}\bigr{>}-\bigl{<}UU^{\top}(\Psi+\Psi^{\top})UU^{\top},Z-Z^{*}\bigr{>}$
		$\displaystyle=2\bigl{<}UU^{\top}\Psi,Z-Z^{}\bigr{>}+2\bigl{<}UU^{\top}\Psi^{\top},Z-Z^{}\bigr{>}-2\bigl{<}UU^{\top}\Psi UU^{\top},Z-Z^{*}\bigr{>}$
		$\displaystyle=2\bigl{<}UU^{\top}\Psi,Z-Z^{}\bigr{>}+2\bigl{<}UU^{\top}\Psi^{\top},Z-Z^{}\bigr{>}-2\bigl{<}UU^{\top}\Psi,(Z-Z^{*})UU^{\top}\bigr{>}.$		(8.1)

An upper bound on $S_{1}(Z)$ follows from an upper bound on the three terms in the right side of (8.2). Let us show how to bound the first term. Similar arguments can be used to control the other two terms.

Let $V:=UU^{T}\Psi$ . Let us find a high probability upper bound on the term $\bigl{<}UU^{\top}\Psi,Z-Z^{*}\bigr{>}=\bigl{<}V,Z-Z^{*}\bigr{>}$ uniformly over $Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})$ . For all $k\in[K]$ , $i\in{\cal C}_{k}$ and $j\in[n]$ , we have

V_{ij}=\sum_{t=1}^{n}(UU^{T})_{it}\Psi_{tj}=\sum_{t\in{\cal C}_{k}}\frac{1}{l_{k}}\Psi_{tj}=\frac{1}{l_{k}}\sum_{t\in{\cal C}_{k}}\Psi_{tj}=\frac{1}{l_{k}}\sum_{t\in{\cal C}_{k}}\Psi_{tj}.

Therefore, given $j\in[n]$ the $V_{ij}$ ’s are all equal for $i\in{\cal C}_{k}$ . We can therefore fix some arbitrary index $i_{k}\in{\cal C}_{k}$ and have $V_{ij}=V_{i_{k}j}$ for all $i\in{\cal C}_{k}$ . Moreover, $(V_{i_{k}j}:k\in[K],j\in[n])$ is a family of independent random variables. We now have

\displaystyle\bigl{<}V,Z-Z^{*}\bigr{>}=\sum_{k\in[K]}\sum_{i\in{\cal C}_{k}}\sum_{j\in[n]}V_{ij}(Z-Z^{*})_{ij}=\sum_{k\in[K]}\sum_{j\in[n]}l_{k}V_{i_{k}j}\sum_{i\in{\cal C}_{k}}\frac{(Z-Z^{*})_{ij}}{l_{k}}=\sum_{k\in[K]}\sum_{j\in[n]}l_{k}V_{i_{k}j}w_{kj}

which is a weighted sum of $nK$ independent centered random variables $X_{k,j}:=l_{k}V_{i_{k}j}$ with weights $w_{k,j}=(1/l_{k})\sum_{i\in{\cal C}_{k}}(Z-Z^{*})_{ij}$ for $k\in[K],j\in[n]$ . We now idenfity some properties on the weights $w_{kj}$ .

The weights are such that

\displaystyle\sum_{k\in[K]}\sum_{j\in[n]}|w_{k,j}|\leq\frac{c_{1}K}{n}\sum_{k\in[K]}\sum_{j\in[n]}\sum_{i\in{\cal C}_{k}}|(Z-Z^{*})_{ij}|=\left\|Z-Z^{*}\right\|_{1}\frac{c_{1}K}{n}\leq\frac{c_{1}rK}{n}

which is equivalent to say that the weights vector $w=(w_{kj}:k\in[K],j\in[n])$ is in the $\ell_{1}^{Kn}$ -ball $(c_{1}rK/n)B_{1}^{Kn}$ . It is also in the unit $\ell_{\infty}^{Kn}$ -ball since for all $k\in[K]$ and $j\in[n]$ ,

|w_{k,j}|\leq\sum_{i\in{\cal C}_{k}}\frac{|(Z-Z^{*})_{ij}|}{l_{k}}\leq\left\|Z-Z^{*}\right\|_{\infty}\leq 1.

We therefore have $w\in B_{\infty}^{Kn}\cap(c_{1}rK/n)B_{1}^{Kn}$ and so

\sup_{Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})}\bigl{<}V,Z-Z^{*}\bigr{>}\leq\sup_{w\in B_{\infty}^{Kn}\cap(c_{1}rK/n)B_{1}^{Kn}}\sum_{k\in[K],j\in[n]}X_{k,j}w_{k,j}.

(8.2)

It remains to find a high probability upper bound on the latter term. We can use the following lemma to that end.

Lemma 1.

Let $X_{k,j}=\sum_{t\in{\cal C}_{k}}\Psi_{tj}$ for $(k,j)\in[K]\times[n]$ . For all $0\leq R\leq Kn$ , if $\lceil R\rceil\geq 2eKn\exp(-(9/32)n\rho/K)$ then with probability at least $1-\left(\lceil R\rceil/(2eKn)\right)^{\lceil R\rceil}$ ,

\sup_{w\in B_{\infty}^{Kn}\cap RB_{1}^{Kn}}\sum_{(k,j)\in[K]\times[n]}X_{k,j}w_{k,j}\leq 4\sqrt{8c_{0}}R\sqrt{\frac{n\rho}{K}\log\left(\frac{2eKn}{\lceil R\rceil}\right)}.

Proof of Lemma 1. Let $N=Kn$ and assume that $1\leq R\leq N$ . We denote by $X_{1}^{*}\geq X_{2}^{*}\geq\cdots,\geq X_{N}^{*}$ (resp. $w_{1}^{*}\geq\cdots\geq w_{N}^{*}$ ) the non-decreasing rearrangement of $|X_{k,j}|$ (resp. $|w_{k,j}|$ ) for $(k,j)\in[K]\times[n]$ . We have

	$\displaystyle\sup_{w\in B_{\infty}^{N}\cap RB_{1}^{N}}\sum_{(k,j)\in[K]\times[n]}X_{k,j}w_{k,j}\leq\sup_{w\in B_{\infty}^{N}\cap RB_{1}^{N}}\sum_{i=1}^{N}X_{i}^{}w_{i}^{}\leq\sup_{w\in B_{\infty}^{N}}\sum_{i=1}^{\lceil R\rceil}X_{i}^{}w_{i}^{}+\sup_{w\in RB_{1}^{N}}\sum_{i=\lceil R\rceil+1}^{N}X_{i}^{}w_{i}^{}$
	$\displaystyle\leq\sum_{i=1}^{\lceil R\rceil}X_{i}^{}+RX_{\lceil R\rceil+1}^{}\leq 2\sum_{i=1}^{\lceil R\rceil}X_{i}^{*}.$

Moreover, for all $\tau>0$ , using a union bound, we have

	$\displaystyle{\mathbb{P}}\left(\sum_{i=1}^{\lceil R\rceil}X_{i}^{*}>\tau\right)={\mathbb{P}}\left(\exists I\subset[K]\times[n]:\|I\|=\lceil R\rceil\mbox{ and }\sum_{(k,j)\in I}\|X_{k,j}\|>\tau\right)$
	$\displaystyle={\mathbb{P}}\left(\max_{I\subset[K]\times[n]:\|I\|=\lceil R\rceil}\max_{u_{k,j}=\pm 1,(k,j)\in I}\sum_{(k,j)\in I}u_{k,j}X_{k,j}>\tau\right)$
	$\displaystyle\leq\sum_{I\subset[K]\times[n]:\|I\|=\lceil R\rceil}\sum_{u\in\{\pm 1\}^{\lceil R\rceil}}{\mathbb{P}}\left(\sum_{(k,j)\in I}X_{k,j}u_{k,j}>\tau\right)=\sum_{I\subset[K]\times[n]:\|I\|=\lceil R\rceil}\sum_{u\in\{\pm 1\}^{\lceil R\rceil}}{\mathbb{P}}\left(\sum_{(k,j)\in I}\sum_{t\in{\cal C}_{k}}\Psi_{t,j}u_{k,j}>\tau\right).$

Let us now control each term of the latter sum thanks to Bernstein inequality. The random variables $(\Psi_{t,j}:t,j\in[n])$ are independent with variances at most $\rho=\delta\max(1-\delta(2p-1)^{2},1-\delta(2q-1)^{2})$ since $\mathrm{Var}(\Psi_{ij})=0$ when $i>j$ and $\mathrm{Var}(\Psi_{ij})=\mathrm{Var}(A_{ij}-{\mathbb{E}}[A_{ij}])=\mathrm{Var}(A_{ij})\leq\rho$ for $j\geq i$ . Moreover, $|\Psi_{ij}|=0$ when $j<i$ and $|\Psi_{ij}|=|W_{ij}|=|A_{ij}-\mathbb{E}A_{ij}|\leq 2$ for $j\geq i$ because $A_{ij}\in\{-1,0,1\}$ . It follows from Bernstein’s inequality that for all $I\subset[K]\times[n]$ satisfying $|I|=\lceil R\rceil$ and $u\in\{\pm 1\}^{\lceil R\rceil}$ that

\displaystyle{\mathbb{P}}\left(\sum_{(k,j)\in I}\sum_{t\in{\cal C}_{k}}\Psi_{t,j}u_{k,j}>\tau\right)\leq\exp\left(\frac{-\tau^{2}}{2\lceil R\rceil l_{k}\rho+4\tau/3}\right)\leq\exp\left(\frac{-\tau^{2}}{2\lceil R\rceil c_{0}n\rho/K+4\tau/3}\right)\leq\exp\left(\frac{-\tau^{2}}{4\lceil R\rceil c_{0}n\rho/K}\right)

when $\tau\leq(3/2)\lceil R\rceil c_{0}n\rho/K$ . Therefore, $\sup_{w\in B_{\infty}^{N}\cap RB_{1}^{N}}\sum X_{k,j}w_{k,j}\leq 2\tau$ with probability at least

\displaystyle 1-\binom{N}{\lceil R\rceil}2^{\lceil R\rceil}\exp\left(\frac{-\tau^{2}}{4\lceil R\rceil c_{0}n\rho/K}\right)\geq 1-\exp\left(\frac{-\tau^{2}}{8\lceil R\rceil c_{0}n\rho/K}\right)

when

(3/2)\lceil R\rceil c_{0}n\rho/K\geq\tau\geq\sqrt{8c_{0}}\lceil R\rceil\sqrt{\frac{n\rho}{K}\log\left(\frac{2eN}{\lceil R\rceil}\right)}

which is a non vacuous condition since $\lceil R\rceil\geq 2eN\exp(-(9/32)n\rho/K)$ . The result follows, in the case $1\leq R\leq N$ , by taking $\tau=\sqrt{8c_{0}}\lceil R\rceil\sqrt{(n\rho/K)\log\left(2eN/\lceil R\rceil\right)}$ and using that $2R\geq\lceil R\rceil$ when $R\geq 1$ .

For $0\leq R\leq 1$ , we have

\sup_{w\in B_{\infty}^{Kn}\cap RB_{1}^{Kn}}\sum_{(k,j)\in[K]\times[n]}X_{k,j}w_{k,j}=R\max_{(k,j)\in[K]\times[n]}|X_{k,j}|

and using Bernstein inequality as above we get that with probability at least $1-\exp(-K\tau^{2}/(8c_{0}n\rho))$ , $\max_{(k,j)\in[K]\times[n]}|X_{k,j}|\leq\tau$ when $3c_{0}n\rho/(2K)\geq\tau\geq\sqrt{8c_{0}n\rho\log(nK)/K}$ which is a non vacuous condition when $9c_{0}n\rho\geq 4K\log(nK)$ . By taking $\tau=\sqrt{8c_{0}n\rho\log(nK)/K}$ , we obtain, that for all $0\leq R\leq 1$ , if $9c_{0}n\rho\geq 4K\log(nK)$ then with probability at least $1-1/(nK)$ ,

\sup_{w\in B_{\infty}^{Kn}\cap RB_{1}^{Kn}}\sum_{(k,j)\in[K]\times[n]}X_{k,j}w_{k,j}\leq R\sqrt{\frac{8c_{0}n\rho\log(nK)}{K}}.

We apply Lemma 1 for $R=c_{1}rK/n$ to control (8.2):

{\mathbb{P}}\left[\sup_{Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})}\bigl{<}V,Z-Z^{*}\bigr{>}\leq c_{2}r\sqrt{\frac{K\rho}{n}\log\left(\frac{2eKn}{\lceil\frac{c_{1}rK}{n}\rceil}\right)}\right]\geq 1-\left(\frac{\lceil\frac{c_{1}rK}{n}\rceil}{2eKn}\right)^{\lceil\frac{c_{1}rK}{n}\rceil}

when $\lceil c_{1}rK/n\rceil\geq 2eKn\exp(-(9/32)n\rho/K)$ .

Using the same methodology, we can prove exactly the same result for $\sup_{Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})}\bigl{<}UU^{\top}\Psi^{\top},Z-Z^{*}\bigr{>}$ . We can also use the same method to upper bound $\sup_{Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})}\bigl{<}UU^{\top}\Psi^{\top},(Z-Z^{*})UU^{\top}\bigr{>}$ , we simply have to check that the weights vector $w^{\prime}=(w^{\prime}_{kj}:k\in[K],j\in[n])$ where $w^{\prime}_{kj}=(1/l_{k})\sum_{i\in{\cal C}_{k}}[(Z-Z^{*})UU^{\top}]_{ij}$ is also in $B_{\infty}^{Kn}\cap(c_{1}rK/n)B_{1}^{Kn}$ for any $Z\in{\cal C}$ such that $\left\|Z-Z^{*}\right\|_{1}\leq r$ . This is indeed the case, since we have for all $i\in[n]$ , $k^{\prime}\in[K]$ and $j\in{\cal C}_{k^{\prime}}$ , $[(Z-Z^{*})UU^{\top}]_{ij}=\sum_{p=1}^{n}(Z-Z^{*})_{ip}(UU^{\top})_{pj}=\sum_{p\in{\cal C}_{k^{\prime}}}(Z-Z^{*})_{ip}/l_{k^{\prime}}$ which is therefore constant for all elements in $j\in{\cal C}_{k^{\prime}}$ . Therefore, we have

	$\displaystyle\sum_{k\in[K]}\sum_{j\in[n]}\|w_{kj}^{\prime}\|=\sum_{k\in[K]}\sum_{k^{\prime}\in[K]}\sum_{j\in{\cal C}_{k^{\prime}}}\left\|\frac{1}{l_{k}l_{k^{\prime}}}\sum_{i\in{\cal C}_{k}}\sum_{p\in{\cal C}_{k^{\prime}}}(Z-Z^{*})_{ip}\right\|$
	$\displaystyle\leq\sum_{k\in[K]}\sum_{k^{\prime}\in[K]}\sum_{j\in{\cal C}_{k^{\prime}}}\frac{1}{l_{k}l_{k^{\prime}}}\sum_{i\in{\cal C}_{k}}\sum_{p\in{\cal C}_{k^{\prime}}}\left\|(Z-Z^{})_{ip}\right\|\leq\left\\|Z-Z^{}\right\\|_{1}\frac{c_{1}K}{n}\leq\frac{c_{1}rK}{n}$

and for all $k^{\prime}\in[K]$ and $j\in{\cal C}_{k^{\prime}}$ ,

|w_{kj}^{\prime}|=\left|\frac{1}{l_{k}l_{k^{\prime}}}\sum_{i\in{\cal C}_{k}}\sum_{p\in{\cal C}_{k^{\prime}}}(Z-Z^{*})_{ip}\right|\leq\left\|Z-Z^{*}\right\|_{\infty}\leq 1.

Therefore, $w^{\prime}\in B_{\infty}^{Kn}\cap(c_{1}rK/n)B_{1}^{Kn}$ and we obtain exactly the same upper bound for the three terms in (8.2). This concludes the proof of Proposition 2.

8.3 Proof of Proposition 3: control of the $S_{2}(Z)$ term from [41]

In this section, we prove Proposition 3. We follow the proof from [41] but we only consider the “dense case” which is when $n\delta\nu\geq\log n$ – we recall that $\nu=\max(2p-1,1-2q)$ . For a similar uniform control of $S_{2}(Z)$ in the “sparse case ”, when $c_{0}\leq n\delta\nu\leq\log n$ for some absolute constant $c_{0}$ , we refer the reader to [41].

For all $Z\in{\cal C}$ , we have $S_{2}(Z)=\bigl{<}{\cal P}^{\perp}(Z-Z^{*}),W\bigr{>}=\bigl{<}{\cal P}^{\perp}(Z),W\bigr{>}$ because, by construction of the projection operator, ${\cal P}^{\perp}(Z^{*})=0$ . Therefore, $S_{2}(Z)\leq\left\|{\cal P}^{\perp}(Z)\right\|_{*}\left\|W\right\|_{\mbox{op}}$ where $\left\|\cdot\right\|_{*}$ denotes the nuclear norm (i.e. the sum of singular values) and $\left\|\cdot\right\|_{\mbox{op}}$ denotes the operator norm (i.e. maximum of the singular value). In the following Lemma 2, we prove an upper bound for $\left\|{\cal P}^{\perp}(Z)\right\|_{*}$ and then, we will obtain a high probability upper bound onto $\left\|W\right\|_{\mbox{op}}$ .

Lemma 2.

For all $Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})$ , we have

\displaystyle\left\|{\cal P}^{\perp}(Z)\right\|_{*}=\mathrm{Tr}({\cal P}^{\perp}(Z))\leq\frac{c_{1}k}{n}\left\|Z-Z^{*}\right\|_{1}\leq\frac{c_{1}Kr}{n}.

Proof.

Since $Z\succeq 0$ so it is for $({\mathbb{I}}_{n}-UU^{\top})Z({\mathbb{I}}_{n}-UU^{\top})$ and so ${\cal P}^{\perp}(Z)=({\mathbb{I}}_{n}-UU^{\top})Z({\mathbb{I}}_{n}-UU^{\top})\succeq 0$ therefore $\left\|{\cal P}^{\perp}(Z)\right\|_{*}=\mathrm{Tr}({\cal P}^{\perp}(Z))$ . Next, we bound the trace of ${\cal P}^{\perp}(Z)$ .

Since ${\mathbb{I}}_{n}-UU^{\top}$ is a projection operator, it is symmetric and $({\mathbb{I}}_{n}-UU^{\top})^{2}={\mathbb{I}}_{n}-UU^{\top}$ , moreover, $\mathrm{Tr}(Z)=n=\mathrm{Tr}(Z^{*})$ when $Z\in{\cal C}$ so

	$\displaystyle\mathrm{Tr}({\cal P}^{\perp}(Z))=\mathrm{Tr}({\cal P}^{\perp}(Z-Z^{}))=\mathrm{Tr}(({\mathbb{I}}_{n}-UU^{\top})(Z-Z^{})({\mathbb{I}}_{n}-UU^{\top}))$
	$\displaystyle=\mathrm{Tr}(({\mathbb{I}}_{n}-UU^{T})^{2}(Z-Z^{}))=\mathrm{Tr}(({\mathbb{I}}_{n}-UU^{T})(Z-Z^{}))=\mathrm{Tr}(Z)-\mathrm{Tr}(Z^{})+\mathrm{Tr}(UU^{T}(Z^{}-Z))$
	$\displaystyle=\sum_{i,j}(UU^{T})_{ij}(Z^{}-Z)_{ij}=\sum_{k\in[K]}\sum_{i,j\in{\cal C}_{k}}\frac{1}{l_{k}}(Z^{}-Z)_{ij}\overset{\mbox{(i)}}{=}\sum_{k\in[K]}\frac{1}{l_{k}}\sum_{i,j\in{\cal C}_{k}}\|(Z^{*}-Z)_{ij}\|$
	$\displaystyle\leq\frac{c_{1}K}{n}\sum_{k\in[K]}\sum_{i,j\in{\cal C}_{k}}\|(Z^{}-Z)_{ij}\|\leq\frac{c_{1}K}{n}\left\\|Z-Z^{}\right\\|_{1}$

where we used in (i) that for $i$ and $j$ in a same community, we have $Z^{*}_{ij}=1$ and $Z_{ij}\in[0,1]$ , thus $(Z^{*}-Z)_{ij}=|(Z^{*}-Z)_{ij}|$ . Finally, when $Z$ is in the localized set ${\cal C}\cap(Z^{*}+rB_{1}^{n\times n})$ , we have $\left\|Z-Z^{*}\right\|_{1}\leq r$ which concludes the proof.

Now, we obtain a high probability upper bound on $\left\|W\right\|_{\mbox{op}}$ . In the following, we apply this result in the “dense case” (i.e. $n\delta\nu\geq\log n$ ) to get the uniform bound onto $S_{2}(Z)$ over $Z\in{\cal C}\cap(Z^{*}+rB_{1}^{n\times n})$ .

Lemma 3 (Lemma 4 in [41]).

With probability at least $1-\exp(-\delta\nu n)$ , $\left\|W\right\|_{\mbox{op}}\leq 16\sqrt{\delta\nu n}+168\sqrt{\mathrm{log}(n)}$ .

Proof.

Let $A^{\prime}$ be an independent copy of $A$ and $R\in{\mathbb{R}}^{n\times n}$ be a random symmetric matrix independent from both $A$ and $A^{\prime}$ whose sub-diagonal entries are independent Rademacher random variables. Using a symmetrization argument (see Chapter 2 in [64] or Chapter 2.3 in [98]), we obtain for $W=A-{\mathbb{E}}A$ ,

\displaystyle{\mathbb{E}}\left\|W\right\|_{\mbox{op}}={\mathbb{E}}\left\|A-{\mathbb{E}}A^{\prime}\right\|_{\mbox{op}}\overset{\mbox{(i)}}{\leq}{\mathbb{E}}\left\|A-A^{\prime}\right\|_{\mbox{op}}\overset{\mbox{(ii)}}{=}{\mathbb{E}}\left\|(A-A^{\prime})\circ R\right\|_{\mbox{op}}\overset{\mbox{(iii)}}{\leq}2{\mathbb{E}}\left\|A\circ R\right\|_{\mbox{op}}

where $\circ$ is the entry-wise matrix multiplication operation, (i) comes from Jensen’s inequality, (ii) occurs since $A-A^{\prime}$ and $(A-A^{\prime})\circ R$ are identically distributed and (iii) is the triangle inequality. Next, we obtain an upper bound onto ${\mathbb{E}}\left\|A\circ R\right\|_{\mbox{op}}$ .

We define the family of independent random variables $(\xi_{ij}:1\leq i\leq j\leq n)$ where for all $1\leq i\leq j\leq n$

\xi_{ij}=\left\{\begin{array}[]{l}\frac{1}{\sqrt{|{\mathbb{E}}A_{ij}|}}\mbox{ with probability }\frac{{\mathbb{E}}A_{ij}}{2}\\ -\frac{1}{\sqrt{|{\mathbb{E}}A_{ij}|}}\mbox{ with probability }\frac{{\mathbb{E}}A_{ij}}{2}\\ 0\mbox{ with probability }1-{\mathbb{E}}A_{ij}.\end{array}\right.

(8.3)

We also put $b_{ij}:=\sqrt{|{\mathbb{E}}A_{ij}|}$ for all $1\leq i\leq j\leq n$ . It is straightforward to see that $(\xi_{ij}b_{ij}:1\leq i\leq j\leq n)$ and $(A_{ij}R_{ij}:1\leq i\leq j\leq n)$ have the same distribution. As a consequence, $\left\|A\circ R\right\|_{\mbox{op}}$ and $\left\|X\right\|_{\mbox{op}}$ have the same distribution where $X\in{\mathbb{R}}^{n\times n}$ is a symmetric matrix with $X_{ij}=\xi_{ij}b_{ij}$ for $1\leq i\leq j\leq n$ . An upper bound on ${\mathbb{E}}\left\|X\right\|_{\mbox{op}}$ follow from the next result due to [9].

Theorem 7 (Corollary 3.6 in [9]).

Let $\xi_{ij},1\leq i\leq j\leq n$ be independent symmetric random variables with unit variance and $(b_{ij},1\leq i\leq j\leq n)$ be a family of real numbers. Let $X\in{\mathbb{R}}^{n\times n}$ be the random symmetric matrix defined by $X_{ij}=\xi_{ij}b_{ij}$ for all $1\leq i\leq j\leq n$ . Let $\sigma:=\max_{1\leq i\leq n}\left\{\sqrt{\sum_{j=1}^{n}b_{ij}^{2}}\right\}$ . Then, for any $\alpha\geq 3$ ,

\displaystyle{\mathbb{E}}\left\|X\right\|_{\mbox{op}}\leq e^{\frac{2}{\alpha}}\left[2\sigma+14\alpha\max_{1\leq i\leq j\leq n}\left\{\left\|\xi_{ij}b_{ij}\right\|_{2\lceil\alpha\log(n)\rceil}\right\}\sqrt{\log(n)}\right]

where, for any $q>0$ , $\left\|\cdot\right\|_{q}$ denotes the $L_{q}$ -norm.

Since $(\xi_{ij}:1\leq i\leq j\leq n)$ are independent symmetric such that $\mathrm{Var}(\xi_{ij})={\mathbb{E}}[\xi_{ij}^{2}]=1$ we can apply Lemma 7 to upper bound ${\mathbb{E}}\left\|X\right\|_{\mbox{op}}={\mathbb{E}}\left\|A\circ R\right\|_{\mbox{op}}$ . We have $\left\|\xi_{ij}b_{ij}\right\|_{2\lceil\alpha\log(n)\rceil}\leq 1$ for any $\alpha\geq 3$ and $b_{ij}^{2}=|{\mathbb{E}}A_{ij}|\leq\delta\nu$ . It therefore follows from Lemma 7 for $\alpha=3$ that

{\mathbb{E}}\left\|W\right\|_{\mbox{op}}\leq 2e^{\frac{2}{3}}\left[2\sqrt{n\delta\nu}+42\sqrt{\log(n)}\right]\leq 8\sqrt{n\delta\nu}+168\sqrt{\log(n)}.

(8.4)

The final step to prove Lemma 3 is a concentration argument showing that $\left\|W\right\|_{\mbox{op}}$ is close to its expectation with high probability. To that end we rely on a general result for Lipschitz and separately convex functions from [16]. We first recall that a real-valued function $f$ of $N$ variables is said separately convex when for every $i=1,\ldots,N$ it is a convex function of the $i$ -th variable if the rest of the variables are fixed.

Theorem 8 (Theorem 6.10 in [16]).

Let ${\cal X}$ be a convex compact set in ${\mathbb{R}}$ with diameter $B$ . Let $X_{1},\cdots,X_{N}$ be independent random variables taking values in ${\cal X}$ . Let $f:{\cal X}^{N}\rightarrow{\mathbb{R}}$ be a separately convex and $1$ -Lipschitz function, w.r.t. the $\ell_{2}^{N}$ -norm (i.e. $|f(x)-f(y)|\leq\left\|x-y\right\|_{2}$ for all $x,y\in{\cal X}^{N}$ ). Then $Z=f(X_{1},\ldots,X_{N})$ satisfies, for all $t>0$ , with probability at least $1-\exp(-t^{2}/B^{2})$ , $Z\leq{\mathbb{E}}[Z]+t$ .

We apply Theorem 8 to $Z:=\left\|W\right\|_{\mbox{op}}=f(A_{ij}-\mathbb{E}A_{ij},1\leq i\leq j\leq n)=\frac{1}{\sqrt{2}}\left\|A-{\mathbb{E}}A\right\|_{\mbox{op}}$ where $f$ is a $1$ -Lipschitz w.r.t. $\ell_{2}^{N}$ -norm for $N=n(n-1)/2$ and separately convex function and $(A_{ij}-\mathbb{E}A_{ij},1\leq i\leq j\leq n)$ is a family of $N$ independent random variables. Moreover, for each $i\geq j$ , $(A-{\mathbb{E}}A)_{ij}\in[-1-\delta(2p-1),1-\delta(2q-1)]$ , which is a convex compact set with diameter $B=2(1+\delta(p-q))\leq 4$ . Therefore, it follows from Theorem 8 that for all $t>0$ , with probability at least $1-\exp(-t^{2}/16)$ , $\left\|W\right\|_{\mbox{op}}\leq\mathbb{E}\left\|W\right\|_{\mbox{op}}+\sqrt{2}t$ . In particular, we finish the proof of Lemma 3 for $t=4\sqrt{\delta\nu n}$ and using the bound from (8.4).

It follows from Lemma 3 that when $n\nu\delta\geq\log n$ , $\left\|W\right\|_{\mbox{op}}\leq 184\sqrt{n\delta\nu}$ with probability at least $1-\exp(-\delta\nu n)$ . Using this later result together with Lemma 2 concludes the proof of Proposition 3.

9 Annex 2: Solving SDP’s in practice

The practical implementation of our approach to the problems of synchronization, signed clustering and MAX-CUT resort to solving a convex optimization problem. In the present section, we describe the various algoritms we used for solving these SDP’s.

9.1 Pierra’s method

For SDP’s with constraints on the entries, we propose a simple modification of the method initially proposed by Pierra in [84]. Let $f$ : $\mathbb{R}^{n\times n}\mapsto\mathbb{R}$ be a convex function. Let $\mathcal{C}$ denote a convex set which can be written as the intersection of convex sets $\mathcal{C}=\mathcal{S}_{1}\cap\cdots\cap\mathcal{S}_{J}$ . Let us define $\mathbb{H}=\mathbb{R}^{n\times n}\times\cdots\times\mathbb{R}^{n\times n}$ ( $J$ times) and let $\mathbb{D}$ denote the (diagonal) subspace of $\mathbb{H}$ of vectors of the form $(Z,\ldots,Z)$ . In this new formalism, the problem can now be formulated as a minimisation problem over the intersection of two sets only, i.e.

\min_{{\sf Z}\in\mathbb{H}}\left(\frac{1}{J}\ \sum_{j=1}^{J}\ f({\sf Z}_{j}):{\sf Z}=({\sf Z}_{j})_{j=1}^{J}\in\left(\mathcal{S}_{1}\times\cdots\times\mathcal{S}_{J}\right)\cap\mathbb{D}\right).

Define $F(\sf Z)=\frac{1}{J}\ \sum_{j=1}^{J}\ f({\sf Z}_{j})$ . The algorithm proposed by Pierra in [84] consists of performing the following iterations

{\sf Z}^{p+1}=\operatorname{Prox}_{I_{\mathcal{S}_{1}\times\cdots\times\mathcal{S}_{J}}+\frac{1}{2}\varepsilon F}\left(\sf{B}^{p}\right)\mbox{ and }\sf{B}^{p+1}=\operatorname{Proj}_{\mathbb{D}}({\sf Z}^{p+1}).

9.1.1 Application to community detection

Let us now present the computational details of Pierra’s method for the community detection problem. We will estimate its membership matrix $\bar{Z}$ via the following SDP estimator

\hat{Z}\in\text{argmax}_{Z\in\mathcal{C}}\langle A,Z\rangle,

where $\mathcal{C}=\{Z\in\mathbb{R}^{n\times n},Z\succeq 0,Z\geq 0,{\rm diag}(Z)\preceq I_{n},\sum Z_{ij}\leq\lambda\}$ and $\lambda=\sum_{i,j=1}^{n}\bar{Z}_{ij}=\sum_{k=1}^{K}|\mathcal{C}_{k}|^{2}$ denotes the number of nonzero elements in the membership matrix $\bar{Z}$ . The motivation for this approach stems from the fact that the membership matrix $\bar{Z}$ is actually the oracle, i.e., $Z^{*}=\bar{Z}$ , where $Z^{*}\in\text{argmax}_{Z\in\mathcal{C}}\langle\mathbb{E}[A],Z\rangle$ . The function $f$ to minimize in the Pierra algorithm is defined as $f(Z)=-\langle A,Z\rangle$ .

Let us denote by $\mathbb{S}_{+}$ the set of symmetric positive semi-definite matrices in $\mathbb{R}^{n\times n}$ . The set $\mathcal{C}$ is the intersection of the sets

	$\displaystyle S_{1}$	$\displaystyle=\mathbb{S}_{+}\,S_{2}=\left\{Z\in\mathbb{R}^{n\times n}\mid Z\geq 0\right\}\,S_{3}=\left\{Z\in\mathbb{R}^{n\times n}\mid\text{diag}(Z)\preceq I\right\}$
		$\displaystyle\mbox{ and }S_{4}=\left\{Z\in\mathbb{R}^{n\times n}\mid\sum_{i,j=1}^{n}\ Z_{ij}\leq\lambda\right\}$

We now compute for all $\sf{B}=(\sf{B}_{j})_{j=1}^{4}\in({\mathbb{R}}^{n\times n})^{4}$ and $j=1,\ldots,4$ ( $J=4$ here)

\displaystyle\operatorname{Prox}_{I_{S_{1}\times\cdots\times S_{4}}+\frac{1}{2}\varepsilon F}\left(\sf{B}\right)_{j}

\displaystyle=\operatorname{Prox}_{I_{S_{j}}+\frac{1}{2J}\varepsilon f}\left(\sf{B}_{j}\right)

We have for $J=4$

\operatorname{Prox}_{I_{S_{j}}+\frac{1}{2J}\varepsilon f}\left(\sf{B}_{j}\right)=\text{argmin}_{Z\in S_{j}}\ -\frac{\epsilon}{2J}\ \langle A,Z\rangle+\frac{1}{2}\ \|Z-{\sf B}_{j}\|_{F}^{2}=P_{S_{j}}\left({\sf B}_{j}+\frac{\epsilon}{2J}\ A\right)

On the other hand, the projections operators $P_{S_{j}},j=1,2,3,4$ are given by

	$\displaystyle P_{S_{1}}(\sf Z_{1})$	$\displaystyle=U\max\left\{\Sigma,0\right\}U^{\top}\hskip 8.5359pt\text{where $\sf Z_{1}$ has SVD }\sf Z_{1}=U\Sigma U^{\top}$
	$\displaystyle P_{S_{2}}(\sf Z_{2})$	$\displaystyle=\max\left\{{\sf Z}_{2},0\right\}\,P_{S_{3}}(\sf Z_{3})={\sf Z}_{3}-\text{diag}({\sf Z}_{3})+\min\left\{1,\text{diag}({\sf Z}_{3})\right\}$
	$\displaystyle P_{S_{4}}({\sf Z}_{4})$	$\displaystyle=\frac{\lambda}{\sum_{ij}\ ({\sf Z}_{4})_{ij}}\ \sf Z_{4}.$

To sum up, Pierra’s method can be formulated as follows.

For all iterations $k$ in $\mathbb{N}$ , compute the SVD of ${\sf B}_{1}^{k}+\frac{\epsilon}{2\cdot 4}A=U^{k}\Sigma^{k}(U^{k})^{\top}$ . Then compute for all $j=1,\ldots,4$

	$\displaystyle\sf B^{k+1}_{j}$	$\displaystyle=\frac{1}{4}\Bigg{(}U^{k}\max\left\{\Sigma^{k},0\right\}(U^{k})^{\top}+\max\left\{\sf{\sf B}_{2}^{k}+\frac{\epsilon}{2\cdot 4}A,0\right\}+{\sf B}_{3}^{k}+\frac{\epsilon}{2\cdot 4}A-\text{diag}({\sf B}_{3}^{k}+\frac{\epsilon}{2\cdot 4}A)$
		$\displaystyle\hskip 56.9055pt+\min\left\{1,\text{diag}({\sf B}_{3}^{k}+\frac{\epsilon}{2\cdot 4}A)\right\}+\frac{\lambda}{\sum_{ij}\ ({\sf B}_{4}^{k}+\frac{\epsilon}{2\cdot 4}A)_{ij}}\ {\sf B}_{4}^{k}+\frac{\epsilon}{2\cdot 4}A\Bigg{)}.$

9.1.2 Application to signed clustering

Let us now turn to the signed clustering problem. We will estimate its membership matrix $\bar{Z}$ via the following SDP estimator $\hat{Z}\in\text{argmax}_{Z\in\mathcal{C}}\ \langle A,Z\rangle,$ where $\mathcal{C}=\{Z\in\mathbb{R}^{n\times n}:Z\succeq 0,Z_{ij}\in[0,1],Z_{ii}=1,i=1,\ldots,n\}$ . As in the community detection case the function $f$ to minimize in the Pierra algorithm is defined as $f(Z)=-\langle A,Z\rangle$ .

Let us denote by $\mathbb{S}_{+}$ the set of symmetric positive semi-definite matrices in $\mathbb{R}^{n\times n}$ . The set $\mathcal{C}$ is the intersection of the sets $S_{1}=\mathbb{S}_{+}$ , $S_{2}=\left\{Z\in\mathbb{R}^{n\times n}\mid Z\in[0,1]^{n\times n}\right\}$ and $S_{3}=\left\{Z\in\mathbb{R}^{n\times n}\mid Z_{ii}=1,\ i=1,\ldots,n\right\}.$

As before, for $j=1,\ldots,3$

\displaystyle\operatorname{Prox}_{I_{S_{j}}+\frac{1}{2\cdot 3}\varepsilon f}\left(\sf{B}_{j}\right)

\displaystyle=P_{S_{j}}\left({\sf B}_{j}+\frac{\epsilon}{2\cdot 3}\ A\right)

and the projection operators $P_{S_{j}}$ , $j=1,2$ are given by

P_{S_{1}}(\sf Z_{1})=U\max\left\{\Sigma,0\right\}U^{\top},\,P_{S_{2}}(\sf Z_{2})=\min\left\{\max\left\{{\sf Z}_{2},0\right\},1\right\}\mbox{ and }P_{S_{3}}(\sf Z_{3})={\sf Z}_{3}-\text{diag}({\sf Z}_{3})+I

To sum up, Pierra’s method can be formulated as follows.

At each iteration $k$ , compute the SVD of ${\sf B}_{1}^{k}+\frac{\epsilon}{2\cdot 3}A=U^{k}\Sigma^{k}(U^{k})^{\top}$ . Then compute for all $j=1,\ldots,3$

\displaystyle{\sf B}^{k+1}_{j}

\displaystyle=\frac{1}{3}\Bigg{(}U^{k}\max\left\{\Sigma^{k},0\right\}U^{k^{t}}+\min\left\{\max\left\{{\sf B}_{2}^{k}+\frac{\epsilon}{2\cdot 3}\ A,0\right\},1\right\}+{\sf B}_{3}^{k}+\frac{\epsilon}{2\cdot 3}\ A-\text{diag}\left({\sf B}_{3}^{k}+\frac{\epsilon}{2\cdot 3}\ A\right)+I\Bigg{)}.

9.2 The Burer-Monteiro approach and the Manopt Solver

To solve the MAX-CUT and Angular Synchronization problems we rely on Manopt, a freely available Matlab toolbox for optimization on manifolds [17]. Manopt runs the Riemannian Trust-Region method on corresponding Burer-Monteiro non-convex problem with rank bounded by $p$ as follows. The Burer-Monteiro approach consists of replacing the optimization of a linear function $\bigl{<}A,Z\bigr{>}$ over the convex set $\mathcal{Z}=\{Z\succeq 0:\mathcal{A}(Z)=b\}$ with the optimization of the quadratic function $\bigl{<}AY,Y\bigr{>}$ over the non-convex set $\mathcal{Y}=\{Y\in\mathbb{R}^{n\times p}:\mathcal{A}(YY^{T})=b\}$ .

In the context of the MAX-CUT problem, the Burer-Monteiro approach amounts to the following steps. Denoting by $Z$ the positive semidefinite matrix $Z=zz^{T}$ , note that both the cost function and the constraints lend themselves to be expressed linearly in terms of $Z$ . Dropping the NP-hard rank-1 constraint on $Z$ , we arrive at the well-known convex relaxation of MAX-CUT from [49]

\hat{Z}\in\underset{Z\in\mathcal{C}}{\mathrm{argmin}}\bigl{<}A,Z\bigr{>}

(9.1)

where $\mathcal{C}:=\{Z\in{\mathbb{R}}^{n\times n}:Z\succeq 0,Z_{ii}=1,\forall i=1,\ldots,n\}$ .

If a solution $\hat{Z}$ of this SDP has rank 1, then $\hat{Z}=z^{*}z^{*^{T}}$ for some $z^{*}$ , which then gives the optimal cut. Recall that in the general case of higher rank $\hat{Z}$ , Goemans and Williamson [49] introduced the celebrated rounding scheme that extracts approximately optimal cuts within a ratio of 0.878 from $\hat{Z}$ . The corresponding Burer-Monteiro non-convex problem with rank bounded by $p$ is given by

\hat{Y}\in\underset{X\in\mathcal{B}}{\mathrm{argmin}}\bigl{<}AY,Y\bigr{>}

(9.2)

where $\mathcal{B}:=\{Y\in{\mathbb{R}}^{n\times p}:\text{diag}(YY^{T})=1\}$ . Note that the constraint $\text{diag}(YY^{T})=1$ requires each row of $Y$ to have unit $\ell_{2}^{p}$ norm, rendering $Y$ to be a point on the Cartesian product of $n$ unit spheres ${\cal S}_{2}^{p-1}$ in ${\mathbb{R}}^{p}$ , which is a smooth manifold. Also note that the search space of the SDP is compact, since all $Z$ feasible for the SDP have identical trace equal to $n$ .

If the convex set $\mathcal{Z}$ is compact, and $m$ denotes the number of constraints, it holds true that whenever $p$ satisfies $\frac{p(p+1)}{2}\geq m$ , the two problems share the same global optimum [12, 22]. Building on pioneering work of Burer and Monteiro [22], Boumal et. al [18] showed that if the set $\mathcal{Z}$ is compact and the set $\mathcal{Y}$ is a smooth manifold, then $\frac{p(p+1)}{2}\geq m$ implies that, for almost all cost matrices $A$ , global optimality is achieved by any $Y$ satisfying a second-order necessary optimality conditions. Following [18], for $p=\lceil\sqrt{2n}\rceil$ , for almost all matrices $A$ , even though (9.2) is non-convex, any local optimum $Y$ is a global optimum (and so is $Z=YY^{T}$ ), and all saddle points have an escape (the Hessian has a negative eigenvalues). Note that for $p>n/2$ the same statement holds true for all $A$ , and was previously established by [19].

10 Numerical experiments

This section contains the outcome of numerical experiments on the three application problems considered: signed clustering, MAX-CUT, and angular synchronization.

10.1 Signed Clustering

To assess the effectiveness of the SDP relaxation, we consider the following experimental setup. We generate synthetic networks following the signed stochastic block model (SSBM) previously described in Section 4.1.1, with $K=5$ communities. To quantify the effectiveness of the SDP relaxation, we compare the accuracy of a suite of algorithms from the signed clustering literature, before (that is when we perform these algorithms directly on $A$ ) and after the SDP relaxation (that is when we perform the very same algorithms on $\hat{Z}$ ). To measure the recovery quality of the clustering results, for a given indicator set $x_{1},\ldots,x_{K}$ , we rely on the error rate consider in [24], defined as

\gamma=\sum_{c=1}^{K}\frac{x_{c}^{T}A_{com}^{-}x_{c}+x_{c}^{T}L_{com}^{+}x_{c}}{n^{2}},

(10.1)

where $x_{c}$ denotes a cluster indicator vector, $A_{com}$ ( $={\mathbb{E}}A$ ) is the complete $K$ -weakly balanced ground truth network – with $1$ ’s on the diagonal blocks corresponding to inter-cluster edges, and $-1$ otherwise – with $A_{com}=A_{com}^{+}-A_{com}^{-}$ , and $L_{com}^{+}$ denotes the combinatorial graph Laplacian corresponding to $A_{com}^{-}$ . Note that $x_{c}^{T}A_{com}^{-}x_{c}$ counts the number of violations within the clusters (since negative edges should not be placed within clusters) and $x_{c}^{T}L_{com}^{+}x_{c}$ counts the number of violations across clusters (since positive edges should not belong to the cut). Overall, (10.1) essentially counts the fraction of intra-cluster and inter-cluster edge violations, with respect to the full ground truth matrix. Note that this definition can also be easily adjusted to work on real data sets, where the ground truth matrix $A_{com}$ is not available, which one can replace with the empirical observation $A$ .

In terms of the signed clustering algorithms compared, we consider the following algorithms from the literature. One straightforward approach is to simply rely on the spectrum of the observed adjacency matrix $A$ . Kunegis et al. [65] proposed spectral tools for clustering, link prediction, and visualization of signed graphs, by solving a 2-way “signed” ratio-cut problem based on the combinatorial Signed Laplacian [59] $\bar{L}=\bar{D}-A$ , where $\bar{D}$ is a diagonal matrix with $\bar{D}_{ii}=\sum_{i=1}^{n}|A_{ij}|$ . The same authors proposed signed extensions for the case of the random-walk Laplacian $\bar{L}_{\text{rw}}=I-\bar{D}^{-1}A$ , and the symmetric graph Laplacian $\bar{L}_{\text{sym}}=I-\bar{D}^{-1/2}A\bar{D}^{-1/2}$ , the latter of which is particularly suitable for skewed degree distributions. Finally, the last algorithm we considered is BNC of Chiang et al. [25], who introduced a formulation based on the Balanced Normalized Cut objective

\operatorname{min}_{\{x_{1},\ldots,x_{K}\}\in\mathcal{I}}\left(\sum_{c=1}^{K}\frac{x_{c}^{T}(D^{+}-A)x_{c}}{x_{c}^{T}\bar{D}x_{c}}\right).

(10.2)

which, in light of the decomposition $D^{+}-A=D^{+}-(A^{+}-A^{-})=D^{+}-A^{+}+A^{-}=L^{+}+A^{-}$ , is effectively minimizing the number of violations in the clustering procedure.

In our experiments, we first compute the error rate $\gamma_{before}$ of all algorithms on the original SSBM graph (shown in Column 1 of Figure 1), and then we repeat the procedure but with the input to all signed clustering algorithms being given by the output of the SDP relaxation, and denote the resulting recovery error by $\gamma_{after}$ . The third column of the same Figure 1 shows the difference in errors $\gamma_{\delta}=\gamma_{before}-\gamma_{after}$ between the first and second columns, while the fourth column contains a histogram of the error differences $\gamma_{\delta}$ . This altogether illustrates the fact that the SDP relaxation does improve the performance of all signed clustering algorithms, except $\bar{L}$ , and could effectively be used as a denoising pre-processing step.

[Uncaptioned image] — Figure 1: Summary of results for the Signed Clustering problem. The first column denotes the recovery error before the SDP relaxation step, meaning that we consider a number of signed clustering algorithms from the literature which we apply directly the initial adjacency matrix $A$ . The second column contains the results when applying the same suite of algorithms after the SDP relaxation. The third column shows the difference in errors between the first and second columns, while the fourth column contains a histogram of the delta errors. This altogether illustrates the fact the SDP relaxation does improve the performance of all signed clustering algorithms except $\bar{L}$ .

10.2 Max-Cut

For the MAX-CUT problem, we consider two sets of numerical experiments. First, we consider a version of the stochastic block model which essentially perturbs a complete bipartite graph

\mathbf{B}=\begin{vmatrix}\mbox{\boldmath$0$}_{n_{1}\times n_{1}}&\mbox{\boldmath$1$}_{n_{1}\times n_{2}}\\ \mbox{\boldmath$1$}_{n_{2}\times n_{1}}&\mbox{\boldmath$0$}_{n_{2}\times n_{2}}\\ \end{vmatrix},

(10.3)

where $\mbox{\boldmath$1$}_{n_{1}\times n_{2}}$ (respectively, $\mbox{\boldmath$0$}_{n_{1}\times n_{2}}$ ) denotes an $n_{1}\times n_{2}$ matrix of all ones, respectively, all zeros. In our experiments, we set $n_{1}=n_{2}=\frac{n}{2}$ , and fix $n=500$ . We perturb $\mathbf{B}$ by deleting edges across the two partitions, and inserting edges within each partition. More specifically, we generated the full adjacency matrix $A^{0}$ from $\mathbf{B}$ by adding edges independently with probability $\eta$ within each partition (i.e., along the diagonal blocks in (10.3)). Finally, we denote by $A$ the masked version we observe, $A=A^{0}\circ S$ , where $S$ denotes the adjacency matrix of an Erdős-Rényi( $n$ , $\delta$ ) graph. The graph shown in Figure 2 is an instance of the above generative model.

Refer to caption — Figure 2: Illustration of Max-Cut in the setting of a perturbation of a complete bipartite graph.

Note that for small values of $\eta$ we expect the maximum cut to occur across the initial partition $\mathcal{P}_{\mathbf{B}}$ in the clean bipartite graph $\mathbf{B}$ , which we aim to recover as we sparsify the observed graph $A$ . The heatmap in the left of Figure 3 shows the Adjusted Rand Index (ARI) between the initial partition $\mathcal{P}_{\mathbf{B}}$ and the partition of the Max-Cut SDP relaxation in (9.1), as we vary the noise parameter $\eta$ and the sparsity $\delta$ . As expected, for a fix level of noise $\eta$ , we are able to recover the hypothetically optimal Max-Cut, for suitable levels of the sparsity parameter. The heatmap in the right of Figure 3 shows the computational running time, as we vary the two parameter, showing that the Manopt solver takes the longest to solve dense noisy problems, as one would expect.

In the second set of experiments shown in Figure 4, we consider a graph $A^{0}$ chosen at random from the collection³³3http://web.stanford.edu/~yyye/yyye/Gset/ of graphs known in the literature as the Gset, where we vary the sparsity level $\delta$ , and show the Max-Cut value attained on the original full graph $A^{0}$ , but using the Max-Cut partition computed by the SDP relaxation (9.1) on the sparsified graph $A$ .

10.3 Angular Synchronization

For the angular synchronization problem, we consider the following experimental setup, by assessing the quality of the recovered angular solution from the SDP relaxation, as we vary the two parameters of interest. In the $x$ -axis in the plots from Figures 5 and 6 we vary the noise level $\sigma$ , under two different noise models, Gaussian and outliers. On the $y$ -axis, we vary the sparsity of the sampling graph.

We measure the quality of the recovered angles via the Mean Squared Error (MSE), defined as follows. Since a solution can only be recovered up to a global shift, one needs an MSE error that mods out such a degree of freedom. The following MSE is also more broadly applicable for the case when the underlying group is the orthogonal group $O(d)$ , as opposed to $SO(2)$ , as in the present work, where one can replace the unknown angles $\theta_{1},\ldots,\theta_{n}$ with their respective representation as $2\times 2$ rotation matrices $h_{1},\ldots,h_{n}\in\text{O}(2)$ . To that end, we look for an optimal orthogonal transformation $\hat{O}\in\text{O}(2)$ that minimizes the sum of squared distances between the estimated orthogonal transformations and the ground truth measurements

\hat{O}=\underset{O\in O(2)}{\operatorname{argmin}}\sum_{i=1}^{n}\|h_{i}-O\hat{h}_{i}\|_{F}^{2},

(10.4)

where $\hat{h}_{1},\ldots,\hat{h}_{n}\in\text{O}(2)$ denote the $2\times 2$ rotation matrix representation of the estimated angles $\hat{\theta}_{1},\ldots,\hat{\theta}_{n}$ . In other words, $\hat{O}$ is the optimal solution to the alignment problem between two sets of orthogonal transformations, in the least-squares sense. Following the analysis of [91], and making use of properties of the trace, one arrives at

	$\displaystyle\sum_{i=1}^{n}\\|h_{i}-O\hat{h}_{i}\\|_{F}^{2}$	$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\;\mathrm{Trace}\left[\left(h_{i}-O\hat{h}_{i}\right)\left(h_{i}-O\hat{h}_{i}\right)^{T}\right]$		(10.5)
		$\displaystyle=$	$\displaystyle\sum_{i=1}^{n}\;\mathrm{Trace}\left[2I-2O\hat{h}_{i}h_{i}^{T}\right]=4n-2\;\mathrm{Trace}\left[O\sum_{i=1}^{n}\hat{h}_{i}h_{i}^{T}\right].$		(10.5)

If we let $Q$ denote the $2\times 2$ matrix

Q=\frac{1}{n}\sum_{i=1}^{n}\hat{h}_{i}h_{i}^{T},

(10.6)

it follows from (10.5) that the MSE is given by minimizing

\frac{1}{n}\sum_{i=1}^{n}\|h_{i}-O\hat{h}_{i}\|_{F}^{2}=4-2Tr(OQ).

(10.7)

In [7] it is proven that $Tr(OQ)\leq Tr(VU^{T}Q)$ , for all $O\in O(3)$ , where $Q=U\Sigma V^{T}$ is the singular value decomposition of $Q$ . Therefore, the MSE is minimized by the orthogonal matrix $\hat{O}=VU^{T}$ and is given by

\textsc{MSE}\;\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\;\frac{1}{n}\sum_{i=1}^{n}\|h_{i}-\hat{O}\hat{h}_{i}\|_{F}^{2}=4-2\;\mathrm{Trace}(VU^{T}U\Sigma V^{T})=4-2(\sigma_{1}+\sigma_{2}),

(10.8)

where $\sigma_{1},\sigma_{2}$ are the singular values of $Q$ . Therefore, whenever $Q$ is an orthogonal matrix for which $\sigma_{1}=\sigma_{2}=1$ , the MSE vanishes. Indeed, the numerical experiments (on a log scale) in Figures 5 and 6 confirm that for noiseless data, the MSE is very close to zero. Furthermore, as one would expected, under favorable noise regimes and sparsity levels, we have almost perfect recovery, both by the SDP and the spectral relaxations, under both noise models.

11 Conclusions and future work

There are a number of other graph-based problems amenable to SDP relaxations, for which a similar theoretical analysis of their SDP-based estimators could be suitable. For example, the recent work of [34] considered a problem motivated by geosciences and engineering applications of recovering a smooth unknown function $f:G\rightarrow\mathbb{R}$ (where $G=[a,b]$ is known) from noisy observations of its mod 1 values, which is also amenable to a solution based on an SDP relaxation solved via a Burer-Monteiro approach. Another potential application concerns the problem of clustering directed graphs, as in the very recent work of [33] that proposed a spectral algorithm based on Hermitian matrices; this problem is also amenable to an SDP relaxation.

Our theoretical and practical findings show that running algorithms (such as spectral methods) directly on $A$ may be improved by using first a SDP estimator such as $\hat{Z}$ and running the very same algorithms on $\hat{Z}$ (instead of $A$ ). Somehow, $\hat{Z}$ performs a pre-processing de-noising step which improve the recovery of the hidden signal such as community vectors.

Acknowledgements

Mihai Cucuringu acknowledges support from the EPSRC grant EP/N510129/1 at The Alan Turing Institute. Guillaume Lecué acknowledges support from a grant of the French National Research Agency (ANR), “Investissements d’Avenir” (LabEx Ecodec/ANR-11-LABX-0047).

References

[1] Epinions data set. https://snap.stanford.edu/data/soc-Epinions1.html. Accessed: 2010-09-30.
[2] Slashdot data set. https://snap.stanford.edu/data/soc-Slashdot0902.html. Accessed: 2010-09-30.
[3] Emmanuel Abbe, Afonso S Bandeira, and Georgina Hall. Exact recovery in the stochastic block model. IEEE Transactions on Information Theory, 62(1):471–487, 2015.
[4] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi, and Teh Ying Wah. Time-series clustering–a decade review. Information Systems, 53:16–38, 2015.
[5] Brendan PW Ames. Guaranteed clustering and biclustering via semidefinite programming. Mathematical Programming, 147(1-2):429–465, 2014.
[6] Miguel F Anjos and Jean B Lasserre. Handbook on semidefinite, conic and polynomial optimization, volume 166. Springer Science & Business Media, 2011.
[7] K. Arun, T. Huang, and S. Bolstein. Least-squares fitting of two 3-D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):698–700, 1987.
[8] Afonso S Bandeira, Moses Charikar, Amit Singer, and Andy Zhu. Multireference alignment using semidefinite programming. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 459–470. ACM, 2014.
[9] Afonso S Bandeira, Ramon Van Handel, et al. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. The Annals of Probability, 44(4):2479–2506, 2016.
[10] Sujogya Banerjee, Kaushik Sarkar, Sedat Gokalp, Arunabha Sen, and Hasan Davulcu. Partitioning signed bipartite graphs for classification of individuals and organizations. In International Conference on Social Computing, Behavioral-Cultural Modeling, and Prediction, pages 196–204. Springer, 2012.
[11] Peter L. Bartlett and Shahar Mendelson. Empirical minimization. Probab. Theory Related Fields, 135(3):311–334, 2006.
[12] A. I. Barvinok. Problems of distance geometry and convex properties of quadratic maps. Discrete & Computational Geometry, 13(2):189–202, Mar 1995.
[13] Lucien Birgé and Pascal Massart. Rates of convergence for minimum contrast estimators. Probab. Theory Related Fields, 97(1-2):113–150, 1993.
[14] Grigoriy Blekherman, Pablo A Parrilo, and Rekha R Thomas. Semidefinite optimization and convex algebraic geometry. SIAM, 2012.
[15] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.
[16] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013. ISBN 978-0-19-953525-5.
[17] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre. Manopt, a Matlab toolbox for optimization on manifolds. Journal of Machine Learning Research, 15:1455–1459, 2014.
[18] N. Boumal, V. Voroninski, and A.S. Bandeira. The non-convex Burer-Monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems 29, pages 2757–2765. 2016.
[19] Nicolas Boumal. A riemannian low-rank method for optimization over semidefinite matrices with block-diagonal constraints. arXiv preprint arXiv:1506.00575, 2015.
[20] Stephen Boyd, Laurent El Ghaoui, Eric Feron, and Venkataramanan Balakrishnan. Linear matrix inequalities in system and control theory, volume 15. Siam, 1994.
[21] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[22] S. Burer and R.D.C. Monteiro. Local minima and convergence in low-rank semidefinite programming. Mathematical Programming, 103(3):427–444, 2005.
[23] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. The Journal of Machine Learning Research, 17(1):882–938, 2016.
[24] K. Chiang, J. Whang, and I. Dhillon. Scalable clustering of signed networks using Balance Normalized Cut. CIKM, 2012.
[25] Kai-Yang Chiang, Joyce Whang, and Inderjit S. Dhillon. Scalable Clustering of Signed Networks using Balance Normalized Cut. In ACM Conference on Information and Knowledge Management (CIKM), oct 2012.
[26] Geoffrey Chinot, Lecué Guillaume, and Lerasle Matthieu. Statistical learning with lipschitz and convex loss functions. arXiv preprint arXiv:1810.01090, 2018.
[27] Stéphane Chrétien and Franck Corset. Using the eigenvalue relaxation for binary least-squares estimation problems. Signal Processing, 89(11):2079–2091, 2009.
[28] Stéphane Chrétien, Clément Dombry, and Adrien Faivre. A semi-definite programming approach to low dimensional embedding for unsupervised clustering. Frontiers in Applied Mathematics and Statistics, to appear.
[29] Aaron Clauset, Mark EJ Newman, and Cristopher Moore. Finding community structure in very large networks. Physical review E, 70(6):066111, 2004.
[30] M. Cucuringu. Synchronization over Z₂ and community detection in multiplex networks with constraints. Journal of Complex Networks, 3:469–506, 2015.
[31] M. Cucuringu. Sync-Rank: Robust Ranking, Constrained Ranking and Rank Aggregation via Eigenvector and Semidefinite Programming Synchronization. IEEE Transactions on Network Science and Engineering, 3(1):58–79, 2016.
[32] M. Cucuringu, P. Davies, A. Glielmo, and H. Tyagi. SPONGE: A generalized eigenproblem for clustering signed networks. AISTATS 2019.
[33] M. Cucuringu, H. Li, H. Sun, and L. Zanetti. Hermitian matrices for clustering directed graphs: insights and applications. to appear in AISTATS 2020.
[34] M. Cucuringu and H. Tyagi. Provably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrapping. to appear in Journal of Machine Learning Research (JMLR), 2020.
[35] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622, 2016.
[36] Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.
[37] J. A. Davis. Clustering and structural balance in graphs. Human Relations, 20(2):181–187, 1967.
[38] Yohann De Castro, Fabrice Gamboa, Didier Henrion, Roxana Hess, Jean-Bernard Lasserre, et al. Approximate optimal designs for multivariate polynomial regression. The Annals of Statistics, 47(1):127–155, 2019.
[39] Yohann de Castro, Fabrice Gamboa, Didier Henrion, and Jean B. Lasserre. Exact solutions to super resolution on semi-algebraic domains in higher dimensions. IEEE Trans. Information Theory, 63(1):621–630, 2017.
[40] Laurent Demanet and Paul Hand. Stable optimizationless recovery from phaseless linear measurements. Journal of Fourier Analysis and Applications, 20(1):199–221, 2014.
[41] Yingjie Fei and Yudong Chen. Exponential error rates of SDP for block models: beyond Grothendieck’s inequality. IEEE Trans. Inform. Theory, 65(1):551–571, 2019.
[42] Roger Fletcher. A nonlinear programming problem in statistics (educational testing). SIAM Journal on Scientific and Statistical Computing, 2(3):257–267, 1981.
[43] Santo Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
[44] Bernd Gärtner and Jiří Matoušek. Semidefinite programming. In Approximation Algorithms and Semidefinite Programming, pages 15–25. Springer, 2012.
[45] Jean Charles Gilbert and Cédric Josz. Plea for a semidefinite optimization solver in complex numbers. 2017.
[46] Christophe Giraud and Nicolas Verzelen. Partial recovery bounds for clustering with the relaxed $k$ means. arXiv preprint arXiv:1807.07547, 2018.
[47] Michel X Goemans. Semidefinite programming in combinatorial optimization. Mathematical Programming, 79(1-3):143–161, 1997.
[48] Michel X Goemans and David P Williamson. New 34-approximation algorithms for the maximum satisfiability problem. SIAM Journal on Discrete Mathematics, 7(4):656–666, 1994.
[49] Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
[50] David Gross, Yi-Kai Liu, Steven T Flammia, Stephen Becker, and Jens Eisert. Quantum state tomography via compressed sensing. Physical review letters, 105(15):150401, 2010.
[51] A. Grothendieck. Résumé de la théorie métrique des produits tensoriels topologiques. Bol. Soc. Mat. São Paulo, 8:1–79, 1953.
[52] Alexandre Grothendieck. Résumé de la théorie métrique des produits tensoriels topologiques. Soc. de Matemática de São Paulo, 1956.
[53] Olivier Guédon and Roman Vershynin. Community detection in sparse networks via Grothendieck’s inequality. Probab. Theory Related Fields, 165(3-4):1025–1049, 2016.
[54] Olivier Guédon and Roman Vershynin. Community detection in sparse networks via grothendieck’s inequality. Probability Theory and Related Fields, 165(3-4):1025–1049, 2016.
[55] Bruce Hajek, Yihong Wu, and Jiaming Xu. Achieving exact cluster recovery threshold via semidefinite programming. IEEE Transactions on Information Theory, 62(5):2788–2797, 2016.
[56] Simai He, Zhi-Quan Luo, Jiawang Nie, and Shuzhong Zhang. Semidefinite relaxation bounds for indefinite homogeneous quadratic optimization. SIAM Journal on Optimization, 19(2):503–523, 2008.
[57] Chinmay Hegde, Aswin C Sankaranarayanan, and Richard G Baraniuk. Near-isometric linear embeddings of manifolds. In 2012 IEEE Statistical Signal Processing Workshop (SSP), pages 728–731. IEEE, 2012.
[58] Samuel B. Hopkins. Sub-gaussian mean estimation in polynomial time. CoRR, abs/1809.07425, 2018.
[59] Jao Ping Hou. Bounds for the least Laplacian eigenvalue of a signed graph. Acta Mathematica Sinica, 21(4):955–960, 2005.
[60] Adel Javanmard, Andrea Montanari, and Federico Ricci-Tersenghi. Phase transitions in semidefinite relaxations. Proceedings of the National Academy of Sciences, 113(16):E2218–E2223, 2016.
[61] David Karger, Rajeev Motwani, and Madhu Sudan. Approximate graph coloring by semidefinite programming. Journal of the ACM (JACM), 45(2):246–265, 1998.
[62] Subhash Khot and Assaf Naor. Approximate kernel clustering. Mathematika, 55(1-2):129–165, 2009.
[63] Vladimir Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Statist., 34(6):2593–2656, 2006.
[64] Vladimir Koltchinskii. Oracle inequalities in empirical risk minimization and sparse recovery problems, volume 2033 of Lecture Notes in Mathematics. Springer, Heidelberg, 2011. Lectures from the 38th Probability Summer School held in Saint-Flour, 2008, École d’Été de Probabilités de Saint-Flour. [Saint-Flour Probability Summer School].
[65] Jérôme Kunegis, Stephan Schmidt, Andreas Lommatzsch, Jürgen Lerner, Ernesto William De Luca, and Sahin Albayrak. Spectral analysis of signed graphs for clustering, prediction and visualization. SDM, 10:559–570, 2010.
[66] Jean Bernard Lasserre. An Introduction to Polynomial and Semi-Algebraic Optimization. Number 52. Cambridge University Press, 2015.
[67] Javad Lavaei and Steven H Low. Zero duality gap in optimal power flow problem. IEEE Transactions on Power Systems, 27(1):92–107, 2011.
[68] Can M. Le, Elizaveta Levina, and Roman Vershynin. Optimization via low-rank approximation for community detection in networks. Ann. Statist., 44(1):373–400, 2016.
[69] Guillaume Lecué and Shahar Mendelson. Learning subgaussian classes: Upper and minimax bounds. Technical report, CNRS, Ecole polytechnique and Technion, 2013.
[70] Claude Lemaréchal, Arkadii Nemirovskii, and Yurii Nesterov. New variants of bundle methods. Mathematical programming, 69(1-3):111–147, 1995.
[71] J. Leskovec, D. Huttenlocher, and J. Kleinberg. Predicting positive and negative links in online social networks. In WWW, pages 641–650, 2010.
[72] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg. Signed Networks in Social Media. In CHI, pages 1361–1370, 2010.
[73] László Lovász. On the shannon capacity of a graph. IEEE Transactions on Information theory, 25(1):1–7, 1979.
[74] Wing-Kin Ken Ma. Semidefinite relaxation of quadratic optimization problems and applications. IEEE Signal Processing Magazine, 1053(5888/10), 2010.
[75] Enno Mammen and Alexandre B. Tsybakov. Smooth discrimination analysis. Ann. Statist., 27(6):1808–1829, 1999.
[76] Pascal Massart. Concentration inequalities and model selection, volume 1896 of Lecture Notes in Mathematics. Springer, Berlin, 2007. Lectures from the 33rd Summer School on Probability Theory held in Saint-Flour, July 6–23, 2003, With a foreword by Jean Picard.
[77] David A Mazziotti. Large-scale semidefinite programming for many-electron quantum mechanics. Physical review letters, 106(8):083001, 2011.
[78] Elchanan Mossel, Joe Neeman, and Allan Sly. Reconstruction and estimation in the planted partition model. Probab. Theory Related Fields, 162(3-4):431–461, 2015.
[79] Y Nesterov. Semidefinite relaxation and non-convex quadratic optimization. Optimization Methods and Software, 12:1–20, 1997.
[80] Mark EJ Newman. Finding community structure in networks using the eigenvectors of matrices. Physical review E, 74(3):036104, 2006.
[81] Carl Olsson, Anders P Eriksson, and Fredrik Kahl. Solving large scale binary quadratic problems: Spectral methods vs. semidefinite programming. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2007.
[82] Jiming Peng and Yu Wei. Approximating k-means-type clustering via semidefinite programming. SIAM journal on optimization, 18(1):186–205, 2007.
[83] Jiming Peng and Yu Xia. A new theoretical framework for k-means-type clustering. In Foundations and advances in data mining, pages 79–96. Springer, 2005.
[84] Guy Pierra. Decomposition through formalization in a product space. Mathematical Programming, 28(1):96–115, 1984.
[85] Gilles Pisier. Grothendieck’s theorem, past and present. Bulletin of the American Mathematical Society, 49(2):237–323, 2012.
[86] Mason A Porter, Jukka-Pekka Onnela, and Peter J Mucha. Communities in networks. Notices of the AMS, 56(9):1082–1097, 2009.
[87] Martin Royer. Adaptive clustering through semidefinite programming. In Advances in Neural Information Processing Systems, pages 1795–1803, 2017.
[88] P Scobey and DG Kabe. Vector quadratic programming problems and inequality constrained least squares estimation. J. Indust. Math. Soc., 28:37–49, 1978.
[89] Alexander Shapiro. Weighted minimum trace factor analysis. Psychometrika, 47(3):243–264, 1982.
[90] A. Singer. Angular synchronization by eigenvectors and semidefinite programming. Appl. Comput. Harmon. Anal., 30(1):20–36, 2011.
[91] A. Singer and Y. Shkolnisky. Three-dimensional structure determination from common lines in Cryo-EM by eigenvectors and semidefinite programming. SIAM Journal on Imaging Sciences, 4(2):543–572, 2011.
[92] Amit Singer. Angular synchronization by eigenvectors and semidefinite programming. Applied and computational harmonic analysis, 30(1):20–36, 2011.
[93] Jun Sun, Stephen Boyd, Lin Xiao, and Persi Diaconis. The fastest mixing markov process on a graph and a connection to a maximum variance unfolding problem. SIAM review, 48(4):681–699, 2006.
[94] Michael J Todd. Semidefinite optimization. Acta Numerica, 10:515–560, 2001.
[95] Alexandre B. Tsybakov. Optimal rate of aggregation. In Computational Learning Theory and Kernel Machines (COLT-2003), volume 2777 of Lecture Notes in Artificial Intelligence, pages 303–313. Springer, Heidelberg, 2003.
[96] Alexandre B. Tsybakov. Optimal aggregation of classifiers in statistical learning. Ann. Statist., 32(1):135–166, 2004.
[97] Sara A. van de Geer. Applications of empirical process theory, volume 6 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2000.
[98] Aad W. van der Vaart and Jon A. Wellner. Weak convergence and empirical processes. Springer Series in Statistics. Springer-Verlag, New York, 1996. With applications to statistics.
[99] Vladimir N. Vapnik. Statistical learning theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. John Wiley & Sons, Inc., New York, 1998. A Wiley-Interscience Publication.
[100] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.
[101] V Vu. Singular vectors under random perturbation. Preprint available in arXiv:104.2000, 2010.
[102] Irène Waldspurger, Alexandre d’Aspremont, and Stéphane Mallat. Phase recovery, maxcut and complex semidefinite programming. Mathematical Programming, 149(1-2):47–81, 2015.
[103] Kilian Q Weinberger, Benjamin Packer, and Lawrence K Saul. Nonlinear dimensionality reduction by semidefinite programming and kernel matrix factorization. In AISTATS, volume 2, page 6. Citeseer, 2005.
[104] Kilian Q Weinberger and Lawrence K Saul. An introduction to nonlinear dimensionality reduction by maximum variance unfolding. In AAAI, volume 6, pages 1683–1686, 2006.
[105] Kilian Q Weinberger and Lawrence K Saul. Unsupervised learning of image manifolds by semidefinite programming. International journal of computer vision, 70(1):77–90, 2006.
[106] Henry Wolkowicz. Semidefinite and lagrangian relaxations for hard combinatorial problems. In IFIP Conference on System Modeling and Optimization, pages 269–309. Springer, 1999.
[107] Henry Wolkowicz, Romesh Saigal, and Lieven Vandenberghe. Handbook of semidefinite programming: theory, algorithms, and applications, volume 27. Springer Science & Business Media, 2012.
[108] Y. Yu, T. Wang, and R. J. Samworth. A useful variant of the Davis–Kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015.
[109] S. Zhang and Y. Huang. Complex quadratic optimization and semidefinite programming. SIAM Journal on Optimization, 16(3):871–890, 2006.
[110] Shuzhong Zhang. Quadratic maximization and semidefinite relaxation. Mathematical Programming, 87(3):453–465, 2000.

	$\displaystyle{\mathbb{E}}\left[\mathrm{cut}(G,\hat{x})\|\hat{Z}\right]={\mathbb{E}}\left[\frac{1}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(1-\hat{x}_{i}\hat{x}_{j})\right]=\frac{1}{4}\sum_{i,j=1}^{n}A^{0}_{ij}\left(1-{\mathbb{E}}[\mathrm{sign}(\bigl{<}\hat{X}_{i},g\bigr{>})\mathrm{sign}(\bigl{<}\hat{X}_{j},g\bigr{>})]\right)$
	$\displaystyle\overset{(i)}{=}\frac{1}{4}\sum_{i,j=1}^{n}A^{0}_{ij}\left(1-\frac{2}{\pi}\arcsin(\bigl{<}\hat{X}_{i},\hat{X}_{j}\bigr{>})\right)=\frac{1}{2\pi}\sum_{i,j=1}^{n}A^{0}_{ij}\arccos(\bigl{<}\hat{X}_{i},\hat{X}_{j}\bigr{>})$
	$\displaystyle\overset{(ii)}{\geq}\frac{0.878}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(1-\bigl{<}\hat{X}_{i},\hat{X}_{j}\bigr{>})=\frac{0.878}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(1-\bigl{<}X_{i}^{},X_{j}^{}\bigr{>})+\frac{0.878}{4}\sum_{i,j=1}^{n}A^{0}_{ij}(\bigl{<}X_{i}^{},X_{j}^{}\bigr{>}-\bigl{<}\hat{X}_{i},\hat{X}_{j}\bigr{>})$
	$\displaystyle=\frac{0.878}{4}\bigl{<}A^{0},J-Z^{}\bigr{>}+\frac{0.878}{4}\bigl{<}A^{0},Z^{}-\hat{Z}\bigr{>}=0.878\mathrm{SDP}(G)-\frac{0.878}{4}\bigl{<}\mathbb{E}B,Z^{*}-\hat{Z}\bigr{>}$
	$\displaystyle\geq 0.878\ \mathrm{SDP}(G)-\frac{0.878}{4}r^{*}(\Delta)$

		$\displaystyle S_{1}(Z)=\bigl{<}{\cal P}(Z-Z^{}),W\bigr{>}=\bigl{<}{\cal P}(W),Z-Z^{}\bigr{>}$
		$\displaystyle=\bigl{<}UU^{\top}W,Z-Z^{}\bigr{>}+\bigl{<}WUU^{\top},Z-Z^{}\bigr{>}-\bigl{<}UU^{\top}WUU^{\top},Z-Z^{*}\bigr{>}$
		$\displaystyle=2\bigl{<}UU^{\top}W,Z-Z^{}\bigr{>}-\bigl{<}UU^{\top}WUU^{\top},Z-Z^{}\bigr{>}$
		$\displaystyle=2\bigl{<}UU^{\top}\Psi,Z-Z^{}\bigr{>}+2\bigl{<}UU^{\top}\Psi^{\top},Z-Z^{}\bigr{>}-\bigl{<}UU^{\top}(\Psi+\Psi^{\top})UU^{\top},Z-Z^{*}\bigr{>}$
		$\displaystyle=2\bigl{<}UU^{\top}\Psi,Z-Z^{}\bigr{>}+2\bigl{<}UU^{\top}\Psi^{\top},Z-Z^{}\bigr{>}-2\bigl{<}UU^{\top}\Psi UU^{\top},Z-Z^{*}\bigr{>}$
		$\displaystyle=2\bigl{<}UU^{\top}\Psi,Z-Z^{}\bigr{>}+2\bigl{<}UU^{\top}\Psi^{\top},Z-Z^{}\bigr{>}-2\bigl{<}UU^{\top}\Psi,(Z-Z^{*})UU^{\top}\bigr{>}.$		(8.1)

	$\displaystyle\mathrm{Tr}({\cal P}^{\perp}(Z))=\mathrm{Tr}({\cal P}^{\perp}(Z-Z^{}))=\mathrm{Tr}(({\mathbb{I}}_{n}-UU^{\top})(Z-Z^{})({\mathbb{I}}_{n}-UU^{\top}))$
	$\displaystyle=\mathrm{Tr}(({\mathbb{I}}_{n}-UU^{T})^{2}(Z-Z^{}))=\mathrm{Tr}(({\mathbb{I}}_{n}-UU^{T})(Z-Z^{}))=\mathrm{Tr}(Z)-\mathrm{Tr}(Z^{})+\mathrm{Tr}(UU^{T}(Z^{}-Z))$
	$\displaystyle=\sum_{i,j}(UU^{T})_{ij}(Z^{}-Z)_{ij}=\sum_{k\in[K]}\sum_{i,j\in{\cal C}_{k}}\frac{1}{l_{k}}(Z^{}-Z)_{ij}\overset{\mbox{(i)}}{=}\sum_{k\in[K]}\frac{1}{l_{k}}\sum_{i,j\in{\cal C}_{k}}\|(Z^{*}-Z)_{ij}\|$
	$\displaystyle\leq\frac{c_{1}K}{n}\sum_{k\in[K]}\sum_{i,j\in{\cal C}_{k}}\|(Z^{}-Z)_{ij}\|\leq\frac{c_{1}K}{n}\left\\|Z-Z^{}\right\\|_{1}$

	Before	After	Delta	Histogram
$A$
$\bar{L}$
$\bar{L}_{rw}$
$\bar{L}_{sym}$
BNC

Learning with Semi-Definite Programming: new statistical bounds based on fixed point analysis and excess risk curvature

Abstract

1 Introduction

1.1 Historical background

1.1.1 Early history

1.1.2 The Goemans-Williamson SDP relaxation of Max-Cut and its legacy

1.1.3 Relaxation of machine learning and high dimensional statistical estimation problems

1.2 Mathematical formulation of the problem

1.3 Goal of the paper

2 Main general results for the statistical analysis of SDP estimators

Definition 1.

Definition 2.

Theorem 1.

Assumption 1.

Corollary 1.

Assumption 2.

Theorem 2.

3 Revisiting two results from the community detection literature [41, 54]

Definition 3.

Lemma 1 (Lemma 7.2 in [54]).

4 Contributions of the paper

4.1 Application to signed clustering

4.1.1 A Signed Stochastic Block Model (SSBM)

4.1.2 Main result for the estimation of the cluster matrix in signed clustering

Assumption 3.

Theorem 3.

4.2 Application to angular group synchronization

4.2.1 Main results for phase recovery in the synchronization problem

Theorem 4.

Corollary 2.

4.3 Application to the MAX-CUT problem

4.3.1 Main results for the MAX-CUT problem

Theorem 5.

Theorem 6.

5 Proof of Theorem 3 (signed clustering)

5.1 Curvature equation

Proposition 1.

Proof.

5.2 Computation of the complexity fixed point rG∗​(Δ)r^{*}_{G}(\Delta)

Proposition 2.

Proposition 3.

6 Proofs of Theorem 4 and Corollary 2 (angular synchronization)

6.1 Curvature of the objective function

Proposition 4.

Proof.

6.2 Computation of the complexity fixed point rG∗​(Δ)r^{*}_{G}(\Delta)

6.3 End of the proof of Theorem 4 and Corollary 2: application of Corollary 1

7 Proofs of Theorem 5 and 6 (MAX-CUT)

7.1 Proof of Theorem 5

7.2 Proof of Theorem 6

8 Annex 1: Signed clustering

8.1 Proof of Equation (4.1)

8.2 Proof of Proposition 2: control of S1​(Z)S_{1}(Z) adapted from [41]

Lemma 1.

8.3 Proof of Proposition 3: control of the S2​(Z)S_{2}(Z) term from [41]

Lemma 2.

Proof.

Lemma 3 (Lemma 4 in [41]).

Proof.

Theorem 7 (Corollary 3.6 in [9]).

Theorem 8 (Theorem 6.10 in [16]).

9 Annex 2: Solving SDP’s in practice

9.1 Pierra’s method

9.1.1 Application to community detection

9.1.2 Application to signed clustering

9.2 The Burer-Monteiro approach and the Manopt Solver

10 Numerical experiments

10.1 Signed Clustering

10.2 Max-Cut

10.3 Angular Synchronization

11 Conclusions and future work

Acknowledgements

References

5.2 Computation of the complexity fixed point $r^{*}_{G}(\Delta)$

6.2 Computation of the complexity fixed point $r^{*}_{G}(\Delta)$

8.2 Proof of Proposition 2: control of $S_{1}(Z)$ adapted from [41]

8.3 Proof of Proposition 3: control of the $S_{2}(Z)$ term from [41]