This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Population-level Balance in Signed Networks

Weijing Tang
Department of Statistics and Data Science, Carnegie Mellon University
and
Ji Zhu
Department of Statistics, University of Michigan
Abstract

Statistical network models are useful for understanding the underlying formation mechanism and characteristics of complex networks. However, statistical models for signed networks have been largely unexplored. In signed networks, there exist both positive (e.g., like, trust) and negative (e.g., dislike, distrust) edges, which are commonly seen in real-world scenarios. The positive and negative edges in signed networks lead to unique structural patterns, which pose challenges for statistical modeling. In this paper, we introduce a statistically principled latent space approach for modeling signed networks and accommodating the well-known balance theory, i.e., “the enemy of my enemy is my friend” and “the friend of my friend is my friend”. The proposed approach treats both edges and their signs as random variables, and characterizes the balance theory with a novel and natural notion of population-level balance. This approach guides us towards building a class of balanced inner-product models, and towards developing scalable algorithms via projected gradient descent to estimate the latent variables. We also establish non-asymptotic error rates for the estimates, which are further verified through simulation studies. In addition, we apply the proposed approach to an international relation network, which provides an informative and interpretable model-based visualization of countries during World War II.


Keywords: Signed Networks, Balance Theory, Latent Space Models, Projected Gradient Descent.

1 Introduction

Networks characterize connectivity relationships between individuals of a complex system and are ubiquitous in various fields, such as social science, biology, transportation, and information technology [22]. In a network, a node represents an individual and an edge between two nodes indicates the presence of certain interaction or relation. Given the unique relational information represented by networks, many statistical models have been developed to understand the underlying mechanism of the system and help explain the observed phenomenon on networks; see for example [10] for a comprehensive overview. One important class of statistical models is the latent variable model, where the presence/absence of an edge depends on the node latent variables. For example, stochastic block models use latent categorical variables to describe the block structure among nodes [1]; latent space models map nodes into a low-dimensional metric space while accounting for transitivity, homophily for node attributes, node heterogeneity and clustering [14, 19]. Such latent variable models are attractive due to their interpretable structure, their nature for network visualization, and their ability for downstream network-assisted learning such as node clustering, node classification, and network link prediction.

Nonetheless, most statistical network models only focus on the presence/absence of edges while ignoring different types of edges, which makes them inadequate for modeling signed networks. A signed network consists of two types of edges, positive edges and negative edges, and such polarized relationships are common in real-world networks. For example, positive and negative edges may respectively correspond to relationships of like and dislike in social networks, collaboration and competition in trading networks, or alliance and militarized dispute in international relation networks. Modeling signed networks has its own unique challenges not merely due to the additional sign for each edge, but more importantly, because the presence of positive and negative edges affect each other in certain ways. There have been various social theories that describe the structural pattern of signed networks [11, 20], an important one being the structural balance theory [12]. Specifically, the balance theory describes the distribution of different types of triangles (i.e. three nodes that are connected with each other). A triangle in a signed network is called balanced if the product of its three edge signs is positive; and it is called unbalanced otherwise (see Figure 1 for examples). The balance theory postulates that balanced triangles should be more prevalent than unbalanced triangles in signed networks, which directly coincides with the proverb, “the enemy of my enemy is my friend” and “the friend of my friend is my friend”. Moreover, recent studies have found empirical evidence of the balance property in many real-world signed networks [20, 17, 9].

Refer to caption
Figure 1: Four types of triangles in signed networks, where the left two are balanced and the right two are unbalanced.

On the other hand, there have been very few statistical models for signed networks that incorporate the balance theory into modeling. To the best of our knowledge, [8] is the only recent work; specifically, it extends the configuration model [6] to signed networks with a focus on matching not only the node degree distribution but also the sign distribution and proportion of balanced triangles. Besides statistical models, there is a collection of work using low-rank matrix completion approaches induced by the balance theory for learning tasks such as sign prediction and clustering [15, 5]. These works assume that there are underlying signed edges (not allowing for zeros) between all possible pairs of nn nodes, and view the network as a fixed n×nn\times n adjacency matrix with entries of {+1,1}\{+1,-1\}. In comparison, statistical network models can provide statistically interpretable structures and account for noise in both signs and edges by modeling network distributions that precisely quantify the randomness in the observed data.

In this paper, we propose a latent space approach to accommodate the balance theory for signed networks in a statistically principled way. Specifically, we introduce a novel definition of balance at the population level, which matches the balance theory in nature while viewing an observed network as the realization of a random quantity. For concreteness, we consider an undirected signed network with nn nodes denoted by a symmetric signed adjacency matrix AA, with Aij=Aji=1A_{ij}=A_{ji}=1 if node ii and node jj are linked by a positive edge, Aij=Aji=1A_{ij}=A_{ji}=-1 if node ii and node jj are linked by a negative edge, and Aij=Aji=0A_{ij}=A_{ji}=0 if there is no edge between ii and jj. We assume there is no self-loop and thereby the diagonal elements of AA are zeros. We assume AijA_{ij} to be random variables taking values in {1,0,1}\{-1,0,1\} and define the notion of balance at the population level as follows.

Definition 1 (Population-level balanced network).

A network is population-level balanced if

E(AijAjAi||AijAjAi|=1)>0, for any three different nodes (i,j,).E(A_{ij}A_{j\ell}A_{\ell i}\big{|}|A_{ij}A_{j\ell}A_{\ell i}|=1)>0,\text{ for any three different nodes }(i,j,\ell).

This definition suggests the expected product of signs on any triangle to be positive but does not require all triangles to be balanced in an observed signed network. Furthermore, the stochastic notion in Definition 1 allows us to investigate what generating mechanisms of signed networks are inherently of population-level balance. Specifically, we will focus on a general class of latent space models, due to aforementioned merits of latent space models. Rigorous descriptions are provided in Section 2. The key finding is that, if there exists a partition of the latent space such that edges tend to be positive within the same subset and negative between different subsets, then the network generated from such a latent space model is population-level balanced.

Based on this finding, we further propose a class of balanced inner-product models that directly capture the population-level balance. A unique difference from latent space models for unsigned networks is that we introduce an additional latent polar variable for each node. In particular, when the product of latent polar variables of two nodes has a large positive value, the sign of an edge between them is more likely to be positive; for a node with the latent polar variable being zero, it has no preference on the signs when forming edges with other nodes. We note that it is this novel introduction of latent polar variables that enables modeling signed networks with the population-level balance.

This paper is organized as follows. We introduce the latent space approach for signed networks in Section 2, where we also provide a sufficient condition for the population-level balance. We present the proposed balanced inner product models in Section 3. We develop two scalable estimation methods in Section 4 and establish their non-asymptotic error rates in Section 5, which are further validated by simulation studies in Section 6. We demonstrate the effectiveness of the proposed approach in modeling a real-world signed network for international relations in Section 7. All proofs are given in the Supplemental Material.

2 A Latent Space Approach for Signed Networks

In this section, we propose a probabilistic generative process for undirected signed networks with nn nodes. Recall that A{1,0,1}n×nA\in\{1,0,-1\}^{n\times n} is the signed adjacency matrix. Suppose the latent space 𝒰0\mathcal{U}_{0} is endowed with the probability measure PuP_{u}; B(,):𝒰0×𝒰0(0,1)B(\cdot,\cdot):\mathcal{U}_{0}\times\mathcal{U}_{0}\rightarrow(0,1) and f(,):𝒰0×𝒰0(,)f(\cdot,\cdot):\mathcal{U}_{0}\times\mathcal{U}_{0}\rightarrow(-\infty,\infty) are two measurable symmetric functions.

Definition 2 (A general latent space model for signed network G(n,𝒰0,Pu,B,f)G(n,\mathcal{U}_{0},P_{u},B,f)).

For 1in1\leq i\leq n, let ui𝒰0u_{i}\in\mathcal{U}_{0} be the latent vector independently sampled from the distribution PuP_{u}. Given the latent vectors of a pair of nodes ii and jj, independently of other pairs, an edge between node ii and node jj is drawn with probability B(ui,uj)B(u_{i},u_{j}), i.e.,

|Aij|ind.Bernoulli(Pij) with Pij=B(ui,uj);|A_{ij}|\overset{\text{ind.}}{\sim}\text{Bernoulli}(P_{ij})\text{ \ with \ }P_{ij}=B(u_{i},u_{j});

then for each edge (i.e. |Aij|=1|A_{ij}|=1), independently of all others, it takes the positive sign with logit f(ui,uj)f(u_{i},u_{j}) and the negative sign otherwise, i.e.,

logit(Aij=1||Aij|=1)=f(ui,uj).\text{logit}(A_{ij}=1\Big{|}|A_{ij}|=1)=f(u_{i},u_{j}).

We write AG(n,𝒰0,Pu,B,f)A\sim G(n,\mathcal{U}_{0},P_{u},B,f) to denote a signed network with nn nodes generated from the above procedure.

Note that in the network generative process in Definition 2, the first part for generating edges covers many existing latent variable models for unsigned networks as special cases by specifying different functions B(,)B(\cdot,\cdot); the second part for generating signs further models the sign distribution through specifying the logit-transformed probability f(,)f(\cdot,\cdot), as the sign of ff is the same as that of the conditional expectation of AijA_{ij}. Given this general class of latent space models for signed networks, next we identify the connection between the population-level balance and the key components of the model (Pu,B,f)(P_{u},B,f). As we will see, this connection serves as the foremost step for incorporating the balance theory into modeling signed networks. The following proposition provides a sufficient condition for the symmetric function f(,)f(\cdot,\cdot) such that the generated network is population-level balanced.

Proposition 1.

Suppose a symmetric function f(,)f(\cdot,\cdot) satisfies that

f(a,b)f(b,c)f(c,a)>0, for any a,b,c𝒰,f(a,b)\cdot f(b,c)\cdot f(c,a)>0,\text{ \ for any }a,b,c\in\mathcal{U}, (1)

where 𝒰\mathcal{U} is a subset of 𝒰0\mathcal{U}_{0} with probability 11, i.e., Pu(𝒰)=1P_{u}(\mathcal{U})=1. Then, a network AG(n,𝒰0,Pu,B,f)A\sim G(n,\mathcal{U}_{0},P_{u},B,f) with a symmetric measurable function B(,):𝒰0×𝒰0(0,1)B(\cdot,\cdot):\mathcal{U}_{0}\times\mathcal{U}_{0}\rightarrow(0,1) is population-level balanced.

The proof of Proposition 1 is based on the fact that E(Aij|ui,uj)>0E(A_{ij}|u_{i},u_{j})>0 if and only if the logit f(ui,uj)>0f(u_{i},u_{j})>0 when the probability that an edge appears between node ii and node jj is nonzero. The details are provided in the Supplemental Material. We note that though there is room for relaxation of the requirement (1), its simplicity provides a feasible direction for further analyses on the form of ff.

Since not any arbitrary symmetric function ff would satisfy (1), it is desirable to study what characteristics the function ff should have. To this end, we have established the necessary and sufficient conditions in Theorem 1 for the function ff to satisfy (1).

Theorem 1.

For a symmetric function f(,):𝒰0×𝒰0(,)f(\cdot,\cdot):\mathcal{U}_{0}\times\mathcal{U}_{0}\rightarrow(-\infty,\infty), f(a,b)f(b,c)f(c,a)>0f(a,b)\cdot f(b,c)\cdot f(c,a)>0 holds for any a,b,c𝒰a,b,c\in\mathcal{U}, where 𝒰𝒰0\mathcal{U}\subset\mathcal{U}_{0} with Pu(𝒰)=1P_{u}(\mathcal{U})=1, if and only if

  1. (i)

    the function ff is positive on 𝒰×𝒰\mathcal{U}\times\mathcal{U}, i.e., f(a,b)>0f(a,b)>0 for any a,b𝒰a,b\in\mathcal{U}; or

  2. (ii)

    there exists two nonempty subsets SS and TT, with ST=𝒰S\cup T=\mathcal{U} and ST=S\cap T=\varnothing, such that sign(f(a,b))=𝟙(aS)𝟙(bS)sign(f(a,b))=\mathbbm{1}(a\in S)\cdot\mathbbm{1}(b\in S) for any a,b𝒰a,b\in\mathcal{U}, where 𝟙(event)\mathbbm{1}(event) is not the usual indicator function, but rather equals 11 if the event holds and 1-1 otherwise.

Theorem 1 implies that, for a function ff to satisfy (1), if it is not always positive, then 𝒰\mathcal{U} can be divided into two nonempty disjoint subsets such that the function ff is positive when the two arguments belong to the same subset and negative otherwise. On the other hand, if the function ff is always positive, it corresponds to a trivial case in which the expected signs between all pairs of nodes are positive. Note that when 𝒰\mathcal{U} is discrete and finite, Theorem 1 is a direct result of [12], while our theorem can be applied to more general latent spaces.

Next, we further illustrate the implication of Theorem 1 in choosing the function ff to describe the population-level balance by taking commonly used latent spaces as examples.

Example 1 The latent space 𝒰0\mathcal{U}_{0} can be a finite set as in stochastic block models [1]. Let 𝒰0={1,,K}\mathcal{U}_{0}=\{1,\cdots,K\} and uiu_{i} denotes the community that node ii belongs to. Theorem 1 implies that the KK communities can be further combined into two groups and edges tend to be positive within the same group and negative between different groups, as shown in the left side of Figure 2. We provide a rigorous description for the above result in the following corollary.

Corollary 1.1.

For a finite set 𝒰0={1,,K}\mathcal{U}_{0}=\{1,\dots,K\}, the symmetric function f(,):𝒰0×𝒰0(,)f(\cdot,\cdot):\mathcal{U}_{0}\times\mathcal{U}_{0}\rightarrow(-\infty,\infty) satisfies that f(a,b)f(b,c)f(c,a)>0 for any a,b,c𝒰0,f(a,b)\cdot f(b,c)\cdot f(c,a)>0\text{ for any }a,b,c\in\mathcal{U}_{0}, if and only if there exists a grouping function g:{1,,K}{1,1}g:\{1,\dots,K\}\rightarrow\{-1,1\} and some constants qab=qba>0q_{ab}=q_{ba}>0 such that f(a,b)=qabg(a)g(b)f(a,b)=q_{ab}\cdot g(a)\cdot g(b) holds for 1a,bK1\leq a,b\leq K.

Note that here the grouping function gg identifies two antagonistic groups in the signed network, where nodes from different groups tend to “dislike” each other.

Refer to caption
Refer to caption
Figure 2: Illustration of the latent space partition in Theorem 1. Left: 𝒰0\mathcal{U}_{0} is a finite set, where each color corresponds to a possible state in 𝒰0\mathcal{U}_{0} and the two ellipses correspond to the partition. Right: 𝒰0\mathcal{U}_{0} is a Euclidean space, where the two colors correspond to the partition.

Example 2 The latent space can also be a Euclidean space as in the latent distance model and the latent projection model [14]. The following proposition provides an important class of continuous symmetric functions for which the requirement (1) is satisfied.

Proposition 2.

For the Euclidean space 𝒰0=Rk\mathcal{U}_{0}={\mbox{{R}}}^{k}, the requirement (1) holds if f(a,b)=ϕ(a)ϕ(b)f(a,b)=\phi(a)\phi(b), where ϕ()\phi(\cdot) is a real-valued continuous function and Pu(u:ϕ(u)0)=1P_{u}(u:\phi(u)\neq 0)=1.

From the mathematical perspective, Proposition 2 is an obvious result based on the inequality in (1). However, in combination with Theorem 1, it leads us to a useful and interesting interpretation for the function ϕ\phi. Specifically, ϕ()\phi(\cdot) can be viewed as the logit (or score) of any binary “classifier” that separates 𝒰0\mathcal{U}_{0} into two disjoint regions; the function ff is positive if the two arguments are classified into the same region and negative otherwise. As shown in the right panel of Figure 2, the boundary of the classifier tries to cut as many negative edges as possible while retaining most positive edges within the same region. with some wRkw\in{\mbox{{R}}}^{k} and γR\gamma\in{\mbox{{R}}}.

Remark 1.

For the special case f(a,b)=abf(a,b)=a^{\top}b, [13] discussed the use of the inner product of two latent variables to capture various third-order dependence patterns among residuals in the context of regression, including the balance theory. Specifically, when the dimension kk is one, [13] observed that all triads among residuals are balanced, which allows for clustering nodes into two groups based on signs of latent variables. This observation aligns with our findings on the separation of the latent space. Nonetheless, the balance concept studied in [13] is at the triangle level and does not extend to the entire network structure. In this work, we have introduced a rigorous stochastic definition of population-level balance for signed networks. Furthermore, the condition in Proposition 1 for population-level balance is more general and includes the inner product as an example.

Remark 2.

Moreover, our findings in this section can be generalized beyond triangles. In the Supplemental Material, we generalize the notion of population-level balance and extend Proposition 1 to any \ell-loops as well as general weighted networks. In addition, we extend the infinitely exchangeable graphon model for signed networks and expand Theorem 1 accordingly.

3 Balanced Inner-product Models

Motivated by the key finding in Theorem 1, we propose two inner-product models for signed networks that fall within the general class of latent space models in Definition 2. Both models are inherently of population-level balance. We first present the separate inner-product model and then introduce the joint inner-product model by adding an additional structural assumption. We will demonstrate the usefulness of this structural assumption in estimating the latent variables both theoretically (if it is correctly specified) and empirically in subsequent sections.

3.1 Separate inner-product model

We assume that for any 1i<jn1\leq i<j\leq n, we have

|Aij|=|Aji|ind.Bernoulli(Pij), with  logit(Pij)=Θij=αi+αj+zizj,|A_{ij}|=|A_{ji}|\overset{\text{ind.}}{\sim}\text{Bernoulli}(P_{ij}),\text{ \ with \ logit}(P_{ij})=\Theta_{ij}=\alpha_{i}+\alpha_{j}+z_{i}^{\top}z_{j}, (2)

where αiR\alpha_{i}\in{\mbox{{R}}} and ziRkz_{i}\in{\mbox{{R}}}^{k} (for i=1,,ni=1,\ldots,n) are latent variables. Further, independently of all others, we also assume

logit(Aij=1||Aij|=1)=ηij=vivj,\text{logit}(A_{ij}=1\Big{|}|A_{ij}|=1)=\eta_{ij}=v_{i}v_{j}, (3)

where viRv_{i}\in{\mbox{{R}}} for i=1,,ni=1,\ldots,n are also latent variables.

The proposed separate inner-product model has the capacity to capture various commonly observed characteristics of signed networks. Specifically, the parameter αi\alpha_{i} enables modeling node degree heterogeneity, of which a larger value leads to higher probability of connecting with other nodes given other parameters fixed. Thus, we call {αi}i=1n\{\alpha_{i}\}_{i=1}^{n} degree heterogeneity parameters. Next, the inner-product formation between the latent position vectors ziz_{i} and zjz_{j} inherently models transitivity, i.e., nodes with common neighbors (regardless of friend or enemy) are more likely to be linked. Because the closer the latent position vectors of two nodes are in the latent space, the higher inner product it is and more likely to connect with each other. Finally, the parameters {vi}i=1n\{v_{i}\}_{i=1}^{n} model the distribution of signs through their product, which satisfies the sufficient condition (1) for the population-level balance. In particular, an edge between two nodes tends to have a positive sign when their latent variables viv_{i} and vjv_{j} have the same sign and a negative sign otherwise. Moreover, the magnitude of viv_{i} controls the discrepancy level between positive and negative signs. Therefore, we name them as latent polar variables. When all latent polar variables are zeros, negative and positive signs are exchangeable.

Following [29], which is for unsigned networks, we impose no distributional assumptions (prior) on the latent variables (αi,zi,vi)(\alpha_{i},z_{i},v_{i}) for the sake of modeling flexibility and estimation scalability, in comparison to treating them as random and using Bayesian estimation approaches as in existing works [19].

For presentation simplicity, we rewrite the model in matrix form. Specifically, we have Θ=α1n+1nα+ZZ,η=vv\Theta=\alpha 1_{n}^{\top}+1_{n}\alpha^{\top}+ZZ^{\top},\ \ \eta=vv^{\top}, where α=(α1,,αn)\alpha=(\alpha_{1},\cdots,\alpha_{n}), 1n1_{n} is the all one vector in Rn{\mbox{{R}}}^{n}, Z=(z1,,zn)Rn×kZ=(z_{1},\cdots,z_{n})^{\top}\in{\mbox{{R}}}^{n\times k}, and v=(v1,,vn)Rnv=(v_{1},\cdots,v_{n})^{\top}\in{\mbox{{R}}}^{n}.

Identifiability To ensure identifiability of parameters (α,Z,v)(\alpha,Z,v), we provide additional constraints in Proposition 3. Given centered latent position variables, that is JnZ=ZJ_{n}Z=Z, where Jn=In1n1n1nJ_{n}=I_{n}-\frac{1}{n}1_{n}1_{n}^{\top}, the parameters are identifiable up to an orthogonal transformation and a sign flipping.

Proposition 3.

Suppose two sets of parameters (α,Z,v)(\alpha,Z,v) and (α¯,Z¯,v¯)(\bar{\alpha},\bar{Z},\bar{v}) satisfy that A1) JnZ=ZJ_{n}Z=Z and JnZ¯=Z¯J_{n}\bar{Z}=\bar{Z}; A2) ZRn×kZ\in{\mbox{{R}}}^{n\times k} is of full rank. Then, they specify the same network distribution through (2) and (3) if and only if there exist an orthogonal matrix ORk×kO\in{\mbox{{R}}}^{k\times k} with OO=OO=IkO^{\top}O=OO^{\top}=I_{k} and κ{1,1}\kappa\in\{-1,1\} such that α=α¯,Z=Z¯O,v=κv¯\alpha=\bar{\alpha},Z=\bar{Z}O,v=\kappa\bar{v}.

3.2 Joint inner-product model

Based on the above separate inner-product model, we further consider the dependency of the latent polar variable viv_{i} on the latent position variable ziz_{i}. The idea of introducing their relationship originates naturally from Proposition 2, where we view the latent polar variable viv_{i} as a function of the latent position variable ziz_{i}, i.e., vi=ϕ(zi)v_{i}=\phi(z_{i}) with some link function ϕ\phi. Modeling such a link function ϕ\phi can provide more structural information of the network. On the other hand, there are flexibilities in choosing the family of link function ϕ\phi, which would lead to different shapes of the latent space partition derived by ϕ\phi. For the scope of this paper, we assume ϕ\phi is a linear function in ziz_{i} in the joint inner-product model, i.e., vi=wzi+γv_{i}=w^{\top}z_{i}+\gamma with wRkw\in{\mbox{{R}}}^{k} and γR\gamma\in{\mbox{{R}}}, and discuss other nonlinear alternatives in Remark 3. Specifically, the joint inner-product model is given by (2) and replaces (3) with

logit(Aij=1||Aij|=1)=ηij=(wzi+γ)(wzj+γ).\text{logit}(A_{ij}=1\Big{|}|A_{ij}|=1)=\eta_{ij}=(w^{\top}z_{i}+\gamma)(w^{\top}z_{j}+\gamma). (4)

In particular, the hyperplane {zRk:wz+γ=0}\{z\in{\mbox{{R}}}^{k}:w^{\top}z+\gamma=0\} separates the latent space into two regions. A pair of nodes tend to have a positive edge when their latent positions are located on the same side of the hyperplane and have a negative edge when their latent positions are located on different sides of the hyperplane. If w=0w=0 and γ0\gamma\neq 0, the sign of each edge has a homogeneous logit γ2\gamma^{2} to be positive.

Identifiability For the joint inner-product model, the identifiability condition for parameters (α,Z,w,γ)(\alpha,Z,w,\gamma) is established correspondingly in Proposition 4.

Proposition 4.

Suppose two sets of parameters (α,Z,w,γ)(\alpha,Z,w,\gamma) and (α¯,Z¯,w¯,γ¯)(\bar{\alpha},\bar{Z},\bar{w},\bar{\gamma}) satisfy that A1) JnZ=ZJ_{n}Z=Z and JnZ¯=Z¯J_{n}\bar{Z}=\bar{Z}; A2) ZRn×kZ\in{\mbox{{R}}}^{n\times k} is of full rank. Then, they specify the same network distribution through (2) and (4) if and only if there exist an orthogonal matrix ORk×kO\in{\mbox{{R}}}^{k\times k} with OO=OO=IkO^{\top}O=OO^{\top}=I_{k} and κ{1,1}\kappa\in\{-1,1\} such that α=α¯,Z=Z¯O,w=κOw¯,γ=κγ¯\alpha=\bar{\alpha},Z=\bar{Z}O,w=\kappa O^{\top}\bar{w},\gamma=\kappa\bar{\gamma}.

Remark 3.

Though we use a linear link function in the joint inner-product model (4), more flexible nonlinear functions can be considered. For example, we may assume ϕ\phi belongs to a reproducing kernel Hilbert space (RKHS) \mathcal{H} associated with an inner product ,\langle\cdot,\cdot\rangle_{\mathcal{H}} under which \mathcal{H} is complete. There is a positive semidefinite kernel function 𝕂(,):Rk×RkR+\mathbb{K}(\cdot,\cdot):{\mbox{{R}}}^{k}\times{\mbox{{R}}}^{k}\rightarrow{\mbox{{R}}}_{+} such that ϕ(zi)=ϕ,𝕂(,zi)\phi(z_{i})=\langle\phi,\mathbb{K}(\cdot,z_{i})\rangle_{\mathcal{H}}. Multiple choices of RKHS are available for practical use, including those with polynomial kernel, Gaussian kernel, and Laplacian kernel [23].

4 Model Estimation

In this section, we develop two methods for fitting the proposed models (2)-(4). Both methods minimize the negative log-likelihood function of the balanced inner-product models through projected gradient descent.

Under balanced inner-product models, the negative log-likelihood function consists of two parts. The first part is derived from the probability of forming edges:

e(α,Z)=i<j{|Aij|Θij+log(1σ(Θij))},\mathcal{L}_{e}(\alpha,Z)=\sum_{i<j}\Big{\{}|A_{ij}|\Theta_{ij}+\log(1-\sigma(\Theta_{ij}))\Big{\}},

where Θ=α1n+1nα+ZZ\Theta=\alpha 1_{n}^{\top}+1_{n}\alpha^{\top}+ZZ^{\top}, and σ(x)=1/(1+exp(x))\sigma(x)=1/(1+\exp(-x)) is the sigmoid function, which is the inverse of the logit function. The second part is derived from the probability of assigning signs:

s(v)\displaystyle\mathcal{L}_{s}(v) =i<j{|Aij|1+Aij2ηij+|Aij|log(1σ(ηij))},\displaystyle=\sum_{i<j}\Big{\{}|A_{ij}|\frac{1+A_{ij}}{2}\eta_{ij}+|A_{ij}|\log(1-\sigma(\eta_{ij}))\Big{\}},

where η=vv\eta=vv^{\top}, and when under the joint inner-product model, we further have v=Zw+γ1nv=Zw+\gamma 1_{n}, or equivalently, vv belongs to the column space of (1n,Z)(1_{n},Z).

The first method estimates parameters (α,Z)(\alpha,Z) and vv separately by minimizing e(α,Z)\mathcal{L}_{e}(\alpha,Z) and s(v)\mathcal{L}_{s}(v) respectively. Hence we name it the separate estimation method. Note that the separate estimation method does not depend on a specific relationship between the latent polar variables vv and the latent position vectors ZZ. Therefore, the separate estimation method can always be applied regardless of the underlying link function ϕ\phi. Alternatively, we also propose a joint estimation method tailored for the joint inner-product model, which exploits the structural assumption for more accurate estimation. Specifically, we jointly estimate parameters (α,Z,v)(\alpha,Z,v) by minimizing a weighted sum of e(α,Z)\mathcal{L}_{e}(\alpha,Z) and s(v)\mathcal{L}_{s}(v), while constraining vv to be in the column space of (1n,Z)(1_{n},Z).

Notation Before presenting the algorithm details, we first introduce the following general notations to be used hereafter. For any XRd1×d2X\in{\mbox{{R}}}^{d_{1}\times d_{2}}, XiX_{i*} and XjX_{*j} denote the ii-th row and jj-th column of matrix XX respectively, and for any function ω()\omega(\cdot), ω(X)\omega(X) represents applying the function ω()\omega(\cdot) element-wisely to XX, that is ω(X)Rd1×d2\omega(X)\in{\mbox{{R}}}^{d_{1}\times d_{2}} and [ω(X)]ij=ω(Xij)[\omega(X)]_{ij}=\omega(X_{ij}). We use \circ to denote the Hadamard product, that is, for any two matrices X,YRd1×d2X,Y\in{\mbox{{R}}}^{d_{1}\times d_{2}}, XYRd1×d2X\circ Y\in{\mbox{{R}}}^{d_{1}\times d_{2}} and [XY]ij=XijYij[X\circ Y]_{ij}=X_{ij}Y_{ij}. Moreover, we use XF\|X\|_{F}, Xop\|X\|_{op}, X\|X\|_{*}, and Xmax\|X\|_{\max} to denote the Frobenius norm, the operator norm, the nuclear norm, and the max norm of a matrix respectively. We use col(X)col(X) to denote the column space of XX. For a vector xRdx\in{\mbox{{R}}}^{d}, we use x\|x\| to denote the Euclidean norm.

4.1 Separate estimation method

First, to estimate parameters (α,Z)(\alpha,Z), we solve the non-convex optimization problem below:

minαR,ZRn×ki,j{|Aij|Θij+log(1σ(Θij))},\min_{\alpha\in{\mbox{{R}}},Z\in{\mbox{{R}}}^{n\times k}}-\sum_{i,j}\Big{\{}|A_{ij}|\Theta_{ij}+\log(1-\sigma(\Theta_{ij}))\Big{\}}, (5)

subject to Θ=α1n+1nα+ZZ and Z=JnZ\Theta=\alpha 1_{n}^{\top}+1_{n}\alpha^{\top}+ZZ^{\top}\text{ and }Z=J_{n}Z. In particular, the signed adjacency matrix enters the objective function through its absolute value, which leads to the same optimization problem studied in [29] when there is no edge covariate. Here we adopt the projected gradient descent algorithm along with the initialization method proposed in [29] because of their theoretical guarantee and scalability to large networks. We provide the detailed description of the method in Algorithm S1 and the initialization algorithm in the Supplemental Material.

Next, to estimate the latent polar variables vv, we solve another non-convex optimization problem, i.e.,

minvRni,j|Aij|{1+Aij2ηij+log(1σ(ηij))} subject to η=vv.\min_{v\in{\mbox{{R}}}^{n}}-\sum_{i,j}|A_{ij}|\Big{\{}\frac{1+A_{ij}}{2}\eta_{ij}+\log(1-\sigma(\eta_{ij}))\Big{\}}\text{ subject to }\eta=vv^{\top}. (6)

Similarly, we develop a fast gradient descent algorithm, which is summarized in Algorithm S2. We also use an initialization algorithm based on the universal singular value thresholding proposed by [26] (see the Supplemental Material).

Remark 4.

We note that, although we use gradient descent algorithms for estimating both (α,Z)(\alpha,Z) and vv, the subtle difference in their objectives makes the theory in [29] not directly applicable to (6). Specifically, unlike the objective in (5), not all elements of the signed adjacency matrix contribute to the objective in (6). Instead, only nonzero entries, i.e., {(i,j):|Aij|=1}\{(i,j):|A_{ij}|=1\}, are used for inferring the latent polar variables through (6). In this case, one key step in building the improvement in errors of iterates in [29] no longer holds. Therefore, we establish a new error bound for Algorithm S2 (see Section 5).

Remark 5.

We also note that our optimization problem in (6) is closely related to the line of research on low-rank matrix estimation. See [18, 2, 7, 4, 25, 24] for a sample of references. In particular, (6) can be viewed as a one-bit matrix completion problem, where we observe a random subset of binary entries generated from a distribution determined by a low-rank matrix. To solve this problem, [7] considered a convex relaxation that replaces the low-rank constraint by the nuclear norm penalization. Though it becomes a convex optimization problem, in general, solving such a nuclear-norm penalized optimization problem requires singular value decomposition at each iteration, which is computationally expensive for large matrices. Alternatively, gradient descent algorithms have been used for improving the computational efficiency. [4] and [24] have established convergence guarantees and statistical errors for the gradient descent algorithms in application to low-rank matrix estimation problems, which particularly cover the one-bit matrix completion problem. However, theories in aforementioned works are based on the uniform random sampling assumption, i.e., each entry of the matrix is observed independently with a uniform probability pp, while in our case, entries are observed with different probabilities PijP_{ij}. Thus, our theoretical analysis of the proposed gradient descent algorithm in Section 5 provides new results for one-bit matrix completion under non-uniform random sampling.

4.2 Joint estimation method

Under the joint inner-product model (3)-(4), we propose to jointly estimate parameters (α,Z,v)(\alpha,Z,v) by re-parameterizing v=Zw+γ1nv=Zw+\gamma 1_{n} with wRkw\in{\mbox{{R}}}^{k} and γR\gamma\in{\mbox{{R}}}. By introducing a hyperparameter λ\lambda, we minimize the following weighted negative log-likelihood,

λ(α,Z,w,γ)=i,j{(1λ)[|Aij|Θij+log(1σ(Θij))]+λ|Aij|[1+Aij2ηij+log(1σ(ηij))]},\mathcal{L}_{\lambda}(\alpha,Z,w,\gamma)=-\sum_{i,j}\Big{\{}(1-\lambda)\big{[}|A_{ij}|\Theta_{ij}+\log(1-\sigma(\Theta_{ij}))\big{]}+\lambda|A_{ij}|\big{[}\frac{1+A_{ij}}{2}\eta_{ij}+\log(1-\sigma(\eta_{ij}))\big{]}\Big{\}},

subject to Θ=α1n+1nα+ZZ\Theta=\alpha 1_{n}^{\top}+1_{n}\alpha^{\top}+ZZ^{\top}, Z=JZZ=JZ, and η=(Zw+γ1n)(Zw+γ1n)\eta=(Zw+\gamma 1_{n})(Zw+\gamma 1_{n})^{\top}.

Here λ\lambda controls the weight of relative information from the edge formation and the sign assignment respectively. In particular, when λ=0\lambda=0, no information from the edge signs is used and the joint estimation reduces to the separate estimation for (α,Z)(\alpha,Z) in (5). Later in Section 5.2, we will theoretically show that, under certain conditions, any positive λ\lambda below some threshold yields more accurate estimation of latent position variables ZZ than the separate estimation (i.e., λ=0\lambda=0), but the magnitude of the improvement depends on the choice of λ\lambda. In principle, we can select λ\lambda in a data-driven manner by performing cross-validation on the observed signed adjacency matrix, where we randomly mask a subset of entries, fit the joint inner-product model by using the remaining entries, repeat the process multiple times, and then select λ\lambda from a candidate set with the best average prediction performance on the holdout entries. In practice, we find simply setting λ=1/2\lambda=1/2 also works generally well, in which case the solution becomes the usual maximum likelihood estimator.

To solve the above constrained minimization problem, we develop a projected gradient descent algorithm, whose details are given in Algorithm S5 in the Supplemental Material. We initialize the algorithm by (α0,Z0)=(α^,Z^)(\alpha_{0},Z_{0})=(\hat{\alpha},\hat{Z}) obtained from Algorithms S1 and (w0,γ0)=argminwRk,γRλ(α0,Z0,w,γ)(w_{0},\gamma_{0})=\operatorname*{arg\,min}_{w\in{\mbox{{R}}}^{k},\gamma\in{\mbox{{R}}}}\mathcal{L}_{\lambda}(\alpha_{0},Z_{0},w,\gamma).

5 Theoretical Results

In this section, we establish high probability error bounds for the proposed two estimation methods. Note for the separate estimation method, the error bound for estimating latent position vectors ZZ under model (2) and that for estimating latent polar variables vv under model (3) are derived separately. Thus, the separate estimation method is robust in the sense that, when one of models (2) and (3) is mis-specified, our theoretical results still hold for the other. On the other hand, for the joint estimation method that utilizes the relationship between vv and ZZ, we further discuss how incorporating their dependency can help reduce the estimation error of latent variables under the joint inner-product model (4).

5.1 Results for the separate estimation method

We present theoretical guarantees of Algorithms S1 and S2 under the separate inner-product model (2) and model (3) respectively. Note that the error bound for the outputs of Algorithm S1 is a straightforward result of [29, Theorem 9] when there is no edge covariate; we adjust it in Proposition 5 for presentation coherence. Nonetheless, their theory cannot be directly applied to the setting of Algorithm S2, because only nonzero entries of the signed adjacency matrix are included in the objective (6), which breaks an important step towards establishing the estimation improvements for successive iterations in their proof. Thus, our established error bound for the outputs of Algorithm S2 is a new result for a more general setting, where entries are observed with non-uniform probabilities.

We describe error bounds for the outputs of Algorithms S1 and S2 with details below. We firstly define the parameter spaces as

θ(n,k,M1,M2)=\displaystyle\mathcal{F}_{\theta}(n,k,M_{1},M_{2})= {αRn,ZRn×k,ΘRn×n|Θ=α1n+1nα+ZZ,JnZ=Z,\displaystyle\Big{\{}\alpha\in{\mbox{{R}}}^{n},Z\in{\mbox{{R}}}^{n\times k},\Theta\in{\mbox{{R}}}^{n\times n}\ |\ \Theta=\alpha 1_{n}^{\top}+1_{n}\alpha^{\top}+ZZ^{\top},J_{n}Z=Z,
max1inZi2,2αmaxM12,max1ijnΘijM2}\displaystyle\ \ \ \ \ \ \ \ \max_{1\leq i\leq n}\|Z_{i*}\|^{2},2\|\alpha\|_{\max}\leq\frac{M_{1}}{2},\max_{1\leq i\neq j\leq n}\Theta_{ij}\leq-M_{2}\Big{\}} (7)

and

η(n,M3)={vRn,ηRn×n|η=vv,vmax2M3},\displaystyle\mathcal{F}_{\eta}(n,M_{3})=\Big{\{}v\in{\mbox{{R}}}^{n},\eta\in{\mbox{{R}}}^{n\times n}\ |\ \eta=vv^{\top},\|v\|_{\max}^{2}\leq M_{3}\Big{\}}, (8)

where Mi>0M_{i}>0 for i=1,2,3i=1,2,3. We allow kk, M1M_{1}, M2M_{2}, and M3M_{3} in (7)-(8) to change with the network size nn similarly as in [29]. Note that, given the inequalities in (7), it is straightforward to see that, for any Θθ(n,k,M1,M2)\Theta\in\mathcal{F}_{\theta}(n,k,M_{1},M_{2}), we have M1ΘijM2-M_{1}\leq\Theta_{ij}\leq-M_{2} for 1ijn1\leq i\neq j\leq n. Therefore, M2M_{2}, as the upper bound of logit-transformed probabilities of observing edges, controls the network sparsity, of which a larger value leads to a sparser network. The true parameters are denoted by (α,Z,v)(\alpha^{*},Z^{*},v^{*}), Θ=α1n+1nα+ZZ\Theta^{*}=\alpha^{*}1_{n}^{\top}+1_{n}{\alpha^{*}}^{\top}+Z^{*}{Z^{*}}^{\top}, and η=vv\eta^{*}=v^{*}{v^{*}}^{\top}.

Error bound for Algorithm S1 Let (αt,Zt)(\alpha_{t},Z_{t}) be the updated parameters at the tt-th iteration in Algorithm S1 and Θt=αt1n+1nαt+ZtZt\Theta_{t}=\alpha_{t}1_{n}^{\top}+1_{n}{\alpha_{t}}^{\top}+Z_{t}{Z_{t}}^{\top}. Since the latent position vectors ZRn×kZ\in{\mbox{{R}}}^{n\times k} are identifiable up to an orthogonal transformation, we define the distance between two latent matrices Z1Z_{1} and Z2Z_{2} as dist(Z1,Z2)=minOO(k)Z1Z2OFdist(Z_{1},Z_{2})=\min_{O\in O(k)}\|Z_{1}-Z_{2}O\|_{F}, where O(k)O(k) is the collection of all orthogonal matrices in Rk{\mbox{{R}}}^{k}. Let Ot=argminOO(k)ZtZOFO_{t}=\arg\min_{O\in O(k)}\|Z_{t}-Z^{*}O\|_{F}, ΔZt=ZtZOt\Delta_{Z_{t}}=Z_{t}-Z^{*}O_{t}, and ΔΘt=ΘtΘ\Delta_{\Theta_{t}}=\Theta_{t}-\Theta^{*}.

For theoretical justification, in Algorithm S1, we further assume projection onto the constraint sets 𝒞Z={ZRn×k,JnZ=Z,max1inZi2M1/2}\mathcal{C}_{Z}=\{Z\in{\mbox{{R}}}^{n\times k},J_{n}Z=Z,\max_{1\leq i\leq n}\|Z_{i*}\|^{2}\leq M_{1}/2\} and 𝒞α={αRn,2αmaxM1/2}\mathcal{C}_{\alpha}=\{\alpha\in{\mbox{{R}}}^{n},2\|\alpha\|_{\max}\leq M_{1}/2\} at each iteration. The following proposition establishes the high probability error bounds for estimating both the latent position matrix ZZ and the logit-transformed probability matrix Θ\Theta.

Proposition 5.

1) the initializers α0,Z0\alpha_{0},Z_{0} in Algorithm S1 satisfy Zop2ΔZ0F2+Δα01nF2c0e2M1Zop4/κZ4\|Z^{*}\|_{op}^{2}\|\Delta_{Z_{0}}\|_{F}^{2}+\|\Delta_{\alpha_{0}}1_{n}^{\top}\|_{F}^{2}\leq c_{0}e^{-2M_{1}}\|Z^{*}\|_{op}^{4}/\kappa_{Z^{*}}^{4} for a sufficiently small positive constant c0c_{0}, where κZ\kappa_{Z^{*}} is the condition number of ZZ^{*}; and 2) Zop2C1κZ2neM1M2/2max{τkeM1,1}\|Z^{*}\|_{op}^{2}\geq C_{1}\kappa_{Z^{*}}^{2}\sqrt{n}e^{M_{1}-M_{2}/2}\max\{\sqrt{\tau k}e^{M_{1}},1\} for a sufficiently large constant C1C_{1}. Then there exist positive constants ρ,c1,\rho,c_{1}, and CC uniformly over θ(n,k,M1,M2)\mathcal{F}_{\theta}(n,k,M_{1},M_{2}) such that, with probability at least 1nc11-n^{-c_{1}}, we have

Zop2ΔZTF2,ΔΘTF2CκZ2e2M1nkmax{eM2,lognn},\|Z^{*}\|_{op}^{2}\|\Delta_{Z_{T}}\|_{F}^{2},\ \|\Delta_{\Theta_{T}}\|_{F}^{2}\leq C\kappa_{Z^{*}}^{2}e^{2M_{1}}nk\cdot\max\{e^{-M_{2}},\frac{\log n}{n}\},

for some Tlog(M12κZ2e4M1M2nk3)/log(1τeM1κZ2ρ)1T\leq\log(\frac{M_{1}^{2}}{\kappa_{Z^{*}}^{2}e^{4M_{1}-M_{2}}}\frac{n}{k^{3}})/\log(1-\frac{\tau}{e^{M_{1}}\kappa_{Z^{*}}^{2}}\rho)^{-1}.

Error bound for Algorithm S2 Let vtv_{t} be the updated parameters at the tt-th iteration in Algorithm S2 and ηt=vtvt\eta_{t}=v_{t}{v_{t}}^{\top}. Similarly, as the latent polar variables vRnv\in{\mbox{{R}}}^{n} are identifiable up to a sign, we define the distance between two latent vectors v1v_{1} and v2v_{2} as dist(v1,v2)=minκ{1,1}v1κv2dist(v_{1},v_{2})=\min_{\kappa\in\{-1,1\}}\|v_{1}-\kappa v_{2}\|. Let κt=argminκ{1,1}vtκv\kappa_{t}=\arg\min_{\kappa\in\{-1,1\}}\|v_{t}-\kappa v^{*}\| and Δvt=vtκtv\Delta_{v_{t}}=v_{t}-\kappa_{t}v^{*}, and further let Δηt=ηtη\Delta_{\eta_{t}}=\eta_{t}-\eta^{*}.

Although the error bound presented below does not rely on a specific generating process of edges such as in model (2) and the parameter space θ(n,k,M1,M2)\mathcal{F}_{\theta}(n,k,M_{1},M_{2}) in (7), it depends on the lower bound of the probability of observing an edge. For notation consistency, we use M1M_{1} to denote the lower bound of the logit-transformed probability matrix, i.e., ΘijM1\Theta_{ij}\geq-M_{1} for 1i,jn1\leq i,j\leq n. Similarly, for theoretical justification, we constrain vv to be in the set 𝒞v={vRn,vmax2M3}\mathcal{C}_{v}=\{v\in{\mbox{{R}}}^{n},\|v\|_{\max}^{2}\leq M_{3}\} at each iteration in Algorithm S2. The following theorem establishes the high probability error bounds for estimating the latent polar variables vv and the logit-transformed probability matrix η\eta.

Theorem 2.

Set the step size as τv=τ/v02\tau_{v}=\tau/\|v_{0}\|^{2} for any τc\tau\leq c, where c>0c>0 is a universal constant. Suppose 1) the initializer v0v_{0} in Algorithm S2 satisfies Δv0c0e(M1+M3)/2v\|\Delta_{v_{0}}\|\leq c_{0}e^{-(M_{1}+M_{3})/2}\|v^{*}\| for a sufficiently small positive constant c0c_{0}; and 2) v2C1neM1+M3max{τeM1+M3,1}\|v^{*}\|^{2}\geq C_{1}\sqrt{n}e^{M_{1}+M_{3}}\max\{\sqrt{\tau e^{M_{1}+M_{3}}},1\} for a sufficiently large constant C1C_{1}. Then there exist positive constants ρ,c1,\rho,c_{1}, and CC uniformly over η(n,M3)\mathcal{F}_{\eta}(n,M_{3}) and M1M_{1} such that, with probability at least 1nc11-n^{-c_{1}}, we have

v2ΔvT2,ΔηTF2Ce2(M1+M3)n,\|v^{*}\|^{2}\|\Delta_{v_{T}}\|^{2},\ \|\Delta_{\eta_{T}}\|_{F}^{2}\leq Ce^{2(M_{1}+M_{3})}n,

for some Tlog(M32e3(M1+M3)n)/log(1τeM1+M3ρ)1T\leq\log(\frac{M_{3}^{2}}{e^{3(M_{1}+M_{3})}}n)/\log(1-\frac{\tau}{e^{M_{1}+M_{3}}}\rho)^{-1}.

Theorem 2 implies that the mean square error ΔηTF2/n2\|\Delta_{\eta_{T}}\|_{F}^{2}/n^{2} is of order 𝒪(1/n)\mathcal{O}(1/n), which coincides with the existing error rate for one-bit rank-11 matrix completion problems [7, 4], while our result can be viewed as their extension to the case where entries of the one-bit matrix are randomly observed with non-uniform probabilities. In particular, for the more general non-uniform case, the key ingredient in our proof is to derive a lower bound of the sampling operator |A|{0,1}n×n|A|\in\{0,1\}^{n\times n}. We prove that the sampling operator |A||A| has a positive lower bound, i.e., |A|ηFcηF\||A|\circ\eta\|_{F}\geq c\|\eta\|_{F} with some c>0c>0, as long as η\eta belongs to a specific data-dependent set. This positive lower bound enables us to extend the proof in [29] when establishing iterative improvements. The proof of Theorem 2 is given in the Supplemental Material.

Remark 6.

Note that the error bounds for vv and η\eta in Theorem 2 hold regardless of the concrete form of model (2). Therefore, the above results still hold even if model (2) is mis-specified. But, it still depends on the probability of observing edges. Recall that the probability of forming edges are bounded by eM11/(1+eM1)Pij1/(1+eM2)eM2e^{-M_{1}}\asymp 1/(1+e^{M_{1}})\leq P_{ij}\leq 1/(1+e^{M_{2}})\asymp e^{-M_{2}}. For M1M2M_{1}\asymp M_{2}, the network density is in the order of ρ=eM1\rho=e^{-M_{1}}. Theorem 2 implies that the error bound decreases as the network density ρ\rho grows. Specifically, the impact of network density lies in the multiplier 1/ρ21/\rho^{2}. Intuitively, a sparse network, with fewer observed edges, leads to larger estimation errors for vv and η\eta due to the lack of sign observations. Similarly, Proposition 5 implies that the network density affects the error bound of ZZ through the multiplier max{ρ,logn/n}/ρ2\max\{\rho,\log n/n\}/\rho^{2}. For networks with modest density (ρlogn/n\rho\geq\log n/n), the multiplier is 1/ρ1/\rho. For sparser networks where ρlogn/n\rho\leq\log n/n, the multiplier adjusts to logn/(nρ2)\log n/(n\rho^{2}).

Remark 7.

The assumptions in both Proposition 5 and Theorem 2 require relatively good initializations of (α,Z,v)(\alpha,Z,v). We note that the conditions for α0\alpha_{0} and Z0Z_{0} can be achieved with theoretical justification by the universal-singular-value-thresholding [26] based initialization algorithm proposed in [29]. We further extend this algorithm to initialize v0v_{0}. Based on our simulation studies (see the Supplemental Material), we find that simple random initialization also achieves similar estimation errors after the algorithm converges while requiring more iterations for algorithm convergence.

5.2 Results for the joint estimation method

We first present the convergence guarantee and the error bound for the estimators obtained by Algorithm S5. Then we further investigate how the joint estimation method could improve the estimation of ZZ on top of the separate estimation.

Under the joint inner-product model, we redefine the parameter space as

(\displaystyle\mathcal{F}( n,k,M1,M2,M3)={α,vRn,ZRn×k,Θ,ηRn×n|\displaystyle n,k,M_{1},M_{2},M_{3})=\Big{\{}\alpha,v\in{\mbox{{R}}}^{n},Z\in{\mbox{{R}}}^{n\times k},\Theta,\eta\in{\mbox{{R}}}^{n\times n}\ \big{|}
Θ=α1n+1nα+ZZ,JnZ=Z,η=vv,v=Zw+γ1n,\displaystyle\ \ \ \ \ \Theta=\alpha 1_{n}^{\top}+1_{n}\alpha^{\top}+ZZ^{\top},J_{n}Z=Z,\eta=vv^{\top},v=Zw+\gamma 1_{n},
max1inZi2,αmaxM12,max1ijnΘijM2,vmax2M3,wM,|γ|M},\displaystyle\ \ \ \ \max_{1\leq i\leq n}\|Z_{i*}\|^{2},\|\alpha\|_{\max}\leq\frac{M_{1}}{2},\max_{1\leq i\neq j\leq n}\Theta_{ij}\leq-M_{2},\|v\|_{\max}^{2}\leq M_{3},\|w\|\leq M,|\gamma|\leq M^{\prime}\Big{\}},

where kk, M1M_{1}, M2M_{2}, and M3M_{3} are allowed to change with the network size nn. Let (α,Z,v)(\alpha^{*},Z^{*},v^{*}) be the true parameters, where v=Zw+γ1nv^{*}=Z^{*}w^{*}+\gamma^{*}1_{n} with some wRkw^{*}\in{\mbox{{R}}}^{k} and γR\gamma^{*}\in{\mbox{{R}}}.

Error bound for Algorithm S5 Let (αt,Zt,vt)(\alpha_{t},Z_{t},v_{t}) be the updated parameters at the tt-th iteration in Algorithm S5. We assume the projection onto the same constraint sets 𝒞α\mathcal{C}_{\alpha}, 𝒞Z\mathcal{C}_{Z}, and 𝒞v\mathcal{C}_{v} at the end of each iteration as those for Algorithms S1 and S2. The following theorem first guarantees that the error of iterates {(αt,Zt)}t0\{(\alpha_{t},Z_{t})\}_{t\geq 0} converges up to a statistical error and then gives the high probability error bounds for the estimators of ZZ and Θ\Theta.

Theorem 3.

Set the step sizes as τZ=r0τ/Z0op2\tau_{Z}=r_{0}\tau/\|Z_{0}\|^{2}_{op}, τα=τ/(2n)\tau_{\alpha}=\tau/(2n), and the weight λ=λ~r0/eM1κZ2\lambda=\tilde{\lambda}r_{0}/e^{M_{1}}\kappa^{2}_{Z^{*}} with r0=min{1,Z0op2/v02}r_{0}=\min\{1,\|Z_{0}\|^{2}_{op}/\|v_{0}\|^{2}\} for any τcτ\tau\leq c_{\tau}, λ~cλ\tilde{\lambda}\leq c_{\lambda}, where cτc_{\tau} and cλc_{\lambda} are universal constants. Let ζn=max{|A|Pop,1}\zeta_{n}=\max\{\||A|-P\|_{op},1\} and φn=max{|A|((1+A)/2Q)op,1}\varphi_{n}=\max\{\||A|\circ((1+A)/2-Q)\|_{op},1\}. Denote the error metric for iterates as e~Z,t=ΔZtF2Z0op2+Δαt1nF2\tilde{e}_{Z,t}=\|\Delta_{Z_{t}}\|_{F}^{2}\|Z_{0}\|_{op}^{2}+\|\Delta_{\alpha_{t}}1_{n}^{\top}\|_{F}^{2}. Suppose the initializers α0,Z0\alpha_{0},Z_{0} in Algorithm S5 satisfy e~Z,0c0e2M13M3Zop4/κZ4\tilde{e}_{Z,0}\leq c_{0}e^{-2M_{1}-3M_{3}}\|Z^{*}\|_{op}^{4}/\kappa_{Z^{*}}^{4} for a sufficiently small positive constant c0c_{0}, where κZ\kappa_{Z^{*}} is the condition number of ZZ^{*}. Then, we have

  1. 1.

    (Deterministic bounds for iterative errors) If Zop2C0eM1κZ2ζnmax{τkeM1+3M3/2κZ,1}\|Z^{*}\|_{op}^{2}\geq C_{0}e^{M_{1}}\kappa_{Z^{*}}^{2}\zeta_{n}\max\{\sqrt{\tau k}e^{M_{1}+3M_{3}/2}\kappa_{Z^{*}},1\} and v2C0eM1+M3φnmax{τeM1/2+M3,1}\|v^{*}\|^{2}\geq C_{0}e^{M_{1}+M_{3}}\varphi_{n}\max\{\sqrt{\tau}e^{M_{1}/2+M_{3}},1\} for a sufficiently large constant C0C_{0}, then there exist universal positive constants ρ1\rho_{1}, ρ2\rho_{2}, CC^{\prime}, and C′′C^{\prime\prime} such that for all t0t\geq 0

    e~Z,t+1\displaystyle\tilde{e}_{Z,t+1}\leq (1r0τρ1eM1κZ2)e~Z,tλr0τρ2eM3min{|A|ΔηtF2,eM1ΔηtF2}\displaystyle\left(1-\frac{r_{0}\tau\rho_{1}}{e^{M_{1}}\kappa^{2}_{Z^{*}}}\right)\tilde{e}_{Z,t}-\lambda\frac{r_{0}\tau\rho_{2}}{e^{M_{3}}}\min\{\||A|\circ\Delta_{\eta_{t}}\|_{F}^{2},e^{-M_{1}}\|\Delta_{\eta_{t}}\|_{F}^{2}\}
    +r0τCeM1ζn2k+λr0τC′′eM1+M3φn2.\displaystyle\ \ +r_{0}\tau C^{\prime}e^{M_{1}}\zeta_{n}^{2}k+\lambda r_{0}\tau C^{\prime\prime}e^{M_{1}+M_{3}}\varphi_{n}^{2}.
  2. 2.

    (High-probability bounds) Suppose Zop2C0eM1M2/2κZ2nmax{τkeM1+3M3/2κZ,1}\|Z^{*}\|_{op}^{2}\geq C_{0}e^{M_{1}-M_{2}/2}\kappa_{Z^{*}}^{2}\sqrt{n}\max\{\sqrt{\tau k}e^{M_{1}+3M_{3}/2}\kappa_{Z^{*}},1\} and v2C0eM1+M3nmax{τeM1/2+M3,1}\|v^{*}\|^{2}\geq C_{0}e^{M_{1}+M_{3}}\sqrt{n}\max\{\sqrt{\tau}e^{M_{1}/2+M_{3}},1\} for a sufficiently large constant C0C_{0}. Then there exist positive constants ρ1\rho_{1}, cc, and CC uniformly over (n,k,M1,M2,M3)\mathcal{F}(n,k,M_{1},M_{2},M_{3}) such that, with probability at least 1nc1-n^{-c}, we have

    Zop2ΔZTF2,ΔΘTF2CκZ2e2M1nkmax{eM2,lognn,eM3M11κZ2k},\|Z^{*}\|_{op}^{2}\|\Delta_{Z_{T}}\|_{F}^{2},\ \|\Delta_{\Theta_{T}}\|_{F}^{2}\leq C\kappa_{Z^{*}}^{2}e^{2M_{1}}nk\cdot\max\{e^{-M_{2}},\frac{\log n}{n},e^{M_{3}-M_{1}}\frac{1}{\kappa_{Z^{*}}^{2}k}\},

    for some Tlog(M12κZ2e4M1+3M3M2nk3)/log(1r0τρ1eM1κZ2)1T\leq\log(\frac{M_{1}^{2}}{\kappa_{Z^{*}}^{2}e^{4M_{1}+3M_{3}-M_{2}}}\frac{n}{k^{3}})/\log(1-\frac{r_{0}\tau\rho_{1}}{e^{M_{1}}\kappa_{Z^{*}}^{2}})^{-1}.

The first part of Theorem 3 indicates that, compared to the separate estimation method, the joint method involving the gradient of the sign likelihood leads to an extra improvement on the error bound of iterates, which depends on Δηt\Delta_{\eta_{t}}, while introducing another statistical error term φn\varphi_{n}. As a result, in the second part, the high probability error bounds depend on the maximum of three terms, among which the first two are the same as in Proposition 5 and the third is resulted from φn\varphi_{n}. When M3M1M2+log(κZ2k)M_{3}\leq M_{1}-M_{2}+\log(\kappa_{Z^{*}}^{2}k), the maximum multiplier reduces to the one in Proposition 5. Overall, the error bounds of ZZ and Θ\Theta for the joint estimation method are still in the order 𝒪(nk)\mathcal{O}(nk), which is the same as that for the separate estimation method. In addition, the following corollary gives the error bounds of vTv_{T} and ηT\eta_{T} obtained from the line 5 in Algorithm S5.

Corollary 3.1.

For vt=Ztwt+γt1nv_{t}=Z_{t}w_{t}+\gamma_{t}1_{n} with (wt,γt)=argminwRk,γRλ(αt,Zt,w,γ)(w_{t},\gamma_{t})=\operatorname*{arg\,min}_{w\in{\mbox{{R}}}^{k},\gamma\in{\mbox{{R}}}}\mathcal{L}_{\lambda}(\alpha_{t},Z_{t},w,\gamma), we have ΔηtF16eM1+M3max{ζn,φn}+eM1/2+M3(2+ΔZtFw)vΔZtFw\|\Delta_{\eta_{t}}\|_{F}\leq 16e^{M_{1}+M_{3}}\max\{\zeta_{n},\varphi_{n}\}+e^{M_{1}/2+M_{3}}(2+\|\Delta_{Z_{t}}\|_{F}\|w^{*}\|)\|v^{*}\|\|\Delta_{Z_{t}}\|_{F}\|w^{*}\| for t0t\geq 0. Suppose the conditions for high probability bounds in Theorem 3 hold, then there exist positive constants ρ1\rho_{1}, cc, and CC uniformly over (n,k,M1,M2,M3)\mathcal{F}(n,k,M_{1},M_{2},M_{3}) such that, with probability at least 1nc1-n^{-c}, we have

v2ΔvT2,ΔηT2Ce3M1+2M3nkmax{eM3M1k,κZ2max{eM2,lognn}},\|v^{*}\|^{2}\|\Delta_{v_{T}}\|^{2},\|\Delta_{\eta_{T}}\|^{2}\leq Ce^{3M_{1}+2M_{3}}nk\cdot\max\{\frac{e^{M_{3}-M_{1}}}{k},\kappa_{Z^{*}}^{2}\max\{e^{-M_{2}},\frac{\log n}{n}\}\},

for some Tlog(M12κZ2e4M1+3M3M2nk3)/log(1r0τρ1eM1κZ2)1T\leq\log(\frac{M_{1}^{2}}{\kappa_{Z^{*}}^{2}e^{4M_{1}+3M_{3}-M_{2}}}\frac{n}{k^{3}})/\log(1-\frac{r_{0}\tau\rho_{1}}{e^{M_{1}}\kappa_{Z^{*}}^{2}})^{-1}.

In particular, the deterministic error bound for Δηt\Delta_{\eta_{t}} consists of the statistical error term max{ζn,φn}\max\{\zeta_{n},\varphi_{n}\} and the estimation error of ZtZ_{t}, and with high probability ΔηT2\|\Delta_{\eta_{T}}\|^{2} is dominated by the estimation error of ZtZ_{t} and thus is also in the order 𝒪(nk)\mathcal{O}(nk).

One-step improvement Although Theorem 3 guarantees the convergence of Algorithm S5 up to certain statistical errors, the achieved error rate is in the same order as that for the separate estimation method. To further investigate how exploiting extra structural information in the joint inner-product model would help estimate the latent variables, we consider the estimation error moving against the gradient one step from the estimators obtained by the separate method below.

Suppose we are given estimators (α¯,Z¯)(\bar{\alpha},\bar{Z}) of latent variables obtained from the separate estimation Algorithm S1 and an estimator v¯=Z¯w¯+γ¯1n\bar{v}=\bar{Z}\bar{w}+\bar{\gamma}1_{n}. Then we update the estimator of ZZ by one step through Algorithm S5 as below:

Z^=Z¯2τz(1λ)(σ(Θ¯)|A|)Z¯2τzλ(|A|σ(η¯)B)v¯w¯,\hat{Z}=\bar{Z}-2\tau_{z}(1-\lambda)(\sigma(\bar{\Theta})-|A|)\bar{Z}-2\tau_{z}\lambda\big{(}|A|\circ\sigma(\bar{\eta})-B\big{)}\bar{v}\bar{w}^{\top}, (9)

where Θ¯=α¯1n+1nα¯+Z¯Z¯\bar{\Theta}=\bar{\alpha}1_{n}^{\top}+1_{n}{\bar{\alpha}}^{\top}+\bar{Z}\bar{Z}^{\top}, η¯=v¯v¯\bar{\eta}=\bar{v}\bar{v}^{\top}, and B=|A|(A+1)/2B=|A|\circ(A+1)/2. Note that [B]ij=|Aij|bij[B]_{ij}=|A_{ij}|b_{ij} with bijb_{ij} independently following Bernoulli(σ(ηij))(\sigma(\eta_{ij}^{*})) conditional on |A||A|. The following proposition provides insights on under what scenarios the one-step update in the joint estimation method could lead to better estimates of the latent position vectors ZZ. In below, for ease of derivation, we consider the parameter space (n,k,M1,M2,M3)\mathcal{F}(n,k,M_{1},M_{2},M_{3}) with fixed MiM_{i} (i=1,2,3i=1,2,3) and kk.

Proposition 6.

Given the estimators (α¯,Z¯)(\bar{\alpha},\bar{Z}) obtained from Algorithm S1 and the estimators (w¯,γ¯)(\bar{w},\bar{\gamma}) that are independent of BB conditional on |A||A| and satisfy w¯w2+γ¯γ2=𝒪(1/n)\|\bar{w}-w^{*}\|^{2}+\|\bar{\gamma}-\gamma^{*}\|^{2}=\mathcal{O}(1/n). We update Z¯\bar{Z} for one step by (9) and obtain Z^\hat{Z}. Suppose the conditions in Proposition 5 hold, and the singular values of the sample covariance ZZ/n{Z^{*}}^{\top}Z^{*}/n are of constant order. Then there exists an optimal λopt\lambda_{opt} that minimizes 𝔼Z^ZF2\mathbb{E}\|\hat{Z}-Z^{*}\|_{F}^{2}. Furthermore, if λopt>0\lambda_{opt}>0, we have 𝔼ΔZ^F2<𝔼ΔZ¯F2\mathbb{E}\|\Delta_{\hat{Z}}\|_{F}^{2}<\mathbb{E}\|\Delta_{\bar{Z}}\|_{F}^{2} for any λ(0,2λopt)\lambda\in(0,2\lambda_{opt}), and the improvement 𝔼ΔZ¯F2𝔼ΔZ^F2\mathbb{E}\|\Delta_{\bar{Z}}\|_{F}^{2}-\mathbb{E}\|\Delta_{\hat{Z}}\|_{F}^{2} with λopt\lambda_{opt} is at least

|A|ξT1F2(|A|ξT1F|A|ξT2F|A|ξT3F)216(|A|ξξ(T1+T2T3)op2+𝔼B|A|σ(η)op2/(Z¯op2w¯4)),\frac{\big{\|}|A|\circ\xi\circ T_{1}\big{\|}_{F}^{2}\left(\big{\|}|A|\circ\xi\circ T_{1}\big{\|}_{F}-\big{\|}|A|\circ\xi\circ T_{2}\big{\|}_{F}-\big{\|}|A|\circ\xi\circ T_{3}\big{\|}_{F}\right)^{2}}{16\left(\big{\|}|A|\circ\xi\circ\xi\circ(T_{1}+T_{2}-T_{3})\big{\|}_{op}^{2}+\mathbb{E}\big{\|}B-|A|\circ\sigma(\eta^{*})\big{\|}_{op}^{2}/(\|\bar{Z}\|_{op}^{2}\|\bar{w}\|^{4})\right)},

where TiT_{i}’s are given in (S42)-(S44) respectively for i=1,2,3i=1,2,3 in the Supplemental Material with T1F=𝒪(1)\|T_{1}\|_{F}=\mathcal{O}(1), T2F=𝒪(1)/w\|T_{2}\|_{F}=\mathcal{O}(1)/\|w^{*}\|, and T3F=𝒪(1/n)\|T_{3}\|_{F}=\mathcal{O}(1/\sqrt{n}), and ξ\xi is an element-wise positive constant matrix. Here 𝔼\mathbb{E} represents the conditional expectation of BB given |A||A|.

We provide the expression of the optimal λopt\lambda_{opt} that minimizes 𝔼Z^ZF2\mathbb{E}\|\hat{Z}-Z^{*}\|_{F}^{2}, the proof of Proposition 6, and discuss when the conditional independence assumption and the prerequisite error rate of (w¯,γ¯)(\bar{w},\bar{\gamma}) in Proposition 6 hold in the Supplemental Material. Since a positive λopt\lambda_{opt} implies a strict decrease in the mean square error of ZZ after one-step update, we further investigate in which case λopt\lambda_{opt} tends to be positive. In particular, λopt>0\lambda_{opt}>0 if and only if |A|ξT1F|A|ξT2F|A|ξT3F\big{\|}|A|\circ\xi\circ T_{1}\big{\|}_{F}-\big{\|}|A|\circ\xi\circ T_{2}\big{\|}_{F}-\big{\|}|A|\circ\xi\circ T_{3}\big{\|}_{F} is strictly positive. Our analysis on the upper bounds of the three terms suggests that the first two terms are the dominating terms and a larger w\|w^{*}\| more likely results in a positive λopt\lambda_{opt}. Therefore, when the signal from the edge sign distribution is strong, incorporating information from observed signs in the joint estimation method is useful for improving the estimation of ZZ. Moreover, the magnitude of improvement also depends on the levels of the signal and the noise in the sign distribution. Specifically, as ww¯\|w^{*}\|\asymp\|\bar{w}\| increases, the difference between the upper bounds of the two dominating terms in the numerator increases while the upper bound of the denominator decreases, therefore overall the improvement is likely to increase. This implies that larger signals in the edge signs would lead to greater improvement in estimating ZZ. On the other hand, we find that the improvement decreases when the noise 𝔼B|A|σ(η)op2/Z¯op2\mathbb{E}\big{\|}B-|A|\circ\sigma(\eta^{*})\big{\|}_{op}^{2}/\|\bar{Z}\|_{op}^{2} in the edge sign distribution increases in the denominator.

6 Simulation Studies

In this section, we conduct simulation studies to investigate how estimation errors of the proposed methods depend on: 1) the network size and the dimension of latent position vectors; 2) the network density; and 3) the proportion of positive edges.

Estimation methods We compare three estimation methods. In addition to the separate estimation method and the joint estimation method introduced in Section 4, we further add an intermediate method, one-step-joint estimation, to illustrate the one-step improvement discussed in Proposition 6. Specifically, given Z¯\bar{Z} and v~\tilde{v} obtained from Algorithms S1 and S2 respectively, we compute the one-step-joint estimators (JnZ^,v¯)(J_{n}\hat{Z},\bar{v}) by first updating v¯=Z¯w¯+γ¯1n\bar{v}=\bar{Z}\bar{w}+\bar{\gamma}1_{n} with w¯,γ¯=argminwRk,γRv~Z¯wγ1n\bar{w},\bar{\gamma}=\operatorname*{arg\,min}_{w\in{\mbox{{R}}}^{k},\gamma\in{\mbox{{R}}}}\|\tilde{v}-\bar{Z}w-\gamma 1_{n}\| and then obtaining Z^\hat{Z} by plugging (Z¯,v¯)(\bar{Z},\bar{v}) into (9). We set λ=1/2\lambda=1/2, so that the joint estimation is the same as the maximum likelihood estimation.

Simulation settings For a given network size nn and a latent position vector dimension kk, we set the model parameters as follows. We first generate the latent positions Zijiid𝒩(0,1)Z_{ij}\overset{\text{iid}}{\sim}\mathcal{N}(0,1) from the standard normal distribution, for 1in,1jk1\leq i\leq n,1\leq j\leq k. By centering columns of ZZ, we get Z=JnZZ^{*}=J_{n}Z, where Jn=In1n1n1nJ_{n}=I_{n}-\frac{1}{n}1_{n}1_{n}^{\top}. We further normalize ZZ^{*} element-wise such that ZZF=n\|Z^{*}{Z^{*}}^{\top}\|_{F}=n. Next, we generate the node degree heterogeneity parameters αi=αi/i=1nαi\alpha_{i}^{*}=-\alpha_{i}/\sum_{i=1}^{n}\alpha_{i}, where αiiidU(1,3)\alpha_{i}\overset{\text{iid}}{\sim}U(1,3) is uniformly distributed for 1in1\leq i\leq n. Finally, we set w=1/k1kw^{*}=1/\sqrt{k}\cdot 1_{k}, γ=0\gamma^{*}=0, and vi=wZi+γv^{*}_{i}={w^{*}}^{\top}Z^{*}_{i}+\gamma^{*}.

Given the true latent variables Z,αZ^{*},\alpha^{*}, and vv^{*}, we randomly generate 20 replications of the signed adjacency matrix following (2) and (3), and fit models by three estimation methods. For each method, we measure the relative errors for Z,v,ΘZ,v,\Theta, and η\eta. Due to the identifiability conditions in Proposition 4, we define the relative error for ZZ as Z^ZQ¯F/ZF\|\hat{Z}-Z^{*}\bar{Q}\|_{F}/\|Z^{*}\|_{F}, where Q¯=argminQO(k)Z^ZQF\bar{Q}=\operatorname*{arg\,min}_{Q\in O(k)}\|\hat{Z}-Z^{*}Q\|_{F} and O(k)O(k) is the collection of all orthogonal matrices in Rk{\mbox{{R}}}^{k}. We define the error for vv as v^κ¯v/v\|\hat{v}-\bar{\kappa}v^{*}\|/\|v^{*}\|, where κ¯=argminκ{1,1}v^κv\bar{\kappa}=\operatorname*{arg\,min}_{\kappa\in\{1,-1\}}\|\hat{v}-\kappa v^{*}\|. The relative errors for Θ\Theta and η\eta are defined as Θ^ΘF/ΘF\|\hat{\Theta}-\Theta^{*}\|_{F}/\|\Theta^{*}\|_{F} and η^ηF/ηF\|\hat{\eta}-\eta^{*}\|_{F}/\|\eta^{*}\|_{F} respectively, where Θ^=α^1n+1nα^+Z^Z^\hat{\Theta}=\hat{\alpha}1_{n}^{\top}+1_{n}\hat{\alpha}^{\top}+\hat{Z}\hat{Z}^{\top} and η^=v^v^\hat{\eta}=\hat{v}\hat{v}^{\top}.

Refer to caption
Figure 3: Log-log plots of relative errors with respect to the network size nn. The dimension of the latent position vector is fixed as k=2k=2.

6.1 Varying the network size and the dimension of the latent space

In Figure 3, we summarize how estimation errors vary with different network sizes. We fix k=2k=2 and vary n{500,1000,2000,4000}n\in\{500,1000,2000,4000\}. We can see that, for a fixed dimension of the latent space, the relative errors of all three estimation methods decrease in the rate of 1/n1/\sqrt{n} as the network size nn grows, which align well with the theoretical error rates given in Section 5. Next, compared to the separate estimation method, the joint estimation method consistently achieves smaller estimation errors on all four quantities of interest across different network sizes. In addition, the one-step-joint estimation that simply updates estimates by one-step gradient descent is able to reduce the estimation errors compared to the separate estimation method.

We further summarize how estimation errors of ZZ and Θ\Theta vary with different dimensions of the latent position vector in Figure S3 in the Supplemental Material for the sake of space. We fix n=2000n=2000 and vary k{2,4,8}k\in\{2,4,8\}. We find that, for a fixed network size, the relative errors increase in the rate of k\sqrt{k} as the dimension of latent position vector kk grows. This also agrees well with our theoretical results. The relative trend among the three estimation methods for different kk is similar as that in Figure 3, where the joint estimation method is consistently the best.

6.2 Varying the network density

We investigate how estimation errors for three estimation methods vary with the network density. To this end, we generate the node degree heterogeneity parameters αi=α¯αi/i=1nαi\alpha_{i}^{*}=-\bar{\alpha}-\alpha_{i}/\sum_{i=1}^{n}\alpha_{i} where αiiidU(1,3)\alpha_{i}\overset{\text{iid}}{\sim}U(1,3) is uniformly distributed for 1in1\leq i\leq n. We fix n=2000n=2000 and k=4k=4, and vary α¯{0,0.25,0.5,0.75,1,1.25}\bar{\alpha}\in\{0,0.25,0.5,0.75,1,1.25\}, which leads to the network density ranging from 0.10.1 to 0.50.5.

Figure S4 in the Supplemental Material summarizes the relative estimation errors of ZZ and vv over 2020 replications under different network densities. We can see that both estimation errors of ZZ and vv for all three estimation methods decrease as the network gets denser, and the joint estimation method achieves lower estimation errors than the other two methods consistently across various network densities. In particular, when the network is dense, the improvement in estimating ZZ from the joint estimation against the separate estimation increases, which is expected because in joint estimation, the observed edges’ signs are also useful for inferring ZZ and denser networks provide more information. In addition, regarding the estimation error of vv, the joint and one-step-joint estimation methods that use the additional structural information between ZZ and vv perform more stably than the separate estimation method as the network gets sparser.

Refer to caption
Figure 4: Relative errors with respect to the proportion of positive edges. The network size is fixed as n=2000n=2000 and the dimension of the latent position vector is fixed as k=4k=4.

6.3 Varying the proportion of positive edges

We also investigate the effect of the proportion of positive edges on three estimation methods. For this purpose, we change the simulation setting. Specifically, we fix n=2000n=2000 and k=4k=4, and vary γ{0,0.3,0.6,0.9,1.2,1.5,1.8}\gamma^{*}\in\{0,0.3,0.6,0.9,1.2,1.5,1.8\}, which results in the proportion of positive edges ranging from 0.520.52 to 0.910.91. To eliminate the artificial effect resulting from varying γ\gamma^{*} when evaluating the estimation error of vv, we focus on the estimation error of the centered vv, i.e., vcen=Jnvv_{cen}=J_{n}v. We define the relative error for vcenv_{cen} as Jnv^κ¯Jnv/Jnv=Jnv^κ¯Zw/Zw\|J_{n}\hat{v}-\bar{\kappa}J_{n}v^{*}\|/\|J_{n}v^{*}\|=\|J_{n}\hat{v}-\bar{\kappa}Z^{*}w^{*}\|/\|Z^{*}w^{*}\|, where κ¯=argminκ{1,1}Jnv^Jnκv\bar{\kappa}=\operatorname*{arg\,min}_{\kappa\in\{1,-1\}}\|J_{n}\hat{v}-J_{n}\kappa v^{*}\|.

Figure 4 summarizes the relative estimation errors of ZZ and vcenv_{cen} over 2020 replications under different proportions of positive edges. Overall, the joint estimation method performs consistently the best among three methods and is robust across different sign distributions. We note that the estimation error for ZZ for the separate method does not change, because the generated absolute adjacency matrix |A||A| does not change when varying γ\gamma^{*} and thereby the separate estimates Z^\hat{Z} stay the same. We also observe that the optimal performance of the separate method for estimating vv is not achieved around 50% positive signs. Intuitively, in extreme cases where v0v\approx 0, the negative and positive signs become indistinguishable, with the proportion of positive edges close to 50%. This makes it challenging to estimate vv. The optimal proportion minimizing the estimation error may vary case by case. We investigate an SBM example in the Supplemental Material for illustration.

7 International Relation Data

In this section, we apply the proposed method to an international relation data, i.e. the Correlates of War (COW) [16], to demonstrate how the proposed method can be used to make informative interpretation and visualization of signed networks. The COW dataset records various types of international relations among countries, such as wars, alliances, and militarized interstate disputes. Similarly as [17], we construct a signed network of countries, where the positive edges represent alliance relationships, and the negative edges represent the existence of militarized disputes between countries. We take the snapshot of the records during World War II (WWII), i.e., from 1939 to 1945. In particular, if two countries were involved in both alliances and militarized disputes, we set the sign of their edge to positive if the number of years of alliances is larger than that of the militarized disputes, and we set the sign to negative otherwise. According to the COW records, there are 68 countries that were involved in alliances or militarized disputes during WWII, and the resulted signed network contains 566 positive edges and 519 negative edges. The total number of balanced triangles is 2,148, outnumbering the unbalanced triangles (613) by over threefold. To further assess the balance property, we applied a stratified permutation test [9]. Under the null hypothesis, the sign of an edge is independent of its position on the graph, which is considered as balance-free. The p-value of the stratified permutation test is less than 0.0010.001, which indicates strong evidence of the balance property in this international relation network.

To incorporate the balance theory, we fit the joint inner-product model to the COW dataset. We set the dimension of the latent position vectors as k=2k=2, such that the estimated Z^\hat{Z} can be directly visualized on a 2-dimensional plane. The model fitted by the joint estimation method is visualized in Figure 5. In the figure, each node represents a country and their coordinates are given by Z^\hat{Z}. The size of each node is determined by the estimated degree heterogeneity parameter α^\hat{\alpha}, with larger nodes corresponding to larger α^i\hat{\alpha}_{i} values. The color and the shape of each node ii distinguish the estimated latent polar variable v^i\hat{v}_{i}. Specifically, if v^i>0\hat{v}_{i}>0, the node is visualized as a red circular point; and if v^i<0\hat{v}_{i}<0, the node is visualized as a blue square point. For both the red and the blue points, the larger the absolute magnitude |v^i||\hat{v}_{i}|, the darker the color. The sign of each edge is also indicated in the figure, with dashed green being positive and solid purple being negative.

Refer to caption
Figure 5: Visualization of the model fitted by the joint estimation method on the COW dataset. The nodes are countries involved in WWII. The green dashed lines represent positive edges (alliance) and the purple solid lines represent negative edges. The node sizes are determined by the estimated node degree heterogeneity parameters α^\hat{\alpha}. The node colors and shapes are determined by the estimated latent polar variables v^\hat{v}: for each node ii, the node is a red circle if v^i>0\hat{v}_{i}>0 and a blue square otherwise; and the larger absolute magnitude of v^i\hat{v}_{i}, the darker the color. The grey dash-dotted line represents the linear boundary between the red and blue nodes. To make the visualization easier, we only labeled countries with degree greater than 5.

Figure 5 shows that the proposed model fitted by the joint estimation method is able to capture important information in the signed network of countries during WWII. First, in terms of the estimated node degree heterogeneity α^\hat{\alpha}, the top 11 countries are Germany, Italy, Japan, the United Kingdom (UK), Romania, the United States (USA), Brazil, Bulgaria, Hungary, France, the Soviet Union (USSR). In particular, UK, USA, USSR were the 3 leading countries of the Allies of WWII. France played important and complicated roles in both the Allies and the Axis. Brazil was the only South American country that actively participated in WWII. All other countries were members of the Axis. Since members of the Axis were more active than those of the Allies on average, it is reasonable that even small members have high values of α^\hat{\alpha}. Second, regarding the estimated latent polar variable v^\hat{v}, the model is able to divide the countries into two groups (blue and red) that mostly align with the division between the Axis and the Allies. As we assume a linear transformation from ZZ to vv, the plane (the space of ZZ) can be linearly separated into two areas with the boundary illustrated by the grey dash-dotted line. We can see that edges crossing the boundary are mainly negative (purple), while edges within the same side are mainly positive (green). Finally, the estimated latent position vectors Z^\hat{Z} also capture various other interesting aspects. Within the Allies, Z^\hat{Z} further clusters countries in America together (see the right part of the figure), among which most edges are positive (green). The Middle East countries form a cluster as well (at the top of the figure), though not as tight as countries in America. It is also interesting to see that France, one of the major Allied powers in WWII, is positioned on the same side of Germany, Italy, and Japan. This is because, after occupied by Germany, France was divided into two political powers, Free France and Vichy France, with the latter collaborating with Germany and fighting against the Allies in several campaigns from 1940 to 1944. We perform additional sensitivity analysis to assess the impact of splitting France into two political entities in the Supplemental Material.

On the other hand, we also fit the separate inner-product model and visualize the model fitted by the separate estimation method in Figure S6 in the Supplemental Material for the sake of space. We find that the model fitted by the separate estimation method captures useful information from the network but not as interpretable as that by the joint estimation method (see more detailed discussion in the Supplemental Material).

8 Discussion

In this paper, we propose a latent space approach that accommodates the structural balance theory for modeling signed networks. In particular, we introduce a novel notion of population-level balance, which is a natural choice to characterize the structural balance theory when we treat both the edges and their signs as random variables. We develop sufficient conditions for a latent space model to be balanced at the population level, and propose two balanced inner-product models following the conditions. We also provide scalable estimation algorithms with theoretical guarantees.

In Section 3, we proposed balanced inner-product models that use a one-dimensional polar variable. This design offers a direct and intuitive interpretation, where the sign of the polar variable inherently indicates membership to one of two antagonistic groups. Beyond this, considering multi-dimensional polar variables, i.e., ηij=vivj\eta_{ij}=v_{i}^{\top}v_{j} for viRdv_{i}\in{\mbox{{R}}}^{d} with d>1d>1, could be more flexible in terms of capturing multiple latent factors in real-world applications. In that case, we have ascertained that our estimation method and high-probability error bounds are still applicable to the multi-dimensional scenario. However, it is also important to note that, when considering multi-dimensional polar variables, the conditions in Theorem 1 no longer hold. In addition, multi-dimensional polar variables have the potential drawback of complicating interpretations.

There are a few other directions we may continue to explore in the future. First, the joint inner-product model could be extended to have a nonlinear link function ϕ\phi, which would increase the flexibility of the model. Second, we may generalize the proposed approach to weighted signed networks to better leverage richer edge information available in real-world networks. Finally, it is desirable to extend the latent space approach for undirected networks to directed networks, which can be potentially used to model other interesting social theories, such as the social status theory, for signed networks.

\AtNextBibliography

References

  • [1] Emmanuel Abbe “Community Detection and Stochastic Block Models: Recent Developments” In Journal of Machine Learning Research 18.177, 2018, pp. 1–86
  • [2] Emmanuel J. Candes, Yonina C. Eldar, Thomas Strohmer and Vladislav Voroninski “Phase Retrieval via Matrix Completion” In SIAM Journal on Imaging Sciences 6.1, 2013, pp. 199–225 DOI: 10.1137/110848074
  • [3] Sourav Chatterjee “Matrix estimation by Universal Singular Value Thresholding” In Annals of Statistics 43.1 The Institute of Mathematical Statistics, 2015, pp. 177–214 DOI: 10.1214/14-AOS1272
  • [4] Yudong Chen and Martin J Wainwright “Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees” In arXiv preprint arXiv:1509.03025, 2015
  • [5] Kai-Yang Chiang, Cho-Jui Hsieh, Nagarajan Natarajan, Inderjit S. Dhillon and Ambuj Tewari “Prediction and Clustering in Signed Networks: A Local to Global Perspective” In Journal of Machine Learning Research 15.34, 2014, pp. 1177–1213
  • [6] Fan Chung and Linyuan Lu “Complex Graphs and Networks (CBMS Regional Conference Series in Mathematics)” USA: American Mathematical Society, 2006
  • [7] Mark A Davenport, Yaniv Plan, Ewout Van Den Berg and Mary Wootters “1-bit matrix completion” In Information and Inference: A Journal of the IMA 3.3 OUP, 2014, pp. 189–223
  • [8] Tyler Derr, Charu Aggarwal and Jiliang Tang “Signed network modeling based on structural balance theory” In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018, pp. 557–566
  • [9] Derek Feng, Randolf Altmeyer, Derek Stafford, Nicholas A. Christakis and Harrison H. Zhou “Testing for Balance in Social Networks” In Journal of the American Statistical Association 0.0 Taylor & Francis, 2020, pp. 1–19 DOI: 10.1080/01621459.2020.1764850
  • [10] Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg and Edoardo M. Airoldi “A Survey of Statistical Network Models” In Foundations and Trends® in Machine Learning 2.2, 2010, pp. 129–233 DOI: 10.1561/2200000005
  • [11] Ramanthan Guha, Ravi Kumar, Prabhakar Raghavan and Andrew Tomkins “Propagation of trust and distrust” In Proceedings of the 13th International Conference on World Wide Web, 2004, pp. 403–412
  • [12] Frank Harary “On the notion of balance of a signed graph.” In The Michigan Mathematical Journal 2.2 The University of Michigan, 1953, pp. 143–146
  • [13] Peter D Hoff “Bilinear mixed-effects models for dyadic data” In Journal of the American Statistical Association 100.469 Taylor & Francis, 2005, pp. 286–295
  • [14] Peter D Hoff, Adrian E Raftery and Mark S Handcock “Latent Space Approaches to Social Network Analysis” In Journal of the American Statistical Association 97.460 Taylor & Francis, 2002, pp. 1090–1098 DOI: 10.1198/016214502388618906
  • [15] Cho-Jui Hsieh, Kai-Yang Chiang and Inderjit S Dhillon “Low rank modeling of signed networks” In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 507–515
  • [16] Ahmet Izmirlioglu “The Correlates of War Dataset” In Journal of World-Historical Information 3.1 University Library System, University of Pittsburgh, 2017
  • [17] Alec Kirkley, George T. Cantwell and M… Newman “Balance in signed networks” In Physical Review E 99 American Physical Society, 2019, pp. 012320 DOI: 10.1103/PhysRevE.99.012320
  • [18] Vladimir Koltchinskii, Karim Lounici and Alexandre B. Tsybakov “Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion” In Annals of Statistics 39.5 The Institute of Mathematical Statistics, 2011, pp. 2302–2329 DOI: 10.1214/11-AOS894
  • [19] Pavel N. Krivitsky, Mark S. Handcock, Adrian E. Raftery and Peter D. Hoff “Representing degree distributions, clustering, and homophily in social networks with latent cluster random effects models” In Social Networks 31.3, 2009, pp. 204–213 DOI: https://doi.org/10.1016/j.socnet.2009.04.001
  • [20] Jure Leskovec, Daniel Huttenlocher and Jon Kleinberg “Signed networks in social media” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 2010, pp. 1361–1370
  • [21] Zhuang Ma, Zongming Ma and Hongsong Yuan “Universal Latent Space Model Fitting for Large Networks with Edge Covariates.” In Journal of Machine Learning Research 21.4, 2020, pp. 1–67
  • [22] Mark Newman “Networks: An Introduction” Oxford University Press, 2010
  • [23] Bernhard Scholkopf and Alexander J Smola “Learning with kernels: support vector machines, regularization, optimization, and beyond” Adaptive ComputationMachine Learning series, 2018
  • [24] Lingxiao Wang, Xiao Zhang and Quanquan Gu “A unified computational and statistical framework for nonconvex low-rank matrix estimation” In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 2017, pp. 981–990 PMLR
  • [25] Qinqing Zheng and John Lafferty “Convergence analysis for rectangular matrix completion using Burer-Monteiro factorization and gradient descent” In arXiv preprint arXiv:1605.07051, 2016

References

  • [26] Sourav Chatterjee “Matrix estimation by Universal Singular Value Thresholding” In Annals of Statistics 43.1 The Institute of Mathematical Statistics, 2015, pp. 177–214 DOI: 10.1214/14-AOS1272
  • [27] Chao Gao, Zongming Ma, Anderson Y Zhang and Harrison H Zhou “Achieving optimal misclassification proportion in stochastic block models” In Journal of Machine Learning Research 18.1 JMLR. org, 2017, pp. 1980–2024
  • [28] Jing Lei and Alessandro Rinaldo “Consistency of spectral clustering in stochastic block models” In Annals of Statistics 43.1, 2015, pp. 215–237 DOI: 10.1214/14-AOS1274
  • [29] Zhuang Ma, Zongming Ma and Hongsong Yuan “Universal Latent Space Model Fitting for Large Networks with Edge Covariates.” In Journal of Machine Learning Research 21.4, 2020, pp. 1–67
  • [30] Yurii Nesterov “Introductory lectures on convex optimization: A basic course” Springer Science & Business Media, 2013
  • [31] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi and Ben Recht “Low-rank Solutions of Linear Matrix Equations via Procrustes Flow” In Proceedings of the 33rd International Conference on Machine Learning, 2016, pp. 964–973
  • [32] Anderson Y Zhang and Harrison H Zhou “Minimax rates of community detection in stochastic block models” In The Annals of Statistics 44.5 Institute of Mathematical Statistics, 2016, pp. 2252–2280 DOI: 10.1214/15-AOS1428