This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Network resampling for estimating uncertainty

Qianhua Shan and Elizaveta Levina
Department of Statistics, University of Michigan
[email protected], [email protected]
Abstract

With network data becoming ubiquitous in many applications, many models and algorithms for network analysis have been proposed. Yet methods for providing uncertainty estimates in addition to point estimates of network parameters are much less common. While bootstrap and other resampling procedures have been an effective general tool for estimating uncertainty from i.i.d. samples, adapting them to networks is highly nontrivial. In this work, we study three different network resampling procedures for uncertainty estimation, and propose a general algorithm to construct confidence intervals for network parameters through network resampling. We also propose an algorithm for selecting the sampling fraction, which has a substantial effect on performance. We find that, unsurprisingly, no one procedure is empirically best for all tasks, but that selecting an appropriate sampling fraction substantially improves performance in many cases. We illustrate this on simulated networks and on Facebook data.

1 Introduction

With network data becoming common in many fields, from gene interactions to communications networks, much research has been done on fitting various models to networks and estimating parameters of interest. In contrast, few methods are available for evaluating uncertainty associated with these estimates, which is key to statistical inference. In classical settings, when asymptotic analysis for obtaining error estimates is intractable, various resampling schemes have been successfully developed for estimating uncertainty, such as the jackknife [34], the bootstrap [10], and subsampling or mm-out-of-nn bootstrap [6, 31]. Most of the classical resampling methods were developed for i.i.d. samples and are not easily transferred to the network setting, where nodes are connected by edges and have a complex dependence structure. Earlier work on bootstrap for dependent data focused on settings with a natural ordering, such as time series [30] or spatial random fields [7].

More recently, a few analogues of bootstrap have been proposed specifically for network data. One general approach is to extract some features of the network that can be reasonably treated as i.i.d., and resample those following classical bootstrap. For instance, under a latent space model assumptio latent node positions are assumed to be drawn i.i.d. from some unknown distribution, and the network model is based on these positions. Then as long as the latent positions can be estimated from a given data, they can be resampled in the usual bootstrap fashion to generate new bootstrap network samples, as proposed in [23],. Another network feature that easily lends itself to this kind of approach is subgraph counts, because they are calculated from small network “patches” that can be extracted and then treated as i.i.d. for resampling purposes. This approach has been applied to estimating uncertainty in the mean degree of the graph [35], and extended to bootstrapping the distribution of node degrees [14, 15]. Related work [3] proposed to estimate the variance of network count statistics by bootstrapping subgraphs isomorphic to the pattern of interest. This approach works well for network quantities that can be computed from small patches, and does not require the latent space model assumption, but it does not generalize easily to other more global inference tasks.

A more general approach analogous to the jackknife was proposed in [26], who analyzed a leave-one-node-out procedure for network data. This procedure can be applied to most network statistics, and the authors show that the variance estimates are conservative in expectation under some regularity conditions. However, for a size NN network, this procedure requires to compute the statistics of interest on NN networks with N1N-1 nodes each, which can be computationally expensive.

Another class of methods works by subsampling the network, which we can think of as analogous to mm-out-of-nn bootstrap, e.g., [5]. Inference based on a partially observed or subsampled network was first considered for computational reasons, when it is either not computationally feasible to conduct inference on the whole network, or only part of the network was observed in the first place. Most commonly, subsampling is implemented through node sampling, where we first select a fraction of all nodes at random and then observe the induced subgraph, but there are alternative subsampling strategies, to be discussed below. Examples in this line of work include subsampling under a sparse graphon model [27], estimating the number of connected components in a unknown parent graph from a subsampled subgraph [13, 19], and estimating the degree distribution of a network from a subgraph generated by node sampling or a random walk [38]. Similar methods have also been employed for cross-validation on networks, with subsampling either nodes [9] or edges [25] multiple times.

Our goal in this paper is to study general subsampling schemes that can be used to estimate uncertainty in most network statistics with minimal model assumptions. We will study three different subsampling procedures, discussed in detail in the next section: node sampling, row sampling, and node pair sampling. Our goal is to understand how the choice of subsampling procedure affects performance for different inference tasks, and how each subsampling method performs under different models. One of our main findings is that choosing the fraction of the network to sample has significant implications for performance. To this end, we propose a data-driven procedure to select the resampling fraction and show it performs close to optimal.

The paper is organized as follows: In Section 2, we present the proposed resampling procedures and the data-driven method to choose the subsampling fraction. In Section 3, we compare numerical performance of different subsampling methods for different inference tasks. Section 4 presents an application of our methods to triangle density estimation in Facebook networks, and Section 5 concludes with discussion.

2 Network subsampling methods

We start with fixing notation. Let G=(V,E)G=(V,E) be the observed network, with VV the node set and EE the edge set, and n=|V|n=|V| nodes. We represent GG using its n×nn\times n adjacency matrix AA, where Aij=1A_{ij}=1 if there exists an edge from node ii to node jj, i.e., (i,j)E(i,j)\in E, and Aij=0A_{ij}=0 otherwise. For the purposes of this paper, we focus on binary undirected networks. Generalization to weighted undirected networks should be straightforward, as the resampling mechanisms would not change. Directed networks will, generally speaking, require a different approach, as for the undirected setting a single row of the adjacency matrix contains all the information about the corresponding node, whereas for directed matrices this would not be the case.

In classical statistics, we have a well-established standard bootstrap procedure [11]: given an i.i.d. sample 𝒳={X1,,Xn}\mathcal{X}=\{X_{1},\dots,X_{n}\} from an underlying distribution FF, we can construct a confidence set for some statistics of interest T(F)T(F) by estimating the sample distribution of T^n(𝒳)\hat{T}_{n}(\mathcal{X}), and, typically, using its quantiles to construct a confidence interval for T(F)T(F). By drawing new samples 𝒳1,,𝒳B\mathcal{X}_{1},\dots,\mathcal{X}_{B} with replacement from the set 𝐗\mathbf{X}, or in other words sampling i.i.d. from the empirical distribution F^n\hat{F}_{n}, we can obtain the empirical distribution of T^n(𝒳b)\hat{T}_{n}(\mathcal{X}_{b}), b=1,,Bb=1,\dots,B, to approximate the distribution of T^n(𝒳)\hat{T}_{n}(\mathcal{X}).

Thinking about applying this procedure to a network immediately reveals a number of questions: what are the units of resampling, nodes or edges? What does it mean to sample a node or an edge with replacement? Sampling without replacement seems more amendable to working with resulting induced graphs, and thus we turn to the mm-out-of-nn bootstrap, typically done without replacement, which provides asymptotically consistent estimation not only for i.i.d. but also for weakly dependent stationary samples [31]. Instead of drawing samples of size nn with replacement, we will now draw new samples of size m<nm<n without replacement.

For networks, we argue that the the best unit to resample and compare across procedures is node pairs (whether connected by an edge or not). Resampling algorithms can be constructed in different ways and operate either on nodes or on node pairs, but in the end it is the number of node pairs that determines the amount of useful information we have available. Resampling node pairs with replacement does not really work as it is not clear how to incorporate duplicated node pairs. Thus we will focus on network resampling without replacement, similar to the mm-out-of-nn bootstrap approach. The fraction of resampled node pairs, denoted by qq, turns out to be an important tuning parameter, and we will propose a data-driven way of choosing it. First, we describe three different resampling procedures we study.

2.1 Subsampling procedures

Next, we describe in detail the three subsampling procedures we study. While there are other subsampling strategies and some may have advantages in specific applications, we focus on these three because they are popular and general.

Refer to caption
(a) Node sampling
Refer to caption
(b) Row sampling
Refer to caption
(c) Node pair sampling
Figure 1: Different subsampling methods. In all cases, the fraction of node pairs sampled is q0.5q\approx 0.5 (exact q=0.5q=0.5 is not always possible with only 10 nodes).

Subsampling by node sampling

This is perhaps the most common subsampling scheme, also known as pp-sampling or induced subgraph sampling. A subset of nodes V~\tilde{V} is first sampled from the node set VV independently at random, through nn independent Bernoulli trials with success probability pp. We then observe the subgraph induced by these nodes, which consists of all edges between the nodes in V~\tilde{V}, that is, {(i,j)E:iV~,jV~}\{(i,j)\in E:i\in\tilde{V},j\in\tilde{V}\}. The expected fraction of node pairs sampled under this scheme is q=p2q=p^{2} of the entries in the adjacency matrix.

Sampling nodes is arguably the most intuitive way to subsample a network, as it is analogous to first selecting the objects of interest and then observing the relationships among them, a common way of collecting network data in the real world. The intuition behind node sampling is that the induced network should inherit the global structure of its parent network, such as communities. A variation where inference is on node-level statistics, such as node degree, will work well if these statistics are subsampled directly with the nodes themselves, rather than recomputed from the induced subgraph [20]. At the same time, since each subsampled network will be on a different set of nodes, it is less clear how well this method can work for more general inferential tasks. Recently, node sampling was shown to consistently estimate the distribution of some network statistics such as triangle density [27], under a sparse graphon model satisfying some regularity conditions. But this result only holds asymptotically, and the question of how to choose pp for finite samples was not considered.

Subsampling by row sampling

This subsampling method was first proposed for the cross-validation on networks in [9]. In row sampling, one first chooses a subset of nodes V~\tilde{V} by independent Bernoulli trials with success probability pp; then edges from nodes in V~\tilde{V} and all nodes in VV are observed. This is equivalent to sampling whole rows from the adjacency matrix. This way of sampling is related to star sampling, a common model for social network data collected through surveys, where first a set of nodes is sampled, and then each of the sampled nodes reports all of its connections, especially relevant in situations where we may not know all of the nodes in VV. We assume that both 1s and 0s are reported accurately, meaning that a 0 represents that an edge is absent rather than the status of that connection is unknown. For undirected graphs with symmetric adjacency matrices, row sampling is equivalent to masking a square submatrix from the original adjacency matrix, that is, q=1(1p)2q=1-(1-p)^{2} proportion of all node pairs observed.

Subsampling by node pair sampling

Finally, a network can be subsampled by directly selecting a subset of node pairs to observe. This sampling method was used in [25] for cross-validation on networks, and shown to be superior to row sampling in most cross-validation tasks considered. A subset of node pairs (i,j)(i,j) are selected from V×VV\times V through independent Bernoulli trials with success probability pp, resulting in q=pq=p proportion of entries in the adjacency matrix observed. Again, zeros are treated as truly absent edges, and for undirected networks the sampling is adjusted to draw only from entries with i<ji<j and fill in the symmetric matrix accordingly. In [25], noisy matrix completion was applied first to create a low rank approximation to a full matrix before proceeding with cross-validation, and in principle, we have a choice of whether to use the subsampled matrix or its completed version for downstream tasks.

2.2 Constructing confidence intervals by resampling

Before we proceed to the data-driven method of choosing the settings for resampling, we summarize the algorithm we employ for constructing confidence intervals for network statistics based on a given resampling method with a given sampling fraction qq. This is a general bootstrap-style algorithm that can be applied to any network statistic TT computable on a subgraph.

Algorithm 1: constructing a bootstrap confidence interval.
Input: a graph GG, a network statistic T(G)T(G), a subsampling method, a fraction qq, and the number of bootstrap samples BB.

  1. 1.

    For b=1,,Bb=1,\dots,B

    1. (a)

      Randomly subsample subgraph GbG_{b}, using the chosen method and fraction qq.

    2. (b)

      Calculate the estimate T^b=T^(Gb)\hat{T}_{b}=\hat{T}(G_{b}).

  2. 2.

    Sort the BB estimates, T^(1)T^(B)\hat{T}_{(1)}\leq\dots\leq\hat{T}_{(B)}, and construct a 100(1α)%100(1-\alpha)\% confidence interval for T(G)T(G) as [T^(l),T^(u)][\hat{T}_{(l)},\hat{T}_{(u)}], where l=α2Bl=\lfloor\frac{\alpha}{2}B\rfloor and u=1α2Bu=\lceil\frac{1-\alpha}{2}B\rceil.

We give this version of bootstrap as the most commonly used one, but any other version of bootstrap based on the empirical cumulative distribution function (symmetrized quantiles, etc) can be equally well applied.

2.3 Choosing the subsampling fraction qq

For any method of network subsampling, we have to choose the sampling fraction; for consistency across different methods, we focus on the choice of the node pair fraction qq, which can be converted to pp for node or row sampling. In the classical bootstrap analogue, the mm-out-of-nn bootstrap, consistency results have been obtained for mm\rightarrow\infty and m/n0m/n\rightarrow 0 as nn\rightarrow\infty. These do not offer us much practical guidance for how to choose mm for a given finite sample size nn. The situation is similar for network bootstrap, with no practical rules for choosing qq implied by consistency analysis.

The iterated bootstrap has been commonly used to calibrate bootstrap results since it was proposed and developed in the 1980s [16, 1, 17]. An especially relevant algorithm for our purposes is the double bootstrap approach, originally proposed to improve coverage of bootstrap intervals [28]. A similar double bootstrap approach was proposed by [22] for mm-out-of-nn bootstrap, to correct a discrepancy between the nominal and the actual coverage for bootstrap confidence intervals due to a different sample size mm.

A generic double bootstrap algorithm proceeds as follows. Given the original i.i.d. sample 𝒳={X1,,Xn}\mathcal{X}=\{X_{1},\dots,X_{n}\} and an estimator θ^\hat{\theta} based on 𝒳\mathcal{X}, consider a size nn bootstrap sample 𝒳\mathcal{X}^{*}, drawn from 𝒳\mathcal{X} with replacement, and the corresponding estimator θ^\hat{\theta}^{*}. If the distribution of θ^\hat{\theta} can be approximated by the distribution of θ^\hat{\theta}^{*} conditional on 𝒳\mathcal{X}, a confidence interval of nominal level 1α1-\alpha can be constructed as I(α)=[t^α/2,t^1(α/2)]I(\alpha)=[\hat{t}_{\alpha/2},\hat{t}_{1-(\alpha/2)}], with t^α\hat{t}_{\alpha} defined as sup{t:𝐏(θ^t𝒳)α}\sup\{t:\mathbf{P}(\hat{\theta}^{*}\leq t\mid\mathcal{X})\leq\alpha\}. This confidence interval typically has correct coverage up to O(n1)O(n^{-1}), and double bootstrap proposes to improve this bound by using I(βα)I(\beta_{\alpha}) instead, where βα\beta_{\alpha} is defined as the solution to 𝐏(θI(βα𝒳))=1α\mathbf{P}(\theta\in I(\beta_{\alpha}\mid\mathcal{X}))=1-\alpha and can be approximated by β^α\hat{\beta}_{\alpha}, the solution to 𝐏(θ^I(β^α𝒳)𝒳)=1α\mathbf{P}(\hat{\theta}\in I(\hat{\beta}_{\alpha}\mid\mathcal{X}^{*})\mid\mathcal{X})=1-\alpha. In practice, this is implemented by sampling multiple size nn 𝒳\mathcal{X}^{**} draws from each 𝒳\mathcal{X}^{*} with replacement, and estimating 𝐏(θ^I(βj𝒳)𝒳)\mathbf{P}(\hat{\theta}\in I(\beta_{j}\mid\mathcal{X}^{*})\mid\mathcal{X}) as the proportion of confidence intervals based on (1βj)(1-\beta_{j}) percentiles of 𝒳\mathcal{X}^{**} that cover θ^\hat{\theta} for a set of candidate βj\beta_{j}’s. We then choose β^α=βj\hat{\beta}_{\alpha}=\beta_{j} such that 𝐏(θ^I(βj𝒳)𝒳)\mathbf{P}(\hat{\theta}\in I(\beta_{j}\mid\mathcal{X}^{*})\mid\mathcal{X}) is the closest to 1α1-\alpha among all candidate βj\beta_{j}’s.

Since different choice of subsampling fraction qq will lead to different level to coverage, here we propose an analogous double bootstrap method for network resampling. Compared to the i.i.d. case, instead of using size nn resamples sampled with replacement, we will generate resamples using network subsampling methods described above and instead of choosing an appropriate α\alpha, we aim to choose an appropriate qq. The algorithm is detailed as follows:

Algorithm 2: choosing the sampling fraction qq.

Input: a graph GG, a network statistic T(G)T(G), a subsampling method, a set of candidate fractions q1,,qJq_{1},\dots,q_{J}, and the number of bootstrap samples BB.

For j=1,,Jj=1,\dots,J,

  1. 1.

    For b=1,,Bb=1,\dots,B,

    1. (a)

      Split all nodes randomly into two sets 𝒱1b\mathcal{V}_{1}^{b}, 𝒱2b\mathcal{V}_{2}^{b}.

    2. (b)

      On the subgraph induced by 𝒱1b\mathcal{V}_{1}^{b}, calculate the estimate T^b,j(1)\hat{T}_{b,j}^{(1)}.

    3. (c)

      On the subgraph induced by 𝒱2b\mathcal{V}_{2}^{b}, run Algorithm 1 with q=qjq=q_{j} to construct a confidence interval Rb,jR_{b,j} for Tb,j(2)T_{b,j}^{(2)}

  2. 2.

    Calculate empirical coverage rate πj=1Bb=1B𝟏{T^b,jRb,j}\pi_{j}=\frac{1}{B}\sum_{b=1}^{B}\mathbf{1}\{\hat{T}_{b,j}\in R_{b,j}\}.

Output: j^=argmin(πj(1α))2\hat{j}=\arg\min(\pi_{j}-(1-\alpha))^{2}.

One key difference from the i.i.d. case is that we estimate empirical coverage by checking if the confidence interval constructed on each subsample covers T^b,j(1)\hat{T}_{b,j}^{(1)}, not T^(G)\hat{T}(G), the estimate on the original observed graph GG. This is because with network subsampling, we cannot preserve the original sample size nn, and if we are going to compare network statistics on two subsamples, we need to match the graph sizes (in this case n/2n/2).

3 Empirical results

We evaluate the three different subsampling techniques and our method for choosing the subsampling fraction on four distinct tasks. Three focus on estimating uncertainty, in (1) the normalized triangle density, a global network summary statistic; (2) the number of communities under the stochastic block model, a network model selection parameter; and (3) coefficients estimated in regression with network cohesion, regression parameters estimated with the use of network information. Task (4) is to choose a tuning parameter for regression with network cohesion with a lasso penalty, and evaluate subsampling as a tuning method in this setting.

The networks for all the tasks will be generated from the stochastic block models (SBM). To briefly review, the SBM assigns each node ii to one of KK communities; the node labels ci{1,,K}c_{i}\in\{1,\dots,K\}, for i=1,,ni=1,\dots,n are sometimes viewed as randomly generated from a multinomial distribution with given probabilities for each community, but for the purposes of this evaluation we treat them as fixed, assigning a fixed number of nodes nkn_{k}, k=1,Kk=1,\dots K, to each community kk. Edges are then generated as independent Bernoulli variables, with probabilities of success determined by the communities of their incident nodes, P(Aij=1)=BcicjP(A_{ij}=1)=B_{c_{i}c_{j}}, where BB is a K×KK\times K symmetric probability matrix. We will investigate the effects of varying the number of communities, the number of nodes in each community, the expected overall edge density ρ\rho, and the ratio of probabilities of within- and between-community edges tt, to be defined below.

3.1 Normalized Triangle Density

Count statistics are functions of counts of subgraphs in network, and are commonly viewed as key statistical summaries of a network [3, 4], analogous to moments for i.i.d.  data. Here we will examine a simple and commonly studied statistic, the density of triangles. A triangle subgraph is three vertices all connected to each other. Triangle density is frequently studied because it is related to the concept of transitivity, a key property of social networks where two nodes that have a “friend” in common are more likely to be connected themselves. We will apply resampling to construct confidence intervals for the normalized triangle density for a given network.

Definition 1 (edge density).

Let GG be a graph on nn vertices with adjacency matrix AA. The edge density is defined as

ρ=1(n2)i<j𝒱Aij.\rho=\dfrac{1}{{n\choose 2}}\sum_{i<j\subset\mathcal{V}}A_{ij}.
Definition 2 (normalized triangle density).

Let GG be a graph on nn vertices with adjacency matrix AA. The normalized triangle density is defined as

T=ρ31(n3)i<j<k𝒱AijAjkAik.T=\rho^{-3}\frac{1}{{n\choose 3}}\sum_{i<j<k\subset\mathcal{V}}A_{ij}A_{jk}A_{ik}.

We generate the networks from the SBM models with number of communities KK set to either 1 (no communities, the Erdös-Rényi graph) or 3, with equal community sizes. We consider two values of nn, 300 and 600. The matrix of probabilities BB for the SBM is defined by, for all k,l=1,,Kk,l=1,\dots,K,

Bkl={ργ1k=l,ργ2kl.B_{kl}=\begin{cases}\rho\gamma_{1}&k=l,\\ \rho\gamma_{2}&k\neq l.\end{cases}

The difficulty of community detection in this setting is controlled by the edge density ρ\rho and the ratio t=γ1/γ2t=\gamma_{1}/\gamma_{2}. We set t=5t=5, and vary the expected edge density ρ\rho from 0.01 to 0.1, with higher ρ\rho corresponding to more information available for community detection.

For a fair comparison, we match the proportion of adjacency matrix observed, qq, in all subsampling schemes, and vary qq from 0.2 to 0.8. Figures 2 and 3 show the widths and coverage rates for the confidence intervals constructed by the three subsampling procedures over the full range of qq, as well as the value of qq chosen by our proposed double bootstrap procedure.

There are several general conclusions we can draw from the results in Figures 2 and 3.

(1) The confidence intervals generally get shorter as qq increases, since the subsampled graphs overlap more, and this dependence reduces variance. The coverage rate correspondingly goes down as qq goes up. Still, all resampling procedures achieve nominal or higher coverage for q0.5q\leq 0.5. The one exception to this is node pair sampling at the lowest value of qq with the lowest edge density, where most subsampled graphs are too sparse to have any triangles. In this case the procedure fails, producing estimated zero triangle density and confidence intervals with length and coverage both equal to zero.

(2) For a given value of qq, a larger effective sample size, either from higher density ρ\rho or from a larger network size nn, leads to shorter intervals, as one would expect.

(3) Node sampling produces shorter intervals than the other two schemes, while maintaining similar coverage rates for smaller values of qq. The coverage rate falls faster for node sampling for larger values of qq outside the optimal range.

(4) The double bootstrap procedure selects a value of qq that achieves a good balance of coverage rate and confidence interval width, maintaining close to nominal coverage without being overly conservative. It does get misled by the pathological sparse case with no triangles, highlighting the need of not deploying resampling algorithms without checking that there is enough data to preserve the feature of interest (triangles in this case) in the subsampled graphs.

Figure 4 shows the value of qq chosen by the double bootstrap procedure as a function of graph size and edge density; both appear to not have much effect on the value of qq within the range considered. This suggests that as long as the structure of a network is preserved, a reasonable value of qq could be chosen on a smaller subsampled network, resulting in computational savings with little change in accuracy. And we can see that the qq chosen for node sampling is always smaller than that for the other two subsampling methods, so node sampling can be more computationally efficient, giving smaller subsampled graphs.

n\displaystyle n =300\displaystyle=300
ρ\displaystyle\rho =0.01\displaystyle=0.01
Refer to caption
Refer to caption
n\displaystyle n =300\displaystyle=300
ρ\displaystyle\rho =0.05\displaystyle=0.05
Refer to caption
Refer to caption
n\displaystyle n =300\displaystyle=300
ρ\displaystyle\rho =0.10\displaystyle=0.10
Refer to caption
Refer to caption
n\displaystyle n =600\displaystyle=600
ρ\displaystyle\rho =0.05\displaystyle=0.05
Refer to caption
Refer to caption
Figure 2: Confidence intervals width (left) and coverage (right) for the three different subsampling schemes on the Erdös-Rényi graphs with various values of nn and edge density ρ\rho. The value of qq chosen by our algorithm is shown as a red dot.
n\displaystyle n =300\displaystyle=300
ρ\displaystyle\rho =0.01\displaystyle=0.01
Refer to caption
Refer to caption
n\displaystyle n =300\displaystyle=300
ρ\displaystyle\rho =0.05\displaystyle=0.05
Refer to caption
Refer to caption
n\displaystyle n =300\displaystyle=300
ρ\displaystyle\rho =0.10\displaystyle=0.10
Refer to caption
Refer to caption
n\displaystyle n =600\displaystyle=600
ρ\displaystyle\rho =0.05\displaystyle=0.05
Refer to caption
Refer to caption
Figure 3: Confidence intervals width (left) and coverage (right) for the three different subsampling schemes on SBM graphs with K=3K=3 equal-sized communities, and various values of nn and edge density ρ\rho. The value of qq chosen by our algorithm is shown as a red dot.
Refer to caption
(a) ER, n=300n=300
Refer to caption
(b) ER, ρ=0.05\rho=0.05
Refer to caption
(c) SBM, n=300n=300
Refer to caption
(d) SBM, ρ=0.05\rho=0.05
Figure 4: The optimal value qq chosen by double bootstrap, as a function of edge density ρ\rho (left) and number of nodes nn (right).

3.2 Number of Communities

Finding communities in a network often helps understand and interpret the network patterns, and the number of communities KK is required as input for most community detection algorithms [2, 12]. A number of methods have been proposed to estimate KK: as a function of the observed network’s spectrum [21], through sequential hypothesis testing [8], or subsampling [25, 9]. Subsampling has been shown to be one of the most reliable methods to estimate KK, but little attention has been paid to choosing the subsampling fraction and understanding its effects. Here we investigate the performance of the three subsampling schemes on this task and the choice of the subsampling fraction on the task of understanding uncertainty in KK, beyond the earlier focus on point estimation. Again, we match the proportion of adjacency matrix observed, qq, across all subsampling schemes.

For this task, we generate networks from an SBM with K=3K=3 equal-sized communities. We vary the number of nodes nn, the expected edge density ρ\rho, and the ratio of within and between communities edge probabilities tt, which together control the problem difficulty.

For subgraphs generated by node sampling and row sampling, we estimate the number of communities using the spectral method based on the Bethe-Hessian matrix [21, 33]. For a discrete variable like KK, confidence intervals make little sense, and thus we look at their bootstrap distributions instead. These distributions are not symmetric, and rarely overestimate the number of communities. For subgraphs generated from node pair sampling, there is no complete adjacency matrix to compute the Bethe-Hessian for, and computing it with missing values replaced by zeros is clearly introducing additional noise. We instead estimate the number of communities from node-pair samples using the network cross-validation approach [25], first applying low-rank matrix completion with values of rank KK ranging from 1 to 6, and choosing the best KK by maximizing the AUC score for reconstructing the missing edges.

Figure 5 and 6 shows the results under different subsampling schemes. Since KK is discrete, instead of considering coverage rate, we compare methods by the proportion of correctly estimated KK. It is clear that whenever the community estimation problem itself is hard (the lowest values of ρ\rho, tt, and nn), the number of communities is also hard to estimate. In these hardest cases, KK tends to be severely underestimated by node and row sampling, and varies highly for node pair sampling, and there is no good choice of qq. In all other scenarios, our proposed algorithm does choose a good value of qq balancing the bias and variability of K^\hat{K} by maintaining the empirical coverage rate above 90% and provide a set of candidate rank estimates.

Refer to caption
(a) ρ=0.05\rho=0.05, t=2t=2
Refer to caption
(b) ρ=0.1\rho=0.1, t=2t=2
Refer to caption
(c) ρ=0.05\rho=0.05, t=3t=3
Refer to caption
(d) ρ=0.1\rho=0.1, t=3t=3
Refer to caption
(e) ρ=0.05\rho=0.05, t=4t=4
Refer to caption
(f) ρ=0.1\rho=0.1, t=4t=4
Refer to caption
(g) ρ=0.05\rho=0.05, t=5t=5
Refer to caption
(h) ρ=0.1\rho=0.1, t=5t=5
Figure 5: Simulation results for SBMs with K=3K=3 and n=300n=300.
Refer to caption
(a) ρ=0.05\rho=0.05, t=2t=2
Refer to caption
(b) ρ=0.1\rho=0.1, t=2t=2
Refer to caption
(c) ρ=0.05\rho=0.05, t=3t=3
Refer to caption
(d) ρ=0.1\rho=0.1, t=3t=3
Refer to caption
(e) ρ=0.05\rho=0.05, t=4t=4
Refer to caption
(f) ρ=0.1\rho=0.1, t=4t=4
Refer to caption
(g) ρ=0.05\rho=0.05, t=5t=5
Refer to caption
(h) ρ=0.1\rho=0.1, t=5t=5
Figure 6: Simulation results for SBMs with K=3K=3 and n=600n=600.

3.3 Regression with node cohesion

In many network applications, we often observe covariates associated with nodes. For example, in a school friendship network survey, one may also have the students’ demographic information, grades, family background, and so on [18]. In such settings, the network itself is often used as an aid in answering the main question of interest, rather than the primary object of analysis. For instance, a network cohesion penalty was proposed as a tool to improve linear regression when observations are connected by a network [24]. They consider the predictive model

Y=α+Xβ+ϵ,Y=\mathbf{\alpha}+X\beta+\epsilon,

where Xn×pX\in\mathbb{R}^{n\times p}, YnY\in\mathbb{R}^{n}, and βp\beta\in\mathbb{R}^{p} are the usual matrix of predictors, response vector, and regression coefficients, respectively, and αn\mathbf{\alpha}\in\mathbb{R}^{n} is the individual effect for each node associated with the network. They proposed fitting this model by minimizing the objective function

YαXβ2+λ1αTLα,\|Y-\alpha-X\beta\|_{2}+\lambda_{1}\alpha^{T}L\alpha, (1)

where λ1\lambda_{1} is a tuning parameter, and L=DAL=D-A is the Laplacian of the network, with D=diag(d1,,dn)D=\text{diag}(d_{1},\dots,d_{n}) the degree matrix. The Laplacian-based network cohesion penalty has the effect of shrinking individual effects of connected nodes towards each other, ensuring more similar behavior for neighbors, and improving prediction [24].

This method provides a point estimate of the coefficients β\beta, but no measure of uncertainty, which makes interpretation difficult. Our subsampling methods can add a measure of uncertainty to point estimates β^\hat{\beta}. Since different subsampling methods will lead to different numbers of remaining nodes and thus different sample sizes for regression, we only replace the penalty term in (1) with λ1αTL~α\lambda_{1}\alpha^{T}\tilde{L}\alpha, where L~\tilde{L} is the corresponding Laplacian of the subsampled graph, and retain the full size nn sample of YY and XX for regression, to isolate the uncertainly associated with the network penalty. Alternatively, one could resample (X,Y)(X,Y) pairs together with the network to assess uncertainty overall. If we view the observed graph AA as a realization from a distribution FF, then the subsampled graphs under different subsampling methods can also be considered as a smaller sample from the same distribution FF and the estimation based on these smaller samples should still be consistent.

In addition to the three subsampling methods, we also include a naive bootstrap method. Let AA be the original adjacency matrix with nn nodes. Sample nn nodes with replacement, obtaining n1,n2,,nnn_{1},n_{2},\dots,n_{n}, and create a new adjacency matrix A~\tilde{A} by setting A~ij=1\tilde{A}_{ij}=1 if Aninj=1A_{n_{i}n_{j}}=1 or ni=njn_{i}=n_{j}, otherwise, A~ij=0\tilde{A}_{ij}=0. Note that with this method we cannot isolate the network as a source of uncertainty, and thus the uncertainty assessment will include the usual bootstrap sample variability of regression coefficients.

We generate the adjacency matrix AA as an SBM with with K=3K=3 communities, with 200 nodes each. The individual effect α\alpha is generated from the normal distribution N(ck,σα2)N(c_{k},\sigma_{\alpha}^{2}), where ckc_{k} are 1-1, 0, and 11 for nodes from communities 11, 22, and 33, respectively. We vary the ratio of within and between edge probabilities tt and the standard deviation σα\sigma_{\alpha}, to obtain different signal-to-noise ratios. Regression coefficients βj\beta_{j}’s are drawn from N(1,1)N(1,1) independently, and each response yiy_{i} is drawn from N(αi+βTxi,1)N(\alpha_{i}+\beta^{T}x_{i},1). Figures 7, 8 and 9 show the results from this simulation.

Once again, as qq increases, the confidence intervals get narrower as subgraphs become more similar to the full network. However, for node and row sampling the coverage is also lower for the smallest value of q=0.1q=0.1, since the variance of bootstrap estimates of β\beta is larger for smaller qq, and the center of the constructed confidence interval gets further from the true β\beta. This may be caused by the fact that only a small fraction of nodes contribute to the penalty term. Node pair sampling is different: it gives better performance in terms of mean squared error for nearly all choices of qq, but since its confidence intervals are much narrower than those of node and row sampling, their coverage rate is very poor. All subsampling methods are anti-conservative and can only exceed the nominal coverage rate with small qq. One possible explanation is that in a regression setting, the impact of node covariates dominates any potential impact of the network. In all cases, Algorithm 2 chooses q=0.1q=0.1, which is either the only choice of qq that exceeds the nominal coverage rate or the choice of qq that gives a coverage rate close to the nominal coverage rate.

Refer to caption
Refer to caption
Refer to caption
(a) ρ=0.2\rho=0.2, t=10t=10, σα=0\sigma_{\alpha}=0
Refer to caption
Refer to caption
Refer to caption
(b) ρ=0.2\rho=0.2, t=10t=10, σα=0.1\sigma_{\alpha}=0.1
Refer to caption
Refer to caption
Refer to caption
(c) ρ=0.2\rho=0.2, t=10t=10, σα=0.5\sigma_{\alpha}=0.5
Figure 7: Resampling performance as a function of σα\sigma_{\alpha}. Coverage rate (left), widths of confidence intervals along the longest and shortest directions (middle), mean squared error of the mean of resampling estimates of β\beta (right).
Refer to caption
Refer to caption
Refer to caption
(a) ρ=0.2\rho=0.2, t=2t=2, σα=0.1\sigma_{\alpha}=0.1
Refer to caption
Refer to caption
Refer to caption
(b) ρ=0.2\rho=0.2, t=5t=5, σα=0.1\sigma_{\alpha}=0.1
Refer to caption
Refer to caption
Refer to caption
(c) ρ=0.2\rho=0.2, t=10t=10, σα=0.1\sigma_{\alpha}=0.1
Figure 8: Resampling performance as a function of tt. From left to right: coverage rate, widths of the confidence interval along the longest and shortest direction, mean squared error of the mean of resampling estimates of β\beta.
Refer to caption
Refer to caption
Refer to caption
(a) ρ=0.05\rho=0.05, t=10t=10, σα=0.1\sigma_{\alpha}=0.1
Refer to caption
Refer to caption
Refer to caption
(b) ρ=0.1\rho=0.1, t=10t=10, σα=0.1\sigma_{\alpha}=0.1
Refer to caption
Refer to caption
Refer to caption
(c) ρ=0.2\rho=0.2, t=10t=10, σα=0.1\sigma_{\alpha}=0.1
Figure 9: Resampling performance as a function of ρ\rho. From left to right: coverage rate, widths of the confidence interval along the longest and shortest direction, mean squared error of the mean of resampling estimates of β\beta.

3.4 Regression with node cohesion and variable selection

We now consider the case when the dimension of node covariates is larger. The LASSO in [36] is a popular technique used for high dimensional i.i.d. data. By minimizing

YXβ2+λ2β1,\|Y-X\beta\|^{2}+\lambda_{2}\|\beta\|_{1},

the resulted estimator β^\hat{\beta} will have some entries shrunk to zero, and the set S={i:β^i0}S=\{i:\hat{\beta}_{i}\neq 0\} is known as the activated set. However, we need to balance the goodness-of-fit and complexity of the model properly by tuning λ2\lambda_{2}, and model selection using cross validation is not always stable. In [29], the authors proposed stability selection, which uses subsampling to provide variable selection that is less sensitive to the choice of λ\lambda.

Here we propose a similar procedure for network data. With a graph GG observed, we first generate BB copies of gbg_{b} using one of the resampling schemes. Then we minimize the objective function

YαXβ2+λ1αTLα+λ2β1,\|Y-\alpha-X\beta\|_{2}+\lambda_{1}\alpha^{T}L\alpha+\lambda_{2}\|\beta\|_{1},

over gbg_{b} to construct activated set SbS_{b}, and for each 1jp1\leq j\leq p, calculate 1B𝟏{jSb}\frac{1}{B}\sum\mathbf{1}\{j\in S_{b}\}, the selection probability of each predictor.

The simulation setup is going as follows: A graph with 3 blocks with 200 nodes in each community is generated through SBM. Again we vary the ratio of within and between communities edge probability and the edge density of the graph. The individual effect α\alpha is generated from N(ck,σα)N(c_{k},\sigma_{\alpha}), where ckc_{k} are -1, 0, 1 for nodes from different communities respectively. 25 coefficients βj\beta_{j} are drawn from N(1,1)N(1,1) independently, while the rest 75 βj\beta_{j} are set to 0. Then yiy_{i} is drawn from N(αi+βTxi,1)N(\alpha_{i}+\beta^{T}x_{i},1). The simulation result is shown in Figure 10, 11, 12.

The three resampling schemes give different performances in AUC when we vary the subsampling proportion qq: the performance of pp sampling is not sensitive to the choice of qq, while the performances of row sampling and node pair sampling improve or worsen respectively as qq increases.

Similar to what we observed in the last section, although the subsampling step does bring in uncertainty into the inference, in a regression setting, node covariates often play a more important role and the performance of resampling only the network structure is not as good as expected.

Refer to caption
(a) σα=0\sigma_{\alpha}=0
Refer to caption
(b) σα=0.1\sigma_{\alpha}=0.1
Refer to caption
(c) σα=0.5\sigma_{\alpha}=0.5
Figure 10: Resampling performance as a function of σα\sigma_{\alpha} with fixed ρ=0.2\rho=0.2 and t=10t=10.
Refer to caption
(a) t=2t=2
Refer to caption
(b) t=5t=5
Refer to caption
(c) t=10t=10
Figure 11: Resampling performance as a function of tt with fixed ρ=0.2\rho=0.2 and σα=0.1\sigma_{\alpha}=0.1.
Refer to caption
(a) ρ=0.05\rho=0.05
Refer to caption
(b) ρ=0.1\rho=0.1
Refer to caption
(c) ρ=0.2\rho=0.2
Figure 12: Resampling performance as a function of ρ\rho with fixed σα=0.1\sigma_{\alpha}=0.1 and t=10t=10.

4 A data analysis: triangle density in Facebook networks

The data studied are Facebook networks of 95 American colleges and universities from a single-day snapshot taken in September 2005 [32, 37]. Size of these college networks ranges from around 800 to 30000 and their edge density ranges from 0.0025 to 0.06 as shown in Figure 13. Questions of interest include the distribution of normalized triangle density, understanding whether there are significant differences between colleges, and discovering relationships with other important network characteristics, such as size or edge density. Answering these and other similar questions requires uncertainty estimates in addition to point estimates. We construct these uncertainty estimates by applying our Algorithm 1 to obtain 90% confidence intervals, with subsampling fraction qq chosen by Algorithm 2. The results are shown in Figures 14 and 15.

There are several conclusions to draw from Figure 14. First, larger networks have higher normalized triangle density, despite normalization, and the confidence intervals are short enough to make these differences very significant. The same is true for normalized triangle density as a function of edge density: denser networks have lower normalized triangle density. This may be due to the fact that the edge density of these social networks goes down as the size of the network goes up, as shown in Figure 13, since the number of connections to each individual node grows much slower compared to the network size. Generally, the intervals from all three schemes have similar centers, and those obtained by pair sampling are somewhat shorter than those constructed with node and row sampling. Algorithm 2 generally chooses small values of subsampling proportion qq and Table 1 shows the percentage of different qq chosen out of the 95 graphs.

q=0.1q=0.1 q=0.2q=0.2
node 81.1% 18.9%
row 54.7% 45.3%
pair 96.8% 3.2%
Table 1: Percentage of different qq’s chosen for different colleges using different subsampling methods.

It is evident from Figure 14 that there is a strong positive correlation between normalized triangle density and the number of nodes, and so it is instructive to look at triangle density while controlling for network size. We show residuals of normalized triangle density after regressing normalized triangle density on nn, in Figure 15. We observed that residuals mostly centered around zero, we larger variance as nn grows, suggesting there is no longer a dependence on nn. However, colleges such as UCF, USF and UC do have larger normalized triangle density compared to colleges of similar size, as seen in both Figure 14 and 15. Similar conclusion can be drawn for edge density.

Refer to caption
Figure 13: Network size and edge density of 95 college Facebook networks.
Refer to caption
(a)
Refer to caption
(b)
Figure 14: Confidence intervals of normalized triangle density constructed for 95 college Facebook networks.
Refer to caption
(a)
Refer to caption
(b)
Figure 15: Normalized triangle density residuals after regressing on number of nodes and log edge density respectively.

While the example we gave is a simple exploratory analysis, it demonstrates the possibilities of much richer network data analysis when point estimates of network statistics are accompanied by uncertainty estimates.

5 Discussion

The three different subsampling methods we studied - node, row, and node pair sampling - are all reasonably good at achieving nominal coverage if the sampling proportion qq is selected appropriately, both for estimating uncertainty in network summary statistics and in estimated parameters of a model fitted to network data. While there is no universally best resampling scheme for all tasks, node sampling is a safe choice for most, especially compared to row sampling. The double bootstrap algorithm to choose qq we proposed works well empirically over a range of settings, especially for network statistics that are normalized with respect to the size or the edge density of the graph. Our algorithm is written explicitly for confidence intervals, and would thus need to modified to use with other losses, for instance, prediction intervals, but such a modification would be straightforward.

An obvious question arises about theoretical guarantees of coverage for bootstrap confidence intervals. We have intentionally made no assumptions on the underlying network model when deriving these bootstrap methods. Proving any such guarantee requires assuming an underlying probability model that generated the network. If such a model is assumed, then, for all common network models, there will be an alternative model-based bootstrap approach tailor-made for the model. For example, for latent variable models where network edge probabilities are functions of unobserved latent variables in RdR^{d}, a bootstrap approach based on estimating latent variables in RdR^{d}, bootstrapping those estimated variables, and generating networks from those has been shown to have good asymptotic properties [23]. Similarly, specialized methods for network moments have been proposed in [3, 27]. There is an inevitable tradeoff between keeping a method assumption-free and establishing theoretical guarantees. In this computational paper, we have focused on the algorithms themselves, and leave establishing guarantees for special cases for future work.

References

  • [1] R. Beran. Prepivoting to reduce level error of confidence sets. Biometrika, 74(3):457–468, 1987.
  • [2] S. Bhattacharyya and P. J. Bickel. Community detection in networks using graph distance. arXiv preprint arXiv:1401.3915, 2014.
  • [3] S. Bhattacharyya and P. J. Bickel. Subsampling bootstrap of count features of networks. The Annals of Statistics, 43(6):2384 – 2411, 2015.
  • [4] P. J. Bickel, A. Chen, and E. Levina. The method of moments and degree distributions for network models. The Annals of Statistics, 39(5):2280 – 2301, 2011.
  • [5] P. J. Bickel and D. A. Freedman. Some Asymptotic Theory for the Bootstrap. The Annals of Statistics, 9(6):1196 – 1217, 1981.
  • [6] P. J. Bickel, F. Götze, and W. R. van Zwet. Resampling fewer than n observations: Gains, losses, and remedies for losses. Statistica Sinica, 7(1):1–31, 1997.
  • [7] P. J. Bickel and E. Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577 – 2604, 2008.
  • [8] P. J. Bickel and P. Sarkar. Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 78(1):253–273, 2016.
  • [9] K. Chen and J. Lei. Network cross-validation for determining the number of communities in network data. Journal of the American Statistical Association, 113(521):241–251, 2018.
  • [10] B. Efron and R. Tibshirani. Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science, 1(1):54 – 75, 1986.
  • [11] B. Efron and R. J. Tibshirani. An introduction to the bootstrap. CRC press, 1994.
  • [12] S. Fortunato. Community detection in graphs. Physics reports, 486(3-5):75–174, 2010.
  • [13] O. Frank. Estimation of the Number of Connected Components in a Graph by Using a Sampled Subgraph. Scandinavian Journal of Statistics, 5:177–188, 1978.
  • [14] Y. R. Gel, V. Lyubchich, and L. L. R. Ramirez. Bootstrap quantification of estimation uncertainties in network degree distributions. 7(1):1–12, 2017.
  • [15] A. Green and C. R. Shalizi. Bootstrapping exchangeable random graphs. arXiv preprint arXiv:1711.00813, 2017.
  • [16] P. Hall. On the Bootstrap and Confidence Intervals. The Annals of Statistics, 14(4):1431 – 1452, 1986.
  • [17] P. Hall and M. A. Martin. On bootstrap resampling and iteration. Biometrika, 75(4):661–671, 1988.
  • [18] K. M. Harris. The national longitudinal study of adolescent to adult health (add health), waves i & ii, 1994–1996; wave iii, 2001–2002; wave iv, 2007-2009; wave v, 2016-2018 [machine-readable data file and documentation]. 2018.
  • [19] J. M. Klusowski and Y. Wu. Estimating the number of connected components in a graph via subgraph sampling. Bernoulli, 26(3):1635–1664, Aug. 2020.
  • [20] E. D. Kolaczyk. Sampling and Estimation in Network Graphs, pages 1–30. Springer New York, New York, NY, 2009.
  • [21] C. M. Le and E. Levina. Estimating the number of communities by spectral methods. Electronic Journal of Statistics, 16(1):3315 – 3342, 2022.
  • [22] S. M. S. Lee. On a Class of m out of n Bootstrap Confidence Intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 61(4):901–911, 1999.
  • [23] K. Levin and E. Levina. Bootstrapping networks with latent space structure, 2019.
  • [24] T. Li, E. Levina, and J. Zhu. Prediction models for network-linked data. The Annals of Applied Statistics, 13(1):132 – 164, 2019.
  • [25] T. Li, E. Levina, and J. Zhu. Network cross-validation by edge sampling. Biometrika, 107(2):257–276, 04 2020.
  • [26] Q. Lin, R. Lunde, and P. Sarkar. On the theoretical properties of the network jackknife. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 6105–6115. PMLR, 13–18 Jul 2020.
  • [27] R. Lunde and P. Sarkar. Subsampling sparse graphons under minimal assumptions. arXiv preprint arXiv:1907.12528, 2019.
  • [28] M. A. Martin. On Bootstrap Iteration for Coverage Correction in Confidence Intervals. Journal of the American Statistical Association, 85(412):1105–1118, 1990.
  • [29] N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
  • [30] D. N. Politis and J. P. Romano. The stationary bootstrap. Journal of the American Statistical Association, 89(428):1303–1313, 1994.
  • [31] D. N. Politis, J. P. Romano, and M. Wolf. Subsampling. Springer Series in Statistics. Springer New York, New York, NY, 1999.
  • [32] R. A. Rossi and N. K. Ahmed. The network data repository with interactive graph analytics and visualization. 2015.
  • [33] A. Saade, F. Krzakala, and L. Zdeborová. Matrix Completion from Fewer Entries: Spectral Detectability and Rank Estimation. arXiv:1506.03498 [cond-mat, stat], Jan. 2016.
  • [34] J. Shao and C. F. J. Wu. A General Theory for Jackknife Variance Estimation. The Annals of Statistics, 17(3):1176 – 1197, 1989.
  • [35] M. E. Thompson, L. L. Ramirez Ramirez, V. Lyubchich, and Y. R. Gel. Using the bootstrap for statistical inference on random graphs. Canadian Journal of Statistics, 44(1):3–24, 2016.
  • [36] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 58(1):267–288, 1996.
  • [37] A. L. Traud, P. J. Mucha, and M. A. Porter. Social structure of Facebook networks. Physica A: Statistical Mechanics and its Applications, 391(16):4165–4180, Aug. 2012.
  • [38] Y. Zhang, E. D. Kolaczyk, B. D. Spencer, et al. Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. The Annals of Applied Statistics, 9(1):166–199, 2015.