This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Diffusion Source Identification on General Networks with Statistical Confidence

Quinlan Dawkins, Tianxi Li, Haifeng Xu    Quinlan Dawkins, Tianxi Li, Haifeng Xu
University of Virginia

Diffusion Source Identification on Networks with Statistical Confidence

Quinlan Dawkins, Tianxi Li, Haifeng Xu    Quinlan Dawkins, Tianxi Li, Haifeng Xu
University of Virginia
Abstract

Diffusion source identification on networks is a problem of fundamental importance in a broad class of applications, including rumor controlling and virus identification. Though this problem has received significant recent attention, most studies have focused only on very restrictive settings and lack theoretical guarantees for more realistic networks. We introduce a statistical framework for the study of diffusion source identification and develop a confidence set inference approach inspired by hypothesis testing. Our method efficiently produces a small subset of nodes, which provably covers the source node with any pre-specified confidence level without restrictive assumptions on network structures. Moreover, we propose multiple Monte Carlo strategies for the inference procedure based on network topology and the probabilistic properties that significantly improve the scalability. To our knowledge, this is the first diffusion source identification method with a practically useful theoretical guarantee on general networks. We demonstrate our approach via extensive synthetic experiments on well-known random network models and a mobility network between cities concerning the COVID-19 spreading.

1 Introduction

One pressing problem today is the spreading of misinformation or malicious attacks/virus in various cyberspaces. For example, rumors and fake news on social networks may result in many serious political, economic, and social issues (Vosoughi et al., 2018). Viruses that spread via emails and computer communication may cause severe privacy and leakage problems Newman et al. (2002); Halperin & Almogy (2002); Xu & Ren (2016). The negative impacts stem from a few source users/locations and then spread over the social networks via a diffusion process in such events. One crucial step to reduce the loss from such an event is to quickly identify the sources so that counter-measures can be taken in a timely fashion.

Though early practices have been done for this important problem with motivations from various domains, systematic research on this problem only began very recently, arguably starting from the seminal work of Shah & Zaman (2011), which proposed a rumor center estimator that can be located by an efficient message-passing algorithm with linear time complexity. Despite the significant interest and progress on this problem in recent years (Shah & Zaman, 2012; Dong et al., 2013; Khim & Loh, 2016; Bubeck et al., 2017; Yu et al., 2018; Crane & Xu, 2020), many challenges remain unaddressed. First, the theoretical understanding of these methods is currently only available under very restrictive and somewhat unrealistic structural assumptions of the networks such as regular trees. This is perhaps partially explained by the well-known computational hardness about the probabilistic inference of diffusion process in general graphs Shapiro & Delgado-Eckert (2012). Therefore, intuitive approximations have been used for general networks (Nguyen et al., 2016; Kazemitabar & Amini, 2020). However, such methods lack theoretical guarantees. Second, even for regular trees, the available performance guarantee is far from being useful in practice. Even in the most idealized situation of infinite regular trees, the correct probability of the rumor center is almost always below 0.30.3 (Shah & Zaman, 2011; Dong et al., 2013; Yu et al., 2018). For general graphs, as we show later, the correct rate of such a single-point estimation method only becomes too low to be practical.

To guarantee higher success probability, a typical approach, as in both machine learning theory Valiant (1984) and data-driven applied models LeCun et al. (2015), is perhaps to obtain more data. However, a fundamental challenge in diffusion source identification (DSI) is that the problem by nature has only one snapshot of the network information, i.e., the earliest observation about the infection status of the network.111Since infected nodes are usually indistinguishable and equally infectious, any additional information in later observations only tells us which new or additional nodes are infected and is not helpful for us to infer the source node. Therefore, compared to classic learning tasks, DSI poses a fundamentally different challenge for inference.It is the above crucial understanding that motivates our adoption of a different statistical inference technique, the confidence set. Previously systematic statistical studies adopt the confidence set approach for DSI on trees (Bubeck et al., 2017; Khim & Loh, 2016; Crane & Xu, 2020). Though they enjoy good theoretical properties, the methods are applicable only on infinite trees.

This paper aims to bridge the gap between practically useful algorithms and theoretical guarantees for the DSI problem. We introduce a new statistical inference framework which provably includes many previous methods Shah & Zaman (2011); Nguyen et al. (2016) as special cases. Our new framework not only highlights the drawback of the previous methods but, more importantly, also leads to the design of our confidence set inference approach with finite-sample theoretical guarantee on any network structures.

As a demonstration, consider the example of the COVID-19 spreading procedure in early 2020. Figure 1 shows a travel mobility network between 49 major cities in China, constructed from the two-week travel volume (Lab, 2020; Hu et al., 2020) before the virus caught wide attention. The square nodes (21 out of 49) are all cities with at least five confirmed cases of the virus on Jan 24, 2020. The DSI problem is: given only knowledge about the mobility network and which cities have detected a notable amount of confirmed cases (in this case, at least 5) , can we identify in which city the virus was first detected?

Refer to caption
Figure 1: The mobility network and the COVID-19 infection status of major Chinese cities on Jan 24, 2020. Colored square nodes are cities with at least five confirmed cases.

This problem turns out to be too difficult for precise identification. None of the single-point source identification methods under evaluation can successfully identify Wuhan due to its relatively non-central position from the network (details in Section 5). Nevertheless, both of our 80% and 90% confidence sets cover Wuhan correctly, giving recommendations of 6 nodes and 11 nodes (out of 49 cities), respectively. In fact, the evaluation on all the whole week after the lockdown of Wuhan reveals that both confidence sets correctly cover Wuhan in all the seven days, while the single-point estimation methods are rarely effective. Such a result evidently shows the necessity of adopting confidence set approach and the effectiveness of our solution.Our contributions in this paper can be summarized in three-folds.

  1. 1.

    We introduce an innovative statistical framework for the DSI problem. It includes several previous methods as special cases, but has the potential for more effective inference.

  2. 2.

    Under our framework, we propose a general way to construct the source node confidence set, whose validity can be guaranteed for finite sample size and any network structures. It is the first DSI method with a theoretical performance guarantee on general networks, to the best of our knowledge.

  3. 3.

    We propose techniques that dramatically improve the computational efficiency of our inference algorithm. En route, we develop a generalized importance sampling method, which may be of independent interest.

A high-level message in the paper is that the confidence set approach, which did not receive adequate attention in the machine learning literature, can be an important tool for inference tasks, especially for challenging problems with limited available data.

2 Preliminaries

We start by formalizing the Diffusion Source Identification (DSI) problem, introduced in the seminal work of Shah & Zaman (2011). Consider a network GG with node set V={1,,n}V=\{1,\cdots,n\} and edge set EE. For ease of presentation, we focus on unweighted and undirected networks but it is straightforward to generalize the model and our framework to weighted networks. We write (u,v)E(u,v)\in E if node uu and vv are connected. The network can be equivalently represented by its n×nn\times n binary adjacency matrix AA, where Auv=Avu=1A_{uv}=A_{vu}=1 if and only if (u,v)E(u,v)\in E.

There is a source node sVs^{*}\in V on the network GG initiating a diffusion of a certain effect (rumor, fake news or some virus) over the network GG. We embed our inference of the diffusion procedure under the widely-adopted “Susceptible-Infected” (SI) model (Anderson & May, 1992; Shah & Zaman, 2011), though our approach can be easily tailored to other diffusion procedure as well. In the SI model, the source node ss^{*} is the only “infected” node initially. The infection diffuses as follows: given the set of currently infected nodes after t1t-1 infections, the next infection happens by sampling uniformly at random one of the edges connecting an infected node and a susceptible node. Consequently, a full diffusion path with TT infections can be represented by a sequence of T+1T+1 nodes in the infection order. We define the diffusion path space to be

𝒵T={𝒗={s=v0,v1,,vT}:vtV,vt1vt2\displaystyle\mathcal{Z}_{T}=\{\boldsymbol{v}=\{s^{*}=v_{0},v_{1},\cdots,v_{T}\}:v_{t}\in V,v_{t_{1}}\neq v_{t_{2}}
 if t1t2, and (vt,vt)E for some t<t }\displaystyle\text{~{}if~{}}t_{1}\neq t_{2},\text{ and }(v_{t},v_{t^{\prime}})\in E\text{~{} for some $t<t^{\prime}$ }\}

However, in practice, when the occurrence of the infection is noticed, we have already lost the information about the diffusion path. Instead, the available data only contain the snapshot of the current infection status on the network without the infection order. Formally, the data can be represented as an nn-dimensional binary vector yy with yi=𝟏(i is infected){0,1}y_{i}={\mathbf{1}}(i\text{~{}is infected})\in\{0,1\}, where 𝟏{\mathbf{1}} is the standard indicator function. Therefore, the sample space of the DSI problem can be defined as

𝒴T={y{0,1}n:y1=T,such that {i:yi=1}\displaystyle\mathcal{Y}_{T}=\{y\in\{0,1\}^{n}:\|{y}\|_{1}=T,\text{such that }\{i:y_{i}=1\}
 induces a connected subgraph of G}.\displaystyle\text{~{} induces a connected subgraph of $G$}\}.

Equivalently, we will also think of any y𝒴Ty\in\mathcal{Y}_{T} as the a infected subset of nodes VIVV_{I}\subset V with size TT. The DSI problem can then be defined as follows.

Definition 1 (Diffusion Source Identification).

Given one sample y𝒴Ty\in\mathcal{Y}_{T}, identify the source node ss^{*} of the diffusion process that generates yy.

Challenges. The challenge of DSI intrinsically arises from the loss of information in the observed data. Specifically, by definition, we have a many-to-one mapping ζ:𝒵T𝒴T\zeta:\mathcal{Z}_{T}\to\mathcal{Y}_{T}, such that ζ()\zeta(\cdot) maps a diffusion path to the corresponding infection snapshot of the network. Information about the infection order has been lost upon the observation of data yy. Nevertheless, the DSI problem looks to identify the first node in the infection order, with access to only one snapshot of the infection status. Note that obtaining multiple snapshots over time does not reduce the difficulty of DSI. This is because, given the current snapshot, later observed data carry no additional information about the source node due to the Markov property of the SI model.

3 A General Statistical Framework for DSI with Confidence Guarantees

3.1 DSI as Parameter Estimation

We start by formulating DSI under a systematic statistical framework, which will help in our design of better inference methods later on. Treating the network GG as fixed and ss^{*} as the model parameter, the probability of generating data y𝒴Ty\in\mathcal{Y}_{T} can be represented by s(Y=y)=p(y|s).\mathbb{P}_{s^{*}}(Y=y)=p(y|s^{*}). where random variable YY denotes the observed data. The identification of ss^{*} can then be treated as a parameter estimation problem. Specifically, we consider the following general parameter estimation framework. Given any discrepancy function :𝒴T×𝒵T[0,)\ell:\mathcal{Y}_{T}\times\mathcal{Z}_{T}\to[0,\infty), we want to find an estimator of ss^{*} based on the following optimization problem:

minimizes𝔼s(y,Z)\text{minimize}_{s}~{}~{}\mathbb{E}_{s}\ell(y,Z)\vspace{-1mm} (1)

in which Z𝒵TZ\in\mathcal{Z}_{T} is the random diffusion path following the SI model starting from parameter ss and 𝔼s\mathbb{E}_{s} denotes the expectation over ZZ. That is, we look to select the ss that the diffusion path ZZ it generates has the minimum expected discrepancy from our observed data yy.

Remark 1.

An important design here is that the discrepancy function \ell is defined on 𝒴T×𝒵T\mathcal{Y}_{T}\times\mathcal{Z}_{T}, not on 𝒴T×𝒴T\mathcal{Y}_{T}\times\mathcal{Y}_{T}. That is, yy will be compared with the random diffusion path while not merely the snapshot induced by the path. This is because ZZ contains richer information about the diffusion process. As we show later, this turns out to be very crucial for designing effective discrepancy functions.

Notice that our framework include a few previous methods as special cases. Due to space limit, all formal proofs in this paper have been deferred to the Appendix. Instead, intuition and explanations are provided as needed.

Proposition 1.
  1. 1.

    If rc(y,z)=1𝟏(y=ζ(z))\ell_{rc}(y,z)=1-{\mathbf{1}}(y=\zeta(z)), when the network is an infinite regular tree, procedure (1) gives the rumor center of Shah & Zaman (2011).

  2. 2.

    If se(y,z)=yζ(z)22\ell_{se}(y,z)=\|{y-\zeta(z)}\|_{2}^{2}, the squared Euclidean distance between yy and ζ(z)\zeta(z), the discrepancy is equivalent to the symmetric difference in Nguyen et al. (2016)222However, different from our framework, Nguyen et al. (2016) used an approximation metric to this discrepancy for DSI. .

Proposition 1 also reveals some key drawbacks of the rumor center method and its variants. First, the discrepancy function rc\ell_{rc} only takes two values, and it treats all configurations zz with ζ(z)y\zeta(z)\neq y equally. Therefore, such a function may not be sufficiently sensitive for general networks. From this perspective, se\ell_{se} is potentially better. Second, and importantly, both of the above discrepancy functions only depend on ζ(z)\zeta(z), failing to leverage the diffusion order of the zz. Ignoring such information may also undermine the performance. To overcome these drawbacks, we propose the following family of discrepancy functions as a better alternative. We call this family the canonical family of discrepancy functions.

Definition 2 (Canonical Discrepancy Functions).

Consider a class of discrepancy functions \ell that can be written in the following form

(y,z)=v:yv=1𝟏(vz)h(tz(v)),\ell(y,z)=-\sum_{v:y_{v}=1}{\mathbf{1}}(v\in z)h(t_{z}(v)),\vspace{-1mm} (2)

in which tz(v)t_{z}(v) is the infection order of node vv in path zz and hh is a non-increasing weight function. When vzv\notin z, we define tz(v)=t_{z}(v)=\infty.

The canonical form (2) is essentially a negative similarity function. It incorporates both the infection status and the infection order of zz. The weight function hh incorporates the diffusion order such that if zz deviates from yy at an early stage, the deviation is treated as a stronger signal for their discrepancy, compared with the case when they only deviates at a later stage of the diffusion. Conceptually, this canonical family is general enough to incorporate the needed information for the diffusion process. In addition, as shown in Section 3.4, it admits fundamental properties that make the computation very efficient. As a special case, we demonstrate that se\ell_{se} is equivalent to a discrepancy function with h(tz(v))2h(t_{z}(v))\equiv 2, as follows

yζ(z)22=i=1n𝟏(yiζ(z)i)=2T2v:yv=1𝟏(vz).\displaystyle\|{y-\zeta(z)}\|_{2}^{2}=\sum_{i=1}^{n}{\mathbf{1}}(y_{i}\neq\zeta(z)_{i})=2T-2\sum_{v:y_{v}=1}{\mathbf{1}}(v\in z).

Therefore L2L_{2} is equivalent to Eq. (2) with f(tz(v))2f(t_{z}(v))\equiv 2.

In this paper, we are particularly interested in the following natural configuration as the discrepancy function, which we call the “Averaged Deviation - inverse Time” (ADiT), which takes the canonical family form (2) with the inverse time weights:

h(tz(v))=1tz(v).h(t_{z}(v))=\frac{1}{t_{z}(v)}. (3)

In Table 1 of Section 5, we show the simulation performance of the single-point estimation by our framework compared to other methods. Though our methods demonstrate improvements, the accuracy is universally low in all situations for all methods. Such an observation indicates that it is generally impossible to recover the source node by a single estimator with high accuracy. Indeed, as shown in Shah & Zaman (2011); Dong et al. (2013); Yu et al. (2018), even in the ideal infinite regular tree for which the rumor center is proved to be optimal in the MLE sense, the probability of correct source node identification turns out to still be low (0.3\leq 0.3). Such a low accuracy is far from useful in real-world applications, suggesting the necessity of developing alternative forms of inference, which is we embark on in the next section.

3.2 Confidence Set

As mentioned previously, single point estimators suffer from low success rates, rendering them unsatisfactory in real-world applications. To identify the source node with a nontrivial performance guarantee, we propose constructing a small subset of nodes that provably contains the source nodes with any pre-defined confidence. This insight motivates our use of the confidence set as the DSI method.

Definition 3.

Let YY be the random infection status of the stochastic diffusion process starting from ss^{*}. A level 1α1-\alpha confidence set of the source node is a random set S(Y)VS(Y)\subset V depending on YY for which

(sS(Y))1α.\mathbb{P}(s^{*}\in S(Y))\geq 1-\alpha.

Surprisingly, the idea of using confidence set to infer the diffusion source – though arguably a natural one in statistics – has not been explored much in the context of DSI. The most relevant to ours are probably Bubeck et al. (2017); Khim & Loh (2016) and Crane & Xu (2020). Bubeck et al. (2017) considered identifying the first node of a growing tree but not a diffusion process. Khim & Loh (2016) extended the idea to the SI model but only for infinite regular tree and asymptotic setting. Despite its theoretical merits, this method is not practical. For example, even consider the situation of an infinite 4-regular tree as the network structure, applying the method of Khim & Loh (2016) would indicate a confidence set of size 4115×1064^{11}\approx 5\times 10^{6}, regardless of the infected size TT. This is far too large for almost any applications, let alone the fact that infinite regular tree itself is unrealistic. Crane & Xu (2020) makes the inference more effective, but still rely on the tree-structure assumption.

We instead take a completely different yet natural approach based on our statistical framework for the problem. To ensure the validity of the inference for any network structures, we will rely on the general statistical inference strategy for the confidence set construction. We first introduce a testing procedure for the hypothesis H0:s=sH_{0}:s^{*}=s against the alternative hypothesis H1:ssH_{1}:s^{*}\neq s. Given a discrepancy function \ell, data yy and the node ss under evaluation, define the testing statistic to be our loss Ts(y)=𝔼s(y,Z)T_{s}(y)=\mathbb{E}_{s}\ell(y,Z) for any data yy. Then the p-value of the test is defined to be

ψs=s(Ts(ζ(Z))Ts(y)).\psi_{s}=\mathbb{P}_{s}(T_{s}(\zeta(Z))\geq T_{s}(y)). (4)

where the probability s\mathbb{P}_{s} is over the randomness of the path ZZ generated from the random diffusion process starting from ss. The p-value is the central concept in statistical hypothesis testing, and it gives a probabilistic characterization of how extreme the observed yy deviates from the expected range for random paths that are truly from ss (Lehmann & Romano, 2006). For a level 1α1-\alpha confidence set, we compute ψs\psi_{s} for all nodes ss and construct the confidence set by

S(y)={s:ψs(y)>α}.S(y)=\{s:\psi_{s}(y)>\alpha\}. (5)

The following result guarantees the validity of the confidence set constructed above.

Theorem 1.

The confidence set constructed by (5) is a valid 1α1-\alpha confidence set.

Notice that Theorem 1 is a general result, independent of the network structure or the specific test statistic we use. However, the validity of the confidence set only gives one aspect of the inference. We would like to have small confidence sets in practice since such a small set would narrow down our investigation more effectively. The confidence set size would depend on the network structure and the corresponding effectiveness of the discrepancy function (the test statistic). We will use the proposed ADiT to define our test statistic. As shown in our empirical study, it gives excellent and robust performance across various settings.

3.3 Algorithmic Construction of Confidence Sets

The exact evaluation of the statistic Ts(y)T_{s}(y) and p-value ψs\psi_{s} is infeasible for general graphs since the probability mass function of the SI model is intractable. To overcome this barrier, we resort to the Monte Carlo (MC) method for approximate calculation, with details in Algorithm 1. This vanilla version turns out to be computationally inefficient. However, we will introduce techniques to significantly improve its computation efficiency afterwards.

Remark 2 (Monte Carlo setup).

Note that we have two layers of Monte Carlo evaluations. The first layer is the loss function calculation in (6) and (7), while the second layer is the p-value evaluation (8). The first layer shares the same mm samples. This is different from the classical Monte Carlo, but would not break the validity for p-value calculation. The properties of p-value calculation by Monte Carlo method have been studied in detail by Jockel (1986); Besag & Clifford (1989).

1:  Input: MC sample number mm, confidence level α\alpha
2:  Input: Network GG, data yy, discrepancy function \ell
3:  for  each infected node sys\in y do
4:     Generate 2m2m samples zi𝒵,i=1,,2mz_{i}\in\mathcal{Z},i=1,\cdots,2m from the TT-round diffusion process with source ss on GG.
5:     Estimate expected loss Ts(y)T_{s}(y) of data yy as
T^s(y)=1mi=m+12ms(y,zi).\hat{T}_{s}(y)=\frac{1}{m}\sum_{i=m+1}^{2m}\ell_{s}(y,z_{i}). (6)
6:     For path zj,j=1,,mz_{j},j=1,\cdots,m, estimate Ts(ζ(zj))T_{s}(\zeta(z_{j})) as
T^s(ζ(zj))=1mi=m+12m(ζ(zj),zi).\hat{T}_{s}(\zeta(z_{j}))=\frac{1}{m}\sum_{i=m+1}^{2m}\ell(\zeta(z_{j}),z_{i}). (7)
7:     Estimate the p-value ψs(y)\psi_{s}(y) as
ψ^s(y)=1mj=1m𝟏(T^s(ζ(zj))T^s(y)).\hat{\psi}_{s}(y)=\frac{1}{m}\sum_{j=1}^{m}{\mathbf{1}}(\hat{T}_{s}(\zeta(z_{j}))\geq\hat{T}_{s}(y)). (8)
8:  end for
9:  return level 1α1-\alpha confidence set:
𝒞α(y)={sVI:ψ^s(y)>α}.\mathcal{C}_{\alpha}(y)=\{s\in V_{I}:\hat{\psi}_{s}(y)>\alpha\}.
Algorithm 1 Vanilla MC for Confidence Set Construction
Remark 3 (Choice of the sample number mm).

In theory, the computation in Algorithm 1 is exact when mm\to\infty. In practice, simple guidance about the choice of mm can be derived as follows. The critical step in Algorithm 1 is Step 7 for the p-value calculation since the MC errors from previous steps are usually in a lower order. For the correctness, we only need to worry about the evaluation at node ss^{*} when the true p-value is close to α\alpha. Step 7 averages over mm indicators. By the central limit theorem, the MC estimate at most misses the true p-value by roughly 2α(1α)/m2\sqrt{\alpha(1-\alpha)/m}. For example, if we are aiming for a 90% confidence set where α=0.1\alpha=0.1, setting m=10000m=10000 would indicate that the MC at most misses the targeting confidence level by 0.006%, which is usually good enough in most applications. In our experiments, we use this m=10000m=10000 and it has been sufficient in all situations. Notice that this recommendation is more conservative than the ones used in classical statistical inference problems (Jockel, 1986). In our experience, it might still be acceptable to use a smaller mm.

Remark 4 (Time complexity of the vanilla MC, and its trivial parallelization).

The time complexity of a standard sequential implementation of Algorithm 1 is O~(mT2+m2T2)\tilde{O}(mT^{2}+m^{2}T^{2}):333As a convention, the O~\tilde{O} notation omits logrithmic terms. (1) the first term is due to the MC sampling (Bringmann & Panagiotou, 2017); (2) the second term is from the statistic calculation (7) given the MC samples. However, our algorithm can be trivially parallelized. In particular, the for-loop in Step 3 can be distributed across different sVis\in V_{i} with any communication. This leads to a parallel time complexity O~(mT+m2T)\tilde{O}(mT+m^{2}T). It is worthwhile to compare this time cost with the rumor center of Shah & Zaman (2011) which has O~(dT)\tilde{O}(dT) linear complexity and dd is the maximum node degree. But the algorithm has to be sequential (thus non-parallelizable). In summary, Algorithm 1 has a better dependence on the network density captured by dd but has an additional quadratic dependence on the number of samples mm.

3.4 Fast Loss Estimation for the Canonical Family

A major computational bottleneck of Algorithm 1 is the O(m2T)O(m^{2}T) time for estimating 𝔼s((y,Z))\mathbb{E}_{s}(\ell(y,Z)) in Equation (7) for every jj since we have to compute ψ^\hat{\psi} for mm samples, and each ψ^\hat{\psi} is the average over another mm samples. Fortunately, it turns out that, for canonical discrepancy family, this step can be done in O(mT)O(mT) time, highlighting another advantage of our proposed family of cost functions.

Instead of computing T^s\hat{T}_{s} in Equation (7) by summing over the sample i=m+1,,2mi=m+1,\cdots,2m, we can compute T^s\hat{T}_{s} directly using only the “summary information” of these samples that can be computed and cached in advance. This insight is possible due to the following alternative representation of the T^s(y)\hat{T}_{s}(y) function in Equation (7):

T^s(y)\displaystyle\hat{T}_{s}(y) =1mv:yv=1i=m+12mk=1Th(k)𝟏(tzi(v)=k)\displaystyle=-\frac{1}{m}\sum_{v:y_{v}=1}\sum_{i=m+1}^{2m}\sum_{k=1}^{T}h(k){\mathbf{1}}(t_{z_{i}}(v)=k)
=1mv:yv=1k=1TMv,kh(k)\displaystyle=-\frac{1}{m}\sum_{v:y_{v}=1}\sum_{k=1}^{T}M_{v,k}h(k) (9)

where Mv,kM_{v,k} counts the total number of samples in zm+1,,z2mz_{m+1},\cdots,z_{2m} in which node vv is the kk’th infected node in the infection path. Let Mn×TM\in\mathbb{R}^{n\times T} be the matrix containing the entries Mv,kM_{v,k}. Note that, there are at most mTmT nonzero entries in MM since each sample only has TT nodes. These entries can be computed in O(mT)O(mT) time simply by updating the corresponding Mv,kM_{v,k} entries during sampling. With these non-zero Mv,kM_{v,k} entries, we can then compute h^(v)=k=1TMv,kh(k)\hat{h}(v)=\sum_{k=1}^{T}M_{v,k}h(k) for all the vv that showed up in our samples in O(mT)O(mT) time. Finally, given the previous h^(v)\hat{h}(v), we can compute any T^s(y)\hat{T}_{s}(y) in O(T)O(T) time where y=ζ(z1),,ζ(zm)y=\zeta(z_{1}),\cdots,\zeta(z_{m}), which thus in total takes an additional O(mT)O(mT) time. This overall takes O(mT)O(mT) time.

4 Monte Carlo Acceleration via Pooled Sampling

In subsection 3.4, we reduced the computation time for estimating a single p-value to O~(mT)\tilde{O}(mT), which is arguably the minimum possible in our framework since even sampling mm samples already takes O~(mT)\tilde{O}(mT). In this section, we will introduce efficient strategies to reduce another major computational cost in our algorithm – the MC sampling. Our techniques will “borrow” MC samples of one node for the inference task of another node by leveraging the network structure and properties of the SI model. Consequently, we only need to generate MC samples for a subset of the infected nodes, which may effectively reduce the computational cost.

4.1 Surjective Importance Sampling for Single-Degree Nodes

A node with only one connection in the network is called a single-degree node. Suppose node uVIu\in V_{I} is a single degree node with the only neighbor v0v_{0} that is also infected. Since any diffusion process starting from uu must pass v0v_{0}, we can then use the distribution of paths from v0v_{0} to infer the distribution of paths from uu. However, the converse is not true — a diffusion path from v0v_{0} may not pass uu, and even if it passes uu, this may not occur as the first infection. Therefore, certain mapping is needed to connect the two diffusion processes. The following theorem formulates this intuition.

Theorem 2.

Let uu be a single-degree node in the graph GG with the only neighbor node v0v_{0}. If a path z𝒵Tz\in\mathcal{Z}_{T} starting from v0v_{0} contains uu

z={v0,s1,s2,,sK1,u,sK+1,,sT},z=\{v_{0},s_{1},s_{2},\cdots,s_{K-1},u,s_{K+1},\cdots,s_{T}\},

define zz’s matching path from uu as

fu(z)={u,v0,s1,,sK1,sK+1,,sT}.f_{u}(z)=\{u,v_{0},s_{1},\cdots,s_{K-1},s_{K+1},\cdots,s_{T}\}. (10)

In this case, the likelihood ratio between zz and fu(z)f_{u}(z) is

p(fu(z)|u)p(z|v0)\displaystyle\frac{p\left(f_{u}(z)|u\right)}{p\left(z|v_{0}\right)} =1(u|v0,s1sK1)\displaystyle=\frac{1}{\mathbb{P}\left(u|v_{0},s_{1}\cdots s_{K-1}\right)}
×1k=1K1(1(sk|v0,s1sk1))\displaystyle\times\frac{1}{\prod_{k=1}^{K-1}\left(1-\mathbb{P}\left(s_{k}|v_{0},s_{1}\cdots s_{k-1}\right)\right)} (11)

If the path zz from v0v_{0} that does not contain uu, we define the ratio p(fu(z)|u)/p(z|v0){p\left(f_{u}(z)|u\right)}/{p\left(z|v_{0}\right)} to be 0.

Notice that all terms on the right-hand side of (2) are available when we sample a path from the diffusion process starting at v0v_{0}, thus given a sampled path zz, computing the likelihood ratio only introduces negligible computational cost. Intuitively, according to Theorem 2, when the MC samples of v0v_{0} are available, they can be used to compute the p-value for node uu based on a similar idea to importance sampling (L’Ecuyer & Owen, 2009). However, the regular importance sampling cannot be directly applied because the likelihood ratio is only available between zz and fu(z)f_{u}(z) under the mapping of fuf_{u}. Therefore, we need a generalized version of the importance sampling. We name this procedure the surjective importance sampling and give its property in the following theorem. We believe that this theorem could be of general interest beyond our context.

Theorem 3 (Surjective Importance Sampling).

Suppose p1p_{1} and p2p_{2} are two probability mass functions for discrete random vector ZZ defined on 𝒞1\mathcal{C}_{1} and 𝒞2\mathcal{C}_{2}. Let 𝔼1\mathbb{E}_{1} and 𝔼2\mathbb{E}_{2} denote the expectation with respect to p1p_{1} and p2p_{2}, respectively. Given surjection ϕ:𝒞1𝒞2\phi:\mathcal{C}_{1}^{\prime}\to\mathcal{C}_{2}, defined on a subset 𝒞1𝒞1\mathcal{C}_{1}^{\prime}\subset\mathcal{C}_{1}, we define the inverse mapping by ϕ1(z~)={z𝒞1:ϕ(z)=z~}\phi^{-1}(\tilde{z})=\{z\in\mathcal{C}_{1}^{\prime}:\phi(z)=\tilde{z}\} for any z~𝒞2\tilde{z}\in\mathcal{C}_{2}. For a given bounded real function of interest, gg, define

η=𝔼2[g(Z)] and η^=1mi=1mg(ϕ(Zi))|ϕ1(ϕ(Zi))|p2(ϕ(Zi))p1(Zi)\eta=\mathbb{E}_{2}[g(Z)]\text{~{}~{}and~{}~{}}\hat{\eta}=\frac{1}{m}\sum_{i=1}^{m}\frac{g(\phi(Z_{i}))}{|\phi^{-1}(\phi(Z_{i}))|}\frac{p_{2}(\phi(Z_{i}))}{p_{1}(Z_{i})}

where Z1,Z2,,ZmZ_{1},Z_{2},\cdots,Z_{m} is a size-mm i.i.d. sample from distribution p1p_{1}, and if Zi𝒞1Z_{i}\not\in\mathcal{C}_{1}^{\prime}, we define p2(ϕ(Zi))=0p_{2}(\phi(Z_{i}))=0. We have

limmη^=η a.s.\lim_{m\to\infty}\hat{\eta}=\eta\text{~{}~{}~{}~{}~{}~{}a.s.}

Notice that the standard importance sampling is a special case of Theorem 3 when ϕ\phi is the identity mapping. Theorem 2 and 3 toghether would serve as a cornerstone for our use of the MC samples from v0v_{0} to make inference of uu.

Corollary 1.

For a single degree node uu and its neighbor v0v_{0}, let zi,i=1,,mz_{i},i=1,\cdots,m be the mm i.i.d. paths generated from the diffusion process with source v0v_{0}. For any bounded function gg, we have

limm1mi=1mg(fu(zi))(fu(zi)|u)(zi|v0)1T=𝔼u[g(Z)] a.s.\lim_{m\to\infty}\frac{1}{m}\sum_{i=1}^{m}g\left(f_{u}(z_{i})\right)\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T}=\mathbb{E}_{u}[g(Z)]\text{~{}~{}~{}~{}~{}~{}a.s.}

in which fu(zi)f_{u}(z_{i}) and the likelihood ratio is given by Theorem 2.

Based on Corollary 1, when g(z)=(y,z)g(z)=\ell(y,z) or g(z)=𝟏(Tu(ζ(z))Tu(y))g(z)={\mathbf{1}}(T_{u}(\zeta(z))\geq T_{u}(y)), 𝔼[g]\mathbb{E}[g] corresponds to the test statistic Tu(y)T_{u}(y) or the p-value ψu(y)\psi_{u}(y). Consequently, the MC sampling for uu can be avoided. Instead, to find the p-value for uu, Equation (7) in Algorithm 1 can be replaced by T^u(ζ(fu(zj)))\hat{T}_{u}\left(\zeta\left(f_{u}\left(z_{j}\right)\right)\right) equalling the following

1mi=m+12m(ζ(fu(zj)),fu(zi))(fu(zi)|u)(zi|v0)1T\frac{1}{m}\sum_{i=m+1}^{2m}\ell\left(\zeta\left(f_{u}\left(z_{j}\right)\right),f_{u}(z_{i})\right)\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T}

and Equation (8) can be replaced by ψ^u(y)\hat{\psi}_{u}(y) equalling the following

1mi=1m𝟏(T^u(ζ(fu(zj)))T^u(y))(fu(zi)|u)(zi|v0)1T,\frac{1}{m}\sum_{i=1}^{m}{\mathbf{1}}\bigg{(}\hat{T}_{u}\left(\zeta\left(f_{u}(z_{j})\right)\right)\geq\hat{T}_{u}(y)\bigg{)}\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T},

where zj,j=1,,2mz_{j},j=1,\cdots,2m are the MC samples generated from v0v_{0}. The same operation can be used for T^u(y)\hat{T}_{u}\left(y\right)). The computational strategy for canonical discrepancy functions can also be extended in this setting (see Appendix C).

4.2 Permuted Sampling for Isomorphic Nodes

When the network structure is in some sense “symmetric” for two nodes, the inference properties of the MC samples from one node can be viewed as stochastically equivalent to the MC samples from the other node after the symmetric reflection. We call such a property isomorphism. Denote the node uu’s kkth order neighborhood– the set of all nodes (exactly) kk hops away from uu– by Nk(u)N_{k}(u). The following definition for isomorphism rigorously formulates the aforementioned idea.

Definition 4.

Any two nodes u,vu,v in a network are first-order isomorphic if there exists a permutation π:{1,2,,n}{1,2,,n}\pi:\{1,2,\cdots,n\}\to\{1,2,\cdots,n\}, such that: (1) π(u)=v\pi(u)=v; (2) π(i)=i,if i{u,v}N1(u)N1(v)\pi(i)=i,\text{if~{}~{}}i\notin\{u,v\}\cup N_{1}(u)\cup N_{1}(v); (3) A=Aπ~,π~A=A_{\tilde{\pi},\tilde{\pi}}, where Aπ~,π~A_{\tilde{\pi},\tilde{\pi}} is the resulting matrix by applying permutation π\pi on the rows and columns of AA simultaneously.

For illustration, consider a simplified case of the isomorphism where uu and vv have exactly the same connections. In this case, π\pi only swaps uu and vv and remains the identity mapping for all other nodes. For this pair of u,vu,v, the diffusion process properties would be the same if we swap the positions of uu and vv. Definition 4 is more general than the above simplified case as it allows permutation to the first-order neighbors. Under this definition of isomorphism, the following theorem shows that we can use the MC samples from one node to make inference of its isomorphic nodes after applying the permutation.

Theorem 4.

If uu and vv are first-order isomorphic under the permutation π\pi. If Z={u,v1,v2,,vT1}Z=\{u,v_{1},v_{2},\cdots,v_{T-1}\} is a random diffusion path from source uu. Define the permuted path

Zπ={π(u),π(v1),,π(vT1)}.Z_{\pi}=\{\pi(u),\pi(v_{1}),\cdots,\pi(v_{T-1})\}.

Then ZπZ_{\pi} has the same distribution as a random diffusion path from source vv.

To use the MC samples of one node to its isomorphic nodes according to Theorem 4, we need an efficient algorithm to identify all isomorphic pairs and the corresponding permutations. Directly checking Definition 4 is costly. To speed up the computation, we identify necessary conditions for isomorphism in Proposition 2.

Proposition 2.

If uu and vv are first-order isomorphic, we must have du=dvd_{u}=d_{v} and D1(u)=D1(v)D_{1}(u)=D_{1}(v) where dud_{u} and dvd_{v} are the degrees of uu and vv, D1(u)D_{1}(u) and D2(v)D_{2}(v) are the degree sequence (sorted in ascending order) of N1(u)N_{1}(u) and N1(v)N_{1}(v). Furthermore, uu and vv have the same second-order neighbor sets. That is, N2(u)=N2(v)N_{2}(u)=N_{2}(v).

Based on Proposition 2 we can efficiently identify isomorphism using pre-screening steps. This turns out to significantly speed up our computation. Details of the algorithm are described by Algorithm 2 in Appendix B. With the isomorphic relations available, we can partition the nodes into isomorphic groups. Then MC sampling is only needed for one node in each group, and the MC samples can be shared within the group according to Theorem 4. Specifically, suppose Z1,Z2mZ_{1},\cdots Z_{2m} are sampled from the diffusion process from uu. If uu and vv are isomorphic with permutation π\pi, we can use (Z1)π,(Z2)π,,(Z2m)π(Z_{1})_{\pi},(Z_{2})_{\pi},\cdots,(Z_{2m})_{\pi} as the MC samples of vv in Algorithm 1.

Remark 5.

Definition 4 can be extended to higher-order neighborhoods, identifying more isomorphic pairs. However, the complexity of identifying such pairs increases exponentially with the order of neighbors, which may overwhelm the saved time on the MC side. The first-order isomorphism turns out to give the most desirable tradeoff in terms of computational efficiency.

5 Experimental Studies

In this section, we evaluate our proposed methods on well-studied random network models. We generate networks from three random network models: random 4-regular trees, the preferential attachment model (Barabási & Albert, 1999) and the small-world (S-W) network model (Watts & Strogatz, 1998). In network science, the preferential attachment model is usually used to model the scale-free property of networks that is conjectured by many as ubiquity in real-world networks (Barabási, 2013). The small-world property is believed to be prevalent in social networks (Watts & Strogatz, 1998). The network size is N=1365N=1365 (the size of regular tree with degree 4 and depth 6) with T=150T=150. The networks are sparse with average degree below 4. The Monte Carlo size mm is 4000. Source node in all settings is set to be the one with median eigen-centrality to be more realistic. All reported results are averaged over 200 independent replications.

Refer to caption
Figure 2: The average size of confidence sets for different confidence levels.

We first evaluate the performance of the single-point source estimation accuracy from the rumor center and distance center of Shah & Zaman (2011); Khim & Loh (2016); Bubeck et al. (2017); Yu et al. (2018), as well as estimator using our proposed framework with discrepancy functions se\ell_{se} and ADiT. The result is shown in the Table 1. Though the two estimators based on our framework are always better, the overall message from the table is not promising. All of the methods, including ours, give poor accuracy that is too low to be useful in applications. Such a negative result convincingly shows that the DSI problem is generally too difficult for the single-point estimation strategy to work, and exploring the alternative confidence set inference is necessary.

Table 1: The correct rate of single-point estimation methods across 200 replications.
reg. tree Pref. Att. S-W
Rumor center 0 0 0.08
Dist. center 0 0 0.07
Euclidean (ours) 0 0.04 0.20
ADiT (ours) 0 0.02 0.16

Table 2 shows the coverage rate of the confidence sets, with the squared Euclidean distance and the ADiT as the discrepancy functions. Notably, the proposed confidence set procedure delivers the desired coverage in all settings (up to the simulation error). Meanwhile, the size of the confidence set varies substantially depending on the network structure. For regular trees and scale-free networks, the ADiT works much better than the Euclidean distance, indicating that the diffusion order is informative in this type of network structure. For the small-world networks, the two are very similar. This may indicate that for well-connected networks, the diffusion order is less informative. In general, we believe the adaptivity of the ADiT- based confidence set is always preferable.

Table 2: The coverage rate of the confidence sets across 200 replications.
reg. tree Pref. Att. S-W
Euclidean-90% 93% 91% 92%
size 79.1 86.2 16.0
ADiT-90% 93% 89% 91%
size 60.4 70.5 17.7
Euclidean-80% 85% 84% 83%
size 68.6 71.9 10.5
ADiT-80% 84% 82% 81%
size 50.0 57.5 11.3

To obtain a comprehensive view of the tradeoff between the set size and confidence level, we show the relationship between the confidence set’s average size and the confidence level in Figure 2. The relation is slightly sup-linear. In connection with the single-point estimation results, notice that for small-world networks, the confidence set with a confidence level 20% has average size of around 1. In contrast, the regular tree and preferential attachment network are more difficult, and to guarantee at 10%, the average size of the confidence set is already about 5. These observations verify the results in Table 1 and support our argument that, in general, inferring the source by a single-point estimator is hopeless.

Table 3: The timing comparison of the sequential running time for the proposed pooled MC strategies (in sec.).
reg. tree Pref. Att. S-W
Vanilla MC 898.3 1060.5 1052.9
Import. Sampl. 602.3 462.1 1057.5
Isomorphism 617.8 516.8 1059.4
Both 411.9 430.9 1057.7

Finally, we also evaluate the timing improvements achieved by the pooled MC strategies. The power of the pooled MC strategies depends on network structures, as expected. The timing comparison for the pooled MC strategies is included in Table 3. The timing included is only the sequential version of our method for a fair comparison with the rumor center. As can be seen, with both of the pooled MC strategies used, we can reduce the timing by about 60% for tree structure and the preferential attachment networks, but the effects on small-world networks are negligible. In applications, it is observed that network structure tends to exhibit core-periphery patterns (Miao & Li, 2021), in which the pooled MC strategies are expected to be effective. For reference, the average timing for finding the rumor center is about 1.25 seconds. However, with the extra computational cost, our method provides a confidence sets at any specified levels, with guaranteed accuracy for any network structures. We believe it is generally a wise tradeoff.

Meanwhile, notice that our inference procedure can be parallelized. We give a parallel algorithm in appendix (see Algorithm 3 in Appendix D). It needs MC sampling for only one node in each group, and the calculations for other nodes can be done using pooled MC methods. Table 4 includes the timing results of the parallel version implementation based on 20 cores in the same settings as Table 3 of the main paper. With 20 cores, the time needed for a confidence set construction is less than 30 seconds for cases when the pooled MC methods are effective.

Table 4: Comparison of the parallel running time for the proposed pooled MC strategies (in sec.) on 20 cores.
reg. tree Pref. Att. S-W
Vanilla MC 59.3 61.9 71.5
Import. Sampl. 33.4 32.9 73.3
Isomorphism 44.0 33.4 73.4
Both 26.3 28.4 72.0

6 Summary

We have introduced a statistical inference framework for diffusion source identification on networks. Compared with previous methods, our framework is more general and renders salient insights about the problem. More importantly, within this framework, we can construct the confidence set for the source node in a more natural and principled way such that the success rate can be guaranteed on any network structure. To our knowledge, our method is the first DSI method with theoretical guarantees for general network structures. We also propose efficient computational strategies that are potentially useful in other problems as well.

Acknowledgements

The work is supported by the Quantitative Collaborative Award from the College of Arts and Sciences at the University of Virginia. T. Li is supported in part by the NSF grant DMS-2015298. H. Xu is supported by a GIDI award from the UVA Global Infectious Diseases Institute.

References

  • Anderson & May (1992) Anderson, R. M. and May, R. M. Infectious diseases of humans: dynamics and control. Oxford university press, 1992.
  • Barabási (2013) Barabási, A.-L. Network science. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1987):20120375, 2013.
  • Barabási & Albert (1999) Barabási, A.-L. and Albert, R. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
  • Besag & Clifford (1989) Besag, J. and Clifford, P. Generalized monte carlo significance tests. Biometrika, 76(4):633–642, 1989.
  • Bringmann & Panagiotou (2017) Bringmann, K. and Panagiotou, K. Efficient sampling methods for discrete distributions. Algorithmica, 79(2):484–508, 2017.
  • Bubeck et al. (2017) Bubeck, S., Devroye, L., and Lugosi, G. Finding adam in random growing trees. Random Structures & Algorithms, 50(2):158–172, 2017.
  • Crane & Xu (2020) Crane, H. and Xu, M. Inference on the history of a randomly growing tree. arXiv preprint arXiv:2005.08794, 2020.
  • Dong et al. (2013) Dong, W., Zhang, W., and Tan, C. W. Rooting out the rumor culprit from suspects. In 2013 IEEE International Symposium on Information Theory, pp.  2671–2675. IEEE, 2013.
  • Halperin & Almogy (2002) Halperin, A. and Almogy, G. System and method of virus containment in computer networks, December 19 2002. US Patent App. 10/058,809.
  • Hu et al. (2020) Hu, T., Guan, W. W., Zhu, X., Shao, Y., Liu, L., Du, J., Liu, H., Zhou, H., Wang, J., She, B., et al. Building an open resources repository for covid-19 research. Data and Information Management, 4(3):130–147, 2020.
  • Jockel (1986) Jockel, K.-H. Finite sample properties and asymptotic efficiency of monte carlo tests. The annals of Statistics, pp.  336–347, 1986.
  • Kazemitabar & Amini (2020) Kazemitabar, S. J. and Amini, A. A. Approximate identification of the optimal epidemic source in complex networks. In International Conference on Network Science, pp. 107–125. Springer, 2020.
  • Khim & Loh (2016) Khim, J. and Loh, P.-L. Confidence sets for the source of a diffusion in regular trees. IEEE Transactions on Network Science and Engineering, 4(1):27–40, 2016.
  • Lab (2020) Lab, C. D. Baidu Mobility Data, 2020. URL https://doi.org/10.7910/DVN/FAEZIO.
  • LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.
  • L’Ecuyer & Owen (2009) L’Ecuyer, P. and Owen, A. B. Monte Carlo and Quasi-Monte Carlo Methods 2008. Springer, 2009.
  • Lehmann & Romano (2006) Lehmann, E. L. and Romano, J. P. Testing statistical hypotheses. Springer Science & Business Media, 2006.
  • Miao & Li (2021) Miao, R. and Li, T. Informative core identification in complex networks. arXiv preprint arXiv:2101.06388, 2021.
  • Newman et al. (2002) Newman, M. E., Forrest, S., and Balthrop, J. Email networks and the spread of computer viruses. Physical Review E, 66(3):035101, 2002.
  • Nguyen et al. (2016) Nguyen, H. T., Ghosh, P., Mayo, M. L., and Dinh, T. N. Multiple infection sources identification with provable guarantees. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp.  1663–1672, 2016.
  • Shah & Zaman (2011) Shah, D. and Zaman, T. Rumors in a network: Who’s the culprit? IEEE Transactions on information theory, 57(8):5163–5181, 2011.
  • Shah & Zaman (2012) Shah, D. and Zaman, T. Finding rumor sources on random graphs. 2012.
  • Shapiro & Delgado-Eckert (2012) Shapiro, M. and Delgado-Eckert, E. Finding the probability of infection in an sir network is np-hard. Mathematical biosciences, 240(2):77–84, 2012.
  • Valiant (1984) Valiant, L. G. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • Vosoughi et al. (2018) Vosoughi, S., Roy, D., and Aral, S. The spread of true and false news online. Science, 359(6380):1146–1151, 2018.
  • Watts & Strogatz (1998) Watts, D. J. and Strogatz, S. H. Collective dynamics of ‘small-world’networks. nature, 393(6684):440–442, 1998.
  • Xu & Ren (2016) Xu, Y. and Ren, J. Propagation effect of a virus outbreak on a network with limited anti-virus ability. PloS one, 11(10):e0164415, 2016.
  • Yu et al. (2018) Yu, P.-D., Tan, C. W., and Fu, H.-L. Rumor source detection in finite graphs with boundary effects by message-passing algorithms. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp.  175–192. Springer, 2018.

Appendix A Proofs

A.1 Proof of Theorem 1

Define qT,sα=inft{t:s(Ts(ζ(Z))t)α}q_{T,s^{*}}^{\alpha}=\inf_{t}\{t:\mathbb{P}_{s^{*}}(T_{s^{*}}(\zeta(Z))\geq t)\leq\alpha\}. Notice that qT,sαq_{T,s^{*}}^{\alpha} can be seen as one generalized definition for the right quantile of the distribution of the random variable Ts(Y~)T_{s^{*}}(\tilde{Y}), where Y~:=ζ(Z)\tilde{Y}:=\zeta(Z) is a random infection status of the network generated by the diffusion process starting from ss^{*}.

Now assume YY is a random infection status from the diffusion process from ss^{*}. According to the definition of the p-value, we have

s(sS(Y))\displaystyle\mathbb{P}_{s^{*}}\left(s^{*}\in S(Y)\right) =s(ψs(Y)>α)\displaystyle=\mathbb{P}_{s^{*}}\left(\psi_{s^{*}}(Y)>\alpha\right)
=s(s(Ts(Y~)Ts(Y))>α)\displaystyle=\mathbb{P}_{s^{*}}\left(\mathbb{P}_{s^{*}}\left(T_{s^{*}}(\tilde{Y})\geq T_{s^{*}}(Y)\right)>\alpha\right)
s(Ts(Y)<qT,sα)\displaystyle\geq\mathbb{P}_{s^{*}}\left(T_{s^{*}}(Y)<q_{T,s^{*}}^{\alpha}\right)
=1s(Ts(Y)qT,sα).\displaystyle=1-\mathbb{P}_{s^{*}}\left(T_{s^{*}}(Y)\geq q_{T,s^{*}}^{\alpha}\right).

Note that since ss^{*} is the true source node, Ts(Y)T_{s^{*}}(Y) and Ts(Y~)T_{s^{*}}(\tilde{Y}) are following exactly the same distribution, thus

s(sS(Y))1s(Ts(Y)qT,sα)1α.\mathbb{P}_{s^{*}}\left(s^{*}\in S(Y)\right)\geq 1-\mathbb{P}_{s^{*}}\left(T_{s^{*}}(Y)\geq q_{T,s^{*}}^{\alpha}\right)\geq 1-\alpha.

A.2 Proof of Theorem 2

Given a path generated by the diffusion process starting from v0v_{0} containing uu, denoted by

z={v0,s1,s2,,sK1,u,sK+1,,sT},z=\{v_{0},s_{1},s_{2},\cdots,s_{K-1},u,s_{K+1},\cdots,s_{T}\},

we match it to the path fu(Z)f_{u}(Z) defined as

fu(z)={u,v0,s1,,sK1,sK+1,,sT}.f_{u}(z)=\{u,v_{0},s_{1},\cdots,s_{K-1},s_{K+1},\cdots,s_{T}\}.

We start from the probability mass of zz starting from v0v_{0}. By using the Markov property, we have

p(z|v0)=\displaystyle p(z|v_{0})= (s1|v0)(s2|v0,s1)(sK1|v0,s1,,sK2)\displaystyle\mathbb{P}(s_{1}|v_{0})\mathbb{P}(s_{2}|v_{0},s_{1})\cdots\mathbb{P}(s_{K-1}|v_{0},s_{1},\cdots,s_{K-2})
×(sK+1|v0,,u)(sT|v0,,sT1)\displaystyle~{}~{}\times\mathbb{P}(s_{K+1}|v_{0},\cdots,u)\cdots\mathbb{P}(s_{T}|v_{0},\cdots,s_{T-1}) (12)
×(u|v0,s1,,sK1)\displaystyle~{}~{}~{}~{}~{}~{}\times\mathbb{P}(u|v_{0},s_{1},\cdots,s_{K-1})

In contrast, for the path fu(z)f_{u}(z), we have

(fu(z)|u)=\displaystyle\mathbb{P}(f_{u}(z)|u)= (v0|u)(s1|u,v0)(s2|u,v0,s1)(sK1|u,v0,s1,,sK2)\displaystyle\mathbb{P}(v_{0}|u)\mathbb{P}(s_{1}|u,v_{0})\mathbb{P}(s_{2}|u,v_{0},s_{1})\cdots\mathbb{P}(s_{K-1}|u,v_{0},s_{1},\cdots,s_{K-2})
×(sK+1|u,v0,,sK1)(sT|u,v0,,sT1)\displaystyle~{}~{}\times\mathbb{P}(s_{K+1}|u,v_{0},\cdots,s_{K-1})\cdots\mathbb{P}(s_{T}|u,v_{0},\cdots,s_{T-1})
=\displaystyle= (s1|u,v0)(s2|u,v0,s1)(sK1|u,v0,s1,,sK2)\displaystyle\mathbb{P}(s_{1}|u,v_{0})\mathbb{P}(s_{2}|u,v_{0},s_{1})\cdots\mathbb{P}(s_{K-1}|u,v_{0},s_{1},\cdots,s_{K-2})
×(sK+1|u,v0,,sK1)(sT|u,v0,,sT1).\displaystyle~{}~{}\times\mathbb{P}(s_{K+1}|u,v_{0},\cdots,s_{K-1})\cdots\mathbb{P}(s_{T}|u,v_{0},\cdots,s_{T-1}). (13)

Notice that the conditional probability (sk+1|v0,,u,sK+1,sk),k>K\mathbb{P}(s_{k+1}|v_{0},\cdots,u,s_{K+1},\cdots s_{k}),k>K only depends on the infection status before the kkth infection and is invariant to the infection order. This property indicates that all terms after the K+1K+1th (in the second rows) of (A.2) and (A.2) are equal.

Next, we compare the terms in the first line in each of (A.2) and (A.2). Notice that for each k<Kk<K, the term (sk|v0,s1,,sk1)\mathbb{P}(s_{k}|v_{0},s_{1},\cdots,s_{k-1}) is uniform for each available connections given the infected nodes v0,s1,,sk1v_{0},s_{1},\cdots,s_{k-1} while the term (sk|u,v0,s1,,sk1)\mathbb{P}(s_{k}|u,v_{0},s_{1},\cdots,s_{k-1}) is uniform on all available edges given the infected nodes u,v0,s1,,sk1u,v_{0},s_{1},\cdots,s_{k-1}. The only difference in the two infected sets is on uu. Since uu has only one connection to v0v_{0}, at each point, the number of available infecting edges is one more in the former case. Therefore, we have

(sk|u,v0,s1,,sk1)=11(sk|v0,s1,,sk)(sk|v0,s1,,sk),k<K.\mathbb{P}(s_{k}|u,v_{0},s_{1},\cdots,s_{k-1})=\frac{1}{1-\mathbb{P}(s_{k}|v_{0},s_{1},\cdots,s_{k})}\mathbb{P}(s_{k}|v_{0},s_{1},\cdots,s_{k}),k<K.

In addition, notice that in the third line of (A.2), there is one extra term that does not appear in (A.2). Combining the aforementioned three relations, we final obtain probability mass factor to be

(fu(π)|S0=u)(π|S0=v0)=1(1(s1|v0))(1(s2|v0,s1))(1(sK1|v0,s1,,sK2))(u|v0,s1,,sK1)\frac{\mathbb{P}(f_{u}(\pi)|S_{0}=u)}{\mathbb{P}(\pi|S_{0}=v_{0})}=\frac{1}{(1-\mathbb{P}(s_{1}|v_{0}))(1-\mathbb{P}(s_{2}|v_{0},s_{1}))\cdots(1-\mathbb{P}(s_{K-1}|v_{0},s_{1},\cdots,s_{K-2}))\mathbb{P}(u|v_{0},s_{1},\cdots,s_{K-1})} (14)

Moreover, if π\pi does not contain uu, we set the ratio to be 0.

A.3 Proof of Theorem 3

Since ZiZ_{i}’s are a random sample from p1p_{1}, under the current assumption, by the strong law of large numbers, we have

(η^𝔼1[g(ϕ(Z))|ϕ1(ϕ(Z))|p2(ϕ(Z))p1(Z)])=1.\mathbb{P}\left(\hat{\eta}\to\mathbb{E}_{1}[\frac{g(\phi(Z))}{|\phi^{-1}(\phi(Z))|}\frac{p_{2}(\phi(Z))}{p_{1}(Z)}]\right)=1.

Notice that ϕ\phi is a surjection. Therefore, the term 𝔼1[g(ϕ(Z))|ϕ1(ϕ(Z))|p2(ϕ(Z))p1(Z)]\mathbb{E}_{1}[\frac{g(\phi(Z))}{|\phi^{-1}(\phi(Z))|}\frac{p_{2}(\phi(Z))}{p_{1}(Z)}] can be rewritten as

𝔼1[g(ϕ(Z))|ϕ1(ϕ(Z))|p2(ϕ(Z))p1(Z)]\displaystyle\mathbb{E}_{1}[\frac{g(\phi(Z))}{|\phi^{-1}(\phi(Z))|}\frac{p_{2}(\phi(Z))}{p_{1}(Z)}] =z𝒞1g(ϕ(z))|ϕ1(ϕ(z))|p2(ϕ(z))p1(z)p1(z)\displaystyle=\sum_{z\in\mathcal{C}_{1}}\frac{g(\phi(z))}{|\phi^{-1}(\phi(z))|}\frac{p_{2}(\phi(z))}{p_{1}(z)}p_{1}(z)
=z𝒞1g(ϕ(z))|ϕ1(ϕ(z))|p2(ϕ(z))\displaystyle=\sum_{z\in\mathcal{C}_{1}}\frac{g(\phi(z))}{|\phi^{-1}(\phi(z))|}p_{2}(\phi(z))
=z𝒞1g(ϕ(z))|ϕ1(ϕ(z))|p2(ϕ(z))+z𝒞1/𝒞1g(ϕ(z))|ϕ1(ϕ(z))|p2(ϕ(z))\displaystyle=\sum_{z\in\mathcal{C}_{1}^{\prime}}\frac{g(\phi(z))}{|\phi^{-1}(\phi(z))|}p_{2}(\phi(z))+\sum_{z\in\mathcal{C}_{1}/\mathcal{C}_{1}^{\prime}}\frac{g(\phi(z))}{|\phi^{-1}(\phi(z))|}p_{2}(\phi(z))
=z𝒞1g(ϕ(z))|ϕ1(ϕ(z))|p2(ϕ(z))\displaystyle=\sum_{z\in\mathcal{C}_{1}^{\prime}}\frac{g(\phi(z))}{|\phi^{-1}(\phi(z))|}p_{2}(\phi(z))
=z~𝒞2z:ϕ(z)=z~g(ϕ(z))|ϕ1(ϕ(z))|p2(ϕ(z))\displaystyle=\sum_{\tilde{z}\in\mathcal{C}_{2}}\sum_{z:\phi(z)=\tilde{z}}\frac{g(\phi(z))}{|\phi^{-1}(\phi(z))|}p_{2}(\phi(z))
=z~𝒞2z:ϕ(z)=z~g(z~)|ϕ1(z~)|p2(z~)\displaystyle=\sum_{\tilde{z}\in\mathcal{C}_{2}}\sum_{z:\phi(z)=\tilde{z}}\frac{g(\tilde{z})}{|\phi^{-1}(\tilde{z})|}p_{2}(\tilde{z})
=z~𝒞2zϕ1(z~)g(z~)|ϕ1(z~)|p2(z~)\displaystyle=\sum_{\tilde{z}\in\mathcal{C}_{2}}\sum_{z\in\phi^{-1}(\tilde{z})}\frac{g(\tilde{z})}{|\phi^{-1}(\tilde{z})|}p_{2}(\tilde{z})
=z~𝒞2|ϕ1(z~)|g(z~)|ϕ1(z~)|p2(z~)\displaystyle=\sum_{\tilde{z}\in\mathcal{C}_{2}}|\phi^{-1}(\tilde{z})|\frac{g(\tilde{z})}{|\phi^{-1}(\tilde{z})|}p_{2}(\tilde{z})
=z~𝒞2g(z~)p2(z~)\displaystyle=\sum_{\tilde{z}\in\mathcal{C}_{2}}g(\tilde{z})p_{2}(\tilde{z})
=𝔼2[g(Z)].\displaystyle=\mathbb{E}_{2}[g(Z)].

A.4 Proof of Corollary 1

We will use fu(z)f_{u}(z) in place of ϕ\phi to apply Theorem 3. The only remaining step is to find |fu(fu1(z))||f_{u}(f_{u}^{-1}(z))|. By the definition of fuf_{u} in Theorem 3, it is easy to see that the other T1T-1 nodes (except uu and v0v_{0}) and their order uniquely determine to the mapped path. Therefore, |fu(fu1(z))||f_{u}(f_{u}^{-1}(z))| would always be TT in this situation.

A.5 Proof of Theorem 4

Notice that, given TT, the probability mass function of a diffusion path only depends on the network AA and the source node. Condition 1 of Definition 4 indicates that ZπZ_{\pi} starts from vv. Condition 2 and condition 3 of Definition 4 together indicate that ZπZ_{\pi} is also a valid diffusion path. Condition 3, in particular, indicates that

pu(Z)=pv(Zπ).p_{u}(Z)=p_{v}(Z_{\pi}).

Now define can define π1\pi^{-1} to be the inverse of π\pi. For any Z~\tilde{Z} from vv, for the same reason, Z~π1\tilde{Z}_{\pi^{-1}} is also a valid diffusion path starting from uu. In particular, we have Z=(Zπ)π1Z=(Z_{\pi})_{\pi^{-1}}. Therefore, ZπZ_{\pi} has the same sample space and probability mass function as the random diffusion path from vv.

Appendix B Details for Isomorphism Identification in Section 4.2

Algorithm 2 Identification of First-Order Isomorphic Pairs
  Input: Graph G=(V,E)G=(V,E)
  Initialize L=L=\emptyset to store the list of isomorphic node pairs
  for every node uVu\in V do
     Compute N14(u)N_{1-4}(u) as the set of all neighbors of uu within 44 hops
      // N14(u)N_{1-4}(u) includes all nodes that are possible to be isomorphic to uu
     for vN14(u)v\in N_{1-4}(u) do
        if dv==dud_{v}==d_{u} then
           Compute D1(u)={du:uN1(u)}D_{1}(u)=\{d_{u^{\prime}}:u^{\prime}\in N_{1}(u)\}, the multi-set of degrees of nodes in N1(u)N_{1}(u)
           Similarly, compute D1(v)D_{1}(v), the multi-set of degrees of all one-hop neighbors of vv
           if  D1(u)==D1(v)D_{1}(u)==D_{1}(v)  then
              Compute N~2(u)=N2(u)N1(u)N1(v){u,v}\tilde{N}_{2}(u)=N_{2}(u)-N_{1}(u)-N_{1}(v)-\{u,v\}
               // N~2(u)\tilde{N}_{2}(u) contains all (exactly) two-hop neighbors of uu, but with all (exactly) one-hop neighbors of u,vu,v removed
              Compute N~2(v)=N2(v)N1(u)N1(v){u,v}\tilde{N}_{2}(v)=N_{2}(v)-N_{1}(u)-N_{1}(v)-\{u,v\}
              if  N~2(u)==N~2(v)\tilde{N}_{2}(u)==\tilde{N}_{2}(v)  then
                 Do exhaustive search to check whether u,vu,v are isomorphic by enumerating all possible matchings of their neighbors, and if so, add (u,v)(u,v) to list LL
                  // Hopefully, not many pairs need to go through this step
              end if
           end if
        end if
     end for
  end for

Based on the properties in Proposition 2, Algorithm 2 finds all isomorphic pairs and the permutations in the network. Next, we provide a proof of Proposition 2.

Proof of Proposition 2.

du=dvd_{u}=d_{v} because π\pi gives a 1-1 mapping from N1(u)N_{1}(u) to N1(v)N_{1}(v). Furthermore, we have π(N1(u))=N1(v)\pi(N_{1}(u))=N_{1}(v). By condition 3 of Definition 4, we also have D1(u)=D1(v)D_{1}(u)=D_{1}(v).

For the last one, we can prove by contradiction. Suppose there exists a node ww, such that wN2(u)w\in N_{2}(u) but wN2(v)w\notin N_{2}(v). Since π(N1(u))=N1(v)\pi(N_{1}(u))=N_{1}(v) while π(w)=w\pi(w)=w, so after applying the permutation to the network, we have π(w)\pi(w) disconnected from π(N1(w))\pi(N_{1}(w)). This contradicts condition 3 of Definition 4. ∎

Appendix C Loss Function Computation Acceleration for Surjective Importance Sampling

The calculation strategy for canonical discrepancy functions can also be further generalized to the weighted averaging scenario used for the single-degree nodes in Section 4.1. Specifically, there we need to calculate terms like

1mi=m+12m(y,fu(zi))(fu(zi)|u)(zi|v0)1T\displaystyle\frac{1}{m}\sum_{i=m+1}^{2m}\ell\left(y,f_{u}(z_{i})\right)\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T} =1mi=m+12mv:yv=1𝟏(vfu(zi))h(tfu(zi)(v))(fu(zi)|u)(zi|v0)1T\displaystyle=-\frac{1}{m}\sum_{i=m+1}^{2m}\sum_{v:y_{v}=1}{\mathbf{1}}(v\in f_{u}(z_{i}))h(t_{f_{u}(z_{i})}(v))\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T}
=1mv:yv=1i=m+12m𝟏(vfu(zi))h(tfu(zi)(v))(fu(zi)|u)(zi|v0)1T\displaystyle=-\frac{1}{m}\sum_{v:y_{v}=1}\sum_{i=m+1}^{2m}{\mathbf{1}}(v\in f_{u}(z_{i}))h(t_{f_{u}(z_{i})}(v))\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T}
=1mv:yv=1i=m+12m𝟏(vfu(zi))[k=1T𝟏(tfu(zi)(v)=k)h(k)](fu(zi)|u)(zi|v0)1T\displaystyle=-\frac{1}{m}\sum_{v:y_{v}=1}\sum_{i=m+1}^{2m}{\mathbf{1}}(v\in f_{u}(z_{i}))[\sum_{k=1}^{T}{\mathbf{1}}(t_{f_{u}(z_{i})}(v)=k)h(k)]\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T}
=1mv:yv=1i=m+12mk=1T𝟏(vfu(zi))𝟏(tfu(zi)(v)=k)h(k)(fu(zi)|u)(zi|v0)1T\displaystyle=-\frac{1}{m}\sum_{v:y_{v}=1}\sum_{i=m+1}^{2m}\sum_{k=1}^{T}{\mathbf{1}}(v\in f_{u}(z_{i})){\mathbf{1}}(t_{f_{u}(z_{i})}(v)=k)h(k)\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T}
=1mv:yv=1k=1Th(k)[i=m+12m𝟏(tfu(zi)(v)=k)(fu(zi)|u)(zi|v0)1T].\displaystyle=-\frac{1}{m}\sum_{v:y_{v}=1}\sum_{k=1}^{T}h(k)[\sum_{i=m+1}^{2m}{\mathbf{1}}(t_{f_{u}(z_{i})}(v)=k)\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T}].

Therefore, to use this strategy in Section 4.1, when general MC samples from v0v_{0}, in addition to caching MM, we also want to cache the matrix adjusted by the factor

Mv,k(v0u)=i=m+12m𝟏(tfu(zi)(v)=k)(fu(zi)|u)(zi|v0)1T.M^{(v_{0}\to u)}_{v,k}=\sum_{i=m+1}^{2m}{\mathbf{1}}(t_{f_{u}(z_{i})}(v)=k)\frac{\mathbb{P}\left(f_{u}(z_{i})|u\right)}{\mathbb{P}\left(z_{i}|v_{0}\right)}\frac{1}{T}.

Appendix D Parallel Algorithm for Confidence Set Construction

As discussed in Section 3.3, our confidence set construction algorithm can be implemented in parallel, further boosting its speed. In the main paper, we only include the details and timing for the sequential version. The parallelized algorithm is described in Algorithm 3.

1:  Input: MC sample number mm, confidence level α\alpha, Network GG, data yy, discrepancy function \ell
2:  Compute S={g1,g2,,gM}S=\{g_{1},g_{2},\cdots,g_{M}\}, the isomorphic groups for infected nodes with degree at least 2.
3:  for each gSg\in S do
4:     Extend gg by including all of its single-degree neighbor.
5:  end for
6:  for  each infected isomorphic group gSg\in S in parallel do
7:     Select any sgs\in g with degree at least 2
8:     Generate 2m2m samples zi𝒵,i=1,,2mz_{i}\in\mathcal{Z},i=1,\cdots,2m from the TT-step diffusion process from source ss.
9:     Calculate the p-value for ss following (6), (7), and (8)
10:     for each vgv\in g that is isomorphic to ss do
11:        Calculate ψ^v(y)\hat{\psi}_{v}(y) according to Theorem 4.
12:     end for
13:     for each single-degree node vgv\in g do
14:        Calculate the p-value ψ^v(y)\hat{\psi}_{v}(y) according to the surjective importance sampling in Section 4.1.
15:     end for
16:  end for
17:  return the level 1α1-\alpha confidence set:
𝒞α(y)={sVI:ψ^s(y)>α}.\mathcal{C}_{\alpha}(y)=\{s\in V_{I}:\hat{\psi}_{s}(y)>\alpha\}.
Algorithm 3 Parallel Confidence Set Construction