Diffusion Source Identification on General Networks with Statistical Confidence
Diffusion Source Identification on Networks with Statistical Confidence
Abstract
Diffusion source identification on networks is a problem of fundamental importance in a broad class of applications, including rumor controlling and virus identification. Though this problem has received significant recent attention, most studies have focused only on very restrictive settings and lack theoretical guarantees for more realistic networks. We introduce a statistical framework for the study of diffusion source identification and develop a confidence set inference approach inspired by hypothesis testing. Our method efficiently produces a small subset of nodes, which provably covers the source node with any pre-specified confidence level without restrictive assumptions on network structures. Moreover, we propose multiple Monte Carlo strategies for the inference procedure based on network topology and the probabilistic properties that significantly improve the scalability. To our knowledge, this is the first diffusion source identification method with a practically useful theoretical guarantee on general networks. We demonstrate our approach via extensive synthetic experiments on well-known random network models and a mobility network between cities concerning the COVID-19 spreading.
1 Introduction
One pressing problem today is the spreading of misinformation or malicious attacks/virus in various cyberspaces. For example, rumors and fake news on social networks may result in many serious political, economic, and social issues (Vosoughi et al., 2018). Viruses that spread via emails and computer communication may cause severe privacy and leakage problems Newman et al. (2002); Halperin & Almogy (2002); Xu & Ren (2016). The negative impacts stem from a few source users/locations and then spread over the social networks via a diffusion process in such events. One crucial step to reduce the loss from such an event is to quickly identify the sources so that counter-measures can be taken in a timely fashion.
Though early practices have been done for this important problem with motivations from various domains, systematic research on this problem only began very recently, arguably starting from the seminal work of Shah & Zaman (2011), which proposed a rumor center estimator that can be located by an efficient message-passing algorithm with linear time complexity. Despite the significant interest and progress on this problem in recent years (Shah & Zaman, 2012; Dong et al., 2013; Khim & Loh, 2016; Bubeck et al., 2017; Yu et al., 2018; Crane & Xu, 2020), many challenges remain unaddressed. First, the theoretical understanding of these methods is currently only available under very restrictive and somewhat unrealistic structural assumptions of the networks such as regular trees. This is perhaps partially explained by the well-known computational hardness about the probabilistic inference of diffusion process in general graphs Shapiro & Delgado-Eckert (2012). Therefore, intuitive approximations have been used for general networks (Nguyen et al., 2016; Kazemitabar & Amini, 2020). However, such methods lack theoretical guarantees. Second, even for regular trees, the available performance guarantee is far from being useful in practice. Even in the most idealized situation of infinite regular trees, the correct probability of the rumor center is almost always below (Shah & Zaman, 2011; Dong et al., 2013; Yu et al., 2018). For general graphs, as we show later, the correct rate of such a single-point estimation method only becomes too low to be practical.
To guarantee higher success probability, a typical approach, as in both machine learning theory Valiant (1984) and data-driven applied models LeCun et al. (2015), is perhaps to obtain more data. However, a fundamental challenge in diffusion source identification (DSI) is that the problem by nature has only one snapshot of the network information, i.e., the earliest observation about the infection status of the network.111Since infected nodes are usually indistinguishable and equally infectious, any additional information in later observations only tells us which new or additional nodes are infected and is not helpful for us to infer the source node. Therefore, compared to classic learning tasks, DSI poses a fundamentally different challenge for inference.It is the above crucial understanding that motivates our adoption of a different statistical inference technique, the confidence set. Previously systematic statistical studies adopt the confidence set approach for DSI on trees (Bubeck et al., 2017; Khim & Loh, 2016; Crane & Xu, 2020). Though they enjoy good theoretical properties, the methods are applicable only on infinite trees.
This paper aims to bridge the gap between practically useful algorithms and theoretical guarantees for the DSI problem. We introduce a new statistical inference framework which provably includes many previous methods Shah & Zaman (2011); Nguyen et al. (2016) as special cases. Our new framework not only highlights the drawback of the previous methods but, more importantly, also leads to the design of our confidence set inference approach with finite-sample theoretical guarantee on any network structures.
As a demonstration, consider the example of the COVID-19 spreading procedure in early 2020. Figure 1 shows a travel mobility network between 49 major cities in China, constructed from the two-week travel volume (Lab, 2020; Hu et al., 2020) before the virus caught wide attention. The square nodes (21 out of 49) are all cities with at least five confirmed cases of the virus on Jan 24, 2020. The DSI problem is: given only knowledge about the mobility network and which cities have detected a notable amount of confirmed cases (in this case, at least 5) , can we identify in which city the virus was first detected?

This problem turns out to be too difficult for precise identification. None of the single-point source identification methods under evaluation can successfully identify Wuhan due to its relatively non-central position from the network (details in Section 5). Nevertheless, both of our 80% and 90% confidence sets cover Wuhan correctly, giving recommendations of 6 nodes and 11 nodes (out of 49 cities), respectively. In fact, the evaluation on all the whole week after the lockdown of Wuhan reveals that both confidence sets correctly cover Wuhan in all the seven days, while the single-point estimation methods are rarely effective. Such a result evidently shows the necessity of adopting confidence set approach and the effectiveness of our solution.Our contributions in this paper can be summarized in three-folds.
-
1.
We introduce an innovative statistical framework for the DSI problem. It includes several previous methods as special cases, but has the potential for more effective inference.
-
2.
Under our framework, we propose a general way to construct the source node confidence set, whose validity can be guaranteed for finite sample size and any network structures. It is the first DSI method with a theoretical performance guarantee on general networks, to the best of our knowledge.
-
3.
We propose techniques that dramatically improve the computational efficiency of our inference algorithm. En route, we develop a generalized importance sampling method, which may be of independent interest.
A high-level message in the paper is that the confidence set approach, which did not receive adequate attention in the machine learning literature, can be an important tool for inference tasks, especially for challenging problems with limited available data.
2 Preliminaries
We start by formalizing the Diffusion Source Identification (DSI) problem, introduced in the seminal work of Shah & Zaman (2011). Consider a network with node set and edge set . For ease of presentation, we focus on unweighted and undirected networks but it is straightforward to generalize the model and our framework to weighted networks. We write if node and are connected. The network can be equivalently represented by its binary adjacency matrix , where if and only if .
There is a source node on the network initiating a diffusion of a certain effect (rumor, fake news or some virus) over the network . We embed our inference of the diffusion procedure under the widely-adopted “Susceptible-Infected” (SI) model (Anderson & May, 1992; Shah & Zaman, 2011), though our approach can be easily tailored to other diffusion procedure as well. In the SI model, the source node is the only “infected” node initially. The infection diffuses as follows: given the set of currently infected nodes after infections, the next infection happens by sampling uniformly at random one of the edges connecting an infected node and a susceptible node. Consequently, a full diffusion path with infections can be represented by a sequence of nodes in the infection order. We define the diffusion path space to be
However, in practice, when the occurrence of the infection is noticed, we have already lost the information about the diffusion path. Instead, the available data only contain the snapshot of the current infection status on the network without the infection order. Formally, the data can be represented as an -dimensional binary vector with , where is the standard indicator function. Therefore, the sample space of the DSI problem can be defined as
Equivalently, we will also think of any as the a infected subset of nodes with size . The DSI problem can then be defined as follows.
Definition 1 (Diffusion Source Identification).
Given one sample , identify the source node of the diffusion process that generates .
Challenges. The challenge of DSI intrinsically arises from the loss of information in the observed data. Specifically, by definition, we have a many-to-one mapping , such that maps a diffusion path to the corresponding infection snapshot of the network. Information about the infection order has been lost upon the observation of data . Nevertheless, the DSI problem looks to identify the first node in the infection order, with access to only one snapshot of the infection status. Note that obtaining multiple snapshots over time does not reduce the difficulty of DSI. This is because, given the current snapshot, later observed data carry no additional information about the source node due to the Markov property of the SI model.
3 A General Statistical Framework for DSI with Confidence Guarantees
3.1 DSI as Parameter Estimation
We start by formulating DSI under a systematic statistical framework, which will help in our design of better inference methods later on. Treating the network as fixed and as the model parameter, the probability of generating data can be represented by where random variable denotes the observed data. The identification of can then be treated as a parameter estimation problem. Specifically, we consider the following general parameter estimation framework. Given any discrepancy function , we want to find an estimator of based on the following optimization problem:
(1) |
in which is the random diffusion path following the SI model starting from parameter and denotes the expectation over . That is, we look to select the that the diffusion path it generates has the minimum expected discrepancy from our observed data .
Remark 1.
An important design here is that the discrepancy function is defined on , not on . That is, will be compared with the random diffusion path while not merely the snapshot induced by the path. This is because contains richer information about the diffusion process. As we show later, this turns out to be very crucial for designing effective discrepancy functions.
Notice that our framework include a few previous methods as special cases. Due to space limit, all formal proofs in this paper have been deferred to the Appendix. Instead, intuition and explanations are provided as needed.
Proposition 1.
- 1.
- 2.
Proposition 1 also reveals some key drawbacks of the rumor center method and its variants. First, the discrepancy function only takes two values, and it treats all configurations with equally. Therefore, such a function may not be sufficiently sensitive for general networks. From this perspective, is potentially better. Second, and importantly, both of the above discrepancy functions only depend on , failing to leverage the diffusion order of the . Ignoring such information may also undermine the performance. To overcome these drawbacks, we propose the following family of discrepancy functions as a better alternative. We call this family the canonical family of discrepancy functions.
Definition 2 (Canonical Discrepancy Functions).
Consider a class of discrepancy functions that can be written in the following form
(2) |
in which is the infection order of node in path and is a non-increasing weight function. When , we define .
The canonical form (2) is essentially a negative similarity function. It incorporates both the infection status and the infection order of . The weight function incorporates the diffusion order such that if deviates from at an early stage, the deviation is treated as a stronger signal for their discrepancy, compared with the case when they only deviates at a later stage of the diffusion. Conceptually, this canonical family is general enough to incorporate the needed information for the diffusion process. In addition, as shown in Section 3.4, it admits fundamental properties that make the computation very efficient. As a special case, we demonstrate that is equivalent to a discrepancy function with , as follows
Therefore is equivalent to Eq. (2) with .
In this paper, we are particularly interested in the following natural configuration as the discrepancy function, which we call the “Averaged Deviation - inverse Time” (ADiT), which takes the canonical family form (2) with the inverse time weights:
(3) |
In Table 1 of Section 5, we show the simulation performance of the single-point estimation by our framework compared to other methods. Though our methods demonstrate improvements, the accuracy is universally low in all situations for all methods. Such an observation indicates that it is generally impossible to recover the source node by a single estimator with high accuracy. Indeed, as shown in Shah & Zaman (2011); Dong et al. (2013); Yu et al. (2018), even in the ideal infinite regular tree for which the rumor center is proved to be optimal in the MLE sense, the probability of correct source node identification turns out to still be low (). Such a low accuracy is far from useful in real-world applications, suggesting the necessity of developing alternative forms of inference, which is we embark on in the next section.
3.2 Confidence Set
As mentioned previously, single point estimators suffer from low success rates, rendering them unsatisfactory in real-world applications. To identify the source node with a nontrivial performance guarantee, we propose constructing a small subset of nodes that provably contains the source nodes with any pre-defined confidence. This insight motivates our use of the confidence set as the DSI method.
Definition 3.
Let be the random infection status of the stochastic diffusion process starting from . A level confidence set of the source node is a random set depending on for which
Surprisingly, the idea of using confidence set to infer the diffusion source – though arguably a natural one in statistics – has not been explored much in the context of DSI. The most relevant to ours are probably Bubeck et al. (2017); Khim & Loh (2016) and Crane & Xu (2020). Bubeck et al. (2017) considered identifying the first node of a growing tree but not a diffusion process. Khim & Loh (2016) extended the idea to the SI model but only for infinite regular tree and asymptotic setting. Despite its theoretical merits, this method is not practical. For example, even consider the situation of an infinite 4-regular tree as the network structure, applying the method of Khim & Loh (2016) would indicate a confidence set of size , regardless of the infected size . This is far too large for almost any applications, let alone the fact that infinite regular tree itself is unrealistic. Crane & Xu (2020) makes the inference more effective, but still rely on the tree-structure assumption.
We instead take a completely different yet natural approach based on our statistical framework for the problem. To ensure the validity of the inference for any network structures, we will rely on the general statistical inference strategy for the confidence set construction. We first introduce a testing procedure for the hypothesis against the alternative hypothesis . Given a discrepancy function , data and the node under evaluation, define the testing statistic to be our loss for any data . Then the p-value of the test is defined to be
(4) |
where the probability is over the randomness of the path generated from the random diffusion process starting from . The p-value is the central concept in statistical hypothesis testing, and it gives a probabilistic characterization of how extreme the observed deviates from the expected range for random paths that are truly from (Lehmann & Romano, 2006). For a level confidence set, we compute for all nodes and construct the confidence set by
(5) |
The following result guarantees the validity of the confidence set constructed above.
Theorem 1.
The confidence set constructed by (5) is a valid confidence set.
Notice that Theorem 1 is a general result, independent of the network structure or the specific test statistic we use. However, the validity of the confidence set only gives one aspect of the inference. We would like to have small confidence sets in practice since such a small set would narrow down our investigation more effectively. The confidence set size would depend on the network structure and the corresponding effectiveness of the discrepancy function (the test statistic). We will use the proposed ADiT to define our test statistic. As shown in our empirical study, it gives excellent and robust performance across various settings.
3.3 Algorithmic Construction of Confidence Sets
The exact evaluation of the statistic and p-value is infeasible for general graphs since the probability mass function of the SI model is intractable. To overcome this barrier, we resort to the Monte Carlo (MC) method for approximate calculation, with details in Algorithm 1. This vanilla version turns out to be computationally inefficient. However, we will introduce techniques to significantly improve its computation efficiency afterwards.
Remark 2 (Monte Carlo setup).
Note that we have two layers of Monte Carlo evaluations. The first layer is the loss function calculation in (6) and (7), while the second layer is the p-value evaluation (8). The first layer shares the same samples. This is different from the classical Monte Carlo, but would not break the validity for p-value calculation. The properties of p-value calculation by Monte Carlo method have been studied in detail by Jockel (1986); Besag & Clifford (1989).
(6) |
(7) |
(8) |
Remark 3 (Choice of the sample number ).
In theory, the computation in Algorithm 1 is exact when . In practice, simple guidance about the choice of can be derived as follows. The critical step in Algorithm 1 is Step 7 for the p-value calculation since the MC errors from previous steps are usually in a lower order. For the correctness, we only need to worry about the evaluation at node when the true p-value is close to . Step 7 averages over indicators. By the central limit theorem, the MC estimate at most misses the true p-value by roughly . For example, if we are aiming for a 90% confidence set where , setting would indicate that the MC at most misses the targeting confidence level by 0.006%, which is usually good enough in most applications. In our experiments, we use this and it has been sufficient in all situations. Notice that this recommendation is more conservative than the ones used in classical statistical inference problems (Jockel, 1986). In our experience, it might still be acceptable to use a smaller .
Remark 4 (Time complexity of the vanilla MC, and its trivial parallelization).
The time complexity of a standard sequential implementation of Algorithm 1 is :333As a convention, the notation omits logrithmic terms. (1) the first term is due to the MC sampling (Bringmann & Panagiotou, 2017); (2) the second term is from the statistic calculation (7) given the MC samples. However, our algorithm can be trivially parallelized. In particular, the for-loop in Step 3 can be distributed across different with any communication. This leads to a parallel time complexity . It is worthwhile to compare this time cost with the rumor center of Shah & Zaman (2011) which has linear complexity and is the maximum node degree. But the algorithm has to be sequential (thus non-parallelizable). In summary, Algorithm 1 has a better dependence on the network density captured by but has an additional quadratic dependence on the number of samples .
3.4 Fast Loss Estimation for the Canonical Family
A major computational bottleneck of Algorithm 1 is the time for estimating in Equation (7) for every since we have to compute for samples, and each is the average over another samples. Fortunately, it turns out that, for canonical discrepancy family, this step can be done in time, highlighting another advantage of our proposed family of cost functions.
Instead of computing in Equation (7) by summing over the sample , we can compute directly using only the “summary information” of these samples that can be computed and cached in advance. This insight is possible due to the following alternative representation of the function in Equation (7):
(9) |
where counts the total number of samples in in which node is the ’th infected node in the infection path. Let be the matrix containing the entries . Note that, there are at most nonzero entries in since each sample only has nodes. These entries can be computed in time simply by updating the corresponding entries during sampling. With these non-zero entries, we can then compute for all the that showed up in our samples in time. Finally, given the previous , we can compute any in time where , which thus in total takes an additional time. This overall takes time.
4 Monte Carlo Acceleration via Pooled Sampling
In subsection 3.4, we reduced the computation time for estimating a single p-value to , which is arguably the minimum possible in our framework since even sampling samples already takes . In this section, we will introduce efficient strategies to reduce another major computational cost in our algorithm – the MC sampling. Our techniques will “borrow” MC samples of one node for the inference task of another node by leveraging the network structure and properties of the SI model. Consequently, we only need to generate MC samples for a subset of the infected nodes, which may effectively reduce the computational cost.
4.1 Surjective Importance Sampling for Single-Degree Nodes
A node with only one connection in the network is called a single-degree node. Suppose node is a single degree node with the only neighbor that is also infected. Since any diffusion process starting from must pass , we can then use the distribution of paths from to infer the distribution of paths from . However, the converse is not true — a diffusion path from may not pass , and even if it passes , this may not occur as the first infection. Therefore, certain mapping is needed to connect the two diffusion processes. The following theorem formulates this intuition.
Theorem 2.
Let be a single-degree node in the graph with the only neighbor node . If a path starting from contains
define ’s matching path from as
(10) |
In this case, the likelihood ratio between and is
(11) |
If the path from that does not contain , we define the ratio to be .
Notice that all terms on the right-hand side of (2) are available when we sample a path from the diffusion process starting at , thus given a sampled path , computing the likelihood ratio only introduces negligible computational cost. Intuitively, according to Theorem 2, when the MC samples of are available, they can be used to compute the p-value for node based on a similar idea to importance sampling (L’Ecuyer & Owen, 2009). However, the regular importance sampling cannot be directly applied because the likelihood ratio is only available between and under the mapping of . Therefore, we need a generalized version of the importance sampling. We name this procedure the surjective importance sampling and give its property in the following theorem. We believe that this theorem could be of general interest beyond our context.
Theorem 3 (Surjective Importance Sampling).
Suppose and are two probability mass functions for discrete random vector defined on and . Let and denote the expectation with respect to and , respectively. Given surjection , defined on a subset , we define the inverse mapping by for any . For a given bounded real function of interest, , define
where is a size- i.i.d. sample from distribution , and if , we define . We have
Notice that the standard importance sampling is a special case of Theorem 3 when is the identity mapping. Theorem 2 and 3 toghether would serve as a cornerstone for our use of the MC samples from to make inference of .
Corollary 1.
For a single degree node and its neighbor , let be the i.i.d. paths generated from the diffusion process with source . For any bounded function , we have
in which and the likelihood ratio is given by Theorem 2.
Based on Corollary 1, when or , corresponds to the test statistic or the p-value . Consequently, the MC sampling for can be avoided. Instead, to find the p-value for , Equation (7) in Algorithm 1 can be replaced by equalling the following
and Equation (8) can be replaced by equalling the following
where are the MC samples generated from . The same operation can be used for ). The computational strategy for canonical discrepancy functions can also be extended in this setting (see Appendix C).
4.2 Permuted Sampling for Isomorphic Nodes
When the network structure is in some sense “symmetric” for two nodes, the inference properties of the MC samples from one node can be viewed as stochastically equivalent to the MC samples from the other node after the symmetric reflection. We call such a property isomorphism. Denote the node ’s th order neighborhood– the set of all nodes (exactly) hops away from – by . The following definition for isomorphism rigorously formulates the aforementioned idea.
Definition 4.
Any two nodes in a network are first-order isomorphic if there exists a permutation , such that: (1) ; (2) ; (3) , where is the resulting matrix by applying permutation on the rows and columns of simultaneously.
For illustration, consider a simplified case of the isomorphism where and have exactly the same connections. In this case, only swaps and and remains the identity mapping for all other nodes. For this pair of , the diffusion process properties would be the same if we swap the positions of and . Definition 4 is more general than the above simplified case as it allows permutation to the first-order neighbors. Under this definition of isomorphism, the following theorem shows that we can use the MC samples from one node to make inference of its isomorphic nodes after applying the permutation.
Theorem 4.
If and are first-order isomorphic under the permutation . If is a random diffusion path from source . Define the permuted path
Then has the same distribution as a random diffusion path from source .
To use the MC samples of one node to its isomorphic nodes according to Theorem 4, we need an efficient algorithm to identify all isomorphic pairs and the corresponding permutations. Directly checking Definition 4 is costly. To speed up the computation, we identify necessary conditions for isomorphism in Proposition 2.
Proposition 2.
If and are first-order isomorphic, we must have and where and are the degrees of and , and are the degree sequence (sorted in ascending order) of and . Furthermore, and have the same second-order neighbor sets. That is, .
Based on Proposition 2 we can efficiently identify isomorphism using pre-screening steps. This turns out to significantly speed up our computation. Details of the algorithm are described by Algorithm 2 in Appendix B. With the isomorphic relations available, we can partition the nodes into isomorphic groups. Then MC sampling is only needed for one node in each group, and the MC samples can be shared within the group according to Theorem 4. Specifically, suppose are sampled from the diffusion process from . If and are isomorphic with permutation , we can use as the MC samples of in Algorithm 1.
Remark 5.
Definition 4 can be extended to higher-order neighborhoods, identifying more isomorphic pairs. However, the complexity of identifying such pairs increases exponentially with the order of neighbors, which may overwhelm the saved time on the MC side. The first-order isomorphism turns out to give the most desirable tradeoff in terms of computational efficiency.
5 Experimental Studies
In this section, we evaluate our proposed methods on well-studied random network models. We generate networks from three random network models: random 4-regular trees, the preferential attachment model (Barabási & Albert, 1999) and the small-world (S-W) network model (Watts & Strogatz, 1998). In network science, the preferential attachment model is usually used to model the scale-free property of networks that is conjectured by many as ubiquity in real-world networks (Barabási, 2013). The small-world property is believed to be prevalent in social networks (Watts & Strogatz, 1998). The network size is (the size of regular tree with degree 4 and depth 6) with . The networks are sparse with average degree below 4. The Monte Carlo size is 4000. Source node in all settings is set to be the one with median eigen-centrality to be more realistic. All reported results are averaged over 200 independent replications.

We first evaluate the performance of the single-point source estimation accuracy from the rumor center and distance center of Shah & Zaman (2011); Khim & Loh (2016); Bubeck et al. (2017); Yu et al. (2018), as well as estimator using our proposed framework with discrepancy functions and ADiT. The result is shown in the Table 1. Though the two estimators based on our framework are always better, the overall message from the table is not promising. All of the methods, including ours, give poor accuracy that is too low to be useful in applications. Such a negative result convincingly shows that the DSI problem is generally too difficult for the single-point estimation strategy to work, and exploring the alternative confidence set inference is necessary.
reg. tree | Pref. Att. | S-W | |
---|---|---|---|
Rumor center | 0 | 0 | 0.08 |
Dist. center | 0 | 0 | 0.07 |
Euclidean (ours) | 0 | 0.04 | 0.20 |
ADiT (ours) | 0 | 0.02 | 0.16 |
Table 2 shows the coverage rate of the confidence sets, with the squared Euclidean distance and the ADiT as the discrepancy functions. Notably, the proposed confidence set procedure delivers the desired coverage in all settings (up to the simulation error). Meanwhile, the size of the confidence set varies substantially depending on the network structure. For regular trees and scale-free networks, the ADiT works much better than the Euclidean distance, indicating that the diffusion order is informative in this type of network structure. For the small-world networks, the two are very similar. This may indicate that for well-connected networks, the diffusion order is less informative. In general, we believe the adaptivity of the ADiT- based confidence set is always preferable.
reg. tree | Pref. Att. | S-W | |
---|---|---|---|
Euclidean-90% | 93% | 91% | 92% |
size | 79.1 | 86.2 | 16.0 |
ADiT-90% | 93% | 89% | 91% |
size | 60.4 | 70.5 | 17.7 |
Euclidean-80% | 85% | 84% | 83% |
size | 68.6 | 71.9 | 10.5 |
ADiT-80% | 84% | 82% | 81% |
size | 50.0 | 57.5 | 11.3 |
To obtain a comprehensive view of the tradeoff between the set size and confidence level, we show the relationship between the confidence set’s average size and the confidence level in Figure 2. The relation is slightly sup-linear. In connection with the single-point estimation results, notice that for small-world networks, the confidence set with a confidence level 20% has average size of around 1. In contrast, the regular tree and preferential attachment network are more difficult, and to guarantee at 10%, the average size of the confidence set is already about 5. These observations verify the results in Table 1 and support our argument that, in general, inferring the source by a single-point estimator is hopeless.
reg. tree | Pref. Att. | S-W | |
---|---|---|---|
Vanilla MC | 898.3 | 1060.5 | 1052.9 |
Import. Sampl. | 602.3 | 462.1 | 1057.5 |
Isomorphism | 617.8 | 516.8 | 1059.4 |
Both | 411.9 | 430.9 | 1057.7 |
Finally, we also evaluate the timing improvements achieved by the pooled MC strategies. The power of the pooled MC strategies depends on network structures, as expected. The timing comparison for the pooled MC strategies is included in Table 3. The timing included is only the sequential version of our method for a fair comparison with the rumor center. As can be seen, with both of the pooled MC strategies used, we can reduce the timing by about 60% for tree structure and the preferential attachment networks, but the effects on small-world networks are negligible. In applications, it is observed that network structure tends to exhibit core-periphery patterns (Miao & Li, 2021), in which the pooled MC strategies are expected to be effective. For reference, the average timing for finding the rumor center is about 1.25 seconds. However, with the extra computational cost, our method provides a confidence sets at any specified levels, with guaranteed accuracy for any network structures. We believe it is generally a wise tradeoff.
Meanwhile, notice that our inference procedure can be parallelized. We give a parallel algorithm in appendix (see Algorithm 3 in Appendix D). It needs MC sampling for only one node in each group, and the calculations for other nodes can be done using pooled MC methods. Table 4 includes the timing results of the parallel version implementation based on 20 cores in the same settings as Table 3 of the main paper. With 20 cores, the time needed for a confidence set construction is less than 30 seconds for cases when the pooled MC methods are effective.
reg. tree | Pref. Att. | S-W | |
---|---|---|---|
Vanilla MC | 59.3 | 61.9 | 71.5 |
Import. Sampl. | 33.4 | 32.9 | 73.3 |
Isomorphism | 44.0 | 33.4 | 73.4 |
Both | 26.3 | 28.4 | 72.0 |
6 Summary
We have introduced a statistical inference framework for diffusion source identification on networks. Compared with previous methods, our framework is more general and renders salient insights about the problem. More importantly, within this framework, we can construct the confidence set for the source node in a more natural and principled way such that the success rate can be guaranteed on any network structure. To our knowledge, our method is the first DSI method with theoretical guarantees for general network structures. We also propose efficient computational strategies that are potentially useful in other problems as well.
Acknowledgements
The work is supported by the Quantitative Collaborative Award from the College of Arts and Sciences at the University of Virginia. T. Li is supported in part by the NSF grant DMS-2015298. H. Xu is supported by a GIDI award from the UVA Global Infectious Diseases Institute.
References
- Anderson & May (1992) Anderson, R. M. and May, R. M. Infectious diseases of humans: dynamics and control. Oxford university press, 1992.
- Barabási (2013) Barabási, A.-L. Network science. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1987):20120375, 2013.
- Barabási & Albert (1999) Barabási, A.-L. and Albert, R. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
- Besag & Clifford (1989) Besag, J. and Clifford, P. Generalized monte carlo significance tests. Biometrika, 76(4):633–642, 1989.
- Bringmann & Panagiotou (2017) Bringmann, K. and Panagiotou, K. Efficient sampling methods for discrete distributions. Algorithmica, 79(2):484–508, 2017.
- Bubeck et al. (2017) Bubeck, S., Devroye, L., and Lugosi, G. Finding adam in random growing trees. Random Structures & Algorithms, 50(2):158–172, 2017.
- Crane & Xu (2020) Crane, H. and Xu, M. Inference on the history of a randomly growing tree. arXiv preprint arXiv:2005.08794, 2020.
- Dong et al. (2013) Dong, W., Zhang, W., and Tan, C. W. Rooting out the rumor culprit from suspects. In 2013 IEEE International Symposium on Information Theory, pp. 2671–2675. IEEE, 2013.
- Halperin & Almogy (2002) Halperin, A. and Almogy, G. System and method of virus containment in computer networks, December 19 2002. US Patent App. 10/058,809.
- Hu et al. (2020) Hu, T., Guan, W. W., Zhu, X., Shao, Y., Liu, L., Du, J., Liu, H., Zhou, H., Wang, J., She, B., et al. Building an open resources repository for covid-19 research. Data and Information Management, 4(3):130–147, 2020.
- Jockel (1986) Jockel, K.-H. Finite sample properties and asymptotic efficiency of monte carlo tests. The annals of Statistics, pp. 336–347, 1986.
- Kazemitabar & Amini (2020) Kazemitabar, S. J. and Amini, A. A. Approximate identification of the optimal epidemic source in complex networks. In International Conference on Network Science, pp. 107–125. Springer, 2020.
- Khim & Loh (2016) Khim, J. and Loh, P.-L. Confidence sets for the source of a diffusion in regular trees. IEEE Transactions on Network Science and Engineering, 4(1):27–40, 2016.
- Lab (2020) Lab, C. D. Baidu Mobility Data, 2020. URL https://doi.org/10.7910/DVN/FAEZIO.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.
- L’Ecuyer & Owen (2009) L’Ecuyer, P. and Owen, A. B. Monte Carlo and Quasi-Monte Carlo Methods 2008. Springer, 2009.
- Lehmann & Romano (2006) Lehmann, E. L. and Romano, J. P. Testing statistical hypotheses. Springer Science & Business Media, 2006.
- Miao & Li (2021) Miao, R. and Li, T. Informative core identification in complex networks. arXiv preprint arXiv:2101.06388, 2021.
- Newman et al. (2002) Newman, M. E., Forrest, S., and Balthrop, J. Email networks and the spread of computer viruses. Physical Review E, 66(3):035101, 2002.
- Nguyen et al. (2016) Nguyen, H. T., Ghosh, P., Mayo, M. L., and Dinh, T. N. Multiple infection sources identification with provable guarantees. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1663–1672, 2016.
- Shah & Zaman (2011) Shah, D. and Zaman, T. Rumors in a network: Who’s the culprit? IEEE Transactions on information theory, 57(8):5163–5181, 2011.
- Shah & Zaman (2012) Shah, D. and Zaman, T. Finding rumor sources on random graphs. 2012.
- Shapiro & Delgado-Eckert (2012) Shapiro, M. and Delgado-Eckert, E. Finding the probability of infection in an sir network is np-hard. Mathematical biosciences, 240(2):77–84, 2012.
- Valiant (1984) Valiant, L. G. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
- Vosoughi et al. (2018) Vosoughi, S., Roy, D., and Aral, S. The spread of true and false news online. Science, 359(6380):1146–1151, 2018.
- Watts & Strogatz (1998) Watts, D. J. and Strogatz, S. H. Collective dynamics of ‘small-world’networks. nature, 393(6684):440–442, 1998.
- Xu & Ren (2016) Xu, Y. and Ren, J. Propagation effect of a virus outbreak on a network with limited anti-virus ability. PloS one, 11(10):e0164415, 2016.
- Yu et al. (2018) Yu, P.-D., Tan, C. W., and Fu, H.-L. Rumor source detection in finite graphs with boundary effects by message-passing algorithms. In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, pp. 175–192. Springer, 2018.
Appendix A Proofs
A.1 Proof of Theorem 1
Define . Notice that can be seen as one generalized definition for the right quantile of the distribution of the random variable , where is a random infection status of the network generated by the diffusion process starting from .
Now assume is a random infection status from the diffusion process from . According to the definition of the p-value, we have
Note that since is the true source node, and are following exactly the same distribution, thus
A.2 Proof of Theorem 2
Given a path generated by the diffusion process starting from containing , denoted by
we match it to the path defined as
We start from the probability mass of starting from . By using the Markov property, we have
(12) | ||||
In contrast, for the path , we have
(13) |
Notice that the conditional probability only depends on the infection status before the th infection and is invariant to the infection order. This property indicates that all terms after the th (in the second rows) of (A.2) and (A.2) are equal.
Next, we compare the terms in the first line in each of (A.2) and (A.2). Notice that for each , the term is uniform for each available connections given the infected nodes while the term is uniform on all available edges given the infected nodes . The only difference in the two infected sets is on . Since has only one connection to , at each point, the number of available infecting edges is one more in the former case. Therefore, we have
A.3 Proof of Theorem 3
Since ’s are a random sample from , under the current assumption, by the strong law of large numbers, we have
Notice that is a surjection. Therefore, the term can be rewritten as
A.4 Proof of Corollary 1
A.5 Proof of Theorem 4
Notice that, given , the probability mass function of a diffusion path only depends on the network and the source node. Condition 1 of Definition 4 indicates that starts from . Condition 2 and condition 3 of Definition 4 together indicate that is also a valid diffusion path. Condition 3, in particular, indicates that
Now define can define to be the inverse of . For any from , for the same reason, is also a valid diffusion path starting from . In particular, we have . Therefore, has the same sample space and probability mass function as the random diffusion path from .
Appendix B Details for Isomorphism Identification in Section 4.2
Based on the properties in Proposition 2, Algorithm 2 finds all isomorphic pairs and the permutations in the network. Next, we provide a proof of Proposition 2.
Proof of Proposition 2.
because gives a 1-1 mapping from to . Furthermore, we have . By condition 3 of Definition 4, we also have .
For the last one, we can prove by contradiction. Suppose there exists a node , such that but . Since while , so after applying the permutation to the network, we have disconnected from . This contradicts condition 3 of Definition 4. ∎
Appendix C Loss Function Computation Acceleration for Surjective Importance Sampling
The calculation strategy for canonical discrepancy functions can also be further generalized to the weighted averaging scenario used for the single-degree nodes in Section 4.1. Specifically, there we need to calculate terms like
Therefore, to use this strategy in Section 4.1, when general MC samples from , in addition to caching , we also want to cache the matrix adjusted by the factor
Appendix D Parallel Algorithm for Confidence Set Construction
As discussed in Section 3.3, our confidence set construction algorithm can be implemented in parallel, further boosting its speed. In the main paper, we only include the details and timing for the sequential version. The parallelized algorithm is described in Algorithm 3.