This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Group Testing with Correlation under Edge-Faulty Graphs

Hesam Nikpey Email: {hesam,jungyeol,xingranc,swati,saeedi}@seas.upenn.edu.    Jungyeol Kim    Xingran Chen    Saswati Sarkar    Shirin Saeedi Bidokhti
Abstract

In applications of group testing in networks, e.g. identifying individuals who are infected by a disease spread over a network, exploiting correlation among network nodes provides fundamental opportunities in reducing the number of tests needed. We model and analyze group testing on nn correlated nodes whose interactions are specified by a graph GG. We model correlation through an edge-faulty random graph formed from GG in which each edge is dropped with probability 1r1-r, and all nodes in the same component have the same state.

We consider three classes of graphs: cycles and trees, dd-regular graphs and stochastic block models or SBM, and obtain lower and upper bounds on the number of tests needed to identify the defective nodes. Our results are expressed in terms of the number of tests needed when the nodes are independent and they are in terms of nn, rr, and the target error. In particular, we quantify the fundamental improvements that exploiting correlation offers by the ratio between the total number of nodes nn and the equivalent number of independent nodes in a classic group testing algorithm.

The lower bounds are derived by illustrating a strong dependence of the number of tests needed on the expected number of components. In this regard, we establish a new approximation for the distribution of component sizes in “dd-regular trees” which may be of independent interest and leads to a lower bound on the expected number of components in dd-regular graphs.

The upper bounds are found by forming dense subgraphs in which nodes are more likely to be in the same state. When GG is a cycle or tree, we show an improvement by a factor of log(1/r)log(1/r). For grid, a graph with almost 2n2n edges, the improvement is by a factor of (1r)log(1/r){(1-r)\log(1/r)}, indicating drastic improvement compared to trees. When GG has a larger number of edges, as in SBM, the improvement can scale in nn.

1 Introduction

Group testing [DHH00] is a well studied problem at the intersection of many fields, including computer science [Dor43, CHKV11, COGHKL20, CGM21, CTB+20], information theory [AJS19, BJ21, GH08] and computational biology [KM95, FKKM97]. The goal is to find an unknown subset of nn items that are different from the rest using the least number of tests. The target subset is often referred to as defective, corrupted or infected, depending on the field of study. In this work, we use the term defective. To find the subset of defectives, items are tested in groups. The result of a test is positive if and only if at least one item in the group is defective.

Over the years, this problem has been formulated via two approaches: the combinatorial approach and the information theoretic approach. In the “combinatorial” version of the problem, it is assumed that there are dd defective items that are to be detected with zero error [DHH00]. Using adaptive group testing (i.e., when who to test next depends on the results of the previous tests), there is a matching upper and lower bound on the number of tests in the form dlogn+O(d)d\log n+O(d) [DHH00]. Using non-adaptive group testing (i.e., when the testing sequence is pre-determined), there is an upper bound of O(d2log(n/d))O(d^{2}\log(n/d)) and an almost matching lower bound of Ω(d2lognlogd)\Omega(\frac{d^{2}\log n}{\log d}). The “information theoretic” approach, on the other hand, assumes a prior statistic on the defectiveness of items, i.e., item ii is assumed to be defective with probability pip_{i}. The aim in this case is to identify the defective set with high probability [LCH+14]. Roughly speaking, there is a lower bound in terms of the underlying entropy of the unknowns, and an almost matching upper bound up to a logn\log n factor of the lower bound.

In most existing works, it is assumed that the state of the items, whether or not they are defective, are independent of each other, which is not realistic in many applications. Group testing, for example, can identify the infected individuals using fewer tests, and therefore in a more timely manner, than individual testing, during the spread of an infectious disease (eg, COVID-19) [BMR21, VFH+21, MNB+21, Ald20, GG20]. But the infection state of individuals are in general correlated, with correlation levels ranging from high to low, depending on how close they live: same household (high), same neighborhood, same city, same country (low). Correlation levels also depend on other factors such as frequency of contact, the number of paths between the individuals in the network of interactions. We elaborate on this further in Section 1.1. Another example is the multiaccess channel problem: here a number of users want to communicate through a common channel and we want to assign time slots to them to avoid any conflicts. Before assigning, we aim to find the number of active users that want to send a message. Using group testing, we can identify the number of active users faster by asking a subset of users if any of them is active [BMTW84, Chl01, GH08]. But again, nodes are often not independent. Generally, some subset of users might communicate among themselves more often and hence, be more correlated. With this motivation, we aim to model such correlation, design group testing techniques that exploit it, and quantify the gain they provide in reducing the number of tests needed.

The closet to our work are [ACO21, AU21, NSG+21, NSFD21] where specific correlation models are considered and group testing methods are designed and analyzed. In [ACO21], the authors consider correlation that is imposed by a one day spread of an infectious disease in a clustered network modeled by a stochastic block model. Each node is initially defective (infected) with some probability and in the next time step, its neighbors become defective probabilistically. The authors provide a simple adaptive algorithm and prove optimality in some regimes of operation and under some assumptions. In [AU21], the authors model correlation through a random edge-faulty graph GG. Each edge ee is realized in the graph with a given probability rer_{e}. So depending on how the graph is realized, it is partitioned into some connected components. Each connected component is assumed defective with probability pp (in which case, all the nodes in that component are defective) and otherwise non-defective with probability 1p1-p. The authors focus only on a subset of the realizations by studying the case in which the set of connected components across realizations forms a particular nested structure. More specifically, they only consider a subset of realizations such that for each realization, it is instantiated from another realization in the subset, i.e., two realizations have the same components except one component that is partitioned into two components. Only one realization does not obey this rule, the one with the least number connected components. They found a near optimal non-adaptive testing strategy for any distribution over the considered realizations of the graph and showed optimality for specific graphs.

The correlation model we consider is close to the work of [AU21]. We consider a random (edge-faulty) graph GG where each edge is realized with probability rr. In a realized graph, each connected component is assumed defective with probability pp. As opposed to [AU21], we do not constrain our study to a subset of the realizations and instead consider all possible realizations of the graph GG. Despite its simplicity, our model captures two key features. First, given a network of nodes, one expects that when a path between two nodes gets shorter, the probability that they are in the same state increases. Our proposed model captures this intuition. By adding one edge to a path, the probability of being in the same state reduces by a factor of rr. Second, two nodes might have long distances from each other, but when there are many edge-distinct paths between them, it is more likely that they are in the same state. Under our model, by adding distinct paths between two nodes, the probability of them being in the same state increases.

In other related works, a graph could represent potential constraints on testing (among independent nodes) [CKMS12, SW18]. This can be viewed as a complementary direction to our setting in which a graph models the inherent dependency of the nodes, but there is no constraint in how the groups can be formed for testing. In [CKMS12], the authors design optimal non-adaptive testing strategies when each group is constrained to be path connected. In particular, they use random walks to obtain the pool of tests. In a follow up work, [SW18] shows that either the constraints are too strong and no algorithm can do better than testing most of the nodes, or optimal algorithms can be designed matching with the unconstrained version of the problem. This is attained by sampling each edge with probability rr (0<r<10<r<1 and optimized). If a component is large enough, the algorithm tests the entire component. Our approach in this paper has similarities with [SW18] in aiming to find parts of the graph that are large and connected enough so that they remain connected with a decent probability after realizing the edges, but our techniques to find the dense subgraphs and the corresponding analysis are different.

1.1 Our Model

We start by motivating the key attributes that we capture in our model through an example of testing for infections in a network of people (nodes). Consider the interaction network for the spread of an infectious disease (e.g. COVID-19) in a network of people/nodes. There is an edge between two nodes if the corresponding individuals are in physical proximity for a minimum amount of time each week. Such individuals are more likely to be in the same state than those who have been distant throughout. Thus, firstly, the probability of being in the same state decreases with increase in the length of paths (i.e., distance in interaction network) between nodes. Second, infection is more likely to spread from one node to another if there are many distinct paths between them. Thus, the probability that two nodes are in the same state increases with the increase in the number of distinct paths between them.

We capture correlation through a faulty-edge graph model. Consider a graph G=(V,E)G=(V,E) where the node set VV, |V|=n|V|=n, represents the items and the edge set EE represents connections/correlations between them. Suppose each edge is realized with probability 0r10\leq r\leq 1. After the sampling, we have a random graph that we denote by GrG_{r}. Each node is either defective or non-defective. All nodes in the same component of GrG_{r} are in the same state, rendering defectiveness a component property. We consider that each component is defective with probability (w.p.) pp independent of others. As an example, consider graph GG with five nodes and eight edges, and a sampled graph realization GrG_{r} as shown in Figure 1 (left) and Figure 1 (right) respectively. When r=1/3r=1/3, GrG_{r} is realized w.p. (13)3(23)5(\frac{1}{3})^{3}(\frac{2}{3})^{5}. There are two components in GrG_{r}, namely, v1,v4,v5v_{1},v_{4},v_{5} and v2,v3v_{2},v_{3}; v1,v4,v5v_{1},v_{4},v_{5} are in the same state, which is defective w.p. pp, independent of the state of v2,v3.v_{2},v_{3}.

Two nodes are guaranteed to be in the same state in GrG_{r} if there exists at least one path that connects them in GrG_{r}. The probability that a path in GG survives in GrG_{r} increases with increase in rr. Thus both the parameter rr and the graph GG determine the correlation between states of different nodes; the correlation is higher if rr is higher, states of all nodes are independent for r=0r=0, while the correlation is the highest possible for a given GG for r=1.r=1.

This model importantly captures the two attributes we discussed: Clearly, a long path between two nodes in GG has a smaller chance of survival in GrG_{r}, compared to a short path, making the end nodes less likely to be in the same state as the length of the path in GG between them increases. Moreover, the probability that at least one path between two nodes survive in GrG_{r} increases with increase in the number of distinct paths between them in GG, so having distinct paths between a pair of nodes in GG makes them more likely to be in the same state.

We aim to find the minimum expected number of tests needed to find the defective items with at most ϵn\epsilon n errors, where ϵ\epsilon can potentially be of order o(1)o(1). To be precise, let #ERR(H) be the number of nodes mispredicted by an algorithm on graph HH. Then we require to have

𝔼HGr[#ERR(H)]ϵn\displaystyle\mathbb{E}_{H\sim G_{r}}[\#ERR(H)]\leq\epsilon n (1)

where the expectation is taken over GrG_{r} and possible randomization of the algorithm. We refer to it as the average error.

Remark 1.1.

The definition of error in classic probabilistic group testings such as [LCH+14] is a stronger notion of error probability where the goal is to correctly predict all nodes with probability 1ϵ1-\epsilon, and with probability ϵ\epsilon one or more nodes are mispredicted. This is stronger than our definition of average error in (1) because with probability ϵ\epsilon at most nn nodes are mispredicted in the classic group testing, so the average error would be less than ϵn\epsilon n, the allowed error in our model.

We mostly work with the notion of average error in this paper. In the last section (Section 5), we consider a stronger notion of error to limit the maximum error: the group testing schemes now need to upper bound the number of mispredicted nodes by ϵn\epsilon n with high probability. We recover all the results of the paper for this stronger constraint on error as well.

Methodologically, we relate the problem of group testing with correlation to that in a network with fewer nodes in which the states of nodes are independent. We obtain bounds on the number of group tests required in the former, to satisfy the constraints we consider on errors, in terms of known bounds in the latter. The relative quantification provides a basis for comparison and determination of the improvements that can be obtained by exploiting correlation.

v1v_{1}v2v_{2}v3v_{3}v4v_{4}v5v_{5}
(a) Graph GG before realization.
v1v_{1}v2v_{2}v3v_{3}v4v_{4}v5v_{5}
(b) Graph GG after a realization with 3 edges.
Figure 1: Graph GG after a realization of edges

The tests can not be designed with the knowledge of GrG_{r}, only the value of rr is known apriori. In the extreme case of r=0r=0, the problem is reduced to the classic group testing with |V|=n|V|=n independent nodes. In the extreme case of r=1r=1, all components of GG remain connected and have the same state and hence the problem is reduced to testing a single node. When 0<r<10<r<1, the problem is non-trivial, because there can be multiple components, some with more than one node, and the number and composition of the components is apriori unknown. Thus, it is not apriori known which nodes will be in the same state. Our group testing strategies will seek to circumvent this challenge by identifying parts of GG that are connected enough so that they remain connected in GrG_{r} with a high probability.

1.2 Contributions

We obtain upper and lower bounds on the number of group tests needed to determine the states, defective or otherwise, of individual nodes in a large class of interaction graphs in the presence of correlation among the states of nodes. We progressively consider 1) cycles and trees (about nn links), 2) dd-regular graphs (about dn/2dn/2 links) and 3) stochastic block models or SBM (with potentially Θ(n2)\Theta(n^{2}) links). The correlation is captured by the factor rr (see Section 1.1). The bounds are obtained in terms of the number of tests needed when the states are independent, and help us quantify the efficiency brought forth by group testing in terms of r.r. In particular, our group testing strategies exploit correlation and build upon classical group testing strategies, but with fewer number of nodes. The ratio between this number and the total number of nodes (nn) determines the benefits of correlation in our strategies and we refer to it as the (multiplicative) improvement factor. Note that this is not the ratio between the number of tests that are needed, but the ratio between the number of nodes that a classical group testing algorithm gets as input. As such, our results are valid for any group testing algorithm, and they can be translated to the ratio between the total number of tests, accordingly, as needed.

For trees and cycles, we prove an upper bound on the optimal number of tests in terms of the number of group tests when there are nlog(1/r)n\log(1/r) independent nodes. Note that one can trivially determine the states of each node by disregarding correlation and testing among nn nodes (e.g. using classic group testing techniques). Our upper bound therefore shows that group testing can reduce the tests. The (multiplicative) improvement factor of log(1/r)\log(1/r) is meaningful (less than 1) when r>1/2r>1/2. As rr approaches 11 the multiplicative factor reduces even further implying even greater benefits due to group testing. Our lower bound, on the other hand, shows an improvement factor (1r)(1-r).

For dd-regular graphs, we prove new results for the distribution of components. This leads to a lower bound that is expressed as a sum series depending on rr and nn. We further prove an upper bound for a specific 44-regular graph, namely grid, in terms of the number of group tests when there are n(1r)log(1/r)n{(1-r)\log(1/r)} independent items. Thus, the improvement factor is (1r)log(1/r){(1-r)\log(1/r)}, as opposed to only log(1/r)\log(1/r) for trees; this hints us that group testing gets drastically more efficient for denser graphs.

The stochastic block model divides the network into communities such that nodes in the same community are more connected than nodes in different communities. We show that the reduction in the test count due to group testing can be classified into three regimes: 1) strong intra-community connectivity but sparse inter-community connectivity, which reduces the effective number of independent nodes to the number of communities, 2) strong connectivity leading to an (almost) fully connected graph, in which case all nodes have the same state and one independent test is representative 3) sparse connectivity leading to many isolated nodes, in which case the states of all nodes are independent. The first case reduces to independent group testing with the number of nodes equal to the number of communities, second regime needs a constant number of tests, and finally the third regime reduces to independent group testing with nn nodes. The tight upper and lower bounds that are known in the literature for the independent group testing subsequently apply for the first and third cases; for the second case, the analysis is rather simple as there is only a constant number of tests needed.

1.3 Our Methods and Ideas

We now briefly describe the mathematical techniques that we follow to obtain the bounds. The techniques constitute a contribution in themselves as a graphical structure was not investigated earlier for group testing except under significant restrictions as described earlier. For the upper bound for a cycle, we divide the cycle GG into subgraphs of size ll (ll nodes) where ll is a parameter. The subgraphs are connected in GG, but need not be connected in GrG_{r} which we do not know apriori. For every subgraph, we select a node which we consider as representative of the subgraph and determine the states of the representatives of all the subgraphs using group testing strategies deployed when states of nodes are independent (ie, we do not exploit possible correlation between the representatives). We consider the state of each node in each subgraph as that of the representative; this is indeed the case if each subgraph is connected, otherwise the state of some nodes are determined in error. The probability of each subgraph being connected decreases with increase in ll, thus the expected number of errors, which can be computed as a function of r,l,nr,l,n, increases with increase in l.l. The number of representatives and therefore the number of nodes subjected to group tests described above is n/ln/l. Thus the number of group tests is non-increasing with l.l. Thus ll represents a tradeoff between the expected number of errors and the number of group tests, and ll is selected appropriately to ensure a low number of group tests subject to ensuring that the expected number of errors does not exceed the specified limit. The number of group tests for the ll that satisfies the specified error constraints provides the upper bound on the number of tests for a cycle.

The upper bound for an arbitrary tree can be obtained similarly, with the additional significant complication that for an arbitrary tree GG and an ll that satisfies constraints on error one may not be able to obtain subgraphs in GG of size ll that are connected in GG (in contrast when GG is a cycle, a path of size ll constitutes such a subgraph). Refer to Figure 2 for an illustration of this challenge. We get around this challenge by using subgraphs that are not connected in GG by themselves, but become connected in GG through at most ll additional nodes in GG (which are not in the subgraph). Construction of such subgraphs is not apriori clear and constitutes an innovation needed for upper bounding the number of tests needed for trees, above and beyond the overall methodology. Each such subgraph is connected in GrG_{r} if the links in GG among these 2l2l nodes survive in GrG_{r}, the probability of this event can again be expressed as a function of l,rl,r, and as before this probability decreases with increase in l.l. The rest of the methodology is similar to for cycles.

The overall methodology for obtaining the upper bound for grids is the same as that for cycles. The subgraphs in question constitute sub-grids of ll nodes. We determine the probability that each such sub-grid is connected in GrG_{r} through a recursive decomposition which is not apriori obvious.

The characterization of lower bound on the number of tests needed for cycle or trees constitutes another innovation. To obtain the lower bound one can assume the knowledge of the components of GrG_{r}, which one does not know in reality. Nodes in each component of GrG_{r} have the same state, which is independent of those of the states of nodes in other components. Thus each component can be considered a super-node and the states of the super-nodes are independent of each other. Thus, one needs at least as many tests as that when there are C(Gr)C(G_{r}) nodes whose states are independent, where C(Gr)C(G_{r}) is the number of components of GrG_{r}. A lower bound can now be obtained if the random variable C(Gr)C(G_{r}) can be bounded. We accomplish this objective by observing that number of components in a stochastic graph constitutes an edge-exposure martingale and the value of this random variable is concentrated around its expectation, courtesy of Azuma’s inequality which holds for such martingales.

CC123456789
Figure 2: A star can not be partitioned into trees of size ll for any l>2.l>2. This is because any partition of a star of nn nodes into smaller trees constitutes of a star of mm nodes, for a mm of our choice and nmn-m isolated nodes. The figure shows an example where n=10,m=5.n=10,m=5. Nodes C,1,2,3,4C,1,2,3,4 constitute the star, nodes 5,95,\ldots 9 constitute isolated nodes. We can however partition the star into subgraphs of size ll (ie, having ll nodes) each of which can be connected through the central node even when the node does not belong in the subgraph. The figure shows such a partition with l=5.l=5. The subgraph consisting of nodes C,1,2,3,4C,1,2,3,4 is connected, the rest of the 55 nodes constitutes a subgraph that can be connected through the central node.

We initially present the above results for constraints on the expected number of errors. We subsequently generalize them for a stronger constraint, that on the number of errors with a high probability (rather than expectation thereof), for cycle and trees. Lower bounds can be obtained following the same methodology as before, except that lower bounds on the number of tests needed for independent nodes for this stronger constraint do not exist in the literature. We derive the latter lower bound first, adapting some existing proof techniques relying on Information-theoretic inequalities. We show that the lower bounds can be improved for some specific structures of GG, such as when GG is a star. This stronger notion of error allows us to include the structure of GG more heavily and directly in this proof, rather than going via an analysis for the number of components. We have resorted to the latter in obtaining the lower bounds for cycle, tree for the weaker constraint on errors. The upper bound can be obtained following the same broad structure, though some additional technical challenges arise. The number of errors over all subgraphs can be bounded with high probability using Hoeffding’s inequality if the number of errors in different subgraphs are independent. This happens for cycle the subgraphs are non-overlapping paths and the events that they remain connected in GrG_{r} are independent. Nonetheless, this does not happen for trees because nodes in one subgraph may be connected through those in other subgraphs. We surmount this technical challenge by invoking an innovative exposure of nodes of the tree such that the number of connected subgraphs constitutes a node exposure Martingale which satisfies the requisite Lipschitz condition for Azuma’s inequality to hold. The high probability bound on the number of errors over all subgraphs now follow via Azuma’s inequality.

2 Preliminaries and Notations

We use the following notations for the rest of the paper. Let CrltOPT(G,r,p,ϵ)\textsc{CrltOPT}(G,r,p,\epsilon) be the expected number of tests in an optimal algorithm on graph GG with correlation parameter rr, probability of defectiveness pp, and an error of ϵn\epsilon n. Let IndepOPT(n,p,ϵ)\textsc{IndepOPT}(n,p,\epsilon) be the minimum expected number of tests needed for nn items in order to find the defective set with the error probability at most ϵ\epsilon, where each item is independently defective with probability pp. Notice that IndepOPT(n,p,ϵ)\textsc{IndepOPT}(n,p,\epsilon) does not depend on GG. Note that group testing is potentially beneficial when the number of defective items is o(n)o(n). It is often assumed that the (expected) number of defective items is nα,α<1n^{\alpha},\alpha<1, so from now on we assume p=o(1)p=o(1). It is also noteworthy that the definitions of error in IndepOPT and CrltOPT are different, as mentioned in Remark 1.1. When clear from the context, we may drop p,r,ϵp,r,\epsilon from the notations. We write Af(n)A\simeq f(n) if A=f(n)+o(f(n))A=f(n)+o(f(n)). We write Af(n)A\lesssim f(n) when Af(n)±o(f(n))A\leq f(n)\pm o(f(n))

The following lemma provides a lower bound on CrltOPT in terms of IndepOPT – the minimum number of group tests needed in discovery of the defective set among independent nodes.

Lemma 2.1.

Let C(Gr)C(G_{r}) be the number of connected components of GrG_{r}. Then

IndepOPT(C(Gr),p,ϵn)CrltOPT(G,r,p,ϵ).\displaystyle\textsc{IndepOPT}(C(G_{r}),p,\epsilon n)\leq\textsc{CrltOPT}(G,r,p,\epsilon). (2)
Proof.

Consider the idealized scenario in which one knows the components of GrG_{r}. Suppose that we have C(Gr)C(G_{r}) components. All nodes in the same component are in the same state, and the states of nodes in different components are independent. Thus, each component can be replaced by one node and the minimum number of tests is that needed to test a graph with C(Gr)C(G_{r}) independent nodes and the expected number of errors is at most ϵn\epsilon n. This number corresponds to IndepOPT(C(Gr),p,γ)C(G_{r}),p,\gamma) for some γ\gamma such that the expected number of errors is at most ϵn\epsilon n. A classical group testing algorithm with IndepOPT(C(Gr),p,γ)C(G_{r}),p,\gamma) tests may make (at least) an error in finding one node’s state with probability γ\gamma; thus the expected number of errors is at least γ\gamma. This implies that we need γϵn.\gamma\leq\epsilon n. Clearly IndepOPT(C(Gr),p,γ)C(G_{r}),p,\gamma) is a non-increasing function of γ.\gamma. Thus at least IndepOPT(C(Gr),p,ϵn)C(G_{r}),p,\epsilon n) tests are needed when one knows the components of GrG_{r}. If one does not know the components, the expected number of tests can only increase. The lemma follows.

Remark 2.2.

At first glance, the bound may come across as counter-intuitive because the number of tests in a graph in which the states of nodes are independent provides a lower bound on that when the states are correlated. The apparent contradiction is resolved when we note that the lower bound is obtained in terms of the number of tests in a graph with a fewer nodes than the original: number of components in GrG_{r} instead of the number of nodes in Gr.G_{r}.

Lemma 2.1 illustrates a connection between the number of connected components and the minimum number of tests CrltOPT. We next discuss concentration lemmas for a class of graph theoretic functions, including the number of connected components. By having the concentration results, we would be able to replace C(G)C(G) by its expectation in (2).

2.1 Concentration Results

Definition 2.3.

[AS16] A graph theoretic function ff is said to satisfy the edge Lipschitz condition if, whenever HH and HH^{\prime} differ in only one edge, |f(H)f(H)|1\left|f(H)-f\left(H^{\prime}\right)\right|\leq 1 .

Note that the number of components C(G)C(G) is edge Lipschitz, as when two graphs differ in only one edge, they either have the same number of components, or the graph with one less edge has one additional component. One can define a node Lipschitz condition by replacing edge with node [AS16].

The Edge Exposure Martingale. Let e1,e2,,eme_{1},e_{2},\ldots,e_{m} be an arbitrary order of the edges. We define a martingale X0,X1,,XmX_{0},X_{1},\ldots,X_{m} where XiX_{i} is the value of a graph theoretic function f(H)f(H) after exposing e1,e2,,eie_{1},e_{2},\ldots,e_{i}. Note that X0X_{0} is a constant which is the expected of f(G)f(G), where GG is drawn from GrG_{r}. This is a special case of martingales sometimes referred to as a Doob martingle process, where XiX_{i} is the conditional expectation of f(H)f(H), as long as the information known at time ii includes the information at time i1i-1 [AS16]. The same process can be defined for node exposure martingales, where the nodes are exposed one by one [AS16]. Node exposure can be seen as exposing one node at each step, so at the ithi^{\text{th}} step the graph has ii nodes along with the edges between them. You can find more about the topic in [AS16, Chapter 7]. We have the following theorem.

Theorem 2.4.

[AS16] When ff satisfies the edge (resp. node) Lipschitz condition, the corresponding edge (resp. node) exposure martingale satisfies |Xi+1Xi|1\left|X_{i+1}-X_{i}\right|\leq 1.

We then have Azuma’s inequality.

Theorem 2.5.

[AS16] Let X0=c,,XmX_{0}=c,\ldots,X_{m} be a martingale with

|Xi+1Xi|1\left|X_{i+1}-X_{i}\right|\leq 1

for all 0i<m0\leq i<m. Then

Pr[|Xmc|>λm]<2eλ2/2.\operatorname{Pr}\left[\left|X_{m}-c\right|>\lambda\sqrt{m}\right]<2e^{-\lambda^{2}/2}.
Corollary 2.6.

Let δ>0\delta>0. Then with probability 1δ1-\delta we have

|C(Gr)𝔼[C(Gr)|]O(mlog1/δ).|C(G_{r})-\mathbb{E}[C(G_{r})|]\leq O(\sqrt{m\log 1/\delta}).

Specifically, in the case that 𝔼[C(Gr)]=cn\mathbb{E}[{C(G_{r})}]=cn for a constant cc, and GG has m=o(n2)m=o(n^{2}) edges, then with high probability the number of connected components is within cn±o(n)log1/δcn\pm o(n)\sqrt{\log 1/\delta}.

Proof.

Consider the number of components C(G)C(G) which is an edge-lipschitz function of the graph. Now define X0=𝔼[C(Gr)|]X_{0}=\mathbb{E}[C(G_{r})|] and Xm=C(Gr)X_{m}=C(G_{r}). Applying Theorem 2.5 concludes the proof. ∎

2.2 Classic Group Testing Results

Lemma 2.1 provides a lower bound on CrltOPT in terms of IndepOPT. Indeed, any lower bound on the latter leads to a lower bound on the former. We now review some useful lower and upper bounds on IndepOPT(G,r,p,ϵ)\textsc{IndepOPT}(G,r,p,\epsilon) next.

In the probabilistic group testing of [LCH+14], there are nn individuals and every individual ii is independently defective with probability pip_{i}. Let 𝐗\mathbf{X} be the indicator vector of the items’ defectiveness and H(𝐗)=ipilog1piH(\mathbf{X})=\sum_{i}p_{i}\log\frac{1}{p_{i}} be the entropy of 𝐗\mathbf{X}. The testing can be adaptive or non-adaptive, meaning the testing at each step can depend on the results of the previous tests, or not, respectively. [LCH+14] proves the following lower bound on the required number of tests (for both adaptive and non-adaptive scenarios)

Theorem 2.7.

[LCH+14] Any probabilistic group testing algorithm whose probability of error is at most ϵ\epsilon requires at least (1ϵ)H(𝐗)(1-\epsilon)H(\mathbf{X}) tests.

For the upper bounds in this paper, roughly speaking, we partition the graph into groups of nodes and assume that the state of nodes in different groups are independent. Subsequently we obtain bounds for group testing on the original graph in terms of those when state of nodes are independent. We therefore utilize the existing bounds for smaller networks in our context. Below, we summarize the probabilistic group testing results of [LCH+14] for upper bounds on number of tests needed when items are independent.

Theorem 2.8.

[LCH+14] There is an adaptive algorithm with probability of error at most exp(2δ2n1/4)\exp\left(-2\delta^{2}n^{1/4}\right) such that the number of tests is less than

2(1+δ)(H(𝐗)+3𝔼[X])2(1+\delta)(H(\mathbf{X})+3\mathbb{E}[X])
Theorem 2.9.

[LCH+14] For any 0<ϵ10<\epsilon^{\prime}\leq 1 and δ>0\delta>0, if the entropy of 𝐗\mathbf{X} satisfies

H(𝐗)Γγ2H(\mathbf{X})\geq\Gamma_{\gamma}^{2}

where

Γγ:=log2(log1/γ(2nϵ))\Gamma_{\gamma}:=\log_{2}\left(\log_{1/\gamma}\left(\frac{2n}{\epsilon^{\prime}}\right)\right)

then with probability of error at most

ϵΓγδ+1+12ϵ\epsilon\leq\Gamma_{\gamma}^{-\delta+1}+\frac{1}{2}\epsilon^{\prime}

there is a non-adaptive algorithm that requires no more than

Telnnlog2(1/γ)(1+δ)H(𝐗)+Γγ2+2𝔼[X]T\leq\frac{e\ln n}{\log_{2}(1/\gamma)}(1+\delta)H(\mathbf{X})+\Gamma_{\gamma}^{2}+2\mathbb{E}[X]

tests.

3 Graphs with a Low Number of Edges

In this section, we consider graphs that have n+cn+c edges, where cc is a constant. Specifically, we prove lower and upper bounds on the number of tests needed when the underlying graph is a cycle or tree, where the lower bound can be generalized to graph with n+cn+c edges. We obtain these bounds by formulating the problem into one with independent nodes, potentially with less nodes than the original problem. First, we use the lemmas introduced in Section 2 to obtain lower bounds for cycles and trees. Next, we propose group testing algorithms and prove upper bounds for cycles and trees.

3.1 A Lower Bound for Cycles and Trees

By Corollary 2.6, if we know the expected number of components in a graph, we would be able to lower bound the minimum number of tests needed by Lemma 2.1. The following theorem proves a lower bound for cycle and trees.

Theorem 3.1.

Let GG be a cycle or a tree. Then we have

IndepOPT((1r)n10nlogn,p,ϵn)CrltOPT(G,r,p,ϵ)+O(1/n).\displaystyle\textsc{IndepOPT}((1-r)n-10\sqrt{n\log n},p,\epsilon n)\leq\textsc{CrltOPT}(G,r,p,\epsilon)+O(1/n).
Proof.

In a tree, by removing each edge we get one more component, so after removing kk edges the tree has k+1k+1 components and the cycle has kk components.

Each edge is removed with probability 1r1-r, so the expected number of components is 1+(1r)(n1)1+(1-r)(n-1) for trees, and (1r)n(1-r)n for cycles. By Corollary 2.6, the number of components is (1r)n±O(nlogn)(1-r)n\pm O(\sqrt{n\log n}) with probability 11/n21-1/n^{2}, and with probability 1/n21/n^{2}, the difference in tests is at most nn, hence O(1/n)O(1/n) additional tests. Applying Lemma 2.1 thus completes the proof. ∎

The above proof also works for any graph with n+cn+c edges, where cc is a constant. In other words, when the number of edges is less than n+cn+c, a lower bound on the number of tests needed for almost (1r)n(1-r)n independent nodes is also a lower bound on the number of tests needed under our model with correlation rr.

3.2 An Upper Bound for Cycles and Trees

In this section, we provide algorithms to find the defective set and provide theoretical bounds. We start by considering that GG is a simple cycle, and subsequently generalize the ideas to arbitrary trees. Note that after having an algorithm for trees, we would have an algorithm for general graphs, by just considering a tree spanning it. But the algorithm might be far from optimal.

The general idea is to partition the graph GG into subgraphs that will remain connected in GrG_{r} with high probability. The nodes in those connected subgraphs will thus have the same states. We can then select a candidate node for each subgraph to be tested. By knowing the probability of each subgraph being connected and the probability of error in classic group testing, we can estimate the error in our problem as a function of the size of the subgraphs and design the subgraphs accordingly.

First, we provide our results for when GG is a cycle.

Theorem 3.2.

Consider a cycle of length nn. Let l=max{log[1/(1ϵ/2)]log1/r,1}l=\max\{\frac{\log[1/(1-\epsilon/2)]}{\log 1/r},1\} and ϵ<ϵ/2\epsilon^{\prime}<\epsilon/2. Then there is an algorithm that uses IndepOPT(n/l,p,ϵ)\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime}) tests and finds the defective set with the error at most ϵn\epsilon n.

Proof.

Consider the following algorithm:

  1. 1.

    Let l=max{log[1/(1ϵ/2)]log1/r,1}l=\max\{\frac{\log[1/(1-\epsilon/2)]}{\log 1/r},1\}. Partition the cycle into n/l\lceil n/l\rceil paths P1,P2,,Pn/lP_{1},P_{2},\ldots,P_{\lceil n/l\rceil} of the same length ll, except one path that may be shorter.

  2. 2.

    For each path, choose one of its nodes at random and let the corresponding nodes be vP1,vP2,vPn/lv_{P_{1}},v_{P_{2}},\ldots v_{P_{\lceil n/l\rceil}}.

  3. 3.

    Use an IndepOPT(n/l,p,ϵ)\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime}) algorithm (by Theorem 2.8 for adaptive or Theorem 2.9 for non-adaptive group testing) to find the defective items among vP1,vP2,vPn/lv_{P_{1}},v_{P_{2}},\ldots v_{P_{\lceil n/l\rceil}} where ϵ<ϵ2\epsilon^{\prime}<\frac{\epsilon}{2} and the probability of being defective equals pp.

  4. 4.

    Assign the state of all the nodes in PiP_{i} as viv_{i} for all ii.

Note that for each ii, the defectiveness probability of viv_{i} is pp. The probability that PiP_{i} is actually connected after a realization is rl1r^{l-1}. So the probability that PiP_{i} is not in the same state as viv_{i} is 1rl11-r^{l-1}. Then assuming that we detect all viv_{i}’s correctly, the error in GG is at most n/l(1rl1)l\lceil n/l\rceil\cdot(1-r^{l-1})\cdot l. By replacing l=max{log[1/(1ϵ/2)]log1/r,1}l=\max\{\frac{\log[1/(1-\epsilon/2)]}{\log 1/r},1\}, the error becomes less than ϵn/2\epsilon n/2. Moreover, we might also have ϵ\epsilon^{\prime} probability of error for the viv_{i}’s (given the criteria set in IndepOPT), meaning that with probability 1ϵ1-\epsilon^{\prime}, all the nodes are predicted correctly, and with probability ϵ\epsilon^{\prime} we have at least one mispredicted node, and at most nn mispredicted nodes. So the total error from this part is at most ϵn<ϵn/2\epsilon^{\prime}n<\epsilon n/2. So the total error is at most (ϵ+ϵ/2)n<ϵn(\epsilon^{\prime}+\epsilon/2)n<\epsilon n and we have the above theorem. ∎

Corollary 3.3.

Consider the case in which p=c/np=c/n where cc is a constant. Let 𝐗\mathbf{X^{\prime}} be the vector of candidate nodes. Then, H(𝐗)=nlH(p)=O(logn)/lH(\mathbf{X^{\prime}})=\frac{n}{l}H(p)=O(\log n)/l, where H(.)H(.) is the binary entropy. Note that the average number of infected nodes in 𝐗\mathbf{X^{\prime}} is μ=c/l\mu=c/l, hence by Theorem 2.8, the number of tests is upper bounded by

O(H(𝐗)+μ)=O(logn+cl)=O(lognlog1/rlog[1/(1ϵ/2)])O(H(\mathbf{X^{\prime}})+\mu)=O(\frac{\log n+c}{l})=O(\frac{\log n\log 1/r}{\log[1/(1-\epsilon/2)]})
O(lognlog1/rϵ).\simeq O(\frac{\log n\log 1/r}{\epsilon}).

Note that when correlations are strong, i.e., r11/lognr\geq 1-1/\log n, the algorithm does a constant number of tests, as expected. Note that using classic group testing without incorporating correlation we need O(nH(c/n))=O(logn)O(nH(c/n))=O(\log n) tests.

We now generalize the ideas to derive an upper bound when GG is a tree. We partition GG into n/l\lceil n/l\rceil subgraphs (which we sometimes refer to as groups) of ll nodes, find the probability of each subgraph being connected in a random realization, and then optimize it by choosing ll. At a high level, we partition GG such that the nodes within a subgraph have small paths among each other. This is because shorter paths remain connected in GrG_{r} with higher probability, maximizing the probability of the nodes being in the same state. Finding the probability of error is not straightforward here, because the subgraphs we form may not necessarily be connected in the original graph GG but could be connected through realization of other edges/nodes.

We first give a definition to formalize the number of nodes needed to make a subset of nodes connected.

Definition 3.4.

Let SVS\subseteq V be a subset of the nodes of graph GG. The smallest connecting closure of SS is a subset SVS^{\prime}\subseteq V such that the induced graph over SSS\cup S^{\prime} is connected.

For example, consider the graph GrG_{r} in Figure 1. If S={v1,v5}S=\{v_{1},v_{5}\}, then the smallest connecting closure of SS is {v4}\{v_{4}\}, as by adding v4v_{4} to SS we make S connected.

Note that if GG is a tree, then every connected subgraph of GG is also a tree. And the number of links in the induced graph over SSS\cup S^{\prime} is one less than the number of nodes in it.

Now we provide a partition of nodes for trees such that for each subgraph, only a few additional nodes and thereby additional links will make them connected. Formally:

Lemma 3.5.

Let GG be a tree. There is a partition of the graph into n/l\lceil n/l\rceil subgraphs each with ll nodes (one subgraph may have less than ll nodes), such that the number of nodes in the smallest connecting closure for each subgraph is less than or equal to ll, for each lnl\leq n.

Proof.

We prove the lemma by induction on the number of nodes of GG. for n=1,2,3n=1,2,3, the statement is trivially true. Now suppose the lemma is true for any number of nodes less than nn, we prove it for nn.

We aim to find a set of nodes of size ll such that, first, by removing the set the graph remains connected, and second, the smallest connecting closure of the set has at most ll nodes. Then by removing the aforementioned set and considering it as one of the subgraphs, we use induction hypothesis for the rest of the graph.

To do that, suppose the tree is hanged by an arbitrary node. Let vv be one of the deepest leafs, that is, any other node is at higher or equal level of vv. Let uu be the first ancestor of vv, such that the subtree rooted at uu, including uu, has ll or more nodes. If there is exactly ll nodes, then the subtree rooted at uu is the desired subgraph.

Now suppose that the subtree rooted at uu has more than ll nodes. Note that the number of nodes in the path from vv to uu is ll or fewer; otherwise vv would have had an ancestor lower than uu such that the subtree rooted at it would have at least ll nodes. Thus the distance from vv to uu is less than ll. Now we form a subgraph SS with ll nodes and connecting closure of ll or fewer nodes. We progressively build SS. Starting with empty set SS, we add the subtree of the child of uu that vv is a descendant of, and the child itself; call this subtree s1s_{1}. Note that s1s_{1} has k<lk<l nodes, otherwise vv would have had an ancestor lower than uu such that the subtree rooted at it would have at least ll nodes. Since s1s_{1} is an entire subtree rooted at a node, GG remains connected even when s1s_{1} is removed from it.

Consider l1=lkl_{1}=l-k. Recursively, do the same process for the other subtrees of uu with l1l_{1} instead of l.l. Note that uu has subtrees other than s1s_{1}, as the subtree rooted at uu has more than ll nodes and s1s_{1} has at most l1l-1 nodes. Then consider another subtree of uu, called s2s_{2}. If |s2|l1|s_{2}|\leq l_{1}, we update l2=l1|s1|l_{2}=l_{1}-|s_{1}| and add s2s_{2} to SS and continue with another subtree of uu, which by the same argument exists. If |s2|>l1|s_{2}|>l_{1}, then again we choose a deepest leaf of s2s_{2} and proceed with the same process as before to find another group of nodes, i.e. we start with the deepest leaf and go up in the tree until it exceeds l1l_{1}, and repeat the procedure. Note that after moving to the subtrees of uu, we disregard the rest of the graph, so uu is an ancestor of all the nodes we encounter next.

Again, for the next recursion, the subtree of the node that exceeds updated ll is an ancestor of the rest of the nodes. Let’s call uu and other nodes that we make a recursion “breaking point”. Then any pair of breaking points are ancestor and descendant, and all the nodes added to SS are subtrees of breaking points. So by connecting all the breaking points by a single path, which has length at most ll, as the distance is less than or equal to uu to vv, we connect SS; so the smallest connecting closure of SS has ll or fewer nodes. More than that, we have only included some subtrees of GG, so by removing SS, GG remains connected and we can use the induction for the remaining tree.

To illustrate the algorithm, consider the tree at Figure 3 with l=5l=5. We start with vv, which is a deepest leaf, move up and now the subtree is {7,v}\{7,v\} and we add node 7 to the subgraph as l2l\geq 2. we move up again, and this time we can’t add uu and its subtrees, as the size of the subtree rooted at uu would exceed ll. So we update l1=l2=3l_{1}=l-2=3 and proceeds with another subtree of uu, let’s say the right subtree containing node 88. The size of the subtree at 88 is one (only {8 }), so we can add it to the subgraph(l11l_{1}\geq 1) and update l2=l11=2l_{2}=l_{1}-1=2. Now we continue with the updated ll and the left subtree of uu. Node 10 is a deepest leaf that we start with, move up to uu^{\prime}, but we can’t add the subtree rooted at uu^{\prime}, as the size of the subtree rooted at uu^{\prime} is bigger than l2l_{2}. So we add 10 to the set, update l3=l21=1l_{3}=l_{2}-1=1 and proceed with uu^{\prime}. Finally our updated ll is 1 and we only add 9 to the subgraph. So the final subgraph is {v,7,8,9,10}\{v,7,8,9,10\}, and we can connect all of them by adding uu and uu^{\prime}.

1245uuuu^{\prime}9107vv8
Figure 3: An example of the procedure in Lemma 3.5

Now we’ve found SS that has the smallest connecting closure at most ll, and includes only subtrees of GG, so removing that does not disconnect the graph. Then we save SS as an aimed subgraph and use the induction hypothesis on the rest of the graph. Then other formed subgraphs have size ll (except one) and have small connecting closures. By adding SS to the subgraphs, we get the desired grouping of all nodes and the proof is complete. ∎

Now we are ready to prove the upper bound for trees.

Theorem 3.6.

Consider a tree with nn nodes and let l=max{log[1/(1ϵ/2)]2log1/r,1}l=\max\{\frac{\log[1/(1-\epsilon/2)]}{2\log 1/r},1\}. Let ϵ<ϵ/2\epsilon^{\prime}<\epsilon/2. Then there is an algorithm that uses IndepOPT(n/l,p,ϵ)\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime}) tests and finds the defective set with at most ϵn\epsilon n errors. I.e.,

CrltOPT(G,r,p,ϵ)IndepOPT(n/l,p,ϵ).\textsc{CrltOPT}(G,r,p,\epsilon)\leq\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime}).
Proof.

Consider the following algorithm:

  1. 1.

    By Lemma 3.5, partition the tree into n/l\lceil n/l\rceil subgraphs g1,g2,,gn/lg_{1},g_{2},\ldots,g_{\lceil n/l\rceil} of the same length ll, one subgraph might be smaller than the other ones.

  2. 2.

    For each group, choose one of its nodes at random and let them be vg1,vg2,vgn/lv_{g_{1}},v_{g_{2}},\ldots v_{g_{\lceil n/l\rceil}}.

  3. 3.

    Use an IndepOPT(n/l,p,ϵ)\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime}) algorithm to find the defective set among vg1,vg2,vgn/lv_{g_{1}},v_{g_{2}},\ldots v_{g_{\lceil n/l\rceil}}.

  4. 4.

    Assign the state of all the nodes in gig_{i} as viv_{i}, for all ii.

First, we calculate the probability that gig_{i} is connected. By Lemma 3.5, we know that each gig_{i} has the property that its smallest connecting closure has ll or fewer nodes. Thus, together gig_{i} and its connected closure have at most 2l12l-1 edges in GG, and gig_{i} and its connected closure constitutes a connected subgraph in G.G. Therefore, the probability of gig_{i} being connected in GrG_{r} is at least the probability that the above edges are retained in GrG_{r} which is at least r2l1r2l.r^{2l-1}\geq r^{2l}. So the probability that gig_{i} is not in the same state as viv_{i} is at most 1r2l1-r^{2l}. The rest of the proof revolves around proving that the total error is less than ϵn\epsilon n as was done for cycle and this completes the proof. ∎

Corollary 3.7.

Corollary 3.3 can be recovered for trees with an additional factor of 2.

4 An Upper Bound for Graphs with More Edges: Grids and SBMs

In this section, we focus on graphs that potentially have many edges. As the number of edges increases, the correlation between nodes increases even when rr is not large. As mentioned earlier, we need to target those components that are more likely to appear in various realizations.

We know that there is a threshold phenomenon in some edge-faulty graphs, meaning that when rr is below a threshold, there are many isolated nodes (and hence many independent tests are needed) and when rr is above that threshold, we have a giant component (and hence a single test suffices). Most famously, this threshold is lognn\frac{\log n}{n} for Erdős-Rényi graphs. For random dd-regular graphs, also, [Goe97] has shown that when a graph is drawn uniformly from the set of all dd-regular graphs with nn nodes and then each edge is realized with probability rr, 1d1\frac{1}{d-1} is a threshold almost surely.

For the rest of this section, we first study a (deterministic) 44-regular graph, known as the grid111There is a subtle difference worthwhile to mention here: The degree regularity does not hold on the boundaries of the grid. and then provide near-optimal results for the stochastic block model. When we consider (deterministic) dd-regular graphs, we can’t use the results of [Goe97] for random dd-regular graphs because we can not be sure that the specific chosen graph is among the “good” graphs that constitute the almost sure result. So we need to develop new results on the number of connected components and the distribution on them for our purposes.

4.1 The grid

We first formally define a grid. A grid with nn nodes and side length n\sqrt{n} is a graph where nodes are in the form of (a,b):1a,bn(a,b):1\leq a,b\leq\sqrt{n}. Node (a,b)(a,b) is connected to its four close neighbors (if exist), namely (a1,b),(a+1,b),(a,b+1),(a,b1)(a-1,b),(a+1,b),(a,b+1),(a,b-1). Border nodes (with a{1,n}a\in\{1,\sqrt{n}\} or b{1,n}b\in\{1,\sqrt{n}\}) might have three or two neighbors. Note that the side length is defined by number of nodes on the side.

In order to derive a lower bound, we need to know the expected number of components in GrG_{r} and equivalently the expected component size that nodes belong to [AS16, Goe97]. To find the expected component size, we describe a random process that forms the components and analyze the expected stopping time. While this is well-known for ER graphs, to the best of our knowledge there are no bounds when the graph is not complete. We derive a lower bound on the expected number of components in a grid by analyzing the problem for 3-regular infinite trees. This approach can be generalized for any dd-regular graph.

Consider the following process. Pick a node vV(G)v\in V(G), mark it as processed, and let it be the root of a tree. For each uV(G)u\in V(G) that is not processed and is a neighbor of vv, uvuv is realized with probability rr and added as a child of vv. The same process is repeated for each realized uu in a Breath First Search (BFS) order. When the process ends, there is a tree with root vv, and the expected size of the tree is the expected size of the component that vv ends up in.

v1v_{1}v2v_{2}v3v_{3}v4v_{4}v5v_{5}v6v_{6}v7v_{7}v8v_{8}v9v_{9}v10v_{10}v11v_{11}v12v_{12}v13v_{13}v14v_{14}v15v_{15}v16v_{16}v17v_{17}v18v_{18}v19v_{19}v20v_{20}v21v_{21}v22v_{22}v23v_{23}v24v_{24}v25v_{25}
Figure 4: An example of the procedure described in Section 4.1, starting with border node v11v_{11}.

An example of this proecss is shown in Figure 4. Node v11v_{11} is the root (colored in blue), the children that are realized are marked in green, and the children that are not realized are marked in red. The component would be {v11,v12,v7,v2}\{v_{11},v_{12},v_{7},v_{2}\}.

By repeating the process for each node that is not processed yet, we get a spanning forest. The expected number of components in the forest is the expected number of components in the original random graph.

Here, the challenge is that we don’t know the number of available (unprocessed) neighbors of a node. It highly depends on the previously chosen nodes, especially when dd is small, like in the grid. Note that this is more complicated than the process for ER graphs, as the number of unprocessed neighbors in an ER graph is independent of the previously processed nodes (due to its homogeneous nature). We circumvent this issue by analyzing an infinite regular tree process that effectively corresponds to a more connected graph and therefore leads to a lower bound on the expected number of connected components for the grid.

4.1.1 3-regular Trees

Consider an infinite tree with root vv such that each node in the tree has three children. Consider the process where each edge is realized with probability rr. Let C(v)C(v) be the component that vv ends up in. The following lemmas approximates the distribution of |C(v)||C(v)|.

Lemma 4.1.

Under the above process and for tt\in\mathbb{N},

P(|C(v)|=t)=12t+1(3tt)rt1(1r)2t+1P(|C(v)|=t)=\frac{1}{2t+1}{3t\choose t}r^{t-1}(1-r)^{2t+1}
Proof.

Let TT be an embedded tree in GG with tt nodes. In order to TT be realized in the process, all the edges in TT should be realized and the rest of edges that has a node in TT should not be realized. There are t1t-1 edges in TT, and each node has three potential edges, so there are 2t+12t+1 edges that are not realized. So the probability that TT be realized is rt1(1r)2t+1r^{t-1}(1-r)^{2t+1}.

Let CtC_{t} be the number of trees with tt nodes and vv as the root. We have

P(|C(v)|=t)=Ctrt1(1r)2t+1.P(|C(v)|=t)=C_{t}\cdot r^{t-1}(1-r)^{2t+1}.

Note that C0=C1=1C_{0}=C_{1}=1. We find a closed form of CtC_{t} by recursion. Node vv has three potential subtrees, where the sum of the size of the subtrees is t1t-1. We thus get the following recursion

Ct=i+j+k=t1,i,j,k0CiCjCk.C_{t}=\prod_{i+j+k=t-1,i,j,k\geq 0}C_{i}C_{j}C_{k}.

This recursion has the same initial points and the same recursion as second order Catalan numbers as follows:

Lemma 4.2.

[Ava08] Order-dd Fuss-Catalan numbers that follows the recursion form

Cnd=i1++id=n1Ci1dCiddC^{d}_{n}=\prod_{i_{1}+\cdots+i_{d}=n-1}C^{d}_{i_{1}}\cdots C^{d}_{i_{d}}

with C0d=C1d=1C^{d}_{0}=C^{d}_{1}=1, has the closed form Cnd=1(d1)n+1(dnn)C^{d}_{n}=\frac{1}{(d-1)n+1}\cdot{dn\choose n}.

So the solution has the form Ct=12t+1(3tt)C_{t}=\frac{1}{2t+1}{3t\choose t} and the proof is completed. ∎

Lemma 4.3.

Let P=P(|C(v)|=)P_{\infty}=P(|C(v)|=\infty). Then,under the above process,

P={0r1/33rr(43r)2r2otherwiseP_{\infty}=\begin{cases}0&\quad{r\leq 1/3}\\ \frac{3r-\sqrt{r(4-3r)}}{2r^{2}}&\quad\text{otherwise}\end{cases}
Proof.

In order for vv to be in an infinite component, at least one of its children should be in an infinite component. There are three cases: (i) Either vv has one child, and this one is infinite, which happens with probability 3r(1r2)P3r(1-r^{2})P_{\infty}; or (ii) vv has two children and at least one of them lies in an infinite component, which happens with probability 3r2(1r)(1(1P)2)3r^{2}(1-r)(1-(1-P_{\infty})^{2}); (iii) or vv has three children and at least one of them lies in an infinite component, which happens with probability r3(1(1P)3)r^{3}(1-(1-P_{\infty})^{3}). So in total we have

P=3r(1r)2P+3r2(1r)(1(1P)2)P_{\infty}=3r(1-r)^{2}P_{\infty}+3r^{2}(1-r)(1-(1-P_{\infty})^{2})
+r3(1(1P)3).+r^{3}(1-(1-P_{\infty})^{3}).

The solutions of this equation are 0 and P=3r±r(43r)2r2P_{\infty}=\frac{3r\pm\sqrt{r(4-3r)}}{2r^{2}}. Note that P<1P_{\infty}<1, so 3r+r(43r)2r2\frac{3r+\sqrt{r(4-3r)}}{2r^{2}} is not valid. The relevant solution turns out to be dependent on whether r>1/3r>1/3 or r1/3r\leq 1/3. As a matter of fact, when r>1/3r>1/3, the probabilities of all finite components (as found in Lemma 4.1) do not add up to one. So for r>1/3r>1/3, the correct solution is 3rr(43r)2r2\frac{3r-\sqrt{r(4-3r)}}{2r^{2}}. When r<1/3r<1/3, we have 3rr(43r)2r2<0\frac{3r-\sqrt{r(4-3r)}}{2r^{2}}<0, so zero is the correct solution and the lemma is proved. ∎

Theorem 4.4.

When r1/3r\leq 1/3, the expected component size is

𝔼(|C(v)|)=t=1t2t+1(3tt)rt1(1r)2t+1\displaystyle\mathbb{E}(|C(v)|)=\sum_{t=1}^{\infty}\frac{t}{2t+1}{3t\choose t}r^{t-1}(1-r)^{2t+1} (3)
1rr34πt=1t(2t+1)(274r(1r)2)t.\displaystyle\simeq\frac{1-r}{r}\sqrt{\frac{3}{4\pi}}\sum_{t=1}^{\infty}\frac{\sqrt{t}}{(2t+1)}(\frac{27}{4}r(1-r)^{2})^{t}.

Note that when r<1/3r<1/3, then 274r(1r)2<1\frac{27}{4}r(1-r)^{2}<1 and hence the sum converges.

Proof.

The proof follows from Lemma 4.1 and Lemma 4.3 and by using Stirling Approximation. ∎

It is worthwhile to remark that the proof generalizes to general dd-regular tree processes.

Remark 4.5.

When the underlying graph is an infinite dd-regular tree, under the defined process and for tt\in\mathbb{N},

P(|C(v)|=t)=1(d1)t+1(dtt)rt1(1r)(d1)t+1P(|C(v)|=t)=\frac{1}{(d-1)t+1}{dt\choose t}r^{t-1}(1-r)^{(d-1)t+1}

4.1.2 Lower Bound on the Expected Number of Components in a Grid

In the case of grid, we consider 3-regular trees, as after the root in grid, each child has at most 3 potential neighbors and if we choose a node in the border, the root also has at most 3 potential children, as illustrated in Figure 4. So the 3-regular tree process that we analyzed corresponds to a more connected graph than the grid. Therefore its expected number of connected components that we found in (3) provides a lower bound on the expected number of connected components in the grid.

Let NCNC be the number of connected components. Note than NCNC is a stopping time, as by knowing C1++CNCC_{1}+\cdots+C_{NC}, where CiC_{i} is the size of the ii’th component, we can decide whether the process has finished or not. The random process in the 33-regular tree is symmetric over all the nodes, so by NCNC be stopping time, 𝔼[NC]=|V(G)|/𝔼[C(v)]\mathbb{E}[NC]=|V(G)|/\mathbb{E}[C(v)]. So by Theorem 4.4, we immediately have the following result.

Theorem 4.6.

For a grid with nn nodes and r1/3r\leq 1/3, the expected number of components is

𝔼(NC)=nt=1t2t+1(3tt)rt1(1r)2t+1\mathbb{E}(NC)=\frac{n}{\sum_{t=1}^{\infty}\frac{t}{2t+1}{3t\choose t}r^{t-1}(1-r)^{2t+1}}
n1rr34πt=1t(2t+1)(274r(1r)2)t.\simeq\frac{n}{\frac{1-r}{r}\sqrt{\frac{3}{4\pi}}\sum_{t=1}^{\infty}\frac{\sqrt{t}}{(2t+1)}(\frac{27}{4}r(1-r)^{2})^{t}}.
Corollary 4.7.

Using Lemma 2.1 in conjunction with Theorem 4.6, any lower bound on the number of tests for n1rr34πt=1t(2t+1)(274r(1r)2)t+O(1/n)\frac{n}{\frac{1-r}{r}\sqrt{\frac{3}{4\pi}}\sum_{t=1}^{\infty}\frac{\sqrt{t}}{(2t+1)}(\frac{27}{4}r(1-r)^{2})^{t}}+O(1/n) independent nodes is also a lower bound on the number of tests on a grid under our model.

4.1.3 An Upper Bound for the Grid

In this subsection, we provide an upper bound for the number of tests in a grid. At a high level idea, we partition the grid into subgrids and assume that each subgrid is connected, so we can consider one representative node per subgrid to test. In other words, the algorithmic idea is similar to the proof of Theorems [3.2, 3.6] where the graph GG was partitioned into subgraphs that were more likely to remain connected in GrG_{r}. But, here, we choose the subgraphs to be subgrids. In order to calculate the error, we need to compute the probability that a subgrid of length kk is connected, where kk is to be optimized later. We first estimate the probability that a subgrid becomes connected.

Lemma 4.8.

Let PkP_{k} be the probability that a grid of length k>1k>1 becomes connected when each of its edge is realized with probability rr. We have:

PkPk1rΘ((1r)k).P_{k}\geq P_{k-1}r^{\Theta((1-r)k)}.
Proof.

Consider the subgrid of length k1k-1 that contains the bottom-left corner node. Then the main grid consists of the subgrid and a path with 2k12k-1 nodes, where each node in the path has one edge to the subgrid. For example in Figure 4, k=5k=5, the subgrid is the grid with corner nodes v6,v9,v24,v21v_{6},v_{9},v_{24},v_{21} and the path of with 99 nodes is v1,v2,v3,v4,v5,v10,v15,v20.v25v_{1},v_{2},v_{3},v_{4},v_{5},v_{10},v_{15},v_{20}.v_{25}. With probability Pk1P_{k-1} the subgrid is connected. Note that in expectation, (2k2)(1r)(2k-2)(1-r) edges in the path would be removed, and by Chernoff bound it is concentrated around its mean. Then, the path would be decomposed into (1r)2k±o(rk)=Θ((1r)k)(1-r)2k\pm o(rk)=\Theta((1-r)k) subpaths with probability at least 11/k101-1/k^{10}. Each subpath has at least one edge to the subgrid, so each one is connected to the path with probability at least rr. The probability that all of them connect to the subgrid then is at least r(1r)Θ(k)r^{(1-r)\Theta(k)} and the lemma is proved. ∎

Theorem 4.9.

Let PkP_{k} be the probability that a grid of length k>1k>1 becomes connected when each of its edge is realized with probability rr.We have:

PkrΘ((1r)k2)=eΘ(log(r)(1r)k2))P_{k}\geq r^{\Theta((1-r)k^{2})}=e^{\Theta(\log(r)(1-r)k^{2}))}
Proof.

The proof is done by replacing Pk1P_{k-1} with Pk2P_{k-2}, and then Pk2P_{k-2} with Pk3P_{k-3} etc in Lemma 4.8 and at last replacing P1=1P_{1}=1. ∎

We now partition the grid into subgrids of length kk, kk will be set later, and consider a candidate node for each subgrid and do the independent group testing on candidate nodes. Now similar to Theorem 3.6, by setting error probability of each subgraph small enough, that is 1Pk<ϵ/21-P_{k}<\epsilon/2, we get k<log(1ϵ/1000)(1r)logrk<\sqrt{\frac{\log(1-\epsilon/1000)}{(1-r)\log r}}. Then the error is less than ϵn\epsilon n with at most n/k2n/k^{2} independent node tests with error ϵ<ϵ/2\epsilon^{\prime}<\epsilon/2.

4.2 An Optimal Algorithm for the Stochastic Block Model

In this section, we study our model on SBM graphs. We apply the same techniques used in Erdős-Rényi graphs to find the connectivity threshold to find the structure of the connected components.

A stochastic block model has gg clusters of size k=n/gk=n/g, where any pair of nodes in the same cluster are connected with probability q1q_{1}, and any pair of nodes in different clusters are connected with probability q2<q1q_{2}<q_{1}. After realizing each edge with probabilities q1q_{1} and q2q_{2}, we have our graph GG. Then based on our correlation model, each edge remains in GrG_{r} with probability rr. So with probability r1=rq1r_{1}=rq_{1} an edge remains in the same cluster and with probability r2=rq2r_{2}=rq_{2} an edge remains between two different clusters. Here, we assume the size of the clusters are much bigger than logn\log n, i.e. klognk\gg\log n. We find the number of connected components based on r1r_{1} and r2r_{2} with high probability, for simplicity let’s say 99%. The probability can be improved with a slight change in the parameters.

Theorem 4.10.
  • If r1100lognkr_{1}\geq\frac{100\log n}{k} and 1(1r2)k2100loggg1-(1-r_{2})^{k^{2}}\geq\frac{100\log g}{g}, then with high probability GG is connected. (first regime, one test needed)

  • If r1100lognkr_{1}\geq\frac{100\log n}{k} and 1(1r2)k21100g1-(1-r_{2})^{k^{2}}\leq\frac{1}{100g}, then with high probability each cluster is connected but most of the clusters are isolated. (second regime, gg independent tests needed)

  • If r11100kr_{1}\leq\frac{1}{100k} and r21100nr_{2}\leq\frac{1}{100n}, then with high probability GrG_{r} has many isolated nodes. (third regime, Ω(n)\Omega(n) independent tests needed)

  • If r11100kr_{1}\leq\frac{1}{100k} and r2100lognnr_{2}\geq\frac{100\log n}{n}, and g>1g>1, then with high probability GrG_{r} is connected. (fourth regime, one test needed)

Proof.

First, suppose r1100lognkr_{1}\geq\frac{100\log n}{k}. A cut is a partition of nodes into two sets (parts), and its size is the number of nodes in the smaller set. We say a cut is disconnected if there is no edge between the sets. A cut of size ik/2i\leq k/2 in a single cluster has i(ki)i(k-i) potential edges between the parts. Then the probability that the specific cut is disconnected is

(1r1)i(ki)(i)er1i(ki)(ii)e100logni(1i/k)(1-r_{1})^{i(k-i)}\overset{(i)}{\leq}e^{-r_{1}i(k-i)}\overset{(ii)}{\leq}e^{-100\log ni(1-i/k)}
(iii)e50iloggk=(1gk)50i.\overset{(iii)}{\leq}e^{-50i\log gk}=(\frac{1}{gk})^{50i}.

The first inequality, (i), is true because 1xex1-x\leq e^{-x} for x0x\geq 0, (ii) is true by r1100lognkr_{1}\geq\frac{100\log n}{k}, and (iii) is true by ik/2i\leq k/2. Note that number of cuts of size ii is (ki){k\choose i}, and by Union Bound, the probability that any cut of size ii becomes disconnected is at most i=1k/2(1gk)50i(ki)\sum_{i=1}^{k/2}(\frac{1}{gk})^{50i}{k\choose i}. But (ki)ki{k\choose i}\leq k^{i} by a simple counting argument, so the probability of a cut be disconnected is at most i=1k/2(1gk)50iki<(1gk)48=(1/n)48\sum_{i=1}^{k/2}(\frac{1}{gk})^{50i}k^{i}<(\frac{1}{gk})^{48}=(1/n)^{48}. So with probability 1(1n)481-(\frac{1}{n})^{48} a single cluster of size kk is connected. Again, by Union Bound, with probability at most (1n)48g<(1n)47(\frac{1}{n})^{48}g<(\frac{1}{n})^{47} there is a disconnected cluster. So with probability 1(1n)471-(\frac{1}{n})^{47}, all clusters are connected.

Now if we assume all the clusters are connected, if there is an edge between two clusters, then those two clusters are connected. So if we consider a graph where the nodes represent the clusters and two nodes are connected if there is at least one edge between the corresponding clusters, then we need to understand the connectivity of the new graph. The probability that there is at least one edge between two clusters is 1(1r2)k21-(1-r_{2})^{k^{2}}, and again if this value is more than 100loggg\frac{100\log g}{g}, then GG is connected with high probability. If 1(1r2)k2<1100g1-(1-r_{2})^{k^{2}}<\frac{1}{100g}, then the probability that a cluster is isolated is more than (11/(100g))g1e1/1000.99(1-1/(100g))^{g-1}\simeq e^{-1/100}\simeq 0.99, so most of the clusters are isolated, which proves the first two parts of the theorem.

If r11100kr_{1}\leq\frac{1}{100k}, then with the same argument, with high probability most of the nodes in all clusters are isolated. If we also have r21100nr_{2}\leq\frac{1}{100n}, then this means that most of the nodes don’t have any neighbors outside of their cluster with high probability, so in total the graph has Ω(n)\Omega(n) many isolated nodes, which proves the third part.

Now suppose r2100lognnr_{2}\geq\frac{100\log n}{n}, and we prove the last part of the theorem. We assume each cluster is empty, i.e. there is no edge in the cluster, and even when they’re empty with r2100lognnr_{2}\geq\frac{100\log n}{n}, the graph is connected with high probability. Consider a cut in GG with in/2i\leq n/2 nodes. Each node has nkn-k potential neighbors in other clusters. So it has at least nkin-k-i potential neighbors outside of its clusters and the chosen cut. Then, almost similar to the first part of the theorem, the probability that this cut is disconnected is at most

(1r2)i(nki)er2i(nki)(1-r_{2})^{i\cdot(n-k-i)}\leq e^{-r_{2}\cdot i\cdot(n-k-i)}
e100logni(nkin)(i)e100logni(1/21/g)\leq e^{-100\log n\cdot i\cdot(\frac{n-k-i}{n})}\overset{(i)}{\leq}e^{-100\log n\cdot i\cdot(1/2-1/g)}
=(1n)100i(1/21/g).=(\frac{1}{n})^{100i(1/2-1/g)}.

Here, (i) is true by in/2i\leq n/2 and k/n=1/gk/n=1/g. Again, there are (ni)ni{n\choose i}\leq n^{i} cuts of size ii. So the probability that any cut of size ii is disconnected is at most ni(1n)100i(1/21/g)=(1n)100i(1/21/g0.01)n^{i}\cdot(\frac{1}{n})^{100i(1/2-1/g)}=(\frac{1}{n})^{100i(1/2-1/g-0.01)}. It is not hard to see that if g>2g>2, then (1n)100i(1/21/g0.01)=o(1/n2)(\frac{1}{n})^{100i(1/2-1/g-0.01)}=o(1/n^{2}). So the probability that any cut is disconnected is bounded by

i=1n/2(1n)100i(1/21/g0.01)no(1/n2)=o(1/n).\sum_{i=1}^{n/2}(\frac{1}{n})^{100i(1/2-1/g-0.01)}\leq n\cdot o(1/n^{2})=o(1/n).

So in the case of g>2g>2, we’ve proved the last part of the theorem. If g=2g=2, for i>2i>2, a cut of size ii has at most i2/4i^{2}/4 edges in the node set of size ii, as the graph is bipartite and the number of edges in the set is maximized when i/2i/2 nodes is chosen from each part of the graph. So the potential edges to the other side of the cut is at least i(nk)i2/4=i(nki/4)=i(ni2)i3n/8i(n-k)-i^{2}/4=i(n-k-i/4)=i(\frac{n-i}{2})\geq i\cdot 3n/8, as in/2i\leq n/2, and we can repeat the reasoning to prove that with high probability all cuts in this graph are connected. It is also easy to verify that when the cut is a single node or a pair of nodes, then the cut is disconnected with probability at most o(1/n4)o(1/n^{4}), and this completes the proof. ∎

Based on the previous theorem, we can now design a simple algorithm based on the parameters r1r_{1} and r2r_{2}. In the first and the last regime, a single node is tested and the the result generalizes to all the nodes. In the second regime, we pick a candidate node from each cluster and perform independent group testing on them. The result of each candidate node generalizes for all the nodes in the correspondent cluster. Finally, in the third regime we perform independent group testing on the nn nodes of the graph.

5 A Stronger Notion of Error

So far, we have focused on bounding the expected error of the algorithms, meaning the error is low (only) on average. But in many applications, we need to have a low error with high probability. As an example, let’s say we need the error to be less than ee. Under the weaker notion of error (bounding the average), we might have an error of 1.5e1.5e half of the time, and for the other half have .5e.5e error. So half of the time we don’t satisfy the required error. Similar to [LCH+14], we introduce probabilistic error with relaxation on error-free prediction. Precisely, in [LCH+14] they wanted to find the defective set with probability 1δ1-\delta such that all the nodes are correctly predicted, but we allow up to ϵn\epsilon n mispredicted nodes.

We consider the following stronger notion of error: suppose that we want to have at most ϵn\epsilon n mispredicted nodes with probability 1δ1-\delta, and for the δ\delta other fraction we can have any number of mispredicted nodes. We refer to this notion of error as maximum error with parameters ϵ\epsilon and δ\delta. This is a relaxation of the error compared to [LCH+14], where ϵ=0\epsilon=0, i.e. with probability 1δ1-\delta we recover perfectly. Let CrltOPT(G,r,p,δ,ϵ)\textsc{CrltOPT}(G,r,p,\delta,\epsilon) be the expected number of tests in an optimal algorithm with maximum error with parameters ϵ\epsilon and δ\delta. Parameters rr and pp are defined as in Section 1.1.

In the following, we will provide lower bounds on the number of tests under the above notion of error. Then we will provide matching upper bounds for some of the graphs discussed so far.

5.1 Lower Bounds for the Stronger Error

We first find a lower bound for CrltOPT(G,r,p,δ,ϵ)\textsc{CrltOPT}(G,r,p,\delta,\epsilon). Note that we are no longer able to use lower bounds of classic group testing directly, like Lemma 2.1, because an error in classical group testing (which affects the average error) might not be counted as an error in maximum error. In other words, in classical group testing, when an error happens with probability δ\delta, there is no guarantee on the number of mispredicted nodes, the error might be one or a constant or all the nodes, but only ϵn\epsilon n mispredicted nodes are allowed in maximum error. So we need to find a lower bound for the problem with maximum error definition and independent nodes directly. We adapt the approach in [LCH+14] to derive the following lower bound:

Theorem 5.1.

Let IndepOPT(n,p,δ,ϵ)\textsc{IndepOPT}(n,p,\delta,\epsilon) be the minimum number of tests for nn independent nodes under maximum error with parameters (δ,ϵ)(\delta,\epsilon). Then any Probabilistic Group Testing algorithm that, with probability 1δ1-\delta, predicts all independent nodes but ϵn\epsilon n of them correctly, needs n(1δ)(H(p)H(ϵ))n(1-\delta)(H(p)-H(\epsilon)) tests where HH is the binary entropy function. i.e.

IndepOPT(n,p,δ,ϵ)n(1δ)(H(p)H(ϵ))O(1).\textsc{IndepOPT}(n,p,\delta,\epsilon)\geq n(1-\delta)(H(p)-H(\epsilon))-O(1).
Proof.

The proof is a modified version of the proof of Theorem 2.7 in [LCH+14]. Let 𝐗\mathbf{X} be the vector of states of the nodes, 𝐁\mathbf{B} be the vector of the result of the group tests for a testing strategy of choice, and 𝐘\mathbf{Y} be the estimated states of the nodes. Then 𝐗𝐁𝐘\mathbf{X}\rightarrow\mathbf{B}\rightarrow\mathbf{Y} form a Markov chain. Thus, by data processing inequality, I(𝐗;𝐁)I(𝐗;𝐘)I(\mathbf{X};\mathbf{B})\geq I(\mathbf{X};\mathbf{Y}). Also we have I(𝐗;𝐁)=H(𝐁)H(𝐁𝐗)H(𝐁).I(\mathbf{X};\mathbf{B})=H(\mathbf{B})-H(\mathbf{B}\mid\mathbf{X})\leq H(\mathbf{B}). Now, H(𝐁)log2|𝐁|H(\mathbf{B})\leq\log_{2}|\mathbf{B}|, where |𝐁||\mathbf{B}| indicates the number of possible values of the random vector 𝐁\mathbf{B}. Since 𝐁\mathbf{B} represents the result vector, the number of possible values of this vector is at most 2T2^{T}, where TT is the number of tests. Thus, Tlog2|𝐁|T{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}{\geq}}\log_{2}|\mathbf{B}|. Thus, combining the above inequalities, TI(𝐗;𝐘).T\geq I(\mathbf{X};\mathbf{Y}).

Moreover,

H(𝐗)=H(𝐗𝐘)+I(𝐗;𝐘).H(\mathbf{X})=H(\mathbf{X}\mid\mathbf{Y})+I(\mathbf{X};\mathbf{Y}).

Thus, TH(𝐗)H(𝐗𝐘).T\geq H(\mathbf{X})-H(\mathbf{X}\mid\mathbf{Y}). Since XX represents the state of nn independent nodes, each of which is defective with probability p,p, H(𝐗)=nH(p)H(\mathbf{X})=nH(p). Thus,

TnH(p)H(𝐗𝐘).T\geq nH(p)-H(\mathbf{X}\mid\mathbf{Y}). (4)

We now obtain an upper bound for H(𝐗𝐘).H(\mathbf{X}\mid\mathbf{Y}). Define the error random variable EE such that

E={1, if 𝐘𝐗0>ϵn0, if 𝐘𝐗0ϵnE=\begin{cases}1,&\text{ if }\|\mathbf{Y}-\mathbf{X}\|_{0}>\epsilon n\\ 0,&\text{ if }\|\mathbf{Y}-\mathbf{X}\|_{0}\leq\epsilon n\end{cases}

where .0\|.\|_{0} is the number of non-zero elements in a vector. We can bound the conditional entropy as follows

H(𝐗𝐘)\displaystyle H(\mathbf{X}\mid\mathbf{Y}) =H(E,𝐗𝐘)\displaystyle=H(E,\mathbf{X}\mid\mathbf{Y})
=H(E𝐘)+Pr[E=0]H(𝐗𝐘,E=0)\displaystyle=H(E\mid\mathbf{Y})+\operatorname{Pr}[E=0]H(\mathbf{X}\mid\mathbf{Y},E=0)
+Pr[E=1]H(𝐗𝐘,E=1)\displaystyle+\operatorname{Pr}[E=1]H(\mathbf{X}\mid\mathbf{Y},E=1)
1+(1δ)H(𝐗𝐘,E=0)+δH(𝐗)\displaystyle\leq 1+(1-\delta)H(\mathbf{X}\mid\mathbf{Y},E=0)+\delta H(\mathbf{X})
1+(1δ)H(𝐗𝐘,E=0)+nδH(p).\displaystyle\leq 1+(1-\delta)H(\mathbf{X}\mid\mathbf{Y},E=0)+n\delta H(p). (5)

We now upper bound H(𝐗𝐘,E=0)H(\mathbf{X}\mid\mathbf{Y},E=0):

H(𝐗𝐘,E=0)=\displaystyle H(\mathbf{X}\mid\mathbf{Y},E=0)=
iPr[Y=yi|E=0]H(𝐗Y=yi,E=0]\displaystyle\sum_{i}\operatorname{Pr}[Y=y_{i}|E=0]H(\mathbf{X}\mid Y=y_{i},E=0]
iPr[Y=yi]logc(nϵn)\displaystyle\leq\sum_{i}\operatorname{Pr}[Y=y_{i}]\log c^{\prime}{n\choose\epsilon n}
=log(nϵn)+cnH(ϵ).\displaystyle=\log{n\choose\epsilon n}+c^{\prime}\simeq nH(\epsilon). (6)

To obtain the inequality, we note that given that E=0E=0, there are at most ϵn\epsilon n bits in which 𝐗\mathbf{X} differs from 𝐘\mathbf{Y}. Thus, given a value of 𝐘\mathbf{Y} and that E=0E=0, there are at most i=0ϵn(ni)c(nϵn)\sum_{i=0}^{\epsilon n}{n\choose i}\leq c^{\prime}{n\choose\epsilon n} values of 𝐗\mathbf{X}, where cc^{\prime} is a constant. The result follows by recalling that the entropy for any random variable with rr values is at most logr\log r. The Theorem follows by putting together (4), (5) and (6).

Analogous to Lemma 2.1, we immediately get the following lemma:

Lemma 5.2.

Let C(Gr)C(G_{r}) be the number of connected components in GrG_{r}. Then, under a maximum error target, we have

C(G)(1δ)(H(p)H(ϵ))CrltOPT(G,r,p,δ,ϵ).\displaystyle C(G)(1-\delta)(H(p)-H(\epsilon))\leq\textsc{CrltOPT}(G,r,p,\delta,\epsilon). (7)

Using Lemma 5.2 along with our concentration results on the number of connected components for cycles, trees and grids, we find lower bounds on the number of tests for our maximum error criteria. Recall that the lower bound for the average error ϵ\epsilon has the form of C(G)(1ϵn)H(p)C(G)(1-\epsilon n)H(p) for a graph GG, while under the stronger notion of error we have C(G)(1δ)(H(p)H(ϵ))C(G)(1-\delta)(H(p)-H(\epsilon)). For the regime where ϵ=c/n,c<1,\epsilon=c/n,c<1, the lower bound for average error and the stronger error simplify to C(G)(1c)H(p)C(G)(1-c)H(p) and C(G)(1δ)(H(p)lognn)C(G)(1δ)H(p)C(G)(1-\delta)(H(p)-\frac{\log n}{n})\simeq C(G)(1-\delta)H(p), respectively, and when δ<c\delta<c we get an improvement.

As discussed, there is a gap between our lower and upper bounds. We next show that the lower bound can be improved by capturing the underlying topology of the graph more heavily.

5.2 An Improved Lower Bound for the Star Graph

Theorem 5.1 does not depend on the underlying graph GG. However, if we take GG into account, we can infer information about the state of the nodes, at least for some graphs. Here, we give an improved lower bound for star graphs, where there is a node with degree n1n-1 and all the other nodes are leaves.

Theorem 5.3.

When the underlying graph GG is a star, we have

n(1δ)(H(r)+(1r)H(p)H(ϵ)1+p(1p)(1r)H(r)+o(1))O(1)\displaystyle n(1-\delta)\left(H(r)+(1-r)H(p)-H(\epsilon)-1+p(1-p)(1-r)H(r^{\prime})+o(1)\right)-O(1)
CrltOPT(G,r,p,δ,ϵ)\displaystyle\leq\textsc{CrltOPT}(G,r,p,\delta,\epsilon) (8)

where r=rr+(1r)(p2+(1p)2)r^{\prime}=\frac{r}{r+(1-r)(p^{2}+(1-p)^{2})}.

Proof.

Let 𝐗\mathbf{X}, 𝐁\mathbf{B} and 𝐘\mathbf{Y} be defined same as in the proof of Theorem 5.1. Let 𝐆\mathbf{G} be a random binary vector where the ii’th coordinate is 1 iff the ii’th edge of GG is realized (with probability rr). Then (X,G)BY(\textbf{X},\textbf{G})\rightarrow\textbf{B}\rightarrow\textbf{Y} forms a Markov chain. We are interested in I(𝐗,𝐆;𝐘)I(\mathbf{X},\mathbf{G};\mathbf{Y}) because

I(𝐗,𝐆;𝐘)I(𝐗,𝐆;𝐁)H(𝐁)log|𝐁|=TI(\mathbf{X},\mathbf{G};\mathbf{Y})\leq I(\mathbf{X},\mathbf{G};\mathbf{B})\leq H(\mathbf{B})\leq\log|\mathbf{B}|=T

where TT is the number of test. Write

I(𝐗,𝐆;𝐘)=H(𝐗,𝐆)H(𝐗,𝐆|𝐘).\displaystyle I(\mathbf{X},\mathbf{G};\mathbf{Y})=H(\mathbf{X},\mathbf{G})-H(\mathbf{X},\mathbf{G}|\mathbf{Y}). (9)

We now bound H(𝐗,𝐆|𝐘)H(\mathbf{X},\mathbf{G}|\mathbf{Y}). Let E=1E=1 if XY0>ϵn\|X-Y\|_{0}>\epsilon n and E=0E=0 otherwise, same as Theorem 5.1. Then

H(𝐗,𝐆|𝐘)\displaystyle H(\mathbf{X},\mathbf{G}|\mathbf{Y}) =H(𝐗,𝐆,E|𝐘)\displaystyle=H(\mathbf{X},\mathbf{G},E|\mathbf{Y})
=H(E|𝐘)+δH(𝐗,𝐆|𝐘,E=1)+(1δ)H(𝐗,𝐆|𝐘,E=0).\displaystyle=H(E|\mathbf{Y})+\delta H(\mathbf{X},\mathbf{G}|\mathbf{Y},E=1)+(1-\delta)H(\mathbf{X},\mathbf{G}|\mathbf{Y},E=0). (10)

By writing H(E|𝐘)1H(E|\mathbf{Y})\leq 1 and H(𝐗,𝐆|𝐘,E=1)H(𝐗,𝐆)H(\mathbf{X},\mathbf{G}|\mathbf{Y},E=1)\leq H(\mathbf{X},\mathbf{G}) and replacing Eq (10) in Eq (9), we get

I(𝐗,𝐆;𝐘)(1δ)(H(𝐗,𝐆)H(𝐗,𝐆|𝐘,E=0))1.\displaystyle I(\mathbf{X},\mathbf{G};\mathbf{Y})\geq(1-\delta)(H(\mathbf{X},\mathbf{G})-H(\mathbf{X},\mathbf{G}|\mathbf{Y},E=0))-1. (11)

As GG is a tree and every component is independently infected with probability pp, we can write

H(𝐗,𝐆)=H(𝐆)+H(𝐗|𝐆)nH(r)+(1r)nH(p)=n(H(r)+(1r)H(p)).H(\mathbf{X},\mathbf{G})=H(\mathbf{G})+H(\mathbf{X}|\mathbf{G})\simeq nH(r)+(1-r)nH(p)=n\left(H(r)+(1-r)H(p)\right).

Now we bound H(𝐗,𝐆|𝐘,E=0)=H(𝐗|𝐘,E=0)+H(𝐆|𝐗)H(\mathbf{X},\mathbf{G}|\mathbf{Y},E=0)=H(\mathbf{X}|\mathbf{Y},E=0)+H(\mathbf{G}|\mathbf{X}). For the first term, in Theorem 5.1 we found

H(𝐗|𝐘,E=0)nH(ϵ).\displaystyle H(\mathbf{X}|\mathbf{Y},E=0)\lesssim nH(\epsilon). (12)

To bound the second term, note that the underlying graph GG is a star and with probability 11/n21-1/n^{2}, the number of nodes in a different state than the center is 2np(1p)(1r)±o(n)2np(1-p)(1-r)\pm o(n), so the contribution outside of this range to H(𝐆|𝐗)H(\mathbf{G}|\mathbf{X}) is only o(1/n)o(1/n). The edges connected to such nodes can not be realized, so n(12p(1p)(1r))n(1-2p(1-p)(1-r)) edges are still uncertain, and knowing that the two end-points are in the same state, each of such edges are realized with probability r=rr+(1r)(p2+(1p)2)r^{\prime}=\frac{r}{r+(1-r)(p^{2}+(1-p)^{2})}. So

H(𝐆|𝐗)n(12p(1p)(1r))H(r).\displaystyle H(\mathbf{G}|\mathbf{X})\leq n\left(1-2p(1-p)(1-r)\right)H(r^{\prime}). (13)

Again, by replacing all in Eq (11), we get a lower bound on the number of tests:

T\displaystyle T I(𝐗,𝐆;𝐘)\displaystyle{\geq}I(\mathbf{X},\mathbf{G};\mathbf{Y})
a(1δ)(H(𝐗,𝐆)H(𝐗,𝐆|𝐘,E=0))\displaystyle\stackrel{{\scriptstyle a}}{{\geq}}(1-\delta)(H(\mathbf{X},\mathbf{G})-H(\mathbf{X},\mathbf{G}|\mathbf{Y},E=0))
=(1δ)(H(𝐆)+H(𝐗|𝐆)H(𝐗|𝐘,E=0)H(𝐆|𝐗))1\displaystyle=(1-\delta)(H(\mathbf{G})+H(\mathbf{X}|\mathbf{G})-H(\mathbf{X}|\mathbf{Y},E=0)-H(\mathbf{G}|\mathbf{X}))-1
b(1δ)(nH(r)+n(1r)H(p)nH(ϵ)H(𝐆|𝐗)+o(nH(r)))1\displaystyle\stackrel{{\scriptstyle b}}{{\geq}}(1-\delta)(nH(r)+n(1-r)H(p)-nH(\epsilon)-H(\mathbf{G}|\mathbf{X})+o(nH(r)))-1
c(1δ)(n(H(r)+(1r)H(p))(nH(ϵ)+n(1p(1p)(1r))H(r))+o(n))1\displaystyle\stackrel{{\scriptstyle c}}{{\geq}}(1-\delta)(n(H(r)+(1-r)H(p))-(nH(\epsilon)+n(1-p(1-p)(1-r))H(r^{\prime}))+o(n))-1
=n(1δ)(H(r)+(1r)H(p)H(ϵ)1+p(1p)(1r)H(r)+o(1))O(1)\displaystyle=n(1-\delta)(H(r)+(1-r)H(p)-H(\epsilon)-1+p(1-p)(1-r)H(r^{\prime})+o(1))-O(1)

Where aa is by Eq 11, bb is by Eq 12 and cc is by Eq 13.

Refer to caption
Figure 5: A comparison of the new lower bound and the old upper bound for p=0.1p=0.1 and ϵ=0.0001\epsilon=0.0001

We compare the lower bound in (8)(Star-specific bound) with the lower bound in (7) (generic bound) in Figure 5. One sees that the lower bound in (8) is strictly larger than (7) for a range of rr around r=12r=\frac{1}{2}. This range corresponds to cases where uncertainty about the edge set of GrG_{r} is high, and therefore knowledge about the existence of (or lack thereof) an edge is very informative as captured in our way of upper bounding H(𝐆|𝐗)H(\mathbf{G}|\mathbf{X}). Even though the improvement offered by the lower bound in (8) is relatively small, it suggests that generic lower bounds are likely to be loose for our correlated group testing problem and the structure of the underlying correlation graph GG need to be considered in order to obtain tighter lower bounds.

5.3 Upper Bounds

We now find upper bounds similar to those that we provided in Sections 3.2 and 4.1. Under an average error target, we partitioned the graph and computed the incurred average error. Under maximum error targets, however, we don’t know what fraction of the realizations with average error ϵn\epsilon n actually has less than ϵn\epsilon n mispredicted nodes, so we can’t use those results and need to prove concentrations around the average. For cycles and grids, the subgraphs we designed were independent of each other, in the sense that the connectivity of a subgraph would not change the probability of connectivity of the other subgraphs. Hence by using Hoeffding’s bound we prove the following theorem for a maximum error target.

Theorem 5.4.

Consider a cycle of length nn and let l=max{log[1/(1ϵ/2)]log1/r,1}l=\max\{\frac{\log[1/(1-\epsilon/2)]}{\log 1/r},1\}, δ>2eΘ(ϵ2nlog(1/r)log(11ϵ/2))\delta>2e^{-\Theta(\frac{\epsilon^{2}n\log(1/r)}{\log(\frac{1}{1-\epsilon/2})})}, and ϵ<δ/2\epsilon^{\prime}<\delta/2. There is an algorithm that uses IndepOPT(n/l,p,ϵ)\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime}) tests and finds the defective set with maximum error with parameters ϵ\epsilon and δ\delta, i.e.

CrltOPT(Cycle,r,p,δ,ϵ)IndepOPT(n/l,p,ϵ)\textsc{CrltOPT}(Cycle,r,p,\delta,\epsilon)\leq\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime})
Proof.

We use the same algorithm and the same subgraphs as in Theorem 3.2. Recall that the probability of one subgraph not being connected is rl1r^{l-1} and the average error is ϵn/2\epsilon n/2. The number of mispredicted nodes in each subgraph is in [0,l][0,l] and they are independent of each other, meaning the connectivity of a group does not change the connectivity of another group, so by Hoeffding’s bound, with probability at least 1exp[Θ(ϵ2nlog(1/r)log(11ϵ/2))]>1δ/21-\exp[-\Theta(\frac{\epsilon^{2}n\log(1/r)}{\log(\frac{1}{1-\epsilon/2})})]>1-\delta/2, the error is at most ϵn\epsilon n. Also with probability more than 1ϵ>1δ/21-\epsilon^{\prime}>1-\delta/2 all candidates are predicted correctly, so with probability more than 1δ1-\delta the classic group tests detect all the defective nodes with no error, and assuming this, the error on subgraphs is less than ϵn\epsilon n and we’re done. ∎

The same reasoning works for subgrids of a grid and leads to the same upper bound with the parameters δ>2eΘ(ϵ2nlog(1/r)log(11ϵ/2))\delta>2e^{-\Theta(\frac{\epsilon^{2}n\log(1/r)}{\log(\frac{1}{1-\epsilon/2})})}, and ϵ<δ/2\epsilon^{\prime}<\delta/2. But for trees, we can’t use Hoeffding’s bound because the connectivity of a subgraph can affect the others, for instance the absence of an edge might make several groups disconnected, hence the groups are not independent anymore. This dependency violates the conditions we need for applying Hoeffding’s inequality. In order to fix the issue, we use the node exposure martingales process with a proper graph function definition to prove the desired concentration.

Before providing the new theorem for trees, we need the following lemma to use the concentration lemma for node exposure martingale defined in Section 3.1.

Lemma 5.5.

Let g1,g2,,gn/lg_{1},g_{2},\ldots,g_{\lceil n/l\rceil} be the subgraphs formed in Lemma 3.5. Let f(H)f(H) be the number of connected subgraphs gig_{i} in graph HH. Then there is an order of node exposure such that at each step, the value of ff does not change by more than one.

Proof.

Let g1,g2,,gn/lg_{1},g_{2},\ldots,g_{\lceil n/l\rceil} be the subgraphs formed in Lemma 3.5 in this order. Consider the following order of node exposure: we first expose all the nodes in the last subgraph, gn/lg_{{\lceil n/l\rceil}}, in some order. Then expose the nodes in the subgraph before that, g(n/l1)g_{(\lceil n/l\rceil-1)}, and so on until the nodes in the last subgraph, g1g_{1}, are exposed.

Consider a node vv in subgraph gig_{i} when it is exposed. Note that no subgraph gj,j>ig_{j},j>i can become connected after exposing vv, as by construction of subgraphs, all the nodes of connecting closure of gjg_{j} lies in gk,k>jg_{k},k>j. Also, no subgraph gj,j<ig_{j},j<i can become connected, because neither of gjg_{j}’s nodes are exposed yet. So f(H)f(H) can potentially only change by one, and for the last node of gig_{i} to make gig_{i} connected, and the lemma is proved. ∎

The above lemma allows us to use Azuma’s inequality for groups made for trees as described in the following theorem.

Theorem 5.6.

Consider a tree with nn nodes and let l=max{log[1/(1ϵ/2)]2log1/r,1}l=\max\{\frac{\log[1/(1-\epsilon/2)]}{2\log 1/r},1\}, δ>2eΘ(ϵ2nlog(1/r)log(11ϵ/2))\delta>2e^{-\Theta(\frac{\epsilon^{2}n\log(1/r)}{\log(\frac{1}{1-\epsilon/2})})}, and ϵ<δ/2\epsilon^{\prime}<\delta/2. Then there is an algorithm that uses IndepOPT(n/l,p,ϵ)\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime}) tests and finds the defective set with maximum error with parameters ϵ\epsilon and δ\delta, i.e.

CrltOPT(Tree,r,p,δ,ϵ)IndepOPT(n/l,p,ϵ)\textsc{CrltOPT}(Tree,r,p,\delta,\epsilon)\leq\textsc{IndepOPT}(\lceil n/l\rceil,p,\epsilon^{\prime})
Proof.

We use the algorithm and the same subgraphs introduced in Theorem 3.2. Recall from the proof of Theorem 3.6  that the probability of one group not being connected is r2lr^{2l} and the average error is ϵn/2\epsilon n/2. Now we can’t use the Hoeffding’s bound to show concentration around the average. Instead, we use a node exposure martingale to prove the concentration. Note that Theorem 2.5 for node exposure martingale only works when the graph function is node Lipschitz, meaning when H1H_{1} and H2H_{2} only differ in one node, |f(H1)f(H2)|1|f(H_{1})-f(H_{2})|\leq 1. But if we can find an order of nodes v1,,vnv_{1},\ldots,v_{n} exposed such the graph HiH_{i} on first ii nodes satisfies |f(Hi+1)f(Hi)|1|f(H_{i+1})-f(H_{i})|\leq 1, then we can still use Azuma’s inequality. Let f(H)f(H) be the number of connected subgraphs gig_{i}in HH. Then by Lemma 5.5 there is an order such that |f(Hi+1)f(Hi)|1|f(H_{i+1})-f({H_{i}})|\leq 1, and we can use the concentration theorem (Theorem 2.5) for the number of connected groups in the random graph GrG_{r}. We now have the same error concentration as in Theorem 5.4 for error (equivalent to the Hoeffding’s bound), hence we can repeat the argument to complete the proof. ∎

6 Conclusion

In this paper, we consider group testing strategies for identifying defective items when the defects of different nodes are correlated. The correlation is modeled through an underlying graph in which the degree of correlation between the defects in the items depends on the distance between the corresponding nodes in the graph. We relate the problem of design of testing strategies in presence of such correlation to that when the defects are independent. We subsequently obtain testing strategies in terms of those already known for independent defects for a large class of underlying graphs, namely trees, cycles, grids, and stochastic block models. This provides an upper bound for the number of tests needed to ensure the desired error bounds. We also obtain fundamental limits ie lower bounds on the minimum number of tests required to ensure the same error bounds using bounds already known when the defects are independent. The bounds are obtained through a novel combination of edge exposure Martingale theory and graph partition techniques.

We now describe some directions for future research stemming from the above results. There is a gap between the upper and lower bounds, which may be because there is scope of improvement in the lower bound, possibly utilizing the specific structures of the underlying correlation graphs. Another important area for future work is the development of testing strategies for general graphs. Towards that end one may envision the partition of the graph into structures such as trees, cycles, grids, stochastic block models etc for which we have identified in this paper testing strategies with guarantees on error rates and the number of tests required. Furthermore, designing group testing strategies when the defects dynamically evolve over time over graphs in question remains open. This problem has started receiving attention [SNFD22, AU21].

References

  • [ACO21] Surin Ahn, Wei-Ning Chen, and Ayfer Ozgur. Adaptive group testing on networks with community structure. arXiv preprint arXiv:2101.02405, 2021.
  • [AJS19] Matthew Aldridge, Oliver Johnson, and Jonathan Scarlett. Group testing: an information theory perspective. arXiv preprint arXiv:1902.06002, 2019.
  • [Ald20] Matthew Aldridge. Conservative two-stage group testing. arXiv preprint arXiv:2005.06617, 2020.
  • [AS16] Noga Alon and Joel H Spencer. The probabilistic method. John Wiley & Sons, 2016.
  • [AU21] Batuhan Arasli and Sennur Ulukus. Group testing with a graph infection spread model. arXiv preprint arXiv:2101.05792, 2021.
  • [Ava08] Jean-Christophe Aval. Multivariate fuss–catalan numbers. Discrete Mathematics, 308(20):4660–4669, 2008.
  • [BJ21] Paolo Bertolotti and Alo Jadbabaie. Network group testing. 2021.
  • [BMR21] Vincent Brault, Bastien Mallein, and Jean-François Rupprecht. Group testing as a strategy for covid-19 epidemiological monitoring and community surveillance. PLoS computational biology, 17(3):e1008726, 2021.
  • [BMTW84] Toby Berger, Nader Mehravari, Don Towsley, and Jack Wolf. Random multiple-access communication and group testing. IEEE Transactions on Communications, 32(7):769–779, 1984.
  • [CGM21] Mahdi Cheraghchi, Ryan Gabrys, and Olgica Milenkovic. Semiquantitative group testing in at most two rounds. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 1973–1978. IEEE, 2021.
  • [CHKV11] Mahdi Cheraghchi, Ali Hormati, Amin Karbasi, and Martin Vetterli. Group testing with probabilistic tests: Theory, design and application. IEEE Transactions on Information Theory, 57(10):7057–7067, 2011.
  • [Chl01] Bogdan S Chlebus. Randomized communication in radio networks. COMBINATORIAL OPTIMIZATION-DORDRECHT-, 9(1):401–456, 2001.
  • [CKMS12] Mahdi Cheraghchi, Amin Karbasi, Soheil Mohajer, and Venkatesh Saligrama. Graph-constrained group testing. IEEE Transactions on Information Theory, 58(1):248–262, 2012.
  • [COGHKL20] Amin Coja-Oghlan, Oliver Gebhard, Max Hahn-Klimroth, and Philipp Loick. Optimal group testing. In Conference on Learning Theory, pages 1374–1388. PMLR, 2020.
  • [CTB+20] Marco Cuturi, Olivier Teboul, Quentin Berthet, Arnaud Doucet, and Jean-Philippe Vert. Noisy adaptive group testing using bayesian sequential experimental design. arXiv preprint arXiv:2004.12508, 2020.
  • [DHH00] Dingzhu Du, Frank K Hwang, and Frank Hwang. Combinatorial group testing and its applications, volume 12. World Scientific, 2000.
  • [Dor43] Robert Dorfman. The detection of defective members of large populations. The Annals of Mathematical Statistics, 14(4):436–440, 1943.
  • [FKKM97] Martin Farach, Sampath Kannan, Emanuel Knill, and Shan Muthukrishnan. Group testing problems with sequences in experimental molecular biology. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 357–367. IEEE, 1997.
  • [GG20] Christian Gollier and Olivier Gossner. Group testing against covid-19. Technical report, EconPol Policy Brief, 2020.
  • [GH08] Michael T Goodrich and Daniel S Hirschberg. Improved adaptive group testing algorithms with applications to multiple access channels and dead sensor diagnosis. Journal of Combinatorial Optimization, 15(1):95–121, 2008.
  • [Goe97] Andreas Goerdt. The giant component threshold for random regular graphs with edge faults. In International Symposium on Mathematical Foundations of Computer Science, pages 279–288. Springer, 1997.
  • [KM95] Emanuel Knill and S Muthukrishnan. Group testing problems in experimental molecular biology. arXiv preprint math/9505211, 1995.
  • [LCH+14] Tongxin Li, Chun Lam Chan, Wenhao Huang, Tarik Kaced, and Sidharth Jaggi. Group testing with prior statistics. In 2014 IEEE International Symposium on Information Theory, pages 2346–2350. IEEE, 2014.
  • [MNB+21] Leon Mutesa, Pacifique Ndishimye, Yvan Butera, Jacob Souopgui, Annette Uwineza, Robert Rutayisire, Ella Larissa Ndoricimpaye, Emile Musoni, Nadine Rujeni, Thierry Nyatanyi, et al. A pooled testing strategy for identifying sars-cov-2 at low prevalence. Nature, 589(7841):276–280, 2021.
  • [NSFD21] Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Christina Fragouli, and Suhas N Diggavi. Group testing for community infections. IEEE BITS the Information Theory Magazine, 1(1):57–68, 2021.
  • [NSG+21] Pavlos Nikolopoulos, Sundara Rajan Srinivasavaradhan, Tao Guo, Christina Fragouli, and Suhas Diggavi. Group testing for connected communities. In International Conference on Artificial Intelligence and Statistics, pages 2341–2349. PMLR, 2021.
  • [SNFD22] Sundara Rajan Srinivasavaradhan, Pavlos Nikolopoulos, Christina Fragouli, and Suhas Diggavi. Dynamic group testing to control and monitor disease progression in a population. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 2255–2260. IEEE, 2022.
  • [SW18] Bruce Spang and Mary Wootters. Unconstraining graph-constrained group testing. arXiv preprint arXiv:1809.03589, 2018.
  • [VFH+21] Claudio M Verdun, Tim Fuchs, Pavol Harar, Dennis Elbrächter, David S Fischer, Julius Berner, Philipp Grohs, Fabian J Theis, and Felix Krahmer. Group testing for sars-cov-2 allows for up to 10-fold efficiency increase across realistic scenarios and testing strategies. Frontiers in Public Health, page 1205, 2021.