1
Model Counting meets Estimation
Abstract.
Constraint satisfaction problems (CSP’s) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments for data streams.
Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and computation. We design a recipe for translation of algorithms developed for estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed to distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields a state-of-the art algorithm for multidimensional range efficient estimation with a simpler analysis.
1. Introduction
Constraint Satisfaction Problems (CSP’s) and the data stream model are two core themes in computer science with a diverse set of applications, ranging from probabilistic reasoning, networks, databases, verification, and the like. Model counting and computation of zeroth frequency moment () are fundamental problems for CSP’s and the data stream model respectively. This paper is motivated by our observation that despite the usage of similar algorithmic techniques for the two problems, the developments in the two communities have, surprisingly, evolved separately, and rarely has a paper from one community been cited by the other.
Given a set of constraints over a set of variables in a finite domain , the problem of model counting is to estimate the number of solutions of . We are often interested when is restricted to a special class of representations such as Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF). A data stream over a domain is represented by wherein each item . The zeroth frequency moment, denoted as , of is the number of distinct elements appearing in , i.e., (traditionally, s are singletons; we will also be interested in the case when s are sets). The fundamental nature of model counting and computation over data streams has led to intense interest from theoreticians and practitioners alike in the respective communities for the past few decades.
The starting point of this work is the confluence of two viewpoints. The first viewpoint contends that some of the algorithms for model counting can conceptually be thought of as operating on the stream of the solutions of the constraints. The second viewpoint contends that a stream can be viewed as a DNF formula, and the problem of estimation is similar to model counting. These viewpoints make it natural to believe that algorithms developed in the streaming setting can be directly applied to model counting, and vice versa. We explore this connection and indeed, design new algorithms for model counting inspired by algorithms for estimating in data streams. By exploring this connection further, we design new algorithms to estimate for streaming sets that are succinctly represented by constraints. To put our contributions in context, we briefly survey the historical development of algorithmic frameworks in both model counting and estimation and point out the similarities.
Model Counting
The complexity-theoretic study of model counting was initiated by Valiant who showed that this problem, in general, is #P-complete (Valiant, 1979). This motivated researchers to investigate approximate model counting and in particular achieving -approximation schemes. The complexity of approximate model counting depends on its representation. When the model is represented as a CNF formula , designing an efficient -approximation is NP-hard (Stockmeyer, 1983). In contrast, when it is represented as a DNF formula, model counting admits FPRAS (fully polynomial-time approximation scheme) (Karp and Luby, 1983; Karp et al., 1989). We will use #CNF to refer to the case when is a CNF formula while #DNF to refer to the case when is a DNF formula.
For #CNF, Stockmeyer (Stockmeyer, 1983) provided a hashing-based randomized procedure that can compute (-approximation within time polynomial in , given access to an NP oracle. Building on Stockmeyer’s approach and motivated by the unprecedented breakthroughs in the design of SAT solvers, researchers have proposed a series of algorithmic improvements that have allowed the hashing-based techniques for approximate model counting to scale to formulas involving hundreds of thousands of variables (Gomes et al., 2007b; Chakraborty et al., 2013b; Ermon et al., 2013; Ivrii et al., 2015; Chistikov et al., 2015; Chakraborty et al., 2016; Achlioptas and Theodoropoulos, 2017; Soos and Meel, 2019; Soos et al., 2020). The practical implementations substitute NP oracle with SAT solvers. In the context of model counting, we are primarily interested in time complexity and therefore, the number of NP queries is of key importance. The emphasis on the number of NP calls also stems from practice as the practical implementation of model counting algorithms have shown to spend over 99% of their time in the underlying SAT calls (Soos and Meel, 2019).
Karp and Luby (Karp and Luby, 1983) proposed the first FPRAS scheme for #DNF, which was subsequently improved in the follow-up works (Karp et al., 1989; Dagum et al., 2000). Chakraborty, Meel, and Vardi (Chakraborty et al., 2016) demonstrated that the hashing-based framework can be extended to #DNF, hereby providing a unified framework for both #CNF and #DNF. Meel, Shrotri, and Vardi (Meel et al., 2017, 2018, 2019) subsequently improved the complexity of the hashing-based approach for #DNF and observed that hashing-based techniques achieve better scalability than that of Monte Carlo techniques.
Zeroth Frequency Moment Estimation
Estimating -approximation of the frequency moments () is a central problem in the data streaming model (Alon et al., 1999). In particular, considerable work has been done in designing algorithms for estimating the frequency moment (), the number of distinct elements in the stream. While designing streaming algorithms, the primary resource concerns are two-fold: space complexity and processing time per element. For an algorithm to be considered efficient, these should be where is the size of the universe 111We ignore factor in this discussion.
The first algorithm for computing with a constant factor approximation was proposed by Flajolet and Martin, who assumed the existence of hash functions with ideal properties resulting in an algorithm with undesirable space complexity (Flajolet and Martin, 1985). In their seminal work, Alon, Matias, and Szegedy designed an space algorithm for with a constant approximation ratio that employs 2-universal hash functions (Alon et al., 1999). Subsequent investigations into hashing-based schemes by Gibbons and Tirthapura (Gibbons and Tirthapura, 2001) and Bar-Yossef, Kumar, and Sivakumar (Bar-Yossef et al., 2002) provided -approximation algorithms with space and time complexity . Subsequently, Bar-Yossef et al. proposed three algorithms with improved space and time complexity (Bar-Yossef et al., [n. d.]). While the three algorithms employ hash functions, they differ conceptually in the usage of relevant random variables for the estimation of . This line of work resulted in the development of an algorithm with optimal space complexity and update time (Kane et al., 2010).
The above-mentioned works are in the setting where each data item is an element of the universe. Subsequently, there has been a series of results of estimating in rich scenarios with particular focus to handle the cases such as a list or a multidimensional range (Bar-Yossef et al., 2002; Pavan and Tirthapura, 2007; Tirthapura and Woodruff, 2012; Sun and Poon, 2009).
The Road to a Unifying Framework
As mentioned above, the algorithmic developments for model counting and estimation have largely relied on the usage of hashing-based techniques and yet these developments have, surprisingly, been separate, and rarely has a work from one community been cited by the other. In this context, we wonder whether it is possible to bridge this gap and if such an exercise would contribute to new algorithms for model counting as well as for estimation? The main conceptual contribution of this work is an affirmative answer to the above question. First, we point out that the two well-known algorithms; Stockmeyer’s #CNF algorithm (Stockmeyer, 1983) that is further refined by Chakraborty et. al. (Chakraborty et al., 2016) and Gibbons and Tirthapura’s estimation algorithm (Gibbons and Tirthapura, 2001), are essentially the same.
The core idea of the hashing-based technique of Stockmeyer’s and Chakraborty et al’s scheme is to use pairwise independent hash functions to partition the solution space (satisfying assignments of a CNF formula) into roughly equal and small cells, wherein a cell is small if the number of solutions is less than a pre-computed threshold, denoted by . Then a good estimate for the number of solutions is the number of solutions in an arbitrary cell number of cells. To partition the solution space, pairwise independent hash functions are used. To determine the appropriate number of cells, the solution space is iteratively partitioned as follows. At the iteration, a hash function with range is considered resulting in cells for each . An NP oracle can be employed to check whether a particular cell (for example ) is small by enumerating solutions one by one until we have either obtained +1 number of solutions or we have exhaustively enumerated all the solutions. If the the cell is small, then the algorithm outputs as an estimate where is the number of solutions in the cell . If the cell is not small, then the algorithm moves on to the next iteration where a hash function with range is considered.
We now describe Gibbons and Tirthapura’s algorithm for estimation which we call the algorithm. We will assume the universe . The algorithm maintains a bucket of size and starts by picking a hash function . It iterates over sampling levels. At level , when a data item comes, if starts with , then is added to the bucket. If the bucket overflows, then the sampling level is increased to and all elements in the bucket other than the ones with are deleted. At the end of the stream, the value is output as the estimate where is the number of elements in the bucket and is the sampling level.
These two algorithms are conceptually the same. In the algorithm, at the sampling level , it looks at only the first bits of the hashed value; this is equivalent to considering a hash function with range . Thus the bucket is nothing but all the elements in the stream that belong to the cell . The final estimate is the number of elements in the bucket times the number of cells, identical to Chakraborty et. al’s algorithm. In both algorithms, to obtain an approximation, the value is chosen as and the median of independent estimations is output.
Our Contributions
Motivated by the conceptual identity between the two algorithms, we further explore the connections between algorithms for model counting and estimation.
-
(1)
We formalize a recipe to transform streaming algorithms for estimation to those for model counting. Such a transformation yields new -approximate algorithms for model counting, which are different from currently known algorithms. Recent studies in the fields of automated reasoning have highlighted the need for diverse approaches (Xu et al., 2008), and similar studies in the context of #DNF provided strong evidence to the power of diversity of approaches (Meel et al., 2018). In this context, these newly obtained algorithms open up several new interesting directions of research ranging from the development of MaxSAT solvers with native XOR support to open problems in designing FPRAS schemes.
-
(2)
Given the central importance of #DNF (and its weighted variant) due to a recent surge of interest in scalable techniques for provenance in probabilistic databases (Senellart, 2018, 2019), a natural question is whether one can design efficient techniques in the distributed setting. In this work, we initiate the study of distributed #DNF. We then show that the transformation recipe from estimation to model counting allows us to view the problem of the design of distributed #DNF algorithms through the lens of distributed functional monitoring that is well studied in the data streaming literature.
-
(3)
Building upon the connection between model counting and estimation, we design new algorithms to estimate over structured set streams where each element of the stream is a (succinct representation of a) subset of the universe. Thus, the stream is where each and the goal is to estimate the of the stream, i.e. size of . In this scenario, a traditional streaming algorithm that processes each element of the set incurs high per-item processing time-complexity and is inefficient. Thus the goal is to design algorithms whose per-item time (time to process each ) is poly-logarithmic in the size of the universe. Structured set streams that are considered in the literature include 1-dimensional and multidimensional ranges (Pavan and Tirthapura, 2007; Tirthapura and Woodruff, 2012). Several interesting problems such as max-dominance norm (Cormode and Muthukrishnan, 2003), counting triangles in graphs (Bar-Yossef et al., 2002), and distinct summation problem (Considine et al., 2004) can be reduced to computing over such ranges.
We observe that several structured sets can be represented as small DNF formulae and thus counting over these structured set data streams can be viewed as a special case of #DNF. Using the hashing-based techniques for #DNF, we obtain a general recipe for a rich class of structured sets that include multidimensional ranges, multidimensional arithmetic progressions, and affine spaces. Prior work on single and multidimensional ranges 222Please refer to Remark 1 in Section 5 for a discussion on the earlier work on multidimensional ranges (Tirthapura and Woodruff, 2012). had to rely on involved analysis for each of the specific instances, while our work provides a general recipe for both analysis and implementation.
Organization
We present notations and preliminaries in Section 2. We then present the transformation of estimation to model counting in Section 3. We then focus on distributed #DNF in Section 4. We then present the transformation of model counting algorithms to structured set streaming algorithms in Section 5. We conclude in Section 6 with a discussion of future research directions.
We would like to emphasize that the primary objective of this work is to provide a unifying framework for estimation and model counting. Therefore, when designing new algorithms based on the transformation recipes, we intentionally focus on conceptually cleaner algorithms and leave potential improvements in time and space complexity for future work.
2. Notation
We will use assume the universe . We write to denote the probability of outcome when sampling from a probability space . For brevity, we omit when it is clear from the context.
Estimation
A data stream over domain can be represented as wherein each item . Let . of the stream is . We are often interested in a a probably approximately correct scheme that returns an -estimate , i.e.,
Model Counting
Let be a set of Boolean variables. For a Boolean formula , let denote the set of variables appearing in . Throughout the paper, unless otherwise stated, we will assume that the relationship holds. We denote the set of all satisfying assignments of by .
The propositional model counting problem is to compute for a given formula . A probably approximately correct (or PAC) counter is a probabilistic algorithm that takes as inputs a formula , a tolerance , and a confidence , and returns a -estimate , i.e.,
PAC guarantees are also sometimes referred to as -guarantees. We use #CNF (resp. #DNF) to refer to the model counting problem when is represented as CNF (resp. DNF).
k-wise Independent hash functions
Let and be a family of hash functions mapping to . We use to denote the probability space obtained by choosing a function uniformly at random from .
Definition 1.
A family of hash functions is wise independent if , ,
(1) |
We will use to refer to a wise independent family of hash functions mapping to .
Explicit families
In this work, one hash family of particular interest is , which is known to be 2-wise independent (Carter and Wegman, 1977). The family is defined as follows: is the family of functions of the form with and where is a uniformly randomly chosen Toeplitz matrix of size while is uniformly randomly matrix of size . Another related hash family of interest is wherein is again of the form where and are uniformly randomly chosen matrices of sizes and respectively. Both and are 2-wise independent but it is worth noticing that can be represented with -bits while requires bits of representation.
For every , the prefix-slice of , denoted , is a map from to , where is the first bits of . Observe that when , , where denotes the submatrix formed by the first rows of and is the first entries of the vector .
3. From Streaming to Counting
As a first step, we present a unified view of the three hashing-based algorithms proposed in Bar-Yossef et al (Bar-Yossef et al., [n. d.]). The first algorithm is the algorithm discussed above with the observation that instead of keeping the elements in the bucket, it suffices to keep their hashed values. Since in the context of model counting, our primary concern is with time complexity, we will focus on Gibbons and Tirthapura’s algorithm in (Gibbons and Tirthapura, 2001) rather than Bar-Yossef et al.’s modification. The second algorithm, which we call , is based on the idea that if we hash all the items of the stream, then -th minimum of the hash valued can be used compute a good estimate of . The third algorithm, which we call , chooses a set of functions, , such that each is picked randomly from an -independent hash family. For each hash function , we say that is not lonely if there exists such that . One can then estimate of by estimating the number of hash functions that are not lonely.
Algorithm 1, called , presents the overarching architecture of the three proposed algorithms. Each of these algorithms first picks an appropriate set of hash functions and initializes the sketch . The architecture of is fairly simple: it chooses a collection of hash functions using , calls the subroutine for every incoming element of the stream, and invokes at the end of the stream to return the approximation.
As shown in Algorithm 2, the hash functions depend on the strategy being implemented. The subroutine returns a collection of independent hash functions from the family . We use to denote the collection of hash functions returned, this collection viewed as either 1-dimensional array or as a 2-dimensional array. When is 1-dimensional array, to denote the th hash function of the collection and when is a 2-dimensional array is the th hash functions.
Sketch Properties
For each of the three algorithms, their corresponding sketches can be viewed as arrays of size of . The parameter is set to .
- :
-
The element is a tuple where is a list of size at most , where . We use to denote and to denote .
- :
-
The element holds the lexicographically distinct many smallest elements of .
- :
-
The element holds a tuple of size . The ’th entry of this tuple is the largest number of trailing zeros in any element of .
For a new item , the update of , as shown in Algorithm 3 is as follows:
- Bucketing:
-
For a new item , if , then we add it to if is not already present in . If the size of is greater than (which is set to be ), then we increment the as in line 8.
- Minimum:
-
For a new item , if is smaller than the , then we replace with .
- Estimation:
-
For a new item , compute , i.e, the number of trailing zeros in , and replace with if is larger than .
Finally, for each of the algorithms, we estimate based on the sketch as described in the subroutine presented as Algorithm 4. It is crucial to note that the estimation of is performed solely using the sketch for the Bucketing and Minimum algorithms. The Estimation algorithm requires an additional parameter that depends on a loose estimate of ; we defer details to Section 3.4.
3.1. A Recipe For Transformation
Observe that for each of the algorithms, the final computation of estimation depends on the sketch . Therefore, as long as for two streams and , if their corresponding sketches, say and respectively, are equivalent, the three schemes presented above would return the same estimates. The recipe for a transformation of streaming algorithms to model counting algorithms is based on the following insight:
-
(1)
Capture the relationship between the sketch , set of hash functions , and set at the end of stream. Recall that is the set of all distinct elements of the stream .
-
(2)
The formula is viewed as symbolic representation of the unique set represented by the stream such that .
-
(3)
Given a formula and set of hash functions , design an algorithm to construct sketch such that holds. And now, we can estimate from .
In the rest of this section, we will apply the above recipe to the three types of estimation algorithms, and derive corresponding model counting algorithms. In particular, we show how applying the above recipe to the algorithm leads us to reproduce the state of the art hashing-based model counting algorithm, , proposed by Chakraborty et al (Chakraborty et al., 2016). Applying the above recipe to and allows us to obtain fundamentally different schemes. In particular, we observe while model counting algorithms based on and provide FPRAS’s when is DNF, such is not the case for the algorithm derived based on .
3.2. Bucketing-based Algorithm
The algorithm chooses a set of pairwise independent hash functions and maintains a sketch that we will describe. Here we use as our choice of pairwise independent hash functions. The sketch is an array where, each of the form . We say that the relation holds if
-
(1)
-
(2)
The following lemma due to Bar-Yossef et al. (Bar-Yossef et al., [n. d.]) and Gibbons and Tirthapura (Gibbons and Tirthapura, 2001) captures the relationship among the sketch , the relation and the number of distinct elements of a multiset.
Lemma 0.
(Bar-Yossef et al., [n. d.]; Gibbons and Tirthapura, 2001) Let be a multiset and where and each s are independently drawn and and let be such that the holds. Let . Then
To design an algorithm for model counting, based on the bucketing strategy, we turn to the subroutine introduced by Chakraborty, Meel, and Vardi: , whose properties are formalized as follows:
Proposition 0.
Equipped with Proposition 2, we now turn to designing an algorithm for model counting based on the Bucketing strategy. The algorithm follows in similar fashion to its streaming counterpart where is iteratively incremented until the number of solutions of the formula ( is less than . Interestingly, an approximate model counting algorithm, called , based on bucketing strategy was discovered independently by Chakraborty et. al. (Chakraborty et al., 2013b) in 2013. We reproduce an adaptation in Algorithm 5 to showcase how can be viewed as transformation of the algorithm. In the spirit of , seeks to construct a sketch of size . To this end, for every iteration of the loop, we continue to increment the value of the loop until the conditions specified by the relation are met. For every iteration , the estimate of the model count is . Finally, the estimate of the model count is simply the median of the estimation of all the iterations. Since in the context of model counting, we are concerned with time complexity, wherein both and lead to same time complexity. Furthermore, Chakraborty et al. (Chakraborty et al., 2013a) observed no difference in empirical runtime behavior due to and .
The following theorem establishes the correctness of , and the proof follows from Lemma 1 and Proposition 2.
Theorem 3.
Given a formula , , and , returns such that . If is a CNF formula, then this algorithm makes calls to NP oracle. If is a DNF formula then is FPRAS. In particular for a DNF formula with terms, takes time.
Further Optimizations
We now discuss how the setting of model counting allows for further optimizations. Observe that for all , . Note that we are interested in finding the value of such that and , therefore, we can perform a binary search for instead of a linear search performed in the loop 8– 10. Indeed, this observation was at the core of Chakraborty et al’s followup work (Chakraborty et al., 2016), which proposed the ApproxMC2, thereby reducing the number of calls to NP oracle from to . Furthermore, the reduction in NP oracle calls led to significant runtime improvement in practice. It is worth commenting that the usage of as FPRAS for DNF is shown to achieve runtime efficiency over the alternatives based on Monte Carlo methods (Meel et al., 2017, 2018, 2019).
3.3. Minimum-based Algorithm
For a given multiset ( eg: a data stream or solutions to a model), we now specify the property . The sketch is an array of sets indexed by members of that holds lexicographically minimum elements of where is . is the property that specifies this relationship. More formally, the relationship holds, if the following conditions are met.
-
(1)
-
(2)
The following lemma due to Bar-Yossef et al. (Bar-Yossef et al., [n. d.]) establishes the relationship between the property and the number of distinct elements of a multiset. Let denote the largest element of the set .
Lemma 0.
(Bar-Yossef et al., [n. d.]) Let be a multiset and where and each s are independently drawn and and let be such that the holds. Let . Then
Therefore, we can transform the algorithm for estimation to that of model counting given access to a subroutine that can compute such that holds true. The following proposition establishes the existence and complexity of such a subroutine, called :
Proposition 0.
There is an algorithm that, given over variables, , and as input, returns a set, so that if , then , otherwise is the lexicographically minimum elements of . Moreover, if is a CNF formula, then makes calls to an NP oracle, and if is a DNF formula with terms, then takes time.
Equipped with Proposition 5, we are now ready to present the algorithm, called , for model counting. Since the complexity of is PTIME when is in DNF, we have as a FPRAS for DNF formulas.
Theorem 6.
Given , ,, returns such that
If is a CNF formula, then is a polynomial-time algorithm that makes calls to NP oracle. If is a DNF formula, then is an FPRAS.
Implementing the Min-based Algorithm
We now give a proof of Proposition 5 by giving an implementation of subroutine.
Proof.
We first present the algorithm when the formula is a DNF formula. Adapting the algorithm for the case of CNF can be done by suing similar ideas.
Let be a DNF formula over variables where is a term. Let be a linear hash function in defined by a binary matrix . Let be the set of hashed values of the satisfying assignments for : . Let be the first elements of in the lexicographic order. Our goal is to compute .
We will give an algorithm with running time to compute when the formula is just a term . Using this algorithm we can compute for a formula with terms by iteratively merging for each term. The time complexity increases by a factor of , resulting in an time algorithm.
Let be a term with width (number of literals) and . By fixing the variables in we get a vector and an matrix so that . Both and can be computed from and in linear time. Let be the transformation .
We will compute ( lexicographically minimum elements in ) iteratively as follows: assuming we have computed minimum of , we will compute minimum using a prefix-searching strategy. We will use a subroutine to solve the following basic prefix-searching primitive: Given any bit string , is there an so that is a prefix for some string in ? This task can be performed using Gaussian elimination over an binary matrix and can be implemented in time .
Let be the minimum in . Let be the rightmost 0 of . Then using the above mentioned procedure we can find the lexicographically smallest string in the range of that extends if it exists. If no such string exists in , find the index of the next 0 in and repeat the procedure. In this manner the minimum can be computed using calls to the prefix-searching primitive resulting in an time algorithm. Invoking the above procedure times results in an algorithm to compute in time.
If is a CNF formula, we can employ the same prefix searching strategy. Consider the following NP oracle: . With calls to , we can compute string in that is lexicographically greater than . So with , calls to , we can compute .
∎
Further Optimizations
As mentioned in Section 1, the problem of model counting has witnessed a significant interest from practitioners owing to its practical usages and the recent developments have been fueled by the breakthrough progress in the SAT solving wherein calls to NP oracles are replaced by invocations of SAT solver in practice. Motivated by the progress in SAT solving, there has been significant interest in design of efficient algorithmic frameworks for related problems such as MaxSAT and its variants. The state of the art MaxSAT based on sophisticated strategies such as implicit hitting sets and are shown to significant outperform algorithms based on merely invoking a SAT solver iteratively. Of particular interest to us is the recent progress in the design of MaxSAT solvers to handle lexicographic objective functions. In this context, it is worth remarking that we expect practical implementation of would invoke a MaxSAT solver times.
3.4. Estimation-based Algorithm
We now adapt the algorithm to model counting. For a given stream and chosen hash functions , the sketch corresponding to the estimation-based algorithm satisfies the following relation :
(2) |
where the procedure is the length of the longest all-zero suffix of . Bar-Yossef et al. (Bar-Yossef et al., [n. d.]) show the following relationship between the property and .
Lemma 0.
(Bar-Yossef et al., [n. d.]) Let be a multiset. For and , suppose is drawn independently from where , , and . Let denote the collection of these hash functions. Suppose satisfies . For any integer , define:
Then, if :
Following the recipe outlined above, we can transform a streaming algorithm to a model counting algorithm by designing a subroutine that can compute the sketch for the set of all solutions described by and a subroutine to find . The following proposition achieves the first objective for CNF formulas using a small number of calls to an NP oracle:
Proposition 0.
There is an algorithm that given over variables and hash function , returns such that
-
(1)
and has least significant bits equal to zero.
-
(2)
has least significant bits equal to zero.
If is a CNF formula, then makes calls to an NP oracle.
Proof.
Consider an NP oracle . Note that can be implemented as a degree- polynomial , so that can be evaluated in polynomial time. A binary search, requiring calls to , suffices to find the largest value of for which belongs to . ∎
We note that unlike Propositions 2 and 5, we do not know whether can be implemented efficiently when is a DNF formula. For a degree- polynomial , we can efficiently test whether has a root by computing , but it is not clear how to simultaneously constrain some variables according to a DNF term.
Equipped with Proposition 8, we obtain that takes in a formula and a suitable value of and returns . The key idea of is to repeatedly invoke for each of the chosen hash functions and compute the estimate based on the sketch and the value of . The following lemma summarizes the time complexity and guarantees of for CNF formulas.
Theorem 9.
Given a CNF formula , parameters and , and such that , the algorithm returns satisfying
makes calls to an NP oracle.
In order to obtain , we run in parallel another counting algorithm based on the simple -estimation algorithm (Flajolet and Martin, 1985; Alon et al., 1999) which we call . Given a stream , the algorithm chooses a random pairwise-independent hash function , computes the largest so that for some , the least significant bits of are zero, and outputs . Alon, Matias and Szegedy (Alon et al., 1999) showed that is a 5-factor approximation of with probability . Using our recipe, we can convert into an algorithm that approximates the number of solutions to a CNF formula within a factor of 5 with probability 3/5. It is easy to check that using the same idea as in Proposition 8, this algorithm requires calls to an NP oracle.
3.5. The Opportunities Ahead
As noted in Section 3.2, the algorithms based on Bucketing was already known and have witnessed a detailed technical development from both applied and algorithmic perspectives. The model counting algorithms based on Minimum and Estimation are new. We discuss some potential implications of these new algorithms to SAT solvers and other aspects.
MaxSAT solvers with native support for XOR constraints
When the input formula is represented as CNF, then , the model counting algorithm based on Bucketing strategy, invokes NP oracle over CNF-XOR formulas, i.e., formulas expressed as conjunction of CNF and XOR constraints. The significant improvement in runtime performance of owes to the design of SAT solvers with native support for CNF-XOR formulas (Soos et al., 2009; Soos and Meel, 2019; Soos et al., 2020). Such solvers have now found applications in other domains such as cryptanalysis. It is perhaps worth emphasizing that the proposal of was crucial to renewed interest in the design of SAT solvers with native support for CNF-XOR formulas. As observed in Section 3.3, the algorithm based on Minimum strategy would ideally invoke a MaxSAT solver that can handle XOR constraints natively. We believe that Minimum-based algorithm will ignite interest in the design of MaxSAT solver with native support for XOR constraints.
FPRAS for DNF based on Estimation
In Section 3.4, we were unable to show that the model counting algorithm obtained based on Estimation is FPRAS when is represented as DNF. The algorithms based on Estimation have been shown to achieve optimal space efficiency in the context of estimation. In this context, an open problem is to investigate whether Estimation-based strategy lends itself to FPRAS for DNF counting.
Empirical Study of FPRAS for DNF Based on Minimum
Meel et al. (Meel et al., 2018, 2019) observed that FPRAS for DNF based on Bucketing has superior performance, in terms of the number of instances solved, to that of FPRAS schemes based on Monte Carlo framework. In this context, a natural direction of future work would be to conduct an empirical study to understand behavior of FPRAS scheme based on Minimum strategy.
4. Distributed DNF Counting
Consider the problem of distributed DNF counting. In this setting, there are sites that can each communicate with a central coordinator. The input DNF formula is partitioned into DNF subformulas , where each is a subset of the terms of the original , with the ’th site receiving only . The goal is for the coordinator to obtain an -approximation of the number of solutions to , while minimizing the total number of bits communicated between the sites and the coordinator. Distributed algorithms for sampling and counting solutions to CSP’s have been studied recently in other models of distributed computation (Feng et al., 2018b, a; Feng and Yin, 2018; Fischer and Ghaffari, 2018). From a practical perspective, given the centrality of #DNF in the context of probabilistic databases (Senellart, 2018; Ré and Suciu, 2008), a distributed DNF counting would entail applications in distributed probabilistic databases.
From our perspective, distributed DNF counting falls within the distributed functional monitoring framework formalized by Cormode et al. (Cormode et al., 2011). Here, the input is a stream which is partitioned arbitrarily into sub-streams that arrive at each of sites. Each site can communicate with the central coordinator, and the goal is for the coordinator to compute a function of the joint stream while minimizing the total communication. This general framework has several direct applications and has been studied extensively (Babcock and Olston, 2003; Cormode et al., 2005; Manjhi et al., 2005; Keralapura et al., 2006; Arackaparambil et al., 2009; Sharfman et al., 2010; Keren et al., 2011; Cormode et al., 2012; Woodruff and Zhang, 2012; Yi and Zhang, 2013; Huang et al., 2012; Woodruff and Zhang, 2017). In distributed DNF counting, each sub-stream corresponds to the set of satisfying assignments to each subformula , while the function to be computed is .
The model counting algorithms discussed in Section 3 can be extended to the distributed setting. We briefly indicate the distributed implementations for each of the three algorithms. As above, we set the parameters to . Correctness follows from Bar-Yossef et al. (Bar-Yossef et al., [n. d.]) and the earlier discussion.
We consider an adaptation of that takes in over variables, a hash function , and a threshold as inputs, returns a set of solutions such that .
- Bucketing.:
-
Setting , the coordinator chooses from and from . It then sends them to the sites. Let be the smallest such that the size of the set is smaller than . The ’th site sends the coordinator the following tuples: for each and for each in . Note that each site only sends tuples for at most choices of , so that hashes these to distinct values with probability . It is easy to verify that the coordinator can then execute the rest of the algorithm . The communication cost is , and the time complexity for each site is polynomial in , , and .
- Minimum.:
-
The coordinator chooses hash functions from and sends it to the sites. Each site runs the algorithm for each hash function and sends the outputs to the coordinator. So, the coordinator receives sets , consisting of the lexicographically smallest hash values of the solutions to . The coordinator then extracts , the lexicographically smallest elements of and proceeds with the rest of algorithm . The communication cost is to account for the sites sending the outputs of their invocations. The time complexity for each site is polynomial in , , and .
- Estimation.:
-
For each , the coordinator chooses hash functions , drawn pairwise independently from (for ) and sends it to the sites. Each site runs the algorithm for each hash function and sends the output to the coordinator. Suppose the coordinator receives for each and . It computes . The rest of is then executed by the coordinator. The communication cost is . However, as earlier, we do not know a polynomial time algorithm to implement the algorithm for DNF terms.
Lower Bound
The communication cost for the Bucketing- and Estimation-based algorithms is nearly optimal in their dependence on and . Woodruff and Zhang (Woodruff and Zhang, 2012) showed that the randomized communication complexity of estimating up to a factor in the distributed functional monitoring setting is . We can reduce estimation problem to distributed DNF counting. Namely, if for the estimation problem, the ’th site receives items , then for the distributed DNF counting problem, is a DNF formula on variables whose solutions are exactly in their binary encoding. Thus, we immediately get an lower bound for the distributed DNF counting problem. Finding the optimal dependence on for remains an interesting open question333Note that if , then bits suffices, as the site can solve the problem on its own and send to the coordinator the binary encoding of a -approximation of ..
5. From Counting to Streaming: Structured Set Streaming
In this section we consider structured set streaming model where each item of the stream is a succinct representation of a set over the universe . Our goal is to design efficient algorithms (both in terms of memory and processing time per item) for computing - number of distinct elements in the union of all the sets in the stream. We call this problem computation over structured set streams.
DNF Sets
A particular representation we are interested in is where each set is presented as the set of satisfying assignments to a DNF formula. Let is a DNF formula over variables. Then the DNF Set corresponding to is the set of satisfying assignments of . The size of this representation is the number of terms in the formula .
A stream over DNF sets is a stream of DNF formulas . Given such a DNF stream, the goal is to estimate where the DNF set represented by . This quantity is same as the number of satisfying assignments of the formula . We show that the algorithms described in the previous section carry over to obtain estimation algorithms for this problem with space and per-item time where is the size of the formula.
Notice that this model generalizes the traditional streaming model where each item of the stream is an element as it can be represented as single term DNF formula whose only satisfying assignment is . This model also subsumes certain other models considered in the streaming literature that we discuss later.
Theorem 1.
There is a streaming algorithm to compute an approximation of over DNF sets. This algorithm takes space and processing time per item where is the size (number of terms) of the corresponding DNF formula.
Proof.
We show how to adapt Minimum-value based algorithm from Section 3.3 to this setting. The algorithm picks a hash function maintains the set consisting of lexicographically minimum elements of the set after processing items. When arrives, it computes the set consisting of the lexicographically minimum values of the set and subsequently updates by computing the lexicographically smallest elements from . By Proposition 5, computation of can be done in time where is the number of terms in . Updating can be done in time. Thus update time for the item is . For obtaining an approximations we set and repeat the procedure times and take the median value. Thus the update time for item is . For analyzing sapce, each hash function uses bits and to store minimums, we require space resulting in overall space usage of . The proof of correctness follows from Lemma 4. ∎
Instead of using Minimum-value based algorithm, we could adapt Bucketing-based algorithm to obtain an algorithm with similar space and time complexities. As noted earlier, some of the set streaming models considered in the literature can be reduced the DNF set streaming. We discuss them next.
Multidimensional Ranges
A dimensional range over an universe is defined as . Such a range represents the set of tuples where and is an integer. Note that every -dimensional range can be succinctly by the tuple . A multi-dimensional stream is a stream where each item is a -dimensional range. The goal is to compute of the union of the -dimensional ranges effeiciently. We will show that computation over multi-dimensional ranges can reduced to computation over DNF sets. Using this reduction we arrive at a simple algorithm to compute over multi-dimensional ranges.
Lemma 0.
Any -dimensional range over can be represented as a DNF formula over variables whose size is at most . There is algorithm that takes as input and outputs the term of using space, for .
Proof.
Let be a -dimensional range over . We will first describe the formula to represent the multi-dimensional range as a conjunction of DNF formulae each with at most terms, where represents , the range in the dimension. Converting this into a DNF formula will result in the formula with terms.
For any bit number , , it is straightforward to write a DNF formula , of size at most , that represents the range (or equivalently the set ). Similarly we can write a DNF formula , of size at most for the range . Now we construct a formula to represent the range over as follows. Let and be the binary representations of and respectively. Let be the largest integer such that . Hence and . Let and denote the integers represented by and . Also, let denote the formula (a single term) that represents the string . Then the formula representing is . This can be written as a DNF formula by distributing and the number of terms in the resulting formula is at most , and has variables. Note that each can be constructed using space. To obtain the final DNF representing the range , we need to convert into a DNF formula. It is easy to see that for any , then th term of this DNF can be computed using space . Note that this formula has variables, variables per each dimension.
∎
Using the above reduction and Theorem 1, we obtain an an algorithm for estimating over multidimensional ranges in a range-efficient manner.
Theorem 3.
There is a streaming algorithm to compute an approximation of over -dimensional ranges that takes space and processing time per item.
Remark 1.
Tirthapura and Woodruff (Tirthapura and Woodruff, 2012) studied the problem of range efficient estimation of ( frequency moments) over -dimensional ranges. They claimed an algorithm to estimate with space and per-item time complexity . However they have retracted their claim (Woodruff, [n. d.]). Their method only yields time per item. Their proof appears to be involved that require a range efficient implementations of count sketch algorithm (Charikar et al., 2004) and recursive sketches (Indyk and Woodruff, 2005; Braverman and Ostrovsky, 2010). We obtain the same complexity bounds with much simpler analysis and a practically efficient algorithm that can use off the shelf available implementations (Meel et al., 2018).
Remark 2.
Subsequent to the present work, an improved algorithm for over structured sets is presented in ( N. V. Vinodchandran Sourav Chakraborty, 2021) (to appear in PODS 2021). In particular, the paper presents an estimation algorithm, called , for streams over Delphic sets. A set belongs to Delphic family if the following queries can be done in time: (1) know the size of the set , (2) draw a uniform random sample from , and (3) given any check if . The authors design a streaming algorithm that given a stream wherein each belongs to Delphic family, computes an -approximation of with worst case space complexity and per-item time is . The algorithm , when applied to -dimensional ranges, gives per-item time and space complexity bounds that are . While brings down the dependency on from exponential to polynomial, it works under the assumption that the length of the stream is known. The general setup presented in ( N. V. Vinodchandran Sourav Chakraborty, 2021), however, can be applied to other structured sets considered in this paper including multidimensional arithmetic progressions.
Representing Multidimensional Ranges as CNF Formulas.
Since the algorithm, , presented in ( N. V. Vinodchandran Sourav Chakraborty, 2021), employs a sampling-based technique, a natural question is whether there exists a hashing-based technique that achieves per-item time polynomial in and . We note that the above approach of representing a multi-dimensional range as DNF formula does not yield such an algorithm. This is because there exist -dimensional ranges whose DNF representation requires size.
Observation 1.
There exist -dimensional ranges whose DNF representation has size .
Proof.
The observation follows by considering the range (only 0 is missing from the interval in each dimension). We will argue that any DNF formula for this range has size (number of terms) . For any , we use the set of variables for representing the coordinate of . Then can be represented as the formula , where . This formula has terms. Let be any other DNF formula representing . The main observation is that any term of is completely contained (in terms of the set of solutions) in one of the terms of . This implies that should have terms. Now we argue that is contained in one of the terms of . should have at least one variable as positive literal from each of . Suppose does not have any variable from for some . Then contains a solution with all the variable in set to 0 and hence not in . Now let be a variable from that is in . Then clearly is in the term of . ∎
This leads to the question of whether we can obtain a super-polynomial lower bound on the time per item. We observe that such a lower bound would imply . For this, we note the following.
Observation 2.
Any -dimensional range can be represented as a CNF formula of size over variables.
This is because a single dimensional range can also be represented as a CNF formula of size (Chakraborty et al., 2015) and thus the CNF formula for is a conjunction of formulas along each dimension. Thus the problem of computing over -dimensional ranges reduces to computing over a stream where each item of the stream is a CNF formula. As in the proof of Theorem 1, we can adapt Minimum-value based algorithm for CNF streams. When a CNF formula arrive, we need to compute the lexicographically smallest elements of where . By Proposition 5, this can be done in polynomial-time by making calls to an NP oracle since is a CNF formula over variables. Thus if P equals NP, then the time taken per range is polynomial in , , and . Thus a super polynomial time lower bound on time per item implies that P differs from NP.
From Weighted #DNF to d-Dimensional Ranges
Designing a streaming algorithm with a per item of polynomial in and is a very interesting open problem with implications on weighted DNF counting. Consider a formula defined on the set of variables . Let a weight function be such that weight of an assignment can be defined as follows:
Furthermore, we define the weight of a formula as
Given and , the problem of weighted counting is to compute . We consider the case where for each , is represented using bits in binary representation, i.e., . Inspired by the key idea of weighted to unweighted reduction due to Chakraborty et al. (Chakraborty et al., 2015), we show how the problem of weighted DNF counting can be reduced to that of estimation of estimation of -dimensional ranges. The reduction is as follows: we transform every term of into a product of multi-dimension ranges where every variable is replaced with interval while is replaced with and every is replaced with For example, a term is replaced with . Given of the resulting stream, we can compute the weight of simply as . Thus a hashing based streaming algorithm that has time per item, yields a hashing based FPRAS for weighted DNF counting, and open problem from (Abboud et al., 2019).
Multidimensional Arithmetic Progressions.
We will now generalize Theorem 3 to handle arithmetic progressions instead of ranges. Let represent the arithmetic progression with common difference in the range , i.e., , where is the largest integer such that . Here, we consider -dimensional arithmetic progressions where each is a power two. We first observe that the set represented by can be expressed as a DNF formula as follows: Let be the DNF formula representing the range and let are the least significant bits of , Let be the term that represents the bit sequence . Now the formula to represent the arithmetic progression is which can be converted to a DNF formula of size . Thus the multi-dimensional arithmetic progression can be represented as a DNF formula of size . Note that time and space required to convert into a DNF formula are as before, i.e, time and space. This leads us to the following corollary.
Corollary 0.
There is a streaming algorithm to compute an approximation of over -dimensional arithmetic progressions, whose common differences are powers of two, that takes space and processing time per item.
Affine Spaces
Another example of structured stream is where each item of the stream is an affine space represented by where is a boolean matrix and is a zero-one vector. Without loss of generality, we may assume that where is a matrix. Thus an affine stream consists of , where each is succinctly represents a set .
For a Boolean matrix and a zero-one vector , let denote the set of all that satisfy .
Proposition 0.
Given , , and as input, there is an algorithm, , that returns a set, so that if , then , otherwise is the lexicographically minimum elements of . Time taken by this algorithm is and the space taken the algorithm is .
Proof.
Let be the matrix that specifies the hash function . Let , and the goal is to compute the smallest element of . Note that if , then it must be the case that where is the matrix obtained by appending rows of to the rows of (at the end), and is the vector obtained by appending to . Note that is a matrix with rows. Now the proof is very similar to the proof of Proposition 5. We can do a prefix search as before and this involves doing Gaussian elimination using sub matrices of . ∎
Theorem 6.
There is a streaming algorithms computes approximation of over affine spaces. This algorithm takes space and processing time of per item.
6. Conclusion and Future Outlook
To summarize, our investigation led to a diverse set of results that unify over two decades of work in model counting and estimation. We believe that the viewpoint presented in this work has potential to spur several new interesting research directions. We sketch some of these directions below:
- Sampling:
-
The problem of counting and sampling are closely related. In particular, the seminal work of Jerrum, Valiant, and Vazirani (Jerrum et al., 1986) showed that the problem of approximate counting and almost-uniform sampling are inter-reducible for self-reducible NP problems. Concurrent to developments in approximate model counting, there has been a significant interest in the design of efficient sampling algorithms. A natural direction would be to launch a similar investigation.
- Higher Moments:
-
There has been a long line of work on estimation of higher moments, i.e. in streaming context. A natural direction of future research is to adapt the notion of in the context of CSP. For example, in the context of DNF, one can view be simply a sum of the size of clauses but it remains to be seen to understand the relevance and potential applications of higher moments such as in the context of CSP. Given the similarity of the core algorithmic frameworks for higher moments, we expect extension of the framework and recipe presented in the paper to derive algorithms for higher moments in the context of CSP.
- Sparse XORs:
-
In the context of model counting, the performance of underlying SAT solvers strongly depends on the size of XORs. The standard construction of and lead to XORs of size and interesting line of research has focused on the design of sparse XOR-based hash functions (Gomes et al., 2007a; Ivrii et al., 2015; Ermon et al., 2014; Achlioptas and Theodoropoulos, 2017; Asteris and Dimakis, 2016) culminating in showing that one can use hash functions of form where wherein each entry of m-th row of is 1 with probability (Meel and Akshay, 2020). Such XORs were shown to improve the runtime efficiency. In this context, a natural direction would be to explore the usage of sparse XORs in the context of estimation.
Acknowledgements.
We thank the anonymous reviewers of PODS 21 for valuable comments. Bhattacharyya was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme [NRF-NRFFAI1-2019-0002] and an Amazon Research Award. Meel was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme[NRF-NRFFAI1-2019-0004 ] and AI Singapore Programme [AISG-RP-2018-005], and NUS ODPRT Grant [R-252-000-685-13]. Vinod was supported in part by NSF CCF-184908 and NSF HDR:TRIPODS-1934884 awards. Pavan was supported in part by NSF CCF-1849053 and NSF HDR:TRIPODS-1934884 awards.References
- (1)
- Abboud et al. (2019) Ralph Abboud, Ismail Ilkan Ceylan, and Thomas Lukasiewicz. 2019. Learning to Reason: Leveraging Neural Networks for Approximate DNF Counting. arXiv preprint arXiv:1904.02688 (2019).
- Achlioptas and Theodoropoulos (2017) Dimitris Achlioptas and Panos Theodoropoulos. 2017. Probabilistic model counting with short XORs. In Proc. of SAT. Springer, 3–19.
- Alon et al. (1999) Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58, 1 (1999), 137–147.
- Arackaparambil et al. (2009) Chrisil Arackaparambil, Joshua Brody, and Amit Chakrabarti. 2009. Functional monitoring without monotonicity. In International Colloquium on Automata, Languages, and Programming. Springer, 95–106.
- Asteris and Dimakis (2016) Megasthenis Asteris and Alexandros G Dimakis. 2016. LDPC codes for discrete integration. Technical Report. Technical report, UT Austin.
- Babcock and Olston (2003) Brian Babcock and Chris Olston. 2003. Distributed top-k monitoring. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 28–39.
- Bar-Yossef et al. ([n. d.]) Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. [n. d.]. Counting Distinct Elements in a Data Stream. In Proc. of RANDOM, Vol. 2483. 1–10.
- Bar-Yossef et al. (2002) Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. 2002. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. of SODA. ACM/SIAM, 623–632.
- Braverman and Ostrovsky (2010) Vladimir Braverman and Rafail Ostrovsky. 2010. Recursive Sketching For Frequency Moments. CoRR abs/1011.2571 (2010).
- Carter and Wegman (1977) J Lawrence Carter and Mark N Wegman. 1977. Universal classes of hash functions. In Proceedings of the ninth annual ACM symposium on Theory of computing. ACM, 106–112.
- Chakraborty et al. (2015) Supratik Chakraborty, Dror Fried, Kuldeep S Meel, and Moshe Y Vardi. 2015. From weighted to unweighted model counting. In Proceedings of AAAI. 689–695.
- Chakraborty et al. (2013a) S. Chakraborty, K. S. Meel, and M. Y. Vardi. 2013a. A Scalable and Nearly Uniform Generator of SAT Witnesses. In Proc. of CAV. 608–623.
- Chakraborty et al. (2013b) S. Chakraborty, K. S. Meel, and M. Y. Vardi. 2013b. A Scalable Approximate Model Counter. In Proc. of CP. 200–216.
- Chakraborty et al. (2016) S. Chakraborty, K. S. Meel, and M. Y. Vardi. 2016. Algorithmic Improvements in Approximate Counting for Probabilistic Inference: From Linear to Logarithmic SAT Calls. In Proc. of IJCAI.
- Charikar et al. (2004) Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. 2004. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1 (2004), 3–15.
- Chistikov et al. (2015) Dmitry Chistikov, Rayna Dimitrova, and Rupak Majumdar. 2015. Approximate counting in SMT and value estimation for probabilistic programs. In Proc. of TACAS. Springer, 320–334.
- Considine et al. (2004) Jeffrey Considine, Feifei Li, George Kollios, and John W. Byers. 2004. Approximate Aggregation Techniques for Sensor Databases. In Proc. of ICDE, Z. Meral Özsoyoglu and Stanley B. Zdonik (Eds.). IEEE Computer Society, 449–460.
- Cormode et al. (2005) Graham Cormode, Minos Garofalakis, Shanmugavelayutham Muthukrishnan, and Rajeev Rastogi. 2005. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Proc. of SIGMOD. 25–36.
- Cormode and Muthukrishnan (2003) Graham Cormode and S. Muthukrishnan. 2003. Estimating Dominance Norms of Multiple Data Streams. In Proc. of ESA (Lecture Notes in Computer Science), Giuseppe Di Battista and Uri Zwick (Eds.), Vol. 2832. Springer, 148–160.
- Cormode et al. (2011) Graham Cormode, Shanmugavelayutham Muthukrishnan, and Ke Yi. 2011. Algorithms for distributed functional monitoring. ACM Transactions on Algorithms (TALG) 7, 2 (2011), 1–20.
- Cormode et al. (2012) Graham Cormode, Shanmugavelayutham Muthukrishnan, Ke Yi, and Qin Zhang. 2012. Continuous sampling from distributed streams. Journal of the ACM (JACM) 59, 2 (2012), 1–25.
- Dagum et al. (2000) P. Dagum, R. Karp, M. Luby, and S. Ross. 2000. An optimal algorithm for Monte Carlo estimation. SIAM Journal on computing 29, 5 (2000), 1484–1496.
- Ermon et al. (2013) Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, and Bart Selman. 2013. Taming the Curse of Dimensionality: Discrete Integration by Hashing and Optimization. In Proc. of ICML. 334–342.
- Ermon et al. (2014) S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman. 2014. Low-density Parity Constraints for Hashing-Based Discrete Integration. In Proc. of ICML. 271–279.
- Feng et al. (2018a) Weiming Feng, Thomas P Hayes, and Yitong Yin. 2018a. Distributed symmetry breaking in sampling (optimal distributed randomly coloring with fewer colors). arXiv preprint arXiv:1802.06953 (2018).
- Feng et al. (2018b) Weiming Feng, Yuxin Sun, and Yitong Yin. 2018b. What can be sampled locally? Distributed Computing (2018), 1–27.
- Feng and Yin (2018) Weiming Feng and Yitong Yin. 2018. On local distributed sampling and counting. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing. 189–198.
- Fischer and Ghaffari (2018) Manuela Fischer and Mohsen Ghaffari. 2018. A Simple Parallel and Distributed Sampling Technique: Local Glauber Dynamics. In 32nd International Symposium on Distributed Computing.
- Flajolet and Martin (1985) Philippe Flajolet and G. Nigel Martin. 1985. Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31, 2 (1985), 182–209.
- Gibbons and Tirthapura (2001) Phillip B. Gibbons and Srikanta Tirthapura. 2001. Estimating simple functions on the union of data streams. In Proc. of SPAA, Arnold L. Rosenberg (Ed.). ACM, 281–291.
- Gomes et al. (2007b) C.P. Gomes, A. Sabharwal, and B. Selman. 2007b. Near-Uniform sampling of combinatorial spaces using XOR constraints. In Proc. of NIPS. 670–676.
- Gomes et al. (2007a) Carla P Gomes, Joerg Hoffmann, Ashish Sabharwal, and Bart Selman. 2007a. From Sampling to Model Counting.. In Proc. of IJCAI. 2293–2299.
- Huang et al. (2012) Zengfeng Huang, Ke Yi, and Qin Zhang. 2012. Randomized algorithms for tracking distributed count, frequencies, and ranks. In Proc. of PODS. 295–306.
- Indyk and Woodruff (2005) Piotr Indyk and David P. Woodruff. 2005. Optimal approximations of the frequency moments of data streams. In Proc. of STOC. ACM, 202–208.
- Ivrii et al. (2015) Alexander Ivrii, Sharad Malik, Kuldeep S. Meel, and Moshe Y. Vardi. 2015. On computing minimal independent support and its applications to sampling and counting. Constraints (2015), 1–18. https://doi.org/10.1007/s10601-015-9204-z
- Jerrum et al. (1986) M.R. Jerrum, L.G. Valiant, and V.V. Vazirani. 1986. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science 43, 2-3 (1986), 169–188.
- Kane et al. (2010) Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. An optimal algorithm for the distinct elements problem. In Proc. of PODS. ACM, 41–52.
- Karp and Luby (1983) R.M. Karp and M. Luby. 1983. Monte-Carlo algorithms for enumeration and reliability problems. Proc. of FOCS (1983).
- Karp et al. (1989) Richard M Karp, Michael Luby, and Neal Madras. 1989. Monte-Carlo approximation algorithms for enumeration problems. Journal of Algorithms 10, 3 (1989), 429 – 448. https://doi.org/10.1016/0196-6774(89)90038-2
- Keralapura et al. (2006) Ram Keralapura, Graham Cormode, and Jeyashankher Ramamirtham. 2006. Communication-efficient distributed monitoring of thresholded counts. In Proc. of SIGMOD. 289–300.
- Keren et al. (2011) Daniel Keren, Izchak Sharfman, Assaf Schuster, and Avishay Livne. 2011. Shape sensitive geometric monitoring. IEEE Transactions on Knowledge and Data Engineering 24, 8 (2011), 1520–1535.
- Manjhi et al. (2005) Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, and Christopher Olston. 2005. Finding (recently) frequent items in distributed data streams. In Proc. of ICDE. IEEE, 767–778.
- Meel and Akshay (2020) Kuldeep S. Meel and S. Akshay. 2020. Sparse Hashing for Scalable Approximate Model Counting: Theory and Practice. In Proc. of LICS.
- Meel et al. (2017) Kuldeep S Meel, Aditya A Shrotri, and Moshe Y Vardi. 2017. On Hashing-Based Approaches to Approximate DNF-Counting. In In Proc. of FSTTCS.
- Meel et al. (2018) Kuldeep S. Meel, Aditya A. Shrotri, and Moshe Y. Vardi. 2018. Not All FPRASs are Equal: Demystifying FPRASs for DNF-Counting. (12 2018).
- Meel et al. (2019) Kuldeep S. Meel, Aditya A. Shrotri, and Moshe Y. Vardi. 2019. Not All FPRASs are Equal: Demystifying FPRASs for DNF-Counting (Extended Abstract). In Proc. of IJCAI.
- Pavan and Tirthapura (2007) A. Pavan and Srikanta Tirthapura. 2007. Range-Efficient Counting of Distinct Elements in a Massive Data Stream. SIAM J. Comput. 37, 2 (2007), 359–379.
- Ré and Suciu (2008) Christopher Ré and Dan Suciu. 2008. Approximate lineage for probabilistic databases. Proceedings of the VLDB Endowment 1, 1 (2008), 797–808.
- Senellart (2018) Pierre Senellart. 2018. Provenance and probabilities in relational databases. ACM SIGMOD Record 46, 4 (2018), 5–15.
- Senellart (2019) Pierre Senellart. 2019. Provenance in Databases: Principles and Applications. In Reasoning Web. Explainable Artificial Intelligence. Springer, 104–109.
- Sharfman et al. (2010) Izchak Sharfman, Assaf Schuster, and Daniel Keren. 2010. A geometric approach to monitoring threshold functions over distributed data streams. In Ubiquitous knowledge discovery. Springer, 163–186.
- Soos et al. (2020) Mate Soos, Stephan Gocht, and Kuldeep S. Meel. 2020. Tinted, Detached, and Lazy CNF-XOR solving and its Applications to Counting and Sampling. In Proceedings of International Conference on Computer-Aided Verification (CAV).
- Soos and Meel (2019) Mate Soos and Kuldeep S Meel. 2019. BIRD: Engineering an Efficient CNF-XOR SAT Solver and its Applications to Approximate Model Counting. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI)(1 2019).
- Soos et al. (2009) Mate Soos, Karsten Nohl, and Claude Castelluccia. 2009. Extending SAT Solvers to Cryptographic Problems. In Proc. of SAT. 244–257.
- Stockmeyer (1983) L. Stockmeyer. 1983. The complexity of approximate counting. In Proc. of STOC. 118–126.
- Sun and Poon (2009) He Sun and Chung Keung Poon. 2009. Two improved range-efficient algorithms for F estimation. Theor. Comput. Sci. 410, 11 (2009), 1073–1080.
- N. V. Vinodchandran Sourav Chakraborty (2021) Kuldeep S. Meel N. V. Vinodchandran Sourav Chakraborty. 2021. Estimating Size of Union of Sets in Streaming Model. In Proc. of PODS 2021.
- Tirthapura and Woodruff (2012) Srikanta Tirthapura and David P. Woodruff. 2012. Rectangle-efficient aggregation in spatial data streams. In Proc. of PODS. ACM, 283–294.
- Valiant (1979) L.G. Valiant. 1979. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3 (1979), 410–421.
- Woodruff ([n. d.]) David Woodruff. [n. d.]. personal communication.
- Woodruff and Zhang (2012) David P Woodruff and Qin Zhang. 2012. Tight bounds for distributed functional monitoring. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing. 941–960.
- Woodruff and Zhang (2017) David P Woodruff and Qin Zhang. 2017. When distributed computation is communication expensive. Distributed Computing 30, 5 (2017), 309–323.
- Xu et al. (2008) Lin Xu, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2008. SATzilla: portfolio-based algorithm selection for SAT. Journal of artificial intelligence research 32 (2008), 565–606.
- Yi and Zhang (2013) Ke Yi and Qin Zhang. 2013. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65, 1 (2013), 206–223.