This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

1

Model Counting meets F0F_{0} Estimation

A. Pavan ⓡ [email protected] Iowa State University N. V. Vinodchandran ⓡ [email protected] University of Nebraska, Lincoln Arnab Bhattacharyya ⓡ [email protected] National University of Singapore  and  Kuldeep S. Meel [email protected] National University of Singapore
(2021)
Abstract.

Constraint satisfaction problems (CSP’s) and data stream models are two powerful abstractions to capture a wide variety of problems arising in different domains of computer science. Developments in the two communities have mostly occurred independently and with little interaction between them. In this work, we seek to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights. To this end, we focus on two foundational problems: model counting for CSP’s and computation of zeroth frequency moments (F0)(F_{0}) for data streams.

Our investigations lead us to observe striking similarity in the core techniques employed in the algorithmic frameworks that have evolved separately for model counting and F0F_{0} computation. We design a recipe for translation of algorithms developed for F0F_{0} estimation to that of model counting, resulting in new algorithms for model counting. We then observe that algorithms in the context of distributed streaming can be transformed to distributed algorithms for model counting. We next turn our attention to viewing streaming from the lens of counting and show that framing F0F_{0} estimation as a special case of #DNF counting allows us to obtain a general recipe for a rich class of streaming problems, which had been subjected to case-specific analysis in prior works. In particular, our view yields a state-of-the art algorithm for multidimensional range efficient F0F_{0} estimation with a simpler analysis.

Model Counting, Streaming Algorithms, F0F_{0}-computation, DNF Counting
ccs: Theory of computation Streaming modelsccs: Theory of computation Sketching and samplingjournalyear: 2021copyright: acmcopyrightconference: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems; June 20–25, 2021; Virtual Event, Chinabooktitle: Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS ’21), June 20–25, 2021, Virtual Event, Chinaprice: 15.00isbn: 978-1-4503-8381-3/21/06doi: 10.1145/3452021.3458311

1. Introduction

Constraint Satisfaction Problems (CSP’s) and the data stream model are two core themes in computer science with a diverse set of applications, ranging from probabilistic reasoning, networks, databases, verification, and the like. Model counting and computation of zeroth frequency moment (F0F_{0}) are fundamental problems for CSP’s and the data stream model respectively. This paper is motivated by our observation that despite the usage of similar algorithmic techniques for the two problems, the developments in the two communities have, surprisingly, evolved separately, and rarely has a paper from one community been cited by the other.

Given a set of constraints φ\varphi over a set of variables in a finite domain 𝒟\mathcal{D}, the problem of model counting is to estimate the number of solutions of φ\varphi. We are often interested when φ\varphi is restricted to a special class of representations such as Conjunctive Normal Form (CNF) and Disjunctive Normal Form (DNF). A data stream over a domain [N][N] is represented by 𝐚=a1,a2,am\mathbf{a}=a_{1},a_{2},\cdots a_{m} wherein each item ai[N]a_{i}\subseteq[N]. The zeroth frequency moment, denoted as F0F_{0}, of 𝐚\mathbf{a} is the number of distinct elements appearing in 𝐚\mathbf{a}, i.e., |iai||\cup_{i}a_{i}| (traditionally, aia_{i}s are singletons; we will also be interested in the case when aia_{i}s are sets). The fundamental nature of model counting and F0F_{0} computation over data streams has led to intense interest from theoreticians and practitioners alike in the respective communities for the past few decades.

The starting point of this work is the confluence of two viewpoints. The first viewpoint contends that some of the algorithms for model counting can conceptually be thought of as operating on the stream of the solutions of the constraints. The second viewpoint contends that a stream can be viewed as a DNF formula, and the problem of F0F_{0} estimation is similar to model counting. These viewpoints make it natural to believe that algorithms developed in the streaming setting can be directly applied to model counting, and vice versa. We explore this connection and indeed, design new algorithms for model counting inspired by algorithms for estimating F0F_{0} in data streams. By exploring this connection further, we design new algorithms to estimate F0F_{0} for streaming sets that are succinctly represented by constraints. To put our contributions in context, we briefly survey the historical development of algorithmic frameworks in both model counting and F0F_{0} estimation and point out the similarities.

Model Counting

The complexity-theoretic study of model counting was initiated by Valiant who showed that this problem, in general, is #P-complete  (Valiant, 1979). This motivated researchers to investigate approximate model counting and in particular achieving (ε,δ)(\varepsilon,\delta)-approximation schemes. The complexity of approximate model counting depends on its representation. When the model φ\varphi is represented as a CNF formula φ\varphi, designing an efficient (ε,δ)(\varepsilon,\delta)-approximation is NP-hard (Stockmeyer, 1983). In contrast, when it is represented as a DNF formula, model counting admits FPRAS (fully polynomial-time approximation scheme) (Karp and Luby, 1983; Karp et al., 1989). We will use #CNF to refer to the case when φ\varphi is a CNF formula while #DNF to refer to the case when φ\varphi is a DNF formula.

For #CNF, Stockmeyer (Stockmeyer, 1983) provided a hashing-based randomized procedure that can compute (ε,δ)\varepsilon,\delta)-approximation within time polynomial in |φ|,ε,δ|\varphi|,\varepsilon,\delta, given access to an NP oracle. Building on Stockmeyer’s approach and motivated by the unprecedented breakthroughs in the design of SAT solvers, researchers have proposed a series of algorithmic improvements that have allowed the hashing-based techniques for approximate model counting to scale to formulas involving hundreds of thousands of variables (Gomes et al., 2007b; Chakraborty et al., 2013b; Ermon et al., 2013; Ivrii et al., 2015; Chistikov et al., 2015; Chakraborty et al., 2016; Achlioptas and Theodoropoulos, 2017; Soos and Meel, 2019; Soos et al., 2020). The practical implementations substitute NP oracle with SAT solvers. In the context of model counting, we are primarily interested in time complexity and therefore, the number of NP queries is of key importance. The emphasis on the number of NP calls also stems from practice as the practical implementation of model counting algorithms have shown to spend over 99% of their time in the underlying SAT calls (Soos and Meel, 2019).

Karp and Luby (Karp and Luby, 1983) proposed the first FPRAS scheme for #DNF, which was subsequently improved in the follow-up works (Karp et al., 1989; Dagum et al., 2000). Chakraborty, Meel, and Vardi (Chakraborty et al., 2016) demonstrated that the hashing-based framework can be extended to #DNF, hereby providing a unified framework for both #CNF and #DNF. Meel, Shrotri, and Vardi (Meel et al., 2017, 2018, 2019) subsequently improved the complexity of the hashing-based approach for #DNF and observed that hashing-based techniques achieve better scalability than that of Monte Carlo techniques.

Zeroth Frequency Moment Estimation

Estimating (ε,δ)(\varepsilon,\delta)-approximation of the kthk^{\rm th} frequency moments (FkF_{k}) is a central problem in the data streaming model (Alon et al., 1999). In particular, considerable work has been done in designing algorithms for estimating the 0th0^{th} frequency moment (F0F_{0}), the number of distinct elements in the stream. While designing streaming algorithms, the primary resource concerns are two-fold: space complexity and processing time per element. For an algorithm to be considered efficient, these should be poly(logN,1/ϵ){\rm poly}(\log N,1/\epsilon) where NN is the size of the universe 111We ignore O(log1δ)O(\log{1\over\delta}) factor in this discussion.

The first algorithm for computing F0F_{0} with a constant factor approximation was proposed by Flajolet and Martin, who assumed the existence of hash functions with ideal properties resulting in an algorithm with undesirable space complexity (Flajolet and Martin, 1985). In their seminal work, Alon, Matias, and Szegedy designed an O(logN)O(\log N) space algorithm for F0F_{0} with a constant approximation ratio that employs 2-universal hash functions (Alon et al., 1999). Subsequent investigations into hashing-based schemes by Gibbons and Tirthapura (Gibbons and Tirthapura, 2001) and Bar-Yossef, Kumar, and Sivakumar (Bar-Yossef et al., 2002) provided (ε,δ)(\varepsilon,\delta)-approximation algorithms with space and time complexity logNpoly(1ε)\log N\cdot{\rm poly}({1\over\varepsilon}). Subsequently, Bar-Yossef et al. proposed three algorithms with improved space and time complexity (Bar-Yossef et al., [n. d.]). While the three algorithms employ hash functions, they differ conceptually in the usage of relevant random variables for the estimation of F0F_{0}. This line of work resulted in the development of an algorithm with optimal space complexity O(logN+1ε2){O}(\log N+{1\over\varepsilon^{2}}) and O(logN)O(\log N) update time (Kane et al., 2010).

The above-mentioned works are in the setting where each data item aia_{i} is an element of the universe. Subsequently, there has been a series of results of estimating F0F_{0} in rich scenarios with particular focus to handle the cases ai[N]a_{i}\subseteq[N] such as a list or a multidimensional range (Bar-Yossef et al., 2002; Pavan and Tirthapura, 2007; Tirthapura and Woodruff, 2012; Sun and Poon, 2009).

The Road to a Unifying Framework

As mentioned above, the algorithmic developments for model counting and F0F_{0} estimation have largely relied on the usage of hashing-based techniques and yet these developments have, surprisingly, been separate, and rarely has a work from one community been cited by the other. In this context, we wonder whether it is possible to bridge this gap and if such an exercise would contribute to new algorithms for model counting as well as for F0F_{0} estimation? The main conceptual contribution of this work is an affirmative answer to the above question. First, we point out that the two well-known algorithms; Stockmeyer’s #CNF algorithm  (Stockmeyer, 1983) that is further refined by Chakraborty et. al.  (Chakraborty et al., 2016) and Gibbons and Tirthapura’s F0F_{0} estimation algorithm (Gibbons and Tirthapura, 2001), are essentially the same.

The core idea of the hashing-based technique of Stockmeyer’s and Chakraborty et al’s scheme is to use pairwise independent hash functions to partition the solution space (satisfying assignments of a CNF formula) into roughly equal and small cells, wherein a cell is small if the number of solutions is less than a pre-computed threshold, denoted by 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh}. Then a good estimate for the number of solutions is the number of solutions in an arbitrary cell ×\times number of cells. To partition the solution space, pairwise independent hash functions are used. To determine the appropriate number of cells, the solution space is iteratively partitioned as follows. At the mthm^{th} iteration, a hash function with range {0,1}m\{0,1\}^{m} is considered resulting in cells h1(y)h^{-1}(y) for each y{0,1}my\in\{0,1\}^{m}. An NP oracle can be employed to check whether a particular cell (for example h1(0m)h^{-1}(0^{m})) is small by enumerating solutions one by one until we have either obtained 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh}+1 number of solutions or we have exhaustively enumerated all the solutions. If the the cell h1(0m)h^{-1}(0^{m}) is small, then the algorithm outputs t×2mt\times 2^{m} as an estimate where tt is the number of solutions in the cell h1(0m)h^{-1}(0^{m}). If the cell h1(0m)h^{-1}(0^{m}) is not small, then the algorithm moves on to the next iteration where a hash function with range {0,1}m+1\{0,1\}^{m+1} is considered.

We now describe Gibbons and Tirthapura’s algorithm for F0F_{0} estimation which we call the 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing} algorithm. We will assume the universe [N]={0,1}n[N]=\{0,1\}^{n}. The algorithm maintains a bucket of size 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} and starts by picking a hash function h:{0,1}n{0,1}nh:\{0,1\}^{n}\rightarrow\{0,1\}^{n}. It iterates over sampling levels. At level mm, when a data item xx comes, if h(x)h(x) starts with 0m0^{m}, then xx is added to the bucket. If the bucket overflows, then the sampling level is increased to m+1m+1 and all elements xx in the bucket other than the ones with h(x)=0m+1h(x)=0^{m+1} are deleted. At the end of the stream, the value t×2mt\times 2^{m} is output as the estimate where tt is the number of elements in the bucket and mm is the sampling level.

These two algorithms are conceptually the same. In the 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing} algorithm, at the sampling level mm, it looks at only the first mm bits of the hashed value; this is equivalent to considering a hash function with range {0,1}m\{0,1\}^{m}. Thus the bucket is nothing but all the elements in the stream that belong to the cell h1(0m)h^{-1}(0^{m}). The final estimate is the number of elements in the bucket times the number of cells, identical to Chakraborty et. al’s algorithm. In both algorithms, to obtain an (ε,δ)(\varepsilon,\delta) approximation, the 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} value is chosen as O(1ε2)O({1\over\varepsilon^{2}}) and the median of O(log1δ)O(\log{1\over\delta}) independent estimations is output.

Our Contributions

Motivated by the conceptual identity between the two algorithms, we further explore the connections between algorithms for model counting and F0F_{0} estimation.

  1. (1)

    We formalize a recipe to transform streaming algorithms for F0F_{0} estimation to those for model counting. Such a transformation yields new (ε,δ)(\varepsilon,\delta)-approximate algorithms for model counting, which are different from currently known algorithms. Recent studies in the fields of automated reasoning have highlighted the need for diverse approaches (Xu et al., 2008), and similar studies in the context of #DNF provided strong evidence to the power of diversity of approaches (Meel et al., 2018). In this context, these newly obtained algorithms open up several new interesting directions of research ranging from the development of MaxSAT solvers with native XOR support to open problems in designing FPRAS schemes.

  2. (2)

    Given the central importance of #DNF (and its weighted variant) due to a recent surge of interest in scalable techniques for provenance in probabilistic databases (Senellart, 2018, 2019), a natural question is whether one can design efficient techniques in the distributed setting. In this work, we initiate the study of distributed #DNF. We then show that the transformation recipe from F0F_{0} estimation to model counting allows us to view the problem of the design of distributed #DNF algorithms through the lens of distributed functional monitoring that is well studied in the data streaming literature.

  3. (3)

    Building upon the connection between model counting and F0F_{0} estimation, we design new algorithms to estimate F0F_{0} over structured set streams where each element of the stream is a (succinct representation of a) subset of the universe. Thus, the stream is S1,S2,S_{1},S_{2},\cdots where each Si[N]S_{i}\subseteq[N] and the goal is to estimate the F0F_{0} of the stream, i.e. size of iSi\cup_{i}S_{i}. In this scenario, a traditional F0F_{0} streaming algorithm that processes each element of the set incurs high per-item processing time-complexity and is inefficient. Thus the goal is to design algorithms whose per-item time (time to process each SiS_{i}) is poly-logarithmic in the size of the universe. Structured set streams that are considered in the literature include 1-dimensional and multidimensional ranges (Pavan and Tirthapura, 2007; Tirthapura and Woodruff, 2012). Several interesting problems such as max-dominance norm (Cormode and Muthukrishnan, 2003), counting triangles in graphs (Bar-Yossef et al., 2002), and distinct summation problem (Considine et al., 2004) can be reduced to computing F0F_{0} over such ranges.

    We observe that several structured sets can be represented as small DNF formulae and thus F0F_{0} counting over these structured set data streams can be viewed as a special case of #DNF. Using the hashing-based techniques for #DNF, we obtain a general recipe for a rich class of structured sets that include multidimensional ranges, multidimensional arithmetic progressions, and affine spaces. Prior work on single and multidimensional ranges 222Please refer to Remark 1 in Section 5 for a discussion on the earlier work on multidimensional ranges (Tirthapura and Woodruff, 2012). had to rely on involved analysis for each of the specific instances, while our work provides a general recipe for both analysis and implementation.

Organization

We present notations and preliminaries in Section 2. We then present the transformation of F0F_{0} estimation to model counting in Section 3. We then focus on distributed #DNF in Section 4. We then present the transformation of model counting algorithms to structured set streaming algorithms in Section 5. We conclude in Section 6 with a discussion of future research directions.

We would like to emphasize that the primary objective of this work is to provide a unifying framework for F0F_{0} estimation and model counting. Therefore, when designing new algorithms based on the transformation recipes, we intentionally focus on conceptually cleaner algorithms and leave potential improvements in time and space complexity for future work.

2. Notation

We will use assume the universe [N]={0,1}n[N]=\{0,1\}^{n}. We write Pr[𝒵:Ω]\Pr\left[\mathcal{Z}:{\Omega}\right] to denote the probability of outcome 𝒵\mathcal{Z} when sampling from a probability space Ω{\Omega}. For brevity, we omit Ω{\Omega} when it is clear from the context.

F0F_{0} Estimation

A data stream 𝐚\mathbf{a} over domain [N][N] can be represented as 𝐚=a1,a2,am\mathbf{a}=a_{1},a_{2},\ldots a_{m} wherein each item ai[N]a_{i}\in[N]. Let 𝐚u=i{ai}\mathbf{a}_{u}=\cup_{i}\{a_{i}\}. F0F_{0} of the stream 𝐚\mathbf{a} is |𝐚u||\mathbf{a}_{u}|. We are often interested in a a probably approximately correct scheme that returns an (ε,δ)(\varepsilon,\delta)-estimate cc, i.e.,

Pr[|𝐚u|1+εc(1+ε)|𝐚u|]1δ\displaystyle\Pr\left[\frac{|\mathbf{a}_{u}|}{1+\varepsilon}\leq c\leq(1+\varepsilon)|\mathbf{a}_{u}|\right]\geq 1-\delta

Model Counting

Let {x1,x2,xn}\{x_{1},x_{2},\ldots x_{n}\} be a set of Boolean variables. For a Boolean formula φ\varphi, let 𝖵𝖺𝗋𝗌(φ)\mathsf{Vars}(\varphi) denote the set of variables appearing in φ\varphi. Throughout the paper, unless otherwise stated, we will assume that the relationship n=|𝖵𝖺𝗋𝗌(φ)|n=|\mathsf{Vars}(\varphi)| holds. We denote the set of all satisfying assignments of φ\varphi by 𝖲𝗈𝗅(φ)\mathsf{Sol}(\varphi).

The propositional model counting problem is to compute |𝖲𝗈𝗅(φ)||\mathsf{Sol}(\varphi)| for a given formula φ\varphi. A probably approximately correct (or PAC) counter is a probabilistic algorithm 𝖠𝗉𝗉𝗋𝗈𝗑𝖢𝗈𝗎𝗇𝗍(,,){\mathsf{ApproxCount}}(\cdot,\cdot,\cdot) that takes as inputs a formula φ\varphi, a tolerance ε>0\varepsilon>0, and a confidence δ(0,1]\delta\in(0,1], and returns a (ε,δ)(\varepsilon,\delta)-estimate cc, i.e.,

Pr[|𝖲𝗈𝗅(φ)|1+εc(1+ε)|𝖲𝗈𝗅(φ)|]1δ.\displaystyle\Pr\Big{[}\frac{|\mathsf{Sol}(\varphi)|}{1+\varepsilon}\leq c\leq(1+\varepsilon)|\mathsf{Sol}(\varphi)|\Big{]}\geq 1-\delta.

PAC guarantees are also sometimes referred to as (ε,δ)(\varepsilon,\delta)-guarantees. We use #CNF (resp. #DNF) to refer to the model counting problem when φ\varphi is represented as CNF (resp. DNF).

k-wise Independent hash functions

Let n,mn,m\in\mathbb{N} and (n,m){h:{0,1}n{0,1}m}\mathcal{H}(n,m)\triangleq\{h:\{0,1\}^{n}\rightarrow\{0,1\}^{m}\} be a family of hash functions mapping {0,1}n\{0,1\}^{n} to {0,1}m\{0,1\}^{m}. We use h𝑅(n,m)h\xleftarrow{R}\mathcal{H}(n,m) to denote the probability space obtained by choosing a function hh uniformly at random from (n,m)\mathcal{H}(n,m).

Definition 1.

A family of hash functions (n,m)\mathcal{H}(n,m) is kk-wise independent if α1,α2,αk{0,1}m\forall\alpha_{1},\alpha_{2},\ldots\alpha_{k}\in\{0,1\}^{m},  distinct x1,x2,xk{0,1}n,h𝑅(n,m)\text{ distinct }x_{1},x_{2},\ldots x_{k}\in\{0,1\}^{n},h\xleftarrow{R}\mathcal{H}(n,m),

(1) Pr[(h(x1)=α1)(h(x2)=α2)(h(xk)=αk)]=12km\displaystyle\Pr[(h(x_{1})=\alpha_{1})\wedge(h(x_{2})=\alpha_{2})\ldots(h(x_{k})=\alpha_{k})]=\frac{1}{2^{km}}

We will use 𝗄𝗐𝗂𝗌𝖾(n,m)\mathcal{H}_{\mathsf{k-wise}}(n,m) to refer to a kk-wise independent family of hash functions mapping {0,1}n\{0,1\}^{n} to {0,1}m\{0,1\}^{m}.

Explicit families

In this work, one hash family of particular interest is 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,m)\mathcal{H}_{\mathsf{Toeplitz}}(n,m), which is known to be 2-wise independent (Carter and Wegman, 1977). The family is defined as follows: 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,m){h:{0,1}n{0,1}m}\mathcal{H}_{\mathsf{Toeplitz}}(n,m)\triangleq\{h:\{0,1\}^{n}\rightarrow\{0,1\}^{m}\} is the family of functions of the form h(x)=Ax+bh(x)=Ax+b with A𝔽2m×nA\in\mathbb{F}_{2}^{m\times n} and b𝔽2m×1b\in\mathbb{F}_{2}^{m\times 1} where AA is a uniformly randomly chosen Toeplitz matrix of size m×nm\times n while bb is uniformly randomly matrix of size m×1m\times 1. Another related hash family of interest is 𝗑𝗈𝗋(n,m)\mathcal{H}_{\mathsf{xor}}(n,m) wherein h(X)h(X) is again of the form Ax+bAx+b where AA and bb are uniformly randomly chosen matrices of sizes m×nm\times n and m×1m\times 1 respectively. Both 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓\mathcal{H}_{\mathsf{Toeplitz}} and 𝗑𝗈𝗋\mathcal{H}_{\mathsf{xor}} are 2-wise independent but it is worth noticing that 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓\mathcal{H}_{\mathsf{Toeplitz}} can be represented with Θ(n)\Theta(n)-bits while 𝗑𝗈𝗋\mathcal{H}_{\mathsf{xor}} requires Θ(n2)\Theta(n^{2}) bits of representation.

For every m{1,n}m\in\{1,\ldots n\}, the mthm^{th} prefix-slice of hh, denoted hmh_{m}, is a map from {0,1}n\{0,1\}^{n} to {0,1}m\{0,1\}^{m}, where hm(y)h_{m}(y) is the first mm bits of h(y)h(y). Observe that when h(x)=Ax+bh(x)=Ax+b, hm(x)=Amx+bmh_{m}(x)=A_{m}x+b_{m}, where AmA_{m} denotes the submatrix formed by the first mm rows of AA and bmb_{m} is the first mm entries of the vector bb.

3. From Streaming to Counting

As a first step, we present a unified view of the three hashing-based algorithms proposed in Bar-Yossef et al (Bar-Yossef et al., [n. d.]). The first algorithm is the 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing} algorithm discussed above with the observation that instead of keeping the elements in the bucket, it suffices to keep their hashed values. Since in the context of model counting, our primary concern is with time complexity, we will focus on Gibbons and Tirthapura’s 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing} algorithm in (Gibbons and Tirthapura, 2001) rather than Bar-Yossef et al.’s modification. The second algorithm, which we call 𝖬𝗂𝗇𝗂𝗆𝗎𝗆\mathsf{Minimum}, is based on the idea that if we hash all the items of the stream, then 𝒪(1/ε2)\mathcal{O}(1/\varepsilon^{2})-th minimum of the hash valued can be used compute a good estimate of F0F_{0}. The third algorithm, which we call 𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗂𝗈𝗇\mathsf{Estimation}, chooses a set of kk functions, {h1,h2,}\{h_{1},h_{2},\ldots\}, such that each hjh_{j} is picked randomly from an 𝒪(log(1/ε))\mathcal{O}(\log(1/\varepsilon))-independent hash family. For each hash function hjh_{j}, we say that hjh_{j} is not lonely if there exists ai𝐚a_{i}\in\mathbf{a} such that hj(ai)=0h_{j}(a_{i})=0. One can then estimate F0F_{0} of 𝐚\mathbf{a} by estimating the number of hash functions that are not lonely.

Algorithm 1, called 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝖥𝟢\mathsf{ComputeF0}, presents the overarching architecture of the three proposed algorithms. Each of these algorithms first picks an appropriate set of hash functions HH and initializes the sketch 𝒮\mathcal{S}. The architecture of 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝖥𝟢\mathsf{ComputeF0} is fairly simple: it chooses a collection of hash functions using 𝖢𝗁𝗈𝗈𝗌𝖾𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌\mathsf{ChooseHashFunctions}, calls the subroutine 𝖯𝗋𝗈𝖼𝖾𝗌𝗌𝖴𝗉𝖽𝖺𝗍𝖾\mathsf{ProcessUpdate} for every incoming element of the stream, and invokes 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝖤𝗌𝗍\mathsf{ComputeEst} at the end of the stream to return the F0F_{0} approximation.

𝖢𝗁𝗈𝗈𝗌𝖾𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌\mathsf{ChooseHashFunctions}

As shown in Algorithm 2, the hash functions depend on the strategy being implemented. The subroutine 𝖯𝗂𝖼𝗄𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌(,t)\mathsf{PickHashFunctions}(\mathcal{H},t) returns a collection of tt independent hash functions from the family \mathcal{H}. We use HH to denote the collection of hash functions returned, this collection viewed as either 1-dimensional array or as a 2-dimensional array. When HH is 1-dimensional array, H[i]H[i] to denote the iith hash function of the collection and when HH is a 2-dimensional array H[i][]jH[i][]j is the [i,j][i,j]th hash functions.

Sketch Properties

For each of the three algorithms, their corresponding sketches can be viewed as arrays of size of 35log(1/δ)35\log(1/\delta). The parameter 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} is set to 96/ε296/\varepsilon^{2}.

𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing}:

The element 𝒮[i]\mathcal{S}[i] is a tuple i,mi\langle\ell_{i},m_{i}\rangle where i\ell_{i} is a list of size at most 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh}, where i={x𝐚H[i]mi(x)=0mi}\ell_{i}=\{x\in\mathbf{a}\mid H[i]_{m_{i}}(x)=0^{m_{i}}\}. We use 𝒮[i](0)\mathcal{S}[i](0) to denote i\ell_{i} and 𝒮[i](1)\mathcal{S}[i](1) to denote mim_{i}.

𝖬𝗂𝗇𝗂𝗆𝗎𝗆\mathsf{Minimum}:

The element 𝒮[i]\mathcal{S}[i] holds the lexicographically distinct 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} many smallest elements of {H[i](x)|x𝐚}\{H[i](x)~{}|~{}x\in\mathbf{a}\}.

𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗂𝗈𝗇\mathsf{Estimation}:

The element 𝒮[i]\mathcal{S}[i] holds a tuple of size 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh}. The jj’th entry of this tuple is the largest number of trailing zeros in any element of H[i,j](𝐚)H[i,j](\mathbf{a}).

𝖯𝗋𝗈𝖼𝖾𝗌𝗌𝖴𝗉𝖽𝖺𝗍𝖾\mathsf{ProcessUpdate}

For a new item xx, the update of 𝒮\mathcal{S}, as shown in Algorithm 3 is as follows:

Bucketing:

For a new item xx, if H[i]mi(x)=0miH[i]_{m_{i}}(x)=0^{m_{i}}, then we add it to 𝒮[i]\mathcal{S}[i] if xx is not already present in 𝒮[i]\mathcal{S}[i]. If the size of 𝒮[i]\mathcal{S}[i] is greater than 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} (which is set to be 𝒪(1/ε2)\mathcal{O}(1/\varepsilon^{2})), then we increment the mim_{i} as in line 8.

Minimum:

For a new item xx, if H[i](x)H[i](x) is smaller than the max𝒮[i]\max{\mathcal{S}[i]}, then we replace max𝒮[i]\max{\mathcal{S}[i]} with H[i](x)H[i](x).

Estimation:

For a new item xx, compute z=𝖳𝗋𝖺𝗂𝗅𝖹𝖾𝗋𝗈(H[i,j](x))z={\mathsf{TrailZero}(H[i,j](x))}, i.e, the number of trailing zeros in H[i,j](x)H[i,j](x), and replace 𝒮[i,j]\mathcal{S}[i,j] with zz if zz is larger than 𝒮[i,j]\mathcal{S}[i,j].

𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝖤𝗌𝗍\mathsf{ComputeEst}

Finally, for each of the algorithms, we estimate F0F_{0} based on the sketch 𝒮\mathcal{S} as described in the subroutine 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝖤𝗌𝗍\mathsf{ComputeEst} presented as Algorithm 4. It is crucial to note that the estimation of F0F_{0} is performed solely using the sketch 𝒮\mathcal{S} for the Bucketing and Minimum algorithms. The Estimation algorithm requires an additional parameter rr that depends on a loose estimate of F0F_{0}; we defer details to Section 3.4.

Algorithm 1 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝖥𝟢(n,ε,δ)\mathsf{ComputeF0}(n,\varepsilon,\delta)
1:𝖳𝗁𝗋𝖾𝗌𝗁96/ε2\mathsf{Thresh}\leftarrow{96}/{\varepsilon^{2}}
2:t35log(1/δ)t\leftarrow 35\log(1/\delta)
3:H𝖢𝗁𝗈𝗈𝗌𝖾𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌(n,𝖳𝗁𝗋𝖾𝗌𝗁,t)H\leftarrow\mathsf{ChooseHashFunctions}(n,\mathsf{Thresh},t)
4:𝒮{}\mathcal{S}\leftarrow\{\}
5:while true do
6:     if 𝖤𝗇𝖽𝖲𝗍𝗋𝖾𝖺𝗆\mathsf{EndStream} then exit;      
7:     xinput()x\leftarrow input()
8:     𝖯𝗋𝗈𝖼𝖾𝗌𝗌𝖴𝗉𝖽𝖺𝗍𝖾(𝒮,H,x,𝖳𝗁𝗋𝖾𝗌𝗁)\mathsf{ProcessUpdate}(\mathcal{S},H,x,\mathsf{Thresh})
9:Est𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝖤𝗌𝗍(𝒮,𝖳𝗁𝗋𝖾𝗌𝗁)Est\leftarrow\mathsf{ComputeEst}(\mathcal{S},\mathsf{Thresh})
10:return EstEst
Algorithm 2 𝖢𝗁𝗈𝗈𝗌𝖾𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌\mathsf{ChooseHashFunctions}(n,𝖳𝗁𝗋𝖾𝗌𝗁,tn,\mathsf{Thresh},t)
1:switch AlgorithmType do
2:     case AlgorithmType==𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing}
3:         H𝖯𝗂𝖼𝗄𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌(𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,n),t)H\leftarrow\mathsf{PickHashFunctions}(\mathcal{H}_{\mathsf{Toeplitz}}(n,n),t)      
4:     case AlgorithmType==𝖬𝗂𝗇𝗂𝗆𝗎𝗆\mathsf{Minimum}
5:         H𝖯𝗂𝖼𝗄𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌(𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,3n),t)H\leftarrow\mathsf{PickHashFunctions}(\mathcal{H}_{\mathsf{Toeplitz}}(n,3n),t)      
6:     case AlgorithmType==𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗂𝗈𝗇\mathsf{Estimation}
7:         s10log(1/ε)s\leftarrow 10\log(1/\varepsilon)
8:         H𝖯𝗂𝖼𝗄𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌(swise(n,n),t×𝖳𝗁𝗋𝖾𝗌𝗁)H\leftarrow\mathsf{PickHashFunctions}(\mathcal{H}_{s-{\rm wise}}(n,n),t\times\mathsf{Thresh})      return HH
Algorithm 3 𝖯𝗋𝗈𝖼𝖾𝗌𝗌𝖴𝗉𝖽𝖺𝗍𝖾(𝒮,H,x,𝖳𝗁𝗋𝖾𝗌𝗁)\mathsf{ProcessUpdate}(\mathcal{S},H,x,\mathsf{Thresh})
1:for i[1,|H|i\in[1,|H|do
2:     switch AlgorithmType do
3:         case Bucketing
4:              mi=𝒮[i](0)m_{i}=\mathcal{S}[i](0)
5:              if H[i]mi(x)==0miH[i]_{m_{i}}(x)==0^{m_{i}} then
6:                  𝒮[i](0)𝒮[i](0){x}\mathcal{S}[i](0)\leftarrow\mathcal{S}[i](0)\cup\{x\}
7:                  if size(𝒮[i](0))>𝖳𝗁𝗋𝖾𝗌𝗁\mathrm{size}(\mathcal{S}[i](0))>\mathsf{Thresh} then
8:                       𝒮[i](1)𝒮[i](1)+1\mathcal{S}[i](1)\leftarrow\mathcal{S}[i](1)+1
9:                       for y𝒮y\in\mathcal{S} do
10:                           if H[i]mi+1(y)0mi+1H[i]_{m_{i}+1}(y)\neq 0^{m_{i}+1} then
11:                                Remove(𝒮[i](0),y\mathcal{S}[i](0),y)                                                                                            
12:         case Minimum
13:              if size(𝒮[i])<𝖳𝗁𝗋𝖾𝗌𝗁\mathrm{size}(\mathcal{S}[i])<\mathsf{Thresh} then
14:                  𝒮[i].Append(H[i](x))\mathcal{S}[i].\mathrm{Append}(H[i](x))
15:              else
16:                  jargmax(S[i])j\leftarrow\arg\max(S[i])
17:                  if 𝒮[i](j)>H[i](x)\mathcal{S}[i](j)>H[i](x) then
18:                       𝒮[i](j)H[i](x)\mathcal{S}[i](j)\leftarrow H[i](x)                                          
19:         case Estimation
20:              for j[1,𝖳𝗁𝗋𝖾𝗌𝗁]j\in[1,\mathsf{Thresh}] do
21:                  S[i,j]max(S[i,j],𝖳𝗋𝖺𝗂𝗅𝖹𝖾𝗋𝗈(H[i,j](x)))S[i,j]\leftarrow\max(S[i,j],\mathsf{TrailZero}(H[i,j](x)))                             
22:return 𝒮\mathcal{S}
Algorithm 4 𝖢𝗈𝗆𝗉𝗎𝗍𝖾𝖤𝗌𝗍\mathsf{ComputeEst}(𝒮,𝖳𝗁𝗋𝖾𝗌𝗁\mathcal{S},\mathsf{Thresh})
1:switch AlgorithmType do
2:     case Bucketing
3:         return Median({size(𝒮[i](0))×2𝒮[i](1)}i)\mathrm{Median}\left(\left\{\mathrm{size}(\mathcal{S}[i](0))\times 2^{\mathcal{S}[i](1)}\right\}_{i}\right)      
4:     case Minimum
5:         return Median({𝖳𝗁𝗋𝖾𝗌𝗁×2mmax{𝒮[i]}}i)\mathrm{Median}\left(\left\{\frac{\mathsf{Thresh}\times 2^{m}}{\max\{\mathcal{S}[i]\}}\right\}_{i}\right)      
6:     case Estimation(rr)
7:         return Median({ln(11𝖳𝗁𝗋𝖾𝗌𝗁j=1𝖳𝗁𝗋𝖾𝗌𝗁𝟙{𝒮[i,j]r})ln(12r)}i)\mathrm{Median}\left(\left\{\frac{\ln\left(1-\frac{1}{\mathsf{Thresh}}\sum_{j=1}^{\mathsf{Thresh}}\mathbb{1}\{\mathcal{S}[i,j]\geq r\}\right)}{\ln(1-2^{-r})}\right\}_{i}\right)      

3.1. A Recipe For Transformation

Observe that for each of the algorithms, the final computation of F0F_{0} estimation depends on the sketch 𝒮\mathcal{S}. Therefore, as long as for two streams 𝐚\mathbf{a} and 𝐚^\hat{\mathbf{a}}, if their corresponding sketches, say 𝒮\mathcal{S} and 𝒮^\hat{\mathcal{S}} respectively, are equivalent, the three schemes presented above would return the same estimates. The recipe for a transformation of streaming algorithms to model counting algorithms is based on the following insight:

  1. (1)

    Capture the relationship 𝒫(𝒮,H,𝐚u)\mathcal{P}(\mathcal{S},H,\mathbf{a}_{u}) between the sketch 𝒮\mathcal{S}, set of hash functions HH, and set 𝐚u\mathbf{a}_{u} at the end of stream. Recall that 𝐚u\mathbf{a}_{u} is the set of all distinct elements of the stream 𝐚\mathbf{a}.

  2. (2)

    The formula φ\varphi is viewed as symbolic representation of the unique set 𝐚u\mathbf{a}_{u} represented by the stream 𝐚\mathbf{a} such that 𝖲𝗈𝗅(φ)=𝐚u\mathsf{Sol}(\varphi)=\mathbf{a}_{u}.

  3. (3)

    Given a formula φ\varphi and set of hash functions HH, design an algorithm to construct sketch 𝒮\mathcal{S} such that 𝒫(𝒮,H,𝖲𝗈𝗅(φ))\mathcal{P}(\mathcal{S},H,\mathsf{Sol}(\varphi)) holds. And now, we can estimate |𝖲𝗈𝗅(φ)||\mathsf{Sol}(\varphi)| from 𝒮\mathcal{S}.

In the rest of this section, we will apply the above recipe to the three types of F0F_{0} estimation algorithms, and derive corresponding model counting algorithms. In particular, we show how applying the above recipe to the 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing} algorithm leads us to reproduce the state of the art hashing-based model counting algorithm, 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC}, proposed by Chakraborty et al (Chakraborty et al., 2016). Applying the above recipe to 𝖬𝗂𝗇𝗂𝗆𝗎𝗆\mathsf{Minimum} and 𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗂𝗈𝗇\mathsf{Estimation} allows us to obtain fundamentally different schemes. In particular, we observe while model counting algorithms based on 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing} and 𝖬𝗂𝗇𝗂𝗆𝗎𝗆\mathsf{Minimum} provide FPRAS’s when φ\varphi is DNF, such is not the case for the algorithm derived based on 𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗂𝗈𝗇\mathsf{Estimation}.

3.2. Bucketing-based Algorithm

The 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing} algorithm chooses a set HH of pairwise independent hash functions and maintains a sketch 𝒮\mathcal{S} that we will describe. Here we use 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓\mathcal{H}_{\mathsf{Toeplitz}} as our choice of pairwise independent hash functions. The sketch 𝒮\mathcal{S} is an array where, each 𝒮[i]\mathcal{S}[i] of the form ci,mi\langle c_{i},m_{i}\rangle. We say that the relation 𝒫1(𝒮,H,𝐚u)\mathcal{P}_{1}(\mathcal{S},H,\mathbf{a}_{u}) holds if

  1. (1)

    |𝐚u{x|H[i]mi1(x)=0mi1}|96ε2|\mathbf{a}_{u}\cap\{x~{}|~{}H[i]_{m_{i}-1}(x)=0^{m_{i}-1}\}|\geq\frac{96}{\varepsilon^{2}}

  2. (2)

    ci=|𝐚u{x|H[i]mi(x)=0mi}|<96ε2c_{i}=|\mathbf{a}_{u}\cap\{x~{}|~{}H[i]_{m_{i}}(x)=0^{m_{i}}\}|<\frac{96}{\varepsilon^{2}}

The following lemma due to Bar-Yossef et al. (Bar-Yossef et al., [n. d.]) and Gibbons and Tirthapura (Gibbons and Tirthapura, 2001) captures the relationship among the sketch 𝒮\mathcal{S}, the relation 𝒫1\mathcal{P}_{1} and the number of distinct elements of a multiset.

Lemma 0.

(Bar-Yossef et al., [n. d.]; Gibbons and Tirthapura, 2001) Let 𝐚{0,1}n\mathbf{a}\subseteq\{0,1\}^{n} be a multiset and H𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,n)H\subseteq\mathcal{H}_{\mathsf{Toeplitz}}(n,n) where and each H[i]H[i]s are independently drawn and |H|=O(log1/δ)|H|=O(\log 1/\delta) and let 𝒮\mathcal{S} be such that the 𝒫1(𝒮,H,au)\mathcal{P}_{1}(\mathcal{S},H,a_{u}) holds. Let c=Median {ci×2mi}ic=\mbox{\rm Median }\{c_{i}\times 2^{m_{i}}\}_{i}. Then

Pr[|𝐚u|(1+ε)c(1+ε)|𝐚u|]1δ.\Pr\left[\frac{|\mathbf{a}_{u}|}{(1+\varepsilon)}\leq c\leq(1+\varepsilon)|\mathbf{a}_{u}|\right]\geq 1-\delta.

To design an algorithm for model counting, based on the bucketing strategy, we turn to the subroutine introduced by Chakraborty, Meel, and Vardi: 𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳\mathsf{BoundedSAT}, whose properties are formalized as follows:

Proposition 0.

(Chakraborty et al., 2013b, 2016) There is an algorithm 𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳\mathsf{BoundedSAT} that gets φ\varphi over nn variables, a hash function h𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,m)h\in\mathcal{H}_{\mathsf{Toeplitz}}(n,m), and a number pp as inputs, returns min(p,|𝖲𝗈𝗅(φh(x)=0m)|)\min(p,|\mathsf{Sol}(\varphi\wedge h(x)={0^{m}})|). If φ\varphi is a CNF formula, then 𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳\mathsf{BoundedSAT} makes 𝒪(p)\mathcal{O}(p) calls to a NP oracle. If φ\varphi is a DNF formula with kk terms, then 𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳\mathsf{BoundedSAT} takes 𝒪(n3kp)\mathcal{O}(n^{3}\cdot k\cdot p) time.

Equipped with Proposition 2, we now turn to designing an algorithm for model counting based on the Bucketing strategy. The algorithm follows in similar fashion to its streaming counterpart where mim_{i} is iteratively incremented until the number of solutions of the formula (φH[i]mi(x)=0mi)\varphi\wedge H[i]_{m_{i}}(x)=0^{m_{i}}) is less than 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh}. Interestingly, an approximate model counting algorithm, called 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC}, based on bucketing strategy was discovered independently by Chakraborty et. al. (Chakraborty et al., 2013b) in 2013. We reproduce an adaptation 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC} in Algorithm 5 to showcase how 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC} can be viewed as transformation of the 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing} algorithm. In the spirit of 𝖡𝗎𝖼𝗄𝖾𝗍𝗂𝗇𝗀\mathsf{Bucketing}, 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC} seeks to construct a sketch 𝒮\mathcal{S} of size t𝒪(log(1/δ))t\in\mathcal{O}(\log(1/\delta)). To this end, for every iteration of the loop, we continue to increment the value of the loop until the conditions specified by the relation 𝒫1(𝒮,H,𝖲𝗈𝗅(φ))\mathcal{P}_{1}(\mathcal{S},H,\mathsf{Sol}(\varphi)) are met. For every iteration ii, the estimate of the model count is ci×2mic_{i}\times 2^{m_{i}}. Finally, the estimate of the model count is simply the median of the estimation of all the iterations. Since in the context of model counting, we are concerned with time complexity, wherein both 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓\mathcal{H}_{\mathsf{Toeplitz}} and 𝗑𝗈𝗋\mathcal{H}_{\mathsf{xor}} lead to same time complexity. Furthermore, Chakraborty et al. (Chakraborty et al., 2013a) observed no difference in empirical runtime behavior due to 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓\mathcal{H}_{\mathsf{Toeplitz}} and 𝗑𝗈𝗋\mathcal{H}_{\mathsf{xor}}.

The following theorem establishes the correctness of 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC}, and the proof follows from Lemma 1 and Proposition 2.

Theorem 3.

Given a formula φ\varphi, ε\varepsilon, and δ\delta, 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC} returns EstEst such that Pr[|𝖲𝗈𝗅(φ)|1+εEst(1+ε)|𝖲𝗈𝗅(φ)|]1δ\Pr[\frac{|\mathsf{Sol}(\varphi)|}{1+\varepsilon}\leq Est\leq(1+\varepsilon)|\mathsf{Sol}(\varphi)|]\geq 1-\delta. If φ\varphi is a CNF formula, then this algorithm makes 𝒪(n1ε2log(1/δ))\mathcal{O}(n\cdot\frac{1}{\varepsilon^{2}}\log(1/\delta)) calls to NP oracle. If φ\varphi is a DNF formula then 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC} is FPRAS. In particular for a DNF formula with kk terms, 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC} takes 𝒪(n4k1ε2log(1/δ))\mathcal{O}(n^{4}\cdot k\cdot\frac{1}{\varepsilon^{2}}\cdot\log(1/\delta)) time.

Algorithm 5 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢(φ,ε,δ)\mathsf{ApproxMC}(\varphi,\varepsilon,\delta)
1:t35log(1δ)t\leftarrow 35\log(\frac{1}{\delta})
2:H𝖯𝗂𝖼𝗄𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌(𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,n),t)H\leftarrow\mathsf{PickHashFunctions}(\mathcal{H}_{\mathsf{Toeplitz}}(n,n),t)
3:𝒮{}\mathcal{S}\leftarrow\{\};
4:𝖳𝗁𝗋𝖾𝗌𝗁96ε2\mathsf{Thresh}\leftarrow\frac{96}{\varepsilon^{2}}
5:for i[1,t]i\in[1,t] do
6:     mi0m_{i}\leftarrow 0
7:     ci𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳(φ,H[i]|mi,𝖳𝗁𝗋𝖾𝗌𝗁)c_{i}\leftarrow\mathsf{BoundedSAT}(\varphi,H[i]|_{m_{i}},\mathsf{Thresh})
8:     while ci𝖳𝗁𝗋𝖾𝗌𝗁c_{i}\geq\mathsf{Thresh}  do
9:         mimi+1m_{i}\leftarrow m_{i}+1
10:         ci𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳(φ,H[i]|mi(x),𝖳𝗁𝗋𝖾𝗌𝗁)c_{i}\leftarrow\mathsf{BoundedSAT}(\varphi,H[i]|_{m_{i}}(x),\mathsf{Thresh})      
11:     𝒮[i](ci,mi)\mathcal{S}[i]\leftarrow(c_{i},m_{i})
12:EstMedian({𝒮[i](0)×2𝒮[i](1)}i)Est\leftarrow Median(\{\mathcal{S}[i](0)\times 2^{\mathcal{S}[i](1)}\}_{i})
13:return EstEst

Further Optimizations

We now discuss how the setting of model counting allows for further optimizations. Observe that for all ii, 𝖲𝗈𝗅(φ(H[i]mi1)(x)=0mi1)𝖲𝗈𝗅(φ(H[i]mi)(x)=0mi)\mathsf{Sol}(\varphi\wedge(H[i]_{m_{i}-1})(x)=0^{m_{i}-1})\subseteq\mathsf{Sol}(\varphi\wedge(H[i]_{m_{i}})(x)=0^{m_{i}}). Note that we are interested in finding the value of mim_{i} such that |𝖲𝗈𝗅(φ(H[i]mi1)(x)=0mi1)|96ε2|\mathsf{Sol}(\varphi\wedge(H[i]_{m_{i}-1})(x)=0^{m_{i}-1})|\geq\frac{96}{\varepsilon^{2}} and |𝖲𝗈𝗅(φ(H[i]mi)(x)=0mi)|<96ε2|\mathsf{Sol}(\varphi\wedge(H[i]_{m_{i}})(x)=0^{m_{i}})|<\frac{96}{\varepsilon^{2}}, therefore, we can perform a binary search for mim_{i} instead of a linear search performed in the loop 8– 10. Indeed, this observation was at the core of Chakraborty et al’s followup work (Chakraborty et al., 2016), which proposed the ApproxMC2, thereby reducing the number of calls to NP oracle from 𝒪(n1ε2log(1/δ))\mathcal{O}(n\cdot\frac{1}{\varepsilon^{2}}\log(1/\delta)) to 𝒪(logn1ε2log(1/δ))\mathcal{O}(\log n\cdot\frac{1}{\varepsilon^{2}}\log(1/\delta)). Furthermore, the reduction in NP oracle calls led to significant runtime improvement in practice. It is worth commenting that the usage of 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢𝟤\mathsf{ApproxMC2} as FPRAS for DNF is shown to achieve runtime efficiency over the alternatives based on Monte Carlo methods (Meel et al., 2017, 2018, 2019).

3.3. Minimum-based Algorithm

For a given multiset 𝐚\mathbf{a} ( eg: a data stream or solutions to a model), we now specify the property 𝒫2(𝒮,H,𝐚u)\mathcal{P}_{2}(\mathcal{S},H,\mathbf{a}_{u}). The sketch 𝒮\mathcal{S} is an array of sets indexed by members of HH that holds lexicographically pp minimum elements of H[i](𝐚u)H[i](\mathbf{a}_{u}) where pp is min(96ε2,|𝐚u|)\min(\frac{96}{\varepsilon^{2}},|\mathbf{a}_{u}|). 𝒫2\mathcal{P}_{2} is the property that specifies this relationship. More formally, the relationship 𝒫2\mathcal{P}_{2} holds, if the following conditions are met.

  1. (1)

    i,|𝒮[i]|=min(96ε2,|𝐚u|)\forall i,|\mathcal{S}[i]|=\min(\frac{96}{\varepsilon^{2}},|\mathbf{a}_{u}|)

  2. (2)

    i,y𝒮[i],y𝒮[i] it holds that H[i](y)H[i](y)\forall i,\forall y\notin\mathcal{S}[i],\forall y^{\prime}\in\mathcal{S}[i]\text{ it holds that }H[i](y^{\prime})\preceq H[i](y)

The following lemma due to Bar-Yossef et al. (Bar-Yossef et al., [n. d.]) establishes the relationship between the property 𝒫2\mathcal{P}_{2} and the number of distinct elements of a multiset. Let max(Si)\max(S_{i}) denote the largest element of the set SiS_{i}.

Lemma 0.

(Bar-Yossef et al., [n. d.]) Let 𝐚{0,1}n\mathbf{a}\subseteq\{0,1\}^{n} be a multiset and H𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,m)H\subseteq\mathcal{H}_{\mathsf{Toeplitz}}(n,m) where m=3nm=3n and each H[i]H[i]s are independently drawn and |H|=O(log1/δ)|H|=O(\log 1/\delta) and let 𝒮\mathcal{S} be such that the 𝒫2(𝒮,H,au)\mathcal{P}_{2}(\mathcal{S},H,a_{u}) holds. Let c=Median {p2mmax(S[i])}ic=\mbox{\rm Median }\{{p\cdot 2^{m}\over\max(S[i])}\}_{i}. Then

Pr[|𝐚u|(1+ε)c(1+ε)|𝐚u|]1δ.\Pr\left[\frac{|\mathbf{a}_{u}|}{(1+\varepsilon)}\leq c\leq(1+\varepsilon)|\mathbf{a}_{u}|\right]\geq 1-\delta.

Therefore, we can transform the 𝖬𝗂𝗇𝗂𝗆𝗎𝗆\mathsf{Minimum} algorithm for F0F_{0} estimation to that of model counting given access to a subroutine that can compute 𝒮\mathcal{S} such that 𝒫2(𝒮,H,𝖲𝗈𝗅(φ))\mathcal{P}_{2}(\mathcal{S},H,\mathsf{Sol}(\varphi)) holds true. The following proposition establishes the existence and complexity of such a subroutine, called 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin}:

Proposition 0.

There is an algorithm 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin} that, given φ\varphi over nn variables, h𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,m)h\in\mathcal{H}_{\mathsf{Toeplitz}}(n,m), and pp as input, returns a set, h(𝖲𝗈𝗅(φ))\mathcal{B}\subseteq h(\mathsf{Sol}(\varphi)) so that if |h(𝖲𝗈𝗅(φ))|p|h(\mathsf{Sol}(\varphi))|\leq p, then =h(𝖲𝗈𝗅(φ))\mathcal{B}=h(\mathsf{Sol}(\varphi)), otherwise \mathcal{B} is the pp lexicographically minimum elements of h(𝖲𝗈𝗅(φ))h(\mathsf{Sol}(\varphi)). Moreover, if φ\varphi is a CNF formula, then 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin} makes 𝒪(pm)\mathcal{O}(p\cdot m) calls to an NP oracle, and if φ\varphi is a DNF formula with kk terms, then 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin} takes 𝒪(m3nkp)\mathcal{O}(m^{3}\cdot n\cdot k\cdot p) time.

Equipped with Proposition 5, we are now ready to present the algorithm, called 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖬𝗂𝗇\mathsf{ApproxModelCountMin}, for model counting. Since the complexity of 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin} is PTIME when φ\varphi is in DNF, we have 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖬𝗂𝗇\mathsf{ApproxModelCountMin} as a FPRAS for DNF formulas.

Theorem 6.

Given φ\varphi, ε\varepsilon,δ\delta, 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖬𝗂𝗇\mathsf{ApproxModelCountMin} returns cc such that

Pr(|𝖲𝗈𝗅(φ)1+εEst(1+ε)|𝖲𝗈𝗅(φ)|)1δ.\Pr\left(\frac{|\mathsf{Sol}(\varphi)}{1+\varepsilon}\leq Est\leq(1+\varepsilon)|\mathsf{Sol}(\varphi)|\right)\geq 1-\delta.

If φ\varphi is a CNF formula, then 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖬𝗂𝗇\mathsf{ApproxModelCountMin} is a polynomial-time algorithm that makes 𝒪(1ε2nlog(1δ))\mathcal{O}(\frac{1}{\varepsilon^{2}}n\log(\frac{1}{\delta})) calls to NP oracle. If φ\varphi is a DNF formula, then 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖬𝗂𝗇\mathsf{ApproxModelCountMin} is an FPRAS.

Algorithm 6 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖬𝗂𝗇(φ,ε,δ)\mathsf{ApproxModelCountMin}(\varphi,\varepsilon,\delta)
1:t35log(1/δ)t\leftarrow 35\log(1/\delta)
2:H𝖯𝗂𝖼𝗄𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌(𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,3n),t)H\leftarrow\mathsf{PickHashFunctions}(\mathcal{H}_{\mathsf{Toeplitz}}(n,3n),t)
3:S{}S\leftarrow\{\}
4:𝖳𝗁𝗋𝖾𝗌𝗁96ε2\mathsf{Thresh}\leftarrow\frac{96}{\varepsilon^{2}}
5:for i[1,t]i\in[1,t] do
6:     S[i]𝖥𝗂𝗇𝖽𝖬𝗂𝗇(φ,H[i],𝖳𝗁𝗋𝖾𝗌𝗁)S[i]\leftarrow\mathsf{FindMin}(\varphi,H[i],\mathsf{Thresh})
7:EstMedian({𝖳𝗁𝗋𝖾𝗌𝗁×23nmax{S[i]}}i)Est\leftarrow{\rm Median}\left(\left\{\frac{\mathsf{Thresh}\times 2^{3n}}{\max\{S[i]\}}\right\}_{i}\right)
8:return EstEst

Implementing the Min-based Algorithm

We now give a proof of Proposition 5 by giving an implementation of 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin} subroutine.

Proof.

We first present the algorithm when the formula φ\varphi is a DNF formula. Adapting the algorithm for the case of CNF can be done by suing similar ideas.

Let ϕ=T1T2Tk\phi=T_{1}\vee T_{2}\vee\ \cdots\vee T_{k} be a DNF formula over nn variables where TiT_{i} is a term. Let h:{0,1}n{0,1}mh:\{0,1\}^{n}\rightarrow\{0,1\}^{m} be a linear hash function in xor(n,m)\mathcal{H}_{xor}(n,m) defined by a m×nm\times n binary matrix AA. Let 𝒞\mathcal{C} be the set of hashed values of the satisfying assignments for φ\varphi: 𝒞={h(x)xφ}{0,1}m\mathcal{C}=\{h(x)\mid x\models\varphi\}\subseteq\{0,1\}^{m}. Let 𝒞p\mathcal{C}_{p} be the first pp elements of 𝒞\mathcal{C} in the lexicographic order. Our goal is to compute 𝒞p\mathcal{C}_{p}.

We will give an algorithm with running time O(m3np)O(m^{3}np) to compute 𝒞p\mathcal{C}_{p} when the formula is just a term TT. Using this algorithm we can compute 𝒞p\mathcal{C}_{p} for a formula with kk terms by iteratively merging 𝒞p\mathcal{C}_{p} for each term. The time complexity increases by a factor of kk, resulting in an O(m3nkp)O(m^{3}nkp) time algorithm.

Let TT be a term with width ww (number of literals) and 𝒞={AxxT}\mathcal{C}=\{Ax\mid x\models T\}. By fixing the variables in TT we get a vector bTb_{T} and an N×(nw)N\times(n-w) matrix ATA_{T} so that 𝒞={ATx+bTx{0,1}(nw)}\mathcal{C}=\{A_{T}x+b_{T}\mid x\in\{0,1\}^{(n-w)}\}. Both ATA_{T} and bTb_{T} can be computed from AA and TT in linear time. Let hT(x)h_{T}(x) be the transformation ATx+bTA_{T}x+b_{T}.

We will compute 𝒞p\mathcal{C}_{p} (pp lexicographically minimum elements in 𝒞\mathcal{C}) iteratively as follows: assuming we have computed (q1)th(q-1)^{th} minimum of 𝒞\mathcal{C}, we will compute qthq^{th} minimum using a prefix-searching strategy. We will use a subroutine to solve the following basic prefix-searching primitive: Given any ll bit string y1yly_{1}\ldots y_{l}, is there an x{0,1}nwx\in\{0,1\}^{n-w} so that y1yly_{1}\ldots y_{l} is a prefix for some string in {hT(x)}\{h_{T}(x)\}? This task can be performed using Gaussian elimination over an (l+1)×(nw)(l+1)\times(n-w) binary matrix and can be implemented in time O(l2(nw))O(l^{2}(n-w)).

Let y=y1ymy=y_{1}\ldots y_{m} be the (q1)th(q-1)^{th} minimum in 𝒞\mathcal{C}. Let r1r_{1} be the rightmost 0 of yy. Then using the above mentioned procedure we can find the lexicographically smallest string in the range of hTh_{T} that extends y1y(r1)1y_{1}\ldots y_{(r-1)}1 if it exists. If no such string exists in 𝒞\mathcal{C}, find the index of the next 0 in yy and repeat the procedure. In this manner the qthq^{th} minimum can be computed using O(m)O(m) calls to the prefix-searching primitive resulting in an O(m3n)O(m^{3}n) time algorithm. Invoking the above procedure pp times results in an algorithm to compute 𝒞p\mathcal{C}_{p} in O(m3np)O(m^{3}np) time.

If φ\varphi is a CNF formula, we can employ the same prefix searching strategy. Consider the following NP oracle: O={φ,h,y,yx,y′′, so that xφ,yy′′>y,h(x)=yy′′}O=\{\langle\varphi,h,y,y^{\prime}\rangle\mid\exists x,\exists y^{\prime\prime},\mbox{ so that }x\models\varphi,y^{\prime}y^{\prime\prime}>y,h(x)=y^{\prime}y^{\prime\prime}\}. With mm calls to OO, we can compute string in 𝒞\mathcal{C} that is lexicographically greater than yy. So with pmp\cdot m, calls to OO, we can compute 𝒞p\mathcal{C}_{p}.

Further Optimizations

As mentioned in Section 1, the problem of model counting has witnessed a significant interest from practitioners owing to its practical usages and the recent developments have been fueled by the breakthrough progress in the SAT solving wherein calls to NP oracles are replaced by invocations of SAT solver in practice. Motivated by the progress in SAT solving, there has been significant interest in design of efficient algorithmic frameworks for related problems such as MaxSAT and its variants. The state of the art MaxSAT based on sophisticated strategies such as implicit hitting sets and are shown to significant outperform algorithms based on merely invoking a SAT solver iteratively. Of particular interest to us is the recent progress in the design of MaxSAT solvers to handle lexicographic objective functions. In this context, it is worth remarking that we expect practical implementation of 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin} would invoke a MaxSAT solver 𝒪(p)\mathcal{O}(p) times.

3.4. Estimation-based Algorithm

We now adapt the 𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗂𝗈𝗇\mathsf{Estimation} algorithm to model counting. For a given stream 𝐚\mathbf{a} and chosen hash functions HH, the sketch 𝒮\mathcal{S} corresponding to the estimation-based algorithm satisfies the following relation 𝒫3(𝒮,H,𝐚u)\mathcal{P}_{3}(\mathcal{S},H,\mathbf{a}_{u}):

(2) 𝒫3(𝒮,H,𝐚u):=(S[i,j]=maxx𝐚u𝖳𝗋𝖺𝗂𝗅𝖹𝖾𝗋𝗈(H[i,j])(x))\displaystyle\mathcal{P}_{3}(\mathcal{S},H,\mathbf{a}_{u}):=\left(S[i,j]=\max_{x\in\mathbf{a}_{u}}\mathsf{TrailZero}(H[i,j])(x)\right)

where the procedure 𝖳𝗋𝖺𝗂𝗅𝖹𝖾𝗋𝗈(z)\mathsf{TrailZero}(z) is the length of the longest all-zero suffix of zz. Bar-Yossef et al. (Bar-Yossef et al., [n. d.]) show the following relationship between the property 𝒫3\mathcal{P}_{3} and F0F_{0}.

Lemma 0.

(Bar-Yossef et al., [n. d.]) Let 𝐚{0,1}n\mathbf{a}\subseteq\{0,1\}^{n} be a multiset. For i[T]i\in[T] and j[M]j\in[M], suppose H[i,j]H[i,j] is drawn independently from swise(n,n)\mathcal{H}_{s-{\rm wise}}(n,n) where s=O(log(1/ε))s=O(\log(1/\varepsilon)), T=O(log(1/δ))T=O(\log(1/\delta)), and M=O(1/ε2)M=O(1/\varepsilon^{2}). Let HH denote the collection of these hash functions. Suppose 𝒮\mathcal{S} satisfies 𝒫3(𝒮,H,𝐚u)\mathcal{P}_{3}(\mathcal{S},H,\mathbf{a}_{u}). For any integer rr, define:

cr=Median{ln(11Mj=1M𝟙{𝒮[i,j]r})ln(12r)}ic_{r}=\mathrm{Median}\left\{\frac{\ln\left(1-\frac{1}{M}\sum_{j=1}^{M}\mathbb{1}\{\mathcal{S}[i,j]\geq r\}\right)}{\ln(1-2^{-r})}\right\}_{i}

Then, if 2F02r50F02F_{0}\leq 2^{r}\leq 50F_{0}:

Pr[(1ε)F0cr(1+ε)F0]1δ\Pr\left[(1-\varepsilon)F_{0}\leq c_{r}\leq(1+\varepsilon)F_{0}\right]\geq 1-\delta

Following the recipe outlined above, we can transform a F0F_{0} streaming algorithm to a model counting algorithm by designing a subroutine that can compute the sketch for the set of all solutions described by φ\varphi and a subroutine to find rr. The following proposition achieves the first objective for CNF formulas using a small number of calls to an NP oracle:

Proposition 0.

There is an algorithm 𝖥𝗂𝗇𝖽𝖬𝖺𝗑𝖱𝖺𝗇𝗀𝖾\mathsf{FindMaxRange} that given φ\varphi over nn variables and hash function hswise(n,n)h\in\mathcal{H}_{s-{\rm wise}}(n,n), returns tt such that

  1. (1)

    z,zφ\exists z,z\models\varphi and h(z)h(z) has tt least significant bits equal to zero.

  2. (2)

    (zφ)\forall(z\models\varphi)\implies h(z)h(z) has t\leq t least significant bits equal to zero.

If φ\varphi is a CNF formula, then 𝖥𝗂𝗇𝖽𝖬𝖺𝗑𝖱𝖺𝗇𝗀𝖾\mathsf{FindMaxRange} makes 𝒪(logn)\mathcal{O}(\log n) calls to an NP oracle.

Proof.

Consider an NP oracle O={φ,h,tx,y,xφ,h(x)=y0t}O=\{\langle\varphi,h,t\rangle\mid\exists x,\exists y,x\models\varphi,h(x)=y0^{t}\rangle\}. Note that hh can be implemented as a degree-ss polynomial h:𝔽2n𝔽2nh:\mathbb{F}_{2^{n}}\to\mathbb{F}_{2^{n}}, so that h(x)h(x) can be evaluated in polynomial time. A binary search, requiring O(logn)O(\log n) calls to OO, suffices to find the largest value of tt for which φ,h,t\langle\varphi,h,t\rangle belongs to OO. ∎

We note that unlike Propositions 2 and 5, we do not know whether 𝖥𝗂𝗇𝖽𝖬𝖺𝗑𝖱𝖺𝗇𝗀𝖾\mathsf{FindMaxRange} can be implemented efficiently when φ\varphi is a DNF formula. For a degree-ss polynomial h:𝔽2n𝔽2nh:\mathbb{F}_{2^{n}}\to\mathbb{F}_{2^{n}}, we can efficiently test whether hh has a root by computing 𝗀𝖼𝖽(h(x),x2nx)\mathsf{gcd}(h(x),x^{2^{n}}-x), but it is not clear how to simultaneously constrain some variables according to a DNF term.

Equipped with Proposition 8, we obtain 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖤𝗌𝗍\mathsf{ApproxModelCountEst} that takes in a formula φ\varphi and a suitable value of rr and returns |𝖲𝗈𝗅(φ)||\mathsf{Sol}(\varphi)|. The key idea of 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖤𝗌𝗍\mathsf{ApproxModelCountEst} is to repeatedly invoke 𝖥𝗂𝗇𝖽𝖬𝖺𝗑𝖱𝖺𝗇𝗀𝖾\mathsf{FindMaxRange} for each of the chosen hash functions and compute the estimate based on the sketch 𝒮\mathcal{S} and the value of rr. The following lemma summarizes the time complexity and guarantees of 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖤𝗌𝗍\mathsf{ApproxModelCountEst} for CNF formulas.

Theorem 9.

Given a CNF formula φ\varphi, parameters ε\varepsilon and δ\delta, and rr such that 2F02r50F02F_{0}\leq 2^{r}\leq 50F_{0}, the algorithm 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖤𝗌𝗍\mathsf{ApproxModelCountEst} returns cc satisfying

Pr[|𝖲𝗈𝗅(φ)1+εc(1+ε)|𝖲𝗈𝗅(φ)|]1δ.\Pr\left[\frac{|\mathsf{Sol}(\varphi)}{1+\varepsilon}\leq c\leq(1+\varepsilon)|\mathsf{Sol}(\varphi)|\right]\geq 1-\delta.

𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖤𝗌𝗍\mathsf{ApproxModelCountEst} makes 𝒪(1ε2lognlog(1δ))\mathcal{O}(\frac{1}{\varepsilon^{2}}\log n\log(\frac{1}{\delta})) calls to an NP oracle.

In order to obtain rr, we run in parallel another counting algorithm based on the simple F0F_{0}-estimation algorithm (Flajolet and Martin, 1985; Alon et al., 1999) which we call 𝖥𝗅𝖺𝗃𝗈𝗅𝖾𝗍𝖬𝖺𝗋𝗍𝗂𝗇\mathsf{FlajoletMartin}. Given a stream 𝐚\mathbf{a}, the 𝖥𝗅𝖺𝗃𝗈𝗅𝖾𝗍𝖬𝖺𝗋𝗍𝗂𝗇\mathsf{FlajoletMartin} algorithm chooses a random pairwise-independent hash function hHxor(n,n)h\in H_{xor}(n,n), computes the largest rr so that for some x𝐚ux\in\mathbf{a}_{u}, the rr least significant bits of h(x)h(x) are zero, and outputs rr. Alon, Matias and Szegedy (Alon et al., 1999) showed that 2r2^{r} is a 5-factor approximation of F0F_{0} with probability 3/53/5. Using our recipe, we can convert 𝖥𝗅𝖺𝗃𝗈𝗅𝖾𝗍𝖬𝖺𝗋𝗍𝗂𝗇\mathsf{FlajoletMartin} into an algorithm that approximates the number of solutions to a CNF formula φ\varphi within a factor of 5 with probability 3/5. It is easy to check that using the same idea as in Proposition 8, this algorithm requires O(logn)O(\log n) calls to an NP oracle.

Algorithm 7 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖤𝗌𝗍(φ,ε,δ,r)\mathsf{ApproxModelCountEst}(\varphi,\varepsilon,\delta,r)
1:𝖳𝗁𝗋𝖾𝗌𝗁96/ε2\mathsf{Thresh}\leftarrow 96/\varepsilon^{2}
2:t35log(1/δ)t\leftarrow 35\log(1/\delta)
3:H𝖯𝗂𝖼𝗄𝖧𝖺𝗌𝗁𝖥𝗎𝗇𝖼𝗍𝗂𝗈𝗇𝗌(swise(n,n),t×𝖳𝗁𝗋𝖾𝗌𝗁)H\leftarrow\mathsf{PickHashFunctions}(\mathcal{H}_{s-{\rm wise}}(n,n),t\times\mathsf{Thresh})
4:S{}S\leftarrow\{\}
5:for i[1,t]i\in[1,t] do
6:     for j[1,𝖳𝗁𝗋𝖾𝗌𝗁]j\in[1,\mathsf{Thresh}] do
7:         S[i,j]𝖥𝗂𝗇𝖽𝖬𝖺𝗑𝖱𝖺𝗇𝗀𝖾(φ,𝖳𝗋𝖺𝗂𝗅𝖹𝖾𝗋𝗈(H[i,j]))S[i,j]\leftarrow\mathsf{FindMaxRange}(\varphi,\mathsf{TrailZero}(H[i,j]))      
8:EstMedian{ln(11𝖳𝗁𝗋𝖾𝗌𝗁j=1𝖳𝗁𝗋𝖾𝗌𝗁𝟙{𝒮[i,j]r})ln(12r)}iEst\leftarrow\mathrm{Median}\left\{\frac{\ln\left(1-\frac{1}{\mathsf{Thresh}}\sum_{j=1}^{\mathsf{Thresh}}\mathbb{1}\{\mathcal{S}[i,j]\geq r\}\right)}{\ln(1-2^{-r})}\right\}_{i}
9:return EstEst

3.5. The Opportunities Ahead

As noted in Section 3.2, the algorithms based on Bucketing was already known and have witnessed a detailed technical development from both applied and algorithmic perspectives. The model counting algorithms based on Minimum and Estimation are new. We discuss some potential implications of these new algorithms to SAT solvers and other aspects.

MaxSAT solvers with native support for XOR constraints

When the input formula φ\varphi is represented as CNF, then 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC}, the model counting algorithm based on Bucketing strategy, invokes NP oracle over CNF-XOR formulas, i.e., formulas expressed as conjunction of CNF and XOR constraints. The significant improvement in runtime performance of 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC} owes to the design of SAT solvers with native support for CNF-XOR formulas (Soos et al., 2009; Soos and Meel, 2019; Soos et al., 2020). Such solvers have now found applications in other domains such as cryptanalysis. It is perhaps worth emphasizing that the proposal of 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC} was crucial to renewed interest in the design of SAT solvers with native support for CNF-XOR formulas. As observed in Section 3.3, the algorithm based on Minimum strategy would ideally invoke a MaxSAT solver that can handle XOR constraints natively. We believe that Minimum-based algorithm will ignite interest in the design of MaxSAT solver with native support for XOR constraints.

FPRAS for DNF based on Estimation

In Section 3.4, we were unable to show that the model counting algorithm obtained based on Estimation is FPRAS when φ\varphi is represented as DNF. The algorithms based on Estimation have been shown to achieve optimal space efficiency in the context of F0F_{0} estimation. In this context, an open problem is to investigate whether Estimation-based strategy lends itself to FPRAS for DNF counting.

Empirical Study of FPRAS for DNF Based on Minimum

Meel et al. (Meel et al., 2018, 2019) observed that FPRAS for DNF based on Bucketing has superior performance, in terms of the number of instances solved, to that of FPRAS schemes based on Monte Carlo framework. In this context, a natural direction of future work would be to conduct an empirical study to understand behavior of FPRAS scheme based on Minimum strategy.

4. Distributed DNF Counting

Consider the problem of distributed DNF counting. In this setting, there are kk sites that can each communicate with a central coordinator. The input DNF formula φ\varphi is partitioned into kk DNF subformulas φ1,,φk\varphi_{1},\dots,\varphi_{k}, where each φi\varphi_{i} is a subset of the terms of the original φ\varphi, with the jj’th site receiving only φj\varphi_{j}. The goal is for the coordinator to obtain an (ϵ,δ)(\epsilon,\delta)-approximation of the number of solutions to φ\varphi, while minimizing the total number of bits communicated between the sites and the coordinator. Distributed algorithms for sampling and counting solutions to CSP’s have been studied recently in other models of distributed computation (Feng et al., 2018b, a; Feng and Yin, 2018; Fischer and Ghaffari, 2018). From a practical perspective, given the centrality of #DNF in the context of probabilistic databases (Senellart, 2018; Ré and Suciu, 2008), a distributed DNF counting would entail applications in distributed probabilistic databases.

From our perspective, distributed DNF counting falls within the distributed functional monitoring framework formalized by Cormode et al. (Cormode et al., 2011). Here, the input is a stream 𝐚\mathbf{a} which is partitioned arbitrarily into sub-streams 𝐚1,,𝐚k\mathbf{a}_{1},\dots,\mathbf{a}_{k} that arrive at each of kk sites. Each site can communicate with the central coordinator, and the goal is for the coordinator to compute a function of the joint stream 𝕒\mathbb{a} while minimizing the total communication. This general framework has several direct applications and has been studied extensively (Babcock and Olston, 2003; Cormode et al., 2005; Manjhi et al., 2005; Keralapura et al., 2006; Arackaparambil et al., 2009; Sharfman et al., 2010; Keren et al., 2011; Cormode et al., 2012; Woodruff and Zhang, 2012; Yi and Zhang, 2013; Huang et al., 2012; Woodruff and Zhang, 2017). In distributed DNF counting, each sub-stream 𝐚i\mathbf{a}_{i} corresponds to the set of satisfying assignments to each subformula φi\varphi_{i}, while the function to be computed is F0F_{0}.

The model counting algorithms discussed in Section 3 can be extended to the distributed setting. We briefly indicate the distributed implementations for each of the three algorithms. As above, we set the parameters 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} to O(1/ε2)O(1/\varepsilon^{2}). Correctness follows from Bar-Yossef et al. (Bar-Yossef et al., [n. d.]) and the earlier discussion.

We consider an adaptation of 𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳\mathsf{BoundedSAT} that takes in φ\varphi over nn variables, a hash function h𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,m)h\in\mathcal{H}_{\mathsf{Toeplitz}}(n,m), and a threshold tt as inputs, returns a set UU of solutions such that |U|=min(t,|𝖲𝗈𝗅(φh(x)=0m)|)|U|=\min(t,|\mathsf{Sol}(\varphi\wedge h(x)={0^{m}})|).

Bucketing.:

Setting m=m= O(log(k/δε2))O(\log(k/\delta\varepsilon^{2})), the coordinator chooses H[1],,H[T]H[1],\dots,H[T] from 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,n)\mathcal{H}_{\mathsf{Toeplitz}}(n,n) and GG from 𝗑𝗈𝗋(n,m)\mathcal{H}_{\mathsf{xor}}(n,m). It then sends them to the kk sites. Let mi,jm_{i,j} be the smallest mm such that the size of the set 𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳(φj,H[i]m,𝗍𝗁𝗋𝖾𝗌𝗁)\mathsf{BoundedSAT}(\varphi_{j},H[i]_{m},\mathsf{thresh}) is smaller than 𝗍𝗁𝗋𝖾𝗌𝗁\mathsf{thresh}. The jj’th site sends the coordinator the following tuples: G(x),𝖳𝗋𝖺𝗂𝗅𝖹𝖾𝗋𝗈(H[i](x))\langle G(x),\mathsf{TrailZero}(H[i](x))\rangle for each i[t]i\in[t] and for each xx in 𝖡𝗈𝗎𝗇𝖽𝖾𝖽𝖲𝖠𝖳(φj,H[i]mi,j,\mathsf{BoundedSAT}(\varphi_{j},H[i]_{m_{i,j}}, 𝗍𝗁𝗋𝖾𝗌𝗁)\mathsf{thresh}). Note that each site only sends tuples for at most O(1/δε2)O(1/\delta\varepsilon^{2}) choices of xx, so that GG hashes these xx to distinct values with probability 1δ/21-\delta/2. It is easy to verify that the coordinator can then execute the rest of the algorithm 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝖢\mathsf{ApproxMC}. The communication cost is O~(k(n+1/ε2)log(1/δ))\tilde{O}(k(n+1/\varepsilon^{2})\cdot\log(1/\delta)), and the time complexity for each site is polynomial in nn, ε1\varepsilon^{-1}, and log(δ1)\log(\delta^{-1}).

Minimum.:

The coordinator chooses hash functions H[1],,H[t]H[1],\dots,H[t] from 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,3n)\mathcal{H}_{\mathsf{Toeplitz}}(n,3n) and sends it to the kk sites. Each site runs the 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin} algorithm for each hash function and sends the outputs to the coordinator. So, the coordinator receives sets S[i,j]S[i,j], consisting of the 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} lexicographically smallest hash values of the solutions to φj\varphi_{j}. The coordinator then extracts S[i]S[i], the 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} lexicographically smallest elements of S[i,1]S[i,k]S[i,1]\cup\dots\cup S[i,k] and proceeds with the rest of algorithm 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖬𝗂𝗇\mathsf{ApproxModelCountMin}. The communication cost is O(kn/ε2log(1/δ))O(kn/\varepsilon^{2}\cdot\log(1/\delta)) to account for the kk sites sending the outputs of their 𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{FindMin} invocations. The time complexity for each site is polynomial in nn, ε1\varepsilon^{-1}, and log(δ1)\log(\delta^{-1}).

Estimation.:

For each i[t]i\in[t], the coordinator chooses 𝖳𝗁𝗋𝖾𝗌𝗁\mathsf{Thresh} hash functions H[i,1],,H[i,𝖳𝗁𝗋𝖾𝗌𝗁]H[i,1],\dots,H[i,\mathsf{Thresh}], drawn pairwise independently from swise(n,n)\mathcal{H}_{s-{\rm wise}}(n,n) (for s=O(log(1/ε))s=O(\log(1/\varepsilon))) and sends it to the kk sites. Each site runs the 𝖥𝗂𝗇𝖽𝖬𝖺𝗑𝖱𝖺𝗇𝗀𝖾\mathsf{FindMaxRange} algorithm for each hash function and sends the output to the coordinator. Suppose the coordinator receives S[i,j,][n]S[i,j,\ell]\in[n] for each i[t],j[𝖳𝗁𝗋𝖾𝗌𝗁]i\in[t],j\in[\mathsf{Thresh}] and [k]\ell\in[k]. It computes S[i,j]=maxS[i,j,]S[i,j]=\max_{\ell}S[i,j,\ell]. The rest of 𝖠𝗉𝗉𝗋𝗈𝗑𝖬𝗈𝖽𝖾𝗅𝖢𝗈𝗎𝗇𝗍𝖤𝗌𝗍\mathsf{ApproxModelCountEst} is then executed by the coordinator. The communication cost is O~(k(n+1/ε2)log(1/δ))\tilde{O}(k(n+1/\varepsilon^{2})\log(1/\delta)). However, as earlier, we do not know a polynomial time algorithm to implement the 𝖥𝗂𝗇𝖽𝖬𝖺𝗑𝖱𝖺𝗇𝗀𝖾\mathsf{FindMaxRange} algorithm for DNF terms.

Lower Bound

The communication cost for the Bucketing- and Estimation-based algorithms is nearly optimal in their dependence on kk and ε\varepsilon. Woodruff and Zhang (Woodruff and Zhang, 2012) showed that the randomized communication complexity of estimating F0F_{0} up to a 1+ε1+\varepsilon factor in the distributed functional monitoring setting is Ω(k/ε2)\Omega(k/\varepsilon^{2}). We can reduce F0F_{0} estimation problem to distributed DNF counting. Namely, if for the F0F_{0} estimation problem, the jj’th site receives items a1,,am[N]a_{1},\dots,a_{m}\in[N], then for the distributed DNF counting problem, φj\varphi_{j} is a DNF formula on log2N\lceil\log_{2}N\rceil variables whose solutions are exactly a1,,ama_{1},\dots,a_{m} in their binary encoding. Thus, we immediately get an Ω(k/ε2)\Omega(k/\varepsilon^{2}) lower bound for the distributed DNF counting problem. Finding the optimal dependence on NN for k>1k>1 remains an interesting open question333Note that if k=1k=1, then log(n/ε)\log(n/\varepsilon) bits suffices, as the site can solve the problem on its own and send to the coordinator the binary encoding of a (1+ε)(1+\varepsilon)-approximation of F0F_{0}..

5. From Counting to Streaming: Structured Set Streaming

In this section we consider structured set streaming model where each item SiS_{i} of the stream is a succinct representation of a set over the universe U={0,1}nU=\{0,1\}^{n}. Our goal is to design efficient algorithms (both in terms of memory and processing time per item) for computing |iSi||\cup_{i}S_{i}| - number of distinct elements in the union of all the sets in the stream. We call this problem F0F_{0} computation over structured set streams.

DNF Sets

A particular representation we are interested in is where each set is presented as the set of satisfying assignments to a DNF formula. Let φ\varphi is a DNF formula over nn variables. Then the DNF Set corresponding to φ\varphi is the set of satisfying assignments of φ\varphi. The size of this representation is the number of terms in the formula φ\varphi.

A stream over DNF sets is a stream of DNF formulas φ1,φ2,\varphi_{1},\varphi_{2},\ldots. Given such a DNF stream, the goal is to estimate |iSi||\bigcup_{i}S_{i}| where SiS_{i} the DNF set represented by φi\varphi_{i}. This quantity is same as the number of satisfying assignments of the formula iφi\vee_{i}\varphi_{i}. We show that the algorithms described in the previous section carry over to obtain (ϵ,δ)(\epsilon,\delta) estimation algorithms for this problem with space and per-item time poly(1/ϵ,n,k,log(1/δ))\mathrm{poly}(1/\epsilon,n,k,\log(1/\delta)) where kk is the size of the formula.

Notice that this model generalizes the traditional streaming model where each item of the stream is an element xUx\in U as it can be represented as single term DNF formula ϕx\phi_{x} whose only satisfying assignment is xx. This model also subsumes certain other models considered in the streaming literature that we discuss later.

Theorem 1.

There is a streaming algorithm to compute an (ϵ,δ)(\epsilon,\delta) approximation of F0F_{0} over DNF sets. This algorithm takes space O(nε2log1δ)O({n\over\varepsilon^{2}}\cdot\log{1\over\delta}) and processing time O(n4k1ε2log1δ)O(n^{4}\cdot k\cdot{1\over\varepsilon^{2}}\cdot\log{1\over\delta}) per item where kk is the size (number of terms) of the corresponding DNF formula.

Proof.

We show how to adapt Minimum-value based algorithm from Section 3.3 to this setting. The algorithm picks a hash function h𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,3n)h\in\mathcal{H}_{\mathsf{Toeplitz}}(n,3n) maintains the set \mathcal{B} consisting of tt lexicographically minimum elements of the set {h(𝖲𝗈𝗅(φ1,,φi1))}\{h(\mathsf{Sol}(\varphi_{1},\vee\ldots\vee,\varphi_{i-1}))\} after processing i1i-1 items. When φi\varphi_{i} arrives, it computes the set \mathcal{B^{\prime}} consisting of the tt lexicographically minimum values of the set {h(𝖲𝗈𝗅(φi))}\{h(\mathsf{Sol}(\varphi_{i}))\} and subsequently updates \mathcal{B} by computing the tt lexicographically smallest elements from \mathcal{B}\cup\mathcal{B^{\prime}}. By Proposition 5, computation of \mathcal{B^{\prime}} can be done in time O(n4kt)O(n^{4}\cdot k\cdot t) where kk is the number of terms in φi\varphi_{i}. Updating \mathcal{B} can be done in O(tn)O(t\cdot n) time. Thus update time for the item φi\varphi_{i} is O(n4kt)O(n^{4}\cdot k\cdot t). For obtaining an (ε,δ)(\varepsilon,\delta) approximations we set t=O(1ε2)t=O({1\over\varepsilon^{2}}) and repeat the procedure O(log1δ)O(\log{1\over\delta}) times and take the median value. Thus the update time for item φ\varphi is O(n4k1ε2log1δ)O(n^{4}\cdot k\cdot{1\over\varepsilon^{2}}\cdot\log{1\over\delta}). For analyzing sapce, each hash function uses O(n)O(n) bits and to store O(1ϵ2)O({1\over\epsilon^{2}}) minimums, we require O(nϵ2)O({n\over\epsilon^{2}}) space resulting in overall space usage of O(nε2log1δ)O({n\over\varepsilon^{2}}\cdot\log{1\over\delta}). The proof of correctness follows from Lemma 4. ∎

Instead of using Minimum-value based algorithm, we could adapt Bucketing-based algorithm to obtain an algorithm with similar space and time complexities. As noted earlier, some of the set streaming models considered in the literature can be reduced the DNF set streaming. We discuss them next.

Multidimensional Ranges

A dd dimensional range over an universe UU is defined as [a1,b1]×[a2,b2]××[ad,bd][a_{1},b_{1}]\times[a_{2},b_{2}]\times\ldots\times[a_{d},b_{d}]. Such a range represents the set of tuples {(x1,,xd)\{(x_{1},\ldots,x_{d}) where aixibia_{i}\leq x_{i}\leq b_{i} and xix_{i} is an integer. Note that every dd-dimensional range can be succinctly by the tuple a1,b1,ad,bd\langle a_{1},b_{1},\cdots a_{d},b_{d}\rangle. A multi-dimensional stream is a stream where each item is a dd-dimensional range. The goal is to compute F0F_{0} of the union of the dd-dimensional ranges effeiciently. We will show that F0F_{0} computation over multi-dimensional ranges can reduced to F0F_{0} computation over DNF sets. Using this reduction we arrive at a simple algorithm to compute F0F_{0} over multi-dimensional ranges.

Lemma 0.

Any dd-dimensional range RR over UU can be represented as a DNF formula φR\varphi_{R} over ndnd variables whose size is at most (2n)d(2n)^{d}. There is algorithm that takes RR as input and outputs the ithi^{th} term of φR\varphi_{R} using O(nd)O(nd) space, for 1i(2n)d1\leq i\leq(2n)^{d}.

Proof.

Let R=[a1,b1]×[a2,b2]××[ad,bd]R=[a_{1},b_{1}]\times[a_{2},b_{2}]\times\ldots\times[a_{d},b_{d}] be a dd-dimensional range over UdU^{d}. We will first describe the formula to represent the multi-dimensional range as a conjunction of dd DNF formulae ϕ1,,ϕd\phi_{1},\cdots,\phi_{d} each with at most 2n2n terms, where ϕi\phi_{i} represents [ai,bi][a_{i},b_{i}], the range in the ithi^{th} dimension. Converting this into a DNF formula will result in the formula ϕR\phi_{R} with (2n)d(2n)^{d} terms.

For any \ell bit number cc, 1c2n1\leq c\leq 2^{n}, it is straightforward to write a DNF formula φc\varphi_{\leq c}, of size at most \ell, that represents the range [0,c][0,c] (or equivalently the set {x0xc}\{x\mid 0\leq x\leq c\}). Similarly we can write a DNF formula φc\varphi_{\geq c}, of size at most \ell for the range [c,21][c,2^{\ell-1}]. Now we construct a formula to represent the range [a,b][a,b] over UU as follows. Let a1a2ana_{1}a_{2}\cdots a_{n} and b1b2bnb_{1}b_{2}\cdots b_{n} be the binary representations of aa and bb respectively. Let \ell be the largest integer such that a1a2al=b1b2bla_{1}a_{2}\cdots a_{l}=b_{1}b_{2}\cdots b_{l}. Hence a+1=0a_{\ell+1}=0 and b+1=1b_{\ell+1}=1. Let aa^{\prime} and bb^{\prime} denote the integers represented by al+2ana_{l+2}\cdots a_{n} and bl+2bnb_{l+2}\cdots b_{n}. Also, let ψ\psi denote the formula (a single term) that represents the string a1aa_{1}\cdots a_{\ell}. Then the formula representing [a,b][a,b] is ψ(x+1¯φax+1φb)\psi\wedge(\overline{x_{\ell+1}}\varphi_{\geq a^{\prime}}\vee x_{\ell+1}\varphi_{\leq b^{\prime}}). This can be written as a DNF formula by distributing ψ\psi and the number of terms in the resulting formula is at most 2n2n, and has nn variables. Note that each φi\varphi_{i} can be constructed using O(n)O(n) space. To obtain the final DNF representing the range RR, we need to convert φ1φd\varphi_{1}\wedge\cdots\varphi_{d} into a DNF formula. It is easy to see that for any ii, then iith term of this DNF can be computed using space O(nd)O(nd). Note that this formula has ndnd variables, nn variables per each dimension.

Using the above reduction and Theorem 1, we obtain an an algorithm for estimating F0F_{0} over multidimensional ranges in a range-efficient manner.

Theorem 3.

There is a streaming algorithm to compute an (ϵ,δ)(\epsilon,\delta) approximation of F0F_{0} over dd-dimensional ranges that takes space O(ndε2log(1/δ))O(\frac{nd}{\varepsilon^{2}}\cdot\log(1/\delta)) and processing time O((nd)4nd1ε2)log(1/δ))O((nd)^{4}\cdot n^{d}\cdot\frac{1}{\varepsilon^{2}})\log(1/\delta)) per item.

Remark 1.

Tirthapura and Woodruff (Tirthapura and Woodruff, 2012) studied the problem of range efficient estimation of FkF_{k} (kthk^{th} frequency moments) over dd-dimensional ranges. They claimed an algorithm to estimate F0F_{0} with space and per-item time complexity poly(n,d,1/ϵ,log1/δ)\mathrm{poly}(n,d,1/\epsilon,\log 1/\delta). However they have retracted their claim (Woodruff, [n. d.]). Their method only yields poly(nd,1/ϵ,log1/δ)\mathrm{poly}(n^{d},1/\epsilon,\log 1/\delta) time per item. Their proof appears to be involved that require a range efficient implementations of count sketch algorithm (Charikar et al., 2004) and recursive sketches (Indyk and Woodruff, 2005; Braverman and Ostrovsky, 2010). We obtain the same complexity bounds with much simpler analysis and a practically efficient algorithm that can use off the shelf available implementations (Meel et al., 2018).

Remark 2.

Subsequent to the present work, an improved algorithm for F0F_{0} over structured sets is presented in  ( N. V. Vinodchandran  Sourav Chakraborty, 2021) (to appear in PODS 2021). In particular, the paper presents an F0F_{0} estimation algorithm, called 𝖠𝖯𝖲-𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗈𝗋\mathsf{APS\mbox{-}Estimator}, for streams over Delphic sets. A set S{0,1}nS\subseteq\{0,1\}^{n} belongs to Delphic family if the following queries can be done in O(n)O(n) time: (1) know the size of the set SS, (2) draw a uniform random sample from SS, and (3) given any xx check if xSx\in S. The authors design a streaming algorithm that given a stream 𝒮=S1,S2,SM\mathcal{S}=\langle S_{1},S_{2}\cdots,S_{M}\rangle wherein each Si{0,1}nS_{i}\subseteq\{0,1\}^{n} belongs to Delphic family, computes an (ε,δ)(\varepsilon,\delta)-approximation of |i=1MSi||\bigcup_{i=1}^{M}S_{i}| with worst case space complexity O(nlog(M/δ)ε2)O(n\cdot\log(M/\delta)\cdot\varepsilon^{-2}) and per-item time is O~(nlog(M/δ)ε2)\widetilde{O}(n\cdot\log(M/\delta)\cdot\varepsilon^{-2}). The algorithm 𝖠𝖯𝖲-𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗈𝗋\mathsf{APS\mbox{-}Estimator}, when applied to dd-dimensional ranges, gives per-item time and space complexity bounds that are poly(n,d,logM,1/ε,log1/δ)\mathrm{poly}(n,d,\log M,1/\varepsilon,\log 1/\delta). While 𝖠𝖯𝖲-𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗈𝗋\mathsf{APS\mbox{-}Estimator} brings down the dependency on dd from exponential to polynomial, it works under the assumption that the length of the stream MM is known. The general setup presented in ( N. V. Vinodchandran  Sourav Chakraborty, 2021), however, can be applied to other structured sets considered in this paper including multidimensional arithmetic progressions.

Representing Multidimensional Ranges as CNF Formulas.

Since the algorithm, 𝖠𝖯𝖲-𝖤𝗌𝗍𝗂𝗆𝖺𝗍𝗈𝗋\mathsf{APS\mbox{-}Estimator}, presented in ( N. V. Vinodchandran  Sourav Chakraborty, 2021), employs a sampling-based technique, a natural question is whether there exists a hashing-based technique that achieves per-item time polynomial in nn and dd. We note that the above approach of representing a multi-dimensional range as DNF formula does not yield such an algorithm. This is because there exist dd-dimensional ranges whose DNF representation requires Ω(nd)\Omega(n^{d}) size.

Observation 1.

There exist dd-dimensional ranges whose DNF representation has size nd\geq n^{d}.

Proof.

The observation follows by considering the range R=[1,2n1]dR=[1,2^{n}-1]^{d} (only 0 is missing from the interval in each dimension). We will argue that any DNF formula φ\varphi for this range has size (number of terms) nd\geq n^{d}. For any 1jd1\leq j\leq d, we use the set of variables Xj={x1j,x2j,,xnj}X^{j}=\{x^{j}_{1},x^{j}_{2},\ldots,x^{j}_{n}\} for representing the jthj^{th} coordinate of RR. Then RR can be represented as the formula φR=(i1,i2,,id)xi11xi22xidd\varphi_{R}=\vee_{(i_{1},i_{2},\ldots,i_{d})}x_{i_{1}}^{1}x_{i_{2}}^{2}\ldots x_{i_{d}}^{d}, where 1ijn1\leq i_{j}\leq n. This formula has ndn^{d} terms. Let φ\varphi be any other DNF formula representing RR. The main observation is that any term TT of φ\varphi is completely contained (in terms of the set of solutions) in one of the terms of φR\varphi_{R}. This implies that φ\varphi should have ndn^{d} terms. Now we argue that TT is contained in one of the terms of φR\varphi_{R}. TT should have at least one variable as positive literal from each of XjX^{j}. Suppose TT does not have any variable from XjX^{j} for some jj. Then TT contains a solution with all the variable in XjX^{j} set to 0 and hence not in RR. Now let xijjx^{j}_{i_{j}} be a variable from XjX^{j} that is in TT. Then clearly TT is in the term xi11xi22xiddx^{1}_{i_{1}}x^{2}_{i_{2}}\ldots x^{d}_{i_{d}} of RR. ∎

This leads to the question of whether we can obtain a super-polynomial lower bound on the time per item. We observe that such a lower bound would imply PNP\mathrm{P}\neq\mathrm{NP}. For this, we note the following.

Observation 2.

Any dd-dimensional range RR can be represented as a CNF formula of size O(nd)O(nd) over ndnd variables.

This is because a single dimensional range [a,b][a,b] can also be represented as a CNF formula of size O(n)O(n) (Chakraborty et al., 2015) and thus the CNF formula for RR is a conjunction of formulas along each dimension. Thus the problem of computing F0F_{0} over dd-dimensional ranges reduces to computing F0F_{0} over a stream where each item of the stream is a CNF formula. As in the proof of Theorem 1, we can adapt Minimum-value based algorithm for CNF streams. When a CNF formula φi\varphi_{i} arrive, we need to compute the tt lexicographically smallest elements of h(𝖲𝗈𝗅(φi))h(\mathsf{Sol}(\varphi_{i})) where h𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,3n)h\in\mathcal{H}_{\mathsf{Toeplitz}}(n,3n). By Proposition 5, this can be done in polynomial-time by making O(tnd)O(tnd) calls to an NP oracle since ϕi\phi_{i} is a CNF formula over ndnd variables. Thus if P equals NP, then the time taken per range is polynomial in nn, dd, and 1/ε21/\varepsilon^{2}. Thus a super polynomial time lower bound on time per item implies that P differs from NP.

From Weighted #DNF to d-Dimensional Ranges

Designing a streaming algorithm with a per item of polynomial in nn and dd is a very interesting open problem with implications on weighted DNF counting. Consider a formula φ\varphi defined on the set of variables x={x1,x2,xn}x=\{x_{1},x_{2},\ldots x_{n}\}. Let a weight function ρ:x(0,1)\rho:x\mapsto(0,1) be such that weight of an assignment σ\sigma can be defined as follows:

W(σ)=xi:σ(xi)=1ρ(xi)xi:σ(xi)=0(1ρ(xi))\displaystyle W(\sigma)=\prod_{x_{i}:\sigma(x_{i})=1}\rho(x_{i})\prod_{x_{i}:\sigma(x_{i})=0}(1-\rho(x_{i}))

Furthermore, we define the weight of a formula φ\varphi as

W(φ)=σφW(σ)\displaystyle W(\varphi)=\sum_{\sigma\models\varphi}W(\sigma)

Given φ\varphi and ρ\rho, the problem of weighted counting is to compute W(φ)W(\varphi). We consider the case where for each xix_{i}, ρ(xi)\rho(x_{i}) is represented using mim_{i} bits in binary representation, i.e., ρ(xi)=ki2mi\rho(x_{i})=\frac{k_{i}}{2^{m_{i}}}. Inspired by the key idea of weighted to unweighted reduction due to Chakraborty et al. (Chakraborty et al., 2015), we show how the problem of weighted DNF counting can be reduced to that of estimation of F0F_{0} estimation of nn-dimensional ranges. The reduction is as follows: we transform every term of φ\varphi into a product of multi-dimension ranges where every variable xix_{i} is replaced with interval [1,ki][1,k_{i}] while ¬xi\neg x_{i} is replaced with [ki+1,2mi][k_{i}+1,2^{m_{i}}] and every \wedge is replaced with ×.\times. For example, a term (x1¬x2¬x3)(x_{1}\wedge\neg x_{2}\wedge\neg x_{3}) is replaced with [1,k1]×[k2+1,2m2]×[k3+1,2m3][1,k_{1}]\times[k_{2}+1,2^{m_{2}}]\times[k_{3}+1,2^{m_{3}}]. Given F0F_{0} of the resulting stream, we can compute the weight of φ\varphi simply as W(φ)=F02imiW(\varphi)=\frac{F_{0}}{2^{\sum_{i}m_{i}}}. Thus a hashing based streaming algorithm that has poly(n,d)poly(n,d) time per item, yields a hashing based FPRAS for weighted DNF counting, and open problem from (Abboud et al., 2019).

Multidimensional Arithmetic Progressions.

We will now generalize Theorem 3 to handle arithmetic progressions instead of ranges. Let [a,b,c][a,b,c] represent the arithmetic progression with common difference cc in the range [a,b][a,b], i.e., a,a+c,a+2c,a+ida,a+c,a+2c,a+id, where ii is the largest integer such that a+idba+id\leq b. Here, we consider dd-dimensional arithmetic progressions R=[a1,b1,c1]××[ad,bd,cd]R=[a_{1},b_{1},c_{1}]\times\cdots\times[a_{d},b_{d},c_{d}] where each cic_{i} is a power two. We first observe that the set represented by [a,b,2][a,b,2^{\ell}] can be expressed as a DNF formula as follows: Let ϕ\phi be the DNF formula representing the range [a,b][a,b] and let a1,,aa_{1},\cdots,a_{\ell} are the least significant bits of aa, Let ψ\psi be the term that represents the bit sequence a1aa_{1}\cdots a_{\ell}. Now the formula to represent the arithmetic progression [a,b,2][a,b,2^{\ell}] is ϕψ\phi\wedge\psi which can be converted to a DNF formula of size O(2n)O(2n). Thus the multi-dimensional arithmetic progression RR can be represented as a DNF formula of size (2n)d(2n)^{d}. Note that time and space required to convert RR into a DNF formula are as before, i.e, O(nd)O(n^{d}) time and O(nd)O(nd) space. This leads us to the following corollary.

Corollary 0.

There is a streaming algorithm to compute an (ϵ,δ)(\epsilon,\delta) approximation of F0F_{0} over dd-dimensional arithmetic progressions, whose common differences are powers of two, that takes space O(nd/ε2log1/δ)O(nd/\varepsilon^{2}\cdot\log 1/\delta) and processing time O((nd)4nd1ε2)log(1/δ))O((nd)^{4}\cdot n^{d}\cdot\frac{1}{\varepsilon^{2}})\log(1/\delta)) per item.

Affine Spaces

Another example of structured stream is where each item of the stream is an affine space represented by Ax=BAx=B where AA is a boolean matrix and BB is a zero-one vector. Without loss of generality, we may assume that where AA is a n×nn\times n matrix. Thus an affine stream consists of A1,B,A2,B2\langle A_{1},B_{\rangle},\langle A_{2},B_{2}\rangle\cdots, where each Ai,Bi\langle A_{i},B_{i}\rangle is succinctly represents a set {x{0,1}nAix=Bi}\{x\in\{0,1\}^{n}\mid A_{i}x=B_{i}\}.

For a n×nn\times n Boolean matrix AA and a zero-one vector BB, let 𝖲𝗈𝗅(A,B)\mathsf{Sol}(\langle A,B\rangle) denote the set of all xx that satisfy Ax=BAx=B.

Proposition 0.

Given (A,B)(A,B), h𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓(n,3n)h\in\mathcal{H}_{\mathsf{Toeplitz}}(n,3n), and tt as input, there is an algorithm, 𝖠𝖿𝖿𝗂𝗇𝖾𝖥𝗂𝗇𝖽𝖬𝗂𝗇\mathsf{AffineFindMin}, that returns a set, h(𝖲𝗈𝗅(A,B))\mathcal{B}\subseteq h(\mathsf{Sol}(\langle A,B\rangle)) so that if |h(𝖲𝗈𝗅(A,B))|t|h(\mathsf{Sol}(\langle A,B\rangle))|\leq t, then =h(𝖲𝗈𝗅(A,B))\mathcal{B}=h(\mathsf{Sol}(\langle A,B\rangle)), otherwise \mathcal{B} is the tt lexicographically minimum elements of h(𝖲𝗈𝗅(A,B))h(\mathsf{Sol}(\langle A,B\rangle)). Time taken by this algorithm is O(n4t)O(n^{4}t) and the space taken the algorithm is O(tn)O(tn).

Proof.

Let DD be the matrix that specifies the hash function hh. Let 𝒞={Dx|Ax=B}\mathcal{C}=\{Dx~{}|~{}Ax=B\}, and the goal is to compute the tt smallest element of 𝒞\mathcal{C}. Note that if y𝒞y\in\mathcal{C}, then it must be the case that D|Ax=y|BD|Ax=y|B where D|AD|A is the matrix obtained by appending rows of AA to the rows of DD (at the end), and y|By|B is the vector obtained by appending BB to yy. Note that D|AD|A is a matrix with 4n4n rows. Now the proof is very similar to the proof of Proposition 5. We can do a prefix search as before and this involves doing Gaussian elimination using sub matrices of D|AD|A. ∎

Theorem 6.

There is a streaming algorithms computes (ϵ,δ)(\epsilon,\delta) approximation of F0F_{0} over affine spaces. This algorithm takes space O(nϵ2log(1/δ))O(\frac{n}{\epsilon^{2}}\cdot\log(1/\delta)) and processing time of O(n41ϵ2log(1/δ))O(n^{4}\frac{1}{\epsilon^{2}}\log(1/\delta)) per item.

6. Conclusion and Future Outlook

To summarize, our investigation led to a diverse set of results that unify over two decades of work in model counting and F0F_{0} estimation. We believe that the viewpoint presented in this work has potential to spur several new interesting research directions. We sketch some of these directions below:

Sampling:

The problem of counting and sampling are closely related. In particular, the seminal work of Jerrum, Valiant, and Vazirani (Jerrum et al., 1986) showed that the problem of approximate counting and almost-uniform sampling are inter-reducible for self-reducible NP problems. Concurrent to developments in approximate model counting, there has been a significant interest in the design of efficient sampling algorithms. A natural direction would be to launch a similar investigation.

Higher Moments:

There has been a long line of work on estimation of higher moments, i.e. FkF_{k} in streaming context. A natural direction of future research is to adapt the notion of FkF_{k} in the context of CSP. For example, in the context of DNF, one can view F1F_{1} be simply a sum of the size of clauses but it remains to be seen to understand the relevance and potential applications of higher moments such as F2F_{2} in the context of CSP. Given the similarity of the core algorithmic frameworks for higher moments, we expect extension of the framework and recipe presented in the paper to derive algorithms for higher moments in the context of CSP.

Sparse XORs:

In the context of model counting, the performance of underlying SAT solvers strongly depends on the size of XORs. The standard construction of 𝖳𝗈𝖾𝗉𝗅𝗂𝗍𝗓\mathcal{H}_{\mathsf{Toeplitz}} and 𝗑𝗈𝗋\mathcal{H}_{\mathsf{xor}} lead to XORs of size Θ(n/2)\Theta(n/2) and interesting line of research has focused on the design of sparse XOR-based hash functions (Gomes et al., 2007a; Ivrii et al., 2015; Ermon et al., 2014; Achlioptas and Theodoropoulos, 2017; Asteris and Dimakis, 2016) culminating in showing that one can use hash functions of form where h(x)=Ax+bh(x)=Ax+b wherein each entry of m-th row of AA is 1 with probability 𝒪(logmm)\mathcal{O}(\frac{\log m}{m}) (Meel and Akshay, 2020). Such XORs were shown to improve the runtime efficiency. In this context, a natural direction would be to explore the usage of sparse XORs in the context of F0F_{0} estimation.

Acknowledgements.
We thank the anonymous reviewers of PODS 21 for valuable comments. Bhattacharyya was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme [NRF-NRFFAI1-2019-0002] and an Amazon Research Award. Meel was supported in part by National Research Foundation Singapore under its NRF Fellowship Programme[NRF-NRFFAI1-2019-0004 ] and AI Singapore Programme [AISG-RP-2018-005], and NUS ODPRT Grant [R-252-000-685-13]. Vinod was supported in part by NSF CCF-184908 and NSF HDR:TRIPODS-1934884 awards. Pavan was supported in part by NSF CCF-1849053 and NSF HDR:TRIPODS-1934884 awards.

References

  • (1)
  • Abboud et al. (2019) Ralph Abboud, Ismail Ilkan Ceylan, and Thomas Lukasiewicz. 2019. Learning to Reason: Leveraging Neural Networks for Approximate DNF Counting. arXiv preprint arXiv:1904.02688 (2019).
  • Achlioptas and Theodoropoulos (2017) Dimitris Achlioptas and Panos Theodoropoulos. 2017. Probabilistic model counting with short XORs. In Proc. of SAT. Springer, 3–19.
  • Alon et al. (1999) Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58, 1 (1999), 137–147.
  • Arackaparambil et al. (2009) Chrisil Arackaparambil, Joshua Brody, and Amit Chakrabarti. 2009. Functional monitoring without monotonicity. In International Colloquium on Automata, Languages, and Programming. Springer, 95–106.
  • Asteris and Dimakis (2016) Megasthenis Asteris and Alexandros G Dimakis. 2016. LDPC codes for discrete integration. Technical Report. Technical report, UT Austin.
  • Babcock and Olston (2003) Brian Babcock and Chris Olston. 2003. Distributed top-k monitoring. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data. 28–39.
  • Bar-Yossef et al. ([n. d.]) Ziv Bar-Yossef, T. S. Jayram, Ravi Kumar, D. Sivakumar, and Luca Trevisan. [n. d.]. Counting Distinct Elements in a Data Stream. In Proc. of RANDOM, Vol. 2483. 1–10.
  • Bar-Yossef et al. (2002) Ziv Bar-Yossef, Ravi Kumar, and D. Sivakumar. 2002. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. of SODA. ACM/SIAM, 623–632.
  • Braverman and Ostrovsky (2010) Vladimir Braverman and Rafail Ostrovsky. 2010. Recursive Sketching For Frequency Moments. CoRR abs/1011.2571 (2010).
  • Carter and Wegman (1977) J Lawrence Carter and Mark N Wegman. 1977. Universal classes of hash functions. In Proceedings of the ninth annual ACM symposium on Theory of computing. ACM, 106–112.
  • Chakraborty et al. (2015) Supratik Chakraborty, Dror Fried, Kuldeep S Meel, and Moshe Y Vardi. 2015. From weighted to unweighted model counting. In Proceedings of AAAI. 689–695.
  • Chakraborty et al. (2013a) S. Chakraborty, K. S. Meel, and M. Y. Vardi. 2013a. A Scalable and Nearly Uniform Generator of SAT Witnesses. In Proc. of CAV. 608–623.
  • Chakraborty et al. (2013b) S. Chakraborty, K. S. Meel, and M. Y. Vardi. 2013b. A Scalable Approximate Model Counter. In Proc. of CP. 200–216.
  • Chakraborty et al. (2016) S. Chakraborty, K. S. Meel, and M. Y. Vardi. 2016. Algorithmic Improvements in Approximate Counting for Probabilistic Inference: From Linear to Logarithmic SAT Calls. In Proc. of IJCAI.
  • Charikar et al. (2004) Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. 2004. Finding frequent items in data streams. Theor. Comput. Sci. 312, 1 (2004), 3–15.
  • Chistikov et al. (2015) Dmitry Chistikov, Rayna Dimitrova, and Rupak Majumdar. 2015. Approximate counting in SMT and value estimation for probabilistic programs. In Proc. of TACAS. Springer, 320–334.
  • Considine et al. (2004) Jeffrey Considine, Feifei Li, George Kollios, and John W. Byers. 2004. Approximate Aggregation Techniques for Sensor Databases. In Proc. of ICDE, Z. Meral Özsoyoglu and Stanley B. Zdonik (Eds.). IEEE Computer Society, 449–460.
  • Cormode et al. (2005) Graham Cormode, Minos Garofalakis, Shanmugavelayutham Muthukrishnan, and Rajeev Rastogi. 2005. Holistic aggregates in a networked world: Distributed tracking of approximate quantiles. In Proc. of SIGMOD. 25–36.
  • Cormode and Muthukrishnan (2003) Graham Cormode and S. Muthukrishnan. 2003. Estimating Dominance Norms of Multiple Data Streams. In Proc. of ESA (Lecture Notes in Computer Science), Giuseppe Di Battista and Uri Zwick (Eds.), Vol. 2832. Springer, 148–160.
  • Cormode et al. (2011) Graham Cormode, Shanmugavelayutham Muthukrishnan, and Ke Yi. 2011. Algorithms for distributed functional monitoring. ACM Transactions on Algorithms (TALG) 7, 2 (2011), 1–20.
  • Cormode et al. (2012) Graham Cormode, Shanmugavelayutham Muthukrishnan, Ke Yi, and Qin Zhang. 2012. Continuous sampling from distributed streams. Journal of the ACM (JACM) 59, 2 (2012), 1–25.
  • Dagum et al. (2000) P. Dagum, R. Karp, M. Luby, and S. Ross. 2000. An optimal algorithm for Monte Carlo estimation. SIAM Journal on computing 29, 5 (2000), 1484–1496.
  • Ermon et al. (2013) Stefano Ermon, Carla P. Gomes, Ashish Sabharwal, and Bart Selman. 2013. Taming the Curse of Dimensionality: Discrete Integration by Hashing and Optimization. In Proc. of ICML. 334–342.
  • Ermon et al. (2014) S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman. 2014. Low-density Parity Constraints for Hashing-Based Discrete Integration. In Proc. of ICML. 271–279.
  • Feng et al. (2018a) Weiming Feng, Thomas P Hayes, and Yitong Yin. 2018a. Distributed symmetry breaking in sampling (optimal distributed randomly coloring with fewer colors). arXiv preprint arXiv:1802.06953 (2018).
  • Feng et al. (2018b) Weiming Feng, Yuxin Sun, and Yitong Yin. 2018b. What can be sampled locally? Distributed Computing (2018), 1–27.
  • Feng and Yin (2018) Weiming Feng and Yitong Yin. 2018. On local distributed sampling and counting. In Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing. 189–198.
  • Fischer and Ghaffari (2018) Manuela Fischer and Mohsen Ghaffari. 2018. A Simple Parallel and Distributed Sampling Technique: Local Glauber Dynamics. In 32nd International Symposium on Distributed Computing.
  • Flajolet and Martin (1985) Philippe Flajolet and G. Nigel Martin. 1985. Probabilistic Counting Algorithms for Data Base Applications. J. Comput. Syst. Sci. 31, 2 (1985), 182–209.
  • Gibbons and Tirthapura (2001) Phillip B. Gibbons and Srikanta Tirthapura. 2001. Estimating simple functions on the union of data streams. In Proc. of SPAA, Arnold L. Rosenberg (Ed.). ACM, 281–291.
  • Gomes et al. (2007b) C.P. Gomes, A. Sabharwal, and B. Selman. 2007b. Near-Uniform sampling of combinatorial spaces using XOR constraints. In Proc. of NIPS. 670–676.
  • Gomes et al. (2007a) Carla P Gomes, Joerg Hoffmann, Ashish Sabharwal, and Bart Selman. 2007a. From Sampling to Model Counting.. In Proc. of IJCAI. 2293–2299.
  • Huang et al. (2012) Zengfeng Huang, Ke Yi, and Qin Zhang. 2012. Randomized algorithms for tracking distributed count, frequencies, and ranks. In Proc. of PODS. 295–306.
  • Indyk and Woodruff (2005) Piotr Indyk and David P. Woodruff. 2005. Optimal approximations of the frequency moments of data streams. In Proc. of STOC. ACM, 202–208.
  • Ivrii et al. (2015) Alexander Ivrii, Sharad Malik, Kuldeep S. Meel, and Moshe Y. Vardi. 2015. On computing minimal independent support and its applications to sampling and counting. Constraints (2015), 1–18. https://doi.org/10.1007/s10601-015-9204-z
  • Jerrum et al. (1986) M.R. Jerrum, L.G. Valiant, and V.V. Vazirani. 1986. Random generation of combinatorial structures from a uniform distribution. Theoretical Computer Science 43, 2-3 (1986), 169–188.
  • Kane et al. (2010) Daniel M. Kane, Jelani Nelson, and David P. Woodruff. 2010. An optimal algorithm for the distinct elements problem. In Proc. of PODS. ACM, 41–52.
  • Karp and Luby (1983) R.M. Karp and M. Luby. 1983. Monte-Carlo algorithms for enumeration and reliability problems. Proc. of FOCS (1983).
  • Karp et al. (1989) Richard M Karp, Michael Luby, and Neal Madras. 1989. Monte-Carlo approximation algorithms for enumeration problems. Journal of Algorithms 10, 3 (1989), 429 – 448. https://doi.org/10.1016/0196-6774(89)90038-2
  • Keralapura et al. (2006) Ram Keralapura, Graham Cormode, and Jeyashankher Ramamirtham. 2006. Communication-efficient distributed monitoring of thresholded counts. In Proc. of SIGMOD. 289–300.
  • Keren et al. (2011) Daniel Keren, Izchak Sharfman, Assaf Schuster, and Avishay Livne. 2011. Shape sensitive geometric monitoring. IEEE Transactions on Knowledge and Data Engineering 24, 8 (2011), 1520–1535.
  • Manjhi et al. (2005) Amit Manjhi, Vladislav Shkapenyuk, Kedar Dhamdhere, and Christopher Olston. 2005. Finding (recently) frequent items in distributed data streams. In Proc. of ICDE. IEEE, 767–778.
  • Meel and Akshay (2020) Kuldeep S. Meel and S. Akshay. 2020. Sparse Hashing for Scalable Approximate Model Counting: Theory and Practice. In Proc. of LICS.
  • Meel et al. (2017) Kuldeep S Meel, Aditya A Shrotri, and Moshe Y Vardi. 2017. On Hashing-Based Approaches to Approximate DNF-Counting. In In Proc. of FSTTCS.
  • Meel et al. (2018) Kuldeep S. Meel, Aditya A. Shrotri, and Moshe Y. Vardi. 2018. Not All FPRASs are Equal: Demystifying FPRASs for DNF-Counting. (12 2018).
  • Meel et al. (2019) Kuldeep S. Meel, Aditya A. Shrotri, and Moshe Y. Vardi. 2019. Not All FPRASs are Equal: Demystifying FPRASs for DNF-Counting (Extended Abstract). In Proc. of IJCAI.
  • Pavan and Tirthapura (2007) A. Pavan and Srikanta Tirthapura. 2007. Range-Efficient Counting of Distinct Elements in a Massive Data Stream. SIAM J. Comput. 37, 2 (2007), 359–379.
  • Ré and Suciu (2008) Christopher Ré and Dan Suciu. 2008. Approximate lineage for probabilistic databases. Proceedings of the VLDB Endowment 1, 1 (2008), 797–808.
  • Senellart (2018) Pierre Senellart. 2018. Provenance and probabilities in relational databases. ACM SIGMOD Record 46, 4 (2018), 5–15.
  • Senellart (2019) Pierre Senellart. 2019. Provenance in Databases: Principles and Applications. In Reasoning Web. Explainable Artificial Intelligence. Springer, 104–109.
  • Sharfman et al. (2010) Izchak Sharfman, Assaf Schuster, and Daniel Keren. 2010. A geometric approach to monitoring threshold functions over distributed data streams. In Ubiquitous knowledge discovery. Springer, 163–186.
  • Soos et al. (2020) Mate Soos, Stephan Gocht, and Kuldeep S. Meel. 2020. Tinted, Detached, and Lazy CNF-XOR solving and its Applications to Counting and Sampling. In Proceedings of International Conference on Computer-Aided Verification (CAV).
  • Soos and Meel (2019) Mate Soos and Kuldeep S Meel. 2019. BIRD: Engineering an Efficient CNF-XOR SAT Solver and its Applications to Approximate Model Counting. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI)(1 2019).
  • Soos et al. (2009) Mate Soos, Karsten Nohl, and Claude Castelluccia. 2009. Extending SAT Solvers to Cryptographic Problems. In Proc. of SAT. 244–257.
  • Stockmeyer (1983) L. Stockmeyer. 1983. The complexity of approximate counting. In Proc. of STOC. 118–126.
  • Sun and Poon (2009) He Sun and Chung Keung Poon. 2009. Two improved range-efficient algorithms for F0{}_{\mbox{0}} estimation. Theor. Comput. Sci. 410, 11 (2009), 1073–1080.
  • N. V. Vinodchandran  Sourav Chakraborty (2021) Kuldeep S. Meel N. V. Vinodchandran  Sourav Chakraborty. 2021. Estimating Size of Union of Sets in Streaming Model. In Proc. of PODS 2021.
  • Tirthapura and Woodruff (2012) Srikanta Tirthapura and David P. Woodruff. 2012. Rectangle-efficient aggregation in spatial data streams. In Proc. of PODS. ACM, 283–294.
  • Valiant (1979) L.G. Valiant. 1979. The complexity of enumeration and reliability problems. SIAM J. Comput. 8, 3 (1979), 410–421.
  • Woodruff ([n. d.]) David Woodruff. [n. d.]. personal communication.
  • Woodruff and Zhang (2012) David P Woodruff and Qin Zhang. 2012. Tight bounds for distributed functional monitoring. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing. 941–960.
  • Woodruff and Zhang (2017) David P Woodruff and Qin Zhang. 2017. When distributed computation is communication expensive. Distributed Computing 30, 5 (2017), 309–323.
  • Xu et al. (2008) Lin Xu, Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. 2008. SATzilla: portfolio-based algorithm selection for SAT. Journal of artificial intelligence research 32 (2008), 565–606.
  • Yi and Zhang (2013) Ke Yi and Qin Zhang. 2013. Optimal tracking of distributed heavy hitters and quantiles. Algorithmica 65, 1 (2013), 206–223.