Guaranteeing the Runtime for Uniform Sampling and Size Estimation over Joins
Abstract.
We propose a new method for estimating the number of answers OUT of a small join query in a large database , and for uniform sampling over joins. Our method is the first to satisfy all the following statements.
-
•
Support arbitrary , which can be either acyclic or cyclic, and contain binary and non-binary relations.
-
•
Guarantee an arbitrary small error with a high probability always in time, where is the AGM bound (an upper bound of OUT), and hides the polylogarithmic factor of input size.
We also explain previous join size estimators in a unified framework. All methods including ours rely on certain indexes on relations in , which take linear time to build offline. Additionally, we extend our method using generalized hypertree decompositions (GHDs) to achieve a lower complexity than when OUT is small, and present optimization techniques for improving estimation efficiency and accuracy.
1. Introduction
The evaluation of join queries is one of the most fundamental problems in databases (Veldhuizen, 2012; Ngo et al., 2012, 2014; Li et al., 2016; Zhao et al., 2018; Aberger et al., 2017). In theory, reducing the time complexity of evaluation algorithms has been the main goal (Yannakakis, 1981; Ngo et al., 2014; Gottlob et al., 2016; Abo Khamis et al., 2017). This also applies to the counting problem (Joglekar et al., 2016), which we study in this paper. Given a join query and a database , an answer of in is a mapping from free variables of to attribute values in (or inversely as stated in (Focke et al., 2022)) constrained by the relations in . For example, if a database of a social network has a binary relation of friendship and a tuple indicates that two people are friends, finding all triples of people who are all friends, is equivalent to finding the answers of the triangle join query ( and are renamed relations of ):
(1) |
The complexity of evaluating join queries is commonly expressed as where IN is the input size and is a width of (see section 2 for definition of IN and example widths). While acyclic queries can be evaluated with (Yannakakis, 1981), cyclic queries have higher values, i.e., more difficult to solve. If is fractional edge cover number (see Definition 1), becomes , the AGM bound (Atserias et al., 2013) which is an upper bound of OUT. Instead of evaluating join queries, the counting problem is to compute , the join size or the number of answers (e.g., triples above). It has its own application for answering COUNT queries in DBMS. Then, the complexity can be reduced from to (Joglekar et al., 2016).
Approximate counting problem is to approximate . Approximate algorithms are alternatively used whenever the exact counting is expensive and approximation is enough. In practice, a famous application is cost-based query optimization which requires efficient approximation of thousands of sub-queries (Leis et al., 2015). In theory, going beyond the polynomial time algorithms, approximate algorithms established the complexity inversely proportional to OUT, under the requirement that guarantees an arbitrary small error with a high probability (see section 2 for formal definitions). In a limited setting where every relation in is binary and identical (e.g., as in (1)), the complexity of has reduced to (Assadi et al., 2019; Fichtenberger et al., 2020). However, in a general setting where contains higher-arity relations, the complexity has reduced to or only (Chen and Yi, 2020), with an additional multiplicative or additive factor of IN. Furthermore, all these methods have only (i.e., = ), and are not extended to other widths such as fractional hypertree width (see Definition 3) smaller than . In this paper, we propose a new method that achieves complexity for containing any-arity relations, under the same . As a corollary, ours achieve a linear or even constant complexity for a sufficiently large OUT. We also extend ours to incorporate fractional hypertree width by using generalized hypertree decompositions (GHDs), achieving a lower complexity than when OUT is small.
A closely related problem to approximate counting is uniform sampling, which has its own important applications such as training set generation for machine learning (Zhao et al., 2018). Our method also supports uniform sampling over joins. We define the problems and technical concepts in section 2, explain related work in section 3 and overview of our results in section 4. We present our method of size estimation and uniform sampling over joins in section 5 including a unified framework of previous work, extend our method using GHDs in section 6, propose optimization techniques in section 7 and future work in section 8. More details of existing algorithms and extension of our algorithms to join-project (a.k.a. conjunctive) queries are presented in Appendix.
2. Technical Background
Hypergraph. A join query is commonly represented as a hypergraph , where hypernodes is the set of variables in and hyperedges (Ngo et al., 2014). Each hyperedge is a subset of and specifies the relation (or for brevity) in . For example, the query in (1) has and . For any , we define two notations used throughout the paper: and , i.e., the set of hyperedges in that intersect with and the projection of each edge in onto .
In relational algebra, is equivalent to . Here, join () is a commutative binary operator between relations, allowing us to use . Table 1 summarizes the operators used in this paper.
natural join operator | |
semi-join operator | |
projection operator | |
selection operator | |
list concatenation operator |
Since all variables in are free (output) variables in join queries, join queries are subsumed by full conjunctive queries. For a more general class of join-project (a.k.a. conjunctive) queries, the set of output variables is a subset of . We assume is a join query and extend to join-project queries in appendix C.
Input and output size. We denote = as the input size ( is the number of tuples in ) and , the number of query answers, as the output size of (or ) in . We drop and if the context is clear, as well as in upcoming notations.
Complexity of join. The time complexity of join evaluation algorithms is commonly expressed as where is a width of , and hides a polylogarithmic factor of IN (Ngo et al., 2014). We assume that is extremely small compared to and regard and as constants as in (Ngo et al., 2014; Assadi et al., 2019; Chen and Yi, 2020). Worst-case optimal (WCO) algorithms such as GenericJoin (Ngo et al., 2014) (Algorithm 4) achieve , called fractional edge cover number:
Definition 1.
(Ngo et al., 2014): The fractional edge cover number is the optimal solution of the linear program (LP): s.t. and .
It is well known that is an upper bound of OUT for any fractional edge cover satisfying the constraints of the LP (Atserias et al., 2013; Friedgut and Kahn, 1998). The optimal value, , is called the AGM bound of on denoted as (Atserias et al., 2013). For example, the query in (1) has and from the optimal solution for the LP in Definition 1. Therefore, its AGM bound is .
Generalized hypertree decomposition. As a special case, the Yannakakis algorithm (Yannakakis, 1981) achieves if is acyclic. It builds a join tree by assigning each hyperedge to a tree node and performs dynamic programming (DP)-like join over the nodes’ outputs (i.e., ’s) in a bottom-up manner.
To bridge the gap between and when is cyclic, generalized hypertree decompositions (GHDs, see Definition 2) (Gottlob et al., 2002) extend the tree for arbitrary joins by assigning multiple hyperedges to a tree node, where the DP-like join is performed over these nodes’ outputs (i.e., answers of the corresponding sub-query) computed by WCO algorithms. This way, decreases from to fractional hypertree width (see Definition 3). We refer to a comprehensive survey (Ngo et al., 2014) for more details about these concepts and appendix A for our explanations of algorithms.
Definition 2.
(Gottlob et al., 2002): A GHD of is a pair () where 1) is a tree of , 2) , 3) for each , there exists s.t. , and 4) for each , forms a non-empty connected subtree of .
Here, is called the bag of , and 4) is called the running intersection property. Each node corresponds to a sub-hypergraph of where . We then define the fractional hypertree width (Grohe and Marx, 2014).
Definition 3.
(Grohe and Marx, 2014): The fractional hypertree width 1) of a GHD () of , , is defined as , and 2) of , , is defined as .
Given a GHD (), GHDJoin (Joglekar et al., 2016) performs GenericJoin for each and Yannakakis (Yannakakis, 1981) on these results along the join tree (see Algorithms 5-7 in section A.2-A.3). Since Yannakakis runs in linear - - time for -acyclic join queries (Brault-Baron, 2016), the total runtime of GHDJoin is , since the input size for Yannakakis is after GenericJoin. It is sufficient to execute a brute-force algorithm to find since the number of possible GHDs is bounded by the query size.
Counting. For the counting problem, the complexity is reduced to without the +OUT term. This is a direct result of Joglekar et al. (Joglekar et al., 2016), solving a more general problem of evaluating aggregation queries (e.g., count, sum, max). We explain in detail in section 6.1.
Approximate counting. Approximate counting achieves even smaller complexities. The primary goal of approximate counting, especially using sampling, is to obtain an arbitrarily small error with a high probability, formally defined as (Assadi et al., 2019; Chen and Yi, 2020):
Definition 4.
(Assadi et al., 2019): For a given , error bound , and probability threshold , if for a random variable approximating OUT, then is a -approximation of OUT.
The challenge is, how far can we reduce the complexity while achieving the above goal. Assume that we approximate OUT using an unbiased estimator with a random variable , i.e., , where is the expectation of . Assume that the variance of is bounded and an upper bound is known, and the time complexity of instantiating (or computing) is upper-bounded by . Then, we can trade off the approximation accuracy and efficiency by controlling the number of runs for sampling. Formally, let where every () is an equivalent random variable to . Then, and . Hence, by setting , we can achieve the following from the Chebyshev’s inequality:
(2) |
Since the complexity of computing each is bounded by , the complexity of computing is bounded by . Therefore, the key challenge is to implement an unbiased estimator with as smallest as possible. and are regarded as constants as in (Assadi et al., 2019; Chen and Yi, 2020). One can easily replace in with using the median trick (Jerrum et al., 1986).
Uniform sampling. If every query answer has the same probability to be sampled, then the sampler is a uniform sampler (Chen and Yi, 2020; Fichtenberger et al., 2020). Since there are answers, can be at most . If , the sampling has the probability to fail (to sample any of the query answers) which is .
For acyclic queries, uniform sampling can be done in time after -time preprocessing (Zhao et al., 2018), by additionally storing the intermediate join sizes of tuples during the bottom-up DP in Yannakakis (see section A.2 for more details).
3. Related Work
In the context of approximate counting (and uniform sampling, which is closely related), there have been two lines of work: 1) determining the classes of queries of tractable and intractable cases and 2) reducing the degree of polynomial-time complexity for tractable cases, especially for join or join-project queries.
The first class of work widens the class of queries that admit fully polynomial-time randomized approximation scheme (FPRAS) and fully polynomial-time almost uniform sampler (FPAUS) (Arenas et al., 2021). Arenas et al. (Arenas et al., 2021) showed that every class of conjunctive queries with bounded hypertree width admits FPRAS and FPAUS. Focke et al. (Focke et al., 2022) extended this to conjunctive queries with disequalities and negations, under some relaxations from FPRAS. These works focus on showing the existence of a polynomial-time algorithm over a wider class of queries, but not on reducing the exact degree of the polynomial complexity.
The second class of work (Assadi et al., 2019; Fichtenberger et al., 2020; Chen and Yi, 2020; Bera and Chakrabarti, 2017; Aliakbarpour et al., 2018; Eden et al., 2020) focuses on reducing the degree for a specific range of queries, and even further to making the complexity inversely proportional to OUT, e.g., . However, they all consider only (i.e., is ), and the complexity has been achieved for limited classes of queries. Assadi et al. (Assadi et al., 2019) proposed a Simple Sublinear-Time Estimator (SSTE for short) with the complexity of for unlabeled graph queries, i.e., where contains identical binary relations only as in the query in (1). However, we found that a missing factor in the proof by Assadi et al. (Assadi et al., 2019) prevented SSTE from satisfying the bound and thus notified the author to modify the proof (see appendix E). They also proposed some conjectures on labeled graphs (i.e., removing the identity assumption), but 1) no concrete algorithms were given and 2) the proposed bound for labeled graphs is larger than (see section D.2). Therefore, it is not known whether these methods can be extended to labeled graphs with time. Fichtenberger et al. (Fichtenberger et al., 2020) extended SSTE to sampling Subgraphs Uniformly in Sublinear Time (SUST for short) for uniform sampling while achieving the same complexity. Here, they said sublinear from the assumption that OUT is large as . Unlike join processing, it is unnecessary to scan the inputs for each query that requires time.
Kim et al. (Kim et al., 2021) proposed Alley, a hybrid method that combines sampling and synopsis. We analyze the sampling part of Alley since analyzing the synopsis is out of scope. Alley solves the approximate subgraph counting problem (but not uniform sampling) for labeled graphs, where the relations in are binary but not necessarily identical. The complexity, however, is instead of , which we explain why in section 5.
For a more general setting where relations in can have higher arities, Chen & Yi (Chen and Yi, 2020) proposed GJ-Sample that also performs uniform sampling, and achieved complexity. The additional factor of IN compared to SSTE and SUST results from computing sample probabilities (or weights) prior to sampling, where the number of candidates is . We explain in detail in section 5. Therefore, they do not achieve complexity for arbitrary join queries and leave it as an open problem (Chen and Yi, 2020). Note that it is important to remove the IN factor, for IN can be higher than when relation sizes are highly skewed. For example, for a triangle query with , , and , then IN is and is for .
We notice that Deng et al. (Deng et al., 2023) have independently pursued the same primary objective as ours, which is to attain for uniform sampling and size estimation over joins. Both papers have been accepted in this PODS. The authors recursively 1) partition the input space into a constant number of sub-spaces and 2) sample a sub-space, to reduce the number of sample candidates from to . Our approach differs from theirs since we do not compute probabilities for candidates prior to sampling nor modify the set of candidates. Instead, we exploit known probability distributions to sample a candidate.
4. Our Results
We first analyze existing sampling-based approximation algorithms in a new unified framework in section 5. We present a new algorithm that bounds by for arbitrary join queries for the first time, achieving the complexity of (see section 2). To avoid pre-sampling overheads in GJ-Sample, we propose degree-based rejection sampling (DRS) that first samples a sample space from a meta sample space and then samples a value from the sampled space, where the value can be rejected based on its degree. We explain the details in section 5. The results of Chen & Yi then directly hold for DRS. The removed overhead reduces the complexity down to for arbitrary joins. We also extend DRS using GHDs in section 6. The following theorems state our results which we prove in the subsequent sections.
Theorem 1.
DRS performs -approximation of OUT in time.
Theorem 2.
If OUT is small enough, DRS with GHDs can perform -approximation of OUT with a lower complexity than . A sufficient condition is when OUT is .
Additionally, we extend Alley to support arbitrary join queries in section 5 and a core lemma of SSTE/SUST to hold for labeled graphs in appendix D. We discuss our extension to join-project (a.k.a. conjunctive queries) in appendix C and extension of the -time uniform sampler (Zhao et al., 2018) in section 2 to support cyclic queries using our framework in section A.2.
5. Achieving Bound
5.1. Generic Sampling-based Framework
We can classify existing sampling-based methods into three groups: variable-at-a-time (a.k.a., vertex-at-a-time) (Kim et al., 2021; Chen and Yi, 2020), edge-at-a-time (Li et al., 2016), and component-at-a-time (Assadi et al., 2019; Fichtenberger et al., 2020). The first/second group samples a variable/hyperedge at a time until every variable/hyperedge in is sampled, while the third group samples a component, either a cycle with an odd number of edges or a star, at a time, having a larger sample granularity than the other two groups.
To analyze these groups in a unified manner, we propose a powerful sampling-based estimation framework GenericCardEst (Algorithm 1). It returns an unbiased estimate of , where is the residual hypergraph given a current sample which is initially empty. Formally, and for .
An invariance is that , i.e., is an answer to the sub-query of on the bound variables . This is a key to prove Lemma 1. Due to space limitations, we prove all lemmas and propositions in appendix E.
Lemma 1.
GenericCardEst returns an unbiased estimate of for and .
We now explain each line of Algorithm 1. If there is no variable to sample for, just return one in Lines 1-1. Otherwise, Line 1 takes a subset of to sample at this step. is 1) a singleton set for variable-at-a-time methods, 2) an edge or its projection for edge-at-a-time methods, or 3) a part of a component for component-at-a-time methods. Line 1 defines the sample space and sample size . Line 1 computes , a probability distribution over ; is a null tuple that does not belong to any join answer. Line 1 samples from where . Line 1 optionally rejects samples in . Here, , , and are adjusted according to the accepted samples; is multiplied by the probability of accepting . GenericCardEst in Line 1 returns an unbiased estimate of the residual query . The unbiasedness of the final return value at Line 1 is guaranteed by the inverse probability factor, , which is the key idea of the Horvitz-Thompson estimator (Horvitz and Thompson, 1952). denotes the indicator which is 1 if is true and 0 otherwise.
5.2. Query Model
GenericCardEst relies on a certain query model used in previous WCO join and sampling algorithms (Ngo et al., 2014; Veldhuizen, 2012; Ngo et al., 2012; Assadi et al., 2019; Chen and Yi, 2020), following the property testing model (Goldreich, 2017). To avoid any confusion with the join query , we refer to the queries here as operations. For a relation , the query model allows that in GenericCardEst can be readily obtained in time. We explain the underlying index structures that enable this later in section 5.6 for brevity.
The same indexes provide the following -time operations as well. Here, can be a projection/selection of another relation, which is also a relation.
-
•
Degree(): for a relation and a tuple
-
•
Access(): -th element of for a relation and attributes
-
•
Exist(): test whether or not a row exists in
-
•
Sample(): uniformly sample a tuple from
Eden et al. (Eden et al., 2017) have developed their algorithms using operations on binary tables. The above four operations are generalizations of theirs to -ary tables.
5.3. State-of-the-art Sampling-based Estimators
We explain state-of-the-art sampling-based estimators as instances of GenericCardEst.
WanderJoin (Li et al., 2016) is a practical edge-at-a-time method. The edges in are first ordered by an index . At the -th step, , , , and . That is, one tuple is uniformly sampled from . Note that can reduce to before proceeding to the last edge, e.g., the triangle query. However, (Li et al., 2016) proceeds first with sampling for every edge and then checks the join predicates, resulting in unnecessary computations.
Alley+. Alley (Kim et al., 2021) is a variable-at-a-time method (i.e., ) and assumes that every is binary. We here explain Alley+, our extension of Alley to -ary relations: , for some fixed , and . Here, the sampling is done without replacement. Due to the heavy intersection () operation, the runtime of Alley+ is (Proposition 6). At an extreme case of , Alley+ becomes an instance of GenericJoin.
SSTE (Assadi et al., 2019) and SUST (Fichtenberger et al., 2020) are component-at-a-time methods. Assuming that every is binary and identical, Lemma 8 in appendix D decomposes into a set of components, where each component is either an odd cycle or a star. SSTE and SUST slightly differ in sampling odd cycles and stars, and SUST performs uniform sampling over joins. We explain the details in appendix D due to space limitations.
GJ-Sample (Chen and Yi, 2020) is a variable-at-a-time method (i.e., ). Here, where . This avoids the intersection operation in Alley+, allowing GJ-Sample to achieve runtime for sampling instead of (Chen and Yi, 2020). and where and , where is a fractional edge cover of the original query (Chen and Yi, 2020). Then, due to Lemma 7 in appendix B, .
More importantly, setting enables GJ-Sample to perform uniform sampling over , with probability for each answer of . Let be a final sample that reaches Line 1, and be the series of hypergraphs at Line 1. Then, from for , and
(3) |
Since , a call to GenericCardEst succeeds to sample any answer of with probability and fails with probability. Note that if is set to , then , so every call to GenericCardEst will succeed. However, computing for every is intractable, which is the reason for GJ-Sample to use that can be computed in time.
However, computing values at Line 1 takes time. This results in an IN factor in GJ-Sample’s runtime, (Chen and Yi, 2020). In order to remove this IN factor, Chen & Yi (Chen and Yi, 2020) separated out sequenceable queries, where all values required in the sampling procedure can be precomputed in time. This results in runtime for sequenceable queries and still for non-sequenceable queries. Determining whether a query is sequenceable or not requires a brute-force algorithm of time, since the search space is bounded by the query size (Chen and Yi, 2020).
5.4. Beyond the State of the Art with Degree-based Rejection Sampling
We propose degree-based rejection sampling (DRS) to avoid computing values in GJ-Sample while achieving a similar , only a constant factor smaller. While we set and as in GJ-Sample, we sample (a counterpart of in GJ-Sample) uniformly from and let . Therefore, is our meta sample space. For each and , we define the relative degree of in as .
In order to use in Line 1, we 1) uniformly sample a row from and 2) let in Line 1, without computing for every . Then, for every is computed, and is rejected if (Line 1). We first assume that one has a higher than all other edges in . Then, every is multiplied by at Line 1, resulting in . Line 1 further uses ( from Lemma 2) as the keeping probability of to make the final .
Lemma 2.
if .
From (3), any final sample that reaches Line 1 has probability to be sampled, indicating a uniform sampling over . Note that the added term is regarded as a constant since it depends on the query size only. If we break our initial assumption so that edges in can have the same maximum relative degree, the above will increase by times. Therefore, we simply decrease the keeping probability by times to preserve .
Example 0.
Let and be two arbitrary numbers. Let be a binary relation and is obtained by renaming to in . Suppose we have a join query . For ease of explanation, we use a hyperedge and its corresponding relation interchangeably. Then, we have an optimal fractional edge cover of as . We choose , , and in turn for DRS. For and , , so we sample a relation from . Assume that is sampled, and then a row is sampled from . Then, and , so is not rejected. Since and tie w.r.t. , we keep with probability . Then, . Next, for and , , and for any . Then, . Similarly, for . In total, a final sample that reaches Line 1 of GenericCardEst has .
5.5. Unified Analysis
This section analyzes the bounds of variance and runtime ( and in section 2) of sampling-based estimators in a unified manner. Note that Lemma 1 already states the unbiasedness of estimators. Theorem 1 can be proved from Propositions 4 and 8; is .
We first define two random variables, 1) for the output of GenericCardEst given and 2) for our final estimate. Here, ’s are independent and identical to , and is the number of initial calls to GenericCardEst with . Let be the random variable for the number of core operations (including the four operations in section 5.2) in GenericCardEst and for the total runtime of obtaining . Then, and .
Proposition 1.
for SSTE and SUST where is the number of odd cycles and stars in .
Proposition 2.
for Alley+ where .
The upper bound of Alley+ explains the unique property of Alley+, that approaches to 0 as approaches to 1. That is, since . In fact, this is tighter than the original bound proved in (Kim et al., 2021), where .
Proposition 3.
for GJ-Sample.
Proposition 4.
for DRS. Note that is .
Proposition 5.
is for SSTE.
Proposition 6.
is for Alley+.
Proposition 7.
For GJ-Sample, is for sequenceable queries and for non-sequenceable queries.
Proposition 8.
is for SUST and DRS.
Until now, we have assumed that OUT is given in computing before invoking GenericCardEst, which is unrealistic since our goal is to estimate OUT itself. To tackle this, we use the geometric search by Assadi et al. (Assadi et al., 2019). They first assume that OUT is large as and run estimation with a small (Assadi et al., 2019). Then, they perform a geometric search on OUT; assume a smaller OUT (decreasing OUT by 2), increase , and run the estimation again. They repeat this until the assumed OUT becomes consistent with the algorithm output. While we implicitly assume that , we can detect when in real applications whenever the assumed OUT falls below 1, in total time.
This geometric search adds a constant factor (= 4) in the bound of , not a factor explained by Chen & Yi (Chen and Yi, 2020). Starting with an assumption , assume that the geometric search stops at . Then, the total sample size used so far is , i.e., only a constant factor (= 4) is introduced.
Since GJ-Sample can perform uniform sampling, it instead uses a simpler approach (Chen and Yi, 2020), by repeating the sampling until a constant number of trials succeed. Then, the number of total trials becomes a random variable, having as its expectation (Chen and Yi, 2020). Therefore, for sequenceable queries and for non-sequenceable queries. Using the same approach, DRS requires trials in expectation and thus , asymptotically faster than GJ-Sample.
Finally, we give a simple explanation of why (or ) of SSTE, SUST, GJ-Sample, and DRS, is inversely proportional to OUT. Intuitively, if OUT is as large as , a set of samples are highly likely to be the actual join answers of . In other words, the join answers are spread over a dense database. A small number of samples would give us enough information about the distribution of the join answers and estimating OUT accurately. In contrast, if OUT is small, the join answers would be skewed and sparsely distributed in certain regions of a database, and most samples would not be the join answers. Therefore, a small number of samples cannot effectively decrease the uncertainty of estimation, which may require a lot more sampling for an accurate estimation.
5.6. Underlying Index Structure
We explain the underlying index for each relation to evaluate the operation in time, as mentioned in section 5.2. We define and , where is the set of attributes of . Note that and are disjoint in GenericCardEst. If , then .
A mild assumption in (Ngo et al., 2014) is to build a B-tree-like index (e.g., a trie index in (Aberger et al., 2017)) under a global attribute order over the whole database. Then, if all attributes in precede all attributes in , is readily available as an index lookup of . Furthermore, if no attribute in lies between and , is an index lookup up to depth . To exploit these, the selection of in each step of GenericCardEst should be consistent with the global order. Instead, we can build indexes for , one for each possible attribute order as in (Aberger et al., 2017), to enable arbitrary selection of .
Remark. The complexity of building an index for is linear, i.e., (Aberger et al., 2017). However, in contrast to the pre-computing overheads in GJ-Sample, the indexing occurs once per database instead of per query. Therefore, its complexity is ignored from the analysis of in section 5.5. Due to their simple structures, indexes can be easily extended to dynamic setting where tuples can be inserted and deleted, with a constant or logarithmic update time.
6. Achieving a Generalized Bound
In this section, we generalize our bound for join size estimation using generalized hypertree decompositions (GHDs, see section 2) and achieve a better bound than , which is the bound of an exact aggregation algorithm AggroGHDJoin using GHDs (see section A.3). We leave uniform sampling using GHDs as a future work in section 8. Our new bound may be better than our previous bound , especially when OUT is small (see Example 2). We first provide background about GHDs and related algorithms.
6.1. Aggregations over GHDs
Instead of performing joins, OUT can be more efficiently computed by solving an AJAR (Aggregations and Joins over Annotated Relations) query (Joglekar et al., 2016). AJAR queries assume that each relation is annotated; where each tuple has an annotation (from some domain ). If two tuples and are joined, the joined tuple has as its annotation. Hence, joining annotated relations results in another annotated relation:
(4) |
Furthermore, we can define an aggregation over an annotated relation with attributes by a pair of attribute and sum operator (Joglekar et al., 2016). Let . Then, the ()-aggregation of generates an annotated relation of attributes , where is the grouping/output attributes, and is an aggregated result for :
(5) |
By definition, no tuple is duplicated in the aggregation result. If multiple aggregation attributes share the same operator, we simply write as . Here, attributes in are marginalized out and removed from the resulting relation. The output attributes becomes . Altogether, if () forms a commutative semiring (Green et al., 2007), we can aggregate over joins (of annotated relations) (Joglekar et al., 2016) as . If , , , , and , the query generates an annotated relation with a single tuple without any attribute. Here, . We hereafter omit whenever possible without ambiguity. In addition, AJAR queries are known to be more general than FAQ queries (Abo Khamis et al., 2016), for a query can have multiple aggregation operators (Joglekar et al., 2016).
Joglekar et al. (Joglekar et al., 2016) present AggroGHDJoin and AggroYannakakis to solve the AJAR queries in time, where is their width of algorithms, and is the output size of the aggregation results which is typically smaller than OUT, e.g., 1 in our case. We prove in section A.3 that for general AJAR queries and for computing OUT. Since we show in section 6.3 that our new bound using GHDs is smaller than , it is also smaller than .
6.2. Sampling over GHDs
From now, we use the same setting from section 6.1 to compute the AJAR query . For a set of grouping attributes , GroupByCardEst (Algorithm 2) returns an approximate answer to the AJAR query , i.e., an annotated relation (Lines 2-2); and represent the exact and approximate value of , where is the residual hypergraph of given (see section 5.1). Therefore, .
GHDCardEst (Algorithm 3) is our entry function given a GHD (). For each node , we define , i.e., sub-query of on (Line 3), and as the set of grouping attributes of , determined as the shared attributes across the nodes (Line 3). GroupByCardEst at Line 3 takes this as the grouping attributes and returns an annotated relation , where . Therefore, is an approximation of the aggregation of with output attributes . Finally, SimpleAggroYannakakis at Line 3 takes these ’s and runs a simple version of AggroYannakakis (see Algorithm 5 in section A.2) to compute for , the union of grouping attributes of all GHD nodes. is used as join attributes between ’s in Line 3. Finally, Lemma 3 stats the unbiasedness of GHDCardEst.
Lemma 3.
GHDCardEst is an unbiased estimator of .
Example 0.
We use and in Figure 1 as an example. Here, , and are grouping attributes. (d) shows examples of annotated relations ’s, which are the outputs of GroupByCardEst on the three nodes. Since we allow duplicate tuples in any base relation , annotations for can be larger than one, even if is covered by a single hyperedge and . (e) and (f) show the join and aggregation on ’s performed in SimpleAggroYannakakis. Note that the annotations are multiplied in joins, and added in aggregations. The final result, 8450, is an approximation of OUT.

The key idea of GHDCardEst is to pushdown the partial sum from to each node in computing , which is similar to the key idea of AggroGHDJoin that pushes down the aggregation attributes in GHD as deeply as possible. However, it is slightly different since we shrink each relation obtained from a node before the join phase in AggroYannakakis; AggroYannakakis aggregates after joining each pair of parent-child relations in GHD.
We also argue that our definition of is minimal; if an attribute in is excluded from , the node loses a join relationship with another node. Our also minimizes the runtime of a non-sampling procedure, GenericJoin at Line 2 of Algorithm 2. Its runtime increases with the output attributes due to the node-monotone property of the AGM bound (Joglekar et al., 2016).
6.3. Analysis of GHDCardEst
Using the analyses in section 5.5 as building blocks, we analyze the variance and runtime of GHDCardEst. For each for a GHD node , we attach two additional annotations and (the runtime of obtaining ), and let (or simply ) denote the residual hypergraph of given in Line 2 of Algorithm 2. Then, is an unbiased estimate of from Lemma 1. We also define as an annotated tuple in so .
Recall that our primary goal for join size estimation is to perform ()-approximation of OUT. To use Chebyshev’s inequality, we should bound by as in section 5.5. Then the following questions arise: 1) Can we make this inequality hold for any and ? 2) How would the runtime be expressed? To answer these questions, we first express using values.
(6) |
We arbitrarily order and regard each as an ordered set . Since values for are mutually independent, we have
(7) |
(8) |
While can be simply decomposed into , has the internal term which prevents a simple decomposition. For any two different tuples , and are not independent, since they can have the same sub-tuple, i.e., and for some . Therefore, we have to consider the covariance when expanding below.
(9) |
From the analysis in section 5.5, we can arbitrarily control and with the sample size, under a condition that . In particular, we set = and = .
Lemma 4.
= if = = for every .
Lemma 5.
if the same condition of Lemma 4 holds for and .
(10) |
The last equality holds from . As a result, .
We can make arbitrarily small, even less than the we desire, by setting the constant factor of arbitrary small for every . We omit setting the constants which is trivial.
Proposition 9.
If approaches to 0 for every , approaches to 0.
We next answer our second question. GroupByCardEst for each node takes time at Line 2 and at Line 2 of Algorithm 2. SimpleAggroYannakakis at Line 3 of Algorithm 3 takes time. Therefore, is + . By setting , becomes + .
We now prove that our new bound is smaller than = ; since is node-monotone, and from Lemma 6 in appendix B. Therefore, if OUT is , is asymptotically smaller than (), proving Theorem 2. We recommend using GenericCardEst if OUT is expected to be large, and using GHDCardEst if OUT is expected to be small.
Example 0.
We use the in Figure 1 without an edge and assume that every base relation has size . Then, our previous bound becomes , from an optimal fractional edge cover : for and otherwise. Since our new bound is smaller than , it is also smaller than if OUT is asymptotically smaller than .
7. Optimizations
This section explains two optimization techniques to enhance estimation efficiency or accuracy.
7.1. Increasing Sampling Probabilities
If we focus on the join size estimation without persisting uniform sampling, we can further reduce the variance. Since (Kim et al., 2021; Chen and Yi, 2020) for our method in section 5.4, increasing reduces the variance. First, when edges tie and have the same maximum in section 5.4, we do not decrease the keeping probability by times. This increases the final by times. In fact, we can use any sampled from ; increases from to . Second, we can interleave GJ-Sample to remove the factor in . If the of GJ-Sample is small enough as a constant, e.g., , we can compute for every as GJ-Sample.
7.2. Skipping Attributes and GHD Nodes
In GenericCardEst, sampling for any non-join attribute is unnecessary and thus, can be safely skipped. For our example query in Figure 1, is the only non-join attribute. Let the current sample (attributes are ) and . Then, sampling any does not affect the sample space and sample size for the remaining vertices, i.e., , , and . Hence, we skip and just return at Line 1 instead of 1. If there are multiple non-join attributes , we skip all of these, sample for all join attributes only, then return at Line 1.
In GHDCardEst, calling GroupByCardEst for single-edge GHD nodes can be safely skipped. If a GHD node is covered by a single edge, i.e., , we can directly obtain for every in without sampling in GroupByCardEst. To be consistent, we regard as (with annotations) in order to remove any duplicates in as mentioned in section 6.1. of in Figure 1 is an example of a single-edge node, where for . Therefore, and , reducing both and .
8. Research Opportunities
We present six research opportunities based on our study. First, we could express our bound in section 6 in a more succinct form by removing internal terms, e.g., to . Then, we could more clearly characterize the gap between this bound and or .
Second, we could dynamically change the sampling approach based on the guess of OUT. Our bound in section 5 can be small as a constant if OUT is large as , while the bound in section 6 has advantages over small OUT (e.g., Example 2). As stated in section 5.5, we could start with GenericCardEst assuming a large OUT, increase the sample size, adjust the assumption, and change to GHDCardEst if the estimated OUT becomes small.
Third, we could even change between sampling and join, then apply the same analysis. While the join algorithms have zero error but +OUT term in their runtime, sampling algorithms have non-zero errors but factor in their runtime. Therefore, it might be possible to balance between sampling and join to reduce the runtime and error for both large- and small-OUT cases, i.e., select a sampling-like algorithm for large OUT and a join-like algorithm for small OUT.
Fourth, we could develop join size estimators with lower complexities using more information about relations, e.g., degree constraints or functional dependencies (Abo Khamis et al., 2017) other than the cardinality constrains in Definition 1.
Fifth, we could extend GHDCardEst to perform uniform sampling over arbitrary GHDs from the observations in section A.2 that 1) ExactWeight (Zhao et al., 2018) performs uniform sampling over certain types of GHDs for acyclic joins, and 2) GHDCardEst can be easily modified for uniform sampling over the same GHDs. In Algorithm 6, the initial value of (sampling weight for a tuple in ) is set to . Therefore, we could extend ExactWeight and GHDCardEst for arbitrary GHDs for even cyclic joins, starting from initializing with an estimate instead of .
Sixth, we could extend our algorithms to more general problems, approximate counting and uniform sampling for conjunctive queries (Arenas et al., 2021; Focke et al., 2022) or join-project queries , where is the set of output attributes (Chen and Yi, 2020). We explain in appendix C that our degree-based rejection sampling can be easily extended for join-project queries, following the same analysis in Section 3 of (Chen and Yi, 2020). We achieve runtime which is again smaller than that of GJ-Sample.
9. Conclusion
In this paper, we have presented a new sampling-based method for join size estimation and uniform sampling. Our method solves an open problem in the literature (Chen and Yi, 2020), achieving (1)-approximation of OUT in time for arbitrary join queries. We presented a unified approach that explains and analyzes four known methods: SSTE, SUST, Alley, and GJ-Sample. We then extended our method using GHDs to achieve a better bound than for small OUT and optimized the efficiency and accuracy using two approaches. Finally, we have highlighted several interesting research opportunities building on our results. We believe that our work may facilitate studies on achieving lower bounds on the uniform sampling and size estimation over joins.
10. Acknowledgement
We thank Yufei Tao for pointing out the condition of Lemma 2.
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-00859, Development of a distributed graph DBMS for intelligent processing of big graphs, 34%), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2018-0-01398, Development of a Conversational, Self-tuning DBMS, 33%), and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.NRF-2021R1A2B5B03001551, 33%).
References
- (1)
- Aberger et al. (2017) Christopher R Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. 2017. Emptyheaded: A relational engine for graph processing. ACM Transactions on Database Systems (TODS) 42, 4 (2017), 1–44.
- Abo Khamis et al. (2016) Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2016. FAQ: Questions Asked Frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (San Francisco, California, USA) (PODS ’16). Association for Computing Machinery, New York, NY, USA, 13–28. https://doi.org/10.1145/2902251.2902280
- Abo Khamis et al. (2017) Mahmoud Abo Khamis, Hung Q Ngo, and Dan Suciu. 2017. What do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog have to do with one another?. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 429–444.
- Aliakbarpour et al. (2018) Maryam Aliakbarpour, Amartya Shankha Biswas, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. 2018. Sublinear-time algorithms for counting star subgraphs via edge sampling. Algorithmica 80, 2 (2018), 668–697.
- Arenas et al. (2021) Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. 2021. When is approximate counting for conjunctive queries tractable?. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing. 1015–1027.
- Assadi et al. (2018) Sepehr Assadi, Michael Kapralov, and Sanjeev Khanna. 2018. A simple sublinear-time algorithm for counting arbitrary subgraphs via edge sampling. arXiv preprint arXiv:1811.07780 (2018).
- Assadi et al. (2019) Sepehr Assadi, Michael Kapralov, and Sanjeev Khanna. 2019. A simple sublinear-time algorithm for counting arbitrary subgraphs via edge sampling. In 10th Innovations in Theoretical Computer Science, ITCS 2019. Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 6.
- Atserias et al. (2013) Albert Atserias, Martin Grohe, and Dániel Marx. 2013. Size bounds and query plans for relational joins. SIAM J. Comput. 42, 4 (2013), 1737–1767.
- Banerjee (2012) Moulinath Banerjee. 2012. Simple Random Sampling. Unpublished Manuscript, University of Michigan, Michigan (2012).
- Bera and Chakrabarti (2017) Suman K Bera and Amit Chakrabarti. 2017. Towards tighter space bounds for counting triangles and other substructures in graph streams. In 34th Symposium on Theoretical Aspects of Computer Science.
- Brault-Baron (2016) Johann Brault-Baron. 2016. Hypergraph acyclicity revisited. ACM Computing Surveys (CSUR) 49, 3 (2016), 1–26.
- Carmeli et al. (2020) Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld, and Nicole Schweikardt. 2020. Answering (unions of) conjunctive queries using random access and random-order enumeration. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 393–409.
- Chen and Yi (2020) Yu Chen and Ke Yi. 2020. Random Sampling and Size Estimation Over Cyclic Joins. In 23rd International Conference on Database Theory (ICDT 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
- Deng et al. (2023) Shiyuan Deng, Shangqi Lu, and Yufei Tao. 2023. On Join Sampling and Hardness of Combinatorial Output-Sensitive Join Algorithms. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2023).
- Eden et al. (2017) Talya Eden, Amit Levi, Dana Ron, and C Seshadhri. 2017. Approximately counting triangles in sublinear time. SIAM J. Comput. 46, 5 (2017), 1603–1646.
- Eden et al. (2020) Talya Eden, Dana Ron, and C Seshadhri. 2020. On approximating the number of k-cliques in sublinear time. SIAM J. Comput. 49, 4 (2020), 747–771.
- Fichtenberger et al. (2020) Hendrik Fichtenberger, Mingze Gao, and Pan Peng. 2020. Sampling Arbitrary Subgraphs Exactly Uniformly in Sublinear Time. In ICALP.
- Focke et al. (2022) Jacob Focke, Leslie Ann Goldberg, Marc Roth, and Stanislav Zivnỳ. 2022. Approximately counting answers to conjunctive queries with disequalities and negations. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 315–324.
- Friedgut and Kahn (1998) Ehud Friedgut and Jeff Kahn. 1998. On the number of copies of one hypergraph in another. Israel Journal of Mathematics 105, 1 (1998), 251–256.
- Goldreich (2017) Oded Goldreich. 2017. Introduction to property testing. Cambridge University Press.
- Gottlob et al. (2016) Georg Gottlob, Gianluigi Greco, Nicola Leone, and Francesco Scarcello. 2016. Hypertree decompositions: Questions and answers. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 57–74.
- Gottlob et al. (2002) Georg Gottlob, Nicola Leone, and Francesco Scarcello. 2002. Hypertree decompositions and tractable queries. J. Comput. System Sci. 64, 3 (2002), 579–627.
- Green et al. (2007) Todd J Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 31–40.
- Grohe and Marx (2014) Martin Grohe and Dániel Marx. 2014. Constraint solving via fractional edge covers. ACM Transactions on Algorithms (TALG) 11, 1 (2014), 1–20.
- Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260 (1952), 663–685.
- Jerrum et al. (1986) Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. 1986. Random generation of combinatorial structures from a uniform distribution. Theoretical computer science 43 (1986), 169–188.
- Joglekar et al. (2016) Manas R. Joglekar, Rohan Puttagunta, and Christopher Ré. 2016. AJAR: Aggregations and Joins over Annotated Relations. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (San Francisco, California, USA) (PODS ’16). Association for Computing Machinery, New York, NY, USA, 91–106. https://doi.org/10.1145/2902251.2902293
- Kim et al. (2021) Kyoungmin Kim, Hyeonji Kim, George Fletcher, and Wook-Shin Han. 2021. Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation. In Proceedings of the 2021 International Conference on Management of Data. 964–976.
- Leis et al. (2015) Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204–215.
- Li et al. (2016) Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615–629.
- Ngo et al. (2012) Hung Q Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2012. Worst-case Optimal Join Algorithms. (2012).
- Ngo et al. (2014) Hung Q Ngo, Christopher Ré, and Atri Rudra. 2014. Skew strikes back: New developments in the theory of join algorithms. ACM SIGMOD Record 42, 4 (2014), 5–16.
- Veldhuizen (2012) Todd L Veldhuizen. 2012. Leapfrog triejoin: a worst-case optimal join algorithm. arXiv preprint arXiv:1210.0481 (2012).
- Yannakakis (1981) Mihalis Yannakakis. 1981. Algorithms for acyclic database schemes. In VLDB, Vol. 81. 82–94.
- Zhao et al. (2018) Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In Proceedings of the 2018 International Conference on Management of Data. 1525–1539.
Appendix
Appendix A Algorithms
A.1. GenericJoin
GenericJoin (Algorithm 4) (Ngo et al., 2014) is a representative worst-case optimal join algorithm. From the set of output attributes (initially ), it selects a subset where (Line 4). For each join answer of an induced hypergraph of projected onto (Line 4), it bounds attributes to and proceeds to match the residual hypergraph (Line 4). If , the results are obtained from the intersection () operation (Lines 4-4). The runtime of GenericJoin is , which can be proven from Lemma 6 (Ngo et al., 2014).
A.2. Yannakakis, SimpleAggroYannakakis, and ExactWeight
In this section and in Algorithms 5-6, we assume that the root node of any GHD has a virtual parent node where 1) , 2) , and 3) (input relation for ) contains a single tuple that joins with all tuples in , for ease of explanation. As mentioned in section 6.1, we assume no duplicate tuples in each .
Given a join tree of a -acyclic query (Brault-Baron, 2016), Yannakakis (Yannakakis, 1981) performs the join in time using dynamic programming. Algorithm 5 without Lines 5-5 and setting is Yannakakis, where the bottom-up and top-down semi-join reductions (Lines 5-5) are followed by the bottom-up join (Lines 5-5). Since the semi-join reductions remove dangling tuples (that do not participate in the final join answers), the intermediate join size monotonically increases up to OUT in the bottom-up join. Therefore, the total runtime is .
SimpleAggroYannakakis (Algorithm 5) is a simplified version of AggroYannakakis (Joglekar et al., 2016), where 1) all attributes are used for aggregation and 2) sum is the only aggregation operator. AggroYannakakis can handle a more general class of aggregation queries, e.g., sum and max operators are used for different attributes. In Line 5, denotes the top node of an attribute , which is the closest node to the root among . is the set of output attributes of . Line 5 aggregates on the attributes in having as their top node, since these attributes do not appear in ancestors and are marginalized out. This early marginalization is the key idea of maintaining the runtime of SimpleAggroYannakakis to be (Joglekar et al., 2016), where is the output size of aggregation. For example, if all attributes are marginalized out in computing the join size.
We explain ExactWeight (Zhao et al., 2018) in our context of using GHDs (Algorithm 6), but assume that ’s are not annotated and compute each weight explicitly. ExactWeight performs uniform sampling over acyclic joins, where the GHDs of single-edge nodes are given. ExactWeight computes the sampling weights and for each tuple and a child relation bottom-up (Lines 6-6) and samples tuples proportional to their weights top-down (Lines 6-6). The sample is then replaced with the tuples sampled from the base relations (Lines 6-6). The bottom-up part is the preprocessing step that takes time, and the top-down sampling and replacement takes for each sample and can be performed multiple times after the preprocessing.
We now briefly explain why ExactWeight returns a uniform sample. corresponds to the join size between and the child subtree of rooted at , and corresponds to the join size between and all child subtrees of . Hence, which is the join size of the whole tree. For a node , its children and its parent , and for . Therefore, . We have seen a similar elimination of probability terms in eq. 3, which leads to after Line 6. By uniformly sampling a tuple from each at Line 6, becomes at Line 6.
We can also make GHDCardEst in Algorithm 3 perform uniform sampling for the same GHD by 1) iterating bottom-up at Line 3, 2) adding the same initialization and updates of sampling weights, and 3) adding the same sampling part. Then, the preprocessing step of GHDCardEst also takes as ExactWeight. From this perspective, ExactWeight is similar to a specific instance of GHDCardEst for GHDs of single-edge nodes and acyclic queries. Extending the above modifications of GHDCardEst to work for arbitrary GHDs and cyclic queries, would be an interesting future work. We could also extend our work to unions of acyclic conjunctive queries and sampling without replacement, as what (Carmeli et al., 2020) is to (Zhao et al., 2018).
A.3. GHDJoin and AggroGHDJoin
Given a generalized hypertree decomposition (GHD) of a query, GHDJoin (Algorithm 7) (Joglekar et al., 2016) uses as a join tree and performs GenericJoin to obtain the join answers of each node (Lines 7-7). is the subquery defined on node (Line 7). Then, it performs Yannakakis over these join answers to evaluate the original query (Line 7). Therefore, the runtime of GHDJoin is from Definition 3.
AggroGHDJoin (Joglekar et al., 2016) is an aggregation version of GHDJoin. It is different from GHDJoin in two aspects: 1) calling AggroYannakakis instead of Yannakakis and 2) including some extra work to ensure that each annotation of a tuple is passed to exactly one call of GenericJoin (Joglekar et al., 2016). Note that a hyperedge can be visited in multiple nodes, thus, its annotations may be aggregated multiple times, leading to an incorrect output.
From the runtime of GenericJoin and AggroYannakakis, the runtime of AggroGHDJoin is . In (Joglekar et al., 2016), the runtime of AggroGHDJoin is expressed as , where . Here, denotes the set of valid GHDs of , that preserve the aggregation ordering of the specified aggregation ; changing the order of and might vary the aggregation result (Joglekar et al., 2016). Since from Definition 3 and is a subset of the set of all GHDs, . However, as in our case where all aggregation operators are the same (), thus commutative, any aggregation order gives the same aggregation result. Therefore, becomes the set of all GHDs. This leads to .
Appendix B Query Decomposition Lemmas
This section explains two similar query decomposition lemmas proved by Ngo et al. (Ngo et al., 2014) and Chen & Yi (Chen and Yi, 2020).
Lemma 6.
(Ngo et al., 2014): Given a query hypergraph , let be any fractional edge cover of . Let be an arbitrary non-empty proper subset of , , and . Then,
(11) |
Using this lemma, Ngo et al. (Ngo et al., 2014) prove that GenericJoin has runtime (i.e., the RHS above) using an induction hypothesis on . If is an optimal fractional edge cover, the RHS becomes .
We briefly explain this with Algorithm 4. If we replace 1) in the lemma with in Algorithm 4 and 2) with , we obtain and
(12) |
If , Line 4 takes time where for . The last term is equivalent to the RHS above.
If , Lines 4-4 take time where from the induction hypothesis (, and ) and from the lemma, which is again the RHS above.
Therefore, in both cases, the runtime of Algorithm 4 is the big-Oh of the RHS above.
Lemma 7.
(Chen and Yi, 2020): Given a query hypergraph , let be any fractional edge cover of . Let for an arbitrary attribute , , and for . Then,
(13) |
Since Chen & Yi (Chen and Yi, 2020) explain Lemma 7 without a name, we also call it the query decomposition lemma since it is similar to Lemma 6. However, it is different from Lemma 6 in that 1) must be 1 and 2) is obtained without an actual join or intersection in GenericJoin. If in Lemma 6, then . Since , . Therefore, Lemma 7 is stronger than Lemma 6 if .
Lemma 7 is the key to show that for GJ-Sample. By replacing 1) in the above lemma with in GJ-Sample in section 5.3 and 2) with , we obtain for and
(14) |
resulting in
(15) |
Appendix C Extension to Conjunctive Queries
From the introduction, the join query we have considered so far is a full conjunctive query where all attributes are output attributes. In this section, we easily extend our degree-based rejection sampling to approximate counting and uniform sampling for join-project queries (a.k.a. conjunctive queries) (Arenas et al., 2021; Focke et al., 2022; Chen and Yi, 2020), following the same analysis of Section 3 in (Chen and Yi, 2020).
Let be a join-project query, where is the set of projection/output attributes and is a join query. We can use the two-step approach in (Chen and Yi, 2020): (Step 1) use Algorithm 1 to sample a join result from , i.e., join the relations that contain any attribute in only. Here, if is disconnected, we take a sample from each connected component. (Step 2) check if is empty. If not, we return as a sampled result, otherwise, we repeat. Then, the final returned is a result in . Using the same analysis of (Chen and Yi, 2020), sampling a tuple in takes time in expectation for our method.
Step 1 of our algorithm takes time in expectation. Step 2 takes time , using any worst-case optimal join algorithm, e.g., GenericJoin. Because each is sampled with probability , the expected runtime of Step 2 is (the big-Oh of) where the inequality follows from the query decomposition lemma (Lemma 6). Please refer to Section 3 of (Chen and Yi, 2020) for the remaining part of the analysis proving the time.
Here, we would like to address the complexity of computing before calling Algorithm 1. If there exists an index for with the attribute order such that the attributes in precede the attributes in , then is readily available (i.e., in time) as explained in section 5.6. This results in total time. Otherwise, we need an additional linear-time preprocessing to obtain from , resulting in total time. Nevertheless, we emphasize that this is still lower than the complexity of GJ-Sample for non-sequenceable join-project queries, which is (Chen and Yi, 2020).
We also note that the proposed algorithm in (Arenas et al., 2021) requires computing in their “High-Level Sampling Template” which is the (approximate) size of the -th subset of a set . This has to be computed for every possible since is used as the probability of sampling . This is analogous to computing for each candidate in GJ-Sample. A difference between GJ-Sample and (Arenas et al., 2021) is that computing takes a constant time in GJ-Sample while computing takes a polynomial time according to Lemma 4.5 in the full version of (Arenas et al., 2021). Therefore, (Arenas et al., 2021) inherently has the overhead of computing ’s analogous to GJ-Sample. Removing such overhead by extending our idea to conjunctive queries, would be an interesting future work.
Appendix D Component-at-a-time Sampling
For each component, GenericCardEst (Algorithm 1) is called in two recursions, e.g., if consists of four odd cycles and three stars, the maximum recursion depth is , including the calls that reach Line 1. Since the components are vertex-disjoint, i.e., variable-disjoint, whenever and (the attributes of) are contained in different components. Therefore, we can safely drop from the following explanations for the component-at-a-time methods.
We first explain for a star of edges . Let be the center attribute, i.e., . For SSTE, in the first call to GenericCardEst, becomes the (singleton set of) center attribute, i.e., , and . To sample a vertex from , SSTE first samples an edge from , and let . Then, since edges in have as their projection onto . When GenericCardEst is called for the second time, becomes the remaining attributes of , i.e., , , and ; a sequence of vertices is sampled from once (i.e., is sampled from ). If forms a star in the database, for these two calls become 1, and the recursion proceeds to the remaining components, with and .
For SUST, in the first call to GenericCardEst, is simply the set of all attributes of and , , and ; a sequence of edges is sampled once. is 1 iff they share a common vertex for the center. The second call to GenericCardEst is a dummy that directly proceeds to the remaining components.
We next explain for an odd cycle of edges with the vertex (attribute) sequence where for and . A constraint is that the sampled vertex (attribute value) for must have the smallest degree among the sampled vertices, where the degree of a vertex is (note that for every ). In the first call to GenericCardEst, contains all ’s and ’s except , , and ; a sequence of edges is sampled where and map to and . When GenericCardEst is called for the second time, . SSTE and SUST differ here.
For SSTE, , , and is . Therefore, vertices are uniformly sampled from with replacement. For SUST, is fixed to 1, and is defined based on the degree of . If the degree is smaller than , then and . If the degree is larger than , then and . Then, at Line 1, 1) any with a smaller degree than is rejected due to the constraint, and 2) an accepted is further kept with probability , which is less than 1 from . Therefore, is again . For both SSTE and SUST, if forms a cycle in the database, the recursion proceeds to the remaining components.
SUST guarantees the uniformity of sampling odd cycles and stars. To sample an odd cycle, SUST relies on the constraint that has the smallest degree. This is allowed in unlabeled graphs since other non-qualifying cycles (e.g., having the smallest degree) can be found by permuting the vertices and edges (e.g., constrain w.r.t. and let ). The new proof of SSTE also uses this constraint to bound the variance (see appendix E). However, in non-identical relations, we cannot permute the edges with different labels.
The essence of component-at-a-time sampling is to use the divide-and-conquer approach by breaking a query into smaller components (i.e., odd cycles and stars), solve simpler problems for these components, and estimate back the cardinality of the original query. Lemmas 8 and 9 allow us to set an induction hypothesis that bounds the variance of estimates in simpler problems. We explain these lemmas in section D.1 and extend them for labeled graphs in section D.2.
D.1. For Unlabeled Graphs
Lemma 8.
(Decomposition Lemma) (Assadi et al., 2019): Given a query hypergraph of identical binary relations (i.e., ), the LP in Definition 1 admits a half-integral optimal solution where 1) and 2) the support of , , forms a set of vertex-disjoint odd cycles (i.e., each cycle has an odd number of edges) and stars.
Lemma 9.
D.2. For Labeled Graphs
Proposition 10.
The LP in Definition 1 admits an integral optimal solution if is a bipartite graph, where relations are binary but not necessarily identical.
Proof.
Let , , and be the incidence matrix of , where . It is well known that if is bipartite, is so-called a totally unimodular matrix***https://en.wikipedia.org/wiki/Unimodular_matrix. We can rewrite the objective of LP, i.e., , as ( is a vector of , ) and constraint ( is a vector of 1’s ). Then, it is also known that an LP of objective and constraint admits an integral optimal solution if is totally unimodular and is a vector of integers, which is exactly our case.
Therefore, we can easily remove the identity assumption on relations if is bipartite. The following proposition fills the remaining non-bipartite case. We omit the proof since it is the same as the proof by Assadi et al. (Assadi et al., 2019).
Proposition 11.
The LP in Definition 1 admits a half-integral optimal solution if is a non-bipartite graph, where relations are not necessarily identical.
Finally, we state Lemma 10. The extension of Lemma 9 for labeled graphs, can be directly proven by Lemma 10 as the proof by Assadi et al. (Assadi et al., 2019).
Lemma 10.
Proof.
The proof is similar to the proof by Assadi et al. (Assadi et al., 2019), but we consider which may be different over . Assume that any half-integral optimal solution is given.
We first show that any edge in a cycle (odd or even) in has . Otherwise if (cannot be 0 due to the definition of ), we can always reduce it into without breaking the feasibility of , since both endpoints of have at least two edges in the cycle. This results in a smaller objective, contradicting that is an optimal solution.
We then remove any even cycles. If an even cycle with edges , , …, exists in , we argue that = where . If LHS ¡ RHS, we subtract from each and add to each . Again, this does not break the feasibility of but decreases the objective by , contradicting the optimality of . Similarly, LHS ¿ RHS does not hold. Thus, LHS = RHS, indicating:
(16) |
Then, we can break by safely subtracting from each and adding to each (or vice versa), preserving the objective.
We next remove any odd cycles connected to any other components. If an odd cycle with edges , , …, exists in such that a vertex in has an incident edge outside , we again argue that = using the same logic as we remove the even cycles. Then, we can remove such odd cycles connected to any other components.
We now have a set of vertex-disjoint odd cycles and trees in . We will break trees into stars. For each tree, let be any path connecting two leaf vertices. Note that to satisfy the constraint in the LP for these two vertices. since any midpoint has two incident edges in the path, and from the same logic used for cycles.
If (even), we argue = using the same logic as above. Then, we can subtract from each and add to each . If (odd), = , and we subtract from each and add to each except . In both cases, the path is decomposed into segments of at most two edges each. We repeat this for every path connecting two leaf vertices, and the final forest consists of trees of a maximum diameter of two, i.e., stars.
Assadi et al. (Assadi et al., 2019) do not present a concrete join size estimator for labeled graphs but prove a lower bound . Here, represents without labels, and since for each ; ’s are subsets of the set of all data edges .
Appendix E Proofs of Lemmas and Propositions
In order to prove Lemmas 1-5 and Propositions 1-8, we first expand and , then derive inequalities on . We define another random variable where is a random variable in . Then, , i.e., and are mutually recursively defined. Then, from (Banerjee, 2012):
(17) |
This holds since consists of identically distributed samples . Note that may not be independent due to sampling without replacement, e.g., in Alley. Then, above is 1 and the variance is decreased by portion (Banerjee, 2012).
From the law of total variance:
(18) |
Here, as , but is a fixed value in instead of a random variable. In the second term above, we show that in the proof of Lemma 1. Then,
(19) |
The (¿ 1) factor has been missing in the original paper of Assadi et al. (Assadi et al., 2019), reducing an upper bound of incorrectly.
Proof of Lemma 1.
We use an induction hypothesis that GenericCardEst (Algorithm 1) returns an unbiased estimate of . For the base case where , Line 1 returns , where . Here, since the recursive call to GenericCardEst reaches Line 1. Since ’s follow an identical distribution, and
(20) |
Note that the assumption in Lemma 1 is used. If , then ¡ , leading to a negative bias.
For the inductive case where , we use the law of total expectation and our induction hypothesis at Line 1, achieving
(21) |
The assumption is used again. The last equality holds from Lines 4-4 of GenericJoin (Algorithm 4); iterates over , not sampled from . Therefore, .
Proof of Lemma 2.
We first rewrite the AGM bounds in LHS using their definitions. Given an optimal fractional edge cover of , the LHS is
(22) |
We partition 1) into and 2) into since any must contain an attribute in or .
Since , . Chen & Yi (Chen and Yi, 2020) also compute the probability of only if contributes to any join answer.
(23) |
Proof of Proposition 1.
We use an induction hypothesis on , the number of odd cycles and stars in .
For the base case when , consists of either a single odd cycle or a star. As explained in appendix D, the maximum recursion depth is 2 + 1. We use (and ) and (and ) to denote the (and ) values of the first two recursions, where in the last recursion so Lines 1-1 of Algorithm 1 are executed.
If is an odd cycle of vertices , from Lemma 8, , and , where . Recall that for .
We explain SSTE first. Since and , from (18). For the first term,
(24) |
For the second term of the total variance, the new proof of SSTE reduces the factor in (19) into . We have for denoting the sampled data vertex for . If , then . Otherwise, recall that is chosen to have the smallest degree among the cycle vertices. Then, there can be at most vertices in the data graph with degrees larger than ; using the proof by contradiction, the total number of edges will be larger than if more than vertices have larger degrees than . Hence, we have at most results for the cycle given , i.e., :
(25) |
For SUST, is bounded by from (25) and the following bound for completes the proof for SUST given an odd cycle.
(26) |
If is a star of edges , from Lemma 8. For SSTE, contains the center attribute of only, , is the set of attributes of except , and for a sampled center vertex .
Again, since . For the first term,
(27) |
For the second term of the total variance, from (19). Therefore, we have for SSTE given a star.
For SUST, is the set of attributes of the star, , and . Thus, .
For the inductive case when , consists of multiple odd cycles and stars. The proofs for both SSTE and SUST are naive expansions of (24)-(27) conditioned on multiple components instead of a part of an odd cycle or a star. Please refer to (Assadi et al., 2019) for more details.
Proof of Proposition 2.
We use an induction hypothesis .
For the base case and , , and regardless of . Therefore, .
For the inductive case ,
(28) |
(29) |
Therefore, from (17),
(30) |
Proof of Proposition 3.
We use an induction hypothesis .
For the base case and , , for . Hence,
(31) |
For the inductive case , we again use the induction hypothesis:
(32) |
From (19), we have
(33) |
Therefore, plugging these into (18) and setting in (17) give . In addition, in the proposition can be removed with a simple proof by Chen & Yi (Chen and Yi, 2020) using the variance of Binomial distribution.
Proof of Proposition 4.
We use an induction hypothesis , similar to the proof of Proposition 3.
For the base case and ,
(34) |
Note that, compared to (31) of GJ-Sample, the upper bound of has increased by a constant factor . Similarly, we can easily show that the upper bounds of both (32) and (33) increase by times, proving the inductive case of Proposition 4. By using the proof by Chen & Yi (Chen and Yi, 2020), we can remove the factor from the proposition as well.
Now, we prove the bounds for (or ). In Algorithm 1, Lines 1-1 take . Line 1 takes time for Alley+ due to the intersection and for the others, i.e., is readily available from the query model in section 5.2. Line 1 takes for GJ-Sample for non-sequenceable queries and for other cases. Line 1 takes . Line 1 takes ; computing relative degrees takes in DRS. Line 1 takes . Note that takes time to evaluate.
From this, we can recursively define , the runtime of computing :
(35) |
Proof of Proposition 5.
In SSTE, if consists of a single vertex of an odd cycle and otherwise. We prove for the first case. Then, if there are cycles in with each, from (35).
for depends solely on the sampled data edge for . Recall that and since has the smallest degree among the cycle vertices. The last equality below holds from a well-known fact in graph theory (Assadi et al., 2018).
(36) |
Proof of Proposition 6.
Simply, Alley+ takes portion of all paths searched by GenericJoin in expectation. Therefore, . is in the worst case, since the computation cost of GenericCardEst for Alley+ is subsumed by GenericJoin.
Proof of Proposition 7.
In GJ-Sample, is fixed to 1 regardless of . Therefore, for sequenceable queries and for non-sequenceable queries from (35). Since the maximum recursion depth is , for sequenceable queries (note the excluded preprocessing cost) and for non-sequenceable queries.
Proof of Proposition 8.
In DRS, is also fixed to 1, and . Hence, . For SUST, is also fixed to 1, and . Hence, .
Proof of Lemma 3.
The unbiasedness of GHDCardEst can be explained by the following (37), since is an unbiased estimate of . Note that if we replace GenericCardEst with GenericJoin in Algorithm 2, then GHDCardEst will return OUT.
(37) |
Proof of Lemma 4.
We use an induction hypothesis on .
For the base case so for a node , =
. Therefore, = .
For the inductive case , we use an induction hypothesis that for every for
some . For any with , we take any . Since and are independent,
(38) |
Proof of Lemma 5.
If and are independent (), then . If not, , , and are independent, and and . Therefore,
(39) |
(40) |