\useunder

Guaranteeing the $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ Runtime for Uniform Sampling and Size Estimation over Joins

Kyoungmin Kim [email protected] Pohang University of Science and Technology (POSTECH)PohangRepublic of Korea , Jaehyun Ha [email protected] Pohang University of Science and Technology (POSTECH)PohangRepublic of Korea , George Fletcher [email protected] Eindhoven University of Technology (TU/e)EindhovenNetherlands and Wook-Shin Han [email protected] Pohang University of Science and Technology (POSTECH)PohangRepublic of Korea

(2023)

Abstract.

We propose a new method for estimating the number of answers OUT of a small join query $Q$ in a large database $D$ , and for uniform sampling over joins. Our method is the first to satisfy all the following statements.

•

Support arbitrary $Q$ , which can be either acyclic or cyclic, and contain binary and non-binary relations.
•

Guarantee an arbitrary small error with a high probability always in $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ time, where $AGM$ is the AGM bound (an upper bound of OUT), and $\tilde{O}$ hides the polylogarithmic factor of input size.

We also explain previous join size estimators in a unified framework. All methods including ours rely on certain indexes on relations in $D$ , which take linear time to build offline. Additionally, we extend our method using generalized hypertree decompositions (GHDs) to achieve a lower complexity than $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ when OUT is small, and present optimization techniques for improving estimation efficiency and accuracy.

Join size estimation, Uniform sampling, Worst-case optimal

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems; June 18–23, 2023; Seattle, WA, USA.^†^†booktitle: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS ’23), June 18–23, 2023, Seattle, WA, USA^†^†price: 15.00^†^†isbn: 979-8-4007-0127-6/23/06^†^†doi: 10.1145/3584372.3588676^†^†ccs: Information systems Database query processing^†^†ccs: Mathematics of computing Probability and statistics

1. Introduction

The evaluation of join queries is one of the most fundamental problems in databases (Veldhuizen, 2012; Ngo et al., 2012, 2014; Li et al., 2016; Zhao et al., 2018; Aberger et al., 2017). In theory, reducing the time complexity of evaluation algorithms has been the main goal (Yannakakis, 1981; Ngo et al., 2014; Gottlob et al., 2016; Abo Khamis et al., 2017). This also applies to the counting problem (Joglekar et al., 2016), which we study in this paper. Given a join query $Q$ and a database $D$ , an answer of $Q$ in $D$ is a mapping from free variables of $Q$ to attribute values in $D$ (or inversely as stated in (Focke et al., 2022)) constrained by the relations in $Q$ . For example, if a database $D$ of a social network has a binary relation $R(A,B)$ of friendship and a tuple $(a,b)\in R(A,B)$ indicates that two people $a,b$ are friends, finding all triples $(a,b,c)$ of people who are all friends, is equivalent to finding the answers of the triangle join query ( $R(B,C)$ and $R(A,C)$ are renamed relations of $R(A,B)$ ):

(1)

\begin{split}Q(A,B,C)=R(A,B)\wedge R(B,C)\wedge R(A,C)\end{split}

The complexity of evaluating join queries is commonly expressed as $\tilde{O}(\text{IN}^{w}+\text{OUT})$ where IN is the input size and $w(\geq 1)$ is a width of $Q$ (see section 2 for definition of IN and example widths). While acyclic queries can be evaluated with $w=1$ (Yannakakis, 1981), cyclic queries have higher $w$ values, i.e., more difficult to solve. If $w$ is fractional edge cover number $\rho$ (see Definition 1), $\text{IN}^{w}$ becomes $AGM$ , the AGM bound (Atserias et al., 2013) which is an upper bound of OUT. Instead of evaluating join queries, the counting problem is to compute $|Q|$ , the join size or the number of answers (e.g., triples above). It has its own application for answering COUNT queries in DBMS. Then, the complexity can be reduced from $\tilde{O}(\text{IN}^{w}+\text{OUT})$ to $\tilde{O}(\text{IN}^{w})$ (Joglekar et al., 2016).

Approximate counting problem is to approximate $|Q|$ . Approximate algorithms are alternatively used whenever the exact counting is expensive and approximation is enough. In practice, a famous application is cost-based query optimization which requires efficient approximation of thousands of sub-queries (Leis et al., 2015). In theory, going beyond the polynomial time algorithms, approximate algorithms established the complexity inversely proportional to OUT, under the requirement that guarantees an arbitrary small error with a high probability (see section 2 for formal definitions). In a limited setting where every relation in $Q$ is binary and identical (e.g., as in (1)), the complexity of $\tilde{O}(\text{IN}^{w})$ has reduced to $\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}\big{)}$ (Assadi et al., 2019; Fichtenberger et al., 2020). However, in a general setting where $Q$ contains higher-arity relations, the complexity has reduced to $\tilde{O}\big{(}\frac{\text{IN}^{w+1}}{\text{OUT}}\big{)}$ or $\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}+\text{IN}\big{)}$ only (Chen and Yi, 2020), with an additional multiplicative or additive factor of IN. Furthermore, all these methods have $w=\rho$ only (i.e., $\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}\big{)}$ = $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ ), and are not extended to other widths such as fractional hypertree width (see Definition 3) smaller than $\rho$ . In this paper, we propose a new method that achieves $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ complexity for $Q$ containing any-arity relations, under the same $w=\rho$ . As a corollary, ours achieve a linear or even constant complexity for a sufficiently large OUT. We also extend ours to incorporate fractional hypertree width by using generalized hypertree decompositions (GHDs), achieving a lower complexity than $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ when OUT is small.

A closely related problem to approximate counting is uniform sampling, which has its own important applications such as training set generation for machine learning (Zhao et al., 2018). Our method also supports uniform sampling over joins. We define the problems and technical concepts in section 2, explain related work in section 3 and overview of our results in section 4. We present our method of size estimation and uniform sampling over joins in section 5 including a unified framework of previous work, extend our method using GHDs in section 6, propose optimization techniques in section 7 and future work in section 8. More details of existing algorithms and extension of our algorithms to join-project (a.k.a. conjunctive) queries are presented in Appendix.

2. Technical Background

Hypergraph. A join query $Q$ is commonly represented as a hypergraph $\mathcal{H}(\mathcal{V},\mathcal{E})$ , where hypernodes $\mathcal{V}$ is the set of variables in $Q$ and hyperedges $\mathcal{E}\subseteq 2^{\mathcal{V}}$ (Ngo et al., 2014). Each hyperedge $F\in\mathcal{E}$ is a subset of $\mathcal{V}$ and specifies the relation $R(F)$ (or $R_{F}$ for brevity) in $Q$ . For example, the query in (1) has $\mathcal{V}=\{A,B,C\}$ and $\mathcal{E}=\{\{A,B\},\{B,C\},\{A,C\}\}$ . For any $I\subset\mathcal{V}$ , we define two notations used throughout the paper: $\mathcal{E}_{I}\coloneqq\{F\in\mathcal{E}|F\cap I\neq\emptyset\}$ and $\mathcal{E}_{I,\pi}\coloneqq\{\pi_{I}{F}|F\in\mathcal{E}_{I}\}$ , i.e., the set of hyperedges in $\mathcal{E}$ that intersect with $I$ and the projection of each edge in $\mathcal{E}_{I}$ onto $I$ .

In relational algebra, $Q$ is equivalent to $\Join_{F\in\mathcal{E}}R_{F}={R_{F_{1}}\Join...\Join R_{F_{\absolutevalue{\mathcal{E}}}}}$ . Here, join ( $\Join$ ) is a commutative binary operator between relations, allowing us to use $\Join_{F\in\mathcal{E}}$ . Table 1 summarizes the operators used in this paper.

Table 1. Operators used in the paper.

$\Join$	natural join operator
$\ltimes$	semi-join operator
$\pi$	projection operator
$\sigma$	selection operator
$\uplus$	list concatenation operator

Since all variables in $\mathcal{V}$ are free (output) variables in join queries, join queries are subsumed by full conjunctive queries. For a more general class of join-project (a.k.a. conjunctive) queries, the set of output variables is a subset of $\mathcal{V}$ . We assume $Q$ is a join query and extend to join-project queries in appendix C.

Input and output size. We denote $\text{IN}(\mathcal{H},D)$ = $\max_{F\in\mathcal{E}}\absolutevalue{R_{F}}$ as the input size ( $|R_{F}|$ is the number of tuples in $R_{F}$ ) and $\text{OUT}(\mathcal{H},D)=|Q|$ , the number of query answers, as the output size of $Q$ (or $\mathcal{H}$ ) in $D$ . We drop $\mathcal{H}$ and $D$ if the context is clear, as well as in upcoming notations.

Complexity of join. The time complexity of join evaluation algorithms is commonly expressed as $\tilde{O}(\text{IN}^{w}+\text{OUT})$ where $w$ is a width of $Q$ , and $\tilde{O}$ hides a polylogarithmic factor of IN (Ngo et al., 2014). We assume that $Q$ is extremely small compared to $D$ and regard $|\mathcal{V}|$ and $|\mathcal{E}|$ as constants as in (Ngo et al., 2014; Assadi et al., 2019; Chen and Yi, 2020). Worst-case optimal (WCO) algorithms such as GenericJoin (Ngo et al., 2014) (Algorithm 4) achieve $w=\rho(\mathcal{H},D)$ , called fractional edge cover number:

Definition 1.

(Ngo et al., 2014): The fractional edge cover number ${\rho(\mathcal{H},D)}$ is the optimal solution of the linear program (LP): $\min_{{\{x_{F}|F\in\mathcal{E}\}}}$ $\sum_{F\in\mathcal{E}}x_{F}$ $\cdot\log_{\text{IN}}\absolutevalue{R_{F}}$ s.t. $\sum_{F\ni v}x_{F}\geq 1:\forall v\in\mathcal{V}$ and $0\leq x_{F}\leq 1:\forall F\in\mathcal{E}$ .

It is well known that $\prod_{F\in\mathcal{E}}\absolutevalue{R_{F}}^{x_{F}}$ is an upper bound of OUT for any fractional edge cover $x$ satisfying the constraints of the LP (Atserias et al., 2013; Friedgut and Kahn, 1998). The optimal value, $\text{IN}^{\rho}=\min_{x}\prod_{F\in\mathcal{E}}\absolutevalue{R_{F}}^{x_{F}}$ , is called the AGM bound of $\mathcal{H}$ on $D$ denoted as $AGM(\mathcal{H},D)$ (Atserias et al., 2013). For example, the query in (1) has $\text{IN}=|R_{\{A,B\}}|=|R_{\{B,C\}}|=|R_{\{A,C\}}|$ and $\rho=3/2$ from the optimal solution $x_{\{A,B\}}=x_{\{B,C\}}=x_{\{A,C\}}=1/2$ for the LP in Definition 1. Therefore, its AGM bound is $\text{IN}^{\rho}=\text{IN}^{3/2}$ .

Generalized hypertree decomposition. As a special case, the Yannakakis algorithm (Yannakakis, 1981) achieves $w=1$ if $Q$ is acyclic. It builds a join tree by assigning each hyperedge $F\in\mathcal{E}$ to a tree node and performs dynamic programming (DP)-like join over the nodes’ outputs (i.e., $R_{F}$ ’s) in a bottom-up manner.

To bridge the gap between $w={\rho}$ and $w=1$ when $Q$ is cyclic, generalized hypertree decompositions (GHDs, see Definition 2) (Gottlob et al., 2002) extend the tree for arbitrary joins by assigning multiple hyperedges to a tree node, where the DP-like join is performed over these nodes’ outputs (i.e., answers of the corresponding sub-query) computed by WCO algorithms. This way, $w$ decreases from ${\rho}$ to fractional hypertree width ${fhtw}$ (see Definition 3). We refer to a comprehensive survey (Ngo et al., 2014) for more details about these concepts and appendix A for our explanations of algorithms.

Definition 2.

(Gottlob et al., 2002): A GHD of $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is a pair ( $\mathcal{T},\chi$ ) where 1) $\mathcal{T}$ is a tree of $(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}})$ , 2) $\chi:\mathcal{V}_{\mathcal{T}}\rightarrow 2^{\mathcal{V}}$ , 3) for each $F\in\mathcal{E}$ , there exists $t\in\mathcal{V}_{\mathcal{T}}$ s.t. $F\subseteq\chi(t)$ , and 4) for each $v\in\mathcal{V}$ , $\{t\in\mathcal{V}_{\mathcal{T}}|v\in\chi(t)\}$ forms a non-empty connected subtree of $\mathcal{T}$ .

Here, $\chi(t)$ is called the bag of $t$ , and 4) is called the running intersection property. Each node $t$ corresponds to a sub-hypergraph $\mathcal{H}_{\chi(t)}$ of $\mathcal{H}$ where $\mathcal{H}_{\chi(t)}\coloneqq(\chi(t),\mathcal{E}_{\chi(t),\pi})$ . We then define the fractional hypertree width (Grohe and Marx, 2014).

Definition 3.

(Grohe and Marx, 2014): The fractional hypertree width 1) of a GHD ( $\mathcal{T},\chi$ ) of $\mathcal{H}$ , $fhtw(\mathcal{T},\mathcal{H})$ , is defined as $\max_{t\in\mathcal{V}_{\mathcal{T}}}\rho(\mathcal{H}_{\chi(t)})$ , and 2) of $\mathcal{H}$ , $fhtw(\mathcal{H})$ , is defined as $\min_{(\mathcal{T},\chi)}fhtw(\mathcal{T},\mathcal{H})$ .

Given a GHD ( $\mathcal{T},\chi$ ), GHDJoin (Joglekar et al., 2016) performs GenericJoin for each $\mathcal{H}_{\chi(t)}(t\in\mathcal{V}_{\mathcal{T}})$ and Yannakakis (Yannakakis, 1981) on these results along the join tree (see Algorithms 5-7 in section A.2-A.3). Since Yannakakis runs in linear - $\tilde{O}(\text{IN}+\text{OUT})$ - time for $\alpha$ -acyclic join queries (Brault-Baron, 2016), the total runtime of GHDJoin is $\tilde{O}(\text{IN}^{fhtw(\mathcal{T},\mathcal{H})}+\text{OUT})$ , since the input size for Yannakakis is $\tilde{O}\big{(}\text{IN}^{\max_{t\in\mathcal{V}_{\mathcal{T}}}\rho(\mathcal{H}_{\chi(t)})}\big{)}$ after GenericJoin. It is sufficient to execute a brute-force algorithm to find $fhtw(\mathcal{H})$ since the number of possible GHDs is bounded by the query size.

Counting. For the counting problem, the complexity is reduced to $\tilde{O}(\text{IN}^{w})$ without the +OUT term. This is a direct result of Joglekar et al. (Joglekar et al., 2016), solving a more general problem of evaluating aggregation queries (e.g., count, sum, max). We explain in detail in section 6.1.

Approximate counting. Approximate counting achieves even smaller complexities. The primary goal of approximate counting, especially using sampling, is to obtain an arbitrarily small error with a high probability, formally defined as (Assadi et al., 2019; Chen and Yi, 2020):

Definition 4.

(Assadi et al., 2019): For a given $\mathcal{H}$ , error bound $\epsilon\in(0,1)$ , and probability threshold $\delta\in(0,1)$ , if $P(\absolutevalue{Z-\text{OUT}}\geq\epsilon\text{OUT})\leq\delta$ for a random variable $Z$ approximating OUT, then $Z$ is a $(1\pm\epsilon)$ -approximation of OUT.

The challenge is, how far can we reduce the complexity while achieving the above goal. Assume that we approximate OUT using an unbiased estimator with a random variable $Y$ , i.e., $\operatorname*{\mathbb{E}}[Y]=\text{OUT}$ , where $\operatorname*{\mathbb{E}}[Y]$ is the expectation of $Y$ . Assume that the variance $\operatorname*{\mathbb{V}ar}[Y]$ of $Y$ is bounded and an upper bound $U_{var}$ is known, and the time complexity of instantiating (or computing) $Y$ is upper-bounded by $U_{time}$ . Then, we can trade off the approximation accuracy and efficiency by controlling the number of runs $N$ for sampling. Formally, let $Z=(Y_{1}+Y_{2}+...+Y_{N})/N$ where every $Y_{i}$ ( $1\leq i\leq N$ ) is an equivalent random variable to $Y$ . Then, $\operatorname*{\mathbb{E}}[Z]=\operatorname*{\mathbb{E}}[Y]=\text{OUT}$ and $\operatorname*{\mathbb{V}ar}[Z]=\operatorname*{\mathbb{V}ar}[Y]/N$ . Hence, by setting $N=\frac{U_{var}}{\epsilon^{2}\delta\text{OUT}^{2}}$ , we can achieve the following from the Chebyshev’s inequality:

(2)

\begin{split}P(|Z-\text{OUT}|\geq\epsilon\text{OUT})\leq\frac{\operatorname*{\mathbb{V}ar}[Z]}{\epsilon^{2}\operatorname*{\mathbb{E}}[Z]^{2}}=\frac{\operatorname*{\mathbb{V}ar}[Y]}{N\epsilon^{2}\text{OUT}^{2}}\leq\frac{U_{var}}{N\epsilon^{2}\text{OUT}^{2}}=\delta\end{split}

Since the complexity of computing each $Y_{i}$ is bounded by $U_{time}$ , the complexity of computing $Z$ is bounded by $N\cdot U_{time}=\frac{U_{var}U_{time}}{\epsilon^{2}\delta\text{OUT}^{2}}$ . Therefore, the key challenge is to implement an unbiased estimator with $U_{var}U_{time}$ as smallest as possible. $\epsilon$ and $\delta$ are regarded as constants as in $\tilde{O}$ (Assadi et al., 2019; Chen and Yi, 2020). One can easily replace $\frac{1}{\delta}$ in $N$ with $\log\frac{1}{\delta}$ using the median trick (Jerrum et al., 1986).

Uniform sampling. If every query answer has the same probability $p$ to be sampled, then the sampler is a uniform sampler (Chen and Yi, 2020; Fichtenberger et al., 2020). Since there are $|Q|=\text{OUT}$ answers, $p$ can be at most $1/\text{OUT}$ . If $p<1/\text{OUT}$ , the sampling has the probability to fail (to sample any of the query answers) which is $1-p\cdot\text{OUT}$ .

For acyclic queries, uniform sampling can be done in $\tilde{O}(1)$ time after $\tilde{O}(\text{IN})$ -time preprocessing (Zhao et al., 2018), by additionally storing the intermediate join sizes of tuples during the bottom-up DP in Yannakakis (see section A.2 for more details).

3. Related Work

In the context of approximate counting (and uniform sampling, which is closely related), there have been two lines of work: 1) determining the classes of queries of tractable and intractable cases and 2) reducing the degree of polynomial-time complexity for tractable cases, especially for join or join-project queries.

The first class of work widens the class of queries that admit fully polynomial-time randomized approximation scheme (FPRAS) and fully polynomial-time almost uniform sampler (FPAUS) (Arenas et al., 2021). Arenas et al. (Arenas et al., 2021) showed that every class of conjunctive queries with bounded hypertree width admits FPRAS and FPAUS. Focke et al. (Focke et al., 2022) extended this to conjunctive queries with disequalities and negations, under some relaxations from FPRAS. These works focus on showing the existence of a polynomial-time algorithm over a wider class of queries, but not on reducing the exact degree of the polynomial complexity.

The second class of work (Assadi et al., 2019; Fichtenberger et al., 2020; Chen and Yi, 2020; Bera and Chakrabarti, 2017; Aliakbarpour et al., 2018; Eden et al., 2020) focuses on reducing the degree for a specific range of queries, and even further to making the complexity inversely proportional to OUT, e.g., $\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}\big{)}$ . However, they all consider $w=\rho$ only (i.e., $\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}\big{)}$ is $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ ), and the complexity $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ has been achieved for limited classes of queries. Assadi et al. (Assadi et al., 2019) proposed a Simple Sublinear-Time Estimator (SSTE for short) with the complexity of $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ for unlabeled graph queries, i.e., where $Q$ contains identical binary relations only as in the query in (1). However, we found that a missing factor in the proof by Assadi et al. (Assadi et al., 2019) prevented SSTE from satisfying the $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ bound and thus notified the author to modify the proof (see appendix E). They also proposed some conjectures on labeled graphs (i.e., removing the identity assumption), but 1) no concrete algorithms were given and 2) the proposed bound for labeled graphs is larger than $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ (see section D.2). Therefore, it is not known whether these methods can be extended to labeled graphs with $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ time. Fichtenberger et al. (Fichtenberger et al., 2020) extended SSTE to sampling Subgraphs Uniformly in Sublinear Time (SUST for short) for uniform sampling while achieving the same complexity. Here, they said sublinear from the assumption that OUT is large as $\Omega(\text{IN}^{\rho-1})$ . Unlike join processing, it is unnecessary to scan the inputs for each query that requires $\tilde{O}(\text{IN})$ time.

Kim et al. (Kim et al., 2021) proposed Alley, a hybrid method that combines sampling and synopsis. We analyze the sampling part of Alley since analyzing the synopsis is out of scope. Alley solves the approximate subgraph counting problem (but not uniform sampling) for labeled graphs, where the relations in $Q$ are binary but not necessarily identical. The complexity, however, is $\tilde{O}(AGM)$ instead of $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ , which we explain why in section 5.

For a more general setting where relations in $Q$ can have higher arities, Chen & Yi (Chen and Yi, 2020) proposed GJ-Sample that also performs uniform sampling, and achieved $\tilde{O}\big{(}\frac{\text{IN}\,AGM}{\text{OUT}}\big{)}$ complexity. The additional factor of IN compared to SSTE and SUST results from computing sample probabilities (or weights) prior to sampling, where the number of candidates is $\tilde{O}(\text{IN})$ . We explain in detail in section 5. Therefore, they do not achieve $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ complexity for arbitrary join queries and leave it as an open problem (Chen and Yi, 2020). Note that it is important to remove the IN factor, for IN can be higher than $\frac{AGM}{\text{OUT}}$ when relation sizes are highly skewed. For example, for a triangle query $\mathcal{H}$ with $\mathcal{E}=\{F_{1},F_{2},F_{3}\}$ , $|R_{F_{1}}|=N^{10},|R_{F_{2}}|=|R_{F_{3}}|=N$ , and $\text{OUT}=N$ , then IN is $N^{10}$ and $\frac{AGM}{\text{OUT}}$ is $\frac{N^{2}}{N}=N$ for $\rho=\frac{1}{5}$ .

We notice that Deng et al. (Deng et al., 2023) have independently pursued the same primary objective as ours, which is to attain $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ for uniform sampling and size estimation over joins. Both papers have been accepted in this PODS. The authors recursively 1) partition the input space into a constant number of sub-spaces and 2) sample a sub-space, to reduce the number of sample candidates from $\tilde{O}(\text{IN})$ to $\tilde{O}(1)$ . Our approach differs from theirs since we do not compute probabilities for candidates prior to sampling nor modify the set of candidates. Instead, we exploit known probability distributions to sample a candidate.

4. Our Results

We first analyze existing sampling-based approximation algorithms in a new unified framework in section 5. We present a new algorithm that bounds $U_{var}U_{time}$ by $\tilde{O}(AGM\cdot\text{OUT})$ for arbitrary join queries for the first time, achieving the complexity of $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ (see section 2). To avoid pre-sampling overheads in GJ-Sample, we propose degree-based rejection sampling (DRS) that first samples a sample space from a meta sample space and then samples a value from the sampled space, where the value can be rejected based on its degree. We explain the details in section 5. The results of Chen & Yi then directly hold for DRS. The removed overhead reduces the complexity down to $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ for arbitrary joins. We also extend DRS using GHDs in section 6. The following theorems state our results which we prove in the subsequent sections.

Theorem 1.

DRS performs $(1\pm\epsilon)$ -approximation of OUT in $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ time.

Theorem 2.

If OUT is small enough, DRS with GHDs can perform $(1\pm\epsilon)$ -approximation of OUT with a lower complexity than $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ . A sufficient condition is when OUT is $\tilde{O}(\text{IN}^{\rho-fhtw})$ .

Additionally, we extend Alley to support arbitrary join queries in section 5 and a core lemma of SSTE/SUST to hold for labeled graphs in appendix D. We discuss our extension to join-project (a.k.a. conjunctive queries) in appendix C and extension of the $\tilde{O}(1)$ -time uniform sampler (Zhao et al., 2018) in section 2 to support cyclic queries using our framework in section A.2.

5. Achieving $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ Bound

5.1. Generic Sampling-based Framework

We can classify existing sampling-based methods into three groups: variable-at-a-time (a.k.a., vertex-at-a-time) (Kim et al., 2021; Chen and Yi, 2020), edge-at-a-time (Li et al., 2016), and component-at-a-time (Assadi et al., 2019; Fichtenberger et al., 2020). The first/second group samples a variable/hyperedge at a time until every variable/hyperedge in $\mathcal{H}$ is sampled, while the third group samples a component, either a cycle with an odd number of edges or a star, at a time, having a larger sample granularity than the other two groups.

To analyze these groups in a unified manner, we propose a powerful sampling-based estimation framework GenericCardEst (Algorithm 1). It returns an unbiased estimate of $\text{OUT}(\mathcal{H}_{s})$ , where $\mathcal{H}_{s}$ is the residual hypergraph given a current sample $s$ which is initially empty. Formally, $\mathcal{H}_{s}\coloneqq\Join_{F\in\mathcal{E}_{\mathcal{O}}}\pi_{\mathcal{O}}(R_{F}\ltimes s)$ and $\mathcal{H}_{\emptyset}=\mathcal{H}$ for $s=\emptyset$ .

Input: Query hypergraph

\mathcal{H}

, variables

\mathcal{O}

to sample, and current sample

s

2if (

\mathcal{O}=\emptyset

) then

3 return 1

I\leftarrow\textsc{{GetNonEmptySubset}}(\mathcal{O})

I\neq\emptyset

\Omega_{I},k\leftarrow\textsc{{GetSampleSpaceAndSize}}(I,\mathcal{H},s)

P\leftarrow{\textsc{{GetDistribution}}}(\Omega_{I})

P(\perp)=1-\sum_{s_{I}\in\Omega_{I}}P(s_{I})

S_{I}\leftarrow{\textsc{{SampleFrom}}}(\Omega_{I},k,P)

|S_{I}|=k

{S_{I},k(=|S_{I}|),\{P(s_{I})|s_{I}\in S_{I}\}\leftarrow\textsc{{RejectAndAdjust}}(S_{I},P,\mathcal{H},s)}

return

\frac{1}{k}\sum_{s_{I}\in S_{I}}\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]}\cdot{\textsc{{GenericCardEst}}(\mathcal{H},\mathcal{O}\setminus I,s\uplus s_{I})}

Algorithm 1 GenericCardEst(

\mathcal{H}(\mathcal{V},\mathcal{E}),\mathcal{O},s

)

An invariance is that $s\in\,\Join_{F\in\mathcal{E}_{\mathcal{V}\setminus\mathcal{O}}}\pi_{\mathcal{V}\setminus\mathcal{O}}R_{F}$ , i.e., $s$ is an answer to the sub-query of $Q$ on the bound variables $\mathcal{V}\setminus\mathcal{O}$ . This is a key to prove Lemma 1. Due to space limitations, we prove all lemmas and propositions in appendix E.

Lemma 1.

GenericCardEst returns an unbiased estimate of $\absolutevalue{\Join_{F\in\mathcal{E}}R_{F}}$ for $\mathcal{O}=\mathcal{V}$ and $s=\emptyset$ .

We now explain each line of Algorithm 1. If there is no variable to sample for, just return one in Lines 1-1. Otherwise, Line 1 takes a subset $I$ of $\mathcal{O}$ to sample at this step. $I$ is 1) a singleton set for variable-at-a-time methods, 2) an edge or its projection for edge-at-a-time methods, or 3) a part of a component for component-at-a-time methods. Line 1 defines the sample space $\Omega_{I}$ and sample size $k$ . Line 1 computes $P$ , a probability distribution over $\Omega_{I}\cup\{\perp\}$ ; $\perp$ is a null tuple that does not belong to any join answer. Line 1 samples $S_{I}$ from $\Omega_{I}$ where $|S_{I}|=k$ . Line 1 optionally rejects samples in $S_{I}$ . Here, $S_{I}$ , $k$ , and $\{P(s_{I})|s_{I}\in S_{I}\}$ are adjusted according to the accepted samples; $P(s_{I})$ is multiplied by the probability of accepting $s_{I}$ . GenericCardEst in Line 1 returns an unbiased estimate of the residual query $\mathcal{H}_{s\uplus s_{I}}$ . The unbiasedness of the final return value at Line 1 is guaranteed by the inverse probability factor, $\frac{1}{P(s_{I})}\operatorname*{\mathbb{I}}[\cdot]$ , which is the key idea of the Horvitz-Thompson estimator (Horvitz and Thompson, 1952). $\operatorname*{\mathbb{I}}[c]$ denotes the indicator which is 1 if $c$ is true and 0 otherwise.

5.2. Query Model

GenericCardEst relies on a certain query model used in previous WCO join and sampling algorithms (Ngo et al., 2014; Veldhuizen, 2012; Ngo et al., 2012; Assadi et al., 2019; Chen and Yi, 2020), following the property testing model (Goldreich, 2017). To avoid any confusion with the join query $Q$ , we refer to the queries here as operations. For a relation $R_{F}$ , the query model allows that $\pi_{I}(R_{F}\ltimes s)$ in GenericCardEst can be readily obtained in $\tilde{O}(1)$ time. We explain the underlying index structures that enable this later in section 5.6 for brevity.

The same indexes provide the following $\tilde{O}(1)$ -time operations as well. Here, $T$ can be a projection/selection of another relation, which is also a relation.

•

Degree( $T,s$ ): $|T\ltimes s|$ for a relation $T$ and a tuple $s$
•

Access( $T,I,i$ ): $i$ -th element of $\pi_{I}(T)$ for a relation $T$ and attributes $I$
•

Exist( $T,r$ ): test whether or not a row $r$ exists in $T$
•

Sample( $T$ ): uniformly sample a tuple $r$ from $T$

Eden et al. (Eden et al., 2017) have developed their algorithms using operations on binary tables. The above four operations are generalizations of theirs to $n$ -ary tables.

5.3. State-of-the-art Sampling-based Estimators

We explain state-of-the-art sampling-based estimators as instances of GenericCardEst.

WanderJoin (Li et al., 2016) is a practical edge-at-a-time method. The edges in $\mathcal{E}$ are first ordered by an index $i\in[1,|\mathcal{E}|]$ . At the $i$ -th step, $I=F_{i}\cap\mathcal{O}$ , $\Omega_{I}=\pi_{I}(R_{F_{i}}\ltimes s)$ , $k=1$ , and $P(s_{I})=\frac{1}{|\Omega_{I}|}$ . That is, one tuple $s_{I}$ is uniformly sampled from $\Omega_{I}$ . Note that $\mathcal{O}$ can reduce to $\emptyset$ before proceeding to the last edge, e.g., the triangle query. However, (Li et al., 2016) proceeds first with sampling for every edge and then checks the join predicates, resulting in unnecessary computations.

Alley+. Alley (Kim et al., 2021) is a variable-at-a-time method (i.e., $|I|=1$ ) and assumes that every $R_{F}$ is binary. We here explain Alley+, our extension of Alley to $n$ -ary relations: $\Omega_{I}=\cap_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ , $k=\lceil b\cdot|\Omega_{I}|\rceil$ for some fixed $b\in(0,1]$ , and $P(s_{I})=\frac{1}{|\Omega_{I}|}$ . Here, the sampling is done without replacement. Due to the heavy intersection ( $\cap$ ) operation, the runtime of Alley+ is $\tilde{O}(AGM(\mathcal{H}))$ (Proposition 6). At an extreme case of $b=1$ , Alley+ becomes an instance of GenericJoin.

SSTE (Assadi et al., 2019) and SUST (Fichtenberger et al., 2020) are component-at-a-time methods. Assuming that every $R_{F}$ is binary and identical, Lemma 8 in appendix D decomposes $\mathcal{H}$ into a set of components, where each component is either an odd cycle or a star. SSTE and SUST slightly differ in sampling odd cycles and stars, and SUST performs uniform sampling over joins. We explain the details in appendix D due to space limitations.

GJ-Sample (Chen and Yi, 2020) is a variable-at-a-time method (i.e., $|I|=1$ ). Here, $\Omega_{I}=\pi_{I}(R_{F^{GJ}}\ltimes s)$ where $F^{GJ}=\operatorname*{arg\,min}_{F\in\mathcal{E}_{I}}|\pi_{I}(R_{F}\ltimes s)|$ . This avoids the intersection operation in Alley+, allowing GJ-Sample to achieve $\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}$ runtime for sampling instead of $\tilde{O}(AGM(\mathcal{H}))$ (Chen and Yi, 2020). $k=1$ and $P(s_{I})={\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}}$ where $AGM(\mathcal{H}_{s})=\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}$ and $AGM(\mathcal{H}_{s\uplus s_{I}})=\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes(s\uplus s_{I})|^{x_{F}}$ , where $x$ is a fractional edge cover of the original query $\mathcal{H}$ (Chen and Yi, 2020). Then, due to Lemma 7 in appendix B, $\sum_{s_{I}\in\Omega_{I}}P(s_{I})\leq 1$ .

More importantly, setting $P(s_{I})={\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}}$ enables GJ-Sample to perform uniform sampling over $Q$ , with $\frac{1}{AGM(\mathcal{H})}$ probability for each answer of $Q$ . Let $s$ be a final sample that reaches Line 1, and $\mathcal{H}_{s_{1}},\mathcal{H}_{s_{2}},...,\mathcal{H}_{s_{n}}$ be the series of hypergraphs $\mathcal{H}_{s\uplus s_{I}}$ at Line 1. Then, $AGM(\mathcal{H}_{s_{n}})=1$ from $\mathcal{E}_{\mathcal{O}}=\mathcal{E}_{\emptyset}=\emptyset$ for $\mathcal{H}_{s_{n}}$ , and

(3)

\begin{split}P(s)=\frac{AGM(\mathcal{H}_{s_{1}})}{AGM(\mathcal{H})}...\frac{AGM(\mathcal{H}_{s_{n}})}{AGM(\mathcal{H}_{s_{n-1}})}=\frac{AGM(\mathcal{H}_{s_{n}})}{AGM(\mathcal{H})}=\frac{1}{AGM(\mathcal{H})}.\end{split}

Since $|Q|=\text{OUT}$ , a call to GenericCardEst succeeds to sample any answer of $Q$ with $\frac{\text{OUT}}{AGM(\mathcal{H})}$ probability and fails with $1-\frac{\text{OUT}}{AGM(\mathcal{H})}$ probability. Note that if $P(s_{I})$ is set to $\text{OUT}(\mathcal{H}_{s\uplus s_{I}})/\text{OUT}(\mathcal{H}_{s})$ , then $P(s)=\frac{1}{\text{OUT}}$ , so every call to GenericCardEst will succeed. However, computing $\text{OUT}(\mathcal{H}_{s\uplus s_{I}})$ for every $s_{I}\in\Omega_{I}$ is intractable, which is the reason for GJ-Sample to use ${\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}}$ that can be computed in $\tilde{O}(1)$ time.

However, computing ${\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}}$ values at Line 1 takes $\tilde{O}(|\Omega_{I}|)=\tilde{O}(|\pi_{I}(R_{F^{GJ}}\ltimes s)|)=\tilde{O}(\text{IN})$ time. This results in an IN factor in GJ-Sample’s runtime, $\tilde{O}\big{(}\frac{\text{IN}\cdot AGM(\mathcal{H})}{\text{OUT}}\big{)}$ (Chen and Yi, 2020). In order to remove this IN factor, Chen & Yi (Chen and Yi, 2020) separated out sequenceable queries, where all $P(s_{I})$ values required in the sampling procedure can be precomputed in $\tilde{O}(\text{IN})$ time. This results in $\tilde{O}\big{(}\text{IN}+\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}$ runtime for sequenceable queries and still $\tilde{O}\big{(}\frac{\text{IN}\cdot AGM(\mathcal{H})}{\text{OUT}}\big{)}$ for non-sequenceable queries. Determining whether a query is sequenceable or not requires a brute-force algorithm of $\tilde{O}(1)$ time, since the search space is bounded by the query size (Chen and Yi, 2020).

5.4. Beyond the State of the Art with Degree-based Rejection Sampling

We propose degree-based rejection sampling (DRS) to avoid computing ${\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}}$ values in GJ-Sample while achieving a similar $P(s_{I})$ , only a constant factor smaller. While we set $|I|=1$ and $k=1$ as in GJ-Sample, we sample $F^{*}$ (a counterpart of $F^{GJ}$ in GJ-Sample) uniformly from $\mathcal{E}_{I}$ and let $\Omega_{I}=\pi_{I}(R_{F^{*}}\ltimes s)$ . Therefore, $\{\pi_{I}(R_{F}\ltimes s)\,|\,F\in\mathcal{E}_{I}\}$ is our meta sample space. For each $s_{I}\in\Omega_{I}$ and $F\in\mathcal{E}_{I}$ , we define the relative degree of $s_{I}$ in $R_{F}\ltimes s$ as $rdeg_{F,s}(s_{I})\coloneqq\frac{|R_{F}\ltimes(s\uplus s_{I})|}{|R_{F}\ltimes s|}$ .

In order to use $P(s_{I})=rdeg_{F^{*},s}(s_{I})$ in Line 1, we 1) uniformly sample a row $t$ from $R_{F^{*}}\ltimes s$ and 2) let $s_{I}=\pi_{I}t$ in Line 1, without computing $P(s_{I})$ for every $s_{I}\in\Omega_{I}$ . Then, $rdeg_{F,s}(s_{I})$ for every $F\in\mathcal{E}_{I}$ is computed, and $s_{I}$ is rejected if $F^{*}\neq\operatorname*{arg\,max}_{F\in\mathcal{E}_{I}}rdeg_{F,s}(s_{I})$ (Line 1). We first assume that one $F$ has a higher $rdeg_{F,s}(s_{I})$ than all other edges in $\mathcal{E}_{I}$ . Then, every $P(s_{I})$ is multiplied by $\frac{1}{|\mathcal{E}_{I}|}$ at Line 1, resulting in $P(s_{I})=\frac{1}{|\mathcal{E}_{I}|}rdeg_{F^{*},s}(s_{I})$ . Line 1 further uses $p=\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{rdeg_{F^{*},s}(s_{I})\cdot AGM(\mathcal{H}_{s})}$ ( $\leq 1$ from Lemma 2) as the keeping probability of $s_{I}$ to make the final $P(s_{I})=\frac{1}{|\mathcal{E}_{I}|}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}$ .

Lemma 2.

$\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}\leq\max_{F\in\mathcal{E}_{I}}rdeg_{F,s}(s_{I})=rdeg_{F^{*},s}(s_{I})$ if $s_{I}\in\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ .

From (3), any final sample $s$ that reaches Line 1 has $P(s)=\frac{1}{AGM(\mathcal{H})}\prod_{I\subset\mathcal{V},|I|=1}\frac{1}{|\mathcal{E}_{I}|}$ probability to be sampled, indicating a uniform sampling over $Q$ . Note that the added term $\prod_{I}\frac{1}{|\mathcal{E}_{I}|}$ is regarded as a constant since it depends on the query size only. If we break our initial assumption so that $m>1$ edges in $\mathcal{E}_{I}$ can have the same maximum relative degree, the $P(s_{I})$ above will increase by $m$ times. Therefore, we simply decrease the keeping probability $p$ by $m$ times to preserve $P(s_{I})=\frac{1}{|\mathcal{E}_{I}|}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}$ .

Example 0.

Let $N$ and $M$ be two arbitrary numbers. Let $R(A,B)=\{(i,1)\,|\,i\in[1,M]\}\cup\{(1,j)\,|\,j\in[2,N]\}$ be a binary relation and $T(A,C)$ is obtained by renaming $B$ to $C$ in $R$ . Suppose we have a join query $Q(A,B,C)=R(A,B)\Join T(A,C)$ . For ease of explanation, we use a hyperedge $F$ and its corresponding relation $R_{F}$ interchangeably. Then, we have an optimal fractional edge cover of $Q$ as $x_{R}=x_{T}=1$ . We choose $I=\{A\}$ , $I=\{B\}$ , and $I=\{C\}$ in turn for DRS. For $I=\{A\}$ and $s=\emptyset$ , $\mathcal{E}_{I}=\{R,T\}$ , so we sample a relation from $\mathcal{E}_{I}$ . Assume that $R$ is sampled, and then a row $t=(1,1)$ is sampled from $R$ . Then, $s_{I}=1$ and $rdeg_{R,s}(s_{I})=\frac{N}{N+M-1}=rdeg_{T,s}(s_{I})$ , so $R$ is not rejected. Since $R$ and $T$ tie w.r.t. $s_{I}$ , we keep $s_{I}$ with probability $p=\frac{1}{2}\frac{1}{rdeg_{R,s}(s_{I})}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}=\frac{1}{2}\frac{1}{rdeg_{R,s}(s_{I})}\frac{N^{2}}{(N+M-1)^{2}}$ . Then, $P(s_{I})=P(s_{\{A\}})=rdeg_{R,s}(s_{I})\frac{1}{2}\frac{1}{rdeg_{R,s}(s_{I})}\frac{N^{2}}{(N+M-1)^{2}}$ $=\frac{1}{2}\frac{N^{2}}{(N+M-1)^{2}}$ . Next, for $I=\{B\}$ and $s=\{A=1\}$ , $\mathcal{E}_{I}=\{R\}$ , and $rdeg_{R,s}(s_{I})=\frac{1}{N}$ for any $s_{I}\in\pi_{I}(R\ltimes s)$ . Then, $P(s_{\{B\}})=\frac{1}{|\mathcal{E}_{I}|}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}=\frac{N}{N^{2}}=\frac{1}{N}$ . Similarly, $P(s_{\{C\}})=\frac{1}{N}$ for $I=\{C\}$ . In total, a final sample $s$ that reaches Line 1 of GenericCardEst has $P(s)=P(s_{\{A\}})P(s_{\{B\}})P(s_{\{C\}})=\frac{1}{2}\frac{1}{(N+M-1)^{2}}=\prod_{I}\frac{1}{|\mathcal{E}_{I}|}\frac{1}{AGM(\mathcal{H})}$ .

5.5. Unified Analysis

This section analyzes the bounds of variance and runtime ( $U_{var}$ and $U_{time}$ in section 2) of sampling-based estimators in a unified manner. Note that Lemma 1 already states the unbiasedness of estimators. Theorem 1 can be proved from Propositions 4 and 8; $U_{var}U_{time}$ is $\tilde{O}(AGM\cdot\text{OUT})$ .

We first define two random variables, 1) $Z_{\mathcal{H}_{s}}$ for the output of GenericCardEst given $s$ and 2) $Z=\sum_{1\leq i\leq N}\frac{Z^{i}_{\mathcal{H}}}{N}$ for our final estimate. Here, $Z^{i}_{\mathcal{H}}$ ’s are independent and identical to $Z_{\mathcal{H}}$ $=Z_{\mathcal{H}_{\emptyset}}$ , and $N$ is the number of initial calls to GenericCardEst with $s=\emptyset$ . Let $T_{\mathcal{H}_{s}}$ be the random variable for the number of core operations (including the four operations in section 5.2) in GenericCardEst and $T$ for the total runtime of obtaining $Z$ . Then, $\operatorname*{\mathbb{V}ar}[Z]=\frac{\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]}{N}$ and $\operatorname*{\mathbb{E}}[T]=N\cdot\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]$ .

Proposition 1.

$\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq 2^{d}\cdot AGM(\mathcal{H})\cdot\text{OUT}$ for SSTE and SUST where $d$ is the number of odd cycles and stars in $\mathcal{H}$ .

Proposition 2.

$\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq\frac{t^{|\mathcal{V}|}-t}{t-1}\cdot\text{OUT}^{2}$ for Alley+ where $t=\frac{2(1-b)}{b}$ .

The upper bound $U_{var}$ of Alley+ explains the unique property of Alley+, that $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]$ approaches to 0 as $b$ approaches to 1. That is, $\lim_{b\to 1}U_{var}=0$ since $\lim_{b\to 1}t=0$ . In fact, this $U_{var}$ is tighter than the original bound $U_{org}=\frac{b}{1-b}(\frac{1}{b^{|\mathcal{V}|}}-1)\cdot\text{OUT}^{2}$ proved in (Kim et al., 2021), where $\lim_{b\to 1}U_{org}=\lim_{b\to 1}\frac{1+b+b^{2}+...+b^{|\mathcal{V}|-1}}{b^{|\mathcal{V}|-1}}$ $\text{OUT}^{2}=|\mathcal{V}|\cdot\text{OUT}^{2}$ .

Proposition 3.

$\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq|\mathcal{V}|\cdot AGM(\mathcal{H})\cdot\text{OUT}$ for GJ-Sample.

Proposition 4.

$\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq|\mathcal{V}|\cdot\prod_{I}|\mathcal{E}_{I}|\cdot AGM(\mathcal{H})\cdot\text{OUT}$ for DRS. Note that $\prod_{I}|\mathcal{E}_{I}|$ is $\tilde{O}(1)$ .

Proposition 5.

$\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]$ is $\tilde{O}(1)$ for SSTE.

Proposition 6.

$\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]$ is $\tilde{O}(b^{|\mathcal{V}|}AGM(\mathcal{H}))$ for Alley+.

Proposition 7.

For GJ-Sample, $T_{\mathcal{H}}$ is $\tilde{O}(1)$ for sequenceable queries and $\tilde{O}(\text{IN})$ for non-sequenceable queries.

Proposition 8.

$T_{\mathcal{H}}$ is $\tilde{O}(1)$ for SUST and DRS.

Until now, we have assumed that OUT is given in computing $N=\frac{U_{var}}{\epsilon^{2}\delta\text{OUT}^{2}}$ before invoking GenericCardEst, which is unrealistic since our goal is to estimate OUT itself. To tackle this, we use the geometric search by Assadi et al. (Assadi et al., 2019). They first assume that OUT is large as $AGM(\mathcal{H})$ and run estimation with a small $N$ (Assadi et al., 2019). Then, they perform a geometric search on OUT; assume a smaller OUT (decreasing OUT by 2), increase $N$ , and run the estimation again. They repeat this until the assumed OUT becomes consistent with the algorithm output. While we implicitly assume that $\text{OUT}>0$ , we can detect when $\text{OUT}=0$ in real applications whenever the assumed OUT falls below 1, in total $\tilde{O}(AGM)$ time.

This geometric search adds a constant factor (= 4) in the $\tilde{O}$ bound of $T_{\mathcal{H}}$ , not a $\log\text{IN}$ factor explained by Chen & Yi (Chen and Yi, 2020). Starting with an assumption $1\leq\frac{AGM(\mathcal{H})}{\text{OUT}}<2$ , assume that the geometric search stops at $2^{k}\leq\frac{AGM(\mathcal{H})}{\text{OUT}}<2^{k+1}$ . Then, the total sample size used so far is $\sum_{1\leq i\leq k+1}\frac{2^{i}}{\epsilon^{2}\delta}\leq\frac{2^{k+2}}{\epsilon^{2}\delta}\leq\frac{4AGM(\mathcal{H})}{\epsilon^{2}\delta\text{OUT}}$ , i.e., only a constant factor (= 4) is introduced.

Since GJ-Sample can perform uniform sampling, it instead uses a simpler approach (Chen and Yi, 2020), by repeating the sampling until a constant number $c$ of trials succeed. Then, the number of total trials becomes a random variable, having $\frac{AGM(\mathcal{H})\cdot c}{\text{OUT}}$ as its expectation (Chen and Yi, 2020). Therefore, $\operatorname*{\mathbb{E}}[T]=\tilde{O}\big{(}\text{IN}+\frac{AGM(\mathcal{H})\cdot c}{\text{OUT}}\big{)}$ for sequenceable queries and $\tilde{O}\big{(}\frac{\text{IN}\cdot AGM(\mathcal{H})\cdot c}{\text{OUT}}\big{)}$ for non-sequenceable queries. Using the same approach, DRS requires $\prod_{I}|\mathcal{E}_{I}|\frac{AGM(\mathcal{H})\cdot c}{\text{OUT}}$ trials in expectation and thus $\operatorname*{\mathbb{E}}[T]=\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}$ , asymptotically faster than GJ-Sample.

Finally, we give a simple explanation of why $T$ (or $\operatorname*{\mathbb{E}}[T]$ ) of SSTE, SUST, GJ-Sample, and DRS, is inversely proportional to OUT. Intuitively, if OUT is as large as $AGM(\mathcal{H})$ , a set of samples $S_{I}$ are highly likely to be the actual join answers of $Q$ . In other words, the join answers are spread over a dense database. A small number of samples would give us enough information about the distribution of the join answers and estimating OUT accurately. In contrast, if OUT is small, the join answers would be skewed and sparsely distributed in certain regions of a database, and most samples would not be the join answers. Therefore, a small number of samples cannot effectively decrease the uncertainty of estimation, which may require a lot more sampling for an accurate estimation.

5.6. Underlying Index Structure

We explain the underlying index $Index_{F}$ for each relation $R_{F}$ to evaluate the operation $\pi_{I}(R_{F}\ltimes s)$ in $\tilde{O}(1)$ time, as mentioned in section 5.2. We define $F_{I}=F\cap I$ and $F_{s}=F\cap Attr(s)$ , where $Attr(s)$ is the set of attributes of $s$ . Note that $F_{I}$ and $F_{s}$ are disjoint in GenericCardEst. If $F_{s}=\{A_{1},A_{2},...,A_{m}\}$ , then $R_{F}\ltimes s=R_{F}\ltimes\pi_{F_{s}}(s)=\sigma_{A_{1}=\pi_{A_{1}}s}\sigma_{A_{2}=\pi_{A_{2}}s}...\sigma_{A_{m}=\pi_{A_{m}}s}(R_{F})$ .

A mild assumption in (Ngo et al., 2014) is to build a B-tree-like index (e.g., a trie index in (Aberger et al., 2017)) under a global attribute order over the whole database. Then, if all attributes in $F_{s}$ precede all attributes in $F_{I}$ , $R_{F}\ltimes s$ is readily available as an index lookup of $Index_{F}(s)=Index_{F}(\pi_{F_{s}}(s))$ . Furthermore, if no attribute in $F$ lies between $F_{s}$ and $F_{I}$ , $\pi_{I}(R_{F}\ltimes s)$ is an index lookup $Index_{F}(s)$ up to depth $|F_{I}|$ . To exploit these, the selection of $I$ in each step of GenericCardEst should be consistent with the global order. Instead, we can build $|F|!$ indexes for $R_{F}$ , one for each possible attribute order as in (Aberger et al., 2017), to enable arbitrary selection of $I$ .

Remark. The complexity of building an index for $R_{F}$ is linear, i.e., $\tilde{O}(|R_{F}|)$ (Aberger et al., 2017). However, in contrast to the pre-computing overheads in GJ-Sample, the indexing occurs once per database instead of per query. Therefore, its complexity is ignored from the analysis of $T_{\mathcal{H}}$ in section 5.5. Due to their simple structures, indexes can be easily extended to dynamic setting where tuples can be inserted and deleted, with a constant or logarithmic update time.

6. Achieving a Generalized Bound

In this section, we generalize our $\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}$ bound for join size estimation using generalized hypertree decompositions (GHDs, see section 2) and achieve a better bound than $\tilde{O}(\text{IN}^{fhtw(\mathcal{H})})$ , which is the bound of an exact aggregation algorithm AggroGHDJoin using GHDs (see section A.3). We leave uniform sampling using GHDs as a future work in section 8. Our new bound may be better than our previous bound $\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}$ , especially when OUT is small (see Example 2). We first provide background about GHDs and related algorithms.

6.1. Aggregations over GHDs

Instead of performing joins, OUT can be more efficiently computed by solving an AJAR (Aggregations and Joins over Annotated Relations) query (Joglekar et al., 2016). AJAR queries assume that each relation $R_{F}$ is annotated; $R_{F}=\{(r,\lambda(r))\}$ where each tuple $r$ has an annotation $\lambda(r)$ (from some domain $\operatorname*{\mathbb{K}}$ ). If two tuples $r_{1}$ and $r_{2}$ are joined, the joined tuple $r_{1}\Join r_{2}$ has $\lambda(r_{1})\otimes\lambda(r_{2})$ as its annotation. Hence, joining annotated relations results in another annotated relation:

(4)

\Join_{F\in\mathcal{E}}R_{F}=\{(r,\lambda)\,|\,r\text{ is a join answer},\lambda=\otimes_{F\in\mathcal{E}}\lambda(\pi_{F}r)\}.

Furthermore, we can define an aggregation over an annotated relation $R$ with attributes $\mathcal{A}$ by a pair of attribute $A$ and sum operator $\oplus$ (Joglekar et al., 2016). Let $G=\mathcal{A}\setminus\{A\}$ . Then, the ( $A,\oplus$ )-aggregation of $R$ generates an annotated relation $R_{G}$ of attributes $G$ , where $G$ is the grouping/output attributes, and $\lambda_{G}$ is an aggregated result for $r_{G}$ :

(5)

\sum_{A,\oplus}R=\{(r_{G},\lambda_{G}):r_{G}\in\pi_{G}R\text{ and }\lambda_{G}=\bigoplus_{(r,\lambda)\in R:\pi_{G}r=r_{G}}\lambda\}.

By definition, no tuple $r_{G}$ is duplicated in the aggregation result. If multiple aggregation attributes $M=\{A_{1},A_{2},...,A_{n}\}$ share the same $\oplus$ operator, we simply write $\sum_{A_{1},\oplus}\sum_{A_{2},\oplus}...\sum_{A_{n},\oplus}$ as $\sum_{M,\oplus}$ . Here, attributes in $M$ are marginalized out and removed from the resulting relation. The output attributes $G$ becomes $\mathcal{A}\setminus M$ . Altogether, if ( $\operatorname*{\mathbb{K}},\oplus,\otimes$ ) forms a commutative semiring (Green et al., 2007), we can aggregate over joins (of annotated relations) (Joglekar et al., 2016) as $\sum_{M,\oplus}\Join_{F\in\mathcal{E}}R_{F}$ . If $\operatorname*{\mathbb{K}}=\operatorname*{\mathbb{Z}}$ , $\lambda(r)=1\,\forall r\in R_{F}$ , $\oplus=+$ , $\otimes=\times$ , and $M=\mathcal{V}$ , the query $\sum_{\mathcal{V},+}\Join_{F\in\mathcal{E}}R_{F}$ generates an annotated relation with a single tuple $r$ without any attribute. Here, $\lambda(r)=\text{OUT}=\absolutevalue{\Join_{F\in\mathcal{E}}R_{F}}$ . We hereafter omit $\oplus$ whenever possible without ambiguity. In addition, AJAR queries are known to be more general than FAQ queries (Abo Khamis et al., 2016), for a query can have multiple aggregation operators (Joglekar et al., 2016).

Joglekar et al. (Joglekar et al., 2016) present AggroGHDJoin and AggroYannakakis to solve the AJAR queries in $\tilde{O}(\text{IN}^{w^{*}}+\text{OUT}_{agg})$ time, where $w^{*}$ is their width of algorithms, and $\text{OUT}_{agg}$ is the output size of the aggregation results which is typically smaller than OUT, e.g., 1 in our case. We prove in section A.3 that $w^{*}\geq fhtw$ for general AJAR queries and $w^{*}=fhtw$ for computing OUT. Since we show in section 6.3 that our new bound using GHDs is smaller than $\tilde{O}(\text{IN}^{fhtw})$ , it is also smaller than $\tilde{O}(\text{IN}^{w^{*}})$ .

6.2. Sampling over GHDs

From now, we use the same setting from section 6.1 to compute the AJAR query $\sum_{\mathcal{V}}\Join_{F\in\mathcal{E}}R_{F}$ . For a set of grouping attributes $G$ , GroupByCardEst (Algorithm 2) returns an approximate answer to the AJAR query $\mathcal{H}^{G}\coloneqq\sum_{\mathcal{V}\setminus G}\mathcal{H}=\sum_{\mathcal{V}\setminus G}\Join_{F\in\mathcal{E}}R_{F}$ , i.e., an annotated relation $\{(g,Z[g])|(g,\lambda(g))\in\mathcal{H}^{G}\}$ (Lines 2-2); $\lambda(g)$ and $Z[g]$ represent the exact and approximate value of $\text{OUT}(\mathcal{H}_{g})$ , where $\mathcal{H}_{g}$ is the residual hypergraph of $\mathcal{H}$ given $g$ (see section 5.1). Therefore, $\sum_{(g,\lambda(g))\in\mathcal{H}^{G}}\lambda(g)=\text{OUT}(\mathcal{H})$ .

Input: Hypergraph

\mathcal{H}

and a set of attributes

G\subseteq\mathcal{V}

R\leftarrow\emptyset

/* compute

\mathcal{H}^{G}\coloneqq\sum_{\mathcal{V}\setminus G}\mathcal{H}

, skip any duplicated

g

3 foreach (

g\in\textsc{{GenericJoin}}(\mathcal{H},G,\emptyset)

) do

Z[g]\leftarrow\textsc{{GenericCardEst}}(\mathcal{H},\mathcal{V}\setminus G,g)

/* approximation of the output size of the residual query given

g

R\leftarrow R\cup\{(g,Z[g])\}

return

R

Algorithm 2 GroupByCardEst (

\mathcal{H}(\mathcal{V},\mathcal{E}),G

)

GHDCardEst (Algorithm 3) is our entry function given a GHD ( $\mathcal{T},\chi$ ). For each node $t\in\mathcal{V}_{\mathcal{T}}$ , we define $\mathcal{H}_{t}\coloneqq(\chi(t),\mathcal{E}_{\chi(t),\pi})$ , i.e., sub-query of $\mathcal{H}$ on $t$ (Line 3), and $G(t)$ as the set of grouping attributes of $t$ , determined as the shared attributes across the nodes (Line 3). GroupByCardEst at Line 3 takes this $G(t)$ as the grouping attributes $G$ and returns an annotated relation $R_{t}$ $\coloneqq\{(g_{t},Z[g_{t}])|(g_{t},\lambda(g_{t}))\in\mathcal{H}^{G(t)}_{t}\}$ , where $\mathcal{H}^{G(t)}_{t}=\sum_{\chi(t)\setminus G(t)}\mathcal{H}_{t}$ . Therefore, $R_{t}$ is an approximation of the aggregation of $\mathcal{H}_{t}$ with output attributes $G(t)$ . Finally, SimpleAggroYannakakis at Line 3 takes these $R_{t}$ ’s and runs a simple version of AggroYannakakis (see Algorithm 5 in section A.2) to compute $\sum_{G(\mathcal{T})}\Join_{t\in\mathcal{V}_{\mathcal{T}}}R_{t}$ for $G(\mathcal{T})\coloneqq\cup_{t\in\mathcal{V}_{\mathcal{T}}}G(t)$ , the union of grouping attributes of all GHD nodes. $G(\mathcal{T})$ is used as join attributes between $R_{t}$ ’s in Line 3. Finally, Lemma 3 stats the unbiasedness of GHDCardEst.

Input: Query hypergraph

\mathcal{H}

and a GHD (

\mathcal{T}

\chi

) of

\mathcal{H}

S_{\mathcal{T}}\leftarrow\emptyset

3foreach (

t\in\mathcal{V}_{\mathcal{T}}

) do

\mathcal{H}_{t}\leftarrow(\chi(t),\mathcal{E}_{\chi(t),\pi})

G(t)\leftarrow\{v\in\chi(t)|\exists t^{\prime}\in\mathcal{V}_{\mathcal{T}}\text{ s.t., }t^{\prime}\neq t,v\in\chi(t^{\prime})\}

R_{t}\leftarrow\textsc{{GroupByCardEst}}(\mathcal{H}_{t},G(t))

/* approximate annotated relation of

\mathcal{H}^{G(t)}_{t}\coloneqq\sum_{\chi(t)\setminus G(t)}\mathcal{H}_{t}

S_{\mathcal{T}}\leftarrow S_{\mathcal{T}}\cup R_{t}

return

\textsc{{SimpleAggroYannakakis}}((\mathcal{T},\chi),S_{\mathcal{T}})

Algorithm 3 GHDCardEst(

\mathcal{H}(\mathcal{V},\mathcal{E})

, (

\mathcal{T}(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}})

\chi

))

Lemma 3.

GHDCardEst is an unbiased estimator of $\absolutevalue{\Join_{F\in\mathcal{E}}R_{F}}$ .

Example 0.

We use $\mathcal{H}$ and $\mathcal{T}_{2}$ in Figure 1 as an example. Here, $G(t_{0})=\{X,X^{\prime}\},G(t_{1})=\{X\}$ , and $G(t_{2})=\{X^{\prime}\}$ are grouping attributes. (d) shows examples of annotated relations $R_{t}$ ’s, which are the outputs of GroupByCardEst on the three nodes. Since we allow duplicate tuples in any base relation $R_{F}:F\in\mathcal{E}$ , annotations for $R_{t_{0}}$ can be larger than one, even if $\chi(t_{0})$ is covered by a single hyperedge and $\chi(t_{0})=G(t_{0})$ . (e) and (f) show the join and aggregation on $R_{t}$ ’s performed in SimpleAggroYannakakis. Note that the annotations are multiplied in joins, and added in aggregations. The final result, 8450, is an approximation of OUT.

Refer to caption — Figure 1. Example of GHDCardEst using GHD $\mathcal{T}_{2}$ in (c). Attributes in $G(t)$ are underlined in (c).

The key idea of GHDCardEst is to pushdown the partial sum $\sum_{\chi(t)\setminus G(t)}$ from $\sum_{\mathcal{V}}$ to each node $t$ in computing $R_{t}$ , which is similar to the key idea of AggroGHDJoin that pushes down the aggregation attributes in GHD as deeply as possible. However, it is slightly different since we shrink each relation obtained from a node before the join phase in AggroYannakakis; AggroYannakakis aggregates after joining each pair of parent-child relations in GHD.

We also argue that our definition of $G(t)$ is minimal; if an attribute in $G(t)$ is excluded from $R_{t}$ , the node $t$ loses a join relationship with another node. Our $G(t)$ also minimizes the runtime of a non-sampling procedure, GenericJoin at Line 2 of Algorithm 2. Its runtime $\tilde{O}(AGM(\mathcal{H}^{G(t)}_{t}))$ increases with the output attributes $G(t)$ due to the node-monotone property of the AGM bound (Joglekar et al., 2016).

6.3. Analysis of GHDCardEst

Using the analyses in section 5.5 as building blocks, we analyze the variance and runtime of GHDCardEst. For each $(g_{t},Z[g_{t}])\in R_{t}$ for a GHD node $t$ , we attach two additional annotations $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]$ and $T[g_{t}]$ (the runtime of obtaining $Z[g_{t}]$ ), and let $\mathcal{H}_{t,g_{t}}$ (or simply $\mathcal{H}_{g_{t}}$ ) denote the residual hypergraph of $\mathcal{H}_{t}$ given $g_{t}$ in Line 2 of Algorithm 2. Then, $Z[g_{t}]$ is an unbiased estimate of $\lambda(g_{t})=\text{OUT}(\mathcal{H}_{g_{t}})$ from Lemma 1. We also define $(f,Z[f])$ as an annotated tuple in $\Join_{t\in\mathcal{V}_{\mathcal{T}}}R_{t}$ so $Z[f]=\prod_{t\in\mathcal{V}_{\mathcal{T}}}Z[\pi_{G(t)}f]$ .

Recall that our primary goal for join size estimation is to perform ( $1\pm\epsilon$ )-approximation of OUT. To use Chebyshev’s inequality, we should bound $\operatorname*{\mathbb{V}ar}[Z]$ by $\epsilon^{2}\delta\text{OUT}^{2}$ as in section 5.5. Then the following questions arise: 1) Can we make this inequality hold for any $\epsilon$ and $\delta$ ? 2) How would the runtime $T$ be expressed? To answer these questions, we first express $Z$ using $Z[f]$ values.

(6)

Z=\sum_{(f,Z[f])\in\Join_{t\in\mathcal{V}_{\mathcal{T}}}R_{t}}Z[f].

We arbitrarily order $\mathcal{V}_{\mathcal{T}}$ and regard each $f$ as an ordered set $\{\pi_{G(t)}f|t\in\mathcal{V}_{\mathcal{T}}\}$ . Since $Z[g_{t}]$ values for $g_{t}\in f$ are mutually independent, we have

(7)

\begin{split}\operatorname*{\mathbb{E}}[Z[f]]&=\operatorname*{\mathbb{E}}\Big{[}\prod_{g_{t}\in f}Z[g_{t}]\Big{]}=\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]],\end{split}

(8)

\begin{split}\operatorname*{\mathbb{V}ar}&[Z[f]]=\operatorname*{\mathbb{V}ar}\Big{[}\prod_{g_{t}\in f}Z[g_{t}]\Big{]}\\ &=\prod_{g_{t}\in f}(\operatorname*{\mathbb{V}ar}[Z[g_{t}]]+\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2})-\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}.\end{split}

While $\operatorname*{\mathbb{E}}[Z[f]]$ can be simply decomposed into $\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]$ , $\operatorname*{\mathbb{V}ar}[Z[f]]$ has the internal term $\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}$ which prevents a simple decomposition. For any two different tuples $f_{1},f_{2}\in\,\Join_{t}R_{t}$ , $Z[f_{1}]$ and $Z[f_{2}]$ are not independent, since they can have the same sub-tuple, i.e., $g_{t}\in f_{1}$ and $g_{t}\in f_{2}$ for some $t\in\mathcal{V}_{\mathcal{T}}$ . Therefore, we have to consider the covariance $\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}])$ when expanding $\operatorname*{\mathbb{V}ar}[Z]$ below.

(9)

\begin{split}\operatorname*{\mathbb{V}ar}[Z]=&\sum_{f}\operatorname*{\mathbb{V}ar}[Z[f]]+\sum_{f_{1}\neq f_{2}}\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}])\end{split}

From the analysis in section 5.5, we can arbitrarily control $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]$ and $T[g_{t}]$ with the sample size, under a condition that $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]\cdot T[g_{t}]=\tilde{O}(AGM(\mathcal{H}_{g_{t}})\cdot\text{OUT}(\mathcal{H}_{g_{t}}))$ . In particular, we set $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]$ = $\tilde{O}(\text{OUT}(\mathcal{H}_{g_{t}})^{2})$ and $T[g_{t}]$ = $\tilde{O}\big{(}\frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}\big{)}$ .

Lemma 4.

$\operatorname*{\mathbb{V}ar}[Z[f]]$ = $\tilde{O}(\operatorname*{\mathbb{E}}[Z[f]]^{2})$ if $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]$ = $\tilde{O}(\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2})$ = $\tilde{O}(\text{OUT}(\mathcal{H}_{g_{t}})^{2})$ for every $g_{t}\in f$ .

Lemma 5.

$\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}])=\tilde{O}\big{(}\prod_{g_{t}\in f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\cdot\prod_{g_{t}\in f_{2}}\\ \operatorname*{\mathbb{E}}[Z[g_{t}]]\big{)}$ if the same condition of Lemma 4 holds for $f_{1}$ and $f_{2}$ .

Now we are ready to answer our first question. By applying Lemmas 4 and 5 to (9), we have

(10)

\begin{split}&\operatorname*{\mathbb{V}ar}[Z]=\sum_{f}\tilde{O}\Big{(}\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\Big{)}\\ &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{f_{1}\neq f_{2}}\tilde{O}\Big{(}\prod_{g_{t}\in f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\prod_{g_{t}\in f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}\\ &=\tilde{O}\Big{(}\sum_{f}\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}+\sum_{f_{1}\neq f_{2}}\prod_{g_{t}\in f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\prod_{g_{t}\in f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}\\ &=\tilde{O}\Big{(}\Big{(}\sum_{f}\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}^{2}\Big{)}=\tilde{O}(\text{OUT}^{2}).\end{split}

The last equality holds from $\sum_{f}\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]=\sum_{f}\operatorname*{\mathbb{E}}[Z[f]]=\operatorname*{\mathbb{E}}[Z]=\text{OUT}$ . As a result, $\operatorname*{\mathbb{V}ar}[Z]=\tilde{O}(\text{OUT}^{2})$ .

We can make $\operatorname*{\mathbb{V}ar}[Z]$ arbitrarily small, even less than the $\epsilon^{2}\delta\text{OUT}^{2}$ we desire, by setting the constant factor of $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]=\tilde{O}(\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2})$ arbitrary small for every $g_{t}$ . We omit setting the constants which is trivial.

Proposition 9.

If $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]$ approaches to 0 for every $g_{t}$ , $\operatorname*{\mathbb{V}ar}[Z]$ approaches to 0.

We next answer our second question. GroupByCardEst for each node $t$ takes $\tilde{O}(AGM(\mathcal{H}^{G(t)}_{t}))$ time at Line 2 and $\sum_{g_{t}\in R_{t}}T[g_{t}]$ at Line 2 of Algorithm 2. SimpleAggroYannakakis at Line 3 of Algorithm 3 takes $\tilde{O}(\max_{t\in\mathcal{V}_{\mathcal{T}}}AGM(\mathcal{H}^{G(t)}_{t}))$ time. Therefore, $T$ is $\tilde{O}\Big{(}\max_{t\in\mathcal{V}_{\mathcal{T}}}AGM(\mathcal{H}^{G(t)}_{t})$ + $\sum_{t\in\mathcal{V}_{\mathcal{T}}}\sum_{(g_{t},Z[g_{t}])\in R_{t}}T[g_{t}]\Big{)}$ . By setting $T[g_{t}]=\tilde{O}\big{(}\frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}\big{)}$ , $T$ becomes $\tilde{O}\Big{(}\max_{t}AGM(\mathcal{H}^{G(t)}_{t})$ + $\sum_{t}\sum_{(g_{t},Z[g_{t}])\in R_{t}}\frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}\Big{)}$ .

We now prove that our new bound is smaller than $\tilde{O}(\text{IN}^{fhtw(\mathcal{H})})$ = $\tilde{O}(\max_{t}AGM(\mathcal{H}_{t}))$ ; $AGM(\mathcal{H}^{G(t)}_{t})\leq AGM(\mathcal{H}_{t})$ since $AGM$ is node-monotone, and $\sum_{t}\sum_{g_{t}}\frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}$ $\leq$ $\sum_{t}\sum_{g_{t}}AGM(\mathcal{H}_{g_{t}})$ $\leq$ $AGM(\mathcal{H}_{t})$ from Lemma 6 in appendix B. Therefore, if OUT is $\tilde{O}(\text{IN}^{\rho-fhtw})$ , $\text{IN}^{fhtw}$ is asymptotically smaller than $\frac{AGM}{\text{OUT}}$ ( $AGM=\text{IN}^{\rho}$ ), proving Theorem 2. We recommend using GenericCardEst if OUT is expected to be large, and using GHDCardEst if OUT is expected to be small.

Example 0.

We use the $\mathcal{H}$ in Figure 1 without an edge $\{Z^{\prime},W^{\prime}\}$ and assume that every base relation has size $N$ . Then, our previous bound $\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}$ becomes $\tilde{O}\big{(}\frac{N^{3}}{\text{OUT}}\big{)}$ , from an optimal fractional edge cover $x$ : $x_{F}=0$ for $F=\{X,X^{\prime}\}$ and $0.5$ otherwise. Since our new bound is smaller than $\tilde{O}(\text{IN}^{fhtw(\mathcal{H})})=N^{1.5}$ , it is also smaller than $\tilde{O}\big{(}\frac{N^{3}}{\text{OUT}}\big{)}$ if OUT is asymptotically smaller than $N^{1.5}$ .

7. Optimizations

This section explains two optimization techniques to enhance estimation efficiency or accuracy.

7.1. Increasing Sampling Probabilities

If we focus on the join size estimation without persisting uniform sampling, we can further reduce the variance. Since $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]=\sum_{t\in Q}\frac{1}{P(t)}-\text{OUT}^{2}$ (Kim et al., 2021; Chen and Yi, 2020) for our method in section 5.4, increasing $P(t):t\in Q$ reduces the variance. First, when $m$ edges tie and have the same maximum $rdeg$ in section 5.4, we do not decrease the keeping probability $p$ by $m$ times. This increases the final $P(t)$ by $m$ times. In fact, we can use any sampled $F^{*}$ from $\mathcal{E}_{I}$ ; $P(s_{I})$ increases from $\frac{1}{|\mathcal{E}_{I}|}\cdot\max_{F^{\prime}\in\mathcal{E}_{I}}rdeg_{F^{\prime},s}(s_{I})\cdot\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{\max_{F^{\prime}\in\mathcal{E}_{I}}rdeg_{F^{\prime},s}(s_{I})AGM(\mathcal{H}_{s})}$ to $\sum_{F\in\mathcal{E}_{I}}\frac{1}{|\mathcal{E}_{I}|}\cdot rdeg_{F,s}(s_{I})\cdot\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{\max_{F^{\prime}\in\mathcal{E}_{I}}rdeg_{F^{\prime},s}(s_{I})AGM(\mathcal{H}_{s})}$ . Second, we can interleave GJ-Sample to remove the $\frac{1}{|\mathcal{E}_{I}|}$ factor in $P(s_{I})$ . If the $|\Omega_{I}|$ of GJ-Sample is small enough as a constant, e.g., $\prod_{I}|\mathcal{E}|$ , we can compute $P(s_{I})=\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}$ for every $s_{I}\in\Omega_{I}$ as GJ-Sample.

7.2. Skipping Attributes and GHD Nodes

In GenericCardEst, sampling $s_{I}$ for any non-join attribute is unnecessary and thus, can be safely skipped. For our example query in Figure 1, $W^{\prime}$ is the only non-join attribute. Let the current sample $s=\{x^{\prime},y^{\prime},z^{\prime}\}$ (attributes are $X^{\prime},Y^{\prime},Z^{\prime}$ ) and $I=\{W^{\prime}\}$ . Then, sampling any $s_{I}\in\Omega_{I}=\pi_{W^{\prime}}(R_{\{Z^{\prime},W^{\prime}\}}\ltimes z^{\prime})$ does not affect the sample space and sample size for the remaining vertices, i.e., $X$ , $Y$ , and $Z$ . Hence, we skip $W^{\prime}$ and just return $\frac{1}{P(s_{\{W^{\prime}\}})}=|\Omega_{\{W^{\prime}\}}|=|\pi_{W^{\prime}}(R_{\{Z^{\prime},W^{\prime}\}}\ltimes z^{\prime})|$ at Line 1 instead of 1. If there are multiple non-join attributes $X_{1},X_{2},...,X_{n}$ , we skip all of these, sample for all join attributes only, then return $\prod_{1\leq i\leq n}|\Omega_{\{X_{i}\}}|$ at Line 1.

In GHDCardEst, calling GroupByCardEst for single-edge GHD nodes can be safely skipped. If a GHD node $t$ is covered by a single edge, i.e., $\exists F\in\mathcal{E}:\chi(t)\subseteq F$ , we can directly obtain $\operatorname*{\mathbb{E}}[Z[g_{t}]]=|R_{F}\ltimes g_{t}|$ for every $g_{t}$ in $R_{t}$ without sampling in GroupByCardEst. To be consistent, we regard $R_{t}$ as $\pi_{\chi(t)}R_{F}$ (with annotations) in order to remove any duplicates in $R_{t}$ as mentioned in section 6.1. $t_{0}$ of $\mathcal{T}_{2}$ in Figure 1 is an example of a single-edge node, where $Z[g_{t_{0}}]=|R_{\{X,X^{\prime}\}}\ltimes g_{t_{0}}|$ for $g_{t_{0}}\in R_{t_{0}}$ . Therefore, $\operatorname*{\mathbb{V}ar}[Z[g_{t_{0}}]]=0$ and $T[g_{t_{0}}]=\tilde{O}(1)$ , reducing both $\operatorname*{\mathbb{V}ar}[Z]$ and $T$ .

8. Research Opportunities

We present six research opportunities based on our study. First, we could express our bound $\tilde{O}\Big{(}\max_{t}AGM(\mathcal{H}^{G(t)}_{t})+\sum_{t}\sum_{(g_{t},Z[g_{t}])\in R_{t}}\\ \frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}\Big{)}$ in section 6 in a more succinct form by removing internal $g_{t}$ terms, e.g., to $\tilde{O}\big{(}\frac{\text{IN}^{fhtw(\mathcal{H})}}{\text{OUT}}\big{)}$ . Then, we could more clearly characterize the gap between this bound and $\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}$ or $\tilde{O}(\text{IN}^{fhtw(\mathcal{H})})$ .

Second, we could dynamically change the sampling approach based on the guess of OUT. Our bound $\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}$ in section 5 can be small as a constant if OUT is large as $AGM(\mathcal{H})$ , while the bound in section 6 has advantages over small OUT (e.g., Example 2). As stated in section 5.5, we could start with GenericCardEst assuming a large OUT, increase the sample size, adjust the assumption, and change to GHDCardEst if the estimated OUT becomes small.

Third, we could even change between sampling and join, then apply the same analysis. While the join algorithms have zero error but +OUT term in their runtime, sampling algorithms have non-zero errors but $\frac{1}{\text{OUT}}$ factor in their runtime. Therefore, it might be possible to balance between sampling and join to reduce the runtime and error for both large- and small-OUT cases, i.e., select a sampling-like algorithm for large OUT and a join-like algorithm for small OUT.

Fourth, we could develop join size estimators with lower complexities using more information about relations, e.g., degree constraints or functional dependencies (Abo Khamis et al., 2017) other than the cardinality constrains in Definition 1.

Fifth, we could extend GHDCardEst to perform uniform sampling over arbitrary GHDs from the observations in section A.2 that 1) ExactWeight (Zhao et al., 2018) performs uniform sampling over certain types of GHDs for acyclic joins, and 2) GHDCardEst can be easily modified for uniform sampling over the same GHDs. In Algorithm 6, the initial value of $W(g_{t})$ (sampling weight for a tuple $g_{t}$ in $R_{t}$ ) is set to $\operatorname*{\mathbb{E}}[Z[g_{t}]]=|R_{F_{t}}\ltimes g_{t}|$ . Therefore, we could extend ExactWeight and GHDCardEst for arbitrary GHDs for even cyclic joins, starting from initializing $W(g_{t})$ with an estimate $Z[g_{t}]$ instead of $\operatorname*{\mathbb{E}}[Z[g_{t}]]$ .

Sixth, we could extend our algorithms to more general problems, approximate counting and uniform sampling for conjunctive queries (Arenas et al., 2021; Focke et al., 2022) or join-project queries $\pi_{O}(Q)$ , where $O$ is the set of output attributes (Chen and Yi, 2020). We explain in appendix C that our degree-based rejection sampling can be easily extended for join-project queries, following the same analysis in Section 3 of (Chen and Yi, 2020). We achieve $\tilde{O}\big{(}\frac{AGM(Q)}{|\pi_{O}(Q)|}\big{)}$ runtime which is again smaller than that of GJ-Sample.

9. Conclusion

In this paper, we have presented a new sampling-based method for join size estimation and uniform sampling. Our method solves an open problem in the literature (Chen and Yi, 2020), achieving (1 $\pm\epsilon$ )-approximation of OUT in $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ time for arbitrary join queries. We presented a unified approach that explains and analyzes four known methods: SSTE, SUST, Alley, and GJ-Sample. We then extended our method using GHDs to achieve a better bound than $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ for small OUT and optimized the efficiency and accuracy using two approaches. Finally, we have highlighted several interesting research opportunities building on our results. We believe that our work may facilitate studies on achieving lower bounds on the uniform sampling and size estimation over joins.

10. Acknowledgement

We thank Yufei Tao for pointing out the condition of Lemma 2.

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-00859, Development of a distributed graph DBMS for intelligent processing of big graphs, 34%), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2018-0-01398, Development of a Conversational, Self-tuning DBMS, 33%), and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.NRF-2021R1A2B5B03001551, 33%).

References

(1)
Aberger et al. (2017) Christopher R Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. 2017. Emptyheaded: A relational engine for graph processing. ACM Transactions on Database Systems (TODS) 42, 4 (2017), 1–44.
Abo Khamis et al. (2016) Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2016. FAQ: Questions Asked Frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (San Francisco, California, USA) (PODS ’16). Association for Computing Machinery, New York, NY, USA, 13–28. https://doi.org/10.1145/2902251.2902280
Abo Khamis et al. (2017) Mahmoud Abo Khamis, Hung Q Ngo, and Dan Suciu. 2017. What do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog have to do with one another?. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 429–444.
Aliakbarpour et al. (2018) Maryam Aliakbarpour, Amartya Shankha Biswas, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. 2018. Sublinear-time algorithms for counting star subgraphs via edge sampling. Algorithmica 80, 2 (2018), 668–697.
Arenas et al. (2021) Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. 2021. When is approximate counting for conjunctive queries tractable?. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing. 1015–1027.
Assadi et al. (2018) Sepehr Assadi, Michael Kapralov, and Sanjeev Khanna. 2018. A simple sublinear-time algorithm for counting arbitrary subgraphs via edge sampling. arXiv preprint arXiv:1811.07780 (2018).
Assadi et al. (2019) Sepehr Assadi, Michael Kapralov, and Sanjeev Khanna. 2019. A simple sublinear-time algorithm for counting arbitrary subgraphs via edge sampling. In 10th Innovations in Theoretical Computer Science, ITCS 2019. Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 6.
Atserias et al. (2013) Albert Atserias, Martin Grohe, and Dániel Marx. 2013. Size bounds and query plans for relational joins. SIAM J. Comput. 42, 4 (2013), 1737–1767.
Banerjee (2012) Moulinath Banerjee. 2012. Simple Random Sampling. Unpublished Manuscript, University of Michigan, Michigan (2012).
Bera and Chakrabarti (2017) Suman K Bera and Amit Chakrabarti. 2017. Towards tighter space bounds for counting triangles and other substructures in graph streams. In 34th Symposium on Theoretical Aspects of Computer Science.
Brault-Baron (2016) Johann Brault-Baron. 2016. Hypergraph acyclicity revisited. ACM Computing Surveys (CSUR) 49, 3 (2016), 1–26.
Carmeli et al. (2020) Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld, and Nicole Schweikardt. 2020. Answering (unions of) conjunctive queries using random access and random-order enumeration. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 393–409.
Chen and Yi (2020) Yu Chen and Ke Yi. 2020. Random Sampling and Size Estimation Over Cyclic Joins. In 23rd International Conference on Database Theory (ICDT 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
Deng et al. (2023) Shiyuan Deng, Shangqi Lu, and Yufei Tao. 2023. On Join Sampling and Hardness of Combinatorial Output-Sensitive Join Algorithms. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2023).
Eden et al. (2017) Talya Eden, Amit Levi, Dana Ron, and C Seshadhri. 2017. Approximately counting triangles in sublinear time. SIAM J. Comput. 46, 5 (2017), 1603–1646.
Eden et al. (2020) Talya Eden, Dana Ron, and C Seshadhri. 2020. On approximating the number of k-cliques in sublinear time. SIAM J. Comput. 49, 4 (2020), 747–771.
Fichtenberger et al. (2020) Hendrik Fichtenberger, Mingze Gao, and Pan Peng. 2020. Sampling Arbitrary Subgraphs Exactly Uniformly in Sublinear Time. In ICALP.
Focke et al. (2022) Jacob Focke, Leslie Ann Goldberg, Marc Roth, and Stanislav Zivnỳ. 2022. Approximately counting answers to conjunctive queries with disequalities and negations. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 315–324.
Friedgut and Kahn (1998) Ehud Friedgut and Jeff Kahn. 1998. On the number of copies of one hypergraph in another. Israel Journal of Mathematics 105, 1 (1998), 251–256.
Goldreich (2017) Oded Goldreich. 2017. Introduction to property testing. Cambridge University Press.
Gottlob et al. (2016) Georg Gottlob, Gianluigi Greco, Nicola Leone, and Francesco Scarcello. 2016. Hypertree decompositions: Questions and answers. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 57–74.
Gottlob et al. (2002) Georg Gottlob, Nicola Leone, and Francesco Scarcello. 2002. Hypertree decompositions and tractable queries. J. Comput. System Sci. 64, 3 (2002), 579–627.
Green et al. (2007) Todd J Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 31–40.
Grohe and Marx (2014) Martin Grohe and Dániel Marx. 2014. Constraint solving via fractional edge covers. ACM Transactions on Algorithms (TALG) 11, 1 (2014), 1–20.
Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260 (1952), 663–685.
Jerrum et al. (1986) Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. 1986. Random generation of combinatorial structures from a uniform distribution. Theoretical computer science 43 (1986), 169–188.
Joglekar et al. (2016) Manas R. Joglekar, Rohan Puttagunta, and Christopher Ré. 2016. AJAR: Aggregations and Joins over Annotated Relations. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (San Francisco, California, USA) (PODS ’16). Association for Computing Machinery, New York, NY, USA, 91–106. https://doi.org/10.1145/2902251.2902293
Kim et al. (2021) Kyoungmin Kim, Hyeonji Kim, George Fletcher, and Wook-Shin Han. 2021. Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation. In Proceedings of the 2021 International Conference on Management of Data. 964–976.
Leis et al. (2015) Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204–215.
Li et al. (2016) Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615–629.
Ngo et al. (2012) Hung Q Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2012. Worst-case Optimal Join Algorithms. (2012).
Ngo et al. (2014) Hung Q Ngo, Christopher Ré, and Atri Rudra. 2014. Skew strikes back: New developments in the theory of join algorithms. ACM SIGMOD Record 42, 4 (2014), 5–16.
Veldhuizen (2012) Todd L Veldhuizen. 2012. Leapfrog triejoin: a worst-case optimal join algorithm. arXiv preprint arXiv:1210.0481 (2012).
Yannakakis (1981) Mihalis Yannakakis. 1981. Algorithms for acyclic database schemes. In VLDB, Vol. 81. 82–94.
Zhao et al. (2018) Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In Proceedings of the 2018 International Conference on Management of Data. 1525–1539.

Appendix

Appendix A Algorithms

A.1. GenericJoin

GenericJoin (Algorithm 4) (Ngo et al., 2014) is a representative worst-case optimal join algorithm. From the set of output attributes $\mathcal{O}$ (initially $\mathcal{V}$ ), it selects a subset $I$ where $1\leq|I|<|\mathcal{O}|$ (Line 4). For each join answer $s_{I}$ of an induced hypergraph of $\mathcal{H}$ projected onto $I$ (Line 4), it bounds attributes $I$ to $s_{I}$ and proceeds to match the residual hypergraph (Line 4). If $|\mathcal{O}|=1$ , the results are obtained from the intersection ( $\cap$ ) operation (Lines 4-4). The runtime of GenericJoin is $\tilde{O}(AGM(\mathcal{H}))$ , which can be proven from Lemma 6 (Ngo et al., 2014).

Input: Query hypergraph

\mathcal{H}

, output attributes

\mathcal{O}\subseteq\mathcal{V}

, and current tuple

s

2if (

|\mathcal{O}|=1

) then

3 return

\cap_{F\in\mathcal{E}_{\mathcal{O}}}\pi_{\mathcal{O}}(R_{F}\ltimes s)

R\leftarrow\emptyset

I\leftarrow

a non-empty proper subset of

\mathcal{O}

8foreach (

s_{I}\in\textsc{{GenericJoin}}(\mathcal{H},I,s)

) do

R[s_{I}]\leftarrow\textsc{{GenericJoin}}(\mathcal{H},\mathcal{O}\setminus I,s\uplus s_{I})

R\leftarrow R\cup\{s_{I}\}\times R[s_{I}]

13return

R

Algorithm 4 GenericJoin(

\mathcal{H}(\mathcal{V},\mathcal{E})

\mathcal{O}

s

)

A.2. Yannakakis, SimpleAggroYannakakis, and ExactWeight

In this section and in Algorithms 5-6, we assume that the root node $r$ of any GHD $\mathcal{T}$ has a virtual parent node $pr$ where 1) $pr\not\in\mathcal{V}_{\mathcal{T}}$ , 2) $\chi(pr)=\emptyset$ , and 3) $R_{pr}$ (input relation for $pr$ ) contains a single tuple $g_{pr}$ that joins with all tuples in $R_{r}$ , for ease of explanation. As mentioned in section 6.1, we assume no duplicate tuples in each $R_{t}$ .

Given a join tree $\mathcal{T}$ of a $\alpha$ -acyclic query $Q$ (Brault-Baron, 2016), Yannakakis (Yannakakis, 1981) performs the join in $\tilde{O}(\text{IN}+\text{OUT})$ time using dynamic programming. Algorithm 5 without Lines 5-5 and setting $R^{\prime}=R_{t}$ is Yannakakis, where the bottom-up and top-down semi-join reductions (Lines 5-5) are followed by the bottom-up join (Lines 5-5). Since the semi-join reductions remove dangling tuples (that do not participate in the final join answers), the intermediate join size monotonically increases up to OUT in the bottom-up join. Therefore, the total runtime is $\tilde{O}(\text{IN}+\text{OUT})$ .

Input: GHD (

\mathcal{T}

\chi

) and relations

R_{t}

for each

t\in\mathcal{V}_{\mathcal{T}}

2foreach (

t\in\mathcal{V}_{\mathcal{T}}

in some bottom-up order) do

p\leftarrow

parent of

t

R_{p}\leftarrow R_{p}\ltimes R_{t}

6foreach (

t\in\mathcal{V}_{\mathcal{T}}

in some top-down order) do

p\leftarrow

parent of

t

R_{t}\leftarrow R_{t}\ltimes R_{p}

10foreach (

t\in\mathcal{V}_{\mathcal{T}}

in some bottom-up order) do

\beta\leftarrow\{A\in G(t)|TOP_{\mathcal{T}}(A)=t\}

R^{\prime}\leftarrow\sum_{\beta}{R_{t}}

p\leftarrow

parent of

t

R_{p}\leftarrow R_{p}\Join R^{\prime}

17return

R_{r}

for the root

r

Algorithm 5 SimpleAggroYannakakis(

(\mathcal{T}(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}}),\chi)

\{R_{t}|t\in\mathcal{V}_{\mathcal{T}}\}

)

SimpleAggroYannakakis (Algorithm 5) is a simplified version of AggroYannakakis (Joglekar et al., 2016), where 1) all attributes are used for aggregation and 2) sum is the only aggregation operator. AggroYannakakis can handle a more general class of aggregation queries, e.g., sum and max operators are used for different attributes. In Line 5, $TOP_{\mathcal{T}}(A)$ denotes the top node of an attribute $A$ , which is the closest node to the root among $\{t\in\mathcal{V}_{\mathcal{T}}|\chi(t)\ni A\}$ . $G(t)$ is the set of output attributes of $R_{t}$ . Line 5 aggregates on the attributes in $G(t)$ having $t$ as their top node, since these attributes do not appear in ancestors and are marginalized out. This early marginalization is the key idea of maintaining the runtime of SimpleAggroYannakakis to be $\tilde{O}(\text{IN}+\text{OUT}_{agg})$ (Joglekar et al., 2016), where $\text{OUT}_{agg}$ is the output size of aggregation. For example, $\text{OUT}_{agg}=1$ if all attributes are marginalized out in computing the join size.

Input: GHD (

\mathcal{T}

\chi

) where

\exists F_{t}\in\mathcal{E}:\chi(t)\subseteq F_{t}

for every

t\in\mathcal{V}_{\mathcal{T}}

and relations

R_{t}

for each

t\in\mathcal{V}_{\mathcal{T}}

W(g_{t})\leftarrow|R_{F_{t}}\ltimes g_{t}|

for every

t\in\mathcal{V}_{\mathcal{T}}

and

g_{t}\in R_{t}

4foreach (

t\in\mathcal{V}_{\mathcal{T}}

in some bottom-up order) do

p\leftarrow

parent of

t

8 foreach (

g_{p}\in R_{p}

) do

W(g_{p},R_{t})\leftarrow\sum_{g_{t}\in R_{t}\ltimes g_{p}}W(g_{t})

W(g_{p})\leftarrow W(g_{p})\cdot W(g_{p},R_{t})

R_{p}\leftarrow R_{p}\ltimes R_{t}

s\leftarrow g_{pr}

15foreach (

t\in\mathcal{V}_{\mathcal{T}}

in some top-down order) do

p\leftarrow

parent of

t

s_{p}\leftarrow\pi_{\chi(p)}s

\{s_{t}\}\leftarrow{\textsc{{SampleFrom}}}(R_{t}\ltimes s_{p},1,\{\frac{W(g_{t})}{W(s_{p},R_{t})}\,|\,g_{t}\in R_{t}\ltimes s_{p}\})

/* sample a tuple

s_{t}

from

R_{t}\ltimes s_{p}

with probability

\frac{W(s_{t})}{W(s_{p},R_{t})}

s\leftarrow s\uplus s_{t}

20foreach (

t\in\mathcal{V}_{\mathcal{T}}

in some order) do

21 Replace

s_{t}\in s

in-place with a uniform sample from

R_{F_{t}}\ltimes s_{t}

return

s

Algorithm 6 ExactWeight(

(\mathcal{T}(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}}),\chi)

\{R_{t}|t\in\mathcal{V}_{\mathcal{T}}\}

)

We explain ExactWeight (Zhao et al., 2018) in our context of using GHDs (Algorithm 6), but assume that $R_{t}$ ’s are not annotated and compute each weight $W$ explicitly. ExactWeight performs uniform sampling over acyclic joins, where the GHDs of single-edge nodes are given. ExactWeight computes the sampling weights $W(g_{p})$ and $W(g_{p},R_{t})$ for each tuple $g_{p}$ and a child relation $R_{t}$ bottom-up (Lines 6-6) and samples tuples proportional to their weights top-down (Lines 6-6). The sample is then replaced with the tuples sampled from the base relations (Lines 6-6). The bottom-up part is the preprocessing step that takes $\tilde{O}(|\mathcal{V}_{\mathcal{T}}|\sum_{t\in\mathcal{V}_{\mathcal{T}}}|R_{t}|)=\tilde{O}(\max_{t}|R_{F_{t}}|)=\tilde{O}(\text{IN})$ time, and the top-down sampling and replacement takes $\tilde{O}(1)$ for each sample and can be performed multiple times after the preprocessing.

We now briefly explain why ExactWeight returns a uniform sample. $W(g_{p},R_{t})$ corresponds to the join size between $g_{p}$ and the child subtree of $p$ rooted at $t$ , and $W(g_{p})$ corresponds to the join size between $g_{p}$ and all child subtrees of $p$ . Hence, $W(g_{pr})=\text{OUT}$ which is the join size of the whole tree. For a node $p$ , its children $\{t_{1},t_{2},...,t_{m}\}$ and its parent $pp$ , $P(s_{p})=\frac{W(s_{p})}{W(s_{pp},R_{p})}$ and $P(s_{t_{i}})=\frac{W(s_{t_{i}})}{W(s_{p},R_{t_{i}})}$ for $i\in[1,m]$ . Therefore, $P(s_{p},s_{t_{1}},...,s_{t_{m}})=\frac{W(s_{p})}{W(s_{pp},R_{p})}\prod_{1\leq i\leq m}\frac{W(s_{t_{i}})}{W(s_{p},R_{t_{i}})}=\frac{|R_{F_{p}}\ltimes s_{p}|\prod_{i}W(s_{p},R_{t_{i}})}{W(s_{pp},R_{p})}\frac{\prod_{i}W(s_{t_{i}})}{\prod_{i}W(s_{p},R_{t_{i}})}=\frac{|R_{F_{p}}\ltimes s_{p}|\prod_{i}W(s_{t_{i}})}{W(s_{pp},R_{p})}$ . We have seen a similar elimination of probability terms in eq. 3, which leads to $P(s)=\frac{\prod_{t}|R_{F_{t}}\ltimes s_{t}|}{W(g_{pr},R_{r})}=\frac{\prod_{t}|R_{F_{t}}\ltimes s_{t}|}{\text{OUT}}$ after Line 6. By uniformly sampling a tuple from each $R_{F_{t}}\ltimes s_{t}$ at Line 6, $P(s)$ becomes $\frac{1}{\text{OUT}}$ at Line 6.

We can also make GHDCardEst in Algorithm 3 perform uniform sampling for the same GHD by 1) iterating $t$ bottom-up at Line 3, 2) adding the same initialization and updates of sampling weights, and 3) adding the same sampling part. Then, the preprocessing step of GHDCardEst also takes $\tilde{O}(\text{IN})$ as ExactWeight. From this perspective, ExactWeight is similar to a specific instance of GHDCardEst for GHDs of single-edge nodes and acyclic queries. Extending the above modifications of GHDCardEst to work for arbitrary GHDs and cyclic queries, would be an interesting future work. We could also extend our work to unions of acyclic conjunctive queries and sampling without replacement, as what (Carmeli et al., 2020) is to (Zhao et al., 2018).

A.3. GHDJoin and AggroGHDJoin

Given a generalized hypertree decomposition (GHD) of a query, GHDJoin (Algorithm 7) (Joglekar et al., 2016) uses $\mathcal{T}$ as a join tree and performs GenericJoin to obtain the join answers of each node (Lines 7-7). $\mathcal{H}_{t}$ is the subquery defined on node $t$ (Line 7). Then, it performs Yannakakis over these join answers to evaluate the original query (Line 7). Therefore, the runtime of GHDJoin is $\tilde{O}(\max_{t\in\mathcal{V}_{\mathcal{T}}}\text{IN}^{\rho(\mathcal{H}_{t})}+\text{OUT})=\tilde{O}(\text{IN}^{fhtw(\mathcal{T},\mathcal{H})}+\text{OUT})$ from Definition 3.

Input: Query hypergraph

\mathcal{H}

and a GHD (

\mathcal{T}

\chi

) of

\mathcal{H}

S_{\mathcal{T}}\leftarrow\emptyset

3 foreach (

t\in\mathcal{V}_{\mathcal{T}}

) do

\mathcal{H}_{t}\leftarrow(\chi(t),\mathcal{E}_{\chi(t),\pi})

/* compute

\mathcal{H}_{t}

S_{\mathcal{T}}\leftarrow S_{\mathcal{T}}\cup\textsc{{GenericJoin}}(\mathcal{H}_{t},\chi(t),\emptyset)

8return

\textsc{{Yannakakis}}((\mathcal{T},\chi),S_{\mathcal{T}})

Algorithm 7 GHDJoin (

\mathcal{H}(\mathcal{V},\mathcal{E})

(\mathcal{T}(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}}),\chi)

)

AggroGHDJoin (Joglekar et al., 2016) is an aggregation version of GHDJoin. It is different from GHDJoin in two aspects: 1) calling AggroYannakakis instead of Yannakakis and 2) including some extra work to ensure that each annotation of a tuple is passed to exactly one call of GenericJoin (Joglekar et al., 2016). Note that a hyperedge $F\in\mathcal{E}$ can be visited in multiple nodes, thus, its annotations may be aggregated multiple times, leading to an incorrect output.

From the runtime of GenericJoin and AggroYannakakis, the runtime of AggroGHDJoin is $\tilde{O}(\text{IN}^{fhtw(\mathcal{T},\mathcal{H})}+\text{OUT}_{agg})$ . In (Joglekar et al., 2016), the runtime of AggroGHDJoin is expressed as $\tilde{O}(\text{IN}^{w^{*}}+\text{OUT}_{agg})$ , where $w^{*}\coloneqq\min_{(\mathcal{T},\chi)\in Val}fhtw(\mathcal{T},\mathcal{H})$ . Here, $Val$ denotes the set of valid GHDs of $\mathcal{H}$ , that preserve the aggregation ordering of the specified aggregation $\sum_{A_{1},\oplus_{1}}\sum_{A_{2},\oplus_{2}}...\sum_{A_{n},\oplus_{n}}$ ; changing the order of $\sum_{A_{i},\oplus_{i}}$ and $\sum_{A_{j},\oplus_{j}}$ might vary the aggregation result (Joglekar et al., 2016). Since $fhtw(\mathcal{H})=\min_{(\mathcal{T},\chi)}fhtw(\mathcal{T},\mathcal{H})$ from Definition 3 and $Val$ is a subset of the set of all GHDs, $w^{*}\geq fhtw(\mathcal{H})$ . However, as in our case where all aggregation operators are the same ( $\oplus_{i}=+$ ), thus commutative, any aggregation order gives the same aggregation result. Therefore, $Val$ becomes the set of all GHDs. This leads to $w^{*}=fhtw(\mathcal{H})$ .

Appendix B Query Decomposition Lemmas

This section explains two similar query decomposition lemmas proved by Ngo et al. (Ngo et al., 2014) and Chen & Yi (Chen and Yi, 2020).

Lemma 6.

(Ngo et al., 2014): Given a query hypergraph $\mathcal{H}(\mathcal{V},\mathcal{E})$ , let $x$ be any fractional edge cover of $\mathcal{H}$ . Let $I$ be an arbitrary non-empty proper subset of $\mathcal{V}$ , $J=\mathcal{V}\setminus I$ , and $L=\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}R_{F}$ . Then,

(11)

\vspace{-2pt}\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{J}}|R_{F}\ltimes s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}}|R_{F}|^{x_{F}}.\vspace{-1pt}

Using this lemma, Ngo et al. (Ngo et al., 2014) prove that GenericJoin has $\tilde{O}(\prod_{F\in\mathcal{E}}|R_{F}|^{x_{F}})$ runtime (i.e., the RHS above) using an induction hypothesis on $|\mathcal{V}|$ . If $x$ is an optimal fractional edge cover, the RHS becomes $AGM(\mathcal{H})$ .

We briefly explain this with Algorithm 4. If we replace 1) $\mathcal{V}$ in the lemma with $\mathcal{O}$ in Algorithm 4 and 2) $R_{F}$ with $R_{F}\ltimes s$ , we obtain $L=\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ and

(12)

\vspace{-2pt}\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes s\uplus s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}.\vspace{-1pt}

If $|\mathcal{O}|=1$ , Line 4 takes $\tilde{O}(\min_{F\in\mathcal{E}_{\mathcal{O}}}|\pi_{\mathcal{O}}(R_{F}\ltimes s)|)$ time where $\min_{F\in\mathcal{E}_{\mathcal{O}}}|\pi_{\mathcal{O}}(R_{F}\ltimes s)|\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|\pi_{\mathcal{O}}(R_{F}\ltimes s)|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}$ for $\sum_{F\in\mathcal{E}_{\mathcal{O}}}x_{F}\geq 1$ . The last term is equivalent to the RHS above.

If $|\mathcal{O}|>1$ , Lines 4-4 take $\tilde{O}(|L|+\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|\pi_{\mathcal{O}\setminus I}(R_{F}\ltimes s\uplus s_{I})|^{x_{F}})$ time where $|L|\leq\prod_{F\in\mathcal{E}_{I}}|R_{F}\ltimes s|^{x_{F}}$ from the induction hypothesis ( $\because|I|<|\mathcal{O}|$ , and $\prod_{F\in\mathcal{E}_{I}}|R_{F}\ltimes s|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}$ ) and $\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|\pi_{\mathcal{O}\setminus I}(R_{F}\ltimes s\uplus s_{I})|^{x_{F}}\leq\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes s\uplus s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}$ from the lemma, which is again the RHS above.

Therefore, in both cases, the runtime of Algorithm 4 is the big-Oh of the RHS above.

Lemma 7.

(Chen and Yi, 2020): Given a query hypergraph $\mathcal{H}(\mathcal{V},\mathcal{E})$ , let $x$ be any fractional edge cover of $\mathcal{H}$ . Let $I=\{A\}$ for an arbitrary attribute $A\in\mathcal{V}$ , $J=\mathcal{V}\setminus I$ , and $L^{\prime}=\pi_{I}R_{F^{GJ}}$ for $F^{GJ}=\operatorname*{arg\,min}_{F\in\mathcal{E}_{I}}|\pi_{I}R_{F}|$ . Then,

(13)

\sum_{s_{I}\in L^{\prime}}\prod_{F\in\mathcal{E}_{J}}|R_{F}\ltimes s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}}|R_{F}|^{x_{F}}.

Since Chen & Yi (Chen and Yi, 2020) explain Lemma 7 without a name, we also call it the query decomposition lemma since it is similar to Lemma 6. However, it is different from Lemma 6 in that 1) $|I|$ must be 1 and 2) $L^{\prime}$ is obtained without an actual join or intersection in GenericJoin. If $|I|=1$ in Lemma 6, then $L=\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}R_{F}=\cap_{F\in\mathcal{E}_{I}}\pi_{I}R_{F}$ . Since $F^{GJ}\in\mathcal{E}_{I}$ , $L^{\prime}\supseteq L$ . Therefore, Lemma 7 is stronger than Lemma 6 if $|I|=1$ .

Lemma 7 is the key to show that $\sum_{s_{I}\in\Omega_{I}}P(s_{I})\leq 1$ for GJ-Sample. By replacing 1) $\mathcal{V}$ in the above lemma with $\mathcal{O}$ in GJ-Sample in section 5.3 and 2) $R_{F}$ with $R_{F}\ltimes s$ , we obtain $L^{\prime}=\pi_{I}(R_{F^{GJ}}\ltimes s)=\Omega_{I}$ for $F^{GJ}=\operatorname*{arg\,min}_{F\in\mathcal{E}_{I}}|\pi_{I}(R_{F}\ltimes s)|$ and

(14)

\sum_{s_{I}\in L^{\prime}}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes s\uplus s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}},

resulting in

(15)

\begin{split}\sum_{s_{I}\in\Omega_{I}}P(s_{I})&=\sum_{s_{I}\in L^{\prime}}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}\\ &=\sum_{s_{I}\in L^{\prime}}\frac{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes s\uplus s_{I}|^{x_{F}}}{\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}}\leq 1.\end{split}

Appendix C Extension to Conjunctive Queries

From the introduction, the join query we have considered so far is a full conjunctive query where all attributes are output attributes. In this section, we easily extend our degree-based rejection sampling to approximate counting and uniform sampling for join-project queries (a.k.a. conjunctive queries) (Arenas et al., 2021; Focke et al., 2022; Chen and Yi, 2020), following the same analysis of Section 3 in (Chen and Yi, 2020).

Let $\pi_{O}(Q)$ be a join-project query, where $O$ is the set of projection/output attributes and $Q$ is a join query. We can use the two-step approach in (Chen and Yi, 2020): (Step 1) use Algorithm 1 to sample a join result $s$ from $Q_{O}\coloneqq\Join_{F\in E_{O}}\pi_{O}(R_{F})$ , i.e., join the relations that contain any attribute in $O$ only. Here, if $Q_{O}$ is disconnected, we take a sample from each connected component. (Step 2) check if $Q_{\mathcal{V}\setminus O}(s)\coloneqq\Join_{F\in E_{\mathcal{V}\setminus O}}(\pi_{\mathcal{V}\setminus O}(R_{F}\ltimes s))$ is empty. If not, we return $s$ as a sampled result, otherwise, we repeat. Then, the final $s$ returned is a result in $\pi_{O}(Q)$ . Using the same analysis of (Chen and Yi, 2020), sampling a tuple in $\pi_{O}(Q)$ takes $\tilde{O}\big{(}\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}\big{)}$ time in expectation for our method.

Step 1 of our algorithm takes time $\tilde{O}\big{(}\frac{AGM(Q_{O})}{\text{OUT}({Q_{O}})}\big{)}$ in expectation. Step 2 takes time $\tilde{O}(AGM(Q_{\mathcal{V}\setminus O}(s)))$ , using any worst-case optimal join algorithm, e.g., GenericJoin. Because each $s\in Q_{O}$ is sampled with probability $\frac{1}{\text{OUT}(Q_{O})}$ , the expected runtime of Step 2 is (the big-Oh of) $\sum_{s\in Q_{O}}\frac{AGM(Q_{\mathcal{V}\setminus O}(s))}{\text{OUT}(Q_{O})}\leq\frac{AGM(Q)}{\text{OUT}(Q_{O})}$ where the inequality follows from the query decomposition lemma (Lemma 6). Please refer to Section 3 of (Chen and Yi, 2020) for the remaining part of the analysis proving the $\tilde{O}\big{(}\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}\big{)}$ time.

Here, we would like to address the complexity of computing $\pi_{O}(R_{F})$ before calling Algorithm 1. If there exists an index for $R_{F}$ with the attribute order such that the attributes in $O\cap F$ precede the attributes in $F\setminus O$ , then $\pi_{O}(R_{F})$ is readily available (i.e., in $\tilde{O}(1)$ time) as explained in section 5.6. This results in total $\tilde{O}\big{(}\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}\big{)}$ time. Otherwise, we need an additional linear-time preprocessing to obtain $\pi_{O}(R_{F})$ from $R_{F}$ , resulting in total $\tilde{O}\big{(}\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}+\text{IN}\big{)}$ time. Nevertheless, we emphasize that this is still lower than the complexity of GJ-Sample for non-sequenceable join-project queries, which is $\tilde{O}\big{(}\text{IN}\cdot\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}+\text{IN}\big{)}$ (Chen and Yi, 2020).

We also note that the proposed algorithm in (Arenas et al., 2021) requires computing $N(T^{\prime}_{i})$ in their “High-Level Sampling Template” which is the (approximate) size of the $i$ -th subset $T^{\prime}_{i}$ of a set $T$ . This has to be computed for every possible $i$ since $\frac{N(T^{\prime}_{i})}{\sum_{i}N(T^{\prime}_{i})}$ is used as the probability of sampling $T^{\prime}_{i}$ . This is analogous to computing $AGM(\mathcal{H}_{s\uplus s_{I}})$ for each candidate $s_{I}$ in GJ-Sample. A difference between GJ-Sample and (Arenas et al., 2021) is that computing $AGM(\mathcal{H}_{s\uplus s_{I}})$ takes a constant time in GJ-Sample while computing $N(T^{\prime}_{i})$ takes a polynomial time according to Lemma 4.5 in the full version of (Arenas et al., 2021). Therefore, (Arenas et al., 2021) inherently has the overhead of computing $N(T^{\prime}_{i})$ ’s analogous to GJ-Sample. Removing such overhead by extending our idea to conjunctive queries, would be an interesting future work.

Appendix D Component-at-a-time Sampling

For each component, GenericCardEst (Algorithm 1) is called in two recursions, e.g., if $\mathcal{H}$ consists of four odd cycles and three stars, the maximum recursion depth is $2\cdot(4+3)+1=15$ , including the calls that reach Line 1. Since the components are vertex-disjoint, i.e., variable-disjoint, $R_{F}\ltimes s=R_{F}$ whenever $F$ and (the attributes of) $s$ are contained in different components. Therefore, we can safely drop $\ltimes s$ from the following explanations for the component-at-a-time methods.

We first explain for a star $S$ of $n$ edges $F_{1},F_{2},...,F_{n}$ . Let $A$ be the center attribute, i.e., $\{A\}=\cap_{i\in[1,n]}F_{i}$ . For SSTE, in the first call to GenericCardEst, $I$ becomes the (singleton set of) center attribute, i.e., $I=\{A\}$ , $\Omega_{I}=\pi_{A}R_{F_{1}}$ and $k=1$ . To sample a vertex $a$ from $\Omega_{I}$ , SSTE first samples an edge $e$ from $R_{F_{1}}$ , and let $a=\pi_{A}(e)$ . Then, $P(a)=\frac{|R_{F_{1}}\ltimes a|}{|R_{F_{1}}|}$ since $|R_{F_{1}}\ltimes a|$ edges in $R_{F_{1}}$ have $a$ as their projection onto $A$ . When GenericCardEst is called for the second time, $I$ becomes the remaining attributes of $S$ , i.e., $I=\cup_{i}F_{i}\setminus\{A\}$ , $\Omega_{I}=\prod_{i}\pi_{I}(R_{F_{i}}\ltimes a)$ , and $k=1$ ; a sequence of $n$ vertices $(b_{1},b_{2},...,b_{n})$ is sampled from $\Omega_{I}$ once (i.e., $b_{i}$ is sampled from $\pi_{I}(R_{F_{i}}\ltimes a)$ ). If $(a,b_{1},b_{2},...,b_{n})$ forms a star in the database, $\operatorname*{\mathbb{I}}[\cdot]$ for these two calls become 1, and the recursion proceeds to the remaining components, with $\mathcal{O}\leftarrow\mathcal{O}\setminus(\cup_{i}F_{i})$ and $s\leftarrow s\uplus(a,b_{1},b_{2},...,b_{n})$ .

For SUST, in the first call to GenericCardEst, $I$ is simply the set of all attributes of $S$ and $\Omega_{I}=\prod_{i}R_{F_{i}}$ , $P(\cdot)=\frac{1}{|\Omega_{I}|}$ , and $k=1$ ; a sequence of $n$ edges is sampled once. $\operatorname*{\mathbb{I}}[\cdot]$ is 1 iff they share a common vertex for the center. The second call to GenericCardEst is a dummy that directly proceeds to the remaining components.

We next explain for an odd cycle $C$ of $2n+1$ edges with the vertex (attribute) sequence $(u_{1},v_{1},u_{2},v_{2},...,u_{n},v_{n},w,u_{1})$ where $F_{i}=\{u_{i},v_{i}\}$ for $i\in[1,n]$ and $F_{0}=\{u_{1},w\}$ . A constraint is that the sampled vertex (attribute value) for $u_{1}$ must have the smallest degree among the sampled vertices, where the degree of a vertex $a$ is $|R\ltimes a|$ (note that $R=R_{F_{i}}$ for every $i$ ). In the first call to GenericCardEst, $I$ contains all $u_{i}$ ’s and $v_{i}$ ’s except $w$ , $\Omega_{I}=\prod_{i\in[1,n]}R_{F_{i}}$ , and $k=1$ ; a sequence of $n$ edges $(\{a_{1},b_{1}\},...,\{a_{n},b_{n}\})$ is sampled where $u_{i}$ and $v_{i}$ map to $a_{i}$ and $b_{i}$ . When GenericCardEst is called for the second time, $I=\{w\}$ . SSTE and SUST differ here.

For SSTE, $\Omega_{I}=\pi_{I}(R_{F_{0}}\ltimes a_{1})$ , $P(\cdot)=\frac{1}{|\Omega_{I}|}$ , and $k$ is $\lceil\frac{|\Omega_{I}|}{\sqrt{|R|}}\rceil$ . Therefore, $k$ vertices are uniformly sampled from $\Omega_{I}$ with replacement. For SUST, $k$ is fixed to 1, and $\Omega_{I}$ is defined based on the degree of $a_{1}$ . If the degree is smaller than $\sqrt{|R|}$ , then $\Omega_{I}=\pi_{I}(R_{F_{0}}\ltimes a_{1})$ and $P(\cdot)=\frac{1}{\sqrt{|R|}}$ . If the degree is larger than $\sqrt{|R|}$ , then $\Omega_{I}=\pi_{I}R_{F_{0}}$ and $P(s_{I})=\frac{|R\ltimes s_{I}|}{|R|}$ . Then, at Line 1, 1) any $s_{I}$ with a smaller degree than $a_{1}$ is rejected due to the constraint, and 2) an accepted $s_{I}$ is further kept with probability $\frac{\sqrt{|R|}}{|R\ltimes s_{I}|}$ , which is less than 1 from $|R\ltimes s_{I}|\geq|R\ltimes a_{1}|\geq\sqrt{|R|}$ . Therefore, $P(s_{I})$ is again $\frac{|R\ltimes s_{I}|}{|R|}\frac{\sqrt{|R|}}{|R\ltimes s_{I}|}=\frac{1}{\sqrt{|R|}}$ . For both SSTE and SUST, if $(a_{1},b_{1},...,a_{n},b_{n},s_{\{w\}},a_{1})$ forms a cycle in the database, the recursion proceeds to the remaining components.

SUST guarantees the uniformity of sampling odd cycles and stars. To sample an odd cycle, SUST relies on the constraint that $a_{1}$ has the smallest degree. This is allowed in unlabeled graphs since other non-qualifying cycles (e.g., $a_{2}$ having the smallest degree) can be found by permuting the vertices and edges (e.g., constrain w.r.t. $a_{1}$ and let $a_{2}=a_{1}$ ). The new proof of SSTE also uses this constraint to bound the variance (see appendix E). However, in non-identical relations, we cannot permute the edges with different labels.

The essence of component-at-a-time sampling is to use the divide-and-conquer approach by breaking a query $\mathcal{H}$ into smaller components (i.e., odd cycles and stars), solve simpler problems for these components, and estimate back the cardinality of the original query. Lemmas 8 and 9 allow us to set an induction hypothesis that bounds the variance of estimates in simpler problems. We explain these lemmas in section D.1 and extend them for labeled graphs in section D.2.

D.1. For Unlabeled Graphs

Lemma 8.

(Decomposition Lemma) (Assadi et al., 2019): Given a query hypergraph $\mathcal{H}(\mathcal{V},\mathcal{E})$ of identical binary relations (i.e., $R_{F}=R:\forall F\in\mathcal{E}$ ), the LP in Definition 1 admits a half-integral optimal solution $x$ where 1) $x_{F}\in\{0,\frac{1}{2},1\}:\forall F\in\mathcal{E}$ and 2) the support of $x$ , $supp(x)\coloneqq\{F\in\mathcal{E}|x_{F}>0\}$ , forms a set of vertex-disjoint odd cycles (i.e., each cycle has an odd number of edges) and stars.

Lemma 9.

(Assadi et al., 2019): Suppose that $\mathcal{H}$ (with identical binary relations as $\mathcal{E}$ ) is decomposed into odd cycles and stars by a half-integral optimal solution $x$ in Lemma 8. For arbitrary subsets of odd cycles $\mathbb{C}$ and stars $\mathbb{S}$ of $\mathcal{H}$ , let $\mathcal{H}_{s}=(\mathcal{V}_{s},\mathcal{E}_{s})$ be the induced subquery of $\mathcal{H}$ on vertices of $\mathbb{C}$ and $\mathbb{S}$ . Then, for output size $\text{OUT}_{s}$ of $\mathcal{H}_{s}$ , $\text{OUT}_{s}\leq|R|^{\sum_{F\in\mathcal{E}_{s}}x_{F}}$ , and $\sum_{F\in\mathcal{E}_{s}}x_{F}=\sum_{C(\mathcal{V}_{C},\mathcal{E}_{C})\in\mathbb{C}}\sum_{F\in\mathcal{E}_{C}}x_{F}+\sum_{S(\mathcal{V}_{S},\mathcal{E}_{S})\in\mathbb{S}}\sum_{F\in\mathcal{E}_{S}}x_{F}$ .

D.2. For Labeled Graphs

We extend Lemma 8 to Lemma 10 for labeled graphs. First, we prove Proposition 10.

Proposition 10.

The LP in Definition 1 admits an integral optimal solution if $\mathcal{H}$ is a bipartite graph, where relations are binary but not necessarily identical.

Proof.

Let $\mathcal{V}=\{v_{1},v_{2},...,v_{n}\}$ , $\mathcal{E}=\{F_{1},F_{2},...,F_{m}\}$ , and $A\in\mathbb{R}^{n\times m}$ be the incidence matrix of $\mathcal{H}$ , where $A_{ij}=\operatorname*{\mathbb{I}}[v_{i}\in F_{j}]$ . It is well known that if $\mathcal{H}$ is bipartite, $A$ is so-called a totally unimodular matrix^*^**https://en.wikipedia.org/wiki/Unimodular_matrix. We can rewrite the objective of LP, i.e., $\min_{x}\sum_{F\in\mathcal{E}}x_{F}\log_{\text{IN}}\absolutevalue{R_{F}}$ , as $\min_{x}{cx}$ ( $c$ is a vector of $(\log_{\text{IN}}{|R_{F}|}:F\in\mathcal{E})\in\mathbb{R}^{1\times m}$ , $x\in\mathbb{R}^{m\times 1}$ ) and constraint $Ax\geq b$ ( $b$ is a vector of 1’s $\in\mathbb{R}^{n\times 1}$ ). Then, it is also known that an LP of objective $\min_{x}{cx}$ and constraint $Ax\geq b$ admits an integral optimal solution if $A$ is totally unimodular and $b$ is a vector of integers, which is exactly our case. $\square$

Therefore, we can easily remove the identity assumption on relations if $\mathcal{H}$ is bipartite. The following proposition fills the remaining non-bipartite case. We omit the proof since it is the same as the proof by Assadi et al. (Assadi et al., 2019).

Proposition 11.

The LP in Definition 1 admits a half-integral optimal solution if $\mathcal{H}$ is a non-bipartite graph, where relations are not necessarily identical.

Finally, we state Lemma 10. The extension of Lemma 9 for labeled graphs, can be directly proven by Lemma 10 as the proof by Assadi et al. (Assadi et al., 2019).

Lemma 10.

(Extended Decomposition Lemma) Given a query hypergraph $\mathcal{H}$ of binary relations, the LP in Definition 1 admits a half-integral optimal solution that satisfies the conditions in Lemma 8.

Proof.

The proof is similar to the proof by Assadi et al. (Assadi et al., 2019), but we consider $\log{|R_{F}|}$ which may be different over $F\in\mathcal{E}$ . Assume that any half-integral optimal solution $x$ is given.

We first show that any edge $F$ in a cycle (odd or even) in $supp(x)$ has $x_{F}=\frac{1}{2}$ . Otherwise if $x_{F}=1$ (cannot be 0 due to the definition of $supp(x)$ ), we can always reduce it into $\frac{1}{2}$ without breaking the feasibility of $x$ , since both endpoints of $F$ have at least two edges in the cycle. This results in a smaller objective, contradicting that $x$ is an optimal solution.

We then remove any even cycles. If an even cycle $C$ with edges $\{F_{1}$ , $F_{2}$ , …, $F_{2k}\}$ exists in $supp(x)$ , we argue that $N_{1}+N_{3}+...+N_{2k-1}$ = $N_{2}+N_{4}+...+N_{2k}$ where $N_{i}=\log{|R_{F_{i}}|}$ . If LHS ¡ RHS, we subtract $\frac{1}{2}$ from each $x_{F_{2i}}$ and add $\frac{1}{2}$ to each $x_{F_{2i-1}}$ . Again, this does not break the feasibility of $x$ but decreases the objective by $\frac{1}{2}\cdot(\text{RHS}-\text{LHS})$ , contradicting the optimality of $x$ . Similarly, LHS ¿ RHS does not hold. Thus, LHS = RHS, indicating:

(16)

|R_{F_{1}}||R_{F_{3}}|...|R_{F_{2k-1}}|=|R_{F_{2}}||R_{F_{4}}|...|R_{F_{2k}}|

Then, we can break $C$ by safely subtracting $\frac{1}{2}$ from each $x_{F_{2i}}$ and adding $\frac{1}{2}$ to each $x_{F_{2i-1}}$ (or vice versa), preserving the objective.

We next remove any odd cycles connected to any other components. If an odd cycle $C$ with edges $\{F_{1}$ , $F_{2}$ , …, $F_{2k-1}\}$ exists in $supp(x)$ such that a vertex $v$ in $C$ has an incident edge $F_{2k}$ outside $C$ , we again argue that $N_{1}+N_{3}+...+N_{2k-1}$ = $N_{2}+N_{4}+...+N_{2k}$ using the same logic as we remove the even cycles. Then, we can remove such odd cycles connected to any other components.

We now have a set of vertex-disjoint odd cycles and trees in $supp(x)$ . We will break trees into stars. For each tree, let $\langle F_{1},F_{2},...,F_{j}\rangle$ be any path connecting two leaf vertices. Note that $x_{F_{1}}=x_{F_{j}}=1$ to satisfy the constraint in the LP for these two vertices. $x_{F_{2}}=x_{F_{3}}=...=x_{F_{j-1}}=\frac{1}{2}$ since any midpoint has two incident edges in the path, and from the same logic used for cycles.

If $j=2k$ (even), we argue $N_{2}+N_{4}+...+N_{2k-2}$ = $N_{3}+N_{5}+...+N_{2k-1}$ using the same logic as above. Then, we can subtract $\frac{1}{2}$ from each $x_{F_{2i-1}}$ and add $\frac{1}{2}$ to each $x_{F_{2i-2}}$ . If $j=2k+1$ (odd), $N_{2}+N_{4}+...+N_{2k}$ = $N_{3}+N_{5}+...+N_{2k-1}$ , and we subtract $\frac{1}{2}$ from each $x_{F_{2i-1}}$ and add $\frac{1}{2}$ to each $x_{F_{2i}}$ except $F_{2}$ . In both cases, the path is decomposed into segments of at most two edges each. We repeat this for every path connecting two leaf vertices, and the final forest consists of trees of a maximum diameter of two, i.e., stars. $\square$

Assadi et al. (Assadi et al., 2019) do not present a concrete join size estimator for labeled graphs but prove a lower bound $\Omega\big{(}\frac{AGM(\mathcal{H}_{U})}{\text{OUT}}\big{)}$ . Here, $\mathcal{H}_{U}$ represents $\mathcal{H}$ without labels, and $AGM(\mathcal{H}_{U})\geq AGM(\mathcal{H})$ since $|R_{F}|\leq|R|$ for each $F\in\mathcal{E}$ ; $R_{F}$ ’s are subsets of the set of all data edges $R$ .

Appendix E Proofs of Lemmas and Propositions

In order to prove Lemmas 1-5 and Propositions 1-8, we first expand $Z_{\mathcal{H}_{s}}$ and $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]$ , then derive inequalities on $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]$ . We define another random variable $Z_{\mathcal{H}_{s}}(s_{I})\coloneqq\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]}\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}$ where $s_{I}$ is a random variable in $Z_{\mathcal{H}_{s}}(s_{I})$ . Then, $Z_{\mathcal{H}_{s}}=\frac{1}{k}\sum_{s_{I}\in S_{I}}Z_{\mathcal{H}_{s}}(s_{I})$ , i.e., $Z_{\mathcal{H}_{s}}$ and $Z_{\mathcal{H}_{s}}(s_{I})$ are mutually recursively defined. Then, from (Banerjee, 2012):

(17)

\begin{split}\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}_{s}}]=\frac{\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}(s_{I})]}{k}\Big{(}1-\frac{k-1}{|\Omega_{I}|-1}\operatorname*{\mathbb{I}}[S_{I}\sim\Omega_{I}\text{ w/o replacement}]\Big{)}.\end{split}

This holds since $S_{I}$ consists of identically distributed samples $\{s_{I}\}$ . Note that $s_{I}\in S_{I}$ may not be independent due to sampling without replacement, e.g., in Alley. Then, $\operatorname*{\mathbb{I}}[\cdot]$ above is 1 and the variance is decreased by $\frac{k-1}{|\Omega_{I}|-1}$ portion (Banerjee, 2012).

From the law of total variance:

(18)

\begin{split}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}(s_{I})]=&\operatorname*{\mathbb{E}}_{s_{I}}[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]]+\operatorname*{\mathbb{V}ar}_{s_{I}}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]].\\ \end{split}

Here, $Z_{\mathcal{H}_{s}}|s_{I}=\frac{1}{P(s_{I})}\operatorname*{\mathbb{I}}[s_{I}]Z_{\mathcal{H}_{s\uplus s_{I}}}$ as $Z_{\mathcal{H}_{s}}(s_{I})$ , but $s_{I}$ is a fixed value in $Z_{\mathcal{H}_{s}}|s_{I}$ instead of a random variable. In the second term above, we show that $\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]=\frac{1}{P(s_{I})}\cdot\operatorname*{\mathbb{I}}[s_{I}]\cdot\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})$ in the proof of Lemma 1. Then,

(19)

\begin{split}\operatorname*{\mathbb{V}ar}_{s_{I}}&[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]]\leq\operatorname*{\mathbb{E}}_{s_{I}}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]^{2}]\\ =&\sum_{s_{I}}P(s_{I})\cdot\frac{1}{P(s_{I})^{2}}\cdot\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]\cdot\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ =&\sum_{s_{I}}\frac{1}{P(s_{I})}\operatorname*{\mathbb{I}}[s_{I}]\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}.\end{split}

The $\frac{1}{P(s_{I})}$ (¿ 1) factor has been missing in the original paper of Assadi et al. (Assadi et al., 2019), reducing an upper bound of $\operatorname*{\mathbb{V}ar}_{s_{I}}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]]$ incorrectly.

Proof of Lemma 1.

We use an induction hypothesis that GenericCardEst (Algorithm 1) returns an unbiased estimate of $\text{OUT}(\mathcal{H}_{s})$ . For the base case where $I=\mathcal{O}$ , Line 1 returns $Z_{\mathcal{H}_{s}}=\frac{\sum_{s_{I}\in S_{I}}Z_{\mathcal{H}_{s}}(s_{I})}{k}$ , where $Z_{\mathcal{H}_{s}}(s_{I})=\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}$ . Here, $Z_{\mathcal{H}_{s\uplus s_{I}}}=1$ since the recursive call to GenericCardEst reaches Line 1. Since $s_{I}$ ’s follow an identical distribution, $\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}]=\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}(s_{I})]$ and

(20)

\begin{split}\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}(s_{I})]&=\sum_{s_{I}\in\Omega_{I}}P(s_{I})\cdot\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot 1\\ &=\sum_{s_{I}\in\Omega_{I}}{\operatorname*{\mathbb{I}}[s_{I}]}\\ &=\sum_{s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)}1\,\,\,(\because\Omega_{I}\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s))\\ &=\absolutevalue{\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)}\\ &=\text{OUT}(\mathcal{H}_{s}).\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,(\because I=\mathcal{O})\\ \end{split}

Note that the assumption $\Omega_{I}\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ in Lemma 1 is used. If $\Omega_{I}\not\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ , then $\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}(s_{I})]=\sum_{s_{I}\in\Omega_{I}}\operatorname*{\mathbb{I}}[s_{I}\in\,\allowbreak\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]$ ¡ $\absolutevalue{\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)}$ , leading to a negative bias.

For the inductive case where $I\subsetneq\mathcal{O}$ , we use the law of total expectation and our induction hypothesis at Line 1, achieving

(21)

\begin{split}\operatorname*{\mathbb{E}}[&Z_{\mathcal{H}_{s}}(s_{I})]=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]]\\ &=\operatorname*{\mathbb{E}}\Big{[}\operatorname*{\mathbb{E}}\Big{[}\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}\Big{|}s_{I}\Big{]}\Big{]}\\ &=\operatorname*{\mathbb{E}}\Big{[}\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s\uplus s_{I}}}|s_{I}]\Big{]}\\ &=\operatorname*{\mathbb{E}}\Big{[}\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\Big{]}\,\,\,(\because\text{induction hypothesis})\\ &=\sum_{s_{I}\in\Omega_{I}}P(s_{I})\cdot\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=\sum_{s_{I}\in\Omega_{I}}{\operatorname*{\mathbb{I}}[s_{I}]}\cdot\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=\sum_{s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)}\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\,\,\,(\because\Omega_{I}\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s))\\ &=\text{OUT}(\mathcal{H}_{s}).\end{split}

The assumption $\Omega_{I}\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ is used again. The last equality holds from Lines 4-4 of GenericJoin (Algorithm 4); $s_{I}$ iterates over $\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ , not sampled from $\Omega_{I}$ . Therefore, $\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}]=\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}(s_{I})]=\text{OUT}({\mathcal{H}_{s}})$ .

$\square$

Proof of Lemma 2.

We first rewrite the AGM bounds in LHS using their definitions. Given an optimal fractional edge cover $x$ of $\mathcal{H}$ , the LHS is

(22)

\begin{split}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}=\frac{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{\mathcal{O}}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}.\end{split}

We partition 1) $\mathcal{E}_{\mathcal{O}\setminus I}$ into $(\mathcal{E}_{I}\cap\mathcal{E}_{\mathcal{O}\setminus I})\cup(\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I})$ and 2) $\mathcal{E}_{\mathcal{O}}$ into $\mathcal{E}_{I}\cup(\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I})$ since any $F\in\mathcal{E}_{\mathcal{O}}$ must contain an attribute in $I$ or $\mathcal{O}\setminus I$ .

Since $s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ , $\absolutevalue{R_{F}\ltimes s\uplus s_{I}}\geq 1:\forall F\in\mathcal{E}_{I}$ . Chen & Yi (Chen and Yi, 2020) also compute the probability of $s_{I}$ only if $s_{I}$ contributes to any join answer.

(23)

\begin{split}&\frac{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{\mathcal{O}}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\\ &=\frac{\prod_{F\in\mathcal{E}_{I}\cap\mathcal{E}_{\mathcal{O}\setminus I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\cdot\frac{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\\ &=\frac{\prod_{F\in\mathcal{E}_{I}\cap\mathcal{E}_{\mathcal{O}\setminus I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\\ &\;\;\;\;\;\;(\because R_{F}\ltimes s\uplus s_{I}=R_{F}\ltimes s:\forall F\in\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I})\\ &\leq\frac{\prod_{F\in\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\;\;\;(\because\absolutevalue{R_{F}\ltimes s\uplus s_{I}}\geq 1:\forall F\in\mathcal{E}_{I}\setminus\mathcal{E}_{\mathcal{O}\setminus I})\\ &=\prod_{F\in\mathcal{E}_{I}}rdeg_{F,s}(s_{I})^{x_{F}}\\ &\leq\prod_{F\in\mathcal{E}_{I}}rdeg_{F^{*},s}(s_{I})^{x_{F}}\;\;(\because rdeg_{F,s}(s_{I})\leq rdeg_{F^{*},s}(s_{I}))\\ &=rdeg_{F^{*},s}(s_{I})^{\sum_{F\in\mathcal{E}_{I}}x_{F}}\\ &\leq rdeg_{F^{*},s}(s_{I})\;\;\Big{(}\because\sum_{F\in\mathcal{E}_{I}}x_{F}\geq 1,rdeg_{F^{*},s}(s_{I})\leq 1\Big{)}.\end{split}

$\square$

Proof of Proposition 1.

We use an induction hypothesis $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq 2^{d}\cdot AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})$ on $d$ , the number of odd cycles and stars in $\mathcal{H}$ .

For the base case when $d=1$ , $\mathcal{H}$ consists of either a single odd cycle or a star. As explained in appendix D, the maximum recursion depth is 2 + 1. We use $I_{1}$ (and $k_{1}$ ) and $I_{2}$ (and $k_{2}$ ) to denote the $I$ (and $k$ ) values of the first two recursions, where $\mathcal{O}=\emptyset$ in the last recursion so Lines 1-1 of Algorithm 1 are executed.

If $\mathcal{H}$ is an odd cycle $C$ of $2n+1$ vertices $\{u_{1},v_{1},u_{2},v_{2},...,u_{n},v_{n},w\}$ , $AGM(\mathcal{H})=|R|^{n+\frac{1}{2}}$ from Lemma 8, $I_{1}=\{u_{1},v_{1},...,u_{n},v_{n}\},I_{2}=\{w\}$ , $\Omega_{I_{1}}=\prod_{i\in[1,n]}R_{F_{i}}$ and $P(s_{I_{1}})=\frac{1}{|R|^{n}}:\forall s_{I}\in\Omega_{I_{1}}$ , where $F_{i}=\{u_{i},v_{i}\}$ . Recall that $R_{F_{i}}=R$ for $i\in[1,n]$ .

We explain SSTE first. Since $k_{1}=1$ and $s=\emptyset$ , $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]]+\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}}|s_{I_{1}}]]$ from (18). For the first term,

(24)

\begin{split}\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}}|s_{I_{1}}]]\\ =&\,\frac{1}{|R|^{n}}\sum_{s_{I_{1}}}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]\,\,\,\Big{(}\because P(s_{I_{1}})=\frac{1}{|R|^{n}}\Big{)}\\ =&\,\frac{1}{|R|^{n}}\sum_{s_{I_{1}}}\operatorname*{\mathbb{V}ar}\Big{[}\frac{1}{P(s_{I})}\operatorname*{\mathbb{I}}[s_{I_{1}}]Z_{\mathcal{H}_{s_{I_{1}}}}\Big{]}\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s_{I_{1}}}}]\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\frac{\operatorname*{\mathbb{I}}[s_{I_{1}}]}{k_{I_{2}}}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s_{I_{1}}}}(s_{I_{2}})]\,\,\,(\because(\ref{eq:totalvariance}))\\ \leq&\,|R|^{n}\sum_{s_{I_{1}}}\frac{\operatorname*{\mathbb{I}}[s_{I_{1}}]}{k_{I_{2}}}\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}_{s_{I_{1}}}}(s_{I_{2}})]\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\frac{\operatorname*{\mathbb{I}}[s_{I_{1}}]}{k_{I_{2}}}\sum_{s_{I_{2}}}\frac{1}{|\Omega_{I_{2}}|}Z^{2}_{\mathcal{H}_{s_{I_{1}}}}(s_{I_{2}})\,\,\,\Big{(}\because P(s_{I_{2}})=\frac{1}{|\Omega_{I_{2}}|}\Big{)}\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\frac{\operatorname*{\mathbb{I}}[s_{I_{1}}]}{k_{I_{2}}}\sum_{s_{I_{2}}}|\Omega_{I_{2}}|\operatorname*{\mathbb{I}}[s_{I_{2}}]\,\,\,\Big{(}\because Z^{2}_{\mathcal{H}_{s_{I_{1}}}}(s_{I_{2}})=|\Omega_{I_{2}}|\operatorname*{\mathbb{I}}[s_{I_{2}}]1\Big{)}\\ \leq&\,|R|^{n+\frac{1}{2}}\sum_{s_{I_{1}}}\sum_{s_{I_{2}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{I}}[s_{I_{2}}]\,\,\,\Bigg{(}\because k_{I_{2}}=\lceil\frac{|\Omega_{I_{2}}|}{|R|^{\frac{1}{2}}}\rceil\geq\frac{|\Omega_{I_{2}}|}{|R|^{\frac{1}{2}}}\Bigg{)}\\ =&\,|R|^{n+\frac{1}{2}}\text{OUT}(\mathcal{H})=AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})\,\,\,(\because\text{Lemma \ref{lemma:decomposition}}).\end{split}

For the second term of the total variance, the new proof of SSTE reduces the $\text{OUT}({\mathcal{H}_{s_{I_{1}}}})$ factor in (19) into $\sqrt{|R|}$ . We have $\text{OUT}({\mathcal{H}_{s_{I_{1}}}})\leq|\Omega_{I_{2}}|=|\pi_{w}(R_{F_{0}}\ltimes a_{1})|$ for $a_{1}$ denoting the sampled data vertex for $u_{1}$ . If $|\pi_{w}(R_{F_{0}}\ltimes a_{1})|\leq\sqrt{|R|}$ , then $\text{OUT}({\mathcal{H}_{s_{I_{1}}}})\leq\sqrt{|R|}$ . Otherwise, recall that $a_{1}$ is chosen to have the smallest degree among the cycle vertices. Then, there can be at most $\sqrt{|R|}$ vertices in the data graph with degrees larger than $a_{1}$ ; using the proof by contradiction, the total number of edges will be larger than $|R|$ if more than $\sqrt{|R|}$ vertices have larger degrees than $a_{1}$ . Hence, we have at most $\sqrt{|R|}$ results for the cycle given $s_{I_{1}}$ , i.e., $\text{OUT}({\mathcal{H}_{s_{I_{1}}}})\leq\sqrt{|R|}$ :

(25)

\begin{split}\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[&Z_{\mathcal{H}}|s_{I_{1}}]]\leq\sum_{s_{I_{1}}}\frac{1}{P(s_{I_{1}})}\operatorname*{\mathbb{I}}[s_{I_{1}}]\text{OUT}({\mathcal{H}_{s_{I_{1}}}})^{2}\,\,\,\,\,(\because(\ref{eq:varexp}))\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\text{OUT}({\mathcal{H}_{s_{I_{1}}}})^{2}\leq|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\text{OUT}_{\mathcal{H}_{s_{I_{1}}}}\sqrt{|R|}\\ =&\,|R|^{n+\frac{1}{2}}\text{OUT}(\mathcal{H})=AGM(\mathcal{H})\cdot\text{OUT}({\mathcal{H}}).\end{split}

Adding (24) and (25) completes the proof for SSTE given an odd cycle.

For SUST, $\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}}|s_{I_{1}}]$ is bounded by $AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})$ from (25) and the following bound for $\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]]$ completes the proof for SUST given an odd cycle.

(26)

\begin{split}\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}}|s_{I_{1}}]]\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s_{I_{1}}}}]\,\,\,(\because(\ref{eq:sste:expvar:cycle}))\\ \leq&\,|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}_{s_{I_{1}}}}]=|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\sum_{s_{I_{2}}}\frac{1}{P(s_{I_{2}})}\operatorname*{\mathbb{I}}[s_{I_{2}}]\\ =&\,|R|^{n+\frac{1}{2}}\sum_{s_{I_{1}}}\sum_{s_{I_{2}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{I}}[s_{I_{2}}]\Big{(}\because P(s_{I_{2}})=\frac{1}{\sqrt{|R|}}\Big{)}\\ =&\,AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H}).\end{split}

If $\mathcal{H}$ is a star $S$ of $n$ edges $\{F_{1},F_{2},...,F_{n}\}$ , $AGM(\mathcal{H})=|R|^{n}$ from Lemma 8. For SSTE, $I_{1}$ contains the center attribute $A$ of $S$ only, $\Omega_{I_{1}}=\pi_{A}R$ , $I_{2}$ is the set of attributes of $S$ except $I_{1}$ , and $\Omega_{I_{2}}=\prod_{i}\pi_{I_{2}}(R_{F_{i}}\ltimes s_{I_{1}})$ for a sampled center vertex $s_{I_{1}}$ .

Again, $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]]+\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}}|s_{I_{1}}]]$ since $k_{I_{1}}=1$ . For the first term,

(27)

\begin{split}\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}}|s_{I_{1}}]]\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]\,\,\,\Big{(}\because P(s_{I_{1}})=\frac{|R\ltimes s_{I_{1}}|}{|R|}\Big{)}\\ \leq&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}}|s_{I_{1}}]\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\sum_{s_{I_{2}}}\frac{1}{|R\ltimes s_{I_{1}}|^{n}}Z^{2}_{\mathcal{H}}|s_{I_{1}}\,\,\,\Big{(}\because P(s_{I_{2}})=\frac{1}{|R\ltimes s_{I_{1}}|^{n}}\Big{)}\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\sum_{s_{I_{2}}}\frac{1}{|R\ltimes s_{I_{1}}|^{n}}\Big{(}\frac{1}{P(s_{I_{1}})}\operatorname*{\mathbb{I}}[s_{I_{1}}]Z_{\mathcal{H}_{s_{I_{1}}}}\Big{)}^{2}\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\sum_{s_{I_{2}}}\frac{1}{|R\ltimes s_{I_{1}}|^{n}}\Big{(}\frac{1}{P(s_{I_{1}})}\operatorname*{\mathbb{I}}[s_{I_{1}}]\frac{1}{P(s_{I_{2}})}\operatorname*{\mathbb{I}}[s_{I_{2}}]\Big{)}^{2}\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\sum_{s_{I_{2}}}\frac{1}{|R\ltimes s_{I_{1}}|^{n}}\Big{(}\frac{|R|}{|R\ltimes s_{I_{1}}|}\operatorname*{\mathbb{I}}[s_{I_{1}}]|R\ltimes s_{I_{1}}|^{n}\operatorname*{\mathbb{I}}[s_{I_{2}}]\Big{)}^{2}\\ =&\,|R|\sum_{s_{I_{1}}}|R\ltimes s_{I_{1}}|^{n-1}\sum_{s_{I_{2}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{I}}[s_{I_{2}}]\\ \leq&\,|R||R|^{n-1}\sum_{s_{I_{1}}}\sum_{s_{I_{2}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{I}}[s_{I_{2}}]\\ =&\,|R|^{n}\text{OUT}(\mathcal{H})=AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})\,\,\,(\because\text{Lemma \ref{lemma:decomposition}}).\end{split}

For the second term of the total variance, $\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}}|s_{I_{1}}]]\leq\text{OUT}(\mathcal{H})^{2}\leq AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})$ from (19). Therefore, we have $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq 2\cdot AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})$ for SSTE given a star.

For SUST, $I_{1}$ is the set of attributes of the star, $\Omega_{I_{1}}=R^{n}$ , and $P(s_{I_{1}})=\frac{1}{|R|^{n}}$ . Thus, $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}}]=\sum_{s_{I_{1}}}\frac{1}{P(s_{I_{1}})}\cdot\operatorname*{\mathbb{I}}[s_{I_{1}}]\cdot 1=|R|^{n}\cdot\text{OUT}(\mathcal{H})=AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})$ .

For the inductive case when $d>1$ , $\mathcal{H}$ consists of multiple odd cycles and stars. The proofs for both SSTE and SUST are naive expansions of (24)-(27) conditioned on multiple components instead of a part of an odd cycle or a star. Please refer to (Assadi et al., 2019) for more details.

$\square$

Proof of Proposition 2.

We use an induction hypothesis $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]\leq\frac{t^{|\mathcal{O}|}-t}{t-1}\cdot{\text{OUT}(\mathcal{H}_{s})^{2}}$ .

For the base case $I=\mathcal{O}$ and $|I|=1$ , $s_{I}\in\Omega_{I}=\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)$ , and $Z_{\mathcal{H}_{s}}(s_{I})=\frac{1}{P(s_{I})}=|\Omega_{I}|$ regardless of $s_{I}$ . Therefore, $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]=0\leq\frac{t^{|\mathcal{O}|}-t}{t-1}\cdot\text{OUT}(\mathcal{H}_{s})^{2}$ .

For the inductive case $I\subsetneq\mathcal{O}$ ,

(28)

\begin{split}\operatorname*{\mathbb{E}}[&\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]]\\ &=\sum_{s_{I}}\frac{1}{|\Omega_{I}|}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]\\ &=\sum_{s_{I}}\frac{1}{|\Omega_{I}|}\operatorname*{\mathbb{V}ar}[|\Omega_{I}|\cdot\operatorname*{\mathbb{I}}[s_{I}]\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}]\\ &=\sum_{s_{I}}|\Omega_{I}|\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s\uplus s_{I}}}]\,\,\,(\because\operatorname*{\mathbb{I}}[s_{I}]=1)\\ &\leq\sum_{s_{I}}|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-t}{t-1}\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})^{2}\,\,\,(\because\text{induction hypothesis}).\end{split}

From (19), $\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]]\leq\sum_{s_{I}}|\Omega_{I}|\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}$ . By plugging this and (28) into (18), we have

(29)

\begin{split}\operatorname*{\mathbb{V}ar}&[Z_{\mathcal{H}_{s}}(s_{I})]\\ &\leq\sum_{s_{I}}\Big{(}|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-t}{t-1}+|\Omega_{I}|\Big{)}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ &=|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}.\end{split}

Therefore, from (17),

(30)

\begin{split}\operatorname*{\mathbb{V}ar}&[Z_{\mathcal{H}_{s}}]=\frac{1}{k}\Big{(}1-\frac{k-1}{|\Omega_{I}|-1}\Big{)}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}(s_{I})]\\ &\leq\frac{1}{k}\Big{(}1-\frac{k-1}{|\Omega_{I}|-1}\Big{)}|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ &=\frac{|\Omega_{I}|-k}{k(|\Omega_{I}|-1)}|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ &\leq\frac{1-b}{b}\frac{|\Omega_{I}|}{|\Omega_{I}|-1}\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ &\;\;\;\;\;\;(\because k=\lceil b\cdot|\Omega_{I}|\rceil\geq b|\Omega_{I}|)\\ &\leq\frac{2(1-b)}{b}\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\,\,\,\Big{(}\because\frac{|\Omega_{I}|}{|\Omega_{I}|-1}\leq 2\Big{)}\\ &=t\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\,\,\,\Big{(}\because t=\frac{2(1-b)}{b}\Big{)}\\ &\leq t\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\Big{(}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})\Big{)}^{2}\\ &=\frac{t^{|\mathcal{O}|}-t}{t-1}\text{OUT}(\mathcal{H}_{s})^{2}.\end{split}

$\square$

Proof of Proposition 3.

We use an induction hypothesis $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]\leq|\mathcal{O}|\cdot AGM(\mathcal{H}_{s})\cdot\text{OUT}(\mathcal{H}_{s})$ .

For the base case $I=\mathcal{O}$ and $|I|=1$ , $AGM(\mathcal{H}_{s\uplus s_{I}})=1$ , for $\mathcal{O}\setminus I=\emptyset$ . Hence,

(31)

\begin{split}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]&\leq\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}_{s}}]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{AGM({\mathcal{H}_{s}})}\Big{(}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\Big{)}^{2}\operatorname*{\mathbb{I}}[s_{I}]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]\\ &=AGM({\mathcal{H}_{s}})\sum_{s_{I}}\operatorname*{\mathbb{I}}[s_{I}]\;\;\;(\because AGM(\mathcal{H}_{s\uplus s_{I}})=1)\\ &=AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}).\end{split}

For the inductive case $I\subsetneq\mathcal{O}$ , we again use the induction hypothesis:

(32)

\begin{split}\operatorname*{\mathbb{E}}&[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{AGM({\mathcal{H}_{s}})}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{AGM({\mathcal{H}_{s}})}\operatorname*{\mathbb{V}ar}\Big{[}\frac{1}{P(s_{I})}\cdot\operatorname*{\mathbb{I}}[s_{I}]\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}\Big{]}\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s\uplus s_{I}}}]\;\;\Big{(}\because P(s_{I})=\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{AGM({\mathcal{H}_{s}})}\Big{)}\\ &\leq\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]|\mathcal{O}\setminus I|AGM({\mathcal{H}_{s\uplus s_{I}}})\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &\;\;\;\;\;(\because\text{induction hypothesis})\\ &=|\mathcal{O}\setminus I|AGM({\mathcal{H}_{s}})\sum_{s_{I}}\operatorname*{\mathbb{I}}[s_{I}]\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=|\mathcal{O}\setminus I|\cdot AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}).\end{split}

From (19), we have

(33)

\begin{split}\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[&Z_{\mathcal{H}_{s}}|s_{I}]]\\ &\leq\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})^{2}\\ &\leq\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]AGM({\mathcal{H}_{s\uplus s_{I}}})\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=AGM({\mathcal{H}_{s}})\sum_{s_{I}}\operatorname*{\mathbb{I}}[s_{I}]\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}).\end{split}

Therefore, plugging these into (18) and setting $k=1$ in (17) give $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]\leq|\mathcal{O}|\cdot AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}})$ . In addition, $|\mathcal{V}|$ in the proposition can be removed with a simple proof by Chen & Yi (Chen and Yi, 2020) using the variance of Binomial distribution.

$\square$

Proof of Proposition 4.

We use an induction hypothesis $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]\leq|\mathcal{O}|\cdot\prod_{I\subset\mathcal{O},|I|=1}|\mathcal{E}_{I}|\cdot AGM(\mathcal{H}_{s})\cdot\text{OUT}(\mathcal{H}_{s})$ , similar to the proof of Proposition 3.

For the base case $I=\mathcal{O}$ and $|I|=1$ ,

(34)

\begin{split}\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}_{s}}]\leq\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}_{s}}]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{|\mathcal{E}_{I}|AGM({\mathcal{H}_{s}})}\Big{(}\frac{|\mathcal{E}_{I}|AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\Big{)}^{2}\operatorname*{\mathbb{I}}[s_{I}]\\ &=\sum_{s_{I}}\frac{|\mathcal{E}_{I}|AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]=|\mathcal{E}_{I}|\cdot AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}).\end{split}

Note that, compared to (31) of GJ-Sample, the upper bound of $\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]$ has increased by a constant factor $\prod_{I}|\mathcal{E}_{I}|$ . Similarly, we can easily show that the upper bounds of both (32) and (33) increase by $\prod_{I\subset\mathcal{O},|I|=1}|\mathcal{E}_{I}|$ times, proving the inductive case of Proposition 4. By using the proof by Chen & Yi (Chen and Yi, 2020), we can remove the $|\mathcal{V}|$ factor from the proposition as well.

$\square$

Now, we prove the bounds for $T_{\mathcal{H}}$ (or $\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]$ ). In Algorithm 1, Lines 1-1 take $\tilde{O}(1)$ . Line 1 takes $\tilde{O}(\min_{F\in\mathcal{E}_{I}}|\pi_{I}(R_{F}\ltimes s)|)$ time for Alley+ due to the intersection and $\tilde{O}(1)$ for the others, i.e., $\Omega_{I}$ is readily available from the query model in section 5.2. Line 1 takes $\tilde{O}(\text{IN})$ for GJ-Sample for non-sequenceable queries and $\tilde{O}(1)$ for other cases. Line 1 takes $\tilde{O}(k)$ . Line 1 takes $\tilde{O}(1)$ ; computing relative degrees takes $\tilde{O}(|\mathcal{E}_{I}|)=\tilde{O}(1)$ in DRS. Line 1 takes $\tilde{O}(k)+\sum_{s_{I}\in S_{I}}\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]T_{\mathcal{H}_{s\uplus s_{I}}}$ . Note that $\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]=\prod_{F\in\mathcal{E}_{I}}\operatorname*{\mathbb{I}}[s_{I}\in\pi_{I}(R_{F}\ltimes s)]$ takes $\tilde{O}(|\mathcal{E}_{I}|)=\tilde{O}(1)$ time to evaluate.

From this, we can recursively define $T_{\mathcal{H}_{s}}$ , the runtime of computing $Z_{\mathcal{H}_{s}}$ :

(35)

T_{\mathcal{H}_{s}}=\left\{\begin{array}[]{ll}\tilde{O}(1)&\text{if }I=\mathcal{O}\\ \tilde{O}(\text{IN})+\operatorname*{\mathbb{I}}[s_{I}]T_{\mathcal{H}_{s\uplus s_{I}}}&\text{if }I\subsetneq\mathcal{O},\\ &\text{non-sequenceable }\\ &\text{queries for {{GJ-Sample}}}\\ \tilde{O}(k)+\sum_{s_{I}\in S_{I}}\operatorname*{\mathbb{I}}[s_{I}]T_{\mathcal{H}_{s\uplus s_{I}}}&\text{otherwise}\\ \end{array}\right.

Proof of Proposition 5.

In SSTE, $k=\lceil\frac{|\Omega_{I}|}{\sqrt{|R|}}\rceil$ if $I$ consists of a single vertex $w$ of an odd cycle and $k=1$ otherwise. We prove $\operatorname*{\mathbb{E}}[k]=\tilde{O}(1)$ for the first case. Then, if there are $n$ cycles in $\mathcal{H}$ with $k=k_{1},k_{2},...,k_{n}$ each, $\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]=\tilde{O}(\prod_{i}\operatorname*{\mathbb{E}}[k_{i}])=\tilde{O}(1)$ from (35).

$k$ for $I=\{w\}$ depends solely on the sampled data edge $\{a_{1},b_{1}\}$ for $F_{1}=\{u_{1},u_{2}\}$ . Recall that $\Omega_{I}=\pi_{I}(R_{F_{0}}\ltimes a_{1})$ and $|\pi_{I}(R_{F_{0}}\ltimes a_{1})|\leq|\pi_{I}(R_{F_{0}}\ltimes b_{1})|$ since $a_{1}$ has the smallest degree among the cycle vertices. The last equality below holds from a well-known fact in graph theory (Assadi et al., 2018).

(36)

\vspace*{-0.2cm}\begin{split}\operatorname*{\mathbb{E}}[k]&=\sum_{\{a_{1},b_{1}\}\in R}\frac{1}{|R|}\lceil\frac{\min(|\pi_{I}(R\ltimes a_{1})|,|\pi_{I}(R\ltimes b_{1})|)}{\sqrt{|R|}}\rceil\\ &\leq 1+\sum_{\{a_{1},b_{1}\}}{\frac{1}{|R|}\frac{\min(|\pi_{I}(R\ltimes a_{1})|,|\pi_{I}(R\ltimes b_{1})|)}{\sqrt{|R|}}}\\ &=\tilde{O}(1)\,\,\,(\because\text{Proposition 2.2 in Assadi et al. \cite[citep]{(\@@bibref{AuthorsPhrase1Year}{SSTEFull}{\@@citephrase{, }}{})}})\end{split}

$\square$

Proof of Proposition 6.

Simply, Alley+ takes $b^{|\mathcal{V}|}$ portion of all paths searched by GenericJoin in expectation. Therefore, $\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]=\tilde{O}(b^{|\mathcal{V}|}AGM(\mathcal{H}))$ . $T_{\mathcal{H}}$ is $\tilde{O}(AGM(\mathcal{H}))$ in the worst case, since the computation cost of GenericCardEst for Alley+ is subsumed by GenericJoin.

$\square$

Proof of Proposition 7.

In GJ-Sample, $k$ is fixed to 1 regardless of $s_{I}$ . Therefore, $T_{\mathcal{H}_{s}}=\tilde{O}(1)+T_{\mathcal{H}_{s\uplus s_{I}}}$ for sequenceable queries and $T_{\mathcal{H}_{s}}=\tilde{O}(\text{IN})+T_{\mathcal{H}_{s\uplus s_{I}}}$ for non-sequenceable queries from (35). Since the maximum recursion depth is $|\mathcal{V}|$ , $T_{\mathcal{H}}=\tilde{O}(|\mathcal{V}|)=\tilde{O}(1)$ for sequenceable queries (note the excluded preprocessing cost) and $T_{\mathcal{H}}=\tilde{O}(|\mathcal{V}|\cdot\text{IN})=\tilde{O}(\text{IN})$ for non-sequenceable queries.

$\square$

Proof of Proposition 8.

In DRS, $k$ is also fixed to 1, and $T_{\mathcal{H}_{s}}=\tilde{O}(|\mathcal{E}_{I}|)+T_{\mathcal{H}_{s\uplus s_{I}}}$ . Hence, $T_{\mathcal{H}}=\tilde{O}(\sum_{I\subset\mathcal{V},|I|=1}|\mathcal{E}_{I}|)=\tilde{O}(1)$ . For SUST, $k$ is also fixed to 1, and $T_{\mathcal{H}_{s}}=\tilde{O}(1)+T_{\mathcal{H}_{s\uplus s_{I}}}$ . Hence, $T=\tilde{O}(1)$ .

$\square$

Proof of Lemma 3.

The unbiasedness of GHDCardEst can be explained by the following (37), since $R_{t}$ is an unbiased estimate of $\mathcal{H}^{G(t)}_{t}=\sum_{\chi(t)\setminus G(t)}\mathcal{H}_{t}=\sum_{\chi(t)\setminus G(t)}{\Join_{F\in\mathcal{E}_{\chi(t)}}\pi_{\chi(t)}R_{F}}$ . Note that if we replace GenericCardEst with GenericJoin in Algorithm 2, then GHDCardEst will return OUT.

(37)

\begin{split}\text{OUT}&=\sum_{\mathcal{V}}\Join_{F\in\mathcal{E}}R_{F}\\ &=\sum_{\mathcal{V}}\Join_{t\in\mathcal{V}_{\mathcal{T}}}\big{(}{\Join_{F\in\mathcal{E}_{\chi(t)}}\pi_{\chi(t)}R_{F}}\big{)}\,\,\,(\because\text{using GHD }(\mathcal{T},\chi))\\ &=\sum_{G(\mathcal{T})}\Join_{t\in\mathcal{V}_{\mathcal{T}}}\Big{(}\sum_{\chi(t)\setminus G(t)}\Join_{F\in\mathcal{E}_{\chi(t)}}\pi_{\chi(t)}R_{F}\Big{)}\\ &=\sum_{G(\mathcal{T})}\Join_{t}\operatorname*{\mathbb{E}}[R_{t}].\end{split}

$\square$

Proof of Lemma 4.

We use an induction hypothesis on $|f|$ . For the base case $|f|=1$ so $f=\{g_{t}\}$ for a node $t$ , $\operatorname*{\mathbb{V}ar}[Z[f]]$ =

$\operatorname*{\mathbb{V}ar}[Z[g_{t}]]$ . Therefore, $\operatorname*{\mathbb{V}ar}[Z[f]]$ = $\tilde{O}(\text{OUT}(\mathcal{H}_{g_{t}})^{2})$ .

For the inductive case $|f|>1$ , we use an induction hypothesis that $\operatorname*{\mathbb{V}ar}[Z[f]]=\tilde{O}(\prod_{g_{t}\in f}\text{OUT}(\mathcal{H}_{g_{t}})^{2})$ for every $|f|\leq n$ for

some $n\geq 1$ . For any $f$ with $|f|=n+1$ , we take any $g_{t}\in f$ . Since $g_{t}$ and $f\setminus g_{t}$ are independent,

(38)

\begin{split}\operatorname*{\mathbb{V}ar}&[Z[f]]\\ &=(\operatorname*{\mathbb{V}ar}[Z[g_{t}]]+\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2})(\operatorname*{\mathbb{V}ar}[Z[f\setminus g_{t}]]+\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2})\\ &\;\;\;\;\;\;\;-\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2}\\ &=\tilde{O}(\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2})\tilde{O}(\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2})-\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2}\\ &\;\;\;\;\;\;(\because\text{induction hypothesis})\\ &=\tilde{O}(\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\cdot\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2})\\ &=\tilde{O}(\operatorname*{\mathbb{E}}[Z[f]]^{2})\,\,\,(\because\text{(\ref{eq:multiexp:prod})}).\end{split}

$\square$

Proof of Lemma 5.

If $f_{1}$ and $f_{2}$ are independent ( $f_{1}\cap f_{2}=\emptyset$ ), then $\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}])=0$ . If not, $f_{1}\cap f_{2}$ , $f_{1}\setminus f_{2}$ , and $f_{2}\setminus f_{1}$ are independent, and $Z[f_{1}]=Z[f_{1}\cap f_{2}]\cdot Z[f_{1}\setminus f_{2}]$ and $Z[f_{2}]=Z[f_{1}\cap f_{2}]\cdot Z[f_{2}\setminus f_{1}]$ . Therefore,

(39)

\begin{split}\operatorname*{\mathbb{C}ov}(Z[f_{1}],\,&Z[f_{2}])=\operatorname*{\mathbb{E}}[Z[f_{1}]\cdot Z[f_{2}]]-\operatorname*{\mathbb{E}}[Z[f_{1}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}]]\\ =&\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]^{2}\cdot Z[f_{1}\setminus f_{2}]\cdot Z[f_{2}\setminus f_{1}]]\\ &-\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]]^{2}\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]]\\ =&\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]^{2}]\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]]\\ &-\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]]^{2}\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]]\\ =&\operatorname*{\mathbb{V}ar}[Z[f_{1}\cap f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]].\end{split}

From Lemma 4 and (7), the last term is

(40)

\begin{split}\tilde{O}(&\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]]^{2})\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]]\\ &=\tilde{O}\Big{(}\prod_{g_{t}\in f_{1}\cap f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\Big{)}\cdot\prod_{g_{t}\in f_{1}\setminus f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\cdot\prod_{g_{t}\in f_{2}\setminus f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\\ &=\tilde{O}\Big{(}\prod_{g_{t}\in f_{1}\cap f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\cdot\prod_{g_{t}\in f_{1}\setminus f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\cdot\prod_{g_{t}\in f_{2}\setminus f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}\\ &=\tilde{O}\Big{(}\prod_{g_{t}\in f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\cdot\prod_{g_{t}\in f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}.\end{split}

$\square$

Proof of Proposition 9.

If $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]$ approaches to 0 for every $g_{t}\in f$ , $\operatorname*{\mathbb{V}ar}[Z[f]]$ approaches to 0 from (8). For any $f_{1}$ and $f_{2}$ , if $\operatorname*{\mathbb{V}ar}[Z[g_{t}]]$ approaches to 0 for every $g_{t}\in f_{1}\cap f_{2}$ , $\operatorname*{\mathbb{V}ar}[Z[f_{1}\cap f_{2}]]$ approaches to 0. If $\operatorname*{\mathbb{V}ar}[Z[f_{1}\cap f_{2}]]$ approaches to 0, $\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}])$ approaches to 0 from (39). Therefore, $\operatorname*{\mathbb{V}ar}[Z]$ also approaches to 0 from (9).

$\square$

Guaranteeing the O~​(A​G​MOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} Runtime for Uniform Sampling and Size Estimation over Joins

Abstract.

1. Introduction

2. Technical Background

Definition 1.

Definition 2.

Definition 3.

Definition 4.

3. Related Work

4. Our Results

Theorem 1.

Theorem 2.

5. Achieving O~​(A​G​MOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} Bound

5.1. Generic Sampling-based Framework

Lemma 1.

5.2. Query Model

5.3. State-of-the-art Sampling-based Estimators

5.4. Beyond the State of the Art with Degree-based Rejection Sampling

Lemma 2.

Example 0.

5.5. Unified Analysis

Proposition 1.

Proposition 2.

Proposition 3.

Proposition 4.

Proposition 5.

Proposition 6.

Proposition 7.

Proposition 8.

5.6. Underlying Index Structure

6. Achieving a Generalized Bound

6.1. Aggregations over GHDs

6.2. Sampling over GHDs

Lemma 3.

Example 0.

6.3. Analysis of GHDCardEst

Lemma 4.

Lemma 5.

Proposition 9.

Example 0.

7. Optimizations

7.1. Increasing Sampling Probabilities

7.2. Skipping Attributes and GHD Nodes

8. Research Opportunities

9. Conclusion

10. Acknowledgement

References

Appendix

Appendix A Algorithms

A.1. GenericJoin

A.2. Yannakakis, SimpleAggroYannakakis, and ExactWeight

A.3. GHDJoin and AggroGHDJoin

Appendix B Query Decomposition Lemmas

Lemma 6.

Lemma 7.

Appendix C Extension to Conjunctive Queries

Appendix D Component-at-a-time Sampling

D.1. For Unlabeled Graphs

Lemma 8.

Lemma 9.

D.2. For Labeled Graphs

Proposition 10.

Proof.

Proposition 11.

Lemma 10.

Proof.

Appendix E Proofs of Lemmas and Propositions

Proof of Lemma 1.

Proof of Lemma 2.

Proof of Proposition 1.

Proof of Proposition 2.

Proof of Proposition 3.

Proof of Proposition 4.

Proof of Proposition 5.

Proof of Proposition 6.

Proof of Proposition 7.

Proof of Proposition 8.

Proof of Lemma 3.

Proof of Lemma 4.

Proof of Lemma 5.

Guaranteeing the $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ Runtime for Uniform Sampling and Size Estimation over Joins

5. Achieving $\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}$ Bound