This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\useunder

Guaranteeing the O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} Runtime for Uniform Sampling and Size Estimation over Joins

Kyoungmin Kim [email protected] Pohang University of Science and Technology (POSTECH)PohangRepublic of Korea Jaehyun Ha [email protected] Pohang University of Science and Technology (POSTECH)PohangRepublic of Korea George Fletcher [email protected] Eindhoven University of Technology (TU/e)EindhovenNetherlands  and  Wook-Shin Han [email protected] Pohang University of Science and Technology (POSTECH)PohangRepublic of Korea
(2023)
Abstract.

We propose a new method for estimating the number of answers OUT of a small join query QQ in a large database DD, and for uniform sampling over joins. Our method is the first to satisfy all the following statements.

  • Support arbitrary QQ, which can be either acyclic or cyclic, and contain binary and non-binary relations.

  • Guarantee an arbitrary small error with a high probability always in O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} time, where AGMAGM is the AGM bound (an upper bound of OUT), and O~\tilde{O} hides the polylogarithmic factor of input size.

We also explain previous join size estimators in a unified framework. All methods including ours rely on certain indexes on relations in DD, which take linear time to build offline. Additionally, we extend our method using generalized hypertree decompositions (GHDs) to achieve a lower complexity than O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} when OUT is small, and present optimization techniques for improving estimation efficiency and accuracy.

Join size estimation, Uniform sampling, Worst-case optimal
journalyear: 2023copyright: acmlicensedconference: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems; June 18–23, 2023; Seattle, WA, USA.booktitle: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS ’23), June 18–23, 2023, Seattle, WA, USAprice: 15.00isbn: 979-8-4007-0127-6/23/06doi: 10.1145/3584372.3588676ccs: Information systems Database query processingccs: Mathematics of computing Probability and statistics

1. Introduction

The evaluation of join queries is one of the most fundamental problems in databases (Veldhuizen, 2012; Ngo et al., 2012, 2014; Li et al., 2016; Zhao et al., 2018; Aberger et al., 2017). In theory, reducing the time complexity of evaluation algorithms has been the main goal (Yannakakis, 1981; Ngo et al., 2014; Gottlob et al., 2016; Abo Khamis et al., 2017). This also applies to the counting problem (Joglekar et al., 2016), which we study in this paper. Given a join query QQ and a database DD, an answer of QQ in DD is a mapping from free variables of QQ to attribute values in DD (or inversely as stated in (Focke et al., 2022)) constrained by the relations in QQ. For example, if a database DD of a social network has a binary relation R(A,B)R(A,B) of friendship and a tuple (a,b)R(A,B)(a,b)\in R(A,B) indicates that two people a,ba,b are friends, finding all triples (a,b,c)(a,b,c) of people who are all friends, is equivalent to finding the answers of the triangle join query (R(B,C)R(B,C) and R(A,C)R(A,C) are renamed relations of R(A,B)R(A,B)):

(1) Q(A,B,C)=R(A,B)R(B,C)R(A,C)\begin{split}Q(A,B,C)=R(A,B)\wedge R(B,C)\wedge R(A,C)\end{split}

The complexity of evaluating join queries is commonly expressed as O~(INw+OUT)\tilde{O}(\text{IN}^{w}+\text{OUT}) where IN is the input size and w(1)w(\geq 1) is a width of QQ (see section 2 for definition of IN and example widths). While acyclic queries can be evaluated with w=1w=1 (Yannakakis, 1981), cyclic queries have higher ww values, i.e., more difficult to solve. If ww is fractional edge cover number ρ\rho (see Definition 1), INw\text{IN}^{w} becomes AGMAGM, the AGM bound (Atserias et al., 2013) which is an upper bound of OUT. Instead of evaluating join queries, the counting problem is to compute |Q||Q|, the join size or the number of answers (e.g., triples above). It has its own application for answering COUNT queries in DBMS. Then, the complexity can be reduced from O~(INw+OUT)\tilde{O}(\text{IN}^{w}+\text{OUT}) to O~(INw)\tilde{O}(\text{IN}^{w}) (Joglekar et al., 2016).

Approximate counting problem is to approximate |Q||Q|. Approximate algorithms are alternatively used whenever the exact counting is expensive and approximation is enough. In practice, a famous application is cost-based query optimization which requires efficient approximation of thousands of sub-queries (Leis et al., 2015). In theory, going beyond the polynomial time algorithms, approximate algorithms established the complexity inversely proportional to OUT, under the requirement that guarantees an arbitrary small error with a high probability (see section 2 for formal definitions). In a limited setting where every relation in QQ is binary and identical (e.g., as in (1)), the complexity of O~(INw)\tilde{O}(\text{IN}^{w}) has reduced to O~(INwOUT)\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}\big{)} (Assadi et al., 2019; Fichtenberger et al., 2020). However, in a general setting where QQ contains higher-arity relations, the complexity has reduced to O~(INw+1OUT)\tilde{O}\big{(}\frac{\text{IN}^{w+1}}{\text{OUT}}\big{)} or O~(INwOUT+IN)\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}+\text{IN}\big{)} only (Chen and Yi, 2020), with an additional multiplicative or additive factor of IN. Furthermore, all these methods have w=ρw=\rho only (i.e., O~(INwOUT)\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}\big{)} = O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}), and are not extended to other widths such as fractional hypertree width (see Definition 3) smaller than ρ\rho. In this paper, we propose a new method that achieves O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} complexity for QQ containing any-arity relations, under the same w=ρw=\rho. As a corollary, ours achieve a linear or even constant complexity for a sufficiently large OUT. We also extend ours to incorporate fractional hypertree width by using generalized hypertree decompositions (GHDs), achieving a lower complexity than O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} when OUT is small.

A closely related problem to approximate counting is uniform sampling, which has its own important applications such as training set generation for machine learning (Zhao et al., 2018). Our method also supports uniform sampling over joins. We define the problems and technical concepts in section 2, explain related work in section 3 and overview of our results in section 4. We present our method of size estimation and uniform sampling over joins in section 5 including a unified framework of previous work, extend our method using GHDs in section 6, propose optimization techniques in section 7 and future work in section 8. More details of existing algorithms and extension of our algorithms to join-project (a.k.a. conjunctive) queries are presented in Appendix.

2. Technical Background

Hypergraph. A join query QQ is commonly represented as a hypergraph (𝒱,)\mathcal{H}(\mathcal{V},\mathcal{E}), where hypernodes 𝒱\mathcal{V} is the set of variables in QQ and hyperedges 2𝒱\mathcal{E}\subseteq 2^{\mathcal{V}} (Ngo et al., 2014). Each hyperedge FF\in\mathcal{E} is a subset of 𝒱\mathcal{V} and specifies the relation R(F)R(F) (or RFR_{F} for brevity) in QQ. For example, the query in (1) has 𝒱={A,B,C}\mathcal{V}=\{A,B,C\} and ={{A,B},{B,C},{A,C}}\mathcal{E}=\{\{A,B\},\{B,C\},\{A,C\}\}. For any I𝒱I\subset\mathcal{V}, we define two notations used throughout the paper: I{F|FI}\mathcal{E}_{I}\coloneqq\{F\in\mathcal{E}|F\cap I\neq\emptyset\} and I,π{πIF|FI}\mathcal{E}_{I,\pi}\coloneqq\{\pi_{I}{F}|F\in\mathcal{E}_{I}\}, i.e., the set of hyperedges in \mathcal{E} that intersect with II and the projection of each edge in I\mathcal{E}_{I} onto II.

In relational algebra, QQ is equivalent to FRF=RF1RF||\Join_{F\in\mathcal{E}}R_{F}={R_{F_{1}}\Join...\Join R_{F_{\absolutevalue{\mathcal{E}}}}}. Here, join (\Join) is a commutative binary operator between relations, allowing us to use F\Join_{F\in\mathcal{E}}. Table 1 summarizes the operators used in this paper.

Table 1. Operators used in the paper.
\Join natural join operator
\ltimes semi-join operator
π\pi projection operator
σ\sigma selection operator
\uplus list concatenation operator

Since all variables in 𝒱\mathcal{V} are free (output) variables in join queries, join queries are subsumed by full conjunctive queries. For a more general class of join-project (a.k.a. conjunctive) queries, the set of output variables is a subset of 𝒱\mathcal{V}. We assume QQ is a join query and extend to join-project queries in appendix C.

Input and output size. We denote IN(,D)\text{IN}(\mathcal{H},D) = maxF|RF|\max_{F\in\mathcal{E}}\absolutevalue{R_{F}} as the input size (|RF||R_{F}| is the number of tuples in RFR_{F}) and OUT(,D)=|Q|\text{OUT}(\mathcal{H},D)=|Q|, the number of query answers, as the output size of QQ (or \mathcal{H}) in DD. We drop \mathcal{H} and DD if the context is clear, as well as in upcoming notations.

Complexity of join. The time complexity of join evaluation algorithms is commonly expressed as O~(INw+OUT)\tilde{O}(\text{IN}^{w}+\text{OUT}) where ww is a width of QQ, and O~\tilde{O} hides a polylogarithmic factor of IN (Ngo et al., 2014). We assume that QQ is extremely small compared to DD and regard |𝒱||\mathcal{V}| and |||\mathcal{E}| as constants as in (Ngo et al., 2014; Assadi et al., 2019; Chen and Yi, 2020). Worst-case optimal (WCO) algorithms such as GenericJoin (Ngo et al., 2014) (Algorithm 4) achieve w=ρ(,D)w=\rho(\mathcal{H},D), called fractional edge cover number:

Definition 1.

(Ngo et al., 2014): The fractional edge cover number ρ(,D){\rho(\mathcal{H},D)} is the optimal solution of the linear program (LP): min{xF|F}\min_{{\{x_{F}|F\in\mathcal{E}\}}} FxF\sum_{F\in\mathcal{E}}x_{F} logIN|RF|\cdot\log_{\text{IN}}\absolutevalue{R_{F}} s.t. FvxF1:v𝒱\sum_{F\ni v}x_{F}\geq 1:\forall v\in\mathcal{V} and 0xF1:F0\leq x_{F}\leq 1:\forall F\in\mathcal{E}.

It is well known that F|RF|xF\prod_{F\in\mathcal{E}}\absolutevalue{R_{F}}^{x_{F}} is an upper bound of OUT for any fractional edge cover xx satisfying the constraints of the LP (Atserias et al., 2013; Friedgut and Kahn, 1998). The optimal value, INρ=minxF|RF|xF\text{IN}^{\rho}=\min_{x}\prod_{F\in\mathcal{E}}\absolutevalue{R_{F}}^{x_{F}}, is called the AGM bound of \mathcal{H} on DD denoted as AGM(,D)AGM(\mathcal{H},D) (Atserias et al., 2013). For example, the query in (1) has IN=|R{A,B}|=|R{B,C}|=|R{A,C}|\text{IN}=|R_{\{A,B\}}|=|R_{\{B,C\}}|=|R_{\{A,C\}}| and ρ=3/2\rho=3/2 from the optimal solution x{A,B}=x{B,C}=x{A,C}=1/2x_{\{A,B\}}=x_{\{B,C\}}=x_{\{A,C\}}=1/2 for the LP in Definition 1. Therefore, its AGM bound is INρ=IN3/2\text{IN}^{\rho}=\text{IN}^{3/2}.

Generalized hypertree decomposition. As a special case, the Yannakakis algorithm (Yannakakis, 1981) achieves w=1w=1 if QQ is acyclic. It builds a join tree by assigning each hyperedge FF\in\mathcal{E} to a tree node and performs dynamic programming (DP)-like join over the nodes’ outputs (i.e., RFR_{F}’s) in a bottom-up manner.

To bridge the gap between w=ρw={\rho} and w=1w=1 when QQ is cyclic, generalized hypertree decompositions (GHDs, see Definition 2) (Gottlob et al., 2002) extend the tree for arbitrary joins by assigning multiple hyperedges to a tree node, where the DP-like join is performed over these nodes’ outputs (i.e., answers of the corresponding sub-query) computed by WCO algorithms. This way, ww decreases from ρ{\rho} to fractional hypertree width fhtw{fhtw} (see Definition 3). We refer to a comprehensive survey (Ngo et al., 2014) for more details about these concepts and appendix A for our explanations of algorithms.

Definition 2.

(Gottlob et al., 2002): A GHD of =(𝒱,)\mathcal{H}=(\mathcal{V},\mathcal{E}) is a pair (𝒯,χ\mathcal{T},\chi) where 1) 𝒯\mathcal{T} is a tree of (𝒱𝒯,𝒯)(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}}), 2) χ:𝒱𝒯2𝒱\chi:\mathcal{V}_{\mathcal{T}}\rightarrow 2^{\mathcal{V}}, 3) for each FF\in\mathcal{E}, there exists t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} s.t. Fχ(t)F\subseteq\chi(t), and 4) for each v𝒱v\in\mathcal{V}, {t𝒱𝒯|vχ(t)}\{t\in\mathcal{V}_{\mathcal{T}}|v\in\chi(t)\} forms a non-empty connected subtree of 𝒯\mathcal{T}.

Here, χ(t)\chi(t) is called the bag of tt, and 4) is called the running intersection property. Each node tt corresponds to a sub-hypergraph χ(t)\mathcal{H}_{\chi(t)} of \mathcal{H} where χ(t)(χ(t),χ(t),π)\mathcal{H}_{\chi(t)}\coloneqq(\chi(t),\mathcal{E}_{\chi(t),\pi}). We then define the fractional hypertree width (Grohe and Marx, 2014).

Definition 3.

(Grohe and Marx, 2014): The fractional hypertree width 1) of a GHD (𝒯,χ\mathcal{T},\chi) of \mathcal{H}, fhtw(𝒯,)fhtw(\mathcal{T},\mathcal{H}), is defined as maxt𝒱𝒯ρ(χ(t))\max_{t\in\mathcal{V}_{\mathcal{T}}}\rho(\mathcal{H}_{\chi(t)}), and 2) of \mathcal{H}, fhtw()fhtw(\mathcal{H}), is defined as min(𝒯,χ)fhtw(𝒯,)\min_{(\mathcal{T},\chi)}fhtw(\mathcal{T},\mathcal{H}).

Given a GHD (𝒯,χ\mathcal{T},\chi), GHDJoin (Joglekar et al., 2016) performs GenericJoin for each χ(t)(t𝒱𝒯)\mathcal{H}_{\chi(t)}(t\in\mathcal{V}_{\mathcal{T}}) and Yannakakis (Yannakakis, 1981) on these results along the join tree (see Algorithms 5-7 in section A.2-A.3). Since Yannakakis runs in linear - O~(IN+OUT)\tilde{O}(\text{IN}+\text{OUT}) - time for α\alpha-acyclic join queries (Brault-Baron, 2016), the total runtime of GHDJoin is O~(INfhtw(𝒯,)+OUT)\tilde{O}(\text{IN}^{fhtw(\mathcal{T},\mathcal{H})}+\text{OUT}), since the input size for Yannakakis is O~(INmaxt𝒱𝒯ρ(χ(t)))\tilde{O}\big{(}\text{IN}^{\max_{t\in\mathcal{V}_{\mathcal{T}}}\rho(\mathcal{H}_{\chi(t)})}\big{)} after GenericJoin. It is sufficient to execute a brute-force algorithm to find fhtw()fhtw(\mathcal{H}) since the number of possible GHDs is bounded by the query size.

Counting. For the counting problem, the complexity is reduced to O~(INw)\tilde{O}(\text{IN}^{w}) without the +OUT term. This is a direct result of Joglekar et al. (Joglekar et al., 2016), solving a more general problem of evaluating aggregation queries (e.g., count, sum, max). We explain in detail in section 6.1.

Approximate counting. Approximate counting achieves even smaller complexities. The primary goal of approximate counting, especially using sampling, is to obtain an arbitrarily small error with a high probability, formally defined as (Assadi et al., 2019; Chen and Yi, 2020):

Definition 4.

(Assadi et al., 2019): For a given \mathcal{H}, error bound ϵ(0,1)\epsilon\in(0,1), and probability threshold δ(0,1)\delta\in(0,1), if P(|ZOUT|ϵOUT)δP(\absolutevalue{Z-\text{OUT}}\geq\epsilon\text{OUT})\leq\delta for a random variable ZZ approximating OUT, then ZZ is a (1±ϵ)(1\pm\epsilon)-approximation of OUT.

The challenge is, how far can we reduce the complexity while achieving the above goal. Assume that we approximate OUT using an unbiased estimator with a random variable YY, i.e., 𝔼[Y]=OUT\operatorname*{\mathbb{E}}[Y]=\text{OUT}, where 𝔼[Y]\operatorname*{\mathbb{E}}[Y] is the expectation of YY. Assume that the variance 𝕍ar[Y]\operatorname*{\mathbb{V}ar}[Y] of YY is bounded and an upper bound UvarU_{var} is known, and the time complexity of instantiating (or computing) YY is upper-bounded by UtimeU_{time}. Then, we can trade off the approximation accuracy and efficiency by controlling the number of runs NN for sampling. Formally, let Z=(Y1+Y2++YN)/NZ=(Y_{1}+Y_{2}+...+Y_{N})/N where every YiY_{i} (1iN1\leq i\leq N) is an equivalent random variable to YY. Then, 𝔼[Z]=𝔼[Y]=OUT\operatorname*{\mathbb{E}}[Z]=\operatorname*{\mathbb{E}}[Y]=\text{OUT} and 𝕍ar[Z]=𝕍ar[Y]/N\operatorname*{\mathbb{V}ar}[Z]=\operatorname*{\mathbb{V}ar}[Y]/N. Hence, by setting N=Uvarϵ2δOUT2N=\frac{U_{var}}{\epsilon^{2}\delta\text{OUT}^{2}}, we can achieve the following from the Chebyshev’s inequality:

(2) P(|ZOUT|ϵOUT)𝕍ar[Z]ϵ2𝔼[Z]2=𝕍ar[Y]Nϵ2OUT2UvarNϵ2OUT2=δ\begin{split}P(|Z-\text{OUT}|\geq\epsilon\text{OUT})\leq\frac{\operatorname*{\mathbb{V}ar}[Z]}{\epsilon^{2}\operatorname*{\mathbb{E}}[Z]^{2}}=\frac{\operatorname*{\mathbb{V}ar}[Y]}{N\epsilon^{2}\text{OUT}^{2}}\leq\frac{U_{var}}{N\epsilon^{2}\text{OUT}^{2}}=\delta\end{split}

Since the complexity of computing each YiY_{i} is bounded by UtimeU_{time}, the complexity of computing ZZ is bounded by NUtime=UvarUtimeϵ2δOUT2N\cdot U_{time}=\frac{U_{var}U_{time}}{\epsilon^{2}\delta\text{OUT}^{2}}. Therefore, the key challenge is to implement an unbiased estimator with UvarUtimeU_{var}U_{time} as smallest as possible. ϵ\epsilon and δ\delta are regarded as constants as in O~\tilde{O} (Assadi et al., 2019; Chen and Yi, 2020). One can easily replace 1δ\frac{1}{\delta} in NN with log1δ\log\frac{1}{\delta} using the median trick (Jerrum et al., 1986).

Uniform sampling. If every query answer has the same probability pp to be sampled, then the sampler is a uniform sampler (Chen and Yi, 2020; Fichtenberger et al., 2020). Since there are |Q|=OUT|Q|=\text{OUT} answers, pp can be at most 1/OUT1/\text{OUT}. If p<1/OUTp<1/\text{OUT}, the sampling has the probability to fail (to sample any of the query answers) which is 1pOUT1-p\cdot\text{OUT}.

For acyclic queries, uniform sampling can be done in O~(1)\tilde{O}(1) time after O~(IN)\tilde{O}(\text{IN})-time preprocessing (Zhao et al., 2018), by additionally storing the intermediate join sizes of tuples during the bottom-up DP in Yannakakis (see section A.2 for more details).

3. Related Work

In the context of approximate counting (and uniform sampling, which is closely related), there have been two lines of work: 1) determining the classes of queries of tractable and intractable cases and 2) reducing the degree of polynomial-time complexity for tractable cases, especially for join or join-project queries.

The first class of work widens the class of queries that admit fully polynomial-time randomized approximation scheme (FPRAS) and fully polynomial-time almost uniform sampler (FPAUS) (Arenas et al., 2021). Arenas et al. (Arenas et al., 2021) showed that every class of conjunctive queries with bounded hypertree width admits FPRAS and FPAUS. Focke et al. (Focke et al., 2022) extended this to conjunctive queries with disequalities and negations, under some relaxations from FPRAS. These works focus on showing the existence of a polynomial-time algorithm over a wider class of queries, but not on reducing the exact degree of the polynomial complexity.

The second class of work (Assadi et al., 2019; Fichtenberger et al., 2020; Chen and Yi, 2020; Bera and Chakrabarti, 2017; Aliakbarpour et al., 2018; Eden et al., 2020) focuses on reducing the degree for a specific range of queries, and even further to making the complexity inversely proportional to OUT, e.g., O~(INwOUT)\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}\big{)}. However, they all consider w=ρw=\rho only (i.e., O~(INwOUT)\tilde{O}\big{(}\frac{\text{IN}^{w}}{\text{OUT}}\big{)} is O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}), and the complexity O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} has been achieved for limited classes of queries. Assadi et al. (Assadi et al., 2019) proposed a Simple Sublinear-Time Estimator (SSTE for short) with the complexity of O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} for unlabeled graph queries, i.e., where QQ contains identical binary relations only as in the query in (1). However, we found that a missing factor in the proof by Assadi et al. (Assadi et al., 2019) prevented SSTE from satisfying the O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} bound and thus notified the author to modify the proof (see appendix E). They also proposed some conjectures on labeled graphs (i.e., removing the identity assumption), but 1) no concrete algorithms were given and 2) the proposed bound for labeled graphs is larger than O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} (see section D.2). Therefore, it is not known whether these methods can be extended to labeled graphs with O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} time. Fichtenberger et al. (Fichtenberger et al., 2020) extended SSTE to sampling Subgraphs Uniformly in Sublinear Time (SUST for short) for uniform sampling while achieving the same complexity. Here, they said sublinear from the assumption that OUT is large as Ω(INρ1)\Omega(\text{IN}^{\rho-1}). Unlike join processing, it is unnecessary to scan the inputs for each query that requires O~(IN)\tilde{O}(\text{IN}) time.

Kim et al. (Kim et al., 2021) proposed Alley, a hybrid method that combines sampling and synopsis. We analyze the sampling part of Alley since analyzing the synopsis is out of scope. Alley solves the approximate subgraph counting problem (but not uniform sampling) for labeled graphs, where the relations in QQ are binary but not necessarily identical. The complexity, however, is O~(AGM)\tilde{O}(AGM) instead of O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}, which we explain why in section 5.

For a more general setting where relations in QQ can have higher arities, Chen & Yi (Chen and Yi, 2020) proposed GJ-Sample that also performs uniform sampling, and achieved O~(INAGMOUT)\tilde{O}\big{(}\frac{\text{IN}\,AGM}{\text{OUT}}\big{)} complexity. The additional factor of IN compared to SSTE and SUST results from computing sample probabilities (or weights) prior to sampling, where the number of candidates is O~(IN)\tilde{O}(\text{IN}). We explain in detail in section 5. Therefore, they do not achieve O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} complexity for arbitrary join queries and leave it as an open problem (Chen and Yi, 2020). Note that it is important to remove the IN factor, for IN can be higher than AGMOUT\frac{AGM}{\text{OUT}} when relation sizes are highly skewed. For example, for a triangle query \mathcal{H} with ={F1,F2,F3}\mathcal{E}=\{F_{1},F_{2},F_{3}\}, |RF1|=N10,|RF2|=|RF3|=N|R_{F_{1}}|=N^{10},|R_{F_{2}}|=|R_{F_{3}}|=N, and OUT=N\text{OUT}=N, then IN is N10N^{10} and AGMOUT\frac{AGM}{\text{OUT}} is N2N=N\frac{N^{2}}{N}=N for ρ=15\rho=\frac{1}{5}.

We notice that Deng et al. (Deng et al., 2023) have independently pursued the same primary objective as ours, which is to attain O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} for uniform sampling and size estimation over joins. Both papers have been accepted in this PODS. The authors recursively 1) partition the input space into a constant number of sub-spaces and 2) sample a sub-space, to reduce the number of sample candidates from O~(IN)\tilde{O}(\text{IN}) to O~(1)\tilde{O}(1). Our approach differs from theirs since we do not compute probabilities for candidates prior to sampling nor modify the set of candidates. Instead, we exploit known probability distributions to sample a candidate.

4. Our Results

We first analyze existing sampling-based approximation algorithms in a new unified framework in section 5. We present a new algorithm that bounds UvarUtimeU_{var}U_{time} by O~(AGMOUT)\tilde{O}(AGM\cdot\text{OUT}) for arbitrary join queries for the first time, achieving the complexity of O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} (see section 2). To avoid pre-sampling overheads in GJ-Sample, we propose degree-based rejection sampling (DRS) that first samples a sample space from a meta sample space and then samples a value from the sampled space, where the value can be rejected based on its degree. We explain the details in section 5. The results of Chen & Yi then directly hold for DRS. The removed overhead reduces the complexity down to O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} for arbitrary joins. We also extend DRS using GHDs in section 6. The following theorems state our results which we prove in the subsequent sections.

Theorem 1.

DRS performs (1±ϵ)(1\pm\epsilon)-approximation of OUT in O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} time.

Theorem 2.

If OUT is small enough, DRS with GHDs can perform (1±ϵ)(1\pm\epsilon)-approximation of OUT with a lower complexity than O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)}. A sufficient condition is when OUT is O~(INρfhtw)\tilde{O}(\text{IN}^{\rho-fhtw}).

Additionally, we extend Alley to support arbitrary join queries in section 5 and a core lemma of SSTE/SUST to hold for labeled graphs in appendix D. We discuss our extension to join-project (a.k.a. conjunctive queries) in appendix C and extension of the O~(1)\tilde{O}(1)-time uniform sampler (Zhao et al., 2018) in section 2 to support cyclic queries using our framework in section A.2.

5. Achieving O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} Bound

5.1. Generic Sampling-based Framework

We can classify existing sampling-based methods into three groups: variable-at-a-time (a.k.a., vertex-at-a-time) (Kim et al., 2021; Chen and Yi, 2020), edge-at-a-time (Li et al., 2016), and component-at-a-time (Assadi et al., 2019; Fichtenberger et al., 2020). The first/second group samples a variable/hyperedge at a time until every variable/hyperedge in \mathcal{H} is sampled, while the third group samples a component, either a cycle with an odd number of edges or a star, at a time, having a larger sample granularity than the other two groups.

To analyze these groups in a unified manner, we propose a powerful sampling-based estimation framework GenericCardEst (Algorithm 1). It returns an unbiased estimate of OUT(s)\text{OUT}(\mathcal{H}_{s}), where s\mathcal{H}_{s} is the residual hypergraph given a current sample ss which is initially empty. Formally, sF𝒪π𝒪(RFs)\mathcal{H}_{s}\coloneqq\Join_{F\in\mathcal{E}_{\mathcal{O}}}\pi_{\mathcal{O}}(R_{F}\ltimes s) and =\mathcal{H}_{\emptyset}=\mathcal{H} for s=s=\emptyset.

Input: Query hypergraph \mathcal{H}, variables 𝒪\mathcal{O} to sample, and current sample ss
1
2if (𝒪=\mathcal{O}=\emptyset) then
3       return 1
4      
IGetNonEmptySubset(𝒪)I\leftarrow\textsc{{GetNonEmptySubset}}(\mathcal{O}) /* II\neq\emptyset */
5
6ΩI,kGetSampleSpaceAndSize(I,,s)\Omega_{I},k\leftarrow\textsc{{GetSampleSpaceAndSize}}(I,\mathcal{H},s)
PGetDistribution(ΩI)P\leftarrow{\textsc{{GetDistribution}}}(\Omega_{I}) /* P()=1sIΩIP(sI)P(\perp)=1-\sum_{s_{I}\in\Omega_{I}}P(s_{I}) */
7
SISampleFrom(ΩI,k,P)S_{I}\leftarrow{\textsc{{SampleFrom}}}(\Omega_{I},k,P) /* |SI|=k|S_{I}|=k */
8
9SI,k(=|SI|),{P(sI)|sISI}RejectAndAdjust(SI,P,,s){S_{I},k(=|S_{I}|),\{P(s_{I})|s_{I}\in S_{I}\}\leftarrow\textsc{{RejectAndAdjust}}(S_{I},P,\mathcal{H},s)}
return 1ksISI1P(sI)𝕀[sIFIπI(RFs)]GenericCardEst(,𝒪I,ssI)\frac{1}{k}\sum_{s_{I}\in S_{I}}\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]}\cdot{\textsc{{GenericCardEst}}(\mathcal{H},\mathcal{O}\setminus I,s\uplus s_{I})}
Algorithm 1 GenericCardEst((𝒱,),𝒪,s\mathcal{H}(\mathcal{V},\mathcal{E}),\mathcal{O},s)

An invariance is that sF𝒱𝒪π𝒱𝒪RFs\in\,\Join_{F\in\mathcal{E}_{\mathcal{V}\setminus\mathcal{O}}}\pi_{\mathcal{V}\setminus\mathcal{O}}R_{F}, i.e., ss is an answer to the sub-query of QQ on the bound variables 𝒱𝒪\mathcal{V}\setminus\mathcal{O}. This is a key to prove Lemma 1. Due to space limitations, we prove all lemmas and propositions in appendix E.

Lemma 1.

GenericCardEst returns an unbiased estimate of |FRF|\absolutevalue{\Join_{F\in\mathcal{E}}R_{F}} for 𝒪=𝒱\mathcal{O}=\mathcal{V} and s=s=\emptyset.

We now explain each line of Algorithm 1. If there is no variable to sample for, just return one in Lines 1-1. Otherwise, Line 1 takes a subset II of 𝒪\mathcal{O} to sample at this step. II is 1) a singleton set for variable-at-a-time methods, 2) an edge or its projection for edge-at-a-time methods, or 3) a part of a component for component-at-a-time methods. Line 1 defines the sample space ΩI\Omega_{I} and sample size kk. Line 1 computes PP, a probability distribution over ΩI{}\Omega_{I}\cup\{\perp\}; \perp is a null tuple that does not belong to any join answer. Line 1 samples SIS_{I} from ΩI\Omega_{I} where |SI|=k|S_{I}|=k. Line 1 optionally rejects samples in SIS_{I}. Here, SIS_{I}, kk, and {P(sI)|sISI}\{P(s_{I})|s_{I}\in S_{I}\} are adjusted according to the accepted samples; P(sI)P(s_{I}) is multiplied by the probability of accepting sIs_{I}. GenericCardEst in Line 1 returns an unbiased estimate of the residual query ssI\mathcal{H}_{s\uplus s_{I}}. The unbiasedness of the final return value at Line 1 is guaranteed by the inverse probability factor, 1P(sI)𝕀[]\frac{1}{P(s_{I})}\operatorname*{\mathbb{I}}[\cdot], which is the key idea of the Horvitz-Thompson estimator (Horvitz and Thompson, 1952). 𝕀[c]\operatorname*{\mathbb{I}}[c] denotes the indicator which is 1 if cc is true and 0 otherwise.

5.2. Query Model

GenericCardEst relies on a certain query model used in previous WCO join and sampling algorithms (Ngo et al., 2014; Veldhuizen, 2012; Ngo et al., 2012; Assadi et al., 2019; Chen and Yi, 2020), following the property testing model (Goldreich, 2017). To avoid any confusion with the join query QQ, we refer to the queries here as operations. For a relation RFR_{F}, the query model allows that πI(RFs)\pi_{I}(R_{F}\ltimes s) in GenericCardEst can be readily obtained in O~(1)\tilde{O}(1) time. We explain the underlying index structures that enable this later in section 5.6 for brevity.

The same indexes provide the following O~(1)\tilde{O}(1)-time operations as well. Here, TT can be a projection/selection of another relation, which is also a relation.

  • Degree(T,sT,s): |Ts||T\ltimes s| for a relation TT and a tuple ss

  • Access(T,I,iT,I,i): ii-th element of πI(T)\pi_{I}(T) for a relation TT and attributes II

  • Exist(T,rT,r): test whether or not a row rr exists in TT

  • Sample(TT): uniformly sample a tuple rr from TT

Eden et al. (Eden et al., 2017) have developed their algorithms using operations on binary tables. The above four operations are generalizations of theirs to nn-ary tables.

5.3. State-of-the-art Sampling-based Estimators

We explain state-of-the-art sampling-based estimators as instances of GenericCardEst.

WanderJoin (Li et al., 2016) is a practical edge-at-a-time method. The edges in \mathcal{E} are first ordered by an index i[1,||]i\in[1,|\mathcal{E}|]. At the ii-th step, I=Fi𝒪I=F_{i}\cap\mathcal{O}, ΩI=πI(RFis)\Omega_{I}=\pi_{I}(R_{F_{i}}\ltimes s), k=1k=1, and P(sI)=1|ΩI|P(s_{I})=\frac{1}{|\Omega_{I}|}. That is, one tuple sIs_{I} is uniformly sampled from ΩI\Omega_{I}. Note that 𝒪\mathcal{O} can reduce to \emptyset before proceeding to the last edge, e.g., the triangle query. However, (Li et al., 2016) proceeds first with sampling for every edge and then checks the join predicates, resulting in unnecessary computations.

Alley+. Alley (Kim et al., 2021) is a variable-at-a-time method (i.e., |I|=1|I|=1) and assumes that every RFR_{F} is binary. We here explain Alley+, our extension of Alley to nn-ary relations: ΩI=FIπI(RFs)\Omega_{I}=\cap_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s), k=b|ΩI|k=\lceil b\cdot|\Omega_{I}|\rceil for some fixed b(0,1]b\in(0,1], and P(sI)=1|ΩI|P(s_{I})=\frac{1}{|\Omega_{I}|}. Here, the sampling is done without replacement. Due to the heavy intersection (\cap) operation, the runtime of Alley+ is O~(AGM())\tilde{O}(AGM(\mathcal{H})) (Proposition 6). At an extreme case of b=1b=1, Alley+ becomes an instance of GenericJoin.

SSTE (Assadi et al., 2019) and SUST (Fichtenberger et al., 2020) are component-at-a-time methods. Assuming that every RFR_{F} is binary and identical, Lemma 8 in appendix D decomposes \mathcal{H} into a set of components, where each component is either an odd cycle or a star. SSTE and SUST slightly differ in sampling odd cycles and stars, and SUST performs uniform sampling over joins. We explain the details in appendix D due to space limitations.

GJ-Sample (Chen and Yi, 2020) is a variable-at-a-time method (i.e., |I|=1|I|=1). Here, ΩI=πI(RFGJs)\Omega_{I}=\pi_{I}(R_{F^{GJ}}\ltimes s) where FGJ=argminFI|πI(RFs)|F^{GJ}=\operatorname*{arg\,min}_{F\in\mathcal{E}_{I}}|\pi_{I}(R_{F}\ltimes s)|. This avoids the intersection operation in Alley+, allowing GJ-Sample to achieve O~(AGM()OUT)\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)} runtime for sampling instead of O~(AGM())\tilde{O}(AGM(\mathcal{H})) (Chen and Yi, 2020). k=1k=1 and P(sI)=AGM(ssI)AGM(s)P(s_{I})={\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}} where AGM(s)=F𝒪|RFs|xFAGM(\mathcal{H}_{s})=\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}} and AGM(ssI)=F𝒪I|RF(ssI)|xFAGM(\mathcal{H}_{s\uplus s_{I}})=\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes(s\uplus s_{I})|^{x_{F}}, where xx is a fractional edge cover of the original query \mathcal{H} (Chen and Yi, 2020). Then, due to Lemma 7 in appendix B, sIΩIP(sI)1\sum_{s_{I}\in\Omega_{I}}P(s_{I})\leq 1.

More importantly, setting P(sI)=AGM(ssI)AGM(s)P(s_{I})={\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}} enables GJ-Sample to perform uniform sampling over QQ, with 1AGM()\frac{1}{AGM(\mathcal{H})} probability for each answer of QQ. Let ss be a final sample that reaches Line 1, and s1,s2,,sn\mathcal{H}_{s_{1}},\mathcal{H}_{s_{2}},...,\mathcal{H}_{s_{n}} be the series of hypergraphs ssI\mathcal{H}_{s\uplus s_{I}} at Line 1. Then, AGM(sn)=1AGM(\mathcal{H}_{s_{n}})=1 from 𝒪==\mathcal{E}_{\mathcal{O}}=\mathcal{E}_{\emptyset}=\emptyset for sn\mathcal{H}_{s_{n}}, and

(3) P(s)=AGM(s1)AGM()AGM(sn)AGM(sn1)=AGM(sn)AGM()=1AGM().\begin{split}P(s)=\frac{AGM(\mathcal{H}_{s_{1}})}{AGM(\mathcal{H})}...\frac{AGM(\mathcal{H}_{s_{n}})}{AGM(\mathcal{H}_{s_{n-1}})}=\frac{AGM(\mathcal{H}_{s_{n}})}{AGM(\mathcal{H})}=\frac{1}{AGM(\mathcal{H})}.\end{split}

Since |Q|=OUT|Q|=\text{OUT}, a call to GenericCardEst succeeds to sample any answer of QQ with OUTAGM()\frac{\text{OUT}}{AGM(\mathcal{H})} probability and fails with 1OUTAGM()1-\frac{\text{OUT}}{AGM(\mathcal{H})} probability. Note that if P(sI)P(s_{I}) is set to OUT(ssI)/OUT(s)\text{OUT}(\mathcal{H}_{s\uplus s_{I}})/\text{OUT}(\mathcal{H}_{s}), then P(s)=1OUTP(s)=\frac{1}{\text{OUT}}, so every call to GenericCardEst will succeed. However, computing OUT(ssI)\text{OUT}(\mathcal{H}_{s\uplus s_{I}}) for every sIΩIs_{I}\in\Omega_{I} is intractable, which is the reason for GJ-Sample to use AGM(ssI)AGM(s){\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}} that can be computed in O~(1)\tilde{O}(1) time.

However, computing AGM(ssI)AGM(s){\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}} values at Line 1 takes O~(|ΩI|)=O~(|πI(RFGJs)|)=O~(IN)\tilde{O}(|\Omega_{I}|)=\tilde{O}(|\pi_{I}(R_{F^{GJ}}\ltimes s)|)=\tilde{O}(\text{IN}) time. This results in an IN factor in GJ-Sample’s runtime, O~(INAGM()OUT)\tilde{O}\big{(}\frac{\text{IN}\cdot AGM(\mathcal{H})}{\text{OUT}}\big{)} (Chen and Yi, 2020). In order to remove this IN factor, Chen & Yi (Chen and Yi, 2020) separated out sequenceable queries, where all P(sI)P(s_{I}) values required in the sampling procedure can be precomputed in O~(IN)\tilde{O}(\text{IN}) time. This results in O~(IN+AGM()OUT)\tilde{O}\big{(}\text{IN}+\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)} runtime for sequenceable queries and still O~(INAGM()OUT)\tilde{O}\big{(}\frac{\text{IN}\cdot AGM(\mathcal{H})}{\text{OUT}}\big{)} for non-sequenceable queries. Determining whether a query is sequenceable or not requires a brute-force algorithm of O~(1)\tilde{O}(1) time, since the search space is bounded by the query size (Chen and Yi, 2020).

5.4. Beyond the State of the Art with Degree-based Rejection Sampling

We propose degree-based rejection sampling (DRS) to avoid computing AGM(ssI)AGM(s){\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}} values in GJ-Sample while achieving a similar P(sI)P(s_{I}), only a constant factor smaller. While we set |I|=1|I|=1 and k=1k=1 as in GJ-Sample, we sample FF^{*} (a counterpart of FGJF^{GJ} in GJ-Sample) uniformly from I\mathcal{E}_{I} and let ΩI=πI(RFs)\Omega_{I}=\pi_{I}(R_{F^{*}}\ltimes s). Therefore, {πI(RFs)|FI}\{\pi_{I}(R_{F}\ltimes s)\,|\,F\in\mathcal{E}_{I}\} is our meta sample space. For each sIΩIs_{I}\in\Omega_{I} and FIF\in\mathcal{E}_{I}, we define the relative degree of sIs_{I} in RFsR_{F}\ltimes s as rdegF,s(sI)|RF(ssI)||RFs|rdeg_{F,s}(s_{I})\coloneqq\frac{|R_{F}\ltimes(s\uplus s_{I})|}{|R_{F}\ltimes s|}.

In order to use P(sI)=rdegF,s(sI)P(s_{I})=rdeg_{F^{*},s}(s_{I}) in Line 1, we 1) uniformly sample a row tt from RFsR_{F^{*}}\ltimes s and 2) let sI=πIts_{I}=\pi_{I}t in Line 1, without computing P(sI)P(s_{I}) for every sIΩIs_{I}\in\Omega_{I}. Then, rdegF,s(sI)rdeg_{F,s}(s_{I}) for every FIF\in\mathcal{E}_{I} is computed, and sIs_{I} is rejected if FargmaxFIrdegF,s(sI)F^{*}\neq\operatorname*{arg\,max}_{F\in\mathcal{E}_{I}}rdeg_{F,s}(s_{I}) (Line 1). We first assume that one FF has a higher rdegF,s(sI)rdeg_{F,s}(s_{I}) than all other edges in I\mathcal{E}_{I}. Then, every P(sI)P(s_{I}) is multiplied by 1|I|\frac{1}{|\mathcal{E}_{I}|} at Line 1, resulting in P(sI)=1|I|rdegF,s(sI)P(s_{I})=\frac{1}{|\mathcal{E}_{I}|}rdeg_{F^{*},s}(s_{I}). Line 1 further uses p=AGM(ssI)rdegF,s(sI)AGM(s)p=\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{rdeg_{F^{*},s}(s_{I})\cdot AGM(\mathcal{H}_{s})} (1\leq 1 from Lemma 2) as the keeping probability of sIs_{I} to make the final P(sI)=1|I|AGM(ssI)AGM(s)P(s_{I})=\frac{1}{|\mathcal{E}_{I}|}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}.

Lemma 2.

AGM(ssI)AGM(s)maxFIrdegF,s(sI)=rdegF,s(sI)\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}\leq\max_{F\in\mathcal{E}_{I}}rdeg_{F,s}(s_{I})=rdeg_{F^{*},s}(s_{I}) if sIFIπI(RFs)s_{I}\in\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s).

From (3), any final sample ss that reaches Line 1 has P(s)=1AGM()I𝒱,|I|=11|I|P(s)=\frac{1}{AGM(\mathcal{H})}\prod_{I\subset\mathcal{V},|I|=1}\frac{1}{|\mathcal{E}_{I}|} probability to be sampled, indicating a uniform sampling over QQ. Note that the added term I1|I|\prod_{I}\frac{1}{|\mathcal{E}_{I}|} is regarded as a constant since it depends on the query size only. If we break our initial assumption so that m>1m>1 edges in I\mathcal{E}_{I} can have the same maximum relative degree, the P(sI)P(s_{I}) above will increase by mm times. Therefore, we simply decrease the keeping probability pp by mm times to preserve P(sI)=1|I|AGM(ssI)AGM(s)P(s_{I})=\frac{1}{|\mathcal{E}_{I}|}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}.

Example 0.

Let NN and MM be two arbitrary numbers. Let R(A,B)={(i,1)|i[1,M]}{(1,j)|j[2,N]}R(A,B)=\{(i,1)\,|\,i\in[1,M]\}\cup\{(1,j)\,|\,j\in[2,N]\} be a binary relation and T(A,C)T(A,C) is obtained by renaming BB to CC in RR. Suppose we have a join query Q(A,B,C)=R(A,B)T(A,C)Q(A,B,C)=R(A,B)\Join T(A,C). For ease of explanation, we use a hyperedge FF and its corresponding relation RFR_{F} interchangeably. Then, we have an optimal fractional edge cover of QQ as xR=xT=1x_{R}=x_{T}=1. We choose I={A}I=\{A\}, I={B}I=\{B\}, and I={C}I=\{C\} in turn for DRS. For I={A}I=\{A\} and s=s=\emptyset, I={R,T}\mathcal{E}_{I}=\{R,T\}, so we sample a relation from I\mathcal{E}_{I}. Assume that RR is sampled, and then a row t=(1,1)t=(1,1) is sampled from RR. Then, sI=1s_{I}=1 and rdegR,s(sI)=NN+M1=rdegT,s(sI)rdeg_{R,s}(s_{I})=\frac{N}{N+M-1}=rdeg_{T,s}(s_{I}), so RR is not rejected. Since RR and TT tie w.r.t. sIs_{I}, we keep sIs_{I} with probability p=121rdegR,s(sI)AGM(ssI)AGM(s)=121rdegR,s(sI)N2(N+M1)2p=\frac{1}{2}\frac{1}{rdeg_{R,s}(s_{I})}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}=\frac{1}{2}\frac{1}{rdeg_{R,s}(s_{I})}\frac{N^{2}}{(N+M-1)^{2}}. Then, P(sI)=P(s{A})=rdegR,s(sI)121rdegR,s(sI)N2(N+M1)2P(s_{I})=P(s_{\{A\}})=rdeg_{R,s}(s_{I})\frac{1}{2}\frac{1}{rdeg_{R,s}(s_{I})}\frac{N^{2}}{(N+M-1)^{2}} =12N2(N+M1)2=\frac{1}{2}\frac{N^{2}}{(N+M-1)^{2}}. Next, for I={B}I=\{B\} and s={A=1}s=\{A=1\}, I={R}\mathcal{E}_{I}=\{R\}, and rdegR,s(sI)=1Nrdeg_{R,s}(s_{I})=\frac{1}{N} for any sIπI(Rs)s_{I}\in\pi_{I}(R\ltimes s). Then, P(s{B})=1|I|AGM(ssI)AGM(s)=NN2=1NP(s_{\{B\}})=\frac{1}{|\mathcal{E}_{I}|}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}=\frac{N}{N^{2}}=\frac{1}{N}. Similarly, P(s{C})=1NP(s_{\{C\}})=\frac{1}{N} for I={C}I=\{C\}. In total, a final sample ss that reaches Line 1 of GenericCardEst has P(s)=P(s{A})P(s{B})P(s{C})=121(N+M1)2=I1|I|1AGM()P(s)=P(s_{\{A\}})P(s_{\{B\}})P(s_{\{C\}})=\frac{1}{2}\frac{1}{(N+M-1)^{2}}=\prod_{I}\frac{1}{|\mathcal{E}_{I}|}\frac{1}{AGM(\mathcal{H})}.

5.5. Unified Analysis

This section analyzes the bounds of variance and runtime (UvarU_{var} and UtimeU_{time} in section 2) of sampling-based estimators in a unified manner. Note that Lemma 1 already states the unbiasedness of estimators. Theorem 1 can be proved from Propositions 4 and 8; UvarUtimeU_{var}U_{time} is O~(AGMOUT)\tilde{O}(AGM\cdot\text{OUT}).

We first define two random variables, 1) ZsZ_{\mathcal{H}_{s}} for the output of GenericCardEst given ss and 2) Z=1iNZiNZ=\sum_{1\leq i\leq N}\frac{Z^{i}_{\mathcal{H}}}{N} for our final estimate. Here, ZiZ^{i}_{\mathcal{H}}’s are independent and identical to ZZ_{\mathcal{H}} =Z=Z_{\mathcal{H}_{\emptyset}}, and NN is the number of initial calls to GenericCardEst with s=s=\emptyset. Let TsT_{\mathcal{H}_{s}} be the random variable for the number of core operations (including the four operations in section 5.2) in GenericCardEst and TT for the total runtime of obtaining ZZ. Then, 𝕍ar[Z]=𝕍ar[Z]N\operatorname*{\mathbb{V}ar}[Z]=\frac{\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]}{N} and 𝔼[T]=N𝔼[T]\operatorname*{\mathbb{E}}[T]=N\cdot\operatorname*{\mathbb{E}}[T_{\mathcal{H}}].

Proposition 1.

𝕍ar[Z]2dAGM()OUT\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq 2^{d}\cdot AGM(\mathcal{H})\cdot\text{OUT} for SSTE and SUST where dd is the number of odd cycles and stars in \mathcal{H}.

Proposition 2.

𝕍ar[Z]t|𝒱|tt1OUT2\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq\frac{t^{|\mathcal{V}|}-t}{t-1}\cdot\text{OUT}^{2} for Alley+ where t=2(1b)bt=\frac{2(1-b)}{b}.

The upper bound UvarU_{var} of Alley+ explains the unique property of Alley+, that 𝕍ar[Z]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}] approaches to 0 as bb approaches to 1. That is, limb1Uvar=0\lim_{b\to 1}U_{var}=0 since limb1t=0\lim_{b\to 1}t=0. In fact, this UvarU_{var} is tighter than the original bound Uorg=b1b(1b|𝒱|1)OUT2U_{org}=\frac{b}{1-b}(\frac{1}{b^{|\mathcal{V}|}}-1)\cdot\text{OUT}^{2} proved in (Kim et al., 2021), where limb1Uorg=limb11+b+b2++b|𝒱|1b|𝒱|1\lim_{b\to 1}U_{org}=\lim_{b\to 1}\frac{1+b+b^{2}+...+b^{|\mathcal{V}|-1}}{b^{|\mathcal{V}|-1}} OUT2=|𝒱|OUT2\text{OUT}^{2}=|\mathcal{V}|\cdot\text{OUT}^{2}.

Proposition 3.

𝕍ar[Z]|𝒱|AGM()OUT\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq|\mathcal{V}|\cdot AGM(\mathcal{H})\cdot\text{OUT} for GJ-Sample.

Proposition 4.

𝕍ar[Z]|𝒱|I|I|AGM()OUT\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq|\mathcal{V}|\cdot\prod_{I}|\mathcal{E}_{I}|\cdot AGM(\mathcal{H})\cdot\text{OUT} for DRS. Note that I|I|\prod_{I}|\mathcal{E}_{I}| is O~(1)\tilde{O}(1).

Proposition 5.

𝔼[T]\operatorname*{\mathbb{E}}[T_{\mathcal{H}}] is O~(1)\tilde{O}(1) for SSTE.

Proposition 6.

𝔼[T]\operatorname*{\mathbb{E}}[T_{\mathcal{H}}] is O~(b|𝒱|AGM())\tilde{O}(b^{|\mathcal{V}|}AGM(\mathcal{H})) for Alley+.

Proposition 7.

For GJ-Sample, TT_{\mathcal{H}} is O~(1)\tilde{O}(1) for sequenceable queries and O~(IN)\tilde{O}(\text{IN}) for non-sequenceable queries.

Proposition 8.

TT_{\mathcal{H}} is O~(1)\tilde{O}(1) for SUST and DRS.

Until now, we have assumed that OUT is given in computing N=Uvarϵ2δOUT2N=\frac{U_{var}}{\epsilon^{2}\delta\text{OUT}^{2}} before invoking GenericCardEst, which is unrealistic since our goal is to estimate OUT itself. To tackle this, we use the geometric search by Assadi et al. (Assadi et al., 2019). They first assume that OUT is large as AGM()AGM(\mathcal{H}) and run estimation with a small NN (Assadi et al., 2019). Then, they perform a geometric search on OUT; assume a smaller OUT (decreasing OUT by 2), increase NN, and run the estimation again. They repeat this until the assumed OUT becomes consistent with the algorithm output. While we implicitly assume that OUT>0\text{OUT}>0, we can detect when OUT=0\text{OUT}=0 in real applications whenever the assumed OUT falls below 1, in total O~(AGM)\tilde{O}(AGM) time.

This geometric search adds a constant factor (= 4) in the O~\tilde{O} bound of TT_{\mathcal{H}}, not a logIN\log\text{IN} factor explained by Chen & Yi (Chen and Yi, 2020). Starting with an assumption 1AGM()OUT<21\leq\frac{AGM(\mathcal{H})}{\text{OUT}}<2, assume that the geometric search stops at 2kAGM()OUT<2k+12^{k}\leq\frac{AGM(\mathcal{H})}{\text{OUT}}<2^{k+1}. Then, the total sample size used so far is 1ik+12iϵ2δ2k+2ϵ2δ4AGM()ϵ2δOUT\sum_{1\leq i\leq k+1}\frac{2^{i}}{\epsilon^{2}\delta}\leq\frac{2^{k+2}}{\epsilon^{2}\delta}\leq\frac{4AGM(\mathcal{H})}{\epsilon^{2}\delta\text{OUT}}, i.e., only a constant factor (= 4) is introduced.

Since GJ-Sample can perform uniform sampling, it instead uses a simpler approach (Chen and Yi, 2020), by repeating the sampling until a constant number cc of trials succeed. Then, the number of total trials becomes a random variable, having AGM()cOUT\frac{AGM(\mathcal{H})\cdot c}{\text{OUT}} as its expectation (Chen and Yi, 2020). Therefore, 𝔼[T]=O~(IN+AGM()cOUT)\operatorname*{\mathbb{E}}[T]=\tilde{O}\big{(}\text{IN}+\frac{AGM(\mathcal{H})\cdot c}{\text{OUT}}\big{)} for sequenceable queries and O~(INAGM()cOUT)\tilde{O}\big{(}\frac{\text{IN}\cdot AGM(\mathcal{H})\cdot c}{\text{OUT}}\big{)} for non-sequenceable queries. Using the same approach, DRS requires I|I|AGM()cOUT\prod_{I}|\mathcal{E}_{I}|\frac{AGM(\mathcal{H})\cdot c}{\text{OUT}} trials in expectation and thus 𝔼[T]=O~(AGM()OUT)\operatorname*{\mathbb{E}}[T]=\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}, asymptotically faster than GJ-Sample.

Finally, we give a simple explanation of why TT (or 𝔼[T]\operatorname*{\mathbb{E}}[T]) of SSTE, SUST, GJ-Sample, and DRS, is inversely proportional to OUT. Intuitively, if OUT is as large as AGM()AGM(\mathcal{H}), a set of samples SIS_{I} are highly likely to be the actual join answers of QQ. In other words, the join answers are spread over a dense database. A small number of samples would give us enough information about the distribution of the join answers and estimating OUT accurately. In contrast, if OUT is small, the join answers would be skewed and sparsely distributed in certain regions of a database, and most samples would not be the join answers. Therefore, a small number of samples cannot effectively decrease the uncertainty of estimation, which may require a lot more sampling for an accurate estimation.

5.6. Underlying Index Structure

We explain the underlying index IndexFIndex_{F} for each relation RFR_{F} to evaluate the operation πI(RFs)\pi_{I}(R_{F}\ltimes s) in O~(1)\tilde{O}(1) time, as mentioned in section 5.2. We define FI=FIF_{I}=F\cap I and Fs=FAttr(s)F_{s}=F\cap Attr(s), where Attr(s)Attr(s) is the set of attributes of ss. Note that FIF_{I} and FsF_{s} are disjoint in GenericCardEst. If Fs={A1,A2,,Am}F_{s}=\{A_{1},A_{2},...,A_{m}\}, then RFs=RFπFs(s)=σA1=πA1sσA2=πA2sσAm=πAms(RF)R_{F}\ltimes s=R_{F}\ltimes\pi_{F_{s}}(s)=\sigma_{A_{1}=\pi_{A_{1}}s}\sigma_{A_{2}=\pi_{A_{2}}s}...\sigma_{A_{m}=\pi_{A_{m}}s}(R_{F}).

A mild assumption in (Ngo et al., 2014) is to build a B-tree-like index (e.g., a trie index in (Aberger et al., 2017)) under a global attribute order over the whole database. Then, if all attributes in FsF_{s} precede all attributes in FIF_{I}, RFsR_{F}\ltimes s is readily available as an index lookup of IndexF(s)=IndexF(πFs(s))Index_{F}(s)=Index_{F}(\pi_{F_{s}}(s)). Furthermore, if no attribute in FF lies between FsF_{s} and FIF_{I}, πI(RFs)\pi_{I}(R_{F}\ltimes s) is an index lookup IndexF(s)Index_{F}(s) up to depth |FI||F_{I}|. To exploit these, the selection of II in each step of GenericCardEst should be consistent with the global order. Instead, we can build |F|!|F|! indexes for RFR_{F}, one for each possible attribute order as in (Aberger et al., 2017), to enable arbitrary selection of II.

Remark. The complexity of building an index for RFR_{F} is linear, i.e., O~(|RF|)\tilde{O}(|R_{F}|) (Aberger et al., 2017). However, in contrast to the pre-computing overheads in GJ-Sample, the indexing occurs once per database instead of per query. Therefore, its complexity is ignored from the analysis of TT_{\mathcal{H}} in section 5.5. Due to their simple structures, indexes can be easily extended to dynamic setting where tuples can be inserted and deleted, with a constant or logarithmic update time.

6. Achieving a Generalized Bound

In this section, we generalize our O~(AGM()OUT)\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)} bound for join size estimation using generalized hypertree decompositions (GHDs, see section 2) and achieve a better bound than O~(INfhtw())\tilde{O}(\text{IN}^{fhtw(\mathcal{H})}), which is the bound of an exact aggregation algorithm AggroGHDJoin using GHDs (see section A.3). We leave uniform sampling using GHDs as a future work in section 8. Our new bound may be better than our previous bound O~(AGM()OUT)\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)}, especially when OUT is small (see Example 2). We first provide background about GHDs and related algorithms.

6.1. Aggregations over GHDs

Instead of performing joins, OUT can be more efficiently computed by solving an AJAR (Aggregations and Joins over Annotated Relations) query (Joglekar et al., 2016). AJAR queries assume that each relation RFR_{F} is annotated; RF={(r,λ(r))}R_{F}=\{(r,\lambda(r))\} where each tuple rr has an annotation λ(r)\lambda(r) (from some domain 𝕂\operatorname*{\mathbb{K}}). If two tuples r1r_{1} and r2r_{2} are joined, the joined tuple r1r2r_{1}\Join r_{2} has λ(r1)λ(r2)\lambda(r_{1})\otimes\lambda(r_{2}) as its annotation. Hence, joining annotated relations results in another annotated relation:

(4) FRF={(r,λ)|r is a join answer,λ=Fλ(πFr)}.\Join_{F\in\mathcal{E}}R_{F}=\{(r,\lambda)\,|\,r\text{ is a join answer},\lambda=\otimes_{F\in\mathcal{E}}\lambda(\pi_{F}r)\}.

Furthermore, we can define an aggregation over an annotated relation RR with attributes 𝒜\mathcal{A} by a pair of attribute AA and sum operator \oplus (Joglekar et al., 2016). Let G=𝒜{A}G=\mathcal{A}\setminus\{A\}. Then, the (A,A,\oplus)-aggregation of RR generates an annotated relation RGR_{G} of attributes GG, where GG is the grouping/output attributes, and λG\lambda_{G} is an aggregated result for rGr_{G}:

(5) A,R={(rG,λG):rGπGR and λG=(r,λ)R:πGr=rGλ}.\sum_{A,\oplus}R=\{(r_{G},\lambda_{G}):r_{G}\in\pi_{G}R\text{ and }\lambda_{G}=\bigoplus_{(r,\lambda)\in R:\pi_{G}r=r_{G}}\lambda\}.

By definition, no tuple rGr_{G} is duplicated in the aggregation result. If multiple aggregation attributes M={A1,A2,,An}M=\{A_{1},A_{2},...,A_{n}\} share the same \oplus operator, we simply write A1,A2,An,\sum_{A_{1},\oplus}\sum_{A_{2},\oplus}...\sum_{A_{n},\oplus} as M,\sum_{M,\oplus}. Here, attributes in MM are marginalized out and removed from the resulting relation. The output attributes GG becomes 𝒜M\mathcal{A}\setminus M. Altogether, if (𝕂,,\operatorname*{\mathbb{K}},\oplus,\otimes) forms a commutative semiring (Green et al., 2007), we can aggregate over joins (of annotated relations) (Joglekar et al., 2016) as M,FRF\sum_{M,\oplus}\Join_{F\in\mathcal{E}}R_{F}. If 𝕂=\operatorname*{\mathbb{K}}=\operatorname*{\mathbb{Z}}, λ(r)=1rRF\lambda(r)=1\,\forall r\in R_{F}, =+\oplus=+, =×\otimes=\times, and M=𝒱M=\mathcal{V}, the query 𝒱,+FRF\sum_{\mathcal{V},+}\Join_{F\in\mathcal{E}}R_{F} generates an annotated relation with a single tuple rr without any attribute. Here, λ(r)=OUT=|FRF|\lambda(r)=\text{OUT}=\absolutevalue{\Join_{F\in\mathcal{E}}R_{F}}. We hereafter omit \oplus whenever possible without ambiguity. In addition, AJAR queries are known to be more general than FAQ queries (Abo Khamis et al., 2016), for a query can have multiple aggregation operators (Joglekar et al., 2016).

Joglekar et al. (Joglekar et al., 2016) present AggroGHDJoin and AggroYannakakis to solve the AJAR queries in O~(INw+OUTagg)\tilde{O}(\text{IN}^{w^{*}}+\text{OUT}_{agg}) time, where ww^{*} is their width of algorithms, and OUTagg\text{OUT}_{agg} is the output size of the aggregation results which is typically smaller than OUT, e.g., 1 in our case. We prove in section A.3 that wfhtww^{*}\geq fhtw for general AJAR queries and w=fhtww^{*}=fhtw for computing OUT. Since we show in section 6.3 that our new bound using GHDs is smaller than O~(INfhtw)\tilde{O}(\text{IN}^{fhtw}), it is also smaller than O~(INw)\tilde{O}(\text{IN}^{w^{*}}).

6.2. Sampling over GHDs

From now, we use the same setting from section 6.1 to compute the AJAR query 𝒱FRF\sum_{\mathcal{V}}\Join_{F\in\mathcal{E}}R_{F}. For a set of grouping attributes GG, GroupByCardEst (Algorithm 2) returns an approximate answer to the AJAR query G𝒱G=𝒱GFRF\mathcal{H}^{G}\coloneqq\sum_{\mathcal{V}\setminus G}\mathcal{H}=\sum_{\mathcal{V}\setminus G}\Join_{F\in\mathcal{E}}R_{F}, i.e., an annotated relation {(g,Z[g])|(g,λ(g))G}\{(g,Z[g])|(g,\lambda(g))\in\mathcal{H}^{G}\} (Lines 2-2); λ(g)\lambda(g) and Z[g]Z[g] represent the exact and approximate value of OUT(g)\text{OUT}(\mathcal{H}_{g}), where g\mathcal{H}_{g} is the residual hypergraph of \mathcal{H} given gg (see section 5.1). Therefore, (g,λ(g))Gλ(g)=OUT()\sum_{(g,\lambda(g))\in\mathcal{H}^{G}}\lambda(g)=\text{OUT}(\mathcal{H}).

Input: Hypergraph \mathcal{H} and a set of attributes G𝒱G\subseteq\mathcal{V}
1
2RR\leftarrow\emptyset
/* compute G𝒱G\mathcal{H}^{G}\coloneqq\sum_{\mathcal{V}\setminus G}\mathcal{H}, skip any duplicated gg */
3 foreach (gGenericJoin(,G,)g\in\textsc{{GenericJoin}}(\mathcal{H},G,\emptyset)) do
       Z[g]GenericCardEst(,𝒱G,g)Z[g]\leftarrow\textsc{{GenericCardEst}}(\mathcal{H},\mathcal{V}\setminus G,g) /* approximation of the output size of the residual query given gg */
4      
5      RR{(g,Z[g])}R\leftarrow R\cup\{(g,Z[g])\}
return RR
Algorithm 2 GroupByCardEst ((𝒱,),G\mathcal{H}(\mathcal{V},\mathcal{E}),G)

GHDCardEst (Algorithm 3) is our entry function given a GHD (𝒯,χ\mathcal{T},\chi). For each node t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}}, we define t(χ(t),χ(t),π)\mathcal{H}_{t}\coloneqq(\chi(t),\mathcal{E}_{\chi(t),\pi}), i.e., sub-query of \mathcal{H} on tt (Line 3), and G(t)G(t) as the set of grouping attributes of tt, determined as the shared attributes across the nodes (Line 3). GroupByCardEst at Line 3 takes this G(t)G(t) as the grouping attributes GG and returns an annotated relation RtR_{t} {(gt,Z[gt])|(gt,λ(gt))tG(t)}\coloneqq\{(g_{t},Z[g_{t}])|(g_{t},\lambda(g_{t}))\in\mathcal{H}^{G(t)}_{t}\}, where tG(t)=χ(t)G(t)t\mathcal{H}^{G(t)}_{t}=\sum_{\chi(t)\setminus G(t)}\mathcal{H}_{t}. Therefore, RtR_{t} is an approximation of the aggregation of t\mathcal{H}_{t} with output attributes G(t)G(t). Finally, SimpleAggroYannakakis at Line 3 takes these RtR_{t}’s and runs a simple version of AggroYannakakis (see Algorithm 5 in section A.2) to compute G(𝒯)t𝒱𝒯Rt\sum_{G(\mathcal{T})}\Join_{t\in\mathcal{V}_{\mathcal{T}}}R_{t} for G(𝒯)t𝒱𝒯G(t)G(\mathcal{T})\coloneqq\cup_{t\in\mathcal{V}_{\mathcal{T}}}G(t), the union of grouping attributes of all GHD nodes. G(𝒯)G(\mathcal{T}) is used as join attributes between RtR_{t}’s in Line 3. Finally, Lemma 3 stats the unbiasedness of GHDCardEst.

Input: Query hypergraph \mathcal{H} and a GHD (𝒯\mathcal{T}, χ\chi) of \mathcal{H}
1
2S𝒯S_{\mathcal{T}}\leftarrow\emptyset
3foreach (t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}}) do
4       t(χ(t),χ(t),π)\mathcal{H}_{t}\leftarrow(\chi(t),\mathcal{E}_{\chi(t),\pi})
5      G(t){vχ(t)|t𝒱𝒯 s.t., tt,vχ(t)}G(t)\leftarrow\{v\in\chi(t)|\exists t^{\prime}\in\mathcal{V}_{\mathcal{T}}\text{ s.t., }t^{\prime}\neq t,v\in\chi(t^{\prime})\}
      RtGroupByCardEst(t,G(t))R_{t}\leftarrow\textsc{{GroupByCardEst}}(\mathcal{H}_{t},G(t)) /* approximate annotated relation of tG(t)χ(t)G(t)t\mathcal{H}^{G(t)}_{t}\coloneqq\sum_{\chi(t)\setminus G(t)}\mathcal{H}_{t} */
6      
7      S𝒯S𝒯RtS_{\mathcal{T}}\leftarrow S_{\mathcal{T}}\cup R_{t}
return SimpleAggroYannakakis((𝒯,χ),S𝒯)\textsc{{SimpleAggroYannakakis}}((\mathcal{T},\chi),S_{\mathcal{T}})
Algorithm 3 GHDCardEst((𝒱,)\mathcal{H}(\mathcal{V},\mathcal{E}), (𝒯(𝒱𝒯,𝒯)\mathcal{T}(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}}), χ\chi))
Lemma 3.

GHDCardEst is an unbiased estimator of |FRF|\absolutevalue{\Join_{F\in\mathcal{E}}R_{F}}.

Example 0.

We use \mathcal{H} and 𝒯2\mathcal{T}_{2} in Figure 1 as an example. Here, G(t0)={X,X},G(t1)={X}G(t_{0})=\{X,X^{\prime}\},G(t_{1})=\{X\}, and G(t2)={X}G(t_{2})=\{X^{\prime}\} are grouping attributes. (d) shows examples of annotated relations RtR_{t}’s, which are the outputs of GroupByCardEst on the three nodes. Since we allow duplicate tuples in any base relation RF:FR_{F}:F\in\mathcal{E}, annotations for Rt0R_{t_{0}} can be larger than one, even if χ(t0)\chi(t_{0}) is covered by a single hyperedge and χ(t0)=G(t0)\chi(t_{0})=G(t_{0}). (e) and (f) show the join and aggregation on RtR_{t}’s performed in SimpleAggroYannakakis. Note that the annotations are multiplied in joins, and added in aggregations. The final result, 8450, is an approximation of OUT.

Refer to caption
Figure 1. Example of GHDCardEst using GHD 𝒯2\mathcal{T}_{2} in (c). Attributes in G(t)G(t) are underlined in (c).

The key idea of GHDCardEst is to pushdown the partial sum χ(t)G(t)\sum_{\chi(t)\setminus G(t)} from 𝒱\sum_{\mathcal{V}} to each node tt in computing RtR_{t}, which is similar to the key idea of AggroGHDJoin that pushes down the aggregation attributes in GHD as deeply as possible. However, it is slightly different since we shrink each relation obtained from a node before the join phase in AggroYannakakis; AggroYannakakis aggregates after joining each pair of parent-child relations in GHD.

We also argue that our definition of G(t)G(t) is minimal; if an attribute in G(t)G(t) is excluded from RtR_{t}, the node tt loses a join relationship with another node. Our G(t)G(t) also minimizes the runtime of a non-sampling procedure, GenericJoin at Line 2 of Algorithm 2. Its runtime O~(AGM(tG(t)))\tilde{O}(AGM(\mathcal{H}^{G(t)}_{t})) increases with the output attributes G(t)G(t) due to the node-monotone property of the AGM bound (Joglekar et al., 2016).

6.3. Analysis of GHDCardEst

Using the analyses in section 5.5 as building blocks, we analyze the variance and runtime of GHDCardEst. For each (gt,Z[gt])Rt(g_{t},Z[g_{t}])\in R_{t} for a GHD node tt, we attach two additional annotations 𝕍ar[Z[gt]]\operatorname*{\mathbb{V}ar}[Z[g_{t}]] and T[gt]T[g_{t}] (the runtime of obtaining Z[gt]Z[g_{t}]), and let t,gt\mathcal{H}_{t,g_{t}} (or simply gt\mathcal{H}_{g_{t}}) denote the residual hypergraph of t\mathcal{H}_{t} given gtg_{t} in Line 2 of Algorithm 2. Then, Z[gt]Z[g_{t}] is an unbiased estimate of λ(gt)=OUT(gt)\lambda(g_{t})=\text{OUT}(\mathcal{H}_{g_{t}}) from Lemma 1. We also define (f,Z[f])(f,Z[f]) as an annotated tuple in t𝒱𝒯Rt\Join_{t\in\mathcal{V}_{\mathcal{T}}}R_{t} so Z[f]=t𝒱𝒯Z[πG(t)f]Z[f]=\prod_{t\in\mathcal{V}_{\mathcal{T}}}Z[\pi_{G(t)}f].

Recall that our primary goal for join size estimation is to perform (1±ϵ1\pm\epsilon)-approximation of OUT. To use Chebyshev’s inequality, we should bound 𝕍ar[Z]\operatorname*{\mathbb{V}ar}[Z] by ϵ2δOUT2\epsilon^{2}\delta\text{OUT}^{2} as in section 5.5. Then the following questions arise: 1) Can we make this inequality hold for any ϵ\epsilon and δ\delta? 2) How would the runtime TT be expressed? To answer these questions, we first express ZZ using Z[f]Z[f] values.

(6) Z=(f,Z[f])t𝒱𝒯RtZ[f].Z=\sum_{(f,Z[f])\in\Join_{t\in\mathcal{V}_{\mathcal{T}}}R_{t}}Z[f].

We arbitrarily order 𝒱𝒯\mathcal{V}_{\mathcal{T}} and regard each ff as an ordered set {πG(t)f|t𝒱𝒯}\{\pi_{G(t)}f|t\in\mathcal{V}_{\mathcal{T}}\}. Since Z[gt]Z[g_{t}] values for gtfg_{t}\in f are mutually independent, we have

(7) 𝔼[Z[f]]=𝔼[gtfZ[gt]]=gtf𝔼[Z[gt]],\begin{split}\operatorname*{\mathbb{E}}[Z[f]]&=\operatorname*{\mathbb{E}}\Big{[}\prod_{g_{t}\in f}Z[g_{t}]\Big{]}=\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]],\end{split}
(8) 𝕍ar[Z[f]]=𝕍ar[gtfZ[gt]]=gtf(𝕍ar[Z[gt]]+𝔼[Z[gt]]2)gtf𝔼[Z[gt]]2.\begin{split}\operatorname*{\mathbb{V}ar}&[Z[f]]=\operatorname*{\mathbb{V}ar}\Big{[}\prod_{g_{t}\in f}Z[g_{t}]\Big{]}\\ &=\prod_{g_{t}\in f}(\operatorname*{\mathbb{V}ar}[Z[g_{t}]]+\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2})-\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}.\end{split}

While 𝔼[Z[f]]\operatorname*{\mathbb{E}}[Z[f]] can be simply decomposed into gtf𝔼[Z[gt]]\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]], 𝕍ar[Z[f]]\operatorname*{\mathbb{V}ar}[Z[f]] has the internal term 𝔼[Z[gt]]2\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2} which prevents a simple decomposition. For any two different tuples f1,f2tRtf_{1},f_{2}\in\,\Join_{t}R_{t}, Z[f1]Z[f_{1}] and Z[f2]Z[f_{2}] are not independent, since they can have the same sub-tuple, i.e., gtf1g_{t}\in f_{1} and gtf2g_{t}\in f_{2} for some t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}}. Therefore, we have to consider the covariance ov(Z[f1],Z[f2])\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}]) when expanding 𝕍ar[Z]\operatorname*{\mathbb{V}ar}[Z] below.

(9) 𝕍ar[Z]=f𝕍ar[Z[f]]+f1f2ov(Z[f1],Z[f2])\begin{split}\operatorname*{\mathbb{V}ar}[Z]=&\sum_{f}\operatorname*{\mathbb{V}ar}[Z[f]]+\sum_{f_{1}\neq f_{2}}\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}])\end{split}

From the analysis in section 5.5, we can arbitrarily control 𝕍ar[Z[gt]]\operatorname*{\mathbb{V}ar}[Z[g_{t}]] and T[gt]T[g_{t}] with the sample size, under a condition that 𝕍ar[Z[gt]]T[gt]=O~(AGM(gt)OUT(gt))\operatorname*{\mathbb{V}ar}[Z[g_{t}]]\cdot T[g_{t}]=\tilde{O}(AGM(\mathcal{H}_{g_{t}})\cdot\text{OUT}(\mathcal{H}_{g_{t}})). In particular, we set 𝕍ar[Z[gt]]\operatorname*{\mathbb{V}ar}[Z[g_{t}]] = O~(OUT(gt)2)\tilde{O}(\text{OUT}(\mathcal{H}_{g_{t}})^{2}) and T[gt]T[g_{t}] = O~(AGM(gt)OUT(gt))\tilde{O}\big{(}\frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}\big{)}.

Lemma 4.

𝕍ar[Z[f]]\operatorname*{\mathbb{V}ar}[Z[f]] = O~(𝔼[Z[f]]2)\tilde{O}(\operatorname*{\mathbb{E}}[Z[f]]^{2}) if 𝕍ar[Z[gt]]\operatorname*{\mathbb{V}ar}[Z[g_{t}]] = O~(𝔼[Z[gt]]2)\tilde{O}(\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}) = O~(OUT(gt)2)\tilde{O}(\text{OUT}(\mathcal{H}_{g_{t}})^{2}) for every gtfg_{t}\in f.

Lemma 5.

ov(Z[f1],Z[f2])=O~(gtf1𝔼[Z[gt]]gtf2𝔼[Z[gt]])\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}])=\tilde{O}\big{(}\prod_{g_{t}\in f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\cdot\prod_{g_{t}\in f_{2}}\\ \operatorname*{\mathbb{E}}[Z[g_{t}]]\big{)} if the same condition of Lemma 4 holds for f1f_{1} and f2f_{2}.

Now we are ready to answer our first question. By applying Lemmas 4 and 5 to (9), we have

(10) 𝕍ar[Z]=fO~(gtf𝔼[Z[gt]]2)+f1f2O~(gtf1𝔼[Z[gt]]gtf2𝔼[Z[gt]])=O~(fgtf𝔼[Z[gt]]2+f1f2gtf1𝔼[Z[gt]]gtf2𝔼[Z[gt]])=O~((fgtf𝔼[Z[gt]])2)=O~(OUT2).\begin{split}&\operatorname*{\mathbb{V}ar}[Z]=\sum_{f}\tilde{O}\Big{(}\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\Big{)}\\ &\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;+\sum_{f_{1}\neq f_{2}}\tilde{O}\Big{(}\prod_{g_{t}\in f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\prod_{g_{t}\in f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}\\ &=\tilde{O}\Big{(}\sum_{f}\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}+\sum_{f_{1}\neq f_{2}}\prod_{g_{t}\in f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\prod_{g_{t}\in f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}\\ &=\tilde{O}\Big{(}\Big{(}\sum_{f}\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}^{2}\Big{)}=\tilde{O}(\text{OUT}^{2}).\end{split}

The last equality holds from fgtf𝔼[Z[gt]]=f𝔼[Z[f]]=𝔼[Z]=OUT\sum_{f}\prod_{g_{t}\in f}\operatorname*{\mathbb{E}}[Z[g_{t}]]=\sum_{f}\operatorname*{\mathbb{E}}[Z[f]]=\operatorname*{\mathbb{E}}[Z]=\text{OUT}. As a result, 𝕍ar[Z]=O~(OUT2)\operatorname*{\mathbb{V}ar}[Z]=\tilde{O}(\text{OUT}^{2}).

We can make 𝕍ar[Z]\operatorname*{\mathbb{V}ar}[Z] arbitrarily small, even less than the ϵ2δOUT2\epsilon^{2}\delta\text{OUT}^{2} we desire, by setting the constant factor of 𝕍ar[Z[gt]]=O~(𝔼[Z[gt]]2)\operatorname*{\mathbb{V}ar}[Z[g_{t}]]=\tilde{O}(\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}) arbitrary small for every gtg_{t}. We omit setting the constants which is trivial.

Proposition 9.

If   𝕍ar[Z[gt]]\operatorname*{\mathbb{V}ar}[Z[g_{t}]] approaches to 0 for every gtg_{t}, 𝕍ar[Z]\operatorname*{\mathbb{V}ar}[Z] approaches to 0.

We next answer our second question. GroupByCardEst for each node tt takes O~(AGM(tG(t)))\tilde{O}(AGM(\mathcal{H}^{G(t)}_{t})) time at Line 2 and gtRtT[gt]\sum_{g_{t}\in R_{t}}T[g_{t}] at Line 2 of Algorithm 2. SimpleAggroYannakakis at Line 3 of Algorithm 3 takes O~(maxt𝒱𝒯AGM(tG(t)))\tilde{O}(\max_{t\in\mathcal{V}_{\mathcal{T}}}AGM(\mathcal{H}^{G(t)}_{t})) time. Therefore, TT is O~(maxt𝒱𝒯AGM(tG(t))\tilde{O}\Big{(}\max_{t\in\mathcal{V}_{\mathcal{T}}}AGM(\mathcal{H}^{G(t)}_{t}) + t𝒱𝒯(gt,Z[gt])RtT[gt])\sum_{t\in\mathcal{V}_{\mathcal{T}}}\sum_{(g_{t},Z[g_{t}])\in R_{t}}T[g_{t}]\Big{)}. By setting T[gt]=O~(AGM(gt)OUT(gt))T[g_{t}]=\tilde{O}\big{(}\frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}\big{)}, TT becomes O~(maxtAGM(tG(t))\tilde{O}\Big{(}\max_{t}AGM(\mathcal{H}^{G(t)}_{t}) + t(gt,Z[gt])RtAGM(gt)OUT(gt))\sum_{t}\sum_{(g_{t},Z[g_{t}])\in R_{t}}\frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}\Big{)}.

We now prove that our new bound is smaller than O~(INfhtw())\tilde{O}(\text{IN}^{fhtw(\mathcal{H})}) = O~(maxtAGM(t))\tilde{O}(\max_{t}AGM(\mathcal{H}_{t})); AGM(tG(t))AGM(t)AGM(\mathcal{H}^{G(t)}_{t})\leq AGM(\mathcal{H}_{t}) since AGMAGM is node-monotone, and tgtAGM(gt)OUT(gt)\sum_{t}\sum_{g_{t}}\frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})} \leq tgtAGM(gt)\sum_{t}\sum_{g_{t}}AGM(\mathcal{H}_{g_{t}}) \leq AGM(t)AGM(\mathcal{H}_{t}) from Lemma 6 in appendix B. Therefore, if OUT is O~(INρfhtw)\tilde{O}(\text{IN}^{\rho-fhtw}), INfhtw\text{IN}^{fhtw} is asymptotically smaller than AGMOUT\frac{AGM}{\text{OUT}} (AGM=INρAGM=\text{IN}^{\rho}), proving Theorem 2. We recommend using GenericCardEst if OUT is expected to be large, and using GHDCardEst if OUT is expected to be small.

Example 0.

We use the \mathcal{H} in Figure 1 without an edge {Z,W}\{Z^{\prime},W^{\prime}\} and assume that every base relation has size NN. Then, our previous bound O~(AGM()OUT)\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)} becomes O~(N3OUT)\tilde{O}\big{(}\frac{N^{3}}{\text{OUT}}\big{)}, from an optimal fractional edge cover xx: xF=0x_{F}=0 for F={X,X}F=\{X,X^{\prime}\} and 0.50.5 otherwise. Since our new bound is smaller than O~(INfhtw())=N1.5\tilde{O}(\text{IN}^{fhtw(\mathcal{H})})=N^{1.5}, it is also smaller than O~(N3OUT)\tilde{O}\big{(}\frac{N^{3}}{\text{OUT}}\big{)} if OUT is asymptotically smaller than N1.5N^{1.5}.

7. Optimizations

This section explains two optimization techniques to enhance estimation efficiency or accuracy.

7.1. Increasing Sampling Probabilities

If we focus on the join size estimation without persisting uniform sampling, we can further reduce the variance. Since 𝕍ar[Z]=tQ1P(t)OUT2\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]=\sum_{t\in Q}\frac{1}{P(t)}-\text{OUT}^{2} (Kim et al., 2021; Chen and Yi, 2020) for our method in section 5.4, increasing P(t):tQP(t):t\in Q reduces the variance. First, when mm edges tie and have the same maximum rdegrdeg in section 5.4, we do not decrease the keeping probability pp by mm times. This increases the final P(t)P(t) by mm times. In fact, we can use any sampled FF^{*} from I\mathcal{E}_{I}; P(sI)P(s_{I}) increases from 1|I|maxFIrdegF,s(sI)AGM(ssI)maxFIrdegF,s(sI)AGM(s)\frac{1}{|\mathcal{E}_{I}|}\cdot\max_{F^{\prime}\in\mathcal{E}_{I}}rdeg_{F^{\prime},s}(s_{I})\cdot\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{\max_{F^{\prime}\in\mathcal{E}_{I}}rdeg_{F^{\prime},s}(s_{I})AGM(\mathcal{H}_{s})} to FI1|I|rdegF,s(sI)AGM(ssI)maxFIrdegF,s(sI)AGM(s)\sum_{F\in\mathcal{E}_{I}}\frac{1}{|\mathcal{E}_{I}|}\cdot rdeg_{F,s}(s_{I})\cdot\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{\max_{F^{\prime}\in\mathcal{E}_{I}}rdeg_{F^{\prime},s}(s_{I})AGM(\mathcal{H}_{s})}. Second, we can interleave GJ-Sample to remove the 1|I|\frac{1}{|\mathcal{E}_{I}|} factor in P(sI)P(s_{I}). If the |ΩI||\Omega_{I}| of GJ-Sample is small enough as a constant, e.g., I||\prod_{I}|\mathcal{E}|, we can compute P(sI)=AGM(ssI)AGM(s)P(s_{I})=\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})} for every sIΩIs_{I}\in\Omega_{I} as GJ-Sample.

7.2. Skipping Attributes and GHD Nodes

In GenericCardEst, sampling sIs_{I} for any non-join attribute is unnecessary and thus, can be safely skipped. For our example query in Figure 1, WW^{\prime} is the only non-join attribute. Let the current sample s={x,y,z}s=\{x^{\prime},y^{\prime},z^{\prime}\} (attributes are X,Y,ZX^{\prime},Y^{\prime},Z^{\prime}) and I={W}I=\{W^{\prime}\}. Then, sampling any sIΩI=πW(R{Z,W}z)s_{I}\in\Omega_{I}=\pi_{W^{\prime}}(R_{\{Z^{\prime},W^{\prime}\}}\ltimes z^{\prime}) does not affect the sample space and sample size for the remaining vertices, i.e., XX, YY, and ZZ. Hence, we skip WW^{\prime} and just return 1P(s{W})=|Ω{W}|=|πW(R{Z,W}z)|\frac{1}{P(s_{\{W^{\prime}\}})}=|\Omega_{\{W^{\prime}\}}|=|\pi_{W^{\prime}}(R_{\{Z^{\prime},W^{\prime}\}}\ltimes z^{\prime})| at Line 1 instead of 1. If there are multiple non-join attributes X1,X2,,XnX_{1},X_{2},...,X_{n}, we skip all of these, sample for all join attributes only, then return 1in|Ω{Xi}|\prod_{1\leq i\leq n}|\Omega_{\{X_{i}\}}| at Line 1.

In GHDCardEst, calling GroupByCardEst for single-edge GHD nodes can be safely skipped. If a GHD node tt is covered by a single edge, i.e., F:χ(t)F\exists F\in\mathcal{E}:\chi(t)\subseteq F, we can directly obtain 𝔼[Z[gt]]=|RFgt|\operatorname*{\mathbb{E}}[Z[g_{t}]]=|R_{F}\ltimes g_{t}| for every gtg_{t} in RtR_{t} without sampling in GroupByCardEst. To be consistent, we regard RtR_{t} as πχ(t)RF\pi_{\chi(t)}R_{F} (with annotations) in order to remove any duplicates in RtR_{t} as mentioned in section 6.1. t0t_{0} of 𝒯2\mathcal{T}_{2} in Figure 1 is an example of a single-edge node, where Z[gt0]=|R{X,X}gt0|Z[g_{t_{0}}]=|R_{\{X,X^{\prime}\}}\ltimes g_{t_{0}}| for gt0Rt0g_{t_{0}}\in R_{t_{0}}. Therefore, 𝕍ar[Z[gt0]]=0\operatorname*{\mathbb{V}ar}[Z[g_{t_{0}}]]=0 and T[gt0]=O~(1)T[g_{t_{0}}]=\tilde{O}(1), reducing both 𝕍ar[Z]\operatorname*{\mathbb{V}ar}[Z] and TT.

8. Research Opportunities

We present six research opportunities based on our study. First, we could express our bound O~(maxtAGM(tG(t))+t(gt,Z[gt])RtAGM(gt)OUT(gt))\tilde{O}\Big{(}\max_{t}AGM(\mathcal{H}^{G(t)}_{t})+\sum_{t}\sum_{(g_{t},Z[g_{t}])\in R_{t}}\\ \frac{AGM(\mathcal{H}_{g_{t}})}{\text{OUT}(\mathcal{H}_{g_{t}})}\Big{)} in section 6 in a more succinct form by removing internal gtg_{t} terms, e.g., to O~(INfhtw()OUT)\tilde{O}\big{(}\frac{\text{IN}^{fhtw(\mathcal{H})}}{\text{OUT}}\big{)}. Then, we could more clearly characterize the gap between this bound and O~(AGM()OUT)\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)} or O~(INfhtw())\tilde{O}(\text{IN}^{fhtw(\mathcal{H})}).

Second, we could dynamically change the sampling approach based on the guess of OUT. Our bound O~(AGM()OUT)\tilde{O}\big{(}\frac{AGM(\mathcal{H})}{\text{OUT}}\big{)} in section 5 can be small as a constant if OUT is large as AGM()AGM(\mathcal{H}), while the bound in section 6 has advantages over small OUT (e.g., Example 2). As stated in section 5.5, we could start with GenericCardEst assuming a large OUT, increase the sample size, adjust the assumption, and change to GHDCardEst if the estimated OUT becomes small.

Third, we could even change between sampling and join, then apply the same analysis. While the join algorithms have zero error but +OUT term in their runtime, sampling algorithms have non-zero errors but 1OUT\frac{1}{\text{OUT}} factor in their runtime. Therefore, it might be possible to balance between sampling and join to reduce the runtime and error for both large- and small-OUT cases, i.e., select a sampling-like algorithm for large OUT and a join-like algorithm for small OUT.

Fourth, we could develop join size estimators with lower complexities using more information about relations, e.g., degree constraints or functional dependencies (Abo Khamis et al., 2017) other than the cardinality constrains in Definition 1.

Fifth, we could extend GHDCardEst to perform uniform sampling over arbitrary GHDs from the observations in section A.2 that 1) ExactWeight (Zhao et al., 2018) performs uniform sampling over certain types of GHDs for acyclic joins, and 2) GHDCardEst can be easily modified for uniform sampling over the same GHDs. In Algorithm 6, the initial value of W(gt)W(g_{t}) (sampling weight for a tuple gtg_{t} in RtR_{t}) is set to 𝔼[Z[gt]]=|RFtgt|\operatorname*{\mathbb{E}}[Z[g_{t}]]=|R_{F_{t}}\ltimes g_{t}|. Therefore, we could extend ExactWeight and GHDCardEst for arbitrary GHDs for even cyclic joins, starting from initializing W(gt)W(g_{t}) with an estimate Z[gt]Z[g_{t}] instead of 𝔼[Z[gt]]\operatorname*{\mathbb{E}}[Z[g_{t}]].

Sixth, we could extend our algorithms to more general problems, approximate counting and uniform sampling for conjunctive queries (Arenas et al., 2021; Focke et al., 2022) or join-project queries πO(Q)\pi_{O}(Q), where OO is the set of output attributes (Chen and Yi, 2020). We explain in appendix C that our degree-based rejection sampling can be easily extended for join-project queries, following the same analysis in Section 3 of (Chen and Yi, 2020). We achieve O~(AGM(Q)|πO(Q)|)\tilde{O}\big{(}\frac{AGM(Q)}{|\pi_{O}(Q)|}\big{)} runtime which is again smaller than that of GJ-Sample.

9. Conclusion

In this paper, we have presented a new sampling-based method for join size estimation and uniform sampling. Our method solves an open problem in the literature (Chen and Yi, 2020), achieving (1±ϵ\pm\epsilon)-approximation of OUT in O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} time for arbitrary join queries. We presented a unified approach that explains and analyzes four known methods: SSTE, SUST, Alley, and GJ-Sample. We then extended our method using GHDs to achieve a better bound than O~(AGMOUT)\tilde{O}\big{(}\frac{AGM}{\text{OUT}}\big{)} for small OUT and optimized the efficiency and accuracy using two approaches. Finally, we have highlighted several interesting research opportunities building on our results. We believe that our work may facilitate studies on achieving lower bounds on the uniform sampling and size estimation over joins.

10. Acknowledgement

We thank Yufei Tao for pointing out the condition of Lemma 2.

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2021-0-00859, Development of a distributed graph DBMS for intelligent processing of big graphs, 34%), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2018-0-01398, Development of a Conversational, Self-tuning DBMS, 33%), and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.NRF-2021R1A2B5B03001551, 33%).

References

  • (1)
  • Aberger et al. (2017) Christopher R Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. 2017. Emptyheaded: A relational engine for graph processing. ACM Transactions on Database Systems (TODS) 42, 4 (2017), 1–44.
  • Abo Khamis et al. (2016) Mahmoud Abo Khamis, Hung Q. Ngo, and Atri Rudra. 2016. FAQ: Questions Asked Frequently. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (San Francisco, California, USA) (PODS ’16). Association for Computing Machinery, New York, NY, USA, 13–28. https://doi.org/10.1145/2902251.2902280
  • Abo Khamis et al. (2017) Mahmoud Abo Khamis, Hung Q Ngo, and Dan Suciu. 2017. What do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog have to do with one another?. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 429–444.
  • Aliakbarpour et al. (2018) Maryam Aliakbarpour, Amartya Shankha Biswas, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, and Anak Yodpinyanee. 2018. Sublinear-time algorithms for counting star subgraphs via edge sampling. Algorithmica 80, 2 (2018), 668–697.
  • Arenas et al. (2021) Marcelo Arenas, Luis Alberto Croquevielle, Rajesh Jayaram, and Cristian Riveros. 2021. When is approximate counting for conjunctive queries tractable?. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing. 1015–1027.
  • Assadi et al. (2018) Sepehr Assadi, Michael Kapralov, and Sanjeev Khanna. 2018. A simple sublinear-time algorithm for counting arbitrary subgraphs via edge sampling. arXiv preprint arXiv:1811.07780 (2018).
  • Assadi et al. (2019) Sepehr Assadi, Michael Kapralov, and Sanjeev Khanna. 2019. A simple sublinear-time algorithm for counting arbitrary subgraphs via edge sampling. In 10th Innovations in Theoretical Computer Science, ITCS 2019. Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 6.
  • Atserias et al. (2013) Albert Atserias, Martin Grohe, and Dániel Marx. 2013. Size bounds and query plans for relational joins. SIAM J. Comput. 42, 4 (2013), 1737–1767.
  • Banerjee (2012) Moulinath Banerjee. 2012. Simple Random Sampling. Unpublished Manuscript, University of Michigan, Michigan (2012).
  • Bera and Chakrabarti (2017) Suman K Bera and Amit Chakrabarti. 2017. Towards tighter space bounds for counting triangles and other substructures in graph streams. In 34th Symposium on Theoretical Aspects of Computer Science.
  • Brault-Baron (2016) Johann Brault-Baron. 2016. Hypergraph acyclicity revisited. ACM Computing Surveys (CSUR) 49, 3 (2016), 1–26.
  • Carmeli et al. (2020) Nofar Carmeli, Shai Zeevi, Christoph Berkholz, Benny Kimelfeld, and Nicole Schweikardt. 2020. Answering (unions of) conjunctive queries using random access and random-order enumeration. In Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 393–409.
  • Chen and Yi (2020) Yu Chen and Ke Yi. 2020. Random Sampling and Size Estimation Over Cyclic Joins. In 23rd International Conference on Database Theory (ICDT 2020). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
  • Deng et al. (2023) Shiyuan Deng, Shangqi Lu, and Yufei Tao. 2023. On Join Sampling and Hardness of Combinatorial Output-Sensitive Join Algorithms. In Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS 2023).
  • Eden et al. (2017) Talya Eden, Amit Levi, Dana Ron, and C Seshadhri. 2017. Approximately counting triangles in sublinear time. SIAM J. Comput. 46, 5 (2017), 1603–1646.
  • Eden et al. (2020) Talya Eden, Dana Ron, and C Seshadhri. 2020. On approximating the number of k-cliques in sublinear time. SIAM J. Comput. 49, 4 (2020), 747–771.
  • Fichtenberger et al. (2020) Hendrik Fichtenberger, Mingze Gao, and Pan Peng. 2020. Sampling Arbitrary Subgraphs Exactly Uniformly in Sublinear Time. In ICALP.
  • Focke et al. (2022) Jacob Focke, Leslie Ann Goldberg, Marc Roth, and Stanislav Zivnỳ. 2022. Approximately counting answers to conjunctive queries with disequalities and negations. In Proceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 315–324.
  • Friedgut and Kahn (1998) Ehud Friedgut and Jeff Kahn. 1998. On the number of copies of one hypergraph in another. Israel Journal of Mathematics 105, 1 (1998), 251–256.
  • Goldreich (2017) Oded Goldreich. 2017. Introduction to property testing. Cambridge University Press.
  • Gottlob et al. (2016) Georg Gottlob, Gianluigi Greco, Nicola Leone, and Francesco Scarcello. 2016. Hypertree decompositions: Questions and answers. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 57–74.
  • Gottlob et al. (2002) Georg Gottlob, Nicola Leone, and Francesco Scarcello. 2002. Hypertree decompositions and tractable queries. J. Comput. System Sci. 64, 3 (2002), 579–627.
  • Green et al. (2007) Todd J Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 31–40.
  • Grohe and Marx (2014) Martin Grohe and Dániel Marx. 2014. Constraint solving via fractional edge covers. ACM Transactions on Algorithms (TALG) 11, 1 (2014), 1–20.
  • Horvitz and Thompson (1952) Daniel G Horvitz and Donovan J Thompson. 1952. A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association 47, 260 (1952), 663–685.
  • Jerrum et al. (1986) Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. 1986. Random generation of combinatorial structures from a uniform distribution. Theoretical computer science 43 (1986), 169–188.
  • Joglekar et al. (2016) Manas R. Joglekar, Rohan Puttagunta, and Christopher Ré. 2016. AJAR: Aggregations and Joins over Annotated Relations. In Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (San Francisco, California, USA) (PODS ’16). Association for Computing Machinery, New York, NY, USA, 91–106. https://doi.org/10.1145/2902251.2902293
  • Kim et al. (2021) Kyoungmin Kim, Hyeonji Kim, George Fletcher, and Wook-Shin Han. 2021. Combining Sampling and Synopses with Worst-Case Optimal Runtime and Quality Guarantees for Graph Pattern Cardinality Estimation. In Proceedings of the 2021 International Conference on Management of Data. 964–976.
  • Leis et al. (2015) Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2015. How good are query optimizers, really? Proceedings of the VLDB Endowment 9, 3 (2015), 204–215.
  • Li et al. (2016) Feifei Li, Bin Wu, Ke Yi, and Zhuoyue Zhao. 2016. Wander join: Online aggregation via random walks. In Proceedings of the 2016 International Conference on Management of Data. 615–629.
  • Ngo et al. (2012) Hung Q Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2012. Worst-case Optimal Join Algorithms. (2012).
  • Ngo et al. (2014) Hung Q Ngo, Christopher Ré, and Atri Rudra. 2014. Skew strikes back: New developments in the theory of join algorithms. ACM SIGMOD Record 42, 4 (2014), 5–16.
  • Veldhuizen (2012) Todd L Veldhuizen. 2012. Leapfrog triejoin: a worst-case optimal join algorithm. arXiv preprint arXiv:1210.0481 (2012).
  • Yannakakis (1981) Mihalis Yannakakis. 1981. Algorithms for acyclic database schemes. In VLDB, Vol. 81. 82–94.
  • Zhao et al. (2018) Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In Proceedings of the 2018 International Conference on Management of Data. 1525–1539.

Appendix

Appendix A Algorithms

A.1. GenericJoin

GenericJoin (Algorithm 4) (Ngo et al., 2014) is a representative worst-case optimal join algorithm. From the set of output attributes 𝒪\mathcal{O} (initially 𝒱\mathcal{V}), it selects a subset II where 1|I|<|𝒪|1\leq|I|<|\mathcal{O}| (Line 4). For each join answer sIs_{I} of an induced hypergraph of \mathcal{H} projected onto II (Line 4), it bounds attributes II to sIs_{I} and proceeds to match the residual hypergraph (Line 4). If |𝒪|=1|\mathcal{O}|=1, the results are obtained from the intersection (\cap) operation (Lines 4-4). The runtime of GenericJoin is O~(AGM())\tilde{O}(AGM(\mathcal{H})), which can be proven from Lemma 6 (Ngo et al., 2014).

Input: Query hypergraph \mathcal{H}, output attributes 𝒪𝒱\mathcal{O}\subseteq\mathcal{V}, and current tuple ss
1
2if (|𝒪|=1|\mathcal{O}|=1) then
3       return F𝒪π𝒪(RFs)\cap_{F\in\mathcal{E}_{\mathcal{O}}}\pi_{\mathcal{O}}(R_{F}\ltimes s)
4      
5RR\leftarrow\emptyset
6 II\leftarrow a non-empty proper subset of 𝒪\mathcal{O}
7
8foreach (sIGenericJoin(,I,s)s_{I}\in\textsc{{GenericJoin}}(\mathcal{H},I,s)) do
9       R[sI]GenericJoin(,𝒪I,ssI)R[s_{I}]\leftarrow\textsc{{GenericJoin}}(\mathcal{H},\mathcal{O}\setminus I,s\uplus s_{I})
10      
11      RR{sI}×R[sI]R\leftarrow R\cup\{s_{I}\}\times R[s_{I}]
12      
13return RR
Algorithm 4 GenericJoin((𝒱,)\mathcal{H}(\mathcal{V},\mathcal{E}), 𝒪\mathcal{O}, ss)

A.2. Yannakakis, SimpleAggroYannakakis, and ExactWeight

In this section and in Algorithms 5-6, we assume that the root node rr of any GHD 𝒯\mathcal{T} has a virtual parent node prpr where 1) pr𝒱𝒯pr\not\in\mathcal{V}_{\mathcal{T}}, 2) χ(pr)=\chi(pr)=\emptyset, and 3) RprR_{pr} (input relation for prpr) contains a single tuple gprg_{pr} that joins with all tuples in RrR_{r}, for ease of explanation. As mentioned in section 6.1, we assume no duplicate tuples in each RtR_{t}.

Given a join tree 𝒯\mathcal{T} of a α\alpha-acyclic query QQ (Brault-Baron, 2016), Yannakakis (Yannakakis, 1981) performs the join in O~(IN+OUT)\tilde{O}(\text{IN}+\text{OUT}) time using dynamic programming. Algorithm 5 without Lines 5-5 and setting R=RtR^{\prime}=R_{t} is Yannakakis, where the bottom-up and top-down semi-join reductions (Lines 5-5) are followed by the bottom-up join (Lines 5-5). Since the semi-join reductions remove dangling tuples (that do not participate in the final join answers), the intermediate join size monotonically increases up to OUT in the bottom-up join. Therefore, the total runtime is O~(IN+OUT)\tilde{O}(\text{IN}+\text{OUT}).

Input: GHD (𝒯\mathcal{T}, χ\chi) and relations RtR_{t} for each t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}}
1
2foreach (t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} in some bottom-up order) do
3       pp\leftarrow parent of tt
4       RpRpRtR_{p}\leftarrow R_{p}\ltimes R_{t}
5      
6foreach (t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} in some top-down order) do
7       pp\leftarrow parent of tt
8       RtRtRpR_{t}\leftarrow R_{t}\ltimes R_{p}
9      
10foreach (t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} in some bottom-up order) do
11       β{AG(t)|TOP𝒯(A)=t}\beta\leftarrow\{A\in G(t)|TOP_{\mathcal{T}}(A)=t\}
12       RβRtR^{\prime}\leftarrow\sum_{\beta}{R_{t}}
13      
14       pp\leftarrow parent of tt
15       RpRpRR_{p}\leftarrow R_{p}\Join R^{\prime}
16      
17return RrR_{r} for the root rr
Algorithm 5 SimpleAggroYannakakis((𝒯(𝒱𝒯,𝒯),χ)(\mathcal{T}(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}}),\chi), {Rt|t𝒱𝒯}\{R_{t}|t\in\mathcal{V}_{\mathcal{T}}\})

SimpleAggroYannakakis (Algorithm 5) is a simplified version of AggroYannakakis (Joglekar et al., 2016), where 1) all attributes are used for aggregation and 2) sum is the only aggregation operator. AggroYannakakis can handle a more general class of aggregation queries, e.g., sum and max operators are used for different attributes. In Line 5, TOP𝒯(A)TOP_{\mathcal{T}}(A) denotes the top node of an attribute AA, which is the closest node to the root among {t𝒱𝒯|χ(t)A}\{t\in\mathcal{V}_{\mathcal{T}}|\chi(t)\ni A\}. G(t)G(t) is the set of output attributes of RtR_{t}. Line 5 aggregates on the attributes in G(t)G(t) having tt as their top node, since these attributes do not appear in ancestors and are marginalized out. This early marginalization is the key idea of maintaining the runtime of SimpleAggroYannakakis to be O~(IN+OUTagg)\tilde{O}(\text{IN}+\text{OUT}_{agg}) (Joglekar et al., 2016), where OUTagg\text{OUT}_{agg} is the output size of aggregation. For example, OUTagg=1\text{OUT}_{agg}=1 if all attributes are marginalized out in computing the join size.

Input: GHD (𝒯\mathcal{T}, χ\chi) where Ft:χ(t)Ft\exists F_{t}\in\mathcal{E}:\chi(t)\subseteq F_{t} for every t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} and relations RtR_{t} for each t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}}
1
2W(gt)|RFtgt|W(g_{t})\leftarrow|R_{F_{t}}\ltimes g_{t}| for every t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} and gtRtg_{t}\in R_{t}
3
4foreach (t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} in some bottom-up order) do
5      
6      pp\leftarrow parent of tt
7      
8      foreach (gpRpg_{p}\in R_{p}) do
9             W(gp,Rt)gtRtgpW(gt)W(g_{p},R_{t})\leftarrow\sum_{g_{t}\in R_{t}\ltimes g_{p}}W(g_{t})
10             W(gp)W(gp)W(gp,Rt)W(g_{p})\leftarrow W(g_{p})\cdot W(g_{p},R_{t})
11            
12      RpRpRtR_{p}\leftarrow R_{p}\ltimes R_{t}
13      
14sgprs\leftarrow g_{pr}
15foreach (t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} in some top-down order) do
16       pp\leftarrow parent of tt
17      spπχ(p)ss_{p}\leftarrow\pi_{\chi(p)}s
      {st}SampleFrom(Rtsp,1,{W(gt)W(sp,Rt)|gtRtsp})\{s_{t}\}\leftarrow{\textsc{{SampleFrom}}}(R_{t}\ltimes s_{p},1,\{\frac{W(g_{t})}{W(s_{p},R_{t})}\,|\,g_{t}\in R_{t}\ltimes s_{p}\}) /* sample a tuple sts_{t} from RtspR_{t}\ltimes s_{p} with probability W(st)W(sp,Rt)\frac{W(s_{t})}{W(s_{p},R_{t})} */
18      
19      sssts\leftarrow s\uplus s_{t}
20foreach (t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}} in some order) do
21       Replace stss_{t}\in s in-place with a uniform sample from RFtstR_{F_{t}}\ltimes s_{t}
return ss
Algorithm 6 ExactWeight((𝒯(𝒱𝒯,𝒯),χ)(\mathcal{T}(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}}),\chi), {Rt|t𝒱𝒯}\{R_{t}|t\in\mathcal{V}_{\mathcal{T}}\})

We explain ExactWeight (Zhao et al., 2018) in our context of using GHDs (Algorithm 6), but assume that RtR_{t}’s are not annotated and compute each weight WW explicitly. ExactWeight performs uniform sampling over acyclic joins, where the GHDs of single-edge nodes are given. ExactWeight computes the sampling weights W(gp)W(g_{p}) and W(gp,Rt)W(g_{p},R_{t}) for each tuple gpg_{p} and a child relation RtR_{t} bottom-up (Lines 6-6) and samples tuples proportional to their weights top-down (Lines 6-6). The sample is then replaced with the tuples sampled from the base relations (Lines 6-6). The bottom-up part is the preprocessing step that takes O~(|𝒱𝒯|t𝒱𝒯|Rt|)=O~(maxt|RFt|)=O~(IN)\tilde{O}(|\mathcal{V}_{\mathcal{T}}|\sum_{t\in\mathcal{V}_{\mathcal{T}}}|R_{t}|)=\tilde{O}(\max_{t}|R_{F_{t}}|)=\tilde{O}(\text{IN}) time, and the top-down sampling and replacement takes O~(1)\tilde{O}(1) for each sample and can be performed multiple times after the preprocessing.

We now briefly explain why ExactWeight returns a uniform sample. W(gp,Rt)W(g_{p},R_{t}) corresponds to the join size between gpg_{p} and the child subtree of pp rooted at tt, and W(gp)W(g_{p}) corresponds to the join size between gpg_{p} and all child subtrees of pp. Hence, W(gpr)=OUTW(g_{pr})=\text{OUT} which is the join size of the whole tree. For a node pp, its children {t1,t2,,tm}\{t_{1},t_{2},...,t_{m}\} and its parent pppp, P(sp)=W(sp)W(spp,Rp)P(s_{p})=\frac{W(s_{p})}{W(s_{pp},R_{p})} and P(sti)=W(sti)W(sp,Rti)P(s_{t_{i}})=\frac{W(s_{t_{i}})}{W(s_{p},R_{t_{i}})} for i[1,m]i\in[1,m]. Therefore, P(sp,st1,,stm)=W(sp)W(spp,Rp)1imW(sti)W(sp,Rti)=|RFpsp|iW(sp,Rti)W(spp,Rp)iW(sti)iW(sp,Rti)=|RFpsp|iW(sti)W(spp,Rp)P(s_{p},s_{t_{1}},...,s_{t_{m}})=\frac{W(s_{p})}{W(s_{pp},R_{p})}\prod_{1\leq i\leq m}\frac{W(s_{t_{i}})}{W(s_{p},R_{t_{i}})}=\frac{|R_{F_{p}}\ltimes s_{p}|\prod_{i}W(s_{p},R_{t_{i}})}{W(s_{pp},R_{p})}\frac{\prod_{i}W(s_{t_{i}})}{\prod_{i}W(s_{p},R_{t_{i}})}=\frac{|R_{F_{p}}\ltimes s_{p}|\prod_{i}W(s_{t_{i}})}{W(s_{pp},R_{p})}. We have seen a similar elimination of probability terms in eq. 3, which leads to P(s)=t|RFtst|W(gpr,Rr)=t|RFtst|OUTP(s)=\frac{\prod_{t}|R_{F_{t}}\ltimes s_{t}|}{W(g_{pr},R_{r})}=\frac{\prod_{t}|R_{F_{t}}\ltimes s_{t}|}{\text{OUT}} after Line 6. By uniformly sampling a tuple from each RFtstR_{F_{t}}\ltimes s_{t} at Line 6, P(s)P(s) becomes 1OUT\frac{1}{\text{OUT}} at Line 6.

We can also make GHDCardEst in Algorithm 3 perform uniform sampling for the same GHD by 1) iterating tt bottom-up at Line 3, 2) adding the same initialization and updates of sampling weights, and 3) adding the same sampling part. Then, the preprocessing step of GHDCardEst also takes O~(IN)\tilde{O}(\text{IN}) as ExactWeight. From this perspective, ExactWeight is similar to a specific instance of GHDCardEst for GHDs of single-edge nodes and acyclic queries. Extending the above modifications of GHDCardEst to work for arbitrary GHDs and cyclic queries, would be an interesting future work. We could also extend our work to unions of acyclic conjunctive queries and sampling without replacement, as what (Carmeli et al., 2020) is to (Zhao et al., 2018).

A.3. GHDJoin and AggroGHDJoin

Given a generalized hypertree decomposition (GHD) of a query, GHDJoin (Algorithm 7) (Joglekar et al., 2016) uses 𝒯\mathcal{T} as a join tree and performs GenericJoin to obtain the join answers of each node (Lines 7-7). t\mathcal{H}_{t} is the subquery defined on node tt (Line 7). Then, it performs Yannakakis over these join answers to evaluate the original query (Line 7). Therefore, the runtime of GHDJoin is O~(maxt𝒱𝒯INρ(t)+OUT)=O~(INfhtw(𝒯,)+OUT)\tilde{O}(\max_{t\in\mathcal{V}_{\mathcal{T}}}\text{IN}^{\rho(\mathcal{H}_{t})}+\text{OUT})=\tilde{O}(\text{IN}^{fhtw(\mathcal{T},\mathcal{H})}+\text{OUT}) from Definition 3.

Input: Query hypergraph \mathcal{H} and a GHD (𝒯\mathcal{T}, χ\chi) of \mathcal{H}
1
2S𝒯S_{\mathcal{T}}\leftarrow\emptyset
3 foreach (t𝒱𝒯t\in\mathcal{V}_{\mathcal{T}}) do
4       t(χ(t),χ(t),π)\mathcal{H}_{t}\leftarrow(\chi(t),\mathcal{E}_{\chi(t),\pi})
5      
      /* compute t\mathcal{H}_{t} */
6       S𝒯S𝒯GenericJoin(t,χ(t),)S_{\mathcal{T}}\leftarrow S_{\mathcal{T}}\cup\textsc{{GenericJoin}}(\mathcal{H}_{t},\chi(t),\emptyset)
7      
8return Yannakakis((𝒯,χ),S𝒯)\textsc{{Yannakakis}}((\mathcal{T},\chi),S_{\mathcal{T}})
Algorithm 7 GHDJoin ((𝒱,)\mathcal{H}(\mathcal{V},\mathcal{E}), (𝒯(𝒱𝒯,𝒯),χ)(\mathcal{T}(\mathcal{V}_{\mathcal{T}},\mathcal{E}_{\mathcal{T}}),\chi))

AggroGHDJoin (Joglekar et al., 2016) is an aggregation version of GHDJoin. It is different from GHDJoin in two aspects: 1) calling AggroYannakakis instead of Yannakakis and 2) including some extra work to ensure that each annotation of a tuple is passed to exactly one call of GenericJoin (Joglekar et al., 2016). Note that a hyperedge FF\in\mathcal{E} can be visited in multiple nodes, thus, its annotations may be aggregated multiple times, leading to an incorrect output.

From the runtime of GenericJoin and AggroYannakakis, the runtime of AggroGHDJoin is O~(INfhtw(𝒯,)+OUTagg)\tilde{O}(\text{IN}^{fhtw(\mathcal{T},\mathcal{H})}+\text{OUT}_{agg}). In (Joglekar et al., 2016), the runtime of AggroGHDJoin is expressed as O~(INw+OUTagg)\tilde{O}(\text{IN}^{w^{*}}+\text{OUT}_{agg}), where wmin(𝒯,χ)Valfhtw(𝒯,)w^{*}\coloneqq\min_{(\mathcal{T},\chi)\in Val}fhtw(\mathcal{T},\mathcal{H}). Here, ValVal denotes the set of valid GHDs of \mathcal{H}, that preserve the aggregation ordering of the specified aggregation A1,1A2,2An,n\sum_{A_{1},\oplus_{1}}\sum_{A_{2},\oplus_{2}}...\sum_{A_{n},\oplus_{n}}; changing the order of Ai,i\sum_{A_{i},\oplus_{i}} and Aj,j\sum_{A_{j},\oplus_{j}} might vary the aggregation result (Joglekar et al., 2016). Since fhtw()=min(𝒯,χ)fhtw(𝒯,)fhtw(\mathcal{H})=\min_{(\mathcal{T},\chi)}fhtw(\mathcal{T},\mathcal{H}) from Definition 3 and ValVal is a subset of the set of all GHDs, wfhtw()w^{*}\geq fhtw(\mathcal{H}). However, as in our case where all aggregation operators are the same (i=+\oplus_{i}=+), thus commutative, any aggregation order gives the same aggregation result. Therefore, ValVal becomes the set of all GHDs. This leads to w=fhtw()w^{*}=fhtw(\mathcal{H}).

Appendix B Query Decomposition Lemmas

This section explains two similar query decomposition lemmas proved by Ngo et al. (Ngo et al., 2014) and Chen & Yi (Chen and Yi, 2020).

Lemma 6.

(Ngo et al., 2014): Given a query hypergraph (𝒱,)\mathcal{H}(\mathcal{V},\mathcal{E}), let xx be any fractional edge cover of \mathcal{H}. Let II be an arbitrary non-empty proper subset of 𝒱\mathcal{V}, J=𝒱IJ=\mathcal{V}\setminus I, and L=FIπIRFL=\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}R_{F}. Then,

(11) sILFJ|RFsI|xFF|RF|xF.\vspace{-2pt}\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{J}}|R_{F}\ltimes s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}}|R_{F}|^{x_{F}}.\vspace{-1pt}

Using this lemma, Ngo et al. (Ngo et al., 2014) prove that GenericJoin has O~(F|RF|xF)\tilde{O}(\prod_{F\in\mathcal{E}}|R_{F}|^{x_{F}}) runtime (i.e., the RHS above) using an induction hypothesis on |𝒱||\mathcal{V}|. If xx is an optimal fractional edge cover, the RHS becomes AGM()AGM(\mathcal{H}).

We briefly explain this with Algorithm 4. If we replace 1) 𝒱\mathcal{V} in the lemma with 𝒪\mathcal{O} in Algorithm 4 and 2) RFR_{F} with RFsR_{F}\ltimes s, we obtain L=FIπI(RFs)L=\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s) and

(12) sILF𝒪I|RFssI|xFF𝒪|RFs|xF.\vspace{-2pt}\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes s\uplus s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}.\vspace{-1pt}

If |𝒪|=1|\mathcal{O}|=1, Line 4 takes O~(minF𝒪|π𝒪(RFs)|)\tilde{O}(\min_{F\in\mathcal{E}_{\mathcal{O}}}|\pi_{\mathcal{O}}(R_{F}\ltimes s)|) time where minF𝒪|π𝒪(RFs)|F𝒪|π𝒪(RFs)|xFF𝒪|RFs|xF\min_{F\in\mathcal{E}_{\mathcal{O}}}|\pi_{\mathcal{O}}(R_{F}\ltimes s)|\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|\pi_{\mathcal{O}}(R_{F}\ltimes s)|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}} for F𝒪xF1\sum_{F\in\mathcal{E}_{\mathcal{O}}}x_{F}\geq 1. The last term is equivalent to the RHS above.

If |𝒪|>1|\mathcal{O}|>1, Lines 4-4 take O~(|L|+sILF𝒪I|π𝒪I(RFssI)|xF)\tilde{O}(|L|+\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|\pi_{\mathcal{O}\setminus I}(R_{F}\ltimes s\uplus s_{I})|^{x_{F}}) time where |L|FI|RFs|xF|L|\leq\prod_{F\in\mathcal{E}_{I}}|R_{F}\ltimes s|^{x_{F}} from the induction hypothesis (|I|<|𝒪|\because|I|<|\mathcal{O}|, and FI|RFs|xFF𝒪|RFs|xF\prod_{F\in\mathcal{E}_{I}}|R_{F}\ltimes s|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}) and sILF𝒪I|π𝒪I(RFssI)|xFsILF𝒪I|RFssI|xFF𝒪|RFs|xF\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|\pi_{\mathcal{O}\setminus I}(R_{F}\ltimes s\uplus s_{I})|^{x_{F}}\leq\sum_{s_{I}\in L}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes s\uplus s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}} from the lemma, which is again the RHS above.

Therefore, in both cases, the runtime of Algorithm 4 is the big-Oh of the RHS above.

Lemma 7.

(Chen and Yi, 2020): Given a query hypergraph (𝒱,)\mathcal{H}(\mathcal{V},\mathcal{E}), let xx be any fractional edge cover of \mathcal{H}. Let I={A}I=\{A\} for an arbitrary attribute A𝒱A\in\mathcal{V}, J=𝒱IJ=\mathcal{V}\setminus I, and L=πIRFGJL^{\prime}=\pi_{I}R_{F^{GJ}} for FGJ=argminFI|πIRF|F^{GJ}=\operatorname*{arg\,min}_{F\in\mathcal{E}_{I}}|\pi_{I}R_{F}|. Then,

(13) sILFJ|RFsI|xFF|RF|xF.\sum_{s_{I}\in L^{\prime}}\prod_{F\in\mathcal{E}_{J}}|R_{F}\ltimes s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}}|R_{F}|^{x_{F}}.

Since Chen & Yi (Chen and Yi, 2020) explain Lemma 7 without a name, we also call it the query decomposition lemma since it is similar to Lemma 6. However, it is different from Lemma 6 in that 1) |I||I| must be 1 and 2) LL^{\prime} is obtained without an actual join or intersection in GenericJoin. If |I|=1|I|=1 in Lemma 6, then L=FIπIRF=FIπIRFL=\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}R_{F}=\cap_{F\in\mathcal{E}_{I}}\pi_{I}R_{F}. Since FGJIF^{GJ}\in\mathcal{E}_{I}, LLL^{\prime}\supseteq L. Therefore, Lemma 7 is stronger than Lemma 6 if |I|=1|I|=1.

Lemma 7 is the key to show that sIΩIP(sI)1\sum_{s_{I}\in\Omega_{I}}P(s_{I})\leq 1 for GJ-Sample. By replacing 1) 𝒱\mathcal{V} in the above lemma with 𝒪\mathcal{O} in GJ-Sample in section 5.3 and 2) RFR_{F} with RFsR_{F}\ltimes s, we obtain L=πI(RFGJs)=ΩIL^{\prime}=\pi_{I}(R_{F^{GJ}}\ltimes s)=\Omega_{I} for FGJ=argminFI|πI(RFs)|F^{GJ}=\operatorname*{arg\,min}_{F\in\mathcal{E}_{I}}|\pi_{I}(R_{F}\ltimes s)| and

(14) sILF𝒪I|RFssI|xFF𝒪|RFs|xF,\sum_{s_{I}\in L^{\prime}}\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes s\uplus s_{I}|^{x_{F}}\leq\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}},

resulting in

(15) sIΩIP(sI)=sILAGM(ssI)AGM(s)=sILF𝒪I|RFssI|xFF𝒪|RFs|xF1.\begin{split}\sum_{s_{I}\in\Omega_{I}}P(s_{I})&=\sum_{s_{I}\in L^{\prime}}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}\\ &=\sum_{s_{I}\in L^{\prime}}\frac{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}|R_{F}\ltimes s\uplus s_{I}|^{x_{F}}}{\prod_{F\in\mathcal{E}_{\mathcal{O}}}|R_{F}\ltimes s|^{x_{F}}}\leq 1.\end{split}

Appendix C Extension to Conjunctive Queries

From the introduction, the join query we have considered so far is a full conjunctive query where all attributes are output attributes. In this section, we easily extend our degree-based rejection sampling to approximate counting and uniform sampling for join-project queries (a.k.a. conjunctive queries) (Arenas et al., 2021; Focke et al., 2022; Chen and Yi, 2020), following the same analysis of Section 3 in (Chen and Yi, 2020).

Let πO(Q)\pi_{O}(Q) be a join-project query, where OO is the set of projection/output attributes and QQ is a join query. We can use the two-step approach in (Chen and Yi, 2020): (Step 1) use Algorithm 1 to sample a join result ss from QOFEOπO(RF)Q_{O}\coloneqq\Join_{F\in E_{O}}\pi_{O}(R_{F}), i.e., join the relations that contain any attribute in OO only. Here, if QOQ_{O} is disconnected, we take a sample from each connected component. (Step 2) check if Q𝒱O(s)FE𝒱O(π𝒱O(RFs))Q_{\mathcal{V}\setminus O}(s)\coloneqq\Join_{F\in E_{\mathcal{V}\setminus O}}(\pi_{\mathcal{V}\setminus O}(R_{F}\ltimes s)) is empty. If not, we return ss as a sampled result, otherwise, we repeat. Then, the final ss returned is a result in πO(Q)\pi_{O}(Q). Using the same analysis of (Chen and Yi, 2020), sampling a tuple in πO(Q)\pi_{O}(Q) takes O~(AGM(Q)OUT(πO(Q)))\tilde{O}\big{(}\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}\big{)} time in expectation for our method.

Step 1 of our algorithm takes time O~(AGM(QO)OUT(QO))\tilde{O}\big{(}\frac{AGM(Q_{O})}{\text{OUT}({Q_{O}})}\big{)} in expectation. Step 2 takes time O~(AGM(Q𝒱O(s)))\tilde{O}(AGM(Q_{\mathcal{V}\setminus O}(s))), using any worst-case optimal join algorithm, e.g., GenericJoin. Because each sQOs\in Q_{O} is sampled with probability 1OUT(QO)\frac{1}{\text{OUT}(Q_{O})}, the expected runtime of Step 2 is (the big-Oh of) sQOAGM(Q𝒱O(s))OUT(QO)AGM(Q)OUT(QO)\sum_{s\in Q_{O}}\frac{AGM(Q_{\mathcal{V}\setminus O}(s))}{\text{OUT}(Q_{O})}\leq\frac{AGM(Q)}{\text{OUT}(Q_{O})} where the inequality follows from the query decomposition lemma (Lemma 6). Please refer to Section 3 of (Chen and Yi, 2020) for the remaining part of the analysis proving the O~(AGM(Q)OUT(πO(Q)))\tilde{O}\big{(}\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}\big{)} time.

Here, we would like to address the complexity of computing πO(RF)\pi_{O}(R_{F}) before calling Algorithm 1. If there exists an index for RFR_{F} with the attribute order such that the attributes in OFO\cap F precede the attributes in FOF\setminus O, then πO(RF)\pi_{O}(R_{F}) is readily available (i.e., in O~(1)\tilde{O}(1) time) as explained in section 5.6. This results in total O~(AGM(Q)OUT(πO(Q)))\tilde{O}\big{(}\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}\big{)} time. Otherwise, we need an additional linear-time preprocessing to obtain πO(RF)\pi_{O}(R_{F}) from RFR_{F}, resulting in total O~(AGM(Q)OUT(πO(Q))+IN)\tilde{O}\big{(}\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}+\text{IN}\big{)} time. Nevertheless, we emphasize that this is still lower than the complexity of GJ-Sample for non-sequenceable join-project queries, which is O~(INAGM(Q)OUT(πO(Q))+IN)\tilde{O}\big{(}\text{IN}\cdot\frac{AGM(Q)}{\text{OUT}(\pi_{O}(Q))}+\text{IN}\big{)} (Chen and Yi, 2020).

We also note that the proposed algorithm in (Arenas et al., 2021) requires computing N(Ti)N(T^{\prime}_{i}) in their “High-Level Sampling Template” which is the (approximate) size of the ii-th subset TiT^{\prime}_{i} of a set TT. This has to be computed for every possible ii since N(Ti)iN(Ti)\frac{N(T^{\prime}_{i})}{\sum_{i}N(T^{\prime}_{i})} is used as the probability of sampling TiT^{\prime}_{i}. This is analogous to computing AGM(ssI)AGM(\mathcal{H}_{s\uplus s_{I}}) for each candidate sIs_{I} in GJ-Sample. A difference between GJ-Sample and (Arenas et al., 2021) is that computing AGM(ssI)AGM(\mathcal{H}_{s\uplus s_{I}}) takes a constant time in GJ-Sample while computing N(Ti)N(T^{\prime}_{i}) takes a polynomial time according to Lemma 4.5 in the full version of (Arenas et al., 2021). Therefore, (Arenas et al., 2021) inherently has the overhead of computing N(Ti)N(T^{\prime}_{i})’s analogous to GJ-Sample. Removing such overhead by extending our idea to conjunctive queries, would be an interesting future work.

Appendix D Component-at-a-time Sampling

For each component, GenericCardEst (Algorithm 1) is called in two recursions, e.g., if \mathcal{H} consists of four odd cycles and three stars, the maximum recursion depth is 2(4+3)+1=152\cdot(4+3)+1=15, including the calls that reach Line 1. Since the components are vertex-disjoint, i.e., variable-disjoint, RFs=RFR_{F}\ltimes s=R_{F} whenever FF and (the attributes of) ss are contained in different components. Therefore, we can safely drop s\ltimes s from the following explanations for the component-at-a-time methods.

We first explain for a star SS of nn edges F1,F2,,FnF_{1},F_{2},...,F_{n}. Let AA be the center attribute, i.e., {A}=i[1,n]Fi\{A\}=\cap_{i\in[1,n]}F_{i}. For SSTE, in the first call to GenericCardEst, II becomes the (singleton set of) center attribute, i.e., I={A}I=\{A\}, ΩI=πARF1\Omega_{I}=\pi_{A}R_{F_{1}} and k=1k=1. To sample a vertex aa from ΩI\Omega_{I}, SSTE first samples an edge ee from RF1R_{F_{1}}, and let a=πA(e)a=\pi_{A}(e). Then, P(a)=|RF1a||RF1|P(a)=\frac{|R_{F_{1}}\ltimes a|}{|R_{F_{1}}|} since |RF1a||R_{F_{1}}\ltimes a| edges in RF1R_{F_{1}} have aa as their projection onto AA. When GenericCardEst is called for the second time, II becomes the remaining attributes of SS, i.e., I=iFi{A}I=\cup_{i}F_{i}\setminus\{A\}, ΩI=iπI(RFia)\Omega_{I}=\prod_{i}\pi_{I}(R_{F_{i}}\ltimes a), and k=1k=1; a sequence of nn vertices (b1,b2,,bn)(b_{1},b_{2},...,b_{n}) is sampled from ΩI\Omega_{I} once (i.e., bib_{i} is sampled from πI(RFia)\pi_{I}(R_{F_{i}}\ltimes a)). If (a,b1,b2,,bn)(a,b_{1},b_{2},...,b_{n}) forms a star in the database, 𝕀[]\operatorname*{\mathbb{I}}[\cdot] for these two calls become 1, and the recursion proceeds to the remaining components, with 𝒪𝒪(iFi)\mathcal{O}\leftarrow\mathcal{O}\setminus(\cup_{i}F_{i}) and ss(a,b1,b2,,bn)s\leftarrow s\uplus(a,b_{1},b_{2},...,b_{n}).

For SUST, in the first call to GenericCardEst, II is simply the set of all attributes of SS and ΩI=iRFi\Omega_{I}=\prod_{i}R_{F_{i}}, P()=1|ΩI|P(\cdot)=\frac{1}{|\Omega_{I}|}, and k=1k=1; a sequence of nn edges is sampled once. 𝕀[]\operatorname*{\mathbb{I}}[\cdot] is 1 iff they share a common vertex for the center. The second call to GenericCardEst is a dummy that directly proceeds to the remaining components.

We next explain for an odd cycle CC of 2n+12n+1 edges with the vertex (attribute) sequence (u1,v1,u2,v2,,un,vn,w,u1)(u_{1},v_{1},u_{2},v_{2},...,u_{n},v_{n},w,u_{1}) where Fi={ui,vi}F_{i}=\{u_{i},v_{i}\} for i[1,n]i\in[1,n] and F0={u1,w}F_{0}=\{u_{1},w\}. A constraint is that the sampled vertex (attribute value) for u1u_{1} must have the smallest degree among the sampled vertices, where the degree of a vertex aa is |Ra||R\ltimes a| (note that R=RFiR=R_{F_{i}} for every ii). In the first call to GenericCardEst, II contains all uiu_{i}’s and viv_{i}’s except ww, ΩI=i[1,n]RFi\Omega_{I}=\prod_{i\in[1,n]}R_{F_{i}}, and k=1k=1; a sequence of nn edges ({a1,b1},,{an,bn})(\{a_{1},b_{1}\},...,\{a_{n},b_{n}\}) is sampled where uiu_{i} and viv_{i} map to aia_{i} and bib_{i}. When GenericCardEst is called for the second time, I={w}I=\{w\}. SSTE and SUST differ here.

For SSTE, ΩI=πI(RF0a1)\Omega_{I}=\pi_{I}(R_{F_{0}}\ltimes a_{1}), P()=1|ΩI|P(\cdot)=\frac{1}{|\Omega_{I}|}, and kk is |ΩI||R|\lceil\frac{|\Omega_{I}|}{\sqrt{|R|}}\rceil. Therefore, kk vertices are uniformly sampled from ΩI\Omega_{I} with replacement. For SUST, kk is fixed to 1, and ΩI\Omega_{I} is defined based on the degree of a1a_{1}. If the degree is smaller than |R|\sqrt{|R|}, then ΩI=πI(RF0a1)\Omega_{I}=\pi_{I}(R_{F_{0}}\ltimes a_{1}) and P()=1|R|P(\cdot)=\frac{1}{\sqrt{|R|}}. If the degree is larger than |R|\sqrt{|R|}, then ΩI=πIRF0\Omega_{I}=\pi_{I}R_{F_{0}} and P(sI)=|RsI||R|P(s_{I})=\frac{|R\ltimes s_{I}|}{|R|}. Then, at Line 1, 1) any sIs_{I} with a smaller degree than a1a_{1} is rejected due to the constraint, and 2) an accepted sIs_{I} is further kept with probability |R||RsI|\frac{\sqrt{|R|}}{|R\ltimes s_{I}|}, which is less than 1 from |RsI||Ra1||R||R\ltimes s_{I}|\geq|R\ltimes a_{1}|\geq\sqrt{|R|}. Therefore, P(sI)P(s_{I}) is again |RsI||R||R||RsI|=1|R|\frac{|R\ltimes s_{I}|}{|R|}\frac{\sqrt{|R|}}{|R\ltimes s_{I}|}=\frac{1}{\sqrt{|R|}}. For both SSTE and SUST, if (a1,b1,,an,bn,s{w},a1)(a_{1},b_{1},...,a_{n},b_{n},s_{\{w\}},a_{1}) forms a cycle in the database, the recursion proceeds to the remaining components.

SUST guarantees the uniformity of sampling odd cycles and stars. To sample an odd cycle, SUST relies on the constraint that a1a_{1} has the smallest degree. This is allowed in unlabeled graphs since other non-qualifying cycles (e.g., a2a_{2} having the smallest degree) can be found by permuting the vertices and edges (e.g., constrain w.r.t. a1a_{1} and let a2=a1a_{2}=a_{1}). The new proof of SSTE also uses this constraint to bound the variance (see appendix E). However, in non-identical relations, we cannot permute the edges with different labels.

The essence of component-at-a-time sampling is to use the divide-and-conquer approach by breaking a query \mathcal{H} into smaller components (i.e., odd cycles and stars), solve simpler problems for these components, and estimate back the cardinality of the original query. Lemmas 8 and 9 allow us to set an induction hypothesis that bounds the variance of estimates in simpler problems. We explain these lemmas in section D.1 and extend them for labeled graphs in section D.2.

D.1. For Unlabeled Graphs

Lemma 8.

(Decomposition Lemma) (Assadi et al., 2019): Given a query hypergraph (𝒱,)\mathcal{H}(\mathcal{V},\mathcal{E}) of identical binary relations (i.e., RF=R:FR_{F}=R:\forall F\in\mathcal{E}), the LP in Definition 1 admits a half-integral optimal solution xx where 1) xF{0,12,1}:Fx_{F}\in\{0,\frac{1}{2},1\}:\forall F\in\mathcal{E} and 2) the support of xx, supp(x){F|xF>0}supp(x)\coloneqq\{F\in\mathcal{E}|x_{F}>0\}, forms a set of vertex-disjoint odd cycles (i.e., each cycle has an odd number of edges) and stars.

Lemma 9.

(Assadi et al., 2019): Suppose that \mathcal{H} (with identical binary relations as \mathcal{E}) is decomposed into odd cycles and stars by a half-integral optimal solution xx in Lemma 8. For arbitrary subsets of odd cycles \mathbb{C} and stars 𝕊\mathbb{S} of \mathcal{H}, let s=(𝒱s,s)\mathcal{H}_{s}=(\mathcal{V}_{s},\mathcal{E}_{s}) be the induced subquery of \mathcal{H} on vertices of \mathbb{C} and 𝕊\mathbb{S}. Then, for output size OUTs\text{OUT}_{s} of s\mathcal{H}_{s}, OUTs|R|FsxF\text{OUT}_{s}\leq|R|^{\sum_{F\in\mathcal{E}_{s}}x_{F}}, and FsxF=C(𝒱C,C)FCxF+S(𝒱S,S)𝕊FSxF\sum_{F\in\mathcal{E}_{s}}x_{F}=\sum_{C(\mathcal{V}_{C},\mathcal{E}_{C})\in\mathbb{C}}\sum_{F\in\mathcal{E}_{C}}x_{F}+\sum_{S(\mathcal{V}_{S},\mathcal{E}_{S})\in\mathbb{S}}\sum_{F\in\mathcal{E}_{S}}x_{F}.

D.2. For Labeled Graphs

We extend Lemma 8 to Lemma 10 for labeled graphs. First, we prove Proposition 10.

Proposition 10.

The LP in Definition 1 admits an integral optimal solution if \mathcal{H} is a bipartite graph, where relations are binary but not necessarily identical.

Proof.

Let 𝒱={v1,v2,,vn}\mathcal{V}=\{v_{1},v_{2},...,v_{n}\}, ={F1,F2,,Fm}\mathcal{E}=\{F_{1},F_{2},...,F_{m}\}, and An×mA\in\mathbb{R}^{n\times m} be the incidence matrix of \mathcal{H}, where Aij=𝕀[viFj]A_{ij}=\operatorname*{\mathbb{I}}[v_{i}\in F_{j}]. It is well known that if \mathcal{H} is bipartite, AA is so-called a totally unimodular matrix***https://en.wikipedia.org/wiki/Unimodular_matrix. We can rewrite the objective of LP, i.e., minxFxFlogIN|RF|\min_{x}\sum_{F\in\mathcal{E}}x_{F}\log_{\text{IN}}\absolutevalue{R_{F}}, as minxcx\min_{x}{cx} (cc is a vector of (logIN|RF|:F)1×m(\log_{\text{IN}}{|R_{F}|}:F\in\mathcal{E})\in\mathbb{R}^{1\times m}, xm×1x\in\mathbb{R}^{m\times 1}) and constraint AxbAx\geq b (bb is a vector of 1’s n×1\in\mathbb{R}^{n\times 1}). Then, it is also known that an LP of objective minxcx\min_{x}{cx} and constraint AxbAx\geq b admits an integral optimal solution if AA is totally unimodular and bb is a vector of integers, which is exactly our case. \square

Therefore, we can easily remove the identity assumption on relations if \mathcal{H} is bipartite. The following proposition fills the remaining non-bipartite case. We omit the proof since it is the same as the proof by Assadi et al. (Assadi et al., 2019).

Proposition 11.

The LP in Definition 1 admits a half-integral optimal solution if \mathcal{H} is a non-bipartite graph, where relations are not necessarily identical.

Finally, we state Lemma 10. The extension of Lemma 9 for labeled graphs, can be directly proven by Lemma 10 as the proof by Assadi et al. (Assadi et al., 2019).

Lemma 10.

(Extended Decomposition Lemma) Given a query hypergraph \mathcal{H} of binary relations, the LP in Definition 1 admits a half-integral optimal solution that satisfies the conditions in Lemma 8.

Proof.

The proof is similar to the proof by Assadi et al. (Assadi et al., 2019), but we consider log(|RF|)\log{|R_{F}|} which may be different over FF\in\mathcal{E}. Assume that any half-integral optimal solution xx is given.

We first show that any edge FF in a cycle (odd or even) in supp(x)supp(x) has xF=12x_{F}=\frac{1}{2}. Otherwise if xF=1x_{F}=1 (cannot be 0 due to the definition of supp(x)supp(x)), we can always reduce it into 12\frac{1}{2} without breaking the feasibility of xx, since both endpoints of FF have at least two edges in the cycle. This results in a smaller objective, contradicting that xx is an optimal solution.

We then remove any even cycles. If an even cycle CC with edges {F1\{F_{1}, F2F_{2}, …, F2k}F_{2k}\} exists in supp(x)supp(x), we argue that N1+N3++N2k1N_{1}+N_{3}+...+N_{2k-1} = N2+N4++N2kN_{2}+N_{4}+...+N_{2k} where Ni=log(|RFi|)N_{i}=\log{|R_{F_{i}}|}. If LHS ¡ RHS, we subtract 12\frac{1}{2} from each xF2ix_{F_{2i}} and add 12\frac{1}{2} to each xF2i1x_{F_{2i-1}}. Again, this does not break the feasibility of xx but decreases the objective by 12(RHSLHS)\frac{1}{2}\cdot(\text{RHS}-\text{LHS}), contradicting the optimality of xx. Similarly, LHS ¿ RHS does not hold. Thus, LHS = RHS, indicating:

(16) |RF1||RF3||RF2k1|=|RF2||RF4||RF2k||R_{F_{1}}||R_{F_{3}}|...|R_{F_{2k-1}}|=|R_{F_{2}}||R_{F_{4}}|...|R_{F_{2k}}|

Then, we can break CC by safely subtracting 12\frac{1}{2} from each xF2ix_{F_{2i}} and adding 12\frac{1}{2} to each xF2i1x_{F_{2i-1}} (or vice versa), preserving the objective.

We next remove any odd cycles connected to any other components. If an odd cycle CC with edges {F1\{F_{1}, F2F_{2}, …, F2k1}F_{2k-1}\} exists in supp(x)supp(x) such that a vertex vv in CC has an incident edge F2kF_{2k} outside CC, we again argue that N1+N3++N2k1N_{1}+N_{3}+...+N_{2k-1} = N2+N4++N2kN_{2}+N_{4}+...+N_{2k} using the same logic as we remove the even cycles. Then, we can remove such odd cycles connected to any other components.

We now have a set of vertex-disjoint odd cycles and trees in supp(x)supp(x). We will break trees into stars. For each tree, let F1,F2,,Fj\langle F_{1},F_{2},...,F_{j}\rangle be any path connecting two leaf vertices. Note that xF1=xFj=1x_{F_{1}}=x_{F_{j}}=1 to satisfy the constraint in the LP for these two vertices. xF2=xF3==xFj1=12x_{F_{2}}=x_{F_{3}}=...=x_{F_{j-1}}=\frac{1}{2} since any midpoint has two incident edges in the path, and from the same logic used for cycles.

If j=2kj=2k (even), we argue N2+N4++N2k2N_{2}+N_{4}+...+N_{2k-2} = N3+N5++N2k1N_{3}+N_{5}+...+N_{2k-1} using the same logic as above. Then, we can subtract 12\frac{1}{2} from each xF2i1x_{F_{2i-1}} and add 12\frac{1}{2} to each xF2i2x_{F_{2i-2}}. If j=2k+1j=2k+1 (odd), N2+N4++N2kN_{2}+N_{4}+...+N_{2k} = N3+N5++N2k1N_{3}+N_{5}+...+N_{2k-1}, and we subtract 12\frac{1}{2} from each xF2i1x_{F_{2i-1}} and add 12\frac{1}{2} to each xF2ix_{F_{2i}} except F2F_{2}. In both cases, the path is decomposed into segments of at most two edges each. We repeat this for every path connecting two leaf vertices, and the final forest consists of trees of a maximum diameter of two, i.e., stars. \square

Assadi et al. (Assadi et al., 2019) do not present a concrete join size estimator for labeled graphs but prove a lower bound Ω(AGM(U)OUT)\Omega\big{(}\frac{AGM(\mathcal{H}_{U})}{\text{OUT}}\big{)}. Here, U\mathcal{H}_{U} represents \mathcal{H} without labels, and AGM(U)AGM()AGM(\mathcal{H}_{U})\geq AGM(\mathcal{H}) since |RF||R||R_{F}|\leq|R| for each FF\in\mathcal{E}; RFR_{F}’s are subsets of the set of all data edges RR.

Appendix E Proofs of Lemmas and Propositions

In order to prove Lemmas 1-5 and Propositions 1-8, we first expand ZsZ_{\mathcal{H}_{s}} and 𝕍ar[Zs]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}], then derive inequalities on 𝕍ar[Zs]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]. We define another random variable Zs(sI)1P(sI)𝕀[sIFIπI(RFs)]ZssIZ_{\mathcal{H}_{s}}(s_{I})\coloneqq\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]}\cdot Z_{\mathcal{H}_{s\uplus s_{I}}} where sIs_{I} is a random variable in Zs(sI)Z_{\mathcal{H}_{s}}(s_{I}). Then, Zs=1ksISIZs(sI)Z_{\mathcal{H}_{s}}=\frac{1}{k}\sum_{s_{I}\in S_{I}}Z_{\mathcal{H}_{s}}(s_{I}), i.e., ZsZ_{\mathcal{H}_{s}} and Zs(sI)Z_{\mathcal{H}_{s}}(s_{I}) are mutually recursively defined. Then, from (Banerjee, 2012):

(17) 𝕍ar[Zs]=𝕍ar[Zs(sI)]k(1k1|ΩI|1𝕀[SIΩI w/o replacement]).\begin{split}\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}_{s}}]=\frac{\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}(s_{I})]}{k}\Big{(}1-\frac{k-1}{|\Omega_{I}|-1}\operatorname*{\mathbb{I}}[S_{I}\sim\Omega_{I}\text{ w/o replacement}]\Big{)}.\end{split}

This holds since SIS_{I} consists of identically distributed samples {sI}\{s_{I}\}. Note that sISIs_{I}\in S_{I} may not be independent due to sampling without replacement, e.g., in Alley. Then, 𝕀[]\operatorname*{\mathbb{I}}[\cdot] above is 1 and the variance is decreased by k1|ΩI|1\frac{k-1}{|\Omega_{I}|-1} portion (Banerjee, 2012).

From the law of total variance:

(18) 𝕍ar[Zs(sI)]=𝔼sI[𝕍ar[Zs|sI]]+𝕍arsI[𝔼[Zs|sI]].\begin{split}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}(s_{I})]=&\operatorname*{\mathbb{E}}_{s_{I}}[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]]+\operatorname*{\mathbb{V}ar}_{s_{I}}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]].\\ \end{split}

Here, Zs|sI=1P(sI)𝕀[sI]ZssIZ_{\mathcal{H}_{s}}|s_{I}=\frac{1}{P(s_{I})}\operatorname*{\mathbb{I}}[s_{I}]Z_{\mathcal{H}_{s\uplus s_{I}}} as Zs(sI)Z_{\mathcal{H}_{s}}(s_{I}), but sIs_{I} is a fixed value in Zs|sIZ_{\mathcal{H}_{s}}|s_{I} instead of a random variable. In the second term above, we show that 𝔼[Zs|sI]=1P(sI)𝕀[sI]OUT(ssI)\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]=\frac{1}{P(s_{I})}\cdot\operatorname*{\mathbb{I}}[s_{I}]\cdot\text{OUT}({\mathcal{H}_{s\uplus s_{I}}}) in the proof of Lemma 1. Then,

(19) 𝕍arsI[𝔼[Zs|sI]]𝔼sI[𝔼[Zs|sI]2]=sIP(sI)1P(sI)2𝕀[sIFIπI(RFs)]OUT(ssI)2=sI1P(sI)𝕀[sI]OUT(ssI)2.\begin{split}\operatorname*{\mathbb{V}ar}_{s_{I}}&[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]]\leq\operatorname*{\mathbb{E}}_{s_{I}}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]^{2}]\\ =&\sum_{s_{I}}P(s_{I})\cdot\frac{1}{P(s_{I})^{2}}\cdot\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]\cdot\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ =&\sum_{s_{I}}\frac{1}{P(s_{I})}\operatorname*{\mathbb{I}}[s_{I}]\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}.\end{split}

The 1P(sI)\frac{1}{P(s_{I})} (¿ 1) factor has been missing in the original paper of Assadi et al. (Assadi et al., 2019), reducing an upper bound of 𝕍arsI[𝔼[Zs|sI]]\operatorname*{\mathbb{V}ar}_{s_{I}}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]] incorrectly.

Proof of Lemma 1.

We use an induction hypothesis that GenericCardEst (Algorithm 1) returns an unbiased estimate of OUT(s)\text{OUT}(\mathcal{H}_{s}). For the base case where I=𝒪I=\mathcal{O}, Line 1 returns Zs=sISIZs(sI)kZ_{\mathcal{H}_{s}}=\frac{\sum_{s_{I}\in S_{I}}Z_{\mathcal{H}_{s}}(s_{I})}{k}, where Zs(sI)=1P(sI)𝕀[sI]ZssIZ_{\mathcal{H}_{s}}(s_{I})=\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}. Here, ZssI=1Z_{\mathcal{H}_{s\uplus s_{I}}}=1 since the recursive call to GenericCardEst reaches Line 1. Since sIs_{I}’s follow an identical distribution, 𝔼[Zs]=𝔼[Zs(sI)]\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}]=\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}(s_{I})] and

(20) 𝔼[Zs(sI)]=sIΩIP(sI)1P(sI)𝕀[sI]1=sIΩI𝕀[sI]=sIFIπI(RFs)1(ΩIFIπI(RFs))=|FIπI(RFs)|=OUT(s).(I=𝒪)\begin{split}\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}(s_{I})]&=\sum_{s_{I}\in\Omega_{I}}P(s_{I})\cdot\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot 1\\ &=\sum_{s_{I}\in\Omega_{I}}{\operatorname*{\mathbb{I}}[s_{I}]}\\ &=\sum_{s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)}1\,\,\,(\because\Omega_{I}\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s))\\ &=\absolutevalue{\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)}\\ &=\text{OUT}(\mathcal{H}_{s}).\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,(\because I=\mathcal{O})\\ \end{split}

Note that the assumption ΩIFIπI(RFs)\Omega_{I}\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s) in Lemma 1 is used. If ΩIFIπI(RFs)\Omega_{I}\not\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s), then 𝔼[Zs(sI)]=sIΩI𝕀[sIFIπI(RFs)]\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}(s_{I})]=\sum_{s_{I}\in\Omega_{I}}\operatorname*{\mathbb{I}}[s_{I}\in\,\allowbreak\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)] ¡ |FIπI(RFs)|\absolutevalue{\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)}, leading to a negative bias.

For the inductive case where I𝒪I\subsetneq\mathcal{O}, we use the law of total expectation and our induction hypothesis at Line 1, achieving

(21) 𝔼[Zs(sI)]=𝔼[𝔼[Zs|sI]]=𝔼[𝔼[1P(sI)𝕀[sI]ZssI|sI]]=𝔼[1P(sI)𝕀[sI]𝔼[ZssI|sI]]=𝔼[1P(sI)𝕀[sI]OUT(ssI)](induction hypothesis)=sIΩIP(sI)1P(sI)𝕀[sI]OUT(ssI)=sIΩI𝕀[sI]OUT(ssI)=sIFIπI(RFs)OUT(ssI)(ΩIFIπI(RFs))=OUT(s).\begin{split}\operatorname*{\mathbb{E}}[&Z_{\mathcal{H}_{s}}(s_{I})]=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]]\\ &=\operatorname*{\mathbb{E}}\Big{[}\operatorname*{\mathbb{E}}\Big{[}\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}\Big{|}s_{I}\Big{]}\Big{]}\\ &=\operatorname*{\mathbb{E}}\Big{[}\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s\uplus s_{I}}}|s_{I}]\Big{]}\\ &=\operatorname*{\mathbb{E}}\Big{[}\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\Big{]}\,\,\,(\because\text{induction hypothesis})\\ &=\sum_{s_{I}\in\Omega_{I}}P(s_{I})\cdot\frac{1}{P(s_{I})}\cdot{\operatorname*{\mathbb{I}}[s_{I}]}\cdot\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=\sum_{s_{I}\in\Omega_{I}}{\operatorname*{\mathbb{I}}[s_{I}]}\cdot\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=\sum_{s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)}\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\,\,\,(\because\Omega_{I}\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s))\\ &=\text{OUT}(\mathcal{H}_{s}).\end{split}

The assumption ΩIFIπI(RFs)\Omega_{I}\supseteq\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s) is used again. The last equality holds from Lines 4-4 of GenericJoin (Algorithm 4); sIs_{I} iterates over FIπI(RFs)\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s), not sampled from ΩI\Omega_{I}. Therefore, 𝔼[Zs]=𝔼[Zs(sI)]=OUT(s)\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}]=\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}(s_{I})]=\text{OUT}({\mathcal{H}_{s}}).

\square

Proof of Lemma 2.

We first rewrite the AGM bounds in LHS using their definitions. Given an optimal fractional edge cover xx of \mathcal{H}, the LHS is

(22) AGM(ssI)AGM(s)=F𝒪I|RFssI|xFF𝒪|RFs|xF.\begin{split}\frac{AGM(\mathcal{H}_{s\uplus s_{I}})}{AGM(\mathcal{H}_{s})}=\frac{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{\mathcal{O}}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}.\end{split}

We partition 1) 𝒪I\mathcal{E}_{\mathcal{O}\setminus I} into (I𝒪I)(𝒪II)(\mathcal{E}_{I}\cap\mathcal{E}_{\mathcal{O}\setminus I})\cup(\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I}) and 2) 𝒪\mathcal{E}_{\mathcal{O}} into I(𝒪II)\mathcal{E}_{I}\cup(\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I}) since any F𝒪F\in\mathcal{E}_{\mathcal{O}} must contain an attribute in II or 𝒪I\mathcal{O}\setminus I.

Since sIFIπI(RFs)s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s), |RFssI|1:FI\absolutevalue{R_{F}\ltimes s\uplus s_{I}}\geq 1:\forall F\in\mathcal{E}_{I}. Chen & Yi (Chen and Yi, 2020) also compute the probability of sIs_{I} only if sIs_{I} contributes to any join answer.

(23) F𝒪I|RFssI|xFF𝒪|RFs|xF=FI𝒪I|RFssI|xFFI|RFs|xFF𝒪II|RFssI|xFF𝒪II|RFs|xF=FI𝒪I|RFssI|xFFI|RFs|xF(RFssI=RFs:F𝒪II)FI|RFssI|xFFI|RFs|xF(|RFssI|1:FI𝒪I)=FIrdegF,s(sI)xFFIrdegF,s(sI)xF(rdegF,s(sI)rdegF,s(sI))=rdegF,s(sI)FIxFrdegF,s(sI)(FIxF1,rdegF,s(sI)1).\begin{split}&\frac{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{\mathcal{O}}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\\ &=\frac{\prod_{F\in\mathcal{E}_{I}\cap\mathcal{E}_{\mathcal{O}\setminus I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\cdot\frac{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\\ &=\frac{\prod_{F\in\mathcal{E}_{I}\cap\mathcal{E}_{\mathcal{O}\setminus I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\\ &\;\;\;\;\;\;(\because R_{F}\ltimes s\uplus s_{I}=R_{F}\ltimes s:\forall F\in\mathcal{E}_{\mathcal{O}\setminus I}\setminus\mathcal{E}_{I})\\ &\leq\frac{\prod_{F\in\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s\uplus s_{I}}^{x_{F}}}{\prod_{F\in\mathcal{E}_{I}}\absolutevalue{R_{F}\ltimes s}^{x_{F}}}\;\;\;(\because\absolutevalue{R_{F}\ltimes s\uplus s_{I}}\geq 1:\forall F\in\mathcal{E}_{I}\setminus\mathcal{E}_{\mathcal{O}\setminus I})\\ &=\prod_{F\in\mathcal{E}_{I}}rdeg_{F,s}(s_{I})^{x_{F}}\\ &\leq\prod_{F\in\mathcal{E}_{I}}rdeg_{F^{*},s}(s_{I})^{x_{F}}\;\;(\because rdeg_{F,s}(s_{I})\leq rdeg_{F^{*},s}(s_{I}))\\ &=rdeg_{F^{*},s}(s_{I})^{\sum_{F\in\mathcal{E}_{I}}x_{F}}\\ &\leq rdeg_{F^{*},s}(s_{I})\;\;\Big{(}\because\sum_{F\in\mathcal{E}_{I}}x_{F}\geq 1,rdeg_{F^{*},s}(s_{I})\leq 1\Big{)}.\end{split}

\square

Proof of Proposition 1.

We use an induction hypothesis 𝕍ar[Z]2dAGM()OUT()\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq 2^{d}\cdot AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H}) on dd, the number of odd cycles and stars in \mathcal{H}.

For the base case when d=1d=1, \mathcal{H} consists of either a single odd cycle or a star. As explained in appendix D, the maximum recursion depth is 2 + 1. We use I1I_{1} (and k1k_{1}) and I2I_{2} (and k2k_{2}) to denote the II (and kk) values of the first two recursions, where 𝒪=\mathcal{O}=\emptyset in the last recursion so Lines 1-1 of Algorithm 1 are executed.

If \mathcal{H} is an odd cycle CC of 2n+12n+1 vertices {u1,v1,u2,v2,,un,vn,w}\{u_{1},v_{1},u_{2},v_{2},...,u_{n},v_{n},w\}, AGM()=|R|n+12AGM(\mathcal{H})=|R|^{n+\frac{1}{2}} from Lemma 8, I1={u1,v1,,un,vn},I2={w}I_{1}=\{u_{1},v_{1},...,u_{n},v_{n}\},I_{2}=\{w\}, ΩI1=i[1,n]RFi\Omega_{I_{1}}=\prod_{i\in[1,n]}R_{F_{i}} and P(sI1)=1|R|n:sIΩI1P(s_{I_{1}})=\frac{1}{|R|^{n}}:\forall s_{I}\in\Omega_{I_{1}}, where Fi={ui,vi}F_{i}=\{u_{i},v_{i}\}. Recall that RFi=RR_{F_{i}}=R for i[1,n]i\in[1,n].

We explain SSTE first. Since k1=1k_{1}=1 and s=s=\emptyset, 𝕍ar[Z]=𝔼[𝕍ar[Z|sI1]]+𝕍ar[𝔼[Z|sI1]]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]]+\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}}|s_{I_{1}}]] from (18). For the first term,

(24) 𝔼[𝕍ar[Z|sI1]]=1|R|nsI1𝕍ar[Z|sI1](P(sI1)=1|R|n)=1|R|nsI1𝕍ar[1P(sI)𝕀[sI1]ZsI1]=|R|nsI1𝕀[sI1]𝕍ar[ZsI1]=|R|nsI1𝕀[sI1]kI2𝕍ar[ZsI1(sI2)]((17))|R|nsI1𝕀[sI1]kI2𝔼[ZsI12(sI2)]=|R|nsI1𝕀[sI1]kI2sI21|ΩI2|ZsI12(sI2)(P(sI2)=1|ΩI2|)=|R|nsI1𝕀[sI1]kI2sI2|ΩI2|𝕀[sI2](ZsI12(sI2)=|ΩI2|𝕀[sI2]1)|R|n+12sI1sI2𝕀[sI1]𝕀[sI2](kI2=|ΩI2||R|12|ΩI2||R|12)=|R|n+12OUT()=AGM()OUT()(Lemma 8).\begin{split}\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}}|s_{I_{1}}]]\\ =&\,\frac{1}{|R|^{n}}\sum_{s_{I_{1}}}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]\,\,\,\Big{(}\because P(s_{I_{1}})=\frac{1}{|R|^{n}}\Big{)}\\ =&\,\frac{1}{|R|^{n}}\sum_{s_{I_{1}}}\operatorname*{\mathbb{V}ar}\Big{[}\frac{1}{P(s_{I})}\operatorname*{\mathbb{I}}[s_{I_{1}}]Z_{\mathcal{H}_{s_{I_{1}}}}\Big{]}\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s_{I_{1}}}}]\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\frac{\operatorname*{\mathbb{I}}[s_{I_{1}}]}{k_{I_{2}}}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s_{I_{1}}}}(s_{I_{2}})]\,\,\,(\because(\ref{eq:totalvariance}))\\ \leq&\,|R|^{n}\sum_{s_{I_{1}}}\frac{\operatorname*{\mathbb{I}}[s_{I_{1}}]}{k_{I_{2}}}\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}_{s_{I_{1}}}}(s_{I_{2}})]\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\frac{\operatorname*{\mathbb{I}}[s_{I_{1}}]}{k_{I_{2}}}\sum_{s_{I_{2}}}\frac{1}{|\Omega_{I_{2}}|}Z^{2}_{\mathcal{H}_{s_{I_{1}}}}(s_{I_{2}})\,\,\,\Big{(}\because P(s_{I_{2}})=\frac{1}{|\Omega_{I_{2}}|}\Big{)}\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\frac{\operatorname*{\mathbb{I}}[s_{I_{1}}]}{k_{I_{2}}}\sum_{s_{I_{2}}}|\Omega_{I_{2}}|\operatorname*{\mathbb{I}}[s_{I_{2}}]\,\,\,\Big{(}\because Z^{2}_{\mathcal{H}_{s_{I_{1}}}}(s_{I_{2}})=|\Omega_{I_{2}}|\operatorname*{\mathbb{I}}[s_{I_{2}}]1\Big{)}\\ \leq&\,|R|^{n+\frac{1}{2}}\sum_{s_{I_{1}}}\sum_{s_{I_{2}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{I}}[s_{I_{2}}]\,\,\,\Bigg{(}\because k_{I_{2}}=\lceil\frac{|\Omega_{I_{2}}|}{|R|^{\frac{1}{2}}}\rceil\geq\frac{|\Omega_{I_{2}}|}{|R|^{\frac{1}{2}}}\Bigg{)}\\ =&\,|R|^{n+\frac{1}{2}}\text{OUT}(\mathcal{H})=AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})\,\,\,(\because\text{Lemma \ref{lemma:decomposition}}).\end{split}

For the second term of the total variance, the new proof of SSTE reduces the OUT(sI1)\text{OUT}({\mathcal{H}_{s_{I_{1}}}}) factor in (19) into |R|\sqrt{|R|}. We have OUT(sI1)|ΩI2|=|πw(RF0a1)|\text{OUT}({\mathcal{H}_{s_{I_{1}}}})\leq|\Omega_{I_{2}}|=|\pi_{w}(R_{F_{0}}\ltimes a_{1})| for a1a_{1} denoting the sampled data vertex for u1u_{1}. If |πw(RF0a1)||R||\pi_{w}(R_{F_{0}}\ltimes a_{1})|\leq\sqrt{|R|}, then OUT(sI1)|R|\text{OUT}({\mathcal{H}_{s_{I_{1}}}})\leq\sqrt{|R|}. Otherwise, recall that a1a_{1} is chosen to have the smallest degree among the cycle vertices. Then, there can be at most |R|\sqrt{|R|} vertices in the data graph with degrees larger than a1a_{1}; using the proof by contradiction, the total number of edges will be larger than |R||R| if more than |R|\sqrt{|R|} vertices have larger degrees than a1a_{1}. Hence, we have at most |R|\sqrt{|R|} results for the cycle given sI1s_{I_{1}}, i.e., OUT(sI1)|R|\text{OUT}({\mathcal{H}_{s_{I_{1}}}})\leq\sqrt{|R|}:

(25) 𝕍ar[𝔼[Z|sI1]]sI11P(sI1)𝕀[sI1]OUT(sI1)2((19))=|R|nsI1𝕀[sI1]OUT(sI1)2|R|nsI1𝕀[sI1]OUTsI1|R|=|R|n+12OUT()=AGM()OUT().\begin{split}\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[&Z_{\mathcal{H}}|s_{I_{1}}]]\leq\sum_{s_{I_{1}}}\frac{1}{P(s_{I_{1}})}\operatorname*{\mathbb{I}}[s_{I_{1}}]\text{OUT}({\mathcal{H}_{s_{I_{1}}}})^{2}\,\,\,\,\,(\because(\ref{eq:varexp}))\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\text{OUT}({\mathcal{H}_{s_{I_{1}}}})^{2}\leq|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\text{OUT}_{\mathcal{H}_{s_{I_{1}}}}\sqrt{|R|}\\ =&\,|R|^{n+\frac{1}{2}}\text{OUT}(\mathcal{H})=AGM(\mathcal{H})\cdot\text{OUT}({\mathcal{H}}).\end{split}

Adding (24) and (25) completes the proof for SSTE given an odd cycle.

For SUST, 𝕍ar[𝔼[Z|sI1]\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}}|s_{I_{1}}] is bounded by AGM()OUT()AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H}) from (25) and the following bound for 𝔼[𝕍ar[Z|sI1]]\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]] completes the proof for SUST given an odd cycle.

(26) 𝔼[𝕍ar[Z|sI1]]=|R|nsI1𝕀[sI1]𝕍ar[ZsI1]((24))|R|nsI1𝕀[sI1]𝔼[ZsI12]=|R|nsI1𝕀[sI1]sI21P(sI2)𝕀[sI2]=|R|n+12sI1sI2𝕀[sI1]𝕀[sI2](P(sI2)=1|R|)=AGM()OUT().\begin{split}\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}}|s_{I_{1}}]]\\ =&\,|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s_{I_{1}}}}]\,\,\,(\because(\ref{eq:sste:expvar:cycle}))\\ \leq&\,|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}_{s_{I_{1}}}}]=|R|^{n}\sum_{s_{I_{1}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\sum_{s_{I_{2}}}\frac{1}{P(s_{I_{2}})}\operatorname*{\mathbb{I}}[s_{I_{2}}]\\ =&\,|R|^{n+\frac{1}{2}}\sum_{s_{I_{1}}}\sum_{s_{I_{2}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{I}}[s_{I_{2}}]\Big{(}\because P(s_{I_{2}})=\frac{1}{\sqrt{|R|}}\Big{)}\\ =&\,AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H}).\end{split}

If \mathcal{H} is a star SS of nn edges {F1,F2,,Fn}\{F_{1},F_{2},...,F_{n}\}, AGM()=|R|nAGM(\mathcal{H})=|R|^{n} from Lemma 8. For SSTE, I1I_{1} contains the center attribute AA of SS only, ΩI1=πAR\Omega_{I_{1}}=\pi_{A}R, I2I_{2} is the set of attributes of SS except I1I_{1}, and ΩI2=iπI2(RFisI1)\Omega_{I_{2}}=\prod_{i}\pi_{I_{2}}(R_{F_{i}}\ltimes s_{I_{1}}) for a sampled center vertex sI1s_{I_{1}}.

Again, 𝕍ar[Z]=𝔼[𝕍ar[Z|sI1]]+𝕍ar[𝔼[Z|sI1]]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]=\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]]+\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}}|s_{I_{1}}]] since kI1=1k_{I_{1}}=1. For the first term,

(27) 𝔼[𝕍ar[Z|sI1]]=sI1|RsI1||R|𝕍ar[Z|sI1](P(sI1)=|RsI1||R|)sI1|RsI1||R|𝔼[Z2|sI1]=sI1|RsI1||R|sI21|RsI1|nZ2|sI1(P(sI2)=1|RsI1|n)=sI1|RsI1||R|sI21|RsI1|n(1P(sI1)𝕀[sI1]ZsI1)2=sI1|RsI1||R|sI21|RsI1|n(1P(sI1)𝕀[sI1]1P(sI2)𝕀[sI2])2=sI1|RsI1||R|sI21|RsI1|n(|R||RsI1|𝕀[sI1]|RsI1|n𝕀[sI2])2=|R|sI1|RsI1|n1sI2𝕀[sI1]𝕀[sI2]|R||R|n1sI1sI2𝕀[sI1]𝕀[sI2]=|R|nOUT()=AGM()OUT()(Lemma 8).\begin{split}\operatorname*{\mathbb{E}}[\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}}|s_{I_{1}}]]\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}|s_{I_{1}}]\,\,\,\Big{(}\because P(s_{I_{1}})=\frac{|R\ltimes s_{I_{1}}|}{|R|}\Big{)}\\ \leq&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}}|s_{I_{1}}]\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\sum_{s_{I_{2}}}\frac{1}{|R\ltimes s_{I_{1}}|^{n}}Z^{2}_{\mathcal{H}}|s_{I_{1}}\,\,\,\Big{(}\because P(s_{I_{2}})=\frac{1}{|R\ltimes s_{I_{1}}|^{n}}\Big{)}\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\sum_{s_{I_{2}}}\frac{1}{|R\ltimes s_{I_{1}}|^{n}}\Big{(}\frac{1}{P(s_{I_{1}})}\operatorname*{\mathbb{I}}[s_{I_{1}}]Z_{\mathcal{H}_{s_{I_{1}}}}\Big{)}^{2}\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\sum_{s_{I_{2}}}\frac{1}{|R\ltimes s_{I_{1}}|^{n}}\Big{(}\frac{1}{P(s_{I_{1}})}\operatorname*{\mathbb{I}}[s_{I_{1}}]\frac{1}{P(s_{I_{2}})}\operatorname*{\mathbb{I}}[s_{I_{2}}]\Big{)}^{2}\\ =&\sum_{s_{I_{1}}}\frac{|R\ltimes s_{I_{1}}|}{|R|}\sum_{s_{I_{2}}}\frac{1}{|R\ltimes s_{I_{1}}|^{n}}\Big{(}\frac{|R|}{|R\ltimes s_{I_{1}}|}\operatorname*{\mathbb{I}}[s_{I_{1}}]|R\ltimes s_{I_{1}}|^{n}\operatorname*{\mathbb{I}}[s_{I_{2}}]\Big{)}^{2}\\ =&\,|R|\sum_{s_{I_{1}}}|R\ltimes s_{I_{1}}|^{n-1}\sum_{s_{I_{2}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{I}}[s_{I_{2}}]\\ \leq&\,|R||R|^{n-1}\sum_{s_{I_{1}}}\sum_{s_{I_{2}}}\operatorname*{\mathbb{I}}[s_{I_{1}}]\operatorname*{\mathbb{I}}[s_{I_{2}}]\\ =&\,|R|^{n}\text{OUT}(\mathcal{H})=AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H})\,\,\,(\because\text{Lemma \ref{lemma:decomposition}}).\end{split}

For the second term of the total variance, 𝕍ar[𝔼[Z|sI1]]OUT()2AGM()OUT()\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}}|s_{I_{1}}]]\leq\text{OUT}(\mathcal{H})^{2}\leq AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H}) from (19). Therefore, we have 𝕍ar[Z]2AGM()OUT()\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq 2\cdot AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H}) for SSTE given a star.

For SUST, I1I_{1} is the set of attributes of the star, ΩI1=Rn\Omega_{I_{1}}=R^{n}, and P(sI1)=1|R|nP(s_{I_{1}})=\frac{1}{|R|^{n}}. Thus, 𝕍ar[Z]𝔼[Z2]=sI11P(sI1)𝕀[sI1]1=|R|nOUT()=AGM()OUT()\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}}]\leq\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}}]=\sum_{s_{I_{1}}}\frac{1}{P(s_{I_{1}})}\cdot\operatorname*{\mathbb{I}}[s_{I_{1}}]\cdot 1=|R|^{n}\cdot\text{OUT}(\mathcal{H})=AGM(\mathcal{H})\cdot\text{OUT}(\mathcal{H}).

For the inductive case when d>1d>1, \mathcal{H} consists of multiple odd cycles and stars. The proofs for both SSTE and SUST are naive expansions of (24)-(27) conditioned on multiple components instead of a part of an odd cycle or a star. Please refer to (Assadi et al., 2019) for more details.

\square

Proof of Proposition 2.

We use an induction hypothesis 𝕍ar[Zs]t|𝒪|tt1OUT(s)2\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]\leq\frac{t^{|\mathcal{O}|}-t}{t-1}\cdot{\text{OUT}(\mathcal{H}_{s})^{2}}.

For the base case I=𝒪I=\mathcal{O} and |I|=1|I|=1, sIΩI=FIπI(RFs)s_{I}\in\Omega_{I}=\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s), and Zs(sI)=1P(sI)=|ΩI|Z_{\mathcal{H}_{s}}(s_{I})=\frac{1}{P(s_{I})}=|\Omega_{I}| regardless of sIs_{I}. Therefore, 𝕍ar[Zs]=0t|𝒪|tt1OUT(s)2\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]=0\leq\frac{t^{|\mathcal{O}|}-t}{t-1}\cdot\text{OUT}(\mathcal{H}_{s})^{2}.

For the inductive case I𝒪I\subsetneq\mathcal{O},

(28) 𝔼[𝕍ar[Zs|sI]]=sI1|ΩI|𝕍ar[Zs|sI]=sI1|ΩI|𝕍ar[|ΩI|𝕀[sI]ZssI]=sI|ΩI|𝕍ar[ZssI](𝕀[sI]=1)sI|ΩI|t|𝒪I|tt1OUT(ssI)2(induction hypothesis).\begin{split}\operatorname*{\mathbb{E}}[&\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]]\\ &=\sum_{s_{I}}\frac{1}{|\Omega_{I}|}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]\\ &=\sum_{s_{I}}\frac{1}{|\Omega_{I}|}\operatorname*{\mathbb{V}ar}[|\Omega_{I}|\cdot\operatorname*{\mathbb{I}}[s_{I}]\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}]\\ &=\sum_{s_{I}}|\Omega_{I}|\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s\uplus s_{I}}}]\,\,\,(\because\operatorname*{\mathbb{I}}[s_{I}]=1)\\ &\leq\sum_{s_{I}}|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-t}{t-1}\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})^{2}\,\,\,(\because\text{induction hypothesis}).\end{split}

From (19), 𝕍ar[𝔼[Zs|sI]]sI|ΩI|OUT(ssI)2\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[Z_{\mathcal{H}_{s}}|s_{I}]]\leq\sum_{s_{I}}|\Omega_{I}|\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}. By plugging this and (28) into (18), we have

(29) 𝕍ar[Zs(sI)]sI(|ΩI|t|𝒪I|tt1+|ΩI|)OUT(ssI)2=|ΩI|t|𝒪I|1t1sIOUT(ssI)2.\begin{split}\operatorname*{\mathbb{V}ar}&[Z_{\mathcal{H}_{s}}(s_{I})]\\ &\leq\sum_{s_{I}}\Big{(}|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-t}{t-1}+|\Omega_{I}|\Big{)}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ &=|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}.\end{split}

Therefore, from (17),

(30) 𝕍ar[Zs]=1k(1k1|ΩI|1)𝕍ar[Zs(sI)]1k(1k1|ΩI|1)|ΩI|t|𝒪I|1t1sIOUT(ssI)2=|ΩI|kk(|ΩI|1)|ΩI|t|𝒪I|1t1sIOUT(ssI)21bb|ΩI||ΩI|1t|𝒪I|1t1sIOUT(ssI)2(k=b|ΩI|b|ΩI|)2(1b)bt|𝒪I|1t1sIOUT(ssI)2(|ΩI||ΩI|12)=tt|𝒪I|1t1sIOUT(ssI)2(t=2(1b)b)tt|𝒪I|1t1(sIOUT(ssI))2=t|𝒪|tt1OUT(s)2.\begin{split}\operatorname*{\mathbb{V}ar}&[Z_{\mathcal{H}_{s}}]=\frac{1}{k}\Big{(}1-\frac{k-1}{|\Omega_{I}|-1}\Big{)}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}(s_{I})]\\ &\leq\frac{1}{k}\Big{(}1-\frac{k-1}{|\Omega_{I}|-1}\Big{)}|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ &=\frac{|\Omega_{I}|-k}{k(|\Omega_{I}|-1)}|\Omega_{I}|\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ &\leq\frac{1-b}{b}\frac{|\Omega_{I}|}{|\Omega_{I}|-1}\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\\ &\;\;\;\;\;\;(\because k=\lceil b\cdot|\Omega_{I}|\rceil\geq b|\Omega_{I}|)\\ &\leq\frac{2(1-b)}{b}\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\,\,\,\Big{(}\because\frac{|\Omega_{I}|}{|\Omega_{I}|-1}\leq 2\Big{)}\\ &=t\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})^{2}\,\,\,\Big{(}\because t=\frac{2(1-b)}{b}\Big{)}\\ &\leq t\frac{t^{|\mathcal{O}\setminus I|}-1}{t-1}\Big{(}\sum_{s_{I}}\text{OUT}(\mathcal{H}_{s\uplus s_{I}})\Big{)}^{2}\\ &=\frac{t^{|\mathcal{O}|}-t}{t-1}\text{OUT}(\mathcal{H}_{s})^{2}.\end{split}

\square

Proof of Proposition 3.

We use an induction hypothesis 𝕍ar[Zs]|𝒪|AGM(s)OUT(s)\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]\leq|\mathcal{O}|\cdot AGM(\mathcal{H}_{s})\cdot\text{OUT}(\mathcal{H}_{s}).

For the base case I=𝒪I=\mathcal{O} and |I|=1|I|=1, AGM(ssI)=1AGM(\mathcal{H}_{s\uplus s_{I}})=1, for 𝒪I=\mathcal{O}\setminus I=\emptyset. Hence,

(31) 𝕍ar[Zs]𝔼[Zs2]=sIAGM(ssI)AGM(s)(AGM(s)AGM(ssI))2𝕀[sI]=sIAGM(s)AGM(ssI)𝕀[sI]=AGM(s)sI𝕀[sI](AGM(ssI)=1)=AGM(s)OUT(s).\begin{split}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]&\leq\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}_{s}}]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{AGM({\mathcal{H}_{s}})}\Big{(}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\Big{)}^{2}\operatorname*{\mathbb{I}}[s_{I}]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]\\ &=AGM({\mathcal{H}_{s}})\sum_{s_{I}}\operatorname*{\mathbb{I}}[s_{I}]\;\;\;(\because AGM(\mathcal{H}_{s\uplus s_{I}})=1)\\ &=AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}).\end{split}

For the inductive case I𝒪I\subsetneq\mathcal{O}, we again use the induction hypothesis:

(32) 𝔼[𝕍ar[Zs|sI]]=sIAGM(ssI)AGM(s)𝕍ar[Zs|sI]=sIAGM(ssI)AGM(s)𝕍ar[1P(sI)𝕀[sI]ZssI]=sIAGM(s)AGM(ssI)𝕀[sI]𝕍ar[ZssI](P(sI)=AGM(ssI)AGM(s))sIAGM(s)AGM(ssI)𝕀[sI]|𝒪I|AGM(ssI)OUT(ssI)(induction hypothesis)=|𝒪I|AGM(s)sI𝕀[sI]OUT(ssI)=|𝒪I|AGM(s)OUT(s).\begin{split}\operatorname*{\mathbb{E}}&[\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{AGM({\mathcal{H}_{s}})}\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}|s_{I}]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{AGM({\mathcal{H}_{s}})}\operatorname*{\mathbb{V}ar}\Big{[}\frac{1}{P(s_{I})}\cdot\operatorname*{\mathbb{I}}[s_{I}]\cdot Z_{\mathcal{H}_{s\uplus s_{I}}}\Big{]}\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s\uplus s_{I}}}]\;\;\Big{(}\because P(s_{I})=\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{AGM({\mathcal{H}_{s}})}\Big{)}\\ &\leq\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]|\mathcal{O}\setminus I|AGM({\mathcal{H}_{s\uplus s_{I}}})\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &\;\;\;\;\;(\because\text{induction hypothesis})\\ &=|\mathcal{O}\setminus I|AGM({\mathcal{H}_{s}})\sum_{s_{I}}\operatorname*{\mathbb{I}}[s_{I}]\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=|\mathcal{O}\setminus I|\cdot AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}).\end{split}

From (19), we have

(33) 𝕍ar[𝔼[Zs|sI]]sIAGM(s)AGM(ssI)𝕀[sI]OUT(ssI)2sIAGM(s)AGM(ssI)𝕀[sI]AGM(ssI)OUT(ssI)=AGM(s)sI𝕀[sI]OUT(ssI)=AGM(s)OUT(s).\begin{split}\operatorname*{\mathbb{V}ar}[\operatorname*{\mathbb{E}}[&Z_{\mathcal{H}_{s}}|s_{I}]]\\ &\leq\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})^{2}\\ &\leq\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]AGM({\mathcal{H}_{s\uplus s_{I}}})\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=AGM({\mathcal{H}_{s}})\sum_{s_{I}}\operatorname*{\mathbb{I}}[s_{I}]\text{OUT}({\mathcal{H}_{s\uplus s_{I}}})\\ &=AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}).\end{split}

Therefore, plugging these into (18) and setting k=1k=1 in (17) give 𝕍ar[Zs]|𝒪|AGM(s)OUT(s)\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]\leq|\mathcal{O}|\cdot AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}). In addition, |𝒱||\mathcal{V}| in the proposition can be removed with a simple proof by Chen & Yi (Chen and Yi, 2020) using the variance of Binomial distribution.

\square

Proof of Proposition 4.

We use an induction hypothesis 𝕍ar[Zs]|𝒪|I𝒪,|I|=1|I|AGM(s)OUT(s)\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}]\leq|\mathcal{O}|\cdot\prod_{I\subset\mathcal{O},|I|=1}|\mathcal{E}_{I}|\cdot AGM(\mathcal{H}_{s})\cdot\text{OUT}(\mathcal{H}_{s}), similar to the proof of Proposition 3.

For the base case I=𝒪I=\mathcal{O} and |I|=1|I|=1,

(34) 𝕍ar[Zs]𝔼[Zs2]=sIAGM(ssI)|I|AGM(s)(|I|AGM(s)AGM(ssI))2𝕀[sI]=sI|I|AGM(s)AGM(ssI)𝕀[sI]=|I|AGM(s)OUT(s).\begin{split}\operatorname*{\mathbb{V}ar}[&Z_{\mathcal{H}_{s}}]\leq\operatorname*{\mathbb{E}}[Z^{2}_{\mathcal{H}_{s}}]\\ &=\sum_{s_{I}}\frac{AGM({\mathcal{H}_{s\uplus s_{I}}})}{|\mathcal{E}_{I}|AGM({\mathcal{H}_{s}})}\Big{(}\frac{|\mathcal{E}_{I}|AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\Big{)}^{2}\operatorname*{\mathbb{I}}[s_{I}]\\ &=\sum_{s_{I}}\frac{|\mathcal{E}_{I}|AGM({\mathcal{H}_{s}})}{AGM({\mathcal{H}_{s\uplus s_{I}}})}\operatorname*{\mathbb{I}}[s_{I}]=|\mathcal{E}_{I}|\cdot AGM({\mathcal{H}_{s}})\cdot\text{OUT}({\mathcal{H}_{s}}).\end{split}

Note that, compared to (31) of GJ-Sample, the upper bound of 𝕍ar[Zs]\operatorname*{\mathbb{V}ar}[Z_{\mathcal{H}_{s}}] has increased by a constant factor I|I|\prod_{I}|\mathcal{E}_{I}|. Similarly, we can easily show that the upper bounds of both (32) and (33) increase by I𝒪,|I|=1|I|\prod_{I\subset\mathcal{O},|I|=1}|\mathcal{E}_{I}| times, proving the inductive case of Proposition 4. By using the proof by Chen & Yi (Chen and Yi, 2020), we can remove the |𝒱||\mathcal{V}| factor from the proposition as well.

\square

Now, we prove the bounds for TT_{\mathcal{H}} (or 𝔼[T]\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]). In Algorithm 1, Lines 1-1 take O~(1)\tilde{O}(1). Line 1 takes O~(minFI|πI(RFs)|)\tilde{O}(\min_{F\in\mathcal{E}_{I}}|\pi_{I}(R_{F}\ltimes s)|) time for Alley+ due to the intersection and O~(1)\tilde{O}(1) for the others, i.e., ΩI\Omega_{I} is readily available from the query model in section 5.2. Line 1 takes O~(IN)\tilde{O}(\text{IN}) for GJ-Sample for non-sequenceable queries and O~(1)\tilde{O}(1) for other cases. Line 1 takes O~(k)\tilde{O}(k). Line 1 takes O~(1)\tilde{O}(1); computing relative degrees takes O~(|I|)=O~(1)\tilde{O}(|\mathcal{E}_{I}|)=\tilde{O}(1) in DRS. Line 1 takes O~(k)+sISI𝕀[sIFIπI(RFs)]TssI\tilde{O}(k)+\sum_{s_{I}\in S_{I}}\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]T_{\mathcal{H}_{s\uplus s_{I}}}. Note that 𝕀[sIFIπI(RFs)]=FI𝕀[sIπI(RFs)]\operatorname*{\mathbb{I}}[s_{I}\in\,\Join_{F\in\mathcal{E}_{I}}\pi_{I}(R_{F}\ltimes s)]=\prod_{F\in\mathcal{E}_{I}}\operatorname*{\mathbb{I}}[s_{I}\in\pi_{I}(R_{F}\ltimes s)] takes O~(|I|)=O~(1)\tilde{O}(|\mathcal{E}_{I}|)=\tilde{O}(1) time to evaluate.

From this, we can recursively define TsT_{\mathcal{H}_{s}}, the runtime of computing ZsZ_{\mathcal{H}_{s}}:

(35) Ts={O~(1)if I=𝒪O~(IN)+𝕀[sI]TssIif I𝒪,non-sequenceable queries for GJ-SampleO~(k)+sISI𝕀[sI]TssIotherwiseT_{\mathcal{H}_{s}}=\left\{\begin{array}[]{ll}\tilde{O}(1)&\text{if }I=\mathcal{O}\\ \tilde{O}(\text{IN})+\operatorname*{\mathbb{I}}[s_{I}]T_{\mathcal{H}_{s\uplus s_{I}}}&\text{if }I\subsetneq\mathcal{O},\\ &\text{non-sequenceable }\\ &\text{queries for {{GJ-Sample}}}\\ \tilde{O}(k)+\sum_{s_{I}\in S_{I}}\operatorname*{\mathbb{I}}[s_{I}]T_{\mathcal{H}_{s\uplus s_{I}}}&\text{otherwise}\\ \end{array}\right.
Proof of Proposition 5.

In SSTE, k=|ΩI||R|k=\lceil\frac{|\Omega_{I}|}{\sqrt{|R|}}\rceil if II consists of a single vertex ww of an odd cycle and k=1k=1 otherwise. We prove 𝔼[k]=O~(1)\operatorname*{\mathbb{E}}[k]=\tilde{O}(1) for the first case. Then, if there are nn cycles in \mathcal{H} with k=k1,k2,,knk=k_{1},k_{2},...,k_{n} each, 𝔼[T]=O~(i𝔼[ki])=O~(1)\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]=\tilde{O}(\prod_{i}\operatorname*{\mathbb{E}}[k_{i}])=\tilde{O}(1) from (35).

kk for I={w}I=\{w\} depends solely on the sampled data edge {a1,b1}\{a_{1},b_{1}\} for F1={u1,u2}F_{1}=\{u_{1},u_{2}\}. Recall that ΩI=πI(RF0a1)\Omega_{I}=\pi_{I}(R_{F_{0}}\ltimes a_{1}) and |πI(RF0a1)||πI(RF0b1)||\pi_{I}(R_{F_{0}}\ltimes a_{1})|\leq|\pi_{I}(R_{F_{0}}\ltimes b_{1})| since a1a_{1} has the smallest degree among the cycle vertices. The last equality below holds from a well-known fact in graph theory (Assadi et al., 2018).

(36) 𝔼[k]={a1,b1}R1|R|min(|πI(Ra1)|,|πI(Rb1)|)|R|1+{a1,b1}1|R|min(|πI(Ra1)|,|πI(Rb1)|)|R|=O~(1)(Proposition 2.2 in Assadi et al. (Assadi et al., 2018))\vspace*{-0.2cm}\begin{split}\operatorname*{\mathbb{E}}[k]&=\sum_{\{a_{1},b_{1}\}\in R}\frac{1}{|R|}\lceil\frac{\min(|\pi_{I}(R\ltimes a_{1})|,|\pi_{I}(R\ltimes b_{1})|)}{\sqrt{|R|}}\rceil\\ &\leq 1+\sum_{\{a_{1},b_{1}\}}{\frac{1}{|R|}\frac{\min(|\pi_{I}(R\ltimes a_{1})|,|\pi_{I}(R\ltimes b_{1})|)}{\sqrt{|R|}}}\\ &=\tilde{O}(1)\,\,\,(\because\text{Proposition 2.2 in Assadi et al. \cite[citep]{(\@@bibref{AuthorsPhrase1Year}{SSTEFull}{\@@citephrase{, }}{})}})\end{split}

\square

Proof of Proposition 6.

Simply, Alley+ takes b|𝒱|b^{|\mathcal{V}|} portion of all paths searched by GenericJoin in expectation. Therefore, 𝔼[T]=O~(b|𝒱|AGM())\operatorname*{\mathbb{E}}[T_{\mathcal{H}}]=\tilde{O}(b^{|\mathcal{V}|}AGM(\mathcal{H})). TT_{\mathcal{H}} is O~(AGM())\tilde{O}(AGM(\mathcal{H})) in the worst case, since the computation cost of GenericCardEst for Alley+ is subsumed by GenericJoin.

\square

Proof of Proposition 7.

In GJ-Sample, kk is fixed to 1 regardless of sIs_{I}. Therefore, Ts=O~(1)+TssIT_{\mathcal{H}_{s}}=\tilde{O}(1)+T_{\mathcal{H}_{s\uplus s_{I}}} for sequenceable queries and Ts=O~(IN)+TssIT_{\mathcal{H}_{s}}=\tilde{O}(\text{IN})+T_{\mathcal{H}_{s\uplus s_{I}}} for non-sequenceable queries from (35). Since the maximum recursion depth is |𝒱||\mathcal{V}|, T=O~(|𝒱|)=O~(1)T_{\mathcal{H}}=\tilde{O}(|\mathcal{V}|)=\tilde{O}(1) for sequenceable queries (note the excluded preprocessing cost) and T=O~(|𝒱|IN)=O~(IN)T_{\mathcal{H}}=\tilde{O}(|\mathcal{V}|\cdot\text{IN})=\tilde{O}(\text{IN}) for non-sequenceable queries.

\square

Proof of Proposition 8.

In DRS, kk is also fixed to 1, and Ts=O~(|I|)+TssIT_{\mathcal{H}_{s}}=\tilde{O}(|\mathcal{E}_{I}|)+T_{\mathcal{H}_{s\uplus s_{I}}}. Hence, T=O~(I𝒱,|I|=1|I|)=O~(1)T_{\mathcal{H}}=\tilde{O}(\sum_{I\subset\mathcal{V},|I|=1}|\mathcal{E}_{I}|)=\tilde{O}(1). For SUST, kk is also fixed to 1, and Ts=O~(1)+TssIT_{\mathcal{H}_{s}}=\tilde{O}(1)+T_{\mathcal{H}_{s\uplus s_{I}}}. Hence, T=O~(1)T=\tilde{O}(1).

\square

Proof of Lemma 3.

The unbiasedness of GHDCardEst can be explained by the following (37), since RtR_{t} is an unbiased estimate of tG(t)=χ(t)G(t)t=χ(t)G(t)Fχ(t)πχ(t)RF\mathcal{H}^{G(t)}_{t}=\sum_{\chi(t)\setminus G(t)}\mathcal{H}_{t}=\sum_{\chi(t)\setminus G(t)}{\Join_{F\in\mathcal{E}_{\chi(t)}}\pi_{\chi(t)}R_{F}}. Note that if we replace GenericCardEst with GenericJoin in Algorithm 2, then GHDCardEst will return OUT.

(37) OUT=𝒱FRF=𝒱t𝒱𝒯(Fχ(t)πχ(t)RF)(using GHD (𝒯,χ))=G(𝒯)t𝒱𝒯(χ(t)G(t)Fχ(t)πχ(t)RF)=G(𝒯)t𝔼[Rt].\begin{split}\text{OUT}&=\sum_{\mathcal{V}}\Join_{F\in\mathcal{E}}R_{F}\\ &=\sum_{\mathcal{V}}\Join_{t\in\mathcal{V}_{\mathcal{T}}}\big{(}{\Join_{F\in\mathcal{E}_{\chi(t)}}\pi_{\chi(t)}R_{F}}\big{)}\,\,\,(\because\text{using GHD }(\mathcal{T},\chi))\\ &=\sum_{G(\mathcal{T})}\Join_{t\in\mathcal{V}_{\mathcal{T}}}\Big{(}\sum_{\chi(t)\setminus G(t)}\Join_{F\in\mathcal{E}_{\chi(t)}}\pi_{\chi(t)}R_{F}\Big{)}\\ &=\sum_{G(\mathcal{T})}\Join_{t}\operatorname*{\mathbb{E}}[R_{t}].\end{split}

\square

Proof of Lemma 4.

We use an induction hypothesis on |f||f|. For the base case |f|=1|f|=1 so f={gt}f=\{g_{t}\} for a node tt, 𝕍ar[Z[f]]\operatorname*{\mathbb{V}ar}[Z[f]] =

𝕍ar[Z[gt]]\operatorname*{\mathbb{V}ar}[Z[g_{t}]]. Therefore, 𝕍ar[Z[f]]\operatorname*{\mathbb{V}ar}[Z[f]] = O~(OUT(gt)2)\tilde{O}(\text{OUT}(\mathcal{H}_{g_{t}})^{2}).

For the inductive case |f|>1|f|>1, we use an induction hypothesis that 𝕍ar[Z[f]]=O~(gtfOUT(gt)2)\operatorname*{\mathbb{V}ar}[Z[f]]=\tilde{O}(\prod_{g_{t}\in f}\text{OUT}(\mathcal{H}_{g_{t}})^{2}) for every |f|n|f|\leq n for

some n1n\geq 1. For any ff with |f|=n+1|f|=n+1, we take any gtfg_{t}\in f. Since gtg_{t} and fgtf\setminus g_{t} are independent,

(38) 𝕍ar[Z[f]]=(𝕍ar[Z[gt]]+𝔼[Z[gt]]2)(𝕍ar[Z[fgt]]+𝔼[Z[fgt]]2)𝔼[Z[gt]]2𝔼[Z[fgt]]2=O~(𝔼[Z[gt]]2)O~(𝔼[Z[fgt]]2)𝔼[Z[gt]]2𝔼[Z[fgt]]2(induction hypothesis)=O~(𝔼[Z[gt]]2𝔼[Z[fgt]]2)=O~(𝔼[Z[f]]2)((7)).\begin{split}\operatorname*{\mathbb{V}ar}&[Z[f]]\\ &=(\operatorname*{\mathbb{V}ar}[Z[g_{t}]]+\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2})(\operatorname*{\mathbb{V}ar}[Z[f\setminus g_{t}]]+\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2})\\ &\;\;\;\;\;\;\;-\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2}\\ &=\tilde{O}(\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2})\tilde{O}(\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2})-\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2}\\ &\;\;\;\;\;\;(\because\text{induction hypothesis})\\ &=\tilde{O}(\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\cdot\operatorname*{\mathbb{E}}[Z[f\setminus g_{t}]]^{2})\\ &=\tilde{O}(\operatorname*{\mathbb{E}}[Z[f]]^{2})\,\,\,(\because\text{(\ref{eq:multiexp:prod})}).\end{split}

\square

Proof of Lemma 5.

If f1f_{1} and f2f_{2} are independent (f1f2=f_{1}\cap f_{2}=\emptyset), then ov(Z[f1],Z[f2])=0\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}])=0. If not, f1f2f_{1}\cap f_{2}, f1f2f_{1}\setminus f_{2}, and f2f1f_{2}\setminus f_{1} are independent, and Z[f1]=Z[f1f2]Z[f1f2]Z[f_{1}]=Z[f_{1}\cap f_{2}]\cdot Z[f_{1}\setminus f_{2}] and Z[f2]=Z[f1f2]Z[f2f1]Z[f_{2}]=Z[f_{1}\cap f_{2}]\cdot Z[f_{2}\setminus f_{1}]. Therefore,

(39) ov(Z[f1],Z[f2])=𝔼[Z[f1]Z[f2]]𝔼[Z[f1]]𝔼[Z[f2]]=𝔼[Z[f1f2]2Z[f1f2]Z[f2f1]]𝔼[Z[f1f2]]2𝔼[Z[f1f2]]𝔼[Z[f2f1]]=𝔼[Z[f1f2]2]𝔼[Z[f1f2]]𝔼[Z[f2f1]]𝔼[Z[f1f2]]2𝔼[Z[f1f2]]𝔼[Z[f2f1]]=𝕍ar[Z[f1f2]]𝔼[Z[f1f2]]𝔼[Z[f2f1]].\begin{split}\operatorname*{\mathbb{C}ov}(Z[f_{1}],\,&Z[f_{2}])=\operatorname*{\mathbb{E}}[Z[f_{1}]\cdot Z[f_{2}]]-\operatorname*{\mathbb{E}}[Z[f_{1}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}]]\\ =&\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]^{2}\cdot Z[f_{1}\setminus f_{2}]\cdot Z[f_{2}\setminus f_{1}]]\\ &-\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]]^{2}\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]]\\ =&\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]^{2}]\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]]\\ &-\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]]^{2}\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]]\\ =&\operatorname*{\mathbb{V}ar}[Z[f_{1}\cap f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]].\end{split}

From Lemma 4 and (7), the last term is

(40) O~(𝔼[Z[f1f2]]2)𝔼[Z[f1f2]]𝔼[Z[f2f1]]=O~(gtf1f2𝔼[Z[gt]]2)gtf1f2𝔼[Z[gt]]gtf2f1𝔼[Z[gt]]=O~(gtf1f2𝔼[Z[gt]]2gtf1f2𝔼[Z[gt]]gtf2f1𝔼[Z[gt]])=O~(gtf1𝔼[Z[gt]]gtf2𝔼[Z[gt]]).\begin{split}\tilde{O}(&\operatorname*{\mathbb{E}}[Z[f_{1}\cap f_{2}]]^{2})\cdot\operatorname*{\mathbb{E}}[Z[f_{1}\setminus f_{2}]]\cdot\operatorname*{\mathbb{E}}[Z[f_{2}\setminus f_{1}]]\\ &=\tilde{O}\Big{(}\prod_{g_{t}\in f_{1}\cap f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\Big{)}\cdot\prod_{g_{t}\in f_{1}\setminus f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\cdot\prod_{g_{t}\in f_{2}\setminus f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\\ &=\tilde{O}\Big{(}\prod_{g_{t}\in f_{1}\cap f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]^{2}\cdot\prod_{g_{t}\in f_{1}\setminus f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\cdot\prod_{g_{t}\in f_{2}\setminus f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}\\ &=\tilde{O}\Big{(}\prod_{g_{t}\in f_{1}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\cdot\prod_{g_{t}\in f_{2}}\operatorname*{\mathbb{E}}[Z[g_{t}]]\Big{)}.\end{split}

\square

Proof of Proposition 9.

If 𝕍ar[Z[gt]]\operatorname*{\mathbb{V}ar}[Z[g_{t}]] approaches to 0 for every gtfg_{t}\in f, 𝕍ar[Z[f]]\operatorname*{\mathbb{V}ar}[Z[f]] approaches to 0 from (8). For any f1f_{1} and f2f_{2}, if 𝕍ar[Z[gt]]\operatorname*{\mathbb{V}ar}[Z[g_{t}]] approaches to 0 for every gtf1f2g_{t}\in f_{1}\cap f_{2}, 𝕍ar[Z[f1f2]]\operatorname*{\mathbb{V}ar}[Z[f_{1}\cap f_{2}]] approaches to 0. If 𝕍ar[Z[f1f2]]\operatorname*{\mathbb{V}ar}[Z[f_{1}\cap f_{2}]] approaches to 0, ov(Z[f1],Z[f2])\operatorname*{\mathbb{C}ov}(Z[f_{1}],Z[f_{2}]) approaches to 0 from (39). Therefore, 𝕍ar[Z]\operatorname*{\mathbb{V}ar}[Z] also approaches to 0 from (9).

\square