PANDA: Query Evaluation in Submodular Width
Abstract
In recent years, several information-theoretic upper bounds have been introduced on the output size and evaluation cost of database join queries. These bounds vary in their power depending on both the type of statistics on input relations and the query plans that they support. This motivated the search for algorithms that can compute the output of a join query in times that are bounded by the corresponding information-theoretic bounds. In this paper, we describe PANDA, an algorithm that takes a Shannon-inequality that underlies the bound, and translates each proof step into an algorithmic step corresponding to some database operation. PANDA computes answers to a conjunctive query in time given by the the submodular width plus the output size of the query. The version in this paper represents a significant simplification of the original version in [6].
1 Introduction
Answering conjunctive queries efficiently is a fundamental problem in the theory and practice of database management, graph algorithms, logic, constraint satisfaction, and graphical model inference, among others [4, 16, 20, 1]. In a full conjunctive query, the input is a set of relations (or tables, or constraints), each with a set of attributes (or variables), and the task is to list all satisfying assignments, i.e. assignments to all variables that simultaneously satisfy all input relations. Each such assignment is called an output tuple, and their number is the output size. For example, the query
(1) |
is a full conjunctive query asking for the list of all triangles in a (directed) graph with edge relation . In contrast, in a Boolean conjunctive query, we only ask whether one such assignment exists. The query
asks whether there is a triangle in the graph. More generally, in a proper conjunctive query, any subset of variables can occur in the the head of the query. These are called free variables. For example, the query
asks for the list of all nodes that are part of a triangle. Other variants of the query evaluation problem include counting output tuples, performing some aggregation over them, or enumerating them under some restrictions.
In the case of a full conjunctive query, the runtime of any algorithm is at least as large as the size of the output. This has motivated the study of upper bounds on the sizes of the query outputs. The corresponding graph-theoretic problem is to bound the number of homomorphic images of a small graph within a larger graph; this problem has a long history in graph theory and isoperimetric inequalities [24, 7, 13, 12, 27, 15]. One such bound is the AGM bound, which is a formula that, given only the cardinalities of the input relations, returns an upper bound on the output size of the query [10]. Moreover, the bound is tight, meaning that there exist relations of the given cardinalities where the query’s output has size equal to the AGM bound (up to a query-dependent factor). This immediately implies that no algorithm can run in time lower than the AGM bound in the worst-case. Thus, an algorithm that runs in this time is called a Worst-Case Optimal Join algorithm. The AGM bound for query (1) is ; algorithms for listing all triangles within this amount of time has been known for decades [21]. For general full conjunctive queries, Grohe and Marx [19], and Atserias, Grohe, and Marx [10, 9] devised a join-project query plan that can compute the output of a full conjunctive query to within a linear factor of the AGM bound, which is very close to optimality. They also showed that a join-only query plan cannot achieve this bound. A few years later, a new class of join algorithms, not based on join and project operators, was able to achieve the AGM bound [33, 29] and thus achieve worst-case optimality.
In practice, especially in database systems, the input cardinalities are not sufficient to model what we know about the data, and definitely not sufficient to predict how good or bad a query plan is. Other data characteristics such as functional dependencies and distinct value counts are often collected and used to optimize queries [28, 22]; futhermore, practical queries often have “relations” that are infinite in pure cardinalities. For example, the output size of this query
is obviously at most , but the AGM bound is . The relation has infinite cardinality. There has thus been a line of research to strengthen the AGM bound to account for increasingly finer classes of input statistics. Specifically, Gottlob, Lee, Valiant and Valiant [18, 17] applied the entropy argument to derive a bound on the output size of a full conjunctive query under general functional dependencies. Their bound, generalizing the AGM-bound, is an optimization problem whose constraints are the Shannon inequalities. This idea was a seed for a series of works that extended the entropy argument to account for finer classes of constraints, including degree constraints [5, 6, 3], and -norm bounds on degree sequences [3].
Designing WCOJ algorithms that match these stronger bounds becomes harder since the bounds under more constraints are tighter. An algorithm called CSMA is described in [5], which only accounts for relation cardinalities and functional dependencies.
In database management systems jargons, a WCOJ algorithm is a multi-way join operator; this operator, for some classes of queries, is asymptotically faster than the traditional binary join operator [29]. Given a complicated conjunctive query, however, before applying a WCOJ algorithm, it is often beneficial to come up with a query plan that decomposes the query into simpler subqueries. Query plans guided by (hyper)tree decompositions have proved to be very useful both in theory and in practice [16]. In particular, query plans represented by a single tree decomposition can be used to answer a Boolean conjunctive query or a full conjunctive query in time bounded by , where fhtw is the fractional hypertree width [20] of the query, and is the size of the output. This runtime is sensitive to the output size, and thus it is more adaptive than a single WCOJ application.111It should be noted, however, that the sub-queries of the plan are still being answered with a WCOJ operator. For example, using tree-decomposition-based query plans, we can identify whether a large graph with edges contains the homomorphic images of a -cycle in time (assuming a constant ).
Motivated by finding the fixed-parameter tractability boundary for Boolean conjunctive queries, or equivalently constraint satisfaction problems, under the regime of unbounded-arity inputs, Marx [25] came up with a beautifully novel idea: instead of fixing a single query plan (i.e. tree-decomposition) up front, we can consider multiple query plans, and partition the data to make use of different plans for different parts of the data. The improved complexity measure that Marx introduced is called the submodular width, subw, which is always less than or equal to fhtw.
The subw algorithm from Marx [25] has some limitations. First, it assumes all input relations have the same cardinality ; in particular, it is not known how to define the width and design similar algorithms under functional dependencies or degree constraints. Second, the runtime of the algorithm is not , but a polynomial in this quantity. Third, the subw-notion and the algorithm were not defined for general conjunctive queries with arbitrary free variables.
Contributions. This paper connects all three of these lines of research: WCOJ algorithms, fractional hypertree width, and submodular width into a single framework, while dealing with arbitrary input degree constraints (which is a superset of functional dependencies and cardinality constraints), and arbitrary conjunctive queries. The bridge that connects these algorithms and concepts is information theory. In particular, our contributions include the following:
We show how a generalized version of the classic Shearer’s inequality [15] can be used to derive upper bounds on the output size of a disjunctive datalog rule (DDR), which is a generalization of a conjunctive query. The upper bound is a generalization of the AGM bound to include degree constraints and DDRs. The introduction of DDR and its information theoretic output size bound in studying conjunctive queries is our first major technical novelty. DDRs are interesting in their own right. They form the building blocks of disjunctive datalog [14], which is a significant extension of datalog. Disjunctive datalog has a long history: it emerged in logic programming [23, 26], and is used for knowledge representation, representing incomplete information, and constraint satisfaction. Our bound on the output size of a DDR represents a bound on the output size of the minimal model of the disjunctive datalog program consisting of that single rule.
Next, we show that certain symbolic manipulations of the information inequality can be converted into a query evaluation plan for the DDR that runs in time bounded by the predicted upper bound. This idea of converting a proof of an information inequality into a query evaluation plan is our second major technical novelty. The algorithm is called PANDA, which stands for “Proof-Assisted eNtropic Degree-Aware”. In particular, PANDA is worst-case optimal for DDRs under arbitrary degree constraints. Even when restricted to only conjunctive queries, this is already beyond the state of the art in WCOJ algorithms because previous WCOJ algorithms cannot meet the tightened bounds under degree constraints.
Lastly, we explain how to define the notions of fhtw and subw under degree constraints and for conjunctive queries with arbitrary free variables. We show how PANDA can be used to answer arbitrary conjunctive queries with arbitrary degree constraints, in time bounded by . These results close the gaps left by Marx’ work. For example, with PANDA, the -cycle query can now be answered in time, which is sub-quadratic, and matches the specialized cycle detection algorithm from Alon, Yuster, and Zwick [8].
The results in this paper were first announced in a conference paper [6]. The current paper makes several significant improvements:
-
•
In [6], we used both the primal and the dual linear program to guide PANDA: the primal gives an optimal polymatroid , while the dual represents the basic Shannon inequalities. In the new version, we use only the dual, which significantly simplifies the algorithm. The algorithm is described only in terms of an information inequality and its proof (called a witness), which correspond precisely to a feasible solution to the dual program. We only need to describe the primal and dual programs later, in Sec. 6, where we introduce the degree-aware submodular width, which is defined in terms of the primal.
-
•
Previously, we needed a proof sequence to drive the algorithm; it was difficult to prove that a proof sequence exists; for example, no proof sequence existed in our earlier framework [5]. In the new version, we describe PANDA without the need for an explicit proof sequence, which again simplifies it. If needed, a proof sequence can still be extracted from the new version of the algorithm.
-
•
One difficulty in the earlier presentation of PANDA was the need to recompute the proof sequence after a reset step. This is no longer necessary here.
Paper Outline This paper is organized as follows. Section 2 presents background concepts needed to understand the techniques and results of the paper; in particular, it introduces disjunctive datalog rules (DDR), a generalization of conjunctive queries, reviews necessary background on information theory, and defines the class of statistics on input relations that the PANDA algorithm supports, which are called degree constraints. Section 3 discusses the class of information inequalities that are at the center of our work, where they are used to both derive upper bounds on the output size of a query and guide the PANDA algorithm. Section 4 states the main algorithmic result, which says that the PANDA algorithm meets this information-theoretic upper bound if the bound is a Shannon inequality, i.e. it does not involve non-Shannon inequalities [36, 37]. Section 5 presents the core PANDA algorithm, which constructs a step-by-step proof of the Shannon inequality, and converts each step into a database operation. Section 6 defines the degree-constraint aware submodular width, and shows how to use disjunctive datalog rules to compute a conjunctive query in time given by the submodular width. We conclude in Section 7.
2 Preliminaries
2.1 Database instances and conjunctive queries (CQ)
Fix a set of variables (or attributes). An atom is an expression of the form where is a relation name and is a set of attributes. A schema, , is a set of atoms. We shall assume throughout that distinct atoms in have distinct attribute sets. If is a relation name in , we write for its attribute set, and define to be the set of all attributes in the schema .
Given a countably infinite domain Dom, we use to denote the set of tuples with attributes . A -instance is a map that assigns to each relation name in a finite subset . Technically, we should use to denote the -instance; however, to reduce notational burden, instead of writing and , we will often write and when the instance is clear from the context. Given and a tuple , we write to denote the projection of onto the variables .
The full natural join (or full join for short) of the -instance is the set of tuples that satisfy all atoms in :
(2) |
This set of tuples is sometimes called in the literature the universal table of the instance .
Given a schema , a conjunctive query is the expression
(3) |
where is called the set of free variables, and is the head atom of the query. Atoms in are called the body atoms of the query. The output of an input instance is the projection of the full join (2) onto the free variables : .
When , we call the query a full conjunctive query. When , we call the query a Boolean conjunctive query, whose answer is either true or false, and it is true if and only if the full join (2) is non-empty.
Our complexity results are expressed in terms of data complexity; in particular, we consider the number of atoms and variables to be a constant, i.e. , and the complexity is a function of the instance size. We define the size of a -instance as: 222Altenratively, the size of a -instance could have been defined as the sum of the sizes of all relations in the instance. However, the two definitions are within a constant factor from one another in data complexity. Moreover, taking the maximum instead of the sum is more convenient for our purposes. For example, we will see later that our size upper bound from Theorem 3.1 only holds if we define the size as the maximum.
(4) |
The notation is used in part to distinguish it from which counts the number of atoms in .
2.2 Tree decompositions and free-connex queries
Consider a conjunctive query in the form (3). A tree decomposition of is a pair , where is a tree and is a map from the nodes of to subsets of that satisfies the following properties: for all atoms in , there is a node such that ; and, for any variable , the set forms a connected sub-tree of . Each set is called a bag of the tree-decomposition, and we will assume w.l.o.g. that the bags are distinct, i.e. when are nodes in .
A free-connex tree decomposition for is a tree-decomposition for with an additional property that there is a connected sub-tree of for which . The query is free-connex acyclic iff there is a free-connex tree decomposition in which every bag is covered by an input atom; namely, for every , there exists an input atom where .
Lemma 2.1.
If is a free-connex acyclic conjunctive query of the form (3), then we can compute its output in time . Furthermore, we can list the output tuples one by one with constant-delay between them.
In particular, for this class of queries, the runtime is the best one can hope for: input size plus output size.
2.3 Disjunctive Datalog rules (DDR)
Disjunctive Datalog rules [14] are a generalization of conjunctive queries, where the head of the query can be a disjunction, formalized as follows. Let and be two schemas, called input and output schema respectively. We associate to these two schemas the following disjunctive Datalog rule (DDR)
(5) |
The DDR is uniquely defined by the two schemas and . The syntax in (5) does not add any new information, but is intended to be suggestive for the following semantics:
Definition 2.2.
Let be an input instance. A model (or feasible output) for the rule (5) is an instance , such that the following condition holds: for every tuple , there is an output atom for which . Similar to (4), the size of the output instance is defined as
(6) |
A model is minimal if we cannot remove a tuple from any output relation without violating the feasibility condition.
In English, a feasible output needs to store each tuple in at least one of the output relations . The query evaluation problem is to compute a minimal model. Note that a conjunctive query is a disjunctive Datalog rule where has a single atom.
DDRs are interesting. We illustrate the concept here with a few examples.
Example 2.3.
Consider the following DDR,
where and . A model (or feasible output) to the DDR is any superset of . Consider now the following DDR:
One model is , . Another model is , . Both are minimal. Many other models exist.
A non-trivial DDR is the following:
(7) |
To compute the rule, for each tuple in the full join, we must either insert into , or insert into . The output size is the larger of the resulting relations and . We shall see later in the paper that, for this rule, there is a model of size , which is a non-trivial result.
2.4 Entropic vectors and polymatroids
For general background on information theory and polymatroids, we refer the reader to [35]. Given a discrete (tuple of) random variable(s) over a domain with distribution , the (Shannon) entropy of is defined as
The support of the distribution is the set of all where . We will only work with distributions of finite support. Let be the support size, then it is known that . The upper bound follows from the concavity of and Jensen’s inequality. Moreover, iff is deterministic, i.e. it has a singleton support, and iff is uniform, meaning that for all in the support.
If is a set of jointly distributed random variables, then the vector is called an entropic vector.333Given two sets and , we use to denote the set of all functions . Implicitly, ; thus, this vector is actually of dimension . We will often write for the union . In particular, .
Starting with Shannon himself, the study of linear functions on entropic vectors has been a central topic in information theory. The two basic linear functions are defined by the so-called information measures. An information measure is an expression or , where are disjoint subsets of the set of variables . We call a monotonicity and a submodularity information measure respectively. A monotonicity measure is called unconditional iff . Similarly, a submodularity measure is called unconditional iff . For any vector , we define the linear functions:
We write MON for the set of monotonicity measures, i.e. the set of all where are disjoint. Similarly, we write SUB for the set of submodularity measures.
A polymatroid is a vector that satisfies the basic Shannon inequalities:
(8) |
The latter two are called monotonicity and submodularity constraints respectively. Every entropic vector is a polymatroid, but the converse is not true [36]. A Shannon inequality is a linear inequality that is derived from the basic Shannon inequalities; equivalently, a linear inequality is a Shannon inequality iff it is satisfied by all polymatroids.
Discussion In the literature, the basic Shannon inequalities are often restricted to elemental ones, which are submodularity measures of the form and monotonicity measures of the form , where are single variables, and . The reason is that the elemental inequalities are sufficient to prove every Shannon inequality; for example , when both and . Given , the total number of elemental basic Shannon inequalities is . 444Specifically, there are ways to choose and and ways to choose resulting in elemental submodularities. Additionally, there are elemental monotonicities. However, for the purpose of the PANDA algorithm, it is preferable to consider all basic Shannon inequalities, because this may lead to a smaller exponent of the polylog factor of the algorithm, as we will see in Section 5.
2.5 Statistics on the data
In typical database engines, various statistics about the data are maintained and used for cardinality estimation, query optimization, and other purposes. Common statistics include the number of distinct values in a column, the number of tuples in a relation, and functional dependencies. A robust abstraction called “degree constraints” was introduced in [6] that captures many of these statistics. This section provides a formal definition of degree constraints, which are also the constraints that PANDA can support.
Let be a monotonicity measure, be a database instance with schema , be a relation in this instance, and be a tuple with schema . We define the quantity degree of with respect to in , denoted by , as follows:
-
•
When both and , then is the number of times the given -tuple occurs in .
-
•
When and are arbitrary with respect to , then we define the restriction of to to be the relation ,555 denotes the semi-join reduce operator defined by . and set
Finally, define the degree of the monotonicity measure in , denoted by , to be
(9) |
Note that, for infinite Dom, if , then , unless , in which case . We say that is a guard of if , and note that, if , then is a guard of iff .
If and , then the degree is the cardinality of , . If then there is a functional dependency in from to . If the number of unique values in a column of is , then . Given a schema instance , define the degree of in the instance as:
(10) |
Let be a set of monotonicity measures and be numerical values for each . Throughout this paper, we write instead of , and define for all . We view the pair as a set of degree constraints, and we say that an instance satisfies the degree constraints iff for all . In that case, we write .
3 On a Class Information Inequalities
The entropy argument and Shearer’s lemma in particular [13] is a powerful information-theoretic tool in extremal combinatorics [30]. Friedgut and Kahn [15] applied the argument to bound the number of homomorphic copies of a graph in a larger graph; this is a special case of the full conjunctive query problem. Grohe and Marx [19], with further elaboration in [9] showed how Shearer’s lemma can be used to bound the output size of a full CQ given input cardinality constraints. Briefly, let be the input schema of a full CQ of the form (3),
(11) |
where the head atom has all the variables. Given any non-negative weight vector that forms a fractional edge cover of the hypergraph where , the output size of the full CQ is bounded by
(12) |
This is known the AGM-bound [10]. The bound is a direct consequence of Shearer’s inequality [13], which states that the inequality
(13) |
holds for every entropic vector if and only if the weights form a fractional edge cover of the hypergraph defined above.
The above results can only deal with cardinality constraints of input relations, and if the input query is a conjunctive query. We extend these results to disjunctive Datalog rules and handle general degree constraints. We start in the next section with a generalization of Shearer’s inequality and show how that implies an output cardinality bound for DDRs. In later sections, we use the information inequality to drive the PANDA algorithm.
3.1 Size Bound for DDRs from Information Inequalities
This section develops an information-theoretic bound for the output size of a DDR under general degree constraints. Inequality (14) below is a generalization of Shearer’s inequality (13), and the bound (15) is a generalization of the AGM bound (12) for DDRs and general degree constraints. (Recall the notation for the size of a model in (6).) In what follows, for a given schema and an atom , for brevity we will also write as we assume a one-to-one correspondence between atoms and their variables.
Theorem 3.1.
Consider a DDR of the form (5) with input and output schemas and respectively. Let be a set of monotonicity measures. Suppose that there exist two non-negative weight vectors and with , where the following inequality holds for all entropic vectors :
(14) |
Then, for any input instance for the DDR (5), there exists a model for the DDR that satisfies: 666Recall our definition of from Eq. (6). Inequality (15) only holds if we define the size as the maximum (not the sum) of the sizes of all relations in the instance. Otherwise, we would have needed to multiply the right-hand side by the number of relations in .
(15) |
Proof.
The plan is to use the entropy argument [13], where we define a uniform probability distribution on a certain subset , denote its entropic vector, then use (14) to prove (15).
Let be the set of all input variables. Notice that, for any joint distribution on with entropic vector , we have for all . This is trivially true if is not guarded by any relation in (see Sec 2.5 for the notion of guardedness). Otherwise, the inequality follows from the fact that the uniform distribution on a finite support has the maximum entropy:
Next, we construct both the set and a model for the DDR as follows. Initially, set , and for all . Iterate over the tuples in some arbitrary order. For each : if s.t. , then ignore ; otherwise, insert into and insert into for every . In the end, we have constructed a model for the DDR. Furthermore, .
Finally, consider the uniform distribution on , i.e. the distribution where each tuple in is chosen randomly with probability . Let be the entropy vector of this distribution. Notice that for each , the marginal distribution on is also uniform, because ; in particular, for all . Then, noting that , the following holds:
∎
In order to obtain the best bound, we need to choose the weights and to minimize the right-hand side of (15). Specifically, we want to minimize the linear objective
(16) |
subject to the constraints that , , , and that inequality (14) holds for all entropic vectors .
For general monotonicity measure , it is an open problem to characterize the weight vectors and for which (14) holds for all entropic vectors . In particular, the difficulty is related to the problem of characterizing the entropic region in information theory [35]. Hence, to make the problem tractable, we relax the upper bound by requiring (14) to hold for all polymatroids .
Definition 3.2 (Polymatroid bound).
The following bound is called the polymatroid bound for disjunctive datalog rules:
(17) | ||||
subject to: | ||||
(18) | ||||
Note that the bound is stated in log-scale, so that the objective is linear. The constraint (18) is a linear constraint, as explained below. In particular, the polymatroid bound is a linear program.
Proposition 3.3.
The constraint that (14) holds for all polymatroids is a linear constraint in the weight vectors and and auxiliary variables.
Proof.
In the space , the set of polymatroids is a polyhedron of the form for an appropriately defined matrix . Inequality (14) is a linear inequality of the form , where is a linear function of the weights and . From the Gale-Kuhn-Tucker variant of Farkas’ lemma777https://en.wikipedia.org/wiki/Farkas%27_lemma, inequality (14) holds for all polymatroids if and only if there is no for which and , which holds if and only if there is a vector such that and . The last condition is a linear constraint in the weights and , and in the dual variables . ∎
3.2 Equivalent Formulations of Inequality (14)
Shearer’s result [13] states that inequality (13) holds for all polymatroids iff the weights form a fractional edge cover of a certain hypergraph. This section shows an analogous characterization for the generalization (14), and states an “integral” version of this characterization that shall be used by the PANDA algorithm. The following lemma is a variant of Farkas’ lemma [31] applied to our specific setting.
Lemma 3.4.
Let be a coefficient vector. The following inequality is a Shannon inequality:
(19) |
if and of only if there exist non-negative coefficients and such that the following equality holds as an identity over symbolic variables , :
(20) |
We call the tuple a witness of (19). If is rational, then there exists a rational witness. Furthermore, and are a function of and .
Proof.
Clearly if (20) holds as an identity, then (19) follows trivially. The converse is a direct consequence of a variant of Farkas’ lemma, in particular Corollary 7.1h in Schrijver’s classic text [31], which states: if the polyhedron is not empty, and if inequality holds for all , then there exists for which is a non-negative linear combination of the inequalities in .888We thank the anonymous reviewer for pointing out this specific variant that helps simplify our proof.
In our context, is the polyhedron defined by (8) and , and the inequality is defined by and . Farkas’ lemma thus implies that if (19) is a Shannon inequality, then is a non-negative linear combination of the Shannon inequalities in (8), namely there are coefficients such that the following identity holds over the variables , :
(21) |
where and are coefficients associated with and . Setting for all in (21), we conclude that , and thus (20) holds.
The fact that and are a function of and (and the bounds on their representation sizes) can be found in Chapter 10 of Schrijver’s book [31]. ∎
In words, Lemma 3.4 says that the LHS of (19) is a positive linear combination of monotonicity and submodularity measures. From the lemma, it is not hard to show that Shearer’s inequality (13) holds whenever the weights form a fractional edge cover of the corresponding hypergraph.
To guide the algorithm, we will need an “integral” version of the lemma above. Let be any set, a multiset over is a multiset whose members are in . The size of a multiset , denoted by , is the number of its members, counting multiplicity. If the coefficients in (14) are rational, then the inequality 14 has a rational witness. By multiplying the inequality with the least common multiple of the denominators of the coefficients, we can obtain an equivalent inequality with integer coefficients. The following Corollary thus follows.
Corollary 3.5.
For rational coefficients , inequality (14) holds for all polymatroids if and only if there exist multisets , , , and over , , MON, and SUB respectively, such that the following identity holds symbolically over the variables , :
(22) |
In particular, if we set , then the identity (22) becomes:
(23) |
The sizes of the multisets are bounded by functions of and .
Definition 3.6.
We call the terms in (22) and (23) statistics terms, and call the terms and witness terms. Specifically, we call terms monotonicity terms, and terms submodularity terms. After removing the common denominator in (14), the resulting inequality is called an integral Shannon inequality and has the format:
(24) |
3.3 The Reset Lemma
To conclude this section on information inequalities, we present a combinatorial lemma that plays a key role in our algorithm. The lemma says that, given a valid integral Shannon inequality (24), if we would like to remove an unconditional term from the RHS while retaining its validity, then it suffices to remove at most one term from the LHS. We may be able to remove even more terms from the RHS, in addition to , but, importantly, it suffices to remove a single term from the LHS.
Lemma 3.7 (Reset Lemma).
Consider an integral Shannon inequality (24):
Suppose some term is unconditional, then there are two multisets and with such that the following is also an integral Shannon inequality:
Proof.
By Corollary 3.5, there exists an integral witness such that equation (23) is an identity (with set to ). We prove the lemma by induction on the “potential” quantity . The base case when is trivial. Consider the case when . Suppose , i.e. .
Since (23) is an identity, the term must cancel out with some other term. If , then we can remove from both sides (i.e. setting and ). Otherwise, there are three cases:
- Case 1
- Case 2
-
cancels with a monotonicity term where . In other words, , and the RHS of (23) contains the terms:
(26) We add to , remove from , and remove from , thus decreasing the potential by 1. Then, we proceed inductively to eliminate the newly added statistics term .
- Case 3
-
cancels with a submodularity term where and . In particular, the RHS of (23) contains the terms:
(27) We add to , drop from , drop from , and add a new monotonicity measure to . Overall, the potential decreases by . The proof follows by induction where we can eliminate the newly added statistics term from the RHS.
∎
We illustrate the reset lemma with the following simple examples:
Example 3.8.
Consider Shearer’s inequality , which we write in the form (24) as:999To be consistent with (24), we should write it as .
(28) |
To drop the term on the RHS of (28) while retaining the validity of the inequality, we can also delete one term on the LHS (we can choose any term since they are identical), and obtain the following Shannon inequality:
The proof of the Reset Lemma “explains” how we can do this dropping by reasoning from the identity counterpart of (28) (i.e. an identity of the form (23)):
(29) |
By canceling out from both sides, we obtain a different identity:
(30) |
which is what Case 3 from the proof of the Reset Lemma does. The resulting identity witnesses the fact that is a Shannon inequality.
Example 3.9.
Consider the following Shannon inequality:
which follows from the following identity of the form (23):
Suppose that we want to drop from the RHS. The first step is going to follow Case 3 of the proof of the Reset Lemma by replacing the submodularity with an additional monotonicity , thus leading to the identity:
where our target now is to remove the term from the RHS. But now, our only option is to use Case 1 and combine with , leading to
And now, we drop from both sides. The resulting inequality is .
We will also need the following simple proposition to explain a step in PANDA.
Proposition 3.10.
Consider an integral Shannon inequality of the form (24). The number of unconditional statistics terms in (counting multiplicities) is at least .
4 Overview of PANDA and statement of main result
This section states and explains the main result in this paper, which shows that the Shannon inequality (14) is not only useful for bounding the output size of a DDR as shown in Theorem 3.1, but it can also be used to guide an algorithm to evaluate a DDR in time proportional to the bound (15), modulo a polylogarithmic factor. We formally state this result in Section 4.1, and present an illustrative example in Section 4.2. The detailed algorithm description and its proof of correctness are presented in Section 5.
4.1 An Efficient Algorithm to Evaluate Disjunctive Datalog Rules
Given the coefficients and of inequality (14) (where recall ), we denote by:
(31) |
Theorem 3.1 implies that, if satisfies the degree constraints (see Sec 2.5), i.e. , then there exists a feasible output that satisfies . Our main result states that such an output can be computed in time if inequality (14) is a Shannon inequality: (We use to hide a polylogarithmic factor in the input size .)
Theorem 4.1 (The PANDA algorithm for a DDR).
Given the following inputs:
-
•
A disjunctive datalog rule (DDR) of the form (5).
-
•
An input instance to the DDR.
-
•
Degree constraints that are satisfied by the instance , i.e. .
-
•
Coefficients , and , with that define a valid Shannon inequality (14).
Let be the bound defined in (31), in terms of the statistics and the coefficients . Then, we can compute a feasible output to the DDR in time . This feasible output is of size .
We will prove the theorem over the next few sections. We illustrate with a simple example:
Example 4.2.
Continuing the example in Eq. (7), assume that . Consider the following Shannon inequality, proved by applying two submodularities:
Rearranging it to the form (14), we obtain:
(32) |
Theorem 3.1 proves that there exists a feasible solution of size , while Theorem 4.1 says that we can compute a feasible solution of size in time .
Trivially, the theorem (and PANDA) can be used to compute a full conjunctive query; because a full conjunctive query is a special case of a DDR, where the output schema consists of a single atom containing all variables. After computing a feasible output of the DDR, one can semijoin-reduce it with every relation in the body to obtain a solution to the full conjunctive query.
Corollary 4.3 (The PANDA algorithm for a full conjunctive query).
Suppose that the DDR from Theorem 4.1 corresponds to a full conjunctive query , i.e. the output schema of the DDR consists of a single atom containing all variables , in which case the DDR (5) collapses back to (11), where . Moreover, suppose that the conditions of Theorem 4.1 are satisfied. Then, the output of the full conjunctive query satisfies , and can be computed in time .
For the best-possible runtime, we want to seed PANDA with parameters minimizing the quantity (or equivalently, ), which is the polymatroid bound (17) for the query . In the case of full CQs, there is ony one -parameter, set to , and it remains to find minimizing . The minimum value is , defined by
The second equality follows from simple duality arguments. PANDA would be worst-case optimal if is tight, in the sense that we can construct a database instance that satisfies the degree constraints and has output size . There are two special cases where is known to be tight in data complexity: when all degree constraints are “simple” (see [2, 32]), or when the degree constraints are “acyclic” (see [5, 27]). The degree constraints are simple if for all , and they are acyclic if there is a global order of the variables in such that for every , all variables in precede all variables in in the order. If contains only cardinality constraints, then it is both simple and acyclic, and is exactly the AGM bound because (14) reduces back to (13).
4.2 Example: Preview of PANDA
We illustrate the algorithm with the DDR (7):
Following Example 4.2, we assume we only have cardinality statistics / constraints:
We further assume that is a power of 2. Each constraint has a “guard”, i.e. a relation that “sponsors” (satisfies) the constraint: . We name all guards to be consistent with the formal description of the algorithm presented in Section 5.
Suppose that the input Shannon inequality given is the one in Eq (32). This inequality is in the shape of (14) with and . The fact that this is a Shannon inequality was shown above in Example 4.2.
From Corollary 3.5, the Shannon inequality (32) must have a corresponding identity (23) over symbolic variables where . We show one such identity below:101010Our witness is not elemental, because is not elemental: if we replaced the latter with , then we obtain an elemental witness. Also note that the witness is not unique; for example, we could have used the witness .
(33) |
The bound from (31) is where , or equivalently, ; this is the runtime budget the algorithm cannot exceed, in order to compute a feasible solution to the DDR.
Our algorithm operates by observing and manipulating identity (33), while maintaining its identity invariant. Every step of the algorithm applies a modification to the identity and mirrors this modification with an algorithmic computation to create one or more sub-problems. Each sub-problem is then equipped with the newly created identity, the newly created input data, and carry on the next step on their own. The spawned sub-problems form a (sub-problem) tree. In the end, we gather results from the leaves of this tree to answer the original query.
In identity (33), the statistics terms , , and correspond to input data (input relations) , , and . When we modify the identity, we will be transforming some subsets of these statistics terms into different statistics terms; correspondingly, the input relations are manipulated to create new input relations that are associated with the new statistics terms. We next describe this idea in more details. The steps of the algorithm are also shown in Figure 1.
In the first step, we grab an unconditional statistics term (from ) in (33). Let’s say the term we grab is . Since (33) is an identity, there must be another term that cancels out the symbolic variable . In this case, it is ; so we combines with the canceling term to obtain new statistics terms:
This rewriting-by-cancellation changed our identity (33) into a new one:
(34) |
The change amounts to replacing a statistics term with two new ones: . We mimic this symbolic change with an actual relational change that shall alter the structure of the problem. The rough idea is to create inputs to the new problem by creating two new guards for the two new statistics terms:
-
•
The guard for will be the table .
-
•
The guard for is obtained by using to construct a “dictionary” (a lookup table) which supports the following operation: given a tuple , return the list of all -values such that . (Note that this operation does not use the given -value.) We denote this dictionary as .
The aim is for us to be able to join all guards (corresponding to the four statistics terms on the RHS of (34)) to answer the same query as before. However, implementing this idea straight up does not work, because the new degree constraint for is and for is , and they are are too large: the product can be much greater than the original statistics of (guarding that we started with). If this product is too large, we cannot apply induction to solve the sub-problem in our time budget of .
The reason this product is too large is because counts both high-degree and low-degree -values, while the new statistics is the maximum degree. Thus, the product is an over-estimation of what the size of the join can be. To remedy the situation, we uniformize the problem by partitioning into a logarithmic number of sub-relations , where each sub-relation contains tuples whose -values have similar degrees. In effect, uniformization is an algorithmic technique for dealing with skews, a notoriously well-known reason for why query plans might blow up in practice.
Concretely, the partitioning is done as follows. Relation contains all tuples where:
We set , thus . We further partition into two equal-sized buckets (which we will continue to name “buckets ”, with some abuse). The two new guards and have statistics and , which (by our partition-into-two trick) satisfy the following:
To be consistent with the notation used in describing PANDA, all guards will be called : in particular, the two new guards and are called and respectively.
After partitioning, we now have sub-problems, each equipped with identity (34). In the second step, we again grab an unconditional statistics term , and find a term from (34) to cancel it.111111Note how “greedy” the algorithm is. This is one of the reasons for the large polylog factor it suffers. This time, the cancellation is:
leading to a new identity
(35) |
The change is suggestive of a join, where for the th subproblem, we will join the corresponding guards : . Recall that , hence for performing this join won’t take time over the budget of . On the other hand, when , we will need an additional idea, to use the reset lemma (Lemma 3.7) to reroute the sub-problem away from the “heavy-hitter hotspot”, i.e. join over the high-degree -values. In addition to uniformization, the reset lemma is our second ingredient to deal with skews.
Concretely, for the th sub-problem, we do the following:
-
•
When , then PANDA performs a join . The output size of this join is within the bound . After computing this join, there will be no more sub-problems because we have computed a relation that fits the output schema, namely it corresponds to . We add tuples from to the output relation .
-
•
In the case where , the algorithm attempts to perform the same join , but its output size now exceeds the bound . Therefore, the algorithm does not compute a guard for , but instead uses the reset lemma to cancel out this term with the term on the LHS. The new identity (for these sub-problems) is now
We again grab an unconditional statistics term and cancel it with :
The guard for has size , thanks to the fact that and , The above step replaces with a new statistics term . Its guard is computed from , by extending it into a dictionary : given , this dictionary returns all -values for which . In particular, this silly dictionary always returns the entire table , no matter what the given are.121212 To be more precise, before PANDA extends into , it will “uniformize” by partitioning the table into buckets based on the “degree” . However, this partition is vacuous since only one bucket will be non-empty. After the dictionary extension, the algorithm performs a join . This join is feasible since the output size is within the bound . Now, the algorithm reaches a terminal node since the join result is in the output schema; namely, it corresponds to .
At the end of the algorithm, the tables from all branches are unioned into the output relation , while tables from all branches are unioned into the other output relation . These two relations are the final output of the algorithm for the DDR (7).
Note that, for the particular query (7), there is a way to compute a feasible output without the polylog factor as described above. The above strategy is only meant to illustrate the PANDA algorithm in its full generality.
5 Detailed Description of PANDA
This section describes the PANDA algorithm in details. Section 5.1 presents the main data structures (called tables and dictionaries) used by the algorithm. Section 5.2 presents the algorithm and its proof of correctness and runtime analysis.
5.1 Tables and Dictionaries
For a given statistics term , PANDA uses two kinds of data structures: tables and dictionaries, denoted . When , then we call it a table; otherwise, we call it a dictionary. There will be at most one table/dictionary for a given . As usual, we abbreviate with just , and statistics with just and . Specifically, a table is a set of tuples over the variables, and a dictionary is a function that gives a set of tuples over the variables given a specific binding of the -variables.
For a statistics term , each table/dictionary is associated with a statistics . We say that satisfies the statistics, and we write , iff for all . As a special case, a table satisfies a statistics iff .
The algorithm performs the following operations on tables and dictionaries: join, projection, extension, construction, and partition. Each operation yields the corresponding statistics on the results, as described below.
-
•
Join of a table with a dictionary, . This operation constructs a new table with statistics . The join takes time .
-
•
Projection of a table, . This operation takes a table over variables , with statistics , and constructs a new table with statistics . The projection takes time .
-
•
Extension of a dictionary into another dictionary , where is disjoint from . This operation takes as input a dictionary and returns a new dictionary defined as for all . Its statistics is . This operation takes time, because the operation does not touch the data, but only constructs a new function that calls the existing function .
-
•
Construction. Given a table over variables with statistics , construct a dictionary , with statistics . This operation takes time because it involves scanning the table and constructing an index. The operation returns a function that looks up in this index.
-
•
Partition. Given a table over variables with statistics , partition into many sub-tables satisfying the conditions stated in Lemma 5.1 below. This operation takes time .
Lemma 5.1.
Let be a table over variables with statistics . Then, can be partitioned into at most sub-tables satisfying
(36) |
Proof.
To obtain the sub-tables , observe that the number of tuples with -degree in the interval is at most . Hence, if we partition based on which of the buckets the -degree falls into, we will almost attain the required inequality, just off by a factor of :
To get rid of the factor of , we partition each into two tables whose projections onto are equally sized. Overall, we need partitions. ∎
5.2 Algorithm
This section describes PANDA and proves Theorem 4.1. The main input to PANDA contains an input instance , an output schema , and statistics satisfied by the input instance, i.e. as defined in Section 2.5. In addition, we are also given the coefficients and for which (14) is a Shannon inequality. From Corollary 3.5, we can equivalently assume that PANDA was given the multisets for which identity (23) holds. Both the algorithm and the analysis will be described in terms of these multisets.
We shall show that a feasible output to the DDR (5) can be computed in time , defined in (31). In terms of these multisets, the bounds defined in (31) have the following equivalent expressions:
(37) |
For each statistics term , there is a guard which is a table/dictionary instance with statistics , i.e. as defined in Section 5.1. Creating the initial guards needs to be done offline, because the cost of creating them may exceed . We do not include this in the cost of the algorithm. We adopt the convention that the lower case letters represent the logarithms of the upper case respectively.
Summarized in Algorithm 1, PANDA works as follows. Starting from an initial node, it grows a tree of sub-problems. New sub-problems are spawned from a “non-terminal” leaf node of the tree. The process stops when every leaf node is terminal, a concept we shall define shortly. After the tree growing stops, a feasible output of the problem is gathered from the leaves of the tree.
To every node, there associates a sub-problem parameterized by a tuple where
-
•
The multisets , and are over the base sets , , MON, and SUB, respectively. These multisets will maintain the invariant that identity (23) holds, as we shall show in the next section.
-
•
is a collection of non-negative real numbers (logs of statistics), one statistics for each . (Recall that, , for convenience.)
-
•
is a collection of dictionaries, one dictionary for each ; these shall be the guards for . In particular, for each .
Definition 5.2 (Terminal leaf).
A leaf node of the tree is “terminal” (it won’t spawn off new sub-problem(s) anymore) if its parameters are such that, there is an unconditional statistics term for which .
We next briefly explain in words the key steps of Algorithm 1. A proof that the algorithm is correct and an analysis of its runtime to show Theorem 4.1 are described in Section 5.3. The main loop looks for a non-terminal leaf node of the sub-problem tree. If there is no such leaf node, then the code-block starting at line 36 gathers the results from the parameters of the (terminal) leaf nodes to construct the final output .
If there is a non-terminal leaf node , then we pick an arbitrary unconditional131313We shall show that an unconditional exists in Section 5.3. statistics term (line 5) and start considering cases in parallel with the cases in the proof of Lemma 3.7, except that Case 3 will be handled differently. The major insight of our work is that we can design an algorithm where the algorithmic steps mirror the Shannon inequality inference steps:
- Case 1: Join or Reset
-
There exists a statistics term (line 6). If the statistics are such that the join-size of is smaller than the budget , checked at line 7, then we compute the join result as shown, and let it guard the new statistics term . This case corresponds precisely to applying Eq. (25), and the new multiset defined in line 9 reflects that. After joining and to form , we remove the old dictionaries from and add the new dictionary to . Similarly, the statistics in are replaced in the same way.
On the other hand, if the join is larger than the budget, then we perform a reset step. The Shannon inequality reasoning is as follows. Initially holds, with witness terms and . After applying the change in Eq. (25), inequality is still a Shannon inequality with the same witness. (Note that is defined in line 13). Now, we apply Lemma 3.7 to drop from , obtaining new parameters to make progress on our algorithm. We remove from and the terms that were dropped from .
- Case 2: Projection
- Case 3: Partition
-
There exists a submodularity measure with (line 24). In this case, instead of applying identity (27) from Lemma 3.7, we apply the following identity to maintain the Shannon inequality:
(38) Identity (38) says that we have two new statistics terms, and , which need guards; this is where we will need to create a logarithmic number of sub-problems in order to create the guards of the right statistics.
In particular, the algorithm performs a “partitioning step”: it uniformizes by applying Lemma 5.1. Each of the sub-problems has the corresponding statistics as defined in the for-loop starting at line 26. The table is on variables . Thus, in order to have it guard the new term , we will need to apply a dictionary extension to it (see Section 5.1). Note that (36) guarantees that .
5.3 Proof of correctness and runtime analysis
In this section, we prove that the algorithm is correct and analyzes its runtime, to complete the proof of Theorem 4.1. Both of these tasks are accomplished by showing that PANDA maintains the invariants described below.
Recall that every sub-problem in the tree is parameterized by a tuple . For any sub-problem that is parameterized by , define
to be the join of all its dictionary guards. At time in the algorithm, let denote the set of current leaf nodes in the sub-problem tree. While spawning new sub-problems, (we will show that) PANDA maintains the following invariants over all tuples : (Recall that for all .)
Shannon inequality: | Identity (23) | holds w.r.t. the tuple | (39) | |||
Non-empty output: | (40) | |||||
Lossless join: | (41) | |||||
Small tables: | (42) | |||||
Guarded statistics: | (43) | |||||
Upper bound: | (44) |
We start by showing that, if the invariants hold, then the answer is correct with the desired runtime.
Proposition 5.3.
Let . Suppose the invariants above are satisfied, then PANDA returns a feasible solution to the input disjunctive datalog rule in time , where is the input multiset of submodularity measures.
Proof.
For a given non-terminal leaf , the number of unconditional statistics terms in is at least . This follows from Proposition 3.10 and invariant (39). Invariant (40) ensures that , and thus the unconditional statistics term exists for line 5 of the algorithm to proceed. Thanks to invariant (39), at least one of the three cases considered in the main loop must hold. Invariant (43) guarantees that we can proceed to perform the join or the partition steps using corresponding tables / dictionaries when we need them in the algorithm. In summary, the body of the main loop can proceed as explained without getting stuck somewhere.
For every node in the sub-problem tree that the algorithm creates, with parameter tuple , define the “potential” quantity as in the proof of Lemma 3.7. Then, similar to what happens in the lemma’s proof, the potential of the child is at least less than the potential of the parent.141414Unlike Case 3 of Lemma 3.7 where applying Eq. (27) reduces by one and increases by one, in Case 3 of the algorithm, we apply Eq. (38) which reduces by one and increases by one.151515Note that the else branch of Case 1 (line 12) also reduces the potential by at least one, and applying the reset lemma in line 14 never increases the potential. Thus, the depth of the sub-problem tree is bounded by the original potential . This proves that the algorithm terminates.
The time spent within each node of the tree is dominated by either the join step in Case 1, the projection in Case 2, or the partition step in Case 3 whose cost is . In Case 2 and Case 3, the cost is bounded by , thanks to invariant (42). In Case 1, the join is only computed if , thus it is also within the budget of . Overall, the total time spent on all the nodes of the tree is bounded by times the number of nodes. As we observed above, the depth of the tree is at most . Every node has a fanout of either , or . Every time the fanout is more than , the number of submodularity measures in is reduced by . Thus, the total runtime in data complexity is .
Last but not least, we prove that the answer computed starting at line 36 is correct, which means, for every tuple , there must exist for which ; recall Definition 2.2. This property holds thanks to invariant (41). Let denote the set of all final leaf nodes in the sub-problem tree. Then, every belongs to for some . Since is a terminal leaf node, there must exist such that , and so thanks to line 40 of Algorithm 1. ∎
Proof.
We verify that every invariant from (39) to (44) holds one by one, by induction. For the input, only invariant (42) may not hold, because some input tables may be larger than the desired bound . We deal with this situation by repeatedly applying the reset lemma as was done in the else branch of Case 1 (line 12), dropping input tables that are too large. After this pre-processing step to make sure that all invariants are satisfied initially, we verify that they remain satisfied by induction.
Invariant (39) is guaranteed by constructing the multisets for the sub-problems while keeping (23) intact. This is easy to verify in all three cases as we apply Eq. (25), (26), (38) or Lemma 3.7.
For invariant 40, the only place where is changed is at line 16. This happens when . Since inequality (44) holds (at the previous iteration), we have
It follows that . Thus, when applying Lemma 3.7, we end up with , which means is not empty.
6 Answering Conjunctive Queries in Submodular-Width Time
The main aim of this section is to explain how PANDA can be used to compute a conjunctive query in time given by its submodular width (plus the output size). Recall the definition of a conjunctive query in Equation (3). In Marx’s work [25], the submodular width was only defined for Boolean conjunctive queries (i.e. is empty) where all input relations are set to be of size . We begin by generalizing this notion to the case when is arbitrary and the inputs satisfy given degree constraints, which are much more general than a single relation size.
6.1 Width parameters for conjunctive queries under degree constraints
Given a conjunctive query in the form (3):
let denote the set of all free-connex tree decompositions of the query.161616 For a fixed query , there are at most tree decompositions, since any two trees that have the same set of bags can be considered equal for our purposes. (See Section 2.2 for the definition of a free-connex tree decomposition.) Let be a set of degree constraints, as defined in Sec. 2.5. As usual, we denote by the associated log-degree constraints. We say that a polymatroid satisfies the constraints, and write , if for all .
Definition 6.1.
The degree-aware fractional hypertree width and the degree-aware submodular width of under the degree constraints are:
(45) | ||||
(46) |
(Note that requires to both be a polymatroid and satisfy the degree constraints.)
Remark 6.2.
Eq. (45) and (46) collapse back to the standard definitions of the fractional hypertree width [20] and submodular width [25], respectively, when the degree constraints only contain cardinality constraints of the form for a single number that represents the input size; see [6] for a proof. Note, however, that in the standard definitions of fhtw and subw, the base of the log function was , the input size, and thus runtimes were stated in the form and . In our generalization, the base of the log function is , and thus runtimes are stated in the form and .
It is straightforward to use PANDA to answer a conjunctive query in time , but we will only describe the algorithm for the submodular width. The key advantage of fhtw over subw is that we can answer sum-product queries over any semiring in time using variable elimination [4], since the query plan involves only one (optimal) tree decomposition.
The two definitions (45) and (46) differ only in the first two operations: v.s. . It is easy to see that , as it follows immediately from ; more precisely, in any complete lattice, the inequality holds. As mentioned above, the degree-aware submodular width generalizes the submodular width considering richer sets of statistics on the input data. The original definition assumed only cardinality constraints: there is one cardinality constraint for each input relation, and they are all equal. In that case, both subw and fhtw are finite. In our generalization, subw can be , for example when . When no confusion arises, we will simply call fhtw and subw the fractional hypertree width and submodular width, dropping the term degree-aware.
Example 6.3 (fhtw and subw of the -cycle with ).
Consider the Boolean -cycle query with :
Suppose we have input cardinality statistics . For instance, in the query where all input relations are the edge set of a graph, is the number of edges. We will show that and for this query.
To show the bound on the fhtw, note that, in any tree decomposition , there must be at least one bag that contains some three consecutive variables on the cycle. Fix accordingly and consider the polymatroid defined by:
for all |
It is straightforward to verify that this is a polymatroid with and .
To prove the bound on subw, consider any polymatroid . Let be a parameter to be determined. Consider two cases.
-
•
There exists for which . Without loss of generality, assume . Consider the tree decomposition:
For every bag , we have .
-
•
for all . Consider the tree decomposition
From submodularity,
Setting to balance the two cases, we conclude that .
6.2 Achieving submodular width runtime with PANDA
Before explaining how PANDA can be used to achieve the submodular width runtime, we need a technical lemma.
Lemma 6.4.
Let denote a finite collection171717Note that is em not a multiset here. of subsets of . Let denote given input degree constraints. If the following quantity is finite:
(47) |
then we can compute coefficients and such that the following are satisfied:
-
(a)
, , and ,
-
(b)
Inequality is a Shannon inequality,
-
(c)
.
Proof.
Let denote the (polyhedral) set of all polymatroids over . We write opt in a slightly different form, where we introduce a new unconstrained variable to replace the inner :
(48) |
Introduce a Lagrangian multiplier for every constraint , and for every constraint . The Lagrange dual function is
Let and denote an optimal solution to the Lagrangian dual problem , then by strong duality181818Which holds because the problem is linear.
From the assumption that opt is finite, it follows that because is unconstrained. Furthermore, if there is any polymatroid for which then is unbounded, because any positive multiple of a polymatroid is a polymatroid. Thus, is satisfied. Furthermore, as the expression inside is non-positive, the maximum it can achieve is with . Consequently, and satisfy the three conditions (a), (b), and (c) above, and we can compute them with standard linear programming algorithms. ∎
Equipped with this tool, we are now ready to show how PANDA can be used to answer a conjunctive query in submodular width time:
Theorem 6.5.
Given a set of degree constraints , a conjunctive query of the form (3) can be computed in time
on any database instance that satisfies the degree constraints.
Proof.
Let be all free-connex tree decompositions of . For every tree decomposition and every node , create a fresh atom over variables . In other words, every bag of every tree decomposition is associated with an atom. Let denote a schema corresponding to the bags of the th tree decomposition. The algorithm will compute relation instances , for all and , such that the tree decompositions together cover the full join of the input relations:
(49) |
From these instances, there are separate free-connex acyclic conjunctive queries of the form
which can be computed in time, using Lemma 2.1. Before computing these queries, we semijoin reduce each in with all the input relations in in order to ensure that the output of each query is a subset of the full join of the input relations . This turns (49) into an equality.
It remains to show that we can compute all the instances satisfying (49) in time . Obviously, to compute them in time , it is necessary that . To this end, for every combination of nodes , we will compute a feasible output to the following DDR (whose input schema is ):
(the th DDR) | (50) |
In words, for this DDR, there is a representative bag from each tree decomposition . After feasible solutions to all these DDRs are computed, then we set
We first prove that the instances defined as such satisfy property (49). Suppose there is a tuple that is not in the RHS of (49). Then, for every , there exists a node such that . Collect these into a tuple , then this implies that we did not compute a feasible output to the th DDR (50), a contradiction.
Last but not least, we show that the DDRs (50) can be computed in time . Fix a tuple of nodes . Let denote the set of all bags for . Define
Then,
6.3 Example: Solving a conjunctive query in submodular width time
Consider the following query whose body is a 4-cycle:
(51) |
Suppose we only have cardinality statistics where all input relation sizes are upper bounded by for some number , i.e.
Let . This query has two free-connex tree decompositions, depicted in Figure 2 (ignoring the trivial tree decomposition with a single bag):
-
•
One with two bags and .
-
•
One with two bags and .
It is not hard to see that the degree-aware fractional hypertree width of , given by (45), is exactly , and we leave this as an exercise. Next, we show that the degree-aware submodular width is . In particular, Eq. (46) for this query becomes:
By distributing the over the inner , and then swapping the two operators, we get:
(52) | ||||
(53) | ||||
(54) | ||||
(55) |
Note that each of the expressions … has the same format as the optimization problem (47) in Lemma 6.4 and is equivalent to the linear program (48). Let’s take the first expression (52). (The other three are similar.) For this expression, a linear program solver gives . Lemma (6.4) guarantees for us the existence of the following Shannon inequality, which is the same as (32) from Section 4.2:
In particular, by item (c) of the lemma, we have:
The other three expressions (53)–(55) also have , leading to .
Next, we describe the algorithm to compute the query (51) in time , as claimed in Theorem 6.5. The algorithm starts by constructing the following four DDRs, which mirror the four expressions (52)–(55):
(56) | ||||
(57) | ||||
(58) | ||||
(59) |
Let’s take the first DDR (56) as an example. We can compute a feasible output to this DDR by computing a feasible output to the following DDR instead:
The above DDR is identical to (7), and we saw in Section 4.2 that we can compute a feasible output to it in time . The other 3 DDRs (57)–(59) can be computed in the same way. Afterwards, we compute:
Using Lemma (2.1), we compute the following two free-connex acyclic conjunctive queries (after semijoin-reducing each of the relations with the input relations and ):
Finally, we take the union of and above as the output to the query in (51). The overall runtime is .
In order to prove the correctness of this algorithm, we show that the full join is identical to the set of tuples that satisfy:
(60) |
The containment is immediate from the definition of (and thanks to the semijoin reduction of with the input relations and ). For the other containment , note that condition (60) is equivalent to the following:
(61) | ||||
By (56)… (59), every tuple that satisfies the conjunction must also satisfy (61). This completes the proof of correctness.
7 Conclusion
We presented PANDA, an algorithm that computes a conjunctive query in time given by its submodular width. For this purpose, we have used a generalization of the notion of submodular width in [25], by incorporating a rich class of statistics on the input relations, including cardinality constraints and degree constraints; the latter can also express functional dependencies. The PANDA algorithm described here is a significant simplification of its preliminary version in [6]. PANDA can also be used as a Worst-Case-Optimal-Join algorithm to compute the output of a full conjunctive query in time bounded by the information-theoretic upper bound of the output size. A recent extension showed that it can also be extended to account for -norms of degree sequences in the input [3].
We leave some open problems. The first is an analysis of the complexity of the witness of a Shannon inequality. The number of submodularities in the Shannon inequality appears as the exponent of a logarithmic factor in the runtime of PANDA, and it would be very useful to study this number as a function of the query. Another question concerns the number of tree decompositions needed to compute the submodular width: our current bound is double exponential, and the question is whether this can be reduced. Finally, one open problem is whether PANDA can be generalized to achieve information-theoretic bounds corresponding to non-Shannon inequalities [36, 37].
References
- [1] S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases, Addison-Wesley, 1995.
- [2] M. Abo Khamis, P. G. Kolaitis, H. Q. Ngo, and D. Suciu, Bag query containment and information theory, ACM Trans. Database Syst., 46 (2021), pp. 12:1–12:39.
- [3] M. Abo Khamis, V. Nakos, D. Olteanu, and D. Suciu, Join size bounds using lp-norms on degree sequences, Proc. ACM Manag. Data, 2 (2024).
- [4] M. Abo Khamis, H. Q. Ngo, and A. Rudra, FAQ: questions asked frequently, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 13–28.
- [5] M. Abo Khamis, H. Q. Ngo, and D. Suciu, Computing join queries with functional dependencies, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 327–342.
- [6] M. Abo Khamis, H. Q. Ngo, and D. Suciu, What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another?, in Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, E. Sallinger, J. V. den Bussche, and F. Geerts, eds., ACM, 2017, pp. 429–444. Extended version available at http://arxiv.org/abs/1612.02503.
- [7] N. Alon, On the number of subgraphs of prescribed type of graphs with a given number of edges, Israel J. Math., 38 (1981), pp. 116–130.
- [8] N. Alon, R. Yuster, and U. Zwick, Finding and counting given length cycles, Algorithmica, 17 (1997), pp. 209–223.
- [9] A. Atserias, M. Grohe, and D. Marx, Size bounds and query plans for relational joins, in 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25-28, 2008, Philadelphia, PA, USA, IEEE Computer Society, 2008, pp. 739–748.
- [10] , Size bounds and query plans for relational joins, SIAM J. Comput., 42 (2013), pp. 1737–1767.
- [11] G. Bagan, A. Durand, and E. Grandjean, On acyclic conjunctive queries and constant delay enumeration, in Computer Science Logic, 21st International Workshop, CSL 2007, 16th Annual Conference of the EACSL, Lausanne, Switzerland, September 11-15, 2007, Proceedings, J. Duparc and T. A. Henzinger, eds., vol. 4646 of Lecture Notes in Computer Science, Springer, 2007, pp. 208–222.
- [12] B. Bollobás and A. Thomason, Projections of bodies and hereditary properties of hypergraphs, Bull. London Math. Soc., 27 (1995), pp. 417–424.
- [13] F. R. K. Chung, R. L. Graham, P. Frankl, and J. B. Shearer, Some intersection theorems for ordered sets and graphs, J. Combin. Theory Ser. A, 43 (1986), pp. 23–37.
- [14] T. Eiter, G. Gottlob, and H. Mannila, Disjunctive datalog, ACM Trans. Database Syst., 22 (1997), pp. 364–418.
- [15] E. Friedgut and J. Kahn, On the number of copies of one hypergraph in another, Israel J. Math., 105 (1998), pp. 251–256.
- [16] G. Gottlob, G. Greco, N. Leone, and F. Scarcello, Hypertree decompositions: Questions and answers, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 57–74.
- [17] G. Gottlob, S. T. Lee, and G. Valiant, Size and treewidth bounds for conjunctive queries, in Proceedings of the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009, June 19 - July 1, 2009, Providence, Rhode Island, USA, J. Paredaens and J. Su, eds., ACM, 2009, pp. 45–54.
- [18] G. Gottlob, S. T. Lee, G. Valiant, and P. Valiant, Size and treewidth bounds for conjunctive queries, J. ACM, 59 (2012), pp. 16:1–16:35.
- [19] M. Grohe and D. Marx, Constraint solving via fractional edge covers, in Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, ACM Press, 2006, pp. 289–298.
- [20] , Constraint solving via fractional edge covers, ACM Trans. Algorithms, 11 (2014), pp. 4:1–4:20.
- [21] A. Itai and M. Rodeh, Finding a minimum circuit in a graph, in Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4-6, 1977, Boulder, Colorado, USA, J. E. Hopcroft, E. P. Friedman, and M. A. Harrison, eds., ACM, 1977, pp. 1–10.
- [22] V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz, A. Kemper, and T. Neumann, How good are query optimizers, really?, Proc. VLDB Endow., 9 (2015), pp. 204–215.
- [23] J. Lobo, J. Minker, and A. Rajasekar, Foundations of disjunctive logic programming, Logic Programming, MIT Press, 1992.
- [24] L. H. Loomis and H. Whitney, An inequality related to the isoperimetric inequality, Bull. Amer. Math. Soc, 55 (1949), pp. 961–962.
- [25] D. Marx, Tractable hypergraph properties for constraint satisfaction and conjunctive queries, J. ACM, 60 (2013), pp. 42:1–42:51.
- [26] J. Minker, Overview of disjunctive logic programming, Ann. Math. Artif. Intell., 12 (1994), pp. 1–24.
- [27] H. Q. Ngo, Worst-case optimal join algorithms: Techniques, results, and open problems, in Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’18, New York, NY, USA, 2018, Association for Computing Machinery, p. 111–124.
- [28] H. Q. Ngo, On an information theoretic approach to cardinality estimation (invited talk), in 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference), D. Olteanu and N. Vortmeier, eds., vol. 220 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022, pp. 1:1–1:21.
- [29] H. Q. Ngo, E. Porat, C. Ré, and A. Rudra, Worst-case optimal join algorithms: [extended abstract], in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, M. Benedikt, M. Krötzsch, and M. Lenzerini, eds., ACM, 2012, pp. 37–48.
- [30] J. Radhakrishnan, Entropy and counting, Computational Mathematics, Modelling and Algorithms, (2003).
- [31] A. Schrijver, Combinatorial Optimization - Polyhedra and Efficiency, Springer, 2003.
- [32] D. Suciu, Applications of information inequalities to database theory problems, in 2023 38th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), 2023, pp. 1–30.
- [33] T. L. Veldhuizen, Triejoin: A simple, worst-case optimal join algorithm, in Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, N. Schweikardt, V. Christophides, and V. Leroy, eds., OpenProceedings.org, 2014, pp. 96–106.
- [34] M. Yannakakis, Algorithms for acyclic database schemes, in Very Large Data Bases, 7th International Conference, September 9-11, 1981, Cannes, France, Proceedings, IEEE Computer Society, 1981, pp. 82–94.
- [35] R. W. Yeung, Information Theory and Network Coding, Springer Publishing Company, Incorporated, 1 ed., 2008.
- [36] Z. Zhang and R. W. Yeung, A non-shannon-type conditional inequality of information quantities, IEEE Trans. Information Theory, 43 (1997), pp. 1982–1986.
- [37] Z. Zhang and R. W. Yeung, On characterization of entropy function via information inequalities, IEEE Transactions on Information Theory, 44 (1998), pp. 1440–1452.