This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PANDA: Query Evaluation in Submodular Width

Mahmoud Abo Khamis
RelationalAI
   Hung Q. Ngo
RelationalAI
   Dan Suciu
University of Washington
Part of this work was conducted while the authors participated in the Fall 2023 Simons Program on Logic and Algorithms in Databases and AI.
Abstract

In recent years, several information-theoretic upper bounds have been introduced on the output size and evaluation cost of database join queries. These bounds vary in their power depending on both the type of statistics on input relations and the query plans that they support. This motivated the search for algorithms that can compute the output of a join query in times that are bounded by the corresponding information-theoretic bounds. In this paper, we describe PANDA, an algorithm that takes a Shannon-inequality that underlies the bound, and translates each proof step into an algorithmic step corresponding to some database operation. PANDA computes answers to a conjunctive query in time given by the the submodular width plus the output size of the query. The version in this paper represents a significant simplification of the original version in [6].

1 Introduction

Answering conjunctive queries efficiently is a fundamental problem in the theory and practice of database management, graph algorithms, logic, constraint satisfaction, and graphical model inference, among others [4, 16, 20, 1]. In a full conjunctive query, the input is a set of relations (or tables, or constraints), each with a set of attributes (or variables), and the task is to list all satisfying assignments, i.e. assignments to all variables that simultaneously satisfy all input relations. Each such assignment is called an output tuple, and their number is the output size. For example, the query

Q(a,b,c) :- E(a,b)E(b,c)E(c,a)\displaystyle Q(a,b,c)\text{ :- }E(a,b)\wedge E(b,c)\wedge E(c,a) (1)

is a full conjunctive query asking for the list of all triangles in a (directed) graph with edge relation EE. In contrast, in a Boolean conjunctive query, we only ask whether one such assignment exists. The query

Q() :- E(a,b)E(b,c)E(c,a)Q()\text{ :- }E(a,b)\wedge E(b,c)\wedge E(c,a)

asks whether there is a triangle in the graph. More generally, in a proper conjunctive query, any subset of variables can occur in the the head of the query. These are called free variables. For example, the query

Q(a) :- E(a,b)E(b,c)E(c,a)Q(a)\text{ :- }E(a,b)\wedge E(b,c)\wedge E(c,a)

asks for the list of all nodes aa that are part of a triangle. Other variants of the query evaluation problem include counting output tuples, performing some aggregation over them, or enumerating them under some restrictions.

In the case of a full conjunctive query, the runtime of any algorithm is at least as large as the size of the output. This has motivated the study of upper bounds on the sizes of the query outputs. The corresponding graph-theoretic problem is to bound the number of homomorphic images of a small graph within a larger graph; this problem has a long history in graph theory and isoperimetric inequalities [24, 7, 13, 12, 27, 15]. One such bound is the AGM bound, which is a formula that, given only the cardinalities of the input relations, returns an upper bound on the output size of the query [10]. Moreover, the bound is tight, meaning that there exist relations of the given cardinalities where the query’s output has size equal to the AGM bound (up to a query-dependent factor). This immediately implies that no algorithm can run in time lower than the AGM bound in the worst-case. Thus, an algorithm that runs in this time is called a Worst-Case Optimal Join algorithm. The AGM bound for query (1) is O(|E|3/2)O(|E|^{3/2}); algorithms for listing all triangles within this amount of time has been known for decades [21]. For general full conjunctive queries, Grohe and Marx [19], and Atserias, Grohe, and Marx [10, 9] devised a join-project query plan that can compute the output of a full conjunctive query to within a linear factor of the AGM bound, which is very close to optimality. They also showed that a join-only query plan cannot achieve this bound. A few years later, a new class of join algorithms, not based on join and project operators, was able to achieve the AGM bound [33, 29] and thus achieve worst-case optimality.

In practice, especially in database systems, the input cardinalities are not sufficient to model what we know about the data, and definitely not sufficient to predict how good or bad a query plan is. Other data characteristics such as functional dependencies and distinct value counts are often collected and used to optimize queries [28, 22]; futhermore, practical queries often have “relations” that are infinite in pure cardinalities. For example, the output size of this query

Q(a,b) :- R(a)S(b)a+b=10Q(a,b)\text{ :- }R(a)\wedge S(b)\wedge a+b=10

is obviously at most min{|R|,|S|}\min\{|R|,|S|\}, but the AGM bound is |R||S||R|\cdot|S|. The relation a+b=10a+b=10 has infinite cardinality. There has thus been a line of research to strengthen the AGM bound to account for increasingly finer classes of input statistics. Specifically, Gottlob, Lee, Valiant and Valiant [18, 17] applied the entropy argument to derive a bound on the output size of a full conjunctive query under general functional dependencies. Their bound, generalizing the AGM-bound, is an optimization problem whose constraints are the Shannon inequalities. This idea was a seed for a series of works that extended the entropy argument to account for finer classes of constraints, including degree constraints [5, 6, 3], and p\ell_{p}-norm bounds on degree sequences [3].

Designing WCOJ algorithms that match these stronger bounds becomes harder since the bounds under more constraints are tighter. An algorithm called CSMA is described in [5], which only accounts for relation cardinalities and functional dependencies.

In database management systems jargons, a WCOJ algorithm is a multi-way join operator; this operator, for some classes of queries, is asymptotically faster than the traditional binary join operator [29]. Given a complicated conjunctive query, however, before applying a WCOJ algorithm, it is often beneficial to come up with a query plan that decomposes the query into simpler subqueries. Query plans guided by (hyper)tree decompositions have proved to be very useful both in theory and in practice [16]. In particular, query plans represented by a single tree decomposition can be used to answer a Boolean conjunctive query or a full conjunctive query in time bounded by O~(Nfhtw+|output|)\tilde{O}(N^{\textsf{fhtw}}+|\text{output}|), where fhtw is the fractional hypertree width [20] of the query, and |output||\text{output}| is the size of the output. This runtime is sensitive to the output size, and thus it is more adaptive than a single WCOJ application.111It should be noted, however, that the sub-queries of the plan are still being answered with a WCOJ operator. For example, using tree-decomposition-based query plans, we can identify whether a large graph with NN edges contains the homomorphic images of a kk-cycle in time O(N2)O(N^{2}) (assuming a constant kk).

Motivated by finding the fixed-parameter tractability boundary for Boolean conjunctive queries, or equivalently constraint satisfaction problems, under the regime of unbounded-arity inputs, Marx [25] came up with a beautifully novel idea: instead of fixing a single query plan (i.e. tree-decomposition) up front, we can consider multiple query plans, and partition the data to make use of different plans for different parts of the data. The improved complexity measure that Marx introduced is called the submodular width, subw, which is always less than or equal to fhtw.

The subw algorithm from Marx [25] has some limitations. First, it assumes all input relations have the same cardinality NN; in particular, it is not known how to define the width and design similar algorithms under functional dependencies or degree constraints. Second, the runtime of the algorithm is not O~(Nsubw)\tilde{O}(N^{\textsf{subw}}), but a polynomial in this quantity. Third, the subw-notion and the algorithm were not defined for general conjunctive queries with arbitrary free variables.

Contributions. This paper connects all three of these lines of research: WCOJ algorithms, fractional hypertree width, and submodular width into a single framework, while dealing with arbitrary input degree constraints (which is a superset of functional dependencies and cardinality constraints), and arbitrary conjunctive queries. The bridge that connects these algorithms and concepts is information theory. In particular, our contributions include the following:

We show how a generalized version of the classic Shearer’s inequality [15] can be used to derive upper bounds on the output size of a disjunctive datalog rule (DDR), which is a generalization of a conjunctive query. The upper bound is a generalization of the AGM bound to include degree constraints and DDRs. The introduction of DDR and its information theoretic output size bound in studying conjunctive queries is our first major technical novelty. DDRs are interesting in their own right. They form the building blocks of disjunctive datalog [14], which is a significant extension of datalog. Disjunctive datalog has a long history: it emerged in logic programming [23, 26], and is used for knowledge representation, representing incomplete information, and constraint satisfaction. Our bound on the output size of a DDR represents a bound on the output size of the minimal model of the disjunctive datalog program consisting of that single rule.

Next, we show that certain symbolic manipulations of the information inequality can be converted into a query evaluation plan for the DDR that runs in time bounded by the predicted upper bound. This idea of converting a proof of an information inequality into a query evaluation plan is our second major technical novelty. The algorithm is called PANDA, which stands for “Proof-Assisted eNtropic Degree-Aware”. In particular, PANDA is worst-case optimal for DDRs under arbitrary degree constraints. Even when restricted to only conjunctive queries, this is already beyond the state of the art in WCOJ algorithms because previous WCOJ algorithms cannot meet the tightened bounds under degree constraints.

Lastly, we explain how to define the notions of fhtw and subw under degree constraints and for conjunctive queries with arbitrary free variables. We show how PANDA can be used to answer arbitrary conjunctive queries with arbitrary degree constraints, in time bounded by O~(Nsubw+|output|)\tilde{O}(N^{\textsf{subw}}+|\text{output}|). These results close the gaps left by Marx’ work. For example, with PANDA, the kk-cycle query can now be answered in O~(N21/k/2)\tilde{O}(N^{2-1/\lceil k/2\rceil}) time, which is sub-quadratic, and matches the specialized cycle detection algorithm from Alon, Yuster, and Zwick [8].

The results in this paper were first announced in a conference paper [6]. The current paper makes several significant improvements:

  • In [6], we used both the primal and the dual linear program to guide PANDA: the primal gives an optimal polymatroid 𝒉\bm{h}, while the dual represents the basic Shannon inequalities. In the new version, we use only the dual, which significantly simplifies the algorithm. The algorithm is described only in terms of an information inequality and its proof (called a witness), which correspond precisely to a feasible solution to the dual program. We only need to describe the primal and dual programs later, in Sec. 6, where we introduce the degree-aware submodular width, which is defined in terms of the primal.

  • Previously, we needed a proof sequence to drive the algorithm; it was difficult to prove that a proof sequence exists; for example, no proof sequence existed in our earlier framework [5]. In the new version, we describe PANDA without the need for an explicit proof sequence, which again simplifies it. If needed, a proof sequence can still be extracted from the new version of the algorithm.

  • One difficulty in the earlier presentation of PANDA was the need to recompute the proof sequence after a reset step. This is no longer necessary here.

Paper Outline This paper is organized as follows. Section 2 presents background concepts needed to understand the techniques and results of the paper; in particular, it introduces disjunctive datalog rules (DDR), a generalization of conjunctive queries, reviews necessary background on information theory, and defines the class of statistics on input relations that the PANDA algorithm supports, which are called degree constraints. Section 3 discusses the class of information inequalities that are at the center of our work, where they are used to both derive upper bounds on the output size of a query and guide the PANDA algorithm. Section 4 states the main algorithmic result, which says that the PANDA algorithm meets this information-theoretic upper bound if the bound is a Shannon inequality, i.e. it does not involve non-Shannon inequalities [36, 37]. Section 5 presents the core PANDA algorithm, which constructs a step-by-step proof of the Shannon inequality, and converts each step into a database operation. Section 6 defines the degree-constraint aware submodular width, and shows how to use disjunctive datalog rules to compute a conjunctive query in time given by the submodular width. We conclude in Section 7.

2 Preliminaries

2.1 Database instances and conjunctive queries (CQ)

Fix a set 𝑽\bm{V} of variables (or attributes). An atom is an expression of the form R(𝑿)R(\bm{X}) where RR is a relation name and 𝑿𝑽\bm{X}\subseteq\bm{V} is a set of attributes. A schema, Σ\Sigma, is a set of atoms. We shall assume throughout that distinct atoms in Σ\Sigma have distinct attribute sets. If RR is a relation name in Σ\Sigma, we write vars(R)\text{\sf vars}(R) for its attribute set, and define vars(Σ)=def𝑽\text{\sf vars}(\Sigma)\stackrel{{\scriptstyle\text{def}}}{{=}}\bm{V} to be the set of all attributes in the schema Σ\Sigma.

Given a countably infinite domain Dom, we use Dom𝑿\textsf{Dom}^{\bm{X}} to denote the set of tuples with attributes 𝑿𝑽\bm{X}\subseteq\bm{V}. A Σ\Sigma-instance is a map DD that assigns to each relation name RR in Σ\Sigma a finite subset RDDomvars(R)R^{D}\subseteq\textsf{Dom}^{\text{\sf vars}(R)}. Technically, we should use ΣD=def(RD)RΣ\Sigma^{D}\stackrel{{\scriptstyle\text{def}}}{{=}}(R^{D})_{R\in\Sigma} to denote the Σ\Sigma-instance; however, to reduce notational burden, instead of writing RDR^{D} and ΣD\Sigma^{D}, we will often write RR and Σ\Sigma when the instance is clear from the context. Given 𝑿𝑽\bm{X}\subseteq\bm{V} and a tuple 𝒕Dom𝑽\bm{t}\in\textsf{Dom}^{\bm{V}}, we write π𝑿(𝒕)\pi_{\bm{X}}(\bm{t}) to denote the projection of 𝒕\bm{t} onto the variables 𝑿\bm{X}.

The full natural join (or full join for short) of the Σ\Sigma-instance is the set of tuples 𝒕Dom𝑽\bm{t}\in\textsf{Dom}^{\bm{V}} that satisfy all atoms in Σ\Sigma:

Σ\displaystyle{{{\Join}}}\Sigma =def{𝒕Dom𝑽πvars(R)(𝒕)R,RΣ}\displaystyle\stackrel{{\scriptstyle\text{def}}}{{=}}\{{\bm{t}\in\textsf{Dom}^{\bm{V}}}\mid{\pi_{\text{\sf vars}(R)}(\bm{t})\in R,\forall R\in\Sigma}\} (2)

This set of tuples is sometimes called in the literature the universal table of the instance Σ\Sigma.

Given a schema Σ\Sigma, a conjunctive query is the expression

Q(𝑭)\displaystyle Q(\bm{F}) :- R(𝑿)ΣR(𝑿)\displaystyle\text{ :- }\bigwedge_{R(\bm{X})\in\Sigma}R(\bm{X}) (3)

where 𝑭𝑽\bm{F}\subseteq\bm{V} is called the set of free variables, and Q(𝑭)Q(\bm{F}) is the head atom of the query. Atoms in Σ\Sigma are called the body atoms of the query. The output Q(𝑭)Q(\bm{F}) of an input instance Σ\Sigma is the projection of the full join (2) onto the free variables 𝑭\bm{F}: Q(𝑭)=defπ𝑭(Σ)Q(\bm{F})\stackrel{{\scriptstyle\text{def}}}{{=}}\pi_{\bm{F}}({{{\Join}}}\Sigma).

When 𝑭=𝑽\bm{F}=\bm{V}, we call the query a full conjunctive query. When 𝑭=\bm{F}=\emptyset, we call the query a Boolean conjunctive query, whose answer Q()Q() is either true or false, and it is true if and only if the full join (2) is non-empty.

Our complexity results are expressed in terms of data complexity; in particular, we consider the number of atoms and variables to be a constant, i.e. |Σ|+|𝑽|=O(1)|\Sigma|+|\bm{V}|=O(1), and the complexity is a function of the instance size. We define the size of a Σ\Sigma-instance as: 222Altenratively, the size of a Σ\Sigma-instance could have been defined as the sum of the sizes of all relations in the instance. However, the two definitions are within a constant factor from one another in data complexity. Moreover, taking the maximum instead of the sum is more convenient for our purposes. For example, we will see later that our size upper bound from Theorem 3.1 only holds if we define the size as the maximum.

Σ=defmaxR(𝑿)Σ|R|.\displaystyle\|\Sigma\|\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{R(\bm{X})\in\Sigma}|R|. (4)

The notation Σ\|\Sigma\| is used in part to distinguish it from |Σ||\Sigma| which counts the number of atoms in Σ\Sigma.

2.2 Tree decompositions and free-connex queries

Consider a conjunctive query in the form (3). A tree decomposition of QQ is a pair (T,χ)(T,\chi), where TT is a tree and χ:nodes(T)2vars(Q)\chi:\text{\sf nodes}(T)\rightarrow 2^{\text{\sf vars}(Q)} is a map from the nodes of TT to subsets of vars(Q)\text{\sf vars}(Q) that satisfies the following properties: for all atoms R(𝑿)R(\bm{X}) in Σ\Sigma, there is a node tnodes(T)t\in\text{\sf nodes}(T) such that 𝑿χ(t)\bm{X}\subseteq\chi(t); and, for any variable Xvars(Q)X\in\text{\sf vars}(Q), the set {tXχ(t)}\{{t}\mid{X\in\chi(t)}\} forms a connected sub-tree of TT. Each set χ(t)\chi(t) is called a bag of the tree-decomposition, and we will assume w.l.o.g. that the bags are distinct, i.e. χ(t)χ(t)\chi(t)\neq\chi(t^{\prime}) when ttt\neq t^{\prime} are nodes in TT.

A free-connex tree decomposition for QQ is a tree-decomposition for QQ with an additional property that there is a connected sub-tree TT^{\prime} of TT for which 𝑭=tnodes(T)χ(t)\bm{F}=\bigcup_{t\in\text{\sf nodes}(T^{\prime})}\chi(t). The query QQ is free-connex acyclic iff there is a free-connex tree decomposition (T,χ,)(T,\chi,) in which every bag is covered by an input atom; namely, for every tnodes(T)t\in\text{\sf nodes}(T), there exists an input atom R(𝑿)ΣR(\bm{X})\in\Sigma where χ(t)𝑿\chi(t)\subseteq\bm{X}.

The following result is well-known [34, 11].

Lemma 2.1.

If QQ is a free-connex acyclic conjunctive query of the form (3), then we can compute its output in time O~(Σ+|Q(𝐅)|)\tilde{O}(\|\Sigma\|+|Q(\bm{F})|). Furthermore, we can list the output tuples one by one with constant-delay between them.

In particular, for this class of queries, the runtime is the best one can hope for: input size plus output size.

2.3 Disjunctive Datalog rules (DDR)

Disjunctive Datalog rules [14] are a generalization of conjunctive queries, where the head of the query can be a disjunction, formalized as follows. Let Σin\Sigma_{\text{\sf in}} and Σout\Sigma_{\text{\sf out}} be two schemas, called input and output schema respectively. We associate to these two schemas the following disjunctive Datalog rule (DDR)

Q(𝒁)ΣoutQ(𝒁)\displaystyle\bigvee_{Q(\bm{Z})\in\Sigma_{\text{\sf out}}}Q(\bm{Z}) :- R(𝑿)ΣinR(𝑿)\displaystyle\text{ :- }\bigwedge_{R(\bm{X})\in\Sigma_{\text{\sf in}}}R(\bm{X}) (5)

The DDR is uniquely defined by the two schemas Σin\Sigma_{\text{\sf in}} and Σout\Sigma_{\text{\sf out}}. The syntax in (5) does not add any new information, but is intended to be suggestive for the following semantics:

Definition 2.2.

Let Σin\Sigma_{\text{\sf in}} be an input instance. A model (or feasible output) for the rule (5) is an instance Σout\Sigma_{\text{\sf out}}, such that the following condition holds: for every tuple 𝒕Σin\bm{t}\in{{{\Join}}}\Sigma_{\text{\sf in}}, there is an output atom Q(𝒁)ΣoutQ(\bm{Z})\in\Sigma_{\text{\sf out}} for which π𝒁(𝒕)Q\pi_{\bm{Z}}(\bm{t})\in Q. Similar to (4), the size of the output instance Σout\Sigma_{\text{\sf out}} is defined as

Σout=defmaxQ(𝒁)Σout|Q|.\displaystyle\|\Sigma_{\text{\sf out}}\|\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{Q(\bm{Z})\in\Sigma_{\text{\sf out}}}|Q|. (6)

A model is minimal if we cannot remove a tuple from any output relation without violating the feasibility condition.

In English, a feasible output needs to store each tuple 𝒕Σin\bm{t}\in{{{\Join}}}\Sigma_{\text{\sf in}} in at least one of the output relations Q(𝒁)ΣoutQ(\bm{Z})\in\Sigma_{\text{\sf out}}. The query evaluation problem is to compute a minimal model. Note that a conjunctive query is a disjunctive Datalog rule where Σout\Sigma_{\text{\sf out}} has a single atom.

DDRs are interesting. We illustrate the concept here with a few examples.

Example 2.3.

Consider the following DDR,

Q(X,Z)\displaystyle Q(X,Z) :- R(X,Y)S(Y,Z)\displaystyle\text{ :- }R(X,Y)\wedge S(Y,Z)

where Σin={R(X,Y),S(Y,Z)}\Sigma_{\text{\sf in}}=\{R(X,Y),S(Y,Z)\} and Σout={Q(X,Z)}\Sigma_{\text{\sf out}}=\{Q(X,Z)\}. A model (or feasible output) to the DDR is any superset of πXZ(RS)\pi_{XZ}(R\Join S). Consider now the following DDR:

A(X)B(Y)\displaystyle A(X)\vee B(Y) :- R(X,Y)\displaystyle\text{ :- }R(X,Y)

One model is A:=πX(R)A:=\pi_{X}(R), B:=B:=\emptyset. Another model is A:=A:=\emptyset, B:=πY(R)B:=\pi_{Y}(R). Both are minimal. Many other models exist.

A non-trivial DDR is the following:

A(X,Y,Z)B(Y,Z,W)\displaystyle A(X,Y,Z)\vee B(Y,Z,W) :- R(X,Y)S(Y,Z)U(Z,W)\displaystyle\text{ :- }R(X,Y)\wedge S(Y,Z)\wedge U(Z,W) (7)

To compute the rule, for each tuple (x,y,z,w)(x,y,z,w) in the full join, we must either insert (x,y,z)(x,y,z) into AA, or insert (y,z,w)(y,z,w) into BB. The output size is the larger of the resulting relations AA and BB. We shall see later in the paper that, for this rule, there is a model of size O(|R||S||U|)O(\sqrt{|R|\cdot|S|\cdot|U|}), which is a non-trivial result.

2.4 Entropic vectors and polymatroids

For general background on information theory and polymatroids, we refer the reader to [35]. Given a discrete (tuple of) random variable(s) 𝑿\bm{X} over a domain Dom(𝑿)\textsf{Dom}(\bm{X}) with distribution pp, the (Shannon) entropy of 𝑿\bm{X} is defined as

h(𝑿)=def𝒙Dom(𝑿)p(𝑿=𝒙)logp(𝑿=𝒙)h(\bm{X})\stackrel{{\scriptstyle\text{def}}}{{=}}-\sum_{\bm{x}\in\textsf{Dom}(\bm{X})}p(\bm{X}=\bm{x})\log p(\bm{X}=\bm{x})

The support of the distribution is the set of all 𝒙\bm{x} where p(𝑿=𝒙)>0p(\bm{X}=\bm{x})>0. We will only work with distributions of finite support. Let NN be the support size, then it is known that 0h(𝑿)logN0\leq h(\bm{X})\leq\log N. The upper bound follows from the concavity of hh and Jensen’s inequality. Moreover, h(𝑿)=0h(\bm{X})=0 iff XX is deterministic, i.e. it has a singleton support, and h(𝑿)=logNh(\bm{X})=\log N iff XX is uniform, meaning that p(𝑿=𝒙)=1/Np(\bm{X}=\bm{x})=1/N for all 𝒙\bm{x} in the support.

If 𝑽\bm{V} is a set of jointly distributed random variables, then the vector 𝒉=(h(𝑿))𝑿𝑽+2𝑽\bm{h}=(h(\bm{X}))_{\bm{X}\subseteq\bm{V}}\in{\mathbb{R}}_{\tiny+}^{2^{\bm{V}}} is called an entropic vector.333Given two sets AA and BB, we use BAB^{A} to denote the set of all functions f:ABf:A\rightarrow B. Implicitly, h()=0h(\emptyset)=0; thus, this vector is actually of dimension 2|𝑽|12^{|\bm{V}|}-1. We will often write 𝑿𝒀\bm{X}\bm{Y} for the union 𝑿𝒀\bm{X}\cup\bm{Y}. In particular, h(𝑿𝒀)=h(𝑿𝒀)h(\bm{X}\bm{Y})=h(\bm{X}\cup\bm{Y}).

Starting with Shannon himself, the study of linear functions on entropic vectors has been a central topic in information theory. The two basic linear functions are defined by the so-called information measures. An information measure is an expression μ=(𝒀|𝑿)\mu=(\bm{Y}|\bm{X}) or σ=(𝒀;𝒁|𝑿)\sigma=(\bm{Y};\bm{Z}|\bm{X}), where 𝑿,𝒀,𝒁\bm{X},\bm{Y},\bm{Z} are disjoint subsets of the set of variables 𝑽\bm{V}. We call μ\mu a monotonicity and σ\sigma a submodularity information measure respectively. A monotonicity measure μ=(𝒀|𝑿)\mu=(\bm{Y}|\bm{X}) is called unconditional iff 𝑿=\bm{X}=\emptyset. Similarly, a submodularity measure σ=(𝒀;𝒁|𝑿)\sigma=(\bm{Y};\bm{Z}|\bm{X}) is called unconditional iff 𝑿=\bm{X}=\emptyset. For any vector 𝒉+2𝑽\bm{h}\in{\mathbb{R}}_{\tiny+}^{2^{\bm{V}}}, we define the linear functions:

h(μ)\displaystyle h(\mu) =defh(𝑿𝒀)h(𝑿)\displaystyle\stackrel{{\scriptstyle\text{def}}}{{=}}h(\bm{X}\bm{Y})-h(\bm{X}) μ=(𝒀|𝑿)\displaystyle\mu=(\bm{Y}|\bm{X})
h(σ)\displaystyle h(\sigma) =defh(𝑿𝒀)+h(𝑿𝒁)h(𝑿)h(𝑿𝒀𝒁)\displaystyle\stackrel{{\scriptstyle\text{def}}}{{=}}h(\bm{X}\bm{Y})+h(\bm{X}\bm{Z})-h(\bm{X})-h(\bm{X}\bm{Y}\bm{Z}) σ=(𝒀;𝒁|𝑿)\displaystyle\sigma=(\bm{Y};\bm{Z}|\bm{X})

We write MON for the set of monotonicity measures, i.e. the set of all μ=(𝒀|𝑿)\mu=(\bm{Y}|\bm{X}) where 𝑿,𝒀𝑽\bm{X},\bm{Y}\subseteq\bm{V} are disjoint. Similarly, we write SUB for the set of submodularity measures.

A polymatroid is a vector 𝒉\bm{h} that satisfies the basic Shannon inequalities:

h()\displaystyle h(\emptyset) =0,\displaystyle=0, μMON,h(μ)\displaystyle\forall\mu\in\text{\sf MON},\ \ h(\mu) 0,\displaystyle\geq 0, σSUB,h(σ)\displaystyle\forall\sigma\in\text{\sf SUB},\ \ h(\sigma) 0\displaystyle\geq 0 (8)

The latter two are called monotonicity and submodularity constraints respectively. Every entropic vector is a polymatroid, but the converse is not true [36]. A Shannon inequality is a linear inequality that is derived from the basic Shannon inequalities; equivalently, a linear inequality is a Shannon inequality iff it is satisfied by all polymatroids.

Discussion In the literature, the basic Shannon inequalities are often restricted to elemental ones, which are submodularity measures of the form (A;B|𝑿)(A;B|\bm{X}) and monotonicity measures of the form (A|𝑽{A})(A|\bm{V}-\{A\}), where A,B𝑽A,B\in\bm{V} are single variables, and 𝑿𝑽{A,B}\bm{X}\subseteq\bm{V}-\{A,B\}. The reason is that the elemental inequalities are sufficient to prove every Shannon inequality; for example h(B;CD|A)=h(B;C|A)+h(B;D|AC)0h(B;CD|A)=h(B;C|A)+h(B;D|AC)\geq 0, when both h(B;C|A)0h(B;C|A)\geq 0 and h(B;D|AC)0h(B;D|AC)\geq 0. Given m=def|𝑽|m\stackrel{{\scriptstyle\text{def}}}{{=}}|\bm{V}|, the total number of elemental basic Shannon inequalities is m(m1)2m3+mm(m-1)2^{m-3}+m. 444Specifically, there are (m2){m\choose 2} ways to choose AA and BB and 2m22^{m-2} ways to choose 𝑿\bm{X} resulting in (m2)2m2{m\choose 2}\cdot 2^{m-2} elemental submodularities. Additionally, there are mm elemental monotonicities. However, for the purpose of the PANDA algorithm, it is preferable to consider all basic Shannon inequalities, because this may lead to a smaller exponent of the polylog factor of the algorithm, as we will see in Section 5.

2.5 Statistics on the data

In typical database engines, various statistics about the data are maintained and used for cardinality estimation, query optimization, and other purposes. Common statistics include the number of distinct values in a column, the number of tuples in a relation, and functional dependencies. A robust abstraction called “degree constraints” was introduced in [6] that captures many of these statistics. This section provides a formal definition of degree constraints, which are also the constraints that PANDA can support.

Let δ=(𝒀|𝑿)\delta=(\bm{Y}|\bm{X}) be a monotonicity measure, Σ\Sigma be a database instance with schema Σ\Sigma, RΣR\in\Sigma be a relation in this instance, and 𝒙Dom𝑿\bm{x}\in\textsf{Dom}^{\bm{X}} be a tuple with schema 𝑿\bm{X}. We define the quantity degree of 𝒙\bm{x} with respect to 𝒀\bm{Y} in RR, denoted by degR(𝒀|𝑿=𝒙)\text{\sf deg}_{R}(\bm{Y}|\bm{X}=\bm{x}), as follows:

  • When both 𝑿vars(R)\bm{X}\subseteq\text{\sf vars}(R) and 𝒀vars(R)\bm{Y}\subseteq\text{\sf vars}(R), then degR(𝒀|𝑿=𝒙)\text{\sf deg}_{R}(\bm{Y}|\bm{X}=\bm{x}) is the number of times the given 𝑿\bm{X}-tuple 𝒙\bm{x} occurs in π𝑿𝒀(R)\pi_{\bm{X}\bm{Y}}(R).

  • When 𝑿\bm{X} and 𝒀\bm{Y} are arbitrary with respect to vars(R)\text{\sf vars}(R), then we define the restriction of RR to 𝑿𝒀\bm{X}\cup\bm{Y} to be the relation R(𝑿𝒀):=Dom𝑿𝒀RR^{\prime}(\bm{X}\bm{Y}):=\textsf{Dom}^{\bm{X}\cup\bm{Y}}\ltimes R,555STS\ltimes T denotes the semi-join reduce operator defined by ST=defπvars(S)(ST)S\ltimes T\stackrel{{\scriptstyle\text{def}}}{{=}}\pi_{\text{\sf vars}(S)}(S\Join T). and set

    degR(𝒀|𝑿=𝒙)=defdegR(𝒀|𝑿=𝒙)\text{\sf deg}_{R}(\bm{Y}|\bm{X}=\bm{x})\stackrel{{\scriptstyle\text{def}}}{{=}}\text{\sf deg}_{R^{\prime}}(\bm{Y}|\bm{X}=\bm{x})

Finally, define the degree of the monotonicity measure δ=(𝒀|𝑿)\delta=(\bm{Y}|\bm{X}) in RR, denoted by degR(δ)\text{\sf deg}_{R}(\delta), to be

degR(𝒀|𝑿)=def\displaystyle\text{\sf deg}_{R}(\bm{Y}|\bm{X})\stackrel{{\scriptstyle\text{def}}}{{=}} max𝒙Dom𝑿degR(𝒀|𝑿=𝒙)\displaystyle\max_{\bm{x}\in\textsf{Dom}^{\bm{X}}}\text{\sf deg}_{R}(\bm{Y}|\bm{X}=\bm{x}) (9)

Note that, for infinite Dom, if 𝒀vars(R)\bm{Y}\not\subseteq\text{\sf vars}(R), then degR(𝒀|𝑿)=\text{\sf deg}_{R}(\bm{Y}|\bm{X})=\infty, unless R=R=\emptyset, in which case degR(𝒀|𝑿)=0\text{\sf deg}_{R}(\bm{Y}|\bm{X})=0. We say that RR is a guard of δ=(𝒀|𝑿)\delta=(\bm{Y}|\bm{X}) if 𝒀vars(R)\bm{Y}\subseteq\text{\sf vars}(R), and note that, if RR\neq\emptyset, then RR is a guard of δ\delta iff degR(𝒀|𝑿)<\text{\sf deg}_{R}(\bm{Y}|\bm{X})<\infty.

If 𝑿=\bm{X}=\emptyset and 𝒀=vars(R)\bm{Y}=\text{\sf vars}(R), then the degree is the cardinality of RR, degR(𝒀|)=|R|\text{\sf deg}_{R}(\bm{Y}|\emptyset)=|R|. If degR(δ)=1\deg_{R}(\delta)=1 then there is a functional dependency in RR from 𝑿\bm{X} to 𝒀\bm{Y}. If the number of unique values in a column AA of RR is kk, then degR(A|)=k\deg_{R}(A|\emptyset)=k. Given a schema instance Σ\Sigma, define the degree of δ=(𝒀|𝑿)\delta=(\bm{Y}|\bm{X}) in the instance Σ\Sigma as:

degΣ(δ)\displaystyle\text{\sf deg}_{\Sigma}(\delta) =defminRΣdegR(δ).\displaystyle\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{R\in\Sigma}\text{\sf deg}_{R}(\delta). (10)

Let ΔMON\Delta\subseteq\text{\sf MON} be a set of monotonicity measures and 𝑵:Δ+\bm{N}:\Delta\rightarrow{\mathbb{R}}_{\tiny+} be numerical values for each δΔ\delta\in\Delta. Throughout this paper, we write NδN_{\delta} instead of N(δ)N(\delta), and define nδ=deflogNδn_{\delta}\stackrel{{\scriptstyle\text{def}}}{{=}}\log N_{\delta} for all δΔ\delta\in\Delta. We view the pair (Δ,𝑵)(\Delta,\bm{N}) as a set of degree constraints, and we say that an instance Σ\Sigma satisfies the degree constraints (Δ,𝑵)(\Delta,\bm{N}) iff degΣ(δ)Nδ\text{\sf deg}_{\Sigma}(\delta)\leq N_{\delta} for all δΔ\delta\in\Delta. In that case, we write Σ(Δ,𝑵)\Sigma\models(\Delta,\bm{N}).

3 On a Class Information Inequalities

The entropy argument and Shearer’s lemma in particular [13] is a powerful information-theoretic tool in extremal combinatorics [30]. Friedgut and Kahn [15] applied the argument to bound the number of homomorphic copies of a graph in a larger graph; this is a special case of the full conjunctive query problem. Grohe and Marx [19], with further elaboration in [9] showed how Shearer’s lemma can be used to bound the output size of a full CQ given input cardinality constraints. Briefly, let Σ\Sigma be the input schema of a full CQ of the form (3),

Q(𝑽)\displaystyle Q(\bm{V}) :- R(𝑿)ΣR(𝑿).\displaystyle\text{ :- }\bigwedge_{R(\bm{X})\in\Sigma}R(\bm{X}). (11)

where the head atom Q(𝑽)Q(\bm{V}) has all the variables. Given any non-negative weight vector 𝒘=(w𝑿)R(𝑿)Σ\bm{w}=(w_{\bm{X}})_{R(\bm{X})\in\Sigma} that forms a fractional edge cover of the hypergraph (𝑽,𝑬)(\bm{V},\bm{E}) where 𝑬:={𝑿R(𝑿)Σ}\bm{E}:=\{\bm{X}\mid R(\bm{X})\in\Sigma\}, the output size of the full CQ is bounded by

|Q|\displaystyle|Q| R(𝑿)Σ|R|w𝑿.\displaystyle\leq\prod_{R(\bm{X})\in\Sigma}|R|^{w_{\bm{X}}}. (12)

This is known the AGM-bound [10]. The bound is a direct consequence of Shearer’s inequality [13], which states that the inequality

h(𝑽)\displaystyle h(\bm{V}) R(𝑿)Σw𝑿h(𝑿),\displaystyle\leq\sum_{R(\bm{X})\in\Sigma}w_{\bm{X}}\cdot h(\bm{X}), (13)

holds for every entropic vector 𝒉+2𝑽\bm{h}\in\mathbb{R}_{+}^{2^{\bm{V}}} if and only if the weights 𝒘\bm{w} form a fractional edge cover of the hypergraph (𝑽,𝑬)(\bm{V},\bm{E}) defined above.

The above results can only deal with cardinality constraints of input relations, and if the input query is a conjunctive query. We extend these results to disjunctive Datalog rules and handle general degree constraints. We start in the next section with a generalization of Shearer’s inequality and show how that implies an output cardinality bound for DDRs. In later sections, we use the information inequality to drive the PANDA algorithm.

3.1 Size Bound for DDRs from Information Inequalities

This section develops an information-theoretic bound for the output size of a DDR under general degree constraints. Inequality (14) below is a generalization of Shearer’s inequality (13), and the bound (15) is a generalization of the AGM bound (12) for DDRs and general degree constraints. (Recall the notation for the size of a model Σout\Sigma_{\text{\sf out}} in (6).) In what follows, for a given schema Σ\Sigma and an atom R(𝑿)ΣR(\bm{X})\in\Sigma, for brevity we will also write 𝑿Σ\bm{X}\in\Sigma as we assume a one-to-one correspondence between atoms and their variables.

Theorem 3.1.

Consider a DDR of the form (5) with input and output schemas Σin\Sigma_{\text{\sf in}} and Σout\Sigma_{\text{\sf out}} respectively. Let ΔMON\Delta\subseteq\text{\sf MON} be a set of monotonicity measures. Suppose that there exist two non-negative weight vectors 𝐰:=(wδ)δΔ\bm{w}:=(w_{\delta})_{\delta\in\Delta} and 𝛌:=(λ𝐙)𝐙Σout\bm{\lambda}:=(\lambda_{\bm{Z}})_{\bm{Z}\in\Sigma_{\text{\sf out}}} with 𝛌1=1\|\bm{\lambda}\|_{1}=1, where the following inequality holds for all entropic vectors 𝐡\bm{h}:

𝒁Σoutλ𝒁h(𝒁)\displaystyle\sum_{\bm{Z}\in\Sigma_{\text{\sf out}}}\lambda_{\bm{Z}}\cdot h(\bm{Z}) δΔwδh(δ)\displaystyle\leq\sum_{\delta\in\Delta}w_{\delta}\cdot h(\delta) (14)

Then, for any input instance Σin\Sigma_{\text{\sf in}} for the DDR (5), there exists a model Σout\Sigma_{\text{\sf out}} for the DDR that satisfies: 666Recall our definition of Σout\|\Sigma_{\text{\sf out}}\| from Eq. (6). Inequality (15) only holds if we define the size Σout\|\Sigma_{\text{\sf out}}\| as the maximum (not the sum) of the sizes of all relations in the instance. Otherwise, we would have needed to multiply the right-hand side by the number of relations in Σout\Sigma_{\text{\sf out}}.

Σout\displaystyle\|\Sigma_{\text{\sf out}}\| δΔ(degΣin(δ))wδ\displaystyle\leq\prod_{\delta\in\Delta}\left(\text{\sf deg}_{\Sigma_{\text{\sf in}}}(\delta)\right)^{w_{\delta}} (15)
Proof.

The plan is to use the entropy argument [13], where we define a uniform probability distribution on a certain subset Q¯Σin\bar{Q}\subseteq{{{\Join}}}\Sigma_{\text{\sf in}}, denote 𝒉\bm{h} its entropic vector, then use (14) to prove (15).

Let 𝑽=defvars(Σin)\bm{V}\stackrel{{\scriptstyle\text{def}}}{{=}}\text{\sf vars}(\Sigma_{\text{\sf in}}) be the set of all input variables. Notice that, for any joint distribution on 𝑽\bm{V} with entropic vector 𝒉\bm{h}, we have h(δ)logdegΣin(δ)h(\delta)\leq\log\text{\sf deg}_{\Sigma_{\text{\sf in}}}(\delta) for all δΔ\delta\in\Delta. This is trivially true if δΔ\delta\in\Delta is not guarded by any relation in Σin\Sigma_{\text{\sf in}} (see Sec 2.5 for the notion of guardedness). Otherwise, the inequality follows from the fact that the uniform distribution on a finite support has the maximum entropy:

h(δ)=h(𝒀|𝑿)\displaystyle h(\delta)=h(\bm{Y}|\bm{X}) =xDom𝑿p(𝑿=𝒙)h(𝒀|𝑿=𝒙)\displaystyle=\sum_{x\in\textsf{Dom}^{\bm{X}}}p(\bm{X}=\bm{x})h(\bm{Y}|\bm{X}=\bm{x})
xDom𝑿p(𝑿=𝒙)logdegΣin(𝒀|𝑿=𝒙)\displaystyle\leq\sum_{x\in\textsf{Dom}^{\bm{X}}}p(\bm{X}=\bm{x})\log\text{\sf deg}_{\Sigma_{\text{\sf in}}}(\bm{Y}|\bm{X}=\bm{x})
xDom𝑿p(𝑿=𝒙)logdegΣin(δ)\displaystyle\leq\sum_{x\in\textsf{Dom}^{\bm{X}}}p(\bm{X}=\bm{x})\log\text{\sf deg}_{\Sigma_{\text{\sf in}}}(\delta)
=logdegΣin(δ)\displaystyle=\log\text{\sf deg}_{\Sigma_{\text{\sf in}}}(\delta)

Next, we construct both the set Q¯\bar{Q} and a model Σout\Sigma_{\text{\sf out}} for the DDR as follows. Initially, set Q¯=\bar{Q}=\emptyset, and Q=Q=\emptyset for all QΣoutQ\in\Sigma_{\text{\sf out}}. Iterate over the tuples 𝒕Σin\bm{t}\in{{{\Join}}}\Sigma_{\text{\sf in}} in some arbitrary order. For each 𝒕\bm{t}: if QΣout\exists Q\in\Sigma_{\text{\sf out}} s.t. πvars(Q)(𝒕)Q\pi_{\text{\sf vars}(Q)}(\bm{t})\in Q, then ignore 𝒕\bm{t}; otherwise, insert 𝒕\bm{t} into Q¯\bar{Q} and insert πvars(Q)(𝒕)\pi_{\text{\sf vars}(Q)}(\bm{t}) into QQ for every QΣoutQ\in\Sigma_{\text{\sf out}}. In the end, we have constructed a model Σout\Sigma_{\text{\sf out}} for the DDR. Furthermore, |Q¯|=Σout|\bar{Q}|=\|\Sigma_{\text{\sf out}}\|.

Finally, consider the uniform distribution on Q¯\bar{Q}, i.e. the distribution where each tuple in Q¯\bar{Q} is chosen randomly with probability 1/|Q¯|1/|\bar{Q}|. Let 𝒉\bm{h} be the entropy vector of this distribution. Notice that for each QΣoutQ\in\Sigma_{\text{\sf out}}, the marginal distribution on vars(Q)\text{\sf vars}(Q) is also uniform, because |Q|=|Q¯||Q|=|\bar{Q}|; in particular, h(𝒁)=log|Q|=log|Q¯|h(\bm{Z})=\log|Q|=\log|\bar{Q}| for all Q(𝒁)ΣoutQ(\bm{Z})\in\Sigma_{\text{\sf out}}. Then, noting that 𝝀1=1\|\bm{\lambda}\|_{1}=1, the following holds:

logΣout=log|Q¯|=𝒁Σoutλ𝒁h(𝒁)δΔwδh(δ)δΔwδlogdegΣin(δ)\displaystyle\log\|\Sigma_{\text{\sf out}}\|=\log|\bar{Q}|=\sum_{\bm{Z}\in\Sigma_{\text{\sf out}}}\lambda_{\bm{Z}}\cdot h(\bm{Z})\leq\sum_{\delta\in\Delta}w_{\delta}\cdot h(\delta)\leq\sum_{\delta\in\Delta}w_{\delta}\cdot\log\text{\sf deg}_{\Sigma_{\text{\sf in}}}(\delta)

In order to obtain the best bound, we need to choose the weights 𝒘\bm{w} and 𝝀\bm{\lambda} to minimize the right-hand side of (15). Specifically, we want to minimize the linear objective

min𝝀,𝒘δΔwδlogdegΣin(δ)\displaystyle\min_{\bm{\lambda},\bm{w}}\sum_{\delta\in\Delta}w_{\delta}\cdot\log\text{\sf deg}_{\Sigma_{\text{\sf in}}}(\delta) (16)

subject to the constraints that 𝒘0\bm{w}\geq 0, 𝝀0\bm{\lambda}\geq 0, 𝝀1=1\|\bm{\lambda}\|_{1}=1, and that inequality (14) holds for all entropic vectors 𝒉\bm{h}.

For general monotonicity measure Δ\Delta, it is an open problem to characterize the weight vectors 𝒘\bm{w} and 𝝀\bm{\lambda} for which (14) holds for all entropic vectors 𝒉\bm{h}. In particular, the difficulty is related to the problem of characterizing the entropic region in information theory [35]. Hence, to make the problem tractable, we relax the upper bound by requiring (14) to hold for all polymatroids 𝒉\bm{h}.

Definition 3.2 (Polymatroid bound).

The following bound is called the polymatroid bound for disjunctive datalog rules:

bΔ,𝑵\displaystyle b^{*}_{\Delta,\bm{N}} =defminδΔwδlogdegΣin(δ)\displaystyle\stackrel{{\scriptstyle\text{def}}}{{=}}\min\sum_{\delta\in\Delta}w_{\delta}\cdot\log\text{\sf deg}_{\Sigma_{\text{\sf in}}}(\delta) (17)
subject to: 𝝀1=1\displaystyle\qquad\|\bm{\lambda}\|_{1}=1
inequality (14) holds for all polymatroids 𝒉+2𝑽\displaystyle\qquad\text{inequality }\eqref{eqn:ddr:shearer}\text{ holds for all polymatroids }\bm{h}\in{\mathbb{R}}_{\tiny+}^{2^{\bm{V}}} (18)
𝒘0,𝝀0\displaystyle\qquad\bm{w}\geq 0,\bm{\lambda}\geq 0

Note that the bound is stated in log-scale, so that the objective is linear. The constraint (18) is a linear constraint, as explained below. In particular, the polymatroid bound is a linear program.

Proposition 3.3.

The constraint that (14) holds for all polymatroids 𝐡\bm{h} is a linear constraint in the weight vectors 𝐰\bm{w} and 𝛌\bm{\lambda} and auxiliary variables.

Proof.

In the space 2𝑽\mathbb{R}^{2^{\bm{V}}}, the set of polymatroids is a polyhedron of the form {𝒉𝑨𝒉𝟎}\{\bm{h}\mid\bm{A}\bm{h}\geq\bm{0}\} for an appropriately defined matrix 𝑨\bm{A}. Inequality (14) is a linear inequality of the form 𝒃𝒉𝟎\bm{b}^{\top}\bm{h}\geq\bm{0}, where 𝒃\bm{b} is a linear function of the weights 𝒘\bm{w} and 𝝀\bm{\lambda}. From the Gale-Kuhn-Tucker variant of Farkas’ lemma777https://en.wikipedia.org/wiki/Farkas%27_lemma, inequality (14) holds for all polymatroids 𝒉\bm{h} if and only if there is no 𝒉\bm{h} for which 𝒃𝒉<0\bm{b}^{\top}\bm{h}<0 and 𝑨𝒉𝟎\bm{A}\bm{h}\geq\bm{0}, which holds if and only if there is a vector 𝒙\bm{x} such that 𝑨𝒙=𝒃\bm{A}^{\top}\bm{x}=\bm{b} and 𝒙𝟎\bm{x}\geq\bm{0}. The last condition is a linear constraint in the weights 𝒘\bm{w} and 𝝀\bm{\lambda}, and in the dual variables 𝒙\bm{x}. ∎

The question of how tight the polymatroid bound is (and its entropic counterpart) has an intriguing connection to information theory. We refer the reader to [6, 32] for more in-depth discussions and results.

3.2 Equivalent Formulations of Inequality (14)

Shearer’s result [13] states that inequality (13) holds for all polymatroids 𝒉\bm{h} iff the weights form a fractional edge cover of a certain hypergraph. This section shows an analogous characterization for the generalization (14), and states an “integral” version of this characterization that shall be used by the PANDA algorithm. The following lemma is a variant of Farkas’ lemma [31] applied to our specific setting.

Lemma 3.4.

Let 𝐚2𝐕\bm{a}\in\mathbb{R}^{2^{\bm{V}}} be a coefficient vector. The following inequality is a Shannon inequality:

𝑿𝑽a𝑿h(𝑿)\displaystyle\sum_{\bm{X}\subseteq\bm{V}}a_{\bm{X}}h(\bm{X})\geq 0\displaystyle 0 (19)

if and of only if there exist non-negative coefficients 𝐦=(mμ)μMON\bm{m}=(m_{\mu})_{\mu\in\text{\sf MON}} and 𝐬=(sσ)σSUB\bm{s}=(s_{\sigma})_{\sigma\in\text{\sf SUB}} such that the following equality holds as an identity over 2|𝐕|2^{|\bm{V}|} symbolic variables h(𝐗)h(\bm{X}), 𝐗𝐕\bm{X}\subseteq\bm{V}:

𝑿𝑽a𝑿h(𝑿|)\displaystyle\sum_{\bm{X}\subseteq\bm{V}}a_{\bm{X}}h(\bm{X}|\emptyset) =μMONmμh(μ)+σSUBsσh(σ)\displaystyle=\sum_{\mu\in\text{\sf MON}}m_{\mu}h(\mu)+\sum_{\sigma\in\text{\sf SUB}}s_{\sigma}h(\sigma) (20)

We call the tuple ((mμ)μMON,(sσ)σSUB)((m_{\mu})_{\mu\in\text{\sf MON}},(s_{\sigma})_{\sigma\in\text{\sf SUB}}) a witness of (19). If 𝐚\bm{a} is rational, then there exists a rational witness. Furthermore, 𝐦\bm{m} and 𝐬\bm{s} are a function of 𝐚\bm{a} and |𝐕||\bm{V}|.

Proof.

Clearly if (20) holds as an identity, then (19) follows trivially. The converse is a direct consequence of a variant of Farkas’ lemma, in particular Corollary 7.1h in Schrijver’s classic text [31], which states: if the polyhedron P={𝒙|𝑨𝒙𝒃}P=\{\bm{x}\;|\;\bm{A}\bm{x}\geq\bm{b}\} is not empty, and if inequality 𝒄,𝒙d\langle\bm{c},\bm{x}\rangle\geq d holds for all 𝒙P\bm{x}\in P, then there exists ddd^{\prime}\geq d for which 𝒄,𝒙d\langle\bm{c},\bm{x}\rangle\geq d^{\prime} is a non-negative linear combination of the inequalities in 𝑨𝒙𝒃\bm{A}\bm{x}\geq\bm{b}.888We thank the anonymous reviewer for pointing out this specific variant that helps simplify our proof.

In our context, PP is the polyhedron defined by (8) and 𝒃=𝟎\bm{b}=\bm{0}, and the inequality is defined by 𝒄=𝒂\bm{c}=\bm{a} and d=0d=0. Farkas’ lemma thus implies that if (19) is a Shannon inequality, then 𝒂,𝒉0\langle\bm{a},\bm{h}\rangle\geq 0 is a non-negative linear combination of the Shannon inequalities in (8), namely there are coefficients 𝒎,𝒔,e+,e\bm{m},\bm{s},e^{+},e^{-} such that the following identity holds over the variables h(𝑿)h(\bm{X}), 𝑿𝑽\bm{X}\subseteq\bm{V}:

𝒂,𝒉=μMONmμh(μ)+σSUBsσh(σ)+(e+e)h().\displaystyle\langle\bm{a},\bm{h}\rangle=\sum_{\mu\in\text{\sf MON}}m_{\mu}h(\mu)+\sum_{\sigma\in\text{\sf SUB}}s_{\sigma}h(\sigma)+(e^{+}-e^{-})h(\emptyset). (21)

where e+e^{+} and ee^{-} are coefficients associated with h()0h(\emptyset)\geq 0 and h()0-h(\emptyset)\geq 0. Setting h(𝑿)=1h(\bm{X})=1 for all 𝑿\bm{X} in (21), we conclude that 𝒂,𝟏=e+e\langle\bm{a},\bm{1}\rangle=e^{+}-e^{-}, and thus (20) holds.

The fact that 𝒎\bm{m} and 𝒔\bm{s} are a function of 𝒂\bm{a} and |𝑽||\bm{V}| (and the bounds on their representation sizes) can be found in Chapter 10 of Schrijver’s book [31]. ∎

In words, Lemma 3.4 says that the LHS of (19) is a positive linear combination of monotonicity and submodularity measures. From the lemma, it is not hard to show that Shearer’s inequality (13) holds whenever the weights form a fractional edge cover of the corresponding hypergraph.

To guide the algorithm, we will need an “integral” version of the lemma above. Let SS be any set, a multiset 𝒮\mathcal{S} over SS is a multiset whose members are in SS. The size of a multiset 𝒮\mathcal{S}, denoted by |𝒮||\mathcal{S}|, is the number of its members, counting multiplicity. If the coefficients 𝒘\bm{w} in (14) are rational, then the inequality 14 has a rational witness. By multiplying the inequality with the least common multiple of the denominators of the coefficients, we can obtain an equivalent inequality with integer coefficients. The following Corollary thus follows.

Corollary 3.5.

For rational coefficients 𝐰\bm{w}, inequality (14) holds for all polymatroids if and only if there exist multisets 𝒵\mathcal{Z}, 𝒟\mathcal{D}, \mathcal{M}, and 𝒮\mathcal{S} over Σout\Sigma_{\text{\sf out}}, Δ\Delta, MON, and SUB respectively, such that the following identity holds symbolically over the variables h(𝐗)h(\bm{X}), 𝐗𝐕\bm{X}\subseteq\bm{V}:

𝒁𝒵h(𝒁|)\displaystyle\sum_{\bm{Z}\in\mathcal{Z}}h(\bm{Z}|\emptyset) =δ𝒟h(δ)μh(μ)σ𝒮h(σ)\displaystyle=\sum_{\delta\in\mathcal{D}}h(\delta)-\sum_{\mu\in\mathcal{M}}h(\mu)-\sum_{\sigma\in\mathcal{S}}h(\sigma) (22)

In particular, if we set h()=0h(\emptyset)=0, then the identity (22) becomes:

𝒁𝒵h(𝒁)=δ𝒟h(δ)μh(μ)σ𝒮h(σ)\displaystyle\sum_{\bm{Z}\in\mathcal{Z}}h(\bm{Z})=\sum_{\delta\in\mathcal{D}}h(\delta)-\sum_{\mu\in\mathcal{M}}h(\mu)-\sum_{\sigma\in\mathcal{S}}h(\sigma) (23)

The sizes of the multisets 𝒵,𝒟,,𝒮\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S} are bounded by functions of 𝒘\bm{w} and |𝑽||\bm{V}|.

Definition 3.6.

We call the terms h(δ)h(\delta) in (22) and (23) statistics terms, and call the terms h(μ)h(\mu) and h(σ)h(\sigma) witness terms. Specifically, we call terms h(μ)h(\mu) monotonicity terms, and terms h(σ)h(\sigma) submodularity terms. After removing the common denominator in (14), the resulting inequality is called an integral Shannon inequality and has the format:

𝒁𝒵h(𝒁)δ𝒟h(δ)\displaystyle\sum_{\bm{Z}\in\mathcal{Z}}h(\bm{Z})\leq\sum_{\delta\in\mathcal{D}}h(\delta) (24)

3.3 The Reset Lemma

To conclude this section on information inequalities, we present a combinatorial lemma that plays a key role in our algorithm. The lemma says that, given a valid integral Shannon inequality (24), if we would like to remove an unconditional term h(δ0)h(\delta_{0}) from the RHS while retaining its validity, then it suffices to remove at most one term h(𝒁)h(\bm{Z}) from the LHS. We may be able to remove even more terms from the RHS, in addition to h(δ0)h(\delta_{0}), but, importantly, it suffices to remove a single term from the LHS.

Lemma 3.7 (Reset Lemma).

Consider an integral Shannon inequality (24):

𝒁𝒵h(𝒁)δ𝒟h(δ)\displaystyle\sum_{\bm{Z}\in\mathcal{Z}}h(\bm{Z})\leq\sum_{\delta\in\mathcal{D}}h(\delta)

Suppose some term δ0𝒟\delta_{0}\in\mathcal{D} is unconditional, then there are two multisets 𝒟𝒟{δ0}\mathcal{D}^{\prime}\subseteq\mathcal{D}\setminus\{\delta_{0}\} and 𝒵𝒵\mathcal{Z}^{\prime}\subseteq\mathcal{Z} with |𝒵||𝒵|1|\mathcal{Z}^{\prime}|\geq|\mathcal{Z}|-1 such that the following is also an integral Shannon inequality:

𝒁𝒵h(𝒁)δ𝒟h(δ)\displaystyle\sum_{\bm{Z}\in\mathcal{Z}^{\prime}}h(\bm{Z})\leq\sum_{\delta\in\mathcal{D}^{\prime}}h(\delta)
Proof.

By Corollary 3.5, there exists an integral witness such that equation (23) is an identity (with h()h(\emptyset) set to 0). We prove the lemma by induction on the “potential” quantity q:=|𝒟|+||+2|𝒮|q:=|\mathcal{D}|+|\mathcal{M}|+2|\mathcal{S}|. The base case when q1q\leq 1 is trivial. Consider the case when q2q\geq 2. Suppose δ0=(𝑾)\delta_{0}=(\bm{W}\mid\emptyset), i.e. h(δ0)=h(𝑾)h(\delta_{0})=h(\bm{W}).

Since (23) is an identity, the term h(𝑾)h(\bm{W}) must cancel out with some other term. If 𝑾𝒵\bm{W}\in\mathcal{Z}, then we can remove h(𝑾)h(\bm{W}) from both sides (i.e. setting 𝒵=𝒵{𝑾}\mathcal{Z}^{\prime}=\mathcal{Z}-\{\bm{W}\} and 𝒟=𝒟{(𝑾|}\mathcal{D}^{\prime}=\mathcal{D}-\{(\bm{W}|\emptyset\}). Otherwise, there are three cases:

Case 1

There exists (𝒀|𝑾)𝒟(\bm{Y}|\bm{W})\in\mathcal{D}. Then, from

h(𝒀|𝑾)+h(𝑾|)=h(𝒀𝑾|)\displaystyle h(\bm{Y}|\bm{W})+h(\bm{W}|\emptyset)=h(\bm{Y}\bm{W}|\emptyset) (25)

identity (23) remains an identity if we add (𝒀𝑾|)(\bm{Y}\bm{W}|\emptyset) to and remove {(𝒀|𝑾),(𝑾|)}\{(\bm{Y}|\bm{W}),(\bm{W}|\emptyset)\} from 𝒟\mathcal{D}. The potential qq decreases by 1, and thus by induction, we can drop the newly added statistics term h(𝒀𝑾)h(\bm{Y}\bm{W}) from the RHS of (24) while dropping at most one term from the LHS.

Case 2

h(𝑾)h(\bm{W}) cancels with a monotonicity term h(μ)h(\mu) where μ=(𝒀|𝑿)\mu=(\bm{Y}|\bm{X}). In other words, 𝑾=𝑿𝒀\bm{W}=\bm{X}\bm{Y}, and the RHS of (23) contains the terms:

h(𝑾)h(𝒀|𝑿)\displaystyle h(\bm{W})-h(\bm{Y}|\bm{X}) =h(𝑾)h(𝑿𝒀)+h(𝑿)=h(𝑿)\displaystyle=h(\bm{W})-h(\bm{X}\bm{Y})+h(\bm{X})=h(\bm{X}) (26)

We add (𝑿|)(\bm{X}|\emptyset) to 𝒟\mathcal{D}, remove (𝑾|)(\bm{W}|\emptyset) from 𝒟\mathcal{D}, and remove μ=(𝒀|𝑿)\mu=(\bm{Y}|\bm{X}) from \mathcal{M}, thus decreasing the potential qq by 1. Then, we proceed inductively to eliminate the newly added statistics term h(𝑿)h(\bm{X}).

Case 3

h(𝑾)h(\bm{W}) cancels with a submodularity term h(σ)h(\sigma) where σ=(𝒀;𝒁|𝑿)\sigma=(\bm{Y};\bm{Z}|\bm{X}) and 𝑾=𝑿𝒀\bm{W}=\bm{X}\bm{Y}. In particular, the RHS of (23) contains the terms:

h(𝑾)h(𝒀;𝒁|𝑿)\displaystyle h(\bm{W})-h(\bm{Y};\bm{Z}|\bm{X}) =h(𝑾)h(𝑿𝒀)h(𝑿𝒁)+h(𝑿)+h(𝑿𝒀𝒁)\displaystyle=h(\bm{W})-h(\bm{X}\bm{Y})-h(\bm{X}\bm{Z})+h(\bm{X})+h(\bm{X}\bm{Y}\bm{Z})
=h(𝑿𝒀𝒁)h(𝒁|𝑿)\displaystyle=h(\bm{X}\bm{Y}\bm{Z})-h(\bm{Z}|\bm{X}) (27)

We add (𝑿𝒀𝒁|)(\bm{X}\bm{Y}\bm{Z}|\emptyset) to 𝒟\mathcal{D}, drop (𝑾|)(\bm{W}|\emptyset) from 𝒟\mathcal{D}, drop σ=(𝒀;𝒁|𝑿)\sigma=(\bm{Y};\bm{Z}|\bm{X}) from 𝒮\mathcal{S}, and add a new monotonicity measure (𝒁|𝑿)(\bm{Z}|\bm{X}) to \mathcal{M}. Overall, the potential qq decreases by 11. The proof follows by induction where we can eliminate the newly added statistics term h(𝑿𝒀𝒁)h(\bm{X}\bm{Y}\bm{Z}) from the RHS.

We illustrate the reset lemma with the following simple examples:

Example 3.8.

Consider Shearer’s inequality 2h(XYZ)h(XY)+h(YZ)+h(XZ)2h(XYZ)\leq h(XY)+h(YZ)+h(XZ), which we write in the form (24) as:999To be consistent with (24), we should write it as h(XYZ)+h(XYZ)h(XY|)+h(YZ|)+h(XZ|)h(XYZ)+h(XYZ)\leq h(XY|\emptyset)+h(YZ|\emptyset)+h(XZ|\emptyset).

h(XYZ)+h(XYZ)\displaystyle h(XYZ)+h(XYZ)\leq h(XY)+h(YZ)+h(XZ)\displaystyle h(XY)+h(YZ)+h(XZ) (28)

To drop the term h(XZ)h(XZ) on the RHS of (28) while retaining the validity of the inequality, we can also delete one term on the LHS (we can choose any term since they are identical), and obtain the following Shannon inequality:

h(XYZ)\displaystyle h(XYZ) h(XY)+h(YZ)\displaystyle\leq h(XY)+h(YZ)

The proof of the Reset Lemma “explains” how we can do this dropping by reasoning from the identity counterpart of (28) (i.e. an identity of the form (23)):

h(XYZ)+h(XYZ)\displaystyle h(XYZ)+h(XYZ) =h(XY)+h(YZ)+h(XZ)\displaystyle=h(XY)+h(YZ)+h(XZ)
(h(XY)+h(XZ)h(X)h(XYZ))\displaystyle-(h(XY)+h(XZ)-h(X)-h(XYZ)) this is h(Y;Z|X)\displaystyle\text{this is }h(Y;Z|X)
(h(X)+h(YZ)h(XYZ)h())\displaystyle-(h(X)+h(YZ)-h(XYZ)-h(\emptyset)) this is h(X;YZ|)\displaystyle\text{this is }h(X;YZ|\emptyset) (29)

By canceling out h(XYZ)h(XYZ) from both sides, we obtain a different identity:

h(XYZ)\displaystyle h(XYZ) =h(XY)+h(YZ)\displaystyle=h(XY)+h(YZ)
(h(XY)h(X))\displaystyle-(h(XY)-h(X)) this is a monotonicity h(Y|X)\displaystyle\text{this is a monotonicity }h(Y|X)
(h(X)+h(YZ)h(XYZ)h())\displaystyle-(h(X)+h(YZ)-h(XYZ)-h(\emptyset)) this is h(X;YZ|)\displaystyle\text{this is }h(X;YZ|\emptyset) (30)

which is what Case 3 from the proof of the Reset Lemma does. The resulting identity witnesses the fact that h(XYZ)h(XY)+h(YZ)h(XYZ)\leq h(XY)+h(YZ) is a Shannon inequality.

Example 3.9.

Consider the following Shannon inequality:

h(XYZW)+h(Y)h(XY)+h(YZ)+h(W|XYZ),h(XYZW)+h(Y)\leq h(XY)+h(YZ)+h(W|XYZ),

which follows from the following identity of the form (23):

h(XYZW)+h(Y)=h(XY)+h(YZ)+h(W|XYZ)h(X;Z|Y)h(XYZW)+h(Y)=h(XY)+h(YZ)+h(W|XYZ)-h(X;Z|Y)

Suppose that we want to drop h(XY)h(XY) from the RHS. The first step is going to follow Case 3 of the proof of the Reset Lemma by replacing the submodularity h(X;Z|Y)h(X;Z|Y) with an additional monotonicity h(Z|Y)h(Z|Y), thus leading to the identity:

h(XYZW)+h(Y)=h(XYZ)+h(YZ)+h(W|XYZ)h(Z|Y)h(XYZW)+h(Y)=h(XYZ)+h(YZ)+h(W|XYZ)-h(Z|Y)

where our target now is to remove the term h(XYZ)h(XYZ) from the RHS. But now, our only option is to use Case 1 and combine h(XYZ)h(XYZ) with h(W|XYZ)h(W|XYZ), leading to

h(XYZW)+h(Y)=h(XYZW)+h(YZ)h(Z|Y)h(XYZW)+h(Y)=h(XYZW)+h(YZ)-h(Z|Y)

And now, we drop h(XYZW)h(XYZW) from both sides. The resulting inequality is h(Y)h(YZ)h(Y)\leq h(YZ).

We will also need the following simple proposition to explain a step in PANDA.

Proposition 3.10.

Consider an integral Shannon inequality of the form (24). The number of unconditional statistics terms in 𝒟\mathcal{D} (counting multiplicities) is at least |𝒵||\mathcal{Z}|.

Proof.

It is straightforward to verify that the following vector 𝒉\bm{h} is a polymatroid, i.e. satisfies (8):

h(𝑾)=\displaystyle h(\bm{W})= {0if 𝑾=1otherwise\displaystyle\begin{cases}0&\mbox{if $\bm{W}=\emptyset$}\\ 1&\mbox{otherwise}\end{cases}

Applying 𝒉\bm{h} to (23), the LHS is |𝒵||\mathcal{Z}|, while the RHS is \leq to the number of unconditional terms, because h(𝒀|)=1h(\bm{Y}|\emptyset)=1, and h(𝒀|𝑿)=0h(\bm{Y}|\bm{X})=0 when 𝑿\bm{X}\neq\emptyset, while all witness terms are non-negative: h(μ)0h(\mu)\geq 0, h(σ)0h(\sigma)\geq 0. ∎

4 Overview of PANDA and statement of main result

This section states and explains the main result in this paper, which shows that the Shannon inequality (14) is not only useful for bounding the output size of a DDR as shown in Theorem 3.1, but it can also be used to guide an algorithm to evaluate a DDR in time proportional to the bound (15), modulo a polylogarithmic factor. We formally state this result in Section 4.1, and present an illustrative example in Section 4.2. The detailed algorithm description and its proof of correctness are presented in Section 5.

4.1 An Efficient Algorithm to Evaluate Disjunctive Datalog Rules

Given the coefficients 𝒘\bm{w} and 𝝀\bm{\lambda} of inequality (14) (where recall 𝝀1=1\|\bm{\lambda}\|_{1}=1), we denote by:

BΔ,𝑵=def\displaystyle B_{\Delta,\bm{N}}\stackrel{{\scriptstyle\text{def}}}{{=}} δΔNδwδ\displaystyle\prod_{\delta\in\Delta}N^{w_{\delta}}_{\delta} bΔ,𝒏=def\displaystyle b_{\Delta,\bm{n}}\stackrel{{\scriptstyle\text{def}}}{{=}} logBΔ,𝑵=δΔwδnδ\displaystyle\log B_{\Delta,\bm{N}}=\sum_{\delta\in\Delta}w_{\delta}n_{\delta} (31)

Theorem 3.1 implies that, if Σin\Sigma_{\text{\sf in}} satisfies the degree constraints (see Sec 2.5), i.e. Σin(Δ,𝑵)\Sigma_{\text{\sf in}}\models(\Delta,\bm{N}), then there exists a feasible output Σout\Sigma_{\text{\sf out}} that satisfies ΣoutBΔ,𝑵\|\Sigma_{\text{\sf out}}\|\leq B_{\Delta,\bm{N}}. Our main result states that such an output can be computed in time O~(Σin+BΔ,𝑵)\tilde{O}(\|\Sigma_{\text{\sf in}}\|+B_{\Delta,\bm{N}}) if inequality (14) is a Shannon inequality: (We use O~\tilde{O} to hide a polylogarithmic factor in the input size Σin\|\Sigma_{\text{\sf in}}\|.)

Theorem 4.1 (The PANDA algorithm for a DDR).

Given the following inputs:

  • A disjunctive datalog rule (DDR) of the form (5).

  • An input instance Σin\Sigma_{\text{\sf in}} to the DDR.

  • Degree constraints (Δ,𝑵)(\Delta,\bm{N}) that are satisfied by the instance Σin\Sigma_{\text{\sf in}}, i.e. Σin(Δ,𝑵)\Sigma_{\text{\sf in}}\models(\Delta,\bm{N}).

  • Coefficients 𝒘=(wδ)δΔ\bm{w}=(w_{\delta})_{\delta\in\Delta}, and 𝝀=(λ𝒁)𝒁Σout\bm{\lambda}=(\lambda_{\bm{Z}})_{\bm{Z}\in\Sigma_{\text{\sf out}}}, with 𝝀1=1\|\bm{\lambda}\|_{1}=1 that define a valid Shannon inequality (14).

Let BΔ,𝐍B_{\Delta,\bm{N}} be the bound defined in (31), in terms of the statistics (Δ,𝐍)(\Delta,\bm{N}) and the coefficients 𝐰\bm{w}. Then, we can compute a feasible output Σout\Sigma_{\text{\sf out}} to the DDR in time O~(Σin+BΔ,𝐍)\tilde{O}(\|\Sigma_{\text{\sf in}}\|+B_{\Delta,\bm{N}}). This feasible output is of size O~(BΔ,𝐍)\tilde{O}(B_{\Delta,\bm{N}}).

We will prove the theorem over the next few sections. We illustrate with a simple example:

Example 4.2.

Continuing the example in Eq. (7), assume that |R|=|S|=|U|=N|R|=|S|=|U|=N. Consider the following Shannon inequality, proved by applying two submodularities:

h(XY)+h(YZ)+h(ZW)\displaystyle h(XY)+h(YZ)+h(ZW) h(XYZ)+h(Y)+h(ZW)h(XYZ)+h(YZW)\displaystyle\geq h(XYZ)+h(Y)+h(ZW)\geq h(XYZ)+h(YZW)

Rearranging it to the form (14), we obtain:

12h(XYZ)+12h(YZW)12h(XY)+12h(YZ)+12h(ZW)\displaystyle\frac{1}{2}h(XYZ)+\frac{1}{2}h(YZW)\leq\frac{1}{2}h(XY)+\frac{1}{2}h(YZ)+\frac{1}{2}h(ZW) (32)

Theorem 3.1 proves that there exists a feasible solution (A,B)(A,B) of size max(|A|,|B|)N3/2\max(|A|,|B|)\leq N^{3/2}, while Theorem 4.1 says that we can compute a feasible solution of size O~(N3/2)\tilde{O}(N^{3/2}) in time O~(N3/2)\tilde{O}(N^{3/2}).

Trivially, the theorem (and PANDA) can be used to compute a full conjunctive query; because a full conjunctive query is a special case of a DDR, where the output schema consists of a single atom containing all variables. After computing a feasible output of the DDR, one can semijoin-reduce it with every relation in the body to obtain a solution to the full conjunctive query.

Corollary 4.3 (The PANDA algorithm for a full conjunctive query).

Suppose that the DDR from Theorem 4.1 corresponds to a full conjunctive query QQ, i.e. the output schema of the DDR consists of a single atom containing all variables Σout={Q(𝐕)}\Sigma_{\text{\sf out}}=\{Q(\bm{V})\}, in which case the DDR (5) collapses back to (11), where Σ:=Σin\Sigma:=\Sigma_{\text{\sf in}}. Moreover, suppose that the conditions of Theorem 4.1 are satisfied. Then, the output of the full conjunctive query QQ satisfies |Q|BΔ,𝐍|Q|\leq B_{\Delta,\bm{N}}, and can be computed in time O~(Σ+BΔ,𝐍)\tilde{O}(\|\Sigma\|+B_{\Delta,\bm{N}}).

For the best-possible runtime, we want to seed PANDA with parameters minimizing the quantity BΔ,𝑵B_{\Delta,\bm{N}} (or equivalently, bΔ,𝑵b_{\Delta,\bm{N}}), which is the polymatroid bound (17) for the query QQ. In the case of full CQs, there is ony one 𝝀\bm{\lambda}-parameter, set to λ=1\lambda=1, and it remains to find 𝒘\bm{w} minimizing BΔ,𝑵B_{\Delta,\bm{N}}. The minimum value is BΔ,𝑵=def2bΔ,𝑵B^{*}_{\Delta,\bm{N}}\stackrel{{\scriptstyle\text{def}}}{{=}}2^{b^{*}_{\Delta,\bm{N}}}, defined by

bΔ,𝑵\displaystyle b^{*}_{\Delta,\bm{N}} =defmin𝒘𝟎{δΔwδnδh(𝑽)δΔwδh(δ) for all polymatroids 𝒉+2𝑽}\displaystyle\stackrel{{\scriptstyle\text{def}}}{{=}}\min_{\bm{w}\geq\bm{0}}\left\{\sum_{\delta\in\Delta}w_{\delta}n_{\delta}\mid h(\bm{V})\leq\sum_{\delta\in\Delta}w_{\delta}h(\delta)\text{ for all polymatroids }\bm{h}\in\mathbb{R}_{+}^{2^{\bm{V}}}\right\}
=max𝒉+2𝑽{h(𝑽)h(δ)nδ for all δΔ and 𝒉 is a polymatroid}\displaystyle=\max_{\bm{h}\in\mathbb{R}_{+}^{2^{\bm{V}}}}\biggl{\{}h(\bm{V})\mid h(\delta)\leq n_{\delta}\text{ for all }\delta\in\Delta\text{ and }\bm{h}\text{ is a polymatroid}\biggr{\}}

The second equality follows from simple duality arguments. PANDA would be worst-case optimal if BΔ,𝑵B^{*}_{\Delta,\bm{N}} is tight, in the sense that we can construct a database instance Σ\Sigma that satisfies the degree constraints (Δ,𝑵)(\Delta,\bm{N}) and has output size |Q|=Ω(BΔ,𝑵)|Q|=\Omega(B^{*}_{\Delta,\bm{N}}). There are two special cases where BΔ,𝑵B^{*}_{\Delta,\bm{N}} is known to be tight in data complexity: when all degree constraints are “simple” (see [2, 32]), or when the degree constraints are “acyclic” (see [5, 27]). The degree constraints are simple if |𝑿|1|\bm{X}|\leq 1 for all δ=(𝒀|𝑿)Δ\delta=(\bm{Y}|\bm{X})\in\Delta, and they are acyclic if there is a global order of the variables in 𝑽\bm{V} such that for every δ=(𝒀|𝑿)Δ\delta=(\bm{Y}|\bm{X})\in\Delta, all variables in 𝑿\bm{X} precede all variables in 𝒀\bm{Y} in the order. If Δ\Delta contains only cardinality constraints, then it is both simple and acyclic, and BΔ,𝑵B^{*}_{\Delta,\bm{N}} is exactly the AGM bound because (14) reduces back to (13).

4.2 Example: Preview of PANDA

We illustrate the algorithm with the DDR (7):

A(X,Y,Z)B(Y,Z,W) :-\displaystyle A(X,Y,Z)\vee B(Y,Z,W)\text{ :- } R(X,Y)S(Y,Z)U(Z,W)\displaystyle R(X,Y)\wedge S(Y,Z)\wedge U(Z,W)

Following Example 4.2, we assume we only have cardinality statistics / constraints:

Δ=\displaystyle\Delta= {(XY|),(YZ|),(ZW|)},\displaystyle\{(XY|\emptyset),(YZ|\emptyset),(ZW|\emptyset)\}, NXY=\displaystyle N_{XY}= NYZ=NZW=N.\displaystyle N_{YZ}=N_{ZW}=N.

We further assume that NN is a power of 2. Each constraint has a “guard”, i.e. a relation that “sponsors” (satisfies) the constraint: TXY:=R,TYZ:=S,TZW:=UT_{XY}:=R,T_{YZ}:=S,T_{ZW}:=U. We name all guards TT_{\cdot} to be consistent with the formal description of the algorithm presented in Section 5.

Suppose that the input Shannon inequality given is the one in Eq (32). This inequality is in the shape of (14) with λXYZ=λYZW=1/2\lambda_{XYZ}=\lambda_{YZW}=1/2 and wXY|=wYZ|=wZW|=1/2w_{XY|\emptyset}=w_{YZ|\emptyset}=w_{ZW|\emptyset}=1/2. The fact that this is a Shannon inequality was shown above in Example 4.2.

From Corollary 3.5, the Shannon inequality (32) must have a corresponding identity (23) over symbolic variables h(𝑿)h(\bm{X}) where h()=0h(\emptyset)=0. We show one such identity below:101010Our witness is not elemental, because h(Y;ZW|)h(Y;ZW|\emptyset) is not elemental: if we replaced the latter with h(Y;W|)+h(Y;Z|W)h(Y;W|\emptyset)+h(Y;Z|W), then we obtain an elemental witness. Also note that the witness is not unique; for example, we could have used the witness h(XY;Z|)+h(Y;W|Z)h(XY;Z|\emptyset)+h(Y;W|Z).

h(XYZ)+h(YZW)\displaystyle h(XYZ)+h(YZW) =h(XY|)+h(YZ|)+h(ZW|)h(X;Z|Y)h(Y;ZW|)\displaystyle=h(XY|\emptyset)+h(YZ|\emptyset)+h(ZW|\emptyset)-h(X;Z|Y)-h(Y;ZW|\emptyset) (33)

The bound from (31) is bΔ,𝑵=(nXY+nYZ+nZW)/2=3n/2b_{\Delta,\bm{N}}=(n_{XY}+n_{YZ}+n_{ZW})/2=3n/2 where n=deflogNn\stackrel{{\scriptstyle\text{def}}}{{=}}\log N, or equivalently, BΔ,𝑵=N3/2B_{\Delta,\bm{N}}=N^{3/2}; this is the runtime budget the algorithm cannot exceed, in order to compute a feasible solution (A,B)(A,B) to the DDR.

β0\beta_{0}βn\beta_{n}βn,1\beta_{n,1}βm+1\beta_{m+1}βm+1,1\beta_{m+1,1}βm\beta_{m}βm,1\beta_{m,1}βm,1,1\beta_{m,1,1}βm,1,1,1\beta_{m,1,1,1}β1\beta_{1}β1,1\beta_{1,1}β1,1,1\beta_{1,1,1}β1,1,1,1\beta_{1,1,1,1}h(XYZ)+h(YZW)=h(XY¯|)+h(YZ|)+h(ZW|)h(X¯;Z|Y¯)h(Y;ZW|)Partition TXYTYiTX|YiExtend TX|YiTX|YZi\begin{aligned} &h(XYZ)+h(YZW)&=&h(\underline{XY}|\emptyset)+h(YZ|\emptyset)+h(ZW|\emptyset)-h(\underline{X};Z|\underline{Y})-h(Y;ZW|\emptyset)\\ &\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}Partition $T_{XY}$}&\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\rightarrow&\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}$T^{i}_{Y}\Join T^{i}_{X|Y}$}\\ &\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}Extend $T^{i}_{X|Y}$}&\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}\rightarrow&\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}$T^{i}_{X|YZ}$}\\ \end{aligned} h(XYZ)+h(YZW)=h(Y|)+h(X|YZ¯)+h(YZ¯|)+h(ZW|)h(Y;ZW|)Join TX|YZiTYZ OR Reset?\begin{aligned} &h(XYZ)+h(YZW)\\ &=h(Y|\emptyset)+h(X|\underline{YZ})+h(\underline{YZ}|\emptyset)+h(ZW|\emptyset)\\ &-h(Y;ZW|\emptyset)\\ &\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}Join $T^{i}_{X|YZ}\Join T_{YZ}$ OR Reset?}\end{aligned} Reset h(YZW)=h(Y¯|)+h(ZW|)h(Y¯;ZW|) Extend TYiTY|ZWi\begin{aligned} \makebox[433.62pt]{$h(YZW)=h(\underline{Y}|\emptyset)+h(ZW|\emptyset)-h(\underline{Y};ZW|\emptyset)$}\\ \makebox[433.62pt]{\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0} Extend $T^{i}_{Y}\rightarrow T^{i}_{Y|ZW}$}}\end{aligned}h(YZW)=h(Y|ZW¯)+h(ZW¯|)Join TY|ZWiTZW\begin{aligned} \makebox[433.62pt]{$h(YZW)=h(Y|\underline{ZW})+h(\underline{ZW}|\emptyset)$}\\ \makebox[433.62pt]{\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}Join $T^{i}_{Y|ZW}\Join T_{ZW}$}}\end{aligned}h(YZW)=h(YZW|)Terminal leafs!\begin{aligned} \makebox[433.62pt]{$\cancel{h(YZW)}=\cancel{h(YZW|\emptyset)}$}\\ \makebox[433.62pt]{\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}Terminal leafs!}}\end{aligned}Join h(XYZ)+h(YZW)=h(Y|)+h(XYZ|)+h(ZW|)h(Y;ZW|)Terminal leafs!\begin{aligned} &\cancel{h(XYZ)}+h(YZW)\\ &=h(Y|\emptyset)+\cancel{h(XYZ|\emptyset)}+h(ZW|\emptyset)\\ &-h(Y;ZW|\emptyset)\\ &\quad\quad\quad\quad\text{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}Terminal leafs!}\end{aligned}
Figure 1: An example illustrating the PANDA algorithm over the disjunctive rule (7). Here, n=deflogNn\stackrel{{\scriptstyle\text{def}}}{{=}}\log N and m=defn2m\stackrel{{\scriptstyle\text{def}}}{{=}}\frac{n}{2}. For each node in the sub-problem tree, the corresponding identity (23) is in blue while the corresponding algorithmic operation is in red.

Our algorithm operates by observing and manipulating identity (33), while maintaining its identity invariant. Every step of the algorithm applies a modification to the identity and mirrors this modification with an algorithmic computation to create one or more sub-problems. Each sub-problem is then equipped with the newly created identity, the newly created input data, and carry on the next step on their own. The spawned sub-problems form a (sub-problem) tree. In the end, we gather results from the leaves of this tree to answer the original query.

In identity (33), the statistics terms h(XY|)h(XY|\emptyset), h(YZ|)h(YZ|\emptyset), and h(ZW|)h(ZW|\emptyset) correspond to input data (input relations) RR, SS, and UU. When we modify the identity, we will be transforming some subsets of these statistics terms into different statistics terms; correspondingly, the input relations are manipulated to create new input relations that are associated with the new statistics terms. We next describe this idea in more details. The steps of the algorithm are also shown in Figure 1.

In the first step, we grab an unconditional statistics term (from 𝒟\mathcal{D}) in (33). Let’s say the term we grab is h(XY|)=h(XY)h(XY|\emptyset)=h(XY). Since (33) is an identity, there must be another term that cancels out the symbolic variable h(XY)h(XY). In this case, it is h(X;Z|Y)-h(X;Z|Y); so we combines h(XY)h(XY) with the canceling term to obtain new statistics terms:

h(XY¯|)h(X¯;Z|Y¯)\displaystyle h(\underline{XY}|\emptyset)-h(\underline{X};Z|\underline{Y}) =h(XY)h()h(XY)h(YZ)+h(XYZ)+h(Y)\displaystyle=h(XY)-h(\emptyset)-h(XY)-h(YZ)+h(XYZ)+h(Y)
=h(Y)h()+h(XYZ)h(YZ)\displaystyle=h(Y)-h(\emptyset)+h(XYZ)-h(YZ)
=h(Y|)+h(X|YZ)\displaystyle=h(Y|\emptyset)+h(X|YZ)

This rewriting-by-cancellation changed our identity (33) into a new one:

h(XYZ)+h(YZW)\displaystyle h(XYZ)+h(YZW) =h(Y|)+h(X|YZ)+h(YZ|)+h(ZW|)h(Y;ZW|)\displaystyle=h(Y|\emptyset)+h(X|YZ)+h(YZ|\emptyset)+h(ZW|\emptyset)-h(Y;ZW|\emptyset) (34)

The change amounts to replacing a statistics term with two new ones: h(XY|)h(Y|)+h(X|YZ)h(XY|\emptyset)\to h(Y|\emptyset)+h(X|YZ). We mimic this symbolic change with an actual relational change that shall alter the structure of the problem. The rough idea is to create inputs to the new problem by creating two new guards for the two new statistics terms:

  • The guard for h(Y|)h(Y|\emptyset) will be the table TY=defπY(R)T_{Y}\stackrel{{\scriptstyle\text{def}}}{{=}}\pi_{Y}(R).

  • The guard for h(X|YZ)h(X|YZ) is obtained by using R(X,Y)R(X,Y) to construct a “dictionary” (a lookup table) which supports the following operation: given a tuple (y,z)(y,z), return the list of all xx-values such that (x,y)R(x,y)\in R. (Note that this operation does not use the given zz-value.) We denote this dictionary as TX|YZT_{X|YZ}.

The aim is for us to be able to join all 44 guards (corresponding to the four statistics terms on the RHS of (34)) to answer the same query as before. However, implementing this idea straight up does not work, because the new degree constraint for TYT_{Y} is NY=def|πY(R)|N_{Y}\stackrel{{\scriptstyle\text{def}}}{{=}}|\pi_{Y}(R)| and for TX|YZT_{X|YZ} is NX|YZ=defdegR(X|Y)N_{X|YZ}\stackrel{{\scriptstyle\text{def}}}{{=}}\deg_{R}(X|Y), and they are are too large: the product NYNX|YZN_{Y}\cdot N_{X|YZ} can be much greater than the original statistics of NXY=N=|R|N_{XY}=N=|R| (guarding h(XY)h(XY) that we started with). If this product is too large, we cannot apply induction to solve the sub-problem in our time budget of BΔ,𝑵=N3/2B_{\Delta,\bm{N}}=N^{3/2}.

The reason this product NYNX|YZN_{Y}\cdot N_{X|YZ} is too large is because NYN_{Y} counts both high-degree and low-degree yy-values, while the new statistics NX|YZN_{X|YZ} is the maximum degree. Thus, the product is an over-estimation of what the size of the join TYTX|YZT_{Y}\Join T_{X|YZ} can be. To remedy the situation, we uniformize the problem by partitioning RR into a logarithmic number of sub-relations R=R1RkR=R^{1}\cup\cdots\cup R^{k}, where each sub-relation contains tuples whose yy-values have similar degrees. In effect, uniformization is an algorithmic technique for dealing with skews, a notoriously well-known reason for why query plans might blow up in practice.

Concretely, the partitioning is done as follows. Relation RiR^{i} contains all tuples (x,y)R(x,y)\in R where:

N2i\displaystyle\frac{N}{2^{i}}\leq degR(X|Y=y)<N2i1\displaystyle\text{\sf deg}_{R}(X|Y=y)<\frac{N}{2^{i-1}}

We set TYi=defπY(Ri)T^{i}_{Y}\stackrel{{\scriptstyle\text{def}}}{{=}}\pi_{Y}(R^{i}), thus |TYi|2i|T_{Y}^{i}|\leq 2^{i}. We further partition TYiT_{Y}^{i} into two equal-sized buckets (which we will continue to name “buckets ii”, with some abuse). The two new guards TYiT_{Y}^{i} and RiR^{i} have statistics NYiN^{i}_{Y} and NX|YZiN^{i}_{X|YZ}, which (by our partition-into-two trick) satisfy the following:

NYi=def|TYi|\displaystyle N^{i}_{Y}\stackrel{{\scriptstyle\text{def}}}{{=}}|T_{Y}^{i}| 2i1\displaystyle\leq 2^{i-1}
NX|YZi=defdegRi(X|Y)\displaystyle N^{i}_{X|YZ}\stackrel{{\scriptstyle\text{def}}}{{=}}\text{\sf deg}_{R^{i}}(X|Y) N2i1\displaystyle\leq\frac{N}{2^{i-1}}
NYiNX|YZi\displaystyle N^{i}_{Y}\cdot N^{i}_{X|YZ} N\displaystyle\leq N

To be consistent with the notation used in describing PANDA, all guards will be called TT: in particular, the two new guards πY(Ri)\pi_{Y}(R^{i}) and RiR^{i} are called TYiT^{i}_{Y} and TX|YZiT^{i}_{X|YZ} respectively.

After partitioning, we now have O(logN)O(\log N) sub-problems, each equipped with identity (34). In the second step, we again grab an unconditional statistics term h(YZ|)h(YZ|\emptyset), and find a term from (34) to cancel it.111111Note how “greedy” the algorithm is. This is one of the reasons for the large polylog factor it suffers. This time, the cancellation is:

h(X|YZ¯)+h(YZ¯|)\displaystyle h(X|\underline{YZ})+h(\underline{YZ}|\emptyset) =h(XYZ)h(YZ)+h(YZ)h()=h(XYZ|),\displaystyle=h(XYZ)-h(YZ)+h(YZ)-h(\emptyset)=h(XYZ|\emptyset),

leading to a new identity

h(XYZ)+h(YZW)\displaystyle h(XYZ)+h(YZW) =h(Y|)+h(XYZ|)+h(ZW|)h(Y;ZW|)\displaystyle=h(Y|\emptyset)+h(XYZ|\emptyset)+h(ZW|\emptyset)-h(Y;ZW|\emptyset) (35)

The change h(X|YZ¯)+h(YZ¯|)h(XYZ)h(X|\underline{YZ})+h(\underline{YZ}|\emptyset)\to h(XYZ) is suggestive of a join, where for the iith subproblem, we will join the corresponding guards : TX|YZiTYZT^{i}_{X|YZ}\Join T_{YZ}. Recall that NX|YZiN/2i1N^{i}_{X|YZ}\leq N/2^{i-1}, hence for i>12logNi>\frac{1}{2}\log N performing this join won’t take time over the budget of O(N3/2)O(N^{3/2}). On the other hand, when i12logNi\leq\frac{1}{2}\log N, we will need an additional idea, to use the reset lemma (Lemma 3.7) to reroute the sub-problem away from the “heavy-hitter hotspot”, i.e. join over the high-degree YY-values. In addition to uniformization, the reset lemma is our second ingredient to deal with skews.

Concretely, for the iith sub-problem, we do the following:

  • When i>12logNi>\frac{1}{2}\log N, then PANDA performs a join TXYZi:=TX|YZiTYZT^{i}_{XYZ}:=T^{i}_{X|YZ}\Join T_{YZ}. The output size of this join is within the bound BΔ,𝑵=N3/2B_{\Delta,\bm{N}}=N^{3/2}. After computing this join, there will be no more sub-problems because we have computed a relation that fits the output schema, namely it corresponds to A(X,Y,Z)A(X,Y,Z). We add tuples from TXYZiT^{i}_{XYZ} to the output relation A(X,Y,Z)A(X,Y,Z).

  • In the case where i12logNi\leq\frac{1}{2}\log N, the algorithm attempts to perform the same join TXYZi:=TX|YZiTYZT^{i}_{XYZ}:=T^{i}_{X|YZ}\Join T_{YZ}, but its output size now exceeds the bound BΔ,𝑵=N3/2B_{\Delta,\bm{N}}=N^{3/2}. Therefore, the algorithm does not compute a guard TXYZiT^{i}_{XYZ} for h(XYZ|)h(XYZ|\emptyset), but instead uses the reset lemma to cancel out this term with the term h(XYZ|)h(XYZ|\emptyset) on the LHS. The new identity (for these sub-problems) is now

    h(YZW)\displaystyle h(YZW) =h(Y|)+h(ZW|)h(Y;ZW|)\displaystyle=h(Y|\emptyset)+h(ZW|\emptyset)-h(Y;ZW|\emptyset)

    We again grab an unconditional statistics term h(Y|)h(Y|\emptyset) and cancel it with h(Y¯;ZW|)h(\underline{Y};ZW|\emptyset):

    h(Y|)h(Y;ZW|)\displaystyle h(Y|\emptyset)-h(Y;ZW|\emptyset) =h(Y|ZW)\displaystyle=h(Y|ZW)

    The guard TYiT^{i}_{Y} for h(Y|)h(Y|\emptyset) has size |TYi|N1/2|T^{i}_{Y}|\leq N^{1/2}, thanks to the fact that |TYi|2i1|T^{i}_{Y}|\leq 2^{i-1} and i12logNi\leq\frac{1}{2}\log N, The above step replaces h(Y|)h(Y|\emptyset) with a new statistics term h(Y|ZW)h(Y|ZW). Its guard is computed from TYiT^{i}_{Y}, by extending it into a dictionary TY|ZWiT^{i}_{Y|ZW}: given (z,w)(z,w), this dictionary returns all yy-values for which (y)TYi(y)\in T^{i}_{Y}. In particular, this silly dictionary always returns the entire table TYiT^{i}_{Y}, no matter what the given (z,w)(z,w) are.121212 To be more precise, before PANDA extends TYiT^{i}_{Y} into TY|ZWiT^{i}_{Y|ZW}, it will “uniformize” TYiT^{i}_{Y} by partitioning the table TYiT^{i}_{Y} into log|TYi|\log|T^{i}_{Y}| buckets based on the “degree” degTYi(Y|)\text{\sf deg}_{T^{i}_{Y}}(Y|\emptyset). However, this partition is vacuous since only one bucket will be non-empty. After the dictionary extension, the algorithm performs a join TYZWi:=TY|ZWiTZWT^{i}_{YZW}:=T^{i}_{Y|ZW}\Join T_{ZW}. This join is feasible since the output size is within the bound BΔ,𝑵=N3/2B_{\Delta,\bm{N}}=N^{3/2}. Now, the algorithm reaches a terminal node since the join result TYZWiT^{i}_{YZW} is in the output schema; namely, it corresponds to B(Y,Z,W)B(Y,Z,W).

At the end of the algorithm, the tables TXYZiT^{i}_{XYZ} from all branches are unioned into the output relation A(X,Y,Z)A(X,Y,Z), while tables TYZWiT^{i}_{YZW} from all branches are unioned into the other output relation B(Y,Z,W)B(Y,Z,W). These two relations are the final output of the algorithm for the DDR (7).

Note that, for the particular query (7), there is a way to compute a feasible output without the polylog factor as described above. The above strategy is only meant to illustrate the PANDA algorithm in its full generality.

5 Detailed Description of PANDA

This section describes the PANDA algorithm in details. Section 5.1 presents the main data structures (called tables and dictionaries) used by the algorithm. Section 5.2 presents the algorithm and its proof of correctness and runtime analysis.

5.1 Tables and Dictionaries

For a given statistics term δ=(𝒀|𝑿)\delta=(\bm{Y}|\bm{X}), PANDA uses two kinds of data structures: tables and dictionaries, denoted TδT_{\delta}. When 𝑿=\bm{X}=\emptyset, then we call it a table; otherwise, we call it a dictionary. There will be at most one table/dictionary TδT_{\delta} for a given δ\delta. As usual, we abbreviate 𝒀|\bm{Y}|\emptyset with just 𝒀\bm{Y}, and statistics N𝒀|,n𝒀|N_{\bm{Y}|\emptyset},n_{\bm{Y}|\emptyset} with just N𝒀N_{\bm{Y}} and n𝒀n_{\bm{Y}}. Specifically, a table is a set T𝒀Dom𝒀T_{\bm{Y}}\subseteq\textsf{Dom}^{\bm{Y}} of tuples over the 𝒀\bm{Y} variables, and a dictionary is a function T𝒀|𝑿:Dom𝑿2Dom𝒀T_{\bm{Y}|\bm{X}}:\textsf{Dom}^{\bm{X}}\rightarrow 2^{\textsf{Dom}^{\bm{Y}}} that gives a set of tuples over the 𝒀\bm{Y} variables given a specific binding 𝑿=𝒙\bm{X}=\bm{x} of the 𝑿\bm{X}-variables.

For a statistics term δ=(𝒀|𝑿)\delta=(\bm{Y}|\bm{X}), each table/dictionary TδT_{\delta} is associated with a statistics NδN_{\delta}. We say that TδT_{\delta} satisfies the statistics, and we write TδNδT_{\delta}\models N_{\delta}, iff |Tδ(𝒙)|Nδ|T_{\delta}(\bm{x})|\leq N_{\delta} for all 𝒙Dom𝑿\bm{x}\in\textsf{Dom}^{\bm{X}}. As a special case, a table T𝒀T_{\bm{Y}} satisfies a statistics N𝒀N_{\bm{Y}} iff |T𝒀|N𝒀|T_{\bm{Y}}|\leq N_{\bm{Y}}.

The algorithm performs the following operations on tables and dictionaries: join, projection, extension, construction, and partition. Each operation yields the corresponding statistics on the results, as described below.

  • Join of a table with a dictionary, T𝑿𝒀:=T𝑿T𝒀|𝑿T_{\bm{X}\bm{Y}}:=T_{\bm{X}}\Join T_{\bm{Y}|\bm{X}}. This operation constructs a new table with statistics N𝑿𝒀=defN𝑿N𝒀|𝑿N_{\bm{X}\bm{Y}}\stackrel{{\scriptstyle\text{def}}}{{=}}N_{\bm{X}}N_{\bm{Y}|\bm{X}}. The join takes time O(N𝑿N𝒀|𝑿)O(N_{\bm{X}}N_{\bm{Y}|\bm{X}}).

  • Projection of a table, T𝑿:=π𝑿(T𝑿𝒀)T_{\bm{X}}:=\pi_{\bm{X}}(T_{\bm{X}\bm{Y}}). This operation takes a table T𝑿𝒀T_{\bm{X}\bm{Y}} over variables 𝑿𝒀\bm{X}\cup\bm{Y}, with statistics N𝑿𝒀N_{\bm{X}\bm{Y}}, and constructs a new table T𝑿T_{\bm{X}} with statistics N𝑿=defN𝑿𝒀N_{\bm{X}}\stackrel{{\scriptstyle\text{def}}}{{=}}N_{\bm{X}\bm{Y}}. The projection takes time O(N𝑿𝒀)O(N_{\bm{X}\bm{Y}}).

  • Extension of a dictionary T𝒀|𝑿T_{\bm{Y}|\bm{X}} into another dictionary T𝒀|𝑿𝒁T_{\bm{Y}|\bm{X}\bm{Z}}, where 𝒁\bm{Z} is disjoint from 𝑿𝒀\bm{X}\bm{Y}. This operation takes as input a dictionary T𝒀|𝑿T_{\bm{Y}|\bm{X}} and returns a new dictionary T𝒀|𝑿𝒁T_{\bm{Y}|\bm{X}\bm{Z}} defined as T𝒀|𝑿𝒁(𝒙,𝒛)=defT𝒀|𝑿(𝒙)T_{\bm{Y}|\bm{X}\bm{Z}}(\bm{x},\bm{z})\stackrel{{\scriptstyle\text{def}}}{{=}}T_{\bm{Y}|\bm{X}}(\bm{x}) for all (𝒙,𝒛)Dom𝑿𝒁\bm{(}\bm{x},\bm{z})\in\textsf{Dom}^{\bm{X}\bm{Z}}. Its statistics is N𝒀|𝑿𝒁=defN𝒀|𝑿N_{\bm{Y}|\bm{X}\bm{Z}}\stackrel{{\scriptstyle\text{def}}}{{=}}N_{\bm{Y}|\bm{X}}. This operation takes O(1)O(1) time, because the operation does not touch the data, but only constructs a new function that calls the existing function T𝒀|𝑿T_{\bm{Y}|\bm{X}}.

  • Construction. Given a table T𝑿𝒀T_{\bm{X}\bm{Y}} over variables 𝑿𝒀\bm{X}\cup\bm{Y} with statistics N𝑿𝒀N_{\bm{X}\bm{Y}}, construct a dictionary T𝒀|𝑿T_{\bm{Y}|\bm{X}}, with statistics N𝒀|𝑿=defdegT𝑿𝒀(𝒀|𝑿)N_{\bm{Y}|\bm{X}}\stackrel{{\scriptstyle\text{def}}}{{=}}\text{\sf deg}_{T_{\bm{X}\bm{Y}}}(\bm{Y}|\bm{X}). This operation takes O(N𝑿𝒀)O(N_{\bm{X}\bm{Y}}) time because it involves scanning the table T𝑿𝒀T_{\bm{X}\bm{Y}} and constructing an index. The operation returns a function that looks up in this index.

  • Partition. Given a table T𝑿𝒀T_{\bm{X}\bm{Y}} over variables 𝑿𝒀\bm{X}\cup\bm{Y} with statistics N𝑿𝒀N_{\bm{X}\bm{Y}}, partition T𝑿𝒀T_{\bm{X}\bm{Y}} into k:=2log|T𝑿𝒀|k:=2\lceil\log|T_{\bm{X}\bm{Y}}|\rceil many sub-tables T1,,TkT^{1},\dots,T^{k} satisfying the conditions stated in Lemma 5.1 below. This operation takes time O(N𝑿𝒀)O(N_{\bm{X}\bm{Y}}).

Lemma 5.1.

Let T𝐗𝐘T_{\bm{X}\bm{Y}} be a table over variables 𝐗𝐘\bm{X}\cup\bm{Y} with statistics N𝐗𝐘N_{\bm{X}\bm{Y}}. Then, T𝐗𝐘T_{\bm{X}\bm{Y}} can be partitioned into at most k:=2log|T𝐗𝐘|k:=2\lceil\log|T_{\bm{X}\bm{Y}}|\rceil sub-tables T1,,TkT^{1},\dots,T^{k} satisfying

|π𝑿(Ti)|degTi(𝒀|𝑿)N𝑿𝒀,i[k].\displaystyle|\pi_{\bm{X}}(T^{i})|\cdot\deg_{T^{i}}(\bm{Y}|\bm{X})\leq N_{\bm{X}\bm{Y}},\quad\quad\forall i\in[k]. (36)
Proof.

To obtain the sub-tables TiT^{i}, observe that the number of tuples 𝒙π𝑿(T𝑿𝒀)\bm{x}\in\pi_{\bm{X}}(T_{\bm{X}\bm{Y}}) with log\log-degree in the interval [i,i+1)[i,i+1) is at most |T𝑿𝒀|/2i2n𝑿𝒀i|T_{\bm{X}\bm{Y}}|/2^{i}\leq 2^{n_{\bm{X}\bm{Y}}-i}. Hence, if we partition T𝑿𝒀T_{\bm{X}\bm{Y}} based on which of the buckets [i,i+1)[i,i+1) the log\log-degree falls into, we will almost attain the required inequality, just off by a factor of 22:

|π𝑿(Ti)|degTi(𝒀|𝑿)2(n𝑿𝒀i)2(i+1)=2N𝑿𝒀.|\pi_{\bm{X}}(T^{i})|\cdot\deg_{T^{i}}(\bm{Y}|\bm{X})\leq 2^{(n_{\bm{X}\bm{Y}}-i)}\cdot 2^{(i+1)}=2N_{\bm{X}\bm{Y}}.

To get rid of the factor of 22, we partition each TiT^{i} into two tables whose projections onto 𝑿\bm{X} are equally sized. Overall, we need k:=2log|T𝑿𝒀|k:=2\lceil\log|T_{\bm{X}\bm{Y}}|\rceil partitions. ∎

5.2 Algorithm

This section describes PANDA and proves Theorem 4.1. The main input to PANDA contains an input instance Σin\Sigma_{\text{\sf in}}, an output schema Σout\Sigma_{\text{\sf out}}\neq\emptyset, and statistics (Δ,𝑵)(\Delta,\bm{N}) satisfied by the input instance, i.e. Σin(Δ,𝑵)\Sigma_{\text{\sf in}}\models(\Delta,\bm{N}) as defined in Section 2.5. In addition, we are also given the coefficients 𝒘\bm{w} and 𝝀\bm{\lambda} for which (14) is a Shannon inequality. From Corollary 3.5, we can equivalently assume that PANDA was given the multisets 𝒵,𝒟,,𝒮\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S} for which identity (23) holds. Both the algorithm and the analysis will be described in terms of these multisets.

We shall show that a feasible output Σout\Sigma_{\text{\sf out}} to the DDR (5) can be computed in time O~(BΔ,𝑵)\tilde{O}(B_{\Delta,\bm{N}}), defined in (31). In terms of these multisets, the bounds defined in (31) have the following equivalent expressions:

BΔ,𝑵\displaystyle B_{\Delta,\bm{N}} =(δ𝒟Nδ)1/|𝒵|\displaystyle=\left(\prod_{\delta\in\mathcal{D}}N_{\delta}\right)^{1/|\mathcal{Z}|} bΔ,𝒏=logBΔ,𝑵\displaystyle b_{\Delta,\bm{n}}=\log B_{\Delta,\bm{N}} =1|𝒵|δ𝒟nδ\displaystyle=\frac{1}{|\mathcal{Z}|}\sum_{\delta\in\mathcal{D}}n_{\delta} (37)

For each statistics term δ=(𝒀|𝑿)𝒟\delta=(\bm{Y}|\bm{X})\in\mathcal{D}, there is a guard which is a table/dictionary instance T𝒀|𝑿T_{\bm{Y}|\bm{X}} with statistics N𝒀|𝑿N_{\bm{Y}|\bm{X}}, i.e. T𝒀|𝑿N𝒀|𝑿T_{\bm{Y}|\bm{X}}\models N_{\bm{Y}|\bm{X}} as defined in Section 5.1. Creating the initial guards needs to be done offline, because the cost of creating them may exceed BΔ,𝑵B_{\Delta,\bm{N}}. We do not include this in the cost of the algorithm. We adopt the convention that the lower case letters nδ,bΔ,𝒏n_{\delta},b_{\Delta,\bm{n}} represent the logarithms of the upper case Nδ,BΔ,𝑵N_{\delta},B_{\Delta,\bm{N}} respectively.

Summarized in Algorithm 1, PANDA works as follows. Starting from an initial node, it grows a tree of sub-problems. New sub-problems are spawned from a “non-terminal” leaf node of the tree. The process stops when every leaf node is terminal, a concept we shall define shortly. After the tree growing stops, a feasible output of the problem is gathered from the leaves of the tree.

Algorithm 1 PANDA
1:Initialize the sub-problem tree with a single node with input parameters (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n})
2:
3:while \exists a non-terminal leaf β\beta do \triangleright Tree-growing Loop
4:     Let (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n}) be the parameters of this leaf
5:     Pick δ=(𝑾|)𝒟\delta=(\bm{W}|\emptyset)\in\mathcal{D} arbitrarily \triangleright See Prop 5.3 for why such δ\delta exists
6:     if \exists (𝒀|𝑾)𝒟(\bm{Y}|\bm{W})\in\mathcal{D} then \triangleright Case 1 of Lemma 3.7: apply (25)
7:         if n𝑾+n𝒀|𝑾bΔ,𝒏n_{\bm{W}}+n_{\bm{Y}|\bm{W}}\leq b_{\Delta,\bm{n}} then
8:              𝒏=𝒏{n𝒀𝑾}{n𝑾+n𝒀|𝑾}\bm{n}^{\prime}=\bm{n}\cup\{n_{\bm{Y}\bm{W}}\}-\{n_{\bm{W}}+n_{\bm{Y}|\bm{W}}\}, where n𝒀𝑾:=n𝑾+n𝒀|𝑾n_{\bm{Y}\bm{W}}:=n_{\bm{W}}+n_{\bm{Y}|\bm{W}}
9:              𝒟=𝒟{(𝒀𝑾|)}{(𝑾|),(𝒀|𝑾)}\mathcal{D}^{\prime}=\mathcal{D}\cup\{(\bm{Y}\bm{W}|\emptyset)\}-\{(\bm{W}|\emptyset),(\bm{Y}|\bm{W})\}
10:              𝑻=𝑻{T𝒀𝑾}{T𝑾,T𝒀|𝑾}\bm{T}^{\prime}=\bm{T}\cup\{T_{\bm{Y}\bm{W}}\}-\{T_{\bm{W}},T_{\bm{Y}|\bm{W}}\}, where T𝒀𝑾:=T𝑾T𝒀|𝑾T_{\bm{Y}\bm{W}}:=T_{\bm{W}}\Join T_{\bm{Y}|\bm{W}}, which guards (𝒀𝑾|)(\bm{Y}\bm{W}|\emptyset)
11:              Create a child of β\beta, with parameters (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D}^{\prime},\mathcal{M},\mathcal{S},\bm{T}^{\prime},\bm{n}^{\prime})
12:         else
13:              𝒟¯=𝒟{(𝒀𝑾|)}{(𝑾|),(𝒀|𝑾)}\overline{\mathcal{D}}=\mathcal{D}\cup\{(\bm{Y}\bm{W}|\emptyset)\}-\{(\bm{W}|\emptyset),(\bm{Y}|\bm{W})\}
14:              Apply Lemma 3.7 to (𝒵,𝒟¯,,𝒮)(\mathcal{Z},\overline{\mathcal{D}},\mathcal{M},\mathcal{S}) to obtain (𝒵,𝒟,,𝒮)(\mathcal{Z}^{\prime},\mathcal{D}^{\prime},\mathcal{M}^{\prime},\mathcal{S}^{\prime}) where 𝒟𝒟¯{(𝒀𝑾|)}\mathcal{D}^{\prime}\subseteq\overline{\mathcal{D}}-\{(\bm{Y}\bm{W}|\emptyset)\}
15:              Let 𝑻,𝒏\bm{T}^{\prime},\bm{n}^{\prime} be 𝑻,𝒏\bm{T},\bm{n} with only the terms corresponding to 𝒟\mathcal{D}^{\prime} retained
16:              Create a child of β\beta, with parameters (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z}^{\prime},\mathcal{D}^{\prime},\mathcal{M}^{\prime},\mathcal{S}^{\prime},\bm{T}^{\prime},\bm{n}^{\prime})
17:         end if
18:     else if (𝒀|𝑿)\exists(\bm{Y}|\bm{X})\in\mathcal{M} where 𝑾=𝑿𝒀\bm{W}=\bm{X}\bm{Y} then \triangleright Case 2 of Lemma 3.7: apply (26)
19:         𝒏=𝒏{n𝑿}{n𝑾}\bm{n}^{\prime}=\bm{n}\cup\{n_{\bm{X}}\}-\{n_{\bm{W}}\} where n𝑿:=n𝑾n_{\bm{X}}:=n_{\bm{W}}
20:         ={(𝒀|𝑿)}\mathcal{M}^{\prime}=\mathcal{M}-\{(\bm{Y}|\bm{X})\}
21:         𝒟=𝒟{(𝑿|)}{(𝑾|)}\mathcal{D}^{\prime}=\mathcal{D}\cup\{(\bm{X}|\emptyset)\}-\{(\bm{W}|\emptyset)\}
22:         𝑻=𝑻{T𝑿}{T𝑾}\bm{T}^{\prime}=\bm{T}\cup\{T_{\bm{X}}\}-\{T_{\bm{W}}\} where T𝑿:=π𝑿(T𝑾)T_{\bm{X}}:=\pi_{\bm{X}}(T_{\bm{W}}), which guards (𝑿|)(\bm{X}|\emptyset)
23:         Create a child of β\beta, with parameters (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D}^{\prime},\mathcal{M}^{\prime},\mathcal{S},\bm{T}^{\prime},\bm{n}^{\prime})
24:     else if (𝒀;𝒁|𝑿)𝒮\exists(\bm{Y};\bm{Z}|\bm{X})\in\mathcal{S} with 𝑾=𝑿𝒀\bm{W}=\bm{X}\bm{Y} then \triangleright Case 3 of Lemma 3.7: apply (38) instead of (27)
25:         Partition T𝑾=T1T2TkT_{\bm{W}}=T^{1}\cup T^{2}\cup\cdots\cup T^{k} with k:=O(log|T𝑾|)k:=O(\log|T_{\bm{W}}|), using Lemma 5.1
26:         for i1i\leftarrow 1 to kk do
27:              𝒏i=𝒏{n𝑿i,n𝒀|𝑿𝒁i}{n𝑾}\bm{n}^{i}=\bm{n}\cup\{n^{i}_{\bm{X}},n^{i}_{\bm{Y}|\bm{X}\bm{Z}}\}-\{n_{\bm{W}}\} where n𝑿i:=log|π𝑿(Ti)|n^{i}_{\bm{X}}:=\log|\pi_{\bm{X}}(T^{i})| and n𝒀|𝑿𝒁i:=logdegTi(𝒀|𝑿)n^{i}_{\bm{Y}|\bm{X}\bm{Z}}:=\log\deg_{T^{i}}(\bm{Y}|\bm{X})
28:              𝒟i=𝒟{(𝑿|),(𝒀|𝑿𝒁)}{(𝑾|)}\mathcal{D}^{i}=\mathcal{D}\cup\{(\bm{X}|\emptyset),(\bm{Y}|\bm{X}\bm{Z})\}-\{(\bm{W}|\emptyset)\}
29:              𝒮i=𝒮{(𝒀;𝒁|𝑿)}\mathcal{S}^{i}=\mathcal{S}-\{(\bm{Y};\bm{Z}|\bm{X})\}
30:              𝑻i=𝑻{T𝑿i,T𝒀|𝑿𝒁i}{T𝑾}\bm{T}^{i}=\bm{T}\cup\{T^{i}_{\bm{X}},T^{i}_{\bm{Y}|\bm{X}\bm{Z}}\}-\{T_{\bm{W}}\}, where T𝑿i:=π𝑿(Ti)T^{i}_{\bm{X}}:=\pi_{\bm{X}}(T^{i}) and T𝒀|𝑿𝒁iT^{i}_{\bm{Y}|\bm{X}\bm{Z}} is an extension of T𝒀|𝑿iT^{i}_{\bm{Y}|\bm{X}}
31:              Create the iith-child of β\beta, with parameters (𝒵,𝒟i,,𝒮i,𝑻i,𝒏i)(\mathcal{Z},\mathcal{D}^{i},\mathcal{M},\mathcal{S}^{i},\bm{T}^{i},\bm{n}^{i})
32:         end for
33:     end if
34:end while
35:
36:Q(𝒁)Q(\bm{Z})\leftarrow\emptyset for all Q(𝒁)ΣoutQ(\bm{Z})\in\Sigma_{\text{\sf out}}
37:for each terminal leaf β\beta do \triangleright Result-gathering phase
38:     Let (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n}) be the parameters of this leaf
39:     Pick δ=(𝒁|)𝒟\delta=(\bm{Z}|\emptyset)\in\mathcal{D} such that 𝒁Σout\bm{Z}\in\Sigma_{\text{\sf out}}\triangleright By Definition 5.2
40:     Q(𝒁)Q(𝒁)T𝒁Q(\bm{Z})\leftarrow Q(\bm{Z})\cup T_{\bm{Z}}
41:end for

To every node, there associates a sub-problem parameterized by a tuple (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n}) where

  • The multisets 𝒵,𝒟,\mathcal{Z},\mathcal{D},\mathcal{M}, and 𝒮\mathcal{S} are over the base sets Σout\Sigma_{\text{\sf out}}, Δ\Delta, MON, and SUB, respectively. These multisets will maintain the invariant that identity (23) holds, as we shall show in the next section.

  • 𝒏\bm{n} is a collection of non-negative real numbers (logs of statistics), one statistics nδn_{\delta} for each δ𝒟\delta\in\mathcal{D}. (Recall that, nδ:=logNδn_{\delta}:=\log N_{\delta}, for convenience.)

  • 𝑻\bm{T} is a collection of dictionaries, one dictionary TδT_{\delta} for each δ𝒟\delta\in\mathcal{D}; these shall be the guards for Nδ=2nδN_{\delta}=2^{n_{\delta}}. In particular, TδNδT_{\delta}\models N_{\delta} for each δ𝒟\delta\in\mathcal{D}.

Definition 5.2 (Terminal leaf).

A leaf node of the tree is “terminal” (it won’t spawn off new sub-problem(s) anymore) if its parameters (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n}) are such that, there is an unconditional statistics term (𝒁|)𝒟(\bm{Z}|\emptyset)\in\mathcal{D} for which 𝒁𝒵\bm{Z}\in\mathcal{Z}.

We next briefly explain in words the key steps of Algorithm 1. A proof that the algorithm is correct and an analysis of its runtime to show Theorem 4.1 are described in Section 5.3. The main loop looks for a non-terminal leaf node β\beta of the sub-problem tree. If there is no such leaf node, then the code-block starting at line 36 gathers the results from the parameters of the (terminal) leaf nodes to construct the final output Σout\Sigma_{\text{\sf out}}.

If there is a non-terminal leaf node β\beta, then we pick an arbitrary unconditional131313We shall show that an unconditional δ\delta exists in Section 5.3. statistics term δ=(𝑾|)𝒟\delta=(\bm{W}|\emptyset)\in\mathcal{D} (line 5) and start considering cases in parallel with the cases in the proof of Lemma 3.7, except that Case 3 will be handled differently. The major insight of our work is that we can design an algorithm where the algorithmic steps mirror the Shannon inequality inference steps:

Case 1: Join or Reset

There exists a statistics term (𝒀|𝑾)𝒟(\bm{Y}|\bm{W})\in\mathcal{D} (line 6). If the statistics are such that the join-size of T𝑾T𝒀|𝑾T_{\bm{W}}\Join T_{\bm{Y}|\bm{W}} is smaller than the budget BΔ,𝑵B_{\Delta,\bm{N}}, checked at line 7, then we compute the join result T𝒀𝑾T_{\bm{Y}\bm{W}} as shown, and let it guard the new statistics term (𝒀𝑾|)(\bm{Y}\bm{W}|\emptyset). This case corresponds precisely to applying Eq. (25), and the new multiset 𝒟\mathcal{D}^{\prime} defined in line 9 reflects that. After joining T𝑾T_{\bm{W}} and T𝒀|𝑾T_{\bm{Y}|\bm{W}} to form T𝒀𝑾T_{\bm{Y}\bm{W}}, we remove the old dictionaries from 𝑻\bm{T} and add the new dictionary to 𝑻\bm{T}. Similarly, the statistics in 𝒏\bm{n} are replaced in the same way.

On the other hand, if the join T𝑾T𝒀|𝑾T_{\bm{W}}\Join T_{\bm{Y}|\bm{W}} is larger than the budget, then we perform a reset step. The Shannon inequality reasoning is as follows. Initially 𝒁𝒵h(𝒁)δ𝒟h(δ)\sum_{\bm{Z}\in\mathcal{Z}}h(\bm{Z})\leq\sum_{\delta\in\mathcal{D}}h(\delta) holds, with witness terms \mathcal{M} and 𝒮\mathcal{S}. After applying the change in Eq. (25), inequality 𝒁𝒵h(𝒁)δ𝒟¯h(δ)\sum_{\bm{Z}\in\mathcal{Z}}h(\bm{Z})\leq\sum_{\delta\in\overline{\mathcal{D}}}h(\delta) is still a Shannon inequality with the same witness. (Note that 𝒟¯\overline{\mathcal{D}} is defined in line 13). Now, we apply Lemma 3.7 to drop δ0=(𝒀𝑾|)\delta_{0}=(\bm{Y}\bm{W}|\emptyset) from 𝒟¯\overline{\mathcal{D}}, obtaining new parameters (𝒵,𝒟,,𝒮)(\mathcal{Z}^{\prime},\mathcal{D}^{\prime},\mathcal{M}^{\prime},\mathcal{S}^{\prime}) to make progress on our algorithm. We remove from 𝑻\bm{T} and 𝒏\bm{n} the terms that were dropped from 𝒟\mathcal{D}.

Case 2: Projection

There exists a monotonicity term (𝒀|𝑿)(\bm{Y}|\bm{X})\in\mathcal{M} with 𝑾=𝑿𝒀\bm{W}=\bm{X}\bm{Y} (line 18). Here, the Shannon inequality is modified by applying the monotonicity-statistics cancellation in Eq. (26). This change is reflected in \mathcal{M}^{\prime} and 𝒟\mathcal{D}^{\prime}. The new statistics term (𝑿|)𝒟(\bm{X}|\emptyset)\in\mathcal{D}^{\prime} has a guard which is the projection T𝑿T_{\bm{X}}.

Case 3: Partition

There exists a submodularity measure (𝒀;𝒁|𝑿)𝒮(\bm{Y};\bm{Z}|\bm{X})\in\mathcal{S} with 𝑾=𝑿𝒀\bm{W}=\bm{X}\bm{Y} (line 24). In this case, instead of applying identity (27) from Lemma 3.7, we apply the following identity to maintain the Shannon inequality:

h(𝑾)h(𝒀;𝒁|𝑿)\displaystyle h(\bm{W})-h(\bm{Y};\bm{Z}|\bm{X}) =h(𝑾)h(𝑿𝒀)h(𝑿𝒁)+h(𝑿)+h(𝑿𝒀𝒁)\displaystyle=h(\bm{W})-h(\bm{X}\bm{Y})-h(\bm{X}\bm{Z})+h(\bm{X})+h(\bm{X}\bm{Y}\bm{Z})
=h(𝑿)+h(𝒀|𝑿𝒁)\displaystyle=h(\bm{X})+h(\bm{Y}|\bm{X}\bm{Z}) (38)

Identity (38) says that we have two new statistics terms, h(𝑿|)h(\bm{X}|\emptyset) and h(𝒀|𝑿𝒁)h(\bm{Y}|\bm{X}\bm{Z}), which need guards; this is where we will need to create a logarithmic number of sub-problems in order to create the guards of the right statistics.

In particular, the algorithm performs a “partitioning step”: it uniformizes T𝑾(=T𝑿𝒀)T_{\bm{W}}(=T_{\bm{X}\bm{Y}}) by applying Lemma 5.1. Each of the sub-problems has the corresponding statistics as defined in the for-loop starting at line 26. The table TiT^{i} is on variables 𝑾=𝑿𝒀\bm{W}=\bm{X}\bm{Y}. Thus, in order to have it guard the new term (𝒀|𝑿𝒁)(\bm{Y}|\bm{X}\bm{Z}), we will need to apply a dictionary extension to it (see Section 5.1). Note that (36) guarantees that n𝑿i+n𝒀|𝑿𝒁in𝑿𝒀=n𝑾n^{i}_{\bm{X}}+n^{i}_{\bm{Y}|\bm{X}\bm{Z}}\leq n_{\bm{X}\bm{Y}}=n_{\bm{W}}.

5.3 Proof of correctness and runtime analysis

In this section, we prove that the algorithm is correct and analyzes its runtime, to complete the proof of Theorem 4.1. Both of these tasks are accomplished by showing that PANDA maintains the invariants described below.

Recall that every sub-problem in the tree is parameterized by a tuple (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n}). For any sub-problem β\beta that is parameterized by (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n}), define

Jβ:=Dom𝑽δ𝒟TδJ_{\beta}:=\textsf{Dom}^{\bm{V}}\Join{{{\Join}}}_{\delta\in\mathcal{D}}T_{\delta}

to be the join of all its dictionary guards. At time tt in the algorithm, let t\mathcal{L}_{t} denote the set of current leaf nodes in the sub-problem tree. While spawning new sub-problems, (we will show that) PANDA maintains the following invariants over all tuples (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n}): (Recall that Nδ=def2nδN_{\delta}\stackrel{{\scriptstyle\text{def}}}{{=}}2^{n_{\delta}} for all δ𝒟\delta\in\mathcal{D}.)

Shannon inequality: Identity (23) holds w.r.t. the tuple (𝒵,𝒟,,𝒮)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S}) (39)
Non-empty output: 𝒵\displaystyle\mathcal{Z} \displaystyle\neq\emptyset (40)
Lossless join: Σin\displaystyle{{{\Join}}}\Sigma_{\text{\sf in}} βtJβ\displaystyle\subseteq\bigcup_{\beta\in\mathcal{L}_{t}}J_{\beta} t\displaystyle\forall t (41)
Small tables: n𝒀\displaystyle n_{\bm{Y}} bΔ,𝒏,\displaystyle\leq b_{\Delta,\bm{n}}, (𝒀|)𝒟\displaystyle\forall(\bm{Y}|\emptyset)\in\mathcal{D} (42)
Guarded statistics: Tδ\displaystyle T_{\delta} Nδ,\displaystyle\models N_{\delta}, δ𝒟\displaystyle\forall\delta\in\mathcal{D} (43)
Upper bound: 1|𝒵|δ𝒟nδ\displaystyle\frac{1}{|\mathcal{Z}|}\sum_{\delta\in\mathcal{D}}n_{\delta} bΔ,𝒏\displaystyle\leq b_{\Delta,\bm{n}} (44)

We start by showing that, if the invariants hold, then the answer is correct with the desired runtime.

Proposition 5.3.

Let N=defmaxδΔNδN\stackrel{{\scriptstyle\text{def}}}{{=}}\max_{\delta\in\Delta}N_{\delta}. Suppose the invariants above are satisfied, then PANDA returns a feasible solution to the input disjunctive datalog rule in time O(BΔ,𝐍(logN)|𝒮|)O(B_{\Delta,\bm{N}}\cdot(\log N)^{|\mathcal{S}|}), where 𝒮\mathcal{S} is the input multiset of submodularity measures.

Proof.

For a given non-terminal leaf β\beta, the number of unconditional statistics terms in 𝒟\mathcal{D} is at least |𝒵||\mathcal{Z}|. This follows from Proposition 3.10 and invariant (39). Invariant (40) ensures that |𝒵|1|\mathcal{Z}|\geq 1, and thus the unconditional statistics term δ\delta exists for line 5 of the algorithm to proceed. Thanks to invariant (39), at least one of the three cases considered in the main loop must hold. Invariant (43) guarantees that we can proceed to perform the join or the partition steps using corresponding tables / dictionaries when we need them in the algorithm. In summary, the body of the main loop can proceed as explained without getting stuck somewhere.

For every node in the sub-problem tree that the algorithm creates, with parameter tuple (𝒵,𝒟,,𝒮,𝑻,𝒏)(\mathcal{Z},\mathcal{D},\mathcal{M},\mathcal{S},\bm{T},\bm{n}), define the “potential” quantity |𝒟|+||+2|𝒮||\mathcal{D}|+|\mathcal{M}|+2|\mathcal{S}| as in the proof of Lemma 3.7. Then, similar to what happens in the lemma’s proof, the potential of the child is at least 11 less than the potential of the parent.141414Unlike Case 3 of Lemma 3.7 where applying Eq. (27) reduces |𝒮||\mathcal{S}| by one and increases |||\mathcal{M}| by one, in Case 3 of the algorithm, we apply Eq. (38) which reduces |𝒮||\mathcal{S}| by one and increases |𝒟||\mathcal{D}| by one.151515Note that the else branch of Case 1 (line 12) also reduces the potential by at least one, and applying the reset lemma in line 14 never increases the potential. Thus, the depth of the sub-problem tree is bounded by the original potential |𝒟|+||+2|𝒮||\mathcal{D}|+|\mathcal{M}|+2|\mathcal{S}|. This proves that the algorithm terminates.

The time spent within each node β\beta of the tree is dominated by either the join step T𝒀𝑾:=T𝑾T𝒀|𝑾T_{\bm{Y}\bm{W}}:=T_{\bm{W}}\Join T_{\bm{Y}|\bm{W}} in Case 1, the projection T𝑿:=π𝑿(T𝑾)T_{\bm{X}}:=\pi_{\bm{X}}(T_{\bm{W}}) in Case 2, or the partition step in Case 3 whose cost is O(|T𝑾|)O(|T_{\bm{W}}|). In Case 2 and Case 3, the cost is bounded by BΔ,𝑵B_{\Delta,\bm{N}}, thanks to invariant (42). In Case 1, the join is only computed if n𝑾+n𝒀|𝑾bΔ,𝒏n_{\bm{W}}+n_{\bm{Y}|\bm{W}}\leq b_{\Delta,\bm{n}}, thus it is also within the budget of BΔ,𝑵B_{\Delta,\bm{N}}. Overall, the total time spent on all the nodes of the tree is bounded by BΔ,𝑵B_{\Delta,\bm{N}} times the number of nodes. As we observed above, the depth of the tree is at most |𝒟|+||+2|𝒮||\mathcal{D}|+|\mathcal{M}|+2|\mathcal{S}|. Every node has a fanout of either 11, or k=O(logBΔ,𝑵)=O(logN)k=O(\log B_{\Delta,\bm{N}})=O(\log N). Every time the fanout is more than 11, the number of submodularity measures in 𝒮\mathcal{S} is reduced by 11. Thus, the total runtime in data complexity is O(BΔ,𝑵(logN)|𝒮|)O(B_{\Delta,\bm{N}}\cdot(\log N)^{|\mathcal{S}|}).

Last but not least, we prove that the answer computed starting at line 36 is correct, which means, for every tuple 𝒕Σin\bm{t}\in{{{\Join}}}\Sigma_{\text{\sf in}}, there must exist Q(𝒁)ΣoutQ(\bm{Z})\in\Sigma_{\text{\sf out}} for which π𝒁(𝒕)Q(𝒁)\pi_{\bm{Z}}(\bm{t})\in Q(\bm{Z}); recall Definition 2.2. This property holds thanks to invariant (41). Let \mathcal{L} denote the set of all final leaf nodes in the sub-problem tree. Then, every 𝒕Σin\bm{t}\in{{{\Join}}}\Sigma_{\text{\sf in}} belongs to JβJ_{\beta} for some β\beta\in\mathcal{L}. Since β\beta is a terminal leaf node, there must exist δ=(𝒁|)𝒟\delta=(\bm{Z}|\emptyset)\in\mathcal{D} such that Q(𝒁)ΣoutQ(\bm{Z})\in\Sigma_{\text{\sf out}}, and so π𝒁(𝒕)Q(𝒁)\pi_{\bm{Z}}(\bm{t})\in Q(\bm{Z}) thanks to line 40 of Algorithm 1. ∎

Proposition 5.4.

Algorithm 1 maintains invariants (39) to (44) throughout its execution.

Proof.

We verify that every invariant from (39) to (44) holds one by one, by induction. For the input, only invariant (42) may not hold, because some input tables may be larger than the desired bound BΔ,𝑵B_{\Delta,\bm{N}}. We deal with this situation by repeatedly applying the reset lemma as was done in the else branch of Case 1 (line 12), dropping input tables that are too large. After this pre-processing step to make sure that all invariants are satisfied initially, we verify that they remain satisfied by induction.

Invariant (39) is guaranteed by constructing the multisets (𝒵,𝒟,,𝒮)(\mathcal{Z}^{\prime},\mathcal{D}^{\prime},\mathcal{M}^{\prime},\mathcal{S}^{\prime}) for the sub-problems while keeping (23) intact. This is easy to verify in all three cases as we apply Eq. (25),  (26), (38) or Lemma 3.7.

For invariant 40, the only place where 𝒵\mathcal{Z} is changed is at line 16. This happens when n𝑾+n𝒀|𝑾>bΔ,𝒏n_{\bm{W}}+n_{\bm{Y}|\bm{W}}>b_{\Delta,\bm{n}}. Since inequality (44) holds (at the previous iteration), we have

bΔ,𝒏1|𝒵|δ𝒟nδ1|𝒵|(n𝑾+n𝒀|𝑾)>1|𝒵|bΔ,𝒏b_{\Delta,\bm{n}}\geq\frac{1}{|\mathcal{Z}|}\sum_{\delta\in\mathcal{D}}n_{\delta}\geq\frac{1}{|\mathcal{Z}|}(n_{\bm{W}}+n_{\bm{Y}|\bm{W}})>\frac{1}{|\mathcal{Z}|}\cdot b_{\Delta,\bm{n}}

It follows that |𝒵|2|\mathcal{Z}|\geq 2. Thus, when applying Lemma 3.7, we end up with |𝒵||𝒵|11|\mathcal{Z}^{\prime}|\geq|\mathcal{Z}|-1\geq 1, which means 𝒵\mathcal{Z}^{\prime} is not empty.

Invariants (41), (42), (43), and (44) can be verified one by one by simple case analysis. Below, we highlight the most prominent cases:

  • Case 1 of the algorithm maintains invariant (42) specifically because we only do the join T𝒀𝑾:=T𝑾T𝒀|𝑿T_{\bm{Y}\bm{W}}:=T_{\bm{W}}\Join T_{\bm{Y}|\bm{X}} when n𝒀𝑾:=n𝑾+n𝒀|𝑾bΔ,𝒏n_{\bm{Y}\bm{W}}:=n_{\bm{W}}+n_{\bm{Y}|\bm{W}}\leq b_{\Delta,\bm{n}}.

  • The else branch of Case 1 (line 12) maintains invariant (44) because of the following:

    δ𝒟nδ=δ𝒟nδ\displaystyle\sum_{\delta\in\mathcal{D}^{\prime}}n^{\prime}_{\delta}=\sum_{\delta\in\mathcal{D}^{\prime}}n_{\delta} δ𝒟nδn𝑾n𝒀|𝑾\displaystyle\leq\sum_{\delta\in\mathcal{D}}n_{\delta}-n_{\bm{W}}-n_{\bm{Y}|\bm{W}} (by Lemma 3.7)
    <δ𝒟nδbΔ,𝒏\displaystyle<\sum_{\delta\in\mathcal{D}}n_{\delta}-b_{\Delta,\bm{n}} (n𝑾+n𝒀𝑾>bΔ,𝒏n_{\bm{W}}+n_{\bm{Y}\bm{W}}>b_{\Delta,\bm{n}})
    |𝒵|bΔ,𝒏bΔ,𝒏\displaystyle\leq|\mathcal{Z}|\cdot b_{\Delta,\bm{n}}-b_{\Delta,\bm{n}} (by invariant (44) before)
    |𝒵|bΔ,𝒏\displaystyle\leq|\mathcal{Z}^{\prime}|\cdot b_{\Delta,\bm{n}} (because |𝒵||𝒵|1|\mathcal{Z}^{\prime}|\geq|\mathcal{Z}|-1)
  • Case 3 of the algorithm maintains invariant (44) because Eq. (36) implies that n𝑿i+n𝒀|𝑿𝒁in𝑿𝒀=n𝑾n_{\bm{X}}^{i}+n_{\bm{Y}|\bm{X}\bm{Z}}^{i}\leq n_{\bm{X}\bm{Y}}=n_{\bm{W}}. (Recall the definition of n𝑿in_{\bm{X}}^{i} and n𝒀|𝑿𝒁in_{\bm{Y}|\bm{X}\bm{Z}}^{i} in line 27.)

6 Answering Conjunctive Queries in Submodular-Width Time

The main aim of this section is to explain how PANDA can be used to compute a conjunctive query in time given by its submodular width (plus the output size). Recall the definition of a conjunctive query in Equation (3). In Marx’s work [25], the submodular width was only defined for Boolean conjunctive queries (i.e. 𝑭\bm{F} is empty) where all input relations are set to be of size NN. We begin by generalizing this notion to the case when 𝑭\bm{F} is arbitrary and the inputs satisfy given degree constraints, which are much more general than a single relation size.

6.1 Width parameters for conjunctive queries under degree constraints

Given a conjunctive query QQ in the form (3):

Q(𝑭) :- R(𝑿)ΣR(𝑿),Q(\bm{F})\text{ :- }\bigwedge_{R(\bm{X})\in\Sigma}R(\bm{X}),

let 𝒯(Q)\mathcal{T}(Q) denote the set of all free-connex tree decompositions of the query.161616 For a fixed query QQ, there are at most 22|𝑽|2^{2^{|\bm{V}|}} tree decompositions, since any two trees that have the same set of bags can be considered equal for our purposes. (See Section 2.2 for the definition of a free-connex tree decomposition.) Let (Δ,𝑵)(\Delta,\bm{N}) be a set of degree constraints, as defined in Sec. 2.5. As usual, we denote by (Δ,𝒏)(\Delta,\bm{n}) the associated log-degree constraints. We say that a polymatroid 𝒉\bm{h} satisfies the constraints, and write 𝒉(Δ,𝒏)\bm{h}\models(\Delta,\bm{n}), if h(δ)nδh(\delta)\leq n_{\delta} for all δΔ\delta\in\Delta.

Definition 6.1.

The degree-aware fractional hypertree width and the degree-aware submodular width of QQ under the degree constraints (Δ,𝒏)(\Delta,\bm{n}) are:

fhtw(Q,Δ,𝒏)=def\displaystyle\textsf{fhtw}(Q,\Delta,\bm{n})\stackrel{{\scriptstyle\text{def}}}{{=}} min(T,χ)𝒯(Q)max𝒉(Δ,𝒏)maxtnodes(T)h(χ(t))\displaystyle\min_{(T,\chi)\in\mathcal{T}(Q)}\max_{\bm{h}\models(\Delta,\bm{n})}\max_{t\in\text{\sf nodes}(T)}h(\chi(t)) (45)
subw(Q,Δ,𝒏)=def\displaystyle\textsf{subw}(Q,\Delta,\bm{n})\stackrel{{\scriptstyle\text{def}}}{{=}} max𝒉(Δ,𝒏)min(T,χ)𝒯(Q)maxtnodes(T)h(χ(t))\displaystyle\max_{\bm{h}\models(\Delta,\bm{n})}\min_{(T,\chi)\in\mathcal{T}(Q)}\max_{t\in\text{\sf nodes}(T)}h(\chi(t)) (46)

(Note that 𝒉(Δ,𝒏)\bm{h}\models(\Delta,\bm{n}) requires 𝒉\bm{h} to both be a polymatroid and satisfy the degree constraints.)

Remark 6.2.

Eq. (45) and (46) collapse back to the standard definitions of the fractional hypertree width [20] and submodular width [25], respectively, when the degree constraints (Δ,𝑵)(\Delta,\bm{N}) only contain cardinality constraints of the form |T𝒀|N|T_{\bm{Y}}|\leq N for a single number NN that represents the input size; see [6] for a proof. Note, however, that in the standard definitions of fhtw and subw, the base of the log function was NN, the input size, and thus runtimes were stated in the form O(Nfhtw)O(N^{\textsf{fhtw}}) and O(Nsubw)O(N^{\textsf{subw}}). In our generalization, the base of the log function is 22, and thus runtimes are stated in the form O(2fhtw)O(2^{\textsf{fhtw}}) and O(2subw)O(2^{\textsf{subw}}).

It is straightforward to use PANDA to answer a conjunctive query in time O~(Σ+2fhtw)\tilde{O}(\|\Sigma\|+2^{\textsf{fhtw}}), but we will only describe the algorithm for the submodular width. The key advantage of fhtw over subw is that we can answer sum-product queries over any semiring in time O~(Σ+2fhtw)\tilde{O}(\|\Sigma\|+2^{\textsf{fhtw}}) using variable elimination [4], since the query plan involves only one (optimal) tree decomposition.

The two definitions (45) and (46) differ only in the first two operations: minmax\min\max v.s. maxmin\max\min. It is easy to see that subwfhtw\textsf{subw}\leq\textsf{fhtw}, as it follows immediately from maxminminmax\max\min\leq\min\max; more precisely, in any complete lattice, the inequality ijxijjixij\bigvee_{i}\bigwedge_{j}x_{ij}\leq\bigwedge_{j}\bigvee_{i}x_{ij} holds. As mentioned above, the degree-aware submodular width generalizes the submodular width considering richer sets of statistics on the input data. The original definition assumed only cardinality constraints: there is one cardinality constraint for each input relation, and they are all equal. In that case, both subw and fhtw are finite. In our generalization, subw can be \infty, for example when Δ=\Delta=\emptyset. When no confusion arises, we will simply call fhtw and subw the fractional hypertree width and submodular width, dropping the term degree-aware.

Example 6.3 (fhtw and subw of the kk-cycle with k4k\geq 4).

Consider the Boolean kk-cycle query with k4k\geq 4:

Q()\displaystyle Q() :- R1,2(X1,X2)Rk1,k(Xk1,Xk)Rk,1(Xk,X1).\displaystyle\text{ :- }R_{1,2}(X_{1},X_{2})\wedge\cdots\wedge R_{k-1,k}(X_{k-1},X_{k})\wedge R_{k,1}(X_{k},X_{1}).

Suppose we have input cardinality statistics N:=|R1,2|==|Rk1,k|=|Rk,1|N:=|R_{1,2}|=\cdots=|R_{k-1,k}|=|R_{k,1}|. For instance, in the query where all input relations are the edge set of a graph, NN is the number of edges. We will show that subw(21/k/2)logN\textsf{subw}\leq(2-1/\lceil k/2\rceil)\log N and fhtw2logN\textsf{fhtw}\geq 2\log N for this query.

To show the bound on the fhtw, note that, in any tree decomposition (T,χ)(T,\chi), there must be at least one bag χ(t)\chi(t) that contains some three consecutive variables {Xi1,Xi,Xi+1}\{X_{i-1},X_{i},X_{i+1}\} on the cycle. Fix ii accordingly and consider the polymatroid 𝒉\bm{h} defined by:

h(𝑿)=|𝑿{Xi1,Xi+1}|logN\displaystyle h(\bm{X})=|\bm{X}\cap\{X_{i-1},X_{i+1}\}|\cdot\log N for all 𝑿{X1,,Xk}\bm{X}\subseteq\{X_{1},\dots,X_{k}\}

It is straightforward to verify that this is a polymatroid with h(χ(t))=2logNh(\chi(t))=2\log N and 𝒉(Δ,𝒏)\bm{h}\models(\Delta,\bm{n}).

To prove the bound on subw, consider any polymatroid 𝒉(Δ,𝒏)\bm{h}\models(\Delta,\bm{n}). Let θ\theta be a parameter to be determined. Consider two cases.

  • There exists XiX_{i} for which h(Xi)θh(X_{i})\leq\theta. Without loss of generality, assume h(X1)θh(X_{1})\leq\theta. Consider the tree decomposition:

    X1,X2,X3X_{1},X_{2},X_{3}X1,X3,X4X_{1},X_{3},X_{4}X1,Xk1,XkX_{1},X_{k-1},X_{k}

    For every bag B={X1,Xi,Xi+1}B=\{X_{1},X_{i},X_{i+1}\}, we have h(B)h(X1)+h(XiXi+1)θ+logNh(B)\leq h(X_{1})+h(X_{i}X_{i+1})\leq\theta+\log N.

  • h(Xi)>θh(X_{i})>\theta for all i[k]i\in[k]. Consider the tree decomposition

    X1X2Xk/2+1X_{1}X_{2}\cdots X_{\lceil k/2\rceil+1}Xk/2+1Xk/2+2XkX1X_{\lceil k/2\rceil+1}X_{\lceil k/2\rceil+2}\cdots X_{k}X_{1}Bag B1B_{1}Bag B2B_{2}

    From submodularity,

    h(B1)\displaystyle h(B_{1})\leq h(X1X2)+i=3k/2+1h(Xi|Xi1)\displaystyle h(X_{1}X_{2})+\displaystyle{\sum_{i=3}^{\lceil k/2\rceil+1}h(X_{i}|X_{i-1})} k/2logN(k/21)θ\displaystyle\leq\lceil k/2\rceil\log N-(\lceil k/2\rceil-1)\theta
    h(B2)\displaystyle h(B_{2})\leq h(XkX1)+i=k/2+1k1h(Xi|Xi+1)\displaystyle h(X_{k}X_{1})+\displaystyle{\sum_{i=\lceil k/2\rceil+1}^{k-1}h(X_{i}|X_{i+1})} k/2logN(k/21)θ.\displaystyle\leq\lfloor k/2\rfloor\log N-(\lfloor k/2\rfloor-1)\theta.

    Setting θ=(11/k/2)logN\theta=(1-1/\lceil k/2\rceil)\log N to balance the two cases, we conclude that subw(21/k/2)logN\textsf{subw}\leq(2-1/\lceil k/2\rceil)\log N.

6.2 Achieving submodular width runtime with PANDA

Before explaining how PANDA can be used to achieve the submodular width runtime, we need a technical lemma.

Lemma 6.4.

Let 𝒵\mathcal{Z} denote a finite collection171717Note that 𝒵\mathcal{Z} is em not a multiset here. of subsets of 𝐕\bm{V}. Let (Δ,𝐧)(\Delta,\bm{n}) denote given input degree constraints. If the following quantity is finite:

opt:=max𝒉(Δ,𝒏)min𝒁𝒵h(𝒁),\displaystyle\text{\sf opt}:=\max_{\bm{h}\models(\Delta,\bm{n})}\min_{\bm{Z}\in\mathcal{Z}}h(\bm{Z}), (47)

then we can compute coefficients 𝛌=(λ𝐙)𝐙𝒵\bm{\lambda}=(\lambda_{\bm{Z}})_{{\bm{Z}}\in\mathcal{Z}} and 𝐰=(wδ)δΔ\bm{w}=(w_{\delta})_{\delta\in\Delta} such that the following are satisfied:

  • (a)

    𝝀1=1\|\bm{\lambda}\|_{1}=1, 𝝀𝟎\bm{\lambda}\geq\bm{0}, and 𝒘𝟎\bm{w}\geq\bm{0},

  • (b)

    Inequality 𝒁𝒵λ𝒁h(𝒁)δΔwδh(δ)\sum_{{\bm{Z}}\in\mathcal{Z}}\lambda_{\bm{Z}}\cdot h({\bm{Z}})\leq\sum_{\delta\in\Delta}w_{\delta}\cdot h(\delta) is a Shannon inequality,

  • (c)

    opt=δΔwδnδ\text{\sf opt}=\sum_{\delta\in\Delta}w_{\delta}n_{\delta}.

Proof.

Let Γ\Gamma denote the (polyhedral) set of all polymatroids over 𝑽\bm{V}. We write opt in a slightly different form, where we introduce a new unconstrained variable tt to replace the inner min\min:

opt=maxt,𝒉Γ{t𝒁𝒵:th(𝒁), and δΔ:h(δ)nδ}\displaystyle\text{\sf opt}=\max_{t,\bm{h}\in\Gamma}\{t\mid\forall{\bm{Z}}\in\mathcal{Z}:t\leq h({\bm{Z}}),\text{ and }\forall\delta\in\Delta:h(\delta)\leq n_{\delta}\} (48)

Introduce a Lagrangian multiplier λ𝒁\lambda_{\bm{Z}} for every constraint th(𝒁)t\leq h({\bm{Z}}), and wδw_{\delta} for every constraint h(δ)nδh(\delta)\leq n_{\delta}. The Lagrange dual function is

(𝝀,𝒘)\displaystyle\mathcal{L}(\bm{\lambda},\bm{w}) =maxt,𝒉Γ(t+𝒁𝒵λ𝒁(h(𝒁)t)+δΔwδ(nδh(δ)))\displaystyle=\max_{t,\bm{h}\in\Gamma}\left(t+\sum_{{\bm{Z}}\in\mathcal{Z}}\lambda_{\bm{Z}}(h({\bm{Z}})-t)+\sum_{\delta\in\Delta}w_{\delta}(n_{\delta}-h(\delta))\right)
=δΔwδnδ+maxt(1𝝀1)t+max𝒉Γ(𝒁𝒵λ𝒁h(𝒁)δΔwδh(δ))\displaystyle=\sum_{\delta\in\Delta}w_{\delta}n_{\delta}+\max_{t}(1-\|\bm{\lambda}\|_{1})t+\max_{\bm{h}\in\Gamma}\left(\sum_{{\bm{Z}}\in\mathcal{Z}}\lambda_{\bm{Z}}h({\bm{Z}})-\sum_{\delta\in\Delta}w_{\delta}h(\delta)\right)

Let 𝝀\bm{\lambda}^{*} and 𝒘\bm{w}^{*} denote an optimal solution to the Lagrangian dual problem min{(𝝀,𝒘)𝝀𝟎,𝒘𝟎}\min\{\mathcal{L}(\bm{\lambda},\bm{w})\mid\bm{\lambda}\geq\bm{0},\bm{w}\geq\bm{0}\}, then by strong duality181818Which holds because the problem is linear.

opt=(𝝀,𝒘)=δΔwδnδ+maxt(1𝝀1)t+max𝒉Γ(𝒁𝒵λ𝒁h(𝒁)δΔwδh(δ))\text{\sf opt}=\mathcal{L}(\bm{\lambda}^{*},\bm{w}^{*})=\sum_{\delta\in\Delta}w^{*}_{\delta}n_{\delta}+\max_{t}(1-\|\bm{\lambda}^{*}\|_{1})t+\max_{\bm{h}\in\Gamma}\left(\sum_{{\bm{Z}}\in\mathcal{Z}}\lambda^{*}_{\bm{Z}}h({\bm{Z}})-\sum_{\delta\in\Delta}w^{*}_{\delta}h(\delta)\right)

From the assumption that opt is finite, it follows that 𝝀1=1\|\bm{\lambda}^{*}\|_{1}=1 because tt is unconstrained. Furthermore, if there is any polymatroid 𝒉\bm{h} for which 𝒁𝒵λ𝒁h(𝒁)δΔwδh(δ)>0\sum_{{\bm{Z}}\in\mathcal{Z}}\lambda^{*}_{\bm{Z}}h({\bm{Z}})-\sum_{\delta\in\Delta}w^{*}_{\delta}h(\delta)>0 then (𝝀,𝒘)\mathcal{L}(\bm{\lambda}^{*},\bm{w}^{*}) is unbounded, because any positive multiple of a polymatroid is a polymatroid. Thus, (b)(b) is satisfied. Furthermore, as the expression inside max𝒉Γ\max_{\bm{h}\in\Gamma} is non-positive, the maximum it can achieve is 0 with 𝒉=𝟎\bm{h}=\bm{0}. Consequently, 𝝀\bm{\lambda}^{*} and 𝒘\bm{w}^{*} satisfy the three conditions (a), (b), and (c) above, and we can compute them with standard linear programming algorithms. ∎

Equipped with this tool, we are now ready to show how PANDA can be used to answer a conjunctive query in submodular width time:

Theorem 6.5.

Given a set of degree constraints (Δ,𝐍)(\Delta,\bm{N}), a conjunctive query QQ of the form (3) can be computed in time

O~(Σ+2subw(Q,Δ,𝒏)+|output|)\tilde{O}(\|\Sigma\|+2^{\textsf{subw}(Q,\Delta,\bm{n})}+|\text{output}|)

on any database instance Σ\Sigma that satisfies the degree constraints.

Proof.

Let 𝒯(Q)={(T1,χ1),,(Tm,χm)}\mathcal{T}(Q)=\{(T_{1},\chi_{1}),\ldots,(T_{m},\chi_{m})\} be all free-connex tree decompositions of QQ. For every tree decomposition (Ti,χi)𝒯(Q)(T_{i},\chi_{i})\in\mathcal{T}(Q) and every node jnodes(Ti)j\in\text{\sf nodes}(T_{i}), create a fresh atom Aij(𝒁ij)A_{ij}(\bm{Z}_{ij}) over variables 𝒁ij:=χi(j)\bm{Z}_{ij}:=\chi_{i}(j). In other words, every bag of every tree decomposition is associated with an atom. Let Σi:={Aij(𝒁ij)jnodes(Ti)}\Sigma^{i}:=\{A_{ij}(\bm{Z}_{ij})\mid j\in\text{\sf nodes}(T_{i})\} denote a schema corresponding to the bags of the iith tree decomposition. The algorithm will compute relation instances Aij(𝒁ij)A_{ij}(\bm{Z}_{ij}), for all i[m]i\in[m] and jnodes(Ti)j\in\text{\sf nodes}(T_{i}), such that the tree decompositions together cover the full join of the input relations:

Σi[m]Σi\displaystyle{{{\Join}}}\Sigma\subseteq\bigcup_{i\in[m]}{{{\Join}}}\Sigma^{i} (49)

From these instances, there are mm separate free-connex acyclic conjunctive queries of the form

Qi(𝑭) :- Aij(𝒁ij)ΣiAij(𝒁ij),Q^{i}(\bm{F})\text{ :- }\bigwedge_{A_{ij}(\bm{Z}_{ij})\in\Sigma^{i}}A_{ij}(\bm{Z}_{ij}),

which can be computed in O~(Σi+|Qi(𝑭)|)\tilde{O}(\|\Sigma^{i}\|+|Q^{i}(\bm{F})|) time, using Lemma 2.1. Before computing these queries, we semijoin reduce each Aij(𝒁ij)A_{ij}(\bm{Z}_{ij}) in Σi\Sigma^{i} with all the input relations in Σ\Sigma in order to ensure that the output of each query Qi(𝑭)Q^{i}(\bm{F}) is a subset of the full join of the input relations Σ{{{\Join}}}\Sigma. This turns (49) into an equality.

It remains to show that we can compute all the instances Σi\Sigma^{i} satisfying (49) in time O~(2subw(Q,Δ,𝒏))\tilde{O}(2^{\textsf{subw}(Q,\Delta,\bm{n})}). Obviously, to compute them in time O~(2subw(Q,Δ,𝒏))\tilde{O}(2^{\textsf{subw}(Q,\Delta,\bm{n})}), it is necessary that Σi=O~(2subw(Q,Δ,𝒏))\|\Sigma^{i}\|=\tilde{O}(2^{\textsf{subw}(Q,\Delta,\bm{n})}). To this end, for every combination of nodes 𝒋=(j1,j2,,jm)nodes(T1)××nodes(Tm)\bm{j}=(j_{1},j_{2},\dots,j_{m})\in\text{\sf nodes}(T_{1})\times\cdots\times\text{\sf nodes}(T_{m}), we will compute a feasible output B1𝒋,Bm𝒋B^{\bm{j}}_{1},\dots B^{\bm{j}}_{m} to the following DDR (whose input schema is Σ\Sigma):

i[m]Bi𝒋(𝒁iji)\displaystyle\bigvee_{i\in[m]}B^{\bm{j}}_{i}(\bm{Z}_{ij_{i}}) :- R(𝑿)ΣR(𝑿).\displaystyle\text{ :- }\bigwedge_{R(\bm{X})\in\Sigma}R(\bm{X}). (the 𝒋\bm{j}th DDR) (50)

In words, for this DDR, there is a representative bag Bi𝒋B_{i}^{\bm{j}} from each tree decomposition (Ti,χi)(T_{i},\chi_{i}). After feasible solutions to all these DDRs are computed, then we set Aij:=𝒋:ji=jBi𝒋.A_{ij}:=\bigcup_{\bm{j}:j_{i}=j}B_{i}^{\bm{j}}.

We first prove that the instances AijA_{ij} defined as such satisfy property (49). Suppose there is a tuple 𝒕Σ\bm{t}\in{{{\Join}}}\Sigma that is not in the RHS of (49). Then, for every i[m]i\in[m], there exists a node jinodes(Ti)j_{i}\in\text{\sf nodes}(T_{i}) such that π𝒁iji(𝒕)Aiji\pi_{\bm{Z}_{ij_{i}}}(\bm{t})\not\in A_{ij_{i}}. Collect these jij_{i} into a tuple 𝒋\bm{j}, then this implies that we did not compute a feasible output to the 𝒋\bm{j}th DDR (50), a contradiction.

Last but not least, we show that the DDRs (50) can be computed in time O~(2subw(Q,Δ,𝒏))\tilde{O}(2^{\textsf{subw}(Q,\Delta,\bm{n})}). Fix a tuple of nodes 𝒋=(j1,,jm)\bm{j}=(j_{1},\dots,j_{m}). Let 𝒵\mathcal{Z} denote the set of all bags 𝒁iji\bm{Z}_{ij_{i}} for i[m]i\in[m]. Define

opt:=max𝒉(Δ,𝒏)minZ𝒵h(Z).\text{\sf opt}:=\max_{\bm{h}\models(\Delta,\bm{n})}\min_{Z\in\mathcal{Z}}h(Z).

Then,

opt=max𝒉(Δ,𝒏)mini[m]h(Ziji)max𝒉(Δ,𝒏)mini[m]maxtnodes(Ti)h(χ(t))=subw(Q,Δ,𝒏).\displaystyle\text{\sf opt}=\max_{\bm{h}\models(\Delta,\bm{n})}\min_{i\in[m]}h(Z_{ij_{i}})\leq\max_{\bm{h}\models(\Delta,\bm{n})}\min_{i\in[m]}\max_{t\in\text{\sf nodes}(T_{i})}h(\chi(t))=\textsf{subw}(Q,\Delta,\bm{n}).

WLOG, we assume that subw is finite, which means opt is finite. By Lemma 6.4, we can compute coefficients 𝝀\bm{\lambda} and 𝒘\bm{w} such that the three conditions in Lemma 6.4 are satisfied. Hence, from Theorem 4.1, we can compute a feasible output to the DDR (50) in time O~(2opt)=O~(2subw(Q,Δ,𝒏))\tilde{O}(2^{\text{\sf opt}})=\tilde{O}(2^{\textsf{subw}(Q,\Delta,\bm{n})}). ∎

6.3 Example: Solving a conjunctive query in submodular width time

Consider the following query QQ whose body is a 4-cycle:

Q(X,Y) :-\displaystyle Q(X,Y)\text{ :- } R(X,Y)S(Y,Z)U(Z,W)V(W,X)\displaystyle R(X,Y)\wedge S(Y,Z)\wedge U(Z,W)\wedge V(W,X) (51)

Suppose we only have cardinality statistics where all input relation sizes are upper bounded by NN for some number NN, i.e.

Δ=\displaystyle\Delta= {(XY|),(YZ|),(ZW|),(WX|)},\displaystyle\{(XY|\emptyset),(YZ|\emptyset),(ZW|\emptyset),(WX|\emptyset)\}, NXY=\displaystyle N_{XY}= NYZ=NZW=NWX=N.\displaystyle N_{YZ}=N_{ZW}=N_{WX}=N.

Let n:=logNn:=\log N. This query has two free-connex tree decompositions, depicted in Figure 2 (ignoring the trivial tree decomposition with a single bag):

  • One with two bags A11(X,Y,Z)A_{11}(X,Y,Z) and A12(Z,W,X)A_{12}(Z,W,X).

  • One with two bags A21(Y,Z,W)A_{21}(Y,Z,W) and A22(W,X,Y)A_{22}(W,X,Y).

XXZZYYWWRRSSUUVVX,Y,ZX,Y,ZZ,W,XZ,W,XW,X,YW,X,YY,Z,WY,Z,W
Figure 2: Query (51) with the two free-connex tree decompositions.

It is not hard to see that the degree-aware fractional hypertree width of QQ, given by (45), is exactly 2n2n, and we leave this as an exercise. Next, we show that the degree-aware submodular width is 3n/23n/2. In particular, Eq. (46) for this query becomes:

subw(Q,Δ,𝒏)=\displaystyle\textsf{subw}(Q,\Delta,\bm{n})= max𝒉(Δ,𝒏)min(max(h(XYZ),h(ZWX)),max(h(YZW),h(WXY)))\displaystyle\max_{\bm{h}\models(\Delta,\bm{n})}\min(\max(h(XYZ),h(ZWX)),\max(h(YZW),h(WXY)))

By distributing the min\min over the inner max\max, and then swapping the two max\max operators, we get:

subw(Q,Δ,𝒏)=max(\displaystyle\textsf{subw}(Q,\Delta,\bm{n})=\max(
max𝒉(Δ,𝒏)min(h(XYZ),h(YZW)),\displaystyle\max_{\bm{h}\models(\Delta,\bm{n})}\min(h(XYZ),h(YZW)), (52)
max𝒉(Δ,𝒏)min(h(XYZ),h(WXY)),\displaystyle\max_{\bm{h}\models(\Delta,\bm{n})}\min(h(XYZ),h(WXY)), (53)
max𝒉(Δ,𝒏)min(h(ZWX),h(YZW)),\displaystyle\max_{\bm{h}\models(\Delta,\bm{n})}\min(h(ZWX),h(YZW)), (54)
max𝒉(Δ,𝒏)min(h(ZWX),h(WXY)))\displaystyle\max_{\bm{h}\models(\Delta,\bm{n})}\min(h(ZWX),h(WXY))) (55)

Note that each of the expressions (52)\eqref{eq:bag-selector:1}(55)\eqref{eq:bag-selector:4} has the same format as the optimization problem (47) in Lemma 6.4 and is equivalent to the linear program (48). Let’s take the first expression (52). (The other three are similar.) For this expression, a linear program solver gives opt=3n/2\text{\sf opt}=3n/2. Lemma (6.4) guarantees for us the existence of the following Shannon inequality, which is the same as (32) from Section 4.2:

12h(XYZ)+12h(YZW)\displaystyle\frac{1}{2}h(XYZ)+\frac{1}{2}h(YZW)\leq 12h(XY|)+12h(YZ|)+12h(ZW|)\displaystyle\frac{1}{2}h(XY|\emptyset)+\frac{1}{2}h(YZ|\emptyset)+\frac{1}{2}h(ZW|\emptyset)

In particular, by item (c) of the lemma, we have:

opt=12nXY+12nYZ+12nZW=32n.\text{\sf opt}=\frac{1}{2}n_{XY}+\frac{1}{2}n_{YZ}+\frac{1}{2}n_{ZW}=\frac{3}{2}n.

The other three expressions (53)–(55) also have opt=3n/2\text{\sf opt}=3n/2, leading to subw(Q,Δ,𝒏)=3n/2\textsf{subw}(Q,\Delta,\bm{n})=3n/2.

Next, we describe the algorithm to compute the query (51) in time O~(N3/2+|output|)\tilde{O}(N^{3/2}+|\text{output}|), as claimed in Theorem 6.5. The algorithm starts by constructing the following four DDRs, which mirror the four expressions (52)–(55):

B111(X,Y,Z)B211(Y,Z,W)\displaystyle B_{1}^{11}(X,Y,Z)\vee B_{2}^{11}(Y,Z,W) :- R(X,Y)S(Y,Z)U(Z,W)V(W,X)\displaystyle\text{ :- }R(X,Y)\wedge S(Y,Z)\wedge U(Z,W)\wedge V(W,X) (56)
B112(X,Y,Z)B212(W,X,Y)\displaystyle B_{1}^{12}(X,Y,Z)\vee B_{2}^{12}(W,X,Y) :- R(X,Y)S(Y,Z)U(Z,W)V(W,X)\displaystyle\text{ :- }R(X,Y)\wedge S(Y,Z)\wedge U(Z,W)\wedge V(W,X) (57)
B121(Z,W,X)B221(Y,Z,W)\displaystyle B_{1}^{21}(Z,W,X)\vee B_{2}^{21}(Y,Z,W) :- R(X,Y)S(Y,Z)U(Z,W)V(W,X)\displaystyle\text{ :- }R(X,Y)\wedge S(Y,Z)\wedge U(Z,W)\wedge V(W,X) (58)
B122(Z,W,X)B222(W,X,Y)\displaystyle B_{1}^{22}(Z,W,X)\vee B_{2}^{22}(W,X,Y) :- R(X,Y)S(Y,Z)U(Z,W)V(W,X)\displaystyle\text{ :- }R(X,Y)\wedge S(Y,Z)\wedge U(Z,W)\wedge V(W,X) (59)

Let’s take the first DDR (56) as an example. We can compute a feasible output to this DDR by computing a feasible output to the following DDR instead:

B111(X,Y,Z)B211(Y,Z,W)\displaystyle B_{1}^{11}(X,Y,Z)\vee B_{2}^{11}(Y,Z,W) :- R(X,Y)S(Y,Z)U(Z,W)\displaystyle\text{ :- }R(X,Y)\wedge S(Y,Z)\wedge U(Z,W)

The above DDR is identical to (7), and we saw in Section 4.2 that we can compute a feasible output to it in time O~(N3/2)\tilde{O}(N^{3/2}). The other 3 DDRs (57)–(59) can be computed in the same way. Afterwards, we compute:

A11\displaystyle A_{11} :=B111B112\displaystyle:=B_{1}^{11}\cup B_{1}^{12}
A12\displaystyle A_{12} :=B121B122\displaystyle:=B_{1}^{21}\cup B_{1}^{22}
A21\displaystyle A_{21} :=B211B221\displaystyle:=B_{2}^{11}\cup B_{2}^{21}
A22\displaystyle A_{22} :=B212B222\displaystyle:=B_{2}^{12}\cup B_{2}^{22}

Using Lemma (2.1), we compute the following two free-connex acyclic conjunctive queries (after semijoin-reducing each of the AijA_{ij} relations with the input relations R,S,UR,S,U and VV):

Q1(X,Y)\displaystyle Q^{1}(X,Y) :- A11(X,Y,Z)A12(Z,W,X)\displaystyle\text{ :- }A_{11}(X,Y,Z)\wedge A_{12}(Z,W,X)
Q2(X,Y)\displaystyle Q^{2}(X,Y) :- A21(Y,Z,W)A22(W,X,Y)\displaystyle\text{ :- }A_{21}(Y,Z,W)\wedge A_{22}(W,X,Y)

Finally, we take the union of Q1Q^{1} and Q2Q^{2} above as the output QQ to the query in (51). The overall runtime is O~(N3/2+|output|)\tilde{O}(N^{3/2}+|\text{output}|).

In order to prove the correctness of this algorithm, we show that the full join R(X,Y)S(Y,Z)U(Z,W)V(W,X)R(X,Y)\Join S(Y,Z)\Join U(Z,W)\Join V(W,X) is identical to the set of tuples (x,y,z,w)(x,y,z,w) that satisfy:

(A11(x,y,z)A12(z,w,x))(A21(y,z,w)A22(w,x,y))\displaystyle(A_{11}(x,y,z)\wedge A_{12}(z,w,x))\vee(A_{21}(y,z,w)\wedge A_{22}(w,x,y)) (60)

The containment \supseteq is immediate from the definition of AijA_{ij} (and thanks to the semijoin reduction of AijA_{ij} with the input relations R,S,UR,S,U and VV). For the other containment \subseteq, note that condition (60) is equivalent to the following:

(A11(x,y,z)A21(y,z,w))\displaystyle(A_{11}(x,y,z)\vee A_{21}(y,z,w))\wedge (61)
(A11(x,y,z)A22(w,x,y))\displaystyle(A_{11}(x,y,z)\vee A_{22}(w,x,y))\wedge
(A12(z,w,x)A21(y,z,w))\displaystyle(A_{12}(z,w,x)\vee A_{21}(y,z,w))\wedge
(A12(z,w,x)A22(w,x,y))\displaystyle(A_{12}(z,w,x)\vee A_{22}(w,x,y))

By (56)… (59), every tuple (x,y,z,w)(x,y,z,w) that satisfies the conjunction R(x,y)S(y,z)U(z,w)V(w,x)R(x,y)\wedge S(y,z)\wedge U(z,w)\wedge V(w,x) must also satisfy (61). This completes the proof of correctness.

7 Conclusion

We presented PANDA, an algorithm that computes a conjunctive query in time given by its submodular width. For this purpose, we have used a generalization of the notion of submodular width in [25], by incorporating a rich class of statistics on the input relations, including cardinality constraints and degree constraints; the latter can also express functional dependencies. The PANDA algorithm described here is a significant simplification of its preliminary version in [6]. PANDA can also be used as a Worst-Case-Optimal-Join algorithm to compute the output of a full conjunctive query in time bounded by the information-theoretic upper bound of the output size. A recent extension showed that it can also be extended to account for p\ell_{p}-norms of degree sequences in the input [3].

We leave some open problems. The first is an analysis of the complexity of the witness of a Shannon inequality. The number of submodularities in the Shannon inequality appears as the exponent of a logarithmic factor in the runtime of PANDA, and it would be very useful to study this number as a function of the query. Another question concerns the number of tree decompositions needed to compute the submodular width: our current bound is double exponential, and the question is whether this can be reduced. Finally, one open problem is whether PANDA can be generalized to achieve information-theoretic bounds corresponding to non-Shannon inequalities [36, 37].

References

  • [1] S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases, Addison-Wesley, 1995.
  • [2] M. Abo Khamis, P. G. Kolaitis, H. Q. Ngo, and D. Suciu, Bag query containment and information theory, ACM Trans. Database Syst., 46 (2021), pp. 12:1–12:39.
  • [3] M. Abo Khamis, V. Nakos, D. Olteanu, and D. Suciu, Join size bounds using lp-norms on degree sequences, Proc. ACM Manag. Data, 2 (2024).
  • [4] M. Abo Khamis, H. Q. Ngo, and A. Rudra, FAQ: questions asked frequently, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 13–28.
  • [5] M. Abo Khamis, H. Q. Ngo, and D. Suciu, Computing join queries with functional dependencies, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 327–342.
  • [6] M. Abo Khamis, H. Q. Ngo, and D. Suciu, What do shannon-type inequalities, submodular width, and disjunctive datalog have to do with one another?, in Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, E. Sallinger, J. V. den Bussche, and F. Geerts, eds., ACM, 2017, pp. 429–444. Extended version available at http://arxiv.org/abs/1612.02503.
  • [7] N. Alon, On the number of subgraphs of prescribed type of graphs with a given number of edges, Israel J. Math., 38 (1981), pp. 116–130.
  • [8] N. Alon, R. Yuster, and U. Zwick, Finding and counting given length cycles, Algorithmica, 17 (1997), pp. 209–223.
  • [9] A. Atserias, M. Grohe, and D. Marx, Size bounds and query plans for relational joins, in 49th Annual IEEE Symposium on Foundations of Computer Science, FOCS 2008, October 25-28, 2008, Philadelphia, PA, USA, IEEE Computer Society, 2008, pp. 739–748.
  • [10]  , Size bounds and query plans for relational joins, SIAM J. Comput., 42 (2013), pp. 1737–1767.
  • [11] G. Bagan, A. Durand, and E. Grandjean, On acyclic conjunctive queries and constant delay enumeration, in Computer Science Logic, 21st International Workshop, CSL 2007, 16th Annual Conference of the EACSL, Lausanne, Switzerland, September 11-15, 2007, Proceedings, J. Duparc and T. A. Henzinger, eds., vol. 4646 of Lecture Notes in Computer Science, Springer, 2007, pp. 208–222.
  • [12] B. Bollobás and A. Thomason, Projections of bodies and hereditary properties of hypergraphs, Bull. London Math. Soc., 27 (1995), pp. 417–424.
  • [13] F. R. K. Chung, R. L. Graham, P. Frankl, and J. B. Shearer, Some intersection theorems for ordered sets and graphs, J. Combin. Theory Ser. A, 43 (1986), pp. 23–37.
  • [14] T. Eiter, G. Gottlob, and H. Mannila, Disjunctive datalog, ACM Trans. Database Syst., 22 (1997), pp. 364–418.
  • [15] E. Friedgut and J. Kahn, On the number of copies of one hypergraph in another, Israel J. Math., 105 (1998), pp. 251–256.
  • [16] G. Gottlob, G. Greco, N. Leone, and F. Scarcello, Hypertree decompositions: Questions and answers, in Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2016, San Francisco, CA, USA, June 26 - July 01, 2016, T. Milo and W. Tan, eds., ACM, 2016, pp. 57–74.
  • [17] G. Gottlob, S. T. Lee, and G. Valiant, Size and treewidth bounds for conjunctive queries, in Proceedings of the Twenty-Eigth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2009, June 19 - July 1, 2009, Providence, Rhode Island, USA, J. Paredaens and J. Su, eds., ACM, 2009, pp. 45–54.
  • [18] G. Gottlob, S. T. Lee, G. Valiant, and P. Valiant, Size and treewidth bounds for conjunctive queries, J. ACM, 59 (2012), pp. 16:1–16:35.
  • [19] M. Grohe and D. Marx, Constraint solving via fractional edge covers, in Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, Miami, Florida, USA, January 22-26, 2006, ACM Press, 2006, pp. 289–298.
  • [20]  , Constraint solving via fractional edge covers, ACM Trans. Algorithms, 11 (2014), pp. 4:1–4:20.
  • [21] A. Itai and M. Rodeh, Finding a minimum circuit in a graph, in Proceedings of the 9th Annual ACM Symposium on Theory of Computing, May 4-6, 1977, Boulder, Colorado, USA, J. E. Hopcroft, E. P. Friedman, and M. A. Harrison, eds., ACM, 1977, pp. 1–10.
  • [22] V. Leis, A. Gubichev, A. Mirchev, P. A. Boncz, A. Kemper, and T. Neumann, How good are query optimizers, really?, Proc. VLDB Endow., 9 (2015), pp. 204–215.
  • [23] J. Lobo, J. Minker, and A. Rajasekar, Foundations of disjunctive logic programming, Logic Programming, MIT Press, 1992.
  • [24] L. H. Loomis and H. Whitney, An inequality related to the isoperimetric inequality, Bull. Amer. Math. Soc, 55 (1949), pp. 961–962.
  • [25] D. Marx, Tractable hypergraph properties for constraint satisfaction and conjunctive queries, J. ACM, 60 (2013), pp. 42:1–42:51.
  • [26] J. Minker, Overview of disjunctive logic programming, Ann. Math. Artif. Intell., 12 (1994), pp. 1–24.
  • [27] H. Q. Ngo, Worst-case optimal join algorithms: Techniques, results, and open problems, in Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’18, New York, NY, USA, 2018, Association for Computing Machinery, p. 111–124.
  • [28] H. Q. Ngo, On an information theoretic approach to cardinality estimation (invited talk), in 25th International Conference on Database Theory, ICDT 2022, March 29 to April 1, 2022, Edinburgh, UK (Virtual Conference), D. Olteanu and N. Vortmeier, eds., vol. 220 of LIPIcs, Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2022, pp. 1:1–1:21.
  • [29] H. Q. Ngo, E. Porat, C. Ré, and A. Rudra, Worst-case optimal join algorithms: [extended abstract], in Proceedings of the 31st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2012, Scottsdale, AZ, USA, May 20-24, 2012, M. Benedikt, M. Krötzsch, and M. Lenzerini, eds., ACM, 2012, pp. 37–48.
  • [30] J. Radhakrishnan, Entropy and counting, Computational Mathematics, Modelling and Algorithms, (2003).
  • [31] A. Schrijver, Combinatorial Optimization - Polyhedra and Efficiency, Springer, 2003.
  • [32] D. Suciu, Applications of information inequalities to database theory problems, in 2023 38th Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), 2023, pp. 1–30.
  • [33] T. L. Veldhuizen, Triejoin: A simple, worst-case optimal join algorithm, in Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, N. Schweikardt, V. Christophides, and V. Leroy, eds., OpenProceedings.org, 2014, pp. 96–106.
  • [34] M. Yannakakis, Algorithms for acyclic database schemes, in Very Large Data Bases, 7th International Conference, September 9-11, 1981, Cannes, France, Proceedings, IEEE Computer Society, 1981, pp. 82–94.
  • [35] R. W. Yeung, Information Theory and Network Coding, Springer Publishing Company, Incorporated, 1 ed., 2008.
  • [36] Z. Zhang and R. W. Yeung, A non-shannon-type conditional inequality of information quantities, IEEE Trans. Information Theory, 43 (1997), pp. 1982–1986.
  • [37] Z. Zhang and R. W. Yeung, On characterization of entropy function via information inequalities, IEEE Transactions on Information Theory, 44 (1998), pp. 1440–1452.