Conjunctive Queries: Unique Characterizations and Exact Learnability
Abstract.
We answer the question of which conjunctive queries are uniquely characterized by polynomially many positive and negative examples, and how to construct such examples efficiently. As a consequence, we obtain a new efficient exact learning algorithm for a class of conjunctive queries. At the core of our contributions lie two new polynomial-time algorithms for constructing frontiers in the homomorphism lattice of finite structures. We also discuss implications for the unique characterizability and learnability of schema mappings and of description logic concepts.
1. Introduction
Conjunctive queries (CQs) are an extensively studied database query language and fragment of first-order logic. They correspond precisely to Datalog programs with a single non-recursive rule. In this paper, we study two problems related to CQs. The first problem is concerned with the existence and constructability of unique characterizations. For which CQs is it the case that can be characterized (up to logical equivalence) by its behavior on a small set of data examples? And, when such a set of data examples exists, can it be constructed efficiently? The second problem pertains to exact learnability of CQs in an interactive setting where the learner has access to a “membership oracle” that, given any database instance and a tuple of values, answers whether the tuple belongs to the answer of the goal CQ (that is, the hidden CQ that the learner is trying to learn). We can think of the membership oracle as a black-box, compiled version of the goal query, which the learner can execute on any number of examples. The task of the learner, then, is to reverse engineer the query based on the observed behavior.
Note that these two problems (unique characterizability and exact learnability) are closely related to each other: a learner can identify the goal query with certainty, only when the set of examples that it has seen so far constitutes a unique characterization of the goal query. In other words, unique characterizability (by a polynomially large set of examples) is necessary, but not sufficient, for (polynomial-time) exact learnability with membership queries.
Motivating Example 1.
This example, although stylized and described at a high level, aims to convey a use case that motivated the present work. The Google Knowledge Graph is a large database of entities and facts, gathered from a variety of sources. It is used to enhance the search engine’s results for queries such as “where was Barack Obama born” with factual information in the form of knowledge panels (KGBlog2020, ). When a query triggers a specific knowledge panel, this may be the result of different triggering and fulfillment mechanisms, each of which may involve a combination of structured queries to the knowledge graph, hard-coded business logic (in a Turing-complete language), and machine learned models. This makes it difficult to understand interactions between knowledge panels (e.g., whether the two knowledge panels are equivalent or one is subsumed by the other, in terms of content and triggering). If a declarative specification of (an approximation of) the triggering and fulfillment logic for a knowledge panel can be constructed programmatically, specified in a sufficiently restrictive formalism such as Datalog rules, this provides an avenue to the above, and other relevant static analysis tasks. The efficient exact learnability with membership queries that we study in this paper, can be viewed as an idealized form of such a programmatic approach, where the membership oracle is the existing, black box, implementation of the knowledge panel, and the learning algorithm aims to produce a CQ that exactly captures it.
The above example provides a motivation for studying efficient exact learnability of CQs, and hence, for studying unique characterizability. However, we would like to emphasize that unique characterizations are of independent interest, outside the context of exact learning algorithms. Indeed, uniquely characterizing examples can be used, for instance, for elementary query engine debugging, and query visualization and explanation.
As it turns out, the above problems about CQs are intimately linked to fundamental properties of the homomorphism lattice of finite structures. In particular, the existence of a unique characterization for a CQ can be reduced to the existence of a frontier in the homomorphism lattice for an associated structure , where, by a “frontier” for , we mean a finite set of structures that cover precisely the set of structures homomorphically strictly weaker than , that is, such that (cf. Figure 1).
Known results (FoniokNT08, ; AlexeCKT2011, ) imply that not every finite structure has such a frontier, and, moreover, a finite structure has a frontier if and only if the structure (modulo homomorphic equivalence) satisfies a structural property called c-acyclicity. These known results, however, are based on exponential constructions, and no polynomial algorithms for constructing frontiers were previously known.
Main Contribution 1 (Polynomial-time algorithms for constructing frontiers).
We show that, for c-acyclic structures, a frontier can in fact be computed in polynomial time. More specifically, we present two polynomial-time algorithms. The first algorithm takes any c-acyclic structure and produces a frontier consisting of structures that are themselves not necessarily c-acyclic (Sect. 3.2). The second algorithm applies to a more restricted class of acyclic structures but yields a frontier consisting entirely of structures belonging to the same class, that is, the class of structures in question is frontier-closed (Sect. 3.3).
We use these to obtain new results on the existence and efficient constructability of unique characterizations for CQs:
Main Contribution 2 (Polynomial Unique Characterizations for Conjunctive Queries).
We show that a CQ is uniquely characterizable by polynomially many examples, precisely if (modulo logical equivalence) it is c-acyclic. Furthermore, for c-acyclic CQs, a uniquely characterizing set of examples can be constructed in polynomial time. In the special case of acyclic and c-connected CQs, a uniquely characterizing set of examples can be constructed consisting entirely of queries from the same class (Sect. 4).
Using the above results as a stepping stone, we obtain a polynomial-time exact learning algorithm for the class of c-acyclic CQs.
Main Contribution 3 (Polynomial-Time Learnability with Membership Queries).
We show that c-acyclic CQs are efficiently exactly learnable in Angluin’s model of exact learnability with membership queries (Angluin88, ) (Sect. 5).
The restriction to c-acyclic CQs in this learnability result is natural, given that, as we mentioned above, exact learnability with membership queries requires the existence of a finite uniquely characterizing set of examples. Note however, that our results do not preclude the possibility that there exist larger efficiently exactly learnable classes of CQs: even if a class includes non-c-acyclic CQs, it may still be possible for every CQ to be uniquely characterizable within the class .
We mainly focus on positive and negative examples in this paper. Another natural type of data example is a pair where is an input instance and is the entire relation that is computed by the query on . We discuss this in Section 6, where we point out that all our results on characterizability and learnability remain true also when considering such data examples.
Finally, although our primary interest is in conjunctive queries, we show that our results also have implications for schema mappings and description logic concepts:
Main Contribution 4 (Schema Mappings and Description Logic Concepts).
As a further corollary to the above, in Sect. 7, we obtain a number of results regarding the existence of polynomial unique characterizations, as well as exact learnability, for LAV (“Local-As-View”) schema mappings and for description logic concepts for the lightweight description logic (Baader2017:introduction, ).
Related Work
Unique characterizations for CQs were first studied in (MannilaR86, ) in the context of automatic test data generation. More precisely, the authors propose the concept of an “adequate test case”, which is a database instance that can be used to distinguish a given CQ from all other, non-equivalent CQs from a given class. In our terminology, this corresponds to a uniquely characterizing input-output example (cf. Section 6). A positive result was obtained in (MannilaR86, ) for restricted classes of self-join-free CQs, by establishing a relationship to Armstrong databases (Armstrong1974, ). In (AlexeCKT2011, ), the authors study unique characterizations for various classes of schema mappings; we will make use of some of the technical results from (AlexeCKT2011, ), and in Section 7, we will discuss an application of our results to LAV schema mappings. In (Staworko15:characterizing, ), the authors study unique characterizability for XML twig queries.
Related work on learning CQs will be discussed in Section 5.
An earlier, extended-abstract version of this paper was published in (tCD2021:conjunctive, ). The present paper extends this conference version with additional results. In particular, Theorem 3.12 was shown in (tCD2021:conjunctive, ) only for the case of (i.e., for unary CQs), and is generalized here to all . Furthermore, the treatment of input-output examples in Section 6 has been added.
Outline
Section 2 reviews basic facts and definitions. In Section 3, we present our two new polynomial-time algorithms for constructing frontiers for finite structures with distinguished elements. We also review a result by (NesetrilOssona2008, ), which implies the existence of (not necessarily polynomially computable) frontiers w.r.t. classes of structures of bounded expansion. In Section 4, we apply these algorithms to show that a CQ is uniquely characterizable by polynomially many examples, precisely if (modulo logical equivalence) it is c-acyclic. Furthermore, for c-acyclic CQs, a uniquely characterizing set of examples can be constructed in polynomial time. In the special case of unary, acyclic, connected CQs, a uniquely characterizing set of examples can be constructed consisting entirely of queries from the same class. In Section 5, we further build on these results, and we study the exact learnability of CQs. In Section 6, we consider another type of data examples, namely input-output examples. Section 7, finally, presents applications to schema mappings and description logic concepts.
2. Preliminaries
Schemas, Structures, Homomorphisms, Cores
A schema (or, relational signature) is a finite set of relation symbols , where each relation as an associated arity . For , by a structure over with distinguished elements we will mean a tuple , where is a finite structure (in the traditional, model-theoretic sense) over the schema , and are elements of the domain of . Note that all structures, in this paper, are assumed to be finite, and we will drop the adjective “finite”. By a fact of a structure we mean an expression of the form where the tuple belongs to the relation in . Given two structures and over the same schema, where and , a homomorphism is a map from the domain of to the domain of , such that preserves all facts (i.e., for each fact of , is a fact of ), and such that for . When such a homomorphism exists, we will also say that “homomorphically maps to” and we will write . We say that and are homomorphically equivalent if and . We occasionally write to say that and .
A structure is said to be a core if there is no homomorphism from the structure in question to a proper substructure (HN92, ). It is known (HN92, ) that every structure has a substructure to which it is homomorphically equivalent and that is a core. This substructure, moreover, is unique up to isomorphism, and it is known as the core of .
We will make use of the following technical lemma in several places later on in this paper:
Lemma 2.1.
Let be a core structure, and let . If there is a homomorphism , then must be injective, and moreover, in this case, there is such with the additional property that the composition of with is the identity map on .
Proof.
The composition of with is an endomorphism of (that is, a homomorphism from the structure to itself). It is a well-known property of cores that every endomorphism is an automorphism (that is, an isomorphism from the structure to itself). Therefore, must be injective. Furthermore, by composing with the inverse of the automorphism, we ensure that its composition with is the identity function. ∎
Fact Graph, FG-Connectedness, FG-Disjoint Union
The fact graph of a structure is the undirected graph whose nodes are the facts of , and such that there is an edge between two distinct facts if they share a non-distinguished element, i.e., there exists an element of the domain of that is distinct from the distinguished elements a, such that occurs in both facts. We say that is fg-connected if the fact graph is connected. A fg-connected component of is a maximal fg-connected substructure of . If and are structures with the same distinguished elements, and whose domains are otherwise (except for these distinguished elements) disjoint, then the union of these two structures will be called a fg-disjoint union and will be denoted as . The same construction naturally extends to finite sets of structures. It is easy to see that every structure is equal to the fg-disjoint union of its fg-connected components. See also (Fagin2008towards, ; tencate2009laconic, ), where fg-connected components are called fact blocks.
Direct Product, Homomorphism Lattice
Given two structures and over the same schema, where and , the direct product is defined, as usual, as , where the domain of is the Cartesian product of the domains of and , and where the facts of are all facts for which it holds that is a fact of and is a fact of . The direct product of a finite collection of structures is defined analogously.
For a fixed schema and , the collection of homomorphic-equivalence classes of structures over with distinguished elements, ordered by homomorphism, forms a lattice. Specifically, the above direct product operation is a meet operation in the lattice-theoretic sense: homomorphically maps to both and , and a structure homomorphically maps to if and only if it homomorphically maps to both and . The join operation of the lattice is a little more tedious to define, and we only sketch it here, as it is not used in the remainder of the paper. For a structure with , by the isomorphism type of the distinguished elements we will mean the equivalence relation over induced by the tuple . When two structures have the same isomorphism type of distinguished elements, their join is simply the fg-disjoint union as defined earlier. In the general case, one must first compute the smallest equivalence relation over that refines the isomorphism type of distinguished elements of both structures, and factor both structures through this equivalence relation, before taking their fg-disjoint union.
For structures without distinguished elements, this lattice has been studied extensively (cf. for instance (HellNesetril2004, ; nesetril2012sparsity, )). The above exposition shows how to lift some of the fundamental constructions to structures with distinguished elements. As we will see, it will be important in much of this paper to consider structures with distinguished elements, as these distinguished elements, intuitively, correspond to the free variables of a CQ.
Incidence Graph, Acyclicity, C-Acyclicity
Given a structure , the incidence graph of is the bipartite multi-graph containing all elements of the domain of as well as all facts of , and an edge whenever is an element and is a fact in which occurs. Whenever an element occurs more than once in the same fact , the incidence graph contains a distinct edge for every occurrence of in . We will call a structure acyclic (also known as Berge-acyclic (Fagin83:acyclicity, )) if the incidence graph of is acyclic; is said to be c-acyclic if every cycle in its incidence graph contains at least one distinguished element, i.e., at least one element in a. In particular, acyclicity implies that no element occurs twice in the same fact, and c-acyclicity implies that no non-distinguished element occurs twice in the same fact. In the case without distinguished elements, c-acyclicity is equivalent to acyclicity. The concept of c-acyclicity was first introduced in (AlexeCKT2011, ) in the study of unique characterizability of GAV schema mappings (cf. Section 7 for more details). A straightforward dynamic-programming argument shows (Dalmau02:constraint, ):
Proposition 2.2.
For c-acyclic structures and (over the same schema and with the same number of distinguished elements), we can test in polynomial time whether . The core of a c-acyclic structure can be computed in polynomial time.
C-Connectedness
We say that a structure is c-connected if every connected component of its incidence graph contains at least one distinguished element. Note that this condition is only meaningful for structures with at least one distinguished element, and that it differs subtly from the condition of fg-connectedness we defined above. For example, the structure consisting of the facts and with distinguished elements , is c-connected but is not fg-connected. For any structure , we denote by the (unique) maximal c-connected substructure, that is, the substructure containing everything reachable from the distinguished elements.
Proposition 2.3.
If is c-connected, then iff .
Conjunctive Queries
Let . A -ary conjunctive query (CQ) over a schema is an expression of the form where is a sequence of variables, and where each is an atomic formula using a relation from and using variables as arguments only. Note that may use variables from x as well as other variables. In addition, it is required that each variable in x occurs in at least one conjunct . This requirement is referred to as the safety condition.
Note that, for simplicity, this definition of CQ does not allow the use of constants. Many of the results in this paper, however, can be extended in a straightforward way to CQs with a fixed finite number of constants (which can be simulated using additional free variables).
If is a structure over the same schema as , we denote by the set of all -tuples of values that satisfy the query in . We write if and are queries over the same schema, and of the same arity, and holds for all structures . We say that and are logically equivalent if and both hold. We refer to any textbook on database theory for a more detailed exposition of the semantics of CQs, and we will restrict ourselves to giving an equivalent presentation of the semantics of CQs through canonical structures and the Chandra-Merlin theorem.
There is a well-known correspondence between -ary CQs over a schema and structures over with distinguished elements. In one direction, we can associate to each -ary CQ over the schema a corresponding structure over with distinguished elements, namely , where the domain of is the set of variables occurring in , and the facts of are the conjuncts of . We will call this structure the canonical structure of . Note that every distinguished element of occurs in at least one fact, as follows from the safety condition of CQs. Conversely, consider any structure , with , such that every distinguished element occurs in at least one fact of . We can associate to a -ary canonical CQ, namely the CQ that has a variable for every value in the domain of occurring in at least one fact, and a conjunct for every fact of .
By the classic Chandra-Merlin Theorem (CM77, ), a tuple a belongs to if and only if there is a homomorphism from to ; and holds if and only if there is a homomorphism from to . Finally, and are logically equivalent if and only if and are homomorphically equivalent.
Exact Learning Models, Conjunctive Queries as a Concept Class
Informally, an exact learning algorithm is an algorithm that identifies an unknown goal concept by asking a number of queries about it. The queries are answered by an oracle that has access to the goal concept. This model of learning was introduced by Dana Angluin, cf. (Angluin88, ). In this paper, we consider the two most extensively studied kinds of oracle queries: membership queries and equivalence queries. We will first review basic notions from computational learning theory, such as the notion of a concept, and then explain what it means for a concept class to be efficiently exactly learnable with membership and/or equivalence queries.
Let be a (possibly infinite) set of examples. A concept over is a function , and a concept class is a collection of such concepts. We say that is a positive example for a concept if , and that is a negative example for if .
Conjunctive queries (over a fixed schema and with a fixed arity ) are a particular example of such a concept class, where the example space is the class of all structures over with distinct elements, and where an example is labeled as positive if the tuple a belongs to , and negative otherwise.
It is always assumed that concepts are specified using some representation system so that one can speak of the length of the specification of a concept. More formally, a representation system for is a string language over some finite alphabet, together with a surjective function . By the size of a concept , we will mean the length of the smallest representation. Similarly, we assume a representation system, with a corresponding notion of length, for the examples in . When there is no risk of confusion, we may conflate concepts (and examples) with their representations.
Specifically, for us, when it comes to structures, any natural choice of representation will do; we only assume that the length of the specification of a structure (for a fixed schema) is polynomial in the domain size, the number of facts and the number of distinguished elements. Likewise for CQs, we assume that the length of the representation of a CQ is polynomial in that of its canonical structure.
For every concept , we denote by the membership oracle for , that is, the oracle that takes as input an example and returns its label, , according to . Similarly, for every concept , we denote by , the equivalence oracle for , that is, the oracle that takes as input the representation of a concept and returns “yes”, if , or returns a counterexample otherwise (that is, an example such that ). An exact learning algorithm with membership and/or equivalence queries for a concept class is an algorithm alg that takes no input but has access to the membership oracle and/or equivalence oracle for some concept , which will be called the goal concept. Importantly, while the algoritm may interact with the oracle(s), it does not know which concept is the goal concept. 111It is common in the learning theory literature to assume that the learning algorithm is given an upper bound on the size of the goal concept as input. However, it turns out that such an assumption is not needed for any of our positive results concerning learnability. Intuitively, the algorithm alg must determine by asking oracle queries. More precisely, for every choice of , alg must terminate after a finite amount of time, and output (some representation of) the goal concept . This notion was introduced by Angluin (Angluin88, ), who also introduced the notion of a polynomial-time exact learning algorithm. We say that an exact learning algorithm alg with membership and/or equivalence queries runs in polynomial time if there exists a two-variable polynomial such that at any point during the run of the algorithm, the time used by alg up to that point (counting one step per oracle call) is bounded by , where is the size of the goal concept and the size of the largest counterexample returned by calls to the equivalence oracle up to that point in the run ( if no equivalence queries have been used). A concept class is efficiently exactly learnable with membership and/or equivalence queries if there is an exact learning algorithm with membership and/or equivalence queries for that runs in polynomial time.
There is a delicate issue about this notion of polynomial time that we now discuss. One might be tempted to relax the previous definition by requiring merely that the total running time is bounded by . However, this change in the definition would give rise to a wrong notion of a polynomial-time algorithms in this context by way of a loophole in the definition. Indeed, under this change, one could design a learning algorithm that, in a first stage, identifies the goal hypothesis by (expensive) exhaustive search and that, once this is achieved, forces —by asking equivalence queries with hypotheses that are appropriate modifications of the goal concept— the equivalence oracle to return large counterexamples that would make up for the time spent during the exhaustive search phase.
3. Frontiers in the homomorphism lattice of structures
In this section, we define frontiers, as well as the relation notions of gap pairs and (restricted) homomorphism dualities, and we will discuss their relationships. We present two polynomial-time methods for constructing frontiers.
For the applications in the next sections, it is important to consider structures with distinguished elements. These distinguished elements, intuitively, correspond to the free variables of a CQ. Specifically, Proposition 4.2 in Section 4 will link unique characterizations for -ary CQs to frontiers for structures with distinguished elements. For this reason, all the results in this section are stated for structures with distinguished elements.
Definition 3.1.
Fix a schema and , and let be a class of structures with distinguished elements and let be a structure with distinguished elements. A frontier for w.r.t. , is a finite set of structures such that
-
(1)
for all .
-
(2)
for all .
-
(3)
For all with and , we have that for some .
See Figure 1 for a graphical depiction of a frontier.
The notion of a frontier is closely related to that of a gap pair. While gap pairs will not play an important role, we will explain the relationship here to provide context. A pair of structures with is said to be a gap pair if , and every structure satisfying and is homomorphically equivalent to either or (NesetrilTardif2000, ). The same concept applies to structures with distinguished elements. It is easy to see that any frontier for a structure must contain (modulo homomorphic equivalence) all structures such that is a gap pair.
Example 3.2.
Let . The structure consisting of facts and (with distinguished element ) has a frontier of size 2 (w.r.t. the class of all finite structures), namely where consists of the facts and consists of the facts , respectively. Note that and are gap pairs. It can be shown that the structure has no frontier of size 1 (as such a frontier would have to consist of a structure that contains both facts and ).
For another example, consider the structure consisting of facts and . It is the right hand side of a gap pair (the left hand side being the structure consisting of the facts and ), but has no frontier as follows from Theorem 3.7 below.
Frontiers are also closely related to (generalized) homomorphism dualities (FoniokNT08, ), and we will be making use of results about homomorphism dualities. We say that a structure has a finite duality w.r.t. a class if there is a finite set of structures such that for all , iff for all , . The set may contain structures that are not in . If is the set of all structures (over the same schema as ), we simply say that has a finite duality.222We note here that in the literature on Constraint Satisfaction, it is usual to consider the ’other side’ of the duality, i.e, a structure is said to have finite duality if there exists a finite set of structures such that for every structure , iff for all , .
Example 3.3.
In the realm of digraphs, viewed as relational structures without distinguished elements with a single binary relation, every directed path of, say, nodes has finite duality (w.r.t. the class of digraphs). Indeed, it is not difficult to verify that for every digraph , iff where is the digraph with nodes and edges . (This example is known as the Gallai-Hasse-Roy-Vitaver Theorem. Amusingly, this result was obtained and published independently by all these four researchers, each in a different language, in the 1960s).
The next lemma is a minor variation of a result from (NesetrilTardif2000, ).
Lemma 3.4.
Let be any class of structures.
-
(1)
If a structure has a finite duality w.r.t. then has a frontier w.r.t. .
-
(2)
If a structure has a frontier w.r.t. and is closed under direct products, then has a finite duality w.r.t. .
Proof.
1. Let be a finite set of structures that forms a duality for . Then is a frontier for . This follows immediately from the fact that, for all with , we have that if and only if .
2. We use a construction from (NesetrilTardif2000, ) involving an exponentiation operation on structures. Let and be structures (without distinguished elements) over the same schema. Then we denote by the structure where
-
•
the domain of is the set of all functions from the domain of to the domain of
-
•
a fact belongs to if for every fact of the form , the fact belongs to .
This construction is characterized by the property that, for all structures , if and only if (HellNesetril2004, ).
Let be a frontier for . Let
We claim that forms a finite duality for . Consider any structure . We must show that homomorphically maps to a structure in iff .
Suppose that there is a homomorphism for some . Then where . Note that, indeed, . Since and is a frontier, it follows that and therefore, by the properties of direct products, .
Conversely, suppose . Then . Hence, there is a homomorphism for some . It follows that where with the function given by . Note that and hence . ∎
Note that the construction of the frontier from the duality is polynomial, while the construction of the duality from the frontier involves an exponential blowup. The following example shows that this is unavoidable.
Example 3.5.
The path , viewed as a structure without any distinguished elements, has a frontier (w.r.t. the class of all finite structures) of size polynomial in , as will follow from Theorem 3.8 below. It is known, however, that any finite duality for this structure must involve a structure whose size is exponential in , and the example can be modified to use a fixed schema (cf. (NesetrilTardif2005, )).
3.1. Frontiers for classes with bounded expansion
The notion of a class of graphs with bounded expansion was introduced in (nesetril2018grad, ). Intuitively, a class of graphs has bounded expansion if all of its shallow minors are sparse. We will not give a precise definition here, but important examples include graphs of bounded degree, graphs of bounded treewidth, planar graphs, and any class of graphs excluding a minor. The same concept of bounded expansion can be applied also to arbitrary structures: a class of structures is said to have bounded expansion if the class of Gaifman graphs of structures in has bounded expansion. We refer to (nesetril2012sparsity, ) for more details. Classes of structures of bounded expansion are in many ways computationally well-behaved (cf. for example (kazana2020firsts, )).
Nešetřil and Ossona de Mendez (NesetrilOssona2008, ; nesetril2012sparsity, ) show that if is any class of structures with bounded expansion, then every structure has a finite duality w.r.t. . It follows by Lemma 3.4 that also every structure has a fontier w.r.t. . Nešetřil and Ossona de Mendez (NesetrilOssona2008, ; nesetril2012sparsity, ) only consider connected structures without distinguished elements,333Note that the various notions of connectedness, such as based on the incidence graph, fact graph, or Gaifman graph, all coincide for relational structures without distinguished elements. but their result extends in a straightforward way to the general case of structures with distinguished elements. Furthermore, it yields an effective procedure for constructing frontiers, although non-elementary (i.e., not bounded by a fixed tower of exponentials).
Theorem 3.6 (from (NesetrilOssona2008, ; nesetril2012sparsity, )).
Let be any class of structures that has bounded expansion. Then every structure has a frontier w.r.t. , which can be effectively constructed.
Proof.
Nešetřil and Ossona de Mendez (NesetrilOssona2008, ) stated their result for the case where is a connected structure without distinguished elements. They show that, every such structure has a finite duality w.r.t. , and hence, by Lemma 3.4, also a fontier w.r.t. . The result extends to disconnected structures through standard arguments: let be any structure (without distinguished elements). We may assume without loss of generality that is a core (because every structure is homomorphically equivalent to its core, and hence every frontier for the latter is a frontier for the former). Let be the connected components of . Since is a core, are pairwise homomorphically incomparable. Take all structures of the form , for a structure belonging to the frontier of (for ). It is straightforward to show that this yields a frontier for .
Next, we show how to extend this to structures with distinguished elements. For any structure with , let be the structure (without distinguished elements) over expanded schema with additional unary predicates , where each denotes . Since and have the same Gaifman graph, and since has bounded expansion, the same holds for . Therefore, it suffices to show that whenever has a frontier w.r.t. , then has a frontier w.r.t. .
Let be a frontier for w.r.t. . Now consider all ways of taking a structure and choosing one element per unary predicate . In this way we obtain a set of structures with distinguished elements, that we claim is a frontier for w.r.t. . Note that any homomorphism for is a homomorphism from to , therefore since is a frontier for , there is no such homomorphism . It is also clear that each maps homomorphically to . Finally, consider any that maps to but not vice versa. Then and . Hence, there is a homomorphism for some . It follows that where each . ∎
3.2. Polynomial frontiers for c-acyclic structures
Alexe et al. (AlexeCKT2011, ), building on Foniok et al. (FoniokNT08, ), show that a structure has a finite duality if and only if its core is c-acyclic. By Lemma 3.4, this implies that a structure has a frontier if and only if its core is c-acyclic.
Theorem 3.7 (from (FoniokNT08, ; AlexeCKT2011, )).
For all structures , the following are equivalent:
-
(1)
has a frontier w.r.t. the class of all structures,
-
(2)
is homomorphically equivalent to a c-acyclic structure,
-
(3)
The core of is c-acyclic
One of our main results is a new proof of the right-to-left direction, which, unlike the original, provides a polynomial-time construction of a frontier from a c-acyclic structure:
Theorem 3.8.
Fix a schema and . Given a c-acyclic structure over with distinguished elements, we can construct in polynomial time a frontier w.r.t. the class of all structures over that have distinguished elements.
Note that the size of the smallest frontier is in general exponential in . Indeed, consider the single-element structure where consists of the single fact and has length . It is not hard to show that every frontier of this (c-acyclic) structure must contain, up to homomorphic equivalence, all structures of the form where consists of two facts, and , and is a sequence in which both and occur. There are exponentially many pairwise homomorphically incomparable such structures.
The proof of Theorem 3.8 is based on a construction that improves over a similar but exponential construction of gap pairs for acyclic structures given in (NesetrilTardif2000, , Def. 3.9). Our results also shed new light on a question posed in the same paper: after presenting a double-exponential construction of duals (for connected structures without distinguished elements), involving first constructing an exponential-sized gap pair, the authors ask: “It would be interesting to know to what extent the characterisation of duals can be simplified, and whether the indirect approach via density is optimal.” This question appeared to have been answered in (NesetrilTardif2005, ), where a direct method was established for constructing single-exponential size duals. Theorem 3.8 together with Lemma 3.4, however, gives another answer: single-exponential duals can be constructed by going through frontiers (i.e., “via density”) as well.
Recall the definition of fg-connectedness from the preliminaries. We first prove a restricted version of Theorem 3.8 for the special case of core, fg-connected, c-acyclic structures with the Unique Names Property. We subsequently lift these extra assumptions. A structure with has the Unique Names Property (UNP) if for all (cf. (baader2003basic, )).
Proposition 3.9.
Given a core, fg-connected, c-acyclic structure with UNP, we can construct in polynomial time (for fixed schema and number of distinguished elements ) a frontier w.r.t. the class of all finite structures. Furthermore, the frontier consists of a single structure, which has the UNP. 444 An earlier conference version of this paper had a bug in the proof of this proposition, as was pointed out to us by Raoul Koudijs (p.c.).
Proof.
Let a core fg-connected c-acyclic structure with UNP be given. To reduce notational complexity in the remainder of this proof, we will simply write instead of , even when referring to the structure including the distinguished elements.
Note that each fg-connected structure either (i) consists of a single fact containing only distinguished elements, or (ii) consists of a number of facts that all contain at least one non-distinguished element. Therefore, we can distinguish two cases:
Case 1. consists of a single fact without non-distinguished elements. Let be the structure whose domain is , where and is a fresh value distinct from ; and which contains all facts over this domain except . It is easy to see that is a homomorphism dual for . Indeed, consider any structure , and let be a copy of the fact in which each element is replaced by the corresponding element . If then contains , therefore, ; if, on the other hand, , then omits , and hence, . It follows that, the direct product constitutes a singleton frontier for . Note that this construction is polynomial because we assume that the schema and are both fixed.
Case 2. consists of one or more facts that each contain a non-distinguished element. In this case, we construct a singleton frontier where
-
•
the domain of consists of
-
(1)
all pairs where is a non-distinguished element of and is a fact of in which occurs, and
-
(2)
All pairs and , where is a distinguished element of
-
(1)
-
•
a fact holds in if and only if holds in and at least one is either a fact that is different from the fact itself, or is nd
-
•
The distinguished elements b are , for .
Note that, in the above construction, id and nd are symbols (not functions), used to simplify notation by ensuring that every element of can be written as a pair. The symbols id and nd, intuitively, stand for “identity” and ”non-distinguished copy”,
We claim that is a frontier for .
It is clear that the natural projection is a homomorphism.
We claim that there is no homomorphism . Assume, for the sake of a contradiction, that there was such a homomorphism. By Lemma 2.1, we may assume that the composition of and is the identity function on . In particular, this means that maps each distinguished element to and for each non-distinguished element of , for some fact . For a non-distinguished element , let us denote by the unique fact for which .
We will consider “walks” in of the form
with , where
-
(1)
are non-distinguished elements,
-
(2)
, and
-
(3)
and co-occur in fact ,
Since is c-acyclic, the length of any such sequence is bounded by the diameter of (otherwise some fact would have to be traversed twice in succession, which would violate condition 2). Furthermore, trivially, such a walk of length exists: just choose as an arbitrary non-distinguished element of . Furthermore, we claim that any such finite sequence can be extended to a longer one: let the fact be of the form (where for some ). Since is a homomorphism, it must map to some fact of , where some is a fact that is different from . We can choose as our element . Thus, we reach our desired contradiction.
Finally, consider any with and . We construct a function as follows: consider any element of , and let . If is a distinguished element (in which case is, too), we set . If is not a distinguished element but is, we set . Otherwise, we proceed as follows: since is c-acyclic and fg-connected, for each non-distinguished element of (other than itself) there is a unique minimal path in the incidence graph, containing only non-distinguished elements, from to . We can represent this path by a sequence of the form
where each is a fact of in which occurs in the -th position and occurs in the -th position. We can partition the non-distinguished elements of (other than itself) according to the last fact on this path, that is, . Furthermore, it follows from fg-connectedness that each fact of contains a non-distinguished element. It is easy to see that if a fact contains multiple non-distinguished elements (other than ) then they must all belong to the same part of the partition as defined above. Therefore, the above partition on non-distinguished elements naturally extends to a partition on the facts of . Note that if contains any facts in which is the only non-distinguished element, we will refer to these facts as “local facts” and they will be handled separately. In this way, we have essentially decomposed into a union , where contains all local facts and each “component” is a substructure of consisting of non-local facts, in such a way that (i) different substructures do not share any facts with each other, (ii) different substructures do not share any elements with each other, except for and distinguished elements (from a), (iii) each contains precisely one fact involving .
Since we know that , it follows that either some local fact of does not map to (when sending to ), or some “component” of does not map to through any homomorphism sending to . In the first case, we choose such local fact and set . In the second case, we choose such a component (if there are multiple, we choose one of minimal size) and let be the unique fact in that component containing (that is, is the fact that by construction connected the non-distinguished elements of the component in question to ). We set . Intuitively, when , then is a fact of involving that “points in a direction where homomorphism from back to fails”.
We claim that is a homomorphism from to : let be a fact of . Then holds in . Let as constructed above (where, we recall, if is a distinguished element, and if is not a distinguished element but is). Also recall that at least one is a non-distinguished element. To show that holds in , it suffices to show that some is different from the fact itself, or is equal to nd. If some non-distinguished is mapped by to a distinguished element, then , and we are done. This leaves us with the case where some is a non-distinguished element, and for all non-distinguished , is of the form for a fact . If one of these is a local fact, then it follows immediately from the construction that . Otherwise, let be the size of the smallest “component” (as defined above) of that does not homomorphically map to , and choose an element with minimal . Then, clearly, must be different from the fact itself. ∎
Next, we remove the assumptions of fg-connectedness and being a core.
Proposition 3.10.
Given a c-acyclic structure with UNP, we can construct in polynomial time a frontier w.r.t. the class of all finite structures. Furthermore, the frontier consists of structures that have the UNP.
Proof.
By Proposition 2.2, we may assume that is a core. Note that the c-acyclicity and UNP properties are preserved under the passage from a structure to its core.
Let be a structure with distinguished elements that is UNP and that is a fg-disjoint union of homomorphically incomparable fg-connected structures . By Proposition 3.9, have, respectively, frontiers , each consisting of a single structure with the UNP. We may assume without loss of generality that each consists of a structure that have the same distinguished elements a (we know that the structures in question have the UNP, and therefore, modulo isomorphism, we can assume that the distinguished elements are precisely a). Let .
We claim that is a frontier for w.r.t. .
Clearly, each structure in maps homomorphically to .
Suppose, for the sake of contradiction, that there is a homomorphism for some . Observe that must send each distinguished element to itself, and it must send each non-distinguished element to a non-distinguished element (otherwise, the composition of with the backward homomorphism would be a non-injective endomorphism on which would contradict the fact that is a core). Since is fg-connected (and because cannot send non-distinguished elements to distinguished elements), its -image must be contained either in some () or in . The former cannot happen because and are homomorphically incomparable. The latter cannot happen either, because belongs to a frontier of .
Finally, let be any structure such that there is a homomorphism but . Let be a fg-connected component of such that . Since is fg-connected, we can partition our structure as where the -image of is contained in while the -image of is disjoint from except possibly for the distinguished elements. We know that and therefore . Furthermore, we have that . Therefore, . ∎
Finally, we can prove Theorem 3.8 itself.
Proof of Theorem 3.8.
Let be c-acyclic. If it has the UNP, we are done. Consider the other case, where the sequence a contains repetitions. Let consists of the same elements without repetition (in some order). We construct a frontier for it as follows:
-
(1)
Consider the structure , which, by construction, has the UNP. Let be a frontier for (again consisting of structures with the UNP), using Proposition 3.10. Note that, through isomorphism, we may assume that each structure in has the same distinguished elements . For each , we take the structure .
-
(2)
Let be the length of the tuple a. For each function , whose range has size strictly greater than , consider structure where contains all facts over the domain , and . We take its direct product with .
It is easy to see that the set of all structures constructed above, constitutes a frontier for . Indeed, suppose a structure maps to but not vice versa. If the tuple of distinguished elements of the structure in question has the same identity type as the tuple a (i.e., the same equalities hold between values at different indices in the tuple) then it is easy to see that the structure in question must map to some structure as constructed under item 1 above. Otherwise, if the tuple of distinguished elements of the structure in question does not have the same identity type, then it is easy to see that the structure in question must map to , as constructed under item 2 above, where reflects the identity type of the distinguished elements of the structure in question. ∎
As a corollary of Theorem 3.8, we obtain the following interesting by-product:
Theorem 3.11.
Fix a schema and . The following problem is solvable in NP: given a finite set of structures and a structure (all with distinguished elements), is a frontier for w.r.t. the class of all structures? If contains a binary relation and , then it is NP-complete.
Proof.
For the upper bound, we use the fact that, if is homomorphically equivalent to a c-acyclic structure , then the core of is c-acyclic (cf. Theorem 3.7). The problem can therefore be solved in non-deterministic polynomial time as follows:
First we guess a substructure and we verify that is c-acyclic and homomorphically equivalent to . Note that the existence of such is a necessary precondition for to be a frontier of . Furthermore, c-acyclicity can be checked in polynomial time using any PTIME algorithm for graph acyclicity (recall that a structure is c-acyclic if and only if its incidence graph is acyclic after removing all nodes corresponding to distinguished elements).
Next, we apply Theorem 3.8 to construct a frontier for (and hence for ). Finally, we verify that each homomorphically maps to some and, vice versa, every homomorphically maps to some . It is not hard to see that this non-deterministic algorithm has an accepting run if and only if is a frontier for .
For the lower bound, we reduce from graph 3-colorability. Let be the structure, over a 3-element domain, that consists of the facts for all pairs with . In addition, each of the three elements is named by a constant symbol. Since is c-acyclic, by Theorem 3.7, it has a frontier . Now, given any graph (viewed as a relational structure with binary relation and without constant symbols), we have that is 3-colorable if and only if is a frontier for the disjoint union of with . To see that this is the case, note that if is 3-colorable, then the disjoint union of with is homomorphically equivalent to itself, whereas if is not 3-colorable, then the disjoint union of with is strictly greater than in the homomorphism order. ∎
3.3. A polynomially frontier-closed class of structures
We call a class of structures frontier-closed if every structure has a frontier w.r.t. , consisting of structures belonging to . If, moreover, the frontier in question can be constructed from in polynomial time, then we say that is polynomially frontier-closed.
Theorem 3.12.
Fix a schema and . The class of c-connected acyclic structures with distinguished elements is polynomially frontier-closed. 555Note that for structures with one distinguished element, c-connectedness is the same as connectedness.
In fact, the construction presented below shows that the polynomial bound holds even when the schema is treated as part of the input of the problem (although does need to be fixed, as the size of the constructed frontier depends exponentially on ).
The special case with binary relations only and
The remainder of this section is dedicated to the proof of Theorem 3.12. To simplify the presentation of the proof, we will first assume that the schema consists of binary relations only, and that . Afterwards, we will show how to lift these restrictions.
Let be a schema consisting of binary relation symbols, and fix a finite structure that is c-connected and acyclic. We will assume, in addition, that is a core, which we may do without loss of generality, because the core of an acyclic structure can be computed in polynomial time (Proposition 2.2) and the properties of c-connectedness and acyclicity are preserved under passage from a structure to its core:
Proposition 3.13.
The properties of c-connectedness and acyclicity are preserved when passing from a structure to its core.
Proof.
That acyclicity is preserved follows immediately from the fact that the core is a substructure of the original structure. For c-connectedness, the argument is as follows: let be a c-connected structure, and let be its core. It follows from the definition of a core that . By Proposition 2.3, this implies that , and hence, . It follows by the minimality property of cores that , i.e., is c-connected. ∎
We shall slightly abuse notation and, for every relation symbol and every we shall say that holds in (or is a fact of) if is a fact of . We can think of as an (oriented) tree rooted at , where every edge has been oriented away from and is labelled or depending on whether or is a fact of .
Definition 3.14 ().
For any element of , we denote by the substructure of consisting of the oriented subtree rooted at .
Definition 3.15 (rank).
For any element of , is the depth of the oriented tree .
The construction of frontiers that we will describe below, involves a recursion on rank. Note that when is maximally far from (in other words, the leafs of the oriented tree have rank 0). Furthermore, if there is an edge (in the oriented tree) then .
The frontier construction below is split into two steps: for each element , we first construct a set of structures , and, subsequently, we construct a modified set of structures .
Definition 3.16 ().
Fix any node of . Let be the children of (in the oriented tree), and let be the corresponding edge labels. We can depict as follows:
If (that is, if ) we define . Otherwise, we define where is obtained from in the following way. Set to be the result of removing from . Then, we add to a fresh isomorphic copy of each structure in , joining with the newly created copies of with a -edge. See Figure 2 for an illustration.
Observe that, for , there is a natural mapping (indeed, following the induction of the construction, we can see that each element of is either an element of or else was introduced as an isomorphic copy of an element of ).
Definition 3.17 ().
We define , where is defined as follows. Let be the natural mapping. Then, is obtained from by adding, for each element of with , a fresh isomorphic copy of together with a connecting -edge from to , where is (the newly created copy of) the parent of and is the edge label of the edge from to in .
Input (uni-relational) | single structure | single structure |
---|---|---|
structure |
Input structure | single structure | single structure |
---|---|---|
The examples in Figures 3 and 4 illustrate the construction of and . In these figures, for clarity, the distinguished element is marked by a circle. In Figure 3, all edges represent the same binary relation , and edge labels are omitted for the sake of readability. In both examples, it happens (coincidentally) that the frontier consists of a single structure.
Definition 3.23 and 3.24 further down may provide additional intuition on the operation used in Definition 3.17.
Theorem 3.18.
Let be a finite structure that is core, c-connected and acyclic, let be any node of , and let be as defined above.
-
(1)
For each , .
-
(2)
For each , .
-
(3)
Let be acyclic and c-connected. If and , then maps homomorphically to for some structure .
In particular (since ), is a frontier for w.r.t. the class of c-connected, acyclic structures with one distinguished element.
Proof.
Item 1 follows immediately from the construction: the natural projection from to is a homomorphism.
For item 2, we proceed by induction on . Item 2 holds true, trivially, when , as is empty in this case. For the inductive case, now, assume that . In this case, by definition, we know that is of the form for some child of . In other words, was obtained from by removing the subtree , and (provided ), adding a fresh isomorphic copy of each structure in joining with the newly created copy of with a -edge, where is the label of the edge in the original structure.
Consider the homomorphism given by item 1. Suppose for the sake of a contradiction that there were also a homomorphism . Let . Towards our contradiction, we perform a case distinction on . Clearly, must be one of the neighbours of in . By construction, these neighbours of in are: (i) the children in ; (ii) the copies of belonging to isomorphic copies of structures in ; and, provided , (iii) the parent, , of in , as well as the copy of introduced in Definition 3.17.
Let us first consider case (i) and (iii). In both cases, let be the composition of and . Then is a homomorphism from to whose range omits . Furthermore, extends straightforwardly to a non-injective endomorphism on by mapping all elements outside to themselves. This is a contradiction with the fact that is a core.
Finally consider case (ii). By a similar argument as the above, no element from can be mapped by to . For, in this case, the composition of and would be a homomorphism from to whose range excludes , and could then be extended to a non-injective endomorphism on , contradicting the fact that is a core. It follows that the restriction of to must be such that its range is entirely contained in for some in , i.e., it defines a homomorphism from to , which contradicts the inductive hypothesis.
Item 3 is proved by induction on . Again, item 3 holds true, trivially, when . Note that when , is a single-node structure without any relations. Therefore, it is impossible that .
Now, consider the inductive case with . Since , it must be the case that, for some child of there is no element in such that and holds in where is the label of the edge joining and . Let be the corresponding structure as constructed in Definition 3.16.
Let be the homomorphism from to given by the hypothesis. We shall construct a homomorphism from to . We have . Note that, since is acyclic and c-connected, we can similarly regard as an ordered tree rooted at . Then, for each child, , of , we define on as follows. If , then since has no homomorphism to , we apply the inductive hypothesis to define on . If is the parent of then we map entirely to the isomorphic copy of attached to introduced in Definition 3.17. If is some child of different than then we define on starting at and by increasing depth as in the homomorphism, , from to until we find some edge in such that its -image is not an edge in . When we found such edge , then we define to be the copy of attached to in according to Definition 3.17 and we extend to the rest of elements in mapping them as well to the same isomorphic copy. ∎
Definition 3.19.
The size of a structure , denoted by , is the number of facts. The total size of a set of structures is
(The definition of total size is conveniently chosen so that the total size of a set of tree-shaped structures is equal to the size of a tree-shaped structure consisting of the given forest with an additional root connected to the root of each original tree.)
Theorem 3.20 (Our construction is polynomial).
.
Proof.
We will show that . The proposition then follows immediately, by construction of . More precisely, we show that, for all elements , . The claim is proved by induction on .
If , then . If , it follows from the construction of that
where is the number of successors of . Note that we are using here the fact that , and the general fact that . ∎
Extending the result to
We now extend the result to . For the time being, we still assume that the schema consists of binary relations only.
Definition 3.21 (Skeleton and offshoots).
Let be a c-connected, acyclic structure with at least one distinguished element. A skeleton node of (with ) is any element that is either a distinguished element (i.e., ) or such that lies on a minimal path between two distinguished elements. We denote by the substructure of (with the same distinguished elements), consisting of the skeleton nodes only. It is easy to see that, in a c-connected and acyclic structure, for every element there is a unique skeleton node that is closest to , i.e., such that is connected to and such that every path from to any other skeleton node passes through . In this case, we say that is affiliated with the skeleton node . For any skeleton node of , the offshoot of in , which we will denote by , 666We acknowledge that this notation is slightly misleading as it does not reflect the dependence of on the distinguished elements a. is the structure where is the substructure of consisting of all elements affiliated with (including itself), and where is the only distinguished element.
Thus, every c-connected and acyclic structure can be thought of as consisting of a skeleton with offshoots, similar to the example depicted schematically in Figure 5.
Our frontier construction for structures with multiple distinguished elements will make use of two operations that we now define.
Definition 3.22 (Splitting).
Consider a structure with . Let be a proper partition of , that is, and are disjoint non-empty sets such that . We will denote by the structure where
-
•
is a disjoint union of two isomorphic copies of . For each element of , we will denote its two copies in by and , respectively.
-
•
For , we set where if and otherwise.
For the next definition, we need to introduce some auxiliary notation. If are elements belonging to the same connected component of some structure , we will denote by the length of the shortest path from to in the incidence graph of (where the length may be counted by the number of facts on the path). If are elements that all belong to the same connected component of we will write if (“ is closer to than is”).
Definition 3.23 (Radial extension).
Let be a homomorphism between acyclic structures with one distinguished element. We denote by the structure obtained from as follows: for each fact containing elements with , we add a disjoint isomorphic copy of (without distinguished elements), and we add a connecting fact that is a copy of in which is replaced by the isomorphic copy of in .
Note that the way in which was constructed from in Definition 3.17, is a concrete instance of the radial-extension operation.
In the statement of the next lemma, we write , where is a structure with distinguished elements, to denote the corresponding structure with distinguished elements.
Lemma 3.24.
Let and be acyclic structures with , and such that is a core. Let for some . Then .
Proof.
Clearly, extends . It is also clear from the construction that the map naturally extends to a homomorphism (using the natural projection for the additional elements of ).
Let us show now that does not map homomorphically to . For the sake of a contradiction, assume that . We can assume from Lemma 2.1 that the composition of and is injective. Since , must map some element of to , where is an element of that does not belong to . Then belongs to an isomorphic copy of that was added for some fact of , where contains elements with . Let be the copy of belonging to the respective isomorphic copy of . Since it then follows that the image according to of the nodes in the shortest path connecting with must necessarily contain all nodes in the shortest path (in ) connecting and . Note that the path connecting and must contain both and . Since it follows that the composition of and is non-injective, a contradiction. ∎
Now, let be c-connected and acyclic. We may also assume that is a core. We define to be the set of all structures obtained as follows:777Note that depends on , even though we did not make this dependence explicit in our notation here.
-
(1)
For each proper partition of the distinguished elements, we add to , provided there are and such that and are connected in . (Note that, by construction, the corresponding distinguished elements and are disconnected in .)
-
(2)
For each proper partition of the distinguished elements, and for each fact of , let be the structure obtained by extending with the fact . We add to , provided that there exists distinguished elements and , such that and are connected in by a path of some length , but there is no path of length connecting the corresponding distinguished elements and in .
-
(3)
For each skeleton node , and for every in the frontier of the offshot (as in Theorem 3.18) we add to the structure obtained from by: (i) removing the offshoot and replacing it , resulting in a new structure , and (ii) taking , where is the homomorphism from , extended to the entire structure by mapping every element outside the offshoot to itself.
The intuition behind this construction is that (1)–(3) capture different ways of homomorphically weakening an acyclic structure with designated elements: in (1), we take two connected distinguished elements and force them to become disconnected. In (2), we increase the length of the shortest path between two distinguished elements, by forcing the minimal path to go through a specified edge. In (3), we make no change to the skeleton of the structure but we weaken one of the offshoots.
Proposition 3.25.
If is c-connected, acyclic, and is a core, then the set of structures constructed above is a frontier for w.r.t. the class of c-connected acyclic structures.
Proof.
It is clear that each structure in homomorphically maps to .
We also claim that there is no homomorphism from to any structure . For (1) and (2) this follows immediately from the construction. For (3), the argument is as follows: let be the skeleton node whose offshoot was replaced, and let be the frontier structure was used as the replacement of . Suppose, for the sake of a contradiction, that . By Lemma 3.24, then, there is already a homomorphism . By Lemma 2.1, we may assume that the composition of with the natural homomorphism from to is the identity function on . Since is follows that there exists some element in such that does not belong to . However, this contradicts the fact that is the identity.
It remains to establish the last property of frontiers: let be any c-connected and acyclic structure such that there is a homomorphism and such that .
We can distinguish three cases:
The first case is where differs from in the number of connected components. Since both structures are c-connected and , this can only happen if there are distinguished elements that belong to different components in but such that are connected in . In this case, we use (1) above. Specifically, let , and let . Note that . Then it is easy to see that .
The second case is there there exists distinguished elements connected in such that the image of the path in connecting and is not injective. It follows from the acyclicity of that there exists such that . Let be the fact in joining and and let be the set containing all elements in that remain connected to after removing fact . Then consider the structure obtained as in (2) above by setting to be the set of distinguished elements in , to be the rest of distinguished elements in , and the fact of used in the construction to be the -image of . If we let and it follows directly from the construction that the distance of and in is larger than the distance of and in , and hence . Finally, it follows easily that the map sending every element to if and to otherwise defines a homomorphism from to .
The third case is where none of the previous two cases hold. We first note that if the second case does not hold then it must be the case that the restriction of to each connected component of must be injective. Furthermore, since and have the same number of connected components it follows that maps isomorphically to . It follows that there exists some node in the skeleton of such that the offshoot does not homomorphically map to , since otherwise, . Consider the offshoot of and let be the maximal substructure of that contains , is connected, and satisfies the property that is contained in and that no other element, besides , is mapped by to . Since it follows that there exists some homomorphism from to some structure in the frontier of . Now consider the structure in produced in step (3) above for and . We shall construct a homomorphism from to .
Let as constructed in step (3) and let be the maximal -connected substructure of containing such that no other element in besides is mapped by to . Note that contains . We shall start by defining on so that defines a homomorphism from to . If then we define to be . If does not belong to then it follows that cannot be in the offshot . In this case we define to be . Clearly defines as well a homomorphism from to (since contains a copy of ). It only remains to extend to all the elements in . For every maximal connected substructure of not containing any element in we extend to as follows. Since contains and is -connected it follows that there is some fact in joining elements and . By the maximality of it follows that . Necessarily and, consequently, we can define so that and every other element in is mapped according to in the copy of in introduced due to .
Incidentally, it may be worth noting that we did not even make full use of the radial extension: for the purpose of this proof, it would have sufficed in case (3) to use a restricted version of radial extension where a copy of the original structure is added only for every fact connected with . ∎
Lifting the restriction to binary relations
This concludes the proof for the case with binary relations only. We now show how to lift the result to schemas containing relations of arbitrary arity.
For a schema , let be the schema containing for each -ary relation , binary relations . For any structure over schema , let be the structure over schema whose domain consists of all elements in the domain of as well as all facts of , and containing all facts of the form where is a fact of , of the form with . Intuitively, we can think of as a bipartite encoding of the structure . Conversely, we associate to every structure over the schema a corresponding structure over the original schema , namely the structure whose domain is the same as that of and containing all facts of the form for which it is the case that satisfies . Note that but need not be isomorphic to .
Lemma 3.26.
For all structures over schema and structures over schema :
-
(1)
If then .
-
(2)
iff .
-
(3)
iff .
-
(4)
If is core, c-connected, and acyclic, then so is
-
(5)
If is acyclic, then so is .
Proof.
1. Let . It is easy to see that, for each element of that participates in at least one fact, it must be the case that belongs to the domain of . Note that elements of that do not participate in any fact can be ignored, as they can be mapped to an arbitrary element of . Finally, it is clear from the constructions that, whenever holds true in , then holds true in .
2. Suppose . We can extend to the entire domain of as follows: let be any fact of of the form . Since satisfies , this means that must satisfy . Choose any such as the image of the fact . Doing this for each fact, we obtain a mapping that extends to the entire domain of . Moreover, whenever holds in , then, by construction, holds true in . In other words is a homomorphism from to
Conversely, suppose . Let be the restriction of to elements of . We claim that the mapping is a homomorphism. Let be any fact of . Then satisfies for all . Therefore, by construction, satisfies .
3. Every homomorphism natural extends to a homomorphism from to by sending each fact to . Conversely, every homomorphism , when restricted to elements of , is clearly a homomorphism from to .
4. For acyclicity, this holds by definition (recall that acyclicity was defined by reference to the incidence graph, in the first place; and note that the incidence graph of is obtained from that of by subdividing every edge in two). Similarly, it is easy to see that whenever is c-connected, then so is . Finally, to show that core-ness is preserved, we proceed by contraposition: suppose is not a core, i.e., admits a proper endomorphism . It is not hard to see that must map elements of to elements of , and fact of to facts of . Therefore, must either map two distinct elements of to the same element, or two distinct facts of to the same fact. However, even in the latter case, the only way that this can happen is if also maps two distinct elements to the same element. It follows that also induces a proper endomorphism of .
5. By contraposition: suppose is not acyclic. Take a minimal cycle in the incidence graph of . Each edge that is part of the cycle (being a fact of ), by construction of , gives rise to a path of length 2 in . Therefore, we obtain a cycle in . ∎
Now, let be any acyclic, c-connected structure over schema . We may again assume that is a core. By Lemma 3.26(4), is also acyclic, c-connected, and core. Let be an acyclic c-connected frontier for , let . We claim that is a frontier for .
First, note that each structure in homomorphically maps to . This follows from Lemma 3.26(1), because each homomorphically maps to . Second, note that does not homomorphically map to any structure in . Indeed, suppose with . Then, by Lemma 3.26(2), , which contradicts the fact that is a frontier for . Finally, for the third property of frontiers, suppose and . Then, by Lemma 3.26(3), and . Therefore, since is a frontier for , we have that for some . It follows by Lemma 3.26(2) that . Since , this shows that maps to a structure in .
Note that consists of acyclic structures (by Lemma 3.26(5)), but may contain structures that are not c-connected. However, this is easily addressed by the following observation, which follows from Proposition 2.3: if a set of structures is a frontier for a structure w.r.t. a class of c-connected structures , then is also a frontier for w.r.t. .
This concludes the proof of Theorem 3.12.
4. Unique Characterizations for Conjunctive Queries
In this section, we study the question of when a CQ is uniquely characterizable by a finite set of positive and/or negative examples.
Definition 4.1 (Data Examples, Fitting, Unique Characterizations).
Let be a class of -ary CQs over a schema (for some ), and let be a -ary query over .
-
(1)
A data example is a structure over schema with distinguished elements. If , we call a positive example (for ), otherwise a negative example.
-
(2)
Let be finite sets of data examples. We say that fits if every example in is a positive example for and every example in is a negative example for . We say that uniquely characterizes w.r.t. if fits and every that fits is logically equivalent to .
It turns out that there is a precise correspondence between unique characterizations and frontiers. Recall that the canonical structure of a query is denoted by . Similarly, for any class of CQs , we will denote by the class of structures .
Proposition 4.2 (Frontiers vs Unique Characterizations).
Fix a schema and . Let be any -ary CQ over and a class of -ary CQs over .
-
(1)
If is a frontier for w.r.t. , then uniquely characterizes w.r.t. .
-
(2)
Conversely, if uniquely characterizes w.r.t. , then is a frontier for w.r.t. .
Proof.
1. Let be a frontier for w.r.t. , let be a conjunctive query that fits . From the fact that the canonical structure is a positive example for , it follows that there is a homomorphism from to . Furthermore, since all structures in the set are negative examples for , we know that does not homomorphically map to any of these structures. Since is a frontier w.r.t. and , we can conclude that homomorphically maps to . Therefore, and are homomorphically equivalent, which implies that and are logically equivalent.
2. Let uniquely characterize w.r.t, , and let . It follows from the basic properties of the direct product operation that each structure in homomorphically maps to . Furthermore, if there were a homomorphism from to some structure , then there would be a homomorphism from to , which would imply that is a positive example for , which we know is not the case. Finally, consider any such that and . This implies that and are not logically equivalent, and hence, since , the two queries must disagree on some example in or . However, it follows from the fact that , that all positive examples for are also positive examples for . Therefore, some structure must be a positive example for , that is, . It follows that . ∎
Proposition 4.2 allows us to take the results on frontiers from the previous section, and rephrase them in terms of unique characterizations. Incidentally, note that results in (AlexeCKT2011, ) imply an analogous relationship between finite dualities and uniquely characterizing sets of examples for unions of conjunctive queries. We need two more lemmas. Recall that a structure corresponds to a conjunctive query only if every distinguished element occurs in at least one fact. Let us call such structures safe. The following lemmas, essentially, allow us to ignore unsafe structures, thereby bridging the gap between structures and CQs.
Lemma 4.3.
Let be a -ary CQ over schema and a class of -ary CQs over . If is uniquely characterized w.r.t. by positive and negative examples , then consists of safe structures and is uniquely characterized w.r.t. by .
Proof.
It suffices to observe that if is not safe, then is a negative example for every conjunctive query, and therefore, cannot meaningfully contribute to characterizing any given conjunctive query w.r.t. a class of CQs. ∎
Lemma 4.4.
A safe structure has a frontier w.r.t. all structures if and only if it has a frontier w.r.t. the class of all safe structures.
Proof.
The left-to-right direction is trivial. The right-to-left direction relies on a homomorphism duality argument of sorts: let be any safe structure that has a frontier w.r.t. the class of all safe structures. Let be its schema and the number of distinguished elements. For every non-empty set , we will denote by the structure with two elements, denoted and , that contains all possible facts involving only and no other facts; and is the tuple , where if , and otherwise. Note that, by construction, none of these structures is safe. It is not hard to see that a structure is unsafe if and only if it admits a homomorphism to a structure in the set . Now, let . Then is a frontier for w.r.t. all structures. To see this, note that each structure in homomorphically maps to . Furthermore, does not map to any structure in (by initial assumption) or in (because consists of unsafe structures while is safe). Finally, consider any . If is safe, then it maps to a structure in . Otherwise, it maps to a structure in and hence also to the corresponding structure in . In either case, it maps to a structure in . ∎
Putting everything together, we obtain the main result of this section. We call a CQ c-acyclic (or acyclic, or c-connected) if the structure is c-acyclic (resp. acyclic, c-connected).
Theorem 4.5.
Fix a schema and fix .
-
(1)
If is a class of -ary CQs such that has bounded expansion, then every CQ is uniquely characterizable w.r.t. by finitely many positive and negative examples (which can be effectively constructed from the query).
-
(2)
A -ary CQ is uniquely characterizable by finitely many positive and negative examples (w.r.t. the class of all -ary CQs) iff is logically equivalent to a c-acyclic CQ. Moreover, for a c-acyclic CQ, a uniquely characterizing set of examples can be constructed in polynomial time.
-
(3)
Assume and let be the class of -ary CQs that are c-connected and acyclic. Then every is uniquely characterizable w.r.t. by finitely many positive and negative examples belonging to . Moreover, the set of examples in question can be constructed in polynomial time.
Remark 1.
Examples showing that Theorem 4.5(3) cannot easily be generalized.
Theorem 4.5(3) applies to CQs of arity that are c-connected, and acyclic. None of these restrictions can be dropped. Recall that a CQ of arity zero is called a Boolean CQ.
Theorem 4.6.
The Boolean acyclic connected CQ is not characterized, w.r.t. the class of Boolean acyclic connected CQs, by finitely many acyclic positive and negative examples.
Proof.
For the sake of a contradiction, assume that is a finite collection of acyclic positive and negative examples that uniquely characterizes within the class of Boolean acyclic connected CQs. Let be a bound on the size of the examples in and . Consider now the following Boolean acyclic connected conjunctive query (cf. Figure 6):
Note that and are not logically equivalent since does not contain any directed path of length . The mapping that sends to defines a homomorphism from (the canonical structure of) to (the canonical structure of) . This implies that , that is, every positive example for is also a positive example for , and hence, in particular, fits all the positive examples in . Since and are not logically equivalent, must therefore disagree with on one of the negative examples, that is, some example is a positive example for . Let be a witnessing homomorphism.
Claim: There are two elements and in at distance (distance, here, is measured in the underlying tree of ) such that .
Proof of claim: Since has at most elements it follows that is not injective, so there are two elements , of that are mapped by to the same element. Now, consider the unique path in (the underlying tree of) connecting and . Then, is a walk in (the underlying tree of) . Indeed, it is a closed walk since . Now, since is a closed walk in a tree, it must backtrack in some vertex . This means that . End of proof of claim.
Let and be the two elements at distance that are mapped by to the same element in . It follows from the definition of that, since and are at distance , . We first consider the case where . In this case, we can assume wlog. that . It then follows that is a directed cycle in , a contradiction because , and was assumed to consist of acyclic structures. Next, consider the case where . Then . We can assume without loss of generality that is even and is odd. Again it follows directly by the construction of that . Note that is a directed path in of length . Similarly is a directed path in of length . Since it follows that we can concatenate the two directed paths obtaining a directed path of length . Consequently homomorphically maps to , a contradiction because . ∎
This shows that in Theorem 4.5(3), the restriction to non-Boolean queries cannot be dropped. Similarly, the restriction to c-connected queries cannot be dropped, and acyclicity cannot be replaced by the weaker condition of c-acyclicity.
Theorem 4.7.
The unary acyclic CQ is not uniquely characterizable, w.r.t. the class of unary acyclic CQs, by finitely many acyclic positive and negative examples.
Proof.
Assume for the sake of a contradiction that there are finitely many acyclic positive and negative examples that uniquely characterizes w.r.t. the class of unary acyclic CQs. We will construct acyclic positive and negative examples that uniquely characterizes w.r.t. the class of Boolean acyclic c-connected CQs, contradicting Theorem 4.6. For each example in , we take every connected component of (without any distinguished element) and add it to or , depending on whether the component in question satisfies . Note that each of these examples is acyclic. By construction, the query fits . We claim that uniquely characterizes w.r.t. the class of Boolean acyclic connected CQs. To see this, let be any Boolean acyclic connected query that fits . Now, let be the unary acyclic (disconnected) query that is the conjunction of with . Then it is not hard to see that fits , and therefore, is equivalent to . From this, it easily follows that must be equivalent to : any counterexample to the equivalence of and can be extended to a counterexample to the equivalence of and simply by adding an isolated element satisfying . ∎
Theorem 4.8.
The unary c-acyclic c-connected CQ is not uniquely characterizable, w.r.t. the class of unary c-acyclic c-connected CQs, by finitely many c-acyclic positive and negative examples.
Proof.
Assume towards a contradiction that is uniquely characterized, w.r.t. the class of unary c-acyclic c-connected CQs, by a finite collection of c-acyclic positive and negative examples . Let be the set that contains, for each , the substructure of consisting of all elements satisfying . Note that this substructure does not include itself (because if was true in , then would have been a positive example for ) and therefore must be acyclic. We claim that uniquely characterizes w.r.t. the class of Boolean acyclic connected CQs, contradicting Theorem 4.6.
Let be any Boolean acyclic connected conjunctive query that fits . Let be the conjunctive query expressing that holds true in the substructure consisting of elements reachable by an -edge from . Note that can be obtained by extending with an additional conjunct for every variable occurring in . Also, note that is c-acyclic and c-connected. Since is a positive example for , there is a homomorphism . By extending in the obvious way, we have that . Therefore, , and hence every positive example for is also a positive example for , and hence fits all the positive examples in . Similarly, fits all negative examples in , because, if it did not, then there would be a homomorphism from to some , from which it would clearly follow that homomorphically maps to the corresponding , which we know is not the case. Therefore, fits all examples in , and hence, is logically equivalent to . It follows that must be logically equivalent to : any counterexample for the equivalence of and can be extended to a counterexample for the equivalence of and by adding a fresh distinguished element connected to all existing elements by means of an -edge. ∎
5. Exact learnability with membership queries
The unique characterization results in the previous section immediately imply (not-necessarily-efficient) exact learnability results:
Theorem 5.1.
Fix a schema and . Let be a computably enumerable class of -ary CQs. If has bounded expansion, then is exactly learnable with membership queries.
The learning algorithm in question simply enumerates all queries and uses membership queries to test if the goal query fits the uniquely characterizing set of examples of (cf. Theorem 4.5(1)). Unfortunately, this learning algorithm does not run in polynomial time. Indeed, the number of membership queries is not known to be bounded by any fixed tower of exponentials (even for classes for which membership can be tested in polynomial time). For the special case of c-acyclic queries, we can do a little better by taking advantage of the fact that a uniquely characterizing set of examples can be constructed in polynomial time. Indeed, the class of c-acyclic -ary CQs is exponential-time exactly learnable with membership queries: the learner can simply enumerates all c-acyclic queries in order of increasing size. For each query (starting with the smallest query), it uses Theorem 4.5(2) to test, using polynomially many membership queries, whether the goal query is equivalent to . After at most many attempts (where is the size of the goal query), the algorithm is guaranteed to find a query that is equivalent to the goal query.888Similarly, by Theorem 4.5(3), the class of unary acyclic c-connected queries is exponential-time exactly learnable with subset queries, where a subset query is an oracle query asking whether a given CQ from the concept class is implied by the goal query. Subset queries correspond precisely to membership queries where the example is the canonical structure of a query from the concept class. Our main result in this section improves on this by establishing efficient (i.e., polynomial-time) exact learnability:
Theorem 5.2.
For each schema and , the class of c-acyclic -ary CQs is efficiently exactly learnable with membership queries.
At a high level, the learning algorithm works by maintaining a c-acyclic hypothesis that is an over-approximation of the actual goal query. At each iteration, the hypothesis is strengthened by replacing it with one of the elements of its frontier, a process that is shown to terminate and yield a query that is logically equivalent to the goal query. Note, however, that the frontier of a c-acyclic structure does not, in general, consist of c-acyclic structures. At the heart of the proof of Theorem 5.2 lies a non-trivial argument showing how to turn an arbitrary hypothesis into a c-acyclic one with polynomially many membership queries. The detailed proof is given in Section 5.1 below.
The class of all -ary queries is not exactly learnable with membership queries (even with unbounded amount of time and the ability to ask an unbounded number of oracle queries), because exact learnability with membership queries would imply that every query in the class is uniquely characterizable, which we know is not the case. On the other hand, we have:
Theorem 5.3 (from (CateDK13:learning, )).
For each schema and , the class of all -ary CQs over is efficiently exactly learnable with membership and equivalence queries.
In fact, it follows from results in (CateDK13:learning, ) that the larger class of all unions of conjunctive queries is efficiently exactly learnable with membership and equivalence queries (for fixed and fixed schema). Efficient exact learnability with membership and equivalence queries is not a monotone property of concept classes, but the result from (CateDK13:learning, ) transfers to CQs as well. For the sake of completeness, a self-contained proof of Theorem 5.3 is given below as well.
Remark 2.
Related Work
There has been considerable prior work that formally studies the task of identifying some unknown goal query from examples. Work in this direction includes learning CQs, Xpath queries, Sparql, tree patterns, description logic concepts, ontologies, and schema mappings among others (BonifatiCL15, ; CateDK13:learning, ; GottlobS10, ; StaworkoW12, ). We shall describe mostly the previous work regarding learning CQs. Some of the work in this direction ((Barcelo017, ; CateD15, ; Cohen94a, ; Hirata00, ; Willard10, ) for example) assumes that a background structure is fixed and known by the algorithm. In this setting, an example is a -ary tuple of elements in , labelled positively or negatively depending on whether it belongs or not to . In the present paper (as in (CateDK13:learning, ; Haussler89, )) we do not fix any background structure (i.e., examples are pairs of the form ). Our setting corresponds also to the extended instances with empty background in (Cohen95, ).
In both cases a number of different learning protocols has been considered. In the reverse-engineering problem (as defined in (WeissC17, )) it is only required that the algorithm produces a query consistent with the examples. In a similar direction, the problem of determining whether such a query exists has been intensively studied under some variants (satisfiability, query-by-example, definability, inverse satisfiability) (Barcelo017, ; CateD15, ; Willard10, ). In some scenarios, it is desirable that the query produced by the learner not only explains the examples received during the training phase, but also has predictive power. In particular, the model considered in (BonifatiCS16, ) follows the paradigm of identification in the limit by Gold and requires that, additionally, there exists a finite set of examples that uniquely determines the target query . In a different direction, the model introduced in (GottlobS10, ), inspired by the minimum description length principle, requires to produce a hypothesis consistent after some repairs. A third line of work (see (Cohen93, ; Haussler89, ; Hirata00, )) studies this problem under Valiant’s probably approximately correct (PAC) model. The present paper is part of a fourth direction based on the exact model of query identification by Angluin. In this model, instead of receiving labelled examples, the learner obtains information about the target query by mean of calls to an oracle. As far as we know, we are the first to study the exact learnability of CQs using a membership oracle.
5.1. Proofs for Theorem 5.2 and Theorem 5.3
To warm up, we first establish Theorem 5.3, because its proof is simpler. The proof relies on the following lemmas. Recall that we denote by the canonical structure of a conjunctive query .
Lemma 5.4.
Let be any structure such that , where is the goal conjunctive query. Then, using membership oracle queries, we can compute in polynomial time (in the size of ) a substructure of such that , and such that does not homomorphically map to any strict substructure of .
Proof.
It suffices to iteratively remove one of the facts from the structure, and use a membership query to test if still admits a homomorphism to the structure after removing the fact in question. Once no further fact can be removed, we have arrived at . ∎
Recall that we call a structure safe if every distinguished element occurs in a fact, that is, the structure is the canonical structure of a conjunctive query. The proof of the next lemma is obvious.
Lemma 5.5.
Let be a -ary conjunctive query over a schema , and let be any structure over schema with distinguished elements. If , then is safe.
We now present the proof of Theorem 5.3.
Proof of Theorem 5.3.
The learning algorithm maintains a structure , which we can intuitively think of as (the canonical structure of) the algorithm’s guess of the goal query. The structure is refined in a series of iterations in such a way that at each iteration , its value, , satisfies the following properties: (i) , and (ii) the size of is bounded by the size of the .
We start by considering where is the structure containing a single node that satisfies all possible facts over the schema and a is the -ary tuple containing only element . Clearly, homomorphically maps to this structure. We apply Lemma 5.4 to find a minimal substructure of it into which the goal query maps. We will denote it be .
Next, at each stage we perform an equivalence oracle query to test if the canonical conjunctive query of is logically equivalent to . Note that, by Lemma 5.5, the structure indeed has a canonical conjunctive query. If the answer to the equivalence oracle query is “yes”, then we are done. Otherwise, we receive a counterexample . This counterexample must be a structure in which the goal query is true but the hypothesis is false. Thus, we have and . Recall that we also have . It follows that . We now set to be a minimal substructure of into which maps (using Lemma 5.4 again). It is clear from the construction that , and that the size of is bounded by the size of (otherwise would not be a minimal substructure of into which maps).
All that remains to be shown is that this algorithm terminates after polynomially many iterations. We show that with each iteration, the domain size of the structure strictly increases. Suppose that the domain size of and is the same. We know that the homomorphism (natural projection) is surjective, and that every fact of is the -image of a fact of (otherwise the composition with the homomorphism from to would constitute a non-surjective homomorphism from to , which would contradict the minimality of ). Therefore, it cannot also be injective, otherwise it would be an isomorphism. Therefore, the number of elements of is strictly greater than the number of elements in . Since the size of each is bounded by the size of , this shows that the algorithm terminates after at most rounds, where is the size of .
Incidentally, note that this algorithm runs in polynomial time (in the size of ), even if the schema and the arity of the conjunctive query are not fixed but treated as part of the input. ∎
Next, in the remainder of this section, we prove Theorem 5.2.
First, we argue that we may restrict attention to schemas consisting of binary relations only. Let be any c-acyclic goal conjunctive query over an arbitrary schema and consider the corresponding conjunctive query over the binary schema , as in Lemma 3.26. By Lemma 3.26(2), every membership query w.r.t. the goal query can be efficiently reduced to a membership query w.r.t. . Therefore, if can be efficiently identified using membership queries for , then can also be efficiently identified using membership queries for , and consequently also can be efficiently identified using membership queries for (namely, by first computing a conjunctive query over that is logically equivalent to , and then returning as the final answer (note that, by Lemma 3.26(3), must then be logically equivalent to ). Finally, it is easy to see that is c-acyclic if and only if is c-acyclic. Therefore, it suffices to prove Theorem 5.2 for the special case of schemas consisting of binary relations. In the remainder of this section, we will therefore restrict ourselves to binary relations.
Proposition 5.6 (Any positive example can be transformed into a c-acyclic one).
Let be a c-acyclic goal query. Given a structure satisfying , using membership queries, we can construct, in time polynomial in , a structure , denoted , such that
-
(1)
is c-acyclic
-
(2)
-
(3)
-
(4)
is not homomorphic to any structure obtained removing some fact of
Proof.
We say that a structure is -c-acyclic if every cycle of length at most goes through a distinguished element. (Here, by a “cycle” we mean a cycle in the incidence graph of the structure, and, to simplify the exposition, by the length of the cycle we refer to the number of facts that lie on the cycle). Note that when this holds for , where is the size of , then every homomorphic image of contained in the structure in question must be c-acyclic. We will describe a method (using membership queries) that takes a -c-acyclic structure with and turns it into a -c-acyclic structure with and . By applying this method repeatedly for increasing (and always minimizing w.r.t. using membership queries, cf. Lemma 5.4), we are guaranteed to reach the situation where we have a structure that is -c-acyclic and therefore, is in fact, c-acyclic.
Let be -c-acyclic with . First, we use membership queries to minimize (cf. Lemma 5.4) and ensure that its size is at most . Next, we say that an edge is bad if it is part of a cycle of length that does not contain a distinguished element, and good otherwise. If there are no bad edges then we are done. Otherwise, let be a bad edge. Let be isomorphic copies of that are disjoint except for the distinguished elements. Now, let be the structure obtained by extending the fg-disjoint union with additional “special edges” and . Clearly, .
Claim 1: for each good edge of , its isomorphic copies belonging to are good in .
Claim 2: the special edges and are both good in .
Claim 1 is obvious from the construction of , as no new short cycles are introduced. To see that Claim 2 holds, consider any minimal cycle in that does not contain any distinguished element, and that goes through one of these edges. Then, clearly, the cycle must go through both of these edges. That is, it must be of the form
where is a path contained in and is a path contained in . Now, we know that the paths and must have length at least (because otherwise would not be -c-acyclic). Therefore, the entire cycle must have length at least .
Claim 3: .
Claim 3 is essentially proved by an induction on the tree structure of , after removing all distinguished elements. More precisely, let . Let be the substructure of obtained by removing all distinguished elements and facts involving distinguished elements. Clearly, is acyclic, i.e., can be oriented as a forest. By induction on this forest, we can construct a homomorphism , with the additional property that, for each element of , is equal to either or . Recall that is the copy of in and that is the copy of in . Next, let be the extension of to that agrees with on all distinguished elements. We can show that . To see this, consider any fact of . If that fact involves at least one distinguished element, then the -image of that fact involves at least one distinguished element of . Therefore, by construction, all the facts obtainable by replacing every non-distinguished element (if there is such) by one of its two copies, are present in , and therefore, no matter how acts on the non-distinguished element of that fact, the -image will be present in . Next, consider the case where the fact doesn’t involve any distinguished element of . In this case, the fact belongs to and hence we know that the -image of that fact (which is also the -image) belongs to . This concludes the proof of claim 3.
Next, let be a minimal substructure of into which homomorphically maps (obtained using membership queries, cf. Lemma 5.4). Clearly, Claim 1-3 above are preserved under the passage from to its substructure . We claim that (1) must contain at least one of the edges and , and (2) for every edge of , must contain at least one of its two isomorphic copies. Because if not, then the homomorphism from to could be composed with the natural projection from to to obtain a homomorphism from to a proper substructure of , a contradiction with the assumed minimality of . Combined with Claim 1 and 2, this allows us to conclude that has strictly more good edges than . Since the number of edges (good or bad) is bounded by , by repeating the above procedure, we obtain a structure that has no bad edges, and therefore, is -c-acyclic. ∎
Proof of Theorem 5.2:.
The algorithm maintains a structure, denoted , which can be interpreted as (the canonical structure of) the algorithm’s guess of the goal query. At every moment in the execution of the algorithm, , satisfies the following properties:
-
(1)
-
(2)
does not homomorphically map to any structure obtained by removing a fact from
-
(3)
is c-acyclic
Note that conditions (1) and (2) imply that cannot have more elements than and that
Initially, is defined to be where is defined as in Proposition 5.6 and is the structure containing a single node that satisfies all possible facts over the schema and is the -ary tuple containing only element . The algorithm refines in a sequence of iterations. At each iteration, the algorithm first constructs the frontier of . Note that by condition (3) is -acyclic. Hence, by Theorem 3.8, can be computed in time polynomial in (and, hence, in ). Then, the algorithm asks a membership query for each in until either it receives a ’yes’ answer or, otherwise, it exhausts all structures in without receiving a ’yes’ answer. In the latter case the algorithm stops and returns the canonical query of , cf. Lemma 5.5, as it is immediate from the fact that and that does not homomorphically map to any structure in that is homomorphically equivalent to , and therefore, the canonical query of is logically equivalent to . In the former case, the algorithm picks any structure in that (when asked as a membership query) produces a ’yes’ answer, updates , by setting and starts a new iteration. It follows immediately that preserves properties (1-3) above.
To show the correctness of the algorithm it only remains to see that the number of iterations is polynomially bounded in . This follows directly from the following claim: the domain size of increases at each iteration. To prove the claim, let be the value of at the th iteration, and note that, by design, we have where belongs to the frontier of . It follows that and . Let be the homomorphism from to . It follows from the fact that and condition (2) that the image of according to is precisely . Therefore, must be non-injective (otherwise it would be an isomorphism, contradicting the fact that ). Since is surjective and non-injective, we can conclude that the domain size of is larger than the domain size of . ∎
6. Another type of data examples: input-output examples
In the previous sections, we have focused on positive and negative data examples. However, there is also another type of data example that is natural to consider, namely a pair where is an input instance (over a schema of the query), and is a -ary relation over the domain of , where is the arity of the query. A conjunctive query fits if is precisely the set of all tuples over the domain of that satisfy . Input-output examples are analogous to universal examples for schema mappings as studied in (AlexeCKT2011, ), in that they capture the complete behavior of the concept (in our case, the conjunctive query) on a database instance.
One input-output example, intuitively, captures the same information as a polynomial number of positive and negative data examples (if we treat as a constant), namely all positive data examples for and all negative data examples for . It follows that a CQ is uniquely characterizable by a finite set of input-output data examples if and only if it is uniquely characterizable by a finite set of positive and negative data examples. In the case of connected CQs, in fact, a single input-output example suffices:
Theorem 6.1.
Fix a schema and , and let be any class of -ary CQs over the given schema. Then the following are equivalent for all queries :
-
(1)
is uniquely characterized w.r.t. by a finite collection of positive and negative data examples.
-
(2)
is uniquely characterized w.r.t. by a finite collection of input-output examples.
And, if consists of c-connected CQs (or and consists of connected CQs):
-
(3)
is uniquely characterized w.r.t. by a single input-output example.
Moreover, the equivalences are witnessed by polynomial-time transformations in each direction.
Proof.
For the direction from (1) to (2): for given , let . Note that whenever a query fits , then it must also fit . For the converse direction, from (2) to (1), the construction was already described above.
The direction (3) to (2) is trivial.
Finally, for the direction from (2) to (3), let . Let be the disjoint union . For every tuple a from the domain of , and for every c-connected query , if and only if . It follows that, whenever two -connected queries agree on their output on (that is to say, ), then they must also agree on their output on each . Therefore, if uniquely characterizes w.r.t. , then also uniquely characterizes w.r.t. . The same argument applies to connected Boolean CQs.
Incidentally, note that the above argument only works for c-connected CQs. For instance, the CQ cannot be uniquely characterized by a single input-output example , because if is non-empty, then where is the query ; whereas if is empty, then , where is the query . ∎
It follows that all results regarding the existence and polynomial-time computability of unique characterizations in Section 4 remain true when considering input-output data examples instead of (or even in addition to) positive and negative data examples.
Similarly, in the exact learning context, we can also consider a different type of oracle query, namely where the algorithm provides the oracle with an input instance and the oracle responds with an input-output example of the form that fits the target CQ. We could call such oracle queries input-output queries. They naturally capture a scenario in which we have black-box access to an executable version of the target CQ. For the same reasons discussed above, input-output queries are no more powerful than membership queries, since one input-output query can be simulated by a polynomial number of membership queries (assuming is fixed). Therefore, also, all our results on exact learnability remain true when considering input-output queries instead of membership queries.
7. Further Applications
While our main focus in this paper is on unique characterizability and exact learnability for CQs, in this section, we explore some implications for other application domains.
7.1. Characterizability and learnability of LAV schema mappings
A schema mapping is a high-level declarative specification of the relationships between two database schemas (Kolaitis05schema, ). Two of the most well-studied schema mapping specification languages are LAV (“Local-as-View”) and GAV (“Global-as-View”) schema mappings.
In (AlexeCKT2011, ), the authors studied the question of when a schema mapping can be uniquely characterized by a finite set of data examples. Different types of data examples were introduced and studied, namely positive examples, negative examples, and “universal” examples. In particular, it was shown in (AlexeCKT2011, ) that a GAV schema mapping can be uniquely characterized by a finite set of positive and negative examples (or, equivalently, by a finite set of universal examples) if and only if the schema mapping in question is logically equivalent to one that is specified using c-acyclic GAV constraints.
It was shown in (AlexeCKT2011, ) that every LAV schema mapping is uniquely characterized by a finite set of universal examples, and that there are LAV schema mappings that are not uniquely characterized by any finite set of positive and negative examples. In this section, we will consider the question which LAV schema mappings are uniquely characterizable by a finite set of positive and negative examples, and how to construct such a set of examples efficiently.
We will also consider the exact learnability of LAV schema mappings with membership queries. Exact learnability of GAV schema mappings was studied in (CateDK13:learning, ), where it was shown that GAV schema mappings are learnable with membership and equivalence queries (and, subsequently, also in a variant of the PAC model) but is not exactly learnable with membership queries alone or with equivalence queries alone. The exact learning algorithm for GAV schema mappings from (CateDK13:learning, ) was further put to use and validated experimentally in (Kun18active, ). Here, we consider exact learnability of LAV schema mappings with membership queries.
Definition 7.1.
A LAV (“Local-As-View”) schema mapping is a triple where and are disjoint schemas (the “source schema” and “target schema”), and is a finite set of LAV constraints, that is, first-order sentences of the form , where is an atomic formula using a relation from , and is a conjunction of atomic formulas using relations from .
By a schema-mapping example we will mean a pair where is a structure over schema without distinguished elements, and is a structure over schema without distinguished elements. We say that is a positive example for a schema mapping if , viewed as a single structure over the joint schema , satisfies all constraints in , and we call a negative example for otherwise. Note that schema-mapping examples were called data examples in (AlexeCKT2011, ). Unique characterizations and learnability with membership queries are defined as before. In particular, by a membership query, in the context of learning LAV schema mappings, we will mean an oracle query that consists of a schema-mapping example, which the oracle then labels as positive or negative depending on whether it satisfies the constraints of the goal LAV schema mapping. It is assumed here, that the source and target schemas are fixed and known to the learner.
Given a fixed source schema , there are only finitely many different possible left-hand sides for a LAV constraint, up to renaming of variables. Furthermore, if a schema mapping contains two LAV constraints with the same left-hand side, then they can be combined into a single LAV constraint by conjoining the respective right-hand sides. Since the right-hand side of a LAV constraint can be thought of as a CQ, this means that, intuitively, a LAV schema mapping can be thought of as a finite collection of CQs (one for each possible left-hand side). In the light of this observation, it is no surprise that questions about the unique characterizability and learnability of LAV schema mappings can be reduced to questions about the unique characterizability and learnability of CQs.
Let us capture this observation a little more precisely. Let be the maximum arity of a relation in , and let be the finite set of all atomic formulas using a relation from and variables from . Given a LAV schema mapping and an , we denote by the following first-order formula over schema :
For example, if consists of the LAV constraints and , and is , then . Similarly, for then . Note that can be equivalently written as a not-necessarily-safe CQ over (by pulling the existential quantifies to the front).
Lemma 7.2.
Let be any LAV schema mapping, and let have distinct variables. For every structure , over schema and with distinguished elements, the following are equivalent:
-
(1)
is a positive data example for ,
-
(2)
The schema-mapping example is a positive example for , where is the structure over consisting of the single fact , and .
We omit the proof, which is straightforward (note that the left-hand side of a LAV constraint can have at most one homomorphism to , and the latter can be extended to the right-hand side of the constraint to iff the respective conjunct of is satisfied. Also note that if is a positive example for a LAV schema mapping then, so is for ).
Intuitively, Lemma 7.2 shows that the behavior of on arbitrary data examples, is fully determined by the behavior of on arbitrary schema-mapping examples. The converse turns out to be true as well, that is, the semantics of a LAV schema mapping is determined (up to logical equivalence) by its associated queries for :
Lemma 7.3.
Two LAV schema mappings , are logically equivalent iff, for every , and are logically equivalent.
Proof.
The left-to-right direction follows immediately from the preceding Lemma. For the right-to-left direction: suppose and are not logically equivalent. Then they disagree on some schema-mapping example . Without loss of generality, we may assume that is a positive example for and a negative example for . In particular, one of the LAV constraints in is false in . Since the left-hand side of a LAV constraint consists of a single atom, it follows that, for some fact of , the schema-mapping example is a negative example for . Moreover, an easy monotonicity argument shows that is a positive example for . Let be obtained from the fact by replacing each distinct element by a corresponding variable . It follows from Lemma 7.2 that and disagree on the structure , and are not logically equivalent. ∎
It follows directly from the above Lemmas that the unique characterizability of a LAV schema mapping reduces to the unique characterizability of each query :
Lemma 7.4.
For all LAV schema mappings , the following are equivalent:
-
(1)
is uniquely characterizable by finitely many positive and negative schema-mapping examples (w.r.t. the class of all LAV schema mappings over ).
-
(2)
For each , is uniquely characterizable by finitely many positive and negative data examples w.r.t. the class of all -ary not-necessarily-safe CQs over .
Intuitively, this shows that a LAV schema mapping is uniquely characterizable iff each of its constraints (joined together according to their left-hand side atom) are. By combining these lemmas with Theorem 4.5 (cf. Remark 1), we can link the unique characterizability of a LAV schema mapping to the condition of c-acyclicity. We say that a LAV schema mapping is c-acyclic if the right-hand side of each of its LAV constraints is a c-acyclic not-necessarily-safe CQ. Note that, in this case, also is c-acyclic, for each .
Theorem 7.5.
Fix a source schema and a target schema . A LAV schema mapping is uniquely characterizable by a finite set of positive and negative schema-mapping examples if and only if is logically equivalent to a c-acyclic LAV schema mapping. Moreover, if is c-acyclic, then a uniquely characterizing set of positive and negative schema-mapping examples can be constructed in polynomial time (for fixed ).
Proof.
The direction going from c-acyclicity to the uniquely characterizing set of schema-mapping examples, follows immediately from the above lemmas together with Theorem 4.5. For the other direction, assume that is uniquely characterizable by finitely many positive and negative schema-mapping examples. It follows by Lemma 7.4 that each is uniquely characterizable by finitely many positive and negative data examples. Hence, each is logically equivalent to a c-acyclic not-necessarily-safe conjunctive query . Finally, let , where consists of all LAV constraints of the form for . Then is c-acyclic and logically equivalent to . ∎
Theorem 7.6.
Fix a source schema and a target schema . The class of c-acyclic LAV schema mappings over is efficiently exactly learnable with membership queries.
Note that the class of all LAV schema mappings over is not exactly learnable with membership queries (assuming that is non-empty and contains a relation of arity at least 2). This follows immediately from the existence of LAV schema mappings that are not uniquely characterizable by finitely many positive and negative schema-mapping examples.
As mentioned earlier, LAV schema mappings and GAV schema mappings are two of the most well-studied schema mapping languages. GLAV (“Global-and-Local-As-Views”) schema mappings is another, which forms a common generalization. An important remaining open question in the area of example-driven approaches to schema mapping design is the following (AlexeCKT2011, ): which GLAV schema mappings are uniquely characterizable by a finite set of examples?
7.2. Learning description logic concept expressions and ABoxes
Description logics are formal specification languages used to represent domain knowledge. Example-driven and machine-learning based approaches have a long history in this area, and have received renewed interest in the last years (Ozaki2020:learning, ), in particular, for ontologies specified in the lightweight description logic , and focusing on the exact learnability of ontologies using entailment queries and equivalence queries. As we show in this section, our results on c-acyclic CQs have some implications for the exact learnability of concept expressions.
Definition 7.7 ( Concept expressions, ABoxes, TBoxes).
Let be fixed, disjoint sets, whose members we will refer to as “concept names”, “role names”, and ”individual names”, respectively. The sets and are assumed to be finite, while is assumed to be infinite.
A concept expression is an expression built up from from concept names in and , using conjunction () and existential restriction ( or , where ).
An ABox is a finite set of ABox axioms of the form and/or , where , , and .
A TBox is a finite set of TBox axioms , where are concept expressions.
The semantics of these expressions can be explained by translation to first-order logic:
Definition 7.8.
The correspondence schema is the schema that contains a unary relation for every and a binary relation for every . Through the standard translation from description logic to first-order logic (cf. Table 1), every concept expression translates to a first-order formula over the correspondence schema. By extension, every TBox translates to a finite first-order theory , where translates to .
An ABox can equivalently be viewed as a finite structure (without distinguished elements), whose domain consists of individual names from , and whose facts are the ABox assertions. Since is assumed to be infinite, every finite structure over the correspondence schema can (up to isomorphism) be represented as an ABox. Therefore, in what follows we will use ABoxes and structures interchangeably.
We can think of an ABox as a (possibly incomplete) list of facts, and a TBox as domain knowledge in the form of rules for deriving more facts. This idea underlies the next definition:
Definition 7.9.
A QA-example is a pair where is an ABox and . We say that is a positive QA-example for a concept expression relative to a TBox if where . If , we say that is a negative QA-example for relative to .
The name QA-example, here, reflects the fact that the task of computing is commonly known as query answering. It is one of the core inference tasks studied in the description logic literature. In general, there are two variants of the definition of : one where ranges over finite structures, and one where ranges over all, finite or infinite, structures. The description logic that we consider here has been shown to be finitely controllable (Barany2013querying, ), meaning that both definitions are equivalent. For more expressive description logics, this is in general not the case.
Example 7.10.
Consider the ABox, TBox, and concept expression in Table 2. Every model of containing the facts in must contain also and for some . It follows that . In other words, is a positive QA-example for relative to . On the other hand, is a negative QA-example for relative to .
See (Baader2017:introduction, ) for more details on description logic syntax and semantics. We now explain how our results from Section 4 and 5 can be applied here. Although a QA-example is just a data example with one distinguished element, over the correspondence schema, the definition of positive/negative QA-examples diverges from the definition of positive/negative data examples, because of the TBox . For the special case where , the two coincide:
Lemma 7.11.
Let . A QA-example is a positive (negative) QA-example for a concept expression relative to iff is a positive (negative) data example for .
Lemma 7.11 follows from the well-known monotonicity property of CQs (i.e., whenever , then ), which implies that .
Concept expressions turn out to correspond precisely to unary, acyclic, c-connected CQs:
Lemma 7.12.
The standard translation of every concept expression is equivalent to a not-necessarily-safe unary CQ that is acyclic and c-connected. Conversely, every unary, acyclic, c-connected not-necessarily-safe CQ over the correspondence schema is logically equivalent to for some concept expression .
Both directions of Lemma 7.12 can be proved using a straightforward induction.
The above two lemmas, together with Theorem 4.5(2) and Theorem 5.2 (cf. Remark 1 and Remark 2) immediately yield our main result here. We say that a collection of positive and negative QA-examples uniquely characterizes a concept expression relative to a TBox if fits the examples (relative to ) and every other concept expression that does so is equivalent (relative to ) to . By a QA-membership query we mean an oracle query consisting of a QA example, where the oracle answers yes or no depending on whether the input is a positive QA example or a negative QA example for the goal concept, relative to the TBox. It is assumed that the TBox is fixed and known to the learner.
Theorem 7.13.
Let . Every concept expression is uniquely characterizable by a finite collection of positive and negative QA examples (relative to ), which can be computed in polynomial time. Furthermore, the class of concept expressions is efficiently exactly learnable with QA-membership queries.
Moreover, by Theorem 4.5(3), the uniquely characterizing examples can be constructed so that each example is the canonical QA-example of a concept expression. By the canonical QA-example of a concept expression , here, we mean the QA-example that (viewed as a structure with one distinguished element) is the canonical structure of the not-necessarily-safe CQ .
Theorem 7.13 remains true when the concept language is extended with unrestricted existential quantification (of the form ) and a restricted form of the I-me self-reference construct introduced in (Marx2002narcissists, ), namely where the I operator can only occur once, and in the very front of the concept expression. Indeed, it can be shown that this extended concept language (by a straightforward extension of the standard translation) captures precisely the class of c-acyclic unary not-necessarily-safe CQs over the correspondence schema.
This raises the question if Theorem 7.13 holds true for arbitrary TBoxes. Since publication of the conference version of this paper (tCD2021:conjunctive, ) (in which we asked the same question), some answers have been obtained. In (Funk2021Actively, ), it is shown that the answer to this question is No if the TBox is treated as part of the input to the learning algorithm. Indeed, it is shown that the problem becomes not efficiently exactly learnable with membership and equivalence queries. On the other hand, a positive answer is given in (Funk2021Actively, ) for a weaker version of the question, namely for the description logic , when the learning algorithm is also allowed to ask equivalence queries. In (funk2022:frontiers, ), furthermore, a positive answer is given for another variant of the above question where the TBox is specified in the description logic DL-Lite, and where the learning algorithm is allowed to ask membership and equivalence queries. At the heart of this learning algorithm lies an extension of our frontier construction from Section 3.3, also obtained in (funk2022:frontiers, ).
8. Acknowledgements
This paper largely grew out of discussions at Dagstuhl Seminar 19361 (“Logic and Learning”) in Sept. 2019. Victor Dalmau was supported by MICCIN grants TIN2016-76573-C2-1P and PID2019-109137GB-C22. Balder ten Cate was supported by the European Union’s Horizon 2020 research and innovation programme (MSCA-101031081). We thank Carsten Lutz, Raoul Koudijs, and Phokion Kolaitis for helpful discussions and for pointing out some flaws in earlier versions of this paper.
References
- (1) Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang-Chiew Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23:1–23:48, December 2011. doi:10.1145/2043652.2043656.
- (2) Dana Angluin. Queries and concept learning. Mach. Learn., 2(4):319–342, April 1988. doi:10.1023/A:1022821128753.
- (3) William Ward Armstrong. Dependency structures of data base relationships. In IFIP Congress, 1974.
- (4) Franz Baader, Ian Horrocks, Carsten Lutz, and Ulrike Sattler. An Introduction to Description Logic. Cambridge University Press, 2017.
- (5) Franz Baader and Werner Nutt. Basic Description Logics, pages 43–95. Cambridge University Press, USA, 2003.
- (6) Vince Bárány, Georg Gottlob, and Martin Otto. Querying the guarded fragment. Logical Methods in Computer Science, 10(2), 2014. doi:10.2168/LMCS-10(2:3)2014.
- (7) Pablo Barceló and Miguel Romero. The complexity of reverse engineering problems for conjunctive queries. In Michael Benedikt and Giorgio Orsi, editors, 20th International Conference on Database Theory, ICDT 2017, March 21-24, 2017, Venice, Italy, volume 68 of LIPIcs, pages 7:1–7:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
- (8) Google Blog. A reintroduction to our knowledge graph and knowledge panels, 2020.
- (9) Angela Bonifati, Radu Ciucanu, and Aurélien Lemay. Learning path queries on graph databases. In Gustavo Alonso, Floris Geerts, Lucian Popa, Pablo Barceló, Jens Teubner, Martín Ugarte, Jan Van den Bussche, and Jan Paredaens, editors, Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015, pages 109–120. OpenProceedings.org, 2015.
- (10) Angela Bonifati, Radu Ciucanu, and Slawek Staworko. Learning join queries from user examples. ACM Trans. Database Syst., 40(4):24:1–24:38, 2016.
- (11) Ashok K. Chandra Chandra and Philip M. Merlin. Optimal Implementation of Conjunctive Queries in Relational Data Bases. In ACM Symposium on Theory of Computing (STOC), pages 77–90, 1977.
- (12) William W. Cohen. Cryptographic limitations on learning one-clause logic programs. In Richard Fikes and Wendy G. Lehnert, editors, Proceedings of the 11th National Conference on Artificial Intelligence. Washington, DC, USA, July 11-15, 1993, pages 80–85. AAAI Press / The MIT Press, 1993.
- (13) William W. Cohen. Pac-learning nondeterminate clauses. In Barbara Hayes-Roth and Richard E. Korf, editors, Proceedings of the 12th National Conference on Artificial Intelligence, Seattle, WA, USA, July 31 - August 4, 1994, Volume 1, pages 676–681. AAAI Press / The MIT Press, 1994.
- (14) William W. Cohen. Pac-learning non-recursive prolog clauses. Artif. Intell., 79(1):1–38, 1995.
- (15) Víctor Dalmau, Phokion G. Kolaitis, and Moshe Y. Vardi. Constraint satisfaction, bounded treewidth, and finite-variable logics. In Pascal Van Hentenryck, editor, Principles and Practice of Constraint Programming - CP 2002, pages 310–326, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
- (16) Ronald Fagin. Degrees of acyclicity for hypergraphs and relational database schemes. J. ACM, 30(3):514–550, July 1983. doi:10.1145/2402.322390.
- (17) Ronald Fagin, Phokion Kolaitis, Alan Nash, and Lucian Popa. Towards a theory of schema-mapping optimization. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 33–42, 01 2008. doi:10.1145/1376916.1376922.
- (18) Jan Foniok, Jaroslav Nesetril, and Claude Tardif. Generalised dualities and maximal finite antichains in the homomorphism order of relational structures. Eur. J. Comb., 29(4):881–899, 2008.
- (19) Maurice Funk, Jean Christoph Jung, and Carsten Lutz. Actively learning concepts and conjunctive queries under elr-ontologies. In IJCAI, 2021.
- (20) Maurice Funk, Jean Christoph Jung, and Carsten Lutz. Frontiers and exact learning of eli queries under dl-lite ontologies. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 2627–2633. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track. doi:10.24963/ijcai.2022/364.
- (21) Georg Gottlob and Pierre Senellart. Schema mapping discovery from data instances. J. ACM, 57(2):6:1–6:37, 2010.
- (22) David Haussler. Learning conjunctive concepts in structural domains. Mach. Learn., 4:7–40, 1989.
- (23) Pavol Hell and Jaroslav Nešetřil. The Core of a Graph. Discrete Mathematics, 109:117–126, 1992.
- (24) Pavol Hell and Jaroslav Nešetřil. Graphs and homomorphisms, volume 28 of Oxford lecture series in mathematics and its applications. Oxford University Press, 2004.
- (25) Kouichi Hirata. On the hardness of learning acyclic conjunctive queries. In Hiroki Arimura, Sanjay Jain, and Arun Sharma, editors, Algorithmic Learning Theory, 11th International Conference, ALT 2000, Sydney, Australia, December 11-13, 2000, Proceedings, volume 1968 of Lecture Notes in Computer Science, pages 238–251. Springer, 2000.
- (26) Wojtek Kazana and Luc Segoufin. First-order queries on classes of structures with bounded expansion. Logical Methods in Computer Science, Volume 16, Issue 1, February 2020.
- (27) Phokion G. Kolaitis. Schema mappings, data exchange, and metadata management. In Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’05, page 61–75, New York, NY, USA, 2005. Association for Computing Machinery. doi:10.1145/1065167.1065176.
- (28) Heikki Mannila and Kari-Jouko Räihä. Test data for relational queries. In PODS, pages 217–223, 1986.
- (29) Maarten Marx. Narcissists, stepmothers and spies. In Ian Horrocks and Sergio Tessaris, editors, Proceedings of the 2002 International Workshop on Description Logics (DL2002), Toulouse, France, April 19-21, 2002, volume 53 of CEUR Workshop Proceedings. CEUR-WS.org, 2002.
- (30) Jaroslav Nešetřil and Patrice Ossona de Mendez. Grad and classes with bounded expansion iii. restricted graph homomorphism dualities. European Journal of Combinatorics, 29(4):1012 – 1024, 2008. Homomorphisms: Structure and Highlights. doi:https://doi.org/10.1016/j.ejc.2007.11.019.
- (31) Jaroslav Nešetřil and Patrice Ossona de Mendez. Grad and classes with bounded expansion i. decompositions. European Journal of Combinatorics, 29(3):760 – 776, 2008. doi:https://doi.org/10.1016/j.ejc.2006.07.013.
- (32) Jaroslav Nešetřil and Patrice Ossona de Mendez. Sparsity (Graphs, Structures, and Algorithms), volume 28. Springer, 01 2012. doi:10.1007/978-3-642-27875-4.
- (33) Jaroslav Nešetřil and Claude Tardif. Duality theorems for finite structures (characterising gaps and good characterisations). Journal of Combinatorial Theory, Series B, 80(1):80 – 97, 2000. doi:https://doi.org/10.1006/jctb.2000.1970.
- (34) Jaroslav Nešetřil and Claude Tardif. Short answers to exponentially long questions: Extremal aspects of homomorphism duality. SIAM J. Discret. Math., 19(4):914–920, August 2005. doi:10.1137/S0895480104445630.
- (35) Ana Ozaki. Learning description logic ontologies: Five approaches. where do they stand? KI - Künstliche Intelligenz, 04 2020. doi:10.1007/s13218-020-00656-9.
- (36) Slawek Staworko and Piotr Wieczorek. Learning twig and path queries. In Alin Deutsch, editor, 15th International Conference on Database Theory, ICDT ’12, Berlin, Germany, March 26-29, 2012, pages 140–154. ACM, 2012.
- (37) Slawomir Staworko and Piotr Wieczorek. Characterizing XML Twig Queries with Examples. In International Conference on Database Theory, International Conference on Database Theory, Brussels, Belgium, March 2015. URL: https://hal.inria.fr/hal-01205417.
- (38) Balder ten Cate, Laura Chiticariu, Phokion Kolaitis, and Wang-Chiew Tan. Laconic schema mappings: Computing the core with sql queries. Proc. VLDB Endow., 2(1):1006–1017, August 2009. doi:10.14778/1687627.1687741.
- (39) Balder ten Cate and Víctor Dalmau. The product homomorphism problem and applications. In Marcelo Arenas and Martín Ugarte, editors, 18th International Conference on Database Theory, ICDT 2015, March 23-27, 2015, Brussels, Belgium, volume 31 of LIPIcs, pages 161–176. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015.
- (40) Balder ten Cate and Victor Dalmau. Conjunctive queries: Unique characterizations and exact learnability. In Proc. of ICDT. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
- (41) Balder ten Cate, Víctor Dalmau, and Phokion G. Kolaitis. Learning schema mappings. ACM Trans. Database Syst., 38(4):28:1–28:31, 2013. doi:10.1145/2539032.2539035.
- (42) Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. Active learning of GAV schema mappings. In Jan Van den Bussche and Marcelo Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 355–368. ACM, 2018. doi:10.1145/3196959.3196974.
- (43) Yaacov Y. Weiss and Sara Cohen. Reverse engineering spj-queries from examples. In Emanuel Sallinger, Jan Van den Bussche, and Floris Geerts, editors, Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, pages 151–166. ACM, 2017.
- (44) Ross Willard. Testing expressibility is hard. In David Cohen, editor, Principles and Practice of Constraint Programming - CP 2010 - 16th International Conference, CP 2010, St. Andrews, Scotland, UK, September 6-10, 2010. Proceedings, volume 6308 of Lecture Notes in Computer Science, pages 9–23. Springer, 2010.