This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conjunctive Queries: Unique Characterizations and Exact Learnability

Balder ten Cate https://orcid.org/0000-0002-2538-5846 Universiteit van AmsterdamILLCAmsterdamThe Netherlands [email protected]  and  Victor Dalmau https://orcid.org/0000-0002-9365-7372 Universitat Pompeu FabraDepartment of Information and Communication TechnologiesBarcelonaSpain [email protected]
Abstract.

We answer the question of which conjunctive queries are uniquely characterized by polynomially many positive and negative examples, and how to construct such examples efficiently. As a consequence, we obtain a new efficient exact learning algorithm for a class of conjunctive queries. At the core of our contributions lie two new polynomial-time algorithms for constructing frontiers in the homomorphism lattice of finite structures. We also discuss implications for the unique characterizability and learnability of schema mappings and of description logic concepts.

Conjunctive Queries, Homomorphisms, Frontiers, Unique Characterizations, Exact Learnability, Schema Mappings, Description Logic
journal: TODSccs: Theory of computation Machine learning theoryccs: Theory of computation Logicccs: Information systems Query languages

1. Introduction

Conjunctive queries (CQs) are an extensively studied database query language and fragment of first-order logic. They correspond precisely to Datalog programs with a single non-recursive rule. In this paper, we study two problems related to CQs. The first problem is concerned with the existence and constructability of unique characterizations. For which CQs qq is it the case that qq can be characterized (up to logical equivalence) by its behavior on a small set of data examples? And, when such a set of data examples exists, can it be constructed efficiently? The second problem pertains to exact learnability of CQs in an interactive setting where the learner has access to a “membership oracle” that, given any database instance and a tuple of values, answers whether the tuple belongs to the answer of the goal CQ (that is, the hidden CQ that the learner is trying to learn). We can think of the membership oracle as a black-box, compiled version of the goal query, which the learner can execute on any number of examples. The task of the learner, then, is to reverse engineer the query based on the observed behavior.

\Tree\top\edgeAA\edgeF1F_{1}\edge\edge\edge\bot\edgeFnF_{n}
Figure 1. A frontier in the homomorphism lattice of structures

Note that these two problems (unique characterizability and exact learnability) are closely related to each other: a learner can identify the goal query with certainty, only when the set of examples that it has seen so far constitutes a unique characterization of the goal query. In other words, unique characterizability (by a polynomially large set of examples) is necessary, but not sufficient, for (polynomial-time) exact learnability with membership queries.

Motivating Example 1.

This example, although stylized and described at a high level, aims to convey a use case that motivated the present work. The Google Knowledge Graph is a large database of entities and facts, gathered from a variety of sources. It is used to enhance the search engine’s results for queries such as “where was Barack Obama born” with factual information in the form of knowledge panels (KGBlog2020, ). When a query triggers a specific knowledge panel, this may be the result of different triggering and fulfillment mechanisms, each of which may involve a combination of structured queries to the knowledge graph, hard-coded business logic (in a Turing-complete language), and machine learned models. This makes it difficult to understand interactions between knowledge panels (e.g., whether the two knowledge panels are equivalent or one is subsumed by the other, in terms of content and triggering). If a declarative specification of (an approximation of) the triggering and fulfillment logic for a knowledge panel can be constructed programmatically, specified in a sufficiently restrictive formalism such as Datalog rules, this provides an avenue to the above, and other relevant static analysis tasks. The efficient exact learnability with membership queries that we study in this paper, can be viewed as an idealized form of such a programmatic approach, where the membership oracle is the existing, black box, implementation of the knowledge panel, and the learning algorithm aims to produce a CQ that exactly captures it.

The above example provides a motivation for studying efficient exact learnability of CQs, and hence, for studying unique characterizability. However, we would like to emphasize that unique characterizations are of independent interest, outside the context of exact learning algorithms. Indeed, uniquely characterizing examples can be used, for instance, for elementary query engine debugging, and query visualization and explanation.

As it turns out, the above problems about CQs are intimately linked to fundamental properties of the homomorphism lattice of finite structures. In particular, the existence of a unique characterization for a CQ can be reduced to the existence of a frontier in the homomorphism lattice for an associated structure AA, where, by a “frontier” for AA, we mean a finite set of structures F1,,FnF_{1},\ldots,F_{n} that cover precisely the set of structures homomorphically strictly weaker than AA, that is, such that {BBA and A↛B}=i{BBFi}\{B\mid B\to A\text{ and }A\not\to B\}=\bigcup_{i}\{B\mid B\to F_{i}\} (cf. Figure 1).

Known results (FoniokNT08, ; AlexeCKT2011, ) imply that not every finite structure has such a frontier, and, moreover, a finite structure has a frontier if and only if the structure (modulo homomorphic equivalence) satisfies a structural property called c-acyclicity. These known results, however, are based on exponential constructions, and no polynomial algorithms for constructing frontiers were previously known.

Main Contribution 1 (Polynomial-time algorithms for constructing frontiers).

We show that, for c-acyclic structures, a frontier can in fact be computed in polynomial time. More specifically, we present two polynomial-time algorithms. The first algorithm takes any c-acyclic structure and produces a frontier consisting of structures that are themselves not necessarily c-acyclic (Sect. 3.2). The second algorithm applies to a more restricted class of acyclic structures but yields a frontier consisting entirely of structures belonging to the same class, that is, the class of structures in question is frontier-closed (Sect. 3.3).

We use these to obtain new results on the existence and efficient constructability of unique characterizations for CQs:

Main Contribution 2 (Polynomial Unique Characterizations for Conjunctive Queries).

We show that a CQ is uniquely characterizable by polynomially many examples, precisely if (modulo logical equivalence) it is c-acyclic. Furthermore, for c-acyclic CQs, a uniquely characterizing set of examples can be constructed in polynomial time. In the special case of acyclic and c-connected CQs, a uniquely characterizing set of examples can be constructed consisting entirely of queries from the same class (Sect. 4).

Using the above results as a stepping stone, we obtain a polynomial-time exact learning algorithm for the class of c-acyclic CQs.

Main Contribution 3 (Polynomial-Time Learnability with Membership Queries).

We show that c-acyclic CQs are efficiently exactly learnable in Angluin’s model of exact learnability with membership queries (Angluin88, ) (Sect. 5).

The restriction to c-acyclic CQs in this learnability result is natural, given that, as we mentioned above, exact learnability with membership queries requires the existence of a finite uniquely characterizing set of examples. Note however, that our results do not preclude the possibility that there exist larger efficiently exactly learnable classes of CQs: even if a class CC includes non-c-acyclic CQs, it may still be possible for every CQ qCq\in C to be uniquely characterizable within the class CC.

We mainly focus on positive and negative examples in this paper. Another natural type of data example is a pair (I,R)(I,R) where II is an input instance and RR is the entire relation that is computed by the query on II. We discuss this in Section 6, where we point out that all our results on characterizability and learnability remain true also when considering such data examples.

Finally, although our primary interest is in conjunctive queries, we show that our results also have implications for schema mappings and description logic concepts:

Main Contribution 4 (Schema Mappings and Description Logic Concepts).

As a further corollary to the above, in Sect. 7, we obtain a number of results regarding the existence of polynomial unique characterizations, as well as exact learnability, for LAV (“Local-As-View”) schema mappings and for description logic concepts for the lightweight description logic \mathcal{ELI} (Baader2017:introduction, ).

Related Work

Unique characterizations for CQs were first studied in (MannilaR86, ) in the context of automatic test data generation. More precisely, the authors propose the concept of an “adequate test case”, which is a database instance that can be used to distinguish a given CQ from all other, non-equivalent CQs from a given class. In our terminology, this corresponds to a uniquely characterizing input-output example (cf. Section 6). A positive result was obtained in (MannilaR86, ) for restricted classes of self-join-free CQs, by establishing a relationship to Armstrong databases (Armstrong1974, ). In (AlexeCKT2011, ), the authors study unique characterizations for various classes of schema mappings; we will make use of some of the technical results from (AlexeCKT2011, ), and in Section 7, we will discuss an application of our results to LAV schema mappings. In (Staworko15:characterizing, ), the authors study unique characterizability for XML twig queries.

Related work on learning CQs will be discussed in Section 5.

An earlier, extended-abstract version of this paper was published in (tCD2021:conjunctive, ). The present paper extends this conference version with additional results. In particular, Theorem 3.12 was shown in (tCD2021:conjunctive, ) only for the case of k=1k=1 (i.e., for unary CQs), and is generalized here to all k1k\geq 1. Furthermore, the treatment of input-output examples in Section 6 has been added.

Outline

Section 2 reviews basic facts and definitions. In Section 3, we present our two new polynomial-time algorithms for constructing frontiers for finite structures with distinguished elements. We also review a result by (NesetrilOssona2008, ), which implies the existence of (not necessarily polynomially computable) frontiers w.r.t. classes of structures of bounded expansion. In Section 4, we apply these algorithms to show that a CQ is uniquely characterizable by polynomially many examples, precisely if (modulo logical equivalence) it is c-acyclic. Furthermore, for c-acyclic CQs, a uniquely characterizing set of examples can be constructed in polynomial time. In the special case of unary, acyclic, connected CQs, a uniquely characterizing set of examples can be constructed consisting entirely of queries from the same class. In Section 5, we further build on these results, and we study the exact learnability of CQs. In Section 6, we consider another type of data examples, namely input-output examples. Section 7, finally, presents applications to schema mappings and description logic concepts.

2. Preliminaries

Schemas, Structures, Homomorphisms, Cores

A schema (or, relational signature) is a finite set of relation symbols 𝒮={R1,,Rn}\mathcal{S}=\{R_{1},\ldots,R_{n}\}, where each relation RiR_{i} as an associated arity arity(Ri)1\text{arity}(R_{i})\geq 1. For k0k\geq 0, by a structure over 𝒮\mathcal{S} with kk distinguished elements we will mean a tuple (A,a1,,ak)(A,a_{1},\ldots,a_{k}), where A=(dom(A),R1A,,RnA)A=(\text{dom}(A),R^{A}_{1},\ldots,R^{A}_{n}) is a finite structure (in the traditional, model-theoretic sense) over the schema 𝒮\mathcal{S}, and a1,,aka_{1},\ldots,a_{k} are elements of the domain of AA. Note that all structures, in this paper, are assumed to be finite, and we will drop the adjective “finite”. By a fact of a structure AA we mean an expression of the form R(a1,,an)R(a_{1},\ldots,a_{n}) where the tuple (a1,,an)(a_{1},\ldots,a_{n}) belongs to the relation RR in AA. Given two structures (A,a)(A,\textbf{a}) and (B,b)(B,\textbf{b}) over the same schema, where a=a1,ak\textbf{a}=a_{1},\ldots a_{k} and b=b1,,bk\textbf{b}=b_{1},\ldots,b_{k}, a homomorphism h:(A,a)(B,b)h:(A,\textbf{a})\to(B,\textbf{b}) is a map hh from the domain of AA to the domain of BB, such that hh preserves all facts (i.e., for each fact R(a1,,an)R(a_{1},\ldots,a_{n}) of AA, R(h(a1),,h(an))R(h(a_{1}),\ldots,h(a_{n})) is a fact of BB), and such that h(ai)=bih(a_{i})=b_{i} for i=1ki=1\ldots k. When such a homomorphism exists, we will also say that (A,a)(A,\textbf{a}) “homomorphically maps to” (B,b)(B,\textbf{b}) and we will write (A,a)(B,b)(A,\textbf{a})\to(B,\textbf{b}). We say that (A,a)(A,\textbf{a}) and (B,b)(B,\textbf{b}) are homomorphically equivalent if (A,a)(B,b)(A,\textbf{a})\to(B,\textbf{b}) and (B,b)(A,a)(B,\textbf{b})\to(A,\textbf{a}). We occasionally write ABA\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}B to say that ABA\to B and B↛AB\not\to A.

A structure is said to be a core if there is no homomorphism from the structure in question to a proper substructure (HN92, ). It is known (HN92, ) that every structure (A,a)(A,\textbf{a}) has a substructure to which it is homomorphically equivalent and that is a core. This substructure, moreover, is unique up to isomorphism, and it is known as the core of (A,a)(A,\textbf{a}).

We will make use of the following technical lemma in several places later on in this paper:

Lemma 2.1.

Let (A,a)(A,\textbf{a}) be a core structure, and let h:(B,b)(A,a)h:(B,\textbf{b})\to(A,\textbf{a}). If there is a homomorphism h:(A,a)(B,b)h^{\prime}:(A,\textbf{a})\to(B,\textbf{b}), then hh^{\prime} must be injective, and moreover, in this case, there is such hh^{\prime} with the additional property that the composition of hh with hh^{\prime} is the identity map on AA.

Proof.

The composition of hh with hh^{\prime} is an endomorphism of (A,a)(A,\textbf{a}) (that is, a homomorphism from the structure to itself). It is a well-known property of cores that every endomorphism is an automorphism (that is, an isomorphism from the structure to itself). Therefore, hh^{\prime} must be injective. Furthermore, by composing hh^{\prime} with the inverse of the automorphism, we ensure that its composition with hh is the identity function. ∎

Fact Graph, FG-Connectedness, FG-Disjoint Union

The fact graph of a structure (A,a)(A,\textbf{a}) is the undirected graph whose nodes are the facts of AA, and such that there is an edge between two distinct facts if they share a non-distinguished element, i.e., there exists an element bb of the domain of AA that is distinct from the distinguished elements a, such that bb occurs in both facts. We say that (A,a)(A,\textbf{a}) is fg-connected if the fact graph is connected. A fg-connected component of (A,a)(A,\textbf{a}) is a maximal fg-connected substructure (A,a)(A^{\prime},\textbf{a}) of (A,a)(A,\textbf{a}). If (A1,a)(A_{1},\textbf{a}) and (A2,a)(A_{2},\textbf{a}) are structures with the same distinguished elements, and whose domains are otherwise (except for these distinguished elements) disjoint, then the union (A1A2,a)(A_{1}\cup A_{2},\textbf{a}) of these two structures will be called a fg-disjoint union and will be denoted as (A1,a)(A2,a)(A_{1},\textbf{a})\uplus(A_{2},\textbf{a}). The same construction naturally extends to finite sets of structures. It is easy to see that every structure (A,a)(A,\textbf{a}) is equal to the fg-disjoint union of its fg-connected components. See also (Fagin2008towards, ; tencate2009laconic, ), where fg-connected components are called fact blocks.

Direct Product, Homomorphism Lattice

Given two structures (A,a)(A,\textbf{a}) and (B,b)(B,\textbf{b}) over the same schema, where a=a1,ak\textbf{a}=a_{1},\ldots a_{k} and b=b1,,bk\textbf{b}=b_{1},\ldots,b_{k}, the direct product (A,a)×(B,b)(A,\textbf{a})\times(B,\textbf{b}) is defined, as usual, as (A×B,a1,b1,,ak,bk)(A\times B,\langle a_{1},b_{1}\rangle,\ldots,\langle a_{k},b_{k}\rangle), where the domain of A×BA\times B is the Cartesian product of the domains of AA and BB, and where the facts of A×BA\times B are all facts R(c1,d1,,cn,dn)R(\langle c_{1},d_{1}\rangle,\ldots,\langle c_{n},d_{n}\rangle) for which it holds that R(c1,,cn)R(c_{1},\ldots,c_{n}) is a fact of AA and R(d1,,dn)R(d_{1},\ldots,d_{n}) is a fact of BB. The direct product of a finite collection of structures is defined analogously.

For a fixed schema 𝒮\mathcal{S} and k0k\geq 0, the collection of homomorphic-equivalence classes of structures over 𝒮\mathcal{S} with kk distinguished elements, ordered by homomorphism, forms a lattice. Specifically, the above direct product operation is a meet operation in the lattice-theoretic sense: (A,a)×(B,b)(A,\textbf{a})\times(B,\textbf{b}) homomorphically maps to both (A,a)(A,\textbf{a}) and (B,b)(B,\textbf{b}), and a structure (C,c)(C,\textbf{c}) homomorphically maps to (A,a)×(B,b)(A,\textbf{a})\times(B,\textbf{b}) if and only if it homomorphically maps to both (A,a)(A,\textbf{a}) and (B,b)(B,\textbf{b}). The join operation of the lattice is a little more tedious to define, and we only sketch it here, as it is not used in the remainder of the paper. For a structure (A,a)(A,\textbf{a}) with a=a1,,ak\textbf{a}=a_{1},\ldots,a_{k}, by the isomorphism type of the distinguished elements we will mean the equivalence relation over {1,,k}\{1,\ldots,k\} induced by the tuple a1,,aka_{1},\ldots,a_{k}. When two structures have the same isomorphism type of distinguished elements, their join is simply the fg-disjoint union as defined earlier. In the general case, one must first compute the smallest equivalence relation over {1,,k}\{1,\ldots,k\} that refines the isomorphism type of distinguished elements of both structures, and factor both structures through this equivalence relation, before taking their fg-disjoint union.

For structures without distinguished elements, this lattice has been studied extensively (cf. for instance (HellNesetril2004, ; nesetril2012sparsity, )). The above exposition shows how to lift some of the fundamental constructions to structures with distinguished elements. As we will see, it will be important in much of this paper to consider structures with distinguished elements, as these distinguished elements, intuitively, correspond to the free variables of a CQ.

Incidence Graph, Acyclicity, C-Acyclicity

Given a structure (A,a)(A,\textbf{a}), the incidence graph of AA is the bipartite multi-graph containing all elements of the domain of AA as well as all facts of AA, and an edge (a,f)(a,f) whenever aa is an element and ff is a fact in which aa occurs. Whenever an element aa occurs more than once in the same fact ff, the incidence graph contains a distinct edge for every occurrence of aa in ff. We will call a structure (A,a)(A,\textbf{a}) acyclic (also known as Berge-acyclic (Fagin83:acyclicity, )) if the incidence graph of AA is acyclic; (A,a)(A,\textbf{a}) is said to be c-acyclic if every cycle in its incidence graph contains at least one distinguished element, i.e., at least one element in a. In particular, acyclicity implies that no element occurs twice in the same fact, and c-acyclicity implies that no non-distinguished element occurs twice in the same fact. In the case without distinguished elements, c-acyclicity is equivalent to acyclicity. The concept of c-acyclicity was first introduced in (AlexeCKT2011, ) in the study of unique characterizability of GAV schema mappings (cf. Section 7 for more details). A straightforward dynamic-programming argument shows (Dalmau02:constraint, ):

Proposition 2.2.

For c-acyclic structures (A,a)(A,\textbf{a}) and (B,b)(B,\textbf{b}) (over the same schema and with the same number of distinguished elements), we can test in polynomial time whether (A,a)(B,b)(A,\textbf{a})\to(B,\textbf{b}). The core of a c-acyclic structure can be computed in polynomial time.

C-Connectedness

We say that a structure (A,a)(A,\textbf{a}) is c-connected if every connected component of its incidence graph contains at least one distinguished element. Note that this condition is only meaningful for structures with at least one distinguished element, and that it differs subtly from the condition of fg-connectedness we defined above. For example, the structure consisting of the facts R(a1,a2)R(a_{1},a_{2}) and S(a2,a1)S(a_{2},a_{1}) with distinguished elements a1,a2a_{1},a_{2}, is c-connected but is not fg-connected. For any structure (A,a)(A,\textbf{a}), we denote by (A,a)reach(A,\textbf{a})^{\textrm{reach}} the (unique) maximal c-connected substructure, that is, the substructure containing everything reachable from the distinguished elements.

Proposition 2.3.

If (A,a)(A,\textbf{a}) is c-connected, then (A,a)(B,b)reach(A,\textbf{a})\to(B,\textbf{b})^{\textrm{reach}} iff (A,a)(B,b)(A,\textbf{a})\to(B,\textbf{b}).

Conjunctive Queries

Let k0k\geq 0. A kk-ary conjunctive query (CQ) qq over a schema 𝒮\mathcal{S} is an expression of the form   q(x) :- α1αnq(\textbf{x})\text{ :- }\alpha_{1}\land\cdots\land\alpha_{n}   where x=x1,,xk\textbf{x}=x_{1},\ldots,x_{k} is a sequence of variables, and where each αi\alpha_{i} is an atomic formula using a relation from 𝒮\mathcal{S} and using variables as arguments only. Note that αi\alpha_{i} may use variables from x as well as other variables. In addition, it is required that each variable in x occurs in at least one conjunct αi\alpha_{i}. This requirement is referred to as the safety condition.

Note that, for simplicity, this definition of CQ does not allow the use of constants. Many of the results in this paper, however, can be extended in a straightforward way to CQs with a fixed finite number of constants (which can be simulated using additional free variables).

If AA is a structure over the same schema as qq, we denote by q(A)q(A) the set of all kk-tuples of values that satisfy the query qq in AA. We write qqq\subseteq q^{\prime} if qq and qq^{\prime} are queries over the same schema, and of the same arity, and q(A)q(A)q(A)\subseteq q^{\prime}(A) holds for all structures AA. We say that qq and qq^{\prime} are logically equivalent if qqq\subseteq q^{\prime} and qqq^{\prime}\subseteq q both hold. We refer to any textbook on database theory for a more detailed exposition of the semantics of CQs, and we will restrict ourselves to giving an equivalent presentation of the semantics of CQs through canonical structures and the Chandra-Merlin theorem.

There is a well-known correspondence between kk-ary CQs over a schema 𝒮\mathcal{S} and structures over 𝒮\mathcal{S} with kk distinguished elements. In one direction, we can associate to each kk-ary CQ q(x)q(\textbf{x}) over the schema 𝒮\mathcal{S} a corresponding structure over 𝒮\mathcal{S} with kk distinguished elements, namely q^=(Aq,x)\widehat{q}=(A_{q},\textbf{x}), where the domain of AqA_{q} is the set of variables occurring in qq, and the facts of AqA_{q} are the conjuncts of qq. We will call this structure q^\widehat{q} the canonical structure of qq. Note that every distinguished element of q^\widehat{q} occurs in at least one fact, as follows from the safety condition of CQs. Conversely, consider any structure (A,a)(A,\textbf{a}), with a=a1,,ak\textbf{a}=a_{1},\ldots,a_{k}, such that every distinguished element aia_{i} occurs in at least one fact of AA. We can associate to (A,a)(A,\textbf{a}) a kk-ary canonical CQ, namely the CQ that has a variable xax_{a} for every value aa in the domain of AA occurring in at least one fact, and a conjunct for every fact of AA.

By the classic Chandra-Merlin Theorem (CM77, ), a tuple a belongs to q(A)q(A) if and only if there is a homomorphism from q^\widehat{q} to (A,a)(A,\textbf{a}); and qqq\subseteq q^{\prime} holds if and only if there is a homomorphism from q^\widehat{q^{\prime}} to q^\widehat{q}. Finally, qq and qq^{\prime} are logically equivalent if and only if q^\widehat{q} and q^\widehat{q^{\prime}} are homomorphically equivalent.

Exact Learning Models, Conjunctive Queries as a Concept Class

Informally, an exact learning algorithm is an algorithm that identifies an unknown goal concept by asking a number of queries about it. The queries are answered by an oracle that has access to the goal concept. This model of learning was introduced by Dana Angluin, cf. (Angluin88, ). In this paper, we consider the two most extensively studied kinds of oracle queries: membership queries and equivalence queries. We will first review basic notions from computational learning theory, such as the notion of a concept, and then explain what it means for a concept class to be efficiently exactly learnable with membership and/or equivalence queries.

Let XX be a (possibly infinite) set of examples. A concept over XX is a function c:X{0,1}c:X\to\{0,1\}, and a concept class 𝒞{\mathcal{C}} is a collection of such concepts. We say that xXx\in X is a positive example for a concept cc if c(x)=1c(x)=1, and that xx is a negative example for cc if c(x)=0c(x)=0.

Conjunctive queries (over a fixed schema 𝒮\mathcal{S} and with a fixed arity kk) are a particular example of such a concept class, where the example space is the class of all structures over 𝒮\mathcal{S} with kk distinct elements, and where an example (A,a)(A,\textbf{a}) is labeled as positive if the tuple a belongs to q(A)q(A), and negative otherwise.

It is always assumed that concepts are specified using some representation system so that one can speak of the length of the specification of a concept. More formally, a representation system for 𝒞\mathcal{C} is a string language \mathcal{L} over some finite alphabet, together with a surjective function r:𝒞r:\mathcal{L}\to\mathcal{C}. By the size of a concept c𝒞c\in\mathcal{C}, we will mean the length of the smallest representation. Similarly, we assume a representation system, with a corresponding notion of length, for the examples in XX. When there is no risk of confusion, we may conflate concepts (and examples) with their representations.

Specifically, for us, when it comes to structures, any natural choice of representation will do; we only assume that the length of the specification of a structure (for a fixed schema) is polynomial in the domain size, the number of facts and the number of distinguished elements. Likewise for CQs, we assume that the length of the representation of a CQ is polynomial in that of its canonical structure.

For every concept cc, we denote by MEMc\text{\rm MEM}_{c} the membership oracle for cc, that is, the oracle that takes as input an example xx and returns its label, c(x)c(x), according to cc. Similarly, for every concept c𝒞c\in\mathcal{C}, we denote by EQc\text{\rm EQ}_{c}, the equivalence oracle for cc, that is, the oracle that takes as input the representation of a concept hh and returns “yes”, if h=ch=c, or returns a counterexample xx otherwise (that is, an example xx such that h(x)c(x)h(x)\neq c(x)). An exact learning algorithm with membership and/or equivalence queries for a concept class CC is an algorithm alg that takes no input but has access to the membership oracle MEMc\text{\rm MEM}_{c} and/or equivalence oracle EQc\text{\rm EQ}_{c} for some concept cCc\in C, which will be called the goal concept. Importantly, while the algoritm may interact with the oracle(s), it does not know which concept cCc\in C is the goal concept. 111It is common in the learning theory literature to assume that the learning algorithm is given an upper bound on the size of the goal concept as input. However, it turns out that such an assumption is not needed for any of our positive results concerning learnability. Intuitively, the algorithm alg must determine cc by asking oracle queries. More precisely, for every choice of cCc\in C, alg must terminate after a finite amount of time, and output (some representation of) the goal concept cc. This notion was introduced by Angluin (Angluin88, ), who also introduced the notion of a polynomial-time exact learning algorithm. We say that an exact learning algorithm alg with membership and/or equivalence queries runs in polynomial time if there exists a two-variable polynomial p(n,m)p(n,m) such that at any point during the run of the algorithm, the time used by alg up to that point (counting one step per oracle call) is bounded by p(n,m)p(n,m), where nn is the size of the goal concept and mm the size of the largest counterexample returned by calls to the equivalence oracle up to that point in the run (m=0m=0 if no equivalence queries have been used). A concept class 𝒞{\mathcal{C}} is efficiently exactly learnable with membership and/or equivalence queries if there is an exact learning algorithm with membership and/or equivalence queries for 𝒞\mathcal{C} that runs in polynomial time.

There is a delicate issue about this notion of polynomial time that we now discuss. One might be tempted to relax the previous definition by requiring merely that the total running time is bounded by p(n,m)p(n,m). However, this change in the definition would give rise to a wrong notion of a polynomial-time algorithms in this context by way of a loophole in the definition. Indeed, under this change, one could design a learning algorithm that, in a first stage, identifies the goal hypothesis by (expensive) exhaustive search and that, once this is achieved, forces —by asking equivalence queries with hypotheses that are appropriate modifications of the goal concept— the equivalence oracle to return large counterexamples that would make up for the time spent during the exhaustive search phase.

3. Frontiers in the homomorphism lattice of structures

In this section, we define frontiers, as well as the relation notions of gap pairs and (restricted) homomorphism dualities, and we will discuss their relationships. We present two polynomial-time methods for constructing frontiers.

For the applications in the next sections, it is important to consider structures with distinguished elements. These distinguished elements, intuitively, correspond to the free variables of a CQ. Specifically, Proposition 4.2 in Section 4 will link unique characterizations for kk-ary CQs to frontiers for structures with kk distinguished elements. For this reason, all the results in this section are stated for structures with distinguished elements.

Definition 3.1.

Fix a schema and k0k\geq 0, and let 𝒞\mathcal{C} be a class of structures with kk distinguished elements and let (A,a)(A,\textbf{a}) be a structure with kk distinguished elements. A frontier for (A,a)(A,\textbf{a}) w.r.t. 𝒞\mathcal{C}, is a finite set of structures FF such that

  1. (1)

    (B,b)(A,a)(B,\textbf{b})\to(A,\textbf{a}) for all (B,b)F(B,\textbf{b})\in F.

  2. (2)

    (A,a)↛(B,b)(A,\textbf{a})\not\to(B,\textbf{b}) for all (B,b)F(B,\textbf{b})\in F.

  3. (3)

    For all (C,c)𝒞(C,\textbf{c})\in\mathcal{C} with (C,c)(A,a)(C,\textbf{c})\to(A,\textbf{a}) and (A,a)↛(C,c)(A,\textbf{a})\not\to(C,\textbf{c}), we have that (C,c)(B,b)(C,\textbf{c})\to(B,\textbf{b}) for some (B,b)F(B,\textbf{b})\in F.

See Figure 1 for a graphical depiction of a frontier.

The notion of a frontier is closely related to that of a gap pair. While gap pairs will not play an important role, we will explain the relationship here to provide context. A pair of structures (B,A)(B,A) with BAB\to A is said to be a gap pair if A↛BA\not\to B, and every structure CC satisfying BCB\to C and CAC\to A is homomorphically equivalent to either BB or AA (NesetrilTardif2000, ). The same concept applies to structures with distinguished elements. It is easy to see that any frontier for a structure AA must contain (modulo homomorphic equivalence) all structures BB such that (B,A)(B,A) is a gap pair.

Example 3.2.

Let 𝒮={R,P,Q}\mathcal{S}=\{R,P,Q\}. The structure (A,a1)(A,a_{1}) consisting of facts P(a1)P(a_{1}) and Q(a1)Q(a_{1}) (with distinguished element a1a_{1}) has a frontier of size 2 (w.r.t. the class of all finite structures), namely F={(B,a1),(C,a1)}F=\{(B,a_{1}),(C,a_{1})\} where BB consists of the facts P(a1),P(b),Q(b)P(a_{1}),P(b),Q(b) and CC consists of the facts Q(a1),P(b),Q(b)Q(a_{1}),P(b),Q(b), respectively. Note that ((B,a1),(A,a1))((B,a_{1}),(A,a_{1})) and ((C,a1),(A,a1))((C,a_{1}),(A,a_{1})) are gap pairs. It can be shown that the structure (A,a1)(A,a_{1}) has no frontier of size 1 (as such a frontier would have to consist of a structure that contains both facts P(a1)P(a_{1}) and Q(a1)Q(a_{1})).

For another example, consider the structure (A,a1)(A^{\prime},a_{1}) consisting of facts P(a1)P(a_{1}) and R(b,b)R(b,b). It is the right hand side of a gap pair (the left hand side being the structure (B,a1)(B^{\prime},a_{1}) consisting of the facts R(b,b)R(b,b) and P(b)P(b^{\prime})), but (A,a1)(A^{\prime},a_{1}) has no frontier as follows from Theorem 3.7 below.

Frontiers are also closely related to (generalized) homomorphism dualities (FoniokNT08, ), and we will be making use of results about homomorphism dualities. We say that a structure (A,a)(A,\textbf{a}) has a finite duality w.r.t.  a class 𝒞\mathcal{C} if there is a finite set of structures DD such that for all (C,c)𝒞(C,\textbf{c})\in\mathcal{C}, (A,a)(C,c)(A,\textbf{a})\to(C,\textbf{c}) iff for all (B,b)D(B,\textbf{b})\in D, (C,c)↛(B,b)(C,\textbf{c})\not\to(B,\textbf{b}). The set DD may contain structures that are not in 𝒞\mathcal{C}. If 𝒞{\mathcal{C}} is the set of all structures (over the same schema as (A,a)(A,\textbf{a})), we simply say that (A,a)(A,\textbf{a}) has a finite duality.222We note here that in the literature on Constraint Satisfaction, it is usual to consider the ’other side’ of the duality, i.e, a structure AA is said to have finite duality if there exists a finite set of structures FF such that for every structure CC, CAC\to A iff for all BFB\in F, B↛CB\not\to C.

Example 3.3.

In the realm of digraphs, viewed as relational structures without distinguished elements with a single binary relation, every directed path AA of, say, k>1k>1 nodes has finite duality (w.r.t. the class of digraphs). Indeed, it is not difficult to verify that for every digraph CC, ACA\to C iff C↛DC\not\to D where DD is the digraph with nodes {1,k1}\{1,\dots k-1\} and edges {(i,j)i<j}\{(i,j)\mid i<j\}. (This example is known as the Gallai-Hasse-Roy-Vitaver Theorem. Amusingly, this result was obtained and published independently by all these four researchers, each in a different language, in the 1960s).

The next lemma is a minor variation of a result from (NesetrilTardif2000, ).

Lemma 3.4.

Let 𝒞\mathcal{C} be any class of structures.

  1. (1)

    If a structure (A,a)(A,\textbf{a}) has a finite duality w.r.t. 𝒞\mathcal{C} then (A,a)(A,\textbf{a}) has a frontier w.r.t. 𝒞\mathcal{C}.

  2. (2)

    If a structure (A,a)𝒞(A,\textbf{a})\in\mathcal{C} has a frontier w.r.t. 𝒞\mathcal{C} and 𝒞\mathcal{C} is closed under direct products, then (A,a)(A,\textbf{a}) has a finite duality w.r.t. 𝒞\mathcal{C}.

Proof.

1. Let DD be a finite set of structures that forms a duality for (A,a)(A,\textbf{a}). Then {(A,a)×(B,b)(B,b)D}\{(A,\textbf{a})\times(B,\textbf{b})\mid(B,\textbf{b})\in D\} is a frontier for (A,a)(A,\textbf{a}). This follows immediately from the fact that, for all (C,c)(C,\textbf{c}) with (C,c)(A,a)(C,\textbf{c})\to(A,\textbf{a}), we have that (C,c)(A,a)×(B,b)(C,\textbf{c})\to(A,\textbf{a})\times(B,\textbf{b}) if and only if (C,c)(B,b)(C,\textbf{c})\to(B,\textbf{b}).

2. We use a construction from (NesetrilTardif2000, ) involving an exponentiation operation on structures. Let BB and CC be structures (without distinguished elements) over the same schema. Then we denote by BCB^{C} the structure where

  • the domain of BCB^{C} is the set of all functions from the domain of CC to the domain of BB

  • a fact R(f1,,fn)R(f_{1},\ldots,f_{n}) belongs to BCB^{C} if for every fact of the form R(a1,,an)CR(a_{1},\ldots,a_{n})\in C, the fact R(f1(a1),,fn(an))R(f_{1}(a_{1}),\ldots,f_{n}(a_{n})) belongs to BB.

This construction is characterized by the property that, for all structures DD, DBCD\to B^{C} if and only if D×CBD\times C\to B (HellNesetril2004, ).

Let FF be a frontier for (A,a)(A,\textbf{a}). Let

D={(BA,h)(B,b)F and h is a tuple of functions such that hi(ai)=bi}D=\{(B^{A},\textbf{h})\mid\text{$(B,\textbf{b})\in F$ and $\textbf{h}$ is a tuple of functions such that $h_{i}(a_{i})=b_{i}$}\}

We claim that DD forms a finite duality for (A,a)(A,\textbf{a}). Consider any structure (C,c)𝒞(C,\textbf{c})\in\mathcal{C}. We must show that (C,c)(C,\textbf{c}) homomorphically maps to a structure in DD iff (A,a)↛(C,c)(A,\textbf{a})\not\to(C,\textbf{c}).

Suppose that there is a homomorphism h:(C,c)(BA,h)h:(C,\textbf{c})\to(B^{A},\textbf{h}) for some (BA,h)D(B^{A},\textbf{h})\in D. Then h~:(A,a)×(C,c)(B,b)\tilde{h}:(A,\textbf{a})\times(C,\textbf{c})\to(B,\textbf{b}) where h~(a,c)=h(c)(a)\tilde{h}(\langle a,c\rangle)=h(c)(a). Note that, indeed, h~(ai,ci)=bi\tilde{h}(\langle a_{i},c_{i}\rangle)=b_{i}. Since (B,b)F(B,\textbf{b})\in F and FF is a frontier, it follows that (A,a)↛(A,a)×(C,c)(A,\textbf{a})\not\to(A,\textbf{a})\times(C,\textbf{c}) and therefore, by the properties of direct products, (A,a)↛(C,c)(A,\textbf{a})\not\to(C,\textbf{c}).

Conversely, suppose (A,a)↛(C,c)(A,\textbf{a})\not\to(C,\textbf{c}). Then (A,a)↛(A,a)×(C,c)(A,\textbf{a})\not\to(A,\textbf{a})\times(C,\textbf{c}). Hence, there is a homomorphism h:(A,a)×(C,c)(B,b)h:(A,\textbf{a})\times(C,\textbf{c})\to(B,\textbf{b}) for some (B,b)F(B,\textbf{b})\in F. It follows that (C,c)(BA,h)(C,\textbf{c})\to(B^{A},\textbf{h}) where h=h1,,hk\textbf{h}=h_{1},\ldots,h_{k} with hih_{i} the function given by hi(x)=h(x,ai)h_{i}(x)=h(x,a_{i}). Note that hi(ai)=bih_{i}(a_{i})=b_{i} and hence (BA,h)D(B^{A},\textbf{h})\in D. ∎

Note that the construction of the frontier from the duality is polynomial, while the construction of the duality from the frontier involves an exponential blowup. The following example shows that this is unavoidable.

Example 3.5.

The path 𝑅R1𝑅R2Rn𝑅\circ\xrightarrow{R}\circ\xrightarrow{R_{1}}\circ\xrightarrow{R}\circ\xrightarrow{R_{2}}\circ\cdots\circ\xrightarrow{R_{n}}\circ\xrightarrow{R}\circ, viewed as a structure without any distinguished elements, has a frontier (w.r.t. the class of all finite structures) of size polynomial in nn, as will follow from Theorem 3.8 below. It is known, however, that any finite duality for this structure must involve a structure whose size is exponential in nn, and the example can be modified to use a fixed schema (cf. (NesetrilTardif2005, )).

3.1. Frontiers for classes with bounded expansion

The notion of a class of graphs with bounded expansion was introduced in (nesetril2018grad, ). Intuitively, a class of graphs has bounded expansion if all of its shallow minors are sparse. We will not give a precise definition here, but important examples include graphs of bounded degree, graphs of bounded treewidth, planar graphs, and any class of graphs excluding a minor. The same concept of bounded expansion can be applied also to arbitrary structures: a class of structures 𝒞\mathcal{C} is said to have bounded expansion if the class of Gaifman graphs of structures in 𝒞\mathcal{C} has bounded expansion. We refer to (nesetril2012sparsity, ) for more details. Classes of structures of bounded expansion are in many ways computationally well-behaved (cf. for example (kazana2020firsts, )).

Nešetřil and Ossona de Mendez (NesetrilOssona2008, ; nesetril2012sparsity, ) show that if 𝒞\mathcal{C} is any class of structures with bounded expansion, then every structure has a finite duality w.r.t. 𝒞\mathcal{C}. It follows by Lemma 3.4 that also every structure has a fontier w.r.t. 𝒞\mathcal{C}. Nešetřil and Ossona de Mendez (NesetrilOssona2008, ; nesetril2012sparsity, ) only consider connected structures without distinguished elements,333Note that the various notions of connectedness, such as based on the incidence graph, fact graph, or Gaifman graph, all coincide for relational structures without distinguished elements. but their result extends in a straightforward way to the general case of structures with distinguished elements. Furthermore, it yields an effective procedure for constructing frontiers, although non-elementary (i.e., not bounded by a fixed tower of exponentials).

Theorem 3.6 (from (NesetrilOssona2008, ; nesetril2012sparsity, )).

Let 𝒞\mathcal{C} be any class of structures that has bounded expansion. Then every structure (A,a)(A,\textbf{a}) has a frontier w.r.t. 𝒞\mathcal{C}, which can be effectively constructed.

Proof.

Nešetřil and Ossona de Mendez (NesetrilOssona2008, ) stated their result for the case where AA is a connected structure without distinguished elements. They show that, every such structure AA has a finite duality w.r.t. 𝒞\mathcal{C}, and hence, by Lemma 3.4, also a fontier w.r.t. 𝒞\mathcal{C}. The result extends to disconnected structures through standard arguments: let AA be any structure (without distinguished elements). We may assume without loss of generality that AA is a core (because every structure is homomorphically equivalent to its core, and hence every frontier for the latter is a frontier for the former). Let A1,,AnA_{1},\ldots,A_{n} be the connected components of AA. Since AA is a core, A1,,AnA_{1},\ldots,A_{n} are pairwise homomorphically incomparable. Take all structures of the form A1Ai1BAi+1AnA_{1}\uplus\cdots\uplus A_{i-1}\uplus B\uplus A_{i+1}\uplus\cdots\uplus A_{n}, for BB a structure belonging to the frontier of AiA_{i} (for i=1ni=1\ldots n). It is straightforward to show that this yields a frontier for AA.

Next, we show how to extend this to structures with distinguished elements. For any structure (A,a)(A,\textbf{a}) with a=a1,,an\textbf{a}=a_{1},\ldots,a_{n}, let AaA^{\textbf{a}} be the structure (without distinguished elements) over expanded schema with additional unary predicates P1,,PnP_{1},\ldots,P_{n}, where each PiP_{i} denotes {ai}\{a_{i}\}. Since AA and AaA^{\textbf{a}} have the same Gaifman graph, and since 𝒞\mathcal{C} has bounded expansion, the same holds for {Aa(A,a)}\{A^{\textbf{a}}\mid(A,\textbf{a})\}. Therefore, it suffices to show that whenever AaA^{\textbf{a}} has a frontier w.r.t. {Cc(C,c)𝒞}\{C^{\textbf{c}}\mid(C,\textbf{c})\in\mathcal{C}\}, then (A,a)(A,\textbf{a}) has a frontier w.r.t. 𝒞\mathcal{C}.

Let F={B1,,Bm}F=\{B_{1},\ldots,B_{m}\} be a frontier for AaA^{\textbf{a}} w.r.t. {Cc(C,c)𝒞}\{C^{\textbf{c}}\mid(C,\textbf{c})\in\mathcal{C}\}. Now consider all ways of taking a structure BFB\in F and choosing one element per unary predicate PiP_{i}. In this way we obtain a set of structures FF^{\prime} with distinguished elements, that we claim is a frontier for (A,a)(A,\textbf{a}) w.r.t. 𝒞\mathcal{C}. Note that any homomorphism h:(A,a)(B,b)h:(A,\textbf{a})\to(B,\textbf{b}) for (B,b)F(B,\textbf{b})\in F^{\prime} is a homomorphism from AaA^{\textbf{a}} to BbB^{\textbf{b}}, therefore since FF is a frontier for AaA^{\textbf{a}}, there is no such homomorphism hh. It is also clear that each (B,b)F(B,\textbf{b})\in F^{\prime} maps homomorphically to (A,a)(A,\textbf{a}). Finally, consider any (C,c)𝒞(C,\textbf{c})\in\mathcal{C} that maps to (A,a)(A,\textbf{a}) but not vice versa. Then CcAaC^{\textbf{c}}\to A^{\textbf{a}} and Aa↛CcA^{\textbf{a}}\not\to C^{\textbf{c}}. Hence, there is a homomorphism h:CcBh:C^{\textbf{c}}\to B for some BFB\in F. It follows that h:(C,c)(B,b)Fh:(C,\textbf{c})\to(B,\textbf{b})\in F^{\prime} where each bi=h(ci)b_{i}=h(c_{i}). ∎

3.2. Polynomial frontiers for c-acyclic structures

Alexe et al. (AlexeCKT2011, ), building on Foniok et al. (FoniokNT08, ), show that a structure has a finite duality if and only if its core is c-acyclic. By Lemma 3.4, this implies that a structure has a frontier if and only if its core is c-acyclic.

Theorem 3.7 (from (FoniokNT08, ; AlexeCKT2011, )).

For all structures (A,a)(A,\textbf{a}), the following are equivalent:

  1. (1)

    (A,a)(A,\textbf{a}) has a frontier w.r.t. the class of all structures,

  2. (2)

    (A,a)(A,\textbf{a}) is homomorphically equivalent to a c-acyclic structure,

  3. (3)

    The core of (A,a)(A,\textbf{a}) is c-acyclic

One of our main results is a new proof of the right-to-left direction, which, unlike the original, provides a polynomial-time construction of a frontier from a c-acyclic structure:

Theorem 3.8.

Fix a schema 𝒮\mathcal{S} and k0k\geq 0. Given a c-acyclic structure over 𝒮\mathcal{S} with kk distinguished elements, we can construct in polynomial time a frontier w.r.t. the class of all structures over 𝒮\mathcal{S} that have kk distinguished elements.

Note that the size of the smallest frontier is in general exponential in kk. Indeed, consider the single-element structure (A,a)(A,\textbf{a}) where AA consists of the single fact P(a)P(a) and a=a,,a\textbf{a}=a,\ldots,a has length kk. It is not hard to show that every frontier of this (c-acyclic) structure must contain, up to homomorphic equivalence, all structures of the form (B,b)(B,\textbf{b}) where BB consists of two facts, P(a1)P(a_{1}) and P(a2)P(a_{2}), and b{a1,a2}k\textbf{b}\in\{a_{1},a_{2}\}^{k} is a sequence in which both a1a_{1} and a2a_{2} occur. There are exponentially many pairwise homomorphically incomparable such structures.

The proof of Theorem 3.8 is based on a construction that improves over a similar but exponential construction of gap pairs for acyclic structures given in (NesetrilTardif2000, , Def. 3.9). Our results also shed new light on a question posed in the same paper: after presenting a double-exponential construction of duals (for connected structures without distinguished elements), involving first constructing an exponential-sized gap pair, the authors ask: “It would be interesting to know to what extent the characterisation of duals can be simplified, and whether the indirect approach via density is optimal.” This question appeared to have been answered in (NesetrilTardif2005, ), where a direct method was established for constructing single-exponential size duals. Theorem 3.8 together with Lemma 3.4, however, gives another answer: single-exponential duals can be constructed by going through frontiers (i.e., “via density”) as well.

Recall the definition of fg-connectedness from the preliminaries. We first prove a restricted version of Theorem 3.8 for the special case of core, fg-connected, c-acyclic structures with the Unique Names Property. We subsequently lift these extra assumptions. A structure (A,a)(A,\textbf{a}) with a=a1,ak\textbf{a}=a_{1},\ldots a_{k} has the Unique Names Property (UNP) if aiaja_{i}\neq a_{j} for all i<ji<j (cf. (baader2003basic, )).

Proposition 3.9.

Given a core, fg-connected, c-acyclic structure with UNP, we can construct in polynomial time (for fixed schema 𝒮\mathcal{S} and number of distinguished elements kk) a frontier w.r.t. the class of all finite structures. Furthermore, the frontier consists of a single structure, which has the UNP. 444 An earlier conference version of this paper had a bug in the proof of this proposition, as was pointed out to us by Raoul Koudijs (p.c.).

Proof.

Let a core fg-connected c-acyclic structure (A,a)(A,\textbf{a}) with UNP be given. To reduce notational complexity in the remainder of this proof, we will simply write AA instead of (A,a)(A,\textbf{a}), even when referring to the structure including the distinguished elements.

Note that each fg-connected structure either (i) consists of a single fact containing only distinguished elements, or (ii) consists of a number of facts that all contain at least one non-distinguished element. Therefore, we can distinguish two cases:

Case 1. AA consists of a single fact ff without non-distinguished elements. Let (B,a)(B,\textbf{a}) be the structure whose domain is {a1,,ak,b}\{a_{1},\ldots,a_{k},b\}, where a=a1,,ak\textbf{a}=a_{1},\ldots,a_{k} and bb is a fresh value distinct from a1,,aka_{1},\ldots,a_{k}; and which contains all facts over this domain except ff. It is easy to see that (B,a)(B,\textbf{a}) is a homomorphism dual for (A,a)(A,\textbf{a}). Indeed, consider any structure (C,c)(C,\textbf{c}), and let ff^{\prime} be a copy of the fact ff in which each element aia_{i} is replaced by the corresponding element cic_{i}. If h:(A,a)(C,c)h:(A,\textbf{a})\to(C,\textbf{c}) then CC contains ff^{\prime}, therefore, (C,c)↛(B,a)(C,\textbf{c})\not\to(B,\textbf{a}); if, on the other hand, (A,a)↛(C,c)(A,\textbf{a})\not\to(C,\textbf{c}), then CC omits ff^{\prime}, and hence, (C,c)(B,a)(C,\textbf{c})\to(B,\textbf{a}). It follows that, the direct product (A,a)×(B,a)(A,\textbf{a})\times(B,\textbf{a}) constitutes a singleton frontier for (A,a)(A,\textbf{a}). Note that this construction is polynomial because we assume that the schema 𝒮\mathcal{S} and kk are both fixed.

Case 2. AA consists of one or more facts that each contain a non-distinguished element. In this case, we construct a singleton frontier F={(B,b)}F=\{(B,\textbf{b})\} where

  • the domain of BB consists of

    1. (1)

      all pairs (a,f)(a,f) where aa is a non-distinguished element of AA and ff is a fact of AA in which aa occurs, and

    2. (2)

      All pairs (a,id)(a,\textsf{id}) and (a,nd)(a,\textsf{nd}), where aa is a distinguished element of AA

  • a fact R((a1,f1),,(an,fn))R((a_{1},f_{1}),\ldots,(a_{n},f_{n})) holds in BB if and only if R(a1,,an)R(a_{1},\ldots,a_{n}) holds in AA and at least one fif_{i} is either a fact that is different from the fact R(a1,,an)R(a_{1},\ldots,a_{n}) itself, or is nd

  • The distinguished elements b are b1=(a1,id),,bn=(an,id)b_{1}=(a_{1},\textsf{id}),\ldots,b_{n}=(a_{n},\textsf{id}), for a=a1,,an\textbf{a}=a_{1},\ldots,a_{n}.

Note that, in the above construction, id and nd are symbols (not functions), used to simplify notation by ensuring that every element of BB can be written as a pair. The symbols id and nd, intuitively, stand for “identity” and ”non-distinguished copy”,

We claim that F={(B,b)}F=\{(B,\textbf{b})\} is a frontier for AA.

It is clear that the natural projection h:(B,b)(A,a)h:(B,\textbf{b})\to(A,\textbf{a}) is a homomorphism.

We claim that there is no homomorphism h:(A,a)(B,b)h^{\prime}:(A,\textbf{a})\to(B,\textbf{b}). Assume, for the sake of a contradiction, that there was such a homomorphism. By Lemma 2.1, we may assume that the composition of hh and hh^{\prime} is the identity function on AA. In particular, this means that hh^{\prime} maps each distinguished element aa to (a,id)(a,\textsf{id}) and for each non-distinguished element aa of AA, h(a)=(a,f)h^{\prime}(a)=(a,f) for some fact ff. For a non-distinguished element aa, let us denote by faf_{a} the unique fact ff for which h(a)=(a,fa)h^{\prime}(a)=(a,f_{a}).

We will consider “walks” in AA of the form

a1fa1a2fa2ana_{1}\xrightarrow{f_{a_{1}}}a_{2}\xrightarrow{f_{a_{2}}}\ldots a_{n}

with n1n\geq 1, where

  1. (1)

    a1,,ana_{1},\ldots,a_{n} are non-distinguished elements,

  2. (2)

    faifai+1f_{a_{i}}\neq f_{a_{i+1}}, and

  3. (3)

    aia_{i} and ai+1a_{i+1} co-occur in fact faif_{a_{i}},

Since AA is c-acyclic, the length of any such sequence is bounded by the diameter of AA (otherwise some fact would have to be traversed twice in succession, which would violate condition 2). Furthermore, trivially, such a walk of length n=1n=1 exists: just choose as a1a_{1} an arbitrary non-distinguished element of AA. Furthermore, we claim that any such finite sequence can be extended to a longer one: let the fact fanf_{a_{n}} be of the form R(b1,,bm)R(b_{1},\ldots,b_{m}) (where an=bia_{n}=b_{i} for some imi\leq m). Since hh is a homomorphism, it must map fanf_{a_{n}} to some fact R((b1,fb1),,(bm,fbm))R((b_{1},f_{b_{1}}),\ldots,(b_{m},f_{b_{m}})) of BB, where some fbjf_{b_{j}} is a fact that is different from fanf_{a_{n}}. We can choose bjb_{j} as our element an+1a_{n+1}. Thus, we reach our desired contradiction.

Finally, consider any CC with h:CAh:C\to A and A↛CA\not\to C. We construct a function h:CBh^{\prime}:C\to B as follows: consider any element cc of CC, and let h(c)=ah(c)=a. If cc is a distinguished element (in which case aa is, too), we set h(c)=(a,id)h^{\prime}(c)=(a,\textsf{id}). If cc is not a distinguished element but aa is, we set h(c)=(a,nd)h^{\prime}(c)=(a,\textsf{nd}). Otherwise, we proceed as follows: since AA is c-acyclic and fg-connected, for each non-distinguished element aa^{\prime} of AA (other than aa itself) there is a unique minimal path in the incidence graph, containing only non-distinguished elements, from aa^{\prime} to aa. We can represent this path by a sequence of the form

a=a0(f0,i0,j0)a1(f1,i1,j1)a2(fn1,in1,jn1)an=aa^{\prime}=a_{0}\xrightarrow{(f_{0},i_{0},j_{0})}a_{1}\xrightarrow{(f_{1},i_{1},j_{1})}a_{2}\cdots\xrightarrow{(f_{n-1},i_{n-1},j_{n-1})}a_{n}=a

where each ff_{\ell} is a fact of AA in which aa_{\ell} occurs in the ii_{\ell}-th position and a+1a_{\ell+1} occurs in the jj_{\ell}-th position. We can partition the non-distinguished elements aa^{\prime} of AA (other than aa itself) according to the last fact on this path, that is, fn1f_{n-1}. Furthermore, it follows from fg-connectedness that each fact of AA contains a non-distinguished element. It is easy to see that if a fact contains multiple non-distinguished elements (other than aa) then they must all belong to the same part of the partition as defined above. Therefore, the above partition on non-distinguished elements naturally extends to a partition on the facts of AA. Note that if AA contains any facts in which aa is the only non-distinguished element, we will refer to these facts as “local facts” and they will be handled separately. In this way, we have essentially decomposed AA into a union AlocaliAiA_{\text{local}}\cup\bigcup_{i}A_{i}, where AlocalA_{\text{local}} contains all local facts and each “component” AiA_{i} is a substructure of AA consisting of non-local facts, in such a way that (i) different substructures AiA_{i} do not share any facts with each other, (ii) different substructures do not share any elements with each other, except for aa and distinguished elements (from a), (iii) each AiA_{i} contains precisely one fact involving aa.

Since we know that (A,a)↛(C,c)(A,a)\not\to(C,c), it follows that either some local fact ff of AA does not map to CC (when sending a,a\textbf{a},a to c,c\textbf{c},c), or some “component” AiA_{i} of AA does not map to CC through any homomorphism sending a,a\textbf{a},a to c,c\textbf{c},c. In the first case, we choose such local fact ff and set h(a)=(a,f)h^{\prime}(a)=(a,f). In the second case, we choose such a component (if there are multiple, we choose one of minimal size) and let ff be the unique fact in that component containing aa (that is, ff is the fact fn1f_{n-1} that by construction connected the non-distinguished elements of the component in question to aa). We set h(c)=(a,f)h^{\prime}(c)=(a,f). Intuitively, when h(c)=(h(c),f)h^{\prime}(c)=(h(c),f), then ff is a fact of AA involving h(c)h(c) that “points in a direction where homomorphism from (A,h(c))(A,h(c)) back to (C,c)(C,c) fails”.

We claim that hh^{\prime} is a homomorphism from CC to BB: let R(c1,,cn)R(c_{1},\ldots,c_{n}) be a fact of CC. Then R(h(c1),,h(cn))R(h(c_{1}),\ldots,h(c_{n})) holds in AA. Let h(ci)=(h(ci),fi)h^{\prime}(c_{i})=(h(c_{i}),f_{i}) as constructed above (where, we recall, fi=idf_{i}=\textsf{id} if cic_{i} is a distinguished element, and fi=ndf_{i}=\textsf{nd} if cic_{i} is not a distinguished element but h(ci)h(c_{i}) is). Also recall that at least one cic_{i} is a non-distinguished element. To show that R(h(c1),,h(cn))R(h^{\prime}(c_{1}),\ldots,h^{\prime}(c_{n})) holds in BB, it suffices to show that some fif_{i} is different from the fact R(h(c1),,h(cn))R(h(c_{1}),\ldots,h(c_{n})) itself, or is equal to nd. If some non-distinguished cic_{i} is mapped by hh to a distinguished element, then h(ci)=(ci,nd)h^{\prime}(c_{i})=(c_{i},\textsf{nd}), and we are done. This leaves us with the case where some cic_{i} is a non-distinguished element, and for all non-distinguished cic_{i}, h(ci)h^{\prime}(c_{i}) is of the form (h(ci),fi)(h(c_{i}),f_{i}) for a fact fif_{i}. If one of these fif_{i} is a local fact, then it follows immediately from the construction that fiR(h(c1),,h(cn))f_{i}\neq R(h(c_{1}),\ldots,h(c_{n})). Otherwise, let nin_{i} be the size of the smallest “component” (as defined above) of (A,h(ci))(A,h(c_{i})) that does not homomorphically map to (C,ci)(C,c_{i}), and choose an element cic_{i} with minimal nin_{i}. Then, clearly, fif_{i} must be different from the fact R(h(c1),,h(cn))R(h(c_{1}),\ldots,h(c_{n})) itself. ∎

Next, we remove the assumptions of fg-connectedness and being a core.

Proposition 3.10.

Given a c-acyclic structure with UNP, we can construct in polynomial time a frontier w.r.t. the class of all finite structures. Furthermore, the frontier consists of structures that have the UNP.

Proof.

By Proposition 2.2, we may assume that (A,a)(A,\textbf{a}) is a core. Note that the c-acyclicity and UNP properties are preserved under the passage from a structure to its core.

Let (A,a)(A,\textbf{a}) be a structure with distinguished elements that is UNP and that is a fg-disjoint union of homomorphically incomparable fg-connected structures (A1,a),,(An,a)(A_{1},\textbf{a}),\ldots,(A_{n},\textbf{a}). By Proposition 3.9, (A1,a),,(An,a)(A_{1},\textbf{a}),\ldots,(A_{n},\textbf{a}) have, respectively, frontiers F1,,FnF_{1},\ldots,F_{n}, each consisting of a single structure with the UNP. We may assume without loss of generality that each FiF_{i} consists of a structure that have the same distinguished elements a (we know that the structures in question have the UNP, and therefore, modulo isomorphism, we can assume that the distinguished elements are precisely a). Let Fi={(Bi,a)}F_{i}=\{(B_{i},\textbf{a})\}.

We claim that F={(ji(Aj,a))(Bi,a)1in}F=\{\big{(}\biguplus_{j\neq i}(A_{j},\textbf{a})\big{)}\uplus(B_{i},\textbf{a})\mid 1\leq i\leq n\} is a frontier for (A,a)(A,\textbf{a}) w.r.t. 𝒞\mathcal{C}.

Clearly, each structure in FF maps homomorphically to AA.

Suppose, for the sake of contradiction, that there is a homomorphism h:(A,a)(ji(Aj,a))(Bi,a)h:(A,\textbf{a})\to\big{(}\biguplus_{j\neq i}(A_{j},\textbf{a})\big{)}\uplus(B_{i},\textbf{a}) for some ii. Observe that hh must send each distinguished element to itself, and it must send each non-distinguished element to a non-distinguished element (otherwise, the composition of hh with the backward homomorphism would be a non-injective endomorphism on (A,a)(A,\textbf{a}) which would contradict the fact that (A,a)(A,\textbf{a}) is a core). Since (Ai,a)(A_{i},\textbf{a}) is fg-connected (and because hh cannot send non-distinguished elements to distinguished elements), its hh-image must be contained either in some (Aj,a)(A_{j},\textbf{a}) (jij\neq i) or in BB. The former cannot happen because AiA_{i} and AjA_{j} are homomorphically incomparable. The latter cannot happen either, because BB belongs to a frontier of AiA_{i}.

Finally, let (C,c)𝒞(C,\textbf{c})\in\mathcal{C} be any structure such that there is a homomorphism h:(C,c)(A,a)h:(C,\textbf{c})\to(A,\textbf{a}) but (A,a)↛(C,c)(A,\textbf{a})\not\to(C,\textbf{c}). Let (Ai,a)(A_{i},\textbf{a}) be a fg-connected component of (A,a)(A,\textbf{a}) such that (Ai,a)↛(C,c)(A_{i},\textbf{a})\not\to(C,\textbf{c}). Since (Ai,a)(A_{i},\textbf{a}) is fg-connected, we can partition our structure (C,c)(C,\textbf{c}) as (C1,c)(C2,c)(C_{1},\textbf{c})\uplus(C_{2},\textbf{c}) where the hh-image of C1C_{1} is contained in (Ai,a)(A_{i},\textbf{a}) while the hh-image of C2C_{2} is disjoint from AiA_{i} except possibly for the distinguished elements. We know that (Ai,a)↛(C1,c)(A_{i},\textbf{a})\not\to(C_{1},\textbf{c}) and therefore (C1,c)(Bi,a)(C_{1},\textbf{c})\to(B_{i},\textbf{a}). Furthermore, we have that (C2,c)ji(Aj,a)(C_{2},\textbf{c})\to\biguplus_{j\neq i}(A_{j},\textbf{a}). Therefore, (C,c)(ji(Aj,a))(Bi,a)(C,\textbf{c})\to\big{(}\biguplus_{j\neq i}(A_{j},\textbf{a})\big{)}\uplus(B_{i},\textbf{a}). ∎

Finally, we can prove Theorem 3.8 itself.

Proof of Theorem 3.8.

Let (A,a)(A,\textbf{a}) be c-acyclic. If it has the UNP, we are done. Consider the other case, where the sequence a contains repetitions. Let a=a1,,an\textbf{a}^{\prime}=a^{\prime}_{1},\ldots,a^{\prime}_{n} consists of the same elements without repetition (in some order). We construct a frontier for it as follows:

  1. (1)

    Consider the structure (A,a)(A,\textbf{a}^{\prime}), which, by construction, has the UNP. Let FF be a frontier for (A,a)(A,\textbf{a}^{\prime}) (again consisting of structures with the UNP), using Proposition 3.10. Note that, through isomorphism, we may assume that each structure in FF has the same distinguished elements a\textbf{a}^{\prime}. For each (B,a)F(B,\textbf{a}^{\prime})\in F, we take the structure (B,a)(B,\textbf{a}).

  2. (2)

    Let kk be the length of the tuple a. For each function f:{1,,k}{1,,k}f:\{1,\ldots,k\}\to\{1,\ldots,k\}, whose range has size strictly greater than nn, consider structure (C,cf)(C,\textbf{c}^{f}) where CC contains all facts over the domain {1,,k}\{1,\ldots,k\}, and cif=f(i)c^{f}_{i}=f(i). We take its direct product with (A,a)(A,\textbf{a}).

It is easy to see that the set of all structures constructed above, constitutes a frontier for (A,a)(A,\textbf{a}). Indeed, suppose a structure maps to (A,a)(A,\textbf{a}) but not vice versa. If the tuple of distinguished elements of the structure in question has the same identity type as the tuple a (i.e., the same equalities hold between values at different indices in the tuple) then it is easy to see that the structure in question must map to some structure (B,a)(B,\textbf{a}) as constructed under item 1 above. Otherwise, if the tuple of distinguished elements of the structure in question does not have the same identity type, then it is easy to see that the structure in question must map to (Cf,cf)×(A,a)(C^{f},\textbf{c}^{f})\times(A,\textbf{a}), as constructed under item 2 above, where ff reflects the identity type of the distinguished elements of the structure in question. ∎

As a corollary of Theorem 3.8, we obtain the following interesting by-product:

Theorem 3.11.

Fix a schema 𝒮\mathcal{S} and k0k\geq 0. The following problem is solvable in NP: given a finite set of structures FF and a structure AA (all with kk distinguished elements), is FF a frontier for AA w.r.t. the class of all structures? If 𝒮\mathcal{S} contains a binary relation and k=3k=3, then it is NP-complete.

Proof.

For the upper bound, we use the fact that, if AA is homomorphically equivalent to a c-acyclic structure AA^{\prime}, then the core of AA is c-acyclic (cf. Theorem 3.7). The problem can therefore be solved in non-deterministic polynomial time as follows:

First we guess a substructure AA^{\prime} and we verify that AA^{\prime} is c-acyclic and homomorphically equivalent to AA. Note that the existence of such AA^{\prime} is a necessary precondition for FF to be a frontier of AA. Furthermore, c-acyclicity can be checked in polynomial time using any PTIME algorithm for graph acyclicity (recall that a structure is c-acyclic if and only if its incidence graph is acyclic after removing all nodes corresponding to distinguished elements).

Next, we apply Theorem 3.8 to construct a frontier FF^{\prime} for AA^{\prime} (and hence for AA). Finally, we verify that each BFB\in F homomorphically maps to some BFB^{\prime}\in F^{\prime} and, vice versa, every BFB^{\prime}\in F^{\prime} homomorphically maps to some BFB\in F. It is not hard to see that this non-deterministic algorithm has an accepting run if and only if FF is a frontier for AA.

For the lower bound, we reduce from graph 3-colorability. Let AA be the structure, over a 3-element domain, that consists of the facts R(a,b)R(a,b) for all pairs a,ba,b with aba\neq b. In addition, each of the three elements is named by a constant symbol. Since AA is c-acyclic, by Theorem 3.7, it has a frontier FF. Now, given any graph GG (viewed as a relational structure with binary relation RR and without constant symbols), we have that GG is 3-colorable if and only if FF is a frontier for the disjoint union of AA with GG. To see that this is the case, note that if GG is 3-colorable, then the disjoint union of AA with GG is homomorphically equivalent to AA itself, whereas if GG is not 3-colorable, then the disjoint union of AA with GG is strictly greater than AA in the homomorphism order. ∎

3.3. A polynomially frontier-closed class of structures

We call a class 𝒞\mathcal{C} of structures frontier-closed if every structure (A,a)𝒞(A,\textbf{a})\in\mathcal{C} has a frontier w.r.t. 𝒞\mathcal{C}, consisting of structures belonging to 𝒞\mathcal{C}. If, moreover, the frontier in question can be constructed from (A,a)(A,\textbf{a}) in polynomial time, then we say that 𝒞\mathcal{C} is polynomially frontier-closed.

Theorem 3.12.

Fix a schema 𝒮\mathcal{S} and k1k\geq 1. The class of c-connected acyclic structures with kk distinguished elements is polynomially frontier-closed. 555Note that for structures with one distinguished element, c-connectedness is the same as connectedness.

In fact, the construction presented below shows that the polynomial bound holds even when the schema is treated as part of the input of the problem (although kk does need to be fixed, as the size of the constructed frontier depends exponentially on kk).

As will follow from results in Section 4 (cf. Theorems 4.6-4.8 below) the theorem fails if we drop any of the three restrictions in the statement (i.e., c-connectedness, acyclicity, and k1k\geq 1).

The special case with binary relations only and k=1k=1

The remainder of this section is dedicated to the proof of Theorem 3.12. To simplify the presentation of the proof, we will first assume that the schema consists of binary relations only, and that k=1k=1. Afterwards, we will show how to lift these restrictions.

Let 𝒮\mathcal{S} be a schema consisting of binary relation symbols, and fix a finite structure (A,a0)(A,a_{0}) that is c-connected and acyclic. We will assume, in addition, that (A,a0)(A,a_{0}) is a core, which we may do without loss of generality, because the core of an acyclic structure can be computed in polynomial time (Proposition 2.2) and the properties of c-connectedness and acyclicity are preserved under passage from a structure to its core:

Proposition 3.13.

The properties of c-connectedness and acyclicity are preserved when passing from a structure to its core.

Proof.

That acyclicity is preserved follows immediately from the fact that the core is a substructure of the original structure. For c-connectedness, the argument is as follows: let (B,b)(B,\textbf{b}) be a c-connected structure, and let (B,b)(B^{\prime},\textbf{b}) be its core. It follows from the definition of a core that (B,b)(B,b)(B,\textbf{b})\leftrightarrow(B^{\prime},\textbf{b}). By Proposition 2.3, this implies that (B,b)(B,b)reach(B,\textbf{b})\leftrightarrow(B^{\prime},\textbf{b})^{\text{reach}}, and hence, (B,b)(B,b)reach(B^{\prime},\textbf{b})\leftrightarrow(B^{\prime},\textbf{b})^{\text{reach}}. It follows by the minimality property of cores that (B,b)=(B,b)reach(B,\textbf{b})=(B^{\prime},\textbf{b})^{\text{reach}}, i.e., (B,b)(B^{\prime},\textbf{b}) is c-connected. ∎

We shall slightly abuse notation and, for every relation symbol RR and every a,bAa,b\in A we shall say that R(a,b)R^{-}(a,b) holds in (or is a fact of) AA if R(b,a)R(b,a) is a fact of AA. We can think of AA as an (oriented) tree rooted at a0a_{0}, where every edge bcb\to c has been oriented away from a0a_{0} and is labelled RR or RR^{-} depending on whether R(b,c)R(b,c) or R(c,b)R(c,b) is a fact of AA.

Definition 3.14 (A|aA|a).

For any element aa of AA, we denote by A|aA|a the substructure of AA consisting of the oriented subtree rooted at aa.

Definition 3.15 (rank).

For any element aa of AA, rank(a)rank(a) is the depth of the oriented tree A|aA|a.

The construction of frontiers that we will describe below, involves a recursion on rank. Note that rank(a)=0rank(a)=0 when aa is maximally far from a0a_{0} (in other words, the leafs of the oriented tree have rank 0). Furthermore, if there is an edge aba\to b (in the oriented tree) then rank(b)<rank(a)rank(b)<rank(a).

The frontier construction below is split into two steps: for each element aa, we first construct a set of structures a\mathcal{F}_{a}, and, subsequently, we construct a modified set of structures a\mathcal{F}^{*}_{a}.

Definition 3.16 (a\mathcal{F}_{a}).

Fix any node aa of AA. Let a1,,ana^{\prime}_{1},\ldots,a^{\prime}_{n} be the children of aa (in the oriented tree), and let S1,,SnS_{1},\ldots,S_{n} be the corresponding edge labels. We can depict A|aA|a as follows:

\Treeaa\edgeS1S_{1}a1a^{\prime}_{1}\edge\edgeS2S_{2}a2a^{\prime}_{2}\edge\edge\edgeSnS_{n}ana^{\prime}_{n}\edge

If rank(a)=0rank(a)=0 (that is, if n=0n=0) we define a=\mathcal{F}_{a}=\emptyset. Otherwise, we define a={H1,,Hn}\mathcal{F}_{a}=\{H^{1},\ldots,H^{n}\} where HiH^{i} is obtained from A|aA|a in the following way. Set HiH^{i} to be the result of removing A|aiA|a^{\prime}_{i} from A|aA|a. Then, we add to HiH^{i} a fresh isomorphic copy FF of each structure in ai\mathcal{F}_{a^{\prime}_{i}}, joining aa with the newly created copies of aia^{\prime}_{i} with a SiS_{i}-edge. See Figure 2 for an illustration.

\Treea\edgeS1S_{1}a1a^{\prime}_{1}\edge\edge\edgeSi1S_{i-1}ai1a^{\prime}_{i-1}\edge\edge\edge\edge\edge\edge\edge
FaiF\in\mathcal{F}_{a^{\prime}_{i}}
\edge\edge
\edge\edgeSi+1S_{i+1}ai+1a^{\prime}_{i+1}\edge\edge\edgeSnS_{n}ana^{\prime}_{n}\edgeSiS_{i}SiS_{i}
Figure 2. A depiction of the construction of a\mathcal{F}_{a} in Definition 3.16.

Observe that, for FaF\in\mathcal{F}_{a}, there is a natural mapping h:FAh:F\to A (indeed, following the induction of the construction, we can see that each element of FF is either an element of AA or else was introduced as an isomorphic copy of an element of AA).

Definition 3.17 (a\mathcal{F}^{*}_{a}).

We define a={FFa}\mathcal{F}^{*}_{a}=\{F^{*}\mid F\in\mathcal{F}_{a}\}, where FF^{*} is defined as follows. Let h:FAh:F\to A be the natural mapping. Then, FF^{*} is obtained from FF by adding, for each element bb of FF with h(b)ah(b)\neq a, a fresh isomorphic copy of AA together with a connecting SS-edge from cc to bb, where cc is (the newly created copy of) the parent of h(b)h(b) and SS is the edge label of the edge from cc to h(b)h(b) in AA.

\Treea0a_{0}\edgea1a_{1}\edgea2a_{2}\edgea3a_{3}\edgea4a_{4}        \Treea0a_{0}\edgea1a_{1}\edgea2a_{2}\edgea3a_{3}        \Treea0a_{0}\edgea1a_{1}\edgea2a_{2}\edgea3a_{3}\Treea0a^{\prime}_{0}\edgea1a^{\prime}_{1}\edgea2a^{\prime}_{2}\edgea3a^{\prime}_{3}\edgea4a^{\prime}_{4}\Treea0′′a^{\prime\prime}_{0}\edgea1′′a^{\prime\prime}_{1}\edgea2′′a^{\prime\prime}_{2}\edgea3′′a^{\prime\prime}_{3}\edgea4′′a^{\prime\prime}_{4}\Treea0′′′a^{\prime\prime\prime}_{0}\edgea1′′′a^{\prime\prime\prime}_{1}\edgea2′′′a^{\prime\prime\prime}_{2}\edgea3′′′a^{\prime\prime\prime}_{3}\edgea4′′′a^{\prime\prime\prime}_{4}
Input (uni-relational) a0\mathcal{F}_{a_{0}} single structure a0\mathcal{F}^{*}_{a_{0}} single structure
structure (A,a0)(A,a_{0})
Figure 3. First example illustrating Definitions 3.16 and 3.17.
\Treeaa\edgeRRbb\edgeRRcc\edgeSSdd        \Treeaa\edgeRRbb\edgeRRcc\edgeRRbb^{\prime}\edgeSSdd        \Treeaa\edgeRRbb\edgeRRcc\edgeRRbb^{\prime}\edgeSSdd\Tree\edgeRR\edgeRR\edgeSS\Tree\edgeRR\edgeRR\edgeSS\Tree\edgeRR\edgeRR\edgeSS\Tree\edgeRR\edgeRR\edgeSSRRRRRRSS
Input structure a\mathcal{F}_{a} single structure a\mathcal{F}^{*}_{a} single structure
(A,a)(A,a)
Figure 4. Second example illustrating Definitions 3.16 and 3.17.

The examples in Figures 3 and 4 illustrate the construction of a\mathcal{F}_{a} and a\mathcal{F}^{*}_{a}. In these figures, for clarity, the distinguished element is marked by a circle. In Figure 3, all edges represent the same binary relation RR, and edge labels are omitted for the sake of readability. In both examples, it happens (coincidentally) that the frontier consists of a single structure.

Definition 3.23 and 3.24 further down may provide additional intuition on the ()(\cdot)^{*} operation used in Definition 3.17.

Theorem 3.18.

Let (A,a0)(A,a_{0}) be a finite structure that is core, c-connected and acyclic, let aa be any node of AA, and let a\mathcal{F}^{*}_{a} be as defined above.

  1. (1)

    For each FaF^{*}\in\mathcal{F}^{*}_{a}, (F,a)(A,a)(F^{*},a)\to(A,a).

  2. (2)

    For each FaF^{*}\in\mathcal{F}^{*}_{a}, (A|a,a)↛(F,a)(A|a,a)\not\to(F^{*},a).

  3. (3)

    Let (B,b)(B,b) be acyclic and c-connected. If (B,b)(A,a)(B,b)\to(A,a) and (A|a,a)↛(B,b)(A|a,a)\not\to(B,b), then (B,b)(B,b) maps homomorphically to (F,a)(F,a) for some structure FaF\in\mathcal{F}^{*}_{a}.

In particular (since A|a0=AA|a_{0}=A), a0\mathcal{F}^{*}_{a_{0}} is a frontier for (A,a0)(A,a_{0}) w.r.t. the class of c-connected, acyclic structures with one distinguished element.

Proof.

Item 1 follows immediately from the construction: the natural projection from (F,a)(F^{*},a) to (A,a)(A,a) is a homomorphism.

For item 2, we proceed by induction on rank(a)rank(a). Item 2 holds true, trivially, when rank(a)=0rank(a)=0, as a\mathcal{F}^{*}_{a} is empty in this case. For the inductive case, now, assume that rank(a)>0rank(a)>0. In this case, by definition, we know that FaF\in\mathcal{F}_{a} is of the form HiH_{i} for some child aia^{\prime}_{i} of aa. In other words, FF was obtained from A|aA|a by removing the subtree A|aiA|a^{\prime}_{i}, and (provided rank(ai)>0rank(a^{\prime}_{i})>0), adding a fresh isomorphic copy of each structure in ai\mathcal{F}_{a^{\prime}_{i}} joining aa with the newly created copy of aia^{\prime}_{i} with a SiS_{i}-edge, where SiS_{i} is the label of the edge aaia\to a^{\prime}_{i} in the original structure.

Consider the homomorphism h:(F,a)(A,a)h:(F^{*},a)\to(A,a) given by item 1. Suppose for the sake of a contradiction that there were also a homomorphism h:(A|a,a)(F,a)h^{\prime}:(A|a,a)\to(F^{*},a). Let h(ai)=bh^{\prime}(a^{\prime}_{i})=b. Towards our contradiction, we perform a case distinction on bb. Clearly, bb must be one of the neighbours of aa in FF^{*}. By construction, these neighbours of aa in FF^{*} are: (i) the children a1,,ai1,ai+1,,ana^{\prime}_{1},\ldots,a^{\prime}_{i-1},a^{\prime}_{i+1},\ldots,a^{\prime}_{n} in FF; (ii) the copies of aia^{\prime}_{i} belonging to isomorphic copies of structures in ai\mathcal{F}_{a^{\prime}_{i}}; and, provided aa0a\neq a_{0}, (iii) the parent, pp, of aa in AA, as well as the copy of pp introduced in Definition 3.17.

Let us first consider case (i) and (iii). In both cases, let h′′h^{\prime\prime} be the composition of hh^{\prime} and hh. Then h′′h^{\prime\prime} is a homomorphism from (A|a,a)(A|a,a) to (A,a)(A,a) whose range omits aia^{\prime}_{i}. Furthermore, h′′h^{\prime\prime} extends straightforwardly to a non-injective endomorphism on (A,a0)(A,a_{0}) by mapping all elements outside A|aA|a to themselves. This is a contradiction with the fact that (A,a0)(A,a_{0}) is a core.

Finally consider case (ii). By a similar argument as the above, no element cc from A|aiA|a^{\prime}_{i} can be mapped by hh^{\prime} to aa. For, in this case, the composition h′′h^{\prime\prime} of hh^{\prime} and hh would be a homomorphism from (A|a,a)(A|a,a) to (A,a)(A,a) whose range excludes cc, and h′′h^{\prime\prime} could then be extended to a non-injective endomorphism on (A,a0)(A,a_{0}), contradicting the fact that (A,a0)(A,a_{0}) is a core. It follows that the restriction of hh^{\prime} to A|aiA|a^{\prime}_{i} must be such that its range is entirely contained in (F,ai)(F^{\prime*},a^{\prime}_{i}) for some FF^{\prime} in ai\mathcal{F}^{*}_{a^{\prime}_{i}}, i.e., it defines a homomorphism from A|aA|a^{\prime} to (F,ai)(F^{\prime*},a^{\prime}_{i}), which contradicts the inductive hypothesis.

Item 3 is proved by induction on rank(a)rank(a). Again, item 3 holds true, trivially, when rank(a)=0rank(a)=0. Note that when rank(a)=0rank(a)=0, A|aA|a is a single-node structure without any relations. Therefore, it is impossible that (A|a,a)↛(B,b)(A|a,a)\not\to(B,b).

Now, consider the inductive case with rank(a)>0rank(a)>0. Since (A|a,a)↛(B,b)(A|a,a)\not\to(B,b), it must be the case that, for some child aia^{\prime}_{i} of aa there is no element bb^{\prime} in BB such that (A|ai,ai)(B,b)(A|a^{\prime}_{i},a^{\prime}_{i})\to(B,b^{\prime}) and Si(b,b)S_{i}(b,b^{\prime}) holds in BB where SiS_{i} is the label of the edge joining aa and aia^{\prime}_{i}. Let F=HiaF=H_{i}\in\mathcal{F}_{a} be the corresponding structure as constructed in Definition 3.16.

Let hh be the homomorphism from (B,b)(B,b) to (A,a)(A,a) given by the hypothesis. We shall construct a homomorphism hh^{\prime} from (B,b)(B,b) to (F,a)(F^{*},a). We have h(b)=ah^{\prime}(b)=a. Note that, since BB is acyclic and c-connected, we can similarly regard BB as an ordered tree rooted at bb. Then, for each child, bb^{\prime}, of bb, we define hh^{\prime} on B|bB|b^{\prime} as follows. If h(b)=aih(b^{\prime})=a^{\prime}_{i}, then since (A|ai,ai)(A|a^{\prime}_{i},a^{\prime}_{i}) has no homomorphism to (B,b)(B,b^{\prime}), we apply the inductive hypothesis to define hh^{\prime} on B|bB|b^{\prime}. If h(b)h(b^{\prime}) is the parent of aa then we map B|bB|b^{\prime} entirely to the isomorphic copy of AA attached to aa introduced in Definition 3.17. If h(b)h(b^{\prime}) is some child of aa different than aia^{\prime}_{i} then we define hh^{\prime} on B|bB|b^{\prime} starting at bb^{\prime} and by increasing depth as in the homomorphism, hh, from (B,b)(B,b) to (A,a)(A,a) until we find some edge cdc\to d in B|bB|b^{\prime} such that its hh-image is not an edge in AA. When we found such edge cdc\to d, then we define h(d)h^{\prime}(d) to be the copy of h(d)h(d) attached to h(c)h(c) in FF^{*} according to Definition 3.17 and we extend hh^{\prime} to the rest of elements in B|dB|d mapping them as well to the same isomorphic copy. ∎

Definition 3.19.

The size of a structure AA, denoted by size(A)size(A), is the number of facts. The total size of a set of structures is totalsize(𝒜)=ΣA𝒜(size(A)+1)totalsize(\mathcal{A})=\Sigma_{A\in\mathcal{A}}(size(A)+1)

(The definition of total size is conveniently chosen so that the total size of a set of tree-shaped structures is equal to the size of a tree-shaped structure consisting of the given forest with an additional root connected to the root of each original tree.)

Theorem 3.20 (Our construction is polynomial).

totalsize(a0)=O(size(A)4)totalsize(\mathcal{F}^{*}_{a_{0}})=O(size(A)^{4}).

Proof.

We will show that totalsize(a0)size(A)3totalsize(\mathcal{F}_{a_{0}})\leq size(A)^{3}. The proposition then follows immediately, by construction of \mathcal{F}^{*}. More precisely, we show that, for all elements aa, totalsize(a)size(A|a)3totalsize(\mathcal{F}_{a})\leq size(A|a)^{3}. The claim is proved by induction on rank(a)rank(a).

If rank(a)=0rank(a)=0, then totalsize(a)=0totalsize(\mathcal{F}_{a})=0. If rank(a)>0rank(a)>0, it follows from the construction of a\mathcal{F}_{a} that

totalsize(a)nsize(A|a)+Σi=1ntotalsize(ai)size(A|a)2+Σi=1nsize(A|ai)3size(A|a)(size(A|a)+Σi=1nsize(A|ai)2)size(A|a)(size(A|a)+(size(A|a)1)2)size(A|a)size(A|a)2size(A|a)3\begin{split}totalsize(\mathcal{F}_{a})&\leq n\cdot size(A|a)+\Sigma_{i=1}^{n}totalsize(\mathcal{F}_{a^{\prime}_{i}})\\ &\leq size(A|a)^{2}+\Sigma_{i=1}^{n}size(A|a^{\prime}_{i})^{3}\\ &\leq size(A|a)\cdot\big{(}size(A|a)+\Sigma_{i=1}^{n}size(A|a^{\prime}_{i})^{2}\big{)}\\ &\leq size(A|a)\cdot\big{(}size(A|a)+(size(A|a)-1)^{2}\big{)}\\ &\leq size(A|a)\cdot size(A|a)^{2}\\ &\leq size(A|a)^{3}\end{split}

where nn is the number of successors of aa. Note that we are using here the fact that Σi=1nsize(A|ai)size(A|a)1\Sigma_{i=1}^{n}size(A|a^{\prime}_{i})\leq size(A|a)-1, and the general fact that Σi=1nxi2(Σi=1nxi)2\Sigma_{i=1}^{n}x_{i}^{2}\leq(\Sigma_{i=1}^{n}x_{i})^{2}. ∎

Extending the result to k1k\geq 1

We now extend the result to k1k\geq 1. For the time being, we still assume that the schema consists of binary relations only.

Definition 3.21 (Skeleton and offshoots).

Let (A,a)(A,\textbf{a}) be a c-connected, acyclic structure with at least one distinguished element. A skeleton node of (A,a)(A,\textbf{a}) (with a=a1,,ak\textbf{a}=a_{1},\ldots,a_{k}) is any element ss that is either a distinguished element (i.e., s{a1,,ak}s\in\{a_{1},\ldots,a_{k}\}) or such that ss lies on a minimal path between two distinguished elements. We denote by skeleton(A,a)\operatorname{skeleton}(A,\textbf{a}) the substructure of (A,a)(A,\textbf{a}) (with the same distinguished elements), consisting of the skeleton nodes only. It is easy to see that, in a c-connected and acyclic structure, for every element cc there is a unique skeleton node ss that is closest to cc, i.e., such that cc is connected to ss and such that every path from cc to any other skeleton node passes through ss. In this case, we say that cc is affiliated with the skeleton node ss. For any skeleton node ss of (A,a)(A,\textbf{a}), the offshoot of ss in (A,a)(A,\textbf{a}), which we will denote by (A|s,s)(A|s,s), 666We acknowledge that this notation is slightly misleading as it does not reflect the dependence of (A|s,s)(A|s,s) on the distinguished elements a. is the structure (A,s)(A^{\prime},s) where AA^{\prime} is the substructure of AA consisting of all elements affiliated with ss (including ss itself), and where ss is the only distinguished element.

Thus, every c-connected and acyclic structure can be thought of as consisting of a skeleton with offshoots, similar to the example depicted schematically in Figure 5.

a1a_{1}a2a_{2}a3a_{3}a4a_{4}a5a_{5}
Figure 5. Schematic depiction of an example c-connected acyclic structure with five distinguished elements. The skeleton consists of the black nodes (which are the elements of the structure that are either distinguished or lie on a shortest path connecting two distinguished elements), while the triangles depict offshoots.

Our frontier construction for structures with multiple distinguished elements will make use of two operations that we now define.

Definition 3.22 (Splitting).

Consider a structure (A,a)(A,\textbf{a}) with a=a1,,ak\textbf{a}=a_{1},\ldots,a_{k}. Let (X,Y)(X,Y) be a proper partition of {1,,k}\{1,\ldots,k\}, that is, XX and YY are disjoint non-empty sets such that XY={1,,k}X\cup Y=\{1,\ldots,k\}. We will denote by split(X,Y)(A,a)\textrm{split}_{(X,Y)}(A,\textbf{a}) the structure (A,a)(A^{\prime},\textbf{a}^{\prime}) where

  • A=A(1)A(2)A^{\prime}=A^{(1)}\uplus A^{(2)} is a disjoint union of two isomorphic copies of AA. For each element aa of AA, we will denote its two copies in AA^{\prime} by a(1)a^{(1)} and a(2)a^{(2)}, respectively.

  • For a=a1,,an\textbf{a}=a_{1},\ldots,a_{n}, we set a=a1(j1),,an(jn)\textbf{a}^{\prime}=a_{1}^{(j_{1})},\ldots,a_{n}^{(j_{n})} where ji=1j_{i}=1 if iXi\in X and ji=2j_{i}=2 otherwise.

For the next definition, we need to introduce some auxiliary notation. If a,ba,b are elements belonging to the same connected component of some structure AA, we will denote by dist(a,b)dist(a,b) the length of the shortest path from aa to bb in the incidence graph of AA (where the length may be counted by the number of facts on the path). If a,b,ca,b,c are elements that all belong to the same connected component of AA we will write a<cba<_{c}b if dist(a,c)<dist(b,c)dist(a,c)<dist(b,c) (“aa is closer to cc than bb is”).

Definition 3.23 (Radial extension).

Let h:(A,s)(B,t)h:(A,s)\to(B,t) be a homomorphism between acyclic structures with one distinguished element. We denote by radial-extensionh:(A,s)(B,t)(A)\textrm{radial-extension}_{h:(A,s)\to(B,t)}(A) the structure obtained from AA as follows: for each fact ff containing elements a1,a2a_{1},a_{2} with a1<sa2a_{1}<_{s}a_{2}, we add a disjoint isomorphic copy B~\widetilde{B} of BB (without distinguished elements), and we add a connecting fact that is a copy of ff in which a1a_{1} is replaced by the isomorphic copy of h(a1)h(a_{1}) in B~\widetilde{B}.

Note that the way in which a\mathcal{F}^{*}_{a} was constructed from a\mathcal{F}_{a} in Definition 3.17, is a concrete instance of the radial-extension operation.

In the statement of the next lemma, we write (A,a,s)(A,\textbf{a},s), where (A,a)(A,\textbf{a}) is a structure with kk distinguished elements, to denote the corresponding structure with k+1k+1 distinguished elements.

Lemma 3.24.

Let (A,a,s)(A,\textbf{a},s) and (B,b,t)(B,\textbf{b},t) be acyclic structures with (B,b,t)(A,a,s)(B,\textbf{b},t)\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(A,\textbf{a},s), and such that (A,a,s)(A,\textbf{a},s) is a core. Let B=radial-extensionh:(B,t)(A,s)(B)B^{\prime}=\textrm{radial-extension}_{h:(B,t)\to(A,s)}(B) for some h:(B,b,t)(A,a,s)h:(B,\textbf{b},t)\to(A,\textbf{a},s). Then (B,b,t)(B,b,t)(A,a,s)(B,\textbf{b},t)\to(B^{\prime},\textbf{b},t)\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(A,\textbf{a},s).

Proof.

Clearly, (B,b,t)(B^{\prime},\textbf{b},t) extends (B,b,t)(B,\textbf{b},t). It is also clear from the construction that the map hh naturally extends to a homomorphism g:(B,b,t)(A,a,s)g:(B^{\prime},\textbf{b},t)\to(A,\textbf{a},s) (using the natural projection for the additional elements of BB^{\prime}).

Let us show now that (A,a,s)(A,\textbf{a},s) does not map homomorphically to (B,b,t)(B^{\prime},\textbf{b},t). For the sake of a contradiction, assume that h:(A,a,s)(B,b,t)h^{\prime}:(A,\textbf{a},s)\to(B^{\prime},\textbf{b},t). We can assume from Lemma 2.1 that the composition of gg and hh^{\prime} is injective. Since (A,a,s)↛(B,b,t)(A,\textbf{a},s)\not\to(B,\textbf{b},t), hh^{\prime} must map some element cc of AA to h(c)=dh^{\prime}(c)=d, where dd is an element of BB^{\prime} that does not belong to BB. Then dd belongs to an isomorphic copy of AA that was added for some fact ff of BB, where ff contains elements b1,b2b_{1},b_{2} with b1<tb2b_{1}<_{t}b_{2}. Let b1b_{1}^{\prime} be the copy of h(b1)h(b_{1}) belonging to the respective isomorphic copy of AA. Since h(s)=th^{\prime}(s)=t it then follows that the image according to hh^{\prime} of the nodes in the shortest path connecting cc with ss must necessarily contain all nodes in the shortest path (in BB^{\prime}) connecting dd and tt. Note that the path connecting dd and tt must contain both b1b^{\prime}_{1} and b1b_{1}. Since g(b1)=g(b1)g(b^{\prime}_{1})=g(b_{1}) it follows that the composition of gg and hh^{\prime} is non-injective, a contradiction. ∎

Now, let (A,a)(A,\textbf{a}) be c-connected and acyclic. We may also assume that (A,a)(A,\textbf{a}) is a core. We define 𝔉\mathfrak{F} to be the set of all structures obtained as follows:777Note that 𝔉\mathfrak{F} depends on (A,a)(A,\textbf{a}), even though we did not make this dependence explicit in our notation here.

  1. (1)

    For each proper partition (X,Y)(X,Y) of the distinguished elements, we add split(X,Y)(A,a)\textrm{split}_{(X,Y)}(A,\textbf{a}) to 𝔉\mathfrak{F}, provided there are aiXa_{i}\in X and ajYa_{j}\in Y such that aia_{i} and aja_{j} are connected in AA. (Note that, by construction, the corresponding distinguished elements ai(1)a_{i}^{(1)} and aj(2)a_{j}^{(2)} are disconnected in split(X,Y)(A,a)\textrm{split}_{(X,Y)}(A,\textbf{a}).)

  2. (2)

    For each proper partition (X,Y)(X,Y) of the distinguished elements, and for each fact R(c,d)R(c,d) of AA, let (B,b)(B,\textbf{b}) be the structure obtained by extending split(X,Y)(A,a)\textrm{split}_{(X,Y)}(A,\textbf{a}) with the fact R(c(1),d(2))R(c^{(1)},d^{(2)}). We add (B,b)(B,\textbf{b}) to 𝔉\mathfrak{F}, provided that there exists distinguished elements aiXa_{i}\in X and ajYa_{j}\in Y, such that aia_{i} and aja_{j} are connected in AA by a path of some length nn, but there is no path of length nn connecting the corresponding distinguished elements ai(1)a_{i}^{(1)} and aj(2)a_{j}^{(2)} in (B,b)(B,\textbf{b}).

  3. (3)

    For each skeleton node ss, and for every (F,s)(F,s) in the frontier of the offshot (A|s,s)(A|s,s) (as in Theorem 3.18) we add to 𝔉\mathfrak{F} the structure obtained from (A,a)(A,\textbf{a}) by: (i) removing the offshoot (A|s,s)(A|s,s) and replacing it (F,s)(F,s), resulting in a new structure (A,a)(A^{\prime},\textbf{a}), and (ii) taking B=radial-extensionh:(A,s)(A,s)(A)B=\textrm{radial-extension}_{h:(A^{\prime},s)\to(A,s)}(A^{\prime}), where hh is the homomorphism from (F,s)(A|s,s)(F,s)\to(A|s,s), extended to the entire structure AA^{\prime} by mapping every element outside the offshoot to itself.

The intuition behind this construction is that (1)–(3) capture different ways of homomorphically weakening an acyclic structure with designated elements: in (1), we take two connected distinguished elements and force them to become disconnected. In (2), we increase the length of the shortest path between two distinguished elements, by forcing the minimal path to go through a specified edge. In (3), we make no change to the skeleton of the structure but we weaken one of the offshoots.

Proposition 3.25.

If (A,a)(A,\textbf{a}) is c-connected, acyclic, and is a core, then the set of structures 𝔉\mathfrak{F} constructed above is a frontier for (A,a)(A,\textbf{a}) w.r.t. the class of c-connected acyclic structures.

Proof.

It is clear that each structure in 𝔉\mathfrak{F} homomorphically maps to (A,a)(A,\textbf{a}).

We also claim that there is no homomorphism from (A,a)(A,\textbf{a}) to any structure (B,b)𝔉(B,\textbf{b})\in\mathfrak{F}. For (1) and (2) this follows immediately from the construction. For (3), the argument is as follows: let ss be the skeleton node whose offshoot was replaced, and let (F,s)𝔉s(F,s)\in\mathfrak{F}^{*}_{s} be the frontier structure was used as the replacement of (A|s,s)(A|s,s). Suppose, for the sake of a contradiction, that (A,a)(B,b)(A,\textbf{a})\to(B,\textbf{b}). By Lemma 3.24, then, there is already a homomorphism h:(A,a,s)(A,a,s)h:(A,\textbf{a},s)\to(A^{\prime},\textbf{a},s). By Lemma 2.1, we may assume that the composition of hh with the natural homomorphism from h:(A,a,s)h^{\prime}:(A^{\prime},\textbf{a},s) to (A,a,s)(A,\textbf{a},s) is the identity function on AA. Since (A|s,s)↛(F,s)(A|s,s)\not\to(F,s) is follows that there exists some element aa in A|sA|s such that h(a)h(a) does not belong to (F,s)(F,s). However, this contradicts the fact that h(h(a))h^{\prime}(h(a)) is the identity.

It remains to establish the last property of frontiers: let (C,c)(C,\textbf{c}) be any c-connected and acyclic structure such that there is a homomorphism h:(C,c)(A,a)h:(C,\textbf{c})\to(A,\textbf{a}) and such that (A,a)↛(C,c)(A,\textbf{a})\not\to(C,\textbf{c}).

We can distinguish three cases:

The first case is where (C,c)(C,\textbf{c}) differs from (A,a)(A,\textbf{a}) in the number of connected components. Since both structures are c-connected and (C,c)(A,a)(C,\textbf{c})\to(A,\textbf{a}), this can only happen if there are distinguished elements ci,cjc_{i},c_{j} that belong to different components in (C,c)(C,\textbf{c}) but such that h(ci),h(cj)h(c_{i}),h(c_{j}) are connected in (A,a)(A,\textbf{a}). In this case, we use (1) above. Specifically, let X={h(ck)ckc is connected to ci}X=\{h(c_{k})\mid\text{$c_{k}\in\textbf{c}$ is connected to $c_{i}$}\}, and let Y={a}XY=\{\textbf{a}\}\setminus X. Note that h(cj)Yh(c_{j})\in Y. Then it is easy to see that (C,c)split(X,Y)(A,a)(C,\textbf{c})\to\textrm{split}_{(X,Y)}(A,\textbf{a}).

The second case is there there exists distinguished elements ci,cjc_{i},c_{j} connected in CC such that the image of the path s1,,sms_{1},\dots,s_{m} in CC connecting cic_{i} and cjc_{j} is not injective. It follows from the acyclicity of AA that there exists \ell such that h(s)=h(s+2)h(s_{\ell})=h(s_{\ell+2}). Let ff be the fact in CC joining ss_{\ell} and s+1s_{\ell+1} and let ZZ be the set containing all elements in CC that remain connected to cic_{i} after removing fact ff. Then consider the structure (B,b)(B,\textbf{b}) obtained as in (2) above by setting XX to be the set of distinguished elements in h(Z)h(Z), YY to be the rest of distinguished elements in CC, and the fact of AA used in the construction to be the hh-image of ff. If we let ai=h(ci)a_{i}=h(c_{i}) and aj=h(cj)a_{j}=h(c_{j}) it follows directly from the construction that the distance of ai(1)a_{i}^{(1)} and aj(2)a_{j}^{(2)} in (B,b)(B,\textbf{b}) is larger than the distance of aia_{i} and aja_{j} in AA, and hence (B,b)𝔉(B,\textbf{b})\in\mathfrak{F}. Finally, it follows easily that the map hh^{\prime} sending every element cCc\in C to h(c)(1)h(c)^{(1)} if cZc\in Z and to h(c)(2)h(c)^{(2)} otherwise defines a homomorphism from (C,c)(C,\textbf{c}) to (B,b)(B,\textbf{b}).

The third case is where none of the previous two cases hold. We first note that if the second case does not hold then it must be the case that the restriction of hh to each connected component of skeleton(C,c)\operatorname{skeleton}(C,\textbf{c}) must be injective. Furthermore, since (C,c)(C,\textbf{c}) and (A,a)(A,\textbf{a}) have the same number of connected components it follows that hh maps skeleton(C,c)\operatorname{skeleton}(C,\textbf{c}) isomorphically to skeleton(A,a)\operatorname{skeleton}(A,\textbf{a}). It follows that there exists some node ss in the skeleton of CC such that the offshoot (A|h(s),h(s))(A|h(s),h(s)) does not homomorphically map to (C,s)(C,s), since otherwise, (A,s)(C,s)(A,s)\to(C,s). Consider the offshoot (C|s,s)(C|s,s) of CC and let DD be the maximal substructure of CC that contains ss, is connected, and satisfies the property that h(D)h(D) is contained in A|h(s)A|h(s) and that no other element, besides ss, is mapped by hh to h(s)h(s). Since (D,s)(A|h(s),h(s))(D,s)\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(A|h(s),h(s)) it follows that there exists some homomorphism gg from (D,s)(D,s) to some structure (F,h(s))(F,h(s)) in the frontier of (A|h(s),h(s))(A|h(s),h(s)). Now consider the structure (B,b)(B,\textbf{b}) in 𝔉\mathfrak{F} produced in step (3) above for h(s)h(s) and (F,h(s))(F,h(s)). We shall construct a homomorphism hh^{\prime} from (C,c)(C,\textbf{c}) to (B,b)(B,\textbf{b}).

Let (A,a)(A^{\prime},\textbf{a}) as constructed in step (3) and let (E,c)(E,\textbf{c}) be the maximal cc-connected substructure of (C,c)(C,\textbf{c}) containing skeleton(C,c)\operatorname{skeleton}(C,\textbf{c}) such that no other element in EE besides ss is mapped by hh to h(s)h(s). Note that EE contains DD. We shall start by defining hh^{\prime} on EE so that hh^{\prime} defines a homomorphism from (E,c)(E,\textbf{c}) to (A,a)(A^{\prime},\textbf{a}). If eDe\in D then we define h(d)h^{\prime}(d) to be g(d)g(d). If ee does not belong to DD then it follows that h(e)h(e) cannot be in the offshot A|h(s)A|h(s). In this case we define h(e)h^{\prime}(e) to be h(e)h(e). Clearly hh^{\prime} defines as well a homomorphism from (E,c)(E,\textbf{c}) to (B,a)(B,\textbf{a}) (since BB contains a copy of AA^{\prime}). It only remains to extend hh^{\prime} to all the elements in CC. For every maximal connected substructure FF of CC not containing any element in EE we extend hh^{\prime} to FF as follows. Since EE contains skeleton(C,c)\operatorname{skeleton}(C,\textbf{c}) and (C,c)(C,\textbf{c}) is cc-connected it follows that there is some fact in CC joining elements eEe\in E and fFf\in F. By the maximality of EE it follows that h(f)=h(s)h(f)=h(s). Necessarily h(f)<h(s)h(e)h(f)<_{h(s)}h(e) and, consequently, we can define hh^{\prime} so that ff and every other element in FF is mapped according to hh in the copy of AA in BB introduced due to h(f)<h(s)h(e)h(f)<_{h(s)}h(e).

Incidentally, it may be worth noting that we did not even make full use of the radial extension: for the purpose of this proof, it would have sufficed in case (3) to use a restricted version of radial extension where a copy of the original structure is added only for every fact connected with ss. ∎

Lifting the restriction to binary relations

This concludes the proof for the case with binary relations only. We now show how to lift the result to schemas containing relations of arbitrary arity.

For a schema 𝒮\mathcal{S}, let 𝒮\mathcal{S}^{*} be the schema containing for each nn-ary relation R𝒮R\in\mathcal{S}, nn binary relations R1,,RnR_{1},\ldots,R_{n}. For any structure AA over schema 𝒮\mathcal{S}, let AA^{*} be the structure over schema 𝒮\mathcal{S}^{*} whose domain consists of all elements in the domain of AA as well as all facts of AA, and containing all facts of the form Ri(b,f)R_{i}(b,f) where ff is a fact of AA, of the form R(a)R(\textbf{a}) with ai=ba_{i}=b. Intuitively, we can think of AA^{*} as a bipartite encoding of the structure AA. Conversely, we associate to every structure BB over the schema 𝒮\mathcal{S}^{*} a corresponding structure BB_{*} over the original schema 𝒮\mathcal{S}, namely the structure whose domain is the same as that of BB and containing all facts of the form R(a1,,an)R(a_{1},\ldots,a_{n}) for which it is the case that BB satisfies yi=1nRi(ai,y)\exists y\bigwedge_{i=1\ldots n}R_{i}(a_{i},y). Note that (A)=A(A^{*})_{*}=A but (B)(B_{*})^{*} need not be isomorphic to BB.

Lemma 3.26.

For all structures A,AA,A^{\prime} over schema 𝒮\mathcal{S} and structures BB over schema 𝒮\mathcal{S}^{*}:

  1. (1)

    If (B,b)(A,a)(B,\textbf{b})\to(A^{*},\textbf{a}) then (B,b)(A,a)(B_{*},\textbf{b})\to(A,\textbf{a}).

  2. (2)

    (A,a)(B,b)(A,\textbf{a})\to(B_{*},\textbf{b}) iff (A,a)(B,b)(A^{*},\textbf{a})\to(B,\textbf{b}).

  3. (3)

    (A,a)(A,a’)(A,\textbf{a})\to(A^{\prime},\textbf{a'}) iff (A,a)(A,a’)(A^{*},\textbf{a})\to({A^{\prime}}^{*},\textbf{a'}).

  4. (4)

    If (A,a)(A,\textbf{a}) is core, c-connected, and acyclic, then so is (A,a)(A^{*},\textbf{a})

  5. (5)

    If (B,b)(B,\textbf{b}) is acyclic, then so is (B,b)(B_{*},\textbf{b}).

Proof.

1. Let h:(B,b)(A,a)h:(B,\textbf{b})\to(A^{*},\textbf{a}). It is easy to see that, for each element bb of BB_{*} that participates in at least one fact, it must be the case that h(b)h(b) belongs to the domain of AA. Note that elements of BB_{*} that do not participate in any fact can be ignored, as they can be mapped to an arbitrary element of AA. Finally, it is clear from the constructions that, whenever R(b1,,bn)R(b_{1},\ldots,b_{n}) holds true in BB_{*}, then R(h(b1),,h(bn))R(h(b_{1}),\ldots,h(b_{n})) holds true in AA.

2. Suppose h:(A,a)(B,b)h:(A,\textbf{a})\to(B_{*},\textbf{b}). We can extend hh to the entire domain of AA^{*} as follows: let ff be any fact of AA of the form R(a1,,an)R(a_{1},\ldots,a_{n}). Since BB_{*} satisfies R(h(a1),,h(an))R(h(a_{1}),\ldots,h(a_{n})), this means that BB must satisfy yi=1nRi(h(ai),y)\exists y\bigwedge_{i=1\ldots n}R_{i}(h(a_{i}),y). Choose any such yy as the image of the fact ff. Doing this for each fact, we obtain a mapping hh^{\prime} that extends hh to the entire domain of AA^{*}. Moreover, whenever Ri(ai,f)R_{i}(a_{i},f) holds in AA^{*}, then, by construction, Ri(h(ai),h(f))R_{i}(h^{\prime}(a_{i}),h^{\prime}(f)) holds true in BB. In other words hh^{\prime} is a homomorphism from (A,a)(A^{*},\textbf{a}) to (B,b)(B,\textbf{b})

Conversely, suppose h:(A,a)(B,b)h:(A^{*},\textbf{a})\to(B,\textbf{b}). Let hh^{\prime} be the restriction of hh to elements of AA. We claim that the mapping h:(A,a)(B,b)h^{\prime}:(A,\textbf{a})\to(B_{*},\textbf{b}) is a homomorphism. Let f=R(a1,,an)f=R(a_{1},\ldots,a_{n}) be any fact of AA. Then BB satisfies Ri(h(ai),h(f))R_{i}(h^{\prime}(a_{i}),h(f)) for all i=1ni=1\ldots n. Therefore, by construction, BB_{*} satisfies R(h(a1),,h(an))R(h^{\prime}(a_{1}),\ldots,h^{\prime}(a_{n})).

3. Every homomorphism h:(A,a)(A,a’)h:(A,\textbf{a})\to(A^{\prime},\textbf{a'}) natural extends to a homomorphism from (A,a)(A^{*},\textbf{a}) to (A,a’)({A^{\prime}}^{*},\textbf{a'}) by sending each fact R(a1,,an)R(a_{1},\ldots,a_{n}) to R(h(a1),,h(an))R(h(a_{1}),\ldots,h(a_{n})). Conversely, every homomorphism h:(A,a)(A,a’)h:(A^{*},\textbf{a})\to({A^{\prime}}^{*},\textbf{a'}), when restricted to elements of AA, is clearly a homomorphism from (A,a)(A,\textbf{a}) to (A,a’)(A^{\prime},\textbf{a'}).

4. For acyclicity, this holds by definition (recall that acyclicity was defined by reference to the incidence graph, in the first place; and note that the incidence graph of (A,a)(A^{*},a) is obtained from that of (A,a)(A,a) by subdividing every edge in two). Similarly, it is easy to see that whenever (A,a)(A,\textbf{a}) is c-connected, then so is (A,a)(A^{*},\textbf{a}). Finally, to show that core-ness is preserved, we proceed by contraposition: suppose (A,a)(A^{*},\textbf{a}) is not a core, i.e., admits a proper endomorphism hh. It is not hard to see that hh must map elements of AA to elements of AA, and fact of AA to facts of AA. Therefore, hh must either map two distinct elements of AA to the same element, or two distinct facts of AA to the same fact. However, even in the latter case, the only way that this can happen is if hh also maps two distinct elements to the same element. It follows that hh also induces a proper endomorphism of AA.

5. By contraposition: suppose (B,b)(B_{*},\textbf{b}) is not acyclic. Take a minimal cycle in the incidence graph of (B,b)(B_{*},\textbf{b}). Each edge that is part of the cycle (being a fact of BB_{*}), by construction of BB^{*}, gives rise to a path of length 2 in BB. Therefore, we obtain a cycle in BB. ∎

Now, let (A,a)(A,\textbf{a}) be any acyclic, c-connected structure over schema 𝒮\mathcal{S}. We may again assume that (A,a)(A,\textbf{a}) is a core. By Lemma 3.26(4), (A,a)(A^{*},\textbf{a}) is also acyclic, c-connected, and core. Let FF be an acyclic c-connected frontier for (A,a)(A^{*},\textbf{a}), let F={(B,b)(B,b)F}F^{\prime}=\{(B_{*},\textbf{b})\mid(B,\textbf{b})\in F\}. We claim that FF^{\prime} is a frontier for (A,a)(A,\textbf{a}).

First, note that each structure in FF^{\prime} homomorphically maps to (A,a)(A,\textbf{a}). This follows from Lemma 3.26(1), because each (B,b)F(B,\textbf{b})\in F homomorphically maps to (A,a)(A^{*},\textbf{a}). Second, note that (A,a)(A,a) does not homomorphically map to any structure in FF^{\prime}. Indeed, suppose (A,a)(B,b)(A,\textbf{a})\to(B_{*},\textbf{b}) with (B,b)F(B_{*},\textbf{b})\in F^{\prime}. Then, by Lemma 3.26(2), (A,a)(B,b)(A^{*},\textbf{a})\to(B,\textbf{b}), which contradicts the fact that FF is a frontier for (A,a)(A^{*},\textbf{a}). Finally, for the third property of frontiers, suppose (C,c)(A,a)(C,\textbf{c})\to(A,\textbf{a}) and (A,a)↛(C,c)(A,\textbf{a})\not\to(C,\textbf{c}). Then, by Lemma 3.26(3), (A,a)(C,c)(A^{*},a)\to(C^{*},\textbf{c}) and (C,c)↛(A,a)(C^{*},\textbf{c})\not\to(A^{*},\textbf{a}). Therefore, since FF is a frontier for (A,a)(A^{*},\textbf{a}), we have that (C,c)(B,b)(C^{*},\textbf{c})\to(B,\textbf{b}) for some (B,b)F(B,\textbf{b})\in F. It follows by Lemma 3.26(2) that (C,c)(B,b)(C,\textbf{c})\to(B_{*},\textbf{b}). Since (B,b)F(B^{*},\textbf{b})\in F^{\prime}, this shows that (C,c)(C,\textbf{c}) maps to a structure in FF^{\prime}.

Note that FF^{\prime} consists of acyclic structures (by Lemma 3.26(5)), but may contain structures that are not c-connected. However, this is easily addressed by the following observation, which follows from Proposition 2.3: if a set of structures FF is a frontier for a structure (A,a)(A,\textbf{a}) w.r.t. a class of c-connected structures 𝒞\mathcal{C}, then {(B,b)reach(B,b)F}\{(B,\textbf{b})^{\textrm{reach}}\mid(B,\textbf{b})\in F\} is also a frontier for (A,a)(A,\textbf{a}) w.r.t. 𝒞\mathcal{C}.

This concludes the proof of Theorem 3.12.

4. Unique Characterizations for Conjunctive Queries

In this section, we study the question of when a CQ is uniquely characterizable by a finite set of positive and/or negative examples.

Definition 4.1 (Data Examples, Fitting, Unique Characterizations).

Let 𝒞\mathcal{C} be a class of kk-ary CQs over a schema 𝒮\mathcal{S} (for some k0k\geq 0), and let qq be a kk-ary query over 𝒮\mathcal{S}.

  1. (1)

    A data example is a structure (A,a)(A,\textbf{a}) over schema 𝒮\mathcal{S} with kk distinguished elements. If aq(A)\textbf{a}\in q(A), we call (A,a)(A,\textbf{a}) a positive example (for qq), otherwise a negative example.

  2. (2)

    Let E+,EE^{+},E^{-} be finite sets of data examples. We say that qq fits (E+,E)(E^{+},E^{-}) if every example in E+E^{+} is a positive example for qq and every example in EE^{-} is a negative example for qq. We say that (E+,E)(E^{+},E^{-}) uniquely characterizes qq w.r.t. 𝒞\mathcal{C} if qq fits (E+,E)(E^{+},E^{-}) and every q𝒞q^{\prime}\in\mathcal{C} that fits (E+,E)(E^{+},E^{-}) is logically equivalent to qq.

It turns out that there is a precise correspondence between unique characterizations and frontiers. Recall that the canonical structure of a query qq is denoted by q^\widehat{q}. Similarly, for any class of CQs 𝒞\mathcal{C}, we will denote by 𝒞^\widehat{\mathcal{C}} the class of structures {q^q𝒞}\{\widehat{q}\mid q\in\mathcal{C}\}.

Proposition 4.2 (Frontiers vs Unique Characterizations).

Fix a schema 𝒮\mathcal{S} and k0k\geq 0. Let qq be any kk-ary CQ over 𝒮\mathcal{S} and 𝒞\mathcal{C} a class of kk-ary CQs over 𝒮\mathcal{S}.

  1. (1)

    If FF is a frontier for q^\widehat{q} w.r.t. 𝒞^\widehat{\mathcal{C}}, then (E+={q^},E=F)(E^{+}=\{\widehat{q}\},E^{-}=F) uniquely characterizes qq w.r.t. 𝒞\mathcal{C}.

  2. (2)

    Conversely, if (E+,E)(E^{+},E^{-}) uniquely characterizes qq w.r.t. 𝒞\mathcal{C}, then F={q^×(B,b)(B,b)E}F=\{\widehat{q}\times(B,\textbf{b})\mid(B,\textbf{b})\in E^{-}\} is a frontier for q^\widehat{q} w.r.t. 𝒞^\widehat{\mathcal{C}}.

Proof.

1. Let FF be a frontier for q^\widehat{q} w.r.t. 𝒞^\widehat{\mathcal{C}}, let q𝒞q^{\prime}\in\mathcal{C} be a conjunctive query that fits (E+={q^},E=F)(E^{+}=\{\widehat{q}\},E^{-}=F). From the fact that the canonical structure q^\widehat{q} is a positive example for qq^{\prime}, it follows that there is a homomorphism from q^\widehat{q^{\prime}} to q^\widehat{q}. Furthermore, since all structures in the set FF are negative examples for qq^{\prime}, we know that q^\widehat{q^{\prime}} does not homomorphically map to any of these structures. Since FF is a frontier w.r.t. 𝒞^\widehat{\mathcal{C}} and q^𝒞^\widehat{q^{\prime}}\in\widehat{\mathcal{C}}, we can conclude that q^\widehat{q} homomorphically maps to q^\widehat{q^{\prime}}. Therefore, q^\widehat{q} and q^\widehat{q^{\prime}} are homomorphically equivalent, which implies that qq and qq^{\prime} are logically equivalent.

2. Let (E+,E)(E^{+},E^{-}) uniquely characterize qq w.r.t, 𝒞\mathcal{C}, and let F={q^×(B,b)(B,b)E}F=\{\widehat{q}\times(B,\textbf{b})\mid(B,\textbf{b})\in E^{-}\}. It follows from the basic properties of the direct product operation that each structure in FF homomorphically maps to q^\widehat{q}. Furthermore, if there were a homomorphism from q^\widehat{q} to some structure q^×(B,b)F\widehat{q}\times(B,\textbf{b})\in F, then there would be a homomorphism from q^\widehat{q} to (B,b)(B,\textbf{b}), which would imply that (B,b)(B,\textbf{b}) is a positive example for qq, which we know is not the case. Finally, consider any q^𝒞^\widehat{q^{\prime}}\in\widehat{\mathcal{C}} such that q^q^\widehat{q^{\prime}}\to\widehat{q} and q^↛q^\widehat{q}\not\to\widehat{q^{\prime}}. This implies that qq and qq^{\prime} are not logically equivalent, and hence, since q𝒞q^{\prime}\in\mathcal{C}, the two queries must disagree on some example in E+E^{+} or EE^{-}. However, it follows from the fact that q^q^\widehat{q}\to\widehat{q}, that all positive examples for qq are also positive examples for qq^{\prime}. Therefore, some structure (B,b)E+(B,\textbf{b})\in E^{+} must be a positive example for qq^{\prime}, that is, q^(B,b)\widehat{q^{\prime}}\to(B,\textbf{b}). It follows that q^q^×(B,b)\widehat{q^{\prime}}\to\widehat{q}\times(B,\textbf{b}). ∎

Proposition 4.2 allows us to take the results on frontiers from the previous section, and rephrase them in terms of unique characterizations. Incidentally, note that results in (AlexeCKT2011, ) imply an analogous relationship between finite dualities and uniquely characterizing sets of examples for unions of conjunctive queries. We need two more lemmas. Recall that a structure (A,a)(A,\textbf{a}) corresponds to a conjunctive query only if every distinguished element occurs in at least one fact. Let us call such structures safe. The following lemmas, essentially, allow us to ignore unsafe structures, thereby bridging the gap between structures and CQs.

Lemma 4.3.

Let qq be a kk-ary CQ over schema 𝒮\mathcal{S} and 𝒞\mathcal{C} a class of kk-ary CQs over 𝒮\mathcal{S}. If qq is uniquely characterized w.r.t. 𝒞\mathcal{C} by positive and negative examples (E+,E)(E^{+},E^{-}), then E+E^{+} consists of safe structures and qq is uniquely characterized w.r.t. 𝒞\mathcal{C} by (E+,{(A,a)E(A,a) is safe})(E^{+},\{(A,\textbf{a})\in E^{-}\mid\text{$(A,\textbf{a})$ is safe}\}).

Proof.

It suffices to observe that if (A,a)(A,\textbf{a}) is not safe, then (A,a)(A,\textbf{a}) is a negative example for every conjunctive query, and therefore, cannot meaningfully contribute to characterizing any given conjunctive query w.r.t. a class of CQs. ∎

Lemma 4.4.

A safe structure has a frontier w.r.t. all structures if and only if it has a frontier w.r.t. the class of all safe structures.

Proof.

The left-to-right direction is trivial. The right-to-left direction relies on a homomorphism duality argument of sorts: let (A,a)(A,\textbf{a}) be any safe structure that has a frontier FF w.r.t. the class of all safe structures. Let 𝒮\mathcal{S} be its schema and kk the number of distinguished elements. For every non-empty set S{1,,k}S\subseteq\{1,\ldots,k\}, we will denote by (AS,aS)(A_{S},\textbf{a}_{S}) the structure with two elements, denoted bb and cc, that contains all possible facts involving only cc and no other facts; and aS\textbf{a}_{S} is the tuple a1,,aka_{1},\ldots,a_{k}, where ai=ba_{i}=b if iSi\in S, and ai=ca_{i}=c otherwise. Note that, by construction, none of these structures is safe. It is not hard to see that a structure is unsafe if and only if it admits a homomorphism to a structure in the set G={(AS,aS)S{1,,k} is non-empty}G=\{(A_{S},\textbf{a}_{S})\mid\text{$S\subseteq\{1,\ldots,k\}$ is non-empty}\}. Now, let G={(A,a)×(B,b)(B,b)G}G^{\prime}=\{(A,\textbf{a})\times(B,\textbf{b})\mid(B,\textbf{b})\in G\}. Then FGF\cup G^{\prime} is a frontier for (A,a)(A,\textbf{a}) w.r.t. all structures. To see this, note that each structure in FGF\cup G^{\prime} homomorphically maps to (A,a)(A,\textbf{a}). Furthermore, (A,a)(A,\textbf{a}) does not map to any structure in FF (by initial assumption) or in GG^{\prime} (because GG^{\prime} consists of unsafe structures while (A,a)(A,\textbf{a}) is safe). Finally, consider any (C,c)(A,a)(C,\textbf{c})\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(A,\textbf{a}). If (C,c)(C,\textbf{c}) is safe, then it maps to a structure in FF. Otherwise, it maps to a structure in GG and hence also to the corresponding structure in GG^{\prime}. In either case, it maps to a structure in FGF\cup G^{\prime}. ∎

Putting everything together, we obtain the main result of this section. We call a CQ qq c-acyclic (or acyclic, or c-connected) if the structure q^\widehat{q} is c-acyclic (resp. acyclic, c-connected).

Theorem 4.5.

Fix a schema and fix k0k\geq 0.

  1. (1)

    If 𝒞\mathcal{C} is a class of kk-ary CQs such that C^\widehat{C} has bounded expansion, then every CQ q𝒞q\in\mathcal{C} is uniquely characterizable w.r.t. 𝒞\mathcal{C} by finitely many positive and negative examples (which can be effectively constructed from the query).

  2. (2)

    A kk-ary CQ qq is uniquely characterizable by finitely many positive and negative examples (w.r.t. the class of all kk-ary CQs) iff qq is logically equivalent to a c-acyclic CQ. Moreover, for a c-acyclic CQ, a uniquely characterizing set of examples can be constructed in polynomial time.

  3. (3)

    Assume k1k\geq 1 and let 𝒞ca\mathcal{C}_{ca} be the class of kk-ary CQs that are c-connected and acyclic. Then every q𝒞caq\in\mathcal{C}_{ca} is uniquely characterizable w.r.t. 𝒞ca\mathcal{C}_{ca} by finitely many positive and negative examples belonging to 𝒞ca^\widehat{\mathcal{C}_{ca}}. Moreover, the set of examples in question can be constructed in polynomial time.

Remark 1.

For the purpose of applications discussed in Section 7, we note that Theorem 4.5 remains true if the safety condition for CQs were to be dropped. Indeed, the proof in this case is even simpler, as it does not require Lemma 4.3 and Lemma 4.4.

Examples showing that Theorem 4.5(3) cannot easily be generalized.

Theorem 4.5(3) applies to CQs of arity k1k\geq 1 that are c-connected, and acyclic. None of these restrictions can be dropped. Recall that a CQ of arity zero is called a Boolean CQ.

Theorem 4.6.

The Boolean acyclic connected CQ   𝐓() :- Ry1y2Ry2y3Ry3y4Ry4y5\operatorname{{\bf T}}()\textrm{ :- }Ry_{1}y_{2}\land Ry_{2}y_{3}\land Ry_{3}y_{4}\land Ry_{4}y_{5} is not characterized, w.r.t. the class of Boolean acyclic connected CQs, by finitely many acyclic positive and negative examples.

Proof.

For the sake of a contradiction, assume that (E+,E)(E^{+},E^{-}) is a finite collection of acyclic positive and negative examples that uniquely characterizes 𝐓\operatorname{{\bf T}} within the class of Boolean acyclic connected CQs. Let nn be a bound on the size of the examples in E+E^{+} and EE^{-}. Consider now the following Boolean acyclic connected conjunctive query 𝐑\operatorname{{\bf R}} (cf. Figure 6):

𝐑() :- i{1,,n} oddj{1,2,3}R(zji,zj+1i)i{1,,n} evenj{2,3,4}R(zji,zj+1i)i{1,,n} evenR(z3i,z4i1)R(z2i,z3i+1)\operatorname{{\bf R}}()\text{ :- }\mathop{\bigwedge_{i\in\{1,\ldots,n\}\text{ odd}}}_{j\in\{1,2,3\}}R(z_{j}^{i},z_{j+1}^{i})~{}~{}~{}\land\mathop{\bigwedge_{i\in\{1,\ldots,n\}\text{ even}}}_{j\in\{2,3,4\}}R(z_{j}^{i},z_{j+1}^{i})~{}~{}~{}\land\bigwedge_{i\in\{1,\ldots,n\}\text{ even}}R(z^{i}_{3},z_{4}^{i-1})\land R(z^{i}_{2},z_{3}^{i+1})

z52z54z41z42z43z44z45z31z32z33z34z35z21z22z23z24z25z11z13z15\begin{array}[]{c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c}&&z^{2}_{5}&&&&z^{4}_{5}&&&\\ &&\uparrow&&&&\uparrow&&&\\ z^{1}_{4}&&z^{2}_{4}&&z^{3}_{4}&&z^{4}_{4}&&z^{5}_{4}&\\ \uparrow&\nwarrow&\uparrow&&\uparrow&\nwarrow&\uparrow&&\uparrow&\nwarrow\\ z^{1}_{3}&&z^{2}_{3}&&z^{3}_{3}&&z^{4}_{3}&&z^{5}_{3}&&~{}~{}\cdots\\ \uparrow&&\uparrow&\nearrow&\uparrow&&\uparrow&\nearrow&\uparrow&\\ z^{1}_{2}&&z^{2}_{2}&&z^{3}_{2}&&z^{4}_{2}&&z^{5}_{2}&\\ \uparrow&&&&\uparrow&&&&\uparrow&\\ z^{1}_{1}&&&&z^{3}_{1}&&&&z^{5}_{1}&\\ \end{array}

Figure 6. The (canonical structure of the) conjunctive query 𝐑\operatorname{{\bf R}}

Note that 𝐑\operatorname{{\bf R}} and 𝐓\operatorname{{\bf T}} are not logically equivalent since 𝐑\operatorname{{\bf R}} does not contain any directed path of length 44. The mapping that sends zjiz^{i}_{j} to yjy_{j} defines a homomorphism from (the canonical structure of) 𝐑\operatorname{{\bf R}} to (the canonical structure of) 𝐓\operatorname{{\bf T}}. This implies that 𝐓𝐑\operatorname{{\bf T}}\subseteq\operatorname{{\bf R}}, that is, every positive example for 𝐓\operatorname{{\bf T}} is also a positive example for 𝐑\operatorname{{\bf R}}, and hence, in particular, 𝐑\operatorname{{\bf R}} fits all the positive examples in E+E^{+}. Since 𝐑\operatorname{{\bf R}} and 𝐓\operatorname{{\bf T}} are not logically equivalent, 𝐑\operatorname{{\bf R}} must therefore disagree with 𝐓\operatorname{{\bf T}} on one of the negative examples, that is, some example AEA\in E^{-} is a positive example for 𝐑\operatorname{{\bf R}}. Let h:𝐑^Ah:\widehat{\operatorname{{\bf R}}}\to A be a witnessing homomorphism.


Claim: There are two elements uu and vv in 𝐑^\widehat{\operatorname{{\bf R}}} at distance 22 (distance, here, is measured in the underlying tree of 𝐑^\widehat{\operatorname{{\bf R}}}) such that h(u)=h(v)h(u)=h(v).


Proof of claim: Since AA has at most nn elements it follows that hh is not injective, so there are two elements uu^{\prime}, vv^{\prime} of 𝐑^\widehat{\operatorname{{\bf R}}} that are mapped by hh to the same element. Now, consider the unique path u=x1,,xm=vu^{\prime}=x_{1},\dots,x_{m}=v^{\prime} in (the underlying tree of) 𝐑^\widehat{\operatorname{{\bf R}}} connecting uu^{\prime} and vv^{\prime}. Then, h(x1),,h(xm)h(x_{1}),\dots,h(x_{m}) is a walk in (the underlying tree of) AA. Indeed, it is a closed walk since h(x1)=h(xm)h(x_{1})=h(x_{m}). Now, since h(x1),,h(xm)h(x_{1}),\dots,h(x_{m}) is a closed walk in a tree, it must backtrack in some vertex h(xi)h(x_{i}). This means that h(xi1)=h(xi+1)h(x_{i-1})=h(x_{i+1}). End of proof of claim.


Let u=zjiu=z^{i}_{j} and v=zjiv=z^{i^{\prime}}_{j^{\prime}} be the two elements at distance 22 that are mapped by hh to the same element in AA. It follows from the definition of 𝐑\operatorname{{\bf R}} that, since uu and vv are at distance 22, |ii|1|i-i^{\prime}|\leq 1. We first consider the case where i=ii=i^{\prime}. In this case, we can assume wlog. that j=j+2j^{\prime}=j+2. It then follows that h(zji),h(zj+1i),h(zj+2i)h(z^{i}_{j}),h(z^{i}_{j+1}),h(z^{i}_{j+2}) is a directed cycle in AA, a contradiction because AEA\in E^{-}, and EE^{-} was assumed to consist of acyclic structures. Next, consider the case where iii\neq i^{\prime}. Then |ii|=1|i-i^{\prime}|=1. We can assume without loss of generality that ii is even and ii^{\prime} is odd. Again it follows directly by the construction of 𝐑\operatorname{{\bf R}} that j=jj^{\prime}=j. Note that h(z1i),,h(zji)h(z^{i^{\prime}}_{1}),\dots,h(z^{i^{\prime}}_{j^{\prime}}) is a directed path in AA of length j1j^{\prime}-1. Similarly h(zji),,h(z5i)h(z^{i}_{j}),\dots,h(z^{i}_{5}) is a directed path in AA of length 5j5-j. Since h(zji)=h(zji)h(z^{i}_{j})=h(z^{i^{\prime}}_{j^{\prime}}) it follows that we can concatenate the two directed paths obtaining a directed path of length 4+jj44+j^{\prime}-j\geq 4. Consequently 𝐓^\widehat{\operatorname{{\bf T}}} homomorphically maps to AA, a contradiction because AEA\in E^{-}. ∎

This shows that in Theorem 4.5(3), the restriction to non-Boolean queries cannot be dropped. Similarly, the restriction to c-connected queries cannot be dropped, and acyclicity cannot be replaced by the weaker condition of c-acyclicity.

Theorem 4.7.

The unary acyclic CQ    𝐓(x) :- P(x)Ry1y2Ry2y3Ry3y4Ry4y5\operatorname{{\bf T}}^{\prime}(x)\text{ :- }P(x)\land Ry_{1}y_{2}\land Ry_{2}y_{3}\land Ry_{3}y_{4}\land Ry_{4}y_{5} is not uniquely characterizable, w.r.t. the class of unary acyclic CQs, by finitely many acyclic positive and negative examples.

Proof.

Assume for the sake of a contradiction that there are finitely many acyclic positive and negative examples (E+,E)(E^{+},E^{-}) that uniquely characterizes 𝐓(x)\operatorname{{\bf T}}^{\prime}(x) w.r.t. the class of unary acyclic CQs. We will construct acyclic positive and negative examples (E+,E)(E^{\prime+},E^{\prime-}) that uniquely characterizes 𝐓\operatorname{{\bf T}} w.r.t. the class of Boolean acyclic c-connected CQs, contradicting Theorem 4.6. For each example in (A,a)E+E(A,a)\in E^{+}\cup E^{-}, we take every connected component of AA (without any distinguished element) and add it to E+E^{\prime+} or EE^{\prime-}, depending on whether the component in question satisfies 𝐓\operatorname{{\bf T}}. Note that each of these examples is acyclic. By construction, the query 𝐓\operatorname{{\bf T}} fits (E+,E)(E^{\prime+},E^{\prime-}). We claim that (E+,E)(E^{\prime+},E^{\prime-}) uniquely characterizes 𝐓\operatorname{{\bf T}} w.r.t. the class of Boolean acyclic connected CQs. To see this, let qq be any Boolean acyclic connected query that fits (E+,E)(E^{\prime+},E^{\prime-}). Now, let q(x)q^{\prime}(x) be the unary acyclic (disconnected) query that is the conjunction of qq with P(x)P(x). Then it is not hard to see that q(x)q^{\prime}(x) fits (E+,E)(E^{+},E^{-}), and therefore, q(x)q^{\prime}(x) is equivalent to 𝐓(x)\operatorname{{\bf T}}^{\prime}(x). From this, it easily follows that qq must be equivalent to 𝐓\operatorname{{\bf T}}: any counterexample to the equivalence of qq and 𝐓\operatorname{{\bf T}} can be extended to a counterexample to the equivalence of q(x)q^{\prime}(x) and 𝐓(x)\operatorname{{\bf T}}^{\prime}(x) simply by adding an isolated element satisfying PP. ∎

Theorem 4.8.

The unary c-acyclic c-connected CQ   𝐓′′(x) :- Ry1y2Ry2y3Ry3y4Ry4y5i=15Rxyi\operatorname{{\bf T}}^{\prime\prime}(x)\text{ :- }Ry_{1}y_{2}\land Ry_{2}y_{3}\land Ry_{3}y_{4}\land Ry_{4}y_{5}\land\bigwedge_{i=1\ldots 5}Rxy_{i}   is not uniquely characterizable, w.r.t. the class of unary c-acyclic c-connected CQs, by finitely many c-acyclic positive and negative examples.

Proof.

Assume towards a contradiction that 𝐓′′\operatorname{{\bf T}}^{\prime\prime} is uniquely characterized, w.r.t. the class of unary c-acyclic c-connected CQs, by a finite collection of c-acyclic positive and negative examples (E+,E)(E^{+},E^{-}). Let EE^{\prime-} be the set that contains, for each (A,a)E(A,a)\in E^{-}, the substructure AA^{\prime} of AA consisting of all elements bb satisfying R(a,b)R(a,b). Note that this substructure does not include aa itself (because if R(a,a)R(a,a) was true in AA, then (A,a)(A,a) would have been a positive example for 𝐓′′\operatorname{{\bf T}}^{\prime\prime}) and therefore must be acyclic. We claim that ({𝐓^},E)(\{\widehat{\operatorname{{\bf T}}}\},E^{\prime-}) uniquely characterizes 𝐓\operatorname{{\bf T}} w.r.t. the class of Boolean acyclic connected CQs, contradicting Theorem 4.6.

Let qq be any Boolean acyclic connected conjunctive query that fits ({𝐓^},E)(\{\widehat{\operatorname{{\bf T}}}\},E^{\prime-}). Let q(x)q^{\prime}(x) be the conjunctive query expressing that qq holds true in the substructure consisting of elements reachable by an RR-edge from xx. Note that q(x)q^{\prime}(x) can be obtained by extending qq with an additional conjunct R(x,y)R(x,y) for every variable yy occurring in qq. Also, note that q(x)q^{\prime}(x) is c-acyclic and c-connected. Since 𝐓^\widehat{\operatorname{{\bf T}}} is a positive example for qq, there is a homomorphism h:q^𝐓^h:\widehat{q}\to\widehat{\operatorname{{\bf T}}}. By extending hh in the obvious way, we have that q^𝐓′′^\widehat{q^{\prime}}\to\widehat{\operatorname{{\bf T}}^{\prime\prime}}. Therefore, 𝐓′′q\operatorname{{\bf T}}^{\prime\prime}\subseteq q^{\prime}, and hence every positive example for 𝐓′′\operatorname{{\bf T}}^{\prime\prime} is also a positive example for qq^{\prime}, and hence qq^{\prime} fits all the positive examples in E+E^{+}. Similarly, qq^{\prime} fits all negative examples in EE^{-}, because, if it did not, then there would be a homomorphism from q^\widehat{q^{\prime}} to some (A,a)E(A,a)\in E^{-}, from which it would clearly follow that qq homomorphically maps to the corresponding AEA^{\prime}\in E^{\prime-}, which we know is not the case. Therefore, qq^{\prime} fits all examples in (E+,E)(E^{+},E^{-}), and hence, qq^{\prime} is logically equivalent to 𝐓′′\operatorname{{\bf T}}^{\prime\prime}. It follows that qq must be logically equivalent to 𝐓\operatorname{{\bf T}}: any counterexample for the equivalence of qq and 𝐓\operatorname{{\bf T}} can be extended to a counterexample for the equivalence of qq^{\prime} and 𝐓′′\operatorname{{\bf T}}^{\prime\prime} by adding a fresh distinguished element connected to all existing elements by means of an RR-edge. ∎

5. Exact learnability with membership queries

The unique characterization results in the previous section immediately imply (not-necessarily-efficient) exact learnability results:

Theorem 5.1.

Fix a schema and k0k\geq 0. Let 𝒞\mathcal{C} be a computably enumerable class of kk-ary CQs. If 𝒞^\widehat{\mathcal{C}} has bounded expansion, then 𝒞\mathcal{C} is exactly learnable with membership queries.

The learning algorithm in question simply enumerates all queries q𝒞q\in\mathcal{C} and uses membership queries to test if the goal query fits the uniquely characterizing set of examples of qq (cf. Theorem 4.5(1)). Unfortunately, this learning algorithm does not run in polynomial time. Indeed, the number of membership queries is not known to be bounded by any fixed tower of exponentials (even for classes 𝒞\mathcal{C} for which membership can be tested in polynomial time). For the special case of c-acyclic queries, we can do a little better by taking advantage of the fact that a uniquely characterizing set of examples can be constructed in polynomial time. Indeed, the class of c-acyclic kk-ary CQs is exponential-time exactly learnable with membership queries: the learner can simply enumerates all c-acyclic queries in order of increasing size. For each query qq (starting with the smallest query), it uses Theorem 4.5(2) to test, using polynomially many membership queries, whether the goal query is equivalent to qq. After at most 2O(n)2^{O(n)} many attempts (where nn is the size of the goal query), the algorithm is guaranteed to find a query that is equivalent to the goal query.888Similarly, by Theorem 4.5(3), the class of unary acyclic c-connected queries is exponential-time exactly learnable with subset queries, where a subset query is an oracle query asking whether a given CQ from the concept class is implied by the goal query. Subset queries correspond precisely to membership queries where the example is the canonical structure of a query from the concept class. Our main result in this section improves on this by establishing efficient (i.e., polynomial-time) exact learnability:

Theorem 5.2.

For each schema and k0k\geq 0, the class of c-acyclic kk-ary CQs is efficiently exactly learnable with membership queries.

At a high level, the learning algorithm works by maintaining a c-acyclic hypothesis that is an over-approximation of the actual goal query. At each iteration, the hypothesis is strengthened by replacing it with one of the elements of its frontier, a process that is shown to terminate and yield a query that is logically equivalent to the goal query. Note, however, that the frontier of a c-acyclic structure does not, in general, consist of c-acyclic structures. At the heart of the proof of Theorem 5.2 lies a non-trivial argument showing how to turn an arbitrary hypothesis into a c-acyclic one with polynomially many membership queries. The detailed proof is given in Section 5.1 below.

The class of all kk-ary queries is not exactly learnable with membership queries (even with unbounded amount of time and the ability to ask an unbounded number of oracle queries), because exact learnability with membership queries would imply that every query in the class is uniquely characterizable, which we know is not the case. On the other hand, we have:

Theorem 5.3 (from (CateDK13:learning, )).

For each schema 𝒮\mathcal{S} and k0k\geq 0, the class of all kk-ary CQs over 𝒮\mathcal{S} is efficiently exactly learnable with membership and equivalence queries.

In fact, it follows from results in (CateDK13:learning, ) that the larger class of all unions of conjunctive queries is efficiently exactly learnable with membership and equivalence queries (for fixed kk and fixed schema). Efficient exact learnability with membership and equivalence queries is not a monotone property of concept classes, but the result from (CateDK13:learning, ) transfers to CQs as well. For the sake of completeness, a self-contained proof of Theorem 5.3 is given below as well.

Remark 2.

For the purpose of applications discussed in Section 7, we note that Theorems 5.15.3 remain true if the safety condition for CQs were to be dropped.

Related Work

There has been considerable prior work that formally studies the task of identifying some unknown goal query QQ from examples. Work in this direction includes learning CQs, Xpath queries, Sparql, tree patterns, description logic concepts, ontologies, and schema mappings among others (BonifatiCL15, ; CateDK13:learning, ; GottlobS10, ; StaworkoW12, ). We shall describe mostly the previous work regarding learning CQs. Some of the work in this direction ((Barcelo017, ; CateD15, ; Cohen94a, ; Hirata00, ; Willard10, ) for example) assumes that a background structure AA is fixed and known by the algorithm. In this setting, an example is a kk-ary tuple (a1,,ak)(a_{1},\dots,a_{k}) of elements in AA, labelled positively or negatively depending on whether it belongs or not to Q(A)Q(A). In the present paper (as in (CateDK13:learning, ; Haussler89, )) we do not fix any background structure (i.e., examples are pairs of the form (A,a)(A,\textbf{a})). Our setting corresponds also to the extended instances with empty background in (Cohen95, ).

In both cases a number of different learning protocols has been considered. In the reverse-engineering problem (as defined in (WeissC17, )) it is only required that the algorithm produces a query consistent with the examples. In a similar direction, the problem of determining whether such a query exists has been intensively studied under some variants (satisfiability, query-by-example, definability, inverse satisfiability) (Barcelo017, ; CateD15, ; Willard10, ). In some scenarios, it is desirable that the query produced by the learner not only explains the examples received during the training phase, but also has predictive power. In particular, the model considered in (BonifatiCS16, ) follows the paradigm of identification in the limit by Gold and requires that, additionally, there exists a finite set of examples that uniquely determines the target query QQ. In a different direction, the model introduced in (GottlobS10, ), inspired by the minimum description length principle, requires to produce a hypothesis consistent after some repairs. A third line of work (see (Cohen93, ; Haussler89, ; Hirata00, )) studies this problem under Valiant’s probably approximately correct (PAC) model. The present paper is part of a fourth direction based on the exact model of query identification by Angluin. In this model, instead of receiving labelled examples, the learner obtains information about the target query by mean of calls to an oracle. As far as we know, we are the first to study the exact learnability of CQs using a membership oracle.

5.1. Proofs for Theorem 5.2 and Theorem 5.3

To warm up, we first establish Theorem 5.3, because its proof is simpler. The proof relies on the following lemmas. Recall that we denote by q^\widehat{q} the canonical structure of a conjunctive query qq.

Lemma 5.4.

Let (A,a)(A,\textbf{a}) be any structure such that qgoal^(A,a)\widehat{q_{goal}}\to(A,\textbf{a}), where qgoalq_{goal} is the goal conjunctive query. Then, using membership oracle queries, we can compute in polynomial time (in the size of (A,a)(A,\textbf{a})) a substructure (A,a)(A^{\prime},\textbf{a}) of (A,a)(A,\textbf{a}) such that qgoal^(A,a)\widehat{q_{goal}}\to(A^{\prime},\textbf{a}), and such that qgoal^\widehat{q_{goal}} does not homomorphically map to any strict substructure of (A,a)(A^{\prime},\textbf{a}).

Proof.

It suffices to iteratively remove one of the facts from the structure, and use a membership query to test if qgoal^\widehat{q_{goal}} still admits a homomorphism to the structure after removing the fact in question. Once no further fact can be removed, we have arrived at (A,a)(A^{\prime},\textbf{a}). ∎

Recall that we call a structure safe if every distinguished element occurs in a fact, that is, the structure is the canonical structure of a conjunctive query. The proof of the next lemma is obvious.

Lemma 5.5.

Let qq be a kk-ary conjunctive query over a schema 𝒮\mathcal{S}, and let (B,b)(B,\textbf{b}) be any structure over schema 𝒮\mathcal{S} with kk distinguished elements. If   q^(B,b)\widehat{q}\to(B,\textbf{b}), then (B,b)(B,\textbf{b}) is safe.

We now present the proof of Theorem 5.3.

Proof of Theorem 5.3.

The learning algorithm maintains a structure (H,h)(H,\textbf{h}), which we can intuitively think of as (the canonical structure of) the algorithm’s guess of the goal query. The structure (H,h)(H,\textbf{h}) is refined in a series of iterations in such a way that at each iteration ii, its value, (Hi,hi)(H_{i},\textbf{h}_{i}), satisfies the following properties: (i) qgoal^(Hi,hi)\widehat{q_{goal}}\to(H_{i},\textbf{h}_{i}), and (ii) the size of (Hi,hi)(H_{i},\textbf{h}_{i}) is bounded by the size of the qgoalq_{goal}.

We start by considering (A,a)(A,\textbf{a}) where AA is the structure containing a single node aa that satisfies all possible facts over the schema and a is the kk-ary tuple (a,,a)(a,\dots,a) containing only element aa. Clearly, qgoal^\widehat{q_{goal}} homomorphically maps to this structure. We apply Lemma 5.4 to find a minimal substructure of it into which the goal query maps. We will denote it be (H0,h0)(H_{0},\textbf{h}_{0}).

Next, at each stage we perform an equivalence oracle query to test if the canonical conjunctive query of (Hi,hi)(H_{i},\textbf{h}_{i}) is logically equivalent to qgoalq_{goal}. Note that, by Lemma 5.5, the structure (Hi,hi)(H_{i},\textbf{h}_{i}) indeed has a canonical conjunctive query. If the answer to the equivalence oracle query is “yes”, then we are done. Otherwise, we receive a counterexample (A,a)(A,\textbf{a}). This counterexample must be a structure in which the goal query is true but the hypothesis is false. Thus, we have qgoal^(A,a)\widehat{q_{goal}}\to(A,\textbf{a}) and (Hi,hi)↛(A,a)(H_{i},\textbf{h}_{i})\not\to(A,\textbf{a}). Recall that we also have qgoal^(Hi,hi)\widehat{q_{goal}}\to(H_{i},\textbf{h}_{i}). It follows that qgoal^(A,a)×(Hi,hi)\widehat{q_{goal}}\to(A,\textbf{a})\times(H_{i},\textbf{h}_{i}). We now set (Hi+1,hi+1)(H_{i+1},\textbf{h}_{i+1}) to be a minimal substructure of (A,a)×(Hi,hi)(A,\textbf{a})\times(H_{i},\textbf{h}_{i}) into which qgoal^\widehat{q_{goal}} maps (using Lemma 5.4 again). It is clear from the construction that (Hi+1,hi+1)(Hi,hi)(H_{i+1},\textbf{h}_{i+1})\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(H_{i},\textbf{h}_{i}), and that the size of (Hi+1,hi+1)(H_{i+1},\textbf{h}_{i+1}) is bounded by the size of qgoal^\widehat{q_{goal}} (otherwise qgoal^\widehat{q_{goal}} would not be a minimal substructure of (A,a)×(Hi,hi)(A,\textbf{a})\times(H_{i},\textbf{h}_{i}) into which qgoal^\widehat{q_{goal}} maps).

All that remains to be shown is that this algorithm terminates after polynomially many iterations. We show that with each iteration, the domain size of the structure (Hi,hi)(H_{i},\textbf{h}_{i}) strictly increases. Suppose that the domain size of (Hi,hi)(H_{i},\textbf{h}_{i}) and (Hi+1,hi+1)(H_{i+1},\textbf{h}_{i+1}) is the same. We know that the homomorphism (natural projection) h:(Hi+1,hi+1)(Hi,hi)h:(H_{i+1},\textbf{h}_{i+1})\to(H_{i},\textbf{h}_{i}) is surjective, and that every fact of HiH_{i} is the hh-image of a fact of Hi+1H_{i+1} (otherwise the composition with the homomorphism from qgoal^\widehat{q_{goal}} to (Hi+1,hi+1)(H_{i+1},\textbf{h}_{i+1}) would constitute a non-surjective homomorphism from qgoal^\widehat{q_{goal}} to (Hi,hi)(H_{i},\textbf{h}_{i}), which would contradict the minimality of (Hi,hi)(H_{i},\textbf{h}_{i})). Therefore, it cannot also be injective, otherwise it would be an isomorphism. Therefore, the number of elements of (Hi+1,hi+1)(H_{i+1},\textbf{h}_{i+1}) is strictly greater than the number of elements in (Hi,hi)(H_{i},\textbf{h}_{i}). Since the size of each (Hi,hi)(H_{i},\textbf{h}_{i}) is bounded by the size of qgoalq_{goal}, this shows that the algorithm terminates after at most nn rounds, where nn is the size of qgoalq_{goal}.

Incidentally, note that this algorithm runs in polynomial time (in the size of qgoalq_{goal}), even if the schema and the arity kk of the conjunctive query are not fixed but treated as part of the input. ∎

Next, in the remainder of this section, we prove Theorem 5.2.

First, we argue that we may restrict attention to schemas consisting of binary relations only. Let qgoalq_{goal} be any c-acyclic goal conjunctive query over an arbitrary schema 𝒮\mathcal{S} and consider the corresponding conjunctive query qgoalq_{goal}^{*} over the binary schema 𝒮\mathcal{S}^{*}, as in Lemma 3.26. By Lemma 3.26(2), every membership query w.r.t. the goal query qgoalq_{goal}^{*} can be efficiently reduced to a membership query w.r.t. qgoalq_{goal}. Therefore, if qgoalq_{goal}^{*} can be efficiently identified using membership queries for qgoalq_{goal}^{*}, then qgoalq_{goal}^{*} can also be efficiently identified using membership queries for qgoalq_{goal}, and consequently also qgoalq_{goal} can be efficiently identified using membership queries for qgoalq_{goal} (namely, by first computing a conjunctive query qq over 𝒮\mathcal{S}^{*} that is logically equivalent to qgoalq_{goal}^{*}, and then returning qq_{*} as the final answer (note that, by Lemma 3.26(3), qq_{*} must then be logically equivalent to qgoalq_{goal}). Finally, it is easy to see that qgoalq_{goal}^{*} is c-acyclic if and only if qgoalq_{goal} is c-acyclic. Therefore, it suffices to prove Theorem 5.2 for the special case of schemas consisting of binary relations. In the remainder of this section, we will therefore restrict ourselves to binary relations.

Proposition 5.6 (Any positive example can be transformed into a c-acyclic one).

Let qgoalq_{goal} be a c-acyclic goal query. Given a structure (A,𝐚)(A,{\bf a}) satisfying qgoal^(A,𝐚)\widehat{q_{goal}}\to(A,{\bf a}), using membership queries, we can construct, in time polynomial in size(A)+size(qgoal)size(A)+size(q_{goal}), a structure (B,𝐛)(B,{\bf b}), denoted CC(A,𝐚)\operatorname{CC}(A,{\bf a}), such that

  1. (1)

    (B,𝐛)(B,{\bf b}) is c-acyclic

  2. (2)

    (B,𝐛)(A,𝐚)(B,{\bf b})\to(A,{\bf a})

  3. (3)

    qgoal^(B,𝐛)\widehat{q_{goal}}\to(B,{\bf b})

  4. (4)

    qgoal^\widehat{q_{goal}} is not homomorphic to any structure obtained removing some fact of (B,𝐛)(B,{\bf b})

Proof.

We say that a structure is mm-c-acyclic if every cycle of length at most mm goes through a distinguished element. (Here, by a “cycle” we mean a cycle in the incidence graph of the structure, and, to simplify the exposition, by the length of the cycle we refer to the number of facts that lie on the cycle). Note that when this holds for m=nm=n, where nn is the size of qgoal^\widehat{q_{goal}}, then every homomorphic image of qgoal^\widehat{q_{goal}} contained in the structure in question must be c-acyclic. We will describe a method (using membership queries) that takes a mm-c-acyclic structure (A,𝐚)(A,{\bf a}) with qgoal^(A,𝐚)\widehat{q_{goal}}\to(A,{\bf a}) and turns it into a (m+1)(m+1)-c-acyclic structure (A,𝐚)(A^{\prime},{\bf a}) with qgoal^(A,𝐚)\widehat{q_{goal}}\to(A^{\prime},{\bf a}) and (A,𝐚)(A,𝐚)(A^{\prime},{\bf a})\to(A,{\bf a}). By applying this method repeatedly for increasing mm (and always minimizing w.r.t. qgoal^\widehat{q_{goal}} using membership queries, cf. Lemma 5.4), we are guaranteed to reach the situation where we have a structure that is nn-c-acyclic and therefore, is in fact, c-acyclic.

Let (A,𝐚)(A,{\bf a}) be mm-c-acyclic with qgoal^(A,𝐚)\widehat{q_{goal}}\to(A,{\bf a}). First, we use membership queries to minimize (A,𝐚)(A,{\bf a}) (cf. Lemma 5.4) and ensure that its size is at most nn. Next, we say that an edge is bad if it is part of a cycle of length m+1m+1 that does not contain a distinguished element, and good otherwise. If there are no bad edges then we are done. Otherwise, let e=R(c,d)e=R(c,d) be a bad edge. Let A1,A2A_{1},A_{2} be isomorphic copies of A{e}A\setminus\{e\} that are disjoint except for the distinguished elements. Now, let (B,a)(B,\textbf{a}) be the structure obtained by extending the fg-disjoint union (A1,a)(A2,a)(A_{1},\textbf{a})\uplus(A_{2},\textbf{a}) with additional “special edges” R(c1,d2)R(c_{1},d_{2}) and R(c2,d1)R(c_{2},d_{1}). Clearly, (B,𝐚)(A,𝐚)(B,{\bf a})\to(A,{\bf a}).

Claim 1: for each good edge of (A,𝐚)(A,{\bf a}), its isomorphic copies belonging to BB are good in (B,𝐚)(B,{\bf a}).

Claim 2: the special edges R(c1,d2)R(c_{1},d_{2}) and R(c2,d1)R(c_{2},d_{1}) are both good in BB.

Claim 1 is obvious from the construction of BB, as no new short cycles are introduced. To see that Claim 2 holds, consider any minimal cycle in BB that does not contain any distinguished element, and that goes through one of these edges. Then, clearly, the cycle must go through both of these edges. That is, it must be of the form

c1R(c1,d2)d2𝜋c2R(c2,d1)d1πc1c_{1}\xrightarrow{R(c_{1},d_{2})}d_{2}\xrightarrow{~{}~{}~{}\pi~{}~{}~{}}c_{2}\xrightarrow{R(c_{2},d_{1})}d_{1}\xrightarrow{~{}~{}~{}\pi^{\prime}~{}~{}~{}}c_{1}

where π\pi is a path contained in A2A_{2} and π\pi^{\prime} is a path contained in A1A_{1}. Now, we know that the paths π\pi and π\pi^{\prime} must have length at least mm (because otherwise (A,𝐚)(A,{\bf a}) would not be mm-c-acyclic). Therefore, the entire cycle must have length at least 2m+22m+2.

Claim 3: qgoal^(B,𝐚)\widehat{q_{goal}}\to(B,{\bf a}).

Claim 3 is essentially proved by an induction on the tree structure of qgoal^\widehat{q_{goal}}, after removing all distinguished elements. More precisely, let h:qgoal^(A,𝐚)h:\widehat{q_{goal}}\to(A,{\bf a}). Let GG be the substructure of qgoal^\widehat{q_{goal}} obtained by removing all distinguished elements and facts involving distinguished elements. Clearly, GG is acyclic, i.e., GG can be oriented as a forest. By induction on this forest, we can construct a homomorphism h:GBh^{\prime}:G\to B, with the additional property that, for each element gg of GG, h(g)h^{\prime}(g) is equal to either h(g)1h(g)_{1} or h(g)2h(g)_{2}. Recall that h(g)1h(g)_{1} is the copy of h(g)h(g) in A1A_{1} and that h(g)2h(g)_{2} is the copy of h(g)h(g) in A2A_{2}. Next, let h′′h^{\prime\prime} be the extension of hh^{\prime} to qgoal^\widehat{q_{goal}} that agrees with hh on all distinguished elements. We can show that h′′:qgoal^(B,𝐚)h^{\prime\prime}:\widehat{q_{goal}}\to(B,{\bf a}). To see this, consider any fact of qgoal^\widehat{q_{goal}}. If that fact involves at least one distinguished element, then the hh-image of that fact involves at least one distinguished element of AA. Therefore, by construction, all the facts obtainable by replacing every non-distinguished element (if there is such) by one of its two copies, are present in BB, and therefore, no matter how hh^{\prime} acts on the non-distinguished element of that fact, the h′′h^{\prime\prime}-image will be present in BB. Next, consider the case where the fact doesn’t involve any distinguished element of AA. In this case, the fact belongs to GG and hence we know that the hh^{\prime}-image of that fact (which is also the h′′h^{\prime\prime}-image) belongs to BB. This concludes the proof of claim 3.

Next, let (B,𝐚)(B^{\prime},{\bf a}) be a minimal substructure of (B,𝐚)(B,{\bf a}) into which qgoal^\widehat{q_{goal}} homomorphically maps (obtained using membership queries, cf. Lemma 5.4). Clearly, Claim 1-3 above are preserved under the passage from (B,𝐚)(B,{\bf a}) to its substructure (B,𝐚)(B^{\prime},{\bf a}). We claim that (1) BB^{\prime} must contain at least one of the edges R(c1,d2)R(c_{1},d_{2}) and R(c2,d1)R(c_{2},d_{1}), and (2) for every edge of A{e}A\setminus\{e\}, BB^{\prime} must contain at least one of its two isomorphic copies. Because if not, then the homomorphism from qgoal^\widehat{q_{goal}} to (B,𝐚)(B^{\prime},{\bf a}) could be composed with the natural projection from (B,𝐚)(B^{\prime},{\bf a}) to (A,𝐚)(A,{\bf a}) to obtain a homomorphism from qgoal^\widehat{q_{goal}} to a proper substructure of (A,𝐚)(A,{\bf a}), a contradiction with the assumed minimality of (A,𝐚)(A,{\bf a}). Combined with Claim 1 and 2, this allows us to conclude that (B,𝐚)(B^{\prime},{\bf a}) has strictly more good edges than (A,𝐚)(A,{\bf a}). Since the number of edges (good or bad) is bounded by nn, by repeating the above procedure, we obtain a structure that has no bad edges, and therefore, is (m+1)(m+1)-c-acyclic. ∎

Proof of Theorem 5.2:.

The algorithm maintains a structure, denoted (H,𝐡)(H,{\bf h}), which can be interpreted as (the canonical structure of) the algorithm’s guess of the goal query. At every moment in the execution of the algorithm, (H,𝐡)(H,{\bf h}), satisfies the following properties:

  1. (1)

    qgoal^(H,𝐡)\widehat{q_{goal}}\to(H,{\bf h})

  2. (2)

    qgoal^\widehat{q_{goal}} does not homomorphically map to any structure obtained by removing a fact from (H,𝐡)(H,{\bf h})

  3. (3)

    (H,𝐡)(H,{\bf h}) is c-acyclic

Note that conditions (1) and (2) imply that HH cannot have more elements than qgoal^\widehat{q_{goal}} and that size(H)size(qgoal^)size(H)\leq size(\widehat{q_{goal}})

Initially, (H,𝐡)(H,{\bf h}) is defined to be CC(A,𝐚)\operatorname{CC}(A,{\bf a}) where CC()\operatorname{CC}(\cdot) is defined as in Proposition 5.6 and AA is the structure containing a single node aa that satisfies all possible facts over the schema and 𝐚{\bf a} is the kk-ary tuple (a,,a)(a,\dots,a) containing only element aa. The algorithm refines (H,𝐡)(H,{\bf h}) in a sequence of iterations. At each iteration, the algorithm first constructs the frontier {\mathcal{F}} of (H,𝐡)(H,{\bf h}). Note that by condition (3) (H,𝐡)(H,{\bf h}) is cc-acyclic. Hence, by Theorem 3.8, {\mathcal{F}} can be computed in time polynomial in size(H)size(H) (and, hence, in size(qgoal)size(q_{goal})). Then, the algorithm asks a membership query for each (F,𝐟)(F,{\bf f}) in \mathcal{F} until either it receives a ’yes’ answer or, otherwise, it exhausts all structures in {\mathcal{F}} without receiving a ’yes’ answer. In the latter case the algorithm stops and returns the canonical query of (H,𝐡)(H,{\bf h}), cf. Lemma 5.5, as it is immediate from the fact that qgoal^(H,𝐡)\widehat{q_{goal}}\to(H,{\bf h}) and that qgoal^\widehat{q_{goal}} does not homomorphically map to any structure in {\mathcal{F}} that (H,𝐡)(H,{\bf h}) is homomorphically equivalent to qgoal^\widehat{q_{goal}}, and therefore, the canonical query of (H,𝐡)(H,{\bf h}) is logically equivalent to qgoalq_{goal}. In the former case, the algorithm picks any structure (F,𝐟)(F,{\bf f}) in {\mathcal{F}} that (when asked as a membership query) produces a ’yes’ answer, updates (H,𝐡)(H,{\bf h}), by setting (H,𝐡):=CC(F,𝐟)(H,{\bf h}):=\operatorname{CC}(F,{\bf f}) and starts a new iteration. It follows immediately that (H,𝐡)(H,{\bf h}) preserves properties (1-3) above.

To show the correctness of the algorithm it only remains to see that the number of iterations is polynomially bounded in size(G)size(G). This follows directly from the following claim: the domain size of HH increases at each iteration. To prove the claim, let (Hi,𝐡i)(H_{i},{\bf h}_{i}) be the value of (H,𝐡)(H,{\bf h}) at the iith iteration, and note that, by design, we have (Hi+1,𝐡i+1)(F,𝐟)(H_{i+1},{\bf h}_{i+1})\to(F,{\bf f}) where (F,𝐟)(F,{\bf f}) belongs to the frontier of (Hi,𝐡i)(H_{i},{\bf h}_{i}). It follows that (Hi+1,𝐡i+1)(Hi,𝐡i)(H_{i+1},{\bf h}_{i+1})\to(H_{i},{\bf h}_{i}) and (Hi,𝐡i)↛(Hi+1,𝐡i+1)(H_{i},{\bf h}_{i})\not\to(H_{i+1},{\bf h}_{i+1}). Let hh be the homomorphism from (Hi+1,𝐡i+1)(H_{i+1},{\bf h}_{i+1}) to (Hi,𝐡i)(H_{i},{\bf h}_{i}). It follows from the fact that qgoal^(Hi+1,𝐡i+1)\widehat{q_{goal}}\to(H_{i+1},{\bf h}_{i+1}) and condition (2) that the image of (Hi+1,𝐡i+1)(H_{i+1},{\bf h}_{i+1}) according to hh is precisely (Hi,𝐡i)(H_{i},{\bf h}_{i}). Therefore, hh must be non-injective (otherwise it would be an isomorphism, contradicting the fact that (Hi,𝐡i)↛(Hi+1,𝐡i+1)(H_{i},{\bf h}_{i})\not\to(H_{i+1},{\bf h}_{i+1})). Since hh is surjective and non-injective, we can conclude that the domain size of Hi+1H_{i+1} is larger than the domain size of HiH_{i}. ∎

6. Another type of data examples: input-output examples

In the previous sections, we have focused on positive and negative data examples. However, there is also another type of data example that is natural to consider, namely a pair (I,R)(I,R) where II is an input instance (over a schema 𝒮\mathcal{S} of the query), and RR is a kk-ary relation over the domain of II, where kk is the arity of the query. A conjunctive query qq fits (I,R)(I,R) if RR is precisely the set of all tuples over the domain of II that satisfy qq. Input-output examples are analogous to universal examples for schema mappings as studied in (AlexeCKT2011, ), in that they capture the complete behavior of the concept (in our case, the conjunctive query) on a database instance.

One input-output example, intuitively, captures the same information as a polynomial number of positive and negative data examples (if we treat kk as a constant), namely all positive data examples (I,a)(I,\textbf{a}) for aR\textbf{a}\in R and all negative data examples (I,a)(I,\textbf{a}) for adom(I)kR\textbf{a}\in dom(I)^{k}\setminus R. It follows that a CQ is uniquely characterizable by a finite set of input-output data examples if and only if it is uniquely characterizable by a finite set of positive and negative data examples. In the case of connected CQs, in fact, a single input-output example suffices:

Theorem 6.1.

Fix a schema and k0k\geq 0, and let CC be any class of kk-ary CQs over the given schema. Then the following are equivalent for all queries qCq\in C:

  1. (1)

    qq is uniquely characterized w.r.t. CC by a finite collection of positive and negative data examples.

  2. (2)

    qq is uniquely characterized w.r.t. CC by a finite collection of input-output examples.

And, if CC consists of c-connected CQs (or k=0k=0 and CC consists of connected CQs):

  1. (3)

    qq is uniquely characterized w.r.t. CC by a single input-output example.

Moreover, the equivalences are witnessed by polynomial-time transformations in each direction.

Proof.

For the direction from (1) to (2): for given (E+,E)(E^{+},E^{-}), let Eio={(I,q(I))(I,a)E+E}E^{io}=\{(I,q(I))\mid(I,\textbf{a})\in E^{+}\cup E^{-}\}. Note that whenever a query qCq^{\prime}\in C fits EioE^{io}, then it must also fit (E+,E)(E^{+},E^{-}). For the converse direction, from (2) to (1), the construction was already described above.

The direction (3) to (2) is trivial.

Finally, for the direction from (2) to (3), let Eio={(I1,q(I1)),,(In,q(In))}E^{io}=\{(I_{1},q(I_{1})),\ldots,(I_{n},q(I_{n}))\}. Let II be the disjoint union i=1n(Ii)\biguplus_{i=1\ldots n}(I_{i}). For every tuple a from the domain of IiI_{i}, and for every c-connected query qq^{\prime}, aq(Ii)\textbf{a}\in q^{\prime}(I_{i}) if and only if aq(I)\textbf{a}\in q^{\prime}(I). It follows that, whenever two cc-connected queries q,qq,q^{\prime} agree on their output on II (that is to say, q(I)=q(I)q(I)=q^{\prime}(I)), then they must also agree on their output on each IiI_{i}. Therefore, if EioE^{io} uniquely characterizes qq w.r.t. CC, then also (I,q(I))(I,q(I)) uniquely characterizes qq w.r.t. CC. The same argument applies to connected Boolean CQs.

Incidentally, note that the above argument only works for c-connected CQs. For instance, the CQ q(x)=y(P(x)Q(y))q(x)=\exists y(P(x)\land Q(y)) cannot be uniquely characterized by a single input-output example (I,q(I))(I,q(I)), because if q(I)q(I) is non-empty, then q(I)=q(I)q(I)=q^{\prime}(I) where q(x)q^{\prime}(x) is the query P(x)P(x); whereas if q(I)q(I) is empty, then q(I)=q′′(I)q(I)=q^{\prime\prime}(I), where q′′(x)q^{\prime\prime}(x) is the query P(x)Q(x)P(x)\land Q(x). ∎

It follows that all results regarding the existence and polynomial-time computability of unique characterizations in Section 4 remain true when considering input-output data examples instead of (or even in addition to) positive and negative data examples.

Similarly, in the exact learning context, we can also consider a different type of oracle query, namely where the algorithm provides the oracle with an input instance II and the oracle responds with an input-output example of the form (I,R)(I,R) that fits the target CQ. We could call such oracle queries input-output queries. They naturally capture a scenario in which we have black-box access to an executable version of the target CQ. For the same reasons discussed above, input-output queries are no more powerful than membership queries, since one input-output query can be simulated by a polynomial number of membership queries (assuming kk is fixed). Therefore, also, all our results on exact learnability remain true when considering input-output queries instead of membership queries.

7. Further Applications

While our main focus in this paper is on unique characterizability and exact learnability for CQs, in this section, we explore some implications for other application domains.

7.1. Characterizability and learnability of LAV schema mappings

A schema mapping is a high-level declarative specification of the relationships between two database schemas (Kolaitis05schema, ). Two of the most well-studied schema mapping specification languages are LAV (“Local-as-View”) and GAV (“Global-as-View”) schema mappings.

In (AlexeCKT2011, ), the authors studied the question of when a schema mapping can be uniquely characterized by a finite set of data examples. Different types of data examples were introduced and studied, namely positive examples, negative examples, and “universal” examples. In particular, it was shown in (AlexeCKT2011, ) that a GAV schema mapping can be uniquely characterized by a finite set of positive and negative examples (or, equivalently, by a finite set of universal examples) if and only if the schema mapping in question is logically equivalent to one that is specified using c-acyclic GAV constraints.

It was shown in (AlexeCKT2011, ) that every LAV schema mapping is uniquely characterized by a finite set of universal examples, and that there are LAV schema mappings that are not uniquely characterized by any finite set of positive and negative examples. In this section, we will consider the question which LAV schema mappings are uniquely characterizable by a finite set of positive and negative examples, and how to construct such a set of examples efficiently.

We will also consider the exact learnability of LAV schema mappings with membership queries. Exact learnability of GAV schema mappings was studied in (CateDK13:learning, ), where it was shown that GAV schema mappings are learnable with membership and equivalence queries (and, subsequently, also in a variant of the PAC model) but is not exactly learnable with membership queries alone or with equivalence queries alone. The exact learning algorithm for GAV schema mappings from (CateDK13:learning, ) was further put to use and validated experimentally in (Kun18active, ). Here, we consider exact learnability of LAV schema mappings with membership queries.

Definition 7.1.

A LAV (“Local-As-View”) schema mapping is a triple M=(S,T,Σ)M=(S,T,\Sigma) where SS and TT are disjoint schemas (the “source schema” and “target schema”), and Σ\Sigma is a finite set of LAV constraints, that is, first-order sentences of the form x(α(x)yϕ(x,y))\forall\textbf{x}(\alpha(\textbf{x})\to\exists\textbf{y}\phi(\textbf{x},\textbf{y})), where α(x)\alpha(\textbf{x}) is an atomic formula using a relation from SS, and ϕ(x,y)\phi(\textbf{x},\textbf{y}) is a conjunction of atomic formulas using relations from TT.

By a schema-mapping example we will mean a pair (I,J)(I,J) where II is a structure over schema 𝒮\mathcal{S} without distinguished elements, and JJ is a structure over schema 𝒯\mathcal{T} without distinguished elements. We say that (I,J)(I,J) is a positive example for a schema mapping M=(𝒮,𝒯,Σ)M=(\mathcal{S},\mathcal{T},\Sigma) if (I,J)(I,J), viewed as a single structure over the joint schema 𝒮𝒯\mathcal{S}\cup\mathcal{T}, satisfies all constraints in Σ\Sigma, and we call (I,J)(I,J) a negative example for MM otherwise. Note that schema-mapping examples were called data examples in (AlexeCKT2011, ). Unique characterizations and learnability with membership queries are defined as before. In particular, by a membership query, in the context of learning LAV schema mappings, we will mean an oracle query that consists of a schema-mapping example, which the oracle then labels as positive or negative depending on whether it satisfies the constraints of the goal LAV schema mapping. It is assumed here, that the source and target schemas are fixed and known to the learner.

Given a fixed source schema 𝒮\mathcal{S}, there are only finitely many different possible left-hand sides α\alpha for a LAV constraint, up to renaming of variables. Furthermore, if a schema mapping contains two LAV constraints with the same left-hand side, then they can be combined into a single LAV constraint by conjoining the respective right-hand sides. Since the right-hand side of a LAV constraint can be thought of as a CQ, this means that, intuitively, a LAV schema mapping can be thought of as a finite collection of CQs (one for each possible left-hand side). In the light of this observation, it is no surprise that questions about the unique characterizability and learnability of LAV schema mappings can be reduced to questions about the unique characterizability and learnability of CQs.

Let us capture this observation a little more precisely. Let kk be the maximum arity of a relation in 𝒮\mathcal{S}, and let ATOMS𝒮\textrm{ATOMS}_{\mathcal{S}} be the finite set of all atomic formulas using a relation from 𝒮\mathcal{S} and variables from {z1,,zk}\{z_{1},\ldots,z_{k}\}. Given a LAV schema mapping M=(𝒮,𝒯,Σ)M=(\mathcal{S},\mathcal{T},\Sigma) and an α(z)ATOMS𝒮\alpha(\textbf{z})\in\textrm{ATOMS}_{\mathcal{S}}, we denote by qM,α(z)q_{M,\alpha}(\textbf{z}) the following first-order formula over schema 𝒯\mathcal{T}:

x(β(x)yϕ(x,y))Σh:{x}{z} a function s.t. β(h(x))=α(z)yϕ(h(x),y)\mathop{\bigwedge_{\forall\textbf{x}(\beta(\textbf{x})\to\exists\textbf{y}\phi(\textbf{x},\textbf{y}))\in\Sigma}}_{\text{$h:\{\textbf{x}\}\to\{\textbf{z}\}$ a function s.t.~{}$\beta(h(\textbf{x}))=\alpha(\textbf{z})$}}\exists\textbf{y}\phi(h(\textbf{x}),\textbf{y})

For example, if MM consists of the LAV constraints x1,x2,x3.R(x1,x2,x3)S(x1,x2,x3)\forall x_{1},x_{2},x_{3}.R(x_{1},x_{2},x_{3})\to S(x_{1},x_{2},x_{3}) and x1,x2.R(x1,x2,x2)yT(x1,y)\forall x_{1},x_{2}.R(x_{1},x_{2},x_{2})\to\exists yT(x_{1},y), and α(z1)\alpha(z_{1}) is R(z1,z1,z1)R(z_{1},z_{1},z_{1}), then qM,α=S(z1,z1,z1)yT(z1,y)q_{M,\alpha}=S(z_{1},z_{1},z_{1})\land\exists yT(z_{1},y). Similarly, for α(z1,z2,z3)=R(z1,z2,z3)\alpha^{\prime}(z_{1},z_{2},z_{3})=R(z_{1},z_{2},z_{3}) then qM,α=S(z1,z2,z3)q_{M,\alpha^{\prime}}=S(z_{1},z_{2},z_{3}). Note that qM,α(z)q_{M,\alpha}(\textbf{z}) can be equivalently written as a not-necessarily-safe CQ over 𝒯\mathcal{T} (by pulling the existential quantifies to the front).

Lemma 7.2.

Let M=(𝒮,𝒯,Σ)M=(\mathcal{S},\mathcal{T},\Sigma) be any LAV schema mapping, and let α(z)ATOMS𝒮\alpha(\textbf{z})\in\textrm{ATOMS}_{\mathcal{S}} have kk distinct variables. For every structure (A,a)(A,\textbf{a}), over schema 𝒯\mathcal{T} and with kk distinguished elements, the following are equivalent:

  1. (1)

    (A,a)(A,\textbf{a}) is a positive data example for qM,α(z)q_{M,\alpha}(\textbf{z}),

  2. (2)

    The schema-mapping example (I,J)(I,J) is a positive example for MM, where II is the structure over 𝒮\mathcal{S} consisting of the single fact α(a)\alpha(\textbf{a}), and J=AJ=A.

We omit the proof, which is straightforward (note that the left-hand side of a LAV constraint can have at most one homomorphism to II, and the latter can be extended to the right-hand side of the constraint to JJ iff the respective conjunct of qM,αq_{M,\alpha} is satisfied. Also note that if (I,J)(I,J) is a positive example for a LAV schema mapping MM then, so is (I,J)(I,J^{\prime}) for JJJ\subseteq J^{\prime}).

Intuitively, Lemma 7.2 shows that the behavior of qM,αq_{M,\alpha} on arbitrary data examples, is fully determined by the behavior of MM on arbitrary schema-mapping examples. The converse turns out to be true as well, that is, the semantics of a LAV schema mapping M=(𝒮,𝒯,Σ)M=(\mathcal{S},\mathcal{T},\Sigma) is determined (up to logical equivalence) by its associated queries qM,αq_{M,\alpha} for αATOMS𝒮\alpha\in\textrm{ATOMS}_{\mathcal{S}}:

Lemma 7.3.

Two LAV schema mappings M1=(𝒮,𝒯,Σ1)M_{1}=(\mathcal{S},\mathcal{T},\Sigma_{1}), M2=(𝒮,𝒯,Σ1)M_{2}=(\mathcal{S},\mathcal{T},\Sigma_{1}) are logically equivalent iff, for every α(z)ATOMS𝒮\alpha(\textbf{z})\in\textrm{ATOMS}_{\mathcal{S}}, qM1,α(z)q_{M_{1},\alpha}(\textbf{z}) and qM2,α(z)q_{M_{2},\alpha}(\textbf{z}) are logically equivalent.

Proof.

The left-to-right direction follows immediately from the preceding Lemma. For the right-to-left direction: suppose M1M_{1} and M2M_{2} are not logically equivalent. Then they disagree on some schema-mapping example (I,J)(I,J). Without loss of generality, we may assume that (I,J)(I,J) is a positive example for M1M_{1} and a negative example for M2M_{2}. In particular, one of the LAV constraints in Σ2\Sigma_{2} is false in (I,J)(I,J). Since the left-hand side of a LAV constraint consists of a single atom, it follows that, for some fact R(a)R(\textbf{a}) of II, the schema-mapping example ({R(a)},J)(\{R(\textbf{a})\},J) is a negative example for M2M_{2}. Moreover, an easy monotonicity argument shows that ({R(a)},J)(\{R(\textbf{a})\},J) is a positive example for M1M_{1}. Let α\alpha be obtained from the fact R(a)R(\textbf{a}) by replacing each distinct element aia_{i} by a corresponding variable ziz_{i}. It follows from Lemma 7.2 that qM1,αq_{M_{1},\alpha} and qM2,αq_{M_{2},\alpha} disagree on the structure (J,a)(J,\textbf{a}), and are not logically equivalent. ∎

It follows directly from the above Lemmas that the unique characterizability of a LAV schema mapping MM reduces to the unique characterizability of each query qM,αq_{M,\alpha}:

Lemma 7.4.

For all LAV schema mappings M=(𝒮,𝒯,Σ)M=(\mathcal{S},\mathcal{T},\Sigma), the following are equivalent:

  1. (1)

    MM is uniquely characterizable by finitely many positive and negative schema-mapping examples (w.r.t. the class of all LAV schema mappings over 𝒮,𝒯\mathcal{S},\mathcal{T}).

  2. (2)

    For each α(z1,,zk)ATOMS𝒮\alpha(z_{1},\ldots,z_{k})\in\textrm{ATOMS}_{\mathcal{S}}, qM,α(z1,,zk)q_{M,\alpha}(z_{1},\ldots,z_{k}) is uniquely characterizable by finitely many positive and negative data examples w.r.t. the class of all kk-ary not-necessarily-safe CQs over 𝒯\mathcal{T}.

Intuitively, this shows that a LAV schema mapping is uniquely characterizable iff each of its constraints (joined together according to their left-hand side atom) are. By combining these lemmas with Theorem 4.5 (cf. Remark 1), we can link the unique characterizability of a LAV schema mapping to the condition of c-acyclicity. We say that a LAV schema mapping MM is c-acyclic if the right-hand side of each of its LAV constraints is a c-acyclic not-necessarily-safe CQ. Note that, in this case, also qM,αq_{M,\alpha} is c-acyclic, for each αATOMS𝒮\alpha\in\textrm{ATOMS}_{\mathcal{S}}.

Theorem 7.5.

Fix a source schema 𝒮\mathcal{S} and a target schema 𝒯\mathcal{T}. A LAV schema mapping M=(𝒮,𝒯,Σ)M=(\mathcal{S},\mathcal{T},\Sigma) is uniquely characterizable by a finite set of positive and negative schema-mapping examples if and only if MM is logically equivalent to a c-acyclic LAV schema mapping. Moreover, if MM is c-acyclic, then a uniquely characterizing set of positive and negative schema-mapping examples can be constructed in polynomial time (for fixed 𝒮,𝒯\mathcal{S},\mathcal{T}).

Proof.

The direction going from c-acyclicity to the uniquely characterizing set of schema-mapping examples, follows immediately from the above lemmas together with Theorem 4.5. For the other direction, assume that MM is uniquely characterizable by finitely many positive and negative schema-mapping examples. It follows by Lemma 7.4 that each qM,αq_{M,\alpha} is uniquely characterizable by finitely many positive and negative data examples. Hence, each qM,αq_{M,\alpha} is logically equivalent to a c-acyclic not-necessarily-safe conjunctive query qM,αq^{\prime}_{M,\alpha}. Finally, let M=(𝒮,𝒯,Σ)M^{\prime}=(\mathcal{S},\mathcal{T},\Sigma^{\prime}), where Σ\Sigma^{\prime} consists of all LAV constraints of the form z(qM,α(z)α(z))\forall\textbf{z}(q_{M,\alpha}(\textbf{z})\to\alpha(\textbf{z})) for α(z)ATOMS𝒮\alpha(\textbf{z})\in\textrm{ATOMS}_{\mathcal{S}}. Then MM^{\prime} is c-acyclic and logically equivalent to MM. ∎

Similarly, Lemma 7.2 and Lemma 7.3, together with Theorem 5.2, directly imply:

Theorem 7.6.

Fix a source schema 𝒮\mathcal{S} and a target schema 𝒯\mathcal{T}. The class of c-acyclic LAV schema mappings over 𝒮,𝒯\mathcal{S},\mathcal{T} is efficiently exactly learnable with membership queries.

Note that the class of all LAV schema mappings over 𝒮,𝒯\mathcal{S},\mathcal{T} is not exactly learnable with membership queries (assuming that 𝒮\mathcal{S} is non-empty and 𝒯\mathcal{T} contains a relation of arity at least 2). This follows immediately from the existence of LAV schema mappings that are not uniquely characterizable by finitely many positive and negative schema-mapping examples.

As mentioned earlier, LAV schema mappings and GAV schema mappings are two of the most well-studied schema mapping languages. GLAV (“Global-and-Local-As-Views”) schema mappings is another, which forms a common generalization. An important remaining open question in the area of example-driven approaches to schema mapping design is the following (AlexeCKT2011, ): which GLAV schema mappings are uniquely characterizable by a finite set of examples?

7.2. Learning description logic concept expressions and ABoxes

Description logics are formal specification languages used to represent domain knowledge. Example-driven and machine-learning based approaches have a long history in this area, and have received renewed interest in the last years (Ozaki2020:learning, ), in particular, for ontologies specified in the lightweight description logic \mathcal{ELI}, and focusing on the exact learnability of ontologies using entailment queries and equivalence queries. As we show in this section, our results on c-acyclic CQs have some implications for the exact learnability of \mathcal{ELI} concept expressions.

Definition 7.7 (\mathcal{ELI} Concept expressions, ABoxes, TBoxes).

Let NC,NR,NIN_{C},N_{R},N_{I} be fixed, disjoint sets, whose members we will refer to as “concept names”, “role names”, and ”individual names”, respectively. The sets NCN_{C} and NRN_{R} are assumed to be finite, while NIN_{I} is assumed to be infinite.

A concept expression CC is an expression built up from from concept names in NCN_{C} and \top, using conjunction (C1C2C_{1}\sqcap C_{2}) and existential restriction (r.C\exists r.C or r.C\exists r^{-}.C, where rNRr\in N_{R}).

An ABox is a finite set of ABox axioms of the form P(a)P(a) and/or r(a,b)r(a,b), where PNCP\in N_{C}, rNRr\in N_{R}, and a,bNIa,b\in N_{I}.

A TBox is a finite set of TBox axioms CDC\sqsubseteq D, where C,DC,D are concept expressions.

The semantics of these expressions can be explained by translation to first-order logic:

Definition 7.8.

The correspondence schema is the schema that contains a unary relation for every ANCA\in N_{C} and a binary relation for every rNRr\in N_{R}. Through the standard translation from description logic to first-order logic (cf. Table 1), every concept expression CC translates to a first-order formula qC(x)q_{C}(x) over the correspondence schema. By extension, every TBox 𝒯\mathcal{T} translates to a finite first-order theory 𝒯fo\mathcal{T}^{\text{fo}}, where C1C2C_{1}\sqsubseteq C_{2} translates to x(qC1(x)qC2(x))\forall x(q_{C_{1}}(x)\to q_{C_{2}}(x)).

qP(x)=P(x) for PNCq(x)=qC1C2(x)=qC1(x)qC2(x)qr.C(x)=y(r(x,y)qC(y))qr.C(x)=y(r(y,x)qC(y))\begin{array}[]{rcl}q_{P}(x)&=&P(x)~{}\text{ for $P\in N_{C}$}\\ q_{\top}(x)&=&\top\\ q_{C_{1}\sqcap C_{2}}(x)&=&q_{C_{1}}(x)\land q_{C_{2}}(x)\\ q_{\exists r.C}(x)&=&\exists y(r(x,y)\land q_{C}(y))\\ q_{\exists r^{-}.C}(x)&=&\exists y(r(y,x)\land q_{C}(y))\\ \end{array}
Table 1. Standard translation from concept expressions to first-order logic

An ABox can equivalently be viewed as a finite structure (without distinguished elements), whose domain consists of individual names from NIN_{I}, and whose facts are the ABox assertions. Since NIN_{I} is assumed to be infinite, every finite structure over the correspondence schema can (up to isomorphism) be represented as an ABox. Therefore, in what follows we will use ABoxes and structures interchangeably.

We can think of an ABox as a (possibly incomplete) list of facts, and a TBox as domain knowledge in the form of rules for deriving more facts. This idea underlies the next definition:

Definition 7.9.

A QA-example is a pair (𝒜,a)(\mathcal{A},a) where 𝒜\mathcal{A} is an ABox and aNIa\in N_{I}. We say that (𝒜,a)(\mathcal{A},a) is a positive QA-example for a concept expression CC relative to a TBox 𝒯\mathcal{T} if acertain(C,𝒜,𝒯)a\in\text{certain}(C,\mathcal{A},\mathcal{T}) where certain(C,𝒜,𝒯)={qC(B)𝒜B and B𝒯fo}\text{certain}(C,\mathcal{A},\mathcal{T})=\bigcap\{q_{C}(B)\mid\text{$\mathcal{A}\subseteq B$ and $B\models\mathcal{T}^{\text{fo}}$}\}. If acertain(C,𝒜,𝒯)a\not\in\text{certain}(C,\mathcal{A},\mathcal{T}), we say that (𝒜,a)(\mathcal{A},a) is a negative QA-example for CC relative to 𝒯\mathcal{T}.

The name QA-example, here, reflects the fact that the task of computing certain(C,𝒜,𝒯)\text{certain}(C,\mathcal{A},\mathcal{T}) is commonly known as query answering. It is one of the core inference tasks studied in the description logic literature. In general, there are two variants of the definition of certain(C,𝒜,𝒯)\text{certain}(C,\mathcal{A},\mathcal{T}): one where BB ranges over finite structures, and one where BB ranges over all, finite or infinite, structures. The description logic \mathcal{ELI} that we consider here has been shown to be finitely controllable (Barany2013querying, ), meaning that both definitions are equivalent. For more expressive description logics, this is in general not the case.

ExampleFirst-order logic translationABox:𝒜={P(a),r(a,b)}TBox:𝒯={PQr.P}𝒯fo={x(P(x)Q(x)y(r(x,y)P(y))}Concept expr:C=r.QqC(x)=y(r(x,y)Q(y))\begin{array}[]{l@{~~~~~}l@{~~~}l}\hfil~{}~{}~{}~{}~{}&\text{Example}\hfil~{}~{}~{}&\text{First-order logic translation}\\[2.84526pt] \text{ABox:}\hfil~{}~{}~{}~{}~{}&\mathcal{A}=\{P(a),r(a,b)\}\hfil~{}~{}~{}\\ \text{TBox:}\hfil~{}~{}~{}~{}~{}&\mathcal{T}=\{P\sqsubseteq Q\sqcap\exists r.P\}\hfil~{}~{}~{}&\mathcal{T}^{\text{fo}}=\{\forall x(P(x)\to Q(x)\land\exists y(r(x,y)\land P(y))\}\\ \text{Concept expr:}\hfil~{}~{}~{}~{}~{}&C=\exists r.Q\hfil~{}~{}~{}&q_{C}(x)=\exists y(r(x,y)\land Q(y))\end{array}
Table 2. Example description logic ABox, TBox and concept expression
Example 7.10.

Consider the ABox, TBox, and concept expression in Table 2. Every model of 𝒯fo\mathcal{T}^{\text{fo}} containing the facts in 𝒜\mathcal{A} must contain also r(a,c)r(a,c) and Q(c)Q(c) for some cNIc\in N_{I}. It follows that acertain(C,𝒜,𝒯)a\in\text{certain}(C,\mathcal{A},\mathcal{T}). In other words, (𝒜,a)(\mathcal{A},a) is a positive QA-example for CC relative to 𝒯\mathcal{T}. On the other hand, (𝒜,b)(\mathcal{A},b) is a negative QA-example for CC relative to 𝒯\mathcal{T}.

See (Baader2017:introduction, ) for more details on description logic syntax and semantics. We now explain how our results from Section 4 and 5 can be applied here. Although a QA-example is just a data example with one distinguished element, over the correspondence schema, the definition of positive/negative QA-examples diverges from the definition of positive/negative data examples, because of the TBox 𝒯\mathcal{T}. For the special case where 𝒯=\mathcal{T}=\emptyset, the two coincide:

Lemma 7.11.

Let 𝒯=\mathcal{T}=\emptyset. A QA-example (𝒜,a)(\mathcal{A},a) is a positive (negative) QA-example for a concept expression CC relative to 𝒯\mathcal{T} iff (𝒜,a)(\mathcal{A},a) is a positive (negative) data example for qC(x)q_{C}(x).

Lemma 7.11 follows from the well-known monotonicity property of CQs (i.e., whenever ABA\subseteq B, then q(A)q(B)q(A)\subseteq q(B)), which implies that certain(C,𝒜,)=qC(𝒜)\text{certain}(C,\mathcal{A},\emptyset)=q_{C}(\mathcal{A}).

Concept expressions turn out to correspond precisely to unary, acyclic, c-connected CQs:

Lemma 7.12.

The standard translation qC(x)q_{C}(x) of every \mathcal{ELI} concept expression CC is equivalent to a not-necessarily-safe unary CQ that is acyclic and c-connected. Conversely, every unary, acyclic, c-connected not-necessarily-safe CQ over the correspondence schema is logically equivalent to qC(x)q_{C}(x) for some \mathcal{ELI} concept expression CC.

Both directions of Lemma 7.12 can be proved using a straightforward induction.

The above two lemmas, together with Theorem 4.5(2) and Theorem 5.2 (cf. Remark 1 and Remark 2) immediately yield our main result here. We say that a collection of positive and negative QA-examples uniquely characterizes a concept expression CC relative to a TBox 𝒯\mathcal{T} if CC fits the examples (relative to 𝒯\mathcal{T}) and every other concept expression that does so is equivalent (relative to 𝒯\mathcal{T}) to CC. By a QA-membership query we mean an oracle query consisting of a QA example, where the oracle answers yes or no depending on whether the input is a positive QA example or a negative QA example for the goal concept, relative to the TBox. It is assumed that the TBox is fixed and known to the learner.

Theorem 7.13.

Let 𝒯=\mathcal{T}=\emptyset. Every \mathcal{ELI} concept expression is uniquely characterizable by a finite collection of positive and negative QA examples (relative to 𝒯\mathcal{T}), which can be computed in polynomial time. Furthermore, the class of \mathcal{ELI} concept expressions is efficiently exactly learnable with QA-membership queries.

Moreover, by Theorem 4.5(3), the uniquely characterizing examples can be constructed so that each example (𝒜,a)(\mathcal{A},a) is the canonical QA-example of a concept expression. By the canonical QA-example of a concept expression CC, here, we mean the QA-example that (viewed as a structure with one distinguished element) is the canonical structure of the not-necessarily-safe CQ qC(x)q_{C}(x).

Theorem 7.13 remains true when the concept language is extended with unrestricted existential quantification (of the form .C\exists.C) and a restricted form of the I-me self-reference construct introduced in (Marx2002narcissists, ), namely where the I operator can only occur once, and in the very front of the concept expression. Indeed, it can be shown that this extended concept language (by a straightforward extension of the standard translation) captures precisely the class of c-acyclic unary not-necessarily-safe CQs over the correspondence schema.

This raises the question if Theorem 7.13 holds true for arbitrary TBoxes. Since publication of the conference version of this paper (tCD2021:conjunctive, ) (in which we asked the same question), some answers have been obtained. In (Funk2021Actively, ), it is shown that the answer to this question is No if the TBox is treated as part of the input to the learning algorithm. Indeed, it is shown that the problem becomes not efficiently exactly learnable with membership and equivalence queries. On the other hand, a positive answer is given in (Funk2021Actively, ) for a weaker version of the question, namely for the description logic \mathcal{EL}, when the learning algorithm is also allowed to ask equivalence queries. In (funk2022:frontiers, ), furthermore, a positive answer is given for another variant of the above question where the TBox is specified in the description logic DL-Lite, and where the learning algorithm is allowed to ask membership and equivalence queries. At the heart of this learning algorithm lies an extension of our frontier construction from Section 3.3, also obtained in (funk2022:frontiers, ).

8. Acknowledgements

This paper largely grew out of discussions at Dagstuhl Seminar 19361 (“Logic and Learning”) in Sept. 2019. Victor Dalmau was supported by MICCIN grants TIN2016-76573-C2-1P and PID2019-109137GB-C22. Balder ten Cate was supported by the European Union’s Horizon 2020 research and innovation programme (MSCA-101031081). We thank Carsten Lutz, Raoul Koudijs, and Phokion Kolaitis for helpful discussions and for pointing out some flaws in earlier versions of this paper.

References

  • (1) Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang-Chiew Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23:1–23:48, December 2011. doi:10.1145/2043652.2043656.
  • (2) Dana Angluin. Queries and concept learning. Mach. Learn., 2(4):319–342, April 1988. doi:10.1023/A:1022821128753.
  • (3) William Ward Armstrong. Dependency structures of data base relationships. In IFIP Congress, 1974.
  • (4) Franz Baader, Ian Horrocks, Carsten Lutz, and Ulrike Sattler. An Introduction to Description Logic. Cambridge University Press, 2017.
  • (5) Franz Baader and Werner Nutt. Basic Description Logics, pages 43–95. Cambridge University Press, USA, 2003.
  • (6) Vince Bárány, Georg Gottlob, and Martin Otto. Querying the guarded fragment. Logical Methods in Computer Science, 10(2), 2014. doi:10.2168/LMCS-10(2:3)2014.
  • (7) Pablo Barceló and Miguel Romero. The complexity of reverse engineering problems for conjunctive queries. In Michael Benedikt and Giorgio Orsi, editors, 20th International Conference on Database Theory, ICDT 2017, March 21-24, 2017, Venice, Italy, volume 68 of LIPIcs, pages 7:1–7:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
  • (8) Google Blog. A reintroduction to our knowledge graph and knowledge panels, 2020.
  • (9) Angela Bonifati, Radu Ciucanu, and Aurélien Lemay. Learning path queries on graph databases. In Gustavo Alonso, Floris Geerts, Lucian Popa, Pablo Barceló, Jens Teubner, Martín Ugarte, Jan Van den Bussche, and Jan Paredaens, editors, Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015, pages 109–120. OpenProceedings.org, 2015.
  • (10) Angela Bonifati, Radu Ciucanu, and Slawek Staworko. Learning join queries from user examples. ACM Trans. Database Syst., 40(4):24:1–24:38, 2016.
  • (11) Ashok K. Chandra Chandra and Philip M. Merlin. Optimal Implementation of Conjunctive Queries in Relational Data Bases. In ACM Symposium on Theory of Computing (STOC), pages 77–90, 1977.
  • (12) William W. Cohen. Cryptographic limitations on learning one-clause logic programs. In Richard Fikes and Wendy G. Lehnert, editors, Proceedings of the 11th National Conference on Artificial Intelligence. Washington, DC, USA, July 11-15, 1993, pages 80–85. AAAI Press / The MIT Press, 1993.
  • (13) William W. Cohen. Pac-learning nondeterminate clauses. In Barbara Hayes-Roth and Richard E. Korf, editors, Proceedings of the 12th National Conference on Artificial Intelligence, Seattle, WA, USA, July 31 - August 4, 1994, Volume 1, pages 676–681. AAAI Press / The MIT Press, 1994.
  • (14) William W. Cohen. Pac-learning non-recursive prolog clauses. Artif. Intell., 79(1):1–38, 1995.
  • (15) Víctor Dalmau, Phokion G. Kolaitis, and Moshe Y. Vardi. Constraint satisfaction, bounded treewidth, and finite-variable logics. In Pascal Van Hentenryck, editor, Principles and Practice of Constraint Programming - CP 2002, pages 310–326, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
  • (16) Ronald Fagin. Degrees of acyclicity for hypergraphs and relational database schemes. J. ACM, 30(3):514–550, July 1983. doi:10.1145/2402.322390.
  • (17) Ronald Fagin, Phokion Kolaitis, Alan Nash, and Lucian Popa. Towards a theory of schema-mapping optimization. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 33–42, 01 2008. doi:10.1145/1376916.1376922.
  • (18) Jan Foniok, Jaroslav Nesetril, and Claude Tardif. Generalised dualities and maximal finite antichains in the homomorphism order of relational structures. Eur. J. Comb., 29(4):881–899, 2008.
  • (19) Maurice Funk, Jean Christoph Jung, and Carsten Lutz. Actively learning concepts and conjunctive queries under elr-ontologies. In IJCAI, 2021.
  • (20) Maurice Funk, Jean Christoph Jung, and Carsten Lutz. Frontiers and exact learning of eli queries under dl-lite ontologies. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 2627–2633. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track. doi:10.24963/ijcai.2022/364.
  • (21) Georg Gottlob and Pierre Senellart. Schema mapping discovery from data instances. J. ACM, 57(2):6:1–6:37, 2010.
  • (22) David Haussler. Learning conjunctive concepts in structural domains. Mach. Learn., 4:7–40, 1989.
  • (23) Pavol Hell and Jaroslav Nešetřil. The Core of a Graph. Discrete Mathematics, 109:117–126, 1992.
  • (24) Pavol Hell and Jaroslav Nešetřil. Graphs and homomorphisms, volume 28 of Oxford lecture series in mathematics and its applications. Oxford University Press, 2004.
  • (25) Kouichi Hirata. On the hardness of learning acyclic conjunctive queries. In Hiroki Arimura, Sanjay Jain, and Arun Sharma, editors, Algorithmic Learning Theory, 11th International Conference, ALT 2000, Sydney, Australia, December 11-13, 2000, Proceedings, volume 1968 of Lecture Notes in Computer Science, pages 238–251. Springer, 2000.
  • (26) Wojtek Kazana and Luc Segoufin. First-order queries on classes of structures with bounded expansion. Logical Methods in Computer Science, Volume 16, Issue 1, February 2020.
  • (27) Phokion G. Kolaitis. Schema mappings, data exchange, and metadata management. In Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’05, page 61–75, New York, NY, USA, 2005. Association for Computing Machinery. doi:10.1145/1065167.1065176.
  • (28) Heikki Mannila and Kari-Jouko Räihä. Test data for relational queries. In PODS, pages 217–223, 1986.
  • (29) Maarten Marx. Narcissists, stepmothers and spies. In Ian Horrocks and Sergio Tessaris, editors, Proceedings of the 2002 International Workshop on Description Logics (DL2002), Toulouse, France, April 19-21, 2002, volume 53 of CEUR Workshop Proceedings. CEUR-WS.org, 2002.
  • (30) Jaroslav Nešetřil and Patrice Ossona de Mendez. Grad and classes with bounded expansion iii. restricted graph homomorphism dualities. European Journal of Combinatorics, 29(4):1012 – 1024, 2008. Homomorphisms: Structure and Highlights. doi:https://doi.org/10.1016/j.ejc.2007.11.019.
  • (31) Jaroslav Nešetřil and Patrice Ossona de Mendez. Grad and classes with bounded expansion i. decompositions. European Journal of Combinatorics, 29(3):760 – 776, 2008. doi:https://doi.org/10.1016/j.ejc.2006.07.013.
  • (32) Jaroslav Nešetřil and Patrice Ossona de Mendez. Sparsity (Graphs, Structures, and Algorithms), volume 28. Springer, 01 2012. doi:10.1007/978-3-642-27875-4.
  • (33) Jaroslav Nešetřil and Claude Tardif. Duality theorems for finite structures (characterising gaps and good characterisations). Journal of Combinatorial Theory, Series B, 80(1):80 – 97, 2000. doi:https://doi.org/10.1006/jctb.2000.1970.
  • (34) Jaroslav Nešetřil and Claude Tardif. Short answers to exponentially long questions: Extremal aspects of homomorphism duality. SIAM J. Discret. Math., 19(4):914–920, August 2005. doi:10.1137/S0895480104445630.
  • (35) Ana Ozaki. Learning description logic ontologies: Five approaches. where do they stand? KI - Künstliche Intelligenz, 04 2020. doi:10.1007/s13218-020-00656-9.
  • (36) Slawek Staworko and Piotr Wieczorek. Learning twig and path queries. In Alin Deutsch, editor, 15th International Conference on Database Theory, ICDT ’12, Berlin, Germany, March 26-29, 2012, pages 140–154. ACM, 2012.
  • (37) Slawomir Staworko and Piotr Wieczorek. Characterizing XML Twig Queries with Examples. In International Conference on Database Theory, International Conference on Database Theory, Brussels, Belgium, March 2015. URL: https://hal.inria.fr/hal-01205417.
  • (38) Balder ten Cate, Laura Chiticariu, Phokion Kolaitis, and Wang-Chiew Tan. Laconic schema mappings: Computing the core with sql queries. Proc. VLDB Endow., 2(1):1006–1017, August 2009. doi:10.14778/1687627.1687741.
  • (39) Balder ten Cate and Víctor Dalmau. The product homomorphism problem and applications. In Marcelo Arenas and Martín Ugarte, editors, 18th International Conference on Database Theory, ICDT 2015, March 23-27, 2015, Brussels, Belgium, volume 31 of LIPIcs, pages 161–176. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015.
  • (40) Balder ten Cate and Victor Dalmau. Conjunctive queries: Unique characterizations and exact learnability. In Proc. of ICDT. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
  • (41) Balder ten Cate, Víctor Dalmau, and Phokion G. Kolaitis. Learning schema mappings. ACM Trans. Database Syst., 38(4):28:1–28:31, 2013. doi:10.1145/2539032.2539035.
  • (42) Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. Active learning of GAV schema mappings. In Jan Van den Bussche and Marcelo Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 355–368. ACM, 2018. doi:10.1145/3196959.3196974.
  • (43) Yaacov Y. Weiss and Sara Cohen. Reverse engineering spj-queries from examples. In Emanuel Sallinger, Jan Van den Bussche, and Floris Geerts, editors, Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, pages 151–166. ACM, 2017.
  • (44) Ross Willard. Testing expressibility is hard. In David Cohen, editor, Principles and Practice of Constraint Programming - CP 2010 - 16th International Conference, CP 2010, St. Andrews, Scotland, UK, September 6-10, 2010. Proceedings, volume 6308 of Lecture Notes in Computer Science, pages 9–23. Springer, 2010.