Conjunctive Queries: Unique Characterizations and Exact Learnability

Balder ten Cate https://orcid.org/0000-0002-2538-5846 Universiteit van AmsterdamILLCAmsterdamThe Netherlands [email protected] and Victor Dalmau https://orcid.org/0000-0002-9365-7372 Universitat Pompeu FabraDepartment of Information and Communication TechnologiesBarcelonaSpain [email protected]

Abstract.

We answer the question of which conjunctive queries are uniquely characterized by polynomially many positive and negative examples, and how to construct such examples efficiently. As a consequence, we obtain a new efficient exact learning algorithm for a class of conjunctive queries. At the core of our contributions lie two new polynomial-time algorithms for constructing frontiers in the homomorphism lattice of finite structures. We also discuss implications for the unique characterizability and learnability of schema mappings and of description logic concepts.

Conjunctive Queries, Homomorphisms, Frontiers, Unique Characterizations, Exact Learnability, Schema Mappings, Description Logic

^†^†journal: TODS^†^†ccs: Theory of computation Machine learning theory^†^†ccs: Theory of computation Logic^†^†ccs: Information systems Query languages

1. Introduction

Conjunctive queries (CQs) are an extensively studied database query language and fragment of first-order logic. They correspond precisely to Datalog programs with a single non-recursive rule. In this paper, we study two problems related to CQs. The first problem is concerned with the existence and constructability of unique characterizations. For which CQs $q$ is it the case that $q$ can be characterized (up to logical equivalence) by its behavior on a small set of data examples? And, when such a set of data examples exists, can it be constructed efficiently? The second problem pertains to exact learnability of CQs in an interactive setting where the learner has access to a “membership oracle” that, given any database instance and a tuple of values, answers whether the tuple belongs to the answer of the goal CQ (that is, the hidden CQ that the learner is trying to learn). We can think of the membership oracle as a black-box, compiled version of the goal query, which the learner can execute on any number of examples. The task of the learner, then, is to reverse engineer the query based on the observed behavior.

Figure 1. A frontier in the homomorphism lattice of structures

Note that these two problems (unique characterizability and exact learnability) are closely related to each other: a learner can identify the goal query with certainty, only when the set of examples that it has seen so far constitutes a unique characterization of the goal query. In other words, unique characterizability (by a polynomially large set of examples) is necessary, but not sufficient, for (polynomial-time) exact learnability with membership queries.

Motivating Example 1.

This example, although stylized and described at a high level, aims to convey a use case that motivated the present work. The Google Knowledge Graph is a large database of entities and facts, gathered from a variety of sources. It is used to enhance the search engine’s results for queries such as “where was Barack Obama born” with factual information in the form of knowledge panels (KGBlog2020, ). When a query triggers a specific knowledge panel, this may be the result of different triggering and fulfillment mechanisms, each of which may involve a combination of structured queries to the knowledge graph, hard-coded business logic (in a Turing-complete language), and machine learned models. This makes it difficult to understand interactions between knowledge panels (e.g., whether the two knowledge panels are equivalent or one is subsumed by the other, in terms of content and triggering). If a declarative specification of (an approximation of) the triggering and fulfillment logic for a knowledge panel can be constructed programmatically, specified in a sufficiently restrictive formalism such as Datalog rules, this provides an avenue to the above, and other relevant static analysis tasks. The efficient exact learnability with membership queries that we study in this paper, can be viewed as an idealized form of such a programmatic approach, where the membership oracle is the existing, black box, implementation of the knowledge panel, and the learning algorithm aims to produce a CQ that exactly captures it.

The above example provides a motivation for studying efficient exact learnability of CQs, and hence, for studying unique characterizability. However, we would like to emphasize that unique characterizations are of independent interest, outside the context of exact learning algorithms. Indeed, uniquely characterizing examples can be used, for instance, for elementary query engine debugging, and query visualization and explanation.

As it turns out, the above problems about CQs are intimately linked to fundamental properties of the homomorphism lattice of finite structures. In particular, the existence of a unique characterization for a CQ can be reduced to the existence of a frontier in the homomorphism lattice for an associated structure $A$ , where, by a “frontier” for $A$ , we mean a finite set of structures $F_{1},\ldots,F_{n}$ that cover precisely the set of structures homomorphically strictly weaker than $A$ , that is, such that $\{B\mid B\to A\text{ and }A\not\to B\}=\bigcup_{i}\{B\mid B\to F_{i}\}$ (cf. Figure 1).

Known results (FoniokNT08, ; AlexeCKT2011, ) imply that not every finite structure has such a frontier, and, moreover, a finite structure has a frontier if and only if the structure (modulo homomorphic equivalence) satisfies a structural property called c-acyclicity. These known results, however, are based on exponential constructions, and no polynomial algorithms for constructing frontiers were previously known.

Main Contribution 1 (Polynomial-time algorithms for constructing frontiers).

We show that, for c-acyclic structures, a frontier can in fact be computed in polynomial time. More specifically, we present two polynomial-time algorithms. The first algorithm takes any c-acyclic structure and produces a frontier consisting of structures that are themselves not necessarily c-acyclic (Sect. 3.2). The second algorithm applies to a more restricted class of acyclic structures but yields a frontier consisting entirely of structures belonging to the same class, that is, the class of structures in question is frontier-closed (Sect. 3.3).

We use these to obtain new results on the existence and efficient constructability of unique characterizations for CQs:

Main Contribution 2 (Polynomial Unique Characterizations for Conjunctive Queries).

We show that a CQ is uniquely characterizable by polynomially many examples, precisely if (modulo logical equivalence) it is c-acyclic. Furthermore, for c-acyclic CQs, a uniquely characterizing set of examples can be constructed in polynomial time. In the special case of acyclic and c-connected CQs, a uniquely characterizing set of examples can be constructed consisting entirely of queries from the same class (Sect. 4).

Using the above results as a stepping stone, we obtain a polynomial-time exact learning algorithm for the class of c-acyclic CQs.

Main Contribution 3 (Polynomial-Time Learnability with Membership Queries).

We show that c-acyclic CQs are efficiently exactly learnable in Angluin’s model of exact learnability with membership queries (Angluin88, ) (Sect. 5).

The restriction to c-acyclic CQs in this learnability result is natural, given that, as we mentioned above, exact learnability with membership queries requires the existence of a finite uniquely characterizing set of examples. Note however, that our results do not preclude the possibility that there exist larger efficiently exactly learnable classes of CQs: even if a class $C$ includes non-c-acyclic CQs, it may still be possible for every CQ $q\in C$ to be uniquely characterizable within the class $C$ .

We mainly focus on positive and negative examples in this paper. Another natural type of data example is a pair $(I,R)$ where $I$ is an input instance and $R$ is the entire relation that is computed by the query on $I$ . We discuss this in Section 6, where we point out that all our results on characterizability and learnability remain true also when considering such data examples.

Finally, although our primary interest is in conjunctive queries, we show that our results also have implications for schema mappings and description logic concepts:

Main Contribution 4 (Schema Mappings and Description Logic Concepts).

As a further corollary to the above, in Sect. 7, we obtain a number of results regarding the existence of polynomial unique characterizations, as well as exact learnability, for LAV (“Local-As-View”) schema mappings and for description logic concepts for the lightweight description logic $\mathcal{ELI}$ (Baader2017:introduction, ).

Related Work

Unique characterizations for CQs were first studied in (MannilaR86, ) in the context of automatic test data generation. More precisely, the authors propose the concept of an “adequate test case”, which is a database instance that can be used to distinguish a given CQ from all other, non-equivalent CQs from a given class. In our terminology, this corresponds to a uniquely characterizing input-output example (cf. Section 6). A positive result was obtained in (MannilaR86, ) for restricted classes of self-join-free CQs, by establishing a relationship to Armstrong databases (Armstrong1974, ). In (AlexeCKT2011, ), the authors study unique characterizations for various classes of schema mappings; we will make use of some of the technical results from (AlexeCKT2011, ), and in Section 7, we will discuss an application of our results to LAV schema mappings. In (Staworko15:characterizing, ), the authors study unique characterizability for XML twig queries.

Related work on learning CQs will be discussed in Section 5.

An earlier, extended-abstract version of this paper was published in (tCD2021:conjunctive, ). The present paper extends this conference version with additional results. In particular, Theorem 3.12 was shown in (tCD2021:conjunctive, ) only for the case of $k=1$ (i.e., for unary CQs), and is generalized here to all $k\geq 1$ . Furthermore, the treatment of input-output examples in Section 6 has been added.

Outline

Section 2 reviews basic facts and definitions. In Section 3, we present our two new polynomial-time algorithms for constructing frontiers for finite structures with distinguished elements. We also review a result by (NesetrilOssona2008, ), which implies the existence of (not necessarily polynomially computable) frontiers w.r.t. classes of structures of bounded expansion. In Section 4, we apply these algorithms to show that a CQ is uniquely characterizable by polynomially many examples, precisely if (modulo logical equivalence) it is c-acyclic. Furthermore, for c-acyclic CQs, a uniquely characterizing set of examples can be constructed in polynomial time. In the special case of unary, acyclic, connected CQs, a uniquely characterizing set of examples can be constructed consisting entirely of queries from the same class. In Section 5, we further build on these results, and we study the exact learnability of CQs. In Section 6, we consider another type of data examples, namely input-output examples. Section 7, finally, presents applications to schema mappings and description logic concepts.

2. Preliminaries

Schemas, Structures, Homomorphisms, Cores

A schema (or, relational signature) is a finite set of relation symbols $\mathcal{S}=\{R_{1},\ldots,R_{n}\}$ , where each relation $R_{i}$ as an associated arity $\text{arity}(R_{i})\geq 1$ . For $k\geq 0$ , by a structure over $\mathcal{S}$ with $k$ distinguished elements we will mean a tuple $(A,a_{1},\ldots,a_{k})$ , where $A=(\text{dom}(A),R^{A}_{1},\ldots,R^{A}_{n})$ is a finite structure (in the traditional, model-theoretic sense) over the schema $\mathcal{S}$ , and $a_{1},\ldots,a_{k}$ are elements of the domain of $A$ . Note that all structures, in this paper, are assumed to be finite, and we will drop the adjective “finite”. By a fact of a structure $A$ we mean an expression of the form $R(a_{1},\ldots,a_{n})$ where the tuple $(a_{1},\ldots,a_{n})$ belongs to the relation $R$ in $A$ . Given two structures $(A,\textbf{a})$ and $(B,\textbf{b})$ over the same schema, where $\textbf{a}=a_{1},\ldots a_{k}$ and $\textbf{b}=b_{1},\ldots,b_{k}$ , a homomorphism $h:(A,\textbf{a})\to(B,\textbf{b})$ is a map $h$ from the domain of $A$ to the domain of $B$ , such that $h$ preserves all facts (i.e., for each fact $R(a_{1},\ldots,a_{n})$ of $A$ , $R(h(a_{1}),\ldots,h(a_{n}))$ is a fact of $B$ ), and such that $h(a_{i})=b_{i}$ for $i=1\ldots k$ . When such a homomorphism exists, we will also say that $(A,\textbf{a})$ “homomorphically maps to” $(B,\textbf{b})$ and we will write $(A,\textbf{a})\to(B,\textbf{b})$ . We say that $(A,\textbf{a})$ and $(B,\textbf{b})$ are homomorphically equivalent if $(A,\textbf{a})\to(B,\textbf{b})$ and $(B,\textbf{b})\to(A,\textbf{a})$ . We occasionally write $A\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}B$ to say that $A\to B$ and $B\not\to A$ .

A structure is said to be a core if there is no homomorphism from the structure in question to a proper substructure (HN92, ). It is known (HN92, ) that every structure $(A,\textbf{a})$ has a substructure to which it is homomorphically equivalent and that is a core. This substructure, moreover, is unique up to isomorphism, and it is known as the core of $(A,\textbf{a})$ .

We will make use of the following technical lemma in several places later on in this paper:

Lemma 2.1.

Let $(A,\textbf{a})$ be a core structure, and let $h:(B,\textbf{b})\to(A,\textbf{a})$ . If there is a homomorphism $h^{\prime}:(A,\textbf{a})\to(B,\textbf{b})$ , then $h^{\prime}$ must be injective, and moreover, in this case, there is such $h^{\prime}$ with the additional property that the composition of $h$ with $h^{\prime}$ is the identity map on $A$ .

Proof.

The composition of $h$ with $h^{\prime}$ is an endomorphism of $(A,\textbf{a})$ (that is, a homomorphism from the structure to itself). It is a well-known property of cores that every endomorphism is an automorphism (that is, an isomorphism from the structure to itself). Therefore, $h^{\prime}$ must be injective. Furthermore, by composing $h^{\prime}$ with the inverse of the automorphism, we ensure that its composition with $h$ is the identity function. ∎

Fact Graph, FG-Connectedness, FG-Disjoint Union

The fact graph of a structure $(A,\textbf{a})$ is the undirected graph whose nodes are the facts of $A$ , and such that there is an edge between two distinct facts if they share a non-distinguished element, i.e., there exists an element $b$ of the domain of $A$ that is distinct from the distinguished elements a, such that $b$ occurs in both facts. We say that $(A,\textbf{a})$ is fg-connected if the fact graph is connected. A fg-connected component of $(A,\textbf{a})$ is a maximal fg-connected substructure $(A^{\prime},\textbf{a})$ of $(A,\textbf{a})$ . If $(A_{1},\textbf{a})$ and $(A_{2},\textbf{a})$ are structures with the same distinguished elements, and whose domains are otherwise (except for these distinguished elements) disjoint, then the union $(A_{1}\cup A_{2},\textbf{a})$ of these two structures will be called a fg-disjoint union and will be denoted as $(A_{1},\textbf{a})\uplus(A_{2},\textbf{a})$ . The same construction naturally extends to finite sets of structures. It is easy to see that every structure $(A,\textbf{a})$ is equal to the fg-disjoint union of its fg-connected components. See also (Fagin2008towards, ; tencate2009laconic, ), where fg-connected components are called fact blocks.

Direct Product, Homomorphism Lattice

Given two structures $(A,\textbf{a})$ and $(B,\textbf{b})$ over the same schema, where $\textbf{a}=a_{1},\ldots a_{k}$ and $\textbf{b}=b_{1},\ldots,b_{k}$ , the direct product $(A,\textbf{a})\times(B,\textbf{b})$ is defined, as usual, as $(A\times B,\langle a_{1},b_{1}\rangle,\ldots,\langle a_{k},b_{k}\rangle)$ , where the domain of $A\times B$ is the Cartesian product of the domains of $A$ and $B$ , and where the facts of $A\times B$ are all facts $R(\langle c_{1},d_{1}\rangle,\ldots,\langle c_{n},d_{n}\rangle)$ for which it holds that $R(c_{1},\ldots,c_{n})$ is a fact of $A$ and $R(d_{1},\ldots,d_{n})$ is a fact of $B$ . The direct product of a finite collection of structures is defined analogously.

For a fixed schema $\mathcal{S}$ and $k\geq 0$ , the collection of homomorphic-equivalence classes of structures over $\mathcal{S}$ with $k$ distinguished elements, ordered by homomorphism, forms a lattice. Specifically, the above direct product operation is a meet operation in the lattice-theoretic sense: $(A,\textbf{a})\times(B,\textbf{b})$ homomorphically maps to both $(A,\textbf{a})$ and $(B,\textbf{b})$ , and a structure $(C,\textbf{c})$ homomorphically maps to $(A,\textbf{a})\times(B,\textbf{b})$ if and only if it homomorphically maps to both $(A,\textbf{a})$ and $(B,\textbf{b})$ . The join operation of the lattice is a little more tedious to define, and we only sketch it here, as it is not used in the remainder of the paper. For a structure $(A,\textbf{a})$ with $\textbf{a}=a_{1},\ldots,a_{k}$ , by the isomorphism type of the distinguished elements we will mean the equivalence relation over $\{1,\ldots,k\}$ induced by the tuple $a_{1},\ldots,a_{k}$ . When two structures have the same isomorphism type of distinguished elements, their join is simply the fg-disjoint union as defined earlier. In the general case, one must first compute the smallest equivalence relation over $\{1,\ldots,k\}$ that refines the isomorphism type of distinguished elements of both structures, and factor both structures through this equivalence relation, before taking their fg-disjoint union.

For structures without distinguished elements, this lattice has been studied extensively (cf. for instance (HellNesetril2004, ; nesetril2012sparsity, )). The above exposition shows how to lift some of the fundamental constructions to structures with distinguished elements. As we will see, it will be important in much of this paper to consider structures with distinguished elements, as these distinguished elements, intuitively, correspond to the free variables of a CQ.

Incidence Graph, Acyclicity, C-Acyclicity

Given a structure $(A,\textbf{a})$ , the incidence graph of $A$ is the bipartite multi-graph containing all elements of the domain of $A$ as well as all facts of $A$ , and an edge $(a,f)$ whenever $a$ is an element and $f$ is a fact in which $a$ occurs. Whenever an element $a$ occurs more than once in the same fact $f$ , the incidence graph contains a distinct edge for every occurrence of $a$ in $f$ . We will call a structure $(A,\textbf{a})$ acyclic (also known as Berge-acyclic (Fagin83:acyclicity, )) if the incidence graph of $A$ is acyclic; $(A,\textbf{a})$ is said to be c-acyclic if every cycle in its incidence graph contains at least one distinguished element, i.e., at least one element in a. In particular, acyclicity implies that no element occurs twice in the same fact, and c-acyclicity implies that no non-distinguished element occurs twice in the same fact. In the case without distinguished elements, c-acyclicity is equivalent to acyclicity. The concept of c-acyclicity was first introduced in (AlexeCKT2011, ) in the study of unique characterizability of GAV schema mappings (cf. Section 7 for more details). A straightforward dynamic-programming argument shows (Dalmau02:constraint, ):

Proposition 2.2.

For c-acyclic structures $(A,\textbf{a})$ and $(B,\textbf{b})$ (over the same schema and with the same number of distinguished elements), we can test in polynomial time whether $(A,\textbf{a})\to(B,\textbf{b})$ . The core of a c-acyclic structure can be computed in polynomial time.

C-Connectedness

We say that a structure $(A,\textbf{a})$ is c-connected if every connected component of its incidence graph contains at least one distinguished element. Note that this condition is only meaningful for structures with at least one distinguished element, and that it differs subtly from the condition of fg-connectedness we defined above. For example, the structure consisting of the facts $R(a_{1},a_{2})$ and $S(a_{2},a_{1})$ with distinguished elements $a_{1},a_{2}$ , is c-connected but is not fg-connected. For any structure $(A,\textbf{a})$ , we denote by $(A,\textbf{a})^{\textrm{reach}}$ the (unique) maximal c-connected substructure, that is, the substructure containing everything reachable from the distinguished elements.

Proposition 2.3.

If $(A,\textbf{a})$ is c-connected, then $(A,\textbf{a})\to(B,\textbf{b})^{\textrm{reach}}$ iff $(A,\textbf{a})\to(B,\textbf{b})$ .

Conjunctive Queries

Let $k\geq 0$ . A $k$ -ary conjunctive query (CQ) $q$ over a schema $\mathcal{S}$ is an expression of the form $q(\textbf{x})\text{ :- }\alpha_{1}\land\cdots\land\alpha_{n}$ where $\textbf{x}=x_{1},\ldots,x_{k}$ is a sequence of variables, and where each $\alpha_{i}$ is an atomic formula using a relation from $\mathcal{S}$ and using variables as arguments only. Note that $\alpha_{i}$ may use variables from x as well as other variables. In addition, it is required that each variable in x occurs in at least one conjunct $\alpha_{i}$ . This requirement is referred to as the safety condition.

Note that, for simplicity, this definition of CQ does not allow the use of constants. Many of the results in this paper, however, can be extended in a straightforward way to CQs with a fixed finite number of constants (which can be simulated using additional free variables).

If $A$ is a structure over the same schema as $q$ , we denote by $q(A)$ the set of all $k$ -tuples of values that satisfy the query $q$ in $A$ . We write $q\subseteq q^{\prime}$ if $q$ and $q^{\prime}$ are queries over the same schema, and of the same arity, and $q(A)\subseteq q^{\prime}(A)$ holds for all structures $A$ . We say that $q$ and $q^{\prime}$ are logically equivalent if $q\subseteq q^{\prime}$ and $q^{\prime}\subseteq q$ both hold. We refer to any textbook on database theory for a more detailed exposition of the semantics of CQs, and we will restrict ourselves to giving an equivalent presentation of the semantics of CQs through canonical structures and the Chandra-Merlin theorem.

There is a well-known correspondence between $k$ -ary CQs over a schema $\mathcal{S}$ and structures over $\mathcal{S}$ with $k$ distinguished elements. In one direction, we can associate to each $k$ -ary CQ $q(\textbf{x})$ over the schema $\mathcal{S}$ a corresponding structure over $\mathcal{S}$ with $k$ distinguished elements, namely $\widehat{q}=(A_{q},\textbf{x})$ , where the domain of $A_{q}$ is the set of variables occurring in $q$ , and the facts of $A_{q}$ are the conjuncts of $q$ . We will call this structure $\widehat{q}$ the canonical structure of $q$ . Note that every distinguished element of $\widehat{q}$ occurs in at least one fact, as follows from the safety condition of CQs. Conversely, consider any structure $(A,\textbf{a})$ , with $\textbf{a}=a_{1},\ldots,a_{k}$ , such that every distinguished element $a_{i}$ occurs in at least one fact of $A$ . We can associate to $(A,\textbf{a})$ a $k$ -ary canonical CQ, namely the CQ that has a variable $x_{a}$ for every value $a$ in the domain of $A$ occurring in at least one fact, and a conjunct for every fact of $A$ .

By the classic Chandra-Merlin Theorem (CM77, ), a tuple a belongs to $q(A)$ if and only if there is a homomorphism from $\widehat{q}$ to $(A,\textbf{a})$ ; and $q\subseteq q^{\prime}$ holds if and only if there is a homomorphism from $\widehat{q^{\prime}}$ to $\widehat{q}$ . Finally, $q$ and $q^{\prime}$ are logically equivalent if and only if $\widehat{q}$ and $\widehat{q^{\prime}}$ are homomorphically equivalent.

Exact Learning Models, Conjunctive Queries as a Concept Class

Informally, an exact learning algorithm is an algorithm that identifies an unknown goal concept by asking a number of queries about it. The queries are answered by an oracle that has access to the goal concept. This model of learning was introduced by Dana Angluin, cf. (Angluin88, ). In this paper, we consider the two most extensively studied kinds of oracle queries: membership queries and equivalence queries. We will first review basic notions from computational learning theory, such as the notion of a concept, and then explain what it means for a concept class to be efficiently exactly learnable with membership and/or equivalence queries.

Let $X$ be a (possibly infinite) set of examples. A concept over $X$ is a function $c:X\to\{0,1\}$ , and a concept class ${\mathcal{C}}$ is a collection of such concepts. We say that $x\in X$ is a positive example for a concept $c$ if $c(x)=1$ , and that $x$ is a negative example for $c$ if $c(x)=0$ .

Conjunctive queries (over a fixed schema $\mathcal{S}$ and with a fixed arity $k$ ) are a particular example of such a concept class, where the example space is the class of all structures over $\mathcal{S}$ with $k$ distinct elements, and where an example $(A,\textbf{a})$ is labeled as positive if the tuple a belongs to $q(A)$ , and negative otherwise.

It is always assumed that concepts are specified using some representation system so that one can speak of the length of the specification of a concept. More formally, a representation system for $\mathcal{C}$ is a string language $\mathcal{L}$ over some finite alphabet, together with a surjective function $r:\mathcal{L}\to\mathcal{C}$ . By the size of a concept $c\in\mathcal{C}$ , we will mean the length of the smallest representation. Similarly, we assume a representation system, with a corresponding notion of length, for the examples in $X$ . When there is no risk of confusion, we may conflate concepts (and examples) with their representations.

Specifically, for us, when it comes to structures, any natural choice of representation will do; we only assume that the length of the specification of a structure (for a fixed schema) is polynomial in the domain size, the number of facts and the number of distinguished elements. Likewise for CQs, we assume that the length of the representation of a CQ is polynomial in that of its canonical structure.

For every concept $c$ , we denote by $\text{\rm MEM}_{c}$ the membership oracle for $c$ , that is, the oracle that takes as input an example $x$ and returns its label, $c(x)$ , according to $c$ . Similarly, for every concept $c\in\mathcal{C}$ , we denote by $\text{\rm EQ}_{c}$ , the equivalence oracle for $c$ , that is, the oracle that takes as input the representation of a concept $h$ and returns “yes”, if $h=c$ , or returns a counterexample $x$ otherwise (that is, an example $x$ such that $h(x)\neq c(x)$ ). An exact learning algorithm with membership and/or equivalence queries for a concept class $C$ is an algorithm alg that takes no input but has access to the membership oracle $\text{\rm MEM}_{c}$ and/or equivalence oracle $\text{\rm EQ}_{c}$ for some concept $c\in C$ , which will be called the goal concept. Importantly, while the algoritm may interact with the oracle(s), it does not know which concept $c\in C$ is the goal concept. ¹¹1It is common in the learning theory literature to assume that the learning algorithm is given an upper bound on the size of the goal concept as input. However, it turns out that such an assumption is not needed for any of our positive results concerning learnability. Intuitively, the algorithm alg must determine $c$ by asking oracle queries. More precisely, for every choice of $c\in C$ , alg must terminate after a finite amount of time, and output (some representation of) the goal concept $c$ . This notion was introduced by Angluin (Angluin88, ), who also introduced the notion of a polynomial-time exact learning algorithm. We say that an exact learning algorithm alg with membership and/or equivalence queries runs in polynomial time if there exists a two-variable polynomial $p(n,m)$ such that at any point during the run of the algorithm, the time used by alg up to that point (counting one step per oracle call) is bounded by $p(n,m)$ , where $n$ is the size of the goal concept and $m$ the size of the largest counterexample returned by calls to the equivalence oracle up to that point in the run ( $m=0$ if no equivalence queries have been used). A concept class ${\mathcal{C}}$ is efficiently exactly learnable with membership and/or equivalence queries if there is an exact learning algorithm with membership and/or equivalence queries for $\mathcal{C}$ that runs in polynomial time.

There is a delicate issue about this notion of polynomial time that we now discuss. One might be tempted to relax the previous definition by requiring merely that the total running time is bounded by $p(n,m)$ . However, this change in the definition would give rise to a wrong notion of a polynomial-time algorithms in this context by way of a loophole in the definition. Indeed, under this change, one could design a learning algorithm that, in a first stage, identifies the goal hypothesis by (expensive) exhaustive search and that, once this is achieved, forces —by asking equivalence queries with hypotheses that are appropriate modifications of the goal concept— the equivalence oracle to return large counterexamples that would make up for the time spent during the exhaustive search phase.

3. Frontiers in the homomorphism lattice of structures

In this section, we define frontiers, as well as the relation notions of gap pairs and (restricted) homomorphism dualities, and we will discuss their relationships. We present two polynomial-time methods for constructing frontiers.

For the applications in the next sections, it is important to consider structures with distinguished elements. These distinguished elements, intuitively, correspond to the free variables of a CQ. Specifically, Proposition 4.2 in Section 4 will link unique characterizations for $k$ -ary CQs to frontiers for structures with $k$ distinguished elements. For this reason, all the results in this section are stated for structures with distinguished elements.

Definition 3.1.

Fix a schema and $k\geq 0$ , and let $\mathcal{C}$ be a class of structures with $k$ distinguished elements and let $(A,\textbf{a})$ be a structure with $k$ distinguished elements. A frontier for $(A,\textbf{a})$ w.r.t. $\mathcal{C}$ , is a finite set of structures $F$ such that

(1)

$(B,\textbf{b})\to(A,\textbf{a})$ for all $(B,\textbf{b})\in F$ .
(2)

$(A,\textbf{a})\not\to(B,\textbf{b})$ for all $(B,\textbf{b})\in F$ .
(3)

For all $(C,\textbf{c})\in\mathcal{C}$ with $(C,\textbf{c})\to(A,\textbf{a})$ and $(A,\textbf{a})\not\to(C,\textbf{c})$ , we have that $(C,\textbf{c})\to(B,\textbf{b})$ for some $(B,\textbf{b})\in F$ .

See Figure 1 for a graphical depiction of a frontier.

The notion of a frontier is closely related to that of a gap pair. While gap pairs will not play an important role, we will explain the relationship here to provide context. A pair of structures $(B,A)$ with $B\to A$ is said to be a gap pair if $A\not\to B$ , and every structure $C$ satisfying $B\to C$ and $C\to A$ is homomorphically equivalent to either $B$ or $A$ (NesetrilTardif2000, ). The same concept applies to structures with distinguished elements. It is easy to see that any frontier for a structure $A$ must contain (modulo homomorphic equivalence) all structures $B$ such that $(B,A)$ is a gap pair.

Example 3.2.

Let $\mathcal{S}=\{R,P,Q\}$ . The structure $(A,a_{1})$ consisting of facts $P(a_{1})$ and $Q(a_{1})$ (with distinguished element $a_{1}$ ) has a frontier of size 2 (w.r.t. the class of all finite structures), namely $F=\{(B,a_{1}),(C,a_{1})\}$ where $B$ consists of the facts $P(a_{1}),P(b),Q(b)$ and $C$ consists of the facts $Q(a_{1}),P(b),Q(b)$ , respectively. Note that $((B,a_{1}),(A,a_{1}))$ and $((C,a_{1}),(A,a_{1}))$ are gap pairs. It can be shown that the structure $(A,a_{1})$ has no frontier of size 1 (as such a frontier would have to consist of a structure that contains both facts $P(a_{1})$ and $Q(a_{1})$ ).

For another example, consider the structure $(A^{\prime},a_{1})$ consisting of facts $P(a_{1})$ and $R(b,b)$ . It is the right hand side of a gap pair (the left hand side being the structure $(B^{\prime},a_{1})$ consisting of the facts $R(b,b)$ and $P(b^{\prime})$ ), but $(A^{\prime},a_{1})$ has no frontier as follows from Theorem 3.7 below.

Frontiers are also closely related to (generalized) homomorphism dualities (FoniokNT08, ), and we will be making use of results about homomorphism dualities. We say that a structure $(A,\textbf{a})$ has a finite duality w.r.t. a class $\mathcal{C}$ if there is a finite set of structures $D$ such that for all $(C,\textbf{c})\in\mathcal{C}$ , $(A,\textbf{a})\to(C,\textbf{c})$ iff for all $(B,\textbf{b})\in D$ , $(C,\textbf{c})\not\to(B,\textbf{b})$ . The set $D$ may contain structures that are not in $\mathcal{C}$ . If ${\mathcal{C}}$ is the set of all structures (over the same schema as $(A,\textbf{a})$ ), we simply say that $(A,\textbf{a})$ has a finite duality.²²2We note here that in the literature on Constraint Satisfaction, it is usual to consider the ’other side’ of the duality, i.e, a structure $A$ is said to have finite duality if there exists a finite set of structures $F$ such that for every structure $C$ , $C\to A$ iff for all $B\in F$ , $B\not\to C$ .

Example 3.3.

In the realm of digraphs, viewed as relational structures without distinguished elements with a single binary relation, every directed path $A$ of, say, $k>1$ nodes has finite duality (w.r.t. the class of digraphs). Indeed, it is not difficult to verify that for every digraph $C$ , $A\to C$ iff $C\not\to D$ where $D$ is the digraph with nodes $\{1,\dots k-1\}$ and edges $\{(i,j)\mid i<j\}$ . (This example is known as the Gallai-Hasse-Roy-Vitaver Theorem. Amusingly, this result was obtained and published independently by all these four researchers, each in a different language, in the 1960s).

The next lemma is a minor variation of a result from (NesetrilTardif2000, ).

Lemma 3.4.

Let $\mathcal{C}$ be any class of structures.

(1)

If a structure $(A,\textbf{a})$ has a finite duality w.r.t. $\mathcal{C}$ then $(A,\textbf{a})$ has a frontier w.r.t. $\mathcal{C}$ .
(2)

If a structure $(A,\textbf{a})\in\mathcal{C}$ has a frontier w.r.t. $\mathcal{C}$ and $\mathcal{C}$ is closed under direct products, then $(A,\textbf{a})$ has a finite duality w.r.t. $\mathcal{C}$ .

Proof.

1. Let $D$ be a finite set of structures that forms a duality for $(A,\textbf{a})$ . Then $\{(A,\textbf{a})\times(B,\textbf{b})\mid(B,\textbf{b})\in D\}$ is a frontier for $(A,\textbf{a})$ . This follows immediately from the fact that, for all $(C,\textbf{c})$ with $(C,\textbf{c})\to(A,\textbf{a})$ , we have that $(C,\textbf{c})\to(A,\textbf{a})\times(B,\textbf{b})$ if and only if $(C,\textbf{c})\to(B,\textbf{b})$ .

2. We use a construction from (NesetrilTardif2000, ) involving an exponentiation operation on structures. Let $B$ and $C$ be structures (without distinguished elements) over the same schema. Then we denote by $B^{C}$ the structure where

•

the domain of $B^{C}$ is the set of all functions from the domain of $C$ to the domain of $B$
•

a fact $R(f_{1},\ldots,f_{n})$ belongs to $B^{C}$ if for every fact of the form $R(a_{1},\ldots,a_{n})\in C$ , the fact $R(f_{1}(a_{1}),\ldots,f_{n}(a_{n}))$ belongs to $B$ .

This construction is characterized by the property that, for all structures $D$ , $D\to B^{C}$ if and only if $D\times C\to B$ (HellNesetril2004, ).

Let $F$ be a frontier for $(A,\textbf{a})$ . Let

D=\{(B^{A},\textbf{h})\mid\text{$(B,\textbf{b})\in F$ and $\textbf{h}$ is a tuple of functions such that $h_{i}(a_{i})=b_{i}$}\}

We claim that $D$ forms a finite duality for $(A,\textbf{a})$ . Consider any structure $(C,\textbf{c})\in\mathcal{C}$ . We must show that $(C,\textbf{c})$ homomorphically maps to a structure in $D$ iff $(A,\textbf{a})\not\to(C,\textbf{c})$ .

Suppose that there is a homomorphism $h:(C,\textbf{c})\to(B^{A},\textbf{h})$ for some $(B^{A},\textbf{h})\in D$ . Then $\tilde{h}:(A,\textbf{a})\times(C,\textbf{c})\to(B,\textbf{b})$ where $\tilde{h}(\langle a,c\rangle)=h(c)(a)$ . Note that, indeed, $\tilde{h}(\langle a_{i},c_{i}\rangle)=b_{i}$ . Since $(B,\textbf{b})\in F$ and $F$ is a frontier, it follows that $(A,\textbf{a})\not\to(A,\textbf{a})\times(C,\textbf{c})$ and therefore, by the properties of direct products, $(A,\textbf{a})\not\to(C,\textbf{c})$ .

Conversely, suppose $(A,\textbf{a})\not\to(C,\textbf{c})$ . Then $(A,\textbf{a})\not\to(A,\textbf{a})\times(C,\textbf{c})$ . Hence, there is a homomorphism $h:(A,\textbf{a})\times(C,\textbf{c})\to(B,\textbf{b})$ for some $(B,\textbf{b})\in F$ . It follows that $(C,\textbf{c})\to(B^{A},\textbf{h})$ where $\textbf{h}=h_{1},\ldots,h_{k}$ with $h_{i}$ the function given by $h_{i}(x)=h(x,a_{i})$ . Note that $h_{i}(a_{i})=b_{i}$ and hence $(B^{A},\textbf{h})\in D$ . ∎

Note that the construction of the frontier from the duality is polynomial, while the construction of the duality from the frontier involves an exponential blowup. The following example shows that this is unavoidable.

Example 3.5.

The path $\circ\xrightarrow{R}\circ\xrightarrow{R_{1}}\circ\xrightarrow{R}\circ\xrightarrow{R_{2}}\circ\cdots\circ\xrightarrow{R_{n}}\circ\xrightarrow{R}\circ$ , viewed as a structure without any distinguished elements, has a frontier (w.r.t. the class of all finite structures) of size polynomial in $n$ , as will follow from Theorem 3.8 below. It is known, however, that any finite duality for this structure must involve a structure whose size is exponential in $n$ , and the example can be modified to use a fixed schema (cf. (NesetrilTardif2005, )).

3.1. Frontiers for classes with bounded expansion

The notion of a class of graphs with bounded expansion was introduced in (nesetril2018grad, ). Intuitively, a class of graphs has bounded expansion if all of its shallow minors are sparse. We will not give a precise definition here, but important examples include graphs of bounded degree, graphs of bounded treewidth, planar graphs, and any class of graphs excluding a minor. The same concept of bounded expansion can be applied also to arbitrary structures: a class of structures $\mathcal{C}$ is said to have bounded expansion if the class of Gaifman graphs of structures in $\mathcal{C}$ has bounded expansion. We refer to (nesetril2012sparsity, ) for more details. Classes of structures of bounded expansion are in many ways computationally well-behaved (cf. for example (kazana2020firsts, )).

Nešetřil and Ossona de Mendez (NesetrilOssona2008, ; nesetril2012sparsity, ) show that if $\mathcal{C}$ is any class of structures with bounded expansion, then every structure has a finite duality w.r.t. $\mathcal{C}$ . It follows by Lemma 3.4 that also every structure has a fontier w.r.t. $\mathcal{C}$ . Nešetřil and Ossona de Mendez (NesetrilOssona2008, ; nesetril2012sparsity, ) only consider connected structures without distinguished elements,³³3Note that the various notions of connectedness, such as based on the incidence graph, fact graph, or Gaifman graph, all coincide for relational structures without distinguished elements. but their result extends in a straightforward way to the general case of structures with distinguished elements. Furthermore, it yields an effective procedure for constructing frontiers, although non-elementary (i.e., not bounded by a fixed tower of exponentials).

Theorem 3.6 (from (NesetrilOssona2008, ; nesetril2012sparsity, )).

Let $\mathcal{C}$ be any class of structures that has bounded expansion. Then every structure $(A,\textbf{a})$ has a frontier w.r.t. $\mathcal{C}$ , which can be effectively constructed.

Proof.

Nešetřil and Ossona de Mendez (NesetrilOssona2008, ) stated their result for the case where $A$ is a connected structure without distinguished elements. They show that, every such structure $A$ has a finite duality w.r.t. $\mathcal{C}$ , and hence, by Lemma 3.4, also a fontier w.r.t. $\mathcal{C}$ . The result extends to disconnected structures through standard arguments: let $A$ be any structure (without distinguished elements). We may assume without loss of generality that $A$ is a core (because every structure is homomorphically equivalent to its core, and hence every frontier for the latter is a frontier for the former). Let $A_{1},\ldots,A_{n}$ be the connected components of $A$ . Since $A$ is a core, $A_{1},\ldots,A_{n}$ are pairwise homomorphically incomparable. Take all structures of the form $A_{1}\uplus\cdots\uplus A_{i-1}\uplus B\uplus A_{i+1}\uplus\cdots\uplus A_{n}$ , for $B$ a structure belonging to the frontier of $A_{i}$ (for $i=1\ldots n$ ). It is straightforward to show that this yields a frontier for $A$ .

Next, we show how to extend this to structures with distinguished elements. For any structure $(A,\textbf{a})$ with $\textbf{a}=a_{1},\ldots,a_{n}$ , let $A^{\textbf{a}}$ be the structure (without distinguished elements) over expanded schema with additional unary predicates $P_{1},\ldots,P_{n}$ , where each $P_{i}$ denotes $\{a_{i}\}$ . Since $A$ and $A^{\textbf{a}}$ have the same Gaifman graph, and since $\mathcal{C}$ has bounded expansion, the same holds for $\{A^{\textbf{a}}\mid(A,\textbf{a})\}$ . Therefore, it suffices to show that whenever $A^{\textbf{a}}$ has a frontier w.r.t. $\{C^{\textbf{c}}\mid(C,\textbf{c})\in\mathcal{C}\}$ , then $(A,\textbf{a})$ has a frontier w.r.t. $\mathcal{C}$ .

Let $F=\{B_{1},\ldots,B_{m}\}$ be a frontier for $A^{\textbf{a}}$ w.r.t. $\{C^{\textbf{c}}\mid(C,\textbf{c})\in\mathcal{C}\}$ . Now consider all ways of taking a structure $B\in F$ and choosing one element per unary predicate $P_{i}$ . In this way we obtain a set of structures $F^{\prime}$ with distinguished elements, that we claim is a frontier for $(A,\textbf{a})$ w.r.t. $\mathcal{C}$ . Note that any homomorphism $h:(A,\textbf{a})\to(B,\textbf{b})$ for $(B,\textbf{b})\in F^{\prime}$ is a homomorphism from $A^{\textbf{a}}$ to $B^{\textbf{b}}$ , therefore since $F$ is a frontier for $A^{\textbf{a}}$ , there is no such homomorphism $h$ . It is also clear that each $(B,\textbf{b})\in F^{\prime}$ maps homomorphically to $(A,\textbf{a})$ . Finally, consider any $(C,\textbf{c})\in\mathcal{C}$ that maps to $(A,\textbf{a})$ but not vice versa. Then $C^{\textbf{c}}\to A^{\textbf{a}}$ and $A^{\textbf{a}}\not\to C^{\textbf{c}}$ . Hence, there is a homomorphism $h:C^{\textbf{c}}\to B$ for some $B\in F$ . It follows that $h:(C,\textbf{c})\to(B,\textbf{b})\in F^{\prime}$ where each $b_{i}=h(c_{i})$ . ∎

3.2. Polynomial frontiers for c-acyclic structures

Alexe et al. (AlexeCKT2011, ), building on Foniok et al. (FoniokNT08, ), show that a structure has a finite duality if and only if its core is c-acyclic. By Lemma 3.4, this implies that a structure has a frontier if and only if its core is c-acyclic.

Theorem 3.7 (from (FoniokNT08, ; AlexeCKT2011, )).

For all structures $(A,\textbf{a})$ , the following are equivalent:

(1)

$(A,\textbf{a})$ has a frontier w.r.t. the class of all structures,
(2)

$(A,\textbf{a})$ is homomorphically equivalent to a c-acyclic structure,
(3)

The core of $(A,\textbf{a})$ is c-acyclic

One of our main results is a new proof of the right-to-left direction, which, unlike the original, provides a polynomial-time construction of a frontier from a c-acyclic structure:

Theorem 3.8.

Fix a schema $\mathcal{S}$ and $k\geq 0$ . Given a c-acyclic structure over $\mathcal{S}$ with $k$ distinguished elements, we can construct in polynomial time a frontier w.r.t. the class of all structures over $\mathcal{S}$ that have $k$ distinguished elements.

Note that the size of the smallest frontier is in general exponential in $k$ . Indeed, consider the single-element structure $(A,\textbf{a})$ where $A$ consists of the single fact $P(a)$ and $\textbf{a}=a,\ldots,a$ has length $k$ . It is not hard to show that every frontier of this (c-acyclic) structure must contain, up to homomorphic equivalence, all structures of the form $(B,\textbf{b})$ where $B$ consists of two facts, $P(a_{1})$ and $P(a_{2})$ , and $\textbf{b}\in\{a_{1},a_{2}\}^{k}$ is a sequence in which both $a_{1}$ and $a_{2}$ occur. There are exponentially many pairwise homomorphically incomparable such structures.

The proof of Theorem 3.8 is based on a construction that improves over a similar but exponential construction of gap pairs for acyclic structures given in (NesetrilTardif2000, , Def. 3.9). Our results also shed new light on a question posed in the same paper: after presenting a double-exponential construction of duals (for connected structures without distinguished elements), involving first constructing an exponential-sized gap pair, the authors ask: “It would be interesting to know to what extent the characterisation of duals can be simplified, and whether the indirect approach via density is optimal.” This question appeared to have been answered in (NesetrilTardif2005, ), where a direct method was established for constructing single-exponential size duals. Theorem 3.8 together with Lemma 3.4, however, gives another answer: single-exponential duals can be constructed by going through frontiers (i.e., “via density”) as well.

Recall the definition of fg-connectedness from the preliminaries. We first prove a restricted version of Theorem 3.8 for the special case of core, fg-connected, c-acyclic structures with the Unique Names Property. We subsequently lift these extra assumptions. A structure $(A,\textbf{a})$ with $\textbf{a}=a_{1},\ldots a_{k}$ has the Unique Names Property (UNP) if $a_{i}\neq a_{j}$ for all $i<j$ (cf. (baader2003basic, )).

Proposition 3.9.

Given a core, fg-connected, c-acyclic structure with UNP, we can construct in polynomial time (for fixed schema $\mathcal{S}$ and number of distinguished elements $k$ ) a frontier w.r.t. the class of all finite structures. Furthermore, the frontier consists of a single structure, which has the UNP. ⁴⁴4 An earlier conference version of this paper had a bug in the proof of this proposition, as was pointed out to us by Raoul Koudijs (p.c.).

Proof.

Let a core fg-connected c-acyclic structure $(A,\textbf{a})$ with UNP be given. To reduce notational complexity in the remainder of this proof, we will simply write $A$ instead of $(A,\textbf{a})$ , even when referring to the structure including the distinguished elements.

Note that each fg-connected structure either (i) consists of a single fact containing only distinguished elements, or (ii) consists of a number of facts that all contain at least one non-distinguished element. Therefore, we can distinguish two cases:

Case 1. $A$ consists of a single fact $f$ without non-distinguished elements. Let $(B,\textbf{a})$ be the structure whose domain is $\{a_{1},\ldots,a_{k},b\}$ , where $\textbf{a}=a_{1},\ldots,a_{k}$ and $b$ is a fresh value distinct from $a_{1},\ldots,a_{k}$ ; and which contains all facts over this domain except $f$ . It is easy to see that $(B,\textbf{a})$ is a homomorphism dual for $(A,\textbf{a})$ . Indeed, consider any structure $(C,\textbf{c})$ , and let $f^{\prime}$ be a copy of the fact $f$ in which each element $a_{i}$ is replaced by the corresponding element $c_{i}$ . If $h:(A,\textbf{a})\to(C,\textbf{c})$ then $C$ contains $f^{\prime}$ , therefore, $(C,\textbf{c})\not\to(B,\textbf{a})$ ; if, on the other hand, $(A,\textbf{a})\not\to(C,\textbf{c})$ , then $C$ omits $f^{\prime}$ , and hence, $(C,\textbf{c})\to(B,\textbf{a})$ . It follows that, the direct product $(A,\textbf{a})\times(B,\textbf{a})$ constitutes a singleton frontier for $(A,\textbf{a})$ . Note that this construction is polynomial because we assume that the schema $\mathcal{S}$ and $k$ are both fixed.

Case 2. $A$ consists of one or more facts that each contain a non-distinguished element. In this case, we construct a singleton frontier $F=\{(B,\textbf{b})\}$ where

•
the domain of $B$ consists of
1. (1)
  
  all pairs $(a,f)$ where $a$ is a non-distinguished element of $A$ and $f$ is a fact of $A$ in which $a$ occurs, and
2. (2)
  
  All pairs $(a,\textsf{id})$ and $(a,\textsf{nd})$ , where $a$ is a distinguished element of $A$
•

a fact $R((a_{1},f_{1}),\ldots,(a_{n},f_{n}))$ holds in $B$ if and only if $R(a_{1},\ldots,a_{n})$ holds in $A$ and at least one $f_{i}$ is either a fact that is different from the fact $R(a_{1},\ldots,a_{n})$ itself, or is nd
•

The distinguished elements b are $b_{1}=(a_{1},\textsf{id}),\ldots,b_{n}=(a_{n},\textsf{id})$ , for $\textbf{a}=a_{1},\ldots,a_{n}$ .

Note that, in the above construction, id and nd are symbols (not functions), used to simplify notation by ensuring that every element of $B$ can be written as a pair. The symbols id and nd, intuitively, stand for “identity” and ”non-distinguished copy”,

We claim that $F=\{(B,\textbf{b})\}$ is a frontier for $A$ .

It is clear that the natural projection $h:(B,\textbf{b})\to(A,\textbf{a})$ is a homomorphism.

We claim that there is no homomorphism $h^{\prime}:(A,\textbf{a})\to(B,\textbf{b})$ . Assume, for the sake of a contradiction, that there was such a homomorphism. By Lemma 2.1, we may assume that the composition of $h$ and $h^{\prime}$ is the identity function on $A$ . In particular, this means that $h^{\prime}$ maps each distinguished element $a$ to $(a,\textsf{id})$ and for each non-distinguished element $a$ of $A$ , $h^{\prime}(a)=(a,f)$ for some fact $f$ . For a non-distinguished element $a$ , let us denote by $f_{a}$ the unique fact $f$ for which $h^{\prime}(a)=(a,f_{a})$ .

We will consider “walks” in $A$ of the form

a_{1}\xrightarrow{f_{a_{1}}}a_{2}\xrightarrow{f_{a_{2}}}\ldots a_{n}

with $n\geq 1$ , where

(1)

$a_{1},\ldots,a_{n}$ are non-distinguished elements,
(2)

$f_{a_{i}}\neq f_{a_{i+1}}$ , and
(3)

$a_{i}$ and $a_{i+1}$ co-occur in fact $f_{a_{i}}$ ,

Since $A$ is c-acyclic, the length of any such sequence is bounded by the diameter of $A$ (otherwise some fact would have to be traversed twice in succession, which would violate condition 2). Furthermore, trivially, such a walk of length $n=1$ exists: just choose as $a_{1}$ an arbitrary non-distinguished element of $A$ . Furthermore, we claim that any such finite sequence can be extended to a longer one: let the fact $f_{a_{n}}$ be of the form $R(b_{1},\ldots,b_{m})$ (where $a_{n}=b_{i}$ for some $i\leq m$ ). Since $h$ is a homomorphism, it must map $f_{a_{n}}$ to some fact $R((b_{1},f_{b_{1}}),\ldots,(b_{m},f_{b_{m}}))$ of $B$ , where some $f_{b_{j}}$ is a fact that is different from $f_{a_{n}}$ . We can choose $b_{j}$ as our element $a_{n+1}$ . Thus, we reach our desired contradiction.

Finally, consider any $C$ with $h:C\to A$ and $A\not\to C$ . We construct a function $h^{\prime}:C\to B$ as follows: consider any element $c$ of $C$ , and let $h(c)=a$ . If $c$ is a distinguished element (in which case $a$ is, too), we set $h^{\prime}(c)=(a,\textsf{id})$ . If $c$ is not a distinguished element but $a$ is, we set $h^{\prime}(c)=(a,\textsf{nd})$ . Otherwise, we proceed as follows: since $A$ is c-acyclic and fg-connected, for each non-distinguished element $a^{\prime}$ of $A$ (other than $a$ itself) there is a unique minimal path in the incidence graph, containing only non-distinguished elements, from $a^{\prime}$ to $a$ . We can represent this path by a sequence of the form

a^{\prime}=a_{0}\xrightarrow{(f_{0},i_{0},j_{0})}a_{1}\xrightarrow{(f_{1},i_{1},j_{1})}a_{2}\cdots\xrightarrow{(f_{n-1},i_{n-1},j_{n-1})}a_{n}=a

where each $f_{\ell}$ is a fact of $A$ in which $a_{\ell}$ occurs in the $i_{\ell}$ -th position and $a_{\ell+1}$ occurs in the $j_{\ell}$ -th position. We can partition the non-distinguished elements $a^{\prime}$ of $A$ (other than $a$ itself) according to the last fact on this path, that is, $f_{n-1}$ . Furthermore, it follows from fg-connectedness that each fact of $A$ contains a non-distinguished element. It is easy to see that if a fact contains multiple non-distinguished elements (other than $a$ ) then they must all belong to the same part of the partition as defined above. Therefore, the above partition on non-distinguished elements naturally extends to a partition on the facts of $A$ . Note that if $A$ contains any facts in which $a$ is the only non-distinguished element, we will refer to these facts as “local facts” and they will be handled separately. In this way, we have essentially decomposed $A$ into a union $A_{\text{local}}\cup\bigcup_{i}A_{i}$ , where $A_{\text{local}}$ contains all local facts and each “component” $A_{i}$ is a substructure of $A$ consisting of non-local facts, in such a way that (i) different substructures $A_{i}$ do not share any facts with each other, (ii) different substructures do not share any elements with each other, except for $a$ and distinguished elements (from a), (iii) each $A_{i}$ contains precisely one fact involving $a$ .

Since we know that $(A,a)\not\to(C,c)$ , it follows that either some local fact $f$ of $A$ does not map to $C$ (when sending $\textbf{a},a$ to $\textbf{c},c$ ), or some “component” $A_{i}$ of $A$ does not map to $C$ through any homomorphism sending $\textbf{a},a$ to $\textbf{c},c$ . In the first case, we choose such local fact $f$ and set $h^{\prime}(a)=(a,f)$ . In the second case, we choose such a component (if there are multiple, we choose one of minimal size) and let $f$ be the unique fact in that component containing $a$ (that is, $f$ is the fact $f_{n-1}$ that by construction connected the non-distinguished elements of the component in question to $a$ ). We set $h^{\prime}(c)=(a,f)$ . Intuitively, when $h^{\prime}(c)=(h(c),f)$ , then $f$ is a fact of $A$ involving $h(c)$ that “points in a direction where homomorphism from $(A,h(c))$ back to $(C,c)$ fails”.

We claim that $h^{\prime}$ is a homomorphism from $C$ to $B$ : let $R(c_{1},\ldots,c_{n})$ be a fact of $C$ . Then $R(h(c_{1}),\ldots,h(c_{n}))$ holds in $A$ . Let $h^{\prime}(c_{i})=(h(c_{i}),f_{i})$ as constructed above (where, we recall, $f_{i}=\textsf{id}$ if $c_{i}$ is a distinguished element, and $f_{i}=\textsf{nd}$ if $c_{i}$ is not a distinguished element but $h(c_{i})$ is). Also recall that at least one $c_{i}$ is a non-distinguished element. To show that $R(h^{\prime}(c_{1}),\ldots,h^{\prime}(c_{n}))$ holds in $B$ , it suffices to show that some $f_{i}$ is different from the fact $R(h(c_{1}),\ldots,h(c_{n}))$ itself, or is equal to nd. If some non-distinguished $c_{i}$ is mapped by $h$ to a distinguished element, then $h^{\prime}(c_{i})=(c_{i},\textsf{nd})$ , and we are done. This leaves us with the case where some $c_{i}$ is a non-distinguished element, and for all non-distinguished $c_{i}$ , $h^{\prime}(c_{i})$ is of the form $(h(c_{i}),f_{i})$ for a fact $f_{i}$ . If one of these $f_{i}$ is a local fact, then it follows immediately from the construction that $f_{i}\neq R(h(c_{1}),\ldots,h(c_{n}))$ . Otherwise, let $n_{i}$ be the size of the smallest “component” (as defined above) of $(A,h(c_{i}))$ that does not homomorphically map to $(C,c_{i})$ , and choose an element $c_{i}$ with minimal $n_{i}$ . Then, clearly, $f_{i}$ must be different from the fact $R(h(c_{1}),\ldots,h(c_{n}))$ itself. ∎

Next, we remove the assumptions of fg-connectedness and being a core.

Proposition 3.10.

Given a c-acyclic structure with UNP, we can construct in polynomial time a frontier w.r.t. the class of all finite structures. Furthermore, the frontier consists of structures that have the UNP.

Proof.

By Proposition 2.2, we may assume that $(A,\textbf{a})$ is a core. Note that the c-acyclicity and UNP properties are preserved under the passage from a structure to its core.

Let $(A,\textbf{a})$ be a structure with distinguished elements that is UNP and that is a fg-disjoint union of homomorphically incomparable fg-connected structures $(A_{1},\textbf{a}),\ldots,(A_{n},\textbf{a})$ . By Proposition 3.9, $(A_{1},\textbf{a}),\ldots,(A_{n},\textbf{a})$ have, respectively, frontiers $F_{1},\ldots,F_{n}$ , each consisting of a single structure with the UNP. We may assume without loss of generality that each $F_{i}$ consists of a structure that have the same distinguished elements a (we know that the structures in question have the UNP, and therefore, modulo isomorphism, we can assume that the distinguished elements are precisely a). Let $F_{i}=\{(B_{i},\textbf{a})\}$ .

We claim that $F=\{\big{(}\biguplus_{j\neq i}(A_{j},\textbf{a})\big{)}\uplus(B_{i},\textbf{a})\mid 1\leq i\leq n\}$ is a frontier for $(A,\textbf{a})$ w.r.t. $\mathcal{C}$ .

Clearly, each structure in $F$ maps homomorphically to $A$ .

Suppose, for the sake of contradiction, that there is a homomorphism $h:(A,\textbf{a})\to\big{(}\biguplus_{j\neq i}(A_{j},\textbf{a})\big{)}\uplus(B_{i},\textbf{a})$ for some $i$ . Observe that $h$ must send each distinguished element to itself, and it must send each non-distinguished element to a non-distinguished element (otherwise, the composition of $h$ with the backward homomorphism would be a non-injective endomorphism on $(A,\textbf{a})$ which would contradict the fact that $(A,\textbf{a})$ is a core). Since $(A_{i},\textbf{a})$ is fg-connected (and because $h$ cannot send non-distinguished elements to distinguished elements), its $h$ -image must be contained either in some $(A_{j},\textbf{a})$ ( $j\neq i$ ) or in $B$ . The former cannot happen because $A_{i}$ and $A_{j}$ are homomorphically incomparable. The latter cannot happen either, because $B$ belongs to a frontier of $A_{i}$ .

Finally, let $(C,\textbf{c})\in\mathcal{C}$ be any structure such that there is a homomorphism $h:(C,\textbf{c})\to(A,\textbf{a})$ but $(A,\textbf{a})\not\to(C,\textbf{c})$ . Let $(A_{i},\textbf{a})$ be a fg-connected component of $(A,\textbf{a})$ such that $(A_{i},\textbf{a})\not\to(C,\textbf{c})$ . Since $(A_{i},\textbf{a})$ is fg-connected, we can partition our structure $(C,\textbf{c})$ as $(C_{1},\textbf{c})\uplus(C_{2},\textbf{c})$ where the $h$ -image of $C_{1}$ is contained in $(A_{i},\textbf{a})$ while the $h$ -image of $C_{2}$ is disjoint from $A_{i}$ except possibly for the distinguished elements. We know that $(A_{i},\textbf{a})\not\to(C_{1},\textbf{c})$ and therefore $(C_{1},\textbf{c})\to(B_{i},\textbf{a})$ . Furthermore, we have that $(C_{2},\textbf{c})\to\biguplus_{j\neq i}(A_{j},\textbf{a})$ . Therefore, $(C,\textbf{c})\to\big{(}\biguplus_{j\neq i}(A_{j},\textbf{a})\big{)}\uplus(B_{i},\textbf{a})$ . ∎

Finally, we can prove Theorem 3.8 itself.

Proof of Theorem 3.8.

Let $(A,\textbf{a})$ be c-acyclic. If it has the UNP, we are done. Consider the other case, where the sequence a contains repetitions. Let $\textbf{a}^{\prime}=a^{\prime}_{1},\ldots,a^{\prime}_{n}$ consists of the same elements without repetition (in some order). We construct a frontier for it as follows:

(1)

Consider the structure $(A,\textbf{a}^{\prime})$ , which, by construction, has the UNP. Let $F$ be a frontier for $(A,\textbf{a}^{\prime})$ (again consisting of structures with the UNP), using Proposition 3.10. Note that, through isomorphism, we may assume that each structure in $F$ has the same distinguished elements $\textbf{a}^{\prime}$ . For each $(B,\textbf{a}^{\prime})\in F$ , we take the structure $(B,\textbf{a})$ .
(2)

Let $k$ be the length of the tuple a. For each function $f:\{1,\ldots,k\}\to\{1,\ldots,k\}$ , whose range has size strictly greater than $n$ , consider structure $(C,\textbf{c}^{f})$ where $C$ contains all facts over the domain $\{1,\ldots,k\}$ , and $c^{f}_{i}=f(i)$ . We take its direct product with $(A,\textbf{a})$ .

It is easy to see that the set of all structures constructed above, constitutes a frontier for $(A,\textbf{a})$ . Indeed, suppose a structure maps to $(A,\textbf{a})$ but not vice versa. If the tuple of distinguished elements of the structure in question has the same identity type as the tuple a (i.e., the same equalities hold between values at different indices in the tuple) then it is easy to see that the structure in question must map to some structure $(B,\textbf{a})$ as constructed under item 1 above. Otherwise, if the tuple of distinguished elements of the structure in question does not have the same identity type, then it is easy to see that the structure in question must map to $(C^{f},\textbf{c}^{f})\times(A,\textbf{a})$ , as constructed under item 2 above, where $f$ reflects the identity type of the distinguished elements of the structure in question. ∎

As a corollary of Theorem 3.8, we obtain the following interesting by-product:

Theorem 3.11.

Fix a schema $\mathcal{S}$ and $k\geq 0$ . The following problem is solvable in NP: given a finite set of structures $F$ and a structure $A$ (all with $k$ distinguished elements), is $F$ a frontier for $A$ w.r.t. the class of all structures? If $\mathcal{S}$ contains a binary relation and $k=3$ , then it is NP-complete.

Proof.

For the upper bound, we use the fact that, if $A$ is homomorphically equivalent to a c-acyclic structure $A^{\prime}$ , then the core of $A$ is c-acyclic (cf. Theorem 3.7). The problem can therefore be solved in non-deterministic polynomial time as follows:

First we guess a substructure $A^{\prime}$ and we verify that $A^{\prime}$ is c-acyclic and homomorphically equivalent to $A$ . Note that the existence of such $A^{\prime}$ is a necessary precondition for $F$ to be a frontier of $A$ . Furthermore, c-acyclicity can be checked in polynomial time using any PTIME algorithm for graph acyclicity (recall that a structure is c-acyclic if and only if its incidence graph is acyclic after removing all nodes corresponding to distinguished elements).

Next, we apply Theorem 3.8 to construct a frontier $F^{\prime}$ for $A^{\prime}$ (and hence for $A$ ). Finally, we verify that each $B\in F$ homomorphically maps to some $B^{\prime}\in F^{\prime}$ and, vice versa, every $B^{\prime}\in F^{\prime}$ homomorphically maps to some $B\in F$ . It is not hard to see that this non-deterministic algorithm has an accepting run if and only if $F$ is a frontier for $A$ .

For the lower bound, we reduce from graph 3-colorability. Let $A$ be the structure, over a 3-element domain, that consists of the facts $R(a,b)$ for all pairs $a,b$ with $a\neq b$ . In addition, each of the three elements is named by a constant symbol. Since $A$ is c-acyclic, by Theorem 3.7, it has a frontier $F$ . Now, given any graph $G$ (viewed as a relational structure with binary relation $R$ and without constant symbols), we have that $G$ is 3-colorable if and only if $F$ is a frontier for the disjoint union of $A$ with $G$ . To see that this is the case, note that if $G$ is 3-colorable, then the disjoint union of $A$ with $G$ is homomorphically equivalent to $A$ itself, whereas if $G$ is not 3-colorable, then the disjoint union of $A$ with $G$ is strictly greater than $A$ in the homomorphism order. ∎

3.3. A polynomially frontier-closed class of structures

We call a class $\mathcal{C}$ of structures frontier-closed if every structure $(A,\textbf{a})\in\mathcal{C}$ has a frontier w.r.t. $\mathcal{C}$ , consisting of structures belonging to $\mathcal{C}$ . If, moreover, the frontier in question can be constructed from $(A,\textbf{a})$ in polynomial time, then we say that $\mathcal{C}$ is polynomially frontier-closed.

Theorem 3.12.

Fix a schema $\mathcal{S}$ and $k\geq 1$ . The class of c-connected acyclic structures with $k$ distinguished elements is polynomially frontier-closed. ⁵⁵5Note that for structures with one distinguished element, c-connectedness is the same as connectedness.

In fact, the construction presented below shows that the polynomial bound holds even when the schema is treated as part of the input of the problem (although $k$ does need to be fixed, as the size of the constructed frontier depends exponentially on $k$ ).

As will follow from results in Section 4 (cf. Theorems 4.6-4.8 below) the theorem fails if we drop any of the three restrictions in the statement (i.e., c-connectedness, acyclicity, and $k\geq 1$ ).

The special case with binary relations only and $k=1$

The remainder of this section is dedicated to the proof of Theorem 3.12. To simplify the presentation of the proof, we will first assume that the schema consists of binary relations only, and that $k=1$ . Afterwards, we will show how to lift these restrictions.

Let $\mathcal{S}$ be a schema consisting of binary relation symbols, and fix a finite structure $(A,a_{0})$ that is c-connected and acyclic. We will assume, in addition, that $(A,a_{0})$ is a core, which we may do without loss of generality, because the core of an acyclic structure can be computed in polynomial time (Proposition 2.2) and the properties of c-connectedness and acyclicity are preserved under passage from a structure to its core:

Proposition 3.13.

The properties of c-connectedness and acyclicity are preserved when passing from a structure to its core.

Proof.

That acyclicity is preserved follows immediately from the fact that the core is a substructure of the original structure. For c-connectedness, the argument is as follows: let $(B,\textbf{b})$ be a c-connected structure, and let $(B^{\prime},\textbf{b})$ be its core. It follows from the definition of a core that $(B,\textbf{b})\leftrightarrow(B^{\prime},\textbf{b})$ . By Proposition 2.3, this implies that $(B,\textbf{b})\leftrightarrow(B^{\prime},\textbf{b})^{\text{reach}}$ , and hence, $(B^{\prime},\textbf{b})\leftrightarrow(B^{\prime},\textbf{b})^{\text{reach}}$ . It follows by the minimality property of cores that $(B,\textbf{b})=(B^{\prime},\textbf{b})^{\text{reach}}$ , i.e., $(B^{\prime},\textbf{b})$ is c-connected. ∎

We shall slightly abuse notation and, for every relation symbol $R$ and every $a,b\in A$ we shall say that $R^{-}(a,b)$ holds in (or is a fact of) $A$ if $R(b,a)$ is a fact of $A$ . We can think of $A$ as an (oriented) tree rooted at $a_{0}$ , where every edge $b\to c$ has been oriented away from $a_{0}$ and is labelled $R$ or $R^{-}$ depending on whether $R(b,c)$ or $R(c,b)$ is a fact of $A$ .

Definition 3.14 ( $A|a$ ).

For any element $a$ of $A$ , we denote by $A|a$ the substructure of $A$ consisting of the oriented subtree rooted at $a$ .

Definition 3.15 (rank).

For any element $a$ of $A$ , $rank(a)$ is the depth of the oriented tree $A|a$ .

The construction of frontiers that we will describe below, involves a recursion on rank. Note that $rank(a)=0$ when $a$ is maximally far from $a_{0}$ (in other words, the leafs of the oriented tree have rank 0). Furthermore, if there is an edge $a\to b$ (in the oriented tree) then $rank(b)<rank(a)$ .

The frontier construction below is split into two steps: for each element $a$ , we first construct a set of structures $\mathcal{F}_{a}$ , and, subsequently, we construct a modified set of structures $\mathcal{F}^{*}_{a}$ .

Definition 3.16 ( $\mathcal{F}_{a}$ ).

Fix any node $a$ of $A$ . Let $a^{\prime}_{1},\ldots,a^{\prime}_{n}$ be the children of $a$ (in the oriented tree), and let $S_{1},\ldots,S_{n}$ be the corresponding edge labels. We can depict $A|a$ as follows:

If $rank(a)=0$ (that is, if $n=0$ ) we define $\mathcal{F}_{a}=\emptyset$ . Otherwise, we define $\mathcal{F}_{a}=\{H^{1},\ldots,H^{n}\}$ where $H^{i}$ is obtained from $A|a$ in the following way. Set $H^{i}$ to be the result of removing $A|a^{\prime}_{i}$ from $A|a$ . Then, we add to $H^{i}$ a fresh isomorphic copy $F$ of each structure in $\mathcal{F}_{a^{\prime}_{i}}$ , joining $a$ with the newly created copies of $a^{\prime}_{i}$ with a $S_{i}$ -edge. See Figure 2 for an illustration.

Figure 2. A depiction of the construction of

\mathcal{F}_{a}

in Definition 3.16.

Observe that, for $F\in\mathcal{F}_{a}$ , there is a natural mapping $h:F\to A$ (indeed, following the induction of the construction, we can see that each element of $F$ is either an element of $A$ or else was introduced as an isomorphic copy of an element of $A$ ).

Definition 3.17 ( $\mathcal{F}^{*}_{a}$ ).

We define $\mathcal{F}^{*}_{a}=\{F^{*}\mid F\in\mathcal{F}_{a}\}$ , where $F^{*}$ is defined as follows. Let $h:F\to A$ be the natural mapping. Then, $F^{*}$ is obtained from $F$ by adding, for each element $b$ of $F$ with $h(b)\neq a$ , a fresh isomorphic copy of $A$ together with a connecting $S$ -edge from $c$ to $b$ , where $c$ is (the newly created copy of) the parent of $h(b)$ and $S$ is the edge label of the edge from $c$ to $h(b)$ in $A$ .


Input (uni-relational)	$\mathcal{F}_{a_{0}}$ single structure	$\mathcal{F}^{*}_{a_{0}}$ single structure
structure $(A,a_{0})$

Figure 3. First example illustrating Definitions 3.16 and 3.17.


Input structure	$\mathcal{F}_{a}$ single structure	$\mathcal{F}^{*}_{a}$ single structure
$(A,a)$

Figure 4. Second example illustrating Definitions 3.16 and 3.17.

The examples in Figures 3 and 4 illustrate the construction of $\mathcal{F}_{a}$ and $\mathcal{F}^{*}_{a}$ . In these figures, for clarity, the distinguished element is marked by a circle. In Figure 3, all edges represent the same binary relation $R$ , and edge labels are omitted for the sake of readability. In both examples, it happens (coincidentally) that the frontier consists of a single structure.

Definition 3.23 and 3.24 further down may provide additional intuition on the $(\cdot)^{*}$ operation used in Definition 3.17.

Theorem 3.18.

Let $(A,a_{0})$ be a finite structure that is core, c-connected and acyclic, let $a$ be any node of $A$ , and let $\mathcal{F}^{*}_{a}$ be as defined above.

(1)

For each $F^{*}\in\mathcal{F}^{*}_{a}$ , $(F^{*},a)\to(A,a)$ .
(2)

For each $F^{*}\in\mathcal{F}^{*}_{a}$ , $(A|a,a)\not\to(F^{*},a)$ .
(3)

Let $(B,b)$ be acyclic and c-connected. If $(B,b)\to(A,a)$ and $(A|a,a)\not\to(B,b)$ , then $(B,b)$ maps homomorphically to $(F,a)$ for some structure $F\in\mathcal{F}^{*}_{a}$ .

In particular (since $A|a_{0}=A$ ), $\mathcal{F}^{*}_{a_{0}}$ is a frontier for $(A,a_{0})$ w.r.t. the class of c-connected, acyclic structures with one distinguished element.

Proof.

Item 1 follows immediately from the construction: the natural projection from $(F^{*},a)$ to $(A,a)$ is a homomorphism.

For item 2, we proceed by induction on $rank(a)$ . Item 2 holds true, trivially, when $rank(a)=0$ , as $\mathcal{F}^{*}_{a}$ is empty in this case. For the inductive case, now, assume that $rank(a)>0$ . In this case, by definition, we know that $F\in\mathcal{F}_{a}$ is of the form $H_{i}$ for some child $a^{\prime}_{i}$ of $a$ . In other words, $F$ was obtained from $A|a$ by removing the subtree $A|a^{\prime}_{i}$ , and (provided $rank(a^{\prime}_{i})>0$ ), adding a fresh isomorphic copy of each structure in $\mathcal{F}_{a^{\prime}_{i}}$ joining $a$ with the newly created copy of $a^{\prime}_{i}$ with a $S_{i}$ -edge, where $S_{i}$ is the label of the edge $a\to a^{\prime}_{i}$ in the original structure.

Consider the homomorphism $h:(F^{*},a)\to(A,a)$ given by item 1. Suppose for the sake of a contradiction that there were also a homomorphism $h^{\prime}:(A|a,a)\to(F^{*},a)$ . Let $h^{\prime}(a^{\prime}_{i})=b$ . Towards our contradiction, we perform a case distinction on $b$ . Clearly, $b$ must be one of the neighbours of $a$ in $F^{*}$ . By construction, these neighbours of $a$ in $F^{*}$ are: (i) the children $a^{\prime}_{1},\ldots,a^{\prime}_{i-1},a^{\prime}_{i+1},\ldots,a^{\prime}_{n}$ in $F$ ; (ii) the copies of $a^{\prime}_{i}$ belonging to isomorphic copies of structures in $\mathcal{F}_{a^{\prime}_{i}}$ ; and, provided $a\neq a_{0}$ , (iii) the parent, $p$ , of $a$ in $A$ , as well as the copy of $p$ introduced in Definition 3.17.

Let us first consider case (i) and (iii). In both cases, let $h^{\prime\prime}$ be the composition of $h^{\prime}$ and $h$ . Then $h^{\prime\prime}$ is a homomorphism from $(A|a,a)$ to $(A,a)$ whose range omits $a^{\prime}_{i}$ . Furthermore, $h^{\prime\prime}$ extends straightforwardly to a non-injective endomorphism on $(A,a_{0})$ by mapping all elements outside $A|a$ to themselves. This is a contradiction with the fact that $(A,a_{0})$ is a core.

Finally consider case (ii). By a similar argument as the above, no element $c$ from $A|a^{\prime}_{i}$ can be mapped by $h^{\prime}$ to $a$ . For, in this case, the composition $h^{\prime\prime}$ of $h^{\prime}$ and $h$ would be a homomorphism from $(A|a,a)$ to $(A,a)$ whose range excludes $c$ , and $h^{\prime\prime}$ could then be extended to a non-injective endomorphism on $(A,a_{0})$ , contradicting the fact that $(A,a_{0})$ is a core. It follows that the restriction of $h^{\prime}$ to $A|a^{\prime}_{i}$ must be such that its range is entirely contained in $(F^{\prime*},a^{\prime}_{i})$ for some $F^{\prime}$ in $\mathcal{F}^{*}_{a^{\prime}_{i}}$ , i.e., it defines a homomorphism from $A|a^{\prime}$ to $(F^{\prime*},a^{\prime}_{i})$ , which contradicts the inductive hypothesis.

Item 3 is proved by induction on $rank(a)$ . Again, item 3 holds true, trivially, when $rank(a)=0$ . Note that when $rank(a)=0$ , $A|a$ is a single-node structure without any relations. Therefore, it is impossible that $(A|a,a)\not\to(B,b)$ .

Now, consider the inductive case with $rank(a)>0$ . Since $(A|a,a)\not\to(B,b)$ , it must be the case that, for some child $a^{\prime}_{i}$ of $a$ there is no element $b^{\prime}$ in $B$ such that $(A|a^{\prime}_{i},a^{\prime}_{i})\to(B,b^{\prime})$ and $S_{i}(b,b^{\prime})$ holds in $B$ where $S_{i}$ is the label of the edge joining $a$ and $a^{\prime}_{i}$ . Let $F=H_{i}\in\mathcal{F}_{a}$ be the corresponding structure as constructed in Definition 3.16.

Let $h$ be the homomorphism from $(B,b)$ to $(A,a)$ given by the hypothesis. We shall construct a homomorphism $h^{\prime}$ from $(B,b)$ to $(F^{*},a)$ . We have $h^{\prime}(b)=a$ . Note that, since $B$ is acyclic and c-connected, we can similarly regard $B$ as an ordered tree rooted at $b$ . Then, for each child, $b^{\prime}$ , of $b$ , we define $h^{\prime}$ on $B|b^{\prime}$ as follows. If $h(b^{\prime})=a^{\prime}_{i}$ , then since $(A|a^{\prime}_{i},a^{\prime}_{i})$ has no homomorphism to $(B,b^{\prime})$ , we apply the inductive hypothesis to define $h^{\prime}$ on $B|b^{\prime}$ . If $h(b^{\prime})$ is the parent of $a$ then we map $B|b^{\prime}$ entirely to the isomorphic copy of $A$ attached to $a$ introduced in Definition 3.17. If $h(b^{\prime})$ is some child of $a$ different than $a^{\prime}_{i}$ then we define $h^{\prime}$ on $B|b^{\prime}$ starting at $b^{\prime}$ and by increasing depth as in the homomorphism, $h$ , from $(B,b)$ to $(A,a)$ until we find some edge $c\to d$ in $B|b^{\prime}$ such that its $h$ -image is not an edge in $A$ . When we found such edge $c\to d$ , then we define $h^{\prime}(d)$ to be the copy of $h(d)$ attached to $h(c)$ in $F^{*}$ according to Definition 3.17 and we extend $h^{\prime}$ to the rest of elements in $B|d$ mapping them as well to the same isomorphic copy. ∎

Definition 3.19.

The size of a structure $A$ , denoted by $size(A)$ , is the number of facts. The total size of a set of structures is $totalsize(\mathcal{A})=\Sigma_{A\in\mathcal{A}}(size(A)+1)$

(The definition of total size is conveniently chosen so that the total size of a set of tree-shaped structures is equal to the size of a tree-shaped structure consisting of the given forest with an additional root connected to the root of each original tree.)

Theorem 3.20 (Our construction is polynomial).

$totalsize(\mathcal{F}^{*}_{a_{0}})=O(size(A)^{4})$ .

Proof.

We will show that $totalsize(\mathcal{F}_{a_{0}})\leq size(A)^{3}$ . The proposition then follows immediately, by construction of $\mathcal{F}^{*}$ . More precisely, we show that, for all elements $a$ , $totalsize(\mathcal{F}_{a})\leq size(A|a)^{3}$ . The claim is proved by induction on $rank(a)$ .

If $rank(a)=0$ , then $totalsize(\mathcal{F}_{a})=0$ . If $rank(a)>0$ , it follows from the construction of $\mathcal{F}_{a}$ that

\begin{split}totalsize(\mathcal{F}_{a})&\leq n\cdot size(A|a)+\Sigma_{i=1}^{n}totalsize(\mathcal{F}_{a^{\prime}_{i}})\\ &\leq size(A|a)^{2}+\Sigma_{i=1}^{n}size(A|a^{\prime}_{i})^{3}\\ &\leq size(A|a)\cdot\big{(}size(A|a)+\Sigma_{i=1}^{n}size(A|a^{\prime}_{i})^{2}\big{)}\\ &\leq size(A|a)\cdot\big{(}size(A|a)+(size(A|a)-1)^{2}\big{)}\\ &\leq size(A|a)\cdot size(A|a)^{2}\\ &\leq size(A|a)^{3}\end{split}

where $n$ is the number of successors of $a$ . Note that we are using here the fact that $\Sigma_{i=1}^{n}size(A|a^{\prime}_{i})\leq size(A|a)-1$ , and the general fact that $\Sigma_{i=1}^{n}x_{i}^{2}\leq(\Sigma_{i=1}^{n}x_{i})^{2}$ . ∎

Extending the result to $k\geq 1$

We now extend the result to $k\geq 1$ . For the time being, we still assume that the schema consists of binary relations only.

Definition 3.21 (Skeleton and offshoots).

Let $(A,\textbf{a})$ be a c-connected, acyclic structure with at least one distinguished element. A skeleton node of $(A,\textbf{a})$ (with $\textbf{a}=a_{1},\ldots,a_{k}$ ) is any element $s$ that is either a distinguished element (i.e., $s\in\{a_{1},\ldots,a_{k}\}$ ) or such that $s$ lies on a minimal path between two distinguished elements. We denote by $\operatorname{skeleton}(A,\textbf{a})$ the substructure of $(A,\textbf{a})$ (with the same distinguished elements), consisting of the skeleton nodes only. It is easy to see that, in a c-connected and acyclic structure, for every element $c$ there is a unique skeleton node $s$ that is closest to $c$ , i.e., such that $c$ is connected to $s$ and such that every path from $c$ to any other skeleton node passes through $s$ . In this case, we say that $c$ is affiliated with the skeleton node $s$ . For any skeleton node $s$ of $(A,\textbf{a})$ , the offshoot of $s$ in $(A,\textbf{a})$ , which we will denote by $(A|s,s)$ , ⁶⁶6We acknowledge that this notation is slightly misleading as it does not reflect the dependence of $(A|s,s)$ on the distinguished elements a. is the structure $(A^{\prime},s)$ where $A^{\prime}$ is the substructure of $A$ consisting of all elements affiliated with $s$ (including $s$ itself), and where $s$ is the only distinguished element.

Thus, every c-connected and acyclic structure can be thought of as consisting of a skeleton with offshoots, similar to the example depicted schematically in Figure 5.

Figure 5. Schematic depiction of an example c-connected acyclic structure with five distinguished elements. The skeleton consists of the black nodes (which are the elements of the structure that are either distinguished or lie on a shortest path connecting two distinguished elements), while the triangles depict offshoots.

Our frontier construction for structures with multiple distinguished elements will make use of two operations that we now define.

Definition 3.22 (Splitting).

Consider a structure $(A,\textbf{a})$ with $\textbf{a}=a_{1},\ldots,a_{k}$ . Let $(X,Y)$ be a proper partition of $\{1,\ldots,k\}$ , that is, $X$ and $Y$ are disjoint non-empty sets such that $X\cup Y=\{1,\ldots,k\}$ . We will denote by $\textrm{split}_{(X,Y)}(A,\textbf{a})$ the structure $(A^{\prime},\textbf{a}^{\prime})$ where

•

$A^{\prime}=A^{(1)}\uplus A^{(2)}$ is a disjoint union of two isomorphic copies of $A$ . For each element $a$ of $A$ , we will denote its two copies in $A^{\prime}$ by $a^{(1)}$ and $a^{(2)}$ , respectively.
•

For $\textbf{a}=a_{1},\ldots,a_{n}$ , we set $\textbf{a}^{\prime}=a_{1}^{(j_{1})},\ldots,a_{n}^{(j_{n})}$ where $j_{i}=1$ if $i\in X$ and $j_{i}=2$ otherwise.

For the next definition, we need to introduce some auxiliary notation. If $a,b$ are elements belonging to the same connected component of some structure $A$ , we will denote by $dist(a,b)$ the length of the shortest path from $a$ to $b$ in the incidence graph of $A$ (where the length may be counted by the number of facts on the path). If $a,b,c$ are elements that all belong to the same connected component of $A$ we will write $a<_{c}b$ if $dist(a,c)<dist(b,c)$ (“ $a$ is closer to $c$ than $b$ is”).

Definition 3.23 (Radial extension).

Let $h:(A,s)\to(B,t)$ be a homomorphism between acyclic structures with one distinguished element. We denote by $\textrm{radial-extension}_{h:(A,s)\to(B,t)}(A)$ the structure obtained from $A$ as follows: for each fact $f$ containing elements $a_{1},a_{2}$ with $a_{1}<_{s}a_{2}$ , we add a disjoint isomorphic copy $\widetilde{B}$ of $B$ (without distinguished elements), and we add a connecting fact that is a copy of $f$ in which $a_{1}$ is replaced by the isomorphic copy of $h(a_{1})$ in $\widetilde{B}$ .

Note that the way in which $\mathcal{F}^{*}_{a}$ was constructed from $\mathcal{F}_{a}$ in Definition 3.17, is a concrete instance of the radial-extension operation.

In the statement of the next lemma, we write $(A,\textbf{a},s)$ , where $(A,\textbf{a})$ is a structure with $k$ distinguished elements, to denote the corresponding structure with $k+1$ distinguished elements.

Lemma 3.24.

Let $(A,\textbf{a},s)$ and $(B,\textbf{b},t)$ be acyclic structures with $(B,\textbf{b},t)\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(A,\textbf{a},s)$ , and such that $(A,\textbf{a},s)$ is a core. Let $B^{\prime}=\textrm{radial-extension}_{h:(B,t)\to(A,s)}(B)$ for some $h:(B,\textbf{b},t)\to(A,\textbf{a},s)$ . Then $(B,\textbf{b},t)\to(B^{\prime},\textbf{b},t)\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(A,\textbf{a},s)$ .

Proof.

Clearly, $(B^{\prime},\textbf{b},t)$ extends $(B,\textbf{b},t)$ . It is also clear from the construction that the map $h$ naturally extends to a homomorphism $g:(B^{\prime},\textbf{b},t)\to(A,\textbf{a},s)$ (using the natural projection for the additional elements of $B^{\prime}$ ).

Let us show now that $(A,\textbf{a},s)$ does not map homomorphically to $(B^{\prime},\textbf{b},t)$ . For the sake of a contradiction, assume that $h^{\prime}:(A,\textbf{a},s)\to(B^{\prime},\textbf{b},t)$ . We can assume from Lemma 2.1 that the composition of $g$ and $h^{\prime}$ is injective. Since $(A,\textbf{a},s)\not\to(B,\textbf{b},t)$ , $h^{\prime}$ must map some element $c$ of $A$ to $h^{\prime}(c)=d$ , where $d$ is an element of $B^{\prime}$ that does not belong to $B$ . Then $d$ belongs to an isomorphic copy of $A$ that was added for some fact $f$ of $B$ , where $f$ contains elements $b_{1},b_{2}$ with $b_{1}<_{t}b_{2}$ . Let $b_{1}^{\prime}$ be the copy of $h(b_{1})$ belonging to the respective isomorphic copy of $A$ . Since $h^{\prime}(s)=t$ it then follows that the image according to $h^{\prime}$ of the nodes in the shortest path connecting $c$ with $s$ must necessarily contain all nodes in the shortest path (in $B^{\prime}$ ) connecting $d$ and $t$ . Note that the path connecting $d$ and $t$ must contain both $b^{\prime}_{1}$ and $b_{1}$ . Since $g(b^{\prime}_{1})=g(b_{1})$ it follows that the composition of $g$ and $h^{\prime}$ is non-injective, a contradiction. ∎

Now, let $(A,\textbf{a})$ be c-connected and acyclic. We may also assume that $(A,\textbf{a})$ is a core. We define $\mathfrak{F}$ to be the set of all structures obtained as follows:⁷⁷7Note that $\mathfrak{F}$ depends on $(A,\textbf{a})$ , even though we did not make this dependence explicit in our notation here.

(1)

For each proper partition $(X,Y)$ of the distinguished elements, we add $\textrm{split}_{(X,Y)}(A,\textbf{a})$ to $\mathfrak{F}$ , provided there are $a_{i}\in X$ and $a_{j}\in Y$ such that $a_{i}$ and $a_{j}$ are connected in $A$ . (Note that, by construction, the corresponding distinguished elements $a_{i}^{(1)}$ and $a_{j}^{(2)}$ are disconnected in $\textrm{split}_{(X,Y)}(A,\textbf{a})$ .)
(2)

For each proper partition $(X,Y)$ of the distinguished elements, and for each fact $R(c,d)$ of $A$ , let $(B,\textbf{b})$ be the structure obtained by extending $\textrm{split}_{(X,Y)}(A,\textbf{a})$ with the fact $R(c^{(1)},d^{(2)})$ . We add $(B,\textbf{b})$ to $\mathfrak{F}$ , provided that there exists distinguished elements $a_{i}\in X$ and $a_{j}\in Y$ , such that $a_{i}$ and $a_{j}$ are connected in $A$ by a path of some length $n$ , but there is no path of length $n$ connecting the corresponding distinguished elements $a_{i}^{(1)}$ and $a_{j}^{(2)}$ in $(B,\textbf{b})$ .
(3)

For each skeleton node $s$ , and for every $(F,s)$ in the frontier of the offshot $(A|s,s)$ (as in Theorem 3.18) we add to $\mathfrak{F}$ the structure obtained from $(A,\textbf{a})$ by: (i) removing the offshoot $(A|s,s)$ and replacing it $(F,s)$ , resulting in a new structure $(A^{\prime},\textbf{a})$ , and (ii) taking $B=\textrm{radial-extension}_{h:(A^{\prime},s)\to(A,s)}(A^{\prime})$ , where $h$ is the homomorphism from $(F,s)\to(A|s,s)$ , extended to the entire structure $A^{\prime}$ by mapping every element outside the offshoot to itself.

The intuition behind this construction is that (1)–(3) capture different ways of homomorphically weakening an acyclic structure with designated elements: in (1), we take two connected distinguished elements and force them to become disconnected. In (2), we increase the length of the shortest path between two distinguished elements, by forcing the minimal path to go through a specified edge. In (3), we make no change to the skeleton of the structure but we weaken one of the offshoots.

Proposition 3.25.

If $(A,\textbf{a})$ is c-connected, acyclic, and is a core, then the set of structures $\mathfrak{F}$ constructed above is a frontier for $(A,\textbf{a})$ w.r.t. the class of c-connected acyclic structures.

Proof.

It is clear that each structure in $\mathfrak{F}$ homomorphically maps to $(A,\textbf{a})$ .

We also claim that there is no homomorphism from $(A,\textbf{a})$ to any structure $(B,\textbf{b})\in\mathfrak{F}$ . For (1) and (2) this follows immediately from the construction. For (3), the argument is as follows: let $s$ be the skeleton node whose offshoot was replaced, and let $(F,s)\in\mathfrak{F}^{*}_{s}$ be the frontier structure was used as the replacement of $(A|s,s)$ . Suppose, for the sake of a contradiction, that $(A,\textbf{a})\to(B,\textbf{b})$ . By Lemma 3.24, then, there is already a homomorphism $h:(A,\textbf{a},s)\to(A^{\prime},\textbf{a},s)$ . By Lemma 2.1, we may assume that the composition of $h$ with the natural homomorphism from $h^{\prime}:(A^{\prime},\textbf{a},s)$ to $(A,\textbf{a},s)$ is the identity function on $A$ . Since $(A|s,s)\not\to(F,s)$ is follows that there exists some element $a$ in $A|s$ such that $h(a)$ does not belong to $(F,s)$ . However, this contradicts the fact that $h^{\prime}(h(a))$ is the identity.

It remains to establish the last property of frontiers: let $(C,\textbf{c})$ be any c-connected and acyclic structure such that there is a homomorphism $h:(C,\textbf{c})\to(A,\textbf{a})$ and such that $(A,\textbf{a})\not\to(C,\textbf{c})$ .

We can distinguish three cases:

The first case is where $(C,\textbf{c})$ differs from $(A,\textbf{a})$ in the number of connected components. Since both structures are c-connected and $(C,\textbf{c})\to(A,\textbf{a})$ , this can only happen if there are distinguished elements $c_{i},c_{j}$ that belong to different components in $(C,\textbf{c})$ but such that $h(c_{i}),h(c_{j})$ are connected in $(A,\textbf{a})$ . In this case, we use (1) above. Specifically, let $X=\{h(c_{k})\mid\text{$c_{k}\in\textbf{c}$ is connected to $c_{i}$}\}$ , and let $Y=\{\textbf{a}\}\setminus X$ . Note that $h(c_{j})\in Y$ . Then it is easy to see that $(C,\textbf{c})\to\textrm{split}_{(X,Y)}(A,\textbf{a})$ .

The second case is there there exists distinguished elements $c_{i},c_{j}$ connected in $C$ such that the image of the path $s_{1},\dots,s_{m}$ in $C$ connecting $c_{i}$ and $c_{j}$ is not injective. It follows from the acyclicity of $A$ that there exists $\ell$ such that $h(s_{\ell})=h(s_{\ell+2})$ . Let $f$ be the fact in $C$ joining $s_{\ell}$ and $s_{\ell+1}$ and let $Z$ be the set containing all elements in $C$ that remain connected to $c_{i}$ after removing fact $f$ . Then consider the structure $(B,\textbf{b})$ obtained as in (2) above by setting $X$ to be the set of distinguished elements in $h(Z)$ , $Y$ to be the rest of distinguished elements in $C$ , and the fact of $A$ used in the construction to be the $h$ -image of $f$ . If we let $a_{i}=h(c_{i})$ and $a_{j}=h(c_{j})$ it follows directly from the construction that the distance of $a_{i}^{(1)}$ and $a_{j}^{(2)}$ in $(B,\textbf{b})$ is larger than the distance of $a_{i}$ and $a_{j}$ in $A$ , and hence $(B,\textbf{b})\in\mathfrak{F}$ . Finally, it follows easily that the map $h^{\prime}$ sending every element $c\in C$ to $h(c)^{(1)}$ if $c\in Z$ and to $h(c)^{(2)}$ otherwise defines a homomorphism from $(C,\textbf{c})$ to $(B,\textbf{b})$ .

The third case is where none of the previous two cases hold. We first note that if the second case does not hold then it must be the case that the restriction of $h$ to each connected component of $\operatorname{skeleton}(C,\textbf{c})$ must be injective. Furthermore, since $(C,\textbf{c})$ and $(A,\textbf{a})$ have the same number of connected components it follows that $h$ maps $\operatorname{skeleton}(C,\textbf{c})$ isomorphically to $\operatorname{skeleton}(A,\textbf{a})$ . It follows that there exists some node $s$ in the skeleton of $C$ such that the offshoot $(A|h(s),h(s))$ does not homomorphically map to $(C,s)$ , since otherwise, $(A,s)\to(C,s)$ . Consider the offshoot $(C|s,s)$ of $C$ and let $D$ be the maximal substructure of $C$ that contains $s$ , is connected, and satisfies the property that $h(D)$ is contained in $A|h(s)$ and that no other element, besides $s$ , is mapped by $h$ to $h(s)$ . Since $(D,s)\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(A|h(s),h(s))$ it follows that there exists some homomorphism $g$ from $(D,s)$ to some structure $(F,h(s))$ in the frontier of $(A|h(s),h(s))$ . Now consider the structure $(B,\textbf{b})$ in $\mathfrak{F}$ produced in step (3) above for $h(s)$ and $(F,h(s))$ . We shall construct a homomorphism $h^{\prime}$ from $(C,\textbf{c})$ to $(B,\textbf{b})$ .

Let $(A^{\prime},\textbf{a})$ as constructed in step (3) and let $(E,\textbf{c})$ be the maximal $c$ -connected substructure of $(C,\textbf{c})$ containing $\operatorname{skeleton}(C,\textbf{c})$ such that no other element in $E$ besides $s$ is mapped by $h$ to $h(s)$ . Note that $E$ contains $D$ . We shall start by defining $h^{\prime}$ on $E$ so that $h^{\prime}$ defines a homomorphism from $(E,\textbf{c})$ to $(A^{\prime},\textbf{a})$ . If $e\in D$ then we define $h^{\prime}(d)$ to be $g(d)$ . If $e$ does not belong to $D$ then it follows that $h(e)$ cannot be in the offshot $A|h(s)$ . In this case we define $h^{\prime}(e)$ to be $h(e)$ . Clearly $h^{\prime}$ defines as well a homomorphism from $(E,\textbf{c})$ to $(B,\textbf{a})$ (since $B$ contains a copy of $A^{\prime}$ ). It only remains to extend $h^{\prime}$ to all the elements in $C$ . For every maximal connected substructure $F$ of $C$ not containing any element in $E$ we extend $h^{\prime}$ to $F$ as follows. Since $E$ contains $\operatorname{skeleton}(C,\textbf{c})$ and $(C,\textbf{c})$ is $c$ -connected it follows that there is some fact in $C$ joining elements $e\in E$ and $f\in F$ . By the maximality of $E$ it follows that $h(f)=h(s)$ . Necessarily $h(f)<_{h(s)}h(e)$ and, consequently, we can define $h^{\prime}$ so that $f$ and every other element in $F$ is mapped according to $h$ in the copy of $A$ in $B$ introduced due to $h(f)<_{h(s)}h(e)$ .

Incidentally, it may be worth noting that we did not even make full use of the radial extension: for the purpose of this proof, it would have sufficed in case (3) to use a restricted version of radial extension where a copy of the original structure is added only for every fact connected with $s$ . ∎

Lifting the restriction to binary relations

This concludes the proof for the case with binary relations only. We now show how to lift the result to schemas containing relations of arbitrary arity.

For a schema $\mathcal{S}$ , let $\mathcal{S}^{*}$ be the schema containing for each $n$ -ary relation $R\in\mathcal{S}$ , $n$ binary relations $R_{1},\ldots,R_{n}$ . For any structure $A$ over schema $\mathcal{S}$ , let $A^{*}$ be the structure over schema $\mathcal{S}^{*}$ whose domain consists of all elements in the domain of $A$ as well as all facts of $A$ , and containing all facts of the form $R_{i}(b,f)$ where $f$ is a fact of $A$ , of the form $R(\textbf{a})$ with $a_{i}=b$ . Intuitively, we can think of $A^{*}$ as a bipartite encoding of the structure $A$ . Conversely, we associate to every structure $B$ over the schema $\mathcal{S}^{*}$ a corresponding structure $B_{*}$ over the original schema $\mathcal{S}$ , namely the structure whose domain is the same as that of $B$ and containing all facts of the form $R(a_{1},\ldots,a_{n})$ for which it is the case that $B$ satisfies $\exists y\bigwedge_{i=1\ldots n}R_{i}(a_{i},y)$ . Note that $(A^{*})_{*}=A$ but $(B_{*})^{*}$ need not be isomorphic to $B$ .

Lemma 3.26.

For all structures $A,A^{\prime}$ over schema $\mathcal{S}$ and structures $B$ over schema $\mathcal{S}^{*}$ :

(1)

If $(B,\textbf{b})\to(A^{*},\textbf{a})$ then $(B_{*},\textbf{b})\to(A,\textbf{a})$ .
(2)

$(A,\textbf{a})\to(B_{*},\textbf{b})$ iff $(A^{*},\textbf{a})\to(B,\textbf{b})$ .
(3)

$(A,\textbf{a})\to(A^{\prime},\textbf{a'})$ iff $(A^{*},\textbf{a})\to({A^{\prime}}^{*},\textbf{a'})$ .
(4)

If $(A,\textbf{a})$ is core, c-connected, and acyclic, then so is $(A^{*},\textbf{a})$
(5)

If $(B,\textbf{b})$ is acyclic, then so is $(B_{*},\textbf{b})$ .

Proof.

1. Let $h:(B,\textbf{b})\to(A^{*},\textbf{a})$ . It is easy to see that, for each element $b$ of $B_{*}$ that participates in at least one fact, it must be the case that $h(b)$ belongs to the domain of $A$ . Note that elements of $B_{*}$ that do not participate in any fact can be ignored, as they can be mapped to an arbitrary element of $A$ . Finally, it is clear from the constructions that, whenever $R(b_{1},\ldots,b_{n})$ holds true in $B_{*}$ , then $R(h(b_{1}),\ldots,h(b_{n}))$ holds true in $A$ .

2. Suppose $h:(A,\textbf{a})\to(B_{*},\textbf{b})$ . We can extend $h$ to the entire domain of $A^{*}$ as follows: let $f$ be any fact of $A$ of the form $R(a_{1},\ldots,a_{n})$ . Since $B_{*}$ satisfies $R(h(a_{1}),\ldots,h(a_{n}))$ , this means that $B$ must satisfy $\exists y\bigwedge_{i=1\ldots n}R_{i}(h(a_{i}),y)$ . Choose any such $y$ as the image of the fact $f$ . Doing this for each fact, we obtain a mapping $h^{\prime}$ that extends $h$ to the entire domain of $A^{*}$ . Moreover, whenever $R_{i}(a_{i},f)$ holds in $A^{*}$ , then, by construction, $R_{i}(h^{\prime}(a_{i}),h^{\prime}(f))$ holds true in $B$ . In other words $h^{\prime}$ is a homomorphism from $(A^{*},\textbf{a})$ to $(B,\textbf{b})$

Conversely, suppose $h:(A^{*},\textbf{a})\to(B,\textbf{b})$ . Let $h^{\prime}$ be the restriction of $h$ to elements of $A$ . We claim that the mapping $h^{\prime}:(A,\textbf{a})\to(B_{*},\textbf{b})$ is a homomorphism. Let $f=R(a_{1},\ldots,a_{n})$ be any fact of $A$ . Then $B$ satisfies $R_{i}(h^{\prime}(a_{i}),h(f))$ for all $i=1\ldots n$ . Therefore, by construction, $B_{*}$ satisfies $R(h^{\prime}(a_{1}),\ldots,h^{\prime}(a_{n}))$ .

3. Every homomorphism $h:(A,\textbf{a})\to(A^{\prime},\textbf{a'})$ natural extends to a homomorphism from $(A^{*},\textbf{a})$ to $({A^{\prime}}^{*},\textbf{a'})$ by sending each fact $R(a_{1},\ldots,a_{n})$ to $R(h(a_{1}),\ldots,h(a_{n}))$ . Conversely, every homomorphism $h:(A^{*},\textbf{a})\to({A^{\prime}}^{*},\textbf{a'})$ , when restricted to elements of $A$ , is clearly a homomorphism from $(A,\textbf{a})$ to $(A^{\prime},\textbf{a'})$ .

4. For acyclicity, this holds by definition (recall that acyclicity was defined by reference to the incidence graph, in the first place; and note that the incidence graph of $(A^{*},a)$ is obtained from that of $(A,a)$ by subdividing every edge in two). Similarly, it is easy to see that whenever $(A,\textbf{a})$ is c-connected, then so is $(A^{*},\textbf{a})$ . Finally, to show that core-ness is preserved, we proceed by contraposition: suppose $(A^{*},\textbf{a})$ is not a core, i.e., admits a proper endomorphism $h$ . It is not hard to see that $h$ must map elements of $A$ to elements of $A$ , and fact of $A$ to facts of $A$ . Therefore, $h$ must either map two distinct elements of $A$ to the same element, or two distinct facts of $A$ to the same fact. However, even in the latter case, the only way that this can happen is if $h$ also maps two distinct elements to the same element. It follows that $h$ also induces a proper endomorphism of $A$ .

5. By contraposition: suppose $(B_{*},\textbf{b})$ is not acyclic. Take a minimal cycle in the incidence graph of $(B_{*},\textbf{b})$ . Each edge that is part of the cycle (being a fact of $B_{*}$ ), by construction of $B^{*}$ , gives rise to a path of length 2 in $B$ . Therefore, we obtain a cycle in $B$ . ∎

Now, let $(A,\textbf{a})$ be any acyclic, c-connected structure over schema $\mathcal{S}$ . We may again assume that $(A,\textbf{a})$ is a core. By Lemma 3.26(4), $(A^{*},\textbf{a})$ is also acyclic, c-connected, and core. Let $F$ be an acyclic c-connected frontier for $(A^{*},\textbf{a})$ , let $F^{\prime}=\{(B_{*},\textbf{b})\mid(B,\textbf{b})\in F\}$ . We claim that $F^{\prime}$ is a frontier for $(A,\textbf{a})$ .

First, note that each structure in $F^{\prime}$ homomorphically maps to $(A,\textbf{a})$ . This follows from Lemma 3.26(1), because each $(B,\textbf{b})\in F$ homomorphically maps to $(A^{*},\textbf{a})$ . Second, note that $(A,a)$ does not homomorphically map to any structure in $F^{\prime}$ . Indeed, suppose $(A,\textbf{a})\to(B_{*},\textbf{b})$ with $(B_{*},\textbf{b})\in F^{\prime}$ . Then, by Lemma 3.26(2), $(A^{*},\textbf{a})\to(B,\textbf{b})$ , which contradicts the fact that $F$ is a frontier for $(A^{*},\textbf{a})$ . Finally, for the third property of frontiers, suppose $(C,\textbf{c})\to(A,\textbf{a})$ and $(A,\textbf{a})\not\to(C,\textbf{c})$ . Then, by Lemma 3.26(3), $(A^{*},a)\to(C^{*},\textbf{c})$ and $(C^{*},\textbf{c})\not\to(A^{*},\textbf{a})$ . Therefore, since $F$ is a frontier for $(A^{*},\textbf{a})$ , we have that $(C^{*},\textbf{c})\to(B,\textbf{b})$ for some $(B,\textbf{b})\in F$ . It follows by Lemma 3.26(2) that $(C,\textbf{c})\to(B_{*},\textbf{b})$ . Since $(B^{*},\textbf{b})\in F^{\prime}$ , this shows that $(C,\textbf{c})$ maps to a structure in $F^{\prime}$ .

Note that $F^{\prime}$ consists of acyclic structures (by Lemma 3.26(5)), but may contain structures that are not c-connected. However, this is easily addressed by the following observation, which follows from Proposition 2.3: if a set of structures $F$ is a frontier for a structure $(A,\textbf{a})$ w.r.t. a class of c-connected structures $\mathcal{C}$ , then $\{(B,\textbf{b})^{\textrm{reach}}\mid(B,\textbf{b})\in F\}$ is also a frontier for $(A,\textbf{a})$ w.r.t. $\mathcal{C}$ .

This concludes the proof of Theorem 3.12.

4. Unique Characterizations for Conjunctive Queries

In this section, we study the question of when a CQ is uniquely characterizable by a finite set of positive and/or negative examples.

Definition 4.1 (Data Examples, Fitting, Unique Characterizations).

Let $\mathcal{C}$ be a class of $k$ -ary CQs over a schema $\mathcal{S}$ (for some $k\geq 0$ ), and let $q$ be a $k$ -ary query over $\mathcal{S}$ .

(1)

A data example is a structure $(A,\textbf{a})$ over schema $\mathcal{S}$ with $k$ distinguished elements. If $\textbf{a}\in q(A)$ , we call $(A,\textbf{a})$ a positive example (for $q$ ), otherwise a negative example.
(2)

Let $E^{+},E^{-}$ be finite sets of data examples. We say that $q$ fits $(E^{+},E^{-})$ if every example in $E^{+}$ is a positive example for $q$ and every example in $E^{-}$ is a negative example for $q$ . We say that $(E^{+},E^{-})$ uniquely characterizes $q$ w.r.t. $\mathcal{C}$ if $q$ fits $(E^{+},E^{-})$ and every $q^{\prime}\in\mathcal{C}$ that fits $(E^{+},E^{-})$ is logically equivalent to $q$ .

It turns out that there is a precise correspondence between unique characterizations and frontiers. Recall that the canonical structure of a query $q$ is denoted by $\widehat{q}$ . Similarly, for any class of CQs $\mathcal{C}$ , we will denote by $\widehat{\mathcal{C}}$ the class of structures $\{\widehat{q}\mid q\in\mathcal{C}\}$ .

Proposition 4.2 (Frontiers vs Unique Characterizations).

Fix a schema $\mathcal{S}$ and $k\geq 0$ . Let $q$ be any $k$ -ary CQ over $\mathcal{S}$ and $\mathcal{C}$ a class of $k$ -ary CQs over $\mathcal{S}$ .

(1)

If $F$ is a frontier for $\widehat{q}$ w.r.t. $\widehat{\mathcal{C}}$ , then $(E^{+}=\{\widehat{q}\},E^{-}=F)$ uniquely characterizes $q$ w.r.t. $\mathcal{C}$ .
(2)

Conversely, if $(E^{+},E^{-})$ uniquely characterizes $q$ w.r.t. $\mathcal{C}$ , then $F=\{\widehat{q}\times(B,\textbf{b})\mid(B,\textbf{b})\in E^{-}\}$ is a frontier for $\widehat{q}$ w.r.t. $\widehat{\mathcal{C}}$ .

Proof.

1. Let $F$ be a frontier for $\widehat{q}$ w.r.t. $\widehat{\mathcal{C}}$ , let $q^{\prime}\in\mathcal{C}$ be a conjunctive query that fits $(E^{+}=\{\widehat{q}\},E^{-}=F)$ . From the fact that the canonical structure $\widehat{q}$ is a positive example for $q^{\prime}$ , it follows that there is a homomorphism from $\widehat{q^{\prime}}$ to $\widehat{q}$ . Furthermore, since all structures in the set $F$ are negative examples for $q^{\prime}$ , we know that $\widehat{q^{\prime}}$ does not homomorphically map to any of these structures. Since $F$ is a frontier w.r.t. $\widehat{\mathcal{C}}$ and $\widehat{q^{\prime}}\in\widehat{\mathcal{C}}$ , we can conclude that $\widehat{q}$ homomorphically maps to $\widehat{q^{\prime}}$ . Therefore, $\widehat{q}$ and $\widehat{q^{\prime}}$ are homomorphically equivalent, which implies that $q$ and $q^{\prime}$ are logically equivalent.

2. Let $(E^{+},E^{-})$ uniquely characterize $q$ w.r.t, $\mathcal{C}$ , and let $F=\{\widehat{q}\times(B,\textbf{b})\mid(B,\textbf{b})\in E^{-}\}$ . It follows from the basic properties of the direct product operation that each structure in $F$ homomorphically maps to $\widehat{q}$ . Furthermore, if there were a homomorphism from $\widehat{q}$ to some structure $\widehat{q}\times(B,\textbf{b})\in F$ , then there would be a homomorphism from $\widehat{q}$ to $(B,\textbf{b})$ , which would imply that $(B,\textbf{b})$ is a positive example for $q$ , which we know is not the case. Finally, consider any $\widehat{q^{\prime}}\in\widehat{\mathcal{C}}$ such that $\widehat{q^{\prime}}\to\widehat{q}$ and $\widehat{q}\not\to\widehat{q^{\prime}}$ . This implies that $q$ and $q^{\prime}$ are not logically equivalent, and hence, since $q^{\prime}\in\mathcal{C}$ , the two queries must disagree on some example in $E^{+}$ or $E^{-}$ . However, it follows from the fact that $\widehat{q}\to\widehat{q}$ , that all positive examples for $q$ are also positive examples for $q^{\prime}$ . Therefore, some structure $(B,\textbf{b})\in E^{+}$ must be a positive example for $q^{\prime}$ , that is, $\widehat{q^{\prime}}\to(B,\textbf{b})$ . It follows that $\widehat{q^{\prime}}\to\widehat{q}\times(B,\textbf{b})$ . ∎

Proposition 4.2 allows us to take the results on frontiers from the previous section, and rephrase them in terms of unique characterizations. Incidentally, note that results in (AlexeCKT2011, ) imply an analogous relationship between finite dualities and uniquely characterizing sets of examples for unions of conjunctive queries. We need two more lemmas. Recall that a structure $(A,\textbf{a})$ corresponds to a conjunctive query only if every distinguished element occurs in at least one fact. Let us call such structures safe. The following lemmas, essentially, allow us to ignore unsafe structures, thereby bridging the gap between structures and CQs.

Lemma 4.3.

Let $q$ be a $k$ -ary CQ over schema $\mathcal{S}$ and $\mathcal{C}$ a class of $k$ -ary CQs over $\mathcal{S}$ . If $q$ is uniquely characterized w.r.t. $\mathcal{C}$ by positive and negative examples $(E^{+},E^{-})$ , then $E^{+}$ consists of safe structures and $q$ is uniquely characterized w.r.t. $\mathcal{C}$ by $(E^{+},\{(A,\textbf{a})\in E^{-}\mid\text{$(A,\textbf{a})$ is safe}\})$ .

Proof.

It suffices to observe that if $(A,\textbf{a})$ is not safe, then $(A,\textbf{a})$ is a negative example for every conjunctive query, and therefore, cannot meaningfully contribute to characterizing any given conjunctive query w.r.t. a class of CQs. ∎

Lemma 4.4.

A safe structure has a frontier w.r.t. all structures if and only if it has a frontier w.r.t. the class of all safe structures.

Proof.

The left-to-right direction is trivial. The right-to-left direction relies on a homomorphism duality argument of sorts: let $(A,\textbf{a})$ be any safe structure that has a frontier $F$ w.r.t. the class of all safe structures. Let $\mathcal{S}$ be its schema and $k$ the number of distinguished elements. For every non-empty set $S\subseteq\{1,\ldots,k\}$ , we will denote by $(A_{S},\textbf{a}_{S})$ the structure with two elements, denoted $b$ and $c$ , that contains all possible facts involving only $c$ and no other facts; and $\textbf{a}_{S}$ is the tuple $a_{1},\ldots,a_{k}$ , where $a_{i}=b$ if $i\in S$ , and $a_{i}=c$ otherwise. Note that, by construction, none of these structures is safe. It is not hard to see that a structure is unsafe if and only if it admits a homomorphism to a structure in the set $G=\{(A_{S},\textbf{a}_{S})\mid\text{$S\subseteq\{1,\ldots,k\}$ is non-empty}\}$ . Now, let $G^{\prime}=\{(A,\textbf{a})\times(B,\textbf{b})\mid(B,\textbf{b})\in G\}$ . Then $F\cup G^{\prime}$ is a frontier for $(A,\textbf{a})$ w.r.t. all structures. To see this, note that each structure in $F\cup G^{\prime}$ homomorphically maps to $(A,\textbf{a})$ . Furthermore, $(A,\textbf{a})$ does not map to any structure in $F$ (by initial assumption) or in $G^{\prime}$ (because $G^{\prime}$ consists of unsafe structures while $(A,\textbf{a})$ is safe). Finally, consider any $(C,\textbf{c})\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(A,\textbf{a})$ . If $(C,\textbf{c})$ is safe, then it maps to a structure in $F$ . Otherwise, it maps to a structure in $G$ and hence also to the corresponding structure in $G^{\prime}$ . In either case, it maps to a structure in $F\cup G^{\prime}$ . ∎

Putting everything together, we obtain the main result of this section. We call a CQ $q$ c-acyclic (or acyclic, or c-connected) if the structure $\widehat{q}$ is c-acyclic (resp. acyclic, c-connected).

Theorem 4.5.

Fix a schema and fix $k\geq 0$ .

(1)

If $\mathcal{C}$ is a class of $k$ -ary CQs such that $\widehat{C}$ has bounded expansion, then every CQ $q\in\mathcal{C}$ is uniquely characterizable w.r.t. $\mathcal{C}$ by finitely many positive and negative examples (which can be effectively constructed from the query).
(2)

A $k$ -ary CQ $q$ is uniquely characterizable by finitely many positive and negative examples (w.r.t. the class of all $k$ -ary CQs) iff $q$ is logically equivalent to a c-acyclic CQ. Moreover, for a c-acyclic CQ, a uniquely characterizing set of examples can be constructed in polynomial time.
(3)

Assume $k\geq 1$ and let $\mathcal{C}_{ca}$ be the class of $k$ -ary CQs that are c-connected and acyclic. Then every $q\in\mathcal{C}_{ca}$ is uniquely characterizable w.r.t. $\mathcal{C}_{ca}$ by finitely many positive and negative examples belonging to $\widehat{\mathcal{C}_{ca}}$ . Moreover, the set of examples in question can be constructed in polynomial time.

Remark 1.

For the purpose of applications discussed in Section 7, we note that Theorem 4.5 remains true if the safety condition for CQs were to be dropped. Indeed, the proof in this case is even simpler, as it does not require Lemma 4.3 and Lemma 4.4.

Examples showing that Theorem 4.5(3) cannot easily be generalized.

Theorem 4.5(3) applies to CQs of arity $k\geq 1$ that are c-connected, and acyclic. None of these restrictions can be dropped. Recall that a CQ of arity zero is called a Boolean CQ.

Theorem 4.6.

The Boolean acyclic connected CQ $\operatorname{{\bf T}}()\textrm{ :- }Ry_{1}y_{2}\land Ry_{2}y_{3}\land Ry_{3}y_{4}\land Ry_{4}y_{5}$ is not characterized, w.r.t. the class of Boolean acyclic connected CQs, by finitely many acyclic positive and negative examples.

Proof.

For the sake of a contradiction, assume that $(E^{+},E^{-})$ is a finite collection of acyclic positive and negative examples that uniquely characterizes $\operatorname{{\bf T}}$ within the class of Boolean acyclic connected CQs. Let $n$ be a bound on the size of the examples in $E^{+}$ and $E^{-}$ . Consider now the following Boolean acyclic connected conjunctive query $\operatorname{{\bf R}}$ (cf. Figure 6):

\operatorname{{\bf R}}()\text{ :- }\mathop{\bigwedge_{i\in\{1,\ldots,n\}\text{ odd}}}_{j\in\{1,2,3\}}R(z_{j}^{i},z_{j+1}^{i})~{}~{}~{}\land\mathop{\bigwedge_{i\in\{1,\ldots,n\}\text{ even}}}_{j\in\{2,3,4\}}R(z_{j}^{i},z_{j+1}^{i})~{}~{}~{}\land\bigwedge_{i\in\{1,\ldots,n\}\text{ even}}R(z^{i}_{3},z_{4}^{i-1})\land R(z^{i}_{2},z_{3}^{i+1})

$\begin{array}[]{c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c@{}c}&&z^{2}_{5}&&&&z^{4}_{5}&&&\\ &&\uparrow&&&&\uparrow&&&\\ z^{1}_{4}&&z^{2}_{4}&&z^{3}_{4}&&z^{4}_{4}&&z^{5}_{4}&\\ \uparrow&\nwarrow&\uparrow&&\uparrow&\nwarrow&\uparrow&&\uparrow&\nwarrow\\ z^{1}_{3}&&z^{2}_{3}&&z^{3}_{3}&&z^{4}_{3}&&z^{5}_{3}&&~{}~{}\cdots\\ \uparrow&&\uparrow&\nearrow&\uparrow&&\uparrow&\nearrow&\uparrow&\\ z^{1}_{2}&&z^{2}_{2}&&z^{3}_{2}&&z^{4}_{2}&&z^{5}_{2}&\\ \uparrow&&&&\uparrow&&&&\uparrow&\\ z^{1}_{1}&&&&z^{3}_{1}&&&&z^{5}_{1}&\\ \end{array}$

Figure 6. The (canonical structure of the) conjunctive query

\operatorname{{\bf R}}

Note that $\operatorname{{\bf R}}$ and $\operatorname{{\bf T}}$ are not logically equivalent since $\operatorname{{\bf R}}$ does not contain any directed path of length $4$ . The mapping that sends $z^{i}_{j}$ to $y_{j}$ defines a homomorphism from (the canonical structure of) $\operatorname{{\bf R}}$ to (the canonical structure of) $\operatorname{{\bf T}}$ . This implies that $\operatorname{{\bf T}}\subseteq\operatorname{{\bf R}}$ , that is, every positive example for $\operatorname{{\bf T}}$ is also a positive example for $\operatorname{{\bf R}}$ , and hence, in particular, $\operatorname{{\bf R}}$ fits all the positive examples in $E^{+}$ . Since $\operatorname{{\bf R}}$ and $\operatorname{{\bf T}}$ are not logically equivalent, $\operatorname{{\bf R}}$ must therefore disagree with $\operatorname{{\bf T}}$ on one of the negative examples, that is, some example $A\in E^{-}$ is a positive example for $\operatorname{{\bf R}}$ . Let $h:\widehat{\operatorname{{\bf R}}}\to A$ be a witnessing homomorphism.

Claim: There are two elements $u$ and $v$ in $\widehat{\operatorname{{\bf R}}}$ at distance $2$ (distance, here, is measured in the underlying tree of $\widehat{\operatorname{{\bf R}}}$ ) such that $h(u)=h(v)$ .

Proof of claim: Since $A$ has at most $n$ elements it follows that $h$ is not injective, so there are two elements $u^{\prime}$ , $v^{\prime}$ of $\widehat{\operatorname{{\bf R}}}$ that are mapped by $h$ to the same element. Now, consider the unique path $u^{\prime}=x_{1},\dots,x_{m}=v^{\prime}$ in (the underlying tree of) $\widehat{\operatorname{{\bf R}}}$ connecting $u^{\prime}$ and $v^{\prime}$ . Then, $h(x_{1}),\dots,h(x_{m})$ is a walk in (the underlying tree of) $A$ . Indeed, it is a closed walk since $h(x_{1})=h(x_{m})$ . Now, since $h(x_{1}),\dots,h(x_{m})$ is a closed walk in a tree, it must backtrack in some vertex $h(x_{i})$ . This means that $h(x_{i-1})=h(x_{i+1})$ . End of proof of claim.

Let $u=z^{i}_{j}$ and $v=z^{i^{\prime}}_{j^{\prime}}$ be the two elements at distance $2$ that are mapped by $h$ to the same element in $A$ . It follows from the definition of $\operatorname{{\bf R}}$ that, since $u$ and $v$ are at distance $2$ , $|i-i^{\prime}|\leq 1$ . We first consider the case where $i=i^{\prime}$ . In this case, we can assume wlog. that $j^{\prime}=j+2$ . It then follows that $h(z^{i}_{j}),h(z^{i}_{j+1}),h(z^{i}_{j+2})$ is a directed cycle in $A$ , a contradiction because $A\in E^{-}$ , and $E^{-}$ was assumed to consist of acyclic structures. Next, consider the case where $i\neq i^{\prime}$ . Then $|i-i^{\prime}|=1$ . We can assume without loss of generality that $i$ is even and $i^{\prime}$ is odd. Again it follows directly by the construction of $\operatorname{{\bf R}}$ that $j^{\prime}=j$ . Note that $h(z^{i^{\prime}}_{1}),\dots,h(z^{i^{\prime}}_{j^{\prime}})$ is a directed path in $A$ of length $j^{\prime}-1$ . Similarly $h(z^{i}_{j}),\dots,h(z^{i}_{5})$ is a directed path in $A$ of length $5-j$ . Since $h(z^{i}_{j})=h(z^{i^{\prime}}_{j^{\prime}})$ it follows that we can concatenate the two directed paths obtaining a directed path of length $4+j^{\prime}-j\geq 4$ . Consequently $\widehat{\operatorname{{\bf T}}}$ homomorphically maps to $A$ , a contradiction because $A\in E^{-}$ . ∎

This shows that in Theorem 4.5(3), the restriction to non-Boolean queries cannot be dropped. Similarly, the restriction to c-connected queries cannot be dropped, and acyclicity cannot be replaced by the weaker condition of c-acyclicity.

Theorem 4.7.

The unary acyclic CQ $\operatorname{{\bf T}}^{\prime}(x)\text{ :- }P(x)\land Ry_{1}y_{2}\land Ry_{2}y_{3}\land Ry_{3}y_{4}\land Ry_{4}y_{5}$ is not uniquely characterizable, w.r.t. the class of unary acyclic CQs, by finitely many acyclic positive and negative examples.

Proof.

Assume for the sake of a contradiction that there are finitely many acyclic positive and negative examples $(E^{+},E^{-})$ that uniquely characterizes $\operatorname{{\bf T}}^{\prime}(x)$ w.r.t. the class of unary acyclic CQs. We will construct acyclic positive and negative examples $(E^{\prime+},E^{\prime-})$ that uniquely characterizes $\operatorname{{\bf T}}$ w.r.t. the class of Boolean acyclic c-connected CQs, contradicting Theorem 4.6. For each example in $(A,a)\in E^{+}\cup E^{-}$ , we take every connected component of $A$ (without any distinguished element) and add it to $E^{\prime+}$ or $E^{\prime-}$ , depending on whether the component in question satisfies $\operatorname{{\bf T}}$ . Note that each of these examples is acyclic. By construction, the query $\operatorname{{\bf T}}$ fits $(E^{\prime+},E^{\prime-})$ . We claim that $(E^{\prime+},E^{\prime-})$ uniquely characterizes $\operatorname{{\bf T}}$ w.r.t. the class of Boolean acyclic connected CQs. To see this, let $q$ be any Boolean acyclic connected query that fits $(E^{\prime+},E^{\prime-})$ . Now, let $q^{\prime}(x)$ be the unary acyclic (disconnected) query that is the conjunction of $q$ with $P(x)$ . Then it is not hard to see that $q^{\prime}(x)$ fits $(E^{+},E^{-})$ , and therefore, $q^{\prime}(x)$ is equivalent to $\operatorname{{\bf T}}^{\prime}(x)$ . From this, it easily follows that $q$ must be equivalent to $\operatorname{{\bf T}}$ : any counterexample to the equivalence of $q$ and $\operatorname{{\bf T}}$ can be extended to a counterexample to the equivalence of $q^{\prime}(x)$ and $\operatorname{{\bf T}}^{\prime}(x)$ simply by adding an isolated element satisfying $P$ . ∎

Theorem 4.8.

The unary c-acyclic c-connected CQ $\operatorname{{\bf T}}^{\prime\prime}(x)\text{ :- }Ry_{1}y_{2}\land Ry_{2}y_{3}\land Ry_{3}y_{4}\land Ry_{4}y_{5}\land\bigwedge_{i=1\ldots 5}Rxy_{i}$ is not uniquely characterizable, w.r.t. the class of unary c-acyclic c-connected CQs, by finitely many c-acyclic positive and negative examples.

Proof.

Assume towards a contradiction that $\operatorname{{\bf T}}^{\prime\prime}$ is uniquely characterized, w.r.t. the class of unary c-acyclic c-connected CQs, by a finite collection of c-acyclic positive and negative examples $(E^{+},E^{-})$ . Let $E^{\prime-}$ be the set that contains, for each $(A,a)\in E^{-}$ , the substructure $A^{\prime}$ of $A$ consisting of all elements $b$ satisfying $R(a,b)$ . Note that this substructure does not include $a$ itself (because if $R(a,a)$ was true in $A$ , then $(A,a)$ would have been a positive example for $\operatorname{{\bf T}}^{\prime\prime}$ ) and therefore must be acyclic. We claim that $(\{\widehat{\operatorname{{\bf T}}}\},E^{\prime-})$ uniquely characterizes $\operatorname{{\bf T}}$ w.r.t. the class of Boolean acyclic connected CQs, contradicting Theorem 4.6.

Let $q$ be any Boolean acyclic connected conjunctive query that fits $(\{\widehat{\operatorname{{\bf T}}}\},E^{\prime-})$ . Let $q^{\prime}(x)$ be the conjunctive query expressing that $q$ holds true in the substructure consisting of elements reachable by an $R$ -edge from $x$ . Note that $q^{\prime}(x)$ can be obtained by extending $q$ with an additional conjunct $R(x,y)$ for every variable $y$ occurring in $q$ . Also, note that $q^{\prime}(x)$ is c-acyclic and c-connected. Since $\widehat{\operatorname{{\bf T}}}$ is a positive example for $q$ , there is a homomorphism $h:\widehat{q}\to\widehat{\operatorname{{\bf T}}}$ . By extending $h$ in the obvious way, we have that $\widehat{q^{\prime}}\to\widehat{\operatorname{{\bf T}}^{\prime\prime}}$ . Therefore, $\operatorname{{\bf T}}^{\prime\prime}\subseteq q^{\prime}$ , and hence every positive example for $\operatorname{{\bf T}}^{\prime\prime}$ is also a positive example for $q^{\prime}$ , and hence $q^{\prime}$ fits all the positive examples in $E^{+}$ . Similarly, $q^{\prime}$ fits all negative examples in $E^{-}$ , because, if it did not, then there would be a homomorphism from $\widehat{q^{\prime}}$ to some $(A,a)\in E^{-}$ , from which it would clearly follow that $q$ homomorphically maps to the corresponding $A^{\prime}\in E^{\prime-}$ , which we know is not the case. Therefore, $q^{\prime}$ fits all examples in $(E^{+},E^{-})$ , and hence, $q^{\prime}$ is logically equivalent to $\operatorname{{\bf T}}^{\prime\prime}$ . It follows that $q$ must be logically equivalent to $\operatorname{{\bf T}}$ : any counterexample for the equivalence of $q$ and $\operatorname{{\bf T}}$ can be extended to a counterexample for the equivalence of $q^{\prime}$ and $\operatorname{{\bf T}}^{\prime\prime}$ by adding a fresh distinguished element connected to all existing elements by means of an $R$ -edge. ∎

5. Exact learnability with membership queries

The unique characterization results in the previous section immediately imply (not-necessarily-efficient) exact learnability results:

Theorem 5.1.

Fix a schema and $k\geq 0$ . Let $\mathcal{C}$ be a computably enumerable class of $k$ -ary CQs. If $\widehat{\mathcal{C}}$ has bounded expansion, then $\mathcal{C}$ is exactly learnable with membership queries.

The learning algorithm in question simply enumerates all queries $q\in\mathcal{C}$ and uses membership queries to test if the goal query fits the uniquely characterizing set of examples of $q$ (cf. Theorem 4.5(1)). Unfortunately, this learning algorithm does not run in polynomial time. Indeed, the number of membership queries is not known to be bounded by any fixed tower of exponentials (even for classes $\mathcal{C}$ for which membership can be tested in polynomial time). For the special case of c-acyclic queries, we can do a little better by taking advantage of the fact that a uniquely characterizing set of examples can be constructed in polynomial time. Indeed, the class of c-acyclic $k$ -ary CQs is exponential-time exactly learnable with membership queries: the learner can simply enumerates all c-acyclic queries in order of increasing size. For each query $q$ (starting with the smallest query), it uses Theorem 4.5(2) to test, using polynomially many membership queries, whether the goal query is equivalent to $q$ . After at most $2^{O(n)}$ many attempts (where $n$ is the size of the goal query), the algorithm is guaranteed to find a query that is equivalent to the goal query.⁸⁸8Similarly, by Theorem 4.5(3), the class of unary acyclic c-connected queries is exponential-time exactly learnable with subset queries, where a subset query is an oracle query asking whether a given CQ from the concept class is implied by the goal query. Subset queries correspond precisely to membership queries where the example is the canonical structure of a query from the concept class. Our main result in this section improves on this by establishing efficient (i.e., polynomial-time) exact learnability:

Theorem 5.2.

For each schema and $k\geq 0$ , the class of c-acyclic $k$ -ary CQs is efficiently exactly learnable with membership queries.

At a high level, the learning algorithm works by maintaining a c-acyclic hypothesis that is an over-approximation of the actual goal query. At each iteration, the hypothesis is strengthened by replacing it with one of the elements of its frontier, a process that is shown to terminate and yield a query that is logically equivalent to the goal query. Note, however, that the frontier of a c-acyclic structure does not, in general, consist of c-acyclic structures. At the heart of the proof of Theorem 5.2 lies a non-trivial argument showing how to turn an arbitrary hypothesis into a c-acyclic one with polynomially many membership queries. The detailed proof is given in Section 5.1 below.

The class of all $k$ -ary queries is not exactly learnable with membership queries (even with unbounded amount of time and the ability to ask an unbounded number of oracle queries), because exact learnability with membership queries would imply that every query in the class is uniquely characterizable, which we know is not the case. On the other hand, we have:

Theorem 5.3 (from (CateDK13:learning, )).

For each schema $\mathcal{S}$ and $k\geq 0$ , the class of all $k$ -ary CQs over $\mathcal{S}$ is efficiently exactly learnable with membership and equivalence queries.

In fact, it follows from results in (CateDK13:learning, ) that the larger class of all unions of conjunctive queries is efficiently exactly learnable with membership and equivalence queries (for fixed $k$ and fixed schema). Efficient exact learnability with membership and equivalence queries is not a monotone property of concept classes, but the result from (CateDK13:learning, ) transfers to CQs as well. For the sake of completeness, a self-contained proof of Theorem 5.3 is given below as well.

Remark 2.

For the purpose of applications discussed in Section 7, we note that Theorems 5.1–5.3 remain true if the safety condition for CQs were to be dropped.

Related Work

There has been considerable prior work that formally studies the task of identifying some unknown goal query $Q$ from examples. Work in this direction includes learning CQs, Xpath queries, Sparql, tree patterns, description logic concepts, ontologies, and schema mappings among others (BonifatiCL15, ; CateDK13:learning, ; GottlobS10, ; StaworkoW12, ). We shall describe mostly the previous work regarding learning CQs. Some of the work in this direction ((Barcelo017, ; CateD15, ; Cohen94a, ; Hirata00, ; Willard10, ) for example) assumes that a background structure $A$ is fixed and known by the algorithm. In this setting, an example is a $k$ -ary tuple $(a_{1},\dots,a_{k})$ of elements in $A$ , labelled positively or negatively depending on whether it belongs or not to $Q(A)$ . In the present paper (as in (CateDK13:learning, ; Haussler89, )) we do not fix any background structure (i.e., examples are pairs of the form $(A,\textbf{a})$ ). Our setting corresponds also to the extended instances with empty background in (Cohen95, ).

In both cases a number of different learning protocols has been considered. In the reverse-engineering problem (as defined in (WeissC17, )) it is only required that the algorithm produces a query consistent with the examples. In a similar direction, the problem of determining whether such a query exists has been intensively studied under some variants (satisfiability, query-by-example, definability, inverse satisfiability) (Barcelo017, ; CateD15, ; Willard10, ). In some scenarios, it is desirable that the query produced by the learner not only explains the examples received during the training phase, but also has predictive power. In particular, the model considered in (BonifatiCS16, ) follows the paradigm of identification in the limit by Gold and requires that, additionally, there exists a finite set of examples that uniquely determines the target query $Q$ . In a different direction, the model introduced in (GottlobS10, ), inspired by the minimum description length principle, requires to produce a hypothesis consistent after some repairs. A third line of work (see (Cohen93, ; Haussler89, ; Hirata00, )) studies this problem under Valiant’s probably approximately correct (PAC) model. The present paper is part of a fourth direction based on the exact model of query identification by Angluin. In this model, instead of receiving labelled examples, the learner obtains information about the target query by mean of calls to an oracle. As far as we know, we are the first to study the exact learnability of CQs using a membership oracle.

5.1. Proofs for Theorem 5.2 and Theorem 5.3

To warm up, we first establish Theorem 5.3, because its proof is simpler. The proof relies on the following lemmas. Recall that we denote by $\widehat{q}$ the canonical structure of a conjunctive query $q$ .

Lemma 5.4.

Let $(A,\textbf{a})$ be any structure such that $\widehat{q_{goal}}\to(A,\textbf{a})$ , where $q_{goal}$ is the goal conjunctive query. Then, using membership oracle queries, we can compute in polynomial time (in the size of $(A,\textbf{a})$ ) a substructure $(A^{\prime},\textbf{a})$ of $(A,\textbf{a})$ such that $\widehat{q_{goal}}\to(A^{\prime},\textbf{a})$ , and such that $\widehat{q_{goal}}$ does not homomorphically map to any strict substructure of $(A^{\prime},\textbf{a})$ .

Proof.

It suffices to iteratively remove one of the facts from the structure, and use a membership query to test if $\widehat{q_{goal}}$ still admits a homomorphism to the structure after removing the fact in question. Once no further fact can be removed, we have arrived at $(A^{\prime},\textbf{a})$ . ∎

Recall that we call a structure safe if every distinguished element occurs in a fact, that is, the structure is the canonical structure of a conjunctive query. The proof of the next lemma is obvious.

Lemma 5.5.

Let $q$ be a $k$ -ary conjunctive query over a schema $\mathcal{S}$ , and let $(B,\textbf{b})$ be any structure over schema $\mathcal{S}$ with $k$ distinguished elements. If $\widehat{q}\to(B,\textbf{b})$ , then $(B,\textbf{b})$ is safe.

We now present the proof of Theorem 5.3.

Proof of Theorem 5.3.

The learning algorithm maintains a structure $(H,\textbf{h})$ , which we can intuitively think of as (the canonical structure of) the algorithm’s guess of the goal query. The structure $(H,\textbf{h})$ is refined in a series of iterations in such a way that at each iteration $i$ , its value, $(H_{i},\textbf{h}_{i})$ , satisfies the following properties: (i) $\widehat{q_{goal}}\to(H_{i},\textbf{h}_{i})$ , and (ii) the size of $(H_{i},\textbf{h}_{i})$ is bounded by the size of the $q_{goal}$ .

We start by considering $(A,\textbf{a})$ where $A$ is the structure containing a single node $a$ that satisfies all possible facts over the schema and a is the $k$ -ary tuple $(a,\dots,a)$ containing only element $a$ . Clearly, $\widehat{q_{goal}}$ homomorphically maps to this structure. We apply Lemma 5.4 to find a minimal substructure of it into which the goal query maps. We will denote it be $(H_{0},\textbf{h}_{0})$ .

Next, at each stage we perform an equivalence oracle query to test if the canonical conjunctive query of $(H_{i},\textbf{h}_{i})$ is logically equivalent to $q_{goal}$ . Note that, by Lemma 5.5, the structure $(H_{i},\textbf{h}_{i})$ indeed has a canonical conjunctive query. If the answer to the equivalence oracle query is “yes”, then we are done. Otherwise, we receive a counterexample $(A,\textbf{a})$ . This counterexample must be a structure in which the goal query is true but the hypothesis is false. Thus, we have $\widehat{q_{goal}}\to(A,\textbf{a})$ and $(H_{i},\textbf{h}_{i})\not\to(A,\textbf{a})$ . Recall that we also have $\widehat{q_{goal}}\to(H_{i},\textbf{h}_{i})$ . It follows that $\widehat{q_{goal}}\to(A,\textbf{a})\times(H_{i},\textbf{h}_{i})$ . We now set $(H_{i+1},\textbf{h}_{i+1})$ to be a minimal substructure of $(A,\textbf{a})\times(H_{i},\textbf{h}_{i})$ into which $\widehat{q_{goal}}$ maps (using Lemma 5.4 again). It is clear from the construction that $(H_{i+1},\textbf{h}_{i+1})\begin{array}[]{l}\to\\[-6.99997pt] \nleftarrow\end{array}(H_{i},\textbf{h}_{i})$ , and that the size of $(H_{i+1},\textbf{h}_{i+1})$ is bounded by the size of $\widehat{q_{goal}}$ (otherwise $\widehat{q_{goal}}$ would not be a minimal substructure of $(A,\textbf{a})\times(H_{i},\textbf{h}_{i})$ into which $\widehat{q_{goal}}$ maps).

All that remains to be shown is that this algorithm terminates after polynomially many iterations. We show that with each iteration, the domain size of the structure $(H_{i},\textbf{h}_{i})$ strictly increases. Suppose that the domain size of $(H_{i},\textbf{h}_{i})$ and $(H_{i+1},\textbf{h}_{i+1})$ is the same. We know that the homomorphism (natural projection) $h:(H_{i+1},\textbf{h}_{i+1})\to(H_{i},\textbf{h}_{i})$ is surjective, and that every fact of $H_{i}$ is the $h$ -image of a fact of $H_{i+1}$ (otherwise the composition with the homomorphism from $\widehat{q_{goal}}$ to $(H_{i+1},\textbf{h}_{i+1})$ would constitute a non-surjective homomorphism from $\widehat{q_{goal}}$ to $(H_{i},\textbf{h}_{i})$ , which would contradict the minimality of $(H_{i},\textbf{h}_{i})$ ). Therefore, it cannot also be injective, otherwise it would be an isomorphism. Therefore, the number of elements of $(H_{i+1},\textbf{h}_{i+1})$ is strictly greater than the number of elements in $(H_{i},\textbf{h}_{i})$ . Since the size of each $(H_{i},\textbf{h}_{i})$ is bounded by the size of $q_{goal}$ , this shows that the algorithm terminates after at most $n$ rounds, where $n$ is the size of $q_{goal}$ .

Incidentally, note that this algorithm runs in polynomial time (in the size of $q_{goal}$ ), even if the schema and the arity $k$ of the conjunctive query are not fixed but treated as part of the input. ∎

Next, in the remainder of this section, we prove Theorem 5.2.

First, we argue that we may restrict attention to schemas consisting of binary relations only. Let $q_{goal}$ be any c-acyclic goal conjunctive query over an arbitrary schema $\mathcal{S}$ and consider the corresponding conjunctive query $q_{goal}^{*}$ over the binary schema $\mathcal{S}^{*}$ , as in Lemma 3.26. By Lemma 3.26(2), every membership query w.r.t. the goal query $q_{goal}^{*}$ can be efficiently reduced to a membership query w.r.t. $q_{goal}$ . Therefore, if $q_{goal}^{*}$ can be efficiently identified using membership queries for $q_{goal}^{*}$ , then $q_{goal}^{*}$ can also be efficiently identified using membership queries for $q_{goal}$ , and consequently also $q_{goal}$ can be efficiently identified using membership queries for $q_{goal}$ (namely, by first computing a conjunctive query $q$ over $\mathcal{S}^{*}$ that is logically equivalent to $q_{goal}^{*}$ , and then returning $q_{*}$ as the final answer (note that, by Lemma 3.26(3), $q_{*}$ must then be logically equivalent to $q_{goal}$ ). Finally, it is easy to see that $q_{goal}^{*}$ is c-acyclic if and only if $q_{goal}$ is c-acyclic. Therefore, it suffices to prove Theorem 5.2 for the special case of schemas consisting of binary relations. In the remainder of this section, we will therefore restrict ourselves to binary relations.

Proposition 5.6 (Any positive example can be transformed into a c-acyclic one).

Let $q_{goal}$ be a c-acyclic goal query. Given a structure $(A,{\bf a})$ satisfying $\widehat{q_{goal}}\to(A,{\bf a})$ , using membership queries, we can construct, in time polynomial in $size(A)+size(q_{goal})$ , a structure $(B,{\bf b})$ , denoted $\operatorname{CC}(A,{\bf a})$ , such that

(1)

$(B,{\bf b})$ is c-acyclic
(2)

$(B,{\bf b})\to(A,{\bf a})$
(3)

$\widehat{q_{goal}}\to(B,{\bf b})$
(4)

$\widehat{q_{goal}}$ is not homomorphic to any structure obtained removing some fact of $(B,{\bf b})$

Proof.

We say that a structure is $m$ -c-acyclic if every cycle of length at most $m$ goes through a distinguished element. (Here, by a “cycle” we mean a cycle in the incidence graph of the structure, and, to simplify the exposition, by the length of the cycle we refer to the number of facts that lie on the cycle). Note that when this holds for $m=n$ , where $n$ is the size of $\widehat{q_{goal}}$ , then every homomorphic image of $\widehat{q_{goal}}$ contained in the structure in question must be c-acyclic. We will describe a method (using membership queries) that takes a $m$ -c-acyclic structure $(A,{\bf a})$ with $\widehat{q_{goal}}\to(A,{\bf a})$ and turns it into a $(m+1)$ -c-acyclic structure $(A^{\prime},{\bf a})$ with $\widehat{q_{goal}}\to(A^{\prime},{\bf a})$ and $(A^{\prime},{\bf a})\to(A,{\bf a})$ . By applying this method repeatedly for increasing $m$ (and always minimizing w.r.t. $\widehat{q_{goal}}$ using membership queries, cf. Lemma 5.4), we are guaranteed to reach the situation where we have a structure that is $n$ -c-acyclic and therefore, is in fact, c-acyclic.

Let $(A,{\bf a})$ be $m$ -c-acyclic with $\widehat{q_{goal}}\to(A,{\bf a})$ . First, we use membership queries to minimize $(A,{\bf a})$ (cf. Lemma 5.4) and ensure that its size is at most $n$ . Next, we say that an edge is bad if it is part of a cycle of length $m+1$ that does not contain a distinguished element, and good otherwise. If there are no bad edges then we are done. Otherwise, let $e=R(c,d)$ be a bad edge. Let $A_{1},A_{2}$ be isomorphic copies of $A\setminus\{e\}$ that are disjoint except for the distinguished elements. Now, let $(B,\textbf{a})$ be the structure obtained by extending the fg-disjoint union $(A_{1},\textbf{a})\uplus(A_{2},\textbf{a})$ with additional “special edges” $R(c_{1},d_{2})$ and $R(c_{2},d_{1})$ . Clearly, $(B,{\bf a})\to(A,{\bf a})$ .

Claim 1: for each good edge of $(A,{\bf a})$ , its isomorphic copies belonging to $B$ are good in $(B,{\bf a})$ .

Claim 2: the special edges $R(c_{1},d_{2})$ and $R(c_{2},d_{1})$ are both good in $B$ .

Claim 1 is obvious from the construction of $B$ , as no new short cycles are introduced. To see that Claim 2 holds, consider any minimal cycle in $B$ that does not contain any distinguished element, and that goes through one of these edges. Then, clearly, the cycle must go through both of these edges. That is, it must be of the form

c_{1}\xrightarrow{R(c_{1},d_{2})}d_{2}\xrightarrow{~{}~{}~{}\pi~{}~{}~{}}c_{2}\xrightarrow{R(c_{2},d_{1})}d_{1}\xrightarrow{~{}~{}~{}\pi^{\prime}~{}~{}~{}}c_{1}

where $\pi$ is a path contained in $A_{2}$ and $\pi^{\prime}$ is a path contained in $A_{1}$ . Now, we know that the paths $\pi$ and $\pi^{\prime}$ must have length at least $m$ (because otherwise $(A,{\bf a})$ would not be $m$ -c-acyclic). Therefore, the entire cycle must have length at least $2m+2$ .

Claim 3: $\widehat{q_{goal}}\to(B,{\bf a})$ .

Claim 3 is essentially proved by an induction on the tree structure of $\widehat{q_{goal}}$ , after removing all distinguished elements. More precisely, let $h:\widehat{q_{goal}}\to(A,{\bf a})$ . Let $G$ be the substructure of $\widehat{q_{goal}}$ obtained by removing all distinguished elements and facts involving distinguished elements. Clearly, $G$ is acyclic, i.e., $G$ can be oriented as a forest. By induction on this forest, we can construct a homomorphism $h^{\prime}:G\to B$ , with the additional property that, for each element $g$ of $G$ , $h^{\prime}(g)$ is equal to either $h(g)_{1}$ or $h(g)_{2}$ . Recall that $h(g)_{1}$ is the copy of $h(g)$ in $A_{1}$ and that $h(g)_{2}$ is the copy of $h(g)$ in $A_{2}$ . Next, let $h^{\prime\prime}$ be the extension of $h^{\prime}$ to $\widehat{q_{goal}}$ that agrees with $h$ on all distinguished elements. We can show that $h^{\prime\prime}:\widehat{q_{goal}}\to(B,{\bf a})$ . To see this, consider any fact of $\widehat{q_{goal}}$ . If that fact involves at least one distinguished element, then the $h$ -image of that fact involves at least one distinguished element of $A$ . Therefore, by construction, all the facts obtainable by replacing every non-distinguished element (if there is such) by one of its two copies, are present in $B$ , and therefore, no matter how $h^{\prime}$ acts on the non-distinguished element of that fact, the $h^{\prime\prime}$ -image will be present in $B$ . Next, consider the case where the fact doesn’t involve any distinguished element of $A$ . In this case, the fact belongs to $G$ and hence we know that the $h^{\prime}$ -image of that fact (which is also the $h^{\prime\prime}$ -image) belongs to $B$ . This concludes the proof of claim 3.

Next, let $(B^{\prime},{\bf a})$ be a minimal substructure of $(B,{\bf a})$ into which $\widehat{q_{goal}}$ homomorphically maps (obtained using membership queries, cf. Lemma 5.4). Clearly, Claim 1-3 above are preserved under the passage from $(B,{\bf a})$ to its substructure $(B^{\prime},{\bf a})$ . We claim that (1) $B^{\prime}$ must contain at least one of the edges $R(c_{1},d_{2})$ and $R(c_{2},d_{1})$ , and (2) for every edge of $A\setminus\{e\}$ , $B^{\prime}$ must contain at least one of its two isomorphic copies. Because if not, then the homomorphism from $\widehat{q_{goal}}$ to $(B^{\prime},{\bf a})$ could be composed with the natural projection from $(B^{\prime},{\bf a})$ to $(A,{\bf a})$ to obtain a homomorphism from $\widehat{q_{goal}}$ to a proper substructure of $(A,{\bf a})$ , a contradiction with the assumed minimality of $(A,{\bf a})$ . Combined with Claim 1 and 2, this allows us to conclude that $(B^{\prime},{\bf a})$ has strictly more good edges than $(A,{\bf a})$ . Since the number of edges (good or bad) is bounded by $n$ , by repeating the above procedure, we obtain a structure that has no bad edges, and therefore, is $(m+1)$ -c-acyclic. ∎

Proof of Theorem 5.2:.

The algorithm maintains a structure, denoted $(H,{\bf h})$ , which can be interpreted as (the canonical structure of) the algorithm’s guess of the goal query. At every moment in the execution of the algorithm, $(H,{\bf h})$ , satisfies the following properties:

(1)

$\widehat{q_{goal}}\to(H,{\bf h})$
(2)

$\widehat{q_{goal}}$ does not homomorphically map to any structure obtained by removing a fact from $(H,{\bf h})$
(3)

$(H,{\bf h})$ is c-acyclic

Note that conditions (1) and (2) imply that $H$ cannot have more elements than $\widehat{q_{goal}}$ and that $size(H)\leq size(\widehat{q_{goal}})$

Initially, $(H,{\bf h})$ is defined to be $\operatorname{CC}(A,{\bf a})$ where $\operatorname{CC}(\cdot)$ is defined as in Proposition 5.6 and $A$ is the structure containing a single node $a$ that satisfies all possible facts over the schema and ${\bf a}$ is the $k$ -ary tuple $(a,\dots,a)$ containing only element $a$ . The algorithm refines $(H,{\bf h})$ in a sequence of iterations. At each iteration, the algorithm first constructs the frontier ${\mathcal{F}}$ of $(H,{\bf h})$ . Note that by condition (3) $(H,{\bf h})$ is $c$ -acyclic. Hence, by Theorem 3.8, ${\mathcal{F}}$ can be computed in time polynomial in $size(H)$ (and, hence, in $size(q_{goal})$ ). Then, the algorithm asks a membership query for each $(F,{\bf f})$ in $\mathcal{F}$ until either it receives a ’yes’ answer or, otherwise, it exhausts all structures in ${\mathcal{F}}$ without receiving a ’yes’ answer. In the latter case the algorithm stops and returns the canonical query of $(H,{\bf h})$ , cf. Lemma 5.5, as it is immediate from the fact that $\widehat{q_{goal}}\to(H,{\bf h})$ and that $\widehat{q_{goal}}$ does not homomorphically map to any structure in ${\mathcal{F}}$ that $(H,{\bf h})$ is homomorphically equivalent to $\widehat{q_{goal}}$ , and therefore, the canonical query of $(H,{\bf h})$ is logically equivalent to $q_{goal}$ . In the former case, the algorithm picks any structure $(F,{\bf f})$ in ${\mathcal{F}}$ that (when asked as a membership query) produces a ’yes’ answer, updates $(H,{\bf h})$ , by setting $(H,{\bf h}):=\operatorname{CC}(F,{\bf f})$ and starts a new iteration. It follows immediately that $(H,{\bf h})$ preserves properties (1-3) above.

To show the correctness of the algorithm it only remains to see that the number of iterations is polynomially bounded in $size(G)$ . This follows directly from the following claim: the domain size of $H$ increases at each iteration. To prove the claim, let $(H_{i},{\bf h}_{i})$ be the value of $(H,{\bf h})$ at the $i$ th iteration, and note that, by design, we have $(H_{i+1},{\bf h}_{i+1})\to(F,{\bf f})$ where $(F,{\bf f})$ belongs to the frontier of $(H_{i},{\bf h}_{i})$ . It follows that $(H_{i+1},{\bf h}_{i+1})\to(H_{i},{\bf h}_{i})$ and $(H_{i},{\bf h}_{i})\not\to(H_{i+1},{\bf h}_{i+1})$ . Let $h$ be the homomorphism from $(H_{i+1},{\bf h}_{i+1})$ to $(H_{i},{\bf h}_{i})$ . It follows from the fact that $\widehat{q_{goal}}\to(H_{i+1},{\bf h}_{i+1})$ and condition (2) that the image of $(H_{i+1},{\bf h}_{i+1})$ according to $h$ is precisely $(H_{i},{\bf h}_{i})$ . Therefore, $h$ must be non-injective (otherwise it would be an isomorphism, contradicting the fact that $(H_{i},{\bf h}_{i})\not\to(H_{i+1},{\bf h}_{i+1})$ ). Since $h$ is surjective and non-injective, we can conclude that the domain size of $H_{i+1}$ is larger than the domain size of $H_{i}$ . ∎

6. Another type of data examples: input-output examples

In the previous sections, we have focused on positive and negative data examples. However, there is also another type of data example that is natural to consider, namely a pair $(I,R)$ where $I$ is an input instance (over a schema $\mathcal{S}$ of the query), and $R$ is a $k$ -ary relation over the domain of $I$ , where $k$ is the arity of the query. A conjunctive query $q$ fits $(I,R)$ if $R$ is precisely the set of all tuples over the domain of $I$ that satisfy $q$ . Input-output examples are analogous to universal examples for schema mappings as studied in (AlexeCKT2011, ), in that they capture the complete behavior of the concept (in our case, the conjunctive query) on a database instance.

One input-output example, intuitively, captures the same information as a polynomial number of positive and negative data examples (if we treat $k$ as a constant), namely all positive data examples $(I,\textbf{a})$ for $\textbf{a}\in R$ and all negative data examples $(I,\textbf{a})$ for $\textbf{a}\in dom(I)^{k}\setminus R$ . It follows that a CQ is uniquely characterizable by a finite set of input-output data examples if and only if it is uniquely characterizable by a finite set of positive and negative data examples. In the case of connected CQs, in fact, a single input-output example suffices:

Theorem 6.1.

Fix a schema and $k\geq 0$ , and let $C$ be any class of $k$ -ary CQs over the given schema. Then the following are equivalent for all queries $q\in C$ :

(1)

$q$ is uniquely characterized w.r.t. $C$ by a finite collection of positive and negative data examples.
(2)

$q$ is uniquely characterized w.r.t. $C$ by a finite collection of input-output examples.

And, if $C$ consists of c-connected CQs (or $k=0$ and $C$ consists of connected CQs):

(3)

$q$ is uniquely characterized w.r.t. $C$ by a single input-output example.

Moreover, the equivalences are witnessed by polynomial-time transformations in each direction.

Proof.

For the direction from (1) to (2): for given $(E^{+},E^{-})$ , let $E^{io}=\{(I,q(I))\mid(I,\textbf{a})\in E^{+}\cup E^{-}\}$ . Note that whenever a query $q^{\prime}\in C$ fits $E^{io}$ , then it must also fit $(E^{+},E^{-})$ . For the converse direction, from (2) to (1), the construction was already described above.

The direction (3) to (2) is trivial.

Finally, for the direction from (2) to (3), let $E^{io}=\{(I_{1},q(I_{1})),\ldots,(I_{n},q(I_{n}))\}$ . Let $I$ be the disjoint union $\biguplus_{i=1\ldots n}(I_{i})$ . For every tuple a from the domain of $I_{i}$ , and for every c-connected query $q^{\prime}$ , $\textbf{a}\in q^{\prime}(I_{i})$ if and only if $\textbf{a}\in q^{\prime}(I)$ . It follows that, whenever two $c$ -connected queries $q,q^{\prime}$ agree on their output on $I$ (that is to say, $q(I)=q^{\prime}(I)$ ), then they must also agree on their output on each $I_{i}$ . Therefore, if $E^{io}$ uniquely characterizes $q$ w.r.t. $C$ , then also $(I,q(I))$ uniquely characterizes $q$ w.r.t. $C$ . The same argument applies to connected Boolean CQs.

Incidentally, note that the above argument only works for c-connected CQs. For instance, the CQ $q(x)=\exists y(P(x)\land Q(y))$ cannot be uniquely characterized by a single input-output example $(I,q(I))$ , because if $q(I)$ is non-empty, then $q(I)=q^{\prime}(I)$ where $q^{\prime}(x)$ is the query $P(x)$ ; whereas if $q(I)$ is empty, then $q(I)=q^{\prime\prime}(I)$ , where $q^{\prime\prime}(x)$ is the query $P(x)\land Q(x)$ . ∎

It follows that all results regarding the existence and polynomial-time computability of unique characterizations in Section 4 remain true when considering input-output data examples instead of (or even in addition to) positive and negative data examples.

Similarly, in the exact learning context, we can also consider a different type of oracle query, namely where the algorithm provides the oracle with an input instance $I$ and the oracle responds with an input-output example of the form $(I,R)$ that fits the target CQ. We could call such oracle queries input-output queries. They naturally capture a scenario in which we have black-box access to an executable version of the target CQ. For the same reasons discussed above, input-output queries are no more powerful than membership queries, since one input-output query can be simulated by a polynomial number of membership queries (assuming $k$ is fixed). Therefore, also, all our results on exact learnability remain true when considering input-output queries instead of membership queries.

7. Further Applications

While our main focus in this paper is on unique characterizability and exact learnability for CQs, in this section, we explore some implications for other application domains.

7.1. Characterizability and learnability of LAV schema mappings

A schema mapping is a high-level declarative specification of the relationships between two database schemas (Kolaitis05schema, ). Two of the most well-studied schema mapping specification languages are LAV (“Local-as-View”) and GAV (“Global-as-View”) schema mappings.

In (AlexeCKT2011, ), the authors studied the question of when a schema mapping can be uniquely characterized by a finite set of data examples. Different types of data examples were introduced and studied, namely positive examples, negative examples, and “universal” examples. In particular, it was shown in (AlexeCKT2011, ) that a GAV schema mapping can be uniquely characterized by a finite set of positive and negative examples (or, equivalently, by a finite set of universal examples) if and only if the schema mapping in question is logically equivalent to one that is specified using c-acyclic GAV constraints.

It was shown in (AlexeCKT2011, ) that every LAV schema mapping is uniquely characterized by a finite set of universal examples, and that there are LAV schema mappings that are not uniquely characterized by any finite set of positive and negative examples. In this section, we will consider the question which LAV schema mappings are uniquely characterizable by a finite set of positive and negative examples, and how to construct such a set of examples efficiently.

We will also consider the exact learnability of LAV schema mappings with membership queries. Exact learnability of GAV schema mappings was studied in (CateDK13:learning, ), where it was shown that GAV schema mappings are learnable with membership and equivalence queries (and, subsequently, also in a variant of the PAC model) but is not exactly learnable with membership queries alone or with equivalence queries alone. The exact learning algorithm for GAV schema mappings from (CateDK13:learning, ) was further put to use and validated experimentally in (Kun18active, ). Here, we consider exact learnability of LAV schema mappings with membership queries.

Definition 7.1.

A LAV (“Local-As-View”) schema mapping is a triple $M=(S,T,\Sigma)$ where $S$ and $T$ are disjoint schemas (the “source schema” and “target schema”), and $\Sigma$ is a finite set of LAV constraints, that is, first-order sentences of the form $\forall\textbf{x}(\alpha(\textbf{x})\to\exists\textbf{y}\phi(\textbf{x},\textbf{y}))$ , where $\alpha(\textbf{x})$ is an atomic formula using a relation from $S$ , and $\phi(\textbf{x},\textbf{y})$ is a conjunction of atomic formulas using relations from $T$ .

By a schema-mapping example we will mean a pair $(I,J)$ where $I$ is a structure over schema $\mathcal{S}$ without distinguished elements, and $J$ is a structure over schema $\mathcal{T}$ without distinguished elements. We say that $(I,J)$ is a positive example for a schema mapping $M=(\mathcal{S},\mathcal{T},\Sigma)$ if $(I,J)$ , viewed as a single structure over the joint schema $\mathcal{S}\cup\mathcal{T}$ , satisfies all constraints in $\Sigma$ , and we call $(I,J)$ a negative example for $M$ otherwise. Note that schema-mapping examples were called data examples in (AlexeCKT2011, ). Unique characterizations and learnability with membership queries are defined as before. In particular, by a membership query, in the context of learning LAV schema mappings, we will mean an oracle query that consists of a schema-mapping example, which the oracle then labels as positive or negative depending on whether it satisfies the constraints of the goal LAV schema mapping. It is assumed here, that the source and target schemas are fixed and known to the learner.

Given a fixed source schema $\mathcal{S}$ , there are only finitely many different possible left-hand sides $\alpha$ for a LAV constraint, up to renaming of variables. Furthermore, if a schema mapping contains two LAV constraints with the same left-hand side, then they can be combined into a single LAV constraint by conjoining the respective right-hand sides. Since the right-hand side of a LAV constraint can be thought of as a CQ, this means that, intuitively, a LAV schema mapping can be thought of as a finite collection of CQs (one for each possible left-hand side). In the light of this observation, it is no surprise that questions about the unique characterizability and learnability of LAV schema mappings can be reduced to questions about the unique characterizability and learnability of CQs.

Let us capture this observation a little more precisely. Let $k$ be the maximum arity of a relation in $\mathcal{S}$ , and let $\textrm{ATOMS}_{\mathcal{S}}$ be the finite set of all atomic formulas using a relation from $\mathcal{S}$ and variables from $\{z_{1},\ldots,z_{k}\}$ . Given a LAV schema mapping $M=(\mathcal{S},\mathcal{T},\Sigma)$ and an $\alpha(\textbf{z})\in\textrm{ATOMS}_{\mathcal{S}}$ , we denote by $q_{M,\alpha}(\textbf{z})$ the following first-order formula over schema $\mathcal{T}$ :

\mathop{\bigwedge_{\forall\textbf{x}(\beta(\textbf{x})\to\exists\textbf{y}\phi(\textbf{x},\textbf{y}))\in\Sigma}}_{\text{$h:\{\textbf{x}\}\to\{\textbf{z}\}$ a function s.t.~{}$\beta(h(\textbf{x}))=\alpha(\textbf{z})$}}\exists\textbf{y}\phi(h(\textbf{x}),\textbf{y})

For example, if $M$ consists of the LAV constraints $\forall x_{1},x_{2},x_{3}.R(x_{1},x_{2},x_{3})\to S(x_{1},x_{2},x_{3})$ and $\forall x_{1},x_{2}.R(x_{1},x_{2},x_{2})\to\exists yT(x_{1},y)$ , and $\alpha(z_{1})$ is $R(z_{1},z_{1},z_{1})$ , then $q_{M,\alpha}=S(z_{1},z_{1},z_{1})\land\exists yT(z_{1},y)$ . Similarly, for $\alpha^{\prime}(z_{1},z_{2},z_{3})=R(z_{1},z_{2},z_{3})$ then $q_{M,\alpha^{\prime}}=S(z_{1},z_{2},z_{3})$ . Note that $q_{M,\alpha}(\textbf{z})$ can be equivalently written as a not-necessarily-safe CQ over $\mathcal{T}$ (by pulling the existential quantifies to the front).

Lemma 7.2.

Let $M=(\mathcal{S},\mathcal{T},\Sigma)$ be any LAV schema mapping, and let $\alpha(\textbf{z})\in\textrm{ATOMS}_{\mathcal{S}}$ have $k$ distinct variables. For every structure $(A,\textbf{a})$ , over schema $\mathcal{T}$ and with $k$ distinguished elements, the following are equivalent:

(1)

$(A,\textbf{a})$ is a positive data example for $q_{M,\alpha}(\textbf{z})$ ,
(2)

The schema-mapping example $(I,J)$ is a positive example for $M$ , where $I$ is the structure over $\mathcal{S}$ consisting of the single fact $\alpha(\textbf{a})$ , and $J=A$ .

We omit the proof, which is straightforward (note that the left-hand side of a LAV constraint can have at most one homomorphism to $I$ , and the latter can be extended to the right-hand side of the constraint to $J$ iff the respective conjunct of $q_{M,\alpha}$ is satisfied. Also note that if $(I,J)$ is a positive example for a LAV schema mapping $M$ then, so is $(I,J^{\prime})$ for $J\subseteq J^{\prime}$ ).

Intuitively, Lemma 7.2 shows that the behavior of $q_{M,\alpha}$ on arbitrary data examples, is fully determined by the behavior of $M$ on arbitrary schema-mapping examples. The converse turns out to be true as well, that is, the semantics of a LAV schema mapping $M=(\mathcal{S},\mathcal{T},\Sigma)$ is determined (up to logical equivalence) by its associated queries $q_{M,\alpha}$ for $\alpha\in\textrm{ATOMS}_{\mathcal{S}}$ :

Lemma 7.3.

Two LAV schema mappings $M_{1}=(\mathcal{S},\mathcal{T},\Sigma_{1})$ , $M_{2}=(\mathcal{S},\mathcal{T},\Sigma_{1})$ are logically equivalent iff, for every $\alpha(\textbf{z})\in\textrm{ATOMS}_{\mathcal{S}}$ , $q_{M_{1},\alpha}(\textbf{z})$ and $q_{M_{2},\alpha}(\textbf{z})$ are logically equivalent.

Proof.

The left-to-right direction follows immediately from the preceding Lemma. For the right-to-left direction: suppose $M_{1}$ and $M_{2}$ are not logically equivalent. Then they disagree on some schema-mapping example $(I,J)$ . Without loss of generality, we may assume that $(I,J)$ is a positive example for $M_{1}$ and a negative example for $M_{2}$ . In particular, one of the LAV constraints in $\Sigma_{2}$ is false in $(I,J)$ . Since the left-hand side of a LAV constraint consists of a single atom, it follows that, for some fact $R(\textbf{a})$ of $I$ , the schema-mapping example $(\{R(\textbf{a})\},J)$ is a negative example for $M_{2}$ . Moreover, an easy monotonicity argument shows that $(\{R(\textbf{a})\},J)$ is a positive example for $M_{1}$ . Let $\alpha$ be obtained from the fact $R(\textbf{a})$ by replacing each distinct element $a_{i}$ by a corresponding variable $z_{i}$ . It follows from Lemma 7.2 that $q_{M_{1},\alpha}$ and $q_{M_{2},\alpha}$ disagree on the structure $(J,\textbf{a})$ , and are not logically equivalent. ∎

It follows directly from the above Lemmas that the unique characterizability of a LAV schema mapping $M$ reduces to the unique characterizability of each query $q_{M,\alpha}$ :

Lemma 7.4.

For all LAV schema mappings $M=(\mathcal{S},\mathcal{T},\Sigma)$ , the following are equivalent:

(1)

$M$ is uniquely characterizable by finitely many positive and negative schema-mapping examples (w.r.t. the class of all LAV schema mappings over $\mathcal{S},\mathcal{T}$ ).
(2)

For each $\alpha(z_{1},\ldots,z_{k})\in\textrm{ATOMS}_{\mathcal{S}}$ , $q_{M,\alpha}(z_{1},\ldots,z_{k})$ is uniquely characterizable by finitely many positive and negative data examples w.r.t. the class of all $k$ -ary not-necessarily-safe CQs over $\mathcal{T}$ .

Intuitively, this shows that a LAV schema mapping is uniquely characterizable iff each of its constraints (joined together according to their left-hand side atom) are. By combining these lemmas with Theorem 4.5 (cf. Remark 1), we can link the unique characterizability of a LAV schema mapping to the condition of c-acyclicity. We say that a LAV schema mapping $M$ is c-acyclic if the right-hand side of each of its LAV constraints is a c-acyclic not-necessarily-safe CQ. Note that, in this case, also $q_{M,\alpha}$ is c-acyclic, for each $\alpha\in\textrm{ATOMS}_{\mathcal{S}}$ .

Theorem 7.5.

Fix a source schema $\mathcal{S}$ and a target schema $\mathcal{T}$ . A LAV schema mapping $M=(\mathcal{S},\mathcal{T},\Sigma)$ is uniquely characterizable by a finite set of positive and negative schema-mapping examples if and only if $M$ is logically equivalent to a c-acyclic LAV schema mapping. Moreover, if $M$ is c-acyclic, then a uniquely characterizing set of positive and negative schema-mapping examples can be constructed in polynomial time (for fixed $\mathcal{S},\mathcal{T}$ ).

Proof.

The direction going from c-acyclicity to the uniquely characterizing set of schema-mapping examples, follows immediately from the above lemmas together with Theorem 4.5. For the other direction, assume that $M$ is uniquely characterizable by finitely many positive and negative schema-mapping examples. It follows by Lemma 7.4 that each $q_{M,\alpha}$ is uniquely characterizable by finitely many positive and negative data examples. Hence, each $q_{M,\alpha}$ is logically equivalent to a c-acyclic not-necessarily-safe conjunctive query $q^{\prime}_{M,\alpha}$ . Finally, let $M^{\prime}=(\mathcal{S},\mathcal{T},\Sigma^{\prime})$ , where $\Sigma^{\prime}$ consists of all LAV constraints of the form $\forall\textbf{z}(q_{M,\alpha}(\textbf{z})\to\alpha(\textbf{z}))$ for $\alpha(\textbf{z})\in\textrm{ATOMS}_{\mathcal{S}}$ . Then $M^{\prime}$ is c-acyclic and logically equivalent to $M$ . ∎

Similarly, Lemma 7.2 and Lemma 7.3, together with Theorem 5.2, directly imply:

Theorem 7.6.

Fix a source schema $\mathcal{S}$ and a target schema $\mathcal{T}$ . The class of c-acyclic LAV schema mappings over $\mathcal{S},\mathcal{T}$ is efficiently exactly learnable with membership queries.

Note that the class of all LAV schema mappings over $\mathcal{S},\mathcal{T}$ is not exactly learnable with membership queries (assuming that $\mathcal{S}$ is non-empty and $\mathcal{T}$ contains a relation of arity at least 2). This follows immediately from the existence of LAV schema mappings that are not uniquely characterizable by finitely many positive and negative schema-mapping examples.

As mentioned earlier, LAV schema mappings and GAV schema mappings are two of the most well-studied schema mapping languages. GLAV (“Global-and-Local-As-Views”) schema mappings is another, which forms a common generalization. An important remaining open question in the area of example-driven approaches to schema mapping design is the following (AlexeCKT2011, ): which GLAV schema mappings are uniquely characterizable by a finite set of examples?

7.2. Learning description logic concept expressions and ABoxes

Description logics are formal specification languages used to represent domain knowledge. Example-driven and machine-learning based approaches have a long history in this area, and have received renewed interest in the last years (Ozaki2020:learning, ), in particular, for ontologies specified in the lightweight description logic $\mathcal{ELI}$ , and focusing on the exact learnability of ontologies using entailment queries and equivalence queries. As we show in this section, our results on c-acyclic CQs have some implications for the exact learnability of $\mathcal{ELI}$ concept expressions.

Definition 7.7 ( $\mathcal{ELI}$ Concept expressions, ABoxes, TBoxes).

Let $N_{C},N_{R},N_{I}$ be fixed, disjoint sets, whose members we will refer to as “concept names”, “role names”, and ”individual names”, respectively. The sets $N_{C}$ and $N_{R}$ are assumed to be finite, while $N_{I}$ is assumed to be infinite.

A concept expression $C$ is an expression built up from from concept names in $N_{C}$ and $\top$ , using conjunction ( $C_{1}\sqcap C_{2}$ ) and existential restriction ( $\exists r.C$ or $\exists r^{-}.C$ , where $r\in N_{R}$ ).

An ABox is a finite set of ABox axioms of the form $P(a)$ and/or $r(a,b)$ , where $P\in N_{C}$ , $r\in N_{R}$ , and $a,b\in N_{I}$ .

A TBox is a finite set of TBox axioms $C\sqsubseteq D$ , where $C,D$ are concept expressions.

The semantics of these expressions can be explained by translation to first-order logic:

Definition 7.8.

The correspondence schema is the schema that contains a unary relation for every $A\in N_{C}$ and a binary relation for every $r\in N_{R}$ . Through the standard translation from description logic to first-order logic (cf. Table 1), every concept expression $C$ translates to a first-order formula $q_{C}(x)$ over the correspondence schema. By extension, every TBox $\mathcal{T}$ translates to a finite first-order theory $\mathcal{T}^{\text{fo}}$ , where $C_{1}\sqsubseteq C_{2}$ translates to $\forall x(q_{C_{1}}(x)\to q_{C_{2}}(x))$ .

\begin{array}[]{rcl}q_{P}(x)&=&P(x)~{}\text{ for $P\in N_{C}$}\\ q_{\top}(x)&=&\top\\ q_{C_{1}\sqcap C_{2}}(x)&=&q_{C_{1}}(x)\land q_{C_{2}}(x)\\ q_{\exists r.C}(x)&=&\exists y(r(x,y)\land q_{C}(y))\\ q_{\exists r^{-}.C}(x)&=&\exists y(r(y,x)\land q_{C}(y))\\ \end{array}

Table 1. Standard translation from concept expressions to first-order logic

An ABox can equivalently be viewed as a finite structure (without distinguished elements), whose domain consists of individual names from $N_{I}$ , and whose facts are the ABox assertions. Since $N_{I}$ is assumed to be infinite, every finite structure over the correspondence schema can (up to isomorphism) be represented as an ABox. Therefore, in what follows we will use ABoxes and structures interchangeably.

We can think of an ABox as a (possibly incomplete) list of facts, and a TBox as domain knowledge in the form of rules for deriving more facts. This idea underlies the next definition:

Definition 7.9.

A QA-example is a pair $(\mathcal{A},a)$ where $\mathcal{A}$ is an ABox and $a\in N_{I}$ . We say that $(\mathcal{A},a)$ is a positive QA-example for a concept expression $C$ relative to a TBox $\mathcal{T}$ if $a\in\text{certain}(C,\mathcal{A},\mathcal{T})$ where $\text{certain}(C,\mathcal{A},\mathcal{T})=\bigcap\{q_{C}(B)\mid\text{$\mathcal{A}\subseteq B$ and $B\models\mathcal{T}^{\text{fo}}$}\}$ . If $a\not\in\text{certain}(C,\mathcal{A},\mathcal{T})$ , we say that $(\mathcal{A},a)$ is a negative QA-example for $C$ relative to $\mathcal{T}$ .

The name QA-example, here, reflects the fact that the task of computing $\text{certain}(C,\mathcal{A},\mathcal{T})$ is commonly known as query answering. It is one of the core inference tasks studied in the description logic literature. In general, there are two variants of the definition of $\text{certain}(C,\mathcal{A},\mathcal{T})$ : one where $B$ ranges over finite structures, and one where $B$ ranges over all, finite or infinite, structures. The description logic $\mathcal{ELI}$ that we consider here has been shown to be finitely controllable (Barany2013querying, ), meaning that both definitions are equivalent. For more expressive description logics, this is in general not the case.

\begin{array}[]{l@{~~~~~}l@{~~~}l}\hfil~{}~{}~{}~{}~{}&\text{Example}\hfil~{}~{}~{}&\text{First-order logic translation}\\[2.84526pt] \text{ABox:}\hfil~{}~{}~{}~{}~{}&\mathcal{A}=\{P(a),r(a,b)\}\hfil~{}~{}~{}\\ \text{TBox:}\hfil~{}~{}~{}~{}~{}&\mathcal{T}=\{P\sqsubseteq Q\sqcap\exists r.P\}\hfil~{}~{}~{}&\mathcal{T}^{\text{fo}}=\{\forall x(P(x)\to Q(x)\land\exists y(r(x,y)\land P(y))\}\\ \text{Concept expr:}\hfil~{}~{}~{}~{}~{}&C=\exists r.Q\hfil~{}~{}~{}&q_{C}(x)=\exists y(r(x,y)\land Q(y))\end{array}

Table 2. Example description logic ABox, TBox and concept expression

Example 7.10.

Consider the ABox, TBox, and concept expression in Table 2. Every model of $\mathcal{T}^{\text{fo}}$ containing the facts in $\mathcal{A}$ must contain also $r(a,c)$ and $Q(c)$ for some $c\in N_{I}$ . It follows that $a\in\text{certain}(C,\mathcal{A},\mathcal{T})$ . In other words, $(\mathcal{A},a)$ is a positive QA-example for $C$ relative to $\mathcal{T}$ . On the other hand, $(\mathcal{A},b)$ is a negative QA-example for $C$ relative to $\mathcal{T}$ .

See (Baader2017:introduction, ) for more details on description logic syntax and semantics. We now explain how our results from Section 4 and 5 can be applied here. Although a QA-example is just a data example with one distinguished element, over the correspondence schema, the definition of positive/negative QA-examples diverges from the definition of positive/negative data examples, because of the TBox $\mathcal{T}$ . For the special case where $\mathcal{T}=\emptyset$ , the two coincide:

Lemma 7.11.

Let $\mathcal{T}=\emptyset$ . A QA-example $(\mathcal{A},a)$ is a positive (negative) QA-example for a concept expression $C$ relative to $\mathcal{T}$ iff $(\mathcal{A},a)$ is a positive (negative) data example for $q_{C}(x)$ .

Lemma 7.11 follows from the well-known monotonicity property of CQs (i.e., whenever $A\subseteq B$ , then $q(A)\subseteq q(B)$ ), which implies that $\text{certain}(C,\mathcal{A},\emptyset)=q_{C}(\mathcal{A})$ .

Concept expressions turn out to correspond precisely to unary, acyclic, c-connected CQs:

Lemma 7.12.

The standard translation $q_{C}(x)$ of every $\mathcal{ELI}$ concept expression $C$ is equivalent to a not-necessarily-safe unary CQ that is acyclic and c-connected. Conversely, every unary, acyclic, c-connected not-necessarily-safe CQ over the correspondence schema is logically equivalent to $q_{C}(x)$ for some $\mathcal{ELI}$ concept expression $C$ .

Both directions of Lemma 7.12 can be proved using a straightforward induction.

The above two lemmas, together with Theorem 4.5(2) and Theorem 5.2 (cf. Remark 1 and Remark 2) immediately yield our main result here. We say that a collection of positive and negative QA-examples uniquely characterizes a concept expression $C$ relative to a TBox $\mathcal{T}$ if $C$ fits the examples (relative to $\mathcal{T}$ ) and every other concept expression that does so is equivalent (relative to $\mathcal{T}$ ) to $C$ . By a QA-membership query we mean an oracle query consisting of a QA example, where the oracle answers yes or no depending on whether the input is a positive QA example or a negative QA example for the goal concept, relative to the TBox. It is assumed that the TBox is fixed and known to the learner.

Theorem 7.13.

Let $\mathcal{T}=\emptyset$ . Every $\mathcal{ELI}$ concept expression is uniquely characterizable by a finite collection of positive and negative QA examples (relative to $\mathcal{T}$ ), which can be computed in polynomial time. Furthermore, the class of $\mathcal{ELI}$ concept expressions is efficiently exactly learnable with QA-membership queries.

Moreover, by Theorem 4.5(3), the uniquely characterizing examples can be constructed so that each example $(\mathcal{A},a)$ is the canonical QA-example of a concept expression. By the canonical QA-example of a concept expression $C$ , here, we mean the QA-example that (viewed as a structure with one distinguished element) is the canonical structure of the not-necessarily-safe CQ $q_{C}(x)$ .

Theorem 7.13 remains true when the concept language is extended with unrestricted existential quantification (of the form $\exists.C$ ) and a restricted form of the I-me self-reference construct introduced in (Marx2002narcissists, ), namely where the I operator can only occur once, and in the very front of the concept expression. Indeed, it can be shown that this extended concept language (by a straightforward extension of the standard translation) captures precisely the class of c-acyclic unary not-necessarily-safe CQs over the correspondence schema.

This raises the question if Theorem 7.13 holds true for arbitrary TBoxes. Since publication of the conference version of this paper (tCD2021:conjunctive, ) (in which we asked the same question), some answers have been obtained. In (Funk2021Actively, ), it is shown that the answer to this question is No if the TBox is treated as part of the input to the learning algorithm. Indeed, it is shown that the problem becomes not efficiently exactly learnable with membership and equivalence queries. On the other hand, a positive answer is given in (Funk2021Actively, ) for a weaker version of the question, namely for the description logic $\mathcal{EL}$ , when the learning algorithm is also allowed to ask equivalence queries. In (funk2022:frontiers, ), furthermore, a positive answer is given for another variant of the above question where the TBox is specified in the description logic DL-Lite, and where the learning algorithm is allowed to ask membership and equivalence queries. At the heart of this learning algorithm lies an extension of our frontier construction from Section 3.3, also obtained in (funk2022:frontiers, ).

8. Acknowledgements

This paper largely grew out of discussions at Dagstuhl Seminar 19361 (“Logic and Learning”) in Sept. 2019. Victor Dalmau was supported by MICCIN grants TIN2016-76573-C2-1P and PID2019-109137GB-C22. Balder ten Cate was supported by the European Union’s Horizon 2020 research and innovation programme (MSCA-101031081). We thank Carsten Lutz, Raoul Koudijs, and Phokion Kolaitis for helpful discussions and for pointing out some flaws in earlier versions of this paper.

References

(1) Bogdan Alexe, Balder ten Cate, Phokion G. Kolaitis, and Wang-Chiew Tan. Characterizing schema mappings via data examples. ACM Trans. Database Syst., 36(4):23:1–23:48, December 2011. doi:10.1145/2043652.2043656.
(2) Dana Angluin. Queries and concept learning. Mach. Learn., 2(4):319–342, April 1988. doi:10.1023/A:1022821128753.
(3) William Ward Armstrong. Dependency structures of data base relationships. In IFIP Congress, 1974.
(4) Franz Baader, Ian Horrocks, Carsten Lutz, and Ulrike Sattler. An Introduction to Description Logic. Cambridge University Press, 2017.
(5) Franz Baader and Werner Nutt. Basic Description Logics, pages 43–95. Cambridge University Press, USA, 2003.
(6) Vince Bárány, Georg Gottlob, and Martin Otto. Querying the guarded fragment. Logical Methods in Computer Science, 10(2), 2014. doi:10.2168/LMCS-10(2:3)2014.
(7) Pablo Barceló and Miguel Romero. The complexity of reverse engineering problems for conjunctive queries. In Michael Benedikt and Giorgio Orsi, editors, 20th International Conference on Database Theory, ICDT 2017, March 21-24, 2017, Venice, Italy, volume 68 of LIPIcs, pages 7:1–7:17. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2017.
(8) Google Blog. A reintroduction to our knowledge graph and knowledge panels, 2020.
(9) Angela Bonifati, Radu Ciucanu, and Aurélien Lemay. Learning path queries on graph databases. In Gustavo Alonso, Floris Geerts, Lucian Popa, Pablo Barceló, Jens Teubner, Martín Ugarte, Jan Van den Bussche, and Jan Paredaens, editors, Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015, pages 109–120. OpenProceedings.org, 2015.
(10) Angela Bonifati, Radu Ciucanu, and Slawek Staworko. Learning join queries from user examples. ACM Trans. Database Syst., 40(4):24:1–24:38, 2016.
(11) Ashok K. Chandra Chandra and Philip M. Merlin. Optimal Implementation of Conjunctive Queries in Relational Data Bases. In ACM Symposium on Theory of Computing (STOC), pages 77–90, 1977.
(12) William W. Cohen. Cryptographic limitations on learning one-clause logic programs. In Richard Fikes and Wendy G. Lehnert, editors, Proceedings of the 11th National Conference on Artificial Intelligence. Washington, DC, USA, July 11-15, 1993, pages 80–85. AAAI Press / The MIT Press, 1993.
(13) William W. Cohen. Pac-learning nondeterminate clauses. In Barbara Hayes-Roth and Richard E. Korf, editors, Proceedings of the 12th National Conference on Artificial Intelligence, Seattle, WA, USA, July 31 - August 4, 1994, Volume 1, pages 676–681. AAAI Press / The MIT Press, 1994.
(14) William W. Cohen. Pac-learning non-recursive prolog clauses. Artif. Intell., 79(1):1–38, 1995.
(15) Víctor Dalmau, Phokion G. Kolaitis, and Moshe Y. Vardi. Constraint satisfaction, bounded treewidth, and finite-variable logics. In Pascal Van Hentenryck, editor, Principles and Practice of Constraint Programming - CP 2002, pages 310–326, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.
(16) Ronald Fagin. Degrees of acyclicity for hypergraphs and relational database schemes. J. ACM, 30(3):514–550, July 1983. doi:10.1145/2402.322390.
(17) Ronald Fagin, Phokion Kolaitis, Alan Nash, and Lucian Popa. Towards a theory of schema-mapping optimization. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 33–42, 01 2008. doi:10.1145/1376916.1376922.
(18) Jan Foniok, Jaroslav Nesetril, and Claude Tardif. Generalised dualities and maximal finite antichains in the homomorphism order of relational structures. Eur. J. Comb., 29(4):881–899, 2008.
(19) Maurice Funk, Jean Christoph Jung, and Carsten Lutz. Actively learning concepts and conjunctive queries under elr-ontologies. In IJCAI, 2021.
(20) Maurice Funk, Jean Christoph Jung, and Carsten Lutz. Frontiers and exact learning of eli queries under dl-lite ontologies. In Lud De Raedt, editor, Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 2627–2633. International Joint Conferences on Artificial Intelligence Organization, 7 2022. Main Track. doi:10.24963/ijcai.2022/364.
(21) Georg Gottlob and Pierre Senellart. Schema mapping discovery from data instances. J. ACM, 57(2):6:1–6:37, 2010.
(22) David Haussler. Learning conjunctive concepts in structural domains. Mach. Learn., 4:7–40, 1989.
(23) Pavol Hell and Jaroslav Nešetřil. The Core of a Graph. Discrete Mathematics, 109:117–126, 1992.
(24) Pavol Hell and Jaroslav Nešetřil. Graphs and homomorphisms, volume 28 of Oxford lecture series in mathematics and its applications. Oxford University Press, 2004.
(25) Kouichi Hirata. On the hardness of learning acyclic conjunctive queries. In Hiroki Arimura, Sanjay Jain, and Arun Sharma, editors, Algorithmic Learning Theory, 11th International Conference, ALT 2000, Sydney, Australia, December 11-13, 2000, Proceedings, volume 1968 of Lecture Notes in Computer Science, pages 238–251. Springer, 2000.
(26) Wojtek Kazana and Luc Segoufin. First-order queries on classes of structures with bounded expansion. Logical Methods in Computer Science, Volume 16, Issue 1, February 2020.
(27) Phokion G. Kolaitis. Schema mappings, data exchange, and metadata management. In Proceedings of the Twenty-Fourth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’05, page 61–75, New York, NY, USA, 2005. Association for Computing Machinery. doi:10.1145/1065167.1065176.
(28) Heikki Mannila and Kari-Jouko Räihä. Test data for relational queries. In PODS, pages 217–223, 1986.
(29) Maarten Marx. Narcissists, stepmothers and spies. In Ian Horrocks and Sergio Tessaris, editors, Proceedings of the 2002 International Workshop on Description Logics (DL2002), Toulouse, France, April 19-21, 2002, volume 53 of CEUR Workshop Proceedings. CEUR-WS.org, 2002.
(30) Jaroslav Nešetřil and Patrice Ossona de Mendez. Grad and classes with bounded expansion iii. restricted graph homomorphism dualities. European Journal of Combinatorics, 29(4):1012 – 1024, 2008. Homomorphisms: Structure and Highlights. doi:https://doi.org/10.1016/j.ejc.2007.11.019.
(31) Jaroslav Nešetřil and Patrice Ossona de Mendez. Grad and classes with bounded expansion i. decompositions. European Journal of Combinatorics, 29(3):760 – 776, 2008. doi:https://doi.org/10.1016/j.ejc.2006.07.013.
(32) Jaroslav Nešetřil and Patrice Ossona de Mendez. Sparsity (Graphs, Structures, and Algorithms), volume 28. Springer, 01 2012. doi:10.1007/978-3-642-27875-4.
(33) Jaroslav Nešetřil and Claude Tardif. Duality theorems for finite structures (characterising gaps and good characterisations). Journal of Combinatorial Theory, Series B, 80(1):80 – 97, 2000. doi:https://doi.org/10.1006/jctb.2000.1970.
(34) Jaroslav Nešetřil and Claude Tardif. Short answers to exponentially long questions: Extremal aspects of homomorphism duality. SIAM J. Discret. Math., 19(4):914–920, August 2005. doi:10.1137/S0895480104445630.
(35) Ana Ozaki. Learning description logic ontologies: Five approaches. where do they stand? KI - Künstliche Intelligenz, 04 2020. doi:10.1007/s13218-020-00656-9.
(36) Slawek Staworko and Piotr Wieczorek. Learning twig and path queries. In Alin Deutsch, editor, 15th International Conference on Database Theory, ICDT ’12, Berlin, Germany, March 26-29, 2012, pages 140–154. ACM, 2012.
(37) Slawomir Staworko and Piotr Wieczorek. Characterizing XML Twig Queries with Examples. In International Conference on Database Theory, International Conference on Database Theory, Brussels, Belgium, March 2015. URL: https://hal.inria.fr/hal-01205417.
(38) Balder ten Cate, Laura Chiticariu, Phokion Kolaitis, and Wang-Chiew Tan. Laconic schema mappings: Computing the core with sql queries. Proc. VLDB Endow., 2(1):1006–1017, August 2009. doi:10.14778/1687627.1687741.
(39) Balder ten Cate and Víctor Dalmau. The product homomorphism problem and applications. In Marcelo Arenas and Martín Ugarte, editors, 18th International Conference on Database Theory, ICDT 2015, March 23-27, 2015, Brussels, Belgium, volume 31 of LIPIcs, pages 161–176. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2015.
(40) Balder ten Cate and Victor Dalmau. Conjunctive queries: Unique characterizations and exact learnability. In Proc. of ICDT. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
(41) Balder ten Cate, Víctor Dalmau, and Phokion G. Kolaitis. Learning schema mappings. ACM Trans. Database Syst., 38(4):28:1–28:31, 2013. doi:10.1145/2539032.2539035.
(42) Balder ten Cate, Phokion G. Kolaitis, Kun Qian, and Wang-Chiew Tan. Active learning of GAV schema mappings. In Jan Van den Bussche and Marcelo Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10-15, 2018, pages 355–368. ACM, 2018. doi:10.1145/3196959.3196974.
(43) Yaacov Y. Weiss and Sara Cohen. Reverse engineering spj-queries from examples. In Emanuel Sallinger, Jan Van den Bussche, and Floris Geerts, editors, Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14-19, 2017, pages 151–166. ACM, 2017.
(44) Ross Willard. Testing expressibility is hard. In David Cohen, editor, Principles and Practice of Constraint Programming - CP 2010 - 16th International Conference, CP 2010, St. Andrews, Scotland, UK, September 6-10, 2010. Proceedings, volume 6308 of Lecture Notes in Computer Science, pages 9–23. Springer, 2010.

Conjunctive Queries: Unique Characterizations and Exact Learnability

Abstract.

1. Introduction

Motivating Example 1.

Main Contribution 1 (Polynomial-time algorithms for constructing frontiers).

Main Contribution 2 (Polynomial Unique Characterizations for Conjunctive Queries).

Main Contribution 3 (Polynomial-Time Learnability with Membership Queries).

Main Contribution 4 (Schema Mappings and Description Logic Concepts).

Related Work

Outline

2. Preliminaries

Schemas, Structures, Homomorphisms, Cores

Lemma 2.1.

Proof.

Fact Graph, FG-Connectedness, FG-Disjoint Union

Direct Product, Homomorphism Lattice

Incidence Graph, Acyclicity, C-Acyclicity

Proposition 2.2.

C-Connectedness

Proposition 2.3.

Conjunctive Queries

Exact Learning Models, Conjunctive Queries as a Concept Class

3. Frontiers in the homomorphism lattice of structures

Definition 3.1.

Example 3.2.

Example 3.3.

Lemma 3.4.

Proof.

Example 3.5.

3.1. Frontiers for classes with bounded expansion

Theorem 3.6 (from (NesetrilOssona2008, ; nesetril2012sparsity, )).

Proof.

3.2. Polynomial frontiers for c-acyclic structures

Theorem 3.7 (from (FoniokNT08, ; AlexeCKT2011, )).

Theorem 3.8.

Proposition 3.9.

Proof.

Proposition 3.10.

Proof.

Proof of Theorem 3.8.

Theorem 3.11.

Proof.

3.3. A polynomially frontier-closed class of structures

Theorem 3.12.

The special case with binary relations only and k=1k=1

Proposition 3.13.

Proof.

Definition 3.14 (A|aA|a).

Definition 3.15 (rank).

Definition 3.16 (ℱa\mathcal{F}_{a}).

Definition 3.17 (ℱa∗\mathcal{F}^{*}_{a}).

Theorem 3.18.

Proof.

Definition 3.19.

Theorem 3.20 (Our construction is polynomial).

Proof.

Extending the result to k≥1k\geq 1

Definition 3.21 (Skeleton and offshoots).

Definition 3.22 (Splitting).

Definition 3.23 (Radial extension).

Lemma 3.24.

Proof.

Proposition 3.25.

Proof.

Lifting the restriction to binary relations

Lemma 3.26.

Proof.

4. Unique Characterizations for Conjunctive Queries

Definition 4.1 (Data Examples, Fitting, Unique Characterizations).

Proposition 4.2 (Frontiers vs Unique Characterizations).

Proof.

Lemma 4.3.

Proof.

Lemma 4.4.

Proof.

Theorem 4.5.

Remark 1.

Examples showing that Theorem 4.5(3) cannot easily be generalized.

Theorem 4.6.

Proof.

The special case with binary relations only and $k=1$

Definition 3.14 ( $A|a$ ).

Definition 3.16 ( $\mathcal{F}_{a}$ ).

Definition 3.17 ( $\mathcal{F}^{*}_{a}$ ).

Extending the result to $k\geq 1$

Definition 7.7 ( $\mathcal{ELI}$ Concept expressions, ABoxes, TBoxes).