2021 \papernumber2095
A Non-Deterministic Multiset Query Language
Abstract
We develop a multiset query and update language executable in a term rewriting system. Its most remarkable feature, besides non-standard approach to quantification and introduction of fresh values, is non-determinism — a query result is not uniquely determined by the database. We argue that this feature is very useful, e.g., in modelling user choices during simulation or reachability analysis of a data-centric business process — the intended application of our work. Query evaluation is implemented by converting the query into a terminating term rewriting system and normalizing the initial term which encapsulates the current database. A normal form encapsulates a query result. We prove that our language can express any relational algebra query. Finally, we present a simple business process specification framework (and an example specification). Both syntax and semantics of our query language is implemented in Maude.
keywords:
term rewriting, query languages, business process modellingA Non-deterministic Multiset Query Language
1 Introduction
In a data-centric approach to business process modelling (see, e.g., [1, 2]), specification of data transformation during case execution is an integral part of the business process model. This new paradigm requires new tools and formalisms for effective specification, simulation and validation. Task-centric models are commonly formalized using Petri nets (see, e.g., [3, 4]). Adapting Petri net-based formalizations to data-centric models is, however, problematic: While simple transformations on data can be represented directly within a coloured Petri net, Petri nets lack facilities for complex data processing and querying. Even so, there exists a large amound of literature (see e.g., [5, 6, 7]) devoted to enriching Petri nets with with data processing capabilities and automated verification of their properties. Recent paper [8] introduced DB-Nets — an attempt to integrate coloured Petri nets with relational databases. The use of two separate formalisms complicates verification and simulation (though it corresponds to actual implementations of BPM systems). In the following paper [9] a subset of database operations was implemented inside coloured Petri nets with name creation and transition priorities.
Conditional term rewriting [10] was proposed as an alternative (if less popular) generic framework for specification of dynamic systems [11]. It subsumes a variety of Petri nets [12] and their simulation is one of popular applications of the term rewriting system Maude [13, 14]). More precisely, it is well known (see e.g., [12]) that coloured Petri nets can be implemented by multiset rewriting systems, and, conversely, rewriting systems which rewrite multisets of terms representing colour tokens can be interpreted as Petri nets: just identify places with colour tokens, and each rewriting rule of the form
with a transition with input arcs from , , , and output arcs to , , , . Other constructs, such as inhibitor arcs can be easily implemented with conditional rewriting rules. Rewriting systems are more general than Petri Nets since they are not limited to rewriting multisets (on the other hand, Petri nets are much better supported by tools and programming libraries). However, rewriting systems still share with Petri nets the limitation and inconvenience of not directly supporting bulk, complex operations on data which involve some kinds of quantification. For example, suppose that the state of a data driven business process related to e-commerce is represented by a multiset of terms. In particular, terms of the form denote the presence of a product in the basket of a customer . Suppose now that cancels the case, and so we need to remove all ’s items from the multiset. Describing removal of a single (nondeterministically chosen) item is easy with the rule where is a variable, but specifiying in a rewriting system (or a Petri net) that all of them need to be removed before the next business step is more complex (though clearly possible) and would require auxilliary tokens and conditional rewrite rules corresponding to inhibitor arcs in a Petri net preventing other transitions as long as there are still some ’s items present in the rewritten multiset. Thus, a high-level query language adding quantifier constructs on top of conventional rewriting system or a Petri net is clearly desirable, particularly if rewriting systems or Petri nets are to become convenient formalisms to model data driven business processes.
In this paper we present an expressive multiset query and update language , designed to be executable in a term rewriting system, useful for a unified and somewhat “Petri nettish” formalization of data-centric business processes. The connection with Petri nets is admittedly tenuous and follows from the fact that the language acts on data represented as multisets of terms which could be viewed as tokens (see the discussion above). Since this language specifies changes to data, instead of being a reimplementation of relational calculus, it contains linear-like features fitting a term rewriting implementation. Most remarkably, is non-deterministic — the result of a query or update is not, in general, uniquely determined by the database. This permits modelling user choices, just like in the case of Petri nets. consists of three sublanguages, parametrized with respect to a signature and -algebra of facts :
-
1.
Language of conditions (Boolean queries) which can be used independently as constraints, or as components of queries in and .
-
2.
Data manipulation language. A DML query in defines new facts to be added to the database. Some of the old facts used in constructing the new ones may be deleted.
-
3.
Language of queries which only return facts but do not change the database. Both syntax, and to some extent semantics of is a restriction of syntax and semantics of .
A query in , , is given semantics by assignment of a rewriting system . To evaluate in a database we start with an initial term . A normal form of wraps a result of ’s evaluation: a Boolean value indicating validity of a condition, a query answer or a new database resulting from execution of a DML query. As remarked above, while is always terminating (i.e., there are no infinite execution paths, so we always do get some result from evaluating ), it is not confluent (i.e., divergent execution paths may not eventually converge) in general, hence we may get distinct results depending on nondeterministic choices. We identify syntactic constraints on queries in each of the sublanguages which ensure confluence for the rewriting system, and hence determinism for the results of the query. shares with the language introduced in [15] a non-standard approach to variable binding. The approach avoids problems with capture-free substitutions without dispensing with explicit variables, but at the price of non-compositionality: The surrounding context may determine whether a variable in a subterm is free or bound in this subterm. supports introduction of fresh values to the database, which is used, e.g., to generate identifiers for newly created artifacts or to simulate user input (cf. [7]).
Since queries in are converted to a term rewriting system, verification of a business process specified as a set of DML queries (see Section 7) can be assisted with symbolic reachability analysis techniques based on narrowing (see e.g., [16]) . We plan to expand on this idea in future research. Note that our results in this article regarding the confluence of the rewriting systems to which a particular subclass of queries compiles to, may be relevant for narrowing (see e.g., [17, 18], c.f. [19]). E.g., narrowing a confluent system may provide a more efficient search procedure than in the case of a non-confluent one.
1.1 Prior work
The present paper builds on the previous paper [15] (cf. [20]) where a multiset query language executed in Maude was proposed. shares many similarities with the language described in [15], particularly the treatment of quantification. It has, however, distinct syntax (with, e.g., fact markings in quantifiers) and distinct semantics. We consider both languages to be alternatives, each of which with its own strengths. A side-by-side comparison is presented in Table 1. Observe that while the language described in [15] can be defined both in the set and multiset setting (in the former case we match multisets of facts modulo idempotence in addition to commutativity, associativity and identity), here we assume exclusively multiset setting. This is because the language described here is implemented through multiset rewriting. Making multiset constructor idempotent would make it impossible to consistently replace matched terms, a feature which is crucial for our formalism to behave sensibly. E.g., given a rule , term rewrites in one step both to (intended) and (not intended).
Semantics of (possibly) non-deterministic queries is based on translation into term rewriting systems. | Semantics of queries is based on term matching on the metalevel. Queries are always deterministic. |
Queries return facts in the same signature as the database against which the query is evaluated. | Queries return objects of an arbitrary signature (as long as it contains a “union” operator). |
DML queries construct multisets of facts to be added to the current database. Some of the current facts used in the construction may be deleted. | DML expressions construct a pair of sets or multisets of facts — those to be deleted from and those to be added to the current database. |
Fresh values are introduced through “virtual fresh facts”. Actual freshness in ensured through rewrite rules which define the semantics of DML query. | Attributes of input facts can be marked by special sorts as fresh. Testing framework ensures that values injected in those columns are actually fresh. |
It is assumed that the database is a multiset of facts | The language can be defined both in set and multiset setting |
Treatment of variable binding is identical in both languages |
, has some similarities to matching logic [21, 22], as both are based on term matching. They have different purposes, however, and different syntax and semantics. Unlike matching logic statements, queries are non-deterministic. Matching logic is used for software verification, while is a query language intended to be a component of the system. Finally, matching logic has conventional quantifiers, whereas we use a non-standard quantification over “relation patterns”.
CINNI [23] is a generic calculus of substitutions implemented in Maude which combines de Bruijn indices with explicit names to solve the problem of capture-free substitutions. To avoid the associated complexity, we decided not to use conventional variable binding implemented, e.g., using CINNI.
Our quantification over “relation patterns” instead of variables, resembles quantifier constructs in description logic 111We are grateful to Prof. Andrzej Tarlecki for this observation. [24]. Our syntactic construct, however, uses explicit variable names and is not limited to binary relations where only the second column is bound by the quantifier.
Data-centric business process models are formalized in a variety of ways. First-order logic and its restricted variations (see e.g., [25, 26, 27, 28]), datalog [29], and UML [30], are popular choices. Those formalisms are excellent for the specification of data models, and they come with expressive query languages; they are, however, not so ideally suited for modelling change, because of frame problems [31], where it is not always obvious what information is modified and what stays the same. Rewriting formalisms [10], which are explicit about scope of change have a clear advantage here. On the other hand, a great deal of work has been devoted to formal verification of logic based business process formalism, see e.g., [32, 33] in the context of hierarchical artifact systems. For another example see [34] where a data aware extension of BPMN was proposed together with SMT (satisfiability modulo theories) based verification techniques.
In [35] a language called Reseda, was introduced for specification of data driven business process. The language integrates data description with behaviour. What makes it relevant as a prior work to the present paper is that Reseda’s semantics is defined by associating with a Reseda program a transition system. Such a program can then be executed by rewriting data in accordance with the transition rules, similarly to the execution of the language described here.
As we remarked earlier, a recent paper [8] introduced DB-Nets which integrate coloured Petri nets with relational databases. Since the use of two separate formalisms complicates verification and simulation in the following paper [9] a subset of database operations was implemented inside coloured Petri nets with name creation and transition priorities. Thus, the motivation of [9] is analogous to the motivation of this paper, but in the world of Petri nets instead of term rewriting systems. There are however two important differences: First, our language is meant to provide complete specification of data driven business processes, whereas in [9] the business process is still specified as a Petri net, and there is just an interface between relational queries (perhaps implemented inside the net itself) and the main net describing the process. Secondly (and this is what makes the first point possible) we do not simply implement a conventional relational dml and query language in a rewriting system. Instead, we implement a linear and non-deterministic query language which can emulate user choices and creation of new objects.
1.2 Preliminaries on term rewriting
Let be a poset (partially ordered set). A family of sets is called an -sorted set if whenever . We abbreviate as . We write iff .
An algebraic signature consists of a finite poset of sorts and a finite set of function symbols. The set of function symbols is -sorted, where is the set of finite, non-empty sequences of elements of partially ordered with iff , and for all . Traditionally we write when , where we denote by the set of function symbols of sort . This explains the somewhat confusing ordering on : we are covariant on return value and contravariant on arguments. Symbols are called constants of sort . A -algebra is an assignment of a set to each such that if , and a function to each in . Let be a -sorted set of variables. A term algebra has “sort-safe” terms as elements and function symbols interpreted by themselves. We denote by the algebra of ground -terms. We often use mixfix syntax where underscores in the function name correspond to consecutive arguments. Thus, if contains and then is a ground term of sort . Positions in a term are denoted by strings of positive integers. Denote by the empty string, and by the subterm of at position (if defined), i.e., , and . Let . If and is a term of the same sort as , then we denote by the result of replacing in with . We use a standard notation for substitutions. Let be a list of terms, a list of distinct variables. Then we denote when , , and for any variable .
A -algebra may be defined as a quotient of by a congruence generated by a set of equalities, where equalities in , referred to as equational attributes, define structural properties such as associativity, commutativity, or identity, and consists of conditional equalities interpreted as directed simplification rules on . It is assumed that simplifications terminate and are confluent, hence each has the unique (modulo ) irreducible form representing a class of in .
Simplification with respect to equalities computes values. The behaviour is represented with rewritings. A rewriting system consists of a signature , a set of equations were defines confluent and terminating (modulo ) simplifications on , and a finite set of conditional rewriting rules of the form , where optional condition is a conjunction of equalities, and is the rule’s label. A one-step rewrite from to using such a rule is possible if there exists a position , term , and a substitution such that , , and is satisfied. We write iff there exist terms and a label of a rule in such that , , and . We denote by and the transitive and reflexive-transitive closures of . We also write if and there is no such that . If is implied by the context, we omit from arrows.
Variants of the following definition and easy to prove lemma appear in the literature (see, e.g., [37]):
Definition 1.1
Let , where , be a transition system. Assume that is an equivalence on which is a bisimulation on . We call semiconfluent at modulo if for all such that there exist such that . We call semiconfluent modulo if is semiconfluent at all . We call confluent at modulo if for all such that there exist such that . We call confluent modulo if is confluent at all .
Lemma 1.2
Semiconfluence modulo equivalence implies confluence modulo equivalence.
2 Multisets of facts, fresh facts and patterns
Our queries are evaluated against, or act on, finite multisets of facts. Duplicate facts can be genuinely useful and removing them is computationally expensive. If necessary, duplicates can be removed explicitly or, better, one can ensure that no duplicates are introduced in the first place by judicious choice of DML operations. In fact, SQL is a multiset query language as well, hence by using multisets we are closer to the actual relational database practice than formal systems based on sets.
is parametrized with respect to a signature of facts and a -algebra of facts . must contain sorts Fact and Bool for facts and Booleans, respectively. All constructors for facts are contained in . Facts are reifications of predicate instances. A typical fact has the form , where is a fact constructor. defines all the data types used in facts and is specifiable in terms of directed equations and equational attributes. must define Boolean connectives and Boolean-valued equality for all . We assume that all ground terms of sort Bool simplify to either t or f. This is non-trivial: Define a function with a single equation . Then is fully reduced and distinct from both t and f.
Multisets of facts. The signature of multisets extends with sorts (Ne)FSet of finite (non-empty) multisets of facts. The subsort ordering is given by . In particular, each fact is a non-empty multiset of facts. Finite multisets of facts are constructed with an associative and commutative binary operator (cf. [38]) with identity element . Operator is subsort overloaded with the additional declaration . Thus, a multiset constructed from a multiset and a non-empty multiset is non-empty.
Freshness and nominal sorts. Support for creation of fresh values is a common requirement (cf. [7]): Identifiers for new objects must not belong to the present nor any past active domain of the database. To understand why reusing identifiers from past domains is bad consider situation where a new business object is created with the identifier of a previously deleted one. In this case the attempt to verify if the deleted object is present in the final database may yield an incorrect affirmative answer. We support creation of fresh values of nominal sorts only. Usually this suffices, and freshness for non-nominal data types is problematic (cf. [39]). A sort is nominal (relative to -algebra ) if values of this sort have no non-trivial algebraic or relational structure beside equality. In particular, for nominal , if and only if . To create values of each nominal sort we have constructor which belongs neither to nor to the signature of .
Example 2.1
Consider a client basket database. Identifiers of customers, products and baskets have sorts , , and , respectively. We use two fact constructors: and . Multiset of facts denotes the state in which customer is the owner of basket containing products and . Using multisets instead of sets can be useful: multiset denotes the situation where basket contains two items of .
Constructing values of nominal sorts from natural numbers simplifies creation of fresh values: to ensure freshness one can construct the new value with the smallest natural number which was not used so far. To keep track of those “smallest unused naturals”, and to make retrieval of fresh values similar to retrieval of data we use fresh facts of sort unrelated to Fact. For each nominal sort we have a single-argument constructor of the form which wraps value such that for all , was never used before. When fresh value of sort is requested, we return and update the fresh fact to . Fresh facts are combined into (non-empty) multisets of sort (Ne) using commutative and associative operator with identity . To facilitate bulk updates of fresh facts needed in the semantics of DML queries, we define the following function:
(1) |
Patterns. Quantifiers in quantify over patterns (of sort Pat) containing non-ground multisets of facts and fresh facts marked by modalities, which control retention of matched facts (i.e., whether upon matching they are removed temporarily, permanently, or not at all from the database), and syntactically wrap fresh facts (i.e., fresh facts can appear in the pattern only inside specialized modality). Patterns can be preserving (of sort ), semi-terminating (of sort ), terminating (of sort ), terminating and preserving (of sort ), or neither. The subsort relation is defined by and . Patterns are constructed with modalities
and associative and commutative, subsort overloaded operator:
Thus, a terminating (resp. a semi-terminating) pattern has to contain at least one fact wrapped with (resp. with either or ). A terminating and preserving pattern consists of facts marked only with or , and it contains at least one fact wrapped with . Directed equalities , where , and , are non-empty multisets of (fresh) facts guarantee that fully reduced patterns have facts gathered in groups of the same modality.
Example 2.2
Let Id be a sort, let and be fact constructors and let , , and be variables. Then
is a pattern. It is terminating because of a presence of subpattern. It is not, however, preserving since it contains facts wrapped in and .
The informal meaning of modalities is as follows: Facts matched by those marked by can be considered at most once during quantifier evaluation, but they are not removed from the database. Facts marked by are removed from the database when matched, but they are returned if the computation branch this matching leads to is unsuccessful. Thus, the presence of facts marked by in the pattern may not guarantee termination, unless one can prove that the formula under quantifier is always successful. Facts marked by are always retained in the database, and wraps fresh facts.
Remark 2.3
In what follows we use the following notation. Let be a pattern. We denote by , , , the multisets of facts consisting of those facts in which are wrapped by modalities , , , and , respectively. Thus, e.g., and .
3 Query and condition languages
This section introduces the three sublanguages of , their syntax and informal semantics. Formal semantics based on conditional term rewriting is provided in the subsequent sections.
3.1 Conditions
The language of conditions on finite multisets of facts is analogous to first-order logic with quantification restricted to the active domain.
Definition 3.1
Let be a signature and let be a -algebra of facts. Formulas of are (generally non-ground) terms of sort Cnd constructed with
Thus, , , , and are conditions if is a term of sort Bool, is a terminating and preserving pattern, and and are conditions. Consider condition . Existential quantifier binds in all the variables appearing in which were not bound by the term surrounding . Thus, the meaning of the formula may change when it is placed in a different context.
Example 3.2
Let , and suppose that terms represent rows of a relation . Let , , and be distinct variables. Then condition expresses functional dependency from the first to the second column of . The first quantifier binds and , the second one binds . The condition is closed. The subcondition taken on its own is open, but only is free and the quantifier now binds both and .
Closed formulas in are called sentences in . Let be the set of variables of , and let iff in is closed. To define by structural recursion we need to keep track of variables bound by the context of . Thus, , where, for any set of variables ,
(2) |
As the syntactic sugar we define operators , , . with equalities , , . Later we prove that, for any condition , is logically equivalent to . Then the functional dependency in Example 3.2 can be equivalently written as
Definition 3.3
A subcondition of in is a subterm of of sort Cnd.
3.2 Syntax of queries and DML queries
Syntactically is a restriction of . Therefore, we define their syntax jointly as follows:
Definition 3.4
Queries in are terms of sort Qy. DML queries in are terms of sort DQy. Success assured (DML) queries are terms of sorts . The sorts are ordered with and . Thus, every query in can be also interpreted as DML query (which inserts but doesn’t delete). Every fact is also a query. Success assured DML queries are DML queries guaranteed to return some facts (or at least ). Terms of sort DQy are constructed with
Thus, , , , , and are queries in if is a fact, is a terminating and preserving pattern, and are queries in , and is a condition. Similarly, , , , , , and are DML queries in if is a fact, is a terminating pattern, and are DML queries in , and is a condition. In addition, is a DML query if is a more general semi-terminating pattern, but is success-assured, i.e., either , a fact, or of the form where at least one of , is success assured. We force to be success assured when is semi-terminating, but not terminating because quantification over semi-terminating pattern may not terminate if the quantified expression can fail. Informal semantics of (DML) queries is given by:
-
1.
Let . A query returns . A DML query adds to the current database.
-
2.
is used to mark a branch of a DML query as successful even if it does not add any facts.
-
3.
is a query returning the empty multiset of facts or a DML query which does nothing.
-
4.
A query returns the multiset union of results of and . A DML query adds to and removes from the database a multiset union of what and add and remove, respectively. Since facts are removed immediately, is not commutative for DML queries.
-
5.
A query returns what returns if is satisfied. It returns otherwise. A DML query does what does if is satisfied. It does nothing otherwise.
-
6.
Quantifier denotes iteration over facts in the database matching . At each iteration step is executed, where is the matching substitution. If contains fresh facts, execution of a DML query may introduce fresh values.
Let if and only if the query in (or ) is closed. Similarly as in the case of conditions, , where, for any set of variables ,
Definition 3.5
A (DML) subquery of a (DML) query is a subterm of of sort (D)Qy.
Example 3.6
The following query in is closed:
It returns a multiset of facts consisting of facts of the form where and belong to the multiset we query and . When evaluating the query each source fact (such that is in the multiset and ) and some fact are matched only once during the evaluation of this query (on the other hand, facts of the form can be matched multiple times). Thus, we return at most facts , where is the number of facts in the database. For example, for multiset the query returns either or depending on whether we match and or we match twice (and then, in both cases, we run out of -facts).
Example 3.7
The following (closed) “DML query” is not syntactically correct (it cannot be assigned sort DQy):
This is because the pattern is only semiterminating, but not terminating. In this case the subquery in the scope of should be success assured, but clearly isn’t (if then it fails). If we would execute this query against a multiset it would loop forever matching , removing it, failing when executing , returning to the multiset, matching it again, and so on.
As the following example demonstrates, some incorrect DML queries of the form , where is semiterminating but not terminating and is not success assured, provably always terminate. Nevertheless, we prefer to reject them regardless, since usually making them syntactically correct is not difficult:
Example 3.8
Assume that is a variable of sort Nat (natural numbers). Execution of the following syntactically incorrect DML query always terminates (it replaces each fact of the form where with fact , and returns to the multiset all other ’s unchanged):
Termination follows from the fact that always succeeds regardless of which natural number is bound to : depending on whether or (and one of these must be true) either left or right argument of succeeds and hence the whole subquery succeeds.
Unfortunately, neither nor is syntactically success assured and hence is also not success assured. However, it is very easy to modify the above query so that it describes the same modification to the database, but is now syntactically correct and can be assigned sort DQy:
4 Rewriting semantics of
Semantics of a sentence in is given by the rewriting system . Terms rewritten by are of the form , where , is the database of facts on which is checked, and is a stack of sort implementing structural recursion. Normal forms, constructed with , encapsulate the result of ’s evaluation. Sort of terms holding the full state of evaluation must be distinct from all the other sorts and it must not have any super- or sub-sort relation to other sorts. This guarantees that there are no constructors accepting terms of sort as arguments. Consequently, if all rewrite rules have terms of this sort on the left-hand side, then no subterms can be rewritten, i.e., the defined rewriting system is top-level. Terms of sort are built from frames of sort , where , using an associative binary operator with identity . Most constructors of frames are indexed by subconditions of , and lists of distinct variable names of respective sorts ( can be of any length, even be empty, as long as and it contains free variables of ):
As can be empty, the above signature templates include and . encapsulates the result of evaluation of a subcondition. Frame negates the result of the next frame on the stack. Frames of the form , , or are called -frames, where is the current substitution. They are related to evaluation of . Marked frames occur in evaluation of disjunctions . Iterator frames represent iterative evaluation of quantifiers. Multiset , called iterator state, contains facts available for matching with . Given a database , to evaluate sentence we rewrite a state term until a normal form is reached. If then is satisfied in .
Now we present rule schemata for instantiated for a given formula in . It is important to distinguish between object variables substituted when applying actual rules, and metavariables used to define rule templates. Below, , , and are object variables. We also denote by and sequences of object variables in . Metavariables, , , and stand for arbitrary subconditions of , metavariable stands for Boolean subterms of . Metavariable stands for arbitrary patterns in , and and denote multisets of facts in the instance of pattern marked by respective modalities (see Remark 2.3).
Constant evaluates to f, and evaluates to , where is the current substitution of variables ( is ground since contains all free variables (i.e., all variables) of , hence, by assumption, it simplifies to either f or t):
(3) |
To evaluate , we “unfold” by replacing the -frame with and -frame. When is evaluated we “fold” by negating the result:
(4) |
Remark 4.1
In our notation the same symbols often play a dual role — as subterms, and as part of function symbols (in sub- and super-scripts). Consider the following instantiation of schema :
Variables in sub- and super-scripts are never substituted: with the above rule we have a one step rewrite of ground terms In schema metavariable occurs both as part of a name and as a subterm. An instantiation yields (with , , and ) a one step rewrite
To evaluate disjunction we create two frames corresponding to the disjuncts. If evaluates to t, the frame marked by is dropped (disjunctions are short circuited). If evaluates to f, the frame corresponding to drops and is evaluated normally.
(5) |
Quantifier evaluation is initialized with the whole database available for matching:
(6) |
Let be a sequence of all the distinct variables in . Let be the current substitution. Rule pushes onto the stack a -frame, where is defined by matching with iterator state, and it removes from the iterator state. We keep applying until evaluates to t or we cannot match with iterator state:
(7) | |||
Let Yes? be a sort and let . For all and patterns occurring in we define a function with the single equation
(8) |
Thus, if and only if matches with . Since facts matched by are removed from the iteration state, cannot be applied infinitely many times. The following rule schema makes evaluate to f when can no longer be applied:
(9) |
Finally, the rule finishes evaluation of .
Theorem 4.2
is a terminating rewriting system.
Proof 4.3
It suffices to define a partial well-order on terms of sort which makes rewriting strictly monotonic, i.e., such that implies for all . The order is defined by
for all , , , . Here is the lexicographic order on stacks derived from partial order on frames, i.e., for all frames , , iff either (1) for some , and for , or (2) and for . Frame ordering is defined by , for all , , , , , , , , , , , such that and is a proper subcondition of . Partial order is Noetherian because multisets of facts , and conditions , are finite terms. Hence, if the stacks are of bounded size, also is Noetherian. The size of stacks is bounded because each stack size increasing rule is of the form , where is a frame and is a proper subterm of . Since rules in are topmost, the rewriting is strictly monotonic because for each rule schema in .
The following useful observation can be trivially verified by examining the rule schemas:
Lemma 4.4
Let be a subcondition of . For any finite multiset of facts , stack , Boolean , variables and values , in iff in , where substitution .
The following example shows verification of a condition in for a given multiset of facts using rewriting semantics. It also shows non-confluence of the resulting rewrite system.
Example 4.5
Suppose . Let , , , and . To check if condition
is satisfied in a multiset we normalize
Thus, is satisfied in . However, we have also a normalizing sequence ending with :
Thus, evaluation of conditions does not necessarily lead to a unique result (the rewriting system is not confluent). This requires making the definition of logical equivalence bisimulation-like:
Definition 4.6
Let and be conditions in . We say that is logically equivalent to , writing , if and only if for all ground multisets of facts , ground substitutions such that and are closed, and a Boolean we have
The following result is an immediate consequence of Lemma 4.4:
Lemma 4.7
Logical equivalence on conditions in is an equivalence and a congruence, i.e., if is a position in a condition in such that is a condition, and , then .
A renaming is an injective substitution mapping variables to variables.
Lemma 4.8
For any closed condition in , and any renaming , .
The following result clarifies elements of rewriting semantics of sentences in :
Lemma 4.9
For each ground multiset of facts , and all sentences , , and :
-
1.
or , and these are the only possible normal forms of .
-
2.
and never .
-
3.
iff for all .
-
4.
iff or .
-
5.
iff and .
-
6.
iff there exists a substitution , and a multiset such that and .
-
7.
iff there exist two sequences of ground multisets of facts and , and a sequence of substitutions such that
-
(a)
, , and , for all ,
-
(b)
, for all ,
-
(c)
there exists no substitution and multiset of facts such that .
-
(a)
-
8.
. If both and then also .
Proof 4.10
Lemma 4.11
The following logical equivalences hold between conditions in :
Proof 4.12
The above equivalences can be proven using points 1-7 in Lemma 4.9. Only the last equivalence’s proof is non-trivial. Let be a ground multiset of facts and let be a ground substitution such that (or, equivalently, ) is closed. Denote , , for . Using Lemma 4.9, p. 6, we see that iff there exists a substitution , and a multiset of facts such that and . The latter holds iff there exists such that , by Lemma 4.9, p. 4. It follows, again using Lemma 4.9, point 6, that iff for some , i.e., iff (by Lemma 4.9, p. 4) . The part of the proof with falsity is more complex. Using Lemma 4.9, p. 7, we see that iff there exist two sequences of ground multisets of facts and , and a sequence of substitutions such that (a) , and , for all , (b) , for all , (c) there exists no substitution and multiset of facts such that . Then by Lemma 4.9, p. 5, (b) iff, for , (bj) , for all . Then (by Lemma 4.9, p. 7) (a), (b1), (b2) and (c) iff for iff, by Lemma 4.9, p. 5, .
Non-confluence of in Example 4.5 depended on patterns in with more than one fact. It turns out that may be non-confluent even if contains only single-fact patterns:
Example 4.13
Let be commutative. Consider condition evaluated in a database . Since is commutative, there are two distinct substitutions and which match with . Consequently, there are two distinct paths of evaluating :
Definition 4.14
A fully reduced term is said to have a unique matching property iff for any ground, fully reduced term there exists at most one substitution such that .
Definition 4.15
A condition in is called deterministic if and only if all quantification patterns in contain only single facts with unique matching property.
The following theorem states that while evaluation of a deterministic condition is not itself deterministic, but its results are.
Theorem 4.16
Let be a deterministic condition in . Then is confluent. In particular, given a ground multiset of facts , there is a unique such that .
Proof 4.17
We argue by induction on the complexity of formulas indexing frames on the top of a stack. First observe that only terms of the form can be rewritten in a single step into two distinct terms. As semiconfluence implies confluence it suffices to prove that if then there exists such that . By Lemmas 4.4 and 4.9, p. 1, a rewrite sequence must (1) contain , or (2) can be extended to the sequence ending in . If we prove that then we can set (if (1)) or (if (2)). It remains to prove that . Under the theorem’s assumption, , where is a fact with a unique matching property (Definition 4.14). Thus, if for some fact in there exists a substitution extending such that and (which implies, by inductive assumption, that ), it is unique, and cannot be missed during evaluation of the iterator frame. Either such exists, and then but (hence necessarily both and ), or it doesn’t, and hence, both and
Example 4.18
Suppose , , and let be a commutative operator (i.e., ). Clearly, and have unique matching property (Definition 4.14), while does not, since, e.g., if and then . Let
Then is deterministic (Definition 4.15), while and are not deterministic. Let . We now consider evaluation of all the ’s on . First, the reader will easily verify that while evaluating the existential quantifier in we can either first match with and then, upon failure, with , or first match with . Eventually, both paths yield satisfaction of on (although the first path is is longer). On the other hand, evaluation of and demonstrate two ways in which non-determinism occurs in evaluation of conditions in . First, consider two non-convergent paths of rewriting :
Here the reason for non-determinism which ultimately leads to non-convergent paths of execution is that when evaluating the existential quantifier at each attempt we have to consume both -fact and -fact: since there is only one fact, if we start from wrong -fact (i.e., ), we do not get the second chance.
Denote for brevity . Recall that . In particular, . Consider now two non-convergent paths of rewriting :
Thus, in this case the reason of non-determinism leading to non-convergent paths was the possibility of two distinct matchings of with given by and .
5 Rewriting semantics of
Let be a query in . We associate with the rewriting system . Terms rewritten with the rules of the rewriting system are of the form , where , is a database of facts against which we issue the query, is a partial answer (i.e., an answer built so far in the rewriting process), and is a stack of sort which simulates structural recursion. Normal forms encapsulating an answer to are constructed with . Terms of sort are constructed from local computation frames of sort , where , using an associative binary operator with identity element . Constructors of frames are indexed by sub-queries of , and lists of distinct variable names of respective sorts ( can be of any length as long as and it contains all free variables of ):
As can be empty, the above signature templates include , etc. As in the case of , variables in super- and sub-scripts are part of function names and are never matched or substituted — Remark 4.1 applies here and in the next section. Frames of the form , , or are called -frames, where is the current substitution. They indicate evaluation of . Conditional frames are used in evaluation of conditionals , where represents evaluation of . Iterator frames , represent iterative evaluation of . Multiset , called iterator state, contains facts available for matching with . Given a database , to evaluate a closed query we rewrite a state term until a normal form is reached. Then we conclude that evaluation of on yields as an answer.
Now we are ready to define the rules of . Literal facts are added to the partial answer multiset after applying the current substitution, and empty queries return nothing:
(10) |
Evaluation of “union” is implemented by replacing the frame corresponding to with two frames corresponding to and , respectively:
(11) |
To compute a conditional we first embed a stack representing computation of condition within the frame corresponding to the conditional. Once this condition is evaluated, we either evaluate if the condition is satisfied, or drop the conditional if it is not:
(12) |
To compute we add, for every rule in the rule schema
(13) |
Evaluation of subquery is initialized with the whole database available for matching:
(14) |
Let be a sequence of all the distinct variables in . Let be the current substitution. Rule pushes onto the stack a -frame, where is defined by matching with iterator state, and it removes from the iterator state:
(15) |
We keep applying until we cannot match with iterator state. Then we remove the iterator frame from the stack. To prevent premature application, rule schema is conditional, where the condition uses functions defined for each and pattern occurring in with the single equation (cf. Equation (8)):
(16) |
Finally, the rule finishes evaluation of .
The following result can be proven similarly to Theorem 4.2:
Theorem 5.1
is a terminating rewriting system.
The following useful observation can be trivially verified by examining the rule schemas:
Lemma 5.2
Let be a subcondition of . Then, for all multisets of facts , , , stacks , lists of variables and values , in if and only if in , where substitution .
The following example shows that evaluation of queries in does not, in general, return a unique answer, and, in particular, that for general queries is not confluent.
Example 5.3
Let and . Consider the query executed against database . Then can normalized in two ways:
Queries like are useful as a simulation of a non-deterministic choice (say, by a human agent) of a subset of values stored in the database with a fixed maximal cardinality. E.g., returns all “-facts” stored in the database. Query defined above, however, chooses (with repetitions) at most as many -facts as there are tokens . Query avoids repetitions.
The following is the bisimulation-like definition of logical equivalence between queries in :
Definition 5.4
Let and be two conditions in . Recall that . We say that is logically equivalent to , writing , if and only if, for all ground multisets of facts and , and ground substitutions such that and are closed, we have
In other words, queries are equivalent if they can match each other’s answers. The following result is an immediate consequence of Lemma 5.2:
Lemma 5.5
Logical equivalence on queries in is an equivalence relation and a congruence, i.e., if is a position in a query in such that is a query, and , then .
We leave proof of the next observation to the reader:
Lemma 5.6
For any closed query in , and any renaming , .
The following clarification of semantics of queries in is proven similarly to Lemma 4.9:
Lemma 5.7
For all ground multisets of facts , and , all closed queries , , , and in , and all closed conditions in , the following statements hold:
-
1.
If then for some ground multiset of facts .
-
2.
Let be a fact. If then . If then .
-
3.
Let . In this case iff and .
-
4.
iff either (non-exclusively) or .
-
5.
iff and for some multisets and such that .
-
6.
iff there exist lists of ground multisets of facts , , and , and a sequence of substitutions such that
-
(a)
, , and , for all ,
-
(b)
, for all ,
-
(c)
there exists no substitution and multiset of facts such that ,
-
(d)
.
-
(a)
Lemma 5.8
The following logical equivalences hold between conditions in :
The next results show that non-confluence of queries makes some natural equivalences invalid:
Lemma 5.9
Let be a condition in and let , be queries in . For all ground multisets of facts , and all substitutions such that , and are closed, we have
(17) |
however, the inverse implication does not hold in general. If is deterministic (Definition 4.15), then
(18) |
Proof 5.10
To prove the implication (17) assume . Either or . In the first case, by Lemma 5.7, p. 3 , our assumption is equivalent to and , the latter of which, in turn, is equivalent, by Lemma 5.7, p. 5, to and for some multisets and such that . But this is equivalent, by Lemma 5.7, p. 3 and 5, to The case is dealt with similarly.
To show that, in general, the inverse implication does not hold, consider , , , where and . Let . Since for , iff , whereas iff .
Assume that is deterministic. By Theorem 4.16, either or , but not both. Let , . If then iff for . If then, for , iff for some multisets and such that , .
Example 5.11
In general, equivalence is not valid. Indeed, let and , and let , , . Suppose that . Then iff . However, iff .
Here we define a class of queries in which evaluate to unique answers:
Definition 5.12
A query in is called deterministic if all quantification patterns in (including those inside conditions) contain only single facts with unique matching property.
Theorem 5.13
Let be a deterministic query in . Then is confluent, and, in particular, given a ground multiset of facts , there is a unique multiset of facts such that .
Proof 5.14
As semiconfluence implies confluence, to show confluence at it suffices to prove that if for some and , then there exists such that . Semiconfluence is immediate at irreducible terms , as well as terms which can only rewrite to . Let where , and . If there exists a unique multiset of facts such that every rewrite sequence starting with either (1) contains or (2) it can be extended to reach , then semiconfluence holds at . Indeed, in this case, either witnesses the semiconfluence (if (1) holds for ) or does (if (2) holds for ). It remains to prove the existence of the unique multiset . We argue by induction on the structure of formulas indexing the frame on the top of the stack. Most cases are dealt by trivial application of Lemmas 5.7 and 5.2. For frames related to conditionals we have to also use Lemma 4.16. The only non-trivial part of the proof concerns frames of the form .
Under the theorem’s assumption, , where is a fact with unique matching property (Definition 4.14). Let be a sequence of all variables in . If there is no fact in such that matches then necessarily . Otherwise for some , where (1) for all there exists a unique substitution such that , (2) there is no fact in such that matches . In this case, necessarily, by Lemma 5.7, p. 6, , where, for all , is the unique multiset of facts (uniqueness and existence follows from inductive assumption) such that .
There is a useful relationship between queries and conditions:
Lemma 5.15
Let , let be a query in , and let be a list of distinct variables such that and for . Then, there exists a condition in such that, for all ground multisets of facts , and for all ground substitutions such that is closed, is closed, and, moreover,
(19) |
i.e., evaluates to t (resp. f) on if and only if there is some returned by when evaluated against , such that (resp. ). Moreover, if is deterministic, then so is
Proof 5.16
We define by recursion on the structure of a query :
The easy if laborious proof that really satisfies all the conditions in the statement is left to the reader.
Example 5.17
Consider the following query in (where and are distinct variables):
Thus, for any fact of the form in the database, will output , and, if , also . Using the recursive formula from the proof of Lemma 5.15 we easily see that
The next result shows that can emulate relational algebra.
Theorem 5.18
Denote by the relational algebra over the relational schema and the domain of atomic values (which we silently identify with its algebraic representation, where types are sorts and predicates are represented with equationally defined operators into the Bool sort). Furthermore, assume that for each relational algebra expression in a function symbol is in , where is the sort (domain) of the -th column of and is the arity of . For any relational database with schema , let be a multiset corresponding to the database . More precisely,
where is the set of tuples of in . Then, for all formulas in , there exists a closed query in such that for all relational databases with schema ,
(20) |
where is the set of tuples obtained by evaluation of relational query against the relational database . Furthermore, for any relational expression , is deterministic.
Proof 5.19
To prove the theorem we define by recursion on the structure of . We consider relational algebra expressions to be constructed with base relations, projections, selections, set unions, Cartesian products, and set differences. Other well known relational operators such as joins can be defined in terms of basic operators mentioned above. Attribute renaming is not relevant in our case, since we represent relations with positional arguments only. We denote set unions, Cartesian products, and set differences with the usual mathematical notation (e.g., , and , respectively). Notation for projections and selections is less standardised (and needs to be adapted to relations with positional arguments). We denote by the projection of onto ’th, ’th, , and ’th “leg” (i.e., on a single tuple, ). We denote by the selection with condition applied to (i.e., it returns those tuples of which satisfy ). If is -ary, then we will assume that in expression corresponds to the -th column of , for .
Let be a base relation of arity (for simplicity we omit the typing information), and let be a list of distinct variables. Then We now consider selected relational operators:
-
1.
Let be a relational formula defining an -ary relation, and let be a subsequence of . We define to be with each subquery of the form replaced with .
-
2.
Let be a relational formula of arity , and let be a term of sort Bool representing condition on rows of (where in special variable , corresponds to the -th “attribute” of ). We define to be with each subquery of the form replaced with .
-
3.
.
-
4.
Let and be relational formulas. Let for some renaming such that ( by Lemma 5.6 since is closed). Further, let be with each subquery of the form replaced with . Then we define to be with each subquery of the form replaced with .
-
5.
Let and be relational formulas of arity . Let be like in the previous point. Let be a list of distinct variables such that . We define to be with each subquery of the form replaced with (see Lemma 5.15).
An easy induction on the structure of shows that is closed and deterministic, and Equation (20) is satisfied (in the case of induction step for set difference we also use Lemma 5.15)
Remark 5.20
Relational queries evaluate to sets of tuples. However, may evaluate to a multiset of facts — e.g., when evaluating unions duplicate facts are not removed.
Example 5.21
Let and be binary relations. Consider the following relational algebra expression:
Thus, iff and for some . Let represent as a query in using definition of from the proof of Theorem 5.18. First,
where are distinct variables. Then, by the point 4 in the proof
Finally, to deal with selection and projection we apply points 2 and 1, respectively, from the proof:
6 Rewriting semantics of
We associate with a DML query in the rewriting system . Terms rewritten with the rules of are of the form , where . is the database of facts against which we issue the DML query. changes during execution of the query while the facts are removed from it. A multiset , expanded during query execution, contains new facts to be added to the database. is a multiset of fresh facts (see Section 2) from which fresh values are drawn. is a stack of sort which simulates structural recursion. Normal forms encapsulating a new database and new multiset of fresh facts after successful or, respectively, failing execution of a DML query, are constructed with . We consider only terms of sort which satisfy the following freshness condition:
Definition 6.1
A term of sort satisfies the freshness condition if and only if for all positions , in such that , and .
Example 6.2
Consider the following term of sort:
It does not satisfy the freshness condition having as subterms both and with . Our general assumption which justifies the freshness condition is that a fresh fact of the form means that we never before used any value of the form with . Clearly, terms which do not satisfy freshness condition (like the term above) violate this assumption.
Terms of sort are constructed from local computation frames of sort , where , using an associative binary operator with identity . Most constructors of frames are indexed by DML sub-queries of , and lists of distinct variable names of respective sorts such that contains all free variables of :
As can be empty, the above signature templates include , etc. A constant marks successful branches of computation, i.e., those which created either new facts or . Marking such branches is necessary as facts deleted by unsuccessful branches are restored as soon as the branch finishes. Frames of the form , , , , or are called -frames, where is the current substitution. They are related to evaluation of . Marked frames occur in the execution of “unions” . Conditional frames are used in execution of conditionals , where represents evaluation of . Iterator frames and represent iterative execution of . Multiset , called iterator state, contains facts available for matching with (cf. Remark 2.3). Iteration status is equal to t iff the iteration already generated either new facts or (i.e., if the branch related to was successful). “Tentative” iterator frames store multiset of facts deleted from the database in the present step, so that they can be restored if the step is unsuccessful. Given a database , in order to execute a closed DML query we rewrite a state term until a normal form or is reached, indicating that a successful or, respectively, unsuccessful execution of in the database yielded a new database , and a new multiset of fresh facts . Now we are ready to define rule schemas of .
Two consecutive occurrences of are collapsed, and -frames are replaced with :
(21) |
An -frame is removed from the stack. An -frame, where is a fact and is the current substitution, is replaced by and is added to (since , is closed):
(22) |
An -frame is split into the -frame and the -frame. The -frame is marked with so that the evaluation of can be marked as successful when at least one of the branches is successful. When both branches are successful, this can produce two consecutive constants on the stack which are then collapsed using rule in Equation (21).
(23) |
Evaluation of a subquery is initialized with the whole database available for matching. Iteration status is f since evaluation is not successful yet.
(25) |
Let be a sequence of all the distinct variables in . Let be the current substitution. Rule pushes onto the stack a -frame, where is defined by matching with iterator state and with the multiset of fresh facts . It also removes from the iterator state and from the database state, and, finally, it updates fresh facts using defined in Equation (1). Removed facts are stored in the tentative iterator frame:
(26) |
If execution of proves unsuccessful, removed facts can be restored both to iterator state and database. Otherwise, we discard them and set the iteration status to t:
(27) |
We keep applying until we cannot match with iterator state. Then we replace the iterator frame with where is the iteration status, , and . To prevent premature application, rule schema is conditional, where the condition uses functions defined for each and pattern occurring in with the single equation (cf. Equation (8)):
(28) |
The last two rules reduce the -terms with an empty stack or stack containing only the constant into a term constructed with or , respectively:
(29) |
Theorem 6.3
is a terminating rewriting system.
Proof 6.4
The proof is similar to the proof of Theorem 4.2. Only subqueries of the form , where is semiterminating but not terminating (i.e., , but ) require special care. In this case, the signature of forces to be success assured, i.e., ’s evaluation always succeeds, and hence removed facts matching are never restored. This ensures termination of ’s execution.
The following useful observation can be trivially verified by examining the rule schemas:
Lemma 6.5
Let be a DML subquery of . Then, for all multisets of facts , , , , multisets , of fresh facts, stacks , lists of variables and of values ,
-
1.
iff ,
-
2.
iff .
where , , and .
As in the case of pure queries, we consider two DML queries equivalent if and only if they can match their results. We should not, however, distinguish results differing only by the choice of fresh values. Let denote the set of nominal sorts in . Let be the -sorted set of nominal values contained in term . For any -sorted bijection between sets of nominal values we denote by the natural extension of to terms such that . More precisely, if , if is a constant of non-nominal sort or a variable, and if is of non-nominal sort. With those notions we define equivalence on DML queries as follows:
Definition 6.6
Let and be two DML queries in . We say that is logically equivalent to , writing , if and only if, for all ground substitutions such that and are closed, all ground multisets of facts , , all ground multisets of fresh facts , , and all
-
1.
if then there exist multisets of fresh facts , , multisets of facts , , and an -sorted bijection such that , , and .
-
2.
if then there exist multisets of fresh facts , , a multiset of facts , and an -sorted bijection such that and .
The following result is an immediate consequence of Lemma 6.5:
Lemma 6.7
Logical equivalence on queries in is an equivalence relation and a congruence, i.e., if is a position in a DML query such that is a DML query, and , then .
Lemma 6.8
For any closed DML query in , and any renaming , .
Lemma 6.9, proven similarly to Lemma 4.9, clarifies elements of rewriting semantics of DML queries in (we leave evaluation where we are better off with the rewriting definition).
Lemma 6.9
For all ground multisets of facts , and of fresh facts , , as well as closed DML queries , , and all sentences , the following holds:
-
1.
If then or for some ground multiset of facts and ground multiset of fresh facts . If then .
-
2.
If then . If then .
-
3.
iff and .
-
4.
iff or .
-
5.
iff there exists a multiset of fresh facts such that and .
-
6.
iff, for some ground multisets of facts , , , such that , a ground multiset of fresh facts , and stacks where or , we have and .
Lemma 6.10
The following logical equivalences hold between queries in :
Lemma 6.10 is very similar to Lemma 5.8, except that is not commutative. may be non-equivalent with if, say, deletes a fact which is referred to in some pattern in .
We need to generalize the notion of confluence, lest rewriting paths leading to terms differing only by distinct choices of fresh values (as in the next example) are to be considered non-convergent.
Example 6.11
Let I be a nominal sort. Let , . Consider DML query , and let . Then
and, if we match and in reverse order in applications of rule, then
Here we define an equivalence relation on terms of sort which is a bisimulation:
Definition 6.12
We say that term is nominally equivalent to term , in which case we write , if and only if there exists a bijection such that .
Lemma 6.13
Nominal equivalence is an equivalence relation. When restricted to terms satisfying the freshness condition (Definition 6.1), it is also a bisimulation on , for all .
We leave an easy proof of Lemma 6.13 to the reader. The restriction to terms satisfying the freshness condition is necessary for nominal equivalence being a bisimulation, as demonstrated below:
Example 6.14
Let I be a nominal sort. Let , . Define . Consider a DML query Let and . Term does not satisfy the freshness condition (Definition 6.1). It is immediate that , , where , but the only normal form of is and .
We now define a class of queries for which is confluent modulo nominal equivalence.
Definition 6.15
Let be a DML query. We say that has no deletion conflicts if and only if for each DML subquery of , and any subterm of (resp. ) occurring inside , (resp. ) has no subterm occurring inside , or such that and are unifiable.
Definition 6.16
Let be a DML query in . is called deterministic if it has no deletion conflicts and all quantification patterns in (including those inside subterms which are conditions) contain only single facts with unique matching property (but may contain any number of fresh facts).
In Example 4.18 we shown why multiple facts in patterns lead to non-confluence, and as a result, to non-deterministic evaluation of conditions (and queries). However, we have not previously considered deletion conflicts (as they are specific to DML queries). The following example shows why deletion conflicts can prevent confluence:
Example 6.17
Consider the following DML query:
(30) |
Observe that all quantification patterns in consist of single facts, but does have deletion conflicts (facts in both patterns are unifiable, and one of them is -pattern which has the second one in its scope). Since the first pattern () is semi-terminating but not terminating, to ensure that the DML query in its scope is success assured it is of the form .
Now, let . It is easy to see that executing against removes from both -facts and either adds a single fact or nothing depending on whether pattern first matches (which makes it possible for the subquery to succeed then and return ) or (which causes all executions of subquery to fail).
The following theorem states that while evaluation of a deterministic DML query is not itself deterministic, but its results are.
Theorem 6.18
Let be a deterministic query in . Then is confluent up to a nominal equivalence. In particular, given ground multisets of facts , and of fresh facts , there is a unique (up to nominal equivalence) term of the form or such that .
Proof 6.19
The only significant difference between the proof of this theorem and Theorem 5.13 is the presence of deletions and fresh facts. The non-confluence introduced by fresh facts can be absorbed with nominal equivalence. Since has no deletion conflicts, when a DML subquery is executed, neither deletion of facts through influences execution of nor execution of decreases the pool of facts available for matching with . Moreover, if contains for some fact , then is the only fact in , hence cannot fail, facts deleted through are never returned, and behaves like .
Let us finish this section with the following remark about expressibility of :
Remark 6.20
A typical formalization of database updates is to use pairs of queries which define facts to be, respectively, deleted from, and added to the current database. This approach can be emulated in , using multiple DML queries executed in a sequence. First, let us extend the signature with function symbols and for each fact constructor . Let and be queries in which return sets of facts to be deleted and added, respectively, to the database. We assume that and contain no subterms of the form or . Let and be the same as and , respectively, except that all subqueries of sort Fact are replaced, respectively, with and . Then to update the database we execute the following DML queries (in this order):
where are fact constructors occurring in and . Thus, because of Theorem 5.18 we can express any relational database update using multiple DML queries in .
7 Reachability analysis of data-centric business processes
In this section we demonstrate the application of in specification and analysis of data-centric business processes. First, we describe a general reachability and simulation framework, and then devote the rest of the section to an extended example specification.
So far, we have specified the rules for execution of a single DML expression. A business process, in general, executes a sequence of DML expressions according to some orchestration rules. A simple example of such rules which we use here, appropriate for a data driven process, is that if a DML expression can be executed successfully then it can be chosen (non-deterministically) as the next command to be executed. Usually (c.f., [40]) such data modifying operations are launched in response to some events, such as user actions which also provide input parameters for the command. In turn, their execution may trigger further events. Here, taking inspiration from [41, 42], we interpret some of the facts as triggering events, user input, and output events (we describe this in more detail later as a part of the example). This simplifies the simulation.
Thus, we specify a business process simply as a finite set of DML expressions in . During simulation, at each “business step” a DML expression is chosen non-deterministically and is executed. Unsuccessful execution simply leaves the database of facts unchanged. Alternatively, during reachability analysis which performs a breadth-first search through all possible evolutions of the process it is more efficient to make the system stuck on unsuccessful step. This prunes spurious branches in search tree.
More precisely, a set of DML expressions determines a rewriting system defined to be the union of ’s, , augmented with two rule schemas
(31) |
for all .
Rule schema chooses non-deterministically a new DML query for execution and rewrites into an initial state for this query if the execution of the previous one was successful. Similarily, chooses a new DML query if the execution of the previous one failed. It is important to know that the failure of a randomly chosen DML expression does not usually mean that the business process execution is faulty: Instead, it may simply mean that the given DML expression was not applicable at the moment. Since here the only way to know if the DML expression is applicable is to run it, the rule is necessary lest the simulation stops prematurely. On the other hand, unsuccessful executions do not change database state, and thus are spurious, adding no useful information. This is why, when doing reachability analysis which explores using breadth-first search all possible paths of execution (in contrast to simulations, where each simulation travels just a single execution path) it is better to drop the rule .
It is assumed that all DML queries are such that a successful execution of Q consumes and emits a special fact (of sort Fact) called a token. The token does not denote any real data, but rather facilitates a non-deterministic choice of user input. Say, if in the database there were facts , where are possible user inputs for some business step, then we can simulate user choice and execution of further action (based on this choice and expressed as DML query with free variable storing user’s decision) by using the DML expression of the form . If the query wouldn’t match and remove the token then the action would be executed for every possible user input. In the example described in this section instead of a constant token we use tokens parametrized by a natural number. All DML queries consume a token with a non-zero parameter and emit a token with a parameter decreased by one. This permits limiting the number of “large business steps”, i.e., executions of DML queries, in the simulation or search procedure. Rewriting systems such as Maude permit limiting rewriting steps in the search procedure. However, execution of each DML query can take many rewriting steps, the number of which is not easy to estimate. Thus, it is not trivial to pass from the number of rewriting steps to the number of business steps (which are more natural in this context).
Given an initial database we start reachability analysis from term , where is the maximal depth of search expressed in the number of business steps, are nominal sorts for which we need fresh values, and are such that values for , , do not occur in . In case of reachability analysis we search for the term of the form where the database satisfies some condition (either desirable or undesirable one).
7.1 Example specification
We borrow an example from [28, Appendix C] to demonstrate specification of a business process as a set of DML expressions in . The example concerns the process of selecting and advertising restaurant offers of dinners by employees of mediating agency, and managing corresponding bookings. The lifecycles of two key artifact types — Offer and Booking — are presented as finite state machines in Figure 1. Each agent publishes exactly one restaurant offer — either the new one which just came or the one which was previously put on hold. The published offer is in the state available. Agent puts the offer he currently publishes on hold (state onHold) when picking up another offer for publication. Dashed arrows in Figure 1 indicate that entering a given state by an artifact may trigger state change in another artifact, e.g., there is a dashed arrow between the available state and the anonymous transition into onHold state (in a distinct artifact of type Offer). Available offer may get closed (state closed, or be picked up by a customer (transition newBooking to state beingBooked). The latter triggers creation of a new Booking artifact. Booking starts with a preliminary phase of drafting (state drafting) in which the customer chooses dinner hosts (transition addHosts). After draft submission (which changes the state to submitted) the agent computes price for the offer (transition determineProposal to state finalized) and the customer decides to either accept or reject the proposal transitioning, respectively, to the accepted or canceled state. The acceptance may in some cases go through toBeValidated state when additional validation is necessary.
States of offers and bookings are constants of sorts OState and BState, respectively, named like in Figure 1. We use the following nominal sorts for identifiers: Rest for restaurants; Person for customers, agents and hosts; Offer, Book and Url for offers, bookings, and url’s of finalized proposals, respectively. For facts we use the following constructors:
where facts , , indicate that , , and are, respectively, identifiers of a registered restaurant, agent, and customer. means that an offer in a state for a restaurant is managed by an agent . means that booking in a state for a customer is related to an offer . A fact indicates that a person is included as a host for booking . Finally, indicates that finalized proposal for booking , with details and prices, is available at the url .
We now specify selected transitions from Figure 1 in detail. Transition newOffer responsible for creation of new offers is implemented with the following DML query:
Above, denotes the successor function. We assume that each DML query is executed against a database in which there is exactly one token matching , i.e., a token holding a number greater than zero. Thus, in the above, we first choose a single fresh offer identifier, and, non-deterministically, a single registered agent and a single restaurant. The token is marked with , so it is removed from the database after matching (this guarantees that we choose no more than one agent, restaurant, and offer identifier). The query emits back a token with a number decreased by 1 ensuring the possibility (if this number is greater than zero) of executing a next query. The restaurant can be arbitrary (as long as it is registered in the system), however the agent must not manage an offer being booked, as described by the deterministic condition
inside the above DML query. If this condition is not satisfied, the quantifier step fails, the token is returned to the database and a new matching is tried. Since is marked by , we do not try the same agent again. If the correct matching is found, a fact describing new offer is added to the database. We also change the state of any available offer managed by the agent of the new offer to onHold.
An offer which was put on hold, may be resumed by any agent who is not managing an offer which is currently being booked. Agent resuming an offer becomes the new manager of the offer:
With the newBooking transition some offer changes state from available to beingBooked). It also triggers creation of a new booking (with a fresh identifier) in the drafting state for the chosen offer on behalf of some registered customer:
The customer involved in booking can add dinner hosts one by one (see transition addHosts in Figure 1) as long as the booking is in the drafting stage. The added host can be either fresh or be present in the database as a host for another offer. We use separate DML queries for each of those cases. The first case (of a fresh host) is trivial:
In the second case we have to ensure that we are not adding the same person twice:
The submit action simply changes the state of the booking from drafting to submitted. Then, if the customer’s customized booking is infeasible, it can be rejected (reject transition in Figure 1, the implementation of which we omit for brevity’s sake). Otherwise, the final proposal (which includes cost, etc.) to the customer who owns the booking is created (transition determineProposal in Figure 1). The preparation of the proposal is abstracted as (1) creating the fresh url to the proposal, and (2) removing information about hosts (which is available at the new url). As before, we pick the booking non-deterministically using the token and appropriate pattern:
A finalized booking proposal for a restaurant can be accepted either immediately (with ) or after an additional confirmation (with ). The first case applies only to golden customers of a given restaurant , i.e., those who successfully booked a dinner in at least -times, for some fixed . Accepting a proposal changes the state of the offer to which the booking belongs to closed:
Remark 7.1
In our earlier work [15] we have used the almost same example (with minor modifications) to illustrate an alternative formalism (c.f. Section 1.1 in the current paper) where queries, also implemented in the rewriting system, but using meta-level features, are deterministic. This makes them behave like a classical queries, but because of determinism it is not possible to simulate user input directly. Instead, a separate mechanism had to be introduced to simulate non-deterministic input choice. Secondly, DML expressions in [15] did not add or, more importantly, delete facts from the database directly. Instead, they return pairs (which need to be specified in the DML expression itself) of (multi)sets of facts: those to be deleted and those to be added. However, in this particular example (and we believe it is typical) facts to be deleted are matched by parts of the patterns in the query. This (in [15]) led to code duplication, and suggested natural use of rewriting rules (which, of course, replaces matched subterms), and ultimately led to the formalism described in this paper.
Remark 7.2
We have implemented both the syntax and semantics of and in Maude [44]. The implementation is available on the project’s website [45]. To test the implementation we have used specification of the business process described above (also available from [45]). The specification compiles into a Maude’s system module which contains definitions of 196 operators, 285 equations and 83 rewriting rules (in actual implementation we used equations in place of deterministic rewriting rules).
Let us now describe a simple example of a reachability analysis with the specification just described. Let
be an initial database of facts. Let
be an initial multiset of fresh facts. Finally, let
be an initial state. Note that since the token is parametrized by 7, it follows that we can make at most seven successful business steps from this state. We are interested in checking if we can reach from initState (in no more than 7 business steps) the state in which there exists a closed offer (and accepted associated booking) from the database with no bookings and offers. Formally, we want to reach the state matching
Using our implementation [45] we can easily check that a matching state is indeed reachable in 6 business steps (Maude reported 525165 actual rewritings in 2528ms).
8 Conclusion
We have presented a multiset non-deterministic query and data manipulation language based on conditional term rewriting. The intended application of this language is in specification, simulation and reachability analysis of data-centric business processes. However, the remarkable features of , particularly non-determinism and non-standard approach to variable binding, make it interesting on its own. We show that non-determinism of queries is useful for simulating user choices, but we also provide easily identifiable syntactic restrictions which ensure uniqueness of query results. Interestingly, this non-determinism leads to bisimulation-like definitions of logical equivalence between formulas. In the last section we demonstrated how sets of DML queries can be used to specify a business process and we provide a simple framework for simulation and testing.
is a multiset query language. Most formal query languages, including relational calculus and algebra, are based on sets. One under-appreciated fact is that SQL is really a multiset query language, and for a very good reason — removing duplicates is expensive. While this was not our primary reason to use multisets, we believe that using multiset languages encourages query design which avoids unnecessary expensive operations, and takes the complexity of query execution into account better than set-based languages.
The fact that closed formulas are compiled to rewriting systems permits their symbolic execution using narrowing [16]. We intend to explore this possibility in future research. This is also one of the reasons why it was important to limit the use of conditional rules as much as possible: many implementations of narrowing (see e.g., [46]) do not permit narrowing with conditions.
We have implemented , and a specification framework extending the one described at the beginning of Section 7 in Maude [46]. The implementation is available from [45]. It differs in non-essential way from the one described in the present paper, but the code is extensively documented.
Acknowledgements
The author is grateful to the anonymous reviewers for their helpful remarks
References
- [1] Hull R. Artifact-Centric Business Process Models: Brief Survey of Research Results and Challenges. In: Meersman R, Tari Z (eds.), On the Move to Meaningful Internet Systems: OTM 2008. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008 pp. 1152–1163. 10.1007/978-3-540-88873-4_17.
- [2] Calvanese D, De Giacomo G, Montali M. Foundations of Data-aware Process Analysis: A Database Theory Perspective. In: Proceedings of the 32Nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’13. ACM, New York, NY, USA, 2013 pp. 1–12. 10.1145/2463664.2467796.
- [3] van der Aalst WM. The application of Petri nets to workflow management. Journal of circuits, systems, and computers, 1998. 8(01):21–66. 10.1142/S0218126698000043.
- [4] van der Aalst WM, Ter Hofstede AH. YAWL: yet another workflow language. Information systems, 2005. 30(4):245–275. 10.1016/j.is.2004.02.002.
- [5] Rosa-Velardo F, de Frutos-Escrig D. Decidability and complexity of Petri nets with unordered data. Theoretical Computer Science, 2011. 412(34):4439–4451.
- [6] Lasota S. Decidability border for Petri nets with data: WQO dichotomy conjecture. In: International Conference on Application and Theory of Petri Nets and Concurrency. Springer, 2016 pp. 20–36. doi:10.1007/978-3-319-39086-4_3.
- [7] Montali M, Rivkin A. Model checking Petri nets with names using data-centric dynamic systems. Formal Aspects of Computing, 2016. 28(4):615–641. 10.1007/s00165-016-0370-6.
- [8] Montali M, Rivkin A. DB-Nets: On the Marriage of Colored Petri Nets and Relational Databases. In: Transactions on Petri Nets and Other Models of Concurrency XII. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017 pp. 91–118. 10.1007/978-3-662-55862-1_5.
- [9] Montali M, Rivkin A. From DB-nets to Coloured Petri Nets with Priorities. In: International Conference on Applications and Theory of Petri Nets and Concurrency. Springer, 2019 pp. 449–469. doi:10.1007/978-3-030-21571-2_24.
- [10] Meseguer J. Conditional rewriting logic as a unified model of concurrency. Theoretical computer science, 1992. 96(1):73–155. 10.1016/0304-3975(92)90182-F.
- [11] Meseguer J, Rosu G. The rewriting logic semantics project. Theoretical Computer Science, 2007. 373(3):213 – 237. 10.1016/j.tcs.2006.12.018.
- [12] Stehr MO, Meseguer J, Ölveczky PC. Rewriting Logic as a Unifying Framework for Petri Nets. In: Unifying Petri Nets: Advances in Petri Nets. Springer Berlin Heidelberg, Berlin, Heidelberg, 2001 pp. 250–303. 10.1007/3-540-45541-8_9.
- [13] Padberg J, Schulz A. Model Checking Reconfigurable Petri Nets with Maude. In: Echahed R, Minas M (eds.), Graph Transformation. Springer International Publishing, Cham, 2016 pp. 54–70. 10.1007/978-3-319-40530-8_4.
- [14] Kheldoun A, Barkaoui K, Ioualalen M. Formal verification of complex business processes based on high-level Petri nets. Information Sciences, 2017. 385:39–54. 10.1016/j.ins.2016.12.044.
- [15] Zieliński B. A Query Language Based on Term Matching and Rewriting. Fundamenta Informaticae, 2019. 169:237–274. 10.3233/FI-2019-1845.
- [16] Meseguer J, Thati P. Symbolic reachability analysis using narrowing and its application to verification of cryptographic protocols. Higher-Order and Symbolic Computation, 2007. 20(1-2):123–160. 10.1007/s10990-007-9000-6.
- [17] Fay M. First-order unification in an equational theory. In: Proceedings of the 4th Workshop on Automated Deduction, Austin, Texas, 1979.
- [18] Hullot JM. Canonical forms and unification. In: International Conference on Automated Deduction. Springer, 1980 pp. 318–334. doi:10.1007/3-540-10009-1_25.
- [19] Alpuente M, Escobar S, Iborra J. Termination of narrowing revisited. Theoretical Computer Science, 2009. 410(46):4608–4625. doi:10.1016/j.tcs.2009.07.037.
- [20] Zieliński B, Maślanka P. Relational Transition System in Maude. In: Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation: 13th International Conference, BDAS 2017, Ustroń, Poland, May 30 - June 2, 2017, Proceedings. Springer International Publishing, Cham, 2017 pp. 497–511. 10.1007/978-3-319-58274-0_39.
- [21] Roşu G, Ellison C, Schulte W. Matching Logic: An Alternative to Hoare/Floyd Logic. In: Algebraic Methodology and Software Technology. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011 pp. 142–162. 10.1007/978-3-642-17796-5_9.
- [22] Roşu G. Matching logic. arXiv:1705.06312, 2017.
- [23] Stehr MO. CINNI-A Generic Calculus of Explicit Substitutions and its Application to --and -Calculi. Electronic Notes in Theoretical Computer Science, 2000. 36:70–92. 10.1016/S1571-0661(05)80125-2.
- [24] Baader F. The description logic handbook: Theory, implementation and applications. Cambridge University Press, New York, NY, USA, 2003. ISBN:0-521-78176-0.
- [25] De Giacomo G, De Masellis R, Rosati R. Verification of conjunctive artifact-centric services. International Journal of Cooperative Information Systems, 2012. 21(02):111–139. 10.1142/S0218843012500025.
- [26] Hariri BB, Calvanese D, De Giacomo G, De Masellis R, Felli P. Foundations of relational artifacts verification. In: International Conference on Business Process Management. Springer, 2011 pp. 379–395. 10.1007/978-3-642-23059-2_28.
- [27] Calvanese D, Montali M, Patrizi F, De Giacomo G. Description logic based dynamic systems: modeling, verification, and synthesis. In: Proceedings of the 24th International Conference on Artificial Intelligence. AAAI Press, 2015 pp. 4247–4253.
- [28] Abdulla PA, Aiswarya C, Atig MF, Montali M, Rezine O. Recency-Bounded Verification of Dynamic Database-Driven Systems. In: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS ’16. ACM, New York, NY, USA, 2016 pp. 195–210. 10.1145/2902251.2902300.
- [29] Chen-Burger YH, Robertson D. Automating business modelling: a guide to using logic to represent informal methods and support reasoning. Springer Science & Business Media, 2006. 10.1007/b138799.
- [30] Merouani H, Mokhati F, Seridi-Bouchelaghem H. Formalizing Artifact-Centric Business Processes - Towards a Conformance Testing Approach. In: Proceedings of the 16th International Conference on Enterprise Information Systems. 2014 pp. 368–374. 10.5220/0004951803680374.
- [31] McCarthy J, Hayes PJ. Some philosophical problems from the standpoint of artificial intelligence. Readings in artificial intelligence, 1969. pp. 431–450.
- [32] Deutsch A, Li Y, Vianu V. Verifas: a practical verifier for artifact systems. Proceedings of the VLDB Endowment, 2017. 11(3):283–296. doi:10.14778/3157794.3157798.
- [33] Deutsch A, Li Y, Vianu V. Verification of hierarchical artifact systems. ACM Transactions on Database Systems (TODS), 2019. 44(3):1–68. doi:10.1145/3321487.
- [34] Calvanese D, Ghilardi S, Gianola A, Montali M, Rivkin A. Formal modeling and SMT-based parameterized verification of data-aware BPMN. In: International Conference on Business Process Management. Springer, 2019 pp. 157–175. doi:10.1007/978-3-030-26619-6_12.
- [35] Seco JC, Debois S, Hildebrandt T, Slaats T. RESEDA: Declaring live event-driven computations as REactive SEmi-structured DAta. In: 2018 IEEE 22nd International enterprise distributed object computing conference (EDOC). IEEE, 2018 pp. 75–84. doi:10.1109/EDOC.2018.00020.
- [36] Meseguer J. Membership algebra as a logical framework for equational specification. In: Recent Trends in Algebraic Development Techniques. Springer Berlin Heidelberg, Berlin, Heidelberg, 1998 pp. 18–61. 10.1007/3-540-64299-4_26.
- [37] Huet G. Confluent reductions: Abstract properties and applications to term rewriting systems: Abstract properties and applications to term rewriting systems. Journal of the ACM (JACM), 1980. 27(4):797–821.
- [38] Thielscher M. Introduction to the Fluent Calculus. Electronic Transactions on Artificial Intelligence, 1998. 2(3-4):179–192.
- [39] Ochremiak J. Nominal sets over algebraic atoms. In: International Conference on Relational and Algebraic Methods in Computer Science. Springer, 2014 pp. 429–445. 10.1007/978-3-319-06251-8_26.
- [40] Hull R, Damaggio E, De Masellis R, Fournier F, Gupta M, Heath III FT, Hobson S, Linehan M, Maradugu S, Nigam A, et al. Business artifacts with guard-stage-milestone lifecycles: managing artifact interactions with conditions and events. In: Proceedings of the 5th ACM international conference on Distributed event-based system. ACM, 2011 pp. 51–62. 10.1145/2002259.2002270.
- [41] Abiteboul S, Vianu V, Fordham B, Yesha Y. Relational Transducers for Electronic Commerce. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS ’98. ACM, New York, NY, USA. ISBN 0-89791-996-3, 1998 pp. 179–187. 10.1145/275487.275507.
- [42] Abiteboul S, Vianu V, Fordham B, Yesha Y. Relational transducers for electronic commerce. Journal of Computer and System Sciences, 2000. 61(2):236–269. doi:10.1006/jcss.2000.1708.
- [43] Abdulla PA, Aiswarya C, Atig MF, Montali M, Rezine O. Recency-bounded verification of dynamic database-driven systems (extended version). arXiv preprint arXiv:1604.03413, 2016.
- [44] Clavel M, Durán F, Eker S, Lincoln P, Martí-Oliet N, Meseguer J, Talcott C. The Maude 2.0 System. In: Nieuwenhuis R (ed.), Rewriting Techniques and Applications (RTA 2003), number 2706 in Lecture Notes in Computer Science. Springer-Verlag, 2003 pp. 76–87. 10.1007/3-540-44881-0_7.
-
[45]
Zieliński B.
Nondeterministic Rewriting Query Language (NDRQL).
Project website,
http://ki.wfi.uni.lodz.pl/ndrql/. - [46] Clavel M, Duràn F, Eker S, Escobar S, Lincoln P, Martì-Oliet N, Meseguer J, Talcott C. Maude Manual (Version 2.7.1), 2016.