This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\hideLIPIcs

Indian Statistical Institute, Kolkata, India and https://www.isical.ac.in/~ansuman[email protected]://orcid.org/0000-0003-0220-646X Government College of Engineering and Ceramic Technology, Kolkata, India and https://sites.google.com/view/kingshukchatterjee/home[email protected]://orcid.org/0000-0002-2617-6309 Tata Institute of Fundamental Research, Mumbai, India and https://www.tifr.res.in/~shibashis.guha/[email protected]://orcid.org/0000-0002-9814-6651 \ccsdesc[500]Theory of computation Formal languages and automata theory \relatedversion

Acknowledgements.
We thank Amaldev Manuel for providing useful comments on a preliminary version of this paper.\EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

Set Augmented Finite Automata over Infinite Alphabets

Ansuman Banerjee    Kingshuk Chatterjee    Shibashis Guha
Abstract

A data language is a set of finite words defined on an infinite alphabet. Data languages are used to express properties associated with data values (domain defined over a countably infinite set). In this paper, we introduce set augmented finite automata (SAFA), a new class of automata for expressing data languages. We investigate the decision problems, closure properties, and expressiveness of SAFA. We also study the deterministic variant of these automata.

keywords:
automata on infinite alphabet, data languages, register automata, expressiveness, closure properties

1 Introduction

A data language is a set of data words that are concatenations of attribute, data-value pairs. While the set of attributes is finite, the values that these attributes hold often come from a countably infinite set (e.g. natural numbers). With large scale availability of data in recent times, there is a need for methods for modeling and analysis of data languages. Thus, there is a demand for automated methods for recognizing attribute data relationships and languages defined on infinite alphabets. A data word is a concatenation of a finite number of attribute, data-value pairs, i.e. a data word w=(a1,d1)(a2,d2)(a|w|,d|w|)w=(a_{1},d_{1})(a_{2},d_{2})...(a_{|w|},d_{|w|}), where each aia_{i} belongs to a finite set and each did_{i} belongs to a countably infinite set. We denote by |w||w| the length of ww. This work introduces a new model for data languages on infinite alphabets.

kk-register automata (finite automata with kk registers, each capable of storing one data value[16] are finite automata with the ability to handle infinite alphabets. The nonemptiness and membership problems for register automata are both NP-complete [24]. However, the language recognizability is somewhat restricted, since it uses only a finite number of registers to store the data values. Thus, register automata cannot accept many data languages, one such being L𝖿𝖽(a)L_{{\sf fd}(a)} which is a collection of data words where all data values associated with the attribute aa have to be distinct. Pushdown versions of automata on infinite alphabets using stacks have also been introduced in [8]. However, even with the introduction of a stack, these models are unable to accept L𝖿𝖽(a)L_{{\sf fd}(a)}. Data automata are introduced in [6] and the emptiness problem is shown to be nonelementary. Further, class memory automata (CMA) and class counter automata (CCA) are introduced in [5, 19] respectively. While CMA and data automata are shown to be equivalent [5], the set of languages accepted by CCA is a subset of the set of languages accepted by CMA. The nonemptiness problem for CCA is EXPSPACE-complete [19], and the nonemptiness problem for CMA is interreducible to the reachability problem in Petri nets [20, 6, 18, 21], and hence Ackermann-complete [9].

Our Contribution:

This work introduces set augmented finite automata (SAFA) that are finite automaton models equipped with a finite number of finite sets for storing data values. Using these sets as auxiliary storage, SAFA is able to recognize many important data languages including L𝖿𝖽(𝖺)L_{\sf fd(a)}. This paper has the following contributions.

  • We present the formal definition of the SAFA model on infinite alphabets (Definition 3.1).

  • We show that nonemptiness and membership are NP-complete for SAFA. Further, we show that universality for SAFA is undecidable (Theorem 4.74.134.15).

  • We study the closure properties on the SAFA model (See Section 4.2). In order to show non-closure under complementation we introduce a pumping lemma for SAFA (Lemma 4.21).

  • We also study the deterministic variant of SAFA and show that there are languages that necessarily need nondeterminism to be accepted (Theorem 4.38).

  • We present a strict hierarchy of languages with respect to the number of sets associated with SAFA models (Theorem 3.5).

  • Finally, we study the expressiveness of SAFA models (See Section 5). While we show that the class of languages recognized by SAFA and register automata are incomparable, the set of languages accepted by SAFA is a strict subset of the set of languages accepted by CCA, and hence by CMA.

Related work:

Register automata introduced by Kaminski et.al.[16] use a finite number of registers to store data values; hence they can only accept those data languages in which membership depends on remembering properties of a finite number kk of data values where kk is bounded above by the number of registers in the register automata. An extension to finite register models are pushdown automata models for infinite alphabets [8]. Cheng et.al. [8] and Autebut et.al. [1] both describe context free languages for infinite alphabets. The membership problem for the grammar proposed by Cheng et. al. is decidable unlike Autebut’s. Sakamoto et.al.[24] show that the membership problem is PTIME-complete and nonemptiness is NP-complete for deterministic register finite automata model on infinite alphabets. Neven et.al.[22, 23] discuss the properties of register and pebble automata on infinite alphabets, in terms of their language acceptance capabilities and also establish their relationship with logic. Tan et.al. [25] introduces a weak 2-pebble automata model whose emptiness is decidable but with significant reduction in acceptance capabilities with respect to pebble automata.Kaminski et.al.[17] also develop a regular expression over infinite alphabets. Choffrut et.al.[4] define finite automata on infinite alphabets which uses first order logic on transitions. This is later extended by Iosif et.al. to an alternating automata model [14] with emptiness problem being undecidable but they give two efficient semi-algorithms for emptiness checking. Demri et.al. [11] explore the relationship between linear temporal logic (LTL) and register automata. Grumberg et.al. [13] introduce variable automata over infinite alphabets where transitions are defined over alphabets as well as over variables. Data automata are introduced in [6]. Class memory automata (CMA) is introduced in [5] and is shown to be equivalent to data automata. The set of languages accepted by CMA is a superset of Class counter automata (CCA) [19], another infinite alphabet automata introduced by Manuel et.al. Bollig [7] combines CMA and register automata and shows that local existential monadic second order (MSO) logic can be converted to class register automata in time polynomial in the size of the input automata. Figuera [12] discusses the properties of alternating register automata and also discusses restricted variants of this model where decidability is tractable. Dassow et.al. [10] introduces the P-automata model and establishes that these are equivalent to a restricted version of register automata. A detailed survey of existing finite automata models on infinite alphabets can be found in [20].

The elegance of SAFA is its simple structure that is easy to implement. Moreover, the membership and nonemptiness problems are NP-complete for our model. On one hand, this gives us an advantage over the hash-based family of models on infinite alphabets, with respect to the associated problem complexities. On the other hand, this puts our model in the same complexity class as the kk-register automata, while having the ability to accept many important data languages.

A preliminary version of this paper appeared in [3].

2 Preliminaries

Let \mathbb{N} denote the set of natural numbers, and [k][k] the set {1,,k}\{1,\dots,k\} where k>0k>0. Let Σ\Sigma be a finite alphabet which comprises a finite set of attributes, and DD be a countably infinite set of attribute values. A data word w(Σ×D)w\in(\Sigma\times D)^{*} is a concatenation of attribute, data value pairs, where * denotes zero or more repetitions. A data value is also known as an attribute value, that is the value associated with an attribute. An example data word is of the form w=(a1,d1)(a|w|,d|w|)w=(a_{1},d_{1})\cdots(a_{|w|},d_{|w|}) where a1,,a|w|Σa_{1},\dots,a_{|w|}\in\Sigma, d1,,d|w|Dd_{1},...,d_{|w|}\in D, and |w||w| denotes the length of ww. An example data word ww on Σ={a,b}\Sigma=\{a,b\} and D=D=\mathbb{N} is: (a,1)(a,2)(b,1)(b,5),(a,2)(a,5)(a,7)(a,100)(a,1)(a,2)(b,1)(b,5),(a,2)(a,5)(a,7)(a,100) with |w||w| = 8. A data language L(Σ×D)L\ \subseteq(\Sigma\times D)^{*} is a set of data words. Some example data languages with Σ={a,b}\Sigma=\{a,b\}, D=D=\mathbb{N} are mentioned below.

  • L𝖿𝖽(a)L_{{\sf fd}(a)}: language of data words, wherein the data values associated with attribute aa are all distinct.

  • L𝖼𝗇𝗍=2L_{\forall{\sf cnt}=2}: language of data words wherein all data values appear exactly twice.

  • L𝖼𝗇𝗍2L_{\exists{\sf cnt}\neq 2}: the language of data words ww where there exists a data value dd which does not appear twice. L𝖼𝗇𝗍2L_{\exists{\sf cnt}\neq 2} is the complement of L𝖼𝗇𝗍=2L_{\forall{\sf cnt}=2}.

  • LabL_{a\exists b}: the language of data words wherein the data values associated with attribute aa are those which have already appeared with the attribute bb.

For a word w(Σ×D)w\in(\Sigma\times D)^{*}, we denote by 𝗉𝗋𝗈𝗃Σ(w){{\sf proj}_{\Sigma}}(w) and 𝗉𝗋𝗈𝗃D(w){{\sf proj}_{D}}(w) the projection of ww on Σ\Sigma and DD respectively. Let L𝗉𝗋𝗈𝗃Σ(L)=regexp(r)L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)} be the set of all data words ww such that 𝗉𝗋𝗈𝗃Σ(w)Lregexp(r){{\sf proj}_{\Sigma}}(w)\in L_{regexp(r)} where Lregexp(r)ΣL_{regexp(r)}\subseteq\Sigma^{*} is the set of all words over Σ\Sigma generated by the regular expression rr.

Post Correspondence Problem (PCP)

: \chadded[id=KC]The PCP problem consists of two lists of equal length, say nn. The items of the lists are finite strings defined on an alphabet Σ\Sigma^{\prime} where |Σ|2|\Sigma^{\prime}|\geq 2. Without loss of generality, we can assume Σ={a,b}\Sigma=\{a,b\}. List 1 consists of the strings x1,,xnx_{1},\dots,x_{n}, and list 2 consists of the strings y1,,yny_{1},\dots,y_{n} where x1,,xn,y1,,ynΣx_{1},\dots,x_{n},y_{1},\dots,y_{n}\in{\Sigma}^{*}. The PCP problem is true, if there exists a sequence α1,,αm\alpha_{1},\dots,\alpha_{m} where α1,,αm[n]\alpha_{1},\dots,\alpha_{m}\in[n] such that xα1xαm=yα1yαmx_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\alpha_{1}}\cdots y_{\alpha_{m}} and false otherwise (see Example 2.1). Each xix_{i} yiy_{i} pair is considered as a domino and the PCP solution is an arrangement of these dominoes such that the strings in the upper part and lower part of the arranged dominoes become same.

Example 2.1.

An instance of the PCP problem on Σ={a,b}\Sigma=\{a,b\} is as follows: List 1=a,baa,ba and List 2=ab,aab,a, Then one possible solution for the above PCP problem is the sequence 1,21,2, i.e. (aab)(baa)\binom{a}{ab}\binom{ba}{a}. ∎

kk-register automata [16]

: A kk-register automaton is a tuple (Q,Σ,δ,τ0,U,q0,F)(Q,\Sigma,\delta,\tau_{0},U,q_{0},F), where QQ is a finite set of states, q0Qq_{0}\in Q is an initial state and FQF\subseteq Q is a set of final states, τ0\tau_{0} is an initial register configuration given by τ0:[k]D{}\tau_{0}:[k]\rightarrow D\cup\{\bot\}, where DD is a countably infinite set, and \bot denotes an uninitialized register, and UU is a partial update function: (Q×Σ)[k](Q\times\Sigma)\rightarrow[k]. The transition relation is δ(Q×Σ×[k]×Q)\delta\subseteq(Q\times\Sigma\times[k]\times Q). The registers initially contain distinct data values other than \bot which can be present in more than one register. The automaton works as follows. Consider a register automaton MM in state qQq\in Q. Each of its registers rir_{i} holds datum did_{i} where 0ik0\leq i\leq k, diD{}d_{i}\in D\cup\{\bot\}. Let MM at some instance reads the jthj^{th} data element (aj,dj)(a_{j},d_{j}) of the input word ww where ajΣa_{j}\in\Sigma, djDd_{j}\in D. Two cases may arise.

  • Case 1: There exists an ii such that dj=did_{j}=d_{i}: In this case two situations may arise (i) (q,a,i,q)δ(q,a,i,q^{\prime})\in\delta and (ii) (q,a,i,q)δ(q,a,i,q^{\prime})\notin\delta. In situation (i) the corresponding transition is executed, and in situation (ii) the automaton stops without consuming the data element.

  • Case 2: There exists no register ii such that dj=did_{j}=d_{i}: In this case, for all ii, we have djdid_{j}\neq d_{i}. We look at the partial update function UU. If U(q,a)U(q,a) is not defined, the automaton stops without consuming the data element. If U(q,a)U(q,a) is defined, then djd_{j} is inserted in the register U(q,a)U(q,a) and the automaton executes the transition (q,a,U(q,a),q)(q,a,U(q,a),q^{\prime}) if (q,a,U(q,a),q)δ(q,a,U(q,a),q^{\prime})\in\delta, otherwise it halts if (q,a,U(q,a),q)δ(q,a,U(q,a),q^{\prime})\notin\delta.

The automaton MM accepts an input data word ww if it consumes the whole word and ends in a final state.

Class counting automata [19]

: A class counting automaton (a.k.a 1-bag CCA) is defined as a 55-tuple M=(Q,Σ,δ,q0,F)M=(Q,\Sigma,\delta,q_{0},F) where QQ is a finite set of states, q0Qq_{0}\in Q is an initial state, and FQF\subseteq Q is the set of accepting states. A constraint cc is a pair (𝗈𝗉,e)({\sf op},e), where 𝗈𝗉={<,>,=,,,}{\sf op}=\{<,>,=,\leq,\geq,\neq\}, ee\in\mathbb{N}. Let CC denote a collection of constraints. The transition relation is δ(Q×Σ×C×Inst××Q)\delta\subseteq(Q\times\Sigma\times C\times Inst\times\mathbb{N}\times Q). A bag is a finite set β(D×)\beta\subseteq(D\times\mathbb{N}). Initially, β(d)\beta(d) is set to 0 for all data values dDd\in D. The set 𝖨𝗇𝗌𝗍={+,}{\sf Inst}=\{\uparrow^{+},\downarrow\}. An element of 𝖨𝗇𝗌𝗍×{\sf Inst}\times\mathbb{N} is called an operation. When making a transition, a CCA reads an attribute, data-value pair (a,d)(a,d), and checks if β(d)𝗈𝗉e\beta(d)\;{\sf op}\;e holds. If it holds, then (i) either β(d)\beta(d) is incremented by mm if the operation is (+,m)(\uparrow^{+},m), or (ii) β(d)\beta(d) is reset to mm if the operation is (,m)(\downarrow,m), and we go to the next state. A CCA accepts a data word ww if it is in a final state after consuming ww.

A kk-bag CCA has kk bags. For a data value, constraint checking can be done on a subset of the bags. The bags can also be updated or reset independently. The set of transitions of a kk-bag CCA is a subset of (Q×Σ×Ck×(Inst×)k×Q)(Q\times\Sigma\times C^{k}\times(Inst\times\mathbb{N})^{k}\times Q). We denote by βi\beta_{i}, the ithi^{\text{th}} bag. It is shown in [19] that for every kk-bag CCA that accepts a language LL, there exists a 11-bag CCA which accepts the same language LL.

Class memory automata [5]

: A class memory automaton is a 66-tuple M=(Q,Σ,δ,q0,F,Fg)M=(Q,\Sigma,\delta,q_{0},F_{\ell},F_{g}) where QQ is a finite set of states, q0Qq_{0}\in Q is an initial state and FgFQF_{g}\subseteq F_{\ell}\subseteq Q are a set of global and local accepting states respectively. The transition relation is δ(Q×Σ×(Q{})×Q)\delta\subseteq(Q\times\Sigma\times(Q\cup\{\bot\})\times Q). The automaton keeps track of the last state where a data value dd is encountered. If a data value dd is not yet encountered, then it is associated with \bot. Each transition of a CMA is dependent on the current state of the automaton and the state the automaton was in when the data value being read currently was last encountered. A data word ww is accepted if the automaton reaches a state qFgq\in F_{g} and the last state of all the data values encountered in ww are in FF_{\ell}.

3 Set augmented finite automata

Definition 3.1.

A set augmented finite automaton (SAFA) is defined as a 66-tuple M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) where QQ is a finite set of states, Σ\Sigma is a finite alphabet, DD is a countably infinite set, q0Qq_{0}\in Q is the initial state, FQF\subseteq Q is a set of final states, HH is a finite set of finite sets of data values. The transition relation is defined as: δ\delta Q×Σ×C×OP×Q\subseteq Q\times\Sigma\times C\times OP\times Q where C={p(hi),!p(hi)|hiH}C=\{p(h_{i}),!p(h_{i})\ |\ h_{i}\in H\}, hih_{i} denotes the ithi^{th} set in HH, and OP={,𝗂𝗇𝗌(hi)|hiH}OP=\{-,\ {\sf ins}(h_{i})\ |\ h_{i}\in H\}. ∎

We call a SAFA a singleton if |H|=1|H|=1. The unary Boolean predicate p(hi)p(h_{i}) evaluates to true if the data value currently being read by the automaton is present in the ithi^{th} set hih_{i}. The predicate !p(hi)!p(h_{i}) is true if the data value currently being read is not in hih_{i}. Further, OPOP denotes a set of operations that a SAFA can execute on reading a symbol; the operation 𝗂𝗇𝗌(hi){\sf ins}(h_{i}) inserts the data value currently being read by the automaton into the set hih_{i}, while - denotes no such insertion is done. For any combination not in δ\delta, we assume the transition is absent.

For a SAFA MM, we define a configuration (q,h)Q×2DH(q,h)\in Q\times 2^{D^{H}} as follows: qQq\in Q is a state of the automaton, h=h1,h|H|h=\langle h_{1},...h_{|H|}\rangle where each hih_{i} for i[|H|]i\in[|H|] is a finite subset of DD, and hh denotes the content of the sets in HH. A run ρ\rho of MM on an input w=(a1,d1)(a|w|,d|w|)w=(a_{1},d_{1})\cdots(a_{|w|},d_{|w|}) is a sequence (q0,h0),,(q|w|,h|w|)(q_{0},h^{0}),\dots,(q_{|w|},h^{|w|}), where hj=h1j,,h|H|jh^{j}=\langle h^{j}_{1},\dots,h^{j}_{|H|}\rangle, and hijh^{j}_{i} for 1i|H|1\leq i\leq|H| is the content of the set hih_{i} after reading the prefix (a1,d1)(aj,dj)(a_{1},d_{1})\cdots(a_{j},d_{j}) for 1j|w|1\leq j\leq|w|. A configuration (qj+1,hj+1)(q_{j+1},h^{j+1}) succeeds a configuration (qj,hj)(q_{j},h^{j}) if there is a transition (qj,aj+1,α,𝗈𝗉,qj+1)(q_{j},a_{j+1},\alpha,{\sf op},q_{j+1}) where

  1. (i)

    for α=p(hi)\alpha=p(h_{i}), we have that the data value dj+1hijd_{j+1}\in{h_{i}}^{j}.

  2. (ii)

    for α=!p(hi)\alpha\ =\ !p(h_{i}), we have that the data value dj+1hijd_{j+1}\notin{h_{i}}^{j}.

The execution of the operation 𝗈𝗉OP{\sf op}\in OP takes the content of the sets of data values from hjh^{j} to hj+1h^{j+1}. If 𝗈𝗉{\sf op} is -, then hj+1=hjh^{j+1}=h^{j}. If 𝗈𝗉{\sf op} is ins(hi)(h_{i}), then hlj+1=hlj{h_{l}}^{j+1}={h_{l}}^{j} for all hlH{hi}h_{l}\in H\setminus\{h_{i}\}, and hij+1=hij{dj+1}{h_{i}}^{j+1}={h_{i}}^{j}\cup\{d_{j+1}\}. If the runrun consumes the whole word ww, and q|w|Fq_{|w|}\in F, then the run is accepting, otherwise, it is rejecting. A word ww is accepted by MM if it has an accepting run. The language L(M)L(M) accepted by MM consists of all words accepted by MM. We denote by |ρ||\rho| the length of the run which equals the number of transitions taken. Note that for the run ρ\rho on an input word ww, we have that |w||w| = |ρ||\rho|.

Definition 3.2.

A SAFA M=(Q,Σ,q0,F,H,δ)M=(Q,\Sigma,q_{0},F,H,\delta) is deterministic (DSAFA) if for every qQq\in Q and aΣa\in\Sigma, if there is a transition (q,a,α,𝗈𝗉,q)(q,a,\alpha,{\sf op},q^{\prime}), where qQq^{\prime}\in Q, 𝗈𝗉OP{\sf op}\in OP, α{p(hi),!p(hi)}\alpha\in\{p(h_{i}),!p(h_{i})\}, hiHh_{i}\in H, then there cannot be any transition of the form (q,a,p(hl),𝗈𝗉,q′′)(q,a,p(h_{l}),{\sf op}^{\prime},q^{\prime\prime}), (q,a,!p(hl),𝗈𝗉,q′′)(q,a,!p(h_{l}),{\sf op}^{\prime},q^{\prime\prime}), where q′′Qq^{\prime\prime}\in Q, hlhih_{l}\neq h_{i}, hlHh_{l}\in H, 𝗈𝗉OP{\sf op}^{\prime}\in OP. The only other allowed transition can be (q,a,α,𝗈𝗉,q′′)(q,a,\alpha^{\prime},{\sf op}^{\prime},q^{\prime\prime}) for α{p(hi),!p(hi)}\alpha^{\prime}\in\{p(h_{i}),!p(h_{i})\}, αα\alpha^{\prime}\neq\alpha, and 𝗈𝗉OP{\sf op}^{\prime}\in OP. ∎

Let 𝖲𝖠𝖥𝖠{\mathcal{L}_{\sf SAFA}} and 𝖣𝖲𝖠𝖥𝖠{\mathcal{L}_{\sf DSAFA}} denote the set of all languages accepted by nondeterministic SAFA and deterministic SAFA respectively. We illustrate the SAFA model with some instances of data languages.

q0q_{0}q1q_{1}(a,!p(h1),𝗂𝗇𝗌(h1))(a,!p(h_{1}),{\sf ins}(h_{1}))(b,p(h1),),(b,!p(h1),)(b,p(h_{1}),-),(b,!p(h_{1}),-)(a,p(h1),)(a,p(h_{1}),-)
Figure 1: SAFA for L𝖿𝖽(𝖺)L_{\sf fd(a)}
Example 3.3.

The language L𝖿𝖽(a)L_{{\sf fd}(a)} can be accepted by the DSAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) in Figure 1. Here, Q={q0,q1}Q=\{q_{0},q_{1}\}, Σ={a,b}\Sigma=\{a,b\}, DD is any countably infinite set, F={q0}F=\{q_{0}\}, H={h1}H=\{h_{1}\}, the transition relation δ\delta consists of the transitions shown in Figure 1. The automaton MM works as follows. The set h1h_{1} is used to store the data values encountered in the input word associated with aa. At q0q_{0}, if MM reads bb, it remains in q0q_{0} without modifying HH. At q0q_{0} when the automaton reads aa, it checks whether the corresponding data value is present in h1h_{1}. If present, it indicates it has already encountered this data value with attribute aa before; therefore the automaton goes to q1q_{1} which is a dead state and the input word is rejected. If the data value is not present in h1h_{1}, it implies that it has not encountered this value with aa, thus it remains in q0q_{0} and inserts the data value into h1h_{1}. Only if the automaton encounters a duplicate data value, it goes to q1q_{1}. If it does not encounter duplicate data values with respect to aa in the input, the automaton remains in q0q_{0} after consuming the entire word and it is accepted. ∎

Example 3.4.

Consider the data language L𝖼𝗇𝗍2:L_{\exists{\sf cnt}\neq 2:} over the alphabet Σ={a}\Sigma=\{a\}. A nonempty word ww is in the language if there exists a data value that appears nn times in ww with nn \neq 2. This can be accepted by the nondeterministic SAFA in Figure 2. At state q0q_{0}, the automaton nondeterministically guesses the data value that does not appear exactly twice \chadded[id=KC]and stores it in set h2h_{2} and goes to state q1q_{1}. The automaton remains in state q1q_{1} if the count of the data value is 11 or moves to state q3q_{3} (via q2q_{2}) and remains there if the count of the data value is greater than 22. In both the cases, it accepts the input word if it can be consumed entirely. If the guess is incorrect, the data value appears twice, and it is in the nonaccepting state q2q_{2} after consuming the input word. Thus, if a data word has all its data values that appear exactly twice, then all the runs end in q2q_{2} and the input is rejected. ∎

q0q_{0}q1q_{1}q2q_{2}q3q_{3}(a,!p(h1),),(a,p(h1),)(a,!p(h_{1}),-),(a,p(h_{1}),-)(a,p(h1),𝗂𝗇𝗌(h1)),(a,!p(h1),𝗂𝗇𝗌(h1))(a,p(h_{1}),{\sf ins}(h_{1})),(a,!p(h_{1}),{\sf ins}(h_{1}))(a,!p(h1),𝗂𝗇𝗌(h2))(a,!p(h_{1}),{\sf ins}(h_{2}))(a,!p(h2),)(a,!p(h_{2}),-)(a,p(h2),)(a,p(h_{2}),-)(a,!p(h2),)(a,!p(h_{2}),-)(a,p(h2),)(a,p(h_{2}),-)
Figure 2: SAFA for L𝖼𝗇𝗍2L_{\exists{\sf cnt}\neq 2}
\chadded

[id=KC]The number of sets in the SAFA model impacts the language accepting capacity. The following theorem establishes a hierarchy of accepted languages by SAFA based on the size of HH.

Theorem 3.5.

No SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) with Σ={a1,,ak+1}\Sigma=\{a_{1},\dots,a_{k+1}\}, |H|=k|H|=k can accept the language L=L𝖿𝖽(𝖺𝟣)L𝖿𝖽(𝖺𝗄+𝟣)L𝗉𝗋𝗈𝗃Σ(L)=a1ak+1L=L_{\sf fd(a_{1})}\cap\dots L_{\sf fd(a_{k+1})}\cap L_{{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}\cdots a_{k+1}^{*}}. 111The language L=L𝖿𝖽(𝖺𝟣)L𝖿𝖽(𝖺𝗄+𝟣)L^{\prime}=L_{\sf fd(a_{1})}\cap\dots L_{\sf fd(a_{k+1})} could have also been considered but the proof is relatively simpler if we instead consider LL.

Proof 3.6.

Assume that there exists a SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) with |H|=k|H|=k, Σ={a1,a2,ak+1}\Sigma=\{a_{1},a_{2},...a_{k+1}\} which accepts L=L𝖿𝖽(a1)&𝖿𝖽(ak+1)&𝗉𝗋𝗈𝗃Σ(L)=a1a2ak+1L=L_{{\sf fd}(a_{1})\&\dots\sf fd(a_{k+1})\&{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}\cdots a_{k+1}^{*}} and MM has |Q|=n|Q|=n states. Since the automaton MM accepts LL and LL contains words which are longer than nn, the automaton MM must have at least one cycle of sequence of transitions in its structure. Thus, the automaton accepts w=xyzLw=xyz\in L where xx is the prefix of ww before entering the cycle, yy is the infix of the word ww that is consumed in the cycle and |y|>1|y|>1 and zz is suffix of the word ww that is consumed after exiting the cycle. The sequence of transitions that consume xx and zz themselves may or may not contain cycles of transitions. Let us focus on the cycle of sequence TcT_{c} of transitions that consumes yy.

  • CASE 1: TcT_{c} contains a transition tt with p(hi)p(h_{i}). The (attribute, data value) pair tt consumed in yy, can be consumed again in the next execution of TcT_{c}. The new word accepted by MM will then not be LL. Therefore TcT_{c} cannot have any transition with p(hi)p(h_{i})

  • CASE 2: Now suppose TcT_{c} has a transition tt of the form (qi,a,!p(hi),,qj)(q_{i},a,!p(h_{i}),-,q_{j}) or (qi,a,!p(hi),𝗂𝗇𝗌(hj),qj)(q_{i},a,!p(h_{i}),{\sf ins}(h_{j}),q_{j}), qi,qjQq_{i},q_{j}\in Q, aΣ,hi,hjH,ija\in\Sigma,h_{i},h_{j}\in H,i\neq j. When TcT_{c} is executed for the first time, let the transition tt consume a data value which MM has not consumed before and MM will not consume in the next execution of TcT_{c} except when it is executing tt. Since the number of transitions in TcT_{c} is finite and DD is countably infinite, it is always possible to find such a value. When TcT_{c} is executed again and tt is being executed, the same (attribute, data-value) pair can be consumed by it as the data value has not been inserted in hih_{i}, when tt executed before. Therefore, the new word which MM accepts cannot be in LL.

  • CASE 3: Since 𝗉𝗋𝗈𝗃Σ(L)=a1ak+1{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}\cdots a_{k+1}^{*}, the cycle of transitions cannot be defined on two different ai,ajΣa_{i},a_{j}\in\Sigma. This is because if TcT_{c} is executed again, then the projection of the new word on Σ\Sigma will no longer be in 𝗉𝗋𝗈𝗃Σ(L){{\sf proj}_{\Sigma}}(L).

From Cases 1 to 3, we conclude that for MM to accept LL, MM must have at least one cycle for each aiΣa_{i}\in\Sigma and the sequence of transitions in the cycle for aia_{i} must be of the form (qi,ai,!p(hj),𝗂𝗇𝗌(hj),qj)(q_{i},a_{i},!p(h_{j}),{\sf ins}(h_{j}),q_{j}), qi,qjQq_{i},q_{j}\in Q, hjHh_{j}\in H, aiΣa_{i}\in\Sigma.

As |Σ|>|H||\Sigma|>|H|, by pigeon hole principle there will be two ai,aja_{i},a_{j} such that they insert the value in the same set hkh_{k}. Now, suppose aia_{i} inserts a data value did_{i} in set hkh_{k}. Consider the data word w=(b1,d1)(bi,di)(bj,dj=di)(b|w|,d|w|)w=(b_{1},d_{1})\cdots(b_{i},d_{i})\cdots(b_{j},d_{j}=d_{i})\cdots(b_{|w|},d_{|w|}) where all the positions have unique data value except at did_{i} and djd_{j}, b1,,b|w|Σb_{1},\dots,b_{|w|}\in\Sigma and 𝗉𝗋𝗈𝗃Σ(w){{\sf proj}_{\Sigma}}(w) is of the form a1a2ak+1a_{1}^{*}a_{2}^{*}\cdots a_{k+1}^{*}. The string ww is a valid data word in LL. But the automaton MM will reject such a data word ww, because when implementing the transition (aj,!p(hk),𝗂𝗇𝗌(hk))(a_{j},!p(h_{k}),{\sf ins}(h_{k})) for a cycle of aja_{j}, when it reads (aj,di)(a_{j},d_{i}) it will fail as did_{i} is already stored in hkh_{k} when the data element (ai,di)(a_{i},d_{i}) was being consumed. It can be argued that the data word ww can be accepted by some other cycle involving aja_{j} but increasing the number of cycles will further result in aja_{j} storing its data in the same set hh_{\ell}, hHh_{\ell}\in H with some other aka_{k}, akΣa_{k}\in\Sigma. Thus, the above mentioned problem will persist. Therefore, it is not possible to construct a SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) with |H|=k|H|=k, Σ={a1,,ak+1}\Sigma=\{a_{1},\dots,a_{k+1}\} which accepts the language L=L𝖿𝖽(a1)&𝖿𝖽(ak+1)&𝗉𝗋𝗈𝗃Σ(L)=a1a2ak+1.L=L_{{\sf fd}(a_{1})\&\dots\sf fd(a_{k+1})\&{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}\cdots a_{k+1}^{*}}.

Let 𝖲𝖠𝖥𝖠(|H|=k){\mathcal{L}_{\sf SAFA}}_{(|H|=k)} be the set of all languages accepted by SAFA with |H|=k|H|=k. Since every SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) with |H|=k|H|=k can be simulated by a SAFA M=(Q,Σ×D,q0,F,H,δ)M^{\prime}=(Q^{\prime},\Sigma\times D,q_{0},F^{\prime},H^{\prime},\delta^{\prime}) with |H|=|H^{\prime}|=\ell and >k\ell>k by using k\ell-k dummy sets that are never used in an execution of MM^{\prime}, we have the following.

Corollary 3.7.

𝖲𝖠𝖥𝖠(|H|=k)𝖲𝖠𝖥𝖠(|H|=k+1){\mathcal{L}_{\sf SAFA}}_{(|H|=k)}\subsetneq{\mathcal{L}_{\sf SAFA}}_{(|H|=k+1)}.

Corollary 3.7 shows that there is a strict hierarchy in terms of accepting capabilities of SAFA with respect to |H||H|.

4 Decision problems and closure properties

We study the nonemptiness, membership problems and closure properties of SAFA.

4.1 Nonemptiness and membership

We study the nonemptiness and the membership problems of SAFA and show that both are 𝖭𝖯{\sf NP}-complete. Given a SAFA MM and an input word ww, the membership problem is to check if wL(M)w\in L(M). Given a SAFA MM, the nonemptiness problem is to check if L(M)L(M)\neq\emptyset. We start with the nonemptiness problem. To show the NP-membership, we first show that there exists a small run if the language accepted by a given SAFA is nonempty.

Lemma 4.1.

Every SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) with L(M)L(M)\neq\emptyset has a data word in L(M)L(M) with an accepting run ρ\rho such that |ρ||\rho| \leq |Q|(|H|+2)1|Q|\cdot(|H|+2)-1.

Proof 4.2.

We prove by contradiction. Assume that L(M)L(M)\neq\emptyset and that for every wL(M)w\in L(M), for all accepting runs ρ\rho of ww, we have that |ρ|>|Q|(|H|+2)1=|Q|(|H|+1)+|Q|1|\rho|>|Q|\cdot(|H|+2)-1=|Q|\cdot(|H|+1)+|Q|-1. We define an indicator function IHI_{H} which maps HH to {0,1}H\{0,1\}^{H}, where 0 corresponding to a set hHh\in H denotes that hh is empty while 11 denotes that hh is nonempty. Since |ρ||Q|(|H|+2)|\rho|\geq|Q|\cdot(|H|+2), the run ρ\rho can be divided into |H|+2|H|+2 segments, each of length |Q||Q|, that is, each segment is an infix over |Q||Q| transitions. By pigeon hole principle, in each segment, there exists a state qQq\in Q that is visited more than once. Further, since there are |H|+2|H|+2 such segments, again by pigeon hole principle, there exists a segment and a state qQq^{\prime}\in Q such that qq^{\prime} is visited more than once in this segment and IHI_{H} does not change over the infix of the run between the two successive visits of qq^{\prime}. We now note that the sequence of transitions reading this infix yy makes a loop over qq^{\prime}, and thus yy can be removed from ww resulting into a word ww^{\prime} and the corresponding run is ρ\rho^{\prime} such that |ρ|<|ρ||\rho^{\prime}|<|\rho| and ρ\rho^{\prime} is accepting. Now there can be two cases if the suffix of ρ\rho following reading yy in ww has a transition tt with p(hi)p(h_{i}) and it reads a data value dd.

  • It may happen that the data value dd was inserted along the infix yy. Since IHI_{H} does not change while reading the infix yy, it implies that the set hih_{i} was nonempty even before the infix yy was read. Let a data value dd^{\prime} was inserted into hih_{i} while reading the prefix before yy. Then ww^{\prime} may be modified to w′′w^{\prime\prime} so that the suffix following yy in w′′w^{\prime\prime} reads the data value dd^{\prime} instead of dd. Let the run corresponding to w′′w^{\prime\prime} be ρ′′\rho^{\prime\prime} that follows the same sequence of states as ρ\rho^{\prime}. Note that |ρ′′|=|ρ|<|ρ||\rho^{\prime\prime}|=|\rho^{\prime}|<|\rho|, and that ρ′′\rho^{\prime\prime} is an accepting run.

  • If while reading ww, the transition tt reads a data value dd that was inserted while reading a prefix appearing before yy, then ww^{\prime} does not need to be modified, and we thus have the accepting run ρ\rho^{\prime}.

Since ρ\rho is an arbitrary accepting run of length |Q|(|H|+2)|Q|\cdot(|H|+2) or more, starting from ρ\rho, we can remove infixes repeatedly and modify it as mentioned above if needed until we reach an accepting run of length strictly smaller than |Q|(|H|+2)|Q|\cdot(|H|+2) without affecting acceptance, and hence the contradiction.

Using Lemma 4.1 we get the following.

Lemma 4.3.

Nonemptiness problem for SAFA is in NP.

Proof 4.4.

Consider a SAFA MM with L(M)L(M)\neq\emptyset. By Lemma 4.1, \chadded[id=KC]a Turing machine can nondeterministically guess an accepting run of polynomial length, hence the result.

We now show that the nonemptiness problem is NP-hard even for deterministic acyclic SAFA over an alphabet of size 33. Example 4.5 describes our construction.

Example 4.5.

For a 3CNF formula ϕ=(xy¯z)(xyz)\phi=(x\vee\overline{y}\vee z)\wedge(x\vee y\vee z), the corresponding SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) is shown in Figure 3. We denote by A(a,i)A(a,i) the transition (a,!p(hi),𝗂𝗇𝗌(hi))(a,!p(h_{i}),{\sf ins}(h_{i})) and by T(a,i)T(a,i) the transition (a,p(hi),)(a,p(h_{i}),-) with Q={q0,qx,qy,qz,qc1,qc2}Q=\{q_{0},q_{x},q_{y},q_{z},q_{c_{1}},q_{c_{2}}\}, Σ={a1,a2,a3}\Sigma=\{a_{1},a_{2},a_{3}\}, D=D=\mathbb{N}, H={hx,hx¯,hy,hy¯,hz,hz¯}H=\{h_{x},h_{\overline{x}},h_{y},h_{\overline{y}},h_{z},h_{\overline{z}}\}, F={qc2}F=\{q_{c_{2}}\}. In particular, if there are \ell variables in the formula, then |H|=2|H|=2\ell. ∎

q0q_{0}qxq_{x}qyq_{y}qzq_{z}qc1q_{c_{1}}qc2q_{c_{2}}A(a1,x)A(a_{1},x)A(a2,x¯)A(a_{2},\overline{x})A(a1,y)A(a_{1},y)A(a2,y¯)A(a_{2},\overline{y})A(a1,z)A(a_{1},z)A(a2,z¯)A(a_{2},\overline{z})T(a1,x)T(a_{1},x)T(a2,y¯)T(a_{2},\overline{y})T(a3,z)T(a_{3},z)T(a1,x)T(a_{1},x)T(a2,y)T(a_{2},y)T(a3,z)T(a_{3},z) A(a,i)=(a,!p(hi),𝗂𝗇𝗌(hi))A(a,i)=(a,!p(h_{i}),{\sf ins}(h_{i})), T(a,i)=(a,p(hi),)T(a,i)=(a,p(h_{i}),-)
Figure 3: The SAFA MM corresponding to a 3CNF formula ϕ\phi
Lemma 4.6.

The nonemptiness problem is NP-hard for deterministic acyclic SAFA over an alphabet of size 33.

From Lemma 4.3 and Lemma 4.6, we have the following.

Theorem 4.7.

The nonemptiness problem for SAFA is NP-complete.

We now show that for singleton SAFA, nonemptiness is NL-complete. Towards this, we first show the following lemma.

Lemma 4.8.

The nonemptiness problem for singleton SAFA is reducible to the nonemptiness of a nondeterministic finite automaton (NFA) in PTIME.

Proof 4.9.

We begin with the following observations on SAFA transitions for a run on an input word. The idea is to see how we can construct a word accepted by a given SAFA that takes it from the initial state to a final state following the transition rules.

  • Transitions with !p(h1)!p(h_{1}) can always be satisfied since we have an infinite number of data values. We can always introduce a new data value with an attribute so that it is not in the set h1h_{1}. However, transitions with p(h1)p(h_{1}) should only be executed if there exists an 𝗂𝗇𝗌(h1){\sf ins}(h_{1}) somewhere earlier on the path before reaching the transition with p(h1)p(h_{1}).

  • Given a SAFA, it is just not enough to only look for simple paths from the initial state to a final state satisfying the observations stated above (see Figure  4). In Figure 4, the only simple path is q0qfq_{0}\rightarrow q_{f} with transition (a,p(h1),)(a,p(h_{1}),-). Since the simple path does not contain any transition having 𝗂𝗇𝗌(h1){\sf ins}(h_{1}) prior to the transition containing p(h1)p(h_{1}), no word is accepted by the automaton along this simple path. However, we find that the automaton accepts the string (a,d1)(a,d1)(a,d_{1})(a,d_{1}). Thus, when checking for emptiness of a SAFA, we cannot just restrict our analysis to simple paths.

  • The language accepted by a SAFA is nonempty iff there exists a sequence of transitions that takes it from an initial state to a final state with the added condition that if the sequence of transitions contains a p(hi)p(h_{i}), then there must exist a corresponding 𝗂𝗇𝗌(hi){\sf ins}(h_{i}) prior to the p(hi)p(h_{i}) in the sequence of transitions.

q0q_{0}qfq_{f}(a,p(h1),)(a,p(h_{1}),-)(a,!p(h1),𝗂𝗇𝗌(p(h1))(a,!p(h_{1}),{\sf ins}(p(h_{1}))
Figure 4: A simple SAFA MM

Given a singleton SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta), H={h1}H=\{h_{1}\}, we check for nonemptiness by constructing three nondeterministic finite automata (NFA). The first NFA M1M_{1} accepts all those sequences of transitions which can take the SAFA MM from its initial state to any of its final states. However, M1M_{1} does not check if the sequence of transitions on the path from the initial state to a final state of MM is valid (i.e. Condition 1 above). The second NFA M2M_{2} accepts all possible sequences of transitions that have 𝗂𝗇𝗌(h1){\sf ins}(h_{1}) prior to encountering a p(h1)p(h_{1}). We construct the synchronous product [2] of M1M_{1} and M2M_{2} that gives us another NFA M3M_{3}. We check for emptiness of M3M_{3}. If M3M_{3} is not empty (final state is reachable), we can conclude that there exists at least one sequence of transitions that takes the SAFA MM from the initial state to a final state and every p(h1)p(h_{1}) encountered on that sequence of transitions has an 𝗂𝗇𝗌(h1){\sf ins}(h_{1}) prior to it. Therefore, the SAFA MM is not empty. If M3M_{3} is empty, it indicates that there exists no such sequence of transitions. Thus, if M3M_{3} is empty, we can conclude that the SAFA MM is empty as well and there is no data word that is accepted by the SAFA.

Given a SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta), H={h1}H=\{h_{1}\} we construct NFA M1=(Q,Σ,q0,F,δ)M_{1}=(Q,\Sigma^{\prime},q_{0},F,\delta^{\prime}) as below:

  • Σ={(a,b,c)}\Sigma^{\prime}=\{(a,b,c)\} where aΣa\in\Sigma, b{p(h1),!p(h1)}b\in\{p(h_{1}),!p(h_{1})\} c{𝗂𝗇𝗌(h1),}c\in\{{\sf ins}(h_{1}),-\}. The alphabet Σ\Sigma^{\prime} contains all transitions that M can have.

  • δ=δ\delta^{\prime}=\delta with the triplets in δ\delta considered as elements of Σ\Sigma^{\prime}.

q0q_{0}qfq_{f}(a,p(h1),)(a,p(h_{1}),-)(a,!p(h1),𝗂𝗇𝗌(p(h1))(a,!p(h_{1}),{\sf ins}(p(h_{1}))
Figure 5: The NFA M1M_{1} corresponding to the SAFA MM in Figure 4
q0q_{0}q1q_{1}(a,!p(h1),𝗂𝗇𝗌(h1))(a,!p(h_{1}),{\sf ins}(h_{1})),(a,!p(h1),)(a,!p(h_{1}),-) (a,α,op)(a,\alpha,op),α{p(h1),!p(h1)}\alpha\in\{p(h_{1}),!p(h_{1})\}, op{,𝗂𝗇𝗌(h1)}op\in\{-,{\sf ins}(h_{1})\}
Figure 6: The NFA M2M_{2} corresponding to the SAFA in Figure 4
q0q0q_{0}q_{0}q0q1q_{0}q_{1}qfq1q_{f}q_{1}(a,!p(h1),𝗂𝗇𝗌(h1))(a,!p(h_{1}),{\sf ins}(h_{1}))(a,!p(h1),𝗂𝗇𝗌(h1))(a,!p(h_{1}),{\sf ins}(h_{1}))(a,p(h1),)(a,p(h_{1}),-)
Figure 7: NFA M3M_{3} : Synchronous product of NFA M1M_{1} and NFA M2M_{2} for SAFA in Figure 4.

From the construction of M1M_{1}, we see that it accepts all those possible sequences of transitions that may take MM from its initial state to any of its final states. We construct NFA M2=(Q′′,Σ,q0,Q′′,δ′′)M_{2}=(Q^{\prime\prime},\Sigma^{\prime},q_{0},Q^{\prime\prime},\delta^{\prime\prime}) as below:

  • Q′′={q0,q1}Q^{\prime\prime}=\{q_{0},q_{1}\}

The transitions in δ′′\delta^{\prime\prime} are as follows:

  • δ(q0,(a,!p(h1),))={q0}\delta(q_{0},(a,!p(h_{1}),-))=\{q_{0}\}, aΣa\in\Sigma

  • δ(q0,(a,!p(h1),𝗂𝗇𝗌(h1)))={q1}\delta(q_{0},(a,!p(h_{1}),{\sf ins}(h_{1})))=\{q_{1}\}, aΣa\in\Sigma

  • δ(q1,(a,!p(h1),𝗂𝗇𝗌(h1)))={q1}\delta(q_{1},(a,!p(h_{1}),{\sf ins}(h_{1})))=\{q_{1}\}, aΣa\in\Sigma

  • δ(q1,(a,p(h1),𝗂𝗇𝗌(h1)))={q1}\delta(q_{1},(a,p(h_{1}),{\sf ins}(h_{1})))=\{q_{1}\}, aΣa\in\Sigma

  • δ(q1,(a,p(h1),))={q1}\delta(q_{1},(a,p(h_{1}),-))=\{q_{1}\}, aΣa\in\Sigma

  • δ(q1,(a,!p(h1),))=q1\delta(q_{1},(a,!p(h_{1}),-))={q_{1}}, aΣa\in\Sigma

The automaton M2M_{2} works as follows. State q0q_{0} denotes that we have not yet come across 𝗂𝗇𝗌(h1){\sf ins}(h_{1}), state q1q_{1} denotes we have seen an 𝗂𝗇𝗌(h1){\sf ins}(h_{1}). For inputs of the form (x,!p(hj),)(x,!p(h_{j}),-) where xΣx\in\Sigma, we remain in state q0q_{0}. If we come across inputs of the form (x,!p(h1),𝗂𝗇𝗌(h1))(x,!p(h_{1}),{\sf ins}(h_{1})) where xΣx\in\Sigma, we move to state q1q_{1} from q0q_{0}. At state q1q_{1}, the automaton remains in state q1q_{1} for every element μΣ.\mu\in\Sigma^{\prime}.

By construction, M2M_{2} accepts all those sequences of transitions of MM where every transition containing p(h1)p(h_{1}) is preceded by at least one transition containing 𝗂𝗇𝗌(h1){\sf ins}(h_{1}). The NFA M3M_{3} is a synchronous product of NFA M1M_{1} and NFA M2M_{2}. Therefore NFA M3M_{3} accepts all those sequences which take the SAFA MM from its initial state to a final state. Thus, if the language accepted by M3M_{3} is empty, so is the language accepted by the SAFA MM, and non-empty otherwise. The automaton M3M_{3} is nonempty if there exists a simple path from its initial state to any of its final states. This can be found out using a standard Depth First Search (DFS).

The time complexity of the emptiness check is polynomial in the size of M3M_{3}. The size of M3M_{3} depends on the size of M1M_{1} and M2M_{2}. Size of an NFA M=(Q,Σ,q0,F,δ)M=(Q,\Sigma,q_{0},F,\delta) is defined as |M|=|Q|+qQ,aΣ|δ(q,a)||M|=|Q|+\sum_{q\in Q,a\in\Sigma}|\delta(q,a)|. The number of states in M1=|Q|M_{1}=|Q| is same as MM. The number of states in M2M_{2} is 22 and the number of states in M3M_{3} is at most 2|Q|2|Q| and number of edges in M3M_{3} is at most 4|Σ|×|δ|×2|Q|4|\Sigma|\times|\delta|\times 2|Q| which is polynomial in the input size.

Example 4.10.

The NFA M1M_{1}, M2M_{2}, M3M_{3} corresponding to SAFA MM (Figure 4) are shown in Figures 5, 6, and 7 respectively. We observe that in NFA M3M_{3}, there exists a path from the initial to the final state. NFA M3M_{3} is not empty, therefore the SAFA MM corresponding to Figure 4 is also not empty, which is true. ∎

Theorem 4.11.

The nonemptiness problem for singleton SAFA is 𝖭𝖫{\sf NL}-complete.

Proof 4.12.

We first discuss 𝖭𝖫{\sf NL}-membership. By Lemma 4.8, the nonemptiness for singleton SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) is in 𝖯𝖳𝖨𝖬𝖤\sf PTIME by reducing the problem to checking the nonemptiness of a nondeterministic finite automaton (NFA) with 2|Q|2|Q| states. This NFA can be constructed on-the-fly leading to an 𝖭𝖫{\sf NL}-membership of the nonemptiness problem of singleton SAFA.

For 𝖭𝖫{\sf NL}-hardness, we show a reduction from the reachability problem on a directed graph GG having vertex set V={1,n}V=\{1,\dots n\} which is known to be 𝖭𝖫{\sf NL}-complete [15]. Let GG be a directed graph with V={1,2,,n}V=\{1,2,...,n\} and we are given the vertices 11 and nn. We define a SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) where Q=VQ=V, Σ={a}\Sigma=\{a\}, DD is a countably infinite set, q0=1q_{0}=1, F={n}F=\{n\}, H={h1}H=\{h_{1}\} i.e. |H|=1|H|=1. The transitions in δ\delta are as follows: (i,a,!p(h1),,j)δ(i,a,!p(h_{1}),-,j)\in\delta if (i,j)(i,j) is an edge in GG. It is easy to see that MM can be constructed from GG in logspace and that L(M)ϕL(M)\neq\phi iff there is a path from vertex 11 to vertex nn in GG. Hence, the result.

We now show that membership for SAFA is NP-complete. We first note that unlike the nonemptiness problem, for DSAFA, membership can be decided in 𝖯𝖳𝖨𝖬𝖤{\sf PTIME} by reading the input word and by checking if a final state is reached.

Theorem 4.13.

The membership problem for SAFA is NP-complete.

Proof 4.14.

Given a SAFA MM and an input word ww, if wL(M)w\in L(M), then a nondeterministic Turing machine can guess an accepting run in polynomial time and hence the membership problem is in NP.

For showing NP-hardness, we reduce from 33SAT for the nonemptiness problem as done for Lemma 4.6. Instead of a deterministic automaton that was constructed in the proof of Lemma 4.6, we construct a nondeterministic SAFA MM with Σ={a}\Sigma=\{a\} and all the transitions in MM are labelled with the same letter aΣa\in\Sigma. Everything else remains the same as in the construction in Lemma 4.6. Note that in the 3SAT formula ψ\psi, if there are \ell variables and kk clauses, then there is a path of length +k\ell+k from the initial state to the unique final state of MM. We consider an input word w=(a,d)(a,d)w=(a,d)\cdots(a,d), that is a word in which all the attribute, data-value pairs are identical in the whole word such that |w|=+k|w|=\ell+k. It is not difficult to see that wL(M)w\in L(M) iff ψ\psi is satisfiable.

Finally, we show that given a SAFA MM defined on Σ×D\Sigma\times D, whether L(M)=(Σ×D)L(M)=(\Sigma\times D)^{*} (universality problem) is undecidable.

Theorem 4.15.

The universality problem for SAFA is undecidable.

Proof 4.16.

The proof is similar to showing \chadded[id=KC]that the universality problem for kk-register automata is undecidable [23]. We reduce the Post Correspondence Problem (PCP) which is already known to be undecidable to the universality problem for SAFA. \chdeleted[id=KC]The PCP problem consists of two lists of equal length, say nn. The items of the lists are finite strings defined on an alphabet Σ\Sigma^{\prime} where |Σ|2|\Sigma^{\prime}|\geq 2. Without loss of generality, we can assume Σ={a,b}\Sigma^{\prime}=\{a,b\}. List 1 consists of the strings x1,,xnx_{1},\dots,x_{n}, and list 2 consists of the strings y1,,yny_{1},\dots,y_{n} where x1,,xn,y1,,ynΣx_{1},\dots,x_{n},y_{1},\dots,y_{n}\in{\Sigma^{\prime}}^{*}. The PCP problem is true, if there exists a sequence α1,,αm\alpha_{1},\dots,\alpha_{m} where α1,,αm[n]\alpha_{1},\dots,\alpha_{m}\in[n] such that xα1xαm=yα1yαmx_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\alpha_{1}}\cdots y_{\alpha_{m}} and false otherwise (see Example 2.1). We reduce the PCP problem to the universality problem for SAFA such that the constructed SAFA does not accept a word which corresponds to a PCP solution. Thus, the SAFA is universal if and only if there does not exist a solution to the PCP problem. For the reduction, we consider input data words of the format u(#,d#)v($,d$)u(\#,d_{\#})v(\$,d_{\$}) with d#,d$Dd_{\#},d_{\$}\in D, where data item (#,d#)(\#,d_{\#}) is a separator and the data item ($,d$)(\$,d_{\$}) is an end-marker. The data words uu and vv represent a candidate solution (xα1xαm;yβ1yβmx_{\alpha_{1}}\cdots x_{\alpha_{m}};y_{\beta_{1}}\cdots y_{\beta_{m}}) where α1,,αm,β1,,βm[n]\alpha_{1},\dots,\alpha_{m},\beta_{1},\dots,\beta_{m}\in[n] of the PCP instance. Such a candidate solution is a true solution of the PCP instance if the following conditions hold.

  • αi=βi\alpha_{i}=\beta_{i} for each i[n]i\in[n] which denotes \chadded[id=KC]the fact that the corresponding strings are taken from the same domino.

  • xα1xαm=yβ1yβmx_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\beta_{1}}\cdots y_{\beta_{m}}, i.e. both strings are same.

We now describe the format in more detail.

  • Each xαjx_{\alpha_{j}} is encoded as (αj,dγ)(a1,dδ1)(ak,dδk)(\alpha_{j},d_{\gamma})(a_{1},d_{\delta_{1}})\cdots(a_{k},d_{\delta_{k}}) where dγd_{\gamma} gives a unique data value to this particular occurrence of domino string from the first list. The symbols a1,akΣa_{1},\dots a_{k}\in\Sigma, the data values dδ1,,dδkDd_{\delta_{1}},\dots,d_{\delta_{k}}\in D represent the position of each aia_{i} in xαjx_{\alpha_{j}} uniquely and xαj=a1akx_{\alpha_{j}}=a_{1}\cdots a_{k}. Similarly, yβjy_{\beta_{j}} is also encoded. The data words u,v(([n]×D)(Σ×D))u,v\in(([n]\times D)(\Sigma\times D)^{*})^{*}. Every dγd_{\gamma} and dδd_{\delta} is unique in uu, that is even across different instances of xαjx_{\alpha_{j}} the data values dγd_{\gamma} and dδd_{\delta} used are different.

  • A string u(#,d#)v($,d$)u(\#,d_{\#})v(\$,d_{\$}) is syntactically correct if the above conditions hold and also the following two conditions are true.

    • 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u))=𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(v)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))={{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v)), i.e. the sequence of data values associated with the symbols in [n][n] in uu are same as that in vv. Having this same sequence of data values in both uu and vv corresponds to the fact that the xisx_{i}^{\prime}s and the yisy_{i}^{\prime}s appear in the same order i.e. αi=βi\alpha_{i}=\beta_{i} for each αi[n].\alpha_{i}\in[n].

    • 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃Σ×D(u))=𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃Σ×D(v)){{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(u))={{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(v)), i.e. the sequence of data values associated with the symbols in Σ\Sigma in uu are same as that in vv. This corresponds to the fact that the strings in uu and vv obtained by concatenating the xisx_{i}^{\prime}s and the yisy_{i}^{\prime}s respectively match, i.e. xα1xαm=yβ1yβmx_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\beta_{1}}\cdots y_{\beta_{m}}.

A syntactically correct string u(#,d#)v($,d$)u(\#,d_{\#})v(\$,d_{\$}) is a true solution of a PCP instance if

  • for each data value in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u)) the number in [n][n] associated with that data value in both uu and vv are same. This ensures that the strings in both uu and vv are chosen from the same domino, i.e. αi=βi\alpha_{i}=\beta_{i} for each αi,βi[n]\alpha_{i},\beta_{i}\in[n], and

  • for each data value in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃Σ×D(u)){{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(u)), the letter in Σ\Sigma associated with that data value in both uu and vv are same. This ensures that the strings formed from both the list are same. i.e. xα1xαm=yβ1yβmx_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\beta_{1}}\cdots y_{\beta_{m}}.

We now describe a nondeterministic SAFA MM which accepts an input data word w(Σ×D)w\in(\Sigma^{\prime}\times D)^{*} where Σ=[n]Σ{#,$}\Sigma^{\prime}=[n]\cup\Sigma\cup\{\#,\$\} if and only if the input data word is not in the correct format or it is not a solution of the PCP instance. The SAFA MM checks and accepts ww if the following conditions are satisfied for the input string ww.

  1. 1.

    The input strng ww is not in the format as required by a PCP instance:

    1. (a)

      The input word ww is not in the form u(#,d#)v($,d$)u(\#,d_{\#})v(\$,d_{\$}). This checking can be done using an NFA.

    2. (b)

      Consider a substring wuw_{u} between two consecutive (α1,d1),(α2,d2)(\alpha_{1},d_{1}),(\alpha_{2},d_{2}) in uu where α1[n]\alpha_{1}\in[n], α2[n]{#}\alpha_{2}\in[n]\cup\{\#\}, and d1,d2Dd_{1},d_{2}\in D, and we call 𝗉𝗋𝗈𝗃Σ(wu){\sf proj}_{\Sigma}(w_{u}) the Σ\Sigma-projection of wuw_{u}. The string uu is not in the right format if there exists a substring wuw_{u} as above whose Σ\Sigma-projection is not the same as xα1x_{\alpha_{1}}. Similarly, the string vv is not in the right format if there exists a substring wvw_{v} as above whose Σ\Sigma-projection is not the same as yα1y_{\alpha_{1}}. Corresponding to every string xαx_{\alpha} for α[n]\alpha\in[n], there is a deterministic finite automaton (DFA) that accepts Σ{xα}\Sigma^{*}\setminus\{x_{\alpha}\}. Given the α\alpha, we can use the corresponding DFA to check that the Σ\Sigma-projection of wuw_{u} is not the same as xαx_{\alpha}. The nondeterminstic SAFA guesses such a substring wuw_{u} of uu which is not in the right format.

  2. 2.

    The dγd_{\gamma} projections in uu and vv are not in the required format.

    1. (a)

      Two data values in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u)) are same. The SAFA can nondeterministically guess that a particular data value is repeated in uu and store it in a set. If it comes across that same data value again while traversing uu, it accepts the input word.

    2. (b)

      Two data values in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(v)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v)) are same. The SAFA can nondeterministically guess that a particular data value is repeated in vv and store it in a set. If it comes across that same data value again while traversing vv, it accepts the input word.

    3. (c)

      The first data value in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u)) and the first data value 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(v){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v) are not the same. The SAFA can store the first data value in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u)) in a set and match it while reading the first data value in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(v)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v)) after #\#.

    4. (d)

      The last data value in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u)) and the last data value 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(v){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v) are not the same. Again the SAFA can nondeterministically read the last two data values in each of these sequences and match them.

    5. (e)

      Two data values dγ1d_{\gamma_{1}} and dγ2d_{\gamma_{2}} are successors in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u)) but not in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(v)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v)).

      • The SAFA can again nondeterministically decide on reading such a pair of data values dγ1d_{\gamma_{1}} and dγ2d_{\gamma_{2}} in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u)) that are different from those in vv and store them in two different sets. Then as it parses through data values in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(v)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v)) and comes across the first data value it checks whether the successor data value in both cases are same or not.

  3. 3.

    The dδd_{\delta} projections in uu and vv are not in the required format. This checking can be done in a similar manner as the checking for dγd_{\gamma} projections. Again recall that this can easily be done since all dδd_{\delta} values are unique in uu.

  4. 4.

    The input word ww is not a true solution of the PCP instance.

    1. (a)

      For the input word to be a correct solution of the PCP instance, the attribute i[n]i\in[n] corresponding to each data value in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(u)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u)) needs to be same as that in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃[n]×D(v)){{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v)). This ensures the strings are chosen from the same domino of the PCP instance.

    2. (b)

      The attribute corresponding to each data value in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃Σ×D(u)){{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(u)) needs to be same as the attribute in 𝗉𝗋𝗈𝗃D(𝗉𝗋𝗈𝗃Σ×D(v)){{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(v)). This ensures the concatenation of strings chosen from List 1 and List 2 are the same.

    3. (c)

      If any of the corresponding data values in uu and vv have different attributes associated with them, then the input word ww is not a true solution of the PCP instance and the SAFA accepts the input word. We know describe how the SAFA finds that the input word is not a correct solution for (a)(a) and (b)(b) above. Based on the data value in uu the SAFA nondeterministically checks whether the same data value in vv has the same attribute associated with it. If they are not the same, the input word is accepted. The SAFA after making the guess stores the data value in a set and also remembers in its state space the attribute it encountered with the data value. When it again comes across the data value in vv, it checks the associated attribute. It accepts the input word if they are not the same.

The above mentioned nondeterministic SAFA MM accepts an input word if and only if the input data word is not in the correct format or it is not a solution to the PCP instance. Therefore if the SAFA MM does not accept an input word, i.e. its not universal, then it implies that the input word is in the required format and is also a solution to the PCP instance, implying that the PCP instance has a solution.

If the PCP instance has a solution then that solution can be represented in the correct format, and the SAFA MM does not accept the input word ww which represents such a solution to the PCP instance is in the correct format. Therefore, the SAFA is not universal.

Hence, universality of SAFA is undecidable.

4.2 Closure Properties

We now study the closure properties. We first study the closure properties of SAFA followed by those of DSAFA.

4.2.1 Closure Properties of SAFA

We start with the Boolean closure properties. We show that SAFA are closed under union, but not under intersection and complementation.

Lemma 4.17.

SAFA are closed under union.

Proof 4.18.

We show here that SAFA models are closed under union. Union of two SAFA models can be obtained by superimposing their start states together. Let us consider two SAFA M1=(Q1,Σ×D,q01,F1,H1,δ1)M_{1}=(Q_{1},\Sigma\times D,q_{01},F_{1},H_{1},\delta_{1}) and M2=(Q2,Σ×D,q02,F2,H2,δ2)M_{2}=(Q_{2},\Sigma\times D,q_{02},F_{2},H_{2},\delta_{2}). The SAFA M3=(Q3,Σ×D,q03,F3,H3,δ3)M_{3}=(Q_{3},\Sigma\times D,q_{03},F_{3},H_{3},\delta_{3}) which accepts the language L(M1)L(M2)L(M_{1})\cup L(M_{2}) is constructed as follows: Q3={q03}Q1Q2Q_{3}=\{q_{03}\}\cup Q_{1}\cup Q_{2}, F3=F1F2F_{3}=F_{1}\cup F_{2} if q01F1q_{01}\notin F_{1} and q02F2q_{02}\notin F_{2}, otherwise F3=F1F2{q03}F_{3}=F_{1}\cup F_{2}\cup\{q_{03}\}, H3=H1H2H_{3}=H_{1}\cup H_{2}. All transitions in δ1\delta_{1} and δ2\delta_{2} are in δ3\delta_{3}. Additionally, for every transition (x,y,z)(x,y,z) from state q01q_{01} to any state qiQ1q_{i}\in Q_{1} in δ1\delta_{1} such that xΣx\in\Sigma, y{p(hk),!p(hk)}y\in\{p(h_{k}),!p(h_{k})\} with hkH1h_{k}\in H_{1} and z{,𝗂𝗇𝗌(hl)}z\in\{-,{\sf ins}(h_{l})\}, hlH1h_{l}\in H_{1}, the transition (x,y,z)(x,y,z) is included in δ3\delta_{3} from state q03q_{03} to state qiq_{i}. Similarly, for every transition (x,y,z)(x,y,z) from state q02q_{02} to a state qiQ2q_{i}\in Q_{2} in δ2\delta_{2} such that xΣx\in\Sigma, y{p(hk),!p(hk)}y\in\{p(h_{k}),!p(h_{k})\}, hkH2h_{k}\in H_{2} and z{,𝗂𝗇𝗌(hl)}z\in\{-,{\sf ins}(h_{l})\}, hlH2h_{l}\in H_{2}, the transition (x,y,z)(x,y,z) is included in δ3\delta_{3} from state q03q_{03} to state qiq_{i}. The automaton M3M_{3} on an input data word nondeterministically decides to which of M1M_{1} and M2M_{2} the input word belongs. If the input word is accepted by either of the automata, it is accepted by M3M_{3}. If an input word is not accepted by both the automata, it is rejected. Thus, the SAFA M3M_{3} accepts the language L(M1)L(M2)L(M_{1})\cup L(M_{2}).

Lemma 4.19.

SAFA are not closed under intersection.

Proof 4.20.

Consider the language L=L𝖿𝖽(a)LabL=L_{{\sf fd}(a)}\cap L_{a\exists b}. We show that there exists no SAFA which accepts LL. We prove by contradiction. Assume that there exists a SAFA M=(Q,{a,b}×D,q0,F,H,δ)M=(Q,\{a,b\}\times D,q_{0},F,H,\delta) with |H|=k>0|H|=k>0 such that MM accepts LL. Then MM must accept the following word wLw\in L where w=(b,d1)(b,dk+1)(a,d1)(a,dk+1)w=(b,d_{1})\cdots(b,d_{k+1})(a,d_{1})\cdots(a,d_{k+1}) and d1,,dk+1d_{1},...,d_{k+1} are all distinct. In order to accept ww, the SAFA MM must go through a sequence T=tb1tbk+1ta1tak+1T=t_{b_{1}}...t_{b_{k+1}}t_{a_{1}}...t_{a_{k+1}} of transitions to completely consume ww and end in an accepting state. Here tbit_{b_{i}} consumes the data element (b,di)(b,d_{i}), and tajt_{a_{j}} consumes the data element (a,dj)(a,d_{j}) and 1i,jk+11\leq i,j\leq k+1. Two cases are possible:

  • There is a transition say tagt_{a_{g}} in TT consuming the data element (a,dg)(a,d_{g}) of ww where g[k+1]g\in[k+1] and tagt_{a_{g}} is of the form (a,!p(hi),)(a,!p(h_{i}),-) or (a,!p(hi),𝗂𝗇𝗌(hj))(a,!p(h_{i}),{\sf ins}(h_{j})) where hi,hjHh_{i},h_{j}\in H. The SAFA MM using the same sequence TT of transitions can accept another data word wLabw^{\prime}\notin L_{a\exists b} where (a,dg)(a,d_{g}) is replaced by (a,d)(a,d) such that ddrd\neq d_{r} for 1rk+11\leq r\leq k+1. It is always possible to get such a data value dd as kk is finite but DD is countably infinite. Therefore, at the time of executing tagt_{a_{g}} the data value dd is not present in hih_{i} and tagt_{a_{g}} executes successfully. Recall that the data values in ww that follow djd_{j} are different from djd_{j}. The data values in ww^{\prime} that follow the data value dd are not equal to dd. Therefore whether dd has been inserted to any set hjHh_{j}\in H or not while executing tagt_{a_{g}} does not impact the successful execution of the transitions in TT that follow tagt_{a_{g}}. Now, in ww^{\prime} there is a data value dd associated with attribute aa which is not associated with attribute bb, thus wLabw^{\prime}\notin L_{a\exists b}.

  • All transitions in TT following tbk+1t_{b_{k+1}} are of the form (a,p(hi),)(a,p(h_{i}),-) or (a,p(hi),𝗂𝗇𝗌(hj))(a,p(h_{i}),{\sf ins}(h_{j})) where hi,hjHh_{i},h_{j}\in H. The number of transitions in TT that follow tbk+1t_{b_{k+1}} is greater than kk. Hence, by pigeon hole principle, there must be two transitions tat_{a_{\ell}} and tamt_{a_{m}} where 1<mk+11\leq\ell<m\leq k+1 which have the same condition p(hi)p(h_{i}) for some hiHh_{i}\in H. The SAFA MM using the same sequence TT of transitions can accept another data word wL𝖿𝖽(a)w^{\prime}\notin L_{{\sf fd}(a)} where (a,dm)(a,d_{m}) is replaced by (a,dl)(a,d_{l}). The SAFA MM when executing tamt_{a_{m}} on ww^{\prime} can successfully consume the data element (a,d)(a,d_{\ell}) instead of (a,dm)(a,d_{m}). This is because tat_{a_{\ell}} and tamt_{a_{m}} have the same condition p(hi)p(h_{i}) and dd_{\ell} is already present in hih_{i} when tat_{a_{\ell}} is executed. Note that the data values in ww^{\prime} that follow the execution of tamt_{a_{m}} are not equal to dd_{\ell}. Therefore whether dd_{\ell} has been inserted to any set hjHh_{j}\in H or not does not impact the successful execution of the transitions in TT that follow tamt_{a_{m}}.

Using a pumping argument, we show that these automata are not closed under complementation.

Lemma 4.21.

Let L𝖲𝖠𝖥𝖠L\in{\mathcal{L}_{\sf SAFA}}. Then there exists a SAFA MM with nn states that accepts LL such that every data word wLw\in L of length at least nn can be written as w=xyzw=xyz and Tw=TxTyTzT_{w}=T_{x}T_{y}T_{z} corresponds to the sequence of transitions that MM takes to accept ww, where Tx=tx1tx|x|T_{x}=t_{x_{1}}\dots t_{x_{|x|}}, Ty=ty1ty|y|T_{y}=t_{y_{1}}\dots t_{y_{|y|}}, Tz=tz1tz|z|T_{z}=t_{z_{1}}\dots t_{z_{|z|}} is the sequence of transitions that MM takes to read xx, yy, zz respectively, and tujt_{u_{j}} denotes the jthj^{th} transition of the transition sequence TuT_{u} with u{x,y,z}u\in\{x,y,z\}, satisfying the following:

  • |y|1|y|\geq 1

  • |xy|n|xy|\leq n

  • for all 1\ell\geq 1, for all words w=xyy1yzw^{\prime}=xyy^{\prime}_{1}\cdots y^{\prime}_{\ell}z^{\prime} such that Tw=TxTyTyTzT_{w^{\prime}}=T_{x}T_{y}{T_{y}}^{\ell}T_{z} is the sequence of transitions that MM takes to accept ww^{\prime} and 𝗉𝗋𝗈𝗃Σ(y)=𝗉𝗋𝗈𝗃Σ(y1)==𝗉𝗋𝗈𝗃Σ(y){{\sf proj}_{\Sigma}}(y)={{\sf proj}_{\Sigma}}(y^{\prime}_{1})=\dots={{\sf proj}_{\Sigma}}(y^{\prime}_{\ell}), 𝗉𝗋𝗈𝗃Σ(z)=𝗉𝗋𝗈𝗃Σ(z){{\sf proj}_{\Sigma}}(z)={{\sf proj}_{\Sigma}}(z^{\prime}).

    • if tyjt_{y_{j}} has p(hi),hiHp(h_{i}),h_{i}\in H then the jthj^{th} datum of 𝗉𝗋𝗈𝗃D(yk)hi{{\sf proj}_{D}}(y^{\prime}_{k})\in h_{i}, 1j|y|1\leq j\leq|y|, 1k1\leq k\leq\ell.

    • if tyjt_{y_{j}} has !p(hi),hiH!p(h_{i}),h_{i}\in H then the jthj^{th} datum of 𝗉𝗋𝗈𝗃D(yk)hi{{\sf proj}_{D}}(y^{\prime}_{k})\notin h_{i}, 1j|y|1\leq j\leq|y|, 1k1\leq k\leq\ell.

    • if tzjt_{z_{j}} has p(hi),hiHp(h_{i}),h_{i}\in H then the jthj^{th} datum of 𝗉𝗋𝗈𝗃D(z)hi{{\sf proj}_{D}}(z^{\prime})\in h_{i}, 1j|z|1\leq j\leq|z|.

    • if tzjt_{z_{j}} has !p(hi),hiH!p(h_{i}),h_{i}\in H, then the jthj^{th} datum of 𝗉𝗋𝗈𝗃D(z)hi{{\sf proj}_{D}}(z^{\prime})\notin h_{i}, 1j|z|1\leq j\leq|z|.

  • wLw^{\prime}\in L.

Proof 4.22.

Since L𝖲𝖠𝖥𝖠L\in{\mathcal{L}_{\sf SAFA}}, there exists a SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta), with say nn states that accepts LL. As |w|n|w|\geq n and wLw\in L, the sequence of states that MM traverses to accept ww must contain a cycle. Let us take the first such cycle and call it cyc_{y}. Let the sequence of transitions that MM executes to traverse the cycle cyc_{y} be TyT_{y} and the infix of ww read along TyT_{y} be yy. Let TxT_{x} be the sequence of transitions that MM traverses before entering the first cycle cyc_{y} and the prefix of ww read along TxT_{x} be xx. Let TzT_{z} be the sequence of transitions that MM traverses after exiting the cycle cyc_{y} to reach a final state in MM, and let the suffix of ww read along TzT_{z} be zz. Therefore, w=xyzw=xyz and the sequence of transitions that MM traverses to accept ww is say Tw=TxTyTzT_{w}=T_{x}T_{y}T_{z}. Moreover since cyc_{y} is the first such cycle, we have |y|1|y|\geq 1 and |xy|n|xy|\leq n. Now, consider a sequence of transitions Tw=TxTyTyTzT_{w^{\prime}}=T_{x}T_{y}{T_{y}}^{\ell}T_{z}, then w=xyy1yzw^{\prime}=xyy_{1}\cdots y_{\ell}z^{\prime}, where 𝗉𝗋𝗈𝗃Σ(y)=𝗉𝗋𝗈𝗃Σ(y1)=𝗉𝗋𝗈𝗃Σ(yl){{\sf proj}_{\Sigma}}(y)={{\sf proj}_{\Sigma}}(y_{1})\dots={{\sf proj}_{\Sigma}}(y_{l}) but 𝗉𝗋𝗈𝗃D(y),𝗉𝗋𝗈𝗃D(y1),,𝗉𝗋𝗈𝗃D(yl){{\sf proj}_{D}}(y),{{\sf proj}_{D}}(y_{1}),\dots,{{\sf proj}_{D}}(y_{l}) may or may not be equal to each other. Since |y|1|y|\geq 1, the sequence TyT_{y} of transitions must have at least one transition. The transition tyjt_{y_{j}} in TyT_{y} with !p(hi)!p(h_{i}) can always be executed successfully because the SAFA MM when executing tyjt_{y_{j}} can always read a new data value which it has not read till executing tyjt_{y_{j}}. We can always find such data values as |w||w^{\prime}| is finite whereas DD is countably infinite. The transition tyjt_{y_{j}} in TyT_{y} with p(hi)p(h_{i}) is executed successfully when consuming yy, since, yy is consumed successfully by MM when accepting ww. As, yy is consumed successfully by MM, the set hih_{i} corresponding to tyjt_{y_{j}} is non-empty as there are no removal operations in SAFA. Therefore, after consuming yy, every time the sequence TyT_{y} is executed, tyjt_{y_{j}} is also executed successfully.

The sequence TzT_{z} is executed successfully due to same reasons as TyT_{y}. Thus, MM accepts a data word w=xyy1yzw^{\prime}=xyy_{1}...y_{\ell}z^{\prime}.

Lemma 4.23.

SAFA are not closed under complementation.

Proof 4.24.

To show SAFA are not closed under complementation, we first define the following functions. The function 𝖼𝗇𝗍(w,d){\sf cnt}(w^{\prime},d) gives the number of times data value dd is present in a data word ww^{\prime} and 𝗎𝗇𝗂(w){\sf uni}(w^{\prime}) gives the number of data values dd with 𝖼𝗇𝗍(w,d)=1{\sf cnt}(w^{\prime},d)=1 in ww^{\prime}. We consider the language L𝖼𝗇𝗍2L_{\exists{\sf cnt}\neq 2}, which is the language of data words ww where there exists a data value dd such that 𝖼𝗇𝗍(w,d)2{\sf cnt}(w^{\prime},d)\neq 2. Example 3.4 shows a SAFA that accepts this.

Consider the complement language L𝖼𝗇𝗍=2L_{\forall{\sf cnt}=2} wherein all data values occur exactly twice. Using Lemma 4.21 we show no SAFA can accept L𝖼𝗇𝗍=2L_{\forall{\sf cnt}=2}.

The proof is by contradiction. Suppose that there exists a SAFA MM with n states accepting L𝖼𝗇𝗍=2L_{\forall{\sf cnt}=2}. Let ww be a data word such that wL𝖼𝗇𝗍=2w\in L_{\forall{\sf cnt}=2} and |w|=2n|w|=2n.

For every decomposition of ww as w=xyzw=xyz and sequence Tw=TxTyTzT_{w}=T_{x}T_{y}T_{z} of transitions that MM takes to accept ww with |y|1|y|\geq 1, we have a w=xyy1y2y3zw^{\prime}=xyy_{1}y_{2}y_{3}z^{\prime} such that Tw=TxTyTy3TzT_{w^{\prime}}=T_{x}T_{y}{T_{y}}^{3}T_{z}. Since |y|1|y|\geq 1, we have that TyT_{y} must have either a transition tt with p(hi)p(h_{i}) for some hiHh_{i}\in H or a transition with !p(hi)!p(h_{i}) for some hiHh_{i}\in H or both. If tt has p(hi)p(h_{i}), then the first time tt is executed while consuming yy, assume that it consumes a data value dd. It is able to consume the data value dd as it is already inserted in hih_{i} before tt is executed. Now if after consuming the word xyxy, TyT_{y} is executed again, then when executing the transition tt it can again consume the same data value dd as before. So, every time TyT_{y} is executed, the SAFA MM will consume the data value dd while executing the transition tt. After executing TyT_{y} three times, the SAFA MM executes the transition sequence TzT_{z}. All the transitions with p(hi)p(h_{i}) in TzT_{z} can be executed successfully with the same data value that they consumed when MM accepted ww because ww^{\prime} and ww both have the same prefix xyxy. The transitions with !p(hi)!p(h_{i}) in TzT_{z} consume data values that MM had not encountered prior to executing these transitions. Thus, if TyT_{y} has a transition tt with p(hi)p(h_{i}) for some hiHh_{i}\in H, then MM accepts the data word w=xyy1y3zw^{\prime}=xyy_{1}\cdots y_{3}z^{\prime} where there exists a data value dd with 𝖼𝗇𝗍(w,d)>3{\sf cnt}(w^{\prime},d)>3.

If TyT_{y} has a transition, say tt with !p(hi)!p(h_{i}) for some hiHh_{i}\in H, then every time TyT_{y} is executed after consuming xyxy, the SAFA MM when executing tt can always read a new data value which it has not read till executing tt and that it will not read later. We can always find such data values as ww^{\prime} is finite whereas DD is countably infinite. The sequence TzT_{z} is executed successfully due to same reasons as before. Thus, if TyT_{y} has a transition tt with !p(hi)!p(h_{i}), then MM accepts a data word w=xyy1y3zw^{\prime}=xyy_{1}...y_{3}z^{\prime} where 𝗎𝗇𝗂(w)3{\sf uni}(w^{\prime})\geq 3. Note that the data value consumed by MM when taking the transition tt while reading yy may already be present in the prefix being read by the sequence of transitions prior to taking tt. Therefore, wL𝖼𝗇𝗍=2w^{\prime}\notin L_{\forall{\sf cnt}=2}.

Note that from the above construction, we see that SAFA with |H|2|H|\geq 2 are not closed under complementation. We observe that singleton SAFA are closed under union but not under intersection. From the hierarchy theorem (Theorem 3.5), we see singleton SAFA cannot accept L𝖿𝖽(𝖺𝟣)L𝖿𝖽(𝖺𝟤)L𝗉𝗋𝗈𝗃Σ(L)=a1a2L_{\sf fd(a_{1})}\cap L_{\sf fd(a_{2})}\cap L_{{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}} but singleton SAFA can accept L𝖿𝖽(𝖺𝟣)L𝗉𝗋𝗈𝗃Σ(L)=a1a2L_{\sf fd(a_{1})}\cap L_{{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}} and L𝖿𝖽(𝖺𝟤)L𝗉𝗋𝗈𝗃Σ(L)=a1a2L_{\sf fd(a_{2})}\cap L_{{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}}. Thus, singleton SAFA are not closed under intersection, and hence also not closed under complementation.

Theorem 4.25.

SAFA are closed under concatenation.

Proof 4.26.

SAFA are shown to be closed under concatenation in a manner similar to NFA. Note that if the sets used in the two input automata AA and BB be HAH_{A} and HBH_{B} respectively, then the automaton accepting the language obtained as a result of concatenation uses the sets HAHBH_{A}\cup H_{B}.

Theorem 4.27.

SAFA are not closed under Kleene’s closure.

Proof 4.28.

Consider the language L1={w({a}×{d})2|dD}L_{1}=\{w\in(\{a\}\times\{d\})^{2}|d\in D\}. The language L1L_{1} can be accepted by a DSAFA (see Figure 8). The Kleene’s closure of L1L_{1} is the language L={w(({a}×{d})2)|dD}L=\{w\in((\{a\}\times\{d\})^{2})^{*}|d\in D\}, i.e., LL is the set of all data words where every data value appears in pairs. We show that there exists no SAFA which accepts LL. We prove by contradiction. Assume that there exists a SAFA M=(Q,{a}×D,q0,F,H,δ)M=(Q,\{a\}\times D,q_{0},F,H,\delta) with |H|=k>0|H|=k>0 such that MM accepts LL. Then MM must accept the following word wLw\in L where w=(a,d1)(a,d1)(a,di)(a,di)(a,dk+1)(a,dk+1)w=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{k+1})(a,d_{k+1}) and d1,,dk+1Dd_{1},...,d_{k+1}\in D are all distinct. In order to accept ww, the SAFA MM must go through a sequence T=t1d1t2d1t1dk+1t2dk+1T=t_{1_{d_{1}}}t_{2_{d_{1}}}...t_{1_{d_{k+1}}}t_{2_{d_{k+1}}} of transitions to completely consume ww and end in an accepting state. Here t1dit_{1_{d_{i}}} consumes the first data item of the ithi^{th} data value pair and t2dit_{2_{d_{i}}} consumes the second data item of the ithi^{th} data value pair and 1ik+11\leq i\leq k+1.

  • The transitions t2dit_{2_{d_{i}}} must be of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})) where h,hjHh_{\ell},h_{j}\in H. This is essential because if t2dit_{2_{d_{i}}} is of the form (a,!p(h),)(a,!p(h_{\ell}),-) or (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})) then instead of consuming did_{i} it can also consume successfully a new data value dnewDd_{new}\in D which is not present in ww. It is always possible to get such a data value as DD is countably infinite. The SAFA MM will then accept the data word w=(a,d1)(a,d1)(a,di)(a,dnew)(a,dk+1)(a,dk+1)w^{\prime}=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{new})\cdots(a,d_{k+1})(a,d_{k+1}) which is not in LL.

  • The transitions t1dit_{1_{d_{i}}} must be of the form (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})) where h,hjHh_{\ell},h_{j}\in H. This is essential because if t1dit_{1_{d_{i}}} is of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})), then t1dit_{1_{d_{i}}} will fail to consume the first instance of the data value did_{i}, since MM has not encountered did_{i} prior to the transition t1dit_{1_{d_{i}}} as all data value pairs in ww are distinct and therefore did_{i} is not present in any set. The transition t1dit_{1_{d_{i}}} cannot be of the form (a,!p(h),)(a,!p(h_{\ell}),-) because the following transition t2dit_{2_{d_{i}}} which is of the form (a,p(hj),)(a,p(h_{j}),-) or (a,p(hj),𝗂𝗇𝗌(hp))(a,p(h_{j}),{\sf ins}(h_{p})) where hj,hpHh_{j},h_{p}\in H will not be executed successfully. Note that for the transition t2dit_{2_{d_{i}}} to consume the data value did_{i}, the data value did_{i} must be inserted in a set when it was first encountered.

  • Since the number of distinct data values in ww is more than the number of sets in MM and all the distinct data values are inserted in the sets in MM, by pigeon hole principle, there are two distinct data values did_{i} and djd_{j}, with i<ji<j which are inserted in the same set hHh_{\ell}\in H. Thus, if MM accepts the data word w=(a,d1)(a,d1)(a,di)(a,di)(a,dj)(a,dj)(a,dk+1)(a,dk+1)w=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{j})(a,d_{j})\cdots(a,d_{k+1})(a,d_{k+1}) then MM also accepts the data word w=(a,d1)(a,d1)(a,di)(a,di)(a,dj)(a,di)(a,dk+1)(a,dk+1)w^{\prime}=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{j})(a,d_{i})\cdots(a,d_{k+1})(a,d_{k+1}) which is not in LL.

Thus, SAFA are not closed under Kleene’s closure.

q0q_{0}q1q_{1}q2q_{2}(a,!p(h1),𝗂𝗇𝗌(h1))(a,!p(h_{1}),{\sf ins}(h_{1}))(a,p(h1),)(a,p(h_{1}),-)
Figure 8: A SAFA MM, such that L(M)𝖲𝖠𝖥𝖠L(M)^{*}\notin{\mathcal{L}_{\sf SAFA}}
q0q_{0}q1q_{1}(a,p(h1),)(a,p(h_{1}),-)(b,p(h1),),(b,!p(h1),𝗂𝗇𝗌(h1))(b,p(h_{1}),-),(b,!p(h_{1}),{\sf ins}(h_{1}))(a,!p(h1),)(a,!p(h_{1}),-)
Figure 9: DSAFA for L𝖺𝖻L_{\sf a\exists b}
Theorem 4.29.

SAFA are not closed under reversal.

Proof 4.30.

Consider the language LabL_{a\exists b} which can be accepted by a DSAFA (see Figure 9). The reversal of LabL_{a\exists b} is the language L={wR|wLab}L=\{w^{R}|w\in L_{a\exists b}\} i.e. LL is the set of all data words where for every attribute aa there is an attribute bb which comes after it and whose data value is same as that of attribute aa. We show that there exists no SAFA which accepts LL. We prove by contradiction. Assume that there exists a SAFA M=(Q,{a,b}×D,q0,F,H,δ)M=(Q,\{a,b\}\times D,q_{0},F,H,\delta) with |H|=k>0|H|=k>0 such that MM accepts LL. Then MM must accept the following word wLw\in L where w=(a,d1)(a,dk+1)(b,d1)(b,dk+1)w=(a,d_{1})\cdots(a,d_{k+1})(b,d_{1})\cdots(b,d_{k+1}) and d1,,dk+1Dd_{1},...,d_{k+1}\in D are all distinct. In order to accept ww, the SAFA MM must go through a sequence T=tad1tadk+1tbd1tbdk+1T=t_{a_{d_{1}}}...t_{a_{d_{k+1}}}t_{b_{d_{1}}}...t_{b_{d_{k+1}}} of transitions to completely consume ww and end in an accepting state. Here tadit_{a_{d_{i}}} consumes the data item (a,di)(a,d_{i}) and tbdit_{b_{d_{i}}} consumes the data item (b,di)(b,d_{i}) of ww.

  • The transitions tbdit_{b_{d_{i}}} must be of the form (b,p(h),)(b,p(h_{\ell}),-) or (b,p(h),𝗂𝗇𝗌(hj))(b,p(h_{\ell}),{\sf ins}(h_{j})) where h,hjHh_{\ell},h_{j}\in H. This is essential because if tbdit_{b_{d_{i}}} is of the form (b,!p(h),)(b,!p(h_{\ell}),-) or (b,!p(h),𝗂𝗇𝗌(hj))(b,!p(h_{\ell}),{\sf ins}(h_{j})) then instead of consuming did_{i} it can also consume successfully a new data value dnewDd_{new}\in D which is not present in ww. It is always possible to get such a data value as DD is countably infinite. The SAFA MM will then accept the data word w=(a,d1)(a,di)(a,dk+1)(b,d1)(b,dnew)(b,dk+1)w^{\prime}=(a,d_{1})\cdots(a,d_{i})\cdots(a,d_{k+1})(b,d_{1})\cdots(b,d_{new})\cdots(b,d_{k+1}) which is not in LL.

  • The transitions tadit_{a_{d_{i}}} must be of the form (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})) where h,hjHh_{\ell},h_{j}\in H. This is essential because if tadit_{a_{d_{i}}} is of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})), then tadit_{a_{d_{i}}} will fail to consume the first instance of the data value did_{i}, since MM has not encountered did_{i} prior to the transition tadit_{a_{d_{i}}} as all data values in ww associated with attribute aa are distinct and therefore did_{i} is not present in any set. The transition tadit_{a_{d_{i}}} cannot be of the form (a,!p(h),)(a,!p(h_{\ell}),-) because there is a following transition tbdit_{b_{d_{i}}} which is of the form (b,p(hj),)(b,p(h_{j}),-) or (b,p(hj),𝗂𝗇𝗌(hp))(b,p(h_{j}),{\sf ins}(h_{p})) where hj,hpHh_{j},h_{p}\in H that will not be executed successfully. Recall that for the transition tbdit_{b_{d_{i}}} to consume the data value did_{i}, the data value did_{i} must be inserted in a set when it was first encountered.

  • Since the number of distinct data values in ww associated with attribute aa are more than the number of sets in MM and all the distinct data values associated with attribute aa are inserted in the sets in MM, by pigeon hole principle, there are two distinct data values did_{i} and djd_{j}, with i<ji<j which are inserted in the same set hHh_{\ell}\in H. Thus, if MM accepts the data word w=(a,d1)(a,di)(a,dj)(a,dk+1)(b,d1)(b,di)(b,dj)(b,dk+1)w=(a,d_{1})\cdots(a,d_{i})\cdots(a,d_{j})\cdots(a,d_{k+1})(b,d_{1})\cdots(b,d_{i})\cdots(b,d_{j})\cdots(b,d_{k+1}) then MM also accepts the data word

    w=(a,d1)(a,di)(a,dj)(a,dk+1)(b,d1)(b,di)(b,di)(b,dk+1)w^{\prime}=(a,d_{1})\cdots(a,d_{i})\cdots(a,d_{j})\cdots(a,d_{k+1})(b,d_{1})\cdots(b,d_{i})\cdots(b,d_{i})\cdots(b,d_{k+1}) which is not in LL.

Thus, SAFA are not closed under reversal.

Theorem 4.31.

SAFA are not closed under homomorphism.

Proof 4.32.

Consider the language Lfd(a)L_{fd(a)} where the data values are taken from the set of natural numbers \mathbb{N} and Σ={a}\Sigma=\{a\}. The homomorphism function is h(ϵ)=ϵh(\epsilon)=\epsilon, h((a,d))=(a,d)(a,d)h((a,d))=(a,d)(a,d) for all aΣa\in\Sigma and dd\in\mathbb{N}. The language L=h(Lfd(a))L=h(L_{fd(a)}) is a language of data words where every data value occurs exactly twice and also consecutively. There exists no SAFA which accepts L=h(Lfd(a))L=h(L_{fd(a)}). The proof is by contradiction. Suppose that there exists a SAFA MM with n states accepting LL. Let ww be a data word such that wLw\in L and |w|=2n|w|=2n.

For every decomposition of ww as w=xyzw=xyz and sequence Tw=TxTyTzT_{w}=T_{x}T_{y}T_{z} of transitions that MM takes to accept ww with |y|1|y|\geq 1, we have a w=xyy1y2y3zw^{\prime}=xyy_{1}y_{2}y_{3}z^{\prime} such that Tw=TxTyTy3TzT_{w^{\prime}}=T_{x}T_{y}{T_{y}}^{3}T_{z}. Since |y|1|y|\geq 1, we have that TyT_{y} must have either a transition tt with p(hi)p(h_{i}) for some hiHh_{i}\in H or a transition with !p(hi)!p(h_{i}) for some hiHh_{i}\in H or both. If tt has p(hi)p(h_{i}), then the first time tt is executed while consuming yy, assume that it consumes a data value dd. It is able to consume the data value dd as it is already inserted in hih_{i} before tt is executed. Now if after consuming the word xyxy, TyT_{y} is executed again, then when executing the transition tt it can again consume the same data value dd as before. So, every time TyT_{y} is executed, the SAFA MM will consume the data value dd while executing the transition tt. After executing TyT_{y} three times, the SAFA MM executes the transition sequence TzT_{z}. All the transitions with p(hi)p(h_{i}) in TzT_{z} can be executed successfully with the same data value that they consumed when MM accepted ww because ww^{\prime} and ww both have the same prefix xyxy. The transitions with !p(hi)!p(h_{i}) in TzT_{z} consume data values that MM has not encountered prior to executing these transitions. Thus, if TyT_{y} has a transition tt with p(hi)p(h_{i}) for some hiHh_{i}\in H, then MM accepts the data word w=xyy1y3zw^{\prime}=xyy_{1}\cdots y_{3}z^{\prime} where there exists a data value dd with 𝖼𝗇𝗍(w,d)>3{\sf cnt}(w^{\prime},d)>3.

If TyT_{y} has a transition, say tt with !p(hi)!p(h_{i}) for some hiHh_{i}\in H, then every time TyT_{y} is executed after consuming xyxy, the SAFA MM when executing tt can always read a new data value which it has not read till executing tt and that it will not read later. We can always find such data values as ww^{\prime} is finite whereas DD is countably infinite. The sequence TzT_{z} is executed successfully due to same reasons as before. Thus, if TyT_{y} has a transition tt with !p(hi)!p(h_{i}), then MM accepts a data word w=xyy1y3zw^{\prime}=xyy_{1}...y_{3}z^{\prime} where 𝗎𝗇𝗂(w)3{\sf uni}(w^{\prime})\geq 3. Note that the data value consumed by MM when taking the transition tt while reading yy may already be present in the prefix being read by the sequence of transitions prior to taking tt. Therefore, wLw^{\prime}\notin L.

Theorem 4.33.

SAFA are not closed under inverse homomorphism.

Proof 4.34.

Consider the language Lϵ={ϵ}L_{\epsilon}=\{\epsilon\}. There exists a SAFA MM which accepts LϵL_{\epsilon}. The data values are taken from the set of natural numbers \mathbb{N} and Σ={a}\Sigma=\{a\}. The homomorphism function is h(ϵ)=ϵh(\epsilon)=\epsilon, h((a,1))=ϵh((a,1))=\epsilon, h((a,d))=(a,d)h((a,d))=(a,d) for all d{1}d\in\mathbb{N}\setminus\{1\}. The language L=h1(Lϵ)L=h^{-1}(L_{\epsilon}) is a language of data words where the data word is ϵ\epsilon or data words only having data value 1 present in it. Since SAFA cannot be initialized \chadded[id=KC], SAFA cannot identify that it has seen the data value 1 as it does not have the data value 1 in any of its sets at the time of beginning the computation. Hence, no SAFA can recognise L=h1(Le)L=h^{-1}(L_{e}).

4.2.2 Closure properties of Deterministic SAFA:

Here we discuss deterministic SAFA and compare their expressiveness with SAFA. Using standard complementation construction as in deterministic finite automata (DFA), by changing non-accepting states to accepting and vice versa, we can show that DSAFA are closed under complementation. Moreover, as deterministic SAFA are closed under complementation but not under intersection hence, it follows that they are also not closed under union. Since the languages used to show non-closure of SAFA under Kleene’s closure, homomorphism, and inverse homomorphism are accepted by DSAFA, we have that DSAFA are also not closed under Kleene’s closure, homomorphism and inverse homomorphism.

Theorem 4.35.

DSAFA are not closed under concatenation.

Proof 4.36.

Consider the language L1=Lfd(a)L_{1}=L_{fd(a)} with Σ={a}\Sigma=\{a\} and L2={w({a}×{d})2|dD}L_{2}=\{w\in(\{a\}\times\{d\})^{2}|d\in D\}. Both L1L_{1} and L2L_{2} can be accepted by DSAFA. The concatenation of L1L_{1} and L2L_{2} is the language L=L1.L2L=L_{1}.L_{2}. i.e. LL is the set of all data words that ends with a pair of same data values and all other data values present in the data word other than the data value in the pair are distinct. The frequency of the data value in the pair is either two or three. We show that there exists no DSAFA which accepts LL. We prove by contradiction. Assume that there exists a DSAFA M=(Q,{a}×D,q0,F,H,δ)M=(Q,\{a\}\times D,q_{0},F,H,\delta) with |Q|=n>0|Q|=n>0 such that MM accepts LL. Then MM must accept the following word wLw\in L where w=(a,d1)(a,dn)(a,dn+1)(a,dn+1)w=(a,d_{1})\cdots(a,d_{n})(a,d_{n+1})(a,d_{n+1}) and d1,,dn+1Dd_{1},...,d_{n+1}\in D are all distinct. In order to accept ww, the DSAFA MM must go through a sequence T=t1tn+2T=t_{1}...t_{n+2} of transitions to completely consume ww and end in an accepting state. Similarly, the DSAFA MM must go through a sequence S=q0qn+2S=q_{0}...q_{n+2} of states to accept ww where q0,,qn+2Qq_{0},...,q_{n+2}\in Q, qn+2Fq_{n+2}\in F and q0q_{0} is the initial state. Here tit_{i} consumes the ithi^{th} data item of the input data word ww and takes the DSAFA MM from state qi1q_{i-1} to qiq_{i}.

  • The last transition tn+2t_{n+2} must be of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})) where h,hjHh_{\ell},h_{j}\in H. This is essential because if tn+2t_{n+2} is of the form (a,!p(h),)(a,!p(h_{\ell}),-) or (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})) then instead of consuming dn+1d_{n+1} which is the second last data value in ww repeated in the last position, it can also consume successfully a new data value dnewDd_{new}\in D which is not present in ww. It is always possible to get such a data value as DD is countably infinite. The DSAFA MM will then accept the data word w=(a,d1)(a,dn)(a,dn+1)(a,dnew)w^{\prime}=(a,d_{1})\cdots(a,d_{n})(a,d_{n+1})(a,d_{new}) which is not in LL. If the last transition is of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})) then the second last transition must be of the form (a,!p(hp),𝗂𝗇𝗌(h))(a,!p(h_{p}),{\sf ins}(h_{\ell})) where hpHh_{p}\in H. The second last transition cannot be of the form (a,p(hp),)(a,p(h_{p}),-) or (a,p(hp),𝗂𝗇𝗌(hj))(a,p(h_{p}),{\sf ins}(h_{j})) because it consumes the data value dn+1d_{n+1} in ww, which was first encountered by MM when executing transition tn+1t_{n+1}. Hence, the data value dn+1d_{n+1} cannot be present in any set prior to executing tn+1t_{n+1}. The transition tn+1t_{n+1} and tn+2t_{n+2} consume the same data value. Moreover, tn+2t_{n+2} is of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})), therefore to execute tn+2t_{n+2} successfully the data value dn+2d_{n+2} must be present in the set hh_{\ell}. The data value was first encountered when executing tn+1t_{n+1}, hence tn+1t_{n+1} must insert the data value into the set hh_{\ell}.

  • The transitions t1,,tnt_{1},...,t_{n} must be of the form (a,!p(h),)(a,!p(h_{\ell}),-) or (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})) where h,hjHh_{\ell},h_{j}\in H. This is essential because if tit_{i} where i[n]i\in[n] is of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})), then tit_{i} will fail to consume the first instance of the data value did_{i}, since MM has not encountered did_{i} prior to the transition tit_{i} as all data values in ww except the last data value are distinct and therefore did_{i} is not present in any set.

  • Since |w|>|Q||w|>|Q|, there exists at least a state qiq_{i} in SS which is repeated in the sequence SS.

    • In the sequence of states SS only the last state can be an accepting state, no other intermediate state can be accepting because then the DSAFA MM will accept a data word wLfd(a)w^{\prime}\in L_{fd(a)} which is not in LL. Therefore, the accepting state qn+2q_{n+2} of MM cannot be a repeated state in SS.

    • Let us assume that the state just prior to the final accepting state qn+2q_{n+2}, i.e. the state qn+1q_{n+1} in SS is one such repeated state, i.e. suppose it is same as the qithq_{i}^{th} state in SS. Then, MM can execute the sequence Tnew=t1titi+1tn+1ti+1tn+1tn+2T_{new}=t_{1}...t_{i}t_{i+1}...t_{n+1}t_{i+1}...t_{n+1}t_{n+2} of transitions and MM will accept the data word

      w=(a,d1)(a,dn)(a,dn+1)(a,dnewi+1)(a,dnewn)(a,dnewn+1)(a,dn+1)w^{\prime}=(a,d_{1})\cdots(a,d_{n})(a,d_{n+1})(a,d_{new_{i+1}})\cdots(a,d_{new_{n}})(a,d_{new_{n+1}})(a,d_{n+1}) where dnewi+1,,dnewn+1Dd_{new_{i+1}},...,d_{new_{n+1}}\in D, all of them are distinct and also different from data values d1,,dn+1d_{1},...,d_{n+1}. We can always find such data values as DD is countably infinite. The DSAFA MM accepts the data word ww^{\prime} due to the following reasons:

      • *

        The SAFA MM accepts ww by executing the sequence of transitions TT. The transition tn+2t_{n+2} in TT is of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})), therefore, for SAFA MM, to execute the transition tn+2t_{n+2} and consume the data value dn+1d_{n+1}, the data value must already be present in the set hh_{\ell}. The data value dn+1d_{n+1} is first encountered while executing transition tn+1t_{n+1} therefore tn+1t_{n+1} inserts the data value in hh_{\ell}.

      • *

        In consuming ww^{\prime}, the DSAFA MM executes the transition tn+1t_{n+1} in its transition sequence, so the data value dn+1d_{n+1} is already present in set hh_{\ell} which MM uses to execute the transition tn+2t_{n+2} as the last transition in the sequence of transitions TnewT_{new} to consume ww^{\prime}. The data word ww^{\prime} is not in LL as the last two data values are not the same.

      Hence the state qn+1q_{n+1} cannot be a repeated state in SS.

    • Let us assume a state qjq_{j} where qjqn+1q_{j}\neq q_{n+1} and qjqn+2q_{j}\neq q_{n+2} is one such repeated state, i.e. suppose it is same as the qithq_{i}^{th} state in SS. The transitions ti+1tjt_{i+1}...t_{j} in TT are of the form (a,!p(h),)(a,!p(h_{\ell}),-) or (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})). The SAFA MM is a DSAFA and in a DSAFA only two possible transitions can come out of a state: one with p(h)p(h_{\ell}) and the other with !p(h)!p(h_{\ell}). In the state qjq_{j} the transition of the form (a,!p(h),)(a,!p(h_{\ell}),-) or (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})) takes MM to a state qzq_{z} which is inside the cycle qiqjq_{i}...q_{j}. Only other transition allowed in qjq_{j} is a transition of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})) that takes it to a state qkq_{k} which is not in the cycle. The state qjq_{j} is not a final state or a state prior to a final state. Therefore there must exist a transition tt from qjq_{j} in TT as TT is an accepting sequence of transitions which takes DSAFA MM from qjq_{j} to qkq_{k} such that MM is no longer in the cycle qiqjq_{i}...q_{j}. The DSAFA MM moves towards the accepting state qn+2q_{n+2} which is not present in the cycle by executing tt. In order to go from qjq_{j} to qkq_{k}, the DSAFA MM must execute a transition of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})) as it is the only other transition available in state qjq_{j} which is a contradiction as all transitions in TT coming out of states which are not accepting states or states prior to accepting states are of the form (a,!p(h),)(a,!p(h_{\ell}),-) or (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})).

Hence, DSAFA are not closed under concatenation.

Thus, we get the following theorem.

Theorem 4.37.

DSAFA are closed under complementation but not under union, intersection, concatenation, Kleene’s closure, reversal, homomorphism and inverse homomorphism.

Table 1 provides a summary of the closure properties of SAFA and DSAFA. We use the symbols \cup for union, \cap for intersection, !! for complement, h(L)h(L) for homomorphism and h1(L)h^{-1}(L) for inverse homomorphism respectively.

Table 1: Closure properties of SAFA
\cup \cap !! . * LRL^{R} h(L)h(L) h1(L)h^{-1}(L)
SAFA \checkmark ×\times ×\times \checkmark ×\times ×\times ×\times ×\times
DSAFA ×\times ×\times \checkmark ×\times ×\times ×\times ×\times ×\times

We now show that the class of languages accepted by DSAFA is strictly contained in the class of languages accepted by SAFA.

Theorem 4.38.

𝖣𝖲𝖠𝖥𝖠𝖲𝖠𝖥𝖠{\mathcal{L}_{\sf DSAFA}}\subsetneq{\mathcal{L}_{\sf SAFA}}.

Proof 4.39.

Recall from Example 3.4 that the language L𝖼𝗇𝗍2𝖲𝖠𝖥𝖠L_{\exists{\sf cnt}\neq 2}\in{\mathcal{L}_{\sf SAFA}}. On the other hand, we show in the proof of Lemma 4.23 that there does not exist a SAFA accepting its complement language L𝖼𝗇𝗍=2L_{\forall{\sf cnt}=2}. This implies that the language L𝖼𝗇𝗍2L_{\exists{\sf cnt}\neq 2} cannot be accepted by a DSAFA since DSAFA are closed under complementation. The result follows since every deterministic SAFA is a SAFA.

Lemma 4.40.

For Every DSAFA accepting L=L𝗉𝗋𝗈𝗃Σ(L)=regexp(r)L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)} we can always get a DFA accepting L(r)L(r) where L(r)L(r) is the language expressed by the regular expression rr which has the same number of states as the DSAFA.

Proof 4.41.

Consider the language L=L𝗉𝗋𝗈𝗃Σ(L)=regexp(r)L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)}. Let MM^{\prime} be a DSAFA which accepts LL. We can always obtain a DSAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) with less or equal number of states as MM^{\prime} and transitions only of the form (a,!p(hi),)(a,!p(h_{i}),-) where aΣa\in\Sigma and hi,hjHh_{i},h_{j}\in H which accepts LL in the following manner:

  • We can remove transitions of the form (a,p(hi),)(a,p(h_{i}),-) or (a,p(hi),𝗂𝗇𝗌(hj))(a,p(h_{i}),{\sf ins}(h_{j})) where aΣa\in\Sigma and hi,hjHh_{i},h_{j}\in H from MM^{\prime} to construct MM.

  • If there are transitions of the form (a,!p(hi),𝗂𝗇𝗌(hj))(a,!p(h_{i}),{\sf ins}(h_{j})) where aΣa\in\Sigma and hi,hjHh_{i},h_{j}\in H in MM^{\prime} we convert it to (a,!p(hi),)(a,!p(h_{i}),-) in MM.

Thus all transactions in MM are of the form (a,!p(hi),)(a,!p(h_{i}),-). Observe, that MM remains a DSAFA. The removal of (a,p(hi),)(a,p(h_{i}),-) or (a,p(hi),𝗂𝗇𝗌(hj))(a,p(h_{i}),{\sf ins}(h_{j})) from MM^{\prime} may result in some unreachable states in MM which can be removed. MM accepts LL due to the following reason:

Consider the case that MM^{\prime} has a sequence TT of transitions where there is at least one transition tt of the form (a,p(hi),)(a,p(h_{i}),-) which results in accepting a data word ww in LL. The data word ww has 𝗉𝗋𝗈𝗃Σ(w)L(r){{\sf proj}_{\Sigma}}(w)\in L(r) where L(r) is the regular language expressed by the regular expression rr and 𝗉𝗋𝗈𝗃D(w){{\sf proj}_{D}}(w) has at least one data value which is repeated. But as MM^{\prime} accepts LL it should also accept the data word ww^{\prime} in LL where 𝗉𝗋𝗈𝗃Σ(w)=𝗉𝗋𝗈𝗃Σ(w){{\sf proj}_{\Sigma}}(w^{\prime})={{\sf proj}_{\Sigma}}(w) and 𝗉𝗋𝗈𝗃D(w){{\sf proj}_{D}}(w^{\prime}) have data values which are all distinct. The sequence TT^{\prime} of transitions which accepts ww^{\prime} consists of transitions with !p(hi)!p(h_{i}) where hiHh_{i}\in H only. The sequence TT^{\prime} of transitions are valid also for MM as MM retains all transitions with !p(hi)!p(h_{i}) and removes any insert operation if there are any from MM^{\prime}. This sequence TT^{\prime} of transitions accepts both ww and ww^{\prime} in MM. Given such a DSAFA MM, we can get a DFA AA which accepts L(r)L(r) by converting the transitions labelled (a,p(hi),)(a,p(h_{i}),-) to aa in AA. The number of states in the DFA AA is same as the number of states of the DSAFA MM.

Lemma 4.42.

Every DSAFA accepting L=L𝗉𝗋𝗈𝗃Σ(L)=regexp(r)L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)} has at least as many states as the smallest DFA accepting L(r)L(r) where L(r)L(r) is the language expressed by the regular expression rr.

Proof 4.43.

Let us assume there exists a DSAFA MM which accepts L=L𝗉𝗋𝗈𝗃Σ(L)=regexp(r)L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)} with k1k_{1} states and there is a minimized DFA AminA_{min} which accepts L(r)L(r) where L(r)L(r) is the language expressed by regular expression rr with k2k_{2} states such that k1<k2k_{1}<k_{2}.

From Lemma 4.40 we see that there exists a DFA AA which accepts L(r)L(r) with k1k_{1} number of states. Thus, AminA_{min} is no longer the minimized DFA, which is a contradiction.

Theorem 4.44.

There exists a language LL which is accepted by a non-deterministic SAFA with nn states but a DSAFA will require at least 2O(n)2^{O(n)} states to accept the same language LL.

Proof 4.45.

From Lemma 4.42 we see that the number of states of any DSAFA that accepts L=L𝗉𝗋𝗈𝗃Σ(L)=regexp(r)L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)} must be greater than or equal to the number of states of the minimum DFA that accepts L(r)L(r). Now consider a minimum NFA AnfaA_{nfa} which accepts L(r). We can obtain a nondeterministic SAFA MnfaM_{nfa} which accepts the language L=L𝗉𝗋𝗈𝗃Σ(L)=regexp(r)L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)} using the NFA AnfaA_{nfa} by replacing the transitions of the NFA which are of the form aa where aΣa\in\Sigma by the transition (a,!p(hi),)(a,!p(h_{i}),-). The number of states of the nondeterministic SAFA MM is the same as that of the NFA AnfaA_{nfa}. We know that there exists an NFA accepting L(re)L(r_{e}) such that every DFA accepting the same language has size at least exponential in the size of the NFA. Now for one such rer_{e}, the number of states required for any DSAFA to accept L=L𝗉𝗋𝗈𝗃Σ(L)=regexp(re)L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r_{e})} is exponential in the number of states required to accept L=L𝗉𝗋𝗈𝗃Σ(L)=regexp(er)L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(_{e}r)} by a nondeterministic SAFA.

5 Expressiveness

Let 𝖪𝖱𝖥𝖠{\mathcal{L}_{\sf KRFA}} and 𝖢𝖬𝖠{\mathcal{L}_{\sf CMA}} be the set of all languages accepted by kk-register automata and CMA respectively. We compare the computational power of SAFA with kk-register automata. We show that 𝖲𝖠𝖥𝖠{\mathcal{L}_{\sf SAFA}} and 𝖪𝖱𝖥𝖠{\mathcal{L}_{\sf KRFA}} are incomparable. Although 𝖲𝖠𝖥𝖠{\mathcal{L}_{\sf SAFA}} and 𝖪𝖱𝖥𝖠{\mathcal{L}_{\sf KRFA}} are incomparable, SAFA recognize many important languages which kk-register automata also recognize such as Ld1L_{d_{1}}: wherein the first data value is repeated, La2L_{a\geq 2}: wherein attribute aa is associated with more than two distinct data values. On the other hand, kk-register automata fail to accept languages where we have to store more than kk data values such as L𝖿𝖽(a)L_{{\sf fd}(a)}, L𝖾𝗏𝖾𝗇(a)L_{{\sf even}(a)}: wherein attribute aa is associated with an even number of distinct data values [20]. SAFA can accept both these data languages. We also show below that there are languages such as LdL_{d}: the language of data words, each of which contains the data value dd associated with some attribute at some position in the data word, that can be accepted by a 22-register automaton but not by SAFA.

Example 5.1.

A 22-register automaton can accept the language LdL_{d}. Consider the 2-register automaton AA with Σ={a}\Sigma=\{a\}, Q={q0,q1}Q=\{q_{0},q_{1}\}, τ0={d,}\tau_{0}=\{d,\bot\}, F={q1}F=\{q_{1}\}, U(q0,a)=2U(q_{0},a)=2, U(q1,a)=2U(q_{1},a)=2. The transition relation δ\delta is defined as:

{(q0,a,1,q1),(q0,a,2,q0),(q1,a,1,q1),(q1,a,2,q1)}\{(q_{0},a,1,q_{1}),(q_{0},a,2,q_{0}),(q_{1},a,1,q_{1}),(q_{1},a,2,q_{1})\}. The automaton AA accepts LdL_{d}. For an input word ww, the automaton checks whether the current data value of ww under the head of AA is equal to the content of register 11, which holds the data value dd from the time of initialization of the registers. If it is equal, the automaton goes to state q1q_{1} and consumes the word. Since q1q_{1} is a final state, the word is accepted. If the data value dd is not present in ww, the automaton remains in state q0q_{0} and rejects the input word. ∎

Theorem 5.2.

𝖲𝖠𝖥𝖠{\mathcal{L}_{\sf SAFA}} and 𝖪𝖱𝖥𝖠{\mathcal{L}_{\sf KRFA}} are incomparable.

Proof 5.3.

We first show by contradiction that no SAFA can accept LdL_{d}. Suppose there exists a SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) which accepts LdL_{d}. Now consider a word wLdw\in L_{d} where the data value dd has occurred at the first position only. Let the sequence of transitions that MM goes through to accept ww be TwT_{w}. The first transition t1t_{1} that TwT_{w} executes cannot be a transition with p(hi)p(h_{i}), hiHh_{i}\in H because t1t_{1} is the first transition of TwT_{w} and there cannot be any insertion to set hih_{i} prior to it. Therefore, t1t_{1} is a transition with !p(hi)!p(h_{i}), hiHh_{i}\in H and so t1t_{1} can consume any other data value dd^{\prime} which is not present in ww. As dd does not occur anywhere else in ww, it is safe to say that all other transitions in TwT_{w} with p(hi)p(h_{i}), hiHh_{i}\in H have data values other than dd present in their respective sets. Thus, if MM accepts w=(a,d)xw=(a,d)x, where x(Σ×D)x\in(\Sigma\times D)^{*} and xx does not have value dd in it, then MM also accepts w=(a,d)xw^{\prime}=(a,d^{\prime})x, and ww^{\prime} does not have data value dd in it, which is a contradiction. From Example 5.1, Example 3.3 and the fact that kk-register automata cannot accept L𝖿𝖽(𝖺)L_{\sf fd(a)} [20] and the above, we conclude 𝖲𝖠𝖥𝖠{\mathcal{L}_{\sf SAFA}} and 𝖪𝖱𝖥𝖠{\mathcal{L}_{\sf KRFA}} are incomparable.

\chadded

[id=KC]If we equip SAFA with initialization, that is the sets of SAFA can be initialized prior to the beginning of computation, then SAFA can accept the language LdL_{d}. Even then, kk-register automata and SAFA with initialization are incomparable as shown below.

Example 5.4.

A 22-register automaton can accept the language L={w(({a}×{d})2)|dD}L=\{w\in((\{a\}\times\{d\})^{2})^{*}|d\in D\}. Consider the 22-register automaton AA with Σ={a}\Sigma=\{a\}, Q={q0,q1,q2}Q=\{q_{0},q_{1},q_{2}\}, τ0={,}\tau_{0}=\{\bot,\bot\}, F={q0}F=\{q_{0}\}, U(q0,a)=1U(q_{0},a)=1, U(q1,a)=2U(q_{1},a)=2,

The transition relation δ\delta is defined as: {(q0,a,1,q1),(q1,a,1,q0),(q1,a,2,q2)}\{(q_{0},a,1,q_{1}),(q_{1},a,1,q_{0}),(q_{1},a,2,q_{2})\}. The automaton AA accepts L={w(({a}×{d})2)|dD}L=\{w\in((\{a\}\times\{d\})^{2})^{*}|d\in D\}. For an input word ww, the automaton in state q0q_{0} reads the first data value of a data pair and inserts it into register 11 and goes to state q2q_{2}. In state q2q_{2} if the second data value of the data pair is same as the first, the automaton goes back to state q1q_{1} to read the next data value pair. Otherwise, the automaton goes to a dead state q3q_{3} and rejects the input word ww. ∎

Let 𝖲𝖠𝖥𝖠𝗂𝗇𝗂𝗍{\mathcal{L}_{\sf SAFA_{init}}} denote the set of all languages accepted by nondeterministic SAFA with initialization. We have the following.

Theorem 5.5.

𝖲𝖠𝖥𝖠𝗂𝗇𝗂𝗍{\mathcal{L}_{\sf SAFA_{init}}} and 𝖪𝖱𝖥𝖠{\mathcal{L}_{\sf KRFA}} are incomparable.

Proof 5.6.

Consider the language L={w(({a}×{d})2)|dD}L=\{w\in((\{a\}\times\{d\})^{2})^{*}|d\in D\} i.e. LL is the set of all data words where every data value appears in pairs. We show that there exists no SAFA with initialization which accepts LL. We prove by contradiction. Assume that there exists a SAFA M=(Q,{a}×D,q0,F,H,δ)M=(Q,\{a\}\times D,q_{0},F,H,\delta) with |H|=k>0|H|=k>0 and whose sets can be initialized such that MM accepts LL. Then MM must accept the following word wLw\in L where w=(a,d1)(a,d1)(a,di)(a,di)(a,dk+1)(a,dk+1)w=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{k+1})(a,d_{k+1}) and d1,,dk+1Dd_{1},...,d_{k+1}\in D are all distinct. In order to accept ww, the SAFA MM must go through a sequence T=t1d1t2d1t1dk+1t2dk+1T=t_{1_{d_{1}}}t_{2_{d_{1}}}...t_{1_{d_{k+1}}}t_{2_{d_{k+1}}} of transitions to completely consume ww and end in an accepting state. Here t1dit_{1_{d_{i}}} consumes the first data item of the ithi^{th} data value pair and t2dit_{2_{d_{i}}} consumes the second data item of the ithi^{th} data value pair and 1ik+11\leq i\leq k+1.

  • The transitions t2dit_{2_{d_{i}}} must be of the form (a,p(h),)(a,p(h_{\ell}),-) or (a,p(h),𝗂𝗇𝗌(hj))(a,p(h_{\ell}),{\sf ins}(h_{j})) where h,hjHh_{\ell},h_{j}\in H. This is essential because if t2dit_{2_{d_{i}}} is of the form (a,!p(h),)(a,!p(h_{\ell}),-) or (a,!p(h),𝗂𝗇𝗌(hj))(a,!p(h_{\ell}),{\sf ins}(h_{j})) then instead of consuming did_{i} it can also consume successfully a new data value dnewDd_{new}\in D which is not present in ww. It is always possible to get such a data value as DD is countably infinite. The SAFA MM will then accept the data word w=(a,d1)(a,d1)(a,di)(a,dnew)(a,dk+1)(a,dk+1)w^{\prime}=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{new})\cdots(a,d_{k+1})(a,d_{k+1}) which is not in LL.

  • The transitions t1dit_{1_{d_{i}}} must either insert the data values it reads in to a set in MM or if the data value is already present in a set during initialization it may or may not insert it. Recall that for the transition t2dit_{2_{d_{i}}} to consume the data value did_{i}, the data value did_{i} must be inserted in a set when it was first encountered or it must already be present during initialization.

  • Since the number of distinct data values in ww are more than the number of sets in MM and all the distinct data values are inserted in the sets in MM, by pigeon hole principle, there are two distinct data values did_{i} and djd_{j}, with i<ji<j which are inserted in the same set hHh_{\ell}\in H. Thus, if MM accepts the data word w=(a,d1)(a,d1)(a,di)(a,di)(a,dj)(a,dj)(a,dk+1)(a,dk+1)w=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{j})(a,d_{j})\cdots(a,d_{k+1})(a,d_{k+1}) then MM will also accept the data word w=(a,d1)(a,d1)(a,di)(a,di)(a,dj)(a,di)(a,dk+1)(a,dk+1)w^{\prime}=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{j})(a,d_{i})\cdots(a,d_{k+1})(a,d_{k+1}) which is not in LL.

From Example 5.4, Example 3.3 and the fact that kk-register automata cannot accept L𝖿𝖽(𝖺)L_{\sf fd(a)} [20] and the above, we conclude 𝖲𝖠𝖥𝖠{\mathcal{L}_{\sf SAFA}} with initialization and 𝖪𝖱𝖥𝖠{\mathcal{L}_{\sf KRFA}} are incomparable.

Also we note that both SAFA and kk-register automata have the same complexity for the nonemptiness and the membership problems. Similar to SAFA, CCA and CMA also accept data languages such as L𝖿𝖽(𝖺)L_{\sf fd(a)}, L𝖾𝗏𝖾𝗇(a)L_{{\sf even}(a)} which kk-register automata cannot, but their decision problems have higher complexity [5, 19]. We can show that the class of languages accepted by SAFA is strictly contained in the class of languages accepted by CCA.

Theorem 5.7.

𝖲𝖠𝖥𝖠𝖢𝖢𝖠{\mathcal{L}_{\sf SAFA}}\subsetneq{\mathcal{L}_{\sf CCA}}

Proof 5.8.

For every SAFA M=(Q,Σ×D,q0,F,H,δ)M=(Q,\Sigma\times D,q_{0},F,H,\delta) with |H|=k|H|=k accepting a language LL, we can construct a kk-bag CCA A=(Q,Σ,Δ,{q0},F)A=(Q,\Sigma,\Delta,\{q_{0}\},F) which accepts the same language LL in the following manner:

  • The set QQ of states, the set FF of final states and the initial state are the same for both the SAFA MM and the kk-bag CCA A.

  • The transitions in the transition relation δ\delta of SAFA MM are mapped to transitions in the transition relation Δ\Delta of the kk-bag CCA AA as follows:

    • For every transition of the form (qi,a,p(hi),,qj)(q_{i},a,p(h_{i}),-,q_{j}) in the transition relation δ\delta of SAFA MM which takes the SAFA MM from state qiq_{i} to state qjq_{j} where aΣa\in\Sigma, qi,qjQq_{i},q_{j}\in Q and hiHh_{i}\in H, we have the transition (qi,a,c1,,ck,instr1,,instrk,qj)(q_{i},a,c_{1},\ldots,c_{k},instr_{1},\ldots,instr_{k},q_{j}) where cic_{i} is (=,1)(=,1) and all other cjc_{j} are (,0)(\geq,0), instri=[0]instr_{i}=[0] for all 1ik1\leq i\leq k in the transition relation Δ\Delta of the kk-bag CCA A. If a data value dd is read on the transition in SAFA MM, then CCA A checks if βi(d)=1\beta_{i}(d)=1 denoting that the kk-bag CCA AA has seen the data value dd before and it has been recorded in the ithi^{th} bag.

    • For every transition of the form (qi,a,!p(hi),,qj)(q_{i},a,!p(h_{i}),-,q_{j}) in the transition relation δ\delta of SAFA MM which takes the SAFA MM from state qiq_{i} to state qjq_{j} where aΣa\in\Sigma, qi,qjQq_{i},q_{j}\in Q and hiHh_{i}\in H, we have the transition (qi,a,c1,,ck,instr1,,instrk)(q_{i},a,c_{1},\ldots,c_{k},instr_{1},\ldots,instr_{k}) where cic_{i} is (=,0)(=,0) and all other cjc_{j} are (,0)(\geq,0), instri=[0]instr_{i}=[0] for all 1ik1\leq i\leq k in the transition relation Δ\Delta of kk-bag CCA. If a data value dd is read on the transition in SAFA MM, then CCA A checks if βi(d)=1\beta_{i}(d)=1 denoting that the kk-bag CCA AA has either not seen the data value dd or it has not been recorded in the ithi^{th} bag.

    • For every transition of the form (qi,a,p(hi)),𝗂𝗇𝗌(hj),qj)(q_{i},a,p(h_{i})),{\sf ins}(h_{j}),q_{j}) in the transition relation δ\delta of the SAFA MM which takes the SAFA MM from state qiq_{i} to state qjq_{j} where aΣa\in\Sigma, qi,qjQq_{i},q_{j}\in Q and hi,hjHh_{i},h_{j}\in H, we introduce the transition (qi,a,c1,,ck,instr1,,instrk)(q_{i},a,c_{1},\ldots,c_{k},instr_{1},\ldots,instr_{k}) where cic_{i} is (=,1)(=,1) and all other cjc_{j} are (,0)(\geq,0), instrj=[,1]instr_{j}=[\downarrow,1] and all other instri=[0]instr_{i}=[0] for all iji\neq j in the transition relation Δ\Delta of the kk-bag CCA. Similar to SAFA MM inserting the data value dd in the set hjh_{j}, the kk-bag CCA records the data value by setting βj(d)=1\beta_{j}(d)=1 while all the values for all other bags remain unchanged.

    • For every transition of the form (qi,a,!p(hi)),𝗂𝗇𝗌(hj),qj)(q_{i},a,!p(h_{i})),{\sf ins}(h_{j}),q_{j}) in the transition relation δ\delta of SAFA MM which takes the SAFA MM from state qiq_{i} to state qjq_{j} where aΣa\in\Sigma, qi,qjQq_{i},q_{j}\in Q and hi,hjHh_{i},h_{j}\in H, we introduce the transition (qi,a,c1,,ck,instr1,,instrk)(q_{i},a,c_{1},\ldots,c_{k},instr_{1},\ldots,instr_{k}) where cic_{i} is (=,0)(=,0) and all other cjc_{j} are (,0)(\geq,0), instrj=[,1]instr_{j}=[\downarrow,1] and all other instri=[0]instr_{i}=[0] for all iji\neq j in transition relation Δ\Delta of the kk-bag CCA. Similar to SAFA MM inserting the data value dd in the set hjh_{j}, the kk-bag CCA records the data value by setting βj(d)=1\beta_{j}(d)=1.

The kk-bag CCA simulates the SAFA MM. Given a data word ww accepted by the SAFA MM, there is a sequence of transitions that takes SAFA MM from an initial state to a final state. The kk-bag CCA simulating the SAFA MM can replicate the corresponding sequence of transitions to accept ww. Given a data word ww not accepted by the SAFA MM, there is no sequence of transitions that takes the SAFA MM from an initial state to a final state. The kk-bag CCA AA simulating the SAFA MM also does not have any sequence of transitions which takes it from an initial state to a final state and thus the kk-bag CCA AA also rejects ww.

We know from [19], that for every kk-bag CCA there exists a one bag CCA which accepts the same language. Therefore for every SAFA MM there exists a CCA AA which accepts the same language. Now, we show that CCA are more expressive than SAFA. Moreover, SAFA MM accepts the languages Lfd(a)L_{fd(a)} and LabL_{a\exists b} but not the language L𝖿𝖽(a)LabL_{{\sf fd}(a)}\cap L_{a\exists b}, and there exists a CCA for every language accepted by a SAFA, the languages Lfd(a),Lab𝖢𝖢𝖠L_{fd(a)},L_{a\exists b}\in{\mathcal{L}_{\sf CCA}}. Since CCA are closed under intersection [19], we have that L𝖿𝖽(a)Lab𝖢𝖢𝖠L_{{\sf fd}(a)}\cap L_{a\exists b}\in{\mathcal{L}_{\sf CCA}}. Therefore, 𝖲𝖠𝖥𝖠𝖢𝖢𝖠{\mathcal{L}_{\sf SAFA}}\subsetneq{\mathcal{L}_{\sf CCA}}.

\chadded

[id=KC]We know from [19] that 𝖢𝖢𝖠𝖢𝖬𝖠{\mathcal{L}_{\sf CCA}}\subsetneq{\mathcal{L}_{\sf CMA}} and from the above result, we get the following theorem. \chdeleted[id=KC]We can show that the class of languages accepted by SAFA is strictly contained in the class of languages accepted by CMA.

Corollary 5.9.

𝖲𝖠𝖥𝖠𝖢𝖬𝖠{\mathcal{L}_{\sf SAFA}}\subsetneq{\mathcal{L}_{\sf CMA}}.

\chdeleted

[id=KC]Proof sketch. A state of a CMA AA simultaneously keeps track of the state of the SAFA MM that it wants to simulate and also to which sets a particular data value has been inserted in MM. Since both the number of states and the number of sets in SAFA are finite, therefore, the number of states in the CMA AA is also finite. The number of states of the CMA simulating a SAFA is exponential in the size of the SAFA since a state of the CMA keeps the information of a subset of HH storing a data value. Further, one can show that a CMA can accept the language L𝖼𝗇𝗍=2L_{\forall{\sf cnt}=2}, and from Lemma 4.23, we know that no SAFA can accept L𝖼𝗇𝗍=2L_{\forall{\sf cnt}=2}. ∎

6 Conclusion

In this paper, we introduce set augmented finite automata which use a finite set of sets of data values to accept data languages. We have shown examples of several data languages that can be accepted by our model. We compare the language acceptance capabilities of SAFA with kk-register automata, CCA, and CMA. The computational power and low complexity of nonemptiness and membership of SAFA makes it an useful tool for modeling and analysis of data languages. We show that similar to register automata, the universality problem for SAFA is undecidable. We also study the deterministic variant of SAFA which are closed under complementation, and hence have decidable universality. We believe our model is robust enough and can also be extended to infinite words. This model opens up some interesting avenues for future research. We would like to explore augmentations of SAFA with Boolean combinations of tests and the feature of updating multiple sets simultaneously. This may lead to having a more well-behaved model with respect to closure properties.

References

  • [1] Jean-Michel Autebert, Joffroy Beauquier, and Luc Boasson. Langages sur des alphabets infinis. Discrete Applied Mathematics, 2(1):1–20, 1980.
  • [2] Christel Baier and Joost-Pieter Katoen. Principles of model checking. MIT press, 2008.
  • [3] Ansuman Banerjee, Kingshuk Chatterjee, and Shibashis Guha. Set augmented finite automata over infinite alphabets. In Frank Drewes and Mikhail Volkov, editors, Developments in Language Theory - 27th International Conference, DLT 2023, Umeå, Sweden, June 12-16, 2023, Proceedings, volume 13911 of Lecture Notes in Computer Science, pages 36–50. Springer, 2023.
  • [4] Alexis Bès. An application of the feferman-vaught theorem to automata and logics for words over an infinite alphabet. Logical Methods in Computer Science, 4, 2008.
  • [5] Henrik Björklund and Thomas Schwentick. On notions of regularity for data languages. Theoretical Computer Science, 411(4-5):702–715, 2010.
  • [6] Mikolaj Bojanczyk, Anca Muscholl, Thomas Schwentick, Luc Segoufin, and Claire David. Two-variable logic on words with data. In 21st Annual IEEE Symposium on Logic in Computer Science (LICS’06), pages 7–16. IEEE, 2006.
  • [7] Benedikt Bollig. An automaton over data words that captures emso logic. In International Conference on Concurrency Theory, pages 171–186. Springer, 2011.
  • [8] Edward YC Cheng and Michael Kaminski. Context-free languages over infinite alphabets. Acta Informatica, 35(3):245–267, 1998.
  • [9] W. Czerwiński and L. Orlikowski. Reachability in vector addition systems is ackermann-complete. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 1229–1240, 2022. doi:10.1109/FOCS52979.2021.00120.
  • [10] Jürgen Dassow and György Vaszil. P finite automata and regular languages over countably infinite alphabets. In International Workshop on Membrane Computing, pages 367–381. Springer, 2006.
  • [11] Stéphane Demri and Ranko Lazić. LTL with the freeze quantifier and register automata. ACM Transactions on Computational Logic (TOCL), 10(3):1–30, 2009.
  • [12] Diego Figueira. Alternating register automata on finite words and trees. Logical Methods in Computer Science, 8, 2012.
  • [13] Orna Grumberg, Orna Kupferman, and Sarai Sheinvald. Variable automata over infinite alphabets. In International Conference on Language and Automata Theory and Applications, pages 561–572. Springer, 2010.
  • [14] Radu Iosif and Xiao Xu. Abstraction refinement for emptiness checking of alternating data automata. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 93–111. Springer, 2018.
  • [15] Neil D Jones. Space-bounded reducibility among combinatorial problems. Journal of Computer and System Sciences, 11(1):68–85, 1975.
  • [16] Michael Kaminski and Nissim Francez. Finite-memory automata. Theoretical Computer Science, 134(2):329–363, 1994.
  • [17] Michael Kaminski and Tony Tan. Regular expressions for languages over infinite alphabets. Fundamenta Informaticae, 69(3):301–318, 2006.
  • [18] S. R. Kosaraju. Decidability of reachability in vector addition systems (preliminary version). In Harry R. Lewis, Barbara B. Simons, Walter A. Burkhard, and Lawrence H. Landweber, editors, Proceedings of the 14th Annual ACM Symposium on Theory of Computing, May 5-7, 1982, San Francisco, California, USA, pages 267–281. ACM, 1982.
  • [19] Amaldev Manuel and Ramaswamy Ramanujam. Class counting automata on datawords. International Journal of Foundations of Computer Science, 22(04):863–882, 2011.
  • [20] Amaldev Manuel and Ramaswamy Ramanujam. Automata over infinite alphabets. In Modern applications of automata theory, pages 529–553. World Scientific, 2012.
  • [21] E. W. Mayr. An algorithm for the general petri net reachability problem. SIAM J. Comput., 13(3):441–460, 1984.
  • [22] Frank Neven. Automata, logic, and xml. In International Workshop on Computer Science Logic, pages 2–26. Springer, 2002.
  • [23] Frank Neven, Thomas Schwentick, and Victor Vianu. Finite state machines for strings over infinite alphabets. ACM Transactions on Computational Logic (TOCL), 5(3):403–435, 2004.
  • [24] Hiroshi Sakamoto and Daisuke Ikeda. Intractability of decision problems for finite-memory automata. Theoretical Computer Science, 231(2):297–308, 2000.
  • [25] Tony Tan. On pebble automata for data languages with decidable emptiness problem. Journal of Computer and System Sciences, 76(8):778–791, 2010.