\hideLIPIcs

Indian Statistical Institute, Kolkata, India and https://www.isical.ac.in/~ansuman [email protected]://orcid.org/0000-0003-0220-646X Government College of Engineering and Ceramic Technology, Kolkata, India and https://sites.google.com/view/kingshukchatterjee/home [email protected]://orcid.org/0000-0002-2617-6309 Tata Institute of Fundamental Research, Mumbai, India and https://www.tifr.res.in/~shibashis.guha/[email protected]://orcid.org/0000-0002-9814-6651 \ccsdesc[500]Theory of computation Formal languages and automata theory \relatedversion

Acknowledgements.

We thank Amaldev Manuel for providing useful comments on a preliminary version of this paper.\EventEditorsJohn Q. Open and Joan R. Access \EventNoEds2 \EventLongTitle42nd Conference on Very Important Topics (CVIT 2016) \EventShortTitleCVIT 2016 \EventAcronymCVIT \EventYear2016 \EventDateDecember 24–27, 2016 \EventLocationLittle Whinging, United Kingdom \EventLogo \SeriesVolume42 \ArticleNo23

Set Augmented Finite Automata over Infinite Alphabets

Ansuman Banerjee Kingshuk Chatterjee Shibashis Guha

Abstract

A data language is a set of finite words defined on an infinite alphabet. Data languages are used to express properties associated with data values (domain defined over a countably infinite set). In this paper, we introduce set augmented finite automata (SAFA), a new class of automata for expressing data languages. We investigate the decision problems, closure properties, and expressiveness of SAFA. We also study the deterministic variant of these automata.

keywords:

automata on infinite alphabet, data languages, register automata, expressiveness, closure properties

1 Introduction

A data language is a set of data words that are concatenations of attribute, data-value pairs. While the set of attributes is finite, the values that these attributes hold often come from a countably infinite set (e.g. natural numbers). With large scale availability of data in recent times, there is a need for methods for modeling and analysis of data languages. Thus, there is a demand for automated methods for recognizing attribute data relationships and languages defined on infinite alphabets. A data word is a concatenation of a finite number of attribute, data-value pairs, i.e. a data word $w=(a_{1},d_{1})(a_{2},d_{2})...(a_{|w|},d_{|w|})$ , where each $a_{i}$ belongs to a finite set and each $d_{i}$ belongs to a countably infinite set. We denote by $|w|$ the length of $w$ . This work introduces a new model for data languages on infinite alphabets.

$k$ -register automata (finite automata with $k$ registers, each capable of storing one data value) [16] are finite automata with the ability to handle infinite alphabets. The nonemptiness and membership problems for register automata are both NP-complete [24]. However, the language recognizability is somewhat restricted, since it uses only a finite number of registers to store the data values. Thus, register automata cannot accept many data languages, one such being $L_{{\sf fd}(a)}$ which is a collection of data words where all data values associated with the attribute $a$ have to be distinct. Pushdown versions of automata on infinite alphabets using stacks have also been introduced in [8]. However, even with the introduction of a stack, these models are unable to accept $L_{{\sf fd}(a)}$ . Data automata are introduced in [6] and the emptiness problem is shown to be nonelementary. Further, class memory automata (CMA) and class counter automata (CCA) are introduced in [5, 19] respectively. While CMA and data automata are shown to be equivalent [5], the set of languages accepted by CCA is a subset of the set of languages accepted by CMA. The nonemptiness problem for CCA is EXPSPACE-complete [19], and the nonemptiness problem for CMA is interreducible to the reachability problem in Petri nets [20, 6, 18, 21], and hence Ackermann-complete [9].

Our Contribution:

This work introduces set augmented finite automata (SAFA) that are finite automaton models equipped with a finite number of finite sets for storing data values. Using these sets as auxiliary storage, SAFA is able to recognize many important data languages including $L_{\sf fd(a)}$ . This paper has the following contributions.

•

We present the formal definition of the SAFA model on infinite alphabets (Definition 3.1).
•

We show that nonemptiness and membership are NP-complete for SAFA. Further, we show that universality for SAFA is undecidable (Theorem 4.7, 4.13, 4.15).
•

We study the closure properties on the SAFA model (See Section 4.2). In order to show non-closure under complementation we introduce a pumping lemma for SAFA (Lemma 4.21).
•

We also study the deterministic variant of SAFA and show that there are languages that necessarily need nondeterminism to be accepted (Theorem 4.38).
•

We present a strict hierarchy of languages with respect to the number of sets associated with SAFA models (Theorem 3.5).
•

Finally, we study the expressiveness of SAFA models (See Section 5). While we show that the class of languages recognized by SAFA and register automata are incomparable, the set of languages accepted by SAFA is a strict subset of the set of languages accepted by CCA, and hence by CMA.

Related work:

Register automata introduced by Kaminski et.al.[16] use a finite number of registers to store data values; hence they can only accept those data languages in which membership depends on remembering properties of a finite number $k$ of data values where $k$ is bounded above by the number of registers in the register automata. An extension to finite register models are pushdown automata models for infinite alphabets [8]. Cheng et.al. [8] and Autebut et.al. [1] both describe context free languages for infinite alphabets. The membership problem for the grammar proposed by Cheng et. al. is decidable unlike Autebut’s. Sakamoto et.al.[24] show that the membership problem is PTIME-complete and nonemptiness is NP-complete for deterministic register finite automata model on infinite alphabets. Neven et.al.[22, 23] discuss the properties of register and pebble automata on infinite alphabets, in terms of their language acceptance capabilities and also establish their relationship with logic. Tan et.al. [25] introduces a weak 2-pebble automata model whose emptiness is decidable but with significant reduction in acceptance capabilities with respect to pebble automata.Kaminski et.al.[17] also develop a regular expression over infinite alphabets. Choffrut et.al.[4] define finite automata on infinite alphabets which uses first order logic on transitions. This is later extended by Iosif et.al. to an alternating automata model [14] with emptiness problem being undecidable but they give two efficient semi-algorithms for emptiness checking. Demri et.al. [11] explore the relationship between linear temporal logic (LTL) and register automata. Grumberg et.al. [13] introduce variable automata over infinite alphabets where transitions are defined over alphabets as well as over variables. Data automata are introduced in [6]. Class memory automata (CMA) is introduced in [5] and is shown to be equivalent to data automata. The set of languages accepted by CMA is a superset of Class counter automata (CCA) [19], another infinite alphabet automata introduced by Manuel et.al. Bollig [7] combines CMA and register automata and shows that local existential monadic second order (MSO) logic can be converted to class register automata in time polynomial in the size of the input automata. Figuera [12] discusses the properties of alternating register automata and also discusses restricted variants of this model where decidability is tractable. Dassow et.al. [10] introduces the P-automata model and establishes that these are equivalent to a restricted version of register automata. A detailed survey of existing finite automata models on infinite alphabets can be found in [20].

The elegance of SAFA is its simple structure that is easy to implement. Moreover, the membership and nonemptiness problems are NP-complete for our model. On one hand, this gives us an advantage over the hash-based family of models on infinite alphabets, with respect to the associated problem complexities. On the other hand, this puts our model in the same complexity class as the $k$ -register automata, while having the ability to accept many important data languages.

A preliminary version of this paper appeared in [3].

2 Preliminaries

Let $\mathbb{N}$ denote the set of natural numbers, and $[k]$ the set $\{1,\dots,k\}$ where $k>0$ . Let $\Sigma$ be a finite alphabet which comprises a finite set of attributes, and $D$ be a countably infinite set of attribute values. A data word $w\in(\Sigma\times D)^{*}$ is a concatenation of attribute, data value pairs, where $*$ denotes zero or more repetitions. A data value is also known as an attribute value, that is the value associated with an attribute. An example data word is of the form $w=(a_{1},d_{1})\cdots(a_{|w|},d_{|w|})$ where $a_{1},\dots,a_{|w|}\in\Sigma$ , $d_{1},...,d_{|w|}\in D$ , and $|w|$ denotes the length of $w$ . An example data word $w$ on $\Sigma=\{a,b\}$ and $D=\mathbb{N}$ is: $(a,1)(a,2)(b,1)(b,5),(a,2)(a,5)(a,7)(a,100)$ with $|w|$ = 8. A data language $L\ \subseteq(\Sigma\times D)^{*}$ is a set of data words. Some example data languages with $\Sigma=\{a,b\}$ , $D=\mathbb{N}$ are mentioned below.

•

$L_{{\sf fd}(a)}$ : language of data words, wherein the data values associated with attribute $a$ are all distinct.
•

$L_{\forall{\sf cnt}=2}$ : language of data words wherein all data values appear exactly twice.
•

$L_{\exists{\sf cnt}\neq 2}$ : the language of data words $w$ where there exists a data value $d$ which does not appear twice. $L_{\exists{\sf cnt}\neq 2}$ is the complement of $L_{\forall{\sf cnt}=2}$ .
•

$L_{a\exists b}$ : the language of data words wherein the data values associated with attribute $a$ are those which have already appeared with the attribute $b$ .

For a word $w\in(\Sigma\times D)^{*}$ , we denote by ${{\sf proj}_{\Sigma}}(w)$ and ${{\sf proj}_{D}}(w)$ the projection of $w$ on $\Sigma$ and $D$ respectively. Let $L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)}$ be the set of all data words $w$ such that ${{\sf proj}_{\Sigma}}(w)\in L_{regexp(r)}$ where $L_{regexp(r)}\subseteq\Sigma^{*}$ is the set of all words over $\Sigma$ generated by the regular expression $r$ .

Post Correspondence Problem (PCP)

: \chadded[id=KC]The PCP problem consists of two lists of equal length, say $n$ . The items of the lists are finite strings defined on an alphabet $\Sigma^{\prime}$ where $|\Sigma^{\prime}|\geq 2$ . Without loss of generality, we can assume $\Sigma=\{a,b\}$ . List 1 consists of the strings $x_{1},\dots,x_{n}$ , and list 2 consists of the strings $y_{1},\dots,y_{n}$ where $x_{1},\dots,x_{n},y_{1},\dots,y_{n}\in{\Sigma}^{*}$ . The PCP problem is true, if there exists a sequence $\alpha_{1},\dots,\alpha_{m}$ where $\alpha_{1},\dots,\alpha_{m}\in[n]$ such that $x_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\alpha_{1}}\cdots y_{\alpha_{m}}$ and false otherwise (see Example 2.1). Each $x_{i}$ $y_{i}$ pair is considered as a domino and the PCP solution is an arrangement of these dominoes such that the strings in the upper part and lower part of the arranged dominoes become same.

Example 2.1.

An instance of the PCP problem on $\Sigma=\{a,b\}$ is as follows: List 1= $a,ba$ and List 2= $ab,a$ , Then one possible solution for the above PCP problem is the sequence $1,2$ , i.e. $\binom{a}{ab}\binom{ba}{a}$ . ∎

$k$ -register automata [16]

: A $k$ -register automaton is a tuple $(Q,\Sigma,\delta,\tau_{0},U,q_{0},F)$ , where $Q$ is a finite set of states, $q_{0}\in Q$ is an initial state and $F\subseteq Q$ is a set of final states, $\tau_{0}$ is an initial register configuration given by $\tau_{0}:[k]\rightarrow D\cup\{\bot\}$ , where $D$ is a countably infinite set, and $\bot$ denotes an uninitialized register, and $U$ is a partial update function: $(Q\times\Sigma)\rightarrow[k]$ . The transition relation is $\delta\subseteq(Q\times\Sigma\times[k]\times Q)$ . The registers initially contain distinct data values other than $\bot$ which can be present in more than one register. The automaton works as follows. Consider a register automaton $M$ in state $q\in Q$ . Each of its registers $r_{i}$ holds datum $d_{i}$ where $0\leq i\leq k$ , $d_{i}\in D\cup\{\bot\}$ . Let $M$ at some instance reads the $j^{th}$ data element $(a_{j},d_{j})$ of the input word $w$ where $a_{j}\in\Sigma$ , $d_{j}\in D$ . Two cases may arise.

•

Case 1: There exists an $i$ such that $d_{j}=d_{i}$ : In this case two situations may arise (i) $(q,a,i,q^{\prime})\in\delta$ and (ii) $(q,a,i,q^{\prime})\notin\delta$ . In situation (i) the corresponding transition is executed, and in situation (ii) the automaton stops without consuming the data element.
•

Case 2: There exists no register $i$ such that $d_{j}=d_{i}$ : In this case, for all $i$ , we have $d_{j}\neq d_{i}$ . We look at the partial update function $U$ . If $U(q,a)$ is not defined, the automaton stops without consuming the data element. If $U(q,a)$ is defined, then $d_{j}$ is inserted in the register $U(q,a)$ and the automaton executes the transition $(q,a,U(q,a),q^{\prime})$ if $(q,a,U(q,a),q^{\prime})\in\delta$ , otherwise it halts if $(q,a,U(q,a),q^{\prime})\notin\delta$ .

The automaton $M$ accepts an input data word $w$ if it consumes the whole word and ends in a final state.

Class counting automata [19]

: A class counting automaton (a.k.a 1-bag CCA) is defined as a $5$ -tuple $M=(Q,\Sigma,\delta,q_{0},F)$ where $Q$ is a finite set of states, $q_{0}\in Q$ is an initial state, and $F\subseteq Q$ is the set of accepting states. A constraint $c$ is a pair $({\sf op},e)$ , where ${\sf op}=\{<,>,=,\leq,\geq,\neq\}$ , $e\in\mathbb{N}$ . Let $C$ denote a collection of constraints. The transition relation is $\delta\subseteq(Q\times\Sigma\times C\times Inst\times\mathbb{N}\times Q)$ . A bag is a finite set $\beta\subseteq(D\times\mathbb{N})$ . Initially, $\beta(d)$ is set to $0$ for all data values $d\in D$ . The set ${\sf Inst}=\{\uparrow^{+},\downarrow\}$ . An element of ${\sf Inst}\times\mathbb{N}$ is called an operation. When making a transition, a CCA reads an attribute, data-value pair $(a,d)$ , and checks if $\beta(d)\;{\sf op}\;e$ holds. If it holds, then (i) either $\beta(d)$ is incremented by $m$ if the operation is $(\uparrow^{+},m)$ , or (ii) $\beta(d)$ is reset to $m$ if the operation is $(\downarrow,m)$ , and we go to the next state. A CCA accepts a data word $w$ if it is in a final state after consuming $w$ .

A $k$ -bag CCA has $k$ bags. For a data value, constraint checking can be done on a subset of the bags. The bags can also be updated or reset independently. The set of transitions of a $k$ -bag CCA is a subset of $(Q\times\Sigma\times C^{k}\times(Inst\times\mathbb{N})^{k}\times Q)$ . We denote by $\beta_{i}$ , the $i^{\text{th}}$ bag. It is shown in [19] that for every $k$ -bag CCA that accepts a language $L$ , there exists a $1$ -bag CCA which accepts the same language $L$ .

Class memory automata [5]

: A class memory automaton is a $6$ -tuple $M=(Q,\Sigma,\delta,q_{0},F_{\ell},F_{g})$ where $Q$ is a finite set of states, $q_{0}\in Q$ is an initial state and $F_{g}\subseteq F_{\ell}\subseteq Q$ are a set of global and local accepting states respectively. The transition relation is $\delta\subseteq(Q\times\Sigma\times(Q\cup\{\bot\})\times Q)$ . The automaton keeps track of the last state where a data value $d$ is encountered. If a data value $d$ is not yet encountered, then it is associated with $\bot$ . Each transition of a CMA is dependent on the current state of the automaton and the state the automaton was in when the data value being read currently was last encountered. A data word $w$ is accepted if the automaton reaches a state $q\in F_{g}$ and the last state of all the data values encountered in $w$ are in $F_{\ell}$ .

3 Set augmented finite automata

Definition 3.1.

A set augmented finite automaton (SAFA) is defined as a $6$ -tuple $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ where $Q$ is a finite set of states, $\Sigma$ is a finite alphabet, $D$ is a countably infinite set, $q_{0}\in Q$ is the initial state, $F\subseteq Q$ is a set of final states, $H$ is a finite set of finite sets of data values. The transition relation is defined as: $\delta$ $\subseteq Q\times\Sigma\times C\times OP\times Q$ where $C=\{p(h_{i}),!p(h_{i})\ |\ h_{i}\in H\}$ , $h_{i}$ denotes the $i^{th}$ set in $H$ , and $OP=\{-,\ {\sf ins}(h_{i})\ |\ h_{i}\in H\}$ . ∎

We call a SAFA a singleton if $|H|=1$ . The unary Boolean predicate $p(h_{i})$ evaluates to true if the data value currently being read by the automaton is present in the $i^{th}$ set $h_{i}$ . The predicate $!p(h_{i})$ is true if the data value currently being read is not in $h_{i}$ . Further, $OP$ denotes a set of operations that a SAFA can execute on reading a symbol; the operation ${\sf ins}(h_{i})$ inserts the data value currently being read by the automaton into the set $h_{i}$ , while $-$ denotes no such insertion is done. For any combination not in $\delta$ , we assume the transition is absent.

For a SAFA $M$ , we define a configuration $(q,h)\in Q\times 2^{D^{H}}$ as follows: $q\in Q$ is a state of the automaton, $h=\langle h_{1},...h_{|H|}\rangle$ where each $h_{i}$ for $i\in[|H|]$ is a finite subset of $D$ , and $h$ denotes the content of the sets in $H$ . A run $\rho$ of $M$ on an input $w=(a_{1},d_{1})\cdots(a_{|w|},d_{|w|})$ is a sequence $(q_{0},h^{0}),\dots,(q_{|w|},h^{|w|})$ , where $h^{j}=\langle h^{j}_{1},\dots,h^{j}_{|H|}\rangle$ , and $h^{j}_{i}$ for $1\leq i\leq|H|$ is the content of the set $h_{i}$ after reading the prefix $(a_{1},d_{1})\cdots(a_{j},d_{j})$ for $1\leq j\leq|w|$ . A configuration $(q_{j+1},h^{j+1})$ succeeds a configuration $(q_{j},h^{j})$ if there is a transition $(q_{j},a_{j+1},\alpha,{\sf op},q_{j+1})$ where

(i)

for $\alpha=p(h_{i})$ , we have that the data value $d_{j+1}\in{h_{i}}^{j}$ .
(ii)

for $\alpha\ =\ !p(h_{i})$ , we have that the data value $d_{j+1}\notin{h_{i}}^{j}$ .

The execution of the operation ${\sf op}\in OP$ takes the content of the sets of data values from $h^{j}$ to $h^{j+1}$ . If ${\sf op}$ is $-$ , then $h^{j+1}=h^{j}$ . If ${\sf op}$ is ins $(h_{i})$ , then ${h_{l}}^{j+1}={h_{l}}^{j}$ for all $h_{l}\in H\setminus\{h_{i}\}$ , and ${h_{i}}^{j+1}={h_{i}}^{j}\cup\{d_{j+1}\}$ . If the $run$ consumes the whole word $w$ , and $q_{|w|}\in F$ , then the run is accepting, otherwise, it is rejecting. A word $w$ is accepted by $M$ if it has an accepting run. The language $L(M)$ accepted by $M$ consists of all words accepted by $M$ . We denote by $|\rho|$ the length of the run which equals the number of transitions taken. Note that for the run $\rho$ on an input word $w$ , we have that $|w|$ = $|\rho|$ .

Definition 3.2.

A SAFA $M=(Q,\Sigma,q_{0},F,H,\delta)$ is deterministic (DSAFA) if for every $q\in Q$ and $a\in\Sigma$ , if there is a transition $(q,a,\alpha,{\sf op},q^{\prime})$ , where $q^{\prime}\in Q$ , ${\sf op}\in OP$ , $\alpha\in\{p(h_{i}),!p(h_{i})\}$ , $h_{i}\in H$ , then there cannot be any transition of the form $(q,a,p(h_{l}),{\sf op}^{\prime},q^{\prime\prime})$ , $(q,a,!p(h_{l}),{\sf op}^{\prime},q^{\prime\prime})$ , where $q^{\prime\prime}\in Q$ , $h_{l}\neq h_{i}$ , $h_{l}\in H$ , ${\sf op}^{\prime}\in OP$ . The only other allowed transition can be $(q,a,\alpha^{\prime},{\sf op}^{\prime},q^{\prime\prime})$ for $\alpha^{\prime}\in\{p(h_{i}),!p(h_{i})\}$ , $\alpha^{\prime}\neq\alpha$ , and ${\sf op}^{\prime}\in OP$ . ∎

Let ${\mathcal{L}_{\sf SAFA}}$ and ${\mathcal{L}_{\sf DSAFA}}$ denote the set of all languages accepted by nondeterministic SAFA and deterministic SAFA respectively. We illustrate the SAFA model with some instances of data languages.

Figure 1: SAFA for

L_{\sf fd(a)}

Example 3.3.

The language $L_{{\sf fd}(a)}$ can be accepted by the DSAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ in Figure 1. Here, $Q=\{q_{0},q_{1}\}$ , $\Sigma=\{a,b\}$ , $D$ is any countably infinite set, $F=\{q_{0}\}$ , $H=\{h_{1}\}$ , the transition relation $\delta$ consists of the transitions shown in Figure 1. The automaton $M$ works as follows. The set $h_{1}$ is used to store the data values encountered in the input word associated with $a$ . At $q_{0}$ , if $M$ reads $b$ , it remains in $q_{0}$ without modifying $H$ . At $q_{0}$ when the automaton reads $a$ , it checks whether the corresponding data value is present in $h_{1}$ . If present, it indicates it has already encountered this data value with attribute $a$ before; therefore the automaton goes to $q_{1}$ which is a dead state and the input word is rejected. If the data value is not present in $h_{1}$ , it implies that it has not encountered this value with $a$ , thus it remains in $q_{0}$ and inserts the data value into $h_{1}$ . Only if the automaton encounters a duplicate data value, it goes to $q_{1}$ . If it does not encounter duplicate data values with respect to $a$ in the input, the automaton remains in $q_{0}$ after consuming the entire word and it is accepted. ∎

Example 3.4.

Consider the data language $L_{\exists{\sf cnt}\neq 2:}$ over the alphabet $\Sigma=\{a\}$ . A nonempty word $w$ is in the language if there exists a data value that appears $n$ times in $w$ with $n$ $\neq$ 2. This can be accepted by the nondeterministic SAFA in Figure 2. At state $q_{0}$ , the automaton nondeterministically guesses the data value that does not appear exactly twice \chadded[id=KC]and stores it in set $h_{2}$ and goes to state $q_{1}$ . The automaton remains in state $q_{1}$ if the count of the data value is $1$ or moves to state $q_{3}$ (via $q_{2}$ ) and remains there if the count of the data value is greater than $2$ . In both the cases, it accepts the input word if it can be consumed entirely. If the guess is incorrect, the data value appears twice, and it is in the nonaccepting state $q_{2}$ after consuming the input word. Thus, if a data word has all its data values that appear exactly twice, then all the runs end in $q_{2}$ and the input is rejected. ∎

Figure 2: SAFA for

L_{\exists{\sf cnt}\neq 2}

\chadded

[id=KC]The number of sets in the SAFA model impacts the language accepting capacity. The following theorem establishes a hierarchy of accepted languages by SAFA based on the size of $H$ .

Theorem 3.5.

No SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ with $\Sigma=\{a_{1},\dots,a_{k+1}\}$ , $|H|=k$ can accept the language $L=L_{\sf fd(a_{1})}\cap\dots L_{\sf fd(a_{k+1})}\cap L_{{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}\cdots a_{k+1}^{*}}$ . ¹¹1The language $L^{\prime}=L_{\sf fd(a_{1})}\cap\dots L_{\sf fd(a_{k+1})}$ could have also been considered but the proof is relatively simpler if we instead consider $L$ .

Proof 3.6.

Assume that there exists a SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ with $|H|=k$ , $\Sigma=\{a_{1},a_{2},...a_{k+1}\}$ which accepts $L=L_{{\sf fd}(a_{1})\&\dots\sf fd(a_{k+1})\&{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}\cdots a_{k+1}^{*}}$ and $M$ has $|Q|=n$ states. Since the automaton $M$ accepts $L$ and $L$ contains words which are longer than $n$ , the automaton $M$ must have at least one cycle of sequence of transitions in its structure. Thus, the automaton accepts $w=xyz\in L$ where $x$ is the prefix of $w$ before entering the cycle, $y$ is the infix of the word $w$ that is consumed in the cycle and $|y|>1$ and $z$ is suffix of the word $w$ that is consumed after exiting the cycle. The sequence of transitions that consume $x$ and $z$ themselves may or may not contain cycles of transitions. Let us focus on the cycle of sequence $T_{c}$ of transitions that consumes $y$ .

•

CASE 1: $T_{c}$ contains a transition $t$ with $p(h_{i})$ . The (attribute, data value) pair $t$ consumed in $y$ , can be consumed again in the next execution of $T_{c}$ . The new word accepted by $M$ will then not be $L$ . Therefore $T_{c}$ cannot have any transition with $p(h_{i})$
•

CASE 2: Now suppose $T_{c}$ has a transition $t$ of the form $(q_{i},a,!p(h_{i}),-,q_{j})$ or $(q_{i},a,!p(h_{i}),{\sf ins}(h_{j}),q_{j})$ , $q_{i},q_{j}\in Q$ , $a\in\Sigma,h_{i},h_{j}\in H,i\neq j$ . When $T_{c}$ is executed for the first time, let the transition $t$ consume a data value which $M$ has not consumed before and $M$ will not consume in the next execution of $T_{c}$ except when it is executing $t$ . Since the number of transitions in $T_{c}$ is finite and $D$ is countably infinite, it is always possible to find such a value. When $T_{c}$ is executed again and $t$ is being executed, the same (attribute, data-value) pair can be consumed by it as the data value has not been inserted in $h_{i}$ , when $t$ executed before. Therefore, the new word which $M$ accepts cannot be in $L$ .
•

CASE 3: Since ${{\sf proj}_{\Sigma}}(L)=a_{1}^{*}\cdots a_{k+1}^{*}$ , the cycle of transitions cannot be defined on two different $a_{i},a_{j}\in\Sigma$ . This is because if $T_{c}$ is executed again, then the projection of the new word on $\Sigma$ will no longer be in ${{\sf proj}_{\Sigma}}(L)$ .

From Cases 1 to 3, we conclude that for $M$ to accept $L$ , $M$ must have at least one cycle for each $a_{i}\in\Sigma$ and the sequence of transitions in the cycle for $a_{i}$ must be of the form $(q_{i},a_{i},!p(h_{j}),{\sf ins}(h_{j}),q_{j})$ , $q_{i},q_{j}\in Q$ , $h_{j}\in H$ , $a_{i}\in\Sigma$ .

As $|\Sigma|>|H|$ , by pigeon hole principle there will be two $a_{i},a_{j}$ such that they insert the value in the same set $h_{k}$ . Now, suppose $a_{i}$ inserts a data value $d_{i}$ in set $h_{k}$ . Consider the data word $w=(b_{1},d_{1})\cdots(b_{i},d_{i})\cdots(b_{j},d_{j}=d_{i})\cdots(b_{|w|},d_{|w|})$ where all the positions have unique data value except at $d_{i}$ and $d_{j}$ , $b_{1},\dots,b_{|w|}\in\Sigma$ and ${{\sf proj}_{\Sigma}}(w)$ is of the form $a_{1}^{*}a_{2}^{*}\cdots a_{k+1}^{*}$ . The string $w$ is a valid data word in $L$ . But the automaton $M$ will reject such a data word $w$ , because when implementing the transition $(a_{j},!p(h_{k}),{\sf ins}(h_{k}))$ for a cycle of $a_{j}$ , when it reads $(a_{j},d_{i})$ it will fail as $d_{i}$ is already stored in $h_{k}$ when the data element $(a_{i},d_{i})$ was being consumed. It can be argued that the data word $w$ can be accepted by some other cycle involving $a_{j}$ but increasing the number of cycles will further result in $a_{j}$ storing its data in the same set $h_{\ell}$ , $h_{\ell}\in H$ with some other $a_{k}$ , $a_{k}\in\Sigma$ . Thus, the above mentioned problem will persist. Therefore, it is not possible to construct a SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ with $|H|=k$ , $\Sigma=\{a_{1},\dots,a_{k+1}\}$ which accepts the language $L=L_{{\sf fd}(a_{1})\&\dots\sf fd(a_{k+1})\&{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}\cdots a_{k+1}^{*}}.$

Let ${\mathcal{L}_{\sf SAFA}}_{(|H|=k)}$ be the set of all languages accepted by SAFA with $|H|=k$ . Since every SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ with $|H|=k$ can be simulated by a SAFA $M^{\prime}=(Q^{\prime},\Sigma\times D,q_{0},F^{\prime},H^{\prime},\delta^{\prime})$ with $|H^{\prime}|=\ell$ and $\ell>k$ by using $\ell-k$ dummy sets that are never used in an execution of $M^{\prime}$ , we have the following.

Corollary 3.7.

${\mathcal{L}_{\sf SAFA}}_{(|H|=k)}\subsetneq{\mathcal{L}_{\sf SAFA}}_{(|H|=k+1)}$ .

Corollary 3.7 shows that there is a strict hierarchy in terms of accepting capabilities of SAFA with respect to $|H|$ .

4 Decision problems and closure properties

We study the nonemptiness, membership problems and closure properties of SAFA.

4.1 Nonemptiness and membership

We study the nonemptiness and the membership problems of SAFA and show that both are ${\sf NP}$ -complete. Given a SAFA $M$ and an input word $w$ , the membership problem is to check if $w\in L(M)$ . Given a SAFA $M$ , the nonemptiness problem is to check if $L(M)\neq\emptyset$ . We start with the nonemptiness problem. To show the NP-membership, we first show that there exists a small run if the language accepted by a given SAFA is nonempty.

Lemma 4.1.

Every SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ with $L(M)\neq\emptyset$ has a data word in $L(M)$ with an accepting run $\rho$ such that $|\rho|$ $\leq$ $|Q|\cdot(|H|+2)-1$ .

Proof 4.2.

We prove by contradiction. Assume that $L(M)\neq\emptyset$ and that for every $w\in L(M)$ , for all accepting runs $\rho$ of $w$ , we have that $|\rho|>|Q|\cdot(|H|+2)-1=|Q|\cdot(|H|+1)+|Q|-1$ . We define an indicator function $I_{H}$ which maps $H$ to $\{0,1\}^{H}$ , where $0$ corresponding to a set $h\in H$ denotes that $h$ is empty while $1$ denotes that $h$ is nonempty. Since $|\rho|\geq|Q|\cdot(|H|+2)$ , the run $\rho$ can be divided into $|H|+2$ segments, each of length $|Q|$ , that is, each segment is an infix over $|Q|$ transitions. By pigeon hole principle, in each segment, there exists a state $q\in Q$ that is visited more than once. Further, since there are $|H|+2$ such segments, again by pigeon hole principle, there exists a segment and a state $q^{\prime}\in Q$ such that $q^{\prime}$ is visited more than once in this segment and $I_{H}$ does not change over the infix of the run between the two successive visits of $q^{\prime}$ . We now note that the sequence of transitions reading this infix $y$ makes a loop over $q^{\prime}$ , and thus $y$ can be removed from $w$ resulting into a word $w^{\prime}$ and the corresponding run is $\rho^{\prime}$ such that $|\rho^{\prime}|<|\rho|$ and $\rho^{\prime}$ is accepting. Now there can be two cases if the suffix of $\rho$ following reading $y$ in $w$ has a transition $t$ with $p(h_{i})$ and it reads a data value $d$ .

•

It may happen that the data value $d$ was inserted along the infix $y$ . Since $I_{H}$ does not change while reading the infix $y$ , it implies that the set $h_{i}$ was nonempty even before the infix $y$ was read. Let a data value $d^{\prime}$ was inserted into $h_{i}$ while reading the prefix before $y$ . Then $w^{\prime}$ may be modified to $w^{\prime\prime}$ so that the suffix following $y$ in $w^{\prime\prime}$ reads the data value $d^{\prime}$ instead of $d$ . Let the run corresponding to $w^{\prime\prime}$ be $\rho^{\prime\prime}$ that follows the same sequence of states as $\rho^{\prime}$ . Note that $|\rho^{\prime\prime}|=|\rho^{\prime}|<|\rho|$ , and that $\rho^{\prime\prime}$ is an accepting run.
•

If while reading $w$ , the transition $t$ reads a data value $d$ that was inserted while reading a prefix appearing before $y$ , then $w^{\prime}$ does not need to be modified, and we thus have the accepting run $\rho^{\prime}$ .

Since $\rho$ is an arbitrary accepting run of length $|Q|\cdot(|H|+2)$ or more, starting from $\rho$ , we can remove infixes repeatedly and modify it as mentioned above if needed until we reach an accepting run of length strictly smaller than $|Q|\cdot(|H|+2)$ without affecting acceptance, and hence the contradiction.

Using Lemma 4.1 we get the following.

Lemma 4.3.

Nonemptiness problem for SAFA is in NP.

Proof 4.4.

Consider a SAFA $M$ with $L(M)\neq\emptyset$ . By Lemma 4.1, \chadded[id=KC]a Turing machine can nondeterministically guess an accepting run of polynomial length, hence the result.

We now show that the nonemptiness problem is NP-hard even for deterministic acyclic SAFA over an alphabet of size $3$ . Example 4.5 describes our construction.

Example 4.5.

For a 3CNF formula $\phi=(x\vee\overline{y}\vee z)\wedge(x\vee y\vee z)$ , the corresponding SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ is shown in Figure 3. We denote by $A(a,i)$ the transition $(a,!p(h_{i}),{\sf ins}(h_{i}))$ and by $T(a,i)$ the transition $(a,p(h_{i}),-)$ with $Q=\{q_{0},q_{x},q_{y},q_{z},q_{c_{1}},q_{c_{2}}\}$ , $\Sigma=\{a_{1},a_{2},a_{3}\}$ , $D=\mathbb{N}$ , $H=\{h_{x},h_{\overline{x}},h_{y},h_{\overline{y}},h_{z},h_{\overline{z}}\}$ , $F=\{q_{c_{2}}\}$ . In particular, if there are $\ell$ variables in the formula, then $|H|=2\ell$ . ∎

Figure 3: The SAFA

M

corresponding to a 3CNF formula

\phi

Lemma 4.6.

The nonemptiness problem is NP-hard for deterministic acyclic SAFA over an alphabet of size $3$ .

From Lemma 4.3 and Lemma 4.6, we have the following.

Theorem 4.7.

The nonemptiness problem for SAFA is NP-complete.

We now show that for singleton SAFA, nonemptiness is NL-complete. Towards this, we first show the following lemma.

Lemma 4.8.

The nonemptiness problem for singleton SAFA is reducible to the nonemptiness of a nondeterministic finite automaton (NFA) in PTIME.

Proof 4.9.

We begin with the following observations on SAFA transitions for a run on an input word. The idea is to see how we can construct a word accepted by a given SAFA that takes it from the initial state to a final state following the transition rules.

•

Transitions with $!p(h_{1})$ can always be satisfied since we have an infinite number of data values. We can always introduce a new data value with an attribute so that it is not in the set $h_{1}$ . However, transitions with $p(h_{1})$ should only be executed if there exists an ${\sf ins}(h_{1})$ somewhere earlier on the path before reaching the transition with $p(h_{1})$ .
•

Given a SAFA, it is just not enough to only look for simple paths from the initial state to a final state satisfying the observations stated above (see Figure 4). In Figure 4, the only simple path is $q_{0}\rightarrow q_{f}$ with transition $(a,p(h_{1}),-)$ . Since the simple path does not contain any transition having ${\sf ins}(h_{1})$ prior to the transition containing $p(h_{1})$ , no word is accepted by the automaton along this simple path. However, we find that the automaton accepts the string $(a,d_{1})(a,d_{1})$ . Thus, when checking for emptiness of a SAFA, we cannot just restrict our analysis to simple paths.
•

The language accepted by a SAFA is nonempty iff there exists a sequence of transitions that takes it from an initial state to a final state with the added condition that if the sequence of transitions contains a $p(h_{i})$ , then there must exist a corresponding ${\sf ins}(h_{i})$ prior to the $p(h_{i})$ in the sequence of transitions.

Figure 4: A simple SAFA

M

Given a singleton SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ , $H=\{h_{1}\}$ , we check for nonemptiness by constructing three nondeterministic finite automata (NFA). The first NFA $M_{1}$ accepts all those sequences of transitions which can take the SAFA $M$ from its initial state to any of its final states. However, $M_{1}$ does not check if the sequence of transitions on the path from the initial state to a final state of $M$ is valid (i.e. Condition 1 above). The second NFA $M_{2}$ accepts all possible sequences of transitions that have ${\sf ins}(h_{1})$ prior to encountering a $p(h_{1})$ . We construct the synchronous product [2] of $M_{1}$ and $M_{2}$ that gives us another NFA $M_{3}$ . We check for emptiness of $M_{3}$ . If $M_{3}$ is not empty (final state is reachable), we can conclude that there exists at least one sequence of transitions that takes the SAFA $M$ from the initial state to a final state and every $p(h_{1})$ encountered on that sequence of transitions has an ${\sf ins}(h_{1})$ prior to it. Therefore, the SAFA $M$ is not empty. If $M_{3}$ is empty, it indicates that there exists no such sequence of transitions. Thus, if $M_{3}$ is empty, we can conclude that the SAFA $M$ is empty as well and there is no data word that is accepted by the SAFA.

Given a SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ , $H=\{h_{1}\}$ we construct NFA $M_{1}=(Q,\Sigma^{\prime},q_{0},F,\delta^{\prime})$ as below:

•

$\Sigma^{\prime}=\{(a,b,c)\}$ where $a\in\Sigma$ , $b\in\{p(h_{1}),!p(h_{1})\}$ $c\in\{{\sf ins}(h_{1}),-\}$ . The alphabet $\Sigma^{\prime}$ contains all transitions that M can have.
•

$\delta^{\prime}=\delta$ with the triplets in $\delta$ considered as elements of $\Sigma^{\prime}$ .

Figure 5: The NFA

M_{1}

corresponding to the SAFA

M

in Figure 4

Figure 6: The NFA

M_{2}

corresponding to the SAFA in Figure 4

Figure 7: NFA

M_{3}

: Synchronous product of NFA

M_{1}

and NFA

M_{2}

for SAFA in Figure 4.

From the construction of $M_{1}$ , we see that it accepts all those possible sequences of transitions that may take $M$ from its initial state to any of its final states. We construct NFA $M_{2}=(Q^{\prime\prime},\Sigma^{\prime},q_{0},Q^{\prime\prime},\delta^{\prime\prime})$ as below:

•

$Q^{\prime\prime}=\{q_{0},q_{1}\}$

The transitions in $\delta^{\prime\prime}$ are as follows:

•

$\delta(q_{0},(a,!p(h_{1}),-))=\{q_{0}\}$ , $a\in\Sigma$
•

$\delta(q_{0},(a,!p(h_{1}),{\sf ins}(h_{1})))=\{q_{1}\}$ , $a\in\Sigma$
•

$\delta(q_{1},(a,!p(h_{1}),{\sf ins}(h_{1})))=\{q_{1}\}$ , $a\in\Sigma$
•

$\delta(q_{1},(a,p(h_{1}),{\sf ins}(h_{1})))=\{q_{1}\}$ , $a\in\Sigma$
•

$\delta(q_{1},(a,p(h_{1}),-))=\{q_{1}\}$ , $a\in\Sigma$
•

$\delta(q_{1},(a,!p(h_{1}),-))={q_{1}}$ , $a\in\Sigma$

The automaton $M_{2}$ works as follows. State $q_{0}$ denotes that we have not yet come across ${\sf ins}(h_{1})$ , state $q_{1}$ denotes we have seen an ${\sf ins}(h_{1})$ . For inputs of the form $(x,!p(h_{j}),-)$ where $x\in\Sigma$ , we remain in state $q_{0}$ . If we come across inputs of the form $(x,!p(h_{1}),{\sf ins}(h_{1}))$ where $x\in\Sigma$ , we move to state $q_{1}$ from $q_{0}$ . At state $q_{1}$ , the automaton remains in state $q_{1}$ for every element $\mu\in\Sigma^{\prime}.$

By construction, $M_{2}$ accepts all those sequences of transitions of $M$ where every transition containing $p(h_{1})$ is preceded by at least one transition containing ${\sf ins}(h_{1})$ . The NFA $M_{3}$ is a synchronous product of NFA $M_{1}$ and NFA $M_{2}$ . Therefore NFA $M_{3}$ accepts all those sequences which take the SAFA $M$ from its initial state to a final state. Thus, if the language accepted by $M_{3}$ is empty, so is the language accepted by the SAFA $M$ , and non-empty otherwise. The automaton $M_{3}$ is nonempty if there exists a simple path from its initial state to any of its final states. This can be found out using a standard Depth First Search (DFS).

The time complexity of the emptiness check is polynomial in the size of $M_{3}$ . The size of $M_{3}$ depends on the size of $M_{1}$ and $M_{2}$ . Size of an NFA $M=(Q,\Sigma,q_{0},F,\delta)$ is defined as $|M|=|Q|+\sum_{q\in Q,a\in\Sigma}|\delta(q,a)|$ . The number of states in $M_{1}=|Q|$ is same as $M$ . The number of states in $M_{2}$ is $2$ and the number of states in $M_{3}$ is at most $2|Q|$ and number of edges in $M_{3}$ is at most $4|\Sigma|\times|\delta|\times 2|Q|$ which is polynomial in the input size.

Example 4.10.

The NFA $M_{1}$ , $M_{2}$ , $M_{3}$ corresponding to SAFA $M$ (Figure 4) are shown in Figures 5, 6, and 7 respectively. We observe that in NFA $M_{3}$ , there exists a path from the initial to the final state. NFA $M_{3}$ is not empty, therefore the SAFA $M$ corresponding to Figure 4 is also not empty, which is true. ∎

Theorem 4.11.

The nonemptiness problem for singleton SAFA is ${\sf NL}$ -complete.

Proof 4.12.

We first discuss ${\sf NL}$ -membership. By Lemma 4.8, the nonemptiness for singleton SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ is in $\sf PTIME$ by reducing the problem to checking the nonemptiness of a nondeterministic finite automaton (NFA) with $2|Q|$ states. This NFA can be constructed on-the-fly leading to an ${\sf NL}$ -membership of the nonemptiness problem of singleton SAFA.

For ${\sf NL}$ -hardness, we show a reduction from the reachability problem on a directed graph $G$ having vertex set $V=\{1,\dots n\}$ which is known to be ${\sf NL}$ -complete [15]. Let $G$ be a directed graph with $V=\{1,2,...,n\}$ and we are given the vertices $1$ and $n$ . We define a SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ where $Q=V$ , $\Sigma=\{a\}$ , $D$ is a countably infinite set, $q_{0}=1$ , $F=\{n\}$ , $H=\{h_{1}\}$ i.e. $|H|=1$ . The transitions in $\delta$ are as follows: $(i,a,!p(h_{1}),-,j)\in\delta$ if $(i,j)$ is an edge in $G$ . It is easy to see that $M$ can be constructed from $G$ in logspace and that $L(M)\neq\phi$ iff there is a path from vertex $1$ to vertex $n$ in $G$ . Hence, the result.

We now show that membership for SAFA is NP-complete. We first note that unlike the nonemptiness problem, for DSAFA, membership can be decided in ${\sf PTIME}$ by reading the input word and by checking if a final state is reached.

Theorem 4.13.

The membership problem for SAFA is NP-complete.

Proof 4.14.

Given a SAFA $M$ and an input word $w$ , if $w\in L(M)$ , then a nondeterministic Turing machine can guess an accepting run in polynomial time and hence the membership problem is in NP.

For showing NP-hardness, we reduce from $3$ SAT for the nonemptiness problem as done for Lemma 4.6. Instead of a deterministic automaton that was constructed in the proof of Lemma 4.6, we construct a nondeterministic SAFA $M$ with $\Sigma=\{a\}$ and all the transitions in $M$ are labelled with the same letter $a\in\Sigma$ . Everything else remains the same as in the construction in Lemma 4.6. Note that in the 3SAT formula $\psi$ , if there are $\ell$ variables and $k$ clauses, then there is a path of length $\ell+k$ from the initial state to the unique final state of $M$ . We consider an input word $w=(a,d)\cdots(a,d)$ , that is a word in which all the attribute, data-value pairs are identical in the whole word such that $|w|=\ell+k$ . It is not difficult to see that $w\in L(M)$ iff $\psi$ is satisfiable.

Finally, we show that given a SAFA $M$ defined on $\Sigma\times D$ , whether $L(M)=(\Sigma\times D)^{*}$ (universality problem) is undecidable.

Theorem 4.15.

The universality problem for SAFA is undecidable.

Proof 4.16.

The proof is similar to showing \chadded[id=KC]that the universality problem for $k$ -register automata is undecidable [23]. We reduce the Post Correspondence Problem (PCP) which is already known to be undecidable to the universality problem for SAFA. \chdeleted[id=KC]The PCP problem consists of two lists of equal length, say $n$ . The items of the lists are finite strings defined on an alphabet $\Sigma^{\prime}$ where $|\Sigma^{\prime}|\geq 2$ . Without loss of generality, we can assume $\Sigma^{\prime}=\{a,b\}$ . List 1 consists of the strings $x_{1},\dots,x_{n}$ , and list 2 consists of the strings $y_{1},\dots,y_{n}$ where $x_{1},\dots,x_{n},y_{1},\dots,y_{n}\in{\Sigma^{\prime}}^{*}$ . The PCP problem is true, if there exists a sequence $\alpha_{1},\dots,\alpha_{m}$ where $\alpha_{1},\dots,\alpha_{m}\in[n]$ such that $x_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\alpha_{1}}\cdots y_{\alpha_{m}}$ and false otherwise (see Example 2.1). We reduce the PCP problem to the universality problem for SAFA such that the constructed SAFA does not accept a word which corresponds to a PCP solution. Thus, the SAFA is universal if and only if there does not exist a solution to the PCP problem. For the reduction, we consider input data words of the format $u(\#,d_{\#})v(\$,d_{\$})$ with $d_{\#},d_{\$}\in D$ , where data item $(\#,d_{\#})$ is a separator and the data item $(\$,d_{\$})$ is an end-marker. The data words $u$ and $v$ represent a candidate solution ( $x_{\alpha_{1}}\cdots x_{\alpha_{m}};y_{\beta_{1}}\cdots y_{\beta_{m}}$ ) where $\alpha_{1},\dots,\alpha_{m},\beta_{1},\dots,\beta_{m}\in[n]$ of the PCP instance. Such a candidate solution is a true solution of the PCP instance if the following conditions hold.

•

$\alpha_{i}=\beta_{i}$ for each $i\in[n]$ which denotes \chadded[id=KC]the fact that the corresponding strings are taken from the same domino.
•

$x_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\beta_{1}}\cdots y_{\beta_{m}}$ , i.e. both strings are same.

We now describe the format in more detail.

•

Each $x_{\alpha_{j}}$ is encoded as $(\alpha_{j},d_{\gamma})(a_{1},d_{\delta_{1}})\cdots(a_{k},d_{\delta_{k}})$ where $d_{\gamma}$ gives a unique data value to this particular occurrence of domino string from the first list. The symbols $a_{1},\dots a_{k}\in\Sigma$ , the data values $d_{\delta_{1}},\dots,d_{\delta_{k}}\in D$ represent the position of each $a_{i}$ in $x_{\alpha_{j}}$ uniquely and $x_{\alpha_{j}}=a_{1}\cdots a_{k}$ . Similarly, $y_{\beta_{j}}$ is also encoded. The data words $u,v\in(([n]\times D)(\Sigma\times D)^{*})^{*}$ . Every $d_{\gamma}$ and $d_{\delta}$ is unique in $u$ , that is even across different instances of $x_{\alpha_{j}}$ the data values $d_{\gamma}$ and $d_{\delta}$ used are different.
•
A string $u(\#,d_{\#})v(\$,d_{\$})$ is syntactically correct if the above conditions hold and also the following two conditions are true.
- –
  
  ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))={{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v))$ , i.e. the sequence of data values associated with the symbols in $[n]$ in $u$ are same as that in $v$ . Having this same sequence of data values in both $u$ and $v$ corresponds to the fact that the $x_{i}^{\prime}s$ and the $y_{i}^{\prime}s$ appear in the same order i.e. $\alpha_{i}=\beta_{i}$ for each $\alpha_{i}\in[n].$
- –
  
  ${{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(u))={{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(v))$ , i.e. the sequence of data values associated with the symbols in $\Sigma$ in $u$ are same as that in $v$ . This corresponds to the fact that the strings in $u$ and $v$ obtained by concatenating the $x_{i}^{\prime}s$ and the $y_{i}^{\prime}s$ respectively match, i.e. $x_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\beta_{1}}\cdots y_{\beta_{m}}$ .

A syntactically correct string $u(\#,d_{\#})v(\$,d_{\$})$ is a true solution of a PCP instance if

•

for each data value in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))$ the number in $[n]$ associated with that data value in both $u$ and $v$ are same. This ensures that the strings in both $u$ and $v$ are chosen from the same domino, i.e. $\alpha_{i}=\beta_{i}$ for each $\alpha_{i},\beta_{i}\in[n]$ , and
•

for each data value in ${{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(u))$ , the letter in $\Sigma$ associated with that data value in both $u$ and $v$ are same. This ensures that the strings formed from both the list are same. i.e. $x_{\alpha_{1}}\cdots x_{\alpha_{m}}=y_{\beta_{1}}\cdots y_{\beta_{m}}$ .

We now describe a nondeterministic SAFA $M$ which accepts an input data word $w\in(\Sigma^{\prime}\times D)^{*}$ where $\Sigma^{\prime}=[n]\cup\Sigma\cup\{\#,\$\}$ if and only if the input data word is not in the correct format or it is not a solution of the PCP instance. The SAFA $M$ checks and accepts $w$ if the following conditions are satisfied for the input string $w$ .

1.
The input strng $w$ is not in the format as required by a PCP instance:
1. (a)
  
  The input word $w$ is not in the form $u(\#,d_{\#})v(\$,d_{\$})$ . This checking can be done using an NFA.
2. (b)
  
  Consider a substring $w_{u}$ between two consecutive $(\alpha_{1},d_{1}),(\alpha_{2},d_{2})$ in $u$ where $\alpha_{1}\in[n]$ , $\alpha_{2}\in[n]\cup\{\#\}$ , and $d_{1},d_{2}\in D$ , and we call ${\sf proj}_{\Sigma}(w_{u})$ the $\Sigma$ -projection of $w_{u}$ . The string $u$ is not in the right format if there exists a substring $w_{u}$ as above whose $\Sigma$ -projection is not the same as $x_{\alpha_{1}}$ . Similarly, the string $v$ is not in the right format if there exists a substring $w_{v}$ as above whose $\Sigma$ -projection is not the same as $y_{\alpha_{1}}$ . Corresponding to every string $x_{\alpha}$ for $\alpha\in[n]$ , there is a deterministic finite automaton (DFA) that accepts $\Sigma^{*}\setminus\{x_{\alpha}\}$ . Given the $\alpha$ , we can use the corresponding DFA to check that the $\Sigma$ -projection of $w_{u}$ is not the same as $x_{\alpha}$ . The nondeterminstic SAFA guesses such a substring $w_{u}$ of $u$ which is not in the right format.
2.
The $d_{\gamma}$ projections in $u$ and $v$ are not in the required format.
1. (a)
  
  Two data values in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))$ are same. The SAFA can nondeterministically guess that a particular data value is repeated in $u$ and store it in a set. If it comes across that same data value again while traversing $u$ , it accepts the input word.
2. (b)
  
  Two data values in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v))$ are same. The SAFA can nondeterministically guess that a particular data value is repeated in $v$ and store it in a set. If it comes across that same data value again while traversing $v$ , it accepts the input word.
3. (c)
  
  The first data value in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))$ and the first data value ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v)$ are not the same. The SAFA can store the first data value in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))$ in a set and match it while reading the first data value in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v))$ after $\#$ .
4. (d)
  
  The last data value in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))$ and the last data value ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v)$ are not the same. Again the SAFA can nondeterministically read the last two data values in each of these sequences and match them.
5. (e)
  Two data values $d_{\gamma_{1}}$ and $d_{\gamma_{2}}$ are successors in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))$ but not in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v))$ .
  - •
    
    The SAFA can again nondeterministically decide on reading such a pair of data values $d_{\gamma_{1}}$ and $d_{\gamma_{2}}$ in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))$ that are different from those in $v$ and store them in two different sets. Then as it parses through data values in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v))$ and comes across the first data value it checks whether the successor data value in both cases are same or not.
3.

The $d_{\delta}$ projections in $u$ and $v$ are not in the required format. This checking can be done in a similar manner as the checking for $d_{\gamma}$ projections. Again recall that this can easily be done since all $d_{\delta}$ values are unique in $u$ .
4.
The input word $w$ is not a true solution of the PCP instance.
1. (a)
  
  For the input word to be a correct solution of the PCP instance, the attribute $i\in[n]$ corresponding to each data value in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(u))$ needs to be same as that in ${{\sf proj}_{D}}({\sf proj}_{[n]\times D}(v))$ . This ensures the strings are chosen from the same domino of the PCP instance.
2. (b)
  
  The attribute corresponding to each data value in ${{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(u))$ needs to be same as the attribute in ${{\sf proj}_{D}}({\sf proj}_{\Sigma\times D}(v))$ . This ensures the concatenation of strings chosen from List 1 and List 2 are the same.
3. (c)
  
  If any of the corresponding data values in $u$ and $v$ have different attributes associated with them, then the input word $w$ is not a true solution of the PCP instance and the SAFA accepts the input word. We know describe how the SAFA finds that the input word is not a correct solution for $(a)$ and $(b)$ above. Based on the data value in $u$ the SAFA nondeterministically checks whether the same data value in $v$ has the same attribute associated with it. If they are not the same, the input word is accepted. The SAFA after making the guess stores the data value in a set and also remembers in its state space the attribute it encountered with the data value. When it again comes across the data value in $v$ , it checks the associated attribute. It accepts the input word if they are not the same.

The above mentioned nondeterministic SAFA $M$ accepts an input word if and only if the input data word is not in the correct format or it is not a solution to the PCP instance. Therefore if the SAFA $M$ does not accept an input word, i.e. its not universal, then it implies that the input word is in the required format and is also a solution to the PCP instance, implying that the PCP instance has a solution.

If the PCP instance has a solution then that solution can be represented in the correct format, and the SAFA $M$ does not accept the input word $w$ which represents such a solution to the PCP instance is in the correct format. Therefore, the SAFA is not universal.

Hence, universality of SAFA is undecidable.

4.2 Closure Properties

We now study the closure properties. We first study the closure properties of SAFA followed by those of DSAFA.

4.2.1 Closure Properties of SAFA

We start with the Boolean closure properties. We show that SAFA are closed under union, but not under intersection and complementation.

Lemma 4.17.

SAFA are closed under union.

Proof 4.18.

We show here that SAFA models are closed under union. Union of two SAFA models can be obtained by superimposing their start states together. Let us consider two SAFA $M_{1}=(Q_{1},\Sigma\times D,q_{01},F_{1},H_{1},\delta_{1})$ and $M_{2}=(Q_{2},\Sigma\times D,q_{02},F_{2},H_{2},\delta_{2})$ . The SAFA $M_{3}=(Q_{3},\Sigma\times D,q_{03},F_{3},H_{3},\delta_{3})$ which accepts the language $L(M_{1})\cup L(M_{2})$ is constructed as follows: $Q_{3}=\{q_{03}\}\cup Q_{1}\cup Q_{2}$ , $F_{3}=F_{1}\cup F_{2}$ if $q_{01}\notin F_{1}$ and $q_{02}\notin F_{2}$ , otherwise $F_{3}=F_{1}\cup F_{2}\cup\{q_{03}\}$ , $H_{3}=H_{1}\cup H_{2}$ . All transitions in $\delta_{1}$ and $\delta_{2}$ are in $\delta_{3}$ . Additionally, for every transition $(x,y,z)$ from state $q_{01}$ to any state $q_{i}\in Q_{1}$ in $\delta_{1}$ such that $x\in\Sigma$ , $y\in\{p(h_{k}),!p(h_{k})\}$ with $h_{k}\in H_{1}$ and $z\in\{-,{\sf ins}(h_{l})\}$ , $h_{l}\in H_{1}$ , the transition $(x,y,z)$ is included in $\delta_{3}$ from state $q_{03}$ to state $q_{i}$ . Similarly, for every transition $(x,y,z)$ from state $q_{02}$ to a state $q_{i}\in Q_{2}$ in $\delta_{2}$ such that $x\in\Sigma$ , $y\in\{p(h_{k}),!p(h_{k})\}$ , $h_{k}\in H_{2}$ and $z\in\{-,{\sf ins}(h_{l})\}$ , $h_{l}\in H_{2}$ , the transition $(x,y,z)$ is included in $\delta_{3}$ from state $q_{03}$ to state $q_{i}$ . The automaton $M_{3}$ on an input data word nondeterministically decides to which of $M_{1}$ and $M_{2}$ the input word belongs. If the input word is accepted by either of the automata, it is accepted by $M_{3}$ . If an input word is not accepted by both the automata, it is rejected. Thus, the SAFA $M_{3}$ accepts the language $L(M_{1})\cup L(M_{2})$ .

Lemma 4.19.

SAFA are not closed under intersection.

Proof 4.20.

Consider the language $L=L_{{\sf fd}(a)}\cap L_{a\exists b}$ . We show that there exists no SAFA which accepts $L$ . We prove by contradiction. Assume that there exists a SAFA $M=(Q,\{a,b\}\times D,q_{0},F,H,\delta)$ with $|H|=k>0$ such that $M$ accepts $L$ . Then $M$ must accept the following word $w\in L$ where $w=(b,d_{1})\cdots(b,d_{k+1})(a,d_{1})\cdots(a,d_{k+1})$ and $d_{1},...,d_{k+1}$ are all distinct. In order to accept $w$ , the SAFA $M$ must go through a sequence $T=t_{b_{1}}...t_{b_{k+1}}t_{a_{1}}...t_{a_{k+1}}$ of transitions to completely consume $w$ and end in an accepting state. Here $t_{b_{i}}$ consumes the data element $(b,d_{i})$ , and $t_{a_{j}}$ consumes the data element $(a,d_{j})$ and $1\leq i,j\leq k+1$ . Two cases are possible:

•

There is a transition say $t_{a_{g}}$ in $T$ consuming the data element $(a,d_{g})$ of $w$ where $g\in[k+1]$ and $t_{a_{g}}$ is of the form $(a,!p(h_{i}),-)$ or $(a,!p(h_{i}),{\sf ins}(h_{j}))$ where $h_{i},h_{j}\in H$ . The SAFA $M$ using the same sequence $T$ of transitions can accept another data word $w^{\prime}\notin L_{a\exists b}$ where $(a,d_{g})$ is replaced by $(a,d)$ such that $d\neq d_{r}$ for $1\leq r\leq k+1$ . It is always possible to get such a data value $d$ as $k$ is finite but $D$ is countably infinite. Therefore, at the time of executing $t_{a_{g}}$ the data value $d$ is not present in $h_{i}$ and $t_{a_{g}}$ executes successfully. Recall that the data values in $w$ that follow $d_{j}$ are different from $d_{j}$ . The data values in $w^{\prime}$ that follow the data value $d$ are not equal to $d$ . Therefore whether $d$ has been inserted to any set $h_{j}\in H$ or not while executing $t_{a_{g}}$ does not impact the successful execution of the transitions in $T$ that follow $t_{a_{g}}$ . Now, in $w^{\prime}$ there is a data value $d$ associated with attribute $a$ which is not associated with attribute $b$ , thus $w^{\prime}\notin L_{a\exists b}$ .
•

All transitions in $T$ following $t_{b_{k+1}}$ are of the form $(a,p(h_{i}),-)$ or $(a,p(h_{i}),{\sf ins}(h_{j}))$ where $h_{i},h_{j}\in H$ . The number of transitions in $T$ that follow $t_{b_{k+1}}$ is greater than $k$ . Hence, by pigeon hole principle, there must be two transitions $t_{a_{\ell}}$ and $t_{a_{m}}$ where $1\leq\ell<m\leq k+1$ which have the same condition $p(h_{i})$ for some $h_{i}\in H$ . The SAFA $M$ using the same sequence $T$ of transitions can accept another data word $w^{\prime}\notin L_{{\sf fd}(a)}$ where $(a,d_{m})$ is replaced by $(a,d_{l})$ . The SAFA $M$ when executing $t_{a_{m}}$ on $w^{\prime}$ can successfully consume the data element $(a,d_{\ell})$ instead of $(a,d_{m})$ . This is because $t_{a_{\ell}}$ and $t_{a_{m}}$ have the same condition $p(h_{i})$ and $d_{\ell}$ is already present in $h_{i}$ when $t_{a_{\ell}}$ is executed. Note that the data values in $w^{\prime}$ that follow the execution of $t_{a_{m}}$ are not equal to $d_{\ell}$ . Therefore whether $d_{\ell}$ has been inserted to any set $h_{j}\in H$ or not does not impact the successful execution of the transitions in $T$ that follow $t_{a_{m}}$ .

Using a pumping argument, we show that these automata are not closed under complementation.

Lemma 4.21.

Let $L\in{\mathcal{L}_{\sf SAFA}}$ . Then there exists a SAFA $M$ with $n$ states that accepts $L$ such that every data word $w\in L$ of length at least $n$ can be written as $w=xyz$ and $T_{w}=T_{x}T_{y}T_{z}$ corresponds to the sequence of transitions that $M$ takes to accept $w$ , where $T_{x}=t_{x_{1}}\dots t_{x_{|x|}}$ , $T_{y}=t_{y_{1}}\dots t_{y_{|y|}}$ , $T_{z}=t_{z_{1}}\dots t_{z_{|z|}}$ is the sequence of transitions that $M$ takes to read $x$ , $y$ , $z$ respectively, and $t_{u_{j}}$ denotes the $j^{th}$ transition of the transition sequence $T_{u}$ with $u\in\{x,y,z\}$ , satisfying the following:

•

$|y|\geq 1$
•

$|xy|\leq n$
•
for all $\ell\geq 1$ , for all words $w^{\prime}=xyy^{\prime}_{1}\cdots y^{\prime}_{\ell}z^{\prime}$ such that $T_{w^{\prime}}=T_{x}T_{y}{T_{y}}^{\ell}T_{z}$ is the sequence of transitions that $M$ takes to accept $w^{\prime}$ and ${{\sf proj}_{\Sigma}}(y)={{\sf proj}_{\Sigma}}(y^{\prime}_{1})=\dots={{\sf proj}_{\Sigma}}(y^{\prime}_{\ell})$ , ${{\sf proj}_{\Sigma}}(z)={{\sf proj}_{\Sigma}}(z^{\prime})$ .
- –
  
  if $t_{y_{j}}$ has $p(h_{i}),h_{i}\in H$ then the $j^{th}$ datum of ${{\sf proj}_{D}}(y^{\prime}_{k})\in h_{i}$ , $1\leq j\leq|y|$ , $1\leq k\leq\ell$ .
- –
  
  if $t_{y_{j}}$ has $!p(h_{i}),h_{i}\in H$ then the $j^{th}$ datum of ${{\sf proj}_{D}}(y^{\prime}_{k})\notin h_{i}$ , $1\leq j\leq|y|$ , $1\leq k\leq\ell$ .
- –
  
  if $t_{z_{j}}$ has $p(h_{i}),h_{i}\in H$ then the $j^{th}$ datum of ${{\sf proj}_{D}}(z^{\prime})\in h_{i}$ , $1\leq j\leq|z|$ .
- –
  
  if $t_{z_{j}}$ has $!p(h_{i}),h_{i}\in H$ , then the $j^{th}$ datum of ${{\sf proj}_{D}}(z^{\prime})\notin h_{i}$ , $1\leq j\leq|z|$ .
•

$w^{\prime}\in L$ .

Proof 4.22.

Since $L\in{\mathcal{L}_{\sf SAFA}}$ , there exists a SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ , with say $n$ states that accepts $L$ . As $|w|\geq n$ and $w\in L$ , the sequence of states that $M$ traverses to accept $w$ must contain a cycle. Let us take the first such cycle and call it $c_{y}$ . Let the sequence of transitions that $M$ executes to traverse the cycle $c_{y}$ be $T_{y}$ and the infix of $w$ read along $T_{y}$ be $y$ . Let $T_{x}$ be the sequence of transitions that $M$ traverses before entering the first cycle $c_{y}$ and the prefix of $w$ read along $T_{x}$ be $x$ . Let $T_{z}$ be the sequence of transitions that $M$ traverses after exiting the cycle $c_{y}$ to reach a final state in $M$ , and let the suffix of $w$ read along $T_{z}$ be $z$ . Therefore, $w=xyz$ and the sequence of transitions that $M$ traverses to accept $w$ is say $T_{w}=T_{x}T_{y}T_{z}$ . Moreover since $c_{y}$ is the first such cycle, we have $|y|\geq 1$ and $|xy|\leq n$ . Now, consider a sequence of transitions $T_{w^{\prime}}=T_{x}T_{y}{T_{y}}^{\ell}T_{z}$ , then $w^{\prime}=xyy_{1}\cdots y_{\ell}z^{\prime}$ , where ${{\sf proj}_{\Sigma}}(y)={{\sf proj}_{\Sigma}}(y_{1})\dots={{\sf proj}_{\Sigma}}(y_{l})$ but ${{\sf proj}_{D}}(y),{{\sf proj}_{D}}(y_{1}),\dots,{{\sf proj}_{D}}(y_{l})$ may or may not be equal to each other. Since $|y|\geq 1$ , the sequence $T_{y}$ of transitions must have at least one transition. The transition $t_{y_{j}}$ in $T_{y}$ with $!p(h_{i})$ can always be executed successfully because the SAFA $M$ when executing $t_{y_{j}}$ can always read a new data value which it has not read till executing $t_{y_{j}}$ . We can always find such data values as $|w^{\prime}|$ is finite whereas $D$ is countably infinite. The transition $t_{y_{j}}$ in $T_{y}$ with $p(h_{i})$ is executed successfully when consuming $y$ , since, $y$ is consumed successfully by $M$ when accepting $w$ . As, $y$ is consumed successfully by $M$ , the set $h_{i}$ corresponding to $t_{y_{j}}$ is non-empty as there are no removal operations in SAFA. Therefore, after consuming $y$ , every time the sequence $T_{y}$ is executed, $t_{y_{j}}$ is also executed successfully.

The sequence $T_{z}$ is executed successfully due to same reasons as $T_{y}$ . Thus, $M$ accepts a data word $w^{\prime}=xyy_{1}...y_{\ell}z^{\prime}$ .

Lemma 4.23.

SAFA are not closed under complementation.

Proof 4.24.

To show SAFA are not closed under complementation, we first define the following functions. The function ${\sf cnt}(w^{\prime},d)$ gives the number of times data value $d$ is present in a data word $w^{\prime}$ and ${\sf uni}(w^{\prime})$ gives the number of data values $d$ with ${\sf cnt}(w^{\prime},d)=1$ in $w^{\prime}$ . We consider the language $L_{\exists{\sf cnt}\neq 2}$ , which is the language of data words $w$ where there exists a data value $d$ such that ${\sf cnt}(w^{\prime},d)\neq 2$ . Example 3.4 shows a SAFA that accepts this.

Consider the complement language $L_{\forall{\sf cnt}=2}$ wherein all data values occur exactly twice. Using Lemma 4.21 we show no SAFA can accept $L_{\forall{\sf cnt}=2}$ .

The proof is by contradiction. Suppose that there exists a SAFA $M$ with n states accepting $L_{\forall{\sf cnt}=2}$ . Let $w$ be a data word such that $w\in L_{\forall{\sf cnt}=2}$ and $|w|=2n$ .

For every decomposition of $w$ as $w=xyz$ and sequence $T_{w}=T_{x}T_{y}T_{z}$ of transitions that $M$ takes to accept $w$ with $|y|\geq 1$ , we have a $w^{\prime}=xyy_{1}y_{2}y_{3}z^{\prime}$ such that $T_{w^{\prime}}=T_{x}T_{y}{T_{y}}^{3}T_{z}$ . Since $|y|\geq 1$ , we have that $T_{y}$ must have either a transition $t$ with $p(h_{i})$ for some $h_{i}\in H$ or a transition with $!p(h_{i})$ for some $h_{i}\in H$ or both. If $t$ has $p(h_{i})$ , then the first time $t$ is executed while consuming $y$ , assume that it consumes a data value $d$ . It is able to consume the data value $d$ as it is already inserted in $h_{i}$ before $t$ is executed. Now if after consuming the word $xy$ , $T_{y}$ is executed again, then when executing the transition $t$ it can again consume the same data value $d$ as before. So, every time $T_{y}$ is executed, the SAFA $M$ will consume the data value $d$ while executing the transition $t$ . After executing $T_{y}$ three times, the SAFA $M$ executes the transition sequence $T_{z}$ . All the transitions with $p(h_{i})$ in $T_{z}$ can be executed successfully with the same data value that they consumed when $M$ accepted $w$ because $w^{\prime}$ and $w$ both have the same prefix $xy$ . The transitions with $!p(h_{i})$ in $T_{z}$ consume data values that $M$ had not encountered prior to executing these transitions. Thus, if $T_{y}$ has a transition $t$ with $p(h_{i})$ for some $h_{i}\in H$ , then $M$ accepts the data word $w^{\prime}=xyy_{1}\cdots y_{3}z^{\prime}$ where there exists a data value $d$ with ${\sf cnt}(w^{\prime},d)>3$ .

If $T_{y}$ has a transition, say $t$ with $!p(h_{i})$ for some $h_{i}\in H$ , then every time $T_{y}$ is executed after consuming $xy$ , the SAFA $M$ when executing $t$ can always read a new data value which it has not read till executing $t$ and that it will not read later. We can always find such data values as $w^{\prime}$ is finite whereas $D$ is countably infinite. The sequence $T_{z}$ is executed successfully due to same reasons as before. Thus, if $T_{y}$ has a transition $t$ with $!p(h_{i})$ , then $M$ accepts a data word $w^{\prime}=xyy_{1}...y_{3}z^{\prime}$ where ${\sf uni}(w^{\prime})\geq 3$ . Note that the data value consumed by $M$ when taking the transition $t$ while reading $y$ may already be present in the prefix being read by the sequence of transitions prior to taking $t$ . Therefore, $w^{\prime}\notin L_{\forall{\sf cnt}=2}$ .

Note that from the above construction, we see that SAFA with $|H|\geq 2$ are not closed under complementation. We observe that singleton SAFA are closed under union but not under intersection. From the hierarchy theorem (Theorem 3.5), we see singleton SAFA cannot accept $L_{\sf fd(a_{1})}\cap L_{\sf fd(a_{2})}\cap L_{{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}}$ but singleton SAFA can accept $L_{\sf fd(a_{1})}\cap L_{{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}}$ and $L_{\sf fd(a_{2})}\cap L_{{{\sf proj}_{\Sigma}}(L)=a_{1}^{*}a_{2}^{*}}$ . Thus, singleton SAFA are not closed under intersection, and hence also not closed under complementation.

Theorem 4.25.

SAFA are closed under concatenation.

Proof 4.26.

SAFA are shown to be closed under concatenation in a manner similar to NFA. Note that if the sets used in the two input automata $A$ and $B$ be $H_{A}$ and $H_{B}$ respectively, then the automaton accepting the language obtained as a result of concatenation uses the sets $H_{A}\cup H_{B}$ .

Theorem 4.27.

SAFA are not closed under Kleene’s closure.

Proof 4.28.

Consider the language $L_{1}=\{w\in(\{a\}\times\{d\})^{2}|d\in D\}$ . The language $L_{1}$ can be accepted by a DSAFA (see Figure 8). The Kleene’s closure of $L_{1}$ is the language $L=\{w\in((\{a\}\times\{d\})^{2})^{*}|d\in D\}$ , i.e., $L$ is the set of all data words where every data value appears in pairs. We show that there exists no SAFA which accepts $L$ . We prove by contradiction. Assume that there exists a SAFA $M=(Q,\{a\}\times D,q_{0},F,H,\delta)$ with $|H|=k>0$ such that $M$ accepts $L$ . Then $M$ must accept the following word $w\in L$ where $w=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{k+1})(a,d_{k+1})$ and $d_{1},...,d_{k+1}\in D$ are all distinct. In order to accept $w$ , the SAFA $M$ must go through a sequence $T=t_{1_{d_{1}}}t_{2_{d_{1}}}...t_{1_{d_{k+1}}}t_{2_{d_{k+1}}}$ of transitions to completely consume $w$ and end in an accepting state. Here $t_{1_{d_{i}}}$ consumes the first data item of the $i^{th}$ data value pair and $t_{2_{d_{i}}}$ consumes the second data item of the $i^{th}$ data value pair and $1\leq i\leq k+1$ .

•

The transitions $t_{2_{d_{i}}}$ must be of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ where $h_{\ell},h_{j}\in H$ . This is essential because if $t_{2_{d_{i}}}$ is of the form $(a,!p(h_{\ell}),-)$ or $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ then instead of consuming $d_{i}$ it can also consume successfully a new data value $d_{new}\in D$ which is not present in $w$ . It is always possible to get such a data value as $D$ is countably infinite. The SAFA $M$ will then accept the data word $w^{\prime}=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{new})\cdots(a,d_{k+1})(a,d_{k+1})$ which is not in $L$ .
•

The transitions $t_{1_{d_{i}}}$ must be of the form $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ where $h_{\ell},h_{j}\in H$ . This is essential because if $t_{1_{d_{i}}}$ is of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ , then $t_{1_{d_{i}}}$ will fail to consume the first instance of the data value $d_{i}$ , since $M$ has not encountered $d_{i}$ prior to the transition $t_{1_{d_{i}}}$ as all data value pairs in $w$ are distinct and therefore $d_{i}$ is not present in any set. The transition $t_{1_{d_{i}}}$ cannot be of the form $(a,!p(h_{\ell}),-)$ because the following transition $t_{2_{d_{i}}}$ which is of the form $(a,p(h_{j}),-)$ or $(a,p(h_{j}),{\sf ins}(h_{p}))$ where $h_{j},h_{p}\in H$ will not be executed successfully. Note that for the transition $t_{2_{d_{i}}}$ to consume the data value $d_{i}$ , the data value $d_{i}$ must be inserted in a set when it was first encountered.
•

Since the number of distinct data values in $w$ is more than the number of sets in $M$ and all the distinct data values are inserted in the sets in $M$ , by pigeon hole principle, there are two distinct data values $d_{i}$ and $d_{j}$ , with $i<j$ which are inserted in the same set $h_{\ell}\in H$ . Thus, if $M$ accepts the data word $w=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{j})(a,d_{j})\cdots(a,d_{k+1})(a,d_{k+1})$ then $M$ also accepts the data word $w^{\prime}=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{j})(a,d_{i})\cdots(a,d_{k+1})(a,d_{k+1})$ which is not in $L$ .

Thus, SAFA are not closed under Kleene’s closure.

Figure 8: A SAFA

M

, such that

L(M)^{*}\notin{\mathcal{L}_{\sf SAFA}}

Figure 9: DSAFA for

L_{\sf a\exists b}

Theorem 4.29.

SAFA are not closed under reversal.

Proof 4.30.

Consider the language $L_{a\exists b}$ which can be accepted by a DSAFA (see Figure 9). The reversal of $L_{a\exists b}$ is the language $L=\{w^{R}|w\in L_{a\exists b}\}$ i.e. $L$ is the set of all data words where for every attribute $a$ there is an attribute $b$ which comes after it and whose data value is same as that of attribute $a$ . We show that there exists no SAFA which accepts $L$ . We prove by contradiction. Assume that there exists a SAFA $M=(Q,\{a,b\}\times D,q_{0},F,H,\delta)$ with $|H|=k>0$ such that $M$ accepts $L$ . Then $M$ must accept the following word $w\in L$ where $w=(a,d_{1})\cdots(a,d_{k+1})(b,d_{1})\cdots(b,d_{k+1})$ and $d_{1},...,d_{k+1}\in D$ are all distinct. In order to accept $w$ , the SAFA $M$ must go through a sequence $T=t_{a_{d_{1}}}...t_{a_{d_{k+1}}}t_{b_{d_{1}}}...t_{b_{d_{k+1}}}$ of transitions to completely consume $w$ and end in an accepting state. Here $t_{a_{d_{i}}}$ consumes the data item $(a,d_{i})$ and $t_{b_{d_{i}}}$ consumes the data item $(b,d_{i})$ of $w$ .

•

The transitions $t_{b_{d_{i}}}$ must be of the form $(b,p(h_{\ell}),-)$ or $(b,p(h_{\ell}),{\sf ins}(h_{j}))$ where $h_{\ell},h_{j}\in H$ . This is essential because if $t_{b_{d_{i}}}$ is of the form $(b,!p(h_{\ell}),-)$ or $(b,!p(h_{\ell}),{\sf ins}(h_{j}))$ then instead of consuming $d_{i}$ it can also consume successfully a new data value $d_{new}\in D$ which is not present in $w$ . It is always possible to get such a data value as $D$ is countably infinite. The SAFA $M$ will then accept the data word $w^{\prime}=(a,d_{1})\cdots(a,d_{i})\cdots(a,d_{k+1})(b,d_{1})\cdots(b,d_{new})\cdots(b,d_{k+1})$ which is not in $L$ .
•

The transitions $t_{a_{d_{i}}}$ must be of the form $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ where $h_{\ell},h_{j}\in H$ . This is essential because if $t_{a_{d_{i}}}$ is of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ , then $t_{a_{d_{i}}}$ will fail to consume the first instance of the data value $d_{i}$ , since $M$ has not encountered $d_{i}$ prior to the transition $t_{a_{d_{i}}}$ as all data values in $w$ associated with attribute $a$ are distinct and therefore $d_{i}$ is not present in any set. The transition $t_{a_{d_{i}}}$ cannot be of the form $(a,!p(h_{\ell}),-)$ because there is a following transition $t_{b_{d_{i}}}$ which is of the form $(b,p(h_{j}),-)$ or $(b,p(h_{j}),{\sf ins}(h_{p}))$ where $h_{j},h_{p}\in H$ that will not be executed successfully. Recall that for the transition $t_{b_{d_{i}}}$ to consume the data value $d_{i}$ , the data value $d_{i}$ must be inserted in a set when it was first encountered.
•

Since the number of distinct data values in $w$ associated with attribute $a$ are more than the number of sets in $M$ and all the distinct data values associated with attribute $a$ are inserted in the sets in $M$ , by pigeon hole principle, there are two distinct data values $d_{i}$ and $d_{j}$ , with $i<j$ which are inserted in the same set $h_{\ell}\in H$ . Thus, if $M$ accepts the data word $w=(a,d_{1})\cdots(a,d_{i})\cdots(a,d_{j})\cdots(a,d_{k+1})(b,d_{1})\cdots(b,d_{i})\cdots(b,d_{j})\cdots(b,d_{k+1})$ then $M$ also accepts the data word

$w^{\prime}=(a,d_{1})\cdots(a,d_{i})\cdots(a,d_{j})\cdots(a,d_{k+1})(b,d_{1})\cdots(b,d_{i})\cdots(b,d_{i})\cdots(b,d_{k+1})$ which is not in $L$ .

Thus, SAFA are not closed under reversal.

Theorem 4.31.

SAFA are not closed under homomorphism.

Proof 4.32.

Consider the language $L_{fd(a)}$ where the data values are taken from the set of natural numbers $\mathbb{N}$ and $\Sigma=\{a\}$ . The homomorphism function is $h(\epsilon)=\epsilon$ , $h((a,d))=(a,d)(a,d)$ for all $a\in\Sigma$ and $d\in\mathbb{N}$ . The language $L=h(L_{fd(a)})$ is a language of data words where every data value occurs exactly twice and also consecutively. There exists no SAFA which accepts $L=h(L_{fd(a)})$ . The proof is by contradiction. Suppose that there exists a SAFA $M$ with n states accepting $L$ . Let $w$ be a data word such that $w\in L$ and $|w|=2n$ .

For every decomposition of $w$ as $w=xyz$ and sequence $T_{w}=T_{x}T_{y}T_{z}$ of transitions that $M$ takes to accept $w$ with $|y|\geq 1$ , we have a $w^{\prime}=xyy_{1}y_{2}y_{3}z^{\prime}$ such that $T_{w^{\prime}}=T_{x}T_{y}{T_{y}}^{3}T_{z}$ . Since $|y|\geq 1$ , we have that $T_{y}$ must have either a transition $t$ with $p(h_{i})$ for some $h_{i}\in H$ or a transition with $!p(h_{i})$ for some $h_{i}\in H$ or both. If $t$ has $p(h_{i})$ , then the first time $t$ is executed while consuming $y$ , assume that it consumes a data value $d$ . It is able to consume the data value $d$ as it is already inserted in $h_{i}$ before $t$ is executed. Now if after consuming the word $xy$ , $T_{y}$ is executed again, then when executing the transition $t$ it can again consume the same data value $d$ as before. So, every time $T_{y}$ is executed, the SAFA $M$ will consume the data value $d$ while executing the transition $t$ . After executing $T_{y}$ three times, the SAFA $M$ executes the transition sequence $T_{z}$ . All the transitions with $p(h_{i})$ in $T_{z}$ can be executed successfully with the same data value that they consumed when $M$ accepted $w$ because $w^{\prime}$ and $w$ both have the same prefix $xy$ . The transitions with $!p(h_{i})$ in $T_{z}$ consume data values that $M$ has not encountered prior to executing these transitions. Thus, if $T_{y}$ has a transition $t$ with $p(h_{i})$ for some $h_{i}\in H$ , then $M$ accepts the data word $w^{\prime}=xyy_{1}\cdots y_{3}z^{\prime}$ where there exists a data value $d$ with ${\sf cnt}(w^{\prime},d)>3$ .

Theorem 4.33.

SAFA are not closed under inverse homomorphism.

Proof 4.34.

Consider the language $L_{\epsilon}=\{\epsilon\}$ . There exists a SAFA $M$ which accepts $L_{\epsilon}$ . The data values are taken from the set of natural numbers $\mathbb{N}$ and $\Sigma=\{a\}$ . The homomorphism function is $h(\epsilon)=\epsilon$ , $h((a,1))=\epsilon$ , $h((a,d))=(a,d)$ for all $d\in\mathbb{N}\setminus\{1\}$ . The language $L=h^{-1}(L_{\epsilon})$ is a language of data words where the data word is $\epsilon$ or data words only having data value 1 present in it. Since SAFA cannot be initialized \chadded[id=KC], SAFA cannot identify that it has seen the data value 1 as it does not have the data value 1 in any of its sets at the time of beginning the computation. Hence, no SAFA can recognise $L=h^{-1}(L_{e})$ .

4.2.2 Closure properties of Deterministic SAFA:

Here we discuss deterministic SAFA and compare their expressiveness with SAFA. Using standard complementation construction as in deterministic finite automata (DFA), by changing non-accepting states to accepting and vice versa, we can show that DSAFA are closed under complementation. Moreover, as deterministic SAFA are closed under complementation but not under intersection hence, it follows that they are also not closed under union. Since the languages used to show non-closure of SAFA under Kleene’s closure, homomorphism, and inverse homomorphism are accepted by DSAFA, we have that DSAFA are also not closed under Kleene’s closure, homomorphism and inverse homomorphism.

Theorem 4.35.

DSAFA are not closed under concatenation.

Proof 4.36.

Consider the language $L_{1}=L_{fd(a)}$ with $\Sigma=\{a\}$ and $L_{2}=\{w\in(\{a\}\times\{d\})^{2}|d\in D\}$ . Both $L_{1}$ and $L_{2}$ can be accepted by DSAFA. The concatenation of $L_{1}$ and $L_{2}$ is the language $L=L_{1}.L_{2}$ . i.e. $L$ is the set of all data words that ends with a pair of same data values and all other data values present in the data word other than the data value in the pair are distinct. The frequency of the data value in the pair is either two or three. We show that there exists no DSAFA which accepts $L$ . We prove by contradiction. Assume that there exists a DSAFA $M=(Q,\{a\}\times D,q_{0},F,H,\delta)$ with $|Q|=n>0$ such that $M$ accepts $L$ . Then $M$ must accept the following word $w\in L$ where $w=(a,d_{1})\cdots(a,d_{n})(a,d_{n+1})(a,d_{n+1})$ and $d_{1},...,d_{n+1}\in D$ are all distinct. In order to accept $w$ , the DSAFA $M$ must go through a sequence $T=t_{1}...t_{n+2}$ of transitions to completely consume $w$ and end in an accepting state. Similarly, the DSAFA $M$ must go through a sequence $S=q_{0}...q_{n+2}$ of states to accept $w$ where $q_{0},...,q_{n+2}\in Q$ , $q_{n+2}\in F$ and $q_{0}$ is the initial state. Here $t_{i}$ consumes the $i^{th}$ data item of the input data word $w$ and takes the DSAFA $M$ from state $q_{i-1}$ to $q_{i}$ .

•

The last transition $t_{n+2}$ must be of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ where $h_{\ell},h_{j}\in H$ . This is essential because if $t_{n+2}$ is of the form $(a,!p(h_{\ell}),-)$ or $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ then instead of consuming $d_{n+1}$ which is the second last data value in $w$ repeated in the last position, it can also consume successfully a new data value $d_{new}\in D$ which is not present in $w$ . It is always possible to get such a data value as $D$ is countably infinite. The DSAFA $M$ will then accept the data word $w^{\prime}=(a,d_{1})\cdots(a,d_{n})(a,d_{n+1})(a,d_{new})$ which is not in $L$ . If the last transition is of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ then the second last transition must be of the form $(a,!p(h_{p}),{\sf ins}(h_{\ell}))$ where $h_{p}\in H$ . The second last transition cannot be of the form $(a,p(h_{p}),-)$ or $(a,p(h_{p}),{\sf ins}(h_{j}))$ because it consumes the data value $d_{n+1}$ in $w$ , which was first encountered by $M$ when executing transition $t_{n+1}$ . Hence, the data value $d_{n+1}$ cannot be present in any set prior to executing $t_{n+1}$ . The transition $t_{n+1}$ and $t_{n+2}$ consume the same data value. Moreover, $t_{n+2}$ is of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ , therefore to execute $t_{n+2}$ successfully the data value $d_{n+2}$ must be present in the set $h_{\ell}$ . The data value was first encountered when executing $t_{n+1}$ , hence $t_{n+1}$ must insert the data value into the set $h_{\ell}$ .
•

The transitions $t_{1},...,t_{n}$ must be of the form $(a,!p(h_{\ell}),-)$ or $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ where $h_{\ell},h_{j}\in H$ . This is essential because if $t_{i}$ where $i\in[n]$ is of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ , then $t_{i}$ will fail to consume the first instance of the data value $d_{i}$ , since $M$ has not encountered $d_{i}$ prior to the transition $t_{i}$ as all data values in $w$ except the last data value are distinct and therefore $d_{i}$ is not present in any set.
•
Since $|w|>|Q|$ , there exists at least a state $q_{i}$ in $S$ which is repeated in the sequence $S$ .
- –
  
  In the sequence of states $S$ only the last state can be an accepting state, no other intermediate state can be accepting because then the DSAFA $M$ will accept a data word $w^{\prime}\in L_{fd(a)}$ which is not in $L$ . Therefore, the accepting state $q_{n+2}$ of $M$ cannot be a repeated state in $S$ .
- –
  
  Let us assume that the state just prior to the final accepting state $q_{n+2}$ , i.e. the state $q_{n+1}$ in $S$ is one such repeated state, i.e. suppose it is same as the $q_{i}^{th}$ state in $S$ . Then, $M$ can execute the sequence $T_{new}=t_{1}...t_{i}t_{i+1}...t_{n+1}t_{i+1}...t_{n+1}t_{n+2}$ of transitions and $M$ will accept the data word
  $w^{\prime}=(a,d_{1})\cdots(a,d_{n})(a,d_{n+1})(a,d_{new_{i+1}})\cdots(a,d_{new_{n}})(a,d_{new_{n+1}})(a,d_{n+1})$ where $d_{new_{i+1}},...,d_{new_{n+1}}\in D$ , all of them are distinct and also different from data values $d_{1},...,d_{n+1}$ . We can always find such data values as $D$ is countably infinite. The DSAFA $M$ accepts the data word $w^{\prime}$ due to the following reasons:
  - *
    
    The SAFA $M$ accepts $w$ by executing the sequence of transitions $T$ . The transition $t_{n+2}$ in $T$ is of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ , therefore, for SAFA $M$ , to execute the transition $t_{n+2}$ and consume the data value $d_{n+1}$ , the data value must already be present in the set $h_{\ell}$ . The data value $d_{n+1}$ is first encountered while executing transition $t_{n+1}$ therefore $t_{n+1}$ inserts the data value in $h_{\ell}$ .
  - *
    
    In consuming $w^{\prime}$ , the DSAFA $M$ executes the transition $t_{n+1}$ in its transition sequence, so the data value $d_{n+1}$ is already present in set $h_{\ell}$ which $M$ uses to execute the transition $t_{n+2}$ as the last transition in the sequence of transitions $T_{new}$ to consume $w^{\prime}$ . The data word $w^{\prime}$ is not in $L$ as the last two data values are not the same.
  Hence the state $q_{n+1}$ cannot be a repeated state in $S$ .
- –
  
  Let us assume a state $q_{j}$ where $q_{j}\neq q_{n+1}$ and $q_{j}\neq q_{n+2}$ is one such repeated state, i.e. suppose it is same as the $q_{i}^{th}$ state in $S$ . The transitions $t_{i+1}...t_{j}$ in $T$ are of the form $(a,!p(h_{\ell}),-)$ or $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ . The SAFA $M$ is a DSAFA and in a DSAFA only two possible transitions can come out of a state: one with $p(h_{\ell})$ and the other with $!p(h_{\ell})$ . In the state $q_{j}$ the transition of the form $(a,!p(h_{\ell}),-)$ or $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ takes $M$ to a state $q_{z}$ which is inside the cycle $q_{i}...q_{j}$ . Only other transition allowed in $q_{j}$ is a transition of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ that takes it to a state $q_{k}$ which is not in the cycle. The state $q_{j}$ is not a final state or a state prior to a final state. Therefore there must exist a transition $t$ from $q_{j}$ in $T$ as $T$ is an accepting sequence of transitions which takes DSAFA $M$ from $q_{j}$ to $q_{k}$ such that $M$ is no longer in the cycle $q_{i}...q_{j}$ . The DSAFA $M$ moves towards the accepting state $q_{n+2}$ which is not present in the cycle by executing $t$ . In order to go from $q_{j}$ to $q_{k}$ , the DSAFA $M$ must execute a transition of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ as it is the only other transition available in state $q_{j}$ which is a contradiction as all transitions in $T$ coming out of states which are not accepting states or states prior to accepting states are of the form $(a,!p(h_{\ell}),-)$ or $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ .

Hence, DSAFA are not closed under concatenation.

Thus, we get the following theorem.

Theorem 4.37.

DSAFA are closed under complementation but not under union, intersection, concatenation, Kleene’s closure, reversal, homomorphism and inverse homomorphism.

Table 1 provides a summary of the closure properties of SAFA and DSAFA. We use the symbols $\cup$ for union, $\cap$ for intersection, $!$ for complement, $h(L)$ for homomorphism and $h^{-1}(L)$ for inverse homomorphism respectively.

Table 1: Closure properties of SAFA

	$\cup$	$\cap$	$!$	.	$*$	$L^{R}$	$h(L)$	$h^{-1}(L)$
SAFA	$\checkmark$	$\times$	$\times$	$\checkmark$	$\times$	$\times$	$\times$	$\times$
DSAFA	$\times$	$\times$	$\checkmark$	$\times$	$\times$	$\times$	$\times$	$\times$

We now show that the class of languages accepted by DSAFA is strictly contained in the class of languages accepted by SAFA.

Theorem 4.38.

${\mathcal{L}_{\sf DSAFA}}\subsetneq{\mathcal{L}_{\sf SAFA}}$ .

Proof 4.39.

Recall from Example 3.4 that the language $L_{\exists{\sf cnt}\neq 2}\in{\mathcal{L}_{\sf SAFA}}$ . On the other hand, we show in the proof of Lemma 4.23 that there does not exist a SAFA accepting its complement language $L_{\forall{\sf cnt}=2}$ . This implies that the language $L_{\exists{\sf cnt}\neq 2}$ cannot be accepted by a DSAFA since DSAFA are closed under complementation. The result follows since every deterministic SAFA is a SAFA.

Lemma 4.40.

For Every DSAFA accepting $L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)}$ we can always get a DFA accepting $L(r)$ where $L(r)$ is the language expressed by the regular expression $r$ which has the same number of states as the DSAFA.

Proof 4.41.

Consider the language $L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)}$ . Let $M^{\prime}$ be a DSAFA which accepts $L$ . We can always obtain a DSAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ with less or equal number of states as $M^{\prime}$ and transitions only of the form $(a,!p(h_{i}),-)$ where $a\in\Sigma$ and $h_{i},h_{j}\in H$ which accepts $L$ in the following manner:

•

We can remove transitions of the form $(a,p(h_{i}),-)$ or $(a,p(h_{i}),{\sf ins}(h_{j}))$ where $a\in\Sigma$ and $h_{i},h_{j}\in H$ from $M^{\prime}$ to construct $M$ .
•

If there are transitions of the form $(a,!p(h_{i}),{\sf ins}(h_{j}))$ where $a\in\Sigma$ and $h_{i},h_{j}\in H$ in $M^{\prime}$ we convert it to $(a,!p(h_{i}),-)$ in $M$ .

Thus all transactions in $M$ are of the form $(a,!p(h_{i}),-)$ . Observe, that $M$ remains a DSAFA. The removal of $(a,p(h_{i}),-)$ or $(a,p(h_{i}),{\sf ins}(h_{j}))$ from $M^{\prime}$ may result in some unreachable states in $M$ which can be removed. $M$ accepts $L$ due to the following reason:

Consider the case that $M^{\prime}$ has a sequence $T$ of transitions where there is at least one transition $t$ of the form $(a,p(h_{i}),-)$ which results in accepting a data word $w$ in $L$ . The data word $w$ has ${{\sf proj}_{\Sigma}}(w)\in L(r)$ where L(r) is the regular language expressed by the regular expression $r$ and ${{\sf proj}_{D}}(w)$ has at least one data value which is repeated. But as $M^{\prime}$ accepts $L$ it should also accept the data word $w^{\prime}$ in $L$ where ${{\sf proj}_{\Sigma}}(w^{\prime})={{\sf proj}_{\Sigma}}(w)$ and ${{\sf proj}_{D}}(w^{\prime})$ have data values which are all distinct. The sequence $T^{\prime}$ of transitions which accepts $w^{\prime}$ consists of transitions with $!p(h_{i})$ where $h_{i}\in H$ only. The sequence $T^{\prime}$ of transitions are valid also for $M$ as $M$ retains all transitions with $!p(h_{i})$ and removes any insert operation if there are any from $M^{\prime}$ . This sequence $T^{\prime}$ of transitions accepts both $w$ and $w^{\prime}$ in $M$ . Given such a DSAFA $M$ , we can get a DFA $A$ which accepts $L(r)$ by converting the transitions labelled $(a,p(h_{i}),-)$ to $a$ in $A$ . The number of states in the DFA $A$ is same as the number of states of the DSAFA $M$ .

Lemma 4.42.

Every DSAFA accepting $L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)}$ has at least as many states as the smallest DFA accepting $L(r)$ where $L(r)$ is the language expressed by the regular expression $r$ .

Proof 4.43.

Let us assume there exists a DSAFA $M$ which accepts $L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)}$ with $k_{1}$ states and there is a minimized DFA $A_{min}$ which accepts $L(r)$ where $L(r)$ is the language expressed by regular expression $r$ with $k_{2}$ states such that $k_{1}<k_{2}$ .

From Lemma 4.40 we see that there exists a DFA $A$ which accepts $L(r)$ with $k_{1}$ number of states. Thus, $A_{min}$ is no longer the minimized DFA, which is a contradiction.

Theorem 4.44.

There exists a language $L$ which is accepted by a non-deterministic SAFA with $n$ states but a DSAFA will require at least $2^{O(n)}$ states to accept the same language $L$ .

Proof 4.45.

From Lemma 4.42 we see that the number of states of any DSAFA that accepts $L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)}$ must be greater than or equal to the number of states of the minimum DFA that accepts $L(r)$ . Now consider a minimum NFA $A_{nfa}$ which accepts L(r). We can obtain a nondeterministic SAFA $M_{nfa}$ which accepts the language $L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r)}$ using the NFA $A_{nfa}$ by replacing the transitions of the NFA which are of the form $a$ where $a\in\Sigma$ by the transition $(a,!p(h_{i}),-)$ . The number of states of the nondeterministic SAFA $M$ is the same as that of the NFA $A_{nfa}$ . We know that there exists an NFA accepting $L(r_{e})$ such that every DFA accepting the same language has size at least exponential in the size of the NFA. Now for one such $r_{e}$ , the number of states required for any DSAFA to accept $L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(r_{e})}$ is exponential in the number of states required to accept $L=L_{{{\sf proj}_{\Sigma}}(L)=regexp(_{e}r)}$ by a nondeterministic SAFA.

5 Expressiveness

Let ${\mathcal{L}_{\sf KRFA}}$ and ${\mathcal{L}_{\sf CMA}}$ be the set of all languages accepted by $k$ -register automata and CMA respectively. We compare the computational power of SAFA with $k$ -register automata. We show that ${\mathcal{L}_{\sf SAFA}}$ and ${\mathcal{L}_{\sf KRFA}}$ are incomparable. Although ${\mathcal{L}_{\sf SAFA}}$ and ${\mathcal{L}_{\sf KRFA}}$ are incomparable, SAFA recognize many important languages which $k$ -register automata also recognize such as $L_{d_{1}}$ : wherein the first data value is repeated, $L_{a\geq 2}$ : wherein attribute $a$ is associated with more than two distinct data values. On the other hand, $k$ -register automata fail to accept languages where we have to store more than $k$ data values such as $L_{{\sf fd}(a)}$ , $L_{{\sf even}(a)}$ : wherein attribute $a$ is associated with an even number of distinct data values [20]. SAFA can accept both these data languages. We also show below that there are languages such as $L_{d}$ : the language of data words, each of which contains the data value $d$ associated with some attribute at some position in the data word, that can be accepted by a $2$ -register automaton but not by SAFA.

Example 5.1.

A $2$ -register automaton can accept the language $L_{d}$ . Consider the 2-register automaton $A$ with $\Sigma=\{a\}$ , $Q=\{q_{0},q_{1}\}$ , $\tau_{0}=\{d,\bot\}$ , $F=\{q_{1}\}$ , $U(q_{0},a)=2$ , $U(q_{1},a)=2$ . The transition relation $\delta$ is defined as:

$\{(q_{0},a,1,q_{1}),(q_{0},a,2,q_{0}),(q_{1},a,1,q_{1}),(q_{1},a,2,q_{1})\}$ . The automaton $A$ accepts $L_{d}$ . For an input word $w$ , the automaton checks whether the current data value of $w$ under the head of $A$ is equal to the content of register $1$ , which holds the data value $d$ from the time of initialization of the registers. If it is equal, the automaton goes to state $q_{1}$ and consumes the word. Since $q_{1}$ is a final state, the word is accepted. If the data value $d$ is not present in $w$ , the automaton remains in state $q_{0}$ and rejects the input word. ∎

Theorem 5.2.

${\mathcal{L}_{\sf SAFA}}$ and ${\mathcal{L}_{\sf KRFA}}$ are incomparable.

Proof 5.3.

We first show by contradiction that no SAFA can accept $L_{d}$ . Suppose there exists a SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ which accepts $L_{d}$ . Now consider a word $w\in L_{d}$ where the data value $d$ has occurred at the first position only. Let the sequence of transitions that $M$ goes through to accept $w$ be $T_{w}$ . The first transition $t_{1}$ that $T_{w}$ executes cannot be a transition with $p(h_{i})$ , $h_{i}\in H$ because $t_{1}$ is the first transition of $T_{w}$ and there cannot be any insertion to set $h_{i}$ prior to it. Therefore, $t_{1}$ is a transition with $!p(h_{i})$ , $h_{i}\in H$ and so $t_{1}$ can consume any other data value $d^{\prime}$ which is not present in $w$ . As $d$ does not occur anywhere else in $w$ , it is safe to say that all other transitions in $T_{w}$ with $p(h_{i})$ , $h_{i}\in H$ have data values other than $d$ present in their respective sets. Thus, if $M$ accepts $w=(a,d)x$ , where $x\in(\Sigma\times D)^{*}$ and $x$ does not have value $d$ in it, then $M$ also accepts $w^{\prime}=(a,d^{\prime})x$ , and $w^{\prime}$ does not have data value $d$ in it, which is a contradiction. From Example 5.1, Example 3.3 and the fact that $k$ -register automata cannot accept $L_{\sf fd(a)}$ [20] and the above, we conclude ${\mathcal{L}_{\sf SAFA}}$ and ${\mathcal{L}_{\sf KRFA}}$ are incomparable.

\chadded

[id=KC]If we equip SAFA with initialization, that is the sets of SAFA can be initialized prior to the beginning of computation, then SAFA can accept the language $L_{d}$ . Even then, $k$ -register automata and SAFA with initialization are incomparable as shown below.

Example 5.4.

A $2$ -register automaton can accept the language $L=\{w\in((\{a\}\times\{d\})^{2})^{*}|d\in D\}$ . Consider the $2$ -register automaton $A$ with $\Sigma=\{a\}$ , $Q=\{q_{0},q_{1},q_{2}\}$ , $\tau_{0}=\{\bot,\bot\}$ , $F=\{q_{0}\}$ , $U(q_{0},a)=1$ , $U(q_{1},a)=2$ ,

The transition relation $\delta$ is defined as: $\{(q_{0},a,1,q_{1}),(q_{1},a,1,q_{0}),(q_{1},a,2,q_{2})\}$ . The automaton $A$ accepts $L=\{w\in((\{a\}\times\{d\})^{2})^{*}|d\in D\}$ . For an input word $w$ , the automaton in state $q_{0}$ reads the first data value of a data pair and inserts it into register $1$ and goes to state $q_{2}$ . In state $q_{2}$ if the second data value of the data pair is same as the first, the automaton goes back to state $q_{1}$ to read the next data value pair. Otherwise, the automaton goes to a dead state $q_{3}$ and rejects the input word $w$ . ∎

Let ${\mathcal{L}_{\sf SAFA_{init}}}$ denote the set of all languages accepted by nondeterministic SAFA with initialization. We have the following.

Theorem 5.5.

${\mathcal{L}_{\sf SAFA_{init}}}$ and ${\mathcal{L}_{\sf KRFA}}$ are incomparable.

Proof 5.6.

Consider the language $L=\{w\in((\{a\}\times\{d\})^{2})^{*}|d\in D\}$ i.e. $L$ is the set of all data words where every data value appears in pairs. We show that there exists no SAFA with initialization which accepts $L$ . We prove by contradiction. Assume that there exists a SAFA $M=(Q,\{a\}\times D,q_{0},F,H,\delta)$ with $|H|=k>0$ and whose sets can be initialized such that $M$ accepts $L$ . Then $M$ must accept the following word $w\in L$ where $w=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{k+1})(a,d_{k+1})$ and $d_{1},...,d_{k+1}\in D$ are all distinct. In order to accept $w$ , the SAFA $M$ must go through a sequence $T=t_{1_{d_{1}}}t_{2_{d_{1}}}...t_{1_{d_{k+1}}}t_{2_{d_{k+1}}}$ of transitions to completely consume $w$ and end in an accepting state. Here $t_{1_{d_{i}}}$ consumes the first data item of the $i^{th}$ data value pair and $t_{2_{d_{i}}}$ consumes the second data item of the $i^{th}$ data value pair and $1\leq i\leq k+1$ .

•

The transitions $t_{2_{d_{i}}}$ must be of the form $(a,p(h_{\ell}),-)$ or $(a,p(h_{\ell}),{\sf ins}(h_{j}))$ where $h_{\ell},h_{j}\in H$ . This is essential because if $t_{2_{d_{i}}}$ is of the form $(a,!p(h_{\ell}),-)$ or $(a,!p(h_{\ell}),{\sf ins}(h_{j}))$ then instead of consuming $d_{i}$ it can also consume successfully a new data value $d_{new}\in D$ which is not present in $w$ . It is always possible to get such a data value as $D$ is countably infinite. The SAFA $M$ will then accept the data word $w^{\prime}=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{new})\cdots(a,d_{k+1})(a,d_{k+1})$ which is not in $L$ .
•

The transitions $t_{1_{d_{i}}}$ must either insert the data values it reads in to a set in $M$ or if the data value is already present in a set during initialization it may or may not insert it. Recall that for the transition $t_{2_{d_{i}}}$ to consume the data value $d_{i}$ , the data value $d_{i}$ must be inserted in a set when it was first encountered or it must already be present during initialization.
•

Since the number of distinct data values in $w$ are more than the number of sets in $M$ and all the distinct data values are inserted in the sets in $M$ , by pigeon hole principle, there are two distinct data values $d_{i}$ and $d_{j}$ , with $i<j$ which are inserted in the same set $h_{\ell}\in H$ . Thus, if $M$ accepts the data word $w=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{j})(a,d_{j})\cdots(a,d_{k+1})(a,d_{k+1})$ then $M$ will also accept the data word $w^{\prime}=(a,d_{1})(a,d_{1})\cdots(a,d_{i})(a,d_{i})\cdots(a,d_{j})(a,d_{i})\cdots(a,d_{k+1})(a,d_{k+1})$ which is not in $L$ .

From Example 5.4, Example 3.3 and the fact that $k$ -register automata cannot accept $L_{\sf fd(a)}$ [20] and the above, we conclude ${\mathcal{L}_{\sf SAFA}}$ with initialization and ${\mathcal{L}_{\sf KRFA}}$ are incomparable.

Also we note that both SAFA and $k$ -register automata have the same complexity for the nonemptiness and the membership problems. Similar to SAFA, CCA and CMA also accept data languages such as $L_{\sf fd(a)}$ , $L_{{\sf even}(a)}$ which $k$ -register automata cannot, but their decision problems have higher complexity [5, 19]. We can show that the class of languages accepted by SAFA is strictly contained in the class of languages accepted by CCA.

Theorem 5.7.

${\mathcal{L}_{\sf SAFA}}\subsetneq{\mathcal{L}_{\sf CCA}}$

Proof 5.8.

For every SAFA $M=(Q,\Sigma\times D,q_{0},F,H,\delta)$ with $|H|=k$ accepting a language $L$ , we can construct a $k$ -bag CCA $A=(Q,\Sigma,\Delta,\{q_{0}\},F)$ which accepts the same language $L$ in the following manner:

•

The set $Q$ of states, the set $F$ of final states and the initial state are the same for both the SAFA $M$ and the $k$ -bag CCA A.
•
The transitions in the transition relation $\delta$ of SAFA $M$ are mapped to transitions in the transition relation $\Delta$ of the $k$ -bag CCA $A$ as follows:
- –
  
  For every transition of the form $(q_{i},a,p(h_{i}),-,q_{j})$ in the transition relation $\delta$ of SAFA $M$ which takes the SAFA $M$ from state $q_{i}$ to state $q_{j}$ where $a\in\Sigma$ , $q_{i},q_{j}\in Q$ and $h_{i}\in H$ , we have the transition $(q_{i},a,c_{1},\ldots,c_{k},instr_{1},\ldots,instr_{k},q_{j})$ where $c_{i}$ is $(=,1)$ and all other $c_{j}$ are $(\geq,0)$ , $instr_{i}=[0]$ for all $1\leq i\leq k$ in the transition relation $\Delta$ of the $k$ -bag CCA A. If a data value $d$ is read on the transition in SAFA $M$ , then CCA A checks if $\beta_{i}(d)=1$ denoting that the $k$ -bag CCA $A$ has seen the data value $d$ before and it has been recorded in the $i^{th}$ bag.
- –
  
  For every transition of the form $(q_{i},a,!p(h_{i}),-,q_{j})$ in the transition relation $\delta$ of SAFA $M$ which takes the SAFA $M$ from state $q_{i}$ to state $q_{j}$ where $a\in\Sigma$ , $q_{i},q_{j}\in Q$ and $h_{i}\in H$ , we have the transition $(q_{i},a,c_{1},\ldots,c_{k},instr_{1},\ldots,instr_{k})$ where $c_{i}$ is $(=,0)$ and all other $c_{j}$ are $(\geq,0)$ , $instr_{i}=[0]$ for all $1\leq i\leq k$ in the transition relation $\Delta$ of $k$ -bag CCA. If a data value $d$ is read on the transition in SAFA $M$ , then CCA A checks if $\beta_{i}(d)=1$ denoting that the $k$ -bag CCA $A$ has either not seen the data value $d$ or it has not been recorded in the $i^{th}$ bag.
- –
  
  For every transition of the form $(q_{i},a,p(h_{i})),{\sf ins}(h_{j}),q_{j})$ in the transition relation $\delta$ of the SAFA $M$ which takes the SAFA $M$ from state $q_{i}$ to state $q_{j}$ where $a\in\Sigma$ , $q_{i},q_{j}\in Q$ and $h_{i},h_{j}\in H$ , we introduce the transition $(q_{i},a,c_{1},\ldots,c_{k},instr_{1},\ldots,instr_{k})$ where $c_{i}$ is $(=,1)$ and all other $c_{j}$ are $(\geq,0)$ , $instr_{j}=[\downarrow,1]$ and all other $instr_{i}=[0]$ for all $i\neq j$ in the transition relation $\Delta$ of the $k$ -bag CCA. Similar to SAFA $M$ inserting the data value $d$ in the set $h_{j}$ , the $k$ -bag CCA records the data value by setting $\beta_{j}(d)=1$ while all the values for all other bags remain unchanged.
- –
  
  For every transition of the form $(q_{i},a,!p(h_{i})),{\sf ins}(h_{j}),q_{j})$ in the transition relation $\delta$ of SAFA $M$ which takes the SAFA $M$ from state $q_{i}$ to state $q_{j}$ where $a\in\Sigma$ , $q_{i},q_{j}\in Q$ and $h_{i},h_{j}\in H$ , we introduce the transition $(q_{i},a,c_{1},\ldots,c_{k},instr_{1},\ldots,instr_{k})$ where $c_{i}$ is $(=,0)$ and all other $c_{j}$ are $(\geq,0)$ , $instr_{j}=[\downarrow,1]$ and all other $instr_{i}=[0]$ for all $i\neq j$ in transition relation $\Delta$ of the $k$ -bag CCA. Similar to SAFA $M$ inserting the data value $d$ in the set $h_{j}$ , the $k$ -bag CCA records the data value by setting $\beta_{j}(d)=1$ .

The $k$ -bag CCA simulates the SAFA $M$ . Given a data word $w$ accepted by the SAFA $M$ , there is a sequence of transitions that takes SAFA $M$ from an initial state to a final state. The $k$ -bag CCA simulating the SAFA $M$ can replicate the corresponding sequence of transitions to accept $w$ . Given a data word $w$ not accepted by the SAFA $M$ , there is no sequence of transitions that takes the SAFA $M$ from an initial state to a final state. The $k$ -bag CCA $A$ simulating the SAFA $M$ also does not have any sequence of transitions which takes it from an initial state to a final state and thus the $k$ -bag CCA $A$ also rejects $w$ .

We know from [19], that for every $k$ -bag CCA there exists a one bag CCA which accepts the same language. Therefore for every SAFA $M$ there exists a CCA $A$ which accepts the same language. Now, we show that CCA are more expressive than SAFA. Moreover, SAFA $M$ accepts the languages $L_{fd(a)}$ and $L_{a\exists b}$ but not the language $L_{{\sf fd}(a)}\cap L_{a\exists b}$ , and there exists a CCA for every language accepted by a SAFA, the languages $L_{fd(a)},L_{a\exists b}\in{\mathcal{L}_{\sf CCA}}$ . Since CCA are closed under intersection [19], we have that $L_{{\sf fd}(a)}\cap L_{a\exists b}\in{\mathcal{L}_{\sf CCA}}$ . Therefore, ${\mathcal{L}_{\sf SAFA}}\subsetneq{\mathcal{L}_{\sf CCA}}$ .

\chadded

[id=KC]We know from [19] that ${\mathcal{L}_{\sf CCA}}\subsetneq{\mathcal{L}_{\sf CMA}}$ and from the above result, we get the following theorem. \chdeleted[id=KC]We can show that the class of languages accepted by SAFA is strictly contained in the class of languages accepted by CMA.

Corollary 5.9.

${\mathcal{L}_{\sf SAFA}}\subsetneq{\mathcal{L}_{\sf CMA}}$ .

\chdeleted

[id=KC]Proof sketch. A state of a CMA $A$ simultaneously keeps track of the state of the SAFA $M$ that it wants to simulate and also to which sets a particular data value has been inserted in $M$ . Since both the number of states and the number of sets in SAFA are finite, therefore, the number of states in the CMA $A$ is also finite. The number of states of the CMA simulating a SAFA is exponential in the size of the SAFA since a state of the CMA keeps the information of a subset of $H$ storing a data value. Further, one can show that a CMA can accept the language $L_{\forall{\sf cnt}=2}$ , and from Lemma 4.23, we know that no SAFA can accept $L_{\forall{\sf cnt}=2}$ . ∎

6 Conclusion

In this paper, we introduce set augmented finite automata which use a finite set of sets of data values to accept data languages. We have shown examples of several data languages that can be accepted by our model. We compare the language acceptance capabilities of SAFA with $k$ -register automata, CCA, and CMA. The computational power and low complexity of nonemptiness and membership of SAFA makes it an useful tool for modeling and analysis of data languages. We show that similar to register automata, the universality problem for SAFA is undecidable. We also study the deterministic variant of SAFA which are closed under complementation, and hence have decidable universality. We believe our model is robust enough and can also be extended to infinite words. This model opens up some interesting avenues for future research. We would like to explore augmentations of SAFA with Boolean combinations of tests and the feature of updating multiple sets simultaneously. This may lead to having a more well-behaved model with respect to closure properties.

References

[1] Jean-Michel Autebert, Joffroy Beauquier, and Luc Boasson. Langages sur des alphabets infinis. Discrete Applied Mathematics, 2(1):1–20, 1980.
[2] Christel Baier and Joost-Pieter Katoen. Principles of model checking. MIT press, 2008.
[3] Ansuman Banerjee, Kingshuk Chatterjee, and Shibashis Guha. Set augmented finite automata over infinite alphabets. In Frank Drewes and Mikhail Volkov, editors, Developments in Language Theory - 27th International Conference, DLT 2023, Umeå, Sweden, June 12-16, 2023, Proceedings, volume 13911 of Lecture Notes in Computer Science, pages 36–50. Springer, 2023.
[4] Alexis Bès. An application of the feferman-vaught theorem to automata and logics for words over an infinite alphabet. Logical Methods in Computer Science, 4, 2008.
[5] Henrik Björklund and Thomas Schwentick. On notions of regularity for data languages. Theoretical Computer Science, 411(4-5):702–715, 2010.
[6] Mikolaj Bojanczyk, Anca Muscholl, Thomas Schwentick, Luc Segoufin, and Claire David. Two-variable logic on words with data. In 21st Annual IEEE Symposium on Logic in Computer Science (LICS’06), pages 7–16. IEEE, 2006.
[7] Benedikt Bollig. An automaton over data words that captures emso logic. In International Conference on Concurrency Theory, pages 171–186. Springer, 2011.
[8] Edward YC Cheng and Michael Kaminski. Context-free languages over infinite alphabets. Acta Informatica, 35(3):245–267, 1998.
[9] W. Czerwiński and L. Orlikowski. Reachability in vector addition systems is ackermann-complete. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 1229–1240, 2022. doi:10.1109/FOCS52979.2021.00120.
[10] Jürgen Dassow and György Vaszil. P finite automata and regular languages over countably infinite alphabets. In International Workshop on Membrane Computing, pages 367–381. Springer, 2006.
[11] Stéphane Demri and Ranko Lazić. LTL with the freeze quantifier and register automata. ACM Transactions on Computational Logic (TOCL), 10(3):1–30, 2009.
[12] Diego Figueira. Alternating register automata on finite words and trees. Logical Methods in Computer Science, 8, 2012.
[13] Orna Grumberg, Orna Kupferman, and Sarai Sheinvald. Variable automata over infinite alphabets. In International Conference on Language and Automata Theory and Applications, pages 561–572. Springer, 2010.
[14] Radu Iosif and Xiao Xu. Abstraction refinement for emptiness checking of alternating data automata. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 93–111. Springer, 2018.
[15] Neil D Jones. Space-bounded reducibility among combinatorial problems. Journal of Computer and System Sciences, 11(1):68–85, 1975.
[16] Michael Kaminski and Nissim Francez. Finite-memory automata. Theoretical Computer Science, 134(2):329–363, 1994.
[17] Michael Kaminski and Tony Tan. Regular expressions for languages over infinite alphabets. Fundamenta Informaticae, 69(3):301–318, 2006.
[18] S. R. Kosaraju. Decidability of reachability in vector addition systems (preliminary version). In Harry R. Lewis, Barbara B. Simons, Walter A. Burkhard, and Lawrence H. Landweber, editors, Proceedings of the 14th Annual ACM Symposium on Theory of Computing, May 5-7, 1982, San Francisco, California, USA, pages 267–281. ACM, 1982.
[19] Amaldev Manuel and Ramaswamy Ramanujam. Class counting automata on datawords. International Journal of Foundations of Computer Science, 22(04):863–882, 2011.
[20] Amaldev Manuel and Ramaswamy Ramanujam. Automata over infinite alphabets. In Modern applications of automata theory, pages 529–553. World Scientific, 2012.
[21] E. W. Mayr. An algorithm for the general petri net reachability problem. SIAM J. Comput., 13(3):441–460, 1984.
[22] Frank Neven. Automata, logic, and xml. In International Workshop on Computer Science Logic, pages 2–26. Springer, 2002.
[23] Frank Neven, Thomas Schwentick, and Victor Vianu. Finite state machines for strings over infinite alphabets. ACM Transactions on Computational Logic (TOCL), 5(3):403–435, 2004.
[24] Hiroshi Sakamoto and Daisuke Ikeda. Intractability of decision problems for finite-memory automata. Theoretical Computer Science, 231(2):297–308, 2000.
[25] Tony Tan. On pebble automata for data languages with decidable emptiness problem. Journal of Computer and System Sciences, 76(8):778–791, 2010.

Acknowledgements.

Set Augmented Finite Automata over Infinite Alphabets

Abstract

keywords:

1 Introduction

Our Contribution:

Related work:

2 Preliminaries

Post Correspondence Problem (PCP)

Example 2.1.

kk-register automata [16]

Class counting automata [19]

Class memory automata [5]

3 Set augmented finite automata

Definition 3.1.

Definition 3.2.

Example 3.3.

Example 3.4.

Theorem 3.5.

Proof 3.6.

Corollary 3.7.

4 Decision problems and closure properties

4.1 Nonemptiness and membership

Lemma 4.1.

Proof 4.2.

Lemma 4.3.

Proof 4.4.

Example 4.5.

Lemma 4.6.

Theorem 4.7.

Lemma 4.8.

Proof 4.9.

Example 4.10.

Theorem 4.11.

Proof 4.12.

Theorem 4.13.

Proof 4.14.

Theorem 4.15.

Proof 4.16.

4.2 Closure Properties

4.2.1 Closure Properties of SAFA

Lemma 4.17.

Proof 4.18.

Lemma 4.19.

Proof 4.20.

Lemma 4.21.

Proof 4.22.

Lemma 4.23.

Proof 4.24.

Theorem 4.25.

Proof 4.26.

Theorem 4.27.

Proof 4.28.

Theorem 4.29.

Proof 4.30.

Theorem 4.31.

Proof 4.32.

Theorem 4.33.

Proof 4.34.

4.2.2 Closure properties of Deterministic SAFA:

Theorem 4.35.

Proof 4.36.

Theorem 4.37.

Theorem 4.38.

Proof 4.39.

Lemma 4.40.

Proof 4.41.

Lemma 4.42.

Proof 4.43.

Theorem 4.44.

Proof 4.45.

5 Expressiveness

Example 5.1.

Theorem 5.2.

Proof 5.3.

Example 5.4.

Theorem 5.5.

Proof 5.6.

Theorem 5.7.

Proof 5.8.

$k$ -register automata [16]