Synthesizing Transducers from Complex Specifications

Anvay Grover, Ruediger Ehlers, and Loris D’Antoni A. Grover and L. D’Antoni are with the Univ. of Wisconsin-Madison. R Ehlers is with Clausthal University of Technology.Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

Automating string transformations has been one of the killer applications of program synthesis. Existing synthesizers that solve this problem produce programs in domain-specific languages (DSL) that are engineered to help the synthesizer, and therefore lack nice formal properties. This limitation prevents the synthesized programs from being used in verification applications (e.g., to check complex pre-post conditions) and makes the synthesizers hard to modify due to their reliance on the given DSL.

We present a constraint-based approach to synthesizing transducers, a well-studied model with strong closure and decidability properties. Our approach handles three types of specifications: (i) input-output examples, (ii) input-output types expressed as regular languages, and (iii) input/output distances that bound how many characters the transducer can modify when processing an input string. Our work is the first to support such complex specifications and it does so by using the algorithmic properties of transducers to generate constraints that can be solved using off-the-shelf SMT solvers. Our synthesis approach can be extended to many transducer models and it can be used, thanks to closure properties of transducers, to compute repairs for partially correct transducers.

1 Introduction

String transformations are used in data transformations [1], sanitization of untrusted inputs [2, 3], and many other domains [4]. Because in these domains bugs may cause serious security vulnerabilities [2], there has been increased interest in building tools that can help programmers verify [2, 3] and synthesize [1, 5, 6] string transformations.

Techniques for verifying string transformations rely on automata-theoretic approaches that provide powerful decidability properties [2]. On the other hand, techniques for synthesizing string transformations rely on domain-specific languages (DSLs) [1, 5]. These DSLs are designed to make synthesis practical and have to give up the closure and decidability properties enabled by automata-theoretic models. The disconnect between the two approaches raises a natural question: Can one synthesize automata-based models and therefore retain and leverage their elegant properties?

Transducers are a natural automata-based formal model for synthesizing string transformations. A finite state transducer (FT) is an automaton where each transition reads an input character and outputs a string of output characters. For instance, Figure 1 shows a transducer that ‘escapes’ instances of the " character. So, on input a"\"a, the transducer outputs the string a\"\\"a. FTs have found wide adoption in a variety of domains [7, 3] because of their many desirable properties (e.g., decidable equivalence check and closure under composition [8]). There has been increasing work on building SMT solvers for strings that support transducers; the Ostrich tool [9] allows a user to write programs in SMT where string-transformations are modelled using transducers. One can then write constraints over such programs and use an SMT solver to automatically check for satisfiability or prove unsatisfiability of those constraints. For example, given a program like the following:

y = escapeQuotes(x)
z = escapeQuotes(y)
assert(y==z) //Checking idempotence

one can use Ostrich to write a set of constraints and use them to prove whether the assertion holds. However, to do so, one needs to first write a transducer $T$ that implements the function escapeQuotes. However, writing transducers by hand is a cumbersome and error-prone task.

(a) Transducer EscapeQuotes

Examples: {a"a $\mapsto$ a\"a, a\\a $\mapsto$ a\\a, a\a $\mapsto$ a\a, a\"a $\mapsto$ a\"a, \ $\mapsto$ \} Types: [a"]^∗\?|([a"]^∗\[a"\][a"]^∗)^∗ $\to$ a^∗\?|(a^∗\[a"\]a^∗)^∗ Distance: At most 1 edit per input character

(b) Specification to synthesize EscapeQuotes

Figure 1: Simplified version of EscapeQuotes from [2].

In this paper, we present a technique for synthesizing transducers from high-level specifications. We use three different specification mechanisms to quickly yield desirable transducers: input-output examples, input-output types, and input-output distances. When provided with the specification in Figure 1(b), our approach yields the transducer in Figure 1. While none of the three specification mechanisms are effective in isolation, they work well altogether. Input-output examples are a natural specification mechanism that is easy to provide, but only capture finitely many inputs. Similarly, input-output types are a natural way to prevent a transducer from generating undesired strings and can often be obtained from function/API specifications. Last, input-output distances are a natural way to specify how much of the input string should be preserved by the transformation.

We show that if the size of the transducers is fixed, all such specifications can be encoded as a set of constraints whose solution directly provides a transducer. While the constraints for examples are fairly straightforward, we introduce two new ideas for encoding types and distances. To encode types and distances, we show that one can use constraints to “guess” the simulation relation and the invariants necessary to prove that the transducer has the given type and respects the given distance constraint.

Because our constraint-based approach is based on decision procedures and is modular, it can support more complex models of transducers: (i) Symbolic Finite Transducers (s-FTs), which support large alphabets [10], and (ii) FTs with lookahead, which can express functions that otherwise require non-determinism. In addition, closure properties of transducers allow us to reduce repair problems for string transformations to our synthesis problem.

Contributions: We make the following contributions.

•

A constraint-based synthesis algorithm for synthesizing transducers from complex specifications (Sec. 3).
•

Extensions of our synthesis algorithm to more complex models—e.g., symbolic transducers and transducers with lookahead—and problems—e.g., transducer repair—that showcase the flexibility of our approach and the power of working with transducers, which enjoy strong theoretical properties—unlike domain-specific languages (Sec. 4).
•

astra: a tool that can synthesize and repair transducers and compares well with a state-of-the-art tool for synthesizing string transformations (Sec. 5).

Proofs and additional results are available in the appendix.

2 Transducer Synthesis Problem

In this section, we define the transducer synthesis problem.

A deterministic finite automaton (DFA) over an alphabet $\Sigma$ is a tuple $D=(Q_{D},\delta_{D},q^{\mathit{init}}_{D},F_{D})$ : $Q_{D}$ is the set of states, $\delta_{D}:Q_{D}\times\Sigma\rightarrow Q_{D}$ is the transition function, $q^{\mathit{init}}_{D}$ is the initial state, and $F_{D}$ is the set of final states. The extended transition function $\delta_{D}^{*}:Q_{D}\times\Sigma^{\ast}\rightarrow Q_{D}$ is defined as $\delta_{D}^{*}(q,\varepsilon)=q$ and $\delta_{D}^{*}(q,au)=\delta_{D}^{*}(\delta_{D}(q,a),u)$ . We say that $D$ accepts a string $w$ if $\delta_{D}^{*}(q^{\mathit{init}}_{D},w)\in F_{D}$ . The regular language $\mathcal{L}(D)$ is the set of strings accepted by a DFA $D$ .

A total finite state transducer (FT) is a tuple $T=(Q_{T},\delta_{T}^{\mathit{st}},\delta_{T}^{\mathit{out}},q^{\mathit{init}}_{T})$ , where $Q_{T}$ are states and $q^{\mathit{init}}_{T}$ is the initial state. Transducers have two transition functions: $\delta_{T}^{\mathit{st}}:q_{T}\times\Sigma\rightarrow q_{T}$ defines the target state, while $\delta_{T}^{\mathit{out}}:q_{T}\times\Sigma\rightarrow\Sigma^{\ast}$ defines the output string of each transition. The extended function for states $\delta_{T}^{\mathit{st*}}$ is defined analogously to the extended transition function for a DFA. The extended function for output strings is defined as $\delta_{T}^{\mathit{out*}}(q,\varepsilon)=\varepsilon$ and $\delta_{T}^{\mathit{out*}}(q,au)=\delta_{T}^{\mathit{out*}}(q,a)\cdot\delta_{T}^{\mathit{out}}(\delta_{T}^{\mathit{st*}}(q,a),u)$ . Given a string $w$ we use $T(w)$ to denote $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)$ , i.e., the output string generated by $T$ on $w$ . Given two DFAs $P$ and $Q$ , we write $\{P\}T\{Q\}$ for a transducer $T$ iff for every string $s$ in $\mathcal{L}(P)$ , the output string $T(s)$ belongs to $\mathcal{L}(Q)$ .

An edit operation on a string is either an insertion/deletion of a character, or a replacement of a character with a different one. For example, editing the string ab to the string acb requires one edit operation, which is inserting a c after the a. The edit distance $\mathit{ed\!\_\!dist}(s,t)$ between two strings $s$ and $t$ is the number of edit-operations required to reach $t$ from $s$ . We use $\mathit{len}(w)$ to denote the length of a string $w$ . The mean edit distance $\mathit{mean\!\_\!ed\!\_\!dist}(s,t)$ between two strings $s$ and $t$ is defined as $\mathit{ed\!\_\!dist}(s,t)/\textit{len}(s)$ . For example, the mean edit distance from ab to acb is $1/2=.5$ .

We can now formulate the transducer synthesis problem. We assume a fixed alphabet $\Sigma$ . If the specification requires that $s$ is translated to $t$ , we write that as ${s\mapsto t}$ .

Problem Statement 1 (Transducer Synthesis).

The transducer synthesis problem has the following inputs and output:
Inputs

•

Number of states $k$ and upper bound $l$ on the length of the output of each transition.
•

Set of input-output examples $E=[\overline{s\mapsto t}]$ .
•

Input-output types $P$ and $Q$ , given as DFAs.
•

A positive upper bound $d\in\mathbb{Q}$ on the mean edit distance.

Output A total transducer $T=(Q_{T},\delta_{T}^{\mathit{st}},\delta_{T}^{\mathit{out}},q^{\mathit{init}}_{T})$ with $k$ states such that:

•

Every transition of $T$ has an output with length at most $l$ , i.e., $\forall q_{T}\in Q_{T},\mathit{a}\in\Sigma.\ \mathit{len}(\delta_{T}^{\mathit{out}}(q,a))\leq l$ .
•

$T$ is consistent with the examples: $\forall{s\mapsto t}\in E.\ T(s)=t$ .
•

$T$ is consistent with input-output types, i.e., $\{P\}T\{Q\}$ .
•

For every string $w\in P$ , $\mathit{mean\!\_\!ed\!\_\!dist}(w,T(w))\leq d$ .

The synthesis problem that we present here is for FTs, and in Section 3, we provide a sound algorithm to solve it using a system of constraints. One of our key contributions is that our encoding can be easily adapted to synthesizing richer models than FTs (e.g., symbolic transducers [8] and transducers with regular lookahead), while still using the same encoding building blocks (Section 4).

3 Constraint-based Transducer Synthesis

In this section, we present a way to generate constraints to solve the transducer synthesis problem defined in Section 2. The synthesis problem can then be solved by invoking a Satisfiability Modulo Theories (SMT) solver on the constraints.

We use a constraint encoding, rather than a direct algorithmic approach because of the multiple objectives to be satisfied. Synthesizing a transducer that translates a set of input-output examples is already an NP-Complete problem [11]. On top of that, we also need to handle input-output types and distances. Our encoding is divided into three parts, one for each objective, which are presented in the following subsections. This division makes our encoding very modular and programmable. In Section 4 we show how it can be adapted to different transducer models and problems. We include a brief description of the size of the constraint encoding in Section B of the appendix.

3.1 Representing Transducers

The transducer we are synthesizing has $k$ (part of the problem input) states $Q_{T}=\allowbreak{}\{q_{0},...,\allowbreak{}q_{k-1}\}$ . We often use $q^{\mathit{init}}_{T}$ as an alternative for $q_{0}$ , the initial state of $T$ .

We illustrate how a transition $q_{1}\xrightarrow{\texttt{a}/\texttt{bc}}q_{2}$ is represented in our encoding. The target state is captured using an uninterpreted function ${\texttt{d}^{\texttt{st}}}:Q_{T}\times\Sigma\to Q_{T}$ , e.g., ${\texttt{d}^{\texttt{st}}}(q_{1},\texttt{a})=q_{2}$ . Representing the output of the transition is trickier because its length is not known a priori. The synthesis problem however provides an output bound $l$ , which allows us to limit the number of characters that may appear in the output. We use an uninterpreted function ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}:Q_{T}\times\Sigma\times\{0,\ldots,l{-}1\}\to\Sigma$ to represent each character in the output string; in our example, ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{1},\texttt{a},0)=\texttt{b}$ and ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{1},\texttt{a},1)=\texttt{c}$ . Since an output string’s length can be smaller than $l$ , we use an additional uninterpreted function ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}{}:Q_{T}\times\Sigma\to\{0,\ldots,l\}$ to model the length of a transition’s output; in our example ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{1},\texttt{a})=2$ .

We say an assignment to the above variables extends to a transducer $T$ for the transducer $T$ obtained by instantiating $\delta^{st}$ and $\delta^{out}$ as described above.

3.2 Input-output Examples

Goal: For each input output-example ${s\mapsto t}\in E$ , $T$ should translate $s$ to $t$ .

Translating $s$ to the correct output string means that $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},s)=t$ . Generating constraints that capture this behavior of $T$ on an example is challenging because we do not know a priori what parts of $t$ are produced by what steps of the transducer’s run. Suppose that we need to translate $s=a_{0}a_{1}$ to $t=b_{0}b_{1}b_{2}$ . A possible solution is for the transducer to have the run $q_{0}\xrightarrow{a_{0}/b_{0}}q1\xrightarrow{a_{1}/b_{1}b_{2}}q_{2}$ . Another possible solution might be to instead have $q_{0}\xrightarrow{a_{0}/b_{0}b_{1}}q1\xrightarrow{a_{1}/b_{2}}q_{2}$ . Notice that the two runs traverse the same states but produce different parts of the output strings at each step. Intuitively, we need a way to “track” how much output the transducer has produced before processing the $i$ -th character in the input and what state it has landed in. For every input example ${s\mapsto t}$ such that $s=a_{0}\cdots a_{n}$ and $t=b_{0}\cdots b_{m}$ , we introduce an uninterpreted function ${\texttt{config}}_{s}:\{0,\ldots,n\}\rightarrow\{0,\ldots,m\}\times Q_{T}$ such that ${\texttt{config}}_{s}(i)=(j,q_{T})$ iff after reading $a_{0}\cdots a_{i-1}$ , the transducer $T$ has produced the output $b_{0}\cdots b_{j-1}$ and reached state $q_{T}$ —i.e., $\delta_{T}^{\mathit{out*}}(q_{0},a_{0}\cdots a_{i-1})=b_{0}\cdots b_{j-1}$ and $\delta_{T}^{\mathit{st*}}(q_{0},a_{0}\cdots a_{i-1})=q_{T}$ .

We describe the constraints that describe the behavior of ${\texttt{config}}_{s}$ . Constraint 1 states that a configuration must start at the initial state and be at position 0 in the output.

{\texttt{config}}_{s}(0)=(0,q^{\mathit{init}}_{T})

(1)

Constraint 2 captures how the configuration is updated when reading the $i$ -th character of the input. For every $0\leq i<n$ , $0\leq j<m$ , $c\in\Sigma$ , and $q_{T}\in Q_{T}$ :

\begin{split}&{\texttt{config}}_{s}(i)=(j,q_{T})\wedge a_{i}=c\Rightarrow\\ &\quad\quad[\bigwedge_{0\leq z<l}({\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)=b_{j+z}\vee z\geq{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c))\wedge\\ &\quad\quad{\texttt{config}}_{s}(i+1)=(j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c),{\texttt{d}^{\texttt{st}}}(q_{T},c))]\end{split}

(2)

Informally, if the $i$ -th character is $c$ and the transducer has reached state $q_{T}$ and produced the characters $b_{0}\cdots b_{j-1}$ so far, the transition reading $c$ from state $q_{T}$ outputs characters $b_{j}\cdots b_{j+f-1}$ , where $f$ is the output length of the transition. The next configuration is then $(j+f,{\texttt{d}^{\texttt{st}}}(q_{T},c))$ .

Finally, Constraint 3 forces $T$ to be completely done with generating $t$ when $s$ has been entirely read. Recall that $\mathit{len}(s)=n$ and $\mathit{len}(t)=m$ .

\bigvee_{q_{T}\in Q_{T}}{\texttt{config}}_{s}(n)=(m,q_{T})

(3)

The constraint encoding for examples is sound and complete (Proofs in A.1).

3.3 Input-Output Types

Goal: $T$ should satisfy the property $\{P\}T\{Q\}$ .

Encoding this property using constraints is challenging because it requires enforcing that when $T$ reads one of the (potentially) infinitely many strings in $P$ it always outputs a string in $Q$ . To solve this problem, we draw inspiration from how one proves that the property $\{P\}T\{Q\}$ holds—i.e., using a simulation relation that relates runs over $P$ , $T$ and $Q$ . Intuitively, if $P$ has read some string $w$ , we need to be able to encode the behavior of $T$ in terms of $w$ , i.e., what state of $T$ this transducer is in after reading $w$ and what output string $w^{\prime}$ it produced. Further, we also need to be able to encode in which state $Q$ would be after reading the output string $w^{\prime}$ . We do this by introducing a function sim: $Q_{P}\times Q_{T}\times Q_{Q}\rightarrow\{0,1\}$ , which preserves the following invariant: ${\texttt{sim}}(q_{P},q_{T},q_{Q})$ holds if there exist strings $w,w^{\prime}$ such that $\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}$ , $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w)=q_{T}$ , $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}$ , and $\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}$ .

Constraint 4 states the initial condition of the simulation—i.e., $P$ , $T$ , and $Q$ are in their initial states.

{\texttt{sim}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})

(4)

Constraint 5 encodes how we advance the simulation relation for states $q_{P},q_{T},q_{Q}$ and for a character $c\in\Sigma$ , using free variables $c_{0}\ldots,c_{l-1}$ and $q_{Q}^{0}\ldots,\allowbreak{}q_{Q}^{l}$ that are separate for each combination of $q_{P},q_{T},q_{Q}$ , and $c$ :

\begin{split}{\texttt{sim}}(q_{P},q_{T},q_{Q})\Rightarrow&\bigwedge_{\mathclap{0\leq z\leq l}}({\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)=z\Rightarrow\\ &\quad[\bigwedge_{\mathclap{0\leq x<z}}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,x){=}c_{x}]\wedge\\ &\quad[q_{Q}^{0}{=}q_{Q}\wedge\bigwedge_{\mathclap{1\leq x<z}}q_{Q}^{x}{=}{\texttt{d}_{Q}}(q_{Q}^{x-1},c_{x-1})]\wedge\\ &\quad{\texttt{sim}}(\delta_{P}(q_{P},\mathit{c}),{\texttt{d}^{\texttt{st}}}(q_{T},c),q_{Q}^{z}))\end{split}

(5)

Intuitively, if ${\texttt{sim}}(q_{P},q_{T},q_{Q})$ and we read a character $c$ , $P$ moves to $\delta_{P}(q_{P},\mathit{c})$ and $T$ moves to ${\texttt{d}^{\texttt{st}}}{}(q_{P},\mathit{c})$ . However, we also need to advance $Q$ and the ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}{}$ symbols produced by ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}{}$ . We hard-code the transition relation $\delta_{Q}$ in an uninterpreted function ${\texttt{d}_{Q}}:Q_{Q}\times\Sigma\rightarrow Q_{Q}$ , and apply it to compute the output state reached when reading the output string. E.g., if ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}{}(q_{T},c)=2$ and ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,0)=c_{0}$ and ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,1)=c_{1}$ , the next state in $Q$ is ${\texttt{d}_{Q}}({\texttt{d}_{Q}}(q_{Q},c_{0}),c_{1})$ .

Lastly, Constraint 6 states that if we encounter a string in $\mathcal{L}(P)$ —i.e., $P$ is in a state $q_{P}\in F_{P}$ —the relation does not contain a state $q_{Q}\notin F_{Q}$ . Since $Q$ is deterministic, this means that $Q$ accepts $T$ ’s output.

\bigwedge_{q_{P}\in F_{P}}\bigwedge_{q_{Q}\notin F_{Q}}\neg{\texttt{sim}}(q_{P},q_{T},q_{Q})

(6)

Soundness and completeness of the type constraints are proven in A.2.

3.4 Input-output Distance

Goal: The mean edit distance between any input string $w$ in $\mathcal{L}(P)$ and the output string $T(w)$ should not exceed $d$ .

Capturing the edit distance for all the possible inputs in the language of $P$ and the corresponding outputs produced by the transducer is challenging because these sets can be infinite. Furthermore, exactly computing the edit distance between an input and an output string may involve comparing characters appearing on different transitions in the transducer run. For example, consider the transducer shown in Figure 2(a) and suppose that we are only interested in strings in the input type $P=\texttt{a(ba)}{\ast}\texttt{a}$ . The first transition from $q_{0}$ deletes the a, therefore making 1 edit. This transducer has a cycle between states $q_{1}$ and $q_{2}$ , which can be taken any number of times. Each iteration, locally, would require that we make 2 edits: one to change the b to a, and the other to change the a to b. However, the total number of edits made over any string in the input type $P=\texttt{a(ab)}{\ast}\texttt{a}$ by this transducer is 1, because the transducer changes strings of the form $\texttt{a(ba)}^{n}\texttt{a}$ to be of the form $\texttt{(ab)}^{n}\texttt{a}$ . Looking at the transitions in isolation, we are prevented from deducing that the edit distance is always 1 because the first transition delays outputting a character. If there was no such delay, as is the case for the transducer in Figure 2(b), which is equivalent on the relevant input type to the one in Figure 2(a), then this issue would not arise.

We take inspiration from Benedikt et al. [12] and focus on the simpler problem of synthesizing a transducer that has ‘aggregate cost’ that satisfies the given objective.¹¹1Benedikt et al. [12] studied a variant of the problem where the distance is bounded by some finite constant. Their work shows that when there is a transducer between two languages that has some bounded global edit distance, then there is also a transducer that is bounded (but with a different bound) under a local method of computing the edit distance—i.e., one where the computation of the edit distance is done transition by transition. For a transducer $T$ and string $s=a_{0}\ldots a_{n}$ , let $q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\ldots q_{T}^{n}\xrightarrow{a_{n}/y_{n}}q_{T}^{n+1}$ be the run of $s$ on $T$ . Then, the aggregate cost of $T$ on $s$ is the sum of the edit distances $\mathit{ed\!\_\!dist}(a_{i},y_{i})$ over all indices $0\leq i\leq n$ . The mean aggregate cost of $T$ on $s$ is the aggregate cost divided by $\mathit{len}(s)$ , the length of $s$ . It follows that if $T$ has a mean aggregate cost lower than some specified $d$ for every string, then it also has a mean edit distance lower than $d$ for every string.

However, the mean aggregate cost overapproximates the edit distance, e.g., the transducer in Figure 2(a) has mean aggregate cost 1, while the mean edit distance when considering only strings in $P=\texttt{a(ab)}{\ast}\texttt{a}$ is less than $1/2$ . For this reason, if the mean edit distance objective was set to $1/2$ , our constraint encoding can only synthesize the transducer in Figure 2(b), and not the equivalent one in Figure 2(a).

(a) Transducer with delayed output

(b) Transducer without delay

Figure 2: Transducers with and without delay.

Our encoding is complete for transducers in which the aggregate cost coincides with the actual edit distance. We leave the problem of being complete with regards to global edit distance as an open problem. In fact, we are not even aware of an algorithm for checking (instead of synthesizing) whether a transducer satisfies a mean edit distance objective.²²2The mean edit distance is similar to mean payoff [13], which discounts a cost by the length of a string and looks at the behavior of a transducer in the limit. Our distance is different because 1) it looks at finite-length strings, and 2) it requires computing the edit distance, which cannot be done one transition at a time. In Section 4.2, we present transducers with lookahead, which can mitigate this source of incompleteness. Furthermore, our evaluation shows that using the aggregate cost and enabling lookahead are both effective techniques in practice.

We can now present our constraints. First, we provide constraints for the edit distance of individual transitions (recall that transitions are being synthesized and we therefore need to compute their edit distances separately). Secondly, we provide constraints that implicitly compute state invariants to capture the aggregate cost between input and output strings at various points in the computation. We are given a rational number $d$ as an input to the problem, which is the allowed distance bound.

Edit Distance of Individual Transitions. To compute the edit distance between the input and the output of each transition, we introduce a function ed: $Q_{T}\times\Sigma\rightarrow\mathbb{Z}$ . For a transition from state $q_{T}$ reading a character $c$ , ${\texttt{ed}}{}(q_{T},c)$ represents the edit distance between $c$ and $\delta_{T}^{\mathit{out}}(q_{T},c)$ . Notice that this quantity is bounded by the output bound $l$ . The constraints to encode the value of this function are divided into two cases: i) the output of the transition contains the input character $c$ (Constraint 7), ii) the output of the transition does not contain the input character $c$ (Constraint 8). In both cases, the values are set via a simple case analysis on whether the length of the output is 0 (edit distance is 1) or not (the edit distance is related to the length of the output).

\begin{split}&[\bigvee_{\mathclap{0\leq z<{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)}}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)=c]\Rightarrow\\ &\quad[{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)=0\Rightarrow{\texttt{ed}}(q_{T},c)=1\wedge\\ &\quad{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)\neq 0\Rightarrow{\texttt{ed}}(q_{T},c)={\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)-1]\end{split}

(7)

\begin{split}&[\bigwedge_{\mathclap{0\leq z<{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)}}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)\neq c]\Rightarrow\\ &\quad[{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)=0\Rightarrow{\texttt{ed}}(q_{T},c)=1\wedge\\ &\quad{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)\neq 0\Rightarrow{\texttt{ed}}(q_{T},c)={\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)]\end{split}

(8)

Edit Distance of Arbitrary Strings. Suppose that $T$ has the transitions $q_{0}\xrightarrow{\texttt{a}/\texttt{a}}q1\xrightarrow{\texttt{a}/\texttt{bc}}q_{2}$ , and the specified mean edit distance is $d=0.5$ . The edit distance is 0 for the first transition and 2 for the second one. For the input string aa, the mean aggregate cost is $2/2$ , which means that the specification is not satisfied. In general, we cannot keep track of every input string in the input type and look at its length and the number of edits that were made over it. So, how can we compute the mean aggregate cost over any input string? The first part of our solution is to scale the edit distance over a single transition depending on the specified mean edit distance. This operation makes it such that an input string is under the edit distance bound if the sum of the weighted edit distances of its transitions is $\geq 0$ . The invariant we need to maintain is that the sum of the weights at any stage of the run gives us where we are with regard to the mean aggregate cost. For each transition we compute the difference between the edit distance over the transition and the specified mean edit distance $d$ . We introduce the uninterpreted function ${\texttt{wed}}:Q_{T}\times\Sigma\rightarrow\mathbb{Q}$ , which stands for weighted edit distance. For a transition at $q_{T}$ reading a character $c$ , the weighted edit distance is given by ${\texttt{wed}}(q_{T},c)=d-{\texttt{ed}}(q_{T},c)$ . The sum of the weights of all transitions tells us the cumulative difference. Going back to our example, the weighted edit distances of the two transitions are ${\texttt{wed}}(q_{0},\texttt{a})=0.5$ and ${\texttt{wed}}(q_{1},\texttt{a})=-1.5$ , making the cumulative distance $-1$ and implying that the specification is violated. We can now compute the mean edit distance over a run without keeping track of the length of the run and the number of edits performed over it.

We still need to compute the weighted edit distance for every string in the possibly infinite language $\mathcal{L}(P)$ . Building on the idea of simulation from the previous section, we introduce a new function called ${\texttt{en}}:Q_{P}\times Q_{T}\times Q_{Q}\rightarrow\mathbb{Q}$ , which tracks an upper bound on the sum of the distances so far at that point in the simulation. This function is similar to a progress measure, which is a type of invariant used to solve energy games [14], a connection we expand on in Section 6. In particular, we already know that if there exist strings $w,w^{\prime}$ such that $\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}$ , $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w)=q_{T}$ , $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}$ , and $\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}$ , then we have ${\texttt{sim}}(q_{P},q_{T},q_{Q})$ . Let this run over $T$ be denoted by $q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\ldots q_{T}^{n-1}\xrightarrow{a_{n-1}/y_{n-1}}q_{T}$ , where $w=a_{0}\cdots a_{n-1}$ , $w^{\prime}=y_{0}\cdots y_{n-1}$ , and $q_{T}=q_{T}^{n}$ . We have that ${\texttt{en}}(q_{P},q_{T},q_{Q})\geq\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})$ .

The en function is a budget on the number of edits we can still perform. At the initial states, we start with no ‘initial credit’ and the energy is 0.

{\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})=0

(9)

Constraint 10 bounds the energy budget according to the weighted edit distance of a transition by computing the minimum budget required at any point to still satisfy the distance bound. For each combination of $q_{P},q_{T},q_{Q}$ , and $c\in\Sigma$ , the constraint uses free variables $c_{0},\ldots,c_{l}$ and $q_{Q}^{0},\ldots,q_{Q}^{l-1}$ :

\begin{split}&\bigwedge_{\mathclap{0\leq z<l}}({\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c){=}z\Rightarrow\\ &[\bigwedge_{\mathclap{0\leq x<z}}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,x){=}c_{x}]{\wedge}[q_{Q}^{0}{=}q_{Q}\wedge\bigwedge_{\mathclap{1\leq x<z}}q_{Q}^{x}{=}{\texttt{d}_{Q}}(q_{Q}^{x-1},c_{x-1})]\wedge\\ &{\texttt{en}}(q_{P},q_{T},q_{Q})\geq{\texttt{en}}(\delta_{P}(q_{P},c),{\texttt{d}^{\texttt{st}}}(q_{T},c),q_{Q}^{z}){-}{\texttt{wed}}(q_{T},c))\end{split}

(10)

In our example, Constraint 10 encodes that the energy at $q_{0}$ can be 1 less than that at $q_{1}$ , but that the energy at $q_{1}$ needs to be $3$ greater than at $q_{2}$ since we need to spend 3 edit operations over the second transition.

At any point during a run, the transducer is allowed to go below the mean edit distance and then ‘catch up’ later because we only care about the edit distance when the transducer has finished reading a string in $\mathcal{L}(P)$ . Therefore, when we reach a final state of $P$ , the transducer should not be in ‘energy debt’.

\bigwedge_{q_{P}\in F_{P}}{\texttt{sim}}(q_{P},q_{T},q_{Q})\Rightarrow{\texttt{en}}(q_{P},q_{T},q_{Q})\geq 0

(11)

The encoding presented in this section is sound (Proofs in A.3).

4 Richer Models and Specifications

We extend our technique to more expressive models (Sections 4.1 and 4.2) and show how our synthesis approach can be used not only to synthesize transducers, but also to repair them (Section 4.3). Furthermore, in Appendix E, we describe an encoding of an alternative distance measure.

4.1 Symbolic Transducers

Symbolic finite automata (s-FA) and transducers (s-FT) extend their non-symbolic counterparts by allowing transitions to carry predicates and functions to represent (potentially infinite) sets of input characters and output strings. Figure 3(a) shows an s-FT that extends the escapeQuotes transducer from Figure 1(a) to handle alphabetic characters. The bottom transition from $q_{0}$ reads a character " (bound to the variable $x$ ) and outputs the string \" (i.e., a \ followed by the character stored in $x$ ). Symbolic finite automata (s-FA) are s-FTs with no outputs. To simplify our exposition, we focus on s-FAs and s-FTs that only operate over ASCII characters that are ordered by their codes. In particular, all of our predicates are unions of intervals over characters (i.e., $x\neq\texttt{\textbackslash}$ is really the union of intervals [NUL-[] and []-DEL]); we often use the predicate notation instead of explicitly writing the intervals for ease of presentation. Furthermore, we only consider two types of output functions: constant characters and offset functions of the form $x+k$ that output the character obtained by taking the input $x$ and adding a constant $k$ to it—e.g., applying $x+(-32)$ to a lowercase alphabetic letter gives the corresponding uppercase letter.

In the rest of the section, we show how we can solve the transducer synthesis problem in the case where $P$ and $Q$ are s-FAs and the goal is to synthesize an s-FT (instead of an FT) that meets the given specification. Intuitively, we do this by ‘finitizing’ the alphabet of the now symbolic input-output types, synthesizing a finite transducer over this alphabet using the technique presented in Section 3, and then extracting an s-FT from the solution.

Finitizing the Alphabet. The idea of finitizing the alphabet of s-FAs is a known one [8] and is based on the concept of $\mathit{minterms}$ , which is the set of maximal satisfiable Boolean combinations of the predicates appearing in the s-FAs. For an s-FA $M$ , we can define its set of predicates as: $\mathit{Predicates}(M)=\{\phi\mid q\xrightarrow{\phi}q^{\prime}\in\delta_{M}\}$ . The set of minterms $\mathit{mterms}(M)$ is the set of satisfiable Boolean combinations of all the predicates in $\mathit{Predicates}(M)$ . For example, for the set of predicates over the s-FT escapeQuotes in Figure 3(a), we have that $\mathit{mterms}(\text{escapeQuotes})=\{x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash},x=\texttt{"},x=\texttt{\textbackslash}\}$ . The reader can learn more about minterms in [8]. We assign each minterm a representative character, as indicated in Figure 3(c), and then construct a finite automaton from the resulting finite alphabet $\Sigma$ . For a character $c\in\Sigma$ , we refer to its corresponding minterm by $mt(c)$ . In the other direction, for each minterm $\psi\in\mathit{minterms}(M)$ , we refer to its uniquely determined representative character by $\mathit{wit}(\psi)$ .

(a) escapeQuotes s-FT

(b)

\mathit{F}(escapeQuotes)

minterms: $[x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash}]$ , $[x=\texttt{"}]$ , $[x=\texttt{\textbackslash}]$ witness char: $\mathit{wit}([x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash}]){=}\texttt{a}$ , $\mathit{wit}([x=\texttt{"}]){=}\texttt{"}$ , $\mathit{wit}([x=\texttt{\textbackslash}]){=}\texttt{\textbackslash}$

Figure 3: Example of Finitization

For an s-FA $M$ , we denote its corresponding FA over the alphabet $\mathit{mterms}(M)$ with $\mathit{F}(M)$ . Given an s-FA $M$ , the set of transitions of $\mathit{F}(M)$ is defined as follows:

\delta_{\mathit{F}(M)}{=}\{q\xrightarrow{\mathit{wit}(\psi)}q^{\prime}{\mid}q\xrightarrow{\phi}q^{\prime}\wedge\psi\in\mathit{mterms}(M)\wedge\text{IsSat}(\psi\land\phi)\}

This algorithm replaces a transition guarded by a predicate $\phi$ in the given s-FA with a set of transitions consisting of the witnesses of the minterms where $\phi$ is satisfiable. In interval arithmetic this is the set of intervals that intersect with the interval specified by $\phi$ . The transition from $q_{1}$ guarded by the predicate $[x\neq\texttt{\textbackslash}]$ in Figure 3(a) intersects with 2 minterms $[x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash}]$ and $[x=\texttt{"}]$ . As a result, we see that this transition is replaced by two transitions in Figure 3(b), one that reads " and another that reads a.

From FTs to s-FTs. Once we have synthesized an FT $T$ , we need to extract an s-FT from it. There are many s-FTs equivalent to a given FT and here we present one way of doing this conversion which is used in our implementation. Let the size of an interval $I$ (the number of characters it contains) be given by $\mathit{size}(I)$ , and the offset between 2 intervals $I_{1}$ and $I_{2}$ (i.e. the difference between the least elements of $I_{1}$ and $I_{2}$ ) be given by $\mathit{offset}(I_{1},I_{2})$ . Suppose we have a transition $q\xrightarrow{c/y_{0}\cdots y_{n}}q^{\prime}$ , where $c,y_{i}\in\Sigma$ . Then, we construct a transition $q\xrightarrow{mt(c)/f_{0}\cdots f_{n}}q^{\prime}$ , where for each $y_{i}$ , the corresponding function $f_{i}$ is determined by the following rules ( $x$ always indicates variable bound to the input predicate):

1.

If $c=y_{i}$ , then $f_{i}=(x)$ , i.e. the identity function.
2.

If $mt(c)$ and $mt(y_{i})$ consist of single intervals $I_{1}$ and $I_{2}$ , respectively, such that $\mathit{size}(I_{1})=\mathit{size}(I_{2})$ , then $f_{i}=(x+\mathit{offset}(I_{1},I_{2}))$ . For instance, if the input interval is [a-z] and the output interval is [A-Z], then the output function is $(x+(-32))$ , which maps lowercase letters to uppercase ones.
3.

Otherwise $f_{i}=y_{i}$ —i.e., the output is a character in the output minterm.

While our s-FT recovery algorithm is sound, it may apply case 3 more often than necessary and introduce many constants, therefore yielding a transducer that does not generalize well to unseen examples. However, our evaluation shows that our technique works well in practice. We provide a sketch of the proof of soundness of this algorithm in Appendix C.

4.2 Synthesizing Transducers with Lookahead

Deterministic transducers cannot express functions where the output at a certain transition depends on future characters in the input. Consider the problem of extracting all substrings of the form <x> (where $\texttt{x}\neq\texttt{<}$ ) from an input string. This is the getTags problem from [15]. A deterministic transducer cannot express this transformation because when it reads < followed by x it has to output <x if the next character is a > and nothing otherwise. However, the transducer does not have access to the next character!

Instead, we extend our technique to handle deterministic transducers with lookahead, i.e., the ability to look at the string suffix when reading a symbol. Formally, a Transducer with Regular Lookahead is a pair $(T,R)$ where $T$ is an FT with $\Sigma_{T}=Q_{R}\times\Sigma$ , and $R$ is a total DFA with $\Sigma_{R}=\Sigma$ . The transducer $T$ now has another input in its transition function, although it still only outputs characters from $\Sigma$ , i.e., $\delta_{T}^{\mathit{out}}:Q_{T}\times(Q_{R}\times\Sigma)\rightarrow\Sigma$ , and $\delta_{T}^{\mathit{st}}:Q_{T}\times(Q_{R}\times\Sigma)\rightarrow Q_{T}$ . The semantics is defined as follows. Given a string $w=a_{0}\cdots a_{n}$ , we define a function $r_{w}$ such that $r_{w}(i)=\delta_{R}(q^{\mathit{init}}_{R},a_{n}\cdots a_{i+1})$ . In other words, $r_{w}(i)$ gives the state reached by $R$ on the reversed suffix starting at $i+1$ . At each step $i$ , the transducer $T$ reads the symbol $(a_{i},r_{w}(i))$ . The extended transition functions now take as input a lookahead word, which is a sequence of pairs of lookahead states and characters, i.e., from $(Q_{R}\times\Sigma)^{\ast}$ .

To synthesize transducers with lookahead, we introduce uninterpreted functions ${\texttt{d}_{\texttt{R}}}$ for the transition function of $R$ , and ${\texttt{look}}_{w}$ for the $r$ -values of $w$ on $R$ . Additionally, we introduce a bound $k_{R}$ on the number of states in the lookahead automaton $R$ as our synthesis algorithm has to synthesize both $T$ and $R$ at the same time. Appendix D provides the modified constraints needed to encode input-output types and input-output examples to use lookahead.

Part of the transducer with lookahead we synthesize for the getTags problem is shown in Figure 4. Notice that there are 2 transitions out of $q_{1}$ for the same input but different lookahead state: the string <x is outputted when the lookahead state is $r_{1}$ .

(a) Subset of transitions in

T

(b) Lookahead automaton

R

Figure 4: Regular lookahead for getTags

Lookahead and aggregate cost: Lookahead can help representing transducers, even deterministic ones, in a way that has lower aggregate cost—i.e., the aggregate cost better approximates the actual edit distance. Suppose that we want to synthesize a transducer that translates the string abc to ab and the string abd to bd. This translation can be done using a deterministic transducer with transitions $q_{0}\xrightarrow{a/\epsilon}q_{1}\xrightarrow{b/\epsilon}q_{2}$ , followed by two transitions from $q_{2}$ that choose the correct output based on the next character. Such a transducer would have a high aggregate cost of 4, even though the actual edit distance is 1. In contrast, using lookahead we can obtain a transducer that can output each character when reading it; this transducer will have aggregate cost 1 for either string. We conjecture that for every transducer $T$ , there always exists an equivalent transducer with regular lookahead $(T^{\prime},R)$ for which the edit distance computation for aggregate cost coincides with the actual edit distance of $T$ .

4.3 Transducer Repair

In this section, we show how our synthesis technique can also be used to “repair” buggy transducers. The key idea is to use the closure properties of automata and transducers—e.g., closure under union and sequential compositions [8]—to reduce repair problems to synthesis ones. The ability to algebraically manipulate transducers and automata is one of the key aspects that distinguishes our work from other synthesis works that use domain-specific languages [1, 5].

We describe two settings in which we can repair an incorrect transducer $T_{\textit{bad}}$ : 1. Let $\{P\}T_{\textit{bad}}\{Q\}$ be an input-output type violated by $T_{\textit{bad}}$ and let $\mathit{Out}_{P}(T_{\textit{bad}})$ be the finite automaton describing the set of strings $T_{\textit{bad}}$ can output when fed inputs in $P$ (this is computable thanks to closure properties of transducers). We are interested in the case where $\mathit{Out}_{P}(T_{\textit{bad}})\setminus Q\neq\emptyset$ —i.e., $T_{\textit{bad}}$ can produce strings that are not in the output type.

2. Let $[\overline{s\mapsto t}]$ be a set of input-output examples. We are interested in the case where there is some example ${s\mapsto t}$ such that $T_{\textit{bad}}(s)\neq t$ .

We describe two ways in which we repair transducers.

Repairing from the Input Language. This approach synthesizes a new transducer for the inputs on which $T_{\textit{bad}}$ is incorrect. Using properties of transducers, we can compute an automaton describing the exact set of inputs $P_{\textit{bad}}\subseteq P$ for which $T_{\textit{bad}}$ does not produce an output in $Q$ (see pre-image computation in [10]). Let $restrict(T,L)$ be the transducer that behaves as $T$ if the input is in $L$ and does not produce an output otherwise (transducers are closed under restriction [10]). If we synthesize a transducer $T_{1}$ with type $\{P_{\textit{bad}}\}T_{1}\{Q\}$ , then the transducer $restrict(T_{1},P_{\textit{bad}})\cup restrict(T_{\textit{bad}},P\setminus P_{\textit{bad}})$ satisfies the desired input-output type (transducers are closed under union).

Fault Localization from Examples. We use this technique when $T_{\textit{bad}}$ is incorrect on some example. We can compute a set of “suspicious” transitions by taking all the transitions traversed when $T(s)\neq t$ for some ${s\mapsto t}\in E$ (i.e., one of these transitions is wrong) and removing all the transitions traversed when $T(s)=t$ for some ${s\mapsto t}\in E$ (i.e., transitions that are likely correct). Essentially, this is a way of identifying $P_{\textit{bad}}$ when $T_{\textit{bad}}$ is wrong on some examples. We can also use this technique to limit the transitions we need to synthesize when performing repair.

5 Evaluation

We implemented our technique in a Java tool astra (Automatic Synthesis of TRAnsducers), which uses Z3 [16] to solve the generated constraints. We evaluate using a 2.7 GHz Intel Core i5, RAM 8 GB, with a 300s timeout.

Q1: Can astra synthesize practical transformations?

Benchmarks. Our first set of benchmarks is obtained from Optician [5, 6], a tool for synthesizing lenses, which are bidirectional programs used for keeping files in different data formats synchronized. We adapted 11 of these benchmarks to work with astra (note that we only synthesize one-directional transformations), and added one additional benchmark extrAcronym2, which is a harder variation (with a larger input type) of extrAcronym. We excluded benchmarks that require some memory, e.g., swapping words in a sentence, as they cannot be modeled with transducers.

Our second set of benchmarks (Miscellaneous) consists of 6 problems we created based on file transformation tasks (unixToDos, dosToUnix and CSVSeparator), and s-FTs from the literature–escapeQuotes from [17], getTags and quicktimeMerger from [15]. All of the benchmarks require synthesizing s-FTs and getTags requires synthesizing an s-FT with lookahead (details in Table I).

To generate the sets of examples, we started with the examples that were used in the original source when available. In 5 cases, astra synthesized a transducer that was not equivalent to the one synthesized by Optician, even though it did meet the specification. In these cases, we used astra to synthesize two different transducers that met the specification, computed a string on which the two transducers differed, and added the desired output for that string as a new example. We repeated this task until astra yielded the desired transducer and we report the time for such sets of examples. The number of examples varies between 1 and 5. The ability to perform equivalence checks of two transducers is yet another reason why synthesizing transducers is useful and is in some ways preferable to synthesizing programs in a DSL. For each benchmark we chose a mean edit distance of 0.5 when the transformation could be synthesized with this distance and of 1 otherwise.

TABLE I: Metrics of astra’s performance on the set of synthesis benchmarks. The right-most set of columns gives the synthesis time for astra and Optician (under 2 different configurations). The middle set of columns gives the sizes of the parameters to the synthesis problem. In particular,

Q_{P}

and

Q_{Q}

denote the number of input and output states and

\delta_{P}

and

\delta_{Q}

denote the number of transitions in the input and output types, respectively. A ✗ represents a benchmark that failed. — stands in for data that is not available; this is because we only re-ran Optician on the benchmarks that were already encoded in its benchmark set, plus a few additional ones for comparing between the tools that we wrote ourselves.

												Time(s)
	Benchmark	$Q_{P}$	$Q_{Q}$	$\delta_{P}$	$\delta_{Q}$	$\Sigma$	$E$	$k$	$l$	$d$	astra	Optician	Optician-re
Optician	extrAcronym	6	3	10	3	3	2	1	1	.5	0.11	0.05	✗
	extrAcronym2	6	3	16	3	3	3	2	1	1	0.42	—	—
	extrNum	15	13	17	12	3	1	1	1	1	0.93	0.05	0.07
	extrQuant	4	3	8	5	2	1	2	1	1	0.19	0.09	✗
	normalizeSpaces	7	6	19	10	2	2	2	1	1	0.46	16.64	✗
	extrOdds	15	9	29	13	5	3	3	2	1	15.87	0.12	✗
	capProb	3	3	3	3	2	2	2	1	1	0.05	0.05	✗
	removeLast	6	3	8	3	3	3	2	1	.5	0.21	0.15	0.07
	sourceToViews	18	7	26	15	5	3	3	2	1	50.92	0.06	✗
	normalizeNamePos	19	7	35	24	13	1	6	2	1	✗	0.05	0.10
	titleConverter	22	13	41	41	15	1	3	1	1	✗	0.07	✗
	bibtextToReadable	14	11	41	35	12	1	5	1	1	✗	0.64	0.15
Miscellaneous	unixToDos	5	7	17	19	4	4	2	2	.5	1.24	—	—
	dosToUnix	7	5	19	17	4	4	2	1	.5	0.41	—	—
	CSVSeparator	5	5	9	9	4	1	1	1	1	0.142	—	—
	escapeQuotes	2	2	6	5	3	5	2	2	1	0.188	✗	✗
	quicktimeMerger	7	3	9	3	2	2	1	1	.5	0.075	—	—
	getTags	3	3	9	4	3	5	2	2	1	0.95	✗	✗

Effectiveness of astra. astra can solve $15/18$ benchmarks (13 in <1s and 2 under a minute) and times out on 3 benchmarks where both $P$ and $Q$ are big.

While the synthesized transducers have at most 3 states, we note that this is because astra synthesizes total transducers and then restricts their domains to the input type $P$ . This is advantageous because synthesizing small total transducers is easier than synthesizing transducers that require more states to define the domain. For instance, when we restrict the solution of extrAcronym2 to its input type, the resulting transducer has 11 states instead of the 2 required by the original solution!

Comparison with Optician. We do not compare astra to tools that only support input-output examples. Instead, we compare astra to Optician on the set of benchmarks common to both tools. Like astra, Optician supports input-output examples and types, but the types are expressed as regular expressions. Furthermore, Optician also attempts to produce a program that minimizes a fixed information theoretical distance between the input and output types [5].

Optician is faster when the number of variables in the constraint encoding increases, while astra is faster on the normalizeSpaces benchmark. Optician, which uses regular expressions to express the input and output types, does not work so well with unstructured data. To confirm this trend, we wrote synthesis tasks for the escapeQuotes and getTags benchmarks in Optician and it was unable to synthesize those as well—e.g., escapeQuotes requires replacing every " character with \".

To further look at the reliance of Optician on regular expressions, we converted the regular expressions used in the lens synthesis benchmarks to automata and then back to regular expressions using a variant of the state elimination algorithm that acts on character intervals. This results in regular expressions that are not very concise and might have redundancies. Optician could only solve 4/11 benchmarks that it was previously synthesizing (Optician-re in Table I).

Answer to Q1: astra can solve real-world benchmarks and has performance comparable to that of Optician for similar tasks. Unlike Optician, astra does not suffer from variations in how the input and output types are specified.

Q2: Can astra repair transducers in practice?

Benchmarks. We considered the benchmarks in Table II. The only pre-existing benchmark that we found was escapeQuotes, through the interface of the Bek programming language used for verifying transducers [17]. We generated 11 additional faulty transducers to repair in the following two ways: (i) Introducing faults in our synthesis benchmarks: We either replaced the output string of a transition with a constant character, inserted an extra character, or deleted a transition altogether. (ii) Incorrect transducers: We intentionally provided fewer input-output examples and used only example-based constraints on some of our synthesis benchmarks.

All the benchmarks involve s-FTs. Three benchmarks are wrong on both input-output types and examples and the rest are only wrong on examples. Additionally, we note that to repair a transducer, we need the “right” set of minterms. Typically, the set of minterms extracted from the transducer predicates is the right one, but in the case of the escapeBrackets problems, astra needs a set of custom minterms we provide manually—i.e., repairing the transducer requires coming up with a new predicate. We are not aware of another tool that solves transducer repair problems and so do not show any comparisons.

TABLE II: Metrics of astra’s performance on the set of repair benchmarks. The two right-most columns give the synthesis time without and with the use of templates. Default is the case where a new transducer is synthesized for

P_{\textit{bad}}

and Template is the case where a partial solution to the solver is provided. The

\delta_{T_{\textit{bad}}}

column gives the number of transitions that were localized by the fault-localization procedure as a fraction of the total number of transitions in the transducer. The other columns that describe the parameters of the synthesis problem in the default case are the same as for Table I.

												Time(s)
	Benchmark	$Q_{P}$	$Q_{Q}$	$\delta_{P}$	$\delta_{Q}$	$\Sigma$	$E$	$k$	$l$	$d$	$\delta_{T_{\textit{bad}}}$	Default	Template
Fault injected	swapCase1	2	1	6	3	3	2	1	1	1	3/3	0.04	0.02
	swapCase2	2	1	4	3	3	2	1	1	1	1/2	✗	✗
	swapCase3	2	1	6	3	3	2	1	1	1	1/3	0.06	0.05
	escapeBrackets1	2	6	16	36	8	4	1	4	4	1/3	0.69	0.42
	escapeBrackets2	1	6	1	7	6	5	1	4	4	1/2	✗	✗
	escapeBrackets3	2	7	8	36	9	5	1	4	4	2/3	1.12	0.34
	caesarCipher	2	1	4	2	3	1	1	1	1	1/1	✗	✗
Synth.	extrAcronym2	11	3	30	3	3	3	2	1	1	12/30	0.59	10.15
	capProb	3	3	3	3	2	2	2	1	1	3/3	0.04	0.04
	extrQuant	8	3	16	5	2	1	2	1	1	5/10	0.37	0.51
	removeLast	6	3	8	3	3	2	2	1	.5	7/8	0.40	1.08
	escapeQuotes	3	2	9	5	3	5	2	1	1	3/5	0.17	0.10

Effectiveness of astra. We indicate the number of suspicious transitions identified by our fault localization procedure (Section 4.3) in the column labeled $\delta_{T_{\textit{bad}}}$ . In many cases, astra can detect 50% of the transitions or more as being likely correct, therefore reducing the space of unknowns.

We compare 2 different ways of solving repair problems in astra. One uses the repair-from-input approach described in Section 4.3 (Default in Table II). The second approach involves using a ‘template’, where we supply the constraint solver with a partial solution to the synthesis problem, based on the transitions that were localized as potentially buggy (Template in Table II).

astra can solve 9/ $12$ repair benchmarks (all in less than 1 second). The times using either approach are comparable in most cases. While one might expect templates to be faster, this is not always the case because the input-output specification for the repair transducer is small, but providing a template requires actually providing a partial solution, which in some cases happens to involve many constraints.

Answer to Q2: astra can repair transducers with varying types of bugs.

6 Related Work

Synthesis of string transformations. String transformations are one of the main targets of program synthesis. Gulwani showed they could be synthesized from input-output examples [1] and introduced the idea of using a DSL to aid synthesis. Optician extended the DSL-based idea to synthesizing lenses [6, 5], which are programs that transform between two formats. Optician supports not only examples but also input-output types. While DSL-based approaches provide good performance, they are also monolithic as they rely on the structure of the DSL to search efficiently. astra does not rely on a DSL and can synthesize string transformations from complex specifications that cannot be handled by DSL-based tools. Moreover, transducers allow applying verification techniques to the synthesized programs (e.g., checking whether two solutions are equivalent). One limitation of transducers is that they do not have ‘memory’, and consequently astra cannot be used for data-transformation tasks where this is required—e.g., mapping the string Firstname Lastname to Lastname, Firstname—something Optician can do. We remark that there exist transducer models with such capabilities [18] and our work lays the foundations to handle complex models in the future.

Synthesis of transducers. Benedikt et al. studied the ‘bounded repair problem’, where the goal is to determine whether there exists a transducer that maps strings from an input to an output type using a bounded number of edits [12]. Their work was the first to identify the relation between solving such a problem and solving games, an idea we leverage in this paper. However, their work is not implemented, cannot handle input-output examples, and therefore shies away from the source of NP-Completeness. Hamza et al. studied the problem of synthesizing minimal non-deterministic Mealy machines (transducers where every transition outputs exactly one character), from examples [11]. They prove that the problem of synthesizing such transducers is NP-complete and provide an algorithm for computing minimal Mealy machines that are consistent with the input-output examples. astra is a more general framework that incorporates new specification mechanisms, e.g., input-output types and distances, and uses them all together. Mealy machines are also synthesized from temporal specifications in reactive synthesis and regular model checking, where they are used to represent parameterized systems [19, 20]. This setting is orthogonal to ours as the specification is different and the transducer is again only a Mealy machine.

The constraint encoding used in astra is inspired by the encoding presented by Daniel Neider for computing minimal separating DFA, i.e. a DFA that separates two disjoint regular languages [21]. astra’s use of weights and energy to specify a mean edit distance is based on energy games [22], a kind of 2-player infinite game that captures the need for a player to not exceed some available resource. One way of solving such games is by defining a progress measure [14]. To determine whether a game has a winning strategy for one of the players, it can be checked whether such a progress measure exists in the game. We showed that the search for such a progress measure can be encoded as an SMT problem.

7 Conclusion

We presented a technique and a tool (astra) for synthesizing different types of transducers from types, examples, and distance functions, and showed astra’s ability to solve a variety of practical problems. astra uses SMT solvers and its performance is affected by input components that result in large constraints (e.g., states in the desired transducer). Because astra synthesizes transducers instead of programs in arbitrary DSLs, its output can be analyzed using transducer algorithms (e.g., equivalence and pre-post analysis). Because of this property, our approach could be beneficial in learning invariants of string-manipulating programs, where a transducer is the formalism of choice, e.g., in the Ostrich tool [9]. In terms of improvements to our technique, aside from optimizing the SMT encoding to improve scalability, our approach could be strengthened by devising ways to effectively ‘guess’ the number of states required for a transducer to work on the given inputs. We leave these directions for future work.

Acknowledgements

This work was funded by the National Science Foundation under grants 1763871, 1750965, 1918211, and 2023222, Facebook and a Microsoft Research Faculty Fellowship.

References

[1] S. Gulwani, “Automating string processing in spreadsheets using input-output examples,” in PoPL’11, January 26-28, 2011, Austin, Texas, USA, January 2011.
[2] P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes, “Fast and precise sanitizer analysis with bek,” in USENIX Security Symposium, vol. 58. USENIX, 2012.
[3] L. D’Antoni and M. Veanes, “Static analysis of string encoders and decoders,” in International Workshop on Verification, Model Checking, and Abstract Interpretation. Springer, 2013, pp. 209–228.
[4] Y. Zhang, A. Albarghouthi, and L. D’Antoni, “Robustness to programmable string transformations via augmented abstract training,” in International Conference on Machine Learning. PMLR, 2020, pp. 11 023–11 032.
[5] A. Miltner, S. Maina, K. Fisher, B. C. Pierce, D. Walker, and S. Zdancewic, “Synthesizing symmetric lenses,” Proceedings of the ACM on Programming Languages, vol. 3, no. ICFP, pp. 1–28, 2019.
[6] A. Miltner, K. Fisher, B. C. Pierce, D. Walker, and S. Zdancewic, “Synthesizing bijective lenses,” Proceedings of the ACM on Programming Languages, vol. 2, no. POPL, pp. 1–30, 2017.
[7] M. Mohri, “Finite-state transducers in language and speech processing,” Computational linguistics, vol. 23, no. 2, pp. 269–311, 1997.
[8] L. D’Antoni and M. Veanes, “Automata modulo theories,” Communications of the ACM, vol. 64, no. 5, pp. 86–95, 2021.
[9] T. Chen, M. Hague, J. He, D. Hu, A. W. Lin, P. Rümmer, and Z. Wu, “A decision procedure for path feasibility of string manipulating programs with integer data type,” in International Symposium on Automated Technology for Verification and Analysis. Springer, 2020, pp. 325–342.
[10] L. D’Antoni and M. Veanes, “The power of symbolic automata and transducers,” in International Conference on Computer Aided Verification. Springer, 2017, pp. 47–67.
[11] J. Hamza and V. Kunčak, “Minimal synthesis of string to string functions from examples,” in Verification, Model Checking, and Abstract Interpretation, C. Enea and R. Piskac, Eds. Cham: Springer International Publishing, 2019, pp. 48–69.
[12] M. Benedikt, G. Puppis, and C. Riveros, “Regular repair of specifications,” in 2011 IEEE 26th Annual Symposium on Logic in Computer Science. IEEE, 2011, pp. 335–344.
[13] R. Bloem, K. Chatterjee, and B. Jobstmann, “Graph games and reactive synthesis,” in Handbook of Model Checking, E. M. Clarke, T. A. Henzinger, H. Veith, and R. Bloem, Eds. Springer, 2018, pp. 921–962. [Online]. Available: https://doi.org/10.1007/978-3-319-10575-8_27
[14] L. Brim, J. Chaloupka, L. Doyen, R. Gentilini, and J.-F. Raskin, “Faster algorithms for mean-payoff games,” Formal methods in system design, vol. 38, no. 2, pp. 97–118, 2011.
[15] M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjorner, “Symbolic finite state transducers: Algorithms and applications,” in Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ser. POPL ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 137–150. [Online]. Available: https://doi.org/10.1145/2103656.2103674
[16] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2008, pp. 337–340.
[17] P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes, “Fast and precise sanitizer analysis with bek,” http://rise4fun.com/Bek/, 2012.
[18] R. Alur, “Streaming string transducers,” in Logic, Language, Information and Computation, L. D. Beklemishev and R. de Queiroz, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 1–1.
[19] O. Markgraf, C.-D. Hong, A. W. Lin, M. Najib, and D. Neider, “Parameterized synthesis with safety properties,” in Asian Symposium on Programming Languages and Systems. Springer, 2020, pp. 273–292.
[20] A. W. Lin and P. Rümmer, “Liveness of randomised parameterised systems under arbitrary schedulers,” in International Conference on Computer Aided Verification. Springer, 2016, pp. 112–133.
[21] D. Neider, “Computing minimal separating dfas and regular invariants using sat and smt solvers,” in International Symposium on Automated Technology for Verification and Analysis. Springer, 2012, pp. 354–369.
[22] A. Chakrabarti, L. de Alfaro, T. A. Henzinger, and M. Stoelinga, “Resource interfaces,” in Embedded Software, Third International Conference, EMSOFT 2003, Philadelphia, PA, USA, October 13-15, 2003, Proceedings, 2003, pp. 117–133. [Online]. Available: https://doi.org/10.1007/978-3-540-45212-6_9

Appendix A Proofs

A.1 Input-Output Examples

Lemmas A.1 and A.2 show that (i) the ${\texttt{config}}_{s}$ function preserves a desired invariant, which is used to show that (ii) for the transducer $T$ encoded in a solution to the constraints, we have $T(s)=t$

Lemma A.1.

Let $\phi$ be the set of example constraints 1 and 2 for an input-output example ${s\mapsto t}$ , where $s=a_{0}\cdots a_{n}$ and $t=b_{0}\cdots b_{m}$ . Let $T$ be a transducer encoded into the functions ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ as described in Section 3.1. If all constraints in $\phi$ are satisfied by ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , and ${\texttt{config}}_{s}$ , then for all $0\leq i<n$ , $0\leq j<m$ , ${\texttt{config}}_{s}(i)=(j,q_{T})$ iff $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},a_{0}\cdots a_{i-1})=b_{0}\cdots b_{j-1}$ and $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots a_{i-1})=q_{T}$ .

Proof.

We first show the forward direction.

Assume that all constraints in $\phi$ are satisfied by ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , and ${\texttt{config}}_{s}$ . We proceed by induction on $i$ . For the base case, when $i=0$ , we have that ${\texttt{config}}_{s}(0)=(0,q^{\mathit{init}}_{T})$ by Constraint 1. We also know that $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},\epsilon)=\epsilon$ and that $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},\epsilon)=q^{\mathit{init}}_{T}$ by definition of the extended transition function. So we are done with the base case.

For the induction step, our induction hypothesis states exactly that ${\texttt{config}}_{s}(i)=(j,q_{T})$ iff $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i-1})=b_{0}\cdots b_{j-1}$ and $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i-1})=q_{T}$ . Assume we have that ${\texttt{config}}_{s}(i)=(j,q_{T})$ . We need to show that this is the case for ${\texttt{config}}_{s}(i+1)=(j^{\prime},q^{\prime}_{T})$ as well. Now, since input words consist of letters in $\Sigma$ , we must have that $a_{i}=c$ for some $c\in\Sigma$ . Then, we must also have that ${\texttt{config}}_{s}(i+1)=(j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)$ , ${\texttt{d}^{\texttt{st}}}(q_{T},c))$ by the implication in Constraint 2 and the inductive hypothesis. By Constraint 2, we have that for all $0\leq z<l$ , either $({\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)=b_{j+z}$ or $z\geq{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)$ . This means that for all $z<{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)$ , we have ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)=b_{j+z}$ . Together with the inductive hypothesis, this implies $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},a_{0}\cdots a_{i-1})=b_{0}\cdots b_{j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T}1,a_{i})-1}$ .

For showing $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i})=q^{\prime}_{T}$ , we can observe that this follows from the inductive hypothesis that $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots a_{i-1})=q_{T}$ , by the definition of $\delta_{T}^{\mathit{st*}}$ , and that by the constraints of type 2, which enforce that ${\texttt{config}}_{s}(i+1)=(j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c),{\texttt{d}^{\texttt{st}}}(q_{T},c))$ . So, we are done with the forward direction.

The backward direction is straightforward. If we have that $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i-1})=b_{0}\cdots b_{j-1}$ and $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i-1})=q_{T}$ , then there exists a run of the form $q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\xrightarrow{a_{1}/y_{1}}q_{T}^{2}\ldots q_{T}^{i-1}\xrightarrow{a_{i-1}/y_{i-1}}q_{T}^{i}$ such that each $y_{i}$ consists of $\geq 1$ characters and $y_{0}\cdots y_{i-1}=b_{0}\cdots b_{j-1}$ . We can use each transition to assign corresponding values to ${\texttt{config}}_{s}(i)$ . ∎

Lemma A.2.

Let $T$ be a transducer encoded into ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , and $\phi$ be the set of constraints of types 1, 2, and 3 for an input-output example ${s\mapsto t}$ . Then ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ can be extended to a model for $\phi$ if and only if $T(s)=t$ .

Proof.

We first prove the forward direction. Assume that an assignment to ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , and ${\texttt{config}}_{s}$ satisfies all constraints in $\phi$ . By Constraint 3, we have that there is some state $q_{T}\in Q_{T}$ of the encoded transducer $T$ for which ${\texttt{config}}_{s}(\mathit{len}(s))=(\mathit{len}(t),q_{T})$ . From Lemma A.1, it follows that $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},s)=t$ and $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},s)=q_{T}$ . This is exactly what it means for $T(s)=t$ .

For the backward direction, assume that for the transducer $T$ encoded in ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , we have $T(s)=t$ . Then, there exists a run $q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\xrightarrow{a_{1}/y_{1}}q_{T}^{2}\ldots q_{T}^{n}\xrightarrow{a_{n}/y_{n}}q_{T}^{n+1}$ such that $s=a_{0}\cdots a_{n}$ and $t=y_{0}\cdots y_{n}$ . Using this run of $s$ on $T$ , we can construct a definition of ${\texttt{config}}_{s}$ such that all constraints in $\phi$ are satisfied.

∎

A.2 Input-Output Types

Lemmas A.3 and A.4 state that (i) the simulation relation sim preserves the desired invariant, which is then used to prove that (ii) $T$ satisfies the specification for input-output types.

Lemma A.3.

Let $\phi$ be the set of input-output constraints from Equation 4 and Equation 5 for the types $P$ and $Q$ , and $T$ be a transducer encoded into the functions ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ . If all constraints in $\phi$ are satisfied by ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , and sim, then ${\texttt{sim}}(q_{P},q_{T},q_{Q})$ if there exist strings $w,w^{\prime}$ such that $\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}$ , $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w)=q_{T}$ , $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}$ , and $\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}$ .

Proof.

Assume that there exist strings $w,w^{\prime}$ such that $\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}$ , $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w)\allowbreak{}=q_{T}$ , $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}$ , and $\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}$ . By Constraint 4, we already have that ${\texttt{sim}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})$ . Since $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}$ , we know that there is an assignment to the functions ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ that encodes a run $q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\xrightarrow{a_{1}/y_{1}}q_{T}^{2}\ldots q_{T}^{n}\xrightarrow{a_{n}/y_{n}}q_{T}^{n+1}$ such that $w=a_{0}\cdots a_{n}$ and $w^{\prime}=y_{0}\cdots y_{n}$ . It follows that we eventually reach ${\texttt{sim}}(q_{P},q_{T},q_{Q})$ by the implication in Constraint 5.

∎

Lemma A.4.

Let $T$ be a transducer encoded into the functions ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ and $\phi$ be the set of input-output type constraints of the types given in Equation 4, Equation 5 and Equation 6. Then ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ can be extended to a model for $\phi$ if and only if $\{P\}T\{Q\}$ .

Proof.

For the forward direction, we assume that an assignment to ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ and sim satisfies all constraints in $\phi$ . Then, for every $q_{P}\in F_{P}$ , if we have that ${\texttt{sim}}(q_{P},q_{T},q_{Q})$ for some $q_{T},q_{Q}$ , then it is the case that $q_{Q}\in F_{Q}$ . From Constraint 5, we know that for a string $w$ such that $\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}$ where $q_{P}\in F_{P}$ , it must be that ${\texttt{sim}}(q_{P},q_{T},q_{Q})$ for at least some $q_{T}$ and $q_{Q}$ . From Lemma A.3, it also follows that for such a string $w\in P$ , there is also a string $w^{\prime}$ such that $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}$ and $\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}$ such that $q_{Q}\in F_{Q}$ . In other words, for every string $w\in P$ , $T$ outputs a string $w^{\prime}\in Q$ as desired.

For the backward direction, let $T$ be a transducer such that $\{P\}T\{Q\}$ . We can use the transition relation of $T$ to construct an assignment of ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ and sim such that all constraints are satisfied.

∎

A.3 Input-Output Distances

Lemmas A.5 and A.6 show that a transducer $T$ that is a solution to the set of distance constraints is such that $\mathit{mean\!\_\!ed\!\_\!dist}(w,T(w))\leq d$ for all $w\in\mathcal{L}(P)$ .

Lemma A.5.

Let $\phi$ be the set of input-output distance constraints from Equations 7, 8, 9, 10 and 11 for a mean edit distance of $d$ , and $T$ be a transducer encoded into the functions ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ . If all constraints in $\phi$ are satisfied by ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , sim, en, and wed, then for all runs over $T$ of the form $q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\ldots q_{T}^{n-1}\xrightarrow{a_{n-1}/y_{n-1}}q_{T}$ , where $a_{0}\cdots a_{n-1}\in\mathcal{L}(P)$ , it holds that:

\sum_{i=0}^{n-1}{{\texttt{wed}}(q_{T}^{i},a_{i})}\geq 0.

Proof.

Assume that all constraints in $\phi$ are satisfied by ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , sim, en and wed. Consider an arbitrary path over of the form $q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\ldots q_{T}^{n-1}\xrightarrow{a_{n-1}/y_{n-1}}q_{T}$ , where $a_{0}a_{1}\cdots a_{n-1}\in\mathcal{L}(P)$ . By Constraint 10, we have that

\begin{split}{\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})&\geq{\texttt{en}}(q_{P}^{1},q_{T}^{1},q_{Q}^{1})-{\texttt{wed}}(q^{\mathit{init}}_{T},a_{0})\\ \ldots\ldots&\geq{\texttt{en}}(q_{P}^{n},q_{Q}^{n},q_{T}^{n})-\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})\\ \end{split}

Since $w\in P$ , we have that $q_{P}^{n}\in F_{P}$ . By Constraint 11, it follows that ${\texttt{en}}(q_{P}^{n},q_{Q}^{n},q_{T}^{n})\geq 0$ . Therefore, ${\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})+\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})\geq 0$ as well. Since ${\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})=0$ by Constraint 9, we have that $\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})\geq 0$ .

∎

Lemma A.6.

Let $T$ be a transducer encoded into the functions ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , and $\phi$ be the set of input-output distance Constraints 7, 8, 9, 10 and 11. If ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ and ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ can be extended to a model for $\phi$ , then for any string $w\in\mathcal{L}(P)$ , $\mathit{mean\!\_\!ed\!\_\!dist}(w,T(w))\leq d$ .

Proof.

Assume that an assignment to ${\texttt{d}^{\texttt{st}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{ch}}}$ , ${\texttt{d}^{\texttt{out}}_{\texttt{len}}}$ , sim, en and wed satisfy all constraints in $\phi$ . From Lemma A.5 it follows that the run of any string $w\in\mathcal{L}(P)$ is such that $\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})\geq 0$ , where $a_{0}\cdots a_{n-1}=w$ and $q_{T}^{0}\cdots q_{T}^{n-1}$ are the states in the run. Since ${\texttt{wed}}(q_{T}^{i},a_{i})$ is the difference between $d$ and ${\texttt{ed}}(q_{T}^{i},a_{i})$ at each transition, if the total difference is $\geq 0$ , then it follows that $\mathit{mean\!\_\!ed\!\_\!dist}(w,T(w))\leq d$ .

∎

Appendix B Size of Constraint Encoding

Input-Output Examples.

There is one constraint of type 1, $nm|Q_{Q}||\Sigma|$ constraints of type 2, and one constraint of type 3 for one input-output example, where $n$ is the length of the input and $m$ is the length of the output.

A constraint of type 1 involves 1 variable, a constraint of type 2 involves $4+l$ variables, and a constraint of type 3 involves $|Q_{T}|$ variables.

Input-Output Types.

There is one constraint of type 4, $|Q_{P}||Q_{T}||Q_{Q}||\Sigma|$ constraints of type 5, and one constraint of type 6 for the input-output types $P$ and $Q$ .

A constraint of type 4 involves no variables, a constraint of type 5 involves $4+3l$ variables, and a constraint of type 6 involves $|F_{P}|+|Q_{Q}-F_{Q}|+|Q_{T}|$ variables.

Input-Output Distances.

There are $|Q_{T}||\Sigma|l$ constraints of type 7 and 8, one constraint of type 9, $|Q_{P}||Q_{T}||Q_{Q}||\Sigma|$ constraints of type 10, and 1 constraint of type 11 for a specified input-output distance.

Constraints of type 7 and 8 involve up to $2+l$ variables, a constraint of type 9 involve no variables, a constraint of type 10 involves $4+3l$ variables, and a constraint of type 11 involves $|F_{P}|+|Q_{T}|+|Q_{Q}|$ variables.

Appendix C Soundness of s-FT recovery algorithm

The following lemma shows that our s-FT recovery algorithm, described in Section 4.1, which we use to synthesize s-FTs is sound in the sense that finitizing the s-FT will result in an identical finite transducer.

Lemma C.1.

Let $T$ be an FT and let $T^{\prime}$ be the corresponding s-FT obtained using the s-FT extraction algorithm. Then we have that $T=\mathit{F}(T^{\prime})$ if we use the representative character $\mathit{wit}(\psi)$ for each minterm $\psi$ .

Proof.

Without loss of generality, consider a transition $q\xrightarrow{c/y}q^{\prime}$ where the input character is $c$ and the output character is $y$ . The proof generalizes for an output string with multiple characters. Now, one of 3 cases apply for $c$ :

1.

If $c=y$ , then the output function $f$ that we choose is the identity function. Re-finitizing this transition results in the transition $q\xrightarrow{c/c}q^{\prime}$ , as expected.
2.

If $mt(c)$ and $mt(y)$ are single intervals $I_{1},I_{2}$ of the same size, then $f=x+\textit{offset}(I_{1},I_{2})$ . If we were to finitize the transition with the same witness for each minterm, then we get back $q\xrightarrow{c/y}q^{\prime}$ .
3.

Otherwise, we have a constant function $f=y$ . Finitizing the transition again results in $q\xrightarrow{c/y}q^{\prime}$ .

∎

Appendix D Constraints for synthesizing Transducers with Regular Lookahead

In writing the lookahead constraints for a string $w$ , we omit the subscript $w$ and just write look and $r$ for ${\texttt{look}}_{w}$ and $r_{w}$ .

Input-Output Examples and Lookahead. Constraint 12 expresses how we compute the lookahead states for a string $w=a_{0}\cdots a_{n}$ .

{\texttt{look}}(n)=q^{\mathit{init}}_{R}\wedge\bigwedge_{0\leq i<n}{\texttt{look}}(i)={\texttt{d}_{\texttt{R}}}({\texttt{look}}(i+1),a_{i+1})

(12)

Constraint 13 shows how the output configuration for input-output examples takes into account what the lookahead state is at any point. Given an example ${s\mapsto t}$ such that $s=a_{0}\cdots a_{n}$ and $t=b_{0}\cdots b_{m}$ , for every $0\leq i<n$ , $0\leq j\leq m$ , $c\in\Sigma$ , $q_{R}\in R$ and $q_{T}\in Q_{T}$ :

\begin{split}&{\texttt{config}}_{s}(i)=(j,q_{T})\wedge a_{i}=c\wedge{\texttt{look}}(i)=q_{R}\Rightarrow\\ &\quad[\bigwedge_{0\leq z<l}({\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},(q_{R},c),z){=}b_{j+z}\vee z\geq{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},(q_{R},c))\wedge\\ &\quad{\texttt{config}}_{s}(i+1){=}(j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},(q_{R},c)),{\texttt{d}^{\texttt{st}}}(q_{T},(q_{R},c)))]\end{split}

(13)

Input-Output Types with Lookahead. The simulation relation sim needs to also consider where we are in $R$ during any possible run on $T$ . So, we extend sim to include the states of $R$ —i.e., ${\texttt{sim}}_{R}\subseteq Q_{P}\times Q_{T}\times Q_{Q}\times Q_{R}$ . Since we travel ‘backwards’ over $R$ , while moving ‘forward’ over $P$ , $T$ and $Q$ in the simulation relation, the simulation relation uses the inverse transition function of $R$ , which we denote by $\delta_{R}^{-1}$ —i.e., $\delta_{R}^{-1}(q_{R},c)=\{q_{R}^{\prime}\mid q_{R}^{\prime}\xrightarrow{c}q_{R}\}$ . We also introduce the corresponding uninterpreted function ${\texttt{d}^{-1}_{\texttt{R}}}$ , which is the inverse of ${\texttt{d}_{\texttt{R}}}$ . The desired invariant is that ${\texttt{sim}}_{R}(q_{P},q_{T},q_{Q},q_{R})$ if there exists a lookahead word $w_{R}=(q_{R}^{0},a_{0})\cdots(q_{R}^{n},a_{n})$ and a string $w$ such that $\delta_{P}^{*}(q^{\mathit{init}}_{P},a_{0}\cdots a_{n})=q_{P}$ , $\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w_{R})=q_{T}$ , $\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w_{R})=w^{\prime}$ , $\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}$ and $\delta_{R}^{-1*}(q_{R},w)=q^{\mathit{init}}_{R}$ .

Our starting point in the simulation can be any state $q_{R}\in R$ . This is because $R$ reads a string $w\in P$ backwards, so the run could start from any state in $R$ .

\bigvee_{q_{R}\in R}{\texttt{sim}}_{R}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q},q_{R})

(14)

Then, to advance the simulation, we now use the inverse transition relation of $R$ . So, if we are at state $q_{R}\in Q_{R}$ , then for each character $\mathit{a}\in\Sigma$ , the simulation advances to all the states that we could have reached $q_{R}$ from by reading $c$ . For any combination of $q_{P},q_{T},q_{Q},q_{R},q_{R}^{\prime}$ and $c\in\Sigma$ , we use the free variables $c_{0}\ldots,c_{l-1}$ and $q_{Q}^{0}\ldots,q_{Q}^{l-1}$ to encode the following constraint:

\begin{split}&{\texttt{sim}}_{R}(q_{P},q_{T},q_{Q},q_{R})\wedge{\texttt{d}^{-1}_{\texttt{R}}}(q_{R},c)=q_{R}^{\prime}\Rightarrow\\ &\bigwedge_{0\leq z\leq l}({\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},(q_{R},c))=z\Rightarrow\\ &\quad\quad[\bigwedge_{0\leq x<z}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},(q_{R},c),x)=c_{x}]\wedge\\ &\quad\quad[q_{Q}^{0}=q_{Q}\wedge\bigwedge_{1\leq x<z}q_{Q}^{x}={\texttt{d}_{Q}}(q_{Q}^{x-1},c_{x-1})]\wedge\\ &\quad\quad{\texttt{sim}}_{R}(\delta_{P}(q_{P},\mathit{c}),{\texttt{d}^{\texttt{st}}}(q_{T},(q_{R},c)),q_{Q}^{z},q_{R}^{\prime}))\end{split}

(15)

Finally, constraint 16 says that when we reach a final state $q_{P}\in F_{P}$ and the initial state of $R$ (since we are going backwards over $R$ ), we cannot be in a non-final state of $Q$ .

\bigwedge_{q_{P}\in F_{P}}\bigwedge_{q_{Q}\notin F_{Q}}\neg{\texttt{sim}}_{R}(q_{P},q_{T},q_{Q},q^{\mathit{init}}_{R})

(16)

Appendix E Alternate Notions of Input-Output Distances

It is easy to modify our constraints to support the following different notions of input-output distances. For example, we can ask that for every input $s$ , the output $T(s)$ can have edit distance at most $D$ from $s$ . To enforce this distance, we change Constraints 9 and 10 to the constraints:

{\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q}))=D

(17)

\begin{split}&\bigwedge_{0\leq z<l}({\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)=z\Rightarrow\\ &~{}~{}[\bigwedge_{0\leq x<z}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,x)=c_{x}]\wedge\\ &~{}~{}[{\texttt{en}}(\delta_{P}(q_{P},c),{\texttt{d}^{\texttt{st}}}(q_{T},c),\delta_{Q}^{*}(q_{Q},c_{0}\cdots c_{z-1}))=\\ &~{}~{}{\texttt{en}}(q_{P},q_{T},q_{Q})-{\texttt{ed}}(q_{T},c)]\end{split}

(18)