This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Synthesizing Transducers from Complex Specifications

Anvay Grover, Ruediger Ehlers, and Loris D’Antoni A. Grover and L. D’Antoni are with the Univ. of Wisconsin-Madison. R Ehlers is with Clausthal University of Technology.Manuscript received April 19, 2005; revised August 26, 2015.
Abstract

Automating string transformations has been one of the killer applications of program synthesis. Existing synthesizers that solve this problem produce programs in domain-specific languages (DSL) that are engineered to help the synthesizer, and therefore lack nice formal properties. This limitation prevents the synthesized programs from being used in verification applications (e.g., to check complex pre-post conditions) and makes the synthesizers hard to modify due to their reliance on the given DSL.

We present a constraint-based approach to synthesizing transducers, a well-studied model with strong closure and decidability properties. Our approach handles three types of specifications: (i) input-output examples, (ii) input-output types expressed as regular languages, and (iii) input/output distances that bound how many characters the transducer can modify when processing an input string. Our work is the first to support such complex specifications and it does so by using the algorithmic properties of transducers to generate constraints that can be solved using off-the-shelf SMT solvers. Our synthesis approach can be extended to many transducer models and it can be used, thanks to closure properties of transducers, to compute repairs for partially correct transducers.

1 Introduction

String transformations are used in data transformations [1], sanitization of untrusted inputs [2, 3], and many other domains [4]. Because in these domains bugs may cause serious security vulnerabilities [2], there has been increased interest in building tools that can help programmers verify [2, 3] and synthesize [1, 5, 6] string transformations.

Techniques for verifying string transformations rely on automata-theoretic approaches that provide powerful decidability properties [2]. On the other hand, techniques for synthesizing string transformations rely on domain-specific languages (DSLs) [1, 5]. These DSLs are designed to make synthesis practical and have to give up the closure and decidability properties enabled by automata-theoretic models. The disconnect between the two approaches raises a natural question: Can one synthesize automata-based models and therefore retain and leverage their elegant properties?

Transducers are a natural automata-based formal model for synthesizing string transformations. A finite state transducer (FT) is an automaton where each transition reads an input character and outputs a string of output characters. For instance, Figure 1 shows a transducer that ‘escapes’ instances of the " character. So, on input a"\"a, the transducer outputs the string a\"\\"a. FTs have found wide adoption in a variety of domains [7, 3] because of their many desirable properties (e.g., decidable equivalence check and closure under composition [8]). There has been increasing work on building SMT solvers for strings that support transducers; the Ostrich tool [9] allows a user to write programs in SMT where string-transformations are modelled using transducers. One can then write constraints over such programs and use an SMT solver to automatically check for satisfiability or prove unsatisfiability of those constraints. For example, given a program like the following:

y = escapeQuotes(x)
z = escapeQuotes(y)
assert(y==z) //Checking idempotence

one can use Ostrich to write a set of constraints and use them to prove whether the assertion holds. However, to do so, one needs to first write a transducer TT that implements the function escapeQuotes. However, writing transducers by hand is a cumbersome and error-prone task.

q0q_{0}startq1q_{1}a \rightarrow a" \rightarrow \"\ \rightarrow \a \rightarrow a" \rightarrow "\ \rightarrow \
(a) Transducer EscapeQuotes

Examples: {a"a \mapsto a\"a, a\\a \mapsto a\\a, a\a \mapsto a\a, a\"a \mapsto a\"a, \ \mapsto \} Types: [a"]\?|([a"]\[a"\][a"]) \to a\?|(a\[a"\]a) Distance: At most 1 edit per input character

(b) Specification to synthesize EscapeQuotes
Figure 1: Simplified version of EscapeQuotes from [2].

In this paper, we present a technique for synthesizing transducers from high-level specifications. We use three different specification mechanisms to quickly yield desirable transducers: input-output examples, input-output types, and input-output distances. When provided with the specification in Figure 1(b), our approach yields the transducer in Figure 1. While none of the three specification mechanisms are effective in isolation, they work well altogether. Input-output examples are a natural specification mechanism that is easy to provide, but only capture finitely many inputs. Similarly, input-output types are a natural way to prevent a transducer from generating undesired strings and can often be obtained from function/API specifications. Last, input-output distances are a natural way to specify how much of the input string should be preserved by the transformation.

We show that if the size of the transducers is fixed, all such specifications can be encoded as a set of constraints whose solution directly provides a transducer. While the constraints for examples are fairly straightforward, we introduce two new ideas for encoding types and distances. To encode types and distances, we show that one can use constraints to “guess” the simulation relation and the invariants necessary to prove that the transducer has the given type and respects the given distance constraint.

Because our constraint-based approach is based on decision procedures and is modular, it can support more complex models of transducers: (i) Symbolic Finite Transducers (s-FTs), which support large alphabets [10], and (ii) FTs with lookahead, which can express functions that otherwise require non-determinism. In addition, closure properties of transducers allow us to reduce repair problems for string transformations to our synthesis problem.

Contributions: We make the following contributions.

  • A constraint-based synthesis algorithm for synthesizing transducers from complex specifications (Sec. 3).

  • Extensions of our synthesis algorithm to more complex models—e.g., symbolic transducers and transducers with lookahead—and problems—e.g., transducer repair—that showcase the flexibility of our approach and the power of working with transducers, which enjoy strong theoretical properties—unlike domain-specific languages (Sec. 4).

  • astra: a tool that can synthesize and repair transducers and compares well with a state-of-the-art tool for synthesizing string transformations (Sec. 5).

Proofs and additional results are available in the appendix.

2 Transducer Synthesis Problem

In this section, we define the transducer synthesis problem.

A deterministic finite automaton (DFA) over an alphabet Σ\Sigma is a tuple D=(QD,δD,qD𝑖𝑛𝑖𝑡,FD)D=(Q_{D},\delta_{D},q^{\mathit{init}}_{D},F_{D}): QDQ_{D} is the set of states, δD:QD×ΣQD\delta_{D}:Q_{D}\times\Sigma\rightarrow Q_{D} is the transition function, qD𝑖𝑛𝑖𝑡q^{\mathit{init}}_{D} is the initial state, and FDF_{D} is the set of final states. The extended transition function δD:QD×ΣQD\delta_{D}^{*}:Q_{D}\times\Sigma^{\ast}\rightarrow Q_{D} is defined as δD(q,ε)=q\delta_{D}^{*}(q,\varepsilon)=q and δD(q,au)=δD(δD(q,a),u)\delta_{D}^{*}(q,au)=\delta_{D}^{*}(\delta_{D}(q,a),u). We say that DD accepts a string ww if δD(qD𝑖𝑛𝑖𝑡,w)FD\delta_{D}^{*}(q^{\mathit{init}}_{D},w)\in F_{D}. The regular language (D)\mathcal{L}(D) is the set of strings accepted by a DFA DD.

A total finite state transducer (FT) is a tuple T=(QT,δT𝑠𝑡,δT𝑜𝑢𝑡,qT𝑖𝑛𝑖𝑡)T=(Q_{T},\delta_{T}^{\mathit{st}},\delta_{T}^{\mathit{out}},q^{\mathit{init}}_{T}), where QTQ_{T} are states and qT𝑖𝑛𝑖𝑡q^{\mathit{init}}_{T} is the initial state. Transducers have two transition functions: δT𝑠𝑡:qT×ΣqT\delta_{T}^{\mathit{st}}:q_{T}\times\Sigma\rightarrow q_{T} defines the target state, while δT𝑜𝑢𝑡:qT×ΣΣ\delta_{T}^{\mathit{out}}:q_{T}\times\Sigma\rightarrow\Sigma^{\ast} defines the output string of each transition. The extended function for states δT𝑠𝑡\delta_{T}^{\mathit{st*}} is defined analogously to the extended transition function for a DFA. The extended function for output strings is defined as δT𝑜𝑢𝑡(q,ε)=ε\delta_{T}^{\mathit{out*}}(q,\varepsilon)=\varepsilon and δT𝑜𝑢𝑡(q,au)=δT𝑜𝑢𝑡(q,a)δT𝑜𝑢𝑡(δT𝑠𝑡(q,a),u)\delta_{T}^{\mathit{out*}}(q,au)=\delta_{T}^{\mathit{out*}}(q,a)\cdot\delta_{T}^{\mathit{out}}(\delta_{T}^{\mathit{st*}}(q,a),u). Given a string ww we use T(w)T(w) to denote δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,w)\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w), i.e., the output string generated by TT on ww. Given two DFAs PP and QQ, we write {P}T{Q}\{P\}T\{Q\} for a transducer TT iff for every string ss in (P)\mathcal{L}(P), the output string T(s)T(s) belongs to (Q)\mathcal{L}(Q).

An edit operation on a string is either an insertion/deletion of a character, or a replacement of a character with a different one. For example, editing the string ab to the string acb requires one edit operation, which is inserting a c after the a. The edit distance 𝑒𝑑_𝑑𝑖𝑠𝑡(s,t)\mathit{ed\!\_\!dist}(s,t) between two strings ss and tt is the number of edit-operations required to reach tt from ss. We use 𝑙𝑒𝑛(w)\mathit{len}(w) to denote the length of a string ww. The mean edit distance 𝑚𝑒𝑎𝑛_𝑒𝑑_𝑑𝑖𝑠𝑡(s,t)\mathit{mean\!\_\!ed\!\_\!dist}(s,t) between two strings ss and tt is defined as 𝑒𝑑_𝑑𝑖𝑠𝑡(s,t)/len(s)\mathit{ed\!\_\!dist}(s,t)/\textit{len}(s). For example, the mean edit distance from ab to acb is 1/2=.51/2=.5.

We can now formulate the transducer synthesis problem. We assume a fixed alphabet Σ\Sigma. If the specification requires that ss is translated to tt, we write that as st{s\mapsto t}.

Problem Statement 1 (Transducer Synthesis).

The transducer synthesis problem has the following inputs and output:
Inputs

  • Number of states kk and upper bound ll on the length of the output of each transition.

  • Set of input-output examples E=[st¯]E=[\overline{s\mapsto t}].

  • Input-output types PP and QQ, given as DFAs.

  • A positive upper bound dd\in\mathbb{Q} on the mean edit distance.

Output A total transducer T=(QT,δTst,δTout,qTinit)T=(Q_{T},\delta_{T}^{\mathit{st}},\delta_{T}^{\mathit{out}},q^{\mathit{init}}_{T}) with kk states such that:

  • Every transition of TT has an output with length at most ll, i.e., qTQT,aΣ.𝑙𝑒𝑛(δT𝑜𝑢𝑡(q,a))l\forall q_{T}\in Q_{T},\mathit{a}\in\Sigma.\ \mathit{len}(\delta_{T}^{\mathit{out}}(q,a))\leq l.

  • TT is consistent with the examples: stE.T(s)=t\forall{s\mapsto t}\in E.\ T(s)=t.

  • TT is consistent with input-output types, i.e., {P}T{Q}\{P\}T\{Q\}.

  • For every string wPw\in P, 𝑚𝑒𝑎𝑛_𝑒𝑑_𝑑𝑖𝑠𝑡(w,T(w))d\mathit{mean\!\_\!ed\!\_\!dist}(w,T(w))\leq d.

The synthesis problem that we present here is for FTs, and in Section 3, we provide a sound algorithm to solve it using a system of constraints. One of our key contributions is that our encoding can be easily adapted to synthesizing richer models than FTs (e.g., symbolic transducers [8] and transducers with regular lookahead), while still using the same encoding building blocks (Section 4).

3 Constraint-based Transducer Synthesis

In this section, we present a way to generate constraints to solve the transducer synthesis problem defined in Section 2. The synthesis problem can then be solved by invoking a Satisfiability Modulo Theories (SMT) solver on the constraints.

We use a constraint encoding, rather than a direct algorithmic approach because of the multiple objectives to be satisfied. Synthesizing a transducer that translates a set of input-output examples is already an NP-Complete problem [11]. On top of that, we also need to handle input-output types and distances. Our encoding is divided into three parts, one for each objective, which are presented in the following subsections. This division makes our encoding very modular and programmable. In Section 4 we show how it can be adapted to different transducer models and problems. We include a brief description of the size of the constraint encoding in Section B of the appendix.

3.1 Representing Transducers

The transducer we are synthesizing has kk (part of the problem input) states QT={q0,,qk1}Q_{T}=\allowbreak{}\{q_{0},...,\allowbreak{}q_{k-1}\}. We often use qT𝑖𝑛𝑖𝑡q^{\mathit{init}}_{T} as an alternative for q0q_{0}, the initial state of TT.

We illustrate how a transition q1a/bcq2q_{1}\xrightarrow{\texttt{a}/\texttt{bc}}q_{2} is represented in our encoding. The target state is captured using an uninterpreted function dst:QT×ΣQT{\texttt{d}^{\texttt{st}}}:Q_{T}\times\Sigma\to Q_{T}, e.g., dst(q1,a)=q2{\texttt{d}^{\texttt{st}}}(q_{1},\texttt{a})=q_{2}. Representing the output of the transition is trickier because its length is not known a priori. The synthesis problem however provides an output bound ll, which allows us to limit the number of characters that may appear in the output. We use an uninterpreted function dchout:QT×Σ×{0,,l1}Σ{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}:Q_{T}\times\Sigma\times\{0,\ldots,l{-}1\}\to\Sigma to represent each character in the output string; in our example, dchout(q1,a,0)=b{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{1},\texttt{a},0)=\texttt{b} and dchout(q1,a,1)=c{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{1},\texttt{a},1)=\texttt{c}. Since an output string’s length can be smaller than ll, we use an additional uninterpreted function dlenout:QT×Σ{0,,l}{\texttt{d}^{\texttt{out}}_{\texttt{len}}}{}:Q_{T}\times\Sigma\to\{0,\ldots,l\} to model the length of a transition’s output; in our example dlenout(q1,a)=2{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{1},\texttt{a})=2.

We say an assignment to the above variables extends to a transducer TT for the transducer TT obtained by instantiating δst\delta^{st} and δout\delta^{out} as described above.

3.2 Input-output Examples

Goal: For each input output-example stE{s\mapsto t}\in E, TT should translate ss to tt.

Translating ss to the correct output string means that δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,s)=t\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},s)=t. Generating constraints that capture this behavior of TT on an example is challenging because we do not know a priori what parts of tt are produced by what steps of the transducer’s run. Suppose that we need to translate s=a0a1s=a_{0}a_{1} to t=b0b1b2t=b_{0}b_{1}b_{2}. A possible solution is for the transducer to have the run q0a0/b0q1a1/b1b2q2q_{0}\xrightarrow{a_{0}/b_{0}}q1\xrightarrow{a_{1}/b_{1}b_{2}}q_{2}. Another possible solution might be to instead have q0a0/b0b1q1a1/b2q2q_{0}\xrightarrow{a_{0}/b_{0}b_{1}}q1\xrightarrow{a_{1}/b_{2}}q_{2}. Notice that the two runs traverse the same states but produce different parts of the output strings at each step. Intuitively, we need a way to “track” how much output the transducer has produced before processing the ii-th character in the input and what state it has landed in. For every input example st{s\mapsto t} such that s=a0ans=a_{0}\cdots a_{n} and t=b0bmt=b_{0}\cdots b_{m}, we introduce an uninterpreted function configs:{0,,n}{0,,m}×QT{\texttt{config}}_{s}:\{0,\ldots,n\}\rightarrow\{0,\ldots,m\}\times Q_{T} such that configs(i)=(j,qT){\texttt{config}}_{s}(i)=(j,q_{T}) iff after reading a0ai1a_{0}\cdots a_{i-1}, the transducer TT has produced the output b0bj1b_{0}\cdots b_{j-1} and reached state qTq_{T}—i.e., δT𝑜𝑢𝑡(q0,a0ai1)=b0bj1\delta_{T}^{\mathit{out*}}(q_{0},a_{0}\cdots a_{i-1})=b_{0}\cdots b_{j-1} and δT𝑠𝑡(q0,a0ai1)=qT\delta_{T}^{\mathit{st*}}(q_{0},a_{0}\cdots a_{i-1})=q_{T}.

We describe the constraints that describe the behavior of configs{\texttt{config}}_{s}. Constraint 1 states that a configuration must start at the initial state and be at position 0 in the output.

configs(0)=(0,qT𝑖𝑛𝑖𝑡){\texttt{config}}_{s}(0)=(0,q^{\mathit{init}}_{T}) (1)

Constraint 2 captures how the configuration is updated when reading the ii-th character of the input. For every 0i<n0\leq i<n, 0j<m0\leq j<m, cΣc\in\Sigma, and qTQTq_{T}\in Q_{T}:

configs(i)=(j,qT)ai=c[0z<l(dchout(qT,c,z)=bj+zzdlenout(qT,c))configs(i+1)=(j+dlenout(qT,c),dst(qT,c))]\begin{split}&{\texttt{config}}_{s}(i)=(j,q_{T})\wedge a_{i}=c\Rightarrow\\ &\quad\quad[\bigwedge_{0\leq z<l}({\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)=b_{j+z}\vee z\geq{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c))\wedge\\ &\quad\quad{\texttt{config}}_{s}(i+1)=(j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c),{\texttt{d}^{\texttt{st}}}(q_{T},c))]\end{split} (2)

Informally, if the ii-th character is cc and the transducer has reached state qTq_{T} and produced the characters b0bj1b_{0}\cdots b_{j-1} so far, the transition reading cc from state qTq_{T} outputs characters bjbj+f1b_{j}\cdots b_{j+f-1}, where ff is the output length of the transition. The next configuration is then (j+f,dst(qT,c))(j+f,{\texttt{d}^{\texttt{st}}}(q_{T},c)).

Finally, Constraint 3 forces TT to be completely done with generating tt when ss has been entirely read. Recall that 𝑙𝑒𝑛(s)=n\mathit{len}(s)=n and 𝑙𝑒𝑛(t)=m\mathit{len}(t)=m.

qTQTconfigs(n)=(m,qT)\bigvee_{q_{T}\in Q_{T}}{\texttt{config}}_{s}(n)=(m,q_{T}) (3)

The constraint encoding for examples is sound and complete (Proofs in A.1).

3.3 Input-Output Types

Goal: TT should satisfy the property {P}T{Q}\{P\}T\{Q\}.

Encoding this property using constraints is challenging because it requires enforcing that when TT reads one of the (potentially) infinitely many strings in PP it always outputs a string in QQ. To solve this problem, we draw inspiration from how one proves that the property {P}T{Q}\{P\}T\{Q\} holds—i.e., using a simulation relation that relates runs over PP, TT and QQ. Intuitively, if PP has read some string ww, we need to be able to encode the behavior of TT in terms of ww, i.e., what state of TT this transducer is in after reading ww and what output string ww^{\prime} it produced. Further, we also need to be able to encode in which state QQ would be after reading the output string ww^{\prime}. We do this by introducing a function sim: QP×QT×QQ{0,1}Q_{P}\times Q_{T}\times Q_{Q}\rightarrow\{0,1\}, which preserves the following invariant: sim(qP,qT,qQ){\texttt{sim}}(q_{P},q_{T},q_{Q}) holds if there exist strings w,ww,w^{\prime} such that δP(qP𝑖𝑛𝑖𝑡,w)=qP\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}, δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,w)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w)=q_{T}, δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,w)=w\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}, and δQ(qQ𝑖𝑛𝑖𝑡,w)=qQ\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}.

Constraint 4 states the initial condition of the simulation—i.e., PP, TT, and QQ are in their initial states.

sim(qP𝑖𝑛𝑖𝑡,qT𝑖𝑛𝑖𝑡,qQ𝑖𝑛𝑖𝑡){\texttt{sim}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q}) (4)

Constraint 5 encodes how we advance the simulation relation for states qP,qT,qQq_{P},q_{T},q_{Q} and for a character cΣc\in\Sigma, using free variables c0,cl1c_{0}\ldots,c_{l-1} and qQ0,qQlq_{Q}^{0}\ldots,\allowbreak{}q_{Q}^{l} that are separate for each combination of qP,qT,qQq_{P},q_{T},q_{Q}, and cc:

sim(qP,qT,qQ)0zl(dlenout(qT,c)=z[0x<zdchout(qT,c,x)=cx][qQ0=qQ1x<zqQx=dQ(qQx1,cx1)]sim(δP(qP,c),dst(qT,c),qQz))\begin{split}{\texttt{sim}}(q_{P},q_{T},q_{Q})\Rightarrow&\bigwedge_{\mathclap{0\leq z\leq l}}({\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)=z\Rightarrow\\ &\quad[\bigwedge_{\mathclap{0\leq x<z}}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,x){=}c_{x}]\wedge\\ &\quad[q_{Q}^{0}{=}q_{Q}\wedge\bigwedge_{\mathclap{1\leq x<z}}q_{Q}^{x}{=}{\texttt{d}_{Q}}(q_{Q}^{x-1},c_{x-1})]\wedge\\ &\quad{\texttt{sim}}(\delta_{P}(q_{P},\mathit{c}),{\texttt{d}^{\texttt{st}}}(q_{T},c),q_{Q}^{z}))\end{split} (5)

Intuitively, if sim(qP,qT,qQ){\texttt{sim}}(q_{P},q_{T},q_{Q}) and we read a character cc, PP moves to δP(qP,c)\delta_{P}(q_{P},\mathit{c}) and TT moves to dst(qP,c){\texttt{d}^{\texttt{st}}}{}(q_{P},\mathit{c}). However, we also need to advance QQ and the dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}{} symbols produced by dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}{}. We hard-code the transition relation δQ\delta_{Q} in an uninterpreted function dQ:QQ×ΣQQ{\texttt{d}_{Q}}:Q_{Q}\times\Sigma\rightarrow Q_{Q}, and apply it to compute the output state reached when reading the output string. E.g., if dlenout(qT,c)=2{\texttt{d}^{\texttt{out}}_{\texttt{len}}}{}(q_{T},c)=2 and dchout(qT,c,0)=c0{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,0)=c_{0} and dchout(qT,c,1)=c1{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,1)=c_{1}, the next state in QQ is dQ(dQ(qQ,c0),c1){\texttt{d}_{Q}}({\texttt{d}_{Q}}(q_{Q},c_{0}),c_{1}).

Lastly, Constraint 6 states that if we encounter a string in (P)\mathcal{L}(P)—i.e., PP is in a state qPFPq_{P}\in F_{P}—the relation does not contain a state qQFQq_{Q}\notin F_{Q}. Since QQ is deterministic, this means that QQ accepts TT’s output.

qPFPqQFQ¬sim(qP,qT,qQ)\bigwedge_{q_{P}\in F_{P}}\bigwedge_{q_{Q}\notin F_{Q}}\neg{\texttt{sim}}(q_{P},q_{T},q_{Q}) (6)

Soundness and completeness of the type constraints are proven in A.2.

3.4 Input-output Distance

Goal: The mean edit distance between any input string ww in (P)\mathcal{L}(P) and the output string T(w)T(w) should not exceed dd.

Capturing the edit distance for all the possible inputs in the language of PP and the corresponding outputs produced by the transducer is challenging because these sets can be infinite. Furthermore, exactly computing the edit distance between an input and an output string may involve comparing characters appearing on different transitions in the transducer run. For example, consider the transducer shown in Figure 2(a) and suppose that we are only interested in strings in the input type P=a(ba)aP=\texttt{a(ba)}{\ast}\texttt{a}. The first transition from q0q_{0} deletes the a, therefore making 1 edit. This transducer has a cycle between states q1q_{1} and q2q_{2}, which can be taken any number of times. Each iteration, locally, would require that we make 2 edits: one to change the b to a, and the other to change the a to b. However, the total number of edits made over any string in the input type P=a(ab)aP=\texttt{a(ab)}{\ast}\texttt{a} by this transducer is 1, because the transducer changes strings of the form a(ba)na\texttt{a(ba)}^{n}\texttt{a} to be of the form (ab)na\texttt{(ab)}^{n}\texttt{a}. Looking at the transitions in isolation, we are prevented from deducing that the edit distance is always 1 because the first transition delays outputting a character. If there was no such delay, as is the case for the transducer in Figure 2(b), which is equivalent on the relevant input type to the one in Figure 2(a), then this issue would not arise.

We take inspiration from Benedikt et al. [12] and focus on the simpler problem of synthesizing a transducer that has ‘aggregate cost’ that satisfies the given objective.111Benedikt et al. [12] studied a variant of the problem where the distance is bounded by some finite constant. Their work shows that when there is a transducer between two languages that has some bounded global edit distance, then there is also a transducer that is bounded (but with a different bound) under a local method of computing the edit distance—i.e., one where the computation of the edit distance is done transition by transition. For a transducer TT and string s=a0ans=a_{0}\ldots a_{n}, let qT𝑖𝑛𝑖𝑡a0/y0qT1qTnan/ynqTn+1q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\ldots q_{T}^{n}\xrightarrow{a_{n}/y_{n}}q_{T}^{n+1} be the run of ss on TT. Then, the aggregate cost of TT on ss is the sum of the edit distances 𝑒𝑑_𝑑𝑖𝑠𝑡(ai,yi)\mathit{ed\!\_\!dist}(a_{i},y_{i}) over all indices 0in0\leq i\leq n. The mean aggregate cost of TT on ss is the aggregate cost divided by 𝑙𝑒𝑛(s)\mathit{len}(s), the length of ss. It follows that if TT has a mean aggregate cost lower than some specified dd for every string, then it also has a mean edit distance lower than dd for every string.

However, the mean aggregate cost overapproximates the edit distance, e.g., the transducer in Figure 2(a) has mean aggregate cost 1, while the mean edit distance when considering only strings in P=a(ab)aP=\texttt{a(ab)}{\ast}\texttt{a} is less than 1/21/2. For this reason, if the mean edit distance objective was set to 1/21/2, our constraint encoding can only synthesize the transducer in Figure 2(b), and not the equivalent one in Figure 2(a).

q0q_{0}startq1q_{1}q2q_{2}q3q_{3}aϵ\texttt{a}{\rightarrow}\epsilonba\texttt{b}{\rightarrow}\texttt{a}ab\texttt{a}{\rightarrow}\texttt{b}aa\texttt{a}{\rightarrow}\texttt{a}
(a) Transducer with delayed output
q0q_{0}startq1q_{1}q2q_{2}q3q_{3}aa\texttt{a}{\rightarrow}\texttt{a}bb\texttt{b}{\rightarrow}\texttt{b}aa\texttt{a}{\rightarrow}\texttt{a}aϵ\texttt{a}{\rightarrow}\epsilon
(b) Transducer without delay
Figure 2: Transducers with and without delay.

Our encoding is complete for transducers in which the aggregate cost coincides with the actual edit distance. We leave the problem of being complete with regards to global edit distance as an open problem. In fact, we are not even aware of an algorithm for checking (instead of synthesizing) whether a transducer satisfies a mean edit distance objective.222The mean edit distance is similar to mean payoff [13], which discounts a cost by the length of a string and looks at the behavior of a transducer in the limit. Our distance is different because 1) it looks at finite-length strings, and 2) it requires computing the edit distance, which cannot be done one transition at a time. In Section 4.2, we present transducers with lookahead, which can mitigate this source of incompleteness. Furthermore, our evaluation shows that using the aggregate cost and enabling lookahead are both effective techniques in practice.

We can now present our constraints. First, we provide constraints for the edit distance of individual transitions (recall that transitions are being synthesized and we therefore need to compute their edit distances separately). Secondly, we provide constraints that implicitly compute state invariants to capture the aggregate cost between input and output strings at various points in the computation. We are given a rational number dd as an input to the problem, which is the allowed distance bound.

Edit Distance of Individual Transitions. To compute the edit distance between the input and the output of each transition, we introduce a function ed: QT×ΣQ_{T}\times\Sigma\rightarrow\mathbb{Z}. For a transition from state qTq_{T} reading a character cc, ed(qT,c){\texttt{ed}}{}(q_{T},c) represents the edit distance between cc and δT𝑜𝑢𝑡(qT,c)\delta_{T}^{\mathit{out}}(q_{T},c). Notice that this quantity is bounded by the output bound ll. The constraints to encode the value of this function are divided into two cases: i) the output of the transition contains the input character cc (Constraint 7), ii) the output of the transition does not contain the input character cc (Constraint 8). In both cases, the values are set via a simple case analysis on whether the length of the output is 0 (edit distance is 1) or not (the edit distance is related to the length of the output).

[0z<dlenout(qT,c)dchout(qT,c,z)=c][dlenout(qT,c)=0ed(qT,c)=1dlenout(qT,c)0ed(qT,c)=dlenout(qT,c)1]\begin{split}&[\bigvee_{\mathclap{0\leq z<{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)}}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)=c]\Rightarrow\\ &\quad[{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)=0\Rightarrow{\texttt{ed}}(q_{T},c)=1\wedge\\ &\quad{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)\neq 0\Rightarrow{\texttt{ed}}(q_{T},c)={\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)-1]\end{split} (7)
[0z<dlenout(qT,c)dchout(qT,c,z)c][dlenout(qT,c)=0ed(qT,c)=1dlenout(qT,c)0ed(qT,c)=dlenout(qT,c)]\begin{split}&[\bigwedge_{\mathclap{0\leq z<{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)}}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)\neq c]\Rightarrow\\ &\quad[{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)=0\Rightarrow{\texttt{ed}}(q_{T},c)=1\wedge\\ &\quad{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)\neq 0\Rightarrow{\texttt{ed}}(q_{T},c)={\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)]\end{split} (8)

Edit Distance of Arbitrary Strings. Suppose that TT has the transitions q0a/aq1a/bcq2q_{0}\xrightarrow{\texttt{a}/\texttt{a}}q1\xrightarrow{\texttt{a}/\texttt{bc}}q_{2}, and the specified mean edit distance is d=0.5d=0.5. The edit distance is 0 for the first transition and 2 for the second one. For the input string aa, the mean aggregate cost is 2/22/2, which means that the specification is not satisfied. In general, we cannot keep track of every input string in the input type and look at its length and the number of edits that were made over it. So, how can we compute the mean aggregate cost over any input string? The first part of our solution is to scale the edit distance over a single transition depending on the specified mean edit distance. This operation makes it such that an input string is under the edit distance bound if the sum of the weighted edit distances of its transitions is 0\geq 0. The invariant we need to maintain is that the sum of the weights at any stage of the run gives us where we are with regard to the mean aggregate cost. For each transition we compute the difference between the edit distance over the transition and the specified mean edit distance dd. We introduce the uninterpreted function wed:QT×Σ{\texttt{wed}}:Q_{T}\times\Sigma\rightarrow\mathbb{Q}, which stands for weighted edit distance. For a transition at qTq_{T} reading a character cc, the weighted edit distance is given by wed(qT,c)=ded(qT,c){\texttt{wed}}(q_{T},c)=d-{\texttt{ed}}(q_{T},c). The sum of the weights of all transitions tells us the cumulative difference. Going back to our example, the weighted edit distances of the two transitions are wed(q0,a)=0.5{\texttt{wed}}(q_{0},\texttt{a})=0.5 and wed(q1,a)=1.5{\texttt{wed}}(q_{1},\texttt{a})=-1.5, making the cumulative distance 1-1 and implying that the specification is violated. We can now compute the mean edit distance over a run without keeping track of the length of the run and the number of edits performed over it.

We still need to compute the weighted edit distance for every string in the possibly infinite language (P)\mathcal{L}(P). Building on the idea of simulation from the previous section, we introduce a new function called en:QP×QT×QQ{\texttt{en}}:Q_{P}\times Q_{T}\times Q_{Q}\rightarrow\mathbb{Q}, which tracks an upper bound on the sum of the distances so far at that point in the simulation. This function is similar to a progress measure, which is a type of invariant used to solve energy games [14], a connection we expand on in Section 6. In particular, we already know that if there exist strings w,ww,w^{\prime} such that δP(qP𝑖𝑛𝑖𝑡,w)=qP\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}, δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,w)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w)=q_{T}, δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,w)=w\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}, and δQ(qQ𝑖𝑛𝑖𝑡,w)=qQ\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}, then we have sim(qP,qT,qQ){\texttt{sim}}(q_{P},q_{T},q_{Q}). Let this run over TT be denoted by qT𝑖𝑛𝑖𝑡a0/y0qT1qTn1an1/yn1qTq^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\ldots q_{T}^{n-1}\xrightarrow{a_{n-1}/y_{n-1}}q_{T}, where w=a0an1w=a_{0}\cdots a_{n-1}, w=y0yn1w^{\prime}=y_{0}\cdots y_{n-1}, and qT=qTnq_{T}=q_{T}^{n}. We have that en(qP,qT,qQ)i=0n1wed(qTi,ai){\texttt{en}}(q_{P},q_{T},q_{Q})\geq\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i}).

The en function is a budget on the number of edits we can still perform. At the initial states, we start with no ‘initial credit’ and the energy is 0.

en(qP𝑖𝑛𝑖𝑡,qT𝑖𝑛𝑖𝑡,qQ𝑖𝑛𝑖𝑡)=0{\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})=0 (9)

Constraint 10 bounds the energy budget according to the weighted edit distance of a transition by computing the minimum budget required at any point to still satisfy the distance bound. For each combination of qP,qT,qQq_{P},q_{T},q_{Q}, and cΣc\in\Sigma, the constraint uses free variables c0,,clc_{0},\ldots,c_{l} and qQ0,,qQl1q_{Q}^{0},\ldots,q_{Q}^{l-1}:

0z<l(dlenout(qT,c)=z[0x<zdchout(qT,c,x)=cx][qQ0=qQ1x<zqQx=dQ(qQx1,cx1)]en(qP,qT,qQ)en(δP(qP,c),dst(qT,c),qQz)wed(qT,c))\begin{split}&\bigwedge_{\mathclap{0\leq z<l}}({\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c){=}z\Rightarrow\\ &[\bigwedge_{\mathclap{0\leq x<z}}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,x){=}c_{x}]{\wedge}[q_{Q}^{0}{=}q_{Q}\wedge\bigwedge_{\mathclap{1\leq x<z}}q_{Q}^{x}{=}{\texttt{d}_{Q}}(q_{Q}^{x-1},c_{x-1})]\wedge\\ &{\texttt{en}}(q_{P},q_{T},q_{Q})\geq{\texttt{en}}(\delta_{P}(q_{P},c),{\texttt{d}^{\texttt{st}}}(q_{T},c),q_{Q}^{z}){-}{\texttt{wed}}(q_{T},c))\end{split} (10)

In our example, Constraint 10 encodes that the energy at q0q_{0} can be 1 less than that at q1q_{1}, but that the energy at q1q_{1} needs to be 33 greater than at q2q_{2} since we need to spend 3 edit operations over the second transition.

At any point during a run, the transducer is allowed to go below the mean edit distance and then ‘catch up’ later because we only care about the edit distance when the transducer has finished reading a string in (P)\mathcal{L}(P). Therefore, when we reach a final state of PP, the transducer should not be in ‘energy debt’.

qPFPsim(qP,qT,qQ)en(qP,qT,qQ)0\bigwedge_{q_{P}\in F_{P}}{\texttt{sim}}(q_{P},q_{T},q_{Q})\Rightarrow{\texttt{en}}(q_{P},q_{T},q_{Q})\geq 0 (11)

The encoding presented in this section is sound (Proofs in A.3).

4 Richer Models and Specifications

We extend our technique to more expressive models (Sections 4.1 and 4.2) and show how our synthesis approach can be used not only to synthesize transducers, but also to repair them (Section 4.3). Furthermore, in Appendix E, we describe an encoding of an alternative distance measure.

4.1 Symbolic Transducers

Symbolic finite automata (s-FA) and transducers (s-FT) extend their non-symbolic counterparts by allowing transitions to carry predicates and functions to represent (potentially infinite) sets of input characters and output strings. Figure 3(a) shows an s-FT that extends the escapeQuotes transducer from Figure 1(a) to handle alphabetic characters. The bottom transition from q0q_{0} reads a character " (bound to the variable xx) and outputs the string \" (i.e., a \ followed by the character stored in xx). Symbolic finite automata (s-FA) are s-FTs with no outputs. To simplify our exposition, we focus on s-FAs and s-FTs that only operate over ASCII characters that are ordered by their codes. In particular, all of our predicates are unions of intervals over characters (i.e., x\x\neq\texttt{\textbackslash} is really the union of intervals [NUL-[] and []-DEL]); we often use the predicate notation instead of explicitly writing the intervals for ease of presentation. Furthermore, we only consider two types of output functions: constant characters and offset functions of the form x+kx+k that output the character obtained by taking the input xx and adding a constant kk to it—e.g., applying x+(32)x+(-32) to a lowercase alphabetic letter gives the corresponding uppercase letter.

In the rest of the section, we show how we can solve the transducer synthesis problem in the case where PP and QQ are s-FAs and the goal is to synthesize an s-FT (instead of an FT) that meets the given specification. Intuitively, we do this by ‘finitizing’ the alphabet of the now symbolic input-output types, synthesizing a finite transducer over this alphabet using the technique presented in Section 3, and then extracting an s-FT from the solution.

Finitizing the Alphabet. The idea of finitizing the alphabet of s-FAs is a known one [8] and is based on the concept of 𝑚𝑖𝑛𝑡𝑒𝑟𝑚𝑠\mathit{minterms}, which is the set of maximal satisfiable Boolean combinations of the predicates appearing in the s-FAs. For an s-FA MM, we can define its set of predicates as: 𝑃𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒𝑠(M)={ϕqϕqδM}\mathit{Predicates}(M)=\{\phi\mid q\xrightarrow{\phi}q^{\prime}\in\delta_{M}\}. The set of minterms 𝑚𝑡𝑒𝑟𝑚𝑠(M)\mathit{mterms}(M) is the set of satisfiable Boolean combinations of all the predicates in 𝑃𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒𝑠(M)\mathit{Predicates}(M). For example, for the set of predicates over the s-FT escapeQuotes in Figure 3(a), we have that 𝑚𝑡𝑒𝑟𝑚𝑠(escapeQuotes)={x"x\,x=",x=\}\mathit{mterms}(\text{escapeQuotes})=\{x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash},x=\texttt{"},x=\texttt{\textbackslash}\}. The reader can learn more about minterms in [8]. We assign each minterm a representative character, as indicated in Figure 3(c), and then construct a finite automaton from the resulting finite alphabet Σ\Sigma. For a character cΣc\in\Sigma, we refer to its corresponding minterm by mt(c)mt(c). In the other direction, for each minterm ψ𝑚𝑖𝑛𝑡𝑒𝑟𝑚𝑠(M)\psi\in\mathit{minterms}(M), we refer to its uniquely determined representative character by 𝑤𝑖𝑡(ψ)\mathit{wit}(\psi).

q0q_{0}startq1q_{1}x"x\x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash} \rightarrow xxx="x=\texttt{"} \rightarrow \xxx=\x=\texttt{\textbackslash} \rightarrow xxx\x\neq\texttt{\textbackslash} \rightarrow xxx=\x=\texttt{\textbackslash} \rightarrow xx
(a) escapeQuotes s-FT
q0q_{0}startq1q_{1}a \rightarrow a" \rightarrow \"\ \rightarrow \a \rightarrow a" \rightarrow "\ \rightarrow \
(b) F(escapeQuotes)\mathit{F}(escapeQuotes)

minterms: [x"x\][x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash}], [x="][x=\texttt{"}], [x=\][x=\texttt{\textbackslash}] witness char: 𝑤𝑖𝑡([x"x\])=a\mathit{wit}([x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash}]){=}\texttt{a}, 𝑤𝑖𝑡([x="])="\mathit{wit}([x=\texttt{"}]){=}\texttt{"}, 𝑤𝑖𝑡([x=\])=\\mathit{wit}([x=\texttt{\textbackslash}]){=}\texttt{\textbackslash}

(c) Set of minterms and their witness elements
Figure 3: Example of Finitization

For an s-FA MM, we denote its corresponding FA over the alphabet 𝑚𝑡𝑒𝑟𝑚𝑠(M)\mathit{mterms}(M) with F(M)\mathit{F}(M). Given an s-FA MM, the set of transitions of F(M)\mathit{F}(M) is defined as follows:

δF(M)={q𝑤𝑖𝑡(ψ)qqϕqψ𝑚𝑡𝑒𝑟𝑚𝑠(M)IsSat(ψϕ)}\delta_{\mathit{F}(M)}{=}\{q\xrightarrow{\mathit{wit}(\psi)}q^{\prime}{\mid}q\xrightarrow{\phi}q^{\prime}\wedge\psi\in\mathit{mterms}(M)\wedge\text{IsSat}(\psi\land\phi)\}

This algorithm replaces a transition guarded by a predicate ϕ\phi in the given s-FA with a set of transitions consisting of the witnesses of the minterms where ϕ\phi is satisfiable. In interval arithmetic this is the set of intervals that intersect with the interval specified by ϕ\phi. The transition from q1q_{1} guarded by the predicate [x\][x\neq\texttt{\textbackslash}] in Figure 3(a) intersects with 2 minterms [x"x\][x\neq\texttt{"}\wedge x\neq\texttt{\textbackslash}] and [x="][x=\texttt{"}]. As a result, we see that this transition is replaced by two transitions in Figure 3(b), one that reads " and another that reads a.

From FTs to s-FTs. Once we have synthesized an FT TT, we need to extract an s-FT from it. There are many s-FTs equivalent to a given FT and here we present one way of doing this conversion which is used in our implementation. Let the size of an interval II (the number of characters it contains) be given by 𝑠𝑖𝑧𝑒(I)\mathit{size}(I), and the offset between 2 intervals I1I_{1} and I2I_{2} (i.e. the difference between the least elements of I1I_{1} and I2I_{2}) be given by 𝑜𝑓𝑓𝑠𝑒𝑡(I1,I2)\mathit{offset}(I_{1},I_{2}). Suppose we have a transition qc/y0ynqq\xrightarrow{c/y_{0}\cdots y_{n}}q^{\prime}, where c,yiΣc,y_{i}\in\Sigma. Then, we construct a transition qmt(c)/f0fnqq\xrightarrow{mt(c)/f_{0}\cdots f_{n}}q^{\prime}, where for each yiy_{i}, the corresponding function fif_{i} is determined by the following rules (xx always indicates variable bound to the input predicate):

  1. 1.

    If c=yic=y_{i}, then fi=(x)f_{i}=(x), i.e. the identity function.

  2. 2.

    If mt(c)mt(c) and mt(yi)mt(y_{i}) consist of single intervals I1I_{1} and I2I_{2}, respectively, such that 𝑠𝑖𝑧𝑒(I1)=𝑠𝑖𝑧𝑒(I2)\mathit{size}(I_{1})=\mathit{size}(I_{2}) , then fi=(x+𝑜𝑓𝑓𝑠𝑒𝑡(I1,I2))f_{i}=(x+\mathit{offset}(I_{1},I_{2})). For instance, if the input interval is [a-z] and the output interval is [A-Z], then the output function is (x+(32))(x+(-32)), which maps lowercase letters to uppercase ones.

  3. 3.

    Otherwise fi=yif_{i}=y_{i}—i.e., the output is a character in the output minterm.

While our s-FT recovery algorithm is sound, it may apply case 3 more often than necessary and introduce many constants, therefore yielding a transducer that does not generalize well to unseen examples. However, our evaluation shows that our technique works well in practice. We provide a sketch of the proof of soundness of this algorithm in Appendix C.

4.2 Synthesizing Transducers with Lookahead

Deterministic transducers cannot express functions where the output at a certain transition depends on future characters in the input. Consider the problem of extracting all substrings of the form <x> (where x<\texttt{x}\neq\texttt{<}) from an input string. This is the getTags problem from [15]. A deterministic transducer cannot express this transformation because when it reads < followed by x it has to output <x if the next character is a > and nothing otherwise. However, the transducer does not have access to the next character!

Instead, we extend our technique to handle deterministic transducers with lookahead, i.e., the ability to look at the string suffix when reading a symbol. Formally, a Transducer with Regular Lookahead is a pair (T,R)(T,R) where TT is an FT with ΣT=QR×Σ\Sigma_{T}=Q_{R}\times\Sigma, and RR is a total DFA with ΣR=Σ\Sigma_{R}=\Sigma. The transducer TT now has another input in its transition function, although it still only outputs characters from Σ\Sigma, i.e., δT𝑜𝑢𝑡:QT×(QR×Σ)Σ\delta_{T}^{\mathit{out}}:Q_{T}\times(Q_{R}\times\Sigma)\rightarrow\Sigma, and δT𝑠𝑡:QT×(QR×Σ)QT\delta_{T}^{\mathit{st}}:Q_{T}\times(Q_{R}\times\Sigma)\rightarrow Q_{T}. The semantics is defined as follows. Given a string w=a0anw=a_{0}\cdots a_{n}, we define a function rwr_{w} such that rw(i)=δR(qR𝑖𝑛𝑖𝑡,anai+1)r_{w}(i)=\delta_{R}(q^{\mathit{init}}_{R},a_{n}\cdots a_{i+1}). In other words, rw(i)r_{w}(i) gives the state reached by RR on the reversed suffix starting at i+1i+1. At each step ii, the transducer TT reads the symbol (ai,rw(i))(a_{i},r_{w}(i)). The extended transition functions now take as input a lookahead word, which is a sequence of pairs of lookahead states and characters, i.e., from (QR×Σ)(Q_{R}\times\Sigma)^{\ast}.

To synthesize transducers with lookahead, we introduce uninterpreted functions dR{\texttt{d}_{\texttt{R}}} for the transition function of RR, and lookw{\texttt{look}}_{w} for the rr-values of ww on RR. Additionally, we introduce a bound kRk_{R} on the number of states in the lookahead automaton RR as our synthesis algorithm has to synthesize both TT and RR at the same time. Appendix D provides the modified constraints needed to encode input-output types and input-output examples to use lookahead.

Part of the transducer with lookahead we synthesize for the getTags problem is shown in Figure 4. Notice that there are 2 transitions out of q1q_{1} for the same input but different lookahead state: the string <x is outputted when the lookahead state is r1r_{1}.

q0q_{0}startq1q_{1}x=<,r0ϵx=\texttt{<},r_{0}\rightarrow\epsilonx<x>,r1<xx\neq\texttt{<}\wedge x\neq\texttt{>},r_{1}\rightarrow\texttt{<x}x<x>,r0ϵx\neq\texttt{<}\wedge x\neq\texttt{>},r_{0}\rightarrow\epsilon
(a) Subset of transitions in TT
r0r_{0}startr1r_{1}x<x>x\neq\texttt{<}\wedge x\neq\texttt{>}x=<x=\texttt{<}x=>x=\texttt{>}x<x>x\neq\texttt{<}\wedge x\neq\texttt{>}x=>x=\texttt{>}x=<x=\texttt{<}
(b) Lookahead automaton RR
Figure 4: Regular lookahead for getTags

Lookahead and aggregate cost: Lookahead can help representing transducers, even deterministic ones, in a way that has lower aggregate cost—i.e., the aggregate cost better approximates the actual edit distance. Suppose that we want to synthesize a transducer that translates the string abc to ab and the string abd to bd. This translation can be done using a deterministic transducer with transitions q0a/ϵq1b/ϵq2q_{0}\xrightarrow{a/\epsilon}q_{1}\xrightarrow{b/\epsilon}q_{2}, followed by two transitions from q2q_{2} that choose the correct output based on the next character. Such a transducer would have a high aggregate cost of 4, even though the actual edit distance is 1. In contrast, using lookahead we can obtain a transducer that can output each character when reading it; this transducer will have aggregate cost 1 for either string. We conjecture that for every transducer TT, there always exists an equivalent transducer with regular lookahead (T,R)(T^{\prime},R) for which the edit distance computation for aggregate cost coincides with the actual edit distance of TT.

4.3 Transducer Repair

In this section, we show how our synthesis technique can also be used to “repair” buggy transducers. The key idea is to use the closure properties of automata and transducers—e.g., closure under union and sequential compositions [8]—to reduce repair problems to synthesis ones. The ability to algebraically manipulate transducers and automata is one of the key aspects that distinguishes our work from other synthesis works that use domain-specific languages [1, 5].

We describe two settings in which we can repair an incorrect transducer TbadT_{\textit{bad}}: 1. Let {P}Tbad{Q}\{P\}T_{\textit{bad}}\{Q\} be an input-output type violated by TbadT_{\textit{bad}} and let 𝑂𝑢𝑡P(Tbad)\mathit{Out}_{P}(T_{\textit{bad}}) be the finite automaton describing the set of strings TbadT_{\textit{bad}} can output when fed inputs in PP (this is computable thanks to closure properties of transducers). We are interested in the case where 𝑂𝑢𝑡P(Tbad)Q\mathit{Out}_{P}(T_{\textit{bad}})\setminus Q\neq\emptyset—i.e., TbadT_{\textit{bad}} can produce strings that are not in the output type.

2. Let [st¯][\overline{s\mapsto t}] be a set of input-output examples. We are interested in the case where there is some example st{s\mapsto t} such that Tbad(s)tT_{\textit{bad}}(s)\neq t.

We describe two ways in which we repair transducers.

Repairing from the Input Language. This approach synthesizes a new transducer for the inputs on which TbadT_{\textit{bad}} is incorrect. Using properties of transducers, we can compute an automaton describing the exact set of inputs PbadPP_{\textit{bad}}\subseteq P for which TbadT_{\textit{bad}} does not produce an output in QQ (see pre-image computation in [10]). Let restrict(T,L)restrict(T,L) be the transducer that behaves as TT if the input is in LL and does not produce an output otherwise (transducers are closed under restriction [10]). If we synthesize a transducer T1T_{1} with type {Pbad}T1{Q}\{P_{\textit{bad}}\}T_{1}\{Q\}, then the transducer restrict(T1,Pbad)restrict(Tbad,PPbad)restrict(T_{1},P_{\textit{bad}})\cup restrict(T_{\textit{bad}},P\setminus P_{\textit{bad}}) satisfies the desired input-output type (transducers are closed under union).

Fault Localization from Examples. We use this technique when TbadT_{\textit{bad}} is incorrect on some example. We can compute a set of “suspicious” transitions by taking all the transitions traversed when T(s)tT(s)\neq t for some stE{s\mapsto t}\in E (i.e., one of these transitions is wrong) and removing all the transitions traversed when T(s)=tT(s)=t for some stE{s\mapsto t}\in E (i.e., transitions that are likely correct). Essentially, this is a way of identifying PbadP_{\textit{bad}} when TbadT_{\textit{bad}} is wrong on some examples. We can also use this technique to limit the transitions we need to synthesize when performing repair.

5 Evaluation

We implemented our technique in a Java tool astra (Automatic Synthesis of TRAnsducers), which uses Z3 [16] to solve the generated constraints. We evaluate using a 2.7 GHz Intel Core i5, RAM 8 GB, with a 300s timeout.

Q1: Can astra synthesize practical transformations?

Benchmarks. Our first set of benchmarks is obtained from Optician [5, 6], a tool for synthesizing lenses, which are bidirectional programs used for keeping files in different data formats synchronized. We adapted 11 of these benchmarks to work with astra (note that we only synthesize one-directional transformations), and added one additional benchmark extrAcronym2, which is a harder variation (with a larger input type) of extrAcronym. We excluded benchmarks that require some memory, e.g., swapping words in a sentence, as they cannot be modeled with transducers.

Our second set of benchmarks (Miscellaneous) consists of 6 problems we created based on file transformation tasks (unixToDos, dosToUnix and CSVSeparator), and s-FTs from the literature–escapeQuotes from [17], getTags and quicktimeMerger from [15]. All of the benchmarks require synthesizing s-FTs and getTags requires synthesizing an s-FT with lookahead (details in Table I).

To generate the sets of examples, we started with the examples that were used in the original source when available. In 5 cases, astra synthesized a transducer that was not equivalent to the one synthesized by Optician, even though it did meet the specification. In these cases, we used astra to synthesize two different transducers that met the specification, computed a string on which the two transducers differed, and added the desired output for that string as a new example. We repeated this task until astra yielded the desired transducer and we report the time for such sets of examples. The number of examples varies between 1 and 5. The ability to perform equivalence checks of two transducers is yet another reason why synthesizing transducers is useful and is in some ways preferable to synthesizing programs in a DSL. For each benchmark we chose a mean edit distance of 0.5 when the transformation could be synthesized with this distance and of 1 otherwise.

TABLE I: Metrics of astra’s performance on the set of synthesis benchmarks. The right-most set of columns gives the synthesis time for astra and Optician (under 2 different configurations). The middle set of columns gives the sizes of the parameters to the synthesis problem. In particular, QPQ_{P} and QQQ_{Q} denote the number of input and output states and δP\delta_{P} and δQ\delta_{Q} denote the number of transitions in the input and output types, respectively. A ✗ represents a benchmark that failed. — stands in for data that is not available; this is because we only re-ran Optician on the benchmarks that were already encoded in its benchmark set, plus a few additional ones for comparing between the tools that we wrote ourselves.
 Time(s)
Benchmark QPQ_{P} QQQ_{Q} δP\delta_{P} δQ\delta_{Q} Σ\Sigma EE kk ll dd astra Optician Optician-re
Optician extrAcronym 6 3 10 3 3 2 1 1 .5 0.11 0.05
extrAcronym2 6 3 16 3 3 3 2 1 1 0.42
extrNum 15 13 17 12 3 1 1 1 1 0.93 0.05 0.07
extrQuant 4 3 8 5 2 1 2 1 1 0.19 0.09
normalizeSpaces 7 6 19 10 2 2 2 1 1 0.46 16.64
extrOdds 15 9 29 13 5 3 3 2 1 15.87 0.12
capProb 3 3 3 3 2 2 2 1 1 0.05 0.05
removeLast 6 3 8 3 3 3 2 1 .5 0.21 0.15 0.07
sourceToViews 18 7 26 15 5 3 3 2 1 50.92 0.06
normalizeNamePos 19 7 35 24 13 1 6 2 1 0.05 0.10
titleConverter 22 13 41 41 15 1 3 1 1 0.07
bibtextToReadable 14 11 41 35 12 1 5 1 1 0.64 0.15
Miscellaneous unixToDos 5 7 17 19 4 4 2 2 .5 1.24
dosToUnix 7 5 19 17 4 4 2 1 .5 0.41
CSVSeparator 5 5 9 9 4 1 1 1 1 0.142
escapeQuotes 2 2 6 5 3 5 2 2 1 0.188
quicktimeMerger 7 3 9 3 2 2 1 1 .5 0.075
getTags 3 3 9 4 3 5 2 2 1 0.95

Effectiveness of astra. astra can solve 15/1815/18 benchmarks (13 in <1s and 2 under a minute) and times out on 3 benchmarks where both PP and QQ are big.

While the synthesized transducers have at most 3 states, we note that this is because astra synthesizes total transducers and then restricts their domains to the input type PP. This is advantageous because synthesizing small total transducers is easier than synthesizing transducers that require more states to define the domain. For instance, when we restrict the solution of extrAcronym2 to its input type, the resulting transducer has 11 states instead of the 2 required by the original solution!

Comparison with Optician. We do not compare astra to tools that only support input-output examples. Instead, we compare astra to Optician on the set of benchmarks common to both tools. Like astra, Optician supports input-output examples and types, but the types are expressed as regular expressions. Furthermore, Optician also attempts to produce a program that minimizes a fixed information theoretical distance between the input and output types [5].

Optician is faster when the number of variables in the constraint encoding increases, while astra is faster on the normalizeSpaces benchmark. Optician, which uses regular expressions to express the input and output types, does not work so well with unstructured data. To confirm this trend, we wrote synthesis tasks for the escapeQuotes and getTags benchmarks in Optician and it was unable to synthesize those as well—e.g., escapeQuotes requires replacing every " character with \".

To further look at the reliance of Optician on regular expressions, we converted the regular expressions used in the lens synthesis benchmarks to automata and then back to regular expressions using a variant of the state elimination algorithm that acts on character intervals. This results in regular expressions that are not very concise and might have redundancies. Optician could only solve 4/11 benchmarks that it was previously synthesizing (Optician-re in Table I).

Answer to Q1: astra can solve real-world benchmarks and has performance comparable to that of Optician for similar tasks. Unlike Optician, astra does not suffer from variations in how the input and output types are specified.

Q2: Can astra repair transducers in practice?

Benchmarks. We considered the benchmarks in Table II. The only pre-existing benchmark that we found was escapeQuotes, through the interface of the Bek programming language used for verifying transducers [17]. We generated 11 additional faulty transducers to repair in the following two ways: (i) Introducing faults in our synthesis benchmarks: We either replaced the output string of a transition with a constant character, inserted an extra character, or deleted a transition altogether. (ii) Incorrect transducers: We intentionally provided fewer input-output examples and used only example-based constraints on some of our synthesis benchmarks.

All the benchmarks involve s-FTs. Three benchmarks are wrong on both input-output types and examples and the rest are only wrong on examples. Additionally, we note that to repair a transducer, we need the “right” set of minterms. Typically, the set of minterms extracted from the transducer predicates is the right one, but in the case of the escapeBrackets problems, astra needs a set of custom minterms we provide manually—i.e., repairing the transducer requires coming up with a new predicate. We are not aware of another tool that solves transducer repair problems and so do not show any comparisons.

TABLE II: Metrics of astra’s performance on the set of repair benchmarks. The two right-most columns give the synthesis time without and with the use of templates. Default is the case where a new transducer is synthesized for PbadP_{\textit{bad}} and Template is the case where a partial solution to the solver is provided. The δTbad\delta_{T_{\textit{bad}}} column gives the number of transitions that were localized by the fault-localization procedure as a fraction of the total number of transitions in the transducer. The other columns that describe the parameters of the synthesis problem in the default case are the same as for Table I.
Time(s)
Benchmark QPQ_{P} QQQ_{Q} δP\delta_{P} δQ\delta_{Q} Σ\Sigma EE kk ll dd δTbad\delta_{T_{\textit{bad}}} Default Template
Fault injected swapCase1 2 1 6 3 3 2 1 1 1 3/3 0.04 0.02
swapCase2 2 1 4 3 3 2 1 1 1 1/2
swapCase3 2 1 6 3 3 2 1 1 1 1/3 0.06 0.05
escapeBrackets1 2 6 16 36 8 4 1 4 4 1/3 0.69 0.42
escapeBrackets2 1 6 1 7 6 5 1 4 4 1/2
escapeBrackets3 2 7 8 36 9 5 1 4 4 2/3 1.12 0.34
caesarCipher 2 1 4 2 3 1 1 1 1 1/1
Synth. extrAcronym2 11 3 30 3 3 3 2 1 1 12/30 0.59 10.15
capProb 3 3 3 3 2 2 2 1 1 3/3 0.04 0.04
extrQuant 8 3 16 5 2 1 2 1 1 5/10 0.37 0.51
removeLast 6 3 8 3 3 2 2 1 .5 7/8 0.40 1.08
escapeQuotes 3 2 9 5 3 5 2 1 1 3/5 0.17 0.10

Effectiveness of astra. We indicate the number of suspicious transitions identified by our fault localization procedure (Section 4.3) in the column labeled δTbad\delta_{T_{\textit{bad}}}. In many cases, astra can detect 50% of the transitions or more as being likely correct, therefore reducing the space of unknowns.

We compare 2 different ways of solving repair problems in astra. One uses the repair-from-input approach described in Section 4.3 (Default in Table II). The second approach involves using a ‘template’, where we supply the constraint solver with a partial solution to the synthesis problem, based on the transitions that were localized as potentially buggy (Template in Table II).

astra can solve 9/1212 repair benchmarks (all in less than 1 second). The times using either approach are comparable in most cases. While one might expect templates to be faster, this is not always the case because the input-output specification for the repair transducer is small, but providing a template requires actually providing a partial solution, which in some cases happens to involve many constraints.

Answer to Q2: astra can repair transducers with varying types of bugs.

6 Related Work

Synthesis of string transformations. String transformations are one of the main targets of program synthesis. Gulwani showed they could be synthesized from input-output examples [1] and introduced the idea of using a DSL to aid synthesis. Optician extended the DSL-based idea to synthesizing lenses [6, 5], which are programs that transform between two formats. Optician supports not only examples but also input-output types. While DSL-based approaches provide good performance, they are also monolithic as they rely on the structure of the DSL to search efficiently. astra does not rely on a DSL and can synthesize string transformations from complex specifications that cannot be handled by DSL-based tools. Moreover, transducers allow applying verification techniques to the synthesized programs (e.g., checking whether two solutions are equivalent). One limitation of transducers is that they do not have ‘memory’, and consequently astra cannot be used for data-transformation tasks where this is required—e.g., mapping the string Firstname Lastname to Lastname, Firstname—something Optician can do. We remark that there exist transducer models with such capabilities [18] and our work lays the foundations to handle complex models in the future.

Synthesis of transducers. Benedikt et al. studied the ‘bounded repair problem’, where the goal is to determine whether there exists a transducer that maps strings from an input to an output type using a bounded number of edits [12]. Their work was the first to identify the relation between solving such a problem and solving games, an idea we leverage in this paper. However, their work is not implemented, cannot handle input-output examples, and therefore shies away from the source of NP-Completeness. Hamza et al. studied the problem of synthesizing minimal non-deterministic Mealy machines (transducers where every transition outputs exactly one character), from examples [11]. They prove that the problem of synthesizing such transducers is NP-complete and provide an algorithm for computing minimal Mealy machines that are consistent with the input-output examples. astra is a more general framework that incorporates new specification mechanisms, e.g., input-output types and distances, and uses them all together. Mealy machines are also synthesized from temporal specifications in reactive synthesis and regular model checking, where they are used to represent parameterized systems [19, 20]. This setting is orthogonal to ours as the specification is different and the transducer is again only a Mealy machine.

The constraint encoding used in astra is inspired by the encoding presented by Daniel Neider for computing minimal separating DFA, i.e. a DFA that separates two disjoint regular languages [21]. astra’s use of weights and energy to specify a mean edit distance is based on energy games [22], a kind of 2-player infinite game that captures the need for a player to not exceed some available resource. One way of solving such games is by defining a progress measure [14]. To determine whether a game has a winning strategy for one of the players, it can be checked whether such a progress measure exists in the game. We showed that the search for such a progress measure can be encoded as an SMT problem.

7 Conclusion

We presented a technique and a tool (astra) for synthesizing different types of transducers from types, examples, and distance functions, and showed astra’s ability to solve a variety of practical problems. astra uses SMT solvers and its performance is affected by input components that result in large constraints (e.g., states in the desired transducer). Because astra synthesizes transducers instead of programs in arbitrary DSLs, its output can be analyzed using transducer algorithms (e.g., equivalence and pre-post analysis). Because of this property, our approach could be beneficial in learning invariants of string-manipulating programs, where a transducer is the formalism of choice, e.g., in the Ostrich tool [9]. In terms of improvements to our technique, aside from optimizing the SMT encoding to improve scalability, our approach could be strengthened by devising ways to effectively ‘guess’ the number of states required for a transducer to work on the given inputs. We leave these directions for future work.

Acknowledgements

This work was funded by the National Science Foundation under grants 1763871, 1750965, 1918211, and 2023222, Facebook and a Microsoft Research Faculty Fellowship.

References

  • [1] S. Gulwani, “Automating string processing in spreadsheets using input-output examples,” in PoPL’11, January 26-28, 2011, Austin, Texas, USA, January 2011.
  • [2] P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes, “Fast and precise sanitizer analysis with bek,” in USENIX Security Symposium, vol. 58.   USENIX, 2012.
  • [3] L. D’Antoni and M. Veanes, “Static analysis of string encoders and decoders,” in International Workshop on Verification, Model Checking, and Abstract Interpretation.   Springer, 2013, pp. 209–228.
  • [4] Y. Zhang, A. Albarghouthi, and L. D’Antoni, “Robustness to programmable string transformations via augmented abstract training,” in International Conference on Machine Learning.   PMLR, 2020, pp. 11 023–11 032.
  • [5] A. Miltner, S. Maina, K. Fisher, B. C. Pierce, D. Walker, and S. Zdancewic, “Synthesizing symmetric lenses,” Proceedings of the ACM on Programming Languages, vol. 3, no. ICFP, pp. 1–28, 2019.
  • [6] A. Miltner, K. Fisher, B. C. Pierce, D. Walker, and S. Zdancewic, “Synthesizing bijective lenses,” Proceedings of the ACM on Programming Languages, vol. 2, no. POPL, pp. 1–30, 2017.
  • [7] M. Mohri, “Finite-state transducers in language and speech processing,” Computational linguistics, vol. 23, no. 2, pp. 269–311, 1997.
  • [8] L. D’Antoni and M. Veanes, “Automata modulo theories,” Communications of the ACM, vol. 64, no. 5, pp. 86–95, 2021.
  • [9] T. Chen, M. Hague, J. He, D. Hu, A. W. Lin, P. Rümmer, and Z. Wu, “A decision procedure for path feasibility of string manipulating programs with integer data type,” in International Symposium on Automated Technology for Verification and Analysis.   Springer, 2020, pp. 325–342.
  • [10] L. D’Antoni and M. Veanes, “The power of symbolic automata and transducers,” in International Conference on Computer Aided Verification.   Springer, 2017, pp. 47–67.
  • [11] J. Hamza and V. Kunčak, “Minimal synthesis of string to string functions from examples,” in Verification, Model Checking, and Abstract Interpretation, C. Enea and R. Piskac, Eds.   Cham: Springer International Publishing, 2019, pp. 48–69.
  • [12] M. Benedikt, G. Puppis, and C. Riveros, “Regular repair of specifications,” in 2011 IEEE 26th Annual Symposium on Logic in Computer Science.   IEEE, 2011, pp. 335–344.
  • [13] R. Bloem, K. Chatterjee, and B. Jobstmann, “Graph games and reactive synthesis,” in Handbook of Model Checking, E. M. Clarke, T. A. Henzinger, H. Veith, and R. Bloem, Eds.   Springer, 2018, pp. 921–962. [Online]. Available: https://doi.org/10.1007/978-3-319-10575-8_27
  • [14] L. Brim, J. Chaloupka, L. Doyen, R. Gentilini, and J.-F. Raskin, “Faster algorithms for mean-payoff games,” Formal methods in system design, vol. 38, no. 2, pp. 97–118, 2011.
  • [15] M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjorner, “Symbolic finite state transducers: Algorithms and applications,” in Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ser. POPL ’12.   New York, NY, USA: Association for Computing Machinery, 2012, p. 137–150. [Online]. Available: https://doi.org/10.1145/2103656.2103674
  • [16] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2008, pp. 337–340.
  • [17] P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes, “Fast and precise sanitizer analysis with bek,” http://rise4fun.com/Bek/, 2012.
  • [18] R. Alur, “Streaming string transducers,” in Logic, Language, Information and Computation, L. D. Beklemishev and R. de Queiroz, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 1–1.
  • [19] O. Markgraf, C.-D. Hong, A. W. Lin, M. Najib, and D. Neider, “Parameterized synthesis with safety properties,” in Asian Symposium on Programming Languages and Systems.   Springer, 2020, pp. 273–292.
  • [20] A. W. Lin and P. Rümmer, “Liveness of randomised parameterised systems under arbitrary schedulers,” in International Conference on Computer Aided Verification.   Springer, 2016, pp. 112–133.
  • [21] D. Neider, “Computing minimal separating dfas and regular invariants using sat and smt solvers,” in International Symposium on Automated Technology for Verification and Analysis.   Springer, 2012, pp. 354–369.
  • [22] A. Chakrabarti, L. de Alfaro, T. A. Henzinger, and M. Stoelinga, “Resource interfaces,” in Embedded Software, Third International Conference, EMSOFT 2003, Philadelphia, PA, USA, October 13-15, 2003, Proceedings, 2003, pp. 117–133. [Online]. Available: https://doi.org/10.1007/978-3-540-45212-6_9

Appendix A Proofs

A.1 Input-Output Examples

Lemmas A.1 and A.2 show that (i) the configs{\texttt{config}}_{s} function preserves a desired invariant, which is used to show that (ii) for the transducer TT encoded in a solution to the constraints, we have T(s)=tT(s)=t

Lemma A.1.

Let ϕ\phi be the set of example constraints 1 and 2 for an input-output example st{s\mapsto t}, where s=a0ans=a_{0}\cdots a_{n} and t=b0bmt=b_{0}\cdots b_{m}. Let TT be a transducer encoded into the functions dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}} as described in Section 3.1. If all constraints in ϕ\phi are satisfied by dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, and configs{\texttt{config}}_{s}, then for all 0i<n0\leq i<n, 0j<m0\leq j<m, configs(i)=(j,qT){\texttt{config}}_{s}(i)=(j,q_{T}) iff δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,a0ai1)=b0bj1\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},a_{0}\cdots a_{i-1})=b_{0}\cdots b_{j-1} and δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,a0ai1)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots a_{i-1})=q_{T}.

Proof.

We first show the forward direction.

Assume that all constraints in ϕ\phi are satisfied by dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, and configs{\texttt{config}}_{s}. We proceed by induction on ii. For the base case, when i=0i=0, we have that configs(0)=(0,qT𝑖𝑛𝑖𝑡){\texttt{config}}_{s}(0)=(0,q^{\mathit{init}}_{T}) by Constraint 1. We also know that δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,ϵ)=ϵ\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},\epsilon)=\epsilon and that δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,ϵ)=qT𝑖𝑛𝑖𝑡\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},\epsilon)=q^{\mathit{init}}_{T} by definition of the extended transition function. So we are done with the base case.

For the induction step, our induction hypothesis states exactly that configs(i)=(j,qT){\texttt{config}}_{s}(i)=(j,q_{T}) iff δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,a0ai1)=b0bj1\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i-1})=b_{0}\cdots b_{j-1} and δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,a0ai1)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i-1})=q_{T}. Assume we have that configs(i)=(j,qT){\texttt{config}}_{s}(i)=(j,q_{T}). We need to show that this is the case for configs(i+1)=(j,qT){\texttt{config}}_{s}(i+1)=(j^{\prime},q^{\prime}_{T}) as well. Now, since input words consist of letters in Σ\Sigma, we must have that ai=ca_{i}=c for some cΣc\in\Sigma. Then, we must also have that configs(i+1)=(j+dlenout(qT,c){\texttt{config}}_{s}(i+1)=(j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c), dst(qT,c)){\texttt{d}^{\texttt{st}}}(q_{T},c)) by the implication in Constraint 2 and the inductive hypothesis. By Constraint 2, we have that for all 0z<l0\leq z<l, either (dchout(qT,c,z)=bj+z({\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)=b_{j+z} or zdlenout(qT,c)z\geq{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c). This means that for all z<dlenout(qT,c)z<{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c), we have dchout(qT,c,z)=bj+z{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,z)=b_{j+z}. Together with the inductive hypothesis, this implies δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,a0ai1)=b0bj+dlenout(qT1,ai)1\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},a_{0}\cdots a_{i-1})=b_{0}\cdots b_{j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T}1,a_{i})-1}.

For showing δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,a0ai)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i})=q^{\prime}_{T}, we can observe that this follows from the inductive hypothesis that δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,a0ai1)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots a_{i-1})=q_{T}, by the definition of δT𝑠𝑡\delta_{T}^{\mathit{st*}}, and that by the constraints of type 2, which enforce that configs(i+1)=(j+dlenout(qT,c),dst(qT,c)){\texttt{config}}_{s}(i+1)=(j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c),{\texttt{d}^{\texttt{st}}}(q_{T},c)). So, we are done with the forward direction.

The backward direction is straightforward. If we have that δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,a0ai1)=b0bj1\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i-1})=b_{0}\cdots b_{j-1} and δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,a0ai1)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},a_{0}\cdots\allowbreak{}a_{i-1})=q_{T}, then there exists a run of the form qT𝑖𝑛𝑖𝑡a0/y0qT1a1/y1qT2qTi1ai1/yi1qTiq^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\xrightarrow{a_{1}/y_{1}}q_{T}^{2}\ldots q_{T}^{i-1}\xrightarrow{a_{i-1}/y_{i-1}}q_{T}^{i} such that each yiy_{i} consists of 1\geq 1 characters and y0yi1=b0bj1y_{0}\cdots y_{i-1}=b_{0}\cdots b_{j-1}. We can use each transition to assign corresponding values to configs(i){\texttt{config}}_{s}(i). ∎

Lemma A.2.

Let TT be a transducer encoded into dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, and ϕ\phi be the set of constraints of types 1, 2, and 3 for an input-output example st{s\mapsto t}. Then dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}} can be extended to a model for ϕ\phi if and only if T(s)=tT(s)=t.

Proof.

We first prove the forward direction. Assume that an assignment to dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, and configs{\texttt{config}}_{s} satisfies all constraints in ϕ\phi. By Constraint 3, we have that there is some state qTQTq_{T}\in Q_{T} of the encoded transducer TT for which configs(𝑙𝑒𝑛(s))=(𝑙𝑒𝑛(t),qT){\texttt{config}}_{s}(\mathit{len}(s))=(\mathit{len}(t),q_{T}). From Lemma A.1, it follows that δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,s)=t\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},s)=t and δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,s)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},s)=q_{T}. This is exactly what it means for T(s)=tT(s)=t.

For the backward direction, assume that for the transducer TT encoded in dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, we have T(s)=tT(s)=t. Then, there exists a run qT𝑖𝑛𝑖𝑡a0/y0qT1a1/y1qT2qTnan/ynqTn+1q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\xrightarrow{a_{1}/y_{1}}q_{T}^{2}\ldots q_{T}^{n}\xrightarrow{a_{n}/y_{n}}q_{T}^{n+1} such that s=a0ans=a_{0}\cdots a_{n} and t=y0ynt=y_{0}\cdots y_{n}. Using this run of ss on TT, we can construct a definition of configs{\texttt{config}}_{s} such that all constraints in ϕ\phi are satisfied.

A.2 Input-Output Types

Lemmas A.3 and A.4 state that (i) the simulation relation sim preserves the desired invariant, which is then used to prove that (iiTT satisfies the specification for input-output types.

Lemma A.3.

Let ϕ\phi be the set of input-output constraints from Equation 4 and Equation 5 for the types PP and QQ, and TT be a transducer encoded into the functions dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}. If all constraints in ϕ\phi are satisfied by dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, and sim, then sim(qP,qT,qQ){\texttt{sim}}(q_{P},q_{T},q_{Q}) if there exist strings w,ww,w^{\prime} such that δP(qP𝑖𝑛𝑖𝑡,w)=qP\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}, δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,w)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w)=q_{T}, δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,w)=w\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}, and δQ(qQ𝑖𝑛𝑖𝑡,w)=qQ\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}.

Proof.

Assume that there exist strings w,ww,w^{\prime} such that δP(qP𝑖𝑛𝑖𝑡,w)=qP\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P}, δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,w)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w)\allowbreak{}=q_{T}, δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,w)=w\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}, and δQ(qQ𝑖𝑛𝑖𝑡,w)=qQ\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q}. By Constraint 4, we already have that sim(qP𝑖𝑛𝑖𝑡,qT𝑖𝑛𝑖𝑡,qQ𝑖𝑛𝑖𝑡){\texttt{sim}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q}). Since δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,w)=w\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime}, we know that there is an assignment to the functions dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}} that encodes a run qT𝑖𝑛𝑖𝑡a0/y0qT1a1/y1qT2qTnan/ynqTn+1q^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\xrightarrow{a_{1}/y_{1}}q_{T}^{2}\ldots q_{T}^{n}\xrightarrow{a_{n}/y_{n}}q_{T}^{n+1} such that w=a0anw=a_{0}\cdots a_{n} and w=y0ynw^{\prime}=y_{0}\cdots y_{n}. It follows that we eventually reach sim(qP,qT,qQ){\texttt{sim}}(q_{P},q_{T},q_{Q}) by the implication in Constraint 5.

Lemma A.4.

Let TT be a transducer encoded into the functions dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}} and ϕ\phi be the set of input-output type constraints of the types given in Equation 4, Equation 5 and Equation 6. Then dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}} can be extended to a model for ϕ\phi if and only if {P}T{Q}\{P\}T\{Q\}.

Proof.

For the forward direction, we assume that an assignment to dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}} and sim satisfies all constraints in ϕ\phi. Then, for every qPFPq_{P}\in F_{P}, if we have that sim(qP,qT,qQ){\texttt{sim}}(q_{P},q_{T},q_{Q}) for some qT,qQq_{T},q_{Q}, then it is the case that qQFQq_{Q}\in F_{Q}. From Constraint 5, we know that for a string ww such that δP(qP𝑖𝑛𝑖𝑡,w)=qP\delta_{P}^{*}(q^{\mathit{init}}_{P},w)=q_{P} where qPFPq_{P}\in F_{P}, it must be that sim(qP,qT,qQ){\texttt{sim}}(q_{P},q_{T},q_{Q}) for at least some qTq_{T} and qQq_{Q}. From Lemma A.3, it also follows that for such a string wPw\in P, there is also a string ww^{\prime} such that δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,w)=w\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w)=w^{\prime} and δQ(qQ𝑖𝑛𝑖𝑡,w)=qQ\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q} such that qQFQq_{Q}\in F_{Q}. In other words, for every string wPw\in P, TT outputs a string wQw^{\prime}\in Q as desired.

For the backward direction, let TT be a transducer such that {P}T{Q}\{P\}T\{Q\}. We can use the transition relation of TT to construct an assignment of dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}} and sim such that all constraints are satisfied.

A.3 Input-Output Distances

Lemmas A.5 and A.6 show that a transducer TT that is a solution to the set of distance constraints is such that 𝑚𝑒𝑎𝑛_𝑒𝑑_𝑑𝑖𝑠𝑡(w,T(w))d\mathit{mean\!\_\!ed\!\_\!dist}(w,T(w))\leq d for all w(P)w\in\mathcal{L}(P).

Lemma A.5.

Let ϕ\phi be the set of input-output distance constraints from Equations 7, 8, 9, 10 and 11 for a mean edit distance of dd, and TT be a transducer encoded into the functions dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}} and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}. If all constraints in ϕ\phi are satisfied by dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, sim, en, and wed, then for all runs over TT of the form qT𝑖𝑛𝑖𝑡a0/y0qT1qTn1an1/yn1qTq^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\ldots q_{T}^{n-1}\xrightarrow{a_{n-1}/y_{n-1}}q_{T}, where a0an1(P)a_{0}\cdots a_{n-1}\in\mathcal{L}(P), it holds that:

i=0n1wed(qTi,ai)0.\sum_{i=0}^{n-1}{{\texttt{wed}}(q_{T}^{i},a_{i})}\geq 0.
Proof.

Assume that all constraints in ϕ\phi are satisfied by dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, sim, en and wed. Consider an arbitrary path over of the form qT𝑖𝑛𝑖𝑡a0/y0qT1qTn1an1/yn1qTq^{\mathit{init}}_{T}\xrightarrow{a_{0}/y_{0}}q_{T}^{1}\ldots q_{T}^{n-1}\xrightarrow{a_{n-1}/y_{n-1}}q_{T}, where a0a1an1(P)a_{0}a_{1}\cdots a_{n-1}\in\mathcal{L}(P). By Constraint 10, we have that

en(qP𝑖𝑛𝑖𝑡,qT𝑖𝑛𝑖𝑡,qQ𝑖𝑛𝑖𝑡)en(qP1,qT1,qQ1)wed(qT𝑖𝑛𝑖𝑡,a0)en(qPn,qQn,qTn)i=0n1wed(qTi,ai)\begin{split}{\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})&\geq{\texttt{en}}(q_{P}^{1},q_{T}^{1},q_{Q}^{1})-{\texttt{wed}}(q^{\mathit{init}}_{T},a_{0})\\ \ldots\ldots&\geq{\texttt{en}}(q_{P}^{n},q_{Q}^{n},q_{T}^{n})-\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})\\ \end{split}

Since wPw\in P, we have that qPnFPq_{P}^{n}\in F_{P}. By Constraint 11, it follows that en(qPn,qQn,qTn)0{\texttt{en}}(q_{P}^{n},q_{Q}^{n},q_{T}^{n})\geq 0. Therefore, en(qP𝑖𝑛𝑖𝑡,qT𝑖𝑛𝑖𝑡,qQ𝑖𝑛𝑖𝑡)+i=0n1wed(qTi,ai)0{\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})+\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})\geq 0 as well. Since en(qP𝑖𝑛𝑖𝑡,qT𝑖𝑛𝑖𝑡,qQ𝑖𝑛𝑖𝑡)=0{\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q})=0 by Constraint 9, we have that i=0n1wed(qTi,ai)0\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})\geq 0.

Lemma A.6.

Let TT be a transducer encoded into the functions dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}} and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, and ϕ\phi be the set of input-output distance Constraints 7, 8, 9, 10 and 11. If dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}} and dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}} can be extended to a model for ϕ\phi, then for any string w(P)w\in\mathcal{L}(P), 𝑚𝑒𝑎𝑛_𝑒𝑑_𝑑𝑖𝑠𝑡(w,T(w))d\mathit{mean\!\_\!ed\!\_\!dist}(w,T(w))\leq d.

Proof.

Assume that an assignment to dst{\texttt{d}^{\texttt{st}}}, dchout{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}, dlenout{\texttt{d}^{\texttt{out}}_{\texttt{len}}}, sim, en and wed satisfy all constraints in ϕ\phi. From Lemma A.5 it follows that the run of any string w(P)w\in\mathcal{L}(P) is such that i=0n1wed(qTi,ai)0\sum_{i=0}^{n-1}{\texttt{wed}}(q_{T}^{i},a_{i})\geq 0, where a0an1=wa_{0}\cdots a_{n-1}=w and qT0qTn1q_{T}^{0}\cdots q_{T}^{n-1} are the states in the run. Since wed(qTi,ai){\texttt{wed}}(q_{T}^{i},a_{i}) is the difference between dd and ed(qTi,ai){\texttt{ed}}(q_{T}^{i},a_{i}) at each transition, if the total difference is 0\geq 0, then it follows that 𝑚𝑒𝑎𝑛_𝑒𝑑_𝑑𝑖𝑠𝑡(w,T(w))d\mathit{mean\!\_\!ed\!\_\!dist}(w,T(w))\leq d.

Appendix B Size of Constraint Encoding

Input-Output Examples.

There is one constraint of type 1, nm|QQ||Σ|nm|Q_{Q}||\Sigma| constraints of type 2, and one constraint of type 3 for one input-output example, where nn is the length of the input and mm is the length of the output.

A constraint of type 1 involves 1 variable, a constraint of type 2 involves 4+l4+l variables, and a constraint of type 3 involves |QT||Q_{T}| variables.

Input-Output Types.

There is one constraint of type 4, |QP||QT||QQ||Σ||Q_{P}||Q_{T}||Q_{Q}||\Sigma| constraints of type 5, and one constraint of type 6 for the input-output types PP and QQ.

A constraint of type 4 involves no variables, a constraint of type 5 involves 4+3l4+3l variables, and a constraint of type 6 involves |FP|+|QQFQ|+|QT||F_{P}|+|Q_{Q}-F_{Q}|+|Q_{T}| variables.

Input-Output Distances.

There are |QT||Σ|l|Q_{T}||\Sigma|l constraints of type 7 and 8, one constraint of type 9, |QP||QT||QQ||Σ||Q_{P}||Q_{T}||Q_{Q}||\Sigma| constraints of type 10, and 1 constraint of type 11 for a specified input-output distance.

Constraints of type 7 and  8 involve up to 2+l2+l variables, a constraint of type 9 involve no variables, a constraint of type 10 involves 4+3l4+3l variables, and a constraint of type 11 involves |FP|+|QT|+|QQ||F_{P}|+|Q_{T}|+|Q_{Q}| variables.

Appendix C Soundness of s-FT recovery algorithm

The following lemma shows that our s-FT recovery algorithm, described in Section 4.1, which we use to synthesize s-FTs is sound in the sense that finitizing the s-FT will result in an identical finite transducer.

Lemma C.1.

Let TT be an FT and let TT^{\prime} be the corresponding s-FT obtained using the s-FT extraction algorithm. Then we have that T=F(T)T=\mathit{F}(T^{\prime}) if we use the representative character 𝑤𝑖𝑡(ψ)\mathit{wit}(\psi) for each minterm ψ\psi.

Proof.

Without loss of generality, consider a transition qc/yqq\xrightarrow{c/y}q^{\prime} where the input character is cc and the output character is yy. The proof generalizes for an output string with multiple characters. Now, one of 3 cases apply for cc:

  1. 1.

    If c=yc=y, then the output function ff that we choose is the identity function. Re-finitizing this transition results in the transition qc/cqq\xrightarrow{c/c}q^{\prime}, as expected.

  2. 2.

    If mt(c)mt(c) and mt(y)mt(y) are single intervals I1,I2I_{1},I_{2} of the same size, then f=x+offset(I1,I2)f=x+\textit{offset}(I_{1},I_{2}). If we were to finitize the transition with the same witness for each minterm, then we get back qc/yqq\xrightarrow{c/y}q^{\prime}.

  3. 3.

    Otherwise, we have a constant function f=yf=y. Finitizing the transition again results in qc/yqq\xrightarrow{c/y}q^{\prime}.

Appendix D Constraints for synthesizing Transducers with Regular Lookahead

In writing the lookahead constraints for a string ww, we omit the subscript ww and just write look and rr for lookw{\texttt{look}}_{w} and rwr_{w}.

Input-Output Examples and Lookahead. Constraint 12 expresses how we compute the lookahead states for a string w=a0anw=a_{0}\cdots a_{n}.

look(n)=qR𝑖𝑛𝑖𝑡0i<nlook(i)=dR(look(i+1),ai+1){\texttt{look}}(n)=q^{\mathit{init}}_{R}\wedge\bigwedge_{0\leq i<n}{\texttt{look}}(i)={\texttt{d}_{\texttt{R}}}({\texttt{look}}(i+1),a_{i+1}) (12)

Constraint 13 shows how the output configuration for input-output examples takes into account what the lookahead state is at any point. Given an example st{s\mapsto t} such that s=a0ans=a_{0}\cdots a_{n} and t=b0bmt=b_{0}\cdots b_{m}, for every 0i<n0\leq i<n, 0jm0\leq j\leq m, cΣc\in\Sigma, qRRq_{R}\in R and qTQTq_{T}\in Q_{T}:

configs(i)=(j,qT)ai=clook(i)=qR[0z<l(dchout(qT,(qR,c),z)=bj+zzdlenout(qT,(qR,c))configs(i+1)=(j+dlenout(qT,(qR,c)),dst(qT,(qR,c)))]\begin{split}&{\texttt{config}}_{s}(i)=(j,q_{T})\wedge a_{i}=c\wedge{\texttt{look}}(i)=q_{R}\Rightarrow\\ &\quad[\bigwedge_{0\leq z<l}({\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},(q_{R},c),z){=}b_{j+z}\vee z\geq{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},(q_{R},c))\wedge\\ &\quad{\texttt{config}}_{s}(i+1){=}(j+{\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},(q_{R},c)),{\texttt{d}^{\texttt{st}}}(q_{T},(q_{R},c)))]\end{split} (13)

Input-Output Types with Lookahead. The simulation relation sim needs to also consider where we are in RR during any possible run on TT. So, we extend sim to include the states of RR—i.e., simRQP×QT×QQ×QR{\texttt{sim}}_{R}\subseteq Q_{P}\times Q_{T}\times Q_{Q}\times Q_{R}. Since we travel ‘backwards’ over RR, while moving ‘forward’ over PP, TT and QQ in the simulation relation, the simulation relation uses the inverse transition function of RR, which we denote by δR1\delta_{R}^{-1}—i.e., δR1(qR,c)={qRqR𝑐qR}\delta_{R}^{-1}(q_{R},c)=\{q_{R}^{\prime}\mid q_{R}^{\prime}\xrightarrow{c}q_{R}\}. We also introduce the corresponding uninterpreted function dR1{\texttt{d}^{-1}_{\texttt{R}}}, which is the inverse of dR{\texttt{d}_{\texttt{R}}}. The desired invariant is that simR(qP,qT,qQ,qR){\texttt{sim}}_{R}(q_{P},q_{T},q_{Q},q_{R}) if there exists a lookahead word wR=(qR0,a0)(qRn,an)w_{R}=(q_{R}^{0},a_{0})\cdots(q_{R}^{n},a_{n}) and a string ww such that δP(qP𝑖𝑛𝑖𝑡,a0an)=qP\delta_{P}^{*}(q^{\mathit{init}}_{P},a_{0}\cdots a_{n})=q_{P}, δT𝑠𝑡(qT𝑖𝑛𝑖𝑡,wR)=qT\delta_{T}^{\mathit{st*}}(q^{\mathit{init}}_{T},w_{R})=q_{T}, δT𝑜𝑢𝑡(qT𝑖𝑛𝑖𝑡,wR)=w\delta_{T}^{\mathit{out*}}(q^{\mathit{init}}_{T},w_{R})=w^{\prime}, δQ(qQ𝑖𝑛𝑖𝑡,w)=qQ\delta_{Q}^{*}(q^{\mathit{init}}_{Q},w^{\prime})=q_{Q} and δR1(qR,w)=qR𝑖𝑛𝑖𝑡\delta_{R}^{-1*}(q_{R},w)=q^{\mathit{init}}_{R}.

Our starting point in the simulation can be any state qRRq_{R}\in R. This is because RR reads a string wPw\in P backwards, so the run could start from any state in RR.

qRRsimR(qP𝑖𝑛𝑖𝑡,qT𝑖𝑛𝑖𝑡,qQ𝑖𝑛𝑖𝑡,qR)\bigvee_{q_{R}\in R}{\texttt{sim}}_{R}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q},q_{R}) (14)

Then, to advance the simulation, we now use the inverse transition relation of RR. So, if we are at state qRQRq_{R}\in Q_{R}, then for each character aΣ\mathit{a}\in\Sigma, the simulation advances to all the states that we could have reached qRq_{R} from by reading cc. For any combination of qP,qT,qQ,qR,qRq_{P},q_{T},q_{Q},q_{R},q_{R}^{\prime} and cΣc\in\Sigma, we use the free variables c0,cl1c_{0}\ldots,c_{l-1} and qQ0,qQl1q_{Q}^{0}\ldots,q_{Q}^{l-1} to encode the following constraint:

simR(qP,qT,qQ,qR)dR1(qR,c)=qR0zl(dlenout(qT,(qR,c))=z[0x<zdchout(qT,(qR,c),x)=cx][qQ0=qQ1x<zqQx=dQ(qQx1,cx1)]simR(δP(qP,c),dst(qT,(qR,c)),qQz,qR))\begin{split}&{\texttt{sim}}_{R}(q_{P},q_{T},q_{Q},q_{R})\wedge{\texttt{d}^{-1}_{\texttt{R}}}(q_{R},c)=q_{R}^{\prime}\Rightarrow\\ &\bigwedge_{0\leq z\leq l}({\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},(q_{R},c))=z\Rightarrow\\ &\quad\quad[\bigwedge_{0\leq x<z}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},(q_{R},c),x)=c_{x}]\wedge\\ &\quad\quad[q_{Q}^{0}=q_{Q}\wedge\bigwedge_{1\leq x<z}q_{Q}^{x}={\texttt{d}_{Q}}(q_{Q}^{x-1},c_{x-1})]\wedge\\ &\quad\quad{\texttt{sim}}_{R}(\delta_{P}(q_{P},\mathit{c}),{\texttt{d}^{\texttt{st}}}(q_{T},(q_{R},c)),q_{Q}^{z},q_{R}^{\prime}))\end{split} (15)

Finally, constraint 16 says that when we reach a final state qPFPq_{P}\in F_{P} and the initial state of RR (since we are going backwards over RR), we cannot be in a non-final state of QQ.

qPFPqQFQ¬simR(qP,qT,qQ,qR𝑖𝑛𝑖𝑡)\bigwedge_{q_{P}\in F_{P}}\bigwedge_{q_{Q}\notin F_{Q}}\neg{\texttt{sim}}_{R}(q_{P},q_{T},q_{Q},q^{\mathit{init}}_{R}) (16)

Appendix E Alternate Notions of Input-Output Distances

It is easy to modify our constraints to support the following different notions of input-output distances. For example, we can ask that for every input ss, the output T(s)T(s) can have edit distance at most DD from ss. To enforce this distance, we change Constraints 9 and 10 to the constraints:

en(qP𝑖𝑛𝑖𝑡,qT𝑖𝑛𝑖𝑡,qQ𝑖𝑛𝑖𝑡))=D{\texttt{en}}(q^{\mathit{init}}_{P},q^{\mathit{init}}_{T},q^{\mathit{init}}_{Q}))=D (17)
0z<l(dlenout(qT,c)=z[0x<zdchout(qT,c,x)=cx][en(δP(qP,c),dst(qT,c),δQ(qQ,c0cz1))=en(qP,qT,qQ)ed(qT,c)]\begin{split}&\bigwedge_{0\leq z<l}({\texttt{d}^{\texttt{out}}_{\texttt{len}}}(q_{T},c)=z\Rightarrow\\ &~{}~{}[\bigwedge_{0\leq x<z}{\texttt{d}^{\texttt{out}}_{\texttt{ch}}}(q_{T},c,x)=c_{x}]\wedge\\ &~{}~{}[{\texttt{en}}(\delta_{P}(q_{P},c),{\texttt{d}^{\texttt{st}}}(q_{T},c),\delta_{Q}^{*}(q_{Q},c_{0}\cdots c_{z-1}))=\\ &~{}~{}{\texttt{en}}(q_{P},q_{T},q_{Q})-{\texttt{ed}}(q_{T},c)]\end{split} (18)