Synthesizing Transducers from Complex Specifications
Abstract
Automating string transformations has been one of the killer applications of program synthesis. Existing synthesizers that solve this problem produce programs in domain-specific languages (DSL) that are engineered to help the synthesizer, and therefore lack nice formal properties. This limitation prevents the synthesized programs from being used in verification applications (e.g., to check complex pre-post conditions) and makes the synthesizers hard to modify due to their reliance on the given DSL.
We present a constraint-based approach to synthesizing transducers, a well-studied model with strong closure and decidability properties. Our approach handles three types of specifications: (i) input-output examples, (ii) input-output types expressed as regular languages, and (iii) input/output distances that bound how many characters the transducer can modify when processing an input string. Our work is the first to support such complex specifications and it does so by using the algorithmic properties of transducers to generate constraints that can be solved using off-the-shelf SMT solvers. Our synthesis approach can be extended to many transducer models and it can be used, thanks to closure properties of transducers, to compute repairs for partially correct transducers.
1 Introduction
String transformations are used in data transformations [1], sanitization of untrusted inputs [2, 3], and many other domains [4]. Because in these domains bugs may cause serious security vulnerabilities [2], there has been increased interest in building tools that can help programmers verify [2, 3] and synthesize [1, 5, 6] string transformations.
Techniques for verifying string transformations rely on automata-theoretic approaches that provide powerful decidability properties [2]. On the other hand, techniques for synthesizing string transformations rely on domain-specific languages (DSLs) [1, 5]. These DSLs are designed to make synthesis practical and have to give up the closure and decidability properties enabled by automata-theoretic models. The disconnect between the two approaches raises a natural question: Can one synthesize automata-based models and therefore retain and leverage their elegant properties?
Transducers are a natural automata-based formal model for synthesizing string transformations. A finite state transducer (FT) is an automaton where each transition reads an input character and outputs a string of output characters. For instance, Figure 1 shows a transducer that ‘escapes’ instances of the " character. So, on input a"\"a, the transducer outputs the string a\"\\"a. FTs have found wide adoption in a variety of domains [7, 3] because of their many desirable properties (e.g., decidable equivalence check and closure under composition [8]). There has been increasing work on building SMT solvers for strings that support transducers; the Ostrich tool [9] allows a user to write programs in SMT where string-transformations are modelled using transducers. One can then write constraints over such programs and use an SMT solver to automatically check for satisfiability or prove unsatisfiability of those constraints. For example, given a program like the following:
y = escapeQuotes(x) z = escapeQuotes(y) assert(y==z) //Checking idempotence
one can use Ostrich to write a set of constraints and use them to prove whether the assertion holds.
However, to do so, one needs to first write a transducer that implements the function escapeQuotes
.
However, writing transducers by hand is a cumbersome and error-prone task.
Examples: {a"a a\"a, a\\a a\\a, a\a a\a, a\"a a\"a, \ \} Types: [a"]∗\?|([a"]∗\[a"\][a"]∗)∗ a∗\?|(a∗\[a"\]a∗)∗ Distance: At most 1 edit per input character
In this paper, we present a technique for synthesizing transducers from high-level specifications. We use three different specification mechanisms to quickly yield desirable transducers: input-output examples, input-output types, and input-output distances. When provided with the specification in Figure 1(b), our approach yields the transducer in Figure 1. While none of the three specification mechanisms are effective in isolation, they work well altogether. Input-output examples are a natural specification mechanism that is easy to provide, but only capture finitely many inputs. Similarly, input-output types are a natural way to prevent a transducer from generating undesired strings and can often be obtained from function/API specifications. Last, input-output distances are a natural way to specify how much of the input string should be preserved by the transformation.
We show that if the size of the transducers is fixed, all such specifications can be encoded as a set of constraints whose solution directly provides a transducer. While the constraints for examples are fairly straightforward, we introduce two new ideas for encoding types and distances. To encode types and distances, we show that one can use constraints to “guess” the simulation relation and the invariants necessary to prove that the transducer has the given type and respects the given distance constraint.
Because our constraint-based approach is based on decision procedures and is modular, it can support more complex models of transducers: (i) Symbolic Finite Transducers (s-FTs), which support large alphabets [10], and (ii) FTs with lookahead, which can express functions that otherwise require non-determinism. In addition, closure properties of transducers allow us to reduce repair problems for string transformations to our synthesis problem.
Contributions: We make the following contributions.
-
•
A constraint-based synthesis algorithm for synthesizing transducers from complex specifications (Sec. 3).
-
•
Extensions of our synthesis algorithm to more complex models—e.g., symbolic transducers and transducers with lookahead—and problems—e.g., transducer repair—that showcase the flexibility of our approach and the power of working with transducers, which enjoy strong theoretical properties—unlike domain-specific languages (Sec. 4).
-
•
astra: a tool that can synthesize and repair transducers and compares well with a state-of-the-art tool for synthesizing string transformations (Sec. 5).
Proofs and additional results are available in the appendix.
2 Transducer Synthesis Problem
In this section, we define the transducer synthesis problem.
A deterministic finite automaton (DFA) over an alphabet is a tuple : is the set of states, is the transition function, is the initial state, and is the set of final states. The extended transition function is defined as and . We say that accepts a string if . The regular language is the set of strings accepted by a DFA .
A total finite state transducer (FT) is a tuple , where are states and is the initial state. Transducers have two transition functions: defines the target state, while defines the output string of each transition. The extended function for states is defined analogously to the extended transition function for a DFA. The extended function for output strings is defined as and . Given a string we use to denote , i.e., the output string generated by on . Given two DFAs and , we write for a transducer iff for every string in , the output string belongs to .
An edit operation on a string is either an insertion/deletion of a character, or a replacement of a character with a different one. For example, editing the string ab to the string acb requires one edit operation, which is inserting a c after the a. The edit distance between two strings and is the number of edit-operations required to reach from . We use to denote the length of a string . The mean edit distance between two strings and is defined as . For example, the mean edit distance from ab to acb is .
We can now formulate the transducer synthesis problem. We assume a fixed alphabet . If the specification requires that is translated to , we write that as .
Problem Statement 1 (Transducer Synthesis).
The transducer synthesis problem has the following inputs and output:
Inputs
-
•
Number of states and upper bound on the length of the output of each transition.
-
•
Set of input-output examples .
-
•
Input-output types and , given as DFAs.
-
•
A positive upper bound on the mean edit distance.
Output A total transducer with states such that:
-
•
Every transition of has an output with length at most , i.e., .
-
•
is consistent with the examples: .
-
•
is consistent with input-output types, i.e., .
-
•
For every string , .
The synthesis problem that we present here is for FTs, and in Section 3, we provide a sound algorithm to solve it using a system of constraints. One of our key contributions is that our encoding can be easily adapted to synthesizing richer models than FTs (e.g., symbolic transducers [8] and transducers with regular lookahead), while still using the same encoding building blocks (Section 4).
3 Constraint-based Transducer Synthesis
In this section, we present a way to generate constraints to solve the transducer synthesis problem defined in Section 2. The synthesis problem can then be solved by invoking a Satisfiability Modulo Theories (SMT) solver on the constraints.
We use a constraint encoding, rather than a direct algorithmic approach because of the multiple objectives to be satisfied. Synthesizing a transducer that translates a set of input-output examples is already an NP-Complete problem [11]. On top of that, we also need to handle input-output types and distances. Our encoding is divided into three parts, one for each objective, which are presented in the following subsections. This division makes our encoding very modular and programmable. In Section 4 we show how it can be adapted to different transducer models and problems. We include a brief description of the size of the constraint encoding in Section B of the appendix.
3.1 Representing Transducers
The transducer we are synthesizing has (part of the problem input) states . We often use as an alternative for , the initial state of .
We illustrate how a transition is represented in our encoding. The target state is captured using an uninterpreted function , e.g., . Representing the output of the transition is trickier because its length is not known a priori. The synthesis problem however provides an output bound , which allows us to limit the number of characters that may appear in the output. We use an uninterpreted function to represent each character in the output string; in our example, and . Since an output string’s length can be smaller than , we use an additional uninterpreted function to model the length of a transition’s output; in our example .
We say an assignment to the above variables extends to a transducer for the transducer obtained by instantiating and as described above.
3.2 Input-output Examples
Goal: For each input output-example , should translate to .
Translating to the correct output string means that . Generating constraints that capture this behavior of on an example is challenging because we do not know a priori what parts of are produced by what steps of the transducer’s run. Suppose that we need to translate to . A possible solution is for the transducer to have the run . Another possible solution might be to instead have . Notice that the two runs traverse the same states but produce different parts of the output strings at each step. Intuitively, we need a way to “track” how much output the transducer has produced before processing the -th character in the input and what state it has landed in. For every input example such that and , we introduce an uninterpreted function such that iff after reading , the transducer has produced the output and reached state —i.e., and .
We describe the constraints that describe the behavior of . Constraint 1 states that a configuration must start at the initial state and be at position 0 in the output.
(1) |
Constraint 2 captures how the configuration is updated when reading the -th character of the input. For every , , , and :
(2) |
Informally, if the -th character is and the transducer has reached state and produced the characters so far, the transition reading from state outputs characters , where is the output length of the transition. The next configuration is then .
Finally, Constraint 3 forces to be completely done with generating when has been entirely read. Recall that and .
(3) |
The constraint encoding for examples is sound and complete (Proofs in A.1).
3.3 Input-Output Types
Goal: should satisfy the property .
Encoding this property using constraints is challenging because it requires enforcing that when reads one of the (potentially) infinitely many strings in it always outputs a string in . To solve this problem, we draw inspiration from how one proves that the property holds—i.e., using a simulation relation that relates runs over , and . Intuitively, if has read some string , we need to be able to encode the behavior of in terms of , i.e., what state of this transducer is in after reading and what output string it produced. Further, we also need to be able to encode in which state would be after reading the output string . We do this by introducing a function sim: , which preserves the following invariant: holds if there exist strings such that , , , and .
Constraint 4 states the initial condition of the simulation—i.e., , , and are in their initial states.
(4) |
Constraint 5 encodes how we advance the simulation relation for states and for a character , using free variables and that are separate for each combination of , and :
(5) |
Intuitively, if and we read a character , moves to and moves to . However, we also need to advance and the symbols produced by . We hard-code the transition relation in an uninterpreted function , and apply it to compute the output state reached when reading the output string. E.g., if and and , the next state in is .
Lastly, Constraint 6 states that if we encounter a string in —i.e., is in a state —the relation does not contain a state . Since is deterministic, this means that accepts ’s output.
(6) |
Soundness and completeness of the type constraints are proven in A.2.
3.4 Input-output Distance
Goal: The mean edit distance between any input string in and the output string should not exceed .
Capturing the edit distance for all the possible inputs in the language of and the corresponding outputs produced by the transducer is challenging because these sets can be infinite. Furthermore, exactly computing the edit distance between an input and an output string may involve comparing characters appearing on different transitions in the transducer run. For example, consider the transducer shown in Figure 2(a) and suppose that we are only interested in strings in the input type . The first transition from deletes the a, therefore making 1 edit. This transducer has a cycle between states and , which can be taken any number of times. Each iteration, locally, would require that we make 2 edits: one to change the b to a, and the other to change the a to b. However, the total number of edits made over any string in the input type by this transducer is 1, because the transducer changes strings of the form to be of the form . Looking at the transitions in isolation, we are prevented from deducing that the edit distance is always 1 because the first transition delays outputting a character. If there was no such delay, as is the case for the transducer in Figure 2(b), which is equivalent on the relevant input type to the one in Figure 2(a), then this issue would not arise.
We take inspiration from Benedikt et al. [12] and focus on the simpler problem of synthesizing a transducer that has ‘aggregate cost’ that satisfies the given objective.111Benedikt et al. [12] studied a variant of the problem where the distance is bounded by some finite constant. Their work shows that when there is a transducer between two languages that has some bounded global edit distance, then there is also a transducer that is bounded (but with a different bound) under a local method of computing the edit distance—i.e., one where the computation of the edit distance is done transition by transition. For a transducer and string , let be the run of on . Then, the aggregate cost of on is the sum of the edit distances over all indices . The mean aggregate cost of on is the aggregate cost divided by , the length of . It follows that if has a mean aggregate cost lower than some specified for every string, then it also has a mean edit distance lower than for every string.
However, the mean aggregate cost overapproximates the edit distance, e.g., the transducer in Figure 2(a) has mean aggregate cost 1, while the mean edit distance when considering only strings in is less than . For this reason, if the mean edit distance objective was set to , our constraint encoding can only synthesize the transducer in Figure 2(b), and not the equivalent one in Figure 2(a).
Our encoding is complete for transducers in which the aggregate cost coincides with the actual edit distance. We leave the problem of being complete with regards to global edit distance as an open problem. In fact, we are not even aware of an algorithm for checking (instead of synthesizing) whether a transducer satisfies a mean edit distance objective.222The mean edit distance is similar to mean payoff [13], which discounts a cost by the length of a string and looks at the behavior of a transducer in the limit. Our distance is different because 1) it looks at finite-length strings, and 2) it requires computing the edit distance, which cannot be done one transition at a time. In Section 4.2, we present transducers with lookahead, which can mitigate this source of incompleteness. Furthermore, our evaluation shows that using the aggregate cost and enabling lookahead are both effective techniques in practice.
We can now present our constraints. First, we provide constraints for the edit distance of individual transitions (recall that transitions are being synthesized and we therefore need to compute their edit distances separately). Secondly, we provide constraints that implicitly compute state invariants to capture the aggregate cost between input and output strings at various points in the computation. We are given a rational number as an input to the problem, which is the allowed distance bound.
Edit Distance of Individual Transitions. To compute the edit distance between the input and the output of each transition, we introduce a function ed: . For a transition from state reading a character , represents the edit distance between and . Notice that this quantity is bounded by the output bound . The constraints to encode the value of this function are divided into two cases: i) the output of the transition contains the input character (Constraint 7), ii) the output of the transition does not contain the input character (Constraint 8). In both cases, the values are set via a simple case analysis on whether the length of the output is 0 (edit distance is 1) or not (the edit distance is related to the length of the output).
(7) |
(8) |
Edit Distance of Arbitrary Strings. Suppose that has the transitions , and the specified mean edit distance is . The edit distance is 0 for the first transition and 2 for the second one. For the input string aa, the mean aggregate cost is , which means that the specification is not satisfied. In general, we cannot keep track of every input string in the input type and look at its length and the number of edits that were made over it. So, how can we compute the mean aggregate cost over any input string? The first part of our solution is to scale the edit distance over a single transition depending on the specified mean edit distance. This operation makes it such that an input string is under the edit distance bound if the sum of the weighted edit distances of its transitions is . The invariant we need to maintain is that the sum of the weights at any stage of the run gives us where we are with regard to the mean aggregate cost. For each transition we compute the difference between the edit distance over the transition and the specified mean edit distance . We introduce the uninterpreted function , which stands for weighted edit distance. For a transition at reading a character , the weighted edit distance is given by . The sum of the weights of all transitions tells us the cumulative difference. Going back to our example, the weighted edit distances of the two transitions are and , making the cumulative distance and implying that the specification is violated. We can now compute the mean edit distance over a run without keeping track of the length of the run and the number of edits performed over it.
We still need to compute the weighted edit distance for every string in the possibly infinite language . Building on the idea of simulation from the previous section, we introduce a new function called , which tracks an upper bound on the sum of the distances so far at that point in the simulation. This function is similar to a progress measure, which is a type of invariant used to solve energy games [14], a connection we expand on in Section 6. In particular, we already know that if there exist strings such that , , , and , then we have . Let this run over be denoted by , where , , and . We have that .
The en function is a budget on the number of edits we can still perform. At the initial states, we start with no ‘initial credit’ and the energy is 0.
(9) |
Constraint 10 bounds the energy budget according to the weighted edit distance of a transition by computing the minimum budget required at any point to still satisfy the distance bound. For each combination of , and , the constraint uses free variables and :
(10) |
In our example, Constraint 10 encodes that the energy at can be 1 less than that at , but that the energy at needs to be greater than at since we need to spend 3 edit operations over the second transition.
At any point during a run, the transducer is allowed to go below the mean edit distance and then ‘catch up’ later because we only care about the edit distance when the transducer has finished reading a string in . Therefore, when we reach a final state of , the transducer should not be in ‘energy debt’.
(11) |
The encoding presented in this section is sound (Proofs in A.3).
4 Richer Models and Specifications
We extend our technique to more expressive models (Sections 4.1 and 4.2) and show how our synthesis approach can be used not only to synthesize transducers, but also to repair them (Section 4.3). Furthermore, in Appendix E, we describe an encoding of an alternative distance measure.
4.1 Symbolic Transducers
Symbolic finite automata (s-FA) and transducers (s-FT) extend their non-symbolic counterparts by allowing transitions to carry predicates and functions to represent (potentially infinite) sets of input characters and output strings. Figure 3(a) shows an s-FT that extends the escapeQuotes transducer from Figure 1(a) to handle alphabetic characters. The bottom transition from reads a character " (bound to the variable ) and outputs the string \" (i.e., a \ followed by the character stored in ). Symbolic finite automata (s-FA) are s-FTs with no outputs. To simplify our exposition, we focus on s-FAs and s-FTs that only operate over ASCII characters that are ordered by their codes. In particular, all of our predicates are unions of intervals over characters (i.e., is really the union of intervals [NUL-[] and []-DEL]); we often use the predicate notation instead of explicitly writing the intervals for ease of presentation. Furthermore, we only consider two types of output functions: constant characters and offset functions of the form that output the character obtained by taking the input and adding a constant to it—e.g., applying to a lowercase alphabetic letter gives the corresponding uppercase letter.
In the rest of the section, we show how we can solve the transducer synthesis problem in the case where and are s-FAs and the goal is to synthesize an s-FT (instead of an FT) that meets the given specification. Intuitively, we do this by ‘finitizing’ the alphabet of the now symbolic input-output types, synthesizing a finite transducer over this alphabet using the technique presented in Section 3, and then extracting an s-FT from the solution.
Finitizing the Alphabet. The idea of finitizing the alphabet of s-FAs is a known one [8] and is based on the concept of , which is the set of maximal satisfiable Boolean combinations of the predicates appearing in the s-FAs. For an s-FA , we can define its set of predicates as: . The set of minterms is the set of satisfiable Boolean combinations of all the predicates in . For example, for the set of predicates over the s-FT escapeQuotes in Figure 3(a), we have that . The reader can learn more about minterms in [8]. We assign each minterm a representative character, as indicated in Figure 3(c), and then construct a finite automaton from the resulting finite alphabet . For a character , we refer to its corresponding minterm by . In the other direction, for each minterm , we refer to its uniquely determined representative character by .
minterms: , , witness char: , ,
For an s-FA , we denote its corresponding FA over the alphabet with . Given an s-FA , the set of transitions of is defined as follows:
This algorithm replaces a transition guarded by a predicate in the given s-FA with a set of transitions consisting of the witnesses of the minterms where is satisfiable. In interval arithmetic this is the set of intervals that intersect with the interval specified by . The transition from guarded by the predicate in Figure 3(a) intersects with 2 minterms and . As a result, we see that this transition is replaced by two transitions in Figure 3(b), one that reads " and another that reads a.
From FTs to s-FTs. Once we have synthesized an FT , we need to extract an s-FT from it. There are many s-FTs equivalent to a given FT and here we present one way of doing this conversion which is used in our implementation. Let the size of an interval (the number of characters it contains) be given by , and the offset between 2 intervals and (i.e. the difference between the least elements of and ) be given by . Suppose we have a transition , where . Then, we construct a transition , where for each , the corresponding function is determined by the following rules ( always indicates variable bound to the input predicate):
-
1.
If , then , i.e. the identity function.
-
2.
If and consist of single intervals and , respectively, such that , then . For instance, if the input interval is [a-z] and the output interval is [A-Z], then the output function is , which maps lowercase letters to uppercase ones.
-
3.
Otherwise —i.e., the output is a character in the output minterm.
While our s-FT recovery algorithm is sound, it may apply case 3 more often than necessary and introduce many constants, therefore yielding a transducer that does not generalize well to unseen examples. However, our evaluation shows that our technique works well in practice. We provide a sketch of the proof of soundness of this algorithm in Appendix C.
4.2 Synthesizing Transducers with Lookahead
Deterministic transducers cannot express functions where the output at a certain transition depends on future characters in the input. Consider the problem of extracting all substrings of the form <x> (where ) from an input string. This is the getTags problem from [15]. A deterministic transducer cannot express this transformation because when it reads < followed by x it has to output <x if the next character is a > and nothing otherwise. However, the transducer does not have access to the next character!
Instead, we extend our technique to handle deterministic transducers with lookahead, i.e., the ability to look at the string suffix when reading a symbol. Formally, a Transducer with Regular Lookahead is a pair where is an FT with , and is a total DFA with . The transducer now has another input in its transition function, although it still only outputs characters from , i.e., , and . The semantics is defined as follows. Given a string , we define a function such that . In other words, gives the state reached by on the reversed suffix starting at . At each step , the transducer reads the symbol . The extended transition functions now take as input a lookahead word, which is a sequence of pairs of lookahead states and characters, i.e., from .
To synthesize transducers with lookahead, we introduce uninterpreted functions for the transition function of , and for the -values of on . Additionally, we introduce a bound on the number of states in the lookahead automaton as our synthesis algorithm has to synthesize both and at the same time. Appendix D provides the modified constraints needed to encode input-output types and input-output examples to use lookahead.
Part of the transducer with lookahead we synthesize for the getTags problem is shown in Figure 4. Notice that there are 2 transitions out of for the same input but different lookahead state: the string <x is outputted when the lookahead state is .
Lookahead and aggregate cost: Lookahead can help representing transducers, even deterministic ones, in a way that has lower aggregate cost—i.e., the aggregate cost better approximates the actual edit distance. Suppose that we want to synthesize a transducer that translates the string abc to ab and the string abd to bd. This translation can be done using a deterministic transducer with transitions , followed by two transitions from that choose the correct output based on the next character. Such a transducer would have a high aggregate cost of 4, even though the actual edit distance is 1. In contrast, using lookahead we can obtain a transducer that can output each character when reading it; this transducer will have aggregate cost 1 for either string. We conjecture that for every transducer , there always exists an equivalent transducer with regular lookahead for which the edit distance computation for aggregate cost coincides with the actual edit distance of .
4.3 Transducer Repair
In this section, we show how our synthesis technique can also be used to “repair” buggy transducers. The key idea is to use the closure properties of automata and transducers—e.g., closure under union and sequential compositions [8]—to reduce repair problems to synthesis ones. The ability to algebraically manipulate transducers and automata is one of the key aspects that distinguishes our work from other synthesis works that use domain-specific languages [1, 5].
We describe two settings in which we can repair an incorrect transducer : 1. Let be an input-output type violated by and let be the finite automaton describing the set of strings can output when fed inputs in (this is computable thanks to closure properties of transducers). We are interested in the case where —i.e., can produce strings that are not in the output type.
2. Let be a set of input-output examples. We are interested in the case where there is some example such that .
We describe two ways in which we repair transducers.
Repairing from the Input Language. This approach synthesizes a new transducer for the inputs on which is incorrect. Using properties of transducers, we can compute an automaton describing the exact set of inputs for which does not produce an output in (see pre-image computation in [10]). Let be the transducer that behaves as if the input is in and does not produce an output otherwise (transducers are closed under restriction [10]). If we synthesize a transducer with type , then the transducer satisfies the desired input-output type (transducers are closed under union).
Fault Localization from Examples. We use this technique when is incorrect on some example. We can compute a set of “suspicious” transitions by taking all the transitions traversed when for some (i.e., one of these transitions is wrong) and removing all the transitions traversed when for some (i.e., transitions that are likely correct). Essentially, this is a way of identifying when is wrong on some examples. We can also use this technique to limit the transitions we need to synthesize when performing repair.
5 Evaluation
We implemented our technique in a Java tool astra (Automatic Synthesis of TRAnsducers), which uses Z3 [16] to solve the generated constraints. We evaluate using a 2.7 GHz Intel Core i5, RAM 8 GB, with a 300s timeout.
Q1: Can astra synthesize practical transformations?
Benchmarks. Our first set of benchmarks is obtained from Optician [5, 6], a tool for synthesizing lenses, which are bidirectional programs used for keeping files in different data formats synchronized. We adapted 11 of these benchmarks to work with astra (note that we only synthesize one-directional transformations), and added one additional benchmark extrAcronym2, which is a harder variation (with a larger input type) of extrAcronym. We excluded benchmarks that require some memory, e.g., swapping words in a sentence, as they cannot be modeled with transducers.
Our second set of benchmarks (Miscellaneous) consists of 6 problems we created based on file transformation tasks (unixToDos, dosToUnix and CSVSeparator), and s-FTs from the literature–escapeQuotes from [17], getTags and quicktimeMerger from [15]. All of the benchmarks require synthesizing s-FTs and getTags requires synthesizing an s-FT with lookahead (details in Table I).
To generate the sets of examples, we started with the examples that were used in the original source when available. In 5 cases, astra synthesized a transducer that was not equivalent to the one synthesized by Optician, even though it did meet the specification. In these cases, we used astra to synthesize two different transducers that met the specification, computed a string on which the two transducers differed, and added the desired output for that string as a new example. We repeated this task until astra yielded the desired transducer and we report the time for such sets of examples. The number of examples varies between 1 and 5. The ability to perform equivalence checks of two transducers is yet another reason why synthesizing transducers is useful and is in some ways preferable to synthesizing programs in a DSL. For each benchmark we chose a mean edit distance of 0.5 when the transformation could be synthesized with this distance and of 1 otherwise.
Time(s) | |||||||||||||
Benchmark | astra | Optician | Optician-re | ||||||||||
Optician | extrAcronym | 6 | 3 | 10 | 3 | 3 | 2 | 1 | 1 | .5 | 0.11 | 0.05 | ✗ |
extrAcronym2 | 6 | 3 | 16 | 3 | 3 | 3 | 2 | 1 | 1 | 0.42 | — | — | |
extrNum | 15 | 13 | 17 | 12 | 3 | 1 | 1 | 1 | 1 | 0.93 | 0.05 | 0.07 | |
extrQuant | 4 | 3 | 8 | 5 | 2 | 1 | 2 | 1 | 1 | 0.19 | 0.09 | ✗ | |
normalizeSpaces | 7 | 6 | 19 | 10 | 2 | 2 | 2 | 1 | 1 | 0.46 | 16.64 | ✗ | |
extrOdds | 15 | 9 | 29 | 13 | 5 | 3 | 3 | 2 | 1 | 15.87 | 0.12 | ✗ | |
capProb | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 1 | 1 | 0.05 | 0.05 | ✗ | |
removeLast | 6 | 3 | 8 | 3 | 3 | 3 | 2 | 1 | .5 | 0.21 | 0.15 | 0.07 | |
sourceToViews | 18 | 7 | 26 | 15 | 5 | 3 | 3 | 2 | 1 | 50.92 | 0.06 | ✗ | |
normalizeNamePos | 19 | 7 | 35 | 24 | 13 | 1 | 6 | 2 | 1 | ✗ | 0.05 | 0.10 | |
titleConverter | 22 | 13 | 41 | 41 | 15 | 1 | 3 | 1 | 1 | ✗ | 0.07 | ✗ | |
bibtextToReadable | 14 | 11 | 41 | 35 | 12 | 1 | 5 | 1 | 1 | ✗ | 0.64 | 0.15 | |
Miscellaneous | unixToDos | 5 | 7 | 17 | 19 | 4 | 4 | 2 | 2 | .5 | 1.24 | — | — |
dosToUnix | 7 | 5 | 19 | 17 | 4 | 4 | 2 | 1 | .5 | 0.41 | — | — | |
CSVSeparator | 5 | 5 | 9 | 9 | 4 | 1 | 1 | 1 | 1 | 0.142 | — | — | |
escapeQuotes | 2 | 2 | 6 | 5 | 3 | 5 | 2 | 2 | 1 | 0.188 | ✗ | ✗ | |
quicktimeMerger | 7 | 3 | 9 | 3 | 2 | 2 | 1 | 1 | .5 | 0.075 | — | — | |
getTags | 3 | 3 | 9 | 4 | 3 | 5 | 2 | 2 | 1 | 0.95 | ✗ | ✗ |
Effectiveness of astra. astra can solve benchmarks (13 in <1s and 2 under a minute) and times out on 3 benchmarks where both and are big.
While the synthesized transducers have at most 3 states, we note that this is because astra synthesizes total transducers and then restricts their domains to the input type . This is advantageous because synthesizing small total transducers is easier than synthesizing transducers that require more states to define the domain. For instance, when we restrict the solution of extrAcronym2 to its input type, the resulting transducer has 11 states instead of the 2 required by the original solution!
Comparison with Optician. We do not compare astra to tools that only support input-output examples. Instead, we compare astra to Optician on the set of benchmarks common to both tools. Like astra, Optician supports input-output examples and types, but the types are expressed as regular expressions. Furthermore, Optician also attempts to produce a program that minimizes a fixed information theoretical distance between the input and output types [5].
Optician is faster when the number of variables in the constraint encoding increases, while astra is faster on the normalizeSpaces benchmark. Optician, which uses regular expressions to express the input and output types, does not work so well with unstructured data. To confirm this trend, we wrote synthesis tasks for the escapeQuotes and getTags benchmarks in Optician and it was unable to synthesize those as well—e.g., escapeQuotes requires replacing every " character with \".
To further look at the reliance of Optician on regular expressions, we converted the regular expressions used in the lens synthesis benchmarks to automata and then back to regular expressions using a variant of the state elimination algorithm that acts on character intervals. This results in regular expressions that are not very concise and might have redundancies. Optician could only solve 4/11 benchmarks that it was previously synthesizing (Optician-re in Table I).
Answer to Q1: astra can solve real-world benchmarks and has performance comparable to that of Optician for similar tasks. Unlike Optician, astra does not suffer from variations in how the input and output types are specified.
Q2: Can astra repair transducers in practice?
Benchmarks. We considered the benchmarks in Table II. The only pre-existing benchmark that we found was escapeQuotes, through the interface of the Bek programming language used for verifying transducers [17]. We generated 11 additional faulty transducers to repair in the following two ways: (i) Introducing faults in our synthesis benchmarks: We either replaced the output string of a transition with a constant character, inserted an extra character, or deleted a transition altogether. (ii) Incorrect transducers: We intentionally provided fewer input-output examples and used only example-based constraints on some of our synthesis benchmarks.
All the benchmarks involve s-FTs. Three benchmarks are wrong on both input-output types and examples and the rest are only wrong on examples. Additionally, we note that to repair a transducer, we need the “right” set of minterms. Typically, the set of minterms extracted from the transducer predicates is the right one, but in the case of the escapeBrackets problems, astra needs a set of custom minterms we provide manually—i.e., repairing the transducer requires coming up with a new predicate. We are not aware of another tool that solves transducer repair problems and so do not show any comparisons.
Time(s) | |||||||||||||
Benchmark | Default | Template | |||||||||||
Fault injected | swapCase1 | 2 | 1 | 6 | 3 | 3 | 2 | 1 | 1 | 1 | 3/3 | 0.04 | 0.02 |
swapCase2 | 2 | 1 | 4 | 3 | 3 | 2 | 1 | 1 | 1 | 1/2 | ✗ | ✗ | |
swapCase3 | 2 | 1 | 6 | 3 | 3 | 2 | 1 | 1 | 1 | 1/3 | 0.06 | 0.05 | |
escapeBrackets1 | 2 | 6 | 16 | 36 | 8 | 4 | 1 | 4 | 4 | 1/3 | 0.69 | 0.42 | |
escapeBrackets2 | 1 | 6 | 1 | 7 | 6 | 5 | 1 | 4 | 4 | 1/2 | ✗ | ✗ | |
escapeBrackets3 | 2 | 7 | 8 | 36 | 9 | 5 | 1 | 4 | 4 | 2/3 | 1.12 | 0.34 | |
caesarCipher | 2 | 1 | 4 | 2 | 3 | 1 | 1 | 1 | 1 | 1/1 | ✗ | ✗ | |
Synth. | extrAcronym2 | 11 | 3 | 30 | 3 | 3 | 3 | 2 | 1 | 1 | 12/30 | 0.59 | 10.15 |
capProb | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 1 | 1 | 3/3 | 0.04 | 0.04 | |
extrQuant | 8 | 3 | 16 | 5 | 2 | 1 | 2 | 1 | 1 | 5/10 | 0.37 | 0.51 | |
removeLast | 6 | 3 | 8 | 3 | 3 | 2 | 2 | 1 | .5 | 7/8 | 0.40 | 1.08 | |
escapeQuotes | 3 | 2 | 9 | 5 | 3 | 5 | 2 | 1 | 1 | 3/5 | 0.17 | 0.10 |
Effectiveness of astra. We indicate the number of suspicious transitions identified by our fault localization procedure (Section 4.3) in the column labeled . In many cases, astra can detect 50% of the transitions or more as being likely correct, therefore reducing the space of unknowns.
We compare 2 different ways of solving repair problems in astra. One uses the repair-from-input approach described in Section 4.3 (Default in Table II). The second approach involves using a ‘template’, where we supply the constraint solver with a partial solution to the synthesis problem, based on the transitions that were localized as potentially buggy (Template in Table II).
astra can solve 9/ repair benchmarks (all in less than 1 second). The times using either approach are comparable in most cases. While one might expect templates to be faster, this is not always the case because the input-output specification for the repair transducer is small, but providing a template requires actually providing a partial solution, which in some cases happens to involve many constraints.
Answer to Q2: astra can repair transducers with varying types of bugs.
6 Related Work
Synthesis of string transformations. String transformations are one of the main targets of program synthesis. Gulwani showed they could be synthesized from input-output examples [1] and introduced the idea of using a DSL to aid synthesis. Optician extended the DSL-based idea to synthesizing lenses [6, 5], which are programs that transform between two formats. Optician supports not only examples but also input-output types. While DSL-based approaches provide good performance, they are also monolithic as they rely on the structure of the DSL to search efficiently. astra does not rely on a DSL and can synthesize string transformations from complex specifications that cannot be handled by DSL-based tools. Moreover, transducers allow applying verification techniques to the synthesized programs (e.g., checking whether two solutions are equivalent). One limitation of transducers is that they do not have ‘memory’, and consequently astra cannot be used for data-transformation tasks where this is required—e.g., mapping the string Firstname Lastname to Lastname, Firstname—something Optician can do. We remark that there exist transducer models with such capabilities [18] and our work lays the foundations to handle complex models in the future.
Synthesis of transducers. Benedikt et al. studied the ‘bounded repair problem’, where the goal is to determine whether there exists a transducer that maps strings from an input to an output type using a bounded number of edits [12]. Their work was the first to identify the relation between solving such a problem and solving games, an idea we leverage in this paper. However, their work is not implemented, cannot handle input-output examples, and therefore shies away from the source of NP-Completeness. Hamza et al. studied the problem of synthesizing minimal non-deterministic Mealy machines (transducers where every transition outputs exactly one character), from examples [11]. They prove that the problem of synthesizing such transducers is NP-complete and provide an algorithm for computing minimal Mealy machines that are consistent with the input-output examples. astra is a more general framework that incorporates new specification mechanisms, e.g., input-output types and distances, and uses them all together. Mealy machines are also synthesized from temporal specifications in reactive synthesis and regular model checking, where they are used to represent parameterized systems [19, 20]. This setting is orthogonal to ours as the specification is different and the transducer is again only a Mealy machine.
The constraint encoding used in astra is inspired by the encoding presented by Daniel Neider for computing minimal separating DFA, i.e. a DFA that separates two disjoint regular languages [21]. astra’s use of weights and energy to specify a mean edit distance is based on energy games [22], a kind of 2-player infinite game that captures the need for a player to not exceed some available resource. One way of solving such games is by defining a progress measure [14]. To determine whether a game has a winning strategy for one of the players, it can be checked whether such a progress measure exists in the game. We showed that the search for such a progress measure can be encoded as an SMT problem.
7 Conclusion
We presented a technique and a tool (astra) for synthesizing different types of transducers from types, examples, and distance functions, and showed astra’s ability to solve a variety of practical problems. astra uses SMT solvers and its performance is affected by input components that result in large constraints (e.g., states in the desired transducer). Because astra synthesizes transducers instead of programs in arbitrary DSLs, its output can be analyzed using transducer algorithms (e.g., equivalence and pre-post analysis). Because of this property, our approach could be beneficial in learning invariants of string-manipulating programs, where a transducer is the formalism of choice, e.g., in the Ostrich tool [9]. In terms of improvements to our technique, aside from optimizing the SMT encoding to improve scalability, our approach could be strengthened by devising ways to effectively ‘guess’ the number of states required for a transducer to work on the given inputs. We leave these directions for future work.
Acknowledgements
This work was funded by the National Science Foundation under grants 1763871, 1750965, 1918211, and 2023222, Facebook and a Microsoft Research Faculty Fellowship.
References
- [1] S. Gulwani, “Automating string processing in spreadsheets using input-output examples,” in PoPL’11, January 26-28, 2011, Austin, Texas, USA, January 2011.
- [2] P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes, “Fast and precise sanitizer analysis with bek,” in USENIX Security Symposium, vol. 58. USENIX, 2012.
- [3] L. D’Antoni and M. Veanes, “Static analysis of string encoders and decoders,” in International Workshop on Verification, Model Checking, and Abstract Interpretation. Springer, 2013, pp. 209–228.
- [4] Y. Zhang, A. Albarghouthi, and L. D’Antoni, “Robustness to programmable string transformations via augmented abstract training,” in International Conference on Machine Learning. PMLR, 2020, pp. 11 023–11 032.
- [5] A. Miltner, S. Maina, K. Fisher, B. C. Pierce, D. Walker, and S. Zdancewic, “Synthesizing symmetric lenses,” Proceedings of the ACM on Programming Languages, vol. 3, no. ICFP, pp. 1–28, 2019.
- [6] A. Miltner, K. Fisher, B. C. Pierce, D. Walker, and S. Zdancewic, “Synthesizing bijective lenses,” Proceedings of the ACM on Programming Languages, vol. 2, no. POPL, pp. 1–30, 2017.
- [7] M. Mohri, “Finite-state transducers in language and speech processing,” Computational linguistics, vol. 23, no. 2, pp. 269–311, 1997.
- [8] L. D’Antoni and M. Veanes, “Automata modulo theories,” Communications of the ACM, vol. 64, no. 5, pp. 86–95, 2021.
- [9] T. Chen, M. Hague, J. He, D. Hu, A. W. Lin, P. Rümmer, and Z. Wu, “A decision procedure for path feasibility of string manipulating programs with integer data type,” in International Symposium on Automated Technology for Verification and Analysis. Springer, 2020, pp. 325–342.
- [10] L. D’Antoni and M. Veanes, “The power of symbolic automata and transducers,” in International Conference on Computer Aided Verification. Springer, 2017, pp. 47–67.
- [11] J. Hamza and V. Kunčak, “Minimal synthesis of string to string functions from examples,” in Verification, Model Checking, and Abstract Interpretation, C. Enea and R. Piskac, Eds. Cham: Springer International Publishing, 2019, pp. 48–69.
- [12] M. Benedikt, G. Puppis, and C. Riveros, “Regular repair of specifications,” in 2011 IEEE 26th Annual Symposium on Logic in Computer Science. IEEE, 2011, pp. 335–344.
- [13] R. Bloem, K. Chatterjee, and B. Jobstmann, “Graph games and reactive synthesis,” in Handbook of Model Checking, E. M. Clarke, T. A. Henzinger, H. Veith, and R. Bloem, Eds. Springer, 2018, pp. 921–962. [Online]. Available: https://doi.org/10.1007/978-3-319-10575-8_27
- [14] L. Brim, J. Chaloupka, L. Doyen, R. Gentilini, and J.-F. Raskin, “Faster algorithms for mean-payoff games,” Formal methods in system design, vol. 38, no. 2, pp. 97–118, 2011.
- [15] M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjorner, “Symbolic finite state transducers: Algorithms and applications,” in Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ser. POPL ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 137–150. [Online]. Available: https://doi.org/10.1145/2103656.2103674
- [16] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2008, pp. 337–340.
- [17] P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes, “Fast and precise sanitizer analysis with bek,” http://rise4fun.com/Bek/, 2012.
- [18] R. Alur, “Streaming string transducers,” in Logic, Language, Information and Computation, L. D. Beklemishev and R. de Queiroz, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 1–1.
- [19] O. Markgraf, C.-D. Hong, A. W. Lin, M. Najib, and D. Neider, “Parameterized synthesis with safety properties,” in Asian Symposium on Programming Languages and Systems. Springer, 2020, pp. 273–292.
- [20] A. W. Lin and P. Rümmer, “Liveness of randomised parameterised systems under arbitrary schedulers,” in International Conference on Computer Aided Verification. Springer, 2016, pp. 112–133.
- [21] D. Neider, “Computing minimal separating dfas and regular invariants using sat and smt solvers,” in International Symposium on Automated Technology for Verification and Analysis. Springer, 2012, pp. 354–369.
- [22] A. Chakrabarti, L. de Alfaro, T. A. Henzinger, and M. Stoelinga, “Resource interfaces,” in Embedded Software, Third International Conference, EMSOFT 2003, Philadelphia, PA, USA, October 13-15, 2003, Proceedings, 2003, pp. 117–133. [Online]. Available: https://doi.org/10.1007/978-3-540-45212-6_9
Appendix A Proofs
A.1 Input-Output Examples
Lemmas A.1 and A.2 show that (i) the function preserves a desired invariant, which is used to show that (ii) for the transducer encoded in a solution to the constraints, we have
Lemma A.1.
Proof.
We first show the forward direction.
Assume that all constraints in are satisfied by , , , and . We proceed by induction on . For the base case, when , we have that by Constraint 1. We also know that and that by definition of the extended transition function. So we are done with the base case.
For the induction step, our induction hypothesis states exactly that iff and . Assume we have that . We need to show that this is the case for as well. Now, since input words consist of letters in , we must have that for some . Then, we must also have that , by the implication in Constraint 2 and the inductive hypothesis. By Constraint 2, we have that for all , either or . This means that for all , we have . Together with the inductive hypothesis, this implies .
For showing , we can observe that this follows from the inductive hypothesis that , by the definition of , and that by the constraints of type 2, which enforce that . So, we are done with the forward direction.
The backward direction is straightforward. If we have that and , then there exists a run of the form such that each consists of characters and . We can use each transition to assign corresponding values to . ∎
Lemma A.2.
Proof.
We first prove the forward direction. Assume that an assignment to , , , and satisfies all constraints in . By Constraint 3, we have that there is some state of the encoded transducer for which . From Lemma A.1, it follows that and . This is exactly what it means for .
For the backward direction, assume that for the transducer encoded in , , and , we have . Then, there exists a run such that and . Using this run of on , we can construct a definition of such that all constraints in are satisfied.
∎
A.2 Input-Output Types
Lemmas A.3 and A.4 state that (i) the simulation relation sim preserves the desired invariant, which is then used to prove that (ii) satisfies the specification for input-output types.
Lemma A.3.
Proof.
Assume that there exist strings such that , , , and . By Constraint 4, we already have that . Since , we know that there is an assignment to the functions , , and that encodes a run such that and . It follows that we eventually reach by the implication in Constraint 5.
∎
Lemma A.4.
Proof.
For the forward direction, we assume that an assignment to , , and sim satisfies all constraints in . Then, for every , if we have that for some , then it is the case that . From Constraint 5, we know that for a string such that where , it must be that for at least some and . From Lemma A.3, it also follows that for such a string , there is also a string such that and such that . In other words, for every string , outputs a string as desired.
For the backward direction, let be a transducer such that . We can use the transition relation of to construct an assignment of , , and sim such that all constraints are satisfied.
∎
A.3 Input-Output Distances
Lemmas A.5 and A.6 show that a transducer that is a solution to the set of distance constraints is such that for all .
Lemma A.5.
Let be the set of input-output distance constraints from Equations 7, 8, 9, 10 and 11 for a mean edit distance of , and be a transducer encoded into the functions , and . If all constraints in are satisfied by , , , sim, en, and wed, then for all runs over of the form , where , it holds that:
Proof.
Assume that all constraints in are satisfied by , , , sim, en and wed. Consider an arbitrary path over of the form , where . By Constraint 10, we have that
Since , we have that . By Constraint 11, it follows that . Therefore, as well. Since by Constraint 9, we have that .
∎
Lemma A.6.
Proof.
Assume that an assignment to , , , sim, en and wed satisfy all constraints in . From Lemma A.5 it follows that the run of any string is such that , where and are the states in the run. Since is the difference between and at each transition, if the total difference is , then it follows that .
∎
Appendix B Size of Constraint Encoding
Input-Output Examples.
There is one constraint of type 1, constraints of type 2, and one constraint of type 3 for one input-output example, where is the length of the input and is the length of the output.
A constraint of type 1 involves 1 variable, a constraint of type 2 involves variables, and a constraint of type 3 involves variables.
Input-Output Types.
There is one constraint of type 4, constraints of type 5, and one constraint of type 6 for the input-output types and .
A constraint of type 4 involves no variables, a constraint of type 5 involves variables, and a constraint of type 6 involves variables.
Input-Output Distances.
Appendix C Soundness of s-FT recovery algorithm
The following lemma shows that our s-FT recovery algorithm, described in Section 4.1, which we use to synthesize s-FTs is sound in the sense that finitizing the s-FT will result in an identical finite transducer.
Lemma C.1.
Let be an FT and let be the corresponding s-FT obtained using the s-FT extraction algorithm. Then we have that if we use the representative character for each minterm .
Proof.
Without loss of generality, consider a transition where the input character is and the output character is . The proof generalizes for an output string with multiple characters. Now, one of 3 cases apply for :
-
1.
If , then the output function that we choose is the identity function. Re-finitizing this transition results in the transition , as expected.
-
2.
If and are single intervals of the same size, then . If we were to finitize the transition with the same witness for each minterm, then we get back .
-
3.
Otherwise, we have a constant function . Finitizing the transition again results in .
∎
Appendix D Constraints for synthesizing Transducers with Regular Lookahead
In writing the lookahead constraints for a string , we omit the subscript and just write look and for and .
Input-Output Examples and Lookahead. Constraint 12 expresses how we compute the lookahead states for a string .
(12) |
Constraint 13 shows how the output configuration for input-output examples takes into account what the lookahead state is at any point. Given an example such that and , for every , , , and :
(13) |
Input-Output Types with Lookahead. The simulation relation sim needs to also consider where we are in during any possible run on . So, we extend sim to include the states of —i.e., . Since we travel ‘backwards’ over , while moving ‘forward’ over , and in the simulation relation, the simulation relation uses the inverse transition function of , which we denote by —i.e., . We also introduce the corresponding uninterpreted function , which is the inverse of . The desired invariant is that if there exists a lookahead word and a string such that , , , and .
Our starting point in the simulation can be any state . This is because reads a string backwards, so the run could start from any state in .
(14) |
Then, to advance the simulation, we now use the inverse transition relation of . So, if we are at state , then for each character , the simulation advances to all the states that we could have reached from by reading . For any combination of and , we use the free variables and to encode the following constraint:
(15) |
Finally, constraint 16 says that when we reach a final state and the initial state of (since we are going backwards over ), we cannot be in a non-final state of .
(16) |