This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Latent Execution for Neural Program Synthesis

Xinyun Chen
UC Berkeley
[email protected] &Dawn Song
UC Berkeley
[email protected] &Yuandong Tian
Facebook AI Research
[email protected]
Abstract

Program synthesis from input-output (IO) examples has been a long-standing challenge. While recent works demonstrated limited success on domain-specific languages (DSL), it remains highly challenging to apply them to real-world programming languages, such as C. Due to complicated syntax and token variation, there are three major challenges: (1) unlike many DSLs, programs in languages like C need to compile first and are not executed via interpreters; (2) the program search space grows exponentially when the syntax and semantics of the programming language become more complex; and (3) collecting a large-scale dataset of real-world programs is non-trivial. As a first step to address these challenges, we propose LaSynth and show its efficacy in a restricted-C domain (i.e., C code with tens of tokens, with sequential, branching, loop and simple arithmetic operations but no library call). More specifically, LaSynth learns the latent representation to approximate the execution of partially generated programs, even if they are incomplete in syntax (addressing (1)). The learned execution significantly improves the performance of next token prediction over existing approaches, facilitating search (addressing (2)). Finally, once trained with randomly generated ground-truth programs and their IO pairs, LaSynth can synthesize more concise programs that resemble human-written code. Furthermore, retraining our model with these synthesized programs yields better performance with fewer samples for both Karel and C program synthesis, indicating the promise of leveraging the learned program synthesizer to improve the dataset quality for input-output program synthesis (addressing (3)). When evaluating on whether the program execution outputs match the IO pairs, LaSynth achieves 55.2% accuracy on generating simple C code with tens of tokens including loops and branches, outperforming existing approaches without executors by around 20%. 111The code is available at https://github.com/Jungyhuk/latent-execution.

1 Introduction

Program synthesis from input-output (IO) pairs, also called programming by example (PBE), requires high-level reasoning and remains a challenging problem for deep models. Unlike Natural Language Processing (NLP) [5, 16] and perceptual tasks such as Computer Vision (CV) [14, 22], the mapping from IO pairs to the program itself is hard to model. Many works attempt to learn a direct mapping from training samples, but often found that it is already difficult to achieve a low training error, and generalization to new problems is even harder. Alternatively, one might choose to formulate program synthesis as a search problem: to find the program that satisfies IO pairs. Unfortunately, the search space of programs is often vast and highly non-smooth, i.e., a small perturbation of the program often leads to a complete change of the output.

While there are many previous works on programming by example tasks [6, 17, 9], they mainly focus on Domain Specific Languages (DSLs), and cannot be easily applied to popular general-purpose programming languages. For example, to synthesize C programs, we need to deal with both high-level control flows (e.g., branching and loop) and low-level operations (e.g., which variable is the target of assignment). Moreover, unlike DSLs (e.g., Karel) for which it is feasible to implement a per-line interpreter, C programs need compilation and a partial C program cannot execute. On the other hand, some recent works investigate natural language descriptions as the auxiliary information of the program specification, and they evaluate neural program synthesis models on constrained or simplified competitive programming problems [27, 3, 23, 11, 4]. Although some of these works demonstrate promising results for synthesizing Python or C code, they require manual annotations of natural language specifications [27] or large-scale pre-training on human-written programs [11, 4], and the performance significantly degrades when only input-output examples are fed into the synthesizer [3].

To synthesize C programs from input-output examples only, we propose LaSynth, which generates the program in a recurrent and token-by-token manner. As the first contribution on model architectures for program synthesis, we propose to use two latent parallel representations in the recurrent model. One representation is learned from regular recurrent models as in autoregressive language models [24], with the double attention mechanism over IO pairs proposed in RobustFill [17] and an operation predictor that models the arithmetic relationship between the program input and output. The second representation, named Latent Execution Trace (LaET), models the hypothetical input signal for the remaining partial program to execute to get to the desired output. Motivated by the line of work on execution-guided program synthesis [47, 18, 57, 12], we learn a latent representation for C programs which are not executed via interpreters, and train the model given only IO pairs without the intermediate program execution states. The two parallel representations are trained end-to-end.

As the second contribution on dataset construction, we demonstrate that it is possible to automatically construct a C codebase that is of high quality, controllable and concise through our proposed program synthesis procedure. Specifically, starting from randomly generated C programs that might contain a lot of redundant statements, we show that via iterative retraining, the subsequent generated code from our learned model becomes more concise and similar to human-written ones. Moreover, learning directly from the generated code leads to better performance given the same amount of samples, and improves the sample efficiency. We observe similar results when applying our iterative retraining technique to Karel [9], another programming by example benchmark consisting of randomly generated programs. Although the initial Karel dataset includes a large proportion of complicated programs with different control flow constructs, we demonstrate that nearly half of the problems can be solved by straight-line programs, which again confirms that randomly generated programs tend to be unnecessarily complicated. We envision that the iterative retraining procedure could greatly reduce laborious efforts in human codebase collection in future research.

As the third contribution, we show for the first time that short C code in a restricted domain (tens of tokens, no library call) with sequential, branching, loop and simple arithmetic operations can be effectively synthesized from IO pairs only. In particular, while LaSynth tends to generate more concise programs (and does not have exact token match with random generated ground truth code), when measuring whether the program execution outputs match the IO pairs, LaSynth achieves 55.2%55.2\% accuracy, and outperforms existing neural program synthesis models by around 20%20\%. These results demonstrate the effectiveness of learning latent execution traces.

2 Neural Program Synthesis from Input-Output Examples

In programming by example tasks, the program specification is a set of input-output examples [17, 9]. Specifically, we provide the synthesizer with a set of KK input-output pairs {(I(k),O(k))}k=1K\{(I^{(k)},O^{(k)})\}_{k=1}^{K} ({IO}K\{IO\}^{K} in short). These input-output pairs are annotated with a ground truth program PP^{\star}, so that P(I(k))=O(k)P^{\star}(I^{(k)})=O^{(k)} for any k{1,2,,K}k\in\{1,2,...,K\}. To measure the program correctness, we include another set of held-out test cases {IO}testKtest\{IO\}_{test}^{K_{test}} that differs from {IO}K\{IO\}^{K}. The goal of the program synthesizer is to predict a program PP from {IO}K\{IO\}^{K}, so that P(I)=P(I)=OP(I)=P^{\star}(I)=O for any (I,O){IO}K+{IO}testKtest(I,O)\in\{IO\}^{K}+\{IO\}_{test}^{K_{test}}.

C Program Synthesis. In this work, we make the first attempt of synthesizing C code in a restricted domain from input-output examples only, and we focus on programs for list processing. List processing tasks have been studied in some prior works on input-output program synthesis, but they synthesize programs in restricted domain-specific languages instead of full-fledged popular programming languages [6, 39, 38].

Our C code synthesis problem brings new challenges for programming by example. Compared to domain-specific languages, the syntax and semantics of C are much more complicated, which significantly enlarges the program search space. Meanwhile, learning good representations for partially decoded programs also becomes more difficult. In particular, prior neural program synthesizers that utilize per-line interpreters for the programming language to guide the synthesis and representation learning [12, 44, 37, 18, 38] are not directly applicable to C. Although it is possible to dump some intermediate variable states during C code execution [10], since partial C programs are not executable, we are able to obtain all the execution states only until a full C code is generated, which is too late to include them in the program decoding process. In particular, the intermediate execution state is not available when the partial program is syntactically invalid, and this happens more frequently for C due to its syntax design.

Refer to caption
Figure 1: Illustration of the C program synthesis pipeline. For dataset construction, we develop a random program generator to sample random C programs, then execute the program over randomly generated inputs and obtain the outputs. The input-output pairs are fed into the neural program synthesizer to predict the programs. Note that the synthesized program can be more concise than the original random program.
Refer to caption
Figure 2: (a) An overview of LaSynth model architecture. (b), (c), and (d) present the details of the program decoder, latent executor, and the operation predictor. Note that the operation predictor is specialized for numerical calculation, and thus is not used for the Karel domain.

3 Program Synthesis with Learned Execution

In this section, we present LaSynth which learns to represent the execution of partial programs to guide the synthesis process. Fig. 2(a) provides an overview of LaSynth model architecture which consists of two components, the program decoder and the latent executor. We present the core design below, and defer more details to Appendix B and Appendix C.

3.1 Model Overview

At a high level, the program decoder (Fig. 2(b)) takes a latent vector ht1h_{t-1} that represents the generated partial program, the previous (generated) program token pt1p_{t-1}, and outputs the latent vector hth_{t} and the next program token ptp_{t} to be generated at time step tt:

(ht,pt)=ProgramDecoder(ht1,pt1;IOt1)(h_{t},p_{t})=\mathrm{ProgramDecoder}(h_{t-1},p_{t-1};IO_{t-1}) (1)

Here the recurrent model is conditioned on the IO pair IOt1IO_{t-1}. When IOt=IO:=(I,O)IO_{t}=IO:=(I,O) for every tt, i.e., IOtIO_{t} remains constant over the entire recurrent generation process, Eqn. 1 represents the standard recurrent architecture used in most autoregressive natural language models [24, 49], and is also used in prior works on program synthesis from input-output examples [17, 9].

For program decoding, the decoder first takes two attention vectors stIs_{t}^{I} and stOs_{t}^{O} computed from IO pairs and latent vector ht1h_{t-1} via double attention [17], and utilizes a max pooling layer to compute an aggregated embedding mtm_{t} for all IO pairs (Fig. 2(b)):

mt=MaxPoolj{1,2,,K}(tanh(W[stI(j);stO(j)]))m_{t}=\mathrm{MaxPool}_{j\in\{1,2,...,K\}}(\mathrm{tanh}(W[s_{t}^{I(j)};s_{t}^{O(j)}])) (2)

Here the superscript (j)(j) indicates that the representation is for the jj-th IO pair, [a;b][a;b] is vector concatenation of aa and bb, and WW is a trainable matrix. To facilitate the prediction of long programs, we compute an attention vector dtd_{t} over previously generated program tokens using the standard attention mechanism [5, 30]:

dt=Attention(mt,{p0,,pt1})d_{t}=\mathrm{Attention}(m_{t},\{p_{0},...,p_{t-1}\}) (3)

Finally, the next token ptp_{t} is sampled from [pt]=Softmax(Vdt)pt\mathbb{P}[p_{t}]=\mathrm{Softmax}(Vd_{t})_{p_{t}} where VV is a trainable matrix.

3.2 Latent Executor Design

As shown in our experiments (Sec. 5), the standard program decoder architecture may not be able to achieve strong performance in program synthesis when the program complexity increases. One main reason is that the standard program decoder only takes the initial IO pairs as the input without considering the program execution, thus the learned representation for the partial program does not effectively guide the synthesis process. Motivated by prior works that utilize execution traces for Karel program synthesis [12, 44, 47], in this paper, we introduce latent executor (Fig. 2(c)) which maintains a second representation I^t\hat{I}_{t} during program decoding. Intuitively, I^t1\hat{I}_{t-1} models the hypothetical input of the partial program ptTp_{t\ldots T} so that its output becomes OO. Given the estimated input I^t1\hat{I}_{t-1} and the latent vector hth_{t}, the latent executor returns I^t\hat{I}_{t} at the next time step tt:

I^t=LatentExecutor(I^t1,ht)\hat{I}_{t}=\mathrm{LatentExecutor}(\hat{I}_{t-1},h_{t}) (4)

The collection of {I^t}t=0T\{\hat{I}_{t}\}_{t=0}^{T} is the latent execution trace (LaET). With the help of latent executor, we now use the IO pairs IOt1:=(I^t1,O)IO_{t-1}:=(\hat{I}_{t-1},O) instead of (I,O)(I,O) for the program decoder (Eqn. 1).

3.3 End-to-end Training

We train our model with supervised learning, by minimizing the sum of token prediction loss Prog\mathcal{L}_{Prog}, and the latent executor loss Exec\mathcal{L}_{Exec}:

=Prog+Exec\mathcal{L}=\mathcal{L}_{Prog}+\mathcal{L}_{Exec} (5)

Specifically, Prog:=t=1TLoss(pt,pt)\mathcal{L}_{Prog}:=\sum_{t=1}^{T}\mathrm{Loss}(p_{t},p^{\star}_{t}) is the step-by-step cross-entropy loss between the predicted programs p1Tp_{1\ldots T} and the ground truth programs p1Tp^{\star}_{1\ldots T}.

For latent executor, since the semantics of partial programs (e.g., partial C programs) are not always well-defined, there is no step-by-step training supervision. However, the output of the executor should be consistent with the program specification after taking the annotated ground truth program as the input. Therefore, we set I^0=I\hat{I}_{0}=I (true input) and minimize the distance between I^T\hat{I}_{T} and OO (true output) after the program finishes:

Exec=Loss(I^T,O)\mathcal{L}_{Exec}=\mathrm{Loss}(\hat{I}_{T},O) (6)

Note that Exec\mathcal{L}_{Exec} does not rely on any assumptions of the partial program semantics, and thus is applicable to both domain-specific languages and general-purpose programming languages such as C. In our evaluation, equipping with the latent executor significantly improves the program prediction performance, where each program could include up to 256 tokens.

3.4 Data Regeneration and Iterative Retraining

Interestingly, once our model is trained on the initial random generated programs 𝒟0\mathcal{D}_{0}, the predicted program becomes more concise and resembles human-written code. While the exact token match accuracy is low even on the training set, the model still satisfies the IO pairs for many problems. We leverage such a phenomenon to construct a new dataset 𝒟1\mathcal{D}_{1} with higher-quality programs from 𝒟0\mathcal{D}_{0}. Specifically, we run beam search on the trained model to predict program p0Tp_{0\ldots T} given input-output pairs in the training set. If model prediction p0Tp_{0\ldots T} satisfies all the input-output examples and held-out cases, we replace the original program p0Tp^{\star}_{0\ldots T} with p0Tp_{0\ldots T} in 𝒟1\mathcal{D}_{1}, and keep p0Tp^{\star}_{0\ldots T} otherwise. Afterward, we re-train the model on 𝒟1\mathcal{D}_{1}. In Sec. 5, we will demonstrate that the retraining process further improves the model performance, especially with smaller training datasets.

4 Restricted C Program Synthesis Domain

Table 1: The comparison between our restricted C domain and existing programming by example tasks.
Control flow Variables Arithmetics No helper functions
Restricted C (Ours)
Karel [9] - - -
DeepCoder [6] - -
FlashFill [19] - - - -

In this section, we discuss our restricted C program synthesis domain, and our operation predictor design for improving the numerical reasoning ability of program synthesis models.

4.1 Data Generation

Collecting large-scale high-quality datasets for program synthesis requires a lot of human efforts, and we aim to reduce the manual work for dataset construction.

Our data generator is built upon Csmith [54], a random C code generation tool originally designed for finding bugs in compilers. Following the common practice of generating input-output pairs, for each program, we randomly sample 5 numerical lists as the program inputs, and execute the program to obtain the corresponding output lists. This is similar to existing works on PBE problems that sample programs based on a probabilistic context-free grammar, randomly generate valid inputs for the programs and obtain the outputs [40, 15, 6]. This creates infinite samples for synthesizing programs in domain-specific languages. While the programs sampled in this way differ from human-written code, Sec. 3.4 shows that they can be converted to be more concise and human-like.

The subset of language features used. Our generated program has variable declaration, variable assignment, and expressions with addition or subtraction operations. The programs also have non-sequential statements, including If statements, For loops, Continue and Break statements. Except for the input argument which is a list, all variables declared are integers, and all program statements are integer manipulation. Each expression has at most 2 mathematical operations, and chaining the full C program could perform multi-step numerical calculation (e.g., p0 = p0 - p1 + p2; p0 = p0 - 1;). Looping statements other than For (i.e., While or Do-While loops) are not supported. Note that we only constrain the final program length (256\leq 256 tokens) and the program can have nested for-loops and complicated if-conditions.

Post-processing. We perform a few post-processing steps to obtain our final programs from programs generated by Csmith (see Fig. 1 for an example). We resample different components of the program, so that (1) each constant numerical value lies in [4,4][-4,4], (2) mathematical operators only contain addition and subtraction, and (3) upper/lower limits of For loops are positive and within the length of the list. Programs are discarded if they are trivial (e.g., constant or identity mappings), or the input-output examples include values out of the range [4,4][-4,4].

Final dataset. We reweight the program distribution so that at least half of them include For loops. Our full dataset includes 500K500K samples in the training set, 1K1K samples in the validation set, and 1K1K samples in the test set. As shown in Fig. 1, the randomly sampled program may contain redundant statements, which can be easily avoided by human programmers. We compare our restricted C domain to prior datasets of programming by example in Table 1.

4.2 Program Decoding with the Operation Predictor

For program decoder, predicting the next program token ptp_{t} is non-trivial, especially when mathematical reasoning is required [43, 29]. To improve the program synthesis performance for domains involving numerical calculation, such as our restricted C domain, we design an associative memory structure named operation predictor (Fig. 2(d)), based on the following intuition: given the input I=2I=2 and output O=4O=4, human would infer that “O=I+2O=I+2” might be the desired operation and write down the code accordingly. To materialize such an intuition, we create a pre-computed table that covers all possible integer addition and subtraction operations for valid input and output list values. We defer the details of the model architecture to Appendix B.2. The program decoding process remains similar to the one described in Sec. 3, and we highlight the key differences as follows.

The operation predictor takes two attention vectors stIs_{t}^{I} and stOs_{t}^{O} as the representations of input-output examples, and yields an operator embedding op^t\hat{op}_{t}. To compute the aggregated embedding vector for all input-output examples, we modify Eqn. 2 to also take op^t\hat{op}_{t} as an input of the max pooling layer:

mt=MaxPoolj{1,2,,K}(tanh(W[stI(j);stO(j);op^t(j)]))m_{t}=\mathrm{MaxPool}_{j\in\{1,2,...,K\}}(\mathrm{tanh}(W[s_{t}^{I(j)};s_{t}^{O(j)};\hat{op}_{t}^{(j)}])) (7)

To train the operation predictor, we add an additional loss Op\mathcal{L}_{Op}:

=Prog+Exec+Op\mathcal{L}=\mathcal{L}_{Prog}+\mathcal{L}_{Exec}+\mathcal{L}_{Op} (8)

Op\mathcal{L}_{Op} is designed to ensure that the operation predictor predicts operations related to IO pairs, and we defer the details to Appendix B.2.

Limitations. In our current implementation of the operation predictor, the operation table is only able to enumerate the arithmetic operations over a pre-defined constant set, thus it requires that the set of possible numerical values in input-output pairs is finite. One way of extending our operation predictor to support potentially unbounded numerical calculation is to combine it with the subword tokenizer, which has been commonly used in recent language models [16, 11, 4]. We consider designing general-purpose number representation for better mathematical reasoning as future work.

5 Experiments

In this section, we discuss our results on synthesizing programs in Karel and C languages. We first show that LaSynth achieves competitive performance on Karel benchmark. Then we present the results on our restricted C benchmark, and demonstrate that our approach significantly outperforms existing neural program synthesis models. Finally, we discuss the effect of iterative retraining.

Table 2: The comparison between LaSynth and baseline neural program synthesis models in our evaluation.
LaSynth Exec Shin et al. Bunel et al. RobustFill Property Signatures
[12] [44] [9] [17] [39]
++ Program execution - - -
No interpreter needed - -

5.1 Karel Program Synthesis

5.1.1 Evaluation Setup

Karel domain. Karel is an educational programming language [41], and has been studied in recent works on neural program synthesis from input-output examples [15, 9, 12, 44]. A Karel program controls a robot in a 2D grid world. There are instructions that control the robot, e.g., move, turnLeft and PutMarker, as well as conditionals and loops, i.e., if, repeat and while. See Appendix A for grammar specification and the state representation.

We train and evaluate all models on the Karel dataset introduced in [9]. The dataset contains randomly sampled programs from the Karel DSL (1.1M1.1M training samples, 2.5K2.5K samples in the validation set and 2.5K2.5K samples in the test set). Each program includes 5 input-output pairs as the specification, and the sixth pair as the held-out test case. Following the prior work, we evaluate two metrics: (1) Exact Match: the predicted program is the same as the ground truth; (2) Generalization: the predicted program satisfies both the input-output pairs and the held-out input-output test case.

Baselines. Bunel et al. [9] designed the first program synthesis model for the Karel benchmark with a similar high-level design as RobustFill, but they use convolutional neural networks (CNN) to encode the Karel grid maps. Compared to LaSynth, this model does not utilize any program execution information, and does not include our latent executor. Instead of directly synthesizing the program from input-output examples, the model in Shin et al. [44] first predicts the execution traces containing the robot actions from the input-output pairs, then decodes the program based on the execution traces. This model improves the prediction performance over Bunel et al., but it requires the full execution traces for model training and an interpreter for execution. Exec [12] leverages the execution states of partial generated programs to guide the subsequent synthesis process, but the execution states are obtained from the Karel interpreter rather than learned by the model, thus this approach represents the ideal scenario where the partial programs could be executable.

Our model architecture for Karel is largely similar to the model for C code synthesis, except that we employ the CNN encoder in Bunel et al. [9] in our program decoder and latent executor. The comparison with baseline models is shown in the middle block of Table 2. All models use the beam search for decoding programs, with the beam size of 64.

5.1.2 Results

Table 3: Results on Karel dataset. Gen and Exact denote generalization and exact match accuracies.
Approach Gen Exact
LaSynth 83.68% 41.12%
Exec [12] 86.04% 39.40%
Bunel et al. [9] 77.12% 32.17%
Shin et al. [44] 81.30% 42.80%
Refer to caption
Figure 3: Generalization accuracies with different training data sizes on Karel. With the full training set, the accuracies are 86.04%86.04\%, 89.28%89.28\% and 89.36%89.36\% for training on random programs, retraining for 1 and 2 iterations.
Refer to caption
(a)
Refer to caption
(b)
Figure 4: Program distributions after iterative retraining on Karel. (a) The distributions of different program types. Seq-only: no control flows. If-only: the program includes If statements but no loops. Repeat/While-only: the program includes Repeat/While loops, but no other control flow constructs. Mixture: the program includes at least two types of control flow constructs. (b) The distributions of programs with different token lengths.

We present the results of LaSynth and baseline model architectures in Table 3. First, LaSynth outperforms all baselines that do not incorporate the partial program execution information, and achieves competitive performance compared with the Exec algorithm that requires an interpreter to obtain the partial program execution states. In particular, LaSynth achieves a higher generalization accuracy than Shin et al. with lower exact match accuracy, showing that decoded programs by LaSynth are more different from randomly generated programs. Although Shin et al. also model the program execution by predicting the robot actions, the prediction of the action traces does not take the program structure into account, resulting in the inferior performance.

5.2 C Code Synthesis

5.2.1 Evaluation Setup

Given the variety of C programs, we observe that the exact match accuracies of models are mostly nearly 0. Therefore, we focus on evaluating the generalization accuracy, and we consider the predicted program to be correct when it satisfies both the 5 input-output examples and 5 held-out test cases.

Baselines. We compare the full LaSynth with its multiple ablated versions:

  • NoExecutor. The program decoder (Eqn. 1) always takes the initial input-output pairs as the input; i.e,. I^t=I0\hat{I}_{t}=I_{0} for every tt.

  • NoPartialExecutor. I^t=I0=I\hat{I}_{t}=I_{0}=I for every tt and additionally hTh_{T} is regularized so that LatentExecutor(I0,hT)\mathrm{LatentExecutor}(I_{0},h_{T}) matches the output OO under loss Exec\mathcal{L}_{Exec}. Therefore, no partial latent execution.

  • NoOpPredictor. The max pooling layer only takes the vectors computed by the double attention as the input (Eqn. 2).

  • NoAttentionInDecoding. There is no attention over decoded program tokens, and the output of the max pooling layer is directly fed into the output softmax layer; i.e., [pt]=Softmax(Vmt)pt\mathbb{P}[p_{t}]=\mathrm{Softmax}(Vm_{t})_{p_{t}} (compared to Eqn. 3).

We also compare with existing neural program synthesis models with good performance on related tasks, as shown in the rightmost block of Table 2. RobustFill [17] is the state-of-the-art neural network architecture on FlashFill benchmark, which synthesizes string manipulation programs in a domain-specific language. As described in Sec. 3, the input-output encoder and the program decoder architectures in RobustFill are similar to LaSynth, except that it does not include the latent executor, operation predictor, and the attention on the decoded program sequence.

Property Signatures [39] was designed for synthesizing list manipulation programs in domain-specific languages, but instead of taking the raw input and output lists as the neural network input, they design some properties that distinguish different programs, then take the values of these properties as the model input. A sample property could be whether the program output is the same as the input, and the property values could be “All True”, “All False”, or “Mixed”, indicating that the property always holds for any input-output pair in the specification, never holds, and holds for some pairs but not others, respectively. We customize the original design [39] for our setting. First, our property set takes the format of O=C+I?O=C+I? and O=CI?O=C-I?, where C[4,4]C\in[-4,4]. For example, O=2+I?O=2+I? means whether the output OO could be calculated by adding 22 to the input II. These properties focus more on numerical calculation, similar to our operation predictor. Second, different from the task in [39], our C programs sometimes manipulate only a subset of the input lists, thus encoding the list with a single property value is inappropriate. Instead, we compute the property value per element in input-output pairs, use a bi-directional LSTM to encode the property values as a sequence, then take the outputs of the bi-LSTM for program prediction.

5.2.2 Results

int * func_1(int a[])
{
int p_0 = 0;
int l_25 = 4;
a[p_0] = 1;
a[l_25];
return a;
}
int * func_1(int a[])
{
int p_0 = 2;
int l_12 = 3;
for (p_0 = 1; p_0 <= 2; p_0++)
{
a[p_0]–;
}
a[l_12] = a[l_12] + 4;
return a;
}
int * func_1(int a[])
{
int p_0 = 0;
int l_7 = 3;
int l_8 = 1;
a[l_8] = (a[l_7] - a[p_0]);
for (p_0 = 3; p_0 <= 4; p_0++)
{
for (int p_1 = 1; p_1 <= 2; p_1++)
{
a[p_1] = a[p_1] + a[p_0];
a[p_1] = a[p_1] + 2;
}
}
return a;
}
Figure 5: Sample programs that could be correctly predicted by LaSynth, but wrongly predicted by models without the latent executor. These programs require multiple different operations for different input list elements.
Table 4: Results on C dataset.
Approach Accuracy
LaSynth 55.2%
NoAttentionInDecoding 53.5%
NoOpPredictor 53.7%
NoPartialExecutor 42.9%
NoExecutor 38.6%
RobustFill [17] 37.6%
Property Signatures [39] 34.5%
Refer to caption
Figure 6: Accuracies of different program types on C dataset.
Refer to caption
(a)
Refer to caption
(b)
Figure 7: Results of iterative retraining on the C dataset. (a) Accuracies with different training data sizes. With the full training set, the accuracies are 55.2%55.2\%, 56.0%56.0\% and 56.5%56.5\% for training on random programs, retraining for 1 and 2 iterations, respectively. (b) The program distributions after each retraining iteration.
Refer to caption
(a)
Refer to caption
(b)
Figure 8: Results on programs of different token lengths on the C dataset. (a) The program token length distributions after each retraining iteration. (b) The accuracies on programs of different token lengths.

Table 4 presents the results, where all models are trained on the initial random programs. The full LaSynth outperforms other variants, and improves the performance of RobustFill and Property Signatures by around 20%20\%. We also increase the model size of RobustFill to see if the improvement comes from larger model size, but the results are not better. In particular, the latent executor significantly increases the prediction accuracy, and achieves better results than NoPartialExecutor, which shows that learning latent execution traces leads to better partial program representations. In Fig. 5, we present sample programs that could be correctly synthesized by LaSynth, but models without the latent executor provide the wrong prediction. We observe that the latent executor is beneficial when the program involves different manipulations for different list elements, e.g., more than one For loop and different mathematical calculations. Our breakdown results on programs of different complexity also justify this observation. We first present the results on programs with different control flow constructs in Fig. 6. Specifically, Seq-only includes programs with no control flow constructs, For-only includes programs with For loops but no If statements, and Mixture includes programs with both For loops and If statements. Then we demonstrate the results on different program lengths in Fig. 8(b). We show that LaSynth achieves decent performance on long and complicated programs, while the accuracies of baseline models drop dramatically.

5.3 Discussion of Iterative Retraining

In Fig. 3, we show the effectiveness of retraining on decoded Karel programs (Sec. 3.4). We observe that retraining for one iteration is sufficient, and it significantly improves the generalization accuracy by over 3%3\%. To understand the differences between predicted programs and randomly generated programs, we demonstrate the changes of dataset distributions after each retraining iteration in Fig. 4(a) and 4(b). We observe that the model learns to predict more concise programs than the ground truth for a large proportion of input-output examples, and considerably alters the dataset distribution so that it becomes more concentrated on short programs with simplified control flow structures. Specifically, from Fig. 4(a), although the initial Karel dataset seems to include a large proportion of complicated programs with different control flow constructs, our model synthesizes straight-line programs for nearly half of the samples, which means that many loops and branches in the annotated ground truth programs are unnecessary. This distribution shift also explains the gap between the exact match and generalization accuracies. The program distribution after the second retraining iteration is largely similar to the first iteration, thus retraining for more iterations does not considerably improve the performance. Note that in the second iteration, the synthesizer tends to generate slightly more complicated programs than the first iteration, in order to deal with the cases when the input-output examples oversimplify the intended program functionality. For example, sometimes the input-output examples do not cover the edge cases that the robot may encounter, thus adding additional If branches could avoid the crashes when testing on held-out cases.

Fig. 7(a) presents the results of retraining on decoded C programs. Similarly, retraining improves the prediction accuracy, especially when the training set is small. From Fig. 7(b) and 8(a), we again observe that the model tends to predict shorter programs than the random code, and it eliminates unnecessary control flows to simplify the programs. We present more examples in Appendix D.

6 Related Work

Programming by example. Programming by example problems have been widely studied with various applications, and recent works have developed deep neural networks as program synthesizers [20, 40, 17, 9]. Most prior works focus on synthesizing programs in domain-specific languages, such as FlashFill [40, 17, 51] for string transformation, Karel [9, 44, 12, 21] for simulated robot navigation, and LISP-style languages for list manipulation [6, 25, 57, 36]. In this work, we make the first attempt of synthesizing C code in a restricted domain from input-output examples only, and we focus on the list manipulation domain.

Some recent works investigate the limitations of synthetic datasets and the ambiguity in program specifications for neural program synthesis [45, 13, 46, 28]. These works focus on reducing the bias of data distributions and generating more diverse input-output pairs, while our data regeneration aims to improve the quality of programs. We consider incorporating both lines of work to further improve the dataset quality as future work. In addition, drawing the inspiration from self-training and bootstrapping techniques developed for other applications [35, 1, 32, 52] to extend our iterative retraining scheme is also another future direction.

Execution-guided program synthesis. To learn better program representations, some recent works incorporate the execution information to guide the synthesis process [47, 57, 44, 12, 18, 48, 7, 21, 38, 37, 31]. In particular, leveraging partial program execution states improves the performance for several program synthesis tasks [12, 57, 18, 37]. However, existing approaches rely on program interpreters to provide the intermediate execution results whenever applicable. In contrast, we demonstrate that our latent executor learns the latent execution traces (LaET) without such a requirement. Besides program synthesis, execution traces have also been utilized for other software engineering applications [2, 33].

Neural execution. Our latent executor is related to prior works on learning to execute algorithms [55, 50, 53] and programs [8]. They focus on predicting execution results for full algorithms and programs, but do not utilize them for program synthesis. Latent state prediction has also been studied in other applications such as task-oriented dialogue systems [34, 56] and robotics [42].

7 Conclusion

In this work, we propose LaSynth, which learns the latent representation to approximate the execution of partial programs, even if their semantics are not well-defined. We demonstrate the possibility of synthesizing elementary C code from input-output examples only, and leveraging learned execution significantly improves the prediction performance by around 20%20\%. Meanwhile, compared to the randomly generated programs, LaSynth synthesizes more concise programs that resemble human-written code, and training on these synthesized programs further improves the prediction performance for both Karel and C program synthesis. Our results indicate the promise of leveraging the learned program synthesizer to improve the dataset quality for programming by example tasks.

We consider extending our approach to synthesize more complicated real-world code as future work. For example, we will integrate our latent executor into large-scale pre-trained language models, which could further improve the performance of those program synthesis models taking natural language specifications. We will also study program synthesis problems with unbounded input ranges and different type signatures, which could be approached with the usage of subword tokenizers.

References

  • [1] S. Abney. Bootstrapping. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 360–367, 2002.
  • [2] M. Alam, J. Gottschlich, N. Tatbul, J. S. Turek, T. Mattson, and A. Muzahid. A zero-positive learning approach for diagnosing software performance regressions. Advances in Neural Information Processing Systems, 32:11627–11639, 2019.
  • [3] F. Alet, J. Lopez-Contreras, J. Koppel, M. Nye, A. Solar-Lezama, T. Lozano-Perez, L. Kaelbling, and J. Tenenbaum. A large-scale benchmark for few-shot program induction and synthesis. In International Conference on Machine Learning, pages 175–186. PMLR, 2021.
  • [4] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  • [5] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
  • [6] M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow. Deepcoder: Learning to write programs. In International Conference on Learning Representations, 2017.
  • [7] M. Balog, R. Singh, P. Maniatis, and C. Sutton. Neural program synthesis with a differentiable fixer. arXiv preprint arXiv:2006.10924, 2020.
  • [8] D. Bieber, C. Sutton, H. Larochelle, and D. Tarlow. Learning to execute programs with instruction pointer attention graph neural networks. In Advances in Neural Information Processing Systems, 2020.
  • [9] R. Bunel, M. Hausknecht, J. Devlin, R. Singh, and P. Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. In International Conference on Learning Representations, 2018.
  • [10] B. Campbell. An executable semantics for compcert c. In International Conference on Certified Programs and Proofs, pages 60–75. Springer, 2012.
  • [11] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  • [12] X. Chen, C. Liu, and D. Song. Execution-guided neural program synthesis. In International Conference on Learning Representations, 2019.
  • [13] J. Clymo, H. Manukian, N. Fijalkow, A. Gascón, and B. Paige. Data generation for neural programming by example. In International Conference on Artificial Intelligence and Statistics, pages 3450–3459. PMLR, 2020.
  • [14] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
  • [15] J. Devlin, R. Bunel, R. Singh, M. Hausknecht, and P. Kohli. Neural program meta-induction. In Advances in Neural Information Processing Systems, 2017.
  • [16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • [17] J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A. Mohamed, and P. Kohli. Robustfill: Neural program learning under noisy I/O. In International Conference on Machine Learning, 2017.
  • [18] K. Ellis, M. I. Nye, Y. Pu, F. Sosa, J. B. Tenenbaum, and A. Solar-Lezama. Write, execute, assess: Program synthesis with a repl. In Advances in Neural Information Processing Systems, 2019.
  • [19] S. Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM SIGPLAN Notices, volume 46, pages 317–330. ACM, 2011.
  • [20] S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8):97–105, 2012.
  • [21] K. Gupta, P. E. Christensen, X. Chen, and D. Song. Synthesize, execute and debug: Learning to repair for neural program synthesis. In Advances in Neural Information Processing Systems, 2020.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [23] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, et al. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021.
  • [24] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [25] A. S. Illia Polosukhin. Neural program search: Solving data processing tasks from description and examples, 2018.
  • [26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  • [27] S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. Liang. Spoc: Search-based pseudocode to code. In Advances in Neural Information Processing Systems, 2019.
  • [28] L. Laich, P. Bielik, and M. Vechev. Guiding program synthesis by learning to generate examples. In International Conference on Learning Representations, 2019.
  • [29] G. Lample and F. Charton. Deep learning for symbolic mathematics. In International Conference on Learning Representations, 2020.
  • [30] M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015.
  • [31] S. Mandal, T. Anderson, J. Turek, J. Gottschlich, S. Zhou, and A. Muzahid. Learning fitness functions for machine programming. Proceedings of Machine Learning and Systems, 3, 2021.
  • [32] D. McClosky, E. Charniak, and M. Johnson. Effective self-training for parsing. In Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, pages 152–159, 2006.
  • [33] C. Mendis, A. Renda, S. Amarasinghe, and M. Carbin. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In International Conference on machine learning, pages 4505–4515. PMLR, 2019.
  • [34] Q. Min, L. Qin, Z. Teng, X. Liu, and Y. Zhang. Dialogue state induction using neural latent variable models. In International Joint Conferences on Artificial Intelligence, 2020.
  • [35] C. Z. Mooney, C. F. Mooney, C. L. Mooney, R. D. Duval, and R. Duvall. Bootstrapping: A nonparametric approach to statistical inference. Number 95. sage, 1993.
  • [36] M. Nye, L. Hewitt, J. Tenenbaum, and A. Solar-Lezama. Learning to infer program sketches. In International Conference on Machine Learning, pages 4861–4870. PMLR, 2019.
  • [37] M. Nye, Y. Pu, M. Bowers, J. Andreas, J. B. Tenenbaum, and A. Solar-Lezama. Representing partial programs with blended abstract semantics. In International Conference on Learning Representations, 2021.
  • [38] A. Odena, K. Shi, D. Bieber, R. Singh, and C. Sutton. Bustle: Bottom-up program-synthesis through learning-guided exploration. In International Conference on Learning Representations, 2021.
  • [39] A. Odena and C. Sutton. Learning to represent programs with property signatures. In International Conference on Learning Representations, 2020.
  • [40] E. Parisotto, A.-r. Mohamed, R. Singh, L. Li, D. Zhou, and P. Kohli. Neuro-symbolic program synthesis. In International Conference on Learning Representations, 2017.
  • [41] R. E. Pattis. Karel the robot: a gentle introduction to the art of programming. John Wiley & Sons, Inc., 1981.
  • [42] C. Paxton, Y. Bisk, J. Thomason, A. Byravan, and D. Foxl. Prospection: Interpretable plans from language by predicting the future. In 2019 International Conference on Robotics and Automation (ICRA), pages 6942–6948. IEEE, 2019.
  • [43] D. Saxton, E. Grefenstette, F. Hill, and P. Kohli. Analysing mathematical reasoning abilities of neural models. 2019.
  • [44] E. C. Shin, I. Polosukhin, and D. Song. Improving neural program synthesis with inferred execution traces. In Advances in Neural Information Processing Systems, pages 8917–8926, 2018.
  • [45] R. Shin, N. Kant, K. Gupta, C. Bender, B. Trabucco, R. Singh, and D. Song. Synthetic datasets for neural program synthesis. In International Conference on Learning Representations, 2019.
  • [46] A. Suh and Y. Timen. Creating synthetic datasets via evolution for neural program synthesis. arXiv preprint arXiv:2003.10485, 2020.
  • [47] S.-H. Sun, H. Noh, S. Somasundaram, and J. Lim. Neural program synthesis from diverse demonstration videos. In Proceedings of the 35th International Conference on Machine Learning, 2018.
  • [48] Y. Tian, A. Luo, X. Sun, K. Ellis, W. T. Freeman, J. B. Tenenbaum, and J. Wu. Learning to infer and execute 3d shape programs. In International Conference on Learning Representations, 2019.
  • [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
  • [50] P. Veličković, R. Ying, M. Padovano, R. Hadsell, and C. Blundell. Neural execution of graph algorithms. In International Conference on Learning Representations, 2020.
  • [51] A. J. Vijayakumar, A. Mohta, O. Polozov, D. Batra, P. Jain, and S. Gulwani. Neural-guided deductive search for real-time program synthesis from examples. In International Conference on Learning Representations, 2018.
  • [52] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687–10698, 2020.
  • [53] Y. Yan, K. Swersky, D. Koutra, P. Ranganathan, and M. Heshemi. Neural execution engines: Learning to execute subroutines. In Advances in Neural Information Processing Systems, 2020.
  • [54] X. Yang, Y. Chen, E. Eide, and J. Regehr. Finding and understanding bugs in c compilers. In Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation, pages 283–294, 2011.
  • [55] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
  • [56] Y. Zhang, Z. Ou, M. Hu, and J. Feng. A probabilistic end-to-end task-oriented dialog model with latent belief states towards semi-supervised learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  • [57] A. Zohar and L. Wolf. Automatic program synthesis of long programs with a learned garbage collector. In Advances in Neural Information Processing Systems, 2018.

Appendix A More Descriptions of the Karel Domain

We present the full grammar of the Karel language in Fig. 9.

To represent the execution states, each Karel grid world has a maximum size of 18×1818\times 18, and each cell in the grid world is represented by a 1616-dimensional vector corresponding to the features in Table 5. Therefore, each grid world is represented as a 16×18×1816\times 18\times 18 tensor.

Prog p::=defrun():sStmt s::=while(b):srepeat(r):ss1;s2a|if(b):sifelse(b):s1else:s2Cond b::=frontIsClear()leftIsClear()rightIsClear|markersPresent()noMarkersPresent()not bAction a::=move()turnRight()turnLeft()|pickMarker()putMarker()Cste r::=01...19\begin{array}[]{rcl}\texttt{Prog p}&::=&\texttt{def}~{}\texttt{run()}~{}\texttt{:}~{}\texttt{s}\\ \texttt{Stmt s}&::=&\texttt{while(b)}~{}\texttt{:}~{}\texttt{s}\mid\texttt{repeat(r)}~{}\texttt{:}~{}\texttt{s}\mid~{}\texttt{$s_{1}$}~{}\texttt{;}~{}\texttt{$s_{2}$}\mid\texttt{a}\\ &|&\texttt{if(b)}~{}\texttt{:}~{}\texttt{s}\mid\texttt{ifelse(b)}~{}\texttt{:}~{}\texttt{$s_{1}$}~{}\texttt{else}~{}\texttt{:}~{}\texttt{$s_{2}$}\\ \texttt{Cond b}&::=&\texttt{frontIsClear()}\mid\texttt{leftIsClear()}\mid\texttt{rightIsClear}\\ &|&\texttt{markersPresent()}\mid\texttt{noMarkersPresent()}\mid\texttt{not b}\\ \texttt{Action a}&::=&\texttt{move()}\mid\texttt{turnRight()}\mid\texttt{turnLeft()}\\ &|&\texttt{pickMarker()}\mid\texttt{putMarker()}\\ \texttt{Cste r}&::=&\texttt{0}\mid\texttt{1}\mid\texttt{...}\mid\texttt{19}\end{array}
Figure 9: Grammar for the Karel task.
Robot facing North
Robot facing East
Robot facing South
Robot facing West
Obstacle
Grid boundary
1 marker
2 markers
3 markers
4 markers
5 markers
6 markers
7 markers
8 markers
9 markers
10 markers
Table 5: Representation of each cell in the Karel state.

Appendix B Details in Model Architecture

B.1 Program Decoder

Our model follows the encoder-decoder framework in prior work on neural program synthesis from input-output examples [17, 9], which includes an encoder for the input-output pairs, and a decoder to synthesize the program.

The program decoder is an LSTM (denoted as LSTMD\mathrm{LSTM}_{D}), which decodes the program as a token sequence. Let pt1p_{t-1} be the decoded program token at step t1t-1, Ep(pt1)E_{p}(p_{t-1}) be the embedding vector of pt1p_{t-1}, ht1h_{t-1} be the hidden state of the program decoder at step t1t-1, and I^t1\hat{I}_{t-1} and OO be the sequences of vectors representing the input list elements and output list elements. We first compute attention vectors over both the input and output lists, following the double attention mechanism in RobustFill:

stO=Attention(ht1,O),stI=Attention([ht1;stO],I^t1)s_{t}^{O}=\mathrm{Attention}(h_{t-1},O),\quad s_{t}^{I}=\mathrm{Attention}([h_{t-1};s_{t}^{O}],\hat{I}_{t-1})

The notation [a;b][a;b] means the concatenation of vectors aa and bb. Then we calculate the output vector of the program decoder at step tt as ht=LSTMD(ht1,[Ep(pt1);stI;stO])h_{t}=\mathrm{LSTM}_{D}(h_{t-1},[E_{p}(p_{t-1});s_{t}^{I};s_{t}^{O}]).

Input-Output Encoder.

For C program synthesis, our input-output encoder architecture is similar to RobustFill [17]. For each input-output pair, we use two bi-directional LSTMs [24] to encode the input and output lists respectively. To capture the relationship between the input and output lists, the output list encoder computes attention vectors over the input list elements, using the standard attention mechanism [5, 30]. We also employ different encoder architectures for program synthesis tasks with other formats of input-output examples, as discussed in Sec. 5.

To capture the required arithmetic operation to convert from the program input to the output, we also include the output of the operation predictor op^t\hat{op}_{t} for program decoding, and we discuss the details later. Afterwards, the max pooling layer aggregates the representation of different IO pairs to generate a single vector:

mt=MaxPoolj{1,2,,K}(tanh(W[stI(j);stO(j);op^t(j)]))m_{t}=\mathrm{MaxPool}_{j\in\{1,2,...,K\}}(\mathrm{tanh}(W[s_{t}^{I(j)};s_{t}^{O(j)};\hat{op}_{t}^{(j)}]))

Here the superscript (j)(j) indicates that the representation is for the jj-th IO pair, and WW is a trainable weight matrix.

To facilitate the prediction of long programs, we compute an attention vector over previously generated program tokens as follows:

dt=Attention(mt,{Ep(p0),Ep(p1),,Ep(pt1)})d_{t}=\mathrm{Attention}(m_{t},\{E_{p}(p_{0}),E_{p}(p_{1}),...,E_{p}(p_{t-1})\})

Finally, the next token ptp_{t} is sampled from [pt]=Softmax(Vdt)pt\mathbb{P}[p_{t}]=\mathrm{Softmax}(Vd_{t})_{p_{t}} where VV is a trainable matrix.

B.2 Operation Predictor for Restricted C Domain

Training neural networks for mathematical reasoning is a challenging problem itself [43, 29], and jointly predicting the mathematical calculations and other program operations imposes extra burden on the program decoder. To mitigate the difficulty, we include a pre-computed table as part of the model input, which describes possible mathematical operations to derive an output value given the input number. For example, Fig. 2(d) shows that by applying the O=2+IO=2+I operation to the input I=2I=2, the output O=4O=4. For each valid input list value CC, we include two operations O=C+IO=C+I and O=CIO=C-I in the table. Then for each operation O=C+IO=C+I, we enumerate all valid integer list values II, and we include the row [O=C+I,I,O][O=C+I,I,O] in the table when OO is also within our bounded range. In this way, the table covers all possible integer addition and subtraction operations for valid input and output list values.

With the pre-computed table, the operation predictor aims to predict the most possible program operation at the next step. First, we re-use the same embedding matrices as those in the input-output encoder, and compute the embedding vectors for each numerical element in the table. Let RR be the number of table rows. For the ii-th row, we refer to the embedding vector of the input and output values as r[i]r^{[i]} and c[i]c^{[i]}, respectively. Then we utilize stIs_{t}^{I} and stOs_{t}^{O} to compute the attention weights over the table columns of input and output values as follows:

wit[i]=AttentionWeight(stI,{r[i]|i{1,2,,R}})wi^{[i]}_{t}=\mathrm{AttentionWeight}(s_{t}^{I},\{r^{[i]}|i\in\{1,2,...,R\}\})
wot[i]=AttentionWeight(stO,{c[i]|i{1,2,,R}})wo^{[i]}_{t}=\mathrm{AttentionWeight}(s_{t}^{O},\{c^{[i]}|i\in\{1,2,...,R\}\})

Let op[i]op^{[i]} be the operation in row ii, then the probability of selecting the operation in the ii-th row at step tt is

[opt=op[i]]wit[i]wot[i]\mathbb{P}[op_{t}=op^{[i]}]\propto wi^{[i]}_{t}\cdot wo^{[i]}_{t}

Let Eop(op)E_{op}(op) be the embedding vector of the operation opop, then the operation predictor output is

op^t=i[opt=op[i]]Eop(op[i])\hat{op}_{t}=\sum_{i}\mathbb{P}[op_{t}=op^{[i]}]E_{op}(op^{[i]})

To train the operation predictor, we provide the training supervision at step 0, when no transformation has been applied to the program input:

Op=Loss(wi0[i],𝟙[r[i]=I^0=I])+Loss(wo0[i],𝟙[c[i]=O])\mathcal{L}_{Op}=\mathrm{Loss}(wi_{0}^{[i]},\mathbbm{1}[r^{[i]}=\hat{I}_{0}=I])+\mathrm{Loss}(wo_{0}^{[i]},\mathbbm{1}[c^{[i]}=O]) (9)

B.3 Latent Executor

In RobustFill, the encoder only takes the initial input-output pairs as the input. On the other hand, in recent work on execution-guided program synthesis [12, 47, 57, 18, 38, 37], the execution states of partial programs are leveraged as the model input to guide the subsequent program prediction. However, existing approaches mostly assume that the programs are sequential [57, 18], or require an interpreter of partial programs [12]. To address these limitations, Nye et al. design neural networks to represent the partial program semantics when they are not well-defined [37]. However, they need to train a separate neural module to represent each program operation, thus it is hard to scale beyond domain-specific languages.

In this work, we include another LSTM to approximate the program execution states, denoted as LSTME\mathrm{LSTM}_{E}. Let I^t1\hat{I}_{t-1} be the input of LSTME\mathrm{LSTM}_{E}, which is the program input at step t1t-1. The output of LSTME\mathrm{LSTM}_{E} is:

Exect=LSTME(ht,I^t1)\mathrm{Exec}_{t}=\mathrm{LSTM}_{E}(h_{t},\hat{I}_{t-1})

B.3.1 Implementation for Restricted C Domain

For our restricted C domain, the length of Exect\mathrm{Exec}_{t} is the same as I^t1\hat{I}_{t-1}, i.e., the input list length. Let LL be the length of input and output lists. Let [It=v]\mathbb{P}[I_{t}=v] be the probability that the execution result at step tt is vv, then:

[It,l=vl]=Softmax(WEExect,l)vl\mathbb{P}[I_{t,l}=v_{l}]=\mathrm{Softmax}(W_{E}\mathrm{Exec}_{t,l})_{v_{l}}

Here the subscript ll denotes that the representation is for the ll-th list element, and WEW_{E} is a trainable weight matrix.

Finally, the approximated execution state I^t\hat{I}_{t} is the weighted sum of the embedding vectors of all possible program input integers c[4,4]c\in[-4,4]\cap\mathbb{Z} (where \mathbb{Z} is the set of all integers):

I^t,l=c[4,4][It,l=c]Eio(c)\hat{I}_{t,l}=\sum_{c\in[-4,4]\cap\mathbb{Z}}\mathbb{P}[I_{t,l}=c]E_{io}(c)

Here Eio(c)E_{io}(c) denotes the embedding vector of the list value cc. At the next program decoding step, I^t\hat{I}_{t} will be fed into the encoder to replace the previous input list I^t1\hat{I}_{t-1}.

B.3.2 Implementation for Karel Domain

Similar to our restricted C domain, in our latent executor implementation for Karel domain, I^t,l\hat{I}_{t,l} is also the weighted sum of all possible execution states. As described in Appendix A, each Karel state describes the following variables: (1) (robotX,robotY)(\text{robot}_{X},\text{robot}_{Y}) denotes the position of the Karel robot, where 0robotX,robotY<180\leq\text{robot}_{X},\text{robot}_{Y}<18; (2) robotdir\text{robot}_{dir}\leq {North, South, West, East} denotes the robot orientation at (robotX,robotY)(\text{robot}_{X},\text{robot}_{Y}); and (3) the number of markers in each grid. Therefore, we train 3 predictors on top of LSTME\mathrm{LSTM}_{E} to predict these variables: (1) a trainable layer that outputs a (18×1818\times 18)-dimensional vector, representing the robot position; (2) a trainable layer that outputs a 44-dimensional vector, representing the robot orientation; and (3) an LSTM that generates an 11-dimensional vector at each step, representing the number of markers in each grid. We apply the softmax to all output vectors to obtain the probability distributions of different variables.

Afterward, we combine the outputs of the predictors to construct a 16×18×1816\times 18\times 18-dimensional vector representing the Karel state, according to Table 5, with the value of each dimension in [0,1][0,1]. Note that Karel programs can not change the grid boundary and obstacles, thus we apply a mask on the predicted intermediate execution states to ensure that the features representing the grid boundary and obstacles remain the same, which are the last 2 dimensions described in Table 5.

Appendix C Implementation Details

All encoders and decoders in our models are 2-layer bi-directional LSTMs with the hidden size of 512. The embedding size is 1024. We use the Adam optimizer [26] for training. The learning rate starts from 1e-3, and is decayed by 0.9 for every 6000 timesteps. The batch size is 8. The training converges in 200K batch updates. The norm for gradient clipping is 5.0. All models are trained on a single GPU. The beam size is 64 for evaluating the model performance, and is 8 for iterative retraining due to the large size of the training set.

About the implementation of the Property Signatures [39], we further illustrate the key difference between our adaption for the restricted C domain and the original implementation in [39] with the following example. Suppose an input-output pair is ([4,3,1,2,1],[4,3,3,3,3])([-4,3,1,2,1],[-4,3,3,3,3]), when the feature is “Input == Output?”, the corresponding property signature is “False” according to the implementation in [39], while the signature is “[True, True, False, False, False]” in our adapted implementation. Compared to the original implementation of property signatures, our adaptation better reveals which specific list elements are manipulated in the program. This modification makes our implementation of property signatures a much stronger baseline for the restricted C domain, because our C programs do not always perform the same manipulation steps over all elements in the input list, and sometimes change the values of only a subset of the input numbers.

Appendix D More Results of Iterative Retraining

I1: [2, 4, 1, 2, -3]
O1: [2, 4, 3, 2, -3]
I2: [1, 0, 1, -3, 4]
O2: [1, 0, 3, -3, 4]
I3: [2, 2, -4, 2, 0]
O3: [2, 2, 3, 2, 0]
I4: [0, -2, 3, 1, 3]
O4: [0, -2, 3, 1, 3]
I5: [-2, 1, 4, 0, 0]
O5: [-2, 1, 3, 0, 0]
int * func_1(int a[])
{
int p_0 = 4;
int l_7 = 2;
int l_8 = 4;
a[l_7] = 3;
a[l_8] = a[p_0];
return a;
}
int * func_1(int a[])
{
int p_0 = 2;
a[p_0] = 3;
return a;
}
I1: [3, 1, 3, -2, -4]
O1: [3, 1, 2, -2, -4]
I2: [2, 0, -1, -1, 3]
O2: [2, 0, 2, -1, 3]
I3: [2, 0, -1, 4, 0]
O3: [2, 0, 2, 4, 0]
I4: [-2, -1, 3, 2, -4]
O4: [-2, -1, 2, 2, -4]
I5: [-4, 0, 3, 0, 1]
O5: [-4, 0, 2, 0, 1]
int * func_1(int a[])
{
int p_0 = 2;
int l_10 = 0;
int l_1 = 4;
l_10 = 2;
for (p_0 = 2; p_0 >= 1; p_0–)
{
a[p_0] = 3;
a[p_0] = 2;
if (a[p_0])
break;
a[p_0] = a[l_1];
a[p_0]++;
}
return a;
}
// Training on random programs
int * func_1(int a[])
{
int p_0 = 2;
int l_7 = 2;
a[l_7] = 2;
return a;
}
\par// After iterative retraining
int * func_1(int a[])
{
int p_0 = 2;
a[p_0] = 2;
return a;
}
I1: [0, 4, 0, 4, 2]
O1: [0, 4, 0, 1, 1]
I2: [4, 0, 1, 1, 4]
O2: [4, 0, 1, 1, 1]
I3: [3, 2, 3, 0, 0]
O3: [3, 2, 3, 1, 1]
I4: [1, 1, 4, 0, 4]
O4: [1, 1, 4, 1, 1]
I5: [1, 3, 0, 1, 1]
O5: [1, 3, 0, 1, 1]
int * func_1(int a[])
{
int p_0 = 0;
int l_10 = 3;
for (p_0 = 4; p_0 >= 0; p_0–)
{
a[p_0] = 3;
a[p_0] = a[p_0];
a[p_0] = 1;
if (a[p_0])
break;
}
a[l_10] = a[l_10];
a[l_10] = a[p_0];
return a;
}
int * func_1(int a[])
{
int p_0 = 4;
for (p_0 = 3; p_0 <= 4; p_0++)
{
a[p_0] = 1;
}
return a;
}
I1: [0, 3, -1, 0, 0]
O1: [4, 3, -1, 4, 4]
I2: [4, -3, 3, 4, 2]
O2: [4, -3, 3, 4, 4]
I3: [-4, 1, 0, 4, -2]
O3: [4, 1, 0, 4, 4]
I4: [0, 4, 3, 0, 4]
O4: [4, 4, 3, 4, 4]
I5: [2, 2, 0, 3, 2]
O5: [4, 2, 0, 4, 4]
int * func_1(int a[])
{
int p_0 = 0;
int l_11 = 3;
for (p_0 = 2; p_0 >= 1; p_0–)
{
for (int p_1 = 4; p_1 >= 3; p_1–)
{
a[p_1] = 4;
}
}
a[p_0] = a[l_11];
return a;
}
int * func_1(int a[])
{
int p_0 = 3;
int l_7 = 0;
a[l_7] = 4;
for (p_0 = 4; p_0 >= 3; p_0–)
{
a[p_0] = 4;
}
return a;
}
I1: [1, 0, 0, 4, -3]
O1: [1, 1, 1, 4, -3]
I2: [-3, 0, 0, -2, 4]
O2: [1, 1, 1, -2, 4]
I3: [0, 2, -2, 4, -3]
O3: [1, 1, 1, 4, -3]
I4: [4, -2, 0, -2, 0]
O4: [1, 1, 1, -2, 0]
I5: [0, 2, -4, 2, 2]
O5: [1, 1, 1, 2, 2]
int * func_1(int a[])
{
int p_0 = 4;
for (p_0 = 1; p_0 >= 0; p_0–)
{
a[p_0] = 1;
for (int p_1 = 2; p_1 >= 1; p_1–)
{
a[p_1] = 4;
a[p_1] = a[p_0];
if (a[p_1])
break;
}
}
return a;
}
int * func_1(int a[])
{
int p_0 = 1;
for (p_0 = 2; p_0 >= 0; p_0–)
{
a[p_0] = 1;
}
return a;
}
Figure 10: More examples of predicted correct programs that are more concise than the randomly generated ground truth programs on C dataset. Left: input-output examples. Middle: the randomly generated ground truth program. Right: the predicted programs. Unless otherwise specified, the predicted programs come from the model trained on random programs.
def run():
repeat (5):
ifelse (rightIsClear):
move
else:
move
putMarker
def run():
repeat (5):
move
putMarker
def run():
move
turnRight
ifelse (noMarkersPresent):
repeat (2):
putMarker
else:
pickMarker
repeat (5):
turnRight
def run():
move
turnLeft
turnLeft
ifelse (markersPresent):
pickMarker
else:
putMarker
putMarker
def run():
pickMarker
move
ifelse (not rightIsClear):
putMarker
move
else:
move
putMarker
while (not rightIsClear):
move
putMarker
putMarker
turnRight
move
def run():
pickMarker
move
move
putMarker
putMarker
turnRight
move
def run():
move
turnRight
repeat (5):
pickMarker
putMarker
def run():
move
repeat (4):
pickMarker
turnRight
def run():
move
ifelse (markersPresent):
ifelse (frontIsClear):
putMarker
else:
pickMarker
else:
while (rightIsClear):
turnRight
repeat (2):
repeat (2):
putMarker
turnLeft
def run():
move
while (leftIsClear):
turnLeft
repeat (4):
putMarker
def run():
putMarker
move
ifelse (not leftIsClear):
putMarker
else:
turnRight
if (rightIsClear):
pickMarker
def run():
putMarker
move
putMarker
if (rightIsClear):
pickMarker
def run():
while (not rightIsClear):
while (noMarkersPresent):
putMarker
move
turnLeft
ifelse (rightIsClear):
while (noMarkersPresent):
putMarker
turnLeft
turnLeft
move
turnLeft
move
else:
turnLeft
turnLeft
turnLeft
repeat (4):
move
turnLeft
def run():
putMarker
move
if (noMarkersPresent):
putMarker
turnRight
move
putMarker
turnRight
while (not frontIsClear):
turnRight
move
turnRight
repeat (3):
move
turnLeft
def run():
repeat (6):
if (markersPresent):
repeat (9):
pickMarker
putMarker
move
putMarker
turnRight
def run():
repeat (6):
putMarker
move
putMarker
turnRight
def run():
turnRight
move
turnRight
move
while (not rightIsClear):
move
pickMarker
ifelse (not leftIsClear):
move
else:
move
def run():
turnRight
move
turnRight
move
pickMarker
move
def run():
while (frontIsClear):
ifelse (markersPresent):
move
else:
putMarker
def run():
while (frontIsClear):
putMarker
move
Figure 11: Examples of predicted correct programs that are more concise than the randomly generated ground truth programs on Karel dataset. 1st and 3rd columns: the randomly generated ground truth programs. 2nd and 4th: the corresponding predicted programs. The predictions come from the model trained on random programs.

Figure 10 presents more examples of predicted correct programs that are more concise than the randomly generated ground truth programs on C dataset.

Figure 11 presents more examples of predicted correct programs that are more concise than the randomly generated ground truth programs on Karel dataset. Note that the predicted Karel program is not semantically equivalent to the annotated ground truth in many cases. The main reason is because the randomly generated ground truth program might include redundant branching statements, i.e., the conditions always evaluate to true or false for all program inputs in the specification and the held-out test cases.

We present the numerical results of iterative retraining on Karel and C benchmarks in Table 6 and Table 7 respectively.

Table 6: Results of iterative retraining on Karel dataset.
Iters 100% 10% 20% 30% 40% 50%
Generalization Accuracy
1 86.04% 70.92% 75.16% 78.84% 80.88% 82.08%
2 89.28% 76.20% 78.40% 81.08% 82.40% 83.40%
3 89.36% 78.12% 81.20% 83.68% 84.24% 86.32%
Exact Match Accuracy
1 39.40% 36.20% 37.20% 38.36% 40.20% 40.04%
2 41.56% 37.24% 37.28% 39.24% 39.72% 39.16%
3 41.16% 36.56% 38.16% 38.68% 38.72% 39.64%
Table 7: Results of iterative retraining on C dataset.
Iters 100% 10% 20% 30% 40% 50%
1 55.2% 11.9% 26.4% 39.1% 45.2% 48.5%
2 56.0% 39.6% 43.9% 48.7% 51.9% 54.1%
3 56.5% 41.7% 44.4% 49.4% 52.8% 54.4%