This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Non-Programmers Can Label Programs Indirectly via Active Examples:
A Case Study with Text-to-SQL

Ruiqi Zhong1        Charlie Snell1        Dan Klein1        Jason Eisner2
1 University of California, Berkeley 2 Microsoft Semantic Machines
{ruiqi-zhong, csnell22, klein}@berkeley.edu
[email protected]
Abstract

Can non-programmers annotate natural language utterances with complex programs that represent their meaning? We introduce APEL, a framework in which non-programmers select among candidate programs generated by a seed semantic parser (e.g., Codex). Since they cannot understand the candidate programs, we ask them to select indirectly by examining the programs’ input-ouput examples. For each utterance, APEL actively searches for a simple input on which the candidate programs tend to produce different outputs. It then asks the non-programmers only to choose the appropriate output, thus allowing us to infer which program is correct and could be used to fine-tune the parser. As a case study, we recruited human non-programmers to use APEL to re-annotate Spider, a text-to-SQL dataset. Our approach achieved the same annotation accuracy as the original expert annotators (75%) and exposed many subtle errors in the original annotations.

1 Introduction

Semantic parsing often aims to map a natural language utterance to a program, which can be executed on an input Kushman and Barzilay (2013). For example, for the utterance “How old is the youngest person,” we can map to the SQL program SELECT MIN(AGE) FROM PEOPLE, execute it on an input database, and return the output answer. Language models like Codex often achieve substantial few-shot performance (Chen et al., 2021a). However, user-facing applications often involve novel phrases or domain-specific values unseen during pre-training or generic instruction-tuning, so the model still needs more labels for training. Since hiring programming experts to label training data is costly, can we ask non-programmers to label them?

Refer to caption
Figure 1: The outline of APEL. The orange background indicates what the annotators can see: they only need to select which output (19 or 23) is correct by reading the question uu and the database ii. We downweight program candidates that are inconsistent with the annotators’ response. Figure 3 illustrates our criteria for ii.

We introduce APEL, a framework that indirectly labels programs with non-programmer responses on carefully constructed input-output examples. APEL must be seeded with a moderately good semantic parser: we use Codex for this in our experiments, prompting it with a few demonstrations that we solicited from expert programmers. We now want to label additional training data. To construct the correct program for a new utterance, APEL uses the seed parser to generate a list of candidate programs with prior probabilities (Figure 1 top), and then asks non-programmer annotators to indicate which candidate is correct.

However, we cannot ask the annotators to select the correct program directly, since even expert programmers might find this challenging (Figure 2). Instead, APEL obtains information indirectly by asking questions about how the program should behave. We synthesize a program input, execute the candidate programs on it to obtain a list of outputs, and ask the annotators which output is correct given the utterance and the input. We could eliminate the programs that returned other outputs, but annotators may make mistakes, so we instead downweight them via Bayes’ Theorem (Figure 1 bottom). To pin down the correct program, we iteratively reduce the entropy of the distribution over candidates by repeating the process with more synthesized inputs.

Refer to caption
Figure 2: APEL needs to obtain information about whether s1s_{1} or s2s_{2} is correct; it is hard even for experts to directly select the correct one by eyeballing. The error of s1s_{1} is subtle: it finds the first names of the students who have dogs and the first names of the students who have cats, and intersects these two sets of first names. APEL can reveal this error by asking about a tiny student database ii in which Alice Jones owns a dog while Alice Smith owns a cat. As no student named Alice owns both, the annotator will select an output oo that does not include Alice, implying that s1s_{1} is incorrect.

To reduce annotators’ workload, we propose an algorithm to search for a program input that maximizes the expected information gain of the annotator’s answer (§ 3), subject to the constraint that the input is simple enough for the annotator to reason about the correct output easily (Figure 3). Since our method resembles the programming by example (PBE) framework (Lavrac and Dzeroski, 1994) and learns an underlying program by actively choosing which examples should be labeled, we call our framework APEL—Active Programming by Examples using a prior derived from a natural Language utterance.

Refer to caption
Figure 3: Informativity and simplicity guide our search for a useful input. iAi_{A} is not informative since knowing the correct output does not tell us which program ss is correct. iBi_{B} is not simple since the input is too large for humans to find the correct output. Figure 1 shows an ideal input.

Non-Programmer Annotator Study. As a case study, we use APEL to label SQL programs. We built an annotation GUI based on APEL and recruited 11 non-programmers to label a random subset of 240 utterances from the Spider (Yu et al., 2018) development set (§ 6). According to a new carefully constructed gold standard, we achieve the same accuracy as the original Spider annotation performed by database experts (75%) and also substantially outperforms the top-1 accuracy of Codex (59%). APEL also exposes subtle errors made by previous experts, which we analyze in § 6.6.

APEL is a preliminary step towards enabling non-programmers to label utterances with arbitrarily complex programs. Applying it to new semantic parsing tasks still faces some bottlenecks: it requires a seed semantic parser with reasonable top-kk accuracy and an efficient procedure to find simple and informative program inputs (§ 7). However, these bottlenecks might be alleviated by future advances in semantic parsing and program synthesis. With these advances, we hope APEL can facilitate non-programmers to label programs in a broader range of applications in the future.111The corrected annotations and our code for searching for informative and simple databases are on our github: https://github.com/ruiqi-zhong/EMNLP23-APEL .

2 Framework

APEL aims to enable humans to indirectly annotate natural language utterances with programs. Here we use text-to-SQL as a case study.

2.1 Outline of APEL

Let uu be a given utterance and cc a known database schema, which specifies the table names, column names, and the constraints on value types, uniqueness, and foreign keys. We want to synthesize a SQL program ss that captures the meaning of uu and works properly for any database with schema cc.

Figure 1 illustrates our pipeline. We first feed cc and uu to a seed semantic parser (e.g. Codex with a prompt) to generate a prior distribution pp over a list of SQL programs ss. We then 1) synthesize an input database ii, 2) execute the list of programs on ii to obtain a list of program outputs oo, and 3) display uu, ii, and the outputs to an annotator aa. The annotator generates a response rr indicating which output they think is correct. We will then raise the posterior probabilities of the candidates ss such that s(i)=rs(i)=r, i.e. ss returns rr when executing on ii. Since we ask the annotator to select the correct output of a program given an input, we call our questions “oo-selection questions.”

To formalize, the posterior distribution over the program ss, once we observe the response rr, is

p(su,a,i,r)\displaystyle p(s\mid u,a,i,r) p(su,a,i)p(rs,u,a,i)\displaystyle\propto p(s\mid u,a,i)\,p(r\mid s,u,a,i)
=p(su)p(ra,s(i))\displaystyle=p(s\mid u)\,p(r\mid a,s(i)) (1)

where p(su)p(s\mid u) was the prior from the seed parser. Here p(ra,s(i))p(r\mid a,s(i)) models annotator behavior: the probability that annotator aa would have responded with rr if ss were a correct implementation of uu and therefore s(i)s(i) were the correct output. To the extent that annotators are accurate, this Bayesian update increases the probability of the correct program.

We can ask more oo-selection questions to obtain further information about the correct candidate and improve the posterior. If desired, different rounds may use different annotators. For each uu, we define

pt(s)\displaystyle p_{t}(s) =defp(su,a1,i1,r1,,at,it,rt)\displaystyle\mathrel{\stackrel{{\scriptstyle\textnormal{def}}}{{=}}}p(s\mid u,a_{1},i_{1},r_{1},\ldots,a_{t},i_{t},r_{t})
pt1(s)p(rtat,s(it))\displaystyle\propto p_{t-1}(s)\,p(r_{t}\mid a_{t},s(i_{t})) (2)

to be the posterior after t>0t>0 questions, with p0=defp(su)p_{0}\mathrel{\stackrel{{\scriptstyle\textnormal{def}}}{{=}}}p(s\mid u) being the prior. Letting TT denote our total number of questions about uu, our final posterior is pTp_{T}.

Evaluation. We output as our annotation the most probable SQL candidate s^T\hat{s}_{T} according to pTp_{T}, and compare it to a gold standard. To show that our framework is useful, s^T\hat{s}_{T} needs to be correct more often than s^0\hat{s}_{0}, the most probable SQL candidate according to the seed semantic parser p(su)p(s\mid u) .

Relation to Weakly Supervised Learning. Appendix A explains how the “soft annotations” provided by the full distribution pTp_{T} could be used to retrain the semantic parser p(su)p(s\mid u). These improved models would in turn improve the estimate of pTp_{T} (i.e., an EM algorithm).

2.2 Criteria for Synthesized Inputs

To generate an oo-selection question on round tt, APEL needs to synthesize an input database iti_{t} that is both informative and simple.

Informative. Our belief as we enter round tt is pt1p_{t-1}. Once we observe the annotator’s response rtr_{t}, we will be able to update it to ptp_{t}. This will achieve an information gain of H(pt1)H(pt)\mathrm{H}(p_{t-1})-\mathrm{H}(p_{t}), where the Shannon entropy H\mathrm{H} of a distribution over programs ss characterizes its remaining uncertainty about which program is correct.

However, ptp_{t} will depend not only on our choice of question iti_{t} but also on the annotator’s response rtr_{t} (equation 2). We do not know rtr_{t} yet, but our current belief is that it will be distributed as

pt1(rt)=spt1(s)p(rtat,s(it))\displaystyle p_{t-1}(r_{t})=\sum_{s}p_{t-1}(s)\,p(r_{t}\mid a_{t},s(i_{t})) (3)

So our expected Information Gain from asking iti_{t} is

IGpt1(it)\displaystyle\mathrm{IG}_{p_{t-1}}(i_{t}) =defH(pt1)𝔼rtpt1[H(pt)]\displaystyle\mathrel{\stackrel{{\scriptstyle\textnormal{def}}}{{=}}}\mathrm{H}(p_{t-1})-\mathbb{E}_{r_{t}\sim p_{t-1}}[\mathrm{H}(p_{t})] (4)

where the expectation is taken under distribution 3. This is high if the candidates ss that are most plausible under pt1p_{t-1} tend to return different outputs s(it)s(i_{t}) and hence different annotator responses rtr_{t}, making rtr_{t} informative about ss. By contrast, iAi_{A} in Figure 3 left would yield an uninformative response (IG=0\mathrm{IG}=0).

Simple. We would like to avoid presenting the annotator with complex inputs such as the large database iBi_{B} in Figure 3 right. The correct response might be informative, but determining it would require too much human effort. We crudely model the effort required for iti_{t} as the number of records |it||i_{t}|.

The next section proposes a heuristic to synthesize a simple informative database iti_{t} given schema cc, a sample database with schema cc, and a distribution pt1p_{t-1} over SQL programs.

3 Optimizing the Input Database

We attempt to maximize the expected information gain IG\mathrm{IG} over all databases that conform to the given schema cc and have at most RR total records, where R=30R=30 in our case study. Formally, we search for

it=argmaxit:|it|RIGpt1(it).i_{t}^{*}=\operatorname*{argmax}_{i_{t}:|i_{t}|\leq R}\mathrm{IG}_{p_{t-1}}(i_{t}). (5)

When multiple databases have the same IG\mathrm{IG}, we break ties by favoring the smallest database. Since tt and pp are fixed during the optimization process, we will write IG(i)\mathrm{IG}(i) instead of IGpt1(it)\mathrm{IG}_{p_{t-1}}(i_{t}) for short.

Our method can be summarized as “fuzz-then-drop.” Fuzzing Miller et al. (1990) is an established practice in software testing, where an algorithm generates a large number of random program inputs to search for an input that satisfies a property or reveals a bug. In our case, we want to search for an input database that maximizes the information gain. Therefore, we first perform fuzzing by randomly generating a large number of large databases as in Zhong et al. (2020)—see Appendix C for further details—and keep the database i0i^{0} that maximizes the expected information gain IG(i0)\mathrm{IG}(i^{0}). We then drop records from i0i^{0} to satisfy the simplicity criterion for LL iterations.

For each iteration \ell of dropping records, we randomly drop 5% of the records from ii^{\ell} to obtain i+1i^{\ell+1}. If this results in a less informative database, i.e., IG(i+1)<IG(i)\mathrm{IG}(i^{\ell+1})<\mathrm{IG}(i^{\ell}), we are willing to retry up to 20 times in hopes of randomly finding an i+1i^{\ell+1} that is at least as informative as ii^{\ell}; if we fail in all 20 tries, we keep the best of the 20 despite the decrease in IG\mathrm{IG}. Of the databases smaller than RR that we encountered during the LL iterations, let i^\hat{i} be the one with the highest IG\mathrm{IG}:

i^=defargmaxi{i:1L},|i|RIG(i)\hat{i}\mathrel{\stackrel{{\scriptstyle\textnormal{def}}}{{=}}}\operatorname*{argmax}_{i\in\{i^{\ell}:1\leq\ell\leq L\},|i|\leq R}\mathrm{IG}(i) (6)

Since our procedure is randomized, we repeat it 3 times, and let ii^{*} be the i^\hat{i} with the largest IG(i^)\mathrm{IG}(\hat{i}), breaking ties as before in favor of smaller i^\hat{i}. Finally, we simplify ii^{*} by dropping tables and columns that were not mentioned by any SQL candidates in pp.

Our procedure of dropping records from a large informative database is heavily inspired by Miao et al. (2019), which, given a database ii such that s1(i)s2(i)s_{1}(i)\neq s_{2}(i), provably finds the smallest subset of records in ii such that programs s1s_{1} and s2s_{2} return different outputs. However, their algorithm works only for a restricted family of SQL programs and cannot be adapted to optimize information gain. Our procedure does not provide any provable optimality guarantee, but is more flexible and practical.

In practice, simply applying the above procedure can generate unnatural databases and lead to vacuous SQL execution, confusing the annotators. In Appendix C, we illustrate several typical confusions (Figure 6) and explain how we fix them.

4 Experimental Setup

We describe the dataset used to benchmark APEL (§ 4.1), how we obtained the prior over SQL candidates (§ 4.2), and our evaluation metrics (§ 4.4).

4.1 Dataset

We benchmarked APEL on the development set of Spider Yu et al. (2018), an English text-to-SQL dataset with 1034 utterance-SQL pairs distributed under the CC BY-SA 4.0 License. The development set is divided into 20 domains, each with a distinct database schema cc, a collection of utterance-SQL (u,s)(u,s) pairs, and a sample database. We split the 1034 (u,s)(u,s) pairs equally into a validation split and an evaluation split. We used the validation split to tune our annotator interface (§ 6), develop our fuzz-then-drop algorithm (§ 3), and prompt Codex (§ 4.2) to create a seed parser. We used the evaluation split to evaluate APEL with simulated annotators (§ 5), and from these drew a random subset of 240 utterances balanced across domains to conduct our human evaluation (§ 6).

To make our evaluation more reliable, we 1) corrected the sample database to conform to the schema cc and updated the test suite correspondingly (Appendix D), and 2) established a new gold standard by correcting errors in 61 out of these 240 SQL annotations (§ 6.2). The corresponding author of Spider endorses our corrections (T. Yu, p.c.).

4.2 Obtaining the SQL Program Prior p0p_{0}

We generated a prior over SQL program candidates using the Codex Chen et al. (2021b) language model with few-shot prompting. Given an utterance uu with a schema cc, we created the prompt (§ E.1) by concatenating a linearization of cc with eight distinct (uk,sk)(u_{k},s_{k}) pairs from the validation split associated with the same schema cc, and finally the utterance uu itself. Some (uk,sk)(u_{k},s_{k}) pairs were chosen randomly while others were chosen because uku_{k} has the highest TF-IDF similarity to uu. We randomly sampled 200 prompts for uu by choosing different (uk,sk)(u_{k},s_{k}) pairs, and for each prompt, we asked Codex to generate 20 completions (SQL programs) and filtered out non-executable candidates. To prevent surface form competition Holtzman et al. (2021), we then merged approximately semantically equivalent candidates by finding sets of candidates that return exactly the same outputs on 1K random databases, using the implementation from Zhong et al. (2020). We define p0p_{0} to be the empirical distribution in our sample of the 16 semantic equivalence classes that were most frequent in the sample. Thus, each ss in § 2 represents an approximate equivalence class rather than a single program.

Treating the original Spider annotation as the ground truth, the top-1 accuracy of p0p_{0} on the entire development set is 72% and the top-16 accuracy is 94%. These numbers are not comparable to prior works, which usually evaluate on unseen database domains in a zero-shot manner (harder than our setting) but do not predict string literals and DISTINCT keywords, which we need for execution. § E.2 includes more details on top-kk accuracy (sometimes also referred to as pass@kk).

Note that the seed parser could be further improved, e.g. with better models (OpenAI, 2023) or prompts (Zhou et al., 2023). However, these improvements are complementary to APEL, whose goal is to improve over the seed parser with non-programmer responses. We design our evaluation metrics based on this goal in the next section.

4.3 Separation of Annotators

When optimizing the next input ii to show to a given annotator aa (§ 3), we measured its IG by conditioning on the previous responses from aa only. That is, we treated each annotator as if they were the first annotator. (Indeed, all annotators for uu saw the same first question.) This approach reduced the amount of information about uu that our TT total questions could extract, but it also reduced the variance of our experiment by running multiple independent (but shorter) annotation sessions on uu. To compute the final pTp_{T} for uu, we still aggregated the responses from all annotators using the Bayesian formula 2.

4.4 Evaluation Metrics

We report the following statistics to evaluate APEL:

  • Codex accuracy: how often is the argmax of p0p_{0} correct? This baseline reflects the top-1 accuracy of the seed semantic parser.

  • APEL accuracy: how often is s^T\hat{s}_{T} (the argmax of pTp_{T}) correct? This reflects the improvement due to our APEL annotators.

  • candidate ceiling: how often does the support of p0p_{0} contain a correct program? This reflects how much APEL is bottlenecked by the seed semantic parser, whose top 16 candidates form our candidate list.

If APEL is effective, we should see that the APEL accuracy is higher than the Codex accuracy. Since Spider categorizes its utterances by difficulty (easy/medium/hard/extra), we report the accuracy for each difficulty level. Next we evaluate APEL with both simulated and human annotators.

5 Automated Simulation Evaluation

To automatically test the effectiveness of APEL without additional human annotations, we benchmarked APEL on the entire evaluation split by simulating an oracle annotator who always responds correctly by choosing the output of the correct SQL program and assuming that the SQL annotation provided by Spider is always correct. Additionally, to evaluate our database generation algorithm in § 3, we compared to OrigDB, an ablated variant of APEL that directly uses the sample database from the original Spider dataset. Table 1 reports the candidate ceiling, the Codex accuracy, and the accuracy for both OrigDB and APEL.

easy med hard extra all
Ceiling 0.98 0.96 0.98 0.81 0.94
Codex 0.87 0.80 0.56 0.45 0.72
OrigDB 0.98 0.90 0.83 0.66 0.86
APEL 0.93 0.95 0.97 0.75 0.91
Table 1: The description of the metrics can be seen in § 4.4. APEL means using databases generated using the method described in § 3 while OrigDB directly uses the sample database from Spider. APEL significantly outperforms both Codex and OrigDB with pp-value < 1×1031\times 10^{-3}, indicating that both APEL and our database generation algorithm are effective.

With 3 rounds of interactions, APEL accuracy (91%) substantially improves on the Codex accuracy (72%), validating the effectiveness of our framework; we achieved this by interacting with our oracle annotator for 1.8 rounds on average. Compared to OrigDB (86%), which uses the sample databases from the original Spider dataset to interact for 1 round, APEL’s database generation method leads to higher accuracy since it allows multiple rounds and optimizes for information gain. Furthermore, averaged across all utterances, the databases generated by APEL contained altogether 10 records across all rounds—whereas the OrigDB databases contained on average 33,295 records and would be impractical to present to human annotators. These results highlight the need for the databases to be both simple and informative.222APEL achieves 81% accuracy with 1 round of interaction.

Appendix H provides further information on the database sizes, the number of interaction rounds, and the robustness of our database generation method under different hyper-parameter choices.

6 Human Evaluation

We evaluate the effectiveness of APEL with real human annotators. We built an interface designed to be user-friendly (§ 6.1) and used it ourselves to establish gold annotations for 240 utterances (§ 6.2). We then recruited 11 non-programmer subjects to annotate the same utterances (§ 6.3), aggregated their responses by learning a model of annotator behavior (§ 6.4), and benchmarked their performance with the newly established gold standard (§ 6.5). We analyze errors by Spider expert annotators in § 6.6, qualitative feedback from our subjects in § 6.7, and errors made by our subjects in Appendix F.

Refer to caption
Figure 4: A screenshot of our annotation interface (§ 6.1). Appendix J and Figure 11 include more details.

6.1 Annotation Interface

Figure 4 shows an example oo-selection question in our interface, which displays the utterance uu on the top, the database ii on the left, and up to M=6M=6 distinct outputs oo on the right in random order, followed by a “none of the above” option; generally, an output may be a string, a number, or a table. The annotator can also use an open-ended response field to optionally report that the question is ambiguous or confusing; we did not use these open-ended responses during data collection, but future work could potentially condition on them in equation 2 (along with the multiple-choice responses).

To reduce the annotators’ cognitive load, when the mouse pointer is over a cell, our interface highlights that cell along with all other cells of ii and of the outputs oo that have the same value. Appendix J describes more features of our interface.

We asked each annotator up to 3 consecutive questions for each utterance uu, stopping the interaction after 1 or 2 questions if some candidate ss already has pt(s)>0.9p_{t}(s)>0.9, or if our heuristic (§ 3) fails to find an appropriate database ii for the next question.

6.2 Gold Standard Annotation

To establish a clean gold standard, two of the authors annotated all 240 utterances using our own APEL system. Whenever our two responses to a APEL question were different, we reached a consensus through discussion.

We closely examined the oo-selection questions where our consensus response did not match the output of the SQL program provided by Spider, and corrected Spider’s program if we felt that our response was strictly better. To avoid biasing against the original annotations, we stuck to the original ones whenever there were ambiguities, and we double-checked each corrected annotation by additionally writing down reasons why we felt it was better. As mentioned in § 4.1, we ultimately corrected 61 out of the 240 SQL annotations. § 6.6 analyzes these corrections in greater detail.

6.3 Non-Programmer Annotation

Annotation Unit. We split the 240 utterances into 8 units. Each unit contained 30 utterances across 4–5 database domains and proved to take 1–2 hours to annotate with our interface. For the annotator behavior model p(ra,s(i))p(r\mid a,s(i)) in equation 1, we assumed that every annotator would respond correctly with 0.7 probability and would otherwise select a response uniformly at random.

Recruiting Non-Programmers. As annotators, we recruited 11 university students who 1) were not pursuing/had not received a Computer Science degree and 2) had no prior experience with SQL. Each annotator could annotate any number of units (from 1 to 8) as they wished, but had to annotate them fully. For each unit we rewarded them with $15 as a base payment and a $5 ($10) bonus if their response agreed with our corrected gold standard >> 85% (95%) of the time. We received 20 units of annotation in total, and hence each utterance was examined by 2.5 annotators on average. We asked each of them 1.84 questions about the utterance and the databases that we presented contain only 8.05 records on average. We visualize the distribution of the number of questions for each utterance and the database sizes for each question in Figure 5.

Refer to caption
Figure 5: The distribution of database sizes (left) and number of questions (right) for the human evaluation.

Participation Procedure. We asked each annotator to: 1) sign a consent form to participate in the study, 2) watch a 12-minute video tutorial that contains our annotation instructions and explains the basics of foreign and primary keys, and 3) complete the annotation task. The tutorial can be seen at https://youtu.be/-MlIcCQ21xs and an example unit of the annotation task can be tried at http://35.225.126.31:4200/v0104_4pm_8.

6.4 Learning an Annotator Behavior Model

After collecting the annotations, we used them to improve § 6.3’s naive model of annotator behavior p(ra,s(i))p(r\mid a,s(i)) by learning parameters αa\alpha_{a} for each annotator aa and βd\beta_{d} for each Spider domain dd. For a given aa and dd, the chance that the annotator answers at random is no longer fixed at 0.3, but is modeled as σ(αa+βd+b)\sigma(\alpha_{a}+\beta_{d}+b), where σ\sigma is the logistic function and bb is a bias term. Larger αa\alpha_{a} and βd\beta_{d} predict higher error rates.

We maximize the incomplete data log-likelihood

Lu\displaystyle L_{u} =logp(r1,rTu,i1,iT)\displaystyle=\log p(r_{1},\ldots r_{T}\mid u,i_{1},\ldots i_{T})
=logsp0(s)t=1Tp(rtat,s(it))\displaystyle=\log\sum_{s}p_{0}(s)\prod_{t=1}^{T}p(r_{t}\mid a_{t},s(i_{t})) (7)

summed over all utterances uu. Since we explicitly model the annotator behavior, LuL_{u} is sensitive to the domain dd of uu and the annotator ata_{t} who answered the question about iti_{t}. Just as in other adaptive crowdsourcing research, we do not assume access to the gold value of ss. Our behavior model will predict a lower annotation error rate for those who tend to agree with other annotators and with Codex.

Overall, our annotators chose the correct option 72% of the time. To evaluate our unsupervised annotator model, we compared its predicted probability of a correct response on each utterance to the 0/1 indicator of whether the annotator did select the correct option. See Appendix I for a visualization. Our model achieves an AUC ROC score of 0.68 and MSE error of 0.185. For comparison, a simple supervised model that always predicted the true overall probability of 72% would achieve an MSE error of 0.20.

As Appendix A explains, one way to maximize § 6.4 is to use an EM algorithm that alternates between imputing the unknown ss for each utterance (using pTp_{T}) and re-estimating the annotator error model based on these imputed “gold” answers. Appendix A also discusses how the annotator error model could be enriched.

easy med hard extra all
Ceiling 0.91 0.91 0.92 0.69 0.88
Spider 0.93 0.78 0.63 0.50 0.75
Codex 0.78 0.65 0.43 0.36 0.59
APEL 0.83 0.80 0.73 0.47 0.75
Table 2: Similar to Table 1, except that we now use the gold standard from § 6.2 and also report the annotation accuracy of the original Spider dataset. APEL significantly outperforms Codex (pp-value < 10310^{-3}).

6.5 Results

We present the results in Table 2. APEL (75%) is effective, since it significantly outperforms Codex (59%) with pp-value < 10310^{-3} under a one-sided paired t-test. APEL leads to the highest improvement for utterances for medium or hard difficulty levels, implying that APEL is most effective when the utterances are hard enough so that there is still room for improvement, but not too hard to confuse the non-programmer annotators. Additionally, APEL with non-programmers is on par with previous database experts without the help of a seed parser. We next analyze the errors made by previous database experts.

6.6 Subtle Errors by Database Experts

APEL helped us identify annotation mistakes in the original Spider dataset (§ 6.2). We categorize and present them below to demonstrate where APEL is most helpful. The corresponding author of Spider agrees with our analyses (T. Yu, p.c.).

Interpreting Database Schema Properly. Suppose each row contains information about an orchestra, including the year the orchestra was founded and its associated recording company. For an utterance “Which recording company was founded the earliest?”, the correct annotation under this schema should be “Not enough information to tell”. Neither Spider nor APEL currently supports this annotation option, though our human annotator reported this issue in their open-ended feedback. Spider annotates this utterance incorrectly as SELECT company from TABLE WHERE YEAR = (SELECT MIN(YEAR) from TABLE), which looks plausibly correct but actually finds the recording company of the earliest-founded orchestra.

Handling All Allowed Values. The annotated SQL should behave appropriately on all plausible cell values. For example, when we are asked about the maximum value in a column that allows NULL cells, we prefer a SQL that skips any NULL cells and returns the maximum of the actual values. As another example, if the utterance is “How many countries have a republic government form?”, the clause WHERE GOVERNMENT = "Republic" will ignore any countries with the government form “Federal Republic”; the correct annotation should be WHERE GOVERNMENT LIKE "%Republic%".

INNER JOIN vs. LEFT JOIN. Suppose the utterance is “List singer names and number of concerts for each singer.” and the database contains a table of singers and a table with records (s,c)(s,c) if singer ss performed in concert cc. The Spider annotation is incorrect because it uses INNER JOIN, which fails to return singers with count 0.

Ties for Extremals. For the utterance “Who is the youngest person?”, the Spider annotation is SELECT NAME FROM PEOPLE ORDER BY AGE LIMIT 1. As APEL discovers, in case of ties, humans prefer a SQL that will return all of the people who have the smallest age, rather than just the first one.

Remark.

Since most of the text-to-SQL models had low performance 3 years ago, Yu et al. (2018) favored short and plausible SQL annotations to make learning easier. These annotation conventions were shared between training and test sets to form a coherent structured prediction task (internal validity). Now that structured prediction is working well enough that the predictions could be used in real-world settings, we should turn to assuring that the SQL annotations actually have the desired effects (external validity). APEL can help in establishing the new gold standard (§ 6.2).

6.7 Qualitative Feedback

We informally interviewed some of our subjects to obtain feedback about APEL. Here is some example feedback that indicates future room for improvement:

  • Boredom: Some subjects complained that the annotation process was boring and they would not want to do it a second time. Future work can design better UIs and interactions to make the annotation more engaging.

  • Difficulty: While most questions are straightforward, some require onerous computations (e.g., requires adding 170115 and 50456). Even though we set up a time limit for each question, these questions consumed a lot of mental energy. Future work could include an option for the users to skim through all the questions and solve the easier ones first.

  • Vagueness: Some utterances were inherently vague. Consider the utterance “Count the number of friends Kyle has.” – what to do when there are two students named Kyle? Without external affirmation that these questions were indeed vague, some subjects wasted too much time guessing how to interpret them. Future work could elicit confidence ratings, rather than choosing a discrete option.

7 Related Work and Future Directions

Our framework can potentially generalize from text-to-SQL to other semantic parsing tasks. The SQL program ss can generalize to other types of executable semantic parses, the input database ii can generalize to any well-typed input, and the database query result o=s(i)o=s(i) can generalize to the intended effect of uu on ii. Applying APEL to a new task would require 1) a seed semantic parser with high top-kk accuracy, and 2) an algorithm to find a simple informative program input ii. These problems are related to semantic parsing and programming by example, respectively. We now discuss how those two lines of research benefit from APEL and also how they could make APEL more powerful in the future.

Semantic Parsing. Semantic parsers have improved significantly over the past decades Zettlemoyer and Collins (2007); Scholak et al. (2021a). Recent pretrained models can perform the task without task-specific architectures Scholak et al. (2021b) or even in a zero/few-shot manner Shin et al. (2021); Rajkumar et al. (2022).

However, collecting semantic parsing datasets is still challenging since it requires experts. Wang et al. (2015) addresses this by synthetically generating logical forms, using templates to explain them in natural language, and asking crowdworkers to paraphrase them. Still, the paraphrases are usually restricted in linguistic diversity Larson et al. (2020). Another line of research (Yao et al., 2020; Elgohary et al., 2020; Mo et al., 2022) tries to learn from user interaction based on the surface form information, e.g., whether a program refers to specific fields. However, their assumption that a non-expert annotator can recognize the true program ss via paraphrases is unrealistic when programs have complex semantics. In contrast, APEL allows non-programmers to indirectly label text-to-SQL data via input-output examples.

Applying APEL to other programming languages requires a seed semantic parser with high top-kk accuracy. Such a requirement is more likely to be fulfilled in the future, at least for programming languages that are well-represented in language model training data, as language model capability is expected to improve  (Ganguli et al., 2022). Additionally, the seed parser can be improved with better prompting techniques (Trummer, 2022; Wei et al., 2022; Wang et al., 2023; Zhou et al., 2023), which are complementary to our contribution and would make APEL stronger.

Programming by Example. PBE has been applied to synthesize regular expressions Gulwani (2011), tensor manipulation Shi et al. (2020), data analysis Bavishi et al. (2019), and visualization Wang et al. (2021) programs, etc. Some other recent works such as Ye et al. (2020); Baik et al. (2020) also try to combine semantic parsing with PBE. However, both of them require the users to provide the input-output examples, which can be time-consuming to write.

To reduce the users’ workload, we provide the input part of the input-output examples, and focus on only those inputs whose outputs will actually help identify the desired concept. This is a case of active learning, which is broadly used in other applications such as learning real-valued functions, sequence classification, and visual question answering (Schulz et al., 2018; Ein-Dor et al., 2020; Karamcheti et al., 2021). In APEL, for each utterance, we maintain a prior over the function (program) space and learn the desired function by querying it on a sequence of carefully chosen inputs. Similar to our work, Pasupat and Liang (2016) asked non-programmers oo-selection questions by synthesizing table inputs, but they did not optimize for question simplicity and focused on a simpler single-table setting. Concurrent to our work, Lahiri et al. (2022) applied a similar framework to label Python functions; unlike them, we validated APEL with human annotators.

To reduce the effort required from humans, future work can also use large language models to evaluate the program outputs (Chen et al., 2023). Prompting large language models to evaluate their own outputs is a widespread idea. Schick et al. (2023) rerank ad hoc programs generated by a language model based on whether their outputs increase the probability of observed text. Bai et al. (2022) automatically critiques model-generated outputs and refines them for future training.

To apply APEL to other programming languages, we need efficient methods to synthesize simple and informative program inputs. Such methods already exist for less expressive languages such as regular expressions or restricted SQL (Miao et al., 2019). Appendix M describes an additional case study where we used APEL (with simulated annotators) to label utterances with regular expressions. To extend APEL to more complicated programs, we can potentially use language models to generate test inputs (Schäfer et al., 2023) and train them to search for more informative inputs with self-supervision (Haluptzok et al., 2023).

Scalable Oversight. As AI systems become more capable of generating candidate responses, an emerging line of research supervises AI systems by providing preferences over AI-generated candidates rather than providing human demonstrations Askell et al. (2021); Ouyang et al. (2022). Therefore, to supervise AI to perform more complex tasks, it becomes increasingly important to determine human preferences over model outputs that are expensive to verify Amodei et al. (2016); Bowman et al. (2022), such as full-book summaries or natural language descriptions of distributional properties Wu et al. (2021); Zhong et al. (2022). Our work presents a strategy to re-weight complex outputs from an AI system (namely, programs) by asking simple informative questions of annotators who do not have to understand the outputs directly.

8 Conclusion

We proposed APEL, enabling non-programmers to indirectly label SQL programs via input-output examples. With advances in semantic parsing and PBE, future work could potentially extend APEL to other applications. We hope non-programmers can label more complex programs in the future, hence decreasing the costs of supervising increasingly capable AI systems that can take complicated actions by generating and executing code.

Acknowledgements

The first author is funded by NSF-Simons Theorinet Grant (NSF Award #2031985). Our human interaction study was approved by UC Berkeley’s Institutional Review Board, and our survey and interface did not collect any personal identifiable information. We thank members of the Berkeley NLP group and Jacob Steinhardt’s group, and the anonymous reviewers for their feedback on our paper draft.

Limitations

APEL is only a preliminary step towards allowing non-programmers to label utterances with programs. We are not close to enabling expert-level labeling for arbitrary programming languages. We only experimented with English utterances, SQL programs, and university students; generalizing this to more natural languages, programming languages, and annotator populations requires more future work.

Our current implementation is limited to text-to-SQL and regular expressions. As mentioned at the end of the last section, applying APEL to other semantic parsing tasks requires effective algorithms to find simple but informative program inputs and a strong seed semantic parser. While we believe that these assumptions are likely to hold in the future, they might not hold currently. We also assumed that we have access to a pool of unlabeled utterances to begin with, while in practice we might not have access to utterances from real users. More future directions are discussed in Appendices B and I.

As APEL only considers the function (i.e., input-output mapping) computed by a program, it is only able to annotate an utterance with a semantic equivalence class of correct programs. Other factors such as efficiency or readability might be needed to choose among the programs in that class.

Even though we have shown that APEL has higher accuracy than the seed parser, we have not empirically validated that the seed parser is improved by fine-tuning on the programs selected by APEL, nor did we study whether such improvements are in fact cheaper to attain with non-programmers than with expert annotators.

Finally, no semantic parser should be used to synthesize SQL queries or other semantic forms for high-stakes scenarios without a careful analysis of errors and the downstream harms that they might cause.

References

  • Amodei et al. (2016) Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565.
  • Andreas et al. (2020) Jacob Andreas, John Bufe, David Burkett, Charles Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo Lanman, Percy Liang, Christopher H. Lin, Ilya Lintsbakh, Andy McGovern, Aleksandr Nisnevich, Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth, Subhro Roy, Jesse Rusak, Beth Short, Div Slomin, Ben Snyder, Stephon Striplin, Yu Su, Zachary Tellman, Sam Thomson, Andrei Vorobev, Izabela Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang, and Alexander Zotov. 2020. Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics, 8:556–571.
  • Askell et al. (2021) Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
  • Bachrach et al. (2012) Yoram Bachrach, Thore Graepel, Thomas P. Minka, and Jo W. Guiver. 2012. How to grade a test without knowing the answers—a Bayesian graphical model for adaptive crowdsourcing and aptitude testing. In ICML.
  • Bai et al. (2022) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
  • Baik et al. (2020) Christopher Baik, Zhongjun Jin, Michael J. Cafarella, and H. V. Jagadish. 2020. Constructing expressive relational queries with dual-specification synthesis. In CIDR.
  • Bavishi et al. (2019) Rohan Bavishi, Caroline Lemieux, Roy Fox, Koushik Sen, and Ion Stoica. 2019. AutoPandas: Neural-backed generators for program synthesis. Proceedings of the ACM on Programming Languages, 3(OOPSLA):1–27.
  • Bowman et al. (2022) Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamile Lukosuite, Amanda Askell, Andy Jones, Anna Chen, et al. 2022. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540.
  • Chen et al. (2021a) Anthony Chen, Pallavi Gudipati, Shayne Longpre, Xiao Ling, and Sameer Singh. 2021a. Evaluating entity disambiguation and the role of popularity in retrieval-based NLP. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4472–4485, Online. Association for Computational Linguistics.
  • Chen et al. (2023) Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2023. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations.
  • Chen et al. (2021b) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021b. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  • Chen et al. (2021c) Xinyun Chen, Linyuan Gong, Alvin Cheung, and Dawn Song. 2021c. PlotCoder: Hierarchical decoding for synthesizing visualization code in programmatic context. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2169–2181, Online. Association for Computational Linguistics.
  • Chen et al. (2018) Xinyun Chen, Chang Liu, and Dawn Song. 2018. Execution-guided neural program synthesis. In International Conference on Learning Representations.
  • Chen et al. (2015) Yuxin Chen, Shervin Javdani, Amin Karbasi, J. Andrew Bagnell, Siddhartha Srinivasa, and Andreas Krause. 2015. Submodular surrogates for value of information. In Proceedings of AAAI.
  • Chu et al. (2017) Shumo Chu, Chenglong Wang, Konstantin Weitz, and Alvin Cheung. 2017. Cosette: An automated prover for SQL. In CIDR.
  • Dempster et al. (1977) A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1):1–21.
  • Ein-Dor et al. (2020) Liat Ein-Dor, Alon Halfon, Ariel Gera, Eyal Shnarch, Lena Dankin, Leshem Choshen, Marina Danilevsky, Ranit Aharonov, Yoav Katz, and Noam Slonim. 2020. Active Learning for BERT: An Empirical Study. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7949–7962, Online. Association for Computational Linguistics.
  • Elgohary et al. (2020) Ahmed Elgohary, Saghar Hosseini, and Ahmed Hassan Awadallah. 2020. Speak to your parser: Interactive text-to-SQL with natural language feedback. In Annual Meeting of the Association for Computational Linguistics.
  • Ganguli et al. (2022) Deep Ganguli, Danny Hernandez, Liane Lovitt, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova Dassarma, Dawn Drain, Nelson Elhage, et al. 2022. Predictability and surprise in large generative models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1747–1764.
  • Gulwani (2011) Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. ACM Sigplan Notices, 46(1):317–330.
  • Haluptzok et al. (2023) Patrick Haluptzok, Matthew Bowers, and Adam Tauman Kalai. 2023. Language models can teach themselves to program better. In The Eleventh International Conference on Learning Representations.
  • Holtzman et al. (2021) Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. 2021. Surface form competition: Why the highest probability answer isn’t always right. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7038–7051, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  • Karamcheti et al. (2021) Siddharth Karamcheti, Ranjay Krishna, Li Fei-Fei, and Christopher Manning. 2021. Mind your outliers! investigating the negative impact of outliers on active learning for visual question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7265–7281, Online. Association for Computational Linguistics.
  • Kushman and Barzilay (2013) Nate Kushman and Regina Barzilay. 2013. Using semantic unification to generate regular expressions from natural language. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 826–836, Atlanta, Georgia. Association for Computational Linguistics.
  • Lahiri et al. (2022) Shuvendu K Lahiri, Aaditya Naik, Georgios Sakkas, Piali Choudhury, Curtis von Veh, Madanlal Musuvathi, Jeevana Priya Inala, Chenglong Wang, and Jianfeng Gao. 2022. Interactive code generation via test-driven user-intent formalization. arXiv preprint arXiv:2208.05950.
  • Larson et al. (2020) Stefan Larson, Anthony Zheng, Anish Mahendran, Rishi Tekriwal, Adrian Cheung, Eric Guldan, Kevin Leach, and Jonathan K. Kummerfeld. 2020. Iterative feature mining for constraint-based data collection to increase data diversity and model robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8097–8106, Online. Association for Computational Linguistics.
  • Lavrac and Dzeroski (1994) Nada Lavrac and Saso Dzeroski. 1994. Inductive logic programming. In WLP, pages 146–160. Springer.
  • Littlestone and Warmuth (1994) N. Littlestone and M.K. Warmuth. 1994. The weighted majority algorithm. Information and Computation, 108(2):212–261.
  • Miao et al. (2019) Zhengjie Miao, Sudeepa Roy, and Jun Yang. 2019. Explaining wrong queries using small examples. In Proceedings of the 2019 International Conference on Management of Data, pages 503–520.
  • Miller et al. (1990) Barton P. Miller, Louis Fredriksen, and Bryan So. 1990. An empirical study of the reliability of UNIX utilities. Communications of the ACM, 33(12):32–44.
  • Mo et al. (2022) Lingbo Mo, Ashley Lewis, Huan Sun, and Michael White. 2022. Towards transparent interactive semantic parsing via step-by-step correction. In Findings of the Association for Computational Linguistics: ACL 2022, pages 322–342, Dublin, Ireland. Association for Computational Linguistics.
  • OpenAI (2023) OpenAI. 2023. GPT-4 technical report. arXiv.
  • Ouyang et al. (2022) Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
  • Pasupat and Liang (2016) Panupong Pasupat and Percy Liang. 2016. Inferring logical forms from denotations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23–32, Berlin, Germany. Association for Computational Linguistics.
  • Poesia et al. (2022) Gabriel Poesia, Alex Polozov, Vu Le, Ashish Tiwari, Gustavo Soares, Christopher Meek, and Sumit Gulwani. 2022. Synchromesh: Reliable code generation from pre-trained language models. In International Conference on Learning Representations.
  • Rajkumar et al. (2022) Nitarshan Rajkumar, Raymond Li, and Dzmitry Bahdanau. 2022. Evaluating the text-to-SQL capabilities of large language models. arXiv preprint arXiv:2204.00498.
  • Schick et al. (2023) Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  • Scholak et al. (2021a) Torsten Scholak, Raymond Li, Dzmitry Bahdanau, Harm de Vries, and Chris Pal. 2021a. DuoRAT: Towards simpler text-to-SQL models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1313–1321, Online. Association for Computational Linguistics.
  • Scholak et al. (2021b) Torsten Scholak, Nathan Schucher, and Dzmitry Bahdanau. 2021b. PICARD: Parsing incrementally for constrained auto-regressive decoding from language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9895–9901, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Schulz et al. (2018) Eric Schulz, Maarten Speekenbrink, and Andreas Krause. 2018. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. Journal of Mathematical Psychology, 85:1–16.
  • Schäfer et al. (2023) Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023. Adaptive test generation using a large language model.
  • Semantic Machines et al. (2020) Semantic Machines, Jacob Andreas, John Bufe, David Burkett, Charles Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo Lanman, Percy Liang, Christopher H. Lin, Ilya Lintsbakh, Andy McGovern, Aleksandr Nisnevich, Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth, Subhro Roy, Jesse Rusak, Beth Short, Div Slomin, Ben Snyder, Stephon Striplin, Yu Su, Zachary Tellman, Sam Thomson, Andrei Vorobev, Izabela Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang, and Alexander Zotov. 2020. Task-oriented dialogue as dataflow synthesis. Transactions of the Association for Computational Linguistics, 8:556–571.
  • Shi et al. (2020) Kensen Shi, David Bieber, and Rishabh Singh. 2020. TF-Coder: Program synthesis for tensor manipulations. arXiv preprint arXiv:2003.09040.
  • Shin et al. (2021) Richard Shin, Christopher Lin, Sam Thomson, Charles Chen, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. 2021. Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7699–7715, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Trummer (2022) Immanuel Trummer. 2022. CodexDB: Synthesizing code for query processing from natural language instructions using GPT-3 Codex. Proceedings of the VLDB Endowment, 15(11):2921–2928.
  • Wang et al. (2020) Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-aware schema encoding and linking for text-to-SQL parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578, Online. Association for Computational Linguistics.
  • Wang et al. (2021) Chenglong Wang, Yu Feng, Rastislav Bodik, Isil Dillig, Alvin Cheung, and Amy J. Ko. 2021. Falx: Synthesis-powered visualization authoring. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–15.
  • Wang et al. (2017) Sida I. Wang, Samuel Ginn, Percy Liang, and Christopher D. Manning. 2017. Naturalizing a programming language via interactive learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 929–938, Vancouver, Canada. Association for Computational Linguistics.
  • Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations.
  • Wang et al. (2015) Yushi Wang, Jonathan Berant, and Percy Liang. 2015. Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1332–1342, Beijing, China. Association for Computational Linguistics.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  • Whitehill et al. (2009) Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier Movellan, and Paul Ruvolo. 2009. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems, volume 22. Curran Associates, Inc.
  • Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
  • Yan et al. (2011) Yan Yan, Romer Rosales, Glenn Fung, and Jennifer G. Dy. 2011. Active learning from crowds. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 1161–1168, Madison, WI, USA. Omnipress.
  • Yao et al. (2020) Ziyu Yao, Yiqi Tang, Wen-tau Yih, Huan Sun, and Yu Su. 2020. An imitation game for learning semantic parsers from user interaction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6883–6902, Online. Association for Computational Linguistics.
  • Ye et al. (2020) Xi Ye, Qiaochu Chen, Isil Dillig, and Greg Durrett. 2020. Benchmarking multimodal regex synthesis with complex structures. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6081–6094, Online. Association for Computational Linguistics.
  • Yu et al. (2018) Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.
  • Zettlemoyer and Collins (2007) Luke Zettlemoyer and Michael Collins. 2007. Online learning of relaxed CCG grammars for parsing to logical form. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pages 678–687, Prague, Czech Republic. Association for Computational Linguistics.
  • Zhong et al. (2022) Ruiqi Zhong, Charlie Snell, Dan Klein, and Jacob Steinhardt. 2022. Describing differences between text distributions with natural language. In International Conference on Machine Learning, pages 27099–27116. PMLR.
  • Zhong et al. (2020) Ruiqi Zhong, Tao Yu, and Dan Klein. 2020. Semantic evaluation for text-to-SQL with distilled test suites. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 396–411, Online. Association for Computational Linguistics.
  • Zhou et al. (2023) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, and Ed H. Chi. 2023. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
\appendixpage

Appendix A Training on APEL Annotations

As APEL is an annotation method, its ultimate goal is to convert unlabeled utterances uu into “supervised” training examples (u,s)(u,s) for a semantic parser. In this appendix, we discuss how these training examples would be constructed and used.

(In the main paper, we used “example” to refer to examples of the form (i,o)(i,o), used for programming by example (§ 2). In this appendix, however, we use “example” to refer to examples of the form (u,s)(u,s), used as training examples for a semantic parser.)

Training on MAP Annotations

For each utterance uu, APEL solicits indirect annotations and produces a posterior probability distribution pTp_{T} over programs. The maximum a posteriori (MAP) program is defined by s^=defargmaxspT(s)\hat{s}\mathrel{\stackrel{{\scriptstyle\textnormal{def}}}{{=}}}\operatorname*{argmax}_{s}p_{T}(s).

Most simply, we could train a semantic parser on the inferred annotations (u,s^)(u,\hat{s}). Specifically, one option is to use these annotations to fine-tune the seed parser p0p_{0}. However, this may not be practical for a very large pretrained model such as Codex, in which case the annotations could be used to train a dedicated semantic parser from scratch.

Downstream Evaluation

In this paper, we evaluated APEL directly by asking whether its inferred annotations (u,s^)(u,\hat{s}) are accurate. We measured this by semantic equivalence s^=s\llbracket\hat{s}\rrbracket=\llbracket s^{*}\rrbracket (see § 4.2) on a small test set of pairs (u,s)(u,s^{*}) that were annotated by experts (using APEL). One could instead evaluate the actual impact of the inferred annotations on parsing performance, by training a semantic parser (e.g., fine-tuning the seed parser) on a large training set of APEL-labeled utterances, and evaluating that parser on the test set.

Beyond MAP

Rather than evaluating only the MAP annotation s^\hat{s}, we could have evaluated APEL’s entire distribution pTp_{T} using the log-loss, logpT(su)-\log p_{T}(s^{*}\mid u), averaged over the test set of (u,s)(u,s^{*}) pairs. This is a more sensitive evaluation, though harder to interpret than exact-match accuracy.

But do we care whether the distribution pTp_{T} is accurate beyond just its MAP annotation? Yes: one could train a parser pp by maximizing its expected log-likelihood

𝔼spT[logp(su)]\displaystyle\mathbb{E}_{s\sim p_{T}}[\log p(s\mid u)] (8)

summed over the APEL-annotated utterances uu. In effect, this treats pTp_{T} as giving a set of weighted “soft annotations” for uu, not just the MAP annotation. It is equivalent to minimizing the Kullback-Leibler divergence KL(pTp)\mathrm{KL}(p_{T}\mid\mid p) (summed over uu).

Iterative Retraining

Once an improved parser has been trained on the inferred annotations—either the MAP annotations or the soft annotations—it can be used as an improved prior p0p_{0} in the update rule 1. This can inform APEL in the selection of future questions.

Furthermore, although APEL’s past questions can no longer be changed, we can now reinterpret the annotators’ responses to those questions, by using equation 1 to recompute an improved posterior pTp_{T} from the improved prior p0p_{0}. The improved soft annotations from these posteriors can be used to retrain the parser again. This procedure can be iterated to convergence to improve the parser.

This iterated procedure is a principled training method. If soft annotations are used for retraining via equation 8 at each iterations, then it is an instance of the Expectation-Maximization (EM) algorithm Dempster et al. (1977). (If MAP labels are used, it is an instance of the hard or “Viterbi” approximation to EM.) EM is guaranteed to converge to a new semantic parsing model p0p_{0} that locally maximizes the conditional log-likelihood 6.4 of all of the observed data—in our case, the annotators’ responses rtr_{t} for all oo-selection questions on all utterances uu. Thus, it is simply a log-likelihood training method where the observed data are not direct annotations ss, but indirect annotations rtr_{t}. It marginalizes over the latent programs ss.

Richer Model of Annotator Behavior

The annotator model p(ra,s(i))p(r\mid a,s(i)) can be jointly retrained with the semantic parser at each iteration of EM, to obtain a higher log-likelihood 6.4. Improving the annotator model in this way should result in more accurate distributions pTp_{T} at the next iteration of EM, and thus a better semantic parser.

Specifically, when we retrain the parser via equation 8, we can also retrain the annotator model to maximize the expected log-likelihood

𝔼spT[t=1Tlogp(rtat,s(it))]\displaystyle\mathbb{E}_{s\sim p_{T}}[\sum_{t=1}^{T}\log p(r_{t}\mid a_{t},s(i_{t}))] (9)

summed over the APEL-annotated utterances uu. This EM procedure will converge to a maximum of the incomplete-data log-likelihood as in § 6.4.

We could fit a richer model of annotator behavior than the relatively simple one in § 6.4. For example, we could estimate the error rates of individual annotators on different types of questions and program inputs. Intuitively, equation 9 means that we will tend to judge annotators as correct on examples where they agree with the consensus of pTp_{T} and the other annotators.

Although equation 1 assumed that p(rs,u,a,i)=p(ra,s(i))p(r\mid s,u,a,i)=p(r\mid a,s(i)), a rich model could drop this assumption.333This means using p(rts,u,at,it)p(r_{t}\mid s,u,a_{t},i_{t}) in place of p(rtat,s(it))p(r_{t}\mid a_{t},s(i_{t})) in equations 2, 3, 6.4 and 9. For example, the model might allow that the annotator is more likely to make an error when uu is difficult or ii is complex. As an annotator’s errors on a given utterance may be influenced by how they answered previous questions, a rich model could even take the form p(rts,u,a,i1,r1,,it1,rt1,it)p(r_{t}\mid s,u,a,i_{1},r_{1},\ldots,i_{t-1},r_{t-1},i_{t}), where i1,r1,,it1,rt1i_{1},r_{1},\ldots,i_{t-1},r_{t-1} are the previous rounds of interaction with annotator aa.

Evaluating with Task Loss

In our discussion of evaluation thus far, we have not given the semantic parser any partial credit. That is, given an utterance uu, we have considered any answer other than the correct program ss to be equally wrong.

However, more generally, it may be tolerable to find a program whose outputs are correct—or at least close to correct—for most inputs. Let loss(o^u,i,o)\mathrm{loss}(\hat{o}\mid u,i,o^{*}) denote the task-specific loss of predicting output o^\hat{o} on input ii when the correct output is oo^{*}. The loss of predicting program s^\hat{s} when the correct program is ss^{*} is then

𝔼i[loss(s^(i)u,i,s(i))]\displaystyle\mathbb{E}_{i}[\mathrm{loss}(\hat{s}(i)\mid u,i,s^{*}(i))] (10)

where the expectation 𝔼i\mathbb{E}_{i} is taken over some realistic distribution of inputs for utterance uu. This loss function can be used in supervised evaluation.

Decoding with Task Loss

Following standard Bayesian decision-theoretic methods, the semantic parser itself can aim to achieve low loss. No change to the training procedure is needed. Suppose we have trained a semantic parsing model p(su)p(s\mid u) by EM as described above. We now need to extract decisions from this model. If the semantic parser is required to translate uu into a single program s^\hat{s} that will be used for all inputs, then the best program to choose is the program that minimizes the Bayes risk

Rp(s^u)=def𝔼sp[𝔼i[loss(s^(i)u,i,s(i))]]\displaystyle R_{p}(\hat{s}\mid u)\mathrel{\stackrel{{\scriptstyle\textnormal{def}}}{{=}}}\mathbb{E}_{s\sim p}[\mathbb{E}_{i}[\mathrm{loss}(\hat{s}(i)\mid u,i,s(i))]] (11)

This s^\hat{s} is not necessarily the MAP program. Once s^\hat{s} is predicted, the output on any given input ii is o^=s^(i)\hat{o}=\hat{s}(i).

In some applications, however, it may not be necessary to choose a single program to use for all inputs. Then the best output o^\hat{o} to return for input ii is the output that minimizes the Bayes risk

Rp(o^u,i)\displaystyle R_{p}(\hat{o}\mid u,i) =def𝔼sp(u)[loss(o^u,i,s(i))]\displaystyle\mathrel{\stackrel{{\scriptstyle\textnormal{def}}}{{=}}}\mathbb{E}_{s\sim p(\cdot\mid u)}[\mathrm{loss}(\hat{o}\mid u,i,s(i))] (12)

In other words, o^\hat{o} is now predicted from ii by a consensus of multiple programs.

Appendix B Extensions to APEL Annotation

The following natural extensions could be explored in future work.

Selecting Questions with Task Loss

A model of task loss as in Appendix A can be used to improve the selection of the next question iti_{t}. Instead of maximizing the expected reduction in entropy (equation 4), we can maximize the expected reduction in the Bayes risk 11.

When pp is any distribution over programs ss for utterance uu, define the minimum Bayes risk by

MBR(p)\displaystyle\mathrm{MBR}(p) =defmins^Rp(s^u)\displaystyle\mathrel{\stackrel{{\scriptstyle\textnormal{def}}}{{=}}}\min_{\hat{s}}R_{p}(\hat{s}\mid u) (13)

At the start of round tt, we can already achieve a Bayes risk of MBR(pt1)\mathrm{MBR}(p_{t-1}). Once we have the annotator’s response rtr_{t} to question iti_{t}, we will be able to update pt1p_{t-1} to ptp_{t} and achieve a Bayes risk of MBR(pt)\mathrm{MBR}(p_{t}), where ptp_{t} depends on iti_{t} and rtr_{t} via the update 2. Thus, the value of information from asking question iti_{t} is

MBR(pt1)𝔼rtpt1[MBR(pt)]\displaystyle\mathrm{MBR}(p_{t-1})-\mathbb{E}_{r_{t}\sim p_{t-1}}[\mathrm{MBR}(p_{t})] (14)

This is essentially the same as the information gain 4, but with task loss replacing log-loss. It is again guaranteed to be 0\geq 0.

Richer Model of Annotator Effort

Our objective 5 tries to present the annotator ata_{t} with a “simple” input database iti_{t}, as measured by the number of records |it||i_{t}|. However, |it||i_{t}| is only a crude measure of the time or effort that the annotator must expend to answer the multiple-choice question derived from iti_{t}. For example, annotators typically find it more difficult to perform arithmetic or to reason across multiple tables within iti_{t}.

To predict the annotator’s effort on the input database iti_{t}, we could attempt to simulate the annotator’s mental execution of the true program ss on iti_{t}.444The same simulation could be also used to help model errors in the annotator’s response rtr_{t} (Appendix A). As we do not know the true ss, we would need to take the expectation of this effort over spt1s\sim p_{t-1}. Mental operations such as adding 3-digit numbers or visually scanning the tables for a specific string could be then given appropriately high costs in the effort model.

In addition, perhaps a question that generates more multiple-choice options requires more effort. To capture this, the effort model could assign a cost not only to computing s(it)s(i_{t}) but also to finding that answer in the list of options. More generally, the model could attempt to capture mental strategies that do not fully compute s(it)s(i_{t}) but only do enough work to eliminate most of the options.

Finally, a question generally requires less human cognitive effort if it is related to a preceding question. Querying input database ii with utterance uu is easier for an annotator who has just queried the same ii with a different uu^{\prime}, or queried a different ii^{\prime} with the same uu.

How would we train the parameters of the effort model? We observe how quickly the annotator answered each iti_{t} with rtr_{t}. Taking this time as a proxy for effort, we can train the effort model to predict it from the true program ss (which the annotator is presumed to know) and iti_{t}. If we do not know ss, we can impute it from the observed responses—that is, evaluate the effort model’s log-likelihood in expectation under spTs\sim p_{T}, analogously to equations 8 and 9. In short, the effort model could be trained by EM, jointly with the other models.

Choosing Who Should Annotate What

The selection objective 5 manages the tradeoff between annotator effort and information gain by imposing a hard constraint on the annotator effort per oo-selection question. We could improve this crude selection objective—especially given good models of annotator behavior and annotator effort—by selecting a question ii that achieves high “bang for the buck.” This score is given by the ratio of value of information (equation 4 or 14) to the expected effort of acquiring that information.

Both the numerator and denominator of this ratio depend on the specific annotator aa who is answering ii, if the models have annotator-specific parameters. They also depend on the utterance uu. Thus, the score is actually a property of the triple (u,i,a)(u,i,a).

At each step of APEL annotation, we would select a high-scoring triple—one in which we expect annotator aa to usefully evaluate utterance uu’s desired behavior on input database ii with low effort. The annotator’s response rr then influences the scores of other triples involving uu. We would stop annotation when no sufficiently high-scoring triple could be found.

In the same way, Bachrach et al. (2012) and Whitehill et al. (2009) model each individual annotator’s capability and each question’s difficulty, learning these parameters through agreement information. Yan et al. (2011) explore these ideas for active learning, similarly routing useful questions to competent annotators. Our experiment empirically finds that each annotator’s ability to answer questions and the difficulty of each domain varies wildly (Figure 10); therefore, future work is likely to benefit from actively selecting who to annotate which utterances.

Non-Myopic Selection Policy

The procedure described just above will switch freely among utterances and inputs, if it always selects the highest-scoring triple. However, recall that if a followup question on the same utterance or input is eventually to be asked at all, asking it immediately will incur less effort from the annotator. Thus, a good heuristic is to prioritize followup questions over non-followup questions, provided that they score highly enough that it is likely that they will eventually be asked.

More generally: Greedily choosing the highest-scoring question at each step is common in interactive protocols for information acquisition, such as adaptive quadrature or Bayesian optimization Schulz et al. (2018). However, this greedy procedure is in general suboptimal. One could do better at selecting the next question by planning ahead to subsequent questions Chen et al. (2015), though at a higher computational cost.

Again, we leave all these refinements to future work.

Appendix C Other Synthesis Constraints

This appendix gives further details about our method of database construction (§ 3). Overall, we follow the recipe of Zhong et al. (2020) to generate large informative databases that conform to a given schema cc. We draw upon the existing sample database with this schema (provided by the Spider dataset in our experiments) to obtain plausible cell values. Following Zhong et al., we first synthesize cell values for all the “parent” columns (i.e., the columns that are being referenced by a child column in a foreign key relation), and then populate child columns with random elements from the parent columns.

Naturalness of ii

As shown in Figure 6a, unrestricted random cell values can confuse annotators who are unfamiliar with databases. Therefore, rather than synthesizing completely random values, we now always copy individual cell values from the sample databases (§ 4.1), optionally with minor perturbations such as ±1\pm 1 for integer values.

A database record might also be confusing even if its individual cell values are not. For example, the annotator can be confused by counterfactual information where the U.S. is in Asia as shown in Figure (b). Therefore, we prefer to initialize i0i^{0} with the existing database. The annotator can also be confused by uncommon patterns where two people have the same name but different IDs; therefore, if the existing sample database has unique values in a column, we prefer to enforce that ii also has unique values in that column.

Non-vacuous Execution

Extremely small ii frequently leads to undefined denotations. For example, the maximum of zero elements is a special NULL value, meaning “undefined”; this confuses annotators without a computer science background (Figure 6d). Therefore, we always add a small probability mass (116\frac{1}{16}) of RETURN NULL to the distribution pp and normalize the probabilities of the other candidates proportionally, in order to incentivize our algorithm to produce ii such that other SQL candidates will return non-NULL values.

Even if the returned value is well-defined, small ii can lead to confusion if some operators are not needed to answer the question. For example, in Figure 6e, asking the maximum over one element might appear confusing, as we do not need the max operator to obtain a correct denotation. Therefore, we always add into pp^{\prime} a small probability mass of “neighbor queries” Zhong et al. (2020) obtained by dropping aggregation operators and WHERE clauses from SQL candidates in pp^{\prime}. This incentivizes our algorithm to produce ii such that the SQL candidates will meaningfully use their operators.

Managing Tradeoffs between two Criteria

All the above tweaks make a tradeoff between the informative and the simplicity criteria in some way: we impose restrictions on ii or modify pp^{\prime} to decrease the annotator effort while sacrificing information gain we can potentially achieve. How do we decide when to apply certain tweaks?

In our paper, we always add small probabilities of neighbor queries and RETURN NULL to pp^{\prime} and use cell values from the existing database. We then consider 3 types of tweaks that we apply if possible: 1) i0i_{0} satisfies the uniqueness constraint, 2) i0i_{0} is initialized with an existing database, and 3) |i|15|i|\leq 15 rather than 30. We apply these tweaks if they do not prevent us from succeeding. We define in total 23=82^{3}=8 different “configurations” 0c<80\leq c<8, each of which specifies what subset of tweaks to apply to the algorithm described in § 3. For example, c=6=B110c=6=B110 means we apply tweaks 1) and 2). We enumerate from c=7c=7 to 0 until the algorithm from § 3 returns a program input with IG(i)0\mathrm{IG}(i)\neq 0. In other words, we start by applying all the tweaks and drop the tweaks gradually until we obtain a ii with positive expected information gain.

Refer to caption
Figure 6: Examples of unnatural databases (top) and vacuous execution (bottom), which motivates several tweaks in Appendix C. In (a) the individual cell values are unnatural. In (b) the records contradict world knowledge. In (c) the database contains two persons with the same name, which is atypical (but possible). In (d) the denotation of the utterance is undefined, since we cannot take the maximum of zero elements. In (e) we do not need the max operator to obtain the correct denotation, since there is only one person in section A.

Appendix D Fixing Spider Databases

We found several issues with the Spider databases and fixed them as follows:

  • Some Spider databases do not conform to the foreign key constraint, i.e. some of the child columns contain values not in the parent columns they are referring to. We enforce the foreign key constraint by dropping the illegal records.

  • In some domains, we identify missing foreign key constraints and add them.

  • The voter_1 domain does not contain an appropriate foreign key design. Since fixing it would require an effort of re-annotating all 15 associated SQLs, we chose to exclude this domain from our evaluation.

  • Some Date typed columns are string-valued and use English words to represent values, e.g. nov1,2021. As a result, dec1,2021, which is chronologically later, will be considered smaller alphabetically. We fix this by canonicalizing date representations into a yyyy-mm-dd format.

We accordingly update the suite of test cases from Zhong et al. (2020) that we use to check whether two SQL forms are equivalent (see § 4.2), so that they conform to the new database schema.

Appendix E Generating SQL Candidates

E.1 Prompting Codex

As sketched in § 4.2, we obtain SQL program candidates through few-shot prompting, where the database schema is followed by 4 or 8 (with 50% probability) pairs of natural language utterances with their corresponding SQL queries from the Spider development set from the subset of utterance-SQL pairs associated with the same database schema. To select each in-context example, with probability 50% we choose the examples most similar to the utterance uu that we want to annotate based on TF-IDF similarity, and with probability 50% we choose a random example that has not yet been selected. Finally we append the natural language utterance uu to be annotated, and ask Codex to continue generating text after this prompt, which generates a candidate SQL program corresponding to uu. An example prompt can be seen in Figure 7. As mentioned in § 4.2, we sampled 200 different prompts, which varied in their selected examples, and for each prompt we sampled 20 candidates from Codex with temperature=1.0 and top_p=0.95.

Refer to caption
Figure 7: An example prompt we use for the Codex API. We obtain SQL program candidates through 4/8-shot prompting, where the database schema (orange) is followed by 4/8 pairs of natural language utterance and their corresponding SQL queries from the Spider development set, randomly sampled from the subset of queries associated with the same database schema. Finally we concatenate the target natural language utterance uu to be annotated, and ask Codex to complete the prompt, which results in a candidate SQL program corresponding to uu.

E.2 Top-kk Accuracy

As mentioned in § 4.2, we defined p0p_{0} as a distribution over the top-kk approximate equivalence classes of the 4000 sampled programs, where k=16k=16. Table 3 reports the candidate ceilings (§ 4.4) for various other values of kk, and Figure 8 graphs these as top-kk accuracy curves.

kk easy medium hard extra all
1 0.87 0.80 0.56 0.45 0.72
2 0.94 0.89 0.74 0.63 0.84
4 0.96 0.93 0.87 0.70 0.89
8 0.97 0.95 0.95 0.78 0.92
16 0.98 0.96 0.98 0.81 0.94
32 0.98 0.96 0.98 0.85 0.95
Table 3: The top-kk accuracy for the SQL candidates generated by Codex on Spider Yu et al. (2018), calculated on each split.
Refer to caption
Figure 8: The top-kk accuracy for the candidate SQL programs generated by Codex, after filtering and merging candidates. On each difficulty split we plot the curve of top-kk accuracy (y-axis) and kk (x-axis, log-scaled). The numbers can be seen in Appendix Table 3.

Appendix F Errors in Non-Programmer Responses

Ambiguous Utterances.

Consider the utterance “What are the names of properties that are either houses or apartments with more than 1 room?” Should it be parsed as “(house) or (apartment and room > 1)”, or “(house or apartment) and room > 1”? Another example: “Count the number of friends Kyle has.” What to do when there are two students named Kyle?

Heavy Computation.

It is hard for humans to do arithmetic mentally, e.g., find the average of eight 9-digit values. To avoid demanding such computations, APEL should improve the annotator effort model |i||i| beyond counting the number of records (Appendix B).

Database Constraints with Common Sense.

Database schemas sometimes omit common-sense constraints. For example, according to common sense, “BIRTHDAY + AGE” should always yield the current year, so sorting by BIRTHDAY ascendingly is equivalent to sorting by AGE descendingly. However, APEL is able to find a database where these two strategies return different outputs.555Even though APEL derives its databases from the original Spider sample database, that sample database contains records that do not conform to this unstated constraint. Such a database is unnatural and confuses humans. A possible solution would be to induce such constraints from the sample database and/or the column names in the schema.

Appendix G Computation

We did not compute the runtime in a controlled environment, so the statistics in this section are our best estimate.

Finding a single informative small database can take up to several minutes. The simulation evaluation on the evaluation split in § 5 (524 utterances) takes around 240 CPU hours in total.

For the human evaluation in § 6 (240 utterances), we must pre-compute the databases for each utterance, in order to support real-time interaction. Since we may ask the annotator up to 3 questions about the utterance, the choice of database iti_{t} is conditioned on 0–2 previous questions and responses (i1,r1,,it1,rt1i_{1},r_{1},\ldots,i_{t-1},r_{t-1}). We pre-compute the database iti_{t} for each of these possible histories that we may encounter.666Except that we set a timeout of 40 minutes per utterance. Of the 240 utterances, 7 utterances timed out before computing all of the databases in the response tree. (These primarily came from one domain where Spider’s sample database was very large.) If during the interaction with an annotator, we needed a database iti_{t} that we had not precomputed, we aborted the interaction early (we considered this as a failure to find an appropriate database, in the terms of § 2.1). However, this rarely happened, since at each node of the response tree, we considered the most probable responses first. This takes around 100 CPU hours in total (for the 240 utterances).

Appendix H Simulated Interaction Statistics

The breakdown statistics of candidate and interaction ceiling (see § 4.4) can be seen in Table 4, and the distribution database sizes and number of rounds of interaction can be seen in Figure 9.

Refer to caption
Figure 9: The interaction statistics using the Spider annotation to simulate an ideal annotator. Left: the distribution of the size of the database cc across each round of interaction. Right: the distribution of the number of rounds across difference utterance.
easy med hard extra all
Candidate 0.98 0.96 0.98 0.81 0.94
Codex 0.87 0.80 0.56 0.45 0.72
OurDB 0.93 0.95 0.97 0.75 0.91
OrigDB 0.98 0.90 0.83 0.66 0.86
Table 4: The accuracy ceilings on each difficulty split. “med” stands for medium difficulty. Candidate means the candidate ceiling. OurDB means the accuracy ceiling achieved by querying an oracle annotator with at most 3 small informative databases we synthesize. OrigDB means the accuracy ceiling by querying with the databases released by the Spider dataset.

Robustness Towards Hyper-Parameter Choices.

We vary the hyper-parameters in § 3 to test its robustness. Changing 5% to 20%, or decreasing the number of random re-runs, all leads to the same performance of 91%, and the average number of interaction rounds fluctuates by at most 0.05. Additionally, even after increasing the maximal allowed database size RR from 15 to 30, we still obtain an average size of 10, since we prefer smaller databases under the same information gain.

Appendix I Human Annotation

We provide breakdown statistics on the annotation accuracy of different sources based on difficulty in Table 5.

More analysis on the annotator behavior model can be found in Figure 10.

Refer to caption
Figure 10: For each domain (left) and annotator (right), we use our annotator behavior model to predict the annotation accuracy (xx-value) without using the gold annotation and compare it with the true annotation accuracy (yy-value) using the gold annotation. The outlier at the bottom of the left figure corresponds to a domain that only contains six utterances, where 4 were used for training and 2 used for held-out evaluation.
easy medium hard extra all
Candidate Ceiling 0.91 0.91 0.92 0.69 0.88
Spider 0.93 0.78 0.63 0.50 0.75
Codex 0.78 0.65 0.43 0.36 0.59
APEL r 0.75 0.71 0.65 0.49 0.67
APEL m 0.83 0.80 0.73 0.47 0.75
Table 5: 1-best accuracy of various SQL prediction methods, broken down by the difficulty level of the utterance (as categorized by Spider). Codex returns the most probable SQL according to p0p_{0}. APEL r does the same, after eliminating SQLs that are inconsistent with the responses of a single randomly chosen annotator. APEL m is our full method, which returns the SQL with the highest posterior probability after we fit our model of annotator error.

Appendix J Interface

Refer to caption
Figure 11: A detailed screenshot of our interface, and the logical flow of follow up questions.
Refer to caption
Figure 12: An example database description page presented to users before they start answering questions for that database.

See Figure 11 for a detailed screenshot of our interface. We implemented the front-end of our interface with Angular and the back-end was built with flask and Redis. Users are presented with a sequence of 40 distinct questions, and each question may have multiple rounds. For each round, the user is given a 4 minute time-limit before the interface automatically transitions to the next question. Before being asked questions on a new database, the user is presented with a page displaying all the tables in the database alongside descriptions we wrote for each table (see Figure Figure 12 for an example screenshot). When answering questions, the user is given a link back to this page for reference.

The user can either select one of the multiple choice questions presented or select “No Answer is Correct”, and depending on their selection the user is presented with a differing set of followup questions. Regardless of their selection, we always ask the user two optional followups: “if you think the question is ambiguous, tell us why.” and “if the question looks confusing, tell us why.” In addition to these optional questions, we sometimes ask required followup questions. Specifically, if the user is on their final round and selects an answer which does not agree with the SPIDER annotation, we ask them why they did not select the correct answer according to spider. Or if the user selects “No Answer is Correct”, we ask “What is the answer you have in mind and why?” We use the users’ answers to these followups to collect information on the users’ reasoning in answering questions and to determine issues with the SPIDER dataset.

We implemented a number of features in our interface to minimize the annotator effort. One of the largest challenges in this task is answering questions across several foreign keys. We implement two distinct mechanisms to make this easier for users. First, we highlight all table values or foreign keys matching the value the mouse is currently hovering over. Second, we give the user the option to merge all foreign keys into a single table by pressing a “merge” button. We allow the users to choose when to merge because there is a trade-off; while merged mode can make reasoning about foreign keys easier, it also can significantly increase the width of the tables visible to the user.

Sometimes there are tables presented to the user that are not necessary for answering the question, so we give users the option to collapse tables to simplify their display.

Appendix K Video Transcript

Page 1

In this task, you will be asked to answer questions from several tables.

Page 2

Here is the overall idea. You will be given a question on the top of the page, several tables on the left of the page, and you need to choose one of the options on the right, that corresponds to the correct answer. In this question, you are asked to “Show name, country, age for all singers ordered by age from the oldest to the youngest.” Therefore, we expect the correct option to list the information about Joe Sharp first, since he is older. We look at the options and B is correct. Notice that A is wrong because it does not list the information for all singers, and C is wrong because it lists the singers from the youngest to the oldest.

After you submit the answer, our system will ask you whether there is anything that appears ambiguous or confusing. We don’t need it for this question now.

Page 3

Let’s go through some more examples.

Page 4

In this question you are asked “How many singers do we have?” This is a tricky question. First notice that the tables have changed from before, so you need to re-read the table. Secondly, there are actually two singers, but they have the same name. You should consider them to be two different persons with the same name but different SSN, and hence choose B.

There is a time limit shown at the top of the page, and after 4 minutes the system will move on to the next question.

Page 5

Takeaways:

  • Names are different from IDs. Two different people can have the same name.

  • There is a time limit of 4 minutes for each question.

Page 6

In this question you are asked to find the song names of the singers above the average age. The average age is the mean of these 4 numbers, which is 34.5. The singers John and Rose have age above 34.5, so we can find their songs, which are sun and gentle man, which is D. Use a calculator if you need to!

Also, notice that there are other tables, but they are not relevant to the question. Feel free to ignore them. You can also choose to collapse them if that makes it easier, and you can click the button again to view it.

Page 7

Takeaways:

  • Use a calculator if you need to.

  • Not every table is needed.

Page 8

Here’s the same question and the same table. Let’s say somehow Sun and Gentleman is not one of the options, and then you should report that no answer is correct. Then we will ask you why you think no answer is correct. For example, you can write “gentleman and sun is correct. the average age is 34.5, John and Rose are above this age and have song gentleman and Sun”.

The system asks us why we didn’t choose A, we can answer “sun is by Rose, who is older than 34.5”. Please tell us enough information so that we can know why your choice is correct - for example if you just say “sun is also a correct answer”, it only describes the difference between the two options rather than explaining why it is correct. Giving us more information can help you win more bonus.

Page 9

Takeaways:

  • Choose no option is correct and tell us why when you think no options are correct

  • Tell us why you didn’t choose an answer when we ask you to do so.

Page 10

The question is “What are the full names of all players, sorted by birth date?” First, notice that there are a lot of answers in this case, and you need to scroll down to read through all of them. Secondly, there are a lot of ambiguities: for example, the question didn’t mention whether we should sort from youngest to oldest, or the reverse; secondly, the question does not mention whether the first and last name should be in the same column.  For these reasons, A, B are both correct. C, D are wrong because the question does not ask for birthday information; F is wrong because it only lists one player and G is wrong for including birthday information. Then we can write in the response: “ABE are all correct; not sure if we should sort them from the oldest to youngest or reverse; also not sure whether to put the first and last name into the same column.” But still, make your best guess, let’s say, A.

Then we click submit, and the system asks us why we didn’t choose C. We explain that “the question does not ask us for the birthday and it contains redundant information”.

Page 11

Takeaways:

  • There can be a lot of options. Make sure to read through every of them

  • When the question is ambiguous and multiple answers are plausible, tell us why it is ambiguous and what are the plausible answers. But still, first make your best guess and submit.

Page 12

The question is “Give the names of countries that are in Europe and have a population equal to 80000.” In this fictitious table, Brazil is in Europe and has a population of 80,000. Therefore, the correct answer is A, even though we know that Brazil is in fact in South America. However, it still cannot stop us from answering the question based on the table. Finally, there are many more countries in the world, beyond these three countries in the table, but we should pretend that there are only three countries in the world here.

Page 13

Takeaways:

  • Try accepting the information from this table as much as possible and focus on the part useful for answering the question.

  • If something is not present in the tables, pretend that it does not exist.

Page 14

Here are some more difficult tables.This is a database that contains information about battles and death. The overall description of the databases can be seen at the top of the page, which says: This database contains information about battles, death events, and ships. And then each table has its own description as well. For example, in the ship table, each row contains information about a ship, the 4th row means the ship D was lost in battle with ID 4, and you can look up information about battle 4 in the battle table. To make it convenient for you, whenever you move your cursor to a value, all the same values will be highlighted. Here we notice that according to the 5th row, Ship E was also lost in battle 4.

To view multiple tables at the same time, you can choose to zoom out, like this. Then you can zoom back in, like this. You can typically find this option in the Help panel of your browser. Again, if you think some tables are irrelevant, just collapse them like this.

You don’t have to study the tables in detail, since they will probably change for the next question.

Page 15

Takeaways:

  • You don’t have to study the table content in great detail, since they will be changing.

  • Zoom-in/out if you need to. You can find them in the helper panel of your browser.

Page 16

This question is “Show names, results and bulgarian commanders of the battles with no ships lost in the ’English Channel”.

The question asks for certain battles namely, those that did not lose ships in the English Channel [pause].  Let’s start by finding the battles that did lose ships in the English channel [pause]. Only Battle 5 did; it lost ship C there.  So the other battles, Battles 0 and 7, lost no ships there.  In fact, Battle 0 lost no ships at all, which is why it doesn’t show up in the second table. We find the names of Battle 0 and 7, along with their other information. Therefore, the answer is E. One very common mistake people make is that they ignored the word “no”, and they chose the battles that lost the ship. Be careful and pay close attention to every word!

Notice that there was originally the death table. We removed it from the display to make it easier for you.

The phrase ’Bulgarian commander’ might send you looking for a table that tells you each commander’s nationality. But actually, Bulgarian_commander is a column in the battles table. Presumably this table lists battles that Bulgaria fought. Each battle had two sides, and this column is naming the commander for the Bulgarian side. You don’t have to fully understand how the tables are set up, but you should figure out enough to answer the question.

Just to repeat, to make it easier for you to process this information, whenever your cursor moves to an ID or a piece of text, its counterpart in other tables will light up; whenever you click on a text, the counterpart in the answer will also be highlighted.

You can also choose to merge the tables. After you merge the table, there will still be two tables. Each of the rows in the battle table will still contain information about a battle, and each of the rows in the ship table will still contain information about a ship. However, the battle information where the ship is lost is merged into the ship table. Notice that battle 0 will not appear in the ship table, because no ship is lost in the battle, so be careful when you try to interpret the merged table. Click unmerge to recover to the original view.

Finally, if you forgot what each table means, you can always view them here.

Page 17

Takeaways:

  • Pay close attention to how the question is being asked. They might lead to different options. Many mistakes people make are because they did not read the questions carefully.

  • Sometimes we choose not to show you certain tables and columns if we know for sure they are not needed.

  • Use the highlight functionality if that helps you to reason across tables.

  • Use the merge functionality if you need to. Each table will contain information about the same object/entity, but the information about its related objects will be pooled in.

Page 18

The question is “List the name and date of the battle that has lost the ship named ‘Lettice’ and the ship named ’HMS Atalanta’.” Since there is no ship named “HMS atlanta”, there is no battle that lost both of these ships. So you should choose A, “no result found”.

Page 19

Takeaways: Choose no_result_found if no answer satisfies the question.

Page 20

To summarize, here are a couple of things you need to remember to answer the questions correctly:

  • Pay close attention to how the question is asked; most mistakes are made because of not reading the question carefully.

  • Accept the information in the table even if they are changing and might be different from the knowledge you have for the real world

  • IDs are different from names

  • Some questions might have a lot of options to choose from and you need to read through all of them.

Page 21

To make it easier for you to answer the questions:

  • Use the highlight and merge operations when you need to

  • Use a calculator if you need to

  • Zoom out to fit the tables into the screen and prevent scrolling.

  • Not all table or column is needed to answer the questions

Page 22

For freeform response:

  • Reporting ambiguities or tell us why the question is confusing only if you need to

  • Explaining why you did not choose another option when we ask you. Giving us more information can you help you win more bonus.

Appendix L Beyond Text-to-SQL

Our framework can be generalized to other semantic parsing applications more broadly, where

  • the SQL program ss can be generalized to other types of executable semantic parses, such as tensor manipulation commands, visualization programs Chen et al. (2021c), or dataflow graphs Semantic Machines et al. (2020);

  • the database schema cc can be generalized to include any context that affects the mapping of uu to ss, e.g., the conversational history preceding uu, and the input type required by program ss;

  • the input database ii can be generalized to any well-typed input if ss is a function, or ii can be the program state (e.g., a mapping from variable names to their values) if ss is a step in a procedural program;

  • the database query result o=s(i)o=s(i) can be generalized to the intended effect of uu given ii, which includes not only an answer or a table, but also an output images, side effects such as file updates (e.g., updates to a database or a document), or robotic actions.

Applying APEL to a new type of semantic parsing application, rather than utterance-to-SQL, would require the following components:

A seed semantic parser that is likely to generate a short list of candidates that contain the correct program. This requirement is not hard to satisfy in many applications, given that large language models achieve often achieve high top-kk accuracy on generating simple Python snippets Chen et al. (2021a), JSON data Poesia et al. (2022), Lispress Shin et al. (2021) and SQL programs Scholak et al. (2021b); Rajkumar et al. (2022) with only a few training examples and are likely to continue improving Kaplan et al. (2020). For example, we achieved 95% top-32 accuracy on Spider without any task-specific engineering beyond few-shot prompting (e.g., specialized architectures Wang et al. (2020), decoding constraints Scholak et al. (2021b), etc).

An algorithm to find an simple informative program input ii that satisfies cc. Our method in § 3 generates random databases by using an existing sample database as a reference and greedily drops rows to optimize the objective in equation 5. Future methods could potentially speed up the optimization process with a learned neural network Chen et al. (2018) or a constraint solver Chu et al. (2017).

A graphical interface that enables the annotators to easily inspect the input ii and choose the correct output oo, where i,oi,o can be generalized from database tables, strings, or numbers to calendar events Andreas et al. (2020), voxel structures Wang et al. (2017), etc. Careful interface design (§ 6.1) can significantly reduce the effort required from the annotators.

In summary, APEL is a general framework for clarifying the semantics of natural language utterances. It elicits information from humans about how the semantic forms should behave when executed on particular inputs.

In this paper we demonstrated the value of APEL on a text-to-SQL task. Some future work is outlined in Appendices AB. It would also be desirable to refine and speed up our heuristics for the challenging problem of finding simple inputs that distinguish among SQL queries. Finally, we look forward to future work that extends APEL to other semantic parsing applications.

Appendix M Simulation Experiments on Regex

We include an additional experiment on regular expressions to help the readers understand how to apply APEL to semantic parsing tasks other than SQL.

Dataset.

We used the dataset from Ye et al. (2020), which aims to synthesize a regular expression from a natural language description and a few strings that should be accepted and rejected by the regex. For example, an utterance uu could be

This is a list of three comma separated strings, where each string must be formed by the substrings "cz" or "rzq" followed by one to four digits.

and the corresponding program ss is

concat ( concat ( or ( const(<cf>), const(<rzq>)),repeatrange(<num>, 1 ,4)),concat(<,> ,concat ( concat ( or ( const(<cf>), const(<rzq>)), repeatrange(<num>, 1, 4)), concat ( <,>, concat ( or (const(<cf>),const(<rzq>)), repeatrange(<num>, 1, 4))))))”.

The program ss maps from any input string ii to a boolean output of 11 or 0. Thus, an oo-selection question simply asks whether the string ii should be accepted or rejected.

We used the development split to prompt GPT-4 (OpenAI, 2023) to create the seed semantic parser. We tested APEL on the test-inclusion split, where the utterances come from the same annotators who annotated the development split.

Seed Semantic Parser.

We prompted GPT-4 with few-shot demonstrations to create a seed semantic parser. We sampled 50 demonstrations from the development split to simulate a small “training set” of expert labels. For each utterance uu we sampled a candidate program from the seed semantic parser 50 times, where each time we randomly selected 20 demonstrations from the 50 dev split demonstrations as in-context demonstrations. We then automatically filtered out candidate programs that were syntatically invalid and merged semantically equivalent ones (semantic equivalence of regexps can be tested exactly). Finally, we define p0p_{0} to be the empirical distribution of the top-10 equivalence classes. The top-1 accuracy (seed parser) is 64% and the top-10 accuracy (ceiling) is 77%.

Simple and Informative Program Input.

We wish to choose among several candidate programs (or more precisely, equivalence classes). For each pair of candidate programs s1s_{1} and s2s_{2}, we sample 200 program input strings ii such that s1(i)s2(i)s_{1}(i)\neq s_{2}(i), using the efficient implementation from Ye et al. (2020). We then collect all such input strings across all pairs of candidate programs, find the ones with the highest information gain, and break ties by choosing the shortest one as a proxy for simplicity.

To evaluate whether the above procedure is effective in generating simple and informative program input, we consider a baseline that only generates an input string that would be accepted by the highest-ranking candidate regex, again using the implementation from Ye et al. (2020). Intuitively, this baseline tries to check the highest-ranking candidate program, by determining whether a string that it accepts should in fact be accepted. The response (if assumed to be correct) may eliminate some of the programs.

Our information gain method, by contrast, actively tries to distinguish among the candidates. It hopes to gain up to 1 bit of information about the correct candidate (this is the maximum information that can be derived from the annotator’s 1-bit response). An input string has IG 1\approx 1 bit if the candidates that would accept that string currently have about half of the probability mass. In that case, either response (1 or 0) will eliminate about half of the candidates (cf. Littlestone and Warmuth, 1994).

That said, any input string ii sampled by our procedure will distinguish between two of the candidate programs, and hence will have positive information gain. The annotator’s response rr (if assumed to be correct) will eliminate one of those two candidate programs, and perhaps others. Thus, it will take at most 9 rounds of oracle annotation for us to rule out 9 incorrect equivalence classes from our original 10 classes, thus achieving the candidate ceiling (§ 4.4), regardless of which sampled input string we select at each round.

Results.

After interacting with a simulated annotator for at most three rounds with an oracle annotator as in § 5, the APEL accuracy reaches the ceiling of 77% and outperforms the seed parser. In contrast, the baseline only achieves 68% accuracy after three rounds, thus illustrating the importance of optimizing information gain. Finally, APEL uses input strings with an average length of 5.3, while the baseline’s average is 11.6; this suggests that optimizing for simplicity is successful.