This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Assessing Data Efficiency in Task-Oriented Semantic Parsing

Shrey Desai   Akshat Shrivastava   Justin Rill   Brian Moran
  Safiyyah Saleem   Alexander Zotov   Ahmed Aly
Facebook
{shreyd, akshats, jrill, bmoran,
safisaleem, alexzotov, ahhegazy}@fb.com
Abstract

Data efficiency, despite being an attractive characteristic, is often challenging to measure and optimize for in task-oriented semantic parsing; unlike exact match, it can require both model- and domain-specific setups, which have, historically, varied widely across experiments. In our work, as a step towards providing a unified solution to data-efficiency-related questions, we introduce a four-stage protocol which gives an approximate measure of how much in-domain, “target” data a parser requires to achieve a certain quality bar. Specifically, our protocol consists of (1) sampling target subsets of different cardinalities, (2) fine-tuning parsers on each subset, (3) obtaining a smooth curve relating target subset (%) vs. exact match (%), and (4) referencing the curve to mine ad-hoc (target subset, exact match) points. We apply our protocol in two real-world case studies—model generalizability and intent complexity—illustrating its flexiblity and applicability to practitioners in task-oriented semantic parsing.

Refer to caption
Refer to caption
Figure 1: Illustration of our data efficiency protocol’s outputs. The discrete plot (top) shows the exact match scores a task-oriented semantic parser achieves while being fine-tuned on increasingly larger subsets of (target domain) data; the logarithmic structure of this plot is typical, as model performance saturates in the presence of more in-domain data. Furthermore, the continuous plot (bottom) approximates the true shape of the data efficiency curve, enabling us to make ad-hoc (x,y)(x,y) queries; that is, how big of a xx% target subset does the parser require to achieve yy% exact match?

1 Introduction

Task-oriented, conversational assistants typically first use semantic parsers to map textual utterances to structured frames executable by downstream components Gupta et al. (2018); Einolghozati et al. (2018); Pasupat et al. (2019); Aghajanyan et al. (2020); Chen et al. (2020); Ghoshal et al. (2020); Babu et al. (2021); Desai and Aly (2021); Shrivastava et al. (2021); Desai et al. (2021). Because of the high costs associated with developing new conversational skills, there has been a surge of interest in improving the data efficiency of these parsers to bootstrap learning in low-resource settings Chen et al. (2020); Ghoshal et al. (2020); Desai et al. (2021). However, despite considerable progress in improving training, practitioners do not yet have a agreed-upon protocol to measure and optimize data efficiency in production settings Chen et al. (2020); Desai et al. (2021). Partly, this is due to the lack of straightforward methodology: unlike metrics like exact match which only require executing boolean checks across system and reference frames, metrics for data efficiency require model- or domain-based setups with numerous design decisions Chen et al. (2020); Desai et al. (2021).

In this work, we introduce a simple but effective protocol to assess the data efficiency of task-oriented semantic parsers in production settings. Depicted in Figure 1, we design our protocol to relate target subset (%) vs. exact match (%); put another way, we approximate the number of in-domain samples (from a target domain) required to achieve a quality bar. Our protocol requires four steps: (1) Uniformly sampling subsets from a target domain, each with different cardinality (e.g., 10%, 20%, 30%, etc.); (2) Fine-tuning parsers on a mix of source domain samples (out-of-domain) and subsetted, target domain samples (in-domain), then recording in-domain exact match; (3) Smoothly approximating the target subset (%) vs. exact match (%) curve with least squares regression; and (4) For a quality target of interest (e.g., exact match of 90%), computing the corresponding number of samples to achieve it (e.g., target subset of 34.03%).

We develop our protocol to be flexible in hyperparameters yet opinionated in design, seeking to meet a range of practitioners’ needs along the theme of data efficiency. As such, we leverage our protocol by conducting both prescriptive and descriptive case studies on model generalization and intent complexity, respectively. Our first case study, model generalization, compares the data efficiency of multiple task-oriented semantic parsers. Our second case study, intent complexity, evaluates the correlation between data efficiency and intent complexity (i.e., how challenging the intent is to model). Across both studies, results show that our protocol can be used as-is with minimal study-specific processing, illustrating its real-world applicability to practitioners in task-oriented semantic parsing.

2 Data Efficiency Protocol

Our goal is to devise a model-based data efficiency protocol which can complete the following statement: Model mm requires x%x\% of samples from target domain dd for fine-tuning in order to achieve y%y\% exact match. Here, we make an important distinction between source and target domains: source domains represent high-resource settings (10K-100K samples) while target domains represent low-resource settings (1K-10K samples). While we assume access to source domains for the purposes of bootstrapping, we are primarily interested in quantitatively measuring our parsers’ abilities to generalize to target domains.

Overview.

We begin with a high-level overview of our data efficiency protocol; Figure 1 shows examples of our protocol’s outputs. Here, we create 8 randomly-sampled subsets consisting of xx% of data from a target domain, mimicking low-resource settings where in-domain is scarce. We then fine-tune our parsers on a combination of source and target samples, once for each subset, then record exact match scores on the test set of the target domain. The resulting discrete plot (top) resembles a logarithmic curve: at small subset sizes, exact match improves, but at large subset sizes, exact match largely saturates. Using this intuition, we fit a polynomial function to the (x,y)(x,y) coordinates and subsequently obtain a continuous plot (bottom) approximating the target subset (%) vs. exact match (%) relationship.

This ultimately allows us to complete our statement; we can now query arbitrary target subset sizes required to achieve pre-defined exact match targets. For example, using our plot, we can deduce we roughly require 2.14% of target domain data in order to achieve 80% exact match.

Methodology.

In summary, our protocol consists of four steps:

  1. 1.

    For a target domain, using a random sampling algorithm, create various training subsets, each with a different cardinality (e.g., 10%).

  2. 2.

    Fine-tune parsers on a concatenation of source domain samples (out-of-domain) and subsetted, target domain samples (in-domain). Then, report their exact match perforamnce on the test set of the target domain. (Here, source domain samples are largely used for bootstrapping and do not have overlap with target domain samples.)

  3. 3.

    Obtain a smooth, continuous approximation of the target subset (%) vs. exact match (%) curve, as the previous step only logs discrete points.

  4. 4.

    For a pre-defined exact match (%) target (e.g., 90%), compute the corresponding target subset (%) required to achieve it.

Below, we elaborate on the technical details behind each step of our protocol.

2.1 Background

In task-oriented semantic parsing, we are typically given a dataset DD comprised of the union of multiple sub-datasets {D(1),,D(d)}\{D^{(1)},\cdots,D^{(d)}\}, where each dataset D(i)={(x(j),y(j))}j=1nD^{(i)}=\{(x^{(j)},y^{(j)})\}_{j=1}^{n} consists of utterance/frame pairs from domain dd. TOPv2 Chen et al. (2020), a popular open-source dataset, consists of the alarm, event, messaging, music, navigation, reminder, timer, and weather domains.

2.2 Sampling Target Subsets

Recall that we would like to assess the data efficiency of a parser in a target domain once it has been bootstrapped on non-target domains, or simply put, source domains. To achieve this, we create a “data efficiency curve” (as depicted in Figure 1) demonstrating how well a parser extends a cross-domain setting upon being fine-tuned on increasingly larger subsets of target domain data.

Random Sampling Algorithm.

We create these subsets by randomly sampling utterance/frame pairs from the target domain. Specifically, our random sampling algorithm algorithm is a function ff which creates a subset Sd(ki)=f(D(d),ki)S^{(k_{i})}_{d}=f(D^{(d)},k_{i}) with size kik_{i}. We parameterize ff as a uniform algorithm which selects ki%k_{i}\% of samples from D(d)D^{(d)} without replacement. This algorithm is simple and well-defined: it requires minimal operations to implement and we can easily compute subset sizes as |Sd(ki)|=1ki×|D(d)||S^{(k_{i})}_{d}|=\frac{1}{k_{i}}\times|D^{(d)}|. See Section 3 for an in-depth discussion on the advantages and disadvantages of uniform sampling.

Selecting Subset Sizes.

Using our random sampling algorithm ff, we create nn subsets {Sd(k1),,Sd(kn)}\{S^{(k_{1})}_{d},\cdots,S^{(k_{n})}_{d}\} for training and evaluation (discussed next). If nn is too small, we may not have enough points to estimate a data efficiency curve, but if nn is too large, our protocol will become cost-prohibitive as fine-tuning typically requires multiple GPUs. To set nn and (k1,,kn)(k_{1},\cdots,k_{n}) precisely, we lean on intuition from the previous section; because exact match logarithmically improves as subsets increase in cardinality Chen et al. (2020); Desai et al. (2021), we can precisely capture data efficiency by selecting subset sizes along such a curve.

We develop a heuristic which empirically works well for a range of domains. We set n=10n=10 such that k1=0k_{1}=0 and k10=100k_{10}=100; put another way, we sample 10 subsets where the 1st and 10th subsets have 0% and 100% of target domain data, representing the lower and upper bounds, respectively. The remaining 8 subsets are spaced out along a logarithmic curve to mimic the typical characteristics of cross-domain generalization.

Formalizing this process, we seek to build a function g(x)g(x) with domain x[1,10]x\in[1,10] and range g[0,100]g\in[0,100] such that gg is spaced out logarithmically and ki=g(i)k_{i}=\lceil g(i)\rceil for ease of use. We can use the generic function g(x)=axb+cg(x)=a^{x-b}+c as a template with the additional constraints that g(1)=0g(1)=0 (0% subset) and g(10)=100g(10)=100 (100% subset). With some algebra, we can solve for the unknown variables, building the following function:

g(x)=(1019)x11g(x)=(\sqrt[9]{101})^{x-1}-1 (1)

We can now determine k1:10k_{1:10} easily by evaluating gg over its domain; Table 1 shows the inputs and outputs of this function for reference. Therefore, using uniform sampling, we create 10 subsets with 0%, 1%, 2%, 4%, 7%, 12%, 21%, 36%, 60%, and 100% of target data.

xx g(x)=axb1g(x)=a^{x-b}-1 g(x)=axb1g(x)=\lceil a^{x-b}-1\rceil
1 0.00 0
2 0.67 1
3 1.79 2
4 3.66 4
5 6.78 7
6 11.99 12
7 20.69 21
8 35.22 36
9 59.48 60
10 100.00 100
Table 1: We create an exponential function g(x)=(1019)x11g(x)=(\sqrt[9]{101})^{x-1}-1 (where a=1019a=\sqrt[9]{101} and b=1b=1) to determine target subset sizes for uniform sampling. Note that we use the ceiling function to discretize the output space of our function.

2.3 Training and Evaluation

Having covered how subsets of target domain data are created, we now describe how to populate the discrete points of a data efficiency curve. Because we plot target subset (%) vs. exact match (%), as shown in Figure 1, we independently fine-tune the same parser nn times on train/eval data—a mix of source and (subsetted) target data—then subsequently report its exact match on the target domain’s corresponding test set.

Using the notation from the previous section, for each target subset Sd(i)S^{(i)}_{d}, we create training and evaluation datasets {D(d)|dd}+{Sd(i)}\{D^{(d^{\prime})}|d^{\prime}\neq d\}+\{S^{(i)}_{d}\} for fine-tuning. Here, note that although we homogenize source and target data, we are only interested in target domain test performance, as the source domain largely acts as a bootstrap for cross-domain generalizability.

Once we perform fine-tuning nn times, once for each target subset, we create a discrete plot with two axes where the xx-axis represents target subset (%) and the yy-axis represents exact match (%). For example, in Figure 1, we see that the parser achieves 70% EM with 1% of target data but quickly improves to 82% EM with 3% of target data.

2.4 Continuous Approximation

The discrete plot gives a coarse picture of data efficiency, but the target subset (%) vs. exact match (%) relationship is only defined for the subset sizes we previously selected. To define this relationship for all possible subset sizes in a cost-effective fashion, one possible solution is creating a continuous plot; because we have numerous, logarithmically-spaced points, we can fit a continuous function on these points which effectively “fills in the gaps”. Note that although this solution is approximate rather than exact, we empirically find our method accounts for variance between fine-tuning runs and are reasonably accurate due to our heuristic selection of initial subset sizes.

For curve fitting, we select the general-purpose polynomial function h(x)=axb+ch(x)=\frac{a}{x^{b}}+c with parameters θ=[a,b,c]\theta=[a,b,c] for its simplicity and flexibility. We use scipy.optimize.curve_fit, which learns θ\theta by minimizing a least squares objective. For reference, Figure 1 shows a comparison of a discrete and continuous plot.

2.5 Samples vs. EM

Once we have learned a continuous function h(x)h(x) with parameters θ\theta over our discrete points, we can now achieve our original goal; that is, evaluating the number of samples required to achieve a certain exact match. We create an inverse continuous function h1(y)=(yca)1bh^{-1}(y)=(\frac{y-c}{a})^{\frac{-1}{b}} and pass in pre-defined exact match targets of interest (e.g., y=90y=90 for 90% exact match). The output of h1(y)h^{-1}(y) can be interpreted as the xx such that h(x)=yh(x)=y, or in other words, the target subset (%) which results in an exact match (%). Furthermore, optionally, we can recover the exact number of samples here (1x×|D(d)|\frac{1}{x}\times|D^{(d)}|) as target subsets are uniformly sampled.

Figure 1 illustrates the procedure described above. By fitting our polynomial function to the discrete points, we end up with the canonical function h(x)=27.26x0.35+97.79h(x)=\frac{-27.26}{x^{0.35}}+97.79 and inverse function h1(y)=(y97.7927.26)10.35h^{-1}(y)=(\frac{y-97.79}{-27.26})^{\frac{-1}{0.35}}. From here, we can plug in pre-defined yy values of interest; for example, h1(80)3.33h^{-1}(80)\approx 3.33 and h1(90)34.03h^{-1}(90)\approx 34.03 as shown by the dotted red lines in the bottom plot.

3 Uniform vs. SPIS

One important design decision we make in our protocol is using a uniform algorithm—as opposed to samples per intent slot (SPIS) Chen et al. (2020)—to sample target subsets. Similar to the uniform algorithm, the SPIS algorithm is a function parameterized by a sizing parameter kk. However, instead of selecting a percentage of samples from D(d)D^{(d)}, SPIS builds up a subset Sd(ki)S_{d}^{(k_{i})} which consists of at least kik_{i} occurrences of each ontology label (i.e., intent or slot).

We discuss a couple of disadvantages of the SPIS algorithm. First, because SPIS-based subset sizes are governed by label occurrences, their sizes are dynamic as opposed to static. It is not immediately clear how large a SPIS-based subset is unless it is explicitly computed; this is not the case with uniform-based subsets, as their size only depends on the total number of samples per domain. Second, also due to the label occurrences constraint, it is often challenging to select subset sizes (k1,,kn)(k_{1},\cdots,k_{n}) for fine-tuning. Some domains are very small (10-100 samples) while other domains are very large (1K-10K samples), so we would need to choose subset sizes depending on a domain’s characteristics; for example, 1000 SPIS would be too large of a subset size to use in a small domain.

However, unlike SPIS, uniform sampling does not guarantee coverage over the entire ontology space, as it only samples utterance/frame pairs according to empirical frequency. This is an important consideration for closed-set domain adaptation where the output space must stay static Chen et al. (2020). Nonetheless, for our purposes, the characteristics of uniform sampling are a “feature” rather than a “bug”, as it creates subsets which more naturally reflect the underlying data distribution.

4 Experimental Setup

For the remainder of this paper, we shift towards leveraging our data efficiency protocol in a series of experiments, each investigating prescriptive or descriptive hypotheses practitioners may pose when building task-oriented assistants.

We experiment with a range of task-oriented semantic parsing models when conducting case studies. Each model is a seq2seq transformer with a pointer-generator-based decoder and relies on either autoregressive (AR) or non-autoregressive (NAR) generation:111Refer to Shrivastava et al. (2021) for training details and model hyperparameters.

BART AR.

BART is a seq2seq transformer combining a transformer encoder and autoregressive transformer decoder, and is pre-trained with a denoising autoencoder objective on monolingual corpora Lewis et al. (2020). For task-oriented semantic parsing, Aghajanyan et al. (2020) shows BART achieves state-of-the-art EM on multiple datasets.

RoBERTa NAR.

Unlike RoBERTa AR, RoBERTa NAR assumes strong independence assumptions during decoding, using the mask-predict algorithm to enable non-autoregressive generation Ghazvininejad et al. (2019). We use the framework outlined in Babu et al. (2021), creating a seq2seq transformer with a RoBERTa encoder, MLP length module, and a non-autoregressive, randomly-initialized transformer decoder (1L, 768H, 16A).

RoBERTa Span Pointer.

Compared to RoBERTa NAR, RoBERTa Span Pointer Shrivastava et al. (2021) optimizes the frame representation and the model architecture to be span-based, but also relies on non-autoregressive decoding. Specifically, utterance spans are represented as index-based endpoints in frame slots (e.g., [i, j]), and the transformer decoder is modified accordingly to place a distribution on indices rather than words during generation.

Refer to caption
Refer to caption
Figure 2: Data efficiency of various production-ready, transformer-based seq2seq parsers on the weather (top) and reminder (bottom) domains. The RoBERTa Span Pointer model is the most generalizable, achieving 90% exact match with 30.67% target data on weather and 80% exact match with 33.47% target data on reminder.
Refer to caption
Refer to caption
Figure 3: Demonstrating the variance of our model generalizability results by fine-tuning RoBERTa Span Pointer parsers with {0, 1, 2} random seeds for the weather (top) and reminder (bottom) domains. Despite fluctuations in discrete values, the continuous plots are tightly bounded together, indicating the robustness of our protocol.

5 Case Study #1: Model Generalizability

Our first case study concerns model generalization: How data efficient are production-ready, task-oriented semantic parsing models? Because crowdsourcing annotations for new domains is time-consuming and cost-prohibitive, practitioners may instead opt for using data efficient parsers which perform well given a limited amount of in-domain data Chen et al. (2020); Ghoshal et al. (2020); Desai et al. (2021). However, given the recent explosion in the number of transformer-based semantic parsers, it can be daunting to select the one which is most generalizable. Our protocol can naturally provide an answer: by bootstrapping a parser on high-resource, source data, then exposing it to increasingly larger subsets of low-resource, target data, we can evaluate generalizability in a data-driven fashion.

Following prior work Chen et al. (2020); Ghoshal et al. (2020), we use the weather and reminder domains of TOPv2 as (independent) target domains; these domains are highly distinct, as they vary across multiple axes (e.g., utterance length, ontology size, frame nesting) Desai et al. (2021).

Domain: Music
IN:ADD_TO_PLAYLIST_MUSIC open
IN:CREATE_PLAYLIST_MUSIC open
IN:DISLIKE_MUSIC closed
IN:LIKE_MUSIC closed
IN:LOOP_MUSIC closed
IN:PAUSE_MUSIC closed
IN:PLAY_MUSIC semi
IN:PREVIOUS_TRACK_MUSIC none
IN:REMOVE_FROM_PLAYLIST_MUSIC semi
IN:REPLAY_MUSIC closed
IN:SET_DEFAULT_PROVIDER_MUSIC closed
IN:SKIP_TRACK_MUSIC closed
IN:START_SHUFFLE_MUSIC closed
IN:STOP_MUSIC closed
Domain: Messaging
IN:CANCEL_MESSAGE none
IN:GET_MESSAGE semi
IN:IGNORE_MESSAGE none
IN:REACT_MESSAGE closed
IN:SEND_MESSAGE open
Domain: Reminder
IN:CREATE_REMINDER open
IN:DELETE_REMINDER open
IN:GET_RECURRING_DATE_TIME semi
IN:GET_REMINDER open
IN:GET_TODO open
IN:SEND_MESSAGE open
IN:UPDATE_REMINDER open
IN:UPDATE_REMINDER_DATE_TIME open
Domain: Timer
IN:ADD_TIME_TIMER closed
IN:CREATE_TIMER closed
IN:DELETE_TIMER none
IN:GET_TIME semi
IN:GET_TIMER none
IN:PAUSE_TIMER none
IN:RESTART_TIMER none
IN:RESUME_TIMER none
IN:SUBTRACT_TIME_TIMER closed
IN:UPDATE_TIMER closed
Domain: Weather
IN:GET_SUNRISE semi
IN:GET_SUNSET semi
IN:GET_WEATHER semi
Table 2: Intent complexity annotations for the music, messaging, reminder, timer, and weather domains following the complexity classes none, closed, (semi)-open, and open.

5.1 Results

Figure 2 shows data efficiency plots for models fine-tuned on the weather and reminder domains. We highlight a couple of key trends:

Autoregressive parsing is more data efficient than canonical non-autoregressive parsing.

We see that BART AR is more data efficient than RoBERTa NAR; on weather, to achieve 90% EM, BART AR requires 32.85% of target data while RoBERTa NAR requires 36.90% of target data, and on reminder, to achieve 70%, BART AR requires 8.46% of target data while RoBERTa NAR requires 13.24% of target data. Inspecting these results closer, we find the length module in RoBERTa NAR is a major bottleneck for generalization; because of strong conditional independence assumptions during non-autoregressive decoding, this parser must first predict the length of the frame to later infill, which can be challenging in a few-shot setting. In contrast, autoregressive parsing does not require an intermediate step and therefore is much simpler to extend cross-domain.

However, span-based, non-autoregressive parsing is highly data efficient.

RoBERTa Span Pointer, despite being a non-autoregressive parser, achieves the best data efficiency results compared to both BART AR and RoBERTa NAR. Recall that this model carries a different inductive bias as it reformulates parsing to be span-based: utterance spans in leaf arguments are represented as index-based endpoints rather than string-based text. This model’s length module no longer has to guess the length of leaf arguments beforehand and can instead focus on the (predictable) syntactic components (e.g., the number of ontology tokens); as a result, both RoBERTa Span Pointer’s length prediction accuracy and final exact match is substantially greater than RoBERTa NAR’s. Given RoBERTa Span Pointer’s strong data efficiency results, we recommend practitioners use this model in production settings.

Data efficiency results do not change much between fine-tuning runs.

Because our protocol is approximate rather than exact, one question we investigate is how stable our protocol’s results are between fine-tuning runs. Here, we are interested in two types of variation: the change in discrete points when when different random seeds are used and, subsequently, the change in continuous curves fitted on each set of discrete points. We fine-tune the RoBERTa Span Pointer model with random seeds {0, 1, 2} and create data efficiency plots using our protocol; Table 3 shows these results. For both the weather and reminder domains, we see that our data efficiency plots are quite similar a cross fine-tuning runs. Though the discrete points typically fluctuate by ±\pm1 EM, as is expected due to different random initializations, the continuous curves are largely similar. These curves can be tighter given more discrete points, especially at larger target subset sizes, but it inevitably comes at the cost of using more compute.

Refer to caption
Figure 4: Comparing data efficiency with intent complexity for the music (top) and timer (bottom) domains. For each domain, we show a discrete plot for the four complexity classes: none, closed, (semi)-open, and open; empty plots imply the domain does not have an intent annotated with the corresponding complexity class.

6 Case Study #2: Intent Complexity

Our second case study concerns intent complexity: Given a rough estimate of how “complex” an intent is (from a modeling standpoint), do we see a correlation between complexity and data efficiency? When developing new domains for task-oriented assistants with an intent-driven methodology222Here, the practitioner creates an intent (e.g., IN:SEND_MESSAGE), enumerates the possible slots (e.g., SL:CONTACT_NAME, SL:DATE_TIME, etc.), then obtains samples for modeling., practitioners have a range of tools at their disposal to obtain data—from crowdsourcing unique samples or paraphrasing existing samples—but it is often unclear how many samples a parser requires to achieve high quality. Here, the complexity of an intent, defined below precisely, can serve as a heuristic, as more challenging intents will subsequently require more data. However, this rests on the assumption that there is a correlation between complexity and data efficiency. Our protocol can provide a solution: once we create data efficiency plots per-intent, we can create an “average” plot representing the intents from each complexity class, then visually inspect the shape of these plots; if the assumption holds, we should see these plots shift towards lower exact match (%) scores as complexity increases.

6.1 Complexity

Definition.

We allude to the notion of “complexity” above, but we have not yet precisely defined this term. In our study, complexity is a rough notion of how difficult an ontology label is to model as judged by exact match. This notion is model-agnostic: we can expect parsers, as a whole, to struggle with certain types of labels (e.g., open-text slots), even if some parsers are comparatively more accurate than others. Specifically, we define four complexity classes: (1) none: variable, consisting of no values, but because out-of-domain intents fall under this category, they can be more challenging to model; (2) closed: easy, consisting of roughly  10 values; (3) semi-open: medium, consisting of named entities, date-times, or closed class with roughly  100 values; and (4) open: hard, consisting of long free text.

Because our goal here is to correlate intent complexity with data efficiency, we primarily focus on annotating intents with closed, semi-open, and open classes. To do so, we make the assumption that the complexity of an intent is derived from the maximum complexity of its slots. This is not a particularly strong assumption, as intents and slots are strongly intertwined during modeling. For example, the intent IN:SEND_MESSAGE is “open” since its slot SL:CONTENT_EXACT is also “open”; this particular slot maps to constituents with wide syntatic and semantic variation, therefore the overarching intent is also challenging to model.

Annotation.

Using our definition of complexity, we select 5 TOPv2 domains for annotation: music, messaging, reminder, timer, and weather. We have 2 in-house linguists annotate each intent from each domain with its respectively complexity class; however, intents with less than 10 occurrences are excluded from our analysis. Our linguists agree on an annotation guideline beforehand and also jointly resolve tricky cases, but for the purposes of quality estimation, both linguists blindly annotate intents in the weather domain; annotator agreement is perfect as measured by Krippendorff’s alpha Krippendorff (2004). Table 2 shows these results.

6.2 Results

We primarily experiment with the music and timer domains given their intents span a wide range of complexity classes. To correlate data efficiency with intent complexity, we use our protocol on the RoBERTa Span Pointer parser to obtain discrete plots for each domain. Because we are interested in intent complexity, we additionally break down the target subset (%) vs. exact match (%) results by intent. Then, we group the intents in each domain by their complexity class and average their results to obtain a discrete plot for each complexity class. Figure 4 shows these results. For both the music and timer domains, we see a rough correspondence between data efficiency and intent complexity; specifically, as the complexity increases, the discrete plots’ heads tend to flatten, indicating the parser performs worse with less data. Because TOPv2 has a limited number of intents, it is challenging to create guidelines using these results, but our breakdown helps give a sense of how much in-domain data, on average, is required to achieve strong performance.

7 Conclusion

We introduce a data efficiency protocol for task-oriented semantic parsing capable of producing discrete and continuous plots with target subset (%) vs. exact match (%); this gives us both exact and approximate results, respectively, illustrating the amount of in-domain data a parser requires to achieve a quality bar. To demonstrate its real-world applicability to practitioners, we leverage our protocol in two case studies: the first study compares the data efficiency of production-ready parsers while the second study correlates data efficiency with intent complexity for the purposes of developing heuristics for data collection.

References

  • Aghajanyan et al. (2020) Armen Aghajanyan, Jean Maillard, Akshat Shrivastava, Keith Diedrick, Michael Haeger, Haoran Li, Yashar Mehdad, Veselin Stoyanov, Anuj Kumar, Mike Lewis, and Sonal Gupta. 2020. Conversational Semantic Parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Babu et al. (2021) Arun Babu, Akshat Shrivastava, Armen Aghajanyan, Ahmed Aly, Angela Fan, and Marjan Ghazvininej. 2021. Non-Autoregressive Semantic Parsing for Compositional Task-Oriented Dialog. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
  • Chen et al. (2020) Xilun Chen, Ashish Ghoshal, Yashar Mehdad, Luke Zettlemoyer, and Sonal Gupta. 2020. Low-Resource Domain Adaptation for Compositional Task-Oriented Semantic Parsing. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Desai and Aly (2021) Shrey Desai and Ahmed Aly. 2021. Diagnosing Transformers in Task-Oriented Semantic Parsing. In Findings of the Association for Computational Linguistics: ACL 2021.
  • Desai et al. (2021) Shrey Desai, Akshat Shrivastava, Alexander Zotov, and Ahmed Aly. 2021. Low-Resource Task-Oriented Semantic Parsing via Intrinsic Modeling. arXiv preprint arXiv:2104.07224.
  • Einolghozati et al. (2018) Arash Einolghozati, Panupong Pasupat, Sonal Gupta, Rushin Shah, Mrinal Mohit, Mike Lewis, and Luke Zettlemoyer. 2018. Improving Semantic Parsing for Task-Oriented Dialog. In Proceedings of the Conversational AI Workshop.
  • Ghazvininejad et al. (2019) Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-Predict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
  • Ghoshal et al. (2020) Asish Ghoshal, Xilun Chen, Sonal Gupta, Luke Zettlemoyer, and Yashar Mehdad. 2020. Learning Better Structured Representations using Low-Rank Adaptive Label Smoothing. In Proceedings of the International Conference on Learning Representations (ICLR).
  • Gupta et al. (2018) Sonal Gupta, Rushin Shah, Mrinal Mohit, Anuj Kumar, and Mike Lewis. 2018. Semantic Parsing for Task Oriented Dialog using Hierarchical Representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP).
  • Krippendorff (2004) Klaus Krippendorff. 2004. Content Analysis: an Introduction to its Methodology. Sage: Thousand Oaks, CA.
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).
  • Pasupat et al. (2019) Panupong Pasupat, Sonal Gupta, Karishma Mandyam, Rushin Shah, Michael Lewis, and Luke Zettlemoyer. 2019. Span-based Hierarchical Semantic Parsing for Task-Oriented Dialog. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
  • Shrivastava et al. (2021) Akshat Shrivastava, Pierce Chuang, Arun Babu, Shrey Desai, Abhinav Arora, Alexander Zotov, and Ahmed Aly. 2021. Span Pointer Networks for Non-Autoregressive Task-Oriented Semantic Parsing. arXiv preprint arXiv:2104.07275.