This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Redwood: Using Collision Detection to Grow
a Large-Scale Intent Classification Dataset

Stefan Larson    Kevin Leach
Vanderbilt University
{firstname.lastname}@vanderbilt.edu
Abstract

Dialog systems must be capable of incorporating new skills via updates over time in order to reflect new use cases or deployment scenarios. Similarly, developers of such ML-driven systems need to be able to add new training data to an already-existing dataset to support these new skills. In intent classification systems, problems can arise if training data for a new skill’s intent overlaps semantically with an already-existing intent. We call such cases collisions. This paper introduces the task of intent collision detection between multiple datasets for the purposes of growing a system’s skillset. We introduce several methods for detecting collisions, and evaluate our methods on real datasets that exhibit collisions. To highlight the need for intent collision detection, we show that model performance suffers if new data is added in such a way that does not arbitrate colliding intents. Finally, we use collision detection to construct and benchmark a new dataset, Redwood, which is composed of 451 intent categories from 13 original intent classification datasets, making it the largest publicly available intent classification benchmark.

Redwood: Using Collision Detection to Grow
a Large-Scale Intent Classification Dataset


Stefan Larson  and Kevin Leach Vanderbilt University {firstname.lastname}@vanderbilt.edu


1 Introduction

As task-oriented dialog systems like Alexa and Siri have become more and more pervasive, tools enabling developers to build custom dialog systems have followed suit. Such tools—like Microsoft’s Luis111 luis.ai, Twilio’s Autopilot222 twilio.com/autopilot, Rasa333 rasa.com, and Google’s DialogFlow444 google.com/dialogflow—enable engineers and dialog designers to craft dialog systems composed of intents, or core categories of competencies or skills in which the system is knowledgeable and to which the system can respond intelligently. New intents may be added periodically to the dialog system as part of its development and maintenance cycle, or dialog system models may be combined together (e.g., Clarke et al. (2022)).

These phenomena may occur especially in real-world deployments, where datasets for dialog models may be developed, grown, and modified by large (and even disparate) teams over the span of a project’s lifetime. Furthermore, dialog system models and their corresponding training datasets are sometimes offered as-a-service or “off-the-shelf” to dialog system builders who might not be fully familiar with the breadth or scope of the pre-existing dataset or model. If the builder adds a new intent to the dataset that overlaps with an existing intent, then the re-trained model’s performance can suffer. As such, there is a need for tools and algorithms to help detect when a new intent overlaps—that is, collides—with an already-existing intent category.

In this paper, we introduce the challenge of intent collision detection, and develop several algorithms for determining whether a candidate intent category collides with another intent category. To do so, we curate and release a meta-dataset of 722 intents from 13 existing datasets. This graph-like meta-dataset consists of annotations indicating tuples of colliding intent pairs (examples of colliding intents can be seen in Table 1). We then introduce several collision detection algorithms and evaluate them on this meta-dataset.

We also use intent collision detection to build Redwood, a new intent classification dataset of 451 intent categories. Redwood is built by combining 13 smaller datasets. As a comparison, we also build Redwood-naïve, which is constructed by naïvely joining together all 13 datasets without arbitrating colliding intents. We find that classifier performance on Redwood-naïve to be substantially worse than Redwood, showcasing the negative effect of not addressing intent collisions in data.

Dataset Samples
Snips how cold is it in princeton junction will it be chilly in fiji at ten pm is it foggy in shelter island
Clinc-150 give me the 7 day forecast what’s the temperature like in tampa will it rain today
MTOP what is the weather in new york today how much is it going to rain tomorrow give me the weather for march 13th
Slurp set alarm tomorrow at 6 am make an alarm for 4pm set a wake up call for 10 am
MTOP can you set a warning alarm for 7pm set an alarm for monday at 5pm make an alarm for the 5th
Clinc-150 wake me up at noon tomorrow set my alarm for getting up i need you to set alarm for me
HWU how much is 1gbp in usd what’s the exchange rates how much is $50 in pounds
Clinc-150 tell me five dollars in yen and rubles how many pesos in one dollar us usd to yen is what right now
Banking-77 do you know the rate of exchange how is the exchange rate doing what are the current exchange rates
Clinc-150 please start calling me mandy I want you to call me this new name the name you should call me is janet
ACID how do i change my name need my name to be updated I need to fix my name in your system
Banking-77 where can I find how to change my name details need to be modified after I got married I need to change my name
Snips play magic sam from the thirties play music by blowfly from the seventies play jeff pilson on youtube
DSTC-8 I want to hear the song high I would like to listen to touch it on tv I’d like to listen to the way I talk
HWU please play yesterday from beattles I’d like to hear queen’s barcelona play daft punk
MetalWOz help me find restaurants in miami fl I need help finding a place to eat I need to find an italian restaurant in denver
DSTC-8 can you help find a place to eat I’m looking for a filipino place to eat I want to find a restaurant in albany
HWU find me a nice restaurant for dinner where can I get shawarma in this area what’s the best chicken place near me
Outlier what is my balance update me on my account balance let me know how much money I have
Clinc-150 what’s my current checking balance what is the total of my bank accounts how much total cash do I have in the bank
DSTC-8 I want to know my checking account balance I’d like to check my balance man how much money do I have in the bank
Table 1: Examples of data that will trigger collisions. Each row of the table displays three samples from a single intent in a particular dataset. Among these three samples, each line collides with an intent category from the other two datasets.

Upon official release, Redwood will by far be the largest openly available intent classification dataset in terms of breadth of intent categories. Our hope is that the new Redwood dataset serves as a showcase for intent collision detection as well as a new, publicly-available, large-scale challenge dataset for intent classification models for dialog systems. Both the collision meta-dataset and Redwood are publicly available at github.com/gxlarson/redwood.

2 Related Work

The Collision Detection Task.

We discuss three areas of related work related to our proposed intent collision detection task: generalized zero-shot learning, open set classification, and out-of-domain (or out-of-scope) sample identification.

In generalized zero-shot learning (e.g., Zhang et al. (2022)), a model is trained with data from a set of “seen” label classes (e.g., intents) and, during inference, must identify test samples as belonging to either a “seen” label class or an “unseen” class for which the model has limited auxiliary knowledge (e.g., descriptions of unseen classes, but no concrete training examples).

Both open set classification and out-of-domain sample identification refer to the modeling task of classifying inference samples among label classes seen during training or to identify if the sample belongs to an unknown or undefined label class (e.g., Larson et al. (2019b); Zhang et al. (2021)). Slot-filling models that are trained on B/I/O tags naturally predict the unknown class label as O tags, but for intent classifiers the task is much more challenging since it requires curating viable training data for an out-of-domain category (i.e., it is challenging to know in advance what types of out-of-domain inputs a system might encounter).

Our proposed task of intent collision detection differs from the aforementioned tasks because “inference” samples need not be considered one at a time, but can instead be grouped together into entire candidate intent categories. This enables considering entirely different modeling tasks like those discussed in Section 3.3. Nevertheless, both our meta-dataset of intent collisions and Redwood allow for the evaluation of both zero-shot and generalized zero-shot learning models, and the Redwood intent classification dataset includes a substantial number of out-of-domain samples for evaluating open set classification and out-of-domain sample detection.

Intent Classification Corpora.

There are several smaller corpora for evaluating intent classification models, some spanning broad domains (e.g., Liu et al. (2019), Larson et al. (2019b), Li et al. (2021)) and others focusing fine-grained evaluation of individual domains (e.g., the Banking-77 corpus Casanueva et al. (2020) with respect to the personal banking domain). While most datasets are constructed via crowdsourcing, our new Redwood dataset is constructed from both (1) already existing datasets and (2) newly crowdsourced intents.

Dataset Derivation and Combination.

Datasets are sometimes formed from other datasets, either by deriving a new dataset from an existing one, or by combining datasets together. The former category include translations of dialog datasets (e.g., Upadhyay et al. (2018); Xu et al. (2020)) as well as re-formulations of existing datasets into new tasks (e.g., converting a semantic role labeling (SRL) dataset to open information extraction (OIE) data as done in Solawetz and Larson (2021)).

Dataset combination has been used in other fields beyond dialog systems and conversational AI. For instance, Song et al. (2020) combined several speech recognition datasets together to form their SpeechStew dataset. As there are no target labels analogous to intents in automatic speech recognition, the creators of SpeechStew did not have to consider collisions among intent categories. In this paper, our focus is primarily on dataset combination, but we also derive intent classification data from several turn-based dialog corpora (MetalWOz and DSTC-8, discussed in Section 3.4).

3 Detecting Collisions

In this section we discuss our proposed challenge, intent collision detection. We begin with a motivating example showing why detecting collisions is important, as well as a formal problem statement. Then, we introduce and evaluate several collision detection baselines on our meta-dataset.

Refer to caption
Figure 1: Transitive collisions.

3.1 Motivating Example

As a motivating example, suppose our intent classification system has been trained on the Clinc-150 dataset Larson et al. (2019b), an intent classification dataset consisting of 150 intents.555In this paper, dataset names are in italics and intent names are in teletype font. Example queries are in italics and in quotes if they appear in-line. The Clinc-150 dataset includes an intent called weather, which is meant to handle weather-related queries such as “what’s the weather like today” and “tell me the weather in New York.” Suppose further that a new developer or a new team attempts to update the intent classifier with new data that contains a new intent category, such as the get_weather intent from the HWU dataset666Recall from Section 1 that such updates from new teams or new developers may be from routine perfective maintenance during a model’s lifetime. Liu et al. (2019). In such a scenario, there are now training data samples that overlap substantially, but that are labeled with different intents (weather vs. get_weather in this example). Thus, upon updating the model by training on HWU’s get_weather data, the predictive performance on any weather-related inference queries might be split between these two intents. This disparity can also cause unintended consequences downstream in production models, such as calls to database systems that are triggered based on the user’s intent.

Indeed, when we train a BERT classifier on the original Clinc-150 training set, the accuracy on the weather test set is 100%. When we add a HWU’s get_weather intent to Clinc-150 to create a new 151st intent and re-train the BERT classifier, we observe an accuracy score of 60% on the weather test set. This performance drop is a symptom of having added an intent category that collides with another intent category. Such a model—which was trained on colliding intents—could cause unexpected behavior on downstream events, especially if the weather and get_weather intents trigger different business logic workflows or system responses. We note that, while in this example, the colliding weather and get_weather intent names are quite similar, other colliding pairs like Snips’ search_screening_event and MetalWOz’s movie_listings do not have lexically similar intent names, precluding straightforward string matching of intent names.

Refer to caption
Figure 2: Non-transitive collisions.

3.2 Problem Statement

In this subsection, we formally define our collision detection problem. We first consider a scenario in which we have two intent classification datasets, 𝒜\mathcal{A} and \mathcal{B}, where Ai𝒜A_{i}\in\mathcal{A} and BjB_{j}\in\mathcal{B} refer to specific intent categories in each. We say that intent categories AiA_{i} and BjB_{j} collide if there exist a sufficient number of queries in AiA_{i} that semantically overlap with a sufficient number of queries in BjB_{j}. This semantic overlap can occur when a developer attempts to add new intent categories to a starting training dataset—when an intent classification model trained on the combined dataset 𝒜\mathcal{A}\cup\mathcal{B} will cause queries belonging to AiA_{i} to be classified in BjB_{j} (and vice versa).

As an example, suppose we have an intent classifier built from a starting dataset such as Clinc-150, which, among other things, contains a weather intent category for weather-related inquiries (cf. Section 3.1). Suppose further that we seek to grow this starting dataset by adding datapoints from a candidate dataset such as HWU (see Section 3.1, which contains a get_weather intent category). If we naïvely combine these two datasets together, a resulting intent classifier will result in some queries from the original weather category to be classified to the newly-added get_weather category because these two categories are semantically similar. Table 1 illustrates several example colliding intents and associated queries. Our approach addresses these collisions by detecting their prevalence and quantifying their impact automatically, aiding developers in improving the quality of their datasets and scope of their dialogue systems.

Because the notion of semantic overlap can differ from category to category and dataset to dataset, we observe several classes of relationships among colliding intent categories in practice. In particular, intent collisions can be simple-pairwise, transitive, or hierarchical. In the simple-pairwise case, two intents collide with each other only, and not with any other intent in either dataset. However, we also observe transitivity within intent classes. Figure 1 illustrates example utterances within intent classes a, b, and c, where all intent classes are transitively related to one another in a cycle.

Lastly, we observe non-transitive hierarchies among colliding intents. In this case, a broad intent category from one dataset can collide with two or more intent categories that do not relate to each other. Figure 2 shows a hypothetical intent class x consisting of general banking queries, including balance inquiries and transfer requests, and classes y and z consist solely of balance inquiry and transfer requests, respectively. Here, because class x is more broad than y and z, each of y and z collide with x, but y and z do not collide with each other. Our approach can help developers reveal such cases when managing datasets, and we consider these collision relationships in the creation of our Redwood dataset.

3.3 Approaches

We introduce two approaches for detecting collisions: Classifier Confusion and Data Coverage.

Classifier Confusion.

A column of a confusion matrix charts the distribution of predictions of a classifier for data in a particular category. We call such a distribution the classification distribution. We adapt this notion for our first collision detection approach, which identifies a candidate intent AA to collide with B𝒞B\in\mathcal{C} if a classifier model trained on dataset 𝒞\mathcal{C} produces a classification distribution dd such that max(d)sum(d)>τ\frac{max(d)}{sum(d)}>\tau, where τ\tau is a threshold set by the developer. We call this ratio the classifier collision score.

Dataset # Intents # Collisions
ACID 175 36
Clinc-150 150 158
MTOP 113 60
Banking-77 77 25
HWU 64 103
New 58 5
MetalWOz 51 80
DSTC-8 34 67
ATIS 26 7
Outlier 10 9
Snips 7 20
Jobs640 1 0
Talk2Car 1 0
Total 767 570
Table 2: Number of intents with collisions. A total of 570 intents have at least one collision.

Data Coverage.

We define the coverage of one intent BB over another intent AA as

Coverage(A,B)=1|B|bBmaxaAsim(a,b).\text{Coverage}(A,B)=\frac{1}{|B|}\sum_{b\in B}\max_{a\in A}sim(a,b).

Here, sim(a,b)sim(a,b) computes the similarity between two phrases aa and bb (for instance, sim(a,b)sim(a,b) could be the cosine similarity between two phrase embeddings or the Jaccard similarity between n-gram sets). The coverage metric can be used to detect if two intents collide using a threshold rule. In other words, AA and BB collide if Coverage(AA,BB) >κ>\kappa, where κ\kappa is a threshold chosen by the developer. We call the coverage metric the coverage score.

Refer to caption
Figure 3: Example entries in the graph-like collision meta-dataset, showing collisions for Clinc-150’s weather intent and HWU’s general_praise intent.

3.4 Datasets

We evaluate the effectiveness of our intent collision approaches using several indicative datasets. These datasets can be roughly grouped into three categories: (1) intent classification datasets like Clinc-150 Larson et al. (2019b), Banking-77 Casanueva et al. (2020), ACID Acharya and Fung (2020), Outlier Larson et al. (2019a), and New (this work; a corpus that was crowdsourced in a manner similar to Larson et al. (2019b) and Larson et al. (2019a)); (2) joint slot-filling and intent classification or semantic parsing datasets like ATIS Hemphill et al. (1990); Hirschman et al. (1992, 1993); Dahl et al. (1994), Snips Coucke et al. (2018), HWU Liu et al. (2019), and MTOP Li et al. (2021); and (3) turn-based dialog datasets like DSTC-8 Kim et al. (2019) and MetalWOz Lee et al. (2019). We only consider the initial queries in the turn-based DSTC-8 and MetalWOz, and discard all subsequent dialog turns.

Queries in these datasets span a wide range of topic domains, including banking and personal finance (Banking-77 and Outlier) and insurance (ACID); other datasets cover a wide array of topic domains, such as Clinc-150 and HWU, which cover smart home, automotive, travel, banking, cooking, and others. Since we are concerned with detecting colliding intents, we do not consider any slot annotations, and we use only the first turns from the multi-turn dialog datasets. In addition, we also use the Jobs640 Califf and Mooney (1997) and Talk2Car Deruyttere et al. (2019) datasets, which, although not originally designed for intent classification tasks, are categorized in a way that admit consideration as single-intent classification for our purposes. Table 2 summarizes these datasets.

The Collision Meta-Dataset

We constructed a graph-like dataset that indicates the collision relationships between intents. To build this dataset, we reviewed all intents from all of the datasets listed in Table 2 to check for collisions between other intents. We developed a ground truth set of tuples indicating whether two intents collide among these datasets. Figure 3 shows the structure of the intent collision meta-dataset, and Table 2 displays the number of collisions that occur relative to each individual dataset. The meta-dataset includes the three types of collisions defined in Section 3.2.

3.5 Experimental Evaluation

Implementation Details.

We evaluate our intent collision detection methods on our newly-created collision meta-dataset. For evaluating the classifier confusion approach, we train a multi-class intent classifier on each individual dataset (except the single-intent datasets) and then run inference on all other intents from the other datasets. We compute and report the classifier confusion score for each run. In our experiments, we use a linear SVM classifier with bag-of-words feature representations.

For evaluating the data coverage approach, we first sample777Sampling avoids combinatorial explosion of possible intent pairs. a nearly equal number of colliding and non-colliding intent pairs from the collision meta-dataset. We then compute the coverage scores for the selected pairs using several sentence representation and similarity metrics. We use the SBERT library’s SBERT-NLI and SBERT-miniLM sentence embedders Reimers and Gurevych (2019) along with cosine similarity. Additionally, we also use n-gram-based similarity, defined as

sim(a,b)=1Nn=1N|n-gramsan-gramsb||n-gramsan-gramsb|sim(a,b)=\frac{1}{N}\sum_{n=1}^{N}\frac{|\text{\emph{n}-grams}_{a}\cap\text{\emph{n}-grams}_{b}|}{|\text{\emph{n}-grams}_{a}\cup\text{\emph{n}-grams}_{b}|}

where a and b are queries from two intents, and N=3N=3 in our experiments.

For both the data coverage and classifier confusion experiments, we only consider intents that have at least 10 queries. For the collision detection experiments, we used all 285 collision pairs and sampled 300 non-colliding pairs since there are substantially more non-colliding pairs. The classifier confusion approach does not compare intents in a pairwise manner, and instead compares a dataset (i.e., a classifier trained on a dataset) against a single intent at a time. We run a classifier on all multi-intent datasets, which yielded a total of 400 collision pairs and 6,802 non-collision pairs for the classifier confusion experiments.

Refer to caption
(a) SBERT-NLI Coverage Score
Refer to caption
(b) Mini-LM Coverage Score
Refer to caption
(c) SVM classifier confusion.
Figure 4: Data coverage and classifier confusion score distributions for various intent collision detection approaches.

Metrics.

While in actual application settings, a user may wish to use thresholds for τ\tau and κ\kappa (defined earlier in Section 3.3) to determine whether intents collide, we evaluate both classifier confusion and coverage methods in a threshold-free manner using the AUC score. (In practice, values for τ\tau and κ\kappa could be set by the practitioner via cross-validation or by using the meta-dataset provided in this work to set optimal thresholds for their application.) The AUC score allows us to judge each method’s ability to distinguish collisions versus non-collisions; an AUC score of 1.0 means perfect separability between collisions and non-collisions, while an AUC score of 0.5 means a method is unable to distinguish between colliding and non-colliding intents.

3.6 Results

Data Coverage.

Figure 4 charts coverage scores and confusion scores for various approaches. In Figure 4 (a) and (b), the coverage approaches tend to return higher coverage scores for non-collisions and lower coverage scores for collisions, which aligns with our expectations given our definition of the coverage metric and assuming the similarity metric used in the coverage computation is effective. The AUC scores allow us to quantitatively judge the performance of the various coverage-based approaches: in Table 3, the SBERT-miniLM embedding method yields the highest AUC score, and interestingly the n-gram-based coverage method performs second best, with the SBERT-NLI embedding method in third.

Coverage Confusion
Approach AUC AUC
SBERT-NLI 0.898
token 0.931
SBERT-miniLM 0.963
SVM-based 0.756
Table 3: AUC metrics for each intent collision detection approach.
Original Original
Dataset Intent Sample
HWU alarm_remove remove the alarm set for 10pm
Clinc-150 reminder_update set a reminder for me to take my meds
MTOP get_weather should i wear a raincoat tuesday
Jobs-640 what systems analyst jobs are there in austin
Talk2Car switch to right lane and park on right behind parked black car
Jobs640 Jobs640 what systems analyst jobs are there in austin
Snips add_to_playlist add paulinho da viola to my radio rock song list
Outlier hours tell me the hours of operation for my bank
New balance do I have holiday time saved
DSTC-8 LookupMusic I like metal songs can you find me some
ATIS ground_service i’ll need to rent a car in washington dc
MetalWOz name_suggester I need to find a name for my new cat
Clinc-150 find_phone can you help me find my cell
ACID info_amt_due what is the current amount due on my account
Banking-77 terminate_account how do I deactivate my account
Clinc-150 measurement_conversion what amount of millimeters are in 50 kilometers
ACID info_name_change i need to fix my name
MTOP play_music find me the latest linkin park album
HWU audio_volume_up just increase the volume a little
Outlier balance how much oney do i have available
Vertanen (2017) why on earth is there cereal in the fridge
Vertanen (2017) who are you going to vote for in november
Vertanen (2017) do you know where i put my glasses
Clinc-150 out-of-scope what size wipers does this car take
Clinc-150 out-of-scope how long is winter
Clinc-150 out-of-scope are any earning reports due
Table 4: Sample intents and queries from our Redwood dataset, along with the corresponding original dataset and intent (where applicable). Samples are grouped into in-scope (top) and out-of-scope (bottom).

Classifier Confusion.

Figure 4 (c) charts classifier confusion scores for the SVM-based classifier confusion approach. Our results demonstrate that actual intent collisions typically yield high classifier confusion scores, while non-collisions yield lower confusion scores. Visually, however, Figure 4 (c) seems to indicate that that the classifier confusion approach is less effective than the coverage-based approaches. This is made more apparent by the AUC score in Table 3. We note that the data coverage and classifier confusion AUC scores are not directly comparable as they use different evaluation settings. Nonetheless, the difference in performance scores does lead us to conclude that the data coverage approach is more effective.

In sum, these experimental results demonstrate that the two intent collision detection approaches introduced here are effective in detecting collisions among real datasets, with the data coverage approach being the stronger of the two.

4 Building the Redwood Dataset

With tools addressing the problem of intent collision detection in hand, we now turn our attention to combining the individual datasets from Table 2 together to form a single large-scale intent classification dataset, Redwood. This section discusses the construction of Redwood and a companion out-of-scope evaluation set, and then evaluates several benchmark intent classifiers on the dataset. These datasets and associated evaluations demonstrate the consequences of leaving colliding intents unaddressed, providing a valuable resource for the community to improving intent classification models.

4.1 Data

In-Scope Data.

After creating the collision meta-dataset, a natural extension was to combine each dataset together to form Redwood. We used the collision meta-dataset to help inform us of which intents could combined, and which intents could stand alone in Redwood. In some cases, we removed intents that caused hierarchical collisions, as sometimes joining together intents from a hierarchical collision produced an intent that was too broad. We included only those intents that have at least 50 queries, and the resulting Redwood consists of 451 total intents and 62,216 queries. Following the terminology used in Larson et al. (2019b), we call these 451 intents in-scope.

Dataset N. Samples
Vertanen (2017) 2067
Clinc-150 1200
Total 3267
Table 5: Sources of out-of-scope data and number of samples used in Redwood’s out-of-scope test set.

By way of comparison, we also produced a "naïve" version of Redwood, called Redwood-naïve, where all the intents from the datasets listed in Table 3.4 were joined together without using collision detection or any other method of arbitrating or correcting colliding intents. Like the original Redwood, we included only intents that have at least 50 queries, and capped each intent at a maximum of 150 queries so as to avoid drastic class imbalances. Redwood-naïve consists of 619 intents and 85,746 total queries.

All versions of Redwood were split into train and test splits per intent: 85% training, 15% testing.

Out-of-Scope Data.

In contrast to in-scope, out-of-scope queries are those that do not belong to any of the in-scope intents. Considering out-of-scope queries in an evaluation of intent classification models is important because such queries occur in production settings, where end users cannot be expected to know the full range of intents when interacting with a conversational AI system. We include a collection of 3,267 out-of-scope queries in addition to the Redwood corpus. Redwood’s out-of-scope data originates from the following sources: Clinc-150 dataset, which itself includes a set of out-of-scope queries; and Vertanen (2017), a crowdsourced dialog dataset from which we use the first dialog turns. We reviewed all candidate out-of-scope queries, removing those that were actually in-scope. Examples of queries from the Redwood dataset are shown in Table 4.

4.2 Benchmark Evaluation

Models.

We benchmark intent classification performance using the MobileBERT model Sun et al. (2020) using the HuggingFace library Wolf et al. (2020). The MobileBERT implementation uses a softmax function to compute logits to a probability vector p, from which we can obtain confidence scores for each intent. These confidence scores can be used to predict whether a query is in- or out-of-scope, according to a decision threshold t given by

decision rule={in-scope,ifmax(p)tout-of-scope,ifmax(p)<t.\text{decision rule}=\begin{cases}\text{in-scope},&\text{if}\ \text{max}(\emph{{p}})\geq t\\ \text{out-of-scope},&\text{if}\ \text{max}(\emph{{p}})<t.\end{cases}

Such decision rules were used in Hendrycks and Gimpel (2017) and Larson et al. (2019b).

Metrics and Experiments.

We measure intent classifier accuracy on in-scope data without considering out-of-scope inputs. We also measure each model’s ability to distinguish in-scope and out-of-scope queries by computing the AUC between in- and out-of-scope confidence scores. In this way, we use AUC to measure how separable in- and out-of-scope queries based on their confidence scores without having to select an confidence threshold t. An AUC score of 0.5 (the minimum AUC score) implies the model cannot distinguish in- versus out-of-scope inputs. An AUC of 1.0 indicates the model can perfectly separate inputs.

4.3 Results

Training In-Scope Clinc Vertanen
Dataset Accuracy OOS AUC OOS AUC
Redwood 0.913 0.921 0.928
Redwood-naïve 0.861 0.909 0.925
Table 6: Model performance of the MobileBERT classifier on Redwood and Redwood-naïve.
Collisions 0 1 2 3 4 5 6 14
Mean Acc. 0.91 0.80 0.81 0.79 0.81 0.80 0.89 0.57
Size 322 74 42 51 15 11 13 8
Table 7: Accuracy scores on Redwood-naïve intents per number of collisions.

Model performance on Redwood-naïve and Redwood is shown in Table 6. First, we notice that the intent classifiers perform reasonably well on the in-scope classification task, with MobileBERT classifying queries with 91% accuracy. The models also perform well on the out-of-scope task, and discriminate between in- and out-of-scope queries with AUC scores of 0.921 and 0.928 on the Clinc-150 and Vertanen (2017) out-of-scope data.

The bottom half of Table 6 presents model performance when trained and tested on Redwood-naïve. In this case, model performance is substantially worse than models trained on the carefully-crafted Redwood dataset, confirming our hypothesis from Section 3.1 that model performance suffers if trained on data with colliding intents.

We drill deeper into the impact of intent collisions on models trained on Redwood-naïve in Table 7 which charts per-intent accuracy based on the number of other intents that collide with that intent. This table groups intents based on the number of collisions, and we see that on average, intents with no collisions exhibit higher accuracy than intents with collisions. In general, colliding intents lead to degraded accuracy: intents with one or more collisions have accuracy of around 10 or more points lower than the no-collision group, with the exception of the 6-collision group. The average accuracy of the 6-collision group on Redwood-naïve is indeed surprising, and we posit that the MobileBERT model—a high-capacity transformer model—can learn the nuances of each individual intent, even if they do semantically collide.

5 Conclusion and Future Work

This paper introduces the task of intent collision detection when constructing or updating an intent classification model’s dataset to incorporate additional intents. Using 13 individual datasets, we constructed a meta-dataset to track intent collisions between the datasets, and then introduced and evaluated two intent collision detection techniques and found that both perform effectively at the collision detection task. To help measure and address this problem, we constructed Redwood, a large-scale intent classification dataset consisting of 451 intents and over 60,000 queries. We used Redwood to benchmark several intent classification models on the task of in-scope query prediction and out-of-scope detection, The new Redwood dataset is the largest publicly available intent classification benchmark, in terms of number of intents, and will be made publicly available. Future work will include annotating slots to extend Redwood to joint intent classification and slot-filling, and it is likely that new tools will have to be developed for doing so. Additionally, using the collision detection methods introduced in this paper, Redwood can be periodically updated with new intents whenever other new intent classification datasets are published.

Acknowledgements

We thank the anonymous reviewers for their detailed and thoughtful feedback, and Jacob Solawetz for his feedback on early iterations of the Redwood concept.

References