Modeling Token-level Uncertainty to Learn Unknown Concepts in SLU
via Calibrated Dirichlet Prior RNN

Yilin Shen¹, Wenhu Chen², Hongxia Jin¹
¹Samsung Research America, Mountain View, CA, USA
²University of California, Santa Barbara, CA, USA
{yilin.shen,hongxia.jin}@samsung.com, [email protected]

Abstract

One major task of spoken language understanding (SLU) in modern personal assistants is to extract semantic concepts from an utterance, called slot filling. Although existing slot filling models attempted to improve extracting new concepts that are not seen in training data, the performance in practice is still not satisfied. Recent research collected question and answer annotated data to learn what is unknown and should be asked, yet not practically scalable due to the heavy data collection effort. In this paper, we incorporate softmax-based slot filling neural architectures to model the sequence uncertainty without question supervision. We design a Dirichlet Prior RNN to model high-order uncertainty by degenerating as softmax layer for RNN model training. To further enhance the uncertainty modeling robustness, we propose a novel multi-task training to calibrate the Dirichlet concentration parameters. We collect unseen concepts to create two test datasets from SLU benchmark datasets Snips and ATIS. On these two and another existing Concept Learning benchmark datasets, we show that our approach significantly outperforms state-of-the-art approaches by up to 8.18%. Our method is generic and can be applied to any RNN or Transformer based slot filling models with softmax layer.

1 Introduction

With the rise of modern artificially intelligent voice-enabled personal assistants (PA) such as Alexa, Google Assistant, Siri, etc., spoken language understanding (SLU) plays a vital role to understand all varieties of user utterances and carry out the intent of users. One of the major tasks in SLU is slot filling, which aims to extract semantic values of predefined slot types from a natural language utterance. Existing slot filling approaches are only designed for offline model training based on a large scale pre-collected training corpus. They explored to use machine learning and deep learning techniques for an independent slot filling model Kurata et al. (2016); Hakkani-Tür et al. (2016); Liu and Lane (2016) or a joint learning model with intent detection Liu and Lane (2016); Wang et al. (2018); Goo et al. (2018).

However, real usage data are typically very different from pre-collected training data. Many slot types can have a large number of new slot values, e.g., book, music and movie names. For example, the benchmark Snips dataset only covers less than 0.01% of the existing slot values, let alone their fast growth. Also, a user could have his personalized slot values for specific meanings, e.g., “happy hours” as a playlist name. In this paper, we define concept as new slot values for a slot type (while a new slot type can be supported similarly). While some new concepts could be extracted using emerging models Lin et al. (2017); Jansson and Liu (2017) or recently improved slot filling models to tackle out-of-vocabulary (OOV) using character embedding Liang et al. (2017a), copy mechanism Zhao and Feng (2018) and few-shot learning Hu et al. (2019); Hou et al. (2019); Fritzler et al. (2019), these models still cannot extract a majority of new concepts, called unknown concepts. Thus, it is critically desirable to enable PA to smartly detect and clarify the unknown concept (as in Figure 1) in order to take the correct action.

Refer to caption — Figure 1: Example of Asking Clarification Question

Earlier work explored rule based approaches to ask clarification questions Hori et al. (2003); Purver (2006). However, these approaches are not applicable to modern deep learning SLU models. Recently, research communities raised special interest on how to make machine learning models know what it does not know Rajpurkar et al. (2018) and how to ask clarification questions Rao and Daumé III (2018). However, these dataset and approach are designed based on the assumption that question/answer supervision is provided. Another line of work studied uncertainty estimation on classification problems Hendrycks and Gimpel (2016); Liang et al. (2017b); Chen et al. (2018), yet not applicable to RNN based models. Unfortunately, due to the large amount and fast growth of new concepts, it is not scalable to continuously collect and annotate new training data. The most relevant work Jia et al. (2017) proposed a hybrid approach to combine model confidence, dependency tree and out-of-vocabulary words to derive the uncertainty of each word in an utterance. Dong et al. (2018) further proposed a few metrics to measure confidence. However, its performance suffers from the over-confidence of unseen concepts in pre-trained slot filling models.

In this paper, we focus on modeling the token-level uncertainty to extract unknown concepts without unknown concept supervision. That said, our model is only trained on existing concepts in training data. Due to the infinite number and fast growth of new concepts in reality, our setting is more practically reasonable. Our approach takes any softmax-based slot filling model and outputs the uncertainty score of each word. Inspired by Dirichlet prior network Malinin and Gales (2018), we first design a Dirichlet Prior RNN by degenerating Dirichlet prior into the softmax layer in slot filling models. Next, we design a multi-task algorithm to further calibrate the Dirichlet concentration parameters to enhance the robustness of uncertainty modeling by learning an adaptive calibration matrix. At last, we incorporate the utterance syntax with words with high uncertainty to extract the unknown concept phrases. Our contributions can be summarized as follows:

•

We design Dirichlet prior RNN to model high-order token-level uncertainty by degenerating softmax-based slot filling models;
•

We propose a multitask learning algorithm to calibrate the Dirichlet concentration parameters to improve uncertainty estimation;
•

Our approach achieves state-of-the-art performance on three SLU benchmark datasets;
•

Our method is generic which can also been applied onto Transformer/BERT based slot filling models with softmax output layers.

2 Related Work

Slot Filling: Slot filling task is to label each word in an utterance to one of the predefined slot tags. Xu and Sarikaya (2013) first proposed to use the convolutional neural network (CNN) together with a conditional random field (CRF). Recently, many researches have surged based on recurrent neural networks (RNN) models. Liu and Lane (2015); Mesnil et al. (2015); Peng and Yao (2015) first uses RNN in a straight-forward way to generate multiple semantic tags sequentially by reading in each word one by one. Later on, Liu and Lane (2016) presented an attention mechanism by incorporating the utterance context. Wang et al. (2018) designed a bi-model via asynchronous training. Goo et al. (2018) proposed to learn the relationship between intent and slot attention vectors. However, these approaches are only designed to extract IND concepts.

Learning to Clarify: Earlier researches apply syntax rules to detect unknown phrases Schlangen et al. (2001); Hori et al. (2003); Purver (2006). However, we have shown in experiments that these approaches are insufficient to achieve good performance. Dong et al. (2018) proposed statistical metrics to measure deep learning based semantic parser. Yet, these approaches cannot extract unseens concepts within one utterance. Recently, Rajpurkar et al. (2018) collected a QA dataset with question/answer supervision to encourage the research community to model an agent which knows what it does not know. Rao and Daumé III (2018) proposed a supervised question ranking approach to select the most relevant clarification questions. However, such supervised approach is not scalable in practice due to heavy data collection workload.

Uncertainty Estimation: The most relevant work Jia et al. (2017) designed a simple approach to combine slot filling model confidence score, utterance syntax and OOV words to estimate the uncertainty score and extract unseen concepts. Dong et al. (2018) studies different uncertainty metrics for semantic parser models. Many recent researches in computer vision community make great progress in uncertainty model in CNN, including a baseline Hendrycks and Gimpel (2016), temperature scaling approaches (Liang et al., 2017b; Lee et al., 2017; Shalev et al., 2018; DeVries and Taylor, 2018), adversarial training (Lee et al., 2017) and variational inference Chen et al. (2018); Malinin and Gales (2018).

3 Background & Problem Definition

Table 1: Uncertainty Modeling Example for Slot filling (In label uncertainty, L=low and H=high)

Utterance	reserve	at	Mario’s	Italiano	for	1	pm
True Labels	O	O	B-restaurant	B-restaurant	O	B-time	I-time
Predicted Labels	O	O	B-game	B-country	O	B-time	I-time
Label Uncertainty	L	L	H	H	L	L	L
Final predicted Labels	O	O	B-unknown	I-unknown	O	B-time	I-time

3.1 Slot Filling

Slot filling is to extract semantic concepts from a natural language utterance. It has been modeled as a sequence labeling problem in literature. Given an input utterance as a sequence of words $\boldsymbol{\mathrm{x}}=(x_{1},\ldots,x_{n})$ of length $n$ , slot filling maps $\boldsymbol{\mathrm{x}}$ to the corresponding label sequence $\boldsymbol{\mathrm{y}}$ of the same length $n$ . Each word is labeled as one of the total $K$ labels. As shown in Table 1, each output label in $\boldsymbol{\mathrm{y}}$ is in the format of IOB, where “B” and “I” stand for the beginning and intermediate word of a slot and “O” means the word does not belong to any slot. As a sequence labeling problem, slot filling can be solved using traditional machine learning approaches McCallum et al. (2000); Raymond and Riccardi (2007) and recent mainstream recurrent neural network (RNN) based approaches which takes and tags each word in an utterance one by one Yao et al. (2014); Mesnil et al. (2015); Peng and Yao (2015); Liu and Lane (2015); Kurata et al. (2016). These RNN based approaches designed different models to estimate the maximum likelihood:

P(\theta|D)=\max_{\theta}\displaystyle E_{D}\Big{[}\prod_{t=1}^{n}P(y_{t}|y_{1},\ldots,y_{t-1},\boldsymbol{\mathrm{x}};\theta)\Big{]}

By optimizing the following loss function:

\mathcal{L}(\theta)\triangleq-{1\over n}\sum_{i=1}^{|S|}\sum_{t=1}^{n}y_{t}(i)\log P_{t}(i)

(1)

where $D=\{\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}}\}\sim p(\boldsymbol{\mathrm{X}},\boldsymbol{\mathrm{Y}})$ is the training set; $\theta$ are the model parameters. $P_{t}(i)$ is typically computed using softmax function. All models with such loss function are referred to as softmax RNN slot filling model.

3.2 Problem Definition

We refer to the training data $D$ as in-distribution (IND) data and each utterance in $D$ is drawn from a fixed but unknown distribution $P_{IND}$ . Our goal is to train a slot filling model only on IND data $D$ to achieve two objectives: (1) correctly label a test sequence $\boldsymbol{\mathrm{z}}$ drawn from the same distribution $P_{IND}$ ; (2) identify out-of-distribution elements in a test sequence $\boldsymbol{\mathrm{z}}$ drawn from a different distribution $P_{OOD}$ , referred to as out-of-distribution (OOD) data. Note that the OOD sequence may only have a subset of out-of-distribution elements. In this case, our goal is to identify only these OOD elements and correctly label other IND elements.

Specifically, we take a pre-defined softmax RNN slot filling model architecture (described in previous subsection) as input. We train our calibrated prior RNN model only by enhancing its softmax layer. Our model outputs two sequences for each test sequence $\boldsymbol{\mathrm{z}}$ with an unseen concept $\boldsymbol{\mathrm{c}}^{*}$ : (1) traditional label sequence $\boldsymbol{\mathrm{y}}$ ; (2) uncertainty sequence $\boldsymbol{\mathrm{u}}$ with an uncertainty score $u_{t}$ for each word $z_{t}$ . For $\boldsymbol{\mathrm{z}}$ drawn from $P_{IND}$ , all labels in $\boldsymbol{\mathrm{y}}$ are expected to be correct and all uncertainty scores in $\boldsymbol{\mathrm{u}}$ are expected to be low. For $\boldsymbol{\mathrm{z}}$ drawn from $P_{OOD}$ , the model is expected to either label an element correctly or gives it a high uncertainty score. At last, the model tags all concepts with correct slot types or as unknown.

4 Calibrated Dirichlet Prior RNN

Figure 2 shows the overview of our proposed method that is designed to incorporate with any slot filling models with softmax output layer. In this section, we focus on presenting two main novel components in blue.

4.1 Dirichlet Prior RNN

As discussed in Section 3, the existing approaches for slot filling are discriminative models which aim to find the best model parameters $\theta^{*}$ to fit training data. To estimate the uncertainty of each word, a naive solution is to use its softmax confidence at each step. However, since an unseen concept is out of distribution from the training data, such pretrained neural networks tend to be blindly overconfident on predictions of completely unrecognizable Nguyen et al. (2015) or irrelevant inputs Hendrycks and Gimpel (2017); Moosavi-Dezfooli et al. (2017).

Consider the IND training set $D=\{\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}}\}\sim p(\boldsymbol{\mathrm{X}},\boldsymbol{\mathrm{Y}})$ . For an utterance $\boldsymbol{\mathrm{z}}$ , we model its high-order distribution $p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},D)$ using a Bayesian framework as follows:

\displaystyle p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},D)=\int\underbrace{p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},\theta)}_{\text{Data}}\underbrace{p(\theta|D)}_{\text{Model}}d\theta

(2)

where, similar as in Equation 1, we have:

p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},\theta)=\prod_{t=1}^{n}p(\boldsymbol{\omega}_{t}|\boldsymbol{\omega}_{1},\ldots,\boldsymbol{\omega}_{t-1},\boldsymbol{\mathrm{z}};\theta)

where $\boldsymbol{\omega}_{i}$ is the predicted distribution of all possible $K$ labels for each word $i$ in $\boldsymbol{\mathrm{z}}$ . The data uncertainty is described by label-level posterior $p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},\theta)$ and model uncertainty is described by model-level posterior $p(\theta|D)$ . However, the integral in Equation 2 is intractable in deep neural networks, thus Monte-Carlo Sampling algorithm is used to approximate it as follows:

\displaystyle p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},D)\approx\frac{1}{M}\sum_{i=1}^{M}p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},\theta^{(i)});\theta^{(i)}\sim p(\theta|D)

where each $p(\boldsymbol{\omega}_{t}|\boldsymbol{\omega}_{1},\ldots,\boldsymbol{\omega}_{t-1},\boldsymbol{\mathrm{z}},\theta^{(i)})$ is a categorical distribution $\boldsymbol{\mu}$ over the simplex with $\boldsymbol{\mu}=[\mu_{1},\cdots,\mu_{K}]=[p(y=\omega(1),\cdots,p(y=\omega(K))]^{T}$ . This ensemble is a collection of points on the a simplex which can be viewed as an implicit distribution induced by the posterior over the model parameters $p(\theta|D)$ .

In addition to the intractability of posteriori estimation, such ensemble of implicit distribution also leads to the challenge of measuring uncertainty. For example, a high entropy of $p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},D)$ could indicate uncertainty in the prediction due to either an IND input in a region of class overlap or an OOD input far from the training data. When using other measures like mutual information Gal (2016) to determine the uncertainty source, it is very hard in practice due to the difficulty of selecting an appropriate prior distribution and the expensive computation needed for Monte-Carlo estimation in RNN.

4.1.1 Prior RNN

Inspired by the prior network in Malinin and Gales (2018), we propose to prior RNN explicitly parameterize model-level uncertainty at each timestamp. Similarly, it is modeled with a distribution over distribution $p(\boldsymbol{\mu}|\boldsymbol{\mathrm{z}},\theta)$ as follows:

\displaystyle\begin{split}&p(\boldsymbol{\omega}|\boldsymbol{\mathrm{z}},D)=\int\underbrace{p(\boldsymbol{\omega}|\boldsymbol{\mu})p(\boldsymbol{\mu}|\boldsymbol{\mathrm{z}},\theta)}_{\text{Data}}\underbrace{p(\theta|D)}_{\text{Model}}d\boldsymbol{\mu}d\theta\\ &\approx\int p(\boldsymbol{\omega}|\boldsymbol{\mu})p(\boldsymbol{\mu}|\boldsymbol{\mathrm{z}},\hat{\theta})d\boldsymbol{\mu}\end{split}

where

p(\boldsymbol{\mu}|\boldsymbol{\mathrm{z}},\theta)=\prod_{t=1}^{n}p(\boldsymbol{\mu}_{t}|\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{t-1},\boldsymbol{\mathrm{z}};\hat{\theta})

p(\boldsymbol{\omega}_{t}|\boldsymbol{\mu})=p(\boldsymbol{\omega}_{t}|\boldsymbol{\mu}_{t})

where each $\boldsymbol{\omega}_{t}$ is only dependent on $\boldsymbol{\mu}_{t}$ . In addition, the second step approximation is based on the following point estimate $p(\theta|D)=\delta(\theta-\hat{\theta})$ of original model uncertainty:

p(\boldsymbol{\mu}|\boldsymbol{\mathrm{z}},D)\approx p(\boldsymbol{\mu}|\boldsymbol{\mathrm{z}};\hat{\theta})

4.1.2 Degenerated Dirichlet Prior RNN

Next, we focus on how to model high-order distribution $\boldsymbol{\mu}_{t}$ at each timestamp $t$ . Following Gal (2016), the probability density function (PDF) of Dirichlet distribution over all possible values of the $K$ -dimensional categorical distribution is written as:

\displaystyle Dir(\boldsymbol{\mu}|\boldsymbol{\alpha})=\begin{cases}\frac{1}{B(\boldsymbol{\alpha})}\prod_{i=1}^{K}\mu(i)^{\alpha(i)-1}\quad&\text{for}\;\boldsymbol{\mu}\in\mathbb{S}_{k}\\ 0\qquad&\text{otherwise}\end{cases}

where $\boldsymbol{\alpha}=[\alpha(1),\cdots,\alpha(K)]^{T}$ with $\alpha(i)>0$ is the concentration parameter of the Dirichlet distribution and $B(\boldsymbol{\alpha})=\frac{\prod_{i}^{K}\Gamma(\alpha(i))}{\Gamma(\sum_{i}^{k}\alpha(i))}$ is the normalization factor. The entropy-based uncertainty measure $U(\boldsymbol{\alpha})=H(Dir(\boldsymbol{\alpha}))$ is proven in Malinin and Gales (2018) that can perfectly separate model-level uncertainty from data-level uncertainty and computationally efficient due to its due to its tractable statistical properties.

The Dirichlet prior RNN is practically realized by a neural network function $g$ with parameters $\theta$ to generate $n$ $K$ -dimensional vectors:

g(D,\theta)=(\boldsymbol{\alpha}_{1},\ldots,\boldsymbol{\alpha}_{n})

where $\boldsymbol{\alpha}_{t}\in\mathbb{R}^{k}$ at each timestamp $t$ follows:

Dir(\boldsymbol{\alpha}_{t})=p(\boldsymbol{\mu}_{t}|\boldsymbol{\mu}_{1},\ldots,\boldsymbol{\mu}_{t-1},\boldsymbol{\mathrm{z}};\hat{\theta})

It is typically trained using both IND training data $\boldsymbol{\mathrm{x}}$ and OOD training data $\boldsymbol{\mathrm{z}}$ . For example, Malinin and Gales (2018) introduced a multi-task training assuming that both IND and OOD data are available (refer to the original paper for details). However, this is not applicable in our case since OOD data is not available.

As such, we propose to degenerate Dirichlet prior RNN into the traditional RNN training only using IND data. The posterior label distribution at each timestamp $t$ will be given by means of Dirichlet distribution:

p(\boldsymbol{\omega}_{t}|\boldsymbol{\omega}_{1},\ldots,\boldsymbol{\omega}_{t-1},\boldsymbol{\mathrm{z}};\hat{\theta})=\Big{[}{\alpha_{t}(1)\over\alpha_{t}(0)},\ldots,{\alpha_{t}(n)\over\alpha_{t}(0)}\Big{]}

During training, Dirichlet prior RNN $g(D,\theta)$ is optimized to maximize the empirical marginal likelihood on a training set $D=(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}})\in D$ :

\begin{split}&\mathcal{L}_{\text{PriorRNN}}=E_{(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}})\in D}[\log p(\boldsymbol{\mathrm{y}}|\boldsymbol{\mathrm{x}})]\\ &=E_{(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}})\in D}\Big{[}{1\over n}\sum_{j=1}^{K}\sum_{t=1}^{n}y_{t}(j)\log\int_{\boldsymbol{\omega}_{t}}p(\boldsymbol{\omega}_{t}|\boldsymbol{\mathrm{x}})d\boldsymbol{\omega}_{t}\Big{]}\\ &=E_{(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}})\in D}\Big{[}{1\over n}\sum_{j=1}^{K}\sum_{t=1}^{n}y_{t}(j)\log\Big{(}\frac{\alpha_{t}(j)}{\alpha_{t}(0)}\Big{)}\Big{]}\end{split}

Let $\boldsymbol{m}_{t}=RNN_{t}(\boldsymbol{\mathrm{x}};\theta)$ be the last layer output at timestamp $t$ in RNN, it is easy to see that maximizing $\mathcal{L}_{\text{PriorRNN}}$ is equivalent to maximizing sequence cross-entropy loss in RNN if we set:

\boldsymbol{\alpha}_{t}=Me^{\boldsymbol{m}_{t}}=M\exp(RNN(x_{t};\theta))

(3)

where $M$ is the scale constant¹¹1For simplicity, we set the $M=1$ during our experiments to avoid fine-tuning the scaling hyper-parameter..

Remarks: Thanks to the degeneration of Dirichlet prior RNN, the training will be the same as training of traditional RNN-based models. That said, the Dirichlet prior RNN model is exactly the same as the selected softmax based slot filling RNN model. This is also applicable for recent Transformer base models.

4.2 Dirichlet Concentration Calibration

While the proposed degenerated Dirichlet prior RNN is sufficient in some relatively simpler case, the uncertainty measure is still sensitive and erratic to noises that its detection accuracy can hardly provide accurate estimation for model-level uncertainty. This is mainly because that Dirichlet prior RNN is trained only on IND data. Specifically, it is caused by the known over-fitting issue in the pre-trained neural network, where the model becomes over-confident about its prediction though it mislabels it. In our problem of modeling uncertainty for each word in a sequence, this drawback is further amplified due to the overconfidence from both the current word and the utterance context. Moreover, the classification model greatly emphasizes certain dimensions in the concentration parameter regardless of the form of inputs, which causes both data sources to have indistinguishable high confidence.

Our goal is to further make the IND and OOD concepts more separable using only IND data. Previous works Kurakin et al. (2016); Liang et al. (2017b) have designed various approaches to perturb the original data to simulate OOD samples. Unfortunately, these approaches are not applicable onto natural language processing problems due to it discrete nature.

4.2.1 Temperature Scaling Calibration

We are inspired by ODIN Liang et al. (2017b), in which they observed the temperature scaling on last layer logits can enlarge the distance between IND and OOD sample in latent space and better separate them. While simply using a fixed temperature $T$ to scale each parameter in $\boldsymbol{\alpha}_{t}$ , i.e., $\hat{\alpha}_{t}(i)=\alpha_{t}(i)^{1/T}$ , will not directly work here, we design a parameterized calibration function $\boldsymbol{\epsilon}(\boldsymbol{\alpha}_{t};\boldsymbol{\mathrm{W}}_{c})$ with a hyperparameter $\boldsymbol{\mathrm{W}}_{c}$ to learn how to scale and calibrate the concentration during model training. At each timestamp $t$ , it takes as input the concentration parameter $\boldsymbol{\alpha}_{t}$ to generate a noise in a way that it can widen the uncertainty difference between IND and OOD concepts:

\displaystyle\tilde{\boldsymbol{\alpha}}_{t}=\boldsymbol{\alpha}_{t}-\boldsymbol{\epsilon}(\boldsymbol{\alpha}_{t};\delta)=(\boldsymbol{\mathrm{I}}-\boldsymbol{\mathrm{W}}_{c})\boldsymbol{\alpha}_{t}

(4)

where we apply the simplest linear transform based calibration function $\boldsymbol{\epsilon}(\boldsymbol{\alpha}_{t};\delta)=\boldsymbol{\mathrm{W}}_{c}\boldsymbol{\alpha}_{t}$ with $\boldsymbol{\mathrm{W}}_{c}\in\mathbb{R}^{K*K}$ denoting the trainable calibration matrix with $K$ output labels.

4.2.2 Joint Multi-task Training

Figure 3 show the overview of our joint multi-task training. The first task is to minimize the traditional sequence loss in subsubsection 4.1.2 with the supervision of sequence label $\boldsymbol{\mathrm{y}}$ :

\min\mathcal{L}_{\text{PriorRNN}}=\underset{(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}})\sim\text{IND}}{\mathbb{E}}\sum_{j=1}^{K}\sum_{t=1}^{n}y_{t}\log P_{t}(j)

where $P_{t}(j)=\Big{(}\frac{\exp(RNN(x_{t};\theta))(j)}{\sum_{l=1}^{K}\exp(RNN(x_{t};\theta))(l)}\Big{)}$ .

The second task is to maximize the entropy on the calibrated Dirichlet output:

\displaystyle\begin{split}&\max\mathcal{L}_{\text{Calibration}}=\underset{(\boldsymbol{\mathrm{x}},\boldsymbol{\mathrm{y}})\sim\text{IND}}{\mathbb{E}}\Big{[}\sum_{t=1}^{n}H(\tilde{\boldsymbol{\alpha}_{t}}(x_{t}))\Big{]}\\ &\text{s.t.}\;\forall t\;\;||\boldsymbol{\epsilon}_{t}||_{\infty}<\delta||\boldsymbol{\alpha}_{t}||_{\infty};\;\;\boldsymbol{\mathrm{W}}_{c}\geq 0\end{split}

where $H(\cdot)$ is the entropy of Dirichlet $Dir(\tilde{\boldsymbol{\alpha}_{t}})$ . As one can see above, we consider the three constraint for this task. First, since we do not have OOD data during training, the magnitude of such calibration matrix is encouraged to be small so that the generated noise $\boldsymbol{\epsilon}_{t}$ does not destroy the original IND data distribution. Therefore we enforce the first constraint on the norm of calibrated noise: $||\boldsymbol{\epsilon}_{t}||_{\infty}<\delta||\alpha_{t}||_{\infty}$ with $\delta$ denoting the maximum allowed calibration ratio. Second, we enforce the non-negativity of noise $\boldsymbol{\epsilon}_{t}$ , i.e., $\boldsymbol{\mathrm{W}}_{c}\geq 0$ , to ensure the negative noises. The first constraint is realized by re-scaling the noise whose norm is larger than $\delta$ while the second and third constraints are realized by simply adding a ReLU Nair and Hinton (2010) activation to $\boldsymbol{\mathrm{W}}_{c}$ .

4.3 Uncertainty Decision Engine

Thanks to the modeling of Dirichlet distribution $\mu$ , we will use its entropy $H$ to computer the uncertainty score for each word in an utterance. For a Dirichlet distribution $Dir(\boldsymbol{\alpha})$ , it has a close-form entropy form as follows:

\begin{split}H(\boldsymbol{\alpha})=\log B(\boldsymbol{\alpha})+(\alpha(0)-K)\psi(\alpha(0))-\sum_{i}^{k}(\alpha(i)-1)\psi(\alpha(i))\end{split}

where $\alpha(0)=\sum_{i}^{k}\alpha(i)$ denotes the sum over all $K$ dimensions. Thus, the uncertainty score $u_{t}$ for the $t^{th}$ word in utterance $z$ can be computed as:

u_{t}=\begin{cases}H(\boldsymbol{\alpha}_{t})&\mbox{Dirichlet Prior RNN}\\ H(\tilde{\boldsymbol{\alpha}}_{t})&\mbox{Calibrated Dirichlet Prior RNN}\\ \end{cases}

where Dirichlet Prior RNN has $\boldsymbol{\alpha}_{t}=Me^{\boldsymbol{m}_{t}}=M\exp(RNN(x_{t};\theta))$ and Calibrated Dirichlet Prior RNN has calibrated $\tilde{\boldsymbol{\alpha}}_{t}=(\boldsymbol{\mathrm{I}}-\boldsymbol{\mathrm{W}}_{c})\boldsymbol{\alpha}_{t}$ . When $u_{t}$ is larger than a threshold $\theta$ , we set its uncertainty label as high(H) and mark it as “unknown” by ignoring its predicted slot label.

4.4 Unknown Concept Extraction

We extract unknown concept by expanding the set of high uncertainty words with dependency tree of the input utterance. For each word labeled as “unknown”, we traverse the dependency tree upwards along edges (including compound nouns, adjectival modifiers, or posessive modifiers) and locate the root of the corresponding noun phrase containing this word. Then, we mark this root node and all its descendants as “unknown”. This procedure helps to improve the recall by considering syntactically coherent phrases.

At last, we convert the labels to IOB format at phrase level by extracting all sets of consecutive words that are labeled as “unknown”. In each set, we label the first word as “B-unknown” and all following words as “I-unknown”.

5 Experimental Evaluation

Our experimental evaluation is mainly focused on the following research questions:

1.

Can our model achieve comparable performance on IND testing set?
2.

Does entropy in degenerated Dirichlet Prior RNN better estimate uncertainty than confidence in traditional RNN models?
3.

Does concentration calibration improve upon Dirichlet Prior RNN model?
4.

Does our method outperform existing SOTA competitors without OOD training data?

Table 2: Statistics of Constructed OOD Testing Set with New Concepts

Datasets	IND Training/Testing Set				OOD Testing Set with New Concepts
Datasets	#Concept Types	#Train	#Dev	#Test	#Utterances	#Concepts	#Concepts/Per Utterance	#Words/Per Concept
Concept Learning Jia et al. (2017)	5	1,534	219	440	594	140	1.08	3.67
Snips Snips (2017)	72	13,084	700	700	4,816	536	1.21	3.49
ATIS Hemphill et al. (1990)	79	4,478	500	893	1,487	242	2.76	1.87

5.1 Datasets & Settings

We evaluate our method on three SLU benchmark datasets: Concept Learning Jia et al. (2017), Snips Snips (2017) and ATIS Hemphill et al. (1990). For Snips and ATIS datasets, we collect new concepts in a subset of slot types which have many potential uncovered concepts. The dataset details are in Table 2.

Concept Learning Dataset: is a dialogue dataset with clarification questions collected by Jia et al. (2017). It contains 2,193 first-turn basic utterances with pre-collected concepts, and 594 first-turn utterances with unseen (personalized) concepts. We use the original splitted 1,534 train, 659 development, and 440 test utterances for training our model. Then we use all 594 utterances for testing our approach. It can be downloaded from http://stanford.edu/~robinjia/data/concept_learning_data.tar.gz

Snips Dataset: is a custom intent engine dataset Snips (2017) collected by Snips for SLU model evaluation. It originally has 13,084 train utterances and 700 basic test (IND) utterances. We further split all train utterances to be 12,384 train and 700 development. It can be downloaded from https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines

ATIS Dataset: referred to as Airline Travel Information Systems, is a widely used benchmark dataset in SLU research Hemphill et al. (1990) from the ATIS-2 and ATIS-3 corpora. The original contains 4,478 train, 500 development and 893 basic test (IND) utterances. It can be downloaded from https://github.com/yvchen/JointSLU/tree/master/data

For Snips and ATIS datasets, we collect new concepts in a subset of slot types which have many potential uncovered concepts. We separate the slot types to be personalized and populational. We collect the new concepts from a group of crowd sourced workers. For personalized concepts (e.g., playlist), we allow them to freely create any new concept. For population concepts (e.g., movie name), we further evaluate the collected concepts by soliciting judgments from another group of crowd sourced markers. Each turker marker is presented 50 concepts of one slot type and required to select the correct ones. To judge the trustworthiness of each turker marker, we include 10 gold-standard concepts internally marked by experts. Finally, we keep all population concepts that are marked correct by trustworthy markers and all collected personalized concepts. At last, we substitute these new concepts in original basic test set (IND) and construct the new test set (OOD). In OOD test data, we label the unseen concept as unknown in IOB format to preserve phrase level information for evaluation. Table 2 shows the data details and statistics.

We evaluate on the state-of-the-art RNN based slot filling model Goo et al. (2018). We use the state-of-the-art RNN based slot filling model Goo et al. (2018) in all of our experiments. We use all the hyperparameters in original paper Goo et al. (2018) for ATIS and Snips datasets. For Concept Learning dataset, we tested different hyperparameters including hidden state size 32, 64, 128 and 256 through validation. The same hidden state size 64 in original paper also gives the best results in Concept Learning dataset and we used it in our experiment.

For the only hyperparameter $\delta$ in our method, we experimented with different values of $\delta$ from 0.05 to 0.5 with uniform interval 0.05 (we try 10 values of $\delta$ ) and found that setting $\delta=0.1$ yield the best results. We implement our method in Tensorflow using Adam optimizer with 16 mini-batch size. On each dataset, we train both our RNN and Calibrated Prior RNN models with 20 epochs using single GPU. All experiments are run on NVIDIA Tesla V100 GPU 16GB.

Table 3: Slot Filling F1 Scores on Basic Test (IND) Data

Dataset \Model	Calibrated Prior RNN	RNN Baseline
Concept Learning	80.76	80.89
Snips	88.56	88.76
ATIS	95.06	95.19

Table 4: New Concept (OOD) Extraction Results on Concept Learning Dataset without OOD Training Data (Gray: SOTA in Jia et al. (2017); Dong et al. (2018))

Uncertainty Metric \Model	Calibrated Prior RNN (F1 -1% on Dev Set)				RNN (F1 -1% on Dev Set)
	Basic Test (IND)	New Concept Test (OOD)			Basic Test (IND)	New Concept Test (OOD)
	F1	Precision	Recall	F1	F1	Precision	Recall	F1
Dropout Perturbation	79.24	-	-	-	79.96	25.62	36.04	29.95
Gaussian Noise		-	-	-		24.98	36.65	29.71
Variance of Top Candidates		-	-	-		29.66	39.97	34.05
OOV		-	-	-		29.35	41.59	34.41
Syntax+OOV		-	-	-		35.12	38.96	36.94
Confidence		45.22	45.47	45.34		28.66	36.75	32.20
Confidence+Syntax		48.09	48.72	48.40		41.11	39.62	40.35
Confidence+OOV		33.13	44.64	38.03		30.50	40.00	34.61
Confidence+Syntax+OOV		40.73	53.22	46.14		36.12	40.23	38.06
Dirichlet Entropy		46.78	45.19	45.97		40.90	43.39	42.11
Dirichlet Entropy+Syntax		49.05	48.03	48.53		44.02	46.85	45.39
Dirichlet Entropy+OOV		33.02	44.22	37.81		31.67	43.11	36.51
Dirichlet Entropy+Syntax+OOV		40.72	42.87	41.77		39.74	41.97	40.82

Table 5: New Concept (OOD) Extraction Results on Snips Dataset without OOD Training Data (Gray: SOTA in Jia et al. (2017); Dong et al. (2018))

Uncertainty Metric \Model	Calibrated Prior RNN (F1 -1% on Dev Set)				RNN (F1 -1% on Dev Set)
	Basic Test (IND)	New Concept Test (OOD)			Basic Test (IND)	New Concept Test (OOD)
	F1	Precision	Recall	F1	F1	Precision	Recall	F1
Dropout Perturbation	87.67	-	-	-	87.81	31.56	60.15	41.40
Gaussian Noise		-	-	-		30.89	61.09	41.03
Variance of Top Candidates		-	-	-		44.81	64.78	52.44
OOV		-	-	-		33.66	63.02	43.88
Syntax+OOV		-	-	-		31.80	57.23	40.88
Confidence		46.34	71.26	56.16		45.22	62.25	52.39
Confidence+Syntax		46.50	71.07	56.22		45.27	63.29	52.56
Confidence+OOV		33.20	62.45	43.35		32.93	61.70	42.94
Confidence+Syntax+OOV		30.95	56.29	39.94		30.53	69.06	42.34
Dirichlet Entropy		47.44	72.26	57.28		46.60	69.25	55.71
Dirichlet Entropy+Syntax		47.86	73.51	57.97		46.62	70.33	56.07
Dirichlet Entropy+OOV		33.54	63.94	44.00		33.29	62.08	43.34
Dirichlet Entropy+Syntax+OOV		31.56	57.01	40.63		31.45	56.42	40.39

5.2 IND Slot Filling Results

Table 3 reports the slot filling results of IND test set. In all datasets, we observe that our Calibrated Prior RNN model only slightly affects performance (up to 0.2 F1 drop) against RNN baseline due to the calibration loss.

5.3 Evaluation of OOD New Concept Extraction (without OOD training data)

We refer to our method as Dirichlet Entropy + Syntax on (Dirichlet Prior) RNN and Calibrated Dirichlet Prior RNN.

5.3.1 State-of-the-art (SOTA) Competitors

Since existing supervised approach Rao and Daumé III (2018) is not applicable in our setting, we consider several SOTA competitors on RNN model Jia et al. (2017); Dong et al. (2018) as follows (all parameters follow original papers and the higher the more uncertain):

Dropout Perturbation: perform $F$ forward passes through the network, and collect the results using the perturbed parameters with dropout on a Bernoulli distribution with rate 0.25. The variance of results are defined as uncertainty metric.

Gaussian Noise: perform $F$ forward passes through the network, and collect the results using he perturbed parameters with Gaussian noise $N(0,0.01)$ . The variance of results are defined as uncertainty metric.

Variance of Top Candidates: use the variance of the probability of the top $K$ candidates ( $K=5$ ).

Out-of-Vocabulary (OOV, a.k.a. UNK): first constructs a vocabulary not belonging to any concept in training set:

\hat{V}=\{w:\exists(x_{t},y_{t})\in\text{IND}\;\text{s.t.}w=x_{t},y_{t}=\text{`O'}\}

For each word $z_{t}$ in $\boldsymbol{\mathrm{z}}$ , we label it as “unknown” if it is predicted as ‘O’ but $z_{t}\notin\hat{V}$ .

In addition, we also consider evaluate using confidence as uncertainty metrics on both $\boldsymbol{\alpha}_{t}$ in (Dirichlet Prior) RNN and $\tilde{\boldsymbol{\alpha}}_{t}$ in Calibrated Dirichlet Prior RNN:

u_{t}=-\Big{\{}\max_{i}{\alpha_{t}(i)\over\alpha_{t}(0)}\;\;\text{or}\;\;\max_{i}{\tilde{\alpha}_{t}(i)\over\tilde{\alpha}_{t}(0)}\Big{\}}

To be consistent, we consider negative confidences such that higher values indicate more uncertain.

Table 6: New Concept (OOD) Extraction Results on ATIS Dataset without OOD Training Data (Gray: SOTA in Jia et al. (2017); Dong et al. (2018))

Uncertainty Metric \Model	Calibrated Prior RNN (F1 -1% on Dev Set)				RNN (F1 -1% on Dev Set)
	Basic Test (IND)	New Concept Test (OOD)			Basic Test (IND)	New Concept Test (OOD)
	F1	Precision	Recall	F1	F1	Precision	Recall	F1
Dropout Perturbation	94.14	-	-	-	94.26	32.23	63.54	42.77
Gaussian Noise		-	-	-		31.99	64.12	42.68
Variance of Top Candidates		-	-	-		34.61	68.18	45.91
OOV		-	-	-		31.83	68.71	43.51
Syntax+OOV		-	-	-		31.53	67.35	42.95
Confidence		35.64	68.25	46.83		34.27	68.93	45.78
Confidence+Syntax		35.73	68.71	47.01		34.73	69.21	45.98
Confidence+OOV		31.92	68.48	43.54		32.38	69.27	44.13
Confidence+Syntax+OOV		30.32	66.89	41.73		30.74	67.23	42.19
Dirichlet Entropy		38.20	70.29	49.50		37.06	70.63	48.61
Dirichlet Entropy+Syntax		38.31	70.32	49.60		37.06	70.75	48.64
Dirichlet Entropy+OOV		32.51	69.95	44.39		32.96	70.29	44.88
Dirichlet Entropy+Syntax+OOV		31.03	68.25	42.66		31.50	68.71	43.20

5.3.2 Results

Table 4, 5 and 6 report the experimental results on three different datasets. We follow the standard metrics, precision, recall and F1 scores, in the evaluation of slot filling models. For each dataset, we choose the confidence/entropy threshold $\theta$ at the point that F1 score on Basic Test (IND) set is reduced within 1% against original model. That said, our uncertainty model has slight impact on the extraction of IND concepts, which answers the first research question. Note that for new concepts not predicted as “unknown” but has correct slot label prediction, our experiments also consider them correct prediction.

Evaluation of Dirichlet Entropy in RNN: We observe that the Dirichlet entropy on (Dirichlet Prior) RNN improve the F1 score up to 5.04% against SOTA in concept learning data with the combination of utterance syntax. This is because the entropy with underlying high-order distribution can extract more “unknown” words. Likewise, Dirichlet entropy improves F1 over SOTA by 3.51% and 2.66% in Snips and ATIS.

Evaluation of Calibrated Prior RNN: We observe that both confidence and entropy metrics further improve the F1 score on all three datasets. In concept learning dataset, the F1 improvement reaches 8.18% due to the existence of OOV words in new concepts in test set. Such OOV words make the new concept more separable with calibrated Dirichlet concentration parameters. In Snips and ATIS, the improvement reaches 5.41% and 3.62% with less OOV words in new concepts. This follows the intuition in ODIN Liang et al. (2017b) for concentration calibration.

Evaluation of SOTA competitors: Dropout perturbation and gaussian noise perform worst since the perturbation of model parameters break the learned distribution, especially in the context of discrete natural language data. Variance of top candidates performs slightly better since it can partially capture the uncertainty from top candidate labels for each word. Simply labeling OOV words as ‘unknown’ when it is predicted as ‘O’ leads to bad performance in all datasets since OOV could also exists as non-concept words.

It is interesting to observe that combining syntax with both our and baseline methods significantly improves the F1 in concept learning dataset and obtain slightly better F1 scores in Snips and ATIS datasets. This is because the new concepts in concept learning dataset are typically noun phrases, which corresponds to a subtree in dependency tree. On the other hand, many single-word concepts in ATIS and many non noun phrase concepts in Snips leads to limited help on the improvement of F1 scores.

In addition, we train each model 10 times with different random seeds and conduct statistical significance t-test. As a result, our best method (Calibrated Prior RNN + Dirichlet Entropy + utterance syntax) is significantly better than all SOTA competitors (grey) on F1 score in all datasets with $\textsl{p-value}<0.01$ .

6 Conclusion

We designed a Dirichlet prior RNN with concentration calibration for token-based uncertainty modeling to extract unknown concepts. First, we degenerated Dirichlet into conventional RNN without changing the model training. We further proposed a multi-task learning to calibrate the Dirichlet concentration and achieved state-of-the-art performance. Moreover, our method can also be applied onto cutting edge Transformer based models. In future work, we will explore the joint training with utterance syntax models.

References

Chen et al. (2018) Wenhu Chen, Yilin Shen, Xin Wang, and William Wang. 2018. Enhancing the robustness of prior network in out-of-distribution detection. CoRR, abs/1811.07308.
DeVries and Taylor (2018) Terrance DeVries and Graham W Taylor. 2018. Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865.
Dong et al. (2018) Li Dong, Chris Quirk, and Mirella Lapata. 2018. Confidence modeling for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 743–753. Association for Computational Linguistics.
Fritzler et al. (2019) Alexander Fritzler, Varvara Logacheva, and Maksim Kretov. 2019. Few-shot classification in named entity recognition task. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, page 993–1000, New York, NY, USA. Association for Computing Machinery.
Gal (2016) Yarin Gal. 2016. Uncertainty in deep learning. University of Cambridge.
Goo et al. (2018) Chih-Wen Goo, Guang Gao, Yun-Kai Hsu, Chih-Li Huo, Tsung-Chieh Chen, Keng-Wei Hsu, and Yun-Nung Chen. 2018. Slot-gated modeling for joint slot filling and intent prediction. In The Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT.
Hakkani-Tür et al. (2016) Dilek Hakkani-Tür, Gökhan Tür, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In INTERSPEECH, pages 715–719.
Hemphill et al. (1990) Charles T. Hemphill, John J. Godfrey, and George R. Doddington. 1990. The atis spoken language systems pilot corpus. In Proceedings of the Workshop on Speech and Natural Language, HLT ’90, pages 96–101, Stroudsburg, PA, USA. Association for Computational Linguistics.
Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136.
Hendrycks and Gimpel (2017) Dan Hendrycks and Kevin Gimpel. 2017. A baseline for detecting misclassified and out-of-distribution examples in neural networks.
Hori et al. (2003) C. Hori, T. Hori, H. Isozaki, E. Maeda, S. Katagiri, and S. Furui. 2003. Deriving disambiguous queries in a spoken interactive odqa system. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., volume 1, pages I–I.
Hou et al. (2019) Yutai Hou, Zhihan Zhou, Yijia Liu, Ning Wang, Wanxiang Che, Han Liu, and Ting Liu. 2019. Few-shot sequence labeling with label dependency transfer and pair-wise embedding.
Hu et al. (2019) Ziniu Hu, Ting Chen, Kai-Wei Chang, and Yizhou Sun. 2019. Few-shot representation learning for out-of-vocabulary words. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 4102–4112. Association for Computational Linguistics.
Jansson and Liu (2017) Patrick Jansson and Shuhua Liu. 2017. Distributed representation, LDA topic modelling and deep learning for emerging named entity recognition from social media. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 154–159, Copenhagen, Denmark. Association for Computational Linguistics.
Jia et al. (2017) Robin Jia, Larry Heck, Dilek Hakkani-Tür, and Georgi Nikolov. 2017. Learning concepts through conversations in spoken dialogue systems. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP).
Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533.
Kurata et al. (2016) Gakuto Kurata, Bing Xiang, Bowen Zhou, and Mo Yu. 2016. Leveraging sentence-level information with encoder lstm for semantic slot filling. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP’16, pages 2077–2083.
Lee et al. (2017) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2017. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325.
Liang et al. (2017a) Dongyun Liang, Weiran Xu, and Yinge Zhao. 2017a. Combining word-level and character-level representations for relation classification of informal text. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 43–47, Vancouver, Canada. Association for Computational Linguistics.
Liang et al. (2017b) Shiyu Liang, Yixuan Li, and R Srikant. 2017b. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690.
Lin et al. (2017) Bill Y. Lin, Frank Xu, Zhiyi Luo, and Kenny Zhu. 2017. Multi-channel BiLSTM-CRF model for emerging named entity recognition in social media. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 160–165, Copenhagen, Denmark. Association for Computational Linguistics.
Liu and Lane (2015) Bing Liu and Ian Lane. 2015. Recurrent neural network structured output prediction for spoken language understanding. In Proc. NIPS Workshop on Machine Learning for Spoken Language Understanding and Interactions.
Liu and Lane (2016) Bing Liu and Ian Lane. 2016. Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454.
Malinin and Gales (2018) Andrey Malinin and Mark Gales. 2018. Predictive uncertainty estimation via prior networks. arXiv preprint arXiv:1802.10501.
McCallum et al. (2000) Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000. Maximum entropy markov models for information extraction and segmentation. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML’00, pages 591–598, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
Mesnil et al. (2015) Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et al. 2015. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23(3):530–539.
Moosavi-Dezfooli et al. (2017) S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. 2017. Universal adversarial perturbations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 86–94.
Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814.
Nguyen et al. (2015) A. Nguyen, J. Yosinski, and J. Clune. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 427–436.
Peng and Yao (2015) Baolin Peng and Kaisheng Yao. 2015. Recurrent neural networks with external memory for language understanding. arXiv preprint arXiv:1506.00195.
Purver (2006) Matthew Purver. 2006. Clarie: Handling clarification requests in a dialogue system. Research on Language and Computation, 4:259–288.
Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Association for Computational Linguistics (ACL).
Rao and Daumé III (2018) Sudha Rao and Hal Daumé III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2737–2746. Association for Computational Linguistics.
Raymond and Riccardi (2007) Christian Raymond and Giuseppe Riccardi. 2007. Generative and discriminative algorithms for spoken language understanding. In INTERSPEECH.
Schlangen et al. (2001) David Schlangen, Alex Lascarides, and Ann Copestake. 2001. Resolving underspecification using discourse information. In Proceedings of SemDial-5 (BiDialog).
Shalev et al. (2018) Gabi Shalev, Yossi Adi, and Joseph Keshet. 2018. Out-of-distribution detection using multiple semantic label representations. arXiv preprint arXiv:1808.06664.
Snips (2017) Snips. 2017. https://github.com/snipsco/nlu-benchmark/tree/master/2017-06-custom-intent-engines.
Wang et al. (2018) Yu Wang, Yilin Shen, and Hongxia Jin. 2018. A bi-model based rnn semantic frame parsing model for intent detection and slot filling. In The Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT.
Xu and Sarikaya (2013) Puyang Xu and Ruhi Sarikaya. 2013. Convolutional neural network based triangular crf for joint intent detection and slot filling. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 78–83. IEEE.
Yao et al. (2014) Kaisheng Yao, Baolin Peng, Yu Zhang, Dong Yu, Geoffrey Zweig, and Yangyang Shi. 2014. Spoken language understanding using long short-term memory neural networks. In 2014 IEEE Spoken Language Technology Workshop (SLT), SLT’14, pages 189–194.
Zhao and Feng (2018) Lin Zhao and Zhe Feng. 2018. Improving slot filling in spoken language understanding with joint pointer and attention. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 426–431. Association for Computational Linguistics.

Modeling Token-level Uncertainty to Learn Unknown Concepts in SLU via Calibrated Dirichlet Prior RNN