Prior Knowledge Driven Label Embedding for Slot Filling in Natural Language Understanding

Su Zhu, Zijian Zhao, Rao Ma, and Kai Yu Su Zhu, Zijian Zhao, Rao Ma and Kai Yu are supported by the National Key Research and Development Program of China (Grant No.2017YFB1002102). Experiments have been carried out on the PI supercomputer at Shanghai Jiao Tong University. (Corresponding authors: Kai Yu.) The authors are with the SpeechLab and MoE Key Lab of Artificial Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected]).

Abstract

Traditional slot filling in natural language understanding (NLU) predicts a one-hot vector for each word. This form of label representation lacks semantic correlation modelling, which leads to severe data sparsity problem, especially when adapting an NLU model to a new domain. To address this issue, a novel label embedding based slot filling framework is proposed in this paper. Here, distributed label embedding is constructed for each slot using prior knowledge. Three encoding methods are investigated to incorporate different kinds of prior knowledge about slots: atomic concepts, slot descriptions, and slot exemplars. The proposed label embeddings tend to share text patterns and reuses data with different slot labels. This makes it useful for adaptive NLU with limited data. Also, since label embedding is independent of NLU model, it is compatible with almost all deep learning based slot filling models. The proposed approaches are evaluated on three datasets. Experiments on single domain and domain adaptation tasks show that label embedding achieves significant performance improvement over traditional one-hot label representation as well as advanced zero-shot approaches.

Index Terms:

Natural language understanding, slot filling, label embedding, prior knowledge, domain adaptation.

I Introduction

Nowdays, the development of mobile internet and smart devices has led to the tremendous growth of conversational dialogue systems, such as Amazon Alexa, Google Assistant, Apple Siri and Microsoft Cortana. Natural language understanding (NLU) is a key component of these systems, parsing user’s utterances into the corresponding semantic representations [1] for certain narrow domains (e.g., booking restaurant, searching flight). As a main task of NLU, slot filling is typically treated as a sequence labelling problem in which contiguous sequences of words are tagged with semantic labels (slots) [2, 3, 4, 5, 6].

Deep learning has achieved great success for the slot filling task in NLU [5, 6, 7, 8, 9, 10, 11], outperforming most traditional approaches [4, 12]. However, deep learning is notorious for performing poorly with limited labelled data. This is called the data sparsity problem, which usually refers to limited data samples of feature-label pairs. In many supervised learning tasks (e.g., part-of-speech tagging, named-entity recognition), the data sparsity problem mainly lies on feature space since their label spaces are fixed. However, the NLU slot-filling task is domain-specific, and the output semantic labels are defined for all possible semantic scopes of the domain. This brings about label space data sparsity, which is an even more serious problem in NLU slot-filling task:

•

The label space of slot-filling task defined in each domain is distinct from others. Therefore, it is a challenge to exploit data of different domains efficiently.
•

Narrow domains are increasingly popular in commercial dialogue systems. Growing narrow domains lead to fine-grained semantic labels (slots). For example, domain WEATHER concerns city_name, while domain TRAVEL makes a distinction between arrival_city and departure_city.

To mitigate the label space data sparsity problem, a label embedding based method for slot filling in NLU is proposed. The label embedding is driven by prior knowledge about slot definitions, which implicates relations of different semantic labels and promotes the adaptive training of slot filling. It changes the representation of each semantic label from the one-hot vector to a distributed one. For example, slots arrival_city and departure_city are orthogonal in one-hot vector space, but partially the same in the city dimension. Label embedding implicating relations of slots could share patterns and reuse the data of associated slots.

Unlike input word embeddings [13, 14] which could be trained with a large number of unlabelled text corpora, label embeddings involve prior knowledge from domain experts or developers. Three kinds of human knowledge and the corresponding encoding methods are investigated in this paper: atomic concept, slot description encoding and slot exemplar encoding. The atomic concept refers to a pulverized slot, which is smaller semantic unit than the slot, e.g., the atomic concepts of slot arrival_city can be destination and city_name. The slot description literally interprets each slot in natural language, e.g., slot arrival_city can be described as “city name of the destination”. The slot exemplar is an annotation example for each slot.

The atomic concept is initially proposed in our previous work [15], while label is not represented as an embedding vector there. Here, we exploit label embedding independent of NLU model, which can be compatible with most deep learning-based slot-filling models. To further model the temporal dependencies of labels, a new label transition module based on label embedding is also proposed within the conditional random field (CRF) [16]. Our proposed framework is evaluated on DSTC 2&3 [17] and SNIPS [18] benchmarks and a custom dataset (AICar), outperforming all baselines significantly.

The following section discusses how our method relates to the previous works. We also introduce the slot filling task in section III. Then, we describe the details of label embeddings for slot filling in section IV and V. The results of the extensive experiments are given in section VI. We conclude and give some future research directions in section VII.

II Related Work

Standard approaches to solving the data sparsity problem of slot filling include semi-supervised learning [19, 20, 21, 22], domain adaptation [23, 24, 25, 26, 27, 28, 29, 30, 31] and zero-shot learning [32, 33, 34, 35, 36, 37]. The semi-supervised learning and domain adaptation (mostly parameters sharing) methods focus on improving the coverage of feature space while not label space. The zero-shot learning methods involve ontological descriptions to enhance the slot filling. However, Ferreira et al. [32] and Yazdani et al. [33] made an assumption that all possible values of each slot are known, which is not practical for real-world applications. Our methods seek to interpret semantic labels (slots) in multiple dimensions where relations of slots can be inferred implicitly. It helps data reuse for different labels and domains.

Label embeddings for SLU have been introduced in several works. Chen et al. [34] exploit label embedding for intent detection but not suitable for slot filling. Some works choose to extract label embeddings from data samples [38, 39, 40, 41] by exploiting words of values tagged with a semantic label, but they only focus on the corresponding values. We consider both the value and context information in the slot exemplar encoding. Concept tagger [35] and zero-shot adaptive transfer [36] also incorporate slot embeddings from slot descriptions for the slot filling. However, they proposed a slot-independent conditional layer to predict the existence for each slot one by one, which would take more time for training and inference. Moreover, it is not possible for modelling the time series dependence of different slots in those methods. We propose label embeddings for slot filling without changing the basic workflow, which overcomes the above issues.

III Slot Filling in NLU

Slot filling is a major task of natural language understanding (NLU) in task-oriented dialogue systems, which automatically extracts a set of attributes or “slots”, and the corresponding values. It is typically treated as a sequence labelling problem. An example of data annotation is provided in Fig. 1. It follows the popular inside/outside/beginning (IOB) schema, where Boston and New York are the departure and arrival cities specified as the slot values in the user’s utterance, respectively. In this paper, we use the word tag to refer to the semantic label in IOB schema, which is different from slot.

Refer to caption — Figure 1: An example of annotation for slot filling in domain *searching flight*.

In this section, we will sequentially introduce the slot filling in NLU for single domain and domain adaptation.

III-A Single Domain

Let $\boldsymbol{w}=(w_{1},w_{2},\cdots,w_{N})$ denote an input sequence (words), and $\boldsymbol{t}=(t_{1},t_{2},\cdots,t_{N})$ denote its output sequence (tags) in domain $d$ , where $N$ is the sequence length and $t_{i}$ is one of the predefined tags, i.e. $t_{i}\in\mathcal{T}_{d}$ . Let $\mathcal{T}_{d}$ denote all possible labels of domain $d$ . $t_{i}$ would be represented as a one-hot vector, $\mathbf{o}(t_{i})$ , in conventional methods. The slot filling in NLU is to estimate $p(\boldsymbol{t}|\boldsymbol{w})$ , the posterior probability of output sequence $\boldsymbol{t}$ given input $\boldsymbol{w}$ .

Recently, motivated by a number of successful neural network and deep learning methods in natural language processing, many neural network architectures have been applied in this task, such as vanilla recurrent neural network (RNN) [5, 42, 43, 44], convolutional neural network (CNN) [8, 45], long short-term memory (LSTM) [7, 46, 47, 48], encoder-decoder [10, 6, 49, 9]. Bi-directional LSTM based RNN (BLSTM) is adopted in this paper. Every input word is mapped to a vector via $\mathbf{w}_{i}=\mathbf{W}_{in}\mathbf{o}(w_{i})$ , where $\mathbf{W}_{in}$ is an embedding matrix and $\mathbf{o}(w_{i})$ a one-hot vector. Hidden vectors are recursively computed at the $i$ -th time step via:

\mathbf{h}_{i}=[\overrightarrow{\mathbf{h}}_{i};\overleftarrow{\mathbf{h}}_{i}];\overrightarrow{\mathbf{h}}_{i}=\text{f}_{\text{{lstm}}}(\mathbf{w}_{i},\overrightarrow{\mathbf{h}}_{i-1});\overleftarrow{\mathbf{h}}_{i}=\text{b}_{\text{{lstm}}}(\mathbf{w}_{i},\overleftarrow{\mathbf{h}}_{i+1})

where $[\cdot;\cdot]$ denotes vector concatenation, $\mathbf{h}_{i}\in\mathbb{R}^{2n}$ ( $n$ is the hidden size), $\text{f}_{\text{{lstm}}}$ and $\text{b}_{\text{{lstm}}}$ are the LSTM functions for forward and backward passes respectively. At each time step, a linear output layer is applied to predict tags, i.e.

\mathbf{z}_{i}=\mathbf{W}_{o}\mathbf{h}_{i}

(1)

where $\mathbf{W}_{o}\in\mathbb{R}^{|\mathcal{T}_{d}|\times 2n}$ denotes trainable parameters of the output layer (bias is omitted). In the following sections, prior knowledge driven label embeddings will be included in $\mathbf{W}_{o}$ .

III-A1 BLSTM-softmax

The posterior probability is factorized:

p(\boldsymbol{t}|\boldsymbol{w})=\prod_{i=1}^{N}p(t_{i}|\mathbf{h}_{i})=\prod_{i=1}^{N}\text{softmax}(\mathbf{z}_{i})^{\top}\mathbf{o}(t_{i})

(2)

where $\mathbf{o}(t_{i})$ is a one-hot vector with a dimension for tag $t_{i}$ , so $\text{softmax}(\mathbf{z}_{i})^{\top}\mathbf{o}(t_{i})$ is precisely the $t_{i}$ -th element of the distribution defined by the softmax function.

III-A2 BLSTM-CRF

To model the time series dependence of output labels, several methods based on encoder-decoder architectures [9, 10, 6, 49] are proposed for slot filling. Besides that, CRF output layer can jointly decode the best chain of labels for a given input sentence [50, 51, 52]. The CRF output layer calculates the posterior probability at the sentence level:

p(\boldsymbol{t}|\boldsymbol{w})=\frac{\text{exp}(\psi(\boldsymbol{w},\boldsymbol{t}))}{\sum_{\boldsymbol{t}^{\prime}}\text{exp}(\psi(\boldsymbol{w},\boldsymbol{t}^{\prime}))}

(3)

The score of a sentence $\boldsymbol{w}$ along with a path of tags $\boldsymbol{t}$ is given by the sum of transition scores and network scores:

\psi(\boldsymbol{w},\boldsymbol{t})=\sum_{i=1}^{N}([\mathbf{A}]_{t_{i-1},t_{i}}+\mathbf{z}_{i}^{\top}\mathbf{o}(t_{i}))

(4)

where $\mathbf{A}\in\mathbb{R}^{|\mathcal{T}_{d}|\times|\mathcal{T}_{d}|}$ is a transition matrix, and its element $[\mathbf{A}]_{m,n}$ modelling the transition from the $m$ -th to the $n$ -th label for a pair of consecutive time steps.

The maximum likelihood estimation is used for training, and the negative log-likelihood is given by

Loss=-\log p(\boldsymbol{t}|\boldsymbol{w})

. Decoding is to search for the label sequence $\boldsymbol{t}^{*}$ with the highest posterior probability, i.e.

\boldsymbol{t}^{*}=\underset{\boldsymbol{t}^{\prime}}{\operatorname{argmax}}\ p(\boldsymbol{t}^{\prime}|\boldsymbol{w})

The searching is time independent for the softmax output function, while it can be solved efficiently by adopting the Viterbi algorithm for the CRF output layer.

III-B Domain Adaptation

Domain adaptation can alleviate the data sparsity problem of slot filling by taking advantage of data from other domains, such as feature augmentation [23], parameters sharing [24, 25, 26], and bag of experts [27, 28].

Sharing parameters for different domains is a popular approach to transferring knowledge learned from source domains to a target domain. Suppose we have a set of source domains, $\mathcal{D}_{src}$ , and a target domain $td$ . We mix all data of $\mathcal{D}_{src}$ to pre-train a slot filling model, then fine-tune the model by using data of $td$ . The pre-training procedure can be implemented only once for different target domains.

In the pre-training stage, different domains in $\mathcal{D}_{src}$ may not be able to share the output layer fully. Because each narrow domain has a closed and limited semantic space which is different from others. It may lead to conflicts of data annotation. Several examples of disputes between different domains (e.g., DSTC2 V.S. DSTC3) are shown in Table I.

TABLE I: Examples about conflicts of data annotation from different domains. We use [value:slot] for annotation.

domain	data samples
DSTC2	I’m looking for a [Thai:Food] restaurant.
DSTC3	I’m looking for a [Thai:Food] [restaurant:Type].
Weather	I’m going to [Beijing:City], and what’s the weather?
Travel	I’m going to [Beijing:ToCity], and what’s the weather?

Therefore, the output layer should be masked at different positions, according to which domain the training data is from. For all the source domains, we have a union of labels, $\mathcal{T}_{src}=\bigcup\limits_{d\in\mathcal{D}_{src}}\mathcal{T}_{d}$ . Therefore, the output matrix is scaled up to $\mathbf{W}_{o}^{src}\in\mathbb{R}^{|\mathcal{T}_{src}|\times 2n}$ , and the transition matrix is $\mathbf{A}_{src}\in\mathbb{R}^{|\mathcal{T}_{src}|\times|\mathcal{T}_{src}|}$ . For each domain $d\in\mathcal{D}_{src}$ , there is mask vector $\mathbf{m}_{d}\in\mathbb{R}^{|\mathcal{T}_{src}|}$ :

[\mathbf{m}_{d}]_{t_{j}}=\left\{\begin{array}[]{ll}{0,}&{\text{ if\ }t_{j}\in\mathcal{T}_{d}}\\ {-\infty,}&{\text{ otherwise }}\end{array}\right.

(5)

where each $t_{j}\in\mathcal{T}_{src}$ is covered. It is used to control the output for domain $d$ by ignoring all labels out of this domain and replacing Equation (1) with:

\mathbf{z}_{i}=\mathbf{W}_{o}^{src}\mathbf{h}_{i}+\mathbf{m}_{d}

(6)

At the beginning of the fine-tuning stage of the target domain $td$ , parameters are copied from the model pre-trained in the source domains for initialization. The word embedding matrix and BLSTM parameters are retained, while the output and transition matrices are copied partially:

	$\displaystyle[\mathbf{W}_{o}^{td}]_{t_{j}}$	$\displaystyle=\left\{\begin{array}[]{ll}{[\mathbf{W}_{o}^{src}]_{t_{j}},}&{\text{if\ }t_{j}\in\mathcal{T}_{src}\cap\mathcal{T}_{td}}\\ {\text{random init.},}&{\text{otherwise}}\end{array}\right.$		(9)
	$\displaystyle[\mathbf{A}_{td}]_{t_{j},t_{k}}$	$\displaystyle=\left\{\begin{array}[]{ll}{[\mathbf{A}_{src}]_{t_{j},t_{k}},}&{\text{if\ }\{t_{j},t_{k}\}\subset\mathcal{T}_{src}\cap\mathcal{T}_{td}}\\ {\text{random init.},}&{\text{otherwise}}\end{array}\right.$		(12)

where $t_{j}\in\mathcal{T}_{td}$ , $t_{k}\in\mathcal{T}_{td}$ , $\mathbf{W}_{o}^{td}\in\mathbb{R}^{|\mathcal{T}_{td}|\times 2n}$ , $\mathbf{A}_{td}\in\mathbb{R}^{|\mathcal{T}_{td}|\times|\mathcal{T}_{td}|}$ , $[\cdot]_{j}$ denotes the $j$ -th row of a matrix or vector.

IV Why Label Embedding?

As illustrated in Equation (1), there is a score estimates how possible each label $t_{j}$ is to appear at the $i$ -th time step,

[\mathbf{z}_{i}]_{t_{j}}=(\mathbf{W}_{o}\mathbf{h}_{i})^{\top}\mathbf{o}(t_{j})

(13)

, before the softmax or CRF function.

If we consider an alternate low-rank decomposition of $\mathbf{W}_{o}$ , i.e. $\mathbf{W}_{o}=\mathbf{B}^{\top}\mathbf{C}$ where $\mathbf{W}_{o}\in\mathbb{R}^{|\mathcal{T}_{d}|\times 2n}$ , $\mathbf{B}\in\mathbb{R}^{K\times|\mathcal{T}_{d}|}$ , and $\mathbf{C}\in\mathbb{R}^{K\times 2n}$ . Equation (13) can be rewritten as

[\mathbf{z}_{i}]_{t_{j}}=(\mathbf{B}^{\top}\mathbf{C}\mathbf{h}_{i})^{\top}\mathbf{o}(t_{j})=(\mathbf{C}\mathbf{h}_{i})^{\top}(\mathbf{B}\mathbf{o}(t_{j}))

(14)

$\mathbf{B}$ is a matrix corresponding to the lookup table of label embeddings, and $\mathbf{C}$ is a matrix of linear transformation, which respectively map the label $t_{i}$ and features $\mathbf{h}_{i}$ into a joint space.

Imagining that there are already pre-trained or pre-defined label embeddings which reflect relations (similarities and differences) among labels, the model learned from labelled data may even be adapted to labels never exist in the data. It can help reduce the size of required in-domain data, enabling developers to bootstrap new domains quickly.

V Label Embedding for Slot Filling

In this section, we will introduce the proposed slot-filling method based on label embeddings, representing each output tag as a distributed vector. However, unlike word embeddings [13, 14] which can be pre-trained with a large number of unlabelled text corpora, data with semantic labels is very limited and in the face of the label space sparsity issue. We derive label embeddings from semantic definitions of certain domains. Three kinds of prior knowledge about slot definitions are investigated as well as the corresponding encoding methods: atomic concept, slot description and slot exemplar.

Since the label (tag) of slot filling is in IOB format, we make a simple notation to split tag into IOB and slot. For each tag $t$ (e.g., “B-from_city”), $\text{IOB}(t)$ gets the IOB symbol (e.g., “B”), and $\text{SLOT}(t)$ returns the slot name (e.g., “from_city”). Note that $\text{SLOT}(\text{``O''})$ returns null.

V-A Atomic Concept

Atomic concept assumes that there are substructures with smaller semantic units for semantic slots. It is a pure prior knowledge that each slot is represented as a set of atoms. The atomic concepts are pulverized slots and much smaller semantic units than the slots. Each slot is composed of different atomic concepts, e.g., slot from_city can be defined as a set of atoms {from_location, city_name}, and slot date_of_birth can be represented as {date, birth}, as shown in Table II.

Atomic concepts are not automatically defined, which are designed manually by domain experts or developers. Atomic concepts can be classified into two categories, one is value-aware, and the other is context-aware. The value-aware atom is often a category name of values, e.g., “New York” is city_name. Context-aware atoms reflect implicit dependencies around the targeted value, e.g., to_location depends on context like “going to”, “fly to”, “arrive at”, as shown in Fig. 2. Modelling on the atomic concept helps find out the linguistic patterns of related slots by atom sharing, and it can even decrease the required amount of data.

TABLE II: Examples of slot representation by atomic concepts.

slot	atomic concepts
city	{city_name}
from_city	{city_name, from_location}
depart_city	{city_name, from_location}
arrive_airport	{airport_name, to_location}
city_of_birth	{city_name, birth}
date_of_birth	{date, birth}
deny-to_city	{city_name, to_location, deny}

V-A1 Single Domain

We represent each slot $s$ of domain $d$ as a set of atomic concepts, i.e. $\text{atoms}(s)=\{a_{1},\cdots,a_{M_{s}}\}$ , $a_{l}\in\mathcal{A}_{d}$ , where $\mathcal{A}_{d}$ is the vocabulary of atomic concepts and $M_{s}$ is the number of atomic concepts belong to this slot. Thus, each slot can be viewed as a binary vector, $\mathbf{b}(\text{atoms}(s))$ with length of $|\mathcal{A}_{d}|$ , while $\mathbf{b}(\text{atoms}(\text{SLOT}(\text{``O''})))=\mathbf{0}$ . Due to the IOB schema, the label embedding consists of two parts, 3-dimensional IOB one-hot vector and a binary vector of atoms, as illustrated in Fig. 3 (a). The label embedding of tag $t$ is

\text{emb}(t)=[\mathbf{o}(\text{IOB}(t));\mathbf{b}(\text{atoms}(\text{SLOT}(t)))]

(15)

where $\text{emb}(t)\in\mathbb{R}^{3+|\mathcal{A}_{d}|}$ , $t\in\mathcal{T}_{d}$ .

Finally, a label embedding matrix for domain $d$ is created, i.e., $\mathbf{B}_{d}\in\mathbb{R}^{(3+|\mathcal{A}_{d}|)\times|\mathcal{T}_{d}|}$ , where $[\mathbf{B}_{d}^{\top}]_{t}=\text{emb}(t)$ . Equation (1) can be rewritten as

\mathbf{z}_{i}=\mathbf{B}_{d}^{\top}\mathbf{C}_{d}\mathbf{h}_{i}

(16)

where $\mathbf{C}_{d}\in\mathbb{R}^{(3+|\mathcal{A}_{d}|)\times 2n}$ denotes trainable parameters of the output layer, but $\mathbf{B}_{d}$ is fixed as a kind of prior knowledge.

V-A2 Domain Adaptation

Similar to subsection III-B, the domain adaptation with atomic concepts is also introduced as follows. For all the source domains, we have a union of atoms, $\mathcal{A}_{src}=\bigcup\limits_{d\in\mathcal{D}_{src}}\mathcal{A}_{d}$ . Therefore, the output matrix is scaled up to $\mathbf{C}_{src}\in\mathbb{R}^{(3+|\mathcal{A}_{src}|)\times 2n}$ , and $\mathbf{B}_{d}\in\mathbb{R}^{(3+|\mathcal{A}_{src}|)\times|\mathcal{T}_{d}|}$ . For each domain $d\in\mathcal{D}_{src}$ , there is mask vector $\mathbf{m}_{d}\in\mathbb{R}^{3+|\mathcal{A}_{src}|}$ ,

[\mathbf{m}_{d}]_{a_{j}}=\left\{\begin{array}[]{ll}{0,}&{\text{ if\ }a_{j}\in\mathcal{A}_{d}\text{ or }a_{j}\in\text{IOB}}\\ {-\infty,}&{\text{ otherwise }}\end{array}\right.

(17)

for each $a_{j}\in\mathcal{A}_{src}\cup\text{IOB}$ . It is used to ignore the atomic concepts out of domain $d$ and replace Equation (6) with:

\mathbf{z}_{i}=\mathbf{B}_{d}^{\top}(\mathbf{C}_{src}\mathbf{h}_{i}+\mathbf{m}_{d})

(18)

For fine-tuning in the target domain $td$ , the word embedding matrix and BLSTM parameters are still retained. The initialization of $\mathbf{C}_{td}\in\mathbb{R}^{(3+|\mathcal{A}_{td}|)\times 2n}$ for $td$ is

\displaystyle[\mathbf{C}_{td}]_{a_{j}}

\displaystyle=\left\{\begin{array}[]{ll}{[\mathbf{C}_{src}]_{a_{j}},}&{\text{ if\ }a_{j}\in(\mathcal{A}_{src}}\cap\mathcal{A}_{td})\cup\text{IOB}\\ {\text{Random init.},}&{\text{ otherwise }}\end{array}\right.

for each $a_{j}\in\mathcal{A}_{td}\cup\text{IOB}$ . Thus, we calculate output scores of the target domain as $\mathbf{z}_{i}=\mathbf{B}_{td}^{\top}\mathbf{C}_{td}\mathbf{h}_{i}$ , where $\mathbf{B}_{td}\in\mathbb{R}^{(3+|\mathcal{A}_{td}|)\times|\mathcal{T}_{td}|}$ .

Composition of atomic concepts for each slot provides an explicit interpretation about slot relations. However, it requires a large number of human efforts to design atomic concepts carefully, and it is hard to keep them universal for different domains.

V-B Slot Description Encoding

In this method, we assume that there is a textual description for each slot in natural language. It is also pre-defined by domain experts or developers but with less human labors than the atomic concept. For example, deny-to_city can be described as “not this arrival city”. The descriptions of relative slots may include same, synonymous and even antonymous words, which can help reveal the relation between different slots. Thus, we can extract label embeddings literally from slot descriptions.

Unlike the label embedding bound to atomic concepts, an encoder is required to convert literal slot descriptions of variable lengths into fixed dimension vectors, as delineated in Fig. 3 (b). Suppose each slot $s$ has a description, $\text{desc}(s)=x_{1}\cdots x_{L_{s}}$ , where $L_{s}$ is the sequence length. Three encoders are applied: simple mean operation of word embeddings, BLSTM and CNN models. We will introduce them in a single domain, then clarify how to make domain adaptation with slot description-based label embeddings.

V-B1 Mean Encoder of Word Embeddings

Every description word is mapped to an $m$ -dimensional vector via $\mathbf{x}_{l}=\mathbf{W}_{in}\mathbf{o}(x_{l})$ , where $\mathbf{W}_{in}$ is the word embedding matrix and $\mathbf{o}(x_{l})$ a one-hot vector. The embedding of slot $s$ is simply the mean of word embeddings of all words in $\text{desc}(s)$ , i.e.

\text{mean}(\text{desc}(s))=unit(\frac{1}{L_{s}}\sum_{l=1}^{L_{s}}\mathbf{x}_{l})

(19)

where $unit(\mathbf{v})=\frac{\mathbf{v}}{||\mathbf{v}||}$ gets the unit $l_{2}$ norm of vector $\mathbf{v}$ , which enforces slot embeddings in a limited scale.

V-B2 BLSTM Encoder

The hidden vectors of BLSTM encoder are recursively computed at the $l$ -th time step via:

	$\displaystyle\overrightarrow{\mathbf{v}}_{l}=$	$\displaystyle\text{f}_{\text{{lstm}}}(\mathbf{x}_{l},\overrightarrow{\mathbf{v}}_{l-1}),l=1,\cdots,L_{s}$		(20)
	$\displaystyle\overleftarrow{\mathbf{v}}_{l}=$	$\displaystyle\text{b}_{\text{{lstm}}}(\mathbf{x}_{l},\overleftarrow{\mathbf{v}}_{l+1}),l=L_{s},\cdots,1$		(21)

where $\text{f}_{\text{{lstm}}}$ and $\text{b}_{\text{{lstm}}}$ are the LSTM functions for forward and backward passes respectively. We concatenate the final hidden vectors of bi-direction to be the slot embedding, i.e.

\text{BLSTM}(\text{desc}(s))=unit([\overrightarrow{\mathbf{v}}_{L_{s}};\overleftarrow{\mathbf{v}}_{1}])

(22)

V-B3 CNN Encoder

CNN is also a typical model to encode a sequence of words into a fixed-dimensional vector [53]. A linear transformation is applied on a window of $k$ words to produce a set of new features (we set $k=3$ in this work). To avoid that the window size is larger than the sequence length, zero-padding is used on both sides of the input sequence. After processing all possible windows of words, we get a sequence of new feature vectors. A max-over-time pooling operation [54] is applied over new features, which takes the maximum value of each row to form as the final features $\mathbf{c}$ . Thus,

\text{CNN}(\text{desc}(s))=unit(\mathbf{c})

(23)

V-B4 Single Domain and Domain Adaptation

With one of these three encoders, we can encode the description of a slot as a vector. Thus, the label embedding of tag $t\in\mathcal{T}_{d}$ is

\text{emb}(t)=[\mathbf{o}(\text{IOB}(t));\text{ENCODER}(\text{desc}(\text{SLOT}(t)))]

(24)

where ENCODER can be mean, BLSTM and CNN, $\text{emb}(t)\in\mathbb{R}^{K}$ , $K$ is the embedding length. Note that $\text{emb}(\text{``O''})=\mathbf{0}$ .

For any domain $d\in\mathcal{D}_{src}\cup\{td\}$ , we get a label embedding matrix, $\mathbf{B}_{d}\in\mathbb{R}^{K\times|\mathcal{T}_{d}|}$ , and $[\mathbf{B}_{d}^{\top}]_{t}=\text{emb}(t)$ for any $t\in\mathcal{T}_{d}$ . Both Equation (1) and (6) can be rewritten as

\mathbf{z}_{i}=\mathbf{B}_{d}^{\top}\mathbf{C}\mathbf{h}_{i}

(25)

where $\mathbf{z}_{i}\in\mathbb{R}^{|\mathcal{T}_{d}|}$ , $\mathbf{C}\in\mathbb{R}^{K\times 2n}$ denotes trainable parameters of the output layer, but $\mathbf{B}_{d}$ is computed by the slot description encoder on domain $d$ . Since the encoder is domain-independent, $\mathbf{C}$ and the encoder can also be fully shared across domains, as well as the sentence BLSTM which produces $\mathbf{h}_{i}$ .

This method requires an unstructured short-text to describe each slot, which is easier to be obtained, while the involved label encoders may take more time for computation.

V-C Slot Exemplar Encoding

In this method, we propose to extract label embeddings from slot exemplars which contain values of each slot and their neighbor contexts, as shown in Fig. 3 (c). An exemplar of slot $s$ can be a sentence $\boldsymbol{e}=e_{1}\cdots e_{|e|}$ and the “(start, end)” positions $(sp,ep)$ for annotation of the slot, i.e. $\boldsymbol{e}_{sp:ep}$ is a value of $s$ . For example, one exemplar of slot from_city is “flights leaving New York to Shanghai”, and “New York” at positions $(3,4)$ is its value. Two methods are introduced to obtain label embedding by encoding slot exemplars, which utilize words embeddings and language models.

V-C1 Word Embeddings

Every word in the exemplar sentence is mapped to a vector via $\mathbf{e}_{j}=\mathbf{W}_{e}\mathbf{o}(e_{j})$ , where $\mathbf{W}_{e}$ is a word embedding matrix and $\mathbf{o}(e_{j})$ is a one-hot vector. To specify the value and context information of the slot, we split the exemplar sentence into three segmentations, left context, value and right context, as illustrated in Fig. 4 (a). Then, word embeddings in these three parts are respectively averaged and concatenated, composing the slot embedding

\text{emb}(s)=unit([\text{ave}(\mathbf{e}_{<sp});\text{ave}(\mathbf{e}_{sp:ep});\text{ave}(\mathbf{e}_{>ep})])

(26)

where $\text{ave}(\mathbf{e}_{<sp})$ and $\text{ave}(\mathbf{e}_{>ep})$ are zero vectors when $sp=1$ and $ep=|e|$ respectively.

V-C2 Language Models

Different from word embeddings, LSTM based bidirectional language models (biLMs) [55] directly encode context information into hidden states. The input sentence $\boldsymbol{e}=e_{1}\cdots e_{|e|}$ is converted into two sequences of hidden vectors ( $\overrightarrow{\mathbf{u}}_{1:|e|},\overleftarrow{\mathbf{u}}_{1:|e|}$ ) by forward and backward LMs respectively, as shown in Fig. 4 (b). Since hidden states already contain information of history and current word, we use the hidden vectors at positions ranging from $sp$ to $ep$ to compose the slot embedding:

\text{emb}(s)=unit([\text{ave}(\overrightarrow{\mathbf{u}}_{sp:ep});\text{ave}(\overleftarrow{\mathbf{u}}_{sp:ep})])

(27)

V-C3 Single Domain and Domain Adaptation

Now, we can encode the exemplars of any slot as a vector (if there are multiple exemplars for one slot, an average is taken). Thus, the label embedding of tag $t\in\mathcal{T}_{d}$ is

\text{emb}(t)=[\mathbf{o}(\text{IOB}(t));\text{emb}(\text{SLOT}(t))]

(28)

where $\text{emb}(t)\in\mathbb{R}^{K}$ , $K$ is the embedding length, and $\text{emb}(\text{SLOT}(\text{``O''}))=\mathbf{0}$ .

For any domain $d\in\mathcal{D}_{src}\cup\{td\}$ , we get the label embedding matrix, $\mathbf{B}_{d}\in\mathbb{R}^{K\times|\mathcal{T}_{d}|}$ , and $[\mathbf{B}_{d}^{\top}]_{t}=\text{emb}(t)$ for any $t\in\mathcal{T}_{d}$ . Both Equation (1) and (6) can be rewritten as

\mathbf{z}_{i}=\mathbf{B}_{d}^{\top}\mathbf{C}\mathbf{h}_{i}

(29)

where $\mathbf{z}_{i}\in\mathbb{R}^{|\mathcal{T}_{d}|}$ , $\mathbf{C}\in\mathbb{R}^{K\times 2n}$ denotes trainable parameters of the output layer which can be shared across domains. However, $\mathbf{B}_{d}$ is computed by using pre-trained word embeddings or biLMs in this paper, which will be fixed.

This method requires minimal prior knowledge which is some kind of labeled data sample, while it may heavily rely on the quality of pre-trained word embeddings or biLMs.

V-D Label Embedding-based CRF Layer

The typical linear-chain CRF provides the transition between a pair of labels as a scalar score, as shown in Equation (4). However, prior knowledge about slots is not exploited in the traditional transition matrix of CRF layer. We propose a new transition model of CRF Layer depending on label embeddings for slot filling. It calculates the transition score using a bilinear model. The transition score of two tags $t_{i-1}$ and $t_{i}$ in Equation (4) is computed via:

[\mathbf{A}_{d}]_{t_{i-1},t_{i}}=\text{emb}(t_{i-1})^{\top}\mathbf{W}_{a}\text{emb}(t_{i})

(30)

where $t_{i-1}\in\mathcal{T}_{d}$ , $t_{i}\in\mathcal{T}_{d}$ , $\mathbf{A}_{d}\in\mathbb{R}^{|\mathcal{T}_{d}|\times|\mathcal{T}_{d}|}$ is a transition matrix for domain $d$ . The parameters of $\mathbf{W}_{a}$ are domain independent and can be fully shared across domains. In the inference stage, $\mathbf{A}_{d}$ is computed only once per domain.

Compared with the traditional linear-chain CRF whose transition scores are initialized as independent scalars and learned from scratch, we compute transition scores by matching label embeddings of start and end tags. Our model could transfer the knowledge about label transition by utilizing implicit slot relations in label embeddings. Besides the bilinear model, there are other alternatives, like attention mechanism [56, 57], multi-head attention [58].

VI Experiments

In this section, we first introduce the datasets and baselines used. Then, Section VI-C gives the details of experimental setup. In section VI-D, we compare the performance of our proposed methods with traditional slot filling methods.

VI-A Datasets

We carry out experiments on the following NLU datasets.

•

DSTC 2&3 [17]: It contains two domains, DSTC2 (source domain) and DSTC3 (target domain). DSTC2 comprises of dialogues from the restaurant information domain in Cambridge. DSTC3 introduces a tourist information domain about restaurants, pubs and coffee shops in Cambridge, which is an extension of DSTC2 but has only a seed set for training. We use manual transcripts as input by ignoring speech recognition errors. Several slots of DSTC2 are reused in DSTC3.
•

AICar: The AICar dataset is a custom dataset collected from a commercial in-car spoken dialogue system which has thousands of users. It consists of three domains: MAP navigation, WEATHER forecast and FLIGHT information. There is almost no slot overlap among these domains, yet the slots are partially associated, e.g., MAP has slot starting_point, WEATHER has city_name, and FLIGHT has departure_city.
•

SNIPS: It is a public NLU dataset [18] of crowdsourced user utterances across 7 intents and about 2000 training samples per intent. It contains natural language corpus collected in a crowdsourced fashion to benchmark the performance of voice assistants.

All of the datasets are annotated with slot labels¹¹1Notice that because the original semantic annotations are not aligned with words in DSTC 2&3, we simply use string matching for alignment, which converts the semantic annotations into the slot filling format.. The detailed statistics of each dataset are shown in Table III. The language of DSTC 2&3 and SNIPS is English, and AICar is in Chinese. To avoid Chinese word segmentation errors [59], the slot filling annotation of AICar is at Chinese-character level.

TABLE III: Dataset statistics. “AL(desc.)” denotes the average length of slot descriptions.

Dataset	Domain	Train	Test	#Slot	#Atom	AL(desc.)
DSTC 2&3	DSTC2	15611	9890	17	11	2.2
DSTC 2&3	DSTC3	109	18715	30	16	2.4
AICar	Map	12000	20661	30	16	4.3
	Weather	10150	7236	36	20	3.4
	Flight	1500	4407	22	18	4.0
SNIPS	-	13784	700	39	-	1.8

VI-B Baselines

We compare with one traditional method, two label embedding methods and two zero-shot learning based approaches:

•

One-hot: A 2-layer BLSTM model and one linear output layer (CRF layer is optional) for slot filling, as shown in Section III. Each label is discrete and represented as a one-hot vector in this method.
•

Canonical Correlation Analysis (CCA) label embedding [38]: CCA is exploited to induce slot vector by analysis cooccurrences of slots and words, i.e. words tagged with the corresponding slots.
•

Value word embedding [39]: We take the average of pre-trained embeddings of the value words for slot embedding, while ignoring the context words.
•

Concept Tagger (CT) [35] and Zero-shot Adaptive Transfer (ZAT) [36]: These two zero-shot methods also condition slot filling on textual slot descriptions, which have a contextual BLSTM layer and slot dependent BLSTM layer. However, they predict an IOB sequence for each slot, which may lead to overlapped segmentation and cost more time.

VI-C Experimental Setup

VI-C1 Label embeddings of slot filling

Three different approaches to extracting label embeddings are investigated in this paper, via atomic concepts, slot descriptions and slot exemplars, which involve different degrees of human knowledge. Table III also provides the detailed statistics of label representations. For atomic concepts, we act as domain experts to manually split each slot into small units for all five domains, e.g., deny-starting_point is split into “deny”, “from_location” and “poi_name”. As an extension of DSTC2, DSTC3 contains 6 new atoms that constitute about $25\%$ of slot tags. For slot descriptions, we leverage tokenized slot names since each slot name is already a meaningful phrase in the datasets, e.g., deny-starting_point is described as “deny starting point”. The training data can be directly exploited as data exemplars for the corresponding slots.

VI-C2 Network Hyper-parameters

The experimental setup was kept the same for all three datasets with minor adjustments in hyper-parameters. We adopt a 2-layer BLSTM model to encode sentences with $200$ hidden units. For training, the network parameters are randomly initialized under the uniform distribution $[-0.2,0.2]$ . We use optimizer Adam [60] with learning rate $0.001$ for all experiments. The dropout with a probability of $0.5$ is applied to the non-recurrent connections during the training stage. The batch size is $20$ for domain DSTC2, MAP and WEATHER, $5$ for domain DSTC3 and FLIGHT, and $10$ for SNIPS. The maximum norm for gradient clipping is set to 5, and we use weight decay regularization on all weights with lambda= $1e\text{-}6$ to avoid over-fitting.

VI-C3 Network Input Embeddings

We exploit pre-trained word embeddings to initialize the input embedding layers. For AICar, we train an LSTM based bidirectional language models (biLMs) at Chinese-character level by using zhwiki²²2https://dumps.wikimedia.org/zhwiki/latest corpus. The embedding dimension is 200. For DSTC 2&3 and SNIPS, we directly adopt a pre-trained biLMs of ELMo [55] in English³³3https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md. The word embedding dimension is 1024.

VI-C4 Encoder of slot exemplars

The pre-trained input embeddings and BiLMs are also exploited to compute label embeddings based on slot exemplars, as shown in Section V-C. In this paper, we choose to fix the encoder of slot exemplars to reduce training cost. Therefore, the performance of slot filling based on slot exemplars may heavily rely on qualities of the pre-trained BiLMs.

VI-C5 Data settings of domain adaptation

To evaluate the efficiency of different methods in leveraging data, we set up a domain adaptation problem. In DSTC 2&3, DSTC2 is source domain, and DSTC3 is target domain. In AICar, we choose each domain as the target one and the rest two domains as source domains. Moreover, $1,2.5,5,10,20,40,60,80$ percent of the training data in the target domain are randomly selected to simulate low resource settings of the target domain. Each data selection is repeated 10 times by using different random seeds, and the final results of the evaluation are averaged. For SNIPS, we keep one intent as the target domain and the rest intents as the source domain. We pre-train the slot filling model on the source domains, then fine-tune and evaluate it on the target domain.

We randomly selected $80\%$ of the training data for model training and the remaining $20\%$ for validation. We keep the learning rate for 50 epochs and save the parameters that give the best performance on the validation set. Finally, we report the $\text{F}_{1}$ -score of the semantic slots on the test set with parameters that have achieved the best $\text{F}_{1}$ -score on the validation set. The $\text{F}_{1}$ -score is calculated using CoNLL evaluation script⁴⁴4https://www.clips.uantwerpen.be/conll2000/chunking/output.html.

VI-C6 Significance Test

We use McNemar’s test to establish the statistical significance of a method over another ( $\text{p}<0.05$ ).

VI-D Main Results

TABLE IV: Slot F₁ scores on DSTC3 test set with or without the training data of DSTC2. The bold number indicates a statistically significant gain over the best baseline.

Method

DSTC3_train

(single domain)

+DSTC2_train

(domain adapt.)

One-hot

75.52

83.14

(-) w/o output layer sharing

75.52

81.22

CCA label embedding

76.25

84.10

value word embedding

75.32

83.97

76.20

83.68

ZAT

75.76

84.29

atomic concept

76.41

86.03

slot description (BLSTM)

77.41

87.16

slot description (CNN)

76.23

85.89

slot description (emb mean)

76.56

84.05

slot exemplar (word embedding)

77.19

86.10

slot exemplar (BiLMs)

78.58

87.55

TABLE V: Slot F₁ scores on AICar dataset with three domains. We train and evaluate on each target domain and pre-train the slot filling model on the rest two source domains. Columns represent different proportions of the training data in the target domain, defined in Section VI-C5. The results in bold black are the best F₁ scores. Time per batch refers to training and testing times on GeForce GTX 1080 Ti GPU.

Target

Domain

Method

Time per

batch

2.5%

10%

20%

40%

60%

80%

100%

Map

One-hot

9.2ms

40.81

60.24

72.67

82.63

87.94

90.99

92.92

93.61

94.79

94.2ms

0.22

40.93

62.23

75.20

82.79

87.58

91.28

92.88

94.02

94.80

ZAT

118ms

0.05

45.07

63.74

73.79

82.10

87.39

91.33

92.40

93.65

94.62

atomic concept

9.5ms

1.55

46.34

66.57

77.20

84.74

89.27

92.04

93.76

94.81

95.78

slot description (BLSTM)

11.9ms

0.89

46.55

67.26

77.11

84.83

88.68

92.02

93.81

94.56

95.65

slot description (CNN)

10.4ms

0.78

43.44

67.41

77.88

84.85

88.79

91.86

93.77

94.41

95.31

slot description (emb mean)

9.9ms

0.32

45.20

70.12

79.95

85.92

89.37

92.48

93.90

94.67

95.47

slot exemplar (word embedding)

9.2ms

42.54

64.34

76.29

84.16

88.45

91.61

93.40

94.10

94.71

slot exemplar (BiLMs)

9.2ms

45.49

66.17

76.74

84.45

88.80

91.91

93.61

94.42

95.72

Weather

One-hot

7.5ms

32.91

39.35

49.23

62.40

72.96

80.57

84.13

86.50

88.60

140ms

4.97

39.47

45.46

52.26

62.67

73.39

81.42

84.85

86.96

87.85

ZAT

187ms

15.73

39.78

47.03

54.10

65.28

73.60

80.60

84.19

86.75

88.61

atomic concept

8.0ms

24.69

41.73

48.38

56.51

67.39

75.67

82.48

85.08

87.35

88.79

slot description (BLSTM)

10.8ms

6.81

42.14

47.41

56.67

66.84

75.72

82.70

85.89

88.23

89.43

slot description (CNN)

9.3ms

16.74

40.08

46.47

54.73

67.41

75.85

82.74

85.37

87.86

89.44

slot description (emb mean)

8.4ms

15.75

41.72

48.32

55.71

66.54

76.06

82.47

85.56

87.90

89.42

slot exemplar (word embedding)

7.7ms

41.41

46.59

55.09

65.73

75.28

82.44

85.31

87.56

88.72

slot exemplar (BiLMs)

7.8ms

39.35

44.28

52.41

64.50

74.61

81.62

84.75

87.56

88.00

Flight

One-hot

6.2ms

55.69

65.84

73.22

80.68

87.58

92.26

94.14

94.92

95.26

69.5ms

2.07

59.31

71.01

75.36

81.40

87.74

92.32

93.84

94.97

95.74

ZAT

99.2ms

1.89

65.91

70.44

74.80

82.03

87.81

92.37

93.93

94.77

95.34

atomic concept

6.7ms

52.37

62.72

72.74

78.58

83.42

88.76

92.79

93.99

95.02

95.27

slot description (BLSTM)

9.3ms

2.95

70.49

75.93

79.93

85.49

89.96

92.94

94.99

95.33

95.47

slot description (CNN)

8.4ms

2.41

64.78

73.15

78.45

84.92

89.58

93.17

94.69

95.41

95.83

slot description (emb mean)

7.0ms

8.45

69.97

76.02

80.77

85.79

90.33

93.33

94.60

95.29

95.76

slot exemplar (word embedding)

6.4ms

68.49

73.73

79.13

83.82

88.93

92.56

94.18

94.95

95.30

slot exemplar (BiLMs)

6.3ms

66.58

72.96

77.85

82.93

88.66

92.92

94.39

95.17

95.98

We first compare different label embeddings by exploiting BLSTM-softmax models for slot filling, except for CT and ZAT. BLSTM-CRF models with the scalar and biliner transition modules are then investigated in Section VI-D4.

The results on DSTC 2&3 dataset are shown in Table IV. By comparing the column “DSTC3_train” with “+DSTC2_train”, we can see performances on the target domain are improved dramatically due to that the parameters are pre-trained on the source domain. Row 1-2 show that sharing the output weights of reused labels (see Equation (9)) can help transfer knowledge from DSTC2 to DSTC3. By comparing the baselines with our proposed methods, most of our methods can outperform all baselines on both single domain and domain adaptation tasks. Notably, by using label embeddings extracted from slot exemplars with the pre-trained biLMs of ELMo, the slot filling model can achieve the best results ( $78.58\%$ and $87.55\%$ ).

Table V shows the performances of domain adaptation on AICar dataset with three domains. We take each domain as the target domain one by one and leave the rest two as the source domains. We evaluate the baselines and our methods with parts of the training data on the target domain, which range from 1 to 100 percent. We also test the pre-trained models even without the target training data (zero-shot). From Table V, we can find that:

•

Two zero-shot baselines (CT and ZAT) perform better than the One-hot, which benefit from slot descriptions. However, these slot-independent models increase the computation cost sharply since they need to predict an IOB sequence for each slot independently. This could also lead to overlapped slot segmentations and lose the time series dependence of output slots.
•

Even without the target training data ( $0\%$ ), our methods outperform all the baselines. Especially, atomic concepts can achieve the best performances. Because atomic concepts give an explicit interpretation of slots, which can recombine associated atoms into a new slot. While slot descriptions as word sequences are more complex and may contain synonyms implicitly. Without the target training data, there is no exemplar available for each slot.
•

With the increasing training data of the target domain, slot $\text{F}_{1}$ scores of all methods are improved concurrently.
•

At all data partitions, our proposed methods can achieve the best results and achieve statistically significant gain over the best baseline in most cases. The best performances always lie on atomic concepts and slot descriptions, since they involve more human knowledge than slot exemplars which are just the available training samples. Even so, the slot exemplar-based method can also outperform the baselines in most cases.

VI-D1 Input embeddings with pre-trained language model

Besides using pre-trained word embeddings, some advanced pre-trained language models (e.g. ELMo [55], BERT [61]) can also be used to get input embeddings. However, it is orthogonal to the investigation of label embeddings. Table VI shows results on DSTC3 test set using limited seed instances, indicating that our method can get improvements over different types of input embeddings simultaneously. For BERT, bert-base-cased model⁵⁵5https://github.com/google-research/bert#pre-trained-models is utilized, while it performs worse than ELMo. The reason may be that ELMo could get better generalization for the dataset by using char-based word embeddings.

TABLE VI: Slot F₁ scores on DSTC3 test set without the training data of DSTC2, but with different pre-trained language models.

Method	pre. word emb.	ELMo	BERT
One-hot	75.52	78.13	76.64
slot description (BLSTM)	77.41	80.03	78.13

VI-D2 Learning curve

Fig. 5 shows learning curves of different methods when the target domain is FLIGHT and 100% training data is used for fine-tuning. It shows slot $\text{F}_{1}$ scores on the validation set at increasing numbers of data iteration. We can see that our methods can converge much faster than the baselines without label embeddings. Especially, the “slot description (BLSTM)” method converges fastest.

VI-D3 Number of slot exemplars

Although we exploit all data samples of the training set as data exemplars for the corresponding slots, it is meaningful to investigate that how many exemplars for each slot are enough. Fig. 6 shows the number of slot exemplars used versus performance on FLIGHT domain. If there is no training data for fine-tuning, more exemplars usually give better performance. If there is an amount of labelled data for training, the performance is dominated by the size of training data in the target domain.

VI-D4 CRF layer

We also apply a CRF layer to model the time series dependence of output labels, which CT and ZAT cannot model. We want to know whether label embeddings can improve different slot filling models. Table VII shows the results of using different label embeddings with or without the CRF layer, in the case that FLIGHT is the target domain and $10\%$ training data is used for fine-tuning. From the table, we can see that CRF layers give improvements for the baseline and our methods. Compared to the traditional scalar transition model (as shown in Equation (4)), our proposed bilinear transition model (as shown in Section V-D) can improve significantly. The bilinear model calculates the transition score of two labels depending on their label embeddings, which shares transition patterns of related labels. It helps mitigate the data sparsity problem.

VI-D5 Discussions about varied slot descriptions

Different from atomic concepts or slot exemplars which can be in a united form, slot descriptions in natural language would be varied due to different domain experts or developers. Five participants were asked to describe each slot in the FLIGHT domain. The average length of slot descriptions from different participants varies from $6.2$ to $7.7$ . If we use these slot descriptions in the “slot description (BLSTM)” method when FLIGHT is target domain and $10\%$ training data can be used for fine-tuning, slot $\text{F}_{1}$ scores range from $84.66\%$ to $86.43\%$ . As shown in Table VIII, our method seems robust to varied slot descriptions.

VI-D6 Results on SNIPS dataset

We also report results on the SNIPS public dataset where one intent are kept as the target domain and the rest intents serve as the source domain. Only 50 train instances of the target domain can be utilized for fine-tuning, which are chosen in random. All 7 intents play a role of the target domain one by one, and the slot $\text{F}_{1}$ scores are averaged over seven target domains. As shown in Table IX, our methods can also outperform the baselines. Especially, the “slot exemplar (BiLMs)” method achieves the best slot $\text{F}_{1}$ score.

[b] Method w/o CRF with CRF scalar bilinear One-hot 80.68 82.74 85.49^a atomic concept 83.42 84.26 86.95 slot description (BLSTM) 85.49 86.83 88.46 slot description (CNN) 84.92 85.56 87.24 slot description (emb mean) 85.79 86.76 87.49 slot exemplar (word embedding) 83.82 85.61 87.10 slot exemplar (BiLMs) 82.93 84.45 86.83

TABLE VII: Slot

\text{F}_{1}

scores of different models with and without CRF layers on AICar dataset, when FLIGHT is the target domain and the rest two are the source domains.

10\%

training data of FLIGHT is used for fine-tuning, and results are averaged over 10 random data splits.

a

weights of the linear output layer act as label embeddings in the bilinear transition model.

TABLE VIII: Performances of the “slot description (BLSTM)” method with varied slot descriptions on FLIGHT. FLIGHT is the target domain, and its

10\%

training data is used for fine-tuning. “AL(desc.)” denotes the average length of slot descriptions.

AL(desc.)	4.0 (ori.)	7.7	7.4	7.0	6.4	6.2
Slot $\text{F}_{1}$	85.49	85.84	84.66	85.80	86.43	85.08

TABLE IX: Slot F₁ scores on SNIPS test set in the setting of domain adaptation with only 50 instances of each target intent for fine-tuning.

Method	Average Slot F₁
One-hot	85.63
CT	83.79
ZAT	86.10
slot description (BLSTM)	86.94
slot description (CNN)	86.99
slot description (emb mean)	86.57
slot exemplar (word embedding)	85.67
slot exemplar (BiLMs)	87.33

VII Conclusion

This paper has proposed prior knowledge-driven label embedding for the NLU slot filling task to address the data sparsity problem (especially label space data sparsity). It requires human knowledge about slot definitions to consider slot relations into the label embeddings. We investigate three different kinds of prior knowledge about slots and the corresponding encoding methods: atomic concepts, slot descriptions and slot exemplars. These label embeddings are fused with different traditional slot filling models (e.g., BLSTM-softmax and BLSTM-CRF). Notably, we propose a novel label transition model based on label embeddings for CRF. The proposed approaches are evaluated on three datasets (DSTC 2&3, AICar and SNIPS) in the settings of single domain and especially domain adaptation. From the experimental results, we find that these label embeddings involving prior knowledge can significantly improve the performances and efficiencies over traditional slot filling models.

The proposed label embedding based slot filling framework shows promising perspectives of future improvements:

•

The slot exemplar encoding relies on a pre-trained and high qualified BiLM which is fixed. In future work, we will develop a trainable encoder for slot exemplars.
•

There are a few other kinds of prior knowledge, like the hierarchical dependency of domain-intent-slot [40]. Graph neural networks (GNN) have been applied to encode structured data [62, 63]. Applying GNN to encode structured knowledge is one of the future work.
•

Nowadays, developing NLU systems for certain narrow domains independently is the mainstream approach. It costs much to collect data, while it is convenient for domain experts or developers to provide a few domain definitions (prior knowledge). Combination of deep learning-based slot filling and prior knowledge of narrow domains will be a promising research direction.

References

[1] Y.-Y. Wang, L. Deng, and A. Acero, “Spoken language understanding–an introduction to the statistical framework,” IEEE Signal Processing Magazine, vol. 22, no. 5, pp. 16–31, 2005.
[2] Y. Wang, L. Deng, and A. Acero, “Semantic frame-based spoken language understanding,” Spoken Language Understanding: Systems for Extracting Semantic Information from Speech, pp. 41–91, 2011.
[3] Y. He and S. Young, “A data-driven spoken language understanding system,” in Proc. IEEE ASRU, 2003, pp. 583–588.
[4] C. Raymond and G. Riccardi, “Generative and discriminative algorithms for spoken language understanding,” in Proc. INTERSPEECH, 2007, pp. 1605–1608.
[5] G. Mesnil, Y. Dauphin, K. Yao, Y. Bengio, L. Deng, D. Hakkani-Tur, X. He, L. Heck, G. Tur, D. Yu et al., “Using recurrent neural networks for slot filling in spoken language understanding,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 23, no. 3, pp. 530–539, 2015.
[6] B. Liu and I. Lane, “Attention-based recurrent neural network models for joint intent detection and slot filling,” in Proc. INTERSPEECH, 2016, pp. 685–689.
[7] K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, “Spoken language understanding using long short-term memory neural networks,” in IEEE SLT Workshop, 2014, pp. 189–194.
[8] N. T. Vu, “Sequential convolutional neural networks for slot filling in spoken language understanding,” in Proc. INTERSPEECH, 2016, pp. 3250–3254.
[9] G. Kurata, B. Xiang, B. Zhou, and M. Yu, “Leveraging sentence-level information with encoder lstm for semantic slot filling,” in Proc. EMNLP, 2016, pp. 2077–2083.
[10] S. Zhu and K. Yu, “Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding,” in Proc. ICASSP, 2017, pp. 5675–5679.
[11] C. Li, L. Li, and J. Qi, “A self-attentive model with gate mechanism for spoken language understanding,” in Proc. EMNLP, 2018, pp. 3824–3833.
[12] L. S. Zettlemoyer and M. Collins, “Online learning of relaxed ccg grammars for parsing to logical form,” in Proc. EMNLP-CoNLL, 2007, pp. 678–687.
[13] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. NIPS, 2013, pp. 3111–3119.
[14] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
[15] S. Zhu and K. Yu, “Concept transfer learning for adaptive language understanding,” in Proc. SIGdial, 2018, pp. 391–399.
[16] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. ICML, 2001, pp. 282–289.
[17] M. Henderson, B. Thomson, and J. Williams, “Dialog state tracking challenge 2&3,” [online] Available: http://camdial.org/mh521/dstc/, 2013.
[18] A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril, M. Primet, and J. Dureau, “Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces,” arXiv preprint arXiv:1805.10190, 2018.
[19] G. Tur, D. Hakkani-Tür, and R. E. Schapire, “Combining active and semi-supervised learning for spoken language understanding,” Speech Communication, vol. 45, no. 2, pp. 171–186, 2005.
[20] A. Celikyilmaz, R. Sarikaya, D. Hakkani-Tür, X. Liu, N. Ramesh, and G. Tür, “A new pre-training method for training deep learning models with application to spoken language understanding.” in Proc. INTERSPEECH, 2016, pp. 3255–3259.
[21] O. Lan, S. Zhu, and K. Yu, “Semi-supervised training using adversarial multi-task learning for spoken language understanding,” in Proc. ICASSP, 2018, pp. 6049–6053.
[22] S. Zhu, O. Lan, and K. Yu, “Robust spoken language understanding with unsupervised asr-error adaptation,” in Proc. ICASSP, 2018, pp. 6179–6183.
[23] H. Daumé III, “Frustratingly easy domain adaptation,” arXiv preprint arXiv:0907.1815, 2009.
[24] A. Jaech, L. Heck, and M. Ostendorf, “Domain adaptation of recurrent neural networks for natural language understanding,” in Proc. INTERSPEECH, 2016, pp. 690–694.
[25] Y.-B. Kim, K. Stratos, and R. Sarikaya, “Frustratingly easy neural domain adaptation.” in Proc. COLING, 2016, pp. 387–396.
[26] B. Liu and I. Lane, “Multi-domain adversarial learning for slot filling in spoken language understanding,” in NIPS Workshop on Conversational AI, 2018.
[27] Y.-B. Kim, K. Stratos, and D. Kim, “Domain attention with an ensemble of experts,” in Proc. ACL, 2017, pp. 643–653.
[28] R. Jha, A. Marin, S. Shivaprasad, and I. Zitouni, “Bag of experts architectures for model reuse in conversational language understanding,” in Proc. NAACL, 2018, pp. 153–161.
[29] A. K. Goyal, A. Metallinou, and S. Matsoukas, “Fast and scalable expansion of natural language understanding functionality for intelligent agents,” in Proc. NAACL, 2018, pp. 145–152.
[30] H. Zhang, S. Zhu, S. Fan, and K. Yu, “Joint spoken language understanding and domain adaptive language modeling,” in Proc. IScIDE, 2018, pp. 311–324.
[31] Z. Zhao, S. Zhu, and K. Yu, “Data augmentation with atomic templates for spoken language understanding,” in Proc. EMNLP-IJCNLP, 2019, pp. 3628–3634.
[32] E. Ferreira, B. Jabaian, and F. Lefèvre, “Zero-shot semantic parser for spoken language understanding,” in Proc. INTERSPEECH, 2015, pp. 1403–1407.
[33] M. Yazdani and J. Henderson, “A model of zero-shot learning of spoken language understanding.” in Proc. EMNLP, 2015, pp. 244–249.
[34] Y.-N. Chen, D. Hakkani-Tür, and X. He, “Zero-shot learning of intent embeddings for expansion by convolutional deep structured semantic models,” in Proc. ICASSP, 2016, pp. 6045–6049.
[35] A. Bapna, G. Tür, D. Hakkani-Tür, and L. P. Heck, “Towards zero-shot frame semantic parsing for domain scaling,” in Proc. INTERSPEECH, 2017, pp. 2476–2480.
[36] S. Lee and R. Jha, “Zero-shot adaptive transfer for conversational language understanding,” in Proc. AAAI, 2019, pp. 6642–6649.
[37] D. J. Shah, R. Gupta, A. A. Fayazi, and D. Hakkani-Tür, “Robust zero-shot cross-domain slot filling with example values,” in Proc. ACL, 2019, pp. 5484–5490.
[38] Y.-B. Kim, K. Stratos, R. Sarikaya, and M. Jeong, “New transfer learning techniques for disparate label sets,” in Proc. ACL-IJCNLP, 2015, pp. 473–482.
[39] Y. Ma, E. Cambria, and S. Gao, “Label embedding for zero-shot fine-grained named entity typing,” in Proc. COLING, 2016, pp. 171–180.
[40] J. Lee, D. Kim, R. Sarikaya, and Y.-B. Kim, “Coupled representation learning for domains, intents and slots in spoken language understanding,” in Proc. IEEE SLT, 2018, pp. 714–719.
[41] J. Wu, L. F. D’Haro, N. F. Chen, P. Krishnaswamy, and R. E. Banchs, “Joint learning of word and label embeddings for sequence labelling in spoken language understanding,” arXiv preprint arXiv:1910.07150, 2019.
[42] G. Mesnil, X. He, L. Deng, and Y. Bengio, “Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding.” in INTERSPEECH, 2013, pp. 3771–3775.
[43] K. Yao, G. Zweig, M.-Y. Hwang, Y. Shi, and D. Yu, “Recurrent neural networks for language understanding.” in INTERSPEECH, 2013, pp. 2524–2528.
[44] N. T. Vu, P. Gupta, H. Adel, and H. Schütze, “Bi-directional recurrent neural network with ranking loss for spoken language understanding,” in Proc. ICASSP, 2016, pp. 6060–6064.
[45] P. Xu and R. Sarikaya, “Convolutional neural network based triangular crf for joint intent detection and slot filling,” in Proc. IEEE ASRU, 2013, pp. 78–83.
[46] X. Zhang and H. Wang, “A joint model of intent determination and slot filling for spoken language understanding,” in Proc. IJCAI, 2016, pp. 2993–2999.
[47] D. Hakkani-Tür, G. Tur, A. Celikyilmaz, Y.-N. Chen, J. Gao, L. Deng, and Y.-Y. Wang, “Multi-domain joint semantic frame parsing using bi-directional rnn-lstm,” in Proc. INTERSPEECH, 2016, pp. 715–719.
[48] N. Reimers and I. Gurevych, “Optimal hyperparameters for deep lstm-networks for sequence labeling tasks,” arXiv preprint arXiv:1707.06799, 2017.
[49] F. Zhai, S. Potdar, B. Xiang, and B. Zhou, “Neural models for sequence chunking.” in Proc. AAAI, 2017, pp. 3365–3371.
[50] K. Yao, B. Peng, G. Zweig, D. Yu, X. Li, and F. Gao, “Recurrent conditional random field for language understanding,” in Proc. ICASSP, 2014, pp. 4077–4081.
[51] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
[52] X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF,” in Proc. ACL, 2016, pp. 1064–1074.
[53] Y. Kim, “Convolutional neural networks for sentence classification,” in Proc. EMNLP, 2014, pp. 1746–1751.
[54] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa, “Natural language processing (almost) from scratch,” Journal of machine learning research, vol. 12, no. Aug, pp. 2493–2537, 2011.
[55] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. NAACL, 2018, pp. 2227–2237.
[56] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[57] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in Proc. EMNLP, 2015, pp. 1412–1421.
[58] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017, pp. 5998–6008.
[59] Y. Meng, X. Li, X. Sun, Q. Han, A. Yuan, and J. Li, “Is word segmentation necessary for deep learning of chinese representations?” in Proc. ACL, 2019, pp. 3242–3252.
[60] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[61] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. NAACL, 2019, pp. 4171–4186.
[62] L. Chen, B. Tan, S. Long, and K. Yu, “Structured dialogue policy with graph neural networks,” in Proc. COLING, 2018, pp. 1257–1268.
[63] L. Chen, Z. Chen, B. Tan, S. Long, M. Gašić, and K. Yu, “Agentgraph: Toward universal dialogue management with structured deep reinforcement learning,” IEEE/ACM Trans. on Audio, Speech, and Lang. Process., vol. 27, no. 9, pp. 1378–1391, 2019.