A Unified Framework for Multi-intent Spoken Language Understanding with prompting

Feifan Song, Lianzhe Huang and Houfeng Wang
MOE Key Laboratory of Computational Linguistics, Peking University, China
[email protected]
{hlz, wanghf}@pku.edu.cn

Abstract

Multi-intent Spoken Language Understanding has great potential for widespread implementation. Jointly modeling Intent Detection and Slot Filling in it provides a channel to exploit the correlation between intents and slots. However, current approaches are apt to formulate these two sub-tasks differently, which leads to two issues: 1) It hinders models from effective extraction of shared features. 2) Pretty complicated structures are involved to enhance expression ability while causing damage to the interpretability of frameworks. In this work, we describe a Prompt-based Spoken Language Understanding (PromptSLU) framework, to intuitively unify two sub-tasks into the same form by offering a common pre-trained Seq2Seq model. In detail, ID and SF are completed by concisely filling the utterance into task-specific prompt templates as input, and sharing output formats of key-value pairs sequence. Furthermore, variable intents are predicted first, then naturally embedded into prompts to guide slot-value pairs inference from a semantic perspective. Finally, we are inspired by prevalent multi-task learning to introduce an auxiliary sub-task, which helps to learn relationships among provided labels. Experiment results show that our framework outperforms several state-of-the-art baselines on two public datasets.

1 Introduction

Spoken Language Understanding (SLU) is a fundamental part in the area of Task-oriented Dialogue (TOD) modeling. It serves as the entrance of the pipeline with two sub-tasks: semantic information understanding by detecting user intents and extraction by filling the prepared slots, called Intent Detection (ID) and Slot Filling (SF), respectively. The former is usually modeled into a text classification problem and the latter is completed by cutting out fragments from an input utterance in the form of sequence tagging.

Refer to caption — Figure 1: Illustration of prior paradigm for jointly modelling multi-intent Spoken Language Understanding.

Early works (Ravuri and Stolcke, 2015; Kim et al., 2017) focused on two sub-tasks separately, but the correlation between two sub-tasks is seen as the key to further improvement of SLU in recent years. Thus, such jointly modeling techniques were offered to bridge the gap of the common features between Intent Detection and Slot Filling. Chen et al. (2019) treated BERT as a shared contextual encoder for two sub-tasks, while Qin et al. (2019) further brought token-level intent prediction results into Slot Filling process. Apart from these conventional NLU pathways, Zhang et al. (2021) and Tassias (2021) chose to formulate SLU as a constrained generation task.

However, real-world dialogue is more complex. Each utterance can be longer and contain more than one intent. As is shown in Figure 1, the utterance “Show me the cheapest fare in the database and ground transportation san francisco” contains two distinct intents (atis_cheapest and atis_ground_service). For this kind of multi-intent SLU, similar jointly modelling methods are discussed (Gangadharaiah and Narayanaswamy, 2019; Qin et al., 2020; Ding et al., 2021; Qin et al., 2021; Chen et al., 2022; Cai et al., 2022) and work well on several datasets, where multiple intents detection is often formulated as a multi-label text classification problem (Gangadharaiah and Narayanaswamy, 2019), and potential interaction among intent and slot labels also plays an essential role (Qin et al., 2020; Ding et al., 2021; Qin et al., 2021; Chen et al., 2022; Cai et al., 2022).

Although following the path of jointly modeling, prior methods tend to integrate ID and SF during encoding by providing a shared feature extractor, but treat them individually when decoding. Figure 1 displays different decoding processes of two sub-tasks as multi-label classification and sequence tagging. The vast distinction in task formulation between ID and SF is indispensable. It discourages models from effective shared feature extraction and in turn, hinders the potential of comprehensively better performance. Consequently, unifying the two sub-tasks from start to finish with a common framework is significant.

In this paper, we propose a framework to leverage pre-trained language models (PLMs) and prompting to handle the aforementioned challenges, called Prompt-based Spoken Language Understanding (PromptSLU) framework. Instead of using complicated structures to exploit the correlation among labels across two sub-tasks, our framework just incorporates both of them into text generation tasks with prompts, where the respective outputs share a general format of sequences of key-value pairs, namely Belief Span (Lei et al., 2018). During inference, given an utterance and certain task requirement, PromptSLU first fills the utterance in a task-specific prompt template and then inputs it into a pre-trained Seq2Seq model.

Besides, consistency between two sub-tasks is crucial in SLU. Compared with prior settings, the multi-intent scenario is more challenging for accurate alignment from intents to slots because of the greater length of utterances and increased number of labels. For this issue, we explore an intuitive way in which intents are driven to restrain SF by plugging intents into prompt templates, namely Semantic Intents Guidance (SIG). As is plain to humans in form, this design also allows the utilization of semantic information of intents to promote overall comprehension of our framework. Furthermore, inspired by multi-task learning used by Paolini et al. (2021), Su et al. (2022) and Wang et al. (2022a), we try an auxiliary sub-task, called Slot Prediction (SP), to steer models to additionally maintain semantic consistency. Experiments on two datasets demonstrate that PromptSLU outperforms SOTA baselines on most metrics, including those using PLMs. Abundant ablation studies also show the effectiveness of each component.

In summary, our contributions are as follows:

•

We take the first step to introduce prompts into multi-intent SLU by transforming Intent Detection and Slot Filling into a common text generation formulation.
•

We present Semantic Intents Guidance Mechanism and a new sub-task Slot Prediction to provide more intuitive channels for interaction among texts, intents and slots.
•

Experiments on two public datasets show better results of PromptSLU compared with methods with SOTA performance, and the effectiveness of proposed strategies and semantic information.

2 Related Work

2.1 Spoken Language Understanding

Spoken Language Understanding (SLU), in the task-oriented dialog system, is mainly composed of two sub-tasks, Intent Detection (ID) and Slot Filling (SF). As for task formulations, most existing methods model ID into a text classification problem and SF into a sequence tagging problem. Early works separately handled ID and SF (Schapire and Singer, 2000; Haffner et al., 2003; Raymond and Riccardi, 2007; Tur et al., 2011; Ravuri and Stolcke, 2015; Kim et al., 2017; Wang et al., 2022b). However, current approaches prefer to modeling them together, with the consideration of the high correlation between them, and thus lead to substantial improvement. Jointly modeling techniques basically correspond to two different methodologies:

Parallel Model

where ID and SF get outputs respectively, but share an utterance encoder, in an attempt to exploit the common latent features (Zhang and Wang, 2016; Hakkani-Tür et al., 2016; Zhang and Wang, 2019).

Serial Model

built by detecting intent in the first place and then utilizing intent information to navigate SF (Zhang et al., 2019; Qin et al., 2019; Tassias, 2021).

Our framework takes both sides mentioned above. First, It can be serially implemented according to the second methodology, i.e., allowing intents to benefit SF. Then, the results of two sub-tasks come from the same Seq2Seq model in this framework.

2.2 Multi-intent SLU

Multi-intent utterances tend to appear, at a higher frequency in reality, than those with a single intent. This problem has an increasing popularity in society (Kim et al., 2017; Gangadharaiah and Narayanaswamy, 2019; Qin et al., 2020; Ding et al., 2021; Qin et al., 2021; Chen et al., 2022; Cai et al., 2022). Gangadharaiah and Narayanaswamy (2019) used an attention-based neural network for jointly modeling ID and SF in multi-intent scenarios. Qin et al. (2020) proposed AGIF that organizes intent labels and slot labels together in the form of graph structure and focuses on the interaction of labels. Ding et al. (2021) and Qin et al. (2021) utilize graph attention network to attend on the association among labels. Despite the attempt to build the bridge between intent and slot labels, ID and SF are still modeled in different forms, i.e., multi-label classification for ID but sequence tagging for SF. The essential problem has not been handled that distinct modeling ways hinder the extraction of common features. Our text-generation-based framework differs from the above approaches by unifying the formulation of ID and SF. On one hand, this way is plain to humans and naturally suitable for ID with variable intents. On the other hand, every step of promotion of our framework effectively benefits both sub-tasks. Our framework takes both sides above to be equipped with explicitly hierarchical implementation, while the results of two sub-tasks come from the same Seq2Seq model.

2.3 Prompting

GPT-3 (Brown et al., 2020) provides a novel insight on the area of Natural Language Processing with the power of new paradigm (Liu et al., 2021), namely prompt. Currently, it has been explored in many directions by transforming the original task into a text-to-text form (Zhong et al., 2021; Lester et al., 2021; Cui et al., 2021; Wang et al., 2022a; Tassias, 2021; Su et al., 2022), where fine-tuning on downstream tasks effectively refers to knowledge from long-term pre-training processes, and especially works well on low-resource scenarios. Zhong et al. (2021) tried to align the form of fine-tuning to next word prediction to optimize performance on text classification tasks. In another way, Lester et al. (2021) appealed to soft-prompt to capture implicit and continuous signals. For sequence tagging, prompts appear as templates for text filling or natural language descriptions of instruction (Cui et al., 2021; Wang et al., 2022a). As to SLU in task-oriented dialogue (TOD), it consists of two sub-tasks ID and SF which have homologous forms. Tassias (2021) depended on prior distributions of intents and slots to generate a prompt containing all slots to be prdedicted before inference, lowering the difficulty of SF. Su et al. (2022) proposed a prompt-based framework PPTOD to integrate several parts in the pipeline of TOD, including single-intent detection that is different from our setting. However, such jointly modeling fashions neglect the complexity of co-occurrence and semantic similarity among tokens, intents and slots in multi-intent SLU, consequently suppressing the potential of PLMs. To this end, we build the Semantic Intents Guidance mechanism and design an auxiliary sub-task Slot Prediction, to exploit more detailed generality together.

3 Methodology

In this section, we describe the proposed PromptSLU along the inner flow of data. In this framework, we utilize different prompts to complete different sub-tasks, which is an intuitive way similar to Su et al. (2022). However, PromptSLU is specially designed for multi-intent SLU that contains a particular intent-slot interaction mechanism, namely Semantic Intent Guidance (SIG). We also propose a new auxiliary sub-task Slot Prediction (SP) to improve Intent Detection (ID) and Slot Filling (SF).

Given the utterance $X=\{x_{1},x_{2},x_{3},...,x_{N}\}$ , for each of sub-tasks including ID, SF and SP, PromptSLU inputs to the backbone model the concatenation of a task-specific prefix and X, then makes predictions. During SF or SP, intents can also be involved. The whole framework is shown in Figure 2. Three sub-tasks are jointly trained with a sharing pre-trained language model (PLM).

3.1 Intent Detection

Traditionally, intents are defined as a fixed number of labels. A model is fed with input text, then maps it to these labels. Differently, we model the task in the form of text generation, i.e., the backbone model produces a sequence of intents corresponding to the input utterance.

Belief Span is first utilized by Sequicity (Lei et al., 2018) to cover filled and requestable slots for state tracking. We transfer this structure to the area of multi-intent SLU by defining the format of output sequence as:

I_{X}=\textrm{intent : }i_{1},...,\textrm{intent : }i_{N_{I}}

(1)

where $i_{m}~{}(1\leq m\leq N_{I})$ is the $m$ -th intent and $N_{I}$ is the number of predicted intents. As is clarified before, we integrate the task modeling by imposing similar formats on the following sub-tasks. The PLM takes the prompt “transfer sentence to intents : $X$ ”, then carries out Intent Detection.

Like tricks used in Tassias (2021) and Wang et al. (2022a), we use natural language descriptions to represent non-semantic intent labels, reducing the difficulty of exploiting semantic information. We also apply this operation in the follow-up sub-tasks, based on the assumption that PromptSLU significantly relies on semantic information to complete multi-intent SLU.

3.2 Semantic Intent Guidance

It is noted that there exists semantic similarity among tokens, intents and slots. This property is reflected in the aspects of not only frequent co-occurrence but also semantic resemblance, which may be of help for inference. We also consider it in our framework.

We adopt the idea of jointly modeling by parsing intents from the output sequence of ID, and using them to facilitate other sub-tasks. Specifically, together with task-specific prefixes, intents also serve as an essential part of prompts for SF and SP in our prompt-based framework, to guide their completion and keep semantic consistency between ID and SF. We name it as the Semantic Intent Guidance (SIG) mechanism and display it in some prompt templates later.

3.3 Slot Filling

To let PLM generate slot-value pairs directly, raw data have to be processed in advance. We first extract golden slot-value pairs from each BIO-tagged sequence, based on tagging rules. Then we orderly assemble these golden pairs into one sentence in the format of Belief Span, which serves as text-generation target $SF_{x}$ . This step is crucial for purpose of integrating task formulations:

SF_{X}=s_{1}\textrm{ : }v_{1},...,s_{N_{P}}\textrm{ : }v_{N_{P}}

(2)

where $N_{P}$ denotes the number of predicted slot-value pairs.

Similar to the mapping procedure in Intent Detection, we also replace slot labels with corresponding natural language descriptions. The dashed boxes on the top of Figure 2 demonstrate the transformation from intent and slot labels to more human-readable phrases. This procedure occurs both in input and target utterances.

We introduce the SIG mechanism here, as mentioned above, to allow predicted intents to act as pilots for slot filling. In detail, we plug the intent phrase and user utterance into the task-specific prompt template: “transfer sentence to pairs with $I_{X}$ : $X$ ”. We also do this in the next sub-task.

3.4 Slot Prediction

Multi-task learning is apt to offer a channel for the interaction of different tasks, and enhance supervision signals towards training objectives. Following multi-task settings in Su et al. (2022) and Wang et al. (2022a), we design this auxiliary sub-task in the training stage, forcing the PLM to focus on the semantic interaction among tokens, intents and slots. We also leverage it for the improvement of maintaining semantic consistency. PromptSLU is required to generate slots sequence:

SP_{X}=\textrm{slot : }s_{1},...,\textrm{slot : }s_{N_{S}}

(3)

according to the prefix “transfer sentence to slots with $I_{X}$ : $X$ ”. We use $N_{S}$ to denote the number of predicted slots.

3.5 Training

Similar to other text-to-text tasks, our framework is trained to minimize negative log-likelihood for each sub-task:

-\sum_{i=1}^{\left|Y\right|}\log{p_{\Theta}\left(y_{i}\middle|y_{<i},X\right)}

(4)

where $Y$ denotes the target token sequence and $\Theta$ denotes model parameters. We introduce two variants of loss functions.

Weighted Loss integrates three sub-tasks by calculating a weighted sum for each sample:

\mathcal{L}_{w}=\alpha\mathcal{L}_{ID}+\beta\mathcal{L}_{SP}+\gamma\mathcal{L}_{SF}

(5)

where $\alpha$ , $\beta$ and $\gamma$ are hyper-parameters. $\mathcal{L}_{ID}$ , $\mathcal{L}_{SP}$ and $\mathcal{L}_{SF}$ represent the loss of Intent Detection, Slot Prediction and Slot Filling, respectively.

Split Loss is utilized for a shuffled dataset, which is built by splitting each sample in the original dataset into three text generation samples corresponding to three original labels for different sub-tasks. That is,

\mathcal{L}_{s}(X,Y)=\begin{cases}\mathcal{L}_{ID}(X,Y)&{Y\in ID}\\ \mathcal{L}_{SP}(X,Y)&{Y\in SP}\\ \mathcal{L}_{SF}(X,Y)&{Y\in SF}\end{cases}

(6)

4 Experiment

4.1 Datasets and Metrics

We conduct experiments mainly on MixATIS and MixSNIPS (Qin et al., 2020), which are widely used multi-intent SLU datasets. There are 13162, 759 and 828 samples in MixATIS for training, validation and test, respectively, as well as 39776, 2198 and 2199 in MixSNIPS.

Following Cai et al. (2022) and Qin et al. (2021), we evaluate the performance of Slot Filling with F1 score, Intent Detection and sentence-level semantic frame parsing with accuracy, named Slot F1¹¹1Considering there exist samples with repeated slot-value pairs, we calculate Slot F1 after the elimination of duplicate pairs for PromptSLU., Intent Acc and Overall Acc.

4.2 Baselines

For evaluation of multi-intent SLU, we compare our approach with three baselines and their promoted versions (if provided):
AGIF (Qin et al., 2020) : This method utilizes a graph interaction module to build token-wise connections between slot tags and intent labels.
GL-GIN (Qin et al., 2021) : Different from AGIF, this work exploits local coherence during slot tagging and sentence-level connections between intents and slots. It performs decoding in a non-auto-regressive way. GL-GIN-RoBERTa displaces self-attentive encoder by RoBERTa ${}_{\textrm{base}}$ (128M parameters) (Liu et al., 2019).
SDJN (Chen et al., 2022) : This work reformulates multi-intent detection as a weakly supervised task and design a self-distillation mechanism to circularly refresh the pipeline. SDJN-BERT likewise displaces the self-attentive encoder by BERT, whose size is not mentioned in the source paper.

Methods		MixATIS			MixSNIPS
Methods		Slot(F1)	Intent(Acc)	Overall(Acc)	Slot(F1)	Intent(Acc)	Overall(Acc)
Non-pre-trained	AGIF (Qin et al., 2020)	86.7	74.4	40.8	94.2	95.1	74.2
	GL-GIN (Qin et al., 2021)	88.3	76.3	43.5	94.9	95.6	75.4
	SDJN (Chen et al., 2022)	88.2	77.1	44.6	94.4	96.5	75.7
Pre-trained	SDJN+BERT	87.5	78.0	46.3	95.4	96.7	79.3
	GL-GIN+RoBERTa	88.6	79.2	53.9	96.0	97.4	82.4
	PromptSLU ${}_{\mathcal{L}_{w}}$	88.6	83.6	53.3	96.2	97.7	82.8
	PromptSLU ${}_{\mathcal{L}_{s}}$	89.6*	85.8*	57.2*	96.5*	97.5	84.8*

Table 1: Main results on MixATIS and MixSNIPS. Methods are distinguished by whether pre-trained language models are involved. * means that the improvement beyond SOTA is significant with

p<0.05

under t-test.

For baselines except GL-GIN+RoBERTa, we directly take the results from the literature.

4.3 Implementation Detail

In this work, we choose T5 (Raffel et al., 2020) as the backbone of PromptSLU with small (60M parameters) and base (220M parameters) versions, because our prompt design is coherent with the pre-trained tasks of T5. We set epoch and batch size to 10 and 16 on each processor, respectively. The length of sequences and the learning rate are uniformly set to 128 and 1e-4, while the dropout rate is 0.1 for MixATIS and 0.15 for MixSNIPS. We empirically set $\alpha$ , $\beta$ and $\gamma$ as 1.0, 2.0 and 1.0, respectively. All test results are selected according to the performance of Overall Acc on the validation set. We report the average scores after running the code 4 times repeatedly with different global random seeds 0, 1, 2 and 3. Following Su et al. (2022), our framework is built on Huggingface Library (Wolf et al., 2020). The optimizer here remains AdamW (Loshchilov and Hutter, 2019). We conduct experiments on a Linux server with Intel Xeon E5-2680 and Nvidia GeForce RTX 3090.

4.4 Main Results

Table 1 displays the experiment results of our framework and baselines. It is observed that our framework outperforms each baseline on each metric for both MixATIS and MixSNIPS, no matter whether a pre-trained model is leveraged.

It is clear that pre-trained models are beneficial to this task. In general, methods with pre-trained models have better performance than methods without them. It mainly comes from the utilization of knowledge contained. Among them, PromptSLU is distinguished from other NLU methods by more concise modeling but better performance. This result verifies the efficient knowledge exploitation of the proposed prompt-based text generation paradigm for multi-intent SLU.

More specifically, when compared with SOTA results, PromptSLU ${}_{\mathcal{L}_{s}}$ still accomplishes significant advances. It should be noted that the amazing result on Overall Acc, which should have been limited by the relative slight improvement on Slot F1, is in fact beyond the SOTA result by a large margin. This indicates that during inference, our framework is skilled at maintaining semantic consistency between intents and slots in the same utterance. We preliminarily attribute this ability to the influence of the proposed Semantic Intent Guidance mechanism and Slot Prediction.

We compare the performance between PromptSLU ${}_{\mathcal{L}_{w}}$ and PromptSLU ${}_{\mathcal{L}_{s}}$ . Obviously, PromptSLU ${}_{\mathcal{L}_{s}}$ beats PromptSLU ${}_{\mathcal{L}_{w}}$ on almost all metrics, except close value of Intent Acc for MixSNIPS. It primarily demonstrates the superiority of ${\mathcal{L}_{s}}$ against the straightforward weighted sum of losses. We provide a careful analysis at Sec 4.5. For other experiments, we use ${\mathcal{L}_{s}}$ as the loss function by default.

Methods		MixATIS			MixSNIPS
Methods		Slot(F1)	Intent(Acc)	Overall(Acc)	Slot(F1)	Intent(Acc)	Overall(Acc)
w/o golden intents	PromptSLU ${}_{\textrm{small}}$	87.1	80.4	50.6	95.4	97.4	79.4
	PromptSLU ${}_{\textrm{base}}$	89.6	85.8	57.2	96.5	97.5	84.8
	$-$ SP	87.9	82.6	50.8	96.2	97.3	82.9
	$-$ SIG	88.8	84.6	53.3	96.2	97.5	82.7
	only ID	-	81.4	-	-	97.0	-
	only SF	88.6	-	-	96.1	-	-
	$\mathcal{L}_{w}^{(1,1,1)}$	88.3	81.5	51.7	96.3	97.3	83.3
	$\mathcal{L}_{w}^{(2,1,1)}$	89.0	80.2	52.6	96.3	97.6	83.1
	$\mathcal{L}_{w}^{(1,2,1)}$	88.6	83.6	53.3	96.2	97.7	82.8
	$\mathcal{L}_{w}^{(1,1,2)}$	88.3	81.5	51.0	96.4	97.5	83.9
w/ golden intents	PromptSLU ${}_{\textrm{small}}$	87.7	80.4	50.6	95.8	97.4	79.4
w/ golden intents	PromptSLU ${}_{\textrm{base}}$	89.7	85.8	57.2	97.0	97.5	84.8

Table 2: Ablation experiments. SP and SIG denote Slot Prediction and Semantic Intent Guidance, respectively.

\mathcal{L}_{w}^{(a,b,c)}

represents that PromptSLU

{}_{\textrm{base}}

is trained under the weighted loss function with

{(\alpha,\beta,\gamma)}

set to

{(a,b,c)}

4.5 Ablation Study

In this part, we abstract several factors and analyze their effectiveness. Results are shown in Table 2. The reason why the improvement is relatively slight for MixSNIPS is that features and patterns in it are more straightforward to capture to some extent, thus encouraging each variant to have competitive performance. An intuitive idea is that model size plays a dominant factor in model performance, which is proved by the comparison between PromptSLU ${}_{\textrm{small}}$ and PromptSLU ${}_{\textrm{base}}$ . Here PromptSLU ${}_{\textrm{small}}$ has a competitive performance compared to baselines with pre-trained models, while the quantity of parameters in it is pretty less. However, with expanding framework scale of backbone, PromptSLU ${}_{\textrm{base}}$ makes large improvements against PromptSLU ${}_{\textrm{small}}$ .

We next remove Slot Prediction (SP) while retaining Intent Detection and Slot Filling. We can observe that there are drops on every metric no matter which dataset, especially on Overall Acc. It confirms that SP serves as a catalyst to help maintain semantic consistency between ID and SF.

We then check the effect of the SIG mechanism by changing the task-specific prompt templates and removing the blanks originally prepared for predicted intents. This also brings out performance decline. Similar to the ablation of SP, SIG also facilitates the consistency between two kinds of labels, which is one of the chief strengths of jointly modeling. We surprisingly observe that score of Intent Acc also decreases on MixATIS. We attribute it to some implicit semantic supervision during the update of model parameters, i.e., although parameters gradient cannot be propagated across sub-tasks, our framework still has the potential to revise intents prediction against SIG.

In PromptSLU, we implement jointly modeling for SLU along two directions. One is to unify two sub-tasks in a shared formulation with a common Seq2Seq model, as the other allows predicted intents to assist Slot Filling. Hence, we split two sub-tasks by accomplishing each with an exclusive T5 model, in order to evaluate the impact of jointly modeling. It is interesting that PromptSLU with only SF has competitive performance on Slot F1 when compared with other jointly modeling versions. However, the performance of PromptSLU with only ID has dropped a lot on Intent Acc for MixATIS. It confirms the effectiveness of the jointly modeling paradigm even in text generation formulation.

We propose two kinds of loss functions for PromptSLU. The weighted ${\mathcal{L}_{w}}$ has a similar structure to those used in baselines, while ${\mathcal{L}_{s}}$ refers to each sub-tasks as a text generation process more directly. Here we try different combinations of weight hyper-parameters for ${\mathcal{L}_{w}}$ , but ${\mathcal{L}_{s}}$ still derives better performance than them in general. We assume that the optimization targets of different sub-tasks tend to be not identical, thus a trade-off of these targets is indispensable. Different weight hyper-parameters suggest the game, but the desired combination is likely to be hard to access, even dynamically changing along the training process. In contrast, ${\mathcal{L}_{s}}$ allows PromptSLU to avoid the difficulty of trade-off in each turn, but focus on completing just the text generation task and achieve better improvements.

During inference, PromptSLU with SIG generates slot-value pairs based on the predicted intents, where transmission errors should be considered. We experimentally estimate their effect on Slot Filling, by replacing predicted intents with the golden ones. The last two rows in Table 2 give the results. It can be seen that PromptSLU with golden intents generally has extremely slight improvement on Slot F1, while no increase in the values of Overall Acc. It indicates the effect of transmission errors is not significant. We mainly attribute it to the fact that PromptSLU checks those intents against the input utterance, and filters wrong and missing intents in the prompts, which proves that PromptSLU learns to understand semantics. Another reason may be the relatively high values of Intent Acc that reduce the impact of transmission errors.

Input

Output

ID: intent: atis_airport, intent: atis_flight

User: describe pittsburgh airport and then list flights from denver to san francisco no denver to philadelphia.

SF: airport_name: pittsburgh airport, fromloc.city_name:

denver, toloc.city_name: san francisco, toloc.city_name:

philadelphia

French Translation: décrit l’aéroport pittsburgh

et énumère les vols depuis la city de denver à san francisco

et non pas à la city de philadelphia

ID: intent: atis_ground_service, intent: atis_ground_fare

User: what are the costs of car rental in dallas and also ground transportation denver.

SF: transport_type: car rental, city_name: dallas,

city_name: denver

Romanian Translation: costul maşinii închiriate în dallas

şi al transportului pe teren în city.

Table 3: Case study. We use raw labels to replace corresponding natural language descriptions. PromptSLU accurately predicts all labels in the output sentences.

5 Discussion

5.1 Effect of Semantic Information

Our framework is equipped with descriptions of labels rather than raw labels, to better understand semantic information of intents and slots. To explore the effectiveness of this operation, we do experiments by replacing the descriptions with corresponding raw labels and adding them to the vocabulary as special tokens. Figure 3 illustrates the comparison between the two strategies. We use $\textrm{PS}_{n}$ to denote the former variant of PromptSLU ${}_{\textrm{small}}$ and $\textrm{PS}_{r}$ to denote that with raw labels. We set the same hyper-parameters of $\textrm{PS}_{r}$ as those in $\textrm{PS}_{n}$ .

We make the following observations:
1) Figure 3 shows that the absence of semantic information hinders PromptSLU from aligning intents with slots, which is reflected by the drop on Overall Acc. Figure 4 further demonstrates its dynamic influence on SF and ID. It indicates that semantic information is truly acquired and crucial.
2) Comparing the convergence rates of two strategies on Slot F1 and Intent Acc, we see that performance of $\textrm{PS}_{n}$ quickly rises to a relatively high level. However, the performance of $\textrm{PS}_{r}$ just increases step by step at early epochs, along with the modification of label representations, which is defined by local context. It suggests that pre-trained semantic features efficiently help fine-tuning.

5.2 Case Study

Table 3 displays several typical examples. Upon receiving user utterances, PromptSLU completes Intent Detection and Slot Filling with the generation of key-value sequence.

It can be noted that PromptSLU can predict slot-value pairs along the order of their appearance in the user input, which may be further utilized for supplementary requirements. Moreover, there exist slot-value pairs with sharing slots but distinct values, which pose a great challenge to the comprehension of networks. However, PromptSLU precisely predicts them without the lack of intrinsic order.

Surprisingly, we find that PromptSLU sometimes manages to translate user utterances into different language versions, with specific prefixes. This discovery suggests that PromptSLU may alleviate forgetting knowledge acquired in the pre-training stage after fine-tuning. We present it in Table 3.

6 Conclusion

In this paper, we propose PromptSLU, which handles Intent Detection and Slot Filling in multi-intent SLU with a prompt-based text generation framework. To the best of our knowledge, this is the first work to utilize prompts for this problem. Our framework integrates these two sub-tasks into the same formulation while distinguishing them according to diverse prompts, which simplifies model structures and enhances interpretability. Moreover, based on the semantic similarity between intents and slots, predicted intents are driven to guide the process of Slot Filling, while a new auxiliary task Slot Prediction is introduced. Experimental results show its multi-dimensional superiority against baselines on three datasets.

References

Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
Cai et al. (2022) Fengyu Cai, Wanhao Zhou, Fei Mi, and Boi Faltings. 2022. Slim: Explicit slot-intent mapping with bert for joint multi-intent detection and slot filling. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’22, pages 7607–7611.
Chen et al. (2022) Lisong Chen, Peilin Zhou, and Yuexian Zou. 2022. Joint multiple intent detection and slot filling via self-distillation. In 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’22, pages 7612–7616.
Chen et al. (2019) Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909.
Cui et al. (2021) Leyang Cui, Yu Wu, Jian Liu, Sen Yang, and Yue Zhang. 2021. Template-based named entity recognition using BART. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1835–1845, Online. Association for Computational Linguistics.
Ding et al. (2021) Zeyuan Ding, Zhihao Yang, Hongfei Lin, and Jian Wang. 2021. Focus on interaction: A novel dynamic graph model for joint multiple intent detection and slot filling. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI ’21, pages 3801–3807. International Joint Conferences on Artificial Intelligence Organization. Main Track.
Gangadharaiah and Narayanaswamy (2019) Rashmi Gangadharaiah and Balakrishnan Narayanaswamy. 2019. Joint multiple intent detection and slot labeling for goal-oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 564–569, Minneapolis, Minnesota. Association for Computational Linguistics.
Haffner et al. (2003) P. Haffner, G. Tur, and J.H. Wright. 2003. Optimizing svms for complex call classification. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1 of ICASSP ’03, pages I–I.
Hakkani-Tür et al. (2016) Dilek Hakkani-Tür, Gokhan Tur, Asli Celikyilmaz, Yun-Nung Vivian Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Interspeech 2016-17th Annual Conference of the International Speech Communication Association. ISCA.
Kim et al. (2017) Byeongchang Kim, Seonghan Ryu, and Gary Geunbae Lee. 2017. Two-stage multi-intent detection for spoken language understanding. Multimedia Tools Appl., 76(9):11377–11390.
Lei et al. (2018) Wenqiang Lei, Xisen Jin, Min-Yen Kan, Zhaochun Ren, Xiangnan He, and Dawei Yin. 2018. Sequicity: Simplifying task-oriented dialogue systems with single sequence-to-sequence architectures. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1437–1447, Melbourne, Australia. Association for Computational Linguistics.
Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Liu et al. (2021) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
Paolini et al. (2021) Giovanni Paolini, Ben Athiwaratkun, Jason Krone, Jie Ma, Alessandro Achille, RISHITA ANUBHAI, Cicero Nogueira dos Santos, Bing Xiang, and Stefano Soatto. 2021. Structured prediction as translation between augmented natural languages. In International Conference on Learning Representations.
Qin et al. (2019) Libo Qin, Wanxiang Che, Yangming Li, Haoyang Wen, and Ting Liu. 2019. A stack-propagation framework with token-level intent detection for spoken language understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2078–2087, Hong Kong, China. Association for Computational Linguistics.
Qin et al. (2021) Libo Qin, Fuxuan Wei, Tianbao Xie, Xiao Xu, Wanxiang Che, and Ting Liu. 2021. GL-GIN: Fast and accurate non-autoregressive model for joint multiple intent detection and slot filling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 178–188, Online. Association for Computational Linguistics.
Qin et al. (2020) Libo Qin, Xiao Xu, Wanxiang Che, and Ting Liu. 2020. AGIF: An adaptive graph-interactive framework for joint multiple intent detection and slot filling. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1807–1816, Online. Association for Computational Linguistics.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Ravuri and Stolcke (2015) Suman Ravuri and Andreas Stolcke. 2015. Recurrent neural network and lstm models for lexical utterance classification. In Interspeech 2015-16th Annual Conference of the International Speech Communication Association, pages 135–139.
Raymond and Riccardi (2007) Christian Raymond and Giuseppe Riccardi. 2007. Generative and discriminative algorithms for spoken language understanding. In Interspeech 2007-8th Annual Conference of the International Speech Communication Association, pages 1605–1608.
Schapire and Singer (2000) Robert E Schapire and Yoram Singer. 2000. Boostexter: A boosting-based system for text categorization. Machine learning, 39(2):135–168.
Su et al. (2022) Yixuan Su, Lei Shu, Elman Mansimov, Arshit Gupta, Deng Cai, Yi-An Lai, and Yi Zhang. 2022. Multi-task pre-training for plug-and-play task-oriented dialogue system. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4661–4676, Dublin, Ireland. Association for Computational Linguistics.
Tassias (2021) Panagiotis Tassias. 2021. A prompting-based encoder-decoder approach to intent recognition and slot filling. Master’s thesis, Athens University of Economics and Business.
Tur et al. (2011) Gokhan Tur, Dilek Hakkani-Tür, Larry Heck, and S. Parthasarathy. 2011. Sentence simplification for spoken language understanding. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’11, pages 5628–5631.
Wang et al. (2022a) Liwen Wang, Rumei Li, Yang Yan, Yuanmeng Yan, Sirui Wang, Wei Wu, and Weiran Xu. 2022a. Instructionner: A multi-task instruction-based generative framework for few-shot ner. arXiv preprint arXiv:2203.03903.
Wang et al. (2022b) Peiyi Wang, Runxin Xu, Tianyu Liu, Qingyu Zhou, Yunbo Cao, Baobao Chang, and Zhifang Sui. 2022b. An enhanced span-based decomposition method for few-shot sequence labeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
Zhang et al. (2019) Chenwei Zhang, Yaliang Li, Nan Du, Wei Fan, and Philip Yu. 2019. Joint slot filling and intent detection via capsule neural networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5259–5267, Florence, Italy. Association for Computational Linguistics.
Zhang et al. (2021) Linhao Zhang, Yu Shi, Linjun Shou, Ming Gong, Houfeng Wang, and Michael Zeng. 2021. A joint and domain-adaptive approach to spoken language understanding. arXiv preprint arXiv:2107.11768.
Zhang and Wang (2019) Linhao Zhang and Houfeng Wang. 2019. Using bidirectional transformer-crf for spoken language understanding. In Natural Language Processing and Chinese Computing, pages 130–141, Cham. Springer International Publishing.
Zhang and Wang (2016) Xiaodong Zhang and Houfeng Wang. 2016. A joint model of intent determination and slot filling for spoken language understanding. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, volume 16 of IJCAI ’16, pages 2993–2999.
Zhong et al. (2021) Ruiqi Zhong, Kristy Lee, Zheng Zhang, and Dan Klein. 2021. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2856–2878, Punta Cana, Dominican Republic. Association for Computational Linguistics.