This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Unified Framework for Multi-intent Spoken Language Understanding with prompting

Feifan Song, Lianzhe Huang and Houfeng Wang
MOE Key Laboratory of Computational Linguistics, Peking University, China
[email protected]
{hlz, wanghf}@pku.edu.cn
Abstract

Multi-intent Spoken Language Understanding has great potential for widespread implementation. Jointly modeling Intent Detection and Slot Filling in it provides a channel to exploit the correlation between intents and slots. However, current approaches are apt to formulate these two sub-tasks differently, which leads to two issues: 1) It hinders models from effective extraction of shared features. 2) Pretty complicated structures are involved to enhance expression ability while causing damage to the interpretability of frameworks. In this work, we describe a Prompt-based Spoken Language Understanding (PromptSLU) framework, to intuitively unify two sub-tasks into the same form by offering a common pre-trained Seq2Seq model. In detail, ID and SF are completed by concisely filling the utterance into task-specific prompt templates as input, and sharing output formats of key-value pairs sequence. Furthermore, variable intents are predicted first, then naturally embedded into prompts to guide slot-value pairs inference from a semantic perspective. Finally, we are inspired by prevalent multi-task learning to introduce an auxiliary sub-task, which helps to learn relationships among provided labels. Experiment results show that our framework outperforms several state-of-the-art baselines on two public datasets.

1 Introduction

Spoken Language Understanding (SLU) is a fundamental part in the area of Task-oriented Dialogue (TOD) modeling. It serves as the entrance of the pipeline with two sub-tasks: semantic information understanding by detecting user intents and extraction by filling the prepared slots, called Intent Detection (ID) and Slot Filling (SF), respectively. The former is usually modeled into a text classification problem and the latter is completed by cutting out fragments from an input utterance in the form of sequence tagging.

Refer to caption
Figure 1: Illustration of prior paradigm for jointly modelling multi-intent Spoken Language Understanding.

Early works (Ravuri and Stolcke, 2015; Kim et al., 2017) focused on two sub-tasks separately, but the correlation between two sub-tasks is seen as the key to further improvement of SLU in recent years. Thus, such jointly modeling techniques were offered to bridge the gap of the common features between Intent Detection and Slot Filling. Chen et al. (2019) treated BERT as a shared contextual encoder for two sub-tasks, while Qin et al. (2019) further brought token-level intent prediction results into Slot Filling process. Apart from these conventional NLU pathways, Zhang et al. (2021) and Tassias (2021) chose to formulate SLU as a constrained generation task.

However, real-world dialogue is more complex. Each utterance can be longer and contain more than one intent. As is shown in Figure 1, the utterance “Show me the cheapest fare in the database and ground transportation san francisco” contains two distinct intents (atis_cheapest and atis_ground_service). For this kind of multi-intent SLU, similar jointly modelling methods are discussed (Gangadharaiah and Narayanaswamy, 2019; Qin et al., 2020; Ding et al., 2021; Qin et al., 2021; Chen et al., 2022; Cai et al., 2022) and work well on several datasets, where multiple intents detection is often formulated as a multi-label text classification problem (Gangadharaiah and Narayanaswamy, 2019), and potential interaction among intent and slot labels also plays an essential role (Qin et al., 2020; Ding et al., 2021; Qin et al., 2021; Chen et al., 2022; Cai et al., 2022).

Although following the path of jointly modeling, prior methods tend to integrate ID and SF during encoding by providing a shared feature extractor, but treat them individually when decoding. Figure 1 displays different decoding processes of two sub-tasks as multi-label classification and sequence tagging. The vast distinction in task formulation between ID and SF is indispensable. It discourages models from effective shared feature extraction and in turn, hinders the potential of comprehensively better performance. Consequently, unifying the two sub-tasks from start to finish with a common framework is significant.

In this paper, we propose a framework to leverage pre-trained language models (PLMs) and prompting to handle the aforementioned challenges, called Prompt-based Spoken Language Understanding (PromptSLU) framework. Instead of using complicated structures to exploit the correlation among labels across two sub-tasks, our framework just incorporates both of them into text generation tasks with prompts, where the respective outputs share a general format of sequences of key-value pairs, namely Belief Span (Lei et al., 2018). During inference, given an utterance and certain task requirement, PromptSLU first fills the utterance in a task-specific prompt template and then inputs it into a pre-trained Seq2Seq model.

Besides, consistency between two sub-tasks is crucial in SLU. Compared with prior settings, the multi-intent scenario is more challenging for accurate alignment from intents to slots because of the greater length of utterances and increased number of labels. For this issue, we explore an intuitive way in which intents are driven to restrain SF by plugging intents into prompt templates, namely Semantic Intents Guidance (SIG). As is plain to humans in form, this design also allows the utilization of semantic information of intents to promote overall comprehension of our framework. Furthermore, inspired by multi-task learning used by Paolini et al. (2021), Su et al. (2022) and Wang et al. (2022a), we try an auxiliary sub-task, called Slot Prediction (SP), to steer models to additionally maintain semantic consistency. Experiments on two datasets demonstrate that PromptSLU outperforms SOTA baselines on most metrics, including those using PLMs. Abundant ablation studies also show the effectiveness of each component.

In summary, our contributions are as follows:

  • We take the first step to introduce prompts into multi-intent SLU by transforming Intent Detection and Slot Filling into a common text generation formulation.

  • We present Semantic Intents Guidance Mechanism and a new sub-task Slot Prediction to provide more intuitive channels for interaction among texts, intents and slots.

  • Experiments on two public datasets show better results of PromptSLU compared with methods with SOTA performance, and the effectiveness of proposed strategies and semantic information.

2 Related Work

2.1 Spoken Language Understanding

Spoken Language Understanding (SLU), in the task-oriented dialog system, is mainly composed of two sub-tasks, Intent Detection (ID) and Slot Filling (SF). As for task formulations, most existing methods model ID into a text classification problem and SF into a sequence tagging problem. Early works separately handled ID and SF (Schapire and Singer, 2000; Haffner et al., 2003; Raymond and Riccardi, 2007; Tur et al., 2011; Ravuri and Stolcke, 2015; Kim et al., 2017; Wang et al., 2022b). However, current approaches prefer to modeling them together, with the consideration of the high correlation between them, and thus lead to substantial improvement. Jointly modeling techniques basically correspond to two different methodologies:

Parallel Model

where ID and SF get outputs respectively, but share an utterance encoder, in an attempt to exploit the common latent features (Zhang and Wang, 2016; Hakkani-Tür et al., 2016; Zhang and Wang, 2019).

Serial Model

built by detecting intent in the first place and then utilizing intent information to navigate SF (Zhang et al., 2019; Qin et al., 2019; Tassias, 2021).

Our framework takes both sides mentioned above. First, It can be serially implemented according to the second methodology, i.e., allowing intents to benefit SF. Then, the results of two sub-tasks come from the same Seq2Seq model in this framework.

2.2 Multi-intent SLU

Multi-intent utterances tend to appear, at a higher frequency in reality, than those with a single intent. This problem has an increasing popularity in society (Kim et al., 2017; Gangadharaiah and Narayanaswamy, 2019; Qin et al., 2020; Ding et al., 2021; Qin et al., 2021; Chen et al., 2022; Cai et al., 2022). Gangadharaiah and Narayanaswamy (2019) used an attention-based neural network for jointly modeling ID and SF in multi-intent scenarios. Qin et al. (2020) proposed AGIF that organizes intent labels and slot labels together in the form of graph structure and focuses on the interaction of labels. Ding et al. (2021) and Qin et al. (2021) utilize graph attention network to attend on the association among labels. Despite the attempt to build the bridge between intent and slot labels, ID and SF are still modeled in different forms, i.e., multi-label classification for ID but sequence tagging for SF. The essential problem has not been handled that distinct modeling ways hinder the extraction of common features. Our text-generation-based framework differs from the above approaches by unifying the formulation of ID and SF. On one hand, this way is plain to humans and naturally suitable for ID with variable intents. On the other hand, every step of promotion of our framework effectively benefits both sub-tasks. Our framework takes both sides above to be equipped with explicitly hierarchical implementation, while the results of two sub-tasks come from the same Seq2Seq model.

2.3 Prompting

GPT-3 (Brown et al., 2020) provides a novel insight on the area of Natural Language Processing with the power of new paradigm (Liu et al., 2021), namely prompt. Currently, it has been explored in many directions by transforming the original task into a text-to-text form (Zhong et al., 2021; Lester et al., 2021; Cui et al., 2021; Wang et al., 2022a; Tassias, 2021; Su et al., 2022), where fine-tuning on downstream tasks effectively refers to knowledge from long-term pre-training processes, and especially works well on low-resource scenarios. Zhong et al. (2021) tried to align the form of fine-tuning to next word prediction to optimize performance on text classification tasks. In another way, Lester et al. (2021) appealed to soft-prompt to capture implicit and continuous signals. For sequence tagging, prompts appear as templates for text filling or natural language descriptions of instruction (Cui et al., 2021; Wang et al., 2022a). As to SLU in task-oriented dialogue (TOD), it consists of two sub-tasks ID and SF which have homologous forms. Tassias (2021) depended on prior distributions of intents and slots to generate a prompt containing all slots to be prdedicted before inference, lowering the difficulty of SF. Su et al. (2022) proposed a prompt-based framework PPTOD to integrate several parts in the pipeline of TOD, including single-intent detection that is different from our setting. However, such jointly modeling fashions neglect the complexity of co-occurrence and semantic similarity among tokens, intents and slots in multi-intent SLU, consequently suppressing the potential of PLMs. To this end, we build the Semantic Intents Guidance mechanism and design an auxiliary sub-task Slot Prediction, to exploit more detailed generality together.

Refer to caption
Figure 2: The flow path of PromptSLU. We unify the formulations of each sub-task with a sharing PLM. Identical equations in the dashed rectangles present transformations between intent/slot labels and corresponding descriptions. The left arrow shows the direction of the flow path. [1] represents that input utterance is embedded into prompt templates for different sub-tasks. [2] and [3] are input and output of Intent Detection. [4] is the Semantic Intent Guidance mechanism where predicted intents are embedded into prompt templates to guide other sub-tasks during inference. [5-1] and [5-2] are inputs of Slot Filling and Slot Prediction, respectively, both of which can be executed in parallel. [6-1] and [6-2] are their outputs.

3 Methodology

In this section, we describe the proposed PromptSLU along the inner flow of data. In this framework, we utilize different prompts to complete different sub-tasks, which is an intuitive way similar to Su et al. (2022). However, PromptSLU is specially designed for multi-intent SLU that contains a particular intent-slot interaction mechanism, namely Semantic Intent Guidance (SIG). We also propose a new auxiliary sub-task Slot Prediction (SP) to improve Intent Detection (ID) and Slot Filling (SF).

Given the utterance X={x1,x2,x3,,xN}X=\{x_{1},x_{2},x_{3},...,x_{N}\}, for each of sub-tasks including ID, SF and SP, PromptSLU inputs to the backbone model the concatenation of a task-specific prefix and X, then makes predictions. During SF or SP, intents can also be involved. The whole framework is shown in Figure 2. Three sub-tasks are jointly trained with a sharing pre-trained language model (PLM).

3.1 Intent Detection

Traditionally, intents are defined as a fixed number of labels. A model is fed with input text, then maps it to these labels. Differently, we model the task in the form of text generation, i.e., the backbone model produces a sequence of intents corresponding to the input utterance.

Belief Span is first utilized by Sequicity (Lei et al., 2018) to cover filled and requestable slots for state tracking. We transfer this structure to the area of multi-intent SLU by defining the format of output sequence as:

IX=intent : i1,,intent : iNII_{X}=\textrm{intent : }i_{1},...,\textrm{intent : }i_{N_{I}} (1)

where im(1mNI)i_{m}~{}(1\leq m\leq N_{I}) is the mm-th intent and NIN_{I} is the number of predicted intents. As is clarified before, we integrate the task modeling by imposing similar formats on the following sub-tasks. The PLM takes the prompt “transfer sentence to intents : XX”, then carries out Intent Detection.

Like tricks used in Tassias (2021) and Wang et al. (2022a), we use natural language descriptions to represent non-semantic intent labels, reducing the difficulty of exploiting semantic information. We also apply this operation in the follow-up sub-tasks, based on the assumption that PromptSLU significantly relies on semantic information to complete multi-intent SLU.

3.2 Semantic Intent Guidance

It is noted that there exists semantic similarity among tokens, intents and slots. This property is reflected in the aspects of not only frequent co-occurrence but also semantic resemblance, which may be of help for inference. We also consider it in our framework.

We adopt the idea of jointly modeling by parsing intents from the output sequence of ID, and using them to facilitate other sub-tasks. Specifically, together with task-specific prefixes, intents also serve as an essential part of prompts for SF and SP in our prompt-based framework, to guide their completion and keep semantic consistency between ID and SF. We name it as the Semantic Intent Guidance (SIG) mechanism and display it in some prompt templates later.

3.3 Slot Filling

To let PLM generate slot-value pairs directly, raw data have to be processed in advance. We first extract golden slot-value pairs from each BIO-tagged sequence, based on tagging rules. Then we orderly assemble these golden pairs into one sentence in the format of Belief Span, which serves as text-generation target SFxSF_{x}. This step is crucial for purpose of integrating task formulations:

SFX=s1 : v1,,sNP : vNPSF_{X}=s_{1}\textrm{ : }v_{1},...,s_{N_{P}}\textrm{ : }v_{N_{P}} (2)

where NPN_{P} denotes the number of predicted slot-value pairs.

Similar to the mapping procedure in Intent Detection, we also replace slot labels with corresponding natural language descriptions. The dashed boxes on the top of Figure 2 demonstrate the transformation from intent and slot labels to more human-readable phrases. This procedure occurs both in input and target utterances.

We introduce the SIG mechanism here, as mentioned above, to allow predicted intents to act as pilots for slot filling. In detail, we plug the intent phrase and user utterance into the task-specific prompt template: “transfer sentence to pairs with IXI_{X} : XX”. We also do this in the next sub-task.

3.4 Slot Prediction

Multi-task learning is apt to offer a channel for the interaction of different tasks, and enhance supervision signals towards training objectives. Following multi-task settings in Su et al. (2022) and Wang et al. (2022a), we design this auxiliary sub-task in the training stage, forcing the PLM to focus on the semantic interaction among tokens, intents and slots. We also leverage it for the improvement of maintaining semantic consistency. PromptSLU is required to generate slots sequence:

SPX=slot : s1,,slot : sNSSP_{X}=\textrm{slot : }s_{1},...,\textrm{slot : }s_{N_{S}} (3)

according to the prefix “transfer sentence to slots with IXI_{X} : XX”. We use NSN_{S} to denote the number of predicted slots.

3.5 Training

Similar to other text-to-text tasks, our framework is trained to minimize negative log-likelihood for each sub-task:

i=1|Y|logpΘ(yi|y<i,X)-\sum_{i=1}^{\left|Y\right|}\log{p_{\Theta}\left(y_{i}\middle|y_{<i},X\right)} (4)

where YY denotes the target token sequence and Θ\Theta denotes model parameters. We introduce two variants of loss functions.

Weighted Loss integrates three sub-tasks by calculating a weighted sum for each sample:

w=αID+βSP+γSF\mathcal{L}_{w}=\alpha\mathcal{L}_{ID}+\beta\mathcal{L}_{SP}+\gamma\mathcal{L}_{SF} (5)

where α\alpha, β\beta and γ\gamma are hyper-parameters. ID\mathcal{L}_{ID}, SP\mathcal{L}_{SP} and SF\mathcal{L}_{SF} represent the loss of Intent Detection, Slot Prediction and Slot Filling, respectively.

Split Loss is utilized for a shuffled dataset, which is built by splitting each sample in the original dataset into three text generation samples corresponding to three original labels for different sub-tasks. That is,

s(X,Y)={ID(X,Y)YIDSP(X,Y)YSPSF(X,Y)YSF\mathcal{L}_{s}(X,Y)=\begin{cases}\mathcal{L}_{ID}(X,Y)&{Y\in ID}\\ \mathcal{L}_{SP}(X,Y)&{Y\in SP}\\ \mathcal{L}_{SF}(X,Y)&{Y\in SF}\end{cases} (6)

4 Experiment

4.1 Datasets and Metrics

We conduct experiments mainly on MixATIS and MixSNIPS (Qin et al., 2020), which are widely used multi-intent SLU datasets. There are 13162, 759 and 828 samples in MixATIS for training, validation and test, respectively, as well as 39776, 2198 and 2199 in MixSNIPS.

Following Cai et al. (2022) and Qin et al. (2021), we evaluate the performance of Slot Filling with F1 score, Intent Detection and sentence-level semantic frame parsing with accuracy, named Slot F1111Considering there exist samples with repeated slot-value pairs, we calculate Slot F1 after the elimination of duplicate pairs for PromptSLU., Intent Acc and Overall Acc.

4.2 Baselines

For evaluation of multi-intent SLU, we compare our approach with three baselines and their promoted versions (if provided):
AGIF (Qin et al., 2020) : This method utilizes a graph interaction module to build token-wise connections between slot tags and intent labels.
GL-GIN (Qin et al., 2021) : Different from AGIF, this work exploits local coherence during slot tagging and sentence-level connections between intents and slots. It performs decoding in a non-auto-regressive way. GL-GIN-RoBERTa displaces self-attentive encoder by RoBERTabase{}_{\textrm{base}} (128M parameters) (Liu et al., 2019).
SDJN (Chen et al., 2022) : This work reformulates multi-intent detection as a weakly supervised task and design a self-distillation mechanism to circularly refresh the pipeline. SDJN-BERT likewise displaces the self-attentive encoder by BERT, whose size is not mentioned in the source paper.

Methods MixATIS MixSNIPS
Slot(F1) Intent(Acc) Overall(Acc) Slot(F1) Intent(Acc) Overall(Acc)
Non-pre-trained AGIF (Qin et al., 2020) 86.7 74.4 40.8 94.2 95.1 74.2
GL-GIN (Qin et al., 2021) 88.3 76.3 43.5 94.9 95.6 75.4
SDJN (Chen et al., 2022) 88.2 77.1 44.6 94.4 96.5 75.7
Pre-trained SDJN+BERT 87.5 78.0 46.3 95.4 96.7 79.3
GL-GIN+RoBERTa 88.6 79.2 53.9 96.0 97.4 82.4
PromptSLUw{}_{\mathcal{L}_{w}} 88.6 83.6 53.3 96.2 97.7 82.8
PromptSLUs{}_{\mathcal{L}_{s}} 89.6* 85.8* 57.2* 96.5* 97.5 84.8*
Table 1: Main results on MixATIS and MixSNIPS. Methods are distinguished by whether pre-trained language models are involved. * means that the improvement beyond SOTA is significant with p<0.05p<0.05 under t-test.

For baselines except GL-GIN+RoBERTa, we directly take the results from the literature.

4.3 Implementation Detail

In this work, we choose T5 (Raffel et al., 2020) as the backbone of PromptSLU with small (60M parameters) and base (220M parameters) versions, because our prompt design is coherent with the pre-trained tasks of T5. We set epoch and batch size to 10 and 16 on each processor, respectively. The length of sequences and the learning rate are uniformly set to 128 and 1e-4, while the dropout rate is 0.1 for MixATIS and 0.15 for MixSNIPS. We empirically set α\alpha, β\beta and γ\gamma as 1.0, 2.0 and 1.0, respectively. All test results are selected according to the performance of Overall Acc on the validation set. We report the average scores after running the code 4 times repeatedly with different global random seeds 0, 1, 2 and 3. Following Su et al. (2022), our framework is built on Huggingface Library (Wolf et al., 2020). The optimizer here remains AdamW (Loshchilov and Hutter, 2019). We conduct experiments on a Linux server with Intel Xeon E5-2680 and Nvidia GeForce RTX 3090.

4.4 Main Results

Table 1 displays the experiment results of our framework and baselines. It is observed that our framework outperforms each baseline on each metric for both MixATIS and MixSNIPS, no matter whether a pre-trained model is leveraged.

It is clear that pre-trained models are beneficial to this task. In general, methods with pre-trained models have better performance than methods without them. It mainly comes from the utilization of knowledge contained. Among them, PromptSLU is distinguished from other NLU methods by more concise modeling but better performance. This result verifies the efficient knowledge exploitation of the proposed prompt-based text generation paradigm for multi-intent SLU.

More specifically, when compared with SOTA results, PromptSLUs{}_{\mathcal{L}_{s}} still accomplishes significant advances. It should be noted that the amazing result on Overall Acc, which should have been limited by the relative slight improvement on Slot F1, is in fact beyond the SOTA result by a large margin. This indicates that during inference, our framework is skilled at maintaining semantic consistency between intents and slots in the same utterance. We preliminarily attribute this ability to the influence of the proposed Semantic Intent Guidance mechanism and Slot Prediction.

We compare the performance between PromptSLUw{}_{\mathcal{L}_{w}} and PromptSLUs{}_{\mathcal{L}_{s}}. Obviously, PromptSLUs{}_{\mathcal{L}_{s}} beats PromptSLUw{}_{\mathcal{L}_{w}} on almost all metrics, except close value of Intent Acc for MixSNIPS. It primarily demonstrates the superiority of s{\mathcal{L}_{s}} against the straightforward weighted sum of losses. We provide a careful analysis at Sec 4.5. For other experiments, we use s{\mathcal{L}_{s}} as the loss function by default.

Methods MixATIS MixSNIPS
Slot(F1) Intent(Acc) Overall(Acc) Slot(F1) Intent(Acc) Overall(Acc)
w/o golden intents PromptSLUsmall{}_{\textrm{small}} 87.1 80.4 50.6 95.4 97.4 79.4
PromptSLUbase{}_{\textrm{base}} 89.6 85.8 57.2 96.5 97.5 84.8
   -SP 87.9 82.6 50.8 96.2 97.3 82.9
   -SIG 88.8 84.6 53.3 96.2 97.5 82.7
   only ID - 81.4 - - 97.0 -
   only SF 88.6 - - 96.1 - -
   w(1,1,1)\mathcal{L}_{w}^{(1,1,1)} 88.3 81.5 51.7 96.3 97.3 83.3
   w(2,1,1)\mathcal{L}_{w}^{(2,1,1)} 89.0 80.2 52.6 96.3 97.6 83.1
   w(1,2,1)\mathcal{L}_{w}^{(1,2,1)} 88.6 83.6 53.3 96.2 97.7 82.8
   w(1,1,2)\mathcal{L}_{w}^{(1,1,2)} 88.3 81.5 51.0 96.4 97.5 83.9
w/ golden intents PromptSLUsmall{}_{\textrm{small}} 87.7 80.4 50.6 95.8 97.4 79.4
PromptSLUbase{}_{\textrm{base}} 89.7 85.8 57.2 97.0 97.5 84.8
Table 2: Ablation experiments. SP and SIG denote Slot Prediction and Semantic Intent Guidance, respectively. w(a,b,c)\mathcal{L}_{w}^{(a,b,c)} represents that PromptSLUbase{}_{\textrm{base}} is trained under the weighted loss function with (α,β,γ){(\alpha,\beta,\gamma)} set to (a,b,c){(a,b,c)}.

4.5 Ablation Study

In this part, we abstract several factors and analyze their effectiveness. Results are shown in Table 2. The reason why the improvement is relatively slight for MixSNIPS is that features and patterns in it are more straightforward to capture to some extent, thus encouraging each variant to have competitive performance. An intuitive idea is that model size plays a dominant factor in model performance, which is proved by the comparison between PromptSLUsmall{}_{\textrm{small}} and PromptSLUbase{}_{\textrm{base}}. Here PromptSLUsmall{}_{\textrm{small}} has a competitive performance compared to baselines with pre-trained models, while the quantity of parameters in it is pretty less. However, with expanding framework scale of backbone, PromptSLUbase{}_{\textrm{base}} makes large improvements against PromptSLUsmall{}_{\textrm{small}}.

We next remove Slot Prediction (SP) while retaining Intent Detection and Slot Filling. We can observe that there are drops on every metric no matter which dataset, especially on Overall Acc. It confirms that SP serves as a catalyst to help maintain semantic consistency between ID and SF.

We then check the effect of the SIG mechanism by changing the task-specific prompt templates and removing the blanks originally prepared for predicted intents. This also brings out performance decline. Similar to the ablation of SP, SIG also facilitates the consistency between two kinds of labels, which is one of the chief strengths of jointly modeling. We surprisingly observe that score of Intent Acc also decreases on MixATIS. We attribute it to some implicit semantic supervision during the update of model parameters, i.e., although parameters gradient cannot be propagated across sub-tasks, our framework still has the potential to revise intents prediction against SIG.

In PromptSLU, we implement jointly modeling for SLU along two directions. One is to unify two sub-tasks in a shared formulation with a common Seq2Seq model, as the other allows predicted intents to assist Slot Filling. Hence, we split two sub-tasks by accomplishing each with an exclusive T5 model, in order to evaluate the impact of jointly modeling. It is interesting that PromptSLU with only SF has competitive performance on Slot F1 when compared with other jointly modeling versions. However, the performance of PromptSLU with only ID has dropped a lot on Intent Acc for MixATIS. It confirms the effectiveness of the jointly modeling paradigm even in text generation formulation.

We propose two kinds of loss functions for PromptSLU. The weighted w{\mathcal{L}_{w}} has a similar structure to those used in baselines, while s{\mathcal{L}_{s}} refers to each sub-tasks as a text generation process more directly. Here we try different combinations of weight hyper-parameters for w{\mathcal{L}_{w}}, but s{\mathcal{L}_{s}} still derives better performance than them in general. We assume that the optimization targets of different sub-tasks tend to be not identical, thus a trade-off of these targets is indispensable. Different weight hyper-parameters suggest the game, but the desired combination is likely to be hard to access, even dynamically changing along the training process. In contrast, s{\mathcal{L}_{s}} allows PromptSLU to avoid the difficulty of trade-off in each turn, but focus on completing just the text generation task and achieve better improvements.

During inference, PromptSLU with SIG generates slot-value pairs based on the predicted intents, where transmission errors should be considered. We experimentally estimate their effect on Slot Filling, by replacing predicted intents with the golden ones. The last two rows in Table 2 give the results. It can be seen that PromptSLU with golden intents generally has extremely slight improvement on Slot F1, while no increase in the values of Overall Acc. It indicates the effect of transmission errors is not significant. We mainly attribute it to the fact that PromptSLU checks those intents against the input utterance, and filters wrong and missing intents in the prompts, which proves that PromptSLU learns to understand semantics. Another reason may be the relatively high values of Intent Acc that reduce the impact of transmission errors.

Refer to caption
Figure 3: Comparison between two strategies. We use PSr and PSn to denote PromptSLU utilizing raw labels or natural language descriptions of labels.
Refer to caption
Figure 4: Dynamic change of performance on Slot F1 and Intent Acc along the training process.
Input Output
ID: intent: atis_airport, intent: atis_flight
User: describe pittsburgh airport and then list flights from denver to san francisco no denver to philadelphia.
SF: airport_name: pittsburgh airport, fromloc.city_name:
denver, toloc.city_name: san francisco, toloc.city_name:
philadelphia
French Translation: décrit l’aéroport pittsburgh
et énumère les vols depuis la city de denver à san francisco
et non pas à la city de philadelphia
ID: intent: atis_ground_service, intent: atis_ground_fare
User: what are the costs of car rental in dallas and also ground transportation denver.
SF: transport_type: car rental, city_name: dallas,
city_name: denver
Romanian Translation: costul maşinii închiriate în dallas
şi al transportului pe teren în city.
Table 3: Case study. We use raw labels to replace corresponding natural language descriptions. PromptSLU accurately predicts all labels in the output sentences.

5 Discussion

5.1 Effect of Semantic Information

Our framework is equipped with descriptions of labels rather than raw labels, to better understand semantic information of intents and slots. To explore the effectiveness of this operation, we do experiments by replacing the descriptions with corresponding raw labels and adding them to the vocabulary as special tokens. Figure 3 illustrates the comparison between the two strategies. We use PSn\textrm{PS}_{n} to denote the former variant of PromptSLUsmall{}_{\textrm{small}} and PSr\textrm{PS}_{r} to denote that with raw labels. We set the same hyper-parameters of PSr\textrm{PS}_{r} as those in PSn\textrm{PS}_{n}.

We make the following observations:
1) Figure 3 shows that the absence of semantic information hinders PromptSLU from aligning intents with slots, which is reflected by the drop on Overall Acc. Figure 4 further demonstrates its dynamic influence on SF and ID. It indicates that semantic information is truly acquired and crucial.
2) Comparing the convergence rates of two strategies on Slot F1 and Intent Acc, we see that performance of PSn\textrm{PS}_{n} quickly rises to a relatively high level. However, the performance of PSr\textrm{PS}_{r} just increases step by step at early epochs, along with the modification of label representations, which is defined by local context. It suggests that pre-trained semantic features efficiently help fine-tuning.

5.2 Case Study

Table 3 displays several typical examples. Upon receiving user utterances, PromptSLU completes Intent Detection and Slot Filling with the generation of key-value sequence.

It can be noted that PromptSLU can predict slot-value pairs along the order of their appearance in the user input, which may be further utilized for supplementary requirements. Moreover, there exist slot-value pairs with sharing slots but distinct values, which pose a great challenge to the comprehension of networks. However, PromptSLU precisely predicts them without the lack of intrinsic order.

Surprisingly, we find that PromptSLU sometimes manages to translate user utterances into different language versions, with specific prefixes. This discovery suggests that PromptSLU may alleviate forgetting knowledge acquired in the pre-training stage after fine-tuning. We present it in Table 3.

6 Conclusion

In this paper, we propose PromptSLU, which handles Intent Detection and Slot Filling in multi-intent SLU with a prompt-based text generation framework. To the best of our knowledge, this is the first work to utilize prompts for this problem. Our framework integrates these two sub-tasks into the same formulation while distinguishing them according to diverse prompts, which simplifies model structures and enhances interpretability. Moreover, based on the semantic similarity between intents and slots, predicted intents are driven to guide the process of Slot Filling, while a new auxiliary task Slot Prediction is introduced. Experimental results show its multi-dimensional superiority against baselines on three datasets.

References