Choice Fusion as Knowledge for Zero-Shot Dialogue State Tracking

Abstract

With the demanding need for deploying dialogue systems in new domains with less cost, zero-shot dialogue state tracking (DST), which tracks user’s requirements in task-oriented dialogues without training on desired domains, draws attention increasingly. Although prior works have leveraged question-answering (QA) data to reduce the need for in-domain training in DST, they fail to explicitly model knowledge transfer and fusion for tracking dialogue states. To address this issue, we propose CoFunDST, which is trained on domain-agnostic QA datasets and directly uses candidate choices of slot-values as knowledge for zero-shot dialogue-state generation, based on a T5 pre-trained language model. Specifically, CoFunDST selects highly-relevant choices to the reference context and fuses them to initialize the decoder to constrain the model outputs. Our experimental results show that our proposed model achieves outperformed joint goal accuracy compared to existing zero-shot DST approaches in most domains on the MultiWOZ 2.1. Extensive analyses demonstrate the effectiveness of our proposed approach for improving zero-shot DST learning from QA.

Index Terms— Dialogue state tracking, zero-shot, question answering, pre-trained language model, knowledge fusion

1 Introduction

Dialogue service automation nowadays has become an increasingly important part of modern commercial activities and calls for advanced dialogue systems to reduce the labor cost of human agents [1]. Particularly, a task-oriented dialogue system aims to assist users in completing specific tasks, such as ticket booking, online purchasing, or travel planning, via a natural language dialogue [2]. Particularly, dialogue state tracking (DST) is the core function of a task-oriented dialogue system that tracks and generates dialogue states based on the user’s requirements in a human-machine dialogue [3]. Typical dialogue states are collections of slot-value pairs, for example, (destination, Cambridge) and (day, Wednesday) of a train ticket booking service.

Refer to caption — Fig. 1: The overview of Choice-Fusion Dialogue State Tracking (CoFunDST). The input is a concatenated query consisting of a question (Tangerine) and reference context (Blue). A sequence of choice tokens (Orange) is used for training and reference. The golden answer token (Yellow) is an extra input during training as extra supervision for encoding the concatenated query.

Nowadays, the requirements of deploying an increasing number of services across a variety of domains raise challenges to DST models in production [4]. However, existing dialogue datasets only span a few domains, making it impossible to train a DST model upon all conceivable conversation flows [5]. Furthermore, dialogue systems are required to infer dialogue states with dynamic techniques and offer diverse interfaces for different services. Despite the fact that the copy mechanism [6] or dialogue acts [7] are leveraged to efficiently track slots and values in the dialogue history, the performance of DST still relies on a large number of annotations of dialogue states, which is expensive and inefficient to collect data for every new domain and service.

The lack of availability of annotated and representative data can limit the performance of DST at a large scale. A line of work suggests jointly encoding the schema and the dialogue context for DST to address the above challenge [8, 9]. On the other hand, based on the language processing theory [10]—meanings in similar context could be understood and predicted before encountered, large-scale QA datasets provide an option to transfer learned knowledge for DST with little to no in-domain data without the loss of performance. [11, 12]. However, none of the existing works model candidate choices in DST explicitly, resulting in a lack of efficiency and interpretability in using knowledge extraction capabilities from the QA datasets.

As a means of tackling these issues, we propose Choice-Fusion Dialogue State Tracking (CoFunDST) that trains on extensive QA datasets with sufficient annotations for zero-shot DST without training on particular domains¹¹1Code is publicly available at https://github.com/youlandasu/Choice-Fusion.. Specifically, it fuses candidate choices as knowledge to predict slot-values accurately based on a T5 [13] pre-trained encoder-decoder language model. CoFunDST formulates DST and QA as machine reading comprehension (MRC) [14, 15] which generates the answer given the reference context. As a part of CoFunDST, we design appreciative choice selection to assess the relevance of all available candidate choices to the reference context and compute a probability distribution over these choices. Then we apply context-choice fusion [16] to incorporate the context-dependent choices as knowledge for initializing the decoder. Our work advances zero-shot DST in two ways: 1) for the first time, we model candidate choice of values as a distinctive resource of knowledge to leverage missing details for predicting slot-values accurately, 2) we propose the context-choice fusion to selectively incorporate encoded choices based on the dialogue context. The performance of our model is demonstrated on the MultiWOZ 2.1 [17] showing it outperforms existing zero-shot DST approaches in terms of joint goal accuracy in “Restaurant”, “Train”, and “Taxi” domains. Further analysis shows its effectiveness of the choice fusion and knowledge transfer from QA to DST generating different types of slot-values.

Model	Joint Goal Accuracy
Model	Hotel	Train	Restaurant	Attraction	Taxi
TRADE [6]	14.20	22.39	12.59	20.06	59.21
MA-DST [18]	16.28	22.76	13.56	22.46	59.27
SUMBT [19]	19.80	22.50	16.50	22.60	59.50
TransferQA (T5-Small) [11]	21.82	25.66	17.98	26.14	59.68
CoFunDST (T5-Small)	21.07	25.95	18.13	24.79	60.19

Table 1: Zero-shot joint goal accuracy on MultiWoz 2.1 [17]. Model results of TRADE [6], MA-DST [18], and SUMBT [19] are obtained from reference papers. TransferQA (T5-Small) followes all setups in [11], but is trained using T5-Small.

Settings	Hotel	Train	Restaurant	Attraction	Taxi
Settings	JGA / SGA / F1	JGA / SGA / F1	JGA / SGA / F1	JGA / SGA / F1	JGA / SGA / F1
KLD+Fuse	21.04 / 67.13 / 39.74	23.84 / 62.94 / 59.80	19.41 / 54.61 / 31.03	24.57 / 51.28 / 27.24	60.00 / 75.56 / 68.78
KLD	19.63 / 65.49 / 32.68	24.19 / 60.91 / 52.04	16.55 / 54.71 / 29.54	18.00 / 47.04 / 19.54	60.06 / 74.73 / 64.49
Fuse	18.89 / 65.01 / 34.20	22.13 / 57.12 / 48.89	15.87 / 54.65 / 31.91	21.70 / 47.98 / 22.35	59.48 / 73.24 / 62.33

Table 2: Ablation study on the two components of the choice-fusion mechanism: appreciative choice selection with KLD loss (KLD) and context-choice fusion (Fuse). T5-Small [13] is used and the evaluation results on joint goal accuracy (JGA), slot goal accuracy (SGA), and F1 in five domains of MultiWoz 2.1.

2 Method

2.1 Problem Formulation

Both QA and DST are formulated as generative MRC problems, which take questions and choices as input and generate answers token-by-token by comprehending the reference context, as depicted in Fig. 1. For QA training, the input query combines the sequence of question tokens, i.e. $q=\{q_{1},q_{2},\dots,q_{K}\}$ , and the reference context tokens, i.e. $c=\{c_{1},c_{2},\dots,c_{L}\}$ , of the length $K$ and $L$ , respectively. In other words, the model can be regarded as filling the sequence $a$ with correct tokens given the question $q$ and the reference context $c$ : “question: $q$ context: $c$ answer: [ $a$ ]”, where $a=\{a_{1},a_{2},\dots,a_{M}\}$ and $M$ is the length of answer. Additionally, the concatenation of $N$ candidate choices of the given question is denoted as $v=\{v_{1},v_{2},\dots,v_{N}\}$ . We also encode the ground truth answer $\tilde{a}$ in the training set into tokens to combine it with the encoded input.

For DST reference, each domain slot is re-formulated as a natural language question in the form of “What is the [slot] of [domain] that the user is interested in?”, or “What time” and “How many” as prefixes of time- and number-related slots, similar to the domain-slot formulation in [11]. In particular, the state value and dialogue context are taken as an answer and reference context, respectively.

2.2 Choice-Fusion Mechanism

Appreciative Choice Selection: The appreciative choice selection is designed to select the choices that are highly relevant to the reference context, i.e. the appreciative choices. The choice tokens $v$ are processed by the T5 encoder as $V\in\mathbb{R}^{N\times T}$ , where $T$ is the output hidden dimension of the Transformer. We calculate the prior and posterior probability distributions, $p_{pri}$ and $p_{post}$ , of the candidate choices $V$ given the encoded question-context concatenation $D_{pri}\in\mathbb{R}^{(K+L)\times T}$ and the encoded golden answer $D_{post}\in\mathbb{R}^{M\times T}$ . Note that only the prior distribution of Eq. 1 is used during reference.

\displaystyle p_{pri}

\displaystyle=\text{softmax}(\tanh(VW^{V})\tanh([D_{pri}]{W^{D}}_{pri}))

(1)

\displaystyle p_{post}

\displaystyle=\text{softmax}(\tanh(VW^{V})\tanh([D_{pri};D_{post}]{W^{D}}_{post}))

(2)

where $W^{D}_{pri}\in\mathbb{R}^{T\times F}$ , ${W^{D}}_{post}\in\mathbb{R}^{2T\times F}$ , and $W^{V}\in\mathbb{R}^{T\times F}$ are trainable parameter matrices, $F$ is the intermediate dimension for projecting context on the choices. Then the objective function is the Kullback–Leibler divergence (KLD) [20] to optimize the distance of prior and posterior distributions of $V$ , where the ground truth answer embedding $D_{post}$ is served as the posterior knowledge for choice selection (Eq. 3).

\mathcal{L}_{KLD}=KLD(p_{pri},p_{post})

(3)

Context-Choice Fusion: The context-choice fusion leverages appreciative choices to address choice knowledge in answer generation, by fusing the context and appreciative choices to the decoder. To fuse the obtained appreciation over choices for generating accurate answers, candidate choices are weighted by the posterior distribution $p_{post}$ to initialize the input of the decoder, as shown in Eq. 4.

\displaystyle H^{Dec}

\displaystyle=\tanh([D_{pri};p^{T}_{post}V]\cdot W^{Dec}))

(4)

where $H^{Dec}\in\mathbb{R}^{(K+L+N)\times T}$ is the fused input to the decoder, and $W^{Dec}\in\mathbb{R}^{T\times T}$ . For inference, the prior distribution $p_{pri}$ is used in replace of $p_{post}$ for Eq. 4. Such that the appreciative choices are contextualized and incorporated into the CoFunDST model as knowledge.

The overall objective in Eq. 5 is the sum of the KLD and the Cross-Entropy loss of the decoder output and the ground truth answer $\tilde{a}$ , where non-categorical slots and categorical slots are jointly trained.

\mathcal{L}=-\log P(a=\tilde{a}|D_{pri},D_{post},V)+\mathcal{L}_{KLD}

(5)

3 Experiments

Datasets: The model is pre-trained on 20% of the combination of six extractive QA datasets and two multi-choice QA datasets following the dataset pre-processing and slicing in [11]. To verify the generalization among different domains, we evaluate models on MultiWOZ 2.1 [17] and follow dataset setups in [6]. There are 30 distinguished domain-slots in MultiWOZ 2.1 in total, where there are 12 categorical slots provided with collections of values and 18 non-categorical slots.

Baselines: We select the following models as zero-shot DST baselines. (1) TRADE [6] is an encoder-decoder model which leverages slot gates and copy mechanism and shares parameters for predicting unseen slot-values. (2) SUMBT [19] uses pre-trained BERT [21] to learn the relations between slot types and values appearing in utterances and predict dialogue states with slot-utterance matching. (3) MA-DST [18] encodes dialogue context and domain-slots with attention mechanisms at multiple granularities to learn at different semantic levels. (4) TransferQA [11] proposes a task-transfer framework and takes the combination of slot, values, and dialogue context as the input for zero-shot DST.

TRADE [6], MA-DST [18], and SUMBT [19] are evaluated in the cross-domain setting, where the models are trained on the four domains in MultiWOZ 2.1 and evaluated on the held-out domain. On the contrary, TransferQA [11] and our model are trained on the combined QA dataset only. Therefore, it is unnecessary to use in-domain DST data. Although TransferQA and ours are both based on domain-agnostic QA training, we extend this idea with knowledge fusion and directly use values in domain ontology as candidate choices to align predictions of DST with the training procedure.

Implementation: We implemented our model based on T5-Small [13], which is a pre-trained encoder-decoder model for natural language generation. The input text of the T5 encoder is truncated to 512 tokens. The intermediate dimension for prior and posterior probabilities is 64. The fused input to the T5 decoder is passed through a 512-unit Feed-Forward layer for initialization. Adafactor [22] is used as the optimizer with initial learning rate 3e-4 and warm-up steps 100. The number of training epochs is 6. As for our implementation of TransferQA [11], all other hyper-parameters are kept the same as the original settings except for the model size.

4 Results

The zero-shot joint goal accuracy (JGA), which is the average accuracy of predicting all slot-values of a turn correctly, on MultiWOZ 2.1 is presented in Table 1. It can be seen that our model exceeds all baselines on JGA in “restaurant”, “taxi”, and ”train“ domains with KLD loss and context-choice fusion adopted. Compared to TransferQA, our model does not combine candidate choice tokens with the input question and context which tend to be truncated due to the lengthy inputs. Furthermore, CoFunDST maintains context information by fusing the contextualized weights of candidate choices for inference. Experimental results demonstrate our proposed method effectively generalize from QA to new domains without annotated data for DST. The JGA is somewhat less in the other two domains where multiple-choice slots mostly contain an extensive collection of choices to be efficiently used for context-choice fusion. For example, the average number of choices for categorical slots in the domain ontology of “restaurant” is 5, while that of “attraction” is 11. It indicates that more choices associated with a slot lead to less efficient incorporation of choice knowledge into the model. Compared to the first three rows in Table 1 based on leave-one-out training, our model outperformed all of them, indicating it is unnecessary to perform in-domain training.

Domain	#non-cat	#cat	Non-Categorical	Categorical	All
Hotel	4911	7609	66.91	55.83	68.05
Train	8832	2514	61.62	39.24	68.12
Restaurant	5800	4967	60.14	45.69	62.27
Attraction	1249	3253	51.61	35.38	52.94
Taxi	1753	0	75.55	-	75.55

Table 3: The slot goal accuracy of non-categorical and categorical slots on MultiWoz 2.1, respectively. #non-cat and #cat are the total numbers of non-categorical and categorical slots for dev.

5 Discussion

We evaluate two components in the choice-fusion mechanism, i.e. the appreciative choice selection with KLD loss (KLD) and the context-choice fusion (Fuse). Table 2 summarizes the results of using our proposed techniques, where the slot goal accuracy (SGA) is the average accuracy of predicting the value of a slot correctly and F1 is the harmonic mean of the precision and recall to evaluate the performance of slot-value predictions. It is evident that the model adopting both modules essentially outperforms the models with only one module. The SGA and F1 drop when not using the context-fused result for the decoder, indicating that the context fusion is important for constraining the generation of slot-values at the slot level. It is noted that JGA drops for all except “train” and “restaurant” domains slightly compared to applying KLD loss only. This suggests that the turn-level generation of dialogue states could be impacted by re-initializing the decoder, while slot-level accuracy and F1 still outperform. Moreover, it is observed that in all cases metrics drop significantly without KLD loss, which proves that the improved alignment of the prior and posterior distributions of choices depending on the context will benefit accurate value generation. Empirically, it seems that leveraging ground-truth annotations is better than only fusing choices dependent on the prior context in training because the appreciative choice selection is superior to context-choice fusion on all metrics.

The SGA by domains and slot types in Table 3 shows that our model outperforms three more challenging domains with more non-categorical slots than categorical ones. Non-categorical slots tend to have a larger vocabulary size of values and thus benefit from joint training using knowledge fusion. The higher SGA for predicting non-categorical slots is probably owing to more extractive data used in QA training, which results in the model bias that is more likely to generate extractive values. As our focus in this paper is on choice selection and fusion, we leave the study on related datasets as future work.

To explain the performance in the three domains we outperform, we randomly sample 50 dialogues in predictions for analyses. We note that the average number of candidate choices of categorical slots per dialogue in the “Restaurant”, “Train”, and “Taxi” domains are mostly far less than that in the “Attraction” and “Hotel” domains, except for “Train” and “Hotel” domains that are 7 and 5.7, respectively. However, the “Hotel” domain consists of 6 different categorical slots while there is only one categorical slot in the “Train” domain. Further error analysis in “Attraction” and “Hotel” which are degraded shows 87.09% of slot errors are missing to predict a NOT none (active) slot in the “Attraction” domain and 72.22% of them are from categorical slots. While there are 72.06% missing errors in the “Hotel” domain and 70.8% of them are from categorical slots. These suggest the degradation in the two domains is mostly due to not predicting all related categorical slots as active, and this is more likely to happen when the number of categorical slots is many in the “Attraction” and “Hotel” domains.

6 Conclusion

We introduce the CoFunDST model that incorporates candidate choices and transfers knowledge from QA for zero-shot dialogue state generation. The appreciative choice-selection module selects candidate choices that are highly relevant to the reference context. And the context-choice fusion module uses the context and the appreciative choices to initialize the decoder. Our model achieves outperformed zero-shot joint goal accuracy on multiple domains of MultiWOZ 2.1 and the choice-fusion mechanism is shown effective for improving domain-agnostic training for generalization to new domains. Incorporating other domain-agnostic knowledge for zero-shot dialogue state tracking is the area of study that we wish to extend and develop in the future.

References

[1] Roberto Pieraccini, David Suendermann, Krishna Dayanidhi, and Jackson Liscombe, “Are we there yet? research in commercial spoken dialog systems,” in International Conference on Text, Speech and Dialogue. Springer, 2009, pp. 3–13.
[2] Tsung-Hsien Wen, David Vandyke, Nikola Mrksic, Milica Gasic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young, “A network-based end-to-end trainable task-oriented dialogue system,” arXiv preprint arXiv:1604.04562, 2016.
[3] Antoine Bordes, Y-Lan Boureau, and Jason Weston, “Learning end-to-end goal-oriented dialog,” arXiv preprint arXiv:1605.07683, 2016.
[4] Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan, “Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, pp. 8689–8696, Apr. 2020.
[5] Giovanni Campagna, Agata Foryciarz, Mehrad Moradshahi, and Monica S Lam, “Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking,” arXiv preprint arXiv:2005.00891, 2020.
[6] Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung, “Transferable multi-domain state generator for task-oriented dialogue systems,” arXiv preprint arXiv:1905.08743, 2019.
[7] Ruolin Su, Ting-Wei Wu, and Biing-Hwang Juang, “Act-aware slot-value predicting in multi-domain dialogue state tracking,” arXiv preprint arXiv:2208.02462, 2022.
[8] Yu-Ping Ruan, Zhen-Hua Ling, Jia-Chen Gu, and Quan Liu, “Fine-tuning bert for schema-guided zero-shot dialogue state tracking,” arXiv preprint arXiv:2002.00181, 2020.
[9] Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf, “Dialogue state tracking with a language model using schema-driven prompting,” arXiv preprint arXiv:2109.07506, 2021.
[10] Pia Knoeferle, “Predicting (variability of) context effects in language comprehension,” Journal of Cultural Cognitive Science, vol. 3, no. 2, pp. 141–158, 2019.
[11] Zhaojiang Lin, Bing Liu, Andrea Madotto, Seungwhan Moon, Zhenpeng Zhou, Paul A Crook, Zhiguang Wang, Zhou Yu, Eunjoon Cho, Rajen Subba, et al., “Zero-shot dialogue state tracking via cross-task transfer,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 7890–7900.
[12] Shuyang Li, Jin Cao, Mukund Sridhar, Henghui Zhu, Shang-Wen Li, Wael Hamza, and Julian McAuley, “Zero-shot generalization in dialog state tracking through generative question answering,” arXiv preprint arXiv:2101.08333, 2021.
[13] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
[14] Lynette Hirschman, Marc Light, Eric Breck, and John D Burger, “Deep read: A reading comprehension system,” in Proceedings of the 37th annual meeting of the Association for Computational Linguistics, 1999, pp. 325–332.
[15] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang, “SQuAD: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, Nov. 2016, pp. 2383–2392, Association for Computational Linguistics.
[16] Sixing Wu, Ying Li, Dawei Zhang, Yang Zhou, and Zhonghai Wu, “Diverse and informative dialogue generation with context-specific commonsense knowledge awareness,” in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5811–5820.
[17] Mihail Eric, Rahul Goel, Shachi Paul, Abhishek Sethi, Sanchit Agarwal, Shuyag Gao, and Dilek Hakkani-Tur, “Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines,” arXiv preprint arXiv:1907.01669, 2019.
[18] Adarsh Kumar, Peter Ku, Anuj Goyal, Angeliki Metallinou, and Dilek Hakkani-Tur, “Ma-dst: Multi-attention-based scalable dialog state tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, vol. 34, pp. 8107–8114.
[19] Hwaran Lee, Jinsik Lee, and Tae-Yoon Kim, “Sumbt: Slot-utterance matching for universal and scalable belief tracking,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, 2019, pp. 5478–5483.
[20] Solomon Kullback and Richard A Leibler, “On information and sufficiency,” The annals of mathematical statistics, vol. 22, no. 1, pp. 79–86, 1951.
[21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[22] Noam Shazeer and Mitchell Stern, “Adafactor: Adaptive learning rates with sublinear memory cost,” in International Conference on Machine Learning. PMLR, 2018, pp. 4596–4604.