SF-DST: Few-Shot Self-Feeding Reading Comprehension Dialogue State Tracking with Auxiliary Task

Abstract

Few-shot dialogue state tracking (DST) model tracks user requests in dialogue with reliable accuracy even with a small amount of data. In this paper, we introduce an ontology-free few-shot DST with self-feeding belief state input. The self-feeding belief state input increases the accuracy in multi-turn dialogue by summarizing previous dialogue. Also, we newly developed a slot-gate auxiliary task. This new auxiliary task helps classify whether a slot is mentioned in the dialogue. Our model achieved the best score in a few-shot setting for four domains on multiWOZ 2.0.

Index Terms: multi-domain dialogue systems, dialogue state tracking, belief tracking, reading comprehension, self-feeding

1 Introduction

Task-Oriented Dialogue (TOD) system conducts a conversation with a specific purpose and is increasingly necessary due to the emergence of artificial-intelligence speakers and virtual personal assistants. In general, a TOD system is composed of three main sections: a dialogue state tracking (DST) module to track the user’s purpose, a dialogue policy module (POL) to choose system actions like API calling or ending conversation, and a natural-language generation (NLG) module to produce a response to the user [1]. The DST system is a key component of these three parts since it generates a belief state that contains information about the user’s purpose. The belief state is often represented as slot value pairs. For example, in Figure 1, the belief state has hotel information, which is needed to achieve user’s purpose for hotel reservation.

Although DST is essential in a TOD system, labeling the DST dataset is costly. Some authors have tried to train DST using only limited data (few/zero-shot DST) to solve this problem. One promising way is to adopt reading comprehension (RC) in DST [2, 3]. The RC task aims to answer the question by understanding the passage. RC and DST have a similar goal: DST aims to find the value (answer) of slot (question) by understanding the dialogue (passage). In this approach, researchers design questions (e.g., Where is hotel area that the user wants?) for slots (e.g., hotel.area) in advance, and at each turn of dialogue, the model reads the dialogue and answers the questions. These predicted answers become belief state. The first research [2] that adopted RC in a DST divided slots into multiple-choice and span-prediction types and showed knowledge transfer ability of natural language questions. However, their model requires an ontology data that contains pre-defined values for each slots. This has low scalability to new domain and values. To overcome the limitation, [3] proposed an ontology-free model by answering questions generatively and could flexibly predict the unseen domain and values. However, both [2] and [3] require additional external data for pre-training their models. Also, their models have difficulty in classifying whether a slot is mentioned in the dialogue and this problem was the main reason for the accuracy drop.

The use of an auxiliary task is another approach of a few-shot DST. The named-entity recognition task was combined with the DST task to reduce the number of network parameters and increase generalization ability across the domain [4, 5]. The language modeling task was also used with the main DST task, and the combination increased the accuracy in a long context dialogue [6].

Refer to caption — Figure 1: Example of multi-domain TOD dataset and its belief state. The dialogue state tracker (DST) keeps track of user’s requests and predicts value of slot.

In this study, we introduce a few-shot reading comprehension DST with a Self-Feeding approach (SF-DST). We used a text-to-text structure for generative, ontology-free DST and designed a self-feeding belief state input to summarize the previous turn. Applying self-feeding belief state is the first attempt in the RC format DST. Furthermore, we developed a slot-gate auxiliary task which helps to classify whether a slot is mentioned in the dialogue.

Our model achieved the new best accuracy in a few-shot experiment for four domains on MultiWOZ 2.0 [7], and achieved close to the current state-of-the-art in a supervised setting on MultiWOZ 2.1 [8]. In analysis, we investigated the effect of self-feeding belief state input and auxiliary task in various few-shot settings. To summarize our approach and contributions:

•

We propose Self-Feeding reading comprehension DST (SF-DST), an ontology-free few-shot model to track belief state. SF-DST has a text-to-text structure and includes self-feeding belief state input. We showed that the belief state helps understand multi-turn dialogue by summarizing previous turns.
•

We introduce a novel auxiliary task inspired by slot-gate in extractive DST. This new auxiliary task helps distinguish whether a slot is mentioned in the dialogue. We carefully analyzed the effect of the auxiliary task.
•

In a few-shot setting, where 1% to 10% data is available, SF-DST achieved higher accuracy than the previous methods for four domains on MultiWOZ 2.0. In a supervised setting, SF-DST was close to the current state-of-the-art on MulitiWOZ 2.1.

2 Methods

2.1 Problem statement

Conversation $C$ for time step $t$ is denoted as $C_{t}=\{x_{1},y_{1},...,y_{t-1},x_{t}\}$ , where $x_{t}$ means user utterance and $y_{t}$ means system utterance. The belief state $B_{t}$ at turn ${t}$ is composed of slot $s\in S$ and value $v\in V$ . $V$ includes don’t care and not mentioned values and notation $v_{s}$ means value for slot $s$ . Question set ${Q}$ consists of ${q_{s}}$ which is predefined before training (e.g., $s$ : attraction.name, ${q_{s}}$ : What is the attraction name?). Auxiliary set ${A}$ consists of ${a_{s}}$ , and ’Are they talking about [slot]?’ is the inquiry form. ${a_{s}}$ is also predefined before training (e.g., $s$ : attraction.name, ${a_{s}}$ : Are they talking about attraction name?).

2.2 SF-DST

SF-DST has text-to-text structure for generative, ontology-free DST (Figure 2). The input value consists of dialogue history $C_{t}$ , corresponding question for slot $q_{s}$ , and previous belief state $B_{t-1}$ . The model answers the question for all slots at each dialogue turn $t$ , and the predicted answers $v_{s}$ become $B_{t}$ . We did not use a gold belief state as input in both training and inference time; instead, we designed a self-feeding belief state method in which the predicted belief state from the previous turn becomes the current turn’s input belief state. We separate user and system utterances by using [user] and [sys], and add index words Context, Question and Belief to distinguish each input part:

v_{s}=seq2seq(C_{t},q_{s},B_{t-1}).

(1)

We use negative log-likelihood as a loss function given $C_{t}$ , $q_{s}$ and $B_{t-1}$ as

L_{belief}=-\sum_{i=1}^{n}\log p(v_{i}|C_{t},q_{i},B_{t-1}),

(2)

where $n$ denotes the total number of slots.

2.3 Auxiliary task

Some extractive DSTs use a two-step system. These systems first classify whether the slot is mentioned in dialogue, and if classified as mentioned, then finds the answer span from dialogues [9, 10, 11, 12]. This classification module is called slot-gate. By splitting the DST task into two models, this strategy can lower the strain on each model. However, this cascading approach risks error propagation and requires a relatively long inference time. Instead of a cascading strategy, we directly answered a question and trained a slot-gate task as an auxiliary task. The auxiliary task is trained as a question-answering form (Figure 2) and ’Are they talking about [slot]?’ is the inquiry format. The answer is Yes if the slot value is in belief state $B_{t}$ , and not mentioned otherwise; i.e., the main DST question aims to generate a specific value, whereas the auxiliary question aims to classify the slot’s mention in the dialogue. The auxiliary task uses the context $C_{t}$ , previous belief state $B_{t-1}$ and auxiliary question $a_{s}$ as input which has the same form as (1) except the slot question, and uses loss function (2). To train the auxiliary task with the main DST task, our model uses a joint loss function with hyperparameter $a$

L=L_{belief}+aL_{aux}.

(3)

We set $a$ empirically to 0.7.

Table 1: Joint goal accuracy on MultiWOZ 2.1 in a supervised setting. Models focused on a few/zero shot are marked with

\dagger

Model	JGA [%]	Ontology	Type
TRADE $\dagger$	46.00		G
STARC $\dagger$	49.48	need	C+S
DSTQA $\dagger$	51.17	need	C+S
DS-DST	51.21		C+S
GPT2QA $\dagger$	52.58		G
SST-2	55.23	need	C
TripPy	55.29		S
FPDSC_turn	57.88	need	C
SF-DST (ours)	56.96		G

3 Experiment

Table 2: Domain joint goal accuracy [%] on MultiWOZ 2.0. We use 2.0 version to compare with other models. Models that require ontology are marked with

\dagger

Model	Hotel			Restaurant			Attraction			Train			Taxi
Model	1%	5%	10%	1%	5%	10%	1%	5%	10%	1%	5%	10%	1%	5%	10%
TRADE	19.73	37.45	41.42	42.42	55.70	60.94	35.88	57.55	63.12	59.83	69.27	71.11	63.81	66.58	70.19
DSTQA $\dagger$	N/A	50.18	53.68	N/A	58.95	64.51	N/A	70.47	71.60	N/A	70.35	74.50	N/A	70.90	74.19
STARC $\dagger$	45.91	52.59	57.37	51.65	60.49	64.66	40.39	65.34	66.27	65.67	74.11	75.08	72.58	75.35	79.61
SF-DST	54.15	58.61	59.71	57.28	68.92	70.14	61.24	76.69	79.35	71.60	76.05	78.25	65.74	67.48	72.06

We performed experiments on MultiWOZ 2.0 and MultiWOZ 2.1 dataset, which are multi-domain TOD datasets collected using a ’wizard of oz’ setting. MultiWOZ 2.1 is a clean and accurate version of MultiWOZ 2.0. Both have seven domains (Hotel, Restaurant, Attraction, Train, Taxi, Hospital, and Police) and contain about 8,000 dialogues. We excluded the Hospital and Police domains during training, because they are only included in training data. We use joint goal accuracy (JGA) to evaluate our model’s accuracy: If all slot and value pairs in turn are correct, the turn is counted as correct, and joint goal accuracy is the average value of all turns.

In Section 3.1, we evaluate our model in a supervised setting. In this setting, we use all training dataset and compare with other DSTs. In section 3.2, we evaluate our model in a few-shot setting, where our primary focus lies. We report the parameter size of the baselines and detailed implementation in the Appendix.

3.1 DST with supervised setting

We evaluated our model in a commonly-used supervised setting to compare with other DST models. In comparison, we included few/zero-shot DST baselines—TRADE [13], STARC [2], DSTQA [14], GPT2QA¹¹1The paper did not named the model. We assign temporary name to simplify comparison. [3]—and other general DST models, including DS-DST [15], SST-2 [16], TripPy [11], and FPDSC [17]. In addition to JGA, we also report ontology usage and answer prediction type. We divide answer predict type into classification (C), span prediction (S), and generative (G) following [3]. SF-DST achieved the highest accuracy by a wide margin compared to other few/zero-shot DST (Table 1). Our accuracy was lower than the best score of the classification-type method. However, these models need fixed ontology and find answers by classifying the value in ontology [2, 14, 15, 16, 17]. Fixed ontology is hard to obtain in the real world and cannot adapt to frequently changing values. In contrast, our model is ontology-free and generates the answer, which is competitive in the real world.

3.2 DST with few-shot data setting

3.2.1 Domain across knowledge transfer in the few-shot setting

To investigate the knowledge transfer ability across the domain, we pre-trained our model on four domains and fine-tuned it with the target domain. We used domain JGA, which measures the JGA focused on the targeted domain [18]. SF-DST exceeded the previous best score in four of the five domains compared to other few-shot DST results, including TRADE [13], DSTQA [14], and STARC [2] (Table 2). This result demonstrates that our model can adapt to a new domain using only limited data by transferring knowledge from other domains. However, SF-DST showed lower accuracy than ontology-based models (marked as $\dagger$ in table) in the taxi domain. The taxi domain is generally mentioned at the end of the dialogue [14], so even with the same size of data, the chance of training taxi domain is much lower than other domains. Under this condition, the ontology classification method has an advantage over the generative method.

3.2.2 Comparison with few-shot TOD systems

Table 3: Joint goal accuracy [%] on MultiWOZ 2.1 and use of external data. The result of SimpleTOD, MinTL, SOLOIST and PPTOD are referenced from [19].

Model	External Data	Training Size (%)
Model	External Data	1	5	10
SimpleTOD	No	7.91	16.14	22.37
MinTL	No	9.25	21.28	30.32
SOLOIST	Yes	13.21	26.53	32.42
PPTOD_small	Yes	27.85	39.07	42.36
SF-DST	No	28.35	39.39	44.60

End-to-end TOD systems generally has DST, Policy, and NLG modules, and in the real world, our model could be used as a DST module of these systems. Therefore, we compared SF-DST with other DSTs in the TOD systems. We varied the training data rate as 1%, 5%, or 10% and compared with SimpleTOD [20], MinTL [21], SOLOIST [22] and PPTOD [19]. SF-DST yielded the best accuracy in all few-shot settings (Table 3). Our model does not rely on external data, so it can be simply implemented in existing TOD systems. From this result, we anticipate that our model can improve the TOD system as a plug-and-play DST module.

4 Analysis

4.1 Ablation study

Table 4: Ablation study of SF-DST, reporting few-shot (10%) JGA on MultiWOZ2.1 data.

Ablation	JGA [%]
SF-DST (this work)	44.60
– Self-feeding belief state	42.31
– Auxiliary task	42.69
– Self-feeding belief state + Gold belief state	38.55

We perform an ablation study to investigate which component contributes to accuracy in a few-shot environment (10%). We observe that both self-feeding belief state and auxiliary task are essential to increase the accuracy (Table 4). Additionally, we trained the model with the gold belief state (+ Gold belief state in table) instead of self-feeding belief state. The JGA showed a significant decrease (38.55%) compared to training with a self-feeding belief state (44.60%). When the gold belief state is given during the training, the model depends on the belief state rather than the conversation. This causes the performance degrades in the inference stage, where the gold belief state cannot be given.

4.2 Analysis of self-feeding belief state input

Table 5: Analysis of self-feeding belief state input. We separate the dialogue by the previous turn length and evaluate turn JGA.

Model	Previous dialogue turns
Model	1 to 3	4 to 6	7+
SF-DST	54.68	29.65	14.05
– belief state (ablation)	52.16	27.42	15.05

Although accurate DST requires an understanding of the entire dialogue, it is challenging when the dialogue has many turns. In this experiment, we analyzed the effect of the belief state according to the conversation length in a multi-turn circumstance. We separated the dialogue into three classes by the number of previous turns: short-length dialogue (one to three turns), medium-length dialogue (four to six turns), and long-length dialogue (seven or more turns). We trained the model with and without belief state in a few-shot setting (10%) and used the average of turn JGA. Our self-feeding belief state improved both short-length dialogue and medium-length dialogue (Table 5). This means that the belief state, which summarizes previous conversation information, helped the model to understand the multi-turn conversation. However, the JGA decreased when the dialogue was extended (more than seven turns). As the dialogue progressed, the probability of error propagating from the previous belief state increased, so the accuracy dropped. Therefore, finding a self-feeding method that reduces error propagation is a worthy future goal.

4.3 Error analysis and effect of auxiliary task

Table 6: Changed error and JGA rate by adopting auxiliary task on MultiWOZ 2.1. Upper triangle 3.48 means an error rate increases 3.48% point.

Error Type	Training Size(%)
Error Type	1	5	10
Wrong	$\triangle$ 3.48	$\triangle$ 16.47	$\triangle$ 8.97
Ignore	$\bigtriangledown$ 10.6	$\bigtriangledown$ 22.21	$\bigtriangledown$ 20.46
Spurious	$\bigtriangledown$ 10.29	$\bigtriangledown$ 1.17	$\bigtriangledown$ 4.22
JGA	$\triangle$ 13.96	$\triangle$ 12.04	$\triangle$ 4.48

To examine the effect of the slot-gate auxiliary task, we classified the errors as Wrong, Ignore, and Spurious [2, 3]. Wrong means that the model correctly predicts the existence of the answer but predicts the wrong value. Ignore means that the answer exists, but the model ignores it. Spurious means that the answer is not mentioned, but the model predicts some value. We trained our model with and without auxiliary tasks at various training sizes and calculated the changed error and JGA rate by adapting the auxiliary task (Table 6). Overall, JGA was improved in all data settings. This result indicates that the auxiliary task helped to find accurate answers. In the case of error type, ignore, and spurious errors decreased; this result means that the auxiliary task was helpful to classify whether a slot is mentioned in the dialogue. However, the wrong type error grows in all settings. Adopting the auxiliary task increases the number of attempts to find the answer in dialogue when the answer exists, and this causes the growth of the wrong type error. Future work should find an auxiliary task that decreases all types of errors.

4.4 Implicit answers and auxiliary task

Table 7: Changed JGA by adopting auxiliary task on MultiWOZ 2.1. Upper triangle 5.74 means a JGA increases 5.74% point.

Slot Type	Training Size(%)
Slot Type	1	5	10
Explicit	$\triangle$ 5.74	$\triangle$ 1.13	$\triangle$ 2.26
Implicit	$\triangle$ 5.74	$\triangle$ 1.70	$\triangle$ 2.57

Finding the proper answer becomes increasingly challenging when the dialogue does not include an exact match. For example, assume the user said, ”I want to find a place to see a movie.” In that situation, even if the attraction type is not explicitly given, the model should infer the attraction type as theater. This implicit answer circumstance is common in the real world, and we experimented to determine whether our auxiliary question assists in such conditions. We chose ten slots²²2train.day, restaurant.area, hotel.star, attraction.area, hotel.stay, hotel.area, restaurant.day, hotel.people, hotel.day, restaurant.pricerange with a high probability of exact matching answers (explicit slot) and ten slots³³3hotel.type, hotel.internet, hotel.parking, taxi.leaveat, attraction.name, taxi.departure, attraction.type, train.leaveat, taxi.destination, hotel.name with a low probability of exact matching answers (implicit slot) [2]. In the case of explicit slots, 99.12 % of the answers were exactly found in dialogue, compared to 70.34% in implicit slots. We experimented in a few-shot setting and used the JGA of targeted slots. Our auxiliary task improved JGA in all training data set and slot types (Table 7). Asking whether the slot is mentioned helps find an answer even if there is no exact match exists.

5 Conclusion

This paper proposed a generative few-shot DST that has a reading comprehension approach. Our text-to-text model is ontology-free and does not use external data. As an input, we devised a self-feeding belief state and showed that summarized information of belief state is helpful for multiple turn dialogue. Also, we developed a slot-gate auxiliary task. This task reduces the ignore and spurious type errors. As a result, in a few-shot experiment, SF-DST was more accurate than the previous methods for four domains on MultiWOZ 2.0 and was close to the state-of-the-art in a supervised experiment on MultiWOZ 2.1.

6 Acknowledgements

This work was supported by SAMSUNG Research, Samsung Electronics Co.,Ltd., and also supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (No.2021-0-00575, Development of Voicepishing Prevention Technology Based on Speech and Text Deep Learning)

References

[1] S. Young, M. Gašić, B. Thomson, and J. D. Williams, “Pomdp-based statistical spoken dialog systems: A review,” Proceedings of the IEEE, vol. 101, no. 5, pp. 1160–1179, 2013.
[2] S. Gao, S. Agarwal, T. Chung, D. Jin, and D. Hakkani-Tur, “From machine reading comprehension to dialogue state tracking: Bridging the gap,” arXiv preprint arXiv:2004.05827, 2020.
[3] S. Li, J. Cao, M. Sridhar, H. Zhu, S.-W. Li, W. Hamza, and J. McAuley, “Zero-shot generalization in dialog state tracking through generative question answering,” 2021.
[4] A. Rastogi, R. Gupta, and D. Hakkani-Tur, “Multi-task learning for joint language understanding and dialogue state tracking,” arXiv preprint arXiv:1811.05408, 2018.
[5] Y. Lee, “Improving end-to-end task-oriented dialog system with a simple auxiliary task,” in Findings of the Association for Computational Linguistics: EMNLP 2021, 2021, pp. 1296–1303.
[6] J. Quan and D. Xiong, “Modeling long context for task-oriented dialogue state generation,” arXiv preprint arXiv:2004.14080, 2020.
[7] P. Budzianowski, T.-H. Wen, B.-H. Tseng, I. Casanueva, U. Stefan, R. Osman, and M. Gašić, “Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
[8] M. Eric, R. Goel, S. Paul, A. Sethi, S. Agarwal, S. Gao, and D. Hakkani-Tur, “Multiwoz 2.1: Multi-domain dialogue state corrections and state tracking baselines,” arXiv preprint arXiv:1907.01669, 2019.
[9] C.-S. Wu, A. Madotto, E. Hosseini-Asl, C. Xiong, R. Socher, and P. Fung, “Transferable multi-domain state generator for task-oriented dialogue systems,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2019.
[10] G.-L. Chao and I. Lane, “Bert-dst: Scalable end-to-end dialogue state tracking with bidirectional encoder representations from transformer,” arXiv preprint arXiv:1907.03040, 2019.
[11] M. Heck, C. van Niekerk, N. Lubis, C. Geishauser, H.-C. Lin, M. Moresi, and M. Gašić, “Trippy: A triple copy strategy for value independent neural dialog state tracking,” arXiv preprint arXiv:2005.02877, 2020.
[12] A. Kumar, P. Ku, A. Goyal, A. Metallinou, and D. Hakkani-Tur, “Ma-dst: Multi-attention-based scalable dialog state tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8107–8114.
[13] C.-S. Wu, A. Madotto, E. Hosseini-Asl, C. Xiong, R. Socher, and P. Fung, “Transferable multi-domain state generator for task-oriented dialogue systems,” arXiv preprint arXiv:1905.08743, 2019.
[14] L. Zhou and K. Small, “Multi-domain dialogue state tracking as dynamic knowledge graph enhanced question answering,” 2020.
[15] J.-G. Zhang, K. Hashimoto, C.-S. Wu, Y. Wan, P. S. Yu, R. Socher, and C. Xiong, “Find or classify? dual strategy for slot-value predictions on multi-domain dialog state tracking,” arXiv preprint arXiv:1910.03544, 2019.
[16] L. Chen, B. Lv, C. Wang, S. Zhu, B. Tan, and K. Yu, “Schema-guided multi-domain dialogue state tracking with graph attention neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 7521–7528.
[17] J. Zhou, H. Wu, Z. Lin, G. Li, and Y. Zhang, “Dialogue state tracking with multi-level fusion of predicted dialogue states and conversations,” 2021.
[18] G. Campagna, A. Foryciarz, M. Moradshahi, and M. S. Lam, “Zero-shot transfer learning with synthesized data for multi-domain dialogue state tracking,” 2020.
[19] Y. Su, L. Shu, E. Mansimov, A. Gupta, D. Cai, Y.-A. Lai, and Y. Zhang, “Multi-task pre-training for plug-and-play task-oriented dialogue system,” 2021.
[20] E. Hosseini-Asl, B. McCann, C.-S. Wu, S. Yavuz, and R. Socher, “A simple language model for task-oriented dialogue,” 2020.
[21] Z. Lin, A. Madotto, G. I. Winata, and P. Fung, “Mintl: Minimalist transfer learning for task-oriented dialogue systems,” arXiv preprint arXiv:2009.12005, 2020.
[22] B. Peng, C. Li, J. Li, S. Shayandeh, L. Liden, and J. Gao, “Soloist: Building task bots at scale with transfer learning and machine teaching,” 2021.
[23] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” arXiv preprint arXiv:1910.10683, 2019.
[24] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[25] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush, “Huggingface’s transformers: State-of-the-art natural language processing,” 2020.

Appendix A Appendix

A.1 Parameter size

Table 8: Parameter size of baseline models and SF-DST.

Models	#Parameter
MinTL [21]	400M
GPT2QA [3]	355M
SST-2 [16]	110M
TripPy [11]	110M
SimpleTOD [20]	110M
STARC [2]	110M
SOLOIST[22]	110M
PPTOD_small [19]	60M
SF-DST(ours)	60M

A.2 Detailed implementation

We implement SF-DST using T5-small [23], which has six encoder/decoder layers, and the hidden model has size 512. All models are trained using an NVIDIA A5000 GPU for five epochs with early stopping. For optimization, we use AdaFactor [24]. The batch size is 16 in the few-shot setting and 32 in the supervised setting. We implement T5-small based on Huggingface Library [25].