AnyTOD: A Programmable Task-Oriented Dialog System

Jeffrey Zhao, Yuan Cao, Raghav Gupta, Harrison Lee, Abhinav Rastogi,
Mingqiu Wang, Hagen Soltau, Izhak Shafran, Yonghui Wu
Google Research
{jeffreyzhao, yuancao}@google.com

Abstract

We propose AnyTOD, an end-to-end, zero-shot task-oriented dialog (TOD) system capable of handling unseen tasks without task-specific training. We view TOD as a program executed by a language model (LM), where program logic and ontology is provided by a designer as a schema. To enable generalization to unseen schemas and programs without prior training, AnyTOD adopts a neuro-symbolic approach. A neural LM keeps track of events occurring during a conversation and a symbolic program implementing the dialog policy is executed to recommend next actions AnyTOD should take. This approach drastically reduces data annotation and model training requirements, addressing the enduring challenge of rapidly adapting a TOD system to unseen tasks and domains. We demonstrate state-of-the-art results on STAR (Mehri and Eskenazi, 2021), ABCD (Chen et al., 2021) and SGD (Rastogi et al., 2020) benchmarks. We also demonstrate strong zero-shot transfer ability in low-resource settings, such as zero-shot on MultiWOZ (Budzianowski et al., 2018a). In addition, we release STARv2, an updated version of the STAR dataset with richer annotations, for benchmarking zero-shot end-to-end TOD models.¹¹1The STARv2 dataset will be released soon.

Jeffrey Zhao, Yuan Cao, Raghav Gupta, Harrison Lee, Abhinav Rastogi, Mingqiu Wang, Hagen Soltau, Izhak Shafran, Yonghui Wu Google Research {jeffreyzhao, yuancao}@google.com

1 Introduction

Refer to caption — Figure 1: An overview of the AnyTOD system. A LM conducts zero-shot state and action tracking with respect to a provided schema, abstracting it into a sequence of symbols. A program that executes the dialog policy then recommends which actions to take based on the states sequence, the LM then chooses a single final action and generating a response.

An enduring challenge in building and maintaining task-oriented dialog (TOD) systems is efficiently adapting to a new task or domain. For instance, if we were to add the ability to book flight tickets to an existing system that can only handle booking train tickets, this requires new conversations about flight booking to be manually collected and labelled, as well as retraining of natural language understanding (NLU) and policy models. These data efficiency and scaling problems compound for multi-task TOD systems, as each task may have its own bespoke ontology and policy.

To tackle this problem, we propose AnyTOD, an end-to-end TOD system that can be programmed to support unseen tasks or domains without prior training, significantly speeding up the TOD design process by easing data collection and training requirements. To the best of our knowledge, AnyTOD is the first end-to-end TOD system capable of zero-shot transfer. To this end, we view TOD as a program that a language model (LM) must execute throughout a conversation, and can rely on to provide guidance. Any predefined task policy, implemented as a program, can be used to control AnyTOD, allowing arbitrary business logic to be executed for a specific task. To demonstrate the efficacy of this paradigm, we experiment with the STAR (Mehri and Eskenazi, 2021), ABCD (Chen et al., 2021), SGD (Rastogi et al., 2020) and MultiWoZ (Eric et al., 2019) benchmarks. Not only does AnyTOD achieve state-of-the-art results in full-shot settings, it also achieves high accuracy in zero-shot setups.

Overview of AnyTOD To adhere to a given program, AnyTOD adopts a neuro-symbolic approach (Figure 1). A neural LM is trained for zero-shot dialog state tracking (DST) and action state tracking (AST), abstracting both states and actions into a sequence of symbols. To support zero-shot, we follow the schema-guided paradigm advocated by Rastogi et al. (2020), and provide a schema to the LM as contextual information, describing all parameters and actions that should be tracked in natural language. By training on a large corpus of diverse schemas, the LM generalizes to arbitrary and unseen schemas (Lee et al., 2021; Zhao et al., 2022). A schema should also provide a symbolic program that declares the task logic, which is executed to recommend possible next actions the agent can take, conditioned on the current dialog states. These recommendations are then reincorporated into the LM, which selects a single next action prediction (NAP), and generates a response. Note that the symbolic program forces AnyTOD to consider a dialog policy explicitly, driving zero-shot transfer onto unseen policies and allowing arbitrarily complex business logic to be employed. However, the program’s recommendations are only considered as guidelines, and it is up to the LM to make a final decision on the NAP.

STARv2 We also introduce STARv2, an improved version of the STAR dataset (Mosig et al., 2020). The original STAR dataset is very valuable for benchmarking zero-shot dialog policy and NAP across a diverse set of tasks or domains, through following a provided policy graph that outlines the intended flow of a conversation. However, the original dataset made following these policy graphs difficult, due to its lack of training data for DST and AST. Moreover, we found that the schema entity descriptions provided by the original dataset were not intuitive enough to truly support zero-shot DST and AST. To resolve these limitations, the STARv2 dataset provides new belief state and action state annotations to the STAR dataset, as well as more intuitive natural language descriptions for many schema elements. In Section 4.2, we show that these changes facilitate stronger zero-shot DST and AST. However, the ground truth NAP on each system turn is left untouched, allowing direct comparison to results trained on the original STAR dataset. We hope that STARv2 can serve as a new benchmark for TOD systems and drive further research for zero-shot TOD.

2 Related Work

Zero-shot Task-oriented Dialog Fueled by the difficulty of adapting existing TOD systems to new tasks/domains, zero-shot TOD systems have recently seen increasing interest. Much of this work has been on zero-shot DST, with the primary approach being characterizing parameters through names (Wu et al., 2019) or descriptions (Lin et al., 2021; Lee et al., 2021; Zhao et al., 2022). Another approach has been through in-context finetuning (Shah et al., 2019; Gupta et al., 2022), in which a labeled exemplar conversation is given as a prompt to a LM. Mi et al. (2021) demonstrated a more comprehensive approach, including task instructions, constraints, and prompts. In general, these results follow the schema-guided paradigm advocated by Rastogi et al. (2020); Mosig et al. (2020).

By contrast, there are fewer results on zero-shot dialog policy (AST and NAP). To the best of our knowledge, the only result is SAM (Mehri and Eskenazi, 2021), which aligns an LM for an unseen dialog policy by following an explicit policy graph. While similar to the policy graph execution we demonstrate in AnyTOD, there are two differences. First, SAM lacks supervised training on DST and AST, and relies on ground truth NAP only, forcing user state and action tracking to be inextricably linked with the final system action prediction, hurting its ability to generalize to arbitrary policy graphs. Second, SAM is a classification model limited to NAP, and unlike AnyTOD, cannot support DST or natural language generation (NLG). Indeed, we show that AnyTOD is empirically more powerful than SAM in Section 4.2.

To the best of our knowledge, no method has yet to combine zero-shot DST, AST, and NAP into an end-to-end TOD system. All existing end-to-end TOD systems (Hosseini-Asl et al., 2020; He et al., 2021; Yang et al., 2020; Peng et al., 2020) are trained and evaluated on the popular MultiWOZ dataset (Eric et al., 2019). As a result, these systems are only aware of the policy for MultiWOZ, and are not robust to arbitrary/unseen policies. In contrast, AnyTOD can generalize to arbitrary policies, and we demonstrate strong performance on MultiWOZ without prior training (Section 4.4).

TOD as Programming Historically, most TOD approaches use an explicit plan-based dialog policy module (Rich and Sidner, 1998; Ferguson and Allen, 1998; Bohus and Rudnicky, 2009). However, the NLU models powering these TOD systems are tightly coupled to a specific plan, and must be retrained for even slight changes to the plan. In contrast, AnyTOD enables zero-shot dialog policy by training NLU models to be robust to arbitrary programs as policies. Further, AnyTOD uses the program as contextual information to NLU, and refines its NAP with respect to the conversation, belief state, and action history instead of simply accepting the plan’s dictated next action(s).

Recent work has also focused on discovering structure within conversations i.e. a latent schema, policy graph, or program (Shi et al., 2019; Yu et al., 2022; Xu et al., 2020). Notably, SMCalFlow (Machines et al., 2020) constructs “dataflow graphs” from a conversation, parsing semantic intents into executable programs. Cheng et al. (2020); Shin et al. (2021) further explore this setup. However, these aim to manipulate an external API/database instead of controlling the agent’s behavior.

Beyond the scope of TOD, there has been some work in general neuro-symbolic programming with LMs, in which an LM is influenced by the results of a symbolic system. Nye et al. (2021) demonstrated a symbolic reasnoning module that accepts or rejects the logical consistency of generations from a neural LM. Lu et al. (2020) explored using predicated logic constraints to control lexical aspects from the generation of an LM. However, AnyTOD is the first application of such an approach to a practical TOD setting.

3 Methodology

3.1 The AnyTOD System

An overview of the AnyTOD system is presented in Fig. 1. We decompose AnyTOD into three steps, and describe each step in detail below:

1.

Schema and program construction: A designer constructs a schema for AnyTOD to characterize the ontology of a specific task, as well as a policy graph that declares the task logic.
2.

Zero-shot DST and AST: A LM performs zero-shot DST and AST with reference to the schema, without task-specific training.
3.

Program execution and NAP: The predicted states and action history are passed to the schema program, which upon execution recommends preferred system actions to the agent. These actions are resent to the LM, which predicts the final system action(s) conditioned on these recommendations, conversation history, and belief states.

Schema Construction The designer is required to construct a schema defining a task’s ontology, and provide a program describing business logic. This is the only thing AnyTOD requires from the designer. For example suppose the designer is creating a flight booking chatbot, they must define the parameters to be tracked (e.g. “flight id”, “name of the airline”), and enumerate possible actions the user and agent can take (“user saying they would like to search for flights”, “agent should query flight booking api”). Following the schema-guided paradigm advocated in Rastogi et al. (2020), each element in this schema is characterized by a short natural language description, allowing the LM to understand its meaning and facilitate zero-shot transfer. The schema program can be considered as a function that takes in predicted belief states and actions, and dictate possible NAPs following explicit symbolic rules. Examples can be seen in Section A.1. In general, this program should infer agent actions in response to user behavior (e.g. “if user wants to search for flights, query the flight search api”).

Zero-shot DST and AST Adaptation to novel tasks without training data critically hinges on an LM performing zero-shot DST and AST. For this purpose, we adopt and extend the D3ST approach (Zhao et al., 2022) due to its flexibility in zero-shot state and action tracking. Specifically, D3ST conducts zero-shot DST in the following way. Let $p_{0},...p_{n}$ be the parameters defined in the schema, and let $\operatorname*{desc}(p_{i})$ denote a parameter’s natural language description. Then, construct a parameter context string

[params] p

0

\operatorname*{desc}(p_{0})

... p

n

\operatorname*{desc}(p_{n})

Note that the strings p $0$ , ..., p $n$ are used as indices. Similar context strings are generated for actions for AST. These context strings are concatenated with the entire conversation history, forming the input to the LM. This input is contextualized by the schema information, allowing the LM to refer to the schema, and enabling zero-shot transfer. The target string contains the conversation belief state and history of actions at each turn of the conversation, both in a parseable format. Let $p_{i_{0}},\ldots,p_{i_{m}}$ be the active parameters in the conversation, with corresponding values $v_{i_{0}},\ldots,v_{i_{m}}$ . The belief state is represented as

[state] p

i_{0}

v_{i_{0}}

;

...

; p

i_{m}

v_{i_{m}}

Note that inactive slots do not appear in the belief state string. In AnyTOD D3ST is naturally extended to perform zero-shot AST. Note that D3ST’s original formulation in Zhao et al. (2022) was limited to DST, but, in principle, D3ST supports tracking arbitrary events that occur during a conversation, as long as their descriptions are provided. For AST, we build an target string consisting of a history of actions that were active at each turn of the conversation. Let u $j$ and s $k$ be the format of D3ST indices for user and system actions. Then, an action history string may look like

[history] u0 u9; s2; u1; s3; ...

This denotes that, on the first turn, the user was performing user actions u0 and u9. On the second turn, the system was performing system action s2, and so on. Note that the active actions for each turn are separated by a ; character.

Program Execution The LM’s predicted belief states and action history are then parsed and passed to the schema program. This program should execute the dialog policy and control AnyTOD, by recommending possible NAPs. Section A.1 showcases some example programs for STARv2 tasks. In the example shown in Figure 1, the current conversation state (“user would like to search for flights to Dubai with Emirates”) satisfies multiple dependency rules (“since the user would like to search for flights, query the flight search api” and “since the user has not provided their flight departure location, ask the user for it”). These system actions are then passed back to the LM as a string of system action indices.

[recommend] s0 s2

Finally, given the policy graph’s recommended actions as extra conditional information, the LM makes predictions about NAP with respect to the conversation, previously predicted belief states and action history. A response is also generated following the action prediction.

[selected] s2 [response] hello!

Note that the selected action need not be one of the actions recommended from the policy graph output, because actual conversations may not rigorously follow the predefined business logic, and violations are common. This step allows AnyTOD to “softly” execute the policy graph, balancing between the model’s belief before and after receiving recommendations.

Zero-shot transfer AnyTOD’s zero-shot transfer ability is enabled by a combination of two design considerations. The first is the LM’s description-driven state and event tracking. Since this schema information is provided as context, if this LM is trained on a corpus of diverse schemas, it learns to make predictions by “reading” and understanding the schema descriptions. This leads to robustness on AnyTOD’s state and event tracking for unseen schemas, as shown in Zhao et al. (2022). Moreover, AnyTOD facilitates zero-shot policy transfer by executing the provided policy graphs as explicit rules, and by similarly training the LM with a large number of policy graphs when selecting a recommended system action.

3.2 The STARv2 Dataset

To train AnyTOD, we construct STARv2, an updated version of STAR with new ground truth belief state and action annotations, supporting supervised training on DST and AST. These annotations were generated from few-shot training with D3ST (Zhao et al., 2022). We first train D3ST on the SGD dataset, then continue finetuning on a few hand-labeled conversations from STAR.²²24 conversations were labeled from each task in STAR. While not the focus of this paper, the labeling of STARv2 demonstrates the use of few-shot D3ST in labeling unlabeled conversations on new tasks/domains.

Further, STARv2 adds new natural language descriptions for actions in STAR schemas. Prior work on STAR (Mosig et al., 2020; Mehri and Eskenazi, 2021) leverages template utterances as schema descriptions, which we qualitatively found to not fully outline the complexity of actions e.g., the action user_weather_inform_city has a template utterance of just [CITY]. STARv2 provides user is informing city as a more natural action description. We show in Section 4.2 that these actions improve zero-shot AST.

4 Experiments

4.1 Setup

Datasets We demonstrate AnyTOD’s power in zero-shot settings on the following datasets:

STAR and STARv2: As described in Section 3.2, we upgrade the original STAR (Mehri and Eskenazi, 2021) dataset to STARv2. The dataset has 24 tasks across 13 domains, many tasks requiring the model to adhere to a novel policy, providing an important zero-shot AST and NAP benchmark.

ABCD Chen et al. (2021): The design of the ABCD dataset follows a realistic setup, in which an agent’s actions must be balanced between the customer’s expressed desires and the constraints set by task policies. It is thus a natural fit for the AnyTOD framework for both training and evaluation.

SGD Rastogi et al. (2020): SGD is another schema-guided dataset in which schema elements are provided with natural language descriptions to facilitate task transfer. It contains 45 domains and was generated via simulation. Thus, the agent actions and responses follow pre-defined task logic.

MultiWOZ Budzianowski et al. (2018b): MultiWOZ is the standard dataset for benchmarking TOD models. It contains 7 domains and was generated through Wizard-of-Oz (Kelley, 1984) data collection, leading to natural conversations.

Training Our implementation is based upon the open-source T5X codebase Roberts et al. (2022) initialized with the public T5 1.1 checkpoints³³3https://github.com/google-research/text-to-text-transfer-transformer as the LM backend. We update the LM code to execute a schema program and reincorporate the results before making the final NAP, as described in Section 3.1. We experimented on two T5 sizes: base (250M parameters, trained on 16 TPUv3 chips (Jouppi et al., 2017)) and XXL (11B parameters, trained on 64 TPUv3 chips). We otherwise adopt the default T5X finetuning hyper-parameter settings throughout our experiments.

4.2 Results on STAR

Table 1 shows AnyTOD results on the STARv2 dataset on the full-shot and zero-shot domain transfer settings, with both “happy” and “unhappy” conversations. In full-shot, models train on 80% of conversations across all tasks, and evaluate on the remaining 20%. The zero-shot domain setting is a leave-one-out cross-validation across the STARv2 dataset’s 13 domains, evaluating quality on an unseen schema in a completely novel domain. The following metrics are used in our report: joint goal accuracy (JGA) to measure DST, user action F1 (UaF1) to measure AST, system action F1 (SaF1) to measure NAP, and response BLEU.⁴⁴4See Section A.4 for details on calculating these metrics.

Each STAR task schema defines the intended dialog policy by providing a policy graph, where nodes describe conversation actions, and edges connect subsequent actions. An AnyTOD program (Figure A.2) is implemented to recommend next actions with respect to this policy graph.

Two baselines are used for comparison: BERT+S (Mosig et al., 2020) and SAM (Mehri and Eskenazi, 2021), both of which add a policy graph following module for zero-shot transfer to unseen schema. Note that, though these models were trained on the original STAR data, their SaF1 results are directly comparable to AnyTOD trained on STARv2 on NAP (SaF1), as these ground truth labels were left untouched. However, AnyTOD has additional training supervision on AST and DST due to STARv2’s new annotations. For a fair comparison with SAM, we also report results on SAM-User, a modified version of SAM trained on STARv2 that also includes supervised training on user annotations.⁵⁵5See Section A.5 for implementation details. Note that both BERT+S and SAM are based on BERT-base (110M parameters), comparable to T5 base (220M parameters).

Main Result The primary results for AnyTOD base/xxl are given in Table 1. For conciseness, we shorten AnyTOD to AT. As an ablation, we also report results with AT-norec, which removes the policy graph guidance from the AnyTOD method by recommending no system actions. In the full-shot setting, both AnyTOD and -norec, along with reported baselines, achieve very high SaF1. This is due to direct supervised training on NAP removing the need for program guidance. However, we see a huge gap between AnyTOD and -norec in the zero-shot setting; the guidance from the program becomes necessary — we see 60.6 vs. 55.8 SaF1 at base, and 68.0 vs. 62.3 SaF1 at xxl. Moreover, AnyTOD xxl has zero-shot performance comparable to that of full-shot, with 75.4 SaF1 at xxl.

Effect of Natural Language Descriptions As mentioned in Section 3.2, STARv2 provides new natural language descriptions that better characterize the actions within STAR. Our main result AT base/xxl takes advantage of these new descriptions, but to see the impact of these descriptions, we train AT-tmpl on the original template utterances. On base we see little difference between descriptions and templates, but a sizeable improvement in using descriptions appears on xxl, with a larger LM that is better at NLU. This evidences that more intuitive natural language descriptions help AnyTOD understand task semantics better and perform zero-shot transfer.

AnyTOD vs. baselines To compare against available results on STARv2, we compare AT-tmpl base against SAM-User. Both results use template responses provided by STAR, and additionally trained with the new DST and AST annotations in STARv2. However, we see far stronger performance with AnyTOD than with SAM or SAM-User, due to the flexibility provided by the program execution ability demonstrated by AnyTOD, and enabled by supervised training on DST and AST. SAM is not suited to use these contextual signals, likely due to no attention between schema elements and conversation and a rigid classification architecture unsuitable for multiple losses.

Multitask Training with SGD To demonstrate further robustness for AnyTOD, we also report AnyTOD-sgd, which jointly trains with SGD as a multitask training dataset. SGD includes a large number of tasks, each defined by a schema with highly diverse parameters and actions. The -sgd results in Table 1 show that at base, SGD multitask training improves both DST ( $61.9\to 66.1$ JGA), AST ( $72.1\to 74.3$ UaF1), and by extension NAP ( $60.6\to 61.3$ SaF1). A similar but smaller improvement is seen on xxl, suggesting that the larger LM may not need more diverse training owing to its better language understanding.

Model	JGA	UaF1	SaF1	BLEU
BERT+S	-	-	74.9	-
SAM	-	-	71.5	-
SAM-User	-	-	71.7	-
AT-norec base	81.5	83.8	73.3	72.8
AT-tmpl base	82.9	84.6	70.6	72.7
AT base	82.4	84.1	70.7	72
AT-norec xxl	85.6	86.4	75.4	76.4
AT-tmpl xxl	85.1	82.5	71.3	75.8
AT xxl	85.7	84.7	73.3	73.5

(a) Full-shot results on STARv2.

Model	JGA	UaF1	SaF1	BLEU
BERT+S	-	-	32.3	-
SAM⁶⁶6Note that this SAM zero-shot domain SaF1 differs from the original 55.7 from Mehri and Eskenazi (2021). See Section A.3 for more details.	-	-	51.2	-
SAM-User	-	-	44.4	-
AT-norec base	57.8	71	55.8	32.4
AT-tmpl base	62.2	74	61.9	56
AT base	61.9	72.1	60.6	34.3
AT-sgd base	66.1	74.3	61.3	34.4
AT-prog base	61.9	72.1	61.0	34.4
AT-prog+sgd base	66.1	74.3	61.9	34.6
AT-norec xxl	72.7	80	62.3	41.8
AT-tmpl xxl	66.8	72.9	60.8	52.9
AT xxl	74.8	79.2	68.0	44.3
AT-sgd xxl	75.8	80.9	68.5	43.9
AT-prog xxl	74.4	79.3	68.4	44.9
AT-prog+sgd xxl	75.7	81.4	70.7	44.2

(b) Zero-shot domain results on STARv2.

Model	Bank	Trip	Trivia
AT xxl	54.3	52.4	73.8
AT-sgd xxl	53.1	51.5	81.1
AT-prog xxl	61	60.8	73.7
AT-prog+sgd xxl	65	62.9	86.3

Table 1: Results on STARv2. For compactness we show just UaF1 and SaF1 here — see Section A.2 for a complete table. For clarity, we bold SaF1 results for AnyTOD base/xxl, our key result.

Complex Program Logic STARv2 is also a good testbed for complex zero-shot task adaptation, as it includes some tasks which are more complex than simple policy-graph following, specifically the bank, trivia, and trip domains. For instance, the trivia task requires the agent to ask the user a trivia question and extract their answer. Different system actions must be taken by the agent depending on whether or not the user’s answer is correct. This logic is not captured by the provided policy graph alone, requiring more complex logic. AnyTOD is suitable for this problem, as we need only to construct a program implementing this logic. These programs are shown in Section A.1.

We report results with these programs in Table 1 under the -prog name. There is a clear win on zero-shot domain SaF1 when averaged over all domains, with a very high 70.7 SaF1 on -prog+sgd xxl, narrowing the gap with the full-shot 75.4 SaF1. When examining the complex tasks tasks individually (Table 1(c)), the win on NAP is even more apparent. The only exception is AT xxl on trivia, which has little difference with or without the program. In general however, the guidance provided by this specialized program is necessary for higher-level logic in the dialog policy, since the policy graph does not specify enough information to approach the task in zero-shot.

4.3 Results on ABCD and SGD

Model	JGA	JGA	SaF1	SaF1
	seen	unseen	seen	unseen
AT-norec base	89.0	58.5	89.8	83.4
AT base	89.9	62.4	89.8	86.1
AT-norec xxl	94.8	80.2	92.1	87.2
AT xxl	94.8	82.2	91.3	88.9

Table 2: AnyTOD JGA, SaF1 on SGD test set.

We conduct similar experiments on Action State Tracking (AST) (metric: joint action accuracy or JAA) on ABCD (Chen et al., 2021) and DST and NAP (metrics: JGA and SaF1 respectively) on SGD (Rastogi et al., 2020) datasets.

ABCD contains 10 flows, each describing the business logic for handling a customer request, which are relatively similar to each other. We report full-shot results by training and evaluating on all flows, and zero-shot results where the model is trained on one randomly sampled flow and evaluated on all other nine flows. The SGD test set consists of 21 services, 15 of these not seen during training. The dataset is generated via simulation with a generalized policy graph (shared across all services) encoding dialog act transitions. The per-service policy graphs are then constructed by inserting intents and slots and, as a result, end up similar.

Tables 2 and 4 and show AnyTOD results on SGD and ABCD respectively. For both datasets on both full-shot and zero-shot setups we generally see an improvement on action prediction using policy guidance, achieving state-of-the-art results for ABCD. However, the gain is not as large as STARv2, as the task policies are not as diverse. Even without explicit policy guidance, features from different tasks in ABCD/SGD can transfer to each other. Notably, policy guidance helps more on the one-flow setup for ABCD and unseen services for SGD, further establishing the efficacy of policy guidance on unseen setups, even if related.

4.4 Zero-shot Results on MultiWOZ

Model	JGA	Inform	Success	BLEU
SOLOIST	35.9	81.7	67.1	13.6
Mars	35.5	88.9	78.0	19.6
AnyTOD-xxl	30.8	73.9	24.4	3.4

Table 5: Results on MultiWOZ end-to-end benchmark. AnyTOD-xxl is trained on SGD, and evaluated in zero-shot over MultiWOZ. Note we applied templates for response generation, yielding low BLEU in comparison with other models.

To demonstrate the generalizability of the AnyTOD system, we demonstrate zero-shot transfer results on the end-to-end MultiWOZ 2.2 (Zang et al., 2020) benchmark, a popular dataset for TOD research. In this case, AnyTOD-xxl is trained on the SGD dataset, and then evaluated on MultiWOZ in zero-shot with a small policy program (Section A.6). Responses from AnyTOD were constructed using the template utterance approach from Kale and Rastogi (2020). We compare against SOLOIST (Peng et al., 2020) and Mars (Sun et al., 2022), two end-to-end TOD models directly trained on MultiWOZ with supervision. Results are shown in Table 5, with metrics reported by the MultiWOZ eval script (Nekvinda and Dusek, 2021). Although no training examples from MultiWOZ was used at all, AnyTOD demonstrates strong JGA, Inform, and Success comparable to results that do train on MultiWOZ. Note that since we applied templates for response generation, we do not consider BLEU to be important, as the responses are very different from ground truth labels.

5 Analysis

5.1 Impact of Policy Guidance

To see how impactful the recommendations provided by the policy graph are, we reevaluate already finetuned AnyTOD models on the STARv2 zero-shot domain setting, but with changes to the program recommendations during eval. First, to see how dependent AnyTOD is on policy graph guidance, we modify the graph to output no recommendations (denoted as 0rec), forcing the model to do NAP only using the conversation, belief state, and action history. Secondly, we modify the graph to output deliberately bad recommendations (denoted as badrec), intended to trick the model into choosing an incorrect system action. This was done by randomly sampling 1-3 system actions other than the ground truth action.

The major drops in SaF1 for both setups shown in Table 4 confirm that the model, while able to predict actions without it, does consider the policy guidance heavily. Notably, 75% and 83% of correct predictions for 0rec and badrec are actions common to all tasks e.g., hello or query.

Model and Corruption Prob.	All Flows	One Flow
AT base, 0	90.5	48.9
AT base, 0.4	90.1	48.4
AT base, 0.8	89.5	47.4
AT-norec base, 0	90.5	47.4
AT xxl, 0	91.9	67
AT xxl, 0.4	91.5	66.7
AT xxl, 0.8	91.5	65.9
AT-norec xxl, 0	91.6	64.3

Table 6: JAA on ABCD Action State Tracking (AST) with policy corruption. For “one flow”, the JAA is averaged across three runs with a randomly selected flow for training.

We conduct a similar “policy corruption” experiment on ABCD (Table 6), in which policy graphs for evaluation tasks have a 0%, 40%, and 80% chance of being replaced by graphs from incorrect flows during evaluation. We see a consistent quality drop with increasing probability of corruption for both base and xxl.

5.2 Error Analysis

We also analyze AnyTOD errors on STARv2. We classify all incorrect NAPs into three possible error categories: (1) System action error: the program recommends the correct system action, but this was not chosen by the LM, (2) Policy graph error: the predicted belief state and action history are correct, but the program’s execution of the policy graph does not recommend the expected system action, and (3) State tracking error: the predicted belief states and action history are incorrect, which leads to incorrect recommendations from the policy graph. Results are shown in Figure 2. In general, we see that the benefit to scaling the LM from base to xxl comes from improvements to state and action tracking, which aligns with better DST and AST results on xxl as in Table 1.

6 Conclusion

We proposed AnyTOD, a zero-shot end-to-end TOD system that can be programmed to handle unseen tasks without domain-specific training. AnyTOD adopts a neuro-symbolic approach, in which a LM performs zero-shot DST and AST with respect to a provided schema, and abstracts both into a sequence of symbols. These symbol sequences are then parsed and passed to a program expressing the task policy, which gets executed to make recommendations for the next agent action(s). Agent designers are free to implement arbitrarily complex business logic within AnyTOD to determine its policy on unseen tasks or domains. To demonstrate the value of this approach, we show state-of-the-art results on zero-shot TOD benchmarks, such as STAR, ABCD, SGD and MultiWoZ. For further training and benchmarking zero-shot end-to-end TOD systems, we also release the STARv2 dataset, an improved version of STAR.

7 Limitations

AnyTOD is a data-efficient approach designed to accelerate task-oriented dialog system building. We make use of a relatively large LM in our implementation (T5X, up to 11B parameters) to effectively make structured predictions (dialog states), and further control the language model behavior with symbolic programs designed by the user. While the LM’s behavior in our design is properly regulated and we applied templates to formulate responses, one must be responsible in designing the policy program and templates to ensure predictability of the system actions. We do not intend to make use of this system in open-domain, free-form conversation generation scenarios.

References

Bohus and Rudnicky (2009) Dan Bohus and Alexander I Rudnicky. 2009. The RavenClaw dialog management framework: Architecture and systems. Comput. Speech Lang., 23(3):332–361.
Budzianowski et al. (2018a) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Ultes Stefan, Ramadan Osman, and Milica Gašić. 2018a. Multiwoz - a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Budzianowski et al. (2018b) Pawel Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gasic. 2018b. Multiwoz - A large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. CoRR, abs/1810.00278.
Chen et al. (2021) Derek Chen, Howard Chen, Yi Yang, Alexander Lin, and Zhou Yu. 2021. Action-based conversations dataset: A corpus for building more in-depth task-oriented dialogue systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3002–3017, Online. Association for Computational Linguistics.
Cheng et al. (2020) Jianpeng Cheng, Devang Agrawal, Hector Martinez Alonso, Shruti Bhargava, Joris Driesen, Federico Flego, Shaona Ghosh, Dain Kaplan, Dimitri Kartsaklis, Lin Li, Dhivya Piraviperumal, Jason D Williams, Hong Yu, Diarmuid O Seaghdha, and Anders Johannsen. 2020. Conversational semantic parsing for dialog state tracking.
Eric et al. (2019) Mihail Eric, Rahul Goel, Shachi Paul, Adarsh Kumar, Abhishek Sethi, Peter Ku, Anuj Kumar Goyal, Sanchit Agarwal, Shuyang Gao, and Dilek Hakkani-Tur. 2019. MultiWOZ 2.1: A consolidated Multi-Domain dialogue dataset with state corrections and state tracking baselines.
Ferguson and Allen (1998) George Ferguson and James F Allen. 1998. TRIPS: An integrated intelligent problem-solving assistant. https://www.aaai.org/Papers/AAAI/1998/AAAI98-080.pdf. Accessed: 2022-12-14.
Gupta et al. (2022) Raghav Gupta, Harrison Lee, Jeffrey Zhao, Yuan Cao, Abhinav Rastogi, and Yonghui Wu. 2022. Show, don’t tell: Demonstrations outperform descriptions for schema-guided task-oriented dialogue. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4541–4549, Seattle, United States. Association for Computational Linguistics.
He et al. (2021) Wanwei He, Yinpei Dai, Yinhe Zheng, Yuchuan Wu, Zheng Cao, Dermot Liu, Peng Jiang, Min Yang, Fei Huang, Luo Si, Jian Sun, and Yongbin Li. 2021. GALAXY: A generative pre-trained model for Task-Oriented dialog with Semi-Supervised learning and explicit policy injection.
Hosseini-Asl et al. (2020) Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher. 2020. A simple language model for task-oriented dialogue.
Jouppi et al. (2017) Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. 2017. In-datacenter performance analysis of a tensor processing unit. SIGARCH Comput. Archit. News, 45(2):1–12.
Kale and Rastogi (2020) Mihir Kale and Abhinav Rastogi. 2020. Few-shot natural language generation by rewriting templates. CoRR, abs/2004.15006.
Kelley (1984) J F Kelley. 1984. An iterative design methodology for user-friendly natural language office information applications. ACM Trans. Inf. Syst. Secur., 2(1):26–41.
Lee et al. (2021) Chia-Hsuan Lee, Hao Cheng, and Mari Ostendorf. 2021. Dialogue state tracking with a language model using Schema-Driven prompting.
Lin et al. (2021) Zhaojiang Lin, Bing Liu, Seungwhan Moon, Paul Crook, Zhenpeng Zhou, Zhiguang Wang, Zhou Yu, Andrea Madotto, Eunjoon Cho, and Rajen Subba. 2021. Leveraging slot descriptions for Zero-Shot Cross-Domain dialogue StateTracking. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5640–5648, Online. Association for Computational Linguistics.
Lu et al. (2020) Ximing Lu, Peter West, Rowan Zellers, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. NeuroLogic decoding: (un)supervised neural text generation with predicate logic constraints.
Machines et al. (2020) Semantic Machines, Jacob Andreas, John Bufe, David Burkett, Charles Chen, Josh Clausman, Jean Crawford, Kate Crim, Jordan DeLoach, Leah Dorner, Jason Eisner, Hao Fang, Alan Guo, David Hall, Kristin Hayes, Kellie Hill, Diana Ho, Wendy Iwaszuk, Smriti Jha, Dan Klein, Jayant Krishnamurthy, Theo Lanman, Percy Liang, Christopher H Lin, Ilya Lintsbakh, Andy McGovern, Aleksandr Nisnevich, Adam Pauls, Dmitrij Petters, Brent Read, Dan Roth, Subhro Roy, Jesse Rusak, Beth Short, Div Slomin, Ben Snyder, Stephon Striplin, Yu Su, Zachary Tellman, Sam Thomson, Andrei Vorobev, Izabela Witoszko, Jason Wolfe, Abby Wray, Yuchen Zhang, and Alexander Zotov. 2020. Task-Oriented dialogue as dataflow synthesis.
Mehri and Eskenazi (2021) Shikib Mehri and Maxine Eskenazi. 2021. Schema-guided paradigm for zero-shot dialog. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 499–508, Singapore and Online. Association for Computational Linguistics.
Mi et al. (2021) Fei Mi, Yitong Li, Yasheng Wang, Xin Jiang, and Qun Liu. 2021. CINS: Comprehensive instruction for few-shot learning in task-oriented dialog systems.
Mosig et al. (2020) Johannes E M Mosig, Shikib Mehri, and Thomas Kober. 2020. STAR: A Schema-Guided dialog dataset for transfer learning.
Nekvinda and Dusek (2021) Tomás Nekvinda and Ondrej Dusek. 2021. Shades of bleu, flavours of success: The case of multiwoz. CoRR, abs/2106.05555.
Nye et al. (2021) Maxwell Nye, Michael Henry Tessler, Joshua B Tenenbaum, and Brenden M Lake. 2021. Improving coherence and consistency in neural sequence models with Dual-System, Neuro-Symbolic reasoning.
Peng et al. (2020) Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayandeh, Lars Liden, and Jianfeng Gao. 2020. SOLOIST: Building task bots at scale with transfer learning and machine teaching.
Rastogi et al. (2020) Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara, Raghav Gupta, and Pranav Khaitan. 2020. Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8689–8696.
Rich and Sidner (1998) Charles Rich and Candace L Sidner. 1998. COLLAGEN: A collaboration manager for software interface agents. User Model. User-adapt Interact., 8(3):315–350.
Roberts et al. (2022) Adam Roberts, Hyung Won Chung, Anselm Levskaya, Gaurav Mishra, James Bradbury, Daniel Andor, Sharan Narang, Brian Lester, Colin Gaffney, Afroz Mohiuddin, Curtis Hawthorne, Aitor Lewkowycz, Alex Salcianu, Marc van Zee, Jacob Austin, Sebastian Goodman, Livio Baldini Soares, Haitang Hu, Sasha Tsvyashchenko, Aakanksha Chowdhery, Jasmijn Bastings, Jannis Bulian, Xavier Garcia, Jianmo Ni, Andrew Chen, Kathleen Kenealy, Jonathan H. Clark, Stephan Lee, Dan Garrette, James Lee-Thorp, Colin Raffel, Noam Shazeer, Marvin Ritter, Maarten Bosma, Alexandre Passos, Jeremy Maitin-Shepard, Noah Fiedel, Mark Omernick, Brennan Saeta, Ryan Sepassi, Alexander Spiridonov, Joshua Newlan, and Andrea Gesmundo. 2022. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189.
Shah et al. (2019) Darsh J Shah, Raghav Gupta, Amir A Fayazi, and Dilek Hakkani-Tur. 2019. Robust Zero-Shot Cross-Domain slot filling with example values.
Shi et al. (2019) Weiyan Shi, Tiancheng Zhao, and Zhou Yu. 2019. Unsupervised dialog structure learning. CoRR, abs/1904.03736.
Shin et al. (2021) Richard Shin, Christopher Lin, Sam Thomson, Charles Chen, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. 2021. Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA. Association for Computational Linguistics.
Sun et al. (2022) Haipeng Sun, Junwei Bao, Youzheng Wu, and Xiaodong He. 2022. Mars: Semantic-aware contrastive learning for End-to-End Task-Oriented dialog.
Wu et al. (2019) Chien-Sheng Wu, Andrea Madotto, Ehsan Hosseini-Asl, Caiming Xiong, Richard Socher, and Pascale Fung. 2019. Transferable Multi-Domain state generator for Task-Oriented dialogue systems.
Xu et al. (2020) Jun Xu, Zeyang Lei, Haifeng Wang, Zheng-Yu Niu, Hua Wu, Wanxiang Che, and Ting Liu. 2020. Discovering dialog structure graph for Open-Domain dialog generation.
Yang et al. (2020) Yunyi Yang, Yunhao Li, and Xiaojun Quan. 2020. UBAR: Towards fully End-to-End Task-Oriented dialog systems with GPT-2.
Yu et al. (2022) Dian Yu, Mingqiu Wang, Yuan Cao, Izhak Shafran, Laurent El Shafey, and Hagen Soltau. 2022. Unsupervised slot schema induction for task-oriented dialog.
Zang et al. (2020) Xiaoxue Zang, Abhinav Rastogi, Srinivas Sunkara, Raghav Gupta, Jianguo Zhang, and Jindong Chen. 2020. MultiWOZ 2.2 : A dialogue dataset with additional annotation corrections and state tracking baselines. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 109–117, Online. Association for Computational Linguistics.
Zhao et al. (2022) Jeffrey Zhao, Raghav Gupta, Yuan Cao, Dian Yu, Mingqiu Wang, Harrison Lee, Abhinav Rastogi, Izhak Shafran, and Yonghui Wu. 2022. Description-driven task-oriented dialog modeling.

Appendix A Appendix

A.1 AnyTOD Programs

Examples of AnyTOD program implementations for STARv2 can be found in Figures A.2 and A.3.

A.2 Complete Results on STARv2

For compactness, Table 1 showed just UaF1 and SaF1. We also report user action accuracy (UaAcc) and system action accuracy (SaAcc) in Table A.1.

Model	JGA	UaAcc	UaF1	SaAcc	SaF1	BLEU
BERT+S	-	-	-	73.8	74.9	-
SAM	-	-	-	70.4	71.5	-
SAM-User	-	-	-	70.4	71.7	-
AT-noguide base	81.5	74.4	83.8	73.7	73.3	72.8
AT-tmpl base	82.9	75.6	84.6	71	70.6	72.7
AT base	82.4	75.2	84.1	71.6	70.7	72
AT-noguide xxl	85.6	78.3	86.4	75.7	75.4	76.4
AT-tmpl xxl	85.1	72.6	82.5	70.7	71.3	75.8
AT xxl	85.7	75.9	84.7	73.8	73.3	73.5

(a) Full-shot results on STARv2.

Model	JGA	UaAcc	UaF1	SaAcc	SaF1	BLEU
BERT+S	-	-	-	29.7	32.3	-
SAM	-	-	-	49.8	51.2	-
SAM-User	-	-	-	53.9	44.4	-
AT-noguide base	57.8	55.4	71	56.1	55.8	32.4
AT-tmpl base	62.2	56	74	62.5	61.9	56
AT base	61.9	56.6	72.1	61.6	60.6	34.3
AT-sgd base	66.1	59.5	74.3	63.5	61.3	34.4
AT-prog+reply base	62.7	55.8	73.9	63.1	62.9	56.3
AT-prog base	61.9	56.6	72.1	61.9	61.0	34.4
AT-prog+sgd base	66.1	59.5	74.3	64.2	61.9	34.6
AT-noguide xxl	72.7	65.9	80	62.3	62.3	41.8
AT-tmpl xxl	66.8	58.9	72.9	60.9	60.8	52.9
AT xxl	74.8	64.6	79.2	68	68.0	44.3
AT-sgd xxl	75.8	67.8	80.9	69.3	68.5	43.9
AT-prog+reply xxl	73.7	61.6	76.6	65.7	66.3	63.6
AT-prog xxl	74.4	64.7	79.3	68.5	68.4	44.9
AT-prog+sgd xxl	75.7	68.5	81.4	70.8	70.7	44.2

(b) Zero-shot domain results on STARv2.

Table A.1: Complete results on STARv2.

A.3 Corrected SAM Results on Zero-shot Domain

During the development of AnyTOD, we found that the zero-shot domain results reported on SAM in Mehri and Eskenazi (2021) were incorrect. An annotation issue within the STAR dataset set marked some conversations as having an invalid domain; due to how SAM was implemented, these conversations would always be included in the training dataset, even if they were in the evaluation domain. For instance, dialog ID 102 is marked as a null domain in the original STAR dataset. Retraining SAM with this issue fixed caused a drop in SaF1, from 55.7 to 51.2. We fix these annotation errors in the STARv2 dataset.

A.4 Calculating STARv2 Metrics

Details in calculating metrics on STARv2 are as follows. For DST, JGA is calculated with an exact match on belief state parameters and values. For AST, we only consider the quality of the most recent turn within the action history prediction. This is always a user turn, which may have multiple user actions active. This may be considered a multilabel classification problem. Then, we calculate UaAcc through exact set match on the predicted user actions at the current turn, as well as weighted multilabel F1 on the predicted user actions. Both SaAcc and SaF1 are calculated as described in Mosig et al. (2020).

A.5 Implementation of Sam-User

To implement supervised training of AST on SAM, we modify the methodology described in Mehri and Eskenazi (2021), which embeds both conversation and schema elements to produce an attention vector $p$ . Here, $p_{i}$ gives the attention weight between the conversation and the $i$ -th user action of the policy graph. This is then interpreted to be a proxy for probability, and converted to a probability for NAP on all system actions $a$ according to the policy graph edges:

	$\displaystyle g(i,a)=\begin{cases}p_{i},&\text{if }\textbf{action}(\textbf{next}(u_{i}))=a\\ 0,&\text{otherwise}\end{cases}$
	$\displaystyle P(a)=\sum_{i\leq\|S\|}g(i,a)$

Here, $\textbf{action}(\textbf{next}(u_{i}))$ gives the next system action of the user action $u_{i}$ according to the policy graph. Note that $p_{i}$ is an attention weight that is interepted to be the probability of user action $u_{i}$ being active at the current turn; however, no supervised training was done with ground truth user action labels. Then, to implement supervised training on these user actions, we train $p_{i}$ to be actual probabilities, and apply a sigmoid on $p_{i}$ to form a user action prediction head. Note that this is a multilabel binary prediction. We then calculate a binary cross-entropy loss on this head.

A.6 MultiWOZ Zero-Shot Policy Program

Figure A.1 contains the AnyTOD policy program used when evaluating over MultiWOZ. This policy program was handcrafted, and provides a simplified conversation flow.

⬇

1def multiwoz_policy(active_domain, belief_state, act_hist):

2 rec = []

3 last_useracts = act_hist[-1]

5 # We define a new action within the MultiWOZ schema that tracks whether

6 # the user wants to book a provided entity.

7 # Since this is zero-shot we don’t train on this action at all, just provide

8 # a natural language description "user is saying they wants to book this hotel"

9 user_wants_to_book = any(act == ’user-wants-to-book’ for act, _ in last_useracts)

11 if user_wants_to_book:

12 rec.append((’book’, None))

13 # Inform the name of what we’re booking for the user

14 if active_domain in [’restaurant’, ’hotel’, ’attraction’]:

15 rec.append((’inform’, f’{active_domain}-name’))

16 elif active_domain == ’train’:

17 rec.append((’inform’, f’train-trainid’))

18 # Ask the user if they need anything else

19 rec.append((’reqmore’, None))

20 else:

21 # We’re still trying to find an entity for the user

22 # Recommend / select entities

23 if active_domain in [’restaurant’, ’hotel’, ’attraction’]:

24 rec.append((’inform’, f’{active_domain}-name’))

25 elif active_domain == ’train’:

26 rec.append((’inform’, f’train-trainid’))

27 rec.append((’recommend’, None))

28 rec.append((’select’, None))

29 rec.append((’booking-inform’, None))

31 for act, slot in last_useracts:

32 if act == ’inform’:

33 # We often repeat back info the user has given us in next turn

34 rec.append((’inform’, slot))

35 # If the user is requesting a slot, provide the value

36 if act == ’request’:

37 rec.append((’inform’, slot))

38 # If the user is thanking us, say you’re welcome / bye / anything else?

39 if act == ’thank’:

40 rec.append((’welcome’, None))

41 rec.append((’bye’, None))

42 rec.append((’reqmore’, None))

44 return set(rec)

Figure A.1: The AnyTOD program implementation for the zero-shot policy program.

⬇

1USER_CUSTOM_LABEL = ’user_custom’

2OUT_OF_SCOPE_LABEL = ’out_of_scope’

4def anytod_star_policy_program(

5 belief_state: dict[str, str], act_hist: list[list[str]], api: Json,

6 graph: Json, convo_hist: list[str], primary_item: Json):

7 # a list of next action prdictions to recommend to the lm

8 next_act_recs = []

9 # get the "bye" actions for both user and system

10 user_bye_act = _user_bye_act(graph)

11 sys_bye_act = _sys_bye_act(graph)

13 # dict of param -> action user would take to inform this param

14 slot_actions = graph[’slot_actions’]

15 # generate a list of all user informing acts

16 inform_user_acts = set()

17 for _, user_acts in slot_actions.items():

18 inform_user_acts.add(user_acts[0])

20 if act_hist:

21 # iterate through last turn’s active user actions, result of AST

22 for last_useract in act_hist[-1]:

23 # some transitions are common to all star graphs, but not explicit

24 # if user is performing something out-of-scope, return out_of_scope

25 if last_useract == USER_CUSTOM_LABEL:

26 next_act_recs.append(OUT_OF_SCOPE_LABEL)

27 # if user is saying bye, agent can say bye

28 if last_useract == user_bye_act:

29 next_act_recs.append(sys_bye_act)

31 # if the user is performing an action that isn’t informing a param,

32 # look it up in the policy graph

33 if last_useract not in inform_useracts and last_useract in graph[’graph’]:

34 next_act_recs.append(graph[’graph’][last_useract])

35 # if the agent can do the anything_else action, it can also say bye

36 if ’anything_else’ in next_act_recs:

37 next_act_recs.append(bye_act)

39 # if all required params are provided, we can query api

40 query_label = ’query’ if ’query’ in graph[’replies’] else ’query_check’

41 if all(p.name in belief_state for p in api.params if p.required):

42 next_act_recs.append(query_label)

44 # param name -> api param json

45 api_params_by_name = {}

46 for param in api[’input’]:

47 if param[’Name’] != ’RequestType’:

48 api_params_by_name[param[’Name’]] = param

49 # if a param is not known, we can request it from the user

50 for slot in graph[’slot_actions’]:

51 p = api_params_by_name[slot]

52 if p.name not in belief_state:

53 ask_sysact = slot_actions[p.name][0]

54 next_act_recs.append(ask_sysact)

56 return next_act_recs

Figure A.2: The AnyTOD program implementation for a given STAR policy graph.

⬇

1def anytod_star_trivia_policy(

2 belief_state: dict[str, str], act_hist: list[list[str]], api: Json,

3 graph: Json, convo_hist: list[str], primary_item: Json):

4 if act_hist and len(convo_hist) >= 2:

5 for last_useract in act_hist[-1]:

6 # if the user is answering a question

7 if last_useract == ’user_trivia_answer’:

8 # check that the correct trivia answer is in the user’s utterance

9 answer = primary_item.get(’Answer’, None)

10 if answer:

11 last_user_utt = convo_hist[-2]

12 if answer.lower() in last_user_utt.lower():

13 return [’trivia_inform_answer_correct_ask_next’]

14 else:

15 return [’trivia_inform_answer_incorrect_ask_next’]

16 return normal_policy(belief_state, act_hist, api, graph, convo_hist,

17 primary_item)

20def anytod_star_bank_policy(

21 belief_state: dict[str, str], act_hist: list[list[str]], api: Json,

22 graph: Json, convo_hist: list[str], primary_item: Json):

23 # next_act_recs should be populated already by graph following

24 # same as normal_policy() ...

26 # params required for authenticating first and second way

27 first_auth_slots = [’FullName’, ’AccountNumber’, ’PIN’]

28 second_auth_slots = [

29 ’FullName’, ’DateOfBirth’, ’SecurityAnswer1’, ’SecurityAnswer2’

30 ]

31 # if either params are satisfied we can query api

32 if (all(slot in bs for slot in first_auth_slots) or

33 all(slot in bs for slot in second_auth_slots)):

34 next_action_recs.append(’query’)

36 # get all seen user acts

37 seen_useracts = set()

38 for turn, turn_acts in enumerate(act_hist):

39 if turn %

40 seen_useracts.update(turn_acts)

41 forgot_acts = [’user_bank_forgot_account_number’, ’user_bank_forgot_pin’]

42 # if the user has forgotten anything from first auth, follow second auth

43 is_second_auth = any(fa in seen_useracts for fa in forgot_acts)

44 slots = second_auth_slots if is_second_auth else first_auth_slots

45 if graph[’task’] == ’bank_fraud_report’:

46 slots.append(’FraudReport’)

47 # request slots depending on 1st/2nd auth if not known

48 for slot in slots:

49 if slot not in belief_state:

50 next_act_recs.append(graph[’slot_actions’][slot][0])

52 return next_act_recs

Figure A.3: The AnyTOD program implementation specialized for bank and trivia domains.

Model	SaF1
AT base	60.6
AT-0rec base	31.3
AT-badrec base	25.8
AT xxl	68.0
AT-0rec xxl	39.3
AT-badrec xxl	35.0