A Low-Cost, Controllable and Interpretable Task-Oriented Chatbot: With Real-World After-Sale Services as Example
Abstract.
Though widely used in industry, traditional task-oriented dialogue systems suffer from three bottlenecks: (i) difficult ontology construction (e.g., intents and slots); (ii) poor controllability and interpretability; (iii) annotation-hungry. In this paper, we propose to represent utterance with a simpler concept named Dialogue Action, upon which we construct a tree-structured TaskFlow and further build task-oriented chatbot with TaskFlow as core component. A framework is presented to automatically construct TaskFlow from large-scale dialogues and deploy online. Our experiments on real-world after-sale customer services show TaskFlow can satisfy the major needs, as well as reduce the developer burden effectively.
1. Introduction
Task-oriented chatbot, which aims to assist users in completing certain tasks, has been proven valuable for real-word business especially after-sale customer services. (Li et al., 2017; Zhu, 2019; Acharya et al., 2021; Sun et al., 2021) A well-designed chatbot can help standardize the service process, as well as alleviate the pressure of after-sale staffs.
Traditional task-oriented dialogue systems require the domain experts to manually develop a structured ontology (e.g., intents and slots) as foundations, and then build modules including Natural Language Understanding (NLU), Dialog State Tracking (DST), Dialog Policy (DP), and Natural Language Generation (NLG) respectively. However, in industrial practice, such system design and construction method suffer from three bottlenecks. (i) It’s difficult to represent real-world complex utterances with combination of dialogue acts and slots. For example, our preliminary exploration shows that it takes more than 20 slots and 100 possible values to fully express user’s semantic information for a after-sale customer service. (ii) Most dialogue policy works exploit an implicit manner, which leads to unsatisfactory controllability and interpretability for industrial systems. (iii) Existing works adopt supervised learning paradigm which heavily relies on human-annotated data, while the labeling process can be costly and error-prone.
To address above challenges, inspired by the speech act theory (Searle, 1975; Traum and Hinkelman, 1992), we propose to represent utterance with a simpler concept, named Dialogue Action, instead of dialogue acts and slots. Dialogue action is regarded as utterances with unique and identical semantic information within the dialogues corpus, and can be automatically obtained by clustering. In this way, most utterances can be represented by a specific dialogue action. For controllability and interpretability, we exploit an explicit manner and build task-oriented chatbot based on TaskFlow, which is a tree structure with dialogue actions as nodes and dialogue action transition as edges. Finally, we present a framework to automatically construct TaskFlow from large-scale dialogues and deploy online, to further reduce the developer burden.

The contributions of our work are:
-
(1)
We propose to build task-oriented chatbot based on dialogue actions and TaskFlow, which has the advantage of simplicity, controllability and interpretability.
-
(2)
We present an effective framework to automatically construct TaskFlow from large-scale dialogues and deploy the TaskFlow online.
-
(3)
Our exploration on real-world after-sale customer services shows that the TaskFlow can satisfy the majority of user needs, as well as effectively reduce the developer burden.
2. Overview
Our proposed framework mainly consists of two parts, named offline part and online part. We exemplify using a Chinese after-sale customer service of electric bike rental business, where users and staffs communicate online through text messages. With the offline part, we automatically construct TaskFlow from large scale chat logs. After that, with the online part, the TaskFlow can be rapidly deployed online and work as the core component of task-oriented chatbot. The details of both parts are introduced in the following.
2.1. Offline Part
As Figure 1 shows, 3 carefully-designed steps are performed sequentially in the offline part: 1) Dialogue action construction, which constructs dialogue actions for user/staff by clustering; 2) Dialogue standardization, which standardizes the dialogues by mapping each utterance to a dialogue action; 3) TaskFlow construction, which constructs TaskFlow with the standardized dialogues.
2.1.1. Dialogue Action Construction
A dialogue action is regarded as a group of utterances with unique and identical semantic information. Inspired by Lv et al. (2021), we exploit a popular two-stage method to cluster utterances from large scale dialogues as follows.
Feature Extractor Sentence representation generated by pre-trained language models has been widely used as features for clustering. In this paper, we exploit ConSERT (Yan et al., 2021), which solves the collapse issue of BERT-derived sentence representations by contrastive learning, as feature extractor. The ConSERT111https://github.com/yym6472/ConSERT is fine-tuned with the dialogues data, to make the sentence representation more task-oriented and applicable to clustering. The output of [CLS] token is utilized as feature of each utterance for further clustering.
Clustering We use K-means (Krishna and Murty, 1999) to group utterances, and each cluster is treated as a dialogue action. For purity, annotator may check and slightly modify the clusters. Our practice shows the clusters are of high quality and require limited human efforts.
2.1.2. Dialogue Standardization
Dialogue standardization aims to standardize the dialogue by mapping each utterance to a specific dialogue action. Inspired by Yu et al. (2021), we exploit a retrieval-based method, which retrieves clustered utterances that are most similar to the given input utterance, and label the input based on corresponding clusters. Compared with traditional classification methods (Kowsari et al., 2019), such design has two advantages: 1) retrieval-based method is more suitable for our scenario where instance numbers of each cluster are imbalanced. 2) retrieval-based method can adapt to modification of actions (e.g. create new action or remove existing action) without having to retrain the model.
Specifically, as Figure 2 shows, given an input utterance , we first use BM25 algorithm (Robertson and Zaragoza, 2009) to recall top utterances from all clustered utterances. Further, we exploit a BERT-based text similarity computation model to rerank the utterances and select the utterance with highest similarity to , which can be denoted by:
(1) |
where denotes the similarity between and . The text similarity computation model is based on BERT, and takes concatenation of and as inputs (The input sentence is ([CLS], , [SEP], )). It first encodes the concatenated utterances into continuous representations, and computes similarity with output of the [CLS] token (denoted by in the following) as follows:
(2) | ||||
where and are trainable weight parameters.

2.1.3. TaskFlow Construction
TaskFlow is a tree with dialogue action as nodes, and the edges between nodes describe how the conversation proceeds (Refer to Figure 5 for a TaskFlow sample). We construct TaskFlow from the standardized dialogues, each of which can be treated as a sequence of user/staff actions. It’s intuitive to construct TaskFlow by directly inserting action sequence of each dialogue into a tree. However, due to the dynamic nature of dialogues, the action distribution across the corpus would be too scattered, given a certain turn. Similar conversation fragments may lie in different turns of dialogues. Thus, instead of the direct way, we first build a N-gram model of dialogue actions which captures the local conversation pattern more accurately. Further, we sample high-quality action sequences from N-gram model, and the sampled action sequences are merged together, forming the TaskFlow.
Building N-gram Model Given an action sequence , the probability can be approximated as follows:
(3) | ||||
where denotes the subsequence with as starting index and as ending index. Following maximum likelihod estimation, we compute the N-gram as:
(4) |
where denotes the count of in corpus.
Sampling Action Sequence We sample action sequences based on the N-gram model. Specifically, the sampling process starts with [SOS] and ends with [EOS]. We follow the beam search method (Freitag and Al-Onaizan, 2017) but with a different strategy. In each step, we extend every partial action sequence in the beam with its top K actions. Once the [EOS] symbol is appended to a partial action sequence, it is removed from the beam, and a complete action sequence is sampled.
Generating TaskFlow TaskFlow is generated by simply inserting the sampled action sequences into a tree individually (sequence by sequence). The transition probability is recorded as the condition on the edge between action nodes.
Post-Processing With the generated TaskFlow, operational staff can easily determine 1) which action nodes require API calls (e.g., staffs need to lock the bike remotely), 2) which conditions on the edges require modification (e.g., API response be specific value). For example, if a specific user action node has multiple children (i.e., staff action nodes) with different semantic information, an API call is probably required and each edge may correspond to different API response. We manually add API call nodes and modify the conditions on the edge, if necessary. In this way, TaskFlow consists of three types of nodes, named user action node, staff action node and api call node respectively.
2.2. Online Part
To deploy the TaskFlow online and enable interaction with users, we build an execution engine, based on the principle that an ideal conversation should follow the paths in TaskFlow. The execution engine stores and updates the path corresponding to the current conversation. Basically, it moves along the path if the condition on an edge is satisfied, and respond to users when encountering staff action node. Given a new user utterance, the execution engine first categorizes the utterance into an user action with the retrieval-based model from Section 2.1.2. Then, if the user action node’s children are staff action nodes, a staff action node is selected based on the conditions on the edges. Otherwise, the user action node has only one API call node as child. The engine first uses a Parameter Value Extraction Module (introduced in the following) to extract required parameter values, and then executes the API call node. Further, the execution engine moves along the path until it reaches a leaf node or the conversation is closed.

Parameter Value Extraction Module The parameter value extraction module extracts parameter values for the API call from user utterances (e.g., time, user id). Specifically, we use a two-stage method to extract parameters as Figure 3 shows. The Bi-LSTM with CRF model (Lample et al., 2016; Xiangyu et al., 2019; Xi et al., 2021) is first used to perform sequence labeling and extract the mention containing the parameter value. Then a rule-based method is utilized to rewrite the mention, and get the standardized parameter value.
3. INDUSTRIAL APPLICATION
3.1. Online Deployment
TaskFlow is integrated into our online chatbot in the following three typical scenarios of after-sale customer service of electric bike rental:
-
(1)
Forget_to_Lock_bike where customers finished riding but forgot to lock the bike. The customers may require the staff to remotely lock the bike and reduce the fees.
-
(2)
Mechanical_Failure where customers encountered mechanical failures such as brake failure during the ride. The customers may claim for refund, due to poor user experience.
-
(3)
Out_Of_Power where the electric bike ran out of power during the ride. The customers may also claim for refund.
All the scenarios involve strict and complex business logic. For example, the staffs need to judge whether the fee can be waived by checking back-end APIs. We construct TaskFlow for each scenario individually. Specifically, we sample 50,000 utterances from the corpus, and grouped them into 100 clusters respectively. After manual modification, 82 user actions and 93 staff actions are retained. The text similarity computation model adopt BERTBASE model as encoder. We build a 4-gram model and extend action sequence in the beam with its top 5 actions.
3.2. Online Evaluation
3.2.1. Evaluation Metrics
We randomly sample 150 dialogues for each scenario. Annotators with domain knowledge are asked to grade each dialogue by “-1”, “0” or “1”, and the grading criteria can be summarized as follows: (i) Score “-1” denotes that Chatbot can not handle user requirements correctly. (ii) Score “0” denotes that Chatbot can handle user requirements correctly, but may generate influent or incomplete response. (iii) Score “1” denotes that Chatbot can handle user requirements correctly and perfectly complete the conversation.
Scenario | -1 | 0 | 1 |
Forget_to_Lock_bike | 7.69 | 12.82 | 79.49 |
Mechanical_Failure | 14.20 | 10.56 | 75.24 |
Out_Of_Power | 9.34 | 9.41 | 81.25 |
3.2.2. Human Evaluation
The statistical results of human evaluation is shown in Table 1, from which we can observe that: (1) TaskFlow has satisfactory performance in terms of understanding and meeting user requirements, which is the core function of task-oriented chatbot. For exmaple, in scenario Forget_to_Lock_bike, TaskFlow can correctly handle user requirements of 96.31% dialogues. (2) Due to diversified user expressions in real-world scenarios, TaskFlow may generate incomplete and influent responses (e.g., 12.82% of dialogues in Forget_to_Lock_bike). For example, users may complain or curse about unexpected brake failures.
3.3. In-Depth Analysis
3.3.1. Analysis of Efficiency Issue
To quantitatively explore the low-cost characteristic of our method, we build a traditional task-oriented chatbot in Forget_to_Lock_bike (denoted by Traditional in the following) following Yao et al. (2013), and compare the required human efforts in Table 2. Though achieving comparable performance, our TaskFlow is low-cost and require much less human efforts (i.e., 9 v.s. 21 person-days).
Specifically, the building process can be divided into four steps, and person-days (abbr. as pds) required by each step is recorded in Table 2. Since traditional system involves multiple modules, it requires large-scale data annotation and more person-days for training and deploying online. With the presented automatic framework, TaskFlow-based system requires much less annotation and human efforts, thus effectively reducing developer burden.
Step | Process | Traditional | TaskFlow |
1 | Ontology Construction | 3 pds | 1 pds |
2 | Data Annotation | 12 pds | 4 pds |
3 | Training Model | 3 pds | 2 pds |
4 | Online Deployment | 3 pds | 2 pds |
Total | 21 pds | 9 pds |
3.3.2. Analysis of Controllability and Interpretability
With an explicit manner, TaskFlow-based system naturally has the advantage of controllability and Interpretability. Specifically, conversations exactly follow the paths in TaskFlow while traditional dialogue policy module may generate unexpected dialogue act. We present the TaskFlow in Forget_to_Lock_bike and a concrete conversation in Figure 4. If the condition on edge is met, the execution engine will move along the edge and respond according to the staff action node. For example, with API Check_Status returning True, the engine moves to staff action #2 node and respond with 3rd utterance. In this way, unexpected response will never be generated, and the path clearly shows how the conversation is completed.

The above advantages also lead to better flexibility. TaskFlow can be easily modified while it’s hard to manually intervene traditional dialogue system. When encountering changes of customer service policy, traditional dialogue systems may be unavailable, and we need to retrain the model with new labeled data. In contrast, our TaskFlow can be edited and adapt to changes immediately (without training).

4. Conclusion
In this paper, to meet the simplicity, controllability and interpretability required by industrial dialogue systems, we propose a framework to build task-oriented chatbots based on dialogue actions and TaskFlow, from large-scale dialogues. The experiments show such a framework can effectively satisfy majority needs and reduce human efforts.
In the future, we are interested in optimizing the framework overally especially exploring more sophisticated TaskFlow construction methods. Besides, since we construct TaskFlow for each scenario separately, how to construct multi-scenario TaskFlow is worth investigating.
5. RELEVANCY TO SIRIP 2022 THEMES
In this paper, we propose a framework to build task-oriented chatbots based on dialogue actions and TaskFlow, to meet the simplicity, controllability and interpretability required by industrial systems. Our practical experience shows that such a framework can effectively satisfy majority needs and reduce human efforts. The retrieval system greatly contributes to the success of TaskFlow.
6. Presenter BIOGRAPHY
Presenter: Xiangyu Xi. He is an algorithm engineer at Meituan. His research interests include information retrieval, dialogue system, and information extraction. He is currently working on the construction of task-oriented dialogue system in Meituan.
7. Company Portrait
Meituan is China’s leading shopping platform for locally found consumer products and retail services including entertainment, dining, delivery, travel and other services.
References
- (1)
- Acharya et al. (2021) Anish Acharya, Suranjit Adhikari, Sanchit Agarwal, Vincent Auvray, Nehal Belgamwar, Arijit Biswas, Shubhra Chandra, Tagyoung Chung, Maryam Fazel-Zarandi, Raefer Gabriel, Shuyang Gao, Rahul Goel, Dilek Hakkani-Tur, Jan Jezabek, Abhay Jha, Jiun-Yu Kao, Prakash Krishnan, Peter Ku, Anuj Goyal, Chien-Wei Lin, Qing Liu, Arindam Mandal, Angeliki Metallinou, Vishal Naik, Yi Pan, Shachi Paul, Vittorio Perera, Abhishek Sethi, Minmin Shen, Nikko Strom, and Eddie Wang. 2021. Alexa Conversations: An Extensible Data-driven Approach for Building Task-oriented Dialogue Systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations. Association for Computational Linguistics, Online, 125–132. https://doi.org/10.18653/v1/2021.naacl-demos.15
- Freitag and Al-Onaizan (2017) Markus Freitag and Yaser Al-Onaizan. 2017. Beam Search Strategies for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation. 56–60.
- Kowsari et al. (2019) Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Heidarysafa, Sanjana Mendu, Laura Barnes, and Donald Brown. 2019. Text classification algorithms: A survey. Information 10, 4 (2019), 150.
- Krishna and Murty (1999) K Krishna and M Narasimha Murty. 1999. Genetic K-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 29, 3 (1999), 433–439.
- Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).
- Li et al. (2017) Feng-Lin Li, Minghui Qiu, Haiqing Chen, Xiongwei Wang, Xing Gao, Jun Huang, Juwei Ren, Zhongzhou Zhao, Weipeng Zhao, Lei Wang, et al. 2017. Alime assist: An intelligent assistant for creating an innovative e-commerce experience. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 2495–2498.
- Lv et al. (2021) Chenxu Lv, Hengtong Lu, Shuyu Lei, Huixing Jiang, Wei Wu, Caixia Yuan, and Xiaojie Wang. 2021. Task-Oriented Clustering for Dialogues. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 4338–4347. https://doi.org/10.18653/v1/2021.findings-emnlp.368
- Robertson and Zaragoza (2009) Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.
- Searle (1975) John R Searle. 1975. A taxonomy of illocutionary acts. (1975).
- Sun et al. (2021) Kai Sun, Seungwhan Moon, Paul Crook, Stephen Roller, Becka Silvert, Bing Liu, Zhiguang Wang, Honglei Liu, Eunjoon Cho, and Claire Cardie. 2021. Adding Chit-Chat to Enhance Task-Oriented Dialogues. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 1570–1583. https://doi.org/10.18653/v1/2021.naacl-main.124
- Traum and Hinkelman (1992) David R Traum and Elizabeth A Hinkelman. 1992. Conversation acts in task-oriented spoken dialogue. Computational intelligence 8, 3 (1992), 575–599.
- Xi et al. (2021) Xiangyu Xi, Wei Ye, Tong Zhang, Quanxiu Wang, Shikun Zhang, Huixing Jiang, and Wei Wu. 2021. Improving event detection by exploiting label hierarchy. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7688–7692.
- Xiangyu et al. (2019) Xi Xiangyu, Zhang Tong, Ye Wei, Zhang Jinglei, Xie Rui, and Zhang Shikun. 2019. A hybrid character representation for chinese event detection. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
- Yan et al. (2021) Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng Zhang, Wei Wu, and Weiran Xu. 2021. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 5065–5075. https://doi.org/10.18653/v1/2021.acl-long.393
- Yao et al. (2013) Kaisheng Yao, Geoffrey Zweig, Mei-Yuh Hwang, Yangyang Shi, and Dong Yu. 2013. Recurrent neural networks for language understanding.. In Interspeech. 2524–2528.
- Yu et al. (2021) Dian Yu, Luheng He, Yuan Zhang, Xinya Du, Panupong Pasupat, and Qi Li. 2021. Few-shot Intent Classification and Slot Filling with Retrieved Examples. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 734–749. https://doi.org/10.18653/v1/2021.naacl-main.59
- Zhu (2019) Xiaoming Zhu. 2019. Case ii (part a): Jimi’s growth path: Artificial intelligence has redefined the customer service of jd. com. In Emerging Champions in the Digital Economy. Springer, 91–103.