Spoken Language Understanding for Conversational AI:
Recent Advances and Future Direction

Soyeon Caren Han¹, Siqu Long², Henry Weld², Josiah Poon¹
The University of Sydney
Sydney, NSW, Australia
¹{caren.han, josiah.poon}@sydney.edu.au, ²{slon6753, hwel4188}@uni.sydney.edu.au

(2023)

Abstract.

When a human communicates with a machine using natural language on the web and online, how can it understand the human’s intention and semantic context of their talk? This is an important AI task as it enables the machine to construct a sensible answer or perform a useful action for the human. Meaning is represented at the sentence level, identification of which is known as intent detection, and at the word level, a labelling task called slot filling. This dual-level joint task requires innovative thinking about natural language and deep learning network design, and as a result, many approaches and models have been proposed and applied.

This tutorial will discuss how the joint task is set up and introduce Spoken Language Understanding/Natural Language Understanding (SLU/NLU) with Deep Learning techniques. We will cover the datasets, experiments and metrics used in the field. We will describe how the machine uses the latest NLP and Deep Learning techniques to address the joint task, including recurrent and attention-based Transformer networks and pre-trained models (e.g. BERT). We will then look in detail at a network that allows the two levels of the task, intent classification and slot filling, to interact to boost performance explicitly. We will do a code demonstration of a Python notebook for this model and attendees will have an opportunity to watch coding demo tasks on this joint NLU to further their understanding.

Natural Language Understanding, Intent Classification, Slot Filling

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: 10.1145/1122445.1122456^†^†conference: Austin ’23: ACM The Web Conference; April 30–May 04, 2023; Austin, TX^†^†booktitle: Austin ’23: ACM The Web Conference, 30–May 04, 2023, Austin, TX^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Information systems Information retrieval^†^†ccs: Computing methodologies Discourse, dialogue and pragmatics

1. Significance of this Tutorial

The efficacy of virtual assistants becomes more important as their popularity rises. Central to their performance is the ability of the electronic assistant to understand what the human user is saying to act or reply in a way that meaningfully satisfies the requester. The human-device interface may be text-based, but it is now most frequently voice and will probably include images or videos in the near future. To put the understanding of human utterances within a framework, within the natural language processing (NLP) stack lies spoken language understanding (SLU). SLU starts with automatic speech recognition (ASR), the task of taking the sound waves or images of expressed language and transcribing them to text. Natural language understanding (NLU) then takes the text and extracts the semantics for use in further processes - information gathering, question answering, dialogue management, request fulfilment, and so on. The concept of a hierarchical semantic frame has developed to represent the levels of meaning within spoken utterances (Jeong and Lee, 2008). At the highest level is a domain, then intent and then slots. The domain is the area of information the utterance is concerned with. The intent (a.k.a. goal in early papers) is the speaker’s desired outcome from the utterance. The slots are the types of words or spans of words in the utterance that contain semantic information relevant to fulfilling the intent. An example is given in Table 1 for the domain movies. Within this domain, the example has intent find_movie, and the individual tokens are labelled with their slot tag using the inside-outside-beginning (IOB) tagging format. The NLU task is thus extracting the semantic frame elements from the utterance. NLU is central to devices that desire a linguistic interface with humans; conversational agents, instruction in vehicles (driverless or otherwise), Internet of Things (IoT), virtual assistants, online helpdesks/chatbots, robot instruction, and so on. Improving the quality of the semantic detection will improve the quality of the experience for the user, and from here NLU draws its importance and popularity as a research topic.

Table 1. An example of an utterance as a semantic frame with domain, intent and IOB slot annotation (Hakkani-Tür et al., 2016)

query	find	recent	comedies	by	james	cameron
slots	O	B-date	B-genre	O	B-director	I-director
intent	find_movie
domain	movies

In many data sets and real-world applications, the domain is limited; it is concerned only with hotel bookings, or air flight information, for example. The domain level is generally not part of the analysis in these cases. However in wider-ranging applications, for example, the SNIPS data set discussed later, or the manifold personal voice assistants which are expected to field requests from various domains, the inclusion of domain detection in the problem can lead to better results. This leaves us with intent and slot identification. What does the human user want from the communication, and what semantic entities carry the details? The two sub-tasks are known as intent detection and slot filling. The latter may be a misnomer as the task is more correctly slot labelling or slot tagging. Slot filling more precisely gives the slot a value of a type matching the label. For example, a slot labelled “B-city” could be filled with the value “Sydney”. Intent detection is usually approached as a supervised classification task, mapping the entire input sentence to an element of a finite set of classes. Slot filling seeks to attach a class or label to each of the tokens in the utterance, making it within the sequence labelling class of problems.

While early research looked at the tasks separately or put them in a series pipeline, it was quickly noted that the slot labels present and the intent class should and do influence each other in ways that solving the two tasks simultaneously should garner better results for both tasks (Jeong and Lee, 2008; Wang, 2010). A joint model that simultaneously addresses each sub-task must capture the joint distributions of intent and slot labels, with respect to the words in the utterance, their local context, and the global context in the sentence. A joint model has the advantage over pipeline models in that it is less susceptible to error propagation (Chen and Yu, 2019), and over separate models that there is only a single model to train and fine-tune.

2. Topics Covered in this Tutorial

This tutorial presents a comprehensive overview of recent research and development on using Joint Natural Language Understanding for Conversational AI, focusing on chatbots, dialogue systems, and online chats. We first present our vision of using NLP techniques for Conversational Natural Language Understanding to enable and identify user intent and semantic context. Then we introduce three major modules of our tutorial: (1) The Introduction of Joint Natural Language Understanding (Joint-NLU), (2) Joint-NLU: Feature Engineering, (3) Joint-NLU: Main Modelling, and (4) System Demos and Future Directions. The following tutorial topics and their contents are mainly based on the ‘A survey of joint intent detection and slot filling models in natural language understanding’, published in ACM Computing Survey (Weld et al., 2021b) by the tutorial instructors.

2.1. Joint Natural Language Understanding

The joint task marries the objectives of the two sub-tasks. As most papers point out, there is a relationship between the slot labels we should expect to see conditional on the intent and vice versa. A statistical view of this is that a model needs to learn the joint distributions of intent and slot labels. The model should also pay regard to the distributions of slot labels within utterances, and one would expect to inherit approaches to label dependency from the slot-labelling sub-task. Approaches to the joint task range from implicit learning of the distribution through explicit learning of the conditional distribution of slot labels over the intent label, and vice versa, to fully explicit learning of the full joint distribution.

Research in the joint task has largely come from the personal assistant or chatbot fields. Chatbots are usually task-oriented within a single domain, while the personal assistant may be single or multi-domain. Other areas to contribute papers are IoT instruction, robotic instruction (there is also a different concept of intent in robotics to describe what action the robot is attempting), and in-vehicle dialogue for driverless vehicles. These areas also need to filter out utterances not applied to the device. Researchers have also drawn data from question-answering systems, for example, (Zhang and Wang, 2016) who annotated a Chinese question data set from Baidu Knows. A summary of the technological approaches in joint NLU can be found in Figure 1.

Refer to caption — Figure 1. Summary of technological approaches in joint NLU

2.2. Joint-NLU: Feature Engineering

Feature creation is a critical part of the design of circuits in NLU as it ideally should capture, at least, semantic information of the individual tokens, their context, and of the entire sentence. Then, any other extension to the feature set that may be used to enhance the result may be considered, including internal (syntactic, word context) or external (meta-data, sentence context) information.

Token embedding

The earliest models used features familiar from methods like POS tagging and included one-hot word embedding, n-grams, affixes, etc. (Jeong and Lee, 2008). Another approach was to incorporate entity lists from sites such as IMDB (movie titles) or Trip Advisor (hotel names) (Celikyilmaz and Hakkani-Tür, 2012). Neural models enable the embedding of diverse natural language without such feature engineering. The gamut of word embedding methods have been used including word2vec ((Pan et al., 2018; Wang et al., 2018b)), fastText ((Firdaus et al., 2020)), GloVe ((Zhang and Wang, 2016; Liu et al., 2019; Dadas et al., 2019; Okur et al., 2019; Bhasin et al., 2019; Thi Do and Gaspers, 2019; Pentyala et al., 2019; Bhasin et al., 2020)), ELMo ((Zhang et al., 2020; Krone et al., 2020)), BERT ((Zhang et al., 2019; Qin et al., 2019; Ni et al., 2020; Han et al., 2021a; Krone et al., 2020; He et al., 2021; Huang et al., 2021) and (Chen et al., 2019; Castellucci et al., 2019) (pre-print only)). (Firdaus et al., 2018) and (Firdaus et al., 2019) used concatenated GloVe and word2vec embeddings to capture more word information. A recent approach from intent classification proposed training triples of samples - an anchor sample, a positive sample in the same class and a negative sample from a different class (Ren and Xue, 2020).

Sentence embedding

The final hidden state in an RNN was used frequently as the sentence embedding (Zhou et al., 2016; Liu and Lane, 2016; Wang et al., 2018b). Sentences were also embedded by using a special token for the whole sentence in (Hakkani-Tür et al., 2016; Zhang and Wang, 2019), as a max pooling of the RNN hidden states (Zhang and Wang, 2016), as a learned weighted sum of Bi-RNN hidden states (Liu and Lane, 2016), as an average pooling of RNN hidden states (Ma et al., 2017), as a convolutional combination of the input word vectors (Zhao et al., 2018; Bhasin et al., 2019), and as self-attention over BERT word embeddings (Zhang et al., 2019). Reference (Ma et al., 2017) also applies a sparse attention mechanism which evaluates word importance over a batch and applies weights within each sample utterance for intent detection.

2.3. Joint-NLU: Main Modelling

Multi-task learning has been used by both tasks to look for synergistic learning from other related tasks. The joint task itself is an example of this approach. A joint model needs to learn the joint distributions of intent and slot labels, while also paying regard to the distributions of slot labels within utterances. Approaches to the joint task range from implicit learning of these distributions, through explicit learning of the conditional distribution of slot labels over the intent label, or vice versa, to fully explicit learning of the full joint distributions. The RNN architecture has been exploited as it provides a state at each temporal token step and a final state encapsulating the sentence. One critical observation made of many purely recurrent models was that sharing information between the two sub-tasks is implicit. Attention was used to make this more explicit. Self-attention between the word tokens has been used to learn label dependency and as a stronger alternative to learned weighted sum attention. Transformer encoders are a prevalent non-recurrent architecture that performs self-attention amongst tokens, addresses long-range dependency and can form a sentence representation out of the transformed token representations or by using a special sentence token. This clearly points to the BERT pre-trained model architecture which has been used as input to classifiers and in more integrated architectures. Another method to make influence between the tasks explicit is hierarchical models. Capsule models pass slot deductions of sufficient confidence to an intent detection capsule and vice versa. Memory networks also use a form of explicit feedback. An explicit influence between the two sub-tasks comes even further to the forefront in bi-directional models. Reference (E et al., 2019) proves slot2intent and intent2slot influence improve results but did not fuse the two approaches. Fusion approaches here include alternating between slot2intent and intent2slot (Wang et al., 2018a), post-processing fusion (Bhasin et al., 2019), and (Han et al., 2021a) in which bi-directional, direct and explicit influence is central to the model architecture. Even within the joint task, some researchers highlight non-optimal handling of label dependency in slot labelling. Adding CRFs to a deep joint model is the common solution to this problem. Graph networks have been designed to garner knowledge from the training data of word-slot, slot-slot, word-intent and slot-intent correlations. The use of sentence context in multi-turn dialogues would seem to provide even more fuel for explicit influence by incorporating intent-to-intent dependency albeit unidirectionally in time through a conversation. Multi-dialogue also offers the interesting, aligned problems of identifying out-of-domain utterances and changes of intent within a conversation.

2.4. System Demos and Future Directions

Finally, we conclude our tutorial by demonstrating the system on the publicly available and widely used real dataset (ATIS and SNIPS), how the natural (spoken) language is understood by the two main tasks: Intent detection and slot filling for identifying users’ needs from their utterances. All the data, models, codes, and resources will be publicly available. We then introduced how it can be applied to online and web chat, including online in-game chat toxicity detection, published by instructors in ACL (Weld et al., 2021a).

3. Relevance to The Web Conference

This tutorial is highly relevant to TheWebConf on the topic of web mining and context analysis, focusing on Language technologies and the Web. The tutorial instructors have rich experience in delivering tutorials in major NLP (ACL, COLING, NAACL and InterSpeech), AI (AAAI and IJCAI), web mining and knowledge management (SIGIR, WWW, and CIKM) conferences and journals (ACM Computing Surveys and Knowledge and Information Systems). They also have published in the SLU domain and co-published the joint NLU paper in Interspeech 2021 (Han et al., 2021b), NLU-applied papers in ACL 2021 (Weld et al., 2021a) and the joint NLU survey paper in the Journal ACM Computing Surveys (Weld et al., 2021b) since 2021.

4. Tutorial Outlines

This tutorial is expected to be 1.5 hour long lecture-style presentation. The lectures will cover the aforementioned topics in great detail while the demo session mainly focuses on providing a clear and practical demonstration of how to set up and implement a basic model for the joint NLU task. The overall outline is listed as follows:

•
Introduction [20 mins]
- –
  
  Motivations [5 mins]
- –
  
  Introduction to the NLP and SLU [10 mins]
- –
  
  Unique Challenges of SLU [5 mins]
•
Joint Natural Language Understanding [10 mins]
- –
  
  Joint Learning Model in NLU [5 mins]
- –
  
  Explicit and Implicit Joint Learning [5 mins]
•
Joint-NLU: Architecture [30 mins]
- –
  
  Joint-NLU: Feature Engineering [15 mins]
- –
  
  Joint-NLU: Main Modelling [15 mins]
•
System Demo and Future Direction [30 mins]
- –
  
  Real-world Chatbot Dataset and Models Demo [10 mins]
- –
  
  Online game Chat NLU Demo [10 mins]
- –
  
  Research Problems and Future Directions [10 mins]

5. Previous Editions

The tutorial is considered a cutting-edge tutorial that introduces the recent advances in an emerging area of using NLU techniques for Conversational AI. Previous ACL, EMNLP, NAACL, EACL, or COLING. TheWebConf tutorials have not covered the presented topic in the past four years. This tutorial has not been presented elsewhere. We estimate that around 60% of the works covered in this tutorial are from researchers other than the tutorial instructors.

6. Targeted Audience

This tutorial is intended for researchers and practitioners in natural language processing, information retrieval, data mining, text mining, graph mining, machine learning, and their applications to other domains. While the audience with a good background in the above areas would benefit most from this tutorial, we believe the presented material would give the general audience and newcomers a complete picture of the current work, introduce important research topics in this field, and inspire them to learn more. Our tutorial is designed as self-contained, so no specific background knowledge is assumed of the audience. However, it would be beneficial for the audience to know about basic deep learning technologies, pretrained word embeddings (e.g. Word2Vec) and language models (e.g. BERT) before attending this tutorial. We will provide the audience with a reading list of background knowledge on our tutorial website.

7. Tutorial Materials and Equipment

We will provide attendees with a website with access to all the related information, including the outlines, tutorial materials, references, presenter profiles etc. All the lecture slides will be provided in Google Slides¹¹1https://www.google.com.au/slides/about/ and the practical demonstration will be conducted via Google Colab ²²2https://colab.research.google.com/, which is a browser-based and hosted Jupyter notebook service that requires no setup to use. The shareable links to both materials will be available on the website.

8. Video Teaser

A video teaser of our tutorial is available on YouTube: https://youtu.be/ovw7093ogeI.

9. Organisation Details

This tutorial will be delivered both in person and online (e.g., via Zoom) during the conference. We will also provide pre-recorded videos as a backup plan that overcomes the potential occurrence of technical problems. We will release our tutorial website and all the materials one week before the tutorial.

10. Tutorial Instructors

Dr. Soyeon Caren Han

is a co-leader of AD-NLP (Australia Deep Learning NLP Group) and a Senior lecturer (Associate Professor in U.S. System) at the University of Western Australia and an honorary senior lecturer (honorary Associate Professor in U.S. System) at the University of Sydney and the University of Edinburgh. After her Ph.D.(in 2017), she has worked for 6 years at the University of Sydney. Her research interests include Natural Language Processing with Deep Learning. She is broadly interested in several research topics, including visual-linguistic multi-modal learning, abusive language detection, document layout analysis, and recommender system. More information can be found at https://drcarenhan.github.io/.

Ms. Siqu Long

³³3https://scholar.google.com/citations?user=zeutMxcAAAAJ&hl=en&oi=ao

is a PhD candidate at the School of Computer Science, University of Sydney. She received her Bachelor’s Degree and Master’s Degree of Information Technologies in 2016 and 2017 respectively. She worked as a software engineer at IBM, China in 2019. Her research interests include Natural Language Processing and Multi-modal Representation Learning.

Dr. Henry Weld

⁴⁴4https://scholar.google.com/citations?user=l-u_06gAAAAJ&hl=en&oi=ao

is a PhD candidate at the School of Computer Science at the University of Sydney. He worked as a data Scientist specialising in machine learning and natural language processing and Senior Quantitative Analyst with over 17 years of front-office experience at the Commonwealth Bank of Australia. His current research interests are in Natural Language Processing. He received B.E. in Civil Engineering from the University of Queensland in 1987 and his first PhD in Pure Mathematics in 1999 from the University of Sydney. He also completed his Master’s Degree in Data Science in 2019 at the University of Sydney.

Dr. Josiah Poon

is a co-leader of AD-NLP (Australia Deep Learning NLP Group) and a Senior Lecturer at the School of Computer Science, University of Sydney. He’s been using traditional machine learning techniques paying particular attention to learning from imbalanced datasets, short string text classification, and data complexity analysis. He has coordinated a multidisciplinary team consisting of computer scientists, pharmacists, western medicine & traditional Chinese medicine researchers and practitioners since 2007. He co-leads a joint big-data laboratory for integrative medicine (Acclaim) established between the University of Sydney and the Chinese University of Hong Kong to study medical/health problems using computational tools. https://www.sydney.edu.au/engineering/about/our-people/academic-staff/josiah-poon.html

Specialization in SLU

These tutorial instructors have worked together in the SLU domain and co-published the joint NLU paper in Interspeech 2021 (Han et al., 2021b), NLU-applied papers in ACL 2021 (Weld et al., 2021a) and the joint NLU survey paper in the Journal ACM Computing Surveys (Weld et al., 2021b) since 2021.

References

(1)
Bhasin et al. (2019) Anmol Bhasin, Bharatram Natarajan, Gaurav Mathur, Joo Hyuk Jeon, and Jun-Seong Kim. 2019. Unified Parallel Intent and Slot Prediction with Cross Fusion and Slot Masking. In Natural Language Processing and Information Systems, Elisabeth Métais, Farid Meziane, Sunil Vadera, Vijayan Sugumaran, and Mohamad Saraee (Eds.). Springer, Cham, 277–285.
Bhasin et al. (2020) Anmol Bhasin, Bharatram Natarajan, Gaurav Mathur, and Himanshu Mangla. 2020. Parallel Intent and Slot Prediction using MLB Fusion. In 14th International Conference on Semantic Computing (ICSC). IEEE, San Diego, USA, 217–220.
Castellucci et al. (2019) Giuseppe Castellucci, Valentina Bellomaria, Andrea Favalli, and Raniero Romagnoli. 2019. Multi-lingual Intent Detection and Slot Filling in a Joint BERT-based Model. arXiv:1907.02884
Celikyilmaz and Hakkani-Tür (2012) Asli Celikyilmaz and Dilek Hakkani-Tür. 2012. A Joint Model for Discovery of Aspects in Utterances. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Jeju Island, Korea, 330–338. https://www.aclweb.org/anthology/P12-1035
Chen et al. (2019) Qian Chen, Zhu Zhuo, and Wen Wang. 2019. BERT for Joint Intent Classification and Slot Filling. arXiv:1902.10909
Chen and Yu (2019) Sixuan Chen and Shuai Yu. 2019. WAIS: Word Attention for Joint Intent Detection and Slot Filling. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. AAAI Press, Honolulu, USA, 9927–9928.
Dadas et al. (2019) Slawomir Dadas, Jaroslaw Protasiewicz, and Witold Pedrycz. 2019. A Deep Learning Model with Data Enrichment for Intent Detection and Slot Filling. In 2019 IEEE International Conference on Systems, Man and Cybernetics. IEEE, Bari, Italy, 3012–3018. https://doi.org/10.1109/SMC.2019.8914542
E et al. (2019) Haihong E, Peiqing Niu, Zhongfu Chen, and Meina Song. 2019. A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 5467–5471.
Firdaus et al. (2018) Mauajama Firdaus, Shobhit Bhatnagar, Asif Ekbal, and Pushpak Bhattacharyya. 2018. A Deep Learning Based Multi-task Ensemble Model for Intent Detection and Slot Filling in Spoken Language Understanding. In Neural Information Processing, Long Cheng, Andrew Chi Sing Leung, and Seiichi Ozawa (Eds.). Springer, Cham, 647–658.
Firdaus et al. (2020) Mauajama Firdaus, Hitesh Golchha, Asif Ekbal, and Pushpak Bhattacharyya. 2020. A Deep Multi-task Model for Dialogue Act Classification, Intent Detection and Slot Filling. Cognitive Computation 13 (2020), 626–645.
Firdaus et al. (2019) Mauajama Firdaus, Ankit Kumar, Asif Ekbal, and Pushpak Bhattacharyya. 2019. A Multi-Task Hierarchical Approach for Intent Detection and Slot Filling. Knowledge-Based Systems 183 (2019), 104846.
Hakkani-Tür et al. (2016) Dilek Hakkani-Tür, Gökhan Tür, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional RNN-LSTM. In Interspeech. ISCA, San Francisco, USA, 715–719.
Han et al. (2021a) Soyeon Caren Han, Siqu Long, Huichun Li, Henry Weld, and Josiah Poon. 2021a. Bi-directional Joint Neural Networks for Intent Classification and Slot Filling. In Proc. Interspeech 2021. ISCA, Brno, Czech Republic, 4743–4747.
Han et al. (2021b) Soyeon Caren Han, Siqu Long, Huichun Li, Henry Weld, and Josiah Poon. 2021b. Bi-Directional Joint Neural Networks for Intent Classification and Slot Filling. In Proc. Interspeech 2021. 4743–4747. https://doi.org/10.21437/Interspeech.2021-2044
He et al. (2021) Ting He, Xiaohong Xu, Yating Wu, Huazhen Wang, and Jian Chen. 2021. Multitask Learning with Knowledge Base for Joint Intent Detection and Slot Filling. Applied Sciences 11, 11 (2021). https://doi.org/10.3390/app11114887
Huang et al. (2021) Zhiqi Huang, Fenglin Liu, Peilin Zhou, and Yuexian Zou. 2021. Sentiment Injected Iteratively Co-Interactive Network for Spoken Language Understanding. In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Toronto, Canada, 7488–7492. https://doi.org/10.1109/ICASSP39728.2021.9413885
Jeong and Lee (2008) Minwoo Jeong and Gary Geunbae Lee. 2008. Triangular-Chain Conditional Random Fields. IEEE Transactions on Audio, Speech, and Language Processing 16, 7 (2008), 1287–1302.
Krone et al. (2020) Jason Krone, Yi Zhang, and Mona Diab. 2020. Learning to Classify Intents and Slot Labels Given a Handful of Examples. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. Association for Computational Linguistics, Online, 96–108. https://doi.org/10.18653/v1/2020.nlp4convai-1.12
Liu and Lane (2016) Bing Liu and Ian Lane. 2016. Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling. In Interspeech 2016. ISCA, San Francisco, USA, 685–689. https://doi.org/10.21437/Interspeech.2016-1352
Liu et al. (2019) Yijin Liu, Fandong Meng, Jinchao Zhang, Jie Zhou, Yufeng Chen, and Jinan Xu. 2019. CM-Net: A Novel Collaborative Memory Network for Spoken Language Understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 1051–1060. https://doi.org/10.18653/v1/D19-1097
Ma et al. (2017) Mingbo Ma, Kai Zhao, Liang Huang, Bing Xiang, and Bowen Zhou. 2017. Jointly Trained Sequential Labeling and Classification by Sparse Attention Neural Networks. In Interspeech. ISCA, Stockholm, Sweden, 3334–3338.
Ni et al. (2020) Pin Ni, Yuming Li, Gangmin Li, and Victor Chang. 2020. Natural language understanding approaches based on joint task of intent detection and slot filling for IoT voice interaction. Neural Computing and Applications 32 (2020), 16149––16166.
Okur et al. (2019) Eda Okur, Shachi H Kumar, Saurav Sahay, Asli Arslan Esme, and Lama Nachman. 2019. Natural Language Interactions in Autonomous Vehicles: Intent Detection and Slot Filling from Passenger Utterances. arXiv:1904.10500 [cs.CL]
Pan et al. (2018) Lingfeng Pan, Yi Zhang, Feiliang Ren, Yining Hou, Yan Li, Xiaobo Liang, and Yongkang Liu. 2018. A Multiple Utterances Based Neural Network Model for Joint Intent Detection and Slot Filling. In Proceedings of the Evaluation Tasks at the China Conference on Knowledge Graph and Semantic Computing (CCKS 2018). CEUR-WS.org, Tianjin, China, 25–33.
Pentyala et al. (2019) Shiva Pentyala, Mengwen Liu, and Markus Dreyer. 2019. Multi-Task Networks with Universe, Group, and Task Feature Learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 820–830. https://doi.org/10.18653/v1/P19-1079
Qin et al. (2019) Libo Qin, Wanxiang Che, Yangming Li, Haoyang Wen, and Ting Liu. 2019. A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, Hong Kong, 2078–2087. https://doi.org/10.18653/v1/D19-1214
Ren and Xue (2020) Fuji Ren and Siyuan Xue. 2020. Intention Detection Based on Siamese Neural Network With Triplet Loss. IEEE Access 8 (2020), 82242–82254.
Thi Do and Gaspers (2019) Quynh Ngoc Thi Do and Judith Gaspers. 2019. Cross-lingual Transfer Learning for Spoken Language Understanding. In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Brighton, United Kingdom of Great Britain and Northern Ireland, 5956–5960.
Wang et al. (2018a) Yu Wang, Yilin Shen, and Hongxia Jin. 2018a. A Bi-Model Based RNN Semantic Frame Parsing Model for Intent Detection and Slot Filling. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 309–314. https://doi.org/10.18653/v1/N18-2050
Wang et al. (2018b) Yufan Wang, Li Tang, and Tingting He. 2018b. Attention-Based CNN-BLSTM Networks for Joint Intent Detection and Slot Filling. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, Maosong Sun, Ting Liu, Xiaojie Wang, Zhiyuan Liu, and Yang Liu (Eds.). Springer, Cham, 250–261.
Wang (2010) Ye-Yi Wang. 2010. Strategies for statistical spoken language understanding with small amount of data - an empirical study. In Interspeech. ISCA, Makuhari, Japan, 2498–2501.
Weld et al. (2021a) Henry Weld, Guanghao Huang, Jean Lee, Tongshu Zhang, Kunze Wang, Xinghong Guo, Siqu Long, Josiah Poon, and Caren Han. 2021a. CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2406–2416.
Weld et al. (2021b) Henry Weld, Xiaoqi Huang, Siqu Long, Josiah Poon, and Soyeon Caren Han. 2021b. A survey of joint intent detection and slot filling models in natural language understanding. ACM Computing Surveys (CSUR) (2021).
Zhang et al. (2020) Linhao Zhang, Dehong Ma, Xiaodong Zhang, Xiaohui Yan, and Hou-Feng Wang. 2020. Graph LSTM with Context-Gated Mechanism for Spoken Language Understanding. In AAAI 2020. AAAI Press, New York, USA, 9539–9546.
Zhang and Wang (2019) Linhao Zhang and Houfeng Wang. 2019. Using Bidirectional Transformer-CRF for Spoken Language Understanding. In Natural Language Processing and Chinese Computing, Jie Tang, Min-Yen Kan, Dongyan Zhao, Sujian Li, and Hongying Zan (Eds.). Springer, Cham, 130–141.
Zhang and Wang (2016) Xiaodong Zhang and Houfeng Wang. 2016. A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16). AAAI Press, New York, USA, 2993–2999.
Zhang et al. (2019) Zhichang Zhang, Zhenwen Zhang, Haoyuan Chen, and Zhiman Zhang. 2019. A Joint Learning Framework With BERT for Spoken Language Understanding. IEEE Access 7 (2019), 168849–168858.
Zhao et al. (2018) Xinlu Zhao, Haihong E, and Meina Song. 2018. A Joint Model based on CNN-LSTMs in Dialogue Understanding. In 2018 International Conference on Information Systems and Computer Aided Education (ICISCAE). IEEE, Piscataway, USA, 471–475.
Zhou et al. (2016) Qianrong Zhou, Liyun Wen, Xiaojie Wang, Long Ma, and Yue Wang. 2016. A Hierarchical LSTM Model for Joint Tasks. In Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, Maosong Sun, Xuanjing Huang, Hongfei Lin, Zhiyuan Liu, and Yang Liu (Eds.). Springer, Cham, 324–335.

Spoken Language Understanding for Conversational AI: Recent Advances and Future Direction