Improved intent classification based on context information using a windows-based approach

Jeanfranco D. Farfan-Escobedo Institute of Computing
University of Campinas
Campinas/SP, Brazil
Email: [email protected] Julio C. Dos Reis Institute of Computing
University of Campinas
Campinas/SP, Brazil
Email: [email protected]

Abstract

Conversational systems have a Natural Language Understanding (NLU) module. In this module, there is a task known as an intent classification that aims at identifying what a user is attempting to achieve from an utterance. Previous works use only the current utterance to predict the intent of a given query and they do not consider the role of the context (one or a few previous utterances) in the dialog flow for this task. In this work, we propose several approaches to investigate the role of contextual information for the intent classification task. Each approach is used to carry out a concatenation between the dialogue history and the current utterance. Our intent classification method is based on a convolutional neural network that obtains effective vector representations from BERT to perform accurate intent classification using an approach window-based. Our experiments were carried out on a real-world Brazilian Portuguese corpus with dialog flows provided by Wavy global company. Our results achieved substantial improvements over the baseline, isolated utterances (without context), in three approaches using the user’s utterance and system’s response from previous messages as dialogue context.

Index Terms:

Intent classification, context, dialogue flow, Brazilian Portuguese, BERT.

I Introduction

When a human-to-human conversation takes place, people “deduce” intention of a sentence based on the context of the conversation (one or a few previous utterances). In this sense, people do not interpret an intention-based only on an isolated utterance within a dialogue flow. Consequently, the analysis of an utterance to identify a user’s intention in a dialogue system can benefit from the conversation context. However, literature on intent classification does not address this significant approach.

Within a dialogue system, two agents widely interact during the communicative process: user and system. A conversation is usually structured in turns; each turn is defined by one utterance from the user and single system response. As a result of this process of interaction between system and user, the set of turns forms a dialog flow.

Table I shows an example of a conversation between a user and system, in which $U$ represents the user and $S$ system, The conversational flow column represents the intent for each utterance $U$ .

TABLE I: An example of a conversation between an user and system.

Utterance	Conversational flow
U: I’d like Peruvian food	food_information
S: ”La Clave del Sabor” is a delicious restaurant located near the historic downtown of Cusco.
U: But, I am allergic to shrimps	user_information
S: Don’t worry, ”La Clave del Sabor” has beef, chicken, alpaca, guinea pig and fish food
U: Is it reasonably priced?	cost_information
S: Yes, ”La Clave del Sabor” is in the moderate price range
U: What is the phone number?	request_information
S: The number of ”La Clave del Sabor” is +51974264215.
U: I need to travel the next day to Machu Picchu, is there a tour agency nearby?	another_information
S: Two minutes from ”La Clave del Sabor” restaurant is the ”Inca Travel” tourism agency

Literature has not carried out comprehensive studies using the context in the intent classification task. To the best of our knowledge, there is a lack in how to solve this problem. Currently, studies in literature do not examine the relevance of context during intent classification task only use the current utterance as input to predict the user’s intent [1, 2, 3]. As for other tasks such as Dialogue State Tracking and Dialogue Acts do not standardize the context used. Some works used all previous information, and others used the last system reply.

We propose a novel approach to managing context, a windows-based procedure. Our strategy aims to make use of windows on the previous information of the dialog flow.

In our work, we proposed several combinations to use a dialog flow’s earlier conversations to find the best contextual information. These approaches use previous utterances of a dialog flow to increase the contextual knowledge to improve our model. This architecture makes it possible to train the dialogue history and the current utterance jointly. Our approach is robust to imbalanced datasets due to we modify our loss function to penalize misclassification of the underrepresented classes more than the dominant ones. Our results achieved substantial improvements over the baseline in three approaches using the user’s utterance and system’s response from previous messages as dialogue context.

The remainder of this paper is organized as follows. Section II presents related works. Section III makes known our methodology. Section IV presents the experimental results and discusses our findings. Section V concludes and suggests future works.

II RELATED WORKS

We examined different existing investigations, and they were organized into three categories. First, we reviewed recent papers, who address the intent classification task-based in isolated utterance and disconnected sentences. Chen et al. [1] proposed a joint intent classification and slot-filling model based on BERT, aiming at addressing the poor generalization capability of traditional NLU models. Shridhar et al. [3] used Semantic Hashing as embedding for the task of Intent Classification on three frequently used benchmarks: AskUbuntu, Chatbot and Web Application. Similarly, Active Learning methods were provided to deal with this task. Zhang et al. [2] designed an ensemble deep active learning method, which constructs intent classifiers based on BERT and uses an ensemble sampling method to choose informative data for efficient training in Chinese and English languages. Farfan et al. [4] analyzed active learning techniques minimized the amount of labeled data required to build prediction models.

Second, we review whether the current tools use the context to the intent classification task. According to Liu et al. [5], none of the publicly available Natural Language Understanding (NLU) toolkits, such as Dialogflow, LUIS and Rasa, use dialogue context for Intent classification and NER.

Third, we analyzed proposals that use dialog flow in other domains. Several types of research have explored dialog flow to improve the accuracy in different tasks. Khanpour et al. [6] applied a deep-stacked Long Short-Term Memory (LSTM) with pre-trained word embeddings to classify dialogue acts (DAs) in open-domain conversations. The main reason for stacking LSTM cells is to gain longer dependencies between terms in the input chain. As far as dialogue state tracking task is concerned, Gulyaev et al. [7] proposed a GOaL-Oriented Multi-task BERT-based dialogue state tracker (GOLOMB) inspired by architectures for reading comprehension question answering systems. The model uses dialogue history to predict the next slots. Similarly, Wu et al. [8] proposed task-oriented dialogue BERT (TOD-BERT). This pre-trained model outperforms strong baselines like BERT on several downstream task-oriented dialogue applications. TOD-BERT concatenates all the utterances in the same dialogue into one to capture speaker information and the underlying interaction behavior in dialogue. In the same way, Chao [9] proposed BERT-DST, an end-to-end dialogue state tracker which directly extracts slot values from the dialogue context. BERT-DST used BERT to identify slot values from their semantic context. Also, they used the system utterance from the previous turn and the current turn user utterance as dialogue context input.

Literature has presented solutions based only on isolated utterances for this tasks. Also, solutions without a standard to select the context for dialogue acts, and dialogue state tracking tasks. Therefore, there is a lack of studies addressing the context relevance and the most relevant previous information for intent classification task and other domains.

III METHODOLOGY

We proposed a novel approach to managing context, a windows-based procedure. Our window-based strategy aims to make use of windows on the previous information of the dialog flow. Figure 1 presents our methodology. It is mainly composed of four components (dotted lines of different colors delimited each module): preprocessing (dotted red lines); dialogue context module (dotted blue lines); feature extraction (dotted green lines); and classifier (dotted orange lines).

Figure 1 shows a dialog flow pre-processed (red circle, A), then presents several types of trajectories encoded in our solution (blue circle, B). Our model takes a dialogue context and the current utterance as input for each user turn (purple circle, C). The BERT-based encoding module encodes the dialogue context input to produce contextualized sentence-level and token-level representations (green circle, D). The classification module then uses the sentence-level representation ( $b_{0}$ ) to generate a categorical distribution over twenty-two types of categories (orange circle, E).

Refer to caption — Figure 1: We defined five dialogue representation trajectories: ’all context’, ’user context’, ’last user and system’, ’last user context’, and ’last system context’. Each representation is used to carry out a concatenation between the dialogue history and the current utterance. ’all context’ uses the complete previous information (user and system) like context; ’user context’ employs the user’s previous messages to get the context; ’last user and system’ return the context using the user and system’s last utterances; ’last-user context’ employs the user’s last utterance to get the context. ’last-system context’ makes use of the system’s last utterance to get the context.

III-A Preprocessing

We applied to preprocess to clean input queries. The preprocessing step includes lower-casing, removing punctuation on the corpus, removing URLs, removing stop words and the message is finally a vector of tokens, for instance: For the utterance ’Já paguei o boleto da campanha ## porque ele ainda está pendente no site https://campanha.com/5cb4abba34070929d959d32d?’ its correspondent output is ’paguei boleto campanha porque ainda pendente’.

III-B Dialogue context module

To capture user information and the system’s response in a dialogue flow, we used two special tokens, $[USER]$ and $[SYSTEM]$ . We prefix the special tokens to each user utterance and system response, then concatenate all utterances in the same dialogue into one flat sequence (Figure 1). For example, for a dialogue $D$ $=$ $\{U_{1},S_{1},...,U_{n},S_{n}\}$ , where $n$ is the number of conversation turns (messages send by the user), and each $U_{i}$ or $S_{i}$ contains a sequence of words, the input of the pre-training model is processed as “ $[USER]$ $U_{1}$ $[SYSTEM]$ $S_{1}$ $...$ ” with standard positional embeddings and segmentation embeddings. The final representation starts with a $[CLS]$ token followed by the dialogue representation, then the $[SEP]$ token is used for delimiting the representation so that it can be fed to the Feature extraction module.

III-C Feature extraction

The feature extraction module is based on BERT [10]. Figure 1 presents the user utterance and system response from the previous turn as dialogue context (purple circle, C). Our solution also uses the current user utterance message as input, represented as a token sequence in BERT input format. The first token is $[CLS]$ , followed by the tokenized user utterance and system response, then the current tokenized user utterance, and $[SEP]$ . For instance, in Figure 1 we use the following token sequence. ’ $[CLS]$ $[USER]$ consultar pontuação mundo $[SYSTEM]$ consultando pontos [USER] credito boleto faço $...$ desconto $[SEP]$ ’.

Formally, let $[x_{0},x_{1},...,x_{n}]$ denoting the input token sequence, then BERT input layer embeds each token $x_{i}$ into an embedding $e_{i}$ , which is the sum of three embeddings:

\scriptsize BERTinput(x_{i})=Etok(x_{i})+Eseg(x_{i})+Epos(x_{i}),e_{i}\in\mathbb{R}^{d},\forall\ 0\leq i\leq n

(1)

where $Etok$ is the token embedding, $Eseg$ is the segment embeddings, and $Epos$ is the position embedding for each token $x_{i}$ . The embedded input sequence $[e_{0},...,e_{n}]$ is then passed to BERT bidirectional Transformer encoder, whose final hidden states are denoted by $[b_{0},...,b_{n}]$ (Figure 1).

\footnotesize[b_{0},...,b_{n}]=BiTransformer([e_{0},...,e_{n}]),b_{i}\in\mathbb{R}^{d},\forall\ 0\leq i\leq n

(2)

Our solution uses a contextualized sentence-level representation output ( $b_{0}$ ) as a final state corresponding to the $[CLS]$ token.

III-D Classifier

The input for classification module is sentence-level representation $b_{0}$ from the feature extraction module. The classification module predicts the value of $b_{0}$ to be one of the twenty-two categories using a Convolutional Neural Network (CNN) and softmax.

We used a CNN to capture intrinsic syntactic and semantic patterns from input sentences [11]. CNN consists of convolution, pooling, and activation layers. At each convolution layer, a set of kernels convolved a vector, which acts as filters. The pooling layer is composed of a max-pooling function, which reduces the vector size and increases the receptive field size. The activation layer adds nonlinearity to the neural network. In this case, it is usually a Rectified Linear Units (ReLU) that replaces negative inputs with 0 and keeps the positive inputs unchanged (Table II).

TABLE II: Convolutional Neural Network configuration.

Type	Description	#filters	filter size	stride	#units	rate
BN	Batch normalization	-	-	-	-	-
conv	1D Convolution	64	(1 $\times$ 9)	1	-	-
Pool	Max-pooling	-	(1 $\times$ 2)	2	-	-
bn	Batch normalization	-	-	-	-	-
do	Dropout	-	-	-	-	0.6
conv	1D Convolution	32	(1 $\times$ 9)	1	-	-
Pool	Max-pooling	-	(1 $\times$ 2)	2	-	-
bn	Batch normalization	-	-	-	-	-
do	Dropout	-	-	-	-	0.6
FC	Fully-connected	-	-	-	30	-
bn	Batch normalization	-	-	-	-	-
do	Dropout	-	-	-	-	0.25
FC	Fully-connected	-	-	-	22	-

IV EXPERIMENTAL RESULTS

IV-A Dataset

For the evaluation of the proposed approach, we use Wavy Global Dataset (WvGD), which is a dataset of conversations between humans and chatbots implemented in Brazilian Portuguese. The dataset contains 7574 conversations with at least three and at most fifteen turns, where each turn is a result of interaction between a user’s query and a system’s response. As far as the full number of utterances is concerned, there are 36,056 queries, and the number of categories is 22. Table III shows summary corpus’s statistics.

TABLE III: Corpus statistics for the WvGD dataset.

Average number of words per user’s query	3.75
Standard deviation per user’s query	2.92
Variance per user’s query	8.57
Average number of words per system’s response	33.76
Standard deviation per system’s response	30.65
Variance per system’s response	939.48

IV-B Experimental Settings

In our experimental procedure, we randomly divided the WvGD into 60% of samples for training the classifier, 20% for validation, and 20% for testing it. We used the pre-trained multilingual BERT model. We used their feature extraction effectiveness to apply our window-based approach.

About the convolutional neural network architecture configuration, the loss function used was a cross-entropy loss for the corresponding prediction target. We updated all layers in the model using RMSprop optimization and early stopping on the validation set. During training, we used 25% and 60% dropout rates, also batch normalization. Finally, the model was trained for 50 epochs.

IV-C Results and Discussion

Results in Table IV reveal that our approaches based on the last utterance from user and system (user-system context, last-user context, and last-system context) outperformed the baseline model in terms of accuracy, recall, precision, and f1-score. In contrast, baseline model presented better results for two approaches based on all conversation, all context and user context. These results make sense because the dialogue is not just a sequence of independent utterances, but rather a collective action performed by the user and the system.

TABLE IV: Comparison between isolated queries (without context) and window-based approaches. The highest results are highlighted in red. Baseline results are highlighted in blue.

Approach	Accuracy	Recall	Precision	F1-Score
without context	85.32	83.15	87.55	85.25
all context	66.52	55.22	80.54	65.14
user context	78.00	69.34	88.91	77.57
user-system context	86.08	82.94	90.55	86.49
last-user context	87.18	85.01	89.79	87.28
last-system context	86.56	84.18	89.85	86.88

Obtained results do not consider dealing with imbalanced datasets, in which some of the classes appear much more often in the dataset than others. The problem is that the model is likely to learn to predict only the dominant classes. In this context, a potential strategy is to modify our loss function to penalize misclassification of the underrepresented classes more than the dominant ones. We modify the cross-entropy loss function adding the loss value for each label.

L_{CE}=-\sum_{i=1}^{n}t_{i}\log(P_{i})\times L_{i}\textit{, for n classes,}

(3)

where $t_{i}$ is the truth label, $P_{i}$ is the softmax probability and $L_{i}$ is the loss value for the $i^{th}$ class. As a consequence, adding loss values implies that we need to re-train our model, tunning its penalization during training. Table V presents the new results adapting our loss function.

TABLE V: Comparison between isolated intent (without context) and windows-based approaches using loss values. The highest results are highlighted in red. Baseline results are highlighted in blue.

Approach	Accuracy	Recall	Precision	F1-Score
without context	85.19	84.62	86.17	85.37
all context	70.77	66.85	79.25	72.38
user context	78.64	71.55	88.67	78.92
user-system context	86.21	83.97	90.14	86.89
last-user context	87.58	85.96	89.49	87.65
last-system context	87.45	85.90	89.37	87.57

Our current model gets better results using the last utterance from the user, and system (user-system context, last-user context, and last-system context). In the same way, using the last-user context to carry out a concatenation with the current utterance contributes to the best context to predict the current user’s intent.

V CONCLUSIONS AND FUTURE WORKS

Intent classification in dialog flow processing can benefit from addressing contextual information. In this work, we proposed windows-based approaches for intent classification tasks. Our window-based strategy made use of distinct types of windows on the contextual information in a conversation between a user and the system. We identified and defined five types of trajectories. Our study experimented with these trajectories. For experimental purposes, we used WvGD dataset, which provided dialogue flows of various sizes between 3 and 15 queries. We conducted experiments to understand the added value of our defined windows-based approaches for intent classification. Our results achieved substantial improvements over the baseline in three approaches using user’s utterance and system’s response from previous messages as dialogue context. In addition, we modified our loss function to penalize misclassification of the underrepresented classes more than the dominant ones, this enabled us to address the imbalance of our dataset. The final results showed benefits in this decision and increased the accuracy during the classification process. As far as future works are concerned, we are going to investigate the use of our windows-based approach on Dialogue State Tracking task because current works do not standardize the dialogue history, and we are going to propose and experiment with other approaches to intent classification relying on contextual information.

VI ACKNOWLEDGMENT

Jeanfranco D. Farfan-Escobedo thanks the financial support of Sinch Latin America.

References

[1] Q. Chen, Z. Zhuo, and W. Wang, “Bert for joint intent classification and slot filling,” arXiv preprint arXiv:1902.10909, 2019.
[2] L. Zhang and L. Zhang, “An ensemble deep active learning method for intent classification,” in Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, 2019, pp. 107–111.
[3] K. Shridhar, A. Sahu, A. Dash, P. Alonso, G. Pihlgren, V. Pondeknath, F. Simistira, and M. Liwicki, “Subword semantic hashing for intent classification on small datasets,” arXiv preprint arXiv:1810.07150, 2018.
[4] J. D. Farfan-Escobedo, K. Lopes, and J. C. Dos Reis, “Active learning approach for intent classification in portuguese language conversations,” in 2021 IEEE 15th International Conference on Semantic Computing (ICSC). IEEE, 2021, pp. 227–232.
[5] X. Liu, A. Eshghi, P. Swietojanski, and V. Rieser, “Benchmarking natural language understanding services for building conversational agents,” in Increasing Naturalness and Flexibility in Spoken Dialogue Interaction. Springer, 2021, pp. 165–183.
[6] H. Khanpour, N. Guntakandla, and R. Nielsen, “Dialogue act classification in domain-independent conversations using a deep recurrent neural network,” in Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, 2016, pp. 2012–2021.
[7] P. Gulyaev, E. Elistratova, V. Konovalov, Y. Kuratov, L. Pugachev, and M. Burtsev, “Goal-oriented multi-task bert-based dialogue state tracker,” arXiv preprint arXiv:2002.02450, 2020.
[8] C.-S. Wu, S. Hoi, R. Socher, and C. Xiong, “Tod-bert: Pre-trained natural language understanding for task-oriented dialogues,” arXiv preprint arXiv:2004.06871, 2020.
[9] G.-L. Chao and I. Lane, “Bert-dst: Scalable end-to-end dialogue state tracking with bidirectional encoder representations from transformer,” arXiv preprint arXiv:1907.03040, 2019.
[10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[11] S. Schwarz, A. Theóphilo, and A. Rocha, “Emet: Embeddings from multilingual-encoder transformer for fake news detection,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 2777–2781.