KULCQ: An Unsupervised Keyword-based Utterance Level Clustering Quality Metric
Abstract
Intent discovery is crucial for both building new conversational agents and improving existing ones. While several approaches have been proposed for intent discovery, most rely on clustering to group similar utterances together. Traditional evaluation of these utterance clusters requires intent labels for each utterance, limiting scalability. Although some clustering quality metrics exist that do not require labeled data, they focus solely on cluster geometry while ignoring the linguistic nuances present in conversational transcripts. In this paper, we introduce Keyword-based Utterance Level Clustering Quality (KULCQ), an unsupervised metric that leverages keyword analysis to evaluate clustering quality. We demonstrate KULCQ’s effectiveness by comparing it with existing unsupervised clustering metrics and validate its performance through comprehensive ablation studies. Our results show that KULCQ better captures semantic relationships in conversational data while maintaining consistency with geometric clustering principles.
1 Introduction
Over the past few years, there has been a proliferation of frameworks to develop task-oriented dialog systems. Most such systems utilize intent classification to discover the intent of a user. Since intents are specific to the domain, they vary significantly depending on the tasks that they are configured for. Thus, it is crucial to perform open intent discovery Zhang et al. (2021) in order to build task-oriented dialog systems. While there are metrics like Normalized Mutual Information (NMI) McDaid et al. (2011) and Adjusted Rand Index(ARI) to evaluate the performance of open intent discovery, they require annotated intent labels. This is hard to scale and slows down the process of building task-oriented dialog systems. There is a need to measure the goodness of the intent clusters obtained from open intent discovery in an unsupervised way.
Clustering textual data is used in various fields such as Computational Social Science Hox (2017), Conversational AI Cuayáhuitl et al. (2019), and Recommendation Systems Das et al. (2014). Intent detection is one of the most crucial tasks in conversational AI, implemented via classification or clustering. Classification being supervised, requires labels, making clustering a natural choice for discovery and scalablility. Various unsupervised clustering evaluation metrics such as the Silhouette coefficient Rousseeuw (1987), Calinski-Harabasz index Caliński and JA (1974), Davies-Bouldin index Davies and Bouldin (1979), and Dunn index Dunn (1973) take in data belonging to any domain and apply the same algorithm each time, as long as there is a similarity measure defined between pairs of data points. These metrics only focus on the geometry of the clusters, and are agnostic to the domain-specific nuances of the data. Therefore, it is important to have metrics that evaluate clusters by leveraging aspects that are important to the domain the data belongs to.
In this paper we introduce a Keyword-based Utterance Level Clustering Quality (KULCQ) measure which is specific to analysing clusters of conversation data (Section 2.2). The motivation behind leveraging keywords for conversational data is to capture similar intents across utterances despite differences in sentence formation and helping words used - which may lead to a big difference in the sentence embeddings. For example: "How can I get a taxi from A to B?", and "I want a taxi from A to B, can you help?", may differ in representation even though taxi, A, and B are the most important words in both cases.
In our experiments we focus on evaluating the clustering of intents from user query utterances in dialogues between customer support agents and customers. We show in Section 3.1 that our metric can be used as a universal clustering metric, because it follows the trend of a standard metric in an adverse scenario, and correctly captures the general quality of the clusters. In the Sections 3.2 and 3.3 of the paper, we highlight the shortcomings of a standard domain-agnostic metric such as Silhouette coefficient in cases specific to conversational data. Our proposed metric handles these cases and provides an advantage in accurately measuring the clustering quality of conversational data.
2 Methodology
This section contains an overview of the Silhouette method: a well-known, unsupervised metric for evaluating clusters of data of any domain (Section 2.1). We compare it with our algorithm, as it is inspired by the Silhouette method and is based off of it. We then define our measure KULCQ (Section 2.2), which is an enhanced metric specific for evaluating clusters of conversational data.
2.1 Silhouette
For a given object x belonging to cluster y, the Silhouette coefficient Rousseeuw (1987) is defined as a combination of its intra-cluster metric and inter-cluster metric. The intra-cluster measure is defined as average dissimilarity of x to all other objects of cluster y. The inter-cluster metric is defined as:
in which is the set of all clusters and is defined as: average dissimilarity of object x to all objects of cluster i. Finally, and are aggregated as:


2.2 KULCQ
Similar to the Silhouette method, the KULCQ score is a combination of an intra-cluster measure, and an inter-cluster measure. Let us say we have a set of clusters . Each cluster includes the set of utterances . First, we extract keywords from each utterance. The keywords from each utterance are a combination of keywords extracted from the utterance using two libraries - KeyBERT111https://github.com/MaartenGr/KeyBERT and YakeCampos et al. (2020). KeyBERT defines keywords as the words that are most similar to the document as a whole in terms of embeddings, the similarity metric being cosine similarity. Yake on the other hand uses statistical features to identify and rank the most important keywords. Thus, we use a combination of these two methods to get the keywords from each utterance. A keyword can either be a unigram or bigram. Based on the utterance keywords over the entire cluster, we find the top most frequent keywords in each cluster ( denotes the top most frequent keywords for cluster , ). A centroid for each cluster () is calculated as the weighted average of its utterances:
in which is the set of utterances belonging to cluster and is the representation of using an embedding method. Weight of utterance for calculating the centroid of the cluster is the proportion of the top frequent keywords appearing in the utterance. More precisely:
The intra-cluster score is calculated for the cluster as a whole, which is then assigned as the intra-cluster score to each of the utterances in the cluster. The intra-cluster metric is defined as the average of of the distances from each utterance to the centroid which was calculated for the cluster.
in which is a distance function (we use cosine distance in our experiments). The inter-cluster metric for utterance which belongs to cluster is defined as the weighted average of distances of utterance to the centroids of other clusters:
We calculate as the reciprocal of the number of overlapping top frequent keywords between the clusters and ,
By using the reciprocal of the number of overlapping keywords between clusters as a weight while calculating the inter-cluster distance, we discourage different clusters from having a similar set of the most important keywords, and encourage different clusters to be far away from each other - in terms of the most important keywords of each cluster. Finally, the KULCQ score for an utterance is calculated by combining the inter- and intra-cluster scores of the utterance, similar to Silhouette:
3 Experiments and Results
Datasets.
We use three popular, public human-annotated datasets, thus containing gold intent labels. Detailed information about the datasets can be found in appendix section A. We hypothesize that KULCQ is useful for any text clustering task. Given our application being conversational, we limit our experiments to related datasets.
Further in this section we analyze the behavior of the Silhouette and KULCQ metrics on clustered conversational data, and showcase using a few scenarios what KULCQ brings to the table in a conversational setting. The experimental setup we used can be seen in appendix section B.
3.1 Noise injection
We conducted this experiment to test the correctness of KULCQ. We observe the values given by the Silhouette and KULCQ metrics when different amounts of noise are injected into the clusters. Noise here is defined as the perturbation to a cluster by taking utterances from its correctly assigned cluster and putting it into a different(wrong) cluster, and vice versa. Given the annotated clusters, we perturb the cluster of each utterance with probability . In figure 1 we observe that both the KULCQ and Silhouette scores drop as we inject more noise. This validates the behavior of our metric, and proves that it follows the trends of standard metrics. The result of this experiment also validates that our proposed metric reflects the noise in the clustering. Moreover, we can see in figure 1, that the KULCQ metric is more sensitive to noise injection in a cluster. The drop in KULCQ values are much more as compared to the drop in Silhouette values when the probability of perturbing each utterance’s label is increased.
3.2 Bad Geometry, Good Semantics

Here we depict the scenario where having a clustering evaluation metric specific to conversational or text data such as KULCQ is more helpful compared to a domain agnostic metric such as the Silhouette score. We specifically focus on clusters whose utterances due to minor differences in the words present, end up far from each other in the embedding space. In figure 2 we show the cluster “supported cards and currencies" from Finance dataset as a TSNE van der Maaten and Hinton (2008) plot. At first glance it does not seem to be a great cluster, because its utterances have been divided into three regions. The Silhouette coefficient reflects this immediate conclusion, and gives us a low average Silhouette score for utterances of this cluster, based purely on the geometry of the cluster. The Silhouette score for this cluster is which is the second lowest score among the clusters of the Finance dataset. However, based on the KULCQ scores, this cluster ranks as the highest. KULCQ’s judgement turns out to be more reasonable as we look at the cluster and its utterances more closely. We observe that region A includes utterances that talk purely about credit cards (e.g. “What US cards do you accept?", “Do you accept US credit cards?"), region B includes the utterances that talk purely about currencies (e.g. “What currencies can I use?", “Do you have a list of the cards and currencies supported?") and region C’s most important keyword is a specific bank’s name (e.g. “Is American Express supported for adding funds?", “How do I use American Express to top up my account?"). The annotators of the dataset have assigned all such utterances to the same cluster because despite the minor differences, the intent of the utterances in the cluster remain the same. KULCQ, by leveraging the importance of keywords in representing the intent of the utterances, is able to gain information beyond what meets the eye at first glance - the geometry of the cluster.
3.3 Penalizing high generality and variance
We also evaluate a scenario where the Silhouette metric does not penalize certain kinds of clusters enough for containing highly varied utterances, which can in turn lead to difficulties in conversational settings. Contrarily, KULCQ metric results in a bad score for the same clusters, thus providing an advantage in conversational AI problems.
One example can be seen in the AskUbuntu dataset, for the cluster - "Software Recommendation". The Silhouette metric gives this cluster a small positive score of 0.04, whereas the KULCQ metric gives this a score of -0.01. Clearly the KULCQ metric penalizes this cluster much more severely as compared to the Silhouette metric. As seen in Table 2, the utterances of this cluster are widely varied, and it is highly unlikely that these utterances will be clustered together by any automatic intent discovery algorithm. Moreover, in many conversational AI problems, it may not be a wise idea to cluster utterances that are as varied as the ones in Table 2 into one cluster with a general intent. Consider an application of conversational AI such as a chatbot: a slight change in intent can require a completely different action by the bot. In scenarios such as this, it is important that the utterances are clustered in more specific intents as opposed to a highly general one. Keeping this in mind, it is important that clusters such as the "Software Recommendation" cluster are penalized significantly in a conversational setting. By leveraging keywords, KULCQ is able to reward tightly-knit, specific clusters, and penalize the generic and varied ones, which can offer a great advantage for conversational AI applications.
Additionally, we performed clustering over all utterances across datasets using HDBScanCampello et al. (2013) and K-meansMacqueen (1967) and observed similar patterns depicted in Figure 1 and Figure 2. Since the clusters obtained aren’t manually labeled, we conclude our findings solely based on the ground truth intent labels.
4 Conclusion
We define a new metric (KULCQ) for evaluating the clustering quality of text data, specifically conversational data. We compare KULCQ closely with a well-known, unsupervised cluster evaluation metric - the Silhouette coefficient, and show some of the advantages KULCQ provides in a conversational setting when compared to a domain-agnostic metric which relies solely on the geometry of the cluster. We also depict the correctness of our algorithm and prove that it can be used as a universal clustering metric for text data.
We realize that using the reciprocal of an integer such as the number of overlapping keywords as a weight directly may lead to significant fluctuations in the scale of the overall metric, even for small differences in the integer value, which in turn may not reflect the quality of clustering accurately. We are investigating other ways to incorporate the amount of keyword overlap between clusters, such as - 1) using the reciprocal of a probability measure instead, and 2) considering the average cosine similarity between overlapping keywords. We aim to share the findings in the future work.
References
- Braun et al. (2017) Daniel Braun, Adrian Hernandez-Mendez, Florian Matthes, and Manfred Langen. 2017. Evaluating natural language understanding services for conversational question answering systems. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 174–185, Saarbrücken, Germany. Association for Computational Linguistics.
- Budzianowski et al. (2018) Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. 2018. MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 5016–5026, Brussels, Belgium. Association for Computational Linguistics.
- Caliński and JA (1974) Tadeusz Caliński and Harabasz JA. 1974. A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3:1–27.
- Campello et al. (2013) Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. 2013. Density-based clustering based on hierarchical density estimates. In Advances in Knowledge Discovery and Data Mining, pages 160–172, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Campos et al. (2020) Ricardo Campos, Vítor Mangaravite, Arian Pasquali, Alípio Jorge, Célia Nunes, and Adam Jatowt. 2020. Yake! keyword extraction from single documents using multiple local features. Information Sciences, 509:257–289.
- Casanueva et al. (2020) Iñigo Casanueva, Tadas Temčinas, Daniela Gerz, Matthew Henderson, and Ivan Vulić. 2020. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, Online. Association for Computational Linguistics.
- Cuayáhuitl et al. (2019) Heriberto Cuayáhuitl, Donghyeon Lee, Seonghan Ryu, Sungja Choi, Inchul Hwang, and J. Kim. 2019. Deep reinforcement learning for chatbots using clustered actions and human-likeness rewards. 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
- Das et al. (2014) Joydeep Das, Partha Mukherjee, Subhashis Majumder, and Prosenjit Gupta. 2014. Clustering-based recommender system using principles of voting theory. In 2014 International Conference on Contemporary Computing and Informatics (IC3I), pages 230–235.
- Davies and Bouldin (1979) David L. Davies and Donald W. Bouldin. 1979. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2):224–227.
- Dunn (1973) J. C. Dunn. 1973. A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Journal of Cybernetics, 3(3):32–57.
- Hox (2017) Joop Hox. 2017. Computational social science methodology, anyone? Methodology, 13:3–12.
- Macqueen (1967) J. Macqueen. 1967. Some methods for classification and analysis of multivariate observations. In In 5-th Berkeley Symposium on Mathematical Statistics and Probability, pages 281–297.
- McDaid et al. (2011) Aaron McDaid, Derek Greene, and Neil Hurley. 2011. Normalized mutual information to evaluate overlapping community finding algorithms. CoRR.
- Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Rousseeuw (1987) Peter J Rousseeuw. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65.
- van der Maaten and Hinton (2008) Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605.
- Zhang et al. (2021) Hanlei Zhang, Hua Xu, Ting-En Lin, and Rui Lyu. 2021. Discovering new intents with deep aligned clustering. Proceedings of the AAAI Conference on Artificial Intelligence, 35:14365–14373.
Appendix A Datasets Description
We used public conversational datasets for our experiments, in which all the utterances are annotated with their respective intent labels by humans. Table 1 contains a detailed description of all the datasets used.
Dataset Name | Description |
---|---|
Finance Casanueva et al. (2020) | This dataset includes online banking queries annotated with their corresponding intents. Consists of 10003 utterances and 77 intents. |
MultiWOZ Budzianowski et al. (2018) | Multi-Domain Wizard-of-Oz dataset consists of human-human written conversations related to booking hotels, restaurants, and other topics. |
AskUbuntu Braun et al. (2017) | The dataset includes 162 questions and answers from AskUbuntu website222https://askubuntu.com. The utterances are annotated with five intent labels. |
Appendix B Experimental Setup
For calculating the KULCQ score (definition in section 2.2) we extract keywords from each utterance. We used Python libraries KeyBERT and Yake for extracting the keywords.
For sentence-level embeddings we use the “all-MiniLM-L6-v2" model from the Sentence Transformers (sbert) repository Reimers and Gurevych (2019).
The KULCQ and Silhouette scores were calculated for all the datasets mentioned above on an utterance level, cluster level, and dataset level. For both metrics - the KULCQ and Silhouette scores, a cluster-level score is an average of the utterance-level scores for that cluster, and dataset-level score is an average of cluster-level scores over all the clusters in the dataset. This was done in order to compare and showcase the advantage KULCQ has to offer over another standard, unsupervised, utterance-level clustering quality evaluation metric.
Note that the actual intent names/labels are not used anywhere, and only the utterances per intent are treated as clusters, from which the keywords are extracted.
Appendix C Example Utterances for the Software Recommendation cluster in the AskUbuntu dataset
Utterances |
---|
Is there an SSH connection manager? |
MySQL GUI Tools |
What developer text editors are available for Ubuntu? |
Is there an application for reading mobi files? |
Can you recommend a password generator? |