Dynamic Contexts for Generating Suggestion Questions in RAG Based Conversational Systems

Anuja Tayal 0009-0002-3631-8195 University of Illinois, ChicagoDepartment of Computer ScienceChicagoILUSA [email protected] and Aman Tyagi The Procter and Gamble CompanyMasonOhioUSA [email protected]

(2024)

Abstract.

When interacting with Retrieval-Augmented Generation (RAG)-based conversational agents, the users must carefully craft their queries to be understood correctly. Yet, understanding the system’s capabilities can be challenging for the users, leading to ambiguous questions that necessitate further clarification. This work aims to bridge the gap by developing a suggestion question generator. To generate suggestion questions, our approach involves utilizing dynamic context, which includes both dynamic few-shot examples and dynamically retrieved contexts. Through experiments, we show that the dynamic contexts approach can generate better suggestion questions as compared to other prompting approaches.

RAG, Conversational Systems, Question Generation, Prompting, Few-Shot

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Companion Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singapore^†^†booktitle: Companion Proceedings of the ACM Web Conference 2024 (WWW ’24 Companion), May 13–17, 2024, Singapore, Singapore^†^†doi: 10.1145/3589335.3651905^†^†isbn: 979-8-4007-0172-6/24/05^†^†ccs: Computing methodologies Natural language generation

1. Introduction

Current conversational systems encounter numerous problems. Firstly, they depend heavily on user inputs, with the agent merely responding to queries (quac, 1, 2). This method is fundamentally flawed as it fails to actively prompt users to clarify ambiguous or poorly defined questions. As a result, the systems frequently return incorrect responses or resort to apologetic statements when dealing with such inquiries. Secondly, even when the system does provide a correct answer, the user must again formulate an effective follow-up question. This task can be challenging, as users often find it difficult to understand the full scope of the system’s capabilities (tod-cq-clarit, 3). This cycle of inadequate responses forces users to constantly refine their queries, often resulting in more apologies than useful information.

Through this paper, we want to address this change and fill the gap by proposing a suggestion question generator. The generator is designed to generate suggestion questions that the agent can answer, guided by the user’s initial query. Our framework encompasses several key steps. First, it involves identifying pertinent passages (dynamically retrieved contexts) based on the user’s initial query, as the answer to the query hinges on these passages. Secondly, we take the help of relevant triple of (questions, answer, suggestion questions) (dynamic few-shot). Finally, we generate suggestion questions by constituting a dynamic context prompt consisting of the dynamically retrieved contexts, and dynamic few-shot examples. By providing suggestion questions, users are alleviated from the task of question formulation, ensuring a smooth conversational flow (tod-cq-clarit, 3).

Conversational agents can adapt Dynamic Contexts for a variety of tasks, eliminating the need for extensive model training. To generate suggestion questions in the absence of a publicly available dataset, we use in-context learning. In-context learning (few-shot, 4) has been recognized for achieving promising outcomes with minimal training samples.

Refer to caption — Figure 1. Dynamic Context: Suggestion Questions are generated in RAG based Chatbots using the initial user query along with dynamically few-shot examples and dynamic retrieved contexts

2. Problem Statement

Our primary goal is to develop a suggestion question generator. We aim to tackle cases where a RAG-based system cannot interpret a user’s query accurately, despite the presence of relevant information in the dataset. Due to this, the system is unable to understand the query’s intent leading to unsuccessful responses. Hence, we generate three suggested questions that the agent is capable of responding based on the user’s initial query.

In Summary, in this paper, we:

•

Propose to develop a suggestion question generator to address the gap in conversational systems where users frequently lack awareness of the system’s limitations.
•

Formulate a suggestion question generator in low-resource settings which can be applied across various datasets without the need of fine-tuning.
•

Propose a dynamic context prompt that constitutes dynamic few-shot examples and dynamic retrieved contexts to generate suggestion questions.

3. Related Work

To the best of our knowledge, there is no existing model or literature specifically addressing suggestion question generation. However, this concept is closely related to question generation (sparta, 5) and clarification question generation (tod-cq-clarit, 3, 6).

The concept of clarification questions was formally introduced in (cq1, 7), and since then, research into generating these questions has spanned a wide range of scenarios, including open-domain systems (AmbigQA) (ambigqa, 8), knowledge bases (CLAQUA) (claqua, 6), closed-book systems (CLAM) (clam, 9), information-seeking (ISEEQ) (iseeq, 10), task-oriented dialog systems (CLARIT) (tod-cq-clarit, 3), and conversational search (mixed-initiative, 11). Rahmani et al. (rahmani-etal-2023-survey, 12) surveyed the various methodologies, datasets, and different evaluation strategies used for clarification questions.

The distinct characteristics of RAGs (dpr, 13) make them essential for effective information retrieval and is widely used for generating question-answering tasks (quac, 1) and clarification questions (iseeq, 10, 11).

Advancements in large language models (LLMs) have led researchers to investigate the use of LLMs and prompting techniques for both question generation (sparta, 5) and clarification questions (deng2023prompting, 14).

4. Dynamic Contexts Approach

Dynamic Contexts, the Suggestion Question Generator (Figure 1), is designed to enhance the interaction between users and conversational systems by addressing the gap in users’ understanding of these systems’ capabilities. By aiming to reduce the frequency of apologetic responses (tod-cq-clarit, 3), it provides more informed system suggestions, thereby improving the overall user experience.

Dynamic Contexts generate suggestion questions that are answerable by the agent. This is achieved through two principal components of Dynamic Contexts: dynamic few-shot examples and dynamically retrieved contexts. Dynamic few-shot examples begin with the selection of contextually relevant triplets from the dataset, each consisting of a question, its answer, and the suggestion question (QAS). Dynamic few-shot diverges from traditional few-shot prompting by dynamically choosing each example based on the user’s query, rather than relying on a static set of examples. This dynamic selection is crucial for accommodating the varied formats and structures that different questions may necessitate.

To ensure that the questions generated by the Suggestion Question Generator remain pertinent and answerable, dynamic contexts consist of dynamically retrieved contexts. Dynamically retrieved contexts consist of retrieving 4 dynamic contexts based on the initial user query, similar to RAG-based systems from which suggestion questions should be generated. We use OpenAI embeddings to embed the examples and dynamic contexts and cosine similarity to select dynamic examples and contexts. The complete prompt structure utilized for the approach is detailed in the appendix B.

	ChatGPT	Claude-2	GPT-4
Zero-Shot	35	30	43
Few-Shot	42	35	40
Dynamic Few-Shot	42	35	43
Dynamic Contexts (Our approach)	44	44	46

Table 1. Comparative Analysis of Dynamic Contexts with Zero-Shot, Few-Shot and Dynamic Few-Shot

5. Evaluation

Due to the absence of a publicly accessible dataset specifically designed for Suggestion Question Generator tasks, we employed 228 blog posts from Pamper’s Baby Sleep Coach as a practical demonstration of the approach. More details of the dataset are present in Appendix A. To test the Dynamic Contexts approach, LLMs- ChatGPT, GPT-4, and Claude2 were used. The evaluation of the suggestion question generator involves a variety of assessment techniques to ensure a thorough analysis of the performance.

Manual Evaluation To assess the effectiveness of the Dynamic Contexts approach in the Suggestion Question Generator, we conducted a manual evaluation using a set of 48 QAS (question, answer, and suggestion question) pairs. During manual assessment, we primarily assessed the correctness, relevance, and soundness of the generated questions. Dynamic Contexts demonstrated its capability by successfully generating relevant suggestion questions (as shown in Table 1). ChatGPT, Claude-2, and GPT-4 each produced 44, 44, and 46 correct samples, respectively. ChatGPT and GPT-4 errors were related to the baby’s age, whereas Claude-2 did not make any age-related errors. However, Claude-2’s primary shortcomings were its failure to formulate questions that aligned with the provided contexts.

Comparative Analysis A comparative analysis was performed to evaluate the improvement of Dynamic Contexts over different prompting strategies, including Zero-Shot, Few-Shot, and Dynamic Few-Shot approaches. This comparison, detailed in Table 1, highlights the distinctions among the methods. The Zero-Shot method employed no examples in its prompts, while the Few-Shot method used a consistent set of three examples for each query. The Dynamic Few-Shot approach, in contrast, dynamically selected its examples. Distinct from other methods, Dynamic Contexts utilizes Retrieval-Augmented Generation (RAG) to dynamically select contexts for generating suggestion questions. This approach marks a significant shift from the static contexts used in the Zero-Shot, Few-Shot, and Dynamic Few-Shot approaches.

The findings indicate that Claude-2 generated only 30 correct questions (compared to 35 for ChatGPT and 43 for GPT-4) in the zero-shot approach out of the 48 samples. While we supplied the answers, the models struggled to generate questions that could be answered from the provided context. Claude-2 notably struggled, often unable to utilize the provided context effectively, and sometimes misconstrued its role, imagining itself as a baby.

The few-shot approach, which involved providing three example questions to the models, showed an overall improvement in performance. However, the questions generated tended to replicate the structure of the examples provided too closely. The dynamic few-shot approach, aimed at introducing a variety of question types and structures, significantly improved the quality of the questions generated but still lagged behind the dynamic contexts approach.

Preference Benchmarking Further analysis was conducted through Preference Benchmarking Evaluation, where GPT-4 and Claude-2 evaluated the suggestion questions generated by the framework. In this blind test, the models, GPT-4 and Claude-2 acted as judges and were prompted to assess the quality of the questions without knowing from which model it was generated. The results showed Claude2 having no preference at all between the models, while GPT-4 favored Claude-2’s output 57% of the time over its own. Human evaluation was also employed to evaluate which output was better which showed a preference for GPT-4’s output in 43% of cases, Claude-2 in 33% of cases, and showing no preference in 24% of cases. This evaluation suggests a close alignment between the judgments of LLMs and human evaluators. Detailed results can be found in Table 3 of the appendix D.

Ablation Study To determine the optimal order of the final prompt, we conducted an ablation study, altering the arrangement of example contexts and queries. Initially, we presumed that presenting the query before the dynamic contexts would enhance comprehension, allowing the model to suggest questions based on the given information. However, upon changing the order, we did not observe any discernible difference in the nature of the generated suggestion questions as seen from the Table 2 in Appendix C.

6. Conclusion

In this paper, we explore suggestion questions to bridge the gap in users’ comprehension of a conversational system’s capabilities. We present the development of a suggestion question generator that utilizes dynamic context prompting within a RAG-based conversational system. In the future, we plan to personalize the suggestion questions based on the user history.

Acknowledgements.

The authors would like to acknowledge the support of the Procter and Gamble Company (P&G). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of P&G.

References

(1) Eunsol Choi et al. “QuAC: Question Answering in Context” In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing Brussels, Belgium: Association for Computational Linguistics, 2018 DOI: 10.18653/v1/D18-1241
(2) Siva Reddy, Danqi Chen and Christopher D. Manning “CoQA: A Conversational Question Answering Challenge” In Transactions of the Association for Computational Linguistics 7 Cambridge, MA: MIT Press, 2019, pp. 249–266 DOI: 10.1162/tacl˙a˙00266
(3) Yue Feng, Hossein A Rahmani, Aldo Lipani and Emine Yilmaz “Towards Asking Clarification Questions for Information Seeking on Task-Oriented Dialogues” In arXiv preprint arXiv:2305.13690, 2023
(4) Tom Brown et al. “Language Models are Few-Shot Learners” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 1877–1901
(5) Hongwei Zeng, Bifan Wei, Jun Liu and Weiping Fu “Synthesize, Prompt and Transfer: Zero-shot Conversational Question Generation with Pre-trained Language Model” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Toronto, Canada: Association for Computational Linguistics, 2023, pp. 8989–9010 DOI: 10.18653/v1/2023.acl-long.500
(6) Jingjing Xu et al. “Asking Clarification Questions in Knowledge-Based Question Answering” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Hong Kong, China: Association for Computational Linguistics, 2019, pp. 1618–1629 DOI: 10.18653/v1/D19-1172
(7) Sudha Rao and Hal Daumé III “Learning to Ask Good Questions: Ranking Clarification Questions using Neural Expected Value of Perfect Information” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Melbourne, Australia: Association for Computational Linguistics, 2018, pp. 2737–2746 DOI: 10.18653/v1/P18-1255
(8) Sewon Min, Julian Michael, Hannaneh Hajishirzi and Luke Zettlemoyer “AmbigQA: Answering Ambiguous Open-domain Questions” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) Online: Association for Computational Linguistics, 2020, pp. 5783–5797 DOI: 10.18653/v1/2020.emnlp-main.466
(9) Lorenz Kuhn, Yarin Gal and Sebastian Farquhar “CLAM: Selective Clarification for Ambiguous Questions with Large Language Models” In CoRR abs/2212.07769, 2022 DOI: 10.48550/arXiv.2212.07769
(10) Manas Gaur, Kalpa Gunaratna, Vijay Srinivasan and Hongxia Jin “ISEEQ: Information Seeking Question Generation Using Dynamic Meta-Information Retrieval and Knowledge Graphs” In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022 AAAI Press, 2022, pp. 10672–10680 DOI: 10.1609/AAAI.V36I10.21312
(11) Yosi Mass, Doron Cohen, Asaf Yehudai and David Konopnicki “Conversational Search with Mixed-Initiative - Asking Good Clarification Questions backed-up by Passage Retrieval” In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 65–71 DOI: 10.18653/v1/2022.dialdoc-1.7
(12) Hossein A. Rahmani et al. “A Survey on Asking Clarification Questions Datasets in Conversational Systems” In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Toronto, Canada: Association for Computational Linguistics, 2023, pp. 2698–2716 DOI: 10.18653/v1/2023.acl-long.152
(13) Vladimir Karpukhin et al. “Dense Passage Retrieval for Open-Domain Question Answering” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6769–6781
(14) Yang Deng, Wenqiang Lei, Lizi Liao and Tat-Seng Chua “Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration”, 2023 arXiv:2305.13626 [cs.CL]
(15) Mor Geva, Ankit Gupta and Jonathan Berant “Injecting Numerical Reasoning Skills into Language Models” In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 946–958 DOI: 10.18653/v1/2020.acl-main.89

Appendix A Dataset

Due to the absence of a publicly accessible dataset specifically designed for Suggestion Question Generator tasks, we employed 228 blog posts from Pamper’s Baby Sleep Coach as a practical demonstration of the approach. The blog post headlines were treated as queries, while the content of the posts as the answers to those queries. For 35 of these posts, we manually annotated suggestion questions aiming to exemplify the types of questions users might ask, and which could be answered by the provided context. The post headlines from these 35 examples form the basis for identifying queries similar to the initial user query. To each of these similar queries, we then append the corresponding answer and a suggestion question, forming a set of dynamic few-shot examples consisting of pairs of a similar query, its answer, and a suggestion question (as shown in Figure 1). Meanwhile, the original set of 138 answers forms the basis for our answer contexts, from which we select the top 4 contexts. The complete prompt structure utilized for the approach is detailed in the appendix B.

The dataset presents significant challenges, particularly in showcasing the numerical reasoning capabilities of large language models (LLMs) (numerical-reasoning, 15). When handling queries related to a baby’s specific age (For ex- 13 weeks old, 4 months old), both ChatGPT and GPT-4 displayed confusion over the baby’s age in generated suggestion questions. Despite the prompt clearly stating that a month consists of 4 weeks, as detailed in Appendix B, they at times illustrated the difficulty in grasping numerical distinction ( For example- incorrectly interpreted 13 weeks as 13 months).

Appendix B Prompt Format

This section illustrates the prompt given to the LLMs.

You are an intelligent conversational agent for a smart sleep coach of baby. You are given a prompt with a query with different contexts. Each query is one question. Convert the query into question format. You have to suggest 3 different questions that can be very easily answered by only the context provided. If the questions cannot be strictly answered by only the context provided then it should not suggest. Remember in a month there are 4 weeks. The age of the baby should be the same as in the query and not be changed. Only generate the questions. Here are some examples:

{3 dynamic QAC Dynamic Examples}

{Query}

{3 Contexts from which suggestion questions be generated which can be answered from these contexts.}

Appendix C Ablation Study

	Dynamic Contexts	Order Changed
GPT4	46	46
Claude2	44	44

Table 2. Ablation Study of Dynamic Context approach after changing the order of the query and the contexts

To determine the optimal order of the final prompt, we conducted an ablation study, altering the arrangement of example contexts and queries to assess performance. Initially, we presumed that presenting the query before the dynamic contexts would enhance comprehension, allowing the model to suggest questions based on the given information. However, upon changing the order, we did not observe any discernible difference in the nature of the generated suggestion questions as can be seen from the table 2.

Appendix D Preference BenchMarking Results

Refer Table 3

	Human Eval	GPT-4	Claude2
GPT-4	21 (11 both)	21	24
Claude2	16	27	24

Table 3. Preference BenchMarking to evaluate the models preference