An Interactive Query Generation Assistant using LLM-based Prompt Modification and User Feedback
Abstract.
While search is the predominant method of accessing information, formulating effective queries remains a challenging task, especially for situations where the users are not familiar with a domain, or searching for documents in other languages, or looking for complex information such as events, which are not easily expressible as queries. Providing example documents or passages of interest, might be easier for a user, however, such query-by-example scenarios are prone to concept drift, and are highly sensitive to the query generation method. This demo illustrates complementary approaches of using LLMs interactively, assisting and enabling the user to provide edits and feedback at all stages of the query formulation process. The proposed Query Generation Assistant is a novel search interface which supports automatic and interactive query generation over a mono-linguial or multi-lingual document collection. Specifically, the proposed assistive interface enables the users to refine the queries generated by different LLMs, to provide feedback on the retrieved documents or passages, and is able to incorporate the users’ feedback as prompts to generate more effective queries. The proposed interface is a valuable experimental tool for exploring fine-tuning and prompting of LLMs for query generation to qualitatively evaluate the effectiveness of retrieval and ranking models, and for conducting Human-in-the-Loop (HITL) experiments for complex search tasks where users struggle to formulate queries without such assistance.
1. Introduction
Retrieving information is critically important from documents in multiple languages as the Internet increasingly provides access to documents across thousands of languages and domains. Creating effective search queries can be a daunting task for users. First, users may be unfamiliar with the language of the information they need to obtain or may be completely unaware of it, making it hard to craft specific queries. Second, most people are not familiar with the vocabulary and jargon used in other areas or fields, which can further impair their ability to formulate good search queries. Furthermore, users may be unfamiliar with the corpus, or collection of documents being searched, making it challenging to know what information to look for and how to phrase the information need. We propose a solution to this challenge by employing “query-by-example” - allowing users to explore document collections better by letting them specify an example document (rather than an explicit query) of what they are searching for. Although considerable advancements have been made in the domain of query-by-example (Sarwar and Allan, 2020; Zloof, 1975), especially neural and transformer based, there is a clear lack of interfacing tools for performing qualitative analysis making it an area ripe for exploration.
As query-by-example (QBE) and multilingual information retrieval (MLIR) introduce new tasks to the traditional information retrieval community, new research questions and challenges arise, such as how to provide effective search results in different languages and how to assist users in generating effective queries. Traditional methods of qualitative analysis can be time-consuming, as researchers need to manually generate queries and analyze search results, making it even harder to iterate. Hence, having an interfacing tool that can automatically generate queries and display the search results together can be invaluable for researchers and practitioners alike.
On the other hand, success in few-shot prompting (Srivastava et al., 2023; Brown et al., 2020; Liu et al., 2023) has led large language models to play a key role in reducing the information burden on users by especially assisting them for writing tasks namely essay writing, summarisation, transcript and dialog generation, etc. This success has also been transferred to tasks related to query generation (Jeong et al., 2021; Nogueira et al., 2019). While large language model applications are prevalent and numerous studies have been conducted for search interfaces (Liu et al., 2022, 2021a, 2021b; Xu et al., 2009), there has been little impetus to combine search interfaces with large language model based query generation.
In this paper, we demonstrate Query Generation Assistant, a search interface that supports automatic and interactive query generation for monolingual or multi-lingual interactive search. The novel contributions of the proposed interface include:
-
(1)
The interface provides a simple document search interface that displays documents in their original language along with their translations, making it simple for researchers to navigate and analyse search results.
-
(2)
The tool also supports diverse query generation, allowing users to explore search results more comprehensively.
-
(3)
More importantly, it combines search with a prompting-based query generation interface which permits users to refine their queries and prompts with retrieval information.
We believe our interface could work as an effective starting template for performing qualitative analysis over other search related experiments and datasets as well as serve as a tool to incorporate retrieval feedback and Human-In-The-Loop (HITL) studies. Even though our system was built initially for the BETTER search tasks (described in the next subsection,) our interface is generic in nature and would be transferable to other datasets and indices. We share the python code for the below described interface as well as the video demonstration here111https://github.com/emory-irlab/better-search.
We first briefly describe the BETTER task and dataset in Section 1. We then explicate the three main features of Query Generation Assistant in section 3.
2. Dataset
Our system and interface was designed to investigate interactive query generation especially for Query-By-Example (QBE) settings. The BETTER search datasets222https://ir.nist.gov/better/ (Mckinnon and Rubino, 2022; Soboroff, 2023) were used for demonstration. The BETTER dataset is a collection of natural language processing resources developed by IARPA’s BETTER program333https://www.iarpa.gov/research-programs/better to assist their intelligence analysts to process and analyze huge amounts of unstructured, multilingual information efficiently and effectively and serves as an example application for multi-lingual QBE and document retrieval for event monitoring (event retrieval). This collection also contains ancillary information like event span annotations from text across many languages and topics. Particularly, the BETTER program seeks search systems to perform accurate retrieval of Arabic, Persian, Chinese, Korean, and Russian documents on being queried with example English documents.
3. Query Generation Assistant
The Query Generation Assistant User Interface is made up of 3 subsystems. Each of them are described below. The interface is built using HuggingFace’s Gradio platform. Gradio (Wolf et al., 2020; Abid et al., 2019) is an open-source Python package to quickly create easy-to-use, customizable UI components for machine learning models.
3.1. Manual Search

This tab (shown in Figure 1) is the simplest interface which permits users to write search queries by themselves and returns the top-k relevant documents, their English translations, and the highlighted events for each document. The search is conducted on each index built using their corresponding language’s documents, using SOTA cross-lingual dense retrieval model, ColBERT-x (Nair et al., 2022). A document rank list is returned for each language. These rank lists are eventually combined and reranked using reciprocal rank fusion. All the documents are translated offline using Google Translate444https://translate.google.com/ for faster look-up during query time. To highlight the event annotations, such as event triggers and argument entities, on the displayed documents, the collection is parsed to a SOTA event annotator (span-finder (Xia et al., 2021)) offline and then looked-up during the query time.
3.2. Auto Query Generator
The BETTER task seeks to benchmark systems to be able to look for documents in specified target languages which are similar to a user’s example document. We attempt to do this via generating an intermediate query from the example document and performing retrieval over the same. However, the effectiveness of the generated queries is crucial to retrieve relevant documents, while also ensuring query interpretability.
Inspired by the recent success of pre-trained generation models, we fine-tune a T5 (Raffel et al., 2020) model on (document, query) pairs. To evaluate the performance of our approach, we compare the original T5 model with a docT5query (Nogueira et al., 2019) model, which has already been fine-tuned on the MSMarco (Nguyen et al., 2016) dataset. Our results indicate that the docT5query model outperforms the original T5 model, and thus we utilize it for our demonstration.
The complete interface is shown in Figure 2.

3.3. Interactive Query Generation
Large language models in recent years have shown excellent strides in multi-task learning and few-shot learning. With just a handful of examples, large language models have shown impressive generative capabilities, albeit with risks of hallucinations. Prompting has been an effective and seemingly natural way to interact with such models.
While few-shot prompting has been a powerful approach to teach models new tasks, such models have generally shown different outputs on varying various prompting parameters. For example, varying the types of examples in the prompt, changing the order of the examples and number of examples vastly influence the generations. We use this feature to our advantage for query generation by letting people edit their prompts either directly or through user relevance feedback so as to improve subsequent query generations and corresponding retrieval.
We choose FlanT5 (Chung et al., 2022) as it has been fine-tuned already on a large amount of tasks making it arguably convenient (Aribandi et al., 2022) for learning on new tasks. The interface permits prompting FlanT5 by default with two editable (document, query) pairs alongwith an instruction. We present users with options of multiple instructions and their choice of document to generate query from. The interface is shown in Figure 3.

Each of the generated queries can at once or together be used to retrieve documents. On retrieval, each of the retrieved document is provided with a checkbox to permit it to be directly added to the prompt alongwith its original query. This is intended to incorporate user search feedback directly into the prompt to make the prompt more consistent with the users’ requests. In terms of few-shot prompting, when models are prompted to generate responses based on a limited set of examples, the quality of the generations depends on the quality and relevance of the examples provided as models are known to be also less robust towards prompt perturbations(Zhao et al., 2021; Dhole et al., 2023).
4. Conclusion
The primary objective of Query Generation Assistant is to assist researchers with an ability to qualitatively monitor cross-lingual retrieval and provide assistance to generate and refine queries. Researchers and practitioners can quickly and easily perform qualitative analysis with the tool’s search interface and query generation features, allowing them to evaluate search systems more thoroughly. The prompting-based search interface also provides an avenue to perform human-in-the-loop (HITL) studies. Apart from qualitative studies, we believe Query Generation Assistant could be used as an effective starting template to perform more sophisticated information retrieval experiments as well as serve as a tool to incorporate retrieval feedback and conduct Human-In-The-Loop studies.
5. Acknowledgements
This work was supported in part by IARPA BETTER (#2019-19051600005). The views and conclusions contained in this work are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, or endorsements of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
References
- (1)
- Abid et al. (2019) Abubakar Abid, Ali Abdalla, Ali Abid, Dawood Khan, Abdulrahman Alfozan, and James Zou. 2019. Gradio: Hassle-free sharing and testing of ml models in the wild. arXiv preprint arXiv:1906.02569 (2019).
- Aribandi et al. (2022) Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, Sanket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler. 2022. ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning. In International Conference on Learning Representations. https://openreview.net/forum?id=Vzh1BFUCiIX
- Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
- Dhole et al. (2023) Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, et al. 2023. NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation. Northern European Journal of Language Technology 9, 1 (2023). https://nejlt.ep.liu.se/article/view/4725/3874
- Jeong et al. (2021) Soyeong Jeong, Jinheon Baek, ChaeHun Park, and Jong Park. 2021. Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation. In Proceedings of the Second Workshop on Scholarly Document Processing. Association for Computational Linguistics, Online, 7–17. https://doi.org/10.18653/v1/2021.sdp-1.2
- Liu et al. (2021a) Chang Liu, Ying-Hsang Liu, Jingjing Liu, and Ralf Bierig. 2021a. Search Interface Design and Evaluation. Found. Trends Inf. Retr. 15, 3–4 (dec 2021), 243–416. https://doi.org/10.1561/1500000073
- Liu et al. (2021b) Chang Liu, Ying-Hsang Liu, Jingjing Liu, and Ralf Bierig. 2021b. Search Interface Design and Evaluation. 15, 3–4 (dec 2021), 243–416. https://doi.org/10.1561/1500000073
- Liu et al. (2023) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-Train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9, Article 195 (jan 2023), 35 pages. https://doi.org/10.1145/3560815
- Liu et al. (2022) Ying-Hsang Liu, Paul Thomas, Tom Gedeon, and Nicolay Rusnachenko. 2022. Search Interfaces for Biomedical Searching: How Do Gaze, User Perception, Search Behaviour and Search Performance Relate?. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval (Regensburg, Germany) (CHIIR ’22). Association for Computing Machinery, New York, NY, USA, 78–89. https://doi.org/10.1145/3498366.3505769
- Mckinnon and Rubino (2022) Timothy Mckinnon and Carl Rubino. 2022. The IARPA BETTER Program Abstract Task Four New Semantically Annotated Corpora from IARPA’s BETTER Program. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 3595–3600. https://aclanthology.org/2022.lrec-1.384
- Nair et al. (2022) Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton Murray, James Mayfield, and Douglas W. Oard. 2022. Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models. In Advances in Information Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I (Stavanger, Norway). Springer-Verlag, Berlin, Heidelberg, 382–396. https://doi.org/10.1007/978-3-030-99736-6_26
- Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. choice 2640 (2016), 660.
- Nogueira et al. (2019) Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. arXiv:1904.08375 [cs.IR]
- Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- Sarwar and Allan (2020) Sheikh Muhammad Sarwar and James Allan. 2020. Query by Example for Cross-Lingual Event Retrieval (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1601–1604. https://doi.org/10.1145/3397271.3401283
- Soboroff (2023) Ian Soboroff. 2023. The BETTER Cross-Language Datasets. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (Taipei, Taiwan) (SIGIR ’23). Association for Computing Machinery, New York, NY, USA, 3047–3053. https://doi.org/10.1145/3539618.3591910
- Srivastava et al. (2023) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2023. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research (2023).
- Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
- Xia et al. (2021) Patrick Xia, Guanghui Qin, Siddharth Vashishtha, Yunmo Chen, Tongfei Chen, Chandler May, Craig Harman, Kyle Rawlins, Aaron Steven White, and Benjamin Van Durme. 2021. LOME: Large Ontology Multilingual Extraction. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 149–159. https://www.aclweb.org/anthology/2021.eacl-demos.19
- Xu et al. (2009) Songhua Xu, Tao Jin, and Francis C. M. Lau. 2009. A New Visual Search Interface for Web Browsing (WSDM ’09). Association for Computing Machinery, New York, NY, USA, 152–161. https://doi.org/10.1145/1498759.1498821
- Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning. PMLR, 12697–12706.
- Zloof (1975) Moshé M. Zloof. 1975. Query by Example (AFIPS ’75). Association for Computing Machinery, New York, NY, USA, 431–438. https://doi.org/10.1145/1499949.1500034