Making Large Language Models Interactive: A Pioneer Study on Supporting Complex Information-Seeking Tasks with Implicit Constraints

Ali Ahmadvand [email protected] Emory University USA , Negar Arabzadeh [email protected] University of WaterlooCanada , Julia Kiseleva [email protected] Microsoft ResearchUSA , Patricio Figueroa Sanz [email protected] MicrosoftUSA , Xin Deng [email protected] MicrosoftUSA , Sujay Jauhar [email protected] Microsoft ResearchUSA , Michael Gamon [email protected] Microsoft ResearchUSA , Eugene Agichtein [email protected] Emory University USA , Ned Friend [email protected] MicrosoftUSA , Aniruddha Kulkarni [email protected] MicrosoftUSA , Ahmed Awadallah [email protected] Microsoft ResearchUSA and Ryen W. White [email protected] Microsoft ResearchUSA

Abstract.

Current interactive systems with natural language interfaces lack the ability to understand a complex information-seeking request which expresses several implicit constraints at once, and there is no prior information about user preferences, e.g., ‘find hiking trails around San Francisco which are accessible with toddlers and have beautiful scenery in summer’, where output is a list of possible suggestions for users to start their exploration. In such scenarios, user requests can be issued in one shot in the form of a complex and long query, unlike conversational and exploratory search models, where require short utterances or queries are often presented to the system step by step. This advancement provides the final user more flexibility and precision in expressing their intent through the search process, as well as greater efficiency in their interactions with the system. Such systems are inherently helpful for day-to-day user tasks requiring planning that are usually time-consuming, sometimes tricky, and cognitively taxing. We have designed and deployed a platform to collect the data from approaching such complex interactive systems.

Moreover, with the current advancement of generative language models such as GPT-based models, understanding complex user requests becomes more possible, however, these models suffer from hallucination in providing accurate factual knowledge. All language models are mostly trained in large part on web-scraped data from the past, which usually is not useful for immediate users’ needs.

In this article, we propose an Interactive Agent (IA) that leverages Large Language Models (LLM) for complex request understanding and makes it interactive using Reinforcement learning (RL) that allows intricately refine user requests by making them complete, which should lead to better retrieval and reduce LLMs hallucination problems for current user needs. To demonstrate the performance of the proposed modeling paradigm, we have adopted various pre-retrieval metrics that capture the extent to which guided interactions with our system yield better retrieval results. Through extensive experimentation, we demonstrated that our method significantly outperforms several robust baselines.

Interactive User intent modeling, Natural Language Understanding, Complex Search tasks

^†^†copyright: none^†^†ccs: Information systems Information retrieval^†^†ccs: Information systems Query intent^†^†ccs: Information systems Users and interactive retrieval^†^†ccs: Information systems Collaborative search^†^†ccs: Information systems Specialized information retrieval

1. Introduction

Recent advances in the field of Natural Language Understanding (NLU) (Devlin et al., 2018; Adiwardana et al., 2020; Brown et al., 2020; Raffel et al., 2020) have enabled natural language interfaces to help users find information beyond what typical search engines provide, through systems such as open domain and task-oriented dialogue engines (Leiter et al., 2023; Li et al., 2018, 2020b) and conversational recommenders (Christakopoulou et al., 2016), among others. However, most traditional systems still present with one or both of the following limitations: (1) answers are typically constrained to relatively simple and primarily factoid-style requests in natural language (Kwiatkowski et al., 2019; Soleimani et al., 2021), as is the case with search engines; and (2) a requirement on eliciting user preference by asking direct questions about attributes (Kostric et al., 2021).

However, user information needs, when expressed using natural language, can be inherently complex and contain many interdependent and dependent constraints, as is shown in Figure 1. When issuing such requests, users may be considered to be in exploratory mode; they are looking for suggestions to pick from, rather than a single concrete answer. However, sometimes user preferences can be elicited in multiple known steps if a user is in learning and discovery mode. The task becomes especially challenging since most real applications (Christakopoulou, 2018) need to support cold-start users (Kiseleva et al., 2016a; Sepliarskaia et al., 2018), for whom little to no preferential knowledge is known a priori. This may be due to infrequent visits, rapid changes in user preferences (Bernardi et al., 2015; Kiseleva et al., 2014, 2015), or general privacy-preserving constraints that limit the amount or type of information that can be collected or stored. In this work, we aim to bridge the described gap of processing complex information-seeking requests in natural language from unknown users by developing a new type of application, which will work as illustrated in Figure 1. Concretely, our proposed solution is capable of jointly processing complex natural language requests, inferring user preferences, and suggesting new ones for users to explore, given real-time interactions with the Interactive Agent (IA).

Refer to caption — Figure 1. An example of a user request expressed complex information needs in natural language, which is processed by IA Pluto to retrieve a list of suggestions that at least partially satisfy the specified constraints. The request contains a number of search constraints that are highlighted in yellow.

One of the major bottlenecks in tackling the proposed problem of processing complex information-seeking is the lack of an existing interactive system to collect data and observe user interactions. Therefore, we designed a pipeline, which we call Pluto, that allows users to submit complex information-seeking requests. Using Pluto, we leverage human agents in the loop to help users accomplish their informational needs while collecting data on complex search behavior and user interactions (e.g., Holzinger, 2016; Li et al., 2016).

Finally, we propose a novel IA that seeks to replace human agents in the loop in order to scale out Pluto to a significantly broader audience while simultaneously making responses a near real-time experience. The proposed IA contains a Natural Language Understanding NLU unit that extracts a semantic representation of the complex request. It also integrates a novel score that estimates the completeness of a user’s intent at each interactive step. Based on the semantic representation and completion score, the IA interacts with users through a Reinforcement Learning (RL) loop that guides them through the process of specifying intents and expressing new ones. The proposed model leverages a user interface to suggest a ranked list of suggested intents that users may not have previously thought about, or even know. Online user feedback is leveraged through these interactions with users to automatically improve and update the reinforcement learner’s policies.

Another important aspect we consider is a simple, straightforward evaluation of the proposed approach. We adopt pre-retrieval metrics (e.g., Sarnikar et al., 2014; Roitman et al., 2019) as a means to evaluate the extent to which refinement to the complex request afforded by the IA better represents the actual user intent, or narrows down the search space. Our evaluation demonstrates that a better-formulated complex request results in a more reliable and accurate retrieval process. For the retrieval phase, we break down the complex request based on the contained slots and generate a list of queries from the user intent, slots, and location. A search engine API is used to extract relevant documents, after which a GPT-3 (Brown et al., 2020) based ranker re-ranks the final results based on the actual slot values or aspects. The final re-ranker considers the user preferences through the aspects values for the slots in the reformulated query.

To summarize, the main contributions of this work are:

C1

Designing a novel interactive platform to collect data for handling complex information-seeking tasks, which enables integration with a human-in-the-loop protocol for initial processing of the user requests and search engines to retrieve relevant suggestions in response to refined user requests (Section 3).
C2

Formalizing new general problem of interactive intent modeling for retrieving a list of suggestions in response to users’ complex information-seeking requests expressed in natural language, such as presented in Figure 1, where there is no prior information about user preferences (Section 4).
C3

Proposing a hybrid model, which we name Interactive Agent (IA), consisting of an Natural Language Understanding (NLU) and a Reinforcement Learning (RL) component. This model, inspired by conversational agents, encourages and empowers users to explicitly describe their search intents so that they may be more easily satisfied (Section 5).
C4

Suggesting an evaluation metric, Completion Intent Score (CIS) that estimates the degree to which intent is expressed completely, at each step. This metric is used to continue the interactive loop so that users can express the maximum preferential information in a minimum number of steps (Section 7.1).

2. Background and Related Work

Our work is relevant to four broad strands of research on multi-armed bandits, search engines, language as an interface for interactive systems, and exploratory search and trails, which we review below.

Contextual bandits for recommendation

Multi-armed bandits are a classical exploration-exploitation framework from Reinforcement Learning (RL), where the user feedback is available in each iteration (Parapar and Radlinski, 2021; Cortes, 2018; Li et al., 2010). They are becoming popular for online applications such as ranking online advertisements and recommendation systems (e.g., Ban and He, 2021; Joachims et al., 2020), where information about user preferences is unavailable (cold-start users (Bernardi et al., 2015; Kiseleva et al., 2016a)) (Felício et al., 2017). Parapar and Radlinski (2021) proposed a multi-armed bandit model for personalized recommendations by diversifying the user preferences by changing the focus only on past user interactions. Others examined the application of contextual bandit models in healthcare, finance, dynamic pricing, and anomaly detection (Bouneffouf and Rish, 2019). Our work adapts contextual bandits paradigm to the new problem of interactive intent modeling for complex information-seeking tasks.

Search engines

Commonly used search engines such as Google and Bing provide platforms focusing on the document retrieval process through search sessions (Hassan et al., 2010; Kiseleva et al., 2014, 2015; Ageev et al., 2011). Developing retrieval models that can extract the most relevant documents from an extensive collection has been well-studied (Croft et al., 2010) for decades. The developed retrieval models focus on retrieving the most relevant documents corresponding to user intent, represented with textual and contextual information within and across search sessions (Kotov et al., 2011). Although extracting relevant documents is necessary, it is not always sufficient, especially when the users have a complex information-seeking task (Ingwersen and Järvelin, 2006).

Language as an interface for interactions

NLU have been the important direction for human-computer interaction and information search for decades (Woods et al., 1972; Codd, 1974; Hendrix et al., 1978). The recent impressive advances in capabilities of NLU (Devlin et al., 2018; Liu et al., 2019; Clark et al., 2020; Adiwardana et al., 2020; Roller et al., 2020; Brown et al., 2020) powered by large-scale deep learning and increasing demand for new applications has led to a major resurgence of natural language interfaces in the form of virtual assistants, dialog systems, semantic parsing, and question answering systems (Liu and Lane, 2017, 2018; Dinan et al., 2020; Zhang et al., 2019). The scope of natural language interfaces has been significantly expanding from databases (Copestake and Jones, 1990) to knowledge bases (Berant et al., 2013), robots (Tellex et al., 2011), virtual assistants (Kiseleva et al., 2016c, b), and other various forms of interaction (Fast et al., 2018; Desai et al., 2016; Young et al., 2013). Recently, the community has focused on continuous learning through interactions, including systems that learn a new task from instructions (Li et al., 2020a), assess their uncertainty (Yao et al., 2019) and ask feedback from humans in case of uncertainty (Aliannejadi et al., 2021, 2020) or for correcting possible mistakes (Elgohary et al., 2020).

Exploratory search, tours, and trails

Exploratory search refers to an information-seeking process in which the system assists the searcher in understanding the information space for iterative exploration and retrieval of information (Ruotsalo et al., 2018; Hassan Awadallah et al., 2014; White et al., 2008). Anomalous states of knowledge (ASKs) (Belkin, 1980) motivate the need to search and drive demand for search systems. According to the ASK hypothesis, users usually can struggle to conceptualize and formulate their information needs as search queries, which may miss some essential information (Liu and Belkin, 2015; White and Roth, 2009). In such cases, the system should assist the user in specifying their intent (Marchionini, 2006). Through a search log analysis, Odijk et al. (2015) shows that there are many searches where users may struggle to formulate their search query or they may simply be exploring to learn about a new area. New search interface designs may be required to support searchers through their information-seeking process (Villa et al., 2009). Tours and Trails are another group of tools that were developed to guide users to accomplish search tasks. Guided tours are common in hypertext systems (Trigg, 1988) and similar ideas could be applied in the context of search (Hassan and White, 2012). Surfacing common trail destinations in search interfaces can help people find information targets more quickly (White et al., 2007). Search engines may also present full trails as a way to explore, learn, and complete multi-step tasks (Singla et al., 2010). Olston and Chi (2003) proposed ScentTrails that leverage an interface that combines browsing and searching and highlights potentially relevant hyperlinks. WebWatcher (Joachims et al., 1997), like ScentTrails, underlined the relevant hyperlinks and improved the model based on the implicit feedback collected during previous tours.

To summarize, the key distinctions of our work compared to previous efforts are as follows. Similar to the exploratory search, trails, and conversational search, our model proposes an iterative information-seeking process and designs an interface for user interactions to guide struggling users and help them better understand the information space. However, that work that only focuses on user interaction modeling and limits users in issuing short and imprecise queries and utterances, our model provides a platform for users to express their information needs in the form of long and complex requests. Users can utilize this capability to express their intent more accurately and prune significant parts of the search space for the exploratory search process. Adding this capability needs an advanced NLU step and different machine learning components to understand and guide the final user through the search process. To this end, the proposed system has two new components, an intent ontology and a profile for partitioning the information space, enabling the IA to help users be more effective in exploring the search space.

3. Pluto: data collection infrastructure

Since the proposed problem is novel and requires non-trivial user interaction data, we designed a new pipeline, Pluto, to collect such data. Pluto uses a human-in-the-loop setup for data collection and curation. It is comprised of two main components depicted in Figure 2:

Phase1

Refinement of complex user request in natural language;
Phase2

Refinement of retrieved list of suggestions.

Complex user request refinement.

When a user issues a request in natural language to express their complex information needs, which potentially has many expressed constraints (see Figure 1 for several mentioned in the example), GPT-3 (Brown et al., 2020) is leveraged to understand the request’s intent and identify explicitly mentioned aspects.

Once GPT-3 has identified these aspects, they will be used as the initial set for the request. To further expand this set, this phase will proceed to identify an additional list of aspects to be presented to users as a supplemental set of relevant considerations for their request.

As stated, Pluto has integrated human-in-the-loop into its pipeline. The goal of human agents is to intervene at certain stages of the system to offer human judgment. One such intervention occurs when agents review users’ requests, at which point they can correct the aspects this phase identified in the request as well as add new ones to better serve user needs.

Suggestion refinement.

Here, Pluto performs two tasks. First, it receives the slots selected by the user for processing and suggests additional slots (so as to further narrow down the request, with the aid of the user). These new slots can be generated via GPT-3 or by intervention from the human agents. Second, Pluto leverages the search engine to produce a series of suggestions that meet the slots for the request as well as the new slot proposal. GPT-3 is leveraged at this stage to aid in determining which potential suggestions meet which aspects from the request so that the system can rank them. Human agents then make final decisions on which suggestions to present to users. Once that is done, users can either accept the suggestions if they are satisfying, or request another iteration of the retrieval phase. When users request another iteration, they may change either the wording of the request or add/remove aspects from it (including the newly suggested ones). Additionally, for any iteration of this phase, users can provide feedback that is captured via a form to help refine the system.

Finally, human agents are responsible for another very valuable and essential contribution: intent and aspect curation. In either of the phases described above, GPT-3 may suggest various aspects and intents that sometimes are not as relevant or useful. All of these are considered entries into the dynamic intent ontology. However, human agents then curate them. Intent and aspects that are considered higher quality by the agents are then given more weight when suggesting aspects in either of the two phases.

Data handling.

Users of Pluto were supplied with a consent form explaining that their requests and interactions would be viewed by human agents and some members of the development team. Further, the human agents in the loop also consented to have their interactions with the system recorded. All data and interactions were anonymized, and no personal identifiers of users or agents were retained in any of the modeling and experimentation in our work.

Next, we will describe the formal problem description and elaborate on the problem formulation and the proposed interactive agent.

4. Problem description

In this section, we formalize the problem of interactive intent modeling for supporting complex information-seeking tasks.

Notation.

We begin by formally defining used notation as follows:

•

User request $(cr)$ : is a complex information-seeking task expressed in natural language, which contains multiple functional desiderata, preferences, and conditions (e.g. Figure 1).
•

Request topic $(\tau)$ : determines the topic the request belongs to, e.g. “activity” or “service”, where $\tau\in T$ is the list of all existing topics.
•

User intent $(i)$ : for each topic $\tau$ , a list of user intents can be defined. Intents are the identification of what a user wants to find. For example, in Figure 1, the user request has an “activity” topic, with a “hiking” intent. This definition allows having identical intents in different domains, where $i\in I$ is the list of all intents.
•

Slot $(\xi)$ and aspect $(\alpha)$ : for each specific topic $\tau$ and user intent $i$ , a list of slots $\xi$ are defined that describe features and properties of the intent $i$ in topic $\tau$ , and aspect (values) $\alpha$ is a restriction on the slot $\xi$ . For example, from Figure 1, “date” is a slot related to “hiking”, with aspect value of “May 9th to May 29th, 2021”.
•

Intent completion score $(ICS_{i})$ : is a score to estimate the completeness of user intent $i$ in the interaction step $j$ .
•

Semantic representation $(\sigma)$ : is an information frame that represents an abstract representation of the $cr$ as $(\tau,i,[\xi_{0\dots n}],ICS_{(\tau,i)})$
•

Intent ontology $(\Omega)$ : is the graph structure representing relations among the defined domains, intents, and slots.
•

Intent profile $(\Phi)$ : is the list of all conditional distributions $P(\xi|i,\tau)$ , for slot $\xi$ with respect to topic $\tau$ and intent $i$ . It can be changed over time via user interactions with specific intent $i$ and topic $\tau$ .
•

List of retrieved suggestions $S=(Sug_{0},\dots,Sug_{n})$ : is the list of retrieved suggestions in response to $cr$ .

Problem formulation.

Algorithm 1 The proposed interactive user intent modeling for supporting complex information-seeking tasks.

Input: User complex search request

NLU component starts

\triangleright

/*This is a comment*/

Pre-processing of complex request

cr

missingForming prompt by concatenating

cr

with training data

\sigma=(\tau

i

\xi_{0\dots n}

ICS_{(\tau,i)})\leftarrow

NLU(

cr

)

Initialize

c

= one of the methods in section 5.3

RL model starts

\triangleright

/*This is a comment*/

while

ICS_{(\tau,i)}\leq\mu(P(\xi_{k}|i,\tau))+sd(P(\xi_{k}|i,\tau))

if User Feedback then

1- Retrain the RL policy based on user feedback

2- Update context

c

based on user feedback

3- Update

ICS_{i}

in Eq. 2

ICS_{(\tau,i)}=\Sigma^{n_{\tau,i}}_{k=1}{P(\xi_{k}|i,\tau)}+\Sigma^{m_{\tau,i}}_{j=1}{P(\xi_{j}|i,\tau)}

else break

end if

end while

Retrieval step starts

\triangleright

/*This is a comment*/

p_{s}

= []

\triangleright

/*potential suggestions*/

for slot

\xi

in context

c

query

\leftarrow i+with+\xi+in+

location

top-10 documents

\leftarrow

search engine_API (query)

Update

p_{s}\leftarrow p_{s}\cup

top-10 documents

end for

return top-K results from GPT-3 Ranker(

P_{s}

)

This section provides a high-level problem formulation. The desired IA aims to map a request expressing complex information-seeking task $cr$ to a set of relevant suggestions $S$ , as illustrated in Figure 3. The proposed model is comprised of three main components:

(1)

Natural Language Understanding (NLU) component: consists of a topic and intent classifier, and a slot tagger to extract topic $\tau$ , user intent $i$ , and a list of slots $\xi_{0\dots n}$ , respectively. The unit leverages GPT-3 to improve and generalize the predictions for unseen slots. Finally, NLU generates the semantic representation $\sigma=(\tau,i,[\xi_{0\dots n}],ICS_{(\tau,i)})$ for a complex request $cr$ .
(2)

Interactive intent modeling component: is an iterative model leveraging contextual multi-armed bandits (Cortes, 2018) that receives the semantic representation $\sigma$ and context $c$ for the request $cr$ from the NLU unit and predicts the most relevant set of slots for $c$ .
(3)

Retrieval component: generates a sequence of sub-queries based on the list of slots and their corresponding aspects. Relevant documents are retrieved from the Web using Search engine API and ranked by GPT-3 to provide the final list of retrieved suggestions $S=\{Sug_{0},\dots,Sug_{k}\}$ .

To summarize, this section formally defines a problem we intend to solve (Algorithm 1) in the next section.

5. Method Description

This section presents a detailed description of the proposed strategy to model Interactive Agent (IA).

5.1. Creating intent profile $\Phi$

Based on the intent ontology $\Omega$ created in Section 3, and historical users’ interactions with topic $\tau$ , intent $i$ , slot $\xi$ , a dynamic intent profile $\Phi$ can be formed as shown in Figure 4. To do so, for each individual $\xi$ , $i$ , and $\tau$ , the intent profile stores a conditional probability, which can be updated in real-time using new user interactions with triple $(\tau,i,\xi)$ . The conditional probability $P(\xi|i,\tau)$ is computed as follows:

(1)

P(\xi_{k}|i,\tau)=\frac{Frequency(\xi_{(\tau,i,k)})}{\Sigma^{N_{\xi}}_{j=1}{Frequency(\xi_{(\tau,i,j)})}}

where $\xi_{(\tau,i,k)}$ is the $kth$ slot for intent $i$ , and topic $\tau$ , $N_{i}$ is the number of slots for intent $i$ and topic $\tau$ in intent ontology $\Omega$ .

5.2. NLU component

The NLU unit contains three main components: (1) a topic classifier, (2) an intent classifier, and (3) a slot tagger. For each incoming complex request $cr$ , this unit generates a semantic representation as follows: $\sigma=(\tau,i,[\xi_{0\dots n)}],ICS_{(\tau,i)})$ . Figure 5 shows the NLU unit for the proposed model.

GPT-3

To generate the semantic representation, we leveraged GPT-3 (Brown et al., 2020), a generative large language model trained on massive amounts of textual data that has proven capable of natural language generalization and task-agnostic reasoning. One of the hallmarks of GPT-3 is its ability to generate realistic natural language outputs from few or even no training examples (few-shot and zero-shot learning).

The creativity of the model for generating arbitrary linguistic outputs can be controlled using a temperature hyperparameter. We use an internal deployment of GPT-3¹¹1Based on https://beta.openai.com/. as the basis for our NLU.

We leveraged the few-shot prompting technique (Brown et al., 2020; sun2022recitation) for inference, where the training data collected in section is used to form the few-shot prompt for all the GPT-3 requests. Finally, the actual request is concatenated with the data and forms the final prompt.

Intent Completion score

We propose a score Intent Completion Score (ICS) to manage the number of interactions for the interactive steps. The ICS value can be calculated using the semantic representation $\sigma$ and the generated dynamic intent profile $\Phi$ . The initial ICS value is equal to the summation over all the conditional probabilities of slots in the request. Then, in the following steps, ICS becomes updated by new slots that the user selects.

(2)

ICS_{(\tau,i)}=\Sigma^{n_{\tau,i}}_{k=1}{P(\xi_{k}|i,\tau)}+\Sigma^{m_{\tau,i}}_{j=1}{P(\xi_{j}|i,\tau)}

Where $n_{\tau,i}$ is the number of explicitly mentioned slots in the $cr$ and $m_{\tau,i}$ is the number of selected slots through the interactive steps. Also, $P(\xi|i,\tau)$ indicates the conditional probability extracted from intent profile $\Phi$ in Eq. 1.

5.3. Interactive user intent modeling

Algorithm 2 Contextual Multi-armed Bandit Model.

CBM_{(\tau,i)}

is the contextual bandit model trained on the topic

\tau

and intent

i

\Pi_{\theta}(.|c)

is the policy to train the

CBM_{(\tau,i)}

with respect to context

c

Input: semantic representation:

\sigma=(\tau,i,[\xi_{0\dots n}],ICS_{(\tau,i)})

Generate context vector

c

using Method 1,2, or 3 in section 5.3

Select the

CBM_{(\tau,i)}

based on

(\tau,i)

tuple

Initialize Policy

\Pi_{\theta}(.|c)

with random weights (a feed-forward Neural Network or a linear regression)

for each step and context of

c

Sample actions

a_{0\dots k}

from list of actions:

a_{t}\sim\Pi_{\theta}(.|c)

Receive reward

\epsilon

(user feedback by selecting actions)

Add sampled actions

a_{{0\dots k}}

to the list of observed arms

Update Policy

\Pi_{\theta}

with new reward

Update

c\leftarrow c\cup a_{s}

\triangleright

a_{s}

is the selected actions

\subset a_{{0\dots k}}

end for

We leveraged contextual multi-armed bandits to model online user interactions. In each iteration, the system interacts with users, receives user feedback, and updates its policies. Multi-armed bandits (Barraza-Urbina and Glowacka, 2020) are a type of RL model that make rewards immediately available after the interaction of an agent with the environment. Contextual multi-armed bandits are an extension of multi-armed bandits, where the context of the environment is also modeled in predicting the next step. Contextual multi-armed bandits are utilized in the interactive agent as users are capable of providing feedback for the agent in each step. We trained a separate contextual multi-armed bandit to represent each $(\tau,i)$ pair as shown in Algorithm 2. The corresponding bandit model is then invoked at the inference time, based on the semantic representation $\sigma$ . One of the main elements in designing the contextual bandits is how to represent the context $c$ . To this end, we suggested three different methods that are described in the following sections.

Method 1:

This method uses a one-hot representation of the semantic representation $\sigma$ . During the interactions with our agent, the one-hot representation is updated by adding newly selected slots. As a result, the size of the context $c$ equals the number of slots for each specific intent.

(3)

c=\vec{O_{k}}=\sum^{N}_{j=1}{\bigg{\{}^{1\hskip 14.22636ptx_{(j,k)}\in\xi_{i}}_{0\hskip 14.22636ptx_{(j,k)}\notin\xi_{i}}}

Where $\vec{O_{k}}$ is the one-hot vector of the collected slots in the interaction step $k$ . $N$ is the total number of slots in the interaction step $k$ , and $\xi_{i}$ is the slots belonging to intent $i$ .

Method 2

In method 2, the request representation is concatenated with the one-hot representation of the slots to enrich the context representation. We used the Google Universal Sentence Encoder(USE) (Cer et al., 2018), which is trained with a deep averaging network (DAN) encoder for encoding text into a 512-dimensional vector for each request.

(4)

\vspace{-0.5em}c=\vec{O_{k}}+\vec{USE_{Q}}

Where $\vec{O_{k}}$ is the one-hot vector of the collected slots at step $k$ .

Method 3

Inspired by session-based recommender systems (Wu and Yan, 2017), we developed a deep learning model in Figure 6 to extract the slot representations. users were excluded from the model as we only focused on intent modeling independent of the user. The goal is to predict the list of slots most likely to be selected by the user, given the input request and explicitly mentioned lists of slots in semantic representation $\sigma$ .

The model consists of (1) an embedding layer, (2) a representation layer, and (3) a prediction layer. We used sigmoid cross-entropy to compute the loss since the task is a multi-label problem: a subset of slots is predicted for an input list of slots and the request representation. Finally, max-pooling is done across all the slot embeddings and concatenated with the request embedding vector to represent $c$ .

(5)

c=MaxPooling(\vec{O_{k}}*\xi_{(\tau,i,j,k)})+\vec{USE_{Q}}

where $\xi_{(\tau,i,j,k)}$ is the $j^{th}$ slot embedding, with respect to intent $i$ and topic $\tau$ , and $\vec{O_{k}}$ is the one-hot vector of the collected slots in step $k$ .

Threshold to stop iterations:

We leverage the $ICS_{(\tau,i)}$ score to stop the contextual bandit iterations, which has a steady increase in its value through the interactions. To this end, when this value becomes greater than a threshold, the contextual bandit model stops iteration. The threshold varies per $(\tau,i)$ pair. Hence, we consider a threshold value of $ICS_{(\tau,i)}\leq\mu(P(\xi_{k}|i,\tau))+sd(P(\xi_{k}|i,\tau))$ the mean plus the standard deviation of the slot distribution within $(\tau,i)$ .

5.4. Retrieval Component

To extract the final recommendations for the users, we use a retrieval engine that consists of two main components: 1) search retrieval and 2) ranking. For the retrieval part, we need to collect a corpus that is representative of the search space on the Web. Then, we can evaluate the pre-retrieval metrics is discussed in section 7.1. for both initial requests and reformulated requests at inference time.

Corpus collection:

To generate the corpus, we need to issue a series of queries to a search engine that will capture the search space of the web. Algorithm 3, in section LABEL:sec:appendix, shows the steps we used to generate these queries and collect the corpus. In essence, we leveraged a pool of sub-queries derived from the internal intent ontology. To create these sub-queries, we use the idea of request refinement using request sub-topics (Nallapati and Shah, 2006), and generated a list of them by combining each selected topic/intent with the set of aspects we have associated with it.

Finally, these queries were issued to the Bing Web Search API, and the top 100 results (consisting of the page’s title, URL, and snippet) for each query were added to the corpus.

Few-shot Ranker:

A few-shot GPT-3 model, which has been fine-tuned on a limited number of training samples, is deployed on the pool of potential suggestions extracted from the Web Search API. The GPT-3 ranker then ranks all the potential suggestions concerning the evolved user intent and the actual aspect values $\alpha$ . The GPT-3 ranker considers the user preferences for the final ranking results.

6. datasets

To evaluate the proposed interactive model, we leveraged the real data collected through user interaction with Pluto. We collected more than $16,699$ user requests with $166,990$ user interactions for training, and $1,140$ user requests with 13,840 interactions for testing. In Section 8, we describe a crowd-sourcing procedure that is designed to collect annotated data, which is used to train and test the slot tagger in the Natural Language Understanding (NLU) unit. Section 6.2 describes the interactive data collected via Pluto (Section 3). ²²2The datasets contain potentially sensitive data and cannot be shared publicly due to privacy concerns. However, we believe that using the presented descriptions the dataset collection can be reproduced.More details about data collection steps and evaluating the annotation process are described in section 8 and 6.2.

6.1. Dataset Collected for NLU unit

To collect the data for training and evaluating the NLU model, we use a crowd-sourcing platform that provides an easy way for researchers to build interfaces for data collection and labeling. Using the platform, we developed a simple interface that presented annotators with a natural language request paired with up to five possible slots. Annotators were then asked to mark relevant slots and given the opportunity to highlight sections of the request that mapped to the slot $\xi$ to their corresponding aspect $\alpha$ in question. Figure 7 shows a screenshot of the labeling interface.

The set of requests and slots presented to annotators was created from a seed set of $3,246$ requests, where each request was paired with all the slots from the subsuming intent. Three annotators then used the interface to map slots to requests as appropriate.

Evaluating quality of the collected dataset.

Requests were randomly selected from two different topics and 14 user intents, respectively (Table 1). We only chose two topics as the selected intents were a part of these two topics. Three different human annotators manually labeled these queries through the data collection interface described in the previous section. Table 1 presents Krippendorff’s alpha scores (Krippendorff, 2011) across all the intents. A score above or equal to 0.667 is often considered a good reliability test. The results demonstrate an acceptable agreement among all annotators, except “Hike” intent which shows a moderate agreement (Krippendorff, 2011). After evaluating the $\Omega$ , we notice that the slots for “hike” intent have overlapped, meaning there are slots that refer to the same thing with different textual representations. These semantic overlaps happened even after normalization with the clustering, which sometimes confuses annotators.

Topic	Intent	K (dist)	Topic	Intent	K (dist)
Service	restaurants	0.74 (12%)	Service	appliance	0.71 (11%)
Service	electrician	0.79 (13%)	Service	hotel	0.71 (2%)
Service	landscaping	0.67 (16%)	Service	handyman	0.75 (2%)
Activity	hike	0.58 (10%)	Service	cleaners	0.69 (4%)
Activity	general	0.74 (8%)	Service	remodeling	0.82 (3%)
Activity	spring break	1.00 (5%)	Activity	daytrip	0.73 (2%)
Activity	campground	0.74 (6%)	Activity	summercamp	0.75 (6%)

Table 1. Krippendorff’s score across all

(\tau,i)

tuples in the

\Omega

. Where

K

and dist stand for Krippendorff’s score and the distribution of the tuple in the dateaset.

6.2. Dataset Collected via Pluto

The data for training and evaluating our proposed model is collected during six months of Pluto proprietary interactive logs described in Section 3. We used the first five months to form the training set and reserved the last month for testing. Since GPT-3 is a generative model, the suggested slots during data collection may not be expressed identically, despite representing the same underlying intent (e.g.“access to parking” and “parking availability”). To address this issue, we used a universal sentence encoder (Cer et al., 2018) to softly match a generated slot to a slot in $\Omega$ . The slot with the lowest cosine distance is considered the target slot. Figure 8 illustrates an example of a data instance.

Pluto is capable of covering hundreds of different user intents. In this study, however, we selected the $14$ most frequent search intents in the logs because we observed a sharp drop-off in frequency after that. Table. 1 represents the intent values with their corresponding topic. Each sample in the collected interactive dataset has the form of $\xi_{j\dots n}\rightarrow\xi_{k\dots m}$ , where there is no intersection between two sets of slots $\xi_{j\dots n}\cap\xi_{k\dots m}=\emptyset$ . The selected slots are the slots the user selects during the interaction with the interactive agent. We collected more than $16,699$ user requests with $166,990$ user interactions for training, and $1,140$ user requests with $13,840$ interactions for testing.

6.3. Corpus Collection

To generate the corpus, we need to issue a series of queries to a search engine that will capture the search space of the web. Algorithm 3 shows the steps used to generate these queries and collect the corpus.

Algorithm 3 Algorithm to generate corpus for evaluation. L is the name of all US major cities.

Input:

\Omega_{(\tau,i)}

, and L is the user location.

corpus = []

for each location in database L do

for each

(\tau,i)

\Omega

for each

\xi

\Omega_{(\tau,i)}

query

\leftarrow i+\text{near}+location+\text{with}+\xi

top-100 documents

\leftarrow

search engine_API (query)

corpus = corpus

\cup

top-100 documents

end for

Output: corpus

7. Experimental Setup and Results

For convenience, we summarize the methods compared for reporting the experimental results as follows:

Method 1: Popularity Method (Baseline)

The popularity-based method is a heuristic, suggesting the next set of related slots based on overall frequency (popularity) in the intent profile $\Phi$ . The order of suggestions can change over time as some slots become more popular for specific intents.

Group 1: Contextual Multi-armed Bandit Policies

We report the results on $13$ different policies for contextual bandit models, including: “Bootstrapped Upper Confidence Bound”, “Bootstrapped Thompson Sampling”, “Epsilon Greedy”, “Adaptive Greedy”, “SoftMax Explorer”, etc. which have been extensively investigated in (Cortes, 2018). The library to implement the policies is available here³³3https://contextual-bandits.readthedocs.io/en/latest/.

Group 2: Different context representation:

We report the results for the three different proposed context $c$ described in Section 5.3.

7.1. Evaluation Metrics

Evaluating complex search tasks has always been quite challenging. Since the task is not supervised and there is no available dataset or labels, we could not directly evaluate the results. In addition, our goal is to refine requests in a way that they lead to better suggestions. Therefore, we propose to employ Query Performance Prediction (QPP) metrics for evaluation purposes. QPP task is defined as predicting the performance of a retrieval method on a given input request (Carmel and Yom-Tov, 2010; Cronen-Townsend et al., 2002; He and Ounis, 2004). In other words, query performance predictors predict the quality of retrieved items w.r.t to the query. QPP methods have been used in different applications such as query reformulation, query routing, and in intelligent systems (Sarnikar et al., 2014; Roitman et al., 2019). QPP methods are a promising indicator of retrieval performance and are categorized into pre-retrieval, and post-retrieval methods (Carmel and Yom-Tov, 2010).

Post-retrieval QPP methods generally show superior performance compared to pre-retrieval ones, whereas the pre-retrieval QPP methods have been more often used in more real-life applications and can address more practical problems since their prediction occurs before the retrieval.

In addition, almost all of the post-retrieval methods work based on the relevance scores of the retrieved list of documents, and in our case, the relevance score was not available from the search engine API; thus, we only employed pre-retrieval QPP methods for this work’s evaluation purposes. Having said that, we predict and compare the performance of the initial complex requests as well as our reformulated requests using SOTA pre-retrieval QPP methods which have been shown to have a high correlation with retrieval methods on different corpora (Hashemi et al., 2019; Arabzadeh et al., 2020a, b; Zhao et al., 2008; Hauff et al., 2008, 2009; Carmel and Kurland, 2012; He and Ounis, 2004). The intuition behind evaluating our proposed method with pre-retrieval QPP methods is that QPP methods have shown to be a promising indicator of performance. Therefore, we can compare the predicted performance of the initial complex request as well as our reformulated request and predict which one is more likely to perform better. To simply put, higher QPP values mean that there is a higher probability that the request is going to be easily satisfied, and lower QPP values indicate a higher chance of poor retrieval results.

In the following, we elaborate on the SOTA pre-retrieval QPP methods that showed promising performance over different corpora and query sets, and we leveraged them for evaluating this work.

Simplified Clarity Score (SCS): SCS is a specificity-based QPP method, which captures the intuition behind that the more specific a query is, the more likely a system is to specify the query (He and Ounis, 2004; Plachouras et al., 2004). SCS measures the KL divergence between the query and the corpus language model, thereby capturing how well the query is distinguishable from the corpus.

Similarity of Corpus and Query (SCQ): SCQ leverages the intuition that if a query is more similar to the collection, there is a higher potential to find an answer in the collection (Zhao et al., 2008). Concretely, the metric measures the similarity between collection and query for each term and then aggregates over the query, reporting the average of each query term’s individual score.

Neural embedding based QPPs (Neural-CC): Neural embedding-based QPP metrics have shown excellent performance on several Information Retrieval (IR) benchmarks. They go beyond the traditional term-frequency based QPP metrics and capture the semantic aspects of terms (Zamani et al., 2018; Roy et al., 2019; Arabzadeh et al., 2019, 2020a, 2020b; Khodabakhsh and Bagheri, 2021; Roitman, 2020; Hashemi et al., 2019). We adapted one of the recently proposed QPP metrics which build a network between query terms and their most similar neighbors in the embedding space. Similar to (He and Ounis, 2004; Plachouras et al., 2004), this metric is based on query specificity. The intuition behind this metric is that specific queries play a more central and vital role in their neighborhood network than more generic ones. Here, as suggested in (Arabzadeh et al., 2020a, b, 2019), we adapted the Closeness Centrality (CC) of query terms within their neighborhood network, which has shown to have the highest correlation across different IR benchmarks.

Training Parameters:

For contextual bandits and GPT-3 models, the default parameters for the available libraries were used, and no parameter tuning was performed. To train the deep learning model described in Section 7, we use an Adam optimizer with a learning rate of $\eta=0.001$ , a mini-batch of size 8 for training, and embedding of size 100 for both words and aspects. A dropout rate of 0.5 is applied at the fully connected and ReLU layers to prevent potential overfitting. We used the default parameter for training; however, a smaller batch size was preferable based on the available dataset size.

7.2. Experimental Results

We compare the result of QPP metrics on our best policies and popular attributes with the original requests in Figure 9, where we report the percentage difference w.r.t the full form. That is, to what extent do the QPP metrics predict that the reformulated requests are likely to perform better than the original ones. We examine the difference between the average of QPP metrics on reformulated requests with the best policy (adaptive active greedy) and the full form of requests. In addition, we compared the reformulated requests with popular attributes and the full form of the request and reported them in the same figure. As shown in Figure 9, the adaptive active greedy policy showed improvements over all three QPP metrics and on all intents.

These bars in Figure 9 can be interpreted as the percentage of predicted improvement for the reformulated requests compared to full form of requests. For instance, for restaurant intent, SCQ, SCS, and neural embeddings QPP methods, have improved by 3.3%, 3.1%, and 22.5%, respectively. We measure statistical significance using a paired t-test with a p-value of 0.05. We note that while the improvement made by the adaptive active greedy policy were consistently statistically significant on all intents by the SCQ and neural embedding QPP metrics, the gains were only statistically significant 4 intents on the SCS metric: “Restaurants”, “Landscaping”, “Home cleaners”, and “Home Remodeling.”

It should be noted that while QPP methods are potential indicators of performance, every QPP method focuses on a different quality aspect of the query. Therefore, they do not necessarily agree on the predicted performance according to different queries, corpora, or retrieval methods. This observation has been made on different SOTA QPP methods and various well-known corpora such as TREC corpora or MS MARCO and their associated query set (Carmel and Yom-Tov, 2010; Arabzadeh et al., 2021; Hashemi et al., 2019). Thus, we conclude that the level of agreement could strengthen our confidence in the query performance prediction. In other words, the more QPP metrics agree on query performance, the more confidence we have in that prediction. In addition, we can interpret each QPP performance based on the intuition behind them. For example, the SCS method relies on query clarity level, while the SCQ method counts on the existence of potential answers in the corpus. When the two QPP methods do not agree on the query’s performance, we consider it as how the query satisfies the intuition behind one of the QPP methods while failing to satisfy the others. For example, take the ‘activity’ intent in Figure 9, in which the SCQ methods showed significant improvement, but the SCS method did not. We interpret this observation as the clarity of the query has not been significantly increased while refined by our method. However, the query was expanded so that the existing potential answers in the corpus has increased.

NLU evaluation

To evaluate the topic and intent classifiers, the evaluation set described in Section 6.2 is used, which contains more than $16,699$ user requests with $166,990$ user interactions for training, and $1,140$ user requests with $13,840$ interactions for testing. The model achieved a 99.3% and 95.2% accuracy for topic and intents, respectively. For evaluating the slot tagger, we leveraged the annotated data collected by three different judges described in Section 8 performing 4-fold cross-validation and achieved a 0.75 macro-F1 across all the intents and slots. The results for slot tagging are promising despite the challenges, e.g., a small amount of labeled data, a large number of slots per intent, and overlapping slots across user intents. The results indicate the ability of GPT-3 to generalize in few-shot learning.

7.3. Ablation Analysis

Broad vs. Specific:

Studying a system’s performance deeply on a per-query basis can enlighten where the systems fail, i.e., which queries a system fails to answer and which group of queries can be handled easily. Thus, it could potentially lead to future improvements to the system. As such, exploring query characteristics has always attracted lots of attention between IR and NLP communities because query broadness has shown to be a crucial factor that could potentially lead to an unsatisfactory retrieval (Song et al., 2009; Clarke et al., 2009; Sanderson, 2008; Min et al., 2020). Here, we separately study the performance of our proposed method on two groups of broad and specific queries. We are interested in examining whether our proposed method can address both requests consistently, i.e., broad and specific ones. Here we define the broad requests as the requests with less complex information-seeking tasks and fewer preferences expressed; they are short and contain a small number of slots/values ( $\leq$ 3), hence requiring more steps for the RL model to refine the user intent. On the contrary, the specific requests is defined as the longer ones which contain many slots/values, and users need fewer steps to finalize their intent.

Figure 10 presents the evaluation results of broad and specific requests. As demonstrated in this figure, although all the employed QPP metrics agreed that both types of requests had been improved, Adaptive Active Greedy would perform relatively better on broad queries compared to specific ones. It is an expected output because specific requests are more complex than broad ones, and more criteria should be addressed to satisfy them. Moreover, suggesting the popular slots have a deteriorating effect on all the metrics across the intents for the specific requests, showing a challenging reformulation process, while the proposed model in all metrics improves the QPP.

Different Context $c$ :

We compare three different proposed contexts described in section 5.3 in terms of percentage difference with respect to the original form of requests on predicted performance by QPP metrics in Figure 11 on the top-5 most popular intents. The results show all three proposed contexts outperform the original representation across all metrics and intents. We observe that QPP metrics do not consistently agree on the predicted performance between these three different methods. While neural-cc predicts that method 3 and method 2 to define the $c$ perform better than method 1. We also noticed that SCS and SCQ sometimes behave the opposite. We hypothesize that this difference could potentially be because neural-cc works based on neural embedding while SCS and SCQ work based on term frequency and corpus statistics. Therefore, each group might capture different aspects of requests. Although all the proposed contexts $c$ significantly outperform the original query, we cannot conclude which one among them outperforms the others.

Policy evaluation for contextual bandit model:

We performed an experiment for policy evaluation on contextual bandits. We selected the popular intent for off-policy evaluation. Off-line contextual bandits assessment is complicated because they interact in online environments. There are multiple methods for policy evaluation for off-line settings such as Rejection Sampling (RS) (Li et al., 2010) and Normalised Capped Importance Sampling (NCIS) (Gilotte et al., 2018). All the results are reported based on the best arm performance. The system can expose users to multiple slots. As a result, in the proposed setting, the final performance will be much better than the described results.

Avg. reward	restaurant	landscaping	hike	activity	appliance
RS	0.538	0.352	0.455	0.25	0.375
NCIS	0.413	0.469	0.555	0.407	0.654
Real	0.378	0.440	0.495	0.396	0.670

Table 2. Policy evaluation results for RS and NCIS models.

According to the results, RS sometimes underestimates the performance on intents like ”restaurant” and “appliance repair” with overestimating other intents such as ”hike.” The NCIS method provides a more accurate estimation and provides a more realistic estimate.

8. Discussion and Implications

This paper proposed a novel application of natural language interfaces, allowing cold-start users to submit and receive responses for complex information-seeking requests expressed in natural language.

Unlike traditional search engines, where a single most relevant result is expected, users of our system are presented with a set of suggestions for further exploration. We have designed and deployed a system that permitted us to conduct initial data collection and potential future online experimentation using the A/B testing paradigm.

To complement this platform for complex user requests, we leveraged the advances in generative language models and designed a platform to make them interactive for the future design of Search Engines. We developed a novel interactive agent based on contextual bandits that guide users to express the initial request more clearly by refining their intents and adding new preferential desiderata. During this guided interaction, a NLU unit, designed on top of an LLM model, is used to build a structured semantic representation of the request. The system also uses a proposed Completion Intent Score (CIS) that estimates the degree to which intent is entirely expressed at each interaction step.

To efficiently leverage the power of generative language models in designing NLU, we used the few-shot prompting technique. We proposed a pipeline named Pluto to collect high-quality samples labeled by human annotators described in section . Pluto used a human-in-the-loop setup for data collection and consisted of two main components 1) Refinement of complex user requests in natural language, and 2) Refinement of the retrieved list of suggestions. The training set along with the current user request is used to form the prompt for GPT-3, then the extracted aspects were used as an initial point for an exploratory search to elicit user preferences.

These high-quality samples are used as a few-shot prompt and concatenated with the final user query to form the final prompt for the NLU inference.

When the system determines that an optimal request has been expressed, it leverages a search API to retrieve a list of suggestions. To demonstrate the efficacy of the proposed modeling paradigm we have adopted various pre-retrieval metrics that capture the extent to which guided interactions with our system yield better retrieval results. In a suite of experiments, we demonstrated that our method significantly outperforms several robust baseline approaches. We used three SOTA pre-retrieval QPPP metrics such as SCQ, SCS, and neural embeddings QPP to evaluate our method. The proposed interactive LLM model had promising results on different user intents e.g., restaurant intent, where SCQ, SCS, and neural embeddings QPP improved by 3.3%, 3.1%, and 22.5%, respectively.

This article focused on making generative language models interactive and alleviating their problem in responding to immediate user complex information-seeking problems. The generative language models suffer from hallucinating in providing accurate factual knowledge, especially user questions about what is going on in the world currently, as nearly all of them are trained on huge amounts of data collected in the past. The proposed IA can also benefit from the fast advancements in the generative language models, such as ChatGPT (Leiter et al., 2023) and GPT-4 (OpenAI, 2023), and replace its core NLU component with the larger and more advanced models.

In future work, we plan to design an online experiment that will involve business metrics, such as user satisfaction and the ratio of returning users and interactively collect ratings for the list of suggestions made by our system. This will allow us to learn from language and rating data jointly. Another possible direction is designing intent ontologies in a more complex hierarchical form where there are more complex and hierarchical dependencies between attributes. Finally we plan to investigate the reliance of the proposed interactive LLM model on GPT-3 as compared to the more recent larger models such as GPT-4, specifically the study of current drawbacks of current LLM such as hallucination and its impact on the proposed model.

References

Adiwardana et al. (2020) Daniel Adiwardana, Minh-Thang Luong, David R So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, et al. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977, 2020.
Ageev et al. (2011) Mikhail Ageev, Qi Guo, Dmitry Lagun, and Eugene Agichtein. Find it if you can: a game for modeling different types of web search success using interaction data. In SIGIR, 2011.
Aliannejadi et al. (2020) Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. Convai3: Generating clarifying questions for open-domain dialogue systems (clariq). 2020.
Aliannejadi et al. (2021) Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeffrey Dalton, and Mikhail Burtsev. Building and evaluating open-domain dialogue corpora with clarifying questions. arXiv preprint arXiv:2109.05794, 2021.
Arabzadeh et al. (2019) Negar Arabzadeh, Fattaneh Zarrinkalam, Jelena Jovanovic, and Ebrahim Bagheri. Geometric estimation of specificity within embedding spaces. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 2109–2112, 2019.
Arabzadeh et al. (2020a) Negar Arabzadeh, Fattane Zarrinkalam, Jelena Jovanovic, Feras Al-Obeidat, and Ebrahim Bagheri. Neural embedding-based specificity metrics for pre-retrieval query performance prediction. Information Processing & Management, 57(4):102248, 2020a.
Arabzadeh et al. (2020b) Negar Arabzadeh, Fattane Zarrinkalam, Jelena Jovanovic, and Ebrahim Bagheri. Neural embedding-based metrics for pre-retrieval query performance prediction. Advances in Information Retrieval, 12036:78, 2020b.
Arabzadeh et al. (2021) Negar Arabzadeh, Maryam Khodabakhsh, and Ebrahim Bagheri. Bert-qpp: Contextualized pre-trained transformers for query performance prediction. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, pages 2857–2861, 2021.
Ban and He (2021) Yikun Ban and Jingrui He. Local clustering in contextual multi-armed bandits. In Proceedings of the Web Conference 2021, pages 2335–2346, 2021.
Barraza-Urbina and Glowacka (2020) Andrea Barraza-Urbina and Dorota Glowacka. Introduction to bandits in recommender systems. In Fourteenth ACM Conference on Recommender Systems, pages 748–750, 2020.
Belkin (1980) Nicholas J Belkin. Anomalous states of knowledge as a basis for information retrieval. Canadian journal of information science, 5(1):133–143, 1980.
Berant et al. (2013) Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544, 2013.
Bernardi et al. (2015) Lucas Bernardi, Jaap Kamps, Julia Kiseleva, and Melanie JI Müller. The continuous cold start problem in e-commerce recommender systems. arXiv preprint arXiv:1508.01177, 2015.
Bouneffouf and Rish (2019) Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits. arXiv preprint arXiv:1904.10040, 2019.
Brown et al. (2020) Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Carmel and Kurland (2012) David Carmel and Oren Kurland. Query performance prediction for ir. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pages 1196–1197, 2012.
Carmel and Yom-Tov (2010) David Carmel and Elad Yom-Tov. Estimating the query difficulty for information retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2(1):1–89, 2010.
Cer et al. (2018) Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Nicole Limtiaco, Rhomni St John, Noah Constant, Mario Guajardo-Céspedes, Steve Yuan, Chris Tar, et al. Universal sentence encoder. arXiv preprint arXiv:1803.11175, 2018.
Christakopoulou (2018) Konstantina Christakopoulou. Towards Recommendation Systems with Real-World Constraints. PhD thesis, University of Minnesota, 2018.
Christakopoulou et al. (2016) Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. Towards conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 815–824, 2016.
Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
Clarke et al. (2009) Charles LA Clarke, Maheedhar Kolla, and Olga Vechtomova. An effectiveness measure for ambiguous and underspecified queries. In Conference on the Theory of Information Retrieval, pages 188–199. Springer, 2009.
Codd (1974) Edgar F Codd. Seven steps to rendezvous with the casual user. IBM Corporation, 1974.
Copestake and Jones (1990) Ann Copestake and Karen Sparck Jones. Natural language interfaces to databases. 1990.
Cortes (2018) David Cortes. Adapting multi-armed bandits policies to contextual bandits scenarios. arXiv preprint arXiv:1811.04383, 2018.
Croft et al. (2010) W Bruce Croft, Donald Metzler, and Trevor Strohman. Search engines: Information retrieval in practice, volume 520. Addison-Wesley Reading, 2010.
Cronen-Townsend et al. (2002) Steve Cronen-Townsend, Yun Zhou, and W Bruce Croft. Predicting query performance. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 299–306, 2002.
Desai et al. (2016) Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Subhajit Roy, et al. Program synthesis using natural language. In Proceedings of the 38th International Conference on Software Engineering, pages 345–356. ACM, 2016.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2018.
Dinan et al. (2020) Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shuster, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, et al. The second conversational intelligence challenge (convai2). In The NeurIPS’18 Competition, pages 187–208. Springer, Cham, 2020.
Elgohary et al. (2020) Ahmed Elgohary, Saghar Hosseini, and Ahmed Hassan Awadallah. Speak to your parser: Interactive text-to-SQL with natural language feedback. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2065–2077, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.187. URL https://www.aclweb.org/anthology/2020.acl-main.187.
Fast et al. (2018) Ethan Fast, Binbin Chen, Julia Mendelsohn, Jonathan Bassen, and Michael S Bernstein. Iris: A conversational agent for complex tasks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 473. ACM, 2018.
Felício et al. (2017) Crícia Z Felício, Klérisson VR Paixão, Celia AZ Barcelos, and Philippe Preux. A multi-armed bandit model selection for cold-start user recommendation. In Proceedings of the 25th Conference on User Modeling, Adaptation and Personalization, pages 32–40, 2017.
Gilotte et al. (2018) Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. Offline a/b testing for recommender systems. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 198–206, 2018.
Hashemi et al. (2019) Helia Hashemi, Hamed Zamani, and W Bruce Croft. Performance prediction for non-factoid question answering. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pages 55–58, 2019.
Hassan and White (2012) Ahmed Hassan and Ryen W White. Task tours: helping users tackle complex search tasks. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 1885–1889, 2012.
Hassan et al. (2010) Ahmed Hassan, Rosie Jones, and Kristina Lisa Klinkner. Beyond dcg: user behavior as a predictor of a successful search. In WSDM, pages 221–230, 2010.
Hassan Awadallah et al. (2014) Ahmed Hassan Awadallah, Ryen W White, Patrick Pantel, Susan T Dumais, and Yi-Min Wang. Supporting complex search tasks. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management, pages 829–838, 2014.
Hauff et al. (2008) Claudia Hauff, Djoerd Hiemstra, and Franciska de Jong. A survey of pre-retrieval query performance predictors. In Proceedings of the 17th ACM conference on Information and knowledge management, pages 1419–1420, 2008.
Hauff et al. (2009) Claudia Hauff, Leif Azzopardi, and Djoerd Hiemstra. The combination and evaluation of query performance prediction methods. In European Conference on Information Retrieval, pages 301–312. Springer, 2009.
He and Ounis (2004) Ben He and Iadh Ounis. Inferring query performance using pre-retrieval predictors. In International symposium on string processing and information retrieval, pages 43–54. Springer, 2004.
Hendrix et al. (1978) Gary G Hendrix, Earl D Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. Developing a natural language interface to complex data. ACM Transactions on Database Systems (TODS), 3(2):105–147, 1978.
Holzinger (2016) Andreas Holzinger. Interactive machine learning for health informatics: when do we need the human-in-the-loop? Brain Informatics, 3(2):119–131, 2016.
Ingwersen and Järvelin (2006) Peter Ingwersen and Kalervo Järvelin. The turn: Integration of information seeking and retrieval in context, volume 18. Springer Science & Business Media, 2006.
Joachims et al. (1997) Thorsten Joachims, Dayne Freitag, Tom Mitchell, et al. Webwatcher: A tour guide for the world wide web. In IJCAI (1), pages 770–777. Citeseer, 1997.
Joachims et al. (2020) Thorsten Joachims, Yves Raimond, Olivier Koch, Maria Dimakopoulou, Flavian Vasile, and Adith Swaminathan. Reveal 2020: Bandit and reinforcement learning from user interactions. In Fourteenth ACM Conference on Recommender Systems, pages 628–629, 2020.
Khodabakhsh and Bagheri (2021) Maryam Khodabakhsh and Ebrahim Bagheri. Semantics-enabled query performance prediction for ad hoc table retrieval. Information Processing & Management, 58(1):102399, 2021.
Kiseleva et al. (2014) Julia Kiseleva, Eric Crestan, Riccardo Brigo, and Roland Dittel. Modelling and detecting changes in user satisfaction. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 1449–1458, 2014.
Kiseleva et al. (2015) Julia Kiseleva, Jaap Kamps, Vadim Nikulin, and Nikita Makarov. Behavioral dynamics from the serp’s perspective: what are failed serps and how to fix them? In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1561–1570, 2015.
Kiseleva et al. (2016a) Julia Kiseleva, Alexander Tuzhilin, Jaap Kamps, Melanie JI Mueller, Lucas Bernardi, Chad Davis, Ivan Kovacek, Mats Stafseng Einarsen, and Djoerd Hiemstra. Beyond movie recommendations: Solving the continuous cold start problem in e-commercerecommendations. arXiv preprint arXiv:1607.07904, 2016a.
Kiseleva et al. (2016b) Julia Kiseleva, Kyle Williams, Ahmed Hassan Awadallah, Aidan C Crook, Imed Zitouni, and Tasos Anastasakos. Predicting user satisfaction with intelligent assistants. In Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 45–54, 2016b.
Kiseleva et al. (2016c) Julia Kiseleva, Kyle Williams, Jiepu Jiang, Ahmed Hassan Awadallah, Aidan C Crook, Imed Zitouni, and Tasos Anastasakos. Understanding user satisfaction with intelligent assistants. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, pages 121–130, 2016c.
Kostric et al. (2021) Ivica Kostric, Krisztian Balog, and Filip Radlinski. Soliciting user preferences in conversational recommender systems via usage-related questions. In Fifteenth ACM Conference on Recommender Systems, pages 724–729, 2021.
Kotov et al. (2011) Alexander Kotov, Paul N Bennett, Ryen W White, Susan T Dumais, and Jaime Teevan. Modeling and analysis of cross-session search tasks. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pages 5–14, 2011.
Krippendorff (2011) Klaus Krippendorff. Computing krippendorff’s alpha-reliability. 2011.
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
Leiter et al. (2023) Christoph Leiter, Ran Zhang, Yanran Chen, Jonas Belouadi, Daniil Larionov, Vivian Fresen, and Steffen Eger. Chatgpt: A meta-analysis after 2.5 months. arXiv preprint arXiv:2302.13795, 2023.
Li et al. (2016) Jiwei Li, Alexander H Miller, Sumit Chopra, Marc’Aurelio Ranzato, and Jason Weston. Dialogue learning with human-in-the-loop. ICLR, 2016.
Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
Li et al. (2020a) Toby Jia-Jun Li, Tom Mitchell, and Brad Myers. Interactive task learning from GUI-grounded natural language instructions and demonstrations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, July 2020a.
Li et al. (2018) Ziming Li, Julia Kiseleva, and Maarten de Rijke. Dialogue generation: From imitation learning to inverse reinforcement learning. arXiv preprint arXiv:1812.03509, 2018.
Li et al. (2020b) Ziming Li, Sungjin Lee, Baolin Peng, Jinchao Li, Julia Kiseleva, Maarten de Rijke, Shahin Shayandeh, and Jianfeng Gao. Guided dialog policy learning without adversarial learning in the loop. arXiv preprint arXiv:2004.03267, 2020b.
Liu and Lane (2017) Bing Liu and Ian Lane. Iterative policy learning in end-to-end trainable task-oriented neural dialog models. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 482–489. IEEE, 2017.
Liu and Lane (2018) Bing Liu and Ian Lane. Adversarial learning of task-oriented neural dialog models. In Proceedings of the SIGDIAL 2018 Conference, pages 350–359, 2018.
Liu and Belkin (2015) Jingjing Liu and Nicholas J Belkin. Personalizing information retrieval for multi-session tasks: Examining the roles of task stage, task type, and topic knowledge on the interpretation of dwell time as an indicator of document usefulness. Journal of the Association for Information Science and Technology, 66(1):58–81, 2015.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
Marchionini (2006) Gary Marchionini. Exploratory search: from finding to understanding. Communications of the ACM, 49(4):41–46, 2006.
Min et al. (2020) Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. Ambigqa: Answering ambiguous open-domain questions. arXiv preprint arXiv:2004.10645, 2020.
Nallapati and Shah (2006) Ramesh Nallapati and Chirag Shah. Evaluating the quality of query refinement suggestions in information retrieval. Technical report, MASSACHUSETTS UNIV AMHERST CENTER FOR INTELLIGENT INFORMATION RETRIEVAL, 2006.
Odijk et al. (2015) Daan Odijk, Ryen W White, Ahmed Hassan Awadallah, and Susan T Dumais. Struggling and success in web search. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pages 1551–1560, 2015.
Olston and Chi (2003) Christopher Olston and Ed H Chi. Scenttrails: Integrating browsing and searching on the web. ACM Transactions on Computer-Human Interaction (TOCHI), 10(3):177–197, 2003.
OpenAI (2023) OpenAI. Gpt-4 technical report. Technical report, arXiv:2303.08774 [cs.CL], 2023.
Parapar and Radlinski (2021) Javier Parapar and Filip Radlinski. Diverse user preference elicitation with multi-armed bandits. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 130–138, 2021.
Plachouras et al. (2004) Vassilis Plachouras, Ben He, and Iadh Ounis. University of glasgow at trec 2004: Experiments in web, robust, and terabyte tracks with terrier. In TREC, 2004.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Roitman (2020) Haggai Roitman. Ictir tutorial: Modern query performance prediction: Theory and practice. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval, pages 195–196, 2020.
Roitman et al. (2019) Haggai Roitman, Shai Erera, and Guy Feigenblat. A study of query performance prediction for answer quality determination. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, pages 43–46, 2019.
Roller et al. (2020) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M Smith, et al. Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637, 2020.
Roy et al. (2019) Dwaipayan Roy, Debasis Ganguly, Mandar Mitra, and Gareth JF Jones. Estimating gaussian mixture models in the local neighbourhood of embedded word vectors for query performance prediction. Information Processing & Management, 56(3):1026–1045, 2019.
Ruotsalo et al. (2018) Tuukka Ruotsalo, Jaakko Peltonen, Manuel JA Eugster, Dorota Głowacka, Patrik Floréen, Petri Myllymäki, Giulio Jacucci, and Samuel Kaski. Interactive intent modeling for exploratory search. ACM Transactions on Information Systems (TOIS), 36(4):1–46, 2018.
Sanderson (2008) Mark Sanderson. Ambiguous queries: test collections need more sense. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 499–506, 2008.
Sarnikar et al. (2014) Surendra Sarnikar, Zhu Zhang, and J Leon Zhao. Query-performance prediction for effective query routing in domain-specific repositories. Journal of the Association for Information Science and Technology, 65(8):1597–1614, 2014.
Sepliarskaia et al. (2018) Anna Sepliarskaia, Julia Kiseleva, Filip Radlinski, and Maarten de Rijke. Preference elicitation as an optimization problem. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 172–180, 2018.
Singla et al. (2010) Adish Singla, Ryen White, and Jeff Huang. Studying trailfinding algorithms for enhanced web search. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 443–450, 2010.
Soleimani et al. (2021) Amir Soleimani, Christof Monz, and Marcel Worring. NLQuAD: A non-factoid long question answering data set. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 1245–1255, 2021.
Song et al. (2009) Ruihua Song, Zhenxiao Luo, Jian-Yun Nie, Yong Yu, and Hsiao-Wuen Hon. Identification of ambiguous queries in web search. Information Processing & Management, 45(2):216–229, 2009.
Tellex et al. (2011) Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew R Walter, Ashis Gopal Banerjee, Seth Teller, and Nicholas Roy. Understanding natural language commands for robotic navigation and mobile manipulation. In Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
Trigg (1988) Randall H Trigg. Guided tours and tabletops: Tools for communicating in a hypertext environment. ACM Transactions on Information Systems (TOIS), 6(4):398–414, 1988.
Villa et al. (2009) Robert Villa, Iván Cantador, Hideo Joho, and Joemon M Jose. An aspectual interface for supporting complex search tasks. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 379–386, 2009.
White and Roth (2009) Ryen W White and Resa A Roth. Exploratory search: Beyond the query-response paradigm. Synthesis lectures on information concepts, retrieval, and services, 1(1):1–98, 2009.
White et al. (2007) Ryen W White, Mikhail Bilenko, and Silviu Cucerzan. Studying the use of popular destinations to enhance web search interaction. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 159–166, 2007.
White et al. (2008) Ryen W White, Gary Marchionini, and Gheorghe Muresan. Evaluating exploratory search systems. Information Processing and Management, 44(2):433, 2008.
Woods et al. (1972) W. A. Woods, Ronald M Kaplan, and Bonnie L. Webber. The lunar sciences natural language information system: Final report. BBN Report 2378, 1972.
Wu and Yan (2017) Chen Wu and Ming Yan. Session-aware information embedding for e-commerce product recommendation. In Proceedings of the 2017 ACM on conference on information and knowledge management, pages 2379–2382, 2017.
Yao et al. (2019) Ziyu Yao, Yu Su, Huan Sun, and Wen-tau Yih. Model-based interactive semantic parsing: A unified framework and a text-to-SQL case study. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5447–5458, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1547. URL https://www.aclweb.org/anthology/D19-1547.
Young et al. (2013) Steve Young, Milica Gašić, Blaise Thomson, and Jason D Williams. Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179, 2013.
Zamani et al. (2018) Hamed Zamani, W Bruce Croft, and J Shane Culpepper. Neural query performance prediction using weak supervision from multiple signals. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 105–114, 2018.
Zhang et al. (2019) Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536, 2019.
Zhao et al. (2008) Ying Zhao, Falk Scholer, and Yohannes Tsegay. Effective pre-retrieval query performance prediction using similarity and variability evidence. In European conference on information retrieval, pages 52–64. Springer, 2008.

Making Large Language Models Interactive: A Pioneer Study on Supporting Complex Information-Seeking Tasks with Implicit Constraints

Abstract.

1. Introduction

2. Background and Related Work

Contextual bandits for recommendation

Search engines

Language as an interface for interactions

Exploratory search, tours, and trails

3. Pluto: data collection infrastructure

Complex user request refinement.

Suggestion refinement.

Data handling.

4. Problem description

Notation.

Problem formulation.

5. Method Description

5.1. Creating intent profile Φ\Phi

5.2. NLU component

GPT-3

Intent Completion score

5.3. Interactive user intent modeling

Method 1:

Method 2

Method 3

Threshold to stop iterations:

5.4. Retrieval Component

Corpus collection:

Few-shot Ranker:

6. datasets

6.1. Dataset Collected for NLU unit

Evaluating quality of the collected dataset.

6.2. Dataset Collected via Pluto

6.3. Corpus Collection

7. Experimental Setup and Results

Method 1: Popularity Method (Baseline)

Group 1: Contextual Multi-armed Bandit Policies

Group 2: Different context representation:

7.1. Evaluation Metrics

Training Parameters:

7.2. Experimental Results

NLU evaluation

7.3. Ablation Analysis

Broad vs. Specific:

Different Context cc:

Policy evaluation for contextual bandit model:

8. Discussion and Implications

References

5.1. Creating intent profile $\Phi$

Different Context $c$ :