Learning to Relate to Previous Turns in Conversational Search

Fengran Mo 0000-0002-0838-6994 , Jian-Yun Nie 0000-0003-1556-3335 University of MontrealMontrealQuebecCanada [email protected] [email protected] , Kaiyu Huang 0000-0001-6779-1810 Institute for AI Industry Research, Tsinghua University Beijing China [email protected] , Kelong Mao 0000-0002-5648-568X Gaoling School of Artificial Intelligence, Renmin University of China Beijing China [email protected] , Yutao Zhu 0000-0002-9432-3251 University of MontrealMontrealQuebecCanada [email protected] , Peng Li 0000-0003-1374-5979 Institute for AI Industry Research, Tsinghua University Beijing China [email protected] and Yang Liu 0000-0002-3087-242X Department of Computer Science, Tsinghua University Beijing China [email protected]

(2023)

Abstract.

Conversational search allows a user to interact with a search system in multiple turns. A query is strongly dependent on the conversation context. An effective way to improve retrieval effectiveness is to expand the current query with historical queries. However, not all the previous queries are related to, and useful for expanding the current query. In this paper, we propose a new method to select relevant historical queries that are useful for the current query. To cope with the lack of labeled training data, we use a pseudo-labeling approach to annotate useful historical queries based on their impact on the retrieval results. The pseudo-labeled data are used to train a selection model. We further propose a multi-task learning framework to jointly train the selector and the retriever during fine-tuning, allowing us to mitigate the possible inconsistency between the pseudo labels and the changed retriever. Extensive experiments on four conversational search datasets demonstrate the effectiveness and broad applicability of our method compared with several strong baselines.

conversational search, relevance judgment, query expansion

^†^†journalyear: 2023^†^†copyright: acmlicensed^†^†conference: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 6–10, 2023; Long Beach, CA, USA^†^†booktitle: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’23), August 6–10, 2023, Long Beach, CA, USA^†^†price: 15.00^†^†doi: 10.1145/3580305.3599411^†^†isbn: 979-8-4007-0103-0/23/08^†^†ccs: Information systems Query reformulation

1. Introduction

Conversational search (Gao et al., 2022) is an important emerging branch of information retrieval following the rapid development of intelligent assistants (e.g., Siri and Alexa). Compared with the traditional keyword-based ad-hoc search, conversational search allows the user to interact with the system in multiple turns to seek information via a conversation in natural language. Not only this is a natural way for users to interact with an IR system, but also it has the potential to deal with complex information needs (Dalton et al., 2022).

As in any conversation in natural language, queries in conversational search may involve omissions, references to previous turns, and ambiguities (Radlinski and Craswell, 2017). Thus, a primary challenge for effective conversation search is to determine the underlying information need by understanding context-dependent query turns. Some previous studies (Voskarides et al., 2020; Vakulenko et al., 2021; Yu et al., 2020; Kumar and Callan, 2020; Wu et al., 2021) exploited a two-stage pipeline, which first trains a query rewriting (QR) model with external data or human reformulated queries, then feeds the rewritten queries into a retriever. However, it is required that sufficient manual annotations are available for model training, which is difficult to obtain in practice (Anantha et al., 2021; Adlakha et al., 2022). Furthermore, the manually reformulated queries may not be the best search queries because the reformulation is based on human understanding of the query rather than on the search results. We also observe that the rewriting model is usually trained separately from the retriever, making it difficult to optimize the whole system for the final retrieval performance (Lin et al., 2021; Wu et al., 2021). In particular, the learned rewriting model may not best fit the retriever used in the subsequent step.

Refer to caption — Figure 1. $Q2$ , $Q3$ and $Q4$ are about ”Snow White”, while $Q1$ is about ”Evil Queen”. $Q2$ and $Q3$ are judged relevant to $Q4$ in human judgments. Instead, our model judges $Q1$ and $Q3$ are useful for $Q4$ based on their impact to improve the retrieval effectiveness of $Q4$ .

Another research direction (Qu et al., 2020; Yu et al., 2021; Kim and Kim, 2022; Mao et al., 2022a) tries to train a conversational query encoder by leveraging all the previous context information. A common approach uses all the historical queries in the session to expand the current query. Although this expansion leads to improved results, it makes a strong assumption that all the previous queries are related to the current one. In reality, we often observe topic switches within a conversation session, which means the current query may not be related to the previous ones. Fig. 1 illustrates such an example, where we see that only part of the historical queries is relevant to the last query (either by human annotators or by our model). Simply leveraging all the historical queries will inevitably inject irrelevant information into the expanded query and result in sub-optimal queries. This fact will be experimentally confirmed in Sec. 4. Intuitively, a better solution that we will implement is to use only the relevant previous turns to expand/rewrite the current query. So the crucial question is: how can we select useful information from history context for query expansion/reformulation?

A naive idea would be training a classification model based on human annotations to identify the relevant previous queries. Unfortunately, this approach cannot be implemented due to the fact that human annotations of query relevance are usually unavailable in practice. Such relevance labels are also unavailable in most existing datasets for conversational search (Dalton et al., 2020, 2021; Anantha et al., 2021; Adlakha et al., 2022). Thus, identifying relevant previous queries is a non-trivial task in the absence of relevance labels.

Some existing studies (Qu et al., 2020; Kim and Kim, 2022; Li et al., 2022b; Yu et al., 2021; Lin et al., 2021) exploit the powerful pre-trained language models (PLM) (Devlin et al., 2019; Liu et al., 2019) and use the attention mechanism (Vaswani et al., 2017) to determine how strongly historical queries are related to the current one. A common practice is to fine-tune a pre-trained dense retriever (Xiong et al., 2020; Karpukhin et al., 2020) with conversational search data. This approach would require a considerable amount of annotated conversational search sessions, with relevance labels for previous queries. To this end, some pioneering studies (Adlakha et al., 2022; Mao et al., 2022a) proposed annotated conversational search sessions. However, this gives rise to two critical issues: (1) It is difficult to produce a large number of manual annotations in general conversation search sessions; and (2) it is questionable if human annotations are the best supervision signal to use to train a model. In fact, when a human annotates the relevance of a previous query, she/he would basically consider if the two queries are semantically related, without being able to judge if expanding the query with it would improve the search result. This latter would require deep knowledge about how the retrieval model will use the query, which often remains opaque to a human annotator. Therefore, a human annotator could select seemingly relevant but useless queries, while missing some semantically less related but actually useful queries.

In this paper, we investigate how to select useful historical turns to expand the current query in conversational search. The selection is made according to the usefulness of the previous queries, i.e. if the latter can help improve the retrieval effectiveness of the current query. Intuitively, this could help select more useful previous queries for the search task. As we can see in Fig. 1, our system can make different judgments than a human annotator. Q1 is judged useful because it is about the same movie, although about a different character. This is a piece of useful information to expand Q4. On the other hand, Q2, which is manually judged relevant, does not provide any useful information although being about the same character (“Snow White”).

To select useful previous turns, we are inspired by earlier work that selects useful expansion terms or pseudo-feedback documents for query expansion according to their impact (Cao et al., 2008; He and Ounis, 2009). Our situation is similar: we consider the previous turns as candidates for expanding the current query and we aim at selecting the ones that can improve the search effectiveness. Therefore, a similar principle can be used, i.e., a selection model is trained with a set of automatic labels generated according to whether the previous turn can bring improvement in retrieval.

We also observe that most previous approaches separately train a query expansion/rewriting model and a retrieval model. In fact, these two components are strongly dependent - the change in one would impact the other. To jointly optimize both, we design a multi-task learning method. We conduct extensive experiments on four widely used conversational search datasets. Experimental results demonstrate the effectiveness of our proposed method, showing significant improvements over various baselines.

Our contributions are summarized as follows:

(1) We propose an efficient and effective method to select useful previous queries to expand the current query without human annotations.

(2) We propose a multi-task learning method to jointly optimize the query selection model and the dense retrieval model.

(3) We demonstrate the effectiveness of the proposed method and show its broad applicability in various settings (i.e. with or without training data). In addition, the proposed method can select better previous queries than human annotators.

2. Related Work

Conversational Search. Conversational search is defined as iteratively retrieving documents for user’s queries in a multi-round dialog (Dalton et al., 2020, 2021). A later query may depend on the previous (historical) ones in the session. Thus the major challenge is to formulate a de-contextualized query for a standard search system according to the historical context.

One research direction is to explicitly reformulate queries. Typical methods include selecting important terms from the previous queries to expand the current query via a term classifier (Kumar and Callan, 2020; Voskarides et al., 2020), or rewriting the query based on a generative pre-trained language model (PLM) (Yu et al., 2020; Wu et al., 2021; Mao et al., 2023a; Mo et al., 2023). Both methods need external data or human-rewritten oracle queries for training. However, it is found that the manual queries may be sub-optimal for search (Wu et al., 2021; Lin et al., 2021). Therefore, training a query reformulation/rewriting model according to manually reformulated queries may lead to a sub-optimal solution.

Another approach is to fine-tune a dense retriever using conversational search data (Qu et al., 2020; Kim and Kim, 2022; Li et al., 2022b; Mao et al., 2022b, 2023b). Since there are limited supervision signals in the current conversational search datasets, the dense retriever will have difficulty to identify the relevant historical turns to enhance the current query. To address this problem, some few/zero-shot learning methods have been proposed (Yu et al., 2021; Krasakis et al., 2022; Mao et al., 2022a). They try to adapt the ad-hoc search retriever to the conversational scenario with external resources. Nevertheless, these approaches do not contain an explicit selection of relevant historical turns for the current query. As we stated earlier, this selection is critical for conversational search.

Query Expansion. For conversation search, expanding the query in the current turn based on historical context can be viewed as a type of query expansion (Efthimiadis, 1996), which is an effective technique for improving retrieval performance. The expansion terms can come from different resources such as user logs, pseudo-relevant feedback, and document collection. It is critical to use appropriate terms to expand the query. However, it has been shown that the candidate expansion terms selected by a method (e.g. pseudo-relevance feedback) are very noisy: they could be useful, useless, or even harmful to improve the retrieval effectiveness. Therefore, it is important to select only good terms for query expansion (Cao et al., 2008; Sordoni et al., 2015). In the case of pseudo-relevance feedback, there is no further indication or annotation about the usefulness of an expansion term. Therefore, Cao et al. (2008) proposed a method to automatically annotate the usefulness of a candidate term by testing if incorporating it into an expanded query. If it does, then the term is labeled useful; if it degrades the performance, then it is labeled harmful. A similar approach has been adapted for selecting pseudo feedback documents for query expansion (He and Ounis, 2009). Besides, using entity overlap to select the history queries is another interesting naive solution. However, previous works (Vakulenko et al., 2018; Joko et al., 2021) found that entities are often omitted in later turns rather than repeated, which results in poor coverage in finding useful turns in conversational search. In our work, we borrow the idea of pseudo-labeling to annotate the usefulness (or relevance) of a historical query. Based on the pseudo labels, a query selection model (selector) will be trained.

3. Task Formulation

In conversational search, we are given a multi-turn conversation corresponding to a sequence of queries $Q=\{q_{i}\}_{i=1}^{n}$ . The goal is to find the relevant passages $p^{+}$ from a collection $D$ for the last query $q_{n}$ . Each query turn can be ambiguous and is usually context-dependent, thus we need to understand its real search intent based on its conversational context $C$ . Several types of context $C$ have been considered in previous works: (1) The context contains historical queries and the corresponding ground-truth passages (answers) $C=\{q_{i},p_{i}^{*}(a_{i}^{*})\}_{i=1}^{n-1}$ (Vakulenko et al., 2021; Krasakis et al., 2022; Kim and Kim, 2022; Mao et al., 2023b) and (2) it contains historical queries only $C=\{q_{i}\}_{i=1}^{n-1}$ (Qu et al., 2020; Yu et al., 2021; Mao et al., 2022a). Despite the fact that the first type of context can lead to better results, it is believed (Mandya et al., 2020; Siblini et al., 2021; Li et al., 2022a) that it is unrealistic to assume all historical queries are answered correctly. In a realistic scenario, the answers are provided by the system and are not always correct. Therefore, we consider the second type of context in this paper, i.e. our context contains only the historical queries $C=\{q_{i}\}_{i=1}^{n-1}$ . Notice, however, that the method we propose can be easily extended to the first type of context.

4. Re-examination of Conversational Query Expansion

The general assumption behind the existing conversational dense retrieval methods (Vakulenko et al., 2021; Li et al., 2022b; Krasakis et al., 2022; Qu et al., 2020; Yu et al., 2021; Lin et al., 2021) is that all historical queries in previous turns are useful for understanding current query. Therefore, expanding the query by all of them should yield better retrieval results. Typically, they reformulate the current query by concatenating all historical queries. Despite the fact that such an expanded query leads to improved performance (compared to no expansion), we argue that the strategy is sub-optimal. On one hand, the historical queries may be on a different topic, thus not useful to enhance the current query. This happens especially in long sessions where the user may be interested in different topics. On the other hand, the historical queries are not of the same usefulness for expanding the current query. Some may be more related than others.

The above phenomena have been experimentally shown in (Cao et al., 2008) when expanding a query in pseudo-relevance feedback. To verify if the same phenomena exist in conversational search, we run a similar experiment to compare the following two scenarios: (1) expand the current query with all historical queries, (2) expand it with only relevant historical queries/terms. We want to demonstrate that the second strategy is much better than the first one. Notice that the first method is widely used in existing works. The second one is an idealized case because the relevance labels of previous queries are usually unavailable. Meanwhile, within the second strategy, we want to compare the effectiveness between token-level and turn-level expansion based on different types of retrievers. Following (Cao et al., 2008), we identify the useful expansion queries from the history according to whether the selected expansions can bring improvement in retrieval. We describe the process below.

4.1. Pseudo Relevance Labeling

Our goal is to generate gold Pseudo Relevance Labels (PRL) for each historical query/term with respect to a given query. We assume that a relevant historical turn/term should make a positive impact when the current query is expanded with it. The pseudo labeling is based on the impact on retrieval results of a candidate turn/term when the latter is used to expand the query.

The turn-level generation procedure is introduced in Algorithm 1. We first use the current raw query to retrieve the ranked list of documents and calculate the base retrieval score (MRR in this case). Then we iteratively concatenate the query with each of the historical queries as a new reformulated query and get the corresponding retrieval score. If the second score is higher than the initial one, then the relevance of the historical query is labeled “positive”, otherwise, “negative”. Query expansion is produced by concatenating the expansion query $h_{i}$ with the current one $q_{n}$ , i.e. $q^{exp}=q_{n}\circ h_{i}$ , where $\circ$ means concatenation. In the same way, we can also create term-level pseudo labels for all the terms in the historical queries.

Algorithm 1 Generating Gold Pseudo Relevant Label (PRL)

0: The

n^{th}

query turn

q_{n}

, historical queries

h_{1:n-1}(n>1)

, existing retriever

f

0: The

n^{th}

turn gold pseudo relevant labels

L_{n}^{gold}

1: RankList

R_{q}

f(q_{n})

2: Evaluate

R_{q}\rightarrow

base retrieval score

S_{q}

3: Initial list of pseudo relevance labels

L_{n}^{gold}=[]

4: for

i

from

1

n-1

5: RankList

R_{h_{i}}

f(q_{n}\circ h_{i})

6: Evaluate

R_{h_{i}}\rightarrow

expansion retrieval score

S_{h_{i}}

7: if

S_{h_{i}}>S_{q_{n}}

then

8: Add positive label to

L_{n}^{gold}

9: else

10: Add negative label to

L_{n}^{gold}

11: end if

12: end for

4.2. Effectiveness of Selecting Useful Expansions

We assume the gold PRL as the results from an oracle classifier that can correctly separate the useful and harmful historical expansion turns/terms. Before proposing an approach to select useful expansion turns/terms, we first examine if the gold PRL can indeed improve retrieval performance by using the relevant expansion turns/terms. We formulate these query expansion forms as:

	$\displaystyle q^{\text{raw}}$	$\displaystyle=q_{n},$
	$\displaystyle q^{\text{all}}$	$\displaystyle=h_{1}\circ\cdots h_{i}\cdots\circ h_{n-1}\circ q_{n},\quad i\in[1,n-1],$
	$\displaystyle q^{\text{PRL}}$	$\displaystyle=h_{1}^{L}\circ\cdots h_{j}^{L}\cdots\circ q_{n},\quad\quad j\in[1,n-1]\ and\ L^{gold}_{n}[j]=1.$

Fig. 2 shows the retrieval effectiveness scores with these types of queries on the TopiOCQA dataset. We can see significant improvement in retrieval effectiveness by expanding the queries with generated PRL with two retrievers: BM25 (sparse) (Robertson et al., 2009), ANCE (dense) (Xiong et al., 2020). In general, we see that expanding queries by all historical queries is beneficial, leading to improved effectiveness compared to the raw queries. This result is consistent with the observations in previous studies. However, when we use only relevant queries annotated by PRL, the effectiveness is much higher. This demonstrates clearly the potential benefit of selecting the relevant previous historical context and confirms our earlier assumption. We also find the expansion by the relevant tokens is more suitable with the sparse retriever while the turn-level expansion performs better with the dense retriever. The comparison between the two retrievers clearly shows the advantage of the dense retriever ANCE. Therefore, in our experiments, we will use ANCE as the retrieval backbone and relevant historical queries as expansion units.

Given the high impact of relevant queries for expansion, our problem is to develop an effective selection model to identify them. The principle we use is the selected expansion query can bring improvement in retrieval and be judged as relevant for the current query turn by a neural selection model.

5. Methodology

5.1. Overview

So far, the main approaches to QR (Yu et al., 2020; Anantha et al., 2021; Wu et al., 2021; Voskarides et al., 2020) rely on external supervision signals, i.e. human-rewritten queries, to train a rewriting model. In practice, it is difficult to obtain such manually reformulated queries. Other methods expand the query with all historical queries (Qu et al., 2020; Yu et al., 2021; Lin et al., 2021; Kim and Kim, 2022), which will inject noise during expansion, as we showed in the previous section. To overcome these issues, our method learns a query selection model (selector) based on the pseudo-labeled data to select useful historical queries for expansion and produces an expanded query to be submitted to a retriever.

However, we also notice that there is a hidden dependency between the retrieval model and the expansion process. An off-the-shelf retriever such as ANCE has been fine-tuned with unexpanded queries. Fine-tuning it further for expanded queries can yield better results. Similarly, the selector also relies on a retriever to determine useful expansion queries. To account for the dependency, we design a multi-task learning framework to jointly fine-tune the selector-retriever (S-R) models through conversational search data.

The overview of our methods is in Fig. 3, which contains three main components: PRLs generation, selector model training with off-the-shelf retrieval, and multi-task S-R model jointly fine-tuning.

5.2. Selector: Selecting Useful Expansion Queries

5.2.1. Classification of Historical Queries

The goal of the selector model is to identify if a historical query $h_{i}$ is relevant (useful) to the current query $q_{n}$ . We use the pseudo-labeled queries as identified in Sec. 4 as training data to train a selector model.

Selector Model. The purpose of the selector is to determine which historical queries are relevant (useful) for the current one. This can be done in two possible ways: as a classification problem or as a search problem. With a classification approach, we train a classifier to classify candidate queries into relevant/irrelevant classes. A straightforward approach is to directly train a powerful PLM such as Bert (Devlin et al., 2019) with PRL data as a selector. The possible problem with this solution is that a Bert model without specific retrieval fine-tuning might not perform well on query encoding (Qiao et al., 2019). Alternatively, we can also exploit a search tool to determine a matching score between the current query and a candidate. We leverage the ANCE (Xiong et al., 2020) dense retriever which has already fine-tuned by ad-hoc search data (Nguyen et al., 2016) to train a new selector model. We will test both approaches in our experiments.

Weight Assignment to Training Samples. The negative labels annotated automatically by Algorithm 1 is ten times more than positive labels. However, in Sec. 4 we showed that even expanding the query by all the historical queries, including more negative ones than positive ones, yields better results than not expanding it at all. Thus, we need to force the selector to pay more attention to selecting the positive ones. Therefore, we assign higher weights to positive queries than to negative ones. This is achieved by assigning a weight to each class equal to $w[y]=\frac{|\text{negative class}|}{|\text{class y}|}$ . Then the learning objective of the selector is as Eq. 1, in which the weight assignment regularizes the model to focus more on the positive category and less on the negative category.

(1)

\mathcal{L}_{\text{S-W.A.}}=-w[y]\cdot\mathbb{I}\{L_{x_{i}}^{\text{pred.}},L_{x_{i}}^{\text{gold}}\}\cdot\log\frac{\exp(x_{i},y)}{\sum_{j=1}^{Y}\exp(x_{i},j)}.

5.2.2. Conversational Retrieval with Selective Expansion

After selector training, we can apply it to select useful expansion historical queries. According to the relevant judgment produced by the selector, the current query turn can be reformulated by expanding the predicted positive queries and fed into the (off-the-shelf) retriever (ANCE) to produce a rank list. Such a pipeline takes each reformulated query turn as a stand-alone query, then any off-the-shelf retriever can be directly applied without specifically re-training for the conversational scenarios.

5.3. Multi-Task Learning for Jointly Selector-Retriever (S-R) Fine-tuning

We want to use conversational search data to fine-tune the retriever. We expect that the fine-tuned retriever would perform better. However, our preliminary experiment indicates that directly using the selected expansion queries as the input for the fine-tuning would degrade the final performance of the fine-tuned retriever (Sec. 7.2). Thus, a different strategy used for leveraging the automatically labeled queries should be designed.

From the retrieval perspective, a historical query might be always potentially relevant to the current turn to some extent, depending on the fine-tuning stage of the retriever. In the previous strategy, we only keep the queries with positive labels and remove the others. This ”hard” selection strategy is static and does not allow for changes during fine-tuning. Instead, we will use a ”soft” weighting of historical queries, so that during fine-tuning, the status of a historical query may change depending on the fine-tuned retriever. In other words, we want to make the relevance labeling dependent on the retriever used.

To implement this idea, we propose a multi-task learning method to jointly fine-tune the selector and retriever using conversational search data. We take the query $q^{\text{all}}$ concatenating with all the historical queries as input and train a way to assign weights to each candidate query, while also trying to optimize the retrieval effectiveness. This corresponds to a ”soft” selection of candidate expansion queries embedded in the retrieval process. The expected advantage is that the retriever is optimized together with the soft selection of expansion queries, thereby alleviating the problem of data scarcity and yielding effective fine-tuning.

The joint training objective consists of two parts. One is the contrastive learning ranking loss for the retriever as shown in Eq. 2. Another one is the loss of selector training with the weight assignment strategy. The intuition is that if the retriever has the ability to distinguish relevant queries, it would be able to assign reasonable weights to the candidate expansion queries in $q^{\text{all}}$ . The whole learning object of S-R is shown as Eq. 3. Since retrieval is the main task, $\alpha$ aims to moderate the effect given to selector training.

(2)		$\displaystyle\mathcal{L}_{R}$	$\displaystyle=-\log\frac{\exp\left(q_{n}^{\text{all}}\cdot p_{n}^{+}\right)}{\exp\left(q_{n}^{\text{all}}\cdot p_{n}^{+}\right)+\sum_{p_{n}^{-}\in D^{-}}\exp\left(q_{n}^{\text{all}}\cdot p_{n}^{-}\right)},$
(3)		$\displaystyle\mathcal{L}_{\text{S-R}}$	$\displaystyle=\alpha\cdot\mathcal{L}_{\text{S-W.A.}}+\mathcal{L}_{R}.$

6. Experimental Setup

6.1. Datasets and Evaluation Metrics

Datasets: We use four conversational search datasets in our experiments. The TopiOCQA (Adlakha et al., 2022) and QReCC (Anantha et al., 2021) with relatively large amounts of training samples are used for evaluating our supervised methods (i.e. with selector training), while the other two small datasets CAsT 2019 (Dalton et al., 2020) and CAsT 2020 (Dalton et al., 2021) are used for evaluating zero-shot retrieval (i.e. no further training). More details about the datasets are provided in Appendix A.1.

Evaluation: Following the the existing studies on conversational search (Adlakha et al., 2022; Anantha et al., 2021; Dalton et al., 2020, 2021). We evaluate the retrieval effectiveness with the metrics: MRR, NDCG@3, Recall@10, Recall@20, and Recall@100. We adopt the pytrec_eval tool (Van Gysel and de Rijke, 2018) for metric computation.

6.2. Baselines

Since the experimental settings of existing works are not exactly the same as ours, it is unfair to directly compare them. Thus, we implement several strong baselines following the existing methods with the conditions described in Sec. 3. The comparisons for supervised methods include: (1) No expand: The query of the current turn without expansions. (2) QuReTeC (Voskarides et al., 2020): A token-level expansion method to train a sequence tagger to decide whether each term contained in the historical context should be added into the current query. (3) All expand (Yu et al., 2021): It expands the current query with all historical turns and input to the ANCE (Xiong et al., 2020) dense retriever. This is a widely used approach and the strongest baseline in the literature (Qu et al., 2020; Mao et al., 2022a; Kim and Kim, 2022). (4) Rel. expand (Human): It uses only the manually determined relevant historical queries to expand the current query. This method requires human annotations about topic (Adlakha et al., 2022), where we assume two turns are related if their ground-truth topics are the same. (5) Query Rewriter (Yu et al., 2020): This model is based on a GPT-2 (Radford et al., 2019) rewriter model fine-tuned on QReCC.

For zero-shot setting, in addition to the QuReTeC and Query Rewriter, the other compared methods include: (6) ZeCo (Krasakis et al., 2022): A token-level representation based method using Col-BERT (Khattab and Zaharia, 2020) dense retriever. (7) ANCE-ConvDR (Yu et al., 2021): The zero-shot version of ConvDR based on ANCE (Xiong et al., 2020) dense retriever. (8) ConvDR Transfer (Yu et al., 2021): Fine-tune ConvDR on conversational search data only using the ranking loss (Eq. 2). In particular, for a fair comparison, we implement all the compared systems by training them on TopiOCQA without external datasets to make them comparable to our methods, except for the results directly quoted from original papers.

Oracle Methods: We provide the retrieval evaluation by using gold PRL as the upper bound for comparison for all settings. Besides, the human-annotated relevance information from (Mao et al., 2022a) is provided for zero-shot scenarios for reference.

6.3. Implementations

We implement our selector and retriever models based on the BERT (Devlin et al., 2019) and ANCE (Xiong et al., 2020) using the PyTorch and Huggingface libraries. The retriever model follows the bi-encoder architecture (Karpukhin et al., 2020) and we fix the passage encoder during fine-tuning. More details are provided in Appendix A.2 and our released code¹¹1https://github.com/fengranMark/ConvRelExpand.

7. Experiment Results

7.1. Dense Retrieval with Query Selector

Table 1. Performance of off-the-shelf dense retrieval with the reformulated queries by the selective expansions. The human annotation is only available on TopiOCQA. The

\ddagger

symbol denotes the significant improvements with all compared baselines in the t-test with

p<0.05

. Bold and underline indicate the best result and the second best result (other than Oracle).

	Method	TopiOCQA				QReCC
	Method	MRR	NDCG@3	R@20	R@100	MRR	NDCG@3	R@10	R@100
Baseline	No expand	4.08	3.84	9.67	13.76	7.50	6.76	10.77	17.17
	QuReTeC	4.68	4.24	11.01	15.95	9.62	8.66	15.52	24.00
	All expand	9.09	8.21	20.64	29.67	18.32	16.78	28.15	40.43
	Rel. expand (Human)	10.13	9.20	22.12	31.58	-	-	-	-
	Query Rewriter	10.72	9.52	23.61	32.17	15.57	14.01	24.31	34.94
Ours	Rel. expand (BERT PRL)	9.34	8.74	19.93	26.89	16.17	14.80	25.00	35.03
	Rel. expand (ANCE PRL)	9.87	9.25	20.92	28.76	16.88	15.46	25.99	36.64
	+ Weight Assignment	10.83^‡	9.90^‡	24.07^‡	33.33^‡	18.53^‡	16.95^‡	28.88^‡	41.09^‡
Oracle	Rel. expand (Gold PRL)	14.37^‡	13.33^‡	30.39^‡	41.89^‡	20.84^‡	19.36^‡	31.73^‡	44.03^‡

We first evaluate the effectiveness of conversational dense retrieval with our query selection methods for query expansion. Table 1 shows the overall results using queries reformulated in different ways and submitted to ANCE. We show three different versions of our method: “BERT PRL” selects useful historical queries using BERT, “ANCE PRL” selects the queries using ANCE, and “+ Weight Assignment” uses soft selection (i.e. by assigning a weight to each historical query according to their category during training). We have the following observations:

(1) Our proposed methods with useful expansions outperform all baseline methods on both datasets. Specifically, our best relevant expansions (Rel. expand) improve 19.1% and 20.6% on MRR and NDCG@3 over the main competitor that expands with all historical context (All expand) on TopiOCQA. The above result confirms that only part of the historical queries are useful for expanding the current query, while the remaining ones are noise that would hurt retrieval performance. Such observation is consistent with previous studies. It also outperforms the Query Rewriter method, validating our assumption that we can reformulate the query for conversational search relying on useful historical queries automatically determined rather than manually rewritten queries. In addition, our method is much more efficient: the inference speed of our selector is hundreds of times higher than Query Rewriter. Besides, the method of relevant expansions by human annotations (Rel. expand Human) is also better than expanding all queries (All expand), though it is clearly lower than the oracle method and our best method with weight assignment. This confirms our assumption that human-annotated relevant queries are not the best ones for expanding the current query, and a model can be trained to determine better useful historical queries than manual annotators.

(2) The results on QReCC exhibit a similar trend to those on TopiOCQA, but the difference between the baseline and upper bound (Oracle) is smaller. Thus its performance boost is not as strong as on TopiOCQA. This is mainly because the topic-switch phenomenon is uncommon in QReCC: the conversations contain fewer turns (12.3 in TopiOCQA v.s. 3.6 in QReCC), and the queries in a conversation mostly revolve around the same topic (Wu et al., 2021). However, in a more realistic setting with a longer conversation, we expect that topic-switching in conversation would appear more often, thus the utility of our approach would be more obvious.

(3) The oracle relevant expansions by gold PRL is still far ahead of that with our selector. To approach the upper bound, the PRL produced by the ANCE selector with specific strategies can further improve the retrieval scores. Thus, training a better selector is important in future work.

(4) To evaluate the performance of the selector, we report the precision, recall, macro F1, and accuracy of different selector training strategies on our generated PRL data in Table 2. We can observe that there is no big difference in both the retrieval and classification performance between using BERT PRL and ANCE PRL. By exploiting the weight assignment (W.A.) strategy, all the metric scores except the recall of the selector drop a lot. Since it obtains better search results, we conjecture that the recall score of the selector is more related to the retrieval results than precision.

7.2. Results of Conversational Fine-tuning

Table 2. Comparisons of Precision, Recall, Macro F1, and Accuracy of different selector training strategies.

Variants	TopiOCQA				QReCC
Variants	P	R	F1	Acc.	P	R	F1	Acc.
BERT PRL	0.55	0.43	0.48	0.93	0.51	0.58	0.54	0.84
ANCE PRL	0.50	0.45	0.47	0.92	0.53	0.59	0.56	0.84
+W.A.	0.26	0.80	0.39	0.81	0.37	0.90	0.52	0.74
Gold PRL	1.0	1.0	1.0	1.0	1.0	1.0	1.0	1.0

Table 3. Performance of fine-tuned dense retrievers with conversational search data. The

\ddagger

denotes the significant improvements with all compared baselines in the t-test with

p<0.05

. Bold and underline indicate the best result and the second best result.

	Method	TopiOCQA
	Method	MRR	N@3	R@20	R@100
Baseline	No expand	3.94	3.56	8.95	13.61
	All expand	9.40	7.90	24.86	37.55
	Rel. expand (Human)	8.70	7.50	23.95	34.81
	Rel. expand (ANCE PRL)	6.14	5.16	14.58	24.90
Ours	Rel. expand (Gold PRL)	10.26^‡	9.23^‡	26.97^‡	36.28
	Multi-task S-R	9.95^‡	8.63^‡	26.77^‡	39.26^‡

In this section, we test the idea of joint fine-tuning of the selector and the retriever. The All expand and our Multi-task S-R method use $q^{\text{all}}$ as input for retriever fine-tuning, while the No expand and the remaining methods use $q^{\text{raw}}$ and $q^{\text{PRL}}$ , respectively. Then the fine-tuned retriever might be more suitable for conversational scenarios. The overall results are shown in Table 3. Together with the observations in former Sec. 7.1, we have the following observations:

(1) Our multi-task S-R improves 5.5% and 9.5% over the strongest baseline (All expand) w.r.t MRR and NDCG@3, and approaches the performance of Rel. expand with gold PRL. This demonstrates the effectiveness of selector training as an auxiliary task. Different from the explicit selecting methods that boost MRR and NDCG@3, the implicit ones such as the All expand method and our multi-task S-R are good for Recall@100. This might be due to the fact that we use MRR as the retrieval score in Algorithm 1, so the Rel. expand with gold PRL will optimize MRR but not necessarily Recall@100.

(2) We find that the reformulated queries $q^{\text{PRL}}$ with only useful expansions cannot improve the performance of the dense retrievers after fine-tuning and even hurt it. From Table 3, we can see the Rel. expand method with human judgment and gold PRL are worse than the All expand method on recall. The predicted PRL obtained by ANCE is even worse, indicating that the selected queries are incompatible with fine-tuning. We also find that only All expand with $q^{\text{all}}$ and our Multi-task S-R can improve recall along with the fine-tuning procedure, while the Rel. expand method with $q^{\text{PRL}}$ keeps dropping from the beginning. Such phenomenon contradicts our intuitions and previous work (Mao et al., 2022a). A possible explanation is that the gold standard PRL is produced by the base retriever before fine-tuning, while the old gold standard PRL may no longer be the gold standard for the fine-tuned retriever. On the other hand, our multi-task learning method can mitigate this issue to some extent. Although the gold standard PRL remains the same, the selector is updated together with the retriever and it prevents the model from performance degradation. A potentially better strategy is to update the gold PRL based on the fine-tuned retriever in every epoch so that the query relevance is estimated in tight connection with the retriever. We leave this strategy to future work.

7.3. Zero-Shot Learning Performance

We also evaluate our approaches in the zero-shot scenario on two CAsT datasets and compare them with other dense retrieval methods. The overall results are shown in Table 4.

Our methods can be divided into two parts. The selector approach is directly applying the ANCE selector obtained from Sec. 7.1 based on TopiOCQA, to explicitly select useful expansion turns, then feed the reformulated queries into the ANCE retriever. Another approach is based on fine-tuned S-R retriever obtained from Sec. 7.2 via multi-task learning. We can see that our methods produce higher or equivalent performance than the compared methods on various evaluation metrics. This demonstrates their high applicability to a new dataset without further fine-tuning. From the oracle results for reference, we can see the superior quality of both the relevant expansion by humans and our gold PRL, which indicates the high potential of the upper bound of our produced PRL data. Besides, we find the scores of human annotation are still lower than our gold PRL. This confirms again our assumption that human-annotated relevant queries are not necessarily the best ones for expansions.

7.4. Detailed Analysis

In this section, we make further analysis of some specific aspects of our methods.

7.4.1. Effect of Multi-task Learning in Fine-tuning

Table 4. Performance of zero-shot dense retrieval. The ^∗ means the results are quoted from original papers. The

\ddagger

denotes the significant improvements with all compared baselines in the t-test with

p<0.05

. Bold and underline indicate the best and the second best result (except Oracle).

Oracle - For Reference
Method	CAsT-19		CAsT-20
Method	MRR	NDCG@3	MRR	NDCG@3
ZeCo (Krasakis et al., 2022)	-	23.8^∗	-	17.6^∗
ANCE-ConvDR (Yu et al., 2021)	42.0^∗	24.7^∗	23.4^∗	15.0^∗
Query Rewriter (Yu et al., 2020)	33.4	15.0	23.5	14.7
QuReTeC (Voskarides et al., 2020)	45.8	28.2	21.9	15.3
ConvDR Transfer (Yu et al., 2021)	52.3	26.6	32.8	20.0
Ours ANCE Selector	54.6^‡	31.9^‡	32.8	21.7^‡
Ours S-R Retriever	56.9^‡	30.7^‡	31.7	18.8
Rel. expand (Human)	65.0^‡	40.8^‡	42.3^‡	28.8^‡
Our Rel. expand (Gold PRL)	70.3^‡	45.1^‡	47.4^‡	32.4^‡

The selector loss $\mathcal{L}_{W.A.}$ with weight assignment plays an important role in teaching the retriever for identifying useful expansion queries during fine-tuning. We analyze the influence of hyper-parameter $\alpha$ . We first train our S-R retriever with multi-task learning on TopiOCQA then evaluate it on both TopiOCQA and CAsT-19 (zero-shot learning). The results are shown in Fig. 4. We observe that the retrieval score gets better as the $\alpha$ increases until it equals one. On both datasets, $\alpha=1$ is the best setting. The results show that the two tasks in multi-task learning have about the same importance.

Table 5. Analysis of topic-switch types. The number on the left denotes the number of corresponding types. The number and percentage in parentheses denote the agreement with the human judgments on TopiOCQA.

Judgement	Topics-Switch Type
Judgement	Topic Shift	Topic Return	No-switch
Human	546 (100%)	126 (100%)	1637 (100%)
Pred. PRL	501 (221, 40.48%)	1313 (81, 64.29%)	495 (397, 24.25%)
Gold PRL	1461 (469, 85.90%)	581 (60, 47.62%)	259 (220, 13.44%)

Table 6. Topic numbers by various judgment methods.

Dataset	Topics / Conversation
Dataset	Human	Pred. PRL	Gold PRL
TopiOCQA	3.58	6.29	6.34
QReCC	-	2.26	2.70
CAsT-19	5.02	5.44	4.85
CAsT-20	4.56	4.80	4.70

7.4.2. Effect of Topic Judgment

Among a conversation, the turns with the same topic often have similar gold passages and thus we could assume these query turns are relevant. Here, we analyze the topic changes brought by different judgment methods and their impacts on retrieval performance. We follow the previous work (Adlakha et al., 2022) to classify the topic-switch into three types: topic shift, topic return, and no-switch, based on topic annotation by humans or our PRL. The topic shift and topic return means that the last historical query is irrelevant to the current one, while the topic return means that there are some relevant turns in history.

Table 7. Successful and failed concrete cases about different expansion methods and their retrieval effectiveness.

Successful Case

Failure Case

\blacktriangleright

Context: (TopiOCQA Session 14)

q_{1}

: what is the man trap?

r_{1}

: The first episode of the American science fiction television series

q_{2}

: who are the characters?

r_{2}

: Professor Robert Crater, his wife Nancy, Captain Kirk, Chief

Medical Officer Dr. Leonard McCoy, and Crewman Darnell

\blacktriangleright

Current Query:

q_{3}

: what is the name of the series?

\blacktriangleright

Human Judgment: [

q_{1}

q_{2}

] Pred. PRL: [

q_{1}

] Gold PRL: [

q_{1}

]

\blacktriangleright

Score: All expand: 0.20 Pred. PRL: 1.0 Gold PRL: 1.0

\blacktriangleright

Gold Passage:

p^{*}

: “The Man Trap” is the first episode of the American science fiction

television series … the first “Star Trek” episode to air on television.

\blacktriangleright

Context: (TopiOCQA Session 1)

q_{1}

: when will the new dunkirk film be released on DVD?

r_{1}

: 18 December 2017.

q_{2}

: what is this film about?

r_{2}

: Dunkirk evacuation of World War II

\blacktriangleright

Current Query:

q_{3}

: can you mention a few members of the cast?

\blacktriangleright

Human Judgment: [

q_{1}

q_{2}

] Pred. PRL: [] Gold PRL: [

q_{1}

]

\blacktriangleright

Score: All expand: 0.33 Pred. PRL: 0.14 Gold PRL: 0.40

\blacktriangleright

Gold Passage:

p^{*}

: Dunkirk is a 2017 war film written, directed, and produced

by Christopher Nolan that depicts the Dunkirk evacuation of

World War II. Its ensemble cast includes Fionn Whitehead …

Judgment Analysis Table 5 shows the number of topic-switch types determined in different ways: Human, Gold PRL, and predicted PRL (Pred. PRL) by the selector. The difference is that the humans annotate topics with concrete topic names (e.g. ”Disney Movie”) for each turn, while our PRL only indicates if two queries are relevant. We can see that the machine judgments present more topic switches. The Gold PRL and Pred. PRL have considerable amounts of consistent judgments as human annotators on topic shift and topic return types. However, they do not have much overlap in no-switch judgment. As the machine-selected expansions perform better on search than that of humans, this suggests that some queries on the same topic may still be useless for expanding the current query. This also indicates that the human topic annotation might not be the best relevant indication to use to train a selector model. Table 6 shows the number of topics per conversation by different judgment methods. The number of topics in a conversation is determined by topic shift. We can see that our PRL tends to judge more topics in TopiOCQA than in QReCC. The two CAsT datasets have slight differences between ours and human annotation. Previous experimental results show that the performance of our methods is much better on these three datasets except QReCC. Thus, we conjecture that our approach can make more impact on conversations with more turns, higher topic diversity, and finer-grained topics.

Impacts on Retrieval Fig. 5 shows the impact of different judgment methods on retrieval performance by counting the successful (without shadow) and failed (with shadow) examples. Here, a successful/failed example means its MRR score will increase/drop if we expand only relevant history queries to the current query turn based on the PRL, compared with taking all history queries for expansion. We find our PRL can make more impact on the topic-switch cases than those of no-switch. This phenomenon is not clearly observed on the QReCC in which the topic-switch phenomenon is less frequent. This result is consistent with our initial intuition and previous analysis, as we might only need to expand relevant history queries if there are several topics contained in a conversation. Another interesting finding is that the query expansion selected via Gold PRL still leads to some failure cases, which indicates the dependency between each expansion query might have an impact on retrieval effectiveness, which could be further explored.

7.5. Case Study

We finally show one successful and one failed case to help understand more intuitively the behaviors of different expansion methods. We compare the results of different expansion judgments and also show the comparison of the retrieval effectiveness between expanding all historical queries and expanding only the relevant ones.

The corresponding results are shown in Table 7. For the successful case, the model is expected to understand the “name of the series of The Man Trap”. The first round query supplies crucial information “The Man Trap”, while the second asking for characters of the fiction would inject noise because the response for it involves names which is also the goal for the current query. Though the human annotator thinks both the first and second round queries are related to the same topic as the current one, the second round cannot provide useful information for retrieval. For the failed case, the model needs to retrieve “the members of the cast for the mentioned film”. Both the human judgment and the Gold PRL indicate some relevant queries in the context, and expanding them could bring improvements in retrieval effectiveness. However, the selector did not select any expansion queries, which should be avoided and improved. The potential reason is that it is not enough to leverage only the semantics between the current query and the relevant history queries for accurate selection. Besides, the retrieval effectiveness using Gold PRL as expansion is still better than expanding all contexts. Therefore, a better selector should be able to achieve higher effectiveness for this query.

8. Conclusion and Future Work

In this paper, we investigated the problem of query expansion in conversational search. We proposed a selection model to identify useful historical queries for query expansion. The model was trained using pseudo-relevance labels obtained automatically according to the impact of a candidate expansion query on retrieval effectiveness. We showed in our experiments that such a selection model can determine useful queries that perform even better than human-annotated queries. The study demonstrates that we can learn to select useful historical queries automatically without human annotation. To address the problem of inconsistency of query selection and conversational fine-tuning, we proposed a multi-task learning framework to fine-tune the selector and retriever together. This strategy produced more robust results across different datasets.

In this work, we have not explored all aspects of the method. There is much room for improvement in the future. For example, we should learn a better selector to further approach the performance of the oracle expansion. The differences in (topic) relevance judgments between humans and the model can be further taken into account for retrieval effectiveness. Finally, in this work, we only consider historical queries, without their retrieval results. The latter could also be used.

Acknowledgements.

This work is partly supported by the National Natural Science Foundation of China (No. 61925601) and a discovery grant from the Natural Science and Engineering Research Council of Canada.

References

(1)
Adlakha et al. (2022) Vaibhav Adlakha, Shehzaad Dhuliawala, Kaheer Suleman, Harm de Vries, and Siva Reddy. 2022. TopiOCQA: Open-domain Conversational Question Answering with Topic Switching. Transactions of the Association for Computational Linguistics 10 (2022), 468–483.
Anantha et al. (2021) Raviteja Anantha, Svitlana Vakulenko, Zhucheng Tu, Shayne Longpre, Stephen Pulman, and Srinivas Chappidi. 2021. Open-Domain Question Answering Goes Conversational via Question Rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 520–534.
Cao et al. (2008) Guihong Cao, Jian-Yun Nie, Jianfeng Gao, and Stephen Robertson. 2008. Selecting good expansion terms for pseudo-relevance feedback. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. 243–250.
Dalton et al. (2022) Jeffrey Dalton, Sophie Fischer, Paul Owoicho, Filip Radlinski, Federico Rossetto, Johanne R Trippas, and Hamed Zamani. 2022. Conversational Information Seeking: Theory and Application. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3455–3458.
Dalton et al. (2020) Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2020. TREC CAsT 2019: The conversational assistance track overview. arXiv preprint arXiv:2003.13624 (2020).
Dalton et al. (2021) Jeffrey Dalton, Chenyan Xiong, and Jamie Callan. 2021. CAsT 2020: The Conversational Assistance Track Overview. Technical Report.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
Efthimiadis (1996) Efthimis N Efthimiadis. 1996. Query Expansion. Annual review of information science and technology (ARIST) 31 (1996), 121–87.
Gao et al. (2022) Jianfeng Gao, Chenyan Xiong, Paul Bennett, and Nick Craswell. 2022. Neural approaches to conversational information retrieval. arXiv preprint arXiv:2201.05176 (2022).
He and Ounis (2009) Ben He and Iadh Ounis. 2009. Finding Good Feedback Documents. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (Hong Kong, China) (CIKM ’09). 2011–2014.
Johnson et al. (2019) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
Joko et al. (2021) Hideaki Joko, Faegheh Hasibi, Krisztian Balog, and Arjen P de Vries. 2021. Conversational entity linking: problem definition and datasets. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2390–2397.
Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 6769–6781.
Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
Kim and Kim (2022) Sungdong Kim and Gangwoo Kim. 2022. Saving dense retriever from shortcut dependency in conversational search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 10278–10287.
Krasakis et al. (2022) Antonios Minas Krasakis, Andrew Yates, and Evangelos Kanoulas. 2022. Zero-shot Query Contextualization for Conversational Search. arXiv preprint arXiv:2204.10613 (2022).
Kumar and Callan (2020) Vaibhav Kumar and Jamie Callan. 2020. Making Information Seeking Easier: An Improved Pipeline for Conversational Search. In Empirical Methods in Natural Language Processing.
Li et al. (2022a) Huihan Li, Tianyu Gao, Manan Goenka, and Danqi Chen. 2022a. Ditch the Gold Standard: Re-evaluating Conversational Question Answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8074–8085.
Li et al. (2022b) Yongqi Li, Wenjie Li, and Liqiang Nie. 2022b. Dynamic Graph Reasoning for Conversational Open-Domain Question Answering. ACM Transactions on Information Systems (TOIS) 40, 4 (2022), 1–24.
Lin et al. (2021) Sheng-Chieh Lin, Jheng-Hong Yang, and Jimmy Lin. 2021. Contextualized Query Embeddings for Conversational Search. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 1004–1015.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
Mandya et al. (2020) Angrosh Mandya, James O’Neill, Danushka Bollegala, and Frans Coenen. 2020. Do not let the history haunt you: Mitigating Compounding Errors in Conversational Question Answering. In Proceedings of the 12th Language Resources and Evaluation Conference. 2017–2025.
Mao et al. (2023a) Kelong Mao, Zhicheng Dou, Haonan Chen, Fengran Mo, and Hongjin Qian. 2023a. Large Language Models Know Your Contextual Search Intent: A Prompting Framework for Conversational Search. arXiv preprint arXiv:2303.06573 (2023).
Mao et al. (2022a) Kelong Mao, Zhicheng Dou, and Hongjin Qian. 2022a. Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 176–186.
Mao et al. (2022b) Kelong Mao, Zhicheng Dou, Hongjin Qian, Fengran Mo, Xiaohua Cheng, and Zhao Cao. 2022b. ConvTrans: Transforming Web Search Sessions for Conversational Dense Retrieval. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2935–2946.
Mao et al. (2023b) Kelong Mao, Hongjin Qian, Fengran Mo, Zhicheng Dou, Bang Liu, Xiaohua Cheng, and Zhao Cao. 2023b. Learning Denoised and Interpretable Session Representation for Conversational Search. In Proceedings of the ACM Web Conference 2023. 3193–3202.
Mo et al. (2023) Fengran Mo, Kelong Mao, Yutao Zhu, Yihong Wu, Kaiyu Huang, and Jian-Yun Nie. 2023. ConvGQR: Generative Query Reformulation for Conversational Search. arXiv preprint arXiv:2305.15645 (2023).
Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPs.
Qiao et al. (2019) Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. 2019. Understanding the Behaviors of BERT in Ranking. arXiv preprint arXiv:1904.07531 (2019).
Qu et al. (2020) Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W Bruce Croft, and Mohit Iyyer. 2020. Open-retrieval conversational question answering. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 539–548.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Radlinski and Craswell (2017) Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversational search. In Proceedings of the 2017 conference on conference human information interaction and retrieval. 117–126.
Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval 3, 4 (2009), 333–389.
Siblini et al. (2021) Wissam Siblini, Baris Sayil, and Yacine Kessaci. 2021. Towards a more robust evaluation for conversational question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 1028–1034.
Sordoni et al. (2015) Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. In proceedings of the 24th ACM international on conference on information and knowledge management. 553–562.
Vakulenko et al. (2018) Svitlana Vakulenko, Maarten de Rijke, Michael Cochez, Vadim Savenkov, and Axel Polleres. 2018. Measuring semantic coherence of a conversation. In 17th International Semantic Web Conference, ISWC 2018. Springer Verlag, 634–651.
Vakulenko et al. (2021) Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, and Raviteja Anantha. 2021. Question rewriting for conversational question answering. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining. 355–363.
Van Gysel and de Rijke (2018) Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An Extremely Fast Python Interface to trec_eval. In SIGIR. ACM.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Voskarides et al. (2020) Nikos Voskarides, Dan Li, Pengjie Ren, Evangelos Kanoulas, and Maarten de Rijke. 2020. Query resolution for conversational search with limited supervision. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 921–930.
Wu et al. (2021) Zeqiu Wu, Yi Luan, Hannah Rashkin, David Reitter, and Gaurav Singh Tomar. 2021. CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning. arXiv preprint arXiv:2112.08558 (2021).
Xiong et al. (2020) Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020).
Yu et al. (2020) Shi Yu, Jiahua Liu, Jingqin Yang, Chenyan Xiong, Paul Bennett, Jianfeng Gao, and Zhiyuan Liu. 2020. Few-shot generative conversational query rewriting. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 1933–1936.
Yu et al. (2021) Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021. Few-shot conversational dense retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 829–838.

Appendix A Appendices

A.1. Datasets

The statistical information of four used conversational search datasets is shown in Table 8, and the details of each are described below.

TopiOCQA focuses on the new challenge of topic switching, which appears often in a realistic scenario. Most conversations contain more than 10 turns and at least 3 topics. The datasets have been annotated with different types of topic switching: topic shift - the query is a new topic, topic return - return to a previous topic, and no-switch - continue on the same topic. These annotations allow us to determine what historical queries are on the same topic as the current one. The turns involved with the same topic often have the same or similar gold passage, thus we could consider these query turns relevant. We consider this as the manual annotation of query relevance. This is the only dataset with such annotations.

QReCC focuses on the challenge of query rewriting. It aims to reformulate the query to approach the human-rewritten query. Thus, it provides a gold rewritten query for each turn. Compared with TopiOCQA, the number of turns in each conversation is smaller, and most conversations are on the same topic. We only use the queries with ground-truth for retrieval tests.

CAsT-19 and CAsT-20 are two conversational search benchmarks provided in the TREC Conversational Assistance Track (CAsT). They contain 50 and 25 conversations respectively. As we can see, no training data is provided. So we use these datasets for the setting of zero-shot evaluation, applying the models trained on another dataset (TopiOCQA).

Table 8. Statistic of conversational search datasets.

Statistics	TopiOCQA		QReCC		CAsT-19	CAsT-20
Statistics	Train	Test	Train	Test	Test	Test
# Conversations	3,509	205	8,987	2,280	50	25
# Turns (Qry.)	45,650	2,514	10,875	8,124	479	208
# Tokens / Qry.	6.9	6.9	6.5	6.4	6.1	6.8
# Qry. / Conv.	13.0	12.3	3.3	3.6	9.6	8.6
#Doc in Collection	25M		54M		38M

A.2. Implementation Details

Our experiments are conducted on one Nvidia A100 40G GPU. Specifically, for selector training, the length of the query pair and the batch size are both set to 128. For retriever and multi-task S-R conversational fine-tuning, the lengths of the query, expansion concatenation, and passage are truncated into 64, 512, and 384, respectively, and the batch size is set to 16. We use Adam optimizer with 1e-5 learning rate and set the training epoch to 10. The dense retrieval is performed using Faiss (Johnson et al., 2019).