This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mitigating the Impact of False Negatives in Dense Retrieval
with Contrastive Confidence Regularization

Shiqi Wang, Yeqin Zhang, Cam-Tu Nguyen Corresponding authors.
Abstract

In open-domain Question Answering (QA), dense retrieval is crucial for finding relevant passages for answer generation. Typically, contrastive learning is used to train a retrieval model that maps passages and queries to the same semantic space. The objective is to make similar ones closer and dissimilar ones further apart. However, training such a system is challenging due to the false negative issue, where relevant passages may be missed during data annotation. Hard negative sampling, which is commonly used to improve contrastive learning, can introduce more noise in training. This is because hard negatives are those closer to a given query, and thus more likely to be false negatives. To address this issue, we propose a novel contrastive confidence regularizer for Noise Contrastive Estimation (NCE) loss, a commonly used loss for dense retrieval. Our analysis shows that the regularizer helps dense retrieval models be more robust against false negatives with a theoretical guarantee. Additionally, we propose a model-agnostic method to filter out noisy negative passages in the dataset, improving any downstream dense retrieval models. Through experiments on three datasets, we demonstrate that our method achieves better retrieval performance in comparison to existing state-of-the-art dense retrieval systems.

1 Introduction

Text retrieval involves searching for relevant information in vast text collections based on user queries. Efficient and effective methods for this task have revolutionized how we interact with information systems. Recently, there has been growing interest in augmenting large language models (LLMs) with text retrieval for question answering (QA) (Lewis et al. 2020a; Guu et al. 2020; Glass et al. 2022; Borgeaud et al. 2022; Fu et al. 2022; Zhang et al. 2023). These approaches harness retrieval models to obtain external knowledge and ground LLM outputs, reducing hallucinations and the need for frequent LLM updates. Interestingly, augmenting an LM with a retrieval helps reduce the number of parameters required to achieve similar performance as larger LMs (Mialon et al. 2023).

Text retrieval methods can be broadly categorized into two main approaches: sparse and dense retrievals. Sparse methods, such as BM25, exploit the frequency of words to measure the relevance between a passage and a query. While efficient, these methods often fall short of capturing intricate relationships and contextual nuances of language. In contrast, dense retrieval methods aim to learn meaningful representations from the semantic content of passages and queries effectively. These models can be trained based on a pretraining model (e.g. BERT, RoBERTa) as well as fine-tuned for downstream QA tasks (Lewis et al. 2020b), offering easy integration. In addition, it is possible to apply approximate nearest neighbors (ANN) with dense retrieval (Xiong et al. 2021) for efficient retrieval.

This paper focuses on dense retrieval, where contrastive learning is often employed to train passage and query encoders. The core principle of contrastive learning is to encode passages and queries such that relevant passages are closer to their corresponding query in the embedding space, while irrelevant passages are farther away. To train such encoders, we need a labeled dataset with queries annotated with relevant passages (positive samples). However, due to the vast number of candidate passages and the complexity of questions, it is common for annotators to miss relevant information (texts) during data preparation, leading to unlabeled positive examples (false negatives) in the training set. Recent studies support this assumption. For instance, Ni, Gardner, and Dasigi (2021) found that over half of 50 answerable questions from the IIRC dataset (Ferguson et al. 2020) had at least one missing piece of evidence. Similarly, Qu et al. (2021) manually reviewed top-retrieved passages not labeled as positives in MSMARCO (Nguyen et al. 2016) and detected a 70% false negative rate. On the other hand, it is essential to sample hard negatives for effective contrastive learning. Here, hard negatives refer to passages obtained from the top results of a pre-trained dense retrieval model or BM25. Unfortunately, hard negative sampling is susceptible to higher false negative rates in noisy datasets because such negative samples are more likely to be mislabeled ones. Therefore, mitigating the impact of false negatives can potentially improve the performance of dense retrieval.

Several strategies have recently emerged to address the problem of false negatives. Qu et al. (2021) use a highly effective but inefficient reranker based on a cross-encoder to identify high-confidence negatives as true negatives, which were then used to train the retrieval model. Ni, Gardner, and Dasigi (2021) leverage answers in the downstream QA task and design several heuristics to detect valid contexts such as lexical overlapping (between a gold answer and a candidate passage). Recently, Zhou et al. (2022) suggests selecting samples that are highly similar to positive samples but not too close to the query. These samples are considered informative negatives and unlikely to be false negatives. Despite the recent progress, these current methods are primarily based on heuristics and lack a theoretical guarantee.

This paper formalizes the problem of training dense retrieval with false negatives into the peer loss framework (Liu and Guo 2020; Cheng et al. 2021), a theoretical sound approach to learning with label noise. We extend this framework by developing a confidence regularizer for NCE, a commonly used loss for training dense retrieval models. Our regularized loss function increases the model’s confidence and proves to be robust against false negatives. By encouraging confident scoring, we prevent the model from overfitting to noise, resulting in a more robust retrieval model. We then propose a new passage sieve algorithm, that makes use of a confidence regularized retrieval model to select true hard negatives. The clean dataset after the passage sieve is then used to train a stronger retrieval model. We prove that our method can successfully filter out false negatives from hard negatives under mild assumptions. Through experiments on three datasets, we demonstrate that our method achieves better retrieval performance compared to existing state-of-the-art dense retrieval systems that rely on heuristic false negative filtering. Supplementary materials (the appendix, codes) can be found in our GitHub111https://github.com/wangskyGit/passage-sieve.

2 Related Works

2.1 Dense Retrieval

The dual-encoder (two-towers or biencoder) architecture (Huang et al. 2013; Reimers and Gurevych 2019) is the common choice for dense retrieval thanks to its high efficiency. However, vanilla dual encoders have several challenges such as the limited expressiveness compared to cross-encoders, and the suboptimal performance due to non-informative negative samples. As a result, various solutions have been introduced to improve vanilla dual encoders from different perspectives such as knowledge distillation (Ren et al. 2021b; Lu et al. 2022), lightweight interaction models (Khattab and Zaharia 2020; Humeau et al. 2019), sophisticated training procedure (Zhang et al. 2021; Qu et al. 2021; Ren et al. 2021b), negative sampling strategies (Karpukhin et al. 2020; Xiong et al. 2021; Qu et al. 2021; Zhou et al. 2022). In this paper, we focus mostly on negative sampling strategies, particularly targeting the false negative issue. Unlike the closely related works (Qu et al. 2021; Zhou et al. 2022; Ni, Gardner, and Dasigi 2021), which exploit heuristic strategies, our method leverages the peer-loss approach (Liu and Guo 2020), effectively combining practical application with a theoretically-informed perspective.

2.2 Label-Noise Robust Machine Learning

Developing machine learning models that are robust against label noise is important for supervised learning. Existing methods tackle label noise based on the type of noise, such as random noise (Natarajan et al. 2013; Manwani and Sastry 2013), class-dependent noise (Liu and Tao 2015; Patrini et al. 2017; Yao et al. 2020), or instance-dependent noise (Zhu, Liu, and Liu 2021; Cheng et al. 2021; Xia et al. 2020; Yang et al. 2022; Hao et al. 2022). Unfortunately, these methods are mainly designed for multi-class classification, and thus cannot be directly applied to our task.

Also relevant to our work are (Chuang et al. 2020; Robinson et al. 2020) which consider the issue of bias in contrastive learning for multi-class classification. These studies rely on the assumption of a fixed uniform noise probability across different classes (i.e. queries in our case) to approximate the positive and negative distributions. In contrast, our work concerns noise in supervised contrastive learning for query-dependent ranking, where the distributions of positives and negatives are not the same for different queries. For example, it is more likely for queries with high recall to be associated with false negatives. In other words, the noise probability is not uniform across queries.

3 Preliminaries

3.1 Problem Formalization

Let CC be a collection of textual passages, and 𝒟~={(qn,pn+,𝒩~qn)}\tilde{\mathcal{D}}=\{(q_{n},p_{n}^{+},\tilde{\mathcal{N}}_{q_{n}})\} indicates the noisy set of tuples, each consists of a query qnq_{n}, an annotated positive pn+Cp_{n}^{+}\in C, and a set of sampled negatives 𝒩qnC\mathcal{N}_{q_{n}}\subset C with possible false negatives. Our objective is to mitigate the impacts of such false negatives and learn a robust retrieval model that can closely match the model trained on the clean dataset 𝒟={(qn,pn+,𝒩qn)}\mathcal{D}=\{(q_{n},p_{n}^{+},\mathcal{N}_{q_{n}})\} without false negatives.

3.2 Dual-Encoders for Dense Retrieval

A question or a passage is encoded as dense vectors separately, and the similarity is determined as the dot product of the encoder outputs.

sim(q,p)=<Eqry(q),Epsg(p)>sim(q,p)=<E_{qry}(q),E_{psg}(p)> (1)

where EqryE_{qry}, EpsgE_{psg} represent distinct encoders that map the query and the passage into dense vectors, and <.><.> is the similarity function such as dot product, cosine or Euclidean distance. The encoders are often built based on a pre-trained language model such as BERT-based (Karpukhin et al. 2020; Luan et al. 2021; Xiong et al. 2021; Oguz et al. 2022; Zhang et al. 2022), ERNIE-based (Qu et al. 2021; Ren et al. 2021b; Zhang et al. 2021) or RoBERTa-based (Oguz et al. 2022) etc. Since passages and queries are encoded separately, passage embeddings can be precomputed and indexed using Faiss (Johnson, Douze, and Jégou 2021) for efficient search.

Contrastive learning is commonly used to train dual encoders (Karpukhin et al. 2020; Xiong et al. 2021; Qu et al. 2021; Zhou et al. 2022). Typically, it is assumed that we have access to the clean training set 𝒟={(qn,pn+,𝒩qn)}\mathcal{D}=\{(q_{n},p_{n}^{+},\mathcal{N}_{q_{n}})\}. Using this training set, we measure the NCE (Noise-Contrastive Estimation) loss for a given query qnq_{n} as follows:

NCE(pn+,qn)\displaystyle\ell_{NCE}(p_{n}^{+},q_{n}) =ln(f(pn+,qn))\displaystyle=-\ln(f(p_{n}^{+},q_{n})) (2)
=ln(esim(pn+,qn)pi𝒩qnesim(pi,qn)+esim(pn+,qn))\displaystyle=-\ln(\frac{e^{sim(p_{n}^{+},q_{n})}}{\sum_{p_{i}^{-}\in\mathcal{N}_{q_{n}}}e^{sim(p_{i}^{-},q_{n})}+e^{sim(p_{n}^{+},q_{n})}})

The total loss function can be calculated as follows:

1Nn[N]NCE(pn+,qn)\frac{1}{N}\sum_{n\in[N]}\ell_{NCE}(p_{n}^{+},q_{n}) (3)

where NN is the total number of queries in the dataset, and [N]={1,2,,N}[N]=\{1,2,\ldots,N\}.

Negative sampling aims to select samples for 𝒩qn\mathcal{N}_{q_{n}} and plays an important role in learning effective representations with contrastive learning. In the context of dense retrieval, two common strategies for negative sampling are in-batch negatives and hard negatives (Karpukhin et al. 2020; Xiong et al. 2021; Qu et al. 2021). In-batch negatives involve selecting positive passages from other queries in the same batch as negative samples. Generally, increasing the number of in-batch negatives improves dense retrieval performance. On the other hand, hard negatives are usually informative samples that receive a high similarity score from another retrieval model (e.g., BM25, a pre-trained DPR) (Karpukhin et al. 2020). Hard negatives can result in more effective training for DPR, yet including such samples may exaggerate the issue of false negatives.

3.3 Peer Loss and Confidence Regularization

The problem of learning with label noise has been extensively researched in the context of classification tasks (Liu and Guo 2020; Xia et al. 2020; Zhu, Liu, and Liu 2021; Cheng et al. 2021; Yang et al. 2022; Hao et al. 2022). Recently, two effective methods called Peer Loss (Liu and Guo 2020) and its inspired Confidence Regularizer (Cheng et al. 2021; Zhu, Liu, and Liu 2021) have been introduced to train robust machine learning models. One advantage of these methods is that they can work without requiring knowledge of the noise transition matrix or the probability of labels being flipped between classes.

The concept of peer loss is initially introduced for the binary classification problem. In this setting, we use xx to indicate an instance (a feature vector), y{1,1}y\in\{-1,1\} and y~{1,1}\tilde{y}\in\{-1,1\} to represent the clean label and noisy labels, respectively. For each sample (xn,y~n)(x_{n},\tilde{y}_{n}), the peer loss is then defined for cross-entropy loss as follows:

PL(g(xn),y~n)=CE(g(xn),y~n)CE(g(xn1),y~n2)\ell_{PL}(g(x_{n}),\tilde{y}_{n})=\ell_{CE}(g(x_{n}),\tilde{y}_{n})-\ell_{CE}(g(x_{n1}),\tilde{y}_{n2}) (4)

where gg indicates the classification function, xn1x_{n1} and y~n2\tilde{y}_{n2} are derived from different, randomly selected peer sample for nn. Here, the first term measures the loss of the classifier prediction, whereas the second term penalizes the model when it excessively agrees with the incorrect or noisy labels (Liu and Guo 2020).

Inspired by peer loss, Cheng et al. (2021) develops CORES2\text{CORES}^{2}, which extends the framework for multi-class classification and uses the first-order statistic instead of randomly selecting peer samples. Specifically, the new loss in CORES2\text{CORES}^{2} is defined as follows:

(g(xn),y~n)=CE(g(xn),y~n)β𝔼D~Y~[CE(g(xn),Y~)]\displaystyle\ell(g(x_{n}),\tilde{y}_{n})=\ell_{CE}(g(x_{n}),\tilde{y}_{n})-\beta\mathbb{E}_{\tilde{D}_{\tilde{Y}}}[\ell_{CE}(g(x_{n}),\tilde{Y})] (5)

where β\beta is a hyper-parameter, Y~\tilde{Y} denotes the random variable corresponding to the noisy label and D~Y~\tilde{D}_{\tilde{Y}} indicates the marginal distribution of Y~\tilde{Y}. It has been shown that learning with an appropriate β\beta will make the loss function robust to instance-dependent label noise with theoretical guarantee (Cheng et al. 2021).

4 Contrastive Confidence Regularizer

We can adopt the above confidence regularization by introducing a binary label yy associated with each pair of queries and passages, where y=+1y=+1 indicates the positive pair and y=1y=-1 vice versa. Subsequently, CORES2\text{CORES}^{2} can be used with pairwise cross-entropy to mitigate the impacts of false negatives on training the retrieval model. Unfortunately, the cross-entropy loss is not as effective as the NCE loss (Eq. 3) for dense retrieval since the latter can learn a good ranking function by contrasting a positive sample over a list of negatives (Karpukhin et al. 2020). As a result, we aim to extend the framework of peer loss and tailor the confidence regularization to the NCE loss.

To adopt the peer-loss framework, we extend NCE loss to measure loss values associated with negative pairs (pi,qn)(p_{i}^{-},q_{n}) besides positive ones. It is noteworthy that the original NCE loss (Eq. 3) incorporates all positive and negative pairs as normalization factors but only calculates loss values for positive pairs (pn+,qn)(p^{+}_{n},q_{n}) while disregarding the negative pairs. Formally, we use pp without superscript as a generic term for positive passage pn+p_{n}^{+} and negative passage pi𝒩qnp_{i}^{-}\in\mathcal{N}_{q_{n}}, and define a general NCE loss as follows:

NCE(p,qn)\displaystyle\ell_{NCE}(p,q_{n}) =ln(f(p,qn))\displaystyle=-\ln(f(p,q_{n})) (6)
=ln(esim(p,qn)pi𝒩qnesim(pi,qn)+esim(pn+,qn))\displaystyle=-\ln(\frac{e^{sim(p,q_{n})}}{\sum_{p_{i}^{-}\in\mathcal{N}_{q_{n}}}e^{sim(p_{i}^{-},q_{n})}+e^{sim(p_{n}^{+},q_{n})}})

where p could be either positive or negative passages for query qnq_{n}. By this convention, the peer-loss framework can be applied by introducing randomly selected pairs (pn1,qn1)(p_{n_{1}},q_{n_{1}}), (pn2,qn2)(p_{n_{2}},q_{n_{2}}) (n1n2{n_{1}}\neq{n_{2}}) as peer samples to regularize the NCE loss. We then obtain the contrastive peer loss as follows:

PL(pn+,qn)=NCE(pn+,qn)NCE(pn1,qn2)\ell_{PL}(p_{n}^{+},q_{n})=\ell_{NCE}(p_{n}^{+},q_{n})-\ell_{NCE}(p_{n_{1}},q_{n_{2}}) (7)

While our regularized loss appears similar to the original peer-loss outlined in Eq. 4, the major difference lies in the structure of the loss functions. Specifically, the first term of Eq. 7 is calculated only for positive pairs whereas the first term of Eq. 4 involves both positive (y~=+1\tilde{y}=+1) and negative (y~=1\tilde{y}=-1) samples.

Similar to CORES2\text{CORES}^{2}, we introduce Pn1P_{n_{1}} and Qn2Q_{n_{2}} as the random variables corresponding to the peer passage pn1p_{n_{1}} and the peer query qn2q_{n_{2}}, respectively. Let 𝒟~P\tilde{\mathcal{D}}_{P} be the distribution of Pn1P_{n_{1}} given distribution 𝒟~\tilde{\mathcal{D}}. Note that Qn2Q_{n_{2}} is a uniform random variable, or (Qn2=qn|𝒟~)=1/N\mathbb{P}(Q_{n_{2}}=q_{n^{\prime}}|\mathcal{\tilde{D}})=1/N, the contrastive peer loss has the following form in expectation:

1Nn[N]𝔼𝒟~Pn1,Qn2[NCE(pn+,qn)NCE(Pn1,Qn2)]\displaystyle\frac{1}{N}\sum_{n\in[N]}\mathbb{E}_{\tilde{\mathcal{D}}_{P_{n_{1}},Q_{n_{2}}}}\left[\ell_{NCE}(p_{n}^{+},q_{n})-\ell_{NCE}(P_{n_{1}},Q_{n_{2}})\right]
=\displaystyle= 1Nn[N][NCE(pn+,qn)\displaystyle\frac{1}{N}\sum_{n\in[N]}[\ell_{NCE}(p_{n}^{+},q_{n})-
n[N](Qn2=qn|𝒟~)𝔼𝒟~P[NCE(Pn1,qn)]\displaystyle\sum_{n^{\prime}\in[N]}\mathbb{P}(Q_{n_{2}}=q_{n^{\prime}}|\tilde{\mathcal{D}})\mathbb{E}_{\tilde{\mathcal{D}}_{P}}\left[\ell_{NCE}(P_{n_{1}},q_{n^{\prime}})\right]
=\displaystyle= 1Nn[N][NCE(pn+,qn)𝔼𝒟~P[NCE(P,qn)]]\displaystyle\frac{1}{N}\sum_{n\in[N]}\left[\ell_{NCE}(p_{n}^{+},q_{n})-\mathbb{E}_{\tilde{\mathcal{D}}_{P}}\left[\ell_{NCE}(P,q_{n})\right]\right]
\displaystyle\approx 1Nn[N][NCE(pn+,qn)𝔼𝒟~P|qn[NCE(P,qn)]]\displaystyle\frac{1}{N}\sum_{n\in[N]}\left[\ell_{NCE}(p_{n}^{+},q_{n})-\mathbb{E}_{\tilde{\mathcal{D}}_{P|q_{n}}}\left[\ell_{NCE}(P,q_{n})\right]\right]

The last approximation equation is obtained because when considering the batch training in dense retrieval, Pn1P_{n_{1}} is drawn from in-batch passages. In addition, all the passages in batch form the conditional passage distribution of query qnq_{n}. From this derivation, we can get the new noise robust contrastive loss function (i.e. RCL\ell_{RCL}) with contrastive confidence regularizer denoted by CCR\ell_{CCR} as follows:

CCR(pn+,qn)\displaystyle\ell_{CCR}(p_{n}^{+},q_{n}) =𝔼𝒟~P|qn[NCE(p,qn)]\displaystyle=\mathbb{E}_{\tilde{\mathcal{D}}_{P|q_{n}}}[\ell_{NCE}(p,q_{n})] (8)
RCL(pn+,qn)\displaystyle\ell_{RCL}(p_{n}^{+},q_{n}) =NCE(pn+,qn)βCCR(pn+,qn)\displaystyle=\ell_{NCE}(p_{n}^{+},q_{n})-\beta*\ell_{CCR}(p_{n}^{+},q_{n})

In the following, we first empirically prove that our regularized loss function makes the retrieval model more confident, hence being more robust against false negatives. We then present the main theories that guarantee the robustness of our regularized NCE loss.

4.1 Analysis on Simulated Data

The RCL loss (eq. 8) is minimized by making the first term smaller and the expectation term bigger. In other words, we pull the loss associated with positive passages further from the average (the expectation) loss, subsequently making the model more confident in predicting positive passages. Intuitively, as label noise distorts the learning signal in clean data, the model trained on noisy datasets often fits noises, hence becoming less confident (Cheng et al. 2021). By making the model more confident, we can partially reverse it and obtain a more robust retrieval model. We verify this intuition by conducting a simulation on the Natural Question (NQ) dataset (Kwiatkowski et al. 2019). Specifically, we randomly convert some positive passages to false negatives and observe the effects of contrastive confidence regularizer CCR\ell_{CCR} on the distributions of the similarity scores in different groups of passages including false negatives, hard negatives and in-batch negatives. The experimental results in Figure 1 indicate that if we do not include the CCR\ell_{CCR} term (i.e., β=0\beta=0), the distributions are close to each other. On the other hand, incorporating the CCR\ell_{CCR} term makes it easier to separate these distributions. It should be noted that a small overlapping between the false negative and hard negative distributions is expected. This is because while all samples in the simulated “false negatives” category are certainly positive passages, there exist (unknown) false negatives in the “hard negatives” category.

Refer to caption
(a) β=0\beta=0
Refer to caption
(b) β=0.05\beta=0.05
Refer to caption
(c) β=0.1\beta=0.1
Refer to caption
(d) β=0.5\beta=0.5
Figure 1: The effect of the contrastive confidence regularizer. Four figures show analysis from four experiments with all the same settings but with different values of beta. All of them are continuously trained based on the same pre-trained vanilla DPR. The y-axis value represents the NCE loss value between queries and different types of passages. It shows that normal DPR training will cause different distributions to be squeezed together while CCR\ell_{CCR} has the potential to help distinguish false negatives from real negatives.

4.2 Theoretical Analysis

Theoretical analysis shows that our RCL loss enjoys similar properties with CORES2CORES^{2}. The detailed proof can be found in the Appendix, where the main idea is that we introduce pseudo-labels to bridge NCE loss and cross-entropy loss and follow a similar proof sketch with (Cheng et al. 2021). The main result is summarized in the following theorem.

Theorem 1.

With the assumption that the possibility of a noisy pair is smaller than a clean pair and a suitable selection of β\beta, we have: minimizing 𝔼𝒟~[NCE(p+,q)βCCR(p+,q)]\mathbb{E}_{\tilde{\mathcal{D}}}[\ell_{NCE}(p^{+},q)-\beta*\ell_{CCR}(p^{+},q)] is equivalent to minimizing 𝔼𝒟[NCE(p+,q)]\mathbb{E}_{\mathcal{D}}[\ell_{NCE}(p^{+},q)]

Theorem 1 indicates that the contrastive confidence regularizer effectively addresses the adverse effect of false negatives on the loss function. Specifically, we can decompose the regularized loss so that the impact of false negatives is mostly captured by the product of CCR\ell_{CCR} and a scaling factor (the last term in Eq.14, Appendix). Here, the scaling factor is intuitively related to the difference between the relative clean rate of a sample (compared to the average clean rate) and β\beta. When β\beta is large enough, this scaling factor is negative, thus reversing the (term) gradient associated with instances with labeled noise. Note that β\beta should not be too large, otherwise it will affect learning from clean data. This raises a crucial question regarding the existence of parameter β\beta. Analyzing the existence of β\beta becomes challenging when the nature of the noise is uncertain. Fortunately, in the context of false negatives in dense passage retrieval, we can derive the following theorem.

Theorem 2.

When the assumption in Theorem 1 holds, the similarity function simsim is bounded (e.g. cosine similarity) and there is no false positive in the dataset 𝒟~\tilde{\mathcal{D}}, the β\beta that satisfies Theorem 1 must exist and in the interval [0,1].

Theorem 2 shows that under the setting of dense passage retrieval with false negatives, one can select a suitable β\beta between the interval [0,1]. However, this theorem only applies when the similarity function is bounded and there is no other source of noise besides false negatives. In practical scenarios, many dense passage retrieval algorithms employ dot-product similarity instead of cosine similarity. When the loss function is not bounded, and there is an excessive number of negative passages, the confidence regularizer can cause the model to overly optimize the loss associated with negative cases. Our experiments indicate that in such situations, the contrastive confidence regularizer still helps as long as a sufficiently small value for β\beta is selected. In general, as the batch size increases and in-batch negative sampling is utilized, we should decrease the value of β\beta.

5 Passage Sieve Method

Algorithm 1 Passage Sieve Algorithm

Input: Noisy dataset 𝒟~\tilde{\mathcal{D}} with false negatives
Parameter: Hyper-parameter β\beta, learning rate α\alpha, training epochs TT
Output: Sieved dataset 𝒟\mathcal{D}^{*}

1:  Get the pre-trained DPR model \mathcal{M}
2:  Train \mathcal{M} with robust contrastive loss for T epoch:
3:  for t[T]t\in[T] do
4:     for (𝐪,𝐩+,𝒩𝐪)(\mathbf{q},\mathbf{p^{+}},\mathbf{\mathcal{N}_{q}}) in 𝒟~\tilde{\mathcal{D}} do
5:        Calculate RCL\ell_{RCL} according to equation 8
6:        Compute gradients RCL\nabla\leftarrow\nabla_{\mathcal{M}}\ell_{RCL}
7:        Update the model parameters: α\mathcal{M}\leftarrow\mathcal{M}-\alpha\cdot\nabla
8:  Evaluate all the hard negatives:
9:  Initialize empty dataset 𝒟\mathcal{D}^{*}
10:  for (𝐪,𝐩+,𝒩𝐪)(\mathbf{q},\mathbf{p^{+}},\mathbf{\mathcal{N}_{q}}) in 𝒟~\tilde{\mathcal{D}} do
11:     𝒩q[]\mathcal{N}_{q}^{*}\leftarrow[] {list of selected hard negatives}
12:     th=1|Dp|q|pDp|qNCE(p,q;)th=\frac{1}{|D_{p|q}|}\sum_{p\in D_{p|q}}\ell_{NCE}(p,q;\mathcal{M}){Threshold}
13:     for pp^{-} in 𝒩q\mathcal{N}_{q} do
14:        pNCE(p,q;)\ell_{p}^{-}\leftarrow\ell_{NCE}(p^{-},q;\mathcal{M})
15:        if pth\ell_{p}^{-}\geq th then
16:           Append pp^{-} to 𝒩q\mathcal{N}_{q}^{*}
17:     Append (𝐪,𝐩+,𝒩𝐪)(\mathbf{q},\mathbf{p^{+}},\mathbf{\mathcal{N}_{q}^{*}}) to 𝒟\mathcal{D}^{*}
18:  Return Sieved dataset 𝒟\mathcal{D}^{*} with confident false negatives

The previously proposed regularization is helpful towards dense retrieval models based on contrastive NCE loss, but not applicable to sophisticated methods such as AR2 (Zhang et al. 2021), which do not make use of NCE loss. As a result, we design a novel passage sieve algorithm, which can be used as pre-processing for the noisy datasets 𝒟~\tilde{\mathcal{D}} to obtain relatively clean datasets for training.

Our method is presented in Algorithm 1. We first train DPR (Karpukhin et al. 2020), a dual-encoder based retrieval model on the noisy dataset. We then refine the retrieval model using the contrastive confidence regularizer. Afterward, we identify the hard negative passages with a loss function value NCE(p,qn)\ell_{NCE}(p,q_{n}) higher than the average loss value 1|Nqn|+1pNqn{pn+}NCE(p,qn)\frac{1}{|N_{q_{n}}|+1}\sum_{p\in N_{q_{n}}\cup\{p_{n}^{+}\}}\ell_{NCE}(p,q_{n}), and set them as confident negatives. Finally, we discard all other passages except for the confident negatives to obtain a “clean” dataset. It is noteworthy that we employ cosine similarity to satisfy the assumptions outlined in Theorem 2.

Lemma 1.

Algorithm 1 ensures that a negative sample pp^{-} in hard negatives will NOT be selected into the sieved dataset 𝒟\mathcal{D}^{*} if its score f(p,q)f(p^{-},q) given by the model f (eq. 2) is more than a random guess, i.e. its similarity score after softmax is bigger than the average value 1/(|𝒩q|+1)1/(|\mathcal{N}_{q}|+1).

Lemma 1 provides the sufficient condition for accurately selecting true negatives in Algorithm 1. In this algorithm, the DPR model is trained with robust contrastive loss and cosine similarity. According to Theorems 1 and 2, by selecting a suitable value of β[0,1]\beta\in[0,1], minimizing 𝔼D~[(p,q)+CCR(p,q)]\mathbb{E}_{\tilde{D}}[\ell(p,q)+\ell_{CCR}(p,q)] is equivalent to minimizing 𝔼𝒟[(p,q)]\mathbb{E}_{\mathcal{D}}[\ell(p,q)]. This implies that false negatives are gradually given scores closer to the positive passage, and higher compared to hard negatives and in-batch negatives. Considering the abundance of true negatives, the scores assigned to false negatives will exceed random guesses at some point. Consequently, false negatives will be excluded from the sieved dataset in Algorithm 1, resulting in a clean dataset D~\tilde{D}.

Compared to vanilla DPR, DPR with CCR introduces additional computation associated with the expectation calculation, the second term in Eq. 8. Fortunately, this can be approximated by taking the mean value of the NCE losses of all passages given the query. In addition, the calculation of NCE losses is dominated by the calculation of the similarity matrix (Eq. 2), which is also needed in vanilla DPR. Consequently, the contrastive confidence regularizer only slightly increases the time complexity when implemented appropriately. It is worth noting that when using retrieval models like DPR, ANCE, or RocketQA that employ NCE loss, using the robust contrastive loss in section 4 should be sufficient.

6 Experiment

Method NQ TQ MS-pas
R@5 R@20 R@100 R@5 R@20 R@100 MRR@10 R@50 R@1k
DPR (Karpukhin et al. 2020) - 78.4 85.3 - 79.3 84.9 - - -
ANCE (Xiong et al. 2021) 71.8 81.9 87.5 - 80.3 85.3 33.0 81.1 95.9
COIL (Gao, Dai, and Callan 2021) - - - - - - 35.5 - 96.3
ME-BERT (Luan et al. 2021) - - - - - - 33.8 - -
Individual top-k (Sachan et al. 2021) 75.0 84.0 89.2 76.8 83.1 87.0 - - -
RocketQA (Qu et al. 2021) 74.0 82.7 88.5 - - - 37.0 85.5 97.9
RDR (Yang and Seo 2020) - 82.8 88.2 - 82.5 87.3 - - -
RocketQAv2 (Ren et al. 2021b) 75.1 83.7 89.0 - - - 38.8 86.2 98.1
PAIR (Ren et al. 2021a) 74.9 83.5 89.1 - - - 37.9 86.4 98.2
DPR-PAQ (Oguz et al. 2022) 74.2 84.0 89.2 - - - 31.1 - -
Condenser (Gao and Callan 2021) - 83.2 88.4 - 81.9 86.2 36.6 - 97.4
coCondenser (Gao and Callan 2022) 75.8 84.3 89.0 76.8 83.2 87.3 38.2 - 98.4
ERNIE-Search (Lu et al. 2022) 77.0 85.3 89.7 - - - 40.1 87.7 98.2
MVR (Zhang et al. 2022) 76.2 84.8 89.3 77.1 83.4 87.4 - - -
PROD (Lin et al. 2023) 75.6 84.7 89.6 - - - 39.3 87.0 98.4
COT-MAE (Wu et al. 2023) 75.5 84.3 89.3 - - - 39.4 87.0 98.7
AR2 (Zhang et al. 2021) 77.9 86.0 90.1 78.2 84.4 87.9 39.5 87.8 98.6
AR2+passage sieve 79.2 86.4 90.7 78.7 84.7 88.2 39.8 88.3 98.6
Table 1: Performance of passage sieve with the robust contrastive loss on NQ, TQ test sets, and MS-pas development set. The results of the baselines are from the original papers. All results with AR2 as the backbone are obtained from training AR2 with the same batch size and the same number of hard negatives as in (Zhang et al. 2021).
Method B R@5 R@20 R@100
NQ AR2 64 77.9 86.0 90.1
AR2+SimANS 64 78.6 86.2 90.3
AR2+passage sieve 32 78.6 86.1 90.5
AR2+passage sieve 64 79.2 86.4 90.6
TQ AR2 64 78.2 84.4 87.9
AR2+SimANS 64 78.6 84.6 88.1
AR2+passage sieve 32 78.6 84.8 88.3
AR2+passage sieve 64 78.7 84.7 88.2
Table 2: Comparison of AR2, AR2+SimANS and AR2+passage sieve, where B stands for batch size. Here, the number of hard negatives is set to 15 which is the same as in Section 6.2.

6.1 Experimental Setup

Datasets

We conduct experiments on three public QA datasets: Natural Question (NQ) (Kwiatkowski et al. 2019), Trivia QA (TQ) (Joshi et al. 2017) and MSMARCO Passage Ranking (MS-pas) (Nguyen et al. 2016). The detailed statistics are given in the supplementary material.

Evaluation Metrics

Following previous works, we report R@k (k=5, 20, 100) for NQ and TQ, and MRR@10, R@k (k=50, 1K) for MS-pas. Here, MRR refers to the Mean Reciprocal Rank that calculates the reciprocal rank where the first relevant passage is achieved, and R@k measures the proportion of relevant passages (recall) in top-k results.

6.2 Effects of Passage Sieve Method

Experimental Design

We implement our passage sieve method on top of AR2 (Zhang et al. 2021), which is referred to as AR2+passage sieve. To evaluate the effectiveness of our passage sieve method, we compare it with the original AR2 along with other contemporary baselines. The details on the baselines are given in the supplementary material.

During the passage sieve procedure in AR2+passage sieve, we set β=0.5\beta=0.5 and epochs T=1T=1, learning rate α=1e7\alpha=1e-7. Additionally, we leverage the cosine similarity function and set the initial number of hard negative passages to be two times the number of hard negatives to be used in the downstream AR2 model. As for AR2 training, all settings are set the same way as in (Zhang et al. 2021). Specifically, the batch size is set to 64 and the number of hard negatives is 15.

Overall Results

Method # hn NQ
R@5 R@20 R@100
AR2 1 76.4 85.3 89.7
AR2+passage sieve 1 77.6 86.1 90.3
AR2 5 76.9 85.3 89.7
AR2+passage sieve 5 78.0 85.9 90.6
AR2 15 77.9 86.0 90.1
AR2+passage sieve 15 78.6 86.1 90.5
Table 3: Performance of AR2 + passage sieve with the robust contrastive loss on NQ test sets. Here, # hn stands for the number of hard negatives during training. We set the batch size to 32 for all compared methods. The results of the baselines are from (Zhang et al. 2021).

Table 1 presents the results of AR2+passage sieve on NQ and TQ test sets, as well as MS-pas development set. It can be observed that AR2 outperforms most of the baseline models on all three datasets across different evaluation metrics. Furthermore, the proposed passage sieve approach enhances the performance of AR2 on all three datasets and evaluation metrics. The findings suggest that the sieve algorithm significantly contributes to improving the quality of hard negative samples in the dataset, enabling better results even with less GPU memory requirement. It is worth mentioning that our passage sieve procedure only accounts for one-tenth of the training time compared to AR2, the downstream retrieval model.

SimANS (Zhou et al. 2022) is the recently proposed method to address the issue of false negatives based on some useful heuristic assumptions. When combined with AR2, AR2+SimANS achieves state-of-the-art results in dense retrieval. In contrast, we start from theory in noise-robust machine learning algorithms and develop a new loss function. The results in Table 2 show that both SimANS and our method enhance the performance of AR2. However, compared to AR2+SimANS, the passage sieve achieves better results in all terms of TQ and NQ datasets. The advantage is more clearly seen on NQ dataset. It should be noted that we do not include the result on the MS-pas dataset of SimANS because they leverage a 4 times larger batch size than AR2 which makes the comparison unfair.

Refer to caption
Figure 2: Performance of AR2+passage sieve with different hyper-parameter β\beta on NQ test sets. Overall, the passage sieve algorithm is not sensitive to β\beta.

6.3 Detailed Analysis

Influence of limited batch size

We investigate the impact of smaller batch size on the performance of AR2+passage sieve. Table 2 shows that AR2+passage sieve achieves better results in terms of R@5, R@20, and R@100 on TQ dataset, even with a smaller batch size. On NQ dataset, our method achieves comparable results with AR2+SimANS while having only a slight decrease in R@20.

Influence of the number of hard negatives

To evaluate the impact of hard negatives, we experiment with varying numbers of hard negatives during training. To speed up the experiment, the batch size is set to 32 and we only work on NQ dataset. The results in Table 3 demonstrate that the passage sieve approach consistently enhances the performance of AR2 across all settings. Particularly, with only 5 hard negatives, our method can reach a performance near that of 15 hard negatives. Intuitively, when the quality of hard negative samples is higher, we do not need many negative samples, thus reducing computing resources for training.

Influence of hyper-parameter β\beta

To study the influence of the hyper-parameter β\beta on the passage sieve algorithm, we conduct experiments on NQ dataset with different β\beta and all other settings remain the same. Specifically, we vary the values of β\beta to be 0, 0.25, 0.5. Figure 2 indicates the passage sieve method is not sensitive to β\beta. This is because DPR in the passage sieve method is trained with cosine similarity, which satisfies our assumptions in Theorem 2.

Method Batch size R@5 R@20 R@100
NQ DPR 128 - 78.4 85.4
DPR+CCR 64 65.9 77.6 85.4
DPR+CCR 128 68.4 79.5 86.1
TQ DPR 128 - 79.3 84.9
DPR+CCR 64 70.5 78.7 84.9
DPR+CCR 128 71.5 79.8 85.1
Table 4: Performance of DPR+Contrastive Confidence Regularizer on NQ test sets. The results of the baselines are from original papers.

6.4 Effects of Contrastive Confidence Regularizer

Exterimental Design

We evaluate the performance of the Contrastive Confidence Regularizer on the DPR model (Karpukhin et al. 2020). We incorporate the regularizer into the existing NCE loss function during the final 5 epochs of training. Other parameters and configurations are kept consistent with those in the original DPR paper. Since the DPR model utilizes a dot-product similarity function, we select the values of β\beta from the range of 0.001 to 0.0001. The selection of β\beta is based on the performance observed on the validation set.

Experiment Results

Table 4 shows results on NQ and TQ datasets. It is observable that the proposed contrastive confidence regularizer helps DPR get a better performance across all metrics on NQ and TQ datasets. The experimental results also show that the reduction of batch size has a greater impact on R@20 and R@5, but a smaller impact on R@100. On both datasets, the proposed CCR helps achieve the same R@100 value as DPR with only half the batch size.

7 Conclusion

This paper aims to mitigate the impact of false negatives on dense passage retrieval. Toward such a goal, we extend the peer-loss framework and develop a confidence regularization for training robust retrieval models. The proposed regularization is compatible with any base retrieval model that uses NCE loss, a widely used contrastive loss function in dense retrieval. Through empirical and theoretical analysis, it is demonstrated that contrastive confidence regularization leads to more robust retrieval models. Building on this regularization, a passage sieve algorithm is proposed. The algorithm leverages a dense retrieval model trained with confidence regularized NCE loss to filter out false negatives, thereby improving any downstream retrieval model including those that do not exploit NCE loss. The effectiveness of both the passage sieve algorithm and the confidence regularization method is validated through extensive experiments on three commonly used QA datasets. The results show that these methods can enhance base retrieval models, even when fewer negative samples are used.

Acknowledgments

We thank all the reviewers for their helpful comments. This work is supported by the National Key R&D Program of China (2022ZD0116600).

References

  • Borgeaud et al. (2022) Borgeaud, S.; Mensch, A.; Hoffmann, J.; Cai, T.; Rutherford, E.; et al. 2022. Improving Language Models by Retrieving from Trillions of Tokens. In International Conference on Machine Learning.
  • Cheng et al. (2021) Cheng, H.; Zhu, Z.; Li, X.; Gong, Y.; Sun, X.; and Liu, Y. 2021. Learning with Instance-Dependent Label Noise: A Sample Sieve Approach. In International Conference on Learning Representations.
  • Chuang et al. (2020) Chuang, C.-Y.; Robinson, J.; Lin, Y.-C.; Torralba, A.; and Jegelka, S. 2020. Debiased contrastive learning. Advances in Neural Information Processing Systems.
  • Ferguson et al. (2020) Ferguson, J.; Gardner, M.; Hajishirzi, H.; Khot, T.; and Dasigi, P. 2020. IIRC: A Dataset of Incomplete Information Reading Comprehension Questions. In Proceedings of the 2020 Conference on EMNLP. Association for Computational Linguistics.
  • Fu et al. (2022) Fu, H.; Zhang, Y.; Yu, H.; Sun, J.; Huang, F.; Si, L.; Li, Y.; and Nguyen, C.-T. 2022. Doc2Bot: Accessing Heterogeneous Documents via Conversational Bots. In Findings of the Association for Computational Linguistics: EMNLP.
  • Gao and Callan (2021) Gao, L.; and Callan, J. 2021. Is your language model ready for dense representation fine-tuning. arXiv preprint arXiv:2104.08253.
  • Gao and Callan (2022) Gao, L.; and Callan, J. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.
  • Gao, Dai, and Callan (2021) Gao, L.; Dai, Z.; and Callan, J. 2021. COIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List. In Proceedings of the 2021 Conference of NAACL: Human Language Technologies.
  • Glass et al. (2022) Glass, M. R.; Rossiello, G.; Chowdhury, M. F. M.; Naik, A.; Cai, P.; and Gliozzo, A. 2022. Re2G: Retrieve, Rerank, Generate. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Guu et al. (2020) Guu, K.; Lee, K.; Tung, Z.; Pasupat, P.; and Chang, M. 2020. Retrieval Augmented Language Model Pre-Training. In Proceedings of the 37th International Conference on Machine Learning.
  • Hao et al. (2022) Hao, S.; Li, P.; Wu, R.; and Chu, X. 2022. A Model-Agnostic approach for learning with noisy labels of arbitrary distributions. In International Conference on Data Engineering. IEEE.
  • Huang et al. (2013) Huang, P.-S.; He, X.; Gao, J.; et al. 2013. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data. In CIKM.
  • Humeau et al. (2019) Humeau, S.; Shuster, K.; Lachaux, M.-A.; and Weston, J. 2019. Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring. In International Conference on Learning Representations.
  • Johnson, Douze, and Jégou (2021) Johnson, J.; Douze, M.; and Jégou, H. 2021. Billion-Scale Similarity Search with GPUs. IEEE Trans. Big Data, 535–547.
  • Joshi et al. (2017) Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.
  • Karpukhin et al. (2020) Karpukhin, V.; Oguz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; and Yih, W.-t. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on EMNLP. Association for Computational Linguistics.
  • Khattab and Zaharia (2020) Khattab, O.; and Zaharia, M. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval.
  • Kwiatkowski et al. (2019) Kwiatkowski, T.; Palomaki, J.; Redfield, O.; Collins, M.; Parikh, A.; Alberti, C.; Epstein, D.; Polosukhin, I.; Devlin, J.; Lee, K.; et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics.
  • Lewis et al. (2020a) Lewis, P. S. H.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; Riedel, S.; and Kiela, D. 2020a. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020.
  • Lewis et al. (2020b) Lewis, P. S. H.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; Riedel, S.; and Kiela, D. 2020b. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems.
  • Lin et al. (2023) Lin, Z.; Gong, Y.; Liu, X.; Zhang, H.; Lin, C.; Dong, A.; Jiao, J.; Lu, J.; Jiang, D.; Majumder, R.; et al. 2023. Prod: Progressive distillation for dense retrieval. In Proceedings of the ACM Web Conference.
  • Liu and Tao (2015) Liu, T.; and Tao, D. 2015. Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Liu and Guo (2020) Liu, Y.; and Guo, H. 2020. Peer loss functions: Learning from noisy labels without knowing noise rates. In International Conference on Machine Learning. PMLR.
  • Lu et al. (2022) Lu, Y.; Liu, Y.; Liu, J.; Shi, Y.; Huang, Z.; Sun, S. F. Y.; Tian, H.; Wu, H.; Wang, S.; Yin, D.; et al. 2022. Ernie-search: Bridging cross-encoder with dual-encoder via self on-the-fly distillation for dense passage retrieval. arXiv preprint arXiv:2205.09153.
  • Luan et al. (2021) Luan, Y.; Eisenstein, J.; Toutanova, K.; and Collins, M. 2021. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics.
  • Manwani and Sastry (2013) Manwani, N.; and Sastry, P. 2013. Noise tolerance under risk minimization. IEEE Transactions on Cybernetics.
  • Mialon et al. (2023) Mialon, G.; Dessì, R.; Lomeli, M.; Nalmpantis, C.; Pasunuru, R.; Raileanu, R.; Rozière, B.; Schick, T.; Dwivedi-Yu, J.; Celikyilmaz, A.; et al. 2023. Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
  • Natarajan et al. (2013) Natarajan, N.; Dhillon, I. S.; Ravikumar, P. K.; and Tewari, A. 2013. Learning with noisy labels. Advances in Neural Information Processing Systems.
  • Nguyen et al. (2016) Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; and Deng, L. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. choice.
  • Ni, Gardner, and Dasigi (2021) Ni, A.; Gardner, M.; and Dasigi, P. 2021. Mitigating False-Negative Contexts in Multi-document Question Answering with Retrieval Marginalization. In Proceedings of the 2021 Conference on EMNLP. Association for Computational Linguistics.
  • Oguz et al. (2022) Oguz, B.; Lakhotia, K.; Gupta, A.; Lewis, P.; Karpukhin, V.; Piktus, A.; Chen, X.; Riedel, S.; Yih, W.; Gupta, S.; et al. 2022. Domain-matched Pre-training Tasks for Dense Retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022-Findings. Association for Computational Linguistics.
  • Patrini et al. (2017) Patrini, G.; Rozza, A.; Krishna Menon, A.; Nock, R.; and Qu, L. 2017. Making deep neural networks robust to label noise: A loss correction approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
  • Qu et al. (2021) Qu, Y.; Ding, Y.; Liu, J.; Liu, K.; Ren, R.; Zhao, W. X.; Dong, D.; Wu, H.; and Wang, H. 2021. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2021 Conference of the NAACL: Human Language Technologies.
  • Reimers and Gurevych (2019) Reimers, N.; and Gurevych, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on EMNLP-IJCNLP. Association for Computational Linguistics.
  • Ren et al. (2021a) Ren, R.; Lv, S.; Qu, Y.; Liu, J.; Zhao, W. X.; She, Q.; Wu, H.; Wang, H.; and Wen, J.-R. 2021a. PAIR: Leveraging Passage-Centric Similarity Relation for Improving Dense Passage Retrieval. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
  • Ren et al. (2021b) Ren, R.; Qu, Y.; Liu, J.; Zhao, W. X.; She, Q.; Wu, H.; Wang, H.; and Wen, J.-R. 2021b. RocketQAv2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking. In Proceedings of the 2021 Conference on EMNLP.
  • Robinson et al. (2020) Robinson, J. D.; Chuang, C.-Y.; Sra, S.; and Jegelka, S. 2020. Contrastive Learning with Hard Negative Samples. In International Conference on Learning Representations.
  • Sachan et al. (2021) Sachan, D.; Patwary, M.; Shoeybi, M.; Kant, N.; Ping, W.; Hamilton, W. L.; and Catanzaro, B. 2021. End-to-End Training of Neural Retrievers for Open-Domain Question Answering. In Proceddings of the Association for Computational Linguistics: ACL-IJCNLP 2021.
  • Wu et al. (2023) Wu, X.; Ma, G.; Lin, M.; Lin, Z.; Wang, Z.; and Hu, S. 2023. Contextual masked auto-encoder for dense passage retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Xia et al. (2020) Xia, X.; Liu, T.; Han, B.; Wang, N.; Gong, M.; Liu, H.; Niu, G.; Tao, D.; and Sugiyama, M. 2020. Part-dependent label noise: Towards instance-dependent label noise. Advances in Neural Information Processing Systems.
  • Xiong et al. (2021) Xiong, L.; Xiong, C.; Li, Y.; Tang, K.; Liu, J.; Bennett, P. N.; Ahmed, J.; and Overwijk, A. 2021. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. In 9th International Conference on Learning Representations.
  • Yang and Seo (2020) Yang, S.; and Seo, M. 2020. Is retriever merely an approximator of reader? arXiv preprint arXiv:2010.10999.
  • Yang et al. (2022) Yang, S.; Yang, E.; Han, B.; Liu, Y.; Xu, M.; Niu, G.; and Liu, T. 2022. Estimating instance-dependent Bayes-label transition matrix using a deep neural network. In International Conference on Machine Learning. PMLR.
  • Yao et al. (2020) Yao, Y.; Liu, T.; Han, B.; Gong, M.; Deng, J.; Niu, G.; and Sugiyama, M. 2020. Dual t: Reducing estimation error for transition matrix in label-noise learning. Advances in Neural Information Processing Systems.
  • Zhang et al. (2021) Zhang, H.; Gong, Y.; Shen, Y.; Lv, J.; Duan, N.; and Chen, W. 2021. Adversarial Retriever-Ranker for Dense Text Retrieval. In International Conference on Learning Representations.
  • Zhang et al. (2022) Zhang, S.; Liang, Y.; Gong, M.; Jiang, D.; and Duan, N. 2022. Multi-View Document Representation Learning for Open-Domain Dense Retrieval. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  • Zhang et al. (2023) Zhang, Y.; Fu, H.; Fu, C.; Yu, H.; Li, Y.; and Nguyen, C.-T. 2023. Coarse-To-Fine Knowledge Selection for Document Grounded Dialogs. In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing.
  • Zhou et al. (2022) Zhou, K.; Gong, Y.; Liu, X.; Zhao, W. X.; Shen, Y.; Dong, A.; Lu, J.; Majumder, R.; Wen, J.-R.; and Duan, N. 2022. SimANS: Simple Ambiguous Negatives Sampling for Dense Text Retrieval. In Proceedings of the 2022 Conference on EMNLP: Industry Track.
  • Zhu, Liu, and Liu (2021) Zhu, Z.; Liu, T.; and Liu, Y. 2021. A second-order approach to learning with instance-dependent label noise. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

Appendix

Appendix A Missing Details

A.1 Limitations and future work

This paper proposed a novel contrastive confidence regularizer and a passage sieve algorithm to handle the false negative problem in dense retrieval training. The theory analysis and experimental results have shown the effectiveness of our methods, however, there are still several limitations of our work:

  1. 1.

    We only test our methods on top of AR2 (Zhang et al. 2021) and DPR (Karpukhin et al. 2020), more retrieval models can be tested to see the performance variance of the passage sieve and contrastive confidence regularizer in different cases.

  2. 2.

    Theorem 1 only assumes the loss function to be NCE loss and does not require the task to be dense retrieval. Thus, the proposed contrastive confidence regularizer has the same power on any other self-supervised tasks using NCE loss, e.g. self-supervised image classification, sentence embedding, etc. More experiments on these benchmarks should be conducted in future work.

  3. 3.

    The proposed methods are built upon CORES2\text{CORES}^{2} (Cheng et al. 2021), which consider the first-order statistic(estimation) of peer loss. However, Zhu, Liu, and Liu (2021) has extended this into considering the second-order statistics and achieving better performance in CV benchmarks. More work is needed to extend the contrastive confidence regularizer into second-order statistics as well.

A.2 Details about baselines

Details about the baselines in Table 1 in the main content are shown in this subsection.
DPR (Karpukhin et al. 2020) employs a dual-encoder architecture, leveraging BM25 for the identification of hard negatives and utilizing the in-batch negatives for efficiency.
ANCE (Xiong et al. 2021) constructs hard negatives from an Approximate Nearest Neighbor (ANN) index.
COIL (Gao, Dai, and Callan 2021) integrates semantic with lexical matching, storing contextualized token representations in inverted lists.
ME-BERT (Luan et al. 2021) merges the efficiency of dual encoders with the expressiveness of attentional architectures, further venturing into sparse-dense hybrids.
Individual top-k (Sachan et al. 2021) utilizes unsupervised pre-training, focusing on the inverse cloze task and masked salient spans within question-context pairs.
RocketQA (Qu et al. 2021) explores the ability of the cross-encoder model to denoise hard negatives.
RDR (Yang and Seo 2020) seeks to distill knowledge from the reader into the retriever, aiming for the retriever to gain reader strengths without loss of inherent advantages.
RocketQAv2 (Ren et al. 2021b) distills knowledge from the cross-encoder model by narrowing the relevance distribution between the retrieval model and the cross-encoder model.
PAIR (Ren et al. 2021a) considers both query-centric similarity relations and passage-centric similarity relations.
DPR-PAQ (Oguz et al. 2022) introduces the PAQ dataset, concentrating on the question-answering pretraining task.
Condenser (Gao and Callan 2021) boosts optimization readiness by performing mask language model predictions actively condition on dense representation.
coCondenser (Gao and Callan 2022) adds a corpus-level contrastive learning goal to the existing Condenser pre-training architecture.
ERNIE-Search (Lu et al. 2022) refines the distillation process by sharing the encoding layers with the dual-encoder and conducting a cascaded approach.
MVR (Zhang et al. 2022) offers multiple representations for candidate passages and proposes a global-local loss to prevent multiple embeddings from collapsing to the same.
PROD (Lin et al. 2023) enhances the cross-encoder distillation by cyclically improving teacher models and tailoring sample selections to student needs.
COT-MAE (Wu et al. 2023) utilizes an asymmetric encoder-decoder architecture with self-supervised and context-supervised masked auto-encoding tasks.
AR2 (Zhang et al. 2021) jointly optimizes the retrieval and re-ranker model according to a minimax adversarial objective.

A.3 Example implementation of contrastive confidence regularizer

In section 5, we mentioned that CCR (eq.8 in the main content) only slightly increases the time complexity when implemented properly. In this subsection, we will demonstrate the implementation to prove this claim.

The example code is shown in listing 1. The dimension of ss is the number of queries in the batch times the number of passages (including positive, in-batch negatives, and hard negatives) of each query. In order to calculate the NCE loss, a log softmax function on the passage dimension is required. And after that, an NLL loss with only positive passages labeled as 1 is calculated with the after-softmax matrix s_s\_. The only additional code for CCR is to calculate the NLL loss for all elements in the s_s\_ and get the mean value of each row of it.

Listing 1: NCE loss with Contrastive Confidence Regularizer
1s = get_scores(q, p) #get the similarity matrix
2s_ = F.log_softmax(s, dim=1)#calculate softmax on passage dimension
3loss = F.nll_loss(s_,positive_idx,reduction="none") #the original NCE loss
4loss=loss-beta*(-1)*s_.mean(dim=1)#the only additional code to calculate CCR
5loss=loss.mean()

A.4 Efficiency analysis of passage sieve algorithm

Overall speaking, The time taken to integrate the passage sieve (algorithm 1) into AR2 is equivalent to that of training a vanilla DPR(Karpukhin et al. 2020) model. Compared with the time consumption of AR2 training itself, the additional training time is relatively small. We record the time consumption of one epoch training on the TQ dataset and show it in table 5.

It should be noted that the passage sieve has nothing to do with the inference time consumption. Adding the passage sieve only requires an additional data preprocessing time during training.

AR2 AR2+passage sieve
Batch size 32 32
Number of hard negatives 15 15
Data preprocessing 10min 2h 10min
Retrieval & Reranker training 2h 5min 2h 5min
Corpus and query encoding 6h 6min 6h 6min
ANN index build and Retrieval 30min 30min
Overall 9h 11h
Table 5: Comparison of efficiency of AR2 + passage sieve and AR2 on the TQ dataset. We list the time consumption for one epoch of training with 4 Tesla V100s.

A.5 Intermediate experiment result

Sieve out rate of passage sieve

In the passage sieve algorithm, we perform passage sieves according to their loss value and the average loss value (i.e. threshold). And we prove that those who get a score f(p,q)f(p,q) more than the average value will be sieved out for sure, however, it remains unknown what percentage of all negative passages will be sieved out (we called it sieve out rate). The sieve out rate directly affects how many passages we need to examine during the passage sieve in order to meet the hard negatives requirement of the downstream retrieval model.

In the main content, we mentioned that we choose two times the number of hard negatives that the downstream retrieval model (AR2 in this case) will use. This ratio is actually chosen by some preliminary experiments, in which we observed that the passage sieve algorithm usually sieves out examples less than half. We show the sieve-out rate during training on the NQ dataset in figure 3. The figure shows that the sieve rate is usually smaller than but close to 0.5 and usually when one epoch achieves a sieve-out rate near 0.5, the rate of the next epoch will drop immediately. It should be noted that this rate has nothing to do with the estimated noise pair possibility. The strategy adopted by the passage sieve algorithm is to filter aggressively and only retain confident ones, in which only the sufficient conditions for filtering are specified (see Lemma 1).

Refer to caption
Figure 3: Sieve out rate during AR2+passage sieve training on NQ dataset
Learning curves

We also show the visualization of the training procedure. Specifically, the Recall@1 Recall@5 of the TQ test set and Recall@1 MRR@10 of the MS-Pas dataset are shown in the figure. And we can see from fig.4 that the metrics on both TQ and MS-pas datasets gradually increase until convergence.

Refer to caption
(a) TQ test set
Refer to caption
(b) MS-pas dev set
Figure 4: Learning curve of MS-pas development set and TQ test set.

A.6 Annotation Table

We conclude the annotation we used in the main content of the paper and show them in table 6 to help read our paper.

pp passage 𝒟\mathcal{D} clean distribution
qq query 𝒟P|q\mathcal{D}_{P|q} distribution of P given q
pp^{-} negatives passage CCR\ell_{CCR} contrastive confidence regularizer
p+p^{+} positive passage RCL\ell_{RCL} robust contrastive loss
𝒩q\mathcal{N}_{q} negative set for query q 𝒟\mathcal{D}^{*} sieved dataset
𝒟~\tilde{\mathcal{D}} Noisy distribution 𝒩q\mathcal{N}_{q}^{*} confident negative list
NCE\ell_{NCE} NCE loss f(p,q)f(p,q) similarity score of p,q after softmax
Table 6: Annotation table

Appendix B Important Proofs

Before we start the proof, we will first introduce the pseudo-label and corresponding loss function we used during the proof. Let’s consider a positive pair (p+,q)(p^{+},q) as label y=1y=1 and negative pair (p,q)(p^{-},q) as label y=1y=-1. Remind that pp means passages and qq stands for queries. In the following context, instead of using superscript, we will leverage this label to indicate the relation between passages and queries. And similar to the context in the label-noise robust machine learning papers, yy represents the real label and y~\tilde{y} for the noisy label in the dataset. Thus we can represent the false negatives in the dataset as (p,q;y~=1,y=1)(p,q;\tilde{y}=-1,y=1). We then re-define the contrastive loss function as :

(f(p,q);y)={NCE(p,q),y=1,0,y=1,\ell(f(p,q);y)=\begin{cases}\ell_{NCE}(p,q),&y=1,\\ 0,&y=-1,\end{cases} (9)

where

NCE(p,q)\displaystyle\ell_{NCE}(p,q) =ln(f(p,q))\displaystyle=-\ln(f(p,q)) (10)
=ln(esim(p,q)pi𝒩qesim(pi,q)+esim(p+,q))\displaystyle=-\ln(\frac{e^{sim(p,q)}}{\sum_{p_{i}^{-}\in\mathcal{N}_{q}}e^{sim(p_{i}^{-},q)}+e^{sim(p^{+},q)}})

Here, ff is the model that gives the score between queries and passages. NCE\ell_{NCE} is the broadened NCE loss as we discussed in the preliminary part. This newly defined pseudo loss is for proof purposes only which helps better analyze the performance of the proposed methods during proof. Pay attention that such a definition will not affect the original contrastive learning in dense retrieval as the loss for negative pairs is always equal to 0. And there is only a coefficient difference between them:

𝔼(P,Q;Y)D[(f(P,Q);Y)]=(Y=1)𝔼(P+,Q)D[NCE(P,Q)]\mathbb{E}_{(P,Q;Y)\in D}\left[\ell(f(P,Q);Y)\right]=\mathbb{P}(Y=1)\mathbb{E}_{(P^{+},Q)\in D}\left[\ell_{NCE}(P,Q)\right]

In the following context, unless otherwise specified, \ell stands for this defined loss, and NCE\ell_{NCE} stands for the original NCE loss.

B.1 Proof for theorem 1 and theorem 2

To make the reading more natural and smooth, we simultaneously prove Theorems 1 and 2 in this section.
Theorem 1. With the assumption that the possibility of a noisy pair is smaller than a clean pair and a suitable selection of β\beta, we have: minimizing 𝔼𝒟~[(p+,q)βCCR(p+,q)]\mathbb{E}_{\tilde{\mathcal{D}}}[\ell(p^{+},q)-\beta*\ell_{CCR}(p^{+},q)] is equivalent to minimizing 𝔼𝒟[(p+,q)]\mathbb{E}_{\mathcal{D}}[\ell(p^{+},q)]
Theorem 2. When the assumption in Theorem 1 holds, the similarity function simsim is bounded (e.g. cosine similarity) and there is no false positive in the dataset 𝒟~\tilde{\mathcal{D}}, the β\beta that satisfies Theorem 1 must exist and in the interval [0,1].

Proof.

We first prove that with the new regularizer, the Main Theorem of (Cheng et al. 2021), which decouples the expected regularized loss, still holds. For simplicity, considering X:=(pn,qn)X:=(p_{n},q_{n})

We define the new pseudo loss as in eq.9, and this loss is aligned with the basic form of the classification loss function with feature x and label y which is used by CORES2\text{CORES}^{2} (Cheng et al. 2021). And we first get those that do NOT require the specific form of the loss function and thus remain unchanged in our case as the original confidence regularizer paper as the eq.11 follows:

𝔼𝒟~[(f(X),Y~)]=jTjj(Y=j)𝔼𝒟|Y=j[(f(X),j)]+\displaystyle\mathbb{E}_{\tilde{\mathcal{D}}}[\ell(f(X),\tilde{Y})]=\sum_{j}T_{jj}\mathbb{P}(Y=j)\mathbb{E}_{\mathcal{D}|Y=j}\left[\ell(f(X),j)\right]+ (11)
ji(Y=i)𝔼𝒟|Y=i[Uij(X)(f(X),j)]\displaystyle\sum_{j}\sum_{\begin{subarray}{c}i\end{subarray}}\mathbb{P}(Y=i)\mathbb{E}_{\mathcal{D}|Y=i}[U_{ij}(X)\ell(f(X),j)]
=T¯𝔼𝒟[(f(X),Y)]+Δ¯𝔼𝒟Δ[(f(X),Y)]+\displaystyle=\underline{T}\mathbb{E}_{\mathcal{D}}[\ell(f(X),Y)]+\bar{\Delta}\mathbb{E}_{\mathcal{D}_{\Delta}}[\ell(f(X),Y)]+
j{1,1}i{1,1}(Y=i)𝔼𝒟|Y=i[Uij(X)(f(X),j)],\displaystyle\sum_{j\in\{-1,1\}}\sum_{\begin{subarray}{c}i\in\{-1,1\}\end{subarray}}\mathbb{P}(Y=i)\mathbb{E}_{\mathcal{D}|Y=i}[U_{ij}(X)\ell(f(X),j)],

where

Tij:=𝔼𝒟|Y=i[Tij(X)],i,j[1,1]\displaystyle T_{ij}:=\mathbb{E}_{\mathcal{D}|Y=i}[T_{ij}(X)],\forall i,j\in[-1,1]
Tij(X):=(Y~=j|Y=i,X),T¯=min{T11,T11}\displaystyle T_{ij}(X):=\mathbb{P}(\tilde{Y}=j|Y=i,X),\underline{T}=\min\{T_{-1-1},T_{11}\}
Uij(X)=Tij(X),ij,Ujj(X)=Tjj(X)Tjj\displaystyle U_{ij}(X)=T_{ij}(X),\forall i\neq j,\quad U_{jj}(X)=T_{jj}(X)-T_{jj}

Δ¯\bar{\Delta} and 𝒟Δ\mathcal{D}_{\Delta} follow the same definition in CORES2\text{CORES}^{2}, and we don’t list them here as it’s not required in our proof. Furthermore, as (f(X);Y=1)=0\ell(f(X);Y=-1)=0 in the pseudo loss (eq. 9) we defined, we can further get the new decoupled NCE loss in our problem:

𝔼𝒟~[(f(X),Y~)]=\displaystyle\mathbb{E}_{\tilde{\mathcal{D}}}[\ell(f(X),\tilde{Y})]= (12)
T¯𝔼𝒟[(f(X),Y)]+Δ¯𝔼𝒟Δ[(f(X),Y)]+\displaystyle\underline{T}\mathbb{E}_{\mathcal{D}}[\ell(f(X),Y)]+\bar{\Delta}\mathbb{E}_{\mathcal{D}_{\Delta}}[\ell(f(X),Y)]+
i{1,1}(Y=i)𝔼𝒟|Y=i[Ui,1(X)(f(X),1)]\displaystyle\sum_{\begin{subarray}{c}i\in\{-1,1\}\end{subarray}}\mathbb{P}(Y=i)\mathbb{E}_{\mathcal{D}|Y=i}[U_{i,1}(X)\ell(f(X),1)]

We use a novel contrastive confidence regularizer which is different from the original regularizer in CORES2\text{CORES}^{2}. And the expected form of the new CCR(pn,qn)\ell_{CCR}(p_{n},q_{n}) is

𝔼𝒟~[CCR(pn,qn)]\displaystyle\mathbb{E}_{\tilde{\mathcal{D}}}\left[\ell_{CCR}(p_{n},q_{n})\right] =𝔼𝒟[CCR(pn,qn)]\displaystyle=\mathbb{E}_{\mathcal{D}}\left[\ell_{CCR}(p_{n},q_{n})\right] (13)
=𝔼qn[𝔼𝒟p|qn[(f(p,qn);1)]]\displaystyle=\mathbb{E}_{q_{n}}\left[\mathbb{E}_{\mathcal{D}_{p|q_{n}}}[\ell(f(p,q_{n});1)]\right]
=𝔼𝒟p,q[(p,q;1)]\displaystyle=\mathbb{E}_{\mathcal{D}_{p,q}}[\ell(p,q;1)]
=i{1,1}(Y=i)𝔼𝒟|Y=i[(p,q;1)]\displaystyle=\sum_{i\in\{-1,1\}}\mathbb{P}(Y=i)\mathbb{E}_{\mathcal{D}|Y=i}[\ell(p,q;1)]

The first equation in eq.13 is true because the clean distribution and the noisy distribution differ from labels instead of distributions of pp and qq. It helps bridge the clean distribution and the noisy distribution. And with the final equivalent transformation in eq.13, we are able to merge the third term of the decoupled NCE loss eq.12 and the expectation of CCR together. Thus the expectation of regularized loss is:

𝔼𝒟~[(f(X),Y~)βCCR(X)]=\displaystyle\mathbb{E}_{\mathcal{\tilde{D}}}\left[\ell(f(X),\tilde{Y})-\beta*\ell_{CCR}(X)\right]= (14)
T¯𝔼𝒟[(f(X),Y)]+Δ¯𝔼𝒟Δ[(f(X),Y)]+\displaystyle\underline{T}\mathbb{E}_{\mathcal{D}}[\ell(f(X),Y)]+\bar{\Delta}\mathbb{E}_{\mathcal{D}_{\Delta}}[\ell(f(X),Y)]+
i{1,1}(Y=i)𝔼𝒟|Y=i[(Ui,1(X)β)(f(X),1)]\displaystyle\sum_{i\in\{-1,1\}}\mathbb{P}(Y=i)\mathbb{E}_{\mathcal{D}|Y=i}[(U_{i,1}(X)-\beta)\ell(f(X),1)]

After we get eq.14, the next key problem is to examine the influence of each term in it during training on noisy datasets. Optimizing the first term represents the optimization of the clean distribution and so we don’t need to consider it. The second term represents expectation on a shifted distribution 𝒟Δ\mathcal{D}_{\Delta}. Lemma 2 in CORES2\text{CORES}^{2} indicates that as long as the dataset is informative and the clean label yy equals the Bayes optimal label y:=argmaxy(y|X)y^{*}:=\arg\max_{y}\mathbb{P}(y|X), the second term won’t change the Bayes optimal label. In our case, the second assumption is naturally satisfied as one sample (p,q) only belongs to one class (positive or negative) in the clean dataset. So, with the assumption that the noisy distribution is informative, i.e. “Possibility of any pair to be noisy is smaller than clean (Assumption I)”, the second term in eq.14 does NOT change the Bayes optimal classifier during training.

Considering the third term in eq.14, the ideal condition is U1,1(x)=U1,1(x)=β,xU_{1,1}(x)=U_{-1,1}(x)=\beta,\forall x and so the third term will always be equal to zero. However, this condition is nearly impossible to meet and violates the assumption of instance-dependent noise. In the original confidence regularizer paper (Cheng et al. 2021), they succeed in proving an interval of β\beta in which the regularized loss will produce the same Bayes optimal classifier as in the clean distribution. Nevertheless, with a complex interval expression, there is still a risk of nonexistence for beta due to the unclear relationship between the starting and ending points of the interval.

In contrast, the third term in eq.14 is significantly different from the original one. And we have a more concrete problem setting. So with new assumptions associated with the problem, we can re-prove a new must-exist bound for β\beta to make the estimation unbiased for the clean datasets. Intuitively, the proof can be concluded as follows: Firstly, Ui,1(X)U_{i,1}(X) is related to the relative clean rate of X compared to the average clean rate and so when β\beta is large enough, the scaling factor of the third term Ui,1(X)βU_{i,1}(X)-\beta will be negative, thus reversing the loss associated with instances with labeled noise. Secondly, β\beta should not be too large, otherwise, it will affect the learning from clean data.

On the one hand, when (Ui,1(X)β)0βmaxi{1,1},X𝒟xUi,1(X)(U_{i,1}(X)-\beta)\leq 0\Leftarrow\beta\geq\max_{i\in\{-1,1\},X\sim\mathcal{D}_{x}}U_{i,1}(X), all the scaling factor (Ui,1(X)β)(U_{i,1}(X)-\beta) would be negative, minimizing the third term equals to minimize the regularization term. It should be noted that, given the assumption that “The NCE loss has an infimum, i.e. similarity function is bounded with a minimal value (Assumption II)”, Theorem 1 in CORES2\text{CORES}^{2}(Cheng et al. 2021) still holds even when we use a different form of the loss function. According to this theorem, minimizing the regularization term will induce confidence prediction no matter whether the distribution is noisy or clean.

Pay attention that, given the assumption that “There is no false positive problem in the distribution (Assumption III)”, we have T1,1(X)=1,T1,1(X)=0,U1,1(X)=T1,1(X)=0T_{-1,-1}(X)=1,T_{-1,1}(X)=0,U_{-1,1}(X)=T_{-1,1}(X)=0 in our problem and so the lower bound become:

β\displaystyle\beta maxX𝒟xU1,1(X)=maxX𝒟xT1,1(X)T1,1\displaystyle\geq\max_{X\sim\mathcal{D}_{x}}U_{1,1}(X)=\max_{X\sim\mathcal{D}_{x}}T_{1,1}(X)-T_{1,1} (15)
=maxX𝒟xT1,1(X)𝔼𝒟|Y=1[T1,1(x)]\displaystyle=\max_{X\sim\mathcal{D}_{x}}T_{1,1}(X)-\mathbb{E}_{\mathcal{D}|Y=1}[T_{1,1}(x)]

On the other hand, we still need to determine an upper bound on the beta to prevent the loss function from being biased against the clean data set. Consider a sample xnx_{n} whose clean label is yn=1y_{n}=1, and the model f(x)f(x) does not fit it well so that its loss (f(xn),1)=0>0\ell(f(x_{n}),1)=\ell_{0}>0. To make the regularized loss unbiased in this case, it should also be bigger than 0:

T1,1(xn)0+(xn)(U1,1(xn)β)00\displaystyle T_{1,1}\mathbb{P}(x_{n})\ell_{0}+\mathbb{P}(x_{n})(U_{1,1}(x_{n})-\beta)\ell_{0}\geq 0 (16)
T1,1+(U1,1(xn)β)0\displaystyle\Leftrightarrow T_{1,1}+(U_{1,1}(x_{n})-\beta)\geq 0
βT1,1+minX𝒟XU1,1(x)\displaystyle\Leftarrow\beta\leq T_{1,1}+\min_{X\sim\mathcal{D}_{X}}U_{1,1}(x)
βT1,1+minX𝒟X(T1,1(x)T1,1)\displaystyle\Leftrightarrow\beta\leq T_{1,1}+\min_{X\sim\mathcal{D}_{X}}(T_{1,1}(x)-T_{1,1})
βminX𝒟XT1,1(X)\displaystyle\Leftrightarrow\beta\leq\min_{X\sim\mathcal{D}_{X}}T_{1,1}(X)

And with the assumption that the noise rate is bounded as T1,1(X)>T1,1(X),X𝒟XT_{1,1}(X)>T_{1,-1}(X),\forall X\sim\mathcal{D}_{X}, we will get that T1,1(x)1/2,xT_{1,1}(x)\geq 1/2,\forall x and so

maxX𝒟xT1,1(X)𝔼𝒟|Y=1[T1,1(x)]11/2=1/2\displaystyle\max_{X\sim\mathcal{D}_{x}}T_{1,1}(X)-\mathbb{E}_{\mathcal{D}|Y=1}[T_{1,1}(x)]\leq 1-1/2=1/2 (17)

And

minX𝒟XT1,1(X)>1/2\min_{X\sim\mathcal{D}_{X}}T_{1,1}(X)>1/2 (18)

so there must be feasible β[0,1]\beta\in[0,1] which satisfies the two bounds at the same time and thus reduces unbiased confidence prediction of the clean dataset. And more concretely, β=0.5\beta=0.5 is a safe choice all the time.

In conclusion, with Assumptions I, II, and III, when

maxX𝒟xT1,1(X)T1,1βminX𝒟XT1,1(X)\max_{X\sim\mathcal{D}_{x}}T_{1,1}(X)-T_{1,1}\leq\beta\leq\min_{X\sim\mathcal{D}_{X}}T_{1,1}(X)

minimizing 𝔼D~[(p,q;Y~)βCCR(p,q)]\mathbb{E}_{\tilde{D}}[\ell(p,q;\tilde{Y})-\beta*\ell_{CCR}(p,q)] is equivalent to minimizing 𝔼𝒟[(p,q;Y)]\mathbb{E}_{\mathcal{D}}[\ell(p,q;Y)] in the dense retrieval problem and such β\beta must exist in [0,1]. ∎

B.2 Proof for Lemma 1

lemma 1. The passage sieve algorithm (1) ensures that a negative sample pp^{-} in hard negatives will NOT be selected into the sieved dataset 𝒟\mathcal{D}^{*} if its score f(p,q)f(p^{-},q) given by the model f is more than a random guess, i.e. its similarity score after softmax is bigger than the average value 1/(|𝒩k|+1)1/(|\mathcal{N}_{k}|+1)

Proof.

Considering the sieving process in (1). for a negative pair (pn,qn;y~n=1)(p_{n},q_{n};\tilde{y}_{n}=-1), NOT being selected into the confident negative set equals the following inequality, for simplicity, define 𝒩=𝒩qn{pn+}\mathcal{N}=\mathcal{N}_{q_{n}}\cup\{p^{+}_{n}\}:

(pn,qn;1)1|𝒩qn|+1pi𝒩(pi,qn;1)\displaystyle\ell(p_{n},q_{n};1)\leq\frac{1}{|\mathcal{N}_{q_{n}}|+1}\sum_{p_{i}\in\mathcal{N}}\ell(p_{i},q_{n};1) (2-1)
NCE(pn,qn)1|𝒩qn|+1pi𝒩NCE(pi,qn)\displaystyle\ell_{NCE}(p_{n},q_{n})\leq\frac{1}{|\mathcal{N}_{q_{n}}|+1}\sum_{p_{i}\in\mathcal{N}}\ell_{NCE}(p_{i},q_{n})
\displaystyle\Leftrightarrow ln(f(pn,qn))1|𝒩qn|+1pi𝒩ln(f(pi,qn))\displaystyle\ln(f(p_{n},q_{n}))\geq\frac{1}{|\mathcal{N}_{q_{n}}|+1}\sum_{p_{i}\in\mathcal{N}}\ln(f(p_{i},q_{n}))
\displaystyle\Leftrightarrow ln(f(pn,qn))1|𝒩qn|pi𝒩,pipnln(f(pi,qn))\displaystyle\ln(f(p_{n},q_{n}))\geq\frac{1}{|\mathcal{N}_{q_{n}}|}\sum_{p_{i}\in\mathcal{N},p_{i}\neq p_{n}}\ln(f(p_{i},q_{n}))

By Jensen’s inequality, we have:

1|𝒩qn|pi𝒩,pipnln(f(pi,qn))ln(pi𝒩,pipnf(pi,qn)|𝒩qn|)\frac{1}{|\mathcal{N}_{q_{n}}|}\sum_{p_{i}\in\mathcal{N},p_{i}\neq p_{n}}\ln(f(p_{i},q_{n}))\leq\ln(\frac{\sum_{p_{i}\in\mathcal{N},p_{i}\neq p_{n}}f(p_{i},q_{n})}{|\mathcal{N}_{q_{n}}|})

Given that:

f(pi,qn)=esim(pn+,qn)pi𝒩qnesim(pi,qn)+esim(pn+,qn)f(p_{i},q_{n})=\frac{e^{sim(p_{n}^{+},q_{n})}}{\sum_{p_{i}^{-}\in\mathcal{N}_{q_{n}}}e^{sim(p_{i}^{-},q_{n})}+e^{sim(p_{n}^{+},q_{n})}}

, we can further get:

1|𝒩qn|pi𝒩,pipnln(f(pi,qn))ln(1f(pn,qn)|𝒩qn|)\displaystyle\frac{1}{|\mathcal{N}_{q_{n}}|}\sum_{p_{i}\in\mathcal{N},p_{i}\neq p_{n}}\ln(f(p_{i},q_{n}))\leq\ln(\frac{1-f(p_{n},q_{n})}{|\mathcal{N}_{q_{n}}|})

And so, we will have the:

f(pn,qn)1/(|𝒩qn|+1)\displaystyle f(p_{n},q_{n})\geq 1/(|\mathcal{N}_{q_{n}}|+1) (2-2)
\displaystyle\Leftrightarrow |𝒩qn|f(pn,qn)1f(pn,qn)\displaystyle|\mathcal{N}_{q_{n}}|*f(p_{n},q_{n})\geq 1-f(p_{n},q_{n})
\displaystyle\Leftrightarrow ln(f(pn,qn))ln(1f(pn,qn)|𝒩qn|)\displaystyle\ln(f(p_{n},q_{n}))\geq\ln(\frac{1-f(p_{n},q_{n})}{|\mathcal{N}_{q_{n}}|})
\displaystyle\Rightarrow ln(f(pn,qn))1|𝒩qn|pi𝒩,pipnln(f(pi,qn))\displaystyle\ln(f(p_{n},q_{n}))\geq\frac{1}{|\mathcal{N}_{q_{n}}|}\sum_{p_{i}\in\mathcal{N},p_{i}\neq p_{n}}\ln(f(p_{i},q_{n}))

Together with (2-1) and (2-2), we will get:

f(pn,qn)1/(|𝒩qn|+1)(pn,qn) will NOT be selectedf(p_{n},q_{n})\geq 1/(|\mathcal{N}_{q_{n}}|+1)\Leftrightarrow(p_{n},q_{n})\textit{ will NOT be selected}

Therefore, one sufficient condition of (pn,qn)(p_{n},q_{n}) to be not selected is that the model give pnp_{n} the higher score than random guess, i.e. f(pn,qn)1/(|𝒩qn|+1)f(p_{n},q_{n})\geq 1/(|\mathcal{N}_{q_{n}}|+1)