Relevance Filtering for Embedding-based Retrieval

Nicholas Rossi Walmart Global TechnologySunnyvaleUSA [email protected] , Juexin Lin Walmart Global TechnologySunnyvaleUSA [email protected] , Feng Liu Walmart Global TechnologySunnyvaleUSA [email protected] , Zhen Yang Walmart Global TechnologySunnyvaleUSA [email protected] , Tony Lee Walmart Global TechnologySunnyvaleUSA [email protected] , Alessandro Magnani Walmart Global TechnologySunnyvaleUSA [email protected] and Ciya Liao Walmart Global TechnologySunnyvaleUSA [email protected]

(2024)

Abstract.

In embedding-based retrieval, Approximate Nearest Neighbor (ANN) search enables efficient retrieval of similar items from large-scale datasets. While maximizing recall of relevant items is usually the goal of retrieval systems, a low precision may lead to a poor search experience. Unlike lexical retrieval, which inherently limits the size of the retrieved set through keyword matching, dense retrieval via ANN search has no natural cutoff. Moreover, the cosine similarity scores of embedding vectors are often optimized via contrastive or ranking losses, which make them difficult to interpret. Consequently, relying on top-K or cosine-similarity cutoff is often insufficient to filter out irrelevant results effectively. This issue is prominent in product search, where the number of relevant products is often small. This paper introduces a novel relevance filtering component (called ”Cosine Adapter”) for embedding-based retrieval to address this challenge. Our approach maps raw cosine similarity scores to interpretable scores using a query-dependent mapping function. We then apply a global threshold on the mapped scores to filter out irrelevant results. We are able to significantly increase the precision of the retrieved set, at the expense of a small loss of recall. The effectiveness of our approach is demonstrated through experiments on both public MS MARCO dataset and internal Walmart product search data. Furthermore, online A/B testing on the Walmart site validates the practical value of our approach in real-world e-commerce settings.

embedding-based retrieval, ranked list truncation, information retrieval, relevance filter

^†^†journalyear: 2024^†^†copyright: acmlicensed^†^†conference: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management; October 21–25, 2024; Boise, ID, USA^†^†booktitle: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24), October 21–25, 2024, Boise, ID, USA^†^†doi: 10.1145/3627673.3680095^†^†isbn: 979-8-4007-0436-9/24/10^†^†ccs: Information systems Retrieval models and ranking

1. Introduction

Dense retrieval models (Bromley et al., 1993; Karpukhin et al., 2020a), have greatly improved information retrieval by learning to represent queries and documents as dense vectors, capturing semantic relationships. This enables embedding-based retrieval to efficiently retrieve semantically relevant documents using Approximate Nearest Neighbor (ANN) search (Johnson et al., 2017). This approach has proven successful across a wide range of domains, including web search (Huang et al., 2013; Shen et al., 2014), question answering (Severyn and Moschitti, 2015), and e-commerce search (Lin et al., 2024; Magnani et al., 2022; Wang et al., 2021; Zhang et al., 2020; He et al., 2023). While dense retrieval models are effective at retrieving relevant documents, their emphasis on recall can compromise precision by surfacing irrelevant documents. This issue is prominent in e-commerce product retrieval, where the number of relevant products is often small. Presenting a long list of irrelevant products may frustrate users, even if the relevant products are on top.

Traditionally, retrieval and reranking are treated as separate tasks, and the latter focuses on optimizing precision. However, as the number of retrieved documents increases, the computational burden on the reranker to demote irrelevant products may grow unnecessarily. For instance, if there is only one relevant product, retrieving $K$ documents from dense retrieval will include at least $K-1$ irrelevant ones. Hence, we propose to filter out the irrelevant products before reranking to enhance precision with negligible sacrifice of recall.

In principle, one could filter out the documents whose cosine similarity score is below a global threshold. However, the cosine similarity scores are usually not interpretable and should not be compared across different queries. Thus, applying a global threshold to the cosine similarity score is not an optimal approach.

To address this, we propose a ”Cosine Adapter”, which is a component in the neural network that provides a function to transform the cosine similarity score into an interpretable relevance score, which can be compared across queries. It is now valid to apply a global threshold to the relevance score to filter out low-relevance documents. Importantly, the Cosine Adapter’s output function is query-dependent, which enables contextual awareness of query difficulty.

Our relevance mapping technique offers two key advantages: First, it provides a clear probabilistic interpretation of document scores, enhancing transparency and interpretability. Second, it is computationally efficient, as the mapping is applied directly to the cosine similarity scores. Extensive offline benchmarking on both MS MARCO and Walmart product search datasets, along with live online tests, demonstrate improved precision across diverse queries and applications.

2. Related Work

There has been previous work on ranked list truncation, which is the problem of determining where to truncate a list of retrieved documents (Arampatzis et al., 2009; Lien et al., 2019; Bahri et al., 2020; Wang et al., 2022a; Ma et al., 2022; Zamani et al., 2022). The idea is that the sequence of document scores (e.g., BM25 scores) and document statistics may have learnable patterns in terms of where the appropriate cutoff position is. For example, a large drop in document score may signal that the remaining documents are less relevant and the optimal cutoff is here. A model is trained to predict the cutoff position to optimize an evaluation metric (e.g., F1) of the truncated list. Recent works include a self-attention layer in the model to capture long-range dependencies among documents (Bahri et al., 2020; Wang et al., 2022a; Ma et al., 2022). Our approach is quite different: we utilize the query embedding, which contains a lot of information about the query, to learn what the cosine score of a relevant item may be. Our approach is also more computationally efficient, as discussed in Section 4.2.

Embedding-based retrieval has recently been widely adopted in production systems across various companies (Nigam et al., 2019; Yang et al., 2020; Zhang et al., 2020; Huang et al., 2020; Yao et al., 2021). While numerous successes have been reported, the challenge of controlling relevance for embedding-based retrieval system remains an active area of research (Lin et al., 2024; Li et al., 2021). Furthermore, there is growing interest in optimizing the pre-ranking phase for enhanced efficiency and scalability. This involves using lightweight modules to filter out less promising candidates before the more computationally intensive ranking stage (Wang et al., 2020; Xu et al., 2024). However, these works do not primarily focus on improving retrieval relevance. To enhance the relevance control in the pre-ranking phase, one approach leverages lexical matching, where a relevance control module filters out products based on the presence of key query terms in their titles (Li et al., 2021). A drawback of this approach is that it sacrifices some of the advantages of dense retrieval, which does not rely on text match. Research addressing both the efficiency and effectiveness of relevance control during pre-ranking remains limited.

3. Background

3.1. Embedding-based retrieval

In information retrieval, the objective is to identify the relevant candidates within a corpus for a given query. Candidates can be passages or documents. In embedding-based retrieval, the dual encoder architecture is used for this task, where two encoders convert the query and candidate, respectively, into a $d$ -dimensional embedding vector (Karpukhin et al., 2020b). The relevance of a candidate to the query is measured by the cosine similarity between their embeddings. Once trained, the encoder is used to generate embeddings for all candidates in the corpus, forming an index. When the user searches for a query, the query embedding is generated, and an ANN search is conducted to retrieve the top-K candidates with the highest cosine similarity scores.

A popular approach for training dual encoder models uses a contrastive loss and in-batch negative sampling (Chen et al., 2017; Henderson et al., 2017; Karpukhin et al., 2020b; Liu et al., 2021; Dong et al., 2022). Given a query $q_{i}$ and its corresponding positive candidate $p_{i}$ , the contrastive learning loss is formulated as:

(1)

loss_{i}=-\log\frac{\exp{\left(\cos{\left(q_{i},p_{i}\right)}/\tau\right)}}{\exp{\left(\cos{\left(q_{i},p_{i}\right)}/\tau\right)}+\sum_{j\in\mathcal{N}}\exp{\left(\cos{\left(q_{i},p_{j}\right)}/\tau\right)}},

where $\tau$ is a temperature hyperparameter, and $\mathcal{N}$ denotes the set of negatives sampled within the batch.

An alternative approach is to use a softmax listwise loss (Lin et al., 2024; Magnani et al., 2022; Zheng et al., 2022). It leverages the predefined labels for each query-candidate pair. For a given query $q_{i}$ and a set of candidates $\mathcal{P}_{i}$ , the ranking objective is defined as:

(2)

loss_{i}=-\sum_{j\in\mathcal{P}_{i}}y_{ij}\log\frac{\exp{\left(\cos{\left(q_{i},p_{j}\right)}/\tau\right)}}{\sum_{j\in\mathcal{P}_{i}}\exp{\left(\cos{\left(q_{i},p_{j}\right)}/\tau\right)}},

where $y_{ij}$ is the predefined label, and $\tau$ is a temperature hyperparameter.

Contrastive loss primarily aims to distinguish the positive from the negatives, while softmax listwise loss aims to assign higher scores to candidates with higher labels. Despite their distinct formulations, both loss functions shape the embedding space based on the relative distances between candidates, rather than their absolute relevance scores. This makes it difficult to interpret the cosine similarity scores. Since the embedding space is optimized for relative comparisons within a query, cosine similarity scores cannot be compared across different queries.

3.2. Problem statement

To minimize the impact on retrieval performance, we decouple relevance filtering from the retrieval process itself, performing filtering as a post-processing step on the ANN search results. Effective relevance filtering requires scores that represent relevance in an absolute sense.

Given a query $q_{i}$ and its top- $K$ retrieved set $\mathcal{P}_{i}$ from ANN search, our goal is to identify a subset of candidates $\tilde{\mathcal{P}}_{i}$ that are truly relevant to the query. To achieve this, we introduce a function designed to measure the absolute relevance of each candidate and apply a cutoff threshold for filtering.

A straightforward approach would be to employ a classifier to predict the relevance scores of query-candidate pairs (Dai and Callan, 2019). However, this approach is often computational expensive to deploy online, particularly when the number of retrieved candidates ( $K$ ) is large.

We propose a simpler function $\mathcal{F}$ that operates directly on the existing cosine similarity scores from the retrieval system. Since our approach leverages the output of the existing dual encoders, it is cheap to compute. The parameters of $\mathcal{F}$ are query-dependent, learned from a neural network that takes the query embedding as input, as discussed below. Formally, the filtering process can be expressed as

(3)

\tilde{\mathcal{P}}_{i}=\{p_{j}|\mathcal{F}_{\Theta}(\cos(q_{i},p_{j}))\geq t\},

where $\Theta$ represents the query-dependent parameters of $\mathcal{F}$ , and $t$ is a global threshold determined empirically.

4. Methodology

4.1. Cosine Adapter

We explore several transformation functions designed to calibrate cosine similarity scores (which range from -1 to 1) to interpretable scores. To preserve the relative ranking of candidates and minimize the impact on recall performance, these functions are chosen to be monotonic and exhibit a variety of shapes:

(1)

raw score: $\mathcal{F}(x)=x$
(2)

linear: $\mathcal{F}(x|\Theta=(a,b))=ax+b$
(3)

square root: $\mathcal{F}(x|\Theta=(a,b))$ = $\text{sgn}(x)a\sqrt{|x|}+b$
(4)

quadratic: $\mathcal{F}(x|\Theta=(a,b))$ = $\text{sgn}(x)ax^{2}+b$
(5)

power: $\mathcal{F}(x|\Theta=(a,b,k))$ = $\text{sgn}(x)a|x|^{k}+b,$ where $k\in(0,2)$ .

Function (1) is a baseline for comparison. The parameters $a$ , $b$ , and $k$ in these functions are query-dependent and learned by a neural network, which we call the ”Cosine Adapter”. As depicted in Figure 1, the Cosine Adapter takes the query embedding from the dual encoder as input and outputs the parameter set $\Theta$ . The Cosine Adapter involves a few feedforward layers with ReLU activation. For (5), $k$ is constrained to the range (0, 2) via a ” $2\sigma()$ ” transformation.

The raw cosine similarity score from a dual encoder model is transformed into a calibrated score using the corresponding $\mathcal{F}$ and $\Theta$ . We train the Cosine Adapter layers via binary cross-entropy loss (MacKay, 2003):

(4)

loss_{i}=-\left[y_{i}\log(\sigma(\mathcal{F}_{i}))+(1-y_{i})\log(1-\sigma(\mathcal{F}_{i}))\right],

where $y_{i}$ represents the label for the $i$ -th query-candidate pair, and $\sigma$ denotes the sigmoid function. $\mathcal{F}_{i}$ and $\sigma(\mathcal{F}_{i})$ are the logit and probability, respectively, that the query-candidate pair is relevant.

Refer to caption — Figure 1. Architecture of the Cosine Adapter. Siamese dual encoder is frozen during training.

Note that in our proposal, we freeze the dual encoders while training the Cosine Adapter. Since the Cosine Adapter depends on the dual encoder and the relevance training data, it needs to be trained specifically for each dataset. Also, the training data for the Cosine Adapter can be different from that of the dual encoder.

4.2. Relevance filtering workflow

Figure 2 illustrates the integration of the Cosine Adapter and relevance filtering module into the retrieval system. Given a query, the query encoder generates the query embedding. The Cosine Adapter converts the query embedding into the parameters $\Theta$ . The query embedding is sent to the ANN index to retrieve the top- $K$ candidates, along with their cosine similarity scores. Within the relevance filter component, the calibrated scores are calculated based on the transformation function and the raw cosine similarity scores. A global threshold, learned offline, is then applied to filter out the results. Finally, the filtered results are forwarded to the re-ranking module.

At inference time, the computational complexity involves the feedforward layers of the Cosine Adapter and computing the calibrated scores. The complexity of the feedforward layers is $O(d^{2})$ , where $d$ is the embedding dimension; note that the Cosine Adapter is run only once per query. The complexity of computing the calibrated scores is $O(K)$ , since it involves a few operations per candidate. For comparison, some ranked list truncation approaches (like Choppy) (Bahri et al., 2020; Wang et al., 2022a; Ma et al., 2022) include a self-attention layer among candidates, which has computational complexity of $O(K^{2}d)$ (Vaswani et al., 2017).

5. Experiments and Results

To demonstrate the effectiveness of our proposed method, we conduct experiments on both the public MS MARCO dataset and a private Walmart product search dataset. Furthermore, we integrate the relevance filtering component into Walmart’s embedding-based retrieval system to validate its real-world utility.

5.1. Metrics and baselines

Implementing a filtering mechanism in retrieval introduces a trade-off between recall and precision. We evaluate performance using the following metrics:

•

PR AUC (Area Under Precision-Recall Curve): This is a common metric for classification problems, which captures the overall balance between precision and recall across different thresholds. This metric is computed without filtering being implemented.
•

P@R95 (Precision at 95% Recall): We establish a global cutoff threshold, applicable to all queries, that achieves 95% recall relative to no filtering. We then report the corresponding precision. This metric highlights the precision achieved while maintaining a high level of recall.
•

Filter%: This is the percentage of results filtered out.
•

Null%: This is the percentage of queries that return zero results after filtering.
•

MRR (Mean Reciprocal Rank). We report this for experiments on the MS MARCO dataset, since it is a standard metric.

These metrics provide a comprehensive view of the filtering mechanism’s impact on retrieval performance.

We report several baselines for comparison. We apply a global threshold on the raw cosine score ( $\mathcal{F}(x)=x$ ). We also apply a global threshold to the max-normalized cosine score, i.e., the cosine score is normalized by the maximum score for the query. For comparison with recent work on ranked list truncation, we also include results for Choppy, optimizing for F1 (Bahri et al., 2020).

5.2. Experiment on MS MARCO data

K	Method	PR AUC	P@R95	Filter%	Null%	MRR
10	no filter	-	P=0.0730, R=0.6973	-	-	0.411
	Choppy	-	P=0.2775, R=0.2701	90.00	0.00	0.278
	raw score	0.124	0.0830	16.53	2.02	0.400
	raw score (max-norm)	0.194	0.0926	25.12	0.00	0.404
	linear	0.206	0.0864	19.84	2.13	0.402
	square root	0.208	0.0870	20.40	2.22	0.402
	quadratic	0.205	0.0869	20.29	2.18	0.402
	power	0.207	0.0871	20.36	2.15	0.402
1000	no filter	-	P=0.0011, R=0.9869	-	-	0.422
	raw score	0.041	0.0028	64.94	0.60	0.417
	raw score (max-norm)	0.119	0.0047	78.92	0.00	0.419
	linear	0.139	0.0065	84.63	0.09	0.421
	square root	0.141	0.0063	84.08	0.10	0.421
	quadratic	0.138	0.0064	84.43	0.13	0.421
	power	0.139	0.0066	84.90	0.04	0.420

Table 1. Experiments on MS MARCO passage ranking dataset. The best scores are shown in boldface. ”P” and ”R” denote precision and recall, respectively.

We conduct our experiment on the MS-MARCO passage ranking dataset (Nguyen et al., 2016). This dataset, based on Bing search results, comprises 503k training queries and 8.8 million passages. As the test set labels are not publicly available, we train our model on the training set and report our results on the development set, which comprises 6,980 queries.

5.2.1. Implementation

We use the publicly available simlm-base-msmarco-finetuned ¹¹1https://huggingface.co/intfloat/simlm-base-msmarco-finetuned, a fine-tuned SimLM model trained on the MS-MARCO dataset using contrastive loss and knowledge distillation (Wang et al., 2022b), as the dual encoder in our experiments. We train the Cosine Adapter with a 1:31 positive-to-negative ratio, where negatives are mined from BM25 results provided by (Wang et al., 2022b). The model is trained on 2 A100 GPUs with batch size 128 for 3 epochs. The dual encoder is frozen during this training.

5.2.2. Results

We evaluate the impact of our filtering mechanism on retrieval sets of varying sizes ( $K=10,1000$ ), with results presented in Table 1. Notably, PR AUC and precision values are generally low, as most evaluated queries have only one relevant passage.

The our proposed methods outperform the baselines in terms of PR AUC, indicating a stronger ability to distinguish relevant passages. For $K=1000$ , our methods consistently yield higher P@R95 and Filter% values, indicating better filter performance. Among the proposed functions, the power function outperforms the others, although the differences are small.

Finally, our proposed methods significantly reduce the number of queries with zero results (Null%) compared to the raw-score baseline when $K=1000$ . This indicates that our methods are more effective at identifying the correct boundary between relevant and irrelevant results, rather than indiscriminately removing all results. Interestingly, when $K=10$ , filtering on raw scores exhibits a lower Null%. This is due to the fact that 30% of the queries lack positive results within the top 10, whereas this percentage drops to 1% when $K=1000$ . Further analysis reveals that among queries yielding zero results with the power method but at least one result with the raw-score baseline, about 70% do not contain positive results within the top 10. This observation reinforces that our proposed methods are more adept at filtering irrelevant results.

We find that Choppy has a relatively low recall because it consistently truncates the list too aggressively. This may be due to the peculiar nature of the dataset, i.e., the fact that there is usually only 1 relevant passage per query. Note that PR AUC and P@R95 are not reported for Choppy, because it does not have a tunable threshold: a single cutoff position is predicted for each query (Bahri et al., 2020).

5.2.3. Analysis

To illustrate the cross-query comparability of calibrated scores, we analyze how many passages are retained per query after filtering, as shown in Figure 3 ( $K=1000$ ). When filtering on raw scores, there are two prominent spikes in the distribution: at 0 and 1000. This indicates that a global threshold based on raw scores is ineffective, either removing all results or retaining all results for many queries. In contrast, applying a global threshold to the calibrated scores results in a more balanced distribution. This suggests that the calibrated scores enable a more consistent and effective filtering process, ensuring that a reasonable number of relevant results are retained for each query.

Since the filter threshold is tuned such that the recall is 95%, some relevant items do get filtered out. We have inspected the queries that lose relevant items, and find that they tend to involve rare words (e.g., ”what is einstein neem oil good for”) and misspellings (e.g., ”sydeny climate”).

Passage	Label	Rank	K=10		K=1000
Passage	Label	Rank	Raw score	Calibrated score	Raw score	Calibrated score
Average Right At Home hourly pay ranges from approximately $\$$ 7.36 per hour for Caregiver/Companion to $\$$ 36.48 pers hour for Registered Nurse - Infusion…	1	1
The average salary for work from home jobs is $\$$ 64,000…	0	2		$\times$
According to Salary.com, the average stay-at-home mom or dad should be earning $\$$ 134,121.00 per year for performing ten common daily functions…	0	3		$\times$
Right At Home Salaries in United States. Salary estimated from 1,967 employees, users, and past and present job advertisements on Indeed in the past 12 months.	0	4		$\times$
Gain Financial Freedom & Earn $\$$ 10.00 - $\$$ 125.00 per hour work at home! …	0	5		$\times$
Industry Avg. About The Price Is Right Play at Home Game TV Commercial, ’Win $\$$ 4000 Cash’…	0	6		$\times$
The average KB Home salary ranges from approximately $\$$ 39,845 per year for Purchasing Coordinator to $\$$ 118,625 per year for Director…	0	7		$\times$
The average KB Home salary ranges from approximately $\$$ 39,845 per year for Purchasing Coordinator to $\$$ 121,644 per year for Director. ..	0	8		$\times$
Average Homegoods hourly pay ranges from approximately $\$$ 7.50 per hour for Customer Service Representative to $\$$ 25.00 per hour for Industrial Mechanic…	0	9		$\times$		$\times$
Work-at-home billers are generally offered much lower pay, around $\$$ 0.10 to $\$$ 0.25 per billing claim…	0	10		$\times$		$\times$

Table 2. Query is ”how much does right at home pay”. The calibrated score results are based on power transformation.

\times

denotes the passage is filtered out.

5.2.4. Case study

To further illustrate how our relevance filtering method performs, we showcase an example for the query ”how much does right at home pay”. As shown in Table 2, only the first passage among the retrieved top-10 is relevant. Filtering based on raw scores struggles to differentiate relevant passages from irrelevant ones, retaining 10 and 366 passages when retrieving $K=10$ and $K=1000$ candidates respectively. In contrast, filtering using our power-transformed score demonstrates a stark improvement. For $K=10$ , it retains only the single relevant passage, effectively filtering out all irrelevant ones. Even when scaling up to $K=1000$ , only 7 irrelevant passages are kept. This example reinforces our previous findings: by transforming raw scores to a more interpretable scale, our method allows a global threshold to effectively filter out irrelevant candidates.

5.3. Experiment on Walmart product search data

Dual encoder objective	Method	PR AUC	P@R95
-	no filter	-	P=0.4916, R=1.0
Listwise loss	raw score	0.7074	0.5217
	raw score (max-norm)	0.7374	0.5865
	linear	0.8590	0.6139
	square root	0.8588	0.6231
	quadratic	0.8609	0.6160
	power	0.8619	0.6225
Contrastive loss	raw score	0.7889	0.5702
	raw score (max-norm)	0.7159	0.5711
	linear	0.8626	0.6361
	square root	0.8606	0.6302
	quadratic	0.8623	0.6330
	power	0.8567	0.6247

Table 3. Experiments on Walmart product search dataset. The best scores are shown in boldface. ”P” and ”R” denote for precision and recall, respectively.

K	lift (p-value)
5	+5.34% (0.00)
10	+4.00% (0.00)

Table 4. Precision of the top K results for impacted queries.

	Orders	GMV
lift (p-value)	+0.03% (0.86)	-0.11% (0.83)

Table 5. A/B test results

We conduct similar experiments with Walmart product search data. The training dataset consists of 700k queries and 6 million query-product pairs, each assessed by human reviewers for relevance and categorized as ”exact”, ”substitute”, or ”irrelevant”, similar to the Amazon shopping dataset (Reddy et al., 2022).

We compare the performance of filtering methods using the PR AUC and P@R95 metrics defined above. In addition to the differences between various calibration functions, we are also interested in how our methods perform on dual encoders trained with different objectives: contrastive loss and listwise loss. Our previous experiments showed that models trained with listwise loss usually perform better in retrieval tasks (Magnani et al., 2022).

Rank	Products	Label	Presence with filtering
1	Starbucks Paradise Drink Pineapple Passionfruit with Coconut Milk, 14 fl oz Bottle	Exact
2	Starbucks Pink Drink Strawberry Acai Coffee Coconut Milk Drink, 14 fl oz Bottle	Substitute
3	Starbucks Paradise Drink Pineapple Passionfruit with Coconut Milk, 14 fl oz, 12 Pack Bottles	Exact
4	Welch’s Passion Fruit Fruit Juice Drink, 59 fl oz carton	Irrelevant
5	Starbucks Refreshers Peach Passion Fruit Sparkling Juice Blend, 12 Oz Can	Substitute
6	Welch’s Passion Fruit Fruit Juice Drink, 89 fl oz carton	Irrelevant	$\times$
7	Welch’s Passion Fruit Juice Cocktail, 64 fl oz Bottle	Irrelevant	$\times$
8	Goya Pineapple Juice, 33.8 fl oz	Irrelevant	$\times$
9	Dole Pineapple 100% Juice 59 oz Bottle	Irrelevant	$\times$
10	Starbucks Refreshers Sparkling Juice Blend, Peach Passion Fruit With Coconut Water, 12 oz Can	Substitute

Table 6. Query is ”starbucks pineapple passion fruit juice”.

\times

denotes the product is filtered out.

5.3.1. Implementation

We train two dual-encoder models using contrastive loss (Equation 1) and listwise loss (Equation 2), respectively. The dual encoder training follows (Magnani et al., 2022), where DistilBERT (Sanh et al., 2019) is utilized for both query and product encoders. During Cosine Adapter training, the dual encoder is frozen. The ”exact”, ”substitute”, and ”irrelevant” query-product pairs are assigned labels of 1, 0.5, and 0, respectively. (We find that setting ”substitute” to 0.5 leads to better performance, because ”substitute” is more relevant than ”irrelevant”). The models were trained with binary cross entropy loss (Equation 4) until convergence utilizing 4 A100 GPUs with a batch size of 512. Note that the dual encoders are trained on engagement data, while the Cosine Adapter is trained on manually annotated relevance data.

5.3.2. Results

The test dataset contains 1000 queries with 10 products each. The query-product pairs are evaluated by human reviewers in the same manner as the training data. When computing metrics, we treat ”exact” products as positive and ”substitute” and ”irrelevant” products as negative.

The results in Table 3 show that our proposed methods outperform the baselines. The improvement is particularly pronounced when applied to scores from the listwise-loss-trained dual encoder. For listwise loss, square root and power transformations perform the best. For constrastive loss, linear transformation performs the best.

5.3.3. Analysis

When filtering on raw cosine scores, the contrastive-loss models outperform the listwise-loss models by 9% in P@R95. This is expected as listwise loss prioritizes relative ranking within a candidate list, potentially leading to less calibrated raw scores, even though they contain important information about relative relevance. However, by applying the Cosine Adapter, contrastive loss’s lead is narrowed to 2%.

Like in Section 5.2.3, since the filter is tuned to have recall=95% across all queries, some relevant items products end up being filtered out. From inspection of the results of the power-transformation filter, the queries with the lowest recall show a pattern of containing rare brand names, numbers, and/or misspellings.

5.4. Experiment on Walmart search system

5.4.1. Setup

To demonstrate the practical impact of our proposed relevance filtering component, we integrated it into Walmart’s embedding-based retrieval system, following the architecture outlined in Figure 2. Walmart’s search system utilizes a hybrid retrieval approach, combining traditional lexical retrieval and embedding-based retrieval (Magnani et al., 2022). We evaluated our proposed method on top of the production baseline, assessing its impact via two measures: relevance, assessed by human judgment of the top 10 products surfaced to customers (post-reranker), and engagement, measured through an online A/B test. The model deployed was the square root Cosine Adapter; the filtering threshold was calibrated for a 99% recall to minimize the risk of filtering out relevant products.

5.4.2. Results

To evaluate the relevance performance, we sampled about 700 impacted queries (queries with different ranking in the top 10 positions), and assessed the relevance of the top 10 products surfaced to the customers following the same rating guideline as detailed in Section 5. We then computed precision by treating the ”exact” products as positive. The results are presented in Table 4. While the combined testing with downstream product reranking might partially dilute the relevance lift, we still observe a notable improvement in precision: over 5% for top 5 and 4% for top 10. The results of our A/B test are presented in Table 5, showing the impact on the number of orders and Gross Merchandise Value (GMV). Overall, we observe a neutral impact on engagement, with p-values much greater than 0.05 for both metrics. Thus, our online tests demonstrate an improvement in precision without negatively impacting engagement.

5.4.3. Case study

To illustrate the impact of relevance filtering module in Walmart’s production retrieval system, we show the top 10 products for the query ”starbucks pineapple passion fruit juice” in Table 6. Without filtering, only 20% of the top-10 products are relevant. Among the irrelevant products, 80% exhibit flavor mismatches, while 50% show brand mismatches. These errors mainly came from embedding-based retrieval, which doesn’t rely on keyword matching. However, by incorporating the relevance filtering module, we successfully filtered out 80% of the irrelevant products in the top 10.

6. Conclusion

In this paper, we propose a novel relevance filtering component for embedding-based retrieval systems that enhances precision at the expense of a small loss of recall. Our approach involves transforming raw cosine similarity scores into interpretable relevance probabilities using query-dependent mapping functions. This method is computationally efficient and easy to implement, requiring only a lightweight adapter module added to the existing dense retrieval system. Extensive evaluation, including experiments on both the public MS MARCO dataset and a private Walmart product search dataset, along with a live A/B test on Walmart.com, demonstrates the effectiveness of our approach in improving precision across diverse applications.

7. Resources

Code, checkpoints and instructions are available at https://github.com/juexinlin/dense_retrieval_relevance_filter.

References

(1)
Arampatzis et al. (2009) Avi Arampatzis, Jaap Kamps, and Stephen Robertson. 2009. Where to stop reading a ranked list? threshold optimization using truncated score distributions. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Boston, MA, USA) (SIGIR ’09). Association for Computing Machinery, New York, NY, USA, 524–531. https://doi.org/10.1145/1571941.1572031
Bahri et al. (2020) Dara Bahri, Yi Tay, Che Zheng, Donald Metzler, and Andrew Tomkins. 2020. Choppy: Cut Transformer for Ranked List Truncation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Virtual Event, China) (SIGIR ’20). Association for Computing Machinery, New York, NY, USA, 1513–1516. https://doi.org/10.1145/3397271.3401188
Bromley et al. (1993) Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. 1993. Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems 6 (1993).
Chen et al. (2017) Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong. 2017. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 767–776.
Dai and Callan (2019) Zhuyun Dai and Jamie Callan. 2019. Deeper text understanding for IR with contextual neural language modeling. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval. 985–988.
Dong et al. (2022) Zhe Dong, Jianmo Ni, Daniel M Bikel, Enrique Alfonseca, Yuan Wang, Chen Qu, and Imed Zitouni. 2022. Exploring dual encoder architectures for question answering. arXiv preprint arXiv:2204.07120 (2022).
He et al. (2023) Yunzhong He, Yuxin Tian, Mengjiao Wang, Feier Chen, Licheng Yu, Maolong Tang, Congcong Chen, Ning Zhang, Bin Kuang, and Arul Prakash. 2023. Que2Engage: Embedding-based Retrieval for Relevant and Engaging Products at Facebook Marketplace. arXiv preprint arXiv:2302.11052 (2023).
Henderson et al. (2017) Matthew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, Laszlo Lukacs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient Natural Language Response Suggestion for Smart Reply. ArXiv abs/1705.00652 (2017). https://api.semanticscholar.org/CorpusID:2449317
Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Information & Knowledge Management. 2333–2338.
Huang et al. (2020) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2020. Embedding-based retrieval in facebook search. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2553–2561.
Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734 (2017).
Karpukhin et al. (2020a) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020a. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906 (2020).
Karpukhin et al. (2020b) Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020b. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, Online, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550
Li et al. (2021) Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based product retrieval in taobao search. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3181–3189.
Lien et al. (2019) Yen-Chieh Lien, Daniel Cohen, and W. Bruce Croft. 2019. An Assumption-Free Approach to the Dynamic Truncation of Ranked Lists. In Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval (Santa Clara, CA, USA) (ICTIR ’19). Association for Computing Machinery, New York, NY, USA, 79–82. https://doi.org/10.1145/3341981.3344234
Lin et al. (2024) Juexin Lin, Sachin Yadav, Feng Liu, Nicholas Rossi, Praveen Reddy Suram, Satya Chembolu, Prijith Chandran, Hrushikesh Mohapatra, Tony Lee, Alessandro Magnani, and Ciya Liao. 2024. Enhancing Relevance of Embedding-based Retrieval at Walmart. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM ’24). https://doi.org/10.1145/3627673.3680047
Liu et al. (2021) Yiqun Liu, Kaushik Rangadurai, Yunzhong He, Siddarth Malreddy, Xunlong Gui, Xiaoyi Liu, and Fedor Borisyuk. 2021. Que2search: Fast and accurate query and document understanding for search at facebook. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 3376–3384.
Ma et al. (2022) Yixiao Ma, Qingyao Ai, Yueyue Wu, Yunqiu Shao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2022. Incorporating Retrieval Information into the Truncation of Ranking Lists for Better Legal Search. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 438–448. https://doi.org/10.1145/3477495.3531998
MacKay (2003) David JC MacKay. 2003. Information theory, inference and learning algorithms. Cambridge university press.
Magnani et al. (2022) Alessandro Magnani, Feng Liu, Suthee Chaidaroon, Sachin Yadav, Praveen Reddy Suram, Ajit Puthenputhussery, Sijie Chen, Min Xie, Anirudh Kashi, Tony Lee, et al. 2022. Semantic retrieval at walmart. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 3495–3503.
Nguyen et al. (2016) Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. Ms marco: A human-generated machine reading comprehension dataset. (2016).
Nigam et al. (2019) Priyanka Nigam, Yiwei Song, Vijai Mohan, Vihan Lakshman, Weitian Ding, Ankit Shingavi, Choon Hui Teo, Hao Gu, and Bing Yin. 2019. Semantic product search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2876–2885.
Reddy et al. (2022) Chandan K. Reddy, Lluís Màrquez, Fran Valero, Nikhil Rao, Hugo Zaragoza, Sambaran Bandyopadhyay, Arnab Biswas, Anlu Xing, and Karthik Subbian. 2022. Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. (2022). arXiv:2206.06588
Sanh et al. (2019) Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
Severyn and Moschitti (2015) Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. 373–382.
Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM international conference on conference on information and knowledge management. 101–110.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wang et al. (2022a) Dong Wang, Jianxin Li, Tianchen Zhu, Haoyi Zhou, Qishan Zhu, Yuxin Wen, and Hongming Piao. 2022a. MtCut: A Multi-Task Framework for Ranked List Truncation. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (Virtual Event, AZ, USA) (WSDM ’22). Association for Computing Machinery, New York, NY, USA, 1054–1062. https://doi.org/10.1145/3488560.3498466
Wang et al. (2022b) Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022b. Simlm: Pre-training with representation bottleneck for dense passage retrieval. arXiv preprint arXiv:2207.02578 (2022).
Wang et al. (2021) Tian Wang, Yuri M Brovman, and Sriganesh Madhvanath. 2021. Personalized embedding-based e-commerce recommendations at ebay. arXiv preprint arXiv:2102.06156 (2021).
Wang et al. (2020) Zhe Wang, Liqin Zhao, Biye Jiang, Guorui Zhou, Xiaoqiang Zhu, and Kun Gai. 2020. Cold: Towards the next generation of pre-ranking system. arXiv preprint arXiv:2007.16122 (2020).
Xu et al. (2024) Enqiang Xu, Yiming Qiu, Junyang Bai, Ping Zhang, Dadong Miao, Songlin Wang, Guoyu Tang, Lin Liu, and Mingming Li. 2024. Optimizing E-commerce Search: Toward a Generalizable and Rank-Consistent Pre-Ranking Model. arXiv preprint arXiv:2405.05606 (2024).
Yang et al. (2020) Ji Yang, Xinyang Yi, Derek Zhiyuan Cheng, Lichan Hong, Yang Li, Simon Xiaoming Wang, Taibai Xu, and Ed H Chi. 2020. Mixed negative sampling for learning two-tower neural networks in recommendations. In Companion Proceedings of the Web Conference 2020. 441–447.
Yao et al. (2021) Shaowei Yao, Jiwei Tan, Xi Chen, Keping Yang, Rong Xiao, Hongbo Deng, and Xiaojun Wan. 2021. Learning a product relevance model from click-through data in e-commerce. In Proceedings of the Web Conference 2021. 2890–2899.
Zamani et al. (2022) Hamed Zamani, Michael Bendersky, Donald Metzler, Honglei Zhuang, and Xuanhui Wang. 2022. Stochastic Retrieval-Conditioned Reranking. In Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval (Madrid, Spain) (ICTIR ’22). Association for Computing Machinery, New York, NY, USA, 81–91. https://doi.org/10.1145/3539813.3545141
Zhang et al. (2020) Han Zhang, Songlin Wang, Kang Zhang, Zhiling Tang, Yunjiang Jiang, Yun Xiao, Weipeng Yan, and Wen-Yun Yang. 2020. Towards personalized and semantic retrieval: An end-to-end solution for e-commerce search via embedding learning. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2407–2416.
Zheng et al. (2022) Yukun Zheng, Jiang Bian, Guanghao Meng, Chao Zhang, Honggang Wang, Zhixuan Zhang, Sen Li, Tao Zhuang, Qingwen Liu, and Xiaoyi Zeng. 2022. Multi-Objective Personalized Product Retrieval in Taobao Search. arXiv preprint arXiv:2210.04170 (2022).