This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MAKE: Vision-Language Pre-training based Product Retrieval in Taobao Search

Xiaoyang Zheng, Zilong Wang, Sen Li Alibaba GroupHangzhouChina
{zhengxiaoyang.zxy,huanshi.wzl,
lisen.lisen}@alibaba-inc.com
Ke Xu City University of Hong KongHong KongChina
[email protected]
 and  Tao Zhuang, Qingwen Liu, Xiaoyi Zeng Alibaba GroupHangzhouChina
{zhuangtao.zt,xiangsheng.lqw,
yuanhan}@alibaba-inc.com
(2023)
Abstract.

Taobao Search consists of two phases: the retrieval phase and the ranking phase. Given a user query, the retrieval phase returns a subset of candidate products for the following ranking phase. Recently, the paradigm of pre-training and fine-tuning has shown its potential in incorporating visual clues into retrieval tasks. In this paper, we focus on solving the problem of text-to-multimodal retrieval in Taobao Search. We consider that users’ attention on titles or images varies on products. Hence, we propose a novel Modal Adaptation module for cross-modal fusion, which helps assigns appropriate weights on texts and images across products. Furthermore, in e-commerce search, user queries tend to be brief and thus lead to significant semantic imbalance between user queries and product titles. Therefore, we design a separate text encoder and a Keyword Enhancement mechanism to enrich the query representations and improve text-to-multimodal matching. To this end, we present a novel vision-language (V+L) pre-training methods to exploit the multimodal information of (user query, product title, product image). Extensive experiments demonstrate that our retrieval-specific pre-training model (referred to as MAKE) outperforms existing V+L pre-training methods on the text-to-multimodal retrieval task. MAKE has been deployed online and brings major improvements on the retrieval system of Taobao Search.

Multimodal Pre-training, Semantic Retrieval, Representation Learning
journalyear: 2023copyright: acmlicensedconference: Companion Proceedings of the ACM Web Conference 2023; April 30-May 4, 2023; Austin, TX, USAbooktitle: Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), April 30-May 4, 2023, Austin, TX, USAprice: 15.00doi: 10.1145/3543873.3584627isbn: 978-1-4503-9419-2/23/04ccs: Information systems Information retrieval

1. Introduction

Online shopping has become popular in our daily lives. Hundreds of millions of users visit e-commerce platforms (such as Amazon, eBay, Taobao, and JD) every day. The product search service, which displays relevant products based on user queries, is of great importance for user experience and transaction efficiency. Taobao Search consists of two phases: the retrieval phase and the ranking phase. The retrieval phase aims to select a candidate set (tens of thousands) from a large pool of products (in billion level), while the ranking phase determines the displaying order. Hence, the retrieval phase plays an important role in the quality of search results. In Taobao111https://www.taobao.com/, a product post is composed of a title and several images, while user queries are plain texts. Therefore, the retrieval phase is formulated as a problem of text-to-multimodal matching.

There are many works (Li et al., 2021a; Nigam et al., 2019; Xiao et al., 2019; Chang et al., 2021; Zobel and Moffat, 2006; Robertson and Zaragoza, 2009) proposed for the product retrieval task, which fall into two categories: lexical matching approaches and embedding-based learning approaches. Lexical matching approaches (Zobel and Moffat, 2006; Robertson and Zaragoza, 2009) typically build the inverted indexes for products and conduct exact matching between inverted indexes and user queries. Embedding learning approaches (Li et al., 2021a; Nigam et al., 2019; Xiao et al., 2019; Chang et al., 2021) learn semantic representations (i.e., embeddings) of queries and products, and then retrieve products by measuring the similarity between the query and product embeddings.

Query Title
White shirt White chiffon shirt, women’s long-sleeved top, 2020 spring and autumn new western style professional wear, light mature temperament
Table 1. An example of query-title pair collected from online logs. There exists a significant imbalance between user queries and product titles.

Recently, the success of transformer (Vaswani et al., 2017) structure and vision-language representation learning (Zhang et al., 2021; Qi et al., 2020; Li et al., 2021b) motivates people to study pre-training on e-commerce tasks (Gao et al., 2020; Zhuge et al., 2021; Yu et al., 2022). These models, composed of a text encoder and an image encoder based on transformers, are pre-trained on text-image pairs and fine-tuned on image captioning, category recognition, text-to-image retrieval, etc. Intuitively, to solve the text-to-multimodal retrieval task in Taobao Search, we exploit the text encoder on user queries and product titles, while applying the image encoder on product images. The representations of user queries and products are then used in an embedding retrieval framework. However, we observe sub-optimal performance due to the following two key problems. First, existing methods neglect the fact that in e-commerce search, users’ attention to titles or images varies on products. For example, users pay more attention to images of clothes. Whereas on electronic products, users care more about key properties described in titles, such as memory size. Second, sellers often apply search engine optimization (SEO) techniques, in order to improve the matching probabilities and ranking of their products. As shown in Table 1, user queries are usually short and brief, while product titles tend to be long and concrete. The semantic imbalance between user queries and products is a big challenge for the retrieval task.

To handle the first problem, we propose a Modal Adaptation module to perform cross-modal fusion by introducing user queries as contextual information and to assign reasonable attention to product titles and images. To address the second issue, we design an independent text encoder to process user queries. We further design a Keyword Enhancement mechanism by jointly optimizing similar positive samples for user queries, in order to enrich the semantic information and learn better user query embeddings.

To summarize, our main contributions are as follows:

  • We propose a novel vision-language pre-training method (referred as MAKE) tailored for the text-to-multimodal retrieval task in e-commerce search. Trained on a large-scale (query, title, image) triplet dataset from online logs of Taobao Search, MAKE is capable of effective and efficient text-to-multimodal retrieval.

  • We propose a Modal Adaptation module to learn appropriate attentions on product titles and images by introducing user queries as the context. The module leads to stronger representation power of product embeddings.

  • We propose a Keyword Enhancement mechanism to enhance the query embeddings by jointly training similar user queries. The mechanism significantly alleviates the semantic imbalance between user queries and product titles.

  • Extensive experiments on offline datasets and online A/B tests demonstrate that MAKE outperforms existing V+L pre-training methods on the e-commerce text-to-multimodal retrieval task. Our method has been deployed on Taobao Search and served hundreds of millions of users every day.

2. The Proposed Approach

Refer to caption
Figure 1. Overview of our pre-trained model MAKE.
\Description

Network Architecture

2.1. Vision-Language Pre-training Model

Model Structure.

Different from existing V+L methods with a text encoder and an image encoder, our pre-training model has a three-tower structure, with a query encoder, a title encoder and an image encoder, as shown in Figure 1. Each of them consists of 6 transformers (a multi-head self-attention layer and a feed-forward layer) blocks (Vaswani et al., 2017). The two text encoders are initialized with the Structbert (Wang et al., 2020) model (pre-trained on Chinese e-commerce corpus), and the image encoder is initialized with the ImageBert (Qi et al., 2020) model (pre-trained on Chinese corpus and corresponding images (Qiu et al., 2021)). The embeddings of products are the combination of outputs from the title encoder and the image encoder.

Model Inputs.

The text modal (user queries and product titles) is pre-processed in the same way as BERT (Devlin et al., 2019). We adopt the Chinese vocabulary provided by EasyTransfer (Qiu et al., 2021). For the product images, we preprocess them as 4*4 patches and apply ResNet (He et al., 2016) as a backbone network to extract image sequences of 2048-D features. Segmentation marks Q, T and I are used to distinguish token sequences of user queries, product titles, and images.

Self-Supervised Pre-training Objective.

Pre-training models with self-supervised task (Devlin et al., 2019; Yu et al., 2022) are proved to be effective on many downstream tasks. Following FashionBert (Gao et al., 2020), we apply two self-supervised tasks: Masked Language Modeling (MLM) and Mask Patch Modeling (MPM). Refer to the paper for detailed information.

Query Encoder.

To start with, we pre-train ALIGN (Li et al., 2021b) on image-text pairs from Taobao. We apply the same text encoder to user queries and product titles. Nevertheless, we observe sub-optimal performance on the query-to-product retrieval task. We find out that in e-commerce search, it is common for sellers to apply search engine optimization (SEO) techniques for improving the rankings of their products. As a result, the product titles usually consist of many keywords that are grammatically meaningless and contain grammatical errors. On the contrary, users tend to type in short terms to the search engine. Hence, there exist non-trivial imbalances between queries and titles. Therefore, we need a separate text encoder for user queries. Besides, for V+L pre-training models (Li et al., 2021b; Qi et al., 2020), Image-Text Matching (ITM) is widely used to improve the performance on the downstream retrieval task. Similar to ITM, we adopt a Query-Product Matching (QPM) loss with in-batch negative sampling to optimize embeddings of queries and products.

(1) QPM=1NiNlogexp(𝒖T𝒗/τ)j=1Nexp(𝒖T𝒗𝒋/τ),\mathcal{L}_{QPM}=-\frac{1}{N}\sum_{i}^{N}\log\frac{\exp(\boldsymbol{u}^{T}\boldsymbol{v}/\tau)}{\sum_{j=1}^{N}\exp(\boldsymbol{u}^{T}\boldsymbol{v_{j}}/\tau)},

where 𝒖\boldsymbol{u} and 𝒗\boldsymbol{v} are normalized embeddings of queries and products. NN is the batch size and τ\tau is the temperature parameter.

2.2. Modal Adaptation Module

In e-commerce, it is obvious that the importance of titles and images varies across different products. For instance, on clothes, users pay more attention to images. Whereas on electronic products, users care more about key properties described in titles, such as memory size. Hence, two modalities should be fused with proper weights for different products. However, we observe that with separated text/image encoders, the pre-training model focuses evenly on text/image modals. We believe that the lack of cross-modal fusion prevents the network from learning better representations.

Although Yu \etal(Yu et al., 2022) design a multimodal fusion encoder on top of the text/image encoder, they ignore user intentions. Therefore, by introducing user queries as contextual information, we propose a novel Modal Adaptation module in order to conduct modal fusion and optimize the overall representations, as shown in Figure 1. The module contains two layers of a sub-module, composed of a self-attention layer, a cross-attention layer and a feed-forward layer, which takes outputs of the query encoder, title encoder, and image encoder as inputs. For the self-attention layer, inputs of KV are outputs of the title encoder and the image encoder, while for the cross-attention layer, inputs of Q are outputs of the query encoder. With the Modal Adaptation module, embeddings of products not only contain two-modal (title+image) information of diverse weights within products but also consider the influence of user queries.

We also design a Query-Product Classification (QPC) loss for the MA module. Different from optimizing similarity among separated embeddings in the QPM loss, the QPC loss refines the joint representation of query-product pairs. The [CLS] outputs (representations of the whole sequence) of the MA module, followed by a fully-connected layer and a sigmoid function, are used in a two-class classification task. We construct negative query-product pairs by choosing the maximum similarity among mini-batch negative samples. The similarity is pre-computed in QPM loss, and thus brings little computational cost. The QPC loss is presented as:

(2) QPC\displaystyle\mathcal{L}_{\mathrm{QPC}} =𝔼(Q,T,I)DH(𝒑QPC(Q,T,I),𝒚QPC),\displaystyle=\mathbb{E}_{(Q,T,I)\sim D}\mathrm{H}(\boldsymbol{p}_{\mathrm{QPC}}(Q,T,I),\boldsymbol{y}_{\mathrm{QPC}}),

where 𝒑QPC\boldsymbol{p}_{QPC} is the probability of classification, 𝒚QPC\boldsymbol{y}_{\mathrm{QPC}} is a 0-1 ground-truth label and H\mathrm{H} is the cross-entropy loss.

2.3. Keyword Enhancement Mechanism

As mentioned above, the QPM (Eq. 1) loss is associated with in-batch negative sampling (IBNS) adopted in ALIGN (Li et al., 2021b), which brings another significant issue to our model. Different from limited academic tasks, multiple user queries may be relevant to the same product. With the IBNS mechanism, those similar user queries are mistakenly treated as negative samples and thus compromise query embeddings. Therefore, to solve the false-negative issue of the IBNS mechanism, we propose a Keyword Enhancement mechanism to replace the IBNS mechanism. The proposed mechanism aims at improving representation learning of user queries by jointly optimizing queries related to the same product.

Instead of query-product pairs, a product with several related queries (product, query1\textit{query}_{1}, \ldots, queryM\textit{query}_{M}) collected from Taobao Search logs are grouped as one training sample. MM is the number of enhanced queries and is set to 5 in this paper. In addition, we design a new QPM loss based on circle loss (sun2020circle) and KE mechanism:

(3) QPMKE=log(1+j=1Nexp(γexp(snegj+θ))m=1Mexp(γsposm)),\mathcal{L}_{QPM}^{KE}=\log(1+\sum_{j=1}^{N}\exp(\gamma\exp(s_{neg}^{j}+\theta))\sum_{m=1}^{M}\exp(-\gamma s_{pos}^{m})),

where s()=𝒖T𝒗log𝒑s(\cdot)=\boldsymbol{u}^{T}\boldsymbol{v}-\log\boldsymbol{p} measures the inner-product similarity between embeddings of queries and products. Following sampled-softmax (Jean et al., 2014), the log𝒑-\log\boldsymbol{p} term is the expected frequency of products, with which we prevent the model from focusing too much on popular products. NN is the batch size. γ\gamma is the scaling factor. The hyper-parameter θ\theta constrains the lower bound of the similarity difference between positive pairs and negative pairs. With the Keyword Enhancement mechanism, we solve the false-negative issue and narrow the distance between embeddings of similar queries.

Finally, the pre-trained model is optimized as below:

(4) =MLMQ+MLMT+MPMI+QPC+QPMKE.\mathcal{L}=\mathcal{L}^{Q}_{MLM}+\mathcal{L}^{T}_{MLM}+\mathcal{L}^{I}_{MPM}+\mathcal{L}_{QPC}+\mathcal{L}_{QPM}^{KE}.

3. Experiments

3.1. Datasets, Implementations, and Metrics

Large-scale Industrial Dataset. We collect online clicking logs with user queries, product titles and images from Taobao Search. The training set contains samples of billion-level, and we randomly choose 1.5 million search logs as the evaluation set.

Model Implementation. The pre-training model is composed of three encoders with 6 layers of transformers (Devlin et al., 2019). Each layer has 768768 hidden units and 12 self-attention heads. We pre-train the model for 10 epochs with a batch size of 12801280 on 5050 NVIDIA P100 GPUs. We applied an Adam optimizer with β1=0.9\beta_{1}=0.9 and β2=0.98\beta_{2}=0.98. The learning rate is warmed-up to 1e41e^{-4} in the first 20002000 iterations and decays to 0 following a linear schedule.

Online Serving. We predict embeddings of all products in Taobao with the title encoder and the image encoder. Then we adopt Proxima (Damo Academy, 2021), an ANN (approximate nearest neighbor) framework, to build indexes of product embeddings with HC (hierarchical clustering) algorithm. Once receiving a user request, the online query encoder predicts the user query and returns the embedding. The query embedding is used to retrieve top-K relevant products from the ANN index. The model is updated on weekly basis.

Offline Evaluation Metrics. The retrieval set is denoted as R={p1,,pK}R=\{p_{1},\ldots,p_{K}\}. The clicked products from the evaluation set is denoted as the target set TT. We use the metric PrelP_{rel} and PcateP_{cate}, which measures the rate of relevance on the retrieval set RR, according to a well-trained relevance model (Yao et al., 2021) (the AUC on human-labeled data is 0.920.92). The first one focus on the overall relevance, while the second one compares the category predicted on user queries to the category of retrieved products.

(5) Prel=1NNRi=1Nj=1NRf(qi,pi,j),Pcate=1NNRi=1Nj=1NR𝕀(fc(qi)=ci,j),\small P_{rel}=\frac{1}{NN_{R}}\sum_{i=1}^{N}\sum_{j=1}^{N_{R}}f(q_{i},p_{i,j}),\quad P_{cate}=\frac{1}{NN_{R}}\sum_{i=1}^{N}\sum_{j=1}^{N_{R}}\mathbb{I}(f_{c}(q_{i})=c_{i,j}),

where f(,)[0,1]f(\cdot,\cdot)\in[0,1] denotes the prediction of the relevance model, fc()f_{c}(\cdot) returns category based on queries. NN is the size of the evaluation dataset and NRiN_{R_{i}} is the size of retrieval set RiR_{i}. ci,jc_{i,j} is the category of the retrieved product pi,jp_{i,j}. We also apply a Recall@K metric to evaluate the retrieval performance, computed as:

(6) Recall@K=1Ni=1N𝕀(t|tRi,KtTi),\text{Recall@K}=\frac{1}{N}\sum_{i=1}^{N}\mathbb{I}(\exists t|t\in R_{i,K}\wedge t\in T_{i}),

where 𝕀()\mathbb{I}(\cdot) is an indicator function.

Online Evaluation Metrics. We use the number of transactions (denoted as #Trans) and GMV (Gross Merchandise Volume, total value of sales) as online evaluation metrics. For users with few recorded consuming behaviors, these two metrics are denoted as #Transn\text{\#Trans}_{n} and GMVn\text{GMV}_{n}, respectively.

3.2. Offline Experimental Results

Refer to caption
Figure 2. Attention weights on titles and images across different product categories. MAKE vs. MAKE w/o MA. \DescriptionAttentions weights on titles and images
\Description

Attention weights on two modals.

3.2.1. Comparison with Baseline Methods

We adopt CLIP (Radford et al., 2021), FashionBERT (Gao et al., 2020) and CommerceMM (Yu et al., 2022) as strong baseline pre-training models. The latter two are proposed to solve downstream tasks related to the e-commerce scenario. FashionBERT (Gao et al., 2020) consists of one encoder, while CLIP (Radford et al., 2021) and CommerceMM (Yu et al., 2022) have a two-tower structure. All baseline methods are pretrained on the same training dataset. As shown in Table 2, our proposed method MAKE outperforms all baseline methods on the text-to-multimodal retrieval task of Taobao Search.

3.2.2. Modal Adaptation Module.

The MA module, associated with the Query-Product Classification (QPC) task, is proposed to conduct modal fusion and learn appropriate attention on text/image modals across different products. The comparison from MAKE w/o MA to MAKE reveals that the MA module significantly improves the relevance and the recall hitrate by 1.88% and 4.59%. To further evaluate the effect of MA module, we collect attention weights on titles/images across different product categories from MAKE and MAKE w/o MA, as shown in Figure 2. After introducing the MA module, for vision dominant categories, the network pays more attention to images (79.4%79.4\% for ornaments), while for text dominant categories, the network focuses more on titles (67.5%67.5\% for laptops). With the MA module, the model assigns proper attention weights on modals of text and image.

Methods PrelP_{rel}\uparrow PcateP_{cate}\uparrow Recall@KK\uparrow
FashionBert (Gao et al., 2020) 0.8385 0.8190 0.3867
CLIP (Radford et al., 2021) 0.8648 0.8423 0.4675
CommerceMM (Yu et al., 2022) 0.8710 0.8653 0.4937
MAKE 0.9014 0.9295 0.6088
MAKE w/o MA 0.8826 0.8910 0.5629
MAKE w/o KE 0.8922 0.8803 0.5781
Table 2. Offline experimental results and ablation studies.

3.2.3. Keyword Enhancement Module

The KE mechanism is a modified negative sampling mechanism with a QPM loss. The KE mechanism enforces the model to jointly optimize similar queries and avoid false-negative sampling of the popular IBNS mechanism. By comparing MAKE to MAKE w/o KE, the KE mechanism effectively strengthens the query representations by reducing the distance from similar queries to the same product. The KE mechanism also helps alleviate the semantic imbalance between user queries and product titles.

3.3. Online A/B Tests

We deploy our pre-training method MAKE on Taobao Search and provide relevant candidates to the prior three-channel retrieval system, including embedding-based learning, collaborative filtering, and inverted-index matching. As shown in Table 3, our method outperforms the prior retrieval system by improving the overall relevance (+2.20%+2.20\%) of product candidates. We also report 14-day average online improvements of MAKE on GMV and #Trans. As shown in Table 3, our proposed method improves GMV and #Trans by 0.79%0.79\% and 0.37%0.37\%, respectively. Considering the large number of transactions in Taobao Search, MAKE facilitates hundreds of thousands of transactions per day. Besides, MAKE obtains more performance gains (2.01%2.01\% on GMV and 1.58%1.58\% on #Trans) on inactive users and new users, including tens of millions of users per day. Compared to the significant gains, the additional computational cost is negligible (2 ms). These results demonstrate that MAKE significantly improves the overall efficiency of Taobao Search.

Methods GMV #Trans GMVn\text{GMV}_{n} #Transn\text{\#Trans}_{n} PrelP_{rel} Time Cost
MAKE +0.79% +0.37% +2.01% +1.58% +2.20% +2 ms
Table 3. Online A/B tests of MAKE.

4. Conclusion

In this paper, we propose a novel vision-language pre-training method (MAKE) with a three-encoder structure tailored for the text-to-multimodal retrieval task of Taobao Search. We propose a Modal Adaptation module to perform cross-modal fusion and learn effective product representations. We further design a Keyword Enhancement mechanism to solve the semantic imbalance and false-negative sampling issue to improve query representations. Offline ablation study and online A/B tests, verify the effectiveness of our method MAKE. We have deployed MAKE online to serve hundreds of millions of users every day, which greatly improves online transaction efficiency in e-commerce.

References

  • (1)
  • Chang et al. (2021) Wei-Cheng Chang, Daniel Jiang, Hsiang-Fu Yu, Choon Hui Teo, Jiong Zhang, Kai Zhong, Kedarnath Kolluri, Qie Hu, Nikhil Shandilya, Vyacheslav Ievgrafov, et al. 2021. Extreme multi-label learning for semantic matching in product search. In SIGKDD. 2643–2651.
  • Damo Academy (2021) Alibaba Group Damo Academy. 2021. http://proxima.alibaba-inc.com/.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT (1).
  • Gao et al. (2020) Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020. Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In SIGIR. 2251–2260.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
  • Jean et al. (2014) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2014. On using very large target vocabulary for neural machine translation. arXiv preprint arXiv:1412.2007 (2014).
  • Li et al. (2021b) Junnan Li, Ramprasaath R Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, and Steven Hoi. 2021b. Align before fuse: Vision and language representation learning with momentum distillation. In NeurIPS.
  • Li et al. (2021a) Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021a. Embedding-Based Product Retrieval in Taobao Search. In SIGKDD. 3181––3189.
  • Nigam et al. (2019) Priyanka Nigam, Yiwei Song, Vijai Mohan, Vihan Lakshman, Weitian Ding, Ankit Shingavi, Choon Hui Teo, Hao Gu, and Bing Yin. 2019. Semantic product search. In SIGKDD. 2876–2885.
  • Qi et al. (2020) Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020).
  • Qiu et al. (2021) Minghui Qiu, Peng Li, Chengyu Wang, Haojie Pan, Ang Wang, Cen Chen, Xianyan Jia, Yaliang Li, Jun Huang, Deng Cai, et al. 2021. EasyTransfer: A Simple and Scalable Deep Transfer Learning Platform for NLP Applications. In CIKM. 4075–4084.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. In ICML, Vol. 139. 8748–8763.
  • Robertson and Zaragoza (2009) Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS. 5998–6008.
  • Wang et al. (2020) Wei Wang, Bin Bi, Ming Yan, Chen Wu, Jiangnan Xia, Zuyi Bao, Liwei Peng, and Luo Si. 2020. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding. In ICLR.
  • Xiao et al. (2019) Rong Xiao, Jianhui Ji, Baoliang Cui, Haihong Tang, Wenwu Ou, Yanghua Xiao, Jiwei Tan, and Xuan Ju. 2019. Weakly supervised co-training of query rewriting and semantic matching for e-commerce. In WSDM. 402–410.
  • Yao et al. (2021) Shaowei Yao, Jiwei Tan, Xi Chen, Keping Yang, Rong Xiao, Hongbo Deng, and Xiaojun Wan. 2021. Learning a Product Relevance Model from Click-Through Data in E-Commerce. In WWW. 2890–2899.
  • Yu et al. (2022) Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao Wang, Yu Chen, Tamara L Berg, and Ning Zhang. 2022. Commercemm: Large-scale commerce multimodal representation learning with omni retrieval. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4433–4442.
  • Zhang et al. (2021) Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In CVPR. 5579–5588.
  • Zhuge et al. (2021) Mingchen Zhuge, Dehong Gao, Deng-Ping Fan, Linbo Jin, Ben Chen, Haoming Zhou, Minghui Qiu, and Ling Shao. 2021. Kaleido-BERT: Vision-Language Pre-training on Fashion Domain. In CVPR. 12647–12657.
  • Zobel and Moffat (2006) Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM computing surveys (CSUR) 38, 2 (2006), 6–es.