This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Modality-Balanced Embedding for Video Retrieval

  Xun Wang1   Bingqing Ke1    Xuanping Li1    Fangyu Liu2
Mingyu Zhang1    Xiao Liang1    Qiushi Xiao1    Cheng Luo1     Yue Yu1
1 Kuaishou  2 University of Cambridge
(2022)
Abstract.

Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio, etc. This modality imbalance results from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two key components: manually generated modality-shuffled (MS) samples and a dynamic margin (DM) based on visual relevance. They can encourage the video encoder to pay balanced attentions to each modality. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving modality bias problem. We have also deployed our  MBVR  in a large video platform and observed statistically significant boost over a highly optimized baseline in an A/B test and manual GSB evaluations.

video retrieval; modality-shuffled negatives; dynamic margin
journalyear: 2022copyright: acmlicensedconference: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval; July 11–15, 2022; Madrid, Spainbooktitle: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22), July 11–15, 2022, Madrid, Spainprice: 15.00doi: 10.1145/3477495.3531899isbn: 978-1-4503-8732-3/22/07ccs: Information systems Video searchccs: Computing methodologies Neural networks

1. Introduction

Video search, which aims to find the relevant videos of a query from billions of videos, is essential to video-sharing platforms(e.g., TikTok, Likee, and Kuaishou). To be efficient, most video search systems adopt a multi-stage pipeline that gradually shrinks the number of candidates. The first stage, known as retrieval, recalls thousands of candidates from billions efficiently, determining the upper bound of the overall performance of a search engine. The subsequent pre-ranking stages further shrink the candidates to the size of hundreds, and the final ranking server then scores and selects videos to display for users. In this paper, we focus on improving the retrieval stage of video search with multimodal embedding learning.

With the recent development of embedding learning  (Bengio et al., 2013) and pre-trained language models (Liu et al., 2019; Devlin et al., 2019; Zhan et al., 2020), embedding-based retrieval approaches have obtained promising results in web (i.e., document) retrieval (Liu et al., 2021b; Huang et al., 2013; Guu et al., 2020; Karpukhin et al., 2020; Zhan et al., 2021; Khattab and Zaharia, 2020) and product search (Li et al., 2021; Zhang et al., 2020). Most of them adopt a bi-encoder architecture and are trained on labeled data or online logs. However, when training a query-video bi-encoder with online search logs, we have identified a bothering modality imbalance phenomenon: a video’s embedding overly relies on its associated text contents, neglecting its visual information. Such models would falsely recall videos that only matching the query textually, but with irrelevant vision contents. Notice that on video-sharing platforms, a video is usually composed of multiple modalities including video frames, audio, text (e.g., title and hashtag), and etc. In this paper, we select two modalities to represent the video content: the text modality from the title, banners, hashtags, and the vision modality from a key frame or the cover.111The other modalities like audio, are dropped here due to the information being either extremely noisy or negligible in our scenario.

This modality imbalance phenomenon results from: 1) modality gap: both query and text are of the same textual modality, thus their relevance is easier to grasp than other video’s modalities; 2) data bias: current search engines are mostly based on text matching, at lexical or semantic level, thus the online search logs, which are used as the training data, are heavily biased towards examples with high query-text similarities.

Recent research in video retrieval mostly focuses on designing more sophisticated architectures (Sun et al., 2019; Lei et al., 2021; Huang et al., 2020b; Liu et al., 2021a) or stronger cross-modal fusion operations (Gabeur et al., 2020; Qu et al., 2021; Xu et al., 2021). They require large-scale clean training data and heavy computational resources, making them suitable for only specific settings. What’s more, in a real-world scenario with modality biased data like the video search logs, they unavoidably suffer from the modality imbalance problem.

To bridge this gap, our paper offers a feasible solution named MBVR to learning modality-balanced video embeddings using noisy search logs, which is a bi-encoder framework with two key components, illustrated in Fig. 1.

Modality-Shuffled negatives. To correct the modality imbalance bias, we generate novel modality-shuffled (MS) negative samples that train the model adversarially. An MS negative consists of a relevant text and an irrelevant video frame w.r.t. a query. MS negatives can be mistakenly ranked at top if a model overly relies on a single modality (in Fig. 2(b)). We add an additional objective to explicitly punish wrongly ranked MS negatives.

Dynamic margin. We further enhance the model with a margin dynamically changed w.r.t.  the visual relevance. The dynamic margin amplifies the loss for the positive query-video pairs that with both related texts and vision contents. Thus, the models with dynamic margin pull visually relevant videos closer to the query.

We conduct extensive offline experiments and ablation study on the key components to validate the effectiveness of MBVR over a strong baseline and recent methods related to modality balancing (Kim et al., 2020; Lamb et al., 2019). Furthermore, we deploy an online A/B test and GSB evaluations on a large video sharing platform to show that our MBVR improves the relevance level and users’ satisfaction of the video search.

Refer to caption
Figure 1. A graphical illustration of MBVR.

2. MBVR

In this section, we first introduce the model architecture, the training of a strong baseline model. Then we illustrate the modality imbalance issue with statistical analysis, and introduce our MBVR  with generated negatives and the dynamic margin.

Refer to caption
(a) RvtR_{vt} distribution
Refer to caption
(b) Similarity scores of base model
Refer to caption
(c) Similarity scores of MBVR
Figure 2. (a) RvtR_{vt} (the ratio of the vision modality influence to the text modality influence) distribution of base model and MBVR. (b) Similarity scores between the queries and the positives/MS negatives of base model and (c) that of MBVR.

2.1. Model Architecture

Our model architecture follows the popular two-tower (i.e., bi-encoder) formulation, as (Huang et al., 2013; Li et al., 2021; Zhang et al., 2020; Liu et al., 2021b; Zhan et al., 2021; Liu et al., 2021c), with a transformer-based text encoder of query and a multi-modal encoder of video.

Query encoder q\mathcal{F}_{q} can be summarised as RBT3+average+FC. We utilize the RoBERTa (Liu et al., 2019) model with three transformer layers as the backbone 222We have also tried to use larger bert encoders with 6 and 12 transformer layers. However, such large models only bring negligible effect with much heavier computational cost, thus we choose use the 3-layers RoBERTa as our text encoder. and use average+FC (i.e., average pooling followed by a fully connected layer) to compress the final token embeddings to dd-dim (d=64d=64).

Multimodal encoder m\mathcal{F}_{m} consists of a text encoder t\mathcal{F}_{t}, a vision encoder v\mathcal{F}_{v} and a fusion module \mathcal{H}. For a video sample mm, its embedding is computed as

(1) m(m)=m(t,v)=(t(t),v(v)),\displaystyle\mathcal{F}_{m}(m)=\mathcal{F}_{m}(t,v)=\mathcal{H}(\mathcal{F}_{t}(t),~{}\mathcal{F}_{v}(v)),

where tt is the text input and vv is the vision input. The text encoder t\mathcal{F}_{t} shares weights with q\mathcal{F}_{q} 333Such weight sharing brings several benefits, e.g., reducing model parameters to save memory and computational cost, and introducing prior query-text matching relation to regularize the training (Firat et al., 2016; Xia et al., 2018; Liu et al., 2021b).. The vision encoder v\mathcal{F}_{v} adopts the classical ResNet-50 (He et al., 2016) network. For the fusion module \mathcal{H}, we adopt the multi-head self-attention (MSA) (Vaswani et al., 2017) to dynamically integrate the two modalities and aggregate the outputs of MSA with an average pooling. We have also tried other feature fusion operations (e.g., direct addition and concatenation-MLP) and discovered that MSA works the best.

2.2. Base Model

Most existing works, e.g., (Huang et al., 2013; Liu et al., 2021b; Huang et al., 2020a), train bi-encoders with the approximated query-to-document retrieval objective. Specifically, given a query qq, its relevant videos q+\mathcal{M}^{+}_{q}, and its irrelevant videos q\mathcal{M}^{-}_{q}, the query-to-document objective is as below:

(2) qm=log(exp(s(q,m)/τ)exp(s(q,m)/τ)+m^Mqexp(s(q,m^)/τ)),\displaystyle\mathcal{L}_{qm}=-\log\Big{(}\frac{\exp(s(q,m)/\tau)}{\exp(s(q,m)/\tau)+\sum_{\hat{m}\in M^{-}_{q}}\exp(s(q,\hat{m})/\tau)}\Big{)},

where τ\tau is a temperature hyper-parameter set to 0.07, and s(q,m)s(q,m) is the cosine similarity of a query-video pair (q,m)(q,m) (i.e., s(q,m)=cos<q(q),m(m)>s(q,m)=cos<\mathcal{F}_{q}(q),\mathcal{F}_{m}(m)>). Notably, here we adopt in-batch random negative sampling, which means q\mathcal{M}^{-}_{q} is all the videos in current mini-batch except the positive sample mm.

As recent works (Liu et al., 2021c; Liu et al., 2021a) added a conversed document-to-query retrieval loss, we formulate a corresponding mq\mathcal{L}_{mq} as below:

(3) mq=log(exp(s(q,m)/τ)exp(s(q,m)/τ)+q^𝒬mexp(s(q^,m)/τ)),\displaystyle\mathcal{L}_{mq}=-\log\Big{(}\frac{\exp(s(q,m)/\tau)}{\exp(s(q,m)/\tau)+\sum_{\hat{q}\in\mathcal{Q}^{-}_{m}}\exp(s(\hat{q},m)/\tau)}\Big{)},

where 𝒬m\mathcal{Q}^{-}_{m} denotes the irrelevant queries of the video mm,  i.e., all the queries in current mini-batch except qq.

The sum of the query-to-document loss and the reversed document-to-query loss results in the main bidirectional objective:

(4) bi=qm+mq.\displaystyle\mathcal{L}_{bi}=\mathcal{L}_{qm}+\mathcal{L}_{mq}.

Beside the above bidirectional objective optimizing the relevance between the query embedding and the video’s multimodal embedding, we also add a auxiliary task t\mathcal{L}_{t} (v\mathcal{L}_{v}) to optimize the relevance between the query and the video’s text modality (the vision modality) with similar formulations. The whole objective base\mathcal{L}_{base} for the base model is:

(5) base=bi+αv+βt,\displaystyle\mathcal{L}_{base}=\mathcal{L}_{bi}+\alpha\mathcal{L}_{v}+\beta\mathcal{L}_{t},

where α=β=0.1\alpha=\beta=0.1, are the weight hyper-parameters.

2.3. Statistical Analysis of Modality Imbalance

To identify the modality imbalance, we define an indicator Rvt{R}_{vt} as below,

(6) Rvt=cos<v(v),m(m)>cos<t(t),m(m)>,\displaystyle{R}_{vt}=\frac{cos<\mathcal{F}_{v}(v),\mathcal{F}_{m}(m)>}{cos<\mathcal{F}_{t}(t),\mathcal{F}_{m}(m)>},

where the definitions of m,v,t\mathcal{F}_{m},\mathcal{F}_{v},\mathcal{F}_{t} are given in Eq. (1). RvtR_{vt} is the ratio between the cosine similarity of vision-video and that of text-video and measures the extent of modality bias of the multi-modal encoder m\mathcal{F}_{m}.

For the base model in Eq. (5), we compute RvtR_{vt} for a randomly sampled set of videos and plot the density histogram graph in Fig. 2(a). As observed, most Rvt<0.3R_{vt}<0.3, indicating the model suffer from the modality imbalance problem and the multimodal embeddings are heavily biased to the text contents. Consequently, when retrieving videos with the base model, visually irrelevant videos can be recalled falsely with even higher similarities than videos relevant to the queries textually and visually. The fundamental cause of modality imbalance is that text matching provides a shortcut for the bi-encoder: the query-text relation is easier to grasp, and most samples in training set can be solved by lexical relevance.

2.4. Modality-Shuffled Negatives

To eliminate the shortcut, we generate novel Modality-Shuffled (MS for short) negative samples, whose vision modality is irrelevant with the query, while the text is relevant to the query. As illustrated in fig. 2(b), such MS negatives are serious adversarial attacks for the base model, which cannot distinguish MS negatives with the real positives. This inspires our loss design of MS negatives as below:

(7) ms=log(exp(s(q,m)/τ)exp(s(q,m)/τ)+m^msexp(s(q,m^)/τ)),\displaystyle\mathcal{L}_{ms}=-\log\Big{(}\frac{\exp(s(q,m)/\tau)}{\exp(s(q,m)/\tau)+\sum_{\hat{m}\in\mathcal{M}_{ms}}\exp(s(q,\hat{m})/\tau)}\Big{)},

where ms\mathcal{M}_{ms} denotes the set of generated MS negatives.

ms\mathcal{L}_{ms} straightly promotes the model disentangle the MS negatives from the real positives as shown in Fig. 2(c). By the way, it is not hard to find that when both RvtR_{vt} and Rv^tR_{\hat{v}t} are close to 0, m(t,v)\mathcal{F}_{m}(t,v) will be close to its MS negative m(t,v^)\mathcal{F}_{m}(t,\hat{v}). Thus, MBVR  with ms\mathcal{L}_{ms} also pushes RvtR_{vt} faraway from 0 as in Fig. 2(a), which indicates that ms\mathcal{L}_{ms} effectively alleviates the modality imbalance problem. Consequently, the information of both text and vision can be well-preserved in the final video embedding.

How to generate MS negatives efficiently? We design to re-combine text embeddings and vision embeddings in the mini-batch as in Fig. 1. For a mini-batch of size nn, the vision, text embeddings of the kk-th video are v(vk)\mathcal{F}_{v}(v_{k}), t(tk)\mathcal{F}_{t}(t_{k}). Then the MS negatives of the k-th video can be computed as (v(vl),t(tk))\mathcal{H}(\mathcal{F}_{v}(v_{l}),\mathcal{F}_{t}(t_{k})), where ll is a randomly selected integer from 11 to nn except kk. Such design can generate one MS negative for each video with only one MSA operation, which is extremely efficient. By repeating the above process MM times, we can generate MM MS negatives for each video. Empirically, more MS negatives can result in better performance, we set M as 32 to balance the effectiveness and efficiency.

2.5. Dynamic Margin

To further address the modality bias, we apply a dynamic margin λ\lambda on the positive pair (q,m)(q,m) of qm\mathcal{L}_{qm} as below:

(8) qm=log(exp((s(q,m)λ)/τ)exp((s(q,m)λ)/τ)+m^qexp(s(q,m^)/τ)).\displaystyle\mathcal{L}_{qm}=-\log\Big{(}\frac{\exp((s(q,m)-\lambda)/\tau)}{\exp((s(q,m)-\lambda)/\tau)+\sum_{\hat{m}\in\mathcal{M}^{-}_{q}}\exp(s(q,\hat{m})/\tau)}\Big{)}.

λ\lambda is computed from the visual relevance of (q,m)(q,m) through a scale and shift transformation:

(9) λ=wσ(cos<v(v),q(q)>)+b,\displaystyle\lambda=w\sigma(cos<\mathcal{F}_{v}(v),\mathcal{F}_{q}(q)>)+b,

where w=0.3,b=0.1w=0.3,b=-0.1, and σ\sigma denotes the sigmoid function, i.e., σ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}}. Then the margin λ\lambda varies in (0.1,0.2)(-0.1,0.2) and monotonically increases w.r.t.  the visual relevance (i.e., cos<v(v),q(q)>cos<\mathcal{F}_{v}(v),\mathcal{F}_{q}(q)>). We also do the same modification for the video-to-query loss mq\mathcal{L}_{mq} and the MS loss ms\mathcal{L}_{ms}. Then the main objective bi\mathcal{L}_{bi} in Eq. 4 with dynamic margin is referred as ~bi\widetilde{\mathcal{L}}_{bi} and ms\mathcal{L}_{ms} with dynamic margin as ~ms\widetilde{\mathcal{L}}_{ms}. Note that the gradient of the margin λ\lambda is detached during model training.

To understand the effect of the dynamic margin easily, when λ>0\lambda>0, it can be considered as moving the positive video mm a bit faraway from the query before computing the loss and results in a larger loss value as in Fig. 1. Therefore, the dynamic margin encourages the model to produce even higher similarities for the vision related query-video pairs.

Finally, the overall learning objective of our MBVR  framework is as follows,

(10) =~bi+αv+βt+γ~ms,\displaystyle\mathcal{L}=\widetilde{\mathcal{L}}_{bi}+\alpha\mathcal{L}_{v}+\beta\mathcal{L}_{t}+\gamma\widetilde{\mathcal{L}}_{ms},

where γ\gamma is a weight hyper-parameter and set as 0.010.01. In summary, MBVR  solves the modality imbalance issue from two collaborative aspects: punishing on videos with unrelated vision contents with the MS component, and enhancing query-video pairs of related vision contents with the dynamic margin.

3. Experiments

3.1. Experimental Setup

In this section, we describe our datasets, evaluation metrics.

Training Dataset. The training dataset contains about 5 million queries, 42 million videos and 170 million relevant query-video pairs mined from recent nine months’ search logs.

Evaluation Datasets. We do offline evaluations on two test datasets. Manual: a manually annotated dataset of two million query-video pairs to evaluate the relevance of recalled videos; Auto: a dataset of five million randomly sampled query-video pairs that automatically collected with Wilson CTR from search logs to simulate the online performance. The datasets used in the paper will be public only as embedding vectors to protect the privacy of users.

For all models, we compute their Precision@K and MRR@K on Auto and PNR on Manual, which are defined as below:

Precision@K. Given a query qq, 𝒱q\mathcal{V}_{q} is the set of relevant videos, and the top KK documents returned by a model is denoted as q={r1,,rK}\mathcal{R}_{q}=\{r_{1},\cdots,r_{K}\}. The metric of Precision@KK is defined as

(11) Precision@K=|q𝒱q|K.\displaystyle Precision@K=\frac{|\mathcal{R}_{q}\cap\mathcal{V}_{q}|}{K}.

MRR@K. Mean Reciprocal Rank at K (MRR@K) is defined as

(12) MRR@K=1Ki=1Kri𝒱q1i,\displaystyle MRR@K=\frac{1}{K}\sum_{i=1}^{K}\mathcal{I}_{r_{i}\in\mathcal{V}_{q}}\cdot\frac{1}{i},

where 𝒜\mathcal{I}_{\mathcal{A}} is an indicator function444Compared with Precision@K, MRR@K can reflect the order of top-KK. Note that MRR@K is usually computed when there is only one positive sample and its value is always below 1, whereas we have more than one relevant videos for each query, then the value of MRR@K can be larger than 1.. If 𝒜\mathcal{A} holds, it is 1, otherwise 0.

PNR. For a given query qq and its associated videos 𝒟q\mathcal{D}_{q}, the positive-negative ratio (PNR) can be defined as

(13) PNR=di,dj𝒟q(yi>yj)(s(q,di)>s(q,dj))di,dj𝒟q(yi>yj)(s(q,di)<s(q,dj)),\displaystyle PNR=\frac{\sum_{d_{i},d_{j}\in\mathcal{D}_{q}}\mathcal{I}(y_{i}>y_{j})\cdot\mathcal{I}(s(q,d_{i})>s(q,d_{j}))}{\sum_{d_{i^{\prime}},d_{j^{\prime}}\in\mathcal{D}_{q}}\mathcal{I}(y_{i^{\prime}}>y_{j^{\prime}})\cdot\mathcal{I}(s(q,d_{i^{\prime}})<s(q,d_{j^{\prime}}))},

where yiy_{i} represents the manual label of did_{i}, and s(q,di)s(q,d_{i}) is the predicted score between qq and did_{i} given by model. PNR measures the consistency between labels and predictions.

3.2. Offline Evaluations

Table 1. Offline experimental results of compared methods on Auto and Manual test sets.
Auto Manual
Method MRR@10 Precision@10 (%) PNR
Text 1.406 45.58 2.188
Vision 0.603 17.55 1.942
Base model 1.446 46.53 2.230
+IAT(Lamb et al., 2019) 1.452 46.22 2.243
+CDF(Kim et al., 2020) 1.463 47.31 2.253
+MS 1.600 50.52 2.311
+DM 1.491 47.75 2.267
MBVR 1.614 51.10 2.310

Compared Methods We compare our method with the highly optimized baseline, and recent state-of-the-art modality-balanced techniques of IAT  (Lamb et al., 2019) and CDF (Kim et al., 2020). IAT trains a robust model under the adversarial attacks of PGD (Madry et al., 2018) and CDF aims to retrieve videos with one modality missed.

  • Base is Eq. 5 without MS negatives and dynamic margin.

  • Text (Vision) only uses the text (vision) modality of videos.

  • +IAT equips Base with IAT (Lamb et al., 2019).

  • +CDF equips Base with CDF (Kim et al., 2020).

  • +MS equips Base with MS negatives ms\mathcal{L}_{ms}.

  • +DM equips Base with dynamic margin ~bi\widetilde{\mathcal{L}}_{bi}.

  • MBVR is the full model with both DM and MS.

Table 1 illustrates experimental results of compared methods on both Auto and Manual test sets. The drastic performance difference between Vision and Text results from the dominated role of text modality in video’s modalities. +IAT and +CDF bring marginal improvements over Base. Both +MS and +DM boost significantly over the strong baseline Base. MS is extremely effective, which brings nearly 4% absolute boost over Base (46.53%50.52%46.53\%\longrightarrow 50.52\%). Furthermore, the full model MBVR, with both MS and DM, achieves the best performance. The offline evaluations verify the effectiveness and compatibility of both components of  MBVR (MS and DM).

3.3. Online Evaluations

For the online test, the control baseline is current online search engine, which is a highly optimized system with multiple retrieval routes of text embedding based ANN retrieval, text matching with inverted indexing, etc., to provide thousands of candidates, and several pre-ranking and ranking models to rank these candidates. And the variant experiment adds our MBVR  multimodal embedding based retrieval as an additional route.

Online A/B Test We conducted online A/B experiments over 10% of the entire traffic for one week. The watch time has increased by1.509%; The long watch rate has increased by 2.485%; The query changed rate555The decrease of query changed rate is positive, as it means the users find relevant videos without changing the query to trigger a new search request. has decreased by 1.174%. This is a statistically significant improvement and verifies the effectiveness of MBVR. Now MBVR  has been deployed online and serves the main traffic.

Manual Evaluation We conduct a manual side-by-side comparison on the top-4 videos between the baseline and the experiment. We randomly sample 200 queries whose top-4 videos are different, and then we let several human experts to tell whether the experiment’s results are more relevant than the baseline ones. The Good vs. Same vs. Bad (GSB) metric is G=45, S=126, B=29, where G (or B) denotes the number of queries whose results in experiments are more relevant (or irrelevant) than the baseline. This GSB result indicates that MBVR  can recall more relevant videos to further meet the users’ search requests.

4. Conclusions and Discussions

In this paper, we identify the challenging modality bias issue in multimodal embedding learning based on online search logs and propose our solution MBVR. The main contributions of MBVR  are the modality-shuffled (MS) negatives and the dynamic margin (DM), which force the model to pay more balanced attention to each modality. Our experiments verify that the proposed MBVR  significantly outperforms a strong baseline and recent modality balanced techniques on offline evaluation and improves the highly optimized online video search system.

As an early exploration of building multimodal retrieval system for short-video platforms, our MBVR  adopts a succinct scheme (i.e., we keep engineering/architecture design choices simple). There are potential directions to further enhance the system, e.g., using more frames, designing more sophisticated cross-modal fusion modules, and adopting smarter data cleaning techniques, which can be explored in the future.

Acknowledgments

We thank Xintong Han for early discussions of MBVR. We thank Xintong Han, Haozhi Zhang, Yu Gao, Shanlan Nie for paper review. We also thank Tong Zhao, Yue Lv for preparing training datasets for the experiments.

References

  • (1)
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. TPAMI 35, 8 (2013).
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171–4186.
  • Firat et al. (2016) Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T Yarman Vural, and Kyunghyun Cho. 2016. Zero-Resource Translation with Multi-Lingual Neural Machine Translation. In EMNLP. 268–277.
  • Gabeur et al. (2020) Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In ECCV.
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Realm: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909 (2020).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
  • Huang et al. (2020a) Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020a. Embedding-based retrieval in facebook search. In KDD. 2553–2561.
  • Huang et al. (2013) Po-Sen Huang, X. He, Jianfeng Gao, L. Deng, A. Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM.
  • Huang et al. (2020b) Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020b. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. EMNLP (2020).
  • Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
  • Kim et al. (2020) Hyounghun Kim, Hao Tan, and Mohit Bansal. 2020. Modality-balanced models for visual dialogue. In AAAI, Vol. 34. 8091–8098.
  • Lamb et al. (2019) Alex Lamb, Vikas Verma, Juho Kannala, and Yoshua Bengio. 2019. Interpolated adversarial training: Achieving robust neural networks without sacrificing too much accuracy. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security. 95–103.
  • Lei et al. (2021) Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR. 7331–7341.
  • Li et al. (2021) Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based Product Retrieval in Taobao Search. KDD (2021).
  • Liu et al. (2021a) Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021a. Hit: Hierarchical transformer with momentum contrast for video-text retrieval. ICCV (2021).
  • Liu et al. (2021b) Yiding Liu, Guan Huang, Weixue Lu, Suqi Cheng, Daiting Shi, Shuaiqiang Wang, Zhicong Cheng, and Dawei Yin. 2021b. Pre-trained Language Model for Web-scale Retrieval in Baidu Search. KDD (2021).
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Liu et al. (2021c) Yiqun Liu, Kaushik Rangadurai, Yunzhong He, Siddarth Malreddy, Xunlong Gui, Xiaoyi Liu, and Fedor Borisyuk. 2021c. Que2Search: Fast and Accurate Query and Document Understanding for Search at Facebook. In KDD. 3376–3384.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR.
  • Qu et al. (2021) Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR. 1104–1113.
  • Sun et al. (2019) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV. 7464–7473.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NeurIPS. 5998–6008.
  • Xia et al. (2018) Yingce Xia, Xu Tan, Fei Tian, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2018. Model-level dual learning. In ICML. PMLR, 5383–5392.
  • Xu et al. (2021) Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. EMNLP (2021).
  • Zhan et al. (2021) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing Dense Retrieval Model Training with Hard Negatives. SIGIR (2021).
  • Zhan et al. (2020) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. arXiv preprint arXiv:2006.15498 (2020).
  • Zhang et al. (2020) Han Zhang, Songlin Wang, Kang Zhang, Zhiling Tang, Yunjiang Jiang, Yun Xiao, Weipeng Yan, and Wen-Yun Yang. 2020. Towards personalized and semantic retrieval: An end-to-end solution for e-commerce search via embedding learning. In SIGIR. 2407–2416.

References

  • (1)
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. TPAMI 35, 8 (2013).
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL. 4171–4186.
  • Firat et al. (2016) Orhan Firat, Baskaran Sankaran, Yaser Al-Onaizan, Fatos T Yarman Vural, and Kyunghyun Cho. 2016. Zero-Resource Translation with Multi-Lingual Neural Machine Translation. In EMNLP. 268–277.
  • Gabeur et al. (2020) Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In ECCV.
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. Realm: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909 (2020).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
  • Huang et al. (2020a) Jui-Ting Huang, Ashish Sharma, Shuying Sun, Li Xia, David Zhang, Philip Pronin, Janani Padmanabhan, Giuseppe Ottaviano, and Linjun Yang. 2020a. Embedding-based retrieval in facebook search. In KDD. 2553–2561.
  • Huang et al. (2013) Po-Sen Huang, X. He, Jianfeng Gao, L. Deng, A. Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In CIKM.
  • Huang et al. (2020b) Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020b. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020).
  • Karpukhin et al. (2020) Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. EMNLP (2020).
  • Khattab and Zaharia (2020) Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
  • Kim et al. (2020) Hyounghun Kim, Hao Tan, and Mohit Bansal. 2020. Modality-balanced models for visual dialogue. In AAAI, Vol. 34. 8091–8098.
  • Lamb et al. (2019) Alex Lamb, Vikas Verma, Juho Kannala, and Yoshua Bengio. 2019. Interpolated adversarial training: Achieving robust neural networks without sacrificing too much accuracy. In Proceedings of the 12th ACM Workshop on Artificial Intelligence and Security. 95–103.
  • Lei et al. (2021) Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In CVPR. 7331–7341.
  • Li et al. (2021) Sen Li, Fuyu Lv, Taiwei Jin, Guli Lin, Keping Yang, Xiaoyi Zeng, Xiao-Ming Wu, and Qianli Ma. 2021. Embedding-based Product Retrieval in Taobao Search. KDD (2021).
  • Liu et al. (2021a) Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021a. Hit: Hierarchical transformer with momentum contrast for video-text retrieval. ICCV (2021).
  • Liu et al. (2021b) Yiding Liu, Guan Huang, Weixue Lu, Suqi Cheng, Daiting Shi, Shuaiqiang Wang, Zhicong Cheng, and Dawei Yin. 2021b. Pre-trained Language Model for Web-scale Retrieval in Baidu Search. KDD (2021).
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Liu et al. (2021c) Yiqun Liu, Kaushik Rangadurai, Yunzhong He, Siddarth Malreddy, Xunlong Gui, Xiaoyi Liu, and Fedor Borisyuk. 2021c. Que2Search: Fast and Accurate Query and Document Understanding for Search at Facebook. In KDD. 3376–3384.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. In ICLR.
  • Qu et al. (2021) Leigang Qu, Meng Liu, Jianlong Wu, Zan Gao, and Liqiang Nie. 2021. Dynamic Modality Interaction Modeling for Image-Text Retrieval. In SIGIR. 1104–1113.
  • Sun et al. (2019) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV. 7464–7473.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NeurIPS. 5998–6008.
  • Xia et al. (2018) Yingce Xia, Xu Tan, Fei Tian, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2018. Model-level dual learning. In ICML. PMLR, 5383–5392.
  • Xu et al. (2021) Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. Videoclip: Contrastive pre-training for zero-shot video-text understanding. EMNLP (2021).
  • Zhan et al. (2021) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Optimizing Dense Retrieval Model Training with Hard Negatives. SIGIR (2021).
  • Zhan et al. (2020) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Min Zhang, and Shaoping Ma. 2020. RepBERT: Contextualized Text Embeddings for First-Stage Retrieval. arXiv preprint arXiv:2006.15498 (2020).
  • Zhang et al. (2020) Han Zhang, Songlin Wang, Kang Zhang, Zhiling Tang, Yunjiang Jiang, Yun Xiao, Weipeng Yan, and Wen-Yun Yang. 2020. Towards personalized and semantic retrieval: An end-to-end solution for e-commerce search via embedding learning. In SIGIR. 2407–2416.