InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Junyang Lin^∗1, An Yang^∗1,2, Yichang Zhang¹, Jie Liu¹, Jingren Zhou¹, Hongxia Yang^†1 ¹ Alibaba Group ² MOE Key Lab of Computational Linguistics, School of EECS, Peking University junyang.ljy, yichang.zyc, sanshuai.lj, jingren.zhou, [email protected] [email protected]

(2020)

Abstract.

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset¹¹1We will release the dataset to nourish further development in the community. for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

Multi-modal pretraining, BERT, visio-linguistic understanding.

^†^†copyright: acmcopyright^†^†journalyear: 2020^†^†doi: 10.1145/1122445.1122456^†^†conference: KDD ’20; August 22–27, 2020; San Diego, CA, USA^†^†booktitle: KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 22–27, 2020, San Diego, CA, USA^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Computing methodologies Transfer learning^†^†ccs: Computing methodologies Multi-modal pretraining^†^†ccs: Networks Self attention¹¹footnotetext: Equal contribution. This work is done when An Yang is an intern at Alibaba Group.²²footnotetext: Corresponding author.

1. Introduction

Pretraining has raised much attention in the community due to its strong capability of generalization and efficient usage of large-scale data. The development of computer vision has been highly connected with pretraining, such as AlexNet (Krizhevsky et al., 2012), VGG (Simonyan and Zisserman, 2015) and ResNet (He et al., 2016), which are pretrained on the large-scale dataset ImageNet (Deng et al., 2009) for image classification. Recent years have witnessed the burst of pretraining in natural language processing. Pretrained models (Peters et al., 2018; Howard and Ruder, 2018; Devlin et al., 2019; Liu et al., 2019; Yang et al., 2019; Dong et al., 2019) have reached state-of-the-art performances in many downstream tasks of natural language processing (NLP), including question answering (Rajpurkar et al., 2016), natural language inference (Wang et al., 2019), and even natural language generation, such as neural machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Vaswani et al., 2017) and abstractive summarization (Rush et al., 2015; Nallapati et al., 2016).

Such significant progress in this field raised the concern of pretraining for task-agnostic multi-modal representation. A series of cross-modal pretraining methods were proposed, and the self-supervised learning provides the models with a strong ability to adapt to multiple multi-modal downstream tasks through finetuning (Sun et al., 2019b; Li et al., 2019b; Tan and Bansal, 2019; Lu et al., 2019a; Su et al., 2019; Chen et al., 2019; Li et al., 2019a; Zhou et al., 2020; Miech et al., 2020a; Li et al., 2020). However, these models are mostly pretrained by simple tasks such as masked language/object modeling (MLM/MOM) and image-text matching (ITM). Except for that, single-stream models (Su et al., 2019; Chen et al., 2019; Li et al., 2019a; Li et al., 2020) simply apply BERT and mix information from two streams into one model, while two-stream models (Lu et al., 2019a; Tan and Bansal, 2019) can only build interaction with co-attention, where there is no self attention to the self-context in each layer of co-attention.

Motivated by this observation, we propose a novel method for multi-modal pretraining, called InterBERT, which refers to BERT for Interaction. This model is the first one of our series of pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The proposed architecture consists of a single-stream interaction module for all the inputs from different modalities, as well as a two-stream extraction module that processes information from each modality separately. This architecture ensures sufficient interaction between modalities and generates contextualized mode representations. Besides, we pretrain the model with our proposed masked group modeling (MGM) and image-text matching with hard negatives (ITM-hn). The tasks are more challenging as they force the model to predict a span or a region and differentiate positive and negative samples with higher difficulty, which requires the model to build a stronger connection between modalities.

We pretrain our InterBERT on a series of large-scale datasets of image-text pairs, and we evaluate the effects of InterBERT on several multi-modal downstream tasks, including caption-based image retrieval (Wang et al., 2016), zero-shot caption-based image retrieval, and visual commonsense reasoning (Zellers et al., 2019). Experimental results demonstrate that our method can achieve significant improvements over the baseline models, and it outperforms or rivals the recent multi-modal pretrained models. The analysis demonstrates that our pretraining tasks can positively impact the model performance in different downstream tasks, and the model can adapt to single-modal tasks without significant performance decrease in comparison with the BERT-base model. Furthermore, we deploy the model in the mobile Taobao, the largest platform of e-commerce in China. We conduct an A/B test and achieve improvements in click-through rate and exposed category width over the single-modal baseline based on BERT. This demonstrates that the potential of pretraining and the contribution of cross-modal information in recommendation.

In brief, our contributions are illustrated below:

•

We propose a novel method for multi-modal pretraining called InterBERT. The new architecture effectively builds multi-modal interaction and preserves the independence of single-modal representation. The proposed pretraining task MGM encourages the model to learn group prediction and interaction with larger context, and the proposed ITM-hn enhances the difficulty in distinguishing positive and negative samples so as to improve its capability in cross-modal matching.
•

Experimental results demonstrate that InterBERT enhances the performance in the downstream tasks, including image retrieval and VCR, in comparison with the current pretraining methods. Our ablation studies have shown the effects of our architecture in the adaptation to NLP tasks and the significant effects of our proposed MGM and ITM-hn in VCR and image retrieval.
•

We deploy InterBERT in the mobile Taobao and conduct an A/B test in a traffic-intensive scenario of recommendation. Our multimodal pretrained model gains performance increases in several metrics over the single-modal BERT-based baseline.

Organization The rest of the paper is organized as follows: Section 2 provides an overview of the proposed approach, including model architecture as well as the pretraining methods. Section 3 provides the details of our experiments, including those of datasets, implementation, results and analysis. Section 4 introduces our Chinese InterBERT and our online deployment. Section 5 reviews the related work, and the final section concludes the paper.

2. Approach

We detail our proposed approach InterBERT in this section. Before moving into the introduction to InterBERT, we illustrate the background of pretraining in NLP and extend it to multi-modal pretraining.

2.1. Background

While following the pretraining principle in NLP, we first introduce the background of NLP pretraining and further introduce multi-modal pretraining.

We first introduce the background of NLP pretraining. Given an input text, a word (a character or a subword) sequence with classification and separation tokens $w=\{[CLS],w_{1},w_{2},\cdots,w_{n},[SEP]\}$ of length $n+2$ , the model should learn to generate its high-level representations $h=\{h_{[CLS]},h_{1},h_{2},\cdots,h_{n},h_{[SEP]}\}$ . The input can be a sequence of multiple sentences, where they are separated by “ $[SEP]$ ”. For the word embedding, except for the embedding layer, positional embedding and segment embedding are applied to denote the word positions and the segments they are from. For a BERT of $l$ layers, the model can produce $l$ sequences of representations $H=\{H^{1},H^{2},\cdots,H^{l}\}$ . In most cases, we refer $h$ to $H^{l}$ . For further finetuning, the pretrained model except for the topmost layer for logits is applied as the backbone of the model for a specific downstream task.

Following this logic, such pretraining can be extended to learning multi-modal representations. In this work, we focus on the pretraining of vision and language. The dataset for multi-modal pretraining consists of paired image-text data, such as an image and its caption. To feed an image into BERT, a solution is to extract the object representations and bounding boxes with a detector, such as Faster-RCNN (Ren et al., 2015), and form a sequence of $m$ object representations $o=\{o_{1},o_{2},\cdots,o_{m}\}$ together with their positions. Similar to the pretraining in NLP, we also add a representation $o_{[CLS]}$ , which is the mean pooling of the original $o$ in our implementation. The goal of the model is to learn the high-level representations of both image and text $h=\{h^{i},h^{t}\}$ , where $h^{i}=\{h^{i}_{[CLS]},h^{i}_{1},\cdots,h^{i}_{m}\}$ and $h^{t}=\{h^{t}_{[CLS]},h^{t}_{1},\cdots,h^{t}_{n},h^{t}_{[SEP]}\}$ .

2.2. Model overview

Refer to caption — Figure 1. An overview of the architecture of InterBERT. The model is built with an image embedding layer, a text embedding layer, a single-stream interaction module, and a two-stream extraction module.

In this section, we illustrate the details of our proposed model InterBERT. The overview of the architecture is demonstrated in Figure 1. The simplest solution for multimodal pretraining is to pretrain a BERT-like model with the concatenation of image and text features. Lu et al. (2019a) pointed out that such a method of information fusing ignores the different requirements of processing for different modalities, and their experimental results show that the two-stream model outperforms the single-stream one in multiple tasks. We view that the effective interaction of modalities is the key to effective pretraining. Such interaction requires the gap bridging between image and text and the maintenance of the independence of each modality. Furthermore, an extra benefit of such independence enables transfer to both cross-modal downstream tasks and single-modal tasks. This enhances the robustness of the model and breaks the limitation of the form of pretraining data.

Replacing Co-Attention with All-Attention While co-attention demonstrates effects in VilBERT (Lu et al., 2019a), we find that such a method of modal interaction limits the capability of the model. The representations of one modality can only attend to those of the other one, ignoring the self-context. The ideal attention should be one that attends to the whole context. Here we replace the co-attention with all-attention, which is a single-stream interaction module based on multi-head self attention (MHSA) and point-wise feed-forward neural network (FFN) (Vaswani et al., 2017; Devlin et al., 2019). The input of the single-stream interaction module is the concatenation of image and text embeddings, and thus the attention can attend to the whole context of both modalities. The layer includes:

(1)	$\displaystyle h^{l}$	$\displaystyle=\mathtt{MHSA}(\mathbf{W_{q}}x^{l-1},\mathbf{W_{k}}x^{l-1},\mathbf{W_{v}}x^{l-1}),$
(2)	$\displaystyle\tilde{h^{l}}$	$\displaystyle=\mathtt{LN}(x^{l-1}+h^{l}),$
(3)	$\displaystyle\hat{h^{l}}$	$\displaystyle=\mathbf{W_{2}}[\mathtt{GeLU}(\mathbf{W_{1}}\tilde{h^{l}}+b_{1})]+b_{2},$
(4)	$\displaystyle x^{l}$	$\displaystyle=\mathtt{LN}(\tilde{h^{l}}+\hat{h^{l}}),$

where $x_{l-1}$ is the whole context of image and text representations, instead of representations of a single modality. For multi-head attention, the model first transforms the inputs to query, key and value representations with weight matrices $\mathbf{W_{q}}$ , $\mathbf{W_{k}}$ , and $\mathbf{W_{v}}$ , and split them into multiple heads and compute the attention scores of query and key as well as the weighted sum of the value. Layer normalization (LN) and residual connection are applied, and the activation function is GeLU (Hendrycks and Gimpel, 2016).

This architecture enables strong interaction between modalities with the attention mechanism. Compared with the two-stream co-attention layer (Lu et al., 2019a) which can only attend to the representations of the other modality, this architecture enables a combination of self attention and co-attention, and therefore the model can generate more contextualized representations. Furthermore, another advantage is that the architecture is identical to BERT and thus its weights can be initialized with the pretrained BERT’s weights, which improves the availability of the previous pretrained models.

Extraction Module for Mode Representations An ideal situation is that the model’s outputs consist of visual and linguistic representations as well as the visual-linguistic ones. Also, a robust multi-modal pretrained model should have the capability to transfer to single-modal tasks. As mentioned above, the single-stream interaction module fuses the visual and linguistic representations and make them more contextualized. In concern of the extraction of representations of each modality, we should develop a module to respectively generate representations to separate the fused information.

We implement a two-stream extraction module, which consists of an image extractor and a text extractor. Each extractor is based on self attention and FFN. The module is responsible for generating high-level object representations and text representations. Except for these, the model generates a general image representation and text representation for finetuning. The image and text representations are transformed into a cross-modal representation by a multi-layer feed-forward network. To validate our hypothesis, we analyze by finetuning our pretrained model and the single-stream multi-modal pretrained model (a simple BERT architecture) on natural language processing tasks to evaluate their performances on single-modal tasks. The analysis demonstrates that our architecture can achieve similar performance compared with the original BERT-base model, while the single-stream model without the two-stream extraction module performs much worse. This shows our model’s advantage in preserving modal independence. More details are described in Section 3.5.

Text Embedding and Image Embedding Following Devlin et al. (2019), we tokenize the input text and embed each word with an embedding layer. Positional embedding is required for the self-attention-based model to obtain the positional information, and segment embedding is required for the model to distinguish image and text.

A solution to adapt the image to Transformer is to obtain the object representations and their locations with a detector. Following Lu et al. (2019a), we apply a commonly used object detector Faster-RCNN (Ren et al., 2015) trained on Visual Genome (Krishna et al., 2017; Anderson et al., 2018). We extract the bounding boxes and the RoI (Region of Interest) features as the object representations. Similar to the aforementioned process, we apply positional embedding and segment embedding to the extracted features.

2.3. Pretraining tasks

In this section, we introduce the pretraining tasks for our multi-modal pretraining, namely masked group modeling (MGM) and image-text matching with hard negatives (ITM-hn).

Masked Group Modeling We propose MGM, which encourages the model to predict the masked groups of images and texts. We name the masked group modeling on text “masked segment modeling (MSM)”, and the masked group modeling on image “masked region modeling (MRM)”. Similar to MLM, MSM also replaces the selected words with the same strategy (replacing with the token “ $\mathtt{[MASK]}$ ”, a random word or the original word). However, MSM masks a continuous segment of text instead of random words. Different from Joshi et al. (2019), we mask multiple segments for each sample. As to MRM, it masks selected objects with zero vectors as MOM does. Yet, it endeavors to mask objects that are immediate to avoid information leakage due to the overlapping between objects. MRM masks objects which have a high proportion of mutual intersection.

For MSM, we randomly choose words as masking anchors by the probability of 10%, and we randomly mask the anchors and 0 to 2 words after the anchors by the probability of uniform distribution. For MRM, we also randomly choose objects as masking anchors by the probability of 10%, and we mask the objects whose IoUs with the anchors are larger than $0.4$ (the optimum value empirically). The objective of the model is to predict the masked words and the categories of the masked objects. The training minimizes the loss:

(5)		$\displaystyle\mathcal{L}_{\mathbf{MSM}}$	$\displaystyle=-\mathbb{E}_{x\sim D}\log p\left(\overline{x}\|\hat{x}\right)\approx-\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T}\mathbf{m^{t}}(x^{n},t)\log p_{\theta}\left(x_{t}^{n}\|\hat{x}^{n}\right),$
(6)		$\displaystyle\mathcal{L}_{\mathbf{MRM}}$	$\displaystyle=-\mathbb{E}_{x\sim D}\log p\left(\overline{x}\|\hat{x}\right)\approx-\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T}\mathbf{m^{i}}(x^{n},t)\log p_{\theta}\left(x_{t}^{n}\|\hat{x}^{n}\right),$

where $x$ is a random sample of image-text pair from the training set $D$ , and $\overline{x}$ refers to the masked segment or the masked region, and $\hat{x}$ refers the whole masked sequence $x$ . $\mathbf{m^{i}}$ and $\mathbf{m^{t}}$ refer to the masking functions for image and text. The objective functions encourage the model to predict the masked groups of words or the class of the masked groups of objects.

Image-Text Matching with Hard Negatives For learning the relation between image and text, we regard the image-text pairs in the dataset as positive samples, and we pair the images with uncorrelated texts and regard the pairs as negative samples. Both the positive and negative samples share the same proportion in the training set. The model is pretrained to distinguish the positives and negatives.

In previous works (Chen et al., 2019; Li et al., 2019a; Lu et al., 2019a), the uncorrelated captions in the negative samples are randomly selected from the training dataset, which makes the negatives easy to be distinguished out. To force the model to learn stronger cross-modal matching capability, we provide the model with harder negatives. For each training image, we consider the captions whose TF-IDF similarities with the image’s original caption are lower than 0.5. The top 30 among them with the highest TF-IDF similarities are selected as the hard negative captions for the image. These operations are expected to select the captions which have more lexical overlaps with the positive caption but are semantically different. An example of hard negatives is shown in Figure 2. During pretraining, we make 20% negative samples constructed with the hard negative captions mentioned above and the other 80% negatives still use randomly selected captions.²²2We have also attempted to set higher probability for the hard negative samples. However, this would make the pretraining loss to high for the model to converge.

We add a simple MLP on top of the main architecture for computing the matching score between inputs of two modalities. Specifically, we first element-wisely multiply the image and text representations (the output representations at the position of “ $[CLS]$ ”) and send the generated representation through the MLP for the matching score. The training minimizes the cross-entropy loss:

(7)

\displaystyle\mathcal{L}_{\mathbf{ITM}}=-\mathbb{E}_{x,y\sim D}\left[y\log p\left(y|\hat{x}\right)+(1-y)\log\left(1-p\left(y|\hat{x}\right)\right)\right],

where $x$ is a random sample from the training set $D$ and $y\in\{0,1\}$ denotes whether $x$ is positive or negative. $\hat{x}$ refers to the masked $x$ .

The overall objective function is the weighted sum of the aforementioned terms, as shown below:

(8)

\displaystyle\mathcal{L}

\displaystyle=\lambda_{1}\mathcal{L}_{\mathbf{MSM}}+\lambda_{2}\mathcal{L}_{\mathbf{MRM}}+\lambda_{3}\mathcal{L}_{\mathbf{ITM}},

where $\lambda$ refers to the hyperparameter for the weights for each term.

2.4. Finetuning

We use the pretrained InterBERT as the backbone for the downstream tasks. We apply the pretrained model to three downstream tasks, including caption-based image retrieval, zero-shot caption-based image retrieval, and visual commonsense reasoning. Finetuning is simple as we can add simple MLP layers based on the requirements of the corresponding downstream tasks.

3. Experiments

In this section, we provide an introduction to our experimental details, and we demonstrate the results as well as the analysis.³³3The data statistics are presented in Table 7 and 8.

3.1. Downstream tasks

Table 1. Data statistics of the datasets for pretraining. The numbers in the parentheses refer to the numbers of images.

Datasets	Training	Validation
Conceptual Caption	3.3M	14K
SBU	890K	10K
COCO	587K (117K)	15K (3K)

Table 2. Data statistics of the datasets of the downstream tasks. “i” refers to the number of images, and “t” refers to the number of texts.

Datasets	Training	Validation	Testing
Flickr30K	i:29K, t:145K	i:1K, t:5K	i:1K, t:5K
VCR	i:80K, t:213K	i:10K, t:27K	i:10K, t:25K

Caption-Based Image Retrieval Caption-based image retrieval requires the model to retrieve an image from a large pool of images based on a given caption. We conduct experiments on Flickr30K (Young et al., 2014), whose images are extracted from Flickr.⁴⁴4https://www.flickr.com In Flickr30K, each image is paired with five captions, which are of relatively high quality. Following Lu et al. (2019a), in the stage of training, we change the task to 4-way multiple choice by adding three negative images for each image-caption pair. The training set contains 29K images, and the validation and test set contain 1K images respectively. The evaluation metrics are R@1, R@5, and R@10 (recall at $1$ , $5$ , and $10$ ).

Zero-Shot Caption-Based Image Retrieval This is the zero-shot setting for caption-based image retrieval. The model performs caption-based image retrieval without finetuning on the training data. This challenges the capability of the pretrained model to understand the relations between image and text. We use the same splits of the dataset and the same evaluation metrics as those of caption-based image retrieval.

Visual Commonsense Reasoning Visual commonsense reasoning (VCR) is a task connected with cognition and requires visual understanding (Zellers et al., 2019). There are three sub-tasks in VCR, including Q $\rightarrow$ A, QA $\rightarrow$ R, and Q $\rightarrow$ AR. Q $\rightarrow$ A refers to providing the answer based on the given image and question, QA $\rightarrow$ R refers to providing the rationale based on the given image, question, and answer. In Q $\rightarrow$ AR, provided an image and a question, the model should not only answer the question but also give the correct rationale for the choice. For each question, there are 4 candidate answers and 4 candidate rationales. The training set contains 80K images and 213K questions, the validation set contains 10K images and 27K questions, and the test set contains 10K images and 25K questions. We apply accuracy score as the evaluation metric.

3.2. Baselines

For the comparison with the previous methods, we mainly compare our InterBERT with the previous models that achieved outstanding performances on the downstream tasks as well as the recent multi-modal pretrained models.

Previous Methods For image retrieval, we compare InterBERT with SCAN (Lee et al., 2018), which is an architecture based on stacked cross-attention. For VCR, we compare InterBERT with R2C (Recognition to Cognition) (Zellers et al., 2019), which contains modules for grounding, contextualizing, and reasoning.

Multi-Modal Pretrained Models We compare our model with some recent multi-modal pretrained models. Specifically, we focus on the comparison of our model with VilBERT and VL-BERT for the reason that our implementation details are similar to theirs, including pretraining datasets and the number of object features. Moreover, VilBERT and VL-BERT are regarded as powerful baselines of two-stream and single-stream multi-modal pretrained models, respectively⁵⁵5We focus on the comparison with the methods that released the codes and models for pretraining and finetuning. Both of them have released the full codes and the reported results proved reproducible. Please refer to https://github.com/jiasenlu/vilbert_beta and https://github.com/jackroos/VL-BERT..

3.3. Implementation details

We pretrain our model on Conceptual Caption (CC) (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011), and COCO captions (Lin et al., 2014). For pretraining, we first extract the object representations of the images with a trained object detector. Specifically, the object representations and their bounding boxes are generated by an object detector based on Faster R-CNN (Ren et al., 2015) with a backbone of ResNet-101 (He et al., 2016), which is trained on Visual Genome (Krishna et al., 2017).⁶⁶6https://github.com/peteanderson80/bottom-up-attention We pretrain the model with AdamW (Loshchilov and Hutter, 2019) with an initial learning rate of $1e-4$ , $\beta_{1}=0.9$ , $\beta_{2}=0.9999$ , $e=1\times 10^{-6}$ and a weight decay of $0.01$ . For the finetuning on Flickr30K image retrieval, the maximum number of objects is $100$ and the actual numbers are between $90$ and $100$ . The model reuses the output layer of the pretraining ITM task to compute the matching scores. We finetune the model on 8 Nvidia V100 for $20$ epochs with AdamW optimizer with an initial learning rate of $4\times 10^{-5}$ and apply a linear decay learning rate scheduler with a warm-up period of $10000$ steps. For the finetuning on VCR, we use a smaller learning rate $2\times 10^{-5}$ and train the model for only $5$ epochs.

3.4. Results

Table 3. Results of the models on the three downstream tasks. The results of the baselines are those reported in their original papers. “-” denotes that the model was not implemented on the task in the original work. “w/o pt” refers to “without pretraining”.

Models	IR			Zero-shot			VCR
Models	R@1	R@5	R@10	R@1	R@5	R@10	Q $\rightarrow$ A	QA $\rightarrow$ R	Q $\rightarrow$ AR
SCAN (Lee et al., 2018)	48.6	77.7	85.2	-	-	-	-	-	-
R2C (Zellers et al., 2019)	-	-	-	-	-	-	63.8	67.2	43.1
VisualBERT (Li et al., 2019b)	-	-	-	-	-	-	70.8	73.2	52.2
VilBERT (Lu et al., 2019a)	58.2	84.9	91.5	31.9	61.1	72.8	72.4	74.5	54.0
VL-BERT (Su et al., 2019)	-	-	-	-	-	-	73.8	74.4	54.2
InterBERT (w/o pt)	53.1	80.6	87.9	-	-	-	63.6	63.1	40.3
InterBERT	61.9	87.1	92.7	49.2	77.6	86.0	73.1	74.8	54.9

Table 3 demonstrates the experimental results of our proposed model InterBERT as well as the compared baselines on the downstream tasks. In the experiment of image retrieval, InterBERT outperforms SCAN by a large margin (+13.3 (27.4%) in R@1, +9.4 (12.1%) in R@5, and +7.5 (8.8%) in R@10), and it also outperforms VilBERT by +3.7 (6.4%) in R@1, +2.2 (2.6%) in R@5, and +1.2 (1.3%) in R@10. As for zero-shot image retrieval, the advantage is significantly larger. It outperforms VilBERT by +17.3 (54.2%) in R@1, +16.5 (27.0%) in R@5, and +13.2 (18.1%) in R@10). In the experiment of VCR, InterBERT also significantly outperforms the baseline R2C by +9.3 (14.6%) in Q $\rightarrow$ A, +7.6 (11.3%) in QA $\rightarrow$ R, and +11.8 (27.4%) in Q $\rightarrow$ AR, and it also outperforms VilBERT by +0.7 (1.0%) in Q $\rightarrow$ A, +0.3 (0.4%) in QA $\rightarrow$ R, and +0.9 (1.7%) in Q $\rightarrow$ AR. Compared with VL-BERT, InterBERT also outperforms by +0.7 (1.3%) on the overall Q $\rightarrow$ AR accuracy.

InterBERT has advantages over the baselines in the tasks, especially in zero-shot image retrieval. Also, compared with VilBERT, InterBERT has an advantage in the number of parameters (173M vs 221M), which reflects the effects of single-stream interaction. The significant advantage in zero-shot learning demonstrates that our model has a strong capability of modeling image-text relations and transferring to downstream tasks without finetuning.

Also, we directly train our InterBERT without multi-modal pretraining to evaluate the effects of pretraining for the downstream tasks. To be more specific, as our pretrained model is initialized with the weights of BERT-base, we also train the InterBERT without pretraining with the BERT initialization. From Table 3, the model without pretraining suffers from the performance degrade (IR: -8.8 (-14.2%) in R@1, -6.5 (-7.5%) in R@5, and -4.8 (-5.2%) in R@10; VCR: -9.5 (-13.0%) in Q $\rightarrow$ A, -11.7 (-15.6%) in QA $\rightarrow$ R, and -14.6 (-26.6%) in Q $\rightarrow$ AR). The effective multi-modal pretraining can significantly impact the model performance in downstream tasks.

3.5. Analysis

In this section, we conduct a series of analyses to evaluate the effects of MGM, the performance on single-modal downstream tasks, and the effects of weight initialization for pretraining.

Table 4. An ablation study of the MGM conducted on the validation set of VCR.

Tasks	VCR
Tasks	Q $\rightarrow$ A	QA $\rightarrow$ R	Q $\rightarrow$ AR
w/o MGM	72.3	74.3	54.0
InterBERT	73.1	74.8	54.9

The Effects of MGM We conduct an ablation study on the validation set of VCR to evaluate the effects of MGM, which includes MSM and MRM. Specifically, we pretrain two models with different pretraining tasks, including MLM+MOM+ITM and MSM+MRM+ITM. Table 4 demonstrates the results of the evaluation. It can be found that our proposed MSM and MRM are beneficial to the pretraining effects. The model trained with MSM and MRM can outperform the baseline by $+0.8$ in Q $\rightarrow$ A, $+0.5$ in QA $\rightarrow$ R, and $+0.9$ in Q $\rightarrow$ AR. The model trained with our tasks gains a stronger ability of modeling image and text by understanding contexts and building a stronger connection so that it can reach better performance in a task that requires reasoning over image and text.

Table 5. An ablation study of our image-text matching conducted on the test sets of caption-based image retrieval and its zero-shot version.

Tasks	IR			Zero-shot
Tasks	R@1	R@5	R@10	R@1	R@5	R@10
w/o ITM-hn	60.2	86.2	92.3	43.3	74.0	82.9
InterBERT	61.9	87.1	92.7	49.2	77.6	86.0

The Effects of ITM with Hard Negatives We also evaluate the effect of our designed image-text matching. Specifically, it strongly impacts the model performance on image retrieval and its zero-shot version. Thus we conduct the ablation study on the two tasks to figure out whether it can bring significant promotion. Table 5 demonstrates the results of the evaluation. It can be found that our designed ITM with negative sampling based on TF-IDF is able to significantly enhance the performance on the retrieval tasks. The model trained with our ITM can outperform the baseline on caption-based image retrieval by $+1.7$ in R1, $+0.9$ in R5, and $+0.4$ in R10, and it also outperforms the baseline on the zero-shot retrieval by $+5.9$ in R1, $+3.6$ in R5, and $+3.1$ in R10. These results demonstrate that the matching task with higher difficulty in differentiating positive and negative samples contributes to the learning through pretraining. Specifically, the boost in the zero-shot retrieval proves that higher difficulty for image-text matching effectively impacts cross-modal understanding.

Table 6. Results on the GLUE dev set. We evaluate the performance of the BERT-base model, a single-stream model, and InterBERT on 8 tasks of GLUE. The results show that InterBERT can rival the BERT-base model in the tasks of natural language understanding. We report F1 scores for QQP and MRPC, Spearman correlations for STS-B, and accuracy scores for the rest.

Model	QNLI	CoLA	SST-2	STS-B	RTE	MNLI (m/mm)	QQP	MRPC	Avg.
BERT-base	91.5	56.7	93.2	88.2	65.0	83.7 / 84.1	87.9	89.6	82.0
Single Stream	90.8	52.3	91.6	88.6	59.2	82.6 / 84.1	87.8	86.4	80.0
InterBERT	91.1	57.3	92.3	88.9	64.3	84.1 / 83.7	88.1	88.6	81.8

Performances in the Single-Modal Tasks While multi-modal pretraining demonstrates effects in the aforementioned downstream tasks, it is still a question whether it still preserve the knowledge of single-modal representation and whether it can still achieve comparable performances in the single-modal tasks. To evaluate the model’s robustness, we conduct an experiment on 8 tasks of GLUE (Wang et al., 2019), including QNLI, CoLA, SST-2, STS-B, RTE, MNLI, QQP, and MRPC. We compare InterBERT with BERT-base and the single-stream multi-modal pretrained model (a simple BERT architecture).

From Table 6, it can be found that InterBERT can achieve similar performances compared with BERT-base (Avg: 82.0 vs 81.8), and it significantly outperforms the single-stream model without the two-stream extraction module (Avg: 81.8 vs 80.0). This indicates that InterBERT with the extraction module preserves the ability to model single-modal representations and it can adapt to single-modal downstream tasks without significant performance decrease.

The Effects of Initialization In our experiments, we surprisingly find that the different weight initialization for pretraining has different impacts on the finetuning on different downstream tasks. As mentioned above, we initialize a part of our model with the weights of the pretrained BERT-base model. Here we compare the models with or without BERT initialization on the performances on the downstream tasks. While we find that such initialization has little impacts on the retrieval tasks, we also find that the initialization is surprisingly significant to the finetuning on VCR. The model without BERT initialization suffers from a severe performance downgrade. Its performances in Q $\rightarrow$ A and QA $\rightarrow$ R are 65.3 (-10.7%) and 64.4 (-13.9%), and VilBERT without BERT initialization performs worse (61.7 in Q $\rightarrow$ A and 59.7 QA $\rightarrow$ R). This demonstrates the importance of the pretrained NLP model on the multi-modal tasks concerned with reasoning as it can improve the effects of text processing and thus enhance its language understanding capability. It also shows that sufficient interaction between modals through all-attention can alleviate the problem. However, this can be a starting point for the research in the effect of initialization on multi-modal pretraining.

4. Related work

In this section, we review the studies in pretraining methods, especially the pretraining in NLP and multi-modal pretraining.

Single-Modal Pretraining The recent years have witnessed the development of the pretraining in NLP. ELMo (Peters et al., 2018), which is an LSTM-based (Hochreiter and Schmidhuber, 1997) language model, has attracted the attention of NLP researchers as it demonstrated that pretraining is also available for NLP tasks. Later, ULMFit (Howard and Ruder, 2018) proposed some techniques and gained improvements in several downstream tasks. Yet these models are based on the conventional recurrent neural network architecture. GPT (Radford et al., 2018) is the first language model based on Transformer (Vaswani et al., 2017) architecture, which is a unidirectional decoder. In concern of full observation of the context, Devlin et al. (2019) proposed BERT, a bidirectional encoder based on Transformer. BERT reached state-of-the-art performances in a number of NLP downstream tasks, including natural language inference (Wang et al., 2019) and question answering (Rajpurkar et al., 2016). There have been a series of studies following BERT (Liu et al., 2019; Yang et al., 2019; Lan et al., 2019). These models have achieved superior performances over the baselines and some even outperformed human performance.

Multi-modal Pretraining The success of pretraining in NLP raised the attention in multi-modal pretraining. VideoBERT (Sun et al., 2019b) is regarded as the first work in multi-modal pretraining. It is a model pretrained on the extracted video frame features and texts. A contemporaneous work of VideoBERT is CBT (Sun et al., 2019a), which is also pretrained on video-text pairs. Miech et al. (2020b) leveraged unlabeled narrated videos for video representation learning.

Inspired by the starting work in multi-modal pretraining, more researchers have turned their focus to visual-linguistic pretraining. There are mainly two streams of model architectures for this task. One is the single-stream model (Alberti et al., 2019; Chen et al., 2019; Li et al., 2019a, b; Su et al., 2019; Gan et al., 2020; Li et al., 2020; Zhou et al., 2020). Li et al. (2019a) processed the concatenation of objects and words with a BERT model and pretrain it with the three conventional tasks. Chen et al. (2019) and Qi et al. (2020) proposed similar methods but with more pretraining tasks and larger datasets. Gan et al. (2020) further improved the model with adversarial training strategy. Su et al. (2019) used identical architecture but they pretrained the object detector and added single-modal data. Huang et al. (2020) attempted to input pixels directly instead of detected objects. Li et al. (2020) leveraged the object labels to enhance the cross-modal alignments. Zhou et al. (2020) proposed a unified single-stream model which jointly learns the caption generation and VQA tasks.

The other form of the model architecture is the two-stream model (Tan and Bansal, 2019; Lu et al., 2019a, b; Yu et al., 2020). Tan and Bansal (2019) proposed a two-stream model with co-attention and pretrained the model only with the in-domain data. Lu et al. (2019a) proposed a similar architecture with a more complex co-attention, and pretrained the model with the out-of-domain data, and Lu et al. (2019b) further improved VilBERT with multi-task learning. Recently, Yu et al. (2020) incorporated the scene graph into the model, which brought performance gains. Aside from these works, Singh et al. (2020) discussed the impact of choosing the pretraining datasets on the performance of the downstream tasks.

In this work, we simply focus on the design of architecture and pretraining tasks. The single-stream models mostly apply BERT to multi-modal pretraining in a straightforward fashion, while the two-stream models have respective encoders for modalities and a co-attention module for the cross-modal interaction. These models either lack the independence of each modality or lack sufficient interaction across modalities. Furthermore, there is still room for setting training tasks for more effective pretraining. Compared with the previous work, our proposed method has several significant differences. Our proposed model architecture is effective in capturing modal interaction with an all-attention-based module and obtaining modal independence with the two-stream extraction module. Besides, our proposed masked group modeling improves the model’s ability to predict a span or a region, so that the model can be more effective.

5. Conclusion

In this paper, we propose a new approach for multi-modal pretraining, InterBERT. The model architecture consists of a single-stream interaction module for sufficient interaction and a two-stream extraction module for the separation of modal information. Furthermore, to strengthen its ability of modeling image and language, we pretrain the model with the tasks of MGM and ITM-hn. Experimental results demonstrate that our InterBERT can outperform the baselines and rival the recent multi-modal pretrained models in the downstream tasks, and our online deployment for recommendation shows its advantages in CTR and exposed category width over the single-modal baseline. The analyses show that the pretraining tasks can enhance the model performance, and InterBERT can adapt to single-modal tasks without significant performance downgrade. Also, we find out that the weight initialization for pretraining makes a difference to downstream tasks. We hope this study can provide some insights into multi-modal pretraining, and in the future, we will endeavor to figure out better model architecture and training tasks for the improvement in multi-modal representation learning.

References

(1)
Alberti et al. (2019) Chris Alberti, Jeffrey Ling, Michael Collins, and David Reitter. 2019. Fusion of Detected Objects in Text for Visual Question Answering. In EMNLP-IJCNLP 2019. 2131–2140.
Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In CVPR 2018. 6077–6086.
Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR 2015.
Bentivogli et al. (2009) Luisa Bentivogli, Bernardo Magnini, Ido Dagan, Hoa Trang Dang, and Danilo Giampiccolo. 2009. The Fifth PASCAL Recognizing Textual Entailment Challenge. In TAC 2009.
Cer et al. (2017) Daniel M. Cer, Mona T. Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In SemEval@ACL 2017. 1–14.
Chen et al. (2019) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. UNITER: Learning UNiversal Image-TExt Representations. CoRR abs/1909.11740 (2019).
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Fei-Fei Li. 2009. ImageNet: A large-scale hierarchical image database. In CVPR 2009. 248–255.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT 2019. 4171–4186.
Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. In NeurIPS 2019. 13042–13054.
Gan et al. (2020) Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. 2020. Large-Scale Adversarial Training for Vision-and-Language Representation Learning. arXiv preprint arXiv:2006.06195 (2020).
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In CVPR 2016. 770–778.
Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. CoRR abs/1606.08415 (2016).
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997), 1735–1780.
Howard and Ruder (2018) Jeremy Howard and Sebastian Ruder. 2018. Universal Language Model Fine-tuning for Text Classification. In ACL 2018. 328–339.
Huang et al. (2020) Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers. arXiv preprint arXiv:2004.00849 (2020).
Joshi et al. (2019) Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2019. SpanBERT: Improving Pre-training by Representing and Predicting Spans. CoRR abs/1907.10529 (2019).
Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NeurIPS 2012. 1106–1114.
Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR abs/1909.11942 (2019).
Lee et al. (2018) Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked Cross Attention for Image-Text Matching. In ECCV 2018. 212–228.
Li et al. (2019a) Gen Li, Nan Duan, Yuejian Fang, Daxin Jiang, and Ming Zhou. 2019a. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training. CoRR abs/1908.06066 (2019).
Li et al. (2019b) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. VisualBERT: A Simple and Performant Baseline for Vision and Language. CoRR abs/1908.03557 (2019).
Li et al. (2020) Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao. 2020. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. CoRR abs/2004.06165 (2020).
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In ECCV 2014. 740–755.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019).
Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR 2019.
Lu et al. (2019a) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019a. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In NeurIPS 2019. 13–23.
Lu et al. (2019b) Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2019b. 12-in-1: Multi-Task Vision and Language Representation Learning. CoRR abs/1912.02315 (2019).
Miech et al. (2020a) Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020a. End-to-End Learning of Visual Representations From Uncurated Instructional Videos. In CVPR 2020. 9876–9886.
Miech et al. (2020b) Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. 2020b. End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9879–9889.
Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016. 280–290.
Ordonez et al. (2011) Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. 2011. Im2Text: Describing Images Using 1 Million Captioned Photographs. In NeurIPS 2011. 1143–1151.
Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. (2017).
Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In NAACL-HLT 2018. 2227–2237.
Qi et al. (2020) Di Qi, Lin Su, Jia Song, Edward Cui, Taroon Bharti, and Arun Sacheti. 2020. Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020).
Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf (2018).
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100, 000+ Questions for Machine Comprehension of Text. In EMNLP 2016. 2383–2392.
Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NeurIPS 2015. 91–99.
Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A Neural Attention Model for Abstractive Sentence Summarization. In EMNLP 2015. 379–389.
Sharma et al. (2019) Lakshay Sharma, Laura Graesser, Nikita Nangia, and Utku Evci. 2019. Natural Language Understanding with the Quora Question Pairs Dataset. CoRR abs/1907.01041 (2019).
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In ACL 2018. 2556–2565.
Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR 2015.
Singh et al. (2020) Amanpreet Singh, Vedanuj Goswami, and Devi Parikh. 2020. Are we pretraining it right? Digging deeper into visio-linguistic pretraining. arXiv preprint arXiv:2004.08744 (2020).
Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013. 1631–1642.
Su et al. (2019) Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2019. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. CoRR abs/1908.08530 (2019).
Sun et al. (2019a) Chen Sun, Fabien Baradel, Kevin Murphy, and Cordelia Schmid. 2019a. Contrastive Bidirectional Transformer for Temporal Representation Learning. CoRR abs/1906.05743 (2019).
Sun et al. (2019b) Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019b. VideoBERT: A Joint Model for Video and Language Representation Learning. CoRR abs/1904.01766 (2019).
Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In NeurIPS 2014. 3104–3112.
Tan and Bansal (2019) Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In EMNLP-IJCNLP 2019. 5099–5110.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017. 5998–6008.
Wang et al. (2019) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In ICLR 2019.
Wang et al. (2016) Liwei Wang, Yin Li, and Svetlana Lazebnik. 2016. Learning Deep Structure-Preserving Image-Text Embeddings. In CVPR 2016. 5005–5013.
Warstadt et al. (2019) Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. Neural Network Acceptability Judgments. TACL 7 (2019), 625–641.
Williams et al. (2018) Adina Williams, Nikita Nangia, and Samuel R. Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In NAACL-HLT 2018. 1112–1122.
Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. CoRR abs/1910.03771 (2019).
Yang et al. (2019) Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In NeurIPS 2019. 5754–5764.
Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2 (2014), 67–78.
Yu et al. (2020) Fei Yu, Jiji Tang, Weichong Yin, Yu Sun, Hao Tian, Hua Wu, and Haifeng Wang. 2020. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph. arXiv preprint arXiv:2006.16934 (2020).
Zellers et al. (2019) Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From Recognition to Cognition: Visual Commonsense Reasoning. In CVPR 2019. 6720–6731.
Zhou et al. (2020) Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified Vision-Language Pre-Training for Image Captioning and VQA. In AAAI 2020. 13041–13049.

Appendix A Appendix

A.1. Data statistics

Pretraining Datasets The image caption datasets for pretraining are Conceptual Caption (CC) (Sharma et al., 2018), SBU Captions (Ordonez et al., 2011), and COCO Captions (Lin et al., 2014). The detailed data statistics are demonstrated in Table 7. In CC and SBU, each image is paired with a text as its description, while in COCO, there are around 5 texts that describe the same text. We also provide the number of images in COCO in the table.

Table 7. Data statistics of the datasets for pretraining. The numbers in the parentheses refer to the numbers of images.

Datasets	Training	Validation
Conceptual Caption	3.3M	14K
SBU	890K	10K
COCO	587K (117K)	15K (3K)

Table 8. Data statistics of the datasets of the downstream tasks. “i” refers to the number of images, and “t” refers to the number of texts.

Datasets	Training	Validation	Testing
Flickr30K	i:29K, t:145K	i:1K, t:5K	i:1K, t:5K
VCR	i:80K, t:213K	i:10K, t:27K	i:10K, t:25K

Downstream Datasets We demonstrate the detail data statistics of the datasets for finetuning in Table 8. The numbers of image and text of each dataset are provided.

Dataset for Online Deployment While our deployment relies on the large-scale dataset, we select a part containing around 3m samples for release, which is called TaoMultimodal. Here we provide more details about the data construction and preprocessing.

We apply the same preprocessing methods to both datasets. The preprocessing includes object detection for object representations and data cleaning for texts. We obtain object representations by using an object detector based on Faster-RCNN, which is trained on the data of the mobile Taobao. There are $33$ categories for object classification. We extract the bounding boxes with confidence scores larger than $0.1$ , and we obtain no more than $16$ objects for each image.⁷⁷7Unlike the other datasets, the images in TaoMultimodal contain relatively fewer objects. The size of the object representations is $2048$ . As to the data cleaning for text, we first remove the titles without any Chinese character. Moreover, we conduct word segmentation with the AliNLP tool⁸⁸8https://data.aliyun.com/product/nlp on the texts, and truncate the text by words in order to make sure the text is no longer than $36$ characters. Furthermore, we remove those titles that trigger our spam detector, including texts that are concerned with pornography, abuse, politics, terrorism, etc. Finally, we obtain a dataset of 3.1M image-text pairs for pretraining and 200K for finetuning.

A.2. Implementation details

In the following, we introduce the details of our implementation in pretraining and finetuning on each downstream task, including the model architecture, optimizer, hyperparameters, etc.

Pretraining Here we provide the experimental details about our implementation for pretraining. The object representations of the images as well as their bounding boxes are generated by an object detector based on Faster R-CNN (Ren et al., 2015) with a backbone of ResNet-101 (He et al., 2016), which is trained on Visual Genome (Krishna et al., 2017). This detector is applied for the bottom-up top-down attention model for image captioning (Anderson et al., 2018), and we downloaded the pretrained detector from their provided link.⁹⁹9https://github.com/peteanderson80/bottom-up-attention For the text processing, we tokenize the texts with BERT’s tokenizer and directly use BERT-base’s embedding layer for word embedding. The vocabulary size is $30522$ and the embedding size is $768$ . For the consistency between word embedding and object representation, we transform the object representations of $2048$ dimensions to $768$ through MLP.

The hidden size of multi-head attention is also set to $768$ . The number of attention head is $12$ . For the FFN, both the input and output sizes are $768$ for stacking layers, and the intermediate size is $3072$ . As to the LN layer inside each layer, we use BERT’s LN with $e=1e-12$ . The single-stream interaction module consists of $12$ layers of Transformer layer, and its weight parameters are initialized with the pretrained BERT-base model. The two-stream independence module contains two Transformers for both modalities on top of the single-stream interaction module. Each has $6$ layers of Transformer layer. The new weight parameters are randomly initialized based on the Gaussian distribution of zero mean and standard deviation of $0.02$ , following Devlin et al. (2019). We pretrain the model with AdamW (Loshchilov and Hutter, 2019) whose initial learning rate of $1e-4$ , $\beta_{1}=0.9$ , $\beta_{2}=0.9999$ , $e=1\times 10^{-6}$ and a weight decay of $0.01$ . We apply the linear decay learning rate scheduler with a warm-up period of $10000$ steps. The batch size for training is $512$ . For the pretraining dataset of image-caption pairs, we pretrain our InterBERT on 8 V100 GPUs for $20$ epochs.

Finetuning For the finetuning on Flickr30K image retrieval, the maximum number of objects is $100$ and the actual numbers are between $90$ and $100$ . The model reuses the output layer of the pretraining ITM task to compute the matching scores. We finetune the model with a batch size of $32$ and train it on 8 V100 for $20$ epochs. We use AdamW optimizer with an initial learning rate of $4\times 10^{-5}$ and apply a linear decay learning rate scheduler with a warmup period of $10000$ steps. We finetune the model for $20$ epochs. For the finetuning on VCR, we use similar hyperparameters with those in the finetuning on Flickr30K, but we use a smaller learning rate $2\times 10^{-5}$ and a smaller batch size $32$ , and we only finetune the model for $5$ epochs. Furthermore, we apply exponential moving average with a rate of $0.9999$ on the finetuned models for the final model, so that it can be more robust and reach better performance in testing.

Deployment We store the representation vector from the multi-modal pretrained model and the BERT-based baseline for each item of the item pool. We then send them into a vector-based KNN service for nearest neighborhood search. For each trigger item, we collect the top $5$ nearest neighbor items offline. This decision of hyperparameter is based on our preliminary online experiments. The online service receives the triggers from the user histories and search the recalled candidate items accordingly. This query service requires around 10ms. A separate ranking system is responsible for the production of final results. Both of our methods share the same preprocssing the postprocessing. For the A/B testing, the A/B buckets adds a trail of multimodal recall and that of single-modal recall respectively. This is the only difference between the buckets. Each of them shares at least 5% traffic and lasts at least 7 days.

Hardware Configuration The experiments are conducted on a Linux server equipped with an Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz, 512GB RAM and 8 NVIDIA V100-SXM2-16GB GPUs.

Software The experiments are implemented in python 3.6 and PyTorch 1.1.0 (Paszke et al., 2017). The code is based on Transformers (Wolf et al., 2019).¹⁰¹⁰10https://github.com/huggingface/transformers

A.3. Details of the GLUE tasks

The GLUE benchmark (Wang et al., 2019) consists of a series of NLP tasks, including QNLI, CoLA, SST-2, STS-B, RTE, MNLI, QQP, and MRPC,. We use them to evaluate the robustness of InterBERT in single-modal downstream tasks.

QNLI Question Natural Language Inference is a binary classification task of SQuAD (Stanford Question Answering Dataset) (Rajpurkar et al., 2016; Wang et al., 2019). It requires the model to judge whether the given answer is a correct one of the given question in a sentence pair.

CoLA The Corpus of Linguistic Acceptability (Warstadt et al., 2019) is a task of binary sentence classification. It requires the algorithms to check whether an English sentence is linguistically acceptable (grammatical and consistent with the world knowledge).

SST-2 The Stanford Sentiment Treebank (Socher et al., 2013) is a task of binary sentiment classification. It requires the algorithms to check whether a sentence is positive or negative. The sentences are extracted from movie reviews with human annotations.

STS-B The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a task of classification of semantic similarity. The sentences are extracted from news headlines and other sources. The algorithms should learn to score two sentences from $1$ to $5$ for their semantic similarity.

RTE Recognizing Textual Entailment (Bentivogli et al., 2009) is a task of natural language inference. This task provides a sentence pair and requires the algorithms to check the relations between the sentences, including “entailment”, “contraction” and “neutral”.

MNLI Multi-Genre Natural Language Inference (Williams et al., 2018) is a task of entailment with a large dataset. It requires the algorithm to figure out the relation of a pair of sentences. The relations include “entailment”, “contradiction”, and “neutral”.

QQP Quora Question Pairs (Sharma et al., 2019) is a task to check if two questions are semantically identical. The questions are extracted from Quora¹¹¹¹11http://quora.com/.

MRPC Microsoft Research Paraphrase Corpus is a dataset of sentence pairs from news websites. The task is to check whether two sentences are semantically identical.

We truncate the input texts to ensure that the maximum length is $128$ . The input texts are all lower-cased. We use a batch size of $128$ and a learning rate of $2\times 10^{-5}$ . We finetune the model on 8 Nvidia V100 GPUs with gradient accumulation for $3$ epochs.