This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2021

\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

1]\orgnameInstitute of Automation, Chinese Academy of Sciences, \orgaddress\cityBeijing \postcode100190, \countryChina

2]\orgdivSchool of Future Technology, \orgnameUniversity of Chinese Academy of Sciences, \orgaddress\cityBeijing \postcode100049, \countryChina

3]\orgdivSchool of Artificial Intelligence, \orgnameUniversity of Chinese Academy of Sciences, \orgaddress\cityBeijing \postcode100049, \countryChina

VLP: A Survey on Vision-Language Pre-training

\fnmFeilong \surChen    \fnmDuzhen \surZhang    \fnmMinglun \surHan    \fnmXiuyi \surChen    \fnmJing \surShi    \fnmShuang \surXu    \fnmBo \surXu [ [ [
Abstract

In the past few years, the emergence of pre-training models has brought uni-modal fields such as computer vision (CV) and natural language processing (NLP) to a new era. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch. So can such pre-trained models be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. This paper surveys recent advances and new frontiers in vision-language pre-training (VLP), including image-text and video-text pre-training. To give readers a better overall grasp of VLP, we first review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks. Then, we summarize the specific VLP models in detail. Finally, we discuss the new frontiers in VLP. To the best of our knowledge, this is the first survey focused on VLP. We hope that this survey can shed light on future research in the VLP field.

keywords:
Vision and language, Pre-training, Transformers

1 Introduction

Making machines respond in ways similar to humans has been a relentless goal of AI researchers. To enable machines to perceive and think, researchers propose a series of related tasks, such as face recognition, reading comprehension, and human-machine dialogue, to train and evaluate the intelligence of machines in a particular aspect. Specifically, domain experts manually construct standard datasets and then train and evaluate relevant models on them. However, due to the limitations of related technologies, it is often necessary to train on a large amount of labelled data to obtain a better and more capable model. The recent emergence of pre-training models based on the Transformer structure vaswani2017attention has alleviated this problem. They are first pre-trained via self-supervised learning that typically exploits auxiliary tasks (pre-training objectives) to mine supervision signals from large-scale unlabelled data to train the model, thereby learning universal representations. Then they can achieve surprising effectiveness by fine-tuning with only a tiny amount of manually-labelled data on downstream tasks. Since the advent of BERT DBLP:conf/naacl/DevlinCLT19 in natural language processing (NLP), various pre-training models have sprung up in the uni-modal field, such as Vision Transformer (ViT) dosovitskiy2020image in computer vision (CV) and Wave2Vec DBLP:conf/interspeech/SchneiderBCA19 in speech. Substantial works have shown they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.

Similar to the uni-modal field, there is also a problem of less high-quality labelled data in the multi-modal field. The natural question is, can the above pre-training method be applied to multi-modal tasks? Researchers have explored this problem and made significant progress. In this paper, we focus on mainstream vision-language pre-training (VLP), including image-text and video-text pre-training. VLP mainly learns the semantic correspondence between different modalities by pre-training on large-scale data. For example, in image-text pre-training, we expect the model to associate “dog” in text with what “dog” looks like in images. In video-text pre-training, we expect the model to map objects/actions in the text to objects/actions in the video. To achieve this goal, the VLP objects and model architecture need to be cleverly designed to allow the model to mine the associations between different modalities.

To give readers a better global grasp of VLP, we first comprehensively review its recent advances and focus on five significant aspects:

  • Feature extraction. This section includes the preprocessing and representation methods of image, video, and text in VLP models (see Section 2).

  • Model architecture. We introduce the architecture of the VLP models from two different perspectives: Single-stream versus Dual-stream from multi-modal fusion perspective, and Encoder-only versus Encoder-decoder from the overall architectural design perspective (see Section 3).

  • Pre-training objectives. Pre-training objectives are the core of VLP, mainly used to guide the model to learn vision-language associated information. We summarize typical and characteristic pre-training objectives divided into completion, matching, temporal, and particular types (see Section 4).

  • Pre-training datasets. Data is critical for VLP. We briefly introduce mainstream corpora for VLP and their specific sizes (see Section 5).

  • Downstream tasks. Various tasks requires a cooperative knowledge of both vision and language. We discuss the basic details and goals of these tasks (see Section 6).

Then we summarize the specific state-of-the-art (SOTA) VLP models in detail (see Section 7). Finally, We conclude the paper and have broad discussions on new frontiers in VLP (see Section 8).

Although there are many surveys on pretrained language models qiu2020pre ; han2021pre and pretrained vision models han2022survey , to the best of our knowledge, this is the first survey focused on VLP. We hope that our survey can help researchers better understand this field and inspire them to design better models.

2 Feature Extraction

This section describes how VLP models preprocess and represent an image, video and text to obtain counterpart features.

2.1 Feature Extraction

2.1.1 Image Feature Extraction

(1) OD-based Region Features (OD-RFs).

Most previous work lu2019vilbert ; li2019visualbert ; li2020oscar on VLP utilizes pre-trained object detectors to extract visual features. The most commonly used object detection model is Faster R-CNN ren2015faster with bottom-up attention anderson2018bottom . It is designed to identify objects belonging to certain classes and localize them with bounding boxes. By using the Faster R-CNN, VLP models obtain the OD-based Region feature embedding V=[o1,o2,,ok]V=[o_{1},o_{2},\dots,o_{k}] of an image with kk selected regions. Each region feature oio_{i} is a 20482048-d Region-of-Interest (RoI) feature with its bounding box. The bounding box is defined by the coordinates of the bottom-left and top-right corners of the region. VLP models use bounding boxes to construct 55-d vectors, and the vector is embedded into a high-dimensional representation (2048-d) named visual geometry embedding. The OD-RFs are obtained by adding the OD-based Region feature embedding with its visual geometry embedding. Although ODFs have brought impressive performance, extracting region features can be time-consuming. To relieve this problem, the pre-trained object detectors are usually frozen during pre-training, which can limit the capacity of VLP models.

(2) CNN-based Grid Features (CNN-GFs).

VLP models li2020unicoder ; wang2021simvlm extract visual features by utilizing convolutional neural networks (CNNs) to obtain the grid features. On the one hand, VLP models can train the CNNs end-to-end by using the grid features jiang2020defense directly. On the other hand, VLP models can also first discretize grid features using a learned vision dictionary, then feed them into the cross-modal module.

(3) ViT-based Patch Features (ViT-PFs).

Inspired by ViT dosovitskiy2020image ; radford2021learning , VLP models reshape the image IiH×W×CI_{i}\in\mathbb{R}^{H\times W\times C} into a sequence of flattened 2D patches IpN×(P2C){I}_{p}\in\mathbb{R}^{N\times(P^{2}\cdot C)}, where (H,W)(H,W) is the resolution of the original image, CC is the number of channels, (P,P)(P,P) is the resolution of each image patch, and N=HW/P2N=HW/P^{2} is the resulting number of patches, which also serves as the effective input sequence length for the Transformer. An input image IiI_{i} is encoded into a sequence of embeddings: {vcls,v1,,vN}\{{v}_{\mathrm{cls}},{v}_{1},...,{v}_{N}\}, where vclsv_{\mathrm{cls}} is the embedding of the [CLS] token.

2.1.2 Video Feature Extraction

A video clip is denoted as MM frames (images). VLP models luo2021clip4clip ; fang2021clip2video extract the frame features by using the method mentioned above. The two most commonly used features are CNN-GFs and ViT-PFs. For CNN-GFs, VLP models first use ResNet he2016deep pre-trained on ImageNet deng2009imagenet or SlowFast feichtenhofer2019slowfast and I3D carreira2017quo pre-trained on Kinetics kay2017kinetics to extract 2D and 3D visual features for each video frame. These features are concatenated as visual features and fed through a fully-connected (FC) layer to be projected into the same lower-dimensional space as token embeddings. For ViT-PFs, a video clip ViM×H×W×CV_{i}\in\mathbb{R}^{M\times H\times W\times C} consisting of MM frames of resolution H×WH\times W, where M=1M=1 for images. Following the protocol in ViT and Timesformer, the input video clip is divided into M×NM\times N non-overlapping spatio-temporal patches of size P×PP\times P, where N=HW/P2N=HW/P^{2}.

2.1.3 Text Feature Extraction

For the textual features, following pretrained language model such as BERT DBLP:conf/naacl/DevlinCLT19 , RoBERTa liu2019roberta , AlBERT lan2019albert , and XLNet yang2019xlnet , VLP models li2019visualbert ; zhang2021vinvl ; zeng2021multi first segment the input sentence into a sequence of subwords. And then, insert a start-of-sequence token and an end-of-sequence token at the beginning and the end of the sequence to generate the input text sequence. Text input representations are computed via summing the corresponding word embedding, text position embedding, and text type embedding.

2.2 Feature Representation

To make full use of uni-modal pre-trained models, VLP models can send the visual or text features to a transformer encoder vaswani2017attention . Specifically, VLP models utilize the standard transformer encoder with random initialization to generate the visual or textual representation. In addition, VLP models can utilize a pre-trained visual transformer to encode the ViT-PFs, such as ViT and DeiT pmlr-v139-touvron21a . VLP models can use a pre-trained textual transformer to encode the textual features, such as BERT. For simplicity, we name these transformer Xformer.

3 Model Architecture

In this section, we introduce the architecture of the VLP models from two different perspectives: (1) Single-stream versus Dual-stream from multi-modal fusion perspective, and (2) Encoder-only versus Encoder-decoder from the overall architectural design perspective.

\begin{overpic}[width=433.62pt]{model_arch.pdf} \end{overpic}
Figure 1: Illustration of two types of model architectures for VLP.

3.1 Single-stream versus Dual-stream

Single-stream Architecture.

The single-stream architecture li2019visualbert ; chen2020uniter ; chen2022improving refers to that the text and visual features are concatenated together, then fed into a single transformer block as shown in Firgue 1 (a). The single-stream structure utilizes merged attention to fuse multimodal inputs. The single-stream architecture is more parameter-efficient, as the same set of parameters is used for both modalities.

Dual-stream Architecture.

The dual-stream architecture zhang2020devlbert ; dou2021empirical refers to that the text and visual features are not concatenated together but sent to two different transformer blocks independently, as shown in Firgue 1 (b). These two transformer blocks do not share parameters. To achieve higher performance, cross-attention (as shown by the dotted line in Firgue 1 (b)) are used to enable cross-modal interaction. To achieve higher efficiency, there can also be no cross-attention between the visual transformer and textual transformer blocks.

3.2 Encoder-only versus Encoder-decoder

Many VLP models adopt the encoder-only architecture, where the cross-modal representations are directly fed into an output layer to generate the final outputs. In contrast, other VLP models advocate using a transformer encoder-decoder architecture, where the cross-modal representations are first fed into a decoder and then to an output layer.

4 Pre-training Objectives

This section introduces how we pre-train VLP models by using different pre-training objectives, which are crucial for learning the universal representation of vision-language. We summarize the pre-training objectives into four categories: completion, matching, temporal, and particular types.

  • Completion is to reconstruct the masked element by leverage the unmasked remainders to understand the modality. (see section 4.14.2 and  4.3).

  • Matching is to unify the vision and language into a shared hidden space to generate universal vision-language representation (see Section 4.44.5 and 4.6).

  • Temporal is to learn good representation by reorder the disrupted input sequence (see Section 4.7)

  • Particular types consists of other pre-training objects, such as visual question answering and visual captioning (see Section 4.8).

Now we introduce the most used pre-training objectives.

4.1 Masked Language Modeling

Masked language modeling (MLM), which was first proposed by Talylor taylor1953cloze in the literature, is widely known because the BERT model adapted it as a novel pre-training task. To model language conditioned on vision, MLM in VLP models is similar to MLM in pre-training language models (PLMs) but predicts the masked textual tokens not only by the rest of the textual tokens but also by the visual tokens. Empirically, VLP models following BERT randomly mask each textual input token with probability 15% and replace the masked one by using a special token [MASK] 80% of the time, a random textual token 10% of the time and the original token 10% of the time to perform masking. The formal definition is as follows:

MLM=E(𝐯,𝐰)DlogP(𝐰m|𝐰\m,𝐯),\mathcal{L}_{\rm MLM}=-{\rm E}_{(\mathbf{v},\mathbf{w})\sim D}\log P(\mathbf{w}_{m}|\mathbf{w}_{\backslash m},\mathbf{v}), (1)

where 𝐯\mathbf{v} denotes the vision, 𝐰\mathbf{w} denotes the textual tokens, 𝐰m\mathbf{w}_{m} denotes the masked textual tokens, 𝐰\m\mathbf{w}_{\backslash m} denotes the remained textual tokens and DD denotes the training dataset.

4.2 Prefix Language Modeling

Prefix Language Modeling (PrefixLM) wang2021simvlm is unified of MLM and language modeling (LM). To make the model simultaneously has good understanding and generation ability, PrefixLM is proposed to facilitate the model with solid generation capability that enables text-induced zero-shot generalization without finetuning. PrefixLM differs from the standard LM such that it enables bi-directional attention on the prefix sequence and only conducts autoregressive factorization on the remaining tokens. PrefixLM under the sequence-to-sequence (seq2seq) framework not only enjoys the bidirectional contextualized representation as in MLM but also can perform text generation similar to LM. The formal definition is as follows:

PrefixLM=E(𝐯,𝐰)DlogP(𝐰Tp|𝐰TP,𝐯),\mathcal{L}_{\rm PrefixLM}=-{\rm E}_{(\mathbf{v},\mathbf{w})\sim D}\log P(\mathbf{w}_{\geq T_{p}}|\mathbf{w}_{\leq T_{P}},\mathbf{v}), (2)

where TPT_{P} denotes the length of the prefix sequence.

4.3 Masked Vision Modeling

To have good understanding on vision or generate images/videos given text, like MLM, masked vision modeling (MVM) chen2020uniter samples vision (image or video) regions or patches and usually masks their visual features with a probability of 15%. VLP models need to reconstruct the masked visual features given the remaining visual features and all the textual features. The masked visual features are set to zeros. Because visual features are high-dimensional and continuous, VLP models propose two variants for MVM.

(1) Masked Features Regression

learns to regress the model output of masked features to its original visual features. VLP models convert the model output of the masked features to a vector of the same dimension as the original visual features first and apply L2 regression between the original visual features and the vector. The formal definition is as follows:

MVM=E(𝐯,𝐰)Df(𝐯m|𝐯\m,𝐰),\mathcal{L}_{\rm MVM}={\rm E}_{(\mathbf{v},\mathbf{w})\sim D}f(\mathbf{v}_{m}|\mathbf{v}_{\backslash m},\mathbf{w}), (3)
f(𝐯m|𝐯\m,𝐰)=i=1Kh(𝐯mi)O(𝐯mi))22,f(\mathbf{v}_{m}|\mathbf{v}_{\backslash m},\mathbf{w})=\sum_{i=1}^{K}\|h(\mathbf{v}_{m}^{i})-O(\mathbf{v}_{m}^{i}))\|_{2}^{2}, (4)

where h(𝐯mi)h(\mathbf{v}_{m}^{i}) denotes the predicted vision representation and O(𝐯mi)O(\mathbf{v}_{m}^{i}) denotes the original vision representation.

(2) Masked Feature Classification

learns to predict the object semantic class for the masked features. VLP models first feed the output of the masked features into an FC layer to predict the scores of object class, which further goes through a softmax function to be transformed into a prediction normalized distribution. Note that there is no ground-truth label. There are two kinds of methods to train VLP models. One is that VLP models take the most likely object class from the object detection model as the hard label (w.p. 0 or 1), assuming the detected object class is the ground-truth label for the masked features and apply cross-entropy loss to minimize the gap between the prediction and pseudo class. The other is that VLP models utilize soft label as supervision signal, which is the raw output from the detector (i.e., a distribution of object classes) and minimize the KL divergence between two distributions. The formal definition is as follows:

MVM=E(𝐯,𝐰)Df(𝐯m|𝐯\m,𝐰).\mathcal{L}_{\rm MVM}={\rm E}_{(\mathbf{v},\mathbf{w})\sim D}f(\mathbf{v}_{m}|\mathbf{v}_{\backslash m},\mathbf{w}). (5)

We use the object detection output from Faster R-CNN, and take the detected object category as the label of the masked region:

f1(𝐯m|𝐯\m,𝐰)=i=1KCE(c(𝐯mi)g1(𝐯mi))),f_{1}(\mathbf{v}_{m}|\mathbf{v}_{\backslash m},\mathbf{w})=\sum_{i=1}^{K}{\rm CE}(c(\mathbf{v}_{m}^{i})-g_{1}(\mathbf{v}_{m}^{i}))), (6)

where g1(𝐯mi)g_{1}(\mathbf{v}_{m}^{i}) the detected detected object category and KK denotes the number of vision regions.

We avoid this assumption by using soft label as supervision signal, which is the raw output from the detector:

f2(𝐯m|𝐯\m,𝐰)=i=1KDKL(c^(𝐯mi)g2(𝐯mi))).f_{2}(\mathbf{v}_{m}|\mathbf{v}_{\backslash m},\mathbf{w})=\sum_{i=1}^{K}{\rm D}_{KL}(\hat{c}(\mathbf{v}_{m}^{i})-g_{2}(\mathbf{v}_{m}^{i}))). (7)

where g1(𝐯mi)g_{1}(\mathbf{v}_{m}^{i}) the detected detected object category distribution.

4.4 Vision-Language Matching

Vision-Language Matching (VLM) li2021align is the most commonly used pre-training objective to align vision and language, which aims to project vision and language into the same space. In the single-stream VLP models, they use the representation of the special token [CLS] as the fused representation of both modalities. In the dual-stream VLP models, they concatenate the visual representation of the special visual token [CLSV] and the textual representation of the special textual token [CLST] as the fused representation of both modalities. VLP models feed the fused representation of both modalities to an FC layer and a sigmoid function to predict a score between 0 and 1, where 0 indicates the vision and language are mismatched, and 1 indicates the vision and language are matched. During training, VLP models sample positive or negative pairs from the dataset at each step. The negative pair is created by replacing the vision or text in a paired sample with randomly selected from other samples.

4.5 Vision-Language Contrastive Learning

Vision-Language Contrastive Learning (VLC) li2021align also aims to align vision and language. Different VLM, VLC predicts the matched vision-language pairs from N×NN\times N possible vision-language pairs given a batch of NN vision-language pairs. Note that there are N2NN^{2}-N negative vision-language pairs within a training batch. VLP models use the visual representation of the special visual token [CLSV] and the textual representation of the special textual token [CLST] to denote the aggregated representation of the vision and language, respectively. VLP models compute the softmax-normalized vision (image or video)-to-text similarity and text-to-vision similarity and leverage cross-entropy losses over vision-to-text and text-to-vision similarities to update themselves. The similarity is often implemented by dot products. The formal definitions are as follows:

pmv2t(I)=exp(s(I,Tm)/τ)m=1Mexp(s(I,Tm)/τ),p^{v2t}_{m}(I)=\frac{{\rm exp}(s(I,T_{m})/\tau)}{\sum_{m=1}^{M}{\rm exp}(s(I,T_{m})/\tau)}, (8)
pmt2v(T)=exp(s(T,Im)/τ)m=1Mexp(s(T,Im)/τ),p^{t2v}_{m}(T)=\frac{{\rm exp}(s(T,I_{m})/\tau)}{\sum_{m=1}^{M}{\rm exp}(s(T,I_{m})/\tau)}, (9)
VLC=12E(I,T)D[CE(yv2t,pv2t(I))+CE(yt2v,pt2v(T)],\mathcal{L}_{\rm VLC}=\frac{1}{2}{\rm E}_{(I,T)\sim D}[{\rm CE}(y^{v2t},p^{v2t}(I))+{\rm CE}(y^{t2v},p^{t2v}(T)], (10)

where II. TT denotes the images and texts, s(cot)s(\cot) denotes the similarity function and τ\tau denotes temperature coefficient. yv2ty^{v2t} and yt2vy^{t2v} denote the labels of vision2text retrieval and text2vision retrieval.

4.6 Word-Region Alignment

Word-Region Alignment (WRA) chen2020uniter is an unsupervised pre-training objective to align vision regions (vision patches) and words. VLP models utilize Optimal Transport to learn the alignment between vision and language. Empirically, VLP models use the IPOT algorithm to approximate the OT distance since the exact minimization is computationally intractable. After solving minimization, the OT distance serves as the WRA loss to train VLP models. The formal definition is as follows:

WRA=minTII(𝐚,𝐛)i=1Tj=1KTijc(𝐰i,𝐯j),\mathcal{L}_{\rm WRA}=\underset{{\rm T}\in II(\mathbf{a},\mathbf{b})}{\rm min}\sum_{i=1}^{T}\sum_{j=1}^{K}{\rm T}_{ij}\cdot c(\mathbf{w}_{i},\mathbf{v}_{j}), (11)

where c(𝐰i,𝐯j)c(\mathbf{w}_{i},\mathbf{v}_{j}) is the cost function evaluating the distance between 𝐰i\mathbf{w}_{i} and 𝐯j\mathbf{v}_{j}, TII(𝐚,𝐛)={TT×K|T𝟏m=𝐚,T𝟏n=𝐛}{\rm T}\in II(\mathbf{a},\mathbf{b})=\{{\rm T}\in\mathbb{R}^{T\times K}|{\rm T}\mathbf{1}_{m}=\mathbf{a},{\rm T}^{\top}\mathbf{1}_{n}=\mathbf{b}\}, 𝐚\mathbf{a} and 𝐛\mathbf{b} Dirac function coefficients centered on 𝐰i\mathbf{w}_{i} and 𝐯j\mathbf{v}_{j}.

4.7 Frame Order Modeling

To better model the timing of the video, VLP models randomly disrupt the order of some input frames and then predict the actual position of each frame. Frame Order Modeling (FOM) li2020hero is modeled as a classification task in practice.

4.8 Particular Pre-training Objects

To better adapt to downstream tasks, VLP models sometimes use the training objects of some downstream tasks, such as visual question answering (VQA) antol2015vqa ; lei2018tvqa ; anderson2018bottom , and visual captioning (VC) vinyals2015show ; bai2018survey , as pre-training objectives. As for VQA, VLP models take the fused representation mentioned above, apply an FC layer, and use the transformed representation to predict the classification over predefined answer candidates. In addition to VLP models tackling the task as classification over predefined answer candidates, VLP models also can directly generate answers in their original text format. As for VC, to reconstruct the input sentence to endow VLP models with the generation capability, VLP models employ an auto-regressive decoder to generate a corresponding textual description of the image or video.

Note that due to space limitations, we only introduce some popular pre-training objectives. We omit some specific pre-training objectives such as grounding referring expression (GRE), image-conditioned denoising autoencoding (IDA) xia2021xgpt , text-conditioned image feature generation (TIFG) xia2021xgpt , object detection (OD) kamath2021mdetr and aligned Kaleido patch modeling (AKPM) zhuge2021kaleido . Moreover, we put masked action prediction into the category of MVM.

5 Pre-training Datasets

Table 1: Details of some popular pre-training datasets for VLP. Names of some datasets are abbreviated for the convenience of subsequent description. FLKR represents Flickr30k, and HT100M represents HowTo100M.
Dataset # Images # Image-text Pairs Duration (hrs) # Clips # Videos
SBU ordonez2011im2text 875K 875K - - -
FLKR young2014image 29K 145K - - -
COCO lin2014microsoft 113K 567K - - -
VG krishna2017visual 108K 5.4M - - -
VGQA krishna2017visual 108K 1.8M - - -
VQA goyal2017making 83K 444K - - -
Matterport3D chang2017matterport3d 104K 104K - - -
FashionGen rostamzadeh2018fashion 260K 260K - - -
CC3M sharma2018conceptual 3M 3M - - -
GQA hudson2019gqa 82K 1M - - -
LAIT qi2020imagebert 10M 10M - - -
CC12M changpinyo2021conceptual 12M 12M - - -
ALIGN jia2021scaling 1.8B 1.8B - - -
Kinetics400 kay2017kinetics - - 817 306K 306K
TVQA lei2018tvqa - - 461 22K 925
HT100M miech2019howto100m - - 134K 136M 1.2M
WebVid2M bain2021frozen - - 13K 2.5M 2.5M

Pre-training datasets are significant for the success of cross-modal representation learning. The quality and the size of pre-training datasets sometimes overwhelm the importance of training strategies and algorithms. Hence, a detailed description of several widely used pre-training datasets is necessary. Table 1 shows statistics of some popular pre-training datasets for VLP.

Since VLP includes image-language pre-training and video-language pre-training, we roughly divide pre-training datasets into two main categories. In later sections, we provide more details about representative pre-training datasets for each category. It is worth noting that no matter which category pre-training datasets belong, they differ in size and sources across different researches. In most works, the pre-training datasets for VLP are constructed by combining public datasets across different cross-modal tasks or scenarios. However, other works, such as VideoBERT sun2019videobert , ImageBERT qi2020imagebert , ALIGN jia2021scaling , and CLIP radford2021learning , conduct pre-training with self-constructed datasets. These self-constructed datasets are usually larger than most public datasets but might contain more noise.

5.1 Datasets for Image-language Pre-training

For image-language pre-training, the most widely used data form is image-text pairs. Most image-language pre-training datasets consist of a large number of image-caption pairs. SBU ordonez2011im2text and Flickr30k young2014image are collected from Flickr and labelled with human-generated annotations. COCO lin2014microsoft consists of images with five human-generated captions, filtered with special procedures to guarantee the quality of images and annotations. CC3M sharma2018conceptual and CC12M changpinyo2021conceptual are constructed by crawling images and their alt-text HTML attributes from the Internet and annotating these pictures with filtered descriptions. Due to looser filtering strategies, CC12M contains more noise than CC3M. Another data source is the visual question answering task. Many image-language datasets are organized as structured data in the context of visual question answering. The representative large-scale dataset is Visual Genome (VG) krishna2017visual . VG contains rich information in its structured data form. Its region-level descriptions and question-answer pairs are frequently used in the study of image-language pre-training. Besides VG, VQA goyal2017making and GQA hudson2019gqa are also popular datasets of visual question-answer pairs. Compared with VGA, GQA further alleviates the systematic biases.

Datasets mentioned above are suitable for most common scenarios. There are also some datasets designed for special cases. Matterport3D chang2017matterport3d consists of RGB-D images of building-scale scenes, annotated with labels for classification and segmentation. Fashion-Gen rostamzadeh2018fashion contains fashion images paired with item descriptions generated by professional stylists.

5.2 Datasets for Video-language Pre-training

Compared to image-language pre-training datasets, video-language pre-training datasets are usually more time-consuming and more difficult to collect and process. These inconveniences restrict the development of the community and the scale of pre-training. Datasets for video-language pre-training cover different scenarios and sources. Most of them, such as Kinetics-400 kay2017kinetics , HowTo100M miech2019howto100m and WebVid-2M bain2021frozen , are collected from the Internet and processed with different procedures. These kinds of videos are usually accompanied by subtitles, thus providing weak or strong alignments between video clips and text. Although those subtitles sometimes might be too weak to align modalities, they still provide useful information, especially for the pre-training on large-scale datasets. Another source of video-text pairs is television programs. TVQA lei2018tvqa is a video-language pre-training dataset generated from television shows. These television shows are collected and converted to a dataset comprised of many dialogues for understanding the videos and recognizing semantic concepts in videos.

Considering the diversity of the sources and formation of these datasets, researchers apply different annotation and processing procedures. For example, Kinetics-400 kay2017kinetics consists of many action-related videos annotated with action classes. For other datasets lei2018tvqa ; miech2019howto100m ; bain2021frozen , the accompanying captions/subtitles of video clips or the class of concepts in videos are usually processed and used as annotations.

6 Downstream Tasks

As shown in Figure 2, a diverse range of tasks requires a cooperative knowledge of vision and language. In this section, we introduce the fundamental details and goals of these tasks.

\begin{overpic}[width=433.62pt]{downstreamtask.pdf} \end{overpic}
Figure 2: Illustration of downstream tasks in VLP.
Visual Question Answering (VQA) antol2015vqa ; wu2017visual ; kafle2017visual ; kafle2017analysis

. Giving a visual input (image or video), VQA represents the task of correctly providing an answer to a question. It is usually regarded as a classification task where the model predicts the most suitable answer from a pool of choices. To obtain accurate performance, it is important to infer logical entailments from images (or videos) based on the question posed.

Visual Reasoning and Compositional Question Answering (GQA) hudson2019gqa ; geng20192nd ; bitton2021automatic

. GQA is an upgraded version of VQA and aims to advance research on the visual reasoning of natural scenes. The images, questions, and answers in its dataset have matching semantic representations. The advantage of this structured representation is that the distribution of answers can be more uniform, and we can analyze the model’s performance from more dimensions. Compared with the single evaluation metric (e.g., accuracy) of traditional VQA, GQA includes multi-dimensional evaluation metrics: consistency, validity, plausibility, distribution, and grounding.

Video-Language Inference (VLI) li2020hero ; li2021adaptive ; chaudhary2021robust

. Given a video clip with aligned subtitles as a premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.

Visual Entailment (VE) xie2019visual ; song2022clip ; xie2018visual

. In the VE task, image is the premise, and text is the hypothesis. Its goal is to predict whether the text is “Entailment Image”. There are three labels, Entailment, Neutral, and Contradiction.

Visual Commonsense Reasoning (VCR) zellers2019recognition ; yu2019heterogeneous ; ye2021case

. VCR is the task of inferring commonsense information and cognitive understanding by a machine when it sees an image. It exists in the form of multiple-choice questions. For a question posed about the image, there are several alternative answers. The model must choose an answer from several answers and then select the reason for choosing this answer from several alternative reasons. Thus, VCR can be divided into two tasks, including question answering (selecting the best answer from a pool of expected answers to the question) and answer justification (providing the rationale behind the given answer). You can follow VCR’s leaderboard111https://visualcommonsense.com/leaderboard/ to track VLP’s latest ideas.

Natural Language for Visual Reasoning (NLVR) suhr2017corpus ; marasovic2020natural

. NLVR is a subtask of the broader VCR category, limited to the classification paradigm. The input of the NLVR task is two images and a text description, and the output is whether the corresponding relationship between the images and the text description is consistent (two labels: true or false). It is typically different from VQA due to longer text sequences covering various linguistic phenomena.

Grounding Referring Expressions (GRE) liu2019improving ; yang2019cross ; zhang2018grounding

. The GRE task aims to localize certain regions (e.g., objects and persons) in an image given a referring expression, where the main challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, locations and interactions with surrounding regions. Specifically, the model can output a score for each region, and the region with the highest score is used as the prediction region.

Category Recognition (CR) zhuge2021kaleido .

CR refers to identifying the category and sub-category of a product, such as {HOODIES, SWEATERS}, {TROUSERS, PANTS}, which are vital attributes for describing a product, and are useful in lots of real-life applications.

Multi-modal Sentiment Analysis.

(MSA) ghosal2018contextual ; akhtar2019multi ; jiming2021summary ; zhang2020knowledge . MSA is aimed to detect sentiments in videos by leveraging multi-modal signals (e.g., vision, language, etc.). It is to predict the affective orientation of an utterance as a continuous intensity variable.

Vision-Language Retrieval (VLR) wang2016comprehensive ; mithun2018learning ; chen2020imram ; chen2022hivlp .

VLR involves understanding both vision (image or video) and language domains with appropriate matching strategies. It includes two subtasks, vision-to-text, and text-to-vision retrieval, where vision-to-text retrieval is to fetch the top-most relevant text description from a larger pool of descriptions as per the vision and vice versa. VLR is widely used in domain-specific searches, multiple search engines, and context-based vision retrieval design systems.

Visual Captioning (VC) xu2015show ; bai2018survey ; wang2018reconstruction .

VC aims to generate semantically and syntactically appropriate text descriptions for a given visual (image or video) input. Generating relevant and explanatory captions for a visual input requires not only a rich knowledge of language, but also a consistent understanding of scenes, entities, and their interactions appreare in the visual input.

Novel Object Captioning at Scale (NoCaps) agrawal2019nocaps ; feng2020cascaded

. NoCaps extends the VC task to test a model’s capability of describing novel objects from the Open Images dataset, which are unseen in the training corpus.

Visual Dialogue (VD) das2017visual ; chen2020dmrm ; chen2021gog ; chen2021multimodal .

The specific task in VD is the following: given an image, a dialog history consisting of a sequence of question-answer pairs, and a natural language follow-up question, the goal for the task is to response the question in free-form natural language (e.g., generate an answer). VD is the visual analogue of the Turing Test.

Multi-modal Machine Translation (MMT) specia2016shared ; yin2020novel ; su2019unsupervised .

MMT is a two-fold task of translation and text generation, translating text from one language to another with additional information from other modalities, e.g., image. The additional visual features aim to remove ambiguities that may arise in straightforward text machine translation and help retain the context of the text descriptions. The multi-modal representation space facilitates robust latent representations to complement the inherent semantic information preserved by visual and linguistic embeddings, respectively.

Vision-Language Navigation (VLN) wang2019reinforced ; zhu2020vision ; gu2022vision .

VLN is a grounding language task of an agent’s locomotion as it sees and explores the real-world dynamics based on linguistic instructions. Like generation tasks, it is typically seen as the task of sequence-to-sequence transcoding. However, VLN has unique characteristics. It usually has longer sequences, and the dynamics of the problem are quite different since it is a real-time evolving task. Its main challenge lies in understanding the environment and making confident decisions during exploring.

Optical Character Recognition (OCR) mori1999optical ; memon2020handwritten .

OCR generally refers to extract handwritten or printed text from images (such as street signs and photos of products) as well as documents (articles, bills, invoices, financial reports, etc.), which includes two parts: text detection (similar to regression) and text recognition (similar to classification).

In addition, there are some iamge-related downstream tasks for evaluating the image-text pre-training models, including semantic segmentation strudel2021segmenter ; mo2022review , and object detection zhao2019object ; fang2021you . There are also some video-related downstream tasks for evaluating the video-text pre-training models, including action classification (AC) sun2019videobert , action segmentation (AS) sun2019learning , and action step Localization (ASL) luo2020univl .

Recently, Changpinyo et.al changpinyo2021conceptual scale up pre-training data for VLP tasks and benchmark its effectiveness against Conceptual Captions 3M on multiple downstream tasks with an emphasis on long-tail visual recognition. Rethmeier et.al rethmeier2022long study the performance of pretrained model on a challenging long-tail task and analyze the resulting long-tail learning capabilities under zero-shot, few-shot and full supervision conditions to explore the performance influence of model size and self-supervision signal amount.

\sidewaystablefn
Table 2: The summary of mainstream image-text VLP models. The number of downstream tasks determines whether the model is generic or domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1.
Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks
VisualBERT li2019visualbert Image OD-RFs Emb Single-stream No MLM+VLM COCO GRE+NLVR+VCR+VQA
ViLBERT lu2019vilbert Image OD-RFs Emb Dual-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA
LXMERT tan2019lxmert Image OD-RFs+Xformer Xformer Dual-stream No MLM+VLM+MVM+VQA COCO+VG+VQA+GQA+VGQA GQA+NLVR+VQA
B2T2 alberti2019fusion Image CNN-GFs Emb Single-stream No MLM+VLM CC3M VCR
Unicoder-VL li2020unicoder Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M+SBU VLR+VCR
VL-BERT su2019vl Image OD-RFs Emb Single-stream No MLM+MVM CC3M GRE+VCR+VQA
VLP zhou2020unified Image OD-RFs Emb Dual-stream Yes MLM+LM CC3M VC+VQA
UNITER chen2020uniter Image OD-RFs Emb Single-stream No MLM+VLM+MVM+WRA COCO+VG+SBU+CC3M GRE+VLR+NLVR+VCR+VE+VQA
12-IN-1 lu202012 Image OD-RFs Emb Single-stream No MLM+MVM MTL GQA+GRE+VC+NLVR+VE+VQA
VisDial-BERT murahari2020large Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M+VQA VD
ImageBERT qi2020imagebert Image OD-RFs Emb Single-stream No MLM+VLM+MVM LAIT+CC3M+SBU VLR
PREVALENT hao2020towards Image CNN-GFs+Xformer Xformer Single-stream No MLM+MVM Matterport3D VLN
XGPT xia2021xgpt Image OD-RFs Emb Dual-stream Yes MLM+IDA+VC+TIFG CC3M VC+VLR
InterBER lin2020interbert Image OD-RFs Emb Single-stream No MLM+VLM+MVM COCO+CC3M+SBU VLR+VCR
PixelBERT huang2020pixel Image CNN-GFs Emb Single-stream No MLM+VLM COCO+VG VLR+NLVR+VQA
OSCAR li2020oscar Image OD-RFs Emb Single-stream No MLM+VLM COCO+SBU+CC3M+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA
VLN-BERT hong2021vln Image OD-RFs Emb Dual-stream No MLM+VLM+MVM CC3M VLN
FashionBERT gao2020fashionbert Image Xformer Emb Single-stream No MLM+VLM+MVM FashionGen VLR
VILLA gan2020large Image OD-RFs+Xformer Xformer Single-stream No MLM+VLM+MVM COCO+VG+CC3M+SBU GRE+VLR+NLVR+VCR+VE+VQA
ERNIE-ViL yu2020ernie Image OD-RFs Emb Single-stream No MLM+MVM CC3M+SBU GRE+VLR+VCR+VQA
RVL-BERT chiou2021visual Image OD-RFs Emb Single-stream No MLM+VLM+MVM CC3M VC+VQA
VinVL zhang2021vinvl Image OD-RFs Emb Single-stream No MLM+VLM COCO+CC3M+SBU+FLKR+VQA+GQA+VGQA GQA+VC+VLR+NLVR+NoCaps+VQA
VL-T5 cho2021vlt5 Image OD-RFs Emb Single-stream Yes MLM+VLM+VQA+GRE+VC COCO+VG+VQA+GQA+VGQA GQA+GRE+VC+MMT+NLVR+VCR+VQA
ViLT kim2021vilt Image ViT-PFs Emb Single-stream No MLM+VLM COCO+VG+SBU+CC3M VLR+NLVR+VQA
ALIGN jia2021scaling Image CNN-GFs Xformer Dual-stream No VLC ALIGN VLR
Kaleido-BERT zhuge2021kaleido Image CNN-GFs Emb Single-stream No MLM+VLM+AKPM FashionGen CR+VC+VLR
MDETR kamath2021mdetr Image Xformer Xformer Single-stream Yes OD+MLM+VLC COCO+VG+FLKR+GQA GQA+VQA
SOHO huang2021seeing Image CNN-GFs Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+NLVR+VE+VQA
E2E-VLP xu2021e2e Image CNN-GFs Emb Single-stream Yes OD+MLM+VLM COCO+VG VC+VLR+NLVR+VQA
Visual Parsing xue2021probing Image Xformer Emb Single-stream No MLM+VLM+MVM COCO+VG VLR+VCR+VE+VQA
CLIP-ViL shen2021much Image CNN-GFs Emb Single-stream Yes MLM+VLM+VQA COCO+VG+VQA+GQA+VGQA VE+VLN+VQA
ALBEF li2021align Image Xformer Xformer Dual-stream No MLM+VLM+VLC COCO+VG+CC3M+SBU VLR+NLVR+VQA
SimVLM wang2021simvlm Image CNN-GFs Emb Single-stream Yes PrefixLM ALIGN VC+NLVR+VE+VQA
MURAL jain2021mural Image CNN-GFs Xformer Dual-stream No VLC CC12M+ALIGN VC+VLR
VLMO vlmo Image ViT-PFs Emb Single-stream No MLM+VLC+VLM COCO+VG+CC3M+SBU VQA+NLVR+VLR
METER dou2021empirical Image Xformer Xformer Dual-stream No MLM+VLM COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
X-VLM zeng2021multi Image Xformer Xformer Single-stream No MLM+VLM+VG COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
TCL yang2022vision Image Xformer Xformer Single-stream No MLM+VLM+TCL COCO+VG+CC3M+SBU VLR+NLVR+VE+VQA
\sidewaystablefn
Table 3: The summary of mainstream video-text VLP models. The number of downstream tasks determines whether the model is generic or domain-specific VLP. FE: Feature Extraction. PT: Pre-training. Emb: Embedding. SC in Datatsets column: self-constructed or self-collected. MTL in Datatsets column: all datasets for multi-task learning in corresponding work. See other abbreviations in Datatsets column in Table 1.
Model Domain Vision FE Language FE Multimodal Fusion Decoder PT Objectives PT Datasets Downstream Tasks
VideoBERT sun2019videobert Video CNN-GFs Emb Single-stream No MLM+VLM+MVM SC AC+VC
CBT sun2019learning Video CNN-GFs+Xformer Xformer Single-stream No VLC Kinetics AC+AS+VC
UniVL luo2020univl Video CNN-GFs Xformer Dual-stream Yes MLM+VLM+VC HT100M AS+ASL+MSA+VC+VLR
HERO li2020hero Video CNN-GFs+Xformer Xformer Single-stream No MLM+VLM+MVM+FOM HT100M+TV VC+VLI+VQA+VLR
MMFT-BERT urooj2020mmft Video OD-RFs+Xformer Xformer Single-stream No VQA TV VQA
ActBERT zhu2020actbert Video OD-RFs+CNN Emb Single-stream No MLM+VLM+MVM HT100M AS+ASL+VC+VQA+VLR
CLIP radford2021learning Image / Video CNN/Xformer Xformer Dual-stream No VLC SC OCR +AC etc.
Frozen bain2021frozen Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
Region-Learner yan2021video Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
CLIP4Clip luo2021clip4clip Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR
CLIP2Video fang2021clip2video Video ViT-PFs Emb Dual-Stream No VLC WebVid2M+CC3M VLR

7 SOTA VLP models

Image-Text VLP models.

VisualBERT li2019visualbert , known as the first image-text pre-training model, uses the visual features extracted by Faster R-CNN, concatenates the visual features and textual embeddings, and then fed the concatenated features to a single transformer initialed by BERT. Many VLP models li2020unicoder ; su2019vl ; chen2020uniter ; qi2020imagebert follow the similar feature extraction and architecture as VisualBERT while adjusting the pre-training objectives and pre-training datasets. Recently, VDBERT wang2020vd models the common implicit vision-language alignment in vision and language by pretraining on large-scale image-text pairs via transfer learning dong2020can ; dong2021and . VLMO vlmo leverages patch embeddings for image and word embeddings for text and feeds the concatenated embeddings into a single transformer with modality experts and achieves an impressive performance. METER dou2021empirical explores how to use a uni-modal pre-trained model and proposes a dual-stream architecture model to handle the multimodel fusion, which achieves the SOTA performance on many downstream tasks. The summary of mainstream image-text VLP models is shown in Table 2.

Video-Text VLP models.

VideoBERT sun2019videobert , known as the first video-text pre-training model, extends the BERT model to process videos and texts simultaneously. VideoBERT uses the pre-trained ConvNet and S3D xie2017rethinking to extract video features and concatenate them with textual word embeddings to feed into a transformer initialed with BERT. ConvNet and S3D are frozen when training the VideoBERT, which indicates the approach is not end-to-end. Recently, inspired by ViT, CLIP4Clip luo2021clip4clip and CLIP2Video fang2021clip2video first process video clips into frames and get patch embeddings according to the method of ViT processing images for each frame. CLIP4clip and CLIP2Video optimize themselves in an end-to-end manner and achieve SOTA performance. The summary of mainstream video-text VLP models is shown in Table 3.

8 Conclusion and New Frontiers

In this paper, we provide the first VLP survey. We review its recent advances from five aspects: feature extraction, model architecture, pre-training objectives, pre-training datasets, and downstream tasks and summarize the specific SOTA VLP models in detail. We hope our survey can help researchers understand VLP better and inspire new works to advance this field. In the future, based on existing works, VLP can be further developed from the following aspects:

Incorporating Acoustic Information.

Most previous works on multi-modal pre-training emphasize the joint modeling of language and vision but ignore the information buried in audios zhu2021deep ; tao2021correction . Although the semantic information in audios might intersect with language, audios could provide extra emotion information, acoustic boundary information, etc. Moreover, pre-training with audios makes the model capable of downstream tasks with acoustic inputs. Until now, joint modeling and representation across text, vision, and audio is still an open problem left for further investigation. Several cutting-edge works have shed light on the future of this research field. Unlike previous VLP models, VATT akbari2021vatt takes the raw audio as input and learns the multi-modal representations with the noise contrastive estimation (NCE). Differing from VATT, OPT liu2021opt learns the cross-modal representations across text, image, and audio jointly with various multi-level masking strategies, and it is also capable of generating text and images. Some other works, such as AudioCLIP guzhov2021audioclip and MERLOT Reserve zellers2022merlot , also shows their unique approaches to learn the cross-modal representations over three modalities.

Knowledgeable and Cognitive Learning.

Although the existing VLP models have achieved remarkable performance, their essence is to fit large-scale multimodal datasets. Making VLP models more knowledgeable is important for future VLP. For input vision and text, there is rich related external common sense world knowledge and illustrative situational knowledge chen2021kbvlp , which can be used to augment the input and accelerate the model training and inference. The solution to this problem requires unified cognitive model architectures, knowledge-guided pre-training objectives, and the support of interacting with new knowledge.

Prompt Tuning.

Currently, fine-tuning is the dominant method to transfer the knowledge of VLP to downstream tasks. However, as the scale of the model increases, each downstream task has its fine-tuning parameters leading to parameter inefficiency. Moreover, the diverse downstream tasks also make the design of the pre-training and fine-tuning stages cumbersome, leading to a gap between them. Recently, prompt tuning is getting more and more attention in NLP. By designing discrete or continuous prompts and using MLM for specific downstream tasks, these models could: 1) reduce the computational cost on fine-tuning the enormous amounts of parameters; 2) bridge the gap between pre-training and fine-tuning. Prompt tuning is a promising way to stimulate the linguistic and world knowledge distributed in PLMs. In the next step, it can be improved and transferred to multi-modal scenarios, breaking the traditional paradigm and solving the pain points of VLP tsimpoukelli2021multimodal .

Model Compression and Acceleration.

Model compression and acceleration is an essential approach to improve the efficiency of VLP models. In this case, large models are compressed to small ones to meet the need for faster inference and deployment on various real-life scenarios such as resource-constrained devices. In general PLMs, model compression and acceleration is a hot topic, and specific methods include parameter sharing lan2019albert , model pruning fan2019reducing , knowledge distillation sanh2019distilbert and model quantization zafrir2019q8bert . Recently, knowledge distillation has been used to compress VLP models fang2021compressing , but other methods such as pruning and quantization of VLP models remain to be explored. Furthermore, a data-efficient VLP paradigm is constructed li2021supervision . However, only a few efforts are currently focused on improving the efficiency of VLP models, leaving much room for exploration.

Out-of-domain Pretraining.

Despite the significant progress achieved by VLP models, part of their success can be traced back to the introduction of in-domain pretraining datasets, used in both pretraining and downstream tasks. The out-of-domain pretraining will be an essential research direction, that is, VLP models transfer the learned knowledge and representation into downstream tasks with unknown data distributions. To mitigate the distribution biases between pretraining and funetuning, DeVLBert zhang2020devlbert is proposed to perform intervention-based learning. It borrows the idea of the backdoor adjustment from the research area of causality and designs several neural-network based structures for Bert-style out-of-domain pretraining.

Advanced Model Architecture.

Nowadays, the transformer-based architectures make great progress in VLP. Is such a structure the optimal structure for VLP? We note that the recently popular diffusion model saharia2022photorealistic for image generation has succeeded greatly. Some researchers li2022diffusion also extend the diffusion model to controllable text generation. So whether the diffusion model can be used in VLP? It may be a question worth exploring in the future. Moreover, neural networks themselves are inspired by neuroscience, and we can explore next-generation VLP frameworks with support from other disciplines. The inspirations from mathematics include the framework of non-Euclidean space Manifold and how to put some geometric priors into the model chen2022fully ; bronstein2021geometric , which are relatively new research directions. Research on the energy-efficient Spiking Neural Networks maass1997networks ; zhang2022recent ; zhang2021population in the brain-inspired field may also provide insights into the exploration of novel VLP architectures.

References

  • \bibcommenthead
  • (1) Vaswani, A., Shazeer, N., et al.: Attention is all you need. NeurIPS 30 (2017)
  • (2) Devlin, J., Chang, M., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT, pp. 4171–4186 (2019)
  • (3) Dosovitskiy, A., Beyer, L., et al.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: ICLR (2020)
  • (4) Schneider, S., Baevski, A., et al.: Wav2Vec: Unsupervised Pre-Training for Speech Recognition. In: Interspeech, pp. 3465–3469 (2019)
  • (5) Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., Huang, X.: Pre-trained models for natural language processing: A survey. Science China Technological Sciences 63(10), 1872–1897 (2020)
  • (6) Han, X., Zhang, Z., Ding, N., Gu, Y., Liu, X., Huo, Y., Qiu, J., Yao, Y., Zhang, A., Zhang, L., et al.: Pre-trained models: Past, present and future. AI Open 2, 225–250 (2021)
  • (7) Han, K., Wang, Y., Chen, H., Chen, X., Guo, J., Liu, Z., Tang, Y., Xiao, A., Xu, C., Xu, Y., et al.: A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022)
  • (8) Lu, J., Batra, D., et al.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In: NeurIPS, pp. 13–23 (2019)
  • (9) Li, L.H., Yatskar, M., et al.: VisualBERT: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
  • (10) Li, X., Yin, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020). https://arxiv.org/pdf/2004.06165
  • (11) Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015)
  • (12) Anderson, P., He, X., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: CVPR, pp. 6077–6086 (2018)
  • (13) Li, G., Duan, N., et al.: Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In: AAAI, pp. 11336–11344 (2020)
  • (14) Wang, Z., Yu, J., et al.: SimVLM: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904 (2021)
  • (15) Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E., Chen, X.: In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10267–10276 (2020)
  • (16) Radford, A., Kim, J.W., et al.: Learning transferable visual models from natural language supervision. In: ICML, pp. 8748–8763 (2021)
  • (17) Luo, H., Ji, L., Zhong, M., Chen, Y., Lei, W., Duan, N., Li, T.: Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)
  • (18) Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)
  • (19) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
  • (20) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). Ieee
  • (21) Feichtenhofer, C., Fan, H., Malik, J., He, K.: Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6202–6211 (2019)
  • (22) Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017)
  • (23) Kay, W., Carreira, J., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  • (24) Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
  • (25) Lan, Z., Chen, M., et al.: ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In: ICLR (2019)
  • (26) Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32 (2019)
  • (27) Zhang, P., Li, X., et al.: VinVL: Revisiting visual representations in vision-language models. In: CVPR (2021)
  • (28) Zeng, Y., Zhang, X., Li, H.: Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv:2111.08276 (2021)
  • (29) Touvron, H., Cord, M., et al.: Training data-efficient image transformers & distillation through attention. In: ICML, vol. 139, pp. 10347–10357 (2021)
  • (30) Chen, Y.-C., Li, L., et al.: UNITER: Universal image-text representation learning. In: ECCV (2020)
  • (31) Chen, F., Chen, X., Xu, S., Xu, B.: Improving cross-modal understanding in visual dialog via contrastive learning. In: ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7937–7941 (2022). IEEE
  • (32) Zhang, S., Jiang, T., et al.: Devlbert: Learning deconfounded visio-linguistic representations. In: ACM MM, pp. 4373–4382 (2020)
  • (33) Dou, Z.-Y., Xu, Y., et al.: An Empirical Study of Training End-to-End Vision-and-Language Transformers. arXiv preprint arXiv:2111.02387 (2021)
  • (34) Taylor, W.L.: “Cloze procedure”: A new tool for measuring readability. Journalism quarterly 30(4), 415–433 (1953)
  • (35) Li, J., Selvaraju, R.R., et al.: Align before fuse: Vision and language representation learning with momentum distillation. arXiv preprint arXiv:2107.07651 (2021)
  • (36) Li, L., Chen, Y.-C., et al.: HERO: Hierarchical encoder for video+ language omni-representation pre-training. In: EMNLP, pp. 2046–2065 (2020)
  • (37) Antol, S., Agrawal, A., et al.: Vqa: Visual question answering. In: ICCV, pp. 2425–2433 (2015)
  • (38) Lei, J., Yu, L., et al.: TVQA: Localized, Compositional Video Question Answering. In: EMNLP, pp. 1369–1379 (2018)
  • (39) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
  • (40) Bai, S., An, S.: A survey on automatic image caption generation. Neurocomputing 311, 291–304 (2018)
  • (41) Xia, Q., Huang, H., et al.: Xgpt: Cross-modal generative pre-training for image captioning. In: NLPCC, pp. 786–797 (2020). Springer
  • (42) Kamath, A., Singh, M., et al.: MDETR-modulated detection for end-to-end multi-modal understanding. In: ICCV, pp. 1780–1790 (2021)
  • (43) Zhuge, M., Gao, D., et al.: Kaleido-BERT: Vision-language pre-training on fashion domain. In: CVPR, pp. 12647–12657 (2021)
  • (44) Ordonez, V., Kulkarni, G., et al.: Im2text: Describing images using 1 million captioned photographs. NeurIPS 24 (2011)
  • (45) Young, P., Lai, A., et al.: From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. TACL 2, 67–78 (2014)
  • (46) Lin, T.-Y., Maire, M., et al.: Microsoft COCO: Common objects in context. In: ECCV, pp. 740–755 (2014)
  • (47) Krishna, R., Zhu, Y., et al.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV 123(1), 32–73 (2017)
  • (48) Goyal, Y., Khot, T., et al.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: CVPR, pp. 6904–6913 (2017)
  • (49) Chang, A., Dai, A., et al.: Matterport3d: Learning from rgb-d data in indoor environments. In: 3DV, pp. 667–676 (2017)
  • (50) Rostamzadeh, N., Hosseini, S., et al.: Fashion-gen: The generative fashion dataset and challenge. arXiv preprint arXiv:1806.08317 (2018)
  • (51) Sharma, P., Ding, N., et al.: Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: ACL, pp. 2556–2565 (2018)
  • (52) Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: CVPR, pp. 6700–6709 (2019)
  • (53) Qi, D., Su, L., et al.: ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. arXiv preprint arXiv:2001.07966 (2020)
  • (54) Changpinyo, S., Sharma, P., et al.: Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: CVPR, pp. 3558–3568 (2021)
  • (55) Jia, C., Yang, Y., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML, pp. 4904–4916 (2021)
  • (56) Miech, A., Zhukov, D., et al.: HowTo100M: Learning a text-video embedding by watching hundred million narrated video clips. In: ICCV, pp. 2630–2640 (2019)
  • (57) Bain, M., Nagrani, A., et al.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: ICCV, pp. 1728–1738 (2021)
  • (58) Sun, C., Myers, A., et al.: VideoBERT: A joint model for video and language representation learning. In: ICCV, pp. 7464–7473 (2019)
  • (59) Wu, Q., Teney, D., Wang, P., Shen, C., Dick, A., Van Den Hengel, A.: Visual question answering: A survey of methods and datasets. Computer Vision and Image Understanding 163, 21–40 (2017)
  • (60) Kafle, K., Kanan, C.: Visual question answering: Datasets, algorithms, and future challenges. Computer Vision and Image Understanding 163, 3–20 (2017)
  • (61) Kafle, K., Kanan, C.: An analysis of visual question answering algorithms. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1965–1973 (2017)
  • (62) Geng, S., Zhang, J., Zhang, H., Elgammal, A., Metaxas, D.N.: 2nd place solution to the gqa challenge 2019. arXiv preprint arXiv:1907.06794 (2019)
  • (63) Bitton, Y., Stanovsky, G., Schwartz, R., Elhadad, M.: Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 94–105 (2021)
  • (64) Li, J., Tang, S., Zhu, L., Shi, H., Huang, X., Wu, F., Yang, Y., Zhuang, Y.: Adaptive hierarchical graph reasoning with semantic coherence for video-and-language inference. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1867–1877 (2021)
  • (65) Chaudhary, A.: Robust Vision and Language Inference via Semantics Transformed Adversarial Training. PhD thesis, Arizona State University (2021)
  • (66) Xie, N., Lai, F., et al.: Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706 (2019)
  • (67) Song, H., Dong, L., Zhang, W., Liu, T., Wei, F.: CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6088–6100 (2022)
  • (68) Xie, N., Lai, F., Doran, D., Kadav, A.: Visual entailment task for visually-grounded language learning. arXiv preprint arXiv:1811.10582 (2018)
  • (69) Zellers, R., Bisk, Y., et al.: From recognition to cognition: Visual commonsense reasoning. In: CVPR, pp. 6720–6731 (2019)
  • (70) Yu, W., Zhou, J., Yu, W., Liang, X., Xiao, N.: Heterogeneous graph learning for visual commonsense reasoning. Advances in Neural Information Processing Systems 32 (2019)
  • (71) Ye, K., Kovashka, A.: A case study of the shortcut effects in visual commonsense reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 3181–3189 (2021)
  • (72) Suhr, A., Lewis, M., et al.: A corpus of natural language for visual reasoning. In: ACL, pp. 217–223 (2017)
  • (73) Marasović, A., Bhagavatula, C., sung Park, J., Le Bras, R., Smith, N.A., Choi, Y.: Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2810–2829 (2020)
  • (74) Liu, X., Wang, Z., et al.: Improving referring expression grounding with cross-modal attention-guided erasing. In: CVPR, pp. 1950–1959 (2019)
  • (75) Yang, S., Li, G., Yu, Y.: Cross-modal relationship inference for grounding referring expressions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4145–4154 (2019)
  • (76) Zhang, H., Niu, Y., Chang, S.-F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4158–4166 (2018)
  • (77) Ghosal, D., Akhtar, M.S., Chauhan, D., Poria, S., Ekbal, A., Bhattacharyya, P.: Contextual inter-modal attention for multi-modal sentiment analysis. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3454–3466 (2018)
  • (78) Akhtar, M.S., Chauhan, D., Ghosal, D., Poria, S., Ekbal, A., Bhattacharyya, P.: Multi-task Learning for Multi-modal Emotion Recognition and Sentiment Analysis. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 370–379 (2019)
  • (79) Jiming, L., Peixiang, Z., et al.: Summary of Multi-modal Sentiment Analysis Technology. Journal of Frontiers of Computer Science & Technology 15(7), 1165 (2021)
  • (80) Zhang, D., Chen, X., Xu, S., Xu, B.: Knowledge aware emotion recognition in textual conversations via multi-task incremental transformer. In: Proceedings of the 28th International Conference on Computational Linguistics, pp. 4429–4440 (2020)
  • (81) Wang, K., Yin, Q., et al.: A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016)
  • (82) Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, pp. 19–27 (2018)
  • (83) Chen, H., Ding, G., Liu, X., Lin, Z., Liu, J., Han, J.: Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12655–12663 (2020)
  • (84) Chen, F., Chen, X., Shi, J., Zhang, D., Chang, J., Tian, Q.: HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval. arXiv preprint arXiv:2205.12105 (2022)
  • (85) Xu, K., Ba, J., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML, pp. 2048–2057 (2015)
  • (86) Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7622–7631 (2018)
  • (87) Agrawal, H., Desai, K., et al.: Nocaps: Novel object captioning at scale. In: ICCV, pp. 8948–8957 (2019)
  • (88) Feng, Q., Wu, Y., Fan, H., Yan, C., Xu, M., Yang, Y.: Cascaded revision network for novel object captioning. IEEE Transactions on Circuits and Systems for Video Technology 30(10), 3413–3421 (2020)
  • (89) Das, A., Kottur, S., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)
  • (90) Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: Dmrm: A dual-channel multi-hop reasoning model for visual dialog. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 7504–7511 (2020)
  • (91) Chen, F., Chen, X., Meng, F., Li, P., Zhou, J.: GoG: Relation-aware Graph-over-Graph Network for Visual Dialog. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 230–243 (2021)
  • (92) Chen, F., Meng, F., Chen, X., Li, P., Zhou, J.: Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation. In: ACL/IJCNLP (Findings) (2021)
  • (93) Specia, L., Frank, S., et al.: A shared task on multimodal machine translation and crosslingual image description. In: WMT, pp. 543–553 (2016)
  • (94) Yin, Y., Meng, F., Su, J., Zhou, C., Yang, Z., Zhou, J., Luo, J.: A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3025–3035 (2020)
  • (95) Su, Y., Fan, K., Bach, N., Kuo, C.-C.J., Huang, F.: Unsupervised multi-modal neural machine translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10482–10491 (2019)
  • (96) Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y.-F., Wang, W.Y., Zhang, L.: Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6629–6638 (2019)
  • (97) Zhu, F., Zhu, Y., Chang, X., Liang, X.: Vision-language navigation with self-supervised auxiliary reasoning tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10012–10022 (2020)
  • (98) Gu, J., Stefani, E., et al.: Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions. In: ACL, pp. 7606–7623 (2022)
  • (99) Mori, S., Nishida, H., et al.: Optical Character Recognition. John Wiley & Sons, Inc., ??? (1999)
  • (100) Memon, J., Sami, M., Khan, R.A., Uddin, M.: Handwritten optical character recognition (OCR): A comprehensive systematic literature review (SLR). IEEE Access 8, 142642–142668 (2020)
  • (101) Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7262–7272 (2021)
  • (102) Mo, Y., Wu, Y., Yang, X., Liu, F., Liao, Y.: Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 493, 626–646 (2022)
  • (103) Zhao, Z.-Q., Zheng, P., Xu, S.-t., Wu, X.: Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems 30(11), 3212–3232 (2019)
  • (104) Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., Liu, W.: You only look at one sequence: Rethinking transformer in vision through object detection. Advances in Neural Information Processing Systems 34, 26183–26197 (2021)
  • (105) Sun, C., Baradel, F., et al.: Learning video representations using contrastive bidirectional transformer. arXiv preprint arXiv:1906.05743 (2019)
  • (106) Luo, H., Ji, L., et al.: UniVL: A unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353 (2020)
  • (107) Rethmeier, N., Augenstein, I.: Long-tail zero and few-shot learning via contrastive pretraining on and for small data. In: Computer Sciences & Mathematics Forum, vol. 3, p. 10 (2022). MDPI
  • (108) Tan, H., Bansal, M.: LXMERT: Learning cross-modality encoder representations from transformers. In: EMNLP, pp. 5100–5111 (2019)
  • (109) Alberti, C., Ling, J., et al.: Fusion of detected objects in text for visual question answering. In: EMNLP, pp. 2131–2140 (2019)
  • (110) Su, W., Zhu, X., et al.: VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In: ICLR (2019)
  • (111) Zhou, L., Palangi, H., et al.: Unified vision-language pre-training for image captioning and vqa. In: AAAI, pp. 13041–13049 (2020)
  • (112) Lu, J., Goswami, V., et al.: 12-in-1: Multi-task vision and language representation learning. In: CVPR, pp. 10437–10446 (2020)
  • (113) Murahari, V., Batra, D., et al.: Large-scale pretraining for visual dialog: A simple state-of-the-art baseline. In: ECCV, pp. 336–352 (2020). Springer
  • (114) Hao, W., Li, C., et al.: Towards learning a generic agent for vision-and-language navigation via pre-training. In: CVPR, pp. 13137–13146 (2020)
  • (115) Lin, J., Yang, A., et al.: InterBERT: Vision-and-language interaction for multi-modal pretraining. arXiv preprint arXiv:2003.13198 (2020)
  • (116) Huang, Z., Zeng, Z., et al.: Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849 (2020)
  • (117) Hong, Y., Wu, Q., et al.: VLN-BERT: A recurrent vision-and-language bert for navigation. In: CVPR, pp. 1643–1653 (2021)
  • (118) Gao, D., Jin, L., Chen, B., et al.: FashionBERT: Text and image matching with adaptive loss for cross-modal retrieval. In: SIGIR, pp. 2251–2260 (2020)
  • (119) Gan, Z., Chen, Y.-C., et al.: Large-scale adversarial training for vision-and-language representation learning. NeurIPS 33, 6616–6628 (2020)
  • (120) Yu, F., Tang, J., et al.: Ernie-vil: Knowledge enhanced vision-language representations through scene graph. AAAI (2020)
  • (121) Chiou, M.-J., Zimmermann, R., et al.: Visual relationship detection with visual-linguistic knowledge from multimodal representations. IEEE Access 9, 50441–50451 (2021)
  • (122) Cho, J., Lei, J., et al.: Unifying vision-and-language tasks via text generation. In: ICML (2021)
  • (123) Kim, W., Son, B., et al.: ViLT: Vision-and-language transformer without convolution or region supervision. arXiv preprint arXiv:2102.03334 (2021)
  • (124) Huang, Z., Zeng, Z., et al.: Seeing out of the box: End-to-end pre-training for vision-language representation learning. In: CVPR, pp. 12976–12985 (2021)
  • (125) Xu, H., Yan, M., et al.: E2E-VLP: End-to-end vision-language pre-training enhanced by visual learning. In: ACL, pp. 503–513 (2021)
  • (126) Xue, H., Huang, Y.a.: Probing inter-modality: Visual parsing with self-attention for vision-language pre-training. arXiv preprint arXiv:2106.13488 (2021)
  • (127) Shen, S., Li, L.H., et al.: How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383 (2021)
  • (128) Jain, A., Guo, M., et al.: MURAL: multimodal, multitask retrieval across languages. arXiv preprint arXiv:2109.05125 (2021)
  • (129) Wang, W., Bao, H., et al.: VLMo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv (2021) 2111.02358 [cs.CV]
  • (130) Yang, J., Duan, J., Tran, S., Xu, Y., Chanda, S., Chen, L., Zeng, B., Chilimbi, T., Huang, J.: Vision-language pre-training with triple contrastive learning. arXiv preprint arXiv:2202.10401 (2022)
  • (131) Urooj, A., Mazaheri, A., et al.: MMFT-BERT: Multimodal fusion transformer with bert encodings for visual question answering. In: Findings of EMNLP 2020, pp. 4648–4660 (2020)
  • (132) Zhu, L., Yang, Y.: ActBERT: Learning global-local video-text representations. In: CVPR, pp. 8746–8755 (2020)
  • (133) Yan, R., Shou, M.Z., et al.: Video-Text Pre-training with Learned Regions. arXiv preprint arXiv:2112.01194 (2021)
  • (134) Wang, Y., Joty, S., Lyu, M., King, I., Xiong, C., Hoi, S.C.: VD-BERT: A Unified Vision and Dialog Transformer with BERT. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3325–3338 (2020)
  • (135) Dong, J., Cong, Y., Sun, G., Zhong, B., Xu, X.: What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4023–4032 (2020)
  • (136) Dong, J., Cong, Y., Sun, G., Fang, Z., Ding, Z.: Where and how to transfer: knowledge aggregation-induced transferability perception for unsupervised domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)
  • (137) Xie, S., Sun, C., et al.: Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 1(2), 5 (2017)
  • (138) Zhu, H., Luo, M.-D., Wang, R., Zheng, A.-H., He, R.: Deep audio-visual learning: A survey. International Journal of Automation and Computing 18(3), 351–376 (2021)
  • (139) Tao, J.-H., Huang, J., Li, Y., Lian, Z., Niu, M.-Y.: Correction to: Semi-supervised ladder networks for speech emotion recognition. International Journal of Automation and Computing 18(4), 680–680 (2021)
  • (140) Akbari, H., Yuan, L., et al.: VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS 34 (2021)
  • (141) Liu, J., Zhu, X., et al.: OPT: Omni-perception pre-trainer for cross-modal understanding and generation. arXiv preprint arXiv:2107.00249 (2021)
  • (142) Guzhov, A., Raue, F., et al.: AudioCLIP: Extending clip to image, text and audio. arXiv preprint arXiv:2106.13043 (2021)
  • (143) Zellers, R., Lu, J., et al.: MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound. arXiv preprint arXiv:2201.02639 (2022)
  • (144) Chen, K., Huang, Q., et al.: KB-VLP: Knowledge Based Vision and Language Pretraining. In: ICML (2021)
  • (145) Tsimpoukelli, M., Menick, J., et al.: Multimodal few-shot learning with frozen language models. NeurIPS 34 (2021)
  • (146) Fan, A., Grave, E., et al.: Reducing Transformer Depth on Demand with Structured Dropout. In: ICLR (2019)
  • (147) Sanh, V., Debut, L., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019)
  • (148) Zafrir, O., Boudoukh, G., et al.: Q8bert: Quantized 8bit bert. In: EMC2-NIPS, pp. 36–39 (2019)
  • (149) Fang, Z., Wang, J., et al.: Compressing visual-linguistic model via knowledge distillation. In: CVPR, pp. 1428–1438 (2021)
  • (150) Li, Y., Liang, F., et al.: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. In: ICLR (2021)
  • (151) Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S.K.S., Ayan, B.K., Mahdavi, S.S., Lopes, R.G., et al.: Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 (2022)
  • (152) Li, X.L., Thickstun, J., Gulrajani, I., Liang, P., Hashimoto, T.B.: Diffusion-lm improves controllable text generation. arXiv preprint arXiv:2205.14217 (2022)
  • (153) Chen, W., Han, X., Lin, Y., Zhao, H., Liu, Z., Li, P., Sun, M., Zhou, J.: Fully Hyperbolic Neural Networks. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5672–5686 (2022)
  • (154) Bronstein, M.M., Bruna, J., Cohen, T., Veličković, P.: Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. arXiv preprint arXiv:2104.13478 (2021)
  • (155) Maass, W.: Networks of spiking neurons: the third generation of neural network models. Neural networks 10(9), 1659–1671 (1997)
  • (156) Zhang, D., Zhang, T., Jia, S., Wang, Q., Xu, B.: Recent Advances and New Frontiers in Spiking Neural Networks. arXiv preprint arXiv:2204.07050 (2022)
  • (157) Zhang, D., Zhang, T., Jia, S., Cheng, X., Xu, B.: Population-coding and dynamic-neurons improved spiking actor network for reinforcement learning. arXiv preprint arXiv:2106.07854 (2021)
[Uncaptioned image]

Feilong Chen received the B.Sc. degree in computer sciences from Hefei University of Technology, China in 2018. He is a Ph.D. candidate in both the Institute of Automation Chinese Academy of Sciences and the University of Chinese Academy of Sciences. His current interests include theoretical research on vision-language pre-training, multi-modal question answering and dialog.

[Uncaptioned image]

Duzhen Zhang received the B.Sc. degree in software engineering from Shandong University, China in 2019. He is a Ph.D. candidate in both the Institute of Automation Chinese Academy of Sciences and the University of Chinese Academy of Sciences. His current interests include theoretical research on reinforcement learning, natural language processing, and Spiking Neural Networks.

[Uncaptioned image]

Minglun Han received the B.Sc. degree in electronic and information engineering from Harbin Institue of Technology at Weihai, China in 2018. He is a Ph.D. candidate in both the Institute of Automation, Chinese Academy of Sciences and the University of Chinese Academy of Sciences. His current research interests include speech recognition, speech synthesis, speech chain.

[Uncaptioned image]

Xiuyi Chen received his Ph.D. degree (2022) in Pattern Recognition and Intelligent System from Institute of Automation, Chinese Academy of Sciences, advised by Prof. Bo Xu. Previously, he received the B.Sc. degree (2017) in Department of Control Science and Engineering from JiLin University. His current interests include Cross-modal Retrieval, Multimodal Learning, Dialogue System, Knowledge-Grounded Generation and Speech Seperation.

[Uncaptioned image]

Jing Shi is a research assistant in the Institute of Automation, Chinese Academy of Sciences, where he received his Ph.D. degree (2021) in the major of Pattern Recognition and Intelligent System, advised by Prof. Bo Xu. Previously, he received the B.Sc. degree (2012) in School of Instrumentation and Optoelectronic Engineering from Beihang University. His current interests include Cross-modal Modeling, Multimodal Learning, Dialogue System, Speech Recognition and Speech Seperation.

[Uncaptioned image]

Shuang Xu is a professor in Institute of Automation, Chinese Academy of Science. Her main research interests include natural language processing and understanding, human-AI hybird intelligence.

[Uncaptioned image]

Bo Xu is a professor, the director of the Institute of Automation Chinese Academy of Sciences, and also deputy director of the Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences. His main research interests include brain-inspired intelligence, brain-inspired cognitive models, natural language processing and understanding, brain-inspired robotics.

E-mail: [email protected] (Corresponding author)