This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Survey of Visual-Semantic Embedding Methods
for Zero-Shot Image Retrieval

Kazuya Ueki Department of Information Science
Meisei University
Tokyo, Jappan
[email protected]
Abstract

Visual-semantic embedding is an interesting research topic because it is useful for various tasks, such as visual question answering (VQA), image-text retrieval, image captioning, and scene graph generation. In this paper, we focus on zero-shot image retrieval using sentences as queries and present a survey of the technological trends in this area. First, we provide a comprehensive overview of the history of the technology, starting with a discussion of the early studies of image-to-text matching and how the technology has evolved over time. In addition, a description of the datasets commonly used in experiments and a comparison of the evaluation results of each method are presented. We also introduce the implementation available on github for use in confirming the accuracy of experiments and for further improvement. We hope that this survey paper will encourage researchers to further develop their research on bridging images and languages.

Index Terms:
image retrieval, visual-semantic embedding, image-text matching, zero-shot learning, transformers

I Typical Approaches for Visual-Semantic Embedding

Refer to caption
Figure 1: General architecture for visual-semantic embedding

A typical solution for visual-semantic embedding is to map images and language to a common embedding space, as shown in Fig. 1. In recent years, various methods have been proposed; these methods can be roughly classified into the following three categories:

  1. 1.

    A method for computing the similarity between images and language by training a deep neural network to extract global representations of images and language.

  2. 2.

    A method for training local correspondences between salient regions in an image and words in a sentence using a deep neural network and calculating their similarity.

  3. 3.

    A method that utilizes pre-trained models employing a large corpus of images and text

The history of each method is shown in Fig. 2. We introduce some of the representative methods below.

Refer to caption
Figure 2: History of visual-semantic embedding methods

I-A Methods for Global Image-Text Matching

In the method of mapping global representations, features from the whole image and features from the text are transformed into a common space, and their similarity is measured.

An early method for mapping images and language into a common space is deep visual-semantic embedding (DeViSE) [1], proposed by Frome et al. DeViSE embeds images and their corresponding words in the same space and trains them so that their similarity is high. Image features were extracted from convolutional neural networks (CNNs) trained with ImageNet, and word features were extracted from models trained with the skip-gram language model. For zi=Mxz_{i}=Mx, which is the feature xx extracted by inputting image ii into the CNN and multiplying by the transformation matrix MM, for ztz_{t}, which is the feature obtained from word tt, the cosine similarity is calculated as follows:

sim(i,t)=ziztzi2zt2.{\rm sim}(i,t)=\frac{z_{i}\cdot z_{t}}{\|z_{i}\|_{2}\|z_{t}\|_{2}}. (1)

For every negative example caption t^\hat{t} for image ii, the matrix MM that linearly transforms the image features into the word feature space is trained using the following hinge rank loss:

L(i,t)=t^max[0,αsim(i,t)+sim(i,t^)],L(i,t)=\sum_{\hat{t}}\max[0,\alpha-{\rm sim}(i,t)+{\rm sim}(i,\hat{t})],

where α\alpha is the margin.

Unifying visual-semantic embeddings (UVS) [2], proposed by Kiros et al., uses a CNN trained on ImageNet (AlexNet [3]) for image feature extraction and LSTM for text feature extraction. Let x,yx,y be the features of the image and text, respectively, and MxM_{x} and MyM_{y} be the transformation matrices, respectively. Then, the vectors in the embedded common space can be denoted as zi=Mxxz_{i}=M_{x}x and zt=Myyz_{t}=M_{y}y. The similarity between image data ii and text data tt, sim(i,t){\rm sim}(i,t), is calculated using the cosine similarity of vectors zi,ztz_{i},z_{t} in the common space. The loss function is a bi-directional hinge rank loss, as shown below:

L(i,t)=\displaystyle L(i,t)= t^max{0,αsim(i,t)+sim(i,t^)}\displaystyle\sum_{\hat{t}}\max\{0,\alpha-{\rm sim}(i,t)+{\rm sim}(i,\hat{t})\}
+i^max{0,αsim(i,t)+sim(i^,t)},\displaystyle+\sum_{\hat{i}}\max\{0,\alpha-{\rm sim}(i,t)+{\rm sim}(\hat{i},t)\},

where i^\hat{i} and t^\hat{t} are data unrelated to tt and ii, respectively.

A multimodal recurrent neural network (m-RNN) [4] has also been proposed; it introduces a multimodal layer to a recurrent neural network (RNN) and has a mechanism to input image features obtained from the CNN.

Deep structure-preserving embedding (DSPE) [5] has also been proposed; it includes a constraint that preserves the correspondence between different modalities (image and text) even when training the correspondence between the same modality.

Improved visual semantic embedding (VSE++) [6] is an improved version of the UVS method described above, with the following changes to the loss function:

L(i,t)=\displaystyle L(i,t)= maxt^[max{0,αsim(i,t)+sim(i,t^)}]\displaystyle\max_{\hat{t}}\biggl{[}\max\{0,\alpha-{\rm sim}(i,t)+{\rm sim}(i,\hat{t})\}\biggr{]}
+maxi^[max{0,αsim(i,t)+sim(i^,t)}].\displaystyle+\max_{\hat{i}}\biggl{[}\max\{0,\alpha-{\rm sim}(i,t)+{\rm sim}(\hat{i},t)\}\biggr{]}.

UVS sums over all the data, while VSE++ includes only the data with the largest loss at each data point. This allows us to utilize hard negatives and has been shown to significantly improve accuracy, regardless of the image feature extraction method used, such as VGG or ResNet.

A generative cross-modal feature learning framework (GXN) [7] has also been proposed; it incorporates generative processes into the training of cross-modal feature embedding.

Dual attention networks (DANs) [8], which aim to capture the detailed interactions between images and text, incorporate attention mechanisms in both images and text, and experiments have shown their effectiveness. DANs involve multiple steps to collect essential information from both modalities, focusing on specific regions of the image and words in the text.

Text-image modality adversarial matching (TIMAM) [9], which uses adversarial correspondence, has also been proposed to train modality-invariant feature representations. The paper also describes the successful inclusion of BERT as a language model.

I-B Methods for Local Image-Text Matching

Methods for extracting global representations of images and text are unable to find the relationship between objects in an image and words in a sentence, which limits the accuracy of image-text matching. For this reason, many methods have been proposed to measure relevance by mapping parts of an image or a sentence and aggregating their similarities.

Karpathy et al. were the first to locally map images to text by optimizing the correspondence between the most similar image regions and word pairs with deep fragment embeddings [10].

Deep visual-semantic alignments (DVSAs) [11] estimate the latent alignment between regions in an image obtained by region CNN (R-CNN) and words in a sentence obtained by a bidirectional recursive neural network (BRNN). This allows us to deal with sentences whose meaning is interpretable and whose length is not fixed.

A method called semantic concepts and order (SCO) [12] has also been proposed. This method improves the image representation by training semantic concepts and then organizing them in the correct semantic order.

In the above methods, the similarity was calculated for all possible pairs without focusing on regions in the image or words in the sentence. In contrast, the stacked cross attention network (SCAN) [13], a method to improve interpretability by focusing on regions of an image or words in a sentence using the bottom-up attention model [14], has been proposed. SCAN has been used as a baseline for many methods and has led to technological developments since its proposal. Examples include the bidirectional focal attention network (BFAN) [15] and the position focused attention network (PFAN) [16]. In BFAN, irrelevant image regions and words cause deterioration in the correspondence between images and text; thus, they are removed. PFAN trains the correspondence between images and text using the positions of objects in the images as cues.

Training coarse correspondences based on object co-occurrence statistics does not allow for fine phrase correspondences. To train more detailed correspondences, a graph structured matching network (GSMN) [17], which models objects, relationships, and attributes as structured phrases through node- and structure-level correspondences, has been proposed and demonstrated greatly improved accuracy.

Scene concept graph (SCG) [18], which extends visual concepts and enhances image representation using image scene graphs as external knowledge, is another effective method. Another method, consensus-aware visual-semantic embedding (CVSE) [19], is inspired by the SCG method and is the first attempt to map images to text using common sense information. CVSE utilizes consensus information by computing the statistical co-occurrence correlations between the semantic concepts from an image caption corpus and expanding the constructed concept correlation graph to generate consensus-aware concept (CAC) representations.

I-C Methods Using Pre-trained Models Employing a Large Corpus of Images and Languages

Refer to caption
Methods for treating images and text as a single sequence
Refer to caption
Methods for treating images and text as separate sequences

Figure 3: General architecture of methods that utilize pre-trained models employing a large corpus of images and text

Currently, because of the availability of large datasets of image and language pairs and the success of large pre-trained models such as transformers [20] and BERT [21] in the field of natural language processing, image retrieval using pre-trained models is becoming mainstream, as shown in Fig. 3. Pre-training models are mostly based on multi-layer transformers, and typical datasets used include Flickr30k [22], MSCOCO [23][24], Conceptual Captions [25], SBU Captions [26], and graph question answering [27]. In the following, we introduce several methods that have been proposed in recent years.

Vision and language BERT (ViLBERT) [28], which extends the BERT model to a multimodal model of images and text, can train common task-independent representations of images and natural language. In ViLBERT, the image and text inputs are processed in separate sequences by the transformer, after which the cross-modal relationship is trained by the co-attentional transformer, as shown in the lower part of Fig. 3. Experiments show that it is possible to achieve higher accuracy than when images and text as a single sequence and training them with a single BERT. The authors used the Conceptual Captions dataset to pre-train ViLBERT and then fine-tuned it for tasks such as VQA, visual sense reasoning (VCR), and image retrieval to achieve the highest accuracy.

Unicoder-VL [29] is a universal encoder that aims to pre-train common visual and semantic representations using a large image caption dataset. Language and images are input to a single transformer model with multiple sequential encoders, which train a common embedding. The following three tasks were used for pre-training:

  • Masked Language Modeling (MLM): Predicting masked input token

  • Masked Object Classification (MOC): Predicting features of masked objects

  • Visual-Linguistic Matching (VLM): Predicting whether image and text pairs are relevant

ImageBERT [30], a transformer-based model trained on multiple tasks simultaneously, achieves high retrieval accuracy by crawling a large number of image-text pairs from the Web for pre-training and fine-tuning them in the image retrieval and text retrieval tasks. The four tasks used in the training are as follows:

  • Masked Language Modeling (MLM): Predicting masked input token

  • Masked Object Classification (MOC): Predicting labels of masked objects

  • Masked Region Feature Regression (MRFR): Predicting features for object locations

  • Image Text Matching (ITM): Predicting whether image and text pairs are relevant

UNiversal Image-TExt Representation (UNITER), which pre-trains BERT models for four tasks using a large database of image/text pairs, has also been proposed. The following four tasks are used for pre-training:

  • Masked Language Modeling (MLM): Predicting masked input token

  • Image Text Matching (ITM): Predicting whether image and text pairs are relevant

  • Word-Region Alignment (WRA): Training to optimize the alignment between words and image regions

  • Masked Region Modeling (MRM): Training to reconstruct the masked image

A pre-training task was designed to map visual and language input at both the global (image-text) and local (word-region) levels.

In the above methods, the features of the image domain and the text are concatenated and fed into the model, and the model trains the semantic association between the image and the text in a brute force fashion using self-attention. In contrast, object-semantics aligned pre-training (Oscar) [32] uses object tags detected in the image as anchor points, which greatly improves the generalizability of the pre-trained models. The pre-training task performed well with only two loss functions: masked token loss and contrastive loss. Masked token loss considers object tags as language data and trains the system to mask and recover the input text and object tags. Contrastive loss takes an object tag as image data, replaces it with a different tag sequence with 50% probability, and trains it to predict whether the object tag is appropriate.

The contrastive-language-image pre-training (CLIP) [33] proposed by OpenAI has also become a hot topic in the field of image and text retrieval since the beginning of 2021. CLIP achieves zero-shot, highly accurate image retrieval without fine tuning by pre-training on a large number of text/image pairs (over 400 million).

We have introduced some recent transformer-based methods for image retrieval, but many more methods have been proposed. In addition to image retrieval, research on transformers in computer vision was presented in [34]. Among the methods using transformers, there are methods that treat images and language as one sequence, as typified by UNITER and Oscar, and methods that treat images and language as separate sequences, as typified by VilBERT, but no conclusion has yet been reached as to which is more appropriate. There is also room to consider whether the region obtained by object detection should be used as the image sequence input to the transformer, or whether the entire image region should be used by a grid-based method.

II Dataset

TABLE I: Comparison of Flickr30k and MSCOCO image retrieval results
Method Flickr30k MSCOCO
R@1R@1 R@5R@5 R@10R@10 rSumrSum R@1R@1 R@5R@5 R@10R@10 rSumrSum
DeViSE [1] 6.7 21.9 32.7 113.1
UVS [2] 16.8 42.0 56.5 251.9
m-RNN [4] 22.8 50.7 63.1 309.5 29.0 42.2 77.0 345.7
DSPE [5] 29.7 60.1 72.1 351.0 39.6 75.2 86.9 420.7
VSE++ [6] 39.6 70.1 79.5 409.8 52.0 92.0
GXN [7] 41.5 80.1 56.6 94.5
DAN [8] 39.4 69.2 79.1 413.5
TIMAM [9] 42.6 71.6 81.9 415.6
Deep Fragment [10] 10.3 31.4 44.5 197.5
DVSA [11] 15.2 37.7 50.5 235.2 27.4 60.2 74.8 351.2
SCO [12] 41.1 70.5 80.1 418.5 56.7 87.5 94.8 499.3
SCAN [13] 48.6 77.7 85.2 465.0 58.8 88.4 94.8 507.9
SCG [18] 49.3 76.4 85.6 468.7 61.4 88.9 95.1 517.6
BFAN [15] 50.8 78.4 59.4 88.4
PFAN [16] 50.4 78.7 86.1 472.0 61.6 89.6 95.2 518.2
GSMN [17] 57.4 82.3 89.0 496.8 63.3 90.1 95.7 522.5
CVSE [19] 56.1 83.2 90.0 487.7 66.3 91.8 96.3 525.5
ViLBERT [28] 58.2 84.9 91.5
Unicoder-VL [29] 71.5 90.9 94.9 539.7 69.7 93.5 97.2 541.3
ImageBERT [30] 73.1 92.6 96.0 545.5 73.6 94.3 97.2 549.0
UNITER [31] 75.6 94.1 96.8 550.9
Oscar [32] 78.2 95.8 98.3 560.6

In this section, we introduce some of the databases commonly used in research related to visual-semantic embedding.

II-A Flickr8k/30k

Databases that are frequently used in training models to generate descriptions (captions) from images include Flickr8k [35] and Flickr30k [22].

Flickr8k is a crowdsourced dataset of 8,092 images with five corresponding human-generated captions per image, focusing on people or animals (e.g., dogs) performing some action. Of the total dataset, 6,000 images are often used as training data, 1,000 images as validation data, and 1,000 images as test data.

Flickr30k is a dataset that extends Flickr8k by annotating each of the 31,783 images with five sentences, in the same way as Flickr8k. Of the total dataset, 28,000 to 29,783 images are often used as training data, 1,000 as validation data, and 1,000 as test data.

II-B MSCOCO

Microsoft common objects in context (MSCOCO) is a dataset created for object recognition, object detection, and caption generation in images, and it is also frequently used for image-to-language matching and retrieval. In total, the dataset consists of more than 300,000 images, but the most commonly used are the 123,287 images in the MSCOCO 2014 dataset and the caption data [24], which contains five descriptions for each image. The original database consisted of 82,783 training data and 40,504 validation data, but it has become standard practice to use 5,000 of the validation data for testing and the remaining 82,783+40,5045,000=118,28782,783+40,504-5,000=118,287 for training data. Test data for the task of image retrieval are often evaluated by averaging the accuracy of five captions over 1,000 or 5,000 images.

II-C Visual Genome

A well-known dataset that assigns descriptions to regions in an image is Visual Genome [36], which was created in Li Fei-Fei’s lab at Stanford University. The number of images is 108,077, but the number of labels is overwhelmingly large, with approximately 5.4 million captions assigned to image regions. In addition, there are approximately 2.8 million labels for attributes of objects in the image, and approximately 2.3 million labels for relationships among objects.

II-D Conceptual Captions

Conceptual Captions [25] is a dataset of over three million image and caption pairs published by Google. The images and captions were collected from the Internet and thus contain a wide range of expressions. The captions were obtained from the HTML alt attribute associated with each image. Because of the very large dataset, it is used to train the transformer-based pre-training model shown in Subsection I-C.

II-E SBU Captions

SBU Captions is a database of one million images with one caption for each image. The captions are written by humans and filtered to leave only those with at least two nouns, noun–verb pairs, or verb–adjective pairs. This dataset has been used for research purposes with several tasks, including Google’s Show-and-Tell [37] and Microsoft’s UNITER.

III Evaluation Criteria

TABLE II: Publicly available implementation for visual-semantic embedding
Method URL
m-RNN [4] https://github.com/mjhucla/mRNN-CR
VSE++ [6] https://github.com/fartashf/vsepp
Bottom-up attention [14] https://github.com/peteanderson80/bottom-up-attention
SCAN [13] https://github.com/kuanghuei/SCAN
PFAN [16] https://github.com/HaoYang0123/Position-Focused-Attention-Network,
GSMN [17] https://github.com/CrossmodalGroup/GSMN
CVSE [19] https://github.com/BruceW91/CVSE
UNITER [31] https://github.com/ChenRocks/UNITER
Oscar [32] https://github.com/microsoft/Oscar
CLIP [33] https://github.com/openai/CLIP

Typical databases used to evaluate the performance of text-to-image retrieval include Flickr30k [22] and MSCOCO [23].

A commonly used evaluation metric for text-to-image retrieval is Recall@K (K=1,5,10K=1,5,10), where R@1R@1, R@5R@5, and R@10R@10 are the percentage of correct answers contained in the top 1, 5, or 10 ranked images corresponding to the text, respectively. In addition, the sum of the respective recall values (rSumrSum) may be used to indicate the overall mapping performance of text retrieval from images and image retrieval from text, as shown below:

rSum\displaystyle rSum =\displaystyle= R@1+R@5+R@10Textsearchwithimageasaquery\displaystyle\underbrace{R@1+R@5+R@10}_{\mathrm{Text~{}search~{}with~{}image~{}as~{}a~{}query}}
+R@1+R@5+R@10Imagesearchwithtextasaquery.\displaystyle+\underbrace{R@1+R@5+R@10}_{\mathrm{Image~{}search~{}with~{}text~{}as~{}a~{}query}}.

IV Evaluation Results

For the methods shown in Section I, Table I summarizes the evaluation results for the zero-shot image retrieval presented in the papers for each method. The databases used for the evaluation are Flickr30k and MSCOCO, which are often compared. Most of the results in this table were output by the following data division.

  • Flickr30k
    train : validation : test = 29,783 : 1,000 : 1,000

  • MSCOCO
    train : validation : test = 123,287 : 113,783 : 1,000

The upper part of Table I shows the results of the methods for mapping the global representations of images and languages shown in Subsection I-A. From DeViSE, which is the first attempt to convert images and languages into a common space, it can be confirmed that various improvements, such as VSE++, which utilizes hard negatives, and DAN, which uses an attention mechanism, resulted in improved accuracy. Among them, VSE++ has been frequently used as a baseline for the other proposed methods.

The middle part of Table I shows the results of the methods for mapping the local representations of images and languages, as shown in Subsection I-B. The overall retrieval accuracy is higher than that of the methods that map global representations of images and language. This is because the words in the search query text can now be appropriately mapped to the objects in the image. Among them, SCAN, which incorporates an attention mechanism, is highly accurate and has become the basis for various subsequent methods, such as BFAN, PFAN, and GSMN.

The bottom part of Table I shows the results of the methods that utilize models trained on a large corpus of images and languages shown in Subsection I-C. Note that in these methods, caption datasets other than Flickr30k and MSCOCO were used for pre-training. The method of fine-tuning using the transformer-based pre-training model, which is powerful in the field of language processing, greatly exceeds the accuracy of conventional methods.

Note that the accuracies shown in Table I are for methods that are trained or fine-tuned using the datasets of interest (Flickr30k and MSCOCO). For this reason, CLIP, which is realized by complete zero-shot learning without using the targeted datasets, is excluded from Table I because it is simply not comparable. However, they show that its accuracy is almost as good as that of other fine-tuned transformer-based methods.

V Implementation

The publicly available implementations that can be used to build the base system and compare the proposed method with the baseline method are summarized in Table II. Among them, there are many improved source codes based on VSE++ and SCAN implemented using PyTorch. The SCAN implementation is a modified version of the VSE++ implementation. In SCAN, image features obtained from bottom-up attention implementation are used. Subsequent technologies, such as PFAN, GSMN, and CVSE, are based on VSE++ or SCAN implementations and image features with bottom-up attention. The deep learning frameworks used in both of these implementations are PyTorch.

However, few implementations of methods that use transformer-based pre-training models have been published. UNITER provides the latest pre-trained models for download, which can be fine-tuned to tasks such as VQA, VCR, and image/text retrieval. Similarly, for Oscar, pre-trained models can be downloaded and fine-tuned for the tasks of VQA, GQA, natural language visual reasoning for real (NLVR2), image/text retrieval, and image caption generation. However, as of January 2021, the source code for the pre-training part has not been released. For CLIP, it is relatively easy to implement zero-shot images or text retrieval because pre-trained models and sample source code for extracting features from images and text are available.

VI Summary

In this paper, we introduced recently proposed methods for zero-shot image retrieval, where the query is a sentence. It is noteworthy that significant progress has been made in just over five years. In particular, the use of transformer-based learning models, which have been rapidly introduced in the past year or two, is expected to expand in the future.

A future challenge is to reduce the high computational cost. In particular, the training of transformer-based models is very time-consuming and requires a large amount of computing resources. Training powerful models, such as the recently proposed GPT-3 [38], would cost at least $4.6 million, and if a single GPU is used, it would theoretically take 355 years. Therefore, it is increasingly important to improve the computation efficiency, including hardware.

Acknowledgements

This work was partially supported by JSPS KAKENHI (Grant Number 18K11362).

References

  • [1] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, T. Mikolov, “DeViSE: A Deep Visual-Semantic Embedding Model,” In Proc. of Advances in Neural Information Processing Systems (NIPS), vol.26, 2013.
  • [2] R. Kiros, R. Salakhutdinov, R. S. Zemel, “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,” In Proc. of NIPS 2014 Deep Learning Workshop, 2014.
  • [3] A. Krizhevsky, I. Sutskever, G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” In Proc. of Advances in Neural Information Processing Systems (NIPS), vol.25, 2012.
  • [4] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, “Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN),” In Proc. of International Conference on Learning Representations (ICLR), 2015.
  • [5] L. Wang, Y. Li, J. Huang, S. Lazebnik, “Learning Deep Structure-Preserving Image-Text Embeddings,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.5005–5013, 2016.
  • [6] F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, “VSE++: Improving Visual-Semantic Embeddings with Hard Negatives,” In Proc. of British Machine Vision Conference (BMVC), 2018.
  • [7] J. Gu, J. Cai, S. Joty, L. Niu, G. Wang, “Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.7181–7189, 2018.
  • [8] H. Nam, J.-W. Ha, J. Kim, “Dual Attention Networks for Multimodal Reasoning and Matching,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.299–307, 2017.
  • [9] N. Sarafianos, X. Xu, I. A. Kakadiaris, “Adversarial Representation Learning for Text-to-Image Matching,” In Proc. of International Conference on Computer Vision (ICCV), 2019.
  • [10] A. Karpathy, A. Joulin, L. Fei-Fei, Deep Fragment Embeddings for Bidirectional Image Sentence Mapping In Proc. of Advances in Neural Information Processing Systems (NIPS), vol.27, 2014.
  • [11] A. Karpathy, L. Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.3128–3137, 2015.
  • [12] Y. Huang, Q. Wu, C. Song, L. Wang, “Learning Semantic Concepts and Order for Image and Sentence Matching,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.6163–6171, 2018.
  • [13] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, “Stacked Cross Attention for Image-Text Matching,” In Proc. of European Conference on Computer Vision (ECCV), 2018.
  • [14] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.6077–6086, 2018.
  • [15] C. Liu, Z. Mao, A.-A. Liu, T. Zhang, B. Wang, Y. Zhang, “Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching,” In Proc. of ACM International Conference on Multimedia (ACMMM), 2019.
  • [16] Y. Wang, H. Yang, X. Qian, L. Ma, J. Lu, B. Li, X. Fan, “Position Focused Attention Network for Image-Text Matching,” In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), pp.3792–3798, 2019.
  • [17] C. Liu, Z. Mao, T. Zhang, H. Xie, B. Wang, Y. Zhang, “Graph Structured Network for Image-Text Matching,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [18] B. Shi, L. Ji, P. Lu, Z. Niu, N. Duan, “Knowledge Aware Semantic Concept Expansion for Image-Text Matching,” In Proc. of International Joint Conference on Artificial Intelligence (IJCAI), pp.5182–5189, 2019.
  • [19] H. Wang, Y. Zhang, Z. Ji, Y. Pang, L. Ma, “Consensus-Aware Visual-Semantic Embedding for Image-Text Matching,” In Proc. of European Conference on Computer Vision (ECCV), 2020.
  • [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, “Attention Is All You Need,” In Proc. of Advances in Neural Information Processing Systems, (NIPS), vol.30, 2017.
  • [21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv:1810.04805, 2018.
  • [22] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics. vol.2, pp.67–78, 2014.
  • [23] T.-Y. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, “Microsoft COCO: Common Objects in Context,” In Proc. of European Conference on Computer Vision (ECCV), pp.740–755, 2014.
  • [24] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollar, C. L. Zitnick, “Microsoft COCO Captions: Data Collection and Evaluation Server,” arXiv:1504.00325, 2015.
  • [25] P. Sharma, N. Ding, S. Goodman, R. Soricut, “Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset for Automatic Image Captioning,” In Proc. of Annual Meeting of the Association for Computational Linguistics, vol.1, pp.2556–2565, 2018.
  • [26] V. Ordonez, G. Kulkarni, T. Berg, “Im2Text: Describing Images Using 1 Million Captioned Photographs,” In Proc. of Advances in Neural Information Processing Systems (NIPS), vol.24, 2011.
  • [27] D. A. Hudson, C. D. Manning, “GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [28] J. Lu, D. Batra, D. Parikh, S. Lee, “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks,” In Proc. of Advances in Neural Information Processing Systems (NeurIPS), pp.13–23, 2019.
  • [29] G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, M. Zhou, “Unicoder-VL: A Universal Encoder for Vision and Language by Cross-Modal Pre-Training,” In Proc. of AAAI Conference on Artificial Intelligence (AAAI), 2020.
  • [30] D. Qi, L. Su, J. Song, E. Cui, T. Bharti, A. Sacheti, “ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data,” arXiv:2001.07966, 2020.
  • [31] Y.-C. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, “UNITER: UNiversal Image-TExt Representation Learning,” In Proc. of European Conference on Computer Vision (ECCV), 2020.
  • [32] X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, J. Gao, “Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks.” In Proc. of European Conference on Computer Vision (ECCV), 2020.
  • [33] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, I. Sutskever, “Learning Transferable Visual Models From Natural Language Supervision,” arXiv:2103.00020, 2021.
  • [34] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, M. Shah, “Transformers in Vision: A Survey,” arXiv:2101.01169, 2021.
  • [35] M. Hodosh, P. Young, J. Hockenmaier, “Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics,” Journal of Artificial Intelligence Research (JAIR), vol.47. pp.853–899, 2013.
  • [36] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei, “Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations,” International Journal of Computer Vision (IJCV), vol.123, pp.32–73, 2017.
  • [37] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, “Show and Tell: A Neural Image Caption Generator,” In Proc. of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [38] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, “Language Models are Few-Shot Learners,” In Proc of Conference on Neural Information Processing Systems (NeurIPS), 2020.