GRILL: Grounded Vision-language Pre-training
via Aligning Text and Image Regions

Woojeong Jin¹, Subhabrata Mukherjee², Yu Cheng², Yelong Shen²,
Weizhu Chen², Ahmed Hassan Awadallah², Damien Jose², Xiang Ren¹
¹University of Southern California ²Microsoft Corporation
{woojeong.jin,xiangren}@usc.edu
{submukhe,yu.cheng,yelong.shen,wzchen,hassanam,dajose}@microsoft.com
Work was mainly done while interning at Microsoft Research.

Abstract

Generalization to unseen tasks is an important ability for few-shot learners to achieve better zero-/few-shot performance on diverse tasks. However, such generalization to vision-language tasks including grounding and generation tasks has been under-explored; existing few-shot VL models struggle to handle tasks that involve object grounding and multiple images such as visual commonsense reasoning (Zellers et al., 2019) or NLVR2 (Suhr et al., 2019). In this paper, we introduce GRILL, GRounded vIsion Language aLigning, a novel VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks with no or very few training instances. Specifically, GRILL learns object grounding and localization by exploiting object-text alignments, which enables it to transfer to grounding tasks in a zero-/few-shot fashion. We evaluate our model on various zero-/few-shot VL tasks and show that it consistently surpasses the state-of-the-art few-shot methods.

1 Introduction

Generalization to unseen tasks has been explored and investigated on zero-/few-shot NLP tasks by performing multi-task learning with task-specific prompts (Sanh et al., 2021) or pre-training huge language models on a massive dataset and using a few examples as demonstrations for generalization (Brown et al., 2020). Similarly, few-shot vision-language (VL) learning methods aim to leverage the pre-trained language models and their powerful generalization abilities to adapt to VL domains and learn new tasks from zero or a few examples (Tsimpoukelli et al., 2021; Radford et al., 2021; Jin et al., 2021; Alayrac et al., 2022).

Refer to caption — Figure 1: Examples of vision-language tasks. Vision-language tasks have different task formats, which makes challenging to generalize in a zero-/few-shot way. In this work, we study generalization of few-shot methods and propose GRILL that can generalize to diverse VL tasks without introducing task-specific special representations or pre-trained object detectors.

While the few-shot learners can overcome the challenges of supervised learning and avoid the need for task-specific fine-tuning, existing few-shot VL learners suffer from limited generalization to unseen tasks such as grounding tasks that require not only understanding the image and the language, but also locating and identifying relevant regions or objects in images, such as visual commonsense reasoning (VCR) (Zellers et al., 2019) or Flickr30k-entities (Plummer et al., 2015). Existing few-shot VL methods exhibit great performance on visual question answering and captioning tasks Alayrac et al. (2022); Tsimpoukelli et al. (2021); Jin et al. (2021), but they lack the skills to generalize to grounding tasks as they do not explicitly model the spatial and visual information of the regions or objects. On the other hand, existing fine-tuning methods rely on special representations for representing regions or objects, such as special tokens that mark the regions or objects in the captions and the images (Cho et al., 2021), and object features extracted from a pre-trained object detector (Su et al., 2020; Chen et al., 2019). These methods achieve good results with fine-tuning, but they are not compatible with zero-/few-shot generalization, due to the different designs of object representation for each task and the dependence on external object detectors that may not cover all the relevant concepts.

In this paper, we introduce GRILL, GRounded vIsion Language aLigning, a new VL model that can be generalized to diverse tasks including visual question answering, captioning, and grounding tasks in a zero-/few-shot fashion. We address the challenge of few-shot generalization to unseen tasks by a) learning object grounding and localization in pre-training, b) representing visual concepts (e.g., regions and images) with versatile image patches, and c) unifying the tasks into text generation. Specifically, our model is a generative sequence-to-sequence transformer model (Vaswani et al., 2017) with a vision transformer (ViT) (Dosovitskiy et al., 2021; Liu et al., 2021) to process images with patch embeddings, where each patch represents a fixed-size region of the image. We represent a visual concept (object or region) that corresponds to a group of patches by aggregating information across the patches. This enables our model to generate better representations for any kind of regions or images. We construct our pre-training dataset from MS-COCO (Lin et al., 2014; Chen et al., 2015) and Visual Genome (Krishna et al., 2017), where each caption contains images or bounding boxes within them, which provide rich and diverse information for the model to learn object grounding and localization. Given the dataset, we pre-train our model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM) objectives, which encourage the model to generate natural language from images and fill in the missing words in captions, respectively; and a discriminative objective, which encourages the model to distinguish whether the paired image-captions are correct or not.

We test our GRILL on $7$ zero-/few-shot vision-language tasks including Visual Commonsense Reasoning (VCR) (Zellers et al., 2019), RefCOCOg (Mao et al., 2016), Flickr30k-entities (Plummer et al., 2015), NLVR2 (Suhr et al., 2019), SNLI-VE (Xie et al., 2019), visual question answering (Goyal et al., 2017), and Flickr30k captioning (Young et al., 2014). We observe that our model demonstrates better zero-/few-shot generalization on diverse tasks compared to baselines. We also find that our pre-training objectives and pre-training datasets are vital for better zero-/few-shot performance.

2 Generalization to Diverse Vision-language Tasks

Various VL tasks require phrase and object grounding and their task formats are different, which makes few-shot models challenging to generalize. In this work, we introduce a model that can generalize to VL tasks including grounding with no or a few labeled examples. We first introduce the background, formal problem definition, and challenges.

2.1 Background: Visual Grounding

Visual grounding refers to the ability to link linguistic concepts (sentences, phrases, or words) to visual concepts (images and regions) (Chandu et al., 2021). Here we consider two types of visual grounding: image grounding and object grounding.

Image grounding refers to the linking of textual concepts to image concepts (Chandu et al., 2021). In this work, we consider image grounding as linking any type of text including sentences, phrases, and words to an entire image (e.g., image captioning, and image retrieval). Given an image and a corresponding caption, object grounding aims to localize objects in the image as mentioned by a noun phrase in the caption (or the entire caption sentence). Such object grounding occurs at word, phrase, and sentence levels in the language modality. Many VL tasks require object grounding implicitly or explicitly and we consider tasks that explicitly require localization as object grounding tasks such as referring expression comprehension (RefCOCOg (Mao et al., 2016)), phrase grounding (Flickr30k-entities (Plummer et al., 2015)), and visual commonsense reasoning (Zellers et al., 2019).

2.2 Problem Formulation

In this work, we re-formulate the widely used pre-training task for image-caption datasets such that each caption may have one or more images including bounding boxes or regions in itself as a part of the text, denoted by $(T,\{V_{j}\}^{N})$ , in addition to the associated images. Note that some captions may not have images in themselves, $N=0$ . We refer to learning on the captions with images grounded learning. For pre-training, a VL model is pre-trained on image-caption datasets where captions include images or bounding boxes. For zero-shot tasks, the pre-trained model $\mathcal{L}$ cannot access training data $\mathcal{D}_{train}$ and validation data $\mathcal{D}_{val}$ . We directly evaluate the model on the test data $\mathcal{D}_{test}$ . For few-shot tasks, the model has access to $K$ instances of training data for fine-tuning. For hyper-parameter tuning and model selection, we assume validation data $\mathcal{D}_{val}$ which has an equal number of instances to $\mathcal{D}_{train}$ to simulate a real-world low-resource environment and compose the validation data from training data. The sizes of $\mathcal{D}_{train}$ and $\mathcal{D}_{val}$ are 32 in our study.

Challenges

Our goal is to pre-train a VL model that seamlessly transfers to various tasks not limited to visual question answering and captioning in a zero-shot or few-shot manner. Different tasks, especially grounding tasks, have different task (input and output) formats as in Fig. 1, and thus the main challenge of this work is to generalize the zero-/few-shot ability to diverse tasks. Existing works on grounding tasks introduce special representations to depict regions such as special tokens (Cho et al., 2021) or object representations by an object detector (Su et al., 2020; Chen et al., 2019). While these works perform well on grounding tasks via expensive fine-tuning on labeled data, they have to design different object representations for different task formats. This makes it difficult to generalize to new tasks in a zero-shot fashion. For example, the object representations from an object detector are difficult to transfer to a task that refers to multiple images such as NLVR2 (Suhr et al., 2019). In this work, we tackle these challenges by introducing patch embeddings to represent objects, regions, and images; learning object grounding and localization in pre-training, and unifying all the tasks into text generation.

3 Pre-training for Better Task Generalization

In this section, we introduce GRILL, a few-shot VL model for jointly learning contextualized representations from vision and language tasks. We first present an overview of GRILL (§3.1), our model architecture (§3.2), pre-training objectives (§3.3), and pre-training data (§3.4) in this section.

3.1 Overview

We propose GRILL, a VL model that can learn object grounding and localization in pre-training and generalize to a wide range of VL tasks in a zero-/few-shot fashion. Our model is a sequence-to-sequence transformer (Vaswani et al., 2017) and takes a hybrid sequence, denoted by $(I,T,\{V_{j}\}^{N})$ , consisting of text $T$ , an image $I$ and visual concepts or regions $\{V_{j}\}^{N}$ as input and the output is a text sequence. We represent an input image with image patches by vision transformer (Dosovitskiy et al., 2021; Liu et al., 2021) and represent a region that corresponds to a set of patches by aggregating information among the patches (§3.2). We illustrate our model in Fig. 2. Given sequences with paired text outputs, we pre-train our model with prefix language modeling, masked language modeling, and a discriminative objective (§3.3). Then we discuss how we create the hybrid sequences from image-caption datasets (§3.4).

3.2 Model Architecture

For unified text generation, we adopt a transformer encoder-decoder architecture (Vaswani et al., 2017), which takes a text sequence as an input and generates another text sequence as an output. To encode images and regions for vision-language tasks, we adopt a vision transformer (Dosovitskiy et al., 2021; Liu et al., 2021) as our image encoder; it splits an input image with a sequence of image patches. Specifically, it first splits an image into non-overlapping patches and linearly embeds all patches, and these patches are passed to the transformer encoder layers, yielding $\{v_{1},...,v_{m}\}$ . For an image of resolution of $224\times 224$ and patch size of $32\times 32$ , we have $m=49$ . We assume that $v_{i}$ encodes the information of the corresponding patch $p_{i}$ . The image patches are versatile in that they can represent any type of images or regions; we represent a visual concept (object or region) $V_{j}$ that corresponds to a set of patches by aggregating information among the patches, and these patches are additionally passed to the transformer encoder layer. We adopt Swin transformer (Swin-B) (Liu et al., 2021) as our vision transformer.

3.3 Pre-training Objectives

We pre-train our model with prefix language modeling (PrefixLM), masked language modeling (MaskedLM) following Jin et al. (2021), and a discriminative objective. Many VL tasks are classification tasks that require choosing one of the options. To deal with the classification tasks, we additionally adopt the discriminative objective, which is to classify whether the given sequence is correct or not. Fig. 3 illustrates the pre-training objectives.

Prefix language modeling. We include prefix language modeling (PrefixLM) following (Raffel et al., 2020; Jin et al., 2021). The objective randomly splits the text with regions input into two separate sequences. The first part may contain regions and is used as an input with an image to the encoder, and the second part does not contain regions and is used as target text to be generated by the decoder. The target text is not allowed to have region representations since our model generates text only

Masked language modeling. Masked language modeling (Cho et al., 2021; Jin et al., 2021) is to mask out random spans with numbered sentinel tokens, e.g., <text_1>, and then the masked sequence is fed into the encoder. Then the decoder generates the masked spans as target text. We randomly mask 15% of input text tokens and replace them with sentinel tokens. Note that the input sequence may include region representations in addition to a paired image and the region representations are not allowed to be masked.

Discriminative objective. The discriminative objective is important so that our model can do classification tasks where it has to determine whether the given sequence is correct or not. Thus, we pre-train GRILL with the discriminative objective and the model generates target texts, “true” for positive pairs and “false” for negative pairs. We consider an image and its captions with associated regions (if any) as positive pairs. With a probability of 50%, we create the negative pairs by replacing the referring words with random region representations from the given image or randomly choosing another training caption. The negative samples let the model learn the correct bindings of referring words and corresponding regions.

3.4 Pre-training Data

To pre-train GRILL, we collect image-caption data from MS COCO (Lin et al., 2014; Chen et al., 2015) and Visual Genome (VG) (Krishna et al., 2017). From the image-caption pairs, we create our hybrid sequences which may have one or more region representations pre-training. We introduce object-word alignments representing correspondence between words and objects, and use the alignments to create hybrid sequences. We create hybrid sequences in pre-training on the fly; we randomly choose $k$ object-word alignments and replace the words with the corresponding bounding boxes. In addition, we include region descriptions and the aligned regions as hybrid sequences from Visual Genome, and non-hybrid sequences (raw text and images) in the pre-training.

3.4.1 Object-word Alignments

Given image-caption pairs, the process of getting object-word alignments consists of three steps: (1) object detection on images, (2) object tag-word matching, and (3) object-word alignments. We illustrate the process in Fig. 4. Note that we use object detection only in pre-training and do not use it on downstream tasks.

Object detection. The first step is to detect objects and object tags from images. We use the state-of-the-art object detector (Zhang et al., 2021) to get object bounding boxes and tags, yielding $\{(V_{1},l_{1}),...,(V_{m},l_{m})\}$ where $V_{i}$ is a bounding box and $l_{i}$ is a tag for the box. Given the set of tags $\{l_{1},...,l_{m}\}$ , we will find correspondence between the tags and words $\{w_{1},...,w_{n}\}$ in a caption in the next step.

Object tag-word matching. The second step is to find similar words $\{w_{1},...,w_{n}\}$ to one of tags $\{l_{1},...,l_{m}\}$ . To find similar words, we introduce a rule-based approach as follows:

•

Exact token matching
•

Plural - Singular exact token matching
•

Word vector similarity (Mikolov et al., 2013)
•

WordNet Synonyms (Miller, 1995)

If one of the rules is satisfied, then we mark them as aligned tags and words $\{(l_{i},w_{j})\}$ . Note that a word can be matched to multiple tags.

Object-word alignments. In the last step, we find alignments between object bounding boxes and words $\{(V_{i},w_{j})\}$ given the alignments between tags and words $\{(l_{i},w_{j})\}$ and an object list $\{(V_{1},l_{1}),...,(V_{m},l_{m})\}$ . We can simply find the object-word alignments since each tag is mapped to each bounding box, yielding $\{(V_{i},l_{i},w_{j})\}$ . However, note that some object bounding boxes share the same object tag; thus the alignments can include noisy correspondences between object boxes and words. To filter out the noisy alignments, we run CLIP (Radford et al., 2021) over the aligned words and objects. After this process, we obtained 1.8 object-word alignments per image-caption pair on average.

4 Experiments

Method	Size	VCR			RefCOCOg	Flickr30k-entities			NLVR2	SNLI-VE	VQAv2	Flickr30k
Method	Size	Q $\rightarrow$ A	QA $\rightarrow$ R	Q $\rightarrow$ AR	Acc	R@1	R@5	R@10	Acc	Acc	Acc	CIDEr
Random	-	25.0	25.0	6.3	19.0	6.5	27.7	47.8	50.0	33.3	0.0	-
UNITER_large	303M	32.6	26.1	8.7	10.0	-	-	-	49.1	17.9	0.0	-
VL-T5	224M	28.2	27.5	8.2	0.0	0.0	0.0	1.1	48.7	-	13.5	4.4
FewVLM_base	224M	25.9	25.4	6.5	0.0	0.0	0.0	0.0	50.6	-	43.4	31.0
FewVLM_large	740M	27.0	26.1	7.4	0.0	0.0	0.0	0.0	51.2	-	47.7	36.5
GRILL	310M	40.6	39.3	16.2	47.5	18.9	53.4	70.3	56.1	46.9	42.3	25.6

Table 1: Zero-shot results. We report performance on downstream tasks without any training data. Our model surpasses all baselines on classification tasks.

4.1 Experiment Details

For pre-training, we use 1,280 batch size for GRILL, set learning rate 1e-4 with 5% linear warmup, and pre-train it with 30 epochs. For the few-shot setting, we randomly choose 32 examples and sample 5 different training and dev splits, and we train models with 100 epochs with a learning rate of 5e-5 and choose the best checkpoint using the dev split. GRILL has 310M parameters.

4.2 Evaluation Setup

To evaluate few-shot performance, we randomly sample 5 different training and dev splits and measure the average performance on the 5 splits. We fine-tune the vision-language models with 100 epochs for the few-shot setup and choose the best checkpoint on the dev set. We report the model performance on the test set for RefCOCOg, NLVR2, Flickr30k-entities, SNLI-VE, and Flickr30k captioning (Karpathy split (Karpathy and Li, 2015)), and the validation set for VCR and VQAv2. We adopt accuracy for VCR, RefCOCOg, SNLI-VE, NLVR2, and VQA datasets; Recall $@$ 1,5,10 for Flickr30k-entities; and CIDEr (Vedantam et al., 2015) for captioning as evaluation metrics.

4.3 Baselines

For baselines, we include existing VL models: UNITER_large (Chen et al., 2019), VL-T5 (Cho et al., 2021), GLIP-L (Li et al., 2022; Zhang et al., 2022), MDETR-ENB3 (Kamath et al., 2021); and few-shot VL models: FewVLM (Jin et al., 2021), Flamingo (Alayrac et al., 2022), and CPT (Yao et al., 2021). For a fair comparison, we exclude VQA datasets for VL-T5 and pre-train the model using their code. Parameter sizes of each model are 303M for UNITER_large, 224M for VL-T5, 231M for GLIP-L, 152M for MDETR, 224M and 740M for FewVLM_base and FewVLM_large, 3B and 80B for Flamingo, and 113M for CPT.

Method

Size

VCR

RefCOCOg

Flickr30k-entities

NLVR2

SNLI-VE

VQAv2

Flickr30k

\rightarrow

\rightarrow

\rightarrow

Acc

R@1

R@5

R@10

Acc

CIDEr

Random

25.0

25.0

6.3

19.0

6.5

27.7

47.8

50.0

33.3

0.0

UNITER_large

303M

29.1_{\pm 3.4}

28.6_{\pm 2.0}

8.4_{\pm 1.0}

45.4_{\pm 4.0}

53.1_{\pm 9.3}

40.7_{\pm 8.4}

24.2_{\pm 3.9}

VL-T5

224M

29.7_{\pm 1.3}

28.0_{\pm 1.6}

8.7_{\pm 0.8}

\mathbf{56.9}_{\pm 2.0}

\mathbf{28.1}_{\pm 2.7}

60.6_{\pm 2.6}

73.3_{\pm 1.8}

48.7_{\pm 0.1}

43.7_{\pm 1.8}

28.0_{\pm 1.2}

FewVLM_base

224M

29.1_{\pm 0.9}

28.4_{\pm 1.1}

8.5_{\pm 0.4}

16.0_{\pm 3.7}

4.2_{\pm 1.2}

18.7_{\pm 1.8}

31.7_{\pm 2.0}

50.3_{\pm 0.7}

47.8_{\pm 0.2}

37.5_{\pm 2.9}

FewVLM_large

740M

30.0_{\pm 2.7}

30.1_{\pm 2.5}

9.3_{\pm 1.5}

17.4_{\pm 1.1}

5.1_{\pm 1.1}

22.7_{\pm 4.0}

38.0_{\pm 5.8}

51.3_{\pm 1.2}

\mathbf{52.3}_{\pm 0.8}

\mathbf{38.4}_{\pm 2.1}

GRILL

310M

\mathbf{41.1}_{\pm 0.7}

\mathbf{40.4}_{\pm 1.1}

\mathbf{16.7}_{\pm 0.6}

48.1_{\pm 1.2}

25.4_{\pm 1.0}

\mathbf{61.3}_{\pm 1.8}

\mathbf{76.0}_{\pm 1.5}

\mathbf{56.2}_{\pm 0.3}

\mathbf{48.4}_{\pm 1.0}

46.8_{\pm 0.1}

37.1_{\pm 1.5}

Table 2: Few-shot results. We report performance on downstream tasks with 32 labeled examples for fine-tuning.

4.4 Downstream Tasks and Datasets

In this section, we compare our GRILL on 7 downstream tasks; visual Commonsense Reasoning, referring expression comprehension, phrase grounding, NLVR2, SNLI-VE, VQA, and captioning.

Visual Commonsense Reasoning (VCR). Visual Commonsense Reasoning (VCR) (Zellers et al., 2019) is a multiple-choice question-answering task that requires commonsense reasoning between objects in images. The task is decomposed into two sub-tasks, question answering (Q $\rightarrow$ A) and rationale prediction (QA $\rightarrow$ R). In the holistic setting (Q $\rightarrow$ AR), models have to predict answers and rationales. Following VL-T5 (Cho et al., 2021), we rank the choices with $P(\text{true})/(P(\text{true})+P(\text{false}))$ . and choose the one with the highest score. VCR provides bounding boxes around entities, with explicit groundings between those entities and references in questions.

Referring Expression Comprehension. Referring expression comprehension is to localize an object given a referring expression. We adopt the RefCOCOg dataset (Mao et al., 2016) for this task. We present a referring phrase and candidate regions from the image to our model; our model finds the most plausible region to the given phrase by ranking the regions with $P(\text{true})/(P(\text{true})+P(\text{false}))$ . Following VL-T5 (Cho et al., 2021), we use Mask R-CNN Anderson et al. (2018) to find region detections as candidates for inference. We consider the selected region to be correct if its intersection over union (IoU) with the ground truth region is greater than 0.5. The upper bound performance on the test set by the Mask R-CNN is 86.09%. We get the performance of the random predictor by randomly choosing the bounding box from the object detector.

Phrase Grounding. Given one or more phrases, phrase grounding is to provide a set of bounding boxes for each phrase. We use the Flickr30k-entities dataset (Plummer et al., 2015) for this task. Following BAN (Kim et al., 2018) and VisualBERT (Li et al., 2019), we adopt Faster R-CNN (Ren et al., 2015) pre-trained on Visual Genome to detect regions as candidates for inference. The predicted region is correct if its intersection over union (IoU) with the ground-truth region is greater than 0.5. The upper bound performance on the test set by the Faster R-CNN is 87.45%. Similar to RefCOCOg we provide a referring phrase and candidate regions from the image to our model; and our model finds the most plausible region to the given phrase by ranking the regions with $P(\text{true})/(P(\text{true})+P(\text{false}))$ . We use the any-box-protocol from MDETR (Kamath et al., 2021).

NLVR2. The task of NLVR2 (Suhr et al., 2019) is to determine whether a text description is true given two images. The task requires understanding two images and comparing them. To apply our model to this task, we create one image by concatenating the two images, and then our model generates text labels “true” and “false” for inference.

Visual Entailment. Visual entailment, SNLI-VE (Xie et al., 2019), is to determine whether the image semantically entails the text given an image-sentence pair. The task is a 3-way classification where labels are “entailment”, “neutral”, and “contradiction.” We define label words for the classification as “entailment”: “true”, “neutral”: “maybe”, “contradiction”: “false.” We choose the classification label by measuring the probability of each word and picking the highest one.

Visual Question Answering. The visual question answering task (Goyal et al., 2017) requires models to answer a question to a given context image. We approach the visual question answering task as a generation task so that the model can produce the answers without introducing any task-specific heads following Jin et al. (2021); Cho et al. (2021). We adopt the input prompt, “question: {question} answer: <text_1>,” where <text_1> is a sentinel token, from Jin et al. (2021) for the generation.

Captioning. The captioning task is to generate a caption given an image. In Flickr30k (Young et al., 2014), we use “an image of’ as our input prompt from Jin et al. (2021).

Method	Size	RefCOCOg		Flickr30k-entities
Method	Size	0	32	0	32
Random	-	19.0	19.0	6.5	6.5
UNITER_large (Chen et al., 2019)	303M	10.0	45.4	-	-
VL-T5 (Cho et al., 2021)	224M	0.0	56.9	0.0	28.1
FewVLM_large (Jin et al., 2021)	740M	0.0	17.4	0.0	5.1
CPT (Yao et al., 2021) (Yao et al., 2021)	113M	36.5	-	-	-
MDETR-ENB3 (Kamath et al., 2021)	152M	54.0^†	-	84.8^‡	-
GLIP-L (Li et al., 2022; Zhang et al., 2022)	231M	-	-	87.1^‡	-
GRILL	310M	47.5	48.1	18.9	25.4

Table 3: Results on RefCOCOg and Flickr30k-entities with 0 and 32 examples. We report recall@1 for Flickr30k-entities. ^†This model used the RefCOCOg dataset in the pre-training. ^‡These models used the Flickr30k-entities dataset in the pre-training while ours did not.

Model	size	0-shot	32-shot
Random	-	0.0	0.0
UNITER_large (Chen et al., 2019)	303M	0.0	24.2
VL-T5 (Cho et al., 2021)	224M	13.5	43.7
FewVLM_large (Jin et al., 2021)	740M	47.7	52.3
Flamingo-3B (Alayrac et al., 2022)	3B	49.2	57.1
Flamingo-80B	80B	56.3	67.6
GRILL	310M	42.3	46.8

Table 4: VQA results with 0 and 32 examples. We report zero-/32-shot performance on the VQAv2 dataset. Flamingo has 3B or 80B parameters and uses in-context examples for inference while our model has 310M parameters and uses the examples for fine-tuning.

4.5 Results

Zero-shot performance. We evaluate the existing models in a zero-shot manner, where models do not have access to any training data. Tab. 1 shows the performance on each task. First, GRILL shows the best performance on most tasks while baselines show worse performance than the random predictor on many of the grounding tasks. On Table 3, we additionally include baselines, GLIP-L and MDETR-ENB3, that are targeted for grounding tasks. These models include the corresponding task-specific datasets in pre-training so they demonstrate great performance without additional fine-tuning. Note that we do not include task-specific datasets in the pre-training. In addition, our model still performs well on SNLI-VE, visual question answering and captioning that do not require explicit grounding. By comparing Flamingo in Tab. 4, a 3B or 80B-sized vision-language model, our model demonstrates good accuracy considering our model size. This suggests that our model has a generalization capability to unseen tasks while competitors have difficulty generalizing to grounding tasks that need phrase or region grounding in a zero-shot way.

Few-shot performance. We evaluate our model and competitors on the few-shot setting (Tab. 2). Our model, GRILL, shows great performance overall, while VL-T5 outperforms our model on the RefCOCOg dataset We conjecture that the method includes the phrase grounding task in their pre-training, so it achieves great performance. However, the model still struggles with other tasks including the VCR task, which demonstrates their limited generalization. Our model shows consistently good results and thus exhibits great generalization on the few-shot setup.

4.6 Ablations

Here, we study ablations for our method. Tab. 5 and Fig. 5 show the ablations on the hybrid sequences and pre-training objectives, and different input formats during inference on the zero-shot setup, respectively.

Hybrid sequences and pre-training objectives. We study the ablation of pre-training objectives and hybrid sequences in pre-training. On Tab. 5, our model without hybrid sequences significantly affects the performance on many tasks. Specifically, results on RefCOCOg and Flickr30k-entities are significantly degraded, suggesting that hybrid sequences in pre-training play a vital role in improving phrase grounding. Among pre-training objectives in GRILL, we notice that the discriminative objective is important for many of the tasks while others do not affect a lot. We conjecture that the tasks in the table are classification tasks so the discriminative objective is the most useful for the tasks.

Input formats in inference. We investigate the different input formats (hybrid sequences vs. original sequences) during zero-shot inference on Fig. 5. Note that we use hybrid sequences in the pre-training. On VCR, we replace the referring words (e.g., [person1] in Fig. 1) with bounding boxes for text input (hybrid sequences), or we do not replace them and use original text input (original sequences). On NLVR2, we replace the “left” word with the left image and the “right" word with the right image (hybrid sequences), or we do not replace them and use the original text input (original). On Flickr30k-entities, we replace the referring words with corresponding bounding boxes (hybrid sequences), or we don’t replace the referring words and use the referring words and bounding boxes for inference (original). Counter-intuitively, we observe that our model with original input formats during inference shows better performance on all the datasets. We conjecture that using the hybrid sequences with bounding boxes may disturb the model predictions since the model needs to judge whether the grounding information is correct or not. We leave the sophisticated design for future work.

Model	VCR	RefCOCOg	NLVR2	Flickr30k-entities
Zero-shot
GRILL	16.2	47.5	56.1	18.9
No hybrid sequences	12.9	18.9	55.7	5.7
No discriminative	6.8	30.5	50.4	12.7
No PrefixLM	14.4	48.5	55.8	18.5
No MLM	15.6	47.8	56.0	19.3
32-shot
GRILL	16.7	48.1	56.2	25.4
No hybrid sequences	14.3	16.3	55.9	18.7
No discriminative	7.2	42.0	50.5	15.3
No PrefixLM	14.7	48.7	55.9	21.9
No MLM	16.3	47.9	56.1	23.5

Table 5: Ablations on the pre-training objectives and hybrid sequences in pre-training. We report Q

\rightarrow

AR for VCR, and R@1 for Flick30k-entities.

5 Related Work

Vision-language few-shot learning. There have been attempts to address the challenge of data-hungry supervised learning in vision-language domains: FewVLM Jin et al. (2021), Frozen Tsimpoukelli et al. (2021), Flamingo (Alayrac et al., 2022), GLIP (Li et al., 2022; Zhang et al., 2022), FewVLM (Jin et al., 2021) improves the few-shot performance of VQA and captioning by prompting the model and its performance is on par with large few-shot learners. Frozen (Tsimpoukelli et al., 2021) adapts a few-shot language model (Radford et al., 2019) to vision-language tasks with soft prompting for images. Flamingo (Alayrac et al., 2022) achieves state-of-the-art results on few-shot VQA and captioning tasks by prompting the model with task-specific examples. While these models achieve improvement on few-shot tasks, they are not applicable to grounding tasks. Lastly, GLIP (Li et al., 2022; Zhang et al., 2022) unifies object detection and phrase grounding and it achieves great performance on zero-shot object detection and phrase grounding tasks. Unlike our method, GLIP used grounding datasets including Flickr30k-entities in pre-training so it achieved great performance on the phrase grounding without fine-tuning. Our method is not applicable to object detection since it requires bounding box regression. We leave this extension for future work.

Grounded vision-language learning. Grounded vision-language learning has been explored to learn grounding between objects in images and phrases in sentence (Li et al., 2020; Zhang et al., 2021; Kamath et al., 2021; Li et al., 2022; Zhang et al., 2022). MDETR is a modulated detector that detects objects in an image conditioned on a raw text query (Kamath et al., 2021). The model exhibits remarkable results on object detection, phrase grounding, and referring expression comprehension by pre-training the model on object detection data. GLIP followed a similar direction and it unifies object detection and phrase grounding (Li et al., 2022; Zhang et al., 2022). While the methods rely on object detection datasets to improve grounding, our method utilizes grounded sequences from image-caption datasets and an object. Our model does not only work on grounding tasks but also on visual question answering and captioning tasks.

6 Conclusion

In this work, we proposed GRILL, a new VL model that can generalize to a variety of VL tasks including grounding tasks. Our model learns object grounding and localization by introducing hybrid sequences in pre-training and easily adapt to diverse task by using a vision transformer for versatile image processing. To pre-train our model, we introduced our dataset using object-word alignments and pre-train it with masked language modeling, prefix language modeling, and the discriminative objective. On the empirical analysis, we observed that our model demonstrated good zero-/few-shot generalization on diverse tasks. We also observed that the discriminative objective and hybrid sequences in pre-training were vital for better zero-/few-shot performance.

References

Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. 2022. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.
Anderson et al. (2018) Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6077–6086.
Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
Chandu et al. (2021) Khyathi Raghavi Chandu, Yonatan Bisk, and Alan W Black. 2021. Grounding’grounding’in nlp. arXiv preprint arXiv:2106.02192.
Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. ArXiv preprint, abs/1504.00325.
Chen et al. (2019) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Uniter: Learning universal image-text representations.
Cho et al. (2021) Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying vision-and-language tasks via text generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 1931–1942. PMLR.
Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
Goyal et al. (2017) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334. IEEE Computer Society.
Jin et al. (2021) Woojeong Jin, Yu Cheng, Yelong Shen, Weizhu Chen, and Xiang Ren. 2021. A good prompt is worth millions of parameters? low-resource prompt-based learning for vision-language models. arXiv preprint arXiv:2110.08484.
Kamath et al. (2021) Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790.
Karpathy and Li (2015) Andrej Karpathy and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3128–3137. IEEE Computer Society.
Kim et al. (2018) Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. Advances in neural information processing systems, 31.
Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
Li et al. (2019) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. ArXiv preprint, abs/1908.03557.
Li et al. (2022) Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. 2022. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975.
Li et al. (2020) Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pages 121–137. Springer.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer.
Liu et al. (2021) Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022.
Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20.
Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Miller (1995) George A Miller. 1995. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41.
Plummer et al. (2015) Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99.
Sanh et al. (2021) Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2021. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
Su et al. (2020) Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: pre-training of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
Suhr et al. (2019) Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2019. A corpus for reasoning about natural language grounded in photographs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6418–6428, Florence, Italy. Association for Computational Linguistics.
Tsimpoukelli et al. (2021) Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. 2021. Multimodal few-shot learning with frozen language models. ArXiv preprint, abs/2106.13884.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
Vedantam et al. (2015) Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 4566–4575. IEEE Computer Society.
Xie et al. (2019) Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2019. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
Yao et al. (2021) Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2021. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797.
Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
Zellers et al. (2019) Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. From recognition to cognition: Visual commonsense reasoning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 6720–6731. Computer Vision Foundation / IEEE.
Zhang et al. (2022) Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. 2022. Glipv2: Unifying localization and vision-language understanding. arXiv preprint arXiv:2206.05836.
Zhang et al. (2021) Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. 2021. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588.

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions