Renaissance: Investigating the Pretraining of Vision-Language Encoders

Clayton Fields
Boise State University
1910 W University Dr
Boise, ID 83725
[email protected]
\AndCasey Kennington
Boise State University
1910 W University Dr
Boise, ID 83725
[email protected]

Abstract

In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.

1 Introduction

In the span of a few years, dozens of vision-language (VL) transformers have appeared in the literature with a bewildering array of architectures and training methods (see Fields and Kennington (2023) for a review). VL tasks, such as NLVR2 Suhr et al. (2018) where the model is tasked with answering questions about images (see Figure 4 for an example) and image captioning require models to somehow represent and fuse both text and image information. Unfortunately, knowledge of best practices for training and implementing these models has lagged far behind the model development process. This stands in contrast to the NLP domain, where studies such as Rogers et al. (2021) and Kaplan et al. (2020) have thoroughly investigated the inner workings and best training practices for NLP transformers. To date, there have been only a handful of studies analyzing VL-transformers, such as Bugliarello et al. (2021), and the collected literature still fails to address some very basic questions concerning VL modeling with transformers.

In this paper we begin to address this gap by providing a systematic analysis geared toward shedding light on some basic aspects of training transformers for vision-language modeling. In particular, we focus on the pretraining and fine-tuning of transformer-encoder architectures. Transformer encoders are best suited toward discriminative tasks such as the NLVR2 benchmark that we mentioned in the opening paragraph and we do not address generative tasks like image captioning here. In our first set of experiments (Section 4), we ask whether it is possible to save compute by freezing parts of the model during pretraining and examining the effect on downstream performance. In our second and final set of experiments (Section 5) we compare the performance of a VL transformer based on a pretrained text encoder versus one based on a pretrained vision transformer. Both sets of experiments will help to establish best training practices for those interested in training VL transformers and hopefully also provide theoretical insight. To perform our experiments, we created a novel VL framework that we call Renaissance that streamlines the ability to evaluate different VL model types (e.g., 1-tower and 2-tower) against a suite of benchmarks.

The specific contributions of this paper can be summarized as follows:

•

We introduce a software platform Renaissance that offers a range of options for creating, training and testing vision-language transformer encoder models.
•

We demonstrate that a great deal of compute can be saved by freezing parts of two-tower encoder models during pretraining. In particular, freezing the visual module can actually lead to small increases in performance. When both modules are frozen there is some loss in downstream performance, though the benefits may outweigh the costs for those with compute-limited training setups.
•

We show that when training a one-tower encoder model, it is best to initialize the model’s weights randomly than to use pretrained weights from either a text or a vision encoder model.

2 Related Work

2.1 Pretraining Vision-Language Transformers

The domain of vision-language modeling has seen major advancements in recent years with the adaptation of the transformer (Vaswani et al. 2017a) as VL models. The first examples of VL transformers to appear in the literature were adaptations of the popular BERT NLP model (Devlin et al. 2018). Some examples include VilBERT (Lu et al. 2019), LXMERT (Tan and Bansal 2019a) and VisualBERT (Li et al. 2019a). In the short space of time since these models were introduced a bewildering array of model variations have appeared in the literature. There are huge models designed for zero-shot inference such as Flamingo (Alayrac et al. 2022) and versatile models such as OFA (Wang et al. 2022) that can generate both text and images.

While the literature is now replete with vision-language models, the analysis of their performance and the establishment of best practices has been mostly left open. The aformentioned study Bugliarello et al. (2021) examines pretraining of vision-language model. Bugliarello et al. (2023) is an effort by the same lead author, Emanuele Bugliarello, that provides an analysis of several models on what they term "fine-grained" tasks. Frank et al. (2021) examined the extent to which the vision and language modalities are actually integrated in VL transformers. As valuable as these studies have been however, they have barely scratched the surface of understanding vision-language transformers.

2.2 Vision-Language Modeling Software

Vision language modeling has only recently come to prominence and the available software for it is still in a fairly primitive state. When using NLP models, researchers have a range of available software options that abstract many of the most difficult elements away from users. The most prominent example of this is the Huggingface model hub that specializes in NLP transformers. Though there are a few vision-language models available on Huggingface, there aren’t many and they don’t lend themselves to the modifications that research often demands. In addition to the Huggingface Hub, there have been some efforts toward creating software platforms primarily dedicated to multimodal modeling. LAVIS, introduced in Li et al. (2023) by Salesforce, is one such platform. Though well programmed and relatively straight forward to use, this program offers support very few VL models. Further more, implementing VL tasks is also a fairly involved task. The paucity of available software options led us to create the Renaissance platform for VL modeling that we introduce in the next section.

3 Renaissance: A Versatile Vision-Language Modeling Platform

We now describe the Renaissance program that we use to complete all of the experiments in this study. Because this is its first introduction, we will provide an extensive description of the program and its capabilities. In this section we also take the opportunity to introduce the pretraining tasks, finetuning tasks and the architectural elements required to understand the experimental procedures.

3.1 Capabilities

In this section we describe the capabilities and various options available from the Renaissance platform. The most salient feature of the platform is its ability to easily change the basic architectural features of multi-modal transformers, then train and test them. By simply editing a configuration file, a user can choose a pretrained text encoder or a pretrained vision encoder from the Huggingface hub to insert into the model. In addition to the various architectural options, there are also a number of pretraining and fine-tuning tasks and options available. We will describe these in subsections below.

3.1.1 Model Types

One-Tower Encoder Modeling

A one-tower encoder model consists of an embedding layer, and a single transformer encoder module followed by a classification layer. Previous examples of one-tower encoders include models such as UNITER Chen et al. (2020) and VisualBERT Li et al. (2019b). In principle, one-tower encoders are very much like NLP encoders such as BERT Devlin et al. (2019) and ELECTRA Clark et al. (2020) with some adaptations for the vision language domain.

One of the key adaptations is that a vision-language model’s embedding layer must accommodate both textual and visual features. For NLP models such as BERT, the embedding layer consists of a single large matrix where each column in the matrix is a vector representing the words in the model’s vocabulary. While one-tower vision-language encoders also have this feature, they have additional components that can process an image into a sequence of vectors. In the current version of the program, the embedding layer will always consist of BERT-style word-piece embeddings for text Devlin et al. (2019) and patch embeddings for image features. Patch embeddings were first introduced as part of the ViT model in the paper Dosovitskiy et al. (2020). Here an image is split in small square patches, the patch is then flattened into a vector and projected to the correct embedding dimension. A visual depiction of patch embeddings from the paper that introduced them can be found in Figure 2. In future versions of the program we hope to include support for using grid features derived from a convolutional neural network to embed images Huang et al. (2020).

The second major component of one-tower encoder models is the transformer encoder stack. The encoder stack for vision-language models is architecturally the same as those found in NLP transformers. Here the only major difference is that encoder’s weights are derived from training on vision language tasks. Renaissance supports the use of the use of most text models on the hub as encoder modules and select variety of vision transformer models on the hub. Specifically, vision models based on the transformer Vaswani et al. (2017b), such as ViT Dosovitskiy et al. (2020), DeiT Touvron et al. (2021), DINO Caron et al. (2021) or BeIT Bao et al. (2021). Convolutional models such as ResNet He et al. (2015), and hybrid models such as ConvNeXT Liu et al. (2022) are not supported. Finally, the classification layer is no different than those found in any other deep learning model. They consist of one or two linear layers that output a score for each possible outcome in the target distribution.

Refer to caption — Figure 1: A visual representation of a one-tower vision-language encoder model.

Two-Tower Encoder Modeling

A two-tower encoder model consists of a text-transformer model, a vision-transformer model and set of cross-modal layers that combine the output of each model into a multimodal feature using cross-attention Lu et al. (2019). In cross-attention layers, the key and value vectors from the visual stream are passed to the multi-head attention mechanism of the textual stream. The key and value vectors from the text stream are also passed to the attention heads in the visual stream resulting in a multi-modal output. Figure 3 shows a simple visual representation of a two-tower model. Because the vision and text modules are separate and the vision and text streams only interact in the cross-modal layers. This is in contrast to one-tower models where visual and textual features interact throughout the model. Another distinction is that for two-tower models the visual and textual features need not be embedded in the same vector space because each encoder module is associated with its own embedding layer. Some previously introduced examples of two-tower transformers are METER Dou et al. (2022) and BridgeTower Xu et al. (2023).

In previously released two-tower encoders, the text encoder modules are architecturally much like BERT and the vision modules much like ViT Dosovitskiy et al. (2020). Renaissance allows users to create new tow-tower models with most vision transformers on the hub as a vision module (convolutional models are not supported), and most text transformers on the hub as a text module. The layers in the cross-modal module are based on implementations from the LXMERT model Tan and Bansal (2019b). Users can choose the dimension and number of cross-modal layers and the number of attention heads per layer. Finally the classification layer is essentially the same as those found in one-tower models.

3.1.2 Training and Configuration Options

Beyond providing flexibility in basic architecture design, the program also provides several options for training and configuring models. The most salient of these features are discussed in the this subsection.

Random Weight Initialization

Multi-modal models are often initialized with weights from pretrained text or image models. For instance VisualBERT is initialized with the weights from the text model BERT Devlin et al. (2019) and ViLT with weights from the image transformer ViT Dosovitskiy et al. (2020). When doing research it is often useful to initialize model weights randomly and train from scratch. This is often useful for establishing baselines in experiments and as we show in Section 5, can have beneficial to the performance of one-tower models. Users can randomly initialize encoder weights by simply changing settings in a configuration file.

Manually Configure Model Dimensions

By default, the dimensions of encoder modules is determined by the huggingface hub. However, when model weights are set to be randomly initialized users can manually specify the dimensions of encoder modules. This allows users to easily create completely novel architectures. As an example, consider a one-tower encoder where the encoder is based on ELECTRA-Small. By default, ELECTRA has a hidden size of 256, an embedding size of 128, an intermediate size of 1028 and 12 layers. Any of these numbers can be altered to create encoders of the desired shape and size.

Freeze Modules During Training

It is also easy to freeze the weights of any of the models modules during training. In addition to being useful for research purposes, this feature allows the user to significantly cut the compute costs of training. In practice, freezing the pretrained weights of a model’s encoder module can be quite useful and is featured in our first set of experiments.

3.1.3 Pretraining Tasks

Currently, our program supports two pretraining tasks, masked language modeling and image-text matching. Models can be pretrained with either of these tasks individually or both in conjunction. Using both tasks in conjunction is a common approach found in the literature. Both tasks are briefly described in the list immediately below. A more thorough description can be found in Fields and Kennington (2023).

•

Masked language modeling (MLM) tasks the model with guessing a masked word based on the image features and the unmasked words. The MLM task was first introduced in Devlin et al. (2019). In the original task, the model’s prediction is based only on the unmasked words in sequence of text. In the multimodal setting, the model’s prediction is based on the unmasked words as well as the associated image.
•

Image-text matching is a binary task where the model is presented with an image-text pair and must determine if the text actually describes the image. Positive pairs are simply the original pairings from the chosen datasets; for negative pairs a sentence is paired with a randomly chosen image from the dataset. This task is much like, and was inspired by, the next sentence prediction task that was also used in training BERT.

At the time of this writing, Renaissance currently supports four different multimodal datasets for pretraing. These four are Visual Genome Krishna et al. (2017), MSCOCO Lin et al. (2014), Conceptual Captions Sharma et al. (2018) and SBU Captions dataset Ordonez et al. (2011). These can also be used by themselves or in any combination for pretraining models.

3.1.4 Downstream Vision-Language Tasks

In order to test and evaluate models, renaissance currently has five downstream vision-language tasks implemented. They are listed below with a brief description of each task.

1.

NLVR2 NLVR2 stands for Natural Language Reasoning for Real and was introduced in Suhr et al. (2018). Here a model will be given two images and must answer a true or false question about them. The addition of second image also makes this quite a challenging task. NLVR2 is very commonly used to benchmark VL models. An example from the dataset can be seen in Figure 4.
2.

SNLI-VE In the SNLI-VE task is a model is presented with an image text-pair and must determine if the image entails the sentence, contradicts the sentence or is neutral with respect to the sentence. It was introduced in Xie et al. (2019). This task tends to be less challenging than the previous and requires less time to fine-tune and evaluate. Though it appears less commonly in the literature, its quick training time makes it very useful as a model development tool.
3.

Reference Resolution with RefCOCO In this final task a model is presented with an image that is segmented into several objects and a sentence describing one of these objects. The model must then determine which object the sentence is referring to. The RefCOCO dataset was introduced in Kazemzadeh et al. (2014).
4.

Multimodal Retrieval with MSCOCO and Flickr30k Multimodal retrieval tasks correspond to activities such as an internet image search. Here a model is given a string of text and must rank a number of images according to how relevant they are to the sentence. The converse process, an image is provided and the model must rank a series of sentences, is also implemented. Our program supports fine-tuning and evaluating both retrieval tasks on the MSCOCO Lin et al. (2014) and Flickr30k datasets.
5.

Visual Question Answering In this task a model is presented with an image and a question pair and must choose the correct answer from a given set possible choices or generate a free-form answer. The visual question answering task (also called VQA) was introduced in Antol et al. (2015). This task is commonly benchmark in the vision-language field and is also quite challenging. An example from the dataset can be seen in Figure 7.

3.1.5 Unimodal Downstream Tasks

Renaissance also supports downstream evaluation on pure NLP tasks and pure computer vision tasks. For pure NLP, Renaissance supports the GLUE tasks Wang et al. (2018). GLUE is a set of natural language understanding tasks commonly used to benchmark NLP models. The program also supports image classification on the CIFAR10 Krizhevsky et al. (2009). The unimodal capability will be useful in testing if and how multimodal training affects unimodal tasks.

3.2 Design and Implementation

Renaissance is entirely written in the Python programming language. Though popular and user friendly, using and maintaining large-scale python programs can be a difficult process. In order to make this model platform useful as a research tool, we’ve made a number of conscious decisions designed to improve the usability and versatility of the program. These design goals are discussed in the section below.

Modularity

This project grew out of our efforts in designing and training compact VL models. In general, this type of work requires pretraining a large number model architectures with various hyperparameters and finetuning them on suitable evaluation tasks. This type of research demands versatility which indicated a modular design. The program incorporates text and vision models derived from the Huggingface Model Hub as encoder modules in custom vision-language models. Because models from Huggingface are generally written as discrete classes with common methods, this modular approach works quite well. This allows for the user to create models with a wide variety of architectures by simply specifying which models from the hub they would like to use.

Beyond pretrained encoder modules, the other parts of the models are also contained in discrete classes that can be easily substituted. The embedding layer of one-tower models and the cross-modal encoders of two-tower models and all classification modules are written as self-contained classes to allow for easy modification. For instance, to add a new type of classification head, one would simply have to write a new class with compatible methods to the appropriate files.

Ease of Use

Correctly installing dependencies to match particular Python installations and hardware setups can be a major obstacle to using deep learning repositories. In order to make Renaissance as user-friendly as possible we have intentionally tried to reduce to the number of dependencies used in this software platform. Where possible we have used packages that are native to base Python, such as using the PILLOW module for image processing instead of third-party libraries like OpenCV.

Admittedly, the complexity of the underlying processes makes configuring models with this program somewhat difficult. In order facilitate this process, we make use the python module Sacred to track the various model settings and hyperparameters. Sacred assists users to configure models and vary settings to easily reproduce experiments. In addition to this, we provide extensive documentation to make using the program, and importantly extending it for novel purposes, as straightforward a process as possible.

Scalability

Though much of our research is primarily focused on modeling with limited compute resources, the program design is intended to make its use with larger compute setups easy to accommodate. To this end, its models are implemented using PytorchLightning. The PytorchLightning package wraps around Pytorch’s data distributed parallel (DDP) library and abstracts many of the difficult parallel programming aspects away from the user. This makes adding additional nodes or devices much more straightforward than it otherwise would be. Additionally, all of the local data processing functions are handled using the fast and memory efficient PyArrow library.

4 Experiment 1: Freezing Encoder Modules During Pretraining

4.1 Premise

In our first set of experiments, we ask what is the effect of freezing the weights of various parts of the model during pretraining? Specifically, if we initialize the vision and text modules of a two-tower encoder with pretrained models from their respective domains, can we freeze one or both of these modules during pretraining? Freezing both modules means that we would only be pretraining the cross-modal and output layers of the model. Pretraining is usually the most compute intensive aspect of model development and we can reduce the GPU memory use and the overall compute required by freezing parts of the model. The compute savings would allow researchers to pretrain models that might otherwise be too large for their hardware requirements. Alternatively, they might also train smaller models at higher batch sizes and possibly obtain better results.

Given that both vision and text encoder modules are pretrained in their respective domains, it makes intuitive sense that we might be able to skip at least some portion of their pretraining. These experiments should demonstrate empirically whether or not this is the case. Furthermore, the creators of the dual encoder¹¹1Dual encoder models are a simpler model type that is currently not available on Renaissance. model LiT Zhai et al. (2022) found that they obtained slightly better results from freezing the model’s vision encoder. This experiment will also afford us the opportunity to see if a similar effect will hold for two-tower encoder models.

Text Encoder	Vision Encoder	SNLI-VE	NLVR2	Ref. Res.
Unfrozen	Unfrozen	0.741	0.672	0.724
Frozen	Unfrozen	0.735	0.675	0.702
Unfrozen	Frozen	0.741	0.672	0.740
Frozen	Frozen	0.738	0.665	0.721
ELECTRA-Base-Frz	ViT-Base-Frz	0.756	0.630	0.756

Table 1: Results for Freeze Module Study

All models are trained for 100k steps at a batch size of 704. All results are calculated on dev sets for each task.

4.2 Experimental Setup and Procedure

To begin, we use Renaissance to construct a set of two-tower models with ELECTRA-Small Clark et al. (2020) as the text encoder and DeiT-Tiny Touvron et al. (2021) as the image encoder. Both of these models are both quite small and efficient, and we chose them to expedite the training process. We set the cross-modal encoder module of each model to contain two sets of six transformer layers each with a hidden size of 256 and 4 attention heads. In total we pretrain four model variations, a baseline with both modules unfrozen, one with the text encoder frozen, one with the image encoder frozen and one with both encoder modules frozen. Each model will be pretrained for 100k steps at a batch size of 704 using the masked language modeling and image-text matching tasks described in Section 3.1.3. All models are trained using two NVIDIA L40s GPUs. We use two of the four pretraining datasets, MSCOCO and Visual Genome, which were described in the same section. Finally, we finetune and evaluate our models on three of the five VL tasks described in Section 3.1.4: SNLI-VE, NLVR2 and reference resolution with RefCOCO. In the interest of saving time and compute, we forego evaluating them on multimodal retrieval tasks and visual question answering. A crucial point to consider is that the no model weights are frozen during finetuning.

4.3 Results

The results for this experiment are summarized in Table 1. We see that we can obtain similar results by freezing one or both of the previously trained encoder modules during pretraining with only mild ill effect. On the SNLI-VE task, the difference between training the whole model and freezing one or both modules is very slight indeed. When freezing the vision module, the downstream results for SNLI-VE are essentially identical and we see only a slight drop in performance compared to the baseline model. We see a slightly different pattern on the NLVR2 task, however. Here we see the results for the baseline model and two models with a single encoder frozen having almost identical results. The model with the visual encoder produces the best score on the reference resolution task. The baseline model and the model with both modules frozen preform very nearly identically, while the the model with the text encoder only frozen scores the worst. Of the four models, the overall best performance is achieved by freezing only the vision encoder.

This is a fairly remarkable result as we can essentially cut the GPU memory required during pretraining in half by freezing so many of the model’s weights. This is especially significant because pretraining the model is by far the largest compute cost we see during training. We should also note that this effect is somewhat similar to a phenomenon noted in the training of the dual encoder LiT Zhai et al. (2022) (dual encoders have two encoder stacks but lack the a cross-modal fusion module). Here Zhai et al. found that they can obtain better results from freezing the image encoder during training. Though our model architecture is different, we observe a somewhat similar effect.

To further demonstrate the utility of freezing the pretrained modules, we train a model that uses ELECTRA-Base as a text-encoder and ViT-Base as the vision encoder. We use the same training hyperparameters but increase the number of cross-attention layers to 10. This model is quite large for an encoder, containing over 210M parameters (decoder models for generative tasks often contain billions of parameters). A model of this size would be well outside our compute capacity to pretrain if the text and vision modules were not frozen. However, because it contains less than 27M trainable parameters, training it puts the same memory load on our two L40s GPUs as the baseline model does. The results for this model are displayed in the final row of Table 1. This model obtains the best results of the study on two of the three downstream tasks and remains well within our limited compute budget.

Encoder	SNLI-VE	NLVR2	Ref. Res.
Random	0.699	0.551	0.554
ViT	0.685	0.534	0.522
BERT	0.692	0.545	0.507

Table 2: Preliminary Results for Text vs Vision Encoder Study All models are trained for 100k steps at a batch size of 512. All results are calculated on dev sets for each task.

5 Experiment 2: Text Encoder vs. Vision Encoder

5.1 Premise

In the previous set of experiments we focused on training two-tower models. In our final experiment we will examine the behavior of one-tower models. One-tower encoder models were also described in Section 3.1.1. To date, most of these models have been derived from text encoder models such as BERT Devlin et al. (2019). A less explored approach is to base such models on transformer based vision models such as ViT Dosovitskiy et al. (2020); this is the approach of the one-tower VL transformer called ViLT Kim et al. (2021). In this experiment we ask if one strategy is superior to another when training and evaluating under otherwise similar conditions? More simply put, are one-tower encoders more effective when based on a vision encoder or a text encoder. In addition to providing guidance to future practitioners of VL modeling, answering this question should also provide interesting results from purely theoretical as well as practical perspectives.

5.2 Experimental Setup and Procedure

To make this experiment as fair a comparison as possible we select a vision transformer and a text transformer model as close in size to each other and architecture as possible. Toward this end we used BERT Devlin et al. (2019) as our text-encoder model and ViT Dosovitskiy et al. (2020) as our vision encoder model. The encoder towers in each of these models were consciously designed to have nearly identical dimensions with each encoder module containing 110M parameters. We employ patch embeddings for visual tokens and word-piece embeddings for visual tokens in all models. The resultant models will be close to identical, save that the weights of one are derived from vision pretraining and the other from text/language pretraining. As a baseline, we also train a randomly initialized version based on the BERT architecture. Finally, we train each model for 50k steps with a batch size of 512 using masked-language modeling and image-text matching. Again we use MSCOCO and Visual Genome as training datasets and evaluate on the 3 vision-language tasks described in the previous experiments.

5.3 Results

The results for this experiment are displayed in Table 2. According to our analysis there doesn’t appear to be a significant advantage in basing a one-tower encoder model on either text or vision. Surprisingly, the randomly initialized model that we trained as a baseline scored the best on all three downstream tasks. These are very much unexpected results. Though we didn’t have an intuition as to whether text or vision would perform better, we didn’t expect the downstream results to be so similar and to be inferior to a randomly initialized variation. These results are especially notable since one of the few in depth analyses of vision-language models, Frank et al. (2021), indicates that the interaction between the visual and language modalities are not symmetric. That study used probing techniques to show that VL transformers learn to use vision-for-language more than language-for-vision. Our best explanation of this phenomenon is that that one-tower models do not make use of the individual visual or textual modalities, but instead converge to values not dependent on either.

Another notable conclusion of this experiment and the preceding ones, is that two-tower models are in general much more parameter efficient than one-tower models. The one-tower models used in this experiment are relatively large, each containing more than 100M parameters. While the two-tower models in the previous experiments contain less than 40M parameters. Nonetheless, the two-tower models outperform those in this final experiment using the same datasets for training and evaluation. In previous studies, one-tower models such as ViLT Kim et al. (2021) that have similar architectures, have obtained better results than those displayed here. They do so by using more data and enormous pretraining batch sizes that require significantly more compute than we used here. Though only a preliminary finding, this insight might prove valuable to those interested in efficient VL modeling.

6 Future Directions

6.1 Renaissance

As this program evolves, we hope to incorporate a number of additional features that are not available in the current version. The capabilities that we plan to add are discussed below.

6.1.1 Model Types

There are several model types, beyond one-tower and two-tower encoders, that we hope to support in future versions. These include, dual encoder, encoder-decoder and decoder only model types (see Fields and Kennington (2023) for explanations of and examples for each type). By virtue of adding these model types, we also hope to include the ability to generate text for tasks such as image captioning.

6.1.2 Additional Tasks

In addition to more model architectures, we hope also to add additional tasks for both pretraining and finetuning. Some asks we intend to add are contrastive learning, reference resolution and visual question answering as pretraining tasks. Further we also hope to add downstream tasks such as image captioning to give the a wider variety of settings to use and evaluate various model architecture.

6.2 Analysis of VL Transformers and Pretraining

Because of the field of vision-language is rapidly evolving there are many possible future directions for research. We will mention a few. Though we have touched on some of the more basic aspects of training in this study, a systematic study of how each pretraining task contributes to downstream performance would be very illuminating. As would testing other tasks, such as visual grounding or visual question answering in pretraining. A thorough investigation of which architectures are best used in which circumstances would also be a worthwhile endeavor. As a final suggestion, we believe the field would also benefit from a scaling study to determine the optimum amount of data to train models at various scales such as Kaplan et al. (2020) performed for NLP transformers.

7 Conclusion

In this study, we have examined some basic features of pretraining vision-language transformers. In addition to the experiments that we’ve performed, we also introduced a flexible vision-language modeling framework called Renaissance, the source code for which can be found at https://github.com/bsu-slim/renaissance. In our first set of experiments we showed that pretrained vision and text modules can be frozen during vision-language pretraining with only small losses in downstream performance. This finding opens the possibility of training VL models whose size might normally exceed one’s compute budget. In our second and final experiment we compared of effect of basing a one-tower encoder model on a text transformer versus a vision transformer. Surprisingly, our results indicate that neither strategy is superior to the other and that randomly initializing model weights yields the best results. We therefore recommend training one-tower models from scratch when possible. We conclude this study with the observation that multi-modal modeling is a rapidly expanding pursuit and we hope that this paper is among the first of many that aim to shed light on this dynamic and exciting field of deep learning.

Limitations

The primary limitations of our study relate to size and scope. With greater resources, particularly compute resources, we would have been able to test more models on a wider variety of downstream tasks. These additional data would have added greater weight to our findings and given them broader applicability.

Ethics Statement

We have no pertinent ethical conflicts to report.

Acknowledgements

The code for Renaissance borrows large portions from the code bases of METER Dou et al. (2022) and ViLT Kim et al. (2021). Without these excellent templates this program could not have been completed on a reasonable time scale. We would also to like to acknowledge the programmers at Huggingface. Our programs design interfaces with classes from the hub and our design borrowed heavily from their methods.

References

Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Flamingo: A visual language model for few-shot learning. arXiv [cs.CV].
Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
Bao et al. (2021) Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. 2021. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
Bugliarello et al. (2021) Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, and Desmond Elliott. 2021. Multimodal pretraining unmasked: A meta-analysis and a unified framework of vision-and-language berts. Transactions of the Association for Computational Linguistics, 9:978–994.
Bugliarello et al. (2023) Emanuele Bugliarello, Laurent Sartran, Aishwarya Agrawal, Lisa Anne Hendricks, and Aida Nematzadeh. 2023. Measuring progress in fine-grained vision-and-language understanding. arXiv preprint arXiv:2305.07558.
Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660.
Chen et al. (2020) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv [cs.CL].
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Dou et al. (2022) Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. 2022. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18166–18176.
Fields and Kennington (2023) Clayton Fields and Casey Kennington. 2023. Vision language transformers: A survey. arXiv preprint arXiv:2307.03254.
Frank et al. (2021) Stella Frank, Emanuele Bugliarello, and Desmond Elliott. 2021. Vision-and-language or vision-for-language? on cross-modal influence in multimodal transformers. arXiv preprint arXiv:2109.04448.
He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition.
Huang et al. (2020) Zhicheng Huang, Zhaoyang Zeng, Bei Liu, Dongmei Fu, and Jianlong Fu. 2020. Pixel-bert: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849.
Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Kazemzadeh et al. (2014) Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. 2014. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798.
Kim et al. (2021) Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning, pages 5583–5594. PMLR.
Krishna et al. (2017) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123(1):32–73.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images.
Li et al. (2023) Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, and Steven C.H. Hoi. 2023. LAVIS: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–41, Toronto, Canada. Association for Computational Linguistics.
Li et al. (2019a) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019a. VisualBERT: A simple and performant baseline for vision and language. arXiv [cs.CV].
Li et al. (2019b) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019b. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
Liu et al. (2022) Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s.
Lu et al. (2019) Lu, Batra, Parikh, and Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst.
Lu et al. (2019) Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32.
Ordonez et al. (2011) Vicente Ordonez, Girish Kulkarni, and Tamara Berg. 2011. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
Rogers et al. (2021) Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866.
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.
Suhr et al. (2018) Alane Suhr, Stephanie Zhou, Ally Zhang, Iris Zhang, Huajun Bai, and Yoav Artzi. 2018. A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491.
Tan and Bansal (2019a) Hao Tan and Mohit Bansal. 2019a. LXMERT: Learning cross-modality encoder representations from transformers. arXiv [cs.CL].
Tan and Bansal (2019b) Hao Tan and Mohit Bansal. 2019b. Lxmert: Learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490.
Touvron et al. (2021) Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. 2021. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pages 10347–10357. PMLR.
Vaswani et al. (2017a) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. arXiv [cs.CL].
Vaswani et al. (2017b) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. Advances in neural information processing systems, 30.
Wang et al. (2018) Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv [cs.CL].
Wang et al. (2022) Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv [cs.CV].
Xie et al. (2019) Ning Xie, Farley Lai, Derek Doran, and Asim Kadav. 2019. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706.
Xu et al. (2023) Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, Wanxiang Che, and Nan Duan. 2023. Bridgetower: Building bridges between encoders in vision-language representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10637–10647.
Zhai et al. (2022) Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. 2022. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18123–18133.

Appendix A Hardware

In all of the three studies we pertrain all of our models using two NVIDIA L40S GPUs each with 48GB of GPU memory. Where feasible we also used a server with two NVIDIA TITAN RTX GPUs with 24GB of memory and a server with two NVIDA TITAN Xp servers with 12GB of memory for finetuning.