Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval

Yeong-Joon Ju, Ho-Joong Kim, Seong-Whan Lee

Abstract

Existing multimodal retrieval systems often rely on disjointed models for image comprehension, such as object detectors and caption generators, leading to cumbersome implementations and training processes. To overcome this limitation, we propose an end-to-end retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries via dynamic modality interaction. Ret-XKnow leverages a partial convolution mechanism to focus on visual information relevant to the given textual query, thereby enhancing multimodal query representations. To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval (ViD2R) dataset automatically constructed from visual dialogue datasets. Our dataset construction process ensures that the dialogues are transformed into suitable information retrieval tasks using a text retriever. We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios. Our code is publicly available: https://github.com/yeongjoonJu/Ret˙XKnow.

Introduction

With the growing demand for information retrieval across diverse applications, such as internet search and knowledge-based question answering, precise and efficient retrieval from multimodal queries involving pairs of images and text has emerged as a critical challenge. In such multimodal queries, each modality independently provides insufficient information to retrieve the desired passages within a knowledge base, necessitating the integrated understanding of the visual and textual queries.

Existing Vision-Language (VL) retrievers (Qu et al. 2021; Luo et al. 2021; Gao et al. 2022; Lin et al. 2023) often depend on disjointed models for object detection or image captioning to provide visual information. The reliance on disjointed models complicates the training process (e.g., the models should be fine-tuned for separate tasks in domain adaptation) and increases the likelihood of propagating erroneous predictions. The utilization of the captioning model also lacks the fine-grained information embedded within the images. Previous approaches (Lin et al. 2023; Luo et al. 2023) have attempted to address these drawbacks. However, as shown in Fig. 1, they result in lower performance in a zero-shot setting than a text retriever that does not use image information despite their pre-training for the image-text alignment. Lin et al. (2023) introduce token-level embeddings and utilize two types of visual representations: textual description of the image and feature-based visual embeddings with regions of interest by an object detector. They pre-train the retriever to map token-level visual embeddings into the linguistic space of a text retriever and then fine-tune it by adding image captions to the textual queries. Such the retriever captures fine-grained features of the image by employing visual embeddings with captions. They also facilitate modality interaction between the textual query and the image by relying on textual information, but the mechanism also results in complex implementations and inefficient retrieval due to multiple steps.

Refer to caption — Figure 1: Zero-shot retrieval performance on OK-VQA (Google Search). Ret-XKnow outperforms the text-based retriever, while other multimodal retrievers fall short, relying on the textual query in the pre-training stage.

Luo et al. (2023) present an end-to-end approach that projects multimodal features encoded via self-attention into linguistic space with a pre-training task called VL-ICT, to detach the dependency on the disjointed modules. They automatically construct a pre-training dataset by applying the Inverse Cloze Task (ICT) (Lee, Chang, and Toutanova 2019) to a multimodal knowledge base. However, this approach has significant limitations. First, the dataset does not adequately reflect the variety and complexity of real-world queries, as it only removes the title or caption from a sentence extracted as the query without considering the image. Second, in the constructed pairs of a multimodal query and the corresponding passage, the passage can often be matched solely with the textual content of the query. This occurs because the target passage is selected from the content following a sentence with a title or caption, thereby hindering learning rich image representations.

To tackle these issues, we propose two approaches: (1) an end-to-end Retriever to eXpand visual Knowledge, Ret-XKnow, and (2) a Visual Dialogue-to-Retrieval (ViD2R) pre-training dataset constructed from visual dialogues containing distinct relevant passages for various queries related to the same image. Ret-XKnow endows a text retriever with the understanding of multimodal queries in the context of efficient information retrieval, inspired by the concept of partial convolutions (Liu et al. 2018), which fill undesired pixels with surrounding pixel information. We compress visual embeddings to focus on the visual information relevant to the textual query by leveraging the relevance scores between visual embeddings and textual query representations as an adaptive mask. We only attach a vision encoder to the text retriever with only a few layers, utilizing output embeddings of the penultimate layer in the vision model for fine-grained visual representations. Our model architecture does not allow the direct intervention of textual query features in the pre-training stage, achieving modality interaction without fusing text features with image features. Through this architecture, we introduce both the late-interaction mechanism (Khattab and Zaharia 2020) for pre-indexing documents and the modality interaction without requiring an additional document encoder and disjointed models.

Recent advances in multimodal language models have produced several multimodal dialogue datasets (Zhu et al. 2023; Liu et al. 2023; Wang et al. 2023; Huang et al. 2023) for training models to perform tasks based on visual content. These datasets consist of multi-turn sessions with query-response pairs centered around a single image, providing precise and comprehensive information pertinent to the query and image. The response with detailed information can improve multimodal retrieval tasks by linking image understanding with complex textual queries. Whereas, such datasets are not appropriate for directly training retrievers due to the gap between explicit responses and broader passages. To bridge this gap, we transform the visual dialogue datasets into a format suitable for retrieval tasks through three simple steps: pre-processing, neural filtering, and response-to-passage conversion. Our construction process is applicable in diverse domains and modalities since our approach only requires multimodal dialogue datasets and sets of documents related to the target domain. Our retriever, Ret-XKnow pre-trained with the ViD2R dataset, outperforms various baselines in zero-shot retrieval performance across four multimodal datasets in an end-to-end manner. Furthermore, we demonstrate that the pre-training dataset curated via our construction method effectively mitigates the issue of overlooking visual features during the pre-training stage, leading to remarkable performance in fine-tuning settings. Our contributions are summarized as follows:

•

We propose Ret-XKnow, an end-to-end multimodal retriever that overcomes the limitations of disjointed models by dynamically focusing on visual features relevant to the textual query.
•

We introduce the ViD2R dataset, which transforms visual dialogue datasets into a format suitable for training VL retrievers, leading to significant improvements in zero-shot retrieval performance.
•

We demonstrate the comprehensive adaptability of Ret-XKnow by fine-tuning three downstream tasks. Our end-to-end retriever even shows comparable performance on baseline methods utilizing image captioning.

Related Works

Neural knowledge retrieval has been a cornerstone of Question Answering (QA) systems, with Dense Passage Retrieval (DPR) (Karpukhin et al. 2020) and its variants (Luo et al. 2021; Gui et al. 2022; Lin and Byrne 2022; Wu and Mooney 2022) pioneering the use of one-dimensional embeddings and contrastive learning. The advent of fine-grained late interaction models (Khattab and Zaharia 2020; Santhanam et al. 2022b) introduced enhanced embedding strategies, enabling precise query-document comparisons. ReAtt (Jiang et al. 2022) further streamlined the QA process by merging the retrieval and reading components into a unified Transformer model, offering an end-to-end solution.

The transition from traditional text queries to multimodal queries has marked a significant evolution in knowledge retrieval (Luo et al. 2021; Ge et al. 2022; Hanu et al. 2022). Initial methods focused on converting images into textual representations, such as captions (Qu et al. 2021; Gao et al. 2022) and object tags (Gui et al. 2022; Yang et al. 2022), leveraging text-based retrievers for relevant knowledge identification. EnFoRe (Wu and Mooney 2022) and DEDR (Salemi, Altmayer Pizzorno, and Zamani 2023) improve image-query representations derived from a multimodal encoder with generated entities and captions, respectively. FLMR (Lin et al. 2023) further refined multimodal queries by incorporating RoIs and generated captions under a late-interaction mechanism. To detach the dependency on intermediate modules, ReViz (Luo et al. 2023) represents an end-to-end multimodal retrieval system, introducing the VL-ICT for pre-training. We extend the motivation of the previous work (Luo et al. 2023) to design an end-to-end VL retriever to retrieve relevant passages with multimodal queries. Unlike previous methods, our approach overcomes the limitations of relying on disjointed models and complex processing by dynamically integrating visual features directly into the retrieval process.

Our Approach

In this section, we first define the problem statement for knowledge retrieval with multimodal queries. Then, we describe the architecture of our end-to-end retrieval model, Ret-XKnow, along with the construction method of the ViD2R dataset utilizing visual dialogue datasets.

Problem Definition

We focus on encoding a multimodal query $Q=(I,T)$ , where $I$ and $T$ represent an image and textual content, respectively. Both $I$ and $T$ individually provide insufficient information to retrieve desired passages $K=\{D_{1},D_{2},...,D_{n}\}$ within a knowledge base, where $D$ denotes a single passage. The primary objective of our end-to-end retriever $\mathcal{R}$ is to accurately map $Q$ to the set of relevant textual knowledge $K$ without requiring separate models for image understanding. To achieve this goal, the multimodal retriever $\mathcal{R}$ should encode the multimodal queries $Q$ by referring information from both modalities $I$ and $T$ to retrieve passages $K$ .

Architecture of Ret-XKnow

We design a retrieval system that dynamically integrates the capabilities of both the pre-trained text retriever, $\mathcal{R}_{T}$ , and the vision model, $\mathcal{R}_{V}$ , to enhance the retrieval of multimodal queries by focusing on visual features relevant to the textual query. Starting with encoding the image and text with separate encoders, Ret-XKnow compresses the visual features relevant to the given textual query at the end for modality interaction, applying the fine-grained late-interaction mechanism. Our network architecture is illustrated in Fig. 2. We elaborate on the components of Ret-XKnow and the rationale behind our design choices.

Fine-grained Late-interaction. To preserve abundant information of knowledge, we employ token-level embeddings for both modalities, applying the MaxSim operation (Khattab and Zaharia 2020). Given the representation of a multimodal query $Q$ and a document $D$ , we estimate the relevance score $r_{Q,D}$ between $Q$ and $D$ via late interaction between token-level embeddings from the multimodal retriever $\mathcal{R}$ as follows:

r_{Q,D}=\sum_{i=1}^{l_{Q}}\max_{j=1}^{l_{D}}(\lVert E_{Q_{i}}\rVert_{2}\cdot\lVert E_{D_{j}}^{T}\rVert_{2})

(1)

where $E_{Q}\in\mathbb{R}^{l_{Q}\times d}$ and $E_{D}\in\mathbb{R}^{l_{D}\times d}$ denote the representations of $Q$ and $D$ embedded by the retriever $\mathcal{R}$ , with each embedding having a dimension of $d$ . Here, $l_{Q}$ and $l_{D}$ represent the number of embeddings for $Q$ and $D$ , respectively. This operation chooses the highest relevance scores over all document tokens for each query token. In our approach, we incorporate the late-interaction mechanism for pre-indexing documents, ensuring no interaction between the query and the documents during the encoding process.

Visual Embeddings for Multimodal Queries. Building upon previous approaches (Alayrac et al. 2022; Liu et al. 2023) that integrate visual comprehension into language models, Ret-XKnow adopts the Vision Transformer (ViT) (Dosovitskiy et al. 2021) as its vision encoder. We utilize token-level visual embeddings $F_{v}$ from the penultimate layer of the ViT, along with a global visual embedding $F_{g}$ obtained from a special token (i.e., CLS token). For the global embedding $F_{g}\in\mathbb{R}^{d_{v}}$ , we directly project the visual features into the latent space of the text retriever $\mathcal{R}_{T}$ via a two-layer perceptron to align the vision and text modalities. Subsequently, the projected $F_{g}$ is reshaped into token-level embeddings $E_{g}\in\mathbb{R}^{l_{g}\times d}$ , where $l_{g}$ is the pre-defined number of tokens. For token-level visual embeddings $F_{v}\in\mathbb{R}^{l_{v}\times d_{v}}$ , each embedding corresponds to distinct visual elements or regions within the input image and involves similar semantics among adjacent patches. Thus, we can acquire regions of interest without region proposal networks since the visual embeddings $F_{v}$ encompass granular visual information. We aim to extract visual information pertinent to the textual query $T$ while diminishing the information on unrelated aspects, with rich visual representations. To achieve modality interaction, we employ the MaxSim operation and the partial convolution mechanism (Liu et al. 2018). The concept of partial convolutions was originally designed for image inpainting tasks to fill missing or unwanted pixels by using the surrounding pixel information. In the context of Ret-XKnow, this mechanism is adeptly repurposed to refine visual embeddings by filling the irrelevant embeddings with adjacent visual features. The visual embeddings $F_{v}$ first undergo a projection via a two-layer perceptron, which maps $F_{v}$ into a space of $\mathcal{R}_{T}$ , i.e., $F_{v}\in\mathbb{R}^{l_{v}\times d_{v}}\to F_{vt}\in\mathbb{R}^{l_{v}\times d_{t}}$ , where $d_{t}$ denotes the dimension of embeddings $F_{t}$ of $T$ from the penultimate layer of $\mathcal{R}_{T}$ . Then, we assign a relevance score for $F_{t}$ to each visual embedding in the projected space as follows:

r_{I,T}=\sum_{i=1}^{l_{t}}\max_{j=1}^{l_{v}}(\lVert F_{vt_{i}}\rVert_{2}\cdot\lVert F_{t_{j}}^{T}\rVert_{2}),

(2)

where $l_{t}$ denotes the length of the textual query. To align the visual embeddings $F_{v}$ and their relevance scores in a spatial arrangement, $F_{v}$ and $r_{I,T}$ are reshaped to have spatial dimensions $p\times p$ , resulting in $F^{\prime}_{v}\in\mathbb{R}^{p\times p\times d_{v}}$ and $r^{\prime}_{I,T}\in\mathbb{R}^{p\times p\times 1}$ . The reshaped $F^{\prime}_{v}$ and $r^{\prime}_{I,T}$ are then concatenated along the channel dimension to form a combined feature map. We subject this feature map to a strided convolution operation $C$ , defined as:

C([F^{\prime}_{v};r^{\prime}_{I,T}])=\sum_{m=1}^{k}{\sum_{n=1}^{k}{([F^{\prime}_{v};r^{\prime}_{I,T}]_{i+2m,j+2n}\cdot W_{mn}+B)}},

(3)

where $W_{mn}$ and $B$ denote the weights and bias of the convolutional filter, respectively. The convolution is applied with a stride that effectively downsamples the feature map, extracting and condensing the most salient visual features based on their relevance to the textual query. The output of this convolution $C([F^{\prime}_{v};r^{\prime}_{I,T}])$ yields a reduced set of selective visual features that are highly relevant to the corresponding textual content, represented as $F^{\prime}_{v}\in\mathbb{R}^{p^{\prime}\times p^{\prime}\times 2d_{v}}$ , where $p^{\prime}$ is the reduced spatial dimension post-convolution. The $F^{\prime}_{v}$ is reshaped to embeddings with 4 tokens per embedding after being projected into $4d$ dimensions by a linear layer. Finally, we integrate the token-level text embeddings with two types of token-level visual embeddings to form the query embeddings $E_{Q}$ as follows:

E_{Q}=[E_{g};E_{v};E_{t}]

(4)

where $E_{t}$ and $E_{v}$ represent the final embeddings obtained by applying a linear layer to $F_{t}$ and the output of the convolution layer, respectively, to achieve embeddings with dimension $d$ . Finally, we compute a relevance score $r_{Q,D}$ between $E_{Q}$ and $E_{D}$ with the MaxSim operation.

Training

We deal with passages including the golden answers to a given question $Q$ as relevant passages $K$ . To train our model, we employ in-batch negative sampling, which treats all passages in a training batch except for a passage $D$ belonging to $K$ as negative passages $\bar{K}$ for $Q$ . We optimize our model by minimizing the following contrastive loss $\mathcal{L}_{CL}$ over the dataset $\mathcal{D}$ :

\mathcal{L}_{CL}=-\sum_{(Q,D)\in\mathcal{D}}\log{\frac{exp(r_{Q,D})}{exp(r_{Q,D})+\sum_{\bar{D}\in\bar{K}}{exp(r_{Q,\bar{D}})}}}.

(5)

In the pre-training stage, all parameters of $\mathcal{R}_{T}$ and $\mathcal{R}_{V}$ are frozen. After training, all passages are pre-indexed using PLAID (Santhanam et al. 2022a), identical to ColBERTv2 (Santhanam et al. 2022b). In the inference stage, we utilize only embeddings with the highest $r_{I,T}$ score for $F^{\prime}_{v}$ to prevent using unrelated information with the textual query.

ViD2R Dataset Construction

To endow the multimodal retriever $\mathcal{R}$ with the ability to comprehend images based on textual queries, we leverage existing visual dialogue datasets. Despite the rich information of responses in visual dialogues, the datasets are not appropriate to directly train the retriever $\mathcal{R}$ since there exists a clear distinction between the explicit responses and more expansive passages. To bridge this gap, we transform the visual dialogue datasets into a format suitable for multimodal retrieval tasks via the following three steps, as illustrated in Fig. 3.

Pre-processing. First, we divide the dialogues into individual turns. Subsequently, to maintain informative content within the dataset, we filter out turns that are unsuitable for retrieval tasks based on responses and tasks given by queries. We exclude tasks that do not contribute to knowledge-based retrieval, such as queries requiring or providing the location of objects, which are unrelated to knowledge content. To reduce bias in training, responses containing simple affirmations or negations, such as “yes” or “no,” are edited to remove these elements.

Neural Filtering. To guarantee that the model learns to utilize visual information in multimodal retrieval, we apply neural filtering to our construction process. Using a text retriever, we perform the top-5 retrieval with questions from a knowledge base compiled with the responses in the visual dialogue datasets. Through this process, we automatically identify responses that can be retrieved based solely on textual queries. To avoid impeding the learning of image representations for retrieval, we filter out the matched query-response pairs from the text retrieval process.

Response-to-Passage Conversion. In the context of retrieval, passages may contain both relevant and irrelevant information, unlike responses in dialogues. Thus, we transform responses into passages typically found in knowledge retrieval tasks by unifying passages related to multimodal queries. To identify the relevant passages, we utilize the responses instead of the queries since the responses contain image-related information conditioned on the given query and image while the textual queries have restricted information. We obtain top-3 passages retrieved from Wikipedia using the responses as textual queries. However, the text retriever may retrieve inappropriate passages due to potential inaccuracies in the retriever. To ensure that converted passages are relevant to both the image and the question, we combine three retrieved passages with the responses. We simply concatenate the response behind the top-1 passage, thereby maintaining the relevance of the context.

Model	Dataset	KB	Metric
Model	Dataset	KB	MRR@5	P@ $k$	R@5	R@10	R@20	R@50	R@100
ColBERTv2	OK-VQA	Wiki-11M	36.00	24.07	52.20	63.54	73.80	83.31	88.27
FLMR*+WiT			32.56	21.20	50.61	62.58	73.40	84.94	90.17
ReViz+VL-ICT			42.97	31.95	61.24	70.00	79.65	87.32	90.95
PreFLMR+ViD2R			50.08	35.96	69.12	78.36	86.01	92.15	95.12
Ret-XKnow+ViD2R			51.10	36.97	70.83	80.94	88.59	94.09	96.35
CLIP^†	OK-VQA	GS-112K	19.08	11.13	34.54	50.48	65.08	80.62	88.11
ColBERTv2			52.46	37.53	69.60	79.57	86.58	93.10	96.51
FLMR*+WiT			38.15	24.62	57.25	69.42	79.43	88.62	93.14
ReViz+VL-ICT^†			45.77	33.18	64.05	75.39	84.21	91.64	94.59
Ret-XKnow+ViD2R			59.88	44.93	78.10	86.50	92.27	96.43	98.08
CLIP^†	ReMuQ	199K	0.34	-	0.78	1.36	2.41	7.34	47.88
ReViz+VL-ICT.^†			23.61	-	39.43	46.77	53.56	63.70	71.13
PreFLMR+ViD2R			54.44	52.37	57.66	58.94	59.63	60.54	60.85
Ret-XKnow+ViD2R			80.88	78.11	85.20	87.48	89.14	90.77	91.63
ColBERTv2	A-OKVQA	Rationale	58.32	49.52	72.58	79.83	85.07	90.92	94.93
FLMR+WiT			48.43	38.95	63.93	73.45	81.75	91.79	96.77
Ret-XKnow+ViD2R			68.13	58.95	82.53	88.82	93.19	97.38	98.52

Table 1: Zero-shot performance of Ret-XKnow and baselines in an end-to-end manner. Note that FLMR* was only pre-trained on the WiT dataset specifically for aligning the vision mapping network, without using external information. Bold indicates the highest performance, while underline signifies the second highest performance. Results marked with a † symbol are taken from (Luo et al. 2023).

k

is set to 5 for OK-VQA and 1 for the other datasets.

Experiments

In this section, we present a comprehensive evaluation of Ret-XKnow, focusing on its performance in zero-shot multimodal retrieval and its adaptability to downstream tasks through fine-tuning. Our experiments are designed to showcase the effectiveness of Ret-XKnow in understanding and integrating complex multimodal queries for information retrieval.

Datasets. We employ four retrieval datasets, two kinds of OK-VQA (Marino et al. 2019), ReMuQ (Luo et al. 2023), and A-OKVQA (Schwenk et al. 2022) to evaluate the retrieval performance of models for multimodal queries. For the OK-VQA dataset, we employ two datasets with different knowledge bases: a small corpus collected from Google search API introduced in (Luo et al. 2021) and a large corpus that contains 11 million Wikipedia passages created by (Qu et al. 2020). Note that these datasets are specifically designed for multimodal retrieval. The data statistics for the retrieval datasets are shown in the Appendix.

Evaluation Metrics. We evaluate the retrieval performances of Ret-XKnow and our baselines by the following metrics: Mean Reciprocal Rank at 5 (MRR@5), which measures ranking the first relevant passage within the top-5 results; Precision at $K$ (P@ $K$ ), which measures the accuracy of the top- $K$ retrieved passages; and Recall at $N$ (R@ $N$ ), which evaluates the ability of the model to retrieve all relevant passages within the top- $N$ results. Due to the absence of ground-truth knowledge passages for each query in the OK-VQA datasets, we define passages that contain any human-annotated answers as ground-truth, following the approach by (Luo et al. 2021).

Method	MRR@5 $\downarrow$	P@1 $\downarrow$	R@5 $\downarrow$	R@10 $\downarrow$
VL-ICT	58.07	51.56	68.22	74.22
ViD2R	28.04	19.82	42.66	53.14

Table 2: Performance for the pre-training dataset when retrieving using only textual queries. Lower scores across all metrics indicate that the ViD2R dataset is better suited for multimodal retrieval tasks.

Implementation Details

ViD2R. Our pre-training dataset is synthesized from two visual dialogue datasets, as detailed in works by (Liu et al. 2023) and (Wang et al. 2023). These datasets merge tasks necessitating image comprehensions for instruction-following tuning (Ouyang et al. 2022). After the preprocessing stage, we yielded 1.35 million QA pairs, each associated with a single image. Subsequent neural filtering, employing ColBERTv2 (Santhanam et al. 2022b) trained on the MS MARCO Passage Ranking task (Nguyen et al. 2016), refined the preprocessed pairs to 0.98 million QA pairs including queries that require visual context for accurate retrieval. In the final stage, we use 6 million Wikipedia released in (Chen et al. 2023) as our data pool to convert responses to passages.

Ret-XKnow. We adopt CLIP ViT-base (Radford et al. 2021) as our vision encoder, alongside ColBERTv2 (Santhanam et al. 2022b), rooted on BERT-base (Devlin et al. 2019), serving as our text retriever. We reduce the number of visual embeddings $F_{v}$ to 16 embeddings via $C(\cdot)$ with 5 of kernel size. $F_{g}$ is converted into 16 embeddings via projection using two-layer perception. The dimension $d$ of final embeddings $E_{Q}$ and $E_{D}$ is set to 128.

Model	Dataset	KB-Size	Metric
Model	Dataset	KB-Size	MRR@5	P@ $k$	R@5	R@10	R@20	R@50	R@100
DPR-LXMERT (Qu et al. 2021)	OK-VQA	Wiki-11M	45.26	33.29	-	-	-	-	-
FLMR*+WiT			38.18	26.17	56.68	67.50	77.41	85.81	90.29
ReViz+VL-ICT			42.63	30.42	60.73	69.23	78.46	86.69	90.64
Ret-XKnow			45.54	32.64	64.96	76.81	84.23	91.40	94.09
Ret-XKnow+ViD2R			50.58	36.54	69.52	81.05	88.19	93.74	95.92
FLMR*+WiT	OK-VQA	GS	50.59	35.95	70.63	81.23	88.90	95.20	97.64
ReViz+VL-ICT^†			54.47	41.74	73.35	83.17	89.56	94.73	96.81
Ret-Xknow			54.57	38.16	75.13	84.86	91.34	95.98	98.04
Ret-XKnow+ViD2R			61.76	46.52	79.57	87.97	93.46	96.95	98.28
FLMR+WiT	A-OKVQA	Rationale	46.99	57.05	73.89	82.53	90.57	95.20	97.47
Ret-XKnow			64.22	54.41	80.17	87.60	92.14	96.07	97.38
Ret-XKnow+ViD2R			71.42	62.62	85.68	91.70	95.37	98.17	99.13

Table 3: Fine-tuning performance of Ret-XKnow and baselines in an end-to-end manner. The rows highlighted in gray indicate the performance when adopting ViT-large instead of ViT-base.

Methods	MRR@5	P@1	R@5	R@10
FLMR+WiT	55.48	43.06	75.80	85.77
Ret-XKnow	62.82	51.23	81.57	89.20
Ret-XKnow+ViD2R	63.93	52.70	81.77	89.34

Table 4: The effect of image captions. Performance of models fine-tuned with captions on the OK-VQA (GS) dataset.

Zero-shot Retrieval using End-to-End Retriever

Baselines. To demonstrate the effectiveness of our end-to-end multimodal retrieval framework, we compare our approach against the advancements including the following baselines. ColBERTv2 (Santhanam et al. 2022b), a text-based retrieval model that employs a fine-grained late-interaction mechanism. In experiments, this model retrieves relevant passages using only textual queries. FLMR+WiT (Lin et al. 2023), a multimodal retriever that incorporates external visual information. They use vision and text encoders identical to our Ret-XKnow. Their pre-training on a subset of the WiT dataset (Srinivasan et al. 2021) aims to align visual embeddings with the linguistic space of the text retriever. ReViz+VL-ICT (Luo et al. 2023), an end-to-end knowledge retrieval model for multimodal queries, which uses pre-trained ViLT and BERT-base as its query and document encoders, respectively. With ViLT optimized for the image size of $384\times 384$ , their model introduces a pre-training strategy, VL-ICT. In the absence of publicly available code and data, we reconstructed the model and dataset based on descriptions in their publication, resulting in a VL-ICT dataset of 2,997,354 samples. PreFLMR (Lin et al. 2024), an end-to-end multimodal knowledge retrieval model, which utilizes a cross-attention mechanism with the outputs of the penultimate layer for fine-grained modality interaction. Unlike Ret-XKnow, their architecture directly fuses the text features with visual features.

Results. In Tab. 1, our approach achieves superior zero-shot retrieval performance across four datasets, significantly outperforming baseline models. Despite utilizing smaller image sizes and a smaller pre-training dataset than ReViz, our method significantly outperforms baselines across all retrieval metrics under zero-shot settings in an end-to-end manner. FLMR, which utilizes the same foundation models, showed performance degradation compared to the base text retriever even after pre-training with WiT, unlike our retriever. Note that we concatenated the aligned visual features with the final embeddings of the text retriever during the inference of FLMR. Despite using the same pre-training dataset, PreFLMR underperforms compared to our RetXKnow on state-of-the-art multimodal retrieval benchmarks due to clear differences in the modality interaction approach. Furthermore, Tab. 2 showcases that our pre-training dataset requires unifying visual information to textual query in contrast to the previous approach, resulting in improving zero-shot performance.

Fine-tuning on Downstream Tasks

We further demonstrate the adaptability of Ret-XKnow and the effectiveness of the ViD2R pre-training task by fine-tuning models on downstream tasks.

Results. The results from our experiments, detailed in Tab. 3, clearly demonstrate the efficacy of our approach. First, Ret-XKnow outperforms other baseline models on key metrics across knowledge bases of different sizes, even before any specialized pre-training. This performance advantage is meaningfully extended by applying transfer learning on the Ret-XKnow model pre-trained with our ViD2R dataset. Moreover, our investigations reveal that our end-to-end retriever reaches performance levels comparable with the baseline that utilizes caption. As shown in Tab. 4, we present the performance of models fine-tuned with image captions within the OK-VQA (GS) dataset. Ret-XKnow approximates these achievements without employing captions in Tab. 3. Such outcomes underscore the effectiveness of Ret-XKnow in enhancing multimodal query representations in an end-to-end manner.

Methods	MRR@5	P@5	R@5	R@10	R@20
Ret-XKnow	59.88	44.93	78.10	86.50	92.27
w/o $r_{I,T}$	54.68	39.97	73.31	83.59	90.29
w/o $C(\cdot)$	48.54	33.88	66.13	76.02	83.19
ViD2R
w/o filtering	54.96	39.42	73.35	82.94	89.92
w/o conversion	49.52	35.47	67.40	77.76	85.81

Table 5: Ablation studies. This table shows zero-shot performance on the OK-VQA (GS) dataset.

Ablation Studies

In Tab. 5, the removal of strided convolution $C(\cdot)$ leads to significant performance degradation in metrics, despite the relatively large number of tokens. This reveals that numerous tokens are rather disturbed in the retrieval task, underlining the need for the extraction of core information. Similarly, omitting the conversion process in ViD2R notably impacts the model’s performance, highlighting its essential contribution to the pre-training process. This outcome corroborates our hypothesis that there is a clear difference between informative response and passage in the retrieval context.

Conclusion

This paper presents an end-to-end multimodal retriever for enhancing multimodal query representations. To effectively pre-train the retriever, we also introduce the ViD2R dataset by automatically transforming multimodal dialogues into information retrieval tasks with only a text retriever and a set of documents. Our method outperforms previous baselines across diverse multimodal retrieval datasets in an end-to-end manner. Through rigorous experiments, we demonstrated the effectiveness of our methods leading to an advanced end-to-end retrieval system for multimodal queries. Despite the remarkable retrieval performance, our work primarily focused on enhancing the representation of visual content, while attempts to improve the text retriever itself were not explored. We still have limited performance compared to methods that utilize intermediate modules. In future works, we aim to address this limitation by exploring ways to enhance the text retriever component alongside the visual content representation. We also plan to extend the flexibility of our approach for diverse modalities and domains, such as the medical domain and unified modalities, utilizing existing multimodal dialogue datasets (Wu et al. 2023; Li et al. 2023).

References

Alayrac et al. (2022) Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems (NeurIPS), 35: 23716–23736.
Chen et al. (2023) Chen, Y.; Hu, H.; Luan, Y.; Sun, H.; Changpinyo, S.; Ritter, A.; and Chang, M.-W. 2023. Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions? In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 14948–14968.
Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 4171–4186.
Dosovitskiy et al. (2021) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR).
Gao et al. (2022) Gao, F.; Ping, Q.; Thattai, G.; Reganti, A.; Wu, Y. N.; and Natarajan, P. 2022. A thousand words are worth more than a picture: Natural language-centric outside-knowledge visual question answering. arXiv preprint arXiv:2201.05299.
Ge et al. (2022) Ge, Y.; Ge, Y.; Liu, X.; Wang, J.; Wu, J.; Shan, Y.; Qie, X.; and Luo, P. 2022. Miles: Visual bert pre-training with injected language semantics for video-text retrieval. In European Conference on Computer Vision (ECCV), 691–708.
Gui et al. (2022) Gui, L.; Wang, B.; Huang, Q.; Hauptmann, A.; Bisk, Y.; and Gao, J. 2022. KAT: A Knowledge Augmented Transformer for Vision-and-Language. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 956–968.
Hanu et al. (2022) Hanu, L.; Thewlis, J.; Asano, Y. M.; and Rupprecht, C. 2022. VTC: Improving Video-Text Retrieval with User Comments. In European Conference on Computer Vision (ECCV), 616–633.
Huang et al. (2023) Huang, J.; Zhang, J.; Jiang, K.; Qiu, H.; and Lu, S. 2023. Visual instruction tuning towards general-purpose multimodal model: A survey. arXiv preprint arXiv:2312.16602.
Jiang et al. (2022) Jiang, Z.; Gao, L.; Wang, Z.; Araki, J.; Ding, H.; Callan, J.; and Neubig, G. 2022. Retrieval as Attention: End-to-end Learning of Retrieval and Reading within a Single Transformer. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2336–2349.
Karpukhin et al. (2020) Karpukhin, V.; Oğuz, B.; Min, S.; Lewis, P.; Wu, L.; Edunov, S.; Chen, D.; and Yih, W.-t. 2020. Dense passage retrieval for open-domain question answering. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781.
Khattab and Zaharia (2020) Khattab, O.; and Zaharia, M. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the International ACM SIGIR conference on research and development in Information Retrieval (SIGIR), 39–48.
Lee, Chang, and Toutanova (2019) Lee, K.; Chang, M.-W.; and Toutanova, K. 2019. Latent Retrieval for Weakly Supervised Open Domain Question Answering. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 6086–6096.
Li et al. (2023) Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; and Gao, J. 2023. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems (NeurIPS), 36.
Lin and Byrne (2022) Lin, W.; and Byrne, B. 2022. Retrieval Augmented Visual Question Answering with Outside Knowledge. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 11238–11254.
Lin et al. (2023) Lin, W.; Chen, J.; Mei, J.; Coca, A.; and Byrne, B. 2023. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. Advances in Neural Information Processing Systems (NeurIPS), 36.
Lin et al. (2024) Lin, W.; Mei, J.; Chen, J.; and Byrne, B. 2024. PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 5294–5316.
Liu et al. (2018) Liu, G.; Reda, F. A.; Shih, K. J.; Wang, T.-C.; Tao, A.; and Catanzaro, B. 2018. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), 85–100.
Liu et al. (2023) Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023. Visual instruction tuning. Advances in Neural Information Processing Systems (NeurIPS), 36.
Loshchilov and Hutter (2019) Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR).
Luo et al. (2023) Luo, M.; Fang, Z.; Gokhale, T.; Yang, Y.; and Baral, C. 2023. End-to-end Knowledge Retrieval with Multi-modal Queries. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 8573–8589.
Luo et al. (2021) Luo, M.; Zeng, Y.; Banerjee, P.; and Baral, C. 2021. Weakly-supervised visual-retriever-reader for knowledge-based question answering. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 6417––6431.
Marino et al. (2019) Marino, K.; Rastegari, M.; Farhadi, A.; and Mottaghi, R. 2019. Ok-VQA: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 3195–3204.
Nguyen et al. (2016) Nguyen, T.; Rosenberg, M.; Song, X.; Gao, J.; Tiwary, S.; Majumder, R.; and Deng, L. 2016. MS MARCO: A human-generated machine reading comprehension dataset.
Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS), 35: 27730–27744.
Podell et al. (2024) Podell, D.; English, Z.; Lacey, K.; Blattmann, A.; Dockhorn, T.; Müller, J.; Penna, J.; and Rombach, R. 2024. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis. In The International Conference on Learning Representations (ICLR).
Qu et al. (2020) Qu, C.; Yang, L.; Chen, C.; Qiu, M.; Croft, W. B.; and Iyyer, M. 2020. Open-retrieval conversational question answering. In Proceedings of the International ACM SIGIR conference on research and development in Information Retrieval (SIGIR), 539–548.
Qu et al. (2021) Qu, C.; Zamani, H.; Yang, L.; Croft, W. B.; and Learned-Miller, E. 2021. Passage retrieval for outside-knowledge visual question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1753–1757.
Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 8748–8763.
Salemi, Altmayer Pizzorno, and Zamani (2023) Salemi, A.; Altmayer Pizzorno, J.; and Zamani, H. 2023. A symmetric dual encoding dense retrieval framework for knowledge-intensive visual question answering. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 110–120.
Santhanam et al. (2022a) Santhanam, K.; Khattab, O.; Potts, C.; and Zaharia, M. 2022a. PLAID: an efficient engine for late interaction retrieval. In Proceedings of the ACM International Conference on Information & Knowledge Management (CIKM), 1747–1756.
Santhanam et al. (2022b) Santhanam, K.; Khattab, O.; Saad-Falcon, J.; Potts, C.; and Zaharia, M. 2022b. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 3715–3734.
Schwenk et al. (2022) Schwenk, D.; Khandelwal, A.; Clark, C.; Marino, K.; and Mottaghi, R. 2022. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision (ECCV), 146–162.
Srinivasan et al. (2021) Srinivasan, K.; Raman, K.; Chen, J.; Bendersky, M.; and Najork, M. 2021. WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2443–2449.
Wang et al. (2023) Wang, J.; Meng, L.; Weng, Z.; He, B.; Wu, Z.; and Jiang, Y.-G. 2023. To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning. arXiv preprint arXiv:2311.07574.
Wu and Mooney (2022) Wu, J.; and Mooney, R. 2022. Entity-Focused Dense Passage Retrieval for Outside-Knowledge Visual Question Answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 8061–8072.
Wu et al. (2023) Wu, S.; Fei, H.; Qu, L.; Ji, W.; and Chua, T.-S. 2023. NExT-GPT: Any-to-Any Multimodal LLM.
Yang et al. (2022) Yang, Z.; Gan, Z.; Wang, J.; Hu, X.; Lu, Y.; Liu, Z.; and Wang, L. 2022. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI), volume 36, 3081–3089.
Zhu et al. (2023) Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M. 2023. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.

Appendix A Visualizations

We visualize $r_{I,T}$ to exhibit that Ret-XKnow adaptively attends to visual information conditioned on the textual query. We applied different textual queries to a set of images containing both dogs and cats. Using Diffusion-XL (Podell et al. 2024), we created 216 images with the simple prompt ‘A dog and a cat in an image.’ From this generated image set, we embedded $E_{v}$ by conditioning on the following three prompts: (1) What is the dog or puppy’s species?, (2) What is the cat or kitty’s species?, (3) What is the place in the image? As shown in Fig. 4, (a) represents $E_{v}$ according to different textual queries and (b) shows that Ret-XKnow focuses core visual information based on the context of the textual queries via modality interaction.

Model	Metric
Model	MRR@5	P@5	R@5	R@10
ReViz	47.82	36.50	66.49	77.35
ReViz⁺	54.47	41.74	73.35	83.17
Ret-XKnow	61.76	46.52	79.57	87.97
Ret-XKnow⁺	62.49	47.19	80.82	88.51
Ret-XKnow⁺+Cap.	65.0	49.85	82.78	90.63

Table 6: Effect of hard negative training. The model marked with ⁺ was fine-tuned on the OK-VQA (GS) dataset with the application of hard negative sampling. “Cap.” indicates models that were fine-tuned using captions. ReViz and Ret-XKnow were pre-trained on VL-ICT and ViD2R, respectively. The results for ReViz were cited from Luo et al. (2023).

Appendix B Effect of Hard Negative Training

Existing works have shown that utilizing hard negative samples leads to discriminative representations. One of our baselines, ReViz (Luo et al. 2023), also adopts the hard negative sampling, boosting their base performance. We further investigate the effect of in-batch negative sampling in our architecture, which treats all passages in a training batch except a passage belonging to desired passages as negative passages for the given query. We retrieved the top-150 documents for questions in the training dataset, with our pre-trained model. Then, we randomly select hard negative samples from the retrieved documents during the training stage. In our experiments, we utilized both hard negative samples and in-batch negative samples, where the number of hard negative samples was set to 7. Tab. 6 shows that while our model, Ret-XKnow, already outperforms existing baselines, the application of hard negative sampling further enhances its performance.

Appendix C Training and Inference Details

In all experiments, we train models using the AdamW optimizer (Loshchilov and Hutter 2019) with warm-up steps on a machine with 8 RTX A6000 GPUs. We chose model checkpoints based on the validation loss.

Pre-training. We used $E_{Q}=[E_{g};E_{v}]$ without $E_{t}$ to align visual embeddings with the linguistic space during the pre-training stage. In this stage, we only tuned the mapping network.

Fine-tuning. For fine-tuning our model on downstream tasks, we tuned all parameters of Ret-XKnow except for the vision model in all experiments. Because each downstream dataset has different scales, we set different hyperparameters for each dataset as shown in Tab. 9. FLMR and Ret-Xknow were set with the same hyperparameters under end-to-end settings. Since the parameters of the vision model are not updated during training, we cached the outputs of the vision model before training. In our setting, training one epoch for the ViD2R dataset took 5 minutes on 8 RTX A6000 GPUs, where one epoch encompasses 1907 steps. We detail the downstream task datasets in Tab. 7.

Inference. Passages within the knowledge base were pre-indexed, following the method established by the previous work (Santhanam et al. 2022b). The indexing process consists of three critical steps: centroid selection, passage encoding, and index inversion. To enhance storage efficiency, embeddings were compressed to 2 bits per dimension. In the OK-VQA dataset using a corpus collected from Google search API, the retrieval time of Ret-Xknow and ColBERTv2 is approximately 0.086 seconds and 0.081 seconds per query on one RTX A6000 GPU, respectively. Thus, Ret-XKnow spends slightly more time retrieving relevant passages with multimodal queries, compared to the base text retriever.

Dataset	KB	#Train	#Test	KB size
Dataset	KB	Size
ViD2R	Synthesized	977,723	-	-
OK-VQA (GS)	GS	8,958	5,046	166,390
OK-VQA (Wiki)	Wikipedia	211,200	2,523	11,000,000
ReMuQ	Wikipedia	-	3,609	195,387
A-OKVQA	Rationale	17,056	1,145	1,145

Table 7: Summary of dataset statistics for evaluation. This table presents the distribution of training and testing instances alongside the size of the knowledge bases for each dataset employed in our study. GS and KB denote the corpus collected from the Google Search API and used knowledge base, respectively.

Statistic	Counts
# Total data	977,723
# Images	203,765
# Max. queries per image	17
# Avg. queries per image	4.8
# Queries requiring description	211,241 (21.61%)
# Other types of queries	766,482 (78.39%)

Table 8: Data Statistics for the ViD2R dataset.

Dataset	Hyperparameter
Dataset	Init. LR	# Epochs	Batch Size per GPU	Global Batch Size	# Warm-up Steps
OK-VQA (GS)	2e-5	20	32	256	10
OK-VQA (GS)⁺	2e-5	20	128 (16*(7+1))	1024	10
OK-VQA (Wiki)	2e-5	10	64	512	200
A-OKVQA	2e-5	10	32	256	50

Table 9: Summary of hyperparameters utilized for fine-tuning. The Init. LR denotes the initial learning rate.

Model	Dataset	KB	Metric
Model	Dataset	KB	MRR@5	P@ $k$	R@5	R@10	R@20	R@50
A	OK-VQA	Wiki-11M	49.64	34.97	68.56	78.60	86.80	93.06
B			50.77	36.29	70.15	80.86	89.10	94.17
Desc. / (A+B)			50.24	35.45	70.23	80.18	87.48	93.90
Ques. / (A+B)			49.73	34.89	69.08	79.59	87.36	93.77
A+B			51.10	36.97	70.83	80.94	88.59	94.09
A	OK-VQA	GS-112K	57.86	42.94	75.94	84.48	90.47	95.44
B			59.83	44.68	77.65	85.89	91.99	96.35
Desc. / (A+B)			57.74	43.05	76.04	84.78	91.36	96.06
Ques. / (A+B)			59.38	43.65	77.31	85.95	91.60	96.0
A+B			59.88	44.93	78.10	86.50	92.27	96.43
A	ReMuQ	Wiki-199K	80.54	77.69	84.90	87.06	88.61	90.41
B			80.78	78.03	85.09	87.12	88.94	91.05
Desc. / (A+B)			80.68	77.83	85.12	87.06	88.89	90.66
Ques. / (A+B)			81.42	78.69	85.54	87.67	89.36	90.99
A+B			80.88	78.11	85.20	87.48	89.14	90.77
A	A-OKVQA	Rationale	67.40	58.17	81.40	87.25	92.14	96.68
B			68.47	60.26	81.66	88.82	93.45	97.47
Desc. / (A+B)			65.71	56.33	79.83	87.07	92.23	97.03
Ques. / (A+B)			67.57	58.25	81.92	88.82	93.01	96.94
A+B			68.13	58.95	82.53	88.82	93.19	97.38

Table 10: Ablations studies for the ViD2R dataset. This table represents the zero-shot performance of Ret-XKnow according to pre-training datasets. A and B denote datasets presented by Liu et al. (2023) and Wang et al. (2023), respectively.

Appendix D Details for ViD2R Dataset

To construct the ViD2R dataset, we employ two visual dialogue datasets (Liu et al. 2023; Wang et al. 2023), which are designed for visual instruction-following tuning. Initially, dialogues were split into individual turns. We removed turns with responses of less than 30 characters. Subsequently, we edited responses containing simple affirmations (“yes”, “no”) and excluded samples for tasks irrelevant to retrieval tasks ( $e.g.$ , location and count), where we automatically filtered out based on specific phrases. Following the pre-processing stage, the dataset comprised 1.35 million QA pairs. To further refine this collection, we employed a text retriever, ColBERTv2, trained on the MS MARCO Passage Ranking task (Nguyen et al. 2016). Finally, we acquired 0.98 million QA pairs in the neural filtering stage. In the final dataset construction phase, we converted responses into passages using a data pool of 6 million Wikipedia documents released by (Chen et al. 2023). During this conversion, responses used as the textual queries were limited to 128 tokens. As shown in Fig. 5, our constructed dataset is featured by pairs of multimodal queries and passages including responses to different queries about the same image, advancing the capability to retrieve relevant information from multimodal queries. The data statistics of our pre-training dataset are shown in Tab. 8. Identical to our motivation, ViD2R features an average of 4.8 queries per image, requiring modality interaction.

Analysis. Tab. 2 in the main body demonstrated that our pre-training dataset requires the ability to integrate visual information, as retrieving solely textual queries is challenging. For a fair comparison, we randomly select 5000 samples from each dataset in this experiment. We also perform ablation studies on types of composed data in the ViD2R dataset. In Tab. 10, B (Wang et al. 2023) was designed to alleviate the lack of reasoning in A (Liu et al. 2023), requiring more complex visual reasoning. Despite the smaller size of the B dataset compared to the A dataset, a ViD2R dataset curated from the B dataset leads to better performance when used to create the ViD2R dataset compared to using A alone. We also divided the query types into two categories for investigation: those requiring descriptions and general questions. We hypothesized that queries requiring descriptions would not facilitate learning modality interaction since they do not contain any conditional information. As shown in this table, queries categorized as questions generally lead to higher performance, except in the case of OK-VQA (Wiki), indicating that question-type queries contribute more to learning. For a fair comparison, both types of questions were sampled in equal numbers. Additionally, this table demonstrates that our architecture is robust to variations in data quantity and type.