Find The Gap: Bi-graph Retrieval and Reasoning for Knowledge-based Visual Question Answering

Elham J. Barezi
Michigan State University
[email protected]
&Parisa Kordjamshidi
Michigan State University
[email protected]

Abstract

We analyze the problem of knowledge-based visual question answering. In this problem, given a question, the models need to ground it into the visual modality and retrieve the relevant information for a given large KB to be able to answer the question. Our analysis has two folds, one is based on designing neural architectures and training them from scratch and another is based on large pre-trained language models. Our research questions in both types of experiments are 1. How task-specific and LLM models in integration of information and reasoning. 2. how these 2 models perform for various types of questions, various integration of information, and different numbers of reasoning hops.

Our results demonstrate the positive impact of empowering either NN or LLM models with retrieved external and visual knowledge.

Moreover, we can see that LLM models work better than their corresponding NN models for KB-related questions which confirms the existence of implicit knowledge in LLMs.

We can see that for both models always 2-hop reasoning is harder than 1-hop reasoning. On the other hand, though LLMs are stronger in 1-hop reasoning, they suffer in 2-hop reasoning in comparison with our fine-tuned NN model. This highlights the impact of using a strong integration and reasoning module.

1 Introduction

(PK: Visual Question Answering (VQA) aims to solve question-answering problems while visual information is needed to provide the correct answer.: if you want to give a basic introduction you might avoid referring to question answering and use a more basic definition like "VQA is a problem setting in which, given a question in natural language form, the answer should be found in a provided image") Though over the past years, the progress in this field has been (PK: amazing: remarkable), existing models still suffer in answering the questions requiring external (PK: (non-visual) knowledge: knowledge beyond the image). These models suffer from lacking extra knowledge integration or integrating noisy knowledge in their reasoning procedure (PK: even when the source of required knowledge is given). Therefore, (PK: in the latter case) (PK: precise: the) retrieval of the relevant knowledge from big knowledge bases and integration and grounding of this knowledge in VQA tasks (PK: again I don’t think you ground the knowledge into vqa, i suggest to avoid this phrasing), so-called KB-VQA, is of paramount importance. (PK: the way you transitioned to KB-VQA was not smooth to me. You can say: this problem is formulated as the so-called KB-VQA where in addition to the question and image the source of external knowledge is given as input. )

There have been various efforts to include extra knowledge information in the VQA task. These methods have (PK: 2: two) basic steps, retrieving related knowledge entries from the given knowledge bases, e.g., Wikipedia and ConceptNet liu2004conceptnet, and predicting the final answer by integrating this extracted external information, and question, and visual data. However, these retrieval-based approaches suffer from (PK: 2:two) weaknesses: (i) They use some heuristics like keyword matching for retrieval, and there is no guarantee (PK: for successful retrieval of the external knowledge,: for retrieving semantically relevant knowledge) and (ii) (PK: the knowledge graphs are very dense, and plenty of irrelevant knowledge is retrieved even if the required knowledge is retrieved.: not sure if this second issue is correctly explained, you mean the exact match is not good, and exact match retrieves many irrelevant info? so in the second item you still talk about exact match? but there are many papers that do not use exact match for using relevant items in KG) These irrelevant entries mislead the reasoning procedure.

With the introduction of the pre-trained large-scale language models and their vast power in (PK: solving various tasks: storing knowledge? [I don’t think their task performance is what you want to refer to here] ), recent approaches aimed to use the implicit knowledge accumulated in these models, such as GPT-3 brown2020language. (PK: These models: which models? VQA? cite here) performed many efforts to use LLMs and guide commonsense reasoning in VQA models. Despite their remarkable results, these methods lack powerful vision processing and rely on the power of LLMs only (PK: cite? what are you referring to here?). With the recent advances in pertaining large vision-language models li2022grounded; peng2023kosmos; li2023blip, investigating the effect of the knowledge hidden in the parameters of these new models is the (PK: next challenge and question to be addressed.: is challenging [and cite] )

Since current information retrieval methods are (PK: heuristic: based on simple heuristics? [cite?]) and there is no guarantee for their retrieval accuracy, they hurt the final VQA results by retrieving irrelevant or noisy facts (PK: [or cite here]). (PK: Although there has been remarkable success in data representation learning, the existing unsupervised pre-training objectives are not necessarily aligned well with the task of transforming KG facts into natural language, motivating lots of ongoing research to utilize powerful models to generate natural language that describes the facts from KGs pan2023unifying.: not sure why this is relevant, remove if not required)

In this work, we propose to train a supervised KB retrieval model using the provided meta-data in the KRVQA dataset. Our retrieval method uses contrastive loss to maximize the similarity of a question embedding with the embedding of its supporting facts while minimizing its similarity to irrelevant facts. Moreover, (PK: we represent our visual data in a graph format (scene-graph): we assume the corresponding scene graph (SG) of the image is given) and utilize a similar retrieval module to extract the question-relevant visual information. In other words, we formulate both external knowledge and visual knowledge integration as retrieval tasks. Next, we integrate both external and visual retrieved knowledge to find the final answer through multiple hops of reasoning.

(PK: We provided integration models using both traditional neural networks, and Large Language Models.: The retrieved knowledge from KG and SG are integrated into both task-specific neural architectures and large language model backbones.) We (PK: provide investigations on: analyse) the strengths and weaknesses of the models regarding the question type (PK: s) and shed light on remaining challenges in retrieval and integration steps.

In a nutshell, we summarize our contributions as follows:

•

(PK: Proposing a supervised retrieval method for both external and visual knowledge.: Proposing a supervised model for retrieving relevant knowledge from the knowledge graph and scene graph conditioned on the question.)
•

Integrating external and visual knowledge in both (PK: non-LLM neural networks and LLMs: task-specific neural architectures and LLMs) for the VQA task.
•

(PK: Providing a framework to analyze information integration versus reasoning ability of the NN and LLM models.: not clear what you mean by framework here)
•

Analyzing VQA accuracy based on the number of hops and types of questions.
•

Analyzing strengths and weaknesses of the models for visual reasoning, external knowledge reasoning, and integration of them. (PK: can you summarize the last three items in one and mention the key points only? it looks redundant and vague, remove analyzing weaknesses and strengths this is too general, and point to the exact aspects.)

(PK: in the abstract you can use these contributions too and remove the numbers of accuracy improvement that are based on incomparable models) (PK: I would bring an example from KRVQR with question, image, answer and the reason, and explain the reason and everything. Either in intro or in the beginning of section 3.)

2 Related works

We roughly divide recent VQA approaches into three groups as follows. The first group, plain VQA methods, focuses on question and image integration approaches without including any extra information. The second group aims to feed extra information extracted from external knowledge bases in VQA models. In the last group of SOTA VQA models in Section 2.3, we review methods that rely on the power of the pre-trained Large Language Models (LLMs) as implicit sources of knowledge for KB-VQA tasks.

2.1 Plain VQA methods

Similar to yu2018beyond, MUTAN ben2017mutan aims to use bilinear models for merging visual and textual information in VQA tasks using a multimodal Tucker tensor decomposition tucker1966some. MUTAN parametrizes bilinear interactions between visual and textual representations to learn high-level associations between question meaning and visual concepts in the image while avoiding suffering from the curse of dimensionality.

To focus on both the visual and textual context of questions, the authors of MCAN yu2019deep aim to design a ‘co-attention’ model to associate keywords in questions with key objects in images. They propose a deep Modular Co-Attention Network (MCAN) that is made by in-depth cascading of Modular Co-Attention (MCA) layers. Each Modular Co-Attention layer (MCA layer) models the self-attention of questions and images, as well as the question-guided-attention of images jointly using a modular composition of two basic attention units.

The authors of ding2022mukea suggest extracting and using knowledge triplets from training data, instead of feeding external knowledge bases. They propose to extract and represent multimodal knowledge by explicit triplets during training of VQA task. They discuss the complementary of the extracted triplets to the existing knowledge graphs and leave combining external knowledge bases open for the future.

2.2 KB processing methods

The MuCKO model zhu2020mucko aims to answer the question by generating a semantic graph built on the dense captions of the image, a graph representing relevant knowledge triplets, and a spatial visual graph where nodes represent region features. It performs question-guided inter and intra-graph attention to answer the question iteratively.

The authors of marino2021krisp use Multimodal-BERT (MMBERT) khare2021mmbert for their multimodal embedding and use Relational Graph Convolutional Network (RGCN) schlichtkrull2018modeling to learn the representation of the concepts extracted from questions, Image, and KBs. To cover the knowledge required in OK-VQA, they rely on four different knowledge sources, DB-Pedia auer2007dbpedia, ConceptNet liu2004conceptnet, VisualGenome krishna2017visual, and hasPart KB bhakthavatsalam2020dogs, to construct a knowledge graph. Their model can improve marginally while suffering from huge memory and time demands. Moreover, since the supporting KB and the OKVQA dataset have been collected separately, this model is not analyzable. There are remaining questions on how VQA models can benefit from external KB while avoiding upcoming noise.

The authors of shevchenko2021reasoning came up with a multi-modal pre-trained model considering external KB. They use an auxiliary training objective that encourages the learned representations of their model to align with graph embeddings of matching entities in an external KB. They implement the method on top of the LXMERT tan2019lxmert model while using ConceptNet and Wikidata as their matching external knowledge bases.

MAVEx is proposed by wu2022multi aiming to leverage and integrates knowledge from visual (Google image search), textual (Wikipedia), and commonsense knowledge sources (ConceptNet). Their goal is to validate a set of candidate answers using answer-based knowledge retrieval.

In guo2022unified, the authors offer Unifer, a unified end-to-end retriever-reader framework for KB-VQA. To reduce noisy retrieval from external KBs, they aim to evaluate retrieved triplets based on their effect on final accuracy. In other words, they compare the prediction scores for each retrieved triplet and select the triplets that improve the final prediction as an external source of knowledge. They pay a huge cost for generating pseudo-labels to evaluate all knowledge triplets for each question.

DMMGR li2022dynamic represents the retrieved KB triplets in dynamic key-value memory format and converts visual content in a spatial graph representation.It performs a question-knowledge guided attention over the spatial visual graph to find the required visual information to answer the question.

2.3 Large Language Models as Implicit Knowledge Sources

Due to the revolutionary growth and power of large pre-trained models, the new trend in knowledge-based VQA aims to use large language-only pre-trained models like GPT-3 brown2020language as implicit sources of knowledge, instead of using external sources of knowledge. Most often, these methods replace the visual data (image) with its corresponding tags and/or captions and then solve an unimodal text-only problem.

The authors of gui2021kat use a contrastive-learning-based module to retrieve knowledge entries from an explicit knowledge-base (entity+description format) and use GPT-3 to retrieve implicit knowledge by feeding image tags and captions as an input prompt. They integrate implicit and explicit knowledge and generate the final answer using some transformer units.

The authors of yang2022empirical propose PICa which relies on the implicit knowledge hidden in GPT-3 and trust power of LLMs in joint retrieval of relevant knowledge and reasoning to find the final answer. They verbalize the visual signal by replacing the image with its caption and/or tags to provide more in-context information for the language model.

The authors of lin2022revive offer REVIVE targeting to find top-k relevant external entries by using the inner product of the external KB entries and image representations, and integrate visual information, implicit knowledge, and explicit knowledge to generate the final answer. Similar to KAT, they have used a subset of Wikidata vrandevcic2014wikidata as their KB, which is in (entity, description) format instead of (entity1, relation, entity2). Like PICa, they have used VinVL zhang2021vinvl as their captioner unit, while extracting regional features, regional positions, and regional tags using a pre-trained GLIP model li2022grounded as their visual representation.

The authors of hu2022promptcap claim that using a single general caption often underspecifies the visual content results in missing visual details essential for answering the visual questions correctly. Despite the prior works that use a general captioning model to textualize the image and enable LLMs to understand images, the authors propose PROMPTCAP (Prompt-guided image Captioning) takes a natural language prompt/query to control the visual entities in the generated caption. They train their captioner using a GPT-3 synthesized dataset.

The authors of gao2022transform offer TRiG to replace visual information with natural language text by image-level captioning, object-level labeling, and text OCR (Optical Character Recognition). The dense knowledge retriever retrieves top-k relevant knowledge passages from Wikipedia using inner-product similarity of Wikipedia passages and concatenation of the given question and caption.

To avoid losing visual information, the authors of salaberria2023image feed visual features alongside regional labels and captions to the transformer in their proposed Caption-Based Model(CBM).

PROPHET shao2023prompting authors suggest including some candidate answers to enrich the prompt and improve final accuracy for LLM-based VQA.

The authors of chen2023see propose an Interactive Prompting Visual Reasoner (IPVR) that aims to verbalize the image using both BLIP li2022blip generated captions and object detection.

In conclusion, the target in implicit-KB VQA models is using LLMs as the source of knowledge, which needs verbalizing the images using a multimodal/image-captioning model. Therefore, the main effort of this group is to find the best verbalization for the images to avoid loss of required visual information.Moreover, storing and retrieving knowledge implicitly in LLM parameters is prone to some errors happening during their knowledge storing and retrieving steps. These errors include but are not limited to 1. The knowledge hidden implicitly in LLM parameters is not necessarily complete, correct, or updated. 2. Correct memorization of knowledge during pre-training is not guaranteed, and 3. Retrieval of relevant knowledge is not guaranteed he2022rethinking.

3 (PK: The: remove?) Proposed Method

Our goal is to retrieve and integrate relevant knowledge triplets from the external knowledge base in graph formats to correctly answer a textual question related to a given image (PK: maybe formally define the notation give question q, image i and knowledge graph KB/KG the model will output which is a sequence of words/one word?).

There are two main challenges in this task, first, retrieving relevant knowledge from the external knowledge base, and second, integrating retrieved knowledge, the question, and the image to generate the final answer. Regarding the first challenge, we propose a supervised model based on contrastive loss to maximize the similarity of (PK: a: the) question embedding with the embedding of its supporting facts while minimizing their similarity to irrelevant facts. Next, we replace our raw image with its corresponding scene graph and propose a similar contrastive-loss optimization method to retrieve the relevant visual information. For more information on the (PK: scene graph extraction and representation, one can check Appendix: you should mention you use the human-annotated scene graphs here, you mention extraction which again misleads the reader that you have a module for extracting them, my concern is serious here and I have mentioned it several times before, please make sure you accurately explain your work. I am not sure why you insist on hiding this information.), section 6.1. Regarding integration and reasoning, we propose two frameworks based on (PK: traditional neural networks: I would call this "task-specific neural architecture vs. using LLMs as backbone"), and Large Language models to provide a comprehensive analysis of the impact of our retrieval method. (PK: We elaborate on the retrieval method in section 3.1 the neural network model in section 3.2, and LLM model in section 3.3.: Please remove this section structure, for an 8-pages paper people will not do this usually.)

3.1 External Knowledge Retrieval

Since the (PK: most famous: please avoid using the term "famous" when you want to say popular or commonly-used models) supporting knowledge-bases provided for VQA task are in graph format cao2021knowledge; marino2021krisp, we propose a knowledge retrieval method for graphical knowledge-bases (KGs) which represents each fact in a triplet form, $(e_{1},r_{1},e_{2})$ indicating first entity, relation, and second entity, respectively. As mentioned before, we propose an objective loss function to learn the representations of the triplets by contrasting positive and negative triplets. We convert each triplet $(e_{1},r_{1},e_{2})$ to a sentence $e_{1}re_{2}$ . One can use (PK: any:remove) transformation-based graph encoders such as TransE, QualE, etc rossi2021knowledge, or neural network-based models such as Relational Graph Convolutional Networks (RGCN) schlichtkrull2018modeling to initialize the triplet embeddings. However, the first group has been trained on a limited amount of (PK: KB: KG is more used in this context, KB is very general while KG means triples, since you use only KG in this work, at some point in the paper I will mention this in the footnote and use KG every where.) data, and the RGCN family suffers from memory issues for data with a huge number of relations, and it only can learn embedding for entities and not for relations. Due to the availability of strong efficient pretrained text-encoders, we used Roberta liu2019roberta for both plain text and triplet embedding initialization. We have used a feed-forward sub-layer (PK: (‘FF’): remove) on top of the Roberta representations to transform the embeddings to the same space. The architecture for our retrieval model is shown in Fig 1.

Refer to caption — Figure 1: Architecture of our supervised retrieval model.

(PK: Our contrastive-loss framework can be seen in: We use the following contrastive loss formulation, [bring the formula], where …is set of triples …) (PK: Equation 1: the first time you introduce the equation just bring it immediately after and say as follows, you do not need to bring a reference, later if you used it again you can refer. This is different from figures and tables.), where $t^{+}$ is a set of triplets appearing in the (PK: supporting reason: you never explained supporting reason before, you should give an example and explain this, maybe in the very beginning of section 3 when you explain the formal problem definition of q and a…) for the given question, and $t^{-}$ is a subset of all non-related triplets.

\linenomathAMS

\displaystyle L_{Ret}(q,KB)=\sum_{t^{+}\in pos}\sum_{t^{-}\in neg}[d(q,t^{+})-d(q,t^{-})]

(1)

(PK: q was never defined before also d is not defined) We use the same initialization method and architecture for both visual triplet retrieval from the scene-graph, and knowledge triplet retrieval from the external knowledge graph. It is worth mentioning that we have utilized the supportive reasons provided by KRVQA dataset (PK: authors:? annotations?) for each question for supervising our retrieval module. We have used the image-related supporting facts (triplets) for supervising visual-retrieval training, and KB-related triplets for supervising KB retrieval training. Since, the external KB includes $2339$ relations, $102343$ entities, and $225434$ facts, we first use keyword matching to find a smaller related subgraph before passing the graph to the CL-based retrieval step. We have extracted tags from the image, and keywords from the question, and extracted a (PK: 2-hop subgraph: subgraph by traversing two hops from each node,) after finding KB nodes with matching keywords. It is worth mentioning that, this small subgraph is still very huge and has an order of $10K$ edges. (PK: The accuracy of our retrieval module for both scene-graph and knowledge base is shown in Table 1.: here is explaining the model, please do not refer to experimental resutls.) (PK: While top-1 retrieval accuracy for both SG and KB is around $60\%$ , top-100 accuracy for both models is almost $100\%$ . Moreover, top-1 accuracy without any training is less than 30% for both models.: remove the whole results from section 3.)

	KB Retrieval	SG Retrieval
top-1 Accuracy	61.10	60.00

Table 1: Retrieval top-1 accuracy for KRVQA dataset.

3.2 Integration and Reasoning (PK: using Traditional Neural Network)

(PK: For our NN framework: our proposed task-specific architecture), we do multi-hop reasoning by integrating the retrieved (PK: external and visual: KG and SG) triplets in an (PK: iterative: not sure what you mean here? is this multiple cross-attention layers?) manner using cross-attention units and generate the final answer as (PK: in equation 2; see my comment above for reference to equations.). In other words, to handle several hops of reasoning, we iteratively update a query (initialized by question embedding) by attending to retrieved external knowledge and retrieved visual triplets (PK: what do you mean by query? are you now refer to the details of attention Q/K/V? I think if you say cross attention everybody knows what is it). In the end, we combine the output of 2 cross-attention modules to get the best out of both visual and external information and perform the final VQA classification.

\linenomathAMS

	$\displaystyle query_{0}=question,$	(2)
$\displaystyle query_{t}=$	$\displaystyle Att(query_{t-1},KB_{ret})+$
	$\displaystyle Att(query_{t-1},SG_{ret})$

The complete architecture of our NN model (R3, KB Retrieval, SG Retrival, and Reasoning) is shown in Fig 2 (PK: what is t-1? I am not sure if I understand your formula).

3.3 Integration and Reasoning using Large Language Models

Due to the remarkable results of hu2022promptcap, we followed this architecture as our main LLM block for VQA task (PK: you need to explain what is this if that is the main thing you use, the paper should be self-contained in terms of the main technical approach). For each question, we provide 32 shots using cosine similarity of (PK: image+question: similarity of what and what?) Clip embedding. We convert each triplet $(e_{1},r_{1},e_{2})$ to a sentence $e_{1}re_{2}$ , and inject it to the prompt as can be seen in the knowledge section of each box in Figure 3. For the sake of fair comparison, we do not feed extra information such as image tags or candidate answers in our prompts. We feed a caption generated by Promptcap captioner hu2022promptcap as the only context information for each shot, and the test example. Due to its significant accuracy "davinci-002" for in-context learning, like hu2022promptcap; shao2023prompting we use it as our LLM engine. (PK: this does not read easily, the explanation is hard to follow, please revise and include clear steps and details. Make sure your sentences have the right order.)

4 Experimental Results

In the following sections, we (PK: provided: provide/will provide) experimental results to show the impact of supervised knowledge retrieval on the final accuracy of VQA tasks in both (PK: traditional neural network) and LLM frameworks. We provide the final accuracy of our models using supervised retrieval in Section 4.3, analysis based on the number of hops, and various integration of visual and KB knowledge in Section 4.4, and ablation study on each part of the model in Section 4.5. (PK: Avoid referring to sections, remove all, instead list what experimental questions you have.)

(PK: Can you write the research questions explicitly? I think we wanted to see this in this version of the paper. The motivation of your above list of experiments is not clear to me.)

4.1 Datasets

cao2021knowledge (PK: you can use citeauthor when you want to refer to the actual authors) propose knowledge-routed visual question reasoning (KRVQA) dataset that aims to cut off the shortcut learning by generating non-crowdsourced question–answer pairs with their supporting reasons indicating which part of the reason is from visual, and which part is from external knowledge. It evaluates the multi-step reasoning ability of the models based on external knowledge and provides meta-data to analyze the strengths and weaknesses of the VQA models. KRVQA is to date the largest knowledge-based VQA dataset. It contains 32910 images and 157201 QA pairs. They are split into the train, validation, and test set at the ratio of 60%, 20%, and 20%, respectively. These questions in this dataset use external knowledge from FVQA dataset with 2339 relations, 102343 entities, 225434 facts wang2017fvqa. One of the current famous VQA datasets is OK-VQA marino2019ok. The authors of OK-VQA claim that most VQA benchmarks are focused on questions that do not require reasoning or knowledge beyond what is in the image. Despite its strength, it does not provide supporting facts/reasons for each question and lacks supporting metadata to analyze the strengths and weaknesses of the reasoning models. In this work, we use KRVQA dataset as its supporting meta-data helps to train our retrieval modules and analyze all modules in our model.

4.2 Implementation Details

We train our model with PyTorch paszke2019pytorch for all experiments using one GPU. For the triplet contrastive loss, we treated all the other samples in a batch as negative samples. Our model is trained by AdamW loshchilov2018fixing optimizer with 200 epochs, where the batch size is $256$ and the learning rate is set to $1*10^{-4}$ . Cascading 4 layers of query-update and reasoning provides the best accuracy for multi-hop integration and reasoning. We use top-1 accuracy as in cao2021knowledge for fair comparison.

4.3 Result and Discussion

Table 2 shows accuracy for VQA task using KRVQA dataset. Besides the results for the most famous models, we provided accuracy for NN+Retrievel (our neural network model supported by retrieved KB and SG triplets), and LLM+retrieval (our LLM-based model supported by retrieved KB and SG triplets). Our enriched LLM model improves accuracy (40.5%) in comparison with Promptcaphu2022promptcap (19.8%) which is one of the most recent VQA frameworks using LLM. Our NN model has made more than 13% improvements in comparison with the best NN model, DMMGR, which uses key-value memory networks to combine external knowledge in VQA tasks. This result will demonstrate that training a supervised knowledge retrieval module for both external and visual knowledge besides a strong reasoning module avoids noise propagation and enriches our model to perform better reasoning. The results show that enriching a base LLM model (Promptcap) by adding retrieved knowledge makes a significant improvement in accuracy, though still, a fine-tuned reasoning module leads to higher accuracy than LLM based model.

Q-type cao2021knowledge	8.12
LSTM cao2021knowledge	8.81
FiLM perez2018film	16.89
MFH yu2018beyond	19.55
UpDown anderson2018bottom	21.85
MCAN yu2019deep	22.23
Mucko zhu2020mucko	24.00
KM-net cao2019explainable	25.19
MUKEA ding2022mukea	27.38
DMMGR li2022dynamic	31.8
Promptcap hu2022promptcap	19.8
NN+Retrieval	44.36
LLM+Retrieval	40.50

Table 2: Top-1 accuracy percentage comparisons among different models on the KRVQR dataset.

4.4 Question-type Analysis

The KRVQA dataset is generated using 7 different question-answer templates as shown in Table 5. We have provided the accuracy of both our NN-based, and LLM-based models for each question type. $KB_{ret}+SG_{ret}$ indicates feeding retrieved knowledge in the model, while $KB_{GT}+SG_{GT}$ indicates injecting ground-truth knowledge triplets in the models to find the upper-bound accuracy using our reasoning module.

From results in Table 5, we can see that the proposed reasoning model is strong enough to find the final answer by combining the supporting facts, although even in the presence of ground-truth supporting facts/reasons, questions type 2, 3, and 5 are harder to answer using our Neural Network based model. Moreover, by comparing the accuracy using the retrieval model versus the upper limit of accuracy using ground truth triplets, our retrieval methods hurt more for question types 2,3, and 5. We conjecture that retrieving relevant facts is harder when the head entity is missing. On the other hand, since the answer to question types 3, and 5 can be found in the second reasoning hop, it is important to update the retrieved knowledge after updating the query at each step, instead of retrieving triplets statically and independent from the reasoning procedure.

Results show that question types 3 and 5 are harder for LLM model even in the presence of ground-truth knowledge. Since for question types 3 and 5, the answer belongs to the second hop, it might make reasoning more complicated and prone to noise.

Another interesting result is that almost for all one-hop questions LLM beats the NN model, though for 2-hop questions it performs worse than NN which confirms the weakness of LLM models in multi-hop reasoning.

In Table 5, we can see that the LLM models work better than their corresponding NN models for KB-related questions which confirms the existence of implicit knowledge in LLMs. However, comparing the performance of each model for KB-related versus non-KB-related question types demonstrates the importance of utilizing a strong multi-modal integration and reasoning module for the questions that need both external and visual knowledge. Regarding the number of hops, we can see that for all models always 2-hop reasoning is harder than 1-hop reasoning. On the other hand, though LLMs are stronger in 1-hop reasoning, they suffer in 2-hop reasoning in comparison with our fine-tuned NN reasoning model.

for our retrieval-based model, KB-related questions are harder than non-KB-related models, while they both have similar accuracy for 1-hop and 2-hop questions. We conjecture that combining information from one domain with another domain (external knowledge with visual knowledge) is still not an easy task.

Qtype	Question Semantics	Answer	Reason
0	What is the relation of $<A>$ and $<B>$ ?	$<R>$
1	What is $<A>$ $<R>$ ?	$<B>$	$(A,R,B)$
2	What $<R>$ $<B>$ ?	$<A>$
3	What is the relation of the object that $<A>$ $<R_{1}>$ and $<C>$ ?	$<R_{2}>$
4	What is the relation of <A> and the object that $<R_{2}>$ $<C>$ ?	$<R_{1}>$	$(A,R_{1},B),(B,R_{2},C)$
5	What $<A>$ $<R_{1}>$ $<R_{2}>$ ?	$<C>$
6	What $<R_{1}>$ $<R_{2}>$ $<C>$ ?	$<A>$

Table 3: KRVQA templates for the question-answer generation.

Q-type	0	1	2	3	4	5	6
NN+( $KB_{ret}$ + $SG_{ret}$ )	63.34	56.76	23.63	14.75	54.55	10.28	54.29
NN+( $KB_{GT}$ + $SG_{GT}$ )	99.74	92.39	86.82	87.44	95.96	74.24	90.86
LLM+( $KB_{ret}$ + $SG_{ret}$ )	60.71	57.31	33.95	12.58	51.93	14.49	51.78
LLM+( $KB_{GT}$ + $SG_{GT}$ )	99.01	96.00	98.52	74.47	92.56	74.54	88.24

Table 4: Accuracy for each question type.

Q-type	KB-related	No-KB-related	1-hop	2-hop
NN+( $KB_{ret}$ + $SG_{ret}$ )	34.63	43.33	37.99	38.83
NN+( $KB_{GT}$ + $SG_{GT}$ )	85.57	94.25	90.46	88.57
LLM+( $KB_{ret}$ + $SG_{ret}$ )	39.36	41.59	43.82	37.67
LLM+( $KB_{GT}$ + $SG_{GT}$ )	91.19	91.15	98.09	80.68

Table 5: Accuracy Analysis.

4.5 Ablation Study

We have investigated and reported the impact of each part of the model in Table 6. $KB_{ret}$ and $SG_{ret}$ used only supporting facts retrieved from KB and SG using our trained module, respectively. $KB_{GT}$ and $SG_{GT}$ integrated the ground-truth supporting external or visual knowledge provided in KRVQA dataset, respectively. The NN-None model only integrates question representation with image CLIP representation, while the LLM-None model does not use any knowledge in the prompts.

The results show that both proposed KB and SG retrieval models make improvements in accuracy. The notable difference in None results in comparison to the model using SG affirms the paramount impact of the high-level representation of the visual data for addressing VQA tasks. Moreover, we can see that our proposed multi-hop reasoning is strong enough to make reasoning and find the final answer from the ground-truth supporting facts (accuracy equal to 89.47%). The huge gap between the None and $SG_{ret}$ model shows the importance of providing a more high-level (object-level) representation of the visual data, despite a low-level abstract representation of an image or a general caption. The results for LLM-based model support the previous findings, though similar to the results in the previous section (Table 2), we can see that LLM models are moderately less accurate in comparison with fine-tuned model, which highlights the impact of a strong reasoning module providing better integration of the given information, and perform multiple hops of reasoning. supporting the findings for NN model, the huge improvement resulting from adding visual knowledge retrieval demonstrates that a caption is not an adequate representation of the visual data, and we need an alternative approach to feed the visual data.

additional information	None	$KB_{ret}$	$SG_{ret}$	$KB_{ret}$ + $SG_{ret}$	$KB_{GT}$	$SG_{GT}$	$KB_{GT}$ + $SG_{GT}$
NN	15.36	22.28	38.31	44.36	38.98	67.06	89.48
LLM	19.8	21.6	37.4	40.5	36.60	58.60	87.90

Table 6: Ablation Study

5 Conclusion and Future Work

We presented a retrieval and reasoning framework for external knowledge-based visual question answering. We trained a contrastive-loss model to retrieve supporting facts from external knowledge as well as the scene graph. We trained our retriever and reasoning model using the KRVQA dataset, the largest existing VQA dataset. We show that formulating the VQA task in the form of a joint KB and scene graph retrieval model makes a remarkable improvement in accuracy using both traditional neural network and LLM reasoning models. We have provided some analysis of the difficulty of various questions and the impact of each part of our model. Our results show that representing the visual data in a high-level object-oriented format such as scene graphs rather than using abstract features improves accuracy. We find that supervising KB retrieval models to tune and align their initial representations with natural language embedding of questions makes a notable improvement in retrieval and final VQA accuracy. Moreover, we provided extensive results on the comparison of NN vs LLM integration and reasoning modules, which shows the effectiveness of LLM models for KB-related questions, though they faint in the integration of various modalities of information, and multi-hop reasoning.

6 Limitations

•

We trained and executed retrieval and reasoning parts separately. We will investigate the impact of dynamic fact retrieval interchained with reasoning in the future.
•

Due to the huge size of our external knowledge base, we retrieve the relevant knowledge for each question only one time before starting the reasoning step. we will replace this static retrieval module with a dynamic retrieval module which refines the retrieved knowledge after each step of reasoning.
•

we used human-annotated scene graphs provided in the Visual Genome dataset, as approximated scene graphs do not generate acceptable accuracy. In the future, we have to replace VG scene graphs with another automatically generated high-level representation of the image.
•

Some famous VQA datasets like OK-VQA do not provide supporting facts/reasons for each question and lack supporting metadata to analyze the strengths and weaknesses of the reasoning models. Therefore, we did not provide results on this dataset. In the future, we try to modify our models and perform experimental results on more datasets.

\Section

Appendix

6.1 Scene Graph Computation

We proposed to treat both the knowledge base and visual signal in graph format and proposed a supervised retrieval method to retrieve supporting triplets from the external knowledge graph, and scene graph. Since the images in KRVQA dataset are taken from the Visual Genome dataset krishna2017visual, we have used the scene graphs provided by the VG dataset in our method. These scene graphs on average include 36 objects and 22 relationships per image.

To have a more realistic setting, we made some efforts to provide results with approximated scene graphs as well. Given an image I, we used Faster-RCNN ren2015faster to identify a set of objects $O=\{o_{i}\}^{K}_{i=1},(K=36)$ , where each object $o_{i}$ is associated with a visual feature vector $v_{i}\in R^{d_{v}},(d_{v}=2048)$ , and a corresponding label. The final scene graph is shown as $G_{V}=(V^{V},E^{V})$ over objects $O$ , where $V^{V}={v^{V}_{i}}^{K}_{i=1}$ is the node set and each node $v^{V}_{i}$ corresponds to a detected object $o_{i}$ . We have trained our scene graph generation method using the training data introduced in xu2017scene (VG150). VG150 includes the most frequent 150 object categories and 50 predicates of the VG dataset. We use 70% of the images for training and the remaining 30% for testing. We generated scene graphs using both VS3 zhang2023learning which used GLIP li2022grounded as a building block for object and feature recognition, and the scene graph benchmark proposed by tang2020unbiased. Unfortunately, scene graph detection quality is low, and error propagation from scene graph generation hampers the retrieval accuracy from 60% to around 19% (as shown in table 1). Therefore, we use Visual Genome provided scene graphs in the next steps of our work to avoid error propagation from scene graph weakness in the next steps of the model and have a better analysis of the strengths and weaknesses of our retrieval and reasoning model.