Multi-Clue Reasoning with Memory Augmentation for
Knowledge-based Visual Question Answering

Chengxiang Yin¹, Zhengping Che², Kun Wu¹, Zhiyuan Xu², Jian Tang²
¹Syracuse University, ²Midea Group

Abstract

Visual Question Answering (VQA) has emerged as one of the most challenging tasks in artificial intelligence due to its multi-modal nature. However, most existing VQA methods are incapable of handling Knowledge-based Visual Question Answering (KB-VQA), which requires external knowledge beyond visible contents to answer questions about a given image. To address this issue, we propose a novel framework that endows the model with capabilities of answering more general questions, and achieves a better exploitation of external knowledge through generating Multiple Clues for Reasoning with Memory Neural Networks (MCR-MemNN). Specifically, a well-defined detector is adopted to predict image-question related relation phrases, each of which delivers two complementary clues to retrieve the supporting facts from an external knowledge base (KB), and these facts are further encoded into a continuous embedding space using a content-addressable memory. Afterwards, mutual interactions between visual-semantic representation and the supporting facts stored in memory are captured to distill the most relevant information in three modalities (i.e., image, question, and KB). Finally, the optimal answer is predicted by choosing the supporting fact with the highest score. We conduct extensive experiments on two widely-used benchmarks. The experimental results well justify the effectiveness of MCR-MemNN as well as its superiority over other KB-VQA methods.

1 Introduction

Recently, significant progress has been made for Visual Question Answering (VQA) [3, 5, 8, 12, 36], which requires a joint comprehension of multi-modal content from images and natural language. A VQA agent is expected to deliver the correct answer to a text-based question according to a given image. Most of the existing VQA models [4, 19, 22, 24, 37] and datasets [15, 23, 30, 41] have focused on simple questions, which are answerable by solely analyzing the question and image, i.e., no external knowledge is required. However, a truly ‘AI-complete’ VQA agent is required to combine both visual and semantic observations with external knowledge for reasoning, which is effortless for human but challenging for machine. Therefore, to bridge the gap between human-level intelligence and algorithm design, Knowledge-based VQA (KB-VQA) [34] is introduced to automatically find answers from the knowledge base (KB), given image-question pairs.

Some efforts have been made in this direction. To begin with, Wang et al. [34] presented a Fact-based VQA (FVQA) dataset to support a much deeper reasoning. FVQA consists of questions that require external knowledge to answer. Several classical solutions [34, 35] have been proposed to solve FVQA by mapping each question to a query and retrieving supporting facts in KB through a keyword-matching manner. These supporting facts are processed to form the final answer. However, these query-mapping based approaches with solely question parsing suffer from serious performance degradation when the information hint is not captured in the external KB or the visual concept is not exactly mentioned in the question. Moreover, special information (i.e., visual concept type and answer source) should be determined in advance during querying and answering phases, which makes it hard to generalize to other datasets. To address these issues, we introduce Multiple Clues for Reasoning, a new KB retrieval method, where a relation phrase detector is proposed to predict multiple complementary clues for supporting facts retrieval.

More recently, Out of the Box (OB) [26] and Straight to Facts (STTF) [27] adopt cosine similarity technique to get the highest scoring fact for answer prediction, where the whole visual information (i.e., object, scene, and action) and the whole semantic information (i.e., all question words) are indiscriminately applied to infer the final answer by implicit reasoning. However, for an image-question pair, only part of the visual content in the image and several specific words in the question are relevant to a given supporting fact. Additionally, the direct concatenation of image-question-entity embedding makes it hard to adaptively capture information among different modalities. To exploit the inter-relationships among three modalities (i.e., image, question and KB), we propose a two-way attention mechanism with memory augmentation to model the interactions between visual-semantic representation and the supporting facts. Then the most relevant information in the three modalities can be distilled.

Refer to caption — Figure 1: An illustration of Multi-Clue Reasoning.

In this paper, to deal with the KB-VQA task and address the aforementioned issues, we present a novel framework that focuses on achieving a better exploitation of external knowledge through generating Multiple Clues for Reasoning with Memory Neural Networks. We name it as MCR-MemNN for short. Specifically, as illustrated by Figure 1, the image-question pair is encoded into a visual-semantic representation, which serves as the input to a well-defined detector to obtain a three-term relation phrase (i.e., $\langle$ Subject, Relation, Object $\rangle$ ). In this case, either the subject (i.e., cat) or the object (i.e., tiger) acts a clue to retrieve supporting facts in the external KB, and both deliver a fact set including the ground-truth fact (i.e., cat RelatedTo tiger). In this manner, the ground-truth fact can be successfully fetched as long as one of the two complementary clues is predicted. Note that the predicted relation (i.e., RelatedTo) is adopted to filter out some facts with different relations. The retained supporting facts, including the ground-truth, are further encoded into a continuous embedding space and stored in a content-addressable memory, where each memory slot corresponds to a supporting fact. Afterwards, for reasoning procedure, we assume that the visual and semantic content in the image-question pair can contribute to a better exploitation of the external knowledge (i.e., supporting facts). Analogously, the external knowledge is helpful for a better understanding of the image-question pair. Therefore, we employ a two-way attention mechanism, which is not only intended to focus on the important aspects of the memory in light of the image-question pair, but also the important parts of image and question in light of the memory.

The main contributions of this paper are summarized as follows: (1) We demonstrate a new KB retrieval method with two complementary clues (i.e., subject and object), which makes it a lot easier to fetch the ground-truth supporting fact. (2) A two-way attention mechanism with memory augmentation is employed to model the inter-relationships among image, question and KB to distill the most relevant information in the three modalities. (3) We perform extensive experiments on two widely-used benchmarks, which shows that the proposed MCR-MemNN is an effective framework customized for KB-VQA.

2 Related Work

2.1 Visual Question Answering

Visual Question Answering (VQA) has emerged as a typical and popular multimedia application, which has been studied by quite a few recent works [1, 21, 33, 38, 40, 39]. A typical CNN-RNN based approach, Neural-Image-QA [24], learns an end-to-end trainable model by feeding both the image features and questions into LSTM. Recently, Huang et al. [11] proposes a multi-grained attention mechanism, that addresses the failed cases on small objects or uncommon concepts through learning word-object correspondence. To go a step further, graph neural network is leveraged in [10] to support complex relational reasoning among objects conditioned on the textual input. Even though lots of methods and datasets have been proposed for VQA task, all of them are targeting at the simple questions, and none of them can deal with the general cases that external knowledge is required for a deep reasoning.

2.2 Knowledge-based Visual Question Answering

To deal with the knowledge-based visual question answering (KB-VQA) tasks, combination of visual observations with the external knowledge is required, which, however, is undoubtedly challenging. In [34], Wang et al.introduces a Fact-based VQA (FVQA) dataset, which requires and supports a much deeper reasoning with the help of the external knowledge bases. It also provides a query-mapping based approach, which maps each question to a query to retrieve the supporting facts. Similarly, Wang et al. [35] transforms the question to an available query template and limits the question types. However, those query-mapping based approaches suffers when the question does not focus on the most obvious visual concept, since the visual representation is not adopted for answer reasoning. In some existing learning based approaches [27, 26], both visual and semantic information are wholly provided and may introduce redundant information, which is harmful to the reasoning process. As a solution, a heterogeneous graph neural network with multiple layers is employed in [42] to collect the complementary evidence from three modalities under the guidance of question. It achieves the state-of-the-art by adopting dense captioning for semantic parsing. Analogously, this paper presents another solution by leveraging multiple clues for KB retrieval and leveraging a two-way attention mechanism with memory augmentation to model the inter-relationships among the three modalities including image, question and KB, which delivers a performance comparable with the state-of-the-art.

3 Methodology

We first describe the general definition and notations of Knowledge-based Visual Question Answering (KB-VQA), followed by the modeling of relation phrase detector. Finally, we present the details about the proposed framework.

3.1 Problem Statement

Given an image I and a related question Q, the KB-VQA task aims to predict an answer A from the external knowledge base (KB), which consists of facts in the form of triplet (i.e., $\langle$ Subject, Relation, Object $\rangle$ ), where subject represents a visual concept, object represents an attribute or a visual concept and relation denotes the relationship between the subject and the object. Note that the answer A can be either the subject or object in the triplet. The key of KB-VQA is to select the correct supporting fact and then determine the answer. For convenience, in our notations, the fact $\langle$ Subject, Relation, Object $\rangle$ corresponds to the answer subject; while its reversed form $\langle$ Object, Relation, Subject $\rangle$ corresponds to the answer object. For instance, as for the ground-truth supporting fact (i.e., cat RelatedTo tiger) in Figure 3, the corresponding answer is cat, and the answer of its reversed form (i.e., tiger RelatedTo cat) is tiger. In this manner, during inference stage, the optimal answer is the first term of the predicted fact.

3.2 Relation Phrase Detector

Same as the fact in KB, the relation phrase is depicted in a form of triplet (i.e., $\langle$ Subject, Relation, Object $\rangle$ ), which contains two complementary clues (i.e., subject and object) for KB retrieval. As mentioned, the two complementary clues make it a lot easier to fetch the ground-truth supporting fact, and the relation predicted can be leveraged to filter out the facts with wrong relations. Therefore, we develop a detector to obtain a relation phrase based on the visual and semantic content of the image-question pair.

As shown in Figure 2, given an image-question pair $\{\textbf{I},\textbf{Q}\}$ , the relation phrase prediction is formulated as a multi-task classification problem following [20]. For image embedding, we use a Faster R-CNN [31] in conjunction with the ResNet-101 [9] pretrained on Visual Genome dataset [15] to generate an output set of image features v as the visual representation of the input image.

\textbf{v}=\mbox{Faster R-CNN}(\textbf{I})

(1)

where $\textbf{v}\in{\mathcal{R}}^{2048\times K}$ is based on the bottom-up attention [1], which represents the ResNet features centered on the top- $K$ objects in the image. $K$ is set to 36 in our experiments.

For question embedding, we first transfer each word in Q into a feature vector using the pre-trained 300D GloVe [29] vectors, and use randomly initialized vectors for the words which are out of GloVe’s vocabulary. Here we denote the resulting sequence of word embeddings as ${\textbf{H}}^{q}$ . Then the Gated Recurrent Unit (GRU) [7] with hidden state dimension 512 is adopted to encode the ${\textbf{H}}^{q}$ as semantic representation $\textbf{q}\in{\mathcal{R}}^{512}$ .

\textbf{q}=\mbox{GRU}({\textbf{H}}^{q})

(2)

To encode the image and question in a shared embedding space, top-down attention [1] is employed to fuse the visual representation v and semantic representation q. Specifically, for each object from $i=1...K$ , its feature ${\textbf{v}}_{i}$ is concatenated with the semantic representation q, and then passed through a non-linear layer $f_{a}$ and a linear layer to obtain its corresponding attention weight $a_{i}$ .

$\displaystyle a_{i}$	$\displaystyle={\textbf{W}}_{a}{f_{a}}([{\textbf{v}}_{i},\textbf{q}])$	(3)
$\displaystyle\hat{\textbf{a}}$	$\displaystyle=\mbox{softmax}(\textbf{a})$	(4)
$\displaystyle\hat{\textbf{v}}$	$\displaystyle={\sum}_{i=1}^{K}\hat{a}_{i}{\textbf{v}}_{i}$	(5)

where ${\textbf{W}}_{a}$ is a learnable weight vector and the attention weights a are normalized with a softmax function to $\hat{\textbf{a}}$ . The image features are weighted by the normalized attention values to get the weighed visual representation $\hat{\textbf{v}}\in{\mathcal{R}}^{2048}$ . Following [33], the non-linear layer $f_{a}:\textbf{x}\in{\mathcal{R}}^{m}\to\textbf{y}\in{\mathcal{R}}^{n}$ with parameters $a$ is defined as follows:

$\displaystyle\widetilde{\textbf{y}}$	$\displaystyle=\mbox{tanh}(\textbf{W}\textbf{x}+\textbf{b})$	(6)
g	$\displaystyle=\sigma({\textbf{W}}^{{}^{\prime}}\textbf{x}+{\textbf{b}}^{{}^{\prime}})$	(7)
y	$\displaystyle=\widetilde{\textbf{y}}\circ\textbf{g}$	(8)

where $\sigma$ is the sigmoid activation function, $\textbf{W},{\textbf{W}}^{{}^{\prime}}\in{\mathcal{R}}^{n\times m}$ and $\textbf{b},{\textbf{b}}^{{}^{\prime}}\in{\mathcal{R}}^{n}$ are learnable parameters, and $\circ$ denotes the element-wise multiplication.

A joint embedding h of the question and the image is obtained by the fusion of the weighed visual representation $\hat{\textbf{v}}$ and the semantic representation q.

\textbf{h}=f_{v}(\hat{\textbf{v}})\circ f_{q}(\textbf{q})

(9)

where $\textbf{h}\in{\mathcal{R}}^{512}$ . Both $f_{v}$ and $f_{q}$ are the non-linear layers with the same form as $f_{a}$ . The joint embedding h is then fed into a group of linear classifiers for the prediction of subject, relation and object in a relation phrase.

$\displaystyle\hat{\textbf{s}}$	$\displaystyle=\mbox{softmax}({\textbf{W}}_{s}\textbf{h}+{\textbf{b}}_{s})$	(10)
$\displaystyle\hat{\textbf{r}}$	$\displaystyle=\mbox{softmax}({\textbf{W}}_{r}\textbf{h}+{\textbf{b}}_{r})$	(11)
$\displaystyle\hat{\textbf{o}}$	$\displaystyle=\mbox{softmax}({\textbf{W}}_{o}\textbf{h}+{\textbf{b}}_{o})$	(12)

where ${\textbf{W}}_{s},{\textbf{W}}_{r},{\textbf{W}}_{o}$ and ${\textbf{b}}_{s},{\textbf{b}}_{r},{\textbf{b}}_{o}$ are learnable parameters, and $\hat{\textbf{s}},\hat{\textbf{r}},\hat{\textbf{o}}$ denote the predicted probability for subject, relation and object over each candidate, respectively. The loss function for the relation phrase detector is defined as

{\mathcal{L}}_{d}={\lambda}_{s}{\mathcal{L}}_{c}(\textbf{s},\hat{\textbf{s}})+{\lambda}_{r}{\mathcal{L}}_{c}(\textbf{r},\hat{\textbf{r}})+{\lambda}_{o}{\mathcal{L}}_{c}(\textbf{o},\hat{\textbf{o}})

(13)

where $\textbf{s},\textbf{r},\textbf{o}$ are the ground-truth labels for subject, object and relation, respectively. ${\mathcal{L}}_{c}$ represents the cross-entropy loss, and the weights of the loss terms (i.e., ${\lambda}_{s},{\lambda}_{r},{\lambda}_{o}$ ) are set to 1.0 in our experiments.

3.3 MCR-MemNN

As shown in Figure 3, besides the Relation Phrase Detector, our proposed MCR-MemNN framework consists of six components, which are Memory Module, Visual-Semantic Embedding Module, Visual-Semantic Aware Attention Module, Memory Aware Attention Module, Generalization Module and Answer Module.

3.3.1 Memory Module

For an image-question pair $\{\textbf{I},\textbf{Q}\}$ , after the three-term phrase (i.e., Subject, Relation, Object) is predicted by the relation phrase detector, all the facts in the external KB with entities pointed by the subject or pointed to the object within $h$ hops are collected as candidate supporting facts. $h$ is set to 1 in our experiments. The relation predicted is leveraged to further filter out the facts with wrong relations. Then a set of candidate supporting facts $\{{\textbf{F}}_{i}\}_{i=1}^{N}$ is collected, where $N$ denotes the number of facts in the fact set, and each fact ${\textbf{F}}_{i}$ consists of a sequence of words.

Afterwards, similar to question embedding, each fact ${\textbf{F}}_{i}$ is transformed to a sequence of word embeddings ${\textbf{H}}^{f}_{i}$ based on GloVe’s vocabulary, and encoded using a BiLSTM to get its representation ${\textbf{f}}_{i}\in{\mathcal{R}}^{128}$ .

{\textbf{f}}_{i}=\mbox{BiLSTM}({\textbf{H}}^{f}_{i})

(14)

To store the candidate supporting facts, a key-value structured memory network [25] is leveraged. The key memory is used in the addressing stage, while the value memory is used in the reading stage. The representation ${\textbf{f}}_{i}$ of each fact is passed through two separate linear layers to obtain its key embedding ${\textbf{M}}^{k}_{i}\in{\mathcal{R}}^{128}$ and value embedding ${\textbf{M}}^{v}_{i}\in{\mathcal{R}}^{128}$ respectively.

	$\displaystyle{\textbf{M}}^{k}_{i}$	$\displaystyle={\textbf{W}}_{k}{\textbf{f}}_{i}$		(15)
	$\displaystyle{\textbf{M}}^{v}_{i}$	$\displaystyle={\textbf{W}}_{v}{\textbf{f}}_{i}$		(16)

where ${\textbf{W}}_{k}\in{\mathcal{R}}^{128\times 128}$ and ${\textbf{W}}_{v}\in{\mathcal{R}}^{128\times 128}$ are learnable parameters. For the set of candidate supporting facts $\{{\textbf{F}}_{i}\}_{i=1}^{N}$ , we have a set of key embeddings ${\textbf{M}}^{k}=\{{\textbf{M}}^{k}_{i}\}_{i=1}^{N}$ and a set of value embeddings ${\textbf{M}}^{v}=\{{\textbf{M}}^{v}_{i}\}_{i=1}^{N}$ . Note that one memory slot is defined as a pair of key embedding and value embedding (i.e., $\{{\textbf{M}}^{k}_{i},{\textbf{M}}^{v}_{i}\}$ ) of one candidate supporting fact.

3.3.2 Visual-Semantic Embedding Module

The visual-semantic embedding module has basically the same architecture as the relation phrase detector and the only difference is about the question embedding before top-down attention, where self-attention are applied over the sequence of word embeddings ${\textbf{H}}^{q}$ to get the semantic representation $\textbf{q}\in{\mathcal{R}}^{128}$ .

	$\displaystyle{\hat{\textbf{a}}^{qq}}$	$\displaystyle=\mbox{softmax}({({\textbf{H}}^{q})}^{\intercal}{\textbf{H}}^{q})$		(17)
	q	$\displaystyle=\mbox{BiLSTM}([{\textbf{H}}^{q}{({\hat{\textbf{a}}}^{qq})}^{\intercal},{\textbf{H}}^{q}])$		(18)

where the ${\textbf{H}}^{q}$ are weighted by the normalized values, concatenated with itself and fed into a BiLSTM. Afterwards, same procedures are conducted to obtain the visual-semantic representation $\textbf{h}\in{\mathcal{R}}^{128}$ .

3.3.3 Visual-Semantic Aware Attention Module

Given visual-semantic representation h, we apply an attention over all the memory slots $\{{\textbf{M}}^{k}_{i},{\textbf{M}}^{v}_{i}\}_{i=1}^{N}$ to obtain the memory summary $\textbf{m}\in{\mathcal{R}}^{128}$ in light of the visual and semantic content of the image-question pair.

$\displaystyle a_{i}^{m}$	$\displaystyle={\textbf{W}}_{3}\mbox{tanh}({\textbf{W}}_{1}\textbf{h}+{\textbf{W}}_{2}{\textbf{M}}_{i}^{k})$	(19)
$\displaystyle{\hat{\textbf{a}}}^{m}$	$\displaystyle=\mbox{softmax}({\textbf{a}}^{m})$	(20)
m	$\displaystyle={\sum}_{i=1}^{N}\hat{a}_{i}^{m}{\textbf{M}}_{i}^{v}$	(21)

where ${\textbf{W}}_{1}\in{\mathcal{R}}^{128\times 128},{\textbf{W}}_{2}\in{\mathcal{R}}^{128\times 128},{\textbf{W}}_{3}\in{\mathcal{R}}^{1\times 128}$ are learnable parameters. The attention for each memory slot is calculated and normalized based on h and the corresponding key embedding ${\textbf{M}}^{k}_{i}$ . Then the set of value embeddings ${\textbf{M}}^{v}=\{{\textbf{M}}^{v}_{i}\}_{i=1}^{N}$ are weighted to get the memory summary m.

3.3.4 Memory Aware Attention Module

As we have obtained the memory summary m, we proceed to compute the attentions over all the question words and all the image features in light of the memory.

Given memory summary m, the sequence of word embeddings ${\textbf{H}}^{q}$ and the set of image features v, the memory-aware question embedding ${\textbf{q}}^{m}\in{\mathcal{R}}^{128}$ and the memory-aware image embedding ${\textbf{v}}^{m}\in{\mathcal{R}}^{2048}$ are derived as follows:

$\displaystyle{\hat{\textbf{a}}}^{q}$	$\displaystyle=\mbox{softmax}({({\textbf{H}}^{q})}^{\intercal}\textbf{m})$	(22)
$\displaystyle{\textbf{q}}^{m}$	$\displaystyle={\textbf{H}}^{q}{\hat{\textbf{a}}}^{q}$	(23)
$\displaystyle{\hat{\textbf{a}}}^{v}$	$\displaystyle=\mbox{softmax}({({\textbf{W}}_{v}\textbf{v})}^{\intercal}\textbf{m})$	(24)
$\displaystyle{\textbf{v}}^{m}$	$\displaystyle=\textbf{v}{\hat{\textbf{a}}}^{v}$	(25)

where ${\hat{\textbf{a}}}^{q}$ represents the normalized memory aware attention over all the question words, ${\hat{\textbf{a}}}^{v}$ represents the normalized memory aware attention over all the image features and ${\textbf{W}}_{v}\in{\mathcal{R}}^{128\times 2048}$ are learnable parameters.

The visual-semantic representation ${\textbf{h}}^{m}$ in light of the memory is obtained by the fusion of the memory-aware question embedding ${\textbf{q}}^{m}$ and the memory-aware image embedding ${\textbf{v}}^{m}$ .

{\textbf{h}}^{m}=f_{v}^{m}({\textbf{v}}^{m})\circ f_{q}^{m}({\textbf{q}}^{m})

(26)

where ${\textbf{h}}^{m}\in{\mathcal{R}}^{128}$ . Both $f_{v}^{m}$ and $f_{q}^{m}$ are the non-linear layers with the same form as $f_{a}$ . Note that the aforementioned two-way mechanism corresponds to Sections 3.3.3 and 3.3.4.

3.3.5 Generalization Module

Inspired by [6], another hop of the attention process is conducted over the memory before answer prediction. Attention mechanism of Section 3.3.3 is leveraged here, given memory aware visual-semantic representation ${\textbf{h}}^{m}$ , to fetch the most relevant information ${\textbf{m}}^{h}\in{\mathcal{R}}^{128}$ from the memory to obtain the final visual-semantic representation $\hat{\textbf{h}}\in{\mathcal{R}}^{128}$ . To be more specific, the fetched information ${\textbf{m}}^{h}$ is concatenated with ${\textbf{h}}^{m}$ and fed into a GRU to update the visual-semantic representation. To the end, a batch normalization (BN) layer is used.

$\displaystyle{a}_{i}^{h}$	$\displaystyle={\textbf{W}}_{6}\mbox{tanh}({\textbf{W}}_{4}{\textbf{h}}^{m}+{\textbf{W}}_{5}{\textbf{M}}_{i}^{k})$	(27)
$\displaystyle\hat{\textbf{a}}^{h}$	$\displaystyle=\mbox{softmax}({\textbf{a}}^{h})$	(28)
$\displaystyle{\textbf{m}}^{h}$	$\displaystyle={\sum}_{i=1}^{N}\hat{a}_{i}^{h}{\textbf{M}}_{i}^{v}$	(29)
$\displaystyle\hat{\textbf{h}}$	$\displaystyle=\mbox{BN}({\textbf{h}}^{m}+\mbox{GRU}([{\textbf{h}}^{m},{\textbf{m}}^{h}]))$	(30)

where ${\textbf{W}}_{4}\in{\mathcal{R}}^{128\times 128},{\textbf{W}}_{5}\in{\mathcal{R}}^{128\times 128},{\textbf{W}}_{6}\in{\mathcal{R}}^{1\times 128}$ are learnable parameters.

3.3.6 Answer Module

Given the final visual-semantic representation $\hat{\textbf{h}}$ , and the set of key embeddings ${\textbf{M}}^{k}\in{\mathcal{R}}^{128\times N}$ , the key embedding ${\textbf{M}}_{i}^{k}$ of each candidate supporting fact is concatenated with the $\hat{\textbf{h}}$ to compute the probability of whether the fact is correct. The predicted supporting fact is

\mbox{argmax}_{i}\mbox{softmax}({\textbf{W}}_{7}[\hat{\textbf{h}},{\textbf{M}}^{k}]+{\textbf{b}}^{7})

(31)

where $i=1,...,N$ , ${\textbf{W}}_{7}\in{\mathcal{R}}^{1\times 256}$ and ${\textbf{b}}^{7}\in{\mathcal{R}}^{N}$ are learnable parameters. As we have stated in Section 3.1, the optimal answer is the first term of the predicted fact.

3.3.7 Loss Function

Once we have the candidate supporting facts retrieved from the external KB, all the learnable parameters of the proposed MCR-MemNN (besides the Relation Phrase Detector) are trained in an end-to-end manner by minimizing the following loss function over the training set.

\mathcal{L}=\frac{1}{D}\sum_{k=1}^{D}{\mathcal{L}}_{c}({\textbf{Y}}_{k},\hat{\textbf{Y}}_{k})

(32)

where ${\mathcal{L}}_{c}$ is defined as the cross-entopy loss, $D$ is the number of training samples, ${\textbf{Y}}_{k}$ and $\hat{\textbf{Y}}_{k}$ represent the ground-truth label and the predicted probability over each candidate supporting fact.

4 Performance Evaluation

We employed two compelling benchmarks, FVQA [34] and Visual7W+ConceptNet [16] to evaluate the proposed MCR-MemNN on Knowledge-based Visual Question Answering (KB-VQA) task.

4.1 Benchmark Datasets

FVQA. The Factual Visual Question Answering (FVQA) dataset consists of $2,190$ images, $5,286$ questions and $4,126$ unique facts corresponding to the questions. The external KB of FVQA, consisting of $193,449$ facts, are constructed based on three structured KBs, including DBpedia [2], ConceptNet [18] and WebChild [32]. Following [34], the top-1 and top-3 accuracies are averaged over five test splits.

Visual7W+ConceptNet. The Visual7W+ConceptNet dataset, built by [16], is a collection of knowledge-based questions with images sampled from the test split of Visual7W [41] dataset. It consists of $16,850$ open domain question-answer pairs based on $8,425$ images. Note that the supporting facts of each question-answer pair are retrieved directly from the ConceptNet, which serves as the external KB. Following [34], the top-1 and top-3 accuracies are calculated over the test set.

4.2 Experimental Setup

4.2.1 Implementation Details

For the training of the relation phrase detector, the model was trained with Adam optimizer [14] with an initial learning rate of $1\times 10^{-4}$ and weight decay of $1\times 10^{-6}$ , and the batch size is set to 32.

For the training of the proposed MCR-MemNN (besides the relation phrase detector), the model was trained with Adam optimizer with an initial learning rate of $1\times 10^{-3}$ and weight decay of $1\times 10^{-6}$ , and the batch size is set to 64. Top 40 predicted subjects and objects were adopted as clues to retrieve candidate supporting facts from the external KB. The memory size $N_{mem}$ was set to 96. Note that no matter the number of candidate supporting facts $N$ is larger than, equal to or smaller than $N_{mem}$ , the ground-truth fact is preserved and the negative facts are randomly selected to fill up the remaining memory. Our code was implemented in PyTorch [28] and run with NVIDIA Tesla P100 GPUs.

4.2.2 Baselines

For the FVQA dataset, CNN-RNN based approaches including LSTM-Question+Image+Pre-VQA [34] and Hie-Question+Image+Pre-VQA [34], semantic parsing based approaches including FVQA (top-3-QQmaping) [34] and FVQA (Ensemble) [34], learning-based approaches including Straight to the Facts (STTF) [27], Out of the Box (OB) [26], Reading Comprehension [17], and Multi-Layer Cross-Modal Knowledge Reasoning (Mucko) [42] are compared with the proposed MCR-MemNN.

For the Visual7W+ConceptNet dataset, learning based approaches including KDMN [16], Out of the Box (OB) are compared with the proposed MCR-MemNN. Note that both KDMN and MCR-MemNN adopted memory augmentation technique for storing the retrieved external knowledge.

Table 1: Results for the relation phrase detector on FVQA.

Case	Classification Acc.			Recall		QA Acc.
Case	Sub	Obj	Union	w/ Rel	w/o Rel	w/ Rel	w/o Rel
Top-20	64.31	39.43	69.80	78.56	82.08	60.36	62.76
Top-30	69.46	43.30	74.24	83.61	86.90	66.34	65.28
Top-40	72.65	45.71	76.22	85.76	88.85	70.92	68.85
Top-60	73.59	46.23	77.10	86.58	89.81	68.52	67.30
Top-100	75.60	47.61	79.37	88.21	90.33	68.02	65.57

4.3 Experimental Results

4.3.1 Relation Phrase Detector

As shown in Table 1, to evaluate the performance of the relation phrase detector, five different cases are considered, including Top-20, Top-30, Top-40, Top-60 and Top-100. For each case, the classification accuracies of subject (i.e., ‘Sub’), object (i.e, ‘Obj’) and both union (i.e., ‘Union’) are calculated. In addition, the answer recall is reported with the top-3 relation limitation (i.e., ‘w/ Relation’) or not (i.e., ‘w/o Relation’). Finally, the downstream question answering accuracy (‘QA Acc.’) is also calculated.

For more clarity, the ‘Sub’ in the case of Top-40 represents the fraction of test samples that the ground-truth subject is included in the top 40 predicted subjects, and these subjects are further adopted as the clues for KB retrieval. The Recall w/ Relation in the case of Top-40 represents the fraction of test samples that the correct answer is included in the candidate answer set corresponding to the supporting facts retrieved by the top-40 subject or object clues and filtered by the top 3 predicted relations. Note that the top-3 classification accuracy for relation prediction using the relation phrase detector is $93.20\%$ .

Results in Table 1 show that both union accuracy and answer recall of the Top-40 case are higher than those of the Top-20 case, and improvements of $10.56\%$ and $6.09\%$ on downstream QA accuracies (with and without relation filtering) are caused. Even though much more relevant supporting facts are retrieved from the external KB in the Top-100 case, which delivers higher both union accuracy and answer recall, the downstream QA accuracies are lower than those of the Top-40 case. This observation clearly shows that excessive retrieved facts can lead to more redundant information, which is harmful to the reasoning process. We choose the top 40 subjects and objects as clues for KB retrieval as this gives the best downstream QA accuracies.

Table 2: Accuracy comparison on FVQA.

Method	Overall Accuracy
Method	top-1	top-3
LSTM-Question+Image+Pre-VQA [34]	24.98	40.40
Hie-Question+Image+Pre-VQA [34]	43.14	59.44
FVQA (top-3-QQmaping) [34]	56.91	64.65
FVQA (Ensemble) [34]	58.76	-
Straight to the Facts (STTF) [27]	62.20	75.60
Reading Comprehension [17]	62.94	70.08
Out of the Box (OB) [26]	69.35	80.25
Mucko (w/o Semantic Graph) [42]	71.28	82.76
MCR-MemNN (Ours)	70.92	81.83

Table 3: Accuracy comparison on Visual7W+ConceptNet.

Method	Overall Accuracy
Method	top-1	top-3
KDMN-NoKnowledge [16]	45.1	-
KDMN-NoMemory [16]	51.9	-
KDMN [16]	57.9	-
KDMN-Ensemble [16]	60.9	-
Out of the Box (OB) [26]	57.32	71.61
MCR-MemNN (Ours)	64.23	79.18

4.3.2 MCR-MemNN

Tables 2 and 3 show the comparison of the proposed MCR-MemNN with the above-mentioned baselines on FVQA and Visual7W+ConceptNet, respectively, and the following observations can be made:

1) On FVQA dataset, MCR-MemNN outperforms almost all the baselines in terms of top-1 and top-3 accuracies. To be specific, MCR-MemNN outperforms the semantic parsing based approaches including FVQA (top-3-QQmaping) and FVQA (Ensemble), and achieves more than $12\%$ boost on top-1 accuracy and more than $15\%$ boost on top-3 accuracy. In addition, compared with the typical learning based approaches including STTF and OB, which wholly introduce the visual and semantic information without selection, MCR-MemNN gains an improvement by leveraging a two-way attention mechanism (i.e., Sections 3.3.3 and 3.3.4) to exploit the inter-relationships among image, question and KB and distill the most relevant information in each of the three modalities.

2) Even though a heterogeneous graph neural network with high complexity is employed in Mucko, the proposed MCR-MemNN can still deliver a comparable performance. Note that the full model of Mucko leveraged dense captions [13] as input for performance improvement. For fair comparison, the semantic graph for dense caption parsing is removed.

3) On Visual7W+ConceptNet dataset, MCR-MemNN consistently outperforms a series of models based on KDMN, which leverages a dynamic memory network to preserve the retrieved external knowledge. Since both MCR-MemNN and KDMN adopted the memory augmentation technique, this observation further evidences the effectiveness of modeling inter-relationships among image, question and KB for KB-VQA task. Note that Mucko does not provide the result without Semantic Graph.

Table 4: Ablation studies on FVQA. (Sub: Subject as Clue; Obj: Obj as Clue; Rel: Relation Filtering; Att: Two-way Attention)

Case	Sub	Obj	Rel	Att	Overall Accuracy
Case	Sub	Obj	Rel	Att	top-1	top-3
1	✓				61.60	68.18
2		✓			44.07	52.72
3	✓	✓			65.04	73.58
4	✓	✓	✓		67.52	76.48
5	✓	✓		✓	68.85	78.37
6	✓	✓	✓	✓	70.92	81.83

4.3.3 Ablation Studies

To validate the superiority of the proposed MCR-MemNN on KB-VQA, several ablation experiments were conducted based on FVQA dataset, and we can make the following observations based on Table 4:

1) MCR-MemNN adopts both subject and object as clues for KB retrieval, which leads to better performance. Specifically, compared with Case-1, where only subject is adopted as clues for KB retrieval, there exists a jump on top-1 (i.e., $3.44\%$ ) and top-3 (i.e., $5.40\%$ ) accuracies when both of the subject and object are leveraged as clues in Case-3. A similar gain can be observed when Case-3 is compared with Case-2.

2) To validate the effectiveness of the relation filtering, the experiment is conducted in Case-4, which achieves $2.48\%$ improvement over Case-3 on top-1 accuracy. This observation clearly implies that the predicted relation can successfully remove the redundant supporting facts retrieved from the external KB. Similarly, compared with Case-5 without relation filtering, Case-6 brings up an improvement of $2.07\%$ .

3) The two-way attention mechanism (i.e., Sections 3.3.3 and 3.3.4) can deliver an additional performance gain on KB-VQA task. For instance, compared with Case-4, where the inter-relationships are not exploited among image, question and KB, Case-6 brings up improvements of $3.40\%$ on top-1 accuracy. This indicates that the redundant information of the three modalities (i.e., image, question and KB) is removed during the reasoning process, and the most relevant information is collected by modeling inter-relationships among the three modalities.

4.3.4 Qualitative Results

Figure 4 shows several success and failure cases using MCR-MemNN. Two steps are indispensable to predict the correct answer: (1) Either the correct subject or object is included in the top-40 predicted subject set or object set. (2) The correct relation is predicted as the top-3 relations. The ground-truth fact will be retrieved as one of the candidate supporting facts, if both the two steps are successfully accomplished. For instance, all five samples in the first row have their corresponding ground-truth facts successfully predicted and the correct answers are delivered. Specifically, the first two samples have both of their subjects and objects correctly predicted. The 3rd sample have its subject correctly predicted while the last two have their objects correctly predicted. This clearly verifies the advantage of multiple clues reasoning, retrieval using two complementary clues (i.e., subject and object) makes it a lot easier to fetch the ground-truth fact and deliver the correct answer. Some other cases are presented in the second row. Generally, if a wrong fact is predicted (e.g., the 2nd, 3rd and 5th samples), the correct answer cannot be given. However, in some special cases, even if the ground-truth fact is not successfully predicted (e.g., the 1st sample), the correct answer can still be delivered. If the correct relation is not included in the top-3 (e.g., the 3rd sample), the correct answer cannot be given even if the subject or object is correctly predicted. The MCR-MemNN always fails if the question is intend to know the relationship between subject and object (e.g., 4th sample).

5 Conclusions

This paper, by introducing Multiple Clues for Reasoning with Memory Neural Network (MCR-MemNN), presents a novel framework for knowledge-based visual question answering (KB-VQA). Comprehensive experiments have been conducted on two widely-used benchmarks and the extensive experimental results have shown that 1) Retrieval using two complementary clues (i.e., subject and object) makes it a lot easier to fetch the ground-truth fact and deliver the correct answer; 2) Two-way attention mechanism with memory augmentation can successfully model the inter-relationships among the three modalities including image, question and KB, and brings up a remarkable performance gain on KB-VQA; 3) MCR-MemNN outperforms most of the KB-VQA methods and achieves a comparable performance with the state-of-the-art.

References

[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE CVPR, pages 6077–6086, 2018.
[2] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, 2007.
[3] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimosthenis Karatzas. Scene text visual question answering. In Proceedings of the IEEE ICCV, pages 4291–4301, 2019.
[4] Kan Chen, Jiang Wang, Liang-Chieh Chen, Haoyuan Gao, Wei Xu, and Ram Nevatia. Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960, 2015.
[5] Long Chen, Xin Yan, Jun Xiao, Hanwang Zhang, Shiliang Pu, and Yueting Zhuang. Counterfactual samples synthesizing for robust visual question answering. In Proceedings of the IEEE/CVF CVPR, pages 10800–10809, 2020.
[6] Yu Chen, Lingfei Wu, and Mohammed J Zaki. Bidirectional attentive memory networks for question answering over knowledge bases. arXiv preprint arXiv:1903.02188, 2019.
[7] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
[8] Difei Gao, Ke Li, Ruiping Wang, Shiguang Shan, and Xilin Chen. Multi-modal graph neural network for joint reasoning on vision and scene text. In Proceedings of the IEEE/CVF CVPR, pages 12746–12756, 2020.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE CVPR, pages 770–778, 2016.
[10] Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pages 10294–10303, 2019.
[11] Pingping Huang, Jianhui Huang, Yuqing Guo, Min Qiao, and Yong Zhu. Multi-grained attention with object-level grounding for visual question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3595–3600, 2019.
[12] Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xinlei Chen. In defense of grid features for visual question answering. In Proceedings of the IEEE/CVF CVPR, pages 10267–10276, 2020.
[13] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE CVPR, pages 4565–4574, 2016.
[14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[15] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 123(1):32–73, 2017.
[16] Guohao Li, Hang Su, and Wenwu Zhu. Incorporating external knowledge to answer open-domain visual questions with dynamic memory networks. arXiv preprint arXiv:1712.00733, 2017.
[17] Hui Li, Peng Wang, Chunhua Shen, and Anton van den Hengel. Visual question answering as reading comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6319–6328, 2019.
[18] Hugo Liu and Push Singh. Conceptnet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
[19] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In NeurIPS, pages 289–297, 2016.
[20] Pan Lu, Lei Ji, Wei Zhang, Nan Duan, Ming Zhou, and Jianyong Wang. R-vqa: learning visual relation facts with semantic attention for visual question answering. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1880–1889, 2018.
[21] Chao Ma, Chunhua Shen, Anthony Dick, Qi Wu, Peng Wang, Anton van den Hengel, and Ian Reid. Visual question answering with memory-augmented networks. In Proceedings of IEEE CVPR, pages 6975–6984, 2018.
[22] Lin Ma, Zhengdong Lu, and Hang Li. Learning to answer questions from image using convolutional neural network. In Proceedings of AAAI, volume 3, page 16, 2016.
[23] Mateusz Malinowski and Mario Fritz. A multi-world approach to question answering about real-world scenes based on uncertain input. In NeurIPS, pages 1682–1690, 2014.
[24] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of IEEE ICCV, pages 1–9, 2015.
[25] Alexander Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bordes, and Jason Weston. Key-value memory networks for directly reading documents. arXiv preprint arXiv:1606.03126, 2016.
[26] Medhini Narasimhan, Svetlana Lazebnik, and Alexander Schwing. Out of the box: Reasoning with graph convolution nets for factual visual question answering. In NeurIPS, pages 2654–2665, 2018.
[27] Medhini Narasimhan and Alexander G Schwing. Straight to the facts: Learning knowledge base retrieval for factual visual question answering. In Proceedings of ECCV, pages 451–468, 2018.
[28] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[29] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of EMNLP, pages 1532–1543, 2014.
[30] Mengye Ren, Ryan Kiros, and Richard Zemel. Image question answering: A visual semantic embedding model and a new dataset. NeurIPS, 1(2):5, 2015.
[31] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
[32] Niket Tandon, Gerard De Melo, and Gerhard Weikum. Acquiring comparative commonsense knowledge from the web. In AAAI, pages 166–172, 2014.
[33] Damien Teney, Peter Anderson, Xiaodong He, and Anton Van Den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In Proceedings of the IEEE CVPR, pages 4223–4232, 2018.
[34] Peng Wang, Qi Wu, Chunhua Shen, Anthony Dick, and Anton Van Den Hengel. Fvqa: Fact-based visual question answering. IEEE TPAMI, 40(10):2413–2427, 2017.
[35] Peng Wang, Qi Wu, Chunhua Shen, Anton van den Hengel, and Anthony Dick. Explicit knowledge-based reasoning for visual question answering. page 1290–1296, 2017.
[36] Xinyu Wang, Yuliang Liu, Chunhua Shen, Chun Chet Ng, Canjie Luo, Lianwen Jin, Chee Seng Chan, Anton van den Hengel, and Liangwei Wang. On the general value of evidence, and bilingual scene-text visual question answering. In Proceedings of the IEEE/CVF CVPR, pages 10126–10135, 2020.
[37] Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and textual question answering. In Proceedings of ICML, pages 2397–2406, 2016.
[38] Chengxiang Yin, Jian Tang, Zhiyuan Xu, and Yanzhi Wang. Memory augmented deep recurrent neural network for video question answering. IEEE transactions on neural networks and learning systems, 31(9):3159–3167, 2020.
[39] Chengxiang Yin, Kun Wu, Zhengping Che, Bo Jiang, Zhiyuan Xu, and Jian Tang. Hierarchical graph attention network for few-shot visual-semantic learning. In Proceedings of the IEEE ICCV, pages 2177–2186, 2021.
[40] Amir Zadeh, Michael Chan, Paul Pu Liang, Edmund Tong, and Louis-Philippe Morency. Social-iq: A question answering benchmark for artificial social intelligence. In Proceedings of IEEE CVPR, pages 8807–8817, 2019.
[41] Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei-Fei. Visual7w: Grounded question answering in images. In Proceedings of the IEEE CVPR, pages 4995–5004, 2016.
[42] Zihao Zhu, Jing Yu, Yujing Wang, Yajing Sun, Yue Hu, and Qi Wu. Mucko: Multi-layer cross-modal knowledge reasoning for fact-based visualquestion answering. arXiv preprint arXiv:2006.09073, 2020.

Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering