Attention Guided Semantic Relationship Parsing for Visual Question Answering

Moshiur Farazi^1,2, Salman Khan^3,1, Nick Barnes¹
¹RSEEME, Australian National University, ²Data61–CSIRO, Canberra, Australia
³Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
[email protected] Corresponding author

Abstract

Humans explain inter-object relationships with semantic labels that demonstrate a high-level understanding required to perform complex Vision-Language tasks such as Visual Question Answering (VQA). However, existing VQA models represent relationships as a combination of object-level visual features which constrain a model to express interactions between objects in a single domain, while the model is trying to solve a multi-modal task. In this paper, we propose a general purpose semantic relationship parser which generates a semantic feature vector for each subject-predicate-object triplet in an image, and a Mutual and Self Attention (MSA) mechanism that learns to identify relationship triplets that are important to answer the given question. To motivate the significance of semantic relationships, we show an oracle setting with ground-truth relationship triplets, where our model achieves a $\sim$ 25% accuracy gain over the closest state-of-the-art model on the challenging GQA dataset. Further, with our semantic parser, we show that our model outperforms other comparable approaches on VQA and GQA datasets.

1 Introduction

Humans can perform high-level reasoning over an image by seamlessly identifying the objects of interest and associated relationships between them. Although objects are central to scene interpretation, they cannot be independently used to develop a holistic understanding of the visual content without considering their mutual relationships. The multi-modal reasoning task of Visual Question Answering (VQA) requires learning precisely encoded relationships between objects. Given the complexity of the task, we advocate for relationship modeling in the semantic space so that a given question can be directly related with the objects and relationships present in an image. Our choice is motivated by two observations. First, visual representations for different instances of the same semantic relationship can be very different, making it challenging for the VQA model to relate them with the asked question. Secondly, different semantic relationship interpretations can exist for a single visual representation, thereby requiring an enriched mechanism to encode a diverse set of semantic relationships. If a relationship parser can automatically derive representations in semantic space and attend to relevant relations, the above challenges can be simplified for VQA.

Based one the hypothesis that a better scene understanding requires a model to generate more discriminative visual and semantic feature representation, recent VQA models employ state-of-the-art visual [15, 16, 32] and semantic [31, 28, 8] feature extractors. Specifically, VQA models use information at grid-level [3, 4, 41], object-level [2, 5, 42, 13, 10] or a combination of both [29, 9, 11] to extract visual features in an image without considering the relationships between them. Some recent models address this problem by identifying the most relevant object pairs by learning an attention distribution over them with respect to the question [6, 23, 17]. This kind of relationship-aware models achieve better performance compared to the ones that do not consider any kind of relationship. However, as seen in the example shown in Fig. 1, the visual feature representation of teddy bear and pillow remains the same even though the relationship between them can be different (e.g., on the right, near to). For higher level reasoning, a visual-semantic model needs to identify these subtle differences, which can only be achieved if the model considers semantic relationship features.

Refer to caption — Figure 1: *Our proposed Semantic Relationship Parser (SRP).* Given an image, the SRP module generates relationship triplets through a relationship detector, which are then passed through a semantic feature extractor. Each semantic relationship feature is paired with subject and object box coordinates for visual grounding. Our approach is built upon semantic description of relationships, as opposed to a visual representation in existing works, which allows us to accurately model complex relationships.

While attempting to combine semantic relationship features, a VQA model faces three major challenges. First, the lack of a visual relationship detector that not only detects arbitrary relationships between object pairs in an image, but also generates semantic features for the detected subject-predicate-object triplets for a downstream task. Second, the effectiveness of semantic relationship features over the visual relationship feature in a complex Vision-Language (VL) task such as VQA is not investigated before. Third, an effective attention mechanism is required to combine rich features encoding visual, semantic and relationship information to predict the correct answer. In this paper, we contribute towards bridging the gap by addressing these three main challenges. The main contributions of this paper are:

•

We propose a general purpose semantic relationship parser that can be used for a complex multi-modal Vision-Language (VL) downstream tasks such as Visual Question Answering.
•

We showcase the effectiveness of using semantic relationship features by reporting superior performance over models employing similar visual relationship features. Further, in an oracle setting where ground-truth relationship labels are available, we obtain a 25% accuracy gain compared to a SOTA model that only uses visual features.
•

We further propose a Mutual and Self Attention (MSA) mechanism that utilizes both mono-modal self-attention and multi-modal mutual-attention using visual features (from the image) and semantic features (from both question and relationships), and report superior accuracy on the VQAv2 and GQA datasets.

2 Related Works

Vision-Language (VL) pre-training: Pre-training a model on a different VQA dataset or even a different VL task, to learn a generic visual-linguistic representation by augmented training, has been shown to improve the accuracy VL downstream tasks [34, 27, 7, 46, 35, 24]. VL-BERT[34] and ViLBERT [26] used transformer models to jointly embed visual and linguistic features and used pretrained models for VL downstream tasks. UNITER[7] and Zhou et al. [46] applied an unified approach to learn the pre-trained representation by jointly training a transformer-like architecture. LXMERT [35] and OSCAR[24] trained a model to learning a cross-modality representation between more expressive visual and linguistic features on different VL tasks, and fine-tuned the learned representation for downstream VL tasks. Our approach is orthogonal to such VL pre-training, where we model visual relationships in the semantic domain from subject-relationship-object triplets to leverage the effectiveness modeling visual relationship in semantic domain; and our approach can be easily be included in the VL pre-training schedule.

Visual relationships in VQA: The two major obstacles in utilizing visual relationships in a VQA model are the lack of ground-truth relationship labels and a way to represent the relationship features. Several recent VQA models [40, 23, 17, 43] resorted to graph neural network approach where the objects pairs represented the nodes and relationship features were represented by some combination of the object features. This approach has two practical limitations, first, it relies heavily on the graph representation and the model’s ability to reason over the graph representation. Second, the lack of a real-world VQA dataset that has ground-truth graph representations of images to train and test the models. A few models [37, 33] tried to capture the relationships from rendered synthetic VQA datasets (Abstract Scene VQAv1 [3], CLEVR [19]), which does not generalize well to real scenes. Even though, the Visual Genome [22] dataset has scene graph annotations, the lack of scene graph representations in benchmark VQA datasets (e.g., VQAv1[3], VQAv2[14], VQA-CP[1]) limits a model’s ability to generate graph representations. We adopt to a tangential approach, where we treat the visual relationship feature not as a combination of visual features or a graph, rather as a semantic mono-modal feature representation from its subject, predicate and object labels.

Attention models in VQA: A large portion of the VQA literature focuses on learning a multi-modal representation of image and question features to generate an attention distribution over the input visual feature representation [12, 39, 4, 5, 9, 41, 20]. These approaches have been very successful in learning the multi-modal interactions, however they do not learn mono-modal attention distributions over the inputs themselves e.g., identifying correlation between different image regions or relationships between different words of the question. Inspired by success of self-attention mechanism [38] in capturing long range dependencies, Yu et al. [42] proposed to use self-attention to capture the mono-modal interaction in a VQA setting. However, for achieving high-level visual understanding, one needs to learn both mono-modal and multi-modal interactions, which we propose in this work.

3 Methods

Given an image ${I}$ and a natural language question ${Q}$ , the task of a VQA model is to predict the answer $\hat{a}$ . Let $\bm{v}$ and $\bm{r}$ be the collection of all visual features and semantic relationship features extracted from the image $I$ , and $\bm{q}$ be the semantic feature representation of the question $Q$ . The VQA problem is typically formulated as a multi-class classification problem:

\displaystyle\hat{a}=\operatorname*{arg\,max}_{a\in\mathcal{A}}p(a|\bm{v},\bm{q},\bm{r};\bm{\theta}),

(1)

where $\bm{\theta}$ denotes the parameters of the model and $\mathcal{A}$ is a dictionary of candidate answers.

3.1 Question and Image Feature Extraction

The traditional approach [12, 4, 5, 36, 42] for extracting question features for the VQA task is by sourcing pretrained semantic embedding vectors for each question words, concatenating and passing them through a recurrent neural network. The hidden state of the last recurrent block is extracted as the question feature. In contrast, we consider the question as a whole instead of separate word entities, thereby providing better contextual modelling. We use Bidirectional Encoder Representations from Transformers (BERT) [8] where we first tokenize each word of the question and then feed the tokenized question into a Transformer model pretrained for language modeling task. The question feature $\bm{q}\in\mathbb{R}^{m\times d_{q}}$ is extracted from the last hidden layer of the BERT model, where $m$ is the number of tokens identified in the question and $d_{q}$ denotes the feature dimension.

We represent the visual features of an input image as a set of bounding box coordinates and corresponding object-specific features. First, the object proposals are generated using a bottom-up [2] attention approach where a pretrained Faster-RCNN [32] model is employed the get the region proposals and extract visual features using a ResNet [15] backbone. Following [2], we use an adaptive threshold to select a range of region proposals $l\in[10,100]$ for each image. Further, to visually ground each region proposal we concatenate each region proposal with its bounding box coordinates. Thus, the visual feature representation $\bm{v}\in\mathbb{R}^{l\times(d_{v}+4)}$ of image $I$ consists of features of its object proposals $\{\bm{f}_{j}\in\mathbb{R}^{d_{v}}\}_{j=1}^{l}$ and corresponding bounding box coordinates $\{\bm{b}_{j}\in\mathbb{R}^{4}\}_{j=1}^{l}$ , where $d_{v}$ is the object feature dimension.

3.2 Semantic Relationship Parsing

The Semantic Relationship Parser (SRP) module is illustrated in Fig. 1. It has three major components. The first component is a region proposal network that operates in a similar manner as explained above in Sec. 3.1 for visual feature generation. Based on these object-wise features and box coordinates, a visual relationship detector is used in the second stage. The visual relationship detector generates subject, relationship and object¹¹1Here, the object refers to the grammatical component of a sentence. proposals from the region features which in-turn are used to generate a semantic relationship triplet set $\mathcal{T}=\{\bm{t}_{k}\}_{k=1}^{q}$ . Each relationship triplet $\bm{t}_{k}$ consists of class labels predicted for subject, relationship, and object. In order to generate a triplet, a set of candidate subject, relationship and object visual features, denoted by $\bm{f}_{s}$ , $\bm{f}_{r}$ and $\bm{f}_{o}$ respectively, are passed through the visual relationship detector. We follow the framework proposed by [44], where we assume a relationship exists only if a subject-object pair exists, not vice versa. Thus the relationship detector learns two mapping functions from visual feature space to semantic space, one for subject/object and the other one for relationship embedding.

The relationship feature embedding is generated by passing the concatenated version of three visual features $\bm{f}_{s}$ , $\bm{f}_{r}$ and $\bm{f}_{o}$ through a two-layer Multi-layer Perceptron (MLP) network. The subject and object feature embeddings are generated in parallel by passing them through the same MLP network. On the other hand, class labels of subject, object and relationship are first converted to word vectors and then to semantic feature embeddings by passing them through a small MLP network. Three triplet losses are minimized [43] to match visual and semantic embedding for subject, object and relationship respectively. During inference, word vectors of all subject/object and relationship class labels are passed and a nearest neighbour search is performed to find the desired relationship labels. In practice, we perform the visual relationship detection on the input image as a pre-processing step where we train the visual relationship detector end-to-end on visual relationship dataset (i.e., Visual Genome [22], VRD [25]) and then run inference on the input images to generate relationship prediction, which consists of subject, object and relationship probability.

The third component of the SRP module is the semantic feature extractor which takes the filtered semantic relationship triplets $\mathcal{R}$ , subject bounding box $\bm{b}_{s}$ and object bounding box $\bm{b}_{o}$ coordinates, and generates visually grounded semantic relationship features as follows,

\displaystyle\bm{r}=\mathcal{F}_{\text{BERT}}(\mathcal{R},\bm{b}_{s},\bm{b}_{o}).

(2)

Here, $\mathcal{F}_{\text{BERT}}$ denotes the BERT model (similar to the one described in Sec. 3.1) for extracting semantic features from the triplets. First, this function takes all entries from the set $\mathcal{R}$ and adds a period (‘.’) after each element. Each relation triplet $\bm{t}_{k}\in\mathcal{R}$ is now considered a complete sentence which is passed through the BERT model separately alongside its subject and object proposal coordinates. This generates corresponding semantic relationship features $\bm{r}\in\mathbb{R}^{n\times d_{r}}$ from image $I$ , where $d_{r}$ is the dimension of the hidden feature of the BERT model.

Notably, the set $\mathcal{R}$ only contains refined relationship triplets obtained after a two stage thresholding and filtering process. At the first stage, we filter out the relationship predictions where the probability of the product of subject and object proposals are higher than a threshold $\alpha$ . This ensures that we only select the high-confidence relationships between subject-object pairs. In the next stage, from the remaining relationship predictions, we select only those whose relation probability is higher than $\beta$ . Intuitively, we set $\alpha$ at a higher value compared to $\beta$ to ensure that we first get the subject and object instances right, and ask the model to predict various relationships between them. The values of $\alpha$ and $\beta$ are set empirically with the objective to select at least three relationship triplet per image. We further filter out any duplicate relationships and end up with ‘ $n$ ’ relationship predictions per image encoded in refined semantic relationship set $\mathcal{R}$ .

3.3 Mutual and Self Attention

The Mutual and Self Attention (MSA) module consists of two major components. The first component focuses on mutual attention where two separate multimodal fusion operations are performed to learn attention distribution over the input feature vectors. The second component applies self attention on the pair of input features and generates attention distribution over the input features themselves. We illustrate the MSA module in Fig. 2. For simplicity lets assume the input to the MSA module are a two feature embeddings $\bm{x}\in\mathbb{R}^{a\times d_{x}}$ and $\bm{y}\in\mathbb{R}^{b\times d_{y}}$ , which will undergo mutual and self attention.

Mutual Attention: To capture the complex interaction between $\bm{x}$ and $\bm{y}$ we jointly embed these features by learning a multimodal embedding function. This is achieved by first concatenating the input features and then passing them through a 3 layer MLP network. The MLP model learns to capture the mutual interactions between the input feature vectors and produces a joint feature embedding. For input combinations $(\bm{y},\bm{x})$ and $(\bm{x},\bm{y})$ , we have:

		$\displaystyle\bm{z}_{y,x}=\text{MLP}\,\,[\bm{y}\,\,\oplus\,\,\bm{x}]\in\mathbb{R}^{1\times b},\quad\textrm{and}$		(3)
		$\displaystyle\bm{z}_{x,y}=\text{MLP}\,\,[\bm{x}\,\,\oplus\,\,\bm{y}]\in\mathbb{R}^{1\times a},$		(4)

where $\oplus$ denotes the concatenation operation of the two vectors. $\bm{z}_{y,x}$ and $\bm{z}_{x,y}$ signifies the learned mutual attention distributions over the inputs $\bm{y}$ and $\bm{x}$ respectively. These attention distributions are used to take a weighted sum on the corresponding input feature vectors to generate mutually attended feature representation $\bm{y}_{m,yx}$ and $\bm{x}_{m,xy}$ ,

		$\displaystyle\bm{y}_{m,yx}=\sum_{i=1}^{b}(z^{i}_{y,x}\,\,\bm{y}^{i})\in\mathbb{R}^{d_{y}},\quad\textrm{and}$
		$\displaystyle\bm{x}_{m,xy}=\sum_{i=1}^{a}(z^{i}_{x,y}\,\,\bm{x}^{i})\in\mathbb{R}^{d_{x}},$		(5)

where, $\bm{x}^{i}$ and $\bm{y}^{i}$ denote the $i^{th}$ row from the feature embeddings $\bm{x}$ and $\bm{y}$ respectively.

Self Attention: For the self attention component, we follow the guided self attention module used in [41]. The input feature $\bm{x}$ is fed to a Transformer employing multi-head attention [38] to learn an attention distribution $\Psi_{x}$ . The other input $\bm{y}$ undergoes a similar multi-head attention like $\bm{x}$ , except the query input is replaced with $\Psi_{x}$ , which allows the model to learn $\bm{x}$ -guided self attention distribution over $\bm{y}$ , denoted as $\Psi_{y,x}$ . $\Psi_{x}$ and $\Psi_{y,x}$ are passed through separate fully-connected layers for dimensionality reduction and we get $\Psi_{x}\in\mathbb{R}^{a}$ and $\Psi_{y,x}\in\mathbb{R}^{b}$ . These self attention maps are used to take a weighted sum over the corresponding feature representations and generate self attended features $\bm{y}_{s,yx}$ and $\bm{x}_{s,y}$ ,

		$\displaystyle\bm{y}_{s,yx}=\sum_{i=1}^{b}(\Psi^{i}_{y,x}\,\,\bm{y}^{i})\in\mathbb{R}^{d_{y}},\quad\textrm{and}$
		$\displaystyle\bm{x}_{s,y}=\sum_{i=1}^{a}(\Psi^{i}_{x}\,\,\bm{x}^{i})\in\mathbb{R}^{d_{x}}.$		(6)

In practice, we employ two MSA modules, where we feed $\bm{q},\,\bm{v}$ to the first one and $\bm{q},\,\bm{r}$ to the other. The intuition behind this is the first MSA module learns to identify which region of the image, and words of the question are important to answer the question. Similarly, the second MSA module tries to identify the salient relationship features and question parts for answering the question. For both MSA blocks, we pass question features as $\bm{x}$ which guides the attention learning process of the input. This is particularly important as the question sets the objective of the task, and the quality of the learned attention distribution depends more on the question than the other inputs. Thus the first MSA module outputs $\bm{v}_{m,vq},\bm{q}_{m,qv},\bm{v}_{s,vq},\bm{q}_{s,v}$ and the second one outputs $\bm{r}_{m,rq},\bm{q}_{m,qr},\bm{r}_{s,rq},\bm{q}_{s,r}$ .

3.4 Attention Fusion

We perform multimodal attention fusion on the outputs of the MSA blocks. Each attended feature is projected to an intermediate space through fully connected layers followed by summation. As the attended features already capture rich feature description, we only use such a simple linear summation technique to capture their interaction before making the final answer prediction. The summed feature vector is then projected to the answer prediction space $d^{|\mathcal{A}|}$ through another fully connected layer where we minimize a cross-entropy loss to predict correct answer from the candidate answer set.

4 Experiments

We perform experiments on two large-scale VQA datasets, namely VQAv2 [14] and GQA [18]. We train the visual relationship detector in the SRP module on VRD dataset with a VGG16 backbone, and use this pretrained model to infer relationship triplets.

4.1 Dataset

We perform experiments on two large-scale VQA datasets, namely VQAv2 [14] and GQA [18]. The VQAv2 has 200K images and 1.1M crowed-sourced questions. This is the biggest manually annotated VQA dataset. Further, GQA contains 11K images and 22M auto-generated questions, making it a more challenging evaluation setting.

4.2 VQA Model Architecture

The visual feature dimension is $d_{v}=2048$ for each object. To extract the semantic features from the question and relationship triplet, we use a pretrained bert-large-cased²²2https://huggingface.co/bert-large-cased model. Since a cased version is used, we do not convert the question or relationship triplets to lowercase. The extracted semantic feature dimensions for question and relationship are $d_{q}=d_{r}=1024$ . Following the recommendation in [38, 41], the intermediate dimensions $d$ of the multi-head attention in transformer module is set to $512$ with $8$ heads and latent dimension of $64$ . Adam optimizer [21] with $\beta_{1}=0.9$ and $\beta_{2}=0.98$ is used.

\sidecaptionvpos

tablec

	GQA Validation Set
Methods	Acc. $\uparrow$	Binary $\uparrow$	Open $\uparrow$	Validity $\uparrow$	Plaus. $\uparrow$	Dist. $\downarrow$
MCAN^‡[42]	65.00	82.08	48.98	94.91	91.42	4.21
$\bm{r}^{vis}+\bm{q}$	51.89	69.02	35.83	95.13	91.78	7.34
$\bm{r}^{sem}+\bm{q}$	50.37	63.66	37.91	95.03	91.83	13.06
$\bm{r}^{vis}+\bm{v}+\bm{q}$	58.62	73.25	44.91	94.95	91.05	12.63
$\bm{r}^{sem}+\bm{v}+\bm{q}$	65.93	82.35	49.27	94.98	91.57	4.88
	Oracle Setting on GQA Validation Set
$\bm{r}^{oracle}+\bm{q}$	68.71	71.84	68.71	94.94	92.99	7.29
$\bm{r}^{oracle}+\bm{v}+\bm{q}$	81.15	85.06	77.48	95.34	94.26	1.08

Table 1: On establishing the benefit of semantic relationship parsing for VQA. We note that using semantic relationship features gives better performance as compared to the visual relationship features (rows 2-5). To demonstrate the richness of semantic features, we also report the upper-bound (oracle case in the last two rows), where our model delivers an absolute gain of

\sim

16 accuracy points over the MCAN [42] model. ^‡ For a fair comparison, the MCAN model reported here is the MCAN-large (frcn+bbox) version which uses bounding-box coordinates of the object proposals.

4.3 Semantic vs. Visual Relationship Feature

In Tab. 1, we first establish the benefit of our proposed semantic relationship feature modeling. For a fair comparison with our proposed MSA model which uses bounding box coordinates of subjects and objects, we compare it with a version of MCAN [42]³³3official implementation of MCAN-large (frcn+bbox) model is at https://github.com/MILVLG/openvqa that also uses bounding box coordinates. To this end, we compare the VQA performance between ‘semantic’ and ‘visual’ relationship features to showcase the comparative advantage on the GQA validation dataset. To develop the baseline model with visual relationship features, we train the SRP module (Sec. 3.2) on the VRD dataset [25] with $100$ objects and $70$ predicate categories, and output visual feature of the subject and object relationship proposal alongwith relationship triplet. The visual feature of the subject and object proposals are concatenated and considered as visual relationship feature $\bm{r}^{vis}$ , and the default semantic relationship features (denoted by $\bm{r}^{sem}$ ) are extracted from the relationship triplet. The models in Tab. 1 employ only the guided self-attention part of the MSA module for simplicity.

Blind models trained with visual relationship feature perform slightly better. In Tab. 1, we see that a VQA model trained with only visual relationship features (row 2) performs better than the model trained only with semantic relationship features (row 3). This is because when the visual feature $\bm{v}$ is not available, the $\bm{r}^{sem}+\bm{q}$ model is blind to the image and the answer prediction is based only on the relationship labels. On the other hand, the $\bm{r}^{vis}+\bm{q}$ model can see the image as a set of the visual feature of subject-object proposals, thus performs better than the completely blind model (row 3). However, in this extreme setting, the blind model perform reasonably well only relying on the semantic relationship labels.

Non-blind models trained with semantic relationship feature perform significantly better. When the visual feature is available, the VQA model with the complementary semantic relationship feature performs significantly better ( $7.34\uparrow$ ) than its counterpart (rows 4, 5 in Tab. 1). This demonstrates the complementary effectiveness of the semantic relationship features, since both these settings are identical except for the nature of the relationship feature.

4.4 Oracle Setting

We simulate an oracle setting to further evaluate the effectiveness of using semantic relationships for VQA. We build this setting using scene-graph annotations available for GQA [18] train and validation sets. Each scene-graph entry consists of ground-truth subject, relationship and object label. We use a scene-graph parser which converts each scene-graph entry into a list of semantic relationship triplets similar to the output of visual relationship detector of Sec. 3.2, and denote the extracted semantic ground-truth relationship features as $\bm{r}^{oracle}$ .

Both blind and non-blind oracle models significantly outperform the SOTA. The blind VQA model with a ground-truth relationship label $\bm{r}^{oracle}$ achieves an overall accuracy gain of $3.71\uparrow$ compared to the state-of-the-art MCAN [42] model which is a non-blind model (comparing rows 6 and 1 of Tab. 1). This is an interesting finding showing if good enough semantic relationship label are available, the VQA model could achieve better performance than SOTA without even looking at the image. Further, when visual feature of the image is available in the oracle setting, the model achieves $16.43(\sim 25\%)$ accuracy gain over [42].

‘Open-ended’ questions are answered better. Both oracle models report significant accuracy gain ( $19.73\uparrow$ and $28.5\uparrow$ compared to MCAN) for the challenging ‘Open’ question category. These open-ended questions require diverse and broad reasoning ability to answer correctly. This is a significant finding as it sheds light upon effectiveness of using complementary semantic relationship features as an important line of research to break the bottleneck of VQA models mostly focusing on learning better visual representations.

4.5 Ablation study

\sidecaptionvpos

tablec

		VQA-v2 Test-dev				GQA Test-dev
Input	Attention	Accuracy	Y/N	Number	Other	Accuracy	Binary	Open
	Mutual	44.00	66.48	31.49	27.21	35.90	54.72	19.93
$\bm{r}+\bm{q}$	Self	53.35	74.16	35.87	39.46	42.72	64.42	29.38
	MSA	53.66	74.71	36.65	39.70	45.53	64.00	29.86
	Mutual	45.74	57.18	34.78	29.74	37.20	56.82	20.56
$\bm{v}+\bm{q}$	Self	70.14	86.57	51.59	60.28	57.03	76.02	40.76
	MSA	70.38	86.78	52.05	60.59	57.45	77.08	40.79
	Mutual	48.29	67.17	33.24	35.49	39.24	55.45	25.48
$\bm{v}+\bm{r}+\bm{q}$	Self	70.46	87.14	51.26	60.57	57.72	76.12	40.48
	MSA	70.76	87.10	53.21	60.77	58.37	77.70	40.44

Table 2: Ablation study of our MSA model. Ablation study of the proposed MSA model on VQAv2 Test-dev and GQA Test-dev set with different combination of input visual and semantic feature representation, and attention mechanism.

We perform extensive ablation on the VQAv2 test-dev and GQA test-dev datasets and report the results in Tab. 2. Our goal here is to identify which input and attention combination contributes to the overall performance of our model. This is a comprehensive setup as the VQA dataset and GQA dataset consist of natural crowed-sourced and auto-generated questions respectively. We use semantic relation features for all our experiments (i.e., $\bm{r}=\bm{r}^{sem}$ ).

Semantic relationship features provide accuracy boost when used in complement with visual features. The blind model which only uses parsed relationship features without any visual features (rows 1–3) performs worse compared to other models that explicitly use visual features. However, when used in complement with image and question features (rows 7–9), it helps models achieve better performance on both VQA and GQA datasets.

Guided self attention provides rich attention distribution over its inputs compared to mutual attention. For the three input combinations listed in Tab. 2, we ablate the MSA module by only activating mutual or self attention module. We can see that when only guided self attention module is activated, a better VQA accuracy is achieved in both the datasets. This is because the self attention module captures rich semantics over the the input features through its multi-head attention architecture. The mutual attention component works best in a VQA setting when the attention distribution is learned on the visual feature, which undergoes a second multimodal fusion with the question feature [12, 9, 4, 5]. By design, we want the mutual attention module to capture the multimodal interaction between the inputs and feed it to the attention fusion module (Sec. 3.4) for combining with other attention distributions. Thus a standalone setup for our mutual attention module performs sub-par to the guided self attention module.

MSA module with both mutual and self attention performs best. The full MSA model with both mutual and self attention modules achieves better performance compared to when a single block is activated (rows 3,6,7 in Tab. 2). The mutual attention module provides complementary information that helps in cases where self attention alone is not sufficient.

4.6 Comparison with state-of-the-art models

	VQAv2 Test-Standard
Methods	Acc.	Y/N	Num.	Other
Ours	71.1	87.3	53.3	61.1
MCAN [42]^†	70.9	-	-	-
ReGAT [23]^†	70.6	-	-	-
Ban+Counter [20]^†	70.4	-	-	-
DFAF^†[13]	70.3	-	-	-
MuRel [6]	68.4	-	-	-
Counter [45]	68.4	83.6	51.4	59.1
RAF [9]	67.4	84.2	44.4	58.0
QAA [11]	67.0	83.8	45.9	57.1
Graph Learner [30]	66.2	82.9	47.1	56.2
Bottom-Up [2]	65.7	82.2	43.9	56.3

Table 3: Benchmarking our MSA model on the VQAv2 Test-Standard dataset. Comparison of our single MSA model trained only on the VQAv2 dataset with comparable state-of-the-art models (i.e., without VL pre-training). Our approach performs favorably well against the existing VQA models.

\dagger

models undergo additional training on the Visual Genome [22] dataset which provides an additional gain.

We report the performance of our single MSA model on benchmark VQAv2 Test-Standard dataset in Tab. 3. We show that by leveraging the semantic relationship features, our model is able to outperform other comparable state-of-the-art models, even without additional training on Visual Genome dataset [22]. Some recent models resort to ensembling with data augmentation techniques [42] or use pretrained model trained on different Vision-Language tasks and/or datasets [35, 24, 7] to achieve superior performance. However, such approach is tangential to the motivation of this paper and are not directly comparable to our proposed model. [43, 17] do not report their performance on the VQAv2 dataset and are thus not included here for comparison. We can see from Tab. 3, that our model achieves state-of-the-art accuracy on the benchmark VQAv2 dataset with comparable methods where the performance boost is achieved by incorporating semantic relationship features.

4.7 Qualitative results

We provide some qualitative results in Fig. 3 of MSA model on VQAv2 dataset. We visualize the attention distribution over the region proposals and list two relationship triplet with highest attention for better visualization. For simplicity we do not visualize the mutual and self attention distribution over the question words. We can see that the self and mutual attention component provide complementary attention distribution over the input features. For example, in the second row, when asked ‘Are they wearing goggles?’ The visual self attention component focuses more on the sunglass of the person on the left. The mutual attention component looks at the person on the left and the right. Similarly, the self attention component gives more attention to a relationship triplet with sunglass and person, but further looks at person wear shirt relationship triplet for getting more semantic context. Such complementary relationship between various attention components helps VQA model to reason better over its input feature representations.

5 Conclusion

VQA problem demands an in-depth understanding of the visual and semantic domains. Existing approaches generally focus on deriving more discriminative visual features or modeling the complex multi-modal interactions. Some models resort to expensive and customized pre-training on other Vision-Language tasks. In this paper, we show that an important missing piece in the existing models is that of enriched semantic relationship modeling. We demonstrate that under an oracle setting, these semantic relationships can bring the performance on par with human-level accuracy on VQA task. Further, we propose an automatic semantic relationship parser alongside a complimentary attention mechanism that delivers consistent improvements on SOTA across two challenging VQA datasets. Our results strongly advocate for further investigation on better relationship modeling in the semantic domain, a direction less explored so far in the VQA community.

References

[1] Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015.
[4] Hedi Ben-Younes, Rémi Cadène, Nicolas Thome, and Matthieu Cord. Mutan: Multimodal tucker fusion for visual question answering. ICCV, 2017.
[5] Hedi Ben-Younes, Remi Cadene, Nicolas Thome, and Matthieu Cord. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection. In AAAI 2019-33rd AAAI Conference on Artificial Intelligence, 2019.
[6] Remi Cadene, Hedi Ben-Younes, Matthieu Cord, and Nicolas Thome. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1989–1998, 2019.
[7] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal image-text representations. August 2020.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[9] Moshiur R Farazi and Salman Khan. Reciprocal attention fusion for visual question answering. In The British Machine Vision Conference (BMVC), September 2018.
[10] Moshiur R Farazi, Salman H Khan, and Nick Barnes. From known to the unknown: Transferring knowledge to answer questions about novel visual and semantic concepts. Image and Vision Computing, 103:103985, 2020.
[11] Moshiur R Farazi, Salman H Khan, and Nick Barnes. Question-agnostic attention for visual question answering. In International Conference on Pattern Recognition, 2020.
[12] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847, 2016.
[13] Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven CH Hoi, Xiaogang Wang, and Hongsheng Li. Dynamic fusion with intra-and inter-modality attention flow for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6639–6648, 2019.
[14] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[16] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7132–7141, 2018.
[17] Ronghang Hu, Anna Rohrbach, Trevor Darrell, and Kate Saenko. Language-conditioned graph networks for relational reasoning. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
[18] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709, 2019.
[19] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. arXiv preprint arXiv:1612.06890, 2016.
[20] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In Advances in Neural Information Processing Systems, pages 1564–1574, 2018.
[21] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[22] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint arXiv:1602.07332, 2016.
[23] Linjie Li, Zhe Gan, Yu Cheng, and Jingjing Liu. Relation-aware graph attention network for visual question answering. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
[24] Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. August 2020.
[25] Cewu Lu, Ranjay Krishna, Michael Bernstein, and Li Fei-Fei. Visual relationship detection with language priors. In European Conference on Computer Vision, pages 852–869. Springer, 2016.
[26] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pages 13–23, 2019.
[27] Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual question answering. In Advances In Neural Information Processing Systems, pages 289–297, 2016.
[28] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
[29] Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. arXiv preprint arXiv:1611.00471, 2016.
[30] Will Norcliffe-Brown, Stathis Vafeias, and Sarah Parisot. Learning conditioned graph structures for interpretable visual question answering. In Advances in Neural Information Processing Systems, pages 8334–8343, 2018.
[31] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
[32] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
[33] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In Advances in neural information processing systems, pages 4967–4976, 2017.
[34] Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. Vl-bert: Pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, 2019.
[35] Hao Tan and Mohit Bansal. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5103–5114, 2019.
[36] Damien Teney, Peter Anderson, Xiaodong He, and Anton van den Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[37] Damien Teney, Lingqiao Liu, and Anton van den Hengel. Graph-structured representations for visual question answering. arXiv preprint arXiv:1609.05600, 2016.
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[39] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164, 2015.
[40] Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. Scene graph generation by iterative message passing. arXiv preprint arXiv:1701.02426, 2017.
[41] Dongfei Yu, Jianlong Fu, Tao Mei, and Yong Rui. Multi-level attention networks for visual question answering. In Conf. on Computer Vision and Pattern Recognition, 2017.
[42] Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian. Deep modular co-attention networks for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6281–6290, 2019.
[43] Cheng Zhang, Wei-Lun Chao, and Dong Xuan. An empirical study on leveraging scene graphs for visual question answering. In The British Machine Vision Conference (BMVC), September 2019.
[44] Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, and Mohamed Elhoseiny. Large-scale visual relationship understanding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9185–9194, 2019.
[45] Yan Zhang, Jonathon Hare, and Adam Prügel-Bennett. Learning to count objects in natural images for visual question answering. In International Conference on Learning Representations, 2018.
[46] Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J Corso, and Jianfeng Gao. Unified vision-language pre-training for image captioning and vqa. In AAAI, pages 13041–13049, 2020.