Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Abstract
Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning. A VCR model generally aims at answering a textual question regarding an image, followed by the rationale prediction for the preceding answering process. Though these two processes are sequential and intertwined, existing methods always consider them as two independent matching-based instances. They, therefore, ignore the pivotal relationship between the two processes, leading to sub-optimal model performance. This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework. To achieve this, we first design a re-attention module for aggregating the vision attention map produced in each process. Thereafter, the resultant two sets of attention maps are carefully aligned to guide the two processes to make decisions based on the same image regions. We apply this method to both conventional attention and the recent Transformer models and carry out extensive experiments on the VCR benchmark dataset. The results demonstrate that with the attention alignment module, our method achieves a considerable improvement over the baseline methods, evidently revealing the feasibility of the coupling of the two processes as well as the effectiveness of the proposed method.
Index Terms:
Visual Commonsense Reasoning, Attention Mechanism, Attention Alignment.I Introduction
Visual Question Answering (VQA) has received increasing interest over the past few years, which requires correctly answering natural language questions about a given image [1]. Despite significant progress having been made, existing VQA benchmarks mainly focus on answering simple recognition questions (e.g., how many or what color), while the explanation of the question answering is often ignored. To close this gap, Visual Commonsense Reasoning (VCR) has recently been presented as a challenge for researchers [2]. Specifically, beyond answering the cognition-level questions (QA) as canonical VQA does, VCR further prompts to provide a rationale for the correct answer (QAR) (see Figure 1 for an example).
In fact, VCR poses more challenges than VQA, which can be seen in the following two aspects: 1) on the data side – the images in the VCR dataset describe more complex situations in the real world, and the questions are rather challenging and require high-level visual reasoning capabilities (e.g., why or how). And 2) on the task side – it is hard to simultaneously figure out the right answer and its right rationale. Typically, VCR models first predict the answer, based on which the rationale can then be selected from the candidates. As the question answering has proved to be non-trivial for traditional VQA models [3, 4, 5], finding the right rationale simultaneously is even more difficult.
Current VCR methods mostly enhance the visual understanding with the given query111Either question or question with the correct answer (see Figure 1). (i.e., to overcome the first challenge), and could be roughly divided into two categories: the first one is to leverage the intra-modality correlations to enhance separate feature learning, and inter-modality ones between vision and linguistics to correctly reason [6, 7, 8]; the other is to employ large external datasets to pre-train a general multi-modal model, and then transfer it to VCR for learning a better joint representation of the image and text [9, 10, 11]. Although these methods have achieved promising results, they are all still limited by a common problem – the QA and QAR are handled independently, with the inherent relationship between these two processes being ignored. The second challenge thus remains unsettled, resulting in the sub-optimal performance of these methods.

In general, separately treating these two processes would inevitably lead VCR to degenerate into two VQA instances222Note that for QAR, the ‘question’ to a VQA model now becomes the original question appended with the right answer. And the corresponding candidate ‘answer’ set is accordingly composed of several rationales.. For each given question, existing methods consider QA and QAR as two independent cases, rather than two sequential reasoning processes of one question. This makes QA and QAR disjoint, deviating from the original intention of VCR. Moreover, answering and justification are actually consistent and coherent in human cognition: the human brain would capture and leverage the same evidence from the image for the two processes. Consequently, a better way to deal with VCR is to handle both via common cues. Thanks to the shared information provided by the same image, we align the visual attention of these two for better coordination.
To seamlessly bridge the two processes, in this paper, we propose an aGree on vISion aTtention model (dubbed GIST) for simple yet effective visual reasoning in VCR. Our method is composed of two consecutive modules. First, we design a novel re-attention module to model the fine-grained multi-modal interactions between queries and images. In particular, our attention module is two-stage: the first stage calculates the relevance between image regions and textual query tokens; the second stage combines the attention map of each token to obtain one single attention vector for each process. Secondly, we align the produced attention maps over the given image, guiding the two processes to ‘look at the same image regions’. Specifically, our method enforces an alignment loss between the attention maps from two processes, aiming to obtain similar attention maps during training.
To test the effectiveness of our method, we conduct extensive experiments on the VCR dataset. We test our GIST method on both the vanilla visual attention models as well as the most recent Vision Language Transformers (VL-Transformers) [12, 11, 13]. Both quantitative and qualitative results demonstrate that our method can significantly outperform the baseline method, confirming the utility of aligning vision attention of the two intertwined processes.
In summary, the contribution of this paper is threefold:
-
•
This work addresses the question answering and rationale prediction in VCR with a unified framework. In particular, we propose an attention alignment module to guide QA and QAR to make predictions with the same visual evidence.
-
•
We design a novel GIST method that skillfully aligns the visual attention from the two processes. We apply this method to both the vanilla visual attention model and the recent VL-Transformers.
-
•
We conduct extensive experiments on the VCR dataset to demonstrate the effectiveness of our method. The code has been released333https://github.com/SDLZY/VCR_Align..
II Related Work
II-A Visual Commonsense Reasoning
Conventional VQA mainly focuses on the recognition capability of models [1, 14]. Until recently, VCR has emerged to study commonsense understanding, putting forward higher requirements in AI systems’ cognitive reasoning abilities. To address this task, many approaches have been proposed, which can be roughly divided into the following two categories.
The initial methods endeavor to devise task-specific architectures for VCR. Some of them explore a variety of sophisticated reasoning structures (e.g., holistic attention mechanism) to construct interactions between the image and text. For instance, R2C [2] performs three inference steps – grounding, contextualization, and reasoning, to effectively solve the visual cognition problem. Inspired by the neuronal connectivity of the brain, CCN [8] designs a graph method to globally and dynamically integrate the local visual neuron connectivity; HGL [6] integrates the intra-graph and inter-graph to bridge the vision and language modalities. In addition, as some questions cannot be directly answered from only the image information, external commonsense knowledge is also exploited in the cross-modal reasoning process [15].
Another prevalent stream is to apply VL-Transformers to the downstream VCR for both QA and QAR [12, 9]. For example, UNITER [11] designs four novel pre-training tasks with conditional masking to learn universal image-text representations for various downstream multi-modal tasks. ViLBERT [10] applies a dual stream fusion encoder to process visual and textual inputs in separate streams, followed by the modality interactions with co-attentional Transformer layers. MERLOT RESERVE [16] first learns task-agnostic representations through sound, language, and vision of videos [17], and then transfers these features to the downstream VCR task.
Although both categories of methods have achieved improved results, they all view the two processes in VCR as two independent VQA instances. As a result, the critical correlations between these two are ignored, resulting in weak visual reasoning. In this work, we propose an effective attention alignment framework to bridge these two processes directly.
II-B Attention in Visual Question Answering
The past few years have witnessed increasing growth in the research area of VQA. Among the existing methods, visual attention-based ones have demonstrated substantial advantages in feature learning. Traditional approaches focus more on the correlation learning between each image region and the whole question sentence [14, 18]. Thereafter, the co-attention mechanism jointly performs the question-guided attention over image regions and the image-guided attention over question words [19]. In addition to these top-down attention methods, BUTD [20, 21] combines both the top-down and bottom-up attention, which first detects the salient objects inside an image and then leverages the top-down attention technique to locate the most relevant regions according to the given question. VQA-HAT [22] and Attn-MFH [23] point out that the attention maps produced by VQA models are inconsistent with human cognition. They, therefore, attempt to design more explicitly visual reasoning methods and have achieved certain improvements.
II-C Vision-Language Transformers
Transformers have achieved great success in the field of Natural Language Processing (NLP) [24] and Computer Vision (CV) [25]. Because of their effectiveness, transformers also attract much attention in multi-modal studies. Based on how the vision and language branches are fused, current VL-Transformers can be roughly categorized into single-stream (e.g., UNIMO [26] and SOHO [27]) and dual-stream cross-modal Transformers (e.g., LXMERT [28] and ALBEF [29]). In general, VL-Transformers adopt a pretrain-then-finetune learning paradigm: these models are firstly pre-trained on large-scale multi-modal datasets (such as Conceptual Captions [30]) for learning universal cross-modal representations, and then fine-tuned on downstream tasks by transferring their rich representations from pre-training. In particular, the pretext tasks play an important role in pre-training, where masked language modeling, masked region prediction, and image-text matching are extensively studied. The fine-tuning step mirrors that of the BERT model [24], which includes a downstream task-specific input, output, and objective. VL Transformers mainly help the following three groups of downstream tasks: cross-modal matching, cross-modal reasoning, and vision language generation [31]. The first group focuses on learning cross-modal correspondences between vision and language, such as image text retrieval and visual referring expression. Reasoning ones require VL Transformers to perform language reasoning based on visual scenes, such as VQA. The last group aims to generate the targets of one modality given the other as input [32]. The desired visual or textual tokens are decoded in an auto-regressive generation manner.
III Proposed Method
Based on task intuition and human cognition, the answering and reasoning processes in VCR should be made cohesive and consistent. Nevertheless, existing methods often treat them separately, rendering the commonsense reasoning less convincing. To tackle these two processes jointly, we resort to aligning the visual attention of question answering and rationale inference processes for a better collaborative connection and design our GIST method.
In the following, we first introduce the method’s intuition and background knowledge, followed by our proposed visual attention alignment method. We then detail its implementation on the model with vanilla attention and the recent VL-Transformer with self-attention, respectively.


III-A Method Intuition
Given an image and a question about this image, the goal of visual commonsense reasoning is to both predict the right answer , as well as the correct rationale (the right and false ones are denoted as and , respectively). In general, VCR comprises two multiple-choice processes: question answering (QA) and answer justification (QAR).
QA aims to predict the correct answer from a set of answer choices to the given question upon the image , which can be achieved by,
(1) |
where denotes the QA model. Specifically, both question and candidate answer are expressed in terms of textual sentences, and the image consists of objects detected by a Mask-RCNN model [33]. Previous methods train by minimizing the cross entropy loss444We employ the single instance loss rather than the batch-wise one for simplicity.,
(2) |
where denotes the index of the ground-truth answer.
QAR is different from QA wherein the input is now composed of question and its correct answer (a straightforward way is to concatenate these two together). The objective of QAR is to select the correct rationale from a rationale set (see Figure 1),
(3) |
where denotes the QAR model. The optimization function is similarly defined as follows,
(4) |
where represents the ground-truth rationale for the given image and question.
III-B Visual Attention Alignment
Inspired by human cognition, we argue that the visual evidence exploited in answering and justification should be the same. As these two processes share the same image (see Equation 1 and Equation 3), the fundamental visual attention thus offers a natural bridge for connecting them. In view of this, we propose to align the visual information based on the attention maps calculated by and .
Visual attention is an integral component for current vision-language models [20, 35]. A typical VCR model often involves a visual attention module on the object set . In particular, the visual attention learns a set of attention score for each image object, where . A large usually represents that the -th object has more influence for answering the current question. As discussed earlier, there are two sets of attention maps: – the attention weights from model , and – the attention weights from model . To achieve the goal that and employ the similar image regions, our idea is to learn another module that drives and closer. Specifically, we implement this by pulling the similarity of attention maps from the correct answer and the correct rationale , while pushing away other negative pairs.
In this paper, we propose an attention alignment loss to make the QA and QAR learn to agree on visual regions. We formalize our attention alignment loss below,
(5) |
where represents the attention weights from model according to the -th answer ; denotes the attention weights from model according to the -th rationale , and is a similarity function. We detail how we implement in the following:
-
•
One intuitive approach to measure the similarity between two sets of attention weights is the dot product. We refer to this method as Align-Dot,
-
•
Inspired by the learning to rank models in the field of information retrieval [36], we use the list-wise approach to align the ranking of attention weights and name this method Align-Rank. To this end, we first obtain the permutations of the two attention vectors prediction and target . Thereafter, we employ the NDCG metric optimization [37, 38] in our experiments:
(6) where denotes a gain function, e.g., , acts as the maximum of , i.e., the value when the predicted attention permutation of is the same as the target one . Nevertheless, directly optimizing NDCG with back-propagation is impossible due to its non-differentiable nature. To approach this problem, we then smooth the permutation function with [38],
(7) where is a hyper-parameter, and denotes the attention value of the -th object.
By combining the aforementioned loss functions, the final objective becomes,
(8) |
where is a trade-off hyper-parameter.
Our method is applicable to both vanilla visual attention models as well as the most recent VL-Transformers. In the following, we will show its implementation on these two typical models.
III-C Application on the Vanilla Attention Model
Before the prevalence of VL-Transformers in VCR, conventional methods all adopt the vanilla attention mechanism to focus on the most salient image regions based on the textual information. In view of this, we intend to explore whether our proposed visual attention alignment method works under such settings. Specifically, we take the TAB-VCR [7] as a typical baseline to implement our method. Note that the two processes, i.e., QA and QAR share the same structure, we, therefore, use the QA as an example as the QAR can be easily extrapolated.
III-C1 Overall Framework
In this subsection, we first introduce the overall framework of TAB-VCR.
Image & Language Encoder. Images in the VCR dataset [2] consist of objects detected by Mask-RCNN [33]. We leverage this fine-grained information and utilize the pre-trained Convolutional Neural Network (CNN) model [39] to extract the object features .
Pertaining to the textual input, we first concatenate the question and each answer and then employ a pre-trained word embedding to obtain the embeddings. Note that each sentence in VCR includes both text and some object tags, as shown in Figure 3. Following previous studies [2, 7], we take the word embeddings and the object features in their right order as input to a bidirectional RNN [40]. Specifically, if an input token is a tag referring to an object , and the corresponding object feature will be ; otherwise, it is the embedding of the entire image. The query and response can be encoded into a sequence of hidden states by an RNN model:
(9) |
where denotes the -th object’s feature if is a tag otherwise the averaged feature of all objects.
Classifier. After the visual reasoning between the given question and image (which will be detailed in the next subsection), we treat VCR as a multi-class classification problem. In particular, we use the last hidden state from RNN and the refined image feature as the representation of the question-answer pair and image, respectively. Our classifier is implemented with a multi-layer perceptron (MLP) [41] to compute a score for the candidate answers:
(10) |
where is the LeakyReLU [39] activation function, and and are the learned weight matrices in the MLP. We omit the bias vectors for simplicity.
III-C2 Re-Attention
A typical model often extracts the visual features with a CNN model [42] and takes the output as the vision features for Equation 9. Different from it, we propose a re-attention module to distribute distinctive attention weights, which serves as an important part of our attention alignment goal. Specifically, our re-attention module consists of attention computation from two directions: object-wise and token-wise. The former takes each token feature as a query and learns the attention weights for all the objects. In contrast, token-wise attention collects the most informative signals of each token based on the fused token information from the former step.
Object-wise Attention. It is intuitive that for each token, different objects often contribute distinctively to the learning of the textual features. For example, in Figure 1, the [person5] object is more important than other objects for learning the token smiling in the question. In light of this, we employ the token to attend to each specific object as follows,
(11) |
where and . In this way, we can have attention maps over all the objects, where is the number of tokens.
Token-wise Attention. After the object-wise attention operation, we then employ another attention module to estimate the importance of each token with respect to the overall textual feature. To implement this, we take the output from RNN model as a query, and perform attention over all the token features,
(12) |
where and . Thereafter, we employ the attention weights, i.e., the importance of tokens , to multiply that of the set of attention maps, to obtain the overall attention weights over objects ,
(13) | ||||
III-C3 Attention Alignment
We then obtain the refined image feature in Equation 10 and perform the final classifier for predicting the right answer or rationale. Thereafter, following Equation 13, the two attention sets can be easily collected from . One is for , and the other is . The visual attention alignment from Sec. III-B can thereby be performed.
Model | VQA | VCR | Transformer | QA | QAR | QAR | |||
---|---|---|---|---|---|---|---|---|---|
valid | test | valid | test | valid | test | ||||
Chance | 25.0 | 25.0 | 25.0 | 25.0 | 6.2 | 6.2 | |||
RevisitedVQA [43] | ✓ | 39.4 | 40.5 | 34.0 | 33.7 | 13.5 | 13.8 | ||
BUTD [20] | ✓ | 42.8 | 44.1 | 25.1 | 25.1 | 10.7 | 11.0 | ||
MLB [44] | ✓ | 45.5 | 46.2 | 36.1 | 36.8 | 17.0 | 17.2 | ||
MUTAN [45] | ✓ | 44.4 | 45.5 | 32.0 | 32.2 | 14.6 | 14.6 | ||
R2C [2] | ✓ | 63.8 | 65.1 | 67.2 | 67.3 | 43.1 | 44.0 | ||
CCN [8] | ✓ | 67.4 | 68.5 | 70.6 | 70.5 | 47.7 | 48.4 | ||
HGL [6] | ✓ | 69.4 | 70.1 | 70.6 | 70.8 | 49.1 | 49.8 | ||
TAB-VCR [7] | ✓ | 69.5 | 70.5 | 71.6 | 71.6 | 50.1 | 50.8 | ||
VL-BERT [12] | ✓ | ✓ | 72.6 | 73.4 | 74.0 | 74.5 | 54.0 | 54.8 | |
UNITER [11] | ✓ | ✓ | 74.4 | 75.5 | 76.9 | 77.3 | 57.5 | 58.6 | |
✓ | 70.5 | 71.2 | 72.5 | 72.0 | 51.5 | 51.4 | |||
✓ | ✓ | 74.9 | 75.6 | 77.0 | 77.5 | 58.1 | 58.8 | ||
Human | ✓ | - | 91.0 | - | 93.0 | - | 85.0 |
III-D Application on the VL-Transformer
VL-Transformers have been widely studied over the past few years [46, 47]. In this section, we show how our attention alignment method works under such self-attention settings.
III-D1 Overall Framework
The structure of a typical single-stream VL-Transformer is illustrated in Figure 3.
Transformer Encoder. As can be observed, the input sequence starts with a special classification token (i.e., [CLS]), and then goes on with query and response elements, visual tokens, and ends with a special ending token (i.e., [END]). A special separation token (i.e., [SEP]) is introduced between every two parts of the input. Following the practice in BERT [24], the input textual sentence is first split into tokens by the WordPiece tokenizer [48], which are then transformed into vectors by the embedding layer. Pertaining to the vision inputs, each token is represented as the detected object features, the same as that in the vanilla attention model. After obtaining these embeddings, we then add them with the segmentation and position embeddings, so that the sequential information can be encoded into the transformer model. Thereafter, these embeddings are inputted to several Transformer blocks, wherein each of them consists of a self-attention layer, a feedforward network, and some layer normalization operations.
Classifier. The features from both vision and language are fused and interacted with the Transformer model. And it is expected that visual reasoning is also performed. At last, the final block’s output of the [CLS] token is fed to a Softmax classifier to predict whether the given response is the correct choice.
III-D2 Re-Attention
The key to a Transformer model is multi-head self-attention. Given query and key matrices, and , the single head attention is formally defined as follows,
(14) |
where is the dimension of keys and values and acts as a scaling factor. The more commonly used multi-head self-attention is formulated,
(15) |
where is the number of attention heads, which is calculated by the function in Equation 14.
One advantage of VL Transformers is that they offer holistic attention estimation with the [CLS] token. In this way, we do not have to aggregate all the information from textual tokens in the first step, as Sec. III-C2 does.
III-D3 Attention Alignment
To achieve the attention alignment goal, we first average all the information from the heads. Thus, we obtain an layers attention map for the [CLS] token. We then extract the attention to the visual tokens and have an attention matrix . To this end, we perform the visual attention alignment for each layer following the guidance of Sec. III-B.
IV Experiments

IV-A Datasets and Evaluation Protocols
We conducted extensive experiments on the VCR benchmark, a large-scale dataset alongside this task. The images are extracted from the movie clips in LSMDC [49] and MovieClips555youtube.com/user/movieclips., wherein the objects inside images are detected via the Mask-RCNN model [33]. We used the official dataset split, where the number of questions for training, validation, and testing are 212,923, 26,534, and 25,263, respectively. For each question, four answers are given with only one being correct, and there are also four rationale choices among which only one makes sense.
Regarding the evaluation metric, we used the popular classification accuracy for QA, QAR, and QAR666For QAR, the prediction is right only when both the answer and rationale are selected correctly.. The ground-truth labels are available for the train and validation sets [2]. Therefore, we reported the performance of our best model on the testing set once and performed other experiments on the validation set.
IV-B Implementation Details
The PyTorch toolkit [50] is leveraged to implement our models, and all the experiments were conducted on a single GeForce RTX 2080 Ti GPU. The specific details of each model are shown below.
Vanilla attention model. As for the input features, we employed the pre-trained object-level features extracted by TAB-VCR [7] as the visual features, and the pre-trained token embeddings from BERT as the textual features. All the trainable parameters are initialized with the default PyTorch settings. The training batch size is set to 96 and the alignment loss weight is set to 1.0. The parameters are optimized with the Adam [51] optimizer with an initial learning rate .
VL-Transformers. In order to verify the generalizable capability of our attention alignment mechanism on VL-Transformers, we employ several classical VL-Transformer models [11, 12, 13] as the backbone. For a fair comparison, we strictly followed the original implementations.

IV-C Overall Performance Comparison
We evaluated our method by comparing its performance with three kinds of methods: (1) advanced VQA models. (2) Traditional VCR baselines. (3) VL-Transformers in VCR. The results on both validation and testing sets are reported in Table I, and the key observations are as follows.
-
•
First, a significant performance gap exists between traditional strong VQA models and VCR methods, especially for QAR. This is because VCR demands higher-order reasoning capability, which differs from the simple recognition in VQA. In addition, predicting the right rationale is even more challenging.
- •
-
•
Lastly, our method achieves the best performance over these state-of-the-art models. Especially, for both the vanilla attention and VL-Transformer VCR baselines, with our attention alignment mechanism, they can all achieve improved gains on both validation and test sets. For example, compared with TAB-VCR, an absolute improvement of 1.1, 1.2 and 1.6 can be observed on QA, QAR and QAR respectively.
Alignment Loss | QA | QAR | QAR |
---|---|---|---|
Vanilla attention | 68.8 | 70.8 | 48.9 |
w/ Align-Dot | 70.5 (+1.7) | 72.5 (+1.7) | 51.4 (+2.5) |
w/ Align-Rank | 69.4 (+0.6) | 71.9 (+1.1) | 50.0 (+1.1) |
VL-BERT | 72.6 | 74.0 | 54.0 |
w/ Align-Dot | 73.2 (+0.6) | 74.6 (+0.6) | 54.9 (+0.9) |
w/ Align-Rank | 73.2 (+0.6) | 74.7 (+0.7) | 54.9 (+0.9) |
UNITER | 74.4 | 76.9 | 57.5 |
w/ Align-Dot | 74.7 (+0.3) | 77.2 (+0.3) | 58.0 (+0.5) |
w/ Align-Rank | 74.9 (+0.5) | 77.0 (+0.1) | 58.1 (+0.6) |
VILLA | 75.4 | 78.7 | 59.5 |
w/ Align-Dot | 75.9 (+0.5) | 79.0 (+0.3) | 60.2 (+0.7) |
w/ Align-Rank | 76.0 (+0.6) | 78.8 (+0.1) | 60.1 (+0.6) |
Model | QA | QAR | QAR |
---|---|---|---|
Full model | 70.5 | 72.5 | 51.4 |
w/o token-wise Att | 70.1 | 71.5 | 50.4 |
w/o Att | 68.7 | 71.2 | 49.1 |
IV-D Ablation Study
To better illustrate the effectiveness of our model in detail, we conducted experiments on the essential parts of our method and reported the results on the validation set below.
Efficacy of the attention alignment. Figure 5 demonstrates the effect of our attention alignment mechanism for both the vanilla attention model and VL-Transformers. From this figure, we can observe that among all the variants, our attention alignment mechanism consistently improves the base model on all metrics by a large margin. Take the vanilla attention model as an example, our attention alignment mechanism boosts it by 1.7% (QA), 1.7% (QAR), and 2.5% (QAR). One main limitation of these baselines is that they neglect the visual consistency and the interactions between the answering and reasoning. In contrast, our visual attention alignment offers a bridge to connect these two processes and significantly enhances the baseline performance.
Figure 4 shows the convergence of answering and reasoning accuracy for baselines and our GIST model. It can be seen that the performance of baselines increases fairly or faster than our GIST model. Nevertheless, with more training steps, GIST outperforms all the baselines by a significant margin. One possible reason is that the baselines show certain disadvantages in these difficult instances as the model keeps training. In contrast, when aligning the visual attention between QA and QAR, the model learns to leverage more accurate visual information to perform visual understanding, leading to more improvements.



Align-Dot v.s. Align-Rank. As discussed in Section III-B, we applied two candidate alignment loss functions to align the two attention maps from the two processes. Table II reports the results obtained from different alignment losses. One can see that both loss functions bring certain performance improvements over respective baselines. Specifically, for the vanilla attention model, the Align-Dot, which is more demanding, can achieve significantly better results. We suspect that one possible reason is that the carefully designed re-attention module can capture the attention precisely.
Re-Attention of the vanilla attention model. In order to investigate the effectiveness of our re-attention module in the vanilla attention model, we designed two variants and reported the results in Table III. After removing the token-wise attention module from our model, we can observe performance degradation on all three metrics. It validates that different textual tokens contribute distinctively to visual feature learning. We then replaced all the attention computation with the mean operation and showed the results in the last row of this table. One can see that the model performance drops sharply compared with the previous two.
IV-E Hyper-parameter Study
In this section, we study the influence of the trade-off hyper-parameter on the performance of our GIST model. The results on both the vanilla attention model and UNITER are shown in Figure 6. As we increase the loss weight , the model performance keeps being enhanced. This result demonstrates the effectiveness of our designed attention alignment mechanism. However, a too-large weight will lead to a deteriorating result. For instance, a loss weight larger than 0.4 for UNITER negatively hurts the model performance.
IV-F Qualitative Results
It is expected that our attention alignment operation should pull the attention maps from the two processes similarly so that the model prediction can be made according to consistent visual cues. To justify this, we performed some qualitative results from the following two angles.
Similarity of attention maps. We leveraged the Align-Dot to perform the attention alignment and estimated the attention similarity between QA and QAR. The histogram of the similarity values is shown in Figure 7. As can be observed, the attention similarity from baseline models is mostly less than 0.2. In contrast, our GIST model yields more consistent visual reasoning results as the similarity is increased significantly.
Attention map visualization. To gain a deeper insight into our attention alignment mechanism, in this section, we also provide some qualitative examples in Figure 8 to compare the attention distribution of the baseline and our method. Regarding the first instance, the baseline puts more attention on the lamp regions while ignoring the critical person areas. However, though the answer is wrongly predicted, the rationale is unexpectedly selected correctly. This supports our argument that the answer prediction and rationale selection should depend on the same evidence otherwise may lead to unpredictable results. As to the second instance, without the attention alignment, the baseline focuses more on the chair1 object, which is less relevant to the given question. Our GIST model corrects this mistake and obtains the right selection for both QA and QAR. The last one shows that the attention weights are distributed incorrectly for both processes. With the consistent alignment of our method, both QA and QAR can be accurately predicted.
V Conclusion and Future Work
In this paper, we propose a novel vision attention alignment method to bridge the two intertwined processes in visual commonsense reasoning. In particular, a re-attention module is first introduced to model the fine-grained inter-modality interactions, followed by the attention alignment equipped with two alternative loss functions. We apply this method to both the conventional vanilla attention model as well as the recent strong VL-Transformers. Through the qualitative and quantitative experiments on the benchmark dataset, the effectiveness of our proposed method is extensively demonstrated.
In the future, we plan to further explore this view, i.e., collaborating the two processes in VCR with a single framework. More techniques that involve the close connection of these two processes are worth more investigation.
References
- [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in IEEE International Conference on Computer Vision. IEEE, 2015, pp. 2425–2433.
- [2] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2019, pp. 6720–6731.
- [3] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2017, pp. 6325–6334.
- [4] Y. Guo, L. Nie, H. Cheng, Z. Cheng, M. S. Kankanhalli, and A. D. Bimbo, “On modality bias recognition and reduction,” ACM Transactions on Multimedia Computing, Communications, and Applications, 2022.
- [5] Y. Guo, L. Nie, Z. Cheng, Q. Tian, and M. Zhang, “Loss re-scaling VQA: revisiting the language prior problem from a class-imbalance view,” IEEE Transactions on Image Processing, vol. 31, pp. 227–238, 2022.
- [6] W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph learning for visual commonsense reasoning,” in Advances in Neural Information Processing Systems, 2019, pp. 2765–2775.
- [7] J. Lin, U. Jain, and A. G. Schwing, “Tab-vcr: Tags and attributes based vcr baselines,” in Advances in Neural Information Processing Systems, 2019, pp. 15 589–15 602.
- [8] A. Wu, L. Zhu, Y. Han, and Y. Yang, “Connective cognition network for directional visual commonsense reasoning,” in Advances in Neural Information Processing Systems, 2019, pp. 5670–5680.
- [9] G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in AAAI Conference on Artificial Intelligence. AAAI, 2020, pp. 11 336–11 344.
- [10] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems, 2019, pp. 13–23.
- [11] Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in European Conference on Computer Vision. Springer, 2020, pp. 104–120.
- [12] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” in International Conference on Learning Representations, 2020.
- [13] Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” in Advances in Neural Information Processing Systems, 2020.
- [14] Y. Zhu, O. Groth, M. S. Bernstein, and L. Fei-Fei, “Visual7w: Grounded question answering in images,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016, pp. 4995–5004.
- [15] D. Song, S. Ma, Z. Sun, S. Yang, and L. Liao, “Kvl-bert: Knowledge enhanced visual-and-linguistic bert for visual commonsense reasoning,” Knowledge-Based Systems, vol. 230, p. 107408, 2021.
- [16] R. Zellers, J. Lu, X. Lu, Y. Yu, Y. Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y. Choi, “Merlot reserve: Neural script knowledge through vision and language and sound,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2022, pp. 16 354–16 366.
- [17] Z. Liu, R. Feng, H. Chen, S. Wu, Y. Gao, Y. Gao, and X. Wang, “Temporal feature alignment and mutual information maximization for video-based human pose estimation,” in Computer Vision and Pattern Recognition. IEEE, 2022, pp. 10 996–11 006.
- [18] K. J. Shih, S. Singh, and D. Hoiem, “Where to look: Focus regions for visual question answering,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016, pp. 4613–4621.
- [19] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in Advances in Neural Information Processing Systems, 2016, pp. 289–297.
- [20] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018, pp. 6077–6086.
- [21] D. Teney, P. Anderson, X. He, and A. van den Hengel, “Tips and tricks for visual question answering: Learnings from the 2017 challenge,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2018, pp. 4223–4232.
- [22] T. Qiao, J. Dong, and D. Xu, “Exploring human-like attention supervision in visual question answering,” in AAAI Conference on Artificial Intelligence. AAAI, 2018, pp. 7300–7307.
- [23] Y. Zhang, J. C. Niebles, and A. Soto, “Interpretable visual question answering by visual grounding from attention supervision mining,” in IEEE Winter Conference on Applications of Computer Vision. IEEE, 2019, pp. 349–357.
- [24] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in North American Chapter of the Association for Computational Linguistics. ACL, 2019, pp. 4171–4186.
- [25] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- [26] W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and H. Wang, “Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning,” in Annual Meeting of the Association for Computational Linguistics. ACL, 2021, pp. 2592–2607.
- [27] Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2021, pp. 12 976–12 985.
- [28] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Empirical Methods in Natural Language Processing. ACL, 2019, pp. 5099–5110.
- [29] J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Advances in Neural Information Processing Systems, 2021, pp. 9694–9705.
- [30] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Annual Meeting of the Association for Computational Linguistics. ACL, 2018, pp. 2556–2565.
- [31] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language pre-trained models,” in International Joint Conference on Artificial Intelligence. Morgan Kaufmann, 2022, pp. 5436–5443.
- [32] J. Cho, J. Lu, D. Schwenk, H. Hajishirzi, and A. Kembhavi, “X-lxmert: Paint, caption and answer questions with multi-modal transformers,” in Empirical Methods in Natural Language Processing. ACL, 2020, pp. 8785–8805.
- [33] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask r-cnn,” in IEEE International Conference on Computer Vision. IEEE, 2017, pp. 2980–2988.
- [34] X. Zhang, F. Zhang, and C. Xu, “Explicit cross-modal representation learning for visual commonsense reasoning,” IEEE Transactions on Multimedia, vol. 24, pp. 2986–2997, 2022.
- [35] C. Liu, Z. Mao, T. Zhang, A. Liu, B. Wang, and Y. Zhang, “Focus your attention: A focal attention for multimodal learning,” IEEE Transactions on Multimedia, vol. 24, pp. 103–115, 2022.
- [36] T. Liu, “Learning to rank for information retrieval,” Foundations and Trends® in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009.
- [37] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of ir techniques,” ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002.
- [38] T. Qin, T. Liu, and H. Li, “A general approximation framework for direct optimization of information retrieval measures,” Information Retrieval, vol. 13, no. 4, pp. 375–397, 2010.
- [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2016, pp. 770–778.
- [40] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” in Empirical Methods in Natural Language Processing. ACL, 2014, pp. 1724–1734.
- [41] K. Hornik, M. B. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
- [42] Z. Liu, Z. Wang, L. Zhang, R. R. Shah, Y. Xia, Y. Yang, and X. Li, “Fastshrinkage: Perceptually-aware retargeting toward mobile platforms,” in ACM Multimedia Conference. ACM, 2017, pp. 501–509.
- [43] A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question answering baselines,” in European Conference on Computer Vision. Springer, 2016, pp. 727–739.
- [44] J. Kim, K. W. On, W. Lim, J. Kim, J. Ha, and B. Zhang, “Hadamard product for low-rank bilinear pooling,” in International Conference on Learning Representations, 2017.
- [45] H. Ben-younes, R. Cadène, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in IEEE International Conference on Computer Vision. IEEE, 2017, pp. 2631–2639.
- [46] J. Tu, X. Liu, Z. Lin, R. Hong, and M. Wang, “Differentiable cross-modal hashing via multimodal transformers,” in International Conference on Multimedia. ACM, 2022, pp. 453–461.
- [47] H. Zhong, J. Chen, C. Shen, H. Zhang, J. Huang, and X. Hua, “Self-adaptive neural module transformer for visual question answering,” IEEE Transactions on Multimedia, vol. 23, pp. 1264–1273, 2021.
- [48] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016.
- [49] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. J. Pal, H. Larochelle, A. C. Courville, and B. Schiele, “Movie description,” International Journal of Computer Vision, vol. 123, no. 1, pp. 94–120, 2017.
- [50] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
- [51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
![]() |
Zhenyang Li received the B.Eng. and master degree from Shandong University and University of Chinese Academy of Sciences, respectively. He is currently pursuing the Ph.D. degree with the School of Computer Science and Technology, Shandong University, supervised by Prof. Liqiang Nie. His research interest is multi-modal computing, especially visual question answering. |
![]() |
Yangyang Guo (Member, IEEE) is currently a research fellow with the National University of Singapore. He has authored or co-authored several papers in top journals, such as IEEE TIP, TMM, TKDE, TNNLS, and ACM TOIS. He is a Regular Reviewer for journals, including IEEE TIP, TMM, TKDE, TCSVT; ACM TOIS, and ToMM. He was the recipient as an outstanding reviewer for IEEE TMM and WSDM 2022. |
![]() |
Kejie Wang is currently pursuing the B.Eng. degree in computer science from the Shandong University. His research interests include visual question answering and computer vision. |
![]() |
Fan Liu (Member, IEEE) is currently a Research Fellow with the School of Computing, National University of Singapore (NUS). He received the Ph.D degree from Shandong University in China. His research interests lie primarily in multimedia search and recommendation. His work has been published in a set of top forums, including ACM SIGIR, MM, WWW, TKDE, TOIS and TMM. He has served as the pc member for several top conferences, like ACM MM, SIGKDD, WSDM, and the reviewer for journals including TKDE, TMM, IPM, INS. |
![]() |
Liqiang Nie (Senior Member, IEEE) received the B.Eng. degree from Xi’an Jiaotong University and the Ph.D. degree from the National University of Singapore (NUS). He is currently a Professor and the dean of the School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen). His research interests lie primarily in multimedia computing and information retrieval. He has co-authored more than 200 articles and four books and received more than 14,000 Google Scholar citations. He is an AE of IEEE TKDE, IEEE TMM, IEEE TCSVT, ACM ToMM, and Information Science. Meanwhile, he is the regular area chair of ACM MM, NeurIPS, IJCAI, and AAAI. He is a member of ICME steering committee. He has received many awards, like ACM MM and SIGIR best paper honorable mention in 2019, SIGMM rising star in 2020, TR35 China 2020, DAMO Academy Young Fellow in 2020, and SIGIR best student paper in 2021. |
![]() |
Mohan Kankanhalli (Fellow, IEEE) received the B.Tech. degree from IIT Kharagpur and the M.S. & Ph.D. degrees from the Rensselaer Polytechnic Institute. He is currently the Provost’s Chair Professor at the Department of Computer Science, National University of Singapore. He is the Director of N-CRiPT and also the Deputy Executive Chairman of AI Singapore (Singapore’s national AI program). His current research interests include multimedia computing, multimedia security and privacy, image/video processing, and social media analysis. . |