This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning to Agree on Vision Attention for Visual Commonsense Reasoning

Zhenyang Li, Yangyang Guo, , Kejie Wang, Fan Liu, , Liqiang Nie, , Mohan Kankanhalli Zhenyang Li and Kejie Wang are with Shandong University, China. E-mail: {zhenyanglidz, kjwang.henry}@gmail.com. Yangyang Guo, Fan Liu and Mohan Kankanhalli are with National University of Singapore, Singapore. E-mail: {guoyang.eric, liufancs}@gmail.com, [email protected]. Liqiang Nie is with Harbin Institute of Technology (Shenzhen), China. E-mail: [email protected].
Abstract

Visual Commonsense Reasoning (VCR) remains a significant yet challenging research problem in the realm of visual reasoning. A VCR model generally aims at answering a textual question regarding an image, followed by the rationale prediction for the preceding answering process. Though these two processes are sequential and intertwined, existing methods always consider them as two independent matching-based instances. They, therefore, ignore the pivotal relationship between the two processes, leading to sub-optimal model performance. This paper presents a novel visual attention alignment method to efficaciously handle these two processes in a unified framework. To achieve this, we first design a re-attention module for aggregating the vision attention map produced in each process. Thereafter, the resultant two sets of attention maps are carefully aligned to guide the two processes to make decisions based on the same image regions. We apply this method to both conventional attention and the recent Transformer models and carry out extensive experiments on the VCR benchmark dataset. The results demonstrate that with the attention alignment module, our method achieves a considerable improvement over the baseline methods, evidently revealing the feasibility of the coupling of the two processes as well as the effectiveness of the proposed method.

Index Terms:
Visual Commonsense Reasoning, Attention Mechanism, Attention Alignment.

I Introduction

Visual Question Answering (VQA) has received increasing interest over the past few years, which requires correctly answering natural language questions about a given image [1]. Despite significant progress having been made, existing VQA benchmarks mainly focus on answering simple recognition questions (e.g., how many or what color), while the explanation of the question answering is often ignored. To close this gap, Visual Commonsense Reasoning (VCR) has recently been presented as a challenge for researchers [2]. Specifically, beyond answering the cognition-level questions (Q\rightarrowA) as canonical VQA does, VCR further prompts to provide a rationale for the correct answer (QA\rightarrowR) (see Figure 1 for an example).

In fact, VCR poses more challenges than VQA, which can be seen in the following two aspects: 1) on the data side – the images in the VCR dataset describe more complex situations in the real world, and the questions are rather challenging and require high-level visual reasoning capabilities (e.g., why or how). And 2) on the task side – it is hard to simultaneously figure out the right answer and its right rationale. Typically, VCR models first predict the answer, based on which the rationale can then be selected from the candidates. As the question answering has proved to be non-trivial for traditional VQA models [3, 4, 5], finding the right rationale simultaneously is even more difficult.

Current VCR methods mostly enhance the visual understanding with the given query111Either question or question with the correct answer (see Figure 1). (i.e., to overcome the first challenge), and could be roughly divided into two categories: the first one is to leverage the intra-modality correlations to enhance separate feature learning, and inter-modality ones between vision and linguistics to correctly reason [6, 7, 8]; the other is to employ large external datasets to pre-train a general multi-modal model, and then transfer it to VCR for learning a better joint representation of the image and text [9, 10, 11]. Although these methods have achieved promising results, they are all still limited by a common problem – the Q\rightarrowA and QA\rightarrowR are handled independently, with the inherent relationship between these two processes being ignored. The second challenge thus remains unsettled, resulting in the sub-optimal performance of these methods.

Refer to caption
Figure 1: An instance of the VCR task and the attention distribution over image objects. There are two sequential processes in VCR: Q\rightarrowA and QA\rightarrowR. We propose to steer the attention from the right rationale (R1) to be similar to the ground-truth answer (A4) while making other negative pairs dissimilar (such as R2 and A4).

In general, separately treating these two processes would inevitably lead VCR to degenerate into two VQA instances222Note that for QA\rightarrowR, the ‘question’ to a VQA model now becomes the original question appended with the right answer. And the corresponding candidate ‘answer’ set is accordingly composed of several rationales.. For each given question, existing methods consider Q\rightarrowA and QA\rightarrowR as two independent cases, rather than two sequential reasoning processes of one question. This makes Q\rightarrowA and QA\rightarrowR disjoint, deviating from the original intention of VCR. Moreover, answering and justification are actually consistent and coherent in human cognition: the human brain would capture and leverage the same evidence from the image for the two processes. Consequently, a better way to deal with VCR is to handle both via common cues. Thanks to the shared information provided by the same image, we align the visual attention of these two for better coordination.

To seamlessly bridge the two processes, in this paper, we propose an aGree on vISion aTtention model (dubbed GIST) for simple yet effective visual reasoning in VCR. Our method is composed of two consecutive modules. First, we design a novel re-attention module to model the fine-grained multi-modal interactions between queries and images. In particular, our attention module is two-stage: the first stage calculates the relevance between image regions and textual query tokens; the second stage combines the attention map of each token to obtain one single attention vector for each process. Secondly, we align the produced attention maps over the given image, guiding the two processes to ‘look at the same image regions’. Specifically, our method enforces an alignment loss between the attention maps from two processes, aiming to obtain similar attention maps during training.

To test the effectiveness of our method, we conduct extensive experiments on the VCR dataset. We test our GIST method on both the vanilla visual attention models as well as the most recent Vision Language Transformers (VL-Transformers) [12, 11, 13]. Both quantitative and qualitative results demonstrate that our method can significantly outperform the baseline method, confirming the utility of aligning vision attention of the two intertwined processes.

In summary, the contribution of this paper is threefold:

  • This work addresses the question answering and rationale prediction in VCR with a unified framework. In particular, we propose an attention alignment module to guide Q\rightarrowA and QA\rightarrowR to make predictions with the same visual evidence.

  • We design a novel GIST method that skillfully aligns the visual attention from the two processes. We apply this method to both the vanilla visual attention model and the recent VL-Transformers.

  • We conduct extensive experiments on the VCR dataset to demonstrate the effectiveness of our method. The code has been released333https://github.com/SDLZY/VCR_Align..

II Related Work

II-A Visual Commonsense Reasoning

Conventional VQA mainly focuses on the recognition capability of models [1, 14]. Until recently, VCR has emerged to study commonsense understanding, putting forward higher requirements in AI systems’ cognitive reasoning abilities. To address this task, many approaches have been proposed, which can be roughly divided into the following two categories.

The initial methods endeavor to devise task-specific architectures for VCR. Some of them explore a variety of sophisticated reasoning structures (e.g., holistic attention mechanism) to construct interactions between the image and text. For instance, R2C [2] performs three inference steps – grounding, contextualization, and reasoning, to effectively solve the visual cognition problem. Inspired by the neuronal connectivity of the brain, CCN [8] designs a graph method to globally and dynamically integrate the local visual neuron connectivity; HGL [6] integrates the intra-graph and inter-graph to bridge the vision and language modalities. In addition, as some questions cannot be directly answered from only the image information, external commonsense knowledge is also exploited in the cross-modal reasoning process [15].

Another prevalent stream is to apply VL-Transformers to the downstream VCR for both Q\rightarrowA and QA\rightarrow[12, 9]. For example, UNITER [11] designs four novel pre-training tasks with conditional masking to learn universal image-text representations for various downstream multi-modal tasks. ViLBERT [10] applies a dual stream fusion encoder to process visual and textual inputs in separate streams, followed by the modality interactions with co-attentional Transformer layers. MERLOT RESERVE [16] first learns task-agnostic representations through sound, language, and vision of videos [17], and then transfers these features to the downstream VCR task.

Although both categories of methods have achieved improved results, they all view the two processes in VCR as two independent VQA instances. As a result, the critical correlations between these two are ignored, resulting in weak visual reasoning. In this work, we propose an effective attention alignment framework to bridge these two processes directly.

II-B Attention in Visual Question Answering

The past few years have witnessed increasing growth in the research area of VQA. Among the existing methods, visual attention-based ones have demonstrated substantial advantages in feature learning. Traditional approaches focus more on the correlation learning between each image region and the whole question sentence [14, 18]. Thereafter, the co-attention mechanism jointly performs the question-guided attention over image regions and the image-guided attention over question words [19]. In addition to these top-down attention methods, BUTD [20, 21] combines both the top-down and bottom-up attention, which first detects the salient objects inside an image and then leverages the top-down attention technique to locate the most relevant regions according to the given question. VQA-HAT [22] and Attn-MFH [23] point out that the attention maps produced by VQA models are inconsistent with human cognition. They, therefore, attempt to design more explicitly visual reasoning methods and have achieved certain improvements.

II-C Vision-Language Transformers

Transformers have achieved great success in the field of Natural Language Processing (NLP) [24] and Computer Vision (CV) [25]. Because of their effectiveness, transformers also attract much attention in multi-modal studies. Based on how the vision and language branches are fused, current VL-Transformers can be roughly categorized into single-stream (e.g., UNIMO [26] and SOHO [27]) and dual-stream cross-modal Transformers (e.g., LXMERT [28] and ALBEF [29]). In general, VL-Transformers adopt a pretrain-then-finetune learning paradigm: these models are firstly pre-trained on large-scale multi-modal datasets (such as Conceptual Captions [30]) for learning universal cross-modal representations, and then fine-tuned on downstream tasks by transferring their rich representations from pre-training. In particular, the pretext tasks play an important role in pre-training, where masked language modeling, masked region prediction, and image-text matching are extensively studied. The fine-tuning step mirrors that of the BERT model [24], which includes a downstream task-specific input, output, and objective. VL Transformers mainly help the following three groups of downstream tasks: cross-modal matching, cross-modal reasoning, and vision language generation [31]. The first group focuses on learning cross-modal correspondences between vision and language, such as image text retrieval and visual referring expression. Reasoning ones require VL Transformers to perform language reasoning based on visual scenes, such as VQA. The last group aims to generate the targets of one modality given the other as input [32]. The desired visual or textual tokens are decoded in an auto-regressive generation manner.

III Proposed Method

Based on task intuition and human cognition, the answering and reasoning processes in VCR should be made cohesive and consistent. Nevertheless, existing methods often treat them separately, rendering the commonsense reasoning less convincing. To tackle these two processes jointly, we resort to aligning the visual attention of question answering and rationale inference processes for a better collaborative connection and design our GIST method.

In the following, we first introduce the method’s intuition and background knowledge, followed by our proposed visual attention alignment method. We then detail its implementation on the model with vanilla attention and the recent VL-Transformer with self-attention, respectively.

Refer to caption
Figure 2: Pipeline of the proposed method. Both Q\rightarrowA and QA\rightarrowR share the same image, which is encoded by the common image encoder. tt denotes the input tokens. The Q\rightarrowA takes the image, question, and answer as input, and produces its attention map. Similarly, we can also obtain the attention map from QA\rightarrowR. Our goal is to make these two attention maps as similar as possible, so that these two processes see the same image regions for reasoning.
Refer to caption
Figure 3: Architecture of our proposed method with vanilla attention network (Left) and VL-Transformers (Right). For the vanilla attention model, we sequentially perform object-wise and token-wise attention to obtain the final attention maps for all the detected objects. Pertaining to the VL-Transformers, the attention weights are extracted from the self-attention operation.

III-A Method Intuition

Given an image II and a question QQ about this image, the goal of visual commonsense reasoning is to both predict the right answer A+A_{+}, as well as the correct rationale R+R_{+} (the right and false ones are denoted as ++ and -, respectively). In general, VCR comprises two multiple-choice processes: question answering (Q\rightarrowA) and answer justification (QA\rightarrowR).

Q\rightarrowA aims to predict the correct answer A+A_{+} from a set of answer choices 𝒜\mathcal{A} to the given question QQ upon the image II, which can be achieved by,

A+=argmaxAi𝒜f(I,QAi),A_{+}=\mathop{\arg\max}\limits_{A_{i}\in{\mathcal{A}}}{f(I,Q\mid A_{i})}, (1)

where ff denotes the Q\rightarrowA model. Specifically, both question QQ and candidate answer AiA_{i} are expressed in terms of textual sentences, and the image II consists of nn objects 𝒪={oi}i=1N\mathcal{O}=\{o_{i}\}_{i=1}^{N} detected by a Mask-RCNN model [33]. Previous methods train ff by minimizing the cross entropy loss444We employ the single instance loss rather than the batch-wise one for simplicity.,

QA=y[A+]logexpf(I,QA+)i=1𝒜expf(I,QAi),\mathcal{L}_{Q\mapsto A}=-y[A_{+}]\log\frac{\exp{f(I,Q\mid A_{+})}}{\sum\nolimits_{i=1}^{\mid\mathcal{A}\mid}\exp{f(I,Q\mid A_{i})}}, (2)

where y[A+]y[A_{+}] denotes the index of the ground-truth answer.

QA\rightarrowR is different from Q\rightarrowA wherein the input is now composed of question QQ and its correct answer A+A_{+} (a straightforward way is to concatenate these two together). The objective of QA\rightarrowR is to select the correct rationale R+R_{+} from a rationale set \mathcal{R} (see Figure 1),

R+=argmaxRig(I,Q,A+Ri),R_{+}=\mathop{\arg\max}\limits_{R_{i}\in{\mathcal{R}}}{g(I,Q,A_{+}\mid R_{i})}, (3)

where gg denotes the QA\rightarrowR model. The optimization function is similarly defined as follows,

QAR=y[R+]logexpg(I,Q,A+R+)i=1expg(I,Q,A+Ri),\mathcal{L}_{{QA\mapsto R}}=-y[R_{+}]\log\frac{\exp{g(I,Q,A_{+}\mid R_{+})}}{\sum\nolimits_{i=1}^{\mid\mathcal{R}\mid}\exp{g(I,Q,A_{+}\mid R_{i})}}, (4)

where y[R+]y[R_{+}] represents the ground-truth rationale for the given image and question.

Note that both ff and gg share identical structures and are often trained separately [8, 34]. To bridge these two connected processes, we propose a visual attention alignment module in the next.

III-B Visual Attention Alignment

Inspired by human cognition, we argue that the visual evidence exploited in answering and justification should be the same. As these two processes share the same image (see Equation 1 and Equation 3), the fundamental visual attention thus offers a natural bridge for connecting them. In view of this, we propose to align the visual information based on the attention maps calculated by ff and gg.

Visual attention is an integral component for current vision-language models [20, 35]. A typical VCR model often involves a visual attention module on the object set 𝒪\mathcal{O}. In particular, the visual attention learns a set of attention score {ci}i=1N\{c_{i}\}_{i=1}^{N} for each image object, where i=1Nci=1\sum_{i=1}^{N}c_{i}=1. A large cic_{i} usually represents that the ii-th object has more influence for answering the current question. As discussed earlier, there are two sets of attention maps: 𝒞QA\mathcal{C}_{Q\mapsto A} – the attention weights from model ff, and 𝒞QAR\mathcal{C}_{QA\mapsto R} – the attention weights from model gg. To achieve the goal that ff and gg employ the similar image regions, our idea is to learn another module that drives 𝒞QA\mathcal{C}_{Q\mapsto A} and 𝒞QAR\mathcal{C}_{QA\mapsto R} closer. Specifically, we implement this by pulling the similarity of attention maps from the correct answer A+{A_{+}} and the correct rationale R+{R_{+}}, while pushing away other negative pairs.

In this paper, we propose an attention alignment loss to make the Q\rightarrowA and QA\rightarrowR learn to agree on visual regions. We formalize our attention alignment loss below,

{QAAtt=y[A+]logexps(𝒞QAA+,𝒞QARR+)i=1𝒜exps(𝒞QAAi,𝒞QARR+),QARAtt=y[R+]logexps(𝒞QARR+,𝒞QAA+)i=1exps(𝒞QARRi,𝒞QAA+),Align=QAAtt+QARAtt,\begin{cases}\mathcal{L}_{Q\mapsto A}^{Att}=-y[A_{+}]\log\frac{\exp{s(\mathcal{C}_{Q\mapsto A}^{A_{+}},\mathcal{C}_{QA\mapsto R}^{R_{+}})}}{\sum\nolimits_{i=1}^{\mid\mathcal{A}\mid}\exp{s(\mathcal{C}_{Q\mapsto A}^{A_{i}},\mathcal{C}_{QA\mapsto R}^{R_{+}})}},\\ \mathcal{L}_{QA\mapsto R}^{Att}=-y[R_{+}]\log\frac{\exp{s(\mathcal{C}_{QA\mapsto R}^{R_{+}},\mathcal{C}_{Q\mapsto A}^{A_{+}})}}{\sum\nolimits_{i=1}^{\mid\mathcal{R}\mid}\exp{s(\mathcal{C}_{QA\mapsto R}^{R_{i}},\mathcal{C}_{Q\mapsto A}^{A_{+}})}},\\ \mathcal{L}_{Align}=\mathcal{L}_{Q\mapsto A}^{Att}+\mathcal{L}_{QA\mapsto R}^{Att},\end{cases} (5)

where 𝒞QAAi\mathcal{C}_{Q\mapsto A}^{A_{i}} represents the attention weights from model ff according to the ii-th answer AiA_{i}; 𝒞QARRi\mathcal{C}_{QA\mapsto R}^{R_{i}} denotes the attention weights from model gg according to the ii-th rationale RiR_{i}, and s(,)s(\cdot,\cdot) is a similarity function. We detail how we implement s(,)s(\cdot,\cdot) in the following:

  • One intuitive approach to measure the similarity between two sets of attention weights is the dot product. We refer to this method as Align-Dot,

    s(𝐜p,𝐜t)=𝐜pT𝐜t.s(\mathbf{c}_{p},\mathbf{c}_{t})=\mathbf{c}_{p}^{T}\mathbf{c}_{t}.
  • Inspired by the learning to rank models in the field of information retrieval [36], we use the list-wise approach to align the ranking of attention weights and name this method Align-Rank. To this end, we first obtain the permutations of the two attention vectors prediction 𝝅p\bm{\pi}_{p} and target 𝝅t\bm{\pi}_{t}. Thereafter, we employ the NDCG metric optimization [37, 38] in our experiments:

    s(𝝅p,𝝅t)\displaystyle s(\bm{\pi}_{p},\bm{\pi}_{t}) =NDCG(𝝅p,𝝅t)\displaystyle=NDCG(\bm{\pi}_{p},\bm{\pi}_{t}) (6)
    =ZN1i=1NG(πpi)log(1+πti),\displaystyle=Z_{N}^{-1}\sum_{i=1}^{N}\frac{G(\pi_{pi})}{\log(1+\pi_{ti})},

    where G(πpi)G(\pi_{pi}) denotes a gain function, e.g., G(πpi)=2πpi1G(\pi_{pi})=2^{\pi_{pi}}-1, ZN1Z_{N}^{-1} acts as the maximum of G(πpi)/log(1+πti)G(\pi_{pi})/\log(1+\pi_{ti}), i.e., the value when the predicted attention permutation of 𝝅p\bm{\pi}_{p} is the same as the target one 𝝅t\bm{\pi}_{t}. Nevertheless, directly optimizing NDCG with back-propagation is impossible due to its non-differentiable nature. To approach this problem, we then smooth the permutation function with [38],

    π^pi=1+jiexp{α(cpicpj)}1+exp{α(cpicpj)},\hat{\pi}_{pi}=1+\sum_{j\neq i}\frac{\exp\{-\alpha(c_{pi}-c_{pj})\}}{1+\exp\{-\alpha(c_{pi}-c_{pj})\}}, (7)

    where α\alpha is a hyper-parameter, and cpic_{pi} denotes the attention value of the ii-th object.

By combining the aforementioned loss functions, the final objective becomes,

=QA+QAR+λAlign\mathcal{L}=\mathcal{L}_{{Q\mapsto A}}+\mathcal{L}_{{QA\mapsto R}}+\lambda\mathcal{L}_{Align} (8)

where λ\lambda is a trade-off hyper-parameter.

Our method is applicable to both vanilla visual attention models as well as the most recent VL-Transformers. In the following, we will show its implementation on these two typical models.

III-C Application on the Vanilla Attention Model

Before the prevalence of VL-Transformers in VCR, conventional methods all adopt the vanilla attention mechanism to focus on the most salient image regions based on the textual information. In view of this, we intend to explore whether our proposed visual attention alignment method works under such settings. Specifically, we take the TAB-VCR [7] as a typical baseline to implement our method. Note that the two processes, i.e., Q\rightarrowA and QA\rightarrowR share the same structure, we, therefore, use the Q\rightarrowA as an example as the QA\rightarrowR can be easily extrapolated.

III-C1 Overall Framework

In this subsection, we first introduce the overall framework of TAB-VCR.

Image & Language Encoder. Images in the VCR dataset [2] consist of objects detected by Mask-RCNN [33]. We leverage this fine-grained information and utilize the pre-trained Convolutional Neural Network (CNN) model [39] to extract the object features 𝒪={𝒐1,,𝒐N}\mathcal{O}=\{\bm{o}_{1},...,\bm{o}_{N}\}.

Pertaining to the textual input, we first concatenate the question qq and each answer aia_{i} and then employ a pre-trained word embedding to obtain the embeddings. Note that each sentence in VCR includes both text and some object tags, as shown in Figure 3. Following previous studies [2, 7], we take the word embeddings and the object features in their right order as input to a bidirectional RNN [40]. Specifically, if an input token is a tag referring to an object oto_{t}, and the corresponding object feature will be 𝒐t\bm{o}_{t}; otherwise, it is the embedding of the entire image. The query and response can be encoded into a sequence of hidden states {𝒉1,,𝒉M}\{\bm{h}_{1},...,\bm{h}_{M}\} by an RNN model:

𝒉t=𝖱𝖭𝖭([𝒕i,𝒐ti];𝒉t1),\bm{h}_{t}=\mathsf{RNN}([\bm{t}_{i},\bm{o}_{t_{i}}];\bm{h}_{t-1}), (9)

where 𝒐ti\bm{o}_{t_{i}} denotes the tit_{i}-th object’s feature if 𝐭i\mathbf{t}_{i} is a tag otherwise the averaged feature of all objects.

Classifier. After the visual reasoning between the given question and image (which will be detailed in the next subsection), we treat VCR as a multi-class classification problem. In particular, we use the last hidden state 𝒉M\bm{h}_{M} from RNN and the refined image feature 𝒐^\hat{\bm{o}} as the representation of the question-answer pair and image, respectively. Our classifier is implemented with a multi-layer perceptron (MLP) [41] to compute a score for the candidate answers:

yi^=𝑾1σ(𝑾0[𝒉M,𝒐^]),\hat{y_{i}}=\bm{W}_{1}\sigma({\bm{W}_{0}[\bm{h}_{M},\hat{\bm{o}}]}), (10)

where σ\sigma is the LeakyReLU [39] activation function, and 𝑾0\bm{W}_{0} and 𝑾1\bm{W}_{1} are the learned weight matrices in the MLP. We omit the bias vectors for simplicity.

III-C2 Re-Attention

A typical model often extracts the visual features with a CNN model [42] and takes the output 𝒪\mathcal{O} as the vision features for Equation 9. Different from it, we propose a re-attention module to distribute distinctive attention weights, which serves as an important part of our attention alignment goal. Specifically, our re-attention module consists of attention computation from two directions: object-wise and token-wise. The former takes each token feature as a query and learns the attention weights for all the objects. In contrast, token-wise attention collects the most informative signals of each token based on the fused token information from the former step.

Object-wise Attention. It is intuitive that for each token, different objects often contribute distinctively to the learning of the textual features. For example, in Figure 1, the [person5] object is more important than other objects for learning the token smiling in the question. In light of this, we employ the token to attend to each specific object as follows,

{𝐜¯oti=𝖬𝖫𝖯(𝐭i)T𝖬𝖫𝖯([𝐨1,𝐨2𝐨N]),𝐜oti=𝖲𝗈𝖿𝗍𝖬𝖺𝗑(𝐜¯oti),\begin{cases}\bar{\mathbf{c}}_{o_{t_{i}}}=\mathsf{MLP}(\mathbf{t}_{i})^{T}\mathsf{MLP}([\mathbf{o}_{1},\mathbf{o}_{2}\cdots\mathbf{o}_{N}]),\\ \mathbf{c}_{o_{t_{i}}}=\mathsf{SoftMax}(\bar{\mathbf{c}}_{o_{t_{i}}}),\\ \end{cases} (11)

where 𝐜oti\mathbf{c}_{o_{t_{i}}} and 𝐜¯otiM\bar{\mathbf{c}}_{o_{t_{i}}}\in\mathbb{R}^{M}. In this way, we can have MM attention maps over all the objects, where MM is the number of tokens.

Token-wise Attention. After the object-wise attention operation, we then employ another attention module to estimate the importance of each token with respect to the overall textual feature. To implement this, we take the output from RNN model 𝐡M\mathbf{h}_{M} as a query, and perform attention over all the token features,

{𝐜¯t=𝖬𝖫𝖯(𝐡M)T𝖬𝖫𝖯([𝐡1,𝐡2,,𝐡M]),𝐜t=𝖲𝗈𝖿𝗍𝖬𝖺𝗑(𝐜t¯),\begin{cases}\bar{\mathbf{c}}_{t}=\mathsf{MLP}(\mathbf{h}_{M})^{T}\mathsf{MLP}([\mathbf{h}_{1},\mathbf{h}_{2},\cdots,\mathbf{h}_{M}]),\\ \mathbf{c}_{t}=\mathsf{SoftMax}(\bar{\mathbf{c}_{t}}),\\ \end{cases} (12)

where 𝐜t\mathbf{c}_{t} and 𝐜t¯M\bar{\mathbf{c}_{t}}\in\mathbb{R}^{M}. Thereafter, we employ the attention weights, i.e., the importance of tokens 𝐜t\mathbf{c}_{t}, to multiply that of the set of attention maps, to obtain the overall attention weights over objects 𝐨^\hat{\mathbf{o}},

𝐜o\displaystyle\mathbf{c}_{o} =i=1Mcti×𝐜oti,\displaystyle=\sum_{i=1}^{M}c_{t_{i}}\times\mathbf{c}_{o_{t_{i}}}, (13)
𝐨^\displaystyle\hat{\mathbf{o}} =i=1Ncoi×𝐨i.\displaystyle=\sum_{i=1}^{N}c_{o_{i}}\times\mathbf{o}_{i}.

III-C3 Attention Alignment

We then obtain the refined image feature in Equation 10 and perform the final classifier for predicting the right answer or rationale. Thereafter, following Equation 13, the two attention sets can be easily collected from 𝐜o\mathbf{c}_{o}. One is for 𝒞QA\mathcal{C}_{Q\mapsto A}, and the other is 𝒞QA\mathcal{C}_{Q\mapsto A}. The visual attention alignment from Sec. III-B can thereby be performed.

TABLE I: Performance comparison on the validation and testing sets. The best results are highlighted in bold.
Model VQA VCR Transformer Q\rightarrowA QA\rightarrowR Q\rightarrowAR
valid test valid test valid test
Chance 25.0 25.0 25.0 25.0 6.2 6.2
RevisitedVQA [43] 39.4 40.5 34.0 33.7 13.5 13.8
BUTD [20] 42.8 44.1 25.1 25.1 10.7 11.0
MLB [44] 45.5 46.2 36.1 36.8 17.0 17.2
MUTAN [45] 44.4 45.5 32.0 32.2 14.6 14.6
R2C [2] 63.8 65.1 67.2 67.3 43.1 44.0
CCN [8] 67.4 68.5 70.6 70.5 47.7 48.4
HGL [6] 69.4 70.1 70.6 70.8 49.1 49.8
TAB-VCR [7] 69.5 70.5 71.6 71.6 50.1 50.8
VL-BERT [12] 72.6 73.4 74.0 74.5 54.0 54.8
UNITER [11] 74.4 75.5 76.9 77.3 57.5 58.6
GISTVanilla\text{GIST}_{\text{Vanilla}} 70.5 71.2 72.5 72.0 51.5 51.4
GISTVL-Transformer\text{GIST}_{\text{VL-Transformer}} 74.9 75.6 77.0 77.5 58.1 58.8
Human - 91.0 - 93.0 - 85.0

III-D Application on the VL-Transformer

VL-Transformers have been widely studied over the past few years [46, 47]. In this section, we show how our attention alignment method works under such self-attention settings.

III-D1 Overall Framework

The structure of a typical single-stream VL-Transformer is illustrated in Figure 3.

Transformer Encoder. As can be observed, the input sequence starts with a special classification token (i.e., [CLS]), and then goes on with query and response elements, visual tokens, and ends with a special ending token (i.e., [END]). A special separation token (i.e., [SEP]) is introduced between every two parts of the input. Following the practice in BERT [24], the input textual sentence is first split into tokens by the WordPiece tokenizer [48], which are then transformed into vectors by the embedding layer. Pertaining to the vision inputs, each token is represented as the detected object features, the same as that in the vanilla attention model. After obtaining these embeddings, we then add them with the segmentation and position embeddings, so that the sequential information can be encoded into the transformer model. Thereafter, these embeddings are inputted to several Transformer blocks, wherein each of them consists of a self-attention layer, a feedforward network, and some layer normalization operations.

Classifier. The features from both vision and language are fused and interacted with the Transformer model. And it is expected that visual reasoning is also performed. At last, the final block’s output of the [CLS] token is fed to a Softmax classifier to predict whether the given response is the correct choice.

III-D2 Re-Attention

The key to a Transformer model is multi-head self-attention. Given query and key matrices, 𝐐\mathbf{Q} and 𝐊\mathbf{K}, the single head attention is formally defined as follows,

𝖲𝖠(𝐐,𝐊)=𝖲𝗈𝖿𝗍𝖬𝖺𝗑(𝐐T𝐊dk),\mathsf{SA}(\mathbf{Q},\mathbf{K})=\mathsf{SoftMax}(\frac{\mathbf{Q}^{T}\mathbf{K}}{\sqrt{d_{k}}}), (14)

where dkd_{k} is the dimension of keys and values and acts as a scaling factor. The more commonly used multi-head self-attention is formulated,

𝖬𝖲𝖠(𝐐,𝐊)=𝖢𝗈𝗇𝖼𝖺𝗍(h1,,hk)𝐖,\mathsf{MSA}(\mathbf{Q},\mathbf{K})=\mathsf{Concat}(h_{1},\cdots,h_{k})\mathbf{W}, (15)

where kk is the number of attention heads, which is calculated by the 𝖲𝖠\mathsf{SA} function in Equation 14.

One advantage of VL Transformers is that they offer holistic attention estimation with the [CLS] token. In this way, we do not have to aggregate all the information from textual tokens in the first step, as Sec. III-C2 does.

III-D3 Attention Alignment

To achieve the attention alignment goal, we first average all the information from the kk heads. Thus, we obtain an LL layers attention map for the [CLS] token. We then extract the attention to the visual tokens and have an attention matrix 𝐂oL×N\mathbf{C}_{o}\in\mathbb{R}^{L\times N}. To this end, we perform the visual attention alignment for each layer following the guidance of Sec. III-B.

IV Experiments

Refer to caption
Figure 4: Convergence analysis of baselines and our GIST model.

IV-A Datasets and Evaluation Protocols

We conducted extensive experiments on the VCR benchmark, a large-scale dataset alongside this task. The images are extracted from the movie clips in LSMDC [49] and MovieClips555youtube.com/user/movieclips., wherein the objects inside images are detected via the Mask-RCNN model [33]. We used the official dataset split, where the number of questions for training, validation, and testing are 212,923, 26,534, and 25,263, respectively. For each question, four answers are given with only one being correct, and there are also four rationale choices among which only one makes sense.

Regarding the evaluation metric, we used the popular classification accuracy for Q\rightarrowA, QA\rightarrowR, and Q\rightarrowAR666For Q\rightarrowAR, the prediction is right only when both the answer and rationale are selected correctly.. The ground-truth labels are available for the train and validation sets [2]. Therefore, we reported the performance of our best model on the testing set once and performed other experiments on the validation set.

IV-B Implementation Details

The PyTorch toolkit [50] is leveraged to implement our models, and all the experiments were conducted on a single GeForce RTX 2080 Ti GPU. The specific details of each model are shown below.

Vanilla attention model. As for the input features, we employed the pre-trained object-level features extracted by TAB-VCR [7] as the visual features, and the pre-trained token embeddings from BERT as the textual features. All the trainable parameters are initialized with the default PyTorch settings. The training batch size is set to 96 and the alignment loss weight λ\lambda is set to 1.0. The parameters are optimized with the Adam [51] optimizer with an initial learning rate 2×1032\times 10^{-3}.

VL-Transformers. In order to verify the generalizable capability of our attention alignment mechanism on VL-Transformers, we employ several classical VL-Transformer models [11, 12, 13] as the backbone. For a fair comparison, we strictly followed the original implementations.

Refer to caption
Figure 5: Performance comparison of different models w/ and w/o our attention alignment module.

IV-C Overall Performance Comparison

We evaluated our method by comparing its performance with three kinds of methods: (1) advanced VQA models. (2) Traditional VCR baselines. (3) VL-Transformers in VCR. The results on both validation and testing sets are reported in Table I, and the key observations are as follows.

  • First, a significant performance gap exists between traditional strong VQA models and VCR methods, especially for QA\rightarrowR. This is because VCR demands higher-order reasoning capability, which differs from the simple recognition in VQA. In addition, predicting the right rationale is even more challenging.

  • Second, compared to the task-specific VCR models such as R2C [2] and CCN [8], VL-Transformers demonstrate advantages in all metrics, indicating the effectiveness of the universal cross-modal representation learned by pre-training.

  • Lastly, our method achieves the best performance over these state-of-the-art models. Especially, for both the vanilla attention and VL-Transformer VCR baselines, with our attention alignment mechanism, they can all achieve improved gains on both validation and test sets. For example, compared with TAB-VCR, an absolute improvement of 1.1%\%, 1.2%\% and 1.6%\% can be observed on Q\rightarrowA, QA\rightarrowR and Q\rightarrowAR respectively.

TABLE II: Effectiveness of different alignment losses.
Alignment Loss Q\rightarrowA QA\rightarrowR Q\rightarrowAR
Vanilla attention 68.8 70.8 48.9
 w/ Align-Dot 70.5 (+1.7) 72.5 (+1.7) 51.4 (+2.5)
 w/ Align-Rank 69.4 (+0.6) 71.9 (+1.1) 50.0 (+1.1)
VL-BERT 72.6 74.0 54.0
 w/ Align-Dot 73.2 (+0.6) 74.6 (+0.6) 54.9 (+0.9)
 w/ Align-Rank 73.2 (+0.6) 74.7 (+0.7) 54.9 (+0.9)
UNITER 74.4 76.9 57.5
 w/ Align-Dot 74.7 (+0.3) 77.2 (+0.3) 58.0 (+0.5)
 w/ Align-Rank 74.9 (+0.5) 77.0 (+0.1) 58.1 (+0.6)
VILLA 75.4 78.7 59.5
 w/ Align-Dot 75.9 (+0.5) 79.0 (+0.3) 60.2 (+0.7)
 w/ Align-Rank 76.0 (+0.6) 78.8 (+0.1) 60.1 (+0.6)
TABLE III: Ablation study from our re-attention module on the validation set.
Model Q\rightarrowA QA\rightarrowR Q\rightarrowAR
Full model 70.5 72.5 51.4
w/o token-wise Att 70.1 71.5 50.4
w/o Att 68.7 71.2 49.1

IV-D Ablation Study

To better illustrate the effectiveness of our model in detail, we conducted experiments on the essential parts of our method and reported the results on the validation set below.

Efficacy of the attention alignment. Figure 5 demonstrates the effect of our attention alignment mechanism for both the vanilla attention model and VL-Transformers. From this figure, we can observe that among all the variants, our attention alignment mechanism consistently improves the base model on all metrics by a large margin. Take the vanilla attention model as an example, our attention alignment mechanism boosts it by 1.7% (Q\rightarrowA), 1.7% (QA\rightarrowR), and 2.5% (Q\rightarrowAR). One main limitation of these baselines is that they neglect the visual consistency and the interactions between the answering and reasoning. In contrast, our visual attention alignment offers a bridge to connect these two processes and significantly enhances the baseline performance.

Figure 4 shows the convergence of answering and reasoning accuracy for baselines and our GIST model. It can be seen that the performance of baselines increases fairly or faster than our GIST model. Nevertheless, with more training steps, GIST outperforms all the baselines by a significant margin. One possible reason is that the baselines show certain disadvantages in these difficult instances as the model keeps training. In contrast, when aligning the visual attention between Q\rightarrowA and QA\rightarrowR, the model learns to leverage more accurate visual information to perform visual understanding, leading to more improvements.

Refer to caption
Figure 6: Accuracy performance change with respect to different hyper-parameter λ\lambda. Q\rightarrowA and QA\rightarrowR use the left yy-axis while Q\rightarrowAR uses the right yy-axis.
Refer to caption
Figure 7: Attention similarity statistics of baselines and our GIST model.
Refer to caption
Figure 8: Attention weight distribution of the vanilla attention model and our proposed GIST model. The input question and image are shown in the first column, followed by the attention distribution of the baseline in Q\rightarrowA and QA\rightarrowR (column 2-3), and GIST in Q\rightarrowA and QA\rightarrowR (column 4-5).

Align-Dot v.s. Align-Rank. As discussed in Section III-B, we applied two candidate alignment loss functions to align the two attention maps from the two processes. Table II reports the results obtained from different alignment losses. One can see that both loss functions bring certain performance improvements over respective baselines. Specifically, for the vanilla attention model, the Align-Dot, which is more demanding, can achieve significantly better results. We suspect that one possible reason is that the carefully designed re-attention module can capture the attention precisely.

Re-Attention of the vanilla attention model. In order to investigate the effectiveness of our re-attention module in the vanilla attention model, we designed two variants and reported the results in Table III. After removing the token-wise attention module from our model, we can observe performance degradation on all three metrics. It validates that different textual tokens contribute distinctively to visual feature learning. We then replaced all the attention computation with the mean operation and showed the results in the last row of this table. One can see that the model performance drops sharply compared with the previous two.

IV-E Hyper-parameter Study

In this section, we study the influence of the trade-off hyper-parameter λ\lambda on the performance of our GIST model. The results on both the vanilla attention model and UNITER are shown in Figure 6. As we increase the loss weight λ\lambda, the model performance keeps being enhanced. This result demonstrates the effectiveness of our designed attention alignment mechanism. However, a too-large weight will lead to a deteriorating result. For instance, a loss weight larger than 0.4 for UNITER negatively hurts the model performance.

IV-F Qualitative Results

It is expected that our attention alignment operation should pull the attention maps from the two processes similarly so that the model prediction can be made according to consistent visual cues. To justify this, we performed some qualitative results from the following two angles.

Similarity of attention maps. We leveraged the Align-Dot to perform the attention alignment and estimated the attention similarity between Q\rightarrowA and QA\rightarrowR. The histogram of the similarity values is shown in Figure 7. As can be observed, the attention similarity from baseline models is mostly less than 0.2. In contrast, our GIST model yields more consistent visual reasoning results as the similarity is increased significantly.

Attention map visualization. To gain a deeper insight into our attention alignment mechanism, in this section, we also provide some qualitative examples in Figure 8 to compare the attention distribution of the baseline and our method. Regarding the first instance, the baseline puts more attention on the lamp regions while ignoring the critical person areas. However, though the answer is wrongly predicted, the rationale is unexpectedly selected correctly. This supports our argument that the answer prediction and rationale selection should depend on the same evidence otherwise may lead to unpredictable results. As to the second instance, without the attention alignment, the baseline focuses more on the chair1 object, which is less relevant to the given question. Our GIST model corrects this mistake and obtains the right selection for both Q\rightarrowA and QA\rightarrowR. The last one shows that the attention weights are distributed incorrectly for both processes. With the consistent alignment of our method, both Q\rightarrowA and QA\rightarrowR can be accurately predicted.

V Conclusion and Future Work

In this paper, we propose a novel vision attention alignment method to bridge the two intertwined processes in visual commonsense reasoning. In particular, a re-attention module is first introduced to model the fine-grained inter-modality interactions, followed by the attention alignment equipped with two alternative loss functions. We apply this method to both the conventional vanilla attention model as well as the recent strong VL-Transformers. Through the qualitative and quantitative experiments on the benchmark dataset, the effectiveness of our proposed method is extensively demonstrated.

In the future, we plan to further explore this view, i.e., collaborating the two processes in VCR with a single framework. More techniques that involve the close connection of these two processes are worth more investigation.

References

  • [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in IEEE International Conference on Computer Vision.   IEEE, 2015, pp. 2425–2433.
  • [2] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2019, pp. 6720–6731.
  • [3] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2017, pp. 6325–6334.
  • [4] Y. Guo, L. Nie, H. Cheng, Z. Cheng, M. S. Kankanhalli, and A. D. Bimbo, “On modality bias recognition and reduction,” ACM Transactions on Multimedia Computing, Communications, and Applications, 2022.
  • [5] Y. Guo, L. Nie, Z. Cheng, Q. Tian, and M. Zhang, “Loss re-scaling VQA: revisiting the language prior problem from a class-imbalance view,” IEEE Transactions on Image Processing, vol. 31, pp. 227–238, 2022.
  • [6] W. Yu, J. Zhou, W. Yu, X. Liang, and N. Xiao, “Heterogeneous graph learning for visual commonsense reasoning,” in Advances in Neural Information Processing Systems, 2019, pp. 2765–2775.
  • [7] J. Lin, U. Jain, and A. G. Schwing, “Tab-vcr: Tags and attributes based vcr baselines,” in Advances in Neural Information Processing Systems, 2019, pp. 15 589–15 602.
  • [8] A. Wu, L. Zhu, Y. Han, and Y. Yang, “Connective cognition network for directional visual commonsense reasoning,” in Advances in Neural Information Processing Systems, 2019, pp. 5670–5680.
  • [9] G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training,” in AAAI Conference on Artificial Intelligence.   AAAI, 2020, pp. 11 336–11 344.
  • [10] J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems, 2019, pp. 13–23.
  • [11] Y. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, and J. Liu, “Uniter: Universal image-text representation learning,” in European Conference on Computer Vision.   Springer, 2020, pp. 104–120.
  • [12] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” in International Conference on Learning Representations, 2020.
  • [13] Z. Gan, Y. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-scale adversarial training for vision-and-language representation learning,” in Advances in Neural Information Processing Systems, 2020.
  • [14] Y. Zhu, O. Groth, M. S. Bernstein, and L. Fei-Fei, “Visual7w: Grounded question answering in images,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2016, pp. 4995–5004.
  • [15] D. Song, S. Ma, Z. Sun, S. Yang, and L. Liao, “Kvl-bert: Knowledge enhanced visual-and-linguistic bert for visual commonsense reasoning,” Knowledge-Based Systems, vol. 230, p. 107408, 2021.
  • [16] R. Zellers, J. Lu, X. Lu, Y. Yu, Y. Zhao, M. Salehi, A. Kusupati, J. Hessel, A. Farhadi, and Y. Choi, “Merlot reserve: Neural script knowledge through vision and language and sound,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2022, pp. 16 354–16 366.
  • [17] Z. Liu, R. Feng, H. Chen, S. Wu, Y. Gao, Y. Gao, and X. Wang, “Temporal feature alignment and mutual information maximization for video-based human pose estimation,” in Computer Vision and Pattern Recognition.   IEEE, 2022, pp. 10 996–11 006.
  • [18] K. J. Shih, S. Singh, and D. Hoiem, “Where to look: Focus regions for visual question answering,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2016, pp. 4613–4621.
  • [19] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in Advances in Neural Information Processing Systems, 2016, pp. 289–297.
  • [20] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2018, pp. 6077–6086.
  • [21] D. Teney, P. Anderson, X. He, and A. van den Hengel, “Tips and tricks for visual question answering: Learnings from the 2017 challenge,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2018, pp. 4223–4232.
  • [22] T. Qiao, J. Dong, and D. Xu, “Exploring human-like attention supervision in visual question answering,” in AAAI Conference on Artificial Intelligence.   AAAI, 2018, pp. 7300–7307.
  • [23] Y. Zhang, J. C. Niebles, and A. Soto, “Interpretable visual question answering by visual grounding from attention supervision mining,” in IEEE Winter Conference on Applications of Computer Vision.   IEEE, 2019, pp. 349–357.
  • [24] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in North American Chapter of the Association for Computational Linguistics.   ACL, 2019, pp. 4171–4186.
  • [25] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  • [26] W. Li, C. Gao, G. Niu, X. Xiao, H. Liu, J. Liu, H. Wu, and H. Wang, “Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning,” in Annual Meeting of the Association for Computational Linguistics.   ACL, 2021, pp. 2592–2607.
  • [27] Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2021, pp. 12 976–12 985.
  • [28] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Empirical Methods in Natural Language Processing.   ACL, 2019, pp. 5099–5110.
  • [29] J. Li, R. R. Selvaraju, A. Gotmare, S. R. Joty, C. Xiong, and S. C. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” in Advances in Neural Information Processing Systems, 2021, pp. 9694–9705.
  • [30] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Annual Meeting of the Association for Computational Linguistics.   ACL, 2018, pp. 2556–2565.
  • [31] Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language pre-trained models,” in International Joint Conference on Artificial Intelligence.   Morgan Kaufmann, 2022, pp. 5436–5443.
  • [32] J. Cho, J. Lu, D. Schwenk, H. Hajishirzi, and A. Kembhavi, “X-lxmert: Paint, caption and answer questions with multi-modal transformers,” in Empirical Methods in Natural Language Processing.   ACL, 2020, pp. 8785–8805.
  • [33] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask r-cnn,” in IEEE International Conference on Computer Vision.   IEEE, 2017, pp. 2980–2988.
  • [34] X. Zhang, F. Zhang, and C. Xu, “Explicit cross-modal representation learning for visual commonsense reasoning,” IEEE Transactions on Multimedia, vol. 24, pp. 2986–2997, 2022.
  • [35] C. Liu, Z. Mao, T. Zhang, A. Liu, B. Wang, and Y. Zhang, “Focus your attention: A focal attention for multimodal learning,” IEEE Transactions on Multimedia, vol. 24, pp. 103–115, 2022.
  • [36] T. Liu, “Learning to rank for information retrieval,” Foundations and Trends® in Information Retrieval, vol. 3, no. 3, pp. 225–331, 2009.
  • [37] K. Järvelin and J. Kekäläinen, “Cumulated gain-based evaluation of ir techniques,” ACM Transactions on Information Systems, vol. 20, no. 4, pp. 422–446, 2002.
  • [38] T. Qin, T. Liu, and H. Li, “A general approximation framework for direct optimization of information retrieval measures,” Information Retrieval, vol. 13, no. 4, pp. 375–397, 2010.
  • [39] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2016, pp. 770–778.
  • [40] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” in Empirical Methods in Natural Language Processing.   ACL, 2014, pp. 1724–1734.
  • [41] K. Hornik, M. B. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural Networks, vol. 2, no. 5, pp. 359–366, 1989.
  • [42] Z. Liu, Z. Wang, L. Zhang, R. R. Shah, Y. Xia, Y. Yang, and X. Li, “Fastshrinkage: Perceptually-aware retargeting toward mobile platforms,” in ACM Multimedia Conference.   ACM, 2017, pp. 501–509.
  • [43] A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question answering baselines,” in European Conference on Computer Vision.   Springer, 2016, pp. 727–739.
  • [44] J. Kim, K. W. On, W. Lim, J. Kim, J. Ha, and B. Zhang, “Hadamard product for low-rank bilinear pooling,” in International Conference on Learning Representations, 2017.
  • [45] H. Ben-younes, R. Cadène, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in IEEE International Conference on Computer Vision.   IEEE, 2017, pp. 2631–2639.
  • [46] J. Tu, X. Liu, Z. Lin, R. Hong, and M. Wang, “Differentiable cross-modal hashing via multimodal transformers,” in International Conference on Multimedia.   ACM, 2022, pp. 453–461.
  • [47] H. Zhong, J. Chen, C. Shen, H. Zhang, J. Huang, and X. Hua, “Self-adaptive neural module transformer for visual question answering,” IEEE Transactions on Multimedia, vol. 23, pp. 1264–1273, 2021.
  • [48] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016.
  • [49] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. J. Pal, H. Larochelle, A. C. Courville, and B. Schiele, “Movie description,” International Journal of Computer Vision, vol. 123, no. 1, pp. 94–120, 2017.
  • [50] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
  • [51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, 2015.
[Uncaptioned image] Zhenyang Li received the B.Eng. and master degree from Shandong University and University of Chinese Academy of Sciences, respectively. He is currently pursuing the Ph.D. degree with the School of Computer Science and Technology, Shandong University, supervised by Prof. Liqiang Nie. His research interest is multi-modal computing, especially visual question answering.
[Uncaptioned image] Yangyang Guo (Member, IEEE) is currently a research fellow with the National University of Singapore. He has authored or co-authored several papers in top journals, such as IEEE TIP, TMM, TKDE, TNNLS, and ACM TOIS. He is a Regular Reviewer for journals, including IEEE TIP, TMM, TKDE, TCSVT; ACM TOIS, and ToMM. He was the recipient as an outstanding reviewer for IEEE TMM and WSDM 2022.
[Uncaptioned image] Kejie Wang is currently pursuing the B.Eng. degree in computer science from the Shandong University. His research interests include visual question answering and computer vision.
[Uncaptioned image] Fan Liu (Member, IEEE) is currently a Research Fellow with the School of Computing, National University of Singapore (NUS). He received the Ph.D degree from Shandong University in China. His research interests lie primarily in multimedia search and recommendation. His work has been published in a set of top forums, including ACM SIGIR, MM, WWW, TKDE, TOIS and TMM. He has served as the pc member for several top conferences, like ACM MM, SIGKDD, WSDM, and the reviewer for journals including TKDE, TMM, IPM, INS.
[Uncaptioned image] Liqiang Nie (Senior Member, IEEE) received the B.Eng. degree from Xi’an Jiaotong University and the Ph.D. degree from the National University of Singapore (NUS). He is currently a Professor and the dean of the School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen). His research interests lie primarily in multimedia computing and information retrieval. He has co-authored more than 200 articles and four books and received more than 14,000 Google Scholar citations. He is an AE of IEEE TKDE, IEEE TMM, IEEE TCSVT, ACM ToMM, and Information Science. Meanwhile, he is the regular area chair of ACM MM, NeurIPS, IJCAI, and AAAI. He is a member of ICME steering committee. He has received many awards, like ACM MM and SIGIR best paper honorable mention in 2019, SIGMM rising star in 2020, TR35 China 2020, DAMO Academy Young Fellow in 2020, and SIGIR best student paper in 2021.
[Uncaptioned image] Mohan Kankanhalli (Fellow, IEEE) received the B.Tech. degree from IIT Kharagpur and the M.S. & Ph.D. degrees from the Rensselaer Polytechnic Institute. He is currently the Provost’s Chair Professor at the Department of Computer Science, National University of Singapore. He is the Director of N-CRiPT and also the Deputy Executive Chairman of AI Singapore (Singapore’s national AI program). His current research interests include multimedia computing, multimedia security and privacy, image/video processing, and social media analysis. .