This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran
Applied Artificial Intelligence Institute, Deakin University, Australia
{hldang,lethao,vuong.le,truyen.tran}@deakin.edu.au
Abstract

Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol-like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects. At each stage of the video event flow, these objects interact with each other, and their interactions are reasoned about with respect to the query and under the overall context of a video. This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) networks. This neural model maintains the objects’ consistent lifelines in the form of a hierarchically nested spatio-temporal graph. Within this graph, the dynamic interactive object-oriented representations are built up along the video sequence, hierarchically abstracted in a bottom-up manner, and converge toward the key information for the correct answer. The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model’s behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA.

1 Introduction

[Uncaptioned image]
Figure 1: The key to Video QA is the effective relational reasoning on objects with their temporal lifelines (below sequences) interleaved with spatial interactions (upper graph) under the context set in the video and the perspective provided by the query. This object-oriented spatio-temporal reasoning is the main theme of this work.
[Uncaptioned image]
Figure 2: Object-oriented Spatio-Temporal Reasoning (OSTR) unit. Inputs include a set of object sequences XX (with identity indicated by colors), a context cc and a query qq. Each object sequence is first summarized by a temporal attention module (matching color boxes). The inter-object S-T relations is modeled by a graph-based spatial interaction (gray box with pink graph). OSTR unit outputs a set of object instances {yn}\left\{y_{n}\right\} with object IDs corresponding to those in the input sequences.

Much of the recent impressive progress of AI can be attributed to the availability of suitable large-scale testbeds. A powerful testbed – largely under-explored – is Video Question Answering (Video QA). This task demands a wide range of cognitive capabilities including learning and reasoning about objects and dynamic relations in space-time, in both visual and linguistic domains. A major challenge of reasoning over video is extracting question-relevant high-level facts from low-level moving pixels over an extended period of time. These facts include objects, their motion profiles, actions, interactions, events, and consequences distributed in space-time. Another challenge is to learn the long-term temporal relation of visual objects conditioning on the guidance clues from the question – effectively bridging the semantic gulf between the two domains. Finally, learning to reason from relational data is an open problem on its own, as it pushes the boundary of learning from simple one-step classification to dynamically construct question-specific computational graphs that realize the iterative reasoning process.

A highly plausible path to tackle these challenges is via object-centric learning since objects are fundamental to cognition [19]. Objects pave the way towards more human-like reasoning capability and symbolic computing [13, 19]. Unlike objects in static images, objects in video have unique evolving lives throughout space-time. As object lives throughout the video, it changes its appearance and position, and interacts with other objects at arbitrary time. When observed in the videos, all these behaviors play out on top of a background of rich context of the video scene. Furthermore, in the question answering setting, these object-oriented information must be considered from the specific view point set by the linguistic query. With these principles, we pinpoint the key to Video QA to be the effective high-level relational reasoning of spatio-temporal objects under the video context and the perspective provided by the query (see Fig. 1). This is challenging for the complexity of the video spatio-temporal structure and the cross-domain compatibility gap between linguistic query and visual objects.

Toward such challenge we design a general-purpose neural unit called Object-oriented Spatio-Temporal Reasoning (OSTR) that operates on a set of video object sequences, a contextual video feature, and an external linguistic query. OSTR models object lifelong interactions and returns a summary representation in the form of a singular set of objects. The specialties of OSTR are in the partitioning of intra-object temporal aggregation and inter-object spatial interaction that leads to the efficiency of the reasoning process. Being flexible and generic, OSTR units are suitable building blocks for constructing powerful reasoning models.

For Video QA problem, we use OSTR units to build up Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) model. The network consists of OSTR units arranged in layers corresponding to the levels of video temporal structure. At each level, HOSTR finds local object interactions and summarizes them toward a higher-level, longer-term representation with the guidance of the linguistic query.

HOSTR stands out with its authentic and explicit modeling of video objects leading to the effective and interpretable reasoning process. The hierarchical architecture also allows the model to efficiently scale to a wider range of video formats and lengths. These advantages are demonstrated in a comprehensive set of experiments on multiple major Video QA tasks and datasets.

In summary, this paper makes three major contributions: (1) A semantic-rich object-oriented representation of videos that paves the way for spatio-temporal reasoning (2) A general-purpose neural reasoning unit with dynamic object interactions per context and query; and (3) A hierarchical network that produces reliable and interpretable video question answering.

2 Related Work

Video QA has been developed on top of traditional video analysis schemes such as recurrent networks of frame features [31] or 3D convolutional operators [20]. Video representations are then fused with or gated by the linguistic query through co-attention [9, 27], hierarchical attention [16, 30], and memory networks [11, 21]. More recent works advance the field by exploiting hierarchical video structure [14] or separate reasoning out of representation learning [15]. A share feature between these works is considering the whole video frames or segments as the unit component of reasoning. In contrast, our work make a step further by using detail objects from the video as primitive constructs for reasoning.

Object-centric Video Representation inherits the modern capability of object detection on images [3] and continuous tracking through temporal consistency [23]. Tracked objects form tubelets [10] whose representation contributes to breakthroughs in action detection [24] and event segmentation [2]. For tasks that require abstract reasoning, the connection between objects beyond temporal object permanence can be established through relation networks [1]. The concurrence of objects’ 2D spatial- and 1D temporal- relations naturally forms a 3D spatio-temporal graphs [22]. This graph can be represented as either a single flattened one where all parts connect together [29], or separated spatial- and temporal-graphs [17]. They can also be approximated as a dynamic graph where objects live through the temporal axis of the video while their properties and connection evolve [8].

Object-based Video QA is still in infancy. The works in [26] and [7] extract object features and feed them to generic relational engines without prior structure of reasoning through space-time. At the other extreme, detected objects are used to scaffold the symbolic reasoning computation graph [28] which is explicit but limited in flexibility and cannot recover from object extraction errors. Our work is a major step toward the object-centric reasoning with the balance between explicitness and flexibility. Here video objects serve as active agents which build up and adjust their interactions dynamically in the spatio-temporal space as instructed by the linguistic query.

3 Method

3.1 Problem Definition

Given a video VV and linguistic question qq, our goal is to learn a mapping function Fθ(.)F_{\theta}(.) that returns a correct answer a¯\bar{a} from an answer set AA as follows:

a¯=argmaxaAFθ(aq,𝒱).\bar{a}=\arg\max_{a\in A}F_{\theta}\left(a\mid q,\mathcal{V}\right). (1)

In this paper, a video VV is abstracted as a collection of object sequences tracked in space and time. Function FθF_{\theta} is designed to have the form of a hierarchical object-oriented network that takes the object sequences, modulates them with the overall video context, dynamically infers object interactions as instructed by the question qq so that key information regarding aa arises from the mix. This object-oriented representation is presented next in Sec. 3.2, followed by Sec. 3.3 describing the key computation unit and Sec. 3.4 the resulting model.

3.2 Data Representation

3.2.1 Video as a Set of Object Sequences

Different from most of the prominent VideoQA methods [9, 5, 14] where videos are represented by frame features, we break down a video of length LL into a list of NN object sequences O={on,t}n=1,t=1N,LO=\left\{o_{n,t}\right\}_{n=1,t=1}^{N,L} constructed by chaining corresponding objects of the same identity nn across the video. These objects are represented by a combination of (1) appearance features (RoI pooling features) on,ta2048o_{n,t}^{a}\in\mathbb{R}^{2048} (representing “what”); and (2) the positional features on,tp=[xminW,yminH,xmaxW,ymaxH,wW,hH,whWH]o_{n,t}^{p}=[\frac{x_{min}}{W},\frac{y_{min}}{H},\frac{x_{max}}{W},\frac{y_{max}}{H},\frac{w}{W},\frac{h}{H},\frac{wh}{WH}] (“where”), with ww and hh are the sizes of the bounding box and W,HW,H are those of the video frame, respectively.

In practice, we use Faster R-CNN with ROI-pooling to extract positional and appearance features; and DeepSort for multi-object tracking. We assume that the objects live from the beginning to the end of the video. Occluded or missed objects have their features marked with special null values which will be specially dealt with by the model.

3.2.2 Joint Encoding of “What” and “Where”

Since the appearance on,tao_{n,t}^{a} of an object may remain relatively stable over time while on,tpo_{n,t}^{p} constantly changes, we must find joint positional-appearance features of objects to make them discriminative in both space and time. Specifically, we propose the following multiplicative gating mechanism to construct such features:

on,t\displaystyle o_{n,t} =f1(on,ta)f2(on,tp)d,\displaystyle=f_{1}(o_{n,t}^{a})\odot f_{2}(o_{n,t}^{p})\in\mathbb{R}^{d}, (2)

where f2(on,tp)(𝟎,𝟏)f_{2}(o_{n,t}^{p})\in(\boldsymbol{0},\boldsymbol{1}) serves as a position gate to (softly) turn on/off the localized appearance features f1(on,ta)f_{1}(o_{n,t}^{a}). We choose f1(x)=tanh(Wax+ba)f_{1}(x)=\tanh\left(W_{a}x+b_{a}\right) and f2(x)=sigmoid(Wpx+bp)f_{2}(x)=\text{sigmoid}\left(W_{p}x+b_{p}\right), where WaW_{a} and WpW_{p} are network weights with height of dd.

Along with the object sequences, we also maintain global features of video frames which hold the information of the background scene and possible missed objects. Specifically, for each frame tt, we form the the global features gtg_{t} as the combination of the frame’s appearance features (pretrained ResNet pool5pool5 vectors) and motion feature (pretrained ResNeXt-101) extracted from such frame.

With these ready, the video is represented as a tuple of object sequences OnO_{n} and frame-wise global features gtg_{t}: 𝒱=({OnOnL×d},n=1N{gt}t=1L)\mathcal{V}=\left(\left\{O_{n}\mid O_{n}\in\mathbb{R}^{L\times d}\right\}{}_{n=1}^{N},\left\{g_{t}\right\}_{t=1}^{L}\right).

3.2.3 Linguistic Representation

We utilize a BiLSTM running on top of GloVe embedding of the words in a query of length SS to generate contextual embeddings {es}s=1S\{e_{s}\}_{s=1}^{S} for esde_{s}\in\mathbb{R}^{d}, which share the dimension dd with object features. We also maintain a global representation of the question by summarizing the two end LSTM states qgdq_{g}\in\mathbb{R}^{d}. We further use qgq_{g} to drive the attention mechanism and combine contextual words into a unified query representation q=s=1Sαsesq=\sum_{s=1}^{S}\alpha_{s}e_{s} where αs=softmaxs(Wq(esqg))\alpha_{s}=\text{softmax}_{s}(W_{q}(e_{s}\odot q_{g})).

3.3 Object-oriented Spatio-Temporal Reasoning (OSTR)

Refer to caption
Figure 3: The architecture of Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) network for Video QA. HOSTR contains OSTR units operating at two levels: clip-level and video-level. Clip-level OSTR units model the interaction between object chunks within a particular clip under the modulation of the question and a clip-specific context representation. Their output objects are chained together and further sent to a video-level OSTR to capture the long-term dependencies between the objects existing in the whole video. Finally, a classifier taking as input the summarized output of video-level OSTR and the query is used for answer prediction.

With the videos represented as object sequences, we need to design a scalable reasoning framework that can work natively on the structures. Such a framework must be modular so it is flexible to different input formats and sizes. Toward this goal, we design a generic reasoning unit called Object-oriented Spatio-Temporal Reasoning (OSTR) that operates on this object-oriented structure and supports layering and parallelism.

Algorithmically, OSTR takes as input a query representation qq, a context representation cc, a set of NN object sequences X={XnXnT×d}n=1NX=\{X_{n}\mid X_{n}\in\mathbb{R}^{T\times d}\}_{n=1}^{N} of equal length TT, and individual identities {n}\left\{n\right\}. In practice, XX can be a subsegment of the whole object sequences OO, and cc is gathered from the frame features gtg_{t} constructed in Sec. 3.2. The output of the OSTR is a set of object instances of the same identity.

Across the space-time domains, real-world objects have distinctive properties (appearance, position, etc.) and behaviors (motion, deformation, etc.) throughout their lives. Meanwhile, different objects living in the same period can interact with each other. The OSTR closely reflects this nature by containing the two main components: (1) Intra-object temporal attention and (2) Inter-object interaction (see Fig. 2).

3.3.1 Intra-object Temporal Attention

The goal of the temporal attention module is to produce a query-specific summary of each object sequence Xn={xn,t}t=1TX_{n}=\left\{x_{n,t}\right\}_{t=1}^{T} into a single vector znz_{n}. The attention weights are driven by the query qq to reflect the fact that the relevance to the query varies across the sequence. In details, the summarized vector znz_{n} is calculated by

zn\displaystyle z_{n} =temporal_attention(Xn)γt=1Tβtxn,t,where\displaystyle=\textrm{temporal\_attention}(X_{n})\coloneqq\gamma*\sum_{t=1}^{T}\beta_{t}x_{n,t},\textrm{where} (3)
βt\displaystyle\beta_{t} =softmaxt(Wa((Wqq+bq)(Wxxt+bx))),\displaystyle=\text{softmax}_{t}\left(W_{a}\left((W_{q}q+b_{q})\odot(W_{x}x_{t}+b_{x})\right)\right), (4)

where \odot is the Hadamard product, {Wa,Wq,Wx}\left\{W_{a},W_{q},W_{x}\right\} are learnable weights, γ\gamma is a binary mask vector to handle the null values caused by missed detections as mentioned in Sec. 3.2.

When the sequential structure is particularly strong, we can optionally employ a BiLSTM to model the sequence. We can then either utilize the last state of the forward LSTM and the first state of the backward LSTM, or place attention across the hidden states instead of object feature embeddings.

3.3.2 Inter-object Interaction

Fundamentally, the lifelines of object sequences are not only described by their internal behavior through time but also by the interactions with their neighbor objects coexisting in the same space. In order to represent such complex relationship, we build a spatio-temporal computation graph to facilitate the inter-object interactions modulated by the query qq. This graph 𝒢(Z,E)\mathcal{G}(Z,E) contains vertices as the summarized objects Z={zn}n=1NZ=\{z_{n}\}_{n=1}^{N} generated in Eq. 4, and the edges EE represented by an adjacency matrix AN×NA\in\mathbb{R}^{N\times N}. A is calculated dynamically as the query-induced correlation matrix between the objects:

an\displaystyle a_{n} =norm(Wa([zn,znq])),\displaystyle=\text{norm}\left(W_{a}([z_{n},z_{n}\odot q])\right), (5)
A\displaystyle A =aa.\displaystyle=a^{\top}a. (6)

Here ana_{n} is the relevance of object nn w.r.t. the query qq. The norm operator is implemented as a softmax function over objects in our implementation.

Given the graph 𝒢,\mathcal{G},we use a Graph Convolutional Network (GCN) equipped with skip-connections to refine objects in relation with their neighboring nodes. Starting with the initialization H0=(z1,z2,,zn)N×dH^{0}=\left(z_{1},z_{2},...,z_{n}\right)\in\mathbb{R}^{N\times d}, the representations of nodes are updated through a number of refinement iterations. At iteration ii, the new hidden states are calculated by:

GCNi(Hi1)\displaystyle\textrm{GCN}_{i}\left(H^{i-1}\right) =W2i1σ(AHi1W1i1+bi1)\displaystyle=W_{2}^{i-1}\sigma\left(AH^{i-1}W_{1}^{i-1}+b^{i-1}\right)
Hi\displaystyle H^{i} =σ(Hi1+GCNi(Hi1)),\displaystyle=\sigma\left(H^{i-1}+\textrm{GCN}_{i}\left(H^{i-1}\right)\right), (7)

where σ()\sigma\left(\cdot\right) is a nonlinear activation (ELU in our implementation). After a fixed number of GCN iterations, the hidden states of the final layer are gathered as Himax={hn}n=1NH^{i_{max}}=\left\{h_{n}\right\}_{n=1}^{N}.

To recover the underlying background scene information and compensate for possible undetected objects, we augment the object representations with the global context cc:

yn=MLP([hn;c]).y_{n}=\textrm{MLP}\left(\left[h_{n};c\right]\right). (8)

These vectors line up to form the final output of the OSTR unit as a set of objects Y={yn}n=1N.Y=\left\{y_{n}\right\}_{n=1}^{N}.

3.4 Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR)

Even though the partitioning of temporal and spatial interaction in OSTR brings the benefits of efficiency and modularity, such separated treatment can cause the loss of spatio-temporal information, especially with long sequences. This limitation prevents us from using OSTR directly on the full video object sequences. To allow temporal and spatial reasonings to cooperate along the way, we break down a long video into multiple short (overlapping) clips and impose an hierarchical structure on top. With such division, the two types of interactions can be combined and interleaved across clips and allow full spatio-temrporal reasoning.

Based on this motive, we design a novel hierarchical structure called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) that follows the video multi-level structure and utilizes the OSTR units as building blocks. Our architecture shares the design philosophy of hierarchical reasoning structures with HCRN [14] as well as the other general neural building blocks such as ResNet and InceptionNet. Thanks to the genericity of OSTR, we can build a hierachy of arbitrary depth. For concreteness, we present here a two-layer HOSTR corresponding to the video structure: clip-level and video-level (see Fig. 3).

In particular, we first split all object sequences OO constructed in Sec. 3.2 into KK equal-sized chunks C={C1,C2,,CK}C=\left\{C_{1},C_{2},...,C_{K}\right\} corresponding to the video clips, each of TT frames. As the result, each chunk includes the object subsequences Ck={on,t}n=1,t=tkN,tk+TC_{k}=\left\{o_{n,t}\right\}_{n=1,t=t_{k}}^{N,t_{k}+T}, where on,to_{n,t} are the object features extracted in Eq. 2, and tkt_{k} is the starting time of clip kk.

Similarly, we divide the sequence of global frame features {gt}t=1L\left\{g_{t}\right\}_{t=1}^{L} into KK parts corresponding to the video clips. The global context ckc_{k} for clip kk is derived from each part by an identical operation with the temporal attention for objects in Eqs. 3,4: ck=temporal_attention({gt}t=tktk+T)c_{k}=\textrm{temporal\_attention}\left(\left\{g_{t}\right\}_{t=t_{k}}^{t_{k}+T}\right).

Clip-level OSTR units work on each of these subsequences CkC_{k}, context ckclipc_{k}^{\textrm{clip}}and query qq, and generate the clip-level representation of the chunk ykclipRN×dy_{k}^{\textrm{clip}}\in R^{N\times d}:

ykclip=OSTR(Ck,ckclip,q).y_{k}^{\textrm{clip}}=\textrm{OSTR}(C_{k},c_{k}^{\textrm{clip}},q). (9)

Outputs of the KK clip-level OSTRs are KK different sets of objects ykclip={yn,k}n=1Ny_{k}^{\textrm{clip}}=\left\{y_{n,k}\right\}_{n=1}^{N} whose identities nn were maintained. Therefore, we can easily chain these objects of the same identity from different clips together to form the video-level sequence of objects Yclip={yn,kclip}n=1,k=1N,KY^{\textrm{clip}}=\left\{y_{n,k}^{\textrm{clip}}\right\}_{n=1,k=1}^{N,K}.

At the video level, we have a single OSTR unit that takes in the object sequence YclipY^{\textrm{clip}}, query q,q, and video-level context cvidc^{\textrm{vid}}. The context cvidc^{\textrm{vid}} is again derived from the clip-level context ckclipc_{k}^{\textrm{clip}} by temporal attention: cvid=temporal_attention({ckclip}k=1K)c^{vid}=\textrm{temporal\_attention}\left(\left\{c_{k}^{\textrm{clip}}\right\}_{k=1}^{K}\right) .

The video-level OSTR models the long-term relationships between input object sequences in the whole video:

Yvid=OSTR(Yclip,cvid,q).Y^{vid}=\textrm{OSTR}(Y^{\textrm{clip}},c^{\textrm{vid}},q).

The output of this unit is a set of NN vectors Yvid={ynvidynvidd}n=1NY^{\textrm{vid}}=\{y_{n}^{\textrm{vid}}\mid y_{n}^{\textrm{vid}}\in\mathbb{R}^{d}\}_{n=1}^{N}. The set is further summarized using an attention mechanism using the query qq into the final representation vector rr:

δn\displaystyle\delta_{n} =softmaxn(MLP[Wyynvid;WyynvidWcq]),\displaystyle=\text{softmax}_{n}\left(\textrm{MLP}\left[W_{y}y_{n}^{\textrm{vid}};W_{y}y_{n}^{\textrm{vid}}\odot W_{c}q\right]\right), (10)
r\displaystyle r =n=1Nδnynvidd.\displaystyle=\sum_{n=1}^{N}\delta_{n}y_{n}^{vid}\in\mathbb{R}^{d}. (11)

3.5 Answer Decoders

We follow the common settings for answer decoders (e.g., see [9]) which combine the final representation rr with the query qq using an MLP followed by a softmax to rank the possible answer choices. More details about the answer decoders per question types are available in the supplemental material. We use the cross-entropy as the loss function to training the model from end to end for all tasks except counting, where Mean Square Error is used.

4 Experiments

4.1 Datasets

We evaluate our proposed HOSTR on the three public video QA benchmarks, namely, TGIF-QA [9], MSVD-QA [25] and MSRVTT-QA [25]. More details are as follows.

MSVD-QA consists of 50,505 QA pairs annotated from 1,970 short video clips. The dataset covers five question types: What, Who, How, When, and Where, of which 61% of the QA pairs for training, 13% for validation and 26% for testing.

MSRVTT-QA contains 10K real videos (65% for training, 5% for validation, and 30% for testing) with more than 243K question-answer pairs. Similar to MSVD-QA, questions are of five types: What, Who, How, When, and Where.

TGIF-QA is one of the largest Video QA datasets with 72K animated GIFs and 120K question-answer pairs. Questions cover four tasks - Action, Event Transition, FrameQA, Count. We refer readers to the supplemental material for the dataset description and statistics.

4.2 Comparison Against SOTAs

Implementation: We use Faster R-CNN111https://github.com/airsplay/py-bottom-up-attention for frame-wise object detection. The number of object sequences per video for MSVD-QA , MSRVTT-QA is 40 and TGIF-QA is 50. We embed question words into 300-D vectors and initialize them with GloVe during training. Default settings are with 6 GCN layers for each OSTR unit. The feature dimension dd is set to be 512512 in all sub-networks.

We compare the performance of HOSTR against recent state-of-the-art (SOTA) methods on all three datasets. Prior results are taken from [14].

MSVD-QA and MSRVTT-QA: Table 1 shows detailed comparisons on MSVD-QA and MSRVTT-QA datasets. It is clear that our proposed method consistently outperforms all SOTA models. Specifically, we significantly improve performance on the MSVD-QA by 3.3 absolute points while the improvement on the MSRVTT-QA is more modest. As videos in the MSRVTT-QA are much longer (3 times longer than those in MSVD-QA) and contain more complicated interaction, it might require a larger number of input object sequences than what in our experiments (40 object sequences).

Model Test Accuracy (%)
MSVD-QA MSRVTT-QA
ST-VQA 31.3 30.9
Co-Mem 31.7 32.0
AMU 32.0 32.5
HME 33.7 33.0
HCRN 36.1 35.4
HOSTR 39.4 35.9
Table 1: Experimental results on MSVD-QA and MSRVTT-QA.
Model TGIF-QA
Action\uparrow Trans.\uparrow Frame\uparrow Count\downarrow
ST-TP (R+C) 62.9 69.4 49.5 4.32
Co-Mem (R+F) 68.2 74.3 51.5 4.10
PSAC (R) 70.4 76.9 55.7 4.27
HME (R+C) 73.9 77.8 53.8 4.02
HCRN (R) 70.8 79.8 56.4 4.38
HCRN (R+F) 75.0 81.4 55.9 3.82
HOSTR (R) 75.6 82.1 58.2 4.13
HOSTR (R+F) 75.0 83.0 58.0 3.65
Table 2: Experimental results on TGIF-QA dataset. R: ResNet, F: Flow, and C: C3D, respectively. MSE is used as the evaluation metric for count while accuracy is used for others.

TGIF-QA: Table 2 presents the results on TGIF-QA dataset. As pointed out in [14], short-term motion features are helpful for action task while long-term motion features are crucial for event transition and count tasks. Hence, we provide two variants of our model: HOSTR (R) makes use of ResNet features as the context information for OSTR units; and HOSTR (R+F) makes use of the combination of ResNet features and motion features (extracted by ResNeXt, the same as in HCRN) as the context representation. HOSTR (R+F) shows exceptional performance on tasks related to motion. Note that we only use the context modulation at the video level to concentrate on the long-term motion. Even without the use of motion features, HOSTR (R) consistently shows more favorable performance than existing works.

The quantitative results prove the effectiveness of the object-oriented reasoning compared to the prior approaches of totally relying on frame-level features. Incorporating the motion features as context information also shows the flexibility of the design, suggesting that HOSTR can leverage a variety of input features and has the potentials to apply to other problems.

4.3 Ablation Studies

To provide more insight about our model, we examine the contributions of different design components to the model’s performance on the MSVD-QA dataset. We detail the results in Table 3. As shown, intra-object temporal attention seems to be more effective in summarizing the input object sequences to work with relational reasoning than BiLSTM. We hypothesize that it is due to the selective nature of the attention mechanism – it keeps only information relevant to the query.

As for the inter-object interaction, increasing the number of GCN layers up to 6 generally improves the performance. It gradually degrades when we stack more layers due to the gradient vanishing. The ablation study also points out the significance of the contextual representation as the performance steeply drops from 39.4 to 37.8 without them.

Last but not least, we conduct two experiments to demonstrate the significance of video hierarchical modeling. “1-level hierarchy” refers to when we replace all clip-level OSTRs with the global average pooling operation to summarize each object sequence into a vector while keeping the OSTR unit at the video level. “1.5-level hierarchy”, on the other hand, refers to when we use an average pooling operation at the video level while keeping the clip-level the same as in our HOSTR. Empirically, it shows that going deeper in hierarchy consistently improves performance on this dataset. The hierarchy may have greater effects in handling longer videos such as those in the MSRVTT-QA and TGIF-QA datasets.

Model Test Acc. (%)
Default config. (*) 39.4
Temporal attention (TA)
 Attention at both levels 39.4
 BiLSTM at clip, TA at video level 39.4
 BiLSTM at both levels 38.8
Inter-object Interaction
 SR with 1 GCN layer 38.3
 SR with 4 GCN layers 38.7
 SR with 8 GCN layers 39.0
Contextual representation
 w/o contextual representation 37.8
Hierarchy
 1-level hierarchy 38.0
 1.5-level hierarchy 38.7
Table 3: Ablation results on MSVD-QA dataset. Default config. (*): 6 GCN layers, Attention at clip & BiLSTM at video level
[Uncaptioned image]
Figure 4: A visualization of the spatio-temporal graph formed in HOSTR. The six most attended objects in each clip are drawn in blue bounding boxes. Red links indicates the importances of the edges.

4.4 Qualitative Analysis

To provide more analysis on the behavior of HOSTR in practice, we visualize the spatio-temporal graph formed during HOSTR operation on a sample in MSVD-QA dataset. In Fig. 4, the spatio-temporal graph of the two most important clips (judged by the temporal attention scores) are visualized in order of their appearance in the video’s timeline. Blue boxes indicate the six objects with highest edge weights (row summation of the adjacency matrix AA calculated in Eq.6). The red lines indicates the most prominent edges of the graph with intensity scaled to the edge strength.

In this example, HOSTR attended mostly on the objects related to the concepts relevant to answer the question (the girl, ball and dog). Furthermore, the relationships between the girl and her surrounding objects are the most important among the edges, and this intuitively agrees with how human might visually examine the scene given the question.

5 Conclusion

We presented a new object-oriented approach to Video QA where objects living in the video are treated as the primitive constructs. This brings us closer to symbolic reasoning, which is arguably more human-like. To realize this high-level idea, we introduced a general-purpose neural unit dubbed Object-oriented Spatio-Temporal Reasoning (OSTR). The unit reasons about its contextualized input – which is a set of object sequences – as instructed by the linguistic query. It first selectively transforms each sequence to an object node, then dynamically induces links between the nodes to build a graph. The graph enables iterative relational reasoning through collective refinement of object representation, gearing toward reaching an answer to the given query. The units are then stacked in a hierarchy that reflects the temporal structure of a typical video, allowing higher-order reasoning across space and time. Our architecture establishes new state-of-the-arts on major Video QA datasets designed for complex compositional questions, relational, temporal reasoning. Our analysis shows that object-oriented reasoning is a reliable, interpretable and effective approach to Video QA.

References

  • Baradel et al. [2018] F. Baradel, N. Neverova, C. Wolf, J. Mille, and G. Mori. Object level visual reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 105–121, 2018.
  • Chao et al. [2018] Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1130–1139, 2018.
  • Desta et al. [2018] M. T. Desta, L. Chen, and T. Kornuta. Object-based reasoning in VQA. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1814–1823. IEEE, 2018.
  • Fan et al. [2019] C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang. Heterogeneous memory enhanced multimodal attention model for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1999–2007, 2019.
  • Gao et al. [2018] J. Gao, R. Ge, K. Chen, and R. Nevatia. Motion-appearance co-memory networks for video question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6576–6585, 2018.
  • Hasan Chowdhury et al. [2018] M. I. Hasan Chowdhury, K. Nguyen, S. Sridharan, and C. Fookes. Hierarchical relational attention for video question answering. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 599–603, 2018. doi: 10.1109/ICIP.2018.8451103.
  • Huang et al. [2020] D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan. Location-aware graph convolutional networks for video question answering. In AAAI, pages 11021–11028, 2020.
  • Jain et al. [2016] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena. Structural-rnn: Deep learning on spatio-temporal graphs. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 5308–5317, 2016.
  • Jang et al. [2017] Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2758–2766, 2017.
  • Kalogeiton et al. [2017] V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 4405–4413, 2017.
  • Kim et al. [2017] K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang. DeepStory: video story QA by deep embedded memory networks. In IJCAI, pages 2016–2022. AAAI Press, 2017.
  • Kingma and Ba [2014] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Lake et al. [2017] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
  • Le et al. [2020a] T. M. Le, V. Le, S. Venkatesh, and T. Tran. Hierarchical conditional relation networks for video question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9972–9981, 2020a.
  • Le et al. [2020b] T. M. Le, V. Le, S. Venkatesh, and T. Tran. Neural reasoning, fast and slow, for video question answering. In IJCNN, 2020b.
  • Liang et al. [2018] J. Liang, L. Jiang, L. Cao, L.-J. Li, and A. G. Hauptmann. Focal visual-text attention for visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6135–6143, 2018.
  • Pan et al. [2020] B. Pan, H. Cai, D.-A. Huang, K.-H. Lee, A. Gaidon, E. Adeli, and J. C. Niebles. Spatio-temporal graph for video captioning with knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10870–10879, 2020.
  • Song et al. [2018] X. Song, Y. Shi, X. Chen, and Y. Han. Explore multi-step reasoning in video question answering. In ACM MM, 2018.
  • Spelke and Kinzler [2007] E. S. Spelke and K. D. Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007.
  • Tran et al. [2018] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6450–6459, 2018.
  • Wang et al. [2019] A. Wang, A. T. Luu, C.-S. Foo, H. Zhu, Y. Tay, and V. Chandrasekhar. Holistic multi-modal memory network for movie question answering. IEEE Transactions on Image Processing, 29:489–499, 2019.
  • Wang and Gupta [2018] X. Wang and A. Gupta. Videos as space-time region graphs. In Proceedings of the European conference on computer vision (ECCV), pages 399–417, 2018.
  • Wojke et al. [2017] N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking with a deep association metric. In ICIP. IEEE, 2017.
  • Xie et al. [2018] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018.
  • Xu et al. [2017] D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
  • Yang et al. [2020] Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura. Bert representations for video question answering. In The IEEE Winter Conference on Applications of Computer Vision, pages 1556–1565, 2020.
  • Ye et al. [2017] Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao, and Y. Zhuang. Video question answering via attribute-augmented attention network learning. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 829–832. ACM, 2017.
  • Yi et al. [2020] K. Yi, C. Gan, Y. Li, P. Kohli, J. Wu, A. Torralba, and J. B. Tenenbaum. CLEVRER: collision events for video representation and reasoning. In ICLR, 2020.
  • Zeng et al. [2019] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 7094–7103, 2019.
  • Zhao et al. [2018] Z. Zhao, X. Jiang, D. Cai, J. Xiao, X. He, and S. Pu. Multi-turn video question answering via multi-stream hierarchical attention context network. In IJCAI, pages 3690–3696, 2018.
  • Zhao et al. [2019] Z. Zhao, Z. Zhang, S. Xiao, Z. Xiao, X. Yan, J. Yu, D. Cai, and F. Wu. Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing, 28(12):5939–5952, 2019.

Appendix

6 Implementation Details

6.1 Question embedding

Each word in a question length SS is first embedded into a vector space of dimension. The vector sequence is the input of a BiLSTM to provide a sequence of forward-backward states. The hidden state pairs is gathered to form the contextual word representations {es}s=1Sesd\{e_{s}\}_{s=1}^{S}\ e_{s}\in\mathbb{R}^{d}, where dd is the length of the vector. The global representation of the question is then summarized using the two end states: qg=[e1;eS],qgd,q_{g}=\left[\overleftarrow{e_{1}};\overrightarrow{e_{S}}\right],\ q_{g}\in\mathbb{R}^{d},where [;][\thinspace;] denotes vector concatenation operation.

6.2 Answer decoders

In this subsection, we provide further details on how HOSTR predicts answers given the final representation rr and the query representation qq. Depending on question types, we design slightly different answer decoders. As for open-ended (MSVD-QA, MSRVTT-QA and FrameQA task in TGIF-QA) and multiple-choice questions (Action and Event Transition task in TGIF-QA), we utilize a classifier of 2-fully connected layers with the softmax function to rank words in a predefined vocabulary set AA or rank answer choices:

z\displaystyle z =MLP(Wr[r;Wqq+bq]+br),\displaystyle=\text{MLP}(W_{r}\left[r;W_{q}q+b_{q}\right]+b_{r}), (12)
p\displaystyle p =softmax(Wzz+bz).\displaystyle=\text{softmax}(W_{z}z+b_{z}). (13)

We use the cross-entropy as the loss function to training the model from end to end in this case. While prior works [5, 14, 9] rely on hinge loss to train multiple-choice questions, we find that cross-entropy loss produces more favorable performance. For the count task in TGIF-QA dataset, we simply take the output of Eq. 12 and further feed it into a rounding operation for an integer output prediction. We use Mean Squared Error as the training loss for this task.

6.3 Training

In this paper, each video is split into K=10K=10 clips of T=10T=10 consecutive frames. This is done simply based on the empirical results. For the HOSTR (R+F) variant, we divided each video into eight clips of T=16T=16 consecutive frames to fit in the predefined configuration of the ResNeXt pretrained model222https://github.com/kenshohara/video-classification-3d-cnn-pytorch. To mitigate the loss of temporal information when split a video up, there are overlapping parts between two video clips next to each other.

We train our model using Adam optimizer [12] at an initial learning rate of 10410^{-4} and weight decay of 10510^{-5}. The learning rate is reduced after every 10 epochs for all tasks except the count task in TGIF-QA in which we reduce the learning rate by half after every 5 epochs. We use a batch size of 64 for MSVD-QA and MSRVTT-QA while that of TGIF-QA is 32. To be compatible with related works [15, 18, 6, 14, 4], we use accuracy as evaluation metric for multi-class classification tasks and MSE for the count task in TGIF-QA dataset..

All experiments are conducted on a single GPU NVIDIA Tesla V100-SXM2-32GB installed on a sever of 40 physical processors and 256 GB of memory running on Ubuntu 18.0.4. It may require a GPU of at least 16GB in order to accommodate our implementation. We use early stopping when validation accuracy decreases after 10 consecutive epochs. Otherwise, all experiments are terminated after 25 epochs at most. Depending on dataset sizes, it may take 9 hours (MSVD-QA) to 30 hours (MSRVTT-QA) of training. As for inference, it takes up to 40 minutes to complete (on MSRVTT-QA). HOSTR is implemented in Python 3.6 with Pytorch 1.2.0.

Refer to caption
Figure 5: Extra visualization of the spatio-temporal graph formed in HOSTR, similar to Fig.4 in the main manuscript. The most attended objects in each clip are drawn in blue bounding boxes. Red links indicates the importances of the edges

7 Dataset Description

Video Q-A pairs Per question types
What Who How When Where
Train 1,200 30,933 19,485 10,479 736 161 72
Val 250 6,415 3,995 2,168 185 51 16
Test 520 13,157 8,149 4,552 370 58 28
All 1,970 50,505 31,625 17,199 1,291 270 116
Table 4: Statistics of the MSVD-QA dataset.
Video Q-A pairs Per question types
What Who How When Where
Train 6,513 158,581 108,792 43,592 4,067 1,626 504
Val 497 12,278 8,337 3,493 344 106 52
Test 2,990 72,821 49,869 20,385 1,640 677 250
All 10,000 243,680 166,998 67,416 6,051 2,409 806
Table 5: Statistics of the MSRVTT-QA dataset.
Q-A pairs Per tasks
Action Transition FrameQA Count
Train 1,254,470 18,427 47,433 35,452 24,158
Val 13,944 2,048 5,271 3,940 2,685
Test 25,751 2,274 6,232 13,691 3,554
All 1,294,165 22,749 58,936 53,083 30,397
Table 6: Statistics of the TGIF-QA dataset.
MSVD-QA

is a relatively small dataset of 50,505 QA pairs annotated from 1,970 short video clips. The dataset covers five question types: What, Who, How, When, and Where, of which 61% of the QA pairs for training, 13% for validation and 26% for testing. We provide details on the number of question-answer pairs in Table 4.

MSRVTT-QA

contains 10K real videos (65% for training, 5% for validation, and 30% for testing) with more than 243K question-answer pairs. Similar to MSVD-QA, questions are of five types: What, Who, How, When, and Where. Details on the statistics of the MSRVTT-QA dataset is shown in Table 5.

TGIF-QA

is one of the largest Video QA datasets of 72K animated GIFs and 120K question-answer pairs. Questions cover four tasks - Action: multiple-choice task identifying what action repeatedly happens in a short period of time; Transition: multiple-choice task assessing if machines could recognize the transition between two events in a video; FrameQA: answers can be found in one of video frames without the need of temporal reasoning; and Count: requires machines to be able to count the number of times an action taking place. We take 10% the number of Q-A pairs and their associated videos in the training set per each task as the validation set. Details on the number of Q-A pairs and videos per split are provided in Table 6.

8 Qualitative Results

In addition to the example provided in the main paper, we present a few more examples of the spatio-temporal graph formed in the HOSTR model in Figure 5. In all examples presented, our HOSTR has successfully modeled the relationships between the targeted objects in the questions or their parts with surrounding objects. These examples clearly demonstrate the explainability and transparency of our model, explaining how the model arrives at the correct answers.

9 Complexity Analysis

We analyze the memory consumption on the number of object nn, hierarchy depth hh given that other factors like video length are fixed; and OSTR units in one layer share parameters.

There are two types of memory mam_{a} is from hidden unit activations, mpm_{p} is for parameters and related terms. We found that mam_{a} is linear with nn, constant to hh. On the other hand, mpm_{p} is linear with hh and constant to nn. These figures are on-par with commonly used sequential models. No major extra memory consumption is introduced.