This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Relation-aware Video Reading Comprehension for
Temporal Language Grounding

Jialin Gao1,2,3*  Xin Sun1,2  MengMeng Xu3  Xi Zhou1,2  Bernard Ghanem3
1Cooperative Medianet Innovation Center, Shanghai Jiao Tong University
2 CloudWalk Technology Co., Ltd, China
3 King Abdullah University of Science and Technology
{jialin_gao, huntersx}@sjtu.edu.cn, [email protected]
{mengmeng.xu,bernard.ghanem}@kaust.edu.sa
 Jialin Gao and Xin Sun are co-first authors with equal contributions, supervised by Prof. Xi Zhou in SJTU.
Abstract

Temporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes have been available at https://github.com/Huntersxsx/RaNet.

1 Introduction

Recently, temporal language grounding in videos has become a heated topic in the computer vision, and natural language processing community Gao et al. (2017); Krishna et al. (2017). This task requires a machine to localize a temporal moment semantically relevant to a given language query, as shown in Fig.1. It has also drawn great attention from industry due to its various applications such as video question answering Huang et al. (2020); Lei et al. (2018), video content retrieval Dong et al. (2019); Shao et al. (2018), and human-computer interaction Zhu et al. (2020), etc.

Refer to caption
Figure 1: An illustration of temporal language grounding in videos based on the relation-aware network. Given a video and a query sentence, our approach aims to semantically align the query representation with a predefined answer set of video moment candidates (a1a_{1},a2a_{2},a3a_{3} and a4a_{4}) and then mine the relationships between them to select the best-matched one.

A straightforward paradigm for this task is the proposing-and-ranking pipelines Xu et al. (2019); Zhang et al. (2020b, 2019a). They first generate a number of video moment candidates and then rank them according to moment-query similarities. It requires a solution to achieve two key targets simultaneously, which are (1) semantic visual-language interaction and (2) reliable candidate ranking. The former ensures a satisfying cross-modal matching between video moments and the query, while the latter guarantees the distinction among candidates. For the first target, some previous works Yuan et al. (2019); Zhang et al. (2020b); Chen and Jiang (2019) resort to the visual clues by modeling moment-sentence or snippet-sentence relations. However, they overlook the linguistic clues from token-level, i.e., token-moment relations, which contain fine-grained linguistic information. For the second target, previous solutions Ge et al. (2019); Liu et al. (2018); Zhang et al. (2020a) generate ranking scores by considering different moment candidates separately or constructing moment-level relations in a simple wayZhang et al. (2020b). Hence, they neglect the temporal and semantic dependencies among candidates. Without this information, it is difficult for previous approaches to distinguish these visually similar moment candidates correctly.

To this end, we propose a Relation-aware Network (RaNet) to address temporal language grounding. In our solution, we formulate this task as Video Reading Comprehension by regarding the video, query, and moment candidates as the text passage, question description, and multi-choice options, respectively. Unlike previous methods, we exploit a coarse-and-fine interaction, which captures not only sentence-moment relations but also token-moment relations. This interaction can allow our model to construct both sentence-aware and token-aware visual representation for each choice, which is expected to distinguish similar candidates in the visual modality. Moreover, we propose to leverage Graph Convolutional Networks (GCN) Kipf and Welling (2016) for mining the moment-moment relations between candidate choices based on their coarse-and-fine representations. With information exchange in GCNs, our RaNet can learn discriminative features for correctly ranking candidates regardless of their high relevance in visual content.

Similar to the system of multi-choice reading comprehension, our RaNet consists of five components: a modality-wise encoder for visual and textual encoding, a multi-choice generator for answer set generation, a choice-query interactor for cross-modality interaction, a multi-choice relation constructor for relationships mining and an answer ranker for the best-matched choice selection. Our contributions are summarized as three-fold: (1) We address temporal language grounding by a Relation-aware Network, which formulates this task as a video reading comprehension problem. (2) We exploit the visual and linguistic clues exhaustively, i.e., coarse-and-fine moment-query relations and moment-moment relations, to learn discriminative representations for distinguishing candidates. (3) The proposed RaNet outperforms other state-of-the-art methods on three widely-used challenging benchmarks: TACoS, Charades-STA and ActivityNet-Captions, where we improve the grounding performance by a great margin (i.e., 33.54% v.s. 25.32% of 2D-TAN on TACoS dataset).

2 Related Work

Temporal Language Grounding. This task was introduced by Anne Hendricks et al. (2017); Gao et al. (2017) to locate relevant moments given a language query. He et al. He et al. (2019) and Wang et al. Wang et al. (2019) used reinforcement learning to solve this problem. Chen et al. Chen et al. (2018) and Ghosh et al. Ghosh et al. (2019) proposed to select the boundary frames based on visual-language interaction. Most of recent works Xu et al. (2019); Yuan et al. (2019); Zhang et al. (2020b); Chen and Jiang (2019); Zhang et al. (2019a) adopted the two-step pipeline to solve this problem.
Visual-Language Interaction. It is vital for this task to semantically match query sentences and videos. This cross-modality alignment was usually achieved by attention mechanism Vaswani et al. (2017) and sequential modeling Hochreiter and Schmidhuber (1997); Medsker and Jain (2001). Xu et al. Xu et al. (2019) and Liu et al. Liu et al. (2018) designed soft-attention modules while Hendricks et al. Anne Hendricks et al. (2017) and Zhang et al. Zhang et al. (2019b) chose the hard counterpart. Some works Chen et al. (2018); Ghosh et al. (2019) attempted to use the property of RNN cells and others went beyond it by dynamic filters Zhang et al. (2019a), Hadamard product Zhang et al. (2020b), QANet Lu et al. (2019) and circular matrices Wu and Han (2018). However, these alignments neglect the importance of token-aware visual feature in cross-modal correlating and distinguishing the similar candidates.
Machine Reading Comprehension. Given the reference document or passage, Machine Reading Comprehension (MRC) requires the machine to answer questions about it Zhang et al. (2020c). There are two types of the existing MRC variations related to the temporal language grounding in videos, i.e., span extraction and multi-choice. The former Rajpurkar et al. (2016) extracts spans from the given passage and has been explored in temporal language grounding task by some previous works Zhang et al. (2020a); Lu et al. (2019); Ghosh et al. (2019). The latter Lai et al. (2017); Sun et al. (2019) aims to find the only correct option in the given candidate choices based on the given passage. We propose to formulate this task from the perspective of multi-choice reading comprehension. Based on this formulation, we focus on the visual-language alignment in a token-moment level. Compared with query-aware context representation in previous solutions, we aim to construct token-aware visual feature for each choice. Inspired by recent advanced attention module Gao et al. (2020); Huang et al. (2019), we mine the relations between multi-choices in an effective and efficient way.

Refer to caption
Figure 2: The overview of our proposed RaNet. It consists of modality-wise encoder, multi-choice generator, choice-query interactor, multi-choice relation constructor, and answer ranker. Video and language passages are first embedded in separately branches. Then we initialize the visual representation for each choice <tis,tie><t_{i}^{s},t_{i}^{e}> from the video stream. Through choice-query interactor, each choice can capture the sentence-aware and token-aware representation from the query. Afterwards, the relation constructor takes advantage of GCNs to model relationships between choices. Finally, the answer ranker evaluates the probability of being selected for each choice based on the exchanged information from the former module.

3 Methodology

In this section, we first describe how to cast the temporal language grounding from the perspective of a multi-choice reading comprehension task, which is solved by the proposed Relation-aware Network (RaNet). Then, we introduce the detailed architecture of the RaNet, consisting of five components as shown in Fig.2. Finally, we illustrate the training and inference of our solution.

3.1 Problem Definition

The goal of this task is to answer where is the semantically corresponding video moment given a language query in an untrimmed video. Referring to the forms of MRC, we treat the video 𝐕\mathbf{V} as a text passage, the query sentence 𝐐\mathbf{Q} as a question description and provide a set of video moment candidates as a list of answer options 𝐀\mathbf{A}. Based on the given triplet (𝐕,𝐐,𝐀)(\mathbf{V,Q,A}), temporal language grounding in videos is equivalent to cross-modal MRC, termed as video reading comprehension.

For each query-video pair, we have one natural language sentence and an associated ground-truth video moment with the start gsg^{s} and end geg^{e} boundary. Each language sentence is represented by 𝐐={qi}i=1L\mathbf{Q}=\{q_{i}\}^{L}_{i=1}, where LL is the number of tokens. The untrimmed video is represented as a sequential snippets 𝐕={v1,v2,,vnv}nv×C\mathbf{V}=\{v_{1},v_{2},\cdots,v_{n_{v}}\}\in\mathcal{R}^{n_{v}\times C} by a pretrained video understanding network, such as C3D Tran et al. (2015), I3D Carreira and Zisserman (2017),VGG Simonyan and Zisserman (2014) etc..

In temporal language grounding, the answer should be a consecutive subsequence (namely time span) of the video passage. For any video moment candidate (i,j)(i,j), it can be treated as a valid answer if it meets the condition of 0<i<j<nv0<i<j<n_{v}. Hence, we follow the fixed-interval sampling strategy in previous work Zhang et al. (2020b) and construct a set of video moment candidates as the answer list A={a1,,aN}A=\{a_{1},\cdots,a_{N}\} with NN valid candidates. After these notations, we can recast the temporal language grounding task from the perspective of multi-choice reading comprehension as:

argmaxiP(ai|(V,Q,A)).\displaystyle\operatorname*{arg\,max}_{i}P(a_{i}|(V,Q,A)). (1)

However, different from the traditional multi-choice reading comprehension, previous solutions in temporal language grounding also compare their performance in terms of top-KK most matching candidates for each query sentence. For fair comparison, it requires our approach to scores KK candidate moments {(pi,tis,tie)}i=1K\{(p_{i},t^{s}_{i},t^{e}_{i})\}_{i=1}^{K}, where pi,tis,tiep_{i},t^{s}_{i},t^{e}_{i} represent the probability of selection, the start, end time of answer aia_{i}, respectively. Without additional mention, the video moment and answer/choice are interchangeable in this paper.

3.2 Architecture

As shown in Figure 2, we describe the details of each component in our framework as followings: Modality-wise Encoder. This module aims to separately encode the content of language query and video. Each branch aggregates the intra-modality context for each snippet and token.

\cdot Video Encoding. We first apply a simple temporal 1D convolution operation to map the input feature sequence to a desired dimension, which is followed by an average pooling layer to reshape the sequence into a desired length TT. To enrich the multi-hop interaction, we use a graph convolution block called GC-NeXt block Xu et al. (2020), which aggregates the context from both temporal and semantic neighbors of each snippet viv_{i} and has been proved effective in Temporal Action Localization task. Finally, We get the encoded visual feature as 𝐕^C×T\hat{\mathbf{V}}\in\mathcal{R}^{C\times T}

\cdot Language Encoding. Each word qiq_{i} of 𝐐\mathbf{Q} is represented with the embedding vector from GloVe 6B 300d Pennington et al. (2014) to get 𝐐300×L\mathbf{Q}\in\mathcal{R}^{300\times L}. Then we sequentially feed the initialized embeddings into a three-layer Bi-LSTM network to capture semantic information and temporal context. We take the last layer’s hidden states as the language representation 𝐐^C×L\hat{\mathbf{Q}}\in\mathcal{R}^{C\times L} for cross-modality fusion with video representation 𝐕^\hat{\mathbf{V}}. In addition, the effect of different word embeddings, is also compared in  4.4.4.

The encoded visual and textual features can be formulated as follows:

𝐕^=VisualEncoder(𝐕)𝐐^=LanguageEncoder(𝐐)\begin{split}\hat{\mathbf{V}}&=VisualEncoder(\mathbf{V})\\ \hat{\mathbf{Q}}&=LanguageEncoder(\mathbf{Q})\\ \end{split} (2)
Refer to caption
Figure 3: (a). Illustration of choice generator and feature initialization. Each block (i,j)(i,j) is a valid choice when i<ji<j, denoted with blue. The Ψ\Psi combines the boundary feature for feature initialization of (1,3)(1,3), the dark blue square. (b). An example of all graph edges connected to one choice in our Multi-Choice Relation Constructor. Moments with the same start or end index (dark green) are connected with the illustrative choice (red). (c). The information propagation between two unconnected moment choices. For other moments (dark green) that are not connected with target moment (red) but have overlaps, relations can be implicit captured with two loops, namely 2 graph attention layers.

Multi-Choice Generator. As shown in Fig.3 (a), the vertical and horizontal axes represent the start and end index of visual sequence. The blocks in the same row have the same start indices, and those in the same column have the same end indices. The white blocks indicate the invalid choices in the left-bottom, where the start boundaries exceed the end boundaries. Therefore, we have the multi-choice options as A={(tis,tie)}i=1NA=\{(t^{s}_{i},t^{e}_{i})\}^{N}_{i=1}. To capture the visual-language interaction, we should initialize the visual feature for the answer set AA so that it can be integrated with the textual features from the language encoder. To ensure the boundary-aware ability inspired by Wang et al. (2020), the initialization method Ψ\Psi combines boundary information, i.e., v^tis\hat{v}_{t^{s}_{i}} and v^tie\hat{v}_{t^{e}_{i}} in 𝐕^\hat{\mathbf{V}}, to construct the moment-level feature representation for each choice aia_{i}. The initialized feature representation can be written as:

𝐅A=Ψ(𝐕^,A),\mathbf{F}_{A}=\Psi(\hat{\mathbf{V}},A), (3)

where Ψ\Psi is the concatenation of v^tis\hat{v}_{t^{s}_{i}} and v^tie\hat{v}_{t^{e}_{i}}, AA is the answer set and 𝐅A2C×N\mathbf{F}_{A}\in\mathcal{R}^{2C\times N}. We also explore the effect of different Ψ\Psi on grounding performance in  4.4.3.
Choice-Query Interactor. As shown in Figure 2, this module explores the inter-modality context for visual-language interaction. Unlike previous methods Zhang et al. (2020b, a); Zeng et al. (2020), we propose a coarse-and-fine cross-modal interaction. We integrate the initialized features 𝐅A\mathbf{F}_{A} with the query both in the sentence-level and token-level. The former can be obtained by a simple Hadamard product and an normalization as:

𝐅1=φ(𝐐^)Conv(𝐅A)F,\mathbf{F}_{1}=\|\varphi(\hat{\mathbf{Q}})\odot Conv(\mathbf{F}_{A})\|_{F}, (4)

where φ\varphi is the aggregation function for a global representation of 𝐐^\hat{\mathbf{Q}} and we set it as max-pooling, \odot is element-wise multiplication, and F\|\cdot\|_{F} denotes the Frobenius normalization.

To ensure the token-aware visual feature for each choice aia_{i}, we adopt attention mechanism to learn the token-moment relations between each choice and the query. Firstly, we adopt a 1D convolution layer to project the visual and textual features to a common space and then calculate their semantic similarities, which depict the relationships 𝐑N×L\mathbf{R}\in\mathcal{R}^{N\times L} between NN candidates and LL tokens. Secondly, we generate query-related feature for each candidate based on the relationships 𝐑\mathbf{R}. Finally, we integrate these two features of candidates for token-aware visual representation.

𝐑=Conv(𝐅A)TConv(𝐐^)𝐅2=Conv(𝐅A)(Conv(𝐐^)𝐑T),\begin{split}\mathbf{R}&=Conv(\mathbf{F}_{A})^{T}\otimes Conv(\hat{\mathbf{Q}})\\ \mathbf{F}_{2}&=Conv(\mathbf{F}_{A})\odot(Conv(\hat{\mathbf{Q}})\otimes\mathbf{R}^{T}),\end{split} (5)

where TT denotes the matrix transpose, \odot and \otimes are element-wise and matrix multiplications, respectively. We add the sentence-aware 𝐅1\mathbf{F}_{1} and token-aware features 𝐅2\mathbf{F}_{2} as the output of this module 𝐅^A\hat{\mathbf{F}}_{A}.

Multi-choice Relation Constructor. In order to explore the relation between multi-choices, we propose this module to aggregate the information from the overlapping moment candidates by GCNs. Previous methods MAN Zhang et al. (2019a), 2D-TAN Zhang et al. (2020b) also considered moment-wise temporal relations, while both of them have two drawbacks: expensive computations and the noise from unnecessary relations. Inspired by CCNet Huang et al. (2019), which proposed a sparsely-connected graph attention module to collect contextual information in horizontal and vertical directions, we propose a Graph ATtention layer (GAT) to construct the relation between moment candidates that have high temporal overlaps with each other.

Concretely, we take each answer candidate ai=(tis,tie)a_{i}=(t^{s}_{i},t^{e}_{i}) as a graph node, and there is a graph edge connecting two candidate choices aia_{i}, aja_{j} if they share the same start/end time spot, e.g., tis=tjst^{s}_{i}=t^{s}_{j} or tie=tjet^{e}_{i}=t^{e}_{j}. An example is shown in Figure 3 (b), where neighbors of the target moment choice (the red one) is denoted as dark green in a criss-cross shape. As shown in Figure 3 (c), our model is also able to achieve the information propagation between two unconnected moment choices. For other moments (dark green) that are not connected with the target moment (red) but have overlapped, their relations can be implicitly captured with two loops, namely two graph attention layers. We can guarantee the message passing between the dark green moment and cyan moments in the first loop. And then, in the second loop, we can construct relations between cyan moments and target moment, where the information from the dark green moment is finally propagated to the red moment.

Given the choice-query 𝐅^AC×N\hat{\mathbf{F}}_{A}\in\mathbb{R}^{C\times N}, there are NN nodes and approximately 2TN2TN edges in the graph. A GAT layer inpsired by Huang et al. (2019) is applied on the graph: for each moment, we compute attention weights of its neighbours in a criss-cross path, and average the features with the weights. The output of the GAT layer can be formulated as:

𝐅^A=Conv(GAT(Conv(𝐅^A),A^))\hat{\mathbf{F}}_{A}^{*}=Conv(GAT(Conv(\hat{\mathbf{F}}_{A}),\hat{A}))\\ (6)

where A^\hat{A} is the adjacency matrix of the graph to determine the connections between two moment choices, defined by the predefined answer set AA. Answer Ranker. Since we have captured the relationship between multi-choices by GCNs, we adopt this answer ranker to predict the ranking score of each answer candidate aia_{i} for selecting the best-matched one. This ranker takes the query-aware feature 𝐅^A\hat{\mathbf{F}}_{A} and relation-aware feature 𝐅^A\hat{\mathbf{F}}_{A}^{*} as input and concatenate them (denoted as \|) to aggregate more contextual information. After that, we employ a convolution layer to generate the probability PAP_{A} of being selected for aia_{i} in the predefined answer set AA. The output can be computed as:

PA=σ(Conv(𝐅^A𝐅^A)),P_{A}=\sigma(Conv(\hat{\mathbf{F}}_{A}^{*}\|\hat{\mathbf{F}}_{A})), (7)

where σ\sigma represents the sigmoid activation function.

3.3 Training and Inference

Following Zhang et al. (2020b), we first calculate the Intersection-over-Union (IoU) between the answer set AA and ground-truth annotation (gs,ge)(g^{s},g^{e}) and then rescale them by two thresholds θmin\theta_{min} and θmax\theta_{max}, which can be written as:

gi={0θiθminθiθminθmaxθminθmin<θi<θmax1θiθmaxg_{i}=\left\{\begin{array}[]{cc}0&\theta_{i}\leq\theta_{min}\\ \frac{\theta_{i}-\theta_{min}}{\theta_{max}-\theta_{min}}&\theta_{min}<\theta_{i}<\theta_{max}\\ 1&\theta_{i}\geq\theta_{max}\end{array}\right. (8)

where gig_{i} and θi\theta_{i} are the supervision label and corresponding IoU between aia_{i} and ground-truth respectively. Hence, the total training loss function of our RaNet is:

=1NΣi=1N(gilogpi+(1gi)log(1pi)),\mathcal{L}=-\frac{1}{N}\Sigma_{i=1}^{N}(g_{i}\log p_{i}+(1-g_{i})\log(1-p_{i})), (9)

where pip_{i} is the output score in PAP_{A} for the answer choice aia_{i} and NN is the number of multi-choices. In the inference stage, we rank all the answer options in AA according to the probability in PAP_{A}.

4 Experiments

Methods Rank1@ Rank5@
0.3 0.5 0.3 0.5
MCN - 5.58 - 10.33
CTRL 18.32 13.30 36.69 25.42
ACRN 19.52 14.62 34.97 24.88
ROLE 15.38 9.94 31.17 20.13
TGN 21.77 18.9 39.06 31.02
ABLR 19.50 9.40 - -
SM-Rl 20.25 15.95 38.47 27.84
CMIN 24.64 18.05 38.46 27.02
QSPN 20.15 15.23 36.72 25.30
ACL-K 24.17 20.01 42.15 30.66
2D-TAN 37.29 25.32 57.81 45.04
DRN - 23.17 - 33.36
DEBUG 23.45 11.72 - -
VSLNet 29.61 24.27 - -
Ours 43.34 33.54 67.33 55.09
Table 1: Performance comparison on TACoS. All results are reported in percentage (%).

To evaluate the effectiveness of the proposed approach, we conduct extensive experiments on three public challenging datasets: TACoS Regneri et al. (2013), ActivityNet Captions Krishna et al. (2017) and Charades-STA Sigurdsson et al. (2016).

4.1 Dataset

TACoS. It consists of 127 videos, which contain different activities that happened in the kitchen. We follow the convention in Gao et al. (2017), where the training, validation, and testing contain 10,14610,146, 4,5894,589, and 4,0834,083 query-video pairs.
Charade-STA. It is extended by Gao et al. (2017) with language descriptions leading to 12,40812,408 and 3,7203,720 query-video pairs for training and testing.
ActivityNet Captions. It is introduced into the temporal language grounding task recently. Following the setting in CMIN Lin et al. (2020), we use val_1 as validation set and val_2 as testing set, which have 37,41737,417, 17,50517,505, and 17,03117,031 query-video pairs for training, validation, and testing, respectively.

4.2 Implementation Details

Evaluation metric. Following Gao et al. Gao et al. (2017), we compute the Rank k@μ\mu for a fair comparison. It denotes the percentage of testing samples that have at least one correct answer in the top-K choices. A selected choice aia_{i} is correct when its IoU θi\theta_{i} with the ground-truth is larger than the threshold μ\mu; otherwise, the choice is wrong. Specifically, we set k{1,5}k\in\{1,5\} and μ{0.3,0.5}\mu\in\{0.3,0.5\} for TACoS and μ{0.5,0.7}\mu\in\{0.5,0.7\} for the other two.

Feature Extractor. We follow the Zhang et al. (2019a); Lin et al. (2020) and adopt the same extractor, e.g., VGG Simonyan and Zisserman (2014) feature for Charades and C3D Tran et al. (2015) for other two. We also use the I3D Carreira and Zisserman (2017) feature to make comparison with Ghosh et al. (2019); Zhang et al. (2020a); Zeng et al. (2020) on Charades. For word embedding, we use the pre-trained GloVe 6B 300d Pennington et al. (2014) as previous solutions Ge et al. (2019); Chen et al. (2018).

Architecture settings. In all experiments, we set the hidden units of Bi-LSTM as 256 and the number of reshaped snippet TT is defined as 128 for TACoS, 64 for ActivityNet Captions and 16 for Charades-STA. The dimension CC of channels is 512. We adopt 2 GAT layers for all benchmarks and position embedding is used in ActivityNet Caption as Zeng et al. (2020).

Training settings. We adopt the Adam optimizer with learning rate of 1×1031\times 10^{-3}, batch size of 32, and training epoch of 15. Following Zhang et al. (2020b), thresholds θmin\theta_{min} and θmax\theta_{max} are set to 0.5 and 1.0 for Charades-STA and ActivityNet Captions, while 0.3 and 0.7 for TACoS.

4.3 Comparison with State-of-the-arts

Our RaNet is compared with recent published state-of-the-art methods: VLSNet Zhang et al. (2020a), 2D-TAN Zhang et al. (2020b), DRN Zeng et al. (2020), CMIN Lin et al. (2020), DEBUG Lu et al. (2019), QSPN Xu et al. (2019), MAN Zhang et al. (2019a), ExCL Ghosh et al. (2019), CTRL Gao et al. (2017), etc.. The top-2 performance values are highlighted by bold and underline, respectively.

TACoS. Table 1 summarizes performance comparison of different methods on the test split. Our RaNet outperforms all the competitive methods with clear margins and reports the highest scores for all IoU thresholds. Compared with the previous best method 2D-TAN, our model achieves 6% absolute improvement at least across all evaluation settings in terms of Rank 1@μ\mu, i.e., 8.22% for μ=0.5\mu=0.5. For evaluation metric of Rank 5@μ\mu, we even reach around 10% absolute improvement. It is worth noting that we exceed VSLNet by 9.27% and 13.73% in terms of Rank 1@μ=0.5\mu=0.5, μ=0.3\mu=0.3 respectively, which also formulates this task from the perspective of MRC.

Charades-STA. We evaluate our method both on VGG and I3D features used in previous works for fair comparison. Our approach reaches the highest score in terms of Rank 1 no matter which kind of feature is adopted as illustrated in Table 2. For VGG feature, we improve the performance from 23.68% in DRN to 26.83% in terms of Rank 1@μ=0.7\mu=0.7. By adopting the stronger I3D feature, our method also exceeds VSLNet in terms of Rank 1@μ={0.5,0.7}\mu=\{0.5,0.7\} (i.e., 60.40% vs. 54.19% and 39.65% vs. 35.22%).

Methods Rank1@ Rank5@
0.5 0.7 0.5 0.7
VGG
MCN 17.46 8.01 48.22 26.73
CTRL 23.63 8.89 58.92 29.52
ABLR 24.36 9.01 - -
QSPN 35.60 15.80 79.40 45.40
ACL-K 30.48 12.20 64.84 35.13
DEBUG 37.39 17.69 - -
MAN 41.24 20.54 83.21 51.85
2D-TAN 39.70 23.31 80.32 51.26
DRN 42.90 23.68 87.80 54.87
Ours 43.87 26.83 86.67 54.22
I3D
ExCL 44.10 22.40 - -
VSLNet 54.19 35.22 - -
DRN 53.09 31.50 89.06 60.05
Ours 60.40 39.65 89.57 64.54
Table 2: Performance comparison on Charades-STA. All results are reported in percentage (%).

ActivityNet-Captions. In Table 3, we compare our model with other competitive methods. Our model achieves the highest scores over all IoU thresholds in the evaluation except the result of Rank 5@μ=0.5\mu=0.5. Particularly, our model outperforms the previous best method (i.e., 2D-TAN) by around 1.29% absolute improvement, in terms of Rank 1@μ=0.7\mu=0.7. Due to the same sampling strategy for moment candidates, this improvement is mostly attributed to the token-aware visual representation and the relationships mining between multi-choices.

Methods Rank1@ Rank5@
0.5 0.7 0.5 0.7
MCN 21.36 6.43 53.23 29.70
CTRL 29.01 10.34 59.17 37.54
ACRN 31.67 11.25 60.34 38.57
TGN 27.93 - 44.20 -
QSPN 33.26 13.43 62.39 40.78
ExCL 42.7 24.1 - -
CMIN 44.62 24.48 69.66 52.96
ABLR 36.79 - - -
DEBUG 39.72 - - -
2D-TAN 44.05 27.38 76.65 62.26
DRN 45.45 24.39 77.97 50.30
VSLNet 43.22 26.16 - -
Ours 45.59 28.67 75.93 62.97
Table 3: Performance comparison on ActivityNet Captions. All results are reported in percentage (%).
Datasets Components Rank1@
𝐅1\mathbf{F}_{1} 𝐅2\mathbf{F}_{2} 𝐑\mathbf{R} 0.3 0.5
TACoS 40.99 28.54
41.26 29.22
41.51 29.64
42.26 32.04
43.34 33.54
𝐅1\mathbf{F}_{1} 𝐅2\mathbf{F}_{2} 𝐑\mathbf{R} 0.5 0.7
Charades-STA 43.06 24.70
42.72 24.33
42.10 24.78
43.60 25.30
43.87 26.83
Table 4: Effectiveness of each component in our proposed approach on TACoS and Charades-STA, measured by Rank 1@μ{0.3,0.5,0.7}\mu\in\{0.3,0.5,0.7\}. VGG features are used in Charades-STA. ✓ and ✗ denote the net with and without that component, respectively.

4.4 Ablation Study

4.4.1 Effectiveness of Network Components

We perform complete and in-depth studies on the effectiveness of our choice-query interactor and multi-choice relation constructor based on the TACoS and Charades-STA datasets. On each dataset, we conduct five comparison experiments for evaluation. First, we remove 𝐅2\mathbf{F}_{2} and 𝐑\mathbf{R} to explore the RaNet-base model, compared with only using 𝐅2\mathbf{F}_{2}. Then, we integrate the interaction and relation modules into our third and forth experiments respectively. Finally, we show the best performance achieved by our proposed approach. Table 4 summarizes the grounding results in terms of Rank 1@μ{0.3,0.5,0.7}\mu\in\{0.3,0.5,0.7\}. Without the interaction and relation modules, our framework can achieve 40.99% and 28.54% for μ=0.3\mu=0.3 and 0.50.5 respectively. It outperforms the previous best method 2D-TAN, indicating the power of our modality-wise encoder. When we consider the token-aware visual representation, our framework can bring significant improvements on both datasets. Improvements can also be observed when adding relation module. These results demonstrate the effectiveness of our RaNet on temporal language grounding.

Refer to caption
Figure 4: Detailed comparison across different IoUs on three benchmarks in terms of Rank 1

4.4.2 Improvement on different IoUs

To have a better understanding of our approach, we illustrate the performance gain achieved on three datasets in terms of different μ(0,1)\mu\in(0,1) with previous best method, 2D-TAN, as shown in Figure 4. This figure visualizes the detailed comparison between our model and 2D-TAN, which shows that our approach can continuously improve the performance, especially for higher IoUs (i.e., μ>0.7\mu>0.7 ). It is observed that the value of relative improvement increases along with the increasing IoU on TACoS and ActivityNet Captions datasets.

4.4.3 Feature Initialization Functions

We conduct experiments to reveal the effect of different feature initialization functions. For a moment candidate (tis,tie)(t^{s}_{i},t^{e}_{i}), it has the corresponding feature sequence YY from 𝐕^\hat{\mathbf{V}} denoted as Y={v^k}k=tistieY=\{\hat{v}_{k}\}_{k={t^{s}_{i}}}^{t^{e}_{i}}. We explore four types of operators (i.e., pooling, sampling, concatenation and addition) in the multi-choice generator. The first two consider all the information of the region within the temporal span of the candidate. Pooling operator focuses on the statistic characteristic and the sampling serves as weight average operator. On the contrary, the last two only consider the boundary information (vtisv_{t^{s}_{i}} and vtiev_{t^{e}_{i}}) of a moment candidate, which expect the cross-modal interaction to be boundary sensitive. Table 5 reports the performance of different operators on TACoS dataset. It is observed that concatenation operator achieves the highest score across all the evaluation criterion, which indicates boundary sensitive operators have better performance than the statistical operators.

Ψ\Psi Rank1@ Rank5@
0.3 0.5 0.3 0.5
Pooling 38.84 29.29 63.31 49.86
Sampling 41.33 30.82 65.58 54.69
Addition 42.69 31.59 64.98 54.36
Concatenation 43.34 33.54 67.33 55.09
Table 5: Effectiveness of different operators used in Multi-Choice Generator on TACoS, measured by Rank 1@μ{0.3,0.5}\mu\in\{0.3,0.5\} and Rank 5@μ{0.3,0.5}\mu\in\{0.3,0.5\}.

4.4.4 Word Embeddings Comparison

To further explore the effect of different textual features, we also conduct experiments on four pre-trained word embeddings (i.e., GloVe 6B, GloVe 42B, GloVe 840B and BERTBase). GloVe Pennington et al. (2014) is an unsupervised learning algorithm for obtaining vector representations for words, which has some common word vectors trained on different corpora of varying sizes. BERT Devlin et al. (2019) is a language representation model considering bidirectional context, which achieved SOTA performance on many NLP tasks. All the GloVe vectors have 300 dimensions whereas BERTBase is a 768-dimensional vector. Table 6 compares the performance of these four pre-trained word embeddings on TACoS dataset. From the results we can see that better word embeddings (i.e.BERT) tend to have better performance, indicating us pay more attention to textual features encoding. All the models in our paper use concatenation as multi-choice feature initialization function and GloVe 6B word vectors for word embeddings initialization if not specified.

Methods Rank1@
0.1 0.3 0.5 0.7
GloVe 6B 54.26 43.34 33.54 18.57
GloVe 42B 54.74 44.21 34.37 20.24
GloVe 840B 53.11 44.51 34.87 19.65
BERTBase 57.34 46.26 34.72 21.54
Table 6: Comparison of different word embeddings on TACoS, measured by Rank 1@μ{0.1,0.3,0.5,0.7}\mu\in\{0.1,0.3,0.5,0.7\}.

4.4.5 Efficiency of Our RaNet

Both fully-connected graph neural networks and stacked convolution layers result in high computation complexity and occupy a huge number of GPU memory. With the help of a sparsely-connected graph attention module used in our Multi-choice Relation Constructor, we can capture moment-wise relations from global dependencies in a more efficient and effective way. Table 7 shows the parameters and FLOPs of our model and 2D-TAN, which uses several convolution layers to capture context of adjacent moment candidates. We can see that RaNet is more lightweight with only 11 M parameters compared with 92 M of 2D-TAN on ActivityNet. Compared with RaNet, RaNet-base replaces the relation constructor with the same 2D convolutional layers as 2D-TAN. Their comparison on FLOPs further indicates the efficiency of our relation constructor against simple convolution layers.

TACoS Charades ActivityNet
2D-TAN 60.93M 60.93M 91.59M
Params RaNet-base 61.52M 59.95M 90.64M
RaNet 12.80M 12.80M 10.99M
2D-TAN 2133.26G 104.64G 997.30G
FLOPs RaNet-base 2137.68G 104.72G 999.54G
RaNet 175.36G 4.0G 43.92G
Table 7: Parameters and FLOPs of our RaNet with the previous best mothod 2D-TAN, which also considers moment-level relations. M and G represent 10610^{6} and 10910^{9} respectively.

4.4.6 Qualitative Analysis

We further show some examples in Figure 5 from ActivityNet Captions dataset. From this comparison, we can find that predictions of our approach are closer to ground truth than our baseline model, which is the one removing 𝐅2\mathbf{F}_{2} and 𝐑\mathbf{R} in Table 4. Considering the same setting for the moment candidate, it also demonstrates the effect of our proposed modules. With the interaction and relations construction modules, our approach can select the choice of video moments matching the query sentence best. In turn, it reflects that capturing the token-aware visual representation for moment candidates and relations among candidates facilitate the net scoring candidates better.

Refer to caption
Figure 5: The qualitative results of RaNet and RaNet-base on the ActivityNet Captions dataset.

5 Conclusion

In this paper, we propose a novel Relation-aware Network to address the problem of temporal language grounding in videos. We first formulate this task from the perspective of multi-choice reading comprehension. Then we propose to interact the visual and textual modalities in a coarse-and-fine fashion for token-aware and sentence-aware representation of each choice. Further, a GAT layer is introduced to mine the exhaustive relations between multi-choices for better ranking. Our model is efficient and outperforms the state-of-the-art methods on three benchmarks, i.e., ActivityNet-Captions, TACoS, and Charades-STA.

Acknowledgements

First of all, I would like to give my heartfelt thanks to all the people who have ever helped me in this paper. The support from CloudWalk Technology Co., Ltd is gratefully acknowledged. This work was also supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.

References