This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos

Jie Wu Sun Yat-sen University [email protected] Guanbin Li Sun Yat-sen University [email protected] Xiaoguang Han Shenzhen Research Institute of Big Data, the Chinese University of Hong Kong [email protected]  and  Liang Lin Sun Yat-sen University DarkMatter AI [email protected]
Abstract.

Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a Boundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.

Temporal grounding of natural language in untrimmed videos, Reinforcement learning, Boundary adaptive refinement

1. Introduction

Temporal grounding of natural language in untrimmed video is a newly-raised and crucial task due to its potential applications in the field of human-robot interaction and cross-media analysis. It aims to locate the temporal segment that is most relevant to the given sentence query in an untrimmed video. Albeit with varying degrees of progress, most of its recent successes (Gao et al., 2017; Liu et al., 2018a; Ge et al., 2019; Chen and Jiang, 2019; Xu et al., 2019; Yuan et al., 2019; Zhang et al., 2019a; Wang et al., 2019; He et al., 2019; Zhang et al., 2019b; Wu et al., 2020) are involved in a fully supervised setting, i.e., mapping between video interval and the corresponding statement description are available in the training set. It is still arduous to acquire such granular annotations that require a huge amount of manual effort, which becomes a critical bottleneck as this task is pushed toward a larger-scale and more complicated scenario. To alleviate such expensive and unwieldy annotations, (Mithun et al., 2019) proposes to address this task in the weakly supervised setting that learns to infer language-related temporal range from video-level supervision. This weakly supervised paradigm only has access to the video-level language description annotations without their corresponding temporal boundary specification. This is an exceedingly favorable scheme since coarse video-level annotations are more readily available on the internet. In our work, we focus on this weakly supervised paradigm.

Many approaches (Gao et al., 2017; Liu et al., 2018a; Ge et al., 2019; Chen and Jiang, 2019; Mithun et al., 2019) employ a two-stage “proposal-and-rank” solution pattern to address the task of temporal grounding of natural language. However, these works are indulged in learning more robust cross-modal representations in the rank branch without explicitly considering and modeling boundary-flexible and content-aware proposals. As shown in the left half of Figure 1, “proposal-and-rank” pattern is inherently restrictive as it relies heavily on pre-defined and inflexible sliding windows (e.g., 128 and 256 frames (Mithun et al., 2019)), which results in lacking generalization for videos with considerable variance in length. More rigorously, it raises two additional challenges when it is extended to the weakly supervised setting. First, offset regressive learning (Gao et al., 2017) for boundary adjustment becomes impractical in the absence of granular annotation. Second, accessing video-query pair during training, the leading model (Mithun et al., 2019) can merely learn cross-modal mappings from the inter-videos, while fails to take into account more subtle and fine-grained semantic concepts within the intra-video. These suboptimal cross-modal mappings generally lead to less accurate boundary prediction.

Refer to caption
Figure 1. The diagrams of sliding window based proposal-and-rank pattern and the novel boundary adaptive refinement process. The input query in this example is “person goes back to close the door.” Traditional pattern is constrained by fixed sliding window templates and has to process extensive candidate segments one by one to localize queries. However, the boundary adaptive refinement manages to flexibly adjust the boundary via a series of actions.

To better cope with the above issues, as shown in the right half of Figure 1, we formulate the task as a cross-modal matching guided heuristic process, a.k.a, Boundary Adaptive Refinement (BAR). BAR resorts to a tailor-designed reinforcement learning paradigm to adaptively optimize the temporal boundary towards shrinking the cross-modal semantic gap. It is noted that reinforcement learning (RL) has been validated in various tasks of fully supervised video understanding, including video recognition (Yeung et al., 2016) and video referring expression (He et al., 2019). This work can be regarded as the first attempt to extend RL to weakly supervised temporal localization tasks. Due to the lack of matching supervision for specific video intervals corresponding to the statement,it is non-trivial to design an intensive learning state assessment and reward function which can effectively drive the model to achieve efficient temporal boundary optimization. Our proposed BAR framework hence includes a context-aware feature extractor for encoding the current and contextualized environment state, an adaptive action planner for decision adjustment of direction and interval range, and most typically a cross-modal alignment evaluator for providing an estimate of the alignment score between each segment-query pair in the absence of pairwise supervisory information. This alignment evaluator is crafted to assign a corresponding reward by comparing the alignment score of the consecutive segment-query pair under the guidance of both inter-video and intra-video ranking loss. This modularized component design and heuristic adaptive temporal window adjustment strategy contributes to making the solution pattern more flexible and conforming to the human perception retrieval mechanism; Furthermore, it can be guided and pruned with goal-oriented rewards in a larger search space to extract more accurate temporal window positioning; Moreover, it also attempts to occupy as little time as possible to reach more impressive results.

The contributions of this work are summarized as follows:

\bullet We design a Boundary Adaptive Refinement framework that resorts to reinforcement learning to address the task of weakly supervised temporal grounding of language in video. To the best of our knowledge, we are the first to employ RL to temporal localization task with weak supervision.

\bullet BAR abandons traditional sliding window based proposal-and-rank pattern and employs a novel boundary adaptive refinement process, which contributes to acquiring more efficient, boundary-flexible and content-aware grounding results.

\bullet Experimental results on two benchmark datasets Charades-STA (Gao et al., 2017) and ActivityNet (Krishna et al., 2017) demonstrate that BAR outperforms the existing state-of-the-art weakly-supervised methods, and even beats some competitive fully-supervised ones.

2. Related work

Refer to caption
Figure 2. The overall architecture of the Boundary Adaptive Refinement (BAR) framework, which consists of a context-aware feature extractor, an adaptive action planner and a cross-modal alignment evaluator.

Temporal Grounding of Natural Language in Video. Temporal grounding of natural language aims to determine the start and end time of a temporal segment in an untrimmed video that corresponds to a language query. It is a temporal extension of image referring expression comprehension(Yang et al., 2019a, b, 2020), and is also a challenging multimedia task which requires cross-modal fusion and fine-grained interactions between the verbal and visual modalities. Many approaches (Gao et al., 2017; Liu et al., 2018a; Ge et al., 2019; Chen and Jiang, 2019; Mithun et al., 2019) employ a two-stage “proposal-and-rank” manner, which first generates temporal proposals and then selects the one with the highest confidence score. However, these approaches rely on external sliding windows matching and ranking, leading to boundary-inflexible and time consuming. To formulate a computationally efficient framework, Chen et al. (Chen et al., 2018a) designed an end-to-end deep neural network that merely performs a single pass to obtain the grounding result. Xu et al. (Xu et al., 2019) proposed a multi-level model to integrate visual-query feature in the earlier stage and further introduced the caption generation as an auxiliary task.

Weakly Supervised Learning. Weakly-supervised learning is a research setup that aims at optimizing a model without substantial manual labeled information. Many computer vision and multi-modal tasks such as salient object detection (Li et al., 2018), captioning (Duan et al., 2018), language grounding (Paul et al., 2018; Mithun et al., 2019), referring expression grounding(Liu et al., 2019) have explored the weakly-supervised setup, since granular annotations are much more source-consuming compared to coarse annotations. Wang et al. (Wang et al., 2018) proposed a weakly supervised collaborative learning framework to resolve the task of weakly supervised object detection, which only requires image-level labels. In the video domain, Duan et al. (Duan et al., 2018) formulated a new task: weakly supervised dense event captioning. The goal of this task is to detect and describe all events of interest contained in a video without dense segment annotations for model training. The work that closely related to ours is (Mithun et al., 2019). Mithun et al. (Mithun et al., 2019) designed Text-Guided Attention (TGA) mechanism to leverage latent alignment between video frames and sentence descriptions to address the same task as us.

Reinforcement Learning. Reinforcement learning (RL) is originated from the neuroscientific and psychological understandings of how humans learn to optimize their behaviors in an environment. It can be mathematically formulated as a Markov Decision Process (MDP) in a sequential decision-making manner. Recently, RL technique (Williams, 1992) has been utilized to imitate human’s thinking pattern to address various tasks, which generally can be formulated as a MDP that executes a series of actions to accomplish the task-specific objective(Yu et al., 2019; Ranzato et al., 2015; He et al., 2019; Shi et al., 2019; Chen et al., 2018b). Ranzato et al. (Ranzato et al., 2015) used the REINFORCE algorithm to train the captioning model in sequence level by optimizing the non-differentiable metric directly. Yeung et al. (Yeung et al., 2016) adopted REINFORCE algorithm to optimize an end-to-end approach for reasoning the temporal bounds of action and its category. He et al. (He et al., 2019) resorted to RL based method to address the fully-supervised version of the studied task, which utilized the temporal IoU as reward indicator. Our work offers the first attempt to extend RL to accomplish the proposed task with weak supervision. To estimate an accurate reward function in the absence of pairwise supervisory information, a cross-modal alignment evaluator is crafted to provide tailor-designed rewards.

3. Methodology

3.1. Problem Formulation

Following the common-used formulation (Gao et al., 2017), we represent a video VV by NN clips {V1,V2,,VN}\{V_{1},V_{2},...,V_{N}\}, each clip corresponds to a small chunk of sequential frames. Taking VV and a text query TT as inputs, the studied task aims to output a video segment [j,k][j,k] (jj and kk indicate the start and end clip indices respectively) that semantically matches the query description. Our work focuses on the weakly supervised setting of this task. Specifically, only a set of VV-TT pairs are provided but the video segment annotation for each pair is not available. Inspired by the observation that humans usually locate interest events during a long video with a heuristic search strategy, we propose to formulate this task as a Markov Decision Process. A Boundary Adaptive Refinement (BAR) framework is thus designed: starting from an initial segment, the reinforcement learning technique is utilized to refine its temporal boundary progressively. The overall architecture of the proposed BAR framework is depicted in Figure 2. As illustrated, this modular framework employs a context-aware feature extractor to encode the environment state into cross-modal contextual concepts. The cross-modal alignment evaluator is crafted to provide a tailor-designed reward and the termination signal for the iterative refinement process. An adaptive action planner is designed to reason the direction and amplitude of the action from contextualized observation adaptively, instead of shifting a fixed amplitude every step (He et al., 2019). The details of these modularized components will be described in the following sections.

3.2. Context-aware Feature Extractor

The context-aware feature extractor takes a video-query pair (VV-TT) from the external environment and encodes it into the context-aware cross-modal concepts. Each word in query TT is firstly encoded using GloVe (Pennington et al., 2014) embeddings and then fed into the GRU (Chung et al., 2014) to capture long-range dependencies. The summarized query representation 𝐄\bm{\mathrm{E}} is obtained from the last hidden state of the GRU. A pre-trained video feature extractor (C3D (Tran et al., 2015) or TSN (Wang et al., 2016)) is used to extract the clip-level feature for each video clip. A video segment is represented as a set of clip features, i.e., 𝐅={𝐅𝟏;;𝐅𝐢;;𝐅𝐌}dk×M\bm{\mathrm{F}}=\{\bm{\mathrm{F_{1}}};...;\bm{\mathrm{F_{i}}};...;\bm{\mathrm{F_{M}}}\}\in\mathbb{R}^{d_{k}\times M}. 𝐅𝐢dk\bm{\mathrm{F_{i}}}\in\mathbb{R}^{d_{k}} denotes the clip-level feature for the video clip ViV_{i} and MM is the number of clips in the corresponding video segment. At each time step, the updated boundary divides the whole video into three parts: left segment, current segment and right segment. And we collect all clip-level features within the corresponding boundary into a set to obtain the three corresponding segment-level features. Rather than directly taking the current segment’s feature as independent inputs (He et al., 2019), this extractor also leverages context-aware contextual cues derived from other segments in the video (i.e., left and right segment feature) for state encoding. Furthermore, the extractor explicitly involves the normalized boundary location 𝐋𝐭𝟏\bm{\mathrm{L_{t-1}}} into the encoded features to provide some notion of relative position:

(1) 𝐋𝐭𝟏=[lt1sN,lt1eN]\bm{\mathrm{L_{t-1}}}=[\frac{l^{s}_{t-1}}{N},\frac{l^{e}_{t-1}}{N}]

where lt1sl^{s}_{t-1} and lt1el^{e}_{t-1} denote the start and end clip indices of the boundary, respectively. t={1,,Tmax}t=\{1,\cdots,T_{max}\} and TmaxT_{max} indicates the maximum iterations in the refinement process.

3.3. Cross-modal Alignment Evaluator

The cross-modal alignment evaluator is designed specially to address two critical issues in our RL-based approach. On the one hand, this evaluator is crafted to assign a target-oriented reward to address the difficulty that the adaptive action planner can not directly obtain a reliable reward function without granular boundary annotations. On the other hand, this alignment evaluator manages to determine an accurate stop signal to terminate the refinement process. Given a video segment, the dimension of each clip feature is reduced to the same as the summarized query representation 𝐄\bm{\mathrm{E}} via a filter function θ\theta, which consists of a fully-connected layer followed by ReLU (Krizhevsky et al., 2012) and Dropout (Srivastava et al., 2014) function. 𝐄\bm{\mathrm{E}} is taken to create a temporal attention over all video clips, which manages to emphasize crucial video clips and weaken inessential parts. Concretely, the scaled dot-product attention mechanism (Luong et al., 2015) is utilized to obtain attention weight aia_{i} and the segment attention feature 𝐀\bm{\mathrm{A}}:

(2) ai=softmax(𝐄θ(𝐅𝐢)k),𝐀=i=1Maiθ(𝐅𝐢)a_{i}=softmax(\frac{\bm{\mathrm{E}}\odot\theta(\bm{\mathrm{F_{i}}})}{\sqrt{k}}),\quad\bm{\mathrm{A}}=\sum_{i=1}^{M}a_{i}\theta(\bm{\mathrm{F_{i}}})

where \odot indicates the dot product operation between two vectors. kk is the dimension of 𝐄\bm{\mathrm{E}}. Then the segment attention feature and query representation are mapped to a joint embedding space to compute the alignment score SS:

(3) S=L2Norm(𝐀)L2Norm(𝐄)S=\mathrm{L2Norm}(\bm{\mathrm{A}})\odot\mathrm{L2Norm}(\bm{\mathrm{E}})

The alignment score can be regarded as a reward estimate to provide reliable reward. Specifically, the evaluator measures the alignment score of the consecutive segment-query pairs, and assigns the corresponding reward rtr_{t}:

(4) rt=sign(StcSt1c)r_{t}=\mathrm{sign}(S^{c}_{t}-S^{c}_{t-1})

where StcS^{c}_{t} denotes the alignment score of the current segment and sentence query at time step tt. This reward function returns +1 or -1. Basically, if the next boundary has a higher alignment score than the current one, the reward rtr_{t} of the action ata_{t} moving from the current window to the next one is +1, and -1 otherwise. Such binary rewards reflect more clearly which action can drive the boundary towards the ground-truth and thus facilitate the agent’s learning.

3.4. Adaptive Action Planner

The adaptive action planner is designed to infer action sequences to refine the temporal boundary. To get a fixed-length visual representation, we utilize a mean pooling layer over feature set 𝐅\bm{\mathrm{F}} of the global, current, left and right segment, obtaining the pooling features 𝐅𝐠,𝐟𝐭𝟏𝐜,𝐟𝐭𝟏𝐥,𝐟𝐭𝟏𝐫\bm{\mathrm{F^{g}}},\bm{\mathrm{f^{c}_{t-1}}},\bm{\mathrm{f^{l}_{t-1}}},\bm{\mathrm{f^{r}_{t-1}}} respectively. Then the cross-gated interaction method (Feng et al., 2018) is further adopted to enhance the effects of the relevant segment-query pairs. Concretely, the current pooling feature 𝐟𝐭𝟏𝐜\bm{\mathrm{f^{c}_{t-1}}} is gated by query representation 𝐄\bm{\mathrm{E}}, and meanwhile the gate of 𝐄\bm{\mathrm{E}} depends on 𝐟𝐭𝟏𝐜\bm{\mathrm{f^{c}_{t-1}}}:

(5) 𝐟𝐭𝟏𝐜~=σ(𝐖𝒔𝐄)𝐟𝐭𝟏𝐜,𝐄~=σ(𝐖𝒗𝐟𝐭𝟏𝐜)𝐄\tilde{\bm{\mathrm{f^{c}_{t-1}}}}=\sigma(\bm{\mathrm{W}^{s}}\bm{\mathrm{E}})\odot\bm{\mathrm{f^{c}_{t-1}}},\quad\tilde{\bm{\mathrm{E}}}=\sigma(\bm{\mathrm{W}^{v}}\bm{\mathrm{f^{c}_{t-1}}})\odot\bm{\mathrm{E}}

where 𝐖𝒔\bm{\mathrm{W}^{s}} and 𝐖𝒗\bm{\mathrm{W}^{v}} are parameter matrices and σ\sigma denotes the sigmoid function. These cross-modal features are then concatenated and fed into two cascaded fully-connected layers ϕ\phi to get the state activation representation sts_{t}:

(6) st=ϕ(𝐄~,𝐟𝐭𝟏𝐜~,𝐟𝐠,𝐟𝐭𝟏𝐥,𝐟𝐭𝟏𝐫,𝐋𝐭𝟏).s_{t}=\phi(\tilde{\bm{\mathrm{E}}},\tilde{\bm{\mathrm{f^{c}_{t-1}}}},\bm{\mathrm{f^{g}}},\bm{\mathrm{f^{l}_{t-1}}},\bm{\mathrm{f^{r}_{t-1}}},\bm{\mathrm{L_{t-1}}}).

Such contextual features encourage the planner to perform a left-right tradeoff on the video contents and infer a more accurate action. sts_{t} is further fed into a GRU cell to enable the agent to incorporate the memory information about the video segments that have been explored. Then the output state of the GRU is followed by two separate fully-connected layers (i.e., actor and critic) to respectively estimate a policy function π(at|st)\pi(a_{t}|s_{t}) and a value approximator vπ(st)v^{\pi}(s_{t}). A primitive action at𝒜a_{t}\in\mathcal{A} is sampled from the policy function π(at|st)\pi(a_{t}|s_{t}) in the training procedure. In our work, the action space 𝒜\mathcal{A} is composed of four primitive actions: shifting the start/end point backward/forward N/νN/\nu clips. ν\nu is an amplitude factor that empirically sets as:

(7) ν=10×(1+2×tanh(StcSg))+,\nu=\lfloor 10\times(1+2\times\mathrm{tanh}(S^{c}_{t}-S^{g}))\rfloor_{+},

where +\lfloor\rfloor_{+} denotes the lower bound of a positive integer. SgS^{g} and StcS^{c}_{t} denotes the global and current alignment score estimated by the alignment evaluator. tanh\mathrm{tanh} is used to constrain the action amplitude to fluctuate around N/10N/10 (an empirical number used in  (He et al., 2019)). SgS^{g} plays as a baseline of the alignment degree to determine ν\nu: when StcS^{c}_{t} is lower, ν\nu becomes smaller and the agent markedly shifts the boundary; when StcS^{c}_{t} becomes higher, ν\nu is larger and the boundary is marginally refined. This adaptive setting enables the agent to determine the action amplitude based on the current observation, which is also in line with human habits.

The state-value vπ(st)v^{\pi}(s_{t}) predicted by the critic is the value estimation of the current state. Under the assumption that the critic produces the exact values, the actor is trained based on an unbiased estimation of the gradient.

3.5. Training

Due to its efficiency, the advantage actor-critic (A2C) (Sutton and Barto, 2018) algorithm is chosen to train our adaptive action planner. Multiple instance learning algorithm with a combined ranking loss rank\mathcal{L}_{rank} is designed to train the cross-modal alignment evaluator and context-aware feature extractor. The total loss in BAR is summarized as:

(8) =A2C+ηrank,\mathcal{L}=\mathcal{L}_{A2C}+\eta\mathcal{L}_{rank},

where A2C\mathcal{L}_{A2C} denotes the loss function in the A2C algorithm. η\eta is a trade-off factor between the two losses.

A2C Loss. The adaptive action planner runs TmaxT_{max} steps for adjustment during training. Given a trajectory in an episode Γ=st,π(|st),vπ(st),at,rt\Gamma=\langle s_{t},\pi(\cdot|s_{t}),v^{\pi}(s_{t}),a_{t},r_{t}\rangle, the loss function of the actor actor\mathcal{L}_{actor} is formulated as:

(9) actor=t=1Tmax[Aπ(st,at)logπ(at|st)+αH(π(at|st))],\small\mathcal{L}_{actor}=-\sum_{t=1}^{T_{max}}[A^{\pi}(s_{t},a_{t})\mathrm{log}\pi(a_{t}|s_{t})+\alpha H(\pi(a_{t}|s_{t}))],

where Aπ(st,at)A^{\pi}(s_{t},a_{t}) denotes the advantage function and the entropy H()H() of the policy is introduced into the objective for improving exploration. Aπ(st,at)=Qπ(st,at)vπ(st)A^{\pi}(s_{t},a_{t})=Q^{\pi}(s_{t},a_{t})-v^{\pi}(s_{t}) measures whether or not and how much the action is better than the policy’s default behaviour. Temporal-difference (TD) learning is adopted to estimate the Q-value function Qπ(st,at)Q^{\pi}(s_{t},a_{t}) by kk-step returns with function approximation:

(10) Qπ(st,at)=l=0k1γlrt+l+γkvπ(st+k)Q^{\pi}(s_{t},a_{t})=\sum_{l=0}^{k-1}\gamma^{l}r_{t+l}+\gamma^{k}v^{\pi}(s_{t+k})

where γ\gamma is a constant discount factor. It is noted that the BAR does not suffer from sparse reward issue during training since the reward can be obtained at every step. To optimize the critic, we minimize the mean squared error (MSE) loss critic\mathcal{L}_{critic} between the Q-value function and the estimated value (Mnih et al., 2016). And the total A2C loss is a combination of the losses from the actor branch and the critic branch: A2C=actor+critic\mathcal{L}_{A2C}=\mathcal{L}_{actor}+\mathcal{L}_{critic}.

Ranking Loss. In general, the content discrepancy between the inter-videos is higher than that within the intra-video. Hence we resort to multiple instance learning algorithm and first leverage coarse-level semantic concepts from the inter-videos to optimize the framework. Concretely, given the global video feature FgF^{g} and its query representation EE , it is expected that the alignment score S(Fg,E)S(F^{g},E) (positive pair) is higher than the score S(Fg,E)S({F^{g}}^{\prime},E) / S(Fg,E)S(F^{g},E^{\prime}) (negative pairs) for any video Fg{F^{g}}^{\prime} / query E~\tilde{E} taken from other sample pairs. The inter-video ranking loss  (Schroff et al., 2015) is thus defined as:

(11) inter=E[ϵ+S(Fg,E)S(Fg,E)]++Fg[ϵ+S(Fg,E)S(Fg,E)]+,\begin{split}\mathcal{L}_{inter}=\sum\limits_{E^{\prime}}[\epsilon+S(F^{g},E^{\prime})-S(F^{g},E)]_{+}\\ +\sum\limits_{{F^{g}}^{\prime}}[\epsilon+S({F^{g}}^{\prime},E)-S(F^{g},E)]_{+},\end{split}

where [x]+[x]_{+} denotes a ramp function defined by max(0,x)\mathrm{max}(0,x) and ϵ\epsilon indicates a margin. S(Fg,E)S(F^{g},E) and SgS^{g} are equivalent. The positive and negative pairs are obtained from the same mini-batch.

Inter-videos generally include substantially broad semantic abstractions that are hard to distinguish similar contents in a specific video. To this end, we design the intra-video ranking loss intra\mathcal{L}_{intra} to capture more subtle concepts in the intra-video to further optimize the network. Expressly, if the score of any one of left, current and right segment-query pairs surpasses the global one during the refinement process, we assume this pair should have higher alignment score than the other two pairs:

(12) intra=ψ(Stc>Sg)×([ϵ+StlStc]++[ϵ+StrStc]+)+ψ(Stl>Sg)×([ϵ+StcStl]++[ϵ+StrStl]+)+ψ(Str>Sg)×([ϵ+StcStr]++[ϵ+StlStr]+),\begin{split}\mathcal{L}_{intra}&={\mathcal{\psi}(S^{c}_{t}>S^{g})}\times([\epsilon+S^{l}_{t}-S^{c}_{t}]_{+}+[\epsilon+S^{r}_{t}-S^{c}_{t}]_{+})\\ &+{\mathcal{\psi}(S^{l}_{t}>S^{g})}\times([\epsilon+S^{c}_{t}-S^{l}_{t}]_{+}+[\epsilon+S^{r}_{t}-S^{l}_{t}]_{+})\\ &+{\mathcal{\psi}(S^{r}_{t}>S^{g})}\times([\epsilon+S^{c}_{t}-S^{r}_{t}]_{+}+[\epsilon+S^{l}_{t}-S^{r}_{t}]_{+}),\end{split}

where StlS^{l}_{t} and StrS^{r}_{t} are the alignment scores of the left segment-query pair and the right segment-query pair at the time step tt, respectively. ψ\psi() is a binary indicator function. If the inequality in parentheses holds, ψ\psi() will output 1, otherwise 0. Specifically, when the score of a segment-query pair, say StcS^{c}_{t}, surpasses SgS_{g}, the optimization target is to increase the gap between StcS^{c}_{t} and the other two (StlS^{l}_{t} and StrS^{r}_{t} ) by increasing StcS^{c}_{t} or decreasing StlS^{l}_{t} and StrS^{r}_{t} . Noted that by lowering StcS^{c}_{t} below SgS_{g} might be another option, but this usually becomes increasingly impractical with the progress of inter-video training. In addition, when there exist more than one segment-query pairs of score surpass SgS_{g}, the optimization target of intra\mathcal{L}_{intra} will usually guide the alignment evaluator to suppress the score of the sub-optimal matching pair(s) to be lower than SgS_{g} and at the same time drive the action planner to adjust the boundary. Intuitively, intra\mathcal{L}_{intra} encourages the text query to be closer to a semantically matched video moment than other possible moments from the same video, which contributes to obtaining a content-aware alignment score.

intra\mathcal{L}_{intra} manages to i) widen the score gap between matched and unmatched segment-query pair to increase the confidence of the alignment evaluation; ii) improve the reward calculation by affecting the alignment evaluator to drive the action planner to achieve better temporal boundary adjustments. To sum up, the combined ranking loss rank\mathcal{L}_{rank} is defined as:

(13) rank=inter+λt=1Tmaxintra,\mathcal{L}_{rank}=\mathcal{L}_{inter}+\lambda\sum_{t=1}^{T_{max}}\mathcal{L}_{intra},

where λ\lambda is a weighting parameter to achieve a ranking loss trade-off between the intra-video and the inter-video. In the early stage of this collaborative training scheme, it is very unlikely that the score of a segment-query pair exceeds SgS_{g} and intra\mathcal{L}_{intra} tends to 0, hence inter\mathcal{L}_{inter} plays a dominant role that learns to transfer the matching between video-query pair to segment-query pair. As the training progresses, inter\mathcal{L}_{inter} converge gradually and it is more common for the score of segment-query pair to exceed SgS_{g}, intra\mathcal{L}_{intra} begins to play a critical role.

Alternating Update. BAR is trained from scratch and an alternating update strategy is applied to facilitate stable training. Specifically, for each set of 2K2K iterations, we first fix the parameters of the action planner and employ rank\mathcal{L}_{rank} for model optimization. This setting guarantees a trustworthy initial reward for the action planner. When KK iterations are reached, we fix the parameters of the alignment evaluator and feature extractor, and switch rank\mathcal{L}_{rank} to A2C\mathcal{L}{A2C} to optimize the action planner for KK more iterations. This alternating update mechanism repeats until the model converges.

Table 1. The performance comparison (in %) of the state-of-the-art methods in fully supervised and weakly supervised setting. “-” indicates that the corresponding values are not available.
Charades-STA (Gao et al., 2017) ActivityNet (Krishna et al., 2017)
Supervision Feature Baseline [email protected] [email protected] [email protected] [email protected] [email protected]
Full Supervision C3D ROLE (Liu et al., 2018b), ACM MM 2018 - 12.12 25.26 - -
MCN (Anne Hendricks et al., 2017), ICCV 2017 4.44 13.66 28.99 10.17 22.07
CTRL (Gao et al., 2017), ICCV 2017 8.89 23.63 - 14.36 29.10
ACRN (Liu et al., 2018a), SIGIR 2018 9.65 26.74 47.64 16.53 31.75
MAC (Ge et al., 2019), WACV 2019 12.23 29.39 53.34 - -
SAP (Chen and Jiang, 2019), AAAI 2019 13.36 27.42 - - -
QSPN (Xu et al., 2019), AAAI 2019 15.80 35.60 54.7 27.70 45.30
ABLR (Yuan et al., 2019), AAAI 2019 - - - 36.79 55.67
SM-RL (Wang et al., 2019), CVPR 2019 11.17 24.36 - - -
RWM (He et al., 2019), AAAI 2019 13.74 34.12 55.16 34.91 53.00
TSN RWM (He et al., 2019), AAAI 2019 17.72 37.23 61.73 37.46 57.29
I3D MAN (Zhang et al., 2019a), CVPR 2019 22.72 46.53 - - -
Weak Supervision C3D TGA (Mithun et al., 2019), CVPR 2019 8.84 19.94 32.14 - -
WS-DEC (Duan et al., 2018), NIPS 2018 - - - 23.34 41.98
WSLLN (Gao et al., 2019), EMNLP 2019 - - - 22.70 42.80
SCN (Lin et al., 2020), AAAI 2020 9.97 23.58 42.96 29.22 47.23
BAR (our) 12.23 27.04 44.97 30.73 49.03
TSN BAR (our) 15.97 33.98 51.64 33.12 53.41

3.6. Inference

At each time step, BAR executes an action at^\hat{a_{t}} via greedy decoding algorithm to adaptively adjust the temporal boundary. And the cross-modal alignment evaluator computes a score StcS^{c}_{t} to provide confidence for alignment degree and termination. Empirically, the final grounding result corresponding to the query usually occupies a reasonable and appropriate video length. Hence to penalize the video segment with abnormal lengths, we propose to update the confidence score with a Gaussian penalty function as follows:

(14) Pt=lteltsNδ,Stc^=StcePt2τP_{t}=\frac{l^{e}_{t}-l^{s}_{t}}{N}-\delta,\quad\hat{S^{c}_{t}}=S^{c}_{t}e^{-\frac{P_{t}^{2}}{\tau}}

where δ\delta denotes the penalty factor corresponds to abnormal lengths. τ\tau is a modulating factor that as τ\tau increases the effect of the penalty degree is likewise decreased. The segment with the max Stc^\hat{S^{c}_{t}} during testing is regarded as the final grounding result.

4. Experiments

4.1. Datasets and Evaluation Metrics

Datasets. We conduct extensive experiments on two benchmark datasets: Charades-STA (Gao et al., 2017) and ActivityNet (Krishna et al., 2017). Charades-STA is extended from the Charades dataset (Sigurdsson et al., 2016) with generated sentence-clip annotations, which comprises a series of sentence-clip pairs with 12,408 for training and 3,720 for testing. The average length of each video in this dataset is 29.8 seconds and the described clips are 8 seconds long in average. ActivityNet dataset (Krishna et al., 2017) is introduced to validate the robustness of the proposed model with longer and more diverse videos. It contains 37,421 and 17,505 video-sentence pairs for training and testing. The average duration of the videos is 2 minutes and the described temporally annotated clips are 36 seconds long on average.

Evaluation Metrics. We adopt “tIoU@ χ\chi” to evaluate the grounding result. “tIoU @ χ\chi” means the percentage of the queries that have temporal IoU larger than threshold χ\chi.

4.2. Implementation Details

We leverage C3D and the TSN model to encode video representation. The initial boundary is set to L0=[N/4;3N/4]L_{0}=[N/4;3N/4]. N/4N/4 and 3N/43N/4 denote the start and end clip indices of the boundary respectively. TmaxT_{max} is set to 12 and the size of the hidden state in GRU is 1024. The batch size is 12 and the total loss is optimized via the Adam optimizer with the learning rate of 0.001. The margin ϵ\epsilon in ranking loss is 0.2. The hyper-parameters α\alpha and γ\gamma is fixed to 0.1 and 0.4, receptively. The factor η\eta and λ\lambda are empirically set to 1 and 0.1. The modulating factor τ\tau is set to 0.5 by cross validation. And penalty baseline factor δ\delta is fixed to 0.35 and 1.0 receptively on Charades-STA and ActivityNet. We use KK = 500 in the alternating update procedure.

4.3. Comparison with the State-of-the-art

We compare the proposed BAR with several state-of-the-art models based on the weakly-supervised and fully-supervised settings in Table 1. On the one hand, BAR significantly outperforms the weakly-supervised method and establishes new state-of-the-art performance on both datasets. Employing the C3D based video feature, BAR boosts the [email protected] to 27.04% and 30.73%, with an improvement of 3.46%, 1.51% compared with SCN (Lin et al., 2020) on the two datasets, receptively. Furthermore, it manages to achieve 33.98% (33.12%) in [email protected] via more powerful TSN feature. It reveals that our approach helps to better obtain accurate video segments. On the other hand, BAR even achieves better or comparable results than some fully-supervised methods. For instance, BAR outperforms QSPN (Xu et al., 2019) by 3.03% w.r.t [email protected] on the ActivityNet dataset. This is an inspiring result as it reveals that our model can get impressive results via learning from massive coarse video-level annotations, which is of great benefit to practical application.

4.4. Ablation Studies

We perform extensive ablation studies and demonstrate the effectiveness of several essential components in BAR. The experiments are conducted on the Charades-STA with the TSN feature. The results are reported in Table 2.

\bullet Effectiveness of Reinforcement Learning. More accurate measurement of the factual RL contribution is to directly remove it and use the generated proposals of an off-the-shelf weakly-supervised action localization method (Mithun et al., 2019). Hence we design a variant (abbreviated as “Ours w/o RL”) to follow the above setting. We can observe that removing RL from BAR will lead to a noticeable drop in performance. For example, [email protected] declines from 33.98% to 25.89%. It reveals that the introduction of RL is fundamental and can bring more flexible and adaptable temporal proposals, this alone is an advantage that cannot be achieved with traditional two-stage frameworks, not to mention its high efficiency.

\bullet Effectiveness of Tailor-designed Reward. In order to validate that a target-oriented reward is essential for this task, we design a baseline (abbreviated as “Ours w/ random reward”) that samples a random scalar value from the uniform distribution of [-1,1] as the reward for optimization. Table 2 shows that this baseline obtains an exceedingly inferior result, which is approximate to a stochastic one. It indicates that a tailor-designed reward is definitely necessary for the RL setting.

\bullet Effectiveness of Boundary Initialization. The initial boundary in this paper is fixed to L0=[N/4;3N/4]L_{0}=[N/4;3N/4]. To compare different boundary initializations, we design two baselines (denoted as “Initial boundary [N/3; 2N/3]” and “Initial boundary [N/5; 4N/5]”) that sets the initial boundary as [N/3;2N/3][N/3;2N/3] and [N/5;4N/5][N/5;4N/5], respectively. As reported in Table 2, different boundary initialization is not sensitive to the performance of the algorithm, and all can obtain competitive experimental results, which reflects the robustness of BAR.

Table 2. Performance of ablation models.
Metrics [email protected] [email protected] [email protected]
Ours w/o RL 12.37 25.89 45.36
Ours w/ random reward 5.76 8.97 28.82
Initial boundary [N/3; 2N/3] 15.72 33.36 51.33
Initial boundary [N/5; 4N/5] 15.83 33.47 51.20
Ours w/ N/5 amplitude 13.60 31.88 49.65
Ours w/ N/10 amplitude 14.27 32.02 50.25
Ours w/ N/15 amplitude 13.73 31.66 49.29
Ours w/o context 13.62 31.45 49.22
Ours w/o intra\mathcal{L}_{intra} 14.24 30.73 46.82
Ours w/ stop 10.13 24.38 43.22
Ours w/o penalty 13.78 30.97 50.27
Ours 15.97 33.98 51.64

\bullet Effectiveness of Adaptive Setting. Rather than shifting a fixed distance for each action, BAR can adaptively adjust its action amplitude according to the current state. To demonstrate the superiority of this adaptive setting, we design three variants (named as “Ours w/ N/5 amplitude”, “Ours w/ N/10 amplitude” and “Ours w/ N/15 amplitude”) that the agent shifts N/5N/5, N/10N/10 and N/15N/15 clips at each step, respectively. As summarized in Table 2, “Ours w/ N/10 amplitude” when set with fixed adjustment strategy. However, our approach with the adaptive setting manages to achieve more impressive performance, which reveals that this adaptive setting is more flexible and effective in our proposed framework.

Refer to caption
(a) Varying δ\delta \in[0.1,0.5], τ\tau=0.5
Refer to caption
(b) Varying τ\tau \in[0.1,5.0], δ\delta=0.35
Figure 3. The performance curve of with varying hyper-parameters δ\delta and τ\tau. Best viewed in color.
Refer to caption
Figure 4. An illustration of how the proposed BAR framework accomplishes the task on Charades-STA.
Refer to caption
Figure 5. An illustration of how the proposed BAR framework accomplishes the task on ActivityNet.

\bullet Effectiveness of Context Information. BAR additionally builds contextualized video representations for action decisions. To investigate the effectiveness of the context information, we design a baseline that removes the context concepts (ft1lf^{l}_{t-1}, ft1rf^{r}_{t-1}) from sts_{t} in Equation 6, abbreviated as “Ours w/o context”. From Table 2, we can see that although the model without context representation can still achieve promising results, our model with context involved gains 2.35% and 2.53% improvement w.r.t [email protected] and [email protected] respectively, which demonstrates that contextual concepts helps for obtaining more content-aware results.

\bullet Effectiveness of Intra-video Ranking Loss. To verify the effectiveness of the intra\mathcal{L}_{intra}, we construct a comparison variant that merely uses inter\mathcal{L}_{inter} to optimize the evaluator, named as “Ours w/o intra\mathcal{L}_{intra}”. Table 2 reveals that the grounding result suffers from an obvious drop without intra\mathcal{L}_{intra}. For example, [email protected] declines from 51.64% to 46.82%. Our approach with intra-video ranking loss manages to achieve more precise alignment scores and more accurate grounding results. To further demonstrate the effectiveness of the alignment score SS obtained by our model, we additionally calculate the correlation coefficient (CC) between SS and ground-truth IoU. It shows that CC can reach 0.79, which reveals the obtained SS is reliable enough to correctly reflect the matching degree and infer the target-oriented rewards.

\bullet Analysis of Stop Signal. We did not include an ending signal in the action space (He et al., 2019) as there is no absolutely reliable and stable internal segment-query matching that can help to effectively terminate the iteration. We further introduce an alignment threshold as a stopping signal (abbreviated as“Ours w/ stop”), which led to inferior results. In order to validate the significance of the length penalty strategy, we design a baseline that directly takes the score StcS^{c}_{t} to determine the termination time, denoted as “Ours w/o penalty”. The results indicate that this baseline suffers from performance degradation. It may be due to the fact that “Ours w/o penalty” tends to provide an excessive score when the length of the video segment is too long or too short.

Figure 3 depicts the performance curves with varying δ\delta or τ\tau respectively in the procedure of cross-validation. We can see that a factor δ\delta with too large or too small value will lead to obvious performance decline, which reveals that a video with suitable length is more likely to produce impressive results. A similar changing trend can be observed with varying τ\tau. It demonstrates that an appropriate gaussian penalty encourages the model to perform better. we empirically observed that δ\delta=0.35 and τ\tau=0.5 contribute to obtaining the most promising performance in different levels of tIoU.

4.5. Efficiency

To further investigate the efficiency of this boundary adaptive refinement process, we compare BAR with TGA (Mithun et al., 2019) in terms of average running time and number of candidate segments. As summarized in Table 3, BAR reduces the localization time and candidate boundaries by a sizeable margin. Please notice that the boundary in BAR is equivalent to the temporal proposal number to some extent, but the “boundary” here is more flexible and adaptable. BAR merely needs to refine an initial temporal boundary progressively, which manages to avoid redundant computations and employ a time-efficient and space-efficient manner. Based on the above discussion, we can conclude that BAR is better than the previous competitive methods in both accuracy and efficiency.

4.6. Qualitative Visualizations

We illustrate two qualitative results in Figure 4, 5 to show the whole process of how BAR obtains the described event location. We observe that our algorithm mainly performs optimization from coarse to fine. The agent will choose a larger movement adjustment at the initial stage of the iteration to quickly narrow the semantic difference between language and vision, and as the iteration progresses, the adjustment range of the movement will change rapidly to achieve local fine-tuning, this is also more consistent with humans performing cross-modal target retrieval.

Table 3. The average running time and number of candidate proposals to localize a moment in a video on Charades-STA.
Methods Time(s) Candidate Proposal Number
TGA (Mithun et al., 2019) 0.104 65.11
BAR (Ours) 0.068 1

5. Conclusions

We propose a Boundary Adaptive Refinement framework that resorts to reinforcement learning to address the task of weakly-supervised temporal grounding of natural language in videos. This refinement scheme completely abandons traditional sliding window-based solution patterns and contributes to obtaining more efficient, boundary-flexible and content-aware grounding results. Extensive experiments show that our approach establishes new state-of-the-art performance on the widely used Charades-STA and ActivityNet datasets. Furthermore, our method even achieves a better result than some competitive fully-supervised methods.

References

  • (1)
  • Anne Hendricks et al. (2017) Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.
  • Chen et al. (2018a) Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018a. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 162–171.
  • Chen and Jiang (2019) Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic Proposal for Activity Localization in Videos via Sentence Query. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Chen et al. (2018b) Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2018b. Recurrent attentional reinforcement learning for multi-label image recognition. In Thirty-Second AAAI Conference on Artificial Intelligence.
  • Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
  • Duan et al. (2018) Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059–3069.
  • Feng et al. (2018) Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. 2018. Video re-localization. In Proceedings of the European Conference on Computer Vision (ECCV). 51–66.
  • Gao et al. (2017) Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.
  • Gao et al. (2019) Mingfei Gao, Larry S Davis, Richard Socher, and Caiming Xiong. 2019. WSLLN: Weakly Supervised Natural Language Localization Networks. (2019).
  • Ge et al. (2019) Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. MAC: Mining Activity Concepts for Language-based Temporal Localization. In IEEE Winter Conference on Applications of Computer Vision. IEEE, 245–253.
  • He et al. (2019) Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Krishna et al. (2017) Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Li et al. (2018) Guanbin Li, Yuan Xie, and Liang Lin. 2018. Weakly supervised salient object detection using image labels. In Thirty-second AAAI conference on artificial intelligence.
  • Lin et al. (2020) Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. 2020. Weakly-Supervised Video Moment Retrieval via Semantic Completion Network. (2020).
  • Liu et al. (2018a) Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 15–24.
  • Liu et al. (2018b) Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In Proceedings of the 26th ACM international conference on Multimedia. 843–851.
  • Liu et al. (2019) Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the 27th ACM International Conference on Multimedia. 539–547.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
  • Mithun et al. (2019) Niluthpol Chowdhury Mithun, Sujoy Paul, Roy-Chowdhury, and Amit K. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592–11601.
  • Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. 1928–1937.
  • Paul et al. (2018) Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. 2018. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision. 563–579.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532–1543.
  • Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).
  • Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
  • Shi et al. (2019) Yukai Shi, Guanbin Li, Qingxing Cao, Keze Wang, and Liang Lin. 2019. Face hallucination by attentive sequence optimization with reinforcement learning. IEEE transactions on pattern analysis and machine intelligence (2019).
  • Sigurdsson et al. (2016) Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision. Springer, 510–526.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
  • Wang et al. (2018) Jiajie Wang, Jiangchao Yao, Ya Zhang, and Rui Zhang. 2018. Collaborative learning for weakly supervised object detection. arXiv preprint arXiv:1802.03531 (2018).
  • Wang et al. (2016) Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
  • Wang et al. (2019) Weining Wang, Yan Huang, and Liang Wang. 2019. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334–343.
  • Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning. Springer, 5–32.
  • Wu et al. (2020) Jie Wu, Guanbin Li, Si Liu, and Liang Lin. 2020. Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Xu et al. (2019) Huijuan Xu, Kun He, L Sigal, S Sclaroff, and K Saenko. 2019. Multilevel Language and Vision Integration for Text-to-Clip Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 2. 7.
  • Yang et al. (2019a) Sibei Yang, Guanbin Li, and Yizhou Yu. 2019a. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4145–4154.
  • Yang et al. (2019b) Sibei Yang, Guanbin Li, and Yizhou Yu. 2019b. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE International Conference on Computer Vision. 4644–4653.
  • Yang et al. (2020) Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Graph-Structured Referring Expression Reasoning in The Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9952–9961.
  • Yeung et al. (2016) Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678–2687.
  • Yu et al. (2019) Tong Yu, Yilin Shen, Ruiyi Zhang, Xiangyu Zeng, and Hongxia Jin. 2019. Vision-language recommendation via attribute augmented multimodal reinforcement learning. In Proceedings of the 27th ACM International Conference on Multimedia. 39–47.
  • Yuan et al. (2019) Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Zhang et al. (2019a) Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247–1257.
  • Zhang et al. (2019b) Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019b. Exploiting Temporal Relationships in Video Moment Localization with Natural Language. In Proceedings of the 27th ACM International Conference on Multimedia. 1230–1238.