Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos
Abstract.
Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a Boundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.
1. Introduction
Temporal grounding of natural language in untrimmed video is a newly-raised and crucial task due to its potential applications in the field of human-robot interaction and cross-media analysis. It aims to locate the temporal segment that is most relevant to the given sentence query in an untrimmed video. Albeit with varying degrees of progress, most of its recent successes (Gao et al., 2017; Liu et al., 2018a; Ge et al., 2019; Chen and Jiang, 2019; Xu et al., 2019; Yuan et al., 2019; Zhang et al., 2019a; Wang et al., 2019; He et al., 2019; Zhang et al., 2019b; Wu et al., 2020) are involved in a fully supervised setting, i.e., mapping between video interval and the corresponding statement description are available in the training set. It is still arduous to acquire such granular annotations that require a huge amount of manual effort, which becomes a critical bottleneck as this task is pushed toward a larger-scale and more complicated scenario. To alleviate such expensive and unwieldy annotations, (Mithun et al., 2019) proposes to address this task in the weakly supervised setting that learns to infer language-related temporal range from video-level supervision. This weakly supervised paradigm only has access to the video-level language description annotations without their corresponding temporal boundary specification. This is an exceedingly favorable scheme since coarse video-level annotations are more readily available on the internet. In our work, we focus on this weakly supervised paradigm.
Many approaches (Gao et al., 2017; Liu et al., 2018a; Ge et al., 2019; Chen and Jiang, 2019; Mithun et al., 2019) employ a two-stage “proposal-and-rank” solution pattern to address the task of temporal grounding of natural language. However, these works are indulged in learning more robust cross-modal representations in the rank branch without explicitly considering and modeling boundary-flexible and content-aware proposals. As shown in the left half of Figure 1, “proposal-and-rank” pattern is inherently restrictive as it relies heavily on pre-defined and inflexible sliding windows (e.g., 128 and 256 frames (Mithun et al., 2019)), which results in lacking generalization for videos with considerable variance in length. More rigorously, it raises two additional challenges when it is extended to the weakly supervised setting. First, offset regressive learning (Gao et al., 2017) for boundary adjustment becomes impractical in the absence of granular annotation. Second, accessing video-query pair during training, the leading model (Mithun et al., 2019) can merely learn cross-modal mappings from the inter-videos, while fails to take into account more subtle and fine-grained semantic concepts within the intra-video. These suboptimal cross-modal mappings generally lead to less accurate boundary prediction.

To better cope with the above issues, as shown in the right half of Figure 1, we formulate the task as a cross-modal matching guided heuristic process, a.k.a, Boundary Adaptive Refinement (BAR). BAR resorts to a tailor-designed reinforcement learning paradigm to adaptively optimize the temporal boundary towards shrinking the cross-modal semantic gap. It is noted that reinforcement learning (RL) has been validated in various tasks of fully supervised video understanding, including video recognition (Yeung et al., 2016) and video referring expression (He et al., 2019). This work can be regarded as the first attempt to extend RL to weakly supervised temporal localization tasks. Due to the lack of matching supervision for specific video intervals corresponding to the statement,it is non-trivial to design an intensive learning state assessment and reward function which can effectively drive the model to achieve efficient temporal boundary optimization. Our proposed BAR framework hence includes a context-aware feature extractor for encoding the current and contextualized environment state, an adaptive action planner for decision adjustment of direction and interval range, and most typically a cross-modal alignment evaluator for providing an estimate of the alignment score between each segment-query pair in the absence of pairwise supervisory information. This alignment evaluator is crafted to assign a corresponding reward by comparing the alignment score of the consecutive segment-query pair under the guidance of both inter-video and intra-video ranking loss. This modularized component design and heuristic adaptive temporal window adjustment strategy contributes to making the solution pattern more flexible and conforming to the human perception retrieval mechanism; Furthermore, it can be guided and pruned with goal-oriented rewards in a larger search space to extract more accurate temporal window positioning; Moreover, it also attempts to occupy as little time as possible to reach more impressive results.
The contributions of this work are summarized as follows:
We design a Boundary Adaptive Refinement framework that resorts to reinforcement learning to address the task of weakly supervised temporal grounding of language in video. To the best of our knowledge, we are the first to employ RL to temporal localization task with weak supervision.
BAR abandons traditional sliding window based proposal-and-rank pattern and employs a novel boundary adaptive refinement process, which contributes to acquiring more efficient, boundary-flexible and content-aware grounding results.
2. Related work

Temporal Grounding of Natural Language in Video. Temporal grounding of natural language aims to determine the start and end time of a temporal segment in an untrimmed video that corresponds to a language query. It is a temporal extension of image referring expression comprehension(Yang et al., 2019a, b, 2020), and is also a challenging multimedia task which requires cross-modal fusion and fine-grained interactions between the verbal and visual modalities. Many approaches (Gao et al., 2017; Liu et al., 2018a; Ge et al., 2019; Chen and Jiang, 2019; Mithun et al., 2019) employ a two-stage “proposal-and-rank” manner, which first generates temporal proposals and then selects the one with the highest confidence score. However, these approaches rely on external sliding windows matching and ranking, leading to boundary-inflexible and time consuming. To formulate a computationally efficient framework, Chen et al. (Chen et al., 2018a) designed an end-to-end deep neural network that merely performs a single pass to obtain the grounding result. Xu et al. (Xu et al., 2019) proposed a multi-level model to integrate visual-query feature in the earlier stage and further introduced the caption generation as an auxiliary task.
Weakly Supervised Learning. Weakly-supervised learning is a research setup that aims at optimizing a model without substantial manual labeled information. Many computer vision and multi-modal tasks such as salient object detection (Li et al., 2018), captioning (Duan et al., 2018), language grounding (Paul et al., 2018; Mithun et al., 2019), referring expression grounding(Liu et al., 2019) have explored the weakly-supervised setup, since granular annotations are much more source-consuming compared to coarse annotations. Wang et al. (Wang et al., 2018) proposed a weakly supervised collaborative learning framework to resolve the task of weakly supervised object detection, which only requires image-level labels. In the video domain, Duan et al. (Duan et al., 2018) formulated a new task: weakly supervised dense event captioning. The goal of this task is to detect and describe all events of interest contained in a video without dense segment annotations for model training. The work that closely related to ours is (Mithun et al., 2019). Mithun et al. (Mithun et al., 2019) designed Text-Guided Attention (TGA) mechanism to leverage latent alignment between video frames and sentence descriptions to address the same task as us.
Reinforcement Learning. Reinforcement learning (RL) is originated from the neuroscientific and psychological understandings of how humans learn to optimize their behaviors in an environment. It can be mathematically formulated as a Markov Decision Process (MDP) in a sequential decision-making manner. Recently, RL technique (Williams, 1992) has been utilized to imitate human’s thinking pattern to address various tasks, which generally can be formulated as a MDP that executes a series of actions to accomplish the task-specific objective(Yu et al., 2019; Ranzato et al., 2015; He et al., 2019; Shi et al., 2019; Chen et al., 2018b). Ranzato et al. (Ranzato et al., 2015) used the REINFORCE algorithm to train the captioning model in sequence level by optimizing the non-differentiable metric directly. Yeung et al. (Yeung et al., 2016) adopted REINFORCE algorithm to optimize an end-to-end approach for reasoning the temporal bounds of action and its category. He et al. (He et al., 2019) resorted to RL based method to address the fully-supervised version of the studied task, which utilized the temporal IoU as reward indicator. Our work offers the first attempt to extend RL to accomplish the proposed task with weak supervision. To estimate an accurate reward function in the absence of pairwise supervisory information, a cross-modal alignment evaluator is crafted to provide tailor-designed rewards.
3. Methodology
3.1. Problem Formulation
Following the common-used formulation (Gao et al., 2017), we represent a video by clips , each clip corresponds to a small chunk of sequential frames. Taking and a text query as inputs, the studied task aims to output a video segment ( and indicate the start and end clip indices respectively) that semantically matches the query description. Our work focuses on the weakly supervised setting of this task. Specifically, only a set of - pairs are provided but the video segment annotation for each pair is not available. Inspired by the observation that humans usually locate interest events during a long video with a heuristic search strategy, we propose to formulate this task as a Markov Decision Process. A Boundary Adaptive Refinement (BAR) framework is thus designed: starting from an initial segment, the reinforcement learning technique is utilized to refine its temporal boundary progressively. The overall architecture of the proposed BAR framework is depicted in Figure 2. As illustrated, this modular framework employs a context-aware feature extractor to encode the environment state into cross-modal contextual concepts. The cross-modal alignment evaluator is crafted to provide a tailor-designed reward and the termination signal for the iterative refinement process. An adaptive action planner is designed to reason the direction and amplitude of the action from contextualized observation adaptively, instead of shifting a fixed amplitude every step (He et al., 2019). The details of these modularized components will be described in the following sections.
3.2. Context-aware Feature Extractor
The context-aware feature extractor takes a video-query pair (-) from the external environment and encodes it into the context-aware cross-modal concepts. Each word in query is firstly encoded using GloVe (Pennington et al., 2014) embeddings and then fed into the GRU (Chung et al., 2014) to capture long-range dependencies. The summarized query representation is obtained from the last hidden state of the GRU. A pre-trained video feature extractor (C3D (Tran et al., 2015) or TSN (Wang et al., 2016)) is used to extract the clip-level feature for each video clip. A video segment is represented as a set of clip features, i.e., . denotes the clip-level feature for the video clip and is the number of clips in the corresponding video segment. At each time step, the updated boundary divides the whole video into three parts: left segment, current segment and right segment. And we collect all clip-level features within the corresponding boundary into a set to obtain the three corresponding segment-level features. Rather than directly taking the current segment’s feature as independent inputs (He et al., 2019), this extractor also leverages context-aware contextual cues derived from other segments in the video (i.e., left and right segment feature) for state encoding. Furthermore, the extractor explicitly involves the normalized boundary location into the encoded features to provide some notion of relative position:
(1) |
where and denote the start and end clip indices of the boundary, respectively. and indicates the maximum iterations in the refinement process.
3.3. Cross-modal Alignment Evaluator
The cross-modal alignment evaluator is designed specially to address two critical issues in our RL-based approach. On the one hand, this evaluator is crafted to assign a target-oriented reward to address the difficulty that the adaptive action planner can not directly obtain a reliable reward function without granular boundary annotations. On the other hand, this alignment evaluator manages to determine an accurate stop signal to terminate the refinement process. Given a video segment, the dimension of each clip feature is reduced to the same as the summarized query representation via a filter function , which consists of a fully-connected layer followed by ReLU (Krizhevsky et al., 2012) and Dropout (Srivastava et al., 2014) function. is taken to create a temporal attention over all video clips, which manages to emphasize crucial video clips and weaken inessential parts. Concretely, the scaled dot-product attention mechanism (Luong et al., 2015) is utilized to obtain attention weight and the segment attention feature :
(2) |
where indicates the dot product operation between two vectors. is the dimension of . Then the segment attention feature and query representation are mapped to a joint embedding space to compute the alignment score :
(3) |
The alignment score can be regarded as a reward estimate to provide reliable reward. Specifically, the evaluator measures the alignment score of the consecutive segment-query pairs, and assigns the corresponding reward :
(4) |
where denotes the alignment score of the current segment and sentence query at time step . This reward function returns +1 or -1. Basically, if the next boundary has a higher alignment score than the current one, the reward of the action moving from the current window to the next one is +1, and -1 otherwise. Such binary rewards reflect more clearly which action can drive the boundary towards the ground-truth and thus facilitate the agent’s learning.
3.4. Adaptive Action Planner
The adaptive action planner is designed to infer action sequences to refine the temporal boundary. To get a fixed-length visual representation, we utilize a mean pooling layer over feature set of the global, current, left and right segment, obtaining the pooling features respectively. Then the cross-gated interaction method (Feng et al., 2018) is further adopted to enhance the effects of the relevant segment-query pairs. Concretely, the current pooling feature is gated by query representation , and meanwhile the gate of depends on :
(5) |
where and are parameter matrices and denotes the sigmoid function. These cross-modal features are then concatenated and fed into two cascaded fully-connected layers to get the state activation representation :
(6) |
Such contextual features encourage the planner to perform a left-right tradeoff on the video contents and infer a more accurate action. is further fed into a GRU cell to enable the agent to incorporate the memory information about the video segments that have been explored. Then the output state of the GRU is followed by two separate fully-connected layers (i.e., actor and critic) to respectively estimate a policy function and a value approximator . A primitive action is sampled from the policy function in the training procedure. In our work, the action space is composed of four primitive actions: shifting the start/end point backward/forward clips. is an amplitude factor that empirically sets as:
(7) |
where denotes the lower bound of a positive integer. and denotes the global and current alignment score estimated by the alignment evaluator. is used to constrain the action amplitude to fluctuate around (an empirical number used in (He et al., 2019)). plays as a baseline of the alignment degree to determine : when is lower, becomes smaller and the agent markedly shifts the boundary; when becomes higher, is larger and the boundary is marginally refined. This adaptive setting enables the agent to determine the action amplitude based on the current observation, which is also in line with human habits.
The state-value predicted by the critic is the value estimation of the current state. Under the assumption that the critic produces the exact values, the actor is trained based on an unbiased estimation of the gradient.
3.5. Training
Due to its efficiency, the advantage actor-critic (A2C) (Sutton and Barto, 2018) algorithm is chosen to train our adaptive action planner. Multiple instance learning algorithm with a combined ranking loss is designed to train the cross-modal alignment evaluator and context-aware feature extractor. The total loss in BAR is summarized as:
(8) |
where denotes the loss function in the A2C algorithm. is a trade-off factor between the two losses.
A2C Loss. The adaptive action planner runs steps for adjustment during training. Given a trajectory in an episode , the loss function of the actor is formulated as:
(9) |
where denotes the advantage function and the entropy of the policy is introduced into the objective for improving exploration. measures whether or not and how much the action is better than the policy’s default behaviour. Temporal-difference (TD) learning is adopted to estimate the Q-value function by -step returns with function approximation:
(10) |
where is a constant discount factor. It is noted that the BAR does not suffer from sparse reward issue during training since the reward can be obtained at every step. To optimize the critic, we minimize the mean squared error (MSE) loss between the Q-value function and the estimated value (Mnih et al., 2016). And the total A2C loss is a combination of the losses from the actor branch and the critic branch: .
Ranking Loss. In general, the content discrepancy between the inter-videos is higher than that within the intra-video. Hence we resort to multiple instance learning algorithm and first leverage coarse-level semantic concepts from the inter-videos to optimize the framework. Concretely, given the global video feature and its query representation , it is expected that the alignment score (positive pair) is higher than the score / (negative pairs) for any video / query taken from other sample pairs. The inter-video ranking loss (Schroff et al., 2015) is thus defined as:
(11) |
where denotes a ramp function defined by and indicates a margin. and are equivalent. The positive and negative pairs are obtained from the same mini-batch.
Inter-videos generally include substantially broad semantic abstractions that are hard to distinguish similar contents in a specific video. To this end, we design the intra-video ranking loss to capture more subtle concepts in the intra-video to further optimize the network. Expressly, if the score of any one of left, current and right segment-query pairs surpasses the global one during the refinement process, we assume this pair should have higher alignment score than the other two pairs:
(12) |
where and are the alignment scores of the left segment-query pair and the right segment-query pair at the time step , respectively. () is a binary indicator function. If the inequality in parentheses holds, () will output 1, otherwise 0. Specifically, when the score of a segment-query pair, say , surpasses , the optimization target is to increase the gap between and the other two ( and ) by increasing or decreasing and . Noted that by lowering below might be another option, but this usually becomes increasingly impractical with the progress of inter-video training. In addition, when there exist more than one segment-query pairs of score surpass , the optimization target of will usually guide the alignment evaluator to suppress the score of the sub-optimal matching pair(s) to be lower than and at the same time drive the action planner to adjust the boundary. Intuitively, encourages the text query to be closer to a semantically matched video moment than other possible moments from the same video, which contributes to obtaining a content-aware alignment score.
manages to i) widen the score gap between matched and unmatched segment-query pair to increase the confidence of the alignment evaluation; ii) improve the reward calculation by affecting the alignment evaluator to drive the action planner to achieve better temporal boundary adjustments. To sum up, the combined ranking loss is defined as:
(13) |
where is a weighting parameter to achieve a ranking loss trade-off between the intra-video and the inter-video. In the early stage of this collaborative training scheme, it is very unlikely that the score of a segment-query pair exceeds and tends to 0, hence plays a dominant role that learns to transfer the matching between video-query pair to segment-query pair. As the training progresses, converge gradually and it is more common for the score of segment-query pair to exceed , begins to play a critical role.
Alternating Update. BAR is trained from scratch and an alternating update strategy is applied to facilitate stable training. Specifically, for each set of iterations, we first fix the parameters of the action planner and employ for model optimization. This setting guarantees a trustworthy initial reward for the action planner. When iterations are reached, we fix the parameters of the alignment evaluator and feature extractor, and switch to to optimize the action planner for more iterations. This alternating update mechanism repeats until the model converges.
Charades-STA (Gao et al., 2017) | ActivityNet (Krishna et al., 2017) | ||||||
Supervision | Feature | Baseline | [email protected] | [email protected] | [email protected] | [email protected] | [email protected] |
Full Supervision | C3D | ROLE (Liu et al., 2018b), ACM MM 2018 | - | 12.12 | 25.26 | - | - |
MCN (Anne Hendricks et al., 2017), ICCV 2017 | 4.44 | 13.66 | 28.99 | 10.17 | 22.07 | ||
CTRL (Gao et al., 2017), ICCV 2017 | 8.89 | 23.63 | - | 14.36 | 29.10 | ||
ACRN (Liu et al., 2018a), SIGIR 2018 | 9.65 | 26.74 | 47.64 | 16.53 | 31.75 | ||
MAC (Ge et al., 2019), WACV 2019 | 12.23 | 29.39 | 53.34 | - | - | ||
SAP (Chen and Jiang, 2019), AAAI 2019 | 13.36 | 27.42 | - | - | - | ||
QSPN (Xu et al., 2019), AAAI 2019 | 15.80 | 35.60 | 54.7 | 27.70 | 45.30 | ||
ABLR (Yuan et al., 2019), AAAI 2019 | - | - | - | 36.79 | 55.67 | ||
SM-RL (Wang et al., 2019), CVPR 2019 | 11.17 | 24.36 | - | - | - | ||
RWM (He et al., 2019), AAAI 2019 | 13.74 | 34.12 | 55.16 | 34.91 | 53.00 | ||
TSN | RWM (He et al., 2019), AAAI 2019 | 17.72 | 37.23 | 61.73 | 37.46 | 57.29 | |
I3D | MAN (Zhang et al., 2019a), CVPR 2019 | 22.72 | 46.53 | - | - | - | |
Weak Supervision | C3D | TGA (Mithun et al., 2019), CVPR 2019 | 8.84 | 19.94 | 32.14 | - | - |
WS-DEC (Duan et al., 2018), NIPS 2018 | - | - | - | 23.34 | 41.98 | ||
WSLLN (Gao et al., 2019), EMNLP 2019 | - | - | - | 22.70 | 42.80 | ||
SCN (Lin et al., 2020), AAAI 2020 | 9.97 | 23.58 | 42.96 | 29.22 | 47.23 | ||
BAR (our) | 12.23 | 27.04 | 44.97 | 30.73 | 49.03 | ||
TSN | BAR (our) | 15.97 | 33.98 | 51.64 | 33.12 | 53.41 |
3.6. Inference
At each time step, BAR executes an action via greedy decoding algorithm to adaptively adjust the temporal boundary. And the cross-modal alignment evaluator computes a score to provide confidence for alignment degree and termination. Empirically, the final grounding result corresponding to the query usually occupies a reasonable and appropriate video length. Hence to penalize the video segment with abnormal lengths, we propose to update the confidence score with a Gaussian penalty function as follows:
(14) |
where denotes the penalty factor corresponds to abnormal lengths. is a modulating factor that as increases the effect of the penalty degree is likewise decreased. The segment with the max during testing is regarded as the final grounding result.
4. Experiments
4.1. Datasets and Evaluation Metrics
Datasets. We conduct extensive experiments on two benchmark datasets: Charades-STA (Gao et al., 2017) and ActivityNet (Krishna et al., 2017). Charades-STA is extended from the Charades dataset (Sigurdsson et al., 2016) with generated sentence-clip annotations, which comprises a series of sentence-clip pairs with 12,408 for training and 3,720 for testing. The average length of each video in this dataset is 29.8 seconds and the described clips are 8 seconds long in average. ActivityNet dataset (Krishna et al., 2017) is introduced to validate the robustness of the proposed model with longer and more diverse videos. It contains 37,421 and 17,505 video-sentence pairs for training and testing. The average duration of the videos is 2 minutes and the described temporally annotated clips are 36 seconds long on average.
Evaluation Metrics. We adopt “tIoU@ ” to evaluate the grounding result. “tIoU @ ” means the percentage of the queries that have temporal IoU larger than threshold .
4.2. Implementation Details
We leverage C3D and the TSN model to encode video representation. The initial boundary is set to . and denote the start and end clip indices of the boundary respectively. is set to 12 and the size of the hidden state in GRU is 1024. The batch size is 12 and the total loss is optimized via the Adam optimizer with the learning rate of 0.001. The margin in ranking loss is 0.2. The hyper-parameters and is fixed to 0.1 and 0.4, receptively. The factor and are empirically set to 1 and 0.1. The modulating factor is set to 0.5 by cross validation. And penalty baseline factor is fixed to 0.35 and 1.0 receptively on Charades-STA and ActivityNet. We use = 500 in the alternating update procedure.
4.3. Comparison with the State-of-the-art
We compare the proposed BAR with several state-of-the-art models based on the weakly-supervised and fully-supervised settings in Table 1. On the one hand, BAR significantly outperforms the weakly-supervised method and establishes new state-of-the-art performance on both datasets. Employing the C3D based video feature, BAR boosts the [email protected] to 27.04% and 30.73%, with an improvement of 3.46%, 1.51% compared with SCN (Lin et al., 2020) on the two datasets, receptively. Furthermore, it manages to achieve 33.98% (33.12%) in [email protected] via more powerful TSN feature. It reveals that our approach helps to better obtain accurate video segments. On the other hand, BAR even achieves better or comparable results than some fully-supervised methods. For instance, BAR outperforms QSPN (Xu et al., 2019) by 3.03% w.r.t [email protected] on the ActivityNet dataset. This is an inspiring result as it reveals that our model can get impressive results via learning from massive coarse video-level annotations, which is of great benefit to practical application.
4.4. Ablation Studies
We perform extensive ablation studies and demonstrate the effectiveness of several essential components in BAR. The experiments are conducted on the Charades-STA with the TSN feature. The results are reported in Table 2.
Effectiveness of Reinforcement Learning. More accurate measurement of the factual RL contribution is to directly remove it and use the generated proposals of an off-the-shelf weakly-supervised action localization method (Mithun et al., 2019). Hence we design a variant (abbreviated as “Ours w/o RL”) to follow the above setting. We can observe that removing RL from BAR will lead to a noticeable drop in performance. For example, [email protected] declines from 33.98% to 25.89%. It reveals that the introduction of RL is fundamental and can bring more flexible and adaptable temporal proposals, this alone is an advantage that cannot be achieved with traditional two-stage frameworks, not to mention its high efficiency.
Effectiveness of Tailor-designed Reward. In order to validate that a target-oriented reward is essential for this task, we design a baseline (abbreviated as “Ours w/ random reward”) that samples a random scalar value from the uniform distribution of [-1,1] as the reward for optimization. Table 2 shows that this baseline obtains an exceedingly inferior result, which is approximate to a stochastic one. It indicates that a tailor-designed reward is definitely necessary for the RL setting.
Effectiveness of Boundary Initialization. The initial boundary in this paper is fixed to . To compare different boundary initializations, we design two baselines (denoted as “Initial boundary [N/3; 2N/3]” and “Initial boundary [N/5; 4N/5]”) that sets the initial boundary as and , respectively. As reported in Table 2, different boundary initialization is not sensitive to the performance of the algorithm, and all can obtain competitive experimental results, which reflects the robustness of BAR.
Metrics | [email protected] | [email protected] | [email protected] |
Ours w/o RL | 12.37 | 25.89 | 45.36 |
Ours w/ random reward | 5.76 | 8.97 | 28.82 |
Initial boundary [N/3; 2N/3] | 15.72 | 33.36 | 51.33 |
Initial boundary [N/5; 4N/5] | 15.83 | 33.47 | 51.20 |
Ours w/ N/5 amplitude | 13.60 | 31.88 | 49.65 |
Ours w/ N/10 amplitude | 14.27 | 32.02 | 50.25 |
Ours w/ N/15 amplitude | 13.73 | 31.66 | 49.29 |
Ours w/o context | 13.62 | 31.45 | 49.22 |
Ours w/o | 14.24 | 30.73 | 46.82 |
Ours w/ stop | 10.13 | 24.38 | 43.22 |
Ours w/o penalty | 13.78 | 30.97 | 50.27 |
Ours | 15.97 | 33.98 | 51.64 |
Effectiveness of Adaptive Setting. Rather than shifting a fixed distance for each action, BAR can adaptively adjust its action amplitude according to the current state. To demonstrate the superiority of this adaptive setting, we design three variants (named as “Ours w/ N/5 amplitude”, “Ours w/ N/10 amplitude” and “Ours w/ N/15 amplitude”) that the agent shifts , and clips at each step, respectively. As summarized in Table 2, “Ours w/ N/10 amplitude” when set with fixed adjustment strategy. However, our approach with the adaptive setting manages to achieve more impressive performance, which reveals that this adaptive setting is more flexible and effective in our proposed framework.




Effectiveness of Context Information. BAR additionally builds contextualized video representations for action decisions. To investigate the effectiveness of the context information, we design a baseline that removes the context concepts (, ) from in Equation 6, abbreviated as “Ours w/o context”. From Table 2, we can see that although the model without context representation can still achieve promising results, our model with context involved gains 2.35% and 2.53% improvement w.r.t [email protected] and [email protected] respectively, which demonstrates that contextual concepts helps for obtaining more content-aware results.
Effectiveness of Intra-video Ranking Loss. To verify the effectiveness of the , we construct a comparison variant that merely uses to optimize the evaluator, named as “Ours w/o ”. Table 2 reveals that the grounding result suffers from an obvious drop without . For example, [email protected] declines from 51.64% to 46.82%. Our approach with intra-video ranking loss manages to achieve more precise alignment scores and more accurate grounding results. To further demonstrate the effectiveness of the alignment score obtained by our model, we additionally calculate the correlation coefficient (CC) between and ground-truth IoU. It shows that CC can reach 0.79, which reveals the obtained is reliable enough to correctly reflect the matching degree and infer the target-oriented rewards.
Analysis of Stop Signal. We did not include an ending signal in the action space (He et al., 2019) as there is no absolutely reliable and stable internal segment-query matching that can help to effectively terminate the iteration. We further introduce an alignment threshold as a stopping signal (abbreviated as“Ours w/ stop”), which led to inferior results. In order to validate the significance of the length penalty strategy, we design a baseline that directly takes the score to determine the termination time, denoted as “Ours w/o penalty”. The results indicate that this baseline suffers from performance degradation. It may be due to the fact that “Ours w/o penalty” tends to provide an excessive score when the length of the video segment is too long or too short.
Figure 3 depicts the performance curves with varying or respectively in the procedure of cross-validation. We can see that a factor with too large or too small value will lead to obvious performance decline, which reveals that a video with suitable length is more likely to produce impressive results. A similar changing trend can be observed with varying . It demonstrates that an appropriate gaussian penalty encourages the model to perform better. we empirically observed that =0.35 and =0.5 contribute to obtaining the most promising performance in different levels of tIoU.
4.5. Efficiency
To further investigate the efficiency of this boundary adaptive refinement process, we compare BAR with TGA (Mithun et al., 2019) in terms of average running time and number of candidate segments. As summarized in Table 3, BAR reduces the localization time and candidate boundaries by a sizeable margin. Please notice that the boundary in BAR is equivalent to the temporal proposal number to some extent, but the “boundary” here is more flexible and adaptable. BAR merely needs to refine an initial temporal boundary progressively, which manages to avoid redundant computations and employ a time-efficient and space-efficient manner. Based on the above discussion, we can conclude that BAR is better than the previous competitive methods in both accuracy and efficiency.
4.6. Qualitative Visualizations
We illustrate two qualitative results in Figure 4, 5 to show the whole process of how BAR obtains the described event location. We observe that our algorithm mainly performs optimization from coarse to fine. The agent will choose a larger movement adjustment at the initial stage of the iteration to quickly narrow the semantic difference between language and vision, and as the iteration progresses, the adjustment range of the movement will change rapidly to achieve local fine-tuning, this is also more consistent with humans performing cross-modal target retrieval.
Methods | Time(s) | Candidate Proposal Number |
---|---|---|
TGA (Mithun et al., 2019) | 0.104 | 65.11 |
BAR (Ours) | 0.068 | 1 |
5. Conclusions
We propose a Boundary Adaptive Refinement framework that resorts to reinforcement learning to address the task of weakly-supervised temporal grounding of natural language in videos. This refinement scheme completely abandons traditional sliding window-based solution patterns and contributes to obtaining more efficient, boundary-flexible and content-aware grounding results. Extensive experiments show that our approach establishes new state-of-the-art performance on the widely used Charades-STA and ActivityNet datasets. Furthermore, our method even achieves a better result than some competitive fully-supervised methods.
References
- (1)
- Anne Hendricks et al. (2017) Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision. 5803–5812.
- Chen et al. (2018a) Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. 2018a. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 162–171.
- Chen and Jiang (2019) Shaoxiang Chen and Yu-Gang Jiang. 2019. Semantic Proposal for Activity Localization in Videos via Sentence Query. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Chen et al. (2018b) Tianshui Chen, Zhouxia Wang, Guanbin Li, and Liang Lin. 2018b. Recurrent attentional reinforcement learning for multi-label image recognition. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
- Duan et al. (2018) Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, and Junzhou Huang. 2018. Weakly supervised dense event captioning in videos. In Advances in Neural Information Processing Systems. 3059–3069.
- Feng et al. (2018) Yang Feng, Lin Ma, Wei Liu, Tong Zhang, and Jiebo Luo. 2018. Video re-localization. In Proceedings of the European Conference on Computer Vision (ECCV). 51–66.
- Gao et al. (2017) Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision. 5267–5275.
- Gao et al. (2019) Mingfei Gao, Larry S Davis, Richard Socher, and Caiming Xiong. 2019. WSLLN: Weakly Supervised Natural Language Localization Networks. (2019).
- Ge et al. (2019) Runzhou Ge, Jiyang Gao, Kan Chen, and Ram Nevatia. 2019. MAC: Mining Activity Concepts for Language-based Temporal Localization. In IEEE Winter Conference on Applications of Computer Vision. IEEE, 245–253.
- He et al. (2019) Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, and Shilei Wen. 2019. Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Krishna et al. (2017) Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision. 706–715.
- Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
- Li et al. (2018) Guanbin Li, Yuan Xie, and Liang Lin. 2018. Weakly supervised salient object detection using image labels. In Thirty-second AAAI conference on artificial intelligence.
- Lin et al. (2020) Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. 2020. Weakly-Supervised Video Moment Retrieval via Semantic Completion Network. (2020).
- Liu et al. (2018a) Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. 2018a. Attentive moment retrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 15–24.
- Liu et al. (2018b) Meng Liu, Xiang Wang, Liqiang Nie, Qi Tian, Baoquan Chen, and Tat-Seng Chua. 2018b. Cross-modal moment localization in videos. In Proceedings of the 26th ACM international conference on Multimedia. 843–851.
- Liu et al. (2019) Xuejing Liu, Liang Li, Shuhui Wang, Zheng-Jun Zha, Li Su, and Qingming Huang. 2019. Knowledge-guided pairwise reconstruction network for weakly supervised referring expression grounding. In Proceedings of the 27th ACM International Conference on Multimedia. 539–547.
- Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015).
- Mithun et al. (2019) Niluthpol Chowdhury Mithun, Sujoy Paul, Roy-Chowdhury, and Amit K. 2019. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 11592–11601.
- Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning. 1928–1937.
- Paul et al. (2018) Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. 2018. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision. 563–579.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing. 1532–1543.
- Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).
- Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815–823.
- Shi et al. (2019) Yukai Shi, Guanbin Li, Qingxing Cao, Keze Wang, and Liang Lin. 2019. Face hallucination by attentive sequence optimization with reinforcement learning. IEEE transactions on pattern analysis and machine intelligence (2019).
- Sigurdsson et al. (2016) Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision. Springer, 510–526.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929–1958.
- Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
- Tran et al. (2015) Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
- Wang et al. (2018) Jiajie Wang, Jiangchao Yao, Ya Zhang, and Rui Zhang. 2018. Collaborative learning for weakly supervised object detection. arXiv preprint arXiv:1802.03531 (2018).
- Wang et al. (2016) Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, and Luc Van Gool. 2016. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
- Wang et al. (2019) Weining Wang, Yan Huang, and Liang Wang. 2019. Language-Driven Temporal Activity Localization: A Semantic Matching Reinforcement Learning Model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 334–343.
- Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. In Reinforcement Learning. Springer, 5–32.
- Wu et al. (2020) Jie Wu, Guanbin Li, Si Liu, and Liang Lin. 2020. Tree-Structured Policy based Progressive Reinforcement Learning for Temporally Language Grounding in Video. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Xu et al. (2019) Huijuan Xu, Kun He, L Sigal, S Sclaroff, and K Saenko. 2019. Multilevel Language and Vision Integration for Text-to-Clip Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 2. 7.
- Yang et al. (2019a) Sibei Yang, Guanbin Li, and Yizhou Yu. 2019a. Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4145–4154.
- Yang et al. (2019b) Sibei Yang, Guanbin Li, and Yizhou Yu. 2019b. Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE International Conference on Computer Vision. 4644–4653.
- Yang et al. (2020) Sibei Yang, Guanbin Li, and Yizhou Yu. 2020. Graph-Structured Referring Expression Reasoning in The Wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9952–9961.
- Yeung et al. (2016) Serena Yeung, Olga Russakovsky, Greg Mori, and Li Fei-Fei. 2016. End-to-end learning of action detection from frame glimpses in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2678–2687.
- Yu et al. (2019) Tong Yu, Yilin Shen, Ruiyi Zhang, Xiangyu Zeng, and Hongxia Jin. 2019. Vision-language recommendation via attribute augmented multimodal reinforcement learning. In Proceedings of the 27th ACM International Conference on Multimedia. 39–47.
- Yuan et al. (2019) Yitian Yuan, Tao Mei, and Wenwu Zhu. 2019. To find where you talk: Temporal sentence localization in video with attention based location regression. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Zhang et al. (2019a) Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. 2019a. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1247–1257.
- Zhang et al. (2019b) Songyang Zhang, Jinsong Su, and Jiebo Luo. 2019b. Exploiting Temporal Relationships in Video Moment Localization with Natural Language. In Proceedings of the 27th ACM International Conference on Multimedia. 1230–1238.