Weak Supervision and Referring Attention for Temporal-Textual Association Learning

Zhiyuan Fang¹¹¹footnotemark: 1 Shu Kong²¹¹footnotemark: 1 Zhe Wang³¹¹footnotemark: 1
Charless Fowlkes⁴ Yezhou Yang¹
¹Arizona State University ²Carnegie Mellon University ³Beihang University ⁴University of California, Irvine

Abstract

A system capturing the association between video frames and textual queries offer great potential for better video analysis. However, training such a system in a fully supervised way inevitably demands a meticulously curated video dataset with temporal-textual annotations. Therefore we provide a Weak-Supervised alternative with our proposed Referring Attention mechanism to learn temporal-textual association (dubbed WSRA). The weak supervision is simply a textual expression (e.g., short phrases or sentences) at video level, indicating this video contains relevant frames. The referring attention is our designed mechanism acting as a scoring function for grounding the given queries over frames temporally. It consists of multiple novel losses and sampling strategies for better training. The principle in our designed mechanism is to fully exploit 1) the weak supervision by considering informative and discriminative cues from intra-video segments anchored with the textual query, 2) multiple queries compared to the single video, and 3) cross-video visual similarities. We validate our WSRA through extensive experiments for temporally grounding by languages, demonstrating that it outperforms the state-of-the-art weakly-supervised methods notably.

Figure 1: One practical application of the weakly supervised temporal textual association learning can be video moment retrieval (e.g., in surveillance video) using natural language query, which requires the video-level caption to align with the video segment temporally without any annotation.

1 Introduction

Videos contain much richer information for humans to interpret the world. Building an intelligent system to automate video analysis could yield a wide range of applications benefiting human society at large, from assistive robots for the elderly to video surveillance for security [19, 63, 34, 7]. A significant component in such a system is to capture the association between video frames and textual reference/queries, a.k.a temporal-textual association [10, 1, 26, 14]. Therefore, learning temporal-textual association over videos becomes a promising direction in the community [1, 51, 28].

One way to obtain such a model for temporal-textual association learning is to fully supervise the training over a video dataset which has meticulous annotations in term of the associated frames with some textual queries [4, 1, 10, 66, 14]. However, we note that such a practice inevitably demands a large-scale dataset, which apparently is not only prohibitively expensive to collect, but also largely limited in terms of diversity of both videos and textual expressions. As an alternative, a few recent methods propose to learn the temporal-textual association only with weak annotations, i.e., video-level expressions in the form of natural language description [30, 23].

Though the weakly-supervised temporal-textual association learning attracted increasing attention until just recently, there exists a great number of works on related topics, such as weakly supervised action localization in videos [50, 36, 29, 32]. However, compared to action localization which only has a limited number of action categories, grounding textual reference is more challenging since the textual expressions could be more free-form with multiple words for the same meaning and flexible sentence structures. In another word, using the natural-language descriptions greatly enlarges the content and diversity of visual-expression searching; the combinatorial nature of open-form languages also makes it infeasible to enumerate all possible expressions towards the same indication. For instance, instead of just localizing the video frames with categorical labels like “kissing” or “person”, a more practical and user-friendly query might be “the moment when the new couple are kissing in the wedding” (for moment retrieval) or “a man in yellow shirt appearing in the hall in last night” (for video surveillance). This necessitates the study of temporal grounding using natural language descriptions. To foster the study, Gao et al. augment the Charades dataset [48] by generating complex language queries with temporal boundary annotations [10]; Anne Hendricks et al. collect the DiDeMo dataset with manual annotations on the associated video frames and natural-language descriptions [1]. Both datasets for the first place enable training for language moment retrieval or temporal grounding.

In this paper, basing on the datasets available in the literature, we study learning temporal-textual association with weak supervisions, e.g., video-level language expressions as shown in Fig. 1. Until very recently, few works propose to weakly supervised train for temporal-textual association with language expressions. In particular, Mithun et al. present the very first attempt for weakly supervised training to localize video segment over the given textual queries [30]. In general, weakly supervised methods of temporal-textual association learning are facing two major challenges: 1) the lack of precise supervision aligning video segments and textual queries, 2) highly undecidable features for the complex and open-form languages.

To overcome these challenges, we propose a weakly-supervised framework with a referring attention mechanism (WSRA) for learning temporal-textual associations on videos. The proposed referring attention mechanism summarizes a series of our novel components. The first one learns through a background modelling method to pool out irrelevant frames specific to the given language query. Building upon it, the second component encourages foreground features to align with the query by discriminating itself from the background features from the first component. Within this component, we present a hard negative mining method integrated to sample irrelevant textual descriptions during learning. The third component exploits inter-video (dis-)similarity based on the multiple textual queries. Specifically, this forces the visual features to be close to each other from different videos, as long as the queries convey similar meanings measured by the similarity of textual features.

To summarize our contributions: 1) We propose a unified framework (WSRA) for weakly supervised learning temporal-textual associations with the referring attention mechanism, directly applicable to moment retrieval and language grounding in videos. 2) We show with rigorous ablation study that the proposed components in the referring attention leverages better informative cues from the limited weak supervision. 3) We justify the WSRA through extensive experiments, notably outperforming other state-of-the-art weakly-supervised methods on these tasks on two public benchmarks, DiDeMo [1] and Charades-STA [10].

2 Related Work

Association Learning across Vision and Language is core and tie of a wide range of tasks across vision and language domains, e.g., textual grounding [38], referring expression comprehension [31] or object retrieval using language [16]. Recent works focus on leveraging the image-level annotations (as weak supervision) [8, 9] or unsupervised method [64] to learn the association across language descriptions and objects. Proceeding from this, there arise works on using uncurated captions to learn temporal associations across video segments and texts [27, 51]. Notably, these works all highlight on the importance of constructing contrastive pairs in exploiting the weak annotations and inspire our work.

Weakly-Supervised Action Localization can be thought of a specific example of weakly supervised video learning with the video-level labels [52, 43, 32, 36]. This problem derives from the fully-supervised counterpart methods which exploits fine annotations at frame level for localizing the actions [2, 18, 58, 42, 53, 44]. Recent weakly-supervised methods extensively adopt either the video-level classification framework [50, 43, 49, 6] or with the attentional mechanism that generates bottom-up sparse weights used for localizing action categories temporally [32]. Beyond that, few works [55, 36] propose to utilize the multi-instance learning loss to address this challenge, [36] also suggests exploiting the co-activity across videos in the metric learning, which largely improves the action localization task even when temporal annotations are not available. Most recent work [33] improves over these methods with background modelling module that explicitly extract the foreground and background appearances.

Temporal Grounding and Moments Retrieval are two instantiations of learning temporal-textual association. In these tasks, most recent methods adopt fully-supervised training over fine annotations on the frame-textual associations. For example, Gao et al. augment the Charades dataset [48] by generating complex language queries with temporal boundary annotations for language moment retrieval [10]; Anne Hendricks et al. also collect a new dataset for training to localize video moments over a given descriptive sentence [1]. Other follow-up methods [15, 24, 4, 54, 62, 13] also fully-supervised train for temporal grounding using these datasets, suffering from the limitation on their generalizability due to the combinatorial nature of complex natural language sentences (e.g., synonymous words, grammatical tense and sentence structures) [15, 10].

Weakly-Supervised Video Grounding by Language further advances the aforementioned to an even challenging task, as now the only available supervision is video-level natural language descriptions in open format, which come with huge amount of unnecessary noises. In particular, Mithun et al. [30] for the first time attempt to solve the temporal grounding problem by weakly supervised learning with only the video-level textual queries [30]. In [5], the author proposed to utilize the temporal proposal and textual description alignment learning using the sliding window fashion, and tackle the grounding problem in a coarse-to-fine manner. Similarly, a very recent work by [23] also focuses on the design of a better proposal generation module, that aggregates the contextual visual cues to generate and score the proposal candidates for grounding. We summarize that current efforts in weakly supervised language grounding are either centered on better proposal generation [5, 23], or a better cross-modal association model by constructing contrastive samples across video and languages [30, 11]. Our WSRA places emphasis on the latter, but nevertheless further distinguishes the above with a more comprehensive cross-modal association learning objectiveness, and a novel sampling and weighting strategy in our metric learning step.

3 Temporal-Textual Association Learning

The well learned temporal-textual associations on videos can allow for practical tasks like moment retrieval and temporal grounding of natural-language descriptions, where the core to it is to learn a joint embedding space for both the frames and the textual queries. In this embedding space, the associated frames and the textual queries should be close to each other represented by the visual feature and language feature, respectively. Formally, given a long and untrimmed video ${\cal V}=[\boldsymbol{v}_{1},...,\boldsymbol{v}_{T}]$ consisting of $T$ snippets of frames, and an open-form textual sentence $\boldsymbol{t}$ , we would like to train a model to localize a video segment $\boldsymbol{v}_{t}$ from ${\cal V}$ that best corresponds to the description, which is ideally the true (yet unknown) video segment for the textual query. We denote by $\boldsymbol{v}_{t}\in\mathbb{R}^{d}$ the feature vector extracted from a video model (e.g., a pretrained classification network), and by $\boldsymbol{t}\in\mathbb{R}^{d}$ the textual feature representation of the query (short phrase or a sentence) from a language model. In practice, segment features can also be the averagely pooled proposal visual features, where each proposal contains several continuous segments with various lengths. As weak supervision causes ambiguities in predicting the association between frames and the textual query, we present our weakly-supervised approach with the proposed referring attention mechanism (WSRA) in this section. Fig. 2 shows the overall architecture of our model.

Refer to caption — Figure 2: Flowchart of the proposed WSRA. The losses are only used during training for temporal-textual association learning, while inference shares the same computation flow for grounding textual queries over the given video. “CNN” and “LM” denote video model and language model to extract features for frames and the textual query. respectively. $\Phi$ denotes the cross-modal scoring function, and the above shown two functions are with independent learning parameters.

3.1 Video-level Attention Modeling

Prior weakly supervised methods for action localization propose to generate a weight vector to localize action labels among the video snippets in a bottom-up manner [36, 33]. While being successful in action localization from predefined categorical labels, these bottom-up methods cannot be directly tailored to the more natural language localization problem in question, as a video can correspond to multiple textual queries which appear in a very open-form language structure. Based on the fact that, the description of textual query usually covers multiple snippets in a video, we propose to model the video-level back/fore-ground features that are (ir)related to the textual query with the designed referring weights:

\begin{split}\boldsymbol{v}_{f}=&\sum^{T}_{t=1}\alpha_{t}\boldsymbol{v}_{t}\\ \boldsymbol{v}_{b}=\frac{1}{T-1}&\sum^{T}_{t=1}(1-\alpha_{t})\boldsymbol{v}_{t},\end{split}

(1)

where $v_{f}$ and $v_{b}$ are the synthesized fore/back-ground features, and weight $\alpha_{t}$ is calculated as:

\alpha_{t}=\frac{\text{exp}(\phi(\boldsymbol{v}_{t},\boldsymbol{t}))}{{\sum_{i=1}^{T}\text{exp}(\phi(\boldsymbol{v}_{i},\boldsymbol{t}))}}.

(2)

In Eq. 2, $\phi(\cdot,\cdot)$ is the cross-modal scoring function that measuring the distance between a frame/snippet and the textual query in the embedding space. In practice, we define $\phi(\boldsymbol{v},\boldsymbol{t})=\text{Sigmoid}(\text{FC}(\text{Cat}(\boldsymbol{v},\boldsymbol{t}))$ with learnable parameters, which has a stronger association ability than cosine similarity [65], bilinear pooling [12, 8, 21], and second-order polynomial feature fusion [37, 10]. It is worth noting that the scoring function not only conveys the idea of metric learning, but also lands for embedding learning along with further loss terms as presented in next subsections.

As we would like to learn an embedding space in which, 1) foreground visual features tightly correspond to the given textual query measured by the scoring function performing in the same space; and 2) background features are clearly far away from both the foreground and textual query, as illustrated in Fig. 3 (a). To this end, refer to the Triplet Loss [40], we set $(\phi(\boldsymbol{v}_{b},\boldsymbol{t})-\phi(\boldsymbol{v}_{f},\boldsymbol{t}))>m$ as our optimization target for video-level metric learning. Rather than simply leveraging the margin or triplet-loss based contrastive learning objectiveness, we refer to the recently proposed general pair weighting framework that origins from deep metric learning [57, 56], which endows our learning objectiveness with the ability of gradient weighting using the logistic-loss as the basic form function. Comparing to the previous works as in [30, 5], WSRA adaptively assigns proper weights to valuable learning pairs thus benefiting the training. We denote fore/back-ground similarity scores as $s_{p}^{i}=\phi(\boldsymbol{v}_{f}^{i},\boldsymbol{t}^{i})$ and $s_{n}^{i}=\phi(\boldsymbol{v}_{b}^{i},\boldsymbol{t}^{i})$ , respectively. Our video-level loss function is calculated by:

\mathcal{L}_{video}=\log\Big{[}1+\sum_{i=1}^{N}\exp(\tau(s_{n}^{i}-s_{p}^{i}+m))\Big{]},

(3)

in which $i$ indices the sample in a random mini-batch, $m$ is a predefined margin and $\tau$ is a temperature factor.

3.2 Snippet-level Attention Modeling

While the above video-level loss imposed on a whole video encourages learning a discriminative embedding space and the metric functions, we introduce snippet-level modeling to enhance the contrastive study of individual video snippet and multiple textual queries, as illustrated in Fig. 3 (b). Under this case, for the $t$ -th snippet in the $j$ -th video $\boldsymbol{v}_{t}^{j}$ , we calculate the referring weight $\beta_{t}^{i}$ as:

\beta_{t}^{j}=\frac{\text{exp}(\phi(\boldsymbol{v}_{t}^{j},\boldsymbol{t}^{j}))}{{\sum_{i=1}^{N}\text{exp}(\phi(\boldsymbol{v}_{t}^{j},\boldsymbol{t}^{i}))}}.

(4)

Since different textual queries don’t have such continuity among different snippets, instead of synthesizing textual features, we use this referring weight as penalty on the optimization target. Denoting $\boldsymbol{t}^{i}$ as the textual query for the $i$ -th video and $\boldsymbol{t}^{j}$ for the $j$ -th video, we define out penalized optimization target as $(\beta_{t}^{j}\phi(\boldsymbol{v}_{t}^{j},\boldsymbol{t}^{i})-\beta_{t}^{i}\phi(\boldsymbol{v}_{t}^{j},\boldsymbol{t}^{j}))>m$ . This target is more flexible during training because those less-optimized scores will have larger weighting factors and consequentially get larger gradient. Our snippet-level loss function for each video is derived as follows:

\mathcal{L}_{snippet}=\log\Big{[}1+\sum^{T}_{t=1}\exp(\tau(\beta_{t}^{+}s_{n}^{t}-\beta_{t}^{-}s_{p}^{t}+m))\Big{]},

(5)

where $s_{p}^{t}=\phi(\boldsymbol{v}^{t},\boldsymbol{t}^{+})$ and $s_{n}^{i}=\phi(\boldsymbol{v}^{t},\boldsymbol{t}^{-})$ are the similarity scores of snippet feature $\boldsymbol{v}^{t}$ with $\boldsymbol{t}^{+}$ (the ground-truth referring text) and $\boldsymbol{t}^{-}$ (the sampled negative text), respectively.

As noted, sampling semantically similar queries affect learning [59], but when the batch size grows with limited number of textual expressions, a single training batch may contain semantically similar queries more easily, e.g., “a man having food” and “the man eating dinner”. As a result, simply treating all other textual queries within the mini-batch as the negative samples hurt discriminative learning. We provide our solution as below on using the similarity $(\boldsymbol{t}^{i},\boldsymbol{t}^{j})$ to differentiate individual instances (queries within a single training batch), telling whether a pair of them are semantically dissimilar enough to support a sampled negative query.

3.2.1 Top/last- $K$ Sampling.

Inspired by the hard negative sampling techniques in metric learning [45, 22, 25, 67], we propose a top/last- $K$ sampling strategy akin to semi-hard negative sampling [41]. We identify queries as “pseudo-positive samples” and “easy-negative samples”, as demonstrated by Fig. 4, then we select the “hard-negative samples”, which provide more informative gradients that help learning. We note “easy-negative samples” do not contribute to better training while being included do not hurt either yet waste wall-clock time in computation. To identify the “hard-negative samples”, we first sort all the queries in the mini-batch based on the similarity scores compared to the one given for the video of interest. Then we simply remove the top and last $K$ samples as easy positive and negative ones. While the hyper-parameter $K$ depends on the batch size, we set $K=3$ in our experiments and find it quite stable and beneficial to the training.

3.3 Batch-level Attention Modeling

Proceeding from previous efforts in un/weak-supervised video representation learning, we are inspired with the necessity to construct contrastive learning pairs across video samples. [36] proposed to encourage the videos containing identical actions to be encoded as similar features in their corresponded temporal regions. [6] learns the frame-wise correspondence across videos and yields great representations without any strong supervision. These works all unanimously highlight that mining the common-information among samples could enormously benefit the discriminative learning. This practice also applies for our video and language tasks, where the similar visual content can be observed in different videos. For instance, the excluded pseudo-positive samples from our snippet-level learning might be “a gentleman is having dinner”, whose visual foreground should be similar with the scenario of “man eating food in a restaurant”.

Concretely, we assume that for each target video $\boldsymbol{v}^{i}$ , it contains implicit common activities with the video $\boldsymbol{v}^{+}$ that related to its textual pseudo-positive sample. Therefore, we exploit a batch-level attention mechanism as well as an inter-video loss to further improve the discriminative learning by utilizing the mined pseudo-positive samples as stated previously. Following the setting in Sec. 3.2.1, we define the attention weight as:

\gamma^{i}=\frac{\text{exp}(\cos(\boldsymbol{t}^{i},\boldsymbol{t}^{+}))}{{\sum_{j=1}^{N}\text{exp}(\cos(\boldsymbol{t}^{i},\boldsymbol{t}^{j}))}}.

(6)

where $\boldsymbol{t}^{+}$ is the pseudo-positive sample for the textual query $\boldsymbol{t}^{i}$ that corresponding to the $i$ -th video. Here we employ the synthesized fore/back-ground features in Eq. 1. We use the form of penalized optimization target as $\gamma^{i}(\cos(\boldsymbol{v}_{b}^{i},\boldsymbol{v}_{f}^{+})-\cos(\boldsymbol{v}_{f}^{i},\boldsymbol{v}_{f}^{+}))>m$ . Our learning objective to be making their foreground representations closer and enlarging the divergence of foreground with background as is shown in Fig. 4. The overall batch-level loss is expressed as:

\mathcal{L}_{batch}=\log\Big{[}1+\sum_{i=1}^{N}\exp(\tau\gamma^{i}(s_{n}^{i}-s_{p}^{i}+m))\Big{]}.

(7)

Since the comparing targets are from the same modality, we simply use cosine similarity in the calculation, that $s_{p}^{i}=\cos(\boldsymbol{v}_{f}^{i},\boldsymbol{v}_{f}^{+})$ , and $s_{n}^{i}=\cos(\boldsymbol{v}_{b}^{i},\boldsymbol{v}_{f}^{+})$ .

3.4 Overall Training Loss

As for the overall objective loss, we combine all the above loss terms for end-to-end training our model:

\mathcal{L}=\alpha\mathcal{L}_{video}+\beta\sum^{T}\mathcal{L}_{snippet}+\delta\mathcal{L}_{batch}\\

(8)

where $\alpha,\beta$ , and $\delta$ are hyper-parameters weighting the loss terms, which are set to 0.1, 1, 0.1, respectively. We list details on the effect of these terms in the ablation study and further study the effect of various loss weights in the supplementary materials. Due to the limited computational resources, we empirically set temperature weight as constant 1 and did not conduct exhaustive searching for the optimal temperature weight per loss. However, various values slightly affect the performances. During training, we use the Adam optimizer with constant learning rate 1 $\times 10^{-4}$ and coefficients 0.9 and 0.999 for computing running averages of gradient and its square. We implement our algorithm using PyTorch toolbox [35] on single GTX1080 Ti GPU.

4 Experiment

The goal of our experiments is to validate the effectiveness of the proposed weakly-supervised framework with the top-down referring attention (WSRA) for temporal-textual association learning, through two tasks of moment localization and language grounding on the DiDeMo [15] and Charades-STA [10], respectively. We use the mean Average Precision (mAP) under various Intersection over Union (IoU) thresholds to measure the performance. First, we elaborate on some important details about the models and language features. We then compare our WSRA with other state-of-the-art methods with systematical ablation studies on the two tasks over the two benchmarks. Finally, we visualize the qualitative results produced by our WSRA.

Language Processing and Feature Extraction. The proposed WSRA framework is agnostic to the choice of language models. Although one is free to use any language models providing features to represent the textual queries (e.g., natural-language sentence or short phrase), we turn to the Openai-GPT2 [39], which is a released language model¹¹1https://github.com/openai/gpt-2 trained over large-scale, diverse corpus (Wikipedia, news and books). GPT2 is composed of a stacking of repetitive transformer modules, and we retrieve our textual features by averaging all outputs from modules as done in literature [60].

4.1 Moment Localization on DiDeMo

For localizing moments over a given natural-language description, we use the Distinct Describable Moments (DiDeMo) dataset [15], which includes $>$ 10k 25-30 second long Flickr videos. Manual annotations contain $>$ 40k sentences with temporal boundaries for fine localization. In total, the DiDeMo dataset contains 8,395, 1,065 and 1,004 videos for training, validation and testing respectively. We report the performances of our model on the testing split and conduct ablation studies on the validation split. As suggested by the dataset, to simplify association between the sentence and a video, each long video is divided into 6 segments, which are annotated as whether corresponding to the sentence for localization. In DeDeMo, each textual description is associated with only one continuous video moment, thus yielding 21 possible candidates ( $\sum_{i=1}^{6}i$ ). We measure the performance using the Rank@1 (R@1) and Rank@5 (R@5) (accuracy of the top-1/5 retrieved candidates) and their mean Intersection of Unions (mIoU) when IoU=1. To extract visual features of the videos, we use the official provided features over both RGB and optical flow [1]. For fair comparisons with other methods, as done in [15, 1], we compute the visual features as the concatenation of the global averagely pooled features (all 21 proposal features) and the local proposal features, which are produced by after average pooling the features from each segment. We set hyper-parameters $\beta=1$ and $\alpha=\delta=0.1$ in this experiment and report results under different combinations in our supplementary materials. During inference, we select the top-5 proposals with highest attention weights as the final prediction.

$\mathcal{L}_{video}$	$\mathcal{L}_{snippet}$	$\mathcal{L}_{batch}$	DiDeMo (Val)			Charades-STA
$\mathcal{L}_{video}$	$\mathcal{L}_{snippet}$	$\mathcal{L}_{batch}$	R@1	R@5	mIoU	R@1	R@5	mIoU
✓	–	–	10.10	36.05	18.78	3.42	18.56	16.85
–	✓	–	14.68	45.72	26.04	4.84	23.65	22.16
✓	✓	–	16.20	49.56	27.50	8.82	26.65	26.42
✓	✓	✓	16.92	50.12	28.32	11.01	39.02	31.00

Table 1: Ablation study of loss terms on the validation split of DiDeMo and Charades-STA when IoU is 1 and 0.7 respectively. As the

\mathcal{L}_{batch}

does not explicitly incorporate the alignment learning, we study it as the complementing term w.r.t. the first two terms.

Supervision	Method	Feature	R@1	R@5	mIoU
	Upper Bound	–	74.75	100	69.05
	Chance	–	3.75	22.5	22.64
Fully Supervised	CCA [1]	Flow&RGB	18.11	52.11	37.82
	Lang. Obj. Retr. [17]	Flow	16.20	43.94	27.18
	LSTM-RGB-local [1]	RGB	13.10	44.82	25.13
	LSTM-Flow-local [1]	Flow	18.35	56.25	31.46
	MCN [1]	Flow&RGB	28.10	78.21	41.08
	TGN [4]	Flow&RGB	28.23	79.26	42.97
Weakly Supervised	TGA [30]	Flow&RGB	12.19	39.74	24.92
	WSRA	RGB	14.20	43.67	25.22
	WSRA^∗	Flow	17.23	48.84	27.42
	WSRA	Flow	17.88	50.04	29.90
	WSRA	Flow&RGB	17.52	52.11	28.87

Table 2: Comparison of performances with fully/weakly-supervised methods on DiDeMo test split. Our WSRA model outperforms the state-of-the-art weakly-supervised method (TGA) [30]. We report performances of models using regular word vectors (marked as WSRA^∗). “Chance” denotes the results of random guess.

	Approach	IoU=0.3		IoU=0.5		IoU=0.7
	Approach	R@1	R@5	R@1	R@5	R@1	R@5	mIoU
Fully Supervised	Random [10]	–	–	08.51	37.12	03.03	14.06	–
	VSA-STV [20]	–	–	10.50	48.43	04.32	20.21	–
	CTRL [10]	–	–	21.42	59.11	07.15	26.91	–
	Xu et al. [61]	–	–	35.60	79.40	15.80	45.40	–
	MAN [66]	–	–	46.53	86.23	22.72	53.72	–
Weakly Supervised	TGA [30]	32.14	89.56	19.94	65.52	08.84	33.51	–
	SCN [23]	42.96	95.56	23.58	71.80	09.97	38.87	–
	CTF^∗ [5]	39.80	-	27.30	-	12.90	-	–
	WSRA (Ours)	50.13	86.75	31.20	70.50	11.01	39.02	31.00

Table 3: Language grounding results on the test set of Charades-STA under different intersection over unions. ^∗ denotes the work under peer reviewing.

To understand how each loss term contributes to the performance, we conduct a systematical ablation study in Table 1. The results clearly demonstrate that all the loss terms improve the performance individually and are complementary with each other, and by combining them all achieves the best performance. Since $L_{batch}$ supervises only the visual features without any alignment, thus can not be compared independently. From this table, it is worth noting that using the fore/background loss only performs on par with TGA [30] (see Table 2), which is the weakly-supervised method that adopts similar design of fore/back-ground modelling. Our final WSRA outperforms TGA [30] by a clear gain, demonstrating that the loss terms exploits the weak supervision more effectively in learning a better discriminative model. Fig. 6 studies the effect of our top/last- $K$ sampling strategy with various $K$ . We clearly see that, when excluding top $K=3$ and last $K=2$ samples in the mini-batch, it yields the best performance, with the batch size set to 42 in this experiment.

We compare our WSRA with other advanced weakly/fully-supervised methods in Table. 2. As the previous methods, we experiment with visual features from modals of RGB and optical flow. Upper-bound of DiDeMo is brought since the human annotators cannot achieve 100% agreement in annotating the segment boundaries w.r.t the given video-level sentence [1]. Comparing with the baseline models, our WSRA model significantly outperforms weakly supervised TGA [30] by 5.5%, 11% at R@1 and R@5 respectively, even achieving comparable performance with several fully supervised methods e.g., CCA [1] and LSTM based model [1]. It is also worth noting that, our WSRA model has similar number of parameters with all above mentioned baseline models.

4.2 Language Grounding on Charades-STA

We further validate our WSRA for the language grounding task on another video dataset, Charades-STA [10], which augments the Charades dataset [47, 46] with manual annotations in the form of natural language descriptions at precise start-end timestamp of each video. Charades-STA contains 12,408/3,720 video-query pairs for training/testing. For annotation, Charades-STA expands each verb to generate a textual sentence from single caption in Charades using language templates, and associate the sentence with frames corresponding to this verb. To encode the video segment, as done by other weakly-supervised methods for action localization [32, 36], we use the I3D feature from a pre-trained model trained over the Kinetics dataset [3]. For inference, we turn to the moment selection methods [10], which generate multi-scale sampled moment candidates using sliding window method for retrieval with fixed length of frames. Moreover, since the video duration varies significantly, we sampling the moment candidates with lengths proportioned to the video duration. Specifically, sampled candidate clips are with $\small\{$ 20%, 30%, 40%, 50% $\small\}$ of the whole videos and in 80% overlap using the sliding window manner, then the moment with highest attention weight score is selected as the final prediction. More details can be found in the supplementary.

We report the performance comparison with different mean Intersection-over Union (mIoU = {0.3, 0.5, 0.7}) and Recall@{1, 5}, and the mIoU as a summary metric. We list detailed comparison in Table 3. As TGA [30] is also constructed among the contrastive learning, we study their performances gap and can see the clear advantage of our referring attention: WSRA significantly outperforms TGA, for example by 18% at [R@1, IoU=0.3], and 12% at [R@1, IoU=0.5], respectively. We can also see clearly from the above table that, comparing with other state-of-the-art weak-supervised methods, WSRA demonstrates clear leading at R@1 when IoU=0.3 and 0.5. Also, WSRA demonstrates either comparable or even better performance than most fully-supervised methods. In the meanwhile, our top/last- $K$ sampling also rejects theses pseudo-positive and less informative samples in the contrastive learning. Fig. 7 shows the qualitative examples of grounding in Charades-STA dataset where the moment is aligned with language query even facing much longer videos. We further study the effect of cross-video loss in the supplementary.

5 Conclusion and Broader Impact

In this paper, we propose the weakly supervised model with the referring attention mechanism (WSRA) for learning temporal-textual association on videos. We introduce several novel loss terms and sampling strategies, all of which help better learning by fully exploiting the cues from the weak supervision at video level. Through extensive experiments on two benchmarks, we show the WSRA model outperforms the state-of-the-art weakly-supervised methods by a notable gain, achieving on par or even better performance than some fully-supervised methods. As an outlook for the future study, we consider that the most potential aspect our model would benefit comes from the video-language representation learning at scale [28], whereas the training is often severely accompanied by uncurated annotations: e.g., temporally misaligned descriptions [27]. To construct a soundly and largely pre-trained model, it is requisite to properly leverage the weak or biased annotation at comprehensive views. Our WSRA provides us with such a perspective as the trailblazer, that investigates thoroughly how language can be fully exploited as valid supervisions even without temporal annotations.

References

[1] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE International Conference on Computer Vision, pages 5803–5812, 2017.
[2] Shyamal Buch, Victor Escorcia, Chuanqi Shen, Bernard Ghanem, and Juan Carlos Niebles. Sst: Single-stream temporal action proposals. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2911–2920, 2017.
[3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[4] Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, and Tat-Seng Chua. Temporally grounding natural sentence in video. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 162–171, 2018.
[5] Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, and Kwan-Yee K Wong. Look closer to ground better: Weakly-supervised temporal grounding of sentence in video. arXiv preprint arXiv:2001.09308, 2020.
[6] Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Pierre Sermanet, and Andrew Zisserman. Temporal cycle-consistency learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1801–1810, 2019.
[7] Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. Video2commonsense: Generating commonsense descriptions to enrich video captioning. arXiv preprint arXiv:2003.05162, 2020.
[8] Zhiyuan Fang, Shu Kong, Charless Fowlkes, and Yezhou Yang. Modularized textual grounding for counterfactual resilience. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2019.
[9] Zhiyuan Fang, Shu Kong, Tianshu Yu, and Yezhou Yang. Weakly supervised attention learning for textual phrases grounding. arXiv preprint arXiv:1805.00545, 2018.
[10] Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision, pages 5267–5275, 2017.
[11] Mingfei Gao, Larry S Davis, Richard Socher, and Caiming Xiong. Wslln: Weakly supervised natural language localization networks. arXiv preprint arXiv:1909.00239, 2019.
[12] Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. Compact bilinear pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 317–326, 2016.
[13] Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. Actor and action video segmentation from a sentence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5958–5966, 2018.
[14] et al Ge, Runzhou. Mac: Mining activity concepts for language-based temporal localization. In IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
[15] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with temporal language. In Proceedings of Conference on Empirical Methods in Natural Language Processing, 2018.
[16] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Natural language object retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4555–4564, 2016.
[17] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, and Trevor Darrell. Natural language object retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[18] Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 4405–4413, 2017.
[19] Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, and Zeynep Akata. Textual explanations for self-driving vehicles. In Proceedings of the European Conference on Computer Vision (ECCV), pages 563–578, 2018.
[20] Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip-thought vectors. In Advances in neural information processing systems, pages 3294–3302, 2015.
[21] Shu Kong and Charless Fowlkes. Low-rank bilinear pooling for fine-grained classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 365–374, 2017.
[22] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[23] Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, and Huasheng Liu. Weakly-supervised video moment retrieval via semantic completion network. arXiv preprint arXiv:1911.08199, 2019.
[24] Meng Liu, Xiang Wang, Liqiang Nie, Xiangnan He, Baoquan Chen, and Tat-Seng Chua. Attentive moment retrieval in videos. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 15–24. ACM, 2018.
[25] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
[26] Wang X. Nie L. Tian Q. Chen B. & Chua T. S. Liu, M. Cross-modal moment localization in videos. In In 2018 ACM Multimedia Conference on Multimedia Conference, pages 843–851, 2018.
[27] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instructional videos. arXiv preprint arXiv:1912.06430, 2019.
[28] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE International Conference on Computer Vision, pages 2630–2640, 2019.
[29] Ashish Mishra, Vinay Kumar Verma, M Shiva Krishna Reddy, S Arulkumar, Piyush Rai, and Anurag Mittal. A generative approach to zero-shot and few-shot action recognition. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 372–380. IEEE, 2018.
[30] Niluthpol Chowdhury Mithun, Sujoy Paul, and Amit K. Roy-Chowdhury. Weakly supervised video moment retrieval from text queries. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[31] Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. In European Conference on Computer Vision, pages 792–807. Springer, 2016.
[32] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6752–6761, 2018.
[33] Phuc Xuan Nguyen, Deva Ramanan, and Charless C Fowlkes. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE International Conference on Computer Vision, pages 5502–5511, 2019.
[34] Sangmin Oh, Anthony Hoogs, Amitha Perera, Naresh Cuntoor, Chia-Chih Chen, Jong Taek Lee, Saurajit Mukherjee, JK Aggarwal, Hyungtae Lee, Larry Davis, et al. A large-scale benchmark dataset for event recognition in surveillance video. In CVPR 2011, pages 3153–3160. IEEE, 2011.
[35] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[36] Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 563–579, 2018.
[37] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
[38] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
[39] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
[40] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815–823, 2015.
[41] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
[42] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5734–5743, 2017.
[43] Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 154–171, 2018.
[44] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1049–1058, 2016.
[45] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 761–769, 2016.
[46] Gunnar A Sigurdsson, Santosh Divvala, Ali Farhadi, and Abhinav Gupta. Asynchronous temporal fields for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 585–594, 2017.
[47] Gunnar A. Sigurdsson, Abhinav Gupta, Cordelia Schmid, Ali Farhadi, and Karteek Alahari. Actor and observer: Joint modeling of first and third-person videos. In CVPR, 2018.
[48] Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pages 510–526. Springer, 2016.
[49] Karan Sikka, Abhinav Dhall, and Marian Stewart Bartlett. Classification and weakly supervised pain localization using multiple segment representation. Image and vision computing, 32(10):659–670, 2014.
[50] Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In 2017 IEEE international conference on computer vision (ICCV), pages 3544–3553. IEEE, 2017.
[51] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 7464–7473, 2019.
[52] Chen Sun, Sanketh Shetty, Rahul Sukthankar, and Ram Nevatia. Temporal localization of fine-grained actions in videos by domain transfer from web images. In Proceedings of the 23rd ACM international conference on Multimedia, pages 371–380. ACM, 2015.
[53] Du Tran and Junsong Yuan. Max-margin structured output regression for spatio-temporal action localization. In Advances in neural information processing systems, pages 350–358, 2012.
[54] Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, and Yong Xu. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7190–7198, 2018.
[55] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4325–4334, 2017.
[56] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5022–5030, 2019.
[57] Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. Vitaa: Visual-textual attributes alignment in person search by natural language. arXiv preprint arXiv:2005.07327, 2020.
[58] Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Learning to track for spatio-temporal action localization. In Proceedings of the IEEE international conference on computer vision, pages 3164–3172, 2015.
[59] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
[60] Han Xiao. bert-as-service. https://github.com/hanxiao/bert-as-service, 2018.
[61] Huijuan Xu, Kun He, L Sigal, S Sclaroff, and K Saenko. Multilevel language and vision integration for text-to-clip retrieval. In AAAI, volume 2, 2019.
[62] Huijuan Xu, Boyang Li, Vasili Ramanishka, Leonid Sigal, and Kate Saenko. Joint event detection and description in continuous video streams. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 396–405. IEEE, 2019.
[63] Yezhou Yang, Yi Li, Cornelia Fermuller, and Yiannis Aloimonos. Robot learning manipulation action plans by” watching” unconstrained videos from the world wide web. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[64] Raymond A Yeh, Minh N Do, and Alexander G Schwing. Unsupervised textual grounding: Linking words to image concepts. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6125–6134, 2018.
[65] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1307–1315, 2018.
[66] Da Zhang, Xiyang Dai, Xin Wang, Yuan-Fang Wang, and Larry S Davis. Man: Moment alignment network for natural language moment retrieval via iterative graph adjustment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1247–1257, 2019.
[67] Xiao Zhang, Zhiyuan Fang, Yandong Wen, Zhifeng Li, and Yu Qiao. Range loss for deep face recognition with long-tailed training data. In Proceedings of the IEEE International Conference on Computer Vision, pages 5409–5418, 2017.