ASM-Loc: Action-aware Segment Modeling for
Weakly-Supervised Temporal Action Localization

Bo He¹, Xitong Yang¹, Le Kang², Zhiyu Cheng², Xin Zhou², Abhinav Shrivastava¹
¹University of Maryland, College Park ²Baidu Research, USA
{bohe,xyang35,abhinav}@cs.umd.edu, {kangle01,zhiyucheng,zhouxin16}@baidu.com

Abstract

Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. Without the boundary information of action segments, existing methods mostly rely on multiple instance learning (MIL), where the predictions of unlabeled instances (i.e., video snippets) are supervised by classifying labeled bags (i.e., untrimmed videos). However, this formulation typically treats snippets in a video as independent instances, ignoring the underlying temporal structures within and across action segments. To address this problem, we propose ASM-Loc, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction. Furthermore, a multi-step refinement strategy is proposed to progressively improve action proposals along the model training process. Extensive experiments on THUMOS-14 and ActivityNet-v1.3 demonstrate the effectiveness of our approach, establishing new state of the art on both datasets. The code and models are publicly available at https://github.com/boheumd/ASM-Loc.

1 Introduction

Figure 1: Action-aware segment modeling for WTAL. Our ASM-Loc leverages the action proposals as well as the proposed segment-centric modules to address the common failures in existing MIL-based methods.

Weakly-supervised temporal action localization (WTAL) has attracted increasing attention in recent years. Unlike its fully-supervised counterpart, WTAL only requires action category annotation at the video level, which is much easier to collect and more scalable for building large-scale datasets. To tackle this problem, recent works [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] mostly rely on the multiple instance learning (MIL) framework [13], where the entire untrimmed video is treated as a labeled bag containing multiple unlabeled instances (i.e., video frames or snippets). The action classification scores of individual snippets are first generated to form the temporal class activation sequences (CAS) and then aggregated by a top- $k$ mean mechanism to obtain the final video-level prediction [3, 14, 6, 8].

While significant improvement has been made in prior work, there is still a huge performance gap between the weakly-supervised and fully-supervised settings. One major challenge is localization completeness, where the models tend to generate incomplete or over-complete action segments due to the inaccurate predictions of action boundaries. Another challenge is the missed detection of short action segments, where the models are biased towards segments with longer duration and produce low-confidence predictions on short actions. Figure 1 demonstrates an example of these two common errors. Although these challenges are inherently difficult due to the lack of segment-level annotation, we argue that the absence of segment-based modeling in existing MIL-based methods is a key reason for the inferior results. In particular, these MIL-based methods treat snippets in a video as independent instances, where their underlying temporal structures are neglected in either the feature modeling or prediction stage.

In this paper, we propose a novel framework that enables explicit, action-aware segment modeling for weakly-supervised temporal action localization, which we term ASM-Loc. To bootstrap segment modeling, we first generate action proposals using the standard MIL-based methods. These proposals provide an initial estimation of the action locations in the untrimmed video as well as their duration. Based on the action proposals, we introduce three segment-centric modules that correspond to the three stages of a WTAL pipeline, i.e., the feature extraction stage, the feature modeling stage and the prediction stage.

First, a dynamic segment sampling module is proposed to balance the contribution of short-range and long-range action segments. As shown in Figure 1, action proposals with short duration are up-sampled along the temporal dimension, with the scale-up ratios dynamically computed according to the length of the proposals. Second, intra- and inter-segment attention modules are presented to capture the temporal structures within and across action segments at the feature modeling stage. Specifically, the intra-segment attention module utilizes self-attention within action proposals to model action dynamics and better discriminate foreground and background snippets. On the other hand, the inter-segment attention module utilizes self-attention across different actions proposals to capture the relationships, facilitating the localization of action segments that involve temporal dependencies (e.g., “CricketBowling” is followed by “CricketShotting” in Figure 1). Note that both attention modules are segment-centric, which is critical to suppress the negative impact of noisy background snippets in untrimmed videos. Third, a pseudo instance-level loss is introduced to refine the localization result by providing fine-grained supervision. The pseudo instance-level labels are derived from the action proposals, coupled with uncertainty estimation scores that mitigate the label noise effects. Finally, a multi-step proposal refinement is adopted to progressively improve the quality of action proposals, which in turn boosts the localization performance of our final model.

We summarize our main contributions as follows:

•

We show that segment-based modeling can be utilized to narrow the performance gap between the weakly-supervised and supervised settings, which has been neglected in prior MIL-based WTAL methods.
•

We introduce three novel segment-centric modules that enable action-aware segment modeling in different stages of a WTAL pipeline.
•

We provide extensive experiments to demonstrate the effectiveness of each component of our design. Our ASM-Loc establishes new state of the art on both THUMOS-14 and ActivityNet-v1.3 datasets.

2 Related works

Temporal Action Localization (TAL)

Compared with action recognition [15, 16, 17, 18, 19, 20, 21], TAL is an more challenging task for video understanding. Current fully-supervised TAL methods can be categorized into two groups: the anchor-based methods [22, 23, 24, 25] perform boundary regression based on pre-defined action proposals, while the anchor-free methods [26, 27, 28] directly predict boundary probability or actionness scores for each snippet in the video, and then employ a bottom-up grouping strategy to match pairs of start and end for each action segment. All these methods require precise temporal annotation of each action instance, which is labor-intensive and time-consuming.

Weakly-supervised Temporal Action Localization

Recently, the weakly supervised setting, where only video-level category labels are required during training, has drawn increasing attention from the community [1, 29, 2, 30, 3, 4, 31, 32, 5, 6, 7, 33, 34, 35, 12, 8, 9, 10, 11]. Specifically, UntrimmedNet [1] is the first to introduce the multiple instance learning (MIL) framework to tackle this problem, which selects foreground snippets and groups them as action segments. STPN [2] improves UntrimmedNet by adding a sparsity loss to enforce the sparsity of selected snippets. CoLA [9] utilizes contrastive learning to distinguish the foreground and background snippets. UGCT [10] proposes an online pseudo label generation with uncertainty-aware learning mechanism to impose the pseudo label supervision on the attention weight. All these MIL-based methods treat each snippet in the video individually, neglecting the rich temporal information at the segment-level. In contrast, our ASM-Loc focuses on modeling segment-level temporal structures for WTAL, which is rarely explored in prior work.

Pseudo Label Guided Training

Using pseudo labels to guide model training has been widely adopted in vision tasks with weak or limited supervision. In weakly supervised object detection, one of the seminal directions is self-training [36, 37, 38, 39], which first trains a teacher model and then the predictions with high confidence are used as instance-level pseudo labels to train a final detector. Similarly, in semi-supervised learning [40, 41, 42, 43, 44] and domain adaptation [45, 46, 47], models are first trained on the labeled / source dataset and then used to generate pseudo labels for the unlabeled / target dataset to guide the training process.

Similar to these works, our ASM-Loc utilizes pseudo segment-level labels (i.e., action proposals) to guide our training process in the WTAL task. However, we do not limit our approach to using pseudo labels for supervision only. Instead, we leverage the action proposals in multiple segment-centric modules, such as dynamic segment sampling, intra- and inter-segment attention.

3 WTAL Base Model

Refer to caption — Figure 2: (a) Framework Overview. The gray modules indicate the components of the base model (*e.g*. conv and FC), while the others are our action-aware segment modeling modules. (b) Dynamic segment sampling is based on the cumulative distribution of the sampling weight vector $W$ . The red dots on the $T$ -axis represent the final sampled timesteps. Shorter action segments have higher scale-up ratios. (c) Intra-segment attention applies self-attention within each action proposal. (d) Inter-segment attention applies self-attention among all proposals in a video. $\bigodot$ , $\bigotimes$ and $\bigoplus$ denote element-wise multiplication, matrix multiplication, and element-wise addition. $T$ , $N$ are the number of snippets and action proposals, respectively.

WTAL aims to recognize and localize action segments in untrimmed videos given only video-level action labels during training. Formally, let us denote an untrimmed training video as $V$ and its ground-truth label as $y\in\mathbb{R}^{C}$ , where $C$ is the number of action categories. Note that $y$ could be a multi-hot vector if more than one action is present in the video and is normalized with the $l_{1}$ normalization. The goal of temporal action localization is to generate a set of action segments $\mathcal{S}=\{(s_{i},e_{i},c_{i},q_{i})\}_{i=1}^{I}$ for a testing video, where $s_{i},e_{i}$ are the start and end time of the $i$ -th segment and $c_{i},q_{i}$ are the corresponding class prediction and confidence score.

Most existing WTAL methods [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12] employ the multiple instance learning (MIL) formulation. A typical pipeline of MIL-based methods consists of three main stages (depicted in Figure 2): (i) The feature extraction stage takes the untrimmed RGB videos and optical flow as input to extract snippet-level features using pre-trained backbone networks. (ii) The feature modeling stage transforms the extracted features to the task-oriented features by performing temporal modeling. (iii) The prediction stage generates class probabilities and attention weights for each time step and computes video-level loss following the MIL formulation during training. In the following subsections, we review the common practices of these three stages and present our base model in detail.

3.1 Feature Extraction and Modeling

Following the recent WTAL methods [2, 4, 32, 34, 10], we first divide each untrimmed video into non-overlapping 16-frame snippets, and then apply a Kinetics-400 pre-trained I3D model [15] to extract features for both RGB and optical flow input. After that, the RGB and optical flow features are concatenated along the channel dimension to form the snippet-level representations $F\in\mathbb{R}^{T\times D}$ , where $T$ is the number of snippets in the video and $D=2048$ is the feature dimensionality. Following [4, 6, 9, 48], the features are then fed into a temporal convolution layer and the ReLU activation for feature modeling: $X=\text{ReLU}(\text{conv}(F))$ .

3.2 Action Prediction and Training Losses

Given the embedded features $X$ , a fully-connected (FC) layer is applied to predict the temporal class activation sequence (CAS) $P\in\mathbb{R}^{T\times(C+1)}$ , where $C+1$ denotes the number of action categories plus the background class. To better differentiate the foreground and background snippets, a common strategy [2, 4, 7] is to introduce an additional attention module that outputs the attention weights for each time step of the untrimmed video. Following [34, 48], we generate the attention weights $A\in\mathbb{R}^{T\times 2}$ using an FC layer, where the two weight values at each time step are normalized by the softmax operation to obtain the foreground and background attention weights, respectively. Finally, the CAS and the attention weights are combined to get the attention weighted CAS: $\hat{P}^{m}(c)=P(c)\odot A^{m},m\in\{\text{fg},\text{bg}\}$ , where $c$ indicates the class index and $\odot$ denotes element-wise multiplication.

Following the MIL formulation, the video-level classification score is generated by the top- $k$ mean strategy [3, 6, 8]. For each class $c$ , we take the $k$ largest values of the attention weighted CAS and compute their averaged value: $\hat{p}^{m}(c)=\frac{1}{k}\sum\text{Top-k}(\hat{P}^{m}(c))$ . Softmax normalization is then performed across all classes to obtain the attention weighted video-level action probabilities. We adopt three video-level losses in such a weakly-supervised setting.

Foreground loss

To guide the training of video-level action classification, we apply the cross-entropy loss between the foreground-attention weighted action probabilities $\hat{p}^{\text{fg}}$ and the video-level action label $y^{\text{fg}}=[y;0]$ , written as:

\mathcal{L}^{\text{fg}}=-\sum_{c=1}^{C+1}y^{\text{fg}}(c)\log\hat{p}^{\text{fg}}(c).\vspace{-0.15in}

(1)

Background loss

To ensure that the negative instances in the untrimmed video are predicted as the background class, we regularize the background-attention weighted action probabilities $\hat{p}^{\text{bg}}$ with an additional background loss [32, 48]. Specifically, we compute the cross-entropy between $\hat{p}^{\text{bg}}$ and the background class label $y^{\text{bg}}$ :

\mathcal{L}^{\text{bg}}=-\sum_{c=1}^{C+1}y^{\text{bg}}(c)\log\hat{p}^{\text{bg}}(c),\vspace{-0.03in}

(2)

where $y^{\text{bg}}(C+1)=1$ and $y^{\text{bg}}(c)=0$ for all other $c$ .

Action-aware background loss

Although no action is taking place in background snippets, we argue that rich context information is still available to reflect the actual action category label. As an example in Figure 3(c), even though the background frames are stationary with only a billiard table, one can still expect the existence of the action category “Billiard” somewhere in the video. Therefore, the background instances are related to not only the background class label but also the action class label.

Based on this observation, we formulate the action-aware background loss as the cross-entropy loss between the background-attention weighted action probabilities $\hat{p}^{\text{bg}}$ and the video-level action label $y^{\text{fg}}$ :

\mathcal{L}^{\text{abg}}=-\sum_{c=1}^{C+1}y^{\text{fg}}(c)\log\hat{p}^{\text{bg}}(c).\vspace{-0.03in}

(3)

The total video-level loss for our base model is the weighted combination of all three losses:

\mathcal{L}^{\text{vid}}=\lambda_{\text{fg}}\mathcal{L}^{\text{fg}}+\lambda_{\text{bg}}\mathcal{L}^{\text{bg}}+\lambda_{\text{abg}}\mathcal{L}^{\text{abg}},

(4)

where $\lambda_{\text{fg}},\lambda_{\text{bg}}$ and $\lambda_{\text{abg}}$ are trade-off parameters for balancing the contribution of the three losses.

3.3 Discussion

As discussed in Sec. 1, our base model follows the MIL formulation and neglects the temporal structures among video snippets. Nevertheless, the prediction results generated by the base model still provide a decent estimation of the action locations and durations in the untrimmed video, which can serve as a bootstrap for our segment modeling process. In particular, we generate the initial action proposals based on the prediction results of the base model: $\mathcal{S}\mapsto\tilde{\mathcal{S}}=\{(s_{n},e_{n},c_{n})\}_{n=1}^{N}$ , where $s_{n}$ , $e_{n}$ and $c_{n}$ denote the start time, the end time, and the predicted category label of the $n$ -th action proposal, respectively. More details on generating action proposals are available in the supplementary material. The main focus of our work is to leverage the action proposals for segment-level temporal modeling, as described in the following section.

4 Action-aware Segment Modeling

Figure 2(a) illustrates an overview of our ASM-Loc framework. Given the action proposals generated by the base model, we introduce action-aware segment modeling into all three stages of the WTAL pipeline: dynamic segment sampling in the feature extraction stage (Sec. 4.1), intra- and inter-segment attention in the feature modeling stage (Sec. 4.2) and pseudo instance-level supervision in the prediction stage (Sec. 4.3). A multi-step proposal refinement is adopted to progressively improve the action proposals and the localization results, as discussed in Sec. 4.4.

4.1 Dynamic Segment Sampling

Action segments in an untrimmed video may have various duration, ranging from less than 2 seconds to more than 1 minute. Intuitively, short actions have small temporal scales, and therefore, their information is prone to loss or distortion throughout the feature modeling stage. As shown in Table 7, we observe that models are indeed biased towards the segments with longer duration and produce lower confidence scores on short segments, resulting in missed detection or inferior localization results. Similar observations are in object detection, where smaller objects have worse detection performance than larger ones [49, 50].

In order to address this problem in the WTAL setting, we propose a novel segment sampling module that dynamically up-samples action proposals according to their estimated duration. Formally, we first initialize a sampling weight vector $W\in\mathbb{R}^{T}$ with values equal to 1 at all time steps. Then, we compute the updated sampling weight for short proposals with duration less than a pre-defined threshold $\gamma$ :

W[s_{n}:e_{n}]=\dfrac{\gamma}{e_{n}-s_{n}},\ \ \text{ if }(e_{n}-s_{n})\leq\gamma,

(5)

where $s_{n},e_{n}$ denote the start and end time of the $n$ -th action proposal. The sampling procedure is based on the Inverse Transform Sampling method as shown in Figure 2(b). The intuition is to sample snippets with frame rates proportional to their sampling weights $W$ . We first compute the cumulative distribution function (CDF) of the sampling weights $f_{W}=\texttt{cdf}(W)$ , then uniformly sample $T$ timesteps from the inverse of the CDF: $\{x_{i}=f_{W}^{-1}(i)\}_{i=1}^{T}$ . In this way, the scale-up ratio of each proposal is dynamically computed according to its estimated duration. We apply linear interpolation when up-sampling is needed.

4.2 Intra- and Inter-Segment Attention

Intra-Segment Attention

Action modeling is of central importance for accurate action classification and temporal boundary prediction. Recent work [51, 18] applies temporal attention globally on trimmed videos for action recognition and achieves impressive performance. However, untrimmed videos are usually dominated by irrelevant background snippets which introduce extra noise to the action segment modeling process. Motivated by this observation, we propose the intra-segment attention module that performs self-attention within each action proposal.

We formulate this module using a masked attention mechanism, as shown in Figure 2(c). Specifically, an attention mask $M\in\mathbb{R}^{T\times T}$ is defined to indicate the foreground snippets corresponding to different action proposals. The attention mask is first initialized with $0$ at all entries and assigned $M[s_{n}:e_{n},s_{n}:e_{n}]=1$ for all proposals. The attention mask is then applied to the attention matrix computed by the standard self-attention approach [52, 53]:

$\displaystyle Q$	$\displaystyle=XW_{Q},\ K=XW_{K},\ V=XW_{V},$	(6)
$\displaystyle A_{i,j}$	$\displaystyle=\dfrac{M_{i,j}\text{exp}(Q_{i}K_{j}^{T}/\sqrt{D})}{{\sum_{k}M_{i,k}\text{exp}(Q_{i}K_{k}^{T}/\sqrt{D})}}$	(7)
$\displaystyle Z$	$\displaystyle=X+\texttt{BN}(AVW_{O}),$	(8)

where $W_{Q},W_{K},W_{V},W_{O}\in\mathbb{R}^{D\times D}$ are the linear projection matrices for generating the query, key, value and the output. Multi-head attention [52] is also adopted to improve the capacity of the attention module. In this way, we explicitly model the temporal structures within each action proposal, avoiding the negative impact of the irrelevant and noisy background snippets.

Inter-Segment Attention. Action segments in an untrimmed video usually involve temporal dependencies with each other. For example, “CricketBowling” tends to be followed by “CricketShotting”, while “VolleyballSpiking” usually repeats multiple times in a video. Capturing these dependencies and interactions among action segments can therefore improve the recognition and localization performance.

Similar to the intra-segment attention module, we leverage a self-attention mechanism to model the relationships across multiple action proposals. As shown in Figure 2(d), we first aggregate the snippet-level features within each action proposal by average pooling on the temporal dimension $\hat{X}_{n}=\frac{1}{e_{n}-s_{n}+1}\sum_{t=s_{n}}^{e_{n}}X(t)$ . The multi-head self-attention is then applied on all segment-level features $\{\hat{X}_{n}\}_{n=1}^{N}$ to model the interactions between different action proposal pairs. The output features are replicated along the time axis and added to the original feature $X$ in a residual manner.

4.3 Pseudo Instance-level Loss

Due to the absence of segment-level annotation, standard MIL-based methods only rely on video-level supervision provided by the video-level action category label. To further refine the localization of action boundaries, we leverage the pseudo instance-level label provided by the action proposals and propose a pseudo instance-level loss that offers more fine-grained supervision than the video-level losses.

Given the action proposals $\tilde{\mathcal{S}}=\{s_{n},e_{n},c_{n}\}_{n=1}^{N}$ , we construct the pseudo instance-level label $\tilde{Q}\in\mathbb{R}^{T\times(C+1)}$ by assigning action labels to the snippets that belong to the action proposals and assigning the background class label to all other snippets:

\tilde{Q}_{t}(c)=\left\{\begin{array}[]{ll}1,\text{ if }\exists n,t\in[s_{n},e_{n}]\text{ and }c=c_{n}\\ 1,\text{ if }\forall n,t\notin[s_{n},e_{n}]\text{ and }c=C+1\\ 0,\text{ otherwise }\end{array}\right.\\

(9)

Note that $\tilde{Q}$ is also normalized with the $l_{1}$ normalization.

As the action proposals are generated from the model prediction, it is inevitable to produce inaccurate pseudo instance-level labels. To handle the label noise effects, we follow the recent work [54, 10, 55, 56] and introduce an uncertainty prediction module that guides the model to learn from noisy pseudo labels. Specifically, we employ an FC layer to output the uncertainty score $U\in\mathbb{R}^{T}$ , which is then used to re-weight the pseudo instance-level loss at each time step. Intuitively, instances with high uncertainty scores are limited from contributing too much to the loss. Coupled with uncertainty scores, the pseudo instance-level loss can be written as the averaged cross-entropy between the temporal CAS $P$ and the pseudo instance-level label $\tilde{Q}$ :

\mathcal{L}_{\text{ins}}=\dfrac{1}{T}\sum_{t=1}^{T}\exp(-U_{t})\left(-\sum_{c=1}^{C+1}\tilde{Q}_{t}(c)\log(P_{t}(c))\right)+\beta U_{t}

(10)

where $\beta$ is a hyper-parameter for the weight decay term, which prevents the uncertainty prediction module from predicting infinite uncertainty for all time steps (and therefore zero loss).

4.4 Multi-step Proposal Refinement

Action proposals play an important role in action-aware modeling. As discussed in Sec. 5.3, the quality of proposals is positively correlated with the performance of multiple components in our approach. While our initial action proposals are obtained from the base model, it is intuitive to leverage the superior prediction results generated by our ASM-Loc to generate more accurate action proposals. Based on this motivation, we propose a multi-step training process that progressively refines the action proposals via multiple steps.

As a bootstrap of segment modeling, we first train the base model (Sec. 3) for $E$ epochs and obtain the initial action proposals $\tilde{\mathcal{S}}_{0}$ . After that, we train our ASM-Loc for another $E$ epochs and obtain the refined action proposals $\tilde{\mathcal{S}}_{1}$ with a more accurate estimation of the action location and duration. The same process can be applied for multiple steps until the quality of action proposals is converged. The complete multi-step proposal refinement process is summarized in Alg. 1. Finally, we train our ASM-Loc using the refined proposals $\tilde{\mathcal{S}}$ until the model is converged.

Table 1: Comparison with state-of-the-art methods on THUMOS-14 dataset. The average mAPs are computed under the IoU thresholds [0.1,0.1,0.7]. UNT and I3D are abbreviations for UntrimmedNet features and I3D features, respectively.

Supervision	Method	Publication	mAP@IoU (%)
Supervision	Method	Publication	0.1	0.2	0.3	0.4	0.5	0.6	0.7	AVG
Full (-)	SSN [25]	ICCV 2017	66.0	59.4	51.9	41.0	29.8	-	-	-
	TAL-Net [22]	CVPR 2018	59.8	57.1	53.2	48.5	42.8	33.8	20.8	45.1
	GTAN [57]	CVPR 2019	69.1	63.7	57.8	47.2	38.8	-	-	-
	P-GCN [58]	ICCV 2019	69.5	67.8	63.6	57.8	49.1	-	-	-
	VSGN [59]	ICCV 2021	-	-	66.7	60.4	52.4	41.0	30.4	-
Weak (UNT)	AutoLoc [30]	ECCV 2018	-	-	35.8	29.0	21.2	13.4	5.8	-
	CleanNet [31]	ICCV 2019	-	-	37.0	30.9	23.9	13.9	7.1	-
	Bas-Net [6]	AAAI 2020	-	-	42.8	34.7	25.1	17.1	9.3	-
Weak (I3D)	STPN [2]	CVPR 2018	52.0	44.7	35.5	25.8	16.9	9.9	4.3	27.0
	CMCS [4]	CVPR 2019	57.4	50.8	41.2	32.1	23.1	15.0	7.0	32.4
	WSAL-BM [32]	ICCV 2019	60.4	56.0	46.6	37.5	26.8	17.6	9.0	36.3
	DGAM [33]	CVPR 2020	60.0	54.2	46.8	38.2	28.8	19.8	11.4	37.0
	TSCN [7]	ECCV 2020	63.4	57.6	47.8	37.7	28.7	19.4	10.2	37.8
	ACM-Net [48]	TIP 2021	68.9	62.7	55.0	44.6	34.6	21.8	10.8	42.6
	CoLA [9]	CVPR 2021	66.2	59.5	51.5	41.9	32.2	22.0	13.1	40.9
	UGCT [10]	CVPR 2021	69.2	62.9	55.5	46.5	35.9	23.8	11.4	43.6
	AUMN [35]	CVPR 2021	66.2	61.9	54.9	44.4	33.3	20.5	9.0	41.5
	FAC-Net [12]	ICCV 2021	67.6	62.1	52.6	44.3	33.4	22.5	12.7	42.2
	ASM-Loc (Ours)	-	71.2	65.5	57.1	46.8	36.6	25.2	13.4	45.1

5 Experiment

5.1 Experimental Setup

Dataset

We evaluate our method on two popular action localization datasets: THUMOS-14 [60] and ActivityNet-v1.3 [61]. THUMOS-14 contains untrimmed videos from 20 categories. The video length varies from a few seconds to several minutes and multiple action instances may exist in a single video. Following previous works [1, 3, 7, 9], we use the 200 videos in the validation set for training and the 213 videos in the testing set for evaluation. ActivityNet-v1.3 is a large-scale dataset with 200 complex daily activities. It has 10,024 training videos and 4,926 validation videos. Following [35, 10], we use the training set to train our model and the validation set for evaluation.

Implementation Details

We employ the I3D [15] network pretrained on Kinetics-400 [15] for feature extraction. We apply TVL1 [62] algorithm to extract optical flow from RGB frames. The Adam optimizer is used with the learning rate of 0.0001 and with the mini-batch sizes of 16, 64 for THUMOS-14 and ActivityNet-v1.3, respectively. The number of sampled snippets $T$ is 750 for THUMOS-14 and 150 for ActivityNet-v1.3. For the multi-step proposal refinement, $E$ is set to 100 and 50 epochs for THUMOS-14 and ActivityNet-v1.3, respectively. Action proposals are generated at the last epoch of each refinement step. More dataset-specific training and testing details are available in the supplementary material.

Input: Training epochs

E

, refinement steps

L

Output: Action proposals

\tilde{\mathcal{S}}

1 Train the base model for

E

epochs.

2 Get initial action proposals:

\tilde{\mathcal{S}}_{0}

3 for l in $\{1,...,L\}$ do

4 Train ASM-Loc for

E

epochs with

\tilde{\mathcal{S}}_{l-1}

5 Update action proposals with

\tilde{\mathcal{S}}_{l}

7 end for

Algorithm 1 Multi-step Proposal Refinement

5.2 Comparison with the State of the Art

In Table 1, we compare our ASM-Loc with state-of-the-art WTAL methods on THUMOS-14. Selected fully-supervised methods are presented for reference. We observe that ASM-Loc outperforms all the previous WTAL methods and establishes new state of the art on THUMOS-14 with 45.1% average mAP for IoU thresholds 0.1:0.7. In particular, our approach outperforms UGCT [10], which also utilizes pseudo labels to guide the model training but without explicit segment modeling. Even compared with the fully supervised methods, ASM-Loc outperforms SSN [25] and TAL-Net [22] and achieves comparable results with GTAN [57] and P-GCN [58] when the IoU threshold is low. The results demonstrate the superior performance of our approach with action-aware segment modeling.

We also conduct experiments on ActivityNet-v1.3 and the comparison results are summarized in Table 2. Again, our ASM-Loc obtains a new state-of-the-art performance of 25.1% average mAP, surpassing the latest works (e.g. UGCT [10], FAC-Net [12]). The consistent superior results on both datasets justify the effectiveness of our ASM-Loc.

Table 2: Comparison with state-of-the-art methods on ActivityNet-v1.3 dataset. The AVG column shows the averaged mAP under the IoU thresholds [0.5:0.05:0.95].

Method	Publication	mAP@IoU (%)
Method	Publication	0.5	0.75	0.95	AVG
STPN [2]	CVPR 2018	29.3	16.9	2.6	16.3
ASSG [63]	MM 2019	32.3	20.1	4.0	18.8
CMCS [4]	CVPR 2019	34.0	20.9	5.7	21.2
Bas-Net [6]	AAAI 2020	34.5	22.5	4.9	22.2
TSCN [7]	ECCV 2020	35.3	21.4	5.3	21.7
A2CL-PT [64]	ECCV 2020	36.8	22.0	5.2	22.5
ACM-Net [48]	TIP 2021	37.6	24.7	6.5	24.4
TS-PCA [10]	CVPR 2021	37.4	23.5	5.9	23.7
UGCT [10]	CVPR 2021	39.1	22.4	5.8	23.8
AUMN [35]	CVPR 2021	38.3	23.5	5.2	23.5
FAC-Net [12]	ICCV 2021	37.6	24.2	6.0	24.0
ASM-Loc (ours)		41.0	24.9	6.2	25.1

5.3 Ablation Studies on THUMOS-14

Table 3: Contribution of each component.

\mathcal{L}_{\text{fg}}

\mathcal{L}_{\text{bg}}

and

\mathcal{L}_{\text{abg}}

represents the foreground, background and action-aware background loss, which are based on MIL with video-level labels. While DSS, Intra, Inter, and

\mathcal{L}_{\text{ins}}

denote the dynamic segment sampling, intra-segment attention, inter-segment attention, and pseudo instance-level loss, respectively, which exploit segment-level information.

Base model			ASM-Loc				AVG
$\mathcal{L}_{\text{fg}}$	$\mathcal{L}_{\text{bg}}$	$\mathcal{L}_{\text{abg}}$	DSS	Intra	Inter	$\mathcal{L}_{\text{ins}}$	0.1:0.7
✓							24.3
✓	✓						36.6
✓	✓	✓					40.3
✓	✓	✓	✓				41.4
✓	✓	✓		✓			41.8
✓	✓	✓			✓		42
✓	✓	✓				✓	41.3
✓	✓	✓		✓	✓		42.7
✓	✓	✓	✓	✓	✓		43.7
✓	✓	✓		✓	✓	✓	44.3
✓	✓	✓	✓	✓	✓	✓	45.1

Table 4: Ablation on self-attention under different settings. “Global”, “BG” indicate self-attention on all and background snippets, respectively.

Label	Setting	mAP@IoU (%)
Label	Setting	0.1	0.3	0.5	0.7	AVG
	Base	67.8	51.8	30.7	10.1	40.3
	Global	67.3	50.8	30.2	10.5	40.1
Action Proposal	BG	66	50.1	30.6	10.4	39.6
Action Proposal	Ours	68.6	53.4	32.5	11.8	41.8
Ground Truth	BG	64.7	49.6	30.3	9.7	38.8
Ground Truth	Ours	73.3	56.2	33.6	13.2	44.3

Table 5: Impact of dynamic segment sampling (DSS). Actions are divided into five duration groups (seconds): XS (0, 1], S (1, 2], M (2, 4], L (4, 6], and XL (6, inf).

Label	Setting	Averaged mAP (%)
Label	Setting	XS	S	M	L	XL	AVG
	Base	10.6	33.7	45.9	48.3	38.3	40.3
Action Proposal	+DSS	15.5	34.9	47.1	48.6	38.5	41.4
Action Proposal	$\bigtriangleup$	+4.9	+1.2	+1.2	+0.3	+0.2	+1.1
Ground Truth	+DSS	20	38	47.6	49.7	38.8	43
Ground Truth	$\bigtriangleup$	+9.4	+4.3	+1.7	+1.4	+0.5	+2.7

Table 6: Effectiveness of the uncertainty estimation module.

Uncer.	mAP@IoU (%)
Uncer.	0.3	0.5	0.7	AVG
	55.5	35.5	13.8	44.1
✓	57.1	36.6	13.4	45.1

Table 7: Ablation on the number of refinement steps. “0” indicates the base model without action-aware segment modeling.

Num.	mAP@IoU (%)
Num.	0.3	0.5	0.7	AVG
0	51.8	30.7	10.1	40.3
1	54.4	34.1	12.5	43.1
2	56.2	35.4	13.8	44.7
3	57.1	36.6	13.4	45.1
4	57.3	36.7	14.1	45.1

Contribution of each component

In Table 7, we conduct an ablation study to investigate the contribution of each component in ASM-Loc. We first observe that adding the background loss $\mathcal{L}_{\text{bg}}$ and the action-aware background loss $\mathcal{L}_{\text{abg}}$ largely enhance the performance of the base model. The two losses encourage the sparsity in the foreground attention weights by pushing the background attention weights to be 1 at background snippets, and therefore improve the foreground-background separation.

For action-aware segment modeling, it is obvious that a consistent gain ( $\geq$ 1%) can be achieved by adding any of our proposed modules. In particular, introducing segment modeling in the feature modeling stage (i.e., intra- and inter-segment attention) significantly increases the performance by 2.4%. The two attention modules are complementary to each other, focusing on modeling temporal structure within and across action segments. When incorporating all the action-aware segment modeling modules together, our approach boosts the final performance from 40.3% to 45.1%.

Are action proposals necessary for self-attention?

We propose an intra-segment attention module that performs self-attention within action proposals to suppress the noise from background snippets. To verify the effectiveness of our design, we compare different settings for self-attention in Table 7. Specifically, the “Global” setting indicates that the self-attention operation is applied directly to all snippets in the untrimmed video. It can be observed that this setting does not provide any gain to the baseline, as the model fails to capture meaningful temporal structure due to the existence of irrelevant and noisy background snippets. Moreover, the “BG” setting, which stands for self-attention on background snippets only, has negative impact and achieves even worse localization results. Finally, our intra-segment attention outperforms these two settings by a large margin, indicating the importance of applying self-attention within action proposals. We also present the settings of using the ground-truth action segments as proposals for intra-segment attention. This setting can be viewed as an upper bound of our approach and it provides even more significant gains over the baseline. This observation inspires us to further improve the action proposals by multi-step refinement.

Impact of dynamic segment sampling

In Table 7, we evaluate the impact of dynamic segment sampling for action segments with different durations. We divide all action segments into five groups according to their duration in seconds and evaluate the averaged mAP [65] separately for each group. As mentioned in the introduction, localization performance on short actions (XS, S) is much worse than longer actions (M, L, XL). By up-sampling the short actions with our dynamic segment sampling module, the model achieves significant gains on short actions (+4.9% for XS and +1.2% for S) and improves the overall performance by 1.1%. Similarly, we present the results using ground-truth segment annotation for dynamics segment sampling, which achieves even larger improvement over the baseline.

Impact of uncertainty estimation

We propose an uncertainty estimation module to mitigate the noisy label problem in pseudo instance-level supervision. Table 7 shows that using uncertainty estimation consistently improves the localization performance at different IoU thresholds, and increases the average mAP by 1%.

Impact of multi-step refinement

Table 7 shows the results of increasing the number of refinement steps for multi-step proposal refinement. We can see that the performance improves as the number of steps increases, indicating that better localization results can be achieved by refined proposals. We adopt 3 refinement steps as our default setting since the performance saturates after that.

5.4 Qualitative Results

Figure 3 shows the visualization comparisons between the base model and our ASM-Loc. We observe that the common errors in existing MIL-based methods can be partly addressed by our action-aware segment modeling method, such as the missed detection of short actions and incomplete localization of the action “VolleySpiking” (Figure 3(a)) and the over-complete localization of the action “BaseballPitch” (Figure 3(b)). We also provide a failure case in Figure 3(c), where our method fails to localize the first action segment due to the largely misaligned action proposal generated by the base model. This also verifies the importance of improving the quality of action proposals and should be further studied in future work.

Figure 3: Visualization of ground-truth, predictions and action proposals. Top-2 predictions with the highest confidence scores are selected for the base model and our ASM-Loc. Transparent frames represent background frames.

6 Conclusion

In this paper, we propose a novel WTAL framework named ASM-Loc which enables explicit action-aware segment modeling beyond previous MIL-based methods. We introduce three novel segment-centric modules corresponding to the three stages of a WTAL pipeline, which narrows the performance gap between the weakly-supervised and fully-supervised settings. We further introduce a multi-step training strategy to progressively refine the action proposals till the localization performance saturates. Our ASM-Loc achieves state-of-the-art results on two WTAL benchmarks.

Acknowledgements. This work was supported by the Air Force (STTR awards FA865019P6014, FA864920C0010) and Amazon Research Award to AS.

References

[1] Limin Wang, Yuanjun Xiong, Dahua Lin, and Luc Van Gool. Untrimmednets for weakly supervised action recognition and detection. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 4325–4334, 2017.
[2] Phuc Nguyen, Ting Liu, Gautam Prasad, and Bohyung Han. Weakly supervised action localization by sparse temporal pooling network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6752–6761, 2018.
[3] Sujoy Paul, Sourya Roy, and Amit K Roy-Chowdhury. W-talc: Weakly-supervised temporal activity localization and classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 563–579, 2018.
[4] Daochang Liu, Tingting Jiang, and Yizhou Wang. Completeness modeling and context separation for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1298–1307, 2019.
[5] Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, and Huijuan Xu. Weakly-supervised action localization with expectation-maximization multi-instance learning. In European conference on computer vision, pages 729–745. Springer, 2020.
[6] Pilhyeon Lee, Youngjung Uh, and Hyeran Byun. Background suppression network for weakly-supervised temporal action localization. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 11320–11327, 2020.
[7] Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, and Gang Hua. Two-stream consensus network for weakly-supervised temporal action localization. In European conference on computer vision, pages 37–54. Springer, 2020.
[8] Ashraful Islam, Chengjiang Long, and Richard Radke. A hybrid attention mechanism for weakly-supervised temporal action localization. arXiv preprint arXiv:2101.00545, 2021.
[9] Can Zhang, Meng Cao, Dongming Yang, Jie Chen, and Yuexian Zou. Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16010–16019, 2021.
[10] Wenfei Yang, Tianzhu Zhang, Xiaoyuan Yu, Tian Qi, Yongdong Zhang, and Feng Wu. Uncertainty guided collaborative training for weakly supervised temporal action detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 53–63, 2021.
[11] Yuan Liu, Jingyuan Chen, Zhenfang Chen, Bing Deng, Jianqiang Huang, and Hanwang Zhang. The blessings of unlabeled background in untrimmed videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6176–6185, 2021.
[12] Linjiang Huang, Liang Wang, and Hongsheng Li. Foreground-action consistency network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8002–8011, 2021.
[13] Oded Maron and Tomás Lozano-Pérez. A framework for multiple-instance learning. Advances in neural information processing systems, pages 570–576, 1998.
[14] Ashraful Islam and Richard Radke. Weakly supervised temporal action localization using deep metric learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 547–556, 2020.
[15] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[16] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
[17] Xitong Yang, Haoqi Fan, Lorenzo Torresani, Larry S. Davis, and Heng Wang. Beyond short clips: End-to-end video-level learning with collaborative memories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7567–7576, June 2021.
[18] Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, and Abhinav Shrivastava. Gta: Global temporal attention for video action understanding. British Machine Vision Conference (BMVC), 2021.
[19] Xitong Yang, Xiaodong Yang, Sifei Liu, Deqing Sun, Larry Davis, and Jan Kautz. Hierarchical contrastive motion learning for video action recognition. British Machine Vision Conference (BMVC), 2021.
[20] Zuxuan Wu, Hengduo Li, Caiming Xiong, Yu-Gang Jiang, and Larry Steven Davis. A dynamic frame selection framework for fast video recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[21] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Yu-Gang Jiang, Luowei Zhou, and Lu Yuan. Bevt: Bert pretraining of video transformers. arXiv preprint arXiv:2112.01529, 2021.
[22] Yu-Wei Chao, Sudheendra Vijayanarasimhan, Bryan Seybold, David A Ross, Jia Deng, and Rahul Sukthankar. Rethinking the faster r-cnn architecture for temporal action localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1130–1139, 2018.
[23] Zheng Shou, Dongang Wang, and Shih-Fu Chang. Temporal action localization in untrimmed videos via multi-stage cnns. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1049–1058, 2016.
[24] Huijuan Xu, Abir Das, and Kate Saenko. R-c3d: Region convolutional 3d network for temporal activity detection. In Proceedings of the IEEE international conference on computer vision, pages 5783–5792, 2017.
[25] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2914–2923, 2017.
[26] Tianwei Lin, Xu Zhao, Haisheng Su, Chongjing Wang, and Ming Yang. Bsn: Boundary sensitive network for temporal action proposal generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 3–19, 2018.
[27] Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, and Shilei Wen. Bmn: Boundary-matching network for temporal action proposal generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3889–3898, 2019.
[28] Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. Fast learning of temporal action proposal via dense boundary generator. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 11499–11506, 2020.
[29] Krishna Kumar Singh and Yong Jae Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In 2017 IEEE international conference on computer vision (ICCV), pages 3544–3553. IEEE, 2017.
[30] Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, and Shih-Fu Chang. Autoloc: Weakly-supervised temporal action localization in untrimmed videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 154–171, 2018.
[31] Ziyi Liu, Le Wang, Qilin Zhang, Zhanning Gao, Zhenxing Niu, Nanning Zheng, and Gang Hua. Weakly supervised temporal action localization through contrast based evaluation networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[32] Phuc Xuan Nguyen, Deva Ramanan, and Charless C Fowlkes. Weakly-supervised action localization with background modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5502–5511, 2019.
[33] Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang. Weakly-supervised action localization by generative attention modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1009–1019, 2020.
[34] Humam Alwassel, Fabian Caba Heilbron, Ali Thabet, and Bernard Ghanem. Refineloc: Iterative refinement for weakly-supervised action localization. 2019.
[35] Wang Luo, Tianzhu Zhang, Wenfei Yang, Jingen Liu, Tao Mei, Feng Wu, and Yongdong Zhang. Action unit memory network for weakly supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9969–9979, 2021.
[36] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jinsong Wang. Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5982–5991, 2019.
[37] Yongqiang Zhang, Yancheng Bai, Mingli Ding, Yongqiang Li, and Bernard Ghanem. W2f: A weakly-supervised to fully-supervised framework for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 928–936, 2018.
[38] Dong Li, Jia-Bin Huang, Yali Li, Shengjin Wang, and Ming-Hsuan Yang. Weakly supervised object localization with progressive domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3512–3520, 2016.
[39] Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-Yu Liu, Yong Jae Lee, Alexander G Schwing, and Jan Kautz. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10598–10607, 2020.
[40] Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, page 896, 2013.
[41] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242, 2016.
[42] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249, 2019.
[43] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Zhang, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020.
[44] Zejia Weng, Xitong Yang, Ang Li, Zuxuan Wu, and Yu-Gang Jiang. Semi-supervised vision transformers. arXiv preprint arXiv:2111.11067, 2021.
[45] Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Asymmetric tri-training for unsupervised domain adaptation. In International Conference on Machine Learning, pages 2988–2997. PMLR, 2017.
[46] Jian Liang, Ran He, Zhenan Sun, and Tieniu Tan. Exploring uncertainty in pseudo-label guided unsupervised domain adaptation. Pattern Recognition, 96:106996, 2019.
[47] Debasmit Das and CS George Lee. Graph matching and pseudo-label guided deep unsupervised domain adaptation. In International conference on artificial neural networks, pages 342–352. Springer, 2018.
[48] Sanqing Qu, Guang Chen, Zhijun Li, Lijun Zhang, Fan Lu, and Alois Knoll. Acm-net: Action context modeling network for weakly-supervised temporal action localization. arXiv preprint arXiv:2104.02967, 2021.
[49] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
[50] Bharat Singh, Mahyar Najibi, and Larry S Davis. Sniper: Efficient multi-scale training. arXiv preprint arXiv:1805.09300, 2018.
[51] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? arXiv preprint arXiv:2102.05095, 2021.
[52] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[53] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[54] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? arXiv preprint arXiv:1703.04977, 2017.
[55] Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey. arXiv preprint arXiv:2007.08199, 2020.
[56] Zhedong Zheng and Yi Yang. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation. International Journal of Computer Vision, 129(4):1106–1120, 2021.
[57] Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, and Tao Mei. Gaussian temporal awareness networks for action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 344–353, 2019.
[58] Runhao Zeng, Wenbing Huang, Mingkui Tan, Yu Rong, Peilin Zhao, Junzhou Huang, and Chuang Gan. Graph convolutional networks for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7094–7103, 2019.
[59] Chen Zhao, Ali K Thabet, and Bernard Ghanem. Video self-stitching graph network for temporal action localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13658–13667, 2021.
[60] Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 155:1–23, 2017.
[61] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
[62] Vincent Duval, Jean-François Aujol, and Yann Gousseau. The tvl1 model: a geometric point of view. Multiscale Modeling & Simulation, 8(1):154–189, 2009.
[63] Chengwei Zhang, Yunlu Xu, Zhanzhan Cheng, Yi Niu, Shiliang Pu, Fei Wu, and Futai Zou. Adversarial seeded sequence growing for weakly-supervised temporal action localization. In Proceedings of the 27th ACM international conference on multimedia, pages 738–746, 2019.
[64] Kyle Min and Jason J Corso. Adversarial background-aware loss for weakly-supervised temporal activity localization. In European conference on computer vision, pages 283–299. Springer, 2020.
[65] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
[66] Humam Alwassel, Fabian Caba Heilbron, Victor Escorcia, and Bernard Ghanem. Diagnosing error in temporal action detectors. In The European Conference on Computer Vision (ECCV), September 2018.
[67] Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5734–5743, 2017.
[68] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.

Appendix

Sec. A reports additional experiments and analysis. Sec. B elaborates on the procedure of action proposal generation. Sec. C provides more dataset-specific implementation details and hyper-parameters for training and testing. We also provide more qualitative results in Sec. D. We discuss the limitation and broader impact of our work in Sec. E and Sec. F.

Appendix A Additional Experiments and Analysis

Error analysis

To analyze the effectiveness of our ASM-Loc, we conduct a DETAD [66] false positive analysis of the base model without any action-aware segment modeling modules and our ASM-Loc. We present the results in Figure 4. It shows a detailed categorization of false positive errors and summarizes the distribution of these errors. $G$ represents the number of ground truth segments in the THUMOS-14 dataset. We can observe that ASM-Loc generates more true positive predictions with high confidence scores and produces less localization error and confusion error (at the top 1 $G$ scoring predictions). It verifies that ASM-Loc improves the detection results by predicting more accurate action boundaries with our action-aware segment modeling modules.

Ablation on the increased receptive field

To further demonstrate that the effectiveness of our intra- and inter-segment attention modules is due to the segment-centric design instead of the increased receptive field, we replace our intra- and inter-segment attention modules with convolutional layers and compare the experimental results. From Table 8 we can see that by replacing the attention modules with convolutional layers, the performances drop by at least $3.3\%$ , and even fall below the base model. We hypothesize that increasing the kernel size of the convolutional layers may lead to confusion between foreground and background snippets especially near the action boundaries. In contrast, our segment-centric attention design can model temporal structures within and across action segments and localize actions more precisely. The results verify that the segment-centric design is the key to our intra- and inter-segment attention modules.

Input: Predicted Action Segments

\mathcal{S}=\{(s_{i},e_{i},c_{i},q_{i})\}_{i=1}^{I}

, selection ratio

\alpha

, segment extension parameter

\delta

Output: Action Proposals

\tilde{\mathcal{S}}=\{(\tilde{s}_{n},\tilde{e}_{n},\tilde{c}_{n})\}_{n=1}^{N}

1 for ground-truth class $c$ do

\mathcal{S}(c)_{sorted}\leftarrow

SORT

(\mathcal{S}(c))

// sort segments by scores of class

c

q_{sum}=\sum q_{i}

// sum confidence scores for all segments

4 Select

K

, s.t.

\max_{K}\sum_{i=1}^{K}q_{i}\leq\alpha*q_{sum}

// select top-

K

segments from

\mathcal{S}(c)_{sorted}

\tilde{\mathcal{S}}(c):\{\tilde{s}_{i},\tilde{e}_{i},\tilde{c}_{i}\}_{i=1}^{K}=\{s_{i}-\delta(e_{i}-s_{i}),e_{i}+\delta(e_{i}-s_{i}),c_{i}\}_{i=1}^{K}

// extend selected segments on both sides

7 end for

Algorithm 2 Action Proposal Generation

Table 8: Ablation on the increased receptive field.

Modeling	Kernel Size	mAP@IoU (%)
Modeling	Kernel Size	0.1	0.2	0.3	0.4	0.5	0.6	0.7	AVG
Base	-	67.8	60.7	51.8	41.3	30.7	19.9	10.1	40.3
	3	66.2	59.3	50.5	39.9	29.9	19.2	9.1	39.2
Conv	5	66.5	58.9	51.0	40.0	29.7	19.3	9.8	39.3
Conv	9	67.1	59.8	50.4	40.1	29.1	19.2	10.2	39.4
Attention	-	68.9	63.1	54.9	44.5	34	22.0	11.9	42.7

Figure 4: Diagnosing detection results. We present DETAD [66] false positive profiles of the base model and our ASM-Loc.

Appendix B Action Proposal Generation

Table 9: Ablation on different action proposal selection methods.

Method	mAP@IoU (%)
Method	0.1	0.2	0.3	0.4	0.5	0.6	0.7	AVG
(a)	69.9	63.8	56	45.8	36.6	25.0	13.5	44.4
(b)	70.5	64.6	57.3	46.8	35.7	24.3	14.2	44.8
(c)	71.2	65.5	57.1	46.8	36.6	25.2	13.4	45.1

In Alg. 2, we present the details of how to generate action proposals $\tilde{\mathcal{S}}$ from the action localization results (i.e., action segments) $\mathcal{S}$ . Specifically, we first sort all the segment scores across the set $\mathcal{S}(c)$ for each ground-truth class $c$ . Then we sum the confidence scores of all the action segments and output $q_{sum}$ , and pick the top- $K$ action segments with their confidence scores summation equal to $\alpha*q_{sum}$ to form the action proposals. Note that the number of the action proposals is video-adaptive and content dependent, despite $\alpha$ is shared for all videos. Finally, following the common practice in temporal action localization [67, 26, 58, 28], we extend each proposal on both ends by $\delta$ of the proposal length to get an extended proposal with a longer temporal duration which can take more context-related snippets into consideration.

To verify the effectiveness of our proposal generation design, we compare three different settings of the segment selection procedure: (a) Fixed number of selected action segments where $K$ is a fixed value for each class, which is not video-adaptive and content dependent; (b) $K$ proportional to the number of predicted action segments in $\mathcal{S}(c)$ , where $K=\alpha*|\mathcal{S}(c)|$ ; (c) our design. In Table 9, we can see that our design achieves the best results among the three designs.

Appendix C Experiment Details

For the hyper-parameters, we set $\lambda_{\text{fg}}=1,\lambda_{\text{bg}}=0.5,\lambda_{\text{abg}}=0.5,\beta=0.2,\gamma=6,H=8,\delta=0.5,\alpha=0.7$ for THUMOS-14 and $\lambda_{\text{fg}}=5,\lambda_{\text{bg}}=0.5,\lambda_{\text{abg}}=0.5,\beta=0.2,\gamma=10,H=8,\delta=0,\alpha=0.3$ for ActivityNet-v1.3.

Following [9, 48], during inference, we use a set of thresholds to obtain the predicted action instances, then perform non-maximum suppression to remove overlapping segments. Specifically, for THUMOS-14, we set the foreground attention threshold from 0.1 to 0.9 with step 0.025, and perform NMS with a t-IOU threshold of 0.45. For ActivityNet-v1.3, we set the foreground-attention threshold from 0.005 to 0.02 with step 0.005, and apply NMS with a t-IoU threshold of 0.9.

We implement our method in PyTorch [68] and train it on a single NVIDIA RTX1080Ti gpu.

Figure 5: Visualization of ground-truth, predictions and action proposals. Top-2 predictions with the highest confidence scores are selected for the base model and our ASM-Loc. Transparent frames represent background frames.

Appendix D More Qualitative Results

We provide more qualitative results in Figure 5. The first example of action “HammerThrow” shows the missed detection of short actions and over-completeness error. The second and third example of action “Shotput” and action “CleanAndJerk” shows the incompleteness error. It clearly shows that our ASM-Loc can help address these errors with more accurate action boundary predictions.

Appendix E Limitation

The main limitation of our ASM-Loc is that the performance of our action-aware segment modeling modules depends on the generated action proposals. When the action proposals are largely misaligned with the ground-truth action segments, our ASM-Loc is not able to fix the error and generate correct predictions, as shown in Figure 3.

Appendix F Broader Impacts

As the most popular media format nowadays, most information is spread in the format of videos. The temporal action localization task aims at finding the temporal boundaries and classifying category labels of actions of interest in untrimmed videos. Unlike supervised learning based approach that requires dense segment-level annotations, our proposed weakly-supervised temporal action localization model ASM-Loc only requires video-level labels. Therefore, WTAL is much more valuable in the real-world applications such as popular video-sharing social-network services, where billions of videos have only video-level user-generated tags. Besides, WTAL has broad applications in various fields, e.g. event detection, video summarization, highlight generation and video surveillance.

ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization

Abstract

1 Introduction

2 Related works

Temporal Action Localization (TAL)

Weakly-supervised Temporal Action Localization

Pseudo Label Guided Training

3 WTAL Base Model

3.1 Feature Extraction and Modeling

3.2 Action Prediction and Training Losses

Foreground loss

Background loss

Action-aware background loss

3.3 Discussion

4 Action-aware Segment Modeling

4.1 Dynamic Segment Sampling

4.2 Intra- and Inter-Segment Attention

Intra-Segment Attention

4.3 Pseudo Instance-level Loss

4.4 Multi-step Proposal Refinement

5 Experiment

5.1 Experimental Setup

Dataset

Implementation Details

5.2 Comparison with the State of the Art

5.3 Ablation Studies on THUMOS-14

Contribution of each component

Are action proposals necessary for self-attention?

Impact of dynamic segment sampling

Impact of uncertainty estimation

Impact of multi-step refinement

5.4 Qualitative Results

6 Conclusion

References

Appendix

Appendix A Additional Experiments and Analysis

Error analysis

Ablation on the increased receptive field

Appendix B Action Proposal Generation

Appendix C Experiment Details

Appendix D More Qualitative Results

Appendix E Limitation

Appendix F Broader Impacts

ASM-Loc: Action-aware Segment Modeling for
Weakly-Supervised Temporal Action Localization