Turning to A Teacher
for Timestamp Supervised Temporal Action Segmentation
Abstract
Temporal action segmentation in videos has drawn much attention recently. Timestamp supervision is a cost-effective way for this task. To obtain more information to optimize the model, the existing method generated pseudo frame-wise labels iteratively based on the output of a segmentation model and the timestamp annotations. However, this practice may introduce noise and oscillation during the training, and lead to performance degeneration. To address this problem, we propose a new framework for timestamp supervised temporal action segmentation by introducing a teacher model parallel to the segmentation model to help stabilize the process of model optimization. The teacher model can be seen as an ensemble of the segmentation model, which helps to suppress the noise and to improve the stability of pseudo labels. We further introduce a segmentally smoothing loss, which is more focused and cohesive, to enforce the smooth transition of the predicted probabilities within action instances. The experiments on three datasets show that our method outperforms the state-of-the-art method and performs comparably against the fully-supervised methods at a much lower annotation cost.
Index Terms— Timestamp Supervision, Temporal Action Segmentation, Mean Teacher
1 Introduction
Temporal action segmentation is of vital importance for many real-life applications such as surveillance and interactive robotics. The goal is to predict frame-wise action labels for a given input video. This task has got much attention recently, motivating the rapid and remarkable development in the fully-supervised setting (providing frame-wise labels) [1, 2, 3, 4, 5, 6, 7]. Due to the high cost of obtaining frame-wise labels, some researchers have attempted to solve the problem under weakly-supervised settings, i.e. , transcript [8, 9, 10] or set [11, 12, 13]. Although the weakly-supervised settings successfully reduce the annotation effort, the performance falls largely behind the fully-supervised counterparts. To narrow the gap between them, timestamp supervision has been proposed recently. Under this setting, only a single timestamp with its action class (including background) is annotated for each action instance. Intuitively, timestamp supervision requires a negligible extra cost compared to weakly-supervised settings but provides much more information.


Existing timestamp supervised method [15] applied a conventional two-step training scheme. They first adopted a segmentation model designed for full supervision to output frame-level action scores for an input video. Under the timestamp supervision setting, the model was trained on the timestamp annotations, i.e. , sparse set of annotated frames. Consequently, the model tends to focus on the annotated frames and those similar to them and ignore the rest frames. To alleviate this problem, in the second step, they adopted a method to generate pseudo frame-wise labels iteratively which were used as ground truth afterwards. However, the pseudo ground truth was variational, which may introduce noise and lead to oscillation during the training process (see the blue broken line in Figure 1(a)). Ideally, the predicted probabilities from the segmentation model for an action are supposed to be steadily high within the action but have a sudden drop around the boundaries (see Figure 1(b)). However, in practice, it is observed that they may alternate between high and low values within an action instance. Although the TMSE loss [4] was adopted to encourage smooth transition of the predictions for frames, it was applied on the whole video and all action classes. Action boundaries were not considered and the constraints on the irrelevant action categories might bring in ambiguity to the model.
In this paper, to address the above questions, we design a new framework for timestamp supervised temporal action segmentation. We introduce a teacher model, which is updated by the exponential moving average (EMA) of the segmentation model, in the second training step. The EMA can be seen as an ensemble of the current and the earlier versions of the segmentation model, improving the quality and the stability of the teacher model predictions. In turn, the teacher model predictions are used to supervise the segmentation model. As a result, the circular dependency helps to improve the stability of pseudo frame-wise labels (see the red broken line in Figure 1(a)). Additionally, we propose a segmentally smoothing loss that encourages smooth transition within each action instance for predicted probabilities of the corresponding label and penalizes the low confidence regions surrounded by the high confidence regions. Meanwhile, the accumulation in segments makes the constraint more focused and cohesive. Overall, our contributions are threefold as the following:
-
•
We design a new framework for timestamp supervised temporal action segmentation, where the introduced teacher model helps to maintain the stability of the predictions and pseudo ground truth during training.
-
•
We propose a new segmentally smoothing loss, which encourages smooth transition within an action instance for the predicted probabilities of the labeled action category.
-
•
Our approach outperforms state-of-the-art methods on three widely used datasets. Moreover, it even performs comparably against fully-supervised methods.
2 Related Work
Fully-supervised temporal action segmentation. In order to complete the task of temporal action segmentation, various methods relying on precise frame-wise annotations, i.e. full supervision, have been proposed, such as grammars[1], semi-Markovian model[2] and Temporal Convolution Networks (TCNs)[3]. Inspired by the success of TCNs, Yazan Abu et al.[4] introduced MS-TCN, a multi-stage architecture. It stacks several TCNs, which have dilated convolutions with residual connections. There are also some variants [5, 6, 7] recently. However, these methods were trained on frame-wise annotations, which are resource intensive and hard to obtain.
Weakly-supervised temporal action segmentation. To relax the requirement of frame-wise annotations, many works have attempted to explore weaker levels of supervision, such as transcript, set and timestamp. In transcript supervision, we have access to ordered lists of the action classes occurring in training videos, but their exact action boundaries are unknown. Ding et al. [8] proposed Iterative Soft Boundary Assignment (ISBA) to align action sequences and update the network iteratively. In [9, 10], they first generated pseudo ground-truth labels for all video frames, and then trained a classifier for frame labeling. As for set supervision, we only have the set of action classes occurring in the training video, without the order and boundaries. There are many kinds of methods, such as multi-instance learning [11], and Viterbi algorithm [12, 13]. While both the above levels of supervision reduce the requirement of the annotations, the performance is not superior when compared to the full supervision methods. To fill this gap, Li et al.[15] introduced timestamp supervision for temporal action segmentation. They adopted the generated pseudo ground-truth for the subsequent action segmentation by estimating action changes. Yet the pseudo ground-truth is variational and the frame-wise accuracy tends to oscillate during the training (see Figure 1(a)).
Mean Teacher, a method that is traditionally adopted in the semi-supervised learning tasks, follows a student-teacher network [16]. It obtains a better teacher model by moving the average of the student model weights without additional training. In effect, the teacher model is an ensemble of the student model, enabling itself to learn more abstract invariance and yield more accurate predictions.
3 Method

3.1 Problem Definition
Given a video with frames, the goal of the temporal action segmentation task is to predict a sequence of the frame-wise labels . Unlike full supervision, timestamp supervision provides a set of annotated timestamps with each video, where is the number of the annotated timestamps. Especially, for each action instance within the video , we have access to one frame labeled with timestamp and action class , where denotes the -th action instance and . Although a video may contain multiple action instances, the number of labeled timestamps is much smaller than the number of frames , i.e. , . This reveals the sparse nature of timestamp supervision.
3.2 Overview
We propose a framework for timestamp supervised temporal action segmentation including a warm-up phase and a mean teacher phase. As shown in Figure 2, we use the feature extracted by I3D [17] as input to our framework. We first train a segmentation model on the timestamp annotations, which is denominated as the warm-up phase. Then we adopt a label generator to generate pseudo frame-wise labels to alleviate the problem of the sparse nature of timestamp supervision. However, this will introduce oscillation that results in performance degeneration. To alleviate this problem, inspired by Mean Teacher [16], we propose a teacher model parallel to the segmentation model, which is updated by the EMA of the segmentation model. In turn, the teacher model predictions are used to update the segmentation model. This forms a circular dependency that helps the framework maintain stability and obtain better performance, which is denominated as the mean teacher phase. The combination of the segmentation and the teacher models is similar to a real-life scenario. A student, who encounters problems, will turn to a teacher. The process of solving problems helps the student enrich knowledge and the teacher gather experience.
3.3 Warm-up Phase

In the warm-up phase, we only have access to a sparse set of annotated frames. To train a segmentation model, we apply a cross entropy loss as the partial classification loss
(1) |
where denotes the predicted probability for the target action class at annotated timestamp . It ignores all frames that are not annotated. Note that the predicted probability is the Softmax output of the action scores , where indicates the number of action classes.
Due to the sparse nature of timestamp supervision, it is observed that the segmentation model only learns from each action instance the annotated frame and other informative frames which have similar features to the annotated frame. Intuitively, the frames between the annotated frames and the informative frames should have similar predicted probability, and these frames should be predicted to have high confidence. Based on this, we propose a segmentally smoothing loss
(2) |
where refers to the predicted probability for the annotated action class at time , denotes the number of frames between the annotated frames and , and is a hyperparameter. The segmentally smoothing loss encourages smooth transition of the frames between high confidence frames and therefore suppresses the low confidence regions surrounded by high confidence frames (see Figure 3). It helps the segmentation model to learn from not only the annotated and the informative frames but also those frames in low confidence regions. Unlike TMSE [4], which encourages smooth transition of the predicted probabilities for all frames in an input video and all action classes, the proposed loss accumulates loss segmentally to encourage smooth transition for the labeled action categories within action instances in the input video. So it is more focused and cohesive.
The overall loss function for the segmentation model in the warm-up phase is a weighted sum of the above two losses
(3) |
where is a hyperparameter to balance the contribution of each loss.
3.4 Mean Teacher Phase
Although the smoothing loss is able to help pay attention to the low confidence parts of an action instance, it still ignores many frames without annotations. Following [15], we apply a label generator to generate pseudo frame-wise labels by detecting action changes. However, we find that the generated pseudo frame-wise labels are not stable and cause oscillation during the training process. To alleviate this problem, we introduce a teacher model parallel to the segmentation model to help stabilize the system, which has the same architecture as the segmentation model. We update the teacher model by the EMA of the segmentation model. The EMA is formed by an ensemble of the current version and those earlier versions of the segmentation model. At -th iteration, it is defined as
(4) |
where and are the weights of the segmentation model and the teacher model respectively, and is a hyperparameter. At the initial training iterations, the teacher model is a bare model without training. To help the teacher model learn quickly, we use absolute average [18] instead of EMA when .
To make the consistency constraint between the segmentation and the teacher models, we adopt a frame consistency loss between their predictions, in the form of mean square error (MSE)[16].
(5) |
where denotes the predicted probabilities of a model. With this loss, the segmentation model learns from the teacher model, thereby forming a circular dependence as shown in Figure 2. It helps the two models and the generated pseudo labels stay steady.
The overall loss function for the segmentation model in the mean teacher phase is a weighted sum of three losses
(6) |
where , akin to , is a hyperparameter. For simplicity, we use to denote both pseudo classification and partial classification losses. While they are in the same form of cross-entropy loss, the pseudo classification loss
(7) |
is supervised by the pseudo frame-wise labels. To generate the pseudo frame-wise labels, we regard the timestamp that minimizes the energy function in [15] as the action change between any two annotated frames. And then, we assign the same label to each frame located between an annotated frame and an estimated action change. Different from [15], we utilize action scores as the input feature to the energy function, rather than the output of the penultimate layer in the segmentation model. It is worth mentioning that we find that there is no benefit to perform forward-backward estimating as in [15] experimentally, so we only adopt the stamp-to-stamp version of it. Note that the above loss function is only computed for the segmentation model, and no gradient is computed on the teacher model.
4 Experiments
4.1 Datasets and Evaluation Metrics
We evaluate our method on three public datasets, namely Georgia Tech Egocentric Activities (GTEA)[19], 50Salads[20] and Breakfast[14].
GTEA contains 28 videos, 11 action classes and 31,222( 31K) frames. It covers 7 daily activities which were performed by 4 actors in a kitchen. For evaluation, we use fourfold cross-validation and report the average as in [4].
50Salads contains 50 videos with 17 action classes and 518,411( 520K) frames. On average, each video contains 20 action instances and is 6.4 minutes long. These action instances were performed by 25 actors and recorded from the top view, where each actor prepared a mixed salad twice. We use fivefold cross-validation for evaluation and report the average following [4].
Breakfast is the largest among the three datasets. It contains 1712 third-person view videos related to breakfast preparation activities. These videos were recorded in 18 different kitchens. Overall, there are 48 different action classes and roughly 3.6M frames. Each video contains 6 action instances on average. To evaluate, we use the standard 4 splits proposed in [14] and report the average.
4.2 Implementation Details
We take the I3D [17] features extracted by [4] as the input to our framework and the MS-TCN [4] as our segmentation model. Following [15], we use two parallel stages for the first stage with kernel size 3 and 5, change the number of layers in the first stage from 10 to 12, and pass the sum of both outputs to the next stages. We train our model for 50 epochs with Adam optimizer. Out of the 50 epochs, the warm-up phase takes 30 epochs, while the mean teacher phase takes the remaining 20 epochs. We use the learning rate 0.0001 for Breakfast and 0.0005 for GTEA and 50Salads, then multiply them by 0.1 every 40 epochs. For the hyperparameters, they are determined empirically. We set to 0.9, 0.99 and 0.999 for GTEA, 50Salads and Breakfast respectively. Both and are set to 1. To help the teacher model learn progressively, we increase linearly from 0 to 0.5 during the 20 epochs.
During the inference time, we take the output of the teacher model as the final predictions [16], and classify each frame as the label with maximum probability.
4.3 Comparison with the State-of-the-art
Supervision | Method | F1@IoU (%) | Edit | Acc | ||
---|---|---|---|---|---|---|
10 | 25 | 50 | ||||
Full | MS-TCN[4] | 76.3 | 74.0 | 64.5 | 67.9 | 80.7 |
MS-TCN++[5] | 80.7 | 78.5 | 70.1 | 74.3 | 83.7 | |
BCN[6] | 82.3 | 81.3 | 74.0 | 74.3 | 84.4 | |
ASRF[7] | 84.9 | 83.5 | 77.3 | 79.3 | 84.5 | |
Transcript | NN-Viterbi[9] | - | - | - | - | 49.4 |
CDFL[10] | - | - | - | - | 54.7 | |
Timestamp | Li et al.[15] | 73.9 | 70.9 | 60.1 | 66.8 | 75.6 |
Ours | 78.5 | 75.5 | 63.4 | 71.8 | 77.7 |
Supervision | Method | F1@IoU (%) | Edit | Acc | ||
---|---|---|---|---|---|---|
10 | 25 | 50 | ||||
Full | MS-TCN[4] | 52.6 | 48.1 | 37.9 | 61.7 | 66.3 |
MS-TCN++[5] | 64.1 | 58.6 | 45.9 | 65.6 | 67.6 | |
BCN[6] | 68.7 | 65.5 | 55.0 | 66.2 | 70.4 | |
ASRF[7] | 74.3 | 68.9 | 56.1 | 72.4 | 67.6 | |
Transcript | NN-Viterbi[9] | - | - | - | - | 43.0 |
CDFL[10] | - | - | - | - | 50.2 | |
Set | ActionSet[11] | - | - | - | - | 23.3 |
SCV[12] | - | - | - | - | 30.2 | |
ACV[13] | - | - | - | - | 33.4 | |
Timestamp | Li et al.[15] | 70.5 | 63.6 | 47.4 | 69.9 | 64.1 |
Ours | 73.1 | 66.5 | 49.4 | 72.6 | 64.6 |
We compare our method with recent approaches under different levels of supervision. The results are presented in Tables 1, 2 and 3. As shown in these tables, our method outperforms the state-of-the-art timestamp supervised approach with a large margin on all datasets (up to 9.4% for the F1 score at 50% IoU threshold on GTEA). At the same time, our method exceeds the approaches under weaker levels of supervision in the form of transcript or set at a comparable annotation cost. Furthermore, our method even performs comparably against the fully-supervised methods in terms of F1 scores and Edit at a much lower annotation cost. Despite the lack of the information of the action boundary, our method achieves about frame-wise accuracy of the full supervision method MS-TCN [4].
4.4 Ablation Study
Dataset | F1@IoU (%) | Edit | Acc | |||||
---|---|---|---|---|---|---|---|---|
10 | 25 | 50 | ||||||
GTEA | ✓ | 82.0 | 77.0 | 58.1 | 76.9 | 71.7 | ||
✓ | 78.3 | 75.2 | 55.4 | 70.8 | 67.0 | |||
✓ | ✓ | 79.0 | 72.6 | 49.1 | 75.3 | 59.6 | ||
✓ | ✓ | 84.3 | 81.7 | 64.8 | 79.8 | 74.4 | ||
50Salads | ✓ | 77.8 | 75.1 | 63.1 | 69.6 | 77.7 | ||
✓ | 62.5 | 58.6 | 45.8 | 54.0 | 72.1 | |||
✓ | ✓ | 75.7 | 71.8 | 60.0 | 67.5 | 76.9 | ||
✓ | ✓ | 78.5 | 75.5 | 63.4 | 71.8 | 77.7 |
Study on Teacher Model. To validate the efficacy of the proposed teacher model, we compare the results of our framework with and without the teacher model. We adopt the conventional two-step training scheme for the latter one. As shown in Table 4, the framework with the teacher model outperforms the one without it. We also present the evolution processes of frame-wise accuracy during the mean teacher phase for the two settings in Figure 1(a), and it demonstrates that the teacher model is able to maintain stability during training and achieves favorable performance. On the contrary, the conventional two-step training scheme (without the teacher model) introduces a severe oscillation which results in performance degeneration.
Study on Smoothing Loss.

We also verify the effectiveness of the proposed segmentally smoothing loss. We list the results of three runs on GTEA and 50Salads in Table 4, including the one without any smoothing loss, the one with TMSE [4] and the one with the proposed segmentally smoothing loss. As analyzed earlier, the method without smoothing loss suffers from a severe over-segmentation error, which is indicated by the low F1 scores and Edit. The proposed segmentally smoothing loss outperforms the TMSE by and on F1 scores on the two datasets respectively. Furthermore, to present the impact of the proposed segmentally smoothing loss visually, we show two representative examples from 50Salads in Figure 4.
5 Conclusion
In this paper, we propose a new framework for timestamp supervised temporal action segmentation by introducing a teacher model to help the segmentation model yield more accurate predictions. With the help of the teacher model, the problem of the instability of the pseudo labels can be relieved. We also propose a segmentally smoothing loss that encourages smooth transition within each action instance. The loss penalizes the low confidence regions that are surrounded by high confidence regions. Our method outperforms the state-of-the-art method on three datasets and further narrows the performance gap between timestamp supervision and full supervision.
Acknowledgements: This work is supported by the National Key RD Program of China (No. 2018AAA0102002), the National Natural Science Foundation of China (Grant No. 61672285).
References
- [1] Hamed Pirsiavash and Deva Ramanan, “Parsing videos of actions with segmental grammars,” in CVPR, 2014, pp. 612–619.
- [2] Colin Lea, Austin Reiter, René Vidal, and Gregory D Hager, “Segmental spatiotemporal cnns for fine-grained action segmentation,” in ECCV. Springer, 2016, pp. 36–52.
- [3] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager, “Temporal convolutional networks for action segmentation and detection,” in CVPR, 2017, pp. 156–165.
- [4] Yazan Abu Farha and Jurgen Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in CVPR, 2019, pp. 3575–3584.
- [5] Shi-Jie Li, Yazan Abu Farha, Yun Liu, Ming-Ming Cheng, and Juergen Gall, “Ms-tcn++: Multi-stage temporal convolutional network for action segmentation,” TPAMI, 2020.
- [6] Zhenzhi Wang, Ziteng Gao, Limin Wang, Zhifeng Li, and Gangshan Wu, “Boundary-aware cascade networks for temporal action segmentation,” in ECCV. Springer, 2020, pp. 34–51.
- [7] Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, and Hirokatsu Kataoka, “Alleviating over-segmentation errors by detecting action boundaries,” in WACV, 2021, pp. 2322–2331.
- [8] Li Ding and Chenliang Xu, “Weakly-supervised action segmentation with iterative soft boundary assignment,” in CVPR, 2018, pp. 6508–6516.
- [9] Alexander Richard, Hilde Kuehne, Ahsan Iqbal, and Juergen Gall, “Neuralnetwork-viterbi: A framework for weakly supervised video learning,” in CVPR, 2018, pp. 7386–7395.
- [10] Jun Li, Peng Lei, and Sinisa Todorovic, “Weakly supervised energy-based learning for action segmentation,” in ICCV, 2019, pp. 6243–6251.
- [11] Alexander Richard, Hilde Kuehne, and Juergen Gall, “Action sets: Weakly supervised action segmentation without ordering constraints,” in CVPR, 2018, pp. 5987–5996.
- [12] Jun Li and Sinisa Todorovic, “Set-constrained viterbi for set-supervised action segmentation,” in CVPR, 2020, pp. 10820–10829.
- [13] Jun Li and Sinisa Todorovic, “Anchor-constrained viterbi for set-supervised action segmentation,” in CVPR, 2021, pp. 9806–9815.
- [14] Hilde Kuehne, Ali Arslan, and Thomas Serre, “The language of actions: Recovering the syntax and semantics of goal-directed human activities,” in CVPR, 2014, pp. 780–787.
- [15] Zhe Li, Yazan Abu Farha, and Jurgen Gall, “Temporal action segmentation from timestamp supervision,” in CVPR, 2021, pp. 8365–8374.
- [16] Antti Tarvainen and Harri Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” arXiv:1703.01780, 2017.
- [17] Joao Carreira and Andrew Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017, pp. 6299–6308.
- [18] Hongjun Chen, Jinbao Wang, Hong Cai Chen, Xiantong Zhen, Feng Zheng, Rongrong Ji, and Ling Shao, “Seminar learning for click-level weakly supervised semantic segmentation,” in ICCV, 2021, pp. 6920–6929.
- [19] Alireza Fathi, Xiaofeng Ren, and James M Rehg, “Learning to recognize objects in egocentric activities,” in CVPR, 2011, pp. 3281–3288.
- [20] Sebastian Stein and Stephen J McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, 2013, pp. 729–738.