33email: [email protected]
Weakly Supervised Online Action Detection for Infant General Movements
Abstract
To make the earlier medical intervention of infants’ cerebral palsy (CP), early diagnosis of brain damage is critical. Although general movements assessment(GMA) has shown promising results in early CP detection, it is laborious. Most existing works take videos as input to make fidgety movements(FMs) classification for the GMA automation. Those methods require a complete observation of videos and can not localize video frames containing normal FMs. Therefore we propose a novel approach named WO-GMA to perform FMs localization in the weakly supervised online setting. Infant body keypoints are first extracted as the inputs to WO-GMA. Then WO-GMA performs local spatio-temporal extraction followed by two network branches to generate pseudo clip labels and model online actions. With the clip-level pseudo labels, the action modeling branch learns to detect FMs in an online fashion. Experimental results on a dataset with 757 videos of different infants show that WO-GMA can get state-of-the-art video-level classification and clip-level detection results. Moreover, only the first 20% duration of the video is needed to get classification results as good as fully observed, implying a significantly shortened FMs diagnosis time. Code is available at: https://github.com/scofiedluo/WO-GMA.
Keywords:
Online action detection weakly supervised general movements assessment fidgety movements(FMs).1 Introduction
Clinical and public health problems of surviving high-risk infants are common worldwide. Taking preterm infants with the highest proportion of high-risk infants as an example, they may face various complications that affect the quality of life, such as delayed growth in language, cognition, motor, intelligence, and even cause cerebral palsy(CP) [16]. Diagnoses of CP are critical for early intervention of such high-risk preterm infants.
Studies indicate that general movements assessment(GMA) with high sensitivity () and specificity () is the most cost-effective and accurate tool for early diagnoses of CP [10]. Fidgety movements(FMs) is an important stage of general movements(GMs). Normal FMs(F+) are smoothly circular movements involving the whole body, including the neck, trunk, and limbs. These movements are small in amplitude, moderate in speed, and variable in all directions [5, 20]. The absence or sporadic occurrence of FMs(F-) is a strong indicator for infants’ CP risk [4]. Qualified assessors usually watch the videos of infants to identify the absence or sporadic occurrence of FMs.
Though GMA is highly accurate, there is a great shortage of qualified assessors, and the assessment is time-consuming. Many works use machine learning or deep learning methods to make GMA automated. Generally, according to the sensors type, automated GMA can be categorized into vision-sensor-based and motion-sensor-based [11]. Since 2D camera data is easier to collect, we focus on vision-based methods. RGB frames are directly processed by VGG followed by LSTM to capture temporal information in [21]. However, RGB frames data contains lots of irrelevant noise in GMA scenes, such as illumination, background, and camera properties. Most vision-based works [18, 3, 22, 17] claim that extracting body keypoints from video data followed by keypoints motion feature analyzer is more robust. Existing works focus on video classification after fully observing the video, leaving two critical problems unaddressed. First, for the real-world application of automated GMA, deep learning methods need greater interpretability. Since unhealthy infants would not exhibit normal FMs for a long time, we can assess infants by localizing when the normal FMs occur. Besides, if the assessment can be completed by partially observing the video, the record time(diagnosis time) can be shortened.
This paper addresses the above two key challenges by proposing a framework in the weakly supervised online action detection setting named WO-GMA. Most online action detection methods [8, 23, 24, 6] rely on frame-level annotations of action boundaries for training. Annotating action boundaries may involve ambiguous decisions and is laborious, especially for FMs mixed with other movements; hence weakly supervised methods are preferred for FMs detection. Many weakly supervised action detection methods [7, 19, 25] utilize multiple instance learning [26] or contrastive learning [9] to train models with video-level labels. Following previous works, we first extract infants’ 2D poses from videos as input of WO-GMA. The pipeline of our method is as follows. WO-GMA contains one local spatio-temporal extraction module followed by two branches to generate pseudo labels and model online action. The local module uses a 3D graph convolution network [15] to capture complex spatio-temporal features based on the extracted infant poses. Supervised by video-level labels, clip-level pseudo labels generating branch mines temporal labels by mixing local feature and long-range information. The online action modeling branch utilizes the generated pseudo labels to conduct clip-level action detection without future information.
Contributions: (1) We are the first to develop an online action detection method on this task and report frame-level recognition results. (2) We validate WO-GMA on our dataset with 757 videos of different infants. Experiments show that the video-level FMs prediction of our method outperforms existing automated GMA models. (3) Experiments demonstrate that we can obtain accurate video level results when only the first of full video frames are used, implying a shortened FMs diagnosis time.
2 Methodology
We use a sequence of 2D keypoints estimated from the RGB video frames as our network input. Figure 1) shows the network architecture of our proposed skeleton-based weakly supervised online action detection model. It consists of three main components. First, a local feature extraction module (LFEM), containing a spatio-temporal graph network followed by joints fusing, is used to extract complex local features from the skeletons. Then, we capture bidirectional long-range information with a clip-level pseudo labels generating branch(CPGB) supervised by video-level labels. Third, the online action modeling branch(OAMB) supervised by the generated pseudo labels is used to detect action without future information. These components will be detailed in this section.

2.1 Local Feature Extraction Module
Human pose keypoints named skeleton is usually denoted as , where nodes set represents joints, and edges set represents connectivity between joints. Formally, we use adjacency matrix to denote the edges set where if there is an edge between and , otherwise 0. Assume that we are given a skeleton sequence , where is the sequence length, and denotes the feature vector dimension for one joint. Then the feature of this sequence can be written as .
As detailed in section 1, FMs have complex movement patterns. To well modeling the local spatio-temporal information, we need to capture the relation between different joints in one frame and the variations of joints over time. Multi-scale graph convolution is often used to fuse long-range relations in one graph [14]. MS-G3D [15] utilized cross-spacetime skip connections to construct a spatio-temporal subgraph and proposed a disentangled multi-scale graph convolution to model human action dynamics. We extend the feature extractor in MS-G3D with vertices fusing to capture the local spatio-temporal information of FMs.
In detail, we first split the pose sequence into clip-level set with a sliding window of temporal size and stride size , where . Then a spatio-temporal graph is constructed by tiling into a block adjacency matrix . A multi-scale graph convolution is applied to each clip feature tensor as
(1) |
where is output feature, is activation function, is the number of scales to aggregate. is the corresponding diagonal degree matrix of disentangled . denotes the learnable parameter matrix.
For each clip, will be collapsed into one skeleton by a 3D convolution operator with kernel size to get a feature tensor . Then each joint in has got rich information from their neighborhood. To extract more complex local spatio-temporal fused feature, we will aggregate joints information as , where subscript denotes the index of clip. We use 2D convolution as aggregator followed with ReLU in this paper. Then we get a feature sequences as . All the clips share the same parameter in this module.
2.2 Clip-level Pseudo Labels Generating Branch
For inference of weakly supervised online action detection, only accumulated history information is available. We introduce a branch to generate pseudo labels with future information and long-range information during training. Since infants’ FMs are continual and hard to distinguish from other mixed movements [5], combination future information with long time receptive field is helpful for better classification and detection.
Take the features extracted from LFEM as input, we use the 1D temporal convolutions followed by ReLU to aggregate long-range temporal information from neighbour clips features, i.e. . The output feature of the last 1D convolution layer is . Then for each clip, fully connected layers are used to get the action scores , where is the score of actions of clip .
Multiple Instance Learning Loss(MILL) is widely used to get accurate actions scores [19, 7]. In this paper, we consider the entire video clip-level sequence set as a bag of instances, where each instance denotes one clip. To use the video-level labels, we compute video-level scores with Top-K strategy for each action class. That is , where is the Top-K indices set of clips over class , and is got by with hyper parameter . Then, to obtain the action class probabilities , sigmoid or softmax (multi-class dataset) is applied to the video-level scores. The MILL is computed as
(2) |
where and is the ground truth label and prediction probability of class . Supervised by , the network can learn clip-level scores which will be used to generate clip-level pseudo labels with future information by a two-stage threshold strategy [7]. First, action class will be discarded if video-level score is less than a threshold . Then, for the remaining action classes, a second threshold, , is applied on clip-level action scores to get pseudo labels . After that, the video-level ground truth labels are used to filter out wrong pseudo labels.
2.3 Online Action Modeling Branch
As shown above, CPGB can utilize future information during training. However, future information is not available during inference. LSTM is used to accumulate the historical temporal information in this branch.
Given one clip-level feature , previous hidden state and cell state as input, updated states is got by . For clip, online action score is computed as , where is the action scores including a background class. Then, cross entropy loss is applied to over all clips with pseudo labels to get frame loss, i.e.
(3) |
To further utilize the ground truth video level information, another MILL is used in this branch. As shown in Figure 1, we use the same top-K strategy as CPGB to get video-level scores for each action class. Here, is used to denote MILL in this branch.
2.4 Training and Inference
In the training stage, two branches and the local feature extraction module are jointly optimized by . During inference, the future information is not available for online action detection tasks, so we only use the feature extraction module and online action modeling branch.
3 Experiments
Dataset. For this study, ethical approval was provided by the Ethics Committee of Children’s Hospital of Shanghai (Review number: 2021RY053-E01). Written informed consent was obtained from the parents/legal guardians in accordance with the Declaration of Helsinki. Our original dataset provided by three hospitals contains 792 videos. Two certified observers, who had got the GMs Trust qualification, made the video-level GMs classification. In case of disagreement, the third certified observer re-assessed the video. The eligible participants were high-risk (premature birth, low birth weight, suspected or brain injury, birth with chronic disease, genetic or genetic metabolic disease). Video recording was conducted according to standard protocol [5]. After deleting the repetitive videos, the dataset remains 757 different videos of 757 different infants at around 4670 weeks gestational age(average: 55 weeks), including 353 F- videos and 404 F+ videos, 434 male and 323 female infants. The resolution of 678 videos is 19201080, others are 1280720, 14401080, 960540, 720576. The average number of frames is 7787(59016322), and the average time duration is 307s(24653s). We split the training and test sets with a ratio of 8:2, which ensures videos with different labels and of different hospitals are divided evenly.
Evaluation Metrics. For video-level performance, we report the classification accuracy, F1-score, and Area Under Curve(AUC). For detection results, following previous works [19, 25], we use standard evaluation protocol by reporting mean Average Precision (mAP) values under different intersection over union (IoU) thresholds.
Implementation details. OpenPose [1] is used to extract skeletons with 18 joints from videos and skeletons preprocessing method in [3] is adopted. For convenience, we truncate the first 6000 frames for the skeleton sequences more than 6000 frames; otherwise, pad 0 at the end of each skeleton sequence to 6000 frames. We implemented our WO-GMA in PyTorch, and performed experiments on a system with Nvidia 3090 GPUs. We train our network with 100 epochs by using Adam [13] with learning rate and weight decay . The window size and stride size are set to be 20. And the parameter in the Top-K strategy of both branches is 8. Video threshold and frame threshold in pseudo labels generating are 0.4 and 0.3 respectively. The dimension of hidden state in LSTM is 1024. Since we only focus on F+, the class number .
3.1 Main Results
Input | Method | Video level | Detection – mAP@IoU(%) | ||||||||
Accuracy | F1 | AUC | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | mean | |||
Image | WOAD[7] | 53.2 | 69.3 | 48 | 2.7 | 1.5 | 1.5 | 0.3 | 0.3 | 1.3 | |
W-TALC[19] | 52.6 | 42.5 | 49.4 | 2.1 | 1.00 | 0.2 | 0.0 | 0.0 | 0.7 | ||
Optical flow | WOAD[7] | 54.5 | 63.5 | 51 | 11.5 | 7.8 | 5.0 | 3.2 | 1.9 | 5.9 | |
W-TALC[19] | 57.8 | 67.9 | 55.9 | 5.6 | 1.1 | 0.1 | 0.0 | 0.0 | 1.4 | ||
Fusion | WOAD[7] | 55.2 | 69.9 | 49.4 | 11.5 | 8.7 | 5.5 | 3.9 | 1.7 | 6.3 | |
W-TALC[19] | 54.5 | 61.5 | 50.5 | 1.0 | 0.2 | 0.0 | 0 | 0 | 0.2 | ||
Skeleton | Zhu et.al[27] | 84.5(2.0) | 85.1(2.0) | 84.7(2.0) | - | ||||||
MS-G3D[15] | 88.4(1.5) | 89.1(1.6) | 92.5(0.7) | - | |||||||
STAM[18] | 86.5(2.6) | 86.5(3.0) | 93.1(1.8) | - | |||||||
WO-GMA | 93.8(1.0) | 94.4(0.9) | 96.9(0.7) | 31.7 | 22.4 | 17.9 | 11.4 | 5.2 | 17.7 |
Video-level classification performance. For our model, one video is classified as F+ if the video level score got by Top-K strategy is greater than in the seen video. For STAM [18], MS-G3D [15], Zhu et.al[27], the classification threshold is set to 0.5. Since action detection tasks also include classification, we also report the video-level performance of WOAD [7] and W-TALC [19]. Following previous appearance-based methods [7, 25], the image features and optical features are extracted by I3D [2] pre-trained on Kinetics [12], a huge dataset contains 306k videos. For a fair comparison, none-skeleton based methods also use the first 6000 frames. Both the clip window size and window stride are 20. Other experiments settings are the same as original papers.
Results are shown in the left part of Table 1. For skeleton-based methods, we report the 5-fold cross-validation results. For other methods, since the results perform much worse than skeleton-based methods and calculating image and optical flow features requires a lot of computing power, we only report results of the fifth fold with frame-level annotations. Though only accumulated history information is used, video-level results of our model outperform previous works by a lot margin, demonstrating the superiority of WO-GMA. Moreover, for both WOAD[7] and W-TALC [19], the model with the image input feature performs worse than the optical flow input feature. This result indicates that motion information is more important than appearance information in this task.
Online action detection performance. To the best of our knowledge, we are the first to develop weakly supervised online action detection method for GMA. 60 F+ samples in the fifth dataset split-fold were annotated by experts with frame-level labels, and 56 valid samples are used as ground truth to report detection results. The right part of Table 1 shows that our model achieves the best detection performance. Compared with the image features, the detection results using the optical flow features are better, which further demonstrates the importance of motion information. Since the skeleton contains only motion information, this result also proves the necessity of using the skeleton as input. The performance gap between left part and right part of Table 1 shows that detection is much more difficult than classification in this task. There are two main possible reasons. First, only video-level supervision information may not enough. Second, compared with everyday life actions like shaking hands, the boundary of FMs is even harder to determine for annotators accurately. Figure 3 demonstrates that our model can get video-level performance as good as fully observed when only the first 20% video frames are observed. This result shows a significant benefit of our model: the assessment time of automated GMA for real-world applications can be greatly shortened. The visualisation detection results in the top subplot of Figure 3 shows the acceptable detection performance.


3.2 Ablation Study
Since the complex movements pattern of FMs, we argue that both local spatio-temporal information and long-range information are critical in this task. This part will analyze the effect each component in the fifth dataset split-fold with frame-level annotations.
To study the effect of CPGB which can combine future information, we remove this branch(w/o pseudo). As shown in Table 2, both the classification and detection performance will drop compared with WO-GMA, demonstrating the necessity to generate pseudo clip labels for training. We replace the LFEM with only clip skeletons vertices features concatenation to get results without complex local features(w/o local). As shown in the second row of Table 2, the video level accuracy will drop by , and the detection results will drop when IoU is higher. Furthermore, to analyze the effect of long-range information, we remove the 1D convolutions used to capture long-range information in CPGB(w/o long-range). Results in Table 2 imply the importance of long-range information.
To better illustrate the influence of different modules, we plot three more curves of the same infant in Fig.3 and report the number of detection instances in Table 2. These results show that the detection action instances are fragmented without long-range information, which is unsuitable for detecting continuous FMs. Without local feature extraction, the detection score is less confident than WO-GMA. Without pseudo, the model may ignore the gap between intermittent FMs, which the long-range information in CPGB will make up. Moreover, generating pseudo labels without long-range information will bring the noise. The detection results further show the difficulty of detection task mentioned in previous subsection. Compared with appearance-based methods, our skeleton-based methods achieve better performance.
Method | Video level | Detection – mAP@IoU(%) | Instances \bigstrut[t] | ||||||||
Accuracy | F1-score | AUC | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | mean | \bigstrut[b] | ||
w/o pseudo | 92.2 | 92.9 | 95.7 | 35.1 | 21.6 | 14.7 | 8.0 | 2.4 | 16.4 | 260 \bigstrut[t] | |
w/o local | 92.8 | 93.6 | 95.8 | 33.7 | 24.6 | 19.4 | 10.6 | 4.7 | 18.6 | 296 | |
w/o long-range | 93.5 | 93.9 | 95.4 | 10.2 | 7.0 | 2.9 | 1.4 | 0.6 | 4.4 | 1000 | |
our model | 94.8 | 95.1 | 96.6 | 31.7 | 22.4 | 17.9 | 11.4 | 5.2 | 17.7 | 420 \bigstrut[b] |
4 Conclusion
We are the first to propose WO-GMA to address online action detection for general movements assessment using weak supervision and evaluate it on a large dataset. Unlike previous methods that only focus on video classification, our WO-GMA can detect the occurrence of FMs in an online fashion without frame-level labels. Experiments results demonstrate that WO-GMA significantly outperforms state-of-the-art both in the classification and detection tasks.
References
- [1] Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7291–7299 (2017)
- [2] Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6299–6308 (2017)
- [3] Chambers, C., Seethapathi, N., Saluja, R., Loeb, H., Pierce, S.R., Bogen, D.K., Prosser, L., Johnson, M.J., Kording, K.P.: Computer vision to automatically assess infant neuromotor risk. IEEE Transactions on Neural Systems and Rehabilitation Engineering 28(11), 2431–2442 (2020)
- [4] Einspieler, C., Peharz, R., Marschik, P.B.: Fidgety movements–tiny in appearance, but huge in impact. Jornal de Pediatria 92, 64–70 (2016)
- [5] Einspieler, C., Prechtl, H.F., Ferrari, F., Cioni, G., Bos, A.F.: The qualitative assessment of general movements in preterm, term and young infants—review of the methodology. Early human development 50(1), 47–60 (1997)
- [6] Eun, H., Moon, J., Park, J., Jung, C., Kim, C.: Learning to discriminate information for online action detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 809–818 (2020)
- [7] Gao, M., Zhou, Y., Xu, R., Socher, R., Xiong, C.: Woad: Weakly supervised online action detection in untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1915–1923 (2021)
- [8] Geest, R.D., Gavves, E., Ghodrati, A., Li, Z., Snoek, C., Tuytelaars, T.: Online action detection. In: European Conference on Computer Vision. pp. 269–284. Springer (2016)
- [9] Gutmann, M., Hyvärinen, A.: Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. pp. 297–304. JMLR Workshop and Conference Proceedings (2010)
- [10] Herskind, A., Greisen, G., Nielsen, J.B.: Early identification and intervention in cerebral palsy. Developmental Medicine & Child Neurology 57(1), 29–36 (2015)
- [11] Irshad, M.T., Nisar, M.A., Gouverneur, P., Rapp, M., Grzegorzek, M.: Ai approaches towards prechtl’s assessment of general movements: A systematic literature review. Sensors 20(18), 5321 (2020)
- [12] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
- [13] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- [14] Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 3595–3603 (2019)
- [15] Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 143–152 (2020)
- [16] Malcolm, W.F.: Beyond the NICU: comprehensive care of the high-risk infant. McGraw-Hill Education (2015)
- [17] McCay, K.D., Ho, E.S., Sakkos, D., Woo, W.L., Marcroft, C., Dulson, P., Embleton, N.D.: Towards explainable abnormal infant movements identification: A body-part based prediction and visualisation framework. In: 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). pp. 1–4. IEEE (2021)
- [18] Nguyen-Thai, B., Le, V., Morgan, C., Badawi, N., Tran, T., Venkatesh, S.: A spatio-temporal attention-based model for infant movement assessment from videos. IEEE journal of biomedical and health informatics 25(10), 3911–3920 (2021)
- [19] Paul, S., Roy, S., Roy-Chowdhury, A.K.: W-talc: Weakly-supervised temporal activity localization and classification. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 563–579 (2018)
- [20] Prechtl, H.F., Hopkins, B.: Developmental transformations of spontaneous movements in early infancy. Early human development 14(3-4), 233–238 (1986)
- [21] Schmidt, W., Regan, M., Fahey, M., Paplinski, A.: General movement assessment by machine learning: Why is it so difficult. J. Med. Artif. Intell 2 (2019)
- [22] Wu, Q., Xu, G., Wei, F., Kuang, J., Zhang, X., Chen, L., Zhang, S.: Automatically measure the quality of infants’ spontaneous movement via videos to predict the risk of cerebral palsy. IEEE Transactions on Instrumentation and Measurement 70, 1–11 (2021)
- [23] Xu, M., Gao, M., Chen, Y.T., Davis, L.S., Crandall, D.J.: Temporal recurrent networks for online action detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5532–5541 (2019)
- [24] Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., Soatto, S.: Long short-term transformer for online action detection. Advances in Neural Information Processing Systems 34 (2021)
- [25] Zhang, C., Cao, M., Yang, D., Chen, J., Zou, Y.: Cola: Weakly-supervised temporal action localization with snippet contrastive learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16010–16019 (2021)
- [26] Zhou, Z.H.: Multi-instance learning: A survey. Department of Computer Science & Technology, Nanjing University, Tech. Rep 1 (2004)
- [27] Zhu, M., Men, Q., Ho, E.S., Leung, H., Shum, H.P.: Interpreting deep learning based cerebral palsy prediction with channel attention. In: 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI). pp. 1–4. IEEE (2021)