¹¹institutetext: Islamic University of Technology, Gazipur, Bangladesh
¹¹email: {shafkatrahman, hasibulhaque, fakhruddingazzali, bakhtiarhasan, hasanul}@iut-dhaka.edu
²²institutetext: Data61-CSIRO, Canberra, Australia
²²email: [email protected]

Improving Action Quality Assessment Using Weighted Aggregation

Shafkat Farabi(✉) 11 0000-0003-4712-1208 Hasibul Himel 11 0000-0002-9246-2477 Fakhruddin Gazzali 11 0000-0001-5609-5852 Md. Bakhtiar Hasan 11 0000-0001-8093-5006 Md. Hasanul Kabir 11 0000-0002-6853-8785 Moshiur Farazi 22 0000-0003-1494-5921

Abstract

Action quality assessment (AQA) aims at automatically judging human action based on a video of the said action and assigning a performance score to it. The majority of works in the existing literature on AQA divide RGB videos into short clips, transform these clips to higher-level representations using Convolutional 3D (C3D) networks, and aggregate them through averaging. These higher-level representations are used to perform AQA. We find that the current clip level feature aggregation technique of averaging is insufficient to capture the relative importance of clip level features. In this work, we propose a learning-based weighted-averaging technique. Using this technique, better performance can be obtained without sacrificing too much computational resources. We call this technique Weight-Decider(WD). We also experiment with ResNets for learning better representations for action quality assessment. We assess the effects of the depth and input clip size of the convolutional neural network on the quality of action score predictions. We achieve a new state-of-the-art Spearman’s rank correlation of 0.9315 (an increase of 0.45%) on the MTL-AQA dataset using a 34 layer (2+1)D ResNet with the capability of processing 32 frame clips, with WD aggregation.

Keywords:

Action Quality Assessment Aggregation MTL-AQA

1 Introduction

Action quality assessment (AQA) addresses the problem of developing a system that can automatically judge the quality of an action performed by a human. This is done by processing a video of the performance and assigning a score to it. The motivation to develop such a system stems from its potential use in applications such as health care [13], sports video analysis [16], skill discrimination for a specific task [14], assessing the skill of trainees in professions such as surgery [3], etc.

Almost all existing works have treated AQA as a regression problem [14, 18, 15]. As shown in Figure 1, most approaches boil down to dividing an RGB video of the action in multiple clips, extracting higher-level features from each clip, aggregating them, and then training a linear-regressor to predict a score based on these aggregated features. Most of these works [18, 15, 22] utilize a convolutional neural network [11] to extract complex higher-level features and simple averaging to aggregate the features. The best performing models [16, 18, 15] make use of the Convolutional-3D (C3D) network [20] with average aggregation.

Refer to caption — Figure 1: Overview of a general AQA pipeline and our improvement over it. Generally, the input video is divided into clips. A feature extractor extracts features from these clips. These features are then aggregated into a video level feature vector. A linear-regressor predicts action quality scores based on this feature vector. We introduce a Weight-Decider module to this architecture, which proposes weights based on the clip level features for better aggregation. Additionally, we use a ResNet instead of the commonly used C3D.

The majority of works in AQA [18, 15, 22, 16] aggregate clip level features into a video level feature vector by simply averaging them. We think this fails to preserve temporal information present in the data. We hypothesize that a more sophisticated method would improve the overall performance of the pipeline. We propose one such technique by introducing a module called Weight-Decider(WD). This module inspects the feature vectors extracted from individual clips and proposes a corresponding weight vector. Finally, these weight vectors can be used to calculate a weighted average of the clip-level feature vectors. In this way, the final video level feature vector contains more contributions from the important features of each clip. This is similar to how real-world judges base their final scoring on key mistakes/skills of the performer and not on an average of all the moments in the action. We design the WD as a shallow neural network that can be trained along with the rest of the AQA pipeline. In our experiments, we show the performance of the AQA pipeline to improve when using WD as aggregation.

Spatio-temporal versions of ResNets capable of processing videos have been proposed by [5] and [21]. These approaches have achieved state-of-the-art results in the task of action recognition and outperform the C3D network [5]. Hence, we decide to use spatio-temporal ResNets as feature extractors for performing AQA. We experiment with various 3D and (2+1)D ResNet feature extractors on the MTL-AQA dataset [16]. We find that 3D and (2+1)D ResNets of depth 34 and 50 with pretraining on large-scale action recognition datasets have performance comparable to the state-of-the-art. We see that (2+1)D and 3D convolutions perform fairly similarly. For 34 layer (2+1)D ResNets, we experiment with 3 different versions that can process 8, 16, or 32 frame clips at once and find the 32 frame clip version to clearly outperform the rest. Our results suggest that processing longer clips is more beneficial than going deeper with convolutions. The 34 layers (2+1)D ResNet with WD processing 32 frame clips achieves a Spearman’s rank correlation of 0.9315 on the MTL-AQA dataset, achieving a new state-of-the-art.

Contributions:

•

We propose a novel learning-based light-weight aggregation technique called Weight-Decider and demonstrate that it can improve the AQA pipeline’s performance.
•

To the best of our knowledge, this is the first work to do a comparative analysis of the effect of the depth, convolution type, and input clip size of the ResNet feature extractor on the final quality of the scores predicted in AQA.
•

One of our approaches outperforms all the previous works in AQA on the MTL-AQA dataset.

2 Related Work

Pirsiavash et al.[18] proposed a novel dataset containing videos of Diving and Figure-skating annotated with action quality scores by expert human judges. They provided Discrete Cosine Transform (DCT) and extracted human pose as inputs to a Support Vector Regressor, which predicted the score.

More recent works have utilized the Convolutional 3D (C3D) network [20] as a feature extractor. Parmar and Morris [15] proposed three architectures, C3D-SVR, C3D-LSTM, C3D-LSTM-SVR, all of which used features extracted from short video clips using C3D network, and later aggregated them and predicted an action score using Support Vector Regressor (SVR) and Long Short-Term Memory (LSTM).In a later work, Parmar and Morris[16] took a multitask approach towards action quality assessment. They released a novel AQA dataset called MTL-AQA and proposed a multi-task learning-based C3D-AVG-MTL framework that extracted features using the C3D network and aggregated these through averaging. They trained these features to do well in score prediction, action classification, and generating captions. Tang et al.[19] took a probabilistic approach (MUSDL). They divided 103 frame videos into 10 overlapping 16 frame clips, used I3D[1] architecture to extract clip level features, averaged as aggregation, and finally predicted parameters of a probabilistic distribution from which the final score prediction was sampled. The authors calculated 7 different scores corresponding to 7 judges for Olympic scoring and summed up the 5 scores in the middle. With this advantage over simple regression-based methods which directly predict the score, this approach achieved a SOTA spearman’s correlation of 0.9273. Diba et al.[2] used a method called “STC Block” for action recognition. This is similar to our proposed aggregation method. However, they utilize this on spatial and temporal features separately after each convolution layer for action recognition, whereas our method is applied to the output of the CNN to aggregate clip level spatiotemporal features for performing AQA.

Our proposed approach differs from these works in that we use 3D and (2+1)D ResNets as the feature extractor and we aggregate these features using the WD network, which is a lightweight and learning-based feature aggregation scheme.

3 Our Approach

3.1 General Pipeline Overview

Let, $V=\{F_{t}\}_{t=1}^{L}$ be the input video containing $L$ frames, where $F_{t}$ denotes the $t^{th}$ frame. It is divided into $N$ non-overlapping clips, each of size $n=\lceil\frac{L}{N}\rceil$ . Thus we define the $i^{th}$ clip as $C_{i}=\{F_{j}\}_{j=i\times n}^{i\times(n+1)-1}$ . The feature extractor takes in a clip $C_{i}$ and outputs a feature vector $f_{i}$ . For the feature extractor, we experiment with 3D ResNets[6] and (2+1)D ResNets[21] with varying depth and input clip size. Next, we aggregate these clip level features to obtain a global video level representation. Finally, a linear-regressor is trained to predict the score from the video level feature representation. Following the majority of previous works [16, 14, 18, 15], we model the problem as linear regression. This makes sense as the action quality score is a real number. To experiment with the relation of the ResNet feature extractor’s depth with the AQA pipeline’s ability to learn, we experiment with 3 different depths:

•

34 layer: We experiment with both 34 layer 3D and (2+1)D ResNets. The only difference being 3D ResNets use $3\times 3\times 3$ convolution kernels, on the other hand (2+1)D ResNets use a $1\times 3\times 3$ convolution followed by $3\times 1\times 1$ convolution[21]. We take the final average-pool layer output (512 dimensional) and pass it through 2 fully-connected layers having $256$ and $128$ units. The 3D ResNet takes input 16 frame clips. On the other hand, we experiment with 3 different variations of (2+1)D 34-layer ResNet, processing 8, 16, and 32 frame clips using available pre-trained weights,
•

50 layer: We experiment with 50 layer 3D and (2+1)D ResNets. The final average-pool layer outputs a feature vector of size $2048$ . We take this feature vector and input it into 3 fully-connected layers having $512$ , $256$ , and $128$ units. Only 16 frame clips are processed.
•

101 layer: We experiment with 101 layer 3D ResNet. The remaining details about the input clip size and output feature vector processing are identical to the 50 layer ResNets.

3.2 Feature Aggregation

Most of the previous works dealing with AQA process the entire input video by first dividing it into multiple smaller clips of equal size, due to memory and computational budget. Most CNNs are designed to process 8, 16, or 32 frames at once. Then the features extracted by the CNN are aggregated to form a video level feature description. Next, a linear-regressor predicts the final score based on this feature description.

The best performing works aggregated the clips by simply averaging them [16, 19, 15]. Some other works [15, 14] aggregated using LSTMs[7]. However, LSTM networks, although make sense in theory because of their ability to handle time sequences, perform worse due to the lack of big-enough datasets dedicated to AQA.

We propose that simply averaging the clip-wise features is an ineffective measure. It should not be able to preserve the temporal information available in the data. This follows from the fact changing the order of the clip level features will generate the same average and hence the same score prediction. Furthermore, expert human judges focus more on mistakes and deviations of the performers and these have a bigger impact on the score. Hence we think, a weighted averaging technique might be more suitable, as the linear-regressor will be able to base its decision on features more important from each clip.

More concretely, if the feature vector extracted from clip $C_{i}$ is $f_{i}$ , we propose the video level feature vector as

f_{video}=\sum_{i=1}^{N}(f_{i}\odot w_{i})

(1)

where $w_{i}$ is a weight vector corresponding to the feature vector $f_{i}$ and $\odot$ represents Hadamard Product or elementwise multiplication. $w_{i}$ is of the same dimensions as $f_{i}$ and learned using a small neural network of 4 layers. This smaller neural network takes as input 128-dimensional feature vector $f_{i}$ and runs it through fully connected layers containing 64, 32, 64, and 128 neurons. All but the final layers employ a ReLU activation function. The architecture is explained in Figure 2. Finally, to ensure the weights corresponding to the same element of different weight vectors sum up to one, a softmax is applied along with the corresponding elements of all the weight vectors. We call this shallow neural network Weight-Decider(WD).

w_{i}^{{}^{\prime}}=WD(f_{i})

(2)

[w_{0}\quad w_{1}\quad\dots\quad w_{N}]=softmax([w_{0}^{{}^{\prime}}\quad w_{1}^{{}^{\prime}}\quad\dots\quad w_{N}^{{}^{\prime}}])

(3)

Finally, the linear-regressor can predict the score using the feature vector $f_{video}$ as proposed by equation 1.

The proposed WD module can be used with any of the feature extractors we planned to use in our experiments. WD replaces averaging as aggregation within the typical AQA pipeline. It can be trained with the rest of the AQA model in an end-to-end manner.

Adding WD to the AQA pipeline to replace averaging as aggregation does not cost much in terms of computational resources. To see this, recall that the WD module has 3 hidden layers and an output layer, consisting of 64, 32, 64, and 128 neurons in that order. Each of these layers contains a weight matrix and a bias. The first hidden layer takes 128 inputs and has 64 neurons. Thus the weight matrix is of dimensions $64\times 128$ . Hence, number of trainable parameters in this layer (including the bias term) is $(64\times 128+1)=8193$ . Similarly, the second hidden layer, third hidden layer, and the output layers contain $2049$ , $2049$ , and $8193$ trainable parameters correspondingly. By summing up all the trainable parameters in each layer, we can see that WD only contains $20484$ trainable parameters. Spatio-temporal ResNet feature extractors have millions of trainable parameters. Hence, additional resources required to train the WD module used in an AQA pipeline are not much. As an example, the 3D and (2+1)D ResNet-34 feature extractors we are using contain 63.6 million trainable parameters [12]. Thus, using the WD module on top of ResNet-34 would increase the number of trainable parameters by approximately $0.03\%$ . The ratio would be smaller for a deeper ResNet. Hence, our proposed WD does not require many computational resources to be incorporated into the pipeline. Our experiments support this. We found the training time for 32-frame ResNet(2+1)D-34 to be 4557 seconds per epoch, and the inference time to be 604 seconds per epoch. The introduction of WD increases the training time per epoch by 24 seconds (0.52%) and the inference time by 21 seconds (3.47%).

4 Experiments

4.1 MTL-AQA Dataset[16]

The biggest dataset dedicated to AQA. It contains 1412 video samples split into 1059 training and 353 test samples. The samples are collected from 16 Olympic dive events. Each sample is 103 frames long and accompanied by the final action quality scores from expert Olympic judges. It was published by Parmer and Morris[16]. We used the exact same train/test split made public in their work[16].

4.2 Evaluation Metric:

In line with previously published literature, we use Spearman’s rank correlation as the evaluation metric. This metric measures the correlation between two sequences containing ordinal or numerical data. It is calculated using the equation:

\rho=1-\frac{6\sum d_{i}^{2}}{n(n^{2}-1)}

(4)

$\rho$ = Spearman’s rank correlation
$d_{i}$ = The difference between the ranks of corresponding variables
$n$ = Number of observations

4.3 Implementation Details:

We implemented our proposed methods using PyTorch[17]. All the 3D ResNets and (2+1)D ResNets processing 16 frame clips were pre-trained on Kinetics-700[9] dataset¹¹1Weights available at: https://github.com/kenshohara/3D-ResNets-PyTorch. The (2+1)D Resnets processing 8 and 32 frame clips were pre-trained on IG-65M dataset[4] and fine tuned on Kinetics-400[9] dataset²²2Weights available at: https://github.com/moabitcoin/ig65m-pytorch.

For each ResNet, we separately experimented using both averaging and WD as feature aggregation. We temporally augmented by randomly picking an ending frame from the last $6$ and chose the preceding $96$ frames for processing. The frames were resized to $171\times 128$ and center cropped to $112\times 112$ . Random horizontal flipping was applied. Batch-normalization was used for regularization. We defined the loss function as a sum of L2 and L1 loss between the predicted score and ground-truth score as Parmar and Morris[16] suggested. We trained the network using the ADAM optimizer[10] for $50$ epochs. We used a learning rate of 0.0001 for modules with randomly initialized weights and 0.00001 for modules with pretrained weights. Train and Test batch sizes were 2 and 5.

4.4 Results on MTL-AQA Dataset:

Table 1: Performance comparison of the various types of ResNets as feature extractors and varying clip length in our pipeline

(a) Effect of various types of ResNets

Depth	Convolution	Aggregation
Depth	Convolution	Average	WD
ResNet-34	3D	0.8982	0.8951
ResNet-34	(2+1)D	0.8932	0.8990
ResNet-50	3D	0.8880	0.8935
ResNet-50	(2+1)D	0.8818	0.8814
ResNet-101	3D	0.6663	0.6033

(b) Effect of varying input clip size in ResNet(2+1)D-34

Clip length (Input Frames)	Aggregation
Clip length (Input Frames)	Average	WD
8	0.8570	0.8853
16	0.8932	0.8990
32	0.9289	0.9315

In Table 1(a), we present the experiment results of varying the depth of the ResNet feature extractor and the aggregation scheme. We can see that 34 layer (2+1)D ResNet with WD as aggregation performs the best with a Spearman’s correlation of $0.8990$ . Increasing the depth to 50 layers somewhat decreases Spearman’s correlation. However, the results are still competitive. At 101 layer depth, even when initialized with pretrained weights from Kinetics[9], overfitting occurs fairly quickly. The overfitting is also evident from the train/test curves presented in Figure 4. The likely reason behind this is the increased number of parameters in the feature extractor. This leads us to establish that the current biggest AQA dataset has enough data to train a 34-layer and 50-layer ResNet feature extractor with generalization, however it overfits the 101-layer ResNet feature extractor.

Because of how (2+1)D ResNets are designed, they have a similar parameter count to their 3D counterparts [21]. Because the overfitting is occurring due to the high parameter count, we do not repeat the experiment with a (2+1)D ResNet-101 feature extractor.

4.4.1 Effect of clip length:

We check the effect of clip length on the performance. We take the best performing model from Table 1(a) (ResNet(2+1)D-34-WD) and input various clip sizes. We experiment with clip sizes of 8 frames, 16 frames, and 32 frames. We vary the aggregation method as well.

From Table 1(b) we can see that the performance of the pipeline increases with the number of frames in each clip. We hypothesize that longer clips allow the ResNet to look for bigger patterns in the temporal dimension, which in turn enables the feature descriptors to be more informative. This enables the linear-regressor to better discriminate between similar-looking examples with fine-grained action quality differences. From Table 1(b), notice that using WD over simple averaging as aggregation gives a boost in performance. However, this performance boost is quite significant in case of 8 frame clips. We believe the reason behind this improved performance is that increasing clip size reduces the number of clips being averaged. Whatever detrimental effect the averaging might have, it will be more prominent when the number of objects being averaged is larger, and less when this number is smaller. Furthermore, CNNs with longer input clips can look at more frames, this in effect increases their temporal horizon. It follows that the feature vectors extracted would have a better encoding of action patterns across time, to begin with. Thus they perform well enough even with averaging as aggregation. But using WD increases performance nevertheless.

For qualitative results, refer to Table 2.

[Uncaptioned image] — Table 2: Qualitative results. Every $16^{th}$ frame processed starting from frame 0 is shown. Italic scores correspond to WD aggregation, plain text scores correspond to average aggregation. The 8, 16, and 32 correspond to input clip sizes.

The ground truth scores provided in the MTL-AQA [16] are taken from expert Olympic judges during a live broadcast of events. 7 judges independently score the athlete’s performance on a scale of 0 (completely failed) to 10 (excellent). The 3 median scores are then added together and multiplied with a pre-determined difficulty degree to obtain the final score. This final score is the one provided in the “True Score” column of Table 2. Our various pipelines attempt to predict this final score from the input performance video.

4.4.2 Comparison with the state of the art:

In Table 3, we compare our best performing models of each depth with previous works on the MTL-AQA dataset. For comparison, we combined C3D architecture with WD to test the result. For the C3D-WD model, 16 frame clips were used. The C3D portion of the model was initialized with Sports-1M[8] pretrained weights. We include this result in Table 3 as well. We can see that our (2+1)D ResNet-34 (32 frame)- WD outperforms all previous works in the literature. This shows the effectiveness of our approach. We can further see that 3D ResNet-50 (16 frame) obtains results comparable to the SOTA. However, the ResNet-101 based approach overfits the dataset and hence performs poorly.

Table 3: Comparison with the state of the art on the MTL-AQA dataset

Method

Sp. Corr.

Pose+DCT[18]

0.2682

C3D-SVR[15]

0.7716

C3D-LSTM[15]

0.8489

MSCADC-STL[16]

0.8472

MSCADC-MTL[16]

0.8612

USDL-Regression[19]

0.8905

C3D-AVG-STL[16]

0.8960

C3D-AVG-MTL[16]

0.9044

MUSDL [19]

0.9273

Ours C3D-WD

0.9223

Ours ResNet34-(2+1)D-WD (32 frame)

0.9315

Ours ResNet50-3D-WD (16 frame)

0.8935

Ours ResNet101-3D-AVG (16 frame)

0.6633

5 Conclusion

In this work, we proposed a ResNet-based regression-oriented pipeline for action quality assessment. We demonstrated experimentally that the MTL-AQA dataset has enough data to train 34 and 50 layer ResNet-based pipelines when initialized with pretrained weights from a related task (like action recognition). Our experiments suggest processing longer clips is more effective than using deeper ResNets. We also propose a lightweight learning-based aggregation technique called WD to replace simple averaging. Experiments show our methods to be more effective than previous works. In the future, we want to investigate with CNNs that can process longer clips (64 or higher) to see if this translates to better performance.

References

[1] Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 4724–4733. IEEE Computer Society (2017). https://doi.org/10.1109/CVPR.2017.502
[2] Diba, A., Fayyaz, M., Sharma, V., Arzani, M.M., Yousefzadeh, R., Gall, J., Gool, L.V.: Spatio-temporal channel correlation networks for action classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part IV. Lecture Notes in Computer Science, vol. 11208, pp. 299–315. Springer (2018). https://doi.org/10.1007/978-3-030-01225-0_18
[3] Funke, I., Mees, S.T., Weitz, J., Speidel, S.: Video-based surgical skill assessment using 3d convolutional neural networks. International Journal of Computer Assisted Radiology and Surgery 14(7), 1217–1225 (2019). https://doi.org/10.1007/s11548-019-01995-1
[4] Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. pp. 12046–12055. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPR.2019.01232
[5] Hara, K., Kataoka, H., Satoh, Y.: Learning spatio-temporal features with 3d residual networks for action recognition. In: 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22-29, 2017. pp. 3154–3160. IEEE Computer Society (2017). https://doi.org/10.1109/ICCVW.2017.373
[6] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 770–778. IEEE Computer Society (2016). https://doi.org/10.1109/CVPR.2016.90
[7] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
[8] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014. pp. 1725–1732. IEEE Computer Society (2014). https://doi.org/10.1109/CVPR.2014.223
[9] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 abs/1705.06950 (2017), http://arxiv.org/abs/1705.06950
[10] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1412.6980
[11] LeCun, Y., Bengio, Y.: Convolutional Networks for Images, Speech, and Time Series, p. 255–258. MIT Press, Cambridge, MA, USA (1998). https://doi.org/10.5555/303568.303704
[12] Leong, M., Prasad, D., Lee, Y.T., Lin, F.: Semi-cnn architecture for effective spatio-temporal learning in action recognition. Applied Sciences 10, 557 (01 2020). https://doi.org/10.3390/app10020557
[13] Parmar, P., Morris, B.T.: Measuring the quality of exercises. In: 2016 38th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). pp. 2241–2244 (2016). https://doi.org/10.1109/EMBC.2016.7591175
[14] Parmar, P., Morris, B.: Action quality assessment across multiple actions. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2019, Waikoloa Village, HI, USA, January 7-11, 2019. pp. 1468–1476. IEEE (2019). https://doi.org/10.1109/WACV.2019.00161
[15] Parmar, P., Morris, B.T.: Learning to score olympic events. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2017, Honolulu, HI, USA, July 21-26, 2017. pp. 76–84. IEEE Computer Society (2017). https://doi.org/10.1109/CVPRW.2017.16
[16] Parmar, P., Morris, B.T.: What and how well you performed? A multitask learning approach to action quality assessment. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. pp. 304–313. Computer Vision Foundation / IEEE (2019). https://doi.org/10.1109/CVPR.2019.00039
[17] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
[18] Pirsiavash, H., Vondrick, C., Torralba, A.: Assessing the quality of actions. In: Fleet, D.J., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI. Lecture Notes in Computer Science, vol. 8694, pp. 556–571. Springer (2014). https://doi.org/10.1007/978-3-319-10599-4_36
[19] Tang, Y., Ni, Z., Zhou, J., Zhang, D., Lu, J., Wu, Y., Zhou, J.: Uncertainty-aware score distribution learning for action quality assessment. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. pp. 9836–9845. IEEE (2020). https://doi.org/10.1109/CVPR42600.2020.00986
[20] Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015. pp. 4489–4497. IEEE Computer Society (2015). https://doi.org/10.1109/ICCV.2015.510
[21] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018. pp. 6450–6459. IEEE Computer Society (2018). https://doi.org/10.1109/CVPR.2018.00675
[22] Xiang, X., Tian, Y., Reiter, A., Hager, G.D., Tran, T.D.: S3D: stacking segmental P3D for action quality assessment. In: 2018 IEEE International Conference on Image Processing, ICIP 2018, Athens, Greece, October 7-10, 2018. pp. 928–932. IEEE (2018). https://doi.org/10.1109/ICIP.2018.8451364

Input Frames	ResNet-34				ResNet-50		ResNet-101
	Prediction				Prediction		Prediction	True
	(2+1)D			3D	(2+1)D	3D	3D	Score
	8	16	32	3D	(2+1)D	3D	3D
								25.64
	54.84	30.46	8.39	7.29	33.23	34.10	45.22
	38.76	18.11	16.41	22.93	38.29	29.93	52.21
								52.79
	66.94	59.69	47.92	67.92	43.57	58.30	122.20
	63.85	40.88	53.21	63.62	52.80	52.31	76.64
								69.59
	71.46	71.34	69.39	83.34	67.38	80.41	167.60
	69.85	64.90	70.40	81.31	67.53	74.25	132.50
								46.20
	67.13	46.16	27.73	34.25	44.06	46.61	54.28
	64.54	32.29	42.87	39.62	49.03	47.13	51.51