This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A New Perspective for Shuttlecock Hitting Event Detection

Yu-Hsi Chen Institute of Communications Engineering, National Tsing Hua University, Taiwan [email protected]
Abstract

This article introduces a novel approach to shuttlecock hitting event detection. Instead of depending on generic methods, we capture the hitting action of players by reasoning over a sequence of images. To learn the features of hitting events in a video clip, we specifically utilize a deep learning model known as SwingNet. This model is designed to capture the relevant characteristics and patterns associated with the act of hitting in badminton. By training SwingNet on the provided video clips, we aim to enable the model to accurately recognize and identify the instances of hitting events based on their distinctive features. Furthermore, we apply the specific video processing technique to extract the prior features from the video, which significantly reduces the learning difficulty for the model. The proposed method not only provides an intuitive and user-friendly approach but also presents a fresh perspective on the task of detecting badminton hitting events. The source code will be available at https://github.com/TW-yuhsi/A-New-Perspective-for-Shuttlecock-Hitting-Event-Detection.

1 Introduction

In this task, we need to develop computer vision technology capable of automatically extracting comprehensive technical data from the broadcast video of badminton matches. This data includes crucial information about each shot, such as the precise timing, spatial positioning, player postures, and skill levels exhibited as the shuttlecock is struck during the match. It encompasses various elements, such as the shot’s timing, ball’s location, player responsible for the hit, swing posture, standing positions of both players, type of ball used, and the eventual winner. These extensive datasets serve as valuable resources for analyzing the technical and strategic aspects of badminton matches. The scoring system is based on individual rallies, where each rally is evaluated separately. Participants can earn a maximum of 11 point per rally based on the accuracy of the rally’s shot count and the precision of the predicted attributes. The final rankings are determined by the total scores, with the highest score securing the top position and the lowest score occupying the last spot.

2 Related Work

For this badminton competitions, it is common practice to use CoachAI Hsu et al. (2019), which is an AI-based coaching assistant system aimed at providing professional guidance and personalized advice in the field of sports. The system utilizes advanced machine learning and deep learning techniques to extract crucial information from athletes’ data and videos, enabling comprehensive analysis and evaluation. Next, we will introduce the deep learning models and techniques utilized in CoachAI.

2.1 TrackNetV2

TrackNetV2 Sun et al. (2020) is an advanced computer vision model designed specifically for object tracking tasks. Building upon its predecessor, TrackNetV1, this updated version offers enhanced performance and improved accuracy in tracking objects across video sequences. It employs deep learning techniques, particularly convolutional neural networks (CNNs), to extract relevant features from consecutive frames of a video. These features are then used to predict the location and trajectory of a specified object in subsequent frames. By learning the spatiotemporal patterns and motion characteristics of the tracked object, TrackNetV2 can accurately follow its movements over time.

After obtaining the predicted badminton trajectories from TrackNetV2, further processing steps are performed, including removing extreme offset points, curve fitting, and interpolation. Once the data is prepared accordingly, roughly rally segmentation and event detection are applied to identify specific events within the processed videos.

2.2 Court Detection

Court detection in Coach AI plays a crucial role in enhancing the accuracy and effectiveness of the system, enabling it to provide valuable insights and actionable information based on the specific dynamics and requirements of the sports court.

2.3 MoveNet

MoveNet is a deep learning model designed for human pose estimation and tracking. It is specifically developed to accurately detect and track human body movements in real-time from video or camera input. This will be further used to analyze the BallType played by the players.

2.4 OpenPose

OpenPose is a computer vision library and framework for real-time multi-person keypoint detection and pose estimation. It enables the accurate estimation of human body poses, including the positions of body joints such as the head, shoulders, elbows, wrists, hips, knees, and ankles. Using OpenPose nodes in this competition allows for more precise identification of the location of the hitter and the defender.

3 Methodology

The methods we employ include video processing, SwingNet, ViT, YOLOv5, and TrackNetV2 deep learning models.

3.1 Video Processing

Video processing is a critical step to improve the score in the entire task. In this process, we first calculate the optical flow images using the method of Optical Flow Calculation embedded in Reynolds Transport Theorem111A Prior Feature Enhanced Layer for Small Objects Detection and Tracking Bruhn et al. (2005); Harouna and Mémin (2017) to capture the prior features of the video, then, we remove the background information to achieve a mechanism similar to attention as shown in Figure 1. And we will use the term ”optical flow video without background” to specifically describe this type of video. Subsequently, we input these processed video into SwingNet for shuttlecock hitting event detection.

Refer to caption
Figure 1: Comparison between the original and optical frames.

3.2 SwingNet

SwingNet McNally et al. (2019) plays a crucial role in the field of computer vision and sports analytics, specifically in the domain of golf. Its application enables more efficient and detailed analysis of golf swings, contributing to the overall development and understanding of the sport. In here, we utilize the SwingNet, a deep learning model combining with the MobileNetV2 and bidirectional LSTM structure, for shuttlecock hitting event detection. That is, utilizing SwingNet enables us to extract the ShotSeq and HitFrame features for the desired csv file. Figure 2 demonstrates the inferred event probabilities for a video clip.

Refer to caption
Figure 2: SwingNet performs event probability inference.

3.3 Vision Transformer

ViT, short for Vision Transformer Dosovitskiy et al. (2020), is a deep learning model that applies the transformer architecture to computer vision tasks. In contrast to the CNNs’ approach, ViT adopts a distinct strategy by leveraging the transformer architecture, initially designed for tasks in natural language processing. Moreover, the strength of ViT lies in its attention mechanism, as depicted in Figure 3, accompanied by the corresponding attention map illustrated in Figure 4. Here, we utilize ViT-B/16 to extract the information including Hitter, RoundHead, Backhand, BallHeight, BallType, and Winner from the videos.

Refer to caption
Figure 3: Attention map for BallType classification task.
Refer to caption
Figure 4: Visualization of attention maps for each layer.

3.4 YOLOv5

The YOLO-series detector is a renowned family of object detection models. Specifically, YOLOv5 Zhu et al. (2021) is a popular and highly efficient object detection model, which bilds upon the success of its predecessors, YOLOv1, YOLOv2, and YOLOv3, and introduces several improvements to achieve better performance in terms of accuracy and speed. We utilize YOLOv5m to extract the information for LandingX, LandingY, HitterLocationX, HitterLocationY, DefenderLocationX, and DefenderLocationY.

To begin with, we opted for YOLOv5m as our chosen detection model. The rationale behind not selecting newer models like YOLOv7 is based on my experience. While YOLOv7 may achieve higher scores on benchmark datasets, it lacks the desired level of robustness. Additionally, once we have obtained the detection results for both players and the ball, we integrate them with the Hitter prediction generated by ViT. We designate the output generated by ViT as the hitter, while the remaining player is assigned the role of the defender. As the landing result of the ball is the desired information, we utilize the y-value of the bottom-right corner of hitter’s bounding box to represent the ball’s landing y-value. The landing coordinate of the ball is indicated by the black cross symbol depicted in Figure 5. Next, we use the vertices of the detection boxes closest to the ball as the positions of the Hitter and the Defender for the AICUP competition, as illustrated by the light blue and orange rectangles in Figure 5.

Refer to caption
Figure 5: Detection result from YOLOv5m.

3.5 TrackNetV2

The intention behind incorporating TrackNetV2 in this context is to compensate for the shortcomings of YOLO in detecting badminton. Nevertheless, there may still arise scenarios where both YOLOv5m and TrackNetV2 fail to detect the shuttlecock simultaneously. In that case, I will assign the coordinates of LandingX and LandingY as (0,0)(0,0).

3.6 YOLOv8-pose

YOLOv8 Jocher et al. (2023) is the latest version of YOLO by Ultralytics. Being at the forefront of advanced technology, YOLOv8 represents a SOTA model that builds upon the achievements of its predecessors. It incorporates innovative features and enhancements to deliver superior performance, versatility, and efficiency. Furthermore, YOLOv8 provides comprehensive support for a wide array of vision AI tasks, encompassing detection, segmentation, pose estimation, tracking, and classification. In the CodaLab competition, we adopt a different approach for determining the positions of the players. Instead of relying on the vertices of the detection boxes, we utilize the pose estimated by YOLOv8x-pose-p6 model to acquire the players’ foot positions with enhanced accuracy. The image with the updated method is depicted in Figure 6.

Refer to caption
Figure 6: Detection results from YOLOv5m and YOLOv8-pose.

4 Results and Discussion

This section encompasses the Evaluation Criteria, Experimental Results, and Discussion.

4.1 Evaluation Criteria

The scoring system is based on each individual video clip, with a maximum score of 11 point per clip. The first step is to validate the number of shots within the rally video, with each row in the corresponding csv file representing one shot. If the predicted number of shots differs from the ground truth, the question will receive a score of 0 points. However, if the number of shots is accurately predicted, a score of 0.10.1 points will be awarded, and the evaluation will proceed to the next stage, which involves comparing the contents of the columns. In this stage, the maximum score for correctly matching column content is 0.90.9 points.

When the number of shots is accurate for the test video, each shot is evaluated individually, and its score is determined based on the cumulative score of its columns. The average score of all the columns is then used as the shot’s content score. The evaluation process begins by validating the HitFrame column. If the prediction error exceeds 22 frames compared to the true value, the shot is awarded 0 points in terms of content and the scoring proceeds to the next shot. However, if the prediction error is within or equal to 22 frames, a score of 0.10.1 points is given, and the remaining columns within the same shot are compared.

  • Hitter: 0.10.1 points if correct, 0 points otherwise.

  • BallHeight: 0.10.1 points if correct, 0 points otherwise.

  • Landing: If the prediction error is not greater than 66 pixels in the Euclidean distance, it is considered correct and is awarded 0.10.1 points; otherwise, 0 points.

  • HitterLocation: If the prediction error is not greater than 1010 pixels in the Euclidean distance, it is considered correct with 0.050.05 points, otherwise 0 points.

  • DefenderLocation: If the prediction error is not greater than 1010 pixels in the Euclidean distance, it is considered correct with 0.050.05 points, otherwise 0 points.

  • Backhand: 0.050.05 points if correct, 0 points otherwise.

  • RoundHead: 0.050.05 points if correct, 0 points otherwise.

  • BallType: 0.20.2 points if correct, 0 points otherwise.

  • Winner: 0.10.1 points if correct, 0 points otherwise. (Note: If it is not the last shot but is filled in, it will be considered incorrect.)

Assuming that there are R rally videos in the data set, and the ii-th video has SiS_{i} shots, the scoring formula is given below:

1Ri=1R1Si=Sipred(0.1+ASSi)\frac{1}{R}\sum\limits_{i=1}^{R}1_{S_{i}=S_{i}^{pred}}(0.1+ASS_{i}) (1)

in which ASSiASS_{i} (the Average Shot Score of content of the ii-th rally video) is given by:

ASSi=1Sij=1Si1|HitFramejHitFramejpred2SSjASS_{i}=\frac{1}{S_{i}}\sum\limits_{j=1}^{S_{i}}1_{|HitFrame_{j}-HitFrame_{j}^{pred}\leq 2}SS_{j} (2)

with SSjSS_{j} (the jj-th Shot Score) given by:

SSj\displaystyle SS_{j} =0.1+0.1×1Hitterj=Hitterjpred\displaystyle=0.1+0.1\times 1_{Hitter_{j}=Hitter_{j}^{pred}} (3)
+0.1×1BallHeightj=BallHeightjpred\displaystyle+0.1\times 1_{BallHeight_{j}=BallHeight_{j}^{pred}}
+0.1×1||LandingjLandingjpred<6\displaystyle+0.1\times 1_{||Landing_{j}-Landing_{j}^{pred}<6}
+0.05×1||HitterLocationjHitterLocationjpred<10\displaystyle+0.05\times 1_{||HitterLocation_{j}-HitterLocation_{j}^{pred}<10}
+0.05×1||DefenderLocationjDefenderLocationjpred<10\displaystyle+0.05\times 1_{||DefenderLocation_{j}-DefenderLocation_{j}^{pred}<10}
+0.05×1Backhandj=Backhandjpred\displaystyle+0.05\times 1_{Backhand_{j}=Backhand_{j}^{pred}}
+0.05×1RoundHeadj=RoundHeadjpred\displaystyle+0.05\times 1_{RoundHead_{j}=RoundHead_{j}^{pred}}
+0.2×1BallTypej=BallTypejpred\displaystyle+0.2\times 1_{BallType_{j}=BallType_{j}^{pred}}
+0.1×1Winnerj=Winnerjpred\displaystyle+0.1\times 1_{Winner_{j}=Winner_{j}^{pred}}

4.2 Experimental Results

To begin with, we will discussing the hyperparameters and performance of SwingNet, as this aspect determines the upper limit of the overall score. This is due to the fact that any errors in the detection of ShotSeq and HitFrame would render subsequent feature accuracy irrelevant, as they would not contribute to the score. Subsequently, we will delve into the hyperparameters and performance analysis of the ViT-B/16 and YOLOv5m models. Consequently, the contribution of each model to the score will be illustrated at the conclusion of this section. Note that since the official pretrained TrackNetV2 and YOLOv8x-pose-p6 models are employed directly for inference, no further details will be provided here.

4.2.1 Hyperparameters and Performance of SwingNet

To enhance the performance of SwingNet in capturing key frames more effectively, we conducted numerous experiments focused on optimizing the hyperparameters. During the training phase, various parameters could be fine-tuned, including image size, sequence length, event ratio, the number of frozen layers, batch size, and iteration number. Additionally, during the inference phase, we can adjust parameters like image size, sequence length, probabilities for the quantiles, and the filter threshold for successive predicted events to further refine the results. After lots of trials, we have determined the hyperparameters that lead to improved predictions for the model. These optimized hyperparameters are summarized in Table 1.

Hyperparameters Train Inference
image size 180×\times180 180×\times180
sequence length 64 64
number of frozen layers 10 10
batch size 8 8
event ratio 0.0305
iteration number 10000
probabilities for the quantiles 0.8
filter threshold 3
Table 1: Optimized hyperparameters for SwingNet.

Once the hyperparameters of SwingNet have been determined, it becomes essential to examine how various video representations influence the detection results. We utilized three distinct types of input videos, including original video, optical flow video (Opt video), and optical flow video without background (Opt video w/o BG). Based on the information presented in Table 2, it can be observed that utilizing the optical flow video without background yields the highest score compared to other types of videos.

Video type Iter. \\backslash Loss Score
Original video 2000 \\backslash 0.5821 0.0234
4000 \\backslash 0.5074 0.0275
6000 \\backslash 0.4622 0.0184
8000 \\backslash 0.4269 0.0276
10000 \\backslash 0.3979 0.0239
Opt video 2000 \\backslash 0.3799 0.0229
4000 \\backslash 0.3325 0.0325
6000 \\backslash 0.3037 0.0333
8000 \\backslash 0.2839 0.0249
10000 \\backslash 0.2664 0.0334
Opt video w/o BG 2000 \\backslash 0.3808 0.0274
4000 \\backslash 0.3294 0.0331
6000 \\backslash 0.2982 0.0349
8000 \\backslash 0.2760 0.0225
10000 \\backslash 0.2582 0.0263
Table 2: Performance of SwingNet on different types of videos.

Next, we made some architectural adjustments in an attempt to achieve higher scores. These adjustments included MobileNetV3 + bidirectional LSTM and MobileNetV2 + TCN. Before conducting the experiments, it was anticipated that both of these architectures would outperform the original one, as MobileNetV3 Howard et al. (2019) and Temporal Convolution Network (TCN) Lea et al. (2017) have been proven to be effective in other literature. However, compare to Table 2, it can be clearly seen that the performance of the original architecture still remained the best. The performance of tuned architectures are shown in Table 3.

Architecture Iter. \\backslash Loss Score
MobileNetV3 2000 \\backslash 0.4364 0.0136
4000 \\backslash 0.3893 0.0293
+ 6000 \\backslash 0.3628 0.0162
bidirectional LSTM 8000 \\backslash 0.3437 0.0253
10000 \\backslash 0.3288 0.0115
MobileNetV2 2000 \\backslash 0.5417 0.0252
4000 \\backslash 0.5065 0.0313
+ 6000 \\backslash 0.4816 0.0289
TCN 8000 \\backslash 0.4613 0.0207
10000 \\backslash 0.4440 0.0234
Table 3: Performance of tuned architectures.

Therefore, in this study, we primarily used optical flow without background as the main type of video for detection, and performed 5-fold inference. All the scores for the 5-fold inference are presented in Table 4.

Iter.\\backslashfold 1 2 3 4 5
1000 0.0259 0.0262 0.0168 0.027 0.0263
2000 0.0241 0.0261 0.0354 0.034 0.0339
3000 0.0212 0.0352 0.037 0.0318 0.0426
4000 0.0275 0.0266 0.0389 0.0372 0.0399
5000 0.0288 0.027 0.0289 0.0366 0.0351
6000 0.0312 0.0167 0.0362 0.0269 0.0283
7000 0.0233 0.0372 0.0144 0.0326 0.0319
8000 0.0285 0.0347 0.0375 0.0396 0.0213
9000 0.0236 0.0237 0.0387 0.0279 0.0226
10000 0.0324 0.0182 0.0348 0.0262 0.0251
Table 4: Scores obtained from 5-fold inference.

After multiple trials, the highest score obtained in the first two columns, as shown in Table 4, is 0.04260.0426. In addition, we attempted to ensemble the detection results of SwingNet, but it did not effectively improve the score. Furthermore, it is notable that the maximum score for the first two columns can reach 0.20.2 according to the evaluation criteria. This observation further emphasizes that SwingNet’s predictions for the first two columns still have great potential for improvement.

Upon completing the first two columns, we will exhibit the hyperparameters employed for classification and detection tasks using ViT-B/16 and YOLOv5m, respectively.

4.2.2 Hyperparameters of ViT-B/16 and YOLOv5m

Based on numerous experiments conducted, we have determined the optimal hyperparameters that improve the performance of ViT-B/16 and YOLOv5m, as presented in Table 5. To enhance the inference performance of the ViT-B/16 and YOLOv5m models, the training strategy involves maximizing the image size. The main reason behind this setting is that larger image sizes have a tendency to uncover finer details, thereby significantly enhancing the value of both training and inference processes. In addition, I divided the training data into 5-fold for the classification tasks.

Hyperparameters ViT-B/16 YOLOv5m
image size 480 2880
optimizer SGD SGD
learning rate 3E-2 1E-2
loss function CE BCE
batch size 4 1
iteration number 10000
epochs 100
Table 5: Hyperparameters of ViT-B/16 and YOLOv5m.

Once the hyperparameters of ViT-B/16 and YOLOv5m have been determined, it is important to delve into the stage of model inference. After completing the intact training phase, I will have a total of 3030 ViT-B/16 models for these classification tasks, specifically 55 models for each. In the inference process, I have employed ensemble techniques, namely vote ensemble and mean ensemble. These approaches involve merging the predictions of multiple models by either averaging the probabilities or selecting the choice that receives the most votes, which aims to enhance the overall performance of the prediction. Table 6 illustrates the comparison of scores achieved with and without the application of these ensemble techniques, using the previous predictions made by SwingNet as a baseline with a score of 0.04260.0426.

Features\\backslashEnsemble fold1 Vote Mean
Hitter +0.0068 +0.0068 +0.0068
RoundHead +0.0033 +0.0033 +0.0033
Backhand +0.0031 +0.0031 +0.0031
BallHeight +0.0066 +0.0066 +0.0067
BallType +0.0089 +0.0091 +0.0089
Winner +0.0005 +0.0005 +0.0006
Table 6: Ablation study of ensemble techniques.

It can be observed that there is only a slight difference between using and not using ensemble techniques. Furthermore, the variations among different ensemble techniques are also small from Table 6. The main factor contributing to this observation is the notable disparity in probabilities predicted by ViT-B/16. As illustrated in Figure 3, the model assigns a remarkably high probability of 0.97450.9745 to the second category, while the probability of the ball belonging to the next highest category only 0.02550.0255, the disparity between these probabilities is substantial. Consequently, the utilization of ensemble techniques does not significantly impact the classification results, as the dominant probability already indicates a clear classification preference. Therefore, we can train only one ViT-B/16 model for each feature during classification tasks, rather than five models. This approach greatly reduces the time and resources required for training and inference.

Now, let’s analyze the individual contributions of different models to their respective features. Firstly, we will use SwingNet to obtain the predicted answers for ShotSeq and HitFrame. Subsequently, we utilize ViT-B/16 for predicting the features related to the classification task. Furthermore, we employ the YOLOv5m model with TrackNetV2 to detect the position of the players and the trajectory of the badminton. Moreover, we utilized the YOLOv8x-pose-p6 model to estimate the poses of players, resulting in a more accurate detection of their positions as shown in Figure 6. These steps allow us to generate the complete submission file. The corresponding contribution table can be found in Table 7.

Features Models Score
ShotSeq SwingNet 0.0426
&HitFrame
Hitter ViT 0.0494 (+0.0068)
RoundHead ViT 0.0527 (+0.0033)
Backhand ViT 0.0558 (+0.0031)
BallHeight ViT 0.0625 (+0.0067)
LandingX YOLOv5m 0.0625 (+0.0000)
&LandingY +TrackNetV2
HitterLocationX YOLOv5m +YOLOv8x  -pose-p6 0.0630 (+0.0005)
&HitterLocationY
&DefenderLocationY
&DefenderLocationY
BallType ViT 0.0721 (+0.0091)
Winner ViT 0.0727 (+0.0006)
Table 7: The distinct contributions of various models to the features.

4.3 Discussion

In this section, we will primarily address four key points of discussion. The first one is the extensive GPU memory resources required when using the method proposed in this article for shuttlecock hitting event detection with SwingNet. The main reason is that the model reads in 6464 frames of video at once for evaluation. For example, if we input 6464 full-color images with resolution of 720p720p, the model needs to accommodate a size of 64×24×1280×720/8/100016964\times 24\times 1280\times 720/8/1000\approx 169GB. Due to the limited 66GB memory of my GPU, the maximum image resolution we can use for training and testing is only 180×180180\times 180. I believe that if there is sufficient GPU memory available, utilizing the method proposed in this article and increasing the image size could be a promising approach. Compared to analyzing the trajectory of the shuttlecock, this method is more intuitive and can overcome situations where the shuttlecock is obstructed by the player.

The second point is that, according to the rules of badminton, we can infer that the information in the Hitter column will alternate between AA and BB (ABAB)(A\rightarrow B\rightarrow A\rightarrow B\cdots). This implies that there will never be more than two consecutive instances of either AA or BB. Consequently, by predicting the first hitter, we can determine the order of each hitter in a video. It is crucial to emphasize that when employing the alternative showup method, the accuracy of the predicted first hitter becomes paramount. Any inaccuracies in this prediction may result in a complete reversal of the order for every hitter in the entire video.

The third point to consider is that by exclusively focusing on the hitter during the classification process, it is possible to enhance the performance of the classification problem. That is, instead of utilizing the entire image for classification as depicted in Figure 3, one can solely utilize the area occupied by the hitter. Implementing this method has the potential to decrease the defender’s propensity to confuse hitter’s information, such as RoundHead, Backhand, and BallType.

Lastly, it is worth noting that in this competition, several features were not utilized while training, such as LandingX, LandingY, HitterLocationX, HitterLocationY, DefenderLocationX, DefenderLocationY. Without fully utilizing the available information, I believe there will be a certain degree of misalignment. Henceforth, we will analyze the unused information with the aim of leveraging it to aid in both the training and inference stages.

5 Summary

This article proposes a new perspective for the shuttlecock hitting event detection task, which involves using a deep learning model to learn the meaning conveyed by the image sequence, and thus capturing the desired information from the video. Furthermore, the video processing method presented in this article can effectively reduce the difficulty of model learning, resulting in better performance in keyframe detection task. In the future, we will continue to explore adjustments to the model architecture in order to construct a more precise method for obtaining keyframes. This will be combined with suitable image processing techniques to further enhance the overall framework.

Acknowledgments

I am delighted to have the opportunity to integrate the models used in this competition and engage in deeper discussions about their structures and mechanisms. I am also grateful to my family and friends for their unwavering support and encouragement throughout this period, which enabled me to fully immerse myself in the contests and generate innovative ideas. I truly appreciate it.

References

  • Bruhn et al. [2005] Andrés Bruhn, Joachim Weickert, and Christoph Schnörr. Lucas/kanade meets horn/schunck: Combining local and global optic flow methods. International journal of computer vision, 61:211–231, 2005.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Harouna and Mémin [2017] S Kadri Harouna and Etienne Mémin. Stochastic representation of the reynolds transport theorem: revisiting large-scale modeling. Computers & Fluids, 156:456–469, 2017.
  • Howard et al. [2019] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
  • Hsu et al. [2019] Tzu-Han Hsu, Ching-Hsuan Chen, Nyan Ping Jut, Tsi-Ui Ik, Wen-Chih Peng, Yu-Shuen Wang, Yu-Chee Tseng, Jiun-Long Huang, Yu-Tai Ching, Chih-Chuan Wang, et al. Coachai: A project for microscopic badminton match data collection and tactical analysis. In 2019 20th Asia-Pacific Network Operations and Management Symposium (APNOMS), pages 1–4. IEEE, 2019.
  • Jocher et al. [2023] Glenn Jocher, Ayush Chaurasia, and Jing Qiu. YOLO by Ultralytics, January 2023.
  • Lea et al. [2017] Colin Lea, Michael D Flynn, Rene Vidal, Austin Reiter, and Gregory D Hager. Temporal convolutional networks for action segmentation and detection. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 156–165, 2017.
  • McNally et al. [2019] William McNally, Kanav Vats, Tyler Pinto, Chris Dulhanty, John McPhee, and Alexander Wong. Golfdb: A video database for golf swing sequencing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019.
  • Sun et al. [2020] Nien-En Sun, Yu-Ching Lin, Shao-Ping Chuang, Tzu-Han Hsu, Dung-Ru Yu, Ho-Yi Chung, and Tsì-Uí İk. Tracknetv2: Efficient shuttlecock tracking network. In 2020 International Conference on Pervasive Artificial Intelligence (ICPAI), pages 86–91. IEEE, 2020.
  • Zhu et al. [2021] Xingkui Zhu, Shuchang Lyu, Xu Wang, and Qi Zhao. Tph-yolov5: Improved yolov5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2778–2788, 2021.