Friends Across Time: Multi-Scale Action Segmentation Transformer for Surgical Phase Recognition

1^st Bokai Zhang Blue River Technology
California, United States of America
[email protected] 2^nd Jiayuan Meng Blue River Technology
California, United States of America
[email protected] 3^rd Bin Cheng Blue River Technology
California, United States of America
[email protected] 4^th Dean Biskup Blue River Technology
California, United States of America
[email protected] 5^th Svetlana Petculescu Independent Consultant
Washington, United States of America
[email protected] 6^th Angela Chapman Blue Grotto Labs, LLC
Massachusetts, United States of America
[email protected]

Abstract

Automatic surgical phase recognition is a core technology for modern operating rooms and online surgical video assessment platforms. Current state-of-the-art methods use both spatial and temporal information to tackle the surgical phase recognition task. Building on this idea, we propose the Multi-Scale Action Segmentation Transformer (MS-AST) for offline surgical phase recognition and the Multi-Scale Action Segmentation Causal Transformer (MS-ASCT) for online surgical phase recognition. We use ResNet50 or EfficientNetV2-M for spatial feature extraction. Our MS-AST and MS-ASCT can model temporal information at different scales with multi-scale temporal self-attention and multi-scale temporal cross-attention, which enhances the capture of temporal relationships between frames and segments. We demonstrate that our method can achieve 95.26% and 96.15% accuracy on the Cholec80 dataset for online and offline surgical phase recognition, respectively, which achieves new state-of-the-art results. Our method can also achieve state-of-the-art results on non-medical datasets in the video action segmentation domain.

Index Terms:

surgical phase recognition, multi-scale, action segmentation, transformer

I Introduction

Over the last few years, Video-Based Analysis (VBA) has been increasingly used for surgical video analysis [1]. One of its key technologies, surgical phase recognition, is instrumental in facilitating efficient skill assessment training. Surgical phase recognition is the automatic detection of start and stop times for different steps or phases of a surgery. While online surgical phase recognition algorithms are mainly designed to support surgeons in the operating room (OR), offline surgical phase recognition algorithms play a crucial role in efficiently categorizing large collections of surgical videos. This efficient categorization streamlines training by allowing for the comparison of surgical phase timings between different surgeons, aiding in skill assessment. Moreover, it supports the analysis of surgical phase sequences, contributing to the standardization of surgical techniques . It also helps in quickly identifying key moments in surgery videos for expert analysis, significantly reducing time in both training and evaluation processes.

To capture spatial information from video frames, early research used image classification networks [2]. However, these methods did not capture temporal information between video frames. Researchers then combined image classification networks with Long Short-Term Memory (LSTM) networks for both spatial and temporal modeling [3, 4]. Recent research [5, 6] utilized Multi-Stage Temporal Convolutional Networks (MS-TCN) for full video temporal modeling and has demonstrated superior performance compared to LSTM [5, 7]. As vision transformers increasingly dominate many computer vision research areas [8], researchers now utilize transformers for temporal modeling [9, 10, 11, 12, 13, 14, 15]. Researchers believe that the attention layers design in transformers can force the model to learn and focus on informative timestamps in surgical videos instead of timestamps like motion blur, and idle action which are not informative for surgical phase recognition.

In this paper, we expand upon transformer-based techniques and propose the Multi-Scale Action Segmentation Transformer (MS-AST) for surgical phase recognition. With different temporal modeling scales, our method can achieve frame-level modeling and segment-level modeling at the same time. Our model is capable of capturing both fast and slow actions, such as short and long surgical phases, respectively, across a range of small and large temporal scales. Our contributions in this paper are listed as follows: (1) Building upon Transformer for Action Segmentation (ASFormer) [9], we design multi-scale temporal self-attention and multi-scale temporal cross-attention to capture frame and segment relations at different temporal scales in our proposed MS-AST for surgical phase recognition. (2) We modify MS-AST to Multi-Scale Action Segmentation Causal Transformer (MS-ASCT), a causal design for online surgical phase recognition. (3) Our proposed MS-AST and MS-ASCT achieve new state-of-the-art results in offline and online surgical phase recognition, respectively. (4) To demonstrate the robustness and wide applicability of our method, we evaluate our MS-AST on the 50Salads [16] and GTEA [17] datasets, which are non-medical datasets that are widely used in the video action segmentation domain. Our MS-AST also achieves state-of-the-art results on these datasets.

II Method

An overview of our method is depicted in Figure 1. First, we train an image classification network with training data from the target dataset for feature extraction. Second, we use the image classification network to extract frame-level spatial information for each second of the video frame. We concatenate these frame-level features to get full video features. Finally, we train our action segmentation network with full video features for temporal modeling and surgical phase recognition. These steps are described in more detail below.

Refer to caption — Figure 1: The overview of our method

II-A Feature Extraction Network

Full video modeling with all video frames from surgical videos requires significant GPU resources. Feature extraction networks are used to tackle this problem. Instead of learning from raw video frames, with feature extraction networks, video frames can be summarized into feature vectors which are more memory efficient during training.

We utilize ResNet50 [18], which is widely used in previous studies [5, 12]. We load up the ImageNet pre-trained weights and conduct transfer learning with the surgical phase dataset. Our Multi-Scale Action Segmentation Transformer can also work with features generated by different feature extraction networks. We utilize EfficientNetV2-M (EffNetV2) [19] as another feature extraction backbone to validate our design.

II-B Action Segmentation Network

II-B1 Transformer for Action Segmentation

ASFormer [9] is an encoder-decoder structured transformer proposed to tackle action segmentation tasks. With extracted frame-wise feature sequences, the encoder will first predict the initial action probability for each frame. Then the initial predictions will be passed to multiple decoders for incremental refinement. Each encoder comprises encoder blocks that have a feed-forward and a single-head self-attention layer. Each decoder comprises decoder blocks that have a feed-forward and a single-head cross-attention layer. To model local features first and then gradually capture the global information with enlarged receptive fields, dilated convolutions were used in the feed-forward layers with a gradually increasing dilation rate, and sliding window attention was used with increasing window size in the attention layers.

II-B2 Multi-Scale Action Segmentation Transformer

To capture segment-level temporal information and model fast action and slow action with different temporal scales, we modify the self-attention layer in each encoder block and the cross-attention layer in each decoder block in ASFormer to multi-scale temporal self-attention and multi-scale temporal cross-attention.

As shown in Figure 2, we use three different temporal scales in our multi-scale temporal self-attention. The dilated convolution kernel sizes are 3, 5, and 17 in the first layers for each encoder and decoder, corresponding to 3 different temporal scales. The window sizes for the sliding window attention also change depending on the kernel sizes used in the dilated convolution. Following the design of the ASFormer, when the kernel size is 3, the window size doubles in each layer, starting from 1 in the first layer and ending at 512 in the 10th layer. In our design, when the kernel size is 5 and 17, the window size at the first layer is 1. For a kernel size of 5, the window size then doubles in each layer, starting from 4 in the second layer and ending at 1024 in the 10th layer. For a kernel size of 17, the window size doubles starting from 16 in the second layer and ending at 4096 in the 10th layer. The output features with attention in different scales can be calculated by

$\displaystyle out_{i}$	$\displaystyle=\alpha\times w_{1,i}\times Attention_{1,i}(out_{i})$	(1)
	$\displaystyle+\alpha\times w_{2,i}\times Attention_{2,i}(out_{i})$
	$\displaystyle+\alpha\times w_{3,i}\times Attention_{3,i}(out_{i})+out_{i}$

Where $out_{i}$ represents the output features generated in the $i$ th encoder or decoder block. $w_{1,i}$ , $w_{2,i}$ , $w_{3,i}$ are weighted parameters that are learned during the training. $Attention_{1,i}$ , $Attention_{2,i}$ , and $Attention_{3,i}$ represent attention results at 3 different scales. $\alpha$ is equal to 1 in the first decoder and then is exponentially decreased for the remaining decoders.

In contrast to the multi-scale self-attention in the encoder block, in the decoder block’s multi-scale cross-attention layer, the query $Q$ and key $K$ are obtained from the output of the encoder and the output of the previous layer, while the value $V$ is obtained from the output of the previous layer, as shown in Figure 2.

II-B3 Multi-Scale Action Segmentation Causal Transformer

In order to achieve online surgical phase recognition in the OR, we modify our MS-AST to incorporate causality, creating the Multi-Scale Action Segmentation Causal Transformer (MS-ASCT). First, we modify the dilated convolution to causal dilated convolution following previous TeCNO research [5]. Second, we remove layer normalization from our network to avoid future information leaks. Finally, we use the causal design of sliding window attention shown in Figure 3, so that the attention operation only uses past information. We use window size equals 5 in Figure 3 as an example.

III Dataset

The Cholec80 dataset [2] consists of 80 videos of cholecystectomy surgeries performed by 13 surgeons. The dataset includes annotations for both surgical phase and tool presence. The 7 surgical phases, labeled as P1 to P7, are “Preparation”, “Calot triangle dissection”, “Clipping and cutting”, “Gallbladder dissection”, “Gallbladder packaging”, “Cleaning and coagulation”, and “Gallbladder retraction”. Following previous research [3, 4, 20, 12, 21, 22], we use the first 40 videos for training and reserve the remaining 40 for testing. We conduct both online surgical phase recognition and offline surgical phase recognition experiments with the Cholec80 dataset.

To evaluate our method in the action segmentation domain, the 50Salads dataset [16] and the GTEA dataset [17] are used following the MS-TCN [6] and ASFormer [9] studies. The 50Salads dataset contains 50 videos of salad preparation steps, with 17 action classes and another two classes for ”action start” and ”action end”. The GTEA dataset contains 28 videos of 11 classes of daily activities in a kitchen. For fair comparisons, we utilized features extracted in [6]. These features are extracted with Inflated 3D ConvNet (I3D) [23]. Following previous research [6, 9], five-fold cross-validations are performed on the 50Salads dataset, and four-fold cross-validations are performed on the GTEA dataset with provided data splits.

IV Experiments

We utilized PyTorch to implement all experiments. We performed all training and testing on a single desktop with one Intel Core i9-13900K CPU and one NVIDIA GeForce RTX 4090 GPU.

IV-A Evaluation metrics

We use frame-level metrics and segmental metrics [24, 25] to evaluate model performance. Following previous research [3, 4, 12, 21], we compute precision, recall, and Jaccard scores for each surgical phase, then average over all surgical phases. We use segmental edit distance score and segmental F1-score as segmental metrics to evaluate over-segmentation error. Absolute improvements are reported in the following sections. The overlap can be measured with Intersection over Union (IOU). For a simpler comparison, we also calculate $\text{F1}_{\mathit{AVG}}$ which is the average of the segmental F1-score at $10\%$ , $25\%$ , and $50\%$ overlap thresholds, as follows:

\text{F1}_{\mathit{AVG}}=\dfrac{1}{3}\times(\text{F1@10 + F1@25 + F1@50})

(2)

IV-B Implementation details

We train ResNet50 and EffNetV2, two feature extraction networks, using cross-entropy loss with a learning rate of 1e-4 and weight decay of 1e-5. We set the batch size to 16 and the training epochs to 50. To augment the data, we resized the smaller side of the frames to 256 pixels and randomly cropped 224 by 224 patches as training samples for ResNet50. Similarly, we resized the smaller side of the frames to 400 pixels and randomly cropped 384 by 384 patches as training samples for EffNetV2. We also selected 15% of the training samples randomly and rotated them within 10 degrees to simulate camera rotation.

We trained our MS-AST and MS-ASCT with cross-entropy loss and smooth loss [6]. We utilized the Adam optimizer with a learning rate of 1e-4, batch size of 1, dropout rate of 0.5, and 200 training epochs. We used one encoder and three decoders in MS-AST for offline surgical phase recognition, and we used one encoder and one decoder in MS-ASCT for online surgical phase recognition. We set the total number of the dilated convolution layers in each encoder and each decoder to 10, and set the number of feature maps to 64.

IV-C Results

IV-C1 Online surgical phase recognition

To compare model performance that used different temporal scales, we compare EffNetV2 Causal ASFormer [22], which utilizes one temporal scale with a kernel size of 3, and EffNetV2 MS-ASCT, which uses multiple temporal scales with different kernel sizes, in Table I. Our kernel size is selected following $1+2^{n}$ , where $n$ is a positive integer. As shown in Table I, MS-ASCT with 3 temporal scales outperforms Causal ASFormer [22] with 1 temporal scale and MS-ASCT with 2 temporal scales. MS-ASCT with kernel sizes of 3, 5, and 17 slightly outperforms MS-ASCT with kernel sizes of 3, 5, and 9.

TABLE I: Overall accuracy (%), segmental edit distance score, segmental F1-score at different thresholds, and the average of segmental F1-score at different thresholds

Method name	Kernel	ACC	Edit	F1@10	F1@25	F1@50	F1@AVG
EffNetV2 Causal ASFormer [22]	3	93.44	46.20	55.43	54.60	50.47	53.50
EffNetV2 MS-ASCT	3,5	93.94	50.85	58.26	57.83	53.48	56.52
EffNetV2 MS-ASCT	3,5,9	94.28	58.25	66.18	65.20	60.78	64.05
EffNetV2 MS-ASCT	3,5,17	94.64	61.82	68.02	68.02	62.45	66.16

To demonstrate that our methods outperform other state-of-the-art approaches, we compare our ResNet MS-ASCT and EffNetV2 MS-ASCT with PhaseLSTM [2], EndoLSTM [2], MTRCNet [4], SV-RCNet [3], TeCNO [5], SlowFast UniGRU, TimeSformer UniGRU, Swin UniGRU [21], Trans-SVNet [12], PATG [26], and Spatio-Temporal Causal Transformer(denoted as ST-CT) [13] in Table II. The mean and the standard deviation in some studies in Table II refer to results over repeated experimental runs. Our ResNet MS-ASCT outperforms all other methods from previous studies. Our ResNet MS-ASCT outperforms previous state-of-the-art ST-CT [13] by approximately 2% in accuracy and recall, as well as approximately 3% in precision. Our EffNetV2 MS-ASCT also outperforms ResNet MS-ASCT, which shows that our designed method can work with features extracted by different feature extraction networks. Our EffNetV2 MS-ASCT outperforms previous state-of-the-art ST-CT [13] by approximately 3% in accuracy, approximately 5% in precision, and approximately 4% in recall.

TABLE II: Overall accuracy, precision, and recall for different methods on Cholec80 dataset for online surgical phase recognition (mean

\pm

std. %)

Method Name	Accuracy	Precision	Recall
PhaseLSTM [2, 21]	79.68 $\pm$ 0.07	72.85 $\pm$ 0.10	73.45 $\pm$ 0.12
EndoLSTM [2, 21]	80.85 $\pm$ 0.17	76.81 $\pm$ 2.62	72.07 $\pm$ 0.64
MTRCNet [4, 21]	82.76 $\pm$ 0.01	76.08 $\pm$ 0.01	78.02 $\pm$ 0.13
SV-RCNet [3, 21]	86.58 $\pm$ 1.01	80.53 $\pm$ 1.59	79.94 $\pm$ 1.79
TeCNO [5, 21]	88.56 $\pm$ 0.27	81.64 $\pm$ 0.41	85.24 $\pm$ 1.06
SlowFast UniGRU [21]	90.47 $\pm$ 0.46	83.12 $\pm$ 2.09	82.33 $\pm$ 1.22
TimeSformer UniGRU [21]	90.42 $\pm$ 0.47	86.05 $\pm$ 1.13	83.20 $\pm$ 1.80
Swin UniGRU [21]	90.88 $\pm$ 0.01	85.07 $\pm$ 1.74	85.59 $\pm$ 0.53
Trans-SVNet [12, 13]	89.6	81.7	87.5
PATG [26]	91.36	86.88	84.00
ST-CT [13]	91.4	85.4	86.3
ResNet MS-ASCT	93.58 $\pm$ 0.13	88.90 $\pm$ 0.41	88.20 $\pm$ 0.34
EffNetV2 MS-ASCT	94.59 $\pm$ 0.04	90.41 $\pm$ 0.22	90.07 $\pm$ 0.46

We also calculated video-by-video accuracy, precision, recall, and Jaccard score for our EffNetV2 MS-ASCT and compared them with other state-of-the-art approaches as shown in Table III. The mean and the standard deviation in Table III refer to results over different test videos. Our EffNetV2 MS-ASCT outperforms all other methods in all considered metrics.

TABLE III: Video-by-video accuracy, precision, recall, and jaccard score for different methods (mean

\pm

std. %)

Method name	Accuracy	Precision	Recall	Jaccard
EndoNet[2]	81.7 $\pm$ 4.2	73.7 $\pm$ 16.1	79.6 $\pm$ 7.9	$-$
EndoNet+LSTM[27]	88.6 $\pm$ 9.6	84.4 $\pm$ 7.9	84.7 $\pm$ 7.9	$-$
MTRCNet-CL[4]	89.2 $\pm$ 7.6	86.9 $\pm$ 4.3	88.0 $\pm$ 6.9	$-$
SV-RCNet[3]	85.3 $\pm$ 7.3	80.7 $\pm$ 7.0	83.5 $\pm$ 7.5	$-$
OHFM[28]	87.3 $\pm$ 5.7	$-$	$-$	67.0 $\pm$ 13.3
TeCNO[5]	88.6 $\pm$ 7.8	86.5 $\pm$ 7.0	87.6 $\pm$ 6.7	75.1 $\pm$ 6.9
TMRNet(ResNeSt)[20]	90.1 $\pm$ 7.6	90.3 $\pm$ 3.3	89.5 $\pm$ 5.0	79.1 $\pm$ 5.7
Trans-SVNet[29]	90.3 $\pm$ 7.1	90.7 $\pm$ 5.0	88.8 $\pm$ 7.4	79.3 $\pm$ 6.6
SAHC[30]	91.8 $\pm$ 8.1	90.3 $\pm$ 6.4	90.0 $\pm$ 6.4	81.2 $\pm$ 5.5
EffNetV2 MS-TCN(Causal)[22]	93.69 $\pm$ 5.30	90.69 $\pm$ 6.44	91.88 $\pm$ 5.51	83.06 $\pm$ 9.51
EffNetV2 Causal ASFormer[22]	94.48 $\pm$ 4.22	92.15 $\pm$ 5.25	91.50 $\pm$ 9.45	84.20 $\pm$ 10.42
EffNetV2 C-ECT[22]	94.67 $\pm$ 4.32	92.66 $\pm$ 5.38	91.11 $\pm$ 7.28	84.02 $\pm$ 8.86
ResNet MS-ASCT	94.71 $\pm$ 4.09	92.71 $\pm$ 5.31	91.49 $\pm$ 6.16	84.35 $\pm$ 8.69
EffNetV2 MS-ASCT	95.26 $\pm$ 3.49	92.76 $\pm$ 5.69	92.16 $\pm$ 9.01	85.36 $\pm$ 10.98

To further compare our methods with previous studies, we calculate the overall accuracy and segmental metrics including the segmental edit distance score, the segmental F1 score at overlapping thresholds of 10%, 25%, and 50%, and their averages, as shown in Table IV. Our EffNetV2 MS-ASCT outperforms previous state-of-the-art EffNetV2 C-ECT [22] by approximately 1% in accuracy, approximately 9% in segmental edit distance score, and approximately 8% in the average of the segmental F1 score at different overlapping thresholds.

TABLE IV: Overall accuracy (%), segmental edit distance score, segmental F1-score at different thresholds, and the average of segmental F1-score at different thresholds

Method name	ACC	Edit	F1@10	F1@25	F1@50	F1@AVG
EffNetV2 MS-TCN(Causal) [22]	92.41	31.86	35.63	34.97	31.66	34.08
EffNetV2 Causal ASFormer [22]	93.44	46.20	55.43	54.60	50.47	53.50
EffNetV2 C-ECT [22]	93.53	53.00	60.61	59.71	54.10	58.14
EffNetV2 MS-ASCT	94.64	61.82	68.02	68.02	62.45	66.16

We visualize the prediction results of 4 test videos for EffNetV2 Causal ASFormer [22] and EffNetV2 MS-ASCT as shown in Figure 4. These visualizations demonstrate that our EffNetV2 MS-ASCT produces fewer over-segmentation errors and out-of-order predictions.

We also plot the normalized confusion matrix from one of our experiments with EffNetV2 MS-ASCT on the Cholec80 dataset for online surgical phase recognition in Figure 5. Some of the prediction errors are due to our model predicting P3 as P4. Our model is also sometimes confused between P6 and P7. Both P3 and P7 are short surgical phases. Insufficient training data may be affecting the performance of these surgical phases.

IV-C2 Offline surgical phase recognition

To demonstrate that our ResNet MS-AST outperforms other state-of-the-art methods, we repeat our experiments multiple times and compare our ResNet MS-AST with ResNet ASFormer [31], SlowFast Transformer, SlowFast BiGRU, TimeSformer Transformer, TimeSformer BiGRU, Swin Transformer, and Swin BiGRU [21] in Table V. The mean and the standard deviation refer to results over repeated experimental runs in Table V. Our ResNet MS-AST outperforms ResNet ASFormer[31] and other methods on all considered metrics. Our EffNetV2 MS-AST achieves comparable performance with ResNet MS-AST.

TABLE V: Overall accuracy, precision, and recall for different methods on Cholec80 dataset for offline surgical phase recognition (mean

\pm

std. %)

Method Name	Accuracy	Precision	Recall
SlowFast Transformer [21]	74.12 $\pm$ 0.03	69.73 $\pm$ 0.32	65.99 $\pm$ 1.15
SlowFast BiGRU [21]	92.74 $\pm$ 0.23	87.71 $\pm$ 1.27	84.69 $\pm$ 1.00
TimeSformer Transformer [21]	73.46 $\pm$ 0.99	73.93 $\pm$ 0.32	68.27 $\pm$ 2.28
TimeSformer BiGRU [21]	92.82 $\pm$ 1.91	89.70 $\pm$ 1.34	86.18 $\pm$ 2.67
Swin Transformer [21]	80.10 $\pm$ 0.72	74.35 $\pm$ 0.98	74.37 $\pm$ 0.69
Swin BiGRU [21]	93.87 $\pm$ 0.04	89.96 $\pm$ 0.79	89.65 $\pm$ 0.58
ResNet ASFormer [31]	93.39 $\pm$ 0.23	89.85 $\pm$ 0.42	89.78 $\pm$ 0.28
ResNet MS-AST	95.17 $\pm$ 0.07	90.41 $\pm$ 0.20	91.85 $\pm$ 0.13
EffNetV2 MS-AST	95.18 $\pm$ 0.06	90.64 $\pm$ 0.41	91.85 $\pm$ 0.46

We also calculated video-by-video accuracy, precision, recall, and Jaccard score for our EffNetV2 MS-AST and compared them with other state-of-the-art approaches as shown in Table VI. The mean and the standard deviation in Table VI refer to results over different test videos. Our EffNetV2 MS-AST outperforms all other methods in all considered metrics.

TABLE VI: Video-by-video accuracy, precision, recall, and jaccard for different methods on Cholec80 dataset (mean

\pm

std. %)

Method Name	Accuracy	Precision	Recall	Jaccard
Less is More(Timestamp) [32]	91.9 $\pm$ 5.6	89.5 $\pm$ 4.4	90.5 $\pm$ 5.9	79.9 $\pm$ 8.5
Not End-to-End(TCN) [33]	92.8 $\pm$ 5.0	$-$	87.5 $\pm$ 8.3	78.7 $\pm$ 9.4
ResNet MS-TCN [6, 5]	92.88 $\pm$ 6.15	92.22 $\pm$ 4.16	89.76 $\pm$ 7.23	82.34 $\pm$ 6.25
ResNet ASFormer [9, 31]	94.25 $\pm$ 5.17	91.70 $\pm$ 5.39	92.33 $\pm$ 5.01	84.48 $\pm$ 6.49
ResNet MS-AST	96.06 $\pm$ 3.12	92.83 $\pm$ 4.67	94.28 $\pm$ 3.67	87.41 $\pm$ 6.87
EffNetV2 MS-AST	96.15 $\pm$ 3.94	93.15 $\pm$ 5.00	94.57 $\pm$ 3.52	88.31 $\pm$ 6.89

IV-C3 Action segmentation on non-medical datasets

MS-TCN and ASFormer were originally designed and evaluated in the non-medical domain with datasets including the 50Salads and GTEA datasets. To demonstrate that our MS-AST can also be utilized for non-medical videos in the action segmentation domain, we evaluate our MS-AST on the 50Salads and GTEA datasets following the MS-TCN and ASFormer studies [6, 9]. For fair comparisons and to demonstrate that our MS-AST can be trained with features generated by feature extraction networks other than image classification networks, we utilized I3D features provided by [6] as training data for our MS-AST.

As shown in Table VII and Table VIII, our MS-AST outperforms ASFormer[9] which utilizes only one temporal scale on all considered metrics. Our MS-AST can achieve state-of-the-art results on both datasets.

TABLE VII: Overall accuracy (%), segmental edit distance score, and segmental F1-score for the 50Salads dataset

Method name	ACC	Edit	F1@10	F1@25	F1@50
MS-TCN [6]	80.7	67.9	76.3	74.0	64.5
ASFormer [9]	85.6	79.6	85.1	83.4	76.0
UVAST [34]	87.4	83.9	89.1	87.6	81.7
MS-AST	90.5	85.9	90.2	89.2	84.7

TABLE VIII: Overall accuracy (%), segmental edit distance score, segmental F1-score for the GTEA dataset

Method name	ACC	Edit	F1@10	F1@25	F1@50
MS-TCN [6]	76.3	79.0	85.8	83.4	69.8
ASFormer [9]	79.7	84.6	90.1	88.8	79.2
UVAST [34]	80.2	92.1	92.7	91.3	81.0
MS-AST	82.3	90.3	92.6	91.6	84.9

V Conclusions

In this work, we propose Multi-Scale Action Segmentation Transformer (MS-AST) for offline surgical phase recognition and Multi-Scale Action Segmentation Causal Transformer (MS-ASCT) for online surgical phase recognition. Our proposed network utilizes multiple scales of temporal feature modeling, which is a good conceptual fit for multi-step surgical scenarios and can work with various feature extraction networks. Our methods achieve new state-of-the-art results on the Cholec80 dataset for surgical phase recognition, representing a significant advancement in surgical phase recognition. We also test our MS-AST on the 50Salads and GTEA datasets to demonstrate the wider applicability of our method and perform similarly to or better than the state-of-the-art results in both cases. Future research directions could explore the integration of our models in real-time surgical environments, adapt our models to more surgical procedures, or focus on enhancing the interpretability of model predictions to provide more actionable insights for surgical teams.

References

[1] L. S. Feldman, A. D. Pryor, A. K. Gardner, B. J. Dunkin, L. Schultz, M. M. Awad, and E. M. Ritter, “Sages video-based assessment (vba) program: a vision for life-long learning for surgeons,” Surgical endoscopy, vol. 34, no. 8, pp. 3285–3288, 2020.
[2] A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy, “Endonet: a deep architecture for recognition tasks on laparoscopic videos,” IEEE TMI, vol. 36, no. 1, pp. 86–97, 2016.
[3] Y. Jin, Q. Dou, H. Chen, L. Yu, J. Qin, C.-W. Fu, and P.-A. Heng, “Sv-rcnet: workflow recognition from surgical videos using recurrent convolutional network,” IEEE TMI, vol. 37, no. 5, pp. 1114–1126, 2017.
[4] Y. Jin, H. Li, Q. Dou, H. Chen, J. Qin, C.-W. Fu, and P.-A. Heng, “Multi-task recurrent convolutional network with correlation loss for surgical video analysis,” Medical image analysis, vol. 59, p. 101572, 2020.
[5] T. Czempiel, M. Paschali, M. Keicher, W. Simson, H. Feussner, S. T. Kim, and N. Navab, “Tecno: Surgical phase recognition with multi-stage temporal convolutional networks,” in MICCAI. Springer, 2020, pp. 343–352.
[6] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in CVPR, 2019, pp. 3575–3584.
[7] B. Zhang, A. Ghanem, A. Simes, H. Choi, A. Yoo, and A. Min, “Swnet: Surgical workflow recognition with deep convolutional network,” in MIDL. PMLR, 2021, pp. 855–869.
[8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
[9] F. Yi, H. Wen, and T. Jiang, “Asformer: Transformer for action segmentation,” in BMVC, 2021, p. 236.
[10] T. Czempiel, M. Paschali, D. Ostler, S. T. Kim, B. Busam, and N. Navab, “Opera: Attention-regularized transformers for surgical phase recognition,” in MICCAI. Springer, 2021, pp. 604–614.
[11] B. Zhang, J. Abbing, A. Ghanem, D. Fer, J. Barker, R. Abukhalil, V. K. Goel, and F. Milletarì, “Towards accurate surgical workflow recognition with convolutional networks and transformers,” CMBBE: Imaging & Visualization, vol. 10, no. 4, pp. 349–356, 2022.
[12] Y. Jin, Y. Long, X. Gao, D. Stoyanov, Q. Dou, and P.-A. Heng, “Trans-svnet: hybrid embedding aggregation transformer for surgical workflow analysis,” IJCARS, pp. 1–10, 2022.
[13] H.-B. Chen, Z. Li, P. Fu, Z.-L. Ni, and G.-B. Bian, “Spatio-temporal causal transformer for multi-grained surgical phase recognition,” in EMBC. IEEE, 2022, pp. 1663–1666.
[14] B. Zhang, M. H. Sarhan, B. Goel, S. Petculescu, and A. Ghanem, “Sf-tmn: Slowfast temporal modeling network for surgical phase recognition,” arXiv preprint arXiv:2306.08859, 2023.
[15] B. Zhang, D. Sturgeon, A. R. Shankar, V. K. Goel, J. Barker, A. Ghanem, P. Lee, M. Milecky, N. Stottler, and S. Petculescu, “Surgical instrument recognition for instrument usage documentation and surgical video library indexing,” CMBBE: Imaging & Visualization, vol. 11, no. 4, pp. 1064–1072, 2023.
[16] S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, 2013, pp. 729–738.
[17] A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” in CVPR. IEEE, 2011, pp. 3281–3288.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
[19] M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in ICML. PMLR, 2021, pp. 10 096–10 106.
[20] Y. Jin, Y. Long, C. Chen, Z. Zhao, Q. Dou, and P.-A. Heng, “Temporal memory relation network for workflow recognition from surgical video,” IEEE TMI, vol. 40, no. 7, pp. 1911–1923, 2021.
[21] Z. He, A. Mottaghi, A. Sharghi, M. A. Jamal, and O. Mohareri, “An empirical study on activity recognition in long surgical videos,” in Machine Learning for Health. PMLR, 2022, pp. 356–372.
[22] B. Zhang, A. Fung, M. Torabi, J. Barker, G. Foley, R. Abukhalil, M. L. Gaddis, and S. Petculescu, “C-ect: Online surgical phase recognition with cross-enhancement causal transformer,” in ISBI. IEEE, 2023, pp. 1–5.
[23] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017, pp. 6299–6308.
[24] C. Lea, R. Vidal, and G. D. Hager, “Learning convolutional action primitives for fine-grained action recognition,” in ICRA. IEEE, 2016, pp. 1642–1649.
[25] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in CVPR, 2017, pp. 156–165.
[26] A. Kadkhodamohammadi, I. Luengo, and D. Stoyanov, “Patg: position-aware temporal graph networks for surgical phase recognition on laparoscopic videos,” IJCARS, vol. 17, no. 5, pp. 849–856, 2022.
[27] A. P. Twinanda, “Vision-based approaches for surgical activity recognition using laparoscopic and rbgd videos,” Ph.D. dissertation, Strasbourg, 2017.
[28] F. Yi and T. Jiang, “Hard frame detection and online mapping for surgical phase recognition,” in MICCAI. Springer, 2019, pp. 449–457.
[29] X. Gao, Y. Jin, Y. Long, Q. Dou, and P.-A. Heng, “Trans-svnet: accurate phase recognition from surgical videos via hybrid embedding aggregation transformer,” in MICCAI. Springer, 2021, pp. 593–603.
[30] X. Ding and X. Li, “Exploring segment-level semantics for online phase recognition from surgical videos,” IEEE TMI, 2022.
[31] B. Zhang, B. Goel, M. H. Sarhan, V. K. Goel, R. Abukhalil, B. Kalesan, N. Stottler, and S. Petculescu, “Surgical workflow recognition with temporal convolution and transformer for action segmentation,” IJCARS, pp. 1–10, 2022.
[32] X. Ding, X. Yan, Z. Wang, W. Zhao, J. Zhuang, X. Xu, and X. Li, “Less is more: Surgical phase recognition from timestamp supervision,” IEEE TMI, vol. 42, no. 6, pp. 1897–1910, 2023.
[33] F. Yi, Y. Yang, and T. Jiang, “Not end-to-end: Explore multi-stage architecture for online surgical phase recognition,” in ACCV, 2022, pp. 2613–2628.
[34] N. Behrmann, S. A. Golestaneh, Z. Kolter, J. Gall, and M. Noroozi, “Unified fully and timestamp supervised temporal action segmentation via sequence to sequence translation,” in ECCV, 2022, pp. 52–68.