OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining

Ming Hu^1,2∗, Kun Yuan^1,3,4∗, Yaling Shen^2∗, Feilong Tang^1,2, Xiaohao Xu⁵, Lin Zhou¹,
Wei Li^1,7, Ying Chen^1,8, Zhongxing Xu², Zelin Peng⁷, Siyuan Yan², Vinkle Srivastav³, Diping Song¹,
Tianbin Li¹, Danli Shi⁶, Jin Ye^1,2, Nicolas Padoy³, Nassir Navab⁴, Junjun He^1†, Zongyuan Ge^2†
¹Shanghai AI Laboratory, ²Monash University, ³University of Strasbourg
⁴Technische Universität München, ⁵University of Michigan, Ann Arbor
⁶The Hong Kong Polytechnic University, ⁷Shanghai Jiao Tong University, ⁸Xiamen University

Abstract

Surgical practice involves complex visual interpretation, procedural skills, and advanced medical knowledge, making surgical vision-language pretraining (VLP) particularly challenging due to this complexity and the limited availability of annotated data. To address the gap, we propose OphCLIP, a hierarchical retrieval-augmented vision-language pretraining framework specifically designed for ophthalmic surgical workflow understanding. OphCLIP leverages the OphVL dataset we constructed, a large-scale and comprehensive collection of over 375K hierarchically structured video-text pairs with tens of thousands of different combinations of attributes (surgeries, phases/operations/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations, etc). These hierarchical video-text correspondences enable OphCLIP to learn both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles, capturing intricate surgical details and high-level procedural insights, respectively. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos, automatically retrieving semantically relevant content to enhance the representation learning of narrative videos. Evaluation across 11 datasets for phase recognition and multi-instrument identification shows OphCLIP’s robust generalization and superior performance.

^†^†footnotetext: ^∗Equal contribution ^†Corresponding author^†^†footnotetext: ^★Project Page: https://github.com/minghu0830/OphCLIP

1 Introduction

Refer to caption — Figure 1: Dataset comparison and results comparison. TOP: comparison of our OphVL with existing fully-supervised learning (FSL), VLP, and Q&A datasets. Bottom: accuracy comparison of CLIP, CLIP* (CLIP fine-tuned on OphVL dataset), and OphCLIP (ours) on phase recognition datasets.

The advancement of surgical AI offers opportunities to enhance surgeons’ capabilities and revolutionize surgical workflows [47, 66, 56]. A key objective is to develop a generalist multi-modal system that can comprehend diverse surgical scenes and communicate with medical professionals using natural language [57], enhancing preoperative planning, intraoperative assistance, and postoperative management. While surgical data science has made progress in uni-modal tasks like surgical phase recognition [34, 58, 67, 18], instrument segmentation [81, 62, 6, 29, 45, 46], and lesion detection [63, 3, 64, 24, 40], the exploration of multi-modal representation learning remains limited.

In general computer vision, models like CLIP [50] have shown success in understanding visual concepts through natural language supervision [32, 36], enabling a shift from task-specific to generalist models [42, 82, 83] with less downstream task-specific fine-tuning [80, 68]. However, surgical multi-modal representation learning poses unique challenges, including specialized medical terminology, and limited data availability, making large-scale datasets difficult to achieve. These challenges, along with the need for expert annotations and the complexity of surgical video data, which often span several hours and involve domain-specific activities within a confined field of view, highlight the gap between general computer vision datasets [38, 50, 2] and surgical datasets [18, 74, 17], as shown in Fig. 1, demanding an innovative solution for surgical multi-modal representation learning.

In this work, we develop OphVL, the first large-scale VLP dataset for ophthalmic surgery understanding, featuring 375K carefully processed video-text pairs from 7.5K hours of narrative open-source surgical videos. These videos cover a diverse set of attributes, including surgeries, phases, instruments, medications, and advanced aspects like eye disease causes, surgical objectives, and postoperative recovery recommendations. Unlike previous datasets with noisy and weakly aligned video-text pairs [37], OphVL applies extensive data processing and large language models (LLMs) to align the video-text pairs and enrich the textual information. Specifically, we define a set of essential concepts (surgery type, operation type, instrument, medication, anatomy, and surgical objectives) and use LLMs to refine narrative texts to focus on these concepts, enriching the textual information at observational and reasoning levels. We then segment video clips and pair them with narrations, while linking full videos to title summaries to create hierarchical video-text pairs at both clip and video levels. Most previous datasets focus only on narrative videos [37, 2] and overlook the potential of large-scale silent surgical videos. Therefore, our OphVL also includes 30K silent videos as an additional knowledge base for the following pretraining.

We introduce OphCLIP, a hierarchical retrieval-augmented VLP framework for ophthalmic video-language pretraining, designed to leverage hierarchical video-text pairs from the OphVL dataset. This approach is inspired by how surgeons often break down the understanding of long surgical videos into shorter activity segments, using this granular knowledge to inform their grasp of silent procedural videos. Surgeons often begin by studying narrated surgical videos to learn specific techniques, building foundational knowledge through the sequence of narrations and the title summaries. When transitioning to silent videos with similar content, they leverage these learned representations to effectively transfer the knowledge.

Similarly, OphCLIP learns short- and long-term visual representations by pairing short video clips with detailed narrative texts for fine-grained features and longer video segments with high-level title summaries for broader context. This approach captures procedural flow and high-level decision-making by integrating both fine- and coarse-grained information. We also introduce retrieval-based augmentation during the pretraining, by constructing a dynamic knowledge base from large-scale silent surgical videos, storing visual and textual embeddings. By retrieving semantically similar silent videos, we add them as auxiliary supervisory signals, facilitating knowledge transfer across narrative and silent procedure videos. OphCLIP integrates these signals into its pretraining framework, learning robust, transferable representations for diverse ophthalmic procedures and achieving state-of-the-art zero-shot performance, as shown in Fig. 1. Our contributions are as follows:

$\bullet$

OphVL Dataset: We constructed OphVL, the first large-scale and most comprehensive VLP dataset for ophthalmic surgical understanding to date, comprising 375K clip-text pairs with a total duration of 7.5K hours, which is 15 $\times$ larger than the current largest surgical VLP dataset. OphVL includes tens of thousands of different combinations of attributes, such as surgeries, phases/actions, instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations.
$\bullet$

OphCLIP: We present OphCLIP, a hierarchical ophthalmic surgical VLP framework that aligns short video clips with narrative texts and full videos with high-level summaries, enhancing both fine-grained and long-term visual representation learning. By incorporating silent videos using retrieval-based supervision, OphCLIP enriches its video understanding and learns robust representations for diverse surgical procedures.
$\bullet$

Comprehensive Zero-shot Evaluation: We conduct extensive evaluations and ablation studies of OphCLIP across 11 datasets (sub-datasets) in two tasks: phase and multi-instrument recognition, demonstrating strong generalizationability across tasks of varying granularities.

2 Related Work

Datasets	Modality	Source	Pairs Generation	Text	Videos	Images	Pairs/QA/Clips	Avg. Clip Length
HowTo100M [37]	Natural	Self-built	Automatic	Caption	1.2M	-	136M	4s
ANet-Captions [26]	Natural	Public	Manual	Caption	20K	-	100K	-
Endo-FM [71]	Endoscope	Public&Private	-	-	-	500M	33K	5s
SurgVLP [74]	Endoscope	Self-built	Automatic	Caption	1,326	-	25K	6s
GenSurg+ [17]	Endoscope	Public	Automatic	Caption	1.8K	-	17K	45s
Surg-QA [30]	Endoscope	Public&Self-built	Automatic	QA	2,151	-	102K	20s
GP-VLS [56]	Endoscope	Public	Automatic	QA	-	-	120K	-
OphNet2024 [18]	Ophthalmic Scope	Self-built	Manual	FSL	1,969	-	17,508/14,674	18s/22s
OphVL (Ours)	Ophthalmic Scope	Self-built	Semi-automatic	Caption	44,290	960M	375K	72s

Table 1: Comparison of OphVL with existing FSL, self-supervised pre-training, VLP, and Q&A datasets in natural and surgical modalities. OphVL encompasses larger-scale video-text pairs, which is 15

\times

larger than the current largest surgical VLP dataset. For OphNet2024, “*/*” denotes operation/phase level. “Public” denotes that the data comes from open-source datasets, while “Self-built” denotes that the dataset has been newly collected and organized.

Surgical Vision-Language Pretraining. Recent research on deep learning for surgical applications has primarily focused on uni-modal tasks. However, it has largely overlooked advancements in next-generation vision-language models (VLMs) [61, 70, 76] and GPT frameworks [69, 79, 43, 49], which enables broader computer vision applications. Many studies have demonstrated the efficacy of utilizing natural language supervision from aligned text to acquire robust visual representations [2, 77]. These methods typically leverage contrastive learning [44] to associate videos (or images) with their corresponding narrations (or captions). However, they face sample efficiency challenges with surgical VLP datasets due to noisy transcriptions, limited variability in phase-level descriptions, and strong temporal dependencies in surgical procedures. Recent work has improved data efficiency and zero-shot performance in CLIP-like models through techniques such as text augmentation via EDA [33], masked token modeling [65], captioning loss [73], and knowledge-based, hierarchical-aware augmentations [75, 78]. However, the limited scale of surgical video-language pretraining datasets and unique challenges of surgical videos, such as extended durations, narrow fields of view, and procedural variability, continue to restrict the development and evaluation of surgical VLMs.

Video Data for Self-Supervision. Recent studies increasingly leverage video data for enhancing self-supervised learning in VLMs [55, 27, 28]. Large datasets from millions of publicly available YouTube videos support training across diverse scenarios, and localized narratives provide dense, frame-level annotations that benefit both single-image and temporal tasks. In medical applications, similar efforts have been directed toward constructing large-scale multimodal datasets, utilizing sources like hospital-based radiological reports [25, 9] and publicly accessible platforms such as YouTube and Twitter [21, 22, 74, 39, 7] for vision-language pretraining. Although video data introduces noise from varying quality and unfiltered content, advancements in automatic speech recognition (ASR) mitigate this issue by enabling large-scale extraction of cleaner text data directly from audio tracks, improving dataset relevance and model reliability [22]. Large language models (LLMs) further aid annotation by offering context-aware insights, reducing manual labeling needs, and enhancing coherence.

Retrieval-augmented Models. In the NLP field, retrieval-augmented language models leverage external knowledge to boost performance across tasks [5, 52]. Similarly, recent advancements in the vision-language domain retrieve semantically related samples to enhance tasks, such as image recognition [35, 23], captioning [31, 53, 54, 72], and knowledge-based visual question answering [60]. While vision-language pretraining [19] has adopted retrieval-augmentation, with the CLIP model for cross-modal retrieval, it mainly targets images and is suboptimal for surgical video data, particularly in linking narrative and silent procedural videos. In contrast, our OphCLIP builds a dynamically updated knowledge base with video titles as queries for efficient cross-video retrieval and pretraining.

3 Data Engine

We develop OphVL, a dataset of large-scale ophthalmic surgery video-text pairs for VLP. OphVL curation pipeline comprises the following steps: (1) collecting real-world ophthalmic surgery channel and video data, (2) filtering ophthalmic surgery videos based on a specific “narrative style”, (3) extracting and denoising video segments and narration texts from videos using various models, tools, and algorithms, (4) rewriting narration texts via LLMs to focus on essential surgery-specific concepts, and finally, (5) extracting frames from video clips to construct video-text pairs.

3.1 Collecting Clip and Text Pairs from YouTube

Collecting Representative Channels and Videos. In collaboration with three practicing ophthalmologists, we compile a comprehensive list of over 3K terms relevant to ophthalmic surgery, derived from extensive literature review. These terms encompass but are not limited to, surgical names, procedural steps, instrument usage, medications, and postoperative complications. Using these keywords, we manually search YouTube to identify ophthalmic surgery channels. Based on our experience, channel-based searches yield a more concentrated and higher-quality collection of ophthalmic surgical videos than individual keyword searches. Ultimately, we identify the YouTube channel IDs for 410 ophthalmic surgery videos and download them in bulk. During the download process, we prioritize the highest-resolution versions, filtering out videos shorter than 30 seconds or with resolution below 224p, resulting in a collection of approximately 100K videos.

Filtering for Narrative-Style Surgical Videos. In this step, we assess each video from each channel to determine (1) whether it depicts real ophthalmic surgeries or contains usable surgery segments, and (2) whether it qualifies as a narrated video with rich explanatory voiceover. For (1), we identify relevant videos by extracting keyframes, which are automatically generated using PySceneDetect to mark the start or end of scenes with significant visual changes. We then train and apply a ResNet50 image classifier to determine if the keyframes are microscope images of ophthalmic surgeries. Videos with over 80% of keyframes classified as ophthalmic microscope images are labeled as valid videos. For (2), we use it to detect the proportion of human voice in the video, setting a threshold of 0.2. Videos below this threshold are flagged as silent or lacking sufficient explanatory narration. For these videos, we collect their titles and clip metadata to construct a knowledge base, which we use to enhance representation learning.

3.2 Text Extraction using ASR and Text Denoising

To tackle the challenges of ASR with medical terminology in YouTube captions, we employed the large-scale open-source Whisper Large-V3 model [51] for speech-to-text conversion by directly transcribing entire speech segments. We then developed a transcription denoising and quality control pipeline consisting of: (i) applying the Rake keyword extraction algorithm to identify key phrases (up to four words) and refining them by removing stopwords [52]; (ii) using SurgicBERTa [4], a language model pre-trained on surgical texts, to validate and correct each refined entry for alignment with known surgical terminology and context; (iii) conditioning a large language model with example prompts to correct spelling errors within sentence context; and (iv) prompting the language model to provide structured summaries of the captions, focusing on key components such as surgery type, phase/action, instrument, medication, anatomical target, and procedure goal.

3.3 Aligning Clip/Image and Text Pair

Due to frequent sentence segmentation discontinuity in Whisper transcriptions—where coherent sentences are often split across multiple timestamps, we develop a heuristic algorithm that merges timestamps based on punctuation and linking words, ensuring semantic continuity and improving GPT-4o summaries. For clip extraction, we align segments with subtitle timestamps. For silent videos, the classifier (Sec. 3.1) samples frames at 1 FPS to extract surgical clips. Titles and metadata are collected to build a knowledge base, enhancing representation learning. Finally, for all clip-text pairs, frames are extracted at 0.5 FPS for pre-training.

3.4 OphVL Statistics

As shown in Tab. 1, the final OphVL dataset comprises 375,198 clip-text pairs extracted from 13,654 narrated videos and 30,636 silent videos (totaling 9363 hours). On average, the clips have a duration of 72 seconds and a resolution of 1500 $\times$ 912, with over 65% of the videos having a resolution equal to or greater than 1280 $\times$ 720. According to our rough estimation, our textual concepts include tens of thousands of different combinations of attributes, such as surgeries, surgical phases/operations/actions, surgical instruments, medications, as well as more advanced aspects like the causes of eye diseases, surgical objectives, and postoperative recovery recommendations. Please refer to the supplementary for more dataset statistical details.

4 OphCLIP

We introduce OphCLIP, the hierarchical retrieval-augmented video-language pretraining framework, including hierarchical video-text correspondences (Sec. 4.1), our contrastive learning approach for fine- and coarse-grained representations (Sec. 4.2), and our strategy for leveraging silent videos as a knowledge base to enhance representation learning (Sec. 4.3).

\begin{overpic}[width=496.85625pt]{fig/arch.pdf} \end{overpic}

Figure 3: OphCLIP’s framework for video-language pretraining. OphCLIP performs vision-language pretraining at both clip and video levels, learning short-term visual representations from narrations and long-term representations from titles, enhanced by a knowledge base. OphCLIP has several components: Narrative videos with associated narrative texts are processed through visual and text encoders, creating clip-level multi-modal embeddings; Silent videos’ multi-modal embeddings are stored in the dynamically updated memory bank, constructing the knowledge base; Video-level pretraining uses maximum inner product search to retrieve relevant top-K silent videos’ embeddings based on queries to enhance the video-level pretraining.

4.1 Hierarchically Video-text Correspondences

We leverage our curated OphVL dataset to train OphCLIP. The OphVL dataset, illustrated in Fig. 3, is a partially hierarchically annotated video collection, denoted as $\mathcal{D}=\{(V^{n}_{i},N^{n}_{i},T^{n}_{i})\}_{i=1}^{|\mathcal{D}_{N}|}\cup\{(V^{s}_{i},T^{s}_{i})\}_{i=1}^{|\mathcal{D}_{S}|}$ . Here, $\mathcal{D}_{N}$ contains narrative videos ( $V^{n}_{i}$ ) paired with clip-level narrations ( $N^{n}_{i}$ ) that describe surgical observations and reasoning, as well as high-level video summaries ( $T^{n}_{i}$ ). In contrast, $\mathcal{D}_{S}$ comprises silent videos ( $V^{s}_{i}$ ) paired only with video-level summaries ( $T^{s}_{i}$ ). Each video, whether narrative or silent, is segmented into clips, represented as $V^{n/s}_{i}=\{v^{n/s}_{ij}\}_{j=1}^{|V^{n/s}_{i}|}$ , with each clip $v_{ij}$ providing a visual counterpart to the narration $n_{ij}$ in narrative videos. For instance, a clip’s narration describes “The anterior chamber is under-filled…” while the high-level title summarizes the procedure as “Application of B-HEX in a small-pupil Phaco surgery…”. This hierarchical text structure allows the model to capture both fine-grained surgical details at the clip level and overall procedural goals at the video level, enabling robust vision-language representations across diverse granularity levels.

4.2 Hierarchical Vision-language Pretraining

OphCLIP has two pretraining stages: clip- and video-level pretraining to learn fine, short-term, and coarse, long-term representations, respectively. OphCLIP adopts CLIP-like architecture [50], using visual and textual encoders, $f_{v}$ and $f_{t}$ , to generate embeddings for frames and texts. A single set of visual and textual encoders is used across both pretraining stages. For clip-level pretraining, each video clip $v^{n}_{ij}$ is paired with its narration $n_{ij}$ to learn short-term representations, denoted as $f_{v}(v^{n}_{ij})$ and $f_{t}(n_{ij})$ . At video level, the entire video $V_{i}$ (including both narrative and silent types) is paired with a summary text $T_{i}$ to form long-term features, represented by $f_{v}(V_{i})$ and $f_{t}(T_{i})$ . This high-level summary $f_{t}(T_{i})$ captures the overall semantic context, supporting deeper long-term reasoning within visual representations.

Clip-level Pretraining. For OphCLIP’s clip-level pretraining, we use an InfoNCE loss [44] to align short-term video clips $v_{ij}$ with their corresponding narration texts $n_{ij}$ . This objective maximizes similarity between visual features $f_{v}(v_{ij})$ and textual features $f_{t}(n_{ij})$ , as shown below in Eq. 1:

L^{\text{vl}}_{\text{clip}}=\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp(f_{v}(v_{ij})^{\top}f_{t}(n_{ij}))}{\sum_{k=1}^{B}\exp(f_{v}(v_{ij})^{\top}f_{t}(n_{kj}))},

(1)

where $B$ is the batch size. Positive pairs consist of temporally aligned video-text pairs, while other pairs in the batch serve as negatives, enabling OphCLIP to learn from short-term video-text correspondences.

To further refine visual features, we incorporate SimSiam self-supervision [8]. By applying random augmentations to create two views of each video clip, we maximize similarity within positive pairs, formalized as $L^{\text{vv}}_{\text{clip}}$ :

$L^{\text{vv}}_{\text{clip}}=\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp(f_{v}(v_{ij})^{\top}f_{t}(n_{ij}))}{\sum_{k=1}^{B}\exp(f_{v}(v_{ij})^{\top}f_{v}(Aug(v_{kj})))}.$

(2)

The combined clip-level objective, $L_{\text{clip}}=L^{\text{vl}}_{\text{clip}}+L^{\text{vv}}_{\text{clip}}$ , strengthens fine-grained visual-textual alignment.

Video-level Pretraining. At the video level, we aim to capture long-term procedural context by aligning each narrative video’s high-level title summary $T$ with a sequence of video segments $V_{i}=\{v_{i1},...v_{i|V_{i}|}\}$ . This process uses the following loss:

L^{\text{narrative}}_{\text{video}}=\frac{1}{B}\sum_{i=1}^{B}\log\frac{\exp(f_{v}(V_{i})^{\top}f_{v}(T_{i}))}{\sum_{k=1}^{B}\exp(f_{v}(V_{i})^{\top}f_{t}(T_{k}))}.

(3)

This objective aligns the entire video representation $f_{v}(V_{i})$ with its summary $f_{t}(T_{i})$ while treating summaries from other videos as negatives. To efficiently obtain long-term visual representations for whole procedure videos, we aggregate short-term video clip features using average pooling. This approach ensures computational efficiency and preserves frame-wise visual features, represented as $f_{v}(V_{i})=Agg\left({f_{v}(v_{ij})}_{j=1}^{|V_{i}|}\right)$ .

4.3 Silent Videos as Knowledge Pool

In addition to learning representations from the narrative videos, our OphCLIP explores silent surgical videos to form a contextual knowledge base that facilitates knowledge transfer and enriches multi-modal representations, as shown in Fig. 3. Using the title of a narrative video as a query, the retriever matches it to relevant silent videos stored in the memory bank, enhancing OphCLIP’s long-term visual representations by integrating additional procedural context from these silent videos.

Query Encoding. To retrieve relevant information from silent videos, we first encode the titles of narrative videos as query embeddings. Specifically, we use the PubMedBERT [16] as the query encoder $f_{\text{query}}$ to transform the title text into a high-dimensional embedding:

q^{n}=f_{\text{query}}(T^{n}).

(4)

This query embedding $q$ , based on the video’s title, captures its semantic essence and is used to find related content in a silent video pool. Encoding the title text into embeddings allows efficient retrieval within the text modality, avoiding inaccuracies from CLIP’s limited visual understanding of ophthalmic videos.

Memory Bank. The memory bank module stores multi-modal representations of silent videos. Each silent video $\hat{V^{s}_{k}}$ is encoded by the visual encoder $f_{v}$ , and its corresponding title text $\hat{T^{s}_{k}}$ is encoded by the text encoder $f_{t}$ and above-mentioned query encoder $f_{query}$ . We use title embeddings as keys in the memory bank, with values comprising visual and textual embeddings of ophthalmic videos and corresponding texts:

$\text{Memory}={(\underbrace{f_{\text{query}}(\hat{T^{s}_{k}})}_{\text{Key}},\underbrace{(f_{t}(\hat{T^{s}_{k}}),f_{v}(\hat{V^{s}_{k}}))}_{\text{Values}})\mid k=1,\dots,|\mathcal{D}_{S}|,}$

(5)

where $|\mathcal{D}_{S}|$ represents the number of silent videos in the OphVLP dataset. In our memory bank, both visual and textual representations are stored and updated dynamically to improve hierarchical vision-language pretraining.

Task	Dataset	CLIP [50]	SLIP [41]	LaCLIP [10]	CLIP [50]	CLIP [50]	OphCLIP
		VTIB16	VITB16	VITB16	RN50	RN50	RN50
		CLIP400M	YFCC100M	LAION-400M	CLIP400M	OphVL	OphVL
Phase	Cat-21 [48]	11.5 / 2.6	7.7 / 2.4	9.6 / 3.6	13.3 / 2.7	28.8 / 17.6	41.4 / 28.8
	Cataract-1K [14]	5.4 / 1.6	5.7 / 2.8	10 / 1.7	6.9 / 2.0	20.8 / 15.9	62.8 / 48.5
	Cataract-101 [59]	9.9 / 4.1	7.3 / 2.6	9.8 / 2.4	10.0 / 3.3	36.2 / 25.5	39.3 / 33.7
	CatRelDet [12]	9.5 / 7.0	10.2 / 6.4	11.8 / 4.4	15.3 / 11.9	26.7 / 23.7	32.6 / 34.2
	LensID [13]	10.6 / 6.5	10.6 / 6.3	25.8 / 17.8	22.9 / 16.3	45.5 / 32.1	59.3 / 41.0
Instrument	Cataract-1K [14]	100.0 / 13.9	100.0 / 13.9	100.0 / 13.9	100.0 / 13.9	80.8 / 15.3	45.1 / 21.2
Instrument	CatInstSeg [11]	100.0 / 20.2	100.0 / 20.2	100.0 / 20.2	100.0 / 20.2	87.3 / 20.3	51.1 / 28.3

Table 2: The comparison to the OpenAI CLIP and CLIP pretrained on our dataset. We report Accuracy / F1-score for zero-shot surgical phase recognition. We report False Positive Rate (the lower the better) FPR / mAP for zero-shot instrument recognition.

Method	OphNet-O	OphNet-P [18]	CaDIS-F	CaDIS-C [15]
Method	Acc / F1	Acc / F1	FPR / mAP	FPRs / mAP
CLIP [50]	0.7 / 0.4	3.2 / 0.7	13.0 / 8.9	28.1 / 22.5
LaCLIP [10]	1.0 / 0.3	3.2 / 0.7	13.0 / 8.9	30.1 / 22.6
CLIP* [50]	2.5 / 0.8	5.0 / 1.6	13.1 / 9.3	28.5 / 22.6
OphCLIP	7.1 / 2.3	18.2 / 4.8	14.7 / 10.6	28.7 / 23.5

Table 3: Fine- vs. Coarse-grained recognition. The OphNet dataset subsets, OphNet-O and OphNet-P, support operation and phase recognition tasks, respectively. CaDIS-F and CaDIS-C, derived from the CaDIS dataset, offer fine-grained and coarse-grained multi-instrument recognition labels. CLIP* represents the CLIP model pretrained on the OphVL dataset.

Retriever. As shown in Fig. 3, our retriever component leverages the query embedding $q$ to perform a maximum inner product search (MIPS) on the memory keys, identifying the top- $K$ silent videos most relevant to the queried narrative video. Specifically, we compute similarity with $q$ for each key embedding $f_{\text{query}}(\hat{T^{s}_{k}})$ in memory, selecting the top- $K$ indices that yield the highest similarity scores:

\text{Retrieved Indices}=\text{argsort}_{k}\left(q^{\top}f_{\text{query}}(\hat{T^{s}_{k}})\right)[:K].

(6)

Then, the retrieved video-level multi-modal representations $\{(f_{v}(\hat{V^{s}_{k_{1}}}),f_{t}(\hat{T^{s}_{k_{1}}})),\dots,(f_{v}(\hat{V^{s}_{k_{K}}},f_{t}(\hat{T^{s}_{k_{K}}}))))\}$ from $K$ most relevant silent videos are considered as positive samples to the queried narrative video. We conduct the contrastive learning as Eq. 7 shows:

$L^{\text{silent}}_{\text{video}}=\frac{1}{K}\displaystyle\sum_{j=1}^{K}\log\frac{\exp(f_{v}(V_{i})^{\top}f_{v}(\hat{V_{i_{j}}))}+\exp(f_{v}(V_{i})^{\top}f_{t}(\hat{T_{i_{j}}}))}{\sum_{m=1}^{K}\exp(f_{v}(V_{i})^{\top}f_{v}(\hat{V_{i_{m}}))}+\exp(f_{v}(V_{i})^{\top}f_{t}(\hat{T_{i_{m}}}))}.$

(7)

Thus, the final video-level pretraining loss is:

L_{\text{video}}=L^{\text{narrative}}_{\text{{video}}}+L^{\text{silent}}_{\text{video}}.

(8)

We leverage the retrieved $K$ similar entities to introduce diverse supervisory signals from additional video-text pairs, facilitating knowledge transfer across narrative and silent procedure videos. Our method captures richer contextual information by pairing each video-level feature of the narrative videos with their title text embedding and the retrieved multi-modal embeddings. This approach not only enhances model robustness but also enables efficient retrieval within million-scale datasets. We employ an alternating training strategy, optimizing $L_{clip}$ for a few batches followed by optimizing $L_{video}$ for a few batches, and repeating this cycle. This strategy optimizes the visual and textual for both short-term and long-term features and also avoids the catastrophic forgetting issue [1, 75]. Please refer to the supplementary for more implementation details.

5 Experiments

Method	Data (%)	Cat-21 [48]	Cataract-1K [14]
Method	Data (%)	Acc / F1	F1 / mAP
CLIP [50]	100	48.2 / 32.6	0.0 / 14.1
CLIP* [50]	100	59.4 / 40.5	2.0 / 14.2
OphCLIP [41]	100	72.1 / 57.9	11.6 / 18.8
CLIP [50]	10	40.1 / 17.3	0.0 / 15.5
CLIP* [50]	10	47.6 / 26.6	1.0 / 15.8
OphCLIP [41]	10	59.5 / 41.6	15.6 / 22.2

Table 4: Few/Full-shot linear probing. Accuracy and F1 scores for Cat-21 and Cataract-1K datasets with different methods using 10% and 100% training data.

5.1 Datasets

To evaluate our approach, we conduct experiments on two downstream tasks, i.e., phase recognition and multi-instrument recognition, using 12 datasets (or sub-datasets). (1) Phase Recognition: Five datasets are used for this task: Cat-21 [48] (11 classes), Cataract-1K [14] (12 classes), Cataract-101 [59] (10 classes), CatRelDet [12] (5 classes), OphNet (96 phases, 232 operations) [18], and LensID [13] (3 classes). The implantation and rest phase recognition task represents a specialized configuration of phase recognition, wherein video frames are labeled with only three-phase categories: pre-implantation, implantation, and post-implantation of the lens. OphNet additionally provides both phase and operation labels, offering finer granularity for classification. (2) Multi-instrument Recognition: We select four datasets for multi-instrument recognition: Cataract-1K [14] (9 classes), CatInstSeg [11] (11 classes), and CaDIS [15] (12 classes, 35 classes). CaDIS also provides labels at different levels of granularity.

5.2 Zero-shot Recognition

In this section, we demonstrate the zero-shot transfer performance of our pretrained OphCLIP model across various downstream tasks. Following CLIP [50], we keep the pretrained visual and text encoders fixed and format class labels as sentence prompts for classification.

Phase Recognition. As shown in Tab. 2, methods pretrained on OphVL dataset, including CLIP, consistently outperform baselines like vanilla CLIP [50] and SLIP [41] across all surgical phase recognition datasets. This demonstrates the impact of our ophthalmic-specific pretraining dataset. Fig. 4 also highlights OphCLIP’s capability to understand ophthalmic-specific concepts across both visual and linguistic modalities. The model not only recognizes relevant anatomical and procedural elements in ophthalmic images but also aligns these elements with corresponding medical terminology and context in textual descriptions. This cross-modal understanding enables OphCLIP to focus on regions within the visual data that contribute most significantly to the semantics, demonstrating that the model effectively prioritizes clinically relevant areas.

Model	OphVL	KB	Prompt	Cat-21 [48]	Cataract-1K [14]
CLIP [50]	$\times$	$\times$	Caption	13.3 / 2.7	6.1 / 2.3
CLIP [50]	$\checkmark$	$\times$	Caption	28.8 / 17.6	20.8 / 15.9
OphCLIP	$\checkmark$	$\times$	Caption	34.9 / 22.9	60.5 / 44.7
OphCLIP	$\checkmark$	$\times$	Mix	41.3 / 27.7	55.9 / 43.0
OphCLIP	$\checkmark$	$\checkmark$	Caption	39.9 / 28.2	61.3 / 47.2
OphCLIP	$\checkmark$	$\checkmark$	Mix	41.4 / 28.8	62.8 / 48.5

Table 5: Ablation study of OphCLIP with various components. This study evaluates the impact of different OphCLIP components: OphVL (pretraining dataset), KB (knowledge base with silent videos), and Prompt (descriptive phase prompts, comparing basic Caption vs. Mix, which adds keywords to captions). Results are presented as Accuracy / F1-score.

Multi-instrument Recognition. To perform surgical instrument recognition with pretrained vision-language models, each instrument class is converted into a text prompt. For each input image, we compute cosine similarities between image features and these prompts, generating a similarity score per class. Since instrument recognition is a multi-label task, sigmoid activation is applied to these scores, allowing the model to output independent probabilities for each instrument. As shown in Tab. 2, baseline models like CLIP and SLIP [41] show high False Positive Rates (FPR) in instrument recognition, with 100% FPRs on Cataract-1K [14] and CatInstSeg [11], indicating substantial false detections. In contrast, OphCLIP significantly reduces false positives, as shown in Fig. 4. This demonstrates the effectiveness of OphCLIP’s ophthalmic-specific pretraining, enhancing its ability to detect surgical instruments accurately and reduce errors.

Fine-grained vs. Coarse-grained. Tab. 3 shows that OphCLIP outperforms other models in both phase recognition across different granularities. For OphNet-O (fine-grained) and OphNet-P (coarse-grained), OphCLIP achieves 18.2% / 4.8% and 7.1% / 2.3% (Acc / F1), significantly surpassing CLIP and LaCLIP [10]. In instrument recognition, OphCLIP achieves an FPR of 14.7% / mAP of 10.6% on CaDIS-F and an FPR of 28.7% / mAP of 23.5% on CaDIS-C, indicating lower false positives and higher precision. These results confirm OphCLIP’s robust performance and adaptability in tasks of different granularities.

5.3 Few/Full-shot Linear Probing

As shown in Tab. 4, we evaluate pretrained method’s visual encoder on Cat-21 and Cataract-1K using linear probing with both 10% and 100% of the data. Our OphCLIP demonstrates substantial performance gains over both CLIP variants, particularly with 100% of the data. These results show that OphCLIP’s visual encoder provides strong, transferable representations for diverse surgical tasks, serving as an effective generalist backbone.

5.4 Ablation Studies

We conduct an ablation study on surgical phase downstream datasets to examine the effect of OphCLIP’s components (Tab. 5). Models pretrained with OphVL consistently outperform those without it, showing notable accuracy gains across all tasks. Adding a knowledge base (KB) with more silent videos further boosts performance, evidenced by higher F1 scores on Cataract-1K. Prompt choice also plays a key role in zero-shot phase recognition. The “Mix” prompt, which includes keywords like “instrument,” “medication,” and “goal”, outperforms the “Caption” prompt. This is due to the specialized OphVL corpus focusing on instruments, medications, and procedural goals, enabling better concept capture and improved predictions.

6 Limitation

Despite our efforts with OphVL and OphCLIP to enhance perception in ophthalmic surgeries, biases from open-source videos persist, including regional practice variations and inconsistent terminology. Secondly, though OphVL covers a wide range of surgeries, research mainly focuses on cataract procedures due to their prevalence and accessibility, making our validation dataset primarily cataract-based. This limits our ability to validate the model across other surgeries like glaucoma and corneal procedures. While OphCLIP outperforms baseline models in phase recognition and multi-instrument classification, the limited variety restricts evaluation on more complex tasks like anomaly detection and other advanced challenges.

7 Conclusion

In this work, we introduce OphCLIP, a specialized vision-language pretraining framework designed for ophthalmic surgery. By constructing the comprehensive OphVL dataset, which includes over 375K clip-text pairs and tens of thousands of ophthalmic surgery-related concepts (surgeries, procedures, instruments, medications, surgical goals, etc.), we enable robust hierarchical learning of both fine-grained and long-term visual representations. Our approach leverages both narrative and silent videos through innovative retrieval-based supervision, resulting in enhanced understanding and generalization across surgical phases and multi-instrument identification tasks. This research sets a new benchmark for ophthalmic surgical workflow understanding and opens avenues for more specialized and context-aware AI applications in ophthalmic surgery.

\thetitle

Supplementary Material

8 OphVL Dataset

Fig. 6 illustrates the clip-text pair samples we constructed. Through our data processing pipeline, OphVL achieves high-quality modality alignment between ophthalmic surgery videos and descriptive texts.

9 Experiments

9.1 Implementation Details.

Hyper-parameter		Value
Epochs		60
Clip-level Pretraining	Batch Size	120
	Image Size	224
	# of Frames	8
	Text Length	77
Clip-level Pretraining	Batch Size	140
	Image Size	224
	# of Frames	8
	# of Retrieved Videos	1
	Text Length	77
Optimization	Learning Rate	8e-5
	Scheduler	Cosine
	Optimizer	Adam
	Momentum	0.9
Loss Function	Temperature	0.1
	Weight of $L_{\text{{clip}}}^{\text{{vv}}}$	0.5
	Weight of $L_{\text{{clip}}}^{\text{{vl}}}$	0.5

Table 6: Hyper-parameter details.

Architecture. We use the CLIP-like architecture [50] with two branches, i.e., visual and textual encoders. We use the ResNet-50 as the visual encoder from the ImageNet initialization. We apply BioClinicalBert [20] as the textual encoder, which is pretrained on the clinical notes. Then we apply the average pooling at the visual features to generate the visual embeddings. We apply a linear projection layer at the end of Bert model’s $[CLS]$ token to generate textual embeddings. We use $768$ as the dimensionality of the embedding space.

Pretraining Setups. In total, we use $8$ RTX- $4090$ with $24$ GB and train for $2$ days. We first perform clip-level pretraining for $40$ epochs and then apply the hierarchical pretraining strategy, which alternatively trains with $3$ epochs of clip-level video-text pairs, followed by $2$ epochs of video-level video-text pairs. We use a batch of $120/140$ for the clip- and video-level pretraining, respectively. More hyper-parameter details can be found in Tab. 6.

9.2 Evaluation Setup.

We evaluate the representation ability of our OphCLIP using two types of downstream tasks: surgical phase recognition and surgical tool recognition. Additionally, we conduct zero-shot evaluation and linear probing to assess the model’s multi-modal alignment and visual representation capabilities. Tables 11-19 list the specific label names we used for the downstream validation datasets. The labels for the OphNet [18] dataset can be found in the online table: https://docs.google.com/spreadsheets/d/1p5lURkth587-lxYwd6eOSmSxPpvIqvyuOKW-4B49PT0/edit?gid=0#gid=0

Surgical Phase Recognition. This task evaluates the model’s understanding of surgical scenes by classifying video frames into predefined surgical phases. It requires the model to identify instruments, anatomical structures, and their interactions by extracting meaningful visual patterns. To focus on multi-modal representation learning, we exclude temporal modeling and analyze frame-level understanding instead.

Surgical Tool Recognition. This task tests the model’s ability to detect and classify surgical instruments within video frames. By analyzing visual features like shape, texture, and contextual cues, the model demonstrates object-level understanding without reliance on broader workflow context. We assess its robustness in identifying tools despite variations in orientation, scale, or occlusion, emphasizing the quality of learned visual representations.

Instrument Label	Textual Prompt
Capsulorhexis Forceps	This video shows capsulorhexis forceps.
Capsulorhexis Cystotome	This video shows capsulorhexis cystotome.
Katena Forceps	This video shows katena forceps.
Irrigation-Aspiration	This video shows irrigation aspiration.
Slit Knife	This video shows slit knife.
Phacoemulsification Tip	This video shows phacoemulsification tip.
Spatula	This video shows spatula.
Gauge	This video shows gauge.
Lens Injector	This video shows lens injector.
Incision Knife	This video shows incision knife.

Table 7: Textual prompts for each instrument label in the Cataract-1K [14] dataset.

Zero-shot Evaluation. To perform frame-wise classification tasks for surgical phase and tool recognition, we construct textual prompts tailored to the class labels. For phase recognition, we address their high-level definitions by breaking them down into essential components such as phase, instrument, medication, and goal. These are referred to as keyword-only prompts as shown in Tab. 10. Additionally, we leverage Large Language Models (LLMs) to generate caption-only prompts, which are detailed descriptive sentences that incorporate relevant surgical instruments, anatomical structures, and events for each phase. These prompts help align the textual domain of pretraining with the downstream task corpus. For tool recognition, we create human-like descriptive sentences to minimize the textual domain gap, ensuring better alignment between pretraining and downstream corpus, as shown in Tab. 7. This approach facilitates robust zero-shot performance by bridging differences in textual contexts.

Linear-Probing Evaluation. For linear-probing, we freeze the visual encoder and train a linear classifier on the extracted features. No image augmentations are applied during training. The linear classifier is implemented as a linear Support Vector Machine (SVM) with a “linear” kernel. We fit the model on the training and validation sets, then evaluate its performance on a separate test set. For few-shot linear-probing, we use a $k$ -percentage shot approach, tailored for surgical video data. Specifically, we sample 10% of the videos from the training set, ensuring no data leakage while maintaining a balanced number of samples across classes. This setup allows for a fair evaluation of the model’s generalization with limited supervision.

More Ablation Experiments Tab. 9 presents additional results of ablation experiments on the Cataract-101 [59] and CatRelDet [12] datasets.

Instrument Label	Precision	Recall	F1	Support
Capsulorhexis Forceps	6.1	100.0	11.5	100
Capsulorhexis Cystotome	4.8	100.0	9.1	85
Katena Forceps	1.6	100.0	3.1	28
Irrigation-Aspiration	25.4	100.0	40.5	451
Slit Knife	1.6	100.0	3.1	28
Phacoemulsification Tip	30.7	100.0	46.9	545
Spatula	40.3	100.0	57.4	716
Gauge	24.0	100.0	38.7	426
Lens Injector	3.7	100.0	7.2	66
Incision Knife	1.2	100.0	2.4	22
Macro Avg.	13.9	100.0	22.0	2475

Table 8: Detailed instrument recognition performance of CLIP [50], SLIP [41], and LaCLIP[10] on Cat-21 dataset in terms of each class label.

10 Limitation

Data Bias. The OphVL dataset is sourced from YouTube, showcasing diverse styles, resolutions, and screen elements. This diversity enhances the evaluation of a model’s generalization ability but may also impact its effectiveness and performance. Some videos in the dataset contain subtitles, watermarks, or additional video windows. Furthermore, regional variability introduces discrepancies in surgical descriptions, such as differences in surgical standards, nomenclature, and definitions influenced by cultural or demographic factors. These characteristics in OphVL reflect the complexity of real-world surgical environments, where ophthalmic microscopes may inherently display various windows or parameters during recording. While these factors pose challenges, they also present opportunities to develop models that are better equipped to handle such diversity.

Downstream Task Limitation. The zero-shot downstream evaluation datasets for OphCLIP are sourced from publicly available datasets, leveraging their high-quality characteristics and ensuring fair comparisons. However, due to the limited diversity of these datasets—most of which primarily focus on phase recognition and instrument classification in ophthalmology—it is challenging to validate the model on a broader range of vision-language understanding tasks, such as lesion identification or anomaly detection. While the Cataract-1K dataset includes annotations for two types of anomalies, lens rotation and pupil reaction, it does not provide frame-level annotations for these cases.

Model	OphVL	KB	Prompt	Cataract-101 [59]	CatRelDet [12]
CLIP [50]	$\times$	$\times$	Caption	10.0 / 3.3	15.3 / 11.9
CLIP [50]	$\checkmark$	$\times$	Caption	36.2 / 25.5	26.7 / 23.7
OphCLIP	$\checkmark$	$\times$	Caption	37.1 / 31.9	33.6 / 35.4
OphCLIP	$\checkmark$	$\times$	Mix	31.9 / 28.4	34.5 / 36.1
OphCLIP	$\checkmark$	$\checkmark$	Caption	41.1 / 34.7	33.6 / 35.3
OphCLIP	$\checkmark$	$\checkmark$	Mix	39.3 / 33.7	32.6 / 34.2

Table 9: Ablation study of OphCLIP with various components: OphVL (use of the OphVL pretraining dataset), KB (knowledge base with silent videos), and Prompt (descriptive phase prompts: Caption vs. Mix, which includes additional keywords in the captions). We report Accuracy / F1-score in this table.

Phase Label

Caption Only Prompt

Keyword Only Prompt

Incision

A diamond or steel keratome blade is used to create a small,

self-sealing incision in the cornea, providing access to the

anterior chamber of the eye. This incision allows the introduction

of surgical instruments while maintaining intraocular pressure.

Phase: Initial access;

Instrument: Diamond or steel blade;

Medication: None;

Goal: Create an entry point into the anterior chamber.

Viscoelastic

A viscoelastic agent, such as sodium hyaluronate,

is injected into the anterior chamber using a cannula.

This agent maintains space, protects the corneal endothelium,

and stabilizes the anterior chamber during the surgery.

Phase: Chamber stabilization;

Instrument: Syringe or cannula;

Medication: Ophthalmic Viscoelastic Device (OVD);

Goal: Maintain anterior chamber depth and protect corneal endothelium.

Capsulorhexis

Using capsulorhexis forceps or a cystotome,

the surgeon creates a circular tear in the anterior lens capsule.

This opening allows access to the underlying cataractous lens,

preparing it for removal.

Phase: Capsule opening;

Instrument: Forceps or cystotome;

Medication: None (Viscoelastic used for support);

Goal: Create a circular opening in the anterior lens capsule.

Hydrodissection

Balanced salt solution (BSS) is injected with a cannula between

the lens capsule and the lens cortex, separating the cataract from

the capsule. This ensures that the lens material can be removed more

easily during phacoemulsification.

Phase: Lens loosening;

Instrument: Cannula;

Medication: Balanced Salt Solution (BSS);

Goal: Separate the lens cortex from the capsule for easy extraction.

Phacoemulsification

A phacoemulsification handpiece with an ultrasonic probe is

inserted into the eye to emulsify the cataract into tiny fragments.

These fragments are simultaneously aspirated, removing the

clouded lens while protecting surrounding structures.

Phase: Lens removal; Instrument:

Phacoemulsification handpiece;

Medication: Balanced Salt Solution (BSS) for cooling and irrigation;

Goal: Break up and emulsify the cataract for extraction.

Irrigation/Aspiration

A dual-function irrigation and aspiration (I/A) handpiece is used

to remove any remaining lens material and fluid from the

capsular bag and anterior chamber. The procedure ensures the

capsular bag is clear for lens implantation.

Phase: Lens material removal;

Instrument: Irrigation/Aspiration handpiece;

Medication: Balanced Salt Solution (BSS);

Goal: Remove remaining lens fragments from the capsular bag.

Capsule Pulishing

A polishing tip or I/A tool is used to gently remove residual

epithelial cells from the inner surface of the posterior capsule,

minimizing the risk of posterior capsule opacification

(secondary cataract formation).

Phase: Capsule cleaning;

Instrument: Polishing tip or Irrigation/Aspiration tool;

Medication: None;

Goal: Remove residual lens epithelial cells to reduce posterior capsule opacification.

Lens Implantation

An intraocular lens (IOL) is loaded into an injector and inserted

through the corneal incision. It is placed within the capsular bag

to replace the natural lens and restore the patient’s vision.

Phase: Lens insertion;

Instrument: Intraocular lens (IOL) injector;

Medication: None;

Goal: Insert the artificial intraocular lens into the capsular bag.

Lens positioning

Using fine-tipped instruments, the surgeon carefully adjusts the

position of the IOL within the capsular bag to ensure proper

centration and stability, optimizing visual outcomes.

Phase: Lens alignment;

Instrument: Manipulating hook or forceps;

Medication: None;

Goal: Ensure the intraocular lens is correctly positioned and centered.

Viscoelastic_Suction

The viscoelastic agent is aspirated from the anterior chamber

using the I/A handpiece to prevent postoperative pressure spikes

and ensure a clear visual axis.

Phase: Viscoelastic removal;

Instrument: Irrigation/Aspiration handpiece;

Medication: None;

Goal: Remove any remaining viscoelastic agents from the anterior chamber.

Anterior_Chamber Flushing

The anterior chamber is flushed with balanced salt solution (BSS)

to remove any remaining debris or blood. This final rinse ensures

that the chamber is clear and that the incision site is clean.

Phase: Final chamber cleaning;

Instrument: Irrigation/Aspiration handpiece;

Medication: Balanced Salt Solution (BSS);

Goal: Ensure the anterior chamber is clear of any debris or substances.

Tonifying/Antibiotics

A pupil-constricting agent, such as acetylcholine, may be injected

to stabilize intraocular pressure. Following this,

an antibiotic such as moxifloxacin is administered to prevent infection,

and sometimes corticosteroids are used to reduce inflammation.

Phase: Final stabilization and protection;

Instrument: Syringe or cannula;

Medication: Acetylcholine (for pupil constriction) and moxifloxacin (antibiotic);

Goal: Stabilize intraocular pressure and prevent infection.

Table 10: Prompt example. Caption-only and keyword-only prompts for each phase label in the Cataract-1K [14] dataset, respectively.

ID	Phase
0	Antibiotikum
1	Hydrodissektion
2	Incision
3	Irrigation-Aspiration
4	Kapselpolishing
5	Linsenimplantation
6	Phako
7	Rhexis
8	Tonisieren
9	Visco-Absaugung
10	Viscoelasticum
11	not_initialized

Table 11: Phase labels of the Cat-21 [48] dataset.

ID	Phase
0	Incision
1	Viscoelastic
2	Capsulorhexis
3	Hydrodissection
4	Phacoemulsification
5	Irrigation/Aspiration
6	Capsule Pulishing
7	Lens Implantation
8	Lens positioning
9	Viscoelastic_Suction
10	Anterior_Chamber Flushing
11	Tonifying/Antibiotics

Table 12: Phase labels of the Cataract-1K [14] dataset.

ID	Phase
0	Incision
1	Viscous agent injection
2	Rhexis
3	Hydrodissection
4	Phacoemulsificiation
5	Irrigation and aspiration
6	Capsule polishing
7	Lens implant setting-up
8	Viscous agent removal
9	Tonifying and antibiotics

Table 13: Phase labels of the Cataract-101 [59] dataset.

ID	Phase
0	Implantation
1	Irrigation_Aspiration and Visc_Suction
2	Phacoemulsification
3	Rhexis
4	Rest

Table 14: Phase labels of the CatRelDet [12] dataset.

ID	Phase
0	Linsenimplantation
1	Linsenimplantation_before
2	Linsenimplantation_after

Table 15: Phase labels of the LensID [13] dataset.

ID	Instrument
0	spatula
1	27 gauge cannula
2	slit knife
3	phaco tip
4	capsulorhexis forceps
5	cartridge
6	I/A handpiece
7	cannula
8	katena forceps
9	eye retractors
10	angled incision knife

Table 16: Instrument labels of the CatInstSeg [11] dataset.

ID	Instrument
0	Capsulorhexis Forceps
1	Capsulorhexis Cystotome
2	Katena Forceps
3	Irrigation-Aspiration
4	Slit Knife
5	Phacoemulsification Tip
6	Spatula
7	Gauge
8	Lens Injector
9	Incision Knife

Table 17: Instrument labels of the Cataract-1K [14] dataset.

ID	Instrument
0	I/A Handpiece
1	Marker
2	Rycroft Cannula Handle
3	Eye Retractors
4	Cotton
5	Secondary Knife Handle
6	Surgical Tape
7	Troutman Forceps
8	Hydrodissection Cannula Handle
9	Vitrectomy Handpiece
10	Iris Hooks
11	Rycroft Cannula
12	Lens Injector
13	Secondary Knife
14	Mendez Ring
15	Primary Knife
16	Capsulorhexis Cystotome
17	I/A Handpiece Handle
18	Micromanipulator
19	Charleux Cannula
20	Phacoemulsifier Handpiece
21	Viscoelastic Cannula
22	Capsulorhexis Forceps
23	Phacoemulsifier Handpiece Handle
24	Lens Injector Handle
25	background
26	Hydrodissection Cannula
27	Capsulorhexis Cystotome Handle
28	Needle Holder
29	Suture Needle
30	Bonn Forceps
31	Primary Knife Handle

Table 18: Fine-grained instrument labels of the CaDIS [15] dataset (CaDIS-F).

ID	Instrument
0	I/A Handpiece
1	Cap. Forceps
2	Eye Retractors
3	Lens Injector
4	Tissue Forceps
5	Surgical Tape
6	Ph. Handpiece
7	Cannula
8	Secondary Knife
9	Cap. Cystotome
10	Primary Knife
11	Micromanipulator

Table 19: Coarse-grained instrument labels of the CaDIS [15] dataset (CaDIS-C).

References

Ashutosh et al. [2023] Kumar Ashutosh, Rohit Girdhar, Lorenzo Torresani, and Kristen Grauman. Hiervl: Learning hierarchical video-language embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23066–23078, 2023.
Bain et al. [2021] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1728–1738, 2021.
Bernal et al. [2015] Jorge Bernal, F Javier Sánchez, Gloria Fernández-Esparrach, Debora Gil, Cristina Rodríguez, and Fernando Vilariño. Wm-dova maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians. Computerized medical imaging and graphics, 43:99–111, 2015.
Bombieri et al. [2023] Marco Bombieri, Marco Rospocher, Simone Paolo Ponzetto, and Paolo Fiorini. Surgicberta: a pre-trained language model for procedural surgical language. International Journal of Data Science and Analytics, 2023.
Borgeaud et al. [2022] Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022.
Cerón et al. [2022] Juan Carlos Ángeles Cerón, Gilberto Ochoa Ruiz, Leonardo Chang, and Sharib Ali. Real-time instance segmentation of surgical instruments using attention and multi-scale feature fusion. Medical Image Analysis, 81:102569, 2022.
Chen et al. [2023] Jun Chen, Ming Hu, Darren J. Coker, Michael L. Berumen, Blair Costelloe, Sara Beery, Anna Rohrbach, and Mohamed Elhoseiny. Mammalnet: A large-scale video benchmark for mammal recognition and behavior understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13052–13061, 2023.
Chen and He [2021] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15750–15758, 2021.
Chen et al. [2024] Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208, 2024.
Fan et al. [2024] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites. Advances in Neural Information Processing Systems, 36, 2024.
Fox et al. [2020] Markus Fox, Mario Taschwer, and Klaus Schoeffmann. Pixel-based tool segmentation in cataract surgery videos with mask R-CNN. In 33rd IEEE International Symposium on Computer-Based Medical Systems, CBMS 2020, Rochester, MN, USA, July 28-30, 2020, pages 565–568. IEEE, 2020.
Ghamsarian et al. [2020] Negin Ghamsarian, Mario Taschwer, Doris Putzgruber-Adamitsch, Stephanie Sarny, and Klaus Schoeffmann. Relevance detection in cataract surgery videos by spatio- temporal action localization. In 25th International Conference on Pattern Recognition, ICPR 2020, Virtual Event / Milan, Italy, January 10-15, 2021, pages 10720–10727. IEEE, 2020.
Ghamsarian et al. [2021] Negin Ghamsarian, Mario Taschwer, Doris Putzgruber-Adamitsch, Stephanie Sarny, Yosuf El-Shabrawi, and Klaus Schoeffmann. Lensid: A cnn-rnn-based framework towards lens irregularity detection in cataract surgery videos. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2021 - 24th International Conference, Strasbourg, France, September 27 - October 1, 2021, Proceedings, Part VIII, pages 76–86. Springer, 2021.
Ghamsarian et al. [2024] Negin Ghamsarian, Yosuf El-Shabrawi, Sahar Nasirihaghighi, Doris Putzgruber-Adamitsch, Martin Zinkernagel, Sebastian Wolf, Klaus Schoeffmann, and Raphael Sznitman. Cataract-1k dataset for deep-learning-assisted analysis of cataract surgery videos. Scientific data, 11(1):373, 2024.
Grammatikopoulou et al. [2021] Maria Grammatikopoulou, Evangello Flouty, Abdolrahim Kadkhodamohammadi, Gwenolé Quellec, Andre Chow, Jean Nehme, Imanol Luengo, and Danail Stoyanov. Cadis: Cataract dataset for surgical rgb-image segmentation. Medical Image Analysis, 71:102053, 2021.
Gu et al. [2020] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing, 2020.
Honarmand et al. [2024] Mohammadmahdi Honarmand, Muhammad Abdullah Jamal, and Omid Mohareri. Vidlpro: A video-language pre-training framework for robotic and laparoscopic surgery. arXiv preprint arXiv:2409.04732, 2024.
Hu et al. [2024] Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, et al. Ophnet: A large-scale video benchmark for ophthalmic surgical workflow understanding. arXiv preprint arXiv:2406.07471, 2024.
Hu et al. [2023] Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A Ross, and Alireza Fathi. Reveal: Retrieval-augmented visual-language pre-training with multi-source multimodal knowledge memory. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23369–23379, 2023.
Huang et al. [2019] Kexin Huang, Jaan Altosaar, and Rajesh Ranganath. Clinicalbert: Modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342, 2019.
Huang et al. [2023] Zhi Huang, Federico Bianchi, Mert Yuksekgonul, Thomas J Montine, and James Zou. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
Ikezogwo et al. [2024] Wisdom Ikezogwo, Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. Advances in neural information processing systems, 36, 2024.
Iscen et al. [2023] Ahmet Iscen, Mathilde Caron, Alireza Fathi, and Cordelia Schmid. Retrieval-enhanced contrastive vision-text models. arXiv preprint arXiv:2306.07196, 2023.
Jha et al. [2021] Debesh Jha, Pia H Smedsrud, Dag Johansen, Thomas De Lange, Håvard D Johansen, Pål Halvorsen, and Michael A Riegler. A comprehensive study on colorectal polyp segmentation with resunet++, conditional random field and test-time augmentation. IEEE journal of biomedical and health informatics, 25(6):2029–2040, 2021.
Johnson et al. [2019] Alistair EW Johnson, Tom J Pollard, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019.
Krishna et al. [2017] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In International Conference on Computer Vision (ICCV), 2017.
Kumar et al. [2023] Akash Kumar, Ashlesha Kumar, Vibhav Vineet, and Yogesh Singh Rawat. A large-scale analysis on self-supervised video representation learning. arXiv e-prints, pages arXiv–2306, 2023.
Kwon et al. [2024] Donghyeon Kwon, Minsu Cho, and Suha Kwak. Self-supervised learning of semantic correspondence using web videos. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2142–2152, 2024.
Laina et al. [2017] Iro Laina, Nicola Rieke, Christian Rupprecht, Josué Page Vizcaíno, Abouzar Eslami, Federico Tombari, and Nassir Navab. Concurrent segmentation and localization for tracking of surgical instruments. In Medical Image Computing and Computer-Assisted Intervention- MICCAI 2017: 20th International Conference, Quebec City, QC, Canada, September 11-13, 2017, Proceedings, Part II 20, pages 664–672. Springer, 2017.
Li et al. [2024a] Jiajie Li, Garrett Skinner, Gene Yang, Brian R Quaranto, Steven D Schwaitzberg, Peter CW Kim, and Jinjun Xiong. Llava-surg: Towards multimodal surgical assistant via structured surgical video learning. arXiv preprint arXiv:2408.07981, 2024a.
Li et al. [2024b] Jiaxuan Li, Duc Minh Vo, Akihiro Sugimoto, and Hideki Nakayama. Evcap: Retrieval-augmented image captioning with external visual-name memory for open-world comprehension. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13733–13742, 2024b.
Li et al. [2023] Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19948–19960, 2023.
Li et al. [2021] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv preprint arXiv:2110.05208, 2021.
Liu et al. [2023] Yang Liu, Jiayu Huo, Jingjing Peng, Rachel Sparks, Prokar Dasgupta, Alejandro Granados, and Sebastien Ourselin. Skit: a fast key information video transformer for online surgical phase recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21074–21084, 2023.
Long et al. [2022] Alexander Long, Wei Yin, Thalaiyasingam Ajanthan, Vu Nguyen, Pulak Purkait, Ravi Garg, Alan Blair, Chunhua Shen, and Anton van den Hengel. Retrieval augmented classification for long-tail visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6959–6969, 2022.
Lüddecke and Ecker [2022] Timo Lüddecke and Alexander Ecker. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7086–7096, 2022.
Miech et al. [2019a] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, 2019a.
Miech et al. [2019b] Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019b.
Ming et al. [2023] Hu Ming, Wang Lin, Yan Siyuan, Ma Don, Ren Qingli, Xia Peng, Feng Wei, Duan Peibo, Ju Lie, and Ge Zongyuan. Nurvid: A large expert-level video database for nursing procedure activity understanding. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
Mohapatra et al. [2022] Subhashree Mohapatra, Girish Kumar Pati, Manohar Mishra, and Tripti Swarnkar. Upolyseg: A u-net-based polyp segmentation network using colonoscopy images. Gastroenterology Insights, 13(3):264–274, 2022.
Mu et al. [2022] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. Slip: Self-supervision meets language-image pre-training. In European conference on computer vision, pages 529–544. Springer, 2022.
Ni et al. [2022] Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, and Haibin Ling. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, pages 1–18. Springer, 2022.
Nori et al. [2023] Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
Oord et al. [2018] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
Pakhomov et al. [2019] Daniil Pakhomov, Vittal Premachandran, Max Allan, Mahdi Azizian, and Nassir Navab. Deep residual learning for instrument segmentation in robotic surgery. In Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 10, pages 566–573. Springer, 2019.
Pakhomov et al. [2020] Daniil Pakhomov, Wei Shen, and Nassir Navab. Towards unsupervised learning for instrument segmentation in robotic surgery with cycle-consistent adversarial networks. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 8499–8504. IEEE, 2020.
Pakkasjärvi et al. [2023] Niklas Pakkasjärvi, Tanvi Luthra, and Sachit Anand. Artificial intelligence in surgical learning. Surgeries, 4(1):86–97, 2023.
Primus et al. [2018] Manfred Jürgen Primus, Doris Putzgruber-Adamitsch, Mario Taschwer, Bernd Münzer, Yosuf El-Shabrawi, László Böszörményi, and Klaus Schoeffmann. Frame-based classification of operation phases in cataract surgery videos. In MultiMedia Modeling - 24th International Conference, MMM 2018, Bangkok, Thailand, February 5-7, 2018, Proceedings, Part I, pages 241–253. Springer, 2018.
Qin et al. [2022] Ziyuan Qin, Huahui Yi, Qicheng Lao, and Kang Li. Medical image understanding with pretrained vision language models: A comprehensive study. arXiv preprint arXiv:2209.15517, 2022.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Radford et al. [2023] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023.
Ram et al. [2023] Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
Ramos et al. [2023] Rita Ramos, Bruno Martins, Desmond Elliott, and Yova Kementchedjhieva. Smallcap: lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2840–2849, 2023.
Sarto et al. [2022] Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Retrieval-augmented transformer for image captioning. In Proceedings of the 19th international conference on content-based multimedia indexing, pages 1–7, 2022.
Schiappa et al. [2023] Madeline C Schiappa, Yogesh S Rawat, and Mubarak Shah. Self-supervised learning for videos: A survey. ACM Computing Surveys, 55(13s):1–37, 2023.
Schmidgall et al. [2024a] Samuel Schmidgall, Joseph Cho, Cyril Zakka, and William Hiesinger. Gp-vls: A general-purpose vision language model for surgery. arXiv preprint arXiv:2407.19305, 2024a.
Schmidgall et al. [2024b] Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, and Axel Krieger. General surgery vision transformer: A video pre-trained foundation model for general surgery. arXiv preprint arXiv:2403.05949, 2024b.
Schoeffmann et al. [2018a] Klaus Schoeffmann, Heinrich Husslein, Sabrina Kletz, Stefan Petscharnig, Bernd Münzer, and Christian Beecks. Video retrieval in laparoscopic video recordings with dynamic content descriptors. Multim. Tools Appl., 77(13):16813–16832, 2018a.
Schoeffmann et al. [2018b] Klaus Schoeffmann, Mario Taschwer, Stephanie Sarny, Bernd Münzer, Manfred Jürgen Primus, and Doris Putzgruber. Cataract-101: video dataset of 101 cataract surgeries. In Proceedings of the 9th ACM multimedia systems conference, pages 421–425, 2018b.
Schwenk et al. [2022] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European conference on computer vision, pages 146–162. Springer, 2022.
Seenivasan et al. [2023] Lalithkumar Seenivasan, Mobarakol Islam, Gokul Kannan, and Hongliang Ren. Surgicalgpt: end-to-end language-vision gpt for visual question answering in surgery. In International conference on medical image computing and computer-assisted intervention, pages 281–290. Springer, 2023.
Shvets et al. [2018] Alexey A Shvets, Alexander Rakhlin, Alexandr A Kalinin, and Vladimir I Iglovikov. Automatic instrument segmentation in robot-assisted surgery using deep learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pages 624–628, 2018.
Silva et al. [2014] Juan Silva, Aymeric Histace, Olivier Romain, Xavier Dray, and Bertrand Granado. Toward embedded detection of polyps in wce images for early diagnosis of colorectal cancer. International journal of computer assisted radiology and surgery, 9:283–293, 2014.
Smedsrud et al. [2021] Pia H Smedsrud, Vajira Thambawita, Steven A Hicks, Henrik Gjestang, Oda Olsen Nedrejord, Espen Næss, Hanna Borgli, Debesh Jha, Tor Jan Derek Berstad, Sigrun L Eskeland, Mathias Lux, Håvard Espeland, Andreas Petlund, Duc Tien Dang Nguyen, Enrique Garcia-Ceja, Dag Johansen, Peter T Schmidt, Ervin Toth, Hugo L Hammer, Thomas de Lange, Michael A Riegler, and Pål Halvorsen. Kvasir-Capsule, a video capsule endoscopy dataset. Scientific Data, 8(1):142, 2021.
Sun et al. [2019] Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
Varghese et al. [2024] Chris Varghese, Ewen M Harrison, Greg O’Grady, and Eric J Topol. Artificial intelligence in surgery. Nature Medicine, pages 1–12, 2024.
Wagner et al. [2023] Martin Wagner, Beat-Peter Müller-Stich, Anna Kisilenko, Duc Tran, Patrick Heger, Lars Mündermann, David M Lubotsky, Benjamin Müller, Tornike Davitashvili, Manuela Capek, et al. Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the heichole benchmark. Medical Image Analysis, 86:102770, 2023.
Wang et al. [2018] Bairui Wang, Lin Ma, Wei Zhang, and Wei Liu. Reconstruction network for video captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7622–7631, 2018.
Wang et al. [2023a] Guangyu Wang, Guoxing Yang, Zongxin Du, Longjun Fan, and Xiaohu Li. Clinicalgpt: large language models finetuned with diverse medical data and comprehensive evaluation. arXiv preprint arXiv:2306.09968, 2023a.
Wang et al. [2024] Guankun Wang, Long Bai, Wan Jun Nah, Jie Wang, Zhaoxi Zhang, Zhen Chen, Jinlin Wu, Mobarakol Islam, Hongbin Liu, and Hongliang Ren. Surgical-lvlm: Learning to adapt large vision-language model for grounded visual question answering in robotic surgery. arXiv preprint arXiv:2405.10948, 2024.
Wang et al. [2023b] Zhao Wang, Chang Liu, Shaoting Zhang, and Qi Dou. Foundation model for endoscopy video analysis via large-scale self-supervised pre-train. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 101–111. Springer, 2023b.
Xu et al. [2024] Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval-augmented egocentric video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13525–13536, 2024.
Yu et al. [2022] J Yu, Z Wang, V Vasudevan, L Yeung, M Seyedhosseini, and Y Wu. Contrastive captioners are image-text foundation models. a rXiv preprint, 2022.
Yuan et al. [2023] Kun Yuan, Vinkle Srivastav, Tong Yu, Joel L Lavanchy, Pietro Mascagni, Nassir Navab, and Nicolas Padoy. Learning multi-modal representations by watching hundreds of surgical video lectures. arXiv preprint arXiv:2307.15220, 2023.
Yuan et al. [2024a] Kun Yuan, Vinkle Srivastav, Nassir Navab, and Nicolas Padoy. Hecvl: Hierarchical video-language pretraining for zero-shot surgical phase recognition. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 306–316. Springer, 2024a.
Yuan et al. [2024b] Kun Yuan, Vinkle Srivastav, Nassir Navab, and Nicolas Padoy. Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation. arXiv preprint arXiv:2410.00263, 2024b.
Yuan et al. [2021] Xin Yuan, Zhe Lin, Jason Kuen, Jianming Zhang, Yilin Wang, Michael Maire, Ajinkya Kale, and Baldo Faieta. Multimodal contrastive training for visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6995–7004, 2021.
Zhang et al. [2018] Bowen Zhang, Hexiang Hu, and Fei Sha. Cross-modal and hierarchical modeling of video and text. In Proceedings of the european conference on computer vision (ECCV), pages 374–390, 2018.
Zhang et al. [2023] Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, et al. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075, 2023.
Zhou et al. [2018] Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8739–8748, 2018.
Zhou et al. [2023] Zijian Zhou, Oluwatosin Alabi, Meng Wei, Tom Vercauteren, and Miaojing Shi. Text promptable surgical instrument segmentation with vision-language models. Advances in Neural Information Processing Systems, 2023.
Zou et al. [2023] Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, et al. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023.
Zou et al. [2024] Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once. Advances in Neural Information Processing Systems, 36, 2024.