¹¹institutetext: Vanderbilt University, Nashville TN, 37212, USA
¹¹email: [email protected]

EndoPBR: Material and Lighting Estimation for Photorealistic Surgical Simulations via Physically-based Rendering

John J. Han 11 Jie Ying Wu 11

Abstract

The lack of labeled datasets in 3D vision for surgical scenes inhibits the development of robust 3D reconstruction algorithms in the medical domain. Despite the popularity of Neural Radiance Fields and 3D Gaussian Splatting in the general computer vision community, these systems have yet to find consistent success in surgical scenes due to challenges such as non-stationary lighting and non-Lambertian surfaces. As a result, the need for labeled surgical datasets continues to grow. In this work, we introduce a differentiable rendering framework for material and lighting estimation from endoscopic images and known geometry. Compared to previous approaches that model lighting and material jointly as radiance, we explicitly disentangle these scene properties for robust and photorealistic novel view synthesis. To disambiguate the training process, we formulate domain-specific properties inherent in surgical scenes. Specifically, we model the scene lighting as a simple spotlight and material properties as a bidirectional reflectance distribution function, parameterized by a neural network. By grounding color predictions in the rendering equation, we can generate photorealistic images at arbitrary camera poses. We evaluate our method with various sequences from the Colonoscopy 3D Video Dataset and show that our method produces competitive novel view synthesis results compared with other approaches. Furthermore, we demonstrate that synthetic data can be used to develop 3D vision algorithms by finetuning a depth estimation model with our rendered outputs. Overall, we see that the depth estimation performance is on par with fine-tuning with the original real images.¹¹1The codebase will be released at https://github.com/juseonghan/EndoPBR.

Keywords:

Endoscopy Novel View Synthesis Differentiable Rendering Physically-based Rendering Inverse Rendering

1 Introduction

Refer to caption — Figure 1: EndoPBR generates photorealistic renderings from posed images and known geometry. Top row contains generated renderings from views outside the training set, and bottom row displays the ground truth RGB images. Note that we undistort images prior to the model training.

Estimating geometry from RGB images, or vision-based 3D reconstruction, is a long-standing research problem and holds many applications in minimally invasive surgery (MIS). For instance, a Simultaneous Localization and Mapping (SLAM) system can lead to real-time surgical navigation guidance for endoscopy [13], enable robots to perceive the surgical scene [12], and actualize digital twins for various healthcare applications [6]. However, surgical scenes are challenging to reconstruct with conventional algorithms due to factors like non-stationary lighting, textureless surfaces, and non-Lambertian surfaces. Most importantly, the lack of large labeled datasets for medical data limits 1) the utility of supervised learning of neural networks, and 2) a reliable platform to evaluate 3D reconstruction algorithms. Thus, existing research for vision-based 3D reconstruction in medical images falls into two main categories: 1) synthesizing a labeled dataset with simulated images or phantoms [15, 1, 22], or 2) training a neural network with self-supervision, e.g. joint ego-motion and depth estimation via photometric warping [14, 26, 18]. Despite their impressive results, these methods have limited generalizability to patients and anatomies outside of the training set due to the inherent domain gap; synthetic data has limited photorealism and self-supervised models overfit to the training data. Recently, foundation models have emerged as potential solutions for medical vision tasks [21], but their zero-shot performance in surgical scenes has room for improvement [7].

One promising avenue toward robust solutions for surgical 3D vision is synthetic labeled datasets of photorealistic images. If such a dataset could be generated at large-scale, then researchers could train models in a supervised manner or fine-tune foundation models for their specific anatomical domain. For example, synthetic data has been used for various medical computer vision tasks such as segmentation [24, 25, 27], depth estimation [15, 9], and camera pose estimation [1]. However, approaches that generate image frames independently are not suitable for systems like SLAM due to the lack of temporal and 3D consistency. Although some methods enforce such consistency with photometric warping [23] or learning mesh textures [8], these methods cannot generalize to novel views or lack photorealism. Phantom-based datasets like EndoSLAM [19] and C3VD [3] have been useful for the community, but they are not large enough to reliably train neural networks and limited to a specific anatomical domain. Recent works to generate realistic data use state-of-the-art neural reconstruction methods like Neural Radiance Fields (NeRFs) [16] and 3D Gaussian Splatting [11] to perform novel view synthesis. However, these algorithms typically require comprehensive coverage of the scene at many camera vantage points, which is not typical in endoscopy [20, 4]. Furthermore, vanilla versions of these algorithms assume stationary lighting and can be unstable with sparse views. To address these issues, we propose EndoPBR, a photorealistic simulation platform for endoscopy that robustly generalizes to novel views.

EndoPBR generates photorealistic images by optimizing material properties and lighting conditions from images of known camera poses and geometry. To disambiguate the training process, we formulate domain-specific constraints of the scene. For material properties, we use the simplified Disney bidirectional reflective distribution function (BRDF) model [5], consisting of base albedo, metallic, and roughness for a given surface point. We parameterize the BRDF with a Multi-Layer Perceptron (MLP) neural network, which takes as input a query 3D point and predicts a 5D vector of material properties. For light conditions, previous works use a neural network to learn the surface light field with the given viewing direction and 3D point [29]. However, this formulation assumes stationary lighting and cannot account for varying illumination present in endoscopic scenes. To address this issue, EndoPBR uses a moving spotlight model with learnable parameters to calculate incident light intensity at the surface. This formulation is simple enough for stable training but also allows for flexibility with learnable parameters to model the complexities of surgical scenes. We generate photorealistic outputs with physically-based rendering techniques and leverage differentiable rendering to minimize the difference between predicted and ground truth images.

We evaluate our algorithm on the Colonoscopy 3D Video Dataset (C3VD) [3] on novel view synthesis by comparing generated renders with ground truth images. We show that by estimating material and incident light, we synthesize novel views with comparable performance with previous methods that use NeRF and 3DGS. Furthermore, we demonstrate the utility of photorealistic synthetic data by fine-tuning a depth estimation model with EndoPBR’s generated images. We show that realistic synthetic data can be used to fine-tune a foundation depth estimation model, and that it produces similar results when fine-tuning with real images.

2 Methods

Our aim is to learn the material properties and lighting conditions of the scene given posed images and surface geometry, i.e. depth maps. An overview of our pipeline is shown in Fig. 2. EndoPBR determines color pixel values with the physics of light transport via physically-based rendering. We demonstrate that our method generalizes to camera views outside the training set.

Preliminaries. The Rendering Equation [10] computes the outgoing radiance from a surface point $\bm{x}$ along a camera viewing direction $\bm{\omega_{o}}$ :

L_{o}(\bm{\omega_{o}},\bm{x})=\int_{\Omega}f(\bm{\omega_{o}},\bm{\omega_{i}},\bm{x})L_{i}(\bm{\omega_{i}},\bm{x})(\bm{\omega_{i}}\cdot\bm{n})d\bm{\omega_{i}}

(1)

where $\bm{n}$ is the surface normal calculated from depth, $L_{i}$ is the incident light intensity coming from direction $\bm{\omega_{i}}$ , and $f$ is the BRDF. The integration is performed in all incident directions $\bm{\omega_{i}}$ in the upper hemisphere $\Omega$ where $\bm{\omega_{i}}\cdot\bm{n}>0$ .

Given a pixel, we calculate its corresponding 3D point by unprojecting its depth value. We query its incident light intensity (Sec. 2.1), BRDF parameters (Sec. 2.2), and evaluate Eqn. 1 to estimate the pixel value (Sec. 2.3). We use differentiable rendering to learn the appropriate BRDF and scene lighting $L_{i}$ from endoscopic images.

2.1 Light Conditions

Decomposing input images to material and light properties is often difficult due to the training ambiguity. For instance, the network may model specularities (bright white spots on the surface) as white albedo instead of a reflective surface. To this end, properly constraining the model to learn valid light conditions is paramount for robust material estimation. For this purpose, we formulate key observations about surgical scenes as model learning constraints, specifically that

1.

The light source moves in tandem with the camera, and
2.

there is no ambient light in surgical scenes; all visible light radiates from the endoscopic light source.

Consequently, we model the environment lighting as a spotlight model located at $\bm{x}_{L}\in\mathbb{R}^{3}$ with direction $\bm{\vec{v}}_{L}\in\mathbb{R}^{3}$ . Given a point $\bm{x}\in\mathbb{R}^{3}$ , the incident light intensity $L_{i}$ experienced by $\bm{x}$ is

L_{i}(\bm{x})=L_{0}\frac{\cos^{n}(\theta)}{d^{q}}

(2)

where $L_{0}$ is the spotlight base intensity, $\theta$ is the angle between $\bm{x}-\bm{x}_{L}$ and $\bm{\vec{v}}_{L}$ , and $d$ is the distance from the light source to the point. We let $L_{0}$ , $n$ and $p$ be learnable parameters to allow the model to flexibly model the environment lighting. Following previous work, we assume that the light source and camera center are co-located, i.e. $\bm{x}_{c}=\bm{x}_{L}$ [2, 20]. We also set the light direction $\bm{v}_{L}$ equal to the camera forward viewing direction.

2.2 BRDF Estimation

The BRDF is a fundamental concept in rendering and captures the interaction of light and matter. Following the Disney BRDF model [5], $f$ can be parameterized by a base color (or albedo) $\bm{b}\in[0,1]^{3}$ , roughness $r\in[0,1]$ , and metallic $m\in[0,1]$ . We predict a 5D vector of these material properties for each spatial point by using a neural network, i.e. an MLP. We encode the spatial point with multiresolution hash encoding [17] prior to the MLP to improve computational efficiency.

We omit $\bm{x}$ in the following notation for simplicity. The BRDF can be decomposed into its diffuse and specular counterparts, $f=f_{s}+f_{d}$ . The diffuse term equation is $f_{d}=\frac{1-m}{\pi}\bm{b}$ and the specular term is defined as

f_{s}(\bm{\omega_{i}},\bm{\omega_{o}})=\frac{D(\bm{h};r)F(\bm{\omega_{i}},\bm{h};\bm{b},m)G(\bm{\omega_{i}},\bm{\omega_{o}},\bm{h};r)}{(\bm{n}\cdot\bm{\omega_{i}})(\bm{n}\cdot\bm{\omega_{o}})}

(3)

where $\bm{h}$ is the half vector between $\bm{\omega_{i}}$ and $\bm{\omega_{o}}$ , $D$ is the microfacets term, $F$ is the Fresnel term, and $G$ represents the GGX distribution. We use approximations of $D$ , $F$ , and $G$ used in previous work [29], which we omit for the sake of brevity.

2.3 Domain-specific Simplifications

The rendering equation is typically computed by integrating over all incident light directions to account for global illumination. Although this costly summation is necessary in natural scenes, we assume that the majority of incident light comes from the spotlight source. Consequently, we omit the integral and compute pixel values only where $\bm{\omega_{i}}=\bm{\omega_{o}}$ . As a result, we simplify Eqn. 1 and calculate the predicted color intensity as

\hat{C}(\bm{\omega_{o}},\bm{x})=2\pi f(\bm{\omega_{o}},\bm{x})L_{i}(\bm{\omega_{o}},\bm{x})(\bm{\omega_{o}}\cdot\bm{n})

(4)

where $2\pi$ accounts for uniform Fibonacci sphere sampling from previous wokr [29]. We perform this calculation for each color channel to estimate RGB values. Finally, we constrain $f_{s}$ to produce white light, since the view-dependent colors in endoscopy are typically specularities.

2.4 Data Preparation and Loss Function

In each iteration, we calculate a pixel’s 3D point by unprojecting depth values with the camera intrinsics $\{f_{x},f_{y},c_{x},c_{y}\}$ and extrinsics $\{R,t\}$ . Specifically, given a pixel $(i,j)$ and its depth value $z$ , we unproject that pixel as $x_{c}=(\frac{i-c_{x}}{f_{x}}z,\frac{j-c_{y}}{f_{y}}z,z)$ and transform it to world coordinates as $x_{w}=Rx_{c}+t$ . We calculate surface normals by taking the image gradient of depth maps and using the camera extrinsics to project them into world space. Each point’s material properties are estimated with the MLP and its incident light intensity from Eqn. 2, which are used to evaluate the rendering equation from Eqn. 4. Finally, we apply a learned gamma correction to our rendered outputs to account for low dynamic range images common in endoscopic scenes, i.e. $\hat{C}^{LDR}=(\hat{C}^{HDR})^{\gamma}$ .

Loss. We rely on the photometric L1 loss as the main objective to train EndoPBR. Furthermore, we also impose material constraints, specifically that the metallic value $m(\bm{x})$ of all points should be minimized in endoscopic scenes and that the base color should not have abrupt changes. Our loss equation is

\mathcal{L}=\mathcal{L}_{1}(\hat{C},C)+\lambda_{m}|m(\bm{x})|+\lambda_{b}|\nabla_{x}b(\bm{x})|

3 Experiments

We develop and test our system on C3VD, a colonoscopy dataset with ground truth poses and depth maps. We use their provided camera intrinsics to undistort the images and also resize them to $(480,640)$ . To ensure valid multiresolution hash encodings, we normalize all 3D points to $[0,1]$ .

Implementation Details. We sample $30$ k pixels with a batch size of $5$ in each iteration. We train our model for $1500$ epochs using PyTorch on an NVIDIA RTX4090 GPU. The BRDF network’s hidden dimensions are 64 with 2 layers following the multiresolution hash encoding layer [17]. We use the Adam optimizer with learning rate $10^{-4}$ and $\beta=(0.9,0.999)$ . We set $\lambda_{m}=10^{-4}$ and $\lambda_{b}=10^{-3}$ in our experiments. Finally, we apply a $8:1$ train-test split to evaluate our method, following previous work [4]. Qualitative results of our method can be seen in Fig. 1.

We design two core experiments to demonstrate the photorealism and robustness of our model. First, we evaluate EndoPBR’s ability to generalize to novel views with image quality metrics like Peak Signal-to-Noise Ratio (PSNR), Learned Perceptual Image Patch Similarity (LPIPS), and Structural Similarity (SSIM). We compare our results with vanilla NeRF [16] as well as two recent works, namely REIM-NeRF [20] and Gaussian Pancakes [4] on the cecum_t4_b, desc_t4_a, trans_t1_a sequences from C3VD. These works leverage state of the art neural reconstruction algorithms to synthesize novel views.

Second, we analyze how realistic synthetic data can be useful for developing 3D vision solutions in minimally invasive surgeries with a simple experiment: fine-tuning a foundation depth estimation model with rendered images. We first train EndoPBR on the cecum_t1_b sequence and apply specific transformations like camera translation and rotation, varied material properties, and adjusted light conditions. Because these components are explicitly disentangled in our model, we are able to adjust them separately for meaningful data augmentation compared with NeRF or 3DGS-based approaches. With the aforementioned transformations, we acquire a synthetic dataset of around $18$ k image-depth-pose samples (from $765$ images), which we visualize in Fig. 3. We do not include any training camera views to ensure valid testing.

We fine-tune a depth foundation model, namely ViT-S model of Depth Anything V2 (DAv2) [28], with the generated synthetic dataset for 100 epochs and the Adam optimizer (lr=5e-4). The training set is completely comprised of rendered images, but we use original C3VD images to evaluate the depth predictions. We use conventional metrics like Absolute Relative Error, Root Mean Squared Error, and $\delta<1.25$ , the percentage of pixels within $25\%$ of ground truth depth values. We compare our results with baseline (zero-shot DAv2), baseline-scaled (median-scaled baseline), and using the original C3VD images to fine-tune the model. We note that original is the upper bound of depth estimation results.

4 Results and Discussion

Experiment 1. Novel View Synthesis. Image quality metrics on the test set are shown in Table 1. The generated image quality results have PSNR and SSIM results on par with or better than REIM-NeRF and Gaussian Pancakes. However, our LPIPS score is notably worse than the other methods. We attribute this to some dark delineations present in rendered images, which can be seen in Fig. 1. These black lines reflect the difficulty of resolving complex interactions between material properties and lighting in color calculation. However, our method is stable to train and does not rely on convergence of a different method as the input to our pipeline, as [4] does.

Table 1: We compute the PSNR, LPIPS, and SSIM for generated images in the testing set for

3

sequences and compare them with vanilla NeRF [16], REIM-NeRF [20] and Gaussian Pancakes [4]. We take numerical values directly from [4].

Model	PSNR ( $\uparrow$ )	SSIM ( $\uparrow$ )	LPIPS ( $\downarrow$ )
NeRF [16]	18.93	0.67	0.43
REIM-NeRF [20]	31.66	0.78	0.22
Gaussian Pancakes [4]	32.31	0.90	0.20
EndoPBR (Ours)	30.39	0.86	0.30

Experiment 2. Depth Estimation with Synthetic Data. The values for depth estimation results are reported in Table 2. First, we observe that the zero-shot depth estimation capabilities of DAv2 are imperfect even when median scaled to ground truth, which is supported by previous work [7]. Fine-tuning to a specific domain will be paramount to incorporate these foundation models to surgical 3D vision tasks. Furthermore, fine-tuning the model on synthetic data leads to similar results to fine-tuning with real images. This is demonstrated by the fact that all metrics are relatively similar between Original and our method. Because our synthetic data did not include any of the original training views or images, we claim leveraging simulated images is a promising venue to develop 3D vision tasks in endoscopy. To the best of our knowledge, this work is the first of its kind to evaluate the utility of synthetic data in this manner.

Table 2: We compare Abs. Rel., RMSE, and

\delta<1.25

metrics between the baseline (zero-shot DAv2), baseline-scaled (zero-shot median scaled to ground truth used in previous work [7], Original (fine-tuning with C3VD images), and our method (fine-tuning with EndoPBR’s rendered images) on cecum t1 b.

Method	Abs. Rel. ( $\downarrow$ )	RMSE ( $\downarrow$ )	$\delta<1.25$ ( $\uparrow$ )
Baseline [28]	$1.04\pm 0.05$	$0.22\pm 0.00$	$0.07\pm 0.07$
Baseline Scaled [28]	$0.31\pm 0.02$	$0.08\pm 0.01$	$0.45\pm 0.06$
Original Fine-Tune	$0.07\pm 0.01$	$0.02\pm 0.00$	$1.00\pm 0.00$
Ours Fine-Tune	$0.08\pm 0.03$	$0.02\pm 0.01$	$0.95\pm 0.09$

Conclusion. To summarize, we propose EndoPBR, a endoscopic simulation platform that leverages differentiable rendering to estimate material properties and lighting conditions from posed images and geometry. We demonstrate that our model’s novel view synthesis capabilities compete with existing methods that use NeRF and 3DGS while flexibly generalizing to novel views to synthesize a photorealistic dataset completely outside of its original training set. Furthermore, we show that realistic synthetic data generated from physics-informed learning is a promising avenue for robust 3D vision solutions in minimally invasive surgeries. We hope that this work inspires further exploration of generating realistic simulations.

{credits}

4.0.1 Acknowledgements

This study was funded by NIH T32 Training Grant (grant number T32EB021937).

4.0.2 \discintname

The authors have no competing interests to declare that are relevant to the content of this article.

References

[1] Armin, M.A., Barnes, N., Alvarez, J., Li, H., Grimpen, F., Salvado, O.: Learning camera pose from optical colonoscopy frames through deep convolutional neural network (cnn). In: Computer Assisted and Robotic Endoscopy and Clinical Image-Based Procedures: 4th International Workshop, CARE 2017, and 6th International Workshop, CLIP 2017, Held in Conjunction with MICCAI 2017, Québec City, QC, Canada, September 14, 2017, Proceedings 4. pp. 50–59. Springer (2017)
[2] Batlle, V.M., Montiel, J.M., Fua, P., Tardós, J.D.: Lightneus: Neural surface reconstruction in endoscopy using illumination decline. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 502–512. Springer (2023)
[3] Bobrow, T.L., Golhar, M., Vijayan, R., Akshintala, V.S., Garcia, J.R., Durr, N.J.: Colonoscopy 3d video dataset with paired depth from 2d-3d registration. Medical image analysis 90, 102956 (2023)
[4] Bonilla, S., Zhang, S., Psychogyios, D., Stoyanov, D., Vasconcelos, F., Bano, S.: Gaussian pancakes: geometrically-regularized 3d gaussian splatting for realistic endoscopic reconstruction. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 274–283. Springer (2024)
[5] Burley, B., Studios, W.D.A.: Physically-based shading at disney. In: Acm Siggraph. vol. 2012, pp. 1–7. vol. 2012 (2012)
[6] Ding, H., Seenivasan, L., Killeen, B.D., Cho, S.M., Unberath, M.: Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding. Artificial Intelligence Surgery 4(3), 109–138 (2024)
[7] Han, J.J., Acar, A., Henry, C., Wu, J.Y.: Depth anything in medical images: A comparative study. arXiv preprint arXiv:2401.16600 (2024)
[8] Han, J.J., Acar, A., Kavoussi, N., Wu, J.Y.: Meshbrush: Painting the anatomical mesh with neural stylization for endoscopy. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 380–390. Springer (2024)
[9] Jeong, B.H., Kim, H.K., Son, Y.D.: Depth estimation from monocular endoscopy using simulation and image transfer approach. Computers in Biology and Medicine 181, 109038 (2024)
[10] Kajiya, J.T.: The rendering equation. In: Proceedings of the 13th annual conference on Computer graphics and interactive techniques. pp. 143–150 (1986)
[11] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 139–1 (2023)
[12] Khan, M., Siepel¹, F., Ruers, T.: Integration of visual slam in robot-assisted minimally invasive surgery: Advances, challenges. In: European Robotics Forum 2024: 15th ERF, Volume 2. p. 399. Springer Nature
[13] Liu, X., Li, Z., Ishii, M., Hager, G.D., Taylor, R.H., Unberath, M.: Sage: slam with appearance and geometry prior for endoscopy. In: 2022 International conference on robotics and automation (ICRA). pp. 5587–5593. IEEE (2022)
[14] Liu, X., Sinha, A., Ishii, M., Hager, G.D., Reiter, A., Taylor, R.H., Unberath, M.: Dense depth estimation in monocular endoscopy with self-supervised learning methods. IEEE transactions on medical imaging 39(5), 1438–1447 (2019)
[15] Mahmood, F., Chen, R., Sudarsky, S., Yu, D., Durr, N.J.: Deep learning with cinematic rendering: fine-tuning deep neural networks using photorealistic medical images. Physics in Medicine & Biology 63(18), 185012 (2018)
[16] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99–106 (2021)
[17] Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM transactions on graphics (TOG) 41(4), 1–15 (2022)
[18] Nazifi, N., Araujo, H., Erabati, G.K., Tahri, O.: Self-supervised monocular pose and depth estimation for wireless capsule endoscopy using transformers. In: Medical Imaging 2024: Image-Guided Procedures, Robotic Interventions, and Modeling. vol. 12928, pp. 252–262. SPIE (2024)
[19] Ozyoruk, K.B., Gokceler, G.I., Bobrow, T.L., Coskun, G., Incetan, K., Almalioglu, Y., Mahmood, F., Curto, E., Perdigoto, L., Oliveira, M., et al.: Endoslam dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos. Medical image analysis 71, 102058 (2021)
[20] Psychogyios, D., Vasconcelos, F., Stoyanov, D.: Realistic endoscopic illumination modeling for nerf-based data generation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 535–544. Springer (2023)
[21] Rabbani, N., Bartoli, A.: Can surgical computer vision benefit from large-scale visual foundation models? International Journal of Computer Assisted Radiology and Surgery 19(6), 1157–1163 (2024)
[22] Rau, A., Bhattarai, B., Agapito, L., Stoyanov, D.: Bimodal camera pose prediction for endoscopy. IEEE Transactions on Medical Robotics and Bionics (2023)
[23] Rivoir, D., Pfeiffer, M., Docea, R., Kolbinger, F., Riediger, C., Weitz, J., Speidel, S.: Long-term temporally consistent unpaired video translation from simulated surgical 3d data. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3343–3353 (2021)
[24] Rivoir, D., Wagner, M., Bodenstedt, S., März, K., Kolbinger, F., Maier-Hein, L., Seidlitz, S., Brandenburg, J., Müller-Stich, B.P., Distler, M., et al.: Importance of the data in the surgical environment. Artificial Intelligence and the Perspective of Autonomous Surgery pp. 29–43 (2024)
[25] Sahu, M., Mukhopadhyay, A., Zachow, S.: Simulation-to-real domain adaptation with teacher–student learning for endoscopic instrument segmentation. International journal of computer assisted radiology and surgery 16(5), 849–859 (2021)
[26] Shao, S., Pei, Z., Chen, W., Zhu, W., Wu, X., Sun, D., Zhang, B.: Self-supervised monocular depth and ego-motion estimation in endoscopy: Appearance flow to the rescue. Medical image analysis 77, 102338 (2022)
[27] Venkatesh, D.K., Rivoir, D., Pfeiffer, M., Kolbinger, F., Speidel, S.: Synthesizing multi-class surgical datasets with anatomy-aware diffusion models. arXiv preprint arXiv:2410.07753 (2024)
[28] Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. Advances in Neural Information Processing Systems 37, 21875–21911 (2025)
[29] Yao, Y., Zhang, J., Liu, J., Qu, Y., Fang, T., McKinnon, D., Tsin, Y., Quan, L.: Neilf: Neural incident light field for physically-based material estimation. In: European conference on computer vision. pp. 700–716. Springer (2022)