11email: [email protected]
22institutetext: Department of Computer Science, Johns Hopkins University 33institutetext: SH Ho Urology Centre, Dept. of Surgery, The Chinese University of Hong Kong 44institutetext: T Stone Robotics Institute, The Chinese University of Hong Kong
E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based
Stereoscopic Depth Perception
Abstract
Reconstructing the scene of robotic surgery from the stereo endoscopic video is an important and promising topic in surgical data science, which potentially supports many applications such as surgical visual perception, robotic surgery education and intra-operative context awareness. However, current methods are mostly restricted to reconstructing static anatomy assuming no tissue deformation, tool occlusion and de-occlusion, and camera movement. However, these assumptions are not always satisfied in minimal invasive robotic surgeries. In this work, we present an efficient reconstruction pipeline for highly dynamic surgical scenes that runs at 28 fps. Specifically, we design a transformer-based stereoscopic depth perception for efficient depth estimation and a light-weight tool segmentor to handle tool occlusion. After that, a dynamic reconstruction algorithm which can estimate the tissue deformation and camera movement, and aggregate the information over time is proposed for surgical scene reconstruction. We evaluate the proposed pipeline on two datasets, the public Hamlyn Centre Endoscopic Video Dataset and our in-house DaVinci robotic surgery dataset. The results demonstrate that our method can recover the scene obstructed by the surgical tool and handle the movement of camera in realistic surgical scenarios effectively at real-time speed.
Keywords:
Dynamic Surgical Scene Reconstruction Transformer-based Depth Estimation Stereo Image Perception1 Introduction
Reconstructing the surgical scene from stereo endoscopic video in robotic-assisted minimally invasive surgeries (MIS) is an important topic as it is central to downstream tasks. ††*Authors contributed equally to this work. For example, during surgical training, it is desirable to expose the trainees to the complete soft-tissue even if surgical tools block the view partially [22, 23] in order to provide enriched context for understanding the surgical manipulation. As illustrated in Fig. 1(a), given the reconstruction from recorded surgical videos and the current video frame with instrument blocking the view, a transparent overlay can be generated for AI-augmented demonstration which brings new possibilities for robotic surgery education. A similar method potentially may be used to provide additional useful context intraoperatively.
However, reconstruction of the surgical scene in laparoscopy is challenging for three reasons. First, the soft-tissue is constantly deforming. This gives rise to challenges to soft-tissue localization. Secondly, it presents heavy occlusion and dynamic movement of the surgical instruments. Identification of occlusion and proper handling of de-occlusion require spatial and temporal coherence and consistency. Lastly, the changes of camera view points compound the aforementioned difficulties into an ego motion task with dynamic objects. Even though there exists prior works in 3D reconstruction in surgical scene, they are generally limited by assuming a static scene [12] or no presence of surgical tools [20].
Closest to our work are [10, 13], which jointly handle tissue deformation and surgical tool occlusion but using kinematics information. We improve upon prior work with an image-only reconstruction pipeline shown in Fig. 1(b) that improves the quality and run-time speed. We implement an efficient transformer-based depth perception module and a light-weight tool segmentor to reconstruct the surgical scenes with only stereo endoscopic image frames as inputs. The two modules run in parallel to output a masked depth estimation without surgical instruments. The masked depth map is later used to produce a temporally and spatially coherent reconstruction of the soft-tissue. We also demonstrate the effectiveness of our pipeline with camera motion and smoke, which is missing from [10, 13].

Our main contributions are summarized as follows: We propose a novel online reconstruction pipeline called E-DSSR which can reconstruct the surgical scene with only stereo endoscopic videos as input, and handle the cases of tissue deformation, tool occlusion and the camera movement simultaneously. We qualitatively and quantitatively evaluate our proposed pipeline on both the public Hamlyn Centre Endoscopic Video Dataset (Hamlyn Dataset) [27] and our in-house DaVinci robotic surgery data to validate the effectiveness of our method.
2 Related Work
Recent works of [7, 21] have explored SLAM system to handle the non-rigidity of tissues. However, they assume a simplified environment where surgical tools are not present. Later, Li et al. [10] proposes a reconstruction framework to simultaneously reconstruct the soft tissue and track surgical instruments using kinematics. However, instrument location is prone to noise and error given the long kinematics chain as demonstrated in [2]. Lu et al. [13] further improves the previous framework with deep learning methods to estimate depth, and a key-point+kinematics hybrid approach to localize the surgical instrument. However, the result is only demonstrated in an ex vivo environment with a fixed camera pose, without camera motion and realistic artifacts such as blood. Furthermore, the surgical instrument is removed by computing the pose of the surgical tool and then rendering a binary mask from a 3D model, which is slower as we will show in Section 4.3. In this work, we present an efficient reconstruction pipeline that only uses a stream of stereo images as input, as well as being capable of handling camera motion and surgical effects such as smoke in addition to the previous challenges.
3 Method
Fig. 1(b) shows the overview of our proposed dynamic surgical scene reconstruction pipeline E-DSSR. We denote left (L) and right (R) RGB image at time stamp as and . A transformer-based stereo depth estimation estimates the depth image using and . A tool segmenting network predicts the mask of the tool concurrently. Note that both networks are designed to be light-weight to enable real-time performance. The masked depth image and left frame from to are input to a dynamic reconstruction algorithm which can dynamically recover the 3D information of the surgical scene.
3.1 Light-weight Stereo Depth Estimation and Tool Segmentation
3.1.1 Transformer-based depth estimation.
To reconstruct the soft-tissue with high quality, it is important to only include depth estimations with high confidence. In the occluded region where pixels are not observed commonly by both and , depth cannot be accurately estimated. Therefore, we opt for the recently proposed Stereo Transformer (STTR) [11] as the depth estimation module as it explicitly identifies these regions. STTR densely compares pixels in and along epipolar lines to find the best matches to estimate depth. STTR uses the attention mechanism [24], which computes the attention (feature similarities ) between source and target and outputs the updated features as:
(1) |
where are the learnt projection weights of dimension to transform the inputs to an embedding space. After the final attention, the similarities between the pixels from and are used as likelihood for pixel matching. We propose a new variant of STTR, to enable faster computational speed while without harming performance. We note that a large amount of FLOPS of STTR is within the attention module. Since STTR is comparing the pixels along epipolar lines (of potential matches) for lines, its number of parameters and FLOPS are in the order of
(2) |
where are the height and width of the input image, and is the number of attentions computed. To avoid significant deterioration of the performance, we keep number of parameters by making constant while quartering the number of attentions. Following prior work, we train STTR on the large-scale synthetic Scene Flow dataset [14], which is a commonly used pre-training dataset for stereo networks. We show that transformer-based depth module improves the reconstruction quality in surgical scene quantitatively in Section 4.3 and our light-weight STTR is faster without much performance sacrifice. This is also the first time transformer-based module is applied to surgical scene depth estimation.
3.1.2 Efficient surgical tool segmentation.
Given depth estimation of the scene, it is desirable to isolate the depth of soft-tissue from surgical tools since surgical tools are considered as “outlier” in soft-tissue reconstruction. For this purpose, we predict a binary mask indicating which pixels belong to the tools. U-Net [17] has been proven to be a light-weight yet accurate model for segmentation tasks [1, 6]. In our pipeline, we design a light-weight U-Net with VGG11 [19] as backbone and with 5 scales of down-sample layers to maintain a run-time of around 12ms per frame (input resolution of 640512). We train the model on public dataset of Robotic Instrument Segmentation from the 2017 MICCAI EndoVis Challenge [1] and directly use it to predict binary instrument masks on our robotic surgery datasets. To mitigate the performance variation of the trained U-Net due to slight domain gap between the training data and the data we evaluate our pipeline on, we perform morphological operation on the segmentation mask to refine tool segmentation boundaries.
3.2 Dynamic Reconstruction
3.2.1 Surfel representation.
Different from most existing methods that adopt a volumetric model [15], we rely on a memory-efficient data representation surfel [3], which is suitable to the varying environment in surgical scenes. Surfels are tuples of variables including 3D position , surface normal , confidence and timestamp of last observation stored as an unordered list . In our pipeline, position is computed using inverse camera intrinsic matrix and estimated masked depth . Confidence is computed using radial distance from the camera center, with the intuition that more oblique points will exhibit larger uncertainty. The canonical surfels is stored in the coordinate defined by the first frame observed, which is then continuously updated by incorporating the surfels of newly observed frames.
3.2.2 Camera pose.
The stereoscope will move during the surgery and the movement would not be continuous nor fast to avoid harming tissue based on consultation with clinical collaborators. As a result, we can assume that the motion between adjacent frames is relatively small, is transformed to the current view given the most recent camera pose and projected to the camera plane. Between and , if the surfels’ normals and depth values are closer than a threshold, a correspondence is found. Given all correspondences , the camera pose is solved by minimizing the energy function defined as follows:
(3) |
3.2.3 Tissue deformation field.
As the surgical scene and tissue is non-rigid and dynamic, to efficiently model the deformation, a sparse node graph [8] is built with each node represented by its position similar to surfels, but also a local 6 DoF deformation . Given this set of sparse nodes, any points can interpolate its deformation using a weighted sum of node deformation based on the distance and radius. For a given query position and node in , the weight can be found as . After applying the updated rigid motion of the camera from previous section, the deformations in node graph is solved using following equation with as-rigid-as-possible regularization:
(4) |
where is the neighboring nodes of node , with the intuition that motion within a neighborhood should be as small as possible.
3.2.4 Model Fusion.
Lastly, to recover and reconstruct the whole surgical scene, the observation will get integrated to the canonical model . For surfels with correspondences, they are fused as one where confidence values get accumulated and the timestamp gets updated. The normal and position are updated as the confidence-weighted sum of observed points and model points. For the fused surfel to be added to the canonical model, the sum of confidence in local neighborhood needs to be higher than a pre-defined threshold and local motion needs to be consistent. At the same time, a surfel in the canonical model will be removed if it is not observed for a long time.
4 Experiments

4.1 Experimental Setting
We evaluate the effectiveness of our proposed dynamic reconstruction framework on two datasets: (1) the public Hamlyn Dataset [27], and (2) our in-house DaVinci robotic surgery dataset of prostatectomy procedure. The Hamlyn Dataset consists of rectified stereo images with resolution of collected in partial nephrectomy and without camera calibration information. Our in-house dataset contains 6 cases of high-resolution stereo videos, each records the whole procedure of robotic prostatectomy. We employ the method of Zhang et al. [28] to calculate the camera calibration information of our surgical stereoscope.
We collected 5 video clips (with 1200 pairs of rectified stereo frames) from our in-house dataset and 2 video clips (600 pairs) from Hamlyn Dataset. Each clip lasts for around 10 seconds, including scenarios of surgical tool occlusion, camera movement and tissue deformation. All the clips are used for testing, as all components in our method do not require the clips for training. In the experiment, we downsample the video frames of our in-house dataset from the original resolution to . We implement the tool segmentation network and the transformer-based depth estimation with PyTorch [16]. The dynamic reconstruction is implemented using C++ with CUDA [18] to accelerate the running speed. Our experiment is conduced on a PC with one Nvidia TITAN RTX GPU and Intel Xeon(R) W-2123 CPU (3.60GHz 8).
4.2 Qualitative Result
To demonstrate the effectiveness of the reconstruction pipeline, we show two examples in Fig. 2 from our in-house dataset. In Fig. 2, the scenes are reconstructed with a fixed camera view, moving surgical tools, and tissue deformations. It can be shown from Fig. 2 that our pipeline can recover the blocked surgical scene, and continuously improve and complete the canonical model given subsequent frames which expose the obstructed tissue.
We show another example in Fig. 3 from the Hamlyn Dataset where the recorded video is more challenging with camera movement, surgical tool movement, tissue deformation and smoke. As shown in Fig. 3, even with camera movement, the dynamic reconstruction algorithm can still track the soft-tissue and complete the surgical scene of the canonical model . Furthermore, the reconstruction pipeline is also robust against smoke.

4.3 Quantitative Evaluation and Analysis
Since acquiring the 3D model of the tissue or ground-truth depth with additional sensors in an in vivo environment is currently impractical due to clinical regulations, we use image-based metrics to evaluate the dynamic reconstruction results. Structural Similarity Index Measure (SSIM) [25] has been widely used for computing consistencies between two images, with a higher value indicating a higher similarity. Peak-Signal-to-Noise Ratio (PSNR) [5] is a metric used to estimate the distortion between target image and synthesized (or noisy) image with a higher value indicating a smaller distortion. Both metrics have been used as evaluation metrics to assess the similarities of the predicted (re-projected) image and original image in absence of ground truth depth information [4, 9]. Following prior works, we adopt them to evaluate the performance of our dynamic reconstruction results.
Method | in-house dataset | Hamlyn Dataset | Speed | ||
HSM [26] + DR w/o mask | 58.48 6.82 | 11.14 1.83 | 37.98 7.32 | 10.27 1.70 | 18Hz |
HSM [26] + DR w/ mask | 59.10 6.71 | 11.76 1.67 | 38.39 7.10 | 10.55 1.86 | 15Hz |
E-DSSR w/o mask | 64.17 4.67 | 13.59 1.72 | 40.83 8.45 | 12.01 2.05 | 36Hz |
DSSR w/o mask | 65.09 5.64 | 13.00 1.61 | 41.83 7.20 | 13.04 2.07 | 18Hz |
E-DSSR* (efficient) | 66.65 4.59 | 41.97 7.32 | 12.85 2.03 | 28Hz | |
DSSR* (high-quality) | 13.64 1.81 | 15Hz |
Specifically, we compare the similarities between the observed frame and the re-projected frame using the reconstructed canonical surfel. Since our goal is to reconstruct soft-tissue, the mask of the surgical tool is applied to the predicted frame to obtain the masked frames and with only tissue information. We computed the average SSIM and PSNR across all the frames and for all video clips as the final evaluation metrics, which is shown below:
(5) |
To demonstrate the effectiveness of our method and assess the contribution of each component in our pipeline, we conduct a set of experiments with our proposed E-DSSR pipeline. We compare E-DSSR with a similar pipeline DSSR with our light-weight STTR replaced by STTR. We further compare with a reconstruction pipeline using a different depth estimation network HSM [26], which is a state-of-the-art fully convolutional stereo depth estimation method. The full list of experiments are shown in Table 1, including: (1) HSM + Dynamic Reconstruction (DR) without tool mask, (2) HSM + Dynamic Reconstruction (DR) with tool masked, (3) DSSR without tool mask, (4) E-DSSR without tool mask, and our proposed methods (5) DSSR and (6) E-DSSR.
We evaluate the advantages of using transformer-based depth module by comparing E-DSSR with HSM + DR w/ mask, we can see that our method outperforms by 7.55% and 1.92 for the in-house dataset and by 3.58% and 2.30 for Hamlyn Dataset.
We also evaluate the contribution of tool segmentation module in the reconstruction result. By comparing E-DSSR and E-DSSR w/o mask, we show that without explicit tool identification, the drops from 66.65% to 64.17% and drops from 13.68 to 13.59 for in-house dataset. As for the Hamlyn Dataset, the drops from 41.97% to 40.83% and drops from 12.85 to 12.01. The same observation can be found by comparing DSSR and DSSR w/o mask, HSM + DR w/ mask and HSM + DR w/o mask. All results demonstrate the benefit of the proposed tool segmentation module.
Finally, while DSSR achieves the best result of on both dataset and on Hamlyn Dataset, the E-DSSR is nearly two times faster than the DSSR (28Hz v.s. 15Hz) and achieves best result of on in-house dataset while with little performance compromise (0.16% on in-house dataset, 0.44% and 0.24 on Hamlyn Dataset).
We also compare our approach with the previous proposed work [10]. Based on their reported result, our light-weight pipeline is 14 faster (28Hz v.s. 2Hz) given the same image resolution. Comparing our image-based tool segmentation module with their kinematics-driven 3D model rendering approach, our module is more than 5 times faster (83Hz v.s. 15Hz).
5 Conclusion
In conclusion, we propose an efficient, image-only reconstruction pipeline for surgical scenes. Our light-weight modules enable real-time reconstruction result in the presence of camera motion, surgical tools, and tissue deformation. We evaluate our pipeline qualitatively and quantitatively to demonstrate its effectiveness. We did not evaluate on longer duration, because the camera views may be completely different within a long video clip, which could be considered as several short clips to be reconstructed. Future works include acquiring larger amounts of data with more significant variations to expand evaluation of our approach, and investigating the effectiveness of our approach in downstream applications such as AI-augmented robotic surgery education.
5.0.1 Acknowledgement.
This project was supported by CUHK Shun Hing Institute of Advanced Engineering (project MMT-p5-20), CUHK T Stone Robotics Institute, Hong Kong RGC TRS Project No.T42-409/18-R, and Multi-Scale Medical Robotics Center InnoHK under grant 8312051.
References
- [1] Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmentation challenge. arXiv preprint arXiv:1902.06426 (2019)
- [2] Ferguson, J.M., Pitt, B., Kuntz, A., Granna, J., Kavoussi, N.L., Nimmagadda, N., Barth, E.J., Herrell III, S.D., Webster III, R.J.: Comparing the accuracy of the da vinci xi and da vinci si for image guidance and automation. The International Journal of Medical Robotics and Computer Assisted Surgery 16(6), 1–10 (2020)
- [3] Gao, W., Tedrake, R.: Surfelwarp: Efficient non-volumetric single view dynamic reconstruction. arXiv preprint arXiv:1904.13073 (2019)
- [4] Godard, C., Mac Aodha, O., Brostow, G.J.: Unsupervised monocular depth estimation with left-right consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 270–279 (2017)
- [5] Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: 2010 20th international conference on pattern recognition. pp. 2366–2369. IEEE (2010)
- [6] Jin, Y., Cheng, K., Dou, Q., Heng, P.A.: Incorporating temporal prior from motion flow for instrument segmentation in minimally invasive surgery video. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 440–448. Springer (2019)
- [7] Lamarca, J., Parashar, S., Bartoli, A., Montiel, J.: Defslam: Tracking and mapping of deforming scenes from monocular sequences. IEEE Transactions on Robotics (2020)
- [8] Li, H., Adams, B., Guibas, L.J., Pauly, M.: Robust single-view geometry and motion reconstruction. ACM Transactions on Graphics (ToG) 28(5), 1–10 (2009)
- [9] Li, L., Li, X., Yang, S., Ding, S., Jolfaei, A., Zheng, X.: Unsupervised learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery. IEEE Transactions on Industrial Informatics (2020)
- [10] Li, Y., Richter, F., Lu, J., Funk, E.K., Orosco, R.K., Zhu, J., Yip, M.C.: Super: A surgical perception framework for endoscopic tissue manipulation with surgical robotics. IEEE Robotics and Automation Letters 5(2), 2294–2301 (2020)
- [11] Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F.X., Taylor, R.H., Unberath, M.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. arXiv preprint arXiv:2011.02910 (2020)
- [12] Liu, X., Stiber, M., Huang, J., Ishii, M., Hager, G.D., Taylor, R.H., Unberath, M.: Reconstructing sinus anatomy from endoscopic video–towards a radiation-free approach for quantitative longitudinal assessment. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 3–13. Springer (2020)
- [13] Lu, J., Jayakumari, A., Richter, F., Li, Y., Yip, M.C.: Super deep: A surgical perception framework for robotic tissue manipulation using deep learning for feature extraction. arXiv preprint arXiv:2003.03472 (2020)
- [14] Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4040–4048 (2016)
- [15] Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 343–352 (2015)
- [16] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703 (2019)
- [17] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
- [18] Sanders, J., Kandrot, E.: CUDA by example: an introduction to general-purpose GPU programming. Addison-Wesley Professional (2010)
- [19] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
- [20] Song, J.: 3D non-rigid SLAM in minimally invasive surgery. Ph.D. thesis (2020)
- [21] Song, J., Wang, J., Zhao, L., Huang, S., Dissanayake, G.: Mis-slam: Real-time large-scale dense deformable slam system in minimal invasive surgery based on heterogeneous computing. IEEE Robotics and Automation Letters 3(4), 4068–4075 (2018)
- [22] Stoyanov, D., Mylonas, G.P., Lerotic, M., Chung, A.J., Yang, G.Z.: Intra-operative visualizations: Perceptual fidelity and human factors. Journal of Display Technology 4(4), 491–501 (2008)
- [23] Taylor, R.H., Menciassi, A., Fichtinger, G., Fiorini, P., Dario, P.: Medical robotics and computer-integrated surgery. Springer handbook of robotics pp. 1657–1684 (2016)
- [24] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017)
- [25] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
- [26] Yang, G., Manela, J., Happold, M., Ramanan, D.: Hierarchical deep stereo matching on high-resolution images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5515–5524 (2019)
- [27] Ye, M., Johns, E., Handa, A., Zhang, L., Pratt, P., Yang, G.Z.: Self-supervised siamese learning on stereo image pairs for depth estimation in robotic surgery. arXiv preprint arXiv:1705.08260 (2017)
- [28] Zhang, Z.: A flexible new technique for camera calibration. IEEE Transactions on pattern analysis and machine intelligence 22(11), 1330–1334 (2000)