This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DD-VNB: A Depth-based Dual-Loop Framework for Real-time Visually Navigated Bronchoscopy

Qingyao Tian, Huai Liao, Xinyan Huang, Jian Chen, Zihui Zhang, Bingyu Yang,
Sebastien Ourselin and Hongbin Liu
Qingyao Tian, Jian Chen and Bingyu Yang are with Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China, and also with the School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China. Huai Liao, M.D. and Xin-yan Huang, M.D. are with Department of Pulmonary and Critical Care Medicine, The First Affiliated Hospital of Sun Yat-sen University, Guangzhou, Guangdong Province, P.R. China. Zihui Zhang is with Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China. Sebastien Ourselin is with School of Biomedical Engineering and Imaging Sciences, King’s College London, London, UK. Corresponding author: Hongbin Liu is with Institute of Automation, Chinese Academy of Sciences, and with Centre of AI and Robotics, Hong Kong Institute of Science& Innovation, Chinese Academy of Sciences. He is also affiliated with the School of Biomedical Engineering and Imaging Sciences, King’s College London, UK. (e-mail: [email protected]).
Abstract

Real-time 6 DOF localization of bronchoscopes is crucial for enhancing intervention quality. However, current vision-based technologies struggle to balance between generalization to unseen data and computational speed. In this study, we propose a Depth-based Dual-Loop framework for real-time Visually Navigated Bronchoscopy (DD-VNB) that can generalize across patient cases without the need of re-training. The DD-VNB framework integrates two key modules: depth estimation and dual-loop localization. To address the domain gap among patients, we propose a knowledge-embedded depth estimation network that maps endoscope frames to depth, ensuring generalization by eliminating patient-specific textures. The network embeds view synthesis knowledge into a cycle adversarial architecture for scale-constrained monocular depth estimation. For real-time performance, our localization module embeds a fast ego-motion estimation network into the loop of depth registration. The ego-motion inference network estimates the pose change of the bronchoscope in high frequency while depth registration against the pre-operative 3D model provides absolute pose periodically. Specifically, the relative pose changes are fed into the registration process as the initial guess to boost its accuracy and speed. Experiments on phantom and in-vivo data from patients demonstrate the effectiveness of our framework: 1) monocular depth estimation outperforms SOTA, 2) localization achieves an accuracy of Absolute Tracking Error (ATE) of 4.7 ±\pm 3.17 mm in phantom and 6.49 ±\pm 3.88 mm in patient data, 3) with a frame-rate approaching video capture speed, 4) without the necessity of case-wise network retraining. The framework’s superior speed and accuracy demonstrate its promising clinical potential for real-time bronchoscopic navigation.

publicationid: pubid: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

Refer to caption

Figure 1: Experimental set up (a) using phantom, (b) by acquiring patient data.

I Introduction

Lung cancer stands as a major health concern on a global scale [1]. Nevertheless, it is a largely preventable disease and early diagnosis significantly improves patient outcomes [2]. Bronchoscopy has played a pivotal role in the examination and diagnosis of airway lesions [3]. During bronchoscopy, surgeons navigate a flexible bronchoscope with a distal camera towards nodules identified in pre-operative CT scans based on endoscopic images. The limited field of view of the bronchoscope necessitates extensive clinical experience to localize its position within the airway. Therefore, there is an urgent demand for a bronchoscopic localization method to aid in interventions.

Recent advancements in robotic bronchoscopy [4, 5] have highlighted technologies like electromagnetic navigation [6] and 3-D shape sensing [7] as significant aids. Visually Navigated Bronchoscopy (VNB) emerges as a promising area of study due to its potential for high accuracy and simple setup. However, the limitations of monocular bronchoscope settings and the textureless airway surface hinder the application of traditional SLAM methods [8]. Furthermore, variations in patient airways pose a challenge in developing a streamlined method for generalization [9, 10]. Consequently, current VNB methods struggle to meet practical application demands.

The primary limitation of existing VNB methods lies in balancing computational speed and creating a streamlined pipeline to generalize across patients. Existing VNB methods mainly belong to two categories: registration-based [11, 12, 13, 14] and retrieval-based [9, 10] localization. registration-based approaches rely on iterative optimization, resulting in limited computational speed, while retrieval-based methods adapt to different patients by case-wise re-training.

In order to address the challenge of balancing speed and generalization, this paper proposes a Depth-based Dual-Loop Framework for Visually Navigated Bronchoscopy (DD-VNB). The methodology involves depth estimation and a dual-loop localization module. The depth estimation of endoscope frames serves as input for the localization module.

To achieve generalization across cases, we propose a knowledge-embeded depth estimation network. Initially, the depth estimation maps endoscope frames to the depth space, serving as input for the pose inference module. This process eliminates patient-specific textures, ensuring the generalization ability of the localization method. Moreover, we innovatively incorporates view synthesis into an unpaired image translation framework, leveraging the geometric principles of view synthesis to inform and constrain the learning process. This approach is motivated by earlier research that utilized view synthesis for joint training of unsupervised depth and motion estimation [15], as well as recent efforts to apply geometry view synthesis in creating realistic surgical scenes [16].

To achieve real-time speed, during the localization stage, DD-VNB embeds a fast ego-motion estimation network within the loop of depth map registration for real-time performance. The ego-motion inference network estimates the bronchoscope’s pose change at high frequency, while depth registration against the pre-operative 3D model provides absolute pose values periodically. Specifically, the relative pose changes are fed into the registration process as the initial guess to boost accuracy and speed.

In summary, our contributions are as follows:

  1. 1.

    A generalizable VNB framework for real-time application is presented, which can generalize across different patient cases, eliminating the necessity for re-training.

  2. 2.

    We propose a knowledge-embeded depth estimation network that leverages geometry view synthesis for accurate depth estimation from monocular endoscopic images at a specific scale.

  3. 3.

    A fast localization module is proposed that embeds an efficient ego-motion estimation network within the loop of single-view depth map registration, enabling fast and accurate pose inference.

  4. 4.

    The proposed framework outperforms SOTA in localization accuracy, as demonstrated by extensive experiments on both phantom and patient data, showcasing its practical effectiveness (Fig. 1).

II Related Works

Recent advances in deep learning (DL) prompts exploration into learning-based VNB [17, 18, 19, 20, 10, 9, 21, 13, 14, 22]. Ozyoruk et al. [17] and Deng et al. [19] introduce visual odometry methods to endoscopic videos. Nonetheless, existing deep monocular visual odometry algorithms face challenges such as scale ambiguity and drift, making them less suitable for bronchoscopic applications. OffsetNet [20] employs DL to register between real and rendered images but suffers from low accuracy in unseen areas during training. Their later work, AirwayNet [10], localizes the camera by estimating visible airways and their relative poses, with successful navigation critically dependent on identifying airways. Zhao et al. [9] adopt auxiliary learning to train a global pose regression network [23]. However, global pose learning is essentially image retrieval and cannot generalize beyond their training data [24].

Several studies [21, 13, 14, 25] explore depth estimation of bronchoscopic images by cycle adversarial networks [26] and perform registration [13, 14]. Theoretically, with adequate training, by utilizing real bronchoscopic images and unpaired virtual depth maps as inputs, cycle adversarial networks are capable of learning the mapping from the distribution of bronchoscopic images to the distribution of their corresponding depth maps [26]. By training an unsupervised conditional cycle adversarial network, Shen et al. [13] obtain a mapping network from bronchoscopic frames to corresponding depth maps, addressing the problem of visual artifacts. Localization is accomplished by iteratively registering the estimated depth maps to preoperative images. Banach et al. [14] use a Three Cycle-Consistent Generative Adversarial Network architecture to estimate image depth and register the generated point cloud to a pre-operative airway model. In these studies, depth-based methods prove robust due to the removal of individual differences in illumination and texture. However, the work mentioned above faces limitations in real-world applications for two primary reasons. Firstly, due to its loose constraint, the cycle adversarial network often achieves biased distribution mapping, resulting in unstable scale between different frames and possible changes in object structure [21]. Secondly, it concentrates exclusively on depth estimation and relies on traditional algorithms for registering camera poses, which suffer from low update frame rates.

In this paper, we aim to achieve real-time bronchoscope localization with generalization to unseen cases during training. To address the issue of learning scale-informed depth, we introduce view consistency loss and geometry consistency loss, which significantly advances the depth estimation process. Additionally, to ensure real-time localization, we design a fast localization module that incorporates an efficient ego-motion estimation network into the depth registration loop.

Refer to caption

Figure 2: The overview of proposed framework, where refref denotes reference time point. During intervention, we first estimate the incoming bronchoscopic frame’s depth map. Then, a dual-loop diagram is introduced to locate the camera position. The ego-motion loop tracks the camera position by inferring the camera movement between a pair of input depth maps in real time. The registration loop infers global pose by referring to pre-operative airway map and inertially eliminates accumulative error by ego-motion estimation. For the next iteration, ref+mref+m serves as the next reference time point for dual-loop iteration, and Pref+mP_{ref+m} is considered as the initial value for registration.

III Method

Fig. 2 shows an overview of the proposed bronchoscopic localization framework. The localization methodology involves depth estimation and localization. Learning-based depth estimation is accomplished based on a adversarial architecture with knowledge embedding as supervision. The localization part consists two algorithms: an ego-motion estimation network that takes two nearby bronchoscopic depth maps as input to infer a 6 DOF relative pose between them; and a depth map registration method mapping a single estimated depth map to the pre-operative airway model, correcting the accumulative error of incremental ego-motion inference.

III-A Depth Estimation with View Synthesis as Supervision

The training strategy of the proposed depth estimation module is shown in Fig. 3. Training of our depth estimation network employs an unpaired image-to-image translation framework based on cycle consistency [26]. The depth estimation aims to map a bronchoscopic frame xXx\in X to its depth space by Gdepth:XZG_{{depth}}:X\rightarrow Z, generating its corresponding depth map z^t=Gdepth(xt)\hat{z}_{t}=G_{{depth}}\left(x_{t}\right). The cycle is completed by reconstructing z^t\hat{z}_{t} back to the domain XX using Gimage:ZXG_{{image}}:Z\rightarrow X. The translation from ZZ to XX is similarly performed. The model enforces cycle consistency for GdepthG_{{depth}} and GimageG_{{image}}. Finally, discriminators DdepthD_{{depth}} and DimageD_{{image}} distinguish between generated/real depth maps and video frames, respectively.

We use the LS-GAN loss [27] as adversarial loss LadvL_{adv} and L1 losses for LcycL_{cyc} and LidenL_{iden} to guide the networks to learn domain transferring as preserving important structure.

View Consistency Loss: To enforce the network to learn depth estimation with absolute scale, virtual camera poses of input depth maps are also collected from our simulator to impose view consistency between generated bronchoscopic video frames.

Taking depth map ztnz_{t-n} and ztz_{t} as input, network GimageG_{{image}} generates bronchoscopic images x^tn\hat{x}_{t-n} and x^t\hat{x}_{t}. With ground truth 6 DOF relative camera pose Ptn,tP_{t-n,t} between depth map ztnz_{t-n} and ztz_{t}, along with camera intrinsic KK, a homogeneous pixel ptnx^tnp_{t-n}\in\hat{x}_{t-n} can be warped to p^tx^t\hat{p}_{t}\in\hat{x}_{t} according to view synthesis:

z^tp^t=𝑲𝑹tn,t𝑲1ztnptn+𝑲Ttn,t,\hat{z}_{t}\hat{p}_{t}=\boldsymbol{K}\boldsymbol{R}_{t-n,t}\boldsymbol{K}^{-1}z_{t-n}p_{t-n}+\boldsymbol{K}T_{t-n,t}, (1)

where Ttn,tT_{t-n,t} and 𝑹tn,t\boldsymbol{R}_{t-n,t} are the translation vector and rotation matrix from tnt-n to tt.

Differentiable bilinear sampling is applied to x^tn\hat{x}_{t-n} according to continuous coordinates p^t\hat{p}_{t}, mapping x^tn\hat{x}_{t-n} to warped bronchoscopic image w(x^tn)w\left(\hat{x}_{t-n}\right), which is supposed to be in consistent with x^t\hat{x}_{t}. Therefore, we propose to minimize the pixelwise inconsistency between warped frame w(x^tn)w\left(\hat{x}_{t-n}\right) and generated frame x^t\hat{x}_{t}. For a pair of generated frames x^tn\hat{x}_{t-n} and x^t\hat{x}_{t}, the view consistency loss is defined as

Lrecx^=1|V|pV|w(x^tn)(p)x^t(p)|,L_{rec_{-}\hat{x}}=\frac{1}{|V|}\sum_{p\in V}\left|w\left(\hat{x}_{t-n}\right)(p)-\hat{x}_{t}(p)\right|, (2)

where w()w(\cdot) is the warping operator into the pixel space of x^t\hat{x}_{t} by (1); VV stands for valid pixels successfully projected from x^tn\hat{x}_{t-n} to the image plane of x^t,|V|\hat{x}_{t},|V| represents the number of pixels in VV. By this means, GimageG_{{image}} is enforced to learn the unbiased mapping from depth map to its corresponding bronchoscopic frame. As cycle consistency is enforced, GdepthG_{{depth}} would in turn learn the unbiased mapping from input video frames to scale-constrained depth maps.

To further constrain the learning of mapping Gdepth:XZG_{{depth}}:X\rightarrow Z, view synthesis is applied to input bronchoscopic frames xtnx_{t-n} and xtx_{t} with their generated depth maps. Although ground truth camera motion between xtnx_{t-n} and xtx_{t} is not accessible, as the estimated depth maps z^tn\hat{z}_{t-n} and z^t\hat{z}_{t} are obtained, the inferred relative pose P^tn,t\hat{P}_{t-n,t} could be calculated by ego-motion estimation, which will be discussed in detail in Section III.B. Thus, a satisfying depth map estimation should be informative to recover the camera motion and preserve object structure, which is formulated as a view consistency loss of:

Lrecx=1|V|pV|w(xtn)(p)xt(p)|.L_{rec_{-}x}=\frac{1}{|V|}\sum_{p\in V}\left|w\left(x_{t-n}\right)(p)-x_{t}(p)\right|. (3)

Combining (2) and (3), the complete view consistency loss is defined as:

Lrec=τ1Lrecx^+τ2Lrecx,L_{rec}=\tau_{1}L_{rec_{-}\hat{x}}+\tau_{2}L_{rec_{-}x}, (4)

where τ1\tau_{1} and τ2\tau_{2} are weight terms used to balance losses.

Geometry Consistency Loss: For generated consecutive depth map z^tn\hat{z}_{t-n} and z^t\hat{z}_{t}, suppose they conform the same 3D scene structure, the difference of their 3D depth attributes should be minimized. Following [28], we enforce geometry consistency loss on the predicted depth maps so that scale of a depth estimation sequence should all agree with each other, and as a result further constrain the learning of mapping Gdepth:XZG_{{depth}}:X\rightarrow Z.

For generated consecutive depth map z^tn\hat{z}_{t-n} and z^t\hat{z}_{t}, the depth inconsistency map zdiffz_{{diff}} is defined in [28] as:

zdiff=|z^ttnz^t|z^ttn+z^t,z_{{diff}}=\frac{\left|\hat{z}_{t}^{t-n}-\hat{z}_{t}^{\prime}\right|}{\hat{z}_{t}^{t-n}+\hat{z}_{t}^{\prime}}, (5)

where z^ttn\hat{z}_{t}^{t-n} is the computed depth map of z^t\hat{z}_{t} by warping z^tn\hat{z}_{t-n} using inferred camera motion p^tn,t\hat{p}_{t-n,t}, and z^t\hat{z}_{t}^{\prime} is the interpolated depth map from generated depth map z^t\hat{z}_{t}. We use z^t\hat{z}_{t}^{\prime} instead of z^t\hat{z}_{t} because the warping flow from z^tn\hat{z}_{t-n} to z^ttn\hat{z}_{t}^{t-n} does not lie on the pixel grid. The depth inconsistency map is normalized by the sum of the two depth maps. The geometry consistency loss can then be defined as:

Lgc=1|V|pVzdiff(p).L_{gc}=\frac{1}{|V|}\sum_{p\in V}z_{{diff}}(p). (6)

Combining the above loss terms, the overall loss is as follows:

L=βLcyc+γLiden+δLadv+Lrec+ηLgc,\displaystyle L=\beta L_{cyc}+\gamma L_{iden}+\delta L_{adv}+L_{rec}+\eta L_{gc}, (7)

where β\beta, γ\gamma, δ\delta and η\eta are weight terms used to balance losses.

Refer to caption

Figure 3: Our depth estimation network training incorporates scale-awareness by combining unpaired image-to-image translation with view synthesis, enforcing view consistency during training. In the XZX\rightarrow Z direction (lower half), depth maps z^t\hat{z}_{t} and z^tn\hat{z}_{t-n} are generated for frames xtx_{t} and xtnx_{t-n}, and camera motion is inferred by the pretrained ego-motion estimation network. With depth and motion, view-synthesized image w(xtn)w\left(x_{t-n}\right) and reprojected depth z^ttn\hat{z}_{t}^{t-n} are obtained, enforcing consistency between xtx_{t} and w(xtn)w\left(x_{t-n}\right), and geometry consistency between z^t\hat{z}_{t} and z^ttn\hat{z}_{t}^{t-n}. In the ZXZ\rightarrow X direction (upper half), ground truth pose and depth in virtual bronchoscopy yield w(x^tn)w\left(\hat{x}_{t-n}\right), enforcing view consistency with x^t\hat{x}_{t}. Adversarial loss in the diagram combines discriminators DdepthD_{{\text{depth}}} and DimageD_{{\text{image}}}.

III-B Dual-loop Localization

Our localization module integrates an ego-motion estimation network into the loop of depth registration for real-time bronchoscope tracking, enhancing accuracy by using relative pose changes as initial guesses in depth registration against the pre-operative 3D model.

Depth Based Ego-motion Inference. As depth map and corresponding camera poses are accessible in virtual bronchoscopy, the learning objective is to minimize the difference between predicted transformation P^tn,t\widehat{P}_{t-n,t} and ground truth transformation Ptn,tP_{t-n,t}. Therefore, the relative camera pose inference loss is defined as:

L(ztn,zt)\displaystyle L\left(z_{t-n},z_{t}\right) =Ttn,tT^tn,t2\displaystyle=\left\|T_{t-n,t}-\hat{T}_{t-n,t}\right\|_{2} (8)
+ωrtn,tr^tn,t2,\displaystyle+\omega\left\|r_{t-n,t}-\hat{r}_{t-n,t}\right\|_{2},

where Ttn,tT_{t-n,t} and T^tn,t\hat{T}_{t-n,t} are the ground truth and predicted translation vector respectively. rtn,tr_{t-n,t} and r^tn,t\hat{r}_{t-n,t} are the ground truth and predicted Euler angles. For data augmentation, depth map ztn,zt,nz_{t-n},z_{t},n\in [5,5][-5,5] is randomly sampled from virtual bronchoscopy sequences for guaranteeing sufficient co-visibility while enhancing motion diversity.

We use FlownetC encoder [29] to extract image features. Then, five convolutional blocks are utilized to regress a pose vector from extracted features. By using virtual depth maps and poses as training data, our ego-motion network can be deployed directly to estimated depth map of incoming bronchoscopic frames during test time.

Depth Map Registration. Relying on ego-motion estimation network alone for tracking yields accumulative error, as errors of previous inferences will propagate over time to current estimation in incremental localization. Thus, refining previous results becomes necessary. Registration between estimated depth maps and pre-operative bronchi model provides absolute position of bronchoscope, eliminating accumulative error from previous relative pose inference.

With depth map registration, after generating depth map z^t\hat{z}_{t} from input bronchoscopic frame xtx_{t}, camera pose PtP_{t} is estimated by minimizing the difference between z^t\hat{z}_{t} and rendered depth z(Pt)z\left(P_{t}^{\prime}\right) in pre-operative airway model at pose PtP_{t}^{\prime}. The optimization process is described as:

Pt=argmaxPtNCC(z^t,z(Pt)),\displaystyle P_{t}=\operatorname{argmax}_{P_{t}^{\prime}}\operatorname{NCC}\left(\hat{z}_{t},z\left(P_{t}^{\prime}\right)\right), (9)

where NCC()\operatorname{NCC}(\cdot) is normalized cross-correlation. Because of the implicit objective function, the Powell algorithm [30] serves as the optimization strategy. Note that during optimization, constant rendering of depth maps in each iteration is the most time-consuming part.

Embedding ego-motion inference into registration, we introduce an embedded-loop localization diagram, where DD-VNB estimates the relative pose change of the bronchoscope in high frequency while the depth map registration against the pre-operative 3D3\mathrm{D} model periodically provides absolute location for error correction in a lower speed.

Taking one registration loop as example, denote refref as the reference time point. At time refref, iterative registration of the estimated depth map z^ref\hat{z}_{{ref}} against airway model begins. As the registration running, z^ref\hat{z}_{{ref}} together with estimated depth map z^ref+i\hat{z}_{{ref+i}} are taken as a pair of input to the ego-motion loop, inferring relative camera position Pref,ref+iP_{{ref,ref+i}} at time ref+iref+i in real-time, where i[1,m]i\in[1,m]. When Pref,ref+mP_{{ref,ref+m}} has been estimated, registration for PrefP_{{ref}} has completed. A more accurate pose inference P˙ref+m\dot{P}_{ref+m} for ref+mref+m is obtained by concatenating PrefP_{{ref}} and Pref,ref+mP_{{ref,ref+m}}. For the next iteration, ref+mref+m serves as the reference time point, and P˙ref+m\dot{P}_{ref+m} is considered as the initial value for registration. The selection of the hyperparameter mm is made with the aim of aligning the computational frame rate with the video capture speed.

IV Experiment

IV-A Dataset

Our experiments span phantom and patient datasets, training and testing our framework and benchmarks on each.

Phantom Data: Collected from a robotized bronchoscope, we have 13 video clips (640x480 res., 30fps), totaling 1000-3000 frames per clip. Training uses 8 right lung clips; testing uses 5 left lung clips. A high-precision CT scan of the lung phantom facilitates 3D reconstruction and segmentation for depth estimation and localization training.

Patient Data: Includes nine cases captured with an Olympus BF-6C260 bronchoscope by 10\sim15fps, supplemented by checkerboard videos for camera calibration. CT scans precede operations, with airway segmentation performed for model reconstruction. Training involves six cases with comprehensive airway videos (approx. 1500 frames each); testing involves three cases with trachea to lobar bronchus videos (150-200 frames each). Real and virtual bronchoscopy frames are manually aligned to ensure ground truth accuracy.

Virtual Bronchoscopy Data: Generated using the SOFA framework simulator, comprising 54 video clips (640x480 res., 400-1000 frames per case) with corresponding camera poses and depth maps for motion inference network training.

IV-B Implementation

Our training leverages the Pytorch framework on an NVIDIA RTX3090 GPU. We train the depth estimation network on phantom and patient datasets seperately. Initially, we maintain the learning rate at 0.0001 without enforcing consistency (τ1=τ2=η=0\tau_{1}=\tau_{2}=\eta=0) to accommodate early-stage training variability. Parameters adjust after 10 epochs to τ1=0.3\tau_{1}=0.3, τ2=5\tau_{2}=5 and η=5\eta=5 for refined training over 100 epochs using the Adam optimizer. γ\gamma and δ\delta are set to 10, 5 and 1 respectively as in [26] during the whole training process. The ego-motion network, trained with virtual data reflecting different camera intrinsics, undergoes 300 epochs at a learning rate of 1e51\mathrm{e-5} with ω\omega set to 100. Depth map registration employs Powell’s algorithm with set error tolerance and convergence criteria of 0.01.

TABLE I: Monocular Depth Estimation Results
  Methods Phantom Patient
SSIM NCC MAE [mm] RMSE [mm] Scale drift SSIM NCC MAE [mm] RMSE [mm] Scale drift
EndoSLAM* 0.918 0.822 4.786 ±\pm 0.926 7.832 ±\pm 0.804 - 0.796 0.738 8.415 ±\pm 2.126 9.919 ±\pm 1.947 -
CycleGAN 0.914 0.911 5.347 ±\pm 1.809 7.038 ±\pm 1.974 -0.256 ±\pm 0.250 0.786 0.846 10.535 ±\pm 4.547 13.260 ±\pm 4.793 -0.789 ±\pm 0.781
Ours w/o LrecL_{rec} 0.916 0.919 4.781 ±\pm 1.367 6.171 ±\pm 1.321 -0.278 ±\pm 0.171 0.832 0.847 7.711 ±\pm 3.278 9.936 ±\pm 3.392 -0.475 ±\pm 0.684
Ours w/o Lrecx^L_{rec_{\hat{x}}} 0.923 0.899 4.491 ±\pm 1.263 6.014 ±\pm 1.380 -0.152 ±\pm 0.235 0.851 0.828 6.455 ±\pm 2.542 8.549 ±\pm 2.663 -0.276 ±\pm 0.569
Ours 0.931 0.901 3.993 ±\pm 0.860 5.727 ±\pm 0.885 -0.160 ±\pm 0.162 0.862 0.83 5.772 ±\pm 2.706 7.631 ±\pm 3.217 -0.034 ±\pm 0.445
 
  • * denotes recovering scale before evaluation. MAE, RMSE and scale drift values are given as the mean ±\pm standard deviation. The best performance in each block is indicated in bold.

Refer to caption

Figure 4: Quantitative depth evaluations. The original input image, depth ground truth, predicted depth maps and error heatmaps by our depth estimation, ours w/o view consistency loss for generated bronchoscopic frame, ours w/o view consistency loss, CycleGAN and EndoSLAM are shown from left to right.

V Results

V-A Depth Estimation Evaluation

As no ground truth camera movement and depth map is available during training process, monocular depth estimation methods of EndoSLAM [17] and CycleGAN [26] are employed as baselines for comparison. CycleGAN is the most commonly used depth estimation method in learning based VNB. EndoSLAM jointly trains unsupervised monocular depth estimation and camera pose networks by adopting view consistency and geometry consistency from [28].

Results are shown in Table I. To further validate the scale-awareness of our depth estimation, we add scale drift in the measurements, which is defined by:

scale drift=1mean(z^)mean(z).{\text{scale drift}}=1-\frac{\operatorname{mean}(\hat{z})}{\operatorname{mean}(z)}. (10)

To address scale ambiguity in EndoSLAM depth maps, we align the first frame’s average depth with ground truth, excluding scale drifts from our evaluation. Our method sets a new standard in depth estimation, with a notable low RMSE of 5.727±0.8855.727\pm 0.885 mm for phantom data and 7.631±3.2177.631\pm 3.217 mm for patient data, outperforming CycleGAN by 18.6% and 42.5% in RMSE reduction for phantom and patient data, respectively. This improvement, especially against CycleGAN’s baseline, underscores our method’s effective use of knowledge embedding for accurate depth estimation within absolute scale. We selected several typical images from patient data for the qualitative evaluation of depth estimation methods in Fig. 4.

V-B Bronchoscopy Localization Evaluation

All methods are trained and evaluated on phantom data and in-vivo patient data respectively. Our method and CycleGAN additionally necessitate the inclusion of virtual data captured with corresponding intrinsic parameters.

Due to the variations in video capturing frequencies, mm is selected to be 10 for phantom data and 3 for patient data, in order to ensure sufficient co-visibility between input frames for ego-motion estimation. We also notice enormous error when testing EndoSLAM localization performance in phantom data, due to lack of parallax between consecutive frames and the accumulative error during incremental pose inference. Instead, we test EndoSLAM by sampling one out of every ten frames from phantom data as input frames.

Results in phantom and patient data are shown in Table II and Table III respectively. Evaluation metrics follows existing standards [13, 31]. According to the tests in terms of ATE on data in phantom and patients, our method outperforms SOTA. The average ATE of our proposed framework is 4.7±3.174.7\pm 3.17 mm in phantom and 6.49±3.886.49\pm 3.88 mm in patient data, far exceeds the performance of EndoSLAM and CycleGAN with depth registration, which is the mostly reported in depth based bronchoscopic localization literature. Our method also achieved the highest SR-5 and SR-10, adding further proof for its superiority. We observe minor performance deterioration of our method from phantom data to patient data.

We provide an example case (Case-1) in patient data, showing the located view in virtual bronchoscope in Fig. 5. Fig. 6 shows the located position by different frameworks in Case-3 against their ground truth.

Our overall localization framework reaches average update frequency 33.9 Hz for phantom data and 12.2 Hz for patient data, due to different mm value. Reported learning based VNB algorithms that exceed our computation speed including [20], [10] and [9], set out to track camera position by a single localization network, which contributes to their high computation speed. However, their methods are difficult, if not impossible, to generalize to unseen airways or branches.

TABLE II: Localization Results On Phantom
Methods ATE (mm) SR-5 (%) SR-10 (%) Runtime
EndoSLAM* (O) 16.68 ±\pm 9.12 6.10% 30.60% 141Hz
CycleGAN+R 12.28 ±\pm 11.99 52.80% 73.50% 3.8Hz
CycleGAN+E (O) 9.08 ±\pm 5.04 11.40% 47.00% 43Hz
CycleGAN+E+R (O) 7.45 ±\pm 4.71 42.00% 79.60% 33.9Hz
Ours w/o R (O) 9.65 ±\pm 3.96 9.80% 61.60% 43Hz
Ours w/o E 7.35 ±\pm 6.1 50.20% 70.20% 3.8Hz
Ours (O) 4.7 ±\pm 3.17 59.20% 88.70% 33.9Hz
  • * denotes aligning global scale before evaluation. (O) denotes computational speed approaching video capture speed. E represents ego-motion estimation and R represents registration. ATE is given as the mean ±\pm standard deviation. The best performance is indicated in bold.

V-C Ablation Studies for Consistency Losses

In order for our depth estimation network to obtain scale perception, we have integrated view consistency losses and geometry consistency loss during CycleGAN-style network training. By those consistency constraints, we are expecting to achieve lower scale drifts in our depth estimation, thereby reducing the estimation error. Defining Lrecx^L_{rec_{-}\hat{x}} as the view consistency loss for generated bronchoscopic frame, LrecxL_{rec_{-}x} as the view consistency loss for real bronchoscopic frame, and LgcL_{gc} as geometry consistency loss, here we specifically investigate the following cases:

  1. 1.

    our depth estimation with Lrecx^L_{rec_{-}\hat{x}} , LrecxL_{rec_{-}x} and LgcL_{gc},

  2. 2.

    our depth estimation with LrecxL_{rec_{-}x} and LgcL_{gc}, without Lrecx^L_{rec_{-}\hat{x}},

  3. 3.

    our depth estimation with LgcL_{gc}, without Lrecx^L_{rec_{-}\hat{x}} and LrecxL_{rec_{-}x},

  4. 4.

    our depth estimation without LrecL_{rec} and LgcL_{gc}, deteriorates as CycleGAN.

The results for depth estimation are given in Table I. As seen from quantitative ablation analysis, the consistency losses make our depth estimation more accurate in both phantom and patient data. View consistency loss which utilizes camera movement for bronchoscopic frame view synthesis, contributes the most with regard to scale perception, and in turn reduces the estimation error.

V-D Ablation Studies for Localization Framework

We conduct ablation studies to assess various depth estimation and pose inference combinations within our localization framework. Table II and Table III shows the results on phantom and patient datasets respectively. Initially, we evaluate our depth estimation technique against CycleGAN, highlighting its impact on bronchoscopic localization. We further examine the role of ego-motion estimation by comparing localization accuracy with and without it, ensuring consistent searching spaces and hyperparameters across tests. Our findings underscore the significance of integrating accurate depth estimation and ego-motion inference for enhanced localization performance, particularly noting ego-motion’s sensitivity to depth estimation scale drift. Results on both phantom and patient datasets confirm our framework’s effectiveness, with ego-motion inference proving crucial for avoiding local minima in depth map registration.

TABLE III: Localization Results On Patient Data
Methods ATE (mm) SR-5 (%) SR-10 (%) Runtime
EndoSLAM* (O) 12.48 ±\pm 5.19 5.70% 29.00% 141 Hz
CycleGAN+R 15.13 ±\pm 10.07 21.70% 54.50% 3.8 Hz
CycleGAN+E (O) 15.58 ±\pm 9.29 12.20% 58.00% 43 Hz
CycleGAN+E+R (O) 30.8 ±\pm 16.61 8.52% 20.95% 12.2 Hz
Ours w/o R (O) 11.13 ±\pm 6.76 2.80% 38.00% 43 Hz
Ours w/o E 15.97 ±\pm 12.99 26.30% 53.80% 3.8 Hz
Ours (O) 6.49 ±\pm 3.88 47.10% 85.00% 12.2 Hz
  • * denotes aligning global scale before evaluation. (O) denotes computational speed approaching video capture speed. E represents ego-motion estimation and R represents registration. ATE is given as the mean ±\pm standard deviation. The best performance is indicated in bold.

V-E Runtime

All of the methods are tested on a workstation with a 12th Gen Intel® Core ™ i7-12700 CPU and a NVIDIA RTX3090 GPU. Input images and depth maps are all cropped and resized to 256 × 256-res. Results are presented in Table II and Table III. Runtime variance of our localization framework between phantom and patient data lies on the different mm value. Depth map rendering takes up the most time in registration methods. Rendering one frame takes about 1.7ms in our implementation. Rendering acceleration should further speed up our overall framework.

Refer to caption

Figure 5: Example of located virtual view using different localization frame-works. E represents ego-motion estimation and R represents registration. Incremental tracking methods (including EndoSLAM, DD-VNB w/o R) are not included because most of their located views are outside the airway model. Frames where tracking was lost are box selected in red.

Refer to caption

Figure 6: Tracked positions using different localization frameworks are plotted against ground truth. E represents ego-motion estimation and R represents registration. Ours shows better performance, with tracking positions closely following the ground truth trajectory.

VI Conclusion

Our study proposes a bronchoscopic localization framework, featuring a knowledge-embedded depth estimation network within a dual-loop scheme for fast and accurate localization. Utilizing monocular frames for depth map inference and integrating both ego-motion estimation and airway CT registration, our work marks a stride toward real-time, learning-based bronchoscopic localization capable of adapting to unseen airways, with future enhancements aimed at further accuracy improvements through feature matching and relocalization strategies.

\AtNextBibliography

References

  • [1] Rebecca L Siegel, Kimberly D Miller, Nikita Sandeep Wagle and Ahmedin Jemal “Cancer statistics, 2023” In CA: a cancer journal for clinicians 73.1 Wiley Online Library, 2023, pp. 17–48
  • [2] Selma Metintaş “Epidemiology of Lung Cancer” In Airway diseases Springer, 2023, pp. 1–45
  • [3] Gerard J Criner et al. “Interventional bronchoscopy” In American journal of respiratory and critical care medicine 202.1 American Thoracic Society, 2020, pp. 29–50
  • [4] Janani Reisenauer et al. “Ion: technology and techniques for shape-sensing robotic-assisted bronchoscopy” In The Annals of thoracic surgery 113.1 Elsevier, 2022, pp. 308–315
  • [5] Elliot Ho, Roy Joseph Cho, Joseph C Keenan and Septimiu Murgu “The feasibility of using the “artery sign” for pre-procedural planning in navigational bronchoscopy for parenchymal pulmonary lesion sampling” In Diagnostics 12.12 MDPI, 2022, pp. 3059
  • [6] Erik E Folch et al. “Sensitivity and safety of electromagnetic navigation bronchoscopy for lung cancer diagnosis: systematic review and meta-analysis” In Chest 158.4 Elsevier, 2020, pp. 1753–1769
  • [7] Chaoyang Shi et al. “Shape sensing techniques for continuum robots in minimally invasive surgery: A survey” In IEEE Transactions on Biomedical Engineering 64.8 IEEE, 2016, pp. 1665–1678
  • [8] Marco Visentini-Scarzanella, Takamasa Sugiura, Toshimitsu Kaneko and Shinichiro Koto “Deep monocular 3D reconstruction for assisted navigation in bronchoscopy” In International journal of computer assisted radiology and surgery 12 Springer, 2017, pp. 1089–1099
  • [9] Cheng Zhao, Mali Shen, Li Sun and Guang-Zhong Yang “Generative localization with uncertainty estimation through video-CT data for bronchoscopic biopsy” In IEEE Robotics and Automation Letters 5.1 IEEE, 2019, pp. 258–265
  • [10] Jake Sganga, David Eng, Chauncey Graetzel and David B Camarillo “Autonomous driving in the lung using deep learning for localization” In arXiv preprint arXiv:1907.08136, 2019
  • [11] Kensaku Mori et al. “Tracking of a bronchoscope using epipolar geometry analysis and intensity-based image registration of real and virtual endoscopic images” In Medical Image Analysis 6.3 Elsevier, 2002, pp. 321–336
  • [12] Daisuke Deguchi et al. “Selective image similarity measure for bronchoscope tracking based on image registration” In Medical Image Analysis 13.4 Elsevier, 2009, pp. 621–633
  • [13] Mali Shen, Yun Gu, Ning Liu and Guang-Zhong Yang “Context-aware depth and pose estimation for bronchoscopic navigation” In IEEE Robotics and Automation Letters 4.2 IEEE, 2019, pp. 732–739
  • [14] Artur Banach et al. “Visually navigated bronchoscopy using three cycle-consistent generative adversarial network for depth estimation” In Medical image analysis 73 Elsevier, 2021, pp. 102164
  • [15] Tinghui Zhou, Matthew Brown, Noah Snavely and David G Lowe “Unsupervised learning of depth and ego-motion from video” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1851–1858
  • [16] Dominik Rivoir et al. “Long-term temporally consistent unpaired video translation from simulated surgical 3D data” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3343–3353
  • [17] Kutsev Bengisu Ozyoruk et al. “EndoSLAM dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos” In Medical image analysis 71 Elsevier, 2021, pp. 102058
  • [18] Inbar Fried et al. “Landmark Based Bronchoscope Localization for Needle Insertion Under Respiratory Deformation” In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 6593–6600 IEEE
  • [19] Jianning Deng et al. “Feature-based Visual Odometry for Bronchoscopy: A Dataset and Benchmark” In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2023, pp. 6557–6564 IEEE
  • [20] Jake Sganga, David Eng, Chauncey Graetzel and David Camarillo “Offsetnet: Deep learning for localization in the lung using rendered images” In 2019 international conference on robotics and automation (ICRA), 2019, pp. 5046–5052 IEEE
  • [21] Mert Asim Karaoglu et al. “Adversarial domain feature adaptation for bronchoscopic depth estimation” In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part IV 24, 2021, pp. 300–310 Springer
  • [22] Juan Borrego-Carazo et al. “BronchoPose: an analysis of data and model configuration for vision-based bronchoscopy pose estimation” In Computer Methods and Programs in Biomedicine 228 Elsevier, 2023, pp. 107241
  • [23] Abhinav Valada, Noha Radwan and Wolfram Burgard “Deep auxiliary learning for visual localization and odometry” In 2018 IEEE international conference on robotics and automation (ICRA), 2018, pp. 6939–6946 IEEE
  • [24] Torsten Sattler, Qunjie Zhou, Marc Pollefeys and Laura Leal-Taixe “Understanding the limitations of cnn-based absolute camera pose regression” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3302–3312
  • [25] Shawn Mathew, Saad Nadeem, Sruti Kumari and Arie Kaufman “Augmenting colonoscopy using extended and directional cyclegan for lossy image translation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 4696–4705
  • [26] Jun-Yan Zhu, Taesung Park, Phillip Isola and Alexei A Efros “Unpaired image-to-image translation using cycle-consistent adversarial networks” In Proceedings of the IEEE international conference on computer vision, 2017, pp. 2223–2232
  • [27] Xudong Mao et al. “Least squares generative adversarial networks” In Proceedings of the IEEE international conference on computer vision, 2017, pp. 2794–2802
  • [28] Jiawang Bian et al. “Unsupervised scale-consistent depth and ego-motion learning from monocular video” In Advances in neural information processing systems 32, 2019
  • [29] Eddy Ilg et al. “Flownet 2.0: Evolution of optical flow estimation with deep networks” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2462–2470
  • [30] Roger Fletcher and Michael JD Powell “A rapidly convergent descent method for minimization” In The computer journal 6.2 Oxford University Press, 1963, pp. 163–168
  • [31] Yun Gu et al. “Vision–kinematics interaction for robotic-assisted bronchoscopy navigation” In IEEE Transactions on Medical Imaging 41.12 IEEE, 2022, pp. 3600–3610