This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2023

1]\orgnameJohns Hopkins University, \orgaddress\cityBaltimore, \stateMD, \countryU.S 2]\orgnameJohns Hopkins Medicine, \orgaddress\cityBaltimore, \stateMD, \countryU.S

TAToo: Vision-based Joint Tracking of Anatomy and Tool for Skull-base Surgery

\fnmZhaoshuo \surLi    \fnmHongchao \surShu    \fnmRuixing \surLiang    \fnmAnna \surGoodridge    \fnmManish \surSahu    \fnmFrancis X. \surCreighton    \fnmRussell H. \surTaylor    \fnmMathias \surUnberath [ [
Abstract

Purpose: Tracking the 3D motion of the surgical tool and the patient anatomy is a fundamental requirement for computer-assisted skull-base surgery. The estimated motion can be used both for intra-operative guidance and for downstream skill analysis. Recovering such motion solely from surgical videos is desirable, as it is compliant with current clinical workflows and instrumentation.

Methods: We present Tracker of Anatomy and Tool (TAToo). TAToo jointly tracks the rigid 3D motion of the patient skull and surgical drill from stereo microscopic videos. TAToo estimates motion via an iterative optimization process in an end-to-end differentiable form. For robust tracking performance, TAToo adopts a probabilistic formulation and enforces geometric constraints on the object level.

Results: We validate TAToo on both simulation data, where ground truth motion is available, as well as on anthropomorphic phantom data, where optical tracking provides a strong baseline. We report sub-millimeter and millimeter inter-frame tracking accuracy for skull and drill, respectively, with rotation errors below 1°\degree. We further illustrate how TAToo may be used in a surgical navigation setting.

Conclusion: We present TAToo, which simultaneously tracks the surgical tool and the patient anatomy in skull-base surgery. TAToo directly predicts the motion from surgical videos, without the need of any markers. Our results show that the performance of TAToo compares favorably to competing approaches. Future work will include fine-tuning of our depth network to reach a 1 mm clinical accuracy goal desired for surgical applications in the skull base.

keywords:
Image-based navigation, 3D motion tracking, Computer vision, Deep learning, Computer-assisted interventionsCode is available at: https://github.com/mli0603/TAToo

1 Introduction

Many otologic and neurosurgical procedures require that surgeons use a surgical drill to remove bone in the lateral skull-base to gain access to delicate structures therein. This process requires both high precision and a strong visual understanding of the anatomy in order to avoid damage to critical anatomies. If the rigid 3D motion of both surgical drill and patient skull can be tracked relative to each other and provided to the surgeons, we can potentially improve the operation safety through this assistive system mezger2013navigation . The recovered 3D motion can also be used to assess surgical skill azari2019modeling in post-operative analysis.

Despite the widespread use of external tracking systems such as optical trackers, video-based solutions are desirable since they integrate seamlessly into existing surgical workflows. Furthermore, for the purposes of skill analysis, they would enable a retrospective analysis of videos that were acquired without specialized external tracking instrumentation. However, tracking multiple objects in surgical video streams is challenging due to the need for the identification of individual objects and maintaining cross-frame correspondences for motion estimation braspenning2004true . Prior video-based systems liu2022sage ; speers2018fast ; long2021dssr ; wang2022neural focus on tracking the patient anatomy with respect to the camera but disregard other objects of interest, such as surgical tools. This restricts their applicability to skull-base surgery, where the rigid motion of both the patient skull and surgical drill are of interest. Other work lee2017multi ; gsaxner2021inside assumes no modification of the patient anatomy and requires precise 3D shapes, which is inapplicable to skull-base surgery.

Refer to caption
Figure 1: Overview of TAToo estimating the 3D motion for patient skull and surgical drill. TAToo takes two stereo video frames across time as input (Sect. 3.1), pre-processes the frames (Sect. 3.2) and uses an iterative motion estimation process to regress the motion (Sect. 3.3).

We introduce TAToo (Fig. 1), which simultaneously tracks the rigid 3D motion of the patient skull and surgical drill relative to the microscope, without prior 3D information of the surgical scene. Given a stereo video stream as input, TAToo first uses off-the-shelf networks to estimate the stereo depth li2021revisiting ; li2021temporally ; tankovich2021hitnet and segmentation map shvets2018automatic as a pre-processing step. TAToo then iteratively updates the motion estimate of both patient skull and surgical drill. At each iteration, TAToo matches the correspondences, refines the correspondences in a probabilistic formulation, and lastly regresses consistent object motion based on geometric optimization. The whole method is end-to-end differentiable.

For evaluation, we specifically consider a skull-base surgical procedure named mastoidectomy, where the temporal bone is drilled. As no public dataset is available for our intended application, we collect a set of data for developing TAToo using both simulation and anthropomorphic phantom data emulating the surgical setup (see Fig. 3). We benchmark TAToo against other motion estimation techniques, including keypoint- and ICP-based algorithms. TAToo’s performance compares favorably to competing approaches, and we further find it to be more robust. We show that TAToo achieves sub-millimeter and millimeter tracking accuracy for patient skull and surgical drill, respectively, with a rotation errors below 1°1\degree. We lastly illustrate how TAToo may be used in a surgical navigation setting. Our contributions can be summarized as follows:

  • We present a novel framework, TAToo, that tracks the 3D motion of patient skull and surgical drill jointly from stereo microscopic videos.

  • We demonstrate that the recovered 3D motion from TAToo can be used in downstream applications, such as a surgical navigation setting.

2 Related Work For 3D Motion Estimation

Motion estimation in general requires matching correspondences temporally and regressing 3D motions based on the matches. Keypoint-based approaches detect keypoints from images rublee2011orb , match keypoints across frames, and then estimate the motion using Procrustes-type registration kabsch1976solution . In surgical scenes, detecting keypoints can be challenging due to large homogeneous regions. Even if keypoints can be found, the sparsity of the detected points can lead to poor spatial configurations that are not adequate for motion estimation. ICP-based approaches besl1992method ; park2017colored instead iteratively find correspondences based on the most recent motion estimate and a set of distance criteria. While the correspondence is often dense and object-level rigidity is enforced, ICP is sensitive to outliers zhang2021fast , which occur often during stereo depth estimation or segmentation estimation. Our approach TAToo builds upon besl1992method ; teed2021raft ; teed2021tangent to use all-pixel correspondences with a probabilistic formulation for improved performance. TAToo further enforces object-level constraints based on semantic image segmentation for effective motion tracking of multiple objects.

3 Approach

3.1 Input

We denote the left and right stereo images at a given time as one stereo frame. Given stereo frames at time tt and t+1t+1 from a microscopic video stream, TAToo recovers the 3D motion of the patient skull and the surgical drill from tt to t+1t+1 with respect to the left stereo camera. We use HH and WW to denote image height and width, subscripts pp for patient skull, and dd for surgical drill. We use T𝕊𝔼(3)T\in\mathbb{SE}(3) for rigid 3D motion. The output of TAToo is TpT_{p} and TdT_{d}, where for convenience of notation we have omitted the temporal dependence.

3.2 Pre-processing

We first use off-the-shelf networks tankovich2021hitnet ; shvets2018automatic to estimate the depth and segmentation information and the associated prediction probabilities. We denote the depth map as DD and the probability of the estimate as σ(D)\sigma(D). The segmentation map denoted as SS, groups pixels into different objects. Each pixel iHWi\in HW belongs to either patient skull (Si=pS^{i}=p) or surgical drill (Si=dS^{i}=d). The probabilities of the segmentation assignment are denoted as σ(S)\sigma(S). The depth and segmentation information is estimated for both frames at tt and t+1t+1.

3.3 Iterative Motion Estimation

Given an initial guess of the motion of the skull and drill, TAToo iterates between correspondence matching (Sect. 3.3.1), probabilistic refinement (Sect. 3.3.2), and geometric optimization (Sect. 3.3.3). In our work, the initial guess of the motion is set to be zero (i.e., the identity transformation).

3.3.1 Correspondence Matching

Given the most recent 3D motion estimate, we first compute the resulting correspondences across frames.

For each source pixel itHtWti_{t}\in H_{t}W_{t} from frame tt, we compute its target location jt+1j_{t+1} in frame t+1t+1 given the most recent motion estimate. The correspondence pair is thus formed as it,jt+1i_{t},j_{t+1}. To compute jt+1j_{t+1}, we use either the patient skull motion TpT_{p} or drill motion TdT_{d} according to the segmentation prediction StiS_{t}^{i}:

jt+1=π(Tπ1(it)),T={Tp,ifSti=p,Td,ifSti=d,j_{t+1}=\pi\big{(}T\pi^{-1}(i_{t})\big{)},\,\,T=\begin{cases}T_{p},&\text{if}\,\,S^{i}_{t}=p,\\ T_{d},&\text{if}\,\,S^{i}_{t}=d,\end{cases} (1)

where π\pi and π1\pi^{-1} are perspective and inverse perspective projections for conversion between pixel and Cartesian world coordinates.

Refer to caption
Figure 2: Qualitative visualization of the motion features. (a) We visualize an 1×1×Ht+1×Wt+11\times 1\times H_{t+1}\times W_{t+1} slice of the 4D correlation volume. This slice of correlation corresponds to the correlation between a source pixel in frame tt (black) and all pixels in frame t+1t+1. Brighter color indicates larger correlation, i.e., the yellow target pixel has a larger correlation with the source pixel than the red target pixel. (b) Scene flow between correspondence pairs.

3.3.2 Probabilistic Refinement

As the correspondence pairs computed from the most recent motion estimate contain outliers, we perform a refinement on these correspondence pairs and estimate the confidence of the refined results. Following lietorch teed2021tangent , we use a deep learning network to learn how to perform such refinement. An illustration of the network design is in Appendix A.

Given a correspondence pair it,jt+1i_{t},j_{t+1}, we first evaluate the fitness of the match. Our network extracts features from the left images of both stereo frames and builds a 4D correlation volume teed2021raft of Ht×Wt×Ht+1×Wt+1H_{t}\times W_{t}\times H_{t+1}\times W_{t+1}, where the dot-product correlation between all features in frame tt and all features in frame t+1t+1 are evaluated. A larger value in the correlation volume indicates a more probable match. We retrieve the correlation between iti_{t} and jt+1j_{t+1} from the 4D correlation volume in Sect. 3.2. We also compute the resulting scene flow as f^i=jt+1it\hat{f}^{i}=j_{t+1}-i_{t}. Both the correlation values and scene flow are used for refinement prediction. A qualitative visualization is shown in Fig. 2.

Using the correlation and scene flow as input, the network then updates the correspondence pair by predicting a residual update Δjt+1\Delta j_{t+1} to the target location jt+1j_{t+1} while fixing the source location iti_{t}:

j^t+1=jt+1+Δjt+1.\hat{j}_{t+1}=j_{t+1}+\Delta j_{t+1}\,. (2)

In order to evaluate how confident the network is about the update, we also output a probability of such residual update via a sigmoid layer. The refinement probability is denoted as σ(Δjt+1)\sigma(\Delta j_{t+1}).

Given the updated correspondence pair iti_{t} and j^t+1\hat{j}_{t+1}, we compute the joint probability σ(it,j^t+1)\sigma(i_{t},\hat{j}_{t+1}) of the matching. Intuitively, this joint probability indicates the reliability of the current correspondence pair. We decompose the joint probability σ(it,jt+1)\sigma(i_{t},j_{t+1}) into two terms: 1) the confidence of the information we know about the source point σ(it)\sigma(i_{t}), 2) and confidence of the information we know about the target point σ(j^t+1|it)\sigma(\hat{j}_{t+1}|i_{t}):

σ(it,j^t+1)=σ(Sti)σ(Dti)σ(it): estimate probabilityσ(Δjt+1)σ(St+1j)σ(Dt+1j),σ(j^t+1|it): correspondence probability\sigma(i_{t},\hat{j}_{t+1})=\underbrace{\sigma(S^{i}_{t})\sigma(D^{i}_{t})}_{\sigma(i_{t})\text{:\,estimate probability}}\,\,\,\,\,\underbrace{\sigma(\Delta j_{t+1})\sigma(S^{j}_{t+1})\sigma(D^{j}_{t+1})\,,}_{\sigma(\hat{j}_{t+1}|i_{t})\text{:\,correspondence probability}} (3)

where σ(Sti),σ(Dti)\sigma(S^{i}_{t}),\sigma(D^{i}_{t}) are the confidence of depth and segmentation at iti_{t} from frame tt, and σ(St+1j),σ(Dt+1j)\sigma(S^{j}_{t+1}),\sigma(D^{j}_{t+1}) are the confidence of depth and segmentation at target j^t+1\hat{j}_{t+1} from frame t+1t+1. With a slight abuse of notation, the correspondence probability σ(j^t+1|it)\sigma(\hat{j}_{t+1}|i_{t}) is written as a conditional probability because the target locations j^t+1\hat{j}_{t+1} are computed from iti_{t}. Our network uses a GRU design with convolution layers following prior work on recurrent optimization teed2021raft .

3.3.3 Geometric Optimization

Given the matched points and the associated confidence, we regress the motion of both the skull and tool with geometric constraints. We employ Gauss-Newton optimization steps over the 𝕊𝔼(3)\mathbb{SE}(3) space, following its recent success in single object non-rigid tracking bozic2020neural , structure from motion lindenberger2021pixel , and SLAM teed2021tangent . We estimate motion by minimizing the perspective projection error between the target location j^t+1\hat{j}_{t+1} and the transformed pixel locations from iti_{t}, weighted by the joint probabilities in Eqn. 3:

E(T)=itHtWtσ(it,j^t+1)j^t+1π(Tπ1(it))2,T={Tp,ifSti=p,Td,ifSti=d,E(T)=\sum_{i_{t}\in H_{t}W_{t}}\sigma(i_{t},\hat{j}_{t+1})\cdot\|\hat{j}_{t+1}-\pi\big{(}T\pi^{-1}(i_{t})\big{)}\|_{2},\,\,T=\begin{cases}T_{p},&\text{if}\,\,S^{i}_{t}=p,\\ T_{d},&\text{if}\,\,S^{i}_{t}=d,\end{cases} (4)

where Tp,TdT_{p},T_{d} are optimized. The intuition behind Eqn. 4 is that given the perspective projection relationship, as well as the network predicted correspondence and probability, it finds the motion TpT_{p} and TdT_{d} that best explain the predicted correspondences it,j^t+1i_{t},\hat{j}_{t+1}.

3.4 Supervision

We adopt a trained off-the-shelf depth network tankovich2021hitnet for estimating depth information. We fine-tune a segmentation network shvets2018automatic on our dataset.

We then train the network in the motion estimation process (Sect. 3.3.2) and impose a loss on the geodesic distance teed2021tangent between ground truth and predicted motion for each object on the Lie manifold of 𝕊𝔼(3)\mathbb{SE}(3):

geo\displaystyle\ell_{\text{geo}} =τp2+ϕp2+τd2+ϕd2,\displaystyle=\|\tau_{p}\|_{2}+\|\phi_{p}\|_{2}+\|\tau_{d}\|_{2}+\|\phi_{d}\|_{2}, (5)
τ,ϕ\displaystyle\tau,\phi =log(TGTT1),\displaystyle=\log(T^{\text{GT}}T^{-1}), (6)

where TGTT^{\text{GT}} is the ground truth motion, τ\tau is the translation vector, and ϕ\phi is the Rodrigues’ rotation vector.

Given the ground truth motion, we also derive the match locations jt+1GTj_{t+1}^{\text{GT}} and the scene flow fi,GTf^{i,GT} for additional supervision, and thus, impose losses as:

match=1HtWtitHtWt|jt+1GTjt+1|,\displaystyle\ell_{\text{match}}=\frac{1}{H_{t}W_{t}}\sum_{i_{t}\in H_{t}W_{t}}|j_{t+1}^{\text{GT}}-j_{t+1}|\,, (7)
flow=1HtWtitHtWt|fi,GTfi|.\displaystyle\ell_{\text{flow}}=\frac{1}{H_{t}W_{t}}\sum_{i_{t}\in H_{t}W_{t}}|f^{i,GT}-f^{i}|\,. (8)

We note that match\ell_{\text{match}} is used to supervise recurrent network incremental estimates Δjt+1\Delta j_{t+1}, and flow\ell_{\text{flow}} is used to supervise the motion estimate from the geometric optimization. The refinement probability σ(Δjt+1)\sigma(\Delta j_{t+1}) is implicitly learned without any supervision.

The above loss is computed for estimates at each iteration. The summed loss is weighted differently at each optimization iteration:

=k[1,𝒦]0.3𝒦k(wgeogeo,k+wmatchmatch,k+wflowflow,k),\ell=\sum_{k\in[1,\mathcal{K}]}0.3^{\mathcal{K}-k}(w_{\text{geo}}\ell_{\text{geo,k}}+w_{\text{match}}\ell_{\text{match,k}}+w_{\text{flow}}\ell_{\text{flow,k}})\,, (9)

where the final iteration is weighted most. We set wgeo=10.0w_{\text{geo}}=10.0, wflow=0.1w_{\text{flow}}=0.1 and wmatch=0.1w_{\text{match}}=0.1 to balance loss magnitudes.

Refer to caption
Figure 3: Data collection setup: (a) simulation environment, (b) surgical phantom with optical tracking as the baseline.

4 Experimental Setup

4.1 Data

Simulation We use a drilling simulator munawar2021virtual ; munawar2023fully to generate synthetic data of three different CT scans ding2021automated ; ding2022automated ; ding2023self drilled by surgical residents as shown in Fig. 3(a). For each CT scan, 1500 instances of data are recorded (4500 total). The simulation data contains ground truth depth, segmentation, and motion. The image resolution is 640×480640\times 480. We use two sequences for training/validation and one sequence for testing.

Phantom We additionally collect four video sequences of surgical phantom data as shown in Fig. 3(b). We use the Atracsys fusionTrack optical tracker222https://www.atracsys-measurement.com/products/fusiontrack-500/ and mount retro-reflective tracking markers on the surgical microscope, phantom, and drill to acquire individual poses, which are used to compute inter-frame motion. There were a total of 13915 instances of data. The image resolution is 960×540960\times 540. We use three sequences for training/validation and one sequence for testing.

4.2 Training and Evaluation Setup

We train TAToo on synthetic data and then fine-tune on surgical phantom data. The initial motion estimate is set to be the identity transformation. We use pre-trained depth network tankovich2021hitnet since we do not have ground truth depth in the phantom data. We manually annotate 100 frames to fine-tune the segmentation network shvets2018automatic for the surgical phantom data. We set the number of iterations to 𝒦=3\mathcal{K}=3. We use random cropping and color augmentations during the training process. We further sub-sample and also reverse the video frames to augment motion data. We use a 80-20 train-validation split ratio.

For evaluation, we benchmark different motion regression algorithms taking the depth and segmentation estimates as a given input. We report the mean L2 norm of translation and rotation error vectors (Eqn. 6) and threshold metrics of 1 mm and 1°1\degree over the entire video sequence. We compare our motion estimation technique against a keypoint-based approach using ORB features and brute force matching rublee2011orb , and the colored ICP algorithm park2017colored implemented by Open3D333http://www.open3d.org/.

5 Results and Discussion

5.1 Tracking Accuracy

Table 1 summarizes the results from both synthetic and surgical phantom data. In both cases, our method outperforms competing approaches by a large margin.

The keypoint-based approach performs worst due to the sparsity and poor spatial configuration of matches, especially for the surgical drill. When there are fewer than 3 keypoints detected, the keypoint-based approach fails to recover the motion, resulting in high failure rates on both datasets. Even for the surgical phantom, many keypoints are clustered in local patches, which is undesirable for motion estimation. In contrast, ICP and TAToo both use dense correspondences to avoid such failure cases. We visualize the matches found by our method in Fig. 5(a), where the correspondences are evenly distributed across the objects.

ICP is also inferior to TAToo due to insufficient outlier rejection during correspondence search, since rejection is based solely on distance and color zhang2021fast . In contrast, TAToo uses a fully probabilistic formulation for the matched correspondences, considering confidences in depth, segmentation, and matches to regress the motion. Thus, TAToo can mitigate the effect of outliers and estimate motion more robustly. We show the violin plots of motion errors in Fig. 4 to demonstrate that TAToo indeed contains fewer extreme outliers in prediction due to the improved robustness.

On both data, our approach has better tracking performance for the patient skull than the surgical drill. This is attributed to the larger inter-frame motion of the surgical drill. Indeed, the average motion is 0.1 mm and 0.03°0.03\degree for the skull, but 1.1 mm and 0.38°0.38\degree for the drill.

While our approach compares favorably to other techniques, we note that our performance deteriorates on surgical phantom data compared to synthetic data, especially for surgical drill tracking. We attribute this deterioration, at least partially, to the sim-to-real transfer challenge in the stereo depth estimation network, as we do not have ground truth data to fine-tune the model. We ablate the impact of depth and segmentation quality on the motion tracking accuracy and present the results in Appendix B. Qualitative results of the depth and segmentation estimates are shown in Appendix C. It is our future work to investigate techniques to alleviate the sim-to-real transfer issue.

Table 1: Benchmark result on synthetic and surgical phantom data. For all metrics, lower is better. τ2\|\tau\|_{2}: translation error. ϕ2\|\phi\|_{2}: rotation error. Failure rate: percentage of the video where motion cannot be recovered.
Synthetic Data
τp2\|\tau_{p}\|_{2} (mm) ϕp2\|\phi_{p}\|_{2} (°) τd2\|\tau_{d}\|_{2} (mm) ϕd2\|\phi_{d}\|_{2} (°) Failure Rate
Keypoint 29.9 ±\pm 34.2 2.4 ±\pm 2.8 10.8 ±\pm 7.8 34.0 ±\pm 38.5 19%
ICP 1.5 ±\pm 2.5 0.1 ±\pm 0.1 2.9 ±\pm 3.9 4.2 ±\pm 25.0 0%
TAToo (ours) 0.5 ±\pm 0.9 0.1 ±\pm 0.1 1.1 ±\pm 1.8 0.2 ±\pm 0.4 0%
Surgical Phantom Data
τp2\|\tau_{p}\|_{2} (mm) ϕp2\|\phi_{p}\|_{2} (°) τd2\|\tau_{d}\|_{2} (mm) ϕd2\|\phi_{d}\|_{2} (°) Failure Rate
Keypoint 7.5 ±\pm 4.8 0.8 ±\pm 0.5 10.3 ±\pm 2.03 16.5 ±\pm 36.5 30%
ICP 0.7 ±\pm 0.7 0.1 ±\pm 0.1 9.7 ±\pm 4.5 2.0 ±\pm 13.9 0%
TAToo (ours) 0.2 ±\pm 0.2 0.1 ±\pm 0.1 4.8 ±\pm 3.5 0.7 ±\pm 0.8 0%
Refer to caption
Figure 4: Violin plots of motion errors comparing ICP and TAToo. TAToo contains much fewer outliers in motion prediction than ICP due to the probabilistic formulation to reject outliers.
Refer to caption
Figure 5: (a) Spatial distribution of keypoints (yellow crosses, left) and correspondence probabilities σ(it,j^t+1)\sigma(i_{t},\hat{j}_{t+1}) of our method (colormap, right) both overlaid on frame tt. Our probable correspondences are dense and distributed, which is better conditioned for motion estimation. (b) The plot of surgical drill trajectory in patient coordinate as used in a surgical navigation system.

5.2 Downstream Application - Surgical Navigation

We apply our approach to surgical navigation, where inter-frame motion predictions are chained for absolute poses. We report the average drill-to-skull transformation error over the entire video sequence on the validation set of synthetic data. The average error is 3.6 mm and 0.3°0.3\degree, demonstrating the applicability of the TAToo video-based tracking paradigm in such a setting. A qualitative visualization is shown in Fig. 5(b) and additional visualization can be found in the video supplementary material. While promising, accumulating relative pose for navigation inevitably introduces drift. Additional mechanisms, such as pose graph bundle adjustment, are required to meet the often referenced clinical requirement of <<1 mm accuracy schneider2021evolution .

6 Conclusion

We present a stereo video-based 3D motion estimation approach that simultaneously tracks the patient skull and the surgical drill. Our proposed iterative optimization approach combines learning-based correspondence matching with geometric optimization and probabilistic formulation. Experiments on simulation and phantom data demonstrate that our approach outperforms competing motion estimation methods.

While we show promising results, our evaluation is limited to simulation and phantom data. We plan to further improve our methods, collect in-vivo dataset and expand our analysis. Moreover, while TAToo outperforms other image-based tracking algorithms, the accuracy of TAToo does not currently meet our 1 mm clinical accuracy goal, which we attribute to the lack of fine-tuning on the depth and segmentation network. Future work includes collecting frame-wise ground truth data for supervising our depth and segmentation network using scalable data collection setup such as digital twins shu2023twin .

\bmhead

Acknowledgments This work was supported in part by Johns Hopkins University internal funds, and in part by NIDCD K08 Grant DC019708.

Declarations

\bmhead

Conflict of interest Dr. Russell H. Taylor and Johns Hopkins University may be entitled to royalty payments related to technology discussed in this paper, and Dr. Taylor has received or may receive some portion of these royalties. Also, Dr. Taylor is a paid consultant to and owns equity in Galen Robotics, Inc. These arrangements have been reviewed and approved by Johns Hopkins University in accordance with its conflict of interest policy.

References

  • \bibcommenthead
  • (1) Mezger, U., Jendrewski, C., Bartels, M.: Navigation in surgery. Langenbeck’s archives of surgery 398(4), 501–514 (2013)
  • (2) Azari, D.P., Frasier, L.L., Quamme, S.R.P., Greenberg, C.C., Pugh, C., Greenberg, J.A., Radwin, R.G.: Modeling surgical technical skill using expert assessment for automated computer rating. Annals of surgery 269(3), 574 (2019)
  • (3) Braspenning, R.A., de Haan, G.: True-motion estimation using feature correspondences. In: Visual Communications and Image Processing 2004, vol. 5308, pp. 396–407 (2004). SPIE
  • (4) Liu, X., Li, Z., Ishii, M., Hager, G.D., Taylor, R.H., Unberath, M.: Sage: Slam with appearance and geometry prior for endoscopy. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 5587–5593 (2022). https://doi.org/10.1109/ICRA46639.2022.9812257
  • (5) Speers, A.D., Ma, B., Jarnagin, W.R., Himidan, S., Simpson, A.L., Wildes, R.P.: Fast and accurate vision-based stereo reconstruction and motion estimation for image-guided liver surgery. Healthcare technology letters 5(5), 208–214 (2018)
  • (6) Long, Y., Li, Z., Yee, C.H., Ng, C.F., Taylor, R.H., Unberath, M., Dou, Q.: E-dssr: efficient dynamic surgical scene reconstruction with transformer-based stereoscopic depth perception. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 415–425 (2021). Springer
  • (7) Wang, Y., Long, Y., Fan, S.H., Dou, Q.: Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part VII, pp. 431–441 (2022). Springer
  • (8) Lee, S.C., Fuerst, B., Tateno, K., Johnson, A., Fotouhi, J., Osgood, G., Tombari, F., Navab, N.: Multi-modal imaging, model-based tracking, and mixed reality visualisation for orthopaedic surgery. Healthcare technology letters 4(5), 168–173 (2017)
  • (9) Gsaxner, C., Li, J., Pepe, A., Schmalstieg, D., Egger, J.: Inside-out instrument tracking for surgical navigation in augmented reality. In: Proceedings of the 27th ACM Symposium on Virtual Reality Software and Technology, pp. 1–11 (2021)
  • (10) Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F.X., Taylor, R.H., Unberath, M.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6197–6206 (2021)
  • (11) Li, Z., Ye, W., Wang, D., Creighton, F.X., Taylor, R.H., Venkatesh, G., Unberath, M.: Temporally consistent online depth estimation in dynamic scenes. arXiv preprint arXiv:2111.09337 (2021)
  • (12) Tankovich, V., Hane, C., Zhang, Y., Kowdle, A., Fanello, S., Bouaziz, S.: Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14362–14372 (2021)
  • (13) Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 624–628 (2018). IEEE
  • (14) Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision, pp. 2564–2571 (2011). Ieee
  • (15) Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography 32(5), 922–923 (1976)
  • (16) Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor Fusion IV: Control Paradigms and Data Structures, vol. 1611, pp. 586–606 (1992). Spie
  • (17) Park, J., Zhou, Q.-Y., Koltun, V.: Colored point cloud registration revisited. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 143–152 (2017)
  • (18) Zhang, J., Yao, Y., Deng, B.: Fast and robust iterative closest point. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)
  • (19) Teed, Z., Deng, J.: Raft-3d: Scene flow using rigid-motion embeddings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8375–8384 (2021)
  • (20) Teed, Z., Deng, J.: Tangent space backpropagation for 3d transformation groups. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10338–10347 (2021)
  • (21) Bozic, A., Palafox, P., Zollhöfer, M., Dai, A., Thies, J., Nießner, M.: Neural non-rigid tracking. Advances in Neural Information Processing Systems 33, 18727–18737 (2020)
  • (22) Lindenberger, P., Sarlin, P.-E., Larsson, V., Pollefeys, M.: Pixel-perfect structure-from-motion with featuremetric refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5987–5997 (2021)
  • (23) Munawar, A., Li, Z., Kunjam, P., Nagururu, N., Ding, A.S., Kazanzides, P., Looi, T., Creighton, F.X., Taylor, R.H., Unberath, M.: Virtual reality for synergistic surgical training and data generation. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization, 1–9 (2021)
  • (24) Munawar, A., Li, Z., Nagururu, N., Trakimas, D., Kazanzides, P., Taylor, R.H., Creighton, F.X.: Fully immersive virtual reality for skull-base surgery: Surgical training and beyond. arXiv preprint arXiv:2302.13878 (2023)
  • (25) Ding, A.S., Lu, A., Li, Z., Galaiya, D., Siewerdsen, J.H., Taylor, R.H., Creighton, F.X.: Automated registration-based temporal bone computed tomography segmentation for applications in neurotologic surgery. Otolaryngology–Head and Neck Surgery 167(1), 133–140 (2022)
  • (26) Ding, A.S., Lu, A., Li, Z., Galaiya, D., Ishii, M., Siewerdsen, J.H., Taylor, R.H., Creighton, F.X.: Automated extraction of anatomical measurements from temporal bone ct imaging. Otolaryngology–Head and Neck Surgery 167(4), 731–738 (2022)
  • (27) Ding, A.S., Lu, A., Li, Z., Sahu, M., Galaiya, D., Siewerdsen, J.H., Unberath, M., Taylor, R.H., Creighton, F.X.: A self-configuring deep learning network for segmentation of temporal bone anatomy in cone-beam ct imaging. Otolaryngology–Head and Neck Surgery (2023)
  • (28) Schneider, D., Hermann, J., Mueller, F., Braga, G.O.B., Anschuetz, L., Caversaccio, M., Nolte, L., Weber, S., Klenzner, T.: Evolution and stagnation of image guidance for surgery in the lateral skull: a systematic review 1989–2020. Frontiers in surgery 7, 604362 (2021)
  • (29) Shu, H., Liang, R., Li, Z., Goodridge, A., Zhang, X., Ding, H., Nagururu, N., Sahu, M., Creighton, F.X., Taylor, R.H., et al.: Twin-s: a digital twin for skull base surgery. International Journal of Computer Assisted Radiology and Surgery, 1–8 (2023)