[1]\fnmZixin \surYang

1]\orgdivCenter for Imaging Science, \orgnameRochester Institute of Technology, \orgaddress\street1 Lomb Memorial Dr, \cityRochester, \postcode14623, \stateNY, \countryUSA

2]\orgdivBiomedical Engineering, \orgnameRochester Institute of Technology, \orgaddress\street1 Lomb Memorial Dr, \cityRochester, \postcode14623, \stateNY, \countryUSA

A Disparity Refinement Framework for Learning-based Stereo Matching Methods in Cross-domain Setting for Laparoscopic Images

[email protected] \fnmRichard \surSimon [email protected] \fnmCristian \surA. Linte [email protected] [ [

Abstract

Purpose: Stereo matching methods that enable depth estimation are crucial for visualization enhancement applications in computer-assisted surgery (CAS). Learning-based stereo matching methods are promising to predict accurate results on laparoscopic images. However, they require a large amount of training data, and their performance may be degraded due to domain shifts.

Methods: Maintaining robustness and improving the accuracy of learning-based methods are still open problems. To overcome the limitations of learning-based methods, we propose a disparity refinement framework consisting of a local disparity refinement method and a global disparity refinement method to improve the results of learning-based stereo matching methods in a cross-domain setting. Those learning-based stereo matching methods are pre-trained on a large public dataset of natural images and are tested on two datasets of laparoscopic images.

Results: Qualitative and quantitative results suggest that our proposed disparity framework can effectively refine disparity maps when they are noise-corrupted on an unseen dataset, without compromising prediction accuracy when the network can generalize well on an unseen dataset.

Conclusion: Our proposed disparity refinement framework could work with learning-based methods to achieve robust and accurate disparity prediction. Yet, as a large laparoscopic dataset for training learning-based methods does not exist and the generalization ability of networks remains to be improved, the incorporation of the proposed disparity refinement framework into existing networks will contribute to improving their overall accuracy and robustness associated with depth estimation.

keywords:

Stereo Matching, Disparity Refinement, Endoscopy, Variational Model, Cross-domain Generalization, Optical Flow

1 Introduction

Stereo endoscopy is commonly used to enable depth estimation in laparoscopic surgerymountney2010three ; lin2016video ; allan2021stereo ; edwards2020serv . Stereo correspondences represented by disparity maps can be estimated via stereo matching techniques scharstein2002taxonomy to provide depth measurements with known intrinsic and extrinsic camera calibration.

Depth information offered by stereo matching can play a key role in surgical navigation lin2016video and visualization enhancement applications mountney2010three ; modrzejewski2019vivo . Depth information can improve surgical performance by enabling the tracking of tissue surface deformations for rendering a motion-stabilized view and avoiding surgical instrument collisions with critical anatomical structures mountney2010three ; lin2016video . Depth estimation also benefits downstream tasks, such as 3D surface reconstruction/SLAM zhou2019real ; recasens2021endo ; wei2022stereo ; yang2022endoscope , and registration modrzejewski2019vivo algorithms, which are the foundations of advanced applications in compute-integrated intervention mountney2010three ; lin2016video ; modrzejewski2019vivo .

Over the last decades, an extensive range of stereo matching approaches has been presented, divided into traditional and deep learning-based methods. However, these methods still have their shortcomings when it comes to handling stereo endoscopic images with smooth surfaces, specular highlights, patterns of repetition, and uneven illumination.

Traditional methods consist of local scharstein2002taxonomy ; bleyer2011patchmatch and global scharstein2002taxonomy ; terzopoulos1986regularization methods, according to the classifications in scharstein2002taxonomy . Local methods use a pre-defined search range to determine the optimal disparity map. In contrast, global methods use the entire image to formulate an optimization problem with a data term and a regularization term. Local methods are computationally efficient, but they have difficulty handling feature-less surfaces and specular highlights, which are common in endoscopic images. Global methods can provide relatively accurate results; however, they can be time-consuming and may require good initialization, especially for large displacements zach2007duality ; horn1981determining .

Learning-based stereo matching methods can be roughly divided into fully-supervised methods chang2018pyramid ; xu2020aanet and self-supervised methods li2021sins ; yang2021dense . The former uses accurate ground truth disparity maps for training, while the latter relies on formulating image synthesis loss. In the era of deep learning, learning-based stereo matching methods are reported to achieve high performance on several public benchmark datasets and outperform traditional methods flow ; kitti . However, their achievements are based on several prerequisites: 1. a large amount of data available for training; 2. identically distributed training and testing datasets. Those prerequisites are not generally satisfied in the computer-assisted surgery application field. Obtaining a sufficiently large dataset to train learning-based methods that can require accurate ground truth, usually with millions of parameters, is impractical in the surgical field. Given various texture and surgery settings, there is also no guarantee that the distribution difference between the training dataset and the testing dataset is negligible. As a large laparoscopic dataset with accurate ground truth does not exist, several methods li2021revisiting ; long2021dssr ; edwards2020serv use cross-domain dataset kitti ; flow for training, which may be not robust. Hence, the distribution change or domain shift can jeopardize the performance wang2018deep .

Considering the advantages and disadvantages of the mentioned methods, a motivating question arises: Can we combine learning-based methods with traditional methods to achieve more robust and accurate results?

To answer the question, in this paper, we study the use of traditional methods to refine disparity maps from learning-based methods in a cross-domain setting. We propose a refinement framework for disparity maps of endoscopic images predicted from learning-based methods trained on a large stereo dataset of natural images. The proposed disparity refinement framework consists of local and global refinement methods (LDR and GDR).

We first use the local refinement method (LDR) to detect low confidence regions on the predicted disparity map by assuming that outliers in the disparity map strongly violate the smoothness and photometric consistency assumptions. Next, the disparity values of low confidence regions are interpolated from the surrounding high confidence regions. Subsequently, we introduce a multi-resolution variational model using a proposed illumination invariant data term in the global refinement method to further refine the disparity from the previous step. The model uses the refined disparity map from the first stage as initialization and is implemented on the GPU, which improves its disparity refinement accuracy, robustness, and computational efficiency. The proposed method is evaluated on two laparoscopic datasets. Experimental results demonstrate that 1) the proposed method can effectively improve the performance of several learning-based methods when they pose difficulties in generalizing on an unseen dataset. 2) it does not impair prediction performance when the learning-based methods can generalize well to an unseen dataset. 3) our framework has better performance in terms of accuracy and speed compared to the other recent, closely related method.

Our contributions are summarized as follows:

•

We present a disparity refinement framework based on traditional methods for learning-based stereo matching methods consisting of LDR and GDR methods in a cross-domain setting.
•

We present an LDR method to measure the confidence of disparity maps and refine disparity values in low confidence regions. The method is designed to refine errors concentrated in small regions and provide a more robust initialization for the subsequent global disparity refinement method.
•

We present a GDR method using our illumination invariant multi-resolution variational model to refine various artifacts, especially for errors concentrated in large regions.

2 Related Works

Stereo matching is a classical problem in computer vision with a large number of methods that have been proposed. Readers may refer to benchmarks kitti ; flow for the list of stereo matching methods. We review the most relevant works to our applications.

Local matching methods are favored by applications that require real-time processing. Stoyanov et al. stoyanov2010real presented a local stereo matching method that propagates disparity information around feature matches in salient regions to avoid mismatches in specular highlights and occlusions. ELAS geiger2010efficient , one of the widely used stereo matching algorithms in surgical scenes, used feature matches to generate a prior pixel-wise disparity and then optimize the disparity via the maximum a-posteriori algorithm.

Global methods using variational approaches are attracting attention, as they are robust to occlusions and specular highlights. Global methods can be accompanied by a sparse to dense step to initialize global methods. In the Dense inverse search (DIS) algorithm kroeger2016fast , it first searches for patch correspondences and then creates a dense displacement field through patch aggregation, which is then fed to a variational model. Song et al. song2021bayesian proposed a Bayesian framework to improve the patch aggregation step in DIS, which improves its performance on textureless surfaces and the photometric inconsistency. Xia et al. xia2022robust proposed a method that includes a sparse-dense feature matching step, image illumination equalization, and global variational refinement. Using deep learning to provide acceptable initialization is also explored in revaud2015epicflow ; roxas2018real .

Different neural network architectures have also been exploited to achieve accurate disparity prediction. PSMnet chang2018pyramid uses 3D convolutions to the aggregate global context information and is the winning technique in the grand challenge of endoscopic data allan2021stereo . AAnet xu2020aanet replaced 3D convolutions with cross-scale correlation, achieving faster and better performance. LEAStereo cheng2020hierarchical utilized a framework to optimize the architecture of the stereo matching network automatically. The LEAStereo achieved top accuracy in several benchmarks kitti ; flow .

Work related to improving the generalization ability of learning-based stereo matching methods is relatively less exploited. Instead of using learning-based features, Cai et al. cai2020matching uses hand-crafted features in the matching space to keep the network from learning domain-specific features. Li et al. li2021revisiting introduced a stereo matching neural network with the Transformer architecture vaswani2017attention , and demonstrated the network could generalize across different domains. Pipelines that adapt networks to unseen target domains are also proposed song2021adastereo ; bousmalis2017unsupervised ; mahmood2018unsupervised . However, those pipelines require samples from the target domain, limiting their applications to small datasets, such as medical images.

Disparity refinement is usually a final step in traditional stereo matching methods scharstein2002taxonomy . Disparity maps can be directly refined by removing peaks ma2013constant , smoothing filters ma2013constant , and hole filling hirschmuller2003stereo ; zhou2019real . Most refinement methods detect outliers and then correct them to achieve more accurate results. In hirschmuller2005accurate , outliers are detected via left-right consistency (LRC) check and refined with steps including peak removal, interpolation, and median filtering. Banno and Ikeuchi banno2011disparity refined outliers failed LRC check with the proposed directed anisotropic diffusion technique. LRC is one of the most common methods to detect outliers. However, LRC doubles the computational time of traditional methods and does not apply to learning-based methods, as most existing learning-based methods are trained to predict only the disparity map of the left image. Zhan et al. zhan2015accurate proposed a multistep refinement method to classify outliers and recover the outliers according to their kind. Assuming the tissue surface is relatively smooth, Zhou and Jagadeesan zhou2019real introduced a radius-based outlier removal method, followed by holing and smoothing. However, those methods can not refine large areas with errors.

A few works have proposed refinement methods for learning-based methods. To our best knowledge, the closest work is yan2019segment , which refines the disparity map from a single method.

Our work focuses on a disparity refinement method robust to different learning-based methods in a cross-domain setting.

3 Methods

We focus on leveraging reasonable assumptions and traditional methods to correct errors in disparity maps predicted by several state-of-the-art learning-based stereo matching methods in a cross-domain setting. We use their publicly released models, trained on a large public dataset, to make predictions on laparoscopic image datasets. Disparity maps from the learning-based methods may contain quite a few errors, resulting from the difference between the training dataset and the testing dataset and the properties of endoscopic images. We propose a disparity refinement framework consisting of a local disparity refinement (LDR) method and a global disparity refinement (GDR).

3.1 Local Disparity Refinement

We use a detection and correction strategy in the local disparity refinement stage. We first estimate the disparity map’s confidence map, and a threshold is set to select outliers from the confidence map. Finally, the disparity values of outliers are interpolated from neighboring pixels.

Several assumptions are made to estimate the confidence of the disparity. Firstly, we assume that the tissue surface is relatively smooth. The pixel $\mathbf{x}$ should have high confidence in a smoothness confidence map $C_{s}(\mathbf{x})$ if the disparity value $u(\mathbf{x})$ is consistent with its surrounding pixels $\overline{u}_{w}(\mathbf{x})$ :

\displaystyle C_{s}(\mathbf{x})=1-\alpha_{s}\cdot\bigg{\lvert}\frac{u(\mathbf{x})-\overline{u}_{w}(\mathbf{x})}{\overline{u}_{w}(\mathbf{x})}\bigg{\lvert},

(1)

where $\overline{u}_{w}(\mathbf{x})$ is the mean disparity value of the local window with the size $w$ , and $\alpha_{s}$ is the hyper-parameter.

Secondly, we assume that outliers in the disparity map tend to violate the photo-consistency assumption strongly. Intensities of outliers $I_{s}(\mathbf{x})$ in the source (left) image and their matched points $I_{t}(\mathbf{x}+u(\mathbf{x}))$ in the target (right) image would have a large difference. Due to illumination differences, the intensity values of corresponding images may not be the same. However, it is reasonable that outliers would strongly violate this assumption. Therefore, the photo-consistency confidence $C_{p}(\mathbf{x})$ is defined as:

\displaystyle C_{p}(\mathbf{x})=1-\alpha_{p}\cdot\bigg{\lvert}\frac{I_{s}(\mathbf{x})-{I}_{t}(\mathbf{x}+u(\mathbf{x}))}{I_{s}(\mathbf{x})}\bigg{\rvert},

(2)

where $\alpha_{p}$ is the hyper-parameter. Pixels with incorrect disparity values have low confidence values in $C_{p}(\mathbf{x})$ .

Thirdly, we assume that in specular highlights and border occlusions huq2013occlusion , predicted disparities would tend to be unreliable. In specular highlights, pixel intensities are saturated and uniform. Border occlusions result from the fact that the right camera misses some of the leftmost portions of the field of view of the left camera.

We set confidence values in these regions as zeros by introducing the specular highlight mask $M_{s}(\mathbf{x})$ and the boundary occlusions mask $M_{b}(\mathbf{x})$ :

\displaystyle M_{s}(\mathbf{x})=\begin{cases}1&\text{if }S(x)>th_{s}\\ 0&\text{otherwise.}\\ \end{cases}

(3)

\displaystyle M_{b}(\mathbf{x})=\begin{cases}1&\text{if }\mathbf{x}+D(\mathbf{x})\ \text{exists in the right image}\\ 0&\text{otherwise.}\\ \end{cases}

(4)

where $S(x)\in[0,1]$ is the value in HSV color space.

The final confidence map $C_{f}(\mathbf{x})$ is defined as the product of the above confidence maps and masks:

\displaystyle C_{f}(\mathbf{x})=M_{v}(\mathbf{x})\cdot M_{s}(\mathbf{x})\cdot C_{p}(\mathbf{x})\cdot C_{s}(\mathbf{x}).

(5)

Pixels are selected as outliers if their confidence values are below the threshold value of $th_{f}$ . Next, we search for inliers along with eight directions for each outlier, similar to SGM hirschmuller2003stereo . Then, for each outlier, its disparity is replaced with the median value of the eight inliers.

3.2 Global Disparity Refinement

We formulate stereo matching as a variational problem. Disparity values ${u}$ ( $x$ ) between the source $I_{s}$ and target $I_{t}$ images are predicted by the minimization of an energy function composed of a data term $E_{data}$ , and a regularization term $E_{S}$ :

\displaystyle\min_{u}\left[\lambda\,E_{data}({u},I_{t},I_{s})+E_{S}({u})\right],

(6)

where $\lambda$ denotes the weight between $E_{data}$ and $E_{S}$ . $E_{data}$ measures the similarity of pixels in $I_{s}$ and $I_{t}$ using:

\displaystyle E_{data}({\bf u})=\int_{\Omega}\left\lvert{\bf D}(P(x+u,I_{t}))-{\bf D}(P(x,I_{s}))\right\lvert^{2}dx.

(7)

Here, $\Omega$ denotes the image domain, $x$ presents the pixel location, $P(x,I)$ is the patch that contains the local intensities, and $\mathbf{D}$ is a novel illumination invariant descriptor:

\displaystyle P(x)=\begin{bmatrix}I(x_{4})&I(x_{3})&I(x_{2})\\ I(x_{5})&I(x)&I(x_{1})\\ I(x_{6})&I(x_{7})&I(x_{8})\end{bmatrix}.

(8)

\displaystyle{\bf D}(P(x_{0},I))=\frac{\mathbf{A}(P(x_{0},I))}{\|\mathbf{A}(P(x_{0},I))\|}.

(9)

\displaystyle\mathbf{A}(P(x_{0},I))=\begin{bmatrix}\lvert I(x_{0})-I(x_{1})\lvert\\ \lvert I(x_{0})-I(x_{2})\lvert\\ \vdots\\ \lvert I(x_{0})-I(x_{8})\lvert\end{bmatrix}.

(10)

$I(x_{i})\in[1,2,...,8]$ denotes locations relative to the central pixel $x_{0}$ . $\mathbf{D}$ is a 8 component vector which is calculated using Eq. 9 and 10. Our descriptor $\mathbf{D}$ is a simpler form of a descriptor proposed in trinh2019illumination and it represents the normalized image gradient about the central pixel of patch $P(x_{0})$ , which was shown to be invariant to linear illumination changes:

\displaystyle{\bf D}(P(x+u,I))={\bf D}(aP(x,I)+b).

(11)

To preserve discontinuities at sharp object transitions and avoid staircasing artifacts in the in the calculated disparity map, we use a Huber function as a regularization term:

\displaystyle E_{S}=\int_{\Omega}\lvert\nabla u\rvert_{\epsilon}dx,

(12)

where

and

ϵ $isasmallpositiveconstant.\par\par Finally,Eq.\ref{eq::6}takesthefollowingform:\par\begin{aligned} \min_{u}\left[\lambda\,\int_{\Omega}\left\lvert{\bf D}(P(x+u,I_{t}))-{\bf D}(P(x,I_{s}))\right\lvert^{2}dx+\int_{\Omega}\lvert\nabla u(x)\rvert_{\epsilon}dx\right],\end{aligned}\par\noindent whichissolvedbyapplyingaprimal-dualminimizationschemeproposedin\cite[cite]{\@@bibref{Authors Phrase1YearPhrase2}{chambolle2011first}{\@@citephrase{(}}{\@@citephrase{)}}}.Asiscommoninmostvariationalopticalflowalgorithms\cite[cite]{\@@bibref{Authors Phrase1YearPhrase2}{kroeger2016fast,werlberger2009anisotropic}{\@@citephrase{(}}{\@@citephrase{)}}},weuseacoarse-to-finewarpingframeworktodealwithlargedisplacements.Ascalefactorof0.5isusedtoconstructanimagepyramidof$ n $levels.Ateachlevel,weperform$ m $warpingiterationsofoptimizingenergyfunctionalEq.\ref{eq::14}.Ineachlevel,thewarpingiterationisinitializedwiththecurrentdisparityfield$ u $,andatargetimageiswarpedtowardsthesourceimageusingthecurrentdisparitymap.\par\par\par$

3.3 Experiment Setup

Table 1: Summary of parameters in our methods.

Parameter	Function	Value
$\alpha_{s}$	Hyper-parameter in Eq. 1	20
$\alpha_{p}$	Hyper-parameter in Eq. 2	2
$th_{f}$	Threshold to select outliers from the final confidence map	0.5
$\lambda$	Weight between data term and regularization term in Eq. 6	0.5
$\epsilon$	Hyper-parameter in Huber norm regularizer Eq. 7	0.1
$m$	Warping iterations at each image pyramid to solve Eq. 3.2	50
$n$	Levels of image pyramid to solve Eq. 3.2	4

We use several state-of-the-art learning-based methods, including PSMnetchang2018pyramid , AAnetxu2020aanet , LEAStereocheng2020hierarchical , and STTRli2021revisiting , to generate raw disparity maps of images from the SERV-CTedwards2020serv dataset and the SCARED dataset.allan2021stereo . All the above learning methods are executed with their public models trained on the SceneFlow datasetflow . We use the released light version of STTR to make it run on our local PC, of specifications listed further below.

We use all images in the SERV-CT dataset edwards2020serv that includes 16 pairs of ex vivo stereo endoscopic images of resolution 720 × 576. The dataset is collected from porcine full torso cadavers. Dense ground truth disparity maps and occlusion maps are computed from aligned CT scans.

We use five sub-datasets (dataset 1,2,3,7, and 8) of the SCARED dataset allan2021stereo , collected ex vivo from the abdominal anatomy of a porcine cadaver. We exclude the rest images in the sub-dataset, as they may contain intrinsic camera errors. Each sub-dataset contains five keyframes with associated camera calibration parameters, and the point cloud is reconstructed using structured light. In total, there are 25 pairs of stereo endoscopic images of resolution 1080 × 1024.

The proposed local disparity refinement method and global disparity refinement method are implemented in C++ and CUDA C++, which are running on an Intel i9-9900K CPU and a NIVIDA Titan X GPU, respectively. Parameters are shown in Table 1. Input images are resized to half the size of the original images. We examine the refinement performance of our proposed LDR and GDR and compare them to the closest work SDR yan2019segment to ours. We used the released code of SDR with the default setting tuned to several datasets.

Root mean square disparity error (RMSE Disparity) and root mean square depth error (RMSE Depth) are used to evaluate errors between the estimated and ground truth results.

4 Results

4.1 SERV-CT Dataset

4.1.1 Qualitative Evaluation

Refer to caption — Figure 1: Disparity maps of learning-based methods (PSMnetchang2018pyramid , AAnetxu2020aanet , LEAStereocheng2020hierarchical , STTRli2021revisiting ) and their refined results by SDR yan2019segment , our LDR, and LDR + GDR, with disparity error maps compared with ground truth. Stereo endoscopic images and ground truth disparity map are shown on the top.

Fig. 1 and 2 exhibit error images that indicate differences between the reference and predictions before and after refinement. We observe that raw disparity maps generated from the learning-based methods show significant disparity errors, especially in regions containing imaging artifacts such as specular highlights, occlusions, low texture and illumination differences. These error regions (i.e., the noise-corrupted regions) are characterized by spikes, strikes, and holes and are not continuous with their surrounding areas. These imaging artifacts are common in endoscopic images, but not in the natural images that they are trained on. Small error regions can be refined effectively via the proposed LDR, and SDR yan2019segment . However, they are not able to refine large error-corrupted areas, such as errors around boundary occlusions. The proposed GDR could further improves the results refined by the LDR and, moreover, various imaging artifacts can also be effectively refined via the GDR.

4.1.2 Quantitative Evaluation

Table 2: Evaluation results on SERV-CT dataset. The statistical significance between the errors before refinement and after refinement is identified by

*(p<0.05)

	Occlusions included		Occlusions not included
Method	RMSE Disparity (pixel)	RMSE Depth (mm)	RMSE Disparity (pixel)	RMSE Depth (mm)
PSMnet chang2018pyramid	38.91 $\pm$ 17.11	25.16 $\pm$ 8.77	33.07 $\pm$ 16.30	23.01 $\pm$ 8.88
PSMnet chang2018pyramid + LDR	31.70 $\pm$ 17.57	22.55 $\pm$ 8.76	29.04 $\pm$ 16.63	21.33 $\pm$ 9.14
PSMnet chang2018pyramid + LDR + GDR	$*$ 8.17 $\pm$ 10.38	$*$ 7.89 $\pm$ 7.85	$*$ 6.59 $\pm$ 9.99	$*$ 6.48 $\pm$ 7.69
PSMnet chang2018pyramid + SDRyan2019segment	32.04 $\pm$ 21.03	22.30 $\pm$ 8.07	28.00 $\pm$ 19.78	20.79 $\pm$ 8.16
AAnet xu2020aanet	11.71 $\pm$ 7.21	13.33 $\pm$ 5.90	9.35 $\pm$ 6.20	11.85 $\pm$ 5.74
AAnet xu2020aanet + LDR	9.39 $\pm$ 7.13	10.04 $\pm$ 6.72	7.53 $\pm$ 6.29	8.39 $\pm$ 6.32
AAnet xu2020aanet + LDR + GDR	$*$ 4.15 $\pm$ 2.08	$*$ 4.86 $\pm$ 2.70	$*$ 2.76 $\pm$ 1.43	$*$ 3.47 $\pm$ 2.40
AAnet xu2020aanet + SDRyan2019segment	9.22 $\pm$ 4.92	11.36 $\pm$ 4.87	6.98 $\pm$ 3.80	9.51 $\pm$ 4.60
LEAStereo cheng2020hierarchical	7.79 $\pm$ 5.58	12.21 $\pm$ 7.39	6.27 $\pm$ 5.10	10.27 $\pm$ 6.69
LEAStereo cheng2020hierarchical + LDR	6.06 $\pm$ 4.66	9.66 $\pm$ 6.54	4.49 $\pm$ 3.89	7.05 $\pm$ 5.57
LEAStereo cheng2020hierarchical + LDR + GDR	$*$ 3.96 $\pm$ 1.79	$*$ 4.45 $\pm$ 2.03	$*$ 2.58 $\pm$ 1.31	$*$ 3.06 $\pm$ 1.73
LEAStereo cheng2020hierarchical + SDRyan2019segment	5.05 $\pm$ 2.83	7.24 $\pm$ 3.68	4.17 $\pm$ 2.38	6.42 $\pm$ 3.56
STTR li2021revisiting	17.22 $\pm$ 6.38	27.06 $\pm$ 5.16	4.27 $\pm$ 3.47	5.34 $\pm$ 4.00
STTR li2021revisiting + LDR	13.23 $\pm$ 6.25	$*$ 22.44 $\pm$ 5.77	3.34 $\pm$ 3.13	4.20 $\pm$ 3.81
STTR li2021revisiting + LDR + GDR	$*$ 4.86 $\pm$ 3.04	$*$ 5.97 $\pm$ 3.57	2.95 $\pm$ 1.68	3.36 $\pm$ 1.70
STTR li2021revisiting + SDRyan2019segment	10.71 $\pm$ 4.84	16.38 $\pm$ 6.15	3.64 $\pm$ 3.49	4.83 $\pm$ 3.63

Table 3: Ablation Study of using different priors for GDR.

	Occlusions included	Occlusions not included
Method	RMSE Disparity (pixel)	RMSE Disparity (pixel)
PSMnetchang2018pyramid + LDR + GDR	8.17 $\pm$ 10.38	6.59 $\pm$ 9.99
PSMnetchang2018pyramid + GDR	8.70 $\pm$ 9.98	6.94 $\pm$ 9.58
GDR (Level 4)	56.88 $\pm$ 24.85	56.39 $\pm$ 25.89
GDR (Level 6)	8.38 $\pm$ 7.08	5.70 $\pm$ 4.75

Table 2 further confirms our observations from the qualitative results and demonstrates the effectiveness and robustness of our proposed method. Decreases in errors are observed after each refinement stage, especially in the GDR stage. When including the occluded region in the evaluations, raw disparity maps estimated from LEAStereocheng2020hierarchical have the lowest 2D and 3D errors, with the RMSE $7.79\pm 5.58$ pixel ( $12.21\pm 7.39$ mm). The errors are minimized to $6.06\pm 4.66$ pixel ( $9.66\pm 6.54$ mm) after LDR stage, and $3.96\pm 1.79$ pixel ( $4.45\pm 2.03$ mm) after GDR stage.

All results from all networks have higher accuracy after excluding occluded regions, and STTR li2021revisiting has the lowest error with $4.27\pm 3.47$ pixel ( $5.34\pm 4.00$ mm). The results of STTRli2021revisiting can be further improved to $3.34\pm 3.13$ pixel ( $4.20\pm 3.81$ mm) at the LDR stage, and $2.95\pm 1.68$ pixel ( $3.36\pm 1.70$ mm) at the GDR stage. Our method can also refine disparity maps predicted by PSMnet with significant errors. Excluding occluded regions, errors of PSMnet are refined from $31.70\pm 17.57$ pixels to $6.59\pm 9.99$ pixels.

We provide quantitative results in Table 3 for the effects of using different priors for GDR and their running time. Different priors include using raw disparity maps from PSMnet, refined disparity maps after LDR, and no prior information. We can observe that using better priors results leads to fewer errors in the final results. The errors associated with PSMnet as prior are higher than those associated with using refined disparity maps after LDR. LDR makes GDR more robust to error (noise)-corrupted priors. Without using priors from networks, the variational model of GDR requires an additional two-image pyramid level to obtain reasonable results. Priors from networks assist the variational model in achieving more accurate results. The refinement performance of LDR is comparable to SDR, and the refinement performance of GDR outperforms SDR (shown in Table 2).

The LDR and GDR ( $n=4$ ) are running at 0.118s and 0.375s per sample, which are more efficient than the SDR running at 1.794s per sample. GDR with $n=6$ runs at 0.5s per sample.

4.2 SCARED Dataset

4.2.1 Qualitative Evaluation

Images in the SCARED dataset contain high texture with low noise, as shown in Fig. 3 and Fig. 4. The SCARED dataset also includes surgical instruments not included in the SERV-CT dataset. In this cross-domain setting, the networks predict accurate disparity maps with clear boundaries that contain fewer artifacts than the SERV-CT dataset. Disparity maps from PSMnet seem to lose scale, but can be refined correctly via the proposed GDR. From Fig. 3, proposed disparity refinement methods correct minor artifacts but do not deteriorate regions without artifacts. The proposed GDR has difficulty dealing with images that contain surgical instruments (Fig. 4). This may result from the weight of the smoothness term.

4.2.2 Quantitative Evaluation

Table 4: Evaluation results on SCARED dataset. The statistical significance between the errors before refinement and after refinement is identified by

*(p<0.05)

Method	RMSE Disparity (pixel)	RMSE Depth (mm)
PSMnet chang2018pyramid	13.71 $\pm$ 6.33	13.77 $\pm$ 3.48
PSMnet chang2018pyramid + LDR	13.64 $\pm$ 6.31	13.60 $\pm$ 3.45
PSMnet chang2018pyramid + LDR + GDR	$*$ 5.06 $\pm$ 6.80	$*$ 3.40 $\pm$ 2.67
PSMnet chang2018pyramid + SDRyan2019segment	14.52 $\pm$ 6.14	15.47 $\pm$ 4.19
AAnet xu2020aanet	4.92 $\pm$ 6.51	3.85 $\pm$ 2.51
AAnet xu2020aanet + LDR	4.63 $\pm$ 6.44	3.35 $\pm$ 2.45
AAnet xu2020aanet + LDR + GDR	4.98 $\pm$ 6.58	3.62 $\pm$ 3.07
AAnet xu2020aanet + SDRyan2019segment	5.13 $\pm$ 6.29	4.11 $\pm$ 2.52
LEAStereo cheng2020hierarchical	5.53 $\pm$ 9.58	3.12 $\pm$ 1.71
LEAStereo cheng2020hierarchical + LDR	5.47 $\pm$ 9.59	2.85 $\pm$ 1.44
LEAStereo cheng2020hierarchical + LDR + GDR	5.68 $\pm$ 8.80	3.24 $\pm$ 2.63
LEAStereo cheng2020hierarchical +SDRyan2019segment	6.07 $\pm$ 9.54	3.76 $\pm$ 1.74
STTR li2021revisiting	5.08 $\pm$ 6.93	5.17 $\pm$ 4.97
STTR li2021revisiting + LDR	4.76 $\pm$ 6.92	4.53 $\pm$ 4.31
STTR li2021revisiting + LDR + GDR	5.28 $\pm$ 6.46	3.70 $\pm$ 2.83
STTR li2021revisiting + SDRyan2019segment	6.05 $\pm$ 6.87	6.22 $\pm$ 4.71

Table 4 suggests that, except for PSMnet, the networks can generalize well on the SCARED dataset. The RMSE disparity errors of AAnet, LEAStereo, and STTR are around 5 pixels. After LDR and GDR, disparity errors of PSMnet can be decreased from 13.71 $\pm$ 6.33 pixel to 5.06 $\pm$ 6.80 pixel, which are to the performance achieved on the other networks. LDR slightly improves the results, and GDR does not further improve or impair the results significantly.

5 Discussion

We propose a disparity refinement framework consisting of local LDR and global GDR disparity refinement methods to maintain the robust performance of learning-based methods and apply them to medical images featured with limited data for training and various scenes. We conducted subjective and objective assessments of several state-of-the-art learning-based methods and refined results with our proposed LDR and GDR on the SERV-CT and SCARED datasets.

Images in the SERV-CT dataset feature a texture-less surface, specular highlights, and illumination differences, which pose difficulties for the generalization ability of learning-based methods. The proposed LDR and GDR can effectively correct artifacts, such as spikes, strikes, and holes. Leveraging assumptions to detect outliers and update them via interpolation, LDR opts for small noisy-corrupted regions. GDR using refined results as initialization from LDR can correct various types of errors. This initialization strategy improves the speed, accuracy, and robustness of the GDR model.

Compared with the SERV-CT dataset, the SCARED dataset has images with uniform (even) illumination and well-textured textures. The networks predict disparity maps with only a few errors. Although the statistical results are insignificant, the proposed LDR and GDR can refine the errors without damaging the pixels with accurate disparity values.

The major limitation of GDR is its success in tackling images containing surgical instruments, due to smoothing of the instrument boundaries, which could be a side-effect of not having fully optimized the parameters utilized in these experiments. Nevertheless, to minimize boundary smoothing and improve the model performance on images featuring surgical instruments, as part of our future work, we will investigate the use of variational models zach2007duality ; werlberger2009anisotropic that feature image gradient-based anisotropic weighting of the regularized.

6 Conclusion

We have presented a robust and accurate disparity refinement framework with a local disparity method and a global disparity refinement method for learning-based methods in a cross-domain setting. In cases where networks perform poorly, our disparity refinement methods can improve the accuracy of the disparity maps. Moreover, our disparity refinement methods can improve the disparity maps of networks when they are corrupted by noise and without compromising network performance when networks generalize well on new domains.

7 Acknowledgement

Research reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under Award No. R35GM128877 and by the Office of Advanced Cyber-infrastructure of the National Science Foundation under Award No.1808530.

8 Disclosure

Nothing to disclose.

References

\bibcommenthead
(1) Mountney, P., Stoyanov, D., Yang, G.-Z.: Three-dimensional tissue deformation recovery and tracking. IEEE Signal Processing Magazine 27(4), 14–24 (2010)
(2) Lin, B., Sun, Y., Qian, X., Goldgof, D., Gitlin, R., You, Y.: Video-based 3d reconstruction, laparoscope localization and deformation recovery for abdominal minimally invasive surgery: a survey. The International Journal of Medical Robotics and Computer Assisted Surgery 12(2), 158–178 (2016)
(3) Allan, M., Mcleod, J., Wang, C., Rosenthal, J.C., Hu, Z., Gard, N., Eisert, P., Fu, K.X., Zeffiro, T., Xia, W., et al.: Stereo correspondence and reconstruction of endoscopic data challenge. arXiv preprint arXiv:2101.01133 (2021)
(4) Edwards, P.E., Psychogyios, D., Speidel, S., Maier-Hein, L., Stoyanov, D.: Serv-ct: A disparity dataset from cone-beam ct for validation of endoscopic 3d reconstruction. Medical image analysis 76, 102302 (2022)
(5) Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47(1), 7–42 (2002)
(6) Modrzejewski, R., Collins, T., Seeliger, B., Bartoli, A., Hostettler, A., Marescaux, J.: An in vivo porcine dataset and evaluation methodology to measure soft-body laparoscopic liver registration accuracy with an extended algorithm that handles collisions. International journal of computer assisted radiology and surgery 14(7), 1237–1245 (2019)
(7) Zhou, H., Jagadeesan, J.: Real-time dense reconstruction of tissue surface from stereo optical video. IEEE transactions on medical imaging 39(2), 400–412 (2019)
(8) Recasens, D., Lamarca, J., Fácil, J.M., Montiel, J., Civera, J.: Endo-depth-and-motion: Reconstruction and tracking in endoscopic videos using depth networks and photometric constraints. IEEE Robotics and Automation Letters 6(4), 7225–7232 (2021)
(9) Wei, R., Li, B., Mo, H., Lu, B., Long, Y., Yang, B., Dou, Q., Liu, Y., Sun, D.: Stereo dense scene reconstruction and accurate localization for learning-based navigation of laparoscope in minimally invasive surgery. IEEE Transactions on Biomedical Engineering (2022)
(10) Yang, Z., Lin, S., Simon, R., Linte, C.A.: Endoscope localization and dense surgical scene reconstruction for stereo endoscopy by unsupervised optical flow and kanade-lucas-tomasi tracking. In: 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 4839–4842 (2022). IEEE
(11) Bleyer, M., Rhemann, C., Rother, C.: Patchmatch stereo-stereo matching with slanted support windows. In: Bmvc, vol. 11, pp. 1–11 (2011)
(12) Terzopoulos, D.: Regularization of inverse visual problems involving discontinuities. IEEE Transactions on pattern analysis and Machine Intelligence (4), 413–424 (1986)
(13) Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l 1 optical flow. In: Joint Pattern Recognition Symposium, pp. 214–223 (2007). Springer
(14) Horn, B.K., Schunck, B.G.: Determining optical flow. Artificial intelligence 17(1-3), 185–203 (1981)
(15) Chang, J.-R., Chen, Y.-S.: Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418 (2018)
(16) Xu, H., Zhang, J.: Aanet: Adaptive aggregation network for efficient stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1959–1968 (2020)
(17) Li, Z., Drenkow, N., Ding, H., Ding, A.S., Lu, A., Creighton, F.X., Taylor, R.H., Unberath, M.: On the sins of image synthesis loss for self-supervised depth estimation. arXiv preprint arXiv:2109.06163 (2021)
(18) Yang, Z., Simon, R., Li, Y., Linte, C.A.: Dense depth estimation from stereo endoscopy videos using unsupervised optical flow methods. In: Annual Conference on Medical Image Understanding and Analysis, pp. 337–349 (2021). Springer
(19) Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048 (2016)
(20) Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3061–3070 (2015)
(21) Li, Z., Liu, X., Drenkow, N., Ding, A., Creighton, F.X., Taylor, R.H., Unberath, M.: Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6197–6206 (2021)
(22) Long, Y., Li, Z., Yee, C.H., Ng, C.F., Taylor, R.H., Unberath, M., Dou, Q.: E-dssr: efficient dynamic surgical scene reconstruction with transformer-based stereoscopic depth perception. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 415–425 (2021). Springer
(23) Wang, M., Deng, W.: Deep visual domain adaptation: A survey. Neurocomputing 312, 135–153 (2018)
(24) Stoyanov, D., Scarzanella, M.V., Pratt, P., Yang, G.-Z.: Real-time stereo reconstruction in robotically assisted minimally invasive surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 275–282 (2010). Springer
(25) Geiger, A., Roser, M., Urtasun, R.: Efficient large-scale stereo matching. In: Asian Conference on Computer Vision, pp. 25–38 (2010). Springer
(26) Kroeger, T., Timofte, R., Dai, D., Van Gool, L.: Fast optical flow using dense inverse search. In: European Conference on Computer Vision, pp. 471–488 (2016). Springer
(27) Song, J., Zhu, Q., Lin, J., Ghaffari, M.: Bayesian dense inverse searching algorithm for real-time stereo matching in minimally invasive surgery. arXiv preprint arXiv:2106.07136 (2021)
(28) Xia, W., Chen, E.C., Pautler, S., Peters, T.M.: A robust edge-preserving stereo matching method for laparoscopic images. IEEE Transactions on Medical Imaging (2022)
(29) Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: Epicflow: Edge-preserving interpolation of correspondences for optical flow. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1164–1172 (2015)
(30) Roxas, M., Oishi, T.: Real-time simultaneous 3d reconstruction and optical flow estimation. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 885–893 (2018). IEEE
(31) Cheng, X., Zhong, Y., Harandi, M., Dai, Y., Chang, X., Li, H., Drummond, T., Ge, Z.: Hierarchical neural architecture search for deep stereo matching. Advances in Neural Information Processing Systems 33, 22158–22169 (2020)
(32) Cai, C., Poggi, M., Mattoccia, S., Mordohai, P.: Matching-space stereo networks for cross-domain generalization. In: 2020 International Conference on 3D Vision (3DV), pp. 364–373 (2020). IEEE
(33) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
(34) Song, X., Yang, G., Zhu, X., Zhou, H., Wang, Z., Shi, J.: Adastereo: A simple and efficient approach for adaptive stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10328–10337 (2021)
(35) Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., Krishnan, D.: Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3722–3731 (2017)
(36) Mahmood, F., Chen, R., Durr, N.J.: Unsupervised reverse domain adaptation for synthetic medical images via adversarial training. IEEE transactions on medical imaging 37(12), 2572–2581 (2018)
(37) Ma, Z., He, K., Wei, Y., Sun, J., Wu, E.: Constant time weighted median filtering for stereo matching and beyond. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 49–56 (2013)
(38) Hirschmuller, H.: Stereo vision based mapping and immediate virtual walkthroughs (2003)
(39) Hirschmuller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 2, pp. 807–814 (2005). IEEE
(40) Banno, A., Ikeuchi, K.: Disparity map refinement and 3d surface smoothing via directed anisotropic diffusion. Computer Vision and Image Understanding 115(5), 611–619 (2011)
(41) Zhan, Y., Gu, Y., Huang, K., Zhang, C., Hu, K.: Accurate image-guided stereo matching with efficient matching cost and disparity refinement. IEEE Transactions on Circuits and Systems for Video Technology 26(9), 1632–1645 (2015)
(42) Yan, T., Gan, Y., Xia, Z., Zhao, Q.: Segment-based disparity refinement with occlusion handling for stereo matching. IEEE Transactions on Image Processing 28(8), 3885–3897 (2019)
(43) Huq, S., Koschan, A., Abidi, M.: Occlusion filling in stereo: Theory and experiments. Computer Vision and Image Understanding 117(6), 688–704 (2013)
(44) Trinh, D.-H., Daul, C.: On illumination-invariant variational optical flow for weakly textured scenes. Computer Vision and Image Understanding 179, 1–18 (2019)
(45) Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision 40(1), 120–145 (2011)
(46) Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., Bischof, H.: Anisotropic huber-l1 optical flow. In: BMVC, vol. 1, p. 3 (2009)