∎
2 Department of Art and Technology, Sogang University, Seoul, Korea (Email: [email protected]).
Corresponding author: Yuchao Dai.
A revisit of the normalized eight-point algorithm and a self-supervised deep solution
Abstract
The normalized eight-point algorithm has been widely viewed as the cornerstone in two-view geometry computation, where the seminal Hartley’s normalization has greatly improved the performance of the direct linear transformation algorithm. A natural question is, whether there exists and how to find other normalization methods that may further improve the performance as per each input sample. In this paper, we provide a novel perspective and propose two contributions to this fundamental problem: 1) we revisit the normalized eight-point algorithm and make a theoretical contribution by presenting the existence of different and better normalization algorithms; 2) we introduce a deep convolutional neural network with a self-supervised learning strategy for normalization. Given eight pairs of correspondences, our network directly predicts the normalization matrices, thus learning to normalize each input sample. Our learning-based normalization module can be integrated with both traditional (e.g., RANSAC) and deep learning frameworks (affording good interpretability) with minimal effort. Extensive experiments on both synthetic and real images demonstrate the effectiveness of our proposed approach.
Keywords:
Two-view geometry Eight-point algorithm Data normalization Permutation invariance Self-supervised1 Introduction
Geometric computation has long been one of the major issues in computer vision. In particular, two-view geometry computation is a central building block for three-dimensional (3D) modeling and camera motion estimation. For example, self-driving is implemented through the technology of simultaneous localization and mapping (SLAM) and structure from motion (SfM). Among many important core algorithms, the eight-point algorithm Longuet_Reconstructing_Nature_1981 computes the fundamental matrix from a set of eight or more point correspondences between two views, which has the advantage of the simplicity of implementation. However, it was extremely susceptible to image noise and hence was of very limited practical use until Hartley devised a normalized eight-point algorithm in his seminal work Hartley_Normalization_TPAMI_1997 , which shows that by preceding the algorithm with a data normalization (translation and scaling) of the coordinates of the correspondences, the results obtained are comparable to those of the best iterative algorithms. As a consequence, with its simple strategy of translation and scaling, the isotropic normalization, now termed as Hartley’s normalization, has gradually become an indispensable component of many geometric computations not only for fundamental matrix estimation dai2016rolling but also for homography zhao2021homography , ellipse fitting szpak2015guaranteed , bundle adjustment zhang2014structure , etc.
One particular aspect of Hartley’s normalization in regard to the direct linear transformation (DLT) formulation of the fundamental matrix computation is that it allows the DLT solution to possess a better condition number. Therefore, when the solution matrix is enforced to have rank 2, a much more stable estimate of the fundamental matrix is obtained; this is important because it is the starting point of all the structure and motion computations such as guided correspondence search, camera and structure optimization, and 3D reconstruction for more than two views. Consequently, enforcing the rank-2 constraint as much as possible at the DLT stage becomes an interesting topic of study. For example, Mühlich and Mester Muhlich_Subspace_SCIA_2001 performed a statistical analysis to obtain an optimal data normalization for DLT fundamental matrix computation and showed that Hartley’s normalization can be expected to work well even though it is not identical to the optimal transform. Mair et al. Elamr_ErrorPropagation_IROS_2013 performed further error analysis to obtain a better performance than Hartley’s eight-point algorithm. The work of da Silveira and Jung daSilveira_Perturbation_CVPR_2019 presented a perturbation analysis of the eight-point algorithm for a wide field of view cameras. In contrast to these works based on statistical analysis, this paper tries to determine the mechanism of data normalization through deep learning without specific statistical modeling. Considering that the fundamental matrix estimation is strongly affected by the error distribution of the feature matching algorithm, we argue that a data normalization scheme can be exploited to achieve DLT solutions of improved rank-2 condition by learning the error distribution from the data themselves; this approach coincides with the views of Refs. Muhlich_TLS_ECCV_1998 ; Muhlich_Subspace_SCIA_2001 ; daSilveira_Perturbation_CVPR_2019 . In particular, as displayed in Fig. 1, we propose to learn a data-driven normalization scheme under the standard configuration of eight correspondences.
Currently, the success of deep learning in high-level vision tasks has been gradually extended to multi-view geometry problems such as homography Detone_DeepHomography_arxiv_2016 , fundamental matrix Ranftl_DeepFundamental_ECCV_2018 , bundle adjustment Tang_BANet_ICLR_2018 , plane sweeping Sunghoon_DPSNet_ICLR_2019 ; fan2021rs , and rolling-shutter modeling fan2023rolling ; fan2022rolling . However, this success has not been extended to the normalized eight-point algorithm and a different or better normalization scheme has so far not been presented nor replaced by the deep learning pipeline. This is mainly due to the following obstacles: 1) gradient descent cannot be trivially applied as mentioned in Ref. Ranftl_DeepFundamental_ECCV_2018 ; 2) the network must be invariant to the permutation of the correspondences, i.e., different orderings of the input data should produce the same normalization; and 3) a large amount of labeled input and output data should be used for supervised learning (in this case, the input is eight-point correspondences and the output is the optimal data normalization). In this paper, we overcome these problems by back-propagating through a singular value decomposition (SVD) layer and using a self-supervised learning mechanism in the permutation-invariant network architecture; this also solves the issue of large training data requirements. Our approach not only produces an interpretable pipeline of fundamental matrix estimation but can also be easily embedded in other robust frameworks such as the differentiable random sample consensus (RANSAC) Brachmann_DSAC_CVPR_2017 . Through experiments, our learning-based normalization demonstrated superior performance to Hartley’s normalization and a good generalization ability across different datasets. Our main contributions can be summarized as follows.
-
1)
We propose a self-supervised learning-based deep solution for normalizing DLT fundamental matrix estimation under the standard configuration of eight point correspondences.
-
2)
We make a theoretical contribution by demonstrating the existence of different and better normalization algorithms beyond Hartley’s normalization.
-
3)
Extensive experiments on both synthetic and real images demonstrate the effectiveness and good generalizability of our proposed approach.

2 Related work
In this section, we briefly review related work in traditional two-view geometry computation and deep learning-based multi-view geometry learning.
2.1 Two-view geometry estimation
The normalized eight-point algorithm Hartley_Normalization_TPAMI_1997 significantly improves the numerical accuracy of the fundamental matrix and extends the scope of applications due to the improved condition number of the hand-designed normalization scheme. Since this seminal work, there have been various follow-up studies on the uncertainty in fundamental matrix estimation and the relationships between the epipolar constraint and corresponding errors. Csurka et al. Csurka_Uncertainty_CVIU_1997 proposed a method to simultaneously estimate the fundamental matrix and its uncertainty. Mühlich and Mester Muhlich_TLS_ECCV_1998 concluded that the normalization strategy can ensure that the two-view non-iterative motion estimation algorithm maintains unbiasedness and consistency. They further introduced a normalization transformation scheme based on the bound of epipolar constraint errors obtained by assuming known feature matching covariance, which was also used to extend the existing first-order error propagation analysis of the eight-point algorithm in Ref. Elamr_ErrorPropagation_IROS_2013 . However, this approach was still not optimal because the error distribution of the input data was not considered Muhlich_Subspace_SCIA_2001 . The closed-form computation of the uncertainty of the fundamental matrix was presented in Ref. Frederic_Uncertainty_BMVC_2008 to recover correspondences via the uncertain equilibrium of motion estimation. Chojnacki and Brooks Chojnacki_Revisiting8pt_TPAMI_2003 revisited the normalized eight-point algorithm and presented a statistical model of data distribution by merging the statistical approach of Ref. Muhlich_TLS_ECCV_1998 , which was further extended in Ref. Chojnacki_Consistency_JMIV_2007 by introducing a structured model for the data distribution. In addition, da Silveira and Jung daSilveira_Perturbation_CVPR_2019 performed perturbation analysis for the fundamental matrix estimation without considering any kind of matching error distribution.
2.2 Deep learning-based geometry estimation
Recently, the success of deep learning in high-level vision tasks has been gradually extended to various multi-view geometry estimation problems. DeTone et al. Detone_DeepHomography_arxiv_2016 employed a deep convolutional neural network (CNN) to regress a homography from a pair of input images in an end-to-end manner. A follow-up study Nguyen_DeepHomographyUnsupervised_Robot&Automation_2018 developed the unsupervised variant by replacing direct supervision with image-based loss. This pipeline has been extended to fundamental matrix estimation, where a fundamental matrix is directly regressed from a pair of stereo images without correspondences Omid_DeepFundamental_wo_corresponences_ECCV_2018 . Ranftl and Koltun Ranftl_DeepFundamental_ECCV_2018 treated the fundamental matrix estimation problem as a weighted homogeneous least-squares problem, where the matching weights and fundamental matrix are simultaneously estimated by using supervised deep networks. With the availability of camera intrinsics, Yi et al. Yi_LearningCorrespondences_CVPR_2018 recovered the essential matrix from putative correspondences with little training data and limited supervision, thus finding good correspondences for wide-baseline stereo. Furthermore, Probst et al. Probst_2019_CVPR proposed an unsupervised learning framework for consensus maximization, in the context of solving 3D vision problems such as 3D-3D matching zhang2022learning ; zhang2022searching , and image-to-image matching (homography and fundamental matrix). DSAC Brachmann_DSAC_CVPR_2017 is a differentiable counterpart of RANSAC and can also be leveraged as a robust optimization component for other deep learning pipelines.
Different from existing work in deep learning-based multi-view geometry computation, our self-supervised learning strategy removes the need for supervisory signals and thus generalizes well across different datasets. Furthermore, our learning-based normalization module can be integrated with both traditional and deep learning frameworks.
3 A revisit of the normalized eight-point algorithm
We use capital letters, A, B, etc., to denote matrices. The operation of reshaping a matrix into a vector is denoted by , defined as , where is the -th column vector of A and is the number of columns. Its inverse operation is denoted as .
Given a pair of correspondences and between two views, the epipolar constraint is expressed as
(1) |
where is a matrix of rank 2, termed as the fundamental matrix. Collecting point correspondences , i.e., the standard configuration, we may rewrite Eq. (1) as a linear equation of f:
(2) |
where is a nine-dimensional vector composed of stacked columns of , and is the coefficient matrix with for . This approach provides the DLT formulation for computing F, and a solution may be obtained through SVD of A.
Despite its simplicity, the computation of the DLT for the eight-point algorithm Longuet_Reconstructing_Nature_1981 is extremely susceptible to noise in the image coordinate measurements. In the seminal work Hartley_Normalization_TPAMI_1997 , Hartley showed that the precision of the eight-point algorithm can be greatly improved by proper normalization of the image coordinates; this approach is the classic normalized eight-point algorithm. Hartley’s normalization is designed to compute image translation and scaling such that the average distance of the transformed coordinates from the origin is :
(3) |
with , and given by
(4) |
where the superscript denotes the -th entry of vector . Given two normalization matrices and T, Eq. (2) is transformed to
(5) |
where is the transformed coefficient matrix with . In summary, the normalized eight-point algorithm mainly includes the following three steps.
-
1)
Normalization: Transform the input image coordinates according to and .
-
2)
Compute the corresponding fundamental matrix to normalize data by
-
a)
Direct linear transform: Determine from the right singular vector corresponding to the smallest singular value of defined in Eq. (5).
-
b)
Singularity constraint enforcement: Replace by , where with is a diagonal matrix satisfying .
-
a)
-
3)
Denormalization: Set .
The condition number of A is defined as , where is the pseudo-inverse of A. Its equivalent condition number may be defined as the ratio of the greatest to the second smallest singular values, , for . It has been reported in the literature Hartley_Normalization_TPAMI_1997 ; Chojnacki_Revisiting8pt_TPAMI_2003 ; Chojnacki_Consistency_JMIV_2007 ; daSilveira_Perturbation_CVPR_2019 that the unsatisfactory performance of the eight-point algorithm is mainly due to the worse numerical conditioning of the coefficient matrix A. In fact, the condition number is extremely large, leading to two least eigenvalues relatively close to one another, and causing their corresponding eigenvectors to be mixed up and indistinguishable. As a result, a negligible perturbation of the matrix entries tends to cause a significant change in the smallest eigenvector, since it may fall anywhere in the proximity to the eigensubspace spanned by the similar eigenvectors associated with those virtual degenerate eigenvalues Chojnacki_Revisiting8pt_TPAMI_2003 . It has been found that proper selection of normalization to the input image coordinates results in better numerical conditioning when carrying out linear DLT computation, and that the improved numerical conditioning provides with the smallest eigenvector of far less susceptible to interference Hartley_Normalization_TPAMI_1997 ; Chojnacki_Consistency_JMIV_2007 . From this point, a natural question arises: Can we achieve the ultimate optimal condition number ? Below we figure out that the condition number of the transformed coefficient matrix cannot reach the optimum of 1. A follow-up question must be: Can we have a better normalization transformation? This paper provides a positive answer in the next section. We develop a self-supervised CNN-based technique that learns the convolutional neural network weights based on a geometric loss function. It requires no ground truth labeling but has shown highly improved performance in various experiments.
Proposition 1
There is no pair of normalization matrices and T that results in .
Proof
(Proof by contradiction) For the full row rank matrix A, there must be an invertible matrix such that holds, where the matrix also has full row rank Horn_MatrixAnalysis_2012 . Moreover, one can assume that each row of Q represents a standard orthonormal basis of the -dimensional subspace, which is easily achieved by matrix decomposition Horn_MatrixAnalysis_2012 , such as Gram-Schmidt orthogonalization, QR decomposition, and SVD decomposition.
The condition number if and only if , where is a non-zero positive constant Horn_MatrixAnalysis_2012 ; Chen_Cond_ECNU_1986 ; this implies that the rows of make up eight orthogonal bases of the -dimensional subspace up to a fixed-length scale . Therefore, in order to achieve , the two invertible transformations and T should make hold, i.e.,
(6) |
Note that = for any Except for the trivial configuration in which , the rank of the sum on the right-hand-side must exist to be equal to 3 for any and T (e.g. given in Eq. (3)); so Eq. (6) cannot be established. That is, there are no normalization matrices and T to make tenable.
4 Learning-based normalization with self-supervised CNNs
This section develops a machine learning model that produces T and , the two data normalization matrices, which result in a better estimation of F than Hartley’s normalization for eight input correspondences. As discussed in Section 3, the estimation of the fundamental matrix has two main steps. First, the input image coordinates are normalized by T and to construct the data matrix , and the solution is obtained. Second, is reconstructed by enforcing the singularity constraint. The following are two observations regarding this estimation process:
-
1)
The goal of Hartley’s normalization is to achieve a better computation of . However, this does not guarantee the singularity condition , which is why the singularity constraint enforcement (SCE) is necessary.
-
2)
There are cases where enforcing the singularity () brings about large nonlinear projection errors and leads to an unsatisfactory estimation of . This happens especially when is not large enough.
It is evident that the singularity constraint should be considered at the same time as well as the numerical conditioning when the normalization matrices T and are prepared, which implies the existence of better normalization schemes.
Our approach adopts a CNN-based model and a self-supervised learning algorithm to train it. The model outputs the parameters of the normalization matrices when eight input correspondences are provided as input. Following the conjecture of the affine structure of the normalization matrix proposed in Ref. Muhlich_TLS_ECCV_1998 , the normalization matrix is designed here to have two more parameters than Hartley’s normalization:
(7) |
which can characterize the data distribution better and enable more general normalization schemes to be implemented by CNNs. Nevertheless, how to robustly determine three normalization parameters (especially , , and ) has always been a difficult problem. Note that, after Hartley’s seminal solution Hartley_Normalization_TPAMI_1997 , there has been no substantial progress in designing hand-crafted normalization strategies. In contrast, we try to extend Hartley’s normalization and develop a deep solution for normalization. The performance of the CNN model for this parametrization is evaluated and visualized through various experiments in Section 5. The overall computation pipeline of our framework is illustrated in Fig. 2.

4.1 Self-supervised learning for normalization

Network architecture. The overall network architecture is illustrated in Fig. 3. We adopt the structure of the 12 consecutive ResNet blocks as the first stage of the CNN network, which is consistent with the classic two-view geometry estimation networks Yi_LearningCorrespondences_CVPR_2018 ; Ranftl_DeepFundamental_ECCV_2018 . The eight input points or u are first processed by multi-layer perceptrons of 128 neurons sharing weights Yi_LearningCorrespondences_CVPR_2018 between correspondences. Then, the 128-dimensional features for each correspondence are transmitted as output through 12-layer ResNet blocks He_ResNet_CVPR_2016 ; Yi_LearningCorrespondences_CVPR_2018 . The integration of global information is performed by weight-sharing operations between different correspondences, followed by instance normalization Ulyanov_TextureNetwork_CVPR_2017 after each layer. Max-pooling and instance normalization are applied to each layer of the 12-layer ResNet blocks, namely, the input of the first ResNet block and the output of each of the next 12 ResNet blocks, to extract 13 global features of dimensions, respectively. This process enables the CNN layer to maintain the permutation invariance and fix the size of the global feature maps. Then, 13 feature maps are concatenated and delivered to the two-dimensional (2D) convolutional layer, which consists of eight channels, square kernels, and unequal strides with four in the column and one in the row. The output of the 2D convolution is then passed through two fully-connected layers each with a dimension of 256, followed by ReLU. Finally, three-parameter estimation corresponding to or u is regressed. Note that our network supports the input of more than eight correspondences and this flexibility is mainly due to the max-pooling and instance normalization design, which is valuable in practice.
Our network is inspired by 3DRegNet Pais_3DRegNet_arxiv_2019 but has significant differences in architecture design: we utilize weight sharing for point correspondences, instance normalization module for better performance, and fewer parameters in 2D convolution. Specifically, compared to the representative two-view geometry estimation methods Yi_LearningCorrespondences_CVPR_2018 ; Ranftl_DeepFundamental_ECCV_2018 , our network is invariant to the permutation of the correspondences.
Self-supervised learning. In order to train our model through self-supervised learning, the outputs obtained from the CNN model are leveraged to construct the normalization matrices T and , and are fed into the next module performing 1) the data scaling, 2) DLT to compute , and 3) SVD to compute singularity constrained . Finally, the output F is evaluated using the loss function chosen to be the symmetry epipolar distance Hartley_MVG_2003 :
(8) |
We tested several variants of distance functions including the Sampson distance and algebraic distance, and decided to use the symmetry epipolar distance, because it showed superior results in the experiments. Interestingly, these findings contrast with the findings of Ref. Hartley_MVG_2003 .
By training through minimizing the loss function, we can train the network without any ground truth data at all, contrary to Ref. Pais_3DRegNet_arxiv_2019 or Ref. Ranftl_DeepFundamental_ECCV_2018 ; the network achieves self-supervisory in the geometric sense. It also enables us to exploit a very large number of frames from video sequence datasets under various kinds of camera motion.
Addressing the ordering invariance. Our network model is designed to be invariant to the order of the input image points similar to Ref. Qi_PointNet_CVPR_2017 or Ref. Pais_3DRegNet_arxiv_2019 , thereby obtaining invariance in the subsequent fundamental matrix computation.
Proposition 2
As long as the computation of the normalization matrices and T has permutation invariance, then so has the computation of the fundamental matrix.
Proof
Because and T maintain invariant for any order of the input data and u, the resulting and hold the same order as and u after normalization; this is equivalent to performing a row transformation on the transformed coefficient matrix in Eq. (5) for different orders of and u. However, when the row transformation is made to , the right singular vector corresponding to the smallest singular value of does not change Horn_MatrixAnalysis_2012 , i.e. the estimation of is not affected. Furthermore, the final fundamental matrix F also has permutation invariance.
Training procedure. The network is implemented in PyTorch. We adopt the Adamax Optimizer Kingma_Adam_ICLR_2015 with an initial learning rate of and a decreasing learning rate of 0.8 times per 10 epochs. The chosen batch size is 16 and the network is trained for 150 epochs. Each input set is pre-filtered by the residual based on the original eight-point algorithm with a threshold (60 pixels) sufficiently large to enhance the stability of the training process.
5 Experimental results
To prove that our approach can learn normalization matrices adapted to the input data and obtain more accurate fundamental matrix estimations, we benchmark the performance of our approach on three typical datasets with varying regularity. Furthermore, we perform cross-dataset validation to prove the generalizability of our approach.
5.1 Datasets
KITTI dataset. The KITTI odometry dataset Geiger_KITTI_CVPR_2012 consists of 22 distinct sequences from a car driving around a residential area. This dataset exhibits dominant forward motion with high regularity but shows difficult data associations. We choose the first 11 sequences with ground truth from GPS and a Velodyne LiDAR. Specifically, we employ sequences “00” to “05” for training and sequences “06” to “10” for testing in our experiment, which enables a fair comparison with recent state-of-the-art methods Ranftl_DeepFundamental_ECCV_2018 .
TUM dataset. We use the indoor sequences from the TUM RGB-D dataset Sturm_RGBD_IROS_2012 , which contains several hand-held sequences with ground truth obtained by an additional motion capture system. This dataset reflects rich camera motion and scene geometry, and shows the most general cases for fundamental matrix estimation. We exploit the cross-validation for the sequence “fr3_long_office” during training. To better test the generalizability of the proposed method, we resize the image size of the TUM RGB-D dataset to be consistent with that of the KITTI dataset.
Cambridge dataset. The Cambridge dataset Kendall_PoseNet_ICCV_2015 is a large-scale outdoor urban localization setting, containing six challenging scenes with changes in perspective and illumination; this setting is quite different from TUM and KITTI datasets. Here we adopt the “St Mary’s Church” scene to evaluate the generalization ability of our proposed approach, and report only the qualitative results in the following section.
We generate two different correspondence datasets for each of the KITTI dataset and the TUM dataset, which are stored in a manner similar to that used in Ref. Menze_SceneFlow_CVPR_2015 . First, 1000 correspondences based on SIFT Lowe_SIFT_IJCV_2004 are pre-filtered by employing a ratio test with a threshold of 0.8. The second one does not leverage the ratio test to pre-filter the correspondences, which generates a challenging dataset with high noise. The ratio test is a frequently used strategy for improving the robustness and accuracy of feature matching. Therefore, unless otherwise stated, we utilize pre-filtered datasets in our experiments. Moreover, each input sample is generated by shuffling all the correspondences between two views in the dataset.
5.2 Evaluation protocols
To evaluate the performance of our approach, we report the average better rate of per input sample, i.e., the average percentage that our learning-based normalization outperforms Hartley’s normalization in terms of the symmetric epipolar distance (see Eq. (8)). Besides, in the experiments within the RANSAC framework, we evaluate the average percentage of inliers (correspondences with errors less than 1 pixel or 0.1 pixels), as well as the F1 (the average percentage of correspondences below 1 pixel error with respect to the ground truth epipolar line).



5.3 Experimental evaluations
In the first experiment, we evaluate the performance of our approach on per input sample. We first optimize F based on Eq. (8) under singularity constraints for Hartley’s normalization and our learning-based normalization in the KITTI testing set, and the results are summarized in Fig. 4 (a). The equivalence between our approach and Hartley-based optimization result is reported, which indicates that our approach can provide better initial values for more sophisticated nonlinear optimization methods. Unlike the constant distance from the origin in Hartley’s normalization, Fig. 5 shows that our learning-based normalization predicts a distance tailored to each input data, which exploits the inherent regularity of the input data.
@KITTI | @TUM | |
---|---|---|
Better rate (%) | Better rate (%) | |
Train on KITTI | 86.68 | 90.04 |
Train on TUM | 84.22 | 89.73 |
Train on KITTI & TUM | 87.05 | 91.43 |
Then, we quantitatively evaluate the average improvement rate of per each input sample, which is our primary concern. Since Hartley’s normalization is the most widely-used normalization method Hartley_MVG_2003 , we only compare with it here. As presented in Table 1, our learning-based normalization outperforms Hartley’s normalization for each input sample. Interestingly, the model trained on the KITTI dataset is generalizable well to the TUM dataset, and vice versa, which shows the great generalizability of our approach. To further analyze the impact of training sets on our approach, we provide experimental results by evaluating the average percentage of each input sample when using KITTI and TUM datasets jointly as training sets. The performance of our approach is further improved for each input sample, which shows that our approach can learn a better and more generalized normalization scheme from more training data that contains diverse regularities. Finally, in Fig. 4 (b), we report the distribution of the symmetric epipolar distance for the original eight-point algorithm, with Hartley’s normalization, and with our learning-based normalization. While both have achieved great improvements with respect to the un-normalization version, our learning-based normalization consistently outperforms Hartley’s normalization in achieving lower errors for eight input correspondences.
@0.1px | @1px | |||
---|---|---|---|---|
Inliers (%) | F1 | Inliers (%) | F1 | |
Ranftl’s Ranftl_DeepFundamental_ECCV_2018 | 24.61 | 14.65 | 85.87 | 75.77 |
MLESAC Torr_MLESAC_CVIU_2000 | 18.60 | 12.54 | 84.48 | 75.15 |
LMEDS Rousseeuw_LMSR_JASA_1984 | 20.01 | 13.34 | 84.23 | 75.44 |
USAC Raguram_USAC_PAMI_2013 | 21.43 | 13.90 | 85.13 | 75.70 |
RANSAC Fischler_RANSAC_ACM_1981 | 21.85 | 13.84 | 84.96 | 75.65 |
Ours | 21.89 | 13.86 | 84.98 | 75.66 |
From the superior performance of our learning-based normalization algorithm over each input sample, we further heuristically verify that our approach can be effectively integrated into the traditional RANSAC framework Fischler_RANSAC_ACM_1981 . In the experimental comparison, we follow the most related and classic work Ranftl_DeepFundamental_ECCV_2018 . We compare our approach with the least median of squares (LMEDS) Rousseeuw_LMSR_JASA_1984 , MLESAC Torr_MLESAC_CVIU_2000 , USAC Raguram_USAC_PAMI_2013 , Ranftl’s method Ranftl_DeepFundamental_ECCV_2018 and RANSAC Fischler_RANSAC_ACM_1981 , where RANSAC is based on Hartley’s normalization while our approach is performed with the learning-based normalization. Note that USAC is a state-of-the-art robust estimation framework, and “RANSAC + normalized eight-point algorithm” represents the gold standard Hartley_MVG_2003 for geometric tasks such as visual odometry and SLAM. Inside Ranftl’s method Ranftl_DeepFundamental_ECCV_2018 , the matching scores have been used as additional information to guide the estimation, which can result in an obvious improvement in average accuracy. By contrast, we leverage only the original RANSAC to conduct experiments for performance evaluation. It is also worth noting that as a supervised learning-based framework, Ranftl’s method requires ground truth correspondences in training, while our approach is fully self-supervised. Additionally, designing an ensemble network to improve overall performance such as DSAC Brachmann_DSAC_CVPR_2017 is outside the scope of this paper, as our focus is better normalization for each sample.
Table 2 summarizes the results on the KITTI dataset. Within the RANSAC framework, our learning-based normalization performs on par with Hartley’s normalization on the KITTI benchmark. Furthermore, we evaluate the performance based on the challenging testing set without the ratio test, and the results are presented in Table 3. Note that our approach achieves higher inliers on the TUM dataset. We remark here that, recent analyses in Refs. chin2018robust ; chin2020quantum as well as related experiments in Ref. ding2020minimal indicate that the RANSAC paradigms with supporting heuristics can only increase the chance of finding the final good solution and are not completely governed by the internal solver, which is one possible reason for the slight improvement of our method when it is embedded into RANSAC. Overall, the effectiveness of our learning-based normalization method combined with RANSAC is demonstrated.
@0.1px | @1px | ||
---|---|---|---|
Inliers (%) | Inliers (%) | ||
KITTI | Hartley | 4.40 | 30.50 |
Learning | 4.38 | 30.55 | |
TUM | Hartley | 7.60 | 44.13 |
Learning | 7.64 | 44.18 |
Finally, we directly employ the network model trained on the KITTI dataset, which is very different from the Cambridge dataset. The qualitative generalization results for the Cambridge dataset are reported in Fig. 6. One can see that our approach can achieve an accurate two-view fundamental matrix estimation, which reflects the good generalization ability of our approach. Moreover, since we always centralize the correspondences first, varying image sizes and distributions of features will not have a significant impact on the final results. Currently, our forward propagation time is approximately 5 times that of Hartley’s normalization due to the use of 12-layer ResNet architectures. Fortunately, these efficiency sacrifices can improve normalization to achieve more accurate epipolar geometry for each sample.

Influence of the number of correspondences. We perform additional experiments to analyze the influence of the number of correspondences in the input. We take the median of 1000 trials based on a random testing image. The results are shown in Fig. 7, which indicate that better fundamental matrices can be obtained with an increasing number of correspondences.
Condition numbers. We conduct another experiment to compare the condition numbers in solving the fundamental matrix and the results are reported in Fig. 8. We observe that better numerical conditioning of the transformed coefficient matrix can be obtained by our learning-based normalization, which is one of the keys to our upgraded performance.
Nonlinear projection. The singularity of is evaluated by calculating for every 100 consecutive frames of the KITTI testing set. The results are displayed in Fig. 9, which shows our learning-based approach is able to achieve smaller nonlinear projection errors. These findings also verify our argument that the condition number of the transformed coefficient matrix via a better normalization will be more conducive to imposing the singularity constraint on the resulting fundamental matrix. Note that these experimental results all highlight the superiority of our learning-based normalization approach.



6 Conclusion
In this paper, we revisit the classic two-view geometry computation with eight point correspondences and employ CNNs to provide a novel perspective for better normalization. First, we present that the ideal condition number can be obtained by our approach to be more consistent with the following singularity constraint enforcement step. Second, we propose a self-supervised deep neural network to learn a robust normalization scheme for more accurate fundamental matrix estimation. Our approach enables a data-driven estimation pipeline to perform interpretable and generalized fundamental matrix estimation. Our learning-based normalization solution is superior to Hartley’s normalization for each input sample, and is comparable to Hartley’s normalization when integrated with RANSAC. Its potential advantage is to provide better initial values for non-linear optimization and to afford better interpretability for an ensemble network. In the future, we plan to design a lightweight network to weigh time and quality, utilize ground truth correspondences or ground truth matching scores to explore supervised two-view geometry estimation, and further extend our deep solution to other multi-view geometry problems such as triangulation, trifocal tensor estimation, etc.
Abbreviations
2D, two-dimensional; 3D, three-dimensional; CNN, convolutional neural network; DLT, direct linear transformation; RANSAC, random sample consensus; SCE, singularity constraint enforcement; SfM, structure from motion; SLAM, simultaneous localization and mapping; SVD, singular value decomposition.
Acknowledgements
We want to thank Jihuang Dai and Xiang Guo for investigating relevant literature. The authors express their gratitude to the anonymous reviewers and the editor.
Availability of data and materials
The datasets generated during and/or analyzed during the current study are available in the KITTI repository: https://www.cvlibs.net/datasets/kitti/eval_odometry.php.
Funding
This work was supported in part by the National Natural Science Foundation of China (No. 62271410) and the National Postdoctoral Innovative Talent Program, China (No. BX20230013).
Declarations
Author contributions
All authors contributed to the study conception and design. Material preparation, theoretical derivation and analysis were performed by FB and DYC. The first draft of the manuscript was written by FB and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Author details
1School of Electronics and Information, Northwestern Polytechnical University and Shaanxi Key Laboratory of Information Acquisition and Processing, Xi’an 710129, China. 2Department of Art and Technology, Sogang University, Seoul 04107, Korea.
Competing interests
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial or non-financial interest in the subject matter or materials discussed in this manuscript.
References
- [1] Longuet-Higgins, H. C. (1981). A computer algorithm for reconstructing a scene from two projections. Nature, 293(5828), 133–135.
- [2] Hartley, R. (1997). In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(6), 580–593.
- [3] Dai, Y., Li, H., & Kneip, L. (2016). Rolling shutter camera relative pose: Generalized epipolar geometry. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4132–4140). Piscataway: IEEE.
- [4] Zhao, C., Fan, B., Hu, J., Pan, Q., & Xu, Z. (2021). Homography-based camera pose estimation with known gravity direction for UAV navigation. Science China Information Sciences, 64(1), 1–13.
- [5] Szpak, Z. L., Chojnacki, W., & van den Hengel, A. (2015). Guaranteed ellipse fitting with a confidence region and an uncertainty measure for centre, axes, and orientation. Journal of Mathematical Imaging and Vision, 52(2), 173–199.
- [6] Zhang, L., & Koch, R. (2014). Structure and motion from line correspondences: Representation, projection, initialization and sparse bundle adjustment. Journal of Visual Communication and Image Representation, 25(5), 904–915.
- [7] Mühlich, M., & Mester, R. (2001). Subspace methods and equilibration in computer vision. In Proceedings of the 12th Scandinavian conference on image analysis (pp. 415–422). Cham: Springer.
- [8] Mair, E., Suppa, M., & Burschka, D. (2013). Error propagation in monocular navigation for Z∞ compared to eightpoint algorithm. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (pp. 4220–4227). Piscataway: IEEE.
- [9] da Silveira, T. L., & Jung, C. R. (2019). Perturbation analysis of the 8-Point algorithm: A case study for wide FoV cameras. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11757–11766). Piscataway: IEEE.
- [10] Mühlich, M., & Mester, R. (1998). The role of total least squares in motion analysis. In Proceedings of the 5th European conference on computer vision (pp. 305–321). Cham: Springer.
- [11] DeTone, D., Malisiewicz, T., & Rabinovich, A. (2016). Deep image homography estimation. arXiv preprint. arXiv:1606.03798.
- [12] Ranftl, R., & Koltun, V. (2018). Deep fundamental matrix estimation. In Proceedings of the 15th European conference on computer vision (pp. 284–299). Cham: Springer.
- [13] Tang, C., & Tan, P. (2018). BA-Net: Dense bundle adjustment network. In Proceedings of the 6th international conference on learning representations (pp. 284–299). Retrieved October 7, 2023, from https://openreview.net/forum?id=B1gabhRcYX.
- [14] Sunghoon I., Hae-Gon J., Stephen L., & In S. K. (2019). DPSNet: End-to-end deep plane sweep stereo. [Poster presentation]. Proceedings of the 7th international conference on learning representations. New Orleans, USA.
- [15] Fan, B., Wang, K., Dai, Y., & He, M. (2021). RS-DPSNet: Deep plane sweep network for rolling shutter stereo images. IEEE Signal Processing Letters, 28, 1550–1554.
- [16] Fan, B., Dai, Y., & He, M. (2023). Rolling shutter camera: Modeling, optimization and learning. Machine Intelligence Research, 20(6), 783–798.
- [17] Fan, B., Dai, Y., & Li, H. (2022). Rolling shutter inversion: Bring rolling shutter images to high framerate global shutter video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5), 6214–6230.
- [18] Brachmann, E., Krull, A., Nowozin, S., Shotton, J., Michel, F., & Gumhold, S. (2017). DSAC-differentiable RANSAC for camera localization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6684–6692). Piscataway: IEEE.
- [19] Csurka, G., Zeller, C., Zhang, Z., & Faugeras, O. D. (1997). Characterizing the uncertainty of the fundamental matrix. Computer Vision and Image Understanding, 68(1), 18–36.
- [20] Sur, F., Noury, N., & Berger, M.-O. (2008). Computing the uncertainty of the 8 point algorithm for fundamental matrix estimation. In Proceedings of the British machine vision conference (pp. 965–974). Swansea: BMVA Press.
- [21] Chojnacki, W., & Brooks, M. J. (2003). Revisiting Hartley’s normalized eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9), 1172–1177.
- [22] Chojnacki, W., & Brooks, M. J. (2007). On the consistency of the normalized eight-point algorithm. Journal of Mathematical Imaging and Vision, 28(1), 19–27.
- [23] Nguyen, T., Chen, S. W., Shivakumar, S. S., Taylor, C. J., & Kumar, V. (2018). Unsupervised deep homography: A fast and robust homography estimation model. IEEE Robotics and Automation Letters, 3(3), 2346–2353.
- [24] Poursaeed, O., Yang, G., Prakash, A., Fang, Q., Jiang, H., & Hariharan, B. (2018). Deep fundamental matrix estimation without correspondences. In Proceedings of the 15th European conference on computer vision (pp. 485–497). Cham: Springer.
- [25] Yi, K. M., Trulls, E., Ono, Y. Lepetit, V., Salzmann, M., & Fua, P. (2018). Learning to find good correspondences. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2666–2674). Piscataway: IEEE.
- [26] Probst, T., Paudel, D. P., Chhatkuli, A., & Gool, L. V. (2019). Unsupervised learning of consensus maximization for 3D vision problems. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 929–938). Piscataway: IEEE.
- [27] Zhang, Z., Dai, Y., Fan, B., Sun, J., & He, M. (2022). Learning a task-specific descriptor for robust matching of 3D point clouds. IEEE Transactions on Circuits and Systems for Video Technology, 32(12), 8462–8475.
- [28] Zhang, Z., Sun, J., Dai, Y., Fan, B., & Liu, Q. (2022). Searching dense point correspondences via permutation matrix learning. IEEE Signal Processing Letters, 29, 1192–1196.
- [29] Horn, R. A., & Johnson, C. R. (2012). Matrix analysis. Cambridge: Cambridge University Press.
- [30] Chen, D. (1986). Some conclusions on condition numbers of matrix. Journal of East China Normal University (Natural Science), 3(2), 11–18.
- [31] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). Piscataway: IEEE.
- [32] Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2017). Improved texture networks: Maximizing quality and diversity in feed-forward stylization and texture synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6924–6932). Piscataway: IEEE.
- [33] Pais, G. D., Ramalingam, S., Govindu, V. M., Nascimento, J. C., Chellappa, R., & Miraldo, P. (2020). 3DRegNet: A deep neural network for 3D point registration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7193–7203). Piscataway: IEEE.
- [34] Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.
- [35] Qi, C. R., Su, H., Mo, K., & Guibas, L. J. (2017). PointNet: deep learning on point sets for 3D classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 652–660). Piscataway: IEEE.
- [36] Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. [Poster presentation]. Proceedings of the 3th international conference on learning representations. San Diego, USA.
- [37] Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3354–3361). Piscataway: IEEE.
- [38] Sturm, J., Engelhard, N., Endres, F., Burgard, W., & Cremers, D. (2012). A benchmark for the evaluation of RGB-D SLAM systems. In Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (pp. 573–580). Piscataway: IEEE.
- [39] Kendall, A., Grimes, M., & Cipolla, R. (2015). PoseNet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE international conference on computer vision (pp. 2938–2946). Piscataway: IEEE.
- [40] Menze, M., & Geiger, A. (2015). Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3061–3070). Piscataway: IEEE.
- [41] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
- [42] Fischler, M. A., & Bolles, R. C. (1981). Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6), 381–395.
- [43] Rousseeuw, P. J. (1984). Least median of squares regression. Journal of the American Statistical Association, 79(388), 871–880.
- [44] Torr, P. H., & Zisserman, A. (2000). MLESAC: A new robust estimator with application to estimating image geometry. Computer Vision and Image Understanding, 78(1), 138–156.
- [45] Raguram, R., Chum, O., Pollefeys, M., Matas, J., & Frahm, J.-M. (2012). USAC: A universal framework for random sample consensus. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2022–2038.
- [46] Chin, T.-J., Cai, Z., & Neumann, F. (2018). Robust fitting in computer vision: easy or hard?. In Proceedings of the 15th European conference on computer vision (pp. 701–716). Cham: Springer.
- [47] Chin, T.-J., Suter, D., Ch’ng, S.-F., et al. (2020). Quantum robust fitting. In Proceedings of the 15th Asian conference on computer vision (pp. 485–499). Cham: Springer.
- [48] Ding, Y., Yang, J., Ponce, J., et al. (2020). Minimal solutions to relative pose estimation from two views sharing a common direction with unknown focal length. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7045–7053). Piscataway: IEEE.