[1,2]\surQiulei Dong 1]\orgdivState Key Laboratory of Multimodal Artificial Intelligence Systems, \orgnameCASIA, \orgaddress\cityBeijing, \postcode100190, \countryChina

2]\orgdivSchool of Artificial Intelligence, \orgnameUCAS, \orgaddress\cityBeijing, \postcode100049, \countryChina

EAR-Net: Pursuing End-to-End Absolute Rotations from Multi-View Images

\surYuzhen Liu [email protected] [email protected] [ [

Abstract

Absolute rotation estimation is an important topic in 3D computer vision. Existing works in literature generally employ a multi-stage (at least two-stage) estimation strategy where multiple independent operations (feature matching, two-view rotation estimation, and rotation averaging) are implemented sequentially. However, such a multi-stage strategy inevitably leads to the accumulation of the errors caused by each involved operation, and degrades its final estimation on global rotations accordingly. To address this problem, we propose an End-to-end method for estimating Absolution Rotations from multi-view images based on deep neural Networks, called EAR-Net. The proposed EAR-Net consists of an epipolar confidence graph construction module and a confidence-aware rotation averaging module. The epipolar confidence graph construction module is explored to simultaneously predict pairwise relative rotations among the input images and their corresponding confidences, resulting in a weighted graph (called epipolar confidence graph). Based on this graph, the confidence-aware rotation averaging module, which is differentiable, is explored to predict the absolute rotations. Thanks to the introduced confidences of the relative rotations, the proposed EAR-Net could effectively handle outlier cases. Experimental results on three public datasets demonstrate that EAR-Net outperforms the state-of-the-art methods by a large margin in terms of both accuracy and inference speed.

keywords:

Deep Learning, Computer Vision, Rotation Estimation

1 Introduction

Absolute rotation estimation is a challenging and important topic in computer vision, which is to calculate the absolute or global camera rotations from a set of multi-view images. It has a broad range of applications, including structure from motion (SfM) [37, 17], multiple view stereo (MVS) [46, 2], novel view synthesis [32], etc.

As shown in Figure 1, a widely-used strategy for absolute rotation estimation in literature includes the following multiple sequential but independent stages (at least two independent stages, i.e., the following stages (a) and (b) are combined into one stage ‘end-to-end relative rotation estimation’ in [31, 18, 10, 5]): (a) Feature matching: A traditional manner in literature is firstly to implement a feature detector [29, 3] for extracting keypoints from each input image, and then build descriptors [35, 43] for these keypoints, and finally match these points among the input images. Recently, a popular manner is to design a feature matching network for directly outputting the point correspondences without an explicit detector nor an explicit descriptor [42, 8]. (b) Two-view rotation estimation: Once the feature correspondences are obtained among the input images, the classic 5-point algorithm [33] (jointly with RANSAC [20] in most cases for alleviating the influences of outliers) is implemented to calculate the relative rotation between each pair of images. (c) Rotation averaging: Once all the relative rotations (i.e., two-view rotations) are obtained, the absolute rotations corresponding to all the cameras are calculated by implementing a rotation averaging algorithm [26, 13, 7, 12].

Refer to caption — Figure 1: Pipeline of the traditional strategy in literature for recovering absolute rotations from a given set of images. It generally consists of multiple stages, including (a) feature matching, (b) two-view rotation estimation and (c) rotation averaging.

However, the aforementioned multi-stage strategy inevitably leads to the accumulation of error generated at each stage, which is a common problem confronted by many multi-stage methods for handling various visual tasks [11, 7, 6]. Moreover, even if the performance at one stage is improved, it could not guarantee an improved performance at its subsequent stage. For example, as indicated by Fan et al., [19], a larger number of correct matches does not necessarily lead to a better estimation of relative poses.

Inspired by the success of end-to-end learning-based strategies in many other visual tasks [4, 9] for avoiding the above problems that multi-stage methods have to be confronted with, the following problem is naturally raised: Is it possible to combine the aforementioned stages into only one stage and estimate the absolute rotations from multi-view images in an end-to-end manner for pursuing a better performance? It is noted that it is not trivial to deal with this problem, since the process of identifying and filtering outliers generated at the aforementioned Stage (b) (feature match outliers) and Stage (c) (relative rotation outliers) is non-differentiable and hinders end-to-end learning. Here, two additional points have to be explained: (1) It is theoretically feasible to simply combine an end-to-end relative rotation estimation sub-network and an end-to-end rotation averaging sub-network into an end-to-end network for computing absolute rotations from multi-view images, however, such a simple combination is less competitive as demonstrated by the experimental results in Section 4.7; (2) the incremental SfM technique [41] could recover the rotations by joint optimization of poses and scene depths together from multi-view images, however, it does not mean incremental SfM is opposed to the absolute rotation estimation technique. In fact, compared with incremental SfM, the absolute rotation estimation technique is more suitable for many visual tasks that do not require depth or point clouds (e.g., camera calibration) as indicated by Hartley et al., [24]. Moreover, the estimated rotations could be applied to incremental SfM approaches for reducing drift [12, 40]. Hence, we do not give a further analysis and comparison between the two techniques in the following parts.

To address the above issues, we propose an end-to-end method for absolute rotation estimation from multi-view images in this paper, called EAR-Net. The proposed EAR-Net consists of an epipolar confidence graph construction (ECGC) module and a confidence-aware rotation averaging (CARA) module. The ECGC module is explored to use the input multi-view images to learn pairwise relative camera rotations and their confidences. Accordingly, a weighted graph (called epipolar confidence graph) is built, whose vertices are the absolute rotations and edges are the pairwise relative rotations weighted by the learned confidences. A relative rotation with a low confidence is considered to be less accurate than that with a high confidence. Then, the CARA module is explored to use the built epipolar confidence graph for learning the absolute rotations. In the explored CARA module, a confidence-aware loss is designed for assigning different weight to different relative rotations, and simultaneously alleviating the negative influence of outliers, by utilizing the confidence scores from the built epipolar confidence graph in the ECGC module. And an iterative optimization algorithm, which is differentiable, is presented to minimize the confidence-aware loss, so that the proposed EAR-Net could be trained in an end-to-end manner.

To summarize, our main contributions include:

•

We explore the ECGC module for building an epipolar confidence graph from an input set of multi-view images. The ECGC module could not only learn pairwise relative camera rotations, but also their degrees of confidence for reflecting whether each estimated relative rotation is accurate enough.
•

We explore the CARA module with a designed confidence-aware loss for predicting the absolute camera rotations from the built epipolar confidence graph. And an iterative optimization algorithm is introduced for optimizing the confidence-aware loss accordingly.
•

By integrating the aforementioned ECGC and CARA modules together, we proposed the EAR-Net for absolute rotation estimation. Thanks to the introduced optimization algorithm in the second contribution, the proposed EAR-Net could be trained in an end-to-end manner. To the authors’ best knowledge, this work is the first attempt to unify relative rotation estimation and rotation averaging in an end-to-end manner, and its superiority to some state-of-the-art multi-stage methods has been demonstrated by the experimental results in Section 4.

This paper is organized as follows: Section 2 gives a review of existing relative rotation estimation and rotation averaging methods. Section 3 introduces the details of the proposed EAR-Net. In Section 4, we conduct experiments on public datasets to demonstrate the effectiveness of EAR-Net. Section 5 concludes the paper.

2 Related Work

In this section, we review the existing works on relative rotation estimation and rotation averaging respectively.

2.1 Relative Rotation Estimation

Relative rotation estimation is to calculate the camera rotation between an input pair of images. A traditional strategy for relative rotation estimation contains the following two key stages: feature matching [29, 35, 42, 44] and two-view rotation estimation via the 5-point algorithm [33]. However, it is hard for such a strategy to obtain a reliable estimation on relative rotation in the cases of weak textures or degenerated configurations [19].

Unlike the above strategies, several recent works have adopted deep neural networks (DNN) to estimate relative rotations in an end-to-end manner [31, 18, 10, 5]. For example, Siamese architectures were employed for relative pose estimation in [31, 18]. Zhou et al., [50] gave an analysis on the continuity of rotation representations, and suggested representing 3D rotations in a 6D space, which benefits the training of neural networks. Different from the above regression-based methods [31, 18], Cai et al., [5] cast the relative rotation estimation problem as a classification problem, where each class represents an angle in the range [-180^∘, 180^∘]. However, it is worth noting that the above DNN-based methods are only available for estimating the relative rotation from an input pair of images, but could not estimate the absolute rotations from multi-view images, which are significantly different from our method.

2.2 Rotation Averaging

Rotation averaging is a widely studied task in computer vision, which aims to recover the absolute rotations of cameras from a given set of relative rotations. It plays an important role in global SfM methods [24, 7, 14, 12], which is efficient for speeding up camera pose estimation, and could also reduce camera drift [12]. The early method by Govindu, [22] used a Lie-algebra representation to average relative rotations in a least-squares manner. To improve robustness, several works have either focused on using robust loss functions [24, 7] or removing outliers before the optimization [23, 48, 26]. For example, Hartley et al., [24] proposed an $\ell_{1}$ averaging method based on the Weiszfeld algorithm, considering that the $\ell_{1}$ norm was more robust to outliers than the $\ell_{2}$ norm. Chatterjee and Govindu, [7] proposed a generalized framework where different robust loss functions could be seamlessly embedded, and the optimization could be implemented in an efficient iteratively re-weighted least-squares (IRLS) manner in the Lie-algebra representation. Govindu, [23] used a RANSAC-based approach to detect and remove erroneous relative rotations. Zach et al., [48] proposed to filter outlier edges by using loop constraint, i.e., chaining all noise-free transformations along a loop should result in an identity transformation. Lee and Civera, [26] proposed an initialization scheme based on a hierarchy of triplet support, and removed edges that do not conform to an initial solution. Moreover, some works found that a good initialization approach is also helpful for improving the rotation averaging accuracy [7, 14, 12, 21, 26]. Sidhartha and Govindu, [40] extensively analyzed the performance of IRLS-based rotation averaging [7, 12, 26], and further pointed out that the performance of these methods are intimately related to both the robust loss function and initialization approach of the absolute rotations.

Additionally, some learning-based rotation averaging methods have been proposed by utilizing deep networks recently [34, 45, 27]. Here, it has to be explained that these learning-based rotation averaging methods take relative rotations as inputs. However, different from the above methods, the proposed EAR-Net recovers the absolute rotations directly from the original images.

3 Methodology

In this section, we introduce the proposed EAR-Net for absolute rotation estimation in detail. Firstly, we introduce the architecture of the proposed EAR-Net. Then, we introduce the key modules and the training strategy. Finally, we introduce the strategy for extending our method to large-scale scenes.

3.1 Architecture

As shown in Figure 2, the proposed EAR-Net takes a set of multi-view images as inputs, and it aims to output the corresponding absolute rotations. EAR-Net consists of the epipolar confidence graph construction module and the confidence-aware rotation averaging module. In the proposed EAR-Net, the referred operations (including relative rotation estimation, rotation averaging, etc.) for estimating absolute camera rotations are combined into one stage, and relative camera rotations are only considered as intermediate features, but not as final prediction results. It is noted that no matter an end-to-end manner or a multi-stage manner is employed to learn relative rotations, the learned relative rotations from different pairs of images generally have different noise levels. Hence, given an input set of images, the epipolar confidence graph construction module is designed to learn the relative rotations and their confidences that indicate whether the corresponding relative rotations are accurate enough, and then a weighted epipolar confidence graph is constructed according to the obtained relative rotations and confidences. Based on the constructed epipolar confidence graph, the confidence-aware rotation averaging module is designed to estimate the absolute rotations. We will introduce the two modules, the loss function and training strategy in detail in the following subsections.

3.2 Epipolar Confidence Graph Construction Module

Given an arbitrary set of $N$ -view images $\{\mathcal{I}_{k}\}_{k=1}^{N}$ with the height $H$ and width $W$ , i.e., $\mathcal{I}_{k}\in\mathbb{R}^{3\times H\times W}$ , the epipolar confidence graph construction module is designed to simultaneously learn the relative camera rotations and their corresponding confidences, respectively. It consists of a feature encoder, a pairwise feature aggregation unit, and a dual-branch decoder:

Feature Encoder. The feature encoder is to learn feature maps from the input $N$ images. In this work, we use the first three residual blocks from ResNet18 [25] as the feature encoder, and it outputs feature maps $\{\mathcal{F}_{k}\}_{k=1}^{N}$ of 256 channels with the size of $H/16\times W/16$ , i.e., $\mathcal{F}_{k}\in\mathbb{R}^{256\times H/16\times W/16}$ .

Pairwise Feature Aggregation. Once the feature maps for all the involved $N$ images are obtained by the feature encoder, the Pairwise Feature Aggregation (PFA) unit is explored to aggregate them pairwisely. The architecture of the PFA unit is illustrated in Figure 3. As seen from this figure, the PFA unit takes a pair of feature maps as inputs, and it outputs an aggregated feature vector. Specifically, we first compute the 4D correlation volume $\bm{\mathrm{Q}}_{ij}\in\mathbb{R}^{H/16\times W/16\times H/16\times W/16}$ to encode the mutual information between the input pair of feature maps $\mathcal{F}_{i}$ and $\mathcal{F}_{j}$ as:

\bm{\mathrm{Q}}_{ij}(h_{1},w_{1},h_{2},w_{2})=\sum\limits_{k=1}^{256}\mathcal{F}_{i}(k,h_{1},w_{1})\mathcal{F}_{j}(k,h_{2},w_{2})

(1)

where $k,h_{1},w_{1},h_{2},w_{2}$ are the dimension indices.

Then, the obtained 4D correlation volumes are reshaped as 3D feature maps with size $\mathbb{R}^{HW/256\times H/16\times W/16}$ , and are passed into a residual block and an average pooling layer to obtain a feature vector (namely the pairwise feature) with the size of 512.

Dual-Branch Decoder. The dual-branch decoder takes the pairwise features as inputs, and it outputs the relative rotations and the corresponding confidences. As shown in Figure 4, the explored dual-branch decoder consists of two branches: a rotation branch and a confidence branch. The pairwise feature vectors are inputted to the two branches respectively.

Rotation branch: This branch maps each pairwise feature into a relative rotation, which is parameterized with Euler angles $\alpha$ , $\beta$ , $\eta$ (roll, pitch, yaw). Similar to [5], the distribution of $\alpha$ , $\beta$ , $\eta$ are represented by $B$ -dimensional vector $p_{\alpha}$ , $p_{\beta}$ , $p_{\eta}$ respectively (we set $B=360$ ), where each element in the vectors indicates a probability corresponding to an angle in [ $0$ , $2\pi$ ]. However, it is non-differentiable to directly choose the angle with the maximum probability as the prediction result as done in [5]. Instead, we compute the following expectation as the final prediction:

\mathrm{E}(\theta)=\sum\limits_{k=1}^{B}p(\theta_{k})\theta_{k}

(2)

where $\theta_{k}$ denotes the angle represented by the $k$ -th element and $p(\theta_{k})$ denotes the corresponding probability.

Specifically, as shown in Figure 4, the rotation branch firstly adopts two fully connected layers and a softmax function to obtain the discrete distribution $p_{\alpha}$ , $p_{\beta}$ , $p_{\eta}$ for roll, pitch, and yaw respectively. Then, the values of roll, pitch, and yaw are computed according to the expectation operation in Eqn. 2. Finally, given the three Euler angles, the corresponding rotation matrix could be straightforwardly obtained.

Confidence branch: This branch consists of three fully connected layers and a sigmoid activation function. It maps each pairwise feature vector into a scalar (confidence) in the range of [0, 1], which is expected to reflect the degree of confidence on whether the estimated relative rotation in the above rotation branch is accurate enough. This is to say, when the confidence branch is trained, it tends to predict large confidence scores for reliable relative rotations, and small confidence scores for unreliable relative rotations.

Once both relative rotations $\bm{\mathrm{R}}_{ij}$ and their corresponding confidences $c_{ij}$ are obtained through the above two branches, the epipolar confidence graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ which is a weighted graph is constructed as follows: Vertex $i\in\mathcal{V}$ represents the $i$ -th camera with an unknown absolute rotation $\bm{\mathrm{R}}_{i}$ ; Edge $(i,j)\in\mathcal{E}$ represents the relative rotation $\bm{\mathrm{R}}_{ij}$ between the $i$ -th and $j$ -th images, which is weighted by the confidence $c_{ij}$ .

3.3 Confidence-Aware Rotation Averaging Module

Given the constructed epipolar confidence graph $\mathcal{G}=\{\mathcal{V},\mathcal{E}\}$ from the above module, the confidence-aware rotation averaging module aims to recover the absolute rotations in an iterative and differentiable manner, which allows the gradient to back-propagate.

Confidence-Aware Loss. Considering that the loss function is one of the main influencing factors for rotation averaging [40], we propose the confidence-aware loss (CAL) that is based on the learned confidence:

\mathcal{L}_{\mathrm{CAL}}=\sum\limits_{(i,j)\in\mathcal{E}}c_{ij}\mathfrak{R}^{2}(\bm{\mathrm{R}}_{ij},\bm{\mathrm{R}}_{j}\bm{\mathrm{R}}_{i}^{\mathrm{T}})

(3)

where ‘ $\mathfrak{R}(\cdot,\cdot)$ ’ is the Riemannian distance: $\mathfrak{R}(\bm{\mathrm{X}},\bm{\mathrm{Y}})=\|\log(\bm{\mathrm{X}}\bm{\mathrm{Y}}^{\mathrm{T}})\|_{2}$ , ‘ $\|\cdot\|_{2}$ ’ denotes the $\ell_{2}$ norm and ‘ $\mathrm{log}(\cdot)$ ’ denotes the mapping from the Lie group to its Lie algebra.

Many hand-crafted loss functions (such as the Cauchy loss, Geman-McClure loss, etc.) are adopted in other methods [7, 26], which are manually designed according to noise distribution in data [34]. In contrast, our approach explores direct learning of the loss function from data. It should be pointed out that these hand-crafted robust loss functions could also be used seamlessly in our framework. However, we will show in Section 4.6 that our learning-based CAL gives a significantly better result.

To minimize the confidence-aware loss, we propose the Confidence-Aware Initialization (CAI) approach and the Confidence-Aware Optimization (CAO) algorithm, which will be introduced in the following.

Confidence-Aware Initialization. Considering that an effective set of initial absolute rotations is generally important for pursuing a more accurate final prediction, unlike the existing initialization methods [12, 21, 38, 26, 40] that calculate the initials of absolute rotations according to manually defined criteria (e.g. number of inlier matches [12, 21, 26] or similarity scores [38]), the proposed CAI approach is based on the automatically learned confidence. Specifically, we first construct a maximum spanning tree via the classic Prim’s algorithm according to the learned confidence. Then, the initials $\{\bm{\mathrm{R}}_{i}\}_{i=1}^{N}$ are computed by multiplying the predicted relative rotations from the root vertex progressively based on the maximum spanning tree.

Input: Relative rotations

\{\bm{\mathrm{R}}_{ij}\}_{(i,j)\in\mathcal{E}}

; Initial absolute rotations

\{\bm{\tilde{\mathrm{R}}}_{i}\}_{i\in\mathcal{V}}

; Confidence

\{c_{ij}\}_{(i,j)\in\mathcal{E}}

; Maximum number of iterations

T

Output: Estimated absolute rotations

\{\bm{\mathrm{R}}_{i}\}_{i\in\mathcal{V}}

2Construct the matrix

\bm{\mathrm{B}}

according to Eqn. 4

3Construct the block diagonal matrix

\bm{\mathrm{W}}

according to Eqn. 6

\bm{\mathrm{M}}\leftarrow(\bm{\mathrm{B}}^{\mathrm{T}}\bm{\mathrm{W}}\bm{\mathrm{B}})^{-1}\bm{\mathrm{B}}^{\mathrm{T}}\bm{\mathrm{W}},k\leftarrow 0

5while $k<T$ do

\Delta\bm{\mathrm{R}}_{ij}\leftarrow\bm{\tilde{\mathrm{R}}}_{j}^{\mathrm{T}}\bm{\mathrm{R}}_{ij}\bm{\tilde{\mathrm{R}}}_{i}

\Delta\bm{\mathrm{b}}_{ij}\leftarrow\log(\Delta\bm{\mathrm{R}}_{ij})

8 Gather

\Delta\bm{\mathrm{b}}_{ij}

into the residual vector

\Delta\bm{\mathrm{b}}

\Delta\mathfrak{r}\leftarrow\bm{\mathrm{M}}\Delta\bm{\mathrm{b}}

\bm{\tilde{\mathrm{R}}}_{i}\leftarrow\bm{\tilde{\mathrm{R}}}_{i}\exp(\Delta\mathfrak{r}_{i})

\forall i\in\mathcal{V}

k\leftarrow k+1

12 end while

\bm{\mathrm{R}}_{i}\leftarrow\bm{\tilde{\mathrm{R}}}_{i},\forall i\in\mathcal{V}

Algorithm 1 Confidence-Aware Optimization (CAO)

Confidence-Aware Optimization. Once the absolute rotations are initialized, we optimize them in an iterative manner. As done in [22], in each iteration, we use the Euclidean distance in Lie algebra to approximate the Riemannian distance in its Lie group, i.e., $\mathfrak{R}(\bm{\mathrm{X}},\bm{\mathrm{Y}})\approx\|\log(\bm{\mathrm{X}})-\log(\bm{\mathrm{Y}})\|_{2}$ . Different from [22] that solves the least squares problem corresponding to the $\ell_{2}$ loss in each iteration, our method solves a weighted least squares problem corresponding to the introduced CAL. In the following, we will introduce the algorithm (outlined in Algorithm 1) in detail.

Let $\bm{\mathrm{I}}\in\mathbb{R}^{3\times 3}$ be the identity matrix, $\{\bm{\tilde{\mathrm{R}}}_{i}\}_{i\in\mathcal{V}}$ be the set of current estimate of absolute rotations, $\{\bm{\mathrm{R}}_{ij}\}_{(i,j)\in\mathcal{E}}$ be the set of predicted relative rotations from the ECGC module, $\{\Delta\mathfrak{r}_{i}\}_{i\in\mathcal{V}}$ be the set of update amount we need to estimate, $\{\Delta\bm{\mathrm{b}}_{ij}\}_{(i,j)\in\mathcal{E}}$ be the set of residuals computed as $\Delta\bm{\mathrm{b}}_{ij}=\log(\bm{\tilde{\mathrm{R}}}_{j}^{\mathrm{T}}\bm{\mathrm{R}}_{ij}\bm{\tilde{\mathrm{R}}}_{i})$ . Then, for edge $(i,j)$ , we have:

\underbrace{\left[\cdots,\bm{\mathrm{I}},\cdots,-\bm{\mathrm{I}},\cdots\right]}_{\bm{\mathrm{B}}_{ij}}\Delta\mathfrak{r}=\Delta\bm{\mathrm{b}}_{ij}

(4)

where $\bm{\mathrm{B}}_{ij}\in\mathbb{R}^{3\times 3N}$ is a block matrix where the $i$ -th block is $-\bm{\mathrm{I}}$ , the $j$ -th block is $\bm{\mathrm{I}}$ , and all other elements are zero.

Stacking all $\bm{\mathrm{B}}_{ij}$ , $\Delta\bm{\mathrm{b}}_{ij}$ , $\Delta\mathfrak{r}_{i}$ along the column, we obtain $\bm{\mathrm{B}}\in\mathbb{R}^{\frac{3N(N-1)}{2}\times 3N}$ , $\Delta\bm{\mathrm{b}}\in\mathbb{R}^{\frac{3N(N-1)}{2}\times 1}$ , $\Delta\mathfrak{r}\in\mathbb{R}^{3N\times 1}$ respectively. Then minimizing the CAL in Eqn. 3 is equivalent to solving the following problem:

\min_{\Delta\mathfrak{r}}(\bm{\mathrm{B}}\Delta\mathfrak{r}-\Delta\bm{\mathrm{b}})^{\mathrm{T}}\bm{\mathrm{W}}(\bm{\mathrm{B}}\Delta\mathfrak{r}-\Delta\bm{\mathrm{b}})

(5)

where $\bm{\mathrm{W}}$ is a diagonal matrix constructed as follows:

\bm{\mathrm{W}}=\left[\begin{array}[]{cccc}c_{12}\bm{\mathrm{I}}&&&\\ &c_{13}\bm{\mathrm{I}}&&\\ &&\ddots&\\ &&&c_{N-1,N}\bm{\mathrm{I}}\\ \end{array}\right]

(6)

The solution to Eqn. 5 is $\Delta\mathfrak{r}=(\bm{\mathrm{B}}^{\mathrm{T}}\bm{\mathrm{W}}\bm{\mathrm{B}})^{-1}\bm{\mathrm{B}}^{\mathrm{T}}\bm{\mathrm{W}}\Delta\bm{\mathrm{b}}$ . Then the absolute rotations are updated as $\bm{\tilde{\mathrm{R}}}_{i}\leftarrow\bm{\tilde{\mathrm{R}}}_{i}\exp{(\Delta\mathfrak{r}_{i})},\forall i\in\mathcal{V}$ , where ‘ $\exp(\cdot)$ ’ is the mapping from Lie algebra to its Lie group.

3.4 Training Strategy and Loss Function

Here, we introduce the training strategy and loss function used in the proposed EAR-Net as follows:

Pretraining. We firstly pretrain the feature encoder, the PFA unit and the rotation branch of the dual-branch decoder together, using ground truth relative rotations as supervision signals, while the confidence branch is omitted. As indicated in Section 3.2, the Euler angle parameterization is used in the rotation branch. Hence, the loss function $\mathcal{L}_{\mathrm{P}}$ at this pretraining stage consists of three cross-entropy loss terms for roll ( $\alpha$ ), pitch ( $\beta$ ) and yaw ( $\eta$ ) respectively:

\mathcal{L}_{\mathrm{P}}=\mathcal{L}_{\mathrm{CE}}(p_{\alpha},p_{\hat{\alpha}})+\mathcal{L}_{\mathrm{CE}}(p_{\beta},p_{\hat{\beta}})+\mathcal{L}_{\mathrm{CE}}(p_{\eta},p_{\hat{\eta}})

(7)

where $\mathcal{L}_{\mathrm{CE}}(\cdot,\cdot)$ is the cross entropy loss, and $p_{\hat{\alpha}}$ , $p_{\hat{\beta}}$ , $p_{\hat{\eta}}$ denote the ground truth distribution (one-hot vector) for roll, pitch and yaw respectively.

End-to-End Training. After the feature encoder and the rotation branch of the dual-branch decoder are initialized with the pretraining weights, the whole model of EAR-Net is trained in an end-to-end manner. To remove the gauge ambiguity, we first obtain the transformation between the estimated rotations $\{\bm{\mathrm{R}}_{i}\}_{i=1}^{N}$ to the ground truth rotations $\{\hat{\bm{\mathrm{R}}}_{i}\}_{i=1}^{N}$ by minimizing the squared Frobenius norms:

\bm{\mathrm{S}}^{\star}=\mathop{\arg\min}\limits_{\bm{\mathrm{S}}\in\textrm{SO}(3)}\sum\limits_{i=1}^{N}\|\bm{\mathrm{S}}-\bm{\mathrm{R}}_{i}^{\mathrm{T}}\hat{\bm{\mathrm{R}}}_{i}\|_{\textrm{F}}^{2}

(8)

The solution could be obtained by the existing method [30]. Then the final loss function is computed as:

\mathcal{L}_{\mathrm{A}}=\frac{1}{N}\sum\limits_{i=1}^{N}\|\bm{\mathrm{R}}_{i}\bm{\mathrm{S}}^{\star}-\hat{\bm{\mathrm{R}}}_{i}\|_{2}

(9)

The above procedures are fully differentiable, thus the gradient could be computed automatically using existing deep-learning libraries. It should be pointed out that there is no direct constraint for the predicted confidences themselves. Instead, the only supervision signal for the confidences comes from the final predicted absolute rotations, i.e., the confidences are adjusted automatically in order to give a better absolute rotation estimation. Intuitively, a large confidence score should be assigned to the reliable relative rotation, and a small confidence score should be assigned to the unreliable relative rotation. Please see the analysis in Section 4.6.

3.5 Extension to Large-Scale Scenes

As noted from Figure 2, the input of the network is a set of $N$ images at each training step, and both the memory and computational costs of EAR-Net have to be dependent on the number $N$ accordingly in principle. With the increase of $N$ , the memory and computational cost would increase significantly.

To address the above issue, we generally set the image number $N$ to be a small integer (In our low-cost server, $N$ is always set to 7) during the training stage, and design the following manner to deal with large-scale scenes where hundreds and thousands of images are involved at the testing stage.

Specifically, for a testing large-scale scene, the feature maps are firstly extracted batch-by-batch from all the testing images via the feature encoder in Figure 2, and then they are temporarily stored. Next, an edge-by-edge processing strategy is used to obtain the relative rotations and confidences:

For each edge of an arbitrary image pair in the scene graph, we load the corresponding two feature maps into the GPU memory at each time. The feature maps are then processed by the PFA unit to obtain the pairwise feature vector, which is further decoded into the relative rotation and its confidence subsequently. Since only two feature maps are processed at each time, the memory cost would not increase with the total number of images. Once all relative rotations and the confidences within the scene graph are obtained, we employ the CARA module to obtain the final absolute rotations.

4 Experiments

4.1 Datasets and Evaluation Metrics

Datasets. To verify the effectiveness of the proposed EAR-Net, we conduct experiments on three public datasets, including the ScanNet [15], the DTU [1] and the 7-Scene [39] datasets. The ScanNet dataset contains 1613 scans collected from 807 indoor scenes, including 1513 training scans and 100 testing scans. The DTU dataset contains 22 testing scans collected in a laboratory setting. The images are captured from different angles using a camera mounted on a robot arm. The 7-Scene dataset is a relatively small dataset collected from 7 scenes with 46 scans in total. Our evaluation is split into three setups, namely the basic setup, the large-scale setup, and the cross-dataset setup, which will be introduced in the following.

Basic setup: We train and test EAR-Net on ScanNet [15]. We follow the official data split for training and testing. For the training/testing split, we sample 150/15 sets of images with size 7 in each scene. This results in 226950/1485 sets of images for training/testing ¹¹1Since there are not enough image sets in the scan ‘scene00718_00’, in the test split, only 99 scans are indeed used for testing in total. The detailed sampling strategy is provided in the Appendix A..

Large-scale setup: To further evaluate the performance of EAR-Net on large-scale scenes, we conduct experiments on the full scans of ScanNet. Specifically, we evaluate the model on all the scans which contain more than 3000 images. This results in 22 large-scale scenes in total.

Cross-dataset setup: We further evaluate the generalization abilities on the DTU [1] and 7-Scene datasets [39]. We train the model on ScanNet and evaluate on the other two. For the DTU dataset, we sample 100 image sets of size 7 from each scan, resulting in 2200 image sets for testing. For the relatively small 7-Scene dataset, we evaluate on the full set. We sample 20 image sets with size 7 from each scan, resulting in 920 image sets in total.

Evaluation Metrics. For each image set, we first obtain the transformation $\bm{\mathrm{S}}^{\star}$ from the estimated absolute rotations to the ground truth by minimizing the above objective function in Eqn. 8. Then, for each estimated global rotation matrix $\bm{\mathrm{R}}$ and the corresponding ground truth $\hat{\bm{\mathrm{R}}}$ , we compute the rotation error: $\arccos((\mathrm{tr}(\hat{\bm{\mathrm{R}}}^{\mathrm{T}}\bm{\mathrm{R}}\bm{\mathrm{S}}^{\star})-1)/2)$ , where ‘ $\mathrm{tr}(\cdot)$ ’ denotes the trace. Finally, the following metrics are reported according to the rotation error: the mean error, the median error as done in [24, 7, 26]. Moreover, we also report the percentage of rotation error that is under $10^{\circ}$ .

Table 1: Comparison on ScanNet [15]. ‘RR’ denotes the relative rotation estimator. ‘RA’ denotes the rotation averaging solver. ‘Mn’ denotes the mean error. ‘Med’ denotes the median error. ‘MNN’ denotes mutual nearest neighbor matching. ‘NS/NT’ denotes the Number of Successful/Total image sets (same for Table 3). The best results are marked in bold face.

RR	RA	Mn $\downarrow$	Med $\downarrow$	Acc@10^∘ $\uparrow$	NS/NT $\uparrow$
SIFT [29]+MNN	IRLS- $\ell_{\frac{1}{2}}$	15.98	8.90	54.01	1475/1485
SuperPoint [16]+MNN		12.14	5.84	66.01	1483/1485
LoFTR [42]		7.57	4.57	82.08	1485/1485
ASpanFormer [8]		7.06	4.30	84.41	1484/1485
Reg6D [50]		11.38	6.35	68.34	1485/1485
ExtremeRotation [5]		9.22	4.60	76.98	1485/1485
SIFT [29]+MNN		14.65	8.32	56.22	1485/1485
SuperPoint [16]+MNN		12.45	6.07	65.70	1485/1485
LoFTR [42]	RAGO	7.68	4.07	81.85	1485/1485
ASpanFormer [8]		7.11	3.74	84.02	1485/1485
Reg6D [50]		11.87	6.55	67.37	1485/1485
ExtremeRotation [5]		9.83	5.04	75.26	1485/1485
SIFT [29]+MNN		24.01	8.45	56.01	1447/1485
SuperPoint [16]+MNN		14.22	5.41	68.41	1479/1485
LoFTR [42]	HARA	7.42	4.53	83.22	1485/1485
ASpanFormer [8]		6.91	4.26	85.17	1484/1485
Reg6D [50]		11.06	6.09	70.57	1485/1485
ExtremeRotation [5]		9.05	4.53	77.88	1485/1485
EAR-Net		4.03	2.06	94.18	1485/1485

4.2 Implementation Details

EAR-Net is implemented in PyTorch. Firstly, the feature encoder, the PFA unit and the rotation branch of the dual-branch decoder are pretrained for 30 epochs with a learning rate of $5\times 10^{-4}$ and a batch size of 20 using the Adam optimizer. Then, we initialize the confidence branch of the decoder so that it outputs a constant value of 0.5 for all pairwise rotations. This could be achieved by dividing the logits (before the sigmoid activation in the last layer of the confidence branch) by a large constant. Next, the whole model is trained with a learning rate of $1\times 10^{-4}$ and a batch size of 8 using the Adam optimizer. We perform $T=3$ iterations of confidence-aware optimization. In total, the model is trained for 50 epochs.

4.3 Comparative Evaluation Under Basic Setup

In this setup, we train and test the EAR-Net and the comparative methods on ScanNet [15]. We compare EAR-Net with some typical multi-stage methods as listed in Table 1. Specifically, for the relative rotation, we adopt the following methods: 1) two detector-matcher-based methods with the 5-point algorithm [33]: SIFT [29]-MNN (mutual nearest neighbor matcher), SuperPoint [16]-MNN; 2) two end-to-end matcher-based methods LoFTR [42], ASpanFormer [8] with the 5-point algorithm; 3) two end-to-end relative rotation estimation methods Reg6D [50] and ExtremeRotation [5]. Then, once the relative rotations are obtained, the three state-of-the-art solvers (IRLS- $\ell_{\frac{1}{2}}$ [7], HARA [26], and the learning-based RAGO [27]) for rotation averaging are implemented respectively to calculate the absolute rotations.

Table 1 lists the numbers of the successfully recovered rotations from the total 1485 testing image sets from ScanNet [15] by all the methods in its right-most column, and reports the corresponding results obtained from these successful images. As seen from this table, the comparative methods [29, 16, 36, 8, 42, 49] sometimes fail to register all image sets, possibly due to (1) an insufficient number of inlier matches and (2) the outlier relative rotation filtering step in HARA [26]. EAR-Net does not only recover all the testing images successfully but also outperforms all the comparative methods significantly under all the five metrics. For example, compared with the method ‘ASpanFormer+HARA’ that performs better than the other comparative methods, EAR-Net achieves a relative reduction of 41.68%(=1-4.03/6.91)/50.95%(=1-2.06/4.26) in mean/median error. These results demonstrate the priority of the proposed end-to-end method EAR-Net over these multi-stage methods.

Inference Speed. Here we further evaluate the effectiveness of EAR-Net in terms of inference speed. The inference speeds by the proposed method and the comparative methods [29, 16, 8, 42, 50] with the HARA solver [26] (which demonstrates better performance than the other solvers [7, 27] in most cases) on ScanNet [15] are shown in Figure 5. All the referred methods are evaluated on a server with the RTX 2080Ti GPU with a batch size of 1. As seen from this figure, EAR-Net runs at a speed of 13.2 image sets per second, which is significantly faster than all these comparative methods. Note that unlike the existing 5-point algorithm based methods [29, 16, 42, 8], our end-to-end EAR-Net is GPU-friendly, and its inference speed could be further improved by parallel computing with a larger batch size. For example, with a batch size of 30, EAR-Net runs at around 192.8 image sets per second. In this case, compared with the 5-point algorithm based methods (SIFT, SuperPoint, LoFTR, ASpanFormer), EAR-Net is more than $\mathbf{640\times}$ faster.

Dataset		ASpan+HARA			Reg6D+HARA			ER+HARA			EAR-Net
Scene	#Cam.	Mn	Med	t	Mn	Med	t	Mn	Med	t	Mn	Med	t
0711_00	3361	54.1	43.6	8.1h	27.5	25.5	1386s	37.1	35.9	1893s	14.8	14.7	717s
0712_00	4787	68.9	59.4	11.2h	58.3	58.8	2222s	58.4	63.9	2762s	8.4	6.9	1023s
0721_00	3773	59.6	61.5	10.4h	31.5	32.7	1543s	40.6	33.1	1989s	18.3	15.0	795s
0736_00	8009	72.3	61.5	21.9h	89.8	90.9	3215s	84.4	80.9	1.3h	58.2	54.5	1743s
0737_00	3076	73.9	67.4	8.0h	51.7	54.6	1157s	79.6	81.3	1839s	20.2	14.4	657s
0739_00	4449	41.7	36.1	11.1h	42.9	39.5	1708s	44.3	44.5	2832s	11.4	11.8	1001s
0744_00	3127	63.0	56.7	7.2h	32.6	21.2	1223s	17.0	14.4	1694s	6.0	5.8	671s
0747_00	5024	101.2	103.0	14.1h	80.5	56.5	1895s	80.4	60.4	2911s	28.9	32.3	1094s
0752_00	3050	69.4	56.5	7.6h	35.3	32.2	1121s	21.9	20.9	1678s	12.6	13.0	696s
0753_00	3389	60.1	57.4	8.1h	23.2	16.6	1248s	33.1	31.2	1846s	11.7	10.4	770s
0754_00	3218	29.5	25.7	7.4h	15.9	16.7	1193s	14.0	14.0	1713s	12.4	12.4	743s
0755_00	3546	49.3	49.1	9.6h	26.0	26.3	1309s	31.8	34.1	1898s	11.5	6.6	872s
0756_00	3503	36.7	34.9	8.1h	20.6	14.5	1324s	22.3	14.3	1828s	10.8	10.2	811s
0757_00	8336	40.6	37.2	21.6h	58.2	61.6	3185s	84.2	78.0	1.2h	10.4	9.0	1972s
0761_00	5190	58.7	51.9	13.7h	33.9	26.3	2010s	48.8	43.4	2657s	9.3	9.0	1200s
0766_00	3504	58.6	57.7	9.7h	40.1	38.2	1326s	66.0	27.9	1784s	28.5	26.9	793s
0768_00	4026	74.7	71.5	9.0h	53.8	52.3	1510s	25.2	26.8	2054s	10.6	10.7	884s
0770_00	3414	62.3	48.1	8.6h	40.8	43.3	1345s	33.0	28.9	1746s	9.8	7.0	791s
0776_00	3478	60.4	54.3	8.6h	45.2	46.1	1338s	27.2	30.1	1782s	4.8	4.3	836s
0784_00	4926	60.7	55.8	12.5h	39.4	40.3	1876s	49.2	44.4	2732s	16.1	17.9	1108s
0785_00	3980	47.6	46.1	10.9h	20.3	18.0	1558s	36.4	36.7	2168s	10.4	8.6	876s
0793_00	3457	72.0	53.5	8.0h	58.1	58.9	1399s	17.8	14.0	1835s	10.0	10.2	756s

Table 2: Comparison on large-scale scenes from the ScanNet dataset [15]. ‘Mn’ denotes the mean error. ‘Med’ denotes the median error. ‘t’ denotes the time cost. ‘ER’ denotes the ExtremeRotation [5]. ‘ASpan’ denotes the ASpanFormer matcher [8]. The best results are marked in bold face. The time costs that are longer than one hour are marked in red.

4.4 Comparative Evaluation Under Large-Scale Setup

In this subsection, we conduct experiments on the 22 large-scale scenes from ScanNet [15]. It is noted that the four 5-point algorithm based methods (SIFT, SuperPoint, LoFTR, ASpanFormer) are very slow to evaluate on such large-scale scenes. And among them, ASpanFormer performs much better than the other three 5-point algorithm based methods. Thus, in these methods, we only evaluate the performance of ASpanFormer [8]. In addition, we also evaluate the performance of the regression-based Reg6D [50] and ExtremeRotation [5]. Then, we obtain the absolute rotations with the HARA rotation averaging solver, which demonstrates better performance than the other solvers [7, 27] in most cases. The mean errors, the median errors, and the time costs of these comparative methods in each scene are reported in Table 2. As seen from this table, the 5-point algorithm based method ASpanFormer [8] performs poorer than the regression-based methods Reg6D and ExtremeRotation. This is mainly because the baselines between pairs of cameras are much wider in the large-scale scenes than those in the basic setup, so that it is difficult for AspanFormer to obtain high-accuracy matching points, however, the Reg6D and ExtremeRotation are end-to-end learning-based methods that do not explicitly extract feature points, as indicated in [10, 5]. Moreover, the proposed EAR-Net outperforms the other three comparative methods [5, 7, 26, 27] significantly on all of the evaluated scenes. It is also noted that the EAR-Net has a significantly lower inference time cost in all the evaluated scenes, especially compared to the 5-point algorithm based AspanFormer [8] which generally takes 7-20 hours. The above experimental results demonstrate the effectiveness of our method on large-scale scenes in terms of both accuracy and inference speed.

Table 3: Cross-dataset comparison on the DTU [1] and 7-Scene [39] datasets. The models are trained on the ScanNet dataset. The best results are marked in bold face.

RR	RA	DTU				7-Scene
RR	RA	Mn $\downarrow$	Med $\downarrow$	Acc@10^∘ $\uparrow$	NS/NT $\uparrow$	Mn $\downarrow$	Med $\downarrow$	Acc@10^∘ $\uparrow$	NS/NT $\uparrow$
SIFT+MNN	IRLS- $\ell_{\frac{1}{2}}$	24.93	21.92	17.52	2200/2200	9.61	5.26	71.69	920/920
SuperPoint+MNN		19.76	16.11	32.79	2200/2200	8.06	3.86	79.67	920/920
LoFTR		18.05	14.89	27.80	2195/2200	6.79	5.06	84.47	920/920
ASpanFormer		16.47	13.44	32.99	2195/2200	6.06	4.79	87.28	920/920
Reg6D		22.45	19.55	18.49	2200/2200	11.12	7.60	63.65	920/920
ExtremeRotation		21.82	18.01	21.33	2200/2200	9.24	6.58	70.82	920/920
SIFT+MNN		22.23	18.71	21.68	2200/2200	11.07	7.08	63.87	920/920
SuperPoint+MNN		18.91	14.63	33.05	2200/2200	9.38	5.61	72.41	920/920
LoFTR	RAGO	17.33	13.81	33.21	2200/2200	7.15	4.75	82.66	920/920
ASpanFormer		13.99	10.70	46.23	2200/2200	6.24	4.75	86.29	920/920
Reg6D		22.72	19.42	19.52	2200/2200	11.69	7.86	62.10	920/920
ExtremeRotation		21.87	17.86	21.76	2200/2200	10.37	7.20	90.11	920/920
SIFT+MNN		36.44	21.39	21.55	2164/2200	11.95	4.93	72.66	920/920
SuperPoin+MNN		20.03	13.76	38.90	2193/2200	8.19	3.65	82.70	920/920
LoFTR	HARA	18.18	14.53	29.17	2194/2200	6.57	5.00	85.73	920/920
ASpanFormer		17.12	13.62	32.72	2191/2200	6.02	4.79	87.73	920/920
Reg6D		22.62	19.50	18.74	2200/2200	11.00	7.49	65.00	920/920
ExtremeRotation		21.88	17.82	21.88	2200/2200	9.33	6.48	71.30	920/920
EAR-Net		13.81	10.68	46.64	2200/2200	4.43	3.26	94.47	920/920

Besides the above quantitative results, we also visualize the estimated rotations on several scenes with 1000-3000 images from ScanNet, including ‘scene720_00’, ‘scene729_00’, ‘scene764_00’. We compare Reg6D+HARA, ExtremeRotation+HARA as well as the proposed EAR-Net. The results are shown in Figure 6. As seen from this figure, our method aligns significantly better with the ground truth, and at the same time has a much faster speed.

4.5 Comparative Evaluation Under Cross-Dataset Setup

In this subsection, we evaluate EAR-Net and the comparative methods under the cross-dataset setup. Specifically, all the referred methods that are trained on ScanNet (except SuperPoint we use the released model that is trained on CoCo [28] by the author) are further evaluated on DTU [1] and 7-Scene [39], and the corresponding results are reported in Table 3. As seen from this table, EAR-Net also performs best among all the referred methods, consistent with the results in Table 1. These results demonstrate the effectiveness of the proposed EAR-Net under the cross-dataset setup.

Method	Mn $\downarrow$	Med $\downarrow$	Acc@10^∘ $\uparrow$
w/o end2end	9.08	3.92	75.75
w/o pretraining	6.05	3.09	88.25
w/o confidence	9.31	3.85	78.71
w/o CAI	4.83	2.09	92.69
Full	4.03	2.06	94.18

Table 4: Ablation study on ScanNet [15]. ‘Mn’ and ‘Med’ denote the mean and median error respectively.

4.6 Ablation Study

This subsection provides ablation studies on ScanNet [15] to evaluate the effect of the following key components:

Effect of End-to-End Training. Here, the model is only trained to output relative rotations, and then the absolute rotations are obtained via the confidence-aware optimization algorithm by weighting all edges equally (w/o end2end). As seen from Table 4, ‘w/o end2end’ causes the mean and median errors to increase by 125.3%(=9.08/4.03-1) and 90.3%(=3.92/2.06-1) respectively, which indicates our model benefits a lot from end-to-end training.

Effect of Pretraining. The performance of EAR-Net without pretraining the feature encoder and rotation branch is reported in Table 4 (w/o pretraining). As seen from this table, ‘w/o pretraining’ has a large negative impact on the final performance. This is because the feature encoder and rotation branch are not initialized well, making the end-to-end training converge to a worse local minimum. This observation is consistent with other end-to-end learning methods in other visual tasks [4, 47].

Effect of Confidence. To evaluate the effect of the learned confidence, the confidence branch is removed, and all edges are weighted equally (w/o confidence).

Firstly, as seen from Table 4, ‘w/o confidence’ causes the mean and median errors to increase by 131.0% and 86.9% respectively, which demonstrates that the learned confidence is important for improving the performance of absolute rotation estimation.

Secondly, we conduct the following experiments to demonstrate the robustness of EAR-Net due to the learned confidence: For each sampled image set with size 7 on ScanNet, we corrupt them by appending another $\{0,2,4,6,8,10\}$ randomly selected images from other scenes with no overlap to each image set. As the number of randomly selected images increases, more and more estimated relative rotations will become outliers, and thus it poses a greater challenge for absolute rotation estimation. The results are reported in Figure 7(a). As seen from this figure, with the increasing amounts of outliers, the performance of the ‘w/o confidence’ variant becomes much poorer, while the full EAR-Net is not sensitive to outliers mainly because low confidences are automatically assigned to the outliers.

Thirdly, we analyze the relationship between the errors of the predicted relative rotations and their confidences. We firstly sample around 230k image pairs from the ScanNet dataset, and compute the relative rotations and their confidences accordingly. Then, we divide the confidences in [0, 1] into 20 groups with equal intervals, and for each group, the mean and median errors of the predicted relative rotations are computed using the ground truth relative rotations. The mean and median errors of the relative rotations in these groups are plotted in Figure 8. As seen from this figure, the relative rotation error tends to drop when the confidence score increases. The error is significantly larger when the confidence score is close to zero, which is possibly because lots of outliers occur near this area.

Moreover, in Figure 9, we visualize a set of images and the errors of the predicted pairwise relative rotations as well as the corresponding confidences. As seen from this figure, EAR-Net tends to predict close-to-zero confidences for image pairs with large rotation errors, e.g., the 1st-6th image pair (the confidence is around 0 and the error is 20.3^∘), and large confidences for image pairs with small errors, e.g., the 3rd-5th image (the confidence is 0.81 and the error is 1.14^∘). Moreover, image pairs with small/large overlap areas generally have small/large confidences, possibly because the model could give a more reliable estimation when image pairs have a larger overlap area. For example, the 3rd-5th image pair has a large overlap area, and the corresponding confidence is around 0.81. The 5th-6th image pair has little overlap, and the corresponding confidence is close to zero. The above results confirm that reliable/unreliable relative rotations tend to be assigned with large/small confidences, as indicated in Section 3.2. In addition, we also visualize the case when image sets are corrupted. Specifically, five images are normally sampled and another five images are randomly selected (noise images) from other scenes with no overlap. Figure 10 shows the corresponding results. As seen from this figure, the confidence scores between the noise images and other images are close to zero, which could alleviate the negative influence of outlier edges via the proposed confidence-aware initialization approach and confidence-aware optimization algorithm.

Effect of CAI. In order to investigate the effect of the CAI (confidence-aware initialization) approach, we evaluate the proposed EAR-Net by replacing the CAI approach with random initialization (w/o CAI). We test the ‘w/o CAI’ variant on the ScanNet dataset five times independently, and the corresponding average results are reported in Table 4. As seen from this table, the ‘w/o CAI’ variant performs worse than the full EAR-Net, demonstrating the effectiveness of the designed CAI approach.

In addition, we also investigate the effect of the CAI approach for resisting outliers by corrupting the image sets by appending another $\{0,2,4,6,8,10\}$ randomly selected images from other scenes with no overlap to each image set. This setup is the same as evaluating the effect of the confidence for resisting outliers. The results are shown in Figure 7(b). As seen from this figure, as the outlier level increases, EAR-Net maintains a stable performance, while the performance of the ‘w/o CAI’ variant is degraded severely. This indicates that the proposed CAI approach is essential for absolute rotation estimation.

Comparison of Different Loss Functions. As indicated in [7], robust loss functions are helpful for dealing with outliers. Hence, we further evaluate the proposed EAR-Net by replacing the proposed CAL in Eqn. 3 with the following loss functions respectively: (1) The naive $\ell_{2}$ loss, which serves as a baseline (EAR-Net- $\ell_{2}$ ). (2) The Cauchy loss function (EAR-Net-Cauchy). (3) The Geman-McClure loss function (EAR-Net-GM). The formulations of the above functions are summarized in the second column of Table 5. We set $\alpha=5^{\circ}$ in the Cauchy and Geman-McClure loss functions as suggested by Chatterjee and Govindu, [7].

Method	Formula	Mn $\downarrow$	Med $\downarrow$	Acc@10^∘ $\uparrow$
	of Loss
EAR-Net- $\ell_{2}$	$\frac{x^{2}}{2}$	9.31	3.85	78.71
EAR-Net-Cauchy	$\frac{\alpha^{2}}{2}\log(1+\frac{x^{2}}{\alpha^{2}})$	7.18	3.14	85.39
EAR-Net-GM	$\frac{x^{2}}{2(\alpha^{2}+x^{2})}$	7.36	3.19	84.82
EAR-Net-CAL	$cx^{2}$	3.98	2.07	94.21

Table 5: Comparison of different loss functions on the ScanNet dataset [15]. ‘GM’ denotes the Geman-McClure loss. ‘Mn’ and ‘Med’ denote the mean and median error respectively.

The corresponding results are reported in Table 5. As seen from this table, incorporating the Cauchy and Geman-McClure loss functions leads to improved model performance compared with using the naive $\ell_{2}$ loss, consistent with the observations in [7]. In addition, it is noted that the EAR-Net with the proposed CAL achieves a significantly higher accuracy than all the evaluated model variants with different robust loss functions, demonstrating directly learning the weights from data is more beneficial.

Method	Mn $\downarrow$	Med $\downarrow$	Acc@10^∘ $\uparrow$
Reg6D+RAGO	11.87	6.55	67.37
Reg6D $\rightarrow$ RAGO	11.57	5.72	71.53
ExtremeRotation+RAGO	9.83	5.04	75.26
ExtremeRotation $\rightarrow$ RAGO	9.04	4.82	76.77
EAR-Net	3.98	2.07	94.21

Table 6: Comparison of different estimation strategies for predicting absolute rotations on ScanNet. ‘Mn’ and ‘Med’ denote the mean and median error respectively. ‘A+B’ denotes A and B are trained respectively in a two-stage manner. ‘A

\rightarrow

B’ denotes A and B are combined and jointly trained together in an end-to-end manner.

4.7 EAR-Net vs End-to-End Learning by Combining Existing Techniques

It is noted that when a learning-based technique for predicting absolute rotations from relative rotations (e.g., RAGO [27]) is combined and jointly trained with a learning-based relative rotation estimation technique (e.g., Reg6D [50], ExtremeRotation [5]) from input images, we could straightforwardly obtain an end-to-end method for predicting absolute rotations from input images. Accordingly in this subsection, we train Reg6D $\rightarrow$ RAGO (combining Reg6D and RAGO) and ExtremeRotation $\rightarrow$ RAGO (combining ExtremeRotation and RAGO) respectively in an end-to-end training manner where only the ground-truth absolute rotations in the training set are used as supervision signals as done in the proposed method, and the corresponding results on the ScanNet dataset [15] are reported in Table 6. For a clear comparison, Table 6 also reports the results of the proposed method as well as the results (that are cited from Table 1) by training Reg6D and RAGO (also ExtremeRotation and RAGO) in a two-stage manner, i.e., the relative rotation estimation method Reg6D (or ExtremeRotation) is trained firstly by utilizing ground truth relative rotations as supervision signals, and then RAGO is trained by utilizing ground truth absolute rotations as supervision signals. Two points could be observed from this table: (i) Both Reg6D $\rightarrow$ RAGO and ExtremeRotation $\rightarrow$ RAGO perform better than their two-stage counterparts, demonstrating that end-to-end training is also effective for boosting the performance of existing models. (ii) The two end-to-end methods Reg6D $\rightarrow$ RAGO and ExtremeRotation $\rightarrow$ RAGO perform significantly worse than the proposed method EAR-Net, demonstrating that such a simple combination of existing techniques could not guarantee a competitive performance.

5 Conclusion

Unlike existing methods that adopt a multi-stage strategy which inevitably leads to the accumulation of error caused by each involved operation, this paper proposes the end-to-end EAR-Net for recovering absolute rotations from multi-view images directly. The EAR-Net consists of two key modules, including the epipolar-confidence graph construction module and the confidence-aware rotation averaging module. The epipolar-confidence graph construction module is explored to predict relative rotations among the input images and their confidences, which results in the epipolar confidence graph. Then the confidence-aware rotation averaging module takes this graph as input, and it outputs the estimated absolute rotations by minimizing the proposed confidence-aware loss via the proposed confidence-aware initialization approach and the confidence-aware optimization algorithm. Extensive experimental results on three public datasets demonstrate the effectiveness of the proposed EAR-Net in terms of both accuracy and inference speed.

Data Availability Statement: The public datasets used in this paper are: (a) the ScanNet dataset [15], (b) the 7-Scene dataset [39], and (c) the DTU dataset [1]. (a) is available at http://www.scan-net.org/, (b) is available at https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7-scenes/, and (c) is available at https://roboimagedata.compute.dtu.dk/?page_id=36.

Appendix A Sampling Strategy

This section describes the sampling strategy for constructing image sets. There are three datasets used in this paper: the ScanNet dataset [15], the DTU dataset [1], and the 7-Scene dataset [39].

To construct an image set on the ScanNet [15] and 7-Scene [39] datasets, we first sample an image pair with the overlap ratio of [0.4, 0.8]. Then a randomly selected image is appended to the image pair if it has an overlap ratio of [0.4, 0.8] with at least one image in the sampled images. This procedure is repeated until enough images have been sampled. Such a sampling strategy ensures that an arbitrary image could have an overlap ratio of [0.4, 0.8] with at least one image in the same image set. For the training set, we use the overlap ratio computed by [36]. For the testing set, the video sequences are first downsampled by 10 to avoid sampling nearly identical images. Then the overlap ratio is computed using the ground truth depth maps and camera poses.

For the DTU dataset [1], we firstly sample an image pair whose rotation angle, computed using the ground truth camera poses, is in [0^∘, 60^∘]. Then a randomly selected image is appended to the image sets if it has a rotation angle of [0^∘, 60^∘] with at least one image in the sampled images. This procedure is repeated until enough images have been sampled.

References

Aanæs et al., [2016] Aanæs, H., Jensen, R. R., Vogiatzis, G., Tola, E., and Dahl, A. B. (2016). Large-scale data for multiple-view stereopsis. International Journal of Computer Vision, 120(2):153–168.
Bae et al., [2022] Bae, G., Budvytis, I., and Cipolla, R. (2022). Multi-view depth estimation by fusing single-view depth probability with multi-view geometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2842–2851.
Barroso-Laguna et al., [2019] Barroso-Laguna, A., Riba, E., Ponsa, D., and Mikolajczyk, K. (2019). Key. net: Keypoint detection by handcrafted and learned cnn filters. In Proceedings of the IEEE International Conference on Computer Vision, pages 5836–5844.
Brachmann and Rother, [2021] Brachmann, E. and Rother, C. (2021). Visual camera re-localization from rgb and rgb-d images using dsac. IEEE transactions on pattern analysis and machine intelligence, 44(9):5847–5865.
Cai et al., [2021] Cai, R., Hariharan, B., Snavely, N., and Averbuch-Elor, H. (2021). Extreme rotation estimation using dense correlation volumes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 14566–14575.
Carion et al., [2020] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., and Zagoruyko, S. (2020). End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer.
Chatterjee and Govindu, [2017] Chatterjee, A. and Govindu, V. M. (2017). Robust relative rotation averaging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):958–972.
[8] Chen, H., Luo, Z., Zhou, L., Tian, Y., Zhen, M., Fang, T., Mckinnon, D., Tsin, Y., and Quan, L. (2022a). Aspanformer: Detector-free image matching with adaptive span transformer. In European Conference on Computer Vision, pages 20–36. Springer.
[9] Chen, H., Wang, P., Wang, F., Tian, W., Xiong, L., and Li, H. (2022b). Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2781–2790.
[10] Chen, K., Snavely, N., and Makadia, A. (2021a). Wide-baseline relative camera pose estimation with directional learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3258–3268.
Chen et al., [2020] Chen, Y., Shen, S., Chen, Y., and Wang, G. (2020). Graph-based parallel large scale structure from motion. Pattern Recognition, 107:107537.
[12] Chen, Y., Zhao, J., and Kneip, L. (2021b). Hybrid rotation averaging: A fast and robust rotation averaging approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10358–10367.
Crandall et al., [2011] Crandall, D., Owens, A., Snavely, N., and Huttenlocher, D. (2011). Discrete-continuous optimization for large-scale structure from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3001–3008. IEEE.
Cui et al., [2018] Cui, H., Shen, S., and Gao, W. (2018). Voting-based incremental structure-from-motion. In International Conference on Pattern Recognition, pages 1929–1934. IEEE.
Dai et al., [2017] Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser, T., and Nießner, M. (2017). Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
DeTone et al., [2018] DeTone, D., Malisiewicz, T., and Rabinovich, A. (2018). Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition workshops, pages 224–236.
Dong et al., [2020] Dong, Q., Gao, X., Cui, H., and hu, Z. (2020). Robust camera translation estimation via rank enforcement. IEEE Transactions on Cybernetics, 52(2):862–872.
En et al., [2018] En, S., Lechervy, A., and Jurie, F. (2018). Rpnet: An end-to-end network for relative camera pose estimation. In Proceedings of the European Conference on Computer Vision Workshops.
Fan et al., [2022] Fan, H., Kileel, J., and Kimia, B. (2022). On the instability of relative pose estimation and ransac’s role. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8935–8943.
Fischler and Bolles, [1981] Fischler, M. A. and Bolles, R. C. (1981). Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395.
Gao et al., [2021] Gao, X., Zhu, L., Xie, Z., Liu, H., and Shen, S. (2021). Incremental rotation averaging. International Journal of Computer Vision, 129(4):1202–1216.
Govindu, [2004] Govindu, V. M. (2004). Lie-algebraic averaging for globally consistent motion estimation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 1. IEEE.
Govindu, [2006] Govindu, V. M. (2006). Robustness in motion averaging. In Asian Conference on Computer Vision, pages 457–466. Springer.
Hartley et al., [2011] Hartley, R., Aftab, K., and Trumpf, J. (2011). L1 rotation averaging using the weiszfeld algorithm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3041–3048.
He et al., [2016] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer.
Lee and Civera, [2022] Lee, S. H. and Civera, J. (2022). Hara: A hierarchical approach for robust rotation averaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 15777–15786.
Li et al., [2022] Li, H., Cui, Z., Liu, S., and Tan, P. (2022). Rago: Recurrent graph optimizer for multiple rotation averaging. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 15787–15796.
Lin et al., [2014] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer.
Lowe, [2004] Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110.
Markley et al., [2007] Markley, F. L., Cheng, Y., Crassidis, J. L., and Oshman, Y. (2007). Averaging quaternions. Journal of Guidance, Control, and Dynamics, 30(4):1193–1197.
Melekhov et al., [2017] Melekhov, I., Ylioinas, J., Kannala, J., and Rahtu, E. (2017). Relative camera pose estimation using convolutional neural networks. In International Conference on Advanced Concepts for Intelligent Vision Systems, pages 675–687. Springer.
Mildenhall et al., [2021] Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. (2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1):99–106.
Nistér, [2004] Nistér, D. (2004). An efficient solution to the five-point relative pose problem. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(6):756–770.
Purkait et al., [2020] Purkait, P., Chin, T.-J., and Reid, I. (2020). Neurora: Neural robust rotation averaging. In European Conference on Computer Vision, pages 137–154.
Revaud et al., [2019] Revaud, J., De Souza, C., Humenberger, M., and Weinzaepfel, P. (2019). R2d2: Reliable and repeatable detector and descriptor. Advances in Neural Information Processing Systems, 32.
Sarlin et al., [2020] Sarlin, P.-E., DeTone, D., Malisiewicz, T., and Rabinovich, A. (2020). Superglue: Learning feature matching with graph neural networks. In IEEE Conference on Computer Vision and Pattern Recognition.
Schonberger and Frahm, [2016] Schonberger, J. L. and Frahm, J.-M. (2016). Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4104–4113.
Shen et al., [2016] Shen, T., Zhu, S., Fang, T., Zhang, R., and Quan, L. (2016). Graph-based consistent matching for structure-from-motion. In European Conference on Computer Vision, pages 139–155. Springer.
Shotton et al., [2013] Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., and Fitzgibbon, A. (2013). Scene coordinate regression forests for camera relocalization in rgb-d images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937.
Sidhartha and Govindu, [2021] Sidhartha, C. and Govindu, V. M. (2021). It is all in the weights: robust rotation averaging revisited. In 2021 International Conference on 3D Vision, pages 1134–1143. IEEE.
Snavely et al., [2006] Snavely, N., Seitz, S. M., and Szeliski, R. (2006). Photo tourism: exploring photo collections in 3d. In ACM siggraph 2006 papers, pages 835–846.
Sun et al., [2021] Sun, J., Shen, Z., Wang, Y., Bao, H., and Zhou, X. (2021). Loftr: Detector-free local feature matching with transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8922–8931.
Tian et al., [2020] Tian, Y., Barroso-Laguna, A., Ng, T., Balntas, V., and Mikolajczyk, K. (2020). Hynet: Learning local descriptor with hybrid similarity measure and triplet loss. In Advances in Neural Information Processing Systems, volume 33, pages 7401–7412.
Wang et al., [2022] Wang, Q., Zhang, J., Yang, K., Peng, K., and Stiefelhagen, R. (2022). Matchformer: Interleaving attention in transformers for feature matching. In Asian Conference on Computer Vision.
Yang et al., [2021] Yang, L., Li, H., Rahim, J. A., Cui, Z., and Tan, P. (2021). End-to-end rotation averaging with multi-source propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11774–11783.
Yao et al., [2018] Yao, Y., Luo, Z., Li, S., Fang, T., and Quan, L. (2018). Mvsnet: Depth inference for unstructured multi-view stereo. In Proceedings of the European Conference on Computer Vision, pages 767–783.
Yi et al., [2016] Yi, K. M., Trulls, E., Lepetit, V., and Fua, P. (2016). Lift: Learned invariant feature transform. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI 14, pages 467–483. Springer.
Zach et al., [2010] Zach, C., Klopschitz, M., and Pollefeys, M. (2010). Disambiguating visual relations using loop constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1426–1433. IEEE.
Zhou et al., [2021] Zhou, Q., Sattler, T., and Leal-Taixe, L. (2021). Patch2pix: Epipolar-guided pixel-level correspondences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4669–4678.
Zhou et al., [2019] Zhou, Y., Barnes, C., Lu, J., Yang, J., and Li, H. (2019). On the continuity of rotation representations in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5745–5753.