Geometric Structure Aided Visual Inertial Localization

Huaiyang Huang, Haoyang Ye, Jianhao Jiao, Yuxiang Sun and Ming Liu ¹Huaiyang Huang, Haoyang Ye, Jianhao Jiao, Yuxiang Sun and Ming Liu are with the Robotics Institute at the Hong Kong University of Science and Technology, Hong Kong. {hhuangat, hyeab, jjiao, eeyxsun, eelium}@ust.hk.

Abstract

Visual Localization is an essential component in autonomous navigation. Existing approaches are either based on the visual structure from SLAM/SfM or the geometric structure from dense mapping. To take the advantages of both, in this work, we present a complete visual inertial localization system based on a hybrid map representation to reduce the computational cost and increase the positioning accuracy. Specially, we propose two modules for data association and batch optimization, respectively. To this end, we develop an efficient data association module to associate map components with local features, which takes only $2$ ms to generate temporal landmarks. For batch optimization, instead of using visual factors, we develop a module to estimate a pose prior from the instant localization results to constrain poses. The experimental results on the EuRoC MAV dataset demonstrate a competitive performance compared to the state of the arts. Specially, our system achieves an average position error in 1.7 cm with 100% recall. The timings show that the proposed modules reduce the computational cost by 20-30%. We will make our implementation open source at http://github.com/hyhuang1995/gmmloc/.

I Introduction and Related Works

Visual Localization is a crucial building block for autonomous navigation, as it provides essential information for other functionalities [1]. In recent years, significant progress has been made for metric visual localization that aims to recover 6-DoF camera poses. Based on the map representation, the prevalent pipelines can be categorized into visual structure- and geometric structure-based methods.

Among them, localizing based on the visual structure is the most prevailing paradigm[2, 3, 4, 5]. Generally, the visual structure is represented by a bipartite graph between landmarks and keyframes, which can be built from Structure-from-Motion (SfM) [6] or Simultaneous Localization and Mapping (SLAM) [7]. Global and local features along with geometric information (e.g., camera poses, landmark positions) are also preserved. During the localization phase, it retrieves the candidate visual frames in the map and search for correspondences via feature matching. Camera pose can then be recovered in a Perspective-n-Point (PnP) scheme. Localization based on the visual structure can resolve a kidnapped localization problem, and the map storage is typically lightweight. However, it suffers from the camera viewpoint and visual appearance variances [5].

Recently, geometric structure-based methods raise increasing concerns [8, 9, 10, 11, 12, 13, 14]. This track of methods quantize the prior map with dense structure components (e.g., pointcloud, voxels, surfels). Generally, they do not focus on a kidnapped problem, and instead regard the visual localization problem as a registration problem between local visual structure and global dense map. The intuition behind is that, compared to visual information, structural information can be more invariant against view-point or illumination change. However, in these systems, how to bootstrap the localization without a relatively accurate initial guess has not been well considered [8]. In addition, how to efficiently introduce geometric information from dense structure is also an open problem to explore [14].

In this work, we propose a visual-inertial localization system with a hybrid map representation, which takes the advantage of visual and geometric structure and temporal visual information. Based on this representation, we can easily resolve a kidnapped localization problem without dealing with how to initialize camera poses cross different modalities. In addition, with the aid of prior geometric structure, we could efficiently generate temporal landmarks, and this process does not require exhaustive inter-frame matching in traditional vision pipelines [6, 7], which would be more time-consuming. The landmark positions can then be recovered and further optimized with geometric constraints. Another interesting issue is the trade-off between accuracy and time cost. Unlike a typical SLAM system, optimizing the states related to historical frames is less meaningful for a localization problem, while with the instant localization we are able to achieve a relatively accurate state estimation result. In this way, how to minimize the computational overhead in batch optimization is also an interesting question to think over. Therefore, in this work, we propose a windowed optimization scheme with linearized term as prior, which improves the efficiency with a comparative localization accuracy compared to other methods. The experimental results on a public dataset show that our method can provide an average accuracy around 1.7 cm in a challenging indoor environment. We summarize our contributions as follows:

•

We present a visual-inertial localization system based on a hybrid map representation, which takes the advantage from both visual structure and geometric structure.
•

We propose a map-feature association module for temporal landmark generation, the computational cost of which is reduced and bounded.
•

We propose an efficient batch optimization scheme with camera poses constrained by the prior terms from the instant localization.
•

Experimental results show that our system achieves a competitive localization performance while balances the computational cost.

II System Overview and Preliminaries

Refer to caption — Figure 1: System overview. KF is the abbreviation for keyframe.

The proposed visual-inertial localization framework is presented in Fig. 1. In the following sections, we briefly introduce different modules along with some preliminaries.

II-A Map representation

We propose to localize cameras with a hybrid map representation, which consists of both visual structure and geometric structure. In our system, We leverage visual SLAM/SfM methods for the visual structure construction, where the landmarks are filtered by the number of frames that can observe them. Then, the geometric structure is modelled by a Gaussian Mixture Model (GMM), which provides a prior distribution of landmarks: $p({{}^{W}\mathbf{p}_{i}}|\mathcal{M})=\prod_{j}\pi_{j}p({{}^{W}\mathbf{p}_{i}}|\mathcal{G}_{j})=\prod_{j}\pi_{j}\mathcal{N}(\boldsymbol{\mu}_{j},\boldsymbol{\Sigma}_{j})$ , with $\pi_{j},\boldsymbol{\mu}_{j},\boldsymbol{\Sigma}_{j}$ the weight, mean vector and covariance matrix for the $j$ -th component $\mathcal{G}_{j}$ . In addition, we further voxelize the geometric structure to support the proposed association module between map components and local features, with details in Sec. III. With the aid of geometric structure, our system efficiently generates the temporal visual structure to enhance the robustness.

II-B Re-localization

As aforementioned, previous works based on prior geometric information generally require a global initial guess to bootstrap the localization system, which is difficult to retrieve in a kidnaped problem. Here we generally follow the pipeline of HLoc [4] to re-localize the camera during initialization or tracking lost. Briefly, the system first retrieves the top-10 keyframe candidates, with NetVLAD [15] as the global descriptor. Then we cluster the candidates by their covisibility, and match local features against each cluster. With the 2D-3D correspondences, the camera poses is recovered via PnP with RANSAC, and then refined with inlier matches. However, for the robustness, accuracy and smoothness of the estimated trajectory, we consider that it is a better practice to leverage temporal visual information in localization. In other words, we explicitly track the matched local features for instant localization rather than performing re-localization at each single frame.

II-C State Prediction

Considering a visual localization problem, the states to estimate each single frame is $\boldsymbol{\xi}_{k}=\left[\boldsymbol{\phi}_{k},\mathbf{t}_{k}\right]$ , where $\boldsymbol{\phi}_{k},\mathbf{t}_{k}$ represent the rotational and translational states, respectively. Then, the rotation matrix from the map frame to the body frame is given as $\mathbf{R}_{B_{k}W}=\exp(\boldsymbol{\phi}_{k}^{\land})$ , and $\mathbf{R}_{B_{k}W}\in SO(3)$ , where the wedge operator $(\cdot)^{\land}$ turns $\boldsymbol{\phi}_{k}\in\mathfrak{so}(3)$ into a skew-symmetric matrix. And we have the translation from the body frame to the world frame ${}^{B_{k}}\mathbf{t}_{B_{k}W}=\mathbf{t}_{k}$ , expressed under the $k$ -th body frame. Using inertial measurements introduces additional states $\left[\mathbf{v}_{k},\mathbf{b}^{a}_{k},\mathbf{b}^{g}_{k}\right]$ , which are the velocity of body frame expressed under the world frame $\mathbf{v}_{k}={{}^{W}\mathbf{v}_{B_{k}}}$ , the bias term of accelerometer and gyroscope, respectively. Then the total state vector is denoted by $\mathbf{x}_{k}=\left[\boldsymbol{\xi}_{k},\mathbf{v}_{k},\mathbf{b}^{a}_{k},\mathbf{b}^{g}_{k}\right]$ . Predicting current states is a common procedure in the state estimation. In our system, if IMU data is available, the states of the next frame can be predicted via the IMU measurement model. Otherwise, the constant velocity model is applied for a vision-only implementation. The relevant descriptions can be found in [16].

II-D Instant Localization

As we stated before, we explicitly utilize sequential information to localize each single frame. The instant localization module deals with the 2D-3D association, state estimation and covariance recovery.

II-D1 Tracking

Following the tracking scheme in ORB-SLAM2 [7], here we use predicted camera poses as prior information to search for the 2D-3D association to estimate relevant states. In our implementation, as the global prior information is more reliable, we first try to associate the observed landmarks from the nearest database keyframe. We also keep this in mind when matching against the previous frame and the (temporal and prior) visual structure, where we give priority to the landmarks in the prior visual structure.

II-D2 Pose Estimation

For the pose recovery, we follow the optimization scheme in [17]. Here we briefly describe it as preliminaries. With 3D-2D association and optional inertial measurements, current states can then be solved in a least-squares problem, which seeks to minimize the following objective function:

E=E_{\text{inertial}}(\mathbf{x}_{k},\mathbf{x}_{k-1})+E_{\text{prior}}(\mathbf{x}_{k-1})+\sum_{i\in\mathcal{V}_{k}}E_{\text{proj}}(\boldsymbol{\xi}_{k},{{}^{W}\mathbf{p}_{i})},

where $\mathcal{V}_{k}$ is the set of landmarks visible by the $k$ -th frame. The visual factor is defined by the reprojection error $E_{\text{proj}}(\boldsymbol{\xi},{{}^{W}\mathbf{p}_{i}})=\rho(\mathbf{e}^{T}_{\text{proj}}(\cdot)\boldsymbol{\Sigma}_{pi}\mathbf{e}_{\text{proj}}(\cdot))$ , with

\mathbf{e}_{\text{proj}}(\cdot)\doteq\mathbf{e}_{\text{proj}}(\boldsymbol{\xi}_{k},{{}^{W}\mathbf{p}_{i}})=\pi({{}^{C_{k}}\mathbf{p}_{i}})-\bar{\mathbf{u}}_{ik},

{{}^{C_{k}}\mathbf{p}_{i}}=\mathbf{R}_{CB}(\mathbf{R}_{B_{k}W}\cdot{{}^{W}\mathbf{p}_{i}}+{{}^{B_{k}}\mathbf{t}_{B_{k}W}})+{{}^{C}\mathbf{t}_{CB}},

where $\bar{\mathbf{u}}_{ik}$ represents the local measurement, $\pi$ denotes the camera projection function, $\boldsymbol{\Sigma}_{pi}$ is the covariance matrix for $\bar{\mathbf{u}}_{ik}$ , $\mathbf{R}_{CB},^{C}\mathbf{t}_{CB}$ are the extrinsic parameters between camera and IMU. $E_{\text{inertial}}$ and $E_{\text{prior}}$ are the inertial and prior term, respectively. We use similar formulations as those in [18, 17] with some trivial modifications. The relevant formulations are described in [16]. Same as [17], after the optimization, the relevant factors are marginalized out for a prior term of $\mathbf{x}_{k}$ .

II-D3 Covariance Recovery

After the optimization, we are able to approximate the localization covariance from the optimization. Denote the state we care about as $\mathbf{x}$ and its update as $\mathbf{x}\boxplus\delta\mathbf{x}$ . After an iterative least-squares estimation, we have a normal equation $\mathbf{H}_{\mathbf{x}}\delta\hat{\mathbf{x}}=\mathbf{b}_{\mathbf{x}}$ . $\mathbf{H}_{\mathbf{x}},\mathbf{b}_{\mathbf{x}}$ might be calculated from the Schur complement, depending on whether $\mathbf{x}$ is the full state vector. This provides a posterior distribution of the state update $\delta\hat{\mathbf{x}}$ : $\delta\mathbf{x}\sim\mathcal{N}(\delta\hat{\mathbf{x}},\boldsymbol{\Sigma}_{\delta\mathbf{x}})$ , with $\boldsymbol{\Sigma}_{\delta\mathbf{x}}=\mathbf{H}_{\mathbf{x}}^{-1}$ . Then, the covariance of the state $\mathbf{x}$ can be propagated via:

\boldsymbol{\Sigma}_{\mathbf{x}}=\mathbf{J}(\mathbf{x},\delta\mathbf{x})\boldsymbol{\Sigma}_{\delta\mathbf{x}}\mathbf{J}(\mathbf{x},\delta\mathbf{x})^{T}\quad\text{with}

(1)

\mathbf{J}(\mathbf{x},\delta\mathbf{x})=\left.\frac{\partial(\mathbf{x}\boxplus{\delta\mathbf{x}})}{\partial\delta\mathbf{x}}\right|_{\delta\mathbf{x}=\mathbf{0}}

(2)

For $\mathbf{t},\mathbf{v},\mathbf{b}_{a},\mathbf{b}_{g}$ , the update is simply addition in $\mathbb{R}^{3}$ , so the jacobian is an identity matrix and the covariance remains the same. On the contrary, for the rotation parameters $\boldsymbol{\phi}$ , the update is preformed on tangent space of $SO(3)$ , yielding

\mathbf{J}(\boldsymbol{\phi},\delta\boldsymbol{\phi})=\left.\frac{\partial(\boldsymbol{\phi}\boxplus{\delta\boldsymbol{\phi}})}{\partial\delta\boldsymbol{\phi}}\right|_{\delta\boldsymbol{\phi}=\mathbf{0}}=\mathbf{J}_{r}^{-1}(\boldsymbol{\phi})

(3)

with $\mathbf{J}_{r}^{-1}(\boldsymbol{\phi})$ the inverse right jacobian of $\boldsymbol{\phi}$ .

The covariance of states, especially rotation and translation, determines how certain the localization result is. Some detailed explanation and experimental results can be found in [16]. In addition, this formulation inspires us to replace the visual factors with a pose prior term in the batch optimization, as described in Sec. IV.

II-E Landmark Creation and Optimization

If only the prior visual structure is used, the localization performance will be highly dependent on the view overlap and appearance similarity between database images and queries. In this way, the total quantity of correspondences can be insufficient for recovering accurate states. Therefore, we propose to generate temporal landmarks with the aid of geometric structure here. For the state optimization, we leverage a fixed-lag smoother to correct current states and update the positions of temporal landmarks. To efficiently resolve this batched estimation problem, we propose to use pose prior factors instead of visual factors. These two modules are described in the Sec. III and Sec. IV with details.

III Temporal landmark generation

A solution of associating local features with map components is proposed in GMMLoc [14], named as projection method here. We briefly review this association scheme. GMMLoc first projects all the map components to the image coordinate as 2D distributions by applying non-linear transforms on 3D gaussian distributions. The local features are then associated with map components in the 2D space. To filter out occluded components, it sorts the projected components by depth, and compare the depth of each component and its neighborhoods. The drawbacks of the projection method can be three-fold: 1) when applied to generic camera models, the projection function needs to be linearized, yielding a bad approximation when the camera distortion is large; 2) the computational time is highly dependent on the number of visible components; 3) sorting operation is sequential thus is difficult to be parallelized.

To resolve these issues, in this work, we propose an improved solution that is more invariant to camera model and presents reduced and bounded computational cost. This method is inspired by techniques widely used in volumetric dense mapping, such as ray-casting, thus we name it ray-casting method here. An comparative illustration of these two methods is shown in Fig. 2.

As a prerequisite, we extend the GMM into a voxelized representation in the mapping stage. To this end, we first adopt the voxel-hashing method in Voxblox [19] to divide the Euclidean space into a set of voxels $\left\{V_{t}\right\}$ , with a given resolution $\delta_{v}=0.1m$ . Each voxel $V_{t}$ represents a region $R_{V_{t}}=\left\{\mathbf{x}|\mathbf{x}\in\mathbb{R}^{3},l<\mathbf{c}^{T}\mathbf{x}<u\right\}$ , defined by a box function with parameters $l,u,\mathbf{c}$ . Then for each Gaussian component, we locate their intersected voxels via the cumulative likelihood:

p(V_{t}|\mathcal{G}_{j})=\iiint_{\mathcal{R}_{V_{t}}}\exp(-\frac{1}{2}\|\mathbf{p}-\boldsymbol{\mu}_{j}\|_{\boldsymbol{\Sigma}})\,d\mathbf{p}.

(4)

We adopt a numerical method [20] to compute this integral as there is no analytical solution. If $p(V_{t}|\mathcal{G}_{j})>\delta_{l}=0.1$ , the component $\mathcal{G}_{j}$ is considered intersected with the voxel $V_{t}$ , and therefore in our system a component can be associated with multiple intersected voxels.

With this voxelized map representation, we then associate the unmatched local features with components in the map. Given a localized frame, with camera pose $[\mathbf{R}_{WC_{k}},{{}^{W}\mathbf{t}_{WC_{k}}}]$ and the local features $\left\{\mathbf{u}_{ki}\right\}_{i=1\dots N}$ , for each candidate feature $\mathbf{u}_{ki}$ , we can define a ray parameterized by $\mathbf{r}_{ki}=\left[{}^{W}\mathbf{v}_{ki},{{}^{W}\mathbf{t}_{WC_{k}}}\right]$ , with the bearing vector under world coordinate ${}^{W}\mathbf{v}_{ki}$ , which is given by: ${{}^{W}\mathbf{v}_{ki}}=\mathbf{R}_{WC_{k}}\pi^{-1}(\mathbf{u}_{ki})$ , and ${{}^{W}\mathbf{t}_{WC_{k}}}$ the origin of the ray. Then we can parameterize the position of landmark observed by $\mathbf{u}_{ki}$ as ${}^{W}\mathbf{p}_{i}={{}^{W}\mathbf{t}_{WC_{k}}}+\lambda{{}^{W}\mathbf{v}_{ki}}$ , where $\lambda$ is the depth along the ray. To find out proper map association for $\mathbf{u}_{ki}$ , we cast the ray into the voxelized map. For each voxel along the ray, we examine the components intersected with it and the metric is defined by Mahalanobious distance between a line and a component ${dist}\left(\mathbf{r}_{ki},\mathcal{G}_{j}\right)$ , which is explained as follows: decomposing the covariance of $\mathcal{G}_{j}$ using SVD, we have:

\boldsymbol{\Sigma}_{j}=\mathbf{R}_{j}\mathbf{S}_{j}\mathbf{R}_{j}^{T},\mathbf{S}_{j}=\text{diag}(\lambda_{1},\lambda_{2},\lambda_{3}),\mathbf{R}_{j}=[\mathbf{e}_{j1},\mathbf{e}_{j2},\mathbf{e}_{j3}].

Applying the whitening transform, the ray parameters become [21]:

	$\displaystyle{\mathbf{t}^{\prime}_{WC_{i}}}$	$\displaystyle=\mathbf{S}_{j}^{-\frac{1}{2}}\mathbf{R}_{j}^{T}({{}^{W}\mathbf{t}_{WC_{i}}}-{{}^{W}\boldsymbol{\mu}_{j}}),$		(5)
	$\displaystyle{\mathbf{v}^{\prime}_{ki}}$	$\displaystyle=\mathbf{S}_{j}^{-\frac{1}{2}}\mathbf{R}_{j}^{T}\cdot{{}^{W}\mathbf{v}_{kj}},$		(5)

with which the distance metric is given by:

\displaystyle{dist}\left(\mathbf{r}_{ki},\mathcal{G}_{j}\right)

\displaystyle=\frac{1}{\|\mathbf{v}_{ki}^{\prime}\|}\|\mathbf{t}_{WC_{i}}^{\prime}\times\mathbf{v}_{ki}^{\prime}\|.

(6)

For enhancing the efficiency, $\mathbf{S}_{j}^{-\frac{1}{2}}\mathbf{R}_{j}^{T}$ in Eq. 5 can be pre-computed for each component $\mathcal{G}_{j}$ by the Cholesky decomposition: $\boldsymbol{\Sigma}_{j}^{-1}=\mathbf{LL}^{T}=(\mathbf{S}_{j}^{-1/2}\mathbf{R}_{j}^{T})^{T}(\mathbf{S}_{j}^{-1/2}\mathbf{R}_{j}^{T})$ .

With this metric, for each component associated with the same voxel, we adopt $\mathcal{X}^{2}$ -test for the component that has the closest distance to the optical ray, which gives a threshold $\delta_{r}=2.8$ . And if the distance is within $\delta_{r}$ , we associate the component with the feature $\mathbf{u}_{ki}$ . Finally, with the association, the depth $\lambda$ in the landmark position parameterization can be trivially recovered by: $\lambda={\mathbf{e}_{1}^{T}(\boldsymbol{\mu}-\mathbf{t}_{WC})}/{\mathbf{e}_{1}^{T}\mathbf{v}}$ . We only perform this operation for local features that are not associated with landmarks in the map. We also sort them by detection score and only keep the top-100 keypoints to bound the computational cost. These strategies are kept the same for comparing the projection method and the ray-casting method in Sec. V. Unlike the projection method, time complexity of the ray-casting method is mainly dependent on number of local features for querying, hence is more bounded. ‘’ We name these initialized points as seeds, which are not immediately used for localization. Instead we further verified them in the Landmark Update step, with subsequent visual information along with localized camera poses. Similar to the matching scheme in Sec. II-D, we match the correspondences with the localized camera pose as prior. Briefly, if a seed is observed with sufficient parallax over multiple frames, it will be activated as a landmark for the future localization. In the implementation, the minimum parallax is set to $4$ -px and the minimum number of observation is set to $4$ . Seeds not visible in the current window will be cleared after each update.

IV Optimization with Global Pose Prior

Although we can already recover the global camera pose with the instant localization, some of the states are not able to be properly estimated, especially when the inertial measurements are incorporated. For example, we find that in practice, $\mathbf{b}^{a}$ is difficult to converge without batch optimization, yielding a biased state prediction. Here we proposed a batched yet efficient optimization structure, which can be solved within $10$ ms in average.

Firstly, inspired by SVO [22], we decoupled the full Bundle Adjustment (BA) into motion- and structure-only BA, and Fig. 3 shows the factor graph representation of our motion-only BA pipeline. For the motion-only BA, we can optimize the camera pose via minimizing

E=\sum_{k\in\mathcal{F}}E_{\text{inertial}}(\mathbf{x}_{k},\mathbf{x}_{k-1})+\sum_{k\in\mathcal{F}}\sum_{i\in\mathcal{V}_{k}}E_{\text{proj}}(\boldsymbol{\xi}_{k},{{}^{W}\mathbf{p}_{i}}),

(7)

where $\mathcal{F}$ is the keyframes in the current window, and we set $\left|\mathcal{F}\right|=15$ . Although the decoupled scheme accelerate the optimization, we found that iteratively re-linearizing the visual factors can be less effective in computation, while the the instant localization can already provide a relatively accurate estimation. As a consequence, we propose to marginalize the visual factors as a pose prior instead of using the original visual factors. After the pose optimization described in Sec. II-D, we have the normal equation: $\mathbf{H}\delta\hat{\mathbf{x}}_{k}=\mathbf{b}$ . By reordering the states, this normal equation can be re-written as:

\left[\begin{matrix}\mathbf{H}_{tt}&\mathbf{H}_{tr}\\ \mathbf{H}_{rt}&\mathbf{H}_{rr}\\ \end{matrix}\right]\left[\begin{matrix}\delta\hat{\boldsymbol{\xi}}_{k}\\ \delta\hat{\mathbf{x}}_{r}\\ \end{matrix}\right]=\left[\begin{matrix}\mathbf{b}_{t}\\ \mathbf{b}_{r}\\ \end{matrix}\right],

(8)

where $t$ denotes the block of related to the camera pose update $\delta\boldsymbol{\xi}_{k}$ and $r$ the block of other variables. The hessian block $\mathbf{H}_{xx}$ can be decomposed by $\mathbf{H}_{xx}=\mathbf{H}_{i}+\mathbf{H}_{p}$ , where $\mathbf{H}_{i}$ and $\mathbf{H}_{p}$ are the hessians related to inertial and visual factors, respectively. We have $\mathbf{H}_{p}=\sum_{i\in\mathcal{V}_{k}}\mathbf{H}_{pi}$ and $\mathbf{H}_{pi}$ is given by:

E_{\text{proj}}(\boldsymbol{\xi}_{k}\boxplus{\delta\boldsymbol{\xi}_{k}},{{}^{W}\mathbf{p}_{i}})\approx\mathbf{J}_{pi}^{T}\boldsymbol{\Sigma}_{pi}^{-1}\delta\boldsymbol{\xi}+\delta\boldsymbol{\xi}^{T}\underbrace{\mathbf{J}_{pi}^{T}\boldsymbol{\Sigma}_{pi}^{-1}\mathbf{J}_{pi}}_{\mathbf{H}_{pi}}\delta\boldsymbol{\xi},

where $\mathbf{J}_{pi}$ is the jacobian of $\mathbf{e}_{\text{proj}}(\boldsymbol{\xi}_{k},{{}^{W}\mathbf{p}_{i}})$ with respect to $\boldsymbol{\xi}_{k}$ . Marginalizing out the visual factors simply forms a distribution on the camera pose update $\delta\boldsymbol{\xi}_{k}\sim\mathcal{N}(\delta\hat{\boldsymbol{\xi}}_{k},\boldsymbol{\Sigma}_{\delta\boldsymbol{\xi}_{k}}),\boldsymbol{\Sigma}_{\delta\boldsymbol{\xi}_{k}}=\mathbf{H}_{p}^{-1}$ .

With this prior information, we can formulate the prior pose constraints as:

	$\displaystyle\mathbf{e}_{\text{rot}}$	$\displaystyle=\log(\hat{\mathbf{R}}_{WB_{k}}\mathbf{R}_{B_{k}W}\exp(\delta\hat{\boldsymbol{\phi}}^{\land}))^{\lor},$		(9)
	$\displaystyle\mathbf{e}_{\text{trans}}$	$\displaystyle=({{}^{B_{k}}\mathbf{t}_{B_{k}W}}-{{}^{B_{k}}\hat{\mathbf{t}}_{B_{k}W}})-\delta\hat{\mathbf{t}}.$		(9)

Assuming the pose estimation reaches the local optimal, we have the update of state $\delta\hat{\boldsymbol{\xi}}_{k}\rightarrow\mathbf{0}$ , thus the prior error state can be formulated as

\mathbf{e}_{\text{pose}}(\boldsymbol{\xi}_{k})=[\log(\hat{\mathbf{R}}_{WB_{k}}\mathbf{R}_{B_{k}W})^{\lor},{{}^{B_{k}}\mathbf{t}_{B_{k}W}}-{{}^{B_{k}}\hat{\mathbf{t}}_{B_{k}W}}].

Then the total objective function Eq. 7 is changed into:

E=\sum_{k\in\mathcal{F}}E_{\text{inertial}}(\mathbf{x}_{k},\mathbf{x}_{k-1})+\sum_{k\in\mathcal{F}}\rho(\|\mathbf{e}_{\text{pose}}\left(\boldsymbol{\xi}_{k}\right)\|_{\boldsymbol{\Sigma}_{\delta\boldsymbol{\xi}_{k}}}),

This problem can be solved by the Levenberg-Marquarelt method. For the structure-only optimization, similar to GMMLoc [14], we re-triangulate the temporal landmarks with geometric constraints.

The proposed optimization structure is quite similar to Sliding Window Filter (SWF) commonly used in batched state estimation. A little bit different is that we fix the states relevant to the oldest frame, instead of introducing a prior term from marginalization, same as in [17]. As in the localization system, the global observations are available for most of the time, we believe such information loss is acceptable. Therefore, we choose this strategy that is much easier for implementation.

Table I: Evaluation on general localization performance on EuRoC MAV dataset. We report the mAPE (cm) and the recall (%) for different methods. The three rows represent variants of our method, visual structure-based methods and geometry structure-based methods.

Method	V101	V103	V201	V203
Ours (V)	1.1 (100.0)	1.8 (100.0)	1.5 (99.8)	2.7 (97.3)
Ours (V+I)	1.1 (100.0)	1.7 (100.0)	1.5 (100.0)	2.6 (99.2)
Ours (V+I+P)	1.2 (100.0)	1.9 (100.0)	2.8 (100.0)	2.4 (100.0)
Ours (V+I+L)	1.1 (100.0)	1.8 (100.0)	1.6 (100.0)	2.3 (100.0)
ORB-Loc	1.5 (100.0)	1.9 (73.8)	1.9 (95.7)	1.4 (25.0)
HLoc	1.9 (99.2)	2.6 (95.9)	2.1 (99.2)	3.4 (94.2)
HLoc + T	1.3 (99.2)	2.3 (96.9)	1.6 (93.6)	2.4 (84.6)
DSL	3.5 (-)	4.5 (-)	2.6 (-)	10.3 (-)
GMMLoc	3.0 (-)	4.7 (-)	1.8 (-)	5.6 (-)
MSCKF (w/ map)	5.6 (-)	8.7 (-)	6.9 (-)	14.9 (-)

V Experimental Results

V-A Experiment Setup

In the experiments, we first evaluate different variants of our system, generally named as Ours(*). For different variants, we have V standing for the basic vision-only system, I for the usage of inertial measurements, L for the usage of pose prior term in the optimization (Sec. IV), and P for replacing the ray-casting method with the projection method (Sec. III). We then evaluate several state-of-the-art methods as baselines, which includes Hierarchical Localization (HLoc) [4] and ORB-SLAM2 in localization mode (ORB-Loc) [7]. Considering that HLoc deals with a kidnapped localization problem with single-shot image as input, we extend it with instant localization module same as ours. This baseline is named as HLoc + T. In addition, we also report results of several methods based mainly on geometric structure for comparison, i.e., DSL [13], GMMLoc [14] and MSCKF (w/ map) [12].

For the evaluation, we use EuRoC MAV dataset [23] that contains two indoor scenes with three sequences on each. It provides stereo images, IMU data streams along with a dense point cloud map. The GMM is built with scikit-learn tookit¹¹1https://scikit-learn.org/. In our evaluations, for each scene, we build a visual structure using COLMAP [6] on the medium sequences (V102, V202) and test the performance on the easy (V101, V201) and difficult sequences (V103, V203). And the visual map is filtered by the minimum number of observations (20 in the experiments). The same model is used for Ours(*), HLoc, HLoc + T. We found that the sequence V203 is extremely difficult for the challenging motion profile, illumination condition and missing in the left images, as a consequence of which we use the right image as input and keep the original SfM model for localization. The experiments have been run on a desktop with an Intel Core i7-8700K CPU, 32GB RAM and NVIDIA 1080Ti GPU.

V-B Results and Discussion

We report the mean Absolute Pose Error (mAPE) and pose recall in Tab. I and Fig. 4. Generally, the proposed system achieves competitive accuracy and robustness in all the four sequences, as shown in Tab. I. While Ours (V+I+L) achieves an average position error of 1.7cm, it also never fails on any of the sequences. The pose recall curves in Fig. 4 also show that for most of the cases, our method outperforms the baselines in both fine- and coarse-precision regimes.

Comparing the results from Ours (V) and Ours (V+I), we observe that the performance is very close with some improvements on the difficult sequences (V103 and V203). As the motion profile on V203 is very aggressive, without proper state prediction from IMU, we observe a recall drop by $1.9\%$ . On the contrary, on easy sequences, we observe slight accuracy drop (e.g., from the distribution curve of V201), as using IMU introduces more parameters to estimate . These results show that inertial measurements would be more helpful when the situation is more challenging. Comparing the results from Ours (V+I) and Ours (V+I+P), we observe that with association module being replaced by projection-based method, generally the localization accuracy remains close, while an obvious drop about 1cm occurs on V201. For Ours (V+I+L), we observe that the overall performance is also close to those without using pose prior. To our surprise, Ours (V+I+L) performs better than all the other variants on V203.

For methods-based on visual structure, while the results show that ORB-Loc has a similar performance to other methods in two easy sequences, it degrades on the difficult sequences. This shows that the selection of local and global features can be a significant factor in visual localization. As HLoc + T is similar to our method in system level, it should be a competitive baseline. While the localization accuracy is close, the recall drop shows that it is less robust than our method for most of the cases. In addition, the recall drop compared to HLoc on V201 and V203 is mainly because we do not attempt to re-localize a frame if it is not successfully localized in the instant localization.

For the methods based on geometric structure (DSL, GMMLoc, MSCKF (w/ map)), we observe that they generally has a less accurate localization performance. This is mainly because the global constraints are formulated similar to point-to-plane distance, which suffer from scene structure degeneration.

Fig. 5 show the reconstructed trajectories of several methods. Although the localization accuracy of different methods in Tab. I are close, we observe that generally the trajectory of our methods is better aligned with the ground truth. As our method benefits from the temporal visual generation, more features can be associated with landmarks, which improves the localization accuracy and robustness. This also validates the benefit of geometry structure.

As the proposed localization solution is developed with the consideration of computational constraints, we further evaluate the runtime performance of different variants of our method. To validate the effectiveness of the proposed modules, we focus on analyse the timings on temporal landmark creation and optimization, which are reported in Fig. 6. Comparing the two association methods, we observe that that the proposed one significantly improves the efficiency. In addition, the timing of projection method varies a lot (a standard derivation of 9.6ms in V201), which shows that the computational cost of projection method is highly dependent on the visible components. On the contrary, timings for the proposed is more consistent. Comparing the optimization time, firstly there is a significant improvement when the full BA is decoupled as motion-only and structure-only BA, above ${1}/{2}$ of the time comparing Ours(V) with Ours(V+I). Furthermore, with camera pose prior from instant localization, there is no need to reconstruct the jacobians related to the visual factors, the optimization time can be further improved by ${2}/{3}$ . More detailed timing results on V201 and V103 are reported in [16]. In general, timings of Ours(V+I+L) show that we achieve a real-time performance in single-thread (28.8ms on V201 and 38.0ms on V103).

VI Conclusions

In this work, we have presented a visual inertial localization system based on a hybrid map representation. To increase the localization robustness, we have proposed an association method to efficiently create temporal landmarks. In the windowed optimization, we have proposed a strategy that reduce the computational time of full BA from about $50$ ms to under $10$ ms. The experimental results on EuRoC MAV dataset show that how our method can provide a $1-3$ cm localization accuracy with less computational resources in indoor scenarios. In the future, we would like to further improve the robustness and accuracy at the system level. Although our optimization module is already very efficient, we believe that solving this problem incrementally would be more suitable in practice, as the measurements typically comes in order. In addition, as the current feature extraction module is relatively tiaccording to the ame-consuming, distillation from the current model can be another interesting problem to deal with.

References

[1] M. Liu, C. Pradalier, and R. Siegwart, “Visual homing from scale with an uncalibrated omnidirectional camera,” IEEE Transactions on Robotics, vol. 29, no. 6, pp. 1353–1365, 2013.
[2] T. Sattler, B. Leibe, and L. Kobbelt, “Efficient & effective prioritized matching for large-scale image-based localization,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 9, pp. 1744–1756, 2016.
[3] L. Svärm, O. Enqvist, F. Kahl, and M. Oskarsson, “City-scale localization for cameras with known vertical direction,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 7, pp. 1455–1461, 2016.
[4] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “From coarse to fine: Robust hierarchical localization at large scale,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 716–12 725.
[5] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic et al., “Benchmarking 6dof outdoor visual localization in changing conditions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8601–8610.
[6] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[7] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
[8] T. Caselitz, B. Steder, M. Ruhnke, and W. Burgard, “Monocular camera localization in 3-d lidar maps,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 1926–1931.
[9] Y. Kim, J. Jeong, and A. Kim, “Stereo camera localization in 3-d lidar maps,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1–9.
[10] X. Ding, Y. Wang, D. Li, L. Tang, H. Yin, and R. Xiong, “Laser map aided visual inertial localization in changing environment,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 4794–4801.
[11] H. Huang, Y. Sun, H. Ye, and M. Liu, “Metric monocular localization using signed distance fields,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 1195–1201.
[12] X. Zuo, P. Geneva, Y. Yang, W. Ye, Y. Liu, and G. Huang, “Visual-inertial localization with prior lidar map constraints,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3394–3401, 2019.
[13] H. Ye, H. Huang, and M. Liu, “Monocular direct sparse localization in a prior 3-d surfel map,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020.
[14] H. Huang, H. Ye, Y. Sun, and M. Liu, “Gmmloc: Structure consistent visual localization with gaussian mixture models,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5043–5050, 2020.
[15] R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307.
[16] H. Huang, H. Ye, J. Jiao, Y. Sun, and M. Liu, “Geometric structure aided visual inertial localization: Supplemental materials,” [Online]. http://www.ram-lab.com/papers/2020/huang2020geoloc_sup.pdf,.
[17] R. Mur-Artal and J. D. Tardós, “Visual-inertial monocular slam with map reuse,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 796–803, 2017.
[18] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration for real-time visual–inertial odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1–21, 2016.
[19] H. Oleynikova, Z. Taylor, M. Fehr, R. Siegwart, and J. Nieto, “Voxblox: Incremental 3d euclidean signed distance fields for on-board mav planning,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
[20] A. Genz, “Numerical computation of rectangular bivariate and trivariate normal and t probabilities,” Statistics and Computing, vol. 14, no. 3, pp. 251–260, 2004.
[21] Z. Lu, S. Baek, and S. Lee, “Robust 3d line extraction from stereo point clouds,” in 2008 IEEE Conference on Robotics, Automation and Mechatronics. IEEE, 2008, pp. 1–5.
[22] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct monocular visual odometry,” in 2014 IEEE international conference on robotics and automation (ICRA). IEEE, 2014, pp. 15–22.
[23] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The euroc micro aerial vehicle datasets,” The International Journal of Robotics Research, 2016. [Online]. Available: http://ijr.sagepub.com/content/early/2016/01/21/0278364915620033.abstract

See pages 1- of sup.pdf