This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Toward Consistent Drift-free Visual Inertial Localization on Keyframe Based Map

Zhuqing Zhang1, Yanmei Jiao1, Shoudong Huang2, Yue Wang1, Rong Xiong1 1State Key Laboratory of Industrial Control Technology and Institute of Cyber-Systems and Control, Zhejiang University, Zhejiang, China. 2Centre for Autonomous Systems, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, NSW 2007, AustraliaYue Wang is the corresponding author [email protected].
Abstract

Global localization is essential for robots to perform further tasks like navigation. In this paper, we propose a new framework to perform global localization based on a filter-based visual-inertial odometry framework MSCKF. To reduce the computation and memory consumption, we only maintain the keyframe poses of the map and employ Schmidt-EKF to update the state. This global localization framework is shown to be able to maintain the consistency of the state estimator. Furthermore, we introduce a re-linearization mechanism during the updating phase. This mechanism could ease the linearization error of observation function to make the state estimation more precise. The experiments show that this mechanism is crucial for large and challenging scenes. Simulations and experiments demonstrate the effectiveness and consistency of our global localization framework.

I Introduction and related works

Autonomous mobile robot is a hot topic in contemporary research and development. To realize it, it is crucial to localize the robot in a global coordinate system. A good way to do that is employing simultaneous localization and mapping (SLAM). However, most of SLAM related works [1, 2, 3, 4, 5] estimate robot state in a local coordinate system and the estimator would have drift. Global information could be introduced to solve this problem and therefore the robot system would be localized in a global frame. A commonly used global information for localization is global position system (GPS)[6, 7, 8]. However, GPS information is usually noisy and not always available. The pre-built visual map is another source of global information, which is what we consider in this paper. For visual-based localization system, the map usually contains keyframes’ poses and related 2D or/and 3D features. When there are matches between the map and the online input images, the global information could be fused into the state estimator, and therefore remove the drift and increase the positional accuracy.

There are lots of literatures performing global localization with pre-built visual maps [6, 9, 10, 11, 12]. However, how to use the information of the map is an open question. For a large scene, the map could contains hundreds of keyframes and tens of thousands of points, which is quite large in volume. Besides, the map information is actually not perfect and should have uncertainty information (covariance or information matrix), which makes the data much larger. To make the global localization system run in real-time, some trade-offs need to be made with map information.

A direct way of using a visual map is employing all the map information (mm keyframes’ poses, nn landmarks, and their covariance or information matrices; m<<nm<<n in general visual SLAM scenarios). However, since the computational complexity of updating covariance is 𝒪((m+n)2)\mathcal{O}((m+n)^{2}), maintaining all the map information in the state vector would be computational expensive. To solve this problem, [10] proposes a system based on Schmidt-EKF [13]. With the help of Schmidt-EKF, updating covariance would be less expensive (𝒪(n)\mathcal{O}(n)). However, [10] still stores the whole map in the state vector, which makes the state dimension quite high and needs a large memory consumption (𝒪(n2)\mathcal{O}(n^{2})) to record information matrices. There are also some works [6, 11, 12, 16] model maps as a set of keyframes. Maplab [11] employs the ROVIO [14, 15] to realize local positioning, and combines the technique in [9] to obtain the global state. Its map is a set of pose-graph consisting of keyframes and landmarks (memory consumption 𝒪(m+n)\mathcal{O}(m+n)). In this framework, to improve computational efficiency, it does not maintain the map information in the state (computational complexity is constant), so that the correlation between current state and map information cannot be maintained, and therefore, the system would suffer from inconsistency. Besides, this system ignores the uncertainty of the map landmarks, and utilizes them in the observation function as fixed values, which again makes the system inconsistent. VINS-fusion [5, 6, 17, 18] builds the map in a similar way to Maplab[11], however it omits the landmarks information (memory consumption 𝒪(m)\mathcal{O}(m)). Its global observation function is constructed by current local 3D landmarks and matched 2D features in the map. Nevertheless, this framework cannot take the covariance of the map information into consideration, and results in inconsistency. Besides, in VINS-Fusion, local 3D landmarks are estimated online from the extracted 2D features in query images (not from the pre-built map) , which is coupled with the local odometry (or state estimator). To guarantee the state estimator runs in real time, the algorithm for extracting 2D keypoints and computing descriptors needs to be lightweight (e.g. FAST[19] + Brief[20]). These kinds of features, however, are not robust for matching with map features. Furthermore, the nonlinear optimization strategy employed in this framework is also time-consuming.

In this paper, we propose a filter-based global localization framework with consideration of the balance between accuracy, consistency, computation speed and memory space. To be specific, the pre-built map is made up of mm keyframes (pose and covariance) and nn landmarks (position). As mm is much smaller than nn, the storage of this scheme is acceptable (𝒪(m2+n)\mathcal{O}(m^{2}+n)). Then, we maintain the poses of the map keyframes in the state vector, so that the stored covariance could be taken into consideration to make the system consistent. To avoid the huge computation of updating the whole map information, Schmidt-EKF is utilized to make computational complexity 𝒪(m)\mathcal{O}(m). Furthermore, we also deal with a common problem in large-scale long-term navigation that the same scene in current images and the map images differs greatly: Our global observation function is based on the matching between 3D landmarks in the pre-built map and 2D features in the current query image, which is different from VINS-Fusion. In this way, matching procedure is decoupled from the online feature tracking, so that the real-time requirement is reduced and more robust features like R2D2[21] could be utilized to match the map and the query images. Therefore, the ability of long-term localization could be improved dramatically. Moreover, to avoid regarding landmarks as constants, the landmarks’ positions from the map are used as linearization points of the observation function and the left null space projection is employed to handle the related Jacobians. Finally, in the case of the long time lack of global matching information, map landmarks are also used in our proposed re-linearization mechanism to improve the quality of the linearization points and therefore improve the accuracy of the localization. We call our method as consistent Schmidt-MSCKF localization (CS-MSCKF).

II System State and Observation Function

As shown in Fig. 1, the whole system consists of three parts, feature matching with a given map, local positioning and global positioning.

Refer to caption
Figure 1: The overview of proposed global localization framework

II-A Local Positioning Based on MSCKF

Local positioning in our work is actually a standard MSCKF[3], and this part follows the implementation of Open-VINS [4]. As is shown in Fig. 2, local positioning is estimating the current state of the robot in local reference system L. Without considering the map information, we maintain the following state vector at time step kk:

𝐱k=[𝐱I𝐱C],\mathbf{x}_{k}=\left[\begin{array}[]{cc}\mathbf{x}_{I}^{\top}&\mathbf{x}_{C}^{\top}\\ \end{array}\right]^{\top}, (1)

where 𝐱I\mathbf{x}_{I} is the current state of the robot. II refers to the IMU reference. Here, we assume the body reference system is identical to the IMU reference system. We define 𝐱I\mathbf{x}_{I} as

𝐱I=[𝐪LIk𝐩IkL𝐯IkL𝐛g𝐛a],\mathbf{x}_{I}=\left[\begin{array}[]{ccccc}{}^{I_{k}}\mathbf{q}_{L}^{\top}&{}^{L}\mathbf{p}_{I_{k}}^{\top}&{}^{L}\mathbf{v}_{I_{k}}^{\top}&\mathbf{b}_{g}^{\top}&\mathbf{b}_{a}^{\top}\\ \end{array}\right]^{\top}, (2)

where 𝐪LIk{}^{I_{k}}\mathbf{q}_{L} is an unit quaternion variable, defined by JPL[24]. Its corresponding rotation matrix 𝐑LIk{}^{I_{k}}\mathbf{R}_{L} rotates a 3D vector from the local coordinate system L{L} into the IMU coordinate system I{I}. 𝐩IkL{}^{L}\mathbf{p}_{I_{k}} is the position of the body in the local coordinate system at time step kk. 𝐯IkL{}^{L}\mathbf{v}_{I_{k}} is the velocity of the body in the local coordinate system. 𝐛g\mathbf{b}_{g} and 𝐛a\mathbf{b}_{a} are the gyroscope and accelerometer bias.

𝐱C=[𝐪LIk1𝐩Ik1L𝐪LIkN𝐩IkNL],\mathbf{x}_{C}=\left[\begin{matrix}{}^{I_{k-1}}\mathbf{q}_{L}^{\top}&{}^{L}\mathbf{p}_{I_{k-1}}^{\top}&\dots&{}^{I_{k-N}}\mathbf{q}_{L}^{\top}&{}^{L}\mathbf{p}_{I_{k-N}}^{\top}\\ \end{matrix}\right], (3)

which consists of the clones of the NN newest historical poses. 𝐱C\mathbf{x}_{C} is the so-called sliding window.

II-A1 State Propagation

When we receive IMU data at time step kk, we could employ the IMU kinematics equations to predict the body state at time step k+1k+1:

𝐱k+1|k=𝐟(𝐱k,𝐚mk𝐧ak,ωmk𝐧ωk),\mathbf{x}_{k+1|k}=\mathbf{f}(\mathbf{x}_{k},\mathbf{a}_{m_{k}}-\mathbf{n}_{a_{k}},\mathbf{\omega}_{m_{k}}-\mathbf{n}_{\omega_{k}}), (4)

where 𝐟\mathbf{f} represents the IMU kinematics equations, 𝐚mk\mathbf{a}_{m_{k}} and ωmk\mathbf{\omega}_{m_{k}} are the observations (linear acceleration and angular velocity) from the IMU. 𝐧ak\mathbf{n}_{a_{k}} and 𝐧ωk\mathbf{n}_{\omega_{k}} are the corresponding zero-mean Gaussian white noise of the IMU observations. With this function, we could propagate the state covariance by

𝐏k+1|k=𝚽k𝐏k|k𝚽k+𝐆k𝐐k|k𝐆k,\mathbf{P}_{k+1|k}=\mathbf{\Phi}_{k}\mathbf{P}_{k|k}\mathbf{\Phi}_{k}^{\top}+\mathbf{G}_{k}\mathbf{Q}_{k|k}\mathbf{G}_{k}^{\top}, (5)

where 𝐏k|k\mathbf{P}_{k|k} and 𝐐k|k\mathbf{Q}_{k|k} are the system state covariance and the noise covariance respectively. 𝚽k\mathbf{\Phi}_{k} and 𝐆k\mathbf{G}_{k} are the Jacobians of (5) with respect to the system state and the noise respectively. For detailed derivation, please refer to [3, 4].

II-A2 Observation Function

In the framework of MSCKF, before fusing feature observations by the observation function, we could get a list of features {f}\left\{f\right\} and related cloned poses {𝐓C}\left\{\mathbf{T}_{C}\right\}. For each feature fif_{i} observed by 𝐓Cj\mathbf{T}_{C_{j}}, we have the following observation function:

𝐳fi=𝐡(𝐓Cj,𝐟i)+𝐧fi,\mathbf{z}_{f_{i}}=\mathbf{h}\left(\mathbf{T}_{C_{j}},\mathbf{f}_{i}\right)+\mathbf{n}_{f_{i}}, (6)

where 𝐳fi\mathbf{z}_{f_{i}} is the 2D observation of fif_{i} in the image. 𝐟i\mathbf{f}_{i} is the 3D coordinate of fif_{i}. By applying first-order Taylor expansion to the above observation function at the current estimated state, we could get the following functions:

𝐳fi𝐡(𝐱^k,𝐟^i)+𝐇𝐱^k𝐱~k+𝐇𝐟^i𝐟~i+𝐧fi,\mathbf{z}_{f_{i}}\approx\mathbf{h}\left(\hat{\mathbf{x}}_{k},\hat{\mathbf{f}}_{i}\right)+\mathbf{H}_{\hat{\mathbf{x}}_{k}}\tilde{\mathbf{x}}_{k}+\mathbf{H}_{\hat{\mathbf{f}}_{i}}\tilde{\mathbf{f}}_{i}+\mathbf{n}_{f_{i}}, (7)
𝐫fi:=𝐳~fi=𝐇𝐱^k𝐱~k+𝐇𝐟^i𝐟~i+𝐧fi,\mathbf{r}_{f_{i}}:=\tilde{\mathbf{z}}_{f_{i}}=\mathbf{H}_{\hat{\mathbf{x}}_{k}}\tilde{\mathbf{x}}_{k}+\mathbf{H}_{\hat{\mathbf{f}}_{i}}\tilde{\mathbf{f}}_{i}+\mathbf{n}_{f_{i}}, (8)

where ^\hat{\cdot} represents the estimated values and ~\tilde{\cdot} represents the errors between true values and estimated values. 𝐇𝐱^k\mathbf{H}_{\hat{\mathbf{x}}_{k}} and 𝐇𝐟^i\mathbf{H}_{\hat{\mathbf{f}}_{i}} are the Jacobians of 𝐡\mathbf{h} with respect to the estimated state and the observed feature respectively. As we do not maintain features in the state vector, 𝐟~i\tilde{\mathbf{f}}_{i} is marginalized by projecting 𝐫fi\mathbf{r}_{f_{i}} into the left null space of 𝐇𝐟^i\mathbf{H}_{\hat{\mathbf{f}}_{i}}, 𝐍\mathbf{N}:

𝐍𝐫fi=𝐍𝐇𝐱^k𝐱~k+𝐍𝐇𝐟^i𝐟~i+𝐍𝐧fi,\mathbf{N}^{\top}\mathbf{r}_{f_{i}}=\mathbf{N}^{\top}\mathbf{H}_{\hat{\mathbf{x}}_{k}}\tilde{\mathbf{x}}_{k}+\mathbf{N}^{\top}\mathbf{H}_{\hat{\mathbf{f}}_{i}}\tilde{\mathbf{f}}_{i}+\mathbf{N}^{\top}\mathbf{n}_{f_{i}}, (9)
𝐫fi=𝐇𝐱^k𝐱~k+𝐧fi,\mathbf{r}^{\prime}_{f_{i}}=\mathbf{H}^{\prime}_{\hat{\mathbf{x}}_{k}}\tilde{\mathbf{x}}_{k}+\mathbf{n}^{\prime}_{f_{i}}, (10)

where 𝐫fi=𝐍𝐫fi\mathbf{r}^{\prime}_{f_{i}}=\mathbf{N}^{\top}\mathbf{r}_{f_{i}}, 𝐇𝐱^k=𝐍𝐇𝐱^k\mathbf{H}^{\prime}_{\hat{\mathbf{x}}_{k}}=\mathbf{N}^{\top}\mathbf{H}_{\hat{\mathbf{x}}_{k}}, 𝐍𝐧fi=𝐍𝐧fi\mathbf{N}^{\top}\mathbf{n}_{f_{i}}=\mathbf{N}^{\top}\mathbf{n}_{f_{i}}. We would employ 𝐇𝐱^k\mathbf{H}^{\prime}_{\hat{\mathbf{x}}_{k}} and 𝐫fi\mathbf{r}^{\prime}_{f_{i}} to do the state update.

II-B Consistent Global Positioning

Refer to caption
Figure 2: The global observation function

II-B1 Initialization

When there is a match between the current image (𝐂k\mathbf{C}_{k} in Fig. 2) and a keyframe in the pre-built map (𝐤𝐟𝟏\mathbf{kf_{1}} in Fig. 2), we could get many corresponding 3D-2D feature matches (the red dot and the black dot in Fig. 2). We represent the 3D features in the coordinate system of the matched keyframe (anchored keyframe), and through EPnP[23] we could get the relative pose 𝐓𝐂k𝐤𝐟𝟏{}^{\mathbf{kf_{1}}}\mathbf{T}_{\mathbf{C}_{k}} between the keyframe and the current frame. This value, combined with the matched keyframe pose 𝐓𝐤𝐟𝟏G{}^{G}\mathbf{T}_{\mathbf{kf_{1}}} and the current frame pose 𝐓𝐂kL{}^{L}\mathbf{T}_{\mathbf{C}_{k}}, would be used to initialize 𝐓LG{}^{G}\mathbf{T}_{L}. After initialization, 𝐓LG𝐱t{}^{G}\mathbf{T}_{L}\triangleq\mathbf{x}_{t} would be added into the state vector:

𝐱k=[𝐱I𝐱C𝐱t].\mathbf{x}_{k}=\left[\begin{array}[]{ccc}\mathbf{x}_{I}^{\top}&\mathbf{x}_{C}^{\top}&\mathbf{x}_{t}^{\top}\\ \end{array}\right]^{\top}. (11)

Besides, the covariance of the state would also be augmented. As the initial estimation of 𝐱t\mathbf{x}_{t} may be inaccurate, its covariance could be assigned with a big value. It should be noted that every time we have matches from the map, the related keyframes’ poses should be added into the state vector:

𝐱k=[𝐱I𝐱C𝐱t𝐱KF],\mathbf{x}_{k}=\left[\begin{array}[]{cccc}\mathbf{x}_{I}^{\top}&\mathbf{x}_{C}^{\top}&\mathbf{x}_{t}^{\top}&\mathbf{x}_{KF}^{\top}\\ \end{array}\right]^{\top}, (12)
𝐱KF=[𝐪kf1G𝐩kf1G𝐪kfMG𝐩kfMG].\mathbf{x}_{KF}=\left[\begin{matrix}{}^{G}\mathbf{q}_{kf_{1}}^{\top}&{}^{G}\mathbf{p}_{kf_{1}}^{\top}&\dots&{}^{G}\mathbf{q}_{kf_{M}}^{\top}&{}^{G}\mathbf{p}_{kf_{M}}^{\top}\\ \end{matrix}\right]^{\top}. (13)

Also, the covariance of keyframes needs to be considered. For the sake of simplicity, we divide the state into two parts: the active part 𝐱A=[𝐱I𝐱C𝐱t]\mathbf{x}_{A}=\left[\begin{matrix}\mathbf{x}_{I}^{\top}&\mathbf{x}_{C}^{\top}&\mathbf{x}_{t}^{\top}\\ \end{matrix}\right]^{\top}, and the nuisance part 𝐱N=𝐱KF\mathbf{x}_{N}=\mathbf{x}_{KF}, so that

𝐱k=[𝐱Ak𝐱Nk].\mathbf{x}_{k}=\left[\begin{array}[]{cc}\mathbf{x}^{\top}_{A_{k}}&\mathbf{x}_{N_{k}}^{\top}\\ \end{array}\right]^{\top}. (14)

This kind of partition is very useful for Schmidt-EKF to be introduced in Sec. III-A.

II-B2 Observation Function

As is shown in Fig. 2, supposing the current image matches with multiple frames (for example, 𝐤𝐟𝟏\mathbf{kf_{1}} and 𝐤𝐟𝟐\mathbf{kf_{2}}), for each 3D landmark (for example, 𝐩f1𝐤𝐟1{}^{\mathbf{kf}_{1}}\mathbf{p}_{f_{1}}), we could project it from the anchored keyframe (𝐤𝐟1\mathbf{kf}_{1}) coordinate system into the 2D current image coordinate system by a nonlinear observation function 𝐠1{}^{1}\mathbf{g}:

𝐳f1=𝐠1(G𝐓𝐤𝐟1,G𝐓L,L𝐓𝐂k,𝐤𝐟1𝐩f1)+𝐧f1,{}^{1}\mathbf{z}_{f}={}^{1}\mathbf{g}(^{G}\mathbf{T}_{\mathbf{kf}_{1}},^{G}\mathbf{T}_{L},^{L}\mathbf{T}_{\mathbf{C}_{k}},^{\mathbf{kf}_{1}}\mathbf{p}_{f_{1}})+{}^{1}\mathbf{n}_{f}, (15)

where 𝐳f1{}^{1}\mathbf{z}_{f} is the 2D pixel coordinate in the current image. Linearizing above function at the estimated values, just like (7), we could get

𝐫f1=𝐇𝐱^Ak1𝐱~Ak+𝐇𝐱^Nk1𝐱~Nk+𝐇𝐩^f1𝐤𝐟11𝐩~f1𝐤𝐟1+𝐧f1.{}^{1}\mathbf{r}_{f}={}^{1}\mathbf{H}_{\hat{\mathbf{x}}_{{A}_{k}}}\tilde{\mathbf{x}}_{{A}_{k}}+{}^{1}\mathbf{H}_{\hat{\mathbf{x}}_{{N}_{k}}}\tilde{\mathbf{x}}_{{N}_{k}}+{}^{1}\mathbf{H}_{{}^{\mathbf{kf}_{1}}\hat{\mathbf{p}}_{f_{1}}}{}^{\mathbf{kf}_{1}}\tilde{\mathbf{p}}_{f_{1}}+{}^{1}\mathbf{n}_{f}. (16)

The subscript A{A} and N{N} represent the active part and the nuisance part of the state vector respectively (c.f. (14)).

Besides, the 3D landmark 𝐩f1𝐤𝐟1{}^{\mathbf{kf}_{1}}\mathbf{p}_{f_{1}} could also be directly reprojected into the anchored keyframe 𝐤𝐟1\mathbf{kf}_{1} by a projection function 𝐠2{}^{2}\mathbf{g}:

𝐳f2=𝐠2(𝐩f1𝐤𝐟1)+𝐧f2,{}^{2}\mathbf{z}_{f}={}^{2}\mathbf{g}({}^{\mathbf{kf}_{1}}\mathbf{p}_{f_{1}})+{}^{2}\mathbf{n}_{f}, (17)
𝐫f2=𝐇𝐩^f1𝐤𝐟12𝐩~f1𝐤𝐟1+𝐧f2.{}^{2}\mathbf{r}_{f}={}^{2}\mathbf{H}_{{}^{\mathbf{kf}_{1}}\hat{\mathbf{p}}_{f_{1}}}{}^{\mathbf{kf}_{1}}\tilde{\mathbf{p}}_{f_{1}}+{}^{2}\mathbf{n}_{f}. (18)

Furthermore, if the 3D landmark 𝐩f1𝐤𝐟1{}^{\mathbf{kf}_{1}}\mathbf{p}_{f_{1}} can be observed by the other matched keyframe(s), like 𝐤𝐟2\mathbf{kf}_{2} in Fig. 2, there is another reprojection function 𝐠3{}^{3}\mathbf{g} which reprojects 𝐩f1𝐤𝐟1{}^{\mathbf{kf}_{1}}\mathbf{p}_{f_{1}} into the 𝐤𝐟2\mathbf{kf}_{2} image plane:

𝐳f3=𝐠3(𝐓𝐤𝐟1G,𝐓𝐤𝐟2G,𝐩f1𝐤𝐟1)+𝐧f3,{}^{3}\mathbf{z}_{f}={}^{3}\mathbf{g}({}^{G}\mathbf{T}_{\mathbf{kf}_{1}},{}^{G}\mathbf{T}_{\mathbf{kf}_{2}},{}^{\mathbf{kf}_{1}}\mathbf{p}_{f_{1}})+{}^{3}\mathbf{n}_{f}, (19)
𝐫f3=𝐇𝐱^Nk3𝐱~Nk+𝐇𝐩^f1𝐤𝐟13𝐩~f1𝐤𝐟1+𝐧f3.{}^{3}\mathbf{r}_{f}={}^{3}\mathbf{H}_{\hat{\mathbf{x}}_{{N}_{k}}}\tilde{\mathbf{x}}_{{N}_{k}}+{}^{3}\mathbf{H}_{{}^{\mathbf{kf}_{1}}\hat{\mathbf{p}}_{f_{1}}}{}^{\mathbf{kf}_{1}}\tilde{\mathbf{p}}_{f_{1}}+{}^{3}\mathbf{n}_{f}. (20)

By stacking (16), (18) and (20) we could get the 6-dimension observation for the 3D landmark 𝐩f1{\mathbf{p}}_{f_{1}}.

In summary, whenever the current image matches with a single keyframe or multiple keyframes, for each 3D landmark 𝐩fi𝐤𝐟j{}^{\mathbf{kf}_{j}}\mathbf{p}_{f_{i}} anchored in the keyframe 𝐤𝐟j\mathbf{kf}_{j}, we could get the observation function with the following form:

𝐫fi=𝐇𝐱^Ak𝐱~Ak+𝐇𝐱^Nk𝐱~Nk+𝐇𝐩^fi𝐤𝐟j𝐩~fi𝐤𝐟j+𝐧fi.\mathbf{r}_{f_{i}}=\mathbf{H}_{\hat{\mathbf{x}}_{{A}_{k}}}\tilde{\mathbf{x}}_{{A}_{k}}+\mathbf{H}_{\hat{\mathbf{x}}_{{N}_{k}}}\tilde{\mathbf{x}}_{{N}_{k}}+\mathbf{H}_{{}^{\mathbf{kf}_{j}}\hat{\mathbf{p}}_{f_{i}}}{}^{\mathbf{kf}_{j}}\tilde{\mathbf{p}}_{f_{i}}+\mathbf{n}_{f_{i}}. (21)

It is important to note that since we regard landmarks of the map as variables with uncertainties, the Jacobian matrix 𝐇𝐩^fi𝐤𝐟j\mathbf{H}_{{}^{\mathbf{kf}_{j}}\hat{\mathbf{p}}_{f_{i}}} needs to be computed. Otherwise, the landmarks would be treated as constants, like [11], which would make the system inconsistent. However, different from [10], we do not maintain map landmarks in the state vector, so we need to marginalize 𝐩fi𝐤𝐟j{}^{\mathbf{kf}_{j}}\mathbf{p}_{f_{i}}. For each landmark, it has at least two measurements (in the current image, and the matched one or more keyframes), so the row number of 𝐇𝐩^fi𝐤𝐟j\mathbf{H}_{{}^{\mathbf{kf}_{j}}\hat{\mathbf{p}}_{f_{i}}} is more than the dimension of the landmark. This fact guarantees that we could project 𝐫fi\mathbf{r}_{f_{i}} into the left null space of 𝐇𝐩^fi𝐤𝐟j\mathbf{H}_{{}^{\mathbf{kf}_{j}}\hat{\mathbf{p}}_{f_{i}}}, like (9). In this way, the items related to the landmarks are eliminated while taking the uncertainties of the landmarks into consideration :

𝐫fi=𝐇𝐱^Ak𝐱~Ak+𝐇𝐱^Nk𝐱~Nk+𝐧f𝐇𝐱^k𝐱~k+𝐧f.\mathbf{r}_{f_{i}}^{*}=\mathbf{H}^{*}_{\hat{\mathbf{x}}_{{A}_{k}}}\tilde{\mathbf{x}}_{{A}_{k}}+\mathbf{H}^{*}_{\hat{\mathbf{x}}_{{N}_{k}}}\tilde{\mathbf{x}}_{{N}_{k}}+\mathbf{n}_{f}^{*}\triangleq\mathbf{H}^{*}_{\hat{\mathbf{x}}_{k}}\tilde{\mathbf{x}}_{k}+\mathbf{n}_{f}^{*}. (22)

Note that (22) only considers one landmark. We need to stack (22) for all matched landmarks to get the final observation function:

𝐫k=𝐇k𝐱~k+𝐧,\mathbf{r}^{*}_{k}=\mathbf{H}_{k}^{*}\tilde{\mathbf{x}}_{k}+\mathbf{n}^{*}, (23)

where 𝐫k\mathbf{r}^{*}_{k}, 𝐇k\mathbf{H}_{k}^{*} and 𝐧\mathbf{n}^{*} are the stacking of many 𝐫fi\mathbf{r}_{f_{i}}^{*}, 𝐇𝐱^k\mathbf{H}^{*}_{\hat{\mathbf{x}}_{k}} and 𝐧f\mathbf{n}_{f}^{*}. (23) will be used to do the state update following the Schmidt-EKF as stated in the next section.

III State update with map information

In this section, we would demonstrate how to update the state with map information efficiently while maintaining the consistency. Besides, we would point out the necessity of re-linearizing the observation function.

III-A Global Update by Schmidt-EKF

When there are matches between the current query frame and the map keyframes, the global observations would be used to do the global update using Schmidt-EKF. As indicated by (14), the state vector could be divided into the active part and the nuisance part. Therefore, (23) could be written as:

𝐫k\displaystyle\mathbf{r}^{*}_{k} =𝐇k𝐱~k+𝐧\displaystyle=\mathbf{H}_{k}^{*}\tilde{\mathbf{x}}_{k}+\mathbf{n}^{*} (24)
=[𝐇Ak𝐇Nk][𝐱~Ak𝐱~Nk]+𝐧,\displaystyle=\left[\begin{array}[]{cc}\mathbf{H}^{*}_{A_{k}}&\mathbf{H}^{*}_{N_{k}}\\ \end{array}\right]\left[\begin{array}[]{c}\tilde{\mathbf{x}}_{A_{k}}\\ \tilde{\mathbf{x}}_{N_{k}}\end{array}\right]+\mathbf{n}^{*},

where 𝐇Nk\mathbf{H}^{*}_{N_{k}} represents the Jacobians of the related poses of the matched map keyframes; 𝐇Ak\mathbf{H}^{*}_{A_{k}} represents the Jacobians of the current state 𝐱I\mathbf{x}_{I} and the relative transformation 𝐱t\mathbf{x}_{t}. Based on the expression of (24), we would analyze the limitation of standard EKF and introduce the updating technique of Schmidt-EKF.

III-A1 Limitation of Standard EKF

The state and the covariance update equations for EKF are as follows:

𝐱^k=𝐱^k|k1+𝐊𝐫k,\hat{\mathbf{x}}_{k}=\hat{\mathbf{x}}_{k|k-1}+\mathbf{K}\mathbf{r}^{*}_{k}, (25)
𝐏k=𝐏k|k1𝐊𝐇k𝐏k|k1,\mathbf{P}_{k}=\mathbf{P}_{k|k-1}-\mathbf{K}\mathbf{H}^{*}_{k}\mathbf{P}_{k|k-1}, (26)

where 𝐱^k|k1\hat{\mathbf{x}}_{k|k-1} and 𝐏k|k1\mathbf{P}_{k|k-1} are the output of the propagation step (c.f. (4), (5)), and

𝐒=𝐇k𝐏k|k1𝐇k+𝐑,\mathbf{S}=\mathbf{H}^{*}_{k}\mathbf{P}_{k|k-1}\mathbf{H}^{*\top}_{k}+\mathbf{R}, (27)
𝐊=𝐇k𝐏k|k1𝐇k𝐒1.\mathbf{K}=\mathbf{H}^{*}_{k}\mathbf{P}_{k|k-1}\mathbf{H}^{*\top}_{k}\mathbf{S}^{-1}. (28)

𝐑\mathbf{R} is the covariance of 𝐧\mathbf{n}^{*}. Expanding (28),

𝐊=[𝐊A𝐊N]\displaystyle\mathbf{K}=\left[\begin{matrix}\mathbf{K}_{A}\\ \mathbf{K}_{N}\end{matrix}\right] =[𝐏AAk|k1𝐇Ak+𝐏ANk|k1𝐇Nk𝐏NAk|k1𝐇Ak+𝐏NNk|k1𝐇Nk]𝐒k1\displaystyle=\left[\begin{matrix}\mathbf{P}_{AA_{k|k-1}}\mathbf{H}_{A_{k}}^{*\top}+\mathbf{P}_{AN_{k|k-1}}\mathbf{H}_{N_{k}}^{*\top}\\ \mathbf{P}_{NA_{k|k-1}}\mathbf{H}_{A_{k}}^{*\top}+\mathbf{P}_{NN_{k|k-1}}\mathbf{H}_{N_{k}}^{*\top}\end{matrix}\right]\mathbf{S}_{k}^{-1} (29)
=[𝐊¯A𝐊¯N]𝐒k1,\displaystyle=\left[\begin{matrix}\bar{\mathbf{K}}_{A}\\ \bar{\mathbf{K}}_{N}\end{matrix}\right]\mathbf{S}_{k}^{-1},

and substituting (29) into (26), we could get:

𝐏k\displaystyle\mathbf{P}_{k} =𝐏k|k1\displaystyle=\mathbf{P}_{k|k-1}- (30)
[𝐊A𝐒𝐊A𝐊A𝐇k[𝐏ANk|k1𝐏NNk|k1][𝐏ANk|k1𝐏NNk|k1]𝐇k𝐊A𝐊N𝐒𝐊N].\displaystyle\left[\begin{matrix}\mathbf{K}_{A}\mathbf{S}\mathbf{K}_{A}^{\top}&\mathbf{K}_{A}\mathbf{H}_{k}^{*\top}\left[\begin{matrix}\mathbf{P}_{AN_{k|k-1}}\\ \mathbf{P}_{NN_{k|k-1}}\end{matrix}\right]\\ \left[\begin{matrix}\mathbf{P}_{AN_{k|k-1}}\\ \mathbf{P}_{NN_{k|k-1}}\\ \end{matrix}\right]^{\top}\mathbf{H}_{k}^{*\top}\mathbf{K}_{A}^{\top}&\mathbf{K}_{N}\mathbf{S}\mathbf{K}_{N}^{\top}\end{matrix}\right].

From (30), we could see that in the standard EKF update step, when we need to update the covariance, the computation is dominated by 𝐊N𝐒𝐊N\mathbf{K}_{N}\mathbf{S}\mathbf{K}_{N}^{\top}, which is 𝒪(n2)\mathcal{O}(n^{2}) (suppose the dimension of 𝐱N\mathbf{x}_{N} is nn). As map information is added continuously while the robot is traveling, the size of the nuisance part 𝐱N\mathbf{x}_{N} would grow over time. Especially for a large scene, the dimension of nuisance part can be thousands easily. In this situation, the filter-based system would suffer from the huge amount of computation and even could not run in real-time. Thanks to Schmidt-EKF, this problem could be solved elegantly.

III-A2 Update by Schmidt-EKF

As analysed before, computing 𝐊N𝐒𝐊N\mathbf{K}_{N}\mathbf{S}\mathbf{K}_{N}^{\top} is time-consuming. Therefore, in Schmidt-EKF, we just drop this part directly when we update the covariance. More specifically, we set 𝐊N=𝟎\mathbf{K}_{N}=\mathbf{0}, so that (30) can be rewritten as:

𝐏k\displaystyle\mathbf{P}_{k} =𝐏k|k1\displaystyle=\mathbf{P}_{k|k-1}- (31)
[𝐊A𝐒𝐊A𝐊A𝐇k[𝐏ANk|k1𝐏NNk|k1][𝐏ANk|k1𝐏NNk|k1]𝐇k𝐊A𝟎],\displaystyle\left[\begin{matrix}\mathbf{K}_{A}\mathbf{S}\mathbf{K}_{A}^{\top}&\mathbf{K}_{A}\mathbf{H}_{k}^{*\top}\left[\begin{matrix}\mathbf{P}_{AN_{k|k-1}}\\ \mathbf{P}_{NN_{k|k-1}}\end{matrix}\right]\\ \left[\begin{matrix}\mathbf{P}_{AN_{k|k-1}}\\ \mathbf{P}_{NN_{k|k-1}}\\ \end{matrix}\right]^{\top}\mathbf{H}_{k}^{*\top}\mathbf{K}_{A}^{\top}&\mathbf{0}\end{matrix}\right],

and (25) would be divided into two parts:

𝐱^Ak\displaystyle\hat{\mathbf{x}}_{A_{k}} =𝐱^Ak|k1+𝐊A𝐫k,\displaystyle=\hat{\mathbf{x}}_{A_{k|k-1}}+\mathbf{K}_{A}\mathbf{r}_{k}^{*}, (32)
𝐱^Nk\displaystyle\hat{\mathbf{x}}_{N_{k}} =𝐱^Nk|k1.\displaystyle=\hat{\mathbf{x}}_{N_{k|k-1}}.

It can be seen that the active part update is identical to the standard EKF while the nuisance part would not be updated. Comparing (30) and (31), it is easy to see that:

𝐏SKF=𝐏EKF+[𝟎𝐊¯N]𝐒1[𝟎𝐊¯N],\mathbf{P}_{SKF}=\mathbf{P}_{EKF}+\left[\begin{matrix}\mathbf{0}\\ \bar{\mathbf{K}}_{N}\end{matrix}\right]\mathbf{S}^{-1}\left[\begin{matrix}\mathbf{0}&\bar{\mathbf{K}}_{N}^{\top}\end{matrix}\right], (33)

where 𝐏SKF\mathbf{P}_{SKF} and 𝐏EKF\mathbf{P}_{EKF} are updated covariance of Schmidt-EKF and standard EKF respectively. As 𝐒1\mathbf{S}^{-1} is postive definite, [𝟎𝐊¯N]𝐒1[𝟎𝐊¯N]\left[\begin{matrix}\mathbf{0}\\ \bar{\mathbf{K}}_{N}\end{matrix}\right]\mathbf{S}^{-1}\left[\begin{matrix}\mathbf{0}&\bar{\mathbf{K}}_{N}^{\top}\end{matrix}\right] must be positive semi-definite. Therefore, 𝐏SKF𝐏EKF>=𝟎\mathbf{P}_{SKF}-\mathbf{P}_{EKF}>=\mathbf{0}, and it is not overconfident about the estimated state, i.e. maintains the system consistency. Besides, different from [9], the cross-covariance of active state and map information is considered (c.f. (31)) whose computation complexity is 𝒪(n)\mathcal{O}(n). Therefore, for the state update with global information, we just need to linearize the global observation function to the form of (24), and substitute (24) into (31)-(32).

III-B Re-linearization

In the update step, as our observation function is nonlinear, we need to linearize it at the newest estimated point:

𝐳\displaystyle\mathbf{z} =𝐡(𝐱)+𝐧\displaystyle=\mathbf{h}(\mathbf{x})+\mathbf{n} (34)
=𝐡(𝐱^)+𝐇𝐱^(𝐱𝐱^)+o(𝐱2)+𝐧\displaystyle=\mathbf{h}(\hat{\mathbf{x}})+\mathbf{H}_{\hat{\mathbf{x}}}(\mathbf{x}-\hat{\mathbf{x}})+o(\mathbf{x}^{2})+\mathbf{n}
𝐡(𝐱^)+𝐇𝐱^(𝐱𝐱^)+𝐧,\displaystyle\approx\mathbf{h}(\hat{\mathbf{x}})+\mathbf{H}_{\hat{\mathbf{x}}}(\mathbf{x}-\hat{\mathbf{x}})+\mathbf{n},

where 𝐇𝐱^\mathbf{H}_{\hat{\mathbf{x}}} is the Jacobians of the observation function with respect to 𝐱\mathbf{x} at the estimated point 𝐱^\hat{\mathbf{x}}. This approximation is only effective when the estimated value (linearization point) is close to the true value. When the error between the estimated value and the true value is big, the precision of this approximation is not satisfactory.

Revisited our global observation function, i.e. (15), the landmark in the matched keyframe reference system is reprojected into the current image plane through a big “loop” (c.f. Fig. 2): The observation of the landmark 𝐩f1𝐤𝐟1{}^{\mathbf{kf}_{1}}\mathbf{p}_{f_{1}} in current image plane (black dot in 2) is derived from 𝐓𝐤𝐟1G{}^{G}\mathbf{T}_{\mathbf{kf}_{1}},𝐓LG{}^{G}\mathbf{T}_{L} and 𝐓𝐂kL{}^{L}\mathbf{T}_{\mathbf{C}_{k}}. Therefore the error of related variables would be propagated to the final estimated measurement, which could be very big. Especially, if there is a long time lack of matching information, i.e. the system does not have global information to constrain the odometry for a long time, when the next global observation occurs, the drift of the odometry and the estimation error of 𝐓LG{}^{G}\mathbf{T}_{L} can be big. If we still directly linearize the observation function, the precision of the first-order approximation would be poor.

A simple idea to reduce the first-order approximation error is performing the Taylor expansion at a more accurate point. To be specific, when we get a match between a map keyframe and the current image, we could compute a relatively accurate pose 𝐓𝐂k𝐤𝐟{}^{\mathbf{kf}}\mathbf{T}_{\mathbf{C}_{k}} by solving EPnP as mentioned in Sec. II-B. Based on this more accurate 𝐓𝐂k𝐤𝐟{}^{\mathbf{kf}}\mathbf{T}_{\mathbf{C}_{k}}, we could recompute 𝐓LG{}^{G}\mathbf{T}_{L} or 𝐓𝐂kL{}^{L}\mathbf{T}_{\mathbf{C}_{k}}. The recomputed value would therefore be more accordant with the observation function (15). Then, we utilize this new value as the linearization point to compute the Jacobians of the observation function.

In our implementation, when the average of the matched features’ re-projection error is bigger than a threshold, the re-linearization mechanism would be triggered. We would re-compute the linearization point of 𝐓LG{}^{G}\mathbf{T}_{L} or 𝐓𝐂kL{}^{L}\mathbf{T}_{\mathbf{C}_{k}} from the results of EPnP. After that, 𝐇𝐱^\mathbf{H}_{\hat{\mathbf{x}}} in (34) is computed with the re-computed linearization point.

IV experimental results

In this section, we would first do some simulation experiments to demonstrate the effectiveness of our proposed framework, and then validate our approach on EuRoC [25], Kaist [26], and our own data set YQ. These three data sets cover the platforms of UAV, car and UGV. Images sampled from these three data sets are shown in Fig. 3.

Refer to caption
Figure 3: The images sampled from three data sets. The first row is the images from EuRoC [25] MH02-05; The second row is the images from Kaist [26] urban38-39; The third row is the images from YQ YQ1-4.

IV-A Simulation

IV-A1 Map Data Generation

We make some adaptations to the simulator of Open-VINS [4] to make it suitable for localization. The simulator needs to be fed a ground truth trajectory so that the system could generate image features and IMU information. Here, we feed the ground truth trajectory of our own data set (denoted as YQ3, see Fig. 4). It has a length of 1.282km. The map keyframes’ poses come from another trajectory of our own data set YQ1, which has the similar running path as YQ3. For the details of YQ data set, please refer to Sec. IV-B3.

To test the consistency of our proposed system, we artificially add the noise to the ground truth of YQ1 to simulate that the map is not perfect. To be specific, the position of the ground truth is perturbed with the Gaussian white noise 𝐧p𝒩(𝟎,0.01𝐈3×3)\mathbf{n}_{p}\sim\mathcal{N}(\mathbf{0},0.01\mathbf{I}_{3\times 3}), and the orientation is perturbed with 𝐧o𝒩(𝟎,0.00025𝐈3×3)\mathbf{n}_{o}\sim\mathcal{N}(\mathbf{0},0.00025\mathbf{I}_{3\times 3}). After the perturbation, the root-mean-squared error (RMSE) of the trajectory of YQ1 is 0.179m.

For the map matching information, we first randomly generate 3D landmarks, then we utilize the global observation function (15) to reproject 3D landmarks into the map keyframes and the current frame so that the 2D-2D matching features are obtained, after which we add Gaussian white noise to the 2D features. Finally, for each 3D landmark, we utilize the noisy 2D features and map keyframes’ perturbed poses to triangulate its 3D position (estimated 3D landmark).

IV-A2 Results and Analysis

In our simulation, there are four settings: the original Open-VINS[4]; our method (CS-MSCKF) with single matching frame (SM); our method (CS-MSCKF) with multiple matching frames (MM); our method that treats map poses and landmarks as constants (mapconst).

Fig. 4 demonstrates the trajectories derived from different situations. The blue one is the trajectory from the Open-VINS odometry. We could see that there is an apparent drift in the latter part of the trajectory, whereas our localization result (red one for multiple matching frames) fit nicely with the ground truth (black one). The green one is the result from the algorithm treating the map as perfect. The reason why the green one fit badly with the ground truth is that our map information is not absolutely accurate, and if we do not consider the uncertainty of the map, the estimator would highly rely on the inaccurate map information and therefore leading to a bad result.

Refer to caption
(a) trajectories on x-y plane
Refer to caption
(b) trajectories along z axis v.s. timestamps
Figure 4: The trajectory comparison of different methods

Table I lists the RMSE of the four settings. Noting that the initial pose of odometry is set as the initial pose of the ground truth, the trajectory of the odometry is already in the global coordinate system, i.e. the initial poses of the odometry and the ground truth are naturally aligned, so that the RMSEs between the estimated trajectories and the ground truth can be computed directly without the need of alignment.

TABLE I: The RMSE/m of different methods
Open-VINS [4] CS-MSCKF +SM CS-MSCKF +MM CS-MSCKF +mapconst
3.250 0.399 0.375 4.062

From Table I, we could find that CS-MSCKF+MM is better than CS-MSCKF+SM. It is in accordance with our intuition, as multiple frames could provide more information. Besides, as is shown in Table I, when the map is treated as perfect (CS-MSCKF+mapconst), the performance of the algorithm is very bad, which demonstrates the necessity of taking the map’s covariance into consideration.

To demonstrate the consistency of our proposed method, we draw the error of estimation with 3-σ\sigma bounds (c.f. Fig. 5). The figure is corresponding to the results of CS-MSCKF+MM ((a),(b)) and CS-MSCKF+mapconst ((c),(d)). From the figure, we could conclude that our proposed algorithm has good consistency while the algorithm that treats the map information as perfect has poor consistency.

Refer to caption
(a) IMU postion (𝐩IkL{}^{L}\mathbf{p}_{I_{k}}) error with 3-σ\sigma bounds towards CS-MSCKF+MM
Refer to caption
(b) Translation of Relative transformation (𝐱t\mathbf{x}_{t}) error with 3-σ\sigma bounds towards CS-MSCKF+MM
Refer to caption
(c) IMU postion (𝐩IkL{}^{L}\mathbf{p}_{I_{k}}) error with 3-σ\sigma bounds towards CS-MSCKF+mapconst
Refer to caption
(d) Translation of Relative transformation (𝐱t\mathbf{x}_{t}) error with 3-σ\sigma towards CS-MSCKF+mapconst
Figure 5: The error and 3-σ\sigma bounds towards CS-MSCKF+MM/mapconst

IV-B Experiments on Real World Data Sets

For real world data sets, the matching procedure is conducted as follows: We first utilize R2D2[21] to extract new features on the current query frame and match with features in map keyframes. When there are enough matches, 3D-2D pairs (3D from map landmarks and 2D from current frame features) are fed into the robust pose solver in [22], after which the accurate and robust 3D-2D pairs are obtained.

IV-B1 EuRoC

In this part, the effectiveness of our proposed algorithm is validated on a popular data set EuRoC [25]. To be specific, we use the sequences of machine hall (MH) to do experiments. The sequence MH01 is used to build the map, and the other sequences of MH (MH02-MH05) are used to do localization. The map contains keyframes and 3D landmarks, where 3D landmarks are triangulated and refined across multiple adjacent map images and the keyframes’ poses are given by the ground truth (we assume the uncertainties of the ground truth are 1cm1cm and 11^{\circ}. We make the comparison between the two settings (CS-MSCKF + SM/MM) of our proposed methods with the benchmark from Open-VINS[4] and VINS-Fusion[5, 6, 17, 18].

Open-VINS is a pure odometry, and the comparison with it could demonstrate that global localization could constrain the odometry’s drift error effectively. For VINS-Fusion, its global localization mode is used. To be specific, we firstly use its SLAM mode to generate a pose-graph (map information) based on MH01, and we modified the poses of the pose-graph node as the ground truth to make the comparison fair. As the localization needs to be performed in real-time, i.e. the estimated pose in time step kk has to be computed in time step kk, not the result of loop closure or global optimization thereafter, we record the global localization results from the combination of the odometry 𝐓kL{}^{L}\mathbf{T}_{k} (without loop closure or global optimization) and the estimated 𝐓LG{}^{G}\mathbf{T}_{L}. The comparison of experimental results is plotted by box-plot in Fig. 6, where the RMSE of Open-VINS is computed based on the alignment of the initial pose and the rest are computed without alignments. In Fig. 6, values along YY-axis are projected into log space, and their specific values are also given. For the localization methods, their processing times are also given in Fig. 7. All the results are the average of three runs.

Refer to caption
Figure 6: The RMSE/m of different methods on EuRoC[25]
Refer to caption
Figure 7: Time consumption (ms) of different methods on EuRoC[25]

From Fig. 6 and Fig. 7, we could find that our proposed algorithms perform better than the VINS-Fusion in terms of both accuracy and efficiency. This is because our framework take the uncertainty of the map information into consideration. Besides, the effective and robust matching algorithm provides us better matching information.

IV-B2 Kaist

To show the advantage of our proposed re-linearization mechanism, we do experiments on Kaist data set [26]. This data set was recorded in urban environments where the majority of the scene can be taken up by other moving vehicles, so this is a challenging data set. In the data set, there are two sequences Urban38 and Urban39 with high similarity. We employ Urban38 to build the map and localization algorithms are tested on Urban39 (the length of the trajectory is 10.67km). The covariance of each map keyframe’s pose is given as 0.1𝐈6×60.1\mathbf{I}_{6\times 6} and the re-linearization threshold is set as 20 pixels. The comparison of different algorithms are given in Table II, and the trajectories derived from CS-MSCKF+SM and CS-MSCKF+SM+R are plotted in Fig. 8.

TABLE II: The RMSE/m of different methods on Kaist[26]
Open- VINS [4] VINS- Fusion [5, 6] [17, 18] CS- MSCKF +SM CS- MSCKF +SM +R CS- MSCKF +MM CS- MSCKF +MM +R
34.303 - 25.541 6.927 6.726 5.812
Refer to caption
Figure 8: The trajectories derived from CS-MSCKF+SM and CS-MSCKF+SM+R.Top: trajectories on the xyx-y plane; Bottom: trajectories along the zz axis v.s. time stamp; The details of the green dotted circle area is also given; The red and blue markers (stars and diamonds) represent the time when two adjacent matches are occurred (the matches at the red markers occur before the matches at the blue markers).

In this challenging scene, the VINS-Fusion failed to perform global localization due to the significant drifts of the odometry. From the results, we could find that for the case of single matching frame, there is a tremendous promotion in accuracy when we employ the re-linearization mechanism. Fig. 8 could illustrate this phenomenon to some extent.

In Fig. 8, the pair of stars represents the two contiguous matches (for example, the ithi^{th} match in red and the i+1th{i+1}^{th} match in blue), and so does the pair of diamonds. The motion direction of the car from the red mark to the blue mark is shown by black arrows in the picture. We could find that the distance between the two stars or the two diamonds is far, which means the accuracy of the localization highly relies on the performance of the odometry. Unfortunately, the odometry inevitably introduces drifts and the estimated values of the state variables are not accurate enough. These two sources of error would make the first-order Taylor approximation of the observation function not good, as analyzed in Sec. III-B. Therefore, we introduce a re-linearization mechanism to make the first-order Taylor approximation more accurate. The effectiveness of this mechanism is shown in Fig. 8. We could find that for the CS-MSCKF+SM, the trajectory around the blue star (c.f. the area of the green dotted circle) hardly changes when there is a match from the map, i.e. since the approximated observation function is not accurate, the single matching frame can hardly correct the trajectory. On the contrary, the trajectory from CS-MSCKF+SM+R shows an obvious sign of being corrected by the single matching frame information. Besides, as shown in the bottom part of Fig. 8, the estimated zz value has been corrected distinctly with the help of re-linearization mechanism while the estimated value from CS-MSCKF+SM changes little (refer to the area indicated by the red and the blue dotted arrow).

From Table II, algorithms with multiple matching frames also have good performance, which is according with our intuition. This could be explained by more map information provides more constrains to the current frame so that the large drift could be corrected and bounded.

IV-B3 YQ

In this part we conduct some experiments on our own data set (YQ). This data set contains four sequences, YQ1-YQ4, where YQ1-YQ3 were recorded on three separate days with different weather in summer and YQ4 was collected in winter after snowing (c.f. Fig. 3). For this data set, we use YQ1 to build the map, and use YQ2-YQ4 to test the performance of the algorithms. The covariance of each map keyframe’s pose is given as 0.1𝐈6×60.1\mathbf{I}_{6\times 6} and the re-linearization threshold is set as 20 pixels. The comparison of different algorithms is given in Table III. VINS-Fusion is failed to perform continuous and consistent global localization because its odometry has large drift and its re-localization mechanism is triggered multiple times. So, the experimental results of VINS-Fusion are not listed.

TABLE III: The RMSE/m of different methods on YQ
Sequence YQ2 (1.299km) YQ3 (1.282km) YQ4 (0.933km)
Open-VINS [4] 17.066 26.233 16.613
CS-MSCKF+SM 8.489 9.151 5.007
CS-MSCKF+SM+R 3.653 3.371 4.914
CS-MSCKF+MM 4.865 4.779 5.252
CS-MSCKF+MM+R 2.869 2.306 4.591

From the results, we could find that for the case of single matching frame, introducing re-linearization mechanism could significantly improve the positioning accuracy. Combining multiple frames matching information and re-linearization mechanism, we could get the best result.

V conclusion

In this paper, we propose a consistent filter-based global localization framework, where a keyframe-based map is used to reduce storage. Schmidt-EKF is employed to handle the problem of too much computation in traditional EKF when the dimension of the state vector is too high. In this way, the computation increases linearly with the number of map keyframes. Moreover, we introduce a re-linearization mechanism to improve the accuracy of the first-order approximation of the observation function, and therefore improve the algorithm performance, especially in large scenes. From extensive experiments, we validate the effectiveness of our proposed framework.

References

  • [1] R. Mur-Artal and J. D. Tardós, ”ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras,” in IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255-1262, Oct. 2017, doi: 10.1109/TRO.2017.2705103.
  • [2] J. Engel, V. Koltun and D. Cremers, ”Direct Sparse Odometry,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611-625, 1 March 2018, doi: 10.1109/TPAMI.2017.2658577.
  • [3] A. I. Mourikis and S. I. Roumeliotis, ”A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation,” Proceedings 2007 IEEE International Conference on Robotics and Automation, Roma, 2007, pp. 3565-3572, doi: 10.1109/ROBOT.2007.364024.
  • [4] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang and G. Huang, ”OpenVINS: A Research Platform for Visual-Inertial Estimation,” 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 2020, pp. 4666-4672, doi: 10.1109/ICRA40945.2020.9196524.
  • [5] T. Qin, P. Li and S. Shen, ”VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator,” in IEEE Transactions on Robotics, vol. 34, no. 4, pp. 1004-1020, Aug. 2018, doi: 10.1109/TRO.2018.2853729.
  • [6] T. Qin, S. Cao, J. Pan, and S. Shen, “A general optimization-based framework for global pose estimation with multiple sensors,” arXiv e-prints, 2019, retrieved on March 1st, 2020. [Online]. Available: https://arxiv.org/abs/1901.03642v1
  • [7] R. Mascaro, L. Teixeira, T. Hinzmann, R. Siegwart and M. Chli, ”GOMSF: Graph-Optimization Based Multi-Sensor Fusion for robust UAV Pose estimation,” 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, QLD, 2018, pp. 1421-1428, doi: 10.1109/ICRA.2018.8460193.
  • [8] W. Lee, K. Eckenhoff, P. Geneva and G. Huang, ”Intermittent GPS-aided VIO: Online Initialization and Calibration,” 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 2020, pp. 5724-5731, doi: 10.1109/ICRA40945.2020.9197029.
  • [9] S. Lynen, T. Sattler, ,M. Bosse, J. Hesch, and R. Siegwart, ”Get Out of My Lab: Large-scale, Real-Time Visual-Inertial Localization,” In Proc. of Robotics: Science and Systems Conference, Rome, Italy, July 13–17 2015.
  • [10] R. C. DuToit, J. A. Hesch, E. D. Nerurkar and S. I. Roumeliotis, ”Consistent map-based 3D localization on mobile devices,” 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 6253-6260, doi: 10.1109/ICRA.2017.7989741.
  • [11] T. Schneider et al., ”Maplab: An Open Framework for Research in Visual-Inertial Mapping and Localization,” in IEEE Robotics and Automation Letters, vol. 3, no. 3, pp. 1418-1425, July 2018, doi: 10.1109/LRA.2018.2800113.
  • [12] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel and J. D. Tardós, ”ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial and Multi-Map SLAM,” arXiv e-prints, 2020, retrieved on July 23rd 2020. [Online]. Available: https://arxiv.org/abs/2007.11898v1
  • [13] P. Geneva, K. Eckenhoff and G. Huang, ”A Linear-Complexity EKF for Visual-Inertial Navigation with Loop Closures,” 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 2019, pp. 3535-3541, doi: 10.1109/ICRA.2019.8793836.
  • [14] M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart,”Iterated extended kalman filter based visual-inertial odometry using direct photometric feedback,” The International Journal of Robotics Research. 2017, 36(10): 1053-1072.
  • [15] M. Bloesch, S. Omari, M. Hutter and R. Siegwart, ”Robust visual inertial odometry using a direct EKF-based approach,” 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, 2015, pp. 298-304, doi: 10.1109/IROS.2015.7353389.
  • [16] G. Cioffi, and D. Scaramuzza, ”Tightly-coupled Fusion of Global Positional Measurements in Optimization-based Visual-Inertial Odometry,” arXiv e-prints, 2020, retrieved on March 9th 2020. [Online]. Available: https://arxiv.org/abs/2003.04159
  • [17] T. Qin, S. Cao, J. Pan, and S. Shen, “A general optimization-based framework for global pose estimation with multiple sensors,” arXiv e-prints, 2019, retrieved on January 11st, 2019. [Online]. Available: https://arxiv.org/abs/1901.03638
  • [18] T. Qin and S. Shen, ”Online Temporal Calibration for Monocular Visual-Inertial Systems,” 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, 2018, pp. 3662-3669, doi: 10.1109/IROS.2018.8593603.
  • [19] E. Rosten and T. Drummond,”Machine learning for high-speed corner detection,” European Conference on Computer Vision (ECCV),Springer, Berlin, Heidelberg, 2006, pp. 430-443.
  • [20] M. Calonder, V. Lepetit, C. Strecha and P. Fua, ”Brief: Binary robust independent elementary features,” European Conference on Computer Vision (ECCV), Springer, Berlin, Heidelberg, 2010, pp. 778-792.
  • [21] J. Revaud et al.”R2d2: repeatable and reliable detector and descriptor,” arXiv e-prints, 2019, retrieved on June 17th, 2019. [Online]. Available: https://arxiv.org/abs/1906.06195
  • [22] Y. Jiao et al., ”2-Entity RANSAC for robust visual localization in changing environment,” 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 2019, pp. 2478-2485, doi: 10.1109/IROS40897.2019.8967671.
  • [23] V. Lepetit, F. Moreno-Noguer, and P. Fua, ”Epnp: an accurate o(n) solution to the pnp problem”. International Journal of Computer Vision, 81(2), 155-166, 2009.
  • [24] N. Trawny, and S. I. Roumeliotis., ”Indirect Kalman filter for 3D attitude estimation.” University of Minnesota, Dept. of Comp. Sci. & Eng., Tech. Rep 2 (2005): 2005.
  • [25] B. Siciliano et al., ”EuRoC - The Challenge Initiative for European Robotics,” ISR/Robotik 2014; 41st International Symposium on Robotics, Munich, Germany, 2014, pp. 1-7.
  • [26] J. Jeong ,Y. Cho, Y.S. Shin, et al. ”Complex urban dataset with multi-level sensors from highly diverse urban environments”, The International Journal of Robotics Research, 2019, 38(6):027836491984399.