This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\UseRawInputEncoding

Stereo Plane SLAM Based on Intersecting Lines

Xiaoyu Zhang, Wei Wang, Xianyu Qi and Ziwei Liao This work was supported by the National Key Research and Development Program of China under grant number 2020YFB1313600.The authors are with the Robotics Institute, Beihang University, Beijing, China. (Email: zhang_xy, wangweilab, qixianyu, liaoziwei @buaa.edu.cn)
Abstract

Plane features can be used to reduce drift errors in SLAM systems, especially in indoor environments. It is easy and efficient to extract planes from a dense point cloud, which is commonly generated from a RGB-D camera or a 3D lidar. But when using a stereo camera, it is hard to compute dense point clouds accurately or efficiently. In this paper, we propose a novel method to compute plane parameters using intersecting lines, which are extracted from stereo images. Plane features are commonly extracted from the surface of man-made objects or structures, which have regular shapes and straight edge lines. In three dimensions, two intersecting lines determine a unique plane. Therefore, we extract line segments from both left and right images of a stereo camera. By stereo matching, we compute lines’ endpoints and direction vectors, and then a plane from two intersecting lines is calculated. We discard inaccurate plane features in the frame tracking. Adding such plane features in the stereo SLAM system reduces drift errors and refines the performance. Finally, we build a global map consisting of both points and planes, which can reflect real scene structures. We test our proposed system on public datasets and demonstrate its accurate estimation results, compared with state-of-the-art SLAM systems. To benefit the research of plane-based SLAM, we release our codes at https://github.com/fishmarch/Stereo-Plane-SLAM.

I INTRODUCTION

Simultaneous localization and mapping (SLAM) is a fundamental problem for various applications, including robots, driverless cars, and augmented reality (AR). Thanks to the increasingly powerful capability of graphics processing, cameras have been widely used and visual SLAM has developed rapidly in the past decade [1].

For visual SLAM, point is the most commonly used feature to track camera poses and build environmental models. Some existing point-based SLAM systems [2, 3] have achieved accurate and robust estimation results. Except for points, lines [4] and planes [5] are also explored to improve SLAM performance in recent years. It is proven that lines and planes are helpful to build more robust and accurate SLAM systems, especially in indoor environments. Line features are commonly extracted using the Line Segment Detector [6] from images. Line segments can be matched by LBD descriptor [7] in frame tracking, but they suffer from the problem of occluding and endpoint variance. Besides, the parameterization of line segments [8] is complicated.

Refer to caption
Figure 1: The plane features computed from intersecting lines. The planes are drawn in different colors. The black points are point features. The line segments are drawn in red for illustration, but they are not included in our SLAM system.

Compared with line features, planes are commonly more robust features in SLAM because of their simple and accurate data association. Planes can even be matched in frames of a large distance, which helps reduce drift errors. Besides, planes are helpful to reflect real scene structures. Plane features are commonly extracted from a dense point cloud [9], generated by a RGB-D camera or a 3D lidar. In stereo images, however, it is not easy to get a dense point cloud accurately or efficiently. The depth of a point can be estimated by matching corresponding pixels in stereo images. But matching all pixels becomes a tough task. Some traditional stereo matching methods [10] utilize low-level features of image patches to match pixels. They run fast but suffer from low quality. Recently, stereo matching algorithms based on deep learning have achieved remarkable performance [11]. But these methods are slow and need high-cost GPU.

This paper proposes a novel method to compute plane features from stereo images. Plane features are commonly extracted from the surface of man-made objects and structures. These planes commonly have regular shapes and straight edge lines. In three dimensions, two intersecting lines determine a unique plane. Therefore, it is reasonable and possible to compute plane features from lines. 3D lines can be computed from stereo images by stereo matching [4]. An example of computing plane features is shown in Fig. 1, the original scene is also shown at the top-left corner. We also provide a video at https://youtu.be/3VWF-JJU9T8.

In summary, our contributions are as follows:

  • A novel method to compute plane features from stereo images based on intersecting lines.

  • A stereo SLAM system using extracted points and computed planes.

  • Evaluated on public datasets, our system gets accurate estimation results and achieves state-of-the-art performance.

In the following, we first introduce the related work in Sec. II, then explain the method to compute plane features in Sec. III, followed by the introduction of the whole system in Sec. IV. In the end, we show our experimental results in Sec. V.

II RELATED WORK

SLAM is well studied and different methods have been proposed in recent years. Many SLAM systems are commonly based on point features and build a global map consisting of 3D points. ORB-SLAM [2] tracks ORB features and uses re-projection error to estimate camera poses. In contrast, direct methods [12] use photometric error to track camera poses. Point features can be used to build a sparse map [13], semi-dense map [14] and even dense map [15]. In these methods, it is easy to extract and compute point features, and they work well in most scenes.

In recent years, researchers also utilize line and plane features to refine the performance of SLAM systems, especially for indoor environments. In a SLAM system, line and plane features work like a filter to reduce the measurement errors, and perform well even in some low-texture scenes. Pumarola et al. [16] extract both point and line features from monocular images and improve SLAM performance for low-texture scenes. Zhang et al. [17] propose a 3D line-based SLAM system using a stereo camera and exhibit its better reconstruction performance. PL-SLAM [4] is built based on ORB-SLAM, and also leverages both points and line segments. Qian et al. [18] extend the work of PL-SLAM using Bags of Point and Line Word and release their codes.

Compared with line features, planes are more accurate and robust landmarks. Taguchi et al. [19] present a framework for registration combining points and planes. CPA-SLAM [20] proposes a novel formulation to track camera poses using global planes in the expectation-maximization (EM) framework. Kaess et al. [5] introduce a minimal representation for infinite planes which is suitable for the least-squares estimation without encountering singularities. [21] explores both plane and plane edges to increase plane observations and constraints. These plane-based SLAM systems also achieve more robust and accurate estimation results compared with point-based methods, especially in low-texture scenes. All of these systems are based on RGB-D cameras, from which planes are easily extracted. Our work utilizes some ideas from line-based SLAM, and extends plane-based SLAM to using stereo cameras.

Refer to caption
(a) Matching of line segments
Refer to caption
(b) Matching of endpoints
Figure 2: Matching of line segments and endpoints.

III PLANE FEATURES FROM INTERSECTING LINES

This section is to introduce the method of computing plane features. We first extract line segments from both left and right images of a stereo camera. By matching line segments and their endpoints, we compute 3D positions of endpoints and lines’ direction vectors. Then we check their position to find intersecting lines. Finally, we compute plane parameters.

III-A Notations

We denote a plane feature as π=(𝒏,d){\pi}=\left(\bm{n}^{\top},d\right)^{\top}, where 𝒏=(nx,ny,nz)\bm{n}=\left(n_{x},n_{y},n_{z}\right)^{\top} is the unit normal vector representing the plane’s orientation and dd is the distance of the plane from the origin. We use the commonly used form 𝑻cwSE(3)\bm{T}_{cw}\in{SE}(3) to represent a camera pose and 𝒑=(x,y,z,1)\bm{p}=\left(x,y,z,1\right)^{\top} to represent a 3D point. Therefore, 𝑻cw𝒑w\bm{T}_{cw}\bm{p}_{w} transforms a point from the world to the camera coordinate system and 𝑻cwπw\bm{T}_{cw}^{-\top}{\pi}_{w} transforms a plane from the world to the camera coordinate system.

For lines, we only record their endpoints (𝒑s,𝒑e)(\bm{p}_{s},\bm{p}_{e}) and unit direction vectors 𝒏l\bm{n}_{l}, which are enough to compute plane features.

III-B Line Detection and Computation

A frame from a stereo camera consists of a left image IlI_{l} and a right image IrI_{r} . We use the Line Segment Detector [6] to extract line segments from both IlI_{l} and IrI_{r}, and match them by the LBD descriptor [7]. The line matching is accurate and robust enough in one stereo frame. As shown in Fig. 2(a), the line segments are drawn in different colors, and the matched line segments are the same color in IlI_{l} and IrI_{r}.

For every matched line segment in the left image IlI_{l}, we find corresponding points of its endpoints in the right image IrI_{r} based on the fact that their row positions remain the same in IlI_{l} and IrI_{r} when using a parallel stereo camera. As shown in Fig. 2(b), matched endpoints are connected by transverse lines.

From the stereo matching of endpoints, we compute their 3D positions 𝒑\bm{p} based on disparities Δu\Delta u. A line’s direction vector 𝒏l\bm{n}_{l} is also defined by its two endpoints.

III-C Plane Computation

Before computing plane features, we need to check the relationship of lines. In three dimensions, two intersecting or parallel lines are on the same plane. For parallel lines, however, it is difficult to judge whether they are extracted from a real plane, and the planes computed from them are prone to bring large errors. Therefore, we only compute planes from intersecting lines.

To find intersecting lines quickly, we find the lines meeting the following conditions:

  • The angle between two lines is larger than the threshold (1010^{\circ} in our experiments)

  • The distance between their central points is smaller than the length of the line.

  • The four endpoints of these two lines lie on the same plane.

The central points 𝒑c\bm{p}_{c} in the second condition are computed from lines’ endpoints 𝒑s\bm{p}_{s} and 𝒑e\bm{p}_{e}. From the first two conditions, we actually find unparallel lines that are close with each other. We compute the plane’s normal vector by the cross product of the lines’ direction vectors,

𝒏π=𝒏li×𝒏lj\bm{n}_{\pi}=\bm{n}_{li}\times\bm{n}_{lj} (1)

Using the plane’s normal 𝒏π\bm{n}_{\pi} and four endpoints 𝒑k\bm{p}_{k} (k=1,2,3,4)(k=1,2,3,4), we then compute four different plane coefficients dkd_{k},

dk=𝒏π(pkx,pky,pkz)d_{k}=-\bm{n}_{\pi}\cdot\left(p_{kx},p_{ky},p_{kz}\right)^{\top} (2)

The distance among them is:

D=Max(dk)Min(dk)D=Max\left(d_{k}\right)-Min\left(d_{k}\right) (3)

If DD is smaller than the threshold (5 cm in our experiments), these two lines meet the third condition and the plane coefficients π=(𝒏π,dk¯){\pi}=\left(\bm{n}_{\pi}^{\top},\bar{d_{k}}\right)^{\top} are also computed, dk¯\bar{d_{k}} here is the arithmetic average of dkd_{k}. Sometimes the computed planes may not be real planes, such as the plane from the lines of a doorframe. But such planes are also stable enough and provide accurate constraints, and therefore we treat them as real planes.

Under these conditions, we compute as many planes as possible at first. We will validate the computed planes in the frame tracking and label inaccurate planes as invalid, which is described in Sec. IV-E.

Refer to caption
Figure 3: The pipeline of our proposed SLAM system.

The features extracted and computed from the images of Fig. 2 is shown in Fig. 1. The black points are feature points extracted from the images. The red lines are the line segments extracted from the images, whose 3D positions are computed from the matched endpoints. Notice that we do not use these line segments in our SLAM system, and we draw them in the figure just to show the process of computing plane features. We draw the plane features by expanding corresponding intersecting lines and they are drawn in different colors. The red plane in the middle of the figure seems not correct because of errors from line segments, and we label this plane as invalid in the validation. But other planes are correct and are used as valid landmarks in our SLAM system.

IV SLAM SYSTEM BASED ON COMPUTED PLANES

Points and planes are both used as landmarks and optimized in our SLAM system. We build our system based on the publicly available ORB-SLAM stereo version [2], which includes feature tracking and bundle adjustment optimization [22].

IV-A System Overview

The pipeline of our proposed SLAM system is illustrated in Fig. 3. It can be divided to three parts, frame processing, pose tracking and mapping. We do not add loop closure now, which will be a part of our future work.

In stereo frame processing, we extract feature points and line segments from both left and right images, and match these features based on descriptors. Then we can compute plane features using the method described in Sec. III.

In pose tracking, every camera pose is estimated based on matched valid features. The camera pose is firstly estimated from the last keyframe and then optimized in the local map.

In mapping, points and planes are constructed and saved in the map. To achieve more accurate estimation, a local map optimization is performed.

IV-B Optimization Formulation

SLAM is commonly formulated as a nonlinear least squares optimization problem [1], and the bundle adjustment (BA) is commonly used for point features [2]. Like points, we also design the optimization formulation for plane features. In our SLAM system, we denote a set of camera poses, point features and plane features as C={ci}C=\{{c}_{i}\}, P={pj}P=\{{p}_{j}\}, Π={πk}\Pi=\{{\pi}_{k}\} respectively, then the optimization problem can be formulated as:

C,P,Π=argmin{C,P,Π}\displaystyle C^{*},P^{*},\Pi^{*}=\underset{\{C,P,\Pi\}}{\arg\min} ci,pje(ci,pj)Σij2+\displaystyle\sum_{{c}_{i},{p}_{j}}\left\|{e}\left({c}_{i},{p}_{j}\right)\right\|_{\Sigma_{ij}}^{2}+ (4)
ci,πke(ci,πk)Σik2\displaystyle\sum_{{c}_{i},{\pi}_{k}}\left\|{e}\left({c}_{i},{\pi}_{k}\right)\right\|_{\Sigma_{ik}}^{2}

e(c,p){e}\left({c},{p}\right), e(c,π){e}\left({c},{\pi}\right) represent the camera-point and camera-plane measurement errors respectively. 𝒙Σ2\|\bm{x}\|_{{\Sigma}}^{2} is the Mahalanobis distance, which equals 𝒙Σ1𝒙\bm{x}^{\top}{\Sigma}^{-1}\bm{x}. Σ{\Sigma} is the corresponding covariance matrix, it is set based on the measurement uncertainty (such as 22^{\circ} for plane angles in our experiments).

The optimization problem can be solved using Levenberg-Marquardt or Gauss-Newton implemented in g2o [23].

IV-C Measurement Error

IV-C1 Camera-Point Error

We use the standard re-projection error for the camera-point measurement in our system:

𝒆cp(𝑻cw,𝒑w)=𝒖cρ(𝑻cw𝒑w)\bm{e}_{cp}\left(\bm{T}_{cw},\bm{p}_{w}\right)=\bm{u}_{c}-\rho\left(\bm{T}_{cw}\bm{p}_{w}\right) (5)

Here 𝑻cw\bm{T}_{cw} is the camera pose, 𝒑w\bm{p}_{w} is the point in the world coordinate system, 𝒖c\bm{u}_{c} is the corresponding image pixel, and ρ\rho is the camera model to project 3D points onto the images. In optimization, the camera pose 𝑻cw\bm{T}_{cw} is mapped to Lie algebra 𝝃𝔰𝔢(3)\bm{\xi}\in\mathfrak{se}(3) to avoid extra constraints [24]. The computation of the corresponding Jacobian matrix can also be found in [24].

IV-C2 Camera-Plane Error

As a 3D plane has only three degrees of freedom, π=(𝒏,d)\pi=\left(\bm{n}^{\top},d\right)^{\top} is over-parameterization. Therefore, it requires extra constraints to ensure the unit length of planes’ normal vectors, adding additional computation in optimization. To overcome this problem, we follow the work in [20] using the minimal parameterization of planes τ=(ϕ,ψ,d){\tau}=(\phi,\psi,d)^{\top} in optimization, where ϕ\phi and ψ\psi are the azimuth and elevation angles of the normal vectors respectively:

τ=q(π)=(ϕ=arctannynx,ψ=arcsinnz,d)\tau=q(\pi)=\left(\phi=\arctan\frac{n_{y}}{n_{x}},\psi=\arcsin n_{z},d\right)^{\top} (6)

Then we define the measurement error using the minimal parameterization:

𝒆cl(𝑻cw,πw)=q(πc)q(𝑻cwπw)\bm{e}_{cl}\left(\bm{T}_{cw},\pi_{w}\right)=q\left({\pi}_{c}\right)-q\left(\bm{T}_{cw}^{-\top}{\pi}_{w}\right) (7)

Here πw\pi_{w} is the plane parameter in the world coordinate system and πc\pi_{c} is the observation of planes in the camera coordinate system. The camera-plane error defines the differences between the plane landmark and its corresponding observation in the camera coordinate system. The computation of the Jacobian matrix can be found in our previous work [21].

IV-D Data Association

We try to match every observation of points and planes with the landmarks in the map. This process is defined as data association. Robust data association is also necessary for accurate estimation results. Point features are commonly matched by their descriptors.

To match plane features, previous works [5, 20] depend on the plane parameters 𝒏\bm{n}^{\top} and dπd_{\pi} directly. This method is simple and works well for small scenes. But it needs accurate estimation of camera poses, and dπd_{\pi} may fluctuate largely because of errors. To refine the data association of planes, we use the distances of endpoints instead.

After computing planes from intersecting lines, we save the endpoints of the lines. To match plane features, we compute the average distance d¯ep\bar{d}_{ep} from the endpoints to plane landmarks. Unlike dπd_{\pi}, the fluctuation of d¯ep\bar{d}_{ep} is relatively small. If d¯ep\bar{d}_{ep} is smaller than the threshold (6 cm in our experiments) and the angle between plane normal vectors is also smaller than the threshold (1212^{\circ} in our experiments), the plane landmark is matched with the corresponding plane observation.

IV-E Plane Validation

We use the same method to select keyframes as in ORB-SLAM [2]. From the keyframe, every unmatched plane feature is used to create a new plane landmark. But these plane landmarks may be inaccurate, and therefore they are labelled invalid. The planes are changed to be valid only if they are observed in sufficient frames (3 keyframes in our experiments). Only valid plane features are used to estimate camera poses and added into the optimization framework (Eq. (4)).

V EVALUATION

In this section, we evaluate our proposed SLAM system in two popular public datasets: the EuRoC dataset [25] and the KITTI vision benchmark [26]. The two datasets both provide stereo images. All experiments run on a laptop computer with i7-7700HQ 2.80 GHz CPU, 16GB RAM, without GPU.

We compare our proposed system with other state-of-the-art stereo SLAM systems. ORB-SLAM2 [2] is a popular point-based visual SLAM system and it has a stereo camera implementation. We also compare our system with a line-based SLAM system, which utilizes the line segments directly. [18] provides a SLAM system based on point and line features, and it is also built on ORB-SLAM2 and achieves state-of-the-art performance. The two SLAM systems both provide open-source codes on the Internet. Our work mainly focuses on the front-end, thus we switch off loop closure to compare drifts in our experiments.

For implementation, our system augments the stereo variant of ORB-SLAM2. We rely on the underlying ORB-SLAM2 for the point extraction and matching, and the methods to maintain a local map. The focus of our implementation is on the line segment extraction, plane computation, plane matching, and camera pose estimation using both point and plane features. In the end, we construct a global map consisting of both points and planes. To help understand the details of our system, we release our codes at https://github.com/fishmarch/Stereo-Plane-SLAM.

V-A EuRoC Dataset

The EuRoC dataset contains stereo sequences recorded from a micro aerial vehicle flying around in three different indoor environments, including an industrial machine hall and two Vicon rooms. These scenes contain man-made objects and structures, which are easy to extract line segments and compute plane features. The sequences are classified as easy, medium, and difficult according to the speed, illumination, and scene texture. The dataset also provides ground truth from the laser tracking system and the motion capture system.

TABLE I: Comparison of translation RMSE (mm) in EuRoC dataset
Sequence ORB-SLAM2 Line-SLAM Our System
MH_01_easy 0.0387850.038785 0.0383700.038370 0.034193\bf{0.034193}
MH_02_easy 0.0520460.052046 0.0498970.049897 0.048448\bf{0.048448}
MH_03_medium 0.0379840.037984 0.0481720.048172 0.034936\bf{0.034936}
MH_04_difficult 0.1054410.105441 0.049670\bf{0.049670} 0.0678240.067824
MH_05_difficult 0.0924290.092429 0.0557670.055767 0.043746\bf{0.043746}
V1_01_easy 0.0874960.087496 0.0879200.087920 0.086734\bf{0.086734}
V1_02_medium 0.0911920.091192 0.064006\bf{0.064006} 0.0675650.067565
V1_03_difficult 0.1736600.173660 0.135513\bf{0.135513} 0.1651730.165173
V2_01_easy 0.0704560.070456 0.061183\bf{0.061183} 0.0630010.063001
V2_02_medium 0.0998850.099885 0.060296\bf{0.060296} 0.0812150.081215
V2_03_difficult lost lost lost
Refer to caption
Figure 4: Built map of line-based SLAM in Sequence V1_01_easy.

Table I shows the comparison of estimation results of different SLAM systems. Here we use the absolute translation root mean square error (RMSE) to evaluate the estimation results. The smallest error for each sequence is labelled as a bold number. It is clear that our system outperforms the stereo ORB-SLAM2 in these sequences. The computed plane features bring more constraints to estimate camera poses, and these constrains are reasonable in such indoor environments.

The line-based SLAM system also performs better than ORB-SLAM2 in these sequences, and gets comparable results with ours. Although using line segments directly adds more information to the SLAM system, some line segments are not accurate and the data association is difficult when the camera view changes, as shown in Fig. 4. A single line in the scene may be constructed as several segments because of the failure of data association. By computing planes from line segments, our system may lose some information but filters out inaccurate line segments, and the data association of planes is easier between frames. The estimation results of our system are better in industrial machine hall sequences(MH*). We find the objects and structure in the industrial machine hall have clear and sharp edges, which benefits the line extraction and plane calculation. But in the Vicon room (V1* and V2*), there are more errors of the lines extracted from the soft cushions and curtains.

Refer to caption
Figure 5: Built map of our system in Sequence V1_01_easy.

An example of the built map from our system is shown in Fig. 5. The map consists of points, planes and camera poses. The plane features enrich the map and help to reflect the real scene structures clearly. It demonstrates the plane features are computed from the main objects and structure, such as walls, cushions and etc. The planes from the wall and the ground are easily matched even in frames of a large distance. Although we try to compute accurate planes and filter out those with large errors, some inaccurate planes still exist in the map.

TABLE II: Comparison of translation RMSE (mm) on KITTI dataset
Sequence ORB-SLAM2 Line-SLAM Our Method
00 9.2431999.243199 8.3417438.341743 7.929075\bf{7.929075}
01 23.31398523.313985 66.52901166.529011 18.228348\bf{18.228348}
02 18.58118418.581184 20.79109920.791099 18.036698\bf{18.036698}
03 9.1914489.191448 9.1483139.148313 9.126736\bf{9.126736}
04 2.5428022.542802 2.4013052.401305 2.207116\bf{2.207116}
05 4.7560734.756073 4.403798\bf{4.403798} 4.531347{4.531347}
06 4.9595064.959506 3.710958\bf{3.710958} 4.0140134.014013
07 2.0724652.072465 2.3360992.336099 1.914680\bf{1.914680}
08 15.78069215.780692 13.583784\bf{13.583784} 13.70572113.705721
09 7.5943867.594386 7.434085\bf{7.434085} 7.6423787.642378
10 6.4843886.484388 6.090371\bf{6.090371} 6.191486.19148

V-B KITTI Dataset

The KITTI dataset contains stereo sequences recorded from an autonomous driving platform driving in urban and highway environments. The ground truth is given by the GPS/IMU localization unit. Although our SLAM system is more suitable for indoor environments, the planes can also be computed from man-made structures (such as houses and roads) in these sequences.

Refer to caption
Figure 6: The comparison of trajectories in Sequence 00 and 01.

Table II shows the comparison of estimation results in the KITTI dataset. We again calculate the RMSE of these sequences and label the smallest errors as bold numbers. Our system also gets better estimation results in most of the sequences. The performance of ORB-SLAM using only point features is good enough, and our system only gets similar results in some sequences. This is because the camera moves much faster in these sequences, and thus planes are only matched in a few frames and fewer valid planes are created. Fig. 6 shows the examples of the estimated trajectories compared with the ground truth and our results are closer to the ground truth.

The line-based SLAM system also performs better in some sequences. But in other sequences, it gets even worse results, such as the Sequence 01, 02, and 07. This is because the extracted lines have large errors and are difficult to match when the camera moves quickly.

A part of the built map in the Sequence 00 is shown in Fig. 7. The plane features are computed from roads and walls. It is clear the drift error is small.

V-C Time Performance

TABLE III: Average processing time (msms) of the main parts.
EuRoC KITTI
Main Part 752×480752\times 480 1241×3761241\times 376
20 fps 10 fps
Feature Processing 40.3224 68.9912
Feature Tracking 12.5305 16.3811
Local Optimization 88.7514 110.826
Frame Tracking 54.3525 86.5212

We record the average processing time of the main parts of the proposed system, as shown in Table III. In the table, feature processing includes point extraction, line extraction, stereo matching, and plane computation; feature tracking includes landmark matching and camera pose estimation. Frame tracking is the whole process for tracking a new coming stereo frame. The most time-consuming part is the feature processing, specifically point and line extraction. Because the images from KITTI sequences are larger, feature processing needs more time.

We also record the average processing time of the line-based SLAM system [18]. The average time for its frame tracking are 81.8247ms81.8247ms in the EuRoC sequences and 105.347ms105.347ms in the KITTI sequences. It is clear that our system is faster than the SLAM system using line segments directly. In the line-based system, many line segments are built with large errors, and they are hard to be matched correctly, as shown in Fig. 4. But all of these features are added into the optimization function, and therefore it needs more time to estimate the optimal results.

Refer to caption
Figure 7: A part of built map in Sequence 00.

VI CONCLUSIONS

We have presented a novel method to compute plane features from stereo images for visual SLAM. Many previous works [5, 20, 21] have demonstrated the benefits of adding plane features in SLAM systems, but most of them are for RGB-D cameras. In this paper, we compute planes from stereo images instead, based on the truth that two intersecting lines determine a plane. After further validation, we add computed planes into our stereo SLAM system. We have presented our experimental results in two famous public datasets and demonstrated the accuracy of our system.

From the experimental results, it is clear that our system outperforms the state-of-the-art point-based SLAM system. Compared with the line-based SLAM system, our system also gets comparable results. The plane computing filters out inaccurate line segments and adds stable constraints to estimate camera poses. The constructed map consists of both points and planes, which reflect the real scene structures. The structured environments with man-made objects are ideal working scenes for our system.

From the constructed maps, we notice some inaccurate plane features still exist and they bring a big challenge to the data association. In the future, we would like to refine the plane computing and validation methods to get more accurate and robust plane features. In addition, we also need a more robust data association algorithm, removing the influence of estimation errors. We will also add loop closure into our system.

References

  • [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J.J. Leonard, Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust- Perception Age, IEEE Trans. Robot., vol. 32, no. 6, pp. 1309-1332, 2016.
  • [2] R. Mur-Artal and J. D. Tardos, ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras, IEEE Trans. Robot., vol. 33, no. 5, pp. 1255-1262, 2017.
  • [3] J. Engel, V. Koltun, and D. Cremers, Direct Sparse Odometry, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611-625, 2018.
  • [4] R. Gomez-Ojeda, F. A. Moreno, D. Zu iga-No l, D. Scaramuzza, and J. Gonzalez-Jimenez, PL-SLAM: A Stereo SLAM System Through the Combination of Points and Line Segments, IEEE Trans. Robot., vol. 35, no. 3, pp. 734-746, 2019.
  • [5] M. Kaess, Simultaneous localization and mapping with infinite planes, in IEEE International Conference on Robotics and Automation, 2015, pp. 4605-4611.
  • [6] R. G. Von Gioi, J. Jakubowicz, J. M. Morel, and G. Randall, LSD: A Fast Line Segment Detector with a False Detection Control, IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 4, pp. 722-732, 2010.
  • [7] L. Zhang and R. Koch, An efficient and robust line segment matching approach based on LBD descriptor and pairwise geometric consistency, J. Vis. Commun. Image Represent., vol. 24, no. 7, pp. 794-805, 2013.
  • [8] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision. Cambridge University Press, 2003.
  • [9] A. J. B. Trevor, S. Gedikli, R. B. Rusu, and H. I. Christensen, Efficient Organized Point Cloud Segmentation with Connected Components, Semant. Percept. Mapp. Explor., 2013.
  • [10] H. Hirschmuller, Stereo Processing by Semiglobal Matching and Mutual Information, IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 2, pp. 328-341, 2008.
  • [11] K. Zhou, X. Meng, and B. Cheng, Review of Stereo Matching Algorithms Based on Deep Learning, Comput. Intell. Neuroence, vol. 2020, pp. 1-12, 2020.
  • [12] J. Engel, T. Sch ps, and D. Cremers, LSD-SLAM: Large-scale Direct Monocular SLAM, in European Conference on Computer Vision, 2014, pp. 834-849.
  • [13] J. Engel, V. Koltun, and D. Cremers, Direct Sparse Odometry, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611-625, 2018.
  • [14] J. Engel, J. Sturm, and D. Cremers, Semi-dense visual odometry for a monocular camera, in Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1449-1456.
  • [15] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, DTAM: Dense tracking and mapping in real-time, in Proceedings of the IEEE International Conference on Computer Vision, 2011, pp. 2320-2327.
  • [16] A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer, PL-SLAM: Real-time monocular visual SLAM with points and lines, in 2017 IEEE international conference on robotics and automation, 2017, pp. 4503-4508.
  • [17] G. Zhang, J. H. Lee, J. Lim, and I. H. Suh, Building a 3-D line-based map using stereo SLAM, IEEE Trans. Robot., vol. 31, no. 6, pp. 1364-1377, 2015.
  • [18] K. Qian, W. Zhao, K. Li, and H. Yu, Real-Time SLAM with BoPLW Pairs for Stereo Cameras, with Loop Detection and Relocalization Capabilities, IEEE Sens. J., vol. 20, no. 3, pp. 1630-1641, 2020.
  • [19] Y. Taguchi, Y.-D. Jian, S. Ramalingam, and C. Feng, Point-Plane SLAM for Hand-held 3D Sensors., in IEEE International Conference on Robotics and Automation, 2013, pp. 5182-5189.
  • [20] L. Ma, C. Kerl, J. St ckler, and D. Cremers, CPA-SLAM: Consistent Plane-model Alignment for Direct RGB-D SLAM, in IEEE International Conference on Robotics and Automation, 2016, pp. 1285-1291.
  • [21] X. Zhang, W. Wang, X. Qi, Z. Liao, and R. Wei, Point-Plane SLAM Using Supposed Planes for Indoor Environments, Sensors, vol. 19, no. 17, p. 3795, 2019.
  • [22] B. Triggs, P. F. Mclauchlan, R. I. Hartley, and A. W. Fitzgibbon, Bundle Adjustment - A Modern Synthesis, in International Workshop on Vision Algorithms: Theory and Practice, 1999, pp. 298-372.
  • [23] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, g2o: A general framework for graph optimization, in IEEE International Conference on Robotics and Automation, 2011, pp. 3607-3613.
  • [24] T. D. Barfoot, State Estimation for Robotics. Cambridge University Press, 2017.
  • [25] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, The EuRoC Micro Aerial Vehicle Datasets, Int. J. Rob. Res., vol. 35, no. 10, pp. 1157-1163, 2016.
  • [26] A. Geiger, P. Lenz, and R. Urtasun, Are we ready for Autonomous Driving? the KITTI Vision Benchmark Suite, in Proceedings of the Conference on Computer Vision and Pattern Recognition, 2012.