PLD-SLAM: A Real-Time Visual SLAM Using Points and Line Segments in Dynamic Scenes

Baosheng Zhang

Abstract

In this paper, we consider the problems in the practical application of visual simultaneous localization and mapping (SLAM). With the popularization and application of the technology in wide scope, the practicability of SLAM system has become a new hot topic after the accuracy and robustness, e.g., how to keep the stability of the system and achieve accurate pose estimation in the low-texture and dynamic environment, and how to improve the universality and real-time performance of the system in the real scenes, etc. This paper proposes a real-time stereo indirect visual SLAM system, PLD-SLAM, which combines point and line features, and avoid the impact of dynamic objects in highly dynamic environments. We also present a novel global gray similarity (GGS) algorithm to achieve reasonable keyframe selection and efficient loop closure detection (LCD). Benefiting from the GGS, PLD-SLAM can realize real-time accurate pose estimation in most real scenes without pre-training and loading a huge feature dictionary model. To verify the performance of the proposed system, we compare it with existing state-of-the-art (SOTA) methods on the public datasets KITTI, EuRoC MAV, and the indoor stereo datasets provided by us, etc. The experiments show that the PLD-SLAM has better real-time performance while ensuring stability and accuracy in most scenarios. In addition, through the analysis of the experimental results of the GGS, we can find it has excellent performance in the keyframe selection and LCD.

Index Terms:

Loop closure, dynamic scenes, line feature, visual simultaneous localization and mappin (vSLAM).

I Introduction

Simultaneous Localization and Mapping (SLAM) is a technology to estimate the state of the robots and build surrounding maps in unknown environments [1]. In the past few decades, outstanding achievement has been made in SLAM and competent for most of the demanding tasks, such as autonomous driving [2], underwater exploration [3], and micro aerial vehicle (MAV) navigation [4]. The mass applications of visual SLAM have further proved the excellent performance of SLAM in location and building maps and promoted the further development of related technologies.

Currently, the most of SLAM systems based on Lidar [5][6] or camera [7, 8, 9]. Since Lidar can obtain highly accurate point clouds in a wide range around and is robust to the wide range of lighting changes, the SLAM based on 2D or 3D Lidar once became the mainstream of SLAM systems [6]. However, with improvements in the device performance and the massive emergence of vision algorithms in recent years, a large number of visual SLAM approaches based on various visual sensors have been proposed. Visual SLAM has played a more important role in research and application in virtue of its information-rich measurements, low cost, and low energy consumption. With the popularity of visual odometry (VO) and visual SLAM technology, the more practical problems in the real scenes emerge gradually and become the new popular research topics. For the benefit of research, we summarize the new primary requirements for the visual SLAM systems as follows:

1.

Improving the capability of the system to cope with the dynamic environment;
2.

Obtaining the robustness and accuracy results while ensuring the real-time performance of the system;
3.

Improving the universality of the system, e.g., implementing the system immediately without additional preprocessing in the new different environments.

In this paper, we consider the above-mentioned challenges and proposed effective solutions.

Finding feature correspondences between two different images is a necessary task in feature-based indirect visual SLAM [10][11]. To deal with the poor textured environment, we introduce line segments as basic features into our system. For line features, we use the LSD detector [12] to extract salient lines in the image like most line feature-based approaches [9][13][14]. In the subsequent feature matching process, keeping the number and correct of matches are the most concerning issues and researchers have contributed meaningful work in this field [15, 16, 17]. However, these feature mismatch removal methods suffer from the drawbacks time-consuming and the limitation of the feature type. In this article, we respectively propose two novel local feature outlier filters for point and line features that address these problems and aim to remove greatly mismatched features one time. Filtering false line feature matched is not a trivial task, the key challenge is that the line features involve a large range in view that cannot be divided locally by regular grids like the point features. In this work, we tackle the issues using a purely visual-based approach. We focus on the line feature itself and construct the circular domain is parameterized by the mid-point and length of each line. Then creating a local line feature domain refer to the circular domain relationship between the line features, removing the outlier matching lines based on the position and direction of all the line features in the line domain. Further discussion about the feature mismatching will be formally addressed in section III-B1 and III-B2.

Among the existing SOTA visual SLAM and VO approaches, such as ORB_SLAM [18] and SVO [19] have estimated the accurate state of robots by optimizing the feature re-projection errors. However, these indirect or semi-direct methods suffer from drawbacks arising from the assumption of static environments. The dynamic features attached to dynamic objects as the critical challenge for conventional visual SLAM, which greatly reduces the location and map reconstruction accuracy. The idea to solve the highly dynamic environment problem is to accurately detect dynamic regions in the view and remove outliers, then the system achieves the motion estimation between different frames based on the remained static features. To retain enough stable features, we refer to our previous work [20], using the $dynamic$ $grid$ algorithm which utilizes the motion model between adjacent frames to sign and remove dynamic point features to solve the problem of dynamic objects in real-world scenes. In addition, we extended the dynamic point detection algorithm to the line features, a brief conclusion and discussion are given in section III-B3

Besides, loop closure detection (LCD) is an essential element for visual SLAM working in long-term and large-scale scenes, and it is increasingly a concern by researchers. An excellent LCD requires the capacity to recognize a previously visited place from current camera measurements and reduces the drift accumulated of all keyframes in the loop [21][22]. In exiting popular SOTA visual SLAM, the LCD proposed is almost implemented by the feature-based bag-of-words (BoW) [23] model or its variants. According to the different times of building BoW models, the BoW-based LCD is divided into static and dynamic models. The off-line feature dictionary model needs to build a BoW model for all field features in the dataset before the system operation, and loading the huge model at the initial stage of system implementation, which is not conducive to the universality of the system to the new environment nor the real-time performance in the automatic implementation process. On the contrary, the dynamic BoW model creates a feature dictionary for each frame during the system execution to realize LCD in subsequent frames. However, although the BoW model can achieve high accuracy performance with a huge visual dictionary, its unable to meet both the real-time and universality (requiring pre-training vocabulary model for new scenes) of the system.

In this work, we present a novel image similarity moble based on global gray similarity (GGS), which extracts the field of view information through the fast global gray value distribution, and realizes the similarity evaluation between different images (see section III-B5). The function of the GGS algorithm introduced into the system is reflected in two parts. First, the LCD in PLD-SLAM based on the GGS achieves SOTA accuracy without additional pre-training models and greatly improves the adaptability of the system to the new environment and the real-time performance of the system. In particular, the GGS-based LCD also has an extremely important reference value for the LCD works of conventional directly or LiDAR SLAM, etc., which without feature detection. Second, a keyframe selection based on the GGS proposed in this work, optimally selects reasonable keyframes for the balance of accuracy and efficiency of the system.

The main contributions of this article can be summarized as follows:

•

We present a complete stereo visual SLAM system based on point and line features in dynamic scenes, coined PLD-SLAM, which achieve SOTA accurate result. This system has real-time performance and better adaptability compared with traditional visual SLAM;
•

Two effective purely visual feature mismatching filter algorithms to remove outliers of point and line features, which balances the feature tracking needs between the accuracy and computational complexity;
•

A novel keyframe selection refers to GGS, which meets the requirements of LCD and improves the real-time of the system.
•

A light-weight and fast LCD based on the GGS of keyframes;

The rest of our paper is organized as follows. After a review of the relevant background literature in Section II, the main details proposed in the paper are described in section III. Subsequently, Section IV provides qualitative and quantitative results of the performance of PLD-SLAM both on the KITTI dataset [9] and in a real-world environment to demonstrate the effectiveness and accuracy of the system. Finally, a brief conclusion and discussion are given in section V.

II Related Work

II-A Visual SLAM

Based on the type of sensors, most visual SLAM and VO systems can be divided into Monocular [8] [9], Stereo [13], and RGB-D [24] method. PTAM [25] is one of the keyframe-based monocular framework visual SLAM, which first introduces the idea of feature tracking and mapping into parallel threads. As a SOTA SLAM, ORB_SLAM [18] begins with PTAM, employing local Bundle Adjustment (BA), Pose Graph Optimization (PGO), and LCD to improve the accuracy and robustness of the system. Furthermore, the method in the [9], simultaneously leverages points and lines information to deal with the poorly textured environment. However, all of these monocular methods have a critical drawback in the initialization and estimation accuracy. The RGB-D scheme can address the above issues and directly build dense maps via the depth information of each pixel. Since the application range is a key limitation, such as outdoor and underwater, RGB-D SLAM is also not able to applicate in wide scope.

To overcome these challenges, various stereo methods solutions have been proposed. In the past decades, the stereo framework in ORB_SLAM2[26] was one of the SOTA stereo frameworks, it used ORB as the point feature and employ BA to eliminate the drift of trajectory estimation, balancing the real-time and accuracy of the system. In addition, Y Zhou $et$ $al.$ present a parallel tracking-and-mapping approach system based on the stereo event cameras [27], it tackles challenging scenarios in robotics, such as low-light, high-speed, and high dynamic range scenes while running in real-time.

II-B Feature-based SLAM

Among the above approaches almost the indirect method, currently exiting visual SLAM can be divided into direct [28][29] and indirect (feature-based)[9][18][13] according to visual measurement processing. Compared to the feature-based methods, the direct method directly uses pixel intensity to compute the camera motion by minimizing photometric errors, instead of detecting and tracking features. However, they are greatly sensitive to brightness changes and constrained to narrow baseline motions [30], most of the researchers used feature-based ones instead due to their robustness and accuracy of estimation. Furthermore, most the feature-based SLAM, no matter monocular or stereo, use point features as their common visual feature and compute corresponding 3D points in the camera frame by correct matching between the detected point features. Among these algorithms, we focus on PMHT SLAM [31], SVO [19], and the ORB_SLAM2 presented by R Mur-Artal $et$ $al.$ [26], is a typical point feature-based visual SLAM, which merge multiple frameworks for various vision sensors. It is based on the fast and convenient handle ORB features to recover trajectory and build sparse point cloud maps.

However, all these approaches with only point features can obtain robust and accurate motion estimation in rich-texture scenes. When facing scenes with poor information, the motion estimation accuracy could degrade significantly. To achieve more robust and better accuracy performance, researchers introduced line features to improve feature detection capability in the low-texture environment [14][32][33]. The method [32] as one of the line-based excellent methods, presents a graph-based stereo visual SLAM system using straight lines as only visual features. In this work, using two different representations to parameterize 3-D lines to meet the needs of projection and optimization, the same strategy can be found in [14]. Other systems utilize other features and line features at the same time, as described in [9] and [13], the point as the most mature and convenient visual feature, complementing the advantages of line features and providing more structural information from the environment over only one kind of feature map. However, no matter direct or feature-based approaches, all existing traditional visual SLAM approaches suffer from drawbacks arising from the dynamic objects in the real scenes, such as moving cars and pedestrians shown in figure 1. In the following, we review the state of the art of visual SLAM or VO systems for highly dynamic environments.

II-C Dynamic SLAM

To eliminate the impact of the dynamic environments, there are several types of moving object detection algorithms that extract feature points and clustered features to distinguish the background and the foreground, in this way to estimate the motion of the camera [34][35]. In the [36], S Minaeian $et$ $al.$ proposed an effective, efficient, and robust method to accurately detect and segment multiple independently moving foreground targets. Then tracking background key points using pyramidal and using a sliding window for detecting multiple moving targets. Other methods with depth information can be applied, the method [37] provides a more accurate dynamic 3D landmark detection method, and first introduce an efficient initial camera pose estimation method based on distinguishing dynamic from static points using graph-cut RANSAC. Recently, learning-based computer visual methods are massive emerging, researchers utilize various deep learning and convolutional neural networks (CNN)-based methods to solve the dynamic object [38][39]. It is noted that, in general, all these deep learning methods are likely to be computationally expensive and time-consuming, which is not feasible for system applicate in the real scene. Currently, the maturity of SLAM as a research field has motivated the practicability of SLAM systems. In this paper, we proposed a real-time visual SLAM in dynamic scenes, the several novel strategies have reference value for SLAM system real application.

III Technical Approach

III-A System Overview

The system overview of our approach is given in figure 2, following most feature-based SLAM frameworks, we divide PLD-SLAM into the three components: Tracking, Local Mapping, and Loop Closure detection. We next briefly review the pipeline of the work.

First, in the tracking components, we aim to estimate the pose of the new frame and preparation for the operation of the system. In the initial stage of the component, we compute the GGS of the image, which is the key parameter of the keyframe selection and loop closure detection. Notice that, normally, selecting the left view to participate in the later work on stereo visual SLAM. To reduce system time consumption, we only obtain the GGS of the left view in the new frame. Meanwhile, we detect ORB and LSD and match the corresponding features to each other to recover 3D points and lines as our preview work DynPL-SVO. Subsequently, we utilized two effective mismatching filters for points and lines, respectively, to detect and remove the false feature matching. The details are described in section III-B1 and III-B2. Finally, the matched point and line features on the current frame are tracked with corresponding features on the previous frame and computing the camera motion by minimizing feature re-projection errors. However, the dynamic features in the field of the view reduce the accuracy and robustness of the system. To overcome the dynamic environment, we based on the $dynamic$ $grid$ algorithm present by our preview work [20], and extended the dynamic point detecting algorithm to the line features. So far, we initial the camera pose estimation of the new frame. Further discussion about the feature tracking procedure will be formally addressed in section III-B. The next components are keyframe insertion and updating the local map.

Before updating the local map, we need to determine when the new frame meets the conditions and is selected as the keyframe. In this study, we employ the GGS above-mentioned as the core to detect the keyframe, then insert the keyframe selected to the local map. The quality of the map depends on the accuracy of the keyframe pose and landmark state. Thus, we match the feature on the new keyframe with the feature on the preview keyframe and the local map. Next, use a co-visibility graph between local keyframes and the feature positions observed by them to build a local BA model to correct elements in the local map. A detailed description of this procedure will be presented in section III-C.

During the local mapping, loop closure detection utilized GGS to calculate the similarity between each inserted new keyframe and all existing keyframes, as described in section III-D1. Then, select the correct looped keyframes which meet the two stages conditions, and correct all the keyframes poses involved in the loop through a pose-graph optimization (PGO).

Finally, we achieved global optimization via a global BA optimization identical to the ones presented in ORB-SLAM.

III-B Tracking

For each input stereo frame, we implement the necessary initialization to obtain its rough relative pose transformation and the observed feature positions in the world. Similar to most traditional feature-based SLAM methods, in this paper, we transform the frame pose estimation into solving a nonlinear least square problem. However, these approaches do not work well in dynamic environments. This section reviews the most important aspects of our previous work [20], which deals with the visual odometry estimation in the dynamic scenes, and also with the GGS strategy.

III-B1 Point feature detection and tracking

For point feature, we use the ORB algorithm, since its balance between performance and efficiency meet the need of the system. To ensure the low time consumption, we cited the local matching scheme used in [20]. In this way, the matching point features between the left and right view in the corresponding local origin greatly reduce the number of candidates. Furthermore, to remove the point feature mis-tracking between two different frames, we advance an effective point feature outliers filter, which constraint relationship between the target point feature and other point features in the same grid to filter out outliers (see figure 3). First, the meshing of all detected point features, i.e., dividing the image evenly into 64×48 grids and storing all point features according to their grid positions in the image. We obtain the rough point features matching ( $p_{p1,2,3,4}$ - $p_{c1,2,3,4}$ ) between the previous frame and the current frame. Considering the motion model of previous frame, the correctly matched point features $p_{p}$ and $p_{c}$ have similar relative relationships with other surrounding feature points. Based on this premise, computing meaning cross between the target point location and the coordinate of all other points in the same grid, named $grid_{cval}$ , as shown following:

	$\displaystyle g_{lk}=\frac{1}{n}\sum_{i=1}^{n}\begin{bmatrix}x_{lk}\\ y_{lk}\end{bmatrix}\times\begin{bmatrix}x_{li}\\ y_{li}\end{bmatrix}$		(1)
	$\displaystyle g_{rk}=\frac{1}{n}\sum_{i=1}^{n}\begin{bmatrix}x_{rk}\\ y_{rk}\end{bmatrix}\times\begin{bmatrix}x_{ri}\\ y_{ri}\end{bmatrix}$		(1)

where $g_{lk}$ and $g_{rk}$ refer to the kth matched point feature pair $grid_{cval}$ in the left and right image, respectively, and $n$ is the number of matched points in the grid. In this way, we can find that the $grid_{cval}$ between the mismatching $p_{r2}$ and $p_{l2}$ is significantly different from other correct matching, i.e.,

|g_{l2}-g_{r2}|>\alpha(g_{lm}-g_{rm})

(2)

where $g_{lm}$ and $g_{rm}$ represent the meaning $grid_{cval}$ in the $grid_{l}$ and $grid_{r}$ , respectively, and $\alpha$ has been set to 1.5 in our experiments. Finally, we repeat the above operation for all matched point features and remove the outlier in each grid.

III-B2 Line feature detection and tracking

It is naturally operation point features via the grid scheme, however, handling line segments is not a trivial task. There is no reasonable structure for enveloping line features since its large span and the different direction between lines. To overcome this problem, we focus on the straight line itself and construct the circular domain as parameterized by the midpoint and length of the line. Figure 4 showed the distribution of the color circular domains based on the line features in the real scene, and the color brightness respect the line features distribution in the local area. The local line features with similar characteristics have brighter colors, i.e., the corresponding circular domains of line features matched on the same object have a wide range of overlap each other. Hence, creating a local line feature domain refer to the circular domain relationship between the line features, removing the outlier matched lines based on the position and direction of all the line features in the local line domain, see figure 5. Like point matching above mentioned, we obtain rough matched line ( $l_{p1,2,3,4}$ – $l_{c1,2,3,4}$ ) between the previous frame and the current frame. Specially, we determine the local line group (LLG) to line $l_{p1,2,3}$ , whose circular domains overlap each other. It is noticed that the correctly tracked line features in the sane LLG have similar positions and directions, hence, we computed the average angle and distance of mid-point between the target tracking line pair, and filter out those matches whose angle and distance are more than twice the meaning of all matched lines in the same LLG as an outlier to ensure that the correspondences are meaningful enough. It is worth mentioning that for LLG with only one line feature such as $l_{p4}$ , we use the meaning angle and distance of all matched features in the image as the threshold to judge the correctness of the match.

III-B3 Dynamic feature detection

After detecting and filtering the mismatching features for the current frame, we now track the matched features between the adjacent frames. However, all the tracked features contain the dynamic point and line features matched to the dynamic objects which reduced the accuracy of pose estimation.

Dynamic environment is an open challenge of visual SLAM, and addressing the issues is one of the important aspects of our work. In this step, we briefly review the dynamic point feature algorithm, $dynamic$ $grid$ , which identifies dynamic point features based on the motion model. Furthermore, we introduce the dynamic line filter based on the LLG above mentioned. Normally, dynamic objects have more abnormal motion relative to the entire stationary scene, an intuitive response is that the dynamic points often have larger photometric re-projection errors between the re-projected point and the detected feature point. Finally, we also note that the feature with the same dynamic characteristics generally clusters together. Therefore, we used the motion model, i.e., the pose transformation transform between the previous frame $prev_{f}$ and the previous frame of the $prev_{f}$ , to estimate the location of the matched point features in the current frame. Then, We calculated and averaged the sum of the squared Euclidean distance between the matched point features and the estimated point features of the detected point features in all the above grids. Once the value exceeds the threshold, the grid and its surrounding 8 grids are defined as the dynamic grid, and the features in the dynamic grid are dynamic. For more details about the dynamic grid, the algorithm can refer to DynPL-SVO.

However, as we said above, the line segment with a large span and the different directions does not have a suitable structure to envelop it as the grid for the point. The above-mismatched line feature work has proved the effectiveness of the LLG method, hence, we extend the $dynamic$ $grid$ algorithm from point features to the line features in this work. As shown in figure 6, there is dynamic object A with severy line features ( $l_{a1},l_{a2},l_{a3}$ ) whose projection constitute $LLG_{pa}$ in the previous frame, during a sample time, the object A move to the B, and the corresponding dynamic line features ( $l_{b1},l_{b2},l_{b3}$ ) projected to current frame and constitute $LLG_{cb}$ . In this step, we still used the motion model $T_{pp}$ of the previous frame to estimate the location of the tracked line features in the current frame, such as lines in the $LLG_{ca}$ . And we defined the distance between the estimated lines in the $LLG_{ca}$ and the detected lines in the $LLG_{cb}$ as the dynamic distance of the line feature. Like the point features, we calculated and averaged the sum of the squared dynamic distance of line features in all the $LLG_{cb}$ . Once the value exceeds the threshold, the $LLG_{cb}$ is defined as $dynamic$ $LLG$ , and the line features in the $dynamic$ $LLG$ are the dynamic feature. After dynamic feature detection, we discard dynamic features and use the remaining static ones to estimate a more accurate camera pose for the current frame. These steps are summarized in Algorithm 1.

Algorithm 1 : Dynamic line detecting using

dynamic

LLG

1:Motion model

\mathbf{T}_{pp}

of the previous frame

prev\_f

, the matched line feature set between the current frame

curr\_f

and

prev\_f

;

2:The

dynamic

LLGs

3:for

each

LLG_{c}\in curr\_f

4: for

each

l_{j}\in LLG_{c}

e_{dyna}

LineErr(l_{j},\mathbf{T}_{pp})

6: end for

7: if

e_{dyna}>\rho

then

LLG_{c}

insert to

DynaLLGs

9: end if

10:end for

11:return

DynaLLGs

III-B4 Pose estimation

We next briefly review the line features performed in the pose estimation. The point measurements are treated in a standard way as method [26], no more details in the following. Next, we introduce a more stable line back-project model and derive its corresponding jacobian. We had obtained the line features in a sequence of images and their plural positions in camera frames, we transform the camera ego-motion estimation into a non-linear least-square equation formed by the projection constraints of the corresponding line features in the previous frame and in the current frame. Continuing the pose estimation scheme proposed in the method [20], we introduce the horizontal re-projection error of line features into the system as additional reference information to assist in pose estimation. So the non-linear least-square equation of our approach is shown in (3). For a description of the detail, the reader is referred to [20].

\begin{split}\xi^{*}&=\underset{\xi}{arg\ min}\Bigg{[}\sum_{i=1}^{m}{e_{p}^{i}(\xi)}^{T}\Sigma_{e_{p}^{i}}^{-1}e_{p}^{i}(\xi)+\\ &\sum_{j=1}^{n}{e_{l_{v}}^{j}(\xi)}^{T}\Sigma_{e_{l_{v}}^{j}}^{-1}e_{l_{v}}^{j}(\xi)+\sum_{k=1}^{q}{e_{l_{h}}^{k}(\xi)}^{T}\Sigma_{e_{l_{h}}^{k}}^{-1}e_{l_{h}}^{k}(\xi)\Bigg{]}\end{split}

(3)

where $m,\,n$ and $q$ denoted the number of points, all lines, and the lines whose endpoints are not close to the edge of the image, respectively. It consisted of the point re-projection error $e_{i}^{p}$ , vertical and horizontal re-projection error of line feature $e_{l_{v}}^{j},e_{l_{h}}^{k}$ . The matrices $\Sigma^{-1}$ in (3) represent the inverse covariance matrices related to the uncertainty of each re-projection error.

In this step, we force on the new representation of line features and its re-projection jacobian utilized in the motion estimation. There are various forms of 3D line representation, we assumed the line $L$ endpoints are $x_{s}(x1,y1,z1)$ and $x_{e}(x2,y2,z2)$ , respectively. Hence, the line in the word can be respected as follows:

L\Rightarrow\frac{x-x_{1}}{l}=\frac{y-y_{1}}{m}=\frac{z-z_{1}}{n}

(4)

where the $l$ , $m$ , and $n$ refer to the $x_{2}-x_{1}$ , $y_{2}-y_{1}$ , and $z_{2}-z_{1}$ , respectively. Furthermore, the line on the normalized plane, i.e., $z_{1}=1$ and $z_{2}=1$ , can de described as:

L\Rightarrow\frac{x-x_{1}}{l}=\frac{y-y_{1}}{m}

(5)

In addition, we define the vertical re-projection errors of the line features were the distance from the endpoints of the re-projected line features $L_{pro}$ on the normalized plane to the detected line features $L_{det}$ in the current frame, shown as following:

	$\displaystyle d_{s}=\frac{[x_{s}-x_{1},\ y_{s}-y_{1}]\times[l,\ m]}{\sqrt{l^{2}+m^{2}}}$		(6)
	$\displaystyle d_{e}=\frac{[x_{e}-x_{2},\ y_{e}-y_{2}]\times[l,\ m]}{\sqrt{l^{2}+m^{2}}}$		(6)

where the $d_{s}$ and $d_{e}$ were the distance of the start point and the end point between the line features, the $x_{s},y_{s}$ and $x_{e},y_{e}$ denote the coordinate value of the detected line feature start point and end point on the normalized plane respectively, and the $x_{1},y_{1}$ and $x_{2},y_{3}$ represent the coordinate value of the projected line feature start point and end point on the normalized plane respectively. Thus, we can obtain the Jacobian of line feature vertical re-projection error using the point jacobian can be expressed as follows:

$\displaystyle\frac{\partial d}{\partial\delta\xi}$	$\displaystyle=\frac{\frac{[x_{s}-x_{1},\ y_{s}-y_{1}]\times[l,\ m]}{\sqrt{l^{2}+m^{2}}}}{\partial\delta\xi}$	(7)
	$\displaystyle=\frac{\partial(m\,e_{s}-l\,e_{e})}{\partial\delta\xi}\frac{1}{\sqrt{l^{2}+m^{2}}}$
	$\displaystyle=\begin{bmatrix}m&-l\end{bmatrix}\frac{\partial\begin{bmatrix}e_{s}\\ e_{e}\end{bmatrix}}{\partial\delta\xi}\frac{1}{\sqrt{l^{2}+m^{2}}}$

where the $e_{s}$ and $e_{e}$ represent the re-projection error between the line features start point and end point, respectively.

Finally, we obtain the incremental motion estimation between the two consecutive frames, which can be modeled by the non-linear least-square equation (3) with jacobian of all features.

III-B5 Gloable gray similar (GGS) algorithm

Compared with the classical feature-based algorithm, such as ORB_SLAM2, our method utilize the pixel-level intensity as image similarity evaluation parameters instead of features and designed the keyframe selection and the loop closure detection based on global gray similar (GGS). For every input frame, in order to improve the real-time of the system, the GGS computing, and the feature extraction for left and right are in parallel. The GGS computing steps are given as shown in figure 7. First, We grayed the left image of the new frame and performed scaling with the scaling factor (set to 0.8 in our experiments). Then, we evenly divide the original gray-scale image and the scaled image into 3 $\times$ 4 grids (see figure 7 (a)) and save the pixel intensity distribution in each grid separately, i.e., each frame GGS can be respected as 24 integer arrays with a size of 256 shown as figure 7 (b), (c). It is worth noting that there is work in the system based on the GGS of the frame, such as keyframe selection and loop closure detection.

III-C Local mapping

The local mapping thread is activated when the frame is selected as the keyframe. In this section, we describes the behavior of the local mapping which can be divide two subsection: $local$ $map$ $updata$ and $local$ $map$ $BA$ .

III-C1 Local map updata

The local map consists of keyframes, 3D landmarks both points and lines, and the co-visibility graph which involve the nodes, i.e., the keyframe poses, and the edges between keyframes when sharing a minimum number of landmarks (set to 20 in our experiments). Thus, the local map update only when a new keyframe is selected and inserted. In this work, we present a time-saving keyframe selection based on the GGS of the frames and the effectiveness of the strategy is verified by the effect of the system. Since the similarity between local frames decreases with sample time and the adjacent frames with high similarity. Thus, we calculation the GGS threshold $GGS_{th}$ through the following expression:

GGS_{th}=GGS_{pKF}+0.4*(GGS_{pKF+1}-GGS_{ppKF+1})

(8)

where $GGS_{pKF}$ and $GGS_{pKF+1}$ represent the GGS of the previous frame $prev\_f$ and the GGS of its first consecutive frame, and the $GGS_{ppKF+1}$ refer to the GGS of the previous frame of the $prev\_f$ . If the GGS of current frame exceed the $GGS_{th}$ , then the current frame is inserted into the system as a new KF, updating the new keyframe GGS and waiting for the next keyframe selected.

Ones the new keyframe was detected, we insert the keyframe into the back-end of the system and update the local map to optimate the element state in the local map. First, we build data association between the current KF with local KFs and obtain a consistent set of common features observed in them. Furthermore, for the new common features observed in the current KF and previous KF, creating the corresponding 3D landmark and insert them into the local map.

III-C2 Local BA

To eliminate the drift of element states in the local map, such as KFs transformation, 3D point pose, and the 3D line landmark length and direction in the world coordinate. After the KF insertion, the next step is to build the co-visibility graph model and perform a local BA to optimize the element in the local map. First, we need to build data association between the KFs which connect with newly inserted keyframes that share at least 20 landmarks, and common features observed between the KEs in the co-visibility graph. These will be used as elements to build optimization models and perform BA which minimizes the projection errors between the observations and the landmarks projected to the frames where they were observed. The local BA pipelines are similar to the ones presented in the PL-SLAM. It should also be underlined that we follow a similar approach to the one described in section III-B4 as it introduced line horizontal re-projection error into the BA framework.

III-D Loop closure detection

The loop closure detection aims to reduce the trajectory estimation drift and improve the accuracy of building maps. The essence of loop detection is to identify the similarity between the field of view and optimize the pose graph in the loop. It is important to remark that, in this paper, we propose an accurate and fast loop closure detection approach based on the GGS of inserted keyframes, which are not needed to build huge offline or online feature dictionaries from different image datasets. In the following, we first introduced detecting loop closures from the GGS approach and then described the correction of the pose estimations of the KFs involved in the loop.

III-D1 Loop detection

For each input keyframe, we computed its GGS in section III-B5 and record the similarity with historical KFs. Compared to building the huge BoW model for point and line features in the frame and computing image similar to all historical keyframes by dictionary model, the loop closure detection based on the GGS has an SOTA performance with BoW and better real-time. No matter the GGS-based or Bow-based LCD approachs, there are two requirements to be met in the candidate detection step, first, toleranting the position of the image to be identified, i.e., the images near the correct loop position can be detected as a candidate. The second, robustness to the dynamic scenes, i.e., when the system is located in the correct loop position and there is a new object exited in the view compared with the last time such as cars, which seriously reduce the LCD performance (see figure 8). Considering the above challenges, we define the similarity, $sim_{v}$ between two images by GGS as follows:

	$\displaystyle sim_{v}=\frac{1}{M*N}\sum_{i=1}^{12}\min\left\\|hist_{a}(p,\,i)-hist_{b}(q,\,i)\right\\|,$		(9)
	$\displaystyle p,\,q=\{0,\,1\}$		(9)

Where $M,\,N$ denote the size of the image and the $hist(0,\,i)$ and $hist(1,\,i)$ represent the ith grid gray distribution histogram value of the original and scaled image, respectively.

It is noticed that the similar evaluation based on the GGS through the scaled scheme to meet the LCD requirement tolerant to the identity image, and the grid strategy in the GGS computign process improves the LCD performance in the dynamic scenes. To avoid the greatest negative impact of incorrect loop detection on the system, we further filter the candidates with the following tests. First, the ratio of matched feature and the transformation between the current keyframe and looped keyframe are important informations for checking the consistency of the loop closure candidate. Hence, we merged the feature matching results and the transformation information into a scalar, through the following expression:

lc_{rat}=\alpha\left\|\log(\mathbf{T}_{i}\mathbf{T}_{j}^{-1})\right\|*ratio_{inl}

(10)

where the $\mathbf{T}_{i},\,\mathbf{T}_{j}$ represent the current keyframe and candidate looped keyframe transformation, respectively, and the $ratio_{inl}$ was the initial inlier features ratio (was set 50% in our experimentation, and the $\alpha$ in the expression was set 0.001). If the value of $lc_{rat}$ lies below some pre-established threshold, which in our experiments has been set to 20%, then the candidate was the outlier. Second, generally, adjacent view images have similar characteristics, including loop detection characteristics. Thus, we assume that the preview keyframe has significant similarity with the looped keyframe, and there is a significant similarity between the current keyframe and the adjacent frames of the looped keyframe. The above conditions can ensure the correctness and efficiency of the LCD.

III-D2 Loop correction

After obtained the correct loop closure, we computed the transformation between the current keyframe and looped keyframe and optimized the pose of all the KFs in the loop. This is a typically graph optimization problem, which can be solved by formulating the problem as a PGO. Establishing the corresponding graph model whose node is the pose of the keyframe and the edges are the data association between keyframes which has at least 20% commonly features. In this paper, we solved the problem using the g2o library [40] as a method PL-SLAM. Updating the pose of keyframes in the loop and optimizing the state of the landmarks observed by these keyframes to correct the map information. Finally, when the system is finished, we implemented the global BA to optimize all the keyframe pose and global map landmark state.

IV Experimental Results and Analysis

In this section, we evaluated the proposed method on the public KITTI dataset [33] and EuRoC MAV dataset [34]. To verify the performance of our method, we compared the accuracy of our algorithm with the state-of-the-art VO systems that can run with stereo RGB data. The selected methods were stvo pl [27], ORB SLAM2 [6] and BoPLW [35]. In order to fairly compare all methods, we only kept the front end of ORB SLAM2 and BoPLW to form VO systems. In addition, to illustrate the benefits of the line feature horizontal re-projection error on the accuracy of motion estimation, we showed the results of our proposal with only horizontal or vertical re-projection errors. Finally, we evaluated the ability of dynamic grids to cope with dynamic scenes on highly dynamic sequences, where the trajectory of a stereo camera in several video sequences with moving objects was estimated.

All the experiments were tested on an Intel Core i5-4210U CPU @ 1.70GHz × 4 and 16 GB RAM without GPU. Since there are no dynamic objects in the EuRoC MAV dataset, we only estimated the trajectory of our method with or without dynamic grids in the KITTI dataset.

IV-A Dynamic Objects Detection Experiments

TABLE I: Mean Absolute RMSE errors on the KITTI Dataset.

	wo/dynamic grid & LLG		w/dynamic grid		w/dynamic LLG		w/dynamic grid & LLG
Seq.	t	R	t	R	t	R	t	R
00	4.414051	1.141609	4.1586	1.036655	4.467157	1.189108	4.098689	1.027319
01	69.670172	4.016659	64.208694	5.589265	69.674846	4.020314	64.200222	5.589637
02	9.858701	2.0002	9.097316	1.776921	9.797926	1.976967	18.140115	3.742419
03	5.776864	4.315284	5.807239	4.157304	5.77685	4.315294	5.807236	4.157297
04	2.622336	49.376319	2.253605	43.416013	2.622333	49.376311	2.25378	43.427358
05	2.069462	0.676671	2.021802	0.717747	2.060681	0.663488	2.014489	0.705174
06	4.577771	3.468494	4.537722	3.543528	4.558604	3.40538	4.538055	3.49156
07	1.207638	0.615184	1.109332	0.57397	1.207379	0.614653	1.104154	0.573476
08	5.647390	1.940569	4.723587	1.512542	5.625841	1.938448	4.707446	1.508939
09	4.447918	1.731804	4.126671	1.412788	4.447922	1.731822	4.127799	1.413146
10	2.067542	1.2582	1.975722	1.127534	2.067256	1.258117	1.976234	1.127434

IV-A1 KITTI Dataset

IV-A2 Argoverse 1 Dataset

Dealing with dynamic scenes is the one of major connect of this paper. The dynamic scenes approach in the PLD-SLAM include two parts : the dynamic grid and dynamic LLG. The dynamic grid follows [20], and the latter is used to detect the dynamic line features. To evaluate the ability of the system to cope with dynamic scenes, we test the proposed method on well-known dataset KITTI and the Argoverse 1 dataset [41], which have 7 high-resolution ring cameras (1920 × 1200) recording at 30 Hz with overlapped field of view providing 360◦ coverage. In addition, there are 2 front-facing stereo cameras (2056×2464) sampled at 5 Hz. The dynamic scenes containing moving objects like cars and pedestrians in some image sequences, showed as in Fiurexxx have a great impact on the performance of the VO/vSLAM systems. The dynamic object detection verified that the dynamic grids and dynamic LLG can accurately identify the dynamic regions to avoid the influence of dynamic features on the accuracy.

To further represent the error in the quantitative analysis, we compared the absolute and relative root-mean-square error (RMSE) with and without dynamic grid approaches in the KITTI dataset, as shown in table xxx.

IV-B Loop Closure Detection Experiments

IV-B1 KITTI Dataset

IV-B2 NEU Dataset

To evaluate the proposed loop closure detection, we test its performance on the KITTI dataset and the high similarity scenes indoor stereo sequences namely $NEU$ $Dataset$ , we have record using MYNT@EYE 120° stereo camera. In addition, to further explane the performance the proposed LCD, we compare the LCD with GGS to STOA LCD approach based on BoW model and openMAP, which … and the result explane in the table xxx and Figure xxx showed

IV-C Localization Accuracy and Efficenty Experiments

IV-C1 KITTI Dataset

IV-C2 EuRoC MAV Dataset

IV-C3 Argoverse 1 Dataset

IV-C4 NEU Dataset

We evaluate the performance of PLD-SLAM in terms of the localization accuracy and computation time. To comprehensively evaluate the implementation effect of the method, we used the public dataset KITTI, EuRoC dataset, in addition, the short term daatset Argoverse 1 and the high similiarity dataset collected by us to be used in the experimental analysis.

IV-C5 KITTI Dataset

The KITTI dataset

References

[1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous localization and mapping: Toward the robust-perception age,” IEEE Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
[2] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE conference on computer vision and pattern recognition, pp. 3354–3361, IEEE, 2012.
[3] J. C. Kinsey, R. M. Eustice, and L. L. Whitcomb, “A survey of underwater vehicle navigation: Recent advances and new challenges,” in IFAC conference of manoeuvering and control of marine craft, vol. 88, pp. 1–12, Lisbon, 2006.
[4] M. Faessler, F. Fontana, C. Forster, E. Mueggler, M. Pizzoli, and D. Scaramuzza, “Autonomous, vision-based flight and live dense 3d mapping with a quadrotor micro aerial vehicle,” Journal of Field Robotics, vol. 33, no. 4, pp. 431–450, 2016.
[5] J. Zhang and S. Singh, “Low-drift and real-time lidar odometry and mapping,” Autonomous Robots, vol. 41, no. 2, pp. 401–416, 2017.
[6] Y. Huang, T. Shan, F. Chen, and B. Englot, “Disco-slam: Distributed scan context-enabled multi-robot lidar slam with two-stage global-local graph optimization,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1150–1157, 2021.
[7] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and mapping: part i,” IEEE robotics & automation magazine, vol. 13, no. 2, pp. 99–110, 2006.
[8] Q. Tong, P. Li, and S. Shen, “Vins-mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Transactions on Robotics, vol. PP, no. 99, pp. 1–17, 2017.
[9] A. Pumarola, A. Vakhitov, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer, “Pl-slam: Real-time monocular visual slam with points and lines,” in 2017 IEEE international conference on robotics and automation (ICRA), pp. 4503–4508, IEEE, 2017.
[10] K. Sun, W. Tao, and Y. Qian, “Guide to match: multi-layer feature matching with a hybrid gaussian mixture model,” IEEE Transactions on Multimedia, vol. 22, no. 9, pp. 2246–2261, 2019.
[11] Q. Fu, H. Yu, X. Wang, Z. Yang, Y. He, H. Zhang, and A. Mian, “Fast orb-slam without keypoint descriptors,” IEEE Transactions on Image Processing, 2021.
[12] R. G. Von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “Lsd: A fast line segment detector with a false detection control,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 4, pp. 722–732, 2008.
[13] R. Gomez-Ojeda, F.-A. Moreno, D. Zuniga-Noël, D. Scaramuzza, and J. Gonzalez-Jimenez, “Pl-slam: A stereo slam system through the combination of points and line segments,” IEEE Transactions on Robotics, vol. 35, no. 3, pp. 734–746, 2019.
[14] X. Zuo, X. Xie, Y. Liu, and G. Huang, “Robust visual slam with point and line features,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1775–1782, IEEE, 2017.
[15] J. Ma, J. Zhao, and A. L. Yuille, “Non-rigid point set registration by preserving global and local structures,” IEEE Transactions on image Processing, vol. 25, no. 1, pp. 53–64, 2015.
[16] J. Ma, J. Zhao, J. Tian, A. L. Yuille, and Z. Tu, “Robust point matching via vector field consensus,” IEEE Transactions on Image Processing, vol. 23, no. 4, pp. 1706–1721, 2014.
[17] K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua, “Learning to find good correspondences,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2666–2674, 2018.
[18] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a versatile and accurate monocular slam system,” IEEE transactions on robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
[19] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “Svo: Semidirect visual odometry for monocular and multicamera systems,” IEEE Transactions on Robotics, vol. 33, no. 2, pp. 249–265, 2016.
[20] X. Ma, Y. Wang, B. Zhang, H. Ma, and C. Luo, “Dynpl-svo: A new method using point and line features for stereo visual odometry in dynamic scenes,” ArXiv, vol. abs/2205.08207, 2022.
[21] A. Angeli, D. Filliat, S. Doncieux, and J.-A. Meyer, “Fast and incremental method for loop-closure detection using bags of visual words,” IEEE transactions on robotics, vol. 24, no. 5, pp. 1027–1037, 2008.
[22] D. Cattaneo, M. Vaghi, and A. Valada, “Lcdnet: Deep loop closure detection and point cloud registration for lidar slam,” IEEE Transactions on Robotics, 2022.
[23] D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Transactions on Robotics, vol. 28, no. 5, pp. 1188–1197, 2012.
[24] C. Kerl, J. Sturm, and D. Cremers, “Dense visual slam for rgb-d cameras,” in 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2100–2106, IEEE, 2013.
[25] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in 2007 6th IEEE and ACM international symposium on mixed and augmented reality, pp. 225–234, IEEE, 2007.
[26] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
[27] Y. Zhou, G. Gallego, H. Rebecq, L. Kneip, H. Li, and D. Scaramuzza, “Semi-dense 3d reconstruction with a stereo event camera,” in Proceedings of the European conference on computer vision (ECCV), pp. 235–251, 2018.
[28] J. Engel, T. Schöps, and D. Cremers, “Lsd-slam: Large-scale direct monocular slam,” in European conference on computer vision, pp. 834–849, Springer, 2014.
[29] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 611–625, 2017.
[30] H. Javidnia and P. Corcoran, “Accurate depth map estimation from small motions,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 2453–2461, 2017.
[31] S. J. Davey, “Simultaneous localization and map building using the probabilistic multi-hypothesis tracker,” IEEE transactions on Robotics, vol. 23, no. 2, pp. 271–280, 2007.
[32] G. Zhang, J. H. Lee, J. Lim, and I. H. Suh, “Building a 3-d line-based map using stereo slam,” IEEE Transactions on Robotics, vol. 31, no. 6, pp. 1364–1377, 2015.
[33] R. Gomez-Ojeda, J. Briales, and J. Gonzalez-Jimenez, “Pl-svo: Semi-direct monocular visual odometry by combining points and line segments,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4211–4216, IEEE, 2016.
[34] A. Viswanath, R. K. Behera, V. Senthamilarasu, and K. Kutty, “Background modelling from a moving camera,” Procedia Computer Science, vol. 58, pp. 289–296, 2015.
[35] H. Kim, P. Kim, and H. J. Kim, “Moving object detection for visual odometry in a dynamic environment based on occlusion accumulation,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8658–8664, IEEE, 2020.
[36] S. Minaeian, J. Liu, and Y.-J. Son, “Effective and efficient detection of moving targets from a uav’s camera,” IEEE transactions on intelligent transportation systems, vol. 19, no. 2, pp. 497–506, 2018.
[37] Z.-J. Du, S.-S. Huang, T.-J. Mu, Q. Zhao, R. Martin, and K. Xu, “Accurate dynamic slam using crf-based long-term consistency,” IEEE Transactions on Visualization and Computer Graphics, 2020.
[38] H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2846–2854, 2016.
[39] Q. Wang, J. Gao, and Y. Yuan, “A joint convolutional neural networks and context transfer for street scenes labeling,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 5, pp. 1457–1470, 2017.
[40] G. Grisetti, R. Kümmerle, H. Strasdat, and K. Konolige, “g2o: A general framework for (hyper) graph optimization,” in Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pp. 9–13, 2011.
[41] M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, et al., “Argoverse: 3d tracking and forecasting with rich maps,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8748–8757, 2019.