LiDAR-Inertial Odometry in Dynamic Driving Scenarios using Label Consistency Detection

Zikang Yuan¹, Xiaoxiang Wang², Jingying Wu², Junda Cheng² and Xin Yang^2∗ ¹Zikang Yuan is with Institute of Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, 430074, China. (E-mail: [email protected])²Xiaoxiang Wang, Jingying Wu, Junda Cheng and Xin Yang^∗ are with the Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, China. (* represents the corresponding author. E-mail: [email protected]; [email protected]; [email protected]; [email protected])

Abstract

In this paper, a LiDAR-inertial odometry (LIO) method that eliminates the influence of moving objects in dynamic driving scenarios is proposed. This method constructs binarized labels for 3D points of current sweep, and utilizes the label difference between each point and its surrounding points in map to identify moving objects. Firstly, the binarized labels, i.e., ground and non-ground are assigned to each 3D point in current sweep using ground segmentation. In actual driving scenarios, dynamic objects are always located on the ground. For most points scanned from moving objects, they cannot coincide with any existing structures in space. For a minority of moving objects’ points that are close to the ground, their labels exhibit differences with surrounding ground points. Thus, the points on moving objects are identified due to lacking of nearest neighbors in map or inconsistency with the labels of surround ground points. The nearest neighbors from global map are localized by voxel-location-based nearest neighbor search and the consistency is evaluated by comparing the label consistency with nearest neighbors, without involving any massive computations. Finally, the points on moving objects are removed. The proposed method is embeded into a self-developed LIO system (i.e., Dynamic-LIO), evaluated with six public datasets, and tested in both dynamic and static environments. Experimental results demonstrate that our method can identify moving objects with extremlely low computational overhead (i.e., 1 $\sim$ 9ms/sweep), and our Dynamic-LIO can achieve state-of-the-art pose estimation accuracy in both static and dynamic scenarios. We have released the source code of this work for the development of the community.

Index Terms:

SLAM, localization, sensor fusion.

I Introduction

In recent years, 3D light detection and ranging (LiDAR) based state estimation methods, including LiDAR-inertial odometry (LIO), have played an important role in autonomous driving thanks to the strong strength of LiDAR to perceive 3D space. Theses methods can provide 15 degree-of-freedom (DOF) state estimation of vehicle platform and recover 3D structure of surrounding environment in real time. However, the scenarios where they apply are strictly limited by the assumption of static environment. In actual driving scenarios, moving vehicles or pedestrians can leave ghost tracks in the map (as shown in Fig. 1 (a)), leading to cumulative errors in state estimation and providing erroneous observational information for obstacle avoidance. Given the necessity for real-time state estimation and mapping in a LIO system, the computational cost of removing dynamic objects must be within the processing time budget for a single sweep. Hence, efficiently identifying moving objects from 3D point clouds is crucial.

Refer to caption — Figure 1: Illustration of (a) the exemplar point cloud map with dynamic points, where green points are ghost tracks of moving vehicles. (b) the static point cloud map, where the dynamic points have been detected and removed by our label consistency detection method.

To address the issues of poor map reconstruction and inaccurate state estimation caused by ghost tracks, researchers employ a range of approaches such as point correlation, visibility, occupancy probability and semantic information to identify and remove dynamic 3D points from input LiDAR sweeps. For point correlation [8], the correlation exists between static points, but there is no correlation between dynamic and static points. The connectivity of map points can be exploited to separate moving objects from the static environments. However, the construction and maintenance of a correlation graph involve computing pairwise Euclidean distances of batch 3D points, and the entire process requires substantial computational resources. For visibility [18, 39, 10], the static pixels on re-projected image planes remain invariant across multiple sweeps, whereas those dynamic pixels undergo displacements. However, the generation of re-projected image plane entails a computational process to map large number of LiDAR points from their original 3D space onto the 2D image plane. For occupancy probability [28, 38], the occupancy status of voxels corresponding to static environments remains constant over time, whereas the occupancy status of voxels occupied by dynamic objects will change with time. However, the estimation of an occupancy grid map need to combine multiple nearest static submaps and count the occupancy status of each voxel in each submap. The three aforementioned methods leverage extensive quantitative geometric computations and global statistics to identify moving objects in 3D point clouds, and they incur high computational costs, which limits their integration into LIO systems. Deep learning methods [7, 21] use learned semantic information for rapid qualitative separation of dynamic objects. However, they depend on extensive labeled data, risk failure with unlabeled classes, and necessitate powerful Graphic Processing Units (GPUs) for real-time operation, which can detract from resources available for subsequent tasks such as trajectory planning.

In contrast to existing approaches involving extensive quantitative geometric computations, global statistics or prior semantic information to classify moving objects and static environments, in this study, a qualitative identification criterion based on label difference with nearest neighbors is used to identify 3D points on moving objects. Since moving vehicles and pedestrians are both situated on the ground, the potential dynamic points only exist in non-ground space, while the bases of moving objects are adjacent to the ground. Based on this characteristic, a label consistency detection method, which can fastly identify moving objects without any prior semantic informations, is proposed to classify moving objects and static environments in driving scenarios. The core idea of the proposed label consistency detection method consists of two parts. First, a fast 2D connected components [14] is utilized to divide the 3D points of current sweep into binarized labels, i.e., ground points and non-ground points. Then, dynamic points are determined by comparing label consistency with nearest neighbors, which are directly localized by a voxel-location-based nearest neighbor search method. The whole dynamic point identification process including voxel-location-based nearest neighbor search and label consistency comparison, is devoid of any operations about geometric computations or global statistics. This characteristic ensures that the proposed method has the advantage of a expremely low computational overhead (i.e., 1 $\sim$ 9ms/sweep). Finally, a self-developed LIO system (i.e., Dynamic-LIO) that uses this label consistency detection method is proposed to eliminate the influence of moving objects in driving scenarios. In order to guarantee that the final reconstructed map comprises solely static points, any detected dynamic points will be excluded from the global map (as shown in Fig. 1 (b)). Subsequently, the clean static global map can be used to perform label consistency detection for the next sweep, ensuring the sustainability of the proposed method.

Experimental results on three public datasets of dynamic environments, i.e., $semantic$ - $kitti$ [1], $ulhk$ - $CA$ [32] and $urban$ - $Nav$ , demonstrate that our method achieves comparable preservation rate (PR), rejection rate (RR) and absolute trajectory error (ATE), and sighnificantly outperforms online dynamic point detection methods [10, 38, 25, 33] in terms of expremely low computational overhead (1 $\sim$ 9ms/sweep). In addition, experimental results on three public datasets of static environments, i.e., $nclt$ [3], $utbm$ [37] and $ulhk$ - $HK$ [32], demonstrate that our Dynamic-LIO also outperforms state-of-the-art LIO systems for static scenes in terms of smaller ATE.

To summarize, the main contributions of this work are three aspects: 1) We propose a label consistency detection method for fast identification of 3D dynamic points. It circumvents the operations of extensive geometric computations or global statistics and thus achieves lightweight; 2) We develop a LIO system and integrate the proposed label consistency detection method into this LIO in a unified manner, improving the accuracy of pose estimation in dynamic scenes by eliminating the ghost tracks in the reconstructed map; 3) We have released the source code of our approach to facilitate the development of the community¹¹1https://github.com/ZikangYuan/dynamic_lio.

The rest of this paper is structured as follows. Sec. II reviews the relevant literature. Sec. III provides preliminaries. Secs. IV details our label consistency detection method and Sec. V introduces our system Dynamic-LIO, followed by experimental evaluation in Sec. VI. Sec. VII concludes the paper.

II Related Work

In this section, we review the related work about existing 3D dynamic point detection approaches without deep learning [24, 8, 39, 18, 15, 27, 20, 5, 28, 38] and current mainstream LIO systems for static and dynamic scenes [30, 26, 36, 12, 4, 6, 42, 41, 23, 25, 33, 34, 44]. Although there are some deep learning based 3D dynamic point detection methods [7, 21], they are weakly relevant to this work so we omit the detailed discussion of them.

Pomerleau el. al. [24] utilized the motion pattern of point to represent the correlation. They calculated the motion pattern of each 3D point and infer the dominant motion patterns within the map, then determined the points that do not fit the motion pattern as dynamic points. Dai et. el. [8] utlized the relative position between two points to represent the correlation, and then utilized the amplitude of relative position change over time as the criterion of consistency to identify dynamic points. However, the computational overhead of calculating motion pattern and maintaining map point correlation in large-scale outdoor scenarios is prohibitive. Yoon et. al. [39] proposed to simply query one sweep against another, and identify point with evident visibility difference as dynamic point. Removert [18] proposed a multi-resolution range image-based false prediction reverting algorithm. This method first conservatively retained definite static points and iteratively recover more uncertain static points by enlarging the query-to-map association window size. However, visibility-based approaches usually suffers from incidence angle ambiguity and occlusion issues. In addition, the generation of multiple projections on the spherical image plane and the assignment of a static value to each point in visibility-based approaches require high computational overhead. OctoMap [15] firstly proposed a framework to generate volumetric 3D environment model, which is based on octrees and uses probabilistic occupancy estimation. Given a registered set of 3D points, Schauer et. al. [27] build a regular voxel occupancy grid and then traverse it along the lines of sight between the sensor and the measured points to identify the differences in volumetric occupancy between multiple sweeps. Erasor [20] proposed the concept called pseudo occupancy to express the occupancy of unit space and then discriminate spaces of varying occupancy. Then, the region-wise ground plane fitting (R-GPF) method is adopted to distinguish static points from dynamic points within the candidate bins that potentially contain dynamic points. DORF [5] proposed a novel coarse-to-fine offline framework that exploits global 4D spatial-temporal LiDAR information to achieve clean static point cloud map generation. DORF first conservatively preserved the definite static points leveraging the receding horizon sampling (RHS) mechanism, then gradually recovered more ambiguous static points, guided by the inherent characteristic of dynamic objects in urban environments. [28] proposed to incrementally estimate high confidence free-space areas by modeling and accounting for sensing, state estimation, and mapping limitations during online robot operation. It can achieve robust moving object detection in complex unstructured environments. RH-Map [38] proposed a novel map construction framework based on 3D region-wise hash map structure, which adopts the two-layer 3D region-wise hash map structure and the region-wise ground plane estimation for dynamic object removal. Occupancy map-based approaches are usually accompanied by the nearest neighbor search, confidence calculation, occupancy probability statistics, relative spatial position calculation and other operations requiring batch geometric computation, which cause a significant computational burden.

In recent years, various LIO systems have been proposed in the robotics community. LIO-SAM [30] firstly formulated LIO as a factor graph, which allows the incorporation of a multitude of relative and absolute measurements, including loop closures, as factors from different sources into the system. In LINs [26], a pioneering integration of 6-axis IMU and 3D LiDAR was accomplished within an error state iterated Kalman filter (ESIKF) framework. This design ensures that the computational demands of the system remain tractable. Based on mathematical foundations, Fast-LIO [36] adapted a technique of solving Kalman gain [31], circumventing the need for high-order matrix inversion, thereby significantly alleviating the computational load. Building upon the advancements of Fast-LIO, Fast-LIO2 [35] introduced an innovative ikd-tree algorithm [2]. Compared to the conventional kd-tree, this algorithm offers reduced temporal expenditure in processes such as tree construction, traversal, and element removal. Point-LIO [12] proposed a point-by-point LIO framework that updates the state at each LiDAR point measurement, which allows an extremely high-frequency output. DLIO [4] proposed to preserve a third-order minimum within the realms of state prediction and point distortion calibration, thereby facilitating the acquisition of more precise pose estimation. IG-LIO [6] integrated the generalized-ICP (GICP) constraints and inertial constraints into a unified estimation framework. In addition, iG-LIO employed a voxel-based surface covariance estimator to estimate the surface covariances of scans, and utilized an incremental voxel map to represent the probabilistic models of surrounding environments. Semi-Elastic-LIO [42] proposed a semi-elastic optimization-based LiDAR-inertial state estimation method, which imparts sufficient elasticity to the state to allow it be optimized to the correct value. SR-LIO [41] adapted the sweep reconstruction method [43, 40], which segments and reconstructs raw input sweeps from spinning LiDAR to obtain reconstructed sweeps with higher frequency. Consequently, the frequency of estimated pose is also increased. Pfreundschuh et. al. [23] proposed an end-to-end occupancy grid based pipeline that can automatically label a wide variety of arbitrary dynamic objects, and embeded this network into a LiDAR odometry system. RF-LIO [25] utilized an adaptive multi-resolution range images to first remove dynamic objects, and then match LiDAR sweeps to the map for state estimation. ID-LIO [33] proposed a LiDAR-inertial odometry based on indexed point and delayed removal strategy for dynamic scenes, which builds on LIO-SAM. Although RF-LIO and ID-LIO have the ability to perform state estimation in dynamic scenarios, huge computational overhead makes them unable to run stably in real time.

III Preliminary

III-A Coordinate Systems

We denote $(\cdot)^{w}$ , $(\cdot)^{l}$ and $(\cdot)^{b}$ as a 3D point in the world coordinate, the LiDAR coordinate and the IMU coordinate respectively. The world coordinate is coinciding with $(\cdot)^{b}$ at the starting position.

We denote the LiDAR coordinate for taking the $i_{th}$ sweep at time $t_{i}$ as $l_{i}$ and the corresponding IMU coordinate at $t_{i}$ as $b_{i}$ , then the transformation matrix (i.e., external parameters) from $l_{i}$ to $b_{i}$ is denoted as $\mathbf{T}_{l_{i}}^{b_{i}}\in SE(3)$ , which consists of a rotation matrix $\mathbf{R}_{l_{i}}^{b_{i}}\in SO(3)$ and a translation vector $\mathbf{t}_{l_{i}}^{b_{i}}\in\mathbb{R}^{3}$ . The external parameters are usually calibrated once offline and remain constant during online pose estimation. Therefore, we can represent $\mathbf{T}_{l_{i}}^{b_{i}}$ as $\mathbf{T}_{l}^{b}$ for simplicity. The pose from the IMU coordinate $(\cdot)^{b_{i}}$ to the world coordinate $(\cdot)^{w}$ is strictly defined as $\mathbf{T}_{b_{i}}^{w}$ .

III-B Voxel map Management

The entire system maintains two global maps: the tracking-map and the output map. The former is utilized for state estimation, while the latter is utilized for label consistency detection and serves as the final reconstruction outcome. In Dynamic-LIO, the tracking-map has already removed out the vast majority of dynamic points. However, to prevent over-filtering that could lead to insufficient geometric information for LIO, we refrain from further processing the tracking-map and instead focus on the output map (as illustrated in Sec. IV-E). Thus compared to the tracking-map, the dynamic points in the output map are removed more thoroughly. Both the tracking-map and the output map are managed by voxel, whose voxel resolution is $1.0\times 1.0\times 1.0$ (unit: m) and each voxel contains a maximum of 20 points.

IV Label Consistency Detection

Label consistency detection aims to circumvent batch geometric computations and global statistics, which are prevalent in existing dynamic point detection approaches, thereby facilitating rapid identification of 3D dynamic points. To our knowledge, most existing methods primarily involve batch geometric computations and global statistics in aspects of nearest neighbor search and consistency evaluation. Consequently, we are committed to achieve the lightweight of these two aspects in our method.

The core premise of label consistency detection is that the moving objects in driving scenarios are in contact with the ground. Under this premise, we first construct the binarized labels (i.e., ground label and non-ground label) for each 3D point by segmenting the ground points from current input sweep (as illustrated in Fig. 2 (a)). All ground points are inherently static, and the potential dynamic points are exclusively found among non-ground points. If we have already prepared the static global map at the previous moment, aside from the new points to be added at a greater distance, each static point at current moment can find its corresponding nearest neighbor within the global map during map update. For LiDAR points scanned from moving objects, the lack of structural informations within the global map prevents the current position from coinciding with any existing static geometric structures in space. (as illustrated in the green area of Fig. 2 (b-1)). Thus, most LiDAR points scanned from moving objects are often unable to find nearest neighbors during registration and we identify thoes points as dynamic points (shown as the green points in Fig. 2 (b-2)). As for the remaining small subset of LiDAR points (shown as the pink points in Fig. 2 (b-2)), they may find ground points as their nearest neighbors. We then determine whether to classify them as dynamic points according to the proportion of ground points within the nearest neighbors. It is evident that throughout the process of evaluating label consistency, we only need to calculate the proportion of ground points among the nearest neighbors, without engaging in any batch geometric computations and global statistics. In addition, we utilize the voxel-location-based nearest neighbor search to obtain nearest neighbors, which can be directly located without any quantitative geometric distance calculation. The lightweight of these two core aspects ensures the low computational overhead of our method. Once the dynamic points at the current moment are identified, we utilize the estimated pose from LIO to register the static points into the global map to finish the map update, which can be used to identify dynamic points of next sweep, thereby ensuring the sustainability of our method.

Specially, the label consistency detection method is divided into five steps: binarized label construction, background separation, voxel-location-based nearest neighbor search, dynamic point determination and undetermined-point re-determination. In the following, we will provide a detailed description of each step.

IV-A Binarized Label Construction

Before dynamic point identification, we first construct binarized descriptors, i.e., ground label and non-ground label, for each 3D point of current sweep. We utilize a fast 2D connected component method [14], which is the same as LeGO-LOAM [29], to separate ground points from current input sweep with very low computational cost. By default, the LiDAR is mounted horizontally on the vehicle, and the LiDAR’s $z$ -axis is perpendicular to the ground. If the LiDAR is equipped at an inclined angle, external parameters can be introduced to ensure the $z$ -axis remains perpendicular. After changing the $z$ -axis to perpendicular, we configure the range image $img$ [22] with the number of LiDAR lines ( $N$ $scan$ in Fig. 3 (a)) as the vertical axis and the horizontal resolution ( $horizon$ $scan$ in Fig. 3 (a)) as the horizontal axis, and position the 3D points of current sweep in their corresponding locations within the range image according to their horizontal and vertical line indices. In the range image, each point is equipped with $x$ , $y$ , $z$ coordinates in $(\cdot)^{l}$ , as well as a Boolean variable $tag$ , which is used to denote whether the point is a ground point. Initially, all $tag$ are set to false. We set the $tag$ of point at $img(1,1)$ to $true$ . Then following the red arrows indicated in Fig. 3 (a), we calculate the pitch angles of two 3D points at the adjacent positions $img(1,2)$ and $img(2,1)$ to that at $img(1,1)$ respectively. Then the pitch angle of $img(2,1)$ to $img(1,1)$ can be calculated as:

\begin{gathered}diff_{x}=img(2,1).x-img(1,1).x\\ diff_{y}=img(2,1).y-img(1,1).y\\ diff_{z}=img(2,1).z-img(1,1).z\\ pitch=arctan\left(diff_{z},sqrt\left(diff_{x}{}^{2}+diff_{y}{}^{2}\right)\right)\end{gathered}

(1)

If the calculated pitch angle is less than a certain threshold (e.g., 5 degree in our system), $img(2,1).tag$ is set as $true$ , which means the corresponding 3D point is determined as ground point. Similarily, we can calculate the pitch angle of $img(1,2)$ to $img(1,1)$ , and set value for $img(1,2).tag$ . If the 3D point located at $img(1,2)$ or $img(2,1)$ is determined as ground point, we follow the blue arrows indicated in Fig. 3 (a) to indentify other points adjacent to them. The entire process is executed recursively, continuing until all pixels in the range image have been visited or the recursive exit condition is met. The visualization of segmented ground points is illustrated in Fig. 3 (b), where the orange points are labeled as “ground points” and the white points are labeled as “non-ground points”.

IV-B Background Separation

In the process of executing label consistency detection, it is necessary to find the nearest neighbors for each point of current sweep. Points that are close to the vehicle platform can reliably find their nearest neighbors, whereas points that are farther may fail due to the incomplete reconstruction of their locations. In outdoor driving scenarios (excluding extreme occlusion), the map structures within a 30 meter radius around the vehicle platform are usually already reconstructed. Therefore, we set a empirical threshold of 30 meters, and define points within 30 meters of the vehicle platform as fore-points and those beyond 30 meters as back-points. For fore-points and back-points, we employ determinations that are specifically tailored to their characteristics.

IV-C Voxel-Location-Based Nearest Neighbor Search

For a specific point $\mathbf{p}^{w}$ with the “non-ground point” label, to determine whether its label is consistent with surrounding points in the global map, we first need to search for the nearest neighbors of $\mathbf{p}^{w}$ . An intuitive alternative is to utilize the 8-nearest neighbor search which is the same as the nearest neighbor search method for point-to-plane distance computation (as illustrated in Fig. 4 (a)) in LIO. Specially, we locate the voxel $V$ to which $\mathbf{p}^{w}$ belongs and the 8 voxels adjacent to $V$ , and set all points in these voxels as candidate points. Subsequently, the 20 nearest points of $\mathbf{p}^{w}$ are identified from 9 candidate voxels by comparing the magnitudes of Euclidean distances to $\mathbf{p}^{w}$ .

However, calculating the Euclidean distance of each candidate point to $\mathbf{p}^{w}$ is an extremely time-consuming process. When LIO builds the point-to-plane distance residuals, in order to ensure that the fitted plane can reflect as much of the geometric information around $\mathbf{p}^{w}$ as possible, we have to utilize the conventional 8-nearest neighbor search. Fortunately, only 600 point-to-plane distance residuals are required for estimating the pose of each sweep in our system, so the total computational overhead is acceptable. However, in order to ensure that the final output map does not contain dynamic points, it is necessary to determine each point of the current sweep, which requires searching the nearest neighbors for each point from the global map. A single sweep of one 32-line LiDAR can yield more than 50,000 points, making the conventional 8-nearest neighbor search inapplicable here.

In the proposed label consistency detection method, we qualitatively assess whether the non-ground point $\mathbf{p}^{w}$ is a dynamic point by comparing its label with those of its surround points. Since the surround points are not involved in quantitative calculations, it is not necessary to strictly satisfy the concept of nearest neighbors. Instead, an approximate approach, i.e., voxel-location-based nearest neighbor search, can be adopted to significantly reduce the computational cost. As illustrated in Fig. 4 (b), we locate the voxel $V$ to which $\mathbf{p}^{w}$ belongs and consider the other points within $V$ as the approximate nearest neighbors. Since the voxel map has a query operation with a computational complexity of $O(1)$ , the entire nearest neighbor search process is extremely fast. In addition, the computational overhead associated with Euclidean distance is also saved. The voxel-location-based nearest neighbor search plays a crucial role in ensuring the low computational cost of label consistency detection, as evidenced by the results of the ablation study, which are documented in Sec. VI-G.

IV-D Dynamic Point Determination

In Sec. IV-B, we categorize the points of current sweep into fore-points and back-points based on their distance from the vehicle platform. For dynamic point determination of fore-points and back-points, we employ the following two distinct modes.

Mode for fore-points. If the number of nearest neighbors is below a certain threshold (5 in our system), it indicates that the location of $\mathbf{p}^{w}$ was originally unoccupied, and thus $\mathbf{p}^{w}$ is classified as a dynamic point. If the number of nearest neighbors is sufficiently large (greater than 5), we calculate the proportion of non-ground points among all nearest neighbors. If this proportion is sufficiently low (less than 30 $\%$ ), $\mathbf{p}^{w}$ is classified as a static point and added to both the tracking-map and the output map. Conversely, if the proportion is larger than 30 $\%$ , $\mathbf{p}^{w}$ is classified as a dynamic point and excluded in the map. Inevitably, this determination criteria potentially results in the erroneous removal of some static points in close proximity to the ground. Nonetheless, the points most commonly affected by this misfiltration are situated at the transition between walls and the ground. Despite the possibility of such misfiltration, the overall geometric integrity of the scene remains intact, and it does not affect the performance of the LIO system. The visualization of dynamic point determination results for fore-points is shown in Fig. 5.

Mode for back-points. If the number of nearest neighbors is below a certain threshold (5 in our system), we cannot identify the back-point as a dynamic point, because it is possible that the location has not yet been reconstructed, preventing the obtainment of the nearest neighbors. Such points are labeled as undetermined-points, and a determination will be made once the vehicle platform continues to move and the geometric structures of the locations of these points are recovered. To ensure that newly acquired point clouds can be properly registered during state estimation, it is necessary to incorporate the undetermined-points into the tracking-map. This will not significantly affect the accuracy of state estimation, as even if there are dynamic objects among the back-points, the number of LiDAR points scanned onto them is very sparse. As for the final output map, it is imperative to ensure that it contains as few dynamic points as possible, hence the determination for undetermined-points will be conducted subsequently. When the number of nearest neighbors is sufficiently large (greater than 5), the processing approach is the same as that for fore-points, and the static points are added to both the tracking-map and the output map. The visualization of dynamic point determination results for back-points is shown in Fig. 6.

IV-E Undetermined-Point Re-Determination

As mentioned in Sec. IV-D, some back-points may fail to find nearest neighbors due to the incomplete reconstruction of their locations. For such points, we label them as undetermined points and place them in a container. As the vehicle platform continues to move forward, the geometric structure informations of previously unreconstructed positions are recovered (as shown in Fig. 7). Then we can make re-determination for those undetermined-points. When a point $\mathbf{p}_{u}^{w}$ in the undetermined point container is close to the current position of the vehicle platform (less than 30 meter), it is highly likely that the geometric structure information around $\mathbf{p}_{u}^{w}$ has been reconstructed. We can then determine whether $\mathbf{p}_{u}^{w}$ is a dynamic point. If the number of nearest neighbors is below a certain threshold (5 in our system), it suggests that the location of $\mathbf{p}_{u}^{w}$ was originally unoccupied, leading to the classification of $\mathbf{p}_{u}^{w}$ as a dynamic point. If the number of nearest neighbors is larger than the threshold of 5, we calculate the proportion of non-ground points among all nearest neighbors. If this proportion is sufficiently low (less than 30 $\%$ ), it is classified as a static point and added to the output map. On the contrary, if the proportion is not smaller than the threshold of 30 $\%$ , it is classified as a dynamic point and would not be included in output map. If an undetermined-point is more than 30 meters away from the vehicle platform’s position for 10 consecutive sweeps, it is likely to be a sparse background point at far distance. Thus, we directly classify it as a static point and add it to the output map.

V Our System Dynamic-LIO

V-A System Overview

Fig. 8 illustrates the framework of our self-developed system Dynamic-LIO which consists of four main modules: cloud processing, static initialization, ESIKF based state estimation and dynamic point identification. The cloud processing module constructs binarized label (i.e., ground label or non-ground label) for each 3D point of current sweep. Subsequently, it performs spatial down-sampling to ensure uniform density of current point cloud. The static initialization module [11] utilizes the IMU measurements to estimate some state parameters such as gravitational acceleration, accelerometer bias, gyroscope bias and initial velocity. The ESIKF based state estimation module estimates the state of current sweep, and utilize the estimated pose to transform all points of current sweep from $(\cdot)^{l}$ to $(\cdot)^{w}$ . The dynamic point identification module identifies dynamic points from current input point cloud data, to ensure that the global map contains only static points. The yellow rectangles indicate the specific locations of five steps of label consistency detection within the overall system framework.

V-B Down-Sampling

To alleviate the substantial computational load caused by the overwhelming number of 3D points collected by a LiDAR in a single sweep, we implement a decimation strategy on point cloud. Initially, a uniform subsampling approach is applied to retain a single point for every set of four. Subsequently, we integrate the uniformly subsampled points into a voxel grid defined by a $0.5\times 0.5\times 0.5$ (unit: m) resolution, ensuring that each voxel is occupied by a solitary point.

It is worth noting that the down-sampling of current input sweep must be performed after binarized label construction. The reason is that down-sampling can disrupt the adjacency relationships of 3D points in range images, which can affect the execution of binarized label construction.

V-C Static Initialization

In our Dynamic-LIO, a static initialization procedure is employed to estimate essential parameters, including initial poses, initial velocities, gravitational forces, as well as biases in both accelerometer and gyroscope measurements. For a detailed elucidation of the methodology, please refer to reference [11].

V-D ESIKF based State Estimation

We adapt the error state iterated Kalman filter (ESIKF) to perform state estimation, which is the same as Fast-LIO2 [35]. Fast-LIO2 utilized their own self-developed toolbox IKFoM [13] for the implementation of on-manifold Kalman filter, and we utilized the more widely recognized Eigen3 library [9] to implement this function. We have documented the entire execution process of our ESIKF in Appendices for readers to refer the details of implementation.

It is worth mentioning that during state estimation, we need to find the nearest neighbors for randomly selected 600 points of current sweep from the global map, and fit a plane with these nearest neighbors to construct point-to-plane distance constraints. To ensure that the final fitted plane as accurately as possible reflect the surrounding geometric information, we still use the conventional 8-nearest neighbor search method here. Additionally, we search for nearest neighbors based on the tracking-map. Compared to the output map, the tracking-map is more conducive to the accurate and robust running of LIO. Because the tracking-map retains enough geometric details while removing out the vast majority of dynamic points.

VI Experiments

We evaluate the overall performance of our method on six autonomous driving scenario datasets: $semantic$ - $kitti$ [1], $ulhk$ - $CA$ [32], $urban$ - $Nav$ [16], $nclt$ [3], $utbm$ [37] and $ulhk$ - $HK$ [32]. Among them, $semantic$ - $kitti$ , $ulhk$ - $CA$ and $urban$ - $Nav$ are three public datasets collected at dynamic scenes, $nclt$ , $utbm$ and $ulhk$ - $HK$ are three public datasets collected at static scenes. $semantic$ - $kitti$ is collected by a 64-line Velodyne LiDAR and each LiDAR point has its unique semantic label. Thus, $semantic$ - $kitti$ is used to evaluate the preservation rate (PR) and rejection rate (RR) of proposed label consistency based dynamic point detection and removal method. $ulhk$ - $CA$ is collected by a 32-line Robosense LiDAR and IMU, and $urban$ - $Nav$ is collected by a 32-line Velodyne LiDAR and IMU. These two datasets are used to evaluate the improvement of dynamic point detection and removal method on pose estimation in terms of absolute trajectory error (ATE). Both $nclt$ , $utbm$ and $ulhk$ - $HK$ are collected by a 32-line Velodyne LiDAR and IMU. These three datasets are used to show the outstanding performance of our self-developed LIO system, and demonstrate that the proposed dynamic point detection and removal method has no negative effect on the accuracy of LIO in static scenes. Details of all the 24 sequences used in this section, including name, duration, and whether they contain dynamic objects, are listed in Table I. A consumer-level computer equipped with an Intel Core i7-11700 and 32 GB RAM is used for all experiments.

TABLE I: Datasets of All Sequences for Evaluation

Name

Duration

(min:sec)

Whether Dynamic

Objects are Included

kitti\_1

semantic-kitti-00