This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SR-LIVO: LiDAR-Inertial-Visual Odometry and Mapping with Sweep Reconstruction

Zikang Yuan1, Jie Deng2, Ruiye Ming2, Fengtian Lang2 and Xin Yang2∗ 1Zikang Yuan is with Institute of Artificial Intelligence, Huazhong University of Science and Technology, Wuhan, 430074, China. (E-mail: [email protected])2Jie Deng, Ruiye Ming, Fengtian Lang and Xin Yang are with the Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan, 430074, China. (* represents the corresponding author. E-mail: [email protected]; [email protected]; [email protected]; [email protected])
Abstract

Existing LiDAR-inertial-visual odometry and mapping (LIV-SLAM) systems mainly utilize the LiDAR-inertial odometry (LIO) module for structure reconstruction and the visual-inertial odometry (VIO) module for color rendering. However, the accuracy of VIO is often compromised by photometric changes, weak textures and motion blur, unlike the more robust LIO. This paper introduces SR-LIVO, an advanced and novel LIV-SLAM system employing sweep reconstruction to align reconstructed sweeps with image timestamps. This allows the LIO module to accurately determine states at all imaging moments, enhancing pose accuracy and processing efficiency. Experimental results on two public datasets demonstrate that: 1) our SR-LIVO outperforms existing state-of-the-art LIV-SLAM systems in both pose accuracy and time efficiency; 2) our LIO-based pose estimation prove more accurate than VIO-based ones in several mainstream LIV-SLAM systems (including ours). We have released our source code to contribute to the community development in this field.

Index Terms:
SLAM, localization, sensor fusion.

I Introduction

In robotic applications like autonomous vehicles [4] and drones [2, 3], cameras, 3D light detection and ranging (LiDAR) and inertial measurement unit (IMU) are key sensors. The integration of IMU measurements can provide motion prior to ensure accurate and quick state estimation. While LiDAR excels in capturing 3D structures, it lacks color information which cameras compensate for. This synergy has led to the rise of LiDAR-inertial-visual odometry and mapping (LIV-SLAM) as a leading method for accurate state estimation and dense color map reconstruction.

Refer to caption
Figure 1: Illustration of (a) the raw sensor measurements from IMU, LiDAR and camera, (b) the raw sensor measurements from IMU, camera and the reconstructed LiDAR sweep data, where the end timestamp of reconstructed sweep is aligned with the timestamp of captured image.

Existing state-of-the-art LIV-SLAM systems [6, 5, 18] mainly consist of a LIO module and a VIO module. The LIO module reconstructs 3D structures while the VIO module colors them. Both LIO and VIO modules perform state estimation. The LIO module obtains the estimated state at the end timestamp of a LiDAR sweep (e.g., tlj1t_{l_{j-1}} and tljt_{l_{j}} in Fig. 1 (a)), and the VIO module solves the state at the timestamp of each captured image (e.g., tci1t_{c_{i-1}} and tcit_{c_{i}} in Fig. 1 (a)). Compared to LIO, VIO is less reliable to photometric changes, weak textures and motion blur, and thus suffers less accurate pose results and in turn degrade colored reconstruction.

In this paper, we propose SR-LIVO, a novel and advanced LVI-SLAM framework that enhances both accuracy and reliability. We shift state estimation entirely to the LIO module, known for its better precision. A key challenge brought by this design is that: the timestamp of the captured images and the end timestamp of LiDAR sweeps are often not aligned. Without the state of the image acquisition moment, we can hardly use image data to render color for the restored 3D structures. To address this issue, we adopt the sweep reconstruction method proposed in our previous work [16] for data synchronization (e.g., tli1t_{l_{i-1}} and tlit_{l_{i}} in Fig. 1 (b)) with the timestamp of captured images (e.g., tci1t_{c_{i-1}} and tcit_{c_{i}} in Fig. 1 (b)). In this way, the LIO module can directly estimate states at image capture moments. Consequently, the vision module’s role is simplified to optimizing camera parameters (e.g., camera intrinsics, extrinsics and time-offset) and completing the color rendering task.

Experimental results on the public datasets NTU_VIRALNTU\_VIRAL [7], R3LiveR3Live [5] demonstrate the following key findings: 1) Our system outperforms existing state-of-the-art LIO systems (i.e., [6, 5]) in terms of the smaller absolute trajectory error (ATE), and is much more efficient than [5]; 2) The estimated pose from the LIO module is more accurate than that from the VIO modules of several mainstream LIV-SLAM systems. Meanwhile, the visualization results demonstrate that our SR-LIVO can achieve comparable reconstruction results to [5] on their self-collected dataset (i.e., R3LiveR3Live), and can achieve much superior results to [5] on the NTU_VIRALNTU\_VIRAL dataset.

To summarize, the main contributions of this work are three folds: 1) We identify an important function of sweep reconstruction, i.e., aligning LiDAR sweep and image timestamps; 2) We design a new LIV-SLAM system based on sweep reconstruction, where the vision module is no longer needed to perform state estimation but only for coloring the reconstructed map; 3) We have released the source code of our system to benefit the development of the community111https://github.com/ZikangYuan/sr_livo.

The rest of this paper is structured as follows. Sec. II reviews the relevant literature. Sec. III provides preliminaries. Secs. IV and V presents system overview and details, followed by experimental evaluation in Sec. VI. Sec. VI concludes the paper.

II Related Work

In recent years, various LiDAR-visual and LiDAR-inertial-visual fusion frameworks have been proposed. V-LOAM [17] is the first cascaded LIV-SLAM framework that provides motion priori for LiDAR odometry via a loosely coupled VIO. [11] cascades a tightly-coupled stereo VIO, a LiDAR odometry and a LiDAR-based loop closing module together. Compared with [17] and [11], the vision module of DV-LOAM [12] utilizes the direct method to perform pose estimation and multi-frame joint optimization in turn to make the vision module provide more accurate motion priori for the subsequent LiDAR module. Lic-Fusion [19] combines the IMU measurements, sparse visual features, LiDAR features with online spatial and temporal calibration within the multi state constrained Kalman filter (MSCKF) framework. To further enhance the accuracy and the robustness of the LiDAR points registration, LIC-Fusion 2.0 [20] proposes a plane-feature tracking algorithm across multiple LiDAR sweeps in a sliding-window and refines the pose of sweep within the window. LVI-SAM [10] integrates the data from camera, LiDAR and IMU into a tightly-coupled graph-based optimization framework. The vision and LiDAR module of LVI-SAM can run independently when each other fails, or jointly when both visual and LiDAR features are sufficient. R2Live [6] firstly proposes to run a LIO module and a VIO module in parallel, where the LIO module provides geometric structure information for the VIO module. The back end utilizes the visual landmarks to perform graph-based optimization. Based on R2Live, R3Live [5] omits the graph-based optimization module and adds a color rendering module for dense color map reconstruction. Compared to R3Live, Fast-LIVO [18] combines LiDAR, camera and IMU measurements into a single error state iterated Kalman filter (ESIKF), which can be updated by both LiDAR and visual observations. mVIL-Fusion [13] proposes a three-staged cascading LVI-SLAM framework, which consists of a LiDAR-assisted VIO module, a multi sweep-to-sweep joint optimization module, a sweep-to-map optimization module and a loop closing module. VILO-SLAM [8] fuses 2D LiDAR residual factor, IMU residual factor and visual reprojection residual factor into an optimization-based framework. In [14], a weight function is designed based on geometric structure and reflectivity to improve the performance of solid-state LIO under severe linear acceleration and angular velocity changes.

III Preliminary

III-A Coordinate Systems

We denote ()w(\cdot)^{w}, ()l(\cdot)^{l}, ()o(\cdot)^{o} and ()c(\cdot)^{c} as a 3D point in the world coordinate, the LiDAR coordinate, the IMU coordinate and the camera coordinate respectively. The world coordinate is coinciding with ()o(\cdot)^{o} at the starting position.

We denote the IMU coordinate for taking the ithi_{th} IMU measurement at time tit_{i} as oio_{i} and the corresponding camera coordinate at tit_{i} as cic_{i}. The transformation matrix (i.e., extrinsics) from cic_{i} to oio_{i} is denoted as 𝐓cioiSE(3)\mathbf{T}_{c_{i}}^{o_{i}}\in SE(3), where 𝐓cioi\mathbf{T}_{c_{i}}^{o_{i}} consists of a rotation matrix 𝐑cioiSO(3)\mathbf{R}_{c_{i}}^{o_{i}}\in SO(3) and a translation vector 𝐭cioi3\mathbf{t}_{c_{i}}^{o_{i}}\in\mathbb{R}^{3}. Similarly, we can also obtain the extrinsics form LiDAR to IMU, i.e., 𝐓lioi\mathbf{T}_{l_{i}}^{o_{i}}. For the datasets involved in this work, we use the internal IMU of LiDAR, therefore, we treat 𝐓lioi\mathbf{T}_{l_{i}}^{o_{i}} as absolutely accurate and do not require online correction. For 𝐓cioi\mathbf{T}_{c_{i}}^{o_{i}}, we optimize it online because the camera is a separate external sensor.

III-B Distortion Correction

For each camera image, we utilized the offline-calibrated distortion parameters to correct the image distortion. For LiDAR points, we utilize the IMU-integrated pose to compensate the motion distortion.

IV System Overview

Refer to caption
Figure 2: Overview of our SR-LIVO which consists of three main modules: a sweep reconstruction module for timestamp alignment, a LIO module for state estimation and structure restoring, and a vision module for camera parameters optimization and color rendering. The yellow rectangles indicate the operations of our system that are different from that of existing state-of-the-art frameworks.

Fig. 2 illustrates the framework of our system which consists of three main modules: a sweep reconstruction module, a LIO state estimation module and a vision module. The sweep reconstruction module aligns the end timestamp of reconstructed sweep with the timestamp of captured image. The LIO module estimates the state of the hardware platform, and restores the structure in real time. The vision module optimizes the camera parameters including camera intrinsics, extrinsics, time-offset, and renders color to the restored structure map in real time. For map management, we utilized the Hash voxel map, which is the same as CT-ICP [1]. The implementation details of various parts of the LIO module are exactly the same as our previous work SR-LIO [15], therefore, we omit the introduction of this module, and only introduce the details of sweep reconstruction and our vision module in Sec. V.

V System Details

V-A Sweep Reconstruction

In our previous work SDV-LOAM [16], we firstly propose the sweep reconstruction idea, which increases the frequency of LiDAR sweeps to the same frequency as camera images. In this work, we point out another important function of this idea: aligning the end timestamp of reconstructed sweep to the timestamp of captured image. According to the frequency of LiDAR sweeps and camera images, the processing method is also different in practice. There are three specific cases as following:

Refer to caption
Figure 3: Illustration of the sweep reconstruction method under three situations: (a) The frequency of captured images is more than twice that of raw LiDAR sweeps. (b) The frequency of captured images is less than twice but more than that of raw LiDAR sweeps. (c) The frequency of captured images is less than that of raw LiDAR sweeps.

The frequency of captured images is more than twice that of raw LiDAR sweeps (as shown in Fig. 3 (a)). In this case, we firstly down-sample the frequency of camera images to twice the frequency of raw LiDAR sweeps. Then we disassemble the raw LiDAR sweep (i.e., Sj1S_{j-1} and SjS_{j} in Fig. 3 (a)) into continuous point cloud data stream, and recombine the point cloud data stream with aligning the end timestamp (e.g., tlit_{l_{i}} in Fig. 3 (a)) of reconstructed sweep (i.e., PiP_{i} in Fig. 3 (a)) to the timestamp of down-sampled captured image (e.g., tcit_{c_{i}} in Fig. 3 (a)). In this way, the number of LiDAR points in reconstructed sweep is only half that in raw input sweep. For spinning LiDAR, not only the number, but also the horizontal distribution range is reduced from 360360^{\circ} to 180180^{\circ}. If the frequency of reconstructed sweeps is more than twice that of raw sweeps, the number and horizontal distribution range of points will decrease further, and eventually the LIO module may fail to perform properly. Therefore, we need to down-sample the frequency of camera images to at most twice the frequency of raw LiDAR sweeps.

The frequency of captured images is less than twice but more than that of raw LiDAR sweeps (as shown in Fig. 3 (b)). In this case, we directly disassemble the raw LiDAR sweep (i.e., Sj1S_{j-1} and SjS_{j} in Fig. 3 (b)) into continuous point cloud data stream, and recombine the point cloud data stream with aligning the end timestamp (e.g., tlit_{l_{i}} in Fig. 3 (b)) of reconstructed sweep (e.g., PiP_{i} in Fig. 3 (b)) to the timestamp of captured image (e.g., tcit_{c_{i}} in Fig. 3 (b)). In this way, the point cloud number and the horizontal distribution range of reconstructed sweeps are less than that of raw LiDAR sweeps, but also enough to support the LIO module run properly.

The frequency of captured images is less than that of raw LiDAR sweeps (as shown in Fig. 3 (c)). In this case, we directly disassemble the raw LiDAR sweep (i.e., Sj1S_{j-1}, SjS_{j}, Sj+1S_{j+1} and Sj+2S_{j+2} in Fig. 3 (c)) into continuous point cloud data stream, and recombine the point cloud data stream according to the following two situations: 1) We assume the current reconstructed sweep begins at tlit_{l_{i}} but the end timestamp is not yet determined. When the time interval from current moment to tlit_{l_{i}} reaches the time period of a raw LiDAR sweep and there are no images around current moment, we set the current moment as the end timestamp (e.g., tli+1t_{l_{i+1}} in Fig. 3 (c)) of this reconstructed sweep (e.g., Pi+1P_{i+1} in Fig. 3 (c)). 2) We assume the current reconstructed sweep begins at tli+1t_{l_{i+1}} but the end timestamp is not yet determined. When the current moment reaches the timestamp of captured image (e.g., tci+2t_{c_{i+2}} in Fig. 3 (c)) and there is a sufficient time interval from tci+2t_{c_{i+2}} to tli+1t_{l_{i+1}}, we set the current moment as the end timestamp (e.g., tci+2t_{c_{i+2}}(tli+2t_{l_{i+2}}) in Fig. 3 (c)) of reconstructed sweep (e.g., Pi+2P_{i+2} in Fig. 3 (c)). Under the situation 2), not all synchronized data includes image data (i.e., Pi1P_{i-1} and Pi+1P_{i+1} in Fig. 3 (c)). For synchronized data without camera images, we just utilize the LIO module to estimate state without the vision module running. For synchronized data with both point cloud and image data, the LIO module and the vision module execute in turn.

V-B Vision Module

V-B1 Recent Point Extraction

Similar as R3Live [5], firstly we record all recently visited voxels {V1,V2,,Vm}\left\{V_{1},V_{2},\cdots,V_{m}\right\} when performing structure map update. Then, we select the newest added point from each visited voxel ViV_{i}, to obtain the recent point set Pr={𝐩1,𝐩2,,𝐩m}P_{r}=\left\{\mathbf{p}_{1},\mathbf{p}_{2},\cdots,\mathbf{p}_{m}\right\}.

V-B2 Camera Parameters Optimization

Different from existing state-of-the-art frameworks (e.g., R3Live and Fast-LIVO), we have estimated the state at the timestamp (e.g., tkt_{k}) of captured images (e.g., ckc_{k}) in the LIO module. Therefore, we no longer need to solve state (i.e., pose, velocity and IMU bias) in the vision module, but only need to utilize the image data to optimize some camera parameters:

𝒙k=[tckok,𝐑ckok,𝐭ckok,ϕk]T\boldsymbol{x}_{k}=\left[t_{c_{k}}^{o_{k}},\mathbf{R}_{c_{k}}^{o_{k}},\mathbf{t}_{c_{k}}^{o_{k}},\phi_{k}\right]^{T} (1)

where tckokt_{c_{k}}^{o_{k}} is the time-offset between IMU and camera while LiDAR is assumed to be synced with the IMU already. 𝐑ckok\mathbf{R}_{c_{k}}^{o_{k}} and 𝐭ckok\mathbf{t}_{c_{k}}^{o_{k}} are the extrinsics between camera and IMU. ϕk=[fxk,fyk,cxk,cyk]T\phi_{k}=\left[f_{x_{k}},f_{y_{k}},c_{x_{k}},c_{y_{k}}\right]^{T} are the camera intrinsics, where (fxk,fyk)(f_{x_{k}},f_{y_{k}}) denote the camera focal length, (cxk,cyk)(c_{x_{k}},c_{y_{k}}) denote the offsets of the principal point from the top-left corner of the image plane.

To balance the effects of previous estimates and the current image observation on camera parameters, we utilize an ESIKF to optimize the camera parameters 𝒙k\boldsymbol{x}_{k}, where the error state δ𝒙k\delta\boldsymbol{x}_{k} is defined as:

δ𝒙k=[δtckok,δ𝐑ckok,δ𝐭ckok,δϕk]T\delta\boldsymbol{x}_{k}=\left[\delta t_{c_{k}}^{o_{k}},\delta\mathbf{R}_{c_{k}}^{o_{k}},\delta\mathbf{t}_{c_{k}}^{o_{k}},\delta\phi_{k}\right]^{T} (2)

For the state prediction, the error state δ𝒙k\delta\boldsymbol{x}_{k} and covariance 𝐏k\mathbf{P}_{k} is propagated as:

δ𝒙k=𝟎𝐏k=𝐏k1\begin{gathered}\delta\boldsymbol{x}_{k}=\mathbf{0}\\ \mathbf{P}_{k}=\mathbf{P}_{k-1}\end{gathered} (3)

For the state update, the minimizing PnP projection error and the minimizing photometric error are used to update δ𝒙\delta\boldsymbol{x} in turn.

The minimizing PnP projection error. Assuming that we have tracked nn map points Pt={𝐩1,𝐩2,,𝐩n}P_{t}=\left\{\mathbf{p}_{1},\mathbf{p}_{2},\cdots,\mathbf{p}_{n}\right\}, and their projection on image ck1c_{k-1} is {𝝆1k1,𝝆2k1,,𝝆nk1}\left\{\boldsymbol{\rho}_{1_{k-1}},\boldsymbol{\rho}_{2_{k-1}},\cdots,\boldsymbol{\rho}_{n_{k-1}}\right\}, we leverage the Lucas-Kanade optical flow to find out their locations in the current image ckc_{k}: {𝝆1k,𝝆2k,,𝝆nk}\left\{\boldsymbol{\rho}_{1_{k}},\boldsymbol{\rho}_{2_{k}},\cdots,\boldsymbol{\rho}_{n_{k}}\right\}. For the exemplar point 𝐩iPt\mathbf{p}_{i}\in P_{t}, we calculate the re-projection error by:

𝐩ick=(𝐑okw𝐑ckok)T𝐩iw𝐑ckokT𝐭ckok(𝐑okw𝐑ckok)T𝐭okw𝐫𝐩i=𝝆ikπ(𝐩ick,,𝒙k)\begin{gathered}\mathbf{p}_{i}^{c_{k}}=\left(\mathbf{R}_{o_{k}}^{w}\cdot\mathbf{R}_{c_{k}}^{o_{k}}\right)^{T}\cdot\mathbf{p}_{i}^{w}-\mathbf{R}_{c_{k}}^{o_{k}T}\cdot\mathbf{t}_{c_{k}}^{o_{k}}-\left(\mathbf{R}_{o_{k}}^{w}\cdot\mathbf{R}_{c_{k}}^{o_{k}}\right)^{T}\cdot\mathbf{t}_{o_{k}}^{w}\\ \mathbf{r}^{\mathbf{p}_{i}}=\boldsymbol{\rho}_{i_{k}}-\pi\left(\mathbf{p}_{i}^{c_{k},},\boldsymbol{x}_{k}\right)\end{gathered} (4)

where π(𝐩ick,,𝒙k)\pi\left(\mathbf{p}_{i}^{c_{k},},\boldsymbol{x}_{k}\right) is computed as below:

π(𝐩ick,𝒙k)=[fxk[𝐩ick]x[𝐩ick]z+cxk,fyk[𝐩ick]y[𝐩ick]z+cyk]T\displaystyle\pi\left(\mathbf{p}_{i}^{c_{k}},\boldsymbol{x}_{k}\right)=\left[f_{x_{k}}\cdot\frac{\left[\mathbf{p}_{i}^{c_{k}}\right]_{x}}{\left[\mathbf{p}_{i}^{c_{k}}\right]_{z}}+c_{x_{k}},f_{y_{k}}\cdot\frac{\left[\mathbf{p}_{i}^{c_{k}}\right]_{y}}{\left[\mathbf{p}_{i}^{c_{k}}\right]_{z}}+c_{y_{k}}\right]^{T} (5)
+tckokΔtk1,k(𝝆ik𝝆ik1)\displaystyle+\frac{t_{c_{k}}^{o_{k}}}{\Delta t_{k-1,k}}\left(\boldsymbol{\rho}_{i_{k}}-\boldsymbol{\rho}_{i_{k-1}}\right)

where Δtk1,k\Delta t_{k-1,k} is the time interval between the last image ck1c_{k-1} and the current image ckc_{k}. In Eq. 5, the first item is the pin-hole projection function and the second one is the online-temporal correction factor [9]. We can express the observation matrix 𝐡\mathbf{h} as:

𝐡=[𝐫𝐩1T,𝐫𝐩2T,,𝐫𝐩nT]T\mathbf{h}=\left[\mathbf{r}^{\mathbf{p}_{1}T},\mathbf{r}^{\mathbf{p}_{2}T},\cdots,\mathbf{r}^{\mathbf{p}_{n}^{T}}\right]^{T} (6)

The corresponding Jacobian matrix of observation constraint 𝐇\mathbf{H} is calculated as:

𝐇=[𝐇1T,𝐇2T,,𝐇nT]T\displaystyle\mathbf{H}=\left[\mathbf{H}_{1}^{T},\mathbf{H}_{2}^{T},\cdots,\mathbf{H}_{n}^{T}\right]^{T} (7)
𝐇i=[𝐫𝐩itckokT𝐫𝐩i𝐑ckokT𝐫𝐩i𝐭ckokT𝐫𝐩iϕkT]T\displaystyle\mathbf{H}_{i}=\left[\begin{array}[]{llll}{\frac{\partial\mathbf{r}^{\mathbf{p}_{i}}}{\partial t_{c_{k}}^{o_{k}}}}^{T}&{\frac{\partial\mathbf{r}^{\mathbf{p}_{i}}}{\partial\mathbf{R}_{c_{k}}^{o_{k}}}}^{T}&{\frac{\partial\mathbf{r}^{\mathbf{p}_{i}}}{\partial\mathbf{t}_{c_{k}}^{o_{k}}}}^{T}&{\frac{\partial\mathbf{r}^{\mathbf{p}_{i}}}{\partial\phi_{k}}}^{T}\end{array}\right]^{T}

The minimizing photometric error. For the exemplar point 𝐩iPr\mathbf{p}_{i}\in P_{r}, if 𝐩i\mathbf{p}_{i} has been rendered the color intensity 𝜸i\boldsymbol{\gamma}_{i}, we firstly project 𝐩iw\mathbf{p}_{i}^{w} to ()ck(\cdot)^{c_{k}} by Eq. 4 and 5, and then calculate the photometric error by:

𝐫𝐩i=𝜸i𝑰(π(𝐩ick,𝒙k))\mathbf{r}^{\mathbf{p}_{i}}=\boldsymbol{\gamma}_{i}-\boldsymbol{I}\left(\pi\left(\mathbf{p}_{i}^{c_{k}},\boldsymbol{x}_{k}\right)\right) (8)

where 𝑰()\boldsymbol{I}(\cdot) represents the color intensity of image pixel. The observation matrix and the corresponding Jacobin matrix have the similar formula as Eq. 6-7.

The same as R3Live [5], for each image ckc_{k}, we firstly utilize the minimizing PnP projection error to update the ESIKF, and then utilize the minimizing photometric error to update the ESIKF. The difference is that the ESIKF of R3Live’s vision module estimates both state and camera parameters of ckc_{k}, but we only optimize camera parameters of ckc_{k}, while the state of ckc_{k} has been solved in our LIO module.

V-B3 Rendering

After the camera parameters have been optimized, we perform the rendering function to update the color of map points. To ensure the density of colored point cloud map, we not only render points in the recent point set PrP_{r}, but also render all points in recently visited voxels {V1,V2,,Vm}\left\{V_{1},V_{2},\cdots,V_{m}\right\}. The rendering function we utilize is the same as R3Live [5].

VI Experiments

We evaluate our system on the drone-collected dataset NTU_VIRALNTU\_VIRAL [7] and the handheld device-collected dataset R3LiveR3Live [5]. The sensors used in NTU_VIRALNTU\_VIRAL dataset are the left camera, the horizontal 16-channel OS1 gen14 LiDAR and its internal IMU, while the high-accuracy laser tracking methods are employed to provide position ground truth. The sensors used in R3LiveR3Live dataset are the camera, the LiVOX AVAI LiDAR and its internal IMU, while no position ground truth data are provided. We utilize all sequences of NTU_VIRALNTU\_VIRAL [7] and 6 sequences (i.e., r3live_01r3live\_01: hku_campus_seq_00hku\_campus\_seq\_00, r3live_02r3live\_02: hku_campus_seq_01hku\_campus\_seq\_01, r3live_03r3live\_03: hku_campus_seq_02hku\_campus\_seq\_02, r3live_04r3live\_04: hku_park_00hku\_park\_00, r3live_05r3live\_05: hku_park_01hku\_park\_01 and r3live_06r3live\_06: hkust_campus_seq_02hkust\_campus\_seq\_02) of R3LiveR3Live for evaluation. A consumer-level computer equipped with an Intel Core i7-11700 and 32 GB RAM is used for all experiments.

VI-A Comparison of the State-of-the-Arts

We compare our system with two state-of-the-art LIV-SLAM systems, i.e., R3Live [5] and Fast-LIVO [18], on NTU_VIRALNTU\_VIRAL dataset [7] and R3LiveR3Live dataset [5]. For the NTU_VIRALNTU\_VIRAL dataset, we utilize the universal evaluation metrics - absolute translational error (ATE) as the evaluation metrics. The R3LiveR3Live dataset does not provide the position groundtruth, however, they return to their starting point at the end of most sequences. Therefore, we utilize the end-to-end error instead. For a fair comparison, we obtain the results of above systems based on the source code provided by the authors.

TABLE I: RMSE of ATE on NTU-VIRAL dataset (unit: m)
R3Live [5] Fast-LIVO [18] Ours
eee_01eee\_01 1.69 0.28 0.21
eee_02eee\_02 ×\times 0.18 0.23
eee_03eee\_03 0.64 0.26 0.22
nya_01nya\_01 0.63 0.34 0.18
nya_02nya\_02 0.35 0.29 0.19
nya_03nya\_03 0.23 0.29 0.20
sbs_01sbs\_01 0.40 0.73 0.12
sbs_02sbs\_02 0.27 0.25 0.22
sbs_03sbs\_03 0.21 0.24 0.21
TABLE II: End to End Errors (unit: m)
R3Live [5] Fast-LIVO [18] Ours
r3live_01r3live\_01 0.097 0.070 0.021
r3live_03r3live\_03 0.115 0.090 0.053
r3live_04r3live\_04 0.071 0.050 0.120
r3live_05r3live\_05 0.603 0.546 0.546
r3live_06r3live\_06 0.036 13.463 0.024

Results in Table I demonstrate that our system outperforms R3Live and Fast-LIVO for almost all sequences in terms of smaller ATE. ”×\times” means the system drifts halfway through the run, where our SR-LIVO has better robustness than R3Live on NTU_VIRALNTU\_VIRAL dataset. The end-to-end errors are reported in Table II, whose overall trend is similar to the RMSE of ATE results. Our SR-LIVO achieves the lowest drift in four of the total five sequences. Fast-LIVO achieves large end-to-end error on sequence hkust_campus_seq_02hkust\_campus\_seq\_02 because Fast-LIVO updates a single ESIKF with LiDAR and visual observations, while the low-quality visual observations would diverge the results of the filter.

VI-B Comparison of LIO module and VIO module

Our system is designed based on the logic that: the state estimation performance of LIO module is more accurate than that of VIO module. Therefore, we provide quantitative data in this section to prove that our logical base is correct. We compare the ATE results of the VIO module with the LiDAR module on R3Live, Fast-LIVO and our SR-LIVO. Our proposed SR-LIVO does not include the VIO module, and we implement a VIO module modeled similar as R3Live to complete this ablation study.

TABLE III: RMSE of ATE Comparison on VIO module and LIO module (unit: m)
R3Live R3Live (V)
Fast-
LIVO
Fast-
LIVO(V)
SR-
LIVO
SR-
LIVO(V)
eee_01eee\_01 1.69 1.71 0.28 0.30 0.21 0.24
eee_02eee\_02 ×\times ×\times 0.18 0.26 0.23 0.23
eee_03eee\_03 0.64 0.81 0.26 0.27 0.22 0.27
nya_01nya\_01 0.63 0.63 0.33 0.33 0.18 0.19
nya_02nya\_02 0.35 0.41 0.29 0.30 0.19 0.20
nya_03nya\_03 0.23 0.32 0.22 0.29 0.20 0.24
sbs_01sbs\_01 0.40 0.70 0.73 0.42 0.12 0.38
sbs_02sbs\_02 0.27 0.33 0.25 0.31 0.22 0.22
sbs_03sbs\_03 0.21 0.23 0.24 0.27 0.21 0.22

The results in Table III demonstrate that: the accuracy of LIO modules is superior to that of VIO modules whether the framework is based on R3Live, Fast-LIVO or our SR-LIVO. This supports the conclusion that LIO is better at state estimation than VIO.

VI-C Time Consumption

TABLE IV: Time Consumption Per Sweep (unit: ms)
Vision LiDAR Total
R3Live Ours R3Live Ours R3Live Ours
r3live_01r3live\_01 38.90 20.86 17.51 9.67 50.57 30.53
r3live_02r3live\_02 33.80 19.95 24.23 10.83 49.95 30.78
r3live_03r3live\_03 37.16 21.51 18.16 9.55 49.27 31.06
r3live_04r3live\_04 43.25 20.33 19.33 10.54 56.14 30.87
r3live_05r3live\_05 40.23 19.64 20.95 11.78 54.20 31.42
r3live_06r3live\_06 38.82 20.80 27.92 13.08 57.43 33.88
eee_01eee\_01 5.23 5.43 38.66 11.15 31.02 16.58
eee_02eee\_02 ×\times 5.41 30.04 9.69 24.60 15.10
eee_03eee\_03 5.00 6.93 27.43 10.00 23.29 16.93
nya_01nya\_01 6.58 6.02 19.62 7.94 19.67 13.96
nya_02nya\_02 6.49 8.68 20.07 7.81 19.87 16.49
nya_03nya\_03 6.98 7.22 20.04 8.02 20.34 15.24
sbs_01sbs\_01 5.10 5.69 24.30 8.97 21.31 14.66
sbs_02sbs\_02 5.52 5.48 26.74 9.31 23.36 14.79
sbs_03sbs\_03 5.26 5.11 26.47 10.24 22.92 15.35

We evaluate the runtime breakdown (unit: ms) of our system and R3Live for all testing sequences. In general, the whole system framework consists of the VIO module and the LIO module. For each sequence, we test the time consumption of the above two modules, and the total time for handling a sweep. Results in Table IV show that our system takes 30\sim34ms to handle a sweep on R3LiveR3Live dataset and 14\sim17ms to handle a sweep on NTU_VIRALNTU\_VIRAL dataset, while R3Live takes 49\sim58ms to handle a sweep on R3LiveR3Live dataset and 10\sim31ms to handle a sweep on NTU_VIRALNTU\_VIRAL dataset. That means our system can run around 1.6X faster than R3Live.

VI-D Real-Time Performance Evaluation

Refer to caption
Figure 4: The curve of time consumption as the frame ID changes on sequence r3live_01r3live\_01. R3Live cannot ensure the real-time performance for a considerable amount of time, while our SR-LIVO can run in real time stably.

We take r3live_01r3live\_01 as the exemplar sequence, and plot the curve of time consumption as the frame ID changes. Fig. 4 demonstrates that R3Live cannot ensure the real-time performance for a considerable amount of time, while our SR-LIVO can run in real time stably.

VI-E Visualization for Map

Refer to caption
Figure 5: Our SR-LIVO is able to reconstruct a dense, 3D, RGB-colored point cloud map, which is comparable to the reconstruction result of R3Live (Fig. 1 in [5]).
Refer to caption
Figure 6: (a) and (b) are the reconstructed grayscale map on eee_01eee\_01 by R3Live and SR-LIVO respectively, while our system achieves significantly better reconstruction result.

Fig. 5 shows the ability of our SR-LIVO to reconstruct a dense, 3D, RGB-colored point cloud map on the exemplar sequence (e.g., r3live_01r3live\_01), which is comparable to the reconstruction result of R3Live (Fig. 1 in [5]). Under the premise that the reconstruction result is equal, our method can run stably in real time while R3Live cannot, demonstrating the strength of our approach.

The camera images of NTU_VIRALNTU\_VIRAL dataset are grayscale images, leading to the grayscale map. Fig. 6 compares the visualizations of our reconstructed grayscale map with R3Live on the exemplar sequence (e.g., eee_01eee\_01) of NTU_VIRALNTU\_VIRAL dataset, on which our system achieves significantly better reconstruction result.

VII Conclusion

This paper proposes a novel LIV-SLAM system, named SR-LIVO, which adapts the sweep reconstruction method to align the end timestamp of reconstructed sweep to the timestamp of captured image. Thus, the state of all image-captured moments can be solved by the more reliable LIO module instead of the hypersensitive VIO module. In SR-LIVO, we utilize an ESIKF to solve state in LIO module, and utilize an ESIKF to optimize camera parameters in vision module respectively for optimal state estimation and colored point cloud map reconstruction.

Our system achieves state-of-the-art pose accuracy on two public datasets, and achieves much lower time consumption than R3Live while keeping the reconstruction result comparable or better. Future work includes adding loop closing module to this framework.

References

  • [1] P. Dellenbach, J.-E. Deschaud, B. Jacquet, and F. Goulette, “Ct-icp: Real-time elastic lidar odometry with loop closure,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 5580–5586.
  • [2] F. Gao, W. Wu, W. Gao, and S. Shen, “Flying on point clouds: Online trajectory generation and autonomous navigation for quadrotors in cluttered environments,” Journal of Field Robotics, vol. 36, no. 4, pp. 710–733, 2019.
  • [3] F. Kong, W. Xu, Y. Cai, and F. Zhang, “Avoiding dynamic small obstacles with onboard sensing and computation on aerial robots,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7869–7876, 2021.
  • [4] J. Levinson, J. Askeland, J. Becker, J. Dolson, D. Held, S. Kammel, J. Z. Kolter, D. Langer, O. Pink, V. Pratt et al., “Towards fully autonomous driving: Systems and algorithms,” in 2011 IEEE intelligent vehicles symposium (IV).   IEEE, 2011, pp. 163–168.
  • [5] J. Lin and F. Zhang, “R 3 live: A robust, real-time, rgb-colored, lidar-inertial-visual tightly-coupled state estimation and mapping package,” in 2022 International Conference on Robotics and Automation (ICRA).   IEEE, 2022, pp. 10 672–10 678.
  • [6] J. Lin, C. Zheng, W. Xu, and F. Zhang, “R 2 live: A robust, real-time, lidar-inertial-visual tightly-coupled state estimator and mapping,” IEEE Robotics and Automation Letters, vol. 6, no. 4, pp. 7469–7476, 2021.
  • [7] T.-M. Nguyen, S. Yuan, M. Cao, Y. Lyu, T. H. Nguyen, and L. Xie, “Ntu viral: A visual-inertial-ranging-lidar dataset, from an aerial vehicle viewpoint,” The International Journal of Robotics Research, vol. 41, no. 3, pp. 270–280, 2022.
  • [8] G. Peng, Y. Zhou, L. Hu, L. Xiao, Z. Sun, Z. Wu, and X. Zhu, “Vilo slam: Tightly coupled binocular vision–inertia slam combined with lidar,” Sensors, vol. 23, no. 10, p. 4588, 2023.
  • [9] T. Qin and S. Shen, “Online temporal calibration for monocular visual-inertial systems. in 2018 ieee,” in RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3662–3669.
  • [10] T. Shan, B. Englot, C. Ratti, and D. Rus, “Lvi-sam: Tightly-coupled lidar-visual-inertial odometry via smoothing and mapping,” in 2021 IEEE international conference on robotics and automation (ICRA).   IEEE, 2021, pp. 5692–5698.
  • [11] W. Shao, S. Vijayarangan, C. Li, and G. Kantor, “Stereo visual inertial lidar simultaneous localization and mapping,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 370–377.
  • [12] W. Wang, J. Liu, C. Wang, B. Luo, and C. Zhang, “Dv-loam: Direct visual lidar odometry and mapping,” Remote Sensing, vol. 13, no. 16, p. 3340, 2021.
  • [13] Y. Wang and H. Ma, “mvil-fusion: Monocular visual-inertial-lidar simultaneous localization and mapping in challenging environments,” IEEE Robotics and Automation Letters, vol. 8, no. 2, pp. 504–511, 2022.
  • [14] T. Yin, J. Yao, Y. Lu, and C. Na, “Solid-state-lidar-inertial-visual odometry and mapping via quadratic motion model and reflectivity information,” Electronics, vol. 12, no. 17, p. 3633, 2023.
  • [15] Z. Yuan, F. Lang, T. Xu, and X. Yang, “Sr-lio: Lidar-inertial odometry with sweep reconstruction,” arXiv preprint arXiv:2210.10424, 2022.
  • [16] Z. Yuan, Q. Wang, K. Cheng, T. Hao, and X. Yang, “Sdv-loam: Semi-direct visual-lidar odometry and mapping,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [17] J. Zhang and S. Singh, “Laser–visual–inertial odometry and mapping with high robustness and low drift,” Journal of field robotics, vol. 35, no. 8, pp. 1242–1264, 2018.
  • [18] C. Zheng, Q. Zhu, W. Xu, X. Liu, Q. Guo, and F. Zhang, “Fast-livo: Fast and tightly-coupled sparse-direct lidar-inertial-visual odometry,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2022, pp. 4003–4009.
  • [19] X. Zuo, P. Geneva, W. Lee, Y. Liu, and G. Huang, “Lic-fusion: Lidar-inertial-camera odometry,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2019, pp. 5848–5854.
  • [20] X. Zuo, Y. Yang, P. Geneva, J. Lv, Y. Liu, G. Huang, and M. Pollefeys, “Lic-fusion 2.0: Lidar-inertial-camera odometry with sliding-window plane-feature tracking,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 5112–5119.
[Uncaptioned image] Zikang Yuan received the B.E. degree from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2018. He is currently a 4th year PhD student of HUST, School of Institute of Artificial Intelligence. He has published two papers on ACM MM, three papers on TMM, one paper on TPAMI and one paper on IROS. His research interests include monocular dense mapping, RGB-D simultaneous localization and mapping, visual-inertial state estimation, LiDAR-inertial state estimation, LiDAR-inertial-wheel state estimation and visual-LiDAR pose estimation and mapping.
[Uncaptioned image] Jie Deng is currently a 4th year B.E. student of HUST, School of Electronic Information and Communications. Her research interests include LiDAR-inertial state estimation, LiDAR-inertial-wheel state estimation and visual-LiDAR pose estimation and mapping.
[Uncaptioned image] Ruiye Ming received the B.E. degree from Zhengzhou University(ZZU), Zhengzhou, China, in 2022. He is currently a 2nd year graduate student of HUST, School of Electronic Information and Communications. His research interests include visual-LiDAR odometry, LiDAR point based loop closing, object reconstruction and deep-learning based depth estimation.
[Uncaptioned image] Fengtian Lang received the B.E. degree from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2023. He is currently a 1st year graduate student of HUST, School of Electronic Information and Communications. He has published one paper on IROS. His research interests include LiDAR-inertial state estimation, LiDAR-inertial-wheel state estimation and LiDAR point based loop closing.
[Uncaptioned image] Xin Yang received her PhD degree in University of California, Santa Barbara in 2013. She worked as a Post-doc in Learning-based Multimedia Lab at UCSB (2013-2014). She is current Professor of Huazhong University of Science and Technology School of Electronic Information and Communications. Her research interests include simultaneous localization and mapping, augmented reality, and medical image analysis. She has published over 90 technical papers, including TPAMI, IJCV, TMI, MedIA, CVPR, ECCV, MM, etc., co-authored two books and holds 3 U.S. Patents. Prof. Yang is a member of IEEE and a member of ACM.