This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Efficient Probabilistic Solution to Mapping Errors in LiDAR-Camera Fusion for Autonomous Vehicles

Dan Shen, Zhengming Zhang, Renran Tian, Yaobin Chen, Rini Sherony D. Shen and Y. Chen are with the Department of Electrical and Computer Engineering, Indiana University Purdue University Indianapolis, Indianapolis, IN, 46202 USA (E-mails: danshen,[email protected])Z. Zhang is with the School of Industrial Engineering, Purdue University, West Lafayette, IN, 47906 USA (E-mail: [email protected])R. Tian is with the Department of Computer Information & Graphics Technology, Indiana University Purdue University Indianapolis, Indianapolis, IN, 46202 USA (E-mail: [email protected])R. Sherony is with the Collaborative Safety Research Center, Toyota Motor Engineering and Manufacturing North America Inc., Ann Arbor, Michigan, USA (E-mail: [email protected])
Abstract

LiDAR-camera fusion is one of the core processes for the perception system of current automated driving systems. The typical sensor fusion process includes a list of coordinate transformation operations following system calibration. Although a significant amount of research has been done to improve the fusion accuracy, there are still inherent data mapping errors in practice related to system synchronization offsets, vehicle vibrations, the small size of the target, and fast relative moving speeds. Moreover, more and more complicated algorithms to improve fusion accuracy can overwhelm the onboard computational resources, limiting the actual implementation. This study proposes a novel and low-cost probabilistic LiDAR-Camera fusion method to alleviate these inherent mapping errors in scene reconstruction. By calculating shape similarity using KL-divergence and applying RANSAC-regression-based trajectory smoother, the effects of LiDAR-camera mapping errors are minimized in object localization and distance estimation. Designed experiments are conducted to prove the robustness and effectiveness of the proposed strategy.

Index Terms:
Probabilistic Sensor Fusion, Perception System, LiDAR-camera Fusion, Micro-mobility transportation tool.

I Introduction

AUTONOMOUS vehicles (AVs) are more popular and attractive with the tremendous and impressive technological progress in recent years. Noticeable achievements in both software and hardware have brought more automation functionality into reality. As increasingly believe, AVs are promising to avoid potential crashes and protect human lives. In addition to driving safety, AVs are also expected to improve ride comfort and smoothen the traffic flow with an increasing level of service on the roads [1]. According to the report of the U.S. Department of Transportation, around 94%94\% of accidents take place due to human errors, and over 90%\% of those accidents were caused by visual information acquisition problems [2][3]. When a large percentage of severe and fatal crashes are related to vulnerable road users, research topics in vulnerable road users (VRU) safety are an indispensable part of the advent of AVs driving in cities. It is vital to investigate and examine whether AVs will detect and be beneficial to the VRUs[4][5][6].

However, we are still far away from fully autonomous driving due to technical difficulties and legal ethics. Automated driving in mixed-traffic environments is one of the main barriers for the development of fully AVs, especially for the safety issues in the interactions among AVs, human-driven vehicles, and VRUs such as cyclists and pedestrians [7]. Recently, the safety issues of micro-mobility platforms have also drawn increasing attention to the public. Accompanying the convenience of the trips, there have been numerous injuries and fatalities reported in association with micromobility options like e-scooters, according to the estimations from the Centers for Disease Control and Prevention (CDC) [8].

Although different companies or research groups may have different AV systems, most solutions generally have three main parts: perception, decision making /path planning, and automatic control. To accomplish the functionality of full autonomy, these three key modules need to cooperate, starting from the perception module. The AV’s perception system uses a set of onboard sensors to detect and understand the surrounding driving environments containing both static and dynamic obstacles, such as roadside objects, road boundaries, moving pedestrians and vehicles, and traffic signs. Light Detection and Ranging (LiDAR) and cameras are two of the most essential and commonly used sensors for scene understanding in vehicle perception systems. Among them, the camera is a low-cost, small in size, and higher-resolution sensor providing 2D information like appearances about the environment in the format of RGB or gray-scale images. Cameras generally work well in clear weather conditions with good illumination. LiDAR is a remote sensing technology that utilizes pulsed laser beams to obtain 3D information of surrounding objects, including accurate distance measurements. Current LiDAR usually has lower resolution compared to cameras but can work under more variety of weather conditions. According to their pros and cons, the idea of fusing both sensors could leverage the advantages with overcoming their disadvantages. Moreover, having multiple multi-modal sensors allows for redundancy when facing failures from individual sensors. Therefore, the LiDAR-camera combination is an effective way for both object detection and range calculation for safe autonomous diving[9][10].

The key part of a LiDAR-camera-based perception system is sensor fusion, which synthesizes inputs from the two types of sensors via coordinate transformation, data point mapping, and 3D scene reconstruction of the surroundings. Many researchers have contributed themselves to the development of LiDAR-Camera fusion algorithms. An interactive LiDAR to camera calibration toolbox to estimate the intrinsic and extrinsic transforms has been introduced in [11]. The authors in [12] address the common, yet challenging, LiDAR-camera semantic fusion problems, which are seldom approached in a wholly probabilistic manner. A novel open-source tool for extrinsic calibration of radar, camera, and LiDAR has been presented in [13] with facilitating joint extrinsic calibration of all three sensing modalities for multiple measurements. [14] presents a pipeline for extrinsic calibration of a ZED stereo camera with a Velodyne Puck LiDAR. This pipeline uses a novel 3D marker whose edges can be robustly detected both in the image and 3D point cloud. In another recent study towards extrinsic calibration of LiDAR-Camera fusion, an algorithm has been introduced in [15] to estimate the similarity transformation between laser points and pixels using a checkerboard.

Besides the conventional efforts to improve the accuracy of transformation matrices, learning-based approaches have also been exploited recently. A study on pedestrian classification used a deep-learning-based method taken from data from a monocular camera and a 3D LiDAR sensor, and compared early and late multi-modal sensor fusion approaches [16]. In [17], another deep learning approach has been developed to carry out road detection by fusing LiDAR point clouds and camera images. An unstructured and sparse point cloud was firstly projected onto the camera image plane and then upsampled to obtain a set of dense 2D images encoding spatial information.

Although many efforts have been put into the LiDAR-camera fusion research, the LiDAR-Camera mapping error is still pervasive in practice. In [18], fusion errors were measured by the distances between the actual chessboard 3D locations and the mapped LiDAR points, and then analyzed quantitatively using simulation and real test sequences to refine the transformation matrices further. Another study also reported LiDAR point projection errors on the images and tried to use a target-based 3D-LiDAR-to-camera calibration method to achieve a 50% reduction in error values and 70% reduction in variances [19]. True positive, false positive, and false negative are proposed to use as main metrics for fusion accuracy evaluation [20]. Compared with the studies proposing fusion algorithms to improve accuracy in simulation or static lab environments, there are much fewer studies analyzing the mapping errors in practice and focusing on minimizing their effects. In a real road environment, issues like time synchronization errors, hardware/software limitations, vibrations, high moving speed of the sensors and target objects, unreliable object detection, and others can all contribute to LiDAR-camera mapping errors. There are very limited papers focusing on these errors in practice and how to deal with them.

The paper is organized in the following way. The problem statement and research objectives are discussed in Section II. Section III talks about the system calibration process. The probabilistic fusion process is introduced in Section IV. Section V illustrates the performance evaluation for the proposed method. Finally, the conclusions are presented in Section VI.

II Problem Statement and Research Objective

In this research, we focus on the observed errors of LiDAR-Camera fusion from a practical view and propose a low-cost probabilistic algorithm to reduce the effects of the mapping errors on the fusion results. The research is motivated by two assumptions:

  • Although static sensor calibration can achieve very high fusion accuracy, the fusion of fast-moving small targets at long distance will still generate large mapping errors in the dynamic driving environment. This situation is more important with the increased popularity of micromobility vehicles like e-scooters. These tools tend to be small, fast, agile, and share the road with cars.

  • More advanced techniques like online learning and automatic calibration can potentially reduce the errors, with the cost of higher computational resources. This will limit the applications of such algorithms in wider range of embedded computing systems. A low-cost solution to the errors may satisfy the needs in many situations without intensive work on calibration.

The goal of LiDAR-camera fusion is to map LiDAR cloud points onto the corresponding image pixels. In other words, this process tries to provide the values in the 3D LiDAR Cartesian coordinate system for a pixel in the 2D image coordinate system, which will then give any detected object in the image with a relative position from the vehicle. Accurate fusion results can be seen in Figure 1, where the green dots on the image represent the mapped LiDAR dots on the pixels. The edges of the same objects from both sensors should align well, and the dots are well distributed across the object surface.

Refer to caption
Figure 1: Example of an accurate LiDAR-Camera mapping result
Refer to caption
Figure 2: Example of an LiDAR-Camera mapping with errors

In practice, two kinds of mapping errors can commonly happen. The first is that the background points got mapped into the area of interest (AOIs) as noises. The second one is the shifting of the projected LiDAR points to the correct image pixels. Figure 2 shows an example of wrongly mapped results from LiDAR-Camera fusion. Section 3 introduces the typical LiDAR-camera calibration process, making very accurate LiDAR-to-camera mapping results within the calibrated zone in a stable environment. In practice, imperfect time synchronization (due to sampling rate and saving time), fast motion, large angular accelerations, and locations beyond the calibrated zone may contribute to the errors.

To address these inherent errors, the objective of this paper is to propose a multi-step probabilistic approach to handle the errors when they are not avoidable. Data from larger AOIs are mapped first to deal with the shifting issues, and then a probabilistic filtering process is applied to remove noises and select only the correct mapping results. The fusion process is implemented in both car-based and wearable LiDAR-camera sensing systems, respectively, towards the interactions between car and e-scooter riders. We choose e-scooter rider for the experiments, considering this new type of micro-mobility tool as one of the most challenging road objects for AV perception systems due to its small size and fast movement. Comparisons are made among fusion outputs, baseline data, and ground-truth measurements to evaluate the performance of the proposed fusion method. The main contributions of the paper are summarized as below:

  • A approach is proposed to improve the fusion results when LiDAR-camera mapping errors are not avoidable in practice.

  • The proposed method requires less computational resources and can supplement the efforts to improve mapping accuracy.

  • The KL-divergence-based shape descriptor can capture geometric information from LiDAR point distributions efficiently with simplified similarity calculation.

  • The proposed calibration and fusion method can support the detection and localization of micromobility vehicles like e-scooters efficiently.

III Probabilistic Fusion Process

The proposed probabilistic fusion process targets integrating the data from a monochrome camera and a 3D LiDAR to localize the surrounding objects of interest and track their relative trajectories related to the ego vehicle. The multistep process is illustrated in figure 3.

Refer to caption
Figure 3: The proposed multi-step probabilistic fusion process that is robust to mapping errors

After the calibration of the LiDAR and camera pair, the collected LiDAR point cloud will be mapped onto the image pixels. The AOIs will be identified frame by frame via computer-vision-based object detection and tracking algorithms. Because these computer vision algorithms are not the main focus of this study, they will not be discussed. Combining the detection and tracking results with the mapped data will generate fusion results on AOIs. If everything is correct, the LiDAR points in the AOIs can be used to localize the corresponding objects via their 3D coordinates. As discussed earlier, two errors might occur in the outputs at this step, preventing direct localization of the objects. Thus, we propose a six-step pipeline to localize the objects with potential errors efficiently. The following subsections will talk about these steps in detail.

Moreover, we formulate the problem as the following. Without loss of generality, we ignore the continuity of an object. In other words, every pixel belongs to exactly one object. We can define the image as 𝒳\mathcal{X}:

𝒳=:{(l,w)},\mathcal{X}=:\{(l,w)\},
(l,w)𝒳,okandck𝒳s.t.(l,w)okandokck,\forall\,(l,w)\,\in\mathcal{X},\,\exists\,o_{k}\,and\,c_{k}\,\subset\,\mathcal{X}\,s.t.(l,w)\,\in o_{k}\,and\,o_{k}\subset c_{k},

where oko_{k} is the set of pixels belong to the object kk and ckc_{k} is the set of pixels identified by the bounding box from a CV algorithm corresponding to the object kk.

Similarly, we can define 𝒴\mathcal{Y} as the LiDAR point cloud:

𝒴=:{(x,y,z)},\mathcal{Y}=:\{(x,y,z)\},
(x,y,z)𝒴,pk𝒴s.t.(x,y,z)pk\forall\,(x,y,z)\,\in\mathcal{Y},\,\exists\,p_{k}\subset\,\mathcal{Y}\,s.t.(x,y,z)\,\in p_{k}\,

where pkp_{k} is the set of LiDAR points belong to the object kk.

A mapping 𝐌:𝐘𝐗\mathbf{M}:\mathbf{Y}\rightarrow\mathbf{X} maps a LiDAR triplet (x,y,z)(x,y,z) to a pixel (l,w)(l,w). In ideal conditions,

𝐌(x,y,z)=(l,w)\mathbf{M}(x,y,z)=(l,w) (1)
𝐌𝟏(l,w)=(x,y,z)\mathbf{M^{-1}}(l,w)=(x,y,z) (2)

Assuming ϵl\epsilon_{l} and ϵw\epsilon_{w} are the mapping errors, we have

𝐌(x,y,z)=(l,w)+(ϵl,ϵw)\mathbf{M}(x,y,z)=(l,w)+(\epsilon_{l},\epsilon_{w}) (3)

Given the object of interest kk^{*}, we propose the algorithm G(.)G(.) to select more LiDAR points reflected by the object kk^{*} that are mapped in the bounding box ckc_{k^{*}}, s.t.

mk=G(𝐌𝟏(ck))m_{k^{*}}=G(\mathbf{M^{-1}}(c_{k^{*}})) (4)
Pr([|pk|(x,y,z)mkf(x,y,z,k)]<t1)>=t2Pr([|p_{k^{*}}|-\sum_{(x,y,z)\in m_{k^{*}}}f(x,y,z,k^{*})]<t_{1})>=t_{2} (5)

where

f(x,y,z,k)=𝟏{(x,y,z)pk},f(x,y,z,k)=\mathbf{1}_{\{(x,y,z)\in p_{k}\}},

and t1t_{1} and t2t_{2} are predefined thresholds.

III-A System Calibration

In order to map LiDAR data to the images, the LiDAR-camera system needs first to be calibrated. In this study, we follow the standard calibration process for LiDAR-camera fusion, primarily calculating three matrices, namely the “intrinsic camera matrix,” the “distortion parameters,” and the “extrinsic transformation matrix”[21][22]. The camera matrix contains data representing camera parameters, such as focus, shear, and image width and height. The distortion coefficients represent the corrections due to lens distortion, such as for the fish-eye lens, which are assumed to be zero for lenses with normal field-of-view (FOV). In this study, we will assume the distortion of the images is minimal and will only consider two components of intrinsic camera matrix and extrinsic matrix for system calibration. The process is depicted in Figure 4. In the figure, from left to right, the 3D LiDAR point cloud data can be converted to 3D camera coordinates using the extrinsic matrix. The intrinsic matrix will then be applied to map the 3D camera coordinates to the 2D pixel coordinates in the image plane.

Refer to caption
Figure 4: Process of system calibration.

A three-step process is applied to calculate these matrices. The intrinsic camera matrix is calculated using the MATLAB Calibration Toolbox. The process of obtaining the extrinsic matrix started from calculating an initial extrinsic matrix using the Levenberg-Marquardt algorithm, which is also known as the damped least-squares method for solving AX=BAX=B matrix equations. The algorithm is implemented using the OpenCV SolvePnP function. After getting this initial matrix, a Python script was used to fine-tune the extrinsic matrix manually.

The sensor fusion process can be shown in the following intrinsic-extrinsic matrix equation 6, which governs LiDAR to camera coordinate transformation in the homogeneous representation.

[uv1]=[fx0ox00fyoy00001][r11r12r13txr21r22r23tyr31r32r33tz0001][XYZ1]\begin{bmatrix}u\\ v\\ 1\end{bmatrix}=\begin{bmatrix}\text{f}_{x}&0&\text{o}_{x}&0\\ 0&\text{f}_{y}&\text{o}_{y}&0\\ 0&0&0&1\end{bmatrix}\begin{bmatrix}\text{r}_{11}&\text{r}_{12}&\text{r}_{13}&\text{t}_{x}\\ \text{r}_{21}&\text{r}_{22}&\text{r}_{23}&\text{t}_{y}\\ \text{r}_{31}&\text{r}_{32}&\text{r}_{33}&\text{t}_{z}\\ 0&0&0&1\end{bmatrix}\begin{bmatrix}X\\ Y\\ Z\\ 1\end{bmatrix} (6)

where u and v are the pixel coordinates in the image plane, and the X, Y, and Z are the coordinates from LiDAR measurements. The first matrix is a 3×4 camera intrinsic matrix. fxf_{x} and fyf_{y} are the focal lengths, which are expressed in pixel units, and the principal point (oxo_{x},oyo_{y}), usually means the image center. The second one is a 4×4 extrinsic matrix, which is composed of a 3×3 rotation matrix R and a 3×1 translation vector T in the equation.

In ideal conditions, LiDAR point clouds can be accurately mapped onto the image by directly applying equation 6 after the calibration process. This is the mapping process of 𝐌\mathbf{M} defined earlier. However, different factors will result into mapping errors. From the next step, we use six steps to reduce the effects of such errors on object detection and localization.

III-B RANSAC-based Prepossessing

Towards the mapping results with potential errors, the first step is to clean well-organized noise. Random Sample Consensus (RANSAC) is originally used to randomly sample data to construct a best-fit estimate, which can be used here for fitting the ground plane. To run RANSAC more efficiently, we also set up a range limit, and LiDAR points that exceed the limit are removed directly. The detailed process of applying the RANSAC algorithm in this paper is shown as below:

  • Step 1 - Set the map boundary as 70 m long and 30 m wide for the ego object and extract all point cloud data in the boundary

  • Step 2 - The normal direction of the ground plane should generally point upward along the Z-axis as (0,0,1)

  • Step 3 - Set the maximum point to plane distance (δ\delta = 0.2m) for plane fitting

  • Step 4 - The number of trials N is defined as

    N=ln(1p)ln[1(1ϵ)n]N=\frac{ln(1-p)}{ln[1-(1-\epsilon)^{n}]} (7)

    where p = 0.99 is the probability that at least one of the N trials will be free of outliers in the estimated ground plane. ϵ\epsilon = 0.2 is the probability that a randomly chosen point is an outlier, and n = 6 is the number of 3D points used in each trial to compute the ground plane.

  • Step 5 - The minimum value for the size of the inlier set for it to be acceptable is determined as

    P=(1ϵ)ΣP=(1-\epsilon)\Sigma (8)

    where Σ\Sigma is the total number of 3D LiDAR points

  • Step 6 - Only the point clouds that are not in the ground plane are selected.

After the prepossessing step, a considerable portion of noises are eliminated.

𝒴=RANSAC(𝒴)=𝒴mg\mathcal{Y^{\prime}}=RANSAC(\mathcal{Y})=\mathcal{Y}\setminus m_{g} (9)

where mgm_{g} is the set of LiDAR points of all the ground, and 𝒳\mathcal{X^{\prime}} is the set of all LiDAR points excluding ground points. As shown in Figure 5, the ground point cloud data (white points shown in the left image) can be deleted in the collected LiDAR data. In the filtered 3D plot, objects such as e-scooter riders, pedestrians, and trees can be observed very clearly (shown in the right image).

Refer to caption
Figure 5: Results of removing ground points using RANSAC.

III-C Enlarged AOIs

After mapping LiDAR points to the detected objects and removing ground reflections using RANSAC-based preprocessing, two major issues explain the remaining data noise:

  • Due to various system and algorithm limitations, the LiDAR points reflected from the target object may be mapped away from the center of the AOIs or even out of the detected AOIs. If only the mapped data within the AOIs are considered as inputs, the localization algorithm may use only partial or none correct LiDAR points.

  • The mapped LiDAR data on AOIs may contain reflections from other objects because of mapping offsets and inaccurate outputs from the object detection algorithms. This will bring the background points (mainly from other objects in the view) into the localization calculation of the target object, if not well addressed.

Refer to caption
Figure 6: Sample of a enlarged AOI

We address the first issue by enlarging the AOIs (Fig.6). The logic is straightforward that larger AOIs will cover more adjacent points to ensure the inclusion of more LiDAR points in the target object. Considering the random directions of possible offsets, the enlargement also goes to all 4 directions, in proportion to the size of the bounding box. The exact ratio is a hyperparameter to be selected based on the system and application. In our case, 100% enlargement was implemented for left and right directions.

ck=Large(ck)c_{k^{*}}^{{}^{\prime}}=Large(c_{k^{*}}) (10)

where ckc_{k^{*}}^{{}^{\prime}} is the set of all pixels in the enlarged AOI kk^{*}, and Large(ck)Large(c_{k^{*}}) is the AOI enlargement process.

Since AOI enlargement exacerbates the issue of adding more noise from the background or irrelevant objects, there is a trade-off between the influence from the first and the second issue. Next steps will focus on the issues of remaining noise.

III-D Mode-Based Clustering

Refer to caption
Figure 7: Distance distribution

In the previous steps, enlarged AOIs will ensure the inclusion of correct LiDAR points from the target object, with the risks to include more noises from other objects. These noises tend to be isolated clusters because the continued reflections from the ground have been removed in the preprocessing stage. We devise an method based on three assumptions to fetch the correct LiDAR points in this step.

  1. 1.

    The first assumption is that the expanded AOI includes most of the correct LiDAR points from the target object, which assures that enough LiDAR points in the AOI are from the target object.

  2. 2.

    Secondly, if the AOI includes multiple objects, the target object is one of the largest. This assumption ensures successful generation of data characteristics.

  3. 3.

    Lastly, the LiDAR points from one object should have similar distance readings from the LiDAR, assuming the object is a continuum.

Then, we can create the distance distribution of all LiDAR points in the augmented AOI under a certain granularity level. The granularity level is a hyperparameter based on the class of the target object (for example, vehicle should have a larger granularity than pedestrian). To further decide the bin positions, we cluster the points using K-Means to find high-density distance centers (which are assumed to be the distance center of objects), and use them as the center of bins. Since in later steps, bin-based clusters are used to represent each object, a miss-aligned bin position can significantly affect the correct point selection. When key bins are positioned, the rest of the bins will be fulfilled sequentially following the hyper-tuned granularity level.

After determining the bin positions and sizes, the distribution of all measured distances from LiDAR points in the enlarged AOIs can be created. A sample distance distribution with three distance centers (peaks) is shown in Figure 7. Such a distance distribution clusters the distance-similar points together. Furthermore, based on the second assumption above, we consider the histogram columns with higher densities as potential candidate clusters for the target object. If there is only one concentrated bar in the distance distribution, we use LiDAR points in this cluster for the target object. There are two hyper-parameters to further filter the candidate clusters for situations with more than one peak:

  • The minimum number of points required for a peak to be selected.

  • The proportion between the number of points in the selected peak and the highest peak.

These two hyper-parameters should be tuned based on the performance in the individual applications. In Figure. 7, by applying different values for the filter hyper-parameters, two or three from the highest identified peaks may be qualified.

Mode(𝐌1(ck))=(m1,m2,,mn)Mode(\mathbf{M}^{-1}(c_{k^{*}}^{{}^{\prime}}))=(m_{1},m_{2},\cdots,m_{n}) (11)

where m1m_{1},m2m_{2},\cdots,mnm_{n} are qualified candidate clusters of LiDAR points, and Mode(.)Mode(.) is the mode-based clustering.

III-E Shape Similarity Filter

If only one cluster was selected from the mode-based clustering process, then the data in the cluster can be used for further steps; otherwise, one shall be selected from multiple clusters for the target object. This is achieved by calculating shape similarity among each cluster with the target object.

The logic is that the target object should have a similar shape as the type it belongs to. The type of the object is known from the object detection algorithm. The rest of the LiDAR points in the enlarged AOI are either from noise, other types of objects, or a partial shape of the same type of object (by assuming that the object detection algorithm will output a accurate bounding box surrounding the target object). Thus, the cluster of LiDAR points that has the highest similarity with the targeted object type should be selected.

III-E1 Shape Descriptor

We use shape descriptor to capture geometric information and simplify the shape representation, the following three steps:

  1. 1.

    The 2D mapping of LiDAR points from each candidate cluster is divided into nine sections evenly.

  2. 2.

    The percentage of points in each section is calculated.

  3. 3.

    The nine percentages form a vector to represent the shape information of the candidate LiDAR point cluster.

The bottom-left table in Figure 8 shows an example of the generated nine percentages, which can be flattened to form the shape descriptor vector..

III-E2 Benchmark Shapes for Different Objects

To make a comparison with the target object type, benchmark shape descriptors for each type of targeted objects need to be created. These distributions (vectors of the shape descriptor) can be learned by processing and averaging a list of manually selected objects for each type of interest (vehicle, pedestrians, e-scooter riders, etc.). The number of objects to be processed to create the benchmark can be tuned as another hyperparameter.

Refer to caption
Figure 8: Shape similarity and rotation demo

III-E3 Shape Rotation

The proposed shape descriptor is invariant to the number of LiDAR points or the size of the bounding box, but is sensitive to rotation. Even for the same object, the rotated LiDAR points will represent quite different shape descriptor vectors, which may happen when the vehicle bumps or vibrates to cause large angular movements of LiDAR and camera in real road environment.

To deal with the rotation issue, we use a constrained PCA (Principal Component Analysis) to find the angle between the object’s main direction and the ground and rotate the selected LiDAR points accordingly. The application constrains the maximum rotation angle to be less than 40 degrees. Since the PCA could identify the basis with maximum variation, the shapes of objects are rotated along with it. The constrained PCA works based on the assumptions that the target object is standing straight on the ground, where the first component’s direction is along with the object’s body.

III-E4 Similarity Calculation

Given the rotated shape descriptor vectors, the distances or differences among them can be calculated using Kullback–Leibler (KL) divergence and project it onto a probability measure through a modified sigmoid function shown in Equation 12 below. In the equation, P and Q are two discrete probability distributions with the same probability space xx; and in this case, they are two rotated shape descriptor vectors.

DKL(PQ)=x𝒳P(x)log(P(x)Q(x))D_{\mathrm{KL}}(P\|Q)=\sum_{x\in\mathcal{X}}P(x)\log\left(\frac{P(x)}{Q(x)}\right) (12)

After calculating the KL-divergence scores among all candidate LiDAR point clusters with the benchmark shape for the target object’s type, the cluster with the highest score is identified as the object, as in equation 13.

m=argmaxm1,m2,,mn(DKL(Shape(m1)S),DKL(Shape(m2)S),,DKL(Shape(mn)S))\begin{split}m^{*}=\operatorname*{argmax}_{m_{1},m_{2},\cdots,m_{n}}(D_{\mathrm{KL}}(Shape(m_{1})\|S^{*}),\\ D_{\mathrm{KL}}(Shape(m_{2})\|S^{*}),\cdots,D_{\mathrm{KL}}(Shape(m_{n})\|S^{*}))\end{split} (13)

where shape(.)shape(.) is the process of calculating shape distribution, mm^{*} is the set of LiDAR points we select corresponding to object k{k^{*}}, and SS^{*} is the benchmark shape distribution.

For the three peaks color-coded in Figure 7, the following Figure 9 shows that they are from the target pedestrian shown in red (LiDAR data mapped to an offset), the two background vehicles are shown in green and blue, respectively. The shape similarity scores for the three clusters towards the object type of pedestrian are calculated and presented as the numbers in the upper-left corner in Figure 9. For each row, the first number is the similarity score (confidence level) between the cluster and pedestrian shape before rotation, the middle number is the distance of the LiDAR point cluster, and the last number is the similarity score after rotation. Based on the outputs, the cluster shown in red will be identified as the target, with all points belong to the pedestrian correctly selected.

Refer to caption
Figure 9: Distance calculation demo. (The first number in each row is the confidence level of the human shape before rotated. The second number is the distance. The third number is confidence level after rotation.)

III-F Distance & Angle Calculation

The main goal of the fusion process is to localize the target object by calculating its angle and distance. After the previous processes, we identified LiDAR points from the selected cluster. Since we have a specific granularity in the distance distribution before selecting the bin-based cluster, the distance and angle error are bounded within the granularity level. There are two strategies to calculate the distance and angle from these data:

  1. 1.

    Averaging across all the points in the clustered LiDAR cloud

  2. 2.

    Selecting one representative LiDAR point for the object

We believe that both methods may generate similar results in most cases. However, in order to increase the robustness against possible distribution, skewness and outlier issues when the bin-width is relatively large, we select the point with the median distance among all the LiDAR points in the cluster to represent the object. This method can ensure an accurate estimation as long as more than 50% of points in the cluster belong to the target object.

Refer to caption
Figure 10: Multiple objects distance calculation. (Y-axis is distance, and X-axis is frame ID. Different colors are used to label the camera the objects are captured from. The black line is manually annotated distance for each object.)

Both distance and angle calculations are based on this representation point. The distance of an object is simply the Euclidean distance calculated based on the representation point’s X and Y values. Since the LiDAR is usually equipped on the top of a car, including z-dimension in the calculation might overestimate the distance. The angle between X-axis in the LiDAR’s coordinate system and the representation point’s vector in the X-Y plane is used to identify the object’s orientation.

Figure 10 shows an example of distance calculation for five objects continuously for 100 frames using multiple cameras. These cameras face different directions with overlap and are fused with the same LiDAR using the proposed probabilistic fusion process. Cameras and the LiDAR are synchronized. When multiple objects move in the scene, the LiDAR can always capture them, and different cameras may capture different objects at different times, together or individually. Then frame-by-frame, the distances of different objects calculated from different cameras are drawn in the plot. Different colors are corresponding to the cameras which capture the objects for that calculation. The curves align continuously and smoothly, and the same objects’ distances captured by different cameras are consistent. We use Figure 10 to give a sense of our method’s performance and will discuss the details in a later section.

III-G Trajectory Smoother

The proposed fusion process is based on the assumptions mentioned above in section III-D. The algorithm can localize the target object accurately when a small number of errors exist in the LiDAR and camera mapping results. However, when a large proportion of the LiDAR points in the enlarged AOIs are not from the target object, the distance and angle calculations are problematic. These errors need to be addressed in this step.

Supported by the object detection and tracking, consecutive localization can be completed frame by frame to form the trajectory of the target object, as illustrated in Figure 10. A trajectory smoother is implemented to detect the outliers as the localization errors, under two assumptions:

  • Assumption 1: The target object’s movement is continuous in three dimensions, given the sampling rate of at least 10 frames per second.

  • Assumption 2: The mapping errors are from the distance of other objects, which will result in a distance difference larger than the size of the object.

Then for any given duration, t0 to t1, we use two phases of regressions with different orders to smooth the trajectory. The first phase uses three order-two polynomial regression models with RANSAC to detect outliers (wrong distance/angle calculations) in the trajectory, corresponding to the three dimensions in the Cartesian 3D space. After we excluded all the outliers, we conduct three order-three polynomial regressions to smooth the inliers and interpolate the missing values. Using two phases of regression is because higher degree-of-freedom regression models are easily affected by outliers. Therefore, we lower the polynomial regression’s order in the outlier detection phase. A higher-order polynomial regression can capture the curvature of the trajectory more accurately given changing accelerations and thus is used after the outliers are removed.

During the phase of outlier detection, RANSAC is applied iteratively to sub-sample random points and fit a regression model. After certain iterations, the regression model of the best fit will be considered as the correct model. The points whose deviation from the best fit model exceeds the threshold are classified as outliers. In this process, we set the threshold as twice the value of the standard deviation.

The independent variables of each regression model are X, Y, Z values from each dimension, respectively. A point will be identified as an outlier if one of the X, Y, and Z values exceeds the corresponding model’s threshold. For example, if a noise LiDAR point is reflected from an object behind the target object, and both two objects are in front of the LiDAR, the smoother can not detect this noise point through X and Z values. But our schema will also check the Y value and find out that it is an outlier.

The final outputs of this probabilistic fusion process will be a smoothed trajectory of target objects with minimum effect from LiDAR-camera mapping errors.

IV Performance Evaluation Method

The performance of the proposed probabilistic fusion process is evaluated with experiments. To thoroughly test the process, we developed two hardware systems with different sensors and ran them on different platforms to perform LiDAR-camera fusion. Then the experiments were conducted to collect sensing data and ground-truth data of interactions between a car and an e-scooter rider in a test track. The performance of the proposed fusion process is evaluated by comparing the processed outputs with ground-truth results. The detailed design of the system can be referenced by [23].

IV-A Data Collection Apparatus

We developed two hardware systems to support the evaluation of the proposed LiDAR-camera fusion process. One is the car-based LiDAR-camera perception system, and the other one is the wearable LiDAR-camera perception system. The descriptions of these two systems will be introduced in the following subsections.

IV-A1 System I: Car-based LiDAR-camera Sensing System

The car-based sensing system includes six FLIR Grasshopper 3 cameras, one 64-beam 360-degree Ouster OS1 LiDAR, a Reach Emlid RTK (Real-time Kinematics) GPS, one IMU (Inertial Measurement Unit), and a desktop computer. The desktop computer, GPS module (excluding antenna), LiDAR, amplifier, and inverter are all placed in the car’s trunk. The main goal of the car-based system is to cover 360 degrees around the vehicle for scene reconstruction and an accurate self-localization via GPS and IMU.

Four cameras are placed at the corners on the top of the vehicle, and two cameras are placed at the center, one front-facing, and one back-facing. The corner cameras are equipped with wide-angle (96 degrees) lenses, and the cameras at the center are equipped with 40-degree lenses. The LiDAR is placed at the center of the car’s roof to provide maximum coverage around the vehicle. Figure 11 illustrates the overview of the car-based perception system with sensors.

All sensors are connected to the PC and operated by the PC using ROS (Robot Operating System), which provides the flexibility to control and process multiple sensors simultaneously. Furthermore, hardware-based synchronizations at 10 FPS (frames per second) are achieved for LiDAR and cameras by sending a pulse sequence to the camera’s sync port, and the GPS/IMU were continuously synchronized using the recorded UNIX timestamps from ROS bag files. Every rotation of the LiDAR (master) is synchronized with every camera capture frame (slave).

Refer to caption
Figure 11: Overview of the car-based perception system with sensors.

IV-A2 System II: Wearable LiDAR-camera Sensing System

Unlike the car-based system, the wearable system cannot achieve a 360-degree view due to the occlusion of the human body. The cameras have to be carefully placed to cover the maximum available field of view in one direction (180-200 degrees). The wearable LiDAR-camera perception system consists of three USB cameras, a 3D LiDAR with an integrated IMU, an RTK GPS unit, and an NVIDIA Jetson Tx2 development computer board for data collection.

In this study, we ask a e-scooter rider to wear the wearable system to capture it’s own locations while also sense the car during interactions. As shown in Figure 12, the system can be either in the front or the back of the rider. The LiDAR stays in the middle, and two cameras are placed on the corners and one in the center. The GPS antenna is placed in the front of the e-scooter while the system is back facing and on the rider’s helmet when front-facing. This design is because that the GPS needs to be at least 50cm away from other electronics. Three cameras cover 200 degrees of view in both front and back-facing configurations. The software architecture of the wearable system is similar to that of the car-based system. The main difference is that all the sensors are synchronized with the UNIX timestamps on the bag files.

Refer to caption
Figure 12: Overview of the Wearable LiDAR-camera perception system with sensors.

IV-B Experiment Design for Evaluation Data Collection

To better evaluate the fusion process, one experiment is designed using these two different hardware systems. The wearable system is equipped by a researcher who rides an e-scooter in an empty parking lot with the experiment vehicle installed with another set of the perception system. Both the experiment vehicle and the e-scooter are equipped with RTK (real-time kinematic) GPS sensors supported by multiple base stations, providing the ground-truth location to an accuracy of 10 centimeters at 5 frames per second (sampling time is 0.2 sec). The e-scooter and the car run in parallel during the experiment, as shown in Figure 13. This scenario represents the situation that the vehicle overtakes the e-scooter from the left side in a parallel path on a straight road with the same driving direction. The lateral offset is around 3m, and the moving speed is about 25 mph for the vehicle and 10-20 mph for the e-scooter, respectively. The target object of the vehicle’s perception system is the e-scooter rider, and the target object of the wearable perception system is the experiment vehicle. There are pedestrians, parked vehicles, trees, light posts, and buildings in the background at different distances.

Refer to caption
Figure 13: Designed scenario 1 (left) and scenario 2 (right).

Two trials of each scenario were completed during the experiments. In the world coordinate system, both the car and the e-scooter rider move from south to north along the y-direction. The distance between them is along the x-direction. For the car-based perception system, the collected time duration after post-processing is about 5.2 secs with driving distances of approximately 65 m and 38m for the car and e-scooter, respectively. For the wearable system, the collected time duration after post-processing is about 5.8 secs with driving distances of approximately 60 m and 30m for car and e-scooter, respectively.

IV-C Measurements of Mapping Errors

Refer to caption
Figure 14: TPR of checking the fusion accuracy in scenario 1 for vehicle-based system (left) and e-Scooter-based system (right) using one camera.
Refer to caption
Figure 15: TPR of checking the fusion accuracy in scenario 2 for vehicle-based system (left) and e-Scooter-based system (right) using one camera.

The True Positive Rate (TPR) was calculated for both the car-based and the wearable LiDAR-camera perception systems to estimate the mapping errors. When judging the mapping correctness, the calculation assumes a valid distance variation up to 15%15\% of the size of the vehicle and e-scooter, respectively. The mapped LiDAR point for the object is correct if the corresponding distance is in that distance range. For example, if the ground truth distance is 10m for the car, and the car size is 4.5m long, the mapped LiDAR points with distances from 9.325 (10 - 0.15*4.5)m to 10.675 (10 + 0.15*4.5)m are considered correct mapping. Otherwise, they are wrong mappings. Similarly, if the ground truth is 10m and the e-scooter size is 1.5m long, the mapped LiDAR points with distances in 9.775 (10 - 0.15*1.5)m to 10.225 (10 + 0.15*1.5)m are considered correct mapping. Otherwise, they are wrong mappings. Then, TPR can be calculated using the following formula:

TruePositiveRate(TPR)=TP(TP+TN)TruePositiveRate(TPR)=\frac{TP}{(TP+TN)} (14)

where TP is the number of corrected mappings, and TN is the number of wrong mappings. The higher value of TPR, the better the fusion accuracy.

The examples of mapping errors in the two experiment scenarios were illustrated in Figure 14 and Figure 15 for both vehicle-based and e-scooter-based wearable systems, respectively. In the figure, the green points mean corrected mapping (TP), the red points are the wrong mapping (TN), and the blue points indicate that the mapped points are outside the bounding box.

IV-D Measurements of Localization Errors

Another measurement used for fusion performance evaluation is the localization error. The calculation of localization errors for both car-based and wearable perception systems is through comparing the estimated localization of the target via the proposed fusion process and the ground-truth localization using the installed RTK GPS module. The target localization can be defined as the coordinates (X, Y) in the perception system’s coordinate system, with the origin on the corresponding car or e-scooter rider. Then, the localization errors in terms of localization X and localization Y can be calculated using Mean Absolute Error (MAE) as the following:

MAEX=iNx|XiXi^|NxMAE_{\mathrm{X}}=\frac{\sum_{i}^{N_{x}}\left|X_{i}-\hat{X_{i}}\right|}{N_{x}} (15)
MAEY=iNy|YiYi^|NyMAE_{\mathrm{Y}}=\frac{\sum_{i}^{N_{y}}\left|Y_{i}-\hat{Y_{i}}\right|}{N_{y}} (16)

where MAE means the mean absolute error and the subscript X indicates the coordinate X in the localization of the target. XiX_{i} is the ground-truth coordinate X value of the target at the i-th timestamp in the scenario, and the Xi^\hat{X_{i}} is the estimated coordinate X value after applying the proposed probabilistic fusion process at the i-th timestamp. NxN_{x} is the total number of X points. For the MAE computation regarding the coordinate Y, the same definitions have been utilized.

IV-E Data Analysis Method

After conducting the experiments, collected data are processed following the proposed probabilistic fusion process. The two evaluation metrics described in sections IV-C and IV-D are calculated and compared at different levels to fully evaluate the performance.

IV-E1 Mapping Accuracy

The LiDAR-camera mapping accuracy is calculated at two levels:

  • Baseline mapping accuracy - The fusion result is firstly obtained when the static calibration matrices are directly applied following equation 6. By following this typical fusion process, the mapping accuracy can be measured with the true positive rate in equation 9 based on the original ROI.

  • Probabilistic fusion mapping accuracy - The mapping accuracy with larger ROI after utilizing the shape similarity filter can also be computed by following the same principle of true positive rate.

IV-E2 Localization Accuracy

The object location is calculated at two levels:

  • Ground-truth location - since we have two GPS systems on each perception platform, the GPS coordinates of the ego object (XeX_{e}, YeY_{e}) and target (XtX_{t}, YtY_{t}) can be obtained. Thus, the ground-truth locations in terms of X and Y relative to the ego-object can be computed as (XtX_{t} - XeX_{e}, YtY_{t} - YeY_{e}).

  • Probabilistic fusion localization accuracy - after applying the proposed probabilistic fusion process, object localization can be achieved through distance & angle calculation. Some outliers can be further eliminated by applying the trajectory smoother. Eventually, a new set of more accurate target X and Y localization relative to the ego object will be obtained.

IV-E3 Performance Evaluation

The performance of the proposed probabilistic fusion process is evaluated using two hypotheses:

  • Hypothesis 1: The proposed probabilistic fusion process can significantly improve the mapping accuracy from the baseline.

  • Hypothesis 2: The proposed probabilistic fusion process can achieve an average TPR that is larger than 0.5. When TPR is larger than 0.5, the process can guarantee that the fusion result is accurate.

V Evaluation Results

V-A Calibration Accuracy for Two Sensing Systems

TABLE I: Fusion accuracy under static calibration condition (TPR)
Pair Index Car-based(%) Wearable(%)
1 94.9 100
2 100 100
3 100 100
4 100 N/A
5 70 N/A
6 100 N/A
Average 94.15 100

The two developed LiDAR-camera sensing systems are both calibrated following the process in section III-A. The fusion accuracies under the static calibration condition for both systems are calculated using equation 6 and shown in the Table I below.

There are six pairs of LiDAR-camera fusion for the car-based system. Out of all these pairs, four pairs have an calculated TPR of 100%\%, with an average TPR of 94.15%\%. Similarly, there are three pairs of LiDAR-camera fusions for the wearable system with all of them achieving an TPR of 100%\%. This result proves that both systems are calibrated to achieve very high fusion accuracy in the static lab environment through the typical fusion process.

V-B Mapping Accuracy in Practice

Although achieving high fusion accuracy in a static environment, the sensing systems encounter much higher errors in dynamic environments due to small synchronization offsets, high moving speeds, and vibrations.

The average TPRs for both the car-based and wearable systems in the two experiment scenarios (as described in section IV-B) is calculated across randomly sampled frames combining all camera-LiDAR pairs, with the results shown in the Baseline column in Table II. The average TPRs for the car-based system is 31.72%\% in scenario 1 and 68.15%\% in scenario 2, and both are much lower than the TPRs in the lab environment. Similarly, the average TPRs for the wearable system are 48.76%\% and 55.77%\% in scenarios 1 and 2, respectively, which are also much lower than the 100%\% fusion accuracy in the static calibration.

Overall, these baseline fusion accuracies contain pretty significant errors and are not good enough to localize the targets if directly used, especially for discontinuous or jumping frames with very low TPRs. A better sensing system may reduce these errors through advanced synchronization, more expensive hardware, and complicated and intensive calibration. However, it is still challenging to avoid mapping errors entirely. Considering that this study aims to solve this issue using a low-cost probabilistic fusion process, these errors provide a baseline. They are not further reduced through other methods.

TABLE II: Paired two-sample t-test results for True Positive Rate (TPR). (Baseline and P-Fusion refer to the average TPR before and after the implementation of the algorithm, respectively.)
System Sce Baseline(%) P-Fusion (%) t p-val N
Car-based 1 31.72(25.30) 65.02(19.01) 8.01 <.001 10
2 68.15(26.02) 81.65(17.93) 6.54 <.001 19
Wearable 1 48.76(16.01) 65.46(18.14) 3.82 .006 6
2 55.77(22.17) 75.29(12.93) 6.34 <.001 10

For the same frames randomly sampled from the car-based and wearable systems during the two experiment scenarios, the proposed probabilistic fusion process was applied to improve the mapping accuracy. The average TPRs for these frames from the two systems are provided in the P-Fusion column in Table II. As illustrated in Figure 16, the average TPRs are much improved for all scenarios for both systems. For the car-based system, the average TPRs are improved to 65.02%\% and 81.65%\% in the two scenarios, respectively. Similarly, the average TPRs are improved to 65.46%\% and 75.29%\% for the wearable system.

We repeatedly perform a paired two-sample t-test on the TPRs before (baseline) and after the probabilistic fusion process for all the randomly sampled frames from each scenario for both sensing systems (Table II). All tests reject the null hypothesis with small p-values (p<0.001p<0.001 for the car-based system in both scenarios and the wearable system in scenario 2, and p<0.01p<0.01 for the wearable system in scenario 1). In other words, there is strong evidence to conclude that the differences between the TPRs calculated before and after applying the proposed algorithm are greater than zero. Therefore, the probabilistic fusion process can improve the mapping accuracy significantly. So that hypothesis 1 is proved.

Refer to caption
Figure 16: Average (std) TPR before or after the proposed algorithm for each system type and scenario.

With the proposed algorithm proved to improve TPRs significantly, it is important to confirm if the improved performance is sufficient for localization purposes. Thus, we conduct a one-sample right-tailed t-test to infer whether the average TPRs after the implementation of the algorithm is greater than 50%, with results shown in Table III. The results show that for both systems, the average TPRs are significantly larger than 50% (pvalue<0.05p-value<0.05 for scenario 1 and pvalue<0.001p-value<0.001 for scenario 2, respectively). This result proves hypothesis 2. Since the fusion process selects the median value of the mapped LiDAR point for localization, a TPR larger than 50% can guarantee that a correct LiDAR point is selected, which is reflected by the target object.

TABLE III: One sample (right-tailed) t-test results for true positive rate, where H1H_{1}: μ>\mu>0.5. (Mean_1 refer to the average TPR after the implementation of the algorithm)
System Sce Mean_1 (%) t p-val N
Car 1 65.02(19.01) 2.50 .017 10
2 81.65(17.93) 7.69 <.001 19
eScooter 1 65.46(18.14) 2.09 .046 6
2 75.29(12.93) 6.19 <.001 10

V-C Localization Estimation Accuracy

In order to further evaluate the fusion performance, we calculate the localization estimation accuracy by comparing the calculated and ground-truth locations of the target objects, using data collected in the two experiment scenarios from both developed sensing systems. The car’s ground-truth locations are achieved from the GPS/IMU data, and the estimated locations are calculated from the wearable sensing system’s outputs, applying the proposed fusion process. A similar process is conducted for the e-scooter rider’s location.

A total of 102 frames were randomly sampled from all the collected data (57 for the car-based system’s output and 45 for the wearable system’s outputs). Then, we calculated the mean absolute error (MAE) (using equations 15 and 16) between the estimated trajectories and the ground-truth trajectories at all frames in meters. The average MAE value for the car-based system is 1.15 meters with a standard deviation of 0.62 meters. The average MAE value for the wearable system is 1.07 meters with a standard deviation of 0.68 meters. Since the localization errors are comparable to the sizes of the target objects, the output can be accurately used to support trajectory generation and crash estimation.

There are several contributors to the localization errors after applying the proposed fusion process. One significant part of the errors comes from the size of the target objects. The current process does not differentiate the exact reflection point of the object, which can be away from the GPS receiver. Since the ground-truth distance is calculated using the relative positions of GPS receivers with an accuracy of 2 cm, the distance of the actual LiDAR reflection point from the GPS receiver can contribute to some inherent localization errors. This problem is true for both targets of the car or the e-scooter rider. Similarly, the distance between the LiDAR origin and the GPS origin in both sensing systems also contributes to the errors. These two types of errors can be further reduced. However, since the paper’s primary purpose is to introduce the probabilistic fusion process, the additional work is out of the current scope and will not be included.

Another contributing factor to the localization error is synchronization offsets between the LiDAR (10 fps) and GPS (5 fps) units. The two inputs can have up to 0.1 second differences due to the different sampling rates, which will cause errors of about 0.9 meters at the speed of 20 mph. Additional work to reduce this error is also out of the scope of discussing the probabilistic fusion process and will not be discussed.

VI Conclusion

As one core component of the AV sensing system, LiDAR-camera fusion attracts much research attention recently. However, although typical calibration can achieve very high fusion accuracy in a static lab environment, inherent mapping errors are inevitable and expensive to deal with in practice. This limitation becomes critical when the sensing target is micro-mobility vehicles like e-scooters which are small and move fast. In this paper, a low-cost probabilistic solution to the mapping errors in LiDAR-camera fusion is proposed for localization purposes. The multiple-step data process is comprised of RANSAC-based data preprocessing, bounding box enlargement, mode-based clustering, shape similarity comparison using KL-divergence, distance and angle calculations, and trajectory smoother utilizing three order-two polynomial regression models with RANSAC.

Two LiDAR-camera sensing systems were developed for cars and wearing, which are carefully calibrated following the typical processes to achieve very high fusion accuracy in the lab environment. Experiments were conducted between car and e-scooter riders with these sensing systems to collect their interaction data in two scenarios. Results have shown that although large mapping errors are observed in the raw data, the proposed fusion process can significantly reduce the mapping errors measured by the TPRs, which can be accurately used for localization, trajectory generation, and crash estimation purposes.

After applying the proposed method, there are still localization errors due to different coordinate origins between the LiDAR and GPS and their synchronization offsets, which will be further reduced in future work. Moreover, the proposed method will be implemented in real-time in future applications to improve the accuracy of AV perception systems.

Acknowledgment

The authors would like to thank the project sponsor: Collaborative Safety Research Center (CSRC), Toyota Motor North America, Research and Development, Ann Arbor, Michigan.

References

  • [1] Olia, A.; Abdelgawad, H.; Abdulhai, B.; Razavi, S.N. Assessing the Potential Impacts of Connected Vehicles: Mobility, Environmental, and Safety Perspectives. J. Intell. Transp. Syst. 2016, 20, 229–243.
  • [2] Singh, S. Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Survey. Traffic Safety Facts Crash Stats. Report No. DOT HS 812 115; National Center for Statistics and Analysis: Washington, DC, USA, 2015.
  • [3] Sam, Dahlia, Cyrilraj Velanganni, and T. Esther Evangelin. ”A vehicle control system using a time synchronized Hybrid VANET to reduce road accidents caused by human error.” Vehicular communications 6 (2016): 17-28.
  • [4] Tabone, Wilbert, Joost de Winter, Claudia Ackermann, Jonas Bärgman, Martin Baumann, Shuchisnigdha Deb, Colleen Emmenegger et al. ”Vulnerable road users and the coming wave of automated vehicles: Expert perspectives.” Transportation Research Interdisciplinary Perspectives 9 (2021): 100293.
  • [5] Shen, Dan, Qiang Yi, Lingxi Li, Renran Tian, Stanley Chien, Yaobin Chen, and Rini Sherony. ”Test scenarios development and data collection methods for the evaluation of vehicle road departure prevention systems.” IEEE Transactions on Intelligent Vehicles 4, no. 3 (2019): 337-352.
  • [6] Shen, Dan, Qiang Yi, Lingxi Li, Stanley Chien, Yaobin Chen, and Rini Sherony. ”Data Collection and Processing Methods for the Evaluation of Vehicle Road Departure Detection Systems.” In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1373-1378. IEEE, 2018.
  • [7] Pauwels, Alex. ”Safety of Automated Driving in Mixed-Traffic Urban Areas: Considering Vulnerable Road Users and Network Efficiency.” (2021).
  • [8] Ma, Qingyu, Hong Yang, Alan Mayhue, Yunlong Sun, Zhitong Huang, and Yifang Ma. ”E-scooter safety: the riding risk analysis based on mobile sensing data.” Accident Analysis and Prevention 151 (2021): 105954.
  • [9] Kumar, G. Ajay, Jin Hee Lee, Jongrak Hwang, Jaehyeong Park, Sung Hoon Youn, and Soon Kwon. ”LiDAR and camera fusion approach for object distance estimation in self-driving vehicles.” Symmetry 12, no. 2 (2020): 324.
  • [10] Zhang, Feihu, Daniel Clarke, and Alois Knoll. ”Vehicle detection based on LiDAR and camera fusion.” In 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), pp. 1620-1625. IEEE, 2014.
  • [11] Lyu, Yecheng, Lin Bai, Mahdi Elhousni, and Xinming Huang. ”An Interactive LiDAR to Camera Calibration.” arXiv preprint arXiv:1903.02122 (2019).
  • [12] Berrio, Julie Stephany, Mao Shan, Stewart Worrall, and Eduardo Nebot. ”Camera-Lidar Integration: Probabilistic sensor fusion for semantic mapping.” arXiv preprint arXiv:2007.05490 (2020).
  • [13] Domhof, Joris, and Kooij Julian FP. ”An Extrinsic Calibration Tool for Radar, Camera and Lidar.” In 2019 International Conference on Robotics and Automation (ICRA), pp. 8107-8113. IEEE, 2019.
  • [14] Sui, Jingfeng, and Shuo Wang. ”Extrinsic calibration of camera and 3D laser sensor system.” In 2017 36th Chinese Control Conference (CCC), pp. 6881-6886. IEEE, 2017.
  • [15] Zhou, Lipu, Zimo Li, and Michael Kaess. ”Automatic extrinsic calibration of a camera and a 3d lidar using line and plane correspondences.” In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5562-5569. IEEE, 2018.
  • [16] Melotti, Gledson, Cristiano Premebida, Nuno MM da S. Gonçalves, Urbano JC Nunes, and Diego R. Faria. ”Multimodal CNN pedestrian classification: a study on combining LIDAR and camera data.” In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 3138-3143. IEEE, 2018.
  • [17] Caltagirone, Luca, Mauro Bellone, Lennart Svensson, and Mattias Wahde. ”LIDAR–camera fusion for road detection using fully convolutional neural networks.” Robotics and Autonomous Systems 111 (2019): 125-131.
  • [18] Kim, Eung-su, and Soon-Yong Park. ”Extrinsic Calibration between Camera and LiDAR Sensors by Matching Multiple 3D Planes.” Sensors 20, no. 1 (2020): 52.
  • [19] Huang, Jiunn-Kai, and Jessy W. Grizzle. ”Improvements to target-based 3d lidar to camera calibration.” IEEE Access 8 (2020): 134101-134110.
  • [20] Wei, Pan, Lucas Cagle, Tasmia Reza, John Ball, and James Gafford. ”LiDAR and camera detection fusion in a real-time industrial multi-sensor collision avoidance system.” Electronics 7, no. 6 (2018): 84.
  • [21] De Alvis, C., Shan, M., Worrall, S. and Nebot, E., 2019, May. Uncertainty Estimation for Projecting Lidar Points onto Camera Images for Moving Platforms. In 2019 International Conference on Robotics and Automation (ICRA) (pp. 6637-6643). IEEE.
  • [22] Zhou, Lipu, Zimo Li, and Michael Kaess. ”Automatic extrinsic calibration of a camera and a 3d lidar using line and plane correspondences.” In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5562-5569. IEEE, 2018.
  • [23] Prabu, A., Shen, D., Tian, R., Chien, S., Li, L., Chen, Y. and Sherony, R., 2022. A wearable data collection system for studying micro-level e-scooter behavior in naturalistic road environment. arXiv preprint arXiv:2212.11979.