This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semantic OcTree Mapping and Shannon Mutual Information Computation for Robot Exploration

Arash Asgharivaskasi,  Nikolay Atanasov We gratefully acknowledge support from ARL DCIST CRA W911NF-17-2-0181 and NSF FRR CAREER 2045945. The authors are with the Department of Electrical and Computer Engineering, University of California San Diego, CA 92093, USA {aasghari,natanasov}@eng.ucsd.edu.
Abstract

Autonomous robot operation in unstructured and unknown environments requires efficient techniques for mapping and exploration using streaming range and visual observations. Information-based exploration techniques, such as Cauchy-Schwarz quadratic mutual information (CSQMI) and fast Shannon mutual information (FSMI), have successfully achieved active binary occupancy mapping with range measurements. However, as we envision robots performing complex tasks specified with semantically meaningful concepts, it is necessary to capture semantics in the measurements, map representation, and exploration objective. This work presents Semantic octree mapping and Shannon Mutual Information (SSMI) computation for robot exploration. We develop a Bayesian multi-class mapping algorithm based on an octree data structure, where each voxel maintains a categorical distribution over semantic classes. We derive a closed-form efficiently-computable lower bound of the Shannon mutual information between a multi-class octomap and a set of range-category measurements using semantic run-length encoding of the sensor rays. The bound allows rapid evaluation of many potential robot trajectories for autonomous exploration and mapping. We compare our method against state-of-the-art exploration techniques and apply it in a variety of simulated and real-world experiments.

Index Terms:
View Planning for SLAM, Reactive and Sensor-Based Planning, Vision-Based Navigation.

I Introduction

Accurate modeling, real-time understanding, and efficient storage of a robot’s environment are key capabilities for autonomous operation. Occupancy grid mapping [1, 2] is a widely used, simple, yet effective, technique for distinguishing between traversable and occupied space surrounding a mobile robot. OctoMap [3] is an extension of occupancy grid mapping, introducing adaptive resolution to improve the memory usage and computational cost of generating 3-D maps of large environments. Delegating increasingly sophisticated tasks to autonomous robots requires augmenting traditional geometric models with semantic information about the context and object-level structure of the environment [4, 5, 6]. Robots also are increasingly expected to operate in unknown environments, with little to no prior information, in applications such as disaster response, environmental monitoring, and reconnaissance. This calls for algorithms allowing robots to autonomously explore unknown environments and construct low-uncertainty metric-semantic maps in real-time, while taking collision and visibility constraints into account.

Refer to caption
Figure 1: A robot autonomously explores an unknown environment using an RGBD sensor and a semantic segmentation algorithm.

This paper considers the active metric-semantic mapping problem, requiring a robot to explore and map an unknown environment, relying on streaming distance and object category observations, e.g., generated by semantic segmentation over RGBD images [7]. See Fig. 1 as an illustration of the sensor data available for mapping and exploration. We extend information-theoretic active mapping techniques [8, 9, 10] from binary to multi-class environment representations. Our approach introduces a Bayesian multi-class mapping procedure which maintains a probability distribution over semantic categories and updates it via a probabilistic range-category perception model. Our main contributions are:

  1. 1.

    a Bayesian multi-class octree mapping approach,

  2. 2.

    a closed-form efficiently-computable lower bound for the Shannon mutual information between a multi-class octree map and a set of range-category measurements,

  3. 3.

    efficient C++ implementation achieving real-time high-accuracy performance onboard physical robot systems in 3-D real-world experiments111Open-source software and videos supplementing this paper are available at https://arashasgharivaskasi-bc.github.io/SSMI_webpage..

Unlike a uniform-resolution grid map, our semantic OctoMap enables efficient mutual information evaluation via run-length encoding of the range-category observations. As a result, the informativeness of many potential robot trajectories may be evaluated to (re-)select one that leads to the best trade-off between map uncertainty reduction and motion cost. Unlike traditional class-agnostic exploration methods, our model and information measure capture the uncertainty of different semantic classes, leading to faster and more accurate exploration. The proposed approach relies on general range and class measurements and general pose kinematics, making it suitable for either ground or aerial robots, equipped with either camera or LiDAR sensors, exploring either indoor or outdoor environments.

We name our method Semantic octree mapping and Shannon Mutual Information (SSMI) computation for robot exploration. This paper is an extended version of our previous conference paper [11] enabling its application in large-scale 3-D environments. Direct application of the regular-grid method in the conference version to 3-D environments faces several challenges, including memory inefficiency, because 3-D environments are predominantly made up of free or unknown space, and information computation inefficiency due to naïve summation over all voxels visited by sensor rays. We introduce (a) an octree representation for Bayesian multi-class mapping, performing octree ray-tracing and leaf node merging using semantic category distributions, (b) semantic run-length encoding (SRLE) for sensor rays which enables efficient mutual information computation between a multi-class octree and range-category measurements, and (c) a comprehensive set of experiments and comparisons with state-of-the-art techniques in 2-D and 3-D environments. We demonstrate the effectiveness of SSMI in real-world mapping and autonomous exploration experiments performed in real-time onboard different mobile robot platforms operating in indoor and outdoor environments.

II Related Work

Frontier-based exploration [12] is a seminal work that highlighted the utility of autonomous exploration and active mapping. It inspired methods [13, 14] that rely on geometric features, such as the boundaries between free and unknown space (frontiers) and the volume that would be revealed by new sensor observations. Due to their intuitive formulation and low computational requirements, geometry-based exploration methods continue to be widely employed in active perception. Recent works include semantics-assisted indoor exploration [15], hex-decomposition-based coverage planning [16], and Laplace potential fields for safe outdoor exploration [17]. More related to our work, receding-horizon “next-best-view” planning [18] presents an active octree occupancy mapping method which executes trajectories built from a random tree whose quality is determined by the amount of unmapped space that can be explored. Similarly, the graph-based exploration methods of [19] and [20] use local random trees in free space to sample candidate viewpoints for exploration, while a global graph maintains the connections among the frontiers in the map. Cao et al. [21, 16] introduced hierarchical active 3D coverage and reconstruction which computes a coarse coverage path at the global scale followed by a local planner that ensures collision avoidance via a high-resolution path. As shown by Corah and Michael [22], coverage-based exploration strategies can be formulated as mutual-information maximization policies in the absence of sensor noise. However, in many real-world circumstances sensor measurements are corrupted by non-trivial noise, reducing the effectiveness of geometric exploration methods that do not capture probabilistic uncertainty. For example, due to the domain shift between the training data and the test environment, utilizing a pre-trained semantic segmentation model in the mapping process requires accounting for measurement uncertainty in the exploration policy.

The work by Elfes [23] is among the first to propose an information-based utility function for measuring and minimizing map uncertainty. Information-based exploration strategies have been devised for uncertainty minimization in robot localization or environment mapping [24, 25, 26]. Information-theoretic objectives, however, require integration over the potential sensor measurements, limiting the use of direct numerical approximations to short planning horizons. Kollar and Roy [27] formulated active mapping using an extended Kalman filter and proposed a local-global optimization, leading to significant gains in efficiency for uncertainty computation and long-horizon planning. Unlike geometry-based approaches, information-theoretic exploration can be directly formulated for active simultaneous localization and mapping (SLAM) [28, 29, 30, 31], aiming to determine a sensing trajectory that balances robot state uncertainty and visitation of unexplored map regions. Stachniss et al. [32] approximate information gain for a Rao-blackwellized particle filter over the joint state of robot pose and map occupancy. Julian et al. [8] prove that, for range measurements and known robot position, the Shannon mutual information is maximized over trajectories that visit unexplored areas. However, without imposing further structure over the observation model, computing the mutual information objective requires numerical integration. The need for efficient mutual information computation becomes evident in 3-D environments. Cauchy-Schwarz quadratic mutual information (CSQMI) [9] and fast Shannon mutual information (FSMI) [10] offer efficiently computable closed-form objectives for active occupancy mapping with range measurements. Henderson et al. [33] propose an even faster computation based on a recursive expression for Shannon mutual information in continuous maps.

The recent success of machine learning methods of perception has motivated learning autonomous exploration policies. Chen et al. [34] attempt to bridge the sim2sim and sim2real gaps via graph neural networks and deep reinforcement learning. This enables decision-making over graphs containing relevant exploration information which is provided by human experts in order to predict a robot’s optimal sensing action in belief space. Lodel et al. [35] introduce a deep reinforcement learning policy which recommends next best view that maximizes information gain via defining mutual information as the training reward. Zwecher et al. [36] employ deep reinforcement learning to find an exploration policy that plans collision-free coverage paths, while another neural network provides a predicted full map given the partially observed environment. Zhang et al. [37] propose a multi-agent reinforcement learning exploration method where regions of interest, free space, and robots are represented as graph nodes, and hierarchical-hops graph neural networks (H2GNN) are used to identify key information in the environment. Related to multi-robot exploration, the authors in [38] utilize an actor-critic strategy to map an unknown environment, where Voronoi partitioning divides the exploration regions among the robots. As this paper demonstrates, incorporating semantic uncertainty in addition to geometric information in the exploration process can be beneficial. Additionally, using Shannon mutual information as an objective function may help train more generalizable exploration policies because it mitigates the need for training sensor-specific models. Hence, the techniques proposed in this paper are complementary to learning approaches and can provide robustness to measurement uncertainty and domain shift caused by sensor and operational condition variations.

Active semantic mapping has recently attracted much attention due to the proliferation of fast object detection and semantic segmentation algorithms implemented on mobile robot platforms. The authors in [39] use a two-layer architecture, where the knowledge representation layer provides a belief over the environment state to the action layer, which subsequently chooses an action to gather information or execute a task. The work in [40] presents a semantic exploration policy which takes an occluded semantic point cloud of an object, finds a match in a database to estimate the full object dimensions, and then generates candidate next observation poses to reconstruct the object. The next best view is computed via a volumetric information gain metric that computes visible entropy from a candidate pose. The semantic map used in this paper is a collection of bounding boxes around objects. Active semantic mapping has also been employed to develop sample-efficient deep learning methods. Blum et al. [41] propose an active learning method for training semantic segmentation networks where the novelty (epistemic uncertainty) of the input images is estimated as the distance from the training data in the embedding space, while a path planning method maximizes novelty of future input images along the planned trajectory, assuming novel images are spatially correlated. Georgakis et al. [42] actively train a hierarchical semantic map generation model that predicts occupancy and semantics given occluded input. The authors use an ensemble of map generation models in order to predict epistemic uncertainty of the predicted map. The uncertainty is then used to choose trajectories for actively training the model with new images that differ the most with the training data of the current model. SSMI distinguishes itself from the aforementioned works by introducing a dense Bayesian multi-class mapping with a closed-form uncertainty measure, as opposed to sampling-based uncertainty estimation. Moreover, our information-theoretic objective function directly models sensor noise specifications, unlike volumetric information gain.

This paper is most related to CSQMI [9] and FSMI [10] in that it develops a closed-form expression for mutual information. However, instead of a binary map and range-only measurements, our formulation considers a multi-class map with Bayesian updates using range-category measurements. Since the same occupancy map can be derived from many different multi-class maps, the information associated with various object classes will fail to be captured if we solely rely on occupancy information, as the case in CSQMI and FSMI. Therefore, we expect to perform exploration more efficiently by using the multi-class perception model, and consequently, expanding the notion of uncertainty to multiple classes.

III Problem Statement

Consider a robot with pose 𝐗tSE(3)\mathbf{X}_{t}\in SE(3) at time tt and deterministic discrete-time kinematics:

𝐗t:=[𝐑t𝐩t𝟎1],𝐗t+1=𝐗texp(τ𝐮^t),\mathbf{X}_{t}:=\begin{bmatrix}\mathbf{R}_{t}&\mathbf{p}_{t}\\ \mathbf{0}^{\top}&1\end{bmatrix},\qquad\mathbf{X}_{t+1}=\mathbf{X}_{t}\exp{(\tau\hat{\mathbf{u}}_{t})}, (1)

where 𝐑tSO(3)\mathbf{R}_{t}\in SO(3) is the robot orientation, 𝐩t3\mathbf{p}_{t}\in\mathbb{R}^{3} is the robot position, τ\tau is the time step, and 𝐮t:=[𝐯t,𝝎t]𝒰6\mathbf{u}_{t}:=[\mathbf{v}_{t}^{\top},\boldsymbol{\omega}_{t}^{\top}]^{\top}\in{\cal U}\subset\mathbb{R}^{6} is the control input, consisting of linear velocity 𝐯t3\mathbf{v}_{t}\in\mathbb{R}^{3} and angular velocity 𝝎t3\boldsymbol{\omega}_{t}\in\mathbb{R}^{3}. The function ()^:6𝔰𝔢(3)\hat{(\cdot)}:\mathbb{R}^{6}\rightarrow\mathfrak{se}(3) maps vectors in 6\mathbb{R}^{6} to the Lie algebra 𝔰𝔢(3)\mathfrak{se}(3). See [43, Chapter 7] for a definition of the Lie groups SO(3)SO(3) and SE(3)SE(3) and the corresponding Lie algebras 𝔰𝔬(3)\mathfrak{so}(3) and 𝔰𝔢(3)\mathfrak{se}(3). The robot is navigating in an environment consisting of a collection of disjoint sets k3{\cal E}_{k}\subset\mathbb{R}^{3}, each associated with a semantic category k𝒦:={0,1,,K}k\in{\cal K}:=\left\{0,1,\ldots,K\right\}. Let 0{\cal E}_{0} denote free space, while each k{\cal E}_{k} for k>0k>0 represents a different category, such as building, vegetation, terrain (see Fig. 1).

We assume that the robot is equipped with a sensor that provides information about the distance to and semantic categories of surrounding objects along a set of rays {𝜼b}b\left\{\boldsymbol{\eta}_{b}\right\}_{b}, where bb is the ray index, 𝜼b3\boldsymbol{\eta}_{b}\in\mathbb{R}^{3} with 𝜼b2=rmax\|\boldsymbol{\eta}_{b}\|_{2}=r_{max}, and rmax>0r_{max}>0 is the maximum sensing range.

Definition 1.

A sensor observation at time tt from robot pose 𝐗t\mathbf{X}_{t} is a collection 𝒵t:={𝐳t,b}b{\cal Z}_{t}:=\left\{\mathbf{z}_{t,b}\right\}_{b} of range and category measurements 𝐳t,b:=(rt,b,yt,b)0×𝒦\mathbf{z}_{t,b}:=(r_{t,b},y_{t,b})\in\mathbb{R}_{\geq 0}\times{\cal K}, acquired along the sensor rays 𝐑t𝜼b\mathbf{R}_{t}\boldsymbol{\eta}_{b} with 𝜼b{𝜼b}b\boldsymbol{\eta}_{b}\in\left\{\boldsymbol{\eta}_{b}\right\}_{b} at robot position 𝐩t\mathbf{p}_{t}.

Such information may be obtained by processing the observations of an RGBD camera or a Lidar with a semantic segmentation algorithm [7]. Fig. 1 shows an example where each pixel in the RGB image corresponds to one sensor ray 𝜼b\boldsymbol{\eta}_{b}, while its corresponding values in the semantic segmentation and the depth images encode category yt,by_{t,b} and range rt,br_{t,b}, respectively. The goal is to construct a multi-class map 𝐦\mathbf{m} of the environment based on the labeled range measurements. We model 𝐦\mathbf{m} as a grid of cells i:={1,,N}i\in{\cal I}:=\{1,\ldots,N\}, each labeled with a category mi𝒦m_{i}\in{\cal K}. In order to model noisy sensor observations, we consider a probability density function (PDF) p(𝒵t𝐦,𝐗t)p({\cal Z}_{t}\mid\mathbf{m},\mathbf{X}_{t}). This observation model allows integrating the measurements into a probabilistic map representation using Bayesian updates. Let pt(𝐦):=p(𝐦𝒵1:t,𝐗1:t)p_{t}(\mathbf{m}):=p(\mathbf{m}\mid{\cal Z}_{1:t},\mathbf{X}_{1:t}) be the probability mass function (PMF) of the map 𝐦\mathbf{m} given the robot trajectory 𝐗1:t\mathbf{X}_{1:t} and observations 𝒵1:t{\cal Z}_{1:t} up to time tt. Given a new observation 𝒵t+1{\cal Z}_{t+1} obtained from robot pose 𝐗t+1\mathbf{X}_{t+1}, the Bayesian update to the map PMF is:

pt+1(𝐦)p(𝒵t+1|𝐦,𝐗t+1)pt(𝐦).p_{t+1}(\mathbf{m})\propto p({\cal Z}_{t+1}|\mathbf{m},\mathbf{X}_{t+1})p_{t}(\mathbf{m}). (2)

We assume that the robot pose is known and omit the dependence of the map distribution and the observation model on it for brevity. We consider the following problem.

Problem.

Given a prior map PMF pt(𝐦)p_{t}(\mathbf{m}) at time tt and a finite planning horizon TT, maximize the ratio:

max𝐮t:t+T1I(𝐦;𝒵t+1:t+T𝒵1:t)J(𝐗t:t+T1,𝐮t:t+T1)subject to(1),(2),\displaystyle\max_{\mathbf{u}_{t:t+T-1}}\frac{I\left(\mathbf{m};{\cal Z}_{t+1:t+T}\mid{\cal Z}_{1:t}\right)}{J(\mathbf{X}_{t:t+T-1},\mathbf{u}_{t:t+T-1})}\;\;\text{subject to}\;\;\eqref{eq:dynamic_model},\eqref{eq:bayes_rule}, (3)

of the mutual information I(𝐦;𝒵t+1:t+T𝒵1:t)I\left(\mathbf{m};{\cal Z}_{t+1:t+T}\mid{\cal Z}_{1:t}\right) between the map 𝐦\mathbf{m} and future sensor observations 𝒵t+1:t+T{\cal Z}_{t+1:t+T} to the motion cost J(𝐗t:t+T1,𝐮t:t+T1)J(\mathbf{X}_{t:t+T-1},\mathbf{u}_{t:t+T-1}) of the control sequence 𝐮t:t+T1\mathbf{u}_{t:t+T-1}.

The definitions of the mutual information and motion cost terms in (3) are:

I(𝐦;𝒵t+1:t+T|𝒵1:t):=𝐦𝒦Np(𝐦,𝒵t+1:t+T|𝒵1:t)\displaystyle I\left(\mathbf{m};{\cal Z}_{t+1:t+T}|{\cal Z}_{1:t}\right):=\!\sum_{\mathbf{m}\in{\cal K}^{N}}\!\int\cdots\int p(\mathbf{m},{\cal Z}_{t+1:t+T}|{\cal Z}_{1:t})
×logp(𝐦,𝒵t+1:t+T𝒵1:t)p(𝐦|𝒵1:t)p(𝒵t+1:t+T|𝒵1:t)τ=1Tbd𝐳t+τ,b,\displaystyle\quad\times\log\frac{p(\mathbf{m},{\cal Z}_{t+1:t+T}\mid{\cal Z}_{1:t})}{p(\mathbf{m}|{\cal Z}_{1:t})p({\cal Z}_{t+1:t+T}|{\cal Z}_{1:t})}\prod_{\tau=1}^{T}\prod_{b}d\mathbf{z}_{t+\tau,b}, (4)
J(𝐗t:t+T1,𝐮t:t+T1):=q(𝐗t+T)+τ=0T1c(𝐗t+τ,𝐮t+τ),\displaystyle J(\mathbf{X}_{t:t+T-1},\mathbf{u}_{t:t+T-1}):=q(\mathbf{X}_{t+T})+\sum_{\tau=0}^{T-1}c(\mathbf{X}_{t+\tau},\mathbf{u}_{t+\tau}),

where the integration in (4) is over all possible values of all sensor beams over all times 𝐳t+τ,b\mathbf{z}_{t+\tau,b}, and the strictly positive terms q(𝐗)q(\mathbf{X}) and c(𝐗,𝐮)c(\mathbf{X},\mathbf{u}) model terminal and stage motion costs (e.g., distance traveled, elapsed time), respectively.

We develop a multi-class extension to the log-odds occupancy mapping algorithm [44, Ch. 9] in Sec. IV and derive an efficient approximation to the mutual information term in Sec. V. In Sec. VI, we present a multi-class extension of the OctoMap [3] algorithm, alongside a fast computation of mutual information over a semantic OctoMap using run-length encoding. This allows autonomous exploration of large 3-D environments by rapidly evaluating potential robot trajectories online and (re-)selecting the one that maximizes the objective in (3). Finally, in Sec. VII, we demonstrate the performance of our approach in simulated and real-world experiments.

IV Bayesian Multi-class Mapping

Refer to caption
(a) Map estimate at time tt with observation 𝐳t+1\mathbf{z}_{t+1}
Refer to caption
(b) Posterior map estimate at time t+1t+1
Refer to caption
Refer to caption
(c) Inverse observation log-odds vector for i=3i=3
Figure 2: Illustration of the Bayesian multi-class mapping given a range-category observation 𝐳t+1=(rt+1,yt+1)\mathbf{z}_{t+1}=(r_{t+1},y_{t+1}) for an environment with object classes of white (free class), red, and blue, encoded as y=0y=0, y=1y=1, and y=2y=2, respectively: (a) Portion of the map 𝐦\mathbf{m} along the observation 𝐳t+1\mathbf{z}_{t+1}. Each cell has a multi-class log-odds vector 𝐡t,i3\mathbf{h}_{t,i}\in\mathbb{R}^{3} for i{1,,5}i\in\left\{1,\ldots,5\right\} at time tt. The cell brightness encodes the occupancy probability, while the cell color represents the most likely category; (b) Map estimate at time t+1t+1 after update with 𝐳t+1\mathbf{z}_{t+1}. Note how each multi-class log-odds vector changes based on the inverse observation model 𝐥i(𝐳t+1)\mathbf{l}_{i}(\mathbf{z}_{t+1}); (c) Second and third elements of the inverse observation log-odds vector 𝐥i(𝐳t+1)\mathbf{l}_{i}(\mathbf{z}_{t+1}) for i=3i=3 as a function of range rr and category yy in observation 𝐳t+1\mathbf{z}_{t+1}. Note that the first element of 𝐥i(𝐳t+1)\mathbf{l}_{i}(\mathbf{z}_{t+1}) is always zero.

This section derives the Bayesian update in (2), using a multinomial logit model to represent the map PMF pt(𝐦)p_{t}(\mathbf{m}) where each cell mim_{i} of the map stores the probability of object classes in 𝒦{\cal K}. To ensure that the number of parameters in the model scales linearly with the map size NN, we maintain a factorized PMF over the cells:

pt(𝐦)=i=1Npt(mi).p_{t}(\mathbf{m})=\prod_{i=1}^{N}p_{t}(m_{i}). (5)

We represent the individual cell PMFs pt(mi)p_{t}(m_{i}) over the semantic categories 𝒦{\cal K} using a vector of log odds:

𝐡t,i:=[logpt(mi=0)pt(mi=0)logpt(mi=K)pt(mi=0)]K+1,\mathbf{h}_{t,i}:=\begin{bmatrix}\log\frac{p_{t}(m_{i}=0)}{p_{t}(m_{i}=0)}&\cdots&\log\frac{p_{t}(m_{i}=K)}{p_{t}(m_{i}=0)}\end{bmatrix}^{\top}\!\!\in\mathbb{R}^{K+1}, (6)

where the free-class likelihood pt(mi=0)p_{t}(m_{i}=0) is used as a pivot. Given the log-odds vector 𝐡t,i\mathbf{h}_{t,i}, the PMF of cell mim_{i} may be recovered using the softmax function σ:K+1K+1\sigma:\mathbb{R}^{K+1}\mapsto\mathbb{R}^{K+1}:

pt(mi=k)=σk+1(𝐡t,i):=𝐞k+1exp(𝐡t,i)𝟏exp(𝐡t,i),p_{t}(m_{i}=k)=\sigma_{k+1}(\mathbf{h}_{t,i}):=\frac{\mathbf{e}_{k+1}^{\top}\exp(\mathbf{h}_{t,i})}{\mathbf{1}^{\top}\exp(\mathbf{h}_{t,i})}, (7)

where 𝐞k\mathbf{e}_{k} is the standard basis vector with kk-th element equal to 11 and 0 elsewhere, 𝟏\mathbf{1} is the vector with all elements equal to 11, and exp()\exp(\cdot) is applied elementwise to the vector 𝐡t,i\mathbf{h}_{t,i}. To derive Bayes rule for the log-odds 𝐡t,i\mathbf{h}_{t,i}, we need to specify an observation model for the range and categry measurements.

Definition 2.

The inverse observation model of a range-category measurement 𝐳\mathbf{z} obtained from robot pose 𝐗\mathbf{X} along sensor ray 𝜼\boldsymbol{\eta} with respect to map cell mim_{i} is a probability mass function p(mi|𝐳;𝐗,𝜼)p(m_{i}|\mathbf{z};\mathbf{X},\boldsymbol{\eta}).

The Bayesian update in (2) for 𝐡t,i\mathbf{h}_{t,i} can be obtained in terms of the range-category inverse observation model, evaluated at a new measurement set 𝒵t+1{\cal Z}_{t+1}.

Proposition 1.

Let 𝐡t,i\mathbf{h}_{t,i} be the log odds of cell mim_{i} at time tt. Given sensor observation 𝒵t+1{\cal Z}_{t+1}, the posterior log-odds are:

𝐡t+1,i=𝐡t,i+𝐳𝒵t+1(𝐥i(𝐳)𝐡0,i)\mathbf{h}_{t+1,i}=\mathbf{h}_{t,i}+\sum_{\mathbf{z}\in{\cal Z}_{t+1}}\left(\mathbf{l}_{i}(\mathbf{z})-\mathbf{h}_{0,i}\right) (8)

where 𝐥i(𝐳)\mathbf{l}_{i}(\mathbf{z}) is the inverse observation model log odds:

𝐥i(𝐳):=[logp(mi=0|𝐳)p(mi=0|𝐳)logp(mi=K|𝐳)p(mi=0|𝐳)].\mathbf{l}_{i}(\mathbf{z}):=\begin{bmatrix}\log\frac{p(m_{i}=0|\mathbf{z})}{p(m_{i}=0|\mathbf{z})}&\cdots&\log\frac{p(m_{i}=K|\mathbf{z})}{p(m_{i}=0|\mathbf{z})}\end{bmatrix}^{\top}. (9)
Proof.

See Appendix A. ∎

To complete the Bayesian multi-class mapping algorithm suggested by (8) we need a particular inverse observation model. When a sensor measurement is generated, the sensor ray continues to travel until it hits an obstacle of category 𝒦{0}{\cal K}\setminus\{0\} or reaches the maximum sensing range rmaxr_{max}. The labeled range measurement 𝐳=(r,y)\mathbf{z}=(r,y) obtained from position 𝐩\mathbf{p} with orientation 𝐑\mathbf{R} indicates that map cell mim_{i} is occupied if the measurement end point 𝐩+rrmax𝐑𝜼\mathbf{p}+\frac{r}{r_{max}}\mathbf{R}\boldsymbol{\eta} lies in the cell. If mim_{i} lies along the sensor ray but does not contain the end point, it is observed as free. Finally, if mim_{i} is not intersected by the sensor ray, no information is provided about its occupancy. The map cells along the sensor ray can be determined by a rasterization algorithm, such as Bresenham’s line algorithm [45]. We parameterize the inverse observation model log-odds vector in (9) as:

𝐥i((r,y)):={ϕ++𝐄y+1𝝍+,r indicates mi is occupied,ϕ,r indicates mi is free,𝐡0,i,otherwise,\displaystyle\mathbf{l}_{i}((r,y)):=\begin{cases}\boldsymbol{\phi}^{+}+\mathbf{E}_{y+1}\boldsymbol{\psi}^{+},&\text{$r$ indicates $m_{i}$ is occupied},\\ \boldsymbol{\phi}^{-},&\text{$r$ indicates $m_{i}$ is free},\\ \mathbf{h}_{0,i},&\text{otherwise},\end{cases} (10)

where 𝐄k:=𝐞k𝐞k\mathbf{E}_{k}:=\mathbf{e}_{k}\mathbf{e}_{k}^{\top} and 𝝍+,ϕ,ϕ+K+1\boldsymbol{\psi}^{+}\!,\boldsymbol{\phi}^{-}\!,\boldsymbol{\phi}^{+}\in\mathbb{R}^{K+1} are parameter vectors, whose first element is 0 to ensure that 𝐥i(𝐳)\mathbf{l}_{i}(\mathbf{z}) is a valid log-odds vector. This parameterization leads to an inverse observation model p(mi=k|𝐳)=σk+1(𝐥i(𝐳))p(m_{i}=k|\mathbf{z})=\sigma_{k+1}(\mathbf{l}_{i}(\mathbf{z})), which is piecewise constant along the sensor ray. Fig. 2 illustrates our Bayesian multi-class mapping method.

To compute the mutual information between an observation sequence 𝒵t+1:t+T{\cal Z}_{t+1:t+T} and the map 𝐦\mathbf{m} in the next section, we will also need the PDF of a range-category measurement 𝐳τ,b𝒵t+1:t+T\mathbf{z}_{\tau,b}\in{\cal Z}_{t+1:t+T} conditioned on 𝒵1:t{\cal Z}_{1:t}. Let τ,b(r){\cal R}_{\tau,b}(r)\subset{\cal I} denote the set of map cell indices along the ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} from robot position 𝐩τ\mathbf{p}_{\tau} with length rr. Let γτ,b(i)\gamma_{\tau,b}(i) denote the distance traveled by the ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} within cell mim_{i} and iτ,bτ,b(r)i_{\tau,b}^{*}\in{\cal R}_{\tau,b}(r) denote the index of the cell hit by 𝐳τ,b\mathbf{z}_{\tau,b}. We define the PDF of 𝐳τ,b=(r,y)\mathbf{z}_{\tau,b}=(r,y) conditioned on 𝒵1:t{\cal Z}_{1:t} as:

p(𝐳τ,b|𝒵1:t)=pt(miτ,b=y)γτ,b(iτ,b)iτ,b(r){iτ,b}pt(mi=0).\displaystyle p(\mathbf{z}_{\tau,b}|{\cal Z}_{1:t})=\frac{p_{t}(m_{i_{\tau,b}^{*}}=y)}{\gamma_{\tau,b}(i_{\tau,b}^{*})}\mkern-18.0mu\prod_{i\in{\cal R}_{\tau,b}(r)\setminus\{i_{\tau,b}^{*}\}}\mkern-18.0mup_{t}(m_{i}=0). (11)

This definition states that the likelihood of 𝐳τ,b=(r,y)\mathbf{z}_{\tau,b}=(r,y) at time tt depends on the likelihood that the cells mim_{i} along the ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} of length rr are empty and the likelihood that the hit cell miτ,bm_{i_{\tau,b}^{*}} has class yy. A similar model for binary observations has been used in [8, 9, 10].

This section described how an observation affects the map PMF pt(𝐦)p_{t}(\mathbf{m}). Now, we switch our focus to computing the mutual information between a sequence of observations 𝒵t+1:t+T{\cal Z}_{t+1:t+T} and the multi-class occupancy map 𝐦\mathbf{m}.

V Informative Planning

Proposition 1 allows a multi-class formulation of occupancy grid mapping, where the uncertainty of a map cell depends on the probability of each class pt(mi=k)p_{t}(m_{i}=k), instead of only the binary occupancy probability 1pt(mi=0)1-p_{t}(m_{i}=0). Moreover, the inverse observation model in (10) may contain different likelihoods for the different classes which can be used to prioritize the information gathering for specific classes. Fig. 3(a) shows an example where the estimated map of an environment with 33 classes, free, class1\textit{class}_{1}, class2\textit{class}_{2}, contains two regions with similar occupancy probability but different semantic uncertainty. In particular, the red and green walls have the same occupancy probability of 0.90.9, as shown in Fig. 3(b), but the red region more certainly belongs to class1\textit{class}_{1} and the green region has high uncertainty between the two classes. As can be seen in Fig. 3(c), the mutual information associated with a binary occupancy map cannot distinguish between the red and green regions since they both have the same occupancy probability. In contrast, the multi-class map takes into account the semantic uncertainty among different categories, as can be seen in Fig. 3(d) where the uncertain green region has larger mutual information than the certain red region.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: Comparison between the information surfaces of binary and multi-class map representations. (a) Environment with three classes free, class1\textit{class}_{1}, and class2\textit{class}_{2} where the white and gray regions represent free and unknown space, respectively, with 𝐩white(mi)=[1,0,0]\mathbf{p}^{white}(m_{i})=\left[1,0,0\right] and 𝐩gray(mi)=[0.3,0.3,0.3]\mathbf{p}^{gray}(m_{i})=\left[0.3,0.3,0.3\right]. The red and green regions have the same occupancy probability of p(mi=occupied)=0.9p(m_{i}=\text{occupied})=0.9 but different class uncertainty, i.e., 𝐩red(mi)=[0.1,0.8,0.1]\mathbf{p}^{red}(m_{i})=\left[0.1,0.8,0.1\right] and 𝐩green(mi)=[0.1,0.45,0.45]\mathbf{p}^{green}(m_{i})=\left[0.1,0.45,0.45\right]. (b) Binary occupancy map, where the intensity of each pixel is proportional to its occupancy probability, regardless of object class. (c) Occupancy mutual information surface. (d) Semantic mutual information surface. Each pixel in the information surfaces shows the value of mutual information between the map and a set of range-category observations, uniformly sampled from a 360360^{\circ} field of view at each pixel location.

These observations suggest that more accurate uncertainty quantification may be achieved using a multi-class instead of a binary perception model, potentially enabling a more efficient exploration strategy. However, computing the mutual information term in (4) is challenging because it involves integration over all possible values of the observation sequence 𝒵t+1:t+T{\cal Z}_{t+1:t+T}. Our main result is an efficiently-computable lower bound on I(𝐦;𝒵t+1:t+T|𝒵1:t)I\left(\mathbf{m};{\cal Z}_{t+1:t+T}|{\cal Z}_{1:t}\right) for range-category observations 𝒵t+1:t+T{\cal Z}_{t+1:t+T} and a multi-class occupancy map 𝐦\mathbf{m}. The result is obtained by selecting a subset 𝒵¯t+1:t+T={𝐳τ,b}τ=t+1,b=1t+T,B\underline{{\cal Z}}_{t+1:t+T}=\left\{\mathbf{z}_{\tau,b}\right\}_{\tau=t+1,b=1}^{t+T,B} of the observations 𝒵t+1:t+T{\cal Z}_{t+1:t+T} in which the sensor rays are non-overlapping. Precisely, any pair of measurements 𝐳τ,b\mathbf{z}_{\tau,b}, 𝐳τ,b𝒵¯t+1:t+T\mathbf{z}_{\tau^{\prime},b^{\prime}}\in\underline{{\cal Z}}_{t+1:t+T} satisfies:

τ,b(rmax)τ,b(rmax)=.{\cal R}_{\tau,b}(r_{max})\cap{\cal R}_{\tau^{\prime},b^{\prime}}(r_{max})=\emptyset. (12)

In practice, constructing 𝒵¯t+1:t+T\underline{{\cal Z}}_{t+1:t+T} requires removing intersecting rays from 𝒵t+1:t+T{\cal Z}_{t+1:t+T} to ensure that the remaining observations are mutually independent. The mutual information between 𝐦\mathbf{m} and 𝒵¯t+1:t+T\underline{{\cal Z}}_{t+1:t+T} can be obtained as a sum of mutual information terms between single rays 𝐳τ,b𝒵¯t+1:t+T\mathbf{z}_{\tau,b}\in\underline{{\cal Z}}_{t+1:t+T} and map cells mim_{i} observed by 𝐳τ,b\mathbf{z}_{\tau,b}. This idea is inspired by CSQMI [9] but we generalize it to multi-class observations and map.

Proposition 2.

Given a sequence of labeled range observations 𝒵t+1:t+T{\cal Z}_{t+1:t+T}, let 𝒵¯t+1:t+T={𝐳τ,b}τ=t+1,b=1t+T,B\underline{{\cal Z}}_{t+1:t+T}=\left\{\mathbf{z}_{\tau,b}\right\}_{\tau=t+1,b=1}^{t+T,B} be a subset of non-overlapping measurements that satisfy (12). Then, the Shannon mutual information between 𝒵t+1:t+T{\cal Z}_{t+1:t+T} and a multi-class occupancy map 𝐦\mathbf{m} can be lower bounded as:

I(𝐦;\displaystyle I(\mathbf{m}; 𝒵t+1:t+T|𝒵1:t)I(𝐦;𝒵¯t+1:t+T|𝒵1:t)\displaystyle{\cal Z}_{t+1:t+T}|{\cal Z}_{1:t})\geq I\left(\mathbf{m};\underline{{\cal Z}}_{t+1:t+T}|{\cal Z}_{1:t}\right) (13)
=τ=t+1t+Tb=1Bk=1Kn=1Nτ,bpτ,b(n,k)Cτ,b(n,k),\displaystyle=\sum_{\tau=t+1}^{t+T}\sum_{b=1}^{B}\sum_{k=1}^{K}\sum_{n=1}^{N_{\tau,b}}p_{\tau,b}(n,k)C_{\tau,b}(n,k),

where Nτ,b:=|τ,b(rmax)|N_{\tau,b}:=|{\cal R}_{\tau,b}(r_{max})|,

pτ,b(n,k)\displaystyle p_{\tau,b}(n,k) :=pt(miτ,b=k)i~τ,b(n){iτ,b}pt(mi=0),\displaystyle:=p_{t}(m_{i_{\tau,b}^{*}}=k)\prod_{i\in\tilde{{\cal R}}_{\tau,b}(n)\setminus\{i_{\tau,b}^{*}\}}p_{t}(m_{i}=0),
Cτ,b(n,k)\displaystyle C_{\tau,b}(n,k) :=f(ϕ++𝐄k+1𝝍+𝐡0,iτ,b,𝐡t,iτ,b)\displaystyle:=f(\boldsymbol{\phi}^{+}+\mathbf{E}_{k+1}\boldsymbol{\psi}^{+}-\mathbf{h}_{0,i_{\tau,b}^{*}},\mathbf{h}_{t,i_{\tau,b}^{*}})
+i~τ,b(n){iτ,b}f(ϕ𝐡0,i,𝐡t,i),\displaystyle+\sum_{i\in\tilde{{\cal R}}_{\tau,b}(n)\setminus\{i_{\tau,b}^{*}\}}f(\boldsymbol{\phi}^{-}-\mathbf{h}_{0,i},\mathbf{h}_{t,i}),
f(ϕ,𝐡)\displaystyle f(\boldsymbol{\phi},\mathbf{h}) :=log(𝟏exp(𝐡)𝟏exp(ϕ+𝐡))+ϕσ(ϕ+𝐡),\displaystyle:=\log\left(\frac{\mathbf{1}^{\top}\exp(\mathbf{h})}{\mathbf{1}^{\top}\exp(\boldsymbol{\phi}+\mathbf{h})}\right)+\boldsymbol{\phi}^{\top}\sigma(\boldsymbol{\phi}+\mathbf{h}),

and ~τ,b(n)τ,b(rmax)\tilde{{\cal R}}_{\tau,b}(n)\subseteq{\cal R}_{\tau,b}(r_{max}) is the set of the first nn map cell indices along the ray 𝐑τ𝛈b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b}, i.e., ~τ,b(n):={iiτ,b(r),|τ,b(r)|=n,rrmax}\tilde{{\cal R}}_{\tau,b}(n):=\{i\mid i\in{\cal R}_{\tau,b}(r),|{\cal R}_{\tau,b}(r)|=n,r\leq r_{max}\}.

Proof.

See Appendix B. ∎

In (13), pτ,b(n,k)p_{\tau,b}(n,k) represents the probability that the nn-th map cell along the ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} belongs to object category kk while all of the previous cells are free. The function f(ϕ,𝐡)f(\boldsymbol{\phi},\mathbf{h}) denotes the log-ratio of the map PMF σ(𝐡)\sigma(\mathbf{h}) and its posterior σ(ϕ+𝐡)\sigma(\boldsymbol{\phi}+\mathbf{h}), averaged over object categories in 𝒦{\cal K} (see (26) in Appendix B for more details). As a result, Cτ,b(n,k)C_{\tau,b}(n,k) is the sum of log-ratios for the first nn cells along the ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} under the same event as the one pτ,b(n,k)p_{\tau,b}(n,k) is associated with. Therefore, the lower bound I(𝐦;𝒵¯t+1:t+T|𝒵1:t)I\left(\mathbf{m};\underline{{\cal Z}}_{t+1:t+T}|{\cal Z}_{1:t}\right) is equivalent to the expectation of summed log-ratios Cτ,b(n,k)C_{\tau,b}(n,k) over all possible instantiations of the observations in 𝒵¯t+1:t+T\underline{{\cal Z}}_{t+1:t+T}.

Proposition 2 allows evaluating the informativeness according to (3) of any potential robot trajectory 𝐗t:t+T\mathbf{X}_{t:t+T}, 𝐮t:t+T1\mathbf{u}_{t:t+T-1}. In order to perform informative planning, first, we identify the boundary between the explored and unexplored regions of the map, similar to [12]. This can be done efficiently using edge detection, for example. Then, we cluster the corresponding map cells by detecting the connected components of the boundary. Each cluster is called a frontier. A motion planning algorithm is used to obtain a set of pose trajectories to the map frontiers, determined from the current map PMF pt(𝐦)p_{t}(\mathbf{m}). Alg. 1 summarizes the procedure for determining a trajectory 𝐗t:t+T\mathbf{X}^{*}_{t:t+T}, 𝐮t:t+T1\mathbf{u}^{*}_{t:t+T-1} that maximizes the objective in (3), where J(𝐗t:t+T1,𝐮t:t+T1)J(\mathbf{X}_{t:t+T-1},\mathbf{u}_{t:t+T-1}) is the length of the corresponding path. This kinematically feasible trajectory can be tracked by a low-level controller that takes the robot dynamics into account.

Evaluation of the mutual information lower bound in Proposition 2 can be accelerated without loss in accuracy for map cells along the observation rays that contain equal PMFs. In the next section, we investigate this property of the proposed lower bound within the context of OcTree-based representations. We begin with proposing a multi-class version of the OctoMap technique, where map cells with equal multi-class probabilities can be compressed into a larger voxel. Next, a fast semantic mutual information formula is presented based on compression of range-category ray-casts over OcTree representations.

Algorithm 1 Information-theoretic Path Planning
1:robot pose 𝐗t\mathbf{X}_{t}, map estimate pt(𝐦)p_{t}(\mathbf{m})
2:=findFrontiers(pt(𝐦)){\cal F}=\textsc{findFrontiers}(p_{t}(\mathbf{m}))
3:for ff\in{\cal F} do
4:     𝐗t+1:t+T,𝐮t:t+T1=planPath(𝐗t,pt(𝐦),f)\mathbf{X}_{t+1:t+T},\mathbf{u}_{t:t+T-1}=\textsc{planPath}(\mathbf{X}_{t},p_{t}(\mathbf{m}),f)
5:     Compute (3) over 𝐗t:t+T,𝐮t:t+T1\mathbf{X}_{t:t+T},\mathbf{u}_{t:t+T-1} via (13)
6:return 𝐗t:t+T\mathbf{X}^{*}_{t:t+T}, 𝐮t:t+T1\mathbf{u}^{*}_{t:t+T-1} with the highest value

VI Information Computation for Semantic OctoMap Representations

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Example of a semantic OctoMap: (a) A white circle represents an inner node such that its children collectively cover the same physical space as the inner node itself. A colored square represents a partition of the 3-D space where all downstream nodes contain identical semantic and occupancy values; therefore, they can be pruned into a leaf node. Lastly, black dots represent unexplored spaces of the environment. (b) geometric representation of the same OcTree with an overview of the SemanticOcTreeNode class.

Utilizing a regular-grid discretization to represent a 3-D environment has prohibitive storage and computation requirements. Large continuous portions of many real environments are unoccupied, suggesting that adaptive discretization is significantly more efficient. OctoMap [3] is a probabilistic 3-D mapping technique that utilizes an OcTree data structure to obtain adaptive resolution, e.g., combining many small cells associated with free space into few large cells. In this section, we develop a multi-class version of OctoMap and propose an efficient multi-class mutual information computation which benefits from the OcTree structure.

VI-A Semantic OctoMap

An OcTree is a hierarchical data structure containing nodes that represent a section of the physical environment. Each node has either 0 or 8 children, where the latter corresponds to the 8 octants of the Euclidean 3-D coordinate system. Thus, the children of a parent node form an eight-way octant partition of the space associated with the parent node. Fig. 4 shows an example of a multi-class OcTree data structure.

We implement a SemanticOcTreeNode class as a building block of the multi-class OcTree structure. A SemanticOcTreeNode instance stores occupancy, color, and semantic information of its corresponding physical space, as shown in Fig. 4(b). The most important data members of the SemanticOcTreeNode class are:

  • children: an array of pointers to SemanticOcTreeNode storing the memory addresses of the 88 child nodes,

  • value: a float variable storing the log-odds occupancy probability of the node,

  • color: a ColorRGB object storing the RGB color of the node,

  • semantics: a SemanticLogOdds object maintaining a categorical probability distribution over the semantic labels in the form of a log-odds ratio.

For performance reasons, the SemanticLogOdds class only stores the multi-class log-odds for the 33 most likely class labels, with each label represented by a unique RGB color. In this case, the log-odds associated with the rest of the labels lump into a single others variable. This relives the multi-class OcTree implementation from dependence on the number of labels that the object classifier can detect. Moreover, it significantly improves the speed of the mapping algorithm in cases with many semantic categories. See Sec. VII-D for an analysis of mapping time versus the number of stored classes.

The implementation of the multi-class OcTree is completed by defining a SemanticOcTree class, which is derived from the OccupancyOcTreeBase class of the OctoMap library [3] and uses a SemanticOcTreeNode as its node type. Fig. 5 illustrates the derivation of the SemanticOcTree and SemanticOcTreeNode classes as a UML diagram.

Refer to caption
Figure 5: UML diagram showing the class inheritance used for the implementation of a multi-class OcTree.

In order to register a new observation to a multi-class OcTree, we follow the standard ray-casting procedure over an OcTree, as in [3], to find the observed leaf nodes. Then, for each observed leaf node, if the observation demands an update, the leaf node is recursively expanded to the smallest resolution and the multi-class log-odds of the downstream nodes are updated using (8). At the ray’s end point, which indicates an occupied cell, we also update the color variable by averaging the observed color with the current stored color of the corresponding node. Alg. 2 details the Bayesian update procedure for the multi-class OcTree.

Algorithm 2 Multi-class OcTree Update of Node 𝐧i\mathbf{n}_{i}
1:OcTree node 𝐧i\mathbf{n}_{i}, observation 𝐳=(r,y)\mathbf{z}=(r,y), mixing coefficient α\alpha
2:s=𝐧i.semanticss=\mathbf{n}_{i}.semantics
3:s.d=𝐧i.semantics.datas.d=\mathbf{n}_{i}.semantics.data
4:s.o=𝐧i.semantics.otherss.o=\mathbf{n}_{i}.semantics.others
5:if 𝐳\mathbf{z} indicates free then
6:     Update ss with ϕ\phi^{-}
7:else if 𝐳\mathbf{z} indicates class yy then
8:     if class yy is among the 3 most likely classes in ss then
9:         Update ss with ϕ++𝐄y+1𝝍+\boldsymbol{\phi}^{+}+\mathbf{E}_{y+1}\boldsymbol{\psi}^{+}
10:     else
11:\triangleright Derive hauxh_{aux} as a portion α\alpha of others class
12:         haux=s.o+logαh_{aux}=s.o+\log{\alpha}
13:         s.o+=ϕothers++log(1α)s.o\mathrel{+}=\phi^{+}_{others}+\log{(1-\alpha)}
14:         sc=concat(s.d,(y,haux))s_{c}=\textsc{concat}(s.d,(y,h_{aux}))
15:         Update scs_{c} with ϕ++𝐄y+1𝝍+\boldsymbol{\phi}^{+}+\mathbf{E}_{y+1}\boldsymbol{\psi}^{+}
16:         Perform descending sort on scs_{c} with respect to log-odds values
17:\triangleright Pick 3 most likely classes
18:         s.d=sc[0:2]s.d=s_{c}[0:2]
19:\triangleright Combine the least likely class with others class
20:         s.o=log(exp(sc[3])+exp(s.o))s.o=\log\Big{(}\exp{(s_{c}[3])}+\exp{(s.o)}\Big{)}      
21:\triangleright Apply thresholds s¯\underline{s} and s¯\overline{s} for log-odds values
22:sfmin{max{sf,s¯},s¯}s_{f}\leftarrow\min\left\{\max\left\{s_{f},\underline{s}\right\},\overline{s}\right\}
23:𝐧i.semantics=s\mathbf{n}_{i}.semantics=s
24:return 𝐧i\mathbf{n}_{i}
Algorithm 3 Semantic Fusion of Two Child Nodes
1:OcTree nodes 𝐧i\mathbf{n}_{i} and 𝐧j\mathbf{n}_{j}
2:si=𝐧i.semanticss_{i}=\mathbf{n}_{i}.semantics
3:sj=𝐧j.semanticss_{j}=\mathbf{n}_{j}.semantics
4:\triangleright Non-repeating list of classes in sis_{i} and sjs_{j}
5:𝒦f=uniqueClass(si,sj)\mathcal{K}_{f}=\textsc{uniqueClass}(s_{i},s_{j})
6:\triangleright Object instantiation for the fused semantics
7:sf=SemanticLogOdds()s_{f}=SemanticLogOdds()
8:\triangleright Slice si.os_{i}.o into smaller probabilities
9:oi=si.olog(1+𝒦f.sizesi.d.size)o_{i}=s_{i}.o-\log(1+\mathcal{K}_{f}.size-s_{i}.d.size)
10:oj=sj.olog(1+𝒦f.sizesj.d.size)o_{j}=s_{j}.o-\log(1+\mathcal{K}_{f}.size-s_{j}.d.size)
11:for y𝒦fy\in\mathcal{K}_{f} do
12:     if ysi.d.labelysj.d.labely\notin s_{i}.d.label\wedge y\in s_{j}.d.label then
13:         sf.d.append(y,oi+sj.d[y].logOdds2)s_{f}.d.\textsc{append}(y,\frac{o_{i}+s_{j}.d[y].logOdds}{2})
14:     else if ysi.d.labelysj.d.labely\in s_{i}.d.label\wedge y\notin s_{j}.d.label then
15:         sf.d.append(y,si.d[y].logOdds+oj2)s_{f}.d.\textsc{append}(y,\frac{s_{i}.d[y].logOdds+o_{j}}{2})
16:     else
17:         sf.d.append(y,si.d[y].logOdds+sj.d[y].logOdds2)s_{f}.d.\textsc{append}(y,\frac{s_{i}.d[y].logOdds+s_{j}.d[y].logOdds}{2})      
18:Perform descending sort on sf.ds_{f}.d with respect to log-odds values
19:expOthers=exp(oi+oj2)expOthers=\exp(\frac{o_{i}+o_{j}}{2})
20:for i>3i>3 do
21:     expOthers+=exp(sf.d[i].logOdds)expOthers\mathrel{+}=\exp(s_{f}.d[i].logOdds)
22:sf.d[3:end].remove()s_{f}.d[3:end].\textsc{remove}()
23:sf.o=log(expOthers)s_{f}.o=\log(expOthers)
24:sfmin{max{sf,s¯},s¯}s_{f}\leftarrow\min\left\{\max\left\{s_{f},\underline{s}\right\},\overline{s}\right\}
25:return sfs_{f}

To obtain a compressed OctoMap, it is necessary to define a rule for information fusion from child nodes towards parent nodes. Depending on the application, different information fusion strategies may be implemented. For example, a conservative strategy would assign the multi-class log-odds of the child node with the highest occupancy probability to the parent node. In this work, we simply assign the average log-odds vector of the child nodes to their parent node as shown in Alg. 3. The benefit of an OctoMap representation is the ability to combine similar cells (leaf nodes) into a large cell (inner node). This is called pruning the OcTree. Every time after an observation is integrated to the map, starting from the deepest inner node, we check for each inner node if 1) the node has 8 children, 2) its children do not have any children of their own, and 3) its children all have equal multi-class log-odds . If an inner node satisfies all of these three criteria, its children are pruned and the inner node is converted into a leaf node with the same multi-class log-odds as its children. This helps to compress the majority of the free cells into a few large cells, while the occupied cells usually do not undergo pruning since only their surfaces are observed by the sensor and their inside remains an unexplored region. Due to sensor noise, it is unlikely that cells belonging to the same class (e.g., free or occupied by the same obstacle) attain identical multi-class log-odds. Maximum and minimum limits for the elements of the multi-class log-odds are used so that each cell arrives at a stable state as its multi-class log-odds entries reach the limits. Stable cells are more likely to share the same multi-class probability distribution, consequently increasing the chance of OcTree pruning. However, thresholding causes loss of information near pt(mi=k)=1p_{t}(m_{i}=k)=1, k𝒦k\in{\cal K} which can be controlled by the maximum and minimum limits.

VI-B Information Computation

A ray cast through an OcTree representation may visit several large cells within which the class probabilities are homogeneous. We exploit this property to obtain the mutual information between a multi-class OctoMap and a single ray as a summation over a subset of OcTree leaf nodes instead of individual map cells. This simplification provides a significant performance gain with no loss of accuracy. The following formulation can be considered a multi-class generalization of the run-length encoding technique introduced by [10], using the mutual information lower bound in (13) and the multi-class OcTree defined earlier in this section.

Suppose that the map cells along a single beam 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} have piecewise-constant multi-class probabilities such that the set {miiτ,b(rmax)}\{m_{i}\mid i\in{\cal R}_{\tau,b}(r_{max})\} can be partitioned into Qτ,bQ_{\tau,b} groups of consecutive cells indexed by τ,bq(rmax){\cal R}_{\tau,b}^{q}(r_{max}), q=1,,Qτ,bq=1,\ldots,Q_{\tau,b}, where:

pt(mi=k)=pt(mj=k),i,jτ,bq(rmax),k𝒦.\begin{split}p_{t}(m_{i}=k)=p_{t}(m_{j}=k),\\ \forall i,j\in{\cal R}_{\tau,b}^{q}(r_{max}),\;\;\;\forall k\in{\cal K}.\end{split} (14)

In this case, the log-odds probabilities encountered by a ray cast can be compressed using semantic run-length encoding, defined as below.

Definition 3.

A semantic run-length encoding (SRLE) of a ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} cast through a multi-class OcTree is an ordered list of tuples of the form [(ωτ,b,q,𝝌t,q)]q=1Qτ,b[(\omega_{\tau,b,q},\boldsymbol{\chi}_{t,q})]^{Q_{\tau,b}}_{q=1}, where ωτ,b,q\omega_{\tau,b,q} and 𝝌t,q\boldsymbol{\chi}_{t,q} respectively represent the width and the log-odds vector of the intersection between the ray and the cells in τ,bq(rmax){\cal R}_{\tau,b}^{q}(r_{max}). The width ωτ,b,q\omega_{\tau,b,q} is the number of OcTree elements along the ray intersection, where an OcTree element is a cell with the smallest physical dimensions.

Refer to caption
Figure 6: Ray cast representation as semantic run-length encoding (SRLE). The multi-class log-odds 𝝌t,q\boldsymbol{\chi}_{t,q} are uniform within each cube. The voxel corresponding to q=2q=2 is unexplored, hence its multi-class log-odds are denoted as 𝝌0,2\boldsymbol{\chi}_{0,2}.

Fig. 6 shows an example of SRLE over a semantic OctoMap. While SRLE can be used in a uniform-resolution grid map, it is particularly effective of a multi-class OcTree, which inherently contains large regions with homogeneous multi-class log-odds. Additionally, the OcTree data structure allows faster ray casting since it can be done over OcTree leaf nodes [46, 47], instead of a uniform-resolution grid as in [45].

SRLE ray casting delivers substantial gains in efficiency for mutual information computation since the contribution of each group {miiτ,bq(rmax)}\{m_{i}\mid i\in{\cal R}_{\tau,b}^{q}(r_{max})\} in the innermost summation of (13) can be obtained in closed form.

Proposition 3.

The Shannon mutual information between a single range-category measurement 𝐳τ,b\mathbf{z}_{\tau,b} and a semantic OctoMap 𝐦\mathbf{m} can be computed as:

I(𝐦;𝐳τ,b|𝒵1:t)=k=1Kq=1Qτ,bρτ,b(q,k)Θτ,b(q,k)I(\mathbf{m};\mathbf{z}_{\tau,b}|{\cal Z}_{1:t})=\sum_{k=1}^{K}\sum_{q=1}^{Q_{\tau,b}}\rho_{\tau,b}(q,k)\Theta_{\tau,b}(q,k) (15)

where Qτ,bQ_{\tau,b} is the number of partitions along the ray 𝐑τ𝛈b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} that have identical multi-class log-odds and the multi-class probabilities for each partition are denoted as:

{πt(q,k)=pt(mi=k)𝝌t,q=𝐡t,iiτ,bq(rmax).\displaystyle\begin{cases}\pi_{t}(q,k)=p_{t}(m_{i}=k)\\ \boldsymbol{\chi}_{t,q}=\mathbf{h}_{t,i}\end{cases}\;\;\forall i\in{\cal R}_{\tau,b}^{q}(r_{max}).

Furthermore, defining ωτ,b,q=|τ,bq(rmax)|\omega_{\tau,b,q}=|{\cal R}_{\tau,b}^{q}(r_{max})| as the width of qq-th partition, we have:

ρτ,b(q,k)\displaystyle\rho_{\tau,b}(q,k) :=πt(q,k)j=1q1πtωτ,b,j(j,0),\displaystyle:=\pi_{t}(q,k)\prod_{j=1}^{q-1}\pi_{t}^{\omega_{\tau,b,j}}(j,0),
Θτ,b(q,k)\displaystyle\Theta_{\tau,b}(q,k) :=βτ,b(q,k)1πtωτ,b,q(q,0)1πt(q,0)+\displaystyle:=\beta_{\tau,b}(q,k)\frac{1-\pi_{t}^{\omega_{\tau,b,q}}(q,0)}{1-\pi_{t}(q,0)}+
f(ϕ𝝌0,q,𝝌t,q)(1πt(q,0))2[(ωτ,b,q1)πtωτ,b,q+1(q,0)\displaystyle\frac{f(\boldsymbol{\phi}^{-}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q})}{(1-\pi_{t}(q,0))^{2}}\Big{[}(\omega_{\tau,b,q}-1)\pi_{t}^{\omega_{\tau,b,q}+1}(q,0)
ωτ,b,qπtωτ,b,q(q,0)+πt(q,0)],\displaystyle\qquad\qquad\qquad-\omega_{\tau,b,q}\pi_{t}^{\omega_{\tau,b,q}}(q,0)+\pi_{t}(q,0)\Big{]},
βτ,b(q,k)\displaystyle\beta_{\tau,b}(q,k) :=f(ϕ++𝐄k+1𝝍+𝝌0,q,𝝌t,q)\displaystyle:=f(\boldsymbol{\phi}^{+}+\mathbf{E}_{k+1}\boldsymbol{\psi}^{+}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q})
+j=1q1ωτ,b,jf(ϕ𝝌0,j,𝝌t,j).\displaystyle+\sum_{j=1}^{q-1}\omega_{\tau,b,j}f(\boldsymbol{\phi}^{-}-\boldsymbol{\chi}_{0,j},\boldsymbol{\chi}_{t,j}).
Proof.

See Appendix C. ∎

In (15), ρτ,b(q,k)\rho_{\tau,b}(q,k) relates to the event that the partition τ,bq(rmax){\cal R}_{\tau,b}^{q}(r_{max}) belongs to category kk while all of the previous partitions along the ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} are free. Analogous to the definition of Cτ,b(n,k)C_{\tau,b}(n,k) in Proposition 2, βτ,b(q,k)\beta_{\tau,b}(q,k) is the weighted sum of log-ratios f(ϕ,𝝌)f(\boldsymbol{\phi},\boldsymbol{\chi}) for the first qq partitions along the ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} under the same event as the one ρτ,b(q,k)\rho_{\tau,b}(q,k) is associated with. Accumulating the multi-class probabilities within the partition τ,bq(rmax){\cal R}_{\tau,b}^{q}(r_{max}) yields Θτ,b(q,k)\Theta_{\tau,b}(q,k), see (C) for more details. Therefore, the mutual information in (15) is equivalent to the expectation of accumulated log-ratios Θτ,b(q,k)\Theta_{\tau,b}(q,k) over all possible instantiations of 𝐳τ,b\mathbf{z}_{\tau,b}.

Proposition 3 allows an extension of the mutual-information lower bound in Proposition 2 to semantic OctoMap representations, summarized in the corollary below. The proof follows directly from the additive property of mutual information between a semantic OctoMap and a sequence of independent observations.

Corollary 1.

Given a sequence of range-category observations 𝒵t+1:t+T{\cal Z}_{t+1:t+T}, the Shannon mutual information between 𝒵t+1:t+T{\cal Z}_{t+1:t+T} and a semantic OctoMap 𝐦\mathbf{m} can be lower bounded as:

I(𝐦;\displaystyle I(\mathbf{m}; 𝒵t+1:t+T|𝒵1:t)I(𝐦;𝒵¯t+1:t+T|𝒵1:t)\displaystyle{\cal Z}_{t+1:t+T}|{\cal Z}_{1:t})\geq I\left(\mathbf{m};\underline{{\cal Z}}_{t+1:t+T}|{\cal Z}_{1:t}\right) (16)
=τ=t+1t+Tb=1Bk=1Kq=1Qτ,bρτ,b(q,k)Θτ,b(q,k),\displaystyle=\sum_{\tau=t+1}^{t+T}\sum_{b=1}^{B}\sum_{k=1}^{K}\sum_{q=1}^{Q_{\tau,b}}\rho_{\tau,b}(q,k)\Theta_{\tau,b}(q,k),

where 𝒵¯t+1:t+T\underline{{\cal Z}}_{t+1:t+T} is a subset of non-overlapping measurements that satisfy (12), and ρτ,b(q,k)\rho_{\tau,b}(q,k) and Θτ,b(q,k)\Theta_{\tau,b}(q,k) are defined in Proposition 3.

The same approach as in Alg. 1 is used for autonomous exploration over a semantic OctoMap. However, we employ the information computation formula of (16) to quantify the informativeness of candidate robot trajectories. The active mapping method in Alg. 1 provides a greedy exploration strategy, which does not change subsequent control inputs based on the updated map distribution. Greedy exploration may be sub-optimal and manifests itself as back and forth travel between map frontiers. We alleviate this behavior by (a) computing the information along the whole trajectory as opposed only at the frontiers or next best view, and (b) re-plan frequently to account for the updated map distribution. Discounted by distance traveled as the cost of a trajectory, this leads to a more accurate calculation of information gain along a candidate path which rules out most of the back and forth visiting behaviour. It is also important to mention that the main scope of this work is introduction of a novel multi-class semantic OcTree representation and the mutual information between such model and range-category observations. Our method enables fast and accurate evaluation of information for any set of candidate trajectories, likes of which can be generated by random tree methods [10, 48] or hierarchical planning strategies [21] or, in the simplest form, a greedy approach that computes paths to each frontier. We believe utilizing our proposed information measure to score candidate viewpoints would be complementary, rather than an alternative, to the state-of-the-art exploration methods that use sophisticated optimization strategies [19, 21, 49].

Refer to caption
Refer to caption
Figure 7: Synthetic environments used for comparisons among frontier-based exploration [12], FSMI [10], and SSMI. Different semantic categories are represented by distinct colors. Left: An instance of procedurally generated random environment with 1010 object classes. Right: Hand-designed environment with corridor and block structures with 1212 object classes.
Refer to caption
(a) Random map / Binary exploration
Refer to caption
(b) Random map / Semantic exploration
Refer to caption
(c) Structured map / Binary exploration
Refer to caption
(d) Structured map / Semantic exploration
Figure 8: Simulation results for active mapping on the environments in Fig. 7, for 2020 exploration iterations. Solid and dotted lines represent mean and 11 standard deviation from the mean, respectively. (a), (b): Exploration performance averaged over 1010 random environments with 33 random starting positions for each instance. Exploration in the random environments sometimes did not terminate before the maximum number of iterations, and therefore the corresponding curves do not flatten. This can be attributed to the fact that random maps, with the same size as the structured map, contain more frontiers that need to be explored in each iteration since each scan has a higher probability of being occluded in multiple angles due to the lack of certain patterns such as corridors. (c), (d): Exploration performance on the structured environment averaged over 33 random starting positions. For the structured map, the exploration terminates before reaching the maximum number of iterations, which explains the flat curves at the end of the corresponding plots.

VI-C Computational Complexity

Note that the mutual information computations in both (13) and (16) can be performed recursively. For (13), we have:

pτ,b(n+1,k)\displaystyle p_{\tau,b}(n+1,k) =pτ,b(n,k)pt(mjτ,b=k)pt(miτ,b=0)pt(miτ,b=k),\displaystyle=p_{\tau,b}(n,k)\frac{p_{t}(m_{j^{*}_{\tau,b}}=k)p_{t}(m_{i^{*}_{\tau,b}}=0)}{p_{t}(m_{i^{*}_{\tau,b}}=k)},
Cτ,b(n+1,k)\displaystyle C_{\tau,b}(n+1,k) =Cτ,b(n,k)\displaystyle=C_{\tau,b}(n,k)
\displaystyle- f(ϕ++𝐄k+1𝝍+𝐡0,iτ,b,𝐡t,iτ,b)\displaystyle f(\boldsymbol{\phi}^{+}+\mathbf{E}_{k+1}\boldsymbol{\psi}^{+}-\mathbf{h}_{0,i_{\tau,b}^{*}},\mathbf{h}_{t,i_{\tau,b}^{*}}) (17)
+\displaystyle+ f(ϕ++𝐄k+1𝝍+𝐡0,jτ,b,𝐡t,jτ,b)\displaystyle f(\boldsymbol{\phi}^{+}+\mathbf{E}_{k+1}\boldsymbol{\psi}^{+}-\mathbf{h}_{0,j_{\tau,b}^{*}},\mathbf{h}_{t,j_{\tau,b}^{*}})
+\displaystyle+ f(ϕ𝐡0,iτ,b,𝐡t,iτ,b),\displaystyle f(\boldsymbol{\phi}^{-}-\mathbf{h}_{0,i_{\tau,b}^{*}},\mathbf{h}_{t,i_{\tau,b}^{*}}),

where jτ,bj_{\tau,b}^{*} and iτ,bi^{*}_{\tau,b} correspond to the index of farthest map cell in ~τ,b(n+1)\tilde{{\cal R}}_{\tau,b}(n+1) and ~τ,b(n)\tilde{{\cal R}}_{\tau,b}(n), respectively. A similar recursive pattern can be found in (16):

ρτ,b(q+1,k)\displaystyle\rho_{\tau,b}(q+1,k) =ρτ,b(q,k)πt(q+1,k)πtωτ,b,q(q,0)πt(q,k),\displaystyle=\rho_{\tau,b}(q,k)\frac{\pi_{t}(q+1,k)\pi^{\omega_{\tau,b,q}}_{t}(q,0)}{\pi_{t}(q,k)}, (18)
βτ,b(q+1,k)\displaystyle\beta_{\tau,b}(q+1,k) =βτ,b(q,k)\displaystyle=\beta_{\tau,b}(q,k)
f(ϕ++𝐄k+1𝝍+𝝌0,q,𝝌t,q)\displaystyle-f(\boldsymbol{\phi}^{+}+\mathbf{E}_{k+1}\boldsymbol{\psi}^{+}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q})
+f(ϕ++𝐄k+1𝝍+𝝌0,q+1,𝝌t,q+1)\displaystyle+f(\boldsymbol{\phi}^{+}+\mathbf{E}_{k+1}\boldsymbol{\psi}^{+}-\boldsymbol{\chi}_{0,q+1},\boldsymbol{\chi}_{t,q+1})
+ωτ,b,qf(ϕ𝝌0,q,𝝌t,q).\displaystyle+\omega_{\tau,b,q}f(\boldsymbol{\phi}^{-}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q}).

This implies that the innermost summations of (13) and (16) can be obtained in O(Nτ,b)O(N_{\tau,b}) and O(Qτ,b)O(Q_{\tau,b}), respectively, where Nτ,bN_{\tau,b} is the number of map cells along a single ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} up to its maximum range, and Qτ,bQ_{\tau,b} is the number of groups of consecutive cells that possess the same multi-class probabilities. In an environment containing KK object classes, evaluating the informativeness of a trajectory composed of TT observations, where each observation contains BB beams, has a complexity of O(TBKNτ,b)O(TBKN_{\tau,b}) for a regular-grid multi-class representation and a complexity of O(TBKQτ,b)O(TBKQ_{\tau,b}) for a multi-class OcTree representation.

As we demonstrate in Sec. VII-C, for a ray 𝐑τ𝜼b\mathbf{R}_{\tau}\boldsymbol{\eta}_{b} we often observe that Qτ,bQ_{\tau,b} is significantly smaller than Nτ,bN_{\tau,b} thanks to the OcTree pruning mechanism. Since Nτ,bN_{\tau,b} scales linearly with the map resolution, the complexity of information computation over a semantic OctoMap grows sub-linearly with respect to the inverse of the OcTree element dimensions, which is a parameter analogous to the map resolution.

VII Experiments

Refer to caption
(a) Partially explored semantic map

Refer to caption

(b) Entropy surface
Refer to caption
(c) Occupancy mutual information surface
Refer to caption
(d) Multi-class mutual information surface
Figure 9: Comparison between different mutual information formulations used for exploration. (a), (b): A snapshot of 2-D exploration showing the map estimate and the corresponding uncertainty, where the entropy for each pixel ii is computed as  (19). (c), (d): The mutual information used to find the informative trajectory by FSMI and SSMI, respectively. Brighter pixels indicate larger values.

In this section, we evaluate the performance of SSMI in simulated and real-world experiments. We compare SSMI with two baseline exploration strategies, i.e., frontier-based exploration [12] and FSMI [10], in a 2-D active binary mapping scenario in Sec. VII-A and a 2-D active multi-class mapping scenario in Sec. VII-B. All three methods use our range-category sensor model in (10) and our Bayesian multi-class mapping in (8) but select informative robot trajectories 𝐗t+1:t+T(𝐮t:t+T1)\mathbf{X}_{t+1:t+T}(\mathbf{u}_{t:t+T-1}) based on their own criteria. In Sec. VII-C, we evaluate the improvement in ray tracing resulting from SRLE through an experiment in a 3-D simulated Unity environment. In Sec. VII-D, we investigate the influence of the number of stored semantic classes on mapping performance. Additionally, in Sec. VII-E, we use a similar 3-D simulation environment to apply SSMI alongside Frontier, FSMI, and hierarchical coverage maximization method TARE [21]. In this section we use our OcTree-based multi-class information computation introduced in Sec. VI in order to demonstrate large-scale realistic active multi-class mapping. Finally, in Sec. VII-F and Sec. VII-G, we test SSMI mapping and exploration in real environments using ground wheeled robots. An open-source implementation of SSMI is available on GitHub222https://github.com/ExistentialRobotics/SSMI..

In each planning step of 2-D exploration, we identify frontiers by applying edge detection on the most likely map at time tt (the mode of pt(𝐦)p_{t}(\mathbf{m})). Then, we cluster the edge cells by detecting the connected components of the boundaries between explored and unexplored space. We plan a path from the robot pose 𝐗t\mathbf{X}_{t} to the center of each frontier using AA^{*} graph search and provide the path to a low-level controller to generate 𝐮t:t+T1\mathbf{u}_{t:t+T-1}. For 3-D exploration, we first derive a 2-D occupancy map by projecting the most likely semantic OctoMap at time tt onto the z=0z=0 surface and proceed with similar steps as in 2-D path planning.

VII-A 2-D Binary Exploration

We consider active binary occupancy mapping first. We compare SSMI against Frontier and FSMI in 11 structured and 1010 procedurally generated 2-D environments, shown in Fig. 7. A 2-D LiDAR sensor is simulated with additive Gaussian noise 𝒩(0,0.1){\cal N}(0,0.1). Fig. 8(a) and Fig. 8(c) compare the exploration performance in terms of map entropy reduction and percentage of the map explored per distance traveled among the three methods. SSMI performs similarly to FSMI in that both achieve low map entropy by traversing significantly less distance compared to Frontier.

VII-B 2-D Multi-class Exploration

Next, we use the same 2-D environments in Fig. 7 but introduce range-category measurements. Range measurements are subject to additive Gaussian noise 𝒩(0,0.1){\cal N}(0,0.1), while category measurements have a uniform misclassification probability of 0.350.35. Fig. 8(b) and Fig. 8(d) compare the semantic exploration performance for all three strategies. SSMI reaches the same level of map entropy as FSMI and Frontier but traverses a noticeably shorter distance. This can be attributed to the fact that only SSMI distinguishes map cells whose occupancy probabilities are the same but their per-class probabilities differ from each other. To further illustrate this, we visualize the entropy and information surfaces used by FSMI and SSMI. Fig. 9(a) shows a snapshot of semantic exploration while Fig. 9(b) visualizes the entropy of each pixel ii computed as:

H(mi|𝒵1:t)=k=0Kpt(mi=k)logpt(mi=k),H(m_{i}|{\cal Z}_{1:t})=-\sum_{k=0}^{K}p_{t}(m_{i}=k)\log{p_{t}(m_{i}=k)}, (19)

where 𝒵1:t{\cal Z}_{1:t} denote realized observations until time tt. The task of exploration can be regarded as minimizing the conditional entropy summed over all pixels, i.e., map entropy. However, since the observations are not known in advance, we resort to estimate the reduction in uncertainty by computing the expectation over the observations. Accounting for the prior uncertainty in map, we arrive at maximizing mutual information as our objective, which is related to entropy as follows:

H(mi)𝔼𝒵1:t{H(mi|𝒵1:t)}=I(mi;𝒵1:t).H(m_{i})-\mathbb{E}_{{\cal Z}_{1:t}}\{H(m_{i}|{\cal Z}_{1:t})\}=I(m_{i};{\cal Z}_{1:t}). (20)

Therefore, the exploration performance is highly dependent upon the mutual information formulation, since it directly dictates how the uncertainty is quantified. As shown in Fig. 9(d) and resulted from capturing per-class uncertainties, semantic mutual information of SSMI, computed in (13) provides a smoother and more accurate estimation of information-rich regions compared to the binary mutual information formula used by FSMI (equation (18) in [10]) shown in Fig. 9(c).

Refer to caption
Figure 10: Variation of the visited OctoMap cells and OcTree elements denoted as QQ and NN, respectively, with respect to the map resolution. Solid blue and red lines represent the average values for QQ and NN over all ray castings, while the dashed lines show one standard deviation from the average. The green curve shows the total exploration time for each map resolution. All measurements are accumulated in the course of 55 exploration iterations.
Refer to caption
(a) The robot begins exploration.
Refer to caption
(b) After 1616 iterations, the robot starts to refine previously explored areas to fill partially observed objects.
Refer to caption
(c) The robot explores unknown regions located on the boundaries of the explored area at iteration 4040.
Refer to caption
(d) Multi-class occupancy map after 6060 exploration iterations
Refer to caption
(e) Photo-realistic Unity simulation environment
Figure 11: Time lapse of autonomous exploration and multi-class mapping in a simulated Unity environment. The robot is equipped with an RGBD sensor and runs semantic segmentation. Different colors represent different semantic categories (grass, dirt road, building, etc.).

VII-C SRLE Compression for 3-D Ray Tracing

In this subsection, we evaluate the ray-tracing compression resulting from SRLE through an experiment in a photo-realistic 3-D Unity simulation, shown in Fig. 11(e). We use a Husky robot equipped with an RGBD camera and run a semantic segmentation algorithm over the RGB images. In order to remove irrelevant randomness, the sensors and the semantic segmentation are defined as error-free. We define map resolution as the inverse of the dimensions of an OcTree element. For resolutions ranging from 1.3m11.3m^{-1} to 6.6m16.6m^{-1}, we run 55 exploration iterations using the semantic OctoMap and information computation of Sec. VI and store all ray traces in SRLE format. Fig. 10 shows the change in distribution for the number of OctoMap cells QQ and OcTree elements NN visited during each ray trace, as well as the time required to execute each exploration episode as a function of map resolution. In other words, NN represents the number of cells to be processed during mapping and information computation as if the environment was represented as a regular 3-D grid, while QQ represents the actual number of processed semantic OctoMap cells. The pruning mechanism of the OcTree representation results in a substantial gain in terms of the number of cells visited for each ray tracing. As opposed to the almost linear growth of NN, the distribution for QQ is effectively independent of the map resolution, except for very fine resolutions where void areas between observations rays prevent efficient pruning. However, for map resolutions larger than 2m12m^{-1}, the exploration time tends to grow larger with the increase of map resolution. This is attributed to the recursive ray insertion method of OctoMap in which it is required to re-compute log odds for each OcTree element along an observation ray whenever an observation ray does not carry the same (free or object class) state as the visited cell. In the subsequent 3-D experiments, we choose map resolution of 2m12m^{-1} in order to balance between performance and map accuracy.

Refer to caption
Refer to caption
Figure 12: Mapping time vs. number of stored classes. Top: Robot trajectory (in green) used for all mapping frequency evaluations. Bottom: Average mapping frequency as a function of the number of stored semantic classes KsK_{s}.

VII-D Mapping Time vs. Number of Stored Classes

We analyse the influence of the number of stored classes in the semantic OctoMap on the mapping time. Let KsK_{s} denote the number of stored semantic classes. Alg. 2 has O(Ks)O(K_{s}) memory and O(KslogKs)O(K_{s}\log{K_{s}}) computational complexity (due to sorting in line 16). Furthermore, let pmissp_{\text{miss}} be the misclassification probability assumed to be uniformly distributed among all incorrect classes. Regarding accuracy, for a classifier with pmiss<K1Kp_{\text{miss}}<\frac{K-1}{K}, where KK is the number of all object classes, the true class will be always asymptotically recoverable as long as Ks2K_{s}\geq 2, thanks to the auxiliary others class that stores the accumulated probability of the KKsK-K_{s} least likely classes (see line 20 of Alg. 2). In general, KsK_{s} controls how fast the true class will be detected with the cost of additional memory use and computation. In order to quantitatively evaluate the effect of KsK_{s} on mapping time, we consider the same Husky robot as the previous subsection with a fixed trajectory, shown in Fig. 12 (top), and measure the mapping frequency as a function of KsK_{s}. Fig. 12 (bottom) shows the decrease in average mapping frequency as KsK_{s} increases. It is important to mention that the trajectory along which the data is collected only visits 66 object classes, which explains the change in slope for Ks>6K_{s}>6.

VII-E 3-D Exploration in a Unity Simulation

We evaluate SSMI in the same 3-D simulation environment as two previous subsections, however, this time the range measurements have an additive Gaussian noise of 𝒩(0,0.1){\cal N}(0,0.1) and the semantic segmentation algorithm detects the true class with a probability of 0.950.95 while the misclassification happens uniformly in the pixel space. Fig. 11 shows several iterations of the exploration process. For comparison, we implemented a 3-D version of FSMI [10] that utilizes run-length encoding to accelerate the information computation for a binary OctoMap. Moreover, we deploy the state-of-the-art hierarchical exploration method of TARE [21] in our 3-D Unity simulation environment. Fig. 13 shows the change in map entropy versus distance traveled and total elapsed time for all exploration strategies. We observe that SSMI is the most efficient in terms of solving the trade-off between path length and information gathered along the path. SSMI achieves the lowest entropy in the multi-class OctoMap. Similar to the discussion in Sec. VII-B, this observation can be ascribed to the fact that, among the compared methods, the only objective function which captures the uncertainty in both semantic classes and occupancy of the environment is the one used by SSMI. On the other hand, SSMI and FSMI require evaluation of mutual information along each candidate trajectory, which has the same cardinality as the number of all frontiers in the current map estimate pt(𝐦)p_{t}(\mathbf{m}), whereas the hierarchical planning method employed by TARE only requires local trajectory computation with a global coverage path obtained at a coarse level. As a result, TARE exploration can be performed over a relatively shorter time period compared to SSMI and FSMI in scenarios where the number of frontiers is large, e.g. outdoor areas. Parallel computation of mutual information for each candidate trajectory or using heuristics such as frontier size in order to sort candidate solutions would improve the computation time of SSMI; however we believe these are outside of the scope of this paper. Fig. 14 compares the mapping precision of various object classes for the tested methods. SSMI exhibits higher precision for object categories that appear rarely, such as the Animal or Tree classes while Frontier slightly outperforms SSMI when it comes to mapping the Grass and Dirt Road categories. This can be explained by the tendency of SSMI towards achieving high overall classification precision even if it requires slight reduction of precision for certain object categories. Furthermore, TARE achieves the highest precision for the Building class, which can be justified by the observation that the computed global coverage path tends to traverse near building walls.

Refer to caption
Figure 13: Simulation results for exploration in Unity 3-D environment.
Refer to caption
Figure 14: Mapping precision for observed semantic classes.

VII-F 3-D Mapping in a Real-world Outdoor Environment

Refer to caption
Figure 15: Environment for the outdoor mapping experiment. Left: Satellite image of the experiment locale with robot trajectory shown in yellow. Right: Corresponding locations from the ground level point of view.

We deployed our semantic OcTree mapping approach on a Husky robot equipped with an Ouster OS1-32 LiDAR and an Intel RealSense D455 RGBD camera. Our software stack is implemented using the Robot Operating System (ROS) [50]. The LiDAR is used for localization via iterative closest point (ICP) scan matching [51]. A neural network based on a FCHarDNet architecture [52] and trained on the RUGD dataset [53] was used for semantic segmentation. The RGBD camera produces color and depth images with size 640×480640\times 480 at 3030 frames per second. The semantic segmentation algorithm takes a 2-D color image and outputs a semantic label for each pixel in the image, at an average frame rate of 28.728.7 frames per second. By aligning the semantic image and the depth map, we derive a semantic 3-D point cloud which is utilized for Bayesian multi-class mapping. Our implementation was able to update the semantic OctoMap every 0.120.12 s, on average, while all of the computations where performed on the mobile robot. The experiment was carried out in an approximately 6 acre forested area shown in Fig. 15. The environment contained various terrain features, including asphalt road, gravel, grass, densely forested areas, and hills. Additionally, a number of buildings and other structures such as bleachers, tents, and cars add to the diversity of the type of object categories within the locale. The robot was manually controlled via joystick, and traveled the path shown in Fig. 15 (left) while incrementally building the semantic OctoMap. Fig. 16 shows the semantic mapping result overlaying the satellite image obtained via 2-D projection of the semantic OctoMap. We computed the memory size of the semantic OctoMap, and compared it with the corresponding regular voxel grid representation, where each voxel contains the same amount of data as an OcTree leaf node at the lowest depth. Fig. 17 shows an almost five-fold saving in memory when using OcTree data structure. The importance of the memory savings of the OctoMap representation becomes more apparent when communication is considered. Our semantic OctoMap implementation resulted in a network bandwidth requirement of 238238 KB/s for OctoMap, whereas a regular grid required 11731173 KB/s for map communication.

Refer to caption
Figure 16: Semantic mapping output overlaying the satellite image. The map is obtained via 2-D projection of the 3-D semantic OctoMap.
Refer to caption
Figure 17: Memory use of regular grid vs. semantic OctoMap

VII-G 3-D Exploration in a Real-world Office Environment

Refer to caption
Figure 18: Robot car used in indoor real-world experiments.
Refer to caption
Figure 19: Real-world experiment results for active mapping for 2020 exploration iterations.

We implemented SSMI on a ground wheeled robot to autonomously map an indoor office environment. Fig. 18 shows the robot equipped with a NVIDIA Xavier NX computer, a Hokuyo UST-10LX LiDAR, and an Intel RealSense D435i RGBD camera. Similar to the outdoor experiments, ROS was used for software deployment on the robot, and ICP laser scan matching provided localization. This time, we utilized a ResNet18 [54] neural network architecture pre-trained on the SUN RGB-D dataset [55] for semantic segmentation. In particular, we employed the deep learning inference ROS nodes provided by NVIDIA [56], which are optimized for Xavier NX computers via TensorRT acceleration. Due to limited computational power available on the mobile platform, we operated the RGBD camera at a lower frame rate of 1515 Hz with color and depth image size set to 640×480640\times 480. The semantic segmentation algorithm was able to produce pixel classification images (resized to 512×400512\times 400) at an average rate of 9.89.8 frames per second. Our implementation was able to publish semantic OctoMap ROS topics every 0.34s0.34s, on average, with all of the processing occurred on the mobile platform. Fig. 20 depicts the exploration process, while Fig. 19 shows the performance of SSMI compared to frontier-based over 2020 exploration iterations. We observe that, similar to the simulations, SSMI outperforms frontier-based exploration in terms of distance traveled. Also, SSMI shows on par performance compared to Frontier in terms of entropy reduction per time. This can be explained by the fact that large depth measurement noise and classification error in the real-world experiments result in (a) the need for re-visiting explored areas in order to estimate an accurate map, leading to poor entropy reduction for the frontier-based method and (b) a small number of safe candidate trajectories, leading to fewer computations to be performed by SSMI. Overall, our experiments show that SSMI outperforms Frontier in indoor exploration scenarios where the number, and length, of candidate trajectories is constrained by the size of the environment.

Refer to caption
(a) The robot begins exploration.
Refer to caption
(b) The robot visits neighboring unexplored regions while trying to refine the map of visited areas.

Refer to caption

(c) Semantic OctoMap after 2020 exploration iterations.
Refer to caption
(d) Office environment featuring corridors, furniture, signs, and doors.
Figure 20: Time lapse of autonomous exploration and multi-class mapping in the environment shown in (d). The exploration is run for 2020 iterations. Different colors represent different semantic categories (floor, wall, furniture, etc.).

VIII Conclusion

This paper developed techniques for active multi-class mapping of large 3-D environments using range and semantic segmentation observations. Our results enable efficient mutual information computation over multi-class maps and make it possible to optimize for per-class uncertainty. Our experiments show that SSMI performs on par with the state of the art FSMI method in binary active mapping scenarios. However, when semantic information is considered SSMI outperforms existing algorithms and leads to efficient exploration and accurate multi-class mapping even in the presence of domain shift due to the difference between the classification training data and the testing environment. Experiments in both simulated and real-world environments showed the scalability of SSMI for large-scale 3-D exploration scenarios.

Appendix A Proof of Proposition 1

Applying Bayes rule in (2) and the factorization in (5) to pt(𝐦)p_{t}(\mathbf{m}) for some 𝐳𝒵t+1\mathbf{z}\in{\cal Z}_{t+1} leads to:

i=1Np(mi|𝒵1:t,𝐳)=p(𝐳)p(𝐳|𝒵1:t)i=1Np(mi|𝐳)p(mi)p(mi|𝒵1:t).\displaystyle{\prod_{i=1}^{N}p(m_{i}|{\cal Z}_{1:t},\mathbf{z})=\frac{p(\mathbf{z})}{p(\mathbf{z}|{\cal Z}_{1:t})}\prod_{i=1}^{N}\frac{p(m_{i}|\mathbf{z})}{p(m_{i})}p(m_{i}|{\cal Z}_{1:t}).}

(21)

The term p(𝐳)p(𝐳|𝒵1:t)\frac{p(\mathbf{z})}{p(\mathbf{z}|{\cal Z}_{1:t})} may be eliminated by considering the odds ratio of an arbitrary category mi=ki𝒦m_{i}=k_{i}\in{\cal K} versus the free category mi=0m_{i}=0 for each cell ii:

i=1N\displaystyle\prod_{i=1}^{N} p(mi=ki|𝒵1:t,𝐳)p(mi=0|𝒵1:t,𝐳)\displaystyle\frac{p(m_{i}=k_{i}|{\cal Z}_{1:t},\mathbf{z})}{p(m_{i}=0|{\cal Z}_{1:t},\mathbf{z})} (22)
=i=1Np(mi=ki|𝐳)p(mi=0|𝐳)p(mi=0)p(mi=ki)p(mi=ki|𝒵1:t)p(mi=0|𝒵1:t).\displaystyle=\prod_{i=1}^{N}\frac{p(m_{i}=k_{i}|\mathbf{z})}{p(m_{i}=0|\mathbf{z})}\frac{p(m_{i}=0)}{p(m_{i}=k_{i})}\frac{p(m_{i}=k_{i}|{\cal Z}_{1:t})}{p(m_{i}=0|{\cal Z}_{1:t})}.

Since each term in both the left- and right-hand side products only depends on one map cell mim_{i}, the expression holds for each individual cell. Re-writing the expression for cell mim_{i} in vector form, with elements corresponding to each possible value of ki𝒦k_{i}\in{\cal K}, and taking an element-wise log leads to:

[logp(mi=0|𝒵1:t,𝐳)p(mi=0|𝒵1:t,𝐳)logp(mi=K|𝒵1:t,𝐳)p(mi=0|𝒵1:t,𝐳)]\displaystyle\begin{bmatrix}\log\frac{p(m_{i}=0|{\cal Z}_{1:t},\mathbf{z})}{p(m_{i}=0|{\cal Z}_{1:t},\mathbf{z})}&\cdots&\log\frac{p(m_{i}=K|{\cal Z}_{1:t},\mathbf{z})}{p(m_{i}=0|{\cal Z}_{1:t},\mathbf{z})}\end{bmatrix}^{\top} (23)
=(𝐥i(𝐳)𝐡0,i)+𝐡t,i.\displaystyle\qquad=(\mathbf{l}_{i}(\mathbf{z})-\mathbf{h}_{0,i})+\mathbf{h}_{t,i}.

Applying (23) recursively for each element 𝐳𝒵t+1\mathbf{z}\in{\cal Z}_{t+1} leads to the desired result in (8).∎

Appendix B Proof of Proposition 2

Let t+1:t+T(rmax):=τ,bτ,b(rmax){\cal R}_{t+1:t+T}(r_{max}):=\cup_{\tau,b}{\cal R}_{\tau,b}(r_{max}) be the set of map indices which can potentially be observed by 𝒵¯t+1:t+T\underline{{\cal Z}}_{t+1:t+T}. Using the factorization in (5) and the fact that Shannon entropy is additive for mutually independent random variables, the mutual information only depends on the cells whose index belongs to t+1:t+T(rmax){\cal R}_{t+1:t+T}(r_{max}), i.e.:

I(𝐦\displaystyle I(\mathbf{m} ;𝒵¯t+1:t+T𝒵1:t)\displaystyle;\underline{{\cal Z}}_{t+1:t+T}\mid{\cal Z}_{1:t})
=τ=t+1t+Tb=1Biτ,b(rmax)I(mi;𝐳τ,b𝒵1:t).\displaystyle=\sum_{\tau=t+1}^{t+T}\sum_{b=1}^{B}\sum_{i\in{\cal R}_{\tau,b}(r_{max})}I(m_{i};\mathbf{z}_{\tau,b}\mid{\cal Z}_{1:t}). (24)

This is true because the measurements 𝐳τ,b𝒵¯t+1:t+T\mathbf{z}_{\tau,b}\in\underline{{\cal Z}}_{t+1:t+T} are independent by construction and the terms I(mi;𝒵¯t+1:t+T𝒵1:t)I(m_{i};\underline{{\cal Z}}_{t+1:t+T}\mid{\cal Z}_{1:t}) can be decomposed into sums of mutual information terms between single-beam measurements 𝐳τ,b\mathbf{z}_{\tau,b} and the respective observed map cells mim_{i}. The mutual information between a single map cell mim_{i} and a sensor ray 𝐳\mathbf{z} is:

I(mi;𝐳𝒵1:t)=\displaystyle I(m_{i};\mathbf{z}\mid{\cal Z}_{1:t})= (25)

Using the inverse observation model in (10) and the Bayesian multi-class update in (8), we have:

k=0Kp(mi=k𝐳,𝒵1:t)logp(mi=k𝐳,𝒵1:t)pt(mi=k)\displaystyle\sum_{k=0}^{K}p(m_{i}=k\mid\mathbf{z},{\cal Z}_{1:t})\log{\frac{p(m_{i}=k\mid\mathbf{z},{\cal Z}_{1:t})}{p_{t}(m_{i}=k)}}
=f(𝐥i(𝐳)𝐡0,i,𝐡t,i),\displaystyle=f(\mathbf{l}_{i}(\mathbf{z})-\mathbf{h}_{0,i},\mathbf{h}_{t,i}), (26)

where (10) and (8) were applied a second time to the log term above. Plugging (26) back into the mutual information expression in (25) and returning to (24), we have:

I(\displaystyle I( 𝐦;𝒵¯t+1:t+T𝒵1:t)\displaystyle\mathbf{m};\underline{{\cal Z}}_{t+1:t+T}\mid{\cal Z}_{1:t}) (27)
=τ=t+1t+Tb=1By=1K0rmax(p(𝐳τ,b=(r,y)𝒵1:t)\displaystyle=\sum_{\tau=t+1}^{t+T}\sum_{b=1}^{B}\sum_{y=1}^{K}\int_{0}^{r_{max}}\biggl{(}p(\mathbf{z}_{\tau,b}=(r,y)\mid{\cal Z}_{1:t})
iτ,b(rmax)f(𝐥i((r,y))𝐡0,i,𝐡t,i))dr.\displaystyle\qquad\qquad\qquad\sum_{i\in{\cal R}_{\tau,b}(r_{max})}\mkern-18.0muf(\mathbf{l}_{i}((r,y))-\mathbf{h}_{0,i},\mathbf{h}_{t,i})\biggr{)}dr.

For 𝐳τ,b=(r,y)\mathbf{z}_{\tau,b}=(r,y), the second term inside the integral above can be simplified to:

C~τ,b(r,y)\displaystyle\tilde{C}_{\tau,b}(r,y) :=iτ,b(rmax)f(𝐥i((r,y))𝐡0,i,𝐡t,i)\displaystyle:=\mkern-18.0mu\sum_{i\in{\cal R}_{\tau,b}(r_{max})}\mkern-18.0muf(\mathbf{l}_{i}((r,y))-\mathbf{h}_{0,i},\mathbf{h}_{t,i}) (28)
=f(ϕ++𝐄y+1𝝍+𝐡0,iτ,b,𝐡t,iτ,b)\displaystyle\phantom{:}=f(\boldsymbol{\phi}^{+}+\mathbf{E}_{y+1}\boldsymbol{\psi}^{+}-\mathbf{h}_{0,i_{\tau,b}^{*}},\mathbf{h}_{t,i_{\tau,b}^{*}})
+iτ,b(r){iτ,b}f(ϕ𝐡0,i,𝐡t,i)\displaystyle\qquad+\mkern-18.0mu\sum_{i\in{\cal R}_{\tau,b}(r)\setminus\{i_{\tau,b}^{*}\}}\mkern-18.0muf(\boldsymbol{\phi}^{-}-\mathbf{h}_{0,i},\mathbf{h}_{t,i})

because for map indices iτ,b(rmax)τ,b(r)i\in{\cal R}_{\tau,b}(r_{max})\setminus{\cal R}_{\tau,b}(r) that are not observed by 𝐳τ,b\mathbf{z}_{\tau,b}, we have 𝐥i((r,y))=𝐡0,i\mathbf{l}_{i}((r,y))=\mathbf{h}_{0,i} according to (10) and f(𝐡0,i𝐡0,i,𝐡t,i)=0f(\mathbf{h}_{0,i}-\mathbf{h}_{0,i},\mathbf{h}_{t,i})=0.

Next, we apply the definition of (11) for the first term in the integral in (27), which turns it into an integration over p~τ,b(r,y)C~τ,b(r,y)\tilde{p}_{\tau,b}(r,y)\tilde{C}_{\tau,b}(r,y). Note that p~τ,b(r,y)\tilde{p}_{\tau,b}(r,y) and C~τ,b(r,y)\tilde{C}_{\tau,b}(r,y) are piecewise-constant functions since τ,b(r){\cal R}_{\tau,b}(r) is constant with respect to rr as long as the beam 𝐳\mathbf{z} lands in cell mim_{i^{*}}. Hence, we can partition the integration domain over rr into a union of intervals where the beam 𝐳\mathbf{z} hits the same cell, i.e., τ,b(r){\cal R}_{\tau,b}(r) remains constant:

0rmaxp~τ,b(r,y)C~τ,b(r,y)𝑑r=n=1Nτ,brn1rnp~τ,b(r,y)C~τ,b(r,y)𝑑r,\int_{0}^{r_{max}}\mkern-18.0mu\tilde{p}_{\tau,b}(r,y)\tilde{C}_{\tau,b}(r,y)\,dr=\sum_{n=1}^{N_{\tau,b}}\int_{r_{n-1}}^{r_{n}}\mkern-18.0mu\tilde{p}_{\tau,b}(r,y)\tilde{C}_{\tau,b}(r,y)\,dr,

where Nτ,b=|τ,b(rmax)|N_{\tau,b}=|{\cal R}_{\tau,b}(r_{max})|, r0=0r_{0}=0, and rN=rmaxr_{N}=r_{max}. From the piecewise-constant property of p~τ,b(r,y)\tilde{p}_{\tau,b}(r,y) and C~τ,b(r,y)\tilde{C}_{\tau,b}(r,y) over the interval (rn1,rn](r_{n-1},r_{n}], one can obtain:

rn1rn\displaystyle\int_{r_{n-1}}^{r_{n}} p~τ,b(r,y)C~τ,b(r,y)dr\displaystyle\tilde{p}_{\tau,b}(r,y)\tilde{C}_{\tau,b}(r,y)\,dr (29)
=p~τ,b(rn,y)C~τ,b(rn,y)γ(n)=pτ,b(n,y)Cτ,b(n,y),\displaystyle=\tilde{p}_{\tau,b}(r_{n},y)\tilde{C}_{\tau,b}(r_{n},y)\gamma(n)=p_{\tau,b}(n,y)C_{\tau,b}(n,y),

where pτ,b(n,y)p_{\tau,b}(n,y) and Cτ,b(n,y)C_{\tau,b}(n,y) are defined in the statement of Proposition 2. Substituting yy with kk and plugging the integration result into (27) yields the lower bound in (13) for the mutual information between 𝐦\mathbf{m} and 𝒵t+1:t+T{\cal Z}_{t+1:t+T}.∎

Appendix C Proof of Proposition 3

Consider a single beam 𝐳τ,b\mathbf{z}_{\tau,b}, passing through cells {mi}i\left\{m_{i}\right\}_{i}, iτ,b(rmax)i\in{\cal R}_{\tau,b}(r_{max}). As shown in Appendix B, the mutual information between the map 𝐦\mathbf{m} and a beam 𝐳τ,b\mathbf{z}_{\tau,b} can be computed as:

I(𝐦;𝐳τ,b|𝒵1:t)=k=1Kn=1Nτ,bpτ,b(n,k)Cτ,b(n,k).I(\mathbf{m};\mathbf{z}_{\tau,b}|{\cal Z}_{1:t})=\sum_{k=1}^{K}\sum_{n=1}^{N_{\tau,b}}p_{\tau,b}(n,k)C_{\tau,b}(n,k). (30)

Assuming piece-wise constant class probabilities, we have:

n=1Nτ,bpτ,b(n,k)Cτ,b(n,k)=q=1Qτ,bn=ωτ,b,1:q1+1ωτ,b,1:qpτ,b(n,k)Cτ,b(n,k),\begin{split}\sum_{n=1}^{N_{\tau,b}}p_{\tau,b}(n,k)&C_{\tau,b}(n,k)=\\ \sum_{q=1}^{Q_{\tau,b}}&\sum_{n=\omega_{\tau,b,1:q-1}+1}^{\omega_{\tau,b,1:q}}p_{\tau,b}(n,k)C_{\tau,b}(n,k),\end{split} (31)

where ωτ,b,1:q=j=1qωτ,b,j\omega_{\tau,b,1:q}=\sum_{j=1}^{q}\omega_{\tau,b,j}. For each ωτ,b,1:q1<nωτ,b,1:q\omega_{\tau,b,1:q-1}<n\leq\omega_{\tau,b,1:q}, the terms pτ,b(n,k)p_{\tau,b}(n,k) and Cτ,b(n,k)C_{\tau,b}(n,k) are expressed as:

pτ,b(n,k)\displaystyle p_{\tau,b}(n,k) =πt(q,k)πt(n1ωτ,b,1:q1)(q,0)j=1q1πtωτ,b,j(j,0),\displaystyle=\pi_{t}(q,k)\pi_{t}^{(n-1-\omega_{\tau,b,1:q-1})}(q,0)\prod_{j=1}^{q-1}\pi_{t}^{\omega_{\tau,b,j}}(j,0),
Cτ,b(n,k)\displaystyle C_{\tau,b}(n,k) =f(ϕ++𝐄k+1𝝍+𝝌0,q,𝝌t,q)+\displaystyle=f(\boldsymbol{\phi}^{+}+\mathbf{E}_{k+1}\boldsymbol{\psi}^{+}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q})+
(n1ωτ,b,1:q1)f(ϕ𝝌0,q,𝝌t,q)+\displaystyle(n-1-\omega_{\tau,b,1:q-1})f(\boldsymbol{\phi}^{-}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q})+
j=1q1ωτ,b,jf(ϕ𝝌0,j,𝝌t,j).\displaystyle\qquad\qquad\qquad\sum_{j=1}^{q-1}\omega_{\tau,b,j}f(\boldsymbol{\phi}^{-}-\boldsymbol{\chi}_{0,j},\boldsymbol{\chi}_{t,j}).

Plugging this into the inner summation of (31) leads to:

n=ωτ,b,1:q1+1ωτ,b,1:qpτ,b(n,k)Cτ,b(n,k)=ρτ,b(q,k)[βτ,b(q,k)j=0ωτ,b,q1πtj(q,0)+f(ϕ𝝌0,q,𝝌t,q)j=0ωτ,b,q1jπtj(q,0)],\begin{split}\sum_{n=\omega_{\tau,b,1:q-1}+1}^{\omega_{\tau,b,1:q}}&p_{\tau,b}(n,k)C_{\tau,b}(n,k)=\\ \rho_{\tau,b}&(q,k)\big{[}\beta_{\tau,b}(q,k)\sum_{j=0}^{\omega_{\tau,b,q}-1}\pi_{t}^{j}(q,0)+\\ &f(\boldsymbol{\phi}^{-}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q})\sum_{j=0}^{\omega_{\tau,b,q}-1}j\pi_{t}^{j}(q,0)\big{]},\end{split} (32)

The summations in (32) can be computed explicitly, leading to the following closed-form expression:

βτ,b(q,k)j=0ωτ,b,q1πtj(q,0)+\displaystyle\beta_{\tau,b}(q,k)\sum_{j=0}^{\omega_{\tau,b,q}-1}\pi_{t}^{j}(q,0)+
f(ϕ𝝌0,q,𝝌t,q)j=0ωτ,b,q1jπtj(q,0)=\displaystyle\qquad f(\boldsymbol{\phi}^{-}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q})\sum_{j=0}^{\omega_{\tau,b,q}-1}j\pi_{t}^{j}(q,0)=
βτ,b(q,k)1πtωτ,b,q(q,0)1πt(q,0)+\displaystyle\beta_{\tau,b}(q,k)\frac{1-\pi_{t}^{\omega_{\tau,b,q}}(q,0)}{1-\pi_{t}(q,0)}+ (33)
f(ϕ𝝌0,q,𝝌t,q)(1πt(q,0))2[(ωτ,b,q1)πtωτ,b,q+1(q,0)\displaystyle\qquad\frac{f(\boldsymbol{\phi}^{-}-\boldsymbol{\chi}_{0,q},\boldsymbol{\chi}_{t,q})}{(1-\pi_{t}(q,0))^{2}}\Big{[}(\omega_{\tau,b,q}-1)\pi_{t}^{\omega_{\tau,b,q}+1}(q,0)
ωτ,b,qπtωτ,b,q(q,0)+πt(q,0)]=Θτ,b(q,k).\displaystyle\qquad\;\;\>-\omega_{\tau,b,q}\pi_{t}^{\omega_{\tau,b,q}}(q,0)+\pi_{t}(q,0)\Big{]}=\Theta_{\tau,b}(q,k).

Therefore, the Shannon mutual information between a semantic OctoMap 𝐦\mathbf{m} and a range-category measurement 𝐳τ,b\mathbf{z}_{\tau,b} can be computed as in (15).∎

References

  • [1] A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989.
  • [2] S. Thrun, “Learning occupancy grid maps with forward sensor models,” Autonomous Robots, vol. 15, no. 2, pp. 111–127, 2003.
  • [3] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, “OctoMap: An efficient probabilistic 3D mapping framework based on octrees,” Autonomous Robots, 2013, software available at https://octomap.github.io.
  • [4] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1746–1754.
  • [5] L. Gan, R. Zhang, J. W. Grizzle, R. M. Eustice, and M. Ghaffari, “Bayesian spatial kernel smoothing for scalable dense semantic mapping,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 790–797, 2020.
  • [6] T. Wang, V. Dhiman, and N. Atanasov, “Learning navigation costs from demonstration with semantic observations,” Proceedings of Machine Learning Research vol, vol. 120, pp. 1–11, 2020.
  • [7] A. Milioto and C. Stachniss, “Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics using CNNs,” in IEEE International Conference on Robotics and Automation (ICRA), 2019.
  • [8] B. J. Julian, S. Karaman, and D. Rus, “On mutual information-based control of range sensing robots for mapping applications,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2013, pp. 5156–5163.
  • [9] B. Charrow, S. Liu, V. Kumar, and N. Michael, “Information-theoretic mapping using cauchy-schwarz quadratic mutual information,” in IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 4791–4798.
  • [10] Z. Zhang, T. Henderson, V. Sze, and S. Karaman, “FSMI: Fast Computation of Shannon Mutual Information for Information-Theoretic Mapping,” in IEEE International Conference on Robotics and Automation (ICRA), 2019, pp. 6912–6918.
  • [11] A. Asgharivaskasi and N. Atanasov, “Active bayesian multi-class mapping from range and semantic segmentation observations,” in IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 1–7.
  • [12] B. Yamauchi, “A frontier-based approach for autonomous exploration,” in IEEE International Symposium on Computational Intelligence in Robotics and Automation, 1997, pp. 146–151.
  • [13] W. Burgard, M. Moors, C. Stachniss, and F. E. Schneider, “Coordinated multi-robot exploration,” IEEE Transactions on Robotics (TRO), vol. 21, no. 3, pp. 376–386, 2005.
  • [14] H. H. González-Baños and J.-C. Latombe, “Navigation strategies for exploring indoor environments,” The International Journal of Robotics Research (IJRR), vol. 21, no. 10-11, pp. 829–848, 2002.
  • [15] C. Gómez, M. Fehr, A. C. Hernández, J. Nieto, R. Barber, and R. Siegwart, “Hybrid Topological and 3D Dense Mapping through Autonomous Exploration for Large Indoor Environments,” in IEEE International Conference on Robotics and Automation (ICRA), 2020.
  • [16] X. Kan, H. Teng, and K. Karydis, “Online exploration and coverage planning in unknown obstacle-cluttered environments,” IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 5969–5976, 2020.
  • [17] R. Maffei, M. P. Souza, M. Mantelli, D. Pittol, M. Kolberg, and V. A. M. Jorge, “Exploration of 3D terrains using potential fields with elevation-based local distortions,” in IEEE International Conference on Robotics and Automation (ICRA), 2020.
  • [18] A. Bircher, M. Kamel, K. Alexis, H. Oleynikova, and R. Siegwart, “Receding horizon “next-best-view” planner for 3d exploration,” in IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 1462–1468.
  • [19] T. Dang, F. Mascarich, S. Khattak, C. Papachristos, and K. Alexis, “Graph-based path planning for autonomous robotic exploration in subterranean environments,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 3105–3112.
  • [20] H. Zhu, C. Cao, Y. Xia, S. Scherer, J. Zhang, and W. Wang, “Dsvp: Dual-stage viewpoint planner for rapid exploration by dynamic expansion,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021, pp. 7623–7630.
  • [21] C. Cao, H. Zhu, H. Choset, and J. Zhang, “Tare: A hierarchical framework for efficiently exploring complex 3d environments.” in Robotics: Science and Systems (RSS), 2021.
  • [22] C. M and N. Michael, “Volumetric objectives for multi-robot exploration of three-dimensional environments,” in IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 9043–9050.
  • [23] A. Elfes, “Robot navigation: Integrating perception, environmental constraints and task execution within a probabilistic framework,” in Reasoning With Uncertainty in Robotics, 1995, pp. 93–130.
  • [24] S. J. Moorehead, R. Simmons, and W. L. Whittaker, “Autonomous exploration using multiple sources of information,” in IEEE International Conference on Robotics and Automation (ICRA), 2001, pp. 3098–3103.
  • [25] F. Bourgault, A. A. Makarenko, S. B. Williams, B. Grocholsky, and H. F. Durrant-Whyte, “Information based adaptive robotic exploration,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2002, pp. 540–545.
  • [26] A. Visser and B. Slamet, “Balancing the information gain against the movement cost for multi-robot frontier exploration,” in European Robotics Symposium, 2008, pp. 43–52.
  • [27] T. Kollar and N. Roy, “Efficient optimization of information-theoretic exploration in SLAM,” in AAAI Conference on Artificial Intelligence, 2008, p. 1369–1375.
  • [28] H. Carrillo, I. Reid, and J. A. Castellanos, “On the comparison of uncertainty criteria for active SLAM,” in IEEE International Conference on Robotics and Automation (ICRA), 2012, pp. 2080–2087.
  • [29] L. Carlone, J. Du, M. K. Ng, B. Bona, and M. Indri, “Active SLAM and exploration with particle filters using Kullback-Leibler divergence,” Journal of Intelligent & Robotic Systems, vol. 75, no. 2, pp. 291–311, 2014.
  • [30] N. Atanasov, J. Le Ny, K. Daniilidis, and G. J. Pappas, “Decentralized active information acquisition: Theory and application to multi-robot SLAM,” in IEEE International Conference on Robotics and Automation (ICRA), 2015, pp. 4775–4782.
  • [31] J. Wang, T. Shan, and B. Englot, “Virtual maps for autonomous exploration with pose SLAM,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2019.
  • [32] C. Stachniss, G. Grisetti, and W. Burgard, “Information gain-based exploration using Rao-Blackwellized particle filters,” in Robotics: Science and Systems, 2005, pp. 65–72.
  • [33] T. Henderson, V. Sze, and S. Karaman, “An efficient and continuous approach to information-theoretic exploration,” in IEEE International Conference on Robotics and Automation (ICRA), 2020.
  • [34] F. Chen, P. Szenher, Y. Huang, J. Wang, T. Shan, S. Bai, and B. Englot, “Zero-shot reinforcement learning on graphs for autonomous exploration under uncertainty,” in IEEE International Conference on Robotics and Automation (ICRA), 2021, pp. 5193–5199.
  • [35] M. Lodel, B. Brito, A. Serra-Gómez, L. Ferranti, R. Babuška, and J. Alonso-Mora, “Where to look next: Learning viewpoint recommendations for informative trajectory planning,” arXiv preprint arXiv:2203.02381, 2022.
  • [36] E. Zwecher, E. Iceland, S. R. Levy, S. Y. Hayoun, O. Gal, and A. Barel, “Integrating deep reinforcement and supervised learning to expedite indoor mapping,” in IEEE International Conference on Robotics and Automation (ICRA), 2022, pp. 10 542–10 548.
  • [37] H. Zhang, J. Cheng, L. Zhang, Y. Li, and W. Zhang, “H2GNN: Hierarchical-hops graph neural networks for multi-robot exploration in unknown environments,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 3435–3442, 2022.
  • [38] J. Hu, H. Niu, J. Carrasco, B. Lennox, and F. Arvin, “Voronoi-based multi-robot autonomous exploration in unknown environments via deep reinforcement learning,” IEEE Transactions on Vehicular Technology, vol. 69, no. 12, pp. 14 413–14 423, 2020.
  • [39] T. S. Veiga, M. Silva, R. Ventura, and P. U. Lima, “A hierarchical approach to active semantic mapping using probabilistic logic and information reward pomdps,” Proceedings of the International Conference on Automated Planning and Scheduling, vol. 29, no. 1, pp. 773–781, May 2021.
  • [40] V. Suriani, S. Kaszuba, S. R. Sabbella, F. Riccio, and D. Nardi, “S-ave: Semantic active vision exploration and mapping of indoor environments for mobile robots,” in 2021 European Conference on Mobile Robots (ECMR), 2021, pp. 1–8.
  • [41] H. Blum, S. Rohrbach, M. Popovic, L. Bartolomei, and R. Y. Siegwart, “Active learning for uav-based semantic mapping,” ArXiv, vol. abs/1908.11157, 2019.
  • [42] G. Georgakis, B. Bucher, K. Schmeckpeper, S. Singh, and K. Daniilidis, “Learning to map for active semantic goal navigation,” ArXiv, vol. abs/2106.15648, 2021.
  • [43] T. D. Barfoot, State estimation for robotics.   Cambridge University Press, 2017.
  • [44] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics.   MIT Press Cambridge, 2005.
  • [45] J. E. Bresenham, “Algorithm for computer control of a digital plotter,” IBM Systems Journal, vol. 4, no. 1, pp. 25–30, 1965.
  • [46] H. Samet, “Implementing ray tracing with octrees and neighbor finding,” Computers & Graphics, vol. 13, no. 4, pp. 445–460, 1989.
  • [47] M. Agate, R. L. Grimsdale, and P. F. Lister, “The hero algorithm for ray-tracing octrees,” in Advances in Computer Graphics Hardware IV, 1991, pp. 61–73.
  • [48] K. Saulnier, N. Atanasov, G. J. Pappas, and V. Kumar, “Information theoretic active exploration in signed distance fields,” in IEEE International Conference on Robotics and Automation (ICRA), 2020.
  • [49] A. Asgharivaskasi, S. Koga, and N. Atanasov, “Active mapping via gradient ascent optimization of shannon mutual information over continuous se(3) trajectories,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2022.
  • [50] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, A. Y. Ng et al., “ROS: an open-source robot operating system,” in ICRA workshop on open source software, vol. 3, no. 3.2, 2009, p. 5.
  • [51] A. Censi, “An ICP variant using a point-to-line metric,” in IEEE International Conference on Robotics and Automation (ICRA), 2008, pp. 19–25.
  • [52] P. Chao, “Fully convolutional hardnet for segmentation in pytorch,” https://github.com/PingoLH/FCHarDNet, accessed: 2022-07-01.
  • [53] M. Wigness, S. Eum, J. G. Rogers, D. Han, and H. Kwon, “A RUGD dataset for autonomous navigation and visual perception in unstructured outdoor environments,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, pp. 5000–5007.
  • [54] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [55] S. Song, S. P. Lichtenberg, and J. Xiao, “Sun RGB-D: A RGB-D scene understanding benchmark suite,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 567–576.
  • [56] NVIDIA, “Deep learning nodes for ROS/ROS2,” https://github.com/dusty-nv/ros_deep_learning, accessed: 2021-07-10.
[Uncaptioned image] Arash Asgharivaskasi (S’22) is a Ph.D. student of Electrical and Computer Engineering at the University of California San Diego, La Jolla, CA. He obtained a B.S. degree in Electrical Engineering from Sharif University of Technology, Tehran, Iran, and an M.S. degree in Electrical and Computer Engineering from the University of California San Diego, La Jolla, CA, USA. His research focuses on active information acquisition using mobile robots with applications to mapping, security, and environmental monitoring.
[Uncaptioned image] Nikolay Atanasov (S’07-M’16) is an Assistant Professor of Electrical and Computer Engineering at the University of California San Diego, La Jolla, CA, USA. He obtained a B.S. degree in Electrical Engineering from Trinity College, Hartford, CT, USA, in 2008 and M.S. and Ph.D. degrees in Electrical and Systems Engineering from the University of Pennsylvania, Philadelphia, PA, USA, in 2012 and 2015, respectively. His research focuses on robotics, control theory, and machine learning, applied to active sensing using ground and aerial robots. He works on probabilistic perception models that unify geometry and semantics and on optimal control and reinforcement learning approaches for minimizing uncertainty in these models. Dr. Atanasov’s work has been recognized by the Joseph and Rosaline Wolf award for the best Ph.D. dissertation in Electrical and Systems Engineering at the University of Pennsylvania in 2015, the best conference paper award at the International Conference on Robotics and Automation in 2017, and the NSF CAREER award in 2021.