Closing the Loop: Graph Networks to Unify Semantic Objects and Visual Features for Multi-object Scenes

Jonathan J.Y. Kim^1,2∗, Martin Urschler¹, Patricia J. Riddle¹, and Jörg S. Wicker¹
¹School of Computer Science, University of Auckland, New Zealand ²Callaghan Innovation, New Zealand
[email protected], {martin.urschler,p.riddle,j.wicker}@auckland.ac.nz *This research was supported by Callaghan Innovation, New Zealand’s Innovation Agency

Abstract

In Simultaneous Localization and Mapping (SLAM), Loop Closure Detection (LCD) is essential to minimize drift when recognizing previously visited places. Visual Bag-of-Words (vBoW) has been an LCD algorithm of choice for many state-of-the-art SLAM systems. It uses a set of visual features to provide robust place recognition but fails to perceive the semantics or spatial relationship between feature points. Previous work has mainly focused on addressing these issues by combining vBoW with semantic and spatial information from objects in the scene. However, they are unable to exploit spatial information of local visual features and lack a structure that unifies semantic objects and visual features, therefore limiting the symbiosis between the two components. This paper proposes SymbioLCD2, which creates a unified graph structure to integrate semantic objects and visual features symbiotically. Our novel graph-based LCD system utilizes the unified graph structure by applying a Weisfeiler-Lehman graph kernel with temporal constraints to robustly predict loop closure candidates. Evaluation of the proposed system shows that having a unified graph structure incorporating semantic objects and visual features improves LCD prediction accuracy, illustrating that the proposed graph structure provides a strong symbiosis between these two complementary components. It also outperforms other Machine Learning algorithms - such as SVM, Decision Tree, Random Forest, Neural Network and GNN based Graph Matching Networks. Furthermore, it has shown good performance in detecting loop closure candidates earlier than state-of-the-art SLAM systems, demonstrating that extended semantic and spatial awareness from the unified graph structure significantly impacts LCD performance.

{textblock*}

180mm(.01-6.8cm) Accepted at IROS 2022. © 2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

I Introduction

Over the past three decades, Simultaneous Localization and Mapping (SLAM) has been actively adopted by wide-ranging fields, such as robotics, medical imaging and mobile devices, for simultaneously building a 3D map and finding the pose of a device. With the advent of low-cost cameras capable of obtaining clear images with relatively high framerates, there has been considerable research in visual SLAM, where SLAM is performed using only visual sensors. Two major categories of visual SLAM are indirect SLAM [1] and direct SLAM [2, 3]. Indirect or feature-based SLAM convert an image into a sparse set of visual features, such as SIFT [4], SURF [5], or ORB [6], whereas direct SLAM utilizes information from all pixels, such as color and intensities [7, 8], or edges [9].

Refer to caption — Figure 1: Basic overview of SymbioLCD2 (a) image taken at time $i$ (b) image taken at time $j$ (c)&(d) semantic objects and visual features are unified into a graph structure (e) GNN-based graph kernel with the subgraph matching algorithm.

In indirect SLAM, localization and mapping rely on visual feature matching between each frame. Minor errors during feature matching and additional sensor drift from the device can accumulate over time to create a substantial amount of drift near the end of its trajectory. Monocular SLAM, where SLAM is performed using a single camera, suffers relatively larger drift compared to stereo SLAM, as it lacks a second camera to perform image rectification. This paper focuses on Monocular SLAM, as it is more crucial for monocular SLAM to address the drift. Loop Closure Detection (LCD), also known as visual place recognition, is utilized by visual SLAM systems to minimize drift in its trajectory. Loop closure requires a device to return to a previously visited location and recognize features obtained earlier to determine and minimize drift accurately. However, LCD is a challenging task due to a phenomenon known as perceptual aliasing, where the features from two different locations may appear similar, making it difficult to ascertain whether the location has been visited previously or not.

Many SLAM systems [1, 10] utilize a visual Bag-of-Words (vBoW) algorithm [11] to perform loop closure. A common vBoW algorithm uses an offline dictionary of visual words created by encoding and clustering the features derived from a large number of images. It finds similarity in a scene by comparing the current set of visual features against feature sets found in its dictionary. The vBoW algorithm is fast and robust to partial occlusions, but most spatial information between feature points is lost during encoding and clustering.

Previous work by Kim et al. [12] and Zhang et al. [13] have focused on mitigating the drawback of vBoW by combining or replacing it with the semantic and spatial information from semantic objects extracted by Convolutional Neural Network (CNN). However, it was only utilizing spatial relationship between semantic objects, failing to exploit additional, potentially complementary spatial information of local visual features. It also lacked a structure that could unify both semantic objects and visual features, restricting the symbiosis between the two components.

We propose to extend the semantic and spatial awareness of a scene by connecting semantic objects and visual features into a unified graph structure. Graph structures have become a popular medium for learning algorithms, as their non-linear data structure can represent both data and their relationships. By connecting semantic objects and their surrounding visual features, it is possible to utilize spatial information of visual features and semantic objects simultaneously, thus extending the number of spatial relationships in a scene. The resulting graph structure can then be used to learn and predict loop closure candidates using a Graph Neural Network (GNN) [14]. We propose to use a GNN with a dedicated graph kernel, which can predict the similarity of a pair of graphs by performing attributed subgraph matching [15]. The advantages of our proposed unified graph structure are as follows:

•

Combining both semantic object and visual features into a single uniform framework, improving the symbiosis between the two components during both the learning and prediction phases of loop closure candidates.
•

Spatial relationship in a scene can be extended by incorporating spatial information of visual features. By connecting semantic objects and their surrounding visual features, we vastly increase the number of spatial relationships that can be extracted from the scene.
•

Semantic information of the object can be extended by sharing the semantic label with its surrounding visual features. A semantic object, such as a cup, is no longer just a label, as it also incorporates distinct relational connections from its surrounding visual features, thereby extending the semantic information of the scene.

This paper presents SymbioLCD2, a novel graph-based loop closure detection system that uses a unified graph structure and a GNN with graph kernel for a robust loop closure candidate detection. It combines the strengths of both semantic objects and visual features symbiotically by transforming them into a unified multi-tier graph structure. Figure 1 shows the basic framework, and Figure 2 shows a detailed overview of our proposed SymbioLCD2.

The main contributions of this paper are as follows:

1.

A novel unified graph structure combining semantic objects and visual features symbiotically, providing improved semantic and spatial awareness of a scene.
2.

Multi-tiered graph formation algorithm with object anchors.
3.

A novel Loop Closure Detection system using Graph Kernel with temporal constraints for a robust and accurate loop closure candidate prediction.
4.

Better precision and recall rate compared to other ML algorithms.
5.

Earlier detection of loop closure candidates than state-of-the-art SLAM system.

This paper is organized as follows: Section II reviews related work, Section III describes our proposed method, Section IV presents experiments and their results, and Section V presents the conclusion and future work.

II Related Work

This section reviews feature descriptors, visual Bag-of-Words, CNN derived descriptors, and Graph Neural Network algorithms.

Feature descriptors Feature descriptors [16], such as SURF [5] and SIFT [4], are algorithms used to extract the visual features from input images. They have been widely used in feature-based SLAM systems, but more recently binary descriptors, such as ORB [6] and BRIEF [17], have been used in various state-of-the-art SLAM systems due to their computational advantage. In general, traditional feature descriptors offer better accuracy, but binary descriptors are faster and more efficient [18].

Visual Bag-of-Words Visual Bag-of-Words are commonly found in loop closure detection modules in several state-of-the-art SLAM systems, such as ORB-SLAM [1], LSD-SLAM [3] and DynaSLAM [10]. There are two distinct vBoW approaches. FAB-MAP [19] learns a generative appearance model using Chow-Liu tree, and it can perform fast online loop closure detection due to the algorithm’s linear complexity. DBoW2 [20], on the other hand, creates a tree-structured offline dictionary by training on a large set of image datasets. DBoW2 compares a set of features extracted by feature descriptors against its dictionary to calculate the co-occurrence score.

CNN derived descriptors for SLAM With the recent advancement in CNNs, there has been a significant interest in using CNNs to extract semantic objects from images and videos. Object instance segmentation algorithms such as Faster R-CNN [21], Mask R-CNN [22], and YOLO [23] can extract semantic and spatial information of objects from a scene and generate semantic labels, bounding boxes and masks of objects. The CNN-extracted objects have been successfully used to replace visual features [24] in such SLAM systems as NodeSLAM [25] and Quadric SLAM [26].

Graph Neural Networks with Graph Kernel Graph structured data can be an effective format for encoding relational spatial structures in a scene. In recent years, GNN have been widely adopted for efficiently learning relational representations in graph-structured data [27]. GNN models are commonly used for graph classification [28], where they can predict the similarity of a graph pair by producing graph embeddings in vector spaces for efficient similarity reasoning [14, 29]. Recent work on graph kernels has emerged as a promising approach for graph classification, as they allow kernelized learning algorithms, such as SVM and Weisfeiler-Lehman, to perform attributed subgraph matching, achieving state-of-the-art performance on graph classification tasks [30]. Weisfeiler-Lehman (WL) graph kernel uses a distinctive labelling scheme to extract subgraph patterns through multiple iterations. It replaces the label of each node with a label consisting of its original label and the subset of labels of its direct neighbours, then compresses it to form a new label. After the relabeling procedures, it calculates the similarity of two graphs as the inner product of the histogram vectors of both graphs [15].

We combine semantic objects extracted from Mask R-CNN and visual features extracted from ORB descriptor to form a unified graph structure to enhance the symbiosis between two components, and perform graph classification using the WL graph kernel.

III Proposed Method

The first three parts in our methods - III-A, III-B and III-C, are identical to the work presented in [12]. Our proposed methods are as follows - first, visual features are extracted by ORB feature descriptor and the vBoW score is generated using DBoW2. Second, semantic objects are extracted using Mask R-CNN then filtered based on their semantic labels and size. Third, the filtered objects get transferred into a matrix, then the semantic and spatial information in the matrix gets projected onto a normalized plane for a scale-invariant label distance matching. Fourth, objects in the matrix get connected to form the main anchors of the graph. Any visual features within the bounding box of an anchor object are assigned with the semantic label of the object and then connected to their anchor object to form a two-tier graph. Then we extend the graph to include surrounding visual features within $x$ pixels from the edges of the bounding box to complete a three-tier graph. Lastly, the combined graph is transferred to a WL graph kernel with temporal constraints for loop closure candidate prediction. The overview of our proposed method is shown in Figure 2, and the reader may refer to [12] for further details on (a), (b) and (c).

III-A Framework, visual feature extraction and generating a vBoW score

SymbioLCD2 is built upon SymbioSLAM2 as its operating framework, which is based on DynaSLAM [10], and it incorporates ORB feature descriptor, DBoW2 and Mask R-CNN within its framework. We use ORB feature descriptor to extract visual features and DBoW2 to compute the vBoW score.

III-B Object extraction and filtering process

Semantic objects are extracted using Mask R-CNN, then filtered based on their semantic labels. Any objects with labels that correspond to movable entities, e.g. person, car or bicycles, are filtered out. This filtering process reduces an unintended loop closing on objects that are not stationary, thus improving the robustness of loop closure. Additionally, any objects that are too large and overlap with other objects in the scene are also filtered out.

III-C Semantic label matching in a scale invariant normalized plane

Aligning and matching object labels based on their raw distances from each other can cause errors due to changes in scale and viewpoint between frames. By projecting object representations into a normalized plane, the distances between a pair of objects become scale-invariant, assisting the label matching algorithm to identify their match accurately. It also enables the label matching algorithm to allow up to 40% of missing or misclassified labels, making it cope better with detection errors originating from the previous object extraction process.

III-D Graph structure formation - three-tier graph formation with central object anchor nodes

We have designed our graph structure to have three different tiers. The three-tier formation of the graph provides a stable structure for the graph, which helps to improve the accuracy of loop closure candidate prediction. Figure 2(d) shows an overview of our graph structure formation, and Figure 3 shows an example of a three-tier graph represented in SLAM and in NetworkX [31].

The first-tier connects objects to form central nodes that act as anchors for visual features to connect onto in the upper tiers. The anchor allows the graph to extend the semantic information of the object by propagating it to the overall graph structure. The object information and the vBoW score get assigned as node features. The normalized distance between each pair of objects is assigned as the edge feature.

The second-tier connects visual features within bounding boxes to each anchor node, i.e. the bounding box’s object node. This multi-tier connection provides extra symbiosis in the graph structure. First, the semantic object gets encoded with positional information from its surrounding visual features, extending its spatial information. Second, visual features get assigned with the label from the semantic object, thereby extending the semantic awareness of the scene - the visual features can now be associated with semantic objects and obtain semantic information through their connections, allowing them to be more than just feature points in the background. The vBoW score and the semantic label of its anchor node get assigned as node features. The normalized distance between the visual feature to its anchor node is assigned as the edge feature.

The third-tier extends the graph to include surrounding visual features within $x$ pixels from the edges of the bounding box. For the experiments, $x$ was set to 25 pixels to minimize the interference with neighbouring objects. The third-tier allows the graph to include the visual features from the object’s background, further enhancing the spatial awareness of the scene. The features in the third-tier do not get assigned with the semantic labels since they are not part of the semantic object, but it helps the graph kernel to differentiate semantic objects when there are multiple similar objects present in the scene.

III-E LCD prediction using GNN - Weisfeiler-Lehman subgraph kernel with temporal constraints

The three-tier graph structure, with combined information from semantic objects and visual features, gets transferred to the graph kernel LCD predictor. We use Weisfeiler-Lehman (WL) subgraph algorithm [15] as our GNN algorithm, with a vertex histogram as its base kernel, and a multi-layer perceptron at the end for the loop closure candidate detection.

The WL subgraph kernel takes a pair of graphs, i.e. a graph from the current frame and a graph from the reference frame, and computes their similarity. Unlike typical GNN embedding models [27] [29], the WL subgraph matching computes the similarity score jointly on the pair, rather than first mapping each graph to a set of graph embeddings. The key idea of the WL subgraph matching algorithm is to substitute the label of each node with a label that includes its original label and the subset of labels from its neighbors, then compressing it to form a new multi-set label. This procedure gets repeated for h iterations and is performed on the pair of graphs simultaneously, therefore any two groups of vertices, or subgraphs, gets identical new labels only if they have identical multi-set labels.

Given two graphs $G^{i}$ and $G^{j}$ , with i & j as the indices of two images being compared,

k_{\text{WL}}(G^{i},G^{j})=k(G^{i}_{0},G^{j}_{0})+k(G^{i}_{1},G^{j}_{1})+...+k(G^{i}_{h},G^{j}_{h}),

(1)

where $h$ is the number of iterations and $k$ being the base kernel for WL, in our case the vertex histogram kernel.

For finding similarity using subgraph information, let $\sum_{0}$ be the set of original node labels of $G^{i}$ and $G^{j}$ , and $c_{h}(G^{ij},\sigma_{ij})$ as the number of occurrences of the label $\sigma_{ij}$ in the Graph $G^{i}$ and $G^{j}$ after $h$ iterations.

Then the Weisfeiler-Lehman subgraph kernel can be defined as:

k_{\text{WL}sg}(G^{i},G^{j})=(\theta^{h}_{\text{WL}sg}(G^{i}),\theta^{h}_{\text{WL}sg}(G^{j})),

(2)

where

\theta^{h}_{\text{WL}sg}(G^{i})=(c_{0}(G^{i},\sigma_{0|\sum_{0}|}),...,c_{h}(G^{i},\sigma_{h|\sum_{h}|)}),

(3)

and

\theta^{h}_{\text{WL}sg}(G^{j})=(c_{0}(G^{j},\sigma_{0|\sum_{0}|}),...,c_{h}(G^{j},\sigma_{h|\sum_{h}|)}).

(4)

Since SLAM uses a continuous stream of images as its input, adjacent images in the sequence can look very similar to each other. Therefore, we use a modified temporal similarity constraint first proposed by Zhang et al. [13],

Tc(i,j)=\ln\left(\frac{\beta_{s}v^{2}_{c}}{f_{c}}(i-j)^{2}\right),

(5)

where v_c is the velocity of the camera, $\beta$ _s is a constant parameter and f_c is the frame rate. To simplify, we set v_c and f_c to be constant, giving us

Tc(i,j)=\ln(\beta_{s}(i-j)^{2}),

(6)

where

\beta_{s}\>\epsilon\>(0,1).

(7)

We combine the temporal constraints with WL subgraph kernel to create a similarity equation $S(i,j)$ , written as

S(i,j)=k_{\text{WL}sg}(G^{i},G^{j})-\alpha(Tc(i,j)),

(8)

where $\alpha\>\epsilon\>(0,100)$ is a parameter to control the weight of $Tc(i,j)$ . When the frame i and j are close the $Tc(i,j)$ will be significant, thus decreasing the WL similarity score. As the $i$ and $j$ get further apart, the value of the constraint will continue to decrease until it becomes negligible, thus not affecting the similarity score.

IV Experiments

We have evaluated SymbioLCD2 with the following experiments. Section IV-A shows evaluation parameters and datasets used in the experiments. Section IV-B performs ablation studies, evaluating symbiosis between semantic objects and visual features in our proposed three-tier graph. Section IV-C evaluates our graph-based LCD predictor against other ML algorithms. Section IV-D compares the performance of LCD candidate detection against state-of-the-art SLAM systems.

IV-A Datasets

For evaluating our SymbioLCD2, we have selected five publicly available datasets with multiple objects and varying camera trajectories. We selected fr2-desk and fr3-longoffice from the TUM dataset [32], and uoa-lounge, uoa-kitchen and uoa-garden from the University of Auckland multi-objects dataset [12]. Table LABEL:tab:eval shows evaluation parameters and Figure 4 shows examples from each dataset. All experiments were performed on a PC with Intel i9 7900X and Nvidia GTX1080Ti.

TABLE I: The evaluation parameters

$x$	$\beta_{s}$	$\alpha$	$h$	edge dim.	node dim.	base kernel
25	0.3	2	50	4	8	Vertex Histogram

TABLE II: Evaluation against other ML algorithms

Dataset	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall	Precision	Recall
	SVM-RBF		DecisionTree		NeuralNetwork		SymbioLCD		GraphMatchingNet		SymbioLCD2
fr2-desk	90.91	83.33	100.00	75.38	76.92	83.33	92.31	100.00	91.91	92.36	96.77	97.27
fr3-longoffice	67.74	77.78	95.24	76.92	66.67	88.89	96.30	94.43	92.86	90.66	97.22	97.81
uoa-lounge	85.00	83.95	93.55	77.33	86.67	96.30	98.77	97.57	94.88	93.76	99.35	98.34
uoa-kitchen	89.73	83.44	93.16	77.53	91.07	97.45	99.32	93.63	94.05	96.42	98.92	97.73
uoa-garden	89.04	86.28	94.90	79.01	90.20	97.79	99.54	95.13	95.08	96.77	99.16	98.99
Average	84.48	82.96	95.37	77.24	82.31	92.75	97.25	96.16	93.76	93.99	98.28	98.03

IV-B Ablation studies - evaluating symbiosis between semantic objects and visual features

We have performed ablation studies to analyze two aspects of our proposed method. First, measuring the symbiosis between semantic objects and visual features by comparing it against a graph composed with only visual features. Second, determining whether extended spatial information from the third-tier of the graph, i.e. the surrounding visual features from the background, improves the prediction accuracy. For this experiment, we created a flat structured graph with just visual features and a two-tier graph with no visual features from the background, to compare against our three-tier graph structure. The evaluation was measured using precision and recall metric, which are defined as follows,

Precision=\frac{TP}{TP+FP}\;\;\;\&\;\;\;Recall=\frac{TP}{TP+FN},

(9)

where TP refers to true positive, FP refers to false positive and FN for false negative. Figure 5 and Table III show that two-tier graphs with semantic objects outperform the flat graph by 15.44% improvements in precision and 9.5% improvements in recall. This demonstrates that having semantic objects as part of the structure brings positive improvements to the prediction performance. The result also shows that having a three-tier structure, which includes surrounding visual features from the background, improves the prediction performance by 2.45% in precision and 3.77% in recall compared to a two-tier graph. This demonstrates that extended spatial awareness from visual features in the third-tier contributes to improvements in performance. The results show that through the use of the proposed framework, the strong symbiotic relationship can be effectively utilized, where both semantic objects and visual features contribute to higher LCD prediction accuracy.

IV-C Evaluating LCD keyframe prediction against other ML algorithms

We have benchmarked SymbioLCD2 against five other Machine Learning algorithms - GNN based Graph Matching Network (GMN), SVM-RBF, DecisionTree, Neural Network with 4x16 layers and SymbioLCD with Random Forest classifier. We used a multi-tier graph structure to evaluate SymbioLCD2 and GMN, but used tabular data, i.e. object labels, normalized label matching, Hausdorff distance and vBoW scores, for other algorithms as they could not take a graph structure as their input. In this evaluation, we measured the performance of each algorithm by predicting keyframes of loop closure candidates and used precision and recall metrics for the evaluation.

TABLE III: Ablation study - Flat vs. Two-tier vs. Three-tier

Dataset	Preci.	Recall	Preci.	Recall	Preci.	Recall
	Flat		Two-tier		Three-tier
fr2-desk	87.92	85.33	93.74	94.20	96.77	97.27
fr3-longoffice	79.12	80.28	94.71	92.47	97.22	97.81
uoa-lounge	77.23	86.32	96.77	95.63	99.34	98.34
uoa-kitchen	86.23	87.08	95.93	93.34	98.92	97.73
uoa-garden	71.45	84.83	96.98	95.70	99.16	98.99
Average	80.39	84.76	95.83	94.26	98.28	98.03

TABLE IV: Autorank analysis (Precision)

	MR	MED	MAD	CI	$\gamma$	Mag.
SymLCD2	1.50	98.60	0.97	[98.28, 98.92]	0.0	neg.
SymLCD	1.83	98.01	2.11	[97.25, 98.77]	0.3	small
DecTree	3.16	95.07	1.34	[94.90, 95.24]	3.0	large
GMN	3.50	93.90	1.49	[93.76, 94.05]	3.7	large
SVM-RBF	5.50	87.02	3.89	[85.00, 89.04]	4.0	large
NeuralNet	5.50	84.49	9.11	[82.31, 86.67]	2.1	large

TABLE V: Autorank analysis (Recall)

	MR	MED	MAD	CI	$\gamma$	Mag.
SymLCD2	1.16	97.92	0.45	[97.81, 98.03]	0.0	neg.
SymLCD	2.50	95.64	2.32	[95.13, 96.16]	1.3	large
GMN	3.16	93.87	3.01	[93.76, 93.99]	1.8	large
NeuralNet	3.25	94.52	4.58	[92.75, 96.30]	1.0	large
SVM-RBF	4.91	83.38	0.73	[83.33, 83.44]	2.8	large
DecTree	6.00	77.28	0.45	[77.24, 77.33]	4.6	large

The evaluation was performed 50 times to account for the non-deterministic nature of the algorithms. Table II presents the precision and recall values of SymbioLCD2 against other algorithms in each dataset. The result shows that, on average, SymbioLCD2 achieved the highest precision and recall compared to other algorithms.

To examine the statistical robustness of results, we have used Autorank [33] to analyze each algorithm’s performance further. Autorank is an automated classifier ranking algorithm that is based on the guidelines proposed by Demšar [34]. It uses paired samples that are independent of each other and determines the differences in the central tendency, such as median (MED), mean rank (MR) and median absolute deviation (MAD), to rank each algorithm. Table IV and Figure 6 shows that SymbioLCD2 was ranked highest against other ML algorithms in precision, Table V and Figure 7 shows that SymbioLCD2 ranked highest in recall.

IV-D Evaluating LCD keyframe prediction against other SLAM systems

To evaluate the performance of SymbioLCD2, we have benchmarked it against the state-of-the-art ORB-SLAM2, DynaSLAM and SymbioLCD. We have chosen the above mentioned three systems for benchmarking, as they all share the same vBoW algorithm and keyframe insertion algorithm. In this evaluation, we recorded the earliest keyframe number of loop closure candidates to benchmark their performances. We have performed the evaluation five times and the median value was recorded to account for the non-deterministic nature of the system. The result in Table VI shows that SymbioLCD2 outperformed ORB-SLAM2, DynaSLAM and SymbioLCD in all datasets. On average, SymbioLCD2 outperformed ORB-SLAM2 by 18.2 frames (3.98%), DynaSLAM by 22.4 frames (6.18%) and SymbioLCD by 3.2 frames (1.10%). This evaluation demonstrates that having extended semantic and spatial awareness contributes to the early detection of loop closure candidates.

TABLE VI: Comparisons of loop closure detected keyframe.

Dataset	ORB SLAM2 (kf)	Dyna SLAM (kf)	Symbio LCD (kf)	Symbio LCD2 (kf)
fr2-desk	393	397	388	386
fr3-longoffice	345	349	314	312
uoa-lounge	301	303	284	279
uoa-kitchen	410	416	392	385
uoa-garden	454	459	450	446

V Conclusion and Future Work

We presented SymbioLCD2, a graph-based loop closure detection system using a graph kernel with temporal constraints for a robust and accurate loop closure candidate prediction. We presented a three-tier graph structure with object anchors to extend semantic and spatial awareness of the scene, connecting semantic objects and visual features. We showed that our unified graph structure helps in making effective use of the strong symbiosis between semantic objects and visual features, where both components contribute to improving the performance of loop closure candidate prediction. SymbioLCD2 outperformed other ML algorithms in both precision and recall, and detected loop closure candidates earlier than state-of-the-art SLAM systems. SymbioLCD2 requires multiple static objects in a scene to be most efficient. For future research, we aim to extend our SymbioLCD2 to work with both static and dynamic objects by utilizing 3D components from a stereo or depth camera.

References

[1] R. Mur-Artal and J. Tardos, “Orb-slam2: an open-source slam system for monocular, stereo and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 10 2016.
[2] T. Whelan, S. Leutenegger, R. Moreno, B. Glocker, and A. Davison, “Elasticfusion: Dense slam without a pose graph,” The International Journal of Robotics Research, vol. 25, 2015.
[3] J. Engel, T. Schoeps, and D. Cremers, “Lsd-slam: large-scale direct monocular slam,” European Conference on Computer Vision, vol. 8690, pp. 1–16, 09 2014.
[4] D. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, pp. 91–110, 11 2004.
[5] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features.” in Computer Vision and Image Understanding, vol. 110, 01 2006, pp. 404–417.
[6] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: an efficient alternative to sift or surf,” in Proceedings of the IEEE International Conference on Computer Vision, 11 2011, pp. 2564–2571.
[7] P. Torr and A. Zisserman, “Feature based methods for structure and motion estimation,” in International Workshop on Vision Algorithms, 2000, pp. 278–294.
[8] T. Tykkälä and A. Comport, “A dense structure model for image based stereo slam,” in IEEE International Conference on Robotics and Automation, 06 2011, pp. 1758 – 1763.
[9] F. Schenk and F. Fraundorfer, “Reslam: A real-time robust edge-based slam system,” in International Conference on Robotics and Automation, 2019, pp. 154–160.
[10] B. Bescos, J. Facil, J. Civera, and J. Neira, “Dynaslam: Tracking, mapping and inpainting in dynamic scenes,” IEEE Robotics and Automation Letters, vol. 3, pp. 4076 – 4083, 2018.
[11] A. Angeli, D. Filliat, S. Doncieux, and J. Meyer, “A fast and incremental method for loop-closure detection using bags of visual words,” in Robotics, IEEE Transactions on, vol. 24, 11 2008, pp. 1027 – 1037.
[12] J. Kim, M. Urschler, P. Riddle, and J. Wicker, “Symbiolcd: Ensemble-based loop closure detection using cnn-extracted objects and visual bag-of-words,” in International Conference on Intelligent Robots and Systems, 2021, p. 5425.
[13] X. Zhang, L. Wang, Y. Zhao, and Y. Su, “Graph-based place recognition in image sequences with cnn features,” Journal of Intelligent & Robotic Systems, vol. 95, 08 2018.
[14] Y. Li, C. Gu, T. Dullien, O. Vinyals, and P. Kohli, “Graph matching networks for learning the similarity of graph structured objects,” in International Conference on Machine Learning, 2019, pp. 3835–3845.
[15] N. Shervashidze, P. Schweitzer, E. J. Van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-lehman graph kernels,” Journal of Machine Learning Research, vol. 12, no. 77, pp. 2539–2561, 2011.
[16] L.-W. Kang, C.-Y. Hsu, H.-W. Chen, C.-S. Lu, C.-Y. Lin, and S.-C. Pei, “Feature-based sparse representation for image similarity assessment,” IEEE Transactions on Multimedia, vol. 13, pp. 1019–1030, 10 2011.
[17] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “Brief: Binary robust independent elementary features,” in European Conference on Computer Vision, 2010, pp. 778–792.
[18] S. A. K. Tareen and Z. Saleem, “A comparative analysis of sift, surf, kaze, akaze, orb, and brisk,” in International Conference on Computing, Mathematics and Engineering Technologies, 03 2018.
[19] M. Cummins and P. Newman, “Fab-map: Probabilistic localization and mapping in the space of appearance,” The International Journal of Robotics Research, vol. 27, pp. 647–665, 06 2008.
[20] D. Galvez-Lopez and J. Tardos, “Bags of binary words for fast place recognition in image sequences,” in Robotics, IEEE Transactions on, vol. 28, 10 2012, pp. 1188–1197.
[21] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 11 2014.
[22] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask r-cnn,” in IEEE International Conference on Computer Vision, Oct 2017.
[23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in IEEE International Conference on Computer Vision, 2016.
[24] Z. Zeng, Y. Zhou, O. Jenkins, and K. Desingh, “Semantic mapping with simultaneous object detection and localization,” in International Conference on Intelligent Robot Systems, 10 2018, pp. 911–918.
[25] E. Sucar, K. Wada, and A. Davison, “Nodeslam: Neural object descriptors for multi-view shape reconstruction,” in International Conference on 3D Vision, 11 2020, pp. 949–958.
[26] L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual quadrics from object detections as landmarks in object-oriented slam,” IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2019.
[27] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive Representation Learning on Large Graphs,” in Advances in Neural Information Processing Systems, 2017, p. 11.
[28] M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An end-to-end deep learning architecture for graph classification.” in Association for the Advancement of Artificial Intelligence, 2018, pp. 4438–4445.
[29] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” 6th International Conference on Learning Representations, 2017.
[30] G. Siglidis, G. Nikolentzos, S. Limnios, C. Giatsidis, K. Skianis, and M. Vazirgiannis, “Grakel: A graph kernel library in python,” Journal of Machine Learning Research, vol. 21, no. 54, pp. 1–5, 2020.
[31] A. A. Hagberg, D. A. Schult, and P. J. Swart, “Exploring network structure, dynamics, and function using networkx,” Python in Science Conference, p. 11–15, 2008.
[32] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in International Conference on Intelligent Robot Systems, Oct. 2012.
[33] S. Herbold, “Autorank: A python package for automated ranking of classifiers,” Journal of Open Source Software, vol. 5, no. 48, p. 2173, 2020.
[34] J. Demsar, “Statistical comparisons of classifiers over multiple data sets,” Journal of Machine Learning Research, vol. 7, pp. 1–30, 01 2006.