LDL: Line Distance Functions for Panoramic Localization

Junho Kim¹, Changwoon Choi¹, Hojun Jang¹, and Young Min Kim^{1, 2}
¹ Dept. of Electrical and Computer Engineering, Seoul National University
² Interdisciplinary Program in Artificial Intelligence and INMC, Seoul National University
[email protected], [email protected], {j12040208, youngmin.kim}@snu.ac.kr

Abstract

We introduce LDL, a fast and robust algorithm that localizes a panorama to a 3D map using line segments. LDL focuses on the sparse structural information of lines in the scene, which is robust to illumination changes and can potentially enable efficient computation. While previous line-based localization approaches tend to sacrifice accuracy or computation time, our method effectively observes the holistic distribution of lines within panoramic images and 3D maps. Specifically, LDL matches the distribution of lines with 2D and 3D line distance functions, which are further decomposed along principal directions of lines to increase the expressiveness. The distance functions provide coarse pose estimates by comparing the distributional information, where the poses are further optimized using conventional local feature matching. As our pipeline solely leverages line geometry and local features, it does not require costly additional training of line-specific features or correspondence matching. Nevertheless, our method demonstrates robust performance on challenging scenarios including object layout changes, illumination shifts, and large-scale scenes, while exhibiting fast pose search terminating within a matter of milliseconds. We thus expect our method to serve as a practical solution for line-based localization, and complement the well-established point-based paradigm. The code for LDL is available through the following link: https://github.com/82magnolia/panoramic-localization.

1 Introduction

Refer to caption — Figure 1: Overview of our approach. LDL assumes a 3D map equipped with lines and local features, and similarly preprocesses the 2D panorama prior to localization. LDL then selects candidate poses by matching 2D, 3D line distance functions through decomposition along principal directions that effectively represent the sparse geometry of lines. Finally, the selected poses are refined via local feature matching [44] and PnP-RANSAC [15, 29].

Estimating the location of a mobile device or agent with respect to a 3D map, widely referred to as visual localization, has vast applications in robotics and AR/VR. Compared to perspective images, which are more widely used for localization, panorama images provide a $360^{\circ}$ field of view that contains ample visual evidence from the holistic scene context. In this light, there have been recent advances in visual localization using panoramic images [8, 7, 26, 27] that demonstrate reasonably stable localization, with state-of-the-art methods leveraging a two-step process of candidate pose selection and refinement [27, 43]. Nevertheless, many existing methods for this task have limitations in computational efficiency and robustness, mainly stemming from the costly or unstable pose selection process. As global feature descriptors [23, 3] or a large number of colored points [26, 27] are the main components for this step, the pipelines can be memory and compute intensive or fragile to large illumination changes [27, 26].

To overcome such limitations, we explore the alternative direction of using lines as the major cue for panoramic localization. Lines have a number of desirable properties compared to commonly used raw color, semantic labels or learned global features [43, 8, 26]. First, due to the long-standing work in line segment extraction [18, 19, 55, 59], it is cheap and stable to extract line segments even amidst dramatic changes in illumination or moderate motion blur. Second, lines are sparse representations of a scene and can potentially lead to small memory consumption and computation. Nevertheless, line segments alone are visually ambiguous compared to other localization cues (color, global features, etc.), which makes them harder to tailor for successful localization. While there exist prior works in line-based visual localization [57, 33, 16], many focus on using lines for pose refinement after finding coarse poses from conventional global feature comparisons [57, 16] or exhibit unstable performance compared to conventional point-based methods [33]. Further, prior works often involve expensive line-specific feature extraction to distinguish contexts and establish one-to-one line correspondences [57].

LDL is a fast and robust localization method that leverages the holistic context from lines in panoramas and 3D maps to effectively find the camera pose. In contrast to previous works [16, 57], we retain our focus on using line segments for pose search based on the observation that conventional point-based matching [12, 44] performs stably once given a good initial pose. As shown in Figure 1, given a panoramic image of an unknown location, we utilize the distribution of extracted line segments and compare it against those in the pre-captured 3D map. First, the candidate pose selection step rapidly evaluates an immense set of poses within a matter of milliseconds and selects the coarse poses to further optimize. Here LDL compares the distribution of lines in 2D and 3D evaluated on their spherical projections using distance functions, as shown in Figure 1. The distance function imbues relative spatial context even in featureless regions and quickly matches poses without establishing explicit correspondences between detected lines. We further enhance the discriminative power of distance functions by decomposition, and separately evaluate lines aligned with each principal directions. Once a small set of initial poses are found, LDL refines them with PnP-RANSAC [15, 29], where we leverage powerful local features from recent works [44, 12] to establish good 2D-3D correspondences.

We evaluate LDL in various indoor scenes where it performs competitively against all tested baselines while demonstrating robust performance in scenes with object changes or large illumination shifts. Further, LDL exhibits an order-of-magnitude faster runtime compared to global feature comparison [3, 23, 17] due to the efficient formulation. By only using the geometric information of lines and pre-trained visual features, we expect LDL to serve as a practical localization algorithm that could enhance and complement existing visual localization techniques.

2 Related Work

Line-Based Localization

Inspired by abundant straight-lines and rectangular structures in man-made objects, many works attempt visual localization with line segments [33, 57, 58, 52, 2, 16]. Micusik et al. [33] utilize the line segments extracted from the 3D model to directly match line segments in images by comparing the Chamfer distance in 2D and 3D. However, lines, even when perfectly matched, are inherently subject to ambiguity along the line direction. Yoon et al. [57] suggest removing such ambiguities by treating points on a line segment as verbal tokens in natural language processing, where line features are learned using Transformers [53]. Such learning-based approaches are trained with a database of pose-annotated images or require additional computation [57, 58, 16]. Further, these approaches only use lines for pose refinement, assuming a coarse pose estimate to be given via global feature comparisons [3, 17]. LDL takes a different approach and focuses on robust pose selection based on lines. We compare LDL against existing approaches for line-based localization, where LDL performs competitively against these methods while balancing robustness and efficiency.

Point-Based Localization

Most visual localization algorithms follow a point-based paradigm, focusing on sparse feature point correspondences [30, 45, 46, 36, 24, 41, 42, 47, 10, 43, 48], dense matching via coordinate regression of scene points [30, 5], or minimizing color discrepancies of dense 3D points via gradient descent [26, 27]. Conventional approaches using a perspective camera input take a two-step approach, where coarse poses are first estimated using global feature descriptors [17, 3] and refined with PnP-RANAC from local feature matches [31, 44, 12] or dense matches from scene coordinate regression [30, 48]. Recent panoramic localization methods [26, 27, 8, 7] also follow a similar two-step approach, where exemplary methods find candidate poses via color distribution matching and refine them using gradient descent optimization [26, 27]. While these algorithms can robustly handle a modest range of scene changes due to the holistic view from panoramas, the algorithms can still fail with significant changes in illumination. We compare LDL against exemplary point-based methods and demonstrate that line segments could be effectively utilized for accurate and robust localization even without the costly calculation of global features or color matching.

3 Method

LDL aims at finding the pose at which the query image $I$ is taken with respect to a 3D scene, where Figure 1 depicts the localization steps taken by LDL. We first represent the 3D scene using a line map equipped with local feature descriptors for keypoint locations, and similarly acquire line segments and local descriptors for the query image prior to localization (Section 3.1). We then estimate the three principal directions for 2D and 3D by voting, from which we can deduce a set of rotations considering the sign and permutation ambiguity (Section 3.2). Given the fixed set of candidate rotations, we construct an initial set of possible poses incorporating translations. We generate the decomposed line distance functions at each pose and choose the promising poses by comparing the distance functions with a robust loss function (Section 3.3). As the final step, the selected poses are refined by performing PnP-RANSAC [15] using feature matches [44] with the query image (Section 3.4).

3.1 Localization Input Preparation

Map Building

LDL operates using a 3D map consisting of line segments and local features. We build such a map starting from a colored point cloud $P=\{X{,}C\}$ . To obtain the 3D line segments we use the line extraction method from Xiaohu et al. [54], which can quickly process point clouds containing millions of points within a few seconds. We further remove short, noisy line segments from the raw detection with a simple filtering step: given the point cloud bounding box of size $b_{x}\times b_{y}\times b_{z}$ , we filter out 3D line segments shorter than $\lambda(b_{x}+b_{y}+b_{z})/3$ with $\lambda=0.1$ in all our experiments. The 2D line segments are then filtered with an adaptive length threshold to match the filtering rate of 3D line segments. Specifically, we choose the threshold value such that the ratio of lines filtered in 2D equals that in 3D.

To obtain local features embedded in the 3D map, we first render synthetic views at various locations using the point cloud color values. Specifically, we project the input point cloud $P{=}\{X,C\}$ at a virtual camera and assign the measured color $Y(u,v){=}C_{n}$ at the projected location of the corresponding 3D coordinate $(u,v){=}\Pi(RX_{n}+t)$ to create the synthetic view $Y$ . We then extract local features for each synthetic view $Y$ using SuperPoint [12], and back-project the local features to their 3D locations, which in turn results in keypoint descriptors embedded in 3D space. Note that while we illustrate map building using a colored point cloud, our setup can also work with line-based SfM maps [40, 39, 32] since the input to our pipeline is lines and associated local features.

Panorama Pre-processing

Similar to map building, we extract line segments and local features from the query panorama image. We use LSD [18] to acquire line segments, which is a robust line detection algorithm that can stably extract lines even under motion blur or lighting changes. To remove noisy line detections as in the 3D case, we filter 2D line segments with an adaptive length threshold to match the filtering rate of 3D line segments. Specifically, for each scene we choose the threshold value such that the ratio of lines filtered in 2D equals that in 3D. Then, we extract local feature descriptors using SuperPoint [12], where the results will later be used for pose refinement in Section 3.4.

3.2 Candidate Rotation Estimation

Given the detected line segments, LDL first estimates a set of feasible rotations by extracting principal directions, which we define as the most common line directions in 2D and 3D. Let $L_{2D}=\{l\}$ denote the line segments in 2D, where $l=(s,e)$ is a tuple of start point $s\in\mathbb{S}^{2}$ and end point $e\in\mathbb{S}^{2}$ . Note that we operate on the spherical projection space and treat lines and points on panoramas as arcs and points on the unit sphere $\mathbb{S}^{2}$ respectively. Similarly, let $L_{3D}=\{\tilde{l}\}$ denote the line segments in 3D, with $\tilde{l}=(\tilde{s},\tilde{e})$ being a tuple containing start and end points $\tilde{s},\tilde{e}\in\mathbb{R}^{3}$ .

LDL estimates the vanishing point and votes for the principal directions in 2D and 3D. In 2D we first extract vanishing points by finding the points of intersection of extended 2D line segments. The 2D principal directions $P_{2D}{=}\{p\}$ are defined as the top $k_{2D}$ vanishing points containing the most incident lines, where $p\in\mathbb{R}^{3}$ is a unit norm vector denoting the vanishing point location in the sphere. Similarly, the 3D principal directions $P_{3D}{=}\{\tilde{p}\}$ are defined as the top $k_{3D}$ most common line directions from 3D line segments obtained via voting. Note that the 3D direction $\tilde{p}\in\mathbb{R}^{3}$ is also normalized.

LDL estimates the feasible candidate rotations up to uncertainty in the combinatorial ambiguities when matching the principal directions in 2D and 3D. Specifically, we select triplets of directions from $P_{2D}$ and $P_{3D}$ , yielding a total of ${k_{2D}\choose 3}\times{k_{3D}\choose 3}\times{3!}\times{2^{3}}$ possible combinations, additionally considering the sign and permutation ambiguity. For each pair of triplets, we apply the Kabsch algorithm [25] to find the optimal rotation that aligns the 2D directions to 3D directions. Discarding infeasible rotations that have large mean squared error, we obtain $N_{r}$ rotations. The possible rotations are further filtered using line distance function presented in the next section.

3.3 Line Distance Functions for Pose Selection

We propose line distance functions to efficiently evaluate a large pool of poses and select promising candidate poses. The initial pool of poses is the combination of possible translations with the rotations found in the previous section. To this end, $N_{t}$ translations are chosen within grid partitions of the 3D point cloud, where details are explained in the supplementary material. The resulting $N_{t}\times N_{r}$ poses are ranked using line distance functions.

Distance Function Definition

Distance functions are designed to compare the holistic spatial context captured from the large field of view in panorama images. They are defined for every point including void regions without any lines and can quickly rank poses. Compared to Chamfer distance or learned line embeddings used in prior work [33, 57], LDL does not attempt pairwise matching between lines, which is often costly and can incur failure modes. For example, it is ambiguous to correctly match between densely packed lines as shown in Figure 2a.

A line distance function is a dense field of distance values to detect lines in the 2D query image or the spherical projection at an arbitrary pose in 3D. For a point $x$ on the unit sphere $\mathbb{S}^{2}$ , the 2D line distance function is given as

f_{2D}(x;L_{2D})=\min_{l\in L_{2D}}D(x,l).

(1)

Here $D(x,l)$ is the spherical distance from $x$ to line segment $l=(s,e)$ , namely

D(x,l)=\left\{\begin{array}[]{lr}\sin^{-1}|\langle x,\dfrac{s\times e}{\|s\times e\|}\rangle|&\!\!\!\!\!\mbox{if $x\in\mathcal{Q}(s,e)$}\\ \min(\cos^{-1}\langle x,e\rangle,\cos^{-1}\langle x,s\rangle)&\!\!\!\!\!\mbox{otherwise,}\end{array}\right.

(2)

where $\mathcal{Q}(s,e)$ is the spherical quadrilateral formed from $\{s,e,\pm(s\times e)/\|s\times e\|\}$ .

Similarly, the 3D line distance function is defined for each candidate rotation $R\in SO(3)$ and translation $t\in\mathbb{R}^{3}$ . Using the spherical projection function $\Pi(\cdot):\mathbb{R}^{3}\rightarrow\mathbb{S}^{2}$ that maps a point in 3D to a point on the unit sphere, the 3D line segment $\tilde{l}=(\tilde{s},\tilde{e})$ is projected to 2D under the candidate transformation as $l=(\Pi(R\tilde{s}+t),\Pi(R\tilde{e}+t))$ . For simplicity, let $\Pi_{L}(\tilde{l};R,t)$ denote the projection of a line segment in 3D to the spherical surface. Then the 3D line distance function is defined as follows,

f_{3D}(x;L_{3D},R,t)=\min_{\tilde{l}\in L_{3D}}D(x,\Pi_{L}(\tilde{l};R,t)).

(3)

As shown in Figure 3, one can expect poses closer to the ground truth to have similar 2D and 3D line distance functions. Therefore, we evaluate $N_{t}\times N_{r}$ poses according to the similarity of line distance functions.

We apply a robust loss function that measures inlier counts to quantify the affinity of the line distance functions. For each candidate pose $\{R,t\}$ we count the number of points whose distance function differs below a threshold $\tau$ ,

L(R,t)=\!-\!\!\sum_{q\in Q}\mathbbm{1}\{|f_{2D}(q;L_{2D})-f_{3D}(q;L_{3D},R,t)|<\tau\},

(4)

where $\mathbbm{1}\{\cdot\}$ is the indicator function and $Q\subset\mathbb{S}^{2}$ is a set of query points uniformly sampled from a sphere. The loss function only considers inlier counts, and thus is robust to outliers from scene changes or line misdetections. We validate the efficacy of the robust loss function in Section 4.2.

Distance Function Decomposition

To further enhance pose search using line distance functions, we propose to decompose the distance functions along three principal directions. While line distance functions provide useful evidence for line-based localization, they lack a sense of direction as in Figure 2b, where the distance functions alone cannot effectively distinguish rotated views at a fixed translation.

We split line segments along the principal directions used for rotation estimation and define separate line distance functions for each group of lines, as shown in Figure 3. Recall from Section 3.2 that each candidate rotation $R$ is obtained from a pair of triplets in 2D and 3D principal directions denoted as $\hat{P}_{2D}^{R}{=}\{p_{1},p_{2},p_{3}\}$ and $\hat{P}_{3D}^{R}{=}\{\tilde{p}_{1},\tilde{p}_{2},\tilde{p}_{3}\}$ . We associate line segments that are parallel to directions in $\hat{P}_{2D}^{R},\hat{P}_{3D}^{R}$ , leading to three groups of line segments $L_{2D}^{R}{=}\{L_{2D}^{1},L_{2D}^{2},L_{2D}^{3}\}$ and $L_{3D}^{R}{=}\{L_{3D}^{1},L_{3D}^{2},L_{3D}^{3}\}$ in 2D and 3D, respectively. We separately define line distance functions for the three groups using Equation 2, namely $f_{2D}(x;L_{2D}^{i})$ and $f_{3D}(x;L_{3D}^{i},R,t)$ for $i=1,2,3$ . Then the robust loss function in Equation 4 can be modified to accommodate the decomposed distance functions,

L(R,\!t)\!=\!-\!\!\sum_{i=1}^{3}\!\sum_{q\in Q}\!\mathbbm{1}\{|f_{2D}(q;\!L_{2D}^{i})-f_{3D}(q;\!L_{3D}^{i},\!R,\!t)|\!<\!\tau\}.

(5)

We validate the importance of distance function decomposition in Section 4.2.

3.4 Candidate Pose Refinement

After we select the top $K$ poses from the pool of $N_{t}\times N_{r}$ poses with the loss function values from Equation 5, we refine them using local feature matching as shown in Figure 1. Here we utilize the cached local features from Section 3.1. Specifically, for each selected pose we first retrieve the set of visible 3D keypoints at that pose and perform local feature matching against the 2D keypoints in the query image. In this process we use SuperGlue [44] for feature matching and select the candidate pose with the most matches. Finally, we apply PnP-RANSAC [15, 29, 21] on the matched 2D and 3D keypoint coordinates to obtain a refined pose estimate. Backed by local feature matching that stably operates given decent coarse estimates from line distance functions, LDL can robustly function as an effective localization method which we further verify in Section 4.

	$t$ -error (m)						$R$ -error (^∘)						Accuracy
Dataset	PC	SB	CD	LT	CPO	LDL	PC	SB	CD	LT	CPO	LDL	PC	SB	CD	LT	CPO	LDL
Area 1	0.02	0.02	0.12	0.02	0.01	0.02	0.46	0.62	1.14	0.62	0.25	0.54	0.66	0.89	0.50	0.90	0.90	0.86
Area 2	0.76	0.04	1.16	0.04	0.01	0.02	2.25	0.72	11.54	0.72	0.27	0.66	0.45	0.76	0.35	0.74	0.81	0.77
Area 3	0.02	0.03	0.79	0.02	0.01	0.02	0.49	0.57	4.54	0.55	0.24	0.54	0.57	0.92	0.36	0.88	0.78	0.89
Area 4	0.18	0.02	0.33	0.02	0.01	0.02	4.17	0.57	1.97	0.56	0.28	0.48	0.49	0.91	0.46	0.91	0.83	0.88
Area 5	0.50	0.03	0.95	0.03	0.01	0.02	14.64	0.69	41.84	0.65	0.27	0.54	0.44	0.80	0.36	0.79	0.74	0.81
Area 6	0.01	0.02	0.50	0.02	0.01	0.02	0.31	0.63	1.20	0.60	0.18	0.50	0.69	0.88	0.47	0.87	0.90	0.83
Total	0.03	0.03	0.73	0.02	0.01	0.02	0.63	0.63	2.30	0.63	0.24	0.53	0.54	0.85	0.39	0.84	0.83	0.83

Table 1: Localization performance evaluation in Stanford 2D-3D-S [4], compared against PICCOLO (PC) [26], structure-based approach (SB), Chamfer distance-based approach (CD), Line Transformer (LT) [57], and CPO [27].

	Accuracy
Dataset	PC	SB	CD	LT	CPO	LDL
Original	0.45	0.69	0.21	0.68	0.72	0.89
Gamma	0.00	0.63	0.47	0.59	0.00	0.82
Intensity	0.00	0.56	0.40	0.58	0.80	0.76
White Balance	0.00	0.62	0.32	0.67	0.74	0.91

Table 2: Localization accuracy on synthetic color variations applied to Room 3 in the Extreme split from OmniScenes [26].

4 Experiments

We evaluate LDL in various localization scenarios and analyze its performance. Our method is mainly implemented using PyTorch [35], and is accelerated with a single RTX 2080 GPU. In all our experiments we set the number of principal directions as $k_{2D}{=}20,k_{3D}{=}3$ , the inlier threshold $\tau{=}0.1$ , and the number of query points as $|Q|{=}42$ . We report the full hyperparameter setup in the supplementary material. Following prior works [26, 8, 7], we report the median translation and rotation errors along with the localization accuracy where a prediction is considered correct if the translation error is below 0.1m and the rotation error is below 5°.

Datasets

We evaluate LDL in two indoor localization datasets: Stanford 2D-3D-S [4] and OmniScenes [26]. Stanford-2D-3D-S [4] contains 1413 panorama images from 272 rooms subdivided into six areas. Each area has diverse indoor scenes such as offices, hallways, and auditoriums where repetitive structure and featureless regions are present. OmniScenes contains 4121 panorama images from seven 3D scans, where the panorama images are captured with cameras either handheld or robot mounted, and at different times of day including large changes in furniture configurations. The dataset has three splits (Robot, Handheld, Extreme) that are recorded in scenes with/without changes, where images in the Extreme split are captured under large camera motion.

		$t$ -error (m)						$R$ -error (^∘)						Accuracy
Split	Change	PC	SB	CD	LT	CPO	LDL	PC	SB	CD	LT	CPO	LDL	PC	SB	CD	LT	CPO	LDL
Robot	✗	0.02	0.03	1.74	0.03	0.01	0.02	0.27	0.58	89.23	0.59	0.12	0.49	0.69	0.99	0.31	0.99	0.89	0.98
Hand	✗	0.01	0.03	2.10	0.03	0.01	0.03	0.23	0.63	89.02	0.64	0.13	0.54	0.81	0.95	0.29	0.95	0.80	0.97
Robot	✓	1.07	0.04	1.78	0.04	0.02	0.03	21.03	0.64	89.27	0.65	1.46	0.58	0.41	0.93	0.30	0.94	0.59	0.95
Hand	✓	0.53	0.04	1.70	0.04	0.02	0.03	7.54	0.71	88.50	0.70	0.37	0.64	0.47	0.92	0.30	0.90	0.60	0.92
Extreme	✓	1.24	0.04	1.55	0.04	0.03	0.03	23.71	0.83	88.54	0.84	0.37	0.72	0.41	0.89	0.29	0.88	0.59	0.92

Table 3: Localization performance evaluation in OmniScenes [26], considering both scenes with and without object layout changes.

Baselines

We compare LDL against three point-based baselines (PICCOLO, CPO, structure-based) and two line-based baselines (Chamfer distance-based, Line Transformer [57]). PICCOLO (PC) [26] and the follow-up work CPO [27] is an optimization-based algorithm that finds pose by minimizing the color discrepancy between the point cloud and the query image. Structure-based approach [43, 51] (SB) is one of the most prominent methods for visual localization using perspective cameras. We implement a method for panorama images, where candidate poses are retrieved from an image database using a global feature extractor [17] and further refined using SuperGlue [44] matches. For fair comparison, we undistort the panorama image into cubemaps and perform feature matching, where the results are then fed to PnP-RANSAC for refinement. In addition, we construct the database of pose-annotated images by rendering synthetic views at various locations in the colored point cloud.

Chamfer distance-based approach (CD), inspired from Micusik et al. [33], ranks candidate poses by comparing the spherical Chamfer distance of line segments in the synthetic views against the query image. Line Transformer by Yoon et al. [57] (LT) ranks candidate poses using Transformer-based [53] matching learned for each line segment. As this baseline also requires a pose-annotated database, we construct a synthetic database similar to the structure-based approach, and apply the undistortion process for fair comparison. We provide additional details about the baselines in the supplementary material.

4.1 Localization Evaluation

Stanford 2D-3D-S

We first assess the localization performance of LDL against the baselines in the Stanford 2D-3D-S dataset, as shown in Table B.1. LDL performs competitively against the strong baselines (Structure-based and Line Transformer) that apply powerful neural networks for candidate pose search. While the dataset contains hallways and auditoriums with large featureless regions or repetitive structure, LDL leverages the holistic distribution of lines using distance functions and shows stable performance without resorting to costly neural network computations. Further, LDL shows superior performance when compared against the Chamfer distance-based method, which indicates that solely focusing on line matches for ranking candidate poses can lead to suboptimal performance.

OmniScenes

We additionally compare LDL against baselines in the OmniScenes dataset, as shown in Table 3. Unlike the Stanford 2D-3D-S dataset, all images exhibit blur from camera motion and approximately half of the images contain changes in object layout. In splits not containing changes, LDL performs competitively against the baselines, which supports our claim that line distance functions enable effective pose search without using neural networks. Further, LDL attains the highest accuracy in splits containing scene changes and notably in the extreme split that contains the largest amount of motion blur. This is due to the stable line extraction [18, 19, 55, 59] that enables resilience against motion blur, and the robust distance function comparison (Equation 4) that rejects outliers for handling scene changes. We further verify the importance of each components in LDL in Section 4.2.

Illumination Robustness Evaluation

To validate the illumination robustness of LDL, we measure localization performance after applying synthetic color variations. We select Room 3 from the Extreme split in OmniScenes for evaluation. As shown in Figure 4, the image gamma, white balance, and average intensity are modified to an arbitrary value, where further details are deferred to the supplementary material. We report the results of LDL along with the baselines in Table 2. CPO, PICCOLO, and the structure-based baseline all suffer from performance degradation, as the color values are directly utilized for finding initial poses. Notably, Yoon et al. [57] also shows performance drop, as Transformer line features are affected by the illumination changes of the image. As LDL relies on the spatial structure of line segments for candidate pose search, it is robust to illumination variations, leading to stable performance across all color variations. Further, note that while all the methods excluding PICCOLO [26] and CPO [27] use local feature matching for pose refinement, there is a large performance gap between LDL and the other methods. This validates our focus on designing a stable candidate pose selection method, as modern feature descriptors and matching algorithms [43, 44, 12, 13] are fairly robust against adversaries such as illumination changes.

Method	$t$ -error	$R$ -error	Acc.
Method	(m)	(^∘)	Acc.
SB ( $K{=}10$ )	0.06	1.18	0.63
SB ( $K{=}20$ )	0.05	1.07	0.71
LDL ( $K{=}10$ )	0.07	1.36	0.63
LDL ( $K{=}20$ )	0.07	1.27	0.69

(a) Multi-Room Localization

Component	CPU	GPU
Line Segment Extraction	0.141	0.141
Rotation Estimation	1.124	0.009
Distance Function Computation	0.052	0.001
Candidate Pose Refinement	5.573	0.587
Total Runtime (sec)	6.890	0.738

(b) Runtime on CPU and GPU

Table 4: Multi-room localization compared against Structure-Based method (SB) with various number of candidate poses (

K

) and runtime analysis of LDL.

4.2 Performance Analysis

Candidate Pose Search Evaluation

To evaluate the efficacy of line distance functions for candidate pose search, we compare the retrieval accuracy of LDL against NetVLAD [3], which is a widely used global feature extractor [43, 57, 23]. Note that NetVLAD is used as the candidate pose selection module in the structure-based baseline. We use the Extreme split from OmniScenes for evaluation, where the translation and rotation error recall curve along with the runtime for processing a single candidate pose is reported in Figure 5. For fair comparison we use the identical pool of translations for both methods as $N_{t}=50$ and assign a large number of candidate rotations for NetVLAD with $N_{r}=216$ . Additional setup details are reported in the supplementary material. While neural network-based pose search methods can perform city-scale search [3, 20, 17], the line distance functions in LDL exhibit competitive performance to NetVLAD in indoor environments. The distance functions provide highly discriminative spatial context, which enables effective pose search. Furthermore, the runtime for pose search in LDL is much shorter than NetVLAD, due to the highly efficient computation of distance functions only conducted on sparse sphere points. This is in contrast to NetVLAD where visual features are computed with a neural network for each view. The line distance functions enable quick and effective pose initialization, which in turn allow LDL to be usable in various practical localization scenarios.

Runtime Analysis

We analyze the runtime of LDL in Table 4b where we decompose the runtime for localizing a single query image from OmniScenes [26]. We assume that 3D scanning along with map building is done offline and only consider the computation time for online operations, namely 2D line segment extraction, candidate pose selection and refinement. Overall, the pose selection process including rotation estimation and distance function computation exhibits a small runtime for both CPU and GPU, which validates the efficiency of our proposed line-based pose search. Nevertheless, the pose refinement exhibits a relatively larger runtime, which is mainly due to the large number of features in panoramas compared to normal images with a smaller field of view. While we attained our focus in pose search and used the off-the-shelf local feature matching algorithms for pose refinement [44, 12], devising highly efficient feature matching algorithms tailored specifically for panoramas is left as future work.

Scalability Analysis

We assess the scalability of LDL to large-scale indoor scenes using the OmniScenes [26] dataset. While the previous set of experiments assume room-scale localization scenarios, here we test LDL using the entire OmniScenes dataset as the 3D map. Table 4a shows the localization results, where LDL is compared against the structure-based method at various number of candidate poses ( $K$ ). LDL exhibits performance on a par with the structure-based method, which indicates that line distance functions can scalably handle large scenes consisting of multiple rooms. Nevertheless, scaling LDL to even larger scale scenes (e.g. building-scale scenes as in InLoc [51]) is left as future work.

Privacy Preservation Analysis

While the main goal of LDL is to offer fast and robust localization based on lines, we find that with a small modification our method can offer light-weight privacy protection in client-server localization scenarios [14, 11, 6, 50, 49]. Following prior works [34, 50], we consider the case where a client using an edge device wants to localize oneself against a 3D line map stored in the cloud. Privacy breaches occur if the service provider maliciously tries to view the visual data captured by the client. This is possible even when only the local feature descriptors are shared between the client and server, by using feature inversion methods [37] that reconstruct the original image from a sparse set of local features as shown in Figure 6.

By changing LDL to only exploit local features near lines during refinement, we can prevent privacy breaches including feature inversion attacks without largely sacrificing localization performance. First, as LDL uses line segments for candidate pose selection the clients only need to share the extracted line segments with the service providers for initial pose search, instead of the entire view that would be needed for global feature-based methods. Second, as local features near line segments are shared with the service provider for pose refinement, feature inversion methods cannot faithfully recover the original visual content. We validate this claim with a small set of experiments performed in the Stanford 2D-3D-S dataset [4], where we filter descriptors whose spherical distances to the nearest line segment are over 0.05 rad. As shown in Figure 6, this line-based filtering degrades the quality of feature inversion attacks by hiding objects that potentially contain sensitive information while only incurring small drops in localization accuracy. We report additional details and results regarding the potential of LDL for privacy preservation in the supplementary material.

4.3 Ablation Study

We ablate the distance function decomposition, number of query points, and robust loss function, which are key components of LDL in the OmniScenes Extreme split. In Table 5a, LDL is first compared against the baseline that does not apply decomposition and use the loss function in Equation 4. Decomposition leads to a large performance gain, as the distance functions are further disambiguated and split into each principal direction. We further test the effect of the number of query points $|Q|$ on evaluating the robust loss function. While increasing the number of query points enhances performance, the improvement is not as significant and incurs additional computation. Conversely, using a smaller number of query points lead to ambiguities in distance function matching, exhibiting poor performance. The number of query points $|Q|=42$ balances both the computational efficiency and localization accuracy of LDL. We finally validate the robust loss function in Equation 5 by comparing LDL against variants using other loss functions: L1, L2, Huber, and Median loss. Here we report results from the Wedding Hall scene, as this scene contains drastic scene changes with large amounts of outliers. As shown in Table 5b, inlier counting proposed in Equation 5 attenuates outliers and exhibits optimal performance, demonstrating the effectiveness of the robust loss function.

Method	$t$ -error	$R$ -error	Acc.
Method	(m)	(^∘)	Acc.
w/o Decomposition	1.00	3.97	0.37
w/ $\|Q\|=10$	0.04	0.85	0.77
w/ $\|Q\|=21$	0.04	0.71	0.88
w/ $\|Q\|=84$	0.03	0.66	0.95
Ours ( $\|Q\|=42$ )	0.03	0.72	0.92

(a) Decomposition & Query Points

Method	$t$ -error	$R$ -error	Acc.
Method	(m)	(^∘)	Acc.
w/ L1 Loss	0.08	1.38	0.55
w/ L2 Loss	0.17	1.48	0.34
w/ Huber Loss	0.11	1.39	0.50
w/ Median Loss	0.08	1.22	0.55
Ours	0.07	1.22	0.68

(b) Choice of Loss Function

Table 5: Ablation study of various components of LDL.

5 Conclusion

We presented LDL, a fast and robust algorithm for panorama to point cloud localization using line segments. LDL benefits from the illumination-robustness of line segments and the holistic context of panoramas by using a novel formulation based on line distance functions. The distance functions effectively handle visual ambiguities of line segments, as they provide spatial meaning to void regions often neglected by existing line-based localization methods. In addition, by evaluating distance functions only on sparsely sampled query points, LDL performs rapid candidate pose search with accuracy on a par with learning-based global feature extractors. As a result, LDL performs robust localization in various challenging scenarios with a short runtime. We expect LDL to complement and enhance the currently prevalent point-based localization algorithms for highly robust and practical localization.

Acknowledgements

This work was partially supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. RS-2023-00218601), Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub), and Samsung Electronics Co., Ltd. Young Min Kim is the corresponding author.

Appendix A Details on LDL

Principal Direction Computation

We explain the details of principal direction computation. Recall that the principal directions in 2D and 3D are defined as the top $k_{2D}$ and $k_{3D}$ most common line directions. The 2D principal directions are extracted from vanishing points. When parallel lines are projected on an image, they appear to converge at a point, which is referred to as a vanishing point. To locate vanishing points, we extrapolate detected line segments and find their intersections. Since we are using panoramic images, we use spherical projection of lines and vanishing points. Specifically, we create a uniform spherical grid and count the number of intersection points in each grid cell, which we referred to as ‘voting’ in the main paper. We select the top $k_{2D}$ grid locations with the most votes as the 2D principal directions. For 3D principal directions, we similarly aggregate votes for 3D line directions and extract the top $k_{3D}$ votes. Note that we fix the filtering parameters for all our experiments and LDL achieves competitive results.

Line Filtering

Prior to localization, recall from Section 3.1 that LDL filters short lines. Specifically, given the point cloud with the bounding box size of $b_{x}\times b_{y}\times b_{z}$ , we filter out 3D line segments shorter than $\lambda(b_{x}+b_{y}+b_{z})/3$ , where $\lambda=0.1$ in all our experiments. The 2D line segments are then filtered to match the filtering rate of 3D line segments. Note the threshold parameter $\lambda$ does not play a critical role in performance. Figure A.1 shows the median localization error measured in Room 1 from OmniScenes [5]. The errors are nearly constant with respect to varying $\lambda$ .

Spherical Quadrilateral for Computing Line Distance Functions

We illustrate the spherical quadrilateral used for computing distance functions from Section 3. As shown in Figure A.2, given a line segment $l$ on a sphere with a start point $s$ and an end point $e$ , the spherical quadrilateral $\mathcal{Q}(s,e)$ is formed by connecting $\{s,e,\pm(s\times e)/\|s\times e\|\}$ . The spherical quadrilateral is used in Equation 2 to compute the distance $D(x,l)$ from a point $x$ to a line segment $l$ on a sphere. Here, $D(x,l)$ is computed differently depending on whether $x$ lies on $\mathcal{Q}(s,e)$ . The 2D and 3D line distance functions (Equation 1, 3) are further built upon this definition of $D(x,l)$ .

Hyperparameter Setup

Here we report the hyperparameter setup of LDL. As explained in Section 3, from $N_{t}\times N_{r}$ poses we select $K$ candidate poses by comparing the distance functions with the robust lost function in Equation 5. Recall that we use the candidate rotation estimation step in Section 3.2 to choose $N_{r}$ rotations. For the $N_{t}$ translations, we follow the design choice of prior works [26, 27, 8, 7] and employ uniform grid partitions for Stanford2D-3D-S [4] and centroids of octrees as in Rodenberg et al. [38] for OmniScenes [26]. We set $K{=}20,N_{t}{=}800$ for OmniScenes [26] and $K{=}20,N_{t}{=}1700$ for Stanford 2D-3D-S [4]. We use an increased number of translations for Stanford 2D-3D-S to cope with large scenes such as auditoriums and hallways. Nevertheless, note that LDL can quickly search promising candidate poses: even in Stanford 2D-3D-S candidate pose search finishes within 0.02 seconds.

Potential for Privacy Preservation

As explained in Section 4.2, while the primary goal of LDL is to offer fast and robust localization, our approach can also be extended to offer low cost protection against various privacy breaches in client-server localization. To cope with edge devices having limited computing power, modern location-based services employ a client-server localization setup [11, 6] where the visual data of the edge device is shared with the service provider [11, 37]. Based on the shared information, the service provider performs the actual localization pipeline and returns the estimated 6DoF pose to the edge device user.

We adapt LDL to the client-server localization scenario while offering privacy protection by having the edge device user to only share lines and local features near lines during localization. Specifically, as shown in Figure A.3, we modify the pose refinement phase of LDL to operate using local features near lines, instead of all the visible local features used for the original refinement explained in Section 3.4. Here we only consider line segments whose lengths are over a designated threshold as explained in Section 3.1 and directions are parallel to one of the 2D principal directions. Such a modification results in privacy protection against feature inversion attacks [11, 37, 34], which take local feature vectors as input and outputs an image reconstruction. Note that LDL naturally offers privacy protection during pose selection as it only uses line segments for this phase and thus does not necessitate the clients to share their entire view with the service provider. We further demonstrate the potential of LDL for privacy protection through experiments shown in Section B.5.

Appendix B Additional Experimental Results

	$t$ -error (m)		$R$ -error (^∘)		Accuracy
Area	LDL	LDL^LS	LDL	LDL^LS	LDL	LDL^LS
Area 1	0.02	0.02	0.54	0.60	0.86	0.75
Area 2	0.02	0.05	0.66	0.79	0.77	0.57
Area 3	0.02	0.03	0.54	0.73	0.89	0.69
Area 4	0.02	0.02	0.48	0.57	0.88	0.72
Area 5	0.02	0.03	0.54	0.61	0.81	0.59
Area 6	0.02	0.02	0.50	0.58	0.83	0.66
Total	0.02	0.03	0.53	0.64	0.83	0.66

Table B.1: Ablation study of uniformly sampling query points on the unit sphere. LDL is compared against a variant using query points sampled along 2D line segment locations (LDL^LS) in the Stanford 2D-3D-S dataset [4].

B.1 Additional Ablation Study

Choice of Query Point Locations

We report the impact of choosing uniformly sampled query points for evaluating distance functions. Recall that we rank $N_{t}\times N_{r}$ poses with the robust loss function in Equation 5, where the query points $Q$ are uniformly sampled from a unit sphere. We compare LDL against a variant that uses query points sampled along the 2D line segment locations. Namely, this variant only considers regions with line segments, in contrast to LDL that equally considers regions lacking lines.

We make quantitative evaluations between LDL and the variant using the Stanford 2D-3D-S [4] dataset. For fair comparison, we use identical hyperparameters as the original implementation of LDL. As shown in Table B.1, uniform sampling employed in LDL leads to large amounts of performance improvement. By fairly using all regions on the sphere, LDL effectively utilizes the spatial context from the line distance functions and performs effective localization.

Method	$t$ -error (m)	$R$ -error (^∘)	Acc.
w/ L1 Loss	0.08	1.38	0.55
w/ L2 Loss	0.17	1.48	0.34
w/ Huber Loss	0.11	1.39	0.50
w/ Median Loss	0.08	1.22	0.55
Ours	0.07	1.22	0.68

Table B.2: Ablation study on the choice of loss functions evaluated in OmniScenes [26].

Choice of Loss Function

We validate the robust loss function in Equation 5 by comparing LDL against variants using other loss functions: L1, L2, Huber, and Median loss. Here we report results from the Wedding Hall scene in OmniScenes, as this scene contains drastic scene changes with large amounts of outliers. As shown in Table B.2, the inlier counting proposed in Equation 5 attenuates outliers in the Extreme split and exhibits optimal performance, demonstrating the effectiveness of the robust loss function.

Method	$t$ -error (m)	$R$ -error (^∘)	Acc.
SfM	0.03	0.80	0.85
3D Scan	0.03	0.71	0.98

Table B.3: Evaluation results of LDL on noisier line maps obtained using structure from motion and Line3D++ [22].

B.2 Evaluation in Noisier Maps

In the main paper, we extract 3D lines from point clouds obtained using Matterport 3D scanners [1]. Here we run LDL on noisier line maps created using structure-from-motion (SfM) and Line3D++ [22]. As shown in Figure B.4, the maps are more noisier than those from 3D scans. Table B.3 shows the localization results from Room 3, 5 in Omniscenes under different types of line maps (note the new pipeline did not produce reliable maps in other scenes). Even though LDL was run with the exact same hyperparameters as in the main paper, it shows only a small amount of performance drop, which indicates that it can robustly handle noisier SfM-based line maps which are generated without 3D scanners.

Method	$t$ -error (m)	$R$ -error (^∘)	Acc.
LDL	0.02	0.54	0.90
Structure-Based	0.03	0.58	0.89

Table B.4: Evaluation on Large Scale Scenes

B.3 Additional Evaluation in Large-Scale Maps

In the main paper we demonstrated that LDL can perform competitively against the structure-based method in large scenes by testing multiple room localization in OmniScenes [23]. To further show the scalability of LDL, we evaluate on 20 office rooms from Stanford 2D-3D-S [3], and localize each image against the jointly composed 3D map. The 20 office rooms contain similar structures, as shown in Figure B.5. Even in such conditions, LDL shows similar performance against the structure-based method as shown in Table B.4. While scalability has not been the main goal of this paper, LDL shows the potential to be deployed in large-scale localization settings containing visual ambiguities.

Accuracy (0.05 m, $5^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Robot	0.66	0.88	0.86	0.85	0.27	0.92
Hand	0.77	0.77	0.73	0.72	0.22	0.82
Change Robot	0.39	0.58	0.72	0.72	0.21	0.78
Change Hand	0.45	0.58	0.68	0.70	0.22	0.72
Extreme	0.38	0.57	0.63	0.62	0.20	0.71

(a) Accuracy at translation and rotation threshold 0.05 m,

5^{\circ}

Accuracy (0.05 m, $10^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Robot	0.66	0.88	0.86	0.85	0.27	0.92
Hand	0.77	0.77	0.73	0.72	0.22	0.82
Change Robot	0.39	0.58	0.72	0.72	0.21	0.78
Change Hand	0.45	0.58	0.68	0.70	0.22	0.72
Extreme	0.38	0.57	0.63	0.62	0.20	0.71

(b) Accuracy at translation and rotation threshold 0.05 m,

10^{\circ}

Accuracy (0.1 m, $5^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Robot	0.69	0.89	0.99	0.99	0.31	0.98
Hand	0.81	0.80	0.95	0.95	0.29	0.97
Change Robot	0.41	0.59	0.93	0.94	0.30	0.95
Change Hand	0.47	0.60	0.92	0.90	0.30	0.92
Extreme	0.41	0.59	0.89	0.88	0.29	0.92

5^{\circ}

Accuracy (0.1 m, $10^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Robot	0.69	0.89	0.99	0.99	0.32	0.98
Hand	0.81	0.80	0.95	0.95	0.29	0.97
Change Robot	0.41	0.59	0.93	0.94	0.30	0.95
Change Hand	0.47	0.60	0.92	0.90	0.30	0.92
Extreme	0.41	0.59	0.89	0.88	0.29	0.92

(d) Accuracy at translation and rotation threshold 0.1 m,

10^{\circ}

Accuracy (0.2 m, $5^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Robot	0.70	0.89	1.00	1.00	0.34	0.99
Hand	0.81	0.81	0.98	0.98	0.32	0.99
Change Robot	0.41	0.59	0.98	0.99	0.33	0.98
Change Hand	0.48	0.60	0.97	0.97	0.34	0.97
Extreme	0.42	0.60	0.96	0.96	0.34	0.98

(e) Accuracy at translation and rotation threshold 0.2 m,

5^{\circ}

Accuracy (0.2 m, $10^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Robot	0.70	0.89	1.00	1.00	0.34	0.99
Hand	0.81	0.81	0.98	0.98	0.33	0.99
Change Robot	0.41	0.59	0.98	0.99	0.33	0.98
Change Hand	0.49	0.60	0.97	0.97	0.34	0.97
Extreme	0.42	0.60	0.96	0.96	0.34	0.98

(f) Accuracy at translation and rotation threshold 0.2 m,

10^{\circ}

Table B.5: Localization accuracy at various thresholds in the OmniScenes [26] dataset.

Accuracy (0.05 m, $5^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Area 1	0.66	0.89	0.83	0.83	0.46	0.83
Area 2	0.42	0.81	0.63	0.63	0.30	0.69
Area 3	0.53	0.76	0.81	0.82	0.34	0.86
Area 4	0.48	0.83	0.87	0.88	0.43	0.85
Area 5	0.44	0.73	0.68	0.69	0.34	0.74
Area 6	0.68	0.90	0.80	0.82	0.45	0.81

(a) Accuracy at translation and rotation threshold 0.05 m,

5^{\circ}

Accuracy (0.05 m, $10^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Area 1	0.66	0.90	0.83	0.83	0.46	0.83
Area 2	0.42	0.81	0.63	0.63	0.30	0.69
Area 3	0.53	0.76	0.81	0.82	0.34	0.86
Area 4	0.48	0.83	0.87	0.88	0.43	0.85
Area 5	0.44	0.73	0.68	0.69	0.34	0.74
Area 6	0.68	0.90	0.80	0.82	0.45	0.81

(b) Accuracy at translation and rotation threshold 0.05 m,

10^{\circ}

Accuracy (0.1 m, $5^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Area 1	0.66	0.90	0.89	0.90	0.50	0.86
Area 2	0.45	0.81	0.76	0.74	0.35	0.77
Area 3	0.57	0.78	0.92	0.88	0.36	0.89
Area 4	0.49	0.83	0.91	0.91	0.46	0.88
Area 5	0.44	0.74	0.80	0.79	0.36	0.81
Area 6	0.69	0.90	0.88	0.87	0.47	0.83

5^{\circ}

Accuracy (0.1 m, $10^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Area 1	0.66	0.90	0.89	0.90	0.50	0.86
Area 2	0.45	0.81	0.76	0.74	0.35	0.77
Area 3	0.57	0.78	0.92	0.88	0.36	0.89
Area 4	0.49	0.83	0.91	0.91	0.46	0.88
Area 5	0.44	0.74	0.80	0.79	0.36	0.81
Area 6	0.69	0.90	0.88	0.87	0.47	0.83

(d) Accuracy at translation and rotation threshold 0.1 m,

10^{\circ}

Accuracy (0.2 m, $5^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Area 1	0.67	0.90	0.89	0.90	0.50	0.86
Area 2	0.47	0.81	0.80	0.81	0.37	0.78
Area 3	0.59	0.81	0.96	0.93	0.41	0.95
Area 4	0.50	0.83	0.94	0.93	0.47	0.89
Area 5	0.47	0.78	0.84	0.84	0.39	0.84
Area 6	0.69	0.90	0.88	0.88	0.48	0.84

(e) Accuracy at translation and rotation threshold 0.2 m,

5^{\circ}

Accuracy (0.2 m, $10^{\circ}$ )	PC	CPO	SB	LT	CD	LDL
Area 1	0.67	0.90	0.89	0.90	0.50	0.86
Area 2	0.47	0.81	0.80	0.81	0.37	0.78
Area 3	0.59	0.81	0.96	0.93	0.41	0.95
Area 4	0.50	0.83	0.94	0.93	0.47	0.89
Area 5	0.47	0.78	0.84	0.84	0.39	0.84
Area 6	0.69	0.90	0.88	0.88	0.48	0.84

(f) Accuracy at translation and rotation threshold 0.2 m,

10^{\circ}

Table B.6: Localization accuracy at various thresholds in the Stanford 2D-3D-S [26] dataset.

B.4 Full Localization Evaluation Results at Various Accuracy Thresholds

We share the full localization results for the OmniScenes [26] and Stanford 2D-3D-S [4] datasets in Table B.5, B.6. Here we additionally show the localization accuracy at various accuracy thresholds. Our method can perform competitively against all the tested baselines across various thresholds, while performing light-weight pose search with line distance functions.

B.5 Privacy Preservation Results

Method	$t$ -error (m)	$R$ -error (^∘)	Acc.
No Filtering	0.02	0.53	0.83
Filtering	0.03	0.64	0.77

(a) Localization Performance Evaluation in Stanford 2D-3D-S [4]

Method	20-PSNR	1-SSIM	MAE
No Filtering	1.1717	0.5505	0.1598
Filtering	1.7577	0.6027	0.1773

(b) Reconstruction Quality of Feature Inversion Attacks

Table B.7: Privacy-preservation evaluation of modified LDL using line-based feature filtering evaluated in Stanford 2D-3D-S dataset [4]. The simple filtering incurs only a small drop in localization accuracy while largely increasing the image error metrics.

We share the detailed privacy evaluation results on the Stanford 2D-3D-S [56] dataset. Table B.7 shows the localization accuracy along with the feature inversion attack results. The image error metrics (20 - PSNR, 1-SSIM, MAE) of the feature inversion attacks measured against the original panorama consistently increase for all tested scenarios, indicating that our line-based feature filtering can successfully hide visual details. This notion is further verified through the additional qualitative samples in Figure B.7 where the sensitive visual data such as tabletop clutter are removed after filtering. Nevertheless, note that the filtering process only incurs a small drop in localization performance. We finally evaluate how LDL balances privacy (feature inversion protection) and utility (localization accuracy) while using line-based feature filtering. In Figure B.6 we plot the image error metrics of feature inversion attacks against the original image along with the localization accuracy using various line-based filtering threshold values. While the discrepancy values increase largely, the localization accuracy remains relatively constant. Thus the modified version of LDL can balance between privacy protection and accurate localization, suggesting its future potential as a robust privacy-preserving localization algorithm.

Nevertheless, the current modification cannot fully hide keypoints from large structures such as walls and ceilings. While these regions typically do not contain sensitive visual information, some users may want their entire views to be hidden from service providers. Developing a more secure line-based localization algorithm that could alleviate a wider range of concerns is left as future work.

Appendix C Baseline Details

In this section, we describe the details for implementing the baselines compared against LDL. We implement PICCOLO [26] and CPO [27] from the publicly available codebase. Below we retain our description on the Structure-based, Chamfer-based, and Line Transformer-based approaches.

Structure-Based Approach

As explained in Section 4, structured-based approach first finds promising candidate poses using robust image retrieval and then refines poses using PnP-RANSAC from feature matches. For image retrieval we use NetVLAD [3], which is a widely used image retrieval method that outputs a global feature vector for each image. To deploy NetVLAD in our setup, we first render $N_{t}\times N_{r}$ synthetic views from the point cloud. Here we use $N_{t}=100$ candidate translations and $N_{r}=216$ candidate rotations uniformly sampled from $SO(3)$ . Then, we extract the global features for each synthetic view and the query image, and choose the top $K=20$ synthetic views whose feature vectors are closest to the query image. As the final step, we perform feature matching [44] from each selected synthetic view against the query image, and choose the final view with the most matches. To ensure fair comparison, we undistort the selected view and the query panorama into cubemaps and separately perform feature matching for each pair of faces. The matches are then aggregated to perform refinement via PnP-RANSAC [15].

Chamfer Distance-Based Approach

Inspired from Micusik et al. [33], Chamfer distance-based approach first selects poses that best align 3D lines against lines in the query image, where the Chamfer distance is used to evaluate the potential matchings. The selected poses are then refined with PnP-RANSAC, similar to the structure-based approach. To elaborate, we find the top $K=20$ poses from an initial pool of $N_{t}\times N_{r}$ poses, where the poses are ranked by measuring the Chamfer distance between the projected line segments in 3D and those in the query image. We set $N_{t}$ and $N_{r}$ identical to LDL and use the principal directions for deducing a set of candidate rotations. As the final step, we render views at the selected $K$ poses and perform feature matching against the query image for refinement via PnP-RANSAC.

Line Transformer-Based Approach

Based on Yoon et al. [57], Line Transformer-based approach finds candidate poses attaining the most line matches with the query image, and refines poses using PnP-RANSAC. For establishing line matches, we first render $N_{t}\times N_{r}$ synthetic views from the point cloud where we set $N_{t}=100$ and $N_{r}=216$ . Then, the top $K_{1}=100$ poses are selected whose NetVLAD [3] features are closest to the query image. This intermediate step is necessary as the line transformer features are computationally expensive and thus could not be naively evaluated for all $N_{t}\times N_{r}$ views. For each synthetic view from the selected poses, we extract line Transformer embeddings and establish matchings with the query image. Similar to the structure-based baseline, we convert panoramas to cubemaps during the line matching process. Finally, we select the top $K_{2}=20$ poses that have the most line matches, and refine them via PnP-RANSAC.

Appendix D Details on Experimental Setup

In this section, we provide additional details for experiments presented in Section 4 and Section B.

Illumination Robustness Evaluation

To evaluate the robustness of LDL against illumination shifts, we apply synthetic color variations to images in Room 3 from OmniScenes [26]. We consider three synthetic color variations, where qualitative examples are shown in Figure 4: average intensity, gamma, and white balance change. For average intensity change we lower each pixel intensity by 25%. For gamma change, we set the image gamma to 0.2. For white balance change, we apply the following transformation matrix to the raw RGB color values: $\begin{pmatrix}0.7&0&0\\ 0&0.9&0\\ 0&0&0.8\end{pmatrix}$ .

Candidate Pose Search Evaluation

We compare LDL against NetVLAD [3] for candidate pose search using the Extreme split from OmniScenes. The recall curves in Figure 5 are obtained by measuring the localization performance of both methods prior to pose refinement. As mentioned in Section 4.2, we use the identical set of translations with $N_{t}=50$ for both methods and associate a large number of candidate rotations $N_{r}=216$ for NetVLAD to ensure fair comparison. Such measures are taken for rotations since LDL estimates rotations using combinatorial matchings of principal directions, which makes the number of candidate rotations to vary for each query image. We empircally find that less than 30 candidate rotations remain after discarding infeasible rotations, and thus setting $N_{r}=216$ for NetVLAD would provide enough evidence to achieve competitive performance against LDL.

Feature Inversion Network for Privacy Evaluation

To evaluate the privacy protection of LDL against feature inversion attacks, we train a fully-convolutional neural network $F_{\Theta}(\cdot)$ that takes a sparse feature map $D\in\mathbb{R}^{H\times W\times C}$ as input and produces image reconstructions. The feature map stores local feature descriptors $\mathbf{f}\in\mathbb{R}^{C}$ at keypoint locations $(i_{\text{kpt}},j_{\text{kpt}})$ , namely $D(i_{\text{kpt}},j_{\text{kpt}})=\mathbf{f}$ , and zero values for other regions. For the inversion network, we use a similar U-Net architecture as in Ng et al. [34] where the only difference is in the input channel dimension that we set as 256 instead of 128 to match the SuperPoint [12] descriptor dimensions. Then for training, we use the entire Matterport3D [9] dataset where we use the first $90\%$ of the 9581 panorama images for training and the rest for validation. We follow the training procedure of Ng et al. [34] and use the perceptual loss and mean absolute error (MAE) loss, where we employ Adam [28] with a learning rate of $1e{-}4$ for optimization. In our experiments, we use the trained network to reconstruct panoramas from the local feature descriptors, where we shared the reconstruction results along with the image error metrics in Section 4 and Section B. To elaborate, during evaluation we first extract local features for each query image in the Stanford 2D-3D-S dataset [4] and run feature inversion, where the results are then compared against the original panorama image.

3D Line Maps for Localization

In Figure D.8, we show visualizations of 3D lines used as input to LDL. Despite the reliabilty of the 3D line extraction algorithm of Xiaohu et al. [54], the lines are still quite noisy. To cope with the noisy detections, LDL employs a length-based filtering scheme to only keep long, salient lines and resorts to matching the distribution of lines using line distance functions instead of trying to establish direct one-to-one matchings as in previous works [33, 57].

References

[1] Matterport 3d: How long does it take to scan a property? https://support.matterport.com/hc/en-us/articles/229136307-How-long-does-it-take-to-scan-a-property-. Accessed: 2020-02-18.
[2] Hichem Abdellali, Robert Frohlich, Viktor Vilagos, and Zoltan Kato. L2d2: Learnable line detector and descriptor. In 2021 International Conference on 3D Vision (3DV), pages 442–452, 2021.
[3] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
[4] Iro Armeni, Sasha Sax, Amir R Zamir, and Silvio Savarese. Joint 2d-3d-semantic data for indoor scene understanding. arXiv preprint arXiv:1702.01105, 2017.
[5] Eric Brachmann, Alexander Krull, Sebastian Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. Dsac — differentiable ransac for camera localization. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2492–2500, 2017.
[6] Christian Cachin, Idit Keidar, and Alexander Shraer. Trusting the cloud. SIGACT News, 40(2):81–86, jun 2009.
[7] Dylan Campbell, Lars Petersson, Laurent Kneip, and Hongdong Li. Globally-optimal inlier set maximisation for camera pose and correspondence estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, page preprint, June 2018.
[8] Dylan Campbell, Lars Petersson, Laurent Kneip, Hongdong Li, and Stephen Gould. The alignment of the spheres: Globally-optimal spherical mixture alignment for camera pose estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page to appear, Long Beach, USA, June 2019. IEEE.
[9] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.
[10] Gabriela Csurka and Martin Humenberger. From handcrafted to deep local invariant features. CoRR, abs/1807.10254, 2018.
[11] Deeksha Dangwal, Vincent T. Lee, Hyo Jin Kim, Tianwei Shen, Meghan Cowan, Rajvi Shah, Caroline Trippel, Brandon Reagen, Timothy Sherwood, Vasileios Balntas, Armin Alaghi, and Eddy Ilg. Analysis and mitigations of reverse engineering attacks on local feature descriptors, 2021.
[12] Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. In CVPR Deep Learning for Visual SLAM Workshop, 2018.
[13] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-Net: A Trainable CNN for Joint Detection and Description of Local Features. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019.
[14] Mihai Dusmanu, Johannes Schönberger, Sudipta Sinha, and Marc Pollefeys. Privacy-preserving image features via adversarial affine subspace embeddings. In Computer Vision and Pattern Recognition (CVPR 2021). CVF/IEEE, June 2021.
[15] Martin A. Fischler and Robert C. Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, 1981.
[16] Shuang Gao, Jixiang Wan, Yishan Ping, Xudong Zhang, Shuzhou Dong, Jijunnan Li, and Yandong Guo. Pose refinement with joint optimization of visual points and lines. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2888–2894, 2021.
[17] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. In European Conference on Computer Vision, 2020.
[18] Rafael Grompone von Gioi, Jeremie Jakubowicz, Jean-Michel Morel, and Gregory Randall. Lsd: A fast line segment detector with a false detection control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4):722–732, 2010.
[19] Geonmo Gu, Byungsoo Ko, SeoungHyun Go, Sung-Hyun Lee, Jingeun Lee, and Minchul Shin. Towards real-time and light-weight line segment detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
[20] Stephen Hausler, Sourav Garg, Ming Xu, Michael Milford, and Tobias Fischer. Patch-netvlad: Multi-scale fusion of locally-global descriptors for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14141–14152, 2021.
[21] J. A. Hesch and S. I. Roumeliotis. A direct least-squares (dls) method for pnp. In 2011 International Conference on Computer Vision, pages 383–390, 2011.
[22] Manuel Hofer, Michael Maurer, and Horst Bischof. Efficient 3d scene abstraction using line segments. Computer Vision and Image Understanding, 157:167–178, 2017. Large-Scale 3D Modeling of Urban Indoor or Outdoor Scenes from Images and Range Scans.
[23] Martin Humenberger, Yohann Cabon, Nicolas Guerin, Julien Morat, Jérôme Revaud, Philippe Rerole, Noé Pion, Cesar de Souza, Vincent Leroy, and Gabriela Csurka. Robust image retrieval-based visual localization using kapture, 2020.
[24] Martin Humenberger, Yohann Cabon, Nicolas Guérin, Julien Morat, Jérôme Revaud, Philippe Rerole, Noé Pion, César Roberto de Souza, Vincent Leroy, and Gabriela Csurka. Robust image retrieval-based visual localization using kapture. CoRR, abs/2007.13867, 2020.
[25] W. Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A, 32(5):922–923, Sep 1976.
[26] Junho Kim, Changwoon Choi, Hojun Jang, and Young Min Kim. Piccolo: Point cloud-centric omnidirectional localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3313–3323, October 2021.
[27] Junho Kim, Hojun Jang, Changwoon Choi, and Young Min Kim. Cpo: Change robust panorama to point cloud localization. ECCV, 2022.
[28] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[29] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o(n) solution to the pnp problem. International Journal Of Computer Vision, 81:155–166, 2009.
[30] Xiaotian Li, Shuzhe Wang, Yi Zhao, Jakob Verbeek, and Juho Kannala. Hierarchical scene coordinate classification and regression for visual localization. In CVPR, 2020.
[31] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004.
[32] Branislav Micusik and Horst Wildenauer. Structure from motion with line segments under relaxed endpoint constraints. In 2014 2nd International Conference on 3D Vision, volume 1, pages 13–19, 2014.
[33] Branislav Micusik and Horst Wildenauer. Descriptor free visual indoor localization with line segments. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3165–3173, 2015.
[34] Tony Ng, Hyo Jin Kim, Vincent T. Lee, Daniel DeTone, Tsun-Yi Yang, Tianwei Shen, Eddy Ilg, Vassileios Balntas, Krystian Mikolajczyk, and Chris Sweeney. Ninjadesc: Content-concealing visual descriptors via adversarial learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12797–12807, June 2022.
[35] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[36] N. Pion, M. Humenberger, G. Csurka, Y. Cabon, and T. Sattler. Benchmarking image retrieval for visual localization. In 2020 International Conference on 3D Vision (3DV), pages 483–494, Los Alamitos, CA, USA, nov 2020. IEEE Computer Society.
[37] Francesco Pittaluga, Sanjeev J Koppal, Sing Bing Kang, and Sudipta N Sinha. Revealing scenes by inverting structure from motion reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 145–154, 2019.
[38] O. B. P. M. Rodenberg, E. Verbree, and S. Zlatanova. Indoor A* Pathfinding Through an Octree Representation of a Point Cloud. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, IV21:249–255, Oct. 2016.
[39] Yohann Salaün, Renaud Marlet, and Pascal Monasse. Robust and accurate line- and/or point-based pose estimation without manhattan assumptions. In European Conference on Computer Vision, 2016.
[40] Y. Salaun, R. Marlet, and P. Monasse. Line-based robust sfm with little image overlap. In 2017 International Conference on 3D Vision (3DV), pages 195–204, Los Alamitos, CA, USA, oct 2017. IEEE Computer Society.
[41] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. CoRR, abs/1812.03506, 2018.
[42] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. CoRR, abs/1911.11763, 2019.
[43] Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. In CVPR, 2019.
[44] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. In CVPR, 2020.
[45] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Improving image-based localization by active correspondence search. In Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid, editors, Computer Vision – ECCV 2012, pages 752–765, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
[46] Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. IEEE Trans. Pattern Anal. Mach. Intell., 39(9):1744–1756, 2017.
[47] Johannes L. Schonberger, Hans Hardmeier, Torsten Sattler, and Marc Pollefeys. Comparative evaluation of hand-crafted and learned local features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
[48] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, 2013.
[49] Pablo Speciale, Johannes L. Schönberger, Sing Bing Kang, Sudipta N. Sinha, and Marc Pollefeys. Privacy preserving image-based localization. 2019.
[50] Pablo Speciale, Johannes Schönberger, Sudipta Sinha, and Marc Pollefeys. Privacy preserving image queries for camera localization. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1486–1496, 2019.
[51] Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Akihiko Torii. InLoc: Indoor Visual Localization with Dense Matching and View Synthesis. In CVPR 2018 - IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, United States, June 2018.
[52] Felix Taubner, Florian Tschopp, Tonci Novkovic, Roland Siegwart, and Fadri Furrer. Lcd – line clustering and description for place recognition, 2020.
[53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[54] Lu Xiaohu, Liu Yahui, and Li Kai. Fast 3d line segment detection from unorganized point cloud. arXiv preprint arXiv:1901.02532, 2019.
[55] Nan Xue, Song Bai, Fudong Wang, Gui-Song Xia, Tianfu Wu, and Liangpei Zhang. Learning attraction field representation for robust line segment detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[56] W. Yang, Y. Qian, J. Kämäräinen, F. Cricri, and L. Fan. Object detection in equirectangular panorama. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 2190–2195, Aug 2018.
[57] Sungho Yoon and Ayoung Kim. Line as a visual sentence: Context-aware line descriptor for visual localization. IEEE Robotics and Automation Letters, 6(4):8726–8733, 2021.
[58] Xin Yu, Sagar Chaturvedi, Chen Feng, Yuichi Taguchi, Teng-Yok Lee, Clinton Fernandes, and Srikumar Ramalingam. Vlase: Vehicle localization by aggregating semantic edges. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3196–3203, 2018.
[59] Ziheng Zhang, Zhengxin Li, Ning Bi, Jia Zheng, Jinlei Wang, Kun Huang, Weixin Luo, Yanyu Xu, and Shenghua Gao. Ppgnet: Learning point-pair graph for line segment detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.