Segment Anything Model is a Good Teacher for Local Feature Learning

Jingqian Wu Rongtao Xu Zach Wood-Doughty Changwei Wang
Shibiao Xu Edmund Y. Lam Jingqian Wu and Edmund Y. Lam are with The University of Hong Kong, Pokfulam.Rongtao Xu is with the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China and School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100190, China.Zach Wood-Doughty is with Northwestern University, Evanston, IL 60201, USAChangwei Wang (Corresponding author) is with the Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan, 250013, China; Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan, China; the State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. Email: [email protected] Xu is with the School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China.

Abstract

Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in any scene and any downstream task. Data-driven local feature learning methods need to rely on pixel-level correspondence for training. However, a vast number of existing approaches ignored the semantic information on which humans rely to describe image pixels. In addition, it is not feasible to enhance generic scene keypoints detection and description simply by using traditional common semantic segmentation models because they can only recognize a limited number of coarse-grained object classes. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a foundation model trained on 11 million images, as a teacher to guide local feature learning. SAMFeat learns additional semantic information brought by SAM and thus is inspired by higher performance even with limited training samples. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which adaptively distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat’s performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.

Index Terms:

Local Feature and Descriptor Learning, Segment Anything Model, Computer Vision

I Introduction

Local feature detection and description is a basic task of computer vision, which is widely used in image matching [1], structure from motion (SfM) [2], simultaneous localization and mapping (SLAM) [3], visual localization [4], and image retrieval [5] tasks. Traditional schemes such as SIFT [6], and ORB [7] based hand-crafted heuristics are not able to cope with drastic illumination and viewpoint changes [1]. Under the wave of deep learning, many data-driven local feature learning methods [8, 9] have recently achieved excellent performance. While many works have been done for training local descriptors based on completely accurate and dense ground truth correspondences [10] between image pairs, a vast number of these works ignored the semantic information on which humans rely to describe image pixels. Even though few previous works adapted the straightforward idea of using traditional common semantic segmentation models to facilitate the detection and description of keypoints, it is not feasible in practice because they can only recognize a limited number of coarse-grained object categories and are not competent for keypoint detection and description in generalized scenarios [11].

Recently, foundation models [12] have revolutionized the field of artificial intelligence. These models, trained on billions of examples, presented strong zero-shot generalization capabilities across a variety of downstream tasks. In this study, we advocate the integration of SAM [11], a foundation model that is able to segment ”anything” in ”any scene”, into the realm of local feature learning. This synergy enhances the robustness and enriches the supervised signals available for local feature learning, encompassing high-level category-agnostic semantics and detailed low-level edge structure information.

Refer to caption — Figure 1: (a): Difference between segment anything model and common semantic segmentation model. (b): Schematic diagram of proposed SAMFeat.

In recent years, works have attempted to introduce pixel-level semantics of images (i.e. semantic segmentation) into local feature learning-based visual localization. Some methods utilized semantic information to filter keypoints [13] and optimize matching [14], while other works utilized semantic information [15] to guide the learning of keypoints detection and improve the performance of the local descriptors in a specific visual localization setting by using feature-level distillation. However, these visual localization pipelines and methods are based on common semantic segmentation models and are difficult to generalize to feature matching tasks. As shown in Fig. 1 (a), this is because, on one hand, common semantic segmentation can only assign semantics to limited categories (e.g. cars, streets, people) which is difficult to generalize to generic scenarios and open world situations [11]. On the other hand, the semantic information for semantic segmentation is coarse-grained, e.g., pixels of wheels and windows are given the same labels for a car. This is detrimental to mining the unique discriminative properties of local features.

The recent SAM [11] is a visual foundation model trained on 11 million images that can segment any objects based on prompt input. Compared to common semantic segmentation models, SAM has three unique properties that can be used to fuel local feature learning. i) SAM is trained on a large amount of data, and therefore, can segment any object and can be adapted to any scene rather than being limited to certain categories and scenes like common semantic segmentation models. SAM’s robust zero-shot performance could be a helpful enhancement in learning and describing features in complex scenes. ii) SAM can obtain fine-grained component-level semantic segmentation results, thus allowing for more accurate modeling of semantic relationships between pixels. In addition, SAM can derive fine-grained category-agnostic semantic masks that can be used as semantic groupings of pixels to guide local feature learning. iii) SAM can detect more detailed edges, whereas edge regions tend to be more prone to critical points and contain more distinguishing information, which helps feature learning by providing accurate guidance in keypoint localization.

In our SAMFeat, we propose three special strategies to boost the performance of local feature learning based on these three properties of SAM. First, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD) for distilling category-agnostic pixel semantic relations learned by the SAM encoder into a local feature learning network with attention guide, thus using semantic discriminative to improve local feature description. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC) to optimize the metric space of local descriptors using SAM-derived semantic groupings as weakly supervised signals. Third, we design an Edge Attention Guidance (EAG) to further improve the localization accuracy and description ability of local features by prompting the network to pay more attention to the edge region. Since the SAM model is only used as a teacher during training, our SAMFeat can efficiently extract local features during inference without burdening the computational consumption of the SAM network.

II Related Work

Local Features and Beyond. Early hand-crafted local features have been investigated for decades and are comprehensively evaluated in [16]. In the wave of deep learning, many data-driven learnable local features have been proposed for improving detectors based on different focuses on [17, 18], descriptors [19, 20, 21, 22], and end-to-end detection and description [23, 24, 25, 26, 27, 28, 29]. Beyond localized features, some learnable advanced matchers have recently been developed to replace the traditional nearest neighbor matcher (NN) to get more accurate matching results. Sparse matchers such as SuperGlue [30] and LightGlue [31] take off-the-shelf local features as input to predict matches using a GNN or Transformer, however, their time complexity scales quadratically with the number of keypoints. Dense matchers [32, 33] compute the correspondence between pixels end-to-end based on the correlation volume, while they spend more memory and space consumption than sparse matchers [15]. Our work centers on enhancing the efficiency and performance of an end-to-end generalized local feature learning approach. We aim to achieve performance comparable to advanced matchers while only using nearest-neighbor matching across various downstream tasks. This is particularly crucial in resource-constrained scenarios demanding high operational efficiency.

Segment Anything Model. The Segment Anything Model (SAM) [11] has achieved remarkable advancements in expanding the scope of segmentation tasks, thereby significantly fostering the evolution of fundamental models in computer vision. SAM incorporates prompt learning techniques in the field of NLP to flexibly implement model building and builds an image engine through interactive annotations, which performs better in techniques such as instance analysis, edge detection, object proposal, and text-to-mask. SAM is specifically designed to address the challenge of segmenting a wide range of objects in complex visual scenes. Unlike traditional approaches that focus on segmenting specific object classes, SAM’s primary objective is to segment anything, providing a versatile solution for diverse and challenging scenarios. Many works [34, 35] now build upon SAM for downstream vision tasks such as medical imaging, video, data annotation, etc [36]. Unlike them, we advocate for the application of SAM to local feature learning. To the best of our knowledge, our work is the first to apply SAM to feature learning and matching tasks. There is, indeed, other newly proposed state-of-the-art work that incorporates other visual foundation models to tackle feature learning tasks. For example, ROMA [37] proposed to incorporate the encoder from DINO-V2 [38] directly into their feature learning framework and fine-tune it in the training stage. However, the computational cost for training and inference of such a method is extremely high. Since there are needs for high operational efficiency requirements in real-time local feature matching applications, we choose not to incorporate SAM directly into the pipeline. To leverage the knowledge from SAM while maintaining high efficiency and inference speed, we treat SAM as a teacher to bootstrap local feature learning, thus using SAM only in the training phase.

Semantics in Local Feature Learning. Prior to our work, semantic information had solely been incorporated into the visual localization task as a means to mitigate the challenges posed by low-level local features when dealing with severe image variations. Some early works incorporated semantic segmentation into the visual localization pipeline for filtering matching points [39, 40], improving 2D-3D matching [41, 42], and estimating camera position [43]. Some recent works [44, 15] have attempted to introduce semantics into local feature learning to improve the performance of visual localization. Based on the assumption that high-level semantics are insensitive to photometric and geometric, they enhance the robustness of local descriptors on semantic categories by distilling features or outputs from semantic segmentation networks. However, semantic segmentation tasks can only segment certain specific categories (e.g., visual localization-related street scenes), preventing such approaches from generalizing to open-world scenarios and making them effective only on visual localization tasks. In contrast, we introduce SAM for segmenting any scene as a distillation object and propose the category-agnostic Attention-weighted Semantic Relation Distillation (ASRD) scheme to enable local feature learning to enjoy semantic information in scenes beyond visual localization. In addition, we also propose Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC) and Edge Attention Guidance (EAG) to further motivate the performance of local features based on the special properties of SAM. Based on the above improvements, our SAMFeat makes it possible for local feature learning to more fully utilize semantic information and benefit in a wider range of scenarios.

III Methodology

III-A Overview

Our SAMFeat uses a backbone for feature extraction, along with several heads serving different purposes end-to-end. The simple overview of SAMFeat is shown in Figure 2 and the detailed network structure is shown in Figure 3.

Backbone. We use an eight-layer VGG-style backbone encoder following [8] to extract feature maps. The encoder consists of $3\times 3$ convolutional layers, relu layers, and max-pooling layers. For a $H\times W$ image $I$ , we concatenate multiscale feature map outputs ( $C_{1}\in\mathbb{R}^{H\times W\times 64},C_{2}\in\mathbb{R}^{\frac{1}{2}H\times\frac{1}{2}W\times 64},C_{3}\in\mathbb{R}^{\frac{1}{4}H\times\frac{1}{4}W\times 128},C_{4}\in\mathbb{R}^{\frac{1}{8}H\times\frac{1}{8}W\times 128})$ delivered to the keypoint detection head (Det Head), edge head (Edge Head), attention head (Att Head), distillation head (Distall Head), and descriptor head (Des Head).

Keypoint Detection Head. We employ four detection layers to predict keypoint heatmaps at various scales. To integrate these predictions, we upsample the heatmaps to match the image dimensions of $h\times w$ . We then use four learnable weights to merge the heatmaps from different scales, thereby predicting the final keypoints and calculating the associated loss. Each detection head receives direct supervision through the detector loss, providing a form of deep supervision as described in [29].

We utilize a weighted binary cross-entropy loss for the detector due to the significant imbalance between keypoints and non-keypoints. With the predicted keypoint heatmap $H\in\mathbb{R}^{h\times w}$ and the pseudo-ground truth label $G\in\mathbb{R}^{h\times w}$ , the detector loss $\mathcal{L}_{\text{det}}$ is defined as follows:

	$\displaystyle\mathcal{L}_{\text{bce}}(h,g)$	$\displaystyle=-\lambda g\log(h)-(1-g)\log(1-h),$		(1)
	$\displaystyle\mathcal{L}_{\text{det}}$	$\displaystyle=\frac{1}{hw}\sum_{u,v}\mathcal{L}_{\text{bce}}(H_{u,v},G_{u,v}),$		(2)

where the weight $\lambda$ is empirically set to 200.

Attention Head. The Attention Head is designed to generate the attention map $A\in\mathbb{R}^{\frac{1}{4}h\times\frac{1}{4}w}$ . Specifically, we obtain $C_{\text{cat}}$ by concatenating the feature maps $C_{1},C_{2},C_{3},$ and $C_{4}$ from the backbone network. Next, $C_{\text{cat}}$ is averaged across the channel dimension, and an attention map $A$ is generated from this averaged feature map using a 3 × 3 convolutional layer followed by a softplus activation function.

The attention map proves to be highly useful in descriptor optimization, distillation optimization, and matching processes [45]. In terms of descriptor and distillation optimization, we provide a detailed analyze of motivation in Section 2 and Section 3. For matching, attention-weighted local descriptors are more effective. Regions with high attention scores in one image are likely to match with similar high-score regions in another image, reducing the matching space and enhancing accuracy. These consistent attention scores serve as prior information, making local descriptor matching more efficient and reliable. The Attention Head will be jointly optimized during the optimization process of both descriptor generation and feature distillation.

Feature Description Head. Initially, we densely extract descriptors for all pixels to form the set $D$ . We then extract descriptors $d$ for individual pixels at their respective locations. In accordance with previous studies [29], we utilize L2 normalization to obtain the local descriptors, defined as:

d=\frac{D_{(i,j)}}{\|D_{(i,j)}\|},

(3)

where $\|\cdot\|$ represents the L2 norm and $(i,j)$ denotes the index position of the local descriptor $d$ .

Inspired by [45, 29], the descriptor loss splits the optimization into two parts: the descriptor’s angle and the consistent attention weight. This differs from the standard triplet loss, which only focuses on the angle component. Positive samples have converging attention scores, while negative samples diverge, leading to a consistent distribution of attention scores across image pairs. Higher attention scores significantly influence the gradient, allowing the network to prioritize more relevant samples and avoid optimizing descriptors for less informative pixels, like the sky or grass.

The descriptor Loss is formulated to jointly optimize local descriptors and consistent attention. Building on previous research [29, 45], we use the corresponding point sets $(P,P^{\prime})$ obtained from ground-truth camera parameters and depths to supervise the training of local descriptors. For an image pair $(I,I^{\prime})$ , dense descriptors $D,D^{\prime}$ and attention maps $A,A^{\prime}$ are extracted using our SAMFeat method.

Given a point set $P$ of size $M$ in image $I$ and the corresponding points $P^{\prime}$ in image $I^{\prime}$ , the local descriptors of $P,P^{\prime}$ are denoted by Equation 3 as $d_{i}$ respectively, where $i\in\{1,\ldots,M\}$ . The corresponding score of the descriptor $d_{i}$ on the attention map $A$ is denoted as $\nu_{i}$ . Thus, the attention-weighted descriptor is defined as $y_{i}=\nu_{i}\cdot d_{i}$ . For $y_{i}$ , its positive distance $\|y_{i}\|^{+}$ is defined as:

\|y_{i}\|^{+}=\|\nu_{i}\cdot d_{i}-\nu^{\prime}_{i}\cdot d^{\prime}_{i}\|_{2},

(4)

and its hardest negative distance $\|y_{i}\|^{-}$ is defined as:

\|y_{i}\|^{-}=\min_{j\in\{1,\ldots,M\},j\neq i}\|\nu_{i}\cdot d_{i}-\nu^{\prime}_{j}\cdot d^{\prime}_{j}\|_{2}.

(5)

The overall descriptor loss is the sum of the individual losses:

\mathcal{L}_{\text{det}}(y)=\sum_{i=1}^{N}\frac{e^{\nu/T}}{\sum_{j=1}^{M}e^{\nu_{j}/T}}\max(0,\|y_{i}\|^{+}-\|y_{i}\|^{-}+1),

(6)

where $\nu$ is the attention score corresponding to $y$ , and $T$ is a smoothing factor that adjusts the effect of attention weighting on the loss. $T$ is set to $15$ following [45].

Edge Head. The Edge Head is designed to learn edges, corners, and keypoints information from SAM. The Edge Map $E$ is generated simply via a convolutional and sigmoid layer, taking the concatenated features outputted from the shared backbone. With a simple edge map supervision loss, SAMFeat is able to mimic and generate accurate edge maps learned from SAM edge knowledge as shown in Fig 8. To further utilize the learned edge information, we designed the Edge Attention Guidance Module that enhanced keypoint detection. The corresponding learning loss for Edge Map and the mechanism behind the guidance module will be introduced with details in Section 5

Distill Head. The Distill Head is designed to distillate the robust prior feature knowledge from the powerful SAM backbone for better image and scene understanding and recognition. The head is constructed by one convolutional operation, and the learned prior feature will also be fused into the network by another convolutional operation. Supervision loss and feature fusion process will be presented with details in Section 3.

III-B Gifts from SAM

SAM [11] is a newly released visual foundation model for segmenting any objects and has strong zero-shoot generalization due to the fact that it is trained using 11 million images and 1.1 billion masks. Due to its scale, model distillation [46] is deployed in this work. We freeze the weights of SAM and use its output as pseudo-ground truth to guide more accurate and robust local feature learning. In this subsection, we introduce how we utilize three gifts from SAM to enhance our SAMFeat. Shown in Figure 3, we input the image $I$ into the SAM [11] with frozen parameters and then simply processed to produce the following three outputs for guided local feature learning.

Pixel-wise Representations Relationship: SAM’s image encoder trained from 11 million images is used to extract image representations for assigning semantic labels. The representation of the encoder outputs implies a valuable semantic correspondence, i.e., pixels of the same semantic object are closer together. To eliminate the effect of specific semantic categories on generalizability, we adopt relations between representations as distillation targets. SAM’s encoder outputs $\mathcal{F}\in\mathbb{R}^{\frac{1}{8}H\frac{1}{8}W\times C}$ , where $C$ is the channel number for feature map. The pixel-wise representations relationship can be defined as $\mathcal{R}\in\mathbb{R}^{\frac{1}{8}H\frac{1}{8}W\times\frac{1}{8}H\frac{1}{8}W}$ , where $\mathcal{R}(i,j)=\frac{\mathcal{F}(i)\cdot\mathcal{F}(j)}{|\mathcal{F}(i)||\mathcal{F}(j)|}$ .

Semantic Grouping: We use the automatically generating masks function¹¹1https://github.com/facebookresearch/segment-anything/ of SAM to obtain fine-grained semantic groupings. Specifically, it works by sampling single-point input prompts in a grid over the image, and SAM can predict multiple masks from each of them. Then, masks are filtered for quality and deduplicated using non-maximal suppression [11]. The semantic grouping of the output can be defined as ${G}\in\mathbb{R}^{H\times W\times N}$ , where $N$ is the number of semantic groupings. Notice that semantic grouping differs from semantic segmentation in that each grouping does not correspond to a specific semantic category (e.g. buildings, car, and person).

Edge Map: The binary edge map ${E}\in\mathbb{R}^{H\times W\times 1}$ is derived directly ²²2https://github.com/ymgw55/segment-anything-edge-detection from the segmentation results of SAM, which highlights the fine-grained object boundaries.

III-C SAMFeat

Thanks to the gifts of the foundation model, SAM, we are able to consider SAM as a knowledgeable teacher with intermediate products and outputs to guide the learning of local features. First, we employ Attention-weighted Semantic Relation Distillation (ASRD) to distill the category-agnostic semantic relations in the SAM encoder into SAMFeat, thereby enhancing the expressive power of local features by introducing semantic distinctiveness. Second, we utilize the high-level semantic grouping of SAM outputs to construct Weakly Supervised Contrastive Learning Based on Semantic Grouping (WCS), which provides cheap and valuable supervision for local descriptor learning. Third, we design an Edge Attention Guidance (EAG) to utilize the low-level edge structure discovered by SAM to guide the network to pay more attention to these edge regions, which are more likely to be detected as keypoints and rich in discriminative information during local feature detection and description.

Attention-weighted Semantic Relation Distillation. SAM aims to obtain the corresponding semantic masks based on the prompt, so the encoder output representation of SAM is rich in semantic discriminative information. Unlike semantic segmentation, SAM does not project pixels to a specified semantic category, so we resort to distilling the semantics contained in the encoder by exploiting the relative relationship between pixels (i.e., pixel representations of the same object are closer together).

However, not every pixel is equally important for local feature learning, and forcing the network to learn a large proportion of background pixels (e.g., the sky) can hinder the learning of discriminative foreground regions. We therefore advocate a greater focus on discriminative foreground regions in the distillation process. In contrast to classical relational distillation [47], we propose Attention-weighted Semantic Relation Distillation (ASRD) to guide the distillation process to focus on valuable relational pairs to further motivate the transfer of knowledge that facilitates local feature matching.

Specifically, $C^{o}_{4}$ is exported from Conv7 layer and then imported into the distillation head to get $C^{d}_{4}\in\mathbb{R}^{\frac{1}{8}H\times\frac{1}{8}W\times 128}$ . Following the operations reported in Sec. III-B, the semantic relation matrix of $C^{d}_{4}$ can be defined as $\mathcal{R}^{{}^{\prime}}$ . As shown in Figure 4, we distill the semantic relation matrix by imposing L1 loss in order to obtain semantic discriminativeness for $C^{d}_{4}$ . $\mathcal{R}^{{}^{\prime}}$ and $\mathcal{R}$ are the corresponding student (SAMFeat) and teacher (SAM) relation matrix. As shown in Fig. 4 (b), we use the attention map obtained in Section 2 to construct the weight matrix used to weight the relational distillations. We flatten the attention map into a 1-dimensional vector $V_{A}$ , and then multiply $V_{A}$ and the transpose $V_{A}^{T}$ of $V_{A}$ by matrix multiplication to obtain the weight matrix $\mathcal{W}$ , which can be formally defined as follows:

\mathcal{W}=V_{A}\times V_{A}^{T},

(7)

where the elements of $\mathcal{W}\in\mathbb{R}^{\frac{1}{8}H\times\frac{1}{8}W}$ and $\mathcal{R}$ correspond to each other. Attention-weighted Semantic Relation Distillation Loss $\mathcal{L}_{dis}$ can be defined as:

\mathcal{L}_{dis}=\frac{\sum_{i,j}^{(\frac{1}{8}H\times\frac{1}{8}W),(\frac{1}{8}H\times\frac{1}{8}W)}|\mathcal{R}_{i,j}-\mathcal{R}^{{}^{\prime}}_{i,j}|\cdot e^{W(i,j)}}{\sum_{i,j}e^{W(i,j)}},

(8)

where $N$ is the number of matrix elements, i.e., $(\frac{1}{8}H\times\frac{1}{8}W)\times(\frac{1}{8}H\times\frac{1}{8}W)$ . Since ASRD is category-agnostic, it is possible to generalize local feature distillation semantic information to generic scenarios. A detailed pseudo-code is illustrated in algorithm 1.

Algorithm 1 Attention-weighted Semantic Relation Distillation

0: Image pair

I_{1},I_{2}

; Attention Map

A

;

H=W=400

;

C=256

; SAMFeat’s encoder

E

; SAM’s encoder

E^{\prime}

0: SAMFeat’s Relationship Matrix

\mathcal{R}

; SAM’s Relationship Matrix

\mathcal{R}^{\prime}

1: Given

I_{1},I_{2}

, an encoded image feature

\mathcal{F}\in\mathbb{R}^{\frac{1}{8}H\times\frac{1}{8}W\times C}

can be obtained via

E

2: Given

I_{1},I_{2}

, an encoded image feature

\mathcal{F^{\prime}}\in\mathbb{R}^{64\times 64\times C}

can be obtained via

E^{\prime}

3: Downsample

\mathcal{F^{\prime}}

\mathcal{F^{\prime}}_{down}\in\mathbb{R}^{\frac{1}{8}H\frac{1}{8}W\times C}

4: Downsample

\mathcal{A}

\mathcal{A}_{down}\in\mathbb{R}^{\frac{1}{8}H\times\frac{1}{8}W}

, and apply cross multiplication with itself and apply softmax activation function to obtain the Attention Weight

A_{w}

\in\mathbb{R}^{\frac{1}{8}H\frac{1}{8}W\times\frac{1}{8}H\frac{1}{8}W}

5: Flatten

\mathcal{F}

and

\mathcal{F^{\prime}}_{down}

then calculate mean on

dim=C

to obtain

\mathcal{F}_{flatten}\in\mathbb{R}^{\frac{1}{8}H\frac{1}{8}W}

and

\mathcal{F^{\prime}}_{flatten}\in\mathbb{R}^{\frac{1}{8}H\frac{1}{8}W}

6: Construct Attention Weighted Relationship Matrix

\mathcal{R}\in\mathbb{R}^{\frac{1}{8}H\frac{1}{8}W\times\frac{1}{8}H\frac{1}{8}W}

and

\mathcal{R^{\prime}}\in\mathbb{R}^{\frac{1}{8}H\frac{1}{8}W\times\frac{1}{8}H\frac{1}{8}W}

from

\mathcal{F}_{flatten}

and

\mathcal{F^{\prime}}_{flatten}

respectively, where

\mathcal{R}(i,j)=\frac{\mathcal{F}(i)\cdot\mathcal{F}(j)}{|\mathcal{F}(i)||\mathcal{F}(j)|}\cdot A_{w}(i,j)

, and same applied for

\mathcal{R^{\prime}}

7: return

\mathcal{L}_{dis}=|\mathcal{R}

\mathcal{R^{\prime}}|

Weakly Supervised Contrastive Learning Based on Semantic Grouping. As shown in Figure 5, we use semantic groupings derived from SAM to construct weakly supervised contrastive learning to optimize the description space of local features. Our motivation is very intuitive: i.e., pixels belonging to the same semantic grouping should be closer in the description space, and on the contrary pixels of different groupings should be kept at a distance in the description space. However, since two pixels belonging to the same grouping do not imply that their descriptors are the closest pair, forcing them to align will impair the discriminative properties of pixels within the same grouping. Therefore, semantic grouping can only provide weakly supervised constraints, and we maintain the discriminatory nature within the semantic grouping by setting a margin in optimization. Given the sampling points set $P\in\mathbb{R}^{N}$ , the positive sample average distance $D_{pos}$ can be defined as:

D_{pos}=\frac{1}{J}\sum^{J}_{i,j}{\rm dis}(P_{i},P_{j}),where~{}G(i)=G(j)~{}and~{}i\neq j.

(9)

Here ${\rm dis(P_{i},P_{j})}$ means calculate the Euclidean distance between the local descriptors corresponding to the two sampling points $P_{i}$ and $P_{j}$ . $G(\cdot)$ denotes the indexed semantic grouping category. $J$ denotes the number of positive samples, noting that since $J$ is not consistent for each image, we take the average to denote the positive sample distance. Similarly, the negative sample average distance $D_{neg}$ can be defined as:

	$\displaystyle D_{\text{neg}}$	$\displaystyle=\frac{1}{K}\sum^{K}_{i,j}\operatorname{dis}(P_{i},P_{j}),$		(10)
		$\displaystyle\quad\text{where }G(i)\neq G(j).$		(10)

where $K$ denotes the number of negative samples. Thus, the final $\mathcal{L}_{wsc}$ loss can be defined as:

\mathcal{L}_{wsc}=-\log(\frac{{\rm exp}(\max(D_{pos},{\rm M})/{\rm T})}{{\rm exp}(\max(D_{pos},{\rm M})+D_{neg})/{\rm T)}}),

(11)

where ${\rm M}$ is a margin parameter used to protect distinctiveness within semantic groupings, and ${\rm T}$ means the temperature coefficient.

Edge Attention Guidance. Edge regions are more worthy of the network’s attention than mundane regions. On one hand, corner and edge points in the edge region are more likely to be detected as keypoints. On the other hand, the edge region contains rich information about the geometric structure thus contributing more to the discriminative nature of the local descriptor. To enable the network to better capture the details of edge areas and improve the robustness of descriptors, we propose the Edge Attention Guidance Module, which can guide the network to focus on edge regions. As shown in Figure 3, we first set up an edge head to predict the edge map ${E}^{{}^{\prime}}$ and use the SAM output of the edge map for supervision. The edge loss $\mathcal{L}_{edge}$ is denoted as:

\mathcal{L}_{edge}=\sum_{i}^{H\times W}|E_{i}-E^{{}^{\prime}}_{i}|.

(12)

We then fuse the predicted edge map ${E}^{{}^{\prime}}$ into the local feature detection and description pipeline to bootstrap the network.

Figure 8 visualize the learning outcome of object boundaries under the guidance of SAM. With our EAG, SAMFeat learns accurate boundaries and edges effectively and efficiently. This prior knowledge of boundaries and edges will then aid feature detection and description.

III-D Guidance from SAM

1) Learning Gifts From SAM.: As described in section III-B, The carefully designed architecture enables SAMFeat to output the learned knowledge under the supervision of three designed loss modules. Therefore, we could summarize the guidance loss supervision from SAM $\mathcal{L}_{g}$ as:

\mathcal{L}_{g}=\mathcal{L}_{dis}+\mathcal{L}_{edge}+\mathcal{L}_{wsc}.

(13)

With accurate loss supervision, SAMFeat is able to utilize the learned knowledge in later modules discussed in section 3 to further aid feature learning and matching tasks.

2) Guided Local Feature Detection: To aid feature detection in SAMFeat, the predicted edge map $E^{\prime}$ from the edge head is performed with a pixel-wise dot product with the middle-level encoded feature representation $C_{3}$ , shown in Figure 3. The product is added to $C_{3}$ for a residual purpose to obtain an edge-enhanced feature $C_{3}$ . This feature will be used to generate a heatmap via the detection head to provide better local feature detection.

3) Guided Local Feature Description: We filter the edge features by the predicted edge map and model the features of the edge region by a self-attention mechanism to encourage the network to capture the information of the edge region. Specifically, the predicted edge map $E^{\prime}$ from the edge head, and the multi-scale feature maps $F_{in}$ extracted from the backbone are fed into the Edge Attention Guidance Module. As shown in Figure 6, we first fuse $E^{\prime}$ and $F_{in}$ by applying a pixel-wise dot product to obtain an edge-oriented feature map ${F}_{edge}$ . Then we apply different convolutional transformations to the given ${F}_{edge}$ to get query $q$ , key $k$ , and value $v$ respectively. We then calculate the attention score using the dot product between query and key. Next, we use the softmax function on the attention score to obtain the attention weight, which is used to calculate the edge-enhanced feature maps with the value feature vector. Finally, the edge-enhanced feature maps and the $F_{in}$ are added to obtain the output feature maps $F_{out}$ .

Total Loss. The total loss $\mathcal{L}$ can be defined as:

\mathcal{L}=\mathcal{L}_{g}+\mathcal{L}_{det}+\mathcal{L}_{des}

(14)

$\mathcal{L}_{g}$ is the loss function guided by SAM’s knowledge defined in section III-D, while $\mathcal{L}_{det}$ is the cross entropy loss for supervised keypoint detection and $\mathcal{L}_{des}$ is the attention weighted triplet loss from MTLDesc [29] for optimizing the local descriptors. Individual weights for each loss are not assigned: each loss shares equal weights. This independence from hyper-parameters, again, shows the robustness of SAMFeat.

IV Experiments

IV-A Implementation.

To generate our training data with dense pixel-wise correspondences, we rely on the MegaDepth dataset [10], a rich resource containing image pairs with known pose and depth information from 196 diverse scenes. Specifically, we use MTLDesc [29] ³³3https://github.com/vignywang/MTLDesc released megedepth image and the correspondence ground truth for training. In our experiment, we meticulously configured the parameters to establish a consistent and efficient training process. Hyper-parameters are set as follows. The learning rate of 0.001 enables gradual parameter updates, and the weight decay of 0.0001 helps control model complexity and mitigate overfitting. With a batch size of 14, our model processes 14 samples per iteration, striking a balance between computational efficiency and convergence. $\rm M$ and $\rm T$ are set to 0.07 and 5. Training spans 30 epochs to ensure comprehensive exposure to the data, with a total training time of 3.5 hours. By meticulously defining these parameters and configurations, we establish a robust experimental setup that ensures replicability and accurate evaluation of our model’s performance. More detailed information about parameter tuning and ablation experiments can be found in the supplementary material.

IV-B Image Matching.

We evaluate the performance of our method in the image-matching tasks on the most popular feature learning and matching benchmark: HPatches [1]. The HPatches dataset consists of 116 sequences of image patches extracted from a diverse range of scenes and objects. Each image patch is associated with ground truth annotations, including key point locations, descriptors, and corresponding homographies. We follow the same evaluation protocol as in D2-Net [48], where eight unreliable scenes are excluded. To ensure an equitable comparison, we align the features extracted by each method through nearest-neighbor matching. A match is deemed accurate if its estimated reprojection error is lower than a predetermined matching threshold. The threshold is systematically varied from 1 to 10 pixels, and the mean matching accuracy (MMA) across all pairs is recorded, indicating the proportion of correct matches relative to potential matches. Subsequently, the area under the curve (AUC) is computed at 5px based on the MMA. The comparison between SAMFeat and other state-of-the-art methods on HPatches image matching is visualized in Figure 7. The MMA @3 threshold against other state-of-the-art methods under each threshold is listed in Table I. SAMFeat achieved the highest MMA @3 even compared to the most updated feature learning model in 2023 top-tier conferences.

TABLE I: Image Matching Performance Comparison on HPatches dataset.

Methods	MMA @3	AUC @5
SIFT ${}_{IJCV^{\prime}2012}$ [49]	50.1	49.6
HardNet ${}_{NeurIPS^{\prime}2017}$ [50]	62.1	56.9
DELF ${}_{ICCV^{\prime}2017}$ [51]	50.7	49.7
SuperPoint ${}_{CVPRW^{\prime}2018}$ [8]	65.7	59.0
Lf-net ${}_{NeurIPS^{\prime}2018}$ [52]	53.2	48.7
ContextDesc ${}_{CVPR^{\prime}2019}$ [22]	63.2	58.3
D2Net ${}_{CVPR^{\prime}2019}$ [48]	40.3	37.8
R2D2 ${}_{NeurIPS^{\prime}2019}$ [53]	72.1	64.1
DISK ${}_{NeurIPS^{\prime}2020}$ [9]	72.2	69.8
ASLFeat ${}_{CVPR^{\prime}2020}$ [54]	72.2	66.9
LLF ${}_{WACV^{\prime}2021}$ [55]	74.0	66.8
Key.Net ${}_{TPAMI^{\prime}2022}$ [56]	72.1	56.0
ALIKE ${}_{TMM^{\prime}2022}$ [57]	70.5	69.0
MTLDesc ${}_{AAAI^{\prime}2022}$ [29]	78.7	71.4
PoSFeat ${}_{CVPR^{\prime}2022}$ [58]	75.3	69.2
SFD2 ${}_{CVPR^{\prime}2023}$ [15]	70.6	64.8
TPR ${}_{CVPR^{\prime}2023}$ [59]	79.8	73.0
SAMFeat (Ours)	82.2	74.4

IV-C Visual Localization.

To further validate the efficacy of our approach when dealing with intricate tasks, we assess its performance in the area of visual localization. This task involves estimating the camera’s position within a scene using an image sequence and serves as an evaluation benchmark for local feature performance in long-term localization scenarios, without requiring a dedicated localization pipeline. We utilize the Aachen Day-Night v1.1 dataset [60] to showcase the impact on visual localization tasks. To ensure fairness in the assessment, we employ a predefined visual localization pipeline⁴⁴4https://github.com/GrumpyZhou/image-matching-toolbox based on colmap provided by benchmark⁵⁵5https://www.visuallocalization.net. This pipeline operates as follows: Initially, custom features extracted from the database’s images are employed to construct a structure-from-motion model. Subsequently, the query images are registered within this model using the same custom features. For keypoints matching, we utilize the mutual nearest neighbor approach to effectively filter out outliers. And we set the number of features in this experiment to $10000$ . We tally the number of accurately localized images under three distinct error thresholds, namely (0.25m, 2°), (0.5m, 5°), and (5m, 10°), signifying the maximum allowable position error in both meters and degrees. Referring to Table II, we categorize current state-of-the-art methods into two categories: $G$ contains methods that are designed for general feature learning tasks; $L$ contains methods that are designed, tuned, and tested specifically for localization tasks, and they typically perform poorly outside of specific localization scenarios, as shown in Table I. SAMFeat achieved the top performance among all general methods, while also revealing a competitive performance among methods that are designed specifically for visualization.

TABLE II: Visual Localization Performance Comparison on Aachen V1.1. Category “L” means local feature methods specifically designed for visual localization tasks, and “G” means generalized local feature methods.

Category	Method	Accuracy @ Thresholds (%) ↑
		Day	Night
		0.25m,2°/0.5m,5°/5m,10°
L	SeLF ${}_{TIP^{\prime}22}$ [44]	$-$	75.0 / 86.8 / 97.6
L	SFD2 ${}_{CVPR^{\prime}2023}$ [15]	88.2 / 96.0 / 98.7	78.0 / 92.1 / 99.5
G	SIFT ${}_{IJCV^{\prime}12}$ [6]	72.2 / 78.4 / 81.7	19.4 / 23.0 / 27.2
	SuperPoint ${}_{CVPRW^{\prime}18}$ [25]	87.9 / 93.6 / 96.8	70.2 / 84.8 / 93.7
	D2-Net ${}_{CVPR^{\prime}19}$ [27]	84.1 / 91.0 / 95.5	63.4 / 83.8 / 92.1
	R2D2 ${}_{NeurIPS^{\prime}19}$ [26]	88.8 / 95.3 / 97.8	72.3 / 88.5 / 94.2
	ASLFeat ${}_{CVPR^{\prime}20}$ [28]	88.0 / 95.4 / 98.2	70.7 / 84.3 / 94.2
	CAPS ${}_{ECCV^{\prime}20}$ [61]	82.4 / 91.3 / 95.9	61.3 / 83.8 / 95.3
	LISRD ${}_{ECCV^{\prime}20}$ [62]	$-$	73.3/ 86.9 / 97.9
	DISK ${}_{NeurIPS^{\prime}22}$ [9]	$-$	73.8 / 86.2 / 97.4
	PoSFeat ${}_{CVPR^{\prime}22}$ [58]	$-$	73.8 / 87.4 / 98.4
	MTLDesc ${}_{AAAI^{\prime}22}$ [29]	$-$	74.3 / 86.9 / 96.9
	TR ${}_{CVPR^{\prime}23}$ [59]	$-$	74.3 / 89.0 / 98.4
	SAMFeat (Ours)	90.2 / 96.0 / 98.5	75.9 / 89.5 / 95.8

IV-D 3D Reconstruction.

TABLE III: Evaluation on ETH 3D reconstruction benchmark [63]. The top two results are marked with bold (1st) and underline (2nd).

ETH benchmark

Datasets

Methods

#Reg.

Images

#Sparse

Points

Track

Length

Reproj.

Error

#Dense

Points

Madrid Metropolis 1344 images

SuperPoint [25]

438

29K

9.03

1.02px

1.55M

D2-Net [27]

495

144k

6.39

1.35px

1.46M

ASLFeat [28]

613

96k

8.76

0.90px

2.00M

DISK [9]

677

213K

7.89

1.14px

1.87M

AWDesc-CA [45]

864

278K

9.52

0.96px

1.65M

SAMFeat

892

282K

9.84

0.93px

1.90M

Gendar- menmarkt 1463 images

SuperPoint [25]

967

93k

7.22

1.03px

3.81M

D2-Net [27]

965

310K

5.55

1.28px

3.15M

ASLFeat [28]

1040

221K

8.72

1.00px

4.01M

DISK [9]

1218

588K

6.02

0.98px

3.62M

AWDesc-CA [45]

1354

548K

6.94

0.95px

3.86M

SAMFeat

1370

596K

7.02

0.93px

3.91M

Tower of London 1576 images

SuperPoint [25]

681

52K

8.76

0.96px

2.77M

D2-Net [27]

708

287K

5.20

1.34px

2.86M

ASLFeat [28]

821

222K

12.52

0.92px

3.06M

DISK [9]

985

517K

5.90

1.02px

3.00M

AWDesc-CA [45]

1414

563K

12.88

0.88px

2.89M

SAMFeat

1443

587K

13.01

0.84px

2.91M

We utilize the ETH benchmark [63] to evaluate the performance of our method on the 3D reconstruction task. Three medium-scale datasets from the benchmark are used for this purpose. We perform exhaustive image matching on all three collections and apply ratio test filtering with a default threshold of 0.8 to eliminate incorrect matches. The reconstruction protocol follows the COLMAP [2] pipeline, which involves running Structure-from-Motion (SfM) first and then Multi-View Stereo (MVS) to generate the dense point cloud models. As shown in Table III, SAMFeat consistently produces the highest number of registered images and sparse points, along with competitive track length and reprojection error, demonstrating its robustness and accuracy in 3D reconstruction tasks.

IV-E Ablation Study.

We conduct ablation studies on different aspects to support our claim and illustrate the necessity of each of our contributed modules.

Ablation on Designed Modules. Table IV demonstrates the efficacy of the components within our network as we progressively incorporate Attention-weighted Semantic Relation Distillation (ASRD), Weakly Supervised Contrastive Learning Based on Semantic Grouping (WCS), and Edge Attention Guidance (EAG). The effectiveness of each component is reflected by the Mean Matching Accuracy at the pixel three threshold on the HPatches Image Matching task. Our baseline is trained using SuperPoint [8] structure along with its detector supervision and attention-weighted triplet loss [29] for descriptor learning. Following the addition of the ASRD, the model’s performance notably improves due to better image feature learning. The introduction of the WCS further enhances accuracy by augmenting the discriminative power of descriptors with semantics. It demonstrates superior performance as it better preserves the inner diversity of objects by optimizing sample ranks. Lastly, the inclusion of the EAG bolsters the network’s capability to embed object edge and boundary information, resulting in further enhancements in accuracy.

TABLE IV: Detailed Ablation Study on SAMFeat. ✓means denotes applied components. The results of MMA@ 3 on HPatches of removing each component individually in addition to applying the components sequentially are reported.

ASRD	EAG	WCS	MMA @3
			75.7
✓			78.6
✓	✓		80.9
✓	✓	✓	82.2
✓		✓	81.2
	✓	✓	79.4

Attention-weighted Semantic Relation Distillation Versus Direct Semantic Feature Distillation Table V highlights the performance differences between two approaches for distilling image features from the SAM encoder: our proposed Attention-weighted Semantic Relation Distillation (ASRD) and Direct Semantic Feature Distillation (DSFD). The effectiveness of these approaches is evaluated using the Mean Matching Accuracy at the pixel three threshold on the HPatches Image Matching task. The results demonstrate that while DSFD achieves a Mean Matching Accuracy (MMA) of 76.9, the ASRD method significantly enhances performance, achieving an MMA of 78.6. This indicates that ASRD provides superior feature learning by effectively capturing and distilling the semantic relationships within the image features.

TABLE V: Ablation test on the Attention-weighted Semantic Relation Distillation (ASRD).

Module Selected	MMA @3
DSFD	76.9
ASRD	78.6

Hyper-Parameters in Weakly Supervised Contrastive Module. Table VI presents the impact of different hyper-parameter values on the performance of Weakly Supervised Contrastive Learning Based on Semantic Grouping (WCS). Specifically, the margin parameter $M$ is used to maintain distinctiveness within semantic groupings, while $T$ represents the temperature coefficient. The effectiveness of these hyper-parameters is evaluated using the Mean Matching Accuracy at the pixel three threshold on the HPatches Image Matching task. The results demonstrate that varying $M$ and $T$ values impact performance. Notably, the combination of $M=0.07$ and $T=5$ achieves the highest accuracy, with an MMA of 82.2. This configuration provides an optimal balance, enhancing the discriminative power of the descriptors by effectively preserving the inner diversity of objects.

TABLE VI: Ablation test on hyper-parameters on WCS.

(M, T)	MMA @3
0, 1	80.9
0.03, 1	81.2
0.05, 1	80.3
0.07, 1	81.3
0.09, 1	80.8
0.11, 1	80.5
0.07, 3	81.6
0.07, 5	82.2
0.07, 7	82.0
0.07, 9	81.8

Training Samples. Table VII provides a comparative analysis of various methods based on the number of training samples used and their Mean Matching Accuracy (MMA) at the pixel three threshold on the HPatches dataset. Despite being trained on a relatively small dataset of 23,600 images, SAMFeat achieves an outstanding MMA@3 of 82.2, surpassing all other methods. This result underscores the efficiency and robustness of SAMFeat in achieving superior performance with limited training data. For instance, SuperPoint, trained on 80,000 images, achieves an MMA@3 of 64.5, while ASLFeat, utilizing a significantly larger dataset of 1,600,000 images, achieves an MMA@3 of 72.3. Similarly, methods like D2Net and R2D2, despite having access to larger or comparable training datasets, attain lower MMA@3 scores of 42.9 and 68.6, respectively. These comparisons highlight the effectiveness of SAMFeat’s design in leveraging a modest amount of training data to achieve top-tier performance.

TABLE VII: Comparisons on the number of Training Samples.

Method	Source	Images	MMA@3
SuperPoint	COCO	80,000	64.5
D2Net	MegaDepth	617,774	42.9
R2D2	Aachen and Web images	12,083	68.6
ASLFeat	GL3D	1600,000	72.3
MTLDesc	MegaDepth	23,600	78.7
SFD2	Aachen and Web images	12,083	70.6
TRR	COCO + Image Matching Challenge	106,000	79.8
SAMFeat	MegaDepth	23,600	82.2

Training Time Table VIII presents a quantitative analysis of the overhead training time costs associated with incorporating additional loss functions into our method. The table details the incremental training time required for each combination of the Attention-weighted Semantic Relation Distillation (ASRD), Edge Attention Guidance (EAG), and Weakly Supervised Contrastive Learning Based on Semantic Grouping (WCS) loss components. Note that our method only requires training for 6 hours using two Nvidia RTX 3090 GPUs. Compared to other work like ASLFeat (42 hours on a single NVIDIA RTX 2080Ti) and TRR (30 hours for training with two NVIDIA-A100 GPUs), this demonstrates a totally reproducible cost for individual researchers.

Even though each additional loss function introduces some extra training time, the tradeoff is justified by the minimal increase in time and the corresponding improvement in accuracy. The final training time remains acceptable, demonstrating the lightweight nature of our approach. This efficiency makes our method easily implementable and resource-efficient, highlighting its reproducibility and practicality for individual researchers in the field of feature learning and description.

TABLE VIII: A quantitative analysis on the overhead training time costs for adding the extra loss functions. ✓means denotes applied loss components.

ASRD	EAG	WCS	Training time in Hours
			3.6
✓			4.7
✓	✓		5.1
✓	✓	✓	6.0

Ablation on Loss Weights Table IX presents the results of an ablation test that explores the impact of adjusting the loss weight of Edge Attention Guidance (EAG) without incorporating Weakly Supervised Contrastive Learning Based on Semantic Grouping (WCS). The effectiveness of these adjustments is measured using the Mean Matching Accuracy at the pixel three threshold on the HPatches dataset. The results indicate that merely adjusting the loss weights of EAG does not achieve the same effectiveness as incorporating WCS. Specifically, with a fixed ASRD weight of 1.0, varying the EAG weight from 0.5 to 1.5 yields incremental improvements in MMA@3, peaking at 81.0. However, this peak still falls short of the performance enhancements observed when WCS is included, highlighting the unique contribution of WCS to the overall accuracy.

TABLE IX: Ablation test on adjusting the loss weight of EAG without WCS. The MMA @3 on HPatches are recorded, showing that it is difficult to achieve the effect of imposing WCS by only adjusting the loss weights.

Weights of ASRD	Weights of EAG	MMA @3
1.0	0.5	80.7
1.0	1.0	80.9
1.0	1.5	81.0

IV-F Inference Speed.

To further demonstrate SAMFeat’s high efficiency and fast inference speed, we conduct a comparison of inference time between other state-of-the-art methods in table X. We assessed the running speed of various methods using open-source code. In Table X, our approach demonstrated exceptionally competitive performance while maintaining a fast inference speed among many lightweight methods.

TABLE X: Comparisons on the Inference Speed. The speed is calculated as the average feature extraction inference speed on HPatches (480 × 640) with the same setting

Methods	Superpoint	D2-Net	SFD2	MTLDesc	R2D2	SAMFeat
Inference Speed	31FPS	6FPS	11FPS	24FPS	8FPS	21FPS

IV-G Matching Visualization.

As mentioned in Section III-B in the full paper, SAMFeat learns edge maps from SAM and utilizes Edge Attention Guidance (EAG) to further enhance the precision of local feature detection and description by encouraging the network to prioritize attention to the edge region. Figure 8 demonstrates the learning outcome of SAMFeat. With the fine-grained object boundaries from SAM, SAMFeat is able to learn clear object edges. This illustrates two things: first, the encoded feature that is used to generate the edge map contains rich edge information, and second, with a clear and accurate generated edge map and EAG, SAMFeat is able to better capture the details of edge areas and improve the robustness of local descriptors.

V Limitations

Although other visual foundation models like DINO [38] or SEEM [64] could potentially serve as alternative teachers, the focus of our study was specifically on SAM. Our methodology was designed around the unique capabilities of SAM, and therefore, further investigation into alternative teachers was not pursued in this study. Future research could explore the applicability and potential advantages of employing other visual foundation models as teachers for local feature learning tasks.

VI Conclusion

In this study, We introduce SAMFeat, a local feature learning method that harnesses the power of the Segment Anything Model (SAM). SAMFeat encompasses three innovations. Firstly, we introduce Attention-weighted Semantic Relation Distillation (ASRD), an auxiliary task aimed at distilling the category-agnostic semantic information acquired by the SAM encoder into the local feature learning network. Secondly, we present Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), a technique that leverages the semantic groupings derived from SAM as weakly supervised signals to optimize the metric space of local descriptors. Furthermore, we engineer the Edge Attention Guidance (EAG) mechanism to elevate the accuracy of local feature detection and description. Our comprehensive evaluation of tasks such as image matching on HPatches and long-term visual localization on Aachen Day-Night consistently underscores the remarkable performance of SAMFeat, surpassing previous methods.

References

[1] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk, “Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017, pp. 5173–5182.
[2] J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 4104–4113.
[3] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE Transactions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
[4] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic et al., “Benchmarking 6dof outdoor visual localization in changing conditions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8601–8610.
[5] X. Wang, Y. Hua, E. Kodirov, G. Hu, R. Garnier, and N. M. Robertson, “Ranked list loss for deep metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5207–5216.
[6] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[7] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision. Ieee, 2011, pp. 2564–2571.
[8] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236.
[9] M. Tyszkiewicz, P. Fua, and E. Trulls, “Disk: Learning local features with policy gradient,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 254–14 265, 2020.
[10] Z. Li and N. Snavely, “Megadepth: Learning single-view depth prediction from internet photos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2041–2050.
[11] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
[12] R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
[13] F. Xue, I. Budvytis, D. O. Reino, and R. Cipolla, “Efficient large-scale localization by global instance recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 348–17 357.
[14] J. L. Schönberger, M. Pollefeys, A. Geiger, and T. Sattler, “Semantic visual localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6896–6906.
[15] F. Xue, I. Budvytis, and R. Cipolla, “Sfd2: Semantic-guided feature detection and description,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5206–5216.
[16] K. Mikolajczyk and C. Schmid, “A performance evaluation of local descriptors,” IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 10, pp. 1615–1630, 2005.
[17] D. Mishkin, F. Radenovic, and J. Matas, “Repeatability is not enough: Learning affine regions via discriminability,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 284–300.
[18] A. Barroso-Laguna, E. Riba, D. Ponsa, and K. Mikolajczyk, “Key. net: Keypoint detection by handcrafted and learned cnn filters,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5836–5844.
[19] Y. Tian, B. Fan, and F. Wu, “L2-net: Deep learning of discriminative patch descriptor in euclidean space,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 661–669.
[20] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas, “Working hard to know your neighbor’s margins: Local descriptor learning loss,” in Advances in Neural Information Processing Systems, 2017, pp. 4826–4837.
[21] Y. Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V. Balntas, “Sosnet: Second order similarity regularization for local descriptor learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 016–11 025.
[22] Z. Luo, T. Shen, L. Zhou, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan, “Contextdesc: Local descriptor augmentation with cross-modality context,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 2527–2536.
[23] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in European Conference on Computer Vision. Springer, 2016, pp. 467–483.
[24] Y. Ono, E. Trulls, P. Fua, and K. M. Yi, “Lf-net: learning local features from images,” in Advances in neural information processing systems, 2018, pp. 6234–6244.
[25] D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self-supervised interest point detection and description,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 224–236.
[26] J. Revaud, C. De Souza, M. Humenberger, and P. Weinzaepfel, “R2d2: Reliable and repeatable detector and descriptor,” in Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper/2019/file/3198dfd0aef271d22f7bcddd6f12f5cb-Paper.pdf
[27] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint detection and description of local features,” in CVPR 2019, 2019.
[28] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan, “Aslfeat: Learning local features of accurate shape and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6589–6598.
[29] C. Wang, R. Xu, Y. Zhang, S. Xu, W. Meng, B. Fan, and X. Zhang, “Mtldesc: Looking wider to describe better,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 2388–2396.
[30] P.-E. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superglue: Learning feature matching with graph neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4938–4947.
[31] P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “Lightglue: Local feature matching at light speed,” arXiv preprint arXiv:2306.13643, 2023.
[32] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931.
[33] J. Yu, J. Chang, J. He, T. Zhang, J. Yu, and F. Wu, “Adaptive spot-guided transformer for consistent local feature matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 898–21 908.
[34] H. He, J. Zhang, M. Xu, J. Liu, B. Du, and D. Tao, “Scalable mask annotation for video text spotting,” arXiv preprint arXiv:2305.01443, 2023.
[35] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J.-K. Kämäräinen, H. J. Chang, M. Danelljan, L. Cehovin, A. Lukežič et al., “The ninth visual object tracking vot2021 challenge results,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2711–2738.
[36] C. Zhang, L. Liu, Y. Cui, G. Huang, W. Lin, Y. Yang, and Y. Hu, “A comprehensive survey on segment anything model for vision and beyond,” arXiv preprint arXiv:2305.08196, 2023.
[37] J. Edstedt, Q. Sun, G. Bökman, M. Wadenbäck, and M. Felsberg, “Roma: Revisiting robust losses for dense feature matching,” arXiv preprint arXiv:2305.15404, 2023.
[38] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
[39] Z. Huang, H. Zhou, Y. Li, B. Yang, Y. Xu, X. Zhou, H. Bao, G. Zhang, and H. Li, “Vs-net: Voting with segmentation for visual localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6101–6111.
[40] H. Hu, Z. Qiao, M. Cheng, Z. Liu, and H. Wang, “Dasgil: Domain adaptation for semantic and geometric-aware image-based localization,” IEEE Transactions on Image Processing, vol. 30, pp. 1342–1353, 2020.
[41] C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Pollefeys, T. Sattler, and F. Kahl, “Semantic match consistency for long-term visual localization,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 383–399.
[42] T. Shi, H. Cui, Z. Song, and S. Shen, “Dense semantic 3d map based long-term visual localization with hybrid features,” arXiv preprint arXiv:2005.10766, 2020.
[43] C. Toft, C. Olsson, and F. Kahl, “Long-term 3d localization and pose from semantic labellings,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 650–659.
[44] B. Fan, J. Zhou, W. Feng, H. Pu, Y. Yang, Q. Kong, F. Wu, and H. Liu, “Learning semantic-aware local features for long term visual localization,” IEEE Transactions on Image Processing, vol. 31, pp. 4842–4855, 2022.
[45] C. Wang, R. Xu, K. Lu, S. Xu, W. Meng, Y. Zhang, B. Fan, and X. Zhang, “Attention weighted local descriptors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 632–10 649, 2023.
[46] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
[47] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowledge distillation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3967–3976.
[48] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 8092–8101.
[49] T. Lindeberg, “Scale invariant feature transform,” 2012.
[50] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas, “Working hard to know your neighbor’s margins: Local descriptor learning loss,” Advances in neural information processing systems, vol. 30, 2017.
[51] H. Noh, A. Araujo, J. Sim, T. Weyand, and B. Han, “Large-scale image retrieval with attentive deep local features,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3456–3465.
[52] Y. Ono, E. Trulls, P. Fua, and K. M. Yi, “Lf-net: Learning local features from images,” Advances in neural information processing systems, vol. 31, 2018.
[53] J. Revaud, P. Weinzaepfel, C. De Souza, N. Pion, G. Csurka, Y. Cabon, and M. Humenberger, “R2d2: repeatable and reliable detector and descriptor,” arXiv preprint arXiv:1906.06195, 2019.
[54] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan, “Aslfeat: Learning local features of accurate shape and localization,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6589–6598.
[55] S. Suwanwimolkul, S. Komorita, and K. Tasaka, “Learning of low-level feature keypoints for accurate and robust detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 2262–2271.
[56] A. Barroso-Laguna and K. Mikolajczyk, “Key. net: Keypoint detection by handcrafted and learned cnn filters revisited,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 698–711, 2022.
[57] X. Zhao, X. Wu, J. Miao, W. Chen, P. C. Chen, and Z. Li, “Alike: Accurate and lightweight keypoint detection and descriptor extraction,” IEEE Transactions on Multimedia, 2022.
[58] K. Li, L. Wang, L. Liu, Q. Ran, K. Xu, and Y. Guo, “Decoupling makes weakly supervised local feature better,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 15 838–15 848.
[59] Z. Wang, C. Wu, Y. Yang, and Z. Li, “Learning transformation-predictive representations for detection and description of local features,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 464–11 473.
[60] T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstrand, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic et al., “Benchmarking 6dof outdoor visual localization in changing conditions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8601–8610.
[61] Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learning feature descriptors using camera pose supervision,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
[62] R. Pautrat, V. Larsson, M. R. Oswald, and M. Pollefeys, “Online invariance selection for local feature descriptors,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 707–724.
[63] J. L. Schonberger, H. Hardmeier, T. Sattler, and M. Pollefeys, “Comparative evaluation of hand-crafted and learned local features,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1482–1491.
[64] X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y. J. Lee, “Segment everything everywhere all at once,” Advances in Neural Information Processing Systems, vol. 36, 2024.