This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: T. Wang, C. Yan are with the Intelligent Information Processing Lab, Hangzhou Dianzi University, China 310018. E-mail: [email protected], [email protected]. C. Yan is the Corresponding Author.
Z. Zheng is with the Sea-NExT Joint Lab, School of Computing, National University of Singapore, Singapore 118404. E-mail: [email protected].
Z. Zhu is with the Intelligent Information Processing Lab, Hangzhou Dianzi University, China 310018, also with the Lishui Institute of Hangzhou Dianzi University, China 323000. E-mail: [email protected].
Y. Gao is with the Lishui Institute of Hangzhou Dianzi University, China 323000, also with the Intelligent Information Processing Lab, Hangzhou Dianzi University, China 310018. E-mail: [email protected].
Y.Yang is with the School of Computer Science, Zhejiang University, China, 310027. E-mail: [email protected].

Learning Cross-view Geo-localization Embeddings via Dynamic Weighted Decorrelation Regularization

Tingyu Wang    Zhedong Zheng    Zunjie Zhu    Yuhan Gao    Yi Yang and Chenggang Yan
Abstract

Cross-view geo-localization aims to spot images of the same location shot from two platforms, e.g., the drone platform and the satellite platform. Existing methods usually focus on optimizing the distance between one embedding with others in the feature space, while neglecting the redundancy of the embedding itself. In this paper, we argue that the low redundancy is also of importance, which motivates the model to mine more diverse patterns. To verify this point, we introduce a simple yet effective regularization, i.e., Dynamic Weighted Decorrelation Regularization (DWDR), to explicitly encourage networks to learn independent embedding channels. As the name implies, DWDR regresses the embedding correlation coefficient matrix to a sparse matrix, i.e., the identity matrix, with dynamic weights. The dynamic weights are applied to focus on still correlated channels during training. Besides, we propose a cross-view symmetric sampling strategy, which keeps the example balance between different platforms. Albeit simple, the proposed method has achieved competitive results on three large-scale benchmarks, i.e., University-1652, CVUSA and CVACT. Moreover, under the harsh circumstance, e.g., the extremely short feature of 64 dimensions, the proposed method surpasses the baseline model by a clear margin.

Keywords:
Geo-localization, Image Retrieval, Deep Learning, The Cross-correlation Correlation Matrix, Decorrelation.
Refer to caption
Figure 1: (a) A strong metric learning based baseline (Zheng et al., 2020b, ). (b) The baseline combined with our proposed dynamic weighted decorrelation regularization (DWDR). ERF refers to the effective receptive field (the yellow area). A large ERF reflects that the learned model is able to extract more discriminative visual features from a wider range without over-fitting to the local area (Luo et al., , 2016; Ding et al., , 2022).

1 Introduction

Cross-view geo-localization is an image retrieval task and has been broadly applied to event detection, drone navigation, and accuracy delivery (Zheng et al., 2020b, ; Wang et al., , 2021; Humenberger et al., , 2022). Given a probe from the query platform (e.g., the drone), the system aims to spot a candidate image in the gallery platform (e.g., the satellite) containing the same geographic target with the probe. Since satellite images possess GPS metadata as annotations, we can easily acquire the location information of the interesting probe. In addition, when the GPS signal of the positioning device encounters interference, the image-based cross-view geo-localization can also be employed as an auxiliary tool to refine the geo-localization and provide a more robust result.

Cross-view geo-localization remains challenging since images from different platforms inherently contain viewpoint variations. The extreme viewpoint change leads to differences in the visual appearance of a geographic target, which confuses the system to locate a position accurately. A crucial key to the geo-localization challenge is to learn a discriminative visual embedding (Teng et al., , 2021; Zheng et al., 2020a, ; Huang et al., , 2022). Recently, deep learning technologies have received much research attention in the cross-view geo-localization problem since the great potential in feature extraction. A popular scheme for learning cross-view geo-localization models is first utilizing the pre-trained convolutional neural network to extract feature maps of images. Following, various metric learning functions are proposed to pull the image pairs with the same geo-tag closer while pushing those features from non-matchable pairs far apart (Zheng et al., 2020b, ; Liu and Li, , 2019; Hu et al., , 2018). Based on this basic scheme, the attention mechanism (Lin et al., , 2022; Shi et al., , 2019) and aligning the spatial layout of features (Wang et al., , 2021; Liu and Li, , 2019; Shi et al., , 2019; Shi et al., 2020c, ; Hu and Lee, , 2020) are also widely considered in the network design. Most existing methods focus more on the similarity between cross-view embeddings while ignoring the redundant channels of the embedding itself.

In human perception study, the neuroscience H. Barlow claims that concise, non-redundant descriptions are of higher value to the perception system and will help to clarify inputs of the external world (Barlow et al., , 1961). Based on this bio-perceptual hypothesis, we argue that stripping the redundancy of visual embeddings contributes to the discrimination of different targets. In this paper, we propose a dynamic weighted decorrelation regularization (DWDR). The measuring objective of DWDR is the Pearson cross-correlation coefficient matrix, which is computing from features of positive pairs composed of cross-view images. Specifically, DWDR employs Square Loss to regress the diagonal elements of the objective matrix to 1, and the off-diagonal elements are approximated to 0. However, the objective matrix is typically large, e.g., 2048×20482048\times 2048. The optimizer is easily overwhelmed by a mass of elements already close to the optimization goal, thus ignoring other elements that need to be regressed. To address this optimization problem, we assign a dynamic weight to each element loss according to the maximum regression distance of the target element. The dynamic weight can adaptively highlight the importance of poorly-regressed elements and suppress the side effect of well-regressed elements in optimization. We observe that DWDR encourages the learned model to focus on a larger effective receptive field, which prevents the network from overfitting to a local pattern (Luo et al., , 2016) (see Figure 1). Besides, as mentioned above, the computation of the Pearson cross-correlation coefficient matrix requires positive sample pairs. As a by-product of selecting positive pairs, we also provide a cross-view symmetric sampling strategy. In a training batch, our symmetric sampling strategy aligns the number of the same geo-tag images between different platforms. Therefore, the proposed strategy mitigates the sample imbalance, especially in drone-to-satellite geo-localization, which contains limited satellite data (Ding et al., , 2021; Dai et al., , 2021). To summarize, our contributions are as follows.

  • We propose a dynamic weighted decorrelation regularization (DWDR), which motivates networks to learn discriminative embeddings by stripping the redundancy of features. During training, DWDR assigns a dynamic weight to the loss of each element in the objective matrix, yielding efficient optimization of networks. As a by-product of DWDR, we further introduce a cross-view symmetric sampling strategy, which maintains the example balance in a training batch.

  • Albeit simple, we demonstrate the effectiveness of the proposed method on three cross-view geo-localization datasets. Extensive experiments show that multiple existing works (Zheng et al., 2020b, ; Wang et al., , 2021; Liu et al., , 2021) fused with our method are able to further boost the performance. In addition, we observe that our method still obtains superior results even for a short visual embedding with 64 dimensions.

The rest of this paper is organized as follows. In Section 2, we discuss related works. The details of our method are illustrated in Section 3. Experimental results are provided in Section 4. Finally, Section 5 presents a summary.

2 Related Work

In this section, we briefly review related previous works, including image-based cross-view geo-localization and low-redundancy representation learning.

2.1 Image-based Cross-view Geo-localization

Imaged-based cross-view geo-localization has been tackled as an image retrieval task. Early works (Bansal et al., , 2011; Lin et al., , 2013; Castaldo et al., , 2015; Saurer et al., , 2016) employ hand-crafted operators to extract discriminative features for cross-view image matching. With the development of deep learning, the convolutional neural network (CNN) has received much research attention in the extraction of image representation. The pioneering CNN-based approach (Workman and Jacobs, , 2015) directly deploys pre-trained AlexNet (Krizhevsky et al., , 2012) to extract features for the cross-view geo-localization. Further, Workman et al., (2015) introduce the information of image pairs as the constraint to fine-tune the pre-trained network and acquire a better performance. Following this line of considering object constraints, Lin et al., (2015) borrow knowledge from face verification and harness the contrastive loss (Hadsell et al., , 2006) to guide the optimization of a modified Siamese Network (Chopra et al., , 2005). Vo and Hays, (2016) discuss the limitation of the Siamese architecture in large-scale cross-view matching and provide a soft-margin triplet loss to improve the geo-localization accuracy. Similarly, Hu et al., (2018) propose a weighted soft-margin ranking loss, which not only improves the matching accuracy but also speeds up the training convergence. Cai et al., (2019) mine hard examples in the training batch to strengthen the penalization of the soft-margin triplet loss. Zheng et al., 2020b suggest that images with the same identification can be classified into one cluster and apply the instance loss (Zheng et al., 2020c, ; Zheng et al., , 2017) as the proxy target to learn discriminative embeddings. Another line of works concentrates on addressing the spatial misalignment problem of cross-view retrieval. CVM-Net (Hu et al., , 2018) employs a shared NetVLAD to aggregate the local feature to minor the visual gap between different viewpoints. Liu and Li, (2019) explicitly encode the orientation information into the image descriptors and boost the discriminative power of the learned features. Shi et al., 2020c first attempt to utilize the optimal transport (OT) theory to close the spatial layout information in the high-level feature. Then Shi et al., (2019) directly resort to the polar transform to align the pixel-level semantic information of cross-view images. DSM (Shi et al., 2020a, ) designs a dynamic similarity matching module to solve the cross-view matching in a limited Field of View (FoV). LPN (Wang et al., , 2021) stresses the importance of contextual information and proposes a square-ring partition strategy to improve the performance of cross-view geo-localization. Without any extra annotations, RK-Net (Lin et al., , 2022) automatically detects salient keypoints to improve the model capability against the appearance changes.

2.2 Low-Redundancy Representation Learning

In the early study of human perception, the neuroscientist H. Barlow (Barlow et al., , 1961) suggests that the perception system tends to encode the raw sensory input as the low-redundancy representation in which each component possesses statistical independence. This learning principle guides a number of algorithms in machine learning. Bengio and Bergstra, (2009) support that the decorrelation criterion is useful in the context of data and derive a fast online pretraining algorithm to learn decorrelated features for neural networks. Cogswell et al., (2016) design Decov Loss that motivates the network to learn non-redundant representations and demonstrate that decorrelating representations helps to reduce overfitting of the trained deep networks. Sun et al., (2017) utilize Singular Vector Decomposition (SVD) and reduce the correlation between output units by integrating the orthogonality constraint in CNN training. Thus the final descriptor contains lower redundant information about the sample. In self-supervised learning (SSL), Barlow Twins (Zbontar et al., , 2021) proposes a simple yet effective object function to acquire representations with low redundancy and avoid model collapse. The optimization goal of Barlow Twins is to transform a cross-correlation matrix into an identity matrix. SSL does not require the input data with human annotation, and the cross-correlation matrix is computed from two distorted representations of a sample.

Refer to caption
Figure 2: A schematic overview of our method. We first apply the symmetric sampling strategy to generate one batch of drone-and-satellite positive pairs. The symmetric sampling strategy is composed of the drone-view based sampling and the satellite-view based sampling. Then the satellite-platform and the drone-platform images of the batch are fed into a two-branch network. The two branch network shares weights of two backbones since images from the drone platform and the satellite platform have similar patterns. Next, the average pooling is deployed to aggregate the output feature maps of each branch into the column vectors. Finally, the column vectors of two branches are inputted into a classifier module to acquire predicted logit scores, respectively, and the cross-entropy function is utilized to compute the instance loss. The proposed DWDR aims to transform the Pearson cross-correlation coefficient matrix ρ\rho into an identity matrix II as much as possible. The Pearson cross-correlation coefficient matrix ρ\rho is calculated by column vectors from two branches. Note that here we show the framework employing ResNet-50 as the backbone and images of University-1652 as inputs.

3 Proposed Method

In Section 3.1, we first give a revisit of preliminaries followed by the description of our baseline network structure for geo-localization in Section 3.2. Next, we introduce the dynamic weighted decorrelation regularization (DWDR). The dynamic weight mechanism relieves the plateau problem in Barlow Twins (Zbontar et al., , 2021). We also provide a mechanism discussion (see Section 3.3).

3.1 Preliminaries

Cross-correlation matrix measures the correlation between two matrices. In particular, for two random matrices 𝐗=(X1,X2,,XM)T\mathbf{X}=(X_{1},X_{2},\cdots,X_{M})^{T}, 𝐘=(Y1,Y2,YN)T\mathbf{Y}=(Y_{1},Y_{2}\cdots,Y_{N})^{T}, where Xm,YnaX_{m},Y_{n}\in\mathbb{R}^{a} with aa dimensions, the cross-correlation matrix of 𝐗\mathbf{X} and 𝐘\mathbf{Y} can be defined as:

ϕ\displaystyle\phi 𝔼[𝐗𝐘𝐓].\displaystyle\triangleq\mathbf{\mathbb{E}\left[XY^{T}\right]}. (1)

A component-wise description is:

ϕ=[[1.2]𝔼[X1Y1T]𝔼[X1Y2T]𝔼[X1YNT]𝔼[X2Y1T]𝔼[X2Y2T]𝔼[X2YNT]𝔼[XMY1T]𝔼[XMY2T]𝔼[XMYNT]],\displaystyle\begin{array}[]{lc}\phi&=\begin{bmatrix}[1.2]\mathbb{E}\left[X_{1}Y_{1}^{T}\right]&\mathbb{E}\left[X_{1}Y_{2}^{T}\right]&\cdots&\mathbb{E}\left[X_{1}Y_{N}^{T}\right]\\ \mathbb{E}\left[X_{2}Y_{1}^{T}\right]&\mathbb{E}\left[X_{2}Y_{2}^{T}\right]&\cdots&\mathbb{E}\left[X_{2}Y_{N}^{T}\right]\\ \vdots&\vdots&\ddots&\vdots\\ \mathbb{E}\left[X_{M}Y_{1}^{T}\right]&\mathbb{E}\left[X_{M}Y_{2}^{T}\right]&\cdots&\mathbb{E}\left[X_{M}Y_{N}^{T}\right]\\ \end{bmatrix},\end{array} (3)

where 𝔼[]\mathbb{E\left[\cdot\right]} refers to the expectation.

Pearson cross-correlation coefficient matrix is similar to the cross-correlation matrix, and it shows a normalized measurement between two matrices. Differently, the value of each element in the Pearson cross-correlation coefficient matrix is between 1-1 and 11. The score 11 and 1-1 denote that two vectors are perfectly correlated and anti-correlated. 0 means that two vectors are completely unrelated. Based on the definition of the cross-correlation matrix, the Pearson cross-correlation coefficient matrix can be formulated as:

ρ\displaystyle\rho =𝔼[(𝐗μ𝐗)(𝐘μ𝐘)T]σ𝐗σ𝐘,\displaystyle=\frac{\mathbb{E}\left[\left(\mathbf{X-\mu_{X}}\right)\left(\mathbf{Y-\mu_{Y}}\right)^{T}\right]}{\sigma_{\mathbf{X}}\sigma_{\mathbf{Y}}}, (4)

where μ𝐗\mu_{\mathbf{X}} and σ𝐗\sigma_{\mathbf{X}} are the mean and the standard deviation of 𝐗\mathbf{X}, respectively. μ𝐘\mu_{\mathbf{Y}} and σ𝐘\sigma_{\mathbf{Y}} are the mean and the standard deviation of 𝐘\mathbf{Y}, separately. 𝔼[]\mathbb{E\left[\cdot\right]} is the expectation.

Barlow Twins (Zbontar et al., , 2021) is a widely-used method for self-supervised learning (SSL). To address the issue of representation collapse, the main contribution of Barlow Twins is introducing a new regularization objective, which can be defined as:

LBTi=1d(1ϕii)2+λi=1dj=1,jidϕij2,L_{BT}\triangleq\sum_{i=1}^{d}(1-\phi_{ii})^{2}+\lambda\sum_{i=1}^{d}\sum_{j=1,j\neq i}^{d}{\phi_{ij}}^{2}, (5)

where λ\lambda is a positive hyper-parameter, and ϕ\phi is the cross-correlation matrix between two mini-batch of features. Given two batches of features with dd dimensions (dd is generally set as 2048), we multiply the feature matrix along with the batch dimension. Thus, the dimension of ϕ\phi is d×dd\times d. The regularization function LBTL_{BT} encourages every feature channel to be independent of others. Specifically, it impels the diagonal elements ϕii\phi_{ii} from the same channel to 11, while pushing off-diagonal elements ϕij\phi_{ij} between different channels to 0. However, in practice, Barlow Twins meets the optimization problem, especially when facing a typical large matrix (e.g., 2048×20482048\times 2048). The model arrives at the plateau after the majority of channels are converged, and it neglects other still-correlated “hard” channels, compromising the training process.

3.2 Network Structure

We adopt ResNet-50 (He et al., , 2016) as the backbone and add a new classifier. The classifier consists of a fully-connected layer (FC), a batch normalization layer (BN), a dropout layer (Dropout), and another fully-connected layer (FC). Notably, the backbone can also be other networks such as VGG16 (Simonyan and Zisserman, , 2015) and Swin Transformer (Liu et al., , 2021). The two-branch baseline consists of three forward phases, i.e., feature extraction, feature aggregation, and feature classification. Specifically, we denote the input images from two platforms as xkx_{k}, where k{1,2}k\in\{1,2\}. 11 denotes the satellite platform, and 22 refers to the drone or ground platform. We first employ two backbones with shared weights to extract feature maps. Then the global average pooling is deployed to aggregate the information of feature maps into the column vectors fkf_{k}. Finally, we harness a classifier to map vectors fkf_{k} of different platforms into one shared classification space and acquire the predicted logit vectors zkz_{k}. Meanwhile, the cross entropy function is employed to calculate the instance loss LidL_{id}  (Zheng et al., 2020b, ). The instance loss is a classification loss with a shared classifier classifier\mathcal{F}_{classifier}. The above process can be formulated as:

fk\displaystyle f_{k} =𝒜vgpool(backbonek(xk)),\displaystyle=\mathcal{A}vgpool(\mathcal{F}^{k}_{backbone}(x_{k})), (6)
zk\displaystyle z_{k} =classifier(fk),\displaystyle=\mathcal{F}_{classifier}(f_{k}), (7)
Lid\displaystyle L_{id} =k=12logexp(zk(y))c=1Cexp(zk(c)).\displaystyle=\sum_{k=1}^{2}-log\frac{exp(z_{k}(y))}{\sum_{c=1}^{C}exp(z_{k}(c))}. (8)

The label y[1,C]y\in[1,C], where CC indicates the category number of geographic targets in the training set. zk(y)z_{k}(y) is the logit score of the ground-truth geo-tag yy. When inference, we remove the final linear classification layer and extract the intermediate feature fkf_{k} as the visual representation.

3.3 Dynamic Weighted Decorrelation Regularization

In this work, we introduce a dynamic weighted decorrelation regularization (DWDR) to encourage the network to learn low-redundancy visual embeddings. As shown in Figure 2, DWDR is implemented based on a classic two-branch baseline (Zheng et al., 2020b, ). The two-branch baseline harnesses location classification as the pretext (Zheng et al., 2020b, ; Zheng et al., , 2019) to conduct the cross-view geo-localization task. During training, we employ the symmetric sampling strategy to balance examples between different platforms in a training batch. It is worth noting that the symmetric sampling strategy is a by-product of DWDR.

The optimization objective of DWDR is the Pearson cross-correlation coefficient matrix ρ\rho between fkf_{k} extracted from images of different platforms. Given two batches of extracted vectors f1f_{1} and f2f_{2} of size b×2048b\times 2048, according to Eq. 4, we can gain the objective matrix ρ\rho with the shape of 2048×20482048\times 2048, where bb denotes the batchsize. DWDR aims to spur the network by regressing the objective matrix ρ\rho into a sparse matrix, i.e., an identity matrix. We employ Square Loss to constrain the regression of each element. DWDR can be written as:

LDWDR\displaystyle L_{DWDR} i=1dω1(1ρii)2+λi=1dj=1,jidω2ρij2,\displaystyle\triangleq\sum_{i=1}^{d}\omega_{1}\cdot(1-\mathcal{\rho}_{ii})^{2}+\lambda\sum_{i=1}^{d}\sum_{j=1,j\neq i}^{d}\omega_{2}\cdot\mathcal{\rho}_{ij}^{2}, (9)

where λ\lambda is a hyper-parameter to balance the diagonal and off-diagonal element weight, ρii\rho_{ii} refers to the diagonal elements of the objective Pearson matrix ρ\rho, and ρij\rho_{ij} denotes off-diagonal elements. ρii\rho_{ii} is regressed to 11, which makes visual embeddings of the same geo-tag invariant for different platforms. ρij\rho_{ij} is regressed to 0 to make the visual embedding channels independent from each other. ω1\omega_{1} and ω2\omega_{2} are two dynamic weights to prevent the optimization plateau, which depends on the regression score. In this way, the dynamic weight adjusts the influence of the loss and adaptively pays attention to the poorly-regressed elements during training. Considering that each element of the Pearson matrix is in [1,1][-1,1], we set ω1=(1ρii2)γ1,ω2=|ρij|γ2\omega_{1}=\left(\frac{1-\rho_{ii}}{2}\right)^{\gamma_{1}},\omega_{2}=\lvert\rho_{ij}\rvert^{\gamma_{2}}, ||\lvert\cdot\rvert denotes the absolute value. In this way, given non-negative focusing parameters γ1\gamma_{1} and γ2\gamma_{2}, we ensure ω1[0,1]\omega_{1}\in[0,1] and ω2[0,1]\omega_{2}\in[0,1]. For elements close to the regression result (well-regressed elements), the assigned dynamic weight is near 0. Conversely, for elements far from the regression target (poorly-regressed elements), the assigned dynamic weight increases to 11.

Symmetric sampling strategy. In order to compute the Pearson cross-correlation coefficient matrix, a necessary step is to construct a training batch by acquiring different-platform images with same geo-tags (‘positive pairs’). Here we take the University-1652 dataset (Zheng et al., 2020b, ) as an example. We can first randomly sample the satellite-view image as an anchor and then form the positive pair by finding the drone-view image of the same geo-tag. We call this sampling, which takes the satellite-view image as an anchor, the satellite-view based sampling. If we use the drone-view image as the anchor, we call the strategy the drone-view based sampling. We note that the image number of different platforms is usually different due to the capturing difficulty. For example, we can easily acquire multiple drone images while only having one satellite image. If we apply the satellite-view based sampling during training, it will miss sampling some drone-view images. It is because every time we randomly sample one image from 54 drone-view images with putting back. On the other hand, if we apply the drone-view based sampling, it will contain duplicate geo-tags within a mini-batch, which repeatedly samples the same satellite image. Therefore, we propose a symmetric sampling strategy, which combines the satellite-view based sampling and drone-view based sampling. In particular, we sample two mini-batch by the two strategies respectively and combine them together as the new mini-batch to train the model. This combined strategy ensures the model can “see” all the training data while keeping the category sampling relatively balanced.

Discussion. The proposed method is similar to Barlow Twins considering disentangling the correlation matrix, but is different in the following aspects: First, Barlow Twins (Zbontar et al., , 2021) optimizes the cross-correlation matrix, while our method harnesses the Pearson cross-correlation coefficient matrix. The Pearson matrix is preferable, since it normalizes the element in a limited range of [-1,1], which unifies the element scale within the matrix and prevents overflowing. Second, as shown in Eq. 5, Barlow Twins accumulated the error along the whole matrix ϕ\phi. However, the dimension of ϕ\phi is large, e.g., 2048×20482048\times 2048. As training proceeds, the majority of elements converge, and the network arrives at the plateau, since the loss is accumulated by a vast of elements. The optimization of the remaining minority elements is usually ignored. The proposed DWDR also accumulated the error but with dynamic weights for different elements in Eq 9. We leverage the Pearson matrix, which is normalized in the range [-1, 1], to set the corresponding dynamic weights. The design also limits dynamic weights in [0,1], preventing the weight overflow. Therefore, DWDR can focus on the minority elements, even when majority channels are converged. Compared with Barlow Twins, DWDR encourages the network to make still correlated channels independent throughout the training period.

Optimization. We optimize our network by jointly employing the instance loss and DWDR:

Ltotal=αLid+(1α)LDWDR.L_{total}=\alpha L_{id}+(1-\alpha)L_{DWDR}. (10)

The instance loss LidL_{id} forces different-platform images with the same geo-tag to be close on the high-level features and pushes mismatched images far apart. At the same time, DWDR motivates the learned visual embeddings with independent channels. Thus the network is able to extract more discriminative features. α\alpha is a weight to control the importance of the loss function and the regularization term.

4 Experiment

We introduce three cross-view geo-localization datasets and the evaluation protocol in Section 4.1. The implementation detail is provided in Section 4.2. We carry out a series of comparisons with state-of-the-art approaches in Section 4.3, followed by ablation studies in Section 4.4. Finally, Section 4.5 visualizes the cross-view geo-localization results.

4.1 Datasets and Evaluation Protocol

We conduct experiments on three geo-localization datasets, i.e., University-1652 (Zheng et al., 2020b, ), CVUSA (Zhai et al., , 2017), and CVACT (Liu and Li, , 2019).

University-1652 (Zheng et al., 2020b, ) is a multi-view multi-source dataset, including data from three different platforms, i.e., drones, satellites and dash cams. As the name implies, this dataset collects 1652 ordinary buildings of 72 universities around the world. 701 of all 1652 buildings are separated into the training set, and the other 951 builds constitute the testing set. Therefore, build images in the training and testing set are not overlapping. For each building, the dataset contains one satellite-view image, 54 drone-view images and 3.38 ground-view images on average. Since dash cams are hard to acquire enough street-view images for some buildings, the dataset also collects 21,099 common-view images from Google Image as an extra training set. The dataset supports two new aerial-view geo-localization tasks, i.e., drone-view target localization (Drone \rightarrow Satellite) and drone navigation (Satellite \rightarrow Drone).

CVUSA (Zhai et al., , 2017) is a large-scale cross-view dataset, which consists of images from two viewpoints, i.e., the ground view and the satellite view. In the dataset, 35,532 ground-and-satellite image pairs are employed for training, and 8,884 image pairs are provided for testing. Noteworthily, ground-view images are the pattern of panoramas, and the orientation of all ground and satellite images is aligned.

CVACT (Liu and Li, , 2019) is a similar dataset to CVUSA. For the ground-to-satellite task, CVACT also contains 35,532 image pairs for training. Different from CVUSA, CVACT provides a validation set with 8,884 image pairs denoted as CVACT_val. Meanwhile, compared with CVUSA, CVACT possesses a larger test set with 92,802 image pairs named CVACT_test. When evaluated in CVACT_val, a query ground-view panorama only match one satellite image in the gallery. However, for CVACT_test, a panoramic query image may correspond to several satellite images within 5 meters from the ground-truth location.

We follow existing works (Wang et al., , 2021; Shi et al., 2020c, ; Shi et al., , 2019) and mainly employ CVACT_val to evaluate our method when training on CVACT. The image number of training and evaluation in these three datasets is shown in Table 1.

Table 1: Statistics of three cross-view datasets, including the image number of training and testing sets. The left and right of the arrow (\rightarrow) refer to the query and gallery platforms, respectively.
Dataset Training Testing
Drone Satellite Drone \rightarrow Satellite Satellite \rightarrow Drone
University -1652  (Zheng et al., 2020b, ) 37,854 701 37,855 951 701 51,355
Ground Satellite Ground \rightarrow Satellite Satellite \rightarrow Ground
CVUSA (Zhai et al., , 2017) 35,532 35,532 8,884 8,884 8,884 8,884
CVACT (Liu and Li, , 2019) 35,532 35,532 8,884 8,884 8,884 8,884

Evaluation protocol. In our experiments, the performance of our method is evaluated by two commonly used metrics, i.e., Recall@K (R@K) and the average precision (AP). R@K refers to the proportion of true-matched candidates in the top-K of the ranking list. The value of AP is measured by the area under the Precision-Recall curve. Higher scores of these two metrics denote better performance of the network.

Table 2: Comparison with the state-of-the-art results reported on University-1652. The compared method are categorized into three groups. The first group consists of baseline-related methods which employ average pooling to aggregate feature maps. The second group contains methods that apply contextual information. The third group includes Transformer-based methods. “MM” indicates the margin of the triplet loss. {\dagger} denotes the input image of size 384×384384\times 384. The input image size of two Transformer-based methods and other CNN-based methods are 224×224224\times 224 and 256×256256\times 256, respectively. The best results are in bold.
Method University-1652
Drone \rightarrow Satellite Satellite \rightarrow Drone
R@1 AP R@1 AP
Instance Loss (Baseline)  (Zheng et al., 2020b, ) 57.09 61.88 73.89 58.73
Contrastive Loss (Lin et al., , 2015) 52.39 57.44 63.91 52.24
Triplet Loss (M=0.3M=0.3(Chechik et al., , 2009) 52.16 57.47 65.05 52.37
Triplet Loss (M=0.5M=0.5(Chechik et al., , 2009) 51.23 56.40 62.77 51.29
Soft Margin Triplet Loss (Hu et al., , 2018) 53.67 58.69 67.90 54.76
LCM (Ding et al., , 2021) 66.65 70.82 79.89 65.38
RK-Net (Lin et al., , 2022) 66.13 70.23 80.17 65.76
Baseline (Zheng et al., 2020b, ) + Ours 69.77 73.73 81.46 70.45
LPN (Wang et al., , 2021) 75.93 79.14 86.45 74.49
LPN + USAM (Lin et al., , 2022) 77.60 80.55 86.59 75.96
PCL (Tian et al., , 2021) 79.47 83.63 87.69 78.51
LPN (Wang et al., , 2021) + Ours 81.51 84.11 88.30 79.38
Swin-B (Liu et al., , 2021) 84.15 86.62 90.30 83.55
FSRA (Dai et al., , 2021) 84.51 86.71 88.45 83.47
Swin-B (Liu et al., , 2021) + Ours 86.41 88.41 91.30 86.02

4.2 Implementation Details

Our method is performed based on a classic two-branch baseline (Zheng et al., 2020b, ). The baseline adopts a modified ResNet-50 (He et al., , 2016) pre-trained on ImageNet (Deng et al., , 2009) to extract visual features. Specifically, we remove the final classification layer of ResNet-50 (He et al., , 2016). Besides, the stride of the second convolution layer and the down-sample layer in the first bottleneck of the ResNet-50 (He et al., , 2016) stage4 is set from 2 to 1. The input image is resized to 256×256256\times 256, and the image augmentation consists of random cropping, random horizontal flipping, and random rotation. We employ stochastic gradient descent (SGD) with momentum 0.9 and weight decay 0.0005 to update model parameters. The image number of each platform in a mini-batch is 16. The initial learning rate is 0.001 for the modified ResNet-50 (He et al., , 2016) backbone and 0.01 for the classifier module. The dropout rate in the classifier module is 0.75. We train the two-branch baseline for 120 epochs, and the learning rate is decayed by 0.1 after 80 epochs. There are two trade-off parameters λ\lambda and α\alpha in the loss function. We run a simple search and observe the better results for λ=1.3×103\lambda=1.3\times 10^{-3} and α=0.9\alpha=0.9. Note that when using Swin-B (Liu et al., , 2021) and VGG16 (Simonyan and Zisserman, , 2015) as backbones, λ=2.0×103\lambda=2.0\times 10^{-3} and λ=3.9×103\lambda=3.9\times 10^{-3} are best choices, separately. During testing, we deploy the Euclidean distance to compute the similarities between the query and candidates. Our model is implemented on Pytorch (Paszke et al., , 2019), and all experiments are conducted on a single NVIDIA RTX 2080Ti GPU.

Table 3: Comparison with prior art on CVUSA and CVACT. The compared methods are divided into 2 columns. Column1: methods without the polar transform. Column2: methods utilizing the polar transform. “Polar Transform” is the boundary of two group Columns. {\ddagger}: The method is implemented using images processed by the polar transform. \star: The method harnesses extra orientation information as input. The best results are in bold.
Method Publication Backbone CVUSA CVACT_val
R@1 R@5 R@10 R@Top1% R@1 R@5 R@10 R@Top1%
MCVPlaces (Workman et al., , 2015) ICCV’15 AlexNet - - - 34.40 - - - -
Zhai (Zhai et al., , 2017) CVPR’17 VGG16 - - - 43.20 - - - -
Vo (Vo and Hays, , 2016) ECCV’16 AlexNet - - - 63.70 - - - -
CVM-Net (Hu et al., , 2018) CVPR’18 VGG16 18.80 44.42 57.47 91.54 20.15 45.00 56.87 87.57
Orientation (Liu and Li, , 2019) CVPR’19 VGG16 27.15 54.66 67.54 93.91 46.96 68.28 75.48 92.04
Zheng (Baseline(Zheng et al., 2020b, ) MM’20 VGG16 43.91 66.38 74.58 91.78 31.20 53.64 63.00 85.27
Regmi (Regmi and Shah, , 2019) ICCV’19 X-Fork 48.75 - 81.27 95.98 - - - -
RKNet (Lin et al., , 2022) TIP’22 USAM 52.50 - - 96.52 40.53 - - 89.12
Siam-FCANet (Cai et al., , 2019) ICCV’19 ResNet-34 - - - 98.30 - - - -
CVFT (Shi et al., 2020c, ) AAAI’20 VGG16 61.43 84.69 90.94 99.02 61.05 81.33 86.52 95.93
LPN (Wang et al., , 2021) TCSVT’21 ResNet-50 85.79 95.38 96.98 99.41 79.99 90.63 92.56 97.03
LPN + USAM (Lin et al., , 2022) TIP’22 ResNet-50 91.22 - - 99.67 82.02 - - 98.18
Polar Transform
SAFA (Shi et al., , 2019) NIPS’19 VGG16 89.84 96.93 98.14 99.64 81.03 92.80 94.84 98.17
DSM (Shi et al., 2020b, ) CVPR’20 VGG16 91.96 97.50 98.54 99.67 82.49 92.44 93.99 97.32
LPN (Wang et al., , 2021) TCSVT’21 ResNet-50 93.78 98.50 99.03 99.72 82.87 92.26 94.09 97.77
Baseline + Ours - VGG16 75.62 90.45 93.60 98.60 66.76 83.34 87.11 95.10
LPN (Wang et al., , 2021) + Ours - ResNet-50 94.33 98.54 99.09 99.80 83.73 92.78 94.53 97.78

4.3 Comparison with State-of-the-art Methods

Results on University-1652. As shown in Table 2, we compare our method with lots of competitive methods on University-1652. The compared methods are divided into three groups, i.e., baseline-related methods, methods harnessing contextual information and Transformer-based methods. In the first group, the first line reports results of our two-branch baseline, i.e., “Instance Loss (Zheng et al., 2020b, )”. We can observe that “Baseline + Ours” substantially improves the baseline performance. In the drone-view target localization task (Drone \rightarrow Satellite), the accuracy of R@1 increases from 57.09%57.09\% to 69.77%69.77\% (+12.68%+12.68\%), and the value of AP raises from 61.88%61.88\% to 73.73%73.73\% (+11.85%+11.85\%). In the drone navigation task (Satellite \rightarrow Drone), the accuracy of R@1 goes up from 73.89%73.89\% to 81.46%81.46\% (+7.57%+7.57\%), and the value of AP increases from 58.73%58.73\% to 70.45%70.45\% (+11.72%+11.72\%). Meanwhile, the performance of our method also has surpassed other baseline-related methods. In the second group, LPN (Wang et al., , 2021) explicitly takes advantage of contextual information during training. Some methods, e.g., “LPN + USAM (Lin et al., , 2022)” and PCL (Tian et al., , 2021), combined with LPN have yielded better results, and we can also implement our method based on LPN. Specifically, we re-implement LPN by utilizing the symmetric sampling strategy to replace the original random sampling and incorporating the dynamic weighted decorrelation regularization during training. Compared with results of LPN, “LPN + Ours” achieves 81.51%81.51\% R@1 accuracy (+5.58%+5.58\%) and 84.11%84.11\% AP (+4.97%+4.97\%) on Drone \rightarrow Satellite and 88.30%88.30\% R@1 accuracy (+1.85%+1.85\%) and 79.38%79.38\% AP (+4.89%+4.89\%) on Satellite \rightarrow Drone. The feature expression ability of Transformer is stronger than that of CNN, and both Transformer-based methods (Liu et al., , 2021; Dai et al., , 2021) obtain a better performance than CNN-based methods. We further combine our method with “Swin-B (Liu et al., , 2021)”. “Swin-B” indicates the two-branch baseline applying Swin-B as the backbone. “Swin-B + Ours” on University-1652 achieves the state-of-the-art results, i.e., 86.41%86.41\% in R@1 accuracy and 88.41%88.41\% in AP for Drone \rightarrow Satellite and 91.30%91.30\% in R@1 accuracy and 86.02%86.02\% in AP for Satellite \rightarrow Drone.

Results on CVUSA and CVACT. Comparisons with other competitive approaches on CVUSA and CVACT are summarized in Table 3. CVUSA and CVACT have a similar data pattern, i.e., the satellite-platform images of aerial viewpoint and the ground panoramas. The polar transform considers the geometric correspondence of two-platform images and transforms the aerial-view image to approximately align a ground panorama at the pixel level. The aligned images help to improve the performance of models. Depending on whether or not the polar transform is harnessed, the compared method can be divided into two columns. The first column reports methods without using polar transform, and methods in the second column employ the polar transform during training and testing. Our method does not employ the polar transform. Experiments on CVUSA and CVACT show phenomena similar to that on University-1652. Our method first outperforms a dual-stream baseline (i.e., the method of Zheng (Zheng et al., 2020b, )) by a large margin, i.e., 31.71%31.71\% R@1 improvement on CVUSA and 35.56%35.56\% R@1 raising on CVACT. At the same time, our method exceeds most of existing methods in the first column. In particular, our method obtains 75.62%75.62\% R@1, 90.45%90.45\% R@5, 93.60%93.60\% R@10, and 98.60%98.60\% R@Top1% on CVUSA, and 66.76%66.76\% R@1, 83.34%83.34\% R@5, 87.11%87.11\% R@10, and 95.10%95.10\% R@Top1% on CVACT. In experiments of University-1652, we observe that our method can combine with LPN (Wang et al., , 2021) and achieve better results. The same experiments are also carried out on CVUSA and CVACT. There are two versions of LPN (i.e., LPN and LPN) in Table 3. LPN applies the polar transform and has achieved higher performance. We notice that our approach still yields competitive results when complemented with the LPN. For instance, “LPN + Ours” boosts the R@1 accuracy from 93.78%93.78\% to 94.33%94.33\% on CVUSA and 82.87%82.87\% to 83.73%83.73\% on CVACT.

The above experimental results on three cross-view geo-localization datasets suggest two points. One is that our method can be flexibly applied in different cross-view settings. The other is that our method is able to encourage existing approaches to mine more diverse patterns, yielding discriminative features.

4.4 Ablation Studies

To further analyze our method, we design several ablation studies. The ablation studies are mainly based on the drone-view target localization (Drone \rightarrow Satellite) and drone navigation (Satellite \rightarrow Drone) of University-1652 (Zheng et al., 2020b, ).

Analysis of parameters γ1\gamma_{1} and γ2\gamma_{2}. The main contribution of our paper is the proposed dynamic weighted decorrelation regularization (DWDR). In DWDR, γ1\gamma_{1} and γ2\gamma_{2} are two key parameters that flexibly adjust the rate at which well-regressed elements of the Pearson cross-correlation coefficient matrix are down-weighted. When γ1=0\gamma_{1}=0 and γ2=0\gamma_{2}=0, DWDR does not apply two dynamic weights and can be viewed as Barlow Twins (Zbontar et al., , 2021). We empirically tune γ1\gamma_{1} and γ2\gamma_{2}, and the related results are detailed in Table 4. We first observe that applying one dynamic weight, i.e., only γ1=1\gamma_{1}=1 or only γ2=1\gamma_{2}=1, achieves similar results to Barlow Twins. The limited performance improvement reflects that ignoring poorly-regressed diagonal and off-diagonal elements both induce the optimization plateau. When both γ1\gamma_{1} and γ2\gamma_{2} are set to 1, i.e., using two dynamic weights, we obtain the best results. Specifically, compared with deploying Barlow Twins as regularization (“Baseline + BT”), our method (“Baseline + Ours”) boosts R@1 from 67.91%67.91\% to 69.77%69.77\% (+1.86%+1.86\%) and AP from 71.99%71.99\% to 73.73%73.73\% (+1.74%+1.74\%) on Drone \rightarrow Satellite, and goes up R@1 from 80.17%80.17\% to 81.46%81.46\% (+1.29%+1.29\%) and AP from 68.03%68.03\% to 70.45%70.45\% (+2.42%+2.42\%) on Satellite \rightarrow Drone. When γ1=2\gamma_{1}=2 and γ2=2\gamma_{2}=2, the performance gains slightly degrades. A reasonable speculation is that large focusing parameters γ1\gamma_{1} and γ2\gamma_{2} cause the importance of poorly-regressed elements in the optimization process to be excessively reduced as well. To further verify the robustness of selected parameters, we conduct the same experiments in “LPN (Wang et al., , 2021) + Ours” and “Swin-B (Liu et al., , 2021) + Ours” and find the same conclusion. That is, when both γ1\gamma_{1} and γ2\gamma_{2} are set to 1, models achieve competitive results. Therefore, we choose γ1=1\gamma_{1}=1 and γ2=1\gamma_{2}=1 as default focusing parameters of DWDR. All three groups of experiments also support that DWDR is more effective than Barlow Twins for motivating networks to learn low-redundancy visual embeddings.

Table 4: Ablation study with different γ1\gamma_{1} and γ2\gamma_{2} in the dynamic weighted decorrelation regularization. BT refers to Barlow Twins (Zbontar et al., , 2021).
Method γ1\gamma_{1} γ2\gamma_{2} Drone \rightarrow Satellite Satellite \rightarrow Drone
R@1 AP R@1 AP
Baseline (Zheng et al., 2020b, )+BT 0 0 67.91 71.99 80.17 68.03
Baseline (Zheng et al., 2020b, )+Ours 0 1 66.83 71.01 77.89 68.01
1 0 67.57 71.62 78.03 67.93
1 1 69.77 73.73 81.46 70.45
2 2 69.40 73.33 80.88 70.05
LPN (Wang et al., , 2021)+BT 0 0 80.93 83.60 86.02 78.33
LPN (Wang et al., , 2021)+Ours 0 1 80.84 83.50 87.30 79.26
1 0 80.83 83.49 88.30 79.93
1 1 81.51 84.11 88.30 79.38
2 2 80.49 83.17 88.45 79.91
Swin-B (Liu et al., , 2021)+BT 0 0 86.03 88.05 91.01 85.07
Swin-B (Liu et al., , 2021)+Ours 0 1 85.94 88.00 91.01 85.33
1 0 86.07 88.09 90.30 85.68
1 1 86.41 88.41 91.30 86.02
2 2 85.54 87.73 90.58 85.65

Effect of our sampling strategy and DWDR. Our symmetric sampling strategy is a combination of the drone-view based sampling and the satellite-view based sampling. To discuss the effectiveness of our sampling strategy, we conduct three groups of experiments under the condition of only changing the sampling strategy. Meanwhile, in each group of experiments, we study the effectiveness of DWDR. The experimental results are shown in Table 5. We observe first that utilizing DWDR alone does not give comparable results to the baseline (“Instance Loss (Zheng et al., 2020b, )”) shown in Table 2. However, when applied in conjunction with Instance Loss, DWDR significantly improves the performance of the network, regardless of the sampling strategy. Experiments within each group verify from the side that DWDR concentrates more on the redundant channels of the embedding itself rather than the distance between cross-view embeddings. Second, when only Instance Loss is harnessed, the drone-view based and the satellite-view based sampling acquire similar results to the baseline (“Instance Loss”) applying random sampling. In contrast, the symmetric sampling strategy obtains the best geo-localization accuracy. Furthermore, the symmetric sampling strategy is also the most competitive in the other two experimental settings, i.e., DWDR alone and Instance Loss plus DWDR. The significant performance increment demonstrates that the symmetric sampling strategy as a by-product is effective.

Table 5: Effect of the symmetric sampling strategy and the dynamic weighted decorrelation regularization (DWDR).
Method Instance Loss DWDR Drone \rightarrow Satellite Satellite \rightarrow Drone
R@1 AP R@1 AP
Drone-view based sampling 60.86 65.56 73.89 59.05
22.61 27.84 35.38 20.98
66.08 70.22 77.60 65.48
Satellite-view based sampling 58.54 63.10 73.61 58.49
24.25 29.37 39.09 22.83
65.06 69.16 76.46 65.27
The symmetric sampling strategy (Ours) 64.74 68.96 77.75 64.32
34.38 40.02 50.07 34.28
69.77 73.73 81.46 70.45

Effect of the dimension of visual embeddings. We deploy the final visual embeddings with different dimensions in geo-localization to investigate the effect of embedding dimensions on retrieval accuracy. The experimental results of the baseline and “Baseline + Ours” are shown in Table 6. We observe that with the increment of the dimension, both the baseline (Zheng et al., 2020b, ) and “Baseline + Ours” have a persistent improvement since the visual embedding possesses more information capacity. However, the performance of the two methods encounters the bottleneck when the feature dimension is 512512. As the dimension of the feature increases to 10241024, the performance of the baseline decreases significantly, and the performance of “Baseline + Ours” tends to stabilize. The experimental results reflect two aspects from the side. One is that features with too high dimensions are prone to redundant channels, which compromise the geo-localization accuracy of models. The other is that our method can encourage networks to learn low-redundancy embeddings and improve the robustness of the model. In addition, as shown in Figure 3, we notice that when the dimension raises from 6464 to 128128, the baseline achieves a higher growth rate than “Baseline + Ours”. The short-dimensional features with small information capacity limit the performance of models. We speculate that our method allows the model to include more primary discriminative patterns in the limited feature dimension to mitigate the negative effects of insufficient information capacity. Therefore, when the feature dimension increases, our method produces fewer performance fluctuations.

Table 6: Ablation study of cross-view geo-localization applying visual features with different dimensions. “Dim” denotes the dimension of features.
Method Dim Drone \rightarrow Satellite Satellite \rightarrow Drone
R@1 AP R@1 AP
Baseline (Zheng et al., 2020b, ) (Instance Loss) 64 49.20 54.36 62.91 49.68
128 56.76 61.74 72.04 58.14
256 57.26 62.17 73.18 58.70
512 57.09 61.88 73.89 58.73
1024 54.20 59.20 68.33 55.37
Baseline (Zheng et al., 2020b, )+Ours 64 60.37 65.03 72.90 60.31
128 63.51 68.05 77.03 64.57
256 68.71 72.72 78.89 68.44
512 69.77 73.73 81.46 70.45
1024 70.55 74.56 80.60 70.51
Refer to caption
Figure 3: Impact of the dimension of features. The R@1 accuracy between the baseline and our method is compared. The red line refers to the baseline (Zheng et al., 2020b, ), and our method is shown using the blue line. (a) The drone-view target localization task (Drone \rightarrow Satellite). (b) The drone navigation task (Satellite \rightarrow Drone). When the feature dimension changes from 128 to 64, the performance of our method drops less than the baseline.

Effect of DWDR under different loss functions. Our baseline applies the instance loss (Zheng et al., 2020c, ; Zheng et al., , 2017) to optimize the network while other loss functions are available. The triplet loss and the soft margin triplet loss are broadly utilized in previous works (Hu et al., , 2018; Liu and Li, , 2019; Shi et al., 2020c, ; Vo and Hays, , 2016). We also evaluate our DWDR by deploying baselines adopting these two loss functions. The margin value of the triplet loss is 0.3, and experimental results are shown in Table 7. We notice that both baselines combined with DWDR gain improved retrieval accuracy on the “Drone \rightarrow Satellite” task and the “Satellite \rightarrow Drone” task of University-1652.

Effect of the intra-view DWDR. Our method applies the cross-view DWDR, in which the Pearson cross-correlation coefficient matrix is computed employing cross-view images. The intra-view DWDR means that the Pearson cross-correlation coefficient matrix of DWDR is calculated employing two distorted images from the same platform generated by different data augmentations. In experiments, the method only utilizing the symmetric sampling strategy is treated as the baseline, and the comparison results are shown in Table 8. We observe that the baseline combined with the intra-view DWDR gains a slight increment. Although the intra-view DWDR also encourages the network to learn independent embedding channels, our cross-view DWDR significantly outperforms applying the intra-view DWDR. It is because the cross-view DWDR is aligned with the cross-view retrieval test setting, which considers embeddings from different platforms for the geo-localization task. It also explains the limited performance increase of applying both intra-view and cross-view DWDR, which relies on the cross-view DWDR.

Table 7: Ablation study of DWDR under different loss functions. “MM” denotes the margin of the triplet loss.
Method University-1652
Drone \rightarrow Satellite Satellite \rightarrow Drone
R@1 AP R@1 AP
Triplet Loss (M=0.3M=0.3(Chechik et al., , 2009) 52.16 57.47 63.91 52.24
Soft Margin Triplet Loss (Hu et al., , 2018) 53.67 58.69 67.90 54.76
Triplet Loss (M=0.3M=0.3) + DWDR 54.14 59.28 67.90 54.76
Soft Margin Triplet Loss + DWDR 57.75 62.58 69.33 57.46
Table 8: Ablation study of the symmetric sampling strategy combined with different DWDR.
Method University-1652
Drone \rightarrow Satellite Satellite \rightarrow Drone
R@1 AP R@1 AP
Symmetric sampling (Baseline) 64.74 68.96 77.75 64.32
+ Intra-view DWDR 65.31 69.57 79.17 65.74
+ Cross-view DWDR 69.77 73.73 81.46 70.45
+ Intra & Cross-view DWDR 69.81 73.68 82.45 70.86

4.5 Qualitative Results

We visualize some heatmaps generated by the baseline and our method as an extra qualitative evaluation. Figure 4 shows the acquired heatmaps in the drone and satellite platforms of University-1652. Images in University-1652 possess an obvious geographic target. Compared with the baseline (Zheng et al., 2020b, ), our method activates a wider range of geographic target regions. In addition, we show some retrieval results on different datasets (see Figure 5). University-1652 supports two tasks. In the drone-view target localization task, the drone-platform image is the query, and in the drone navigation task, the satellite-platform image is the query. The retrieval results of two tasks are shown in Figure 5 (I) and (II). Figure 5 (III) and (IV) show the retrieval results of the ground-to-satellite localization task on CVUSA and CVACT. Given a randomly selected test query, we notice that the proposed method has successfully retrieved the most relevant results from the candidate gallery.

Refer to caption
Figure 4: Visualization of heatmaps. Heatmaps are produced by the baseline (Zheng et al., 2020b, ) and ours on different platforms of University-1652, i.e., the drone platform and the satellite platform.
Refer to caption
Figure 5: Qualitative image retrieval results in different datasets. (I) and (II) show Top-5 retrieval results on University-1652. Different query images indicate the different tasks. (I) is the drone-view target localization task, and (II) is the drone navigation task. (III) and (IV) exhibit Top-3 retrieval results of geographic localization on CVUSA and CVACT, respectively. The true matches are in yellow boxes, and the false matches are highlighted by blue boxes.

5 Conclusion

In this paper, we propose a dynamic weighted decorrelation regularization (DWDR) to achieve the cross-view geo-localization. DWDR reduces the redundancy of visual embeddings by motivating the network to learn independent embedding channels. Specifically, DWDR sets dynamic weights to focus on the poorly-regressed elements when constraining the objective matrix to be as close as possible to the identity matrix. As a by-product of DWDR, the cross-view symmetric sampling strategy is introduced to balance the example number from different platforms in a training batch. The extensive experiments on three datasets, i.e., University-1652, CVUSA and CVACT, demonstrate that our method can learn discriminative embeddings, which significantly improve the retrieval accuracy. Moreover, our method also acquires competitive results with the extremely short feature.

Data Availability Statement

Three datasets supporting the findings of this study are available with the permission of the dataset authors. The links to request these datasets are as follows.
(1) University-1652 : https://github.com/layumi/University16
52-Baseline
;
(2) CVUSA : https://github.com/viibridges/crossnet;
(3) CVACT : https://github.com/Liumouliu/OriCNN.

References

  • Bansal et al., (2011) Bansal, M., Sawhney, H. S., Cheng, H., and Daniilidis, K. (2011). Geo-localization of street views with aerial image databases. In ACM International Conference on Multimedia.
  • Barlow et al., (1961) Barlow, H. B. et al. (1961). Possible principles underlying the transformation of sensory messages. Sensory communication, 1(01).
  • Bengio and Bergstra, (2009) Bengio, Y. and Bergstra, J. (2009). Slow, decorrelated features for pretraining complex cell-like networks. In Neural Information Processing Systems.
  • Cai et al., (2019) Cai, S., Guo, Y., Khan, S., Hu, J., and Wen, G. (2019). Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss. In IEEE International Conference on Computer Vision.
  • Castaldo et al., (2015) Castaldo, F., Zamir, A. R., Angst, R., Palmieri, F. A. N., and Savarese, S. (2015). Semantic cross-view matching. In IEEE International Conference on Computer Vision Workshops.
  • Chechik et al., (2009) Chechik, G., Sharma, V., Shalit, U., and Bengio, S. (2009). Large scale online learning of image similarity through ranking. In Iberian Conference on Pattern Recognition and Image Analysis.
  • Chopra et al., (2005) Chopra, S., Hadsell, R., and LeCun, Y. (2005). Learning a similarity metric discriminatively, with application to face verification. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Cogswell et al., (2016) Cogswell, M., Ahmed, F., Girshick, R., Zitnick, L., and Batra, D. (2016). Reducing overfitting in deep networks by decorrelating representations. In International Conference on Learning Representations.
  • Dai et al., (2021) Dai, M., Hu, J., Zhuang, J., and Zheng, E. (2021). A transformer-based feature segmentation and region alignment method for uav-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology.
  • Deng et al., (2009) Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Ding et al., (2021) Ding, L., Zhou, J., Meng, L., and Long, Z. (2021). A practical cross-view image matching method between uav and satellite for uav-based geo-localization. Remote Sensing, 13(1):47.
  • Ding et al., (2022) Ding, X., Zhang, X., Han, J., and Ding, G. (2022). Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Hadsell et al., (2006) Hadsell, R., Chopra, S., and LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition.
  • He et al., (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Hu et al., (2018) Hu, S., Feng, M., Nguyen, R. M., and Hee Lee, G. (2018). Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Hu and Lee, (2020) Hu, S. and Lee, G. H. (2020). Image-based geo-localization using satellite imagery. International Journal of Computer Vision, 128(5):1205–1219.
  • Huang et al., (2022) Huang, Y., Fu, X., Li, L., and Zha, Z.-J. (2022). Learning degradation-invariant representation for robust real-world person re-identification. International Journal of Computer Vision, 130(11):2770–2796.
  • Humenberger et al., (2022) Humenberger, M., Cabon, Y., Pion, N., Weinzaepfel, P., Lee, D., Guérin, N., Sattler, T., and Csurka, G. (2022). Investigating the role of image retrieval for visual localization. International Journal of Computer Vision, pages 1–26.
  • Krizhevsky et al., (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105.
  • Lin et al., (2022) Lin, J., Zheng, Z., Zhong, Z., Luo, Z., Li, S., Yang, Y., and Sebe, N. (2022). Joint representation learning and keypoint detection for cross-view geo-localization. IEEE Transactions on Image Processing.
  • Lin et al., (2013) Lin, T.-Y., Belongie, S., and Hays, J. (2013). Cross-view image geolocalization. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Lin et al., (2015) Lin, T.-Y., Cui, Y., Belongie, S., and Hays, J. (2015). Learning deep representations for ground-to-aerial geolocalization. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Liu and Li, (2019) Liu, L. and Li, H. (2019). Lending orientation to neural networks for cross-view geo-localization. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Liu et al., (2021) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision.
  • Luo et al., (2016) Luo, W., Li, Y., Urtasun, R., and Zemel, R. (2016). Understanding the effective receptive field in deep convolutional neural networks. In Neural Information Processing Systems.
  • Paszke et al., (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In Neural Information Processing Systems.
  • Regmi and Shah, (2019) Regmi, K. and Shah, M. (2019). Bridging the domain gap for ground-to-aerial image matching. In IEEE International Conference on Computer Vision.
  • Saurer et al., (2016) Saurer, O., Baatz, G., Köser, K., Pollefeys, M., et al. (2016). Image based geo-localization in the alps. International Journal of Computer Vision, 116(3):213–225.
  • Shi et al., (2019) Shi, Y., Liu, L., Yu, X., and Li, H. (2019). Spatial-aware feature aggregation for image based cross-view geo-localization. In Neural Information Processing Systems.
  • (30) Shi, Y., Yu, X., Campbell, D., and Li, H. (2020a). Where am i looking at? joint location and orientation estimation by cross-view matching. In IEEE Conference on Computer Vision and Pattern Recognition.
  • (31) Shi, Y., Yu, X., Campbell, D., and Li, H. (2020b). Where am i looking at? joint location and orientation estimation by cross-view matching. In IEEE Conference on Computer Vision and Pattern Recognition.
  • (32) Shi, Y., Yu, X., Liu, L., Zhang, T., and Li, H. (2020c). Optimal feature transport for cross-view image geo-localization. In AAAI Conference on Artificial Intelligence.
  • Simonyan and Zisserman, (2015) Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations.
  • Sun et al., (2017) Sun, Y., Zheng, L., Deng, W., and Wang, S. (2017). Svdnet for pedestrian retrieval. In IEEE International Conference on Computer Vision.
  • Teng et al., (2021) Teng, S., Zhang, S., Huang, Q., and Sebe, N. (2021). Viewpoint and scale consistency reinforcement for uav vehicle re-identification. International Journal of Computer Vision, 129(3):719–735.
  • Tian et al., (2021) Tian, X., Shao, J., Ouyang, D., and Shen, H. T. (2021). Uav-satellite view synthesis for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology.
  • Vo and Hays, (2016) Vo, N. N. and Hays, J. (2016). Localizing and orienting street views using overhead imagery. In European Conference on Computer Vision.
  • Wang et al., (2021) Wang, T., Zheng, Z., Yan, C., Zhang, J., Sun, Y., Zheng, B., and Yang, Y. (2021). Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology. doi:10.1109/TCSVT.2021.3061265.
  • Workman and Jacobs, (2015) Workman, S. and Jacobs, N. (2015). On the location dependence of convolutional neural network features. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Workman et al., (2015) Workman, S., Souvenir, R., and Jacobs, N. (2015). Wide-area image geolocalization with aerial reference imagery. In IEEE International Conference on Computer Vision.
  • Zbontar et al., (2021) Zbontar, J., Jing, L., Misra, I., LeCun, Y., and Deny, S. (2021). Barlow twins: Self-supervised learning via redundancy reduction. In International Conference on Machine Learning.
  • Zhai et al., (2017) Zhai, M., Bessinger, Z., Workman, S., and Jacobs, N. (2017). Predicting ground-level scene layout from aerial imagery. In IEEE Conference on Computer Vision and Pattern Recognition.
  • (43) Zheng, Z., Ruan, T., Wei, Y. W., Yang, Y., and Tao, M. (2020a). Vehiclenet: Learning robust visual representation for vehicle re-identification. IEEE Transactions on Multimedia (TMM). doi: 10.1109/TMM.2020.3014488.
  • (44) Zheng, Z., Wei, Y., and Yang, Y. (2020b). University-1652: A multi-view multi-source benchmark for drone-based geo-localization. In ACM International Conference on Multimedia. doi: 10.1145/3394171.3413896.
  • Zheng et al., (2019) Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., and Kautz, J. (2019). Joint discriminative and generative learning for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition.
  • (46) Zheng, Z., Zheng, L., Garrett, M., Yang, Y., Xu, M., and Shen, Y.-D. (2020c). Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 16(2):1–23. doi: 10.1145/3383184.
  • Zheng et al., (2017) Zheng, Z., Zheng, L., and Yang, Y. (2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In IEEE International Conference on Computer Vision. doi: 10.1109/ICCV.2017.405.