Lahore University of Management Sciences, Pakistan
11email: {15060051,16060073,murtaza.taj}@lums.edu.pk, [email protected]
Cross-View Image Retrieval - Ground to Aerial Image Retrieval through Deep Learning
Abstract
Cross-modal retrieval aims to measure the content similarity between different types of data. The idea has been previously applied to visual, text, and speech data. In this paper, we present a novel cross-modal retrieval method specifically for multi-view images, called Cross-view Image Retrieval CVIR. Our approach aims to find a feature space as well as an embedding space in which samples from street-view images are compared directly to satellite-view images (and vice-versa). For this comparison, a novel deep metric learning based solution ”DeepCVIR” has been proposed. Previous cross-view image datasets are deficient in that they (1) lack class information; (2) were originally collected for cross-view image geolocalization task with coupled images; (3) do not include any images from off-street locations. To train, compare, and evaluate the performance of cross-view image retrieval, we present a new 6 class cross-view image dataset termed as CrossViewRet which comprises of images including freeway, mountain, palace, river, ship, and stadium with 700 high-resolution dual-view images for each class. Results show that the proposed DeepCVIR outperforms conventional matching approaches on CVIR task for the given dataset and would also serve as the baseline for future research.
Keywords:
Cross-modal Retrieval Cross-View Image Retrieval Cross-View Image Matching Deep Metric Learning.1 Introduction
Cross-view image matching (CVIM) attracted considerable attention of the researchers due to its growing applications in the fields of image geolocalization, GIS mapping, autonomous driving, augmented reality navigation, and robot rescue [5, 1]. Another key factor is the rapid increase in high resolution satellite and street-view imagery provided by platforms such as Google and Flickr. One of the most challenging task to address CVIM is to devise an effective method to fill-in the heterogeneity gap of the two types of images[14, 18].
We introduce cross-view image retrieval (CVIR) which is a special type of cross-modal retrieval, which aims to enable flexible search and collect method across dual-view images. For query image taken from one view-point (say ground-view) it searches for all the similar images taken from the other view-point (say aerial-view) in the database. The idea has evolved from the notion of cross-view image matching with one key difference. In standard cross-view image matching a ground-view image is matched to its respective aerial-view image while relying only on the content of the images. We in contrast introduce CVIR in which the system for the given query image searches for all the similar images in a database considering contextual class information embedded in visual descriptors of the images.
Common practice for conventional retrieval system is representation learning. It tries to transform images to a feature space where distance between them could be measured directly [3]. However, in our case these representative features must be transmuted to another common embedding space to bridge the heterogeneity gap and compute similarity between them.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Freeway | Mountain | Palace | River | Ship | Stadium |
In this paper, we present a novel cross-view image retrieval method, termed as Deep Metric Learning based cross-view image retrieval (DeepCVIR). This method aims to retain the discrimination among visual features from different semantic groups and reduces the dual-view image disparities as well. Intended to achieve this objective, class information is retained in the learned feature space and pairwise label information are retained in the embedding space for all the images. This is done by minimizing the discrimination loss of the images in both the feature space as well as embedding space to ensure the learned embeddings to be both discriminative in class information and view invariant in nature. Figure.2 illustrates our proposed framework in detail.
The remainder of this paper is organized as follows: Section 2 reviews the related work in cross-view image matching and cross-modal learning. Section 3 presents the proposed model including problem formulation, DeepCVIR and implementation details. Section 4 explains the experimental setup including dataset while section 5 provides the results and analysis. Section 6 concludes the paper.
2 Related Work
Recent applications of cross-modal retrieval especially for text, speech, and images in big-data opened new avenues which require improved solution for the recent problems. Existing technique applies cross-modal retrieval techniques to multi-modal data but do not address variety of data in any single modality such as multi-view image retrieval [15].
Cross-view image matching could be taken as one of the potential problems for which Vo. et. al cross-matched and geo-localized street-view images of the 11 cities of United States to their respective satellite-view images [13]. In which experimentation using various versions of Siamese and Triplet networks for feature extraction with distance-based logistic loss have been carried out. While validating the approach on another similar dataset CV-USA Hu. et. al combined local features and global descriptors [5]. One of the major short comings of both these datasets is that the street-view images are obtained from Google satellite image repository which totally ignores the off-street images. Another way to cross-match images is to detecting and matching the content of the images e.g. matching buildings in the street-view query image to the building in the aerial images [12]. This particular approach intuitively failed to perform in the area lacking any tall structures or buildings with prominent features. Researchers have even tried to predict the ground-level scene layout from their respective aerial images, however, the same approach could not be extended for accurate image matching and retrieval purpose[17].

Image retrieval on the other hand has already been progressively used for multi-modal matching in the field of information retrieval[14]. The approach has been validated for applications to match sentences to images, ranking images, and free-hand sketch based image retrieval[15, 6, 8, 7, 18].Moreover, metric learning networks have been previously introduced for template matching tasks [16]. We introduce cross-view image retrieval, employing the combination of metric learning and image retrieval technique for class-based cross-view image matching.
3 Proposed Method
One of the core ideas of this paper is to identify an efficient framework for CVIR using the contextual information of the scene in image. The detailed approach is presented in four different subsections: a) Problem Formulation, b)Deep Feature Extraction, c) Feature Matching, d)DeepCVIR. Figure 2 visually explains the overall architecture of the proposed approach.
3.1 Problem Formulation
We focus on the formulation of CVIR problem for CrossViewRet dataset without losing generality of the topic. Dataset contains two subsets: ground-view images and aerial-view images . In ground-to-aerial retrieval for the given query image we aim to retrieve the set of all the relevant images , where . Similarly, this problem could also be formulated for aerial-to-ground search and retrieval by replacing query image with and search data as . For this purpose, we assume a collection of instances of ground-view and aerial-view image pairs, denoted as , where is the input ground-view image sample and is the input aerial-view image sample for the th instance. Each pair of instances has been assigned a semantic label . If th instance belongs to th class, , otherwise .
3.2 Deep Supervised Feature Learning
Representation learning also termed as ”Indexing” in CVIR refers to learn two functions for dual-view images containing same class information: for the ground-view image and for aerial-view image, where is the dimensionality of features in their respective feature spaces. and in the above two functions are the trainable weights of the street-view and satellite-view feature learning networks. Feature extraction step for the cross-view image pair is influenced by benchmark deep supervised convolutional neural networks including VGG, ResNet-50, and Tiny-Inception-ResNet-v2 pretrained networks [10]. These networks are selected due to their exceptional performance in object recognition and classification task. Unlike traditional Siamese network, here two separate feature learning networks (without weight sharing policy) are employed for extracting features of street and satellite view images. Features acquired through this technique implicitly retain the class information of the images irrespective of their visual viewpoint. Although, these representations might not be projected in the combined feature space for both views still they share same dimensional footprint and could be compared in an embedding space through matching. Figure. 2 (left side) shows the overall indexing procedure in detail.
3.3 Feature Matching and Retrieval
Features of the cross-view image pair are matched either through distance computation, metric learning, or deep networks with specialized loss functions. Traditionally, matching techniques employ distance computation method of the paired data . For instance, Euclidean distance for feature embeddings of this paired data could be computed as
(1) |
where denotes L2-norm operation. In distance metric learning especially contrastive embedding, a loss function implemented on top of point-wise distance operation, is minimized to learn the association of similar and dissimilar data pairs.It is mathematically computed as
(2) |
where indicates the labels of the paired data, representing similar pair and otherwise. is hinge loss function and is taken from (1). is used to penalize the dissimilar pair distances for being smaller than this predefined margin using hinge loss in the second part of (2). Similarly, Mahalanobis distance between the cross-view image pair features is computed as
(3) |
where and are two points from the same distribution which has covariance matrix . The Mahalanobis distance is the same as the Euclidean distance if the covariance matrix becomes the identity matrix. variation in each component of the point.
For each of the these matching measure if the retrieval score comes out to be less than the given threshold (say 0.5), the feature pair is categorized as similar and dissimilar otherwise. For image retrieval top images are visualized as relevant to the query image as shown in Figure 2.
3.4 DeepCVIR: A DML based framework for Feature Matching
The idea of transforming images from feature space to embedding space could be applied by incorporating a deep learning model technically called as deep metric learning network (DML) [16, 11]. We in this research propose a residual deep metric learning architecture optimized with the well known binary cross-entropy loss.
3.4.1 Reshaping 1D features in DeepCVIR
To exploit the contextual information of the objects in image features we reshape 1D features from indexing step to 2D features in retrieval step. 2D convolution layers are then employed to extract significant information from concatenated 2D features of the matching images.
3.4.2 Residual Blocks in DeepCVIR
This DML network inspired from residual learning comprises the combination of two standard residual units presented in [4]. The first residual unit consists of two convolution layers with an identity path while the second one comprise a convolutional shortcut with two convolution layers. We tested three variations of DML for DeepCVIR. S-DeepCVIR consists of only one residual block (two residual units). For D-DeepCVIR and T-DeepCVIR, two and three stacked residual blocks are additionally used in this network, respectively. The rest of the network structure remains the same for all the three variations. Each DeepCVIR network has been terminated by the combination of three pairs of fully connected and activation layers for instigation of non-linear learning.
4 Experimental Setup
Cross-view image retrieval could be inherently divided into two sub-tasks namely Steet-to-satellite retrieval and Satellite-to-street retrieval. If for the given street-view query image, satellite-view relevant images are retrieved it is referred to as Str2Sat while the vice-versa case is referred to as Sat2Str in the rest of the paper. We also investigate the effects of employing different activation functions in DML networks.
4.1 Dataset
In this research a new dataset CrossViewRet has been developed to evaluate and compare the performance of DeepCVIR framework. Previous cross-view image datasets are deficient in that they (1) lack class information about the content of the image; (2) were originally collected for cross-view image geolocalization task with coupled images; (3) were specifically acquired for the purpose of autonomous vehicles therefore they do not include any images having off-street locations. CrossViewRet comprise of images containing 6 classes including freeway, mountain, palace, river, ship, and stadium with 700 high resolution dual-view images for each class. The satellite-view images are collected from the benchmark NWPU-RESISCS-45 dataset, while respective street-view images of each class are downloaded from Flickr image dataset111www.flickr.com using Flickr API [2]. The downloaded street-view images are then cross checked by human annotators and images with obvious visual descriptions of the classes are selected for each class. The spatial resolution of satellite-view images is and the street-view images are of variable sizes; however, they have been resized before employing for experimentation. The dataset has been made public for future use 222https://cvlab.lums.edu.pk/category/projects/imageretrieval.
CrossViewRet is a very complex dataset. Unlike existing cross-view dataset [13, 5] which contain ground and aerial images of the same location. We, on the other hand, do not constraint the images to be of same geo-location. Rather, we focus on visual contents in the scene regardless of any transformation in the images, weather conditions, and variation in day and night time in the scene. As shown in Fig.1, the ground view in sample image of mountain class contains snow whereas its target aerial view image does not. Similarly, the ground view images of palace and stadium class are taken during night and aerial view contains day time drone images. However, the aerial view in river example is satellite image which is totally different than top view drone images.

4.2 Implementation Details
We use two independent networks for feature learning and embedding learning. In case of feature learning VGGNet, ResNet, and Inception-ResNet-v2 with pre-trained ImageNet weights are fine-tuned. Two independent sub-networks have been employed for learning the discriminating class-wise features of both the views. The architecture of the proposed DML network has been explained in section 3.4. The standard 80/20 train-validation splitting criteria was used for CVIR dataset to fine-tune and train all the feature networks and variants of DML networks respectively. Query images used for evaluation were randomly taken from the validation split of the data.
Deep learning networks have been trained on a Nvidia RTX 2080Ti GPU in Keras. For training feature networks, we employ Stochastic Gradient Descent (SGD) with initial learning rate 0.00001 and a learning rate decay with patience 5. For DML network, ADAM with initial learning rate of 0.001 has been used. Early stopping criteria of 15 epochs has been used to halt training for all the networks.
4.3 Evaluation Metric
We evaluated the performance of cross-view image retrieval with not only the standard measures of Precision, Recall, and F1-Score but also evaluated Average Normalized Modified Retrieval Rank (ANMRR), Mean Average Precision (mAP), and P@K (read as Precision at K) [9]. P@K is the percentage of queries which the ground truth image class are in one of the first K retrieved results. Here we only employ P@5 measure for our analysis.
Feature Network | Similarity Measure | ANMRR | mAP | p@5 | Precision | Recall | F1-Score |
---|---|---|---|---|---|---|---|
ResNet-50 | Euclidean | 0.42 | 0.17 | 0.16 | 0.50 | 0.50 | 0.50 |
IncepRes-v2 | 0.42 | 0.17 | 0.16 | 0.50 | 0.50 | 0.50 | |
VGG-16 | 0.41 | 0.18 | 0.15 | 0.48 | 0.48 | 0.48 | |
\cdashline1-8[1.0pt/2pt] ResNet-50 | Contrastive | 0.05 | 0.90 | 0.88 | 0.50 | 0.50 | 0.50 |
IncepRes-v2 | 0.40 | 0.20 | 0.16 | 0.50 | 0.50 | 0.50 | |
VGG-16 | 0.29 | 0.41 | 0.40 | 0.50 | 0.50 | 0.50 | |
\cdashline1-8[1.0pt/2pt] ResNet-50 | Mahalanobis | 0.42 | 0.17 | 0.16 | 0.50 | 0.50 | 0.50 |
IncepRes-v2 | 0.42 | 0.16 | 0.15 | 0.50 | 0.50 | 0.50 | |
VGG-16 | 0.42 | 0.17 | 0.19 | 0.50 | 0.50 | 0.50 | |
\cdashline1-8[1.0pt/2pt] ResNet-50 | DeepCVIR-DML | 0.03 | 0.93 | 0.94 | 0.94 | 0.94 | 0.94 |
IncepRes-v2 | 0.29 | 0.22 | 0.23 | 0.52 | 0.52 | 0.52 | |
VGG-16 | 0.02 | 0.96 | 0.97 | 0.96 | 0.96 | 0.96 |
5 Results
Validation of the proposed DeepCVIR approach for this type of challenging dataset demands extensive assessment. We therefore provide a comparative analysis of the approach using various state-of-the-art techniques as well as variants of the proposed method.
5.1 Deep Features and their Matching Techniques
Various deep features have been previously used for the task of same-view image retrieval; however, view-invariant features of multi-modal images plays a pivotal role in CVIR. Table 1 shows that although Inception-ResNet-v2 may outperform the VGGNet and ResNet on ImageNet challenge yet it failed to extract the most optimal features for cross-view image matching. In addition, the performance of various distance computation methods illustrates that the problem is more complex and could not be solved by linear distances i.e. Euclidean or Contrastive loss embedding. Figure 4(a) also confirms the improvement of learning behavior in term of percentage validation accuracy.
![]() |
![]() |
a) Features vs. Similarity Metric | b) Convergence rate of DeepCVIR |
DML Type | Feature Type |
|
ANMRR | mAP | p@5 | Precision | Recall | F1-Score | ||
---|---|---|---|---|---|---|---|---|---|---|
S-DeepCVIR | ResNet-50 | eLU | 0.04 | 0.93 | 0.94 | 0.50 | 0.50 | 0.50 | ||
S-DeepCVIR | Leaky ReLU | 0.03 | 0.93 | 0.94 | 0.94 | 0.94 | 0.94 | |||
S-DeepCVIR | ReLU | 0.03 | 0.93 | 0.95 | 0.93 | 0.93 | 0.93 | |||
\cdashline1-9[1.0pt/2pt] D-DeepCVIR | ResNet-50 | ReLU | 0.04 | 0.92 | 0.94 | 0.92 | 0.92 | 0.92 | ||
T-DeepCVIR | ReLU | 0.05 | 0.89 | 0.92 | 0.90 | 0.90 | 0.90 | |||
\cdashline1-9[1.0pt/2pt] S-DeepCVIR | VGG-16 | Leaky ReLU | 0.02 | 0.96 | 0.97 | 0.96 | 0.96 | 0.96 | ||
D-DeepCVIR | Leaky ReLU | 0.02 | 0.95 | 0.97 | 0.95 | 0.95 | 0.95 | |||
T-DeepCVIR | Leaky ReLU | 0.02 | 0.96 | 0.98 | 0.95 | 0.95 | 0.95 |
5.2 Feature Matching through DeepCVIR
The proposed DeepCVIR architecture involves the contribution of DML network which assists the learning process by efficient learning of the embedding space to discriminate similar and dissimilar pairs. However, to evaluate the learning routine of the DML network we tried variants of DML with the single, double and triple combination of the proposed residual blocks termed as S-DeepCVIR, D-DeepCVIR, and T-DeepCVIR, respectively.
5.2.1 Impact of the Number of Residual Blocks in DeepCVIR
In general increasing the number of residual blocks in a network supports the overall performance; however, in our case S-DeepCVIR with least number of residual blocks outperforms the rest of the DeepCVIR networks. This was beyond our anticipation, but one cannot neglect the simplicity of this task as compare to other recognition tasks. It could be concluded that the number of learnable parameters of S-DeepCVIR are enough to separate similar and dissimilar features. ANMRR and mAP values in Table 2 illustrates that although all the variants of DeepCVIR performed better than other matching techniques still S-DeepCVIR performed extraordinarily for the given task of Str2Sat as well as Sat2Str. Their convergence curves illustrated in Figure 4(b) due to their less number of learnable parameters, represents significantly earlier and much lower loss with respect to the number of epochs as compare to rest of the combinations.
5.3 Str2Sat vs. Sat2Str Evaluation
Our proposed S-DeepCVIR framework performs equally well on both Str2Sat and Sat2Str tasks. Results in Table 3 shows that although the average ANMRR values remain comparative for all the variants of DeepCVIR architecture still S-DeepCVIR with VGG features achieves minimum average ANMRR of and maximum mAP score.
![]() |
![]() |
![]() |
![]() |
a) ResNet-50 | b) ReLu | c) Leaky ReLu | d) eLu |
![]() |
![]() |
![]() |
![]() |
e) VGG-16 | f) S-DeepCVIR | g) D-DeepCVIR | h) T-DeepCVIR |
Features | DeepCVIR | Str2Sat | Sat2Str | Average ANMRR | ||||
---|---|---|---|---|---|---|---|---|
ANMRR | mAP | p@5 | ANMRR | mAP | p@5 | |||
VGG-16 | S-DeepCVIR | 0.02 | 0.96 | 0.97 | 0.03 | 0.92 | 0.92 | 0.025 |
VGG-16 | D-DeepCVIR | 0.02 | 0.95 | 0.97 | 0.03 | 0.91 | 0.91 | 0.025 |
VGG-16 | T-DeepCVIR | 0.02 | 0.96 | 0.98 | 0.03 | 0.91 | 0.91 | 0.025 |
5.4 t-SNE Visualization of the Learned Embeddings
T-SNE plot is a very effective tool to visualize the data in two dimensional plane for better analysis. We adapted this approach to witness and validate the contribution of DML in transforming features to embedding space. Figure 5(a,f) shows that image features are distributed among the whole region of the plot and hence it is very difficult to measure the correspondence among same and different feature just by using a linear distance. DML separates them into distinguishable clusters.
Although no class information was explicitly provided to the network during training still it successfully clustered the similar pairs into six different classes. It is also observed from the figures that use of different activation functions and multiple residual blocks does not contribute to improvement of the overall result.
6 Conclusion
We propose a cross-view image retrieval system for which we developed a cross-view dataset named CrossViewRet. The dataset consists of street-view and satellite-view images for 6 distinct classes having 700 images per class. The proposed DeepCVIR system consists of two parts: a) a fine-tuned deep feature network, and b) a deep metric learning network trained on image pairs from CrossViewRet dataset. Given features for two images, the proposed residual DML network decides if the two images belong to the same class. In addition an ablative study and a detailed empirical analysis on different activation functions and number of residual blocks in DML network have also been performed. This shows that our proposed DeepCVIR network performed significantly well for the problem of cross-view retrieval.
References
- [1] Arth, C., Schmalstieg, D.: Challenges of large-scale augmented reality on smartphones. Graz University of Technology, Graz pp. 1–4 (2011)
- [2] Cheng, G., Han, J., Lu, X.: Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE 105(10), 1865–1883 (2017)
- [3] Göksu, ., Aptoula, E.: Content based image retrieval of remote sensing images based on deep features. In: 26th Signal Processing and Communications Applications Conference (SIU). pp. 1–4 (2018)
- [4] He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: European Conference on Computer Vision. pp. 630–645. Springer (2016)
- [5] Hu, S., Feng, M., Nguyen, R.M., Hee Lee, G.: Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7258–7267 (2018)
- [6] Lin, Z., Ding, G., Hu, M., Wang, J.: Semantics-preserving hashing for cross-view retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3864–3872 (2015)
- [7] Liu, L., Shen, F., Shen, Y., Liu, X., Shao, L.: Deep sketch hashing: Fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2862–2871 (2017)
- [8] Ma, L., Lu, Z., Shang, L., Li, H.: Multimodal convolutional neural networks for matching image and sentence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2623–2631 (2015)
- [9] Napoletano, P.: Visual descriptors for content-based retrieval of remote-sensing images. International Journal of Remote Sensing 39(5), 1–34 (2018)
- [10] Nazir, U., Khurshid, N., Bhimra, M.A., Taj, M.: Tiny-inception-resnet-v2: Using deep learning for eliminating bonded labors of brick kilns in south asia. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. pp. 39–43 (2019)
- [11] Tharani, M., Khurshid, N., Taj, M.: Unsupervised deep features for remote sensing image matching via discriminator network. arXiv preprint arXiv:1810.06470 (2018)
- [12] Tian, Y., Chen, C., Shah, M.: Cross-view image matching for geo-localization in urban environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3608–3616 (2017)
- [13] Vo, N.N., Hays, J.: Localizing and orienting street views using overhead imagery. In: European Conference on Computer Vision. pp. 494–509. Springer (2016)
- [14] Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215 (2016)
- [15] Wei, Y., Zhao, Y., Lu, C., Wei, S., Liu, L., Zhu, Z., Yan, S.: Cross-modal retrieval with cnn visual features: A new baseline. IEEE Transactions on Cybernetics 47(2), 449–460 (2017)
- [16] Xufeng Han, Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.: Matchnet: Unifying feature and metric learning for patch-based matching. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. pp. 3279–3286 (2015)
- [17] Zhai, M., Bessinger, Z., Workman, S., Jacobs, N.: Predicting ground-level scene layout from aerial imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 867–875 (2017)
- [18] Zhen, L., Hu, P., Wang, X., Peng, D.: Deep supervised cross-modal retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10394–10403 (2019)