DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features

Min Yang^∗, Dongliang He^∗,†, Miao Fan, Baorong Shi,
Xuetong Xue, Fu Li, Errui Ding, Jizhou Huang^†
Baidu Inc., China
{yangmin09, hedongliang01, fanmiao, shibaorong}@baidu.com,
{xuexuetong, lifu, dingerrui, huangjizhou01}@baidu.com

DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of
Local and Global Features

Abstract

Image Retrieval is a fundamental task of obtaining images similar to the query one from a database. A common image retrieval practice is to firstly retrieve candidate images via similarity search using global image features and then re-rank the candidates by leveraging their local features. Previous learning-based studies mainly focus on either global or local image representation learning to tackle the retrieval task. In this paper, we abandon the two-stage paradigm and seek to design an effective single-stage solution by integrating local and global information inside images into compact image representations. Specifically, we propose a Deep Orthogonal Local and Global (DOLG) information fusion framework for end-to-end image retrieval. It attentively extracts representative local information with multi-atrous convolutions and self-attention at first. Components orthogonal to the global image representation are then extracted from the local information. At last, the orthogonal components are concatenated with the global representation as a complementary, and then aggregation is performed to generate the final representation. The whole framework is end-to-end differentiable and can be trained with image-level labels. Extensive experimental results validate the effectiveness of our solution and show that our model achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets. ¹¹1Codes: PaddlePaddle Implementation.

¹¹footnotetext: Equal contribution. ^† Corresponding authors.

1 Introduction

Refer to caption — Figure 1: Illustration of current two-stage and our single-stage image retrieval. Previous methods (a) firstly obtain candidates similar to the query from the database via global deep representation, and then local descriptors are extracted for leveraged re-ranking. Our method (b) aggregates global and local features via an orthogonal fusion to generate the final compact descriptor, and then single-shot similarity search is performed.

Image retrieval is an important task in computer vision, and its main purpose is to find out the images from a large-scale database that are similar to a query one. It is extensively studied by designing various handcrafted features [25, 6, 50]. Owing to the development of deep learning technologies, great progress has been achieved recently [1, 29, 37, 9]. Representations (also named as descriptors) of images, which are used to encode image contents and measure their similarities, play a central role in this task. In the literature of learning-based solutions, two types of image representations are widely explored. One is global feature [4, 3, 44, 1] which serves as high-level semantic image signature and the other one is local feature [5, 36, 29, 18] which can comprise discriminative geometry information about specific image regions. Generally, the global feature can be learned to be invariant to viewpoint and illumination, while local features are more sensitive to local geometry and textures. Therefore, previous state-of-the-art solutions [38, 29, 9] always work in a two-stage paradigm. As shown in Figure 1(a), candidates are retrieved via global feature with high recall, and then re-ranking is performed with local features to further improve precision.

In this paper, we also concentrate on the field of image retrieval with deep networks. Though state-of-the-art performance has been achieved by previous two-stage solutions, they need to rank images twice, and the second re-ranking stage is conducted using the expensive RANSAC [13] or AMSK [42] for spatial verification with local features. More importantly, errors exist inevitably in both stages. Two-stage solutions would suffer from error accumulation which can be a bottleneck for further performance improvement. To alleviate these problems, we abandon the two-stage framework and attempt to find an effective unified single-stage image retrieval solution, which is shown in Figure 1(b). Previous wisdom has implied that global features and local features are two complementary and essential elements for image retrieval. Intuitively, integrating local features and global features into a compact descriptor can achieve our goal. A satisfying local and global fusion scheme can take advantage of both types of features to mutually boost each other for single-stage retrieval. Besides, error accumulation can be avoided. Therefore, we technically answer how to design an effective global and local fusion mechanism for end-to-end single-stage image retrieval.

Specifically, we proposed a Deep Orthogonal Local and Global feature fusion model (DOLG). It consists of a local and a global branch for learning two types of features jointly and an orthogonal fusion module to combine them. In detail, the local components orthogonal to the global feature are decomposed from the local features. Subsequently, the orthogonal components are concatenated with the global feature as a complementary part. Finally, it is aggregated into a compact descriptor. With our orthogonal fusion, the most critical local information can be extracted and redundant components to the global information are eliminated, such that local and global components can be mutually reinforced to produce final representative descriptor with objective-oriented training. To enhance local feature learning, inspired by lessons from prior research, the local branch is equipped with multi-atrous convolutions [10] and self-attention [29] mechanisms to attentively extract representative local features. We think alike FP-Net [31] in terms of orthogonal feature space learning, but DOLG aims at complementary fusion of features in orthogonal spaces. Extensive experiments on Revisited Oxford and Pairs [32] show the effectiveness of our framework. DOLG also achieves state-of-the-art performance on both datasets. To summarize, our main contributions are as follows:

•

We propose to retrieve images in a single-stage paradigm with a novel orthogonal global and local feature fusion framework, which can generate a compact representative image descriptor and is end-to-end learnable.
•

In order to attentively extract discriminative local features, a module with multi-atrous convolution layers followed by a self-attention module is designed for improving our local branch.
•

Extensive experiments are conducted and comprehensive analysis is provided to validate the effectiveness of our solution. Our single-stage method significantly outperforms previous two-stage state-of-the-art ones.

2 Related Work

2.1 Local feature

Prior to deep learning, SIFT [25] and SURF [6] are two well-known hand-engineered local features. Usually such local features are combined with KD trees [7], vocabulary trees [28] or encoded by aggregation methods such as [49, 22] for (approximate) nearest neighbor search. Spatial verification via matching local features with RANSAC [13] to re-rank candidate retrieval results [2, 30] are also shown to significantly improve precision. Recently, driven by the development of deep learning, remarkable progresses have been made in learning local features from images such as [48, 16, 15, 5, 36, 29, 18]. Comprehensive reviews of deep local feature learning can be found in [51, 12]. Among these methods, the state-of-the-art local feature learning framework DELF [29], which proposes an attentive local feature descriptor for large-scale image retrieval, is closely related to our work. One of the design choices of our local branch, namely attentive feature extraction, is inspired by its merit. However, DELF uses only a single-scale feature map and ignores various object scales inside natural images. Our local branch is designed to simulate the image pyramid trick used in SIFT [25] by multi-atrous convolution layers [10].

2.2 Global feature

Conventional solutions obtain global feature by aggregating local features by BoW [39, 33], Fisher vectors [24] or VLAD [23]. Later, aggregated selective match kernels (ASMK) [42] attempts to unify aggregation-based techniques with matching-based approaches such as Hamming Embedding [21]. In deep learning era, global feature is obtained by such differentiable aggregation operations as sum-pooling [43] and GeM pooling [34]. To train deep CNN models, ranking based triplet [8], quadruplet [11], angular [46] and listwise [35] losses or classification based losses [45, 14] are proposed. With these innovations, nowadays, most high performing global features are obtained with deep CNNs for image retrieval[4, 3, 44, 1, 17, 34, 35, 29, 27, 9]. In our work, we leverage lessons from previous studies to use ArcFace loss [14] in the training phase and to explore different pooling schemes for performance improvement. Our model also generates a compact descriptor, meanwhile, it explicitly considers fusing local and global features in an orthogonal way.

2.3 Joint local and global CNN features

It is natural to consider local and global features jointly, because feature maps from an image representation model can be interpreted as local visual words [38, 40]. Joint learning local matching and global representation may be beneficial for both sides. Therefore, distilling pre-trained local feature [15] and global feature [1] into a compact descriptor is proposed in [37]. DELG [9] takes a step further and proposes to jointly train local and global features in an end-to-end manner. However, DELG still works in a two-stage fashion. Our work is essentially different from [29, 9] and we propose orthogonal global and local fusion in order to perform accurate single-stage image retrieval.

3 Methodology

3.1 Overview

Our DOLG framework is depicted in Figure 2. Following [29, 9], it is built upon state-of-the-art image recognition model ResNet [19]. The global branch is kept the same as the original ResNet except that 1) the global averaging pooling is replaced by the GeM pooling [34]; 2) a FC layer is used to reduce feature dimension when generating the global representation $f_{g}\in R^{C\times 1}$ . Specifically, let us denote the output feature map of Res4 as $f_{4}\in R^{C_{4}\times h\times w}$ , then the GeM pooling can be formalized as

f_{g,c}=\left(\frac{1}{hw}\sum_{(i,j)}f_{4,(c,i,j)}^{p}\right)^{1/p}_{c=1,2,...,C_{4}},

(1)

where $p>0$ is a hyper-parameter and $p>1$ pushes the output to focus more on salient feature points. In this paper, we follow the setting of DELG [9] and empirically set it to be 3.0. To jointly extract local descriptors, a local branch is appended after the Res3 block of ResNet. Our local branch consists of multiple atrous convolution layers [10] and a self-attention module. Then, a novel orthogonal fusion module is designed for aggregating $f_{g}$ and the local feature tensor $f_{l}\in R^{C\times H\times W}$ obtained by the local branch. After orthogonal fusion, a final compact descriptor, where local and global information is well integrated, is generated.

3.2 Local Branch

The two major building blocks of our local branch are the multi-atrous convolution layers and the self-attention module. The former building block is to simulate feature pyramid which can handle scale variations among different image instances, and the latter building block is leveraged to performance importance modeling. The detailed network configurations of this branch is shown in Figure 3. The multi-atrous module contains three dilated convolution layers to obtain feature maps with different spatial receptive field and a global average pooling branch. These features are concatenated and then processed by a $1\times 1$ convolution layer. The output feature map is then delivered to the self-attention module for further modeling the importance of each local feature point. Specifically, its input is firstly processed using a $1\times 1$ conv-bn module, then the subsequent feature is normalized and modulated by an attention map generated via a $1\times 1$ convolution layer followed by the SoftPlus operation.

3.3 Orthogonal Fusion Module

The working flow of our orthogonal fusion module is shown in Figure 4(a). It takes $f_{l}$ and $f_{g}$ as inputs and then calculates the projection $f_{l,proj}^{(i,j)}$ of each local feature point $f_{l}^{(i,j)}$ onto the global feature $f_{g}$ . Mathematically, the projection can be formulated as:

f_{l,proj}^{(i,j)}=\frac{f_{l}^{(i,j)}\cdot f_{g}}{|f_{g}|^{2}}f_{g},

(2)

where $f_{l}^{(i,j)}\cdot f_{g}$ is dot product operation and $|f_{g}|^{2}$ is the $L_{2}$ norm of $f_{g}$ :

f_{l}^{(i,j)}\cdot f_{g}=\Sigma_{c=1}^{C}{f_{l,c}^{(i,j)}f_{g,c}}

(3)

|f_{g}|^{2}=\Sigma_{c=1}^{C}{(f_{g,c})^{2}}.

(4)

As demonstrated in Figure 4(b), the orthogonal component is the difference between the local feature and its projection vector, therefore, we can obtain the component orthogonal to $f_{g}$ by:

f_{l,orth}^{i,j}=f_{l}^{(i,j)}-f_{l,proj}^{(i,j)}.

(5)

In this way, a $C\times H\times W$ tensor where each point is orthogonal to $f_{g}$ can be extracted. Afterwards, we append to each point of this tensor with the $C\times 1$ vector $f_{g}$ and then the new tensor is aggregated to be a $C_{o}\times 1$ vector. Finally, a fully connected layer is used to produce a $512\times 1$ descriptor. Typically, $C$ equals 1024 in ResNet [19]. Here, we simply leverage the pooling functionality to aggregate the concatenated tensor, that is to say, “A” in Figure 4(a) is pooling in our current implementation. Actually, it can be designed to be other learnable modules to aggregate the tensor. We will further analysis on this in Section 4 and 5.

3.4 Training Objective

Following DELG [9], the training of our method involves only one $L_{2}$ -normalized $N$ class prediction head ${\cal{\hat{W}}}\in R^{512\times N}$ and just needs image-level labels. ArcFace margin loss [14] is used to train the whole network:

\mathit{L}=-\log\left(\frac{\exp\left(\gamma\times{AF}\left(\hat{\omega}_{t}^{T}\hat{f_{g}},1\right)\right)}{\sum_{n}\exp\left(\gamma\times{AF}\left(\hat{\omega}_{n}^{T}\hat{g},y_{n}\right)\right)}\right)

(6)

where $\hat{\omega_{i}}$ refers to the $i_{th}$ row of $\cal{\hat{W}}$ and $\hat{f_{g}}$ is the $L_{2}$ -normalized version of $f_{g}$ . $y$ is the one-hot label vector and $t$ is the groundtruth class index ( $y_{t}=1$ ). $\gamma$ is a scale factor. $AF$ denotes the ArcFace-adjusted cosine similarity and it can be calculated as $AF(s,c)$ :

{AF}\left(\mathit{s,c}\right)=\left\{\begin{matrix}\cos\left(a\cos\left(s\right)+m\right),&\mathit{if}\ c=1\\ s,&\mathit{if}\ c=0\end{matrix}\right.

(7)

where $s$ is the cosine similarity, $m$ is the ArcFace margin and $c=1$ means this is the groundtruth truth class.

4 Experiments

4.1 Implementation Details

Datasets and Evaluation metric Google landmarks dataset V2 (GLDv2) [47] is developed for large-scale and fine-grained landmark instance recognition and image retrieval. It contains a total of 5M images of 200K different instance tags. It is collected by Google to raise the challenges faced by the landmark identification system under real industrial scenarios as much as possible. Researchers from the Google Landmark Retrieval Competition 2019 further cleaned and revised the GLDv2 to be GLDv2-clean. It contains a total of 1,580,470 images and 81,313 classes. This dataset is used to train our models. To evaluate our model, we mainly use Oxford and Paris datasets with revisited annotations [32], referred to be Roxf and Rpar in the following, respectively. There are 4,993 (6,322) images in the Roxf (Rpar) dataset and a different query set for each, both with 70 images. In order for a fair comparison with state-of-the-art methods [29, 9, 27], mean average precision (mAP) is used as our evaluation metric on the Medium and Hard splits of both datasets. mAP provides a robust measurement of retrieval quality across recall levels and has shown to have good discrimination and stability.

Implementation details All the experiments in this paper are trained based on GLDv2-clean dataset. We randomly divide 80% of the dataset for training and the rest 20% for validation. ResNet50 and ResNet101 are mainly used for experiments. Models are initialized from ImageNet pre-trained weights. The images first undergo augmentations by randomly cropping / distorting the aspect ratio; then, they are resized to $512\times 512$ resolution. We use batch size of 128 to train our models on 8 V100 GPUs with 16G memory per card asynchronously for 100 epochs. One complete training phase takes about 3.8 days for ResNet50 and 6.3 days for ResNet101. SGD optimizer with momentum of 0.9 is used. Weight decay factor is set to 0.0001 and cosine learning rate decay strategy is adopted. Note that we train our models with 5 warming-up epochs and the initial learning rate is 0.05. For the ArcFace margin loss, we empirically set the margin $m$ as 0.15 and the ArcFace scale $\gamma$ as 30. For GeM pooling, we fix the parameter $p$ as 3.0.

As for feature extraction, following previous works [29, 9], we use an image pyramid at inference time to produce multi-scale representations. Specifically, we use 5 scales, i.e., 0.3535, 0.5, 0.7071, 1.0, 1.4142, to extract final compact feature vectors. To fuse these multi-scale features, we firstly normalize them such that their $L_{2}$ norm equals 1, then the normalized features are averaged and finally a $L_{2}$ normalization is applied to produce the final descriptor.

(A) Local feature aggregation + re-ranking
Method	Medium				Hard
Method	Roxf	+1M	Rpar	+1M	Roxf	+1M	Rpar	+1M
\hdashline HesAff-rSIFT-ASMK $\star$ +SP[42]	60.6	46.80	61.40	42.30	36.70	26.90	35.00	16.80
HesAff-HardNet-ASMK $\star$ +SP[26]	65.60	-	65.20	-	41.10	-	38.50	-
HesAff–rSIFT–ASMK $\star$ +SP $\rightarrow$ R[34]–GeM+DFS[20]	79.10	74.30	91.00	85.90	52.70	48.70	81.00	73.20
DELF-ASMK $\star$ +SP[29, 32]	67.80	53.80	76.90	57.30	43.10	31.20	55.40	26.40
DELF-R-ASMK $\star$ +SP[41]	76.00	64.00	80.20	59.70	52.40	38.10	58.60	58.60
R50-How-ASMK,n=2000[43]	79.40	65.80	81.60	61.80	56.90	38.90	62.40	33.70
(B) Global features
\hdashline R101-R-MAC[17]	60.90	39.30	78.90	54.80	32.40	12.50	59.40	28.00
R101-GeM $\uparrow$ [38]	65.30	46.10	77.30	52.60	39.60	22.20	56.60	24.80
R101-GeM-AP[35]	67.50	47.50	80.10	52.50	42.80	23.20	60.50	25.10
R101-GeM-AP (GLDv1) [35]	66.30	-	80.20	-	42.50	-	60.80	-
R152-GeM[34]	68.70	-	79.70	-	44.20	-	60.30	-
ResNet101-GeM+SOLAR $\dagger$ [27]	69.90	53.50	81.60	59.20	47.90	29.90	64.50	33.40
R50-DELG[9]	69.70	55.00	81.60	59.70	45.10	27.80	63.40	34.10
R50-DELG (GLDv2-clean)[9]	73.60	60.60	85.70	68.60	51.00	32.70	71.50	44.40
R50-DELG(GLDv2-clean)^r[9]	77.51	74.80	87.90	77.3	54.76	50.40	73.82	61.01
R101-DELG[9]	73.20	54.80	82.40	61.80	51.20	30.30	64.70	35.50
R101-DELG(GLDv2-clean)[9]	76.30	63.70	86.60	70.60	55.60	37.50	72.40	46.90
(C) Global features + Local feature re-ranking
\hdashline R101-GeM $\uparrow$ +DSM[38]	65.30	47.60	77.40	52.80	39.20	23.20	56.20	25.00
R50-DELG[9]	75.10	61.10	82.30	60.50	54.20	36.80	64.90	34.80
R50-DELG(GLDv2-clean)[9]	78.30	67.20	85.70	69.60	57.90	43.60	71.00	45.70
R50-DELG(GLDv2-clean)^r[9]	79.08	75.90	88.78	77.69	58.40	52.40	76.20	61.60
R101-DELG[9]	78.50	62.70	82.90	62.60	59.30	39.30	65.50	37.00
R101-DELG(GLDv2-clean)[9]	81.20	69.10	87.20	71.50	64.00	47.50	72.80	48.70
R50-DOLG (GLDv2-clean)	80.50	76.58	89.81	80.79	58.82	52.21	77.7	62.83
R101-DOLG (GLDv2-clean)	81.50	77.43	91.02	83.29	61.10	54.81	80.30	66.69

Model	Roxf-M	Roxf-H	Rpar-M	Rpar-H
R50-DELG^r	90.79	69.00	95.57	92.00
R50-DOLG	92.52	71.14	98.43	93.71

Location	Roxf			Rpar
Location	E	M	H	E	M	H
Global only	90.65	78.21	56.31	95.65	89.00	76.17
Fuse f4-only	92.08	79.39	58.13	95.93	89.92	77.92
Fuse f3-only	93.17	80.50	58.82	95.95	89.81	77.70
both f3&f4	92.34	79.41	57.08	96.01	89.78	77.69

Pooling		Roxf			Rpar
Global	Ortho	E	M	H	E	M	H
GeM	GeM	92.62	78.28	55.30	96.20	89.50	76.99
AVG	AVG	92.20	78.14	56.14	95.86	89.25	76.32
GeM	AVG	93.17	80.50	58.82	95.95	89.81	77.70
AVG	GeM	89.63	73.48	44.88	94.67	86.76	72.98

Config	Roxf			Rpar
Config	E	M	H	E	M	H
w/o Local	90.65	78.21	56.31	95.65	89.00	76.17
w/o MultiAtrous	93.48	80.48	58.36	96.66	89.27	76.52
w/o Self-ATT	90.64	78.15	55.34	95.73	89.48	77.16
Full Model	93.17	80.50	58.82	95.95	89.81	77.70