Semi-Global Shape-aware Network

Pengju Zhang^1,2 Yihong Wu^1,2, Jiagang Zhu³
¹ School of Artificial Intelligence Yihong Wu ([email protected]) is the corresponding author. University of Chinese Academy of Sciences
² National Laboratory of Pattern Recognition Institute of Automation Chinese Academy of Sciences
³XForwardAI Technology Co Ltd

Abstract

Non-local operations are usually used to capture long-range dependencies via aggregating global context to each position recently. However, most of the methods cannot preserve object shapes since they only focus on feature similarity but ignore proximity between central and other positions for capturing long-range dependencies, while shape-awareness is beneficial to many computer vision tasks. In this paper, we propose a Semi-Global Shape-aware Network (SGSNet) considering both feature similarity and proximity for preserving object shapes when modeling long-range dependencies. A hierarchical way is taken to aggregate global context. In the first level, each position in the whole feature map only aggregates contextual information in vertical and horizontal directions according to both similarity and proximity. And then the result is input into the second level to do the same operations. By this hierarchical way, each central position gains supports from all other positions, and the combination of similarity and proximity makes each position gain supports mostly from the same semantic object. Moreover, we also propose a linear time algorithm for the aggregation of contextual information, where each of rows and columns in the feature map is treated as a binary tree to reduce similarity computation cost. Experiments on semantic segmentation and image retrieval show that adding SGSNet to existing networks gains solid improvements on both accuracy and efficiency.

1 Introduction

Deep learning, based on local convolutions, has achieved great success in many fields. Recently, many studies show that modeling long-range dependencies can attain consistent improvements in many computer vision tasks, such as semantic segmentation [13, 29], object detection [35], and image retrieval [20]. Long-range dependencies, containing global contextual information, are usually captured by using local convolutional operations [11, 5] or non-local blocks [13, 29, 35].

Conventional convolutions operate data in a local window and many methods enlarge receptive fields by repeating local convolutional operations for capturing long-range dependencies [28, 30]. These operations are inefficient to model long-range dependencies and may encounter optimization difficulties [11]. Considering the problems, several non-local operations are provided to model long-range dependencies, such as Non-local networks [35], CCNet [13], and GCNet [3]. In these methods, each spatial position in a feature map is seen as a node, and each node can get supports from all other nodes according to a weight function. The weight function is usually a dot-product similarity between central and other nodes. This way just considers similarity but ignores proximity between two nodes when capturing long-range dependencies. Due to the absence of spatial distances when capturing long-range dependencies, object shapes cannot be modeled. Actually, shape-awareness has been confirmed to be useful in many computer vision tasks [9, 44, 19]. Tree filters [37, 29] are adopted to preserve object structures, where minimum spanning trees in low-level feature maps are established and then the process of global context aggregation on high-level feature maps is performed. The efficiencies of spanning trees based methods can still be developed furthermore.

Refer to caption — Figure 1: Differences between our method and CCNet [13]. ” $\curvearrowright$ ” means the weight between two nodes. The weights between the central node and the nearest neighbors depend on ” $\curvearrowright$ ” in red. The weights between the central node and the second nearest neighbors depend on ” $\curvearrowright$ ” in purple. (Best viewed in color)

In this paper, we propose a Semi-Global Shape-aware Network (SGSNet) to address the above-mentioned problems. In order to consider both feature similarity and proximity for preserving object shapes when modeling long-range dependencies, we aggregate global contextual information for each position in a hierarchical way. In the first level, each position in the whole feature map only aggregates contextual information in vertical and horizontal directions. Then the result of previous level is input into the next level to do the same operations. By this hierarchical way, each central position gains supports from all other positions, and the combination of similarity and proximity makes each position gain supports mostly from the same semantic object. We also propose an efficient algorithm reducing the computational complexity $\mathcal{O}(N\sqrt{N})$ of the brute force implementation to $\mathcal{O}(N)$ . As a contrast, the computational complexities of Non-local Neural Networks [35] and CCNet [13] are $\mathcal{O}(N^{2})$ and $\mathcal{O}(N\sqrt{N})$ respectively. The computational complexity of [29] is also $\mathcal{O}(N)$ while it needs extra time and computations to establish minimum spanning trees.

Differences between our method and CCNet are illustrated in Fig. 1. The weight of two nodes just depends on similarity between themselves for CCNet. In our method, the weight of two nodes considers both similarity and proximity for preserving object shapes when capturing long-range dependencies. The details are described in Sec. 3.

In summary, our main contributions are:

•

We propose a SGSNet which considers both feature similarity and geometric proximity for preserving object shapes when modeling long-range dependencies.
•

For practical usages, we propose an efficient algorithm for the aggregation of contextual information which reduces the computational complexity $\mathcal{O}(N\sqrt{N})$ to $\mathcal{O}(N)$ . Thus SGSNet can be plugged into existing convolutional neural networks conveniently.
•

Experiments on different computer vision tasks (semantic segmentation and image retrieval) show that adding SGSNet to existing deep neural networks achieves higher accuracies while takes less computations.

2 Related Work

2.1 Self-attention

Self-attention [33] was initially applied in machine translation area. Non-local Networks [35] bridged self-attention modules in machine translation to non-local filtering operations in computer vision. Many methods improved weight functions in self-attention module to learn discriminative feature representation. Non-local Networks [35] discussed four choices of weight functions, i.e. Gaussian, Embedded Gaussian, Dot product and Concatenation. Considering the computational complexity $\mathcal{O}(N^{2})$ of Non-local Networks, CCNet [13] adopted recurrent sparse criss-cross attention modules to substitute the dense attention in Non-local Networks. By two consecutive criss-cross attention modules, every node could collect contextual information from all nodes in feature maps. This way reduced the computational complexity $\mathcal{O}(N^{2})$ for Non-local Networks to $\mathcal{O}(N\sqrt{N})$ , therefore it was more efficient, while object shapes could be further considered. Song et al. [29] first built minimum spanning trees (MST) in low-level feature maps and then computed feature similarity in high-level semantics to preserve object structures. However extra time and computations to build the minimum spanning trees were needed. In this work, we consider both spatial distances and feature similarity when capturing long-range dependencies for higher accuracies meanwhile maintaining a low computational complexity.

2.2 Semantic Segmentation

Semantic segmentation is an essential and challenging task in computer vision community. The methods based on Convolution Neural Networks (CNNs) have made significant achievements in the past few years. Long et al. [17] applied fully convolutional networks (FCNs) in semantic segmentation. Later, researchers found that FCNs were limited by receptive fields due to fixed geometric structures. The works of U-net [27] and DeepLabv3+ [7] used encoder-decoder structures to combine high-level and low-level features for semantic segmentation tasks. Chen et al. [5] adopted atrous convolutions which could effectively enlarge receptive fields when aggregating contextual information. Furthermore, they also proposed atrous spatial pyramid pooling (ASPP) to complete the segmentation task at multiple scales. Zhao et al. [42] exploited global context information by pyramid pooling modules to achieve better performance. PSANet [43] relaxed the local neighborhood constraint through a self-adaptively learned attention mask. GCN [22] found that large convolutional kernels were also important when performing a dense per-pixel prediction task and proposed Global Convolutional Network (GCN). Unlike the previous methods, we aggregate global context information in a self-attention manner.

2.3 Image Retrieval

Traditionally, bag-of-visual-words [24], VLAD [2] and Fisher vector [23] are usually used to aggregate a set of handcrafted local features [18] into a single global vector to represent an image. Recently, many methods attempt to replace handcrafted ones by learned counterparts and then aggregate learned features with these similar techniques as the traditional ones [1, 16, 10]. Some studies show that directly using a pooling operation [26] to substitute the aggregation process can get comparable performances. We follow [20] to name these methods with a pooling operation as global single-pass methods for they do not separate extraction and aggregation steps explicitly. Radenović et al. [26] proposed GeM pooling which generalized average and max pooling and got excellent results. Based on GeM, SOLAR [20] employed second-order similarity and attention for image retrieval and obtained significant performance improvements. In this paper, we also explore the proposed SGSNet components for image retrieval.

3 Semi-Global Shape-aware Network

In this section, we first introduce preliminary formulations which consider feature similarity and geometric proximity for preserving object shapes when modeling long-range dependencies. Then we present the proposed semi-global shape-aware Network. Finally, we propose a linear time algorithm for implementing the network.

3.1 Preliminary Formulations

First, a given feature map $I$ in CNNs can be seen as a connected, undirected graph $G=(N,E)$ . The nodes $N$ are all spatial positions in $I$ and the edges $E$ with weights $\omega$ are all connections between two neighbor nodes. Next, we define a weight function between different nodes. Let us consider a simple case: given a pair of neighbor nodes $u$ and $v$ , the weight between $u$ and $v$ is defined as

\omega(u,v)=\omega(v,u)=exp(-d(u,v)),

(1)

where $d(u,v)$ is Euclidean distance between feature vectors of $u$ and $v$ . When nodes $u$ and $v$ are not neighbors, the weight function is given by

\Omega(u,v)=\prod_{\mathclap{(v_{i},v_{j})\in P_{u,v}}}\omega(v_{i},v_{j})=exp(-\sum_{(v_{i},v_{j})\in P_{u,v}}d(v_{i},v_{j})),

(2)

where $v_{i}$ and $v_{j}$ are neighbor nodes in the shortest path $P$ between $u$ and $v$ . We denote the feature map after aggregating contextual information as $I^{\prime}$ . The aggregation function which takes feature similarity and geometric proximity into account simultaneously is:

I^{\prime}_{u}=\dfrac{1}{S_{u}}\sum_{\forall v\in\mathcal{R}}\Omega(u,v)f(I_{v})+I_{u},

(3)

where $I^{\prime}_{u}$ and $I_{u}$ denotes the feature vector at $u$ in $I^{\prime}$ and $I$ respectively, $\mathcal{R}$ is all nodes in the graph $G$ , the function $f(\cdot)$ means feature transformation and the weight $\Omega(u,v)$ is normalized by $S_{u}=\sum_{\forall v\in\mathcal{R}}\Omega(u,v)$ .

In the above aggregation process, a problem is that there may be a lot of paths between arbitrary nodes $u$ and $v$ and it is time-consuming to find the shortest path for all pairs of nodes. In [37, 29], the authors adopt a spanning tree to remove ”unwanted” edges in the four-connected planner graph $G$ , thus all the nodes are connected by a minimum spanning tree. The minimum spanning tree can enlarge the geometric proximity between two neighbor nodes if these two nodes are dissimilar in appearance. As a result, low support weights will be assigned between these two nodes which causes less robustness to textures [37]. Besides, establishing and traversing a minimum spanning tree needs to pay extra time and computations. Inspired by [12, 13], we adopt a semi-global manner to overcome this difficulty by considering the supports from nodes in horizontal and vertical directions in a single block, and then add multiple blocks hierarchically for capturing full-image dependencies. The details are described in Sec. 3.2.

3.2 A Semi-Global Block

Due to the difficulty of minimizing matching costs in a 2D image, SGM [12] utilizes a semi-global approach to aggregate the matching costs along multiple 1D directions. CCNet [13] just collects contextual information in a criss-cross path. In this paper, we establish a semi-global block, as shown in Fig. 2 (a). Given an input feature map $I\in\mathcal{R}^{C\times W\times H}$ with channels $C$ , width $W$ and height $H$ , the block first utilizes a 1 $\times$ 1 convolution $\lambda$ on $I$ to reduce $C$ . The resulting feature map is denoted as $\Lambda\in\mathcal{R}^{C^{\prime}\times W\times H}$ with $C^{\prime}<C$ . We compute an attention map $A\in\mathcal{R}^{(H+W-1)\times W\times H}$ by a weight function on $\Lambda$ . Channels at a position of $A$ correspond to the weights between this position and other positions in the same column or row respectively. The block also applies another 1 $\times$ 1 convolution $\psi$ on $I$ and the resulting feature map $\Psi$ has the same size as $I$ . For each position $u$ in the spatial dimension of $\Psi$ , we can obtain a feature vector $\Psi_{u}\in\mathcal{R}^{C}$ . We collect the feature vectors of all positions in horizontal and vertical directions of $u$ on $\Psi$ , resulting in a matrix $\Theta_{u}\in\mathcal{R}^{(H+W-1)\times C}$ . We denote the output feature map of the block as $I^{\prime}\in\mathcal{R}^{C\times W\times H}$ , and the feature vector at position $u$ on $I^{\prime}$ by (3) is:

I^{\prime}_{u}=\sum_{d\in(0,H+W-1)}A_{u_{d}}\Theta_{u_{d}}+I_{u},

(4)

where $d$ is the index of channels of $A$ , $A_{u_{d}}$ denotes the value at the $d^{th}$ channel in position $u$ on $A$ . $\Theta_{u_{d}}$ denotes the $d^{th}$ feature vector in $\Theta_{u}$ , and $I_{u}$ denotes the feature vector at position $u$ on $I$ .

It is worth noting that the weight function is quite different from [13]. The aggregation process is shown in Fig. 2 (b), where the weight between nodes $u$ and $v$ depends on all nodes in the path connecting $u$ and $v$ . The path is limited in horizontal and vertical directions. Actually, our weight function (2) in the aggregation process considers feature similarity and geometric proximity simultaneously, as described in Sec 3.1. Therefore, our method can preserve object shapes more clearly compared with [13], which will be shown in Sec. 4.1.4. In order to adjust feature similarity between nodes $u$ and $v$ , we further add learnable parameters $\alpha$ and $\beta$ in (2):

\left\{\begin{array}[]{lr}\Omega(u,v)=exp(-\dfrac{D(u,v)}{\alpha})\ \ \ for\ horizontal&\\ \Omega(u,v)=exp(-\dfrac{D(u,v)}{\beta})\ \ \ for\ vertical,\end{array}\right.

(5)

where $D(u,v)=\sum_{(v_{i},v_{j})\in P_{u,v}}d(v_{i},v_{j})$ , $v_{i}$ and $v_{j}$ are neighbor nodes in horizontal or vertical path $P$ connecting $u$ and $v$ . Considering that width and height of feature maps may be different, we use $\alpha$ and $\beta$ for horizontal and vertical directions respectively. If $\alpha$ ( $\beta$ ) is smaller, feature similarity plays a more important role in (5), and vice versa. Therefore, the learnable parameter $\alpha$ ( $\beta$ ) can adjust the relation of feature similarity and geometric proximity adaptively.

3.3 Hierarchical Semi-Global Blocks

A single semi-global block can only aggregate contextual information in the same row and column of a central node but ignores other directions. We address this problem by multiple hierarchical semi-global blocks. In the first hierarchical level, a feature map $I$ is input into a semi-global block, producing a feature map $I^{\prime}$ . Then in the second hierarchical level, another semi-global block takes $I^{\prime}$ as the input and outputs a feature map $I^{\prime\prime}$ . Thus, every node in $I^{\prime\prime}$ can aggregate contextual information from all nodes in $I$ . Specifically, we denote the attention maps in the first and second hierarchical levels as $A$ and $A^{\prime}$ . The process of spreading contextual information in node $v$ (in purple) to node $u$ which is not in the same row and column as $v$ is illustrated in Fig. 3. In the first semi-global block, only nodes in the same row and column can receive the contextual information from $v$ , which can be seen at $A$ in Fig. 3. In the second semi-global block, $u$ receives contextual information from $v_{1}$ or $v_{4}$ which have aggregated contextual information from $v$ in the first semi-global block, as shown at $A^{\prime}$ in Fig. 3. Thus, $u$ obtains contextual information of $v$ through the above process.

More generally, every node can capture full-image contextual information by the hierarchical semi-global blocks, which can enhance the capability of a single semi-global block for modeling long-range dependencies [13].

3.4 A Linear Time Algorithm for Efficient Implementation

For a clearer explanation, we ignore 1 $\times$ 1 convolution $\lambda$ and $\psi$ in Fig. 2, which does not influence our conclusion. In a semi-global block, one node is supposed to aggregate contextual information of those in vertical and horizontal directions. According to (4), the total computational complexity of a single semi-global block is $\mathcal{O}(N\sqrt{N})$ by the brute force implementation. Specifically, given an input feature map $I$ with $N=W\times H$ nodes, every node needs to aggregate contextual information from other nodes in the same row and column. Thus, we need $(H+W-1)$ computation times of similarity for every node. The total computation is $\mathcal{O}((H\times W)\times(H+W-1))$ , which is unfavorable for practical applications.

Noticing that there are extensive repeated computations in the brute force implementation, we propose an optimized alternative which reduces the above computational complexity of a single semi-global block to $\mathcal{O}(N)$ . We innovatively regard each of rows and columns on the given feature map as a binary tree and complete interactions between any two nodes on the binary tree in linear time $\mathcal{O}(2\times(H-1))$ or $\mathcal{O}(2\times(W-1))$ . Thus total computational complexity is $\mathcal{O}(H\times 2\times(W-1)+W\times 2\times(H-1))=\mathcal{O}(4\times W\times H-2\times(W+H))$ for $W$ columns and $H$ rows. Specifically, let us consider operations on one of the rows or columns, as shown in Fig. 4. We select one node arbitrarily in the given row or column as a root node to establish a binary tree. Except the root node and leaf nodes, every node in the binary tree has a parent node and a child node. Every node is supposed to capture contextual information of others in this binary tree. The process of capturing contextual information is split into aggregation and updating steps:

•

We aggregate contextual information from leaf nodes towards the root node recursively according to

A(I_{u})=\left\{\begin{array}[]{lr}I_{u}\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ u\ is\ a\ leaf\ node&\\ I_{u}+\sum\limits_{P(v)=u}\Omega(u,v)A(I_{v})\ \ \ \ \ \ \ others,\end{array}\right.

(6)

where $I_{u}$ is the input feature vector at node $u$ , $P(v)=u$ means $u$ is the parent node of $v$ . $\Omega(u,v)$ is the defined weight function between $u$ and $v$ in (5).

•

We update features from the root node towards leaf nodes recursively by

U(I_{u})=\left\{\begin{array}[]{lr}A(I_{u})\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ u\ is\ a\ root\ node&\\ \Omega(P(u),u)U(I_{P(u)})&\\ +(1-\Omega^{2}(u,P(u)))A(I_{u})\ \ \ \ \ \ \ \ others,\end{array}\right.

(7)

where $A(I_{u})$ is computed by (6).

The computational complexity of the aggregation and that of updating steps are both $\mathcal{O}(H-1)$ for columns or $\mathcal{O}(W-1)$ for rows. Thus the total computational complexity is $\mathcal{O}(2\times(H-1))$ or $\mathcal{O}(2\times(W-1))$ for a single binary tree.

Operations on other columns and rows are similar. We do not perform updating steps until the aggregation steps are completed for all binary trees. Thus, the aggregation and updating steps of each binary tree are independent of others, which can be parallel accelerated on GPUs. The output feature map is denoted as $I^{\prime}$ . A node $u$ is included in two binary trees ( $T_{c},T_{r}$ ), as shown in Fig. 4. We add the updating results of $u$ in $T_{c}$ and $T_{r}$ to $I_{u}$ :

I^{\prime}_{u}=T_{c}(U(I_{u}))+T_{r}(U(I_{u}))+I_{u},

(8)

where $I^{\prime}_{u}$ is the feature vector at $u$ in $I^{\prime}$ , and $I^{\prime}_{u}$ has contained contextual information in horizontal and vertical directions of $u$ in $I$ .

4 Experiments

In this section, we evaluate SGSNet on semantic segmentation and image retrieval. We first conduct extensive experiments on semantic segmentation to understand the behavior of SGSNet, and then extend SGSNet to image retrieval to demonstrate the generality of SGSNet.

4.1 Experiments on Semantic Segmentation

We adopt Cityscapes [8] benchmark for semantic segmentation and report the metrics of Mean IoU (mean of class-wise intersection over union). Cityscapes is a semantic segmentation benchmark focusing on urban street scenes. This benchmark contains 5,000 finely annotated images which are divided into 2975, 500, 1525 images as training, validation and testing set. A larger set of 20,000 coarsely annotated images are also provided for supporting methods that exploit large volumes of weakly-labeled data. The dataset also defines 30 visual classes, of which 19 classes are used in our experiments.

Based on the proposed linear time algorithm, SGSNet can be plugged into any deep neural networks conveniently for capturing long-range dependencies. We adopt ResNet-101 [11] (pre-trained on ImageNet) as our backbone with minor changes. The last two down-sampling layers are dropped and dilated convolutions are embedded into subsequent convolutional layers similar to [13].

We use the mini-batch stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.0001. A poly learning rate schedule with an initial learning rate of 0.01 and power of 0.9 is employed. We also augment the training set by randomly scaling (0.75 to 2.0 $\times$ ) and then crop out patches with 769 $\times$ 769 pixels as the network input resolution. The learnable parameter $\alpha$ and $\beta$ are both initialized with 1. The number of channels in $\Lambda$ is one eighth of the number of channels in $I$ and we share the parameters of the hierarchical semi-global blocks.

4.1.1 Comparisons with Other Methods

We first train SGSNet on Cityscapes training set and present results on Cityscapes validation set in Tab. 1. We can see that our method achieves better results than other methods, even if DeepLibv3+ [7] and DPC [4] use a more powerful backbone.

We further compare our method with existing methods on the Cityscapes testing set, as shown in Tab. 2. SGSNet is trained with only finely annotated data and then the test results are submitted to the official evaluation server. It can be seen from Tab. 2 that SGSNet outperforms other methods no matter that they employ either the same backbone as ours or a stronger one [40, 36]. Even if CCNet freezes batch normalization (BN) layers and finetunes the model with a low learning rate after training certain iterations while we do not use any of those tricks, our method is still better than CCNet. This proves the importance of considering both feature similarity and geometric proximity for preserving object shapes when capturing long-range dependencies. Furthermore, our SGSNet is more efficient, which will be elaborated in Sec. 4.1.2.

Table 1: Comparisons with other methods on Cityscapes validation set.

Method	Backbone	multi-scale	mIoU(%)
DeepLabv3 [6]	ResNet-101	Yes	79.3
DeepLabv3+ [7]	Xception-65	No	79.1
DPC [4]	Xception-71	No	80.8
CCNet [13]	ResNet-101	Yes	81.3
SGSNet	ResNet-101	No	80.9
SGSNet	ResNet-101	Yes	81.9

Table 2: Comparisons with other methods on Cityscapes testing set, where

\divideontimes

means this method trains with finely annotated training and validation sets.

Method	Backbone	mIoU(%)
DeepLab-v2 [5]	ResNet-101	70.4
RefineNet [15] $\divideontimes$	ResNet-101	73.6
SAC [41] $\divideontimes$	ResNet-101	78.1
GCN [22] $\divideontimes$	ResNet-101	76.9
DUC [34] $\divideontimes$	ResNet-101	77.6
ResNet-38 [40]	WiderResnet-38	78.4
PSPNet [42]	ResNet-101	78.4
BiSeNet [38] $\divideontimes$	ResNet-101	78.9
AAF [14]	ResNet-101	79.1
PSANet [43] $\divideontimes$	ResNet-101	80.1
DFN [39] $\divideontimes$	ResNet-101	79.3
DenseASPP [36] $\divideontimes$	DenseNet-161	80.6
TF [29] $\divideontimes$	ResNet-101	80.8
CCNet [13] $\divideontimes$	ResNet-101	81.4
SGSNet $\divideontimes$	ResNet-101	82.1

4.1.2 Analysis of Computational Complexity

Based on the linear time algorithm described in Sec. 3.4, the computational complexity of SGSNet is $\mathcal{O}(N)$ , which is linearly proportional to the number of nodes in a feature map. We explore the computation cost and the number of parameters of SGSNet on Cityscapes validation set. We also use ResNet-101 as the backbone and the input size is 769 $\times$ 769 pixels, thus the size of input feature maps of the hierarchical semi-global blocks is 97 $\times$ 97 pixels. The baseline network is ResNet101 with some minor changes, where dilated convolutional layers are adopted at stage 4 and 5. As shown in Tab. 3, SGSNet improves the performance by 5.8% mIoU over the baseline with additional 0.295M parameters and 11.4G FLOPs overheads. We also list the GFLOPs and parameters of Non-local and CCNet in Tab. 3. It can be seen that SGSNet achieves higher performance while uses less parameters and FLOPs than CCNet. Non-local uses slightly less parameters than SGSNet, but the additional FLOPs of Non-local is far greater (108G VS 11.4G) than SGSNet. Also, SGSNet increases mIoU by 1.8% compared with Non-local. Therefore, SGSNet is more efficient and effective than the other two.

Table 3: Comparisons of SGSNet with CCNet and Non-local. The increments of FLOPs and parameters are estimated for an input of 1

\times

\times

769

\times

769 pixels.

Method	GFLOPs ( $\blacktriangle$ )	Params (M $\blacktriangle$ )	mIoU(%)
baseline	0	0	75.1
Non-local	108	0.131	79.1
CCNet	16.5	0.328	79.8
SGSNet	11.4	0.295	80.9

4.1.3 Ablation Studies

In order to further understand the behavior of SGSNet, we implement substantial ablation studies on Cityscapes validation set. The increments of FLOPs and parameters with different numbers of the hierarchical semi-global blocks are listed in Tab. 4. Adding a semi-global block increases 4.4% mIoU compared with the baseline network, which indicates importance of the semi-global block. We further add two semi-global blocks hierarchically and the performance is increased by another 1.4% mIoU. These results demonstrate that the proposed hierarchical semi-global blocks can capture full-image contextual information and significantly improve the performance. We believe that the performance can be further improved by adding more semi-global blocks hierarchically. We also can see from Tab. 4 that a single semi-global block just increases 5.70G FLOPs and 0.295M parameters. Parameters do not increase more when adding the second block because they share parameters with each other. Qualitative results are also given in Fig. 5. Areas (truck, ground and fence) in red circles of the first column images are easily classified erroneously. Those areas can’t be classified correctly by just adding one semi-global block, as shown at the second column in Fig. 5. But when we add two semi-global blocks hierarchically, those areas are rectified, as shown at the following third column, which explicitly demonstrates the advantages of hierarchical semi-global blocks.

We also explore the influence of learnable parameters $\alpha$ and $\beta$ . We use two semi-global blocks and remove $\alpha$ and $\beta$ in (5). The results are listed at the fourth row in Tab. 4. As we can see, the performance is dropped by 0.7% mIoU (compared with the last line in Tab. 4) due to the absence of $\alpha$ and $\beta$ .

Table 4: Performance of adding different numbers of hierarchical semi-global blocks to the baseline. The increments of FLOPs and parameters are estimated for an input of 1

\times

\times

769

\times

769 pixels.

Hierarchical	GFLOPs ( $\blacktriangle$ )	Params (M $\blacktriangle$ )	mIoU(%)
baseline	0	0	75.1
H=1	5.70	0.295	79.5
H=2(no $\alpha$ $\beta$ )	11.4	0.295	80.2
H=2	11.4	0.295	80.9

4.1.4 Visualization of Attention Module

To further elaborate what SGSNet has learned, we visualize the learned attention maps of SGSNet in the last column of Fig. 6. For each image in the left column, we choose a specific position (marked by a cross in red) and show its corresponding attention map of SGSNet in the right column. Images in the middle column are the corresponding attention maps of CCNet. Both SGSNet and CCNet use two blocks. We can observe that, compared with CCNet, SGSNet can capture clear semantic similarity and learn object shapes simultaneously. For example, in the first image, the marked position in truck obtains almost all high responses from positions in truck and we can see the outline of the truck in the corresponding attention map of SGSNet clearly. Similar phenomena can be seen at the marked positions (in the building and traffic sign) of the second and third images. Areas in same objects are activated and have the high responses, which means that the proposed SGSNet can be aware of object shapes when capturing long-range dependencies.

To classify a given ambiguous pixel, humans usually look around the pixel, rather than farther pixels, to look for contextual cues that help classify the given pixel correctly. From this perspective, SGSNet is more similar to human behavior than CCNet.

4.2 Experiments on Image Retrieval

4.2.1 Datasets

In this section, we investigate the performance of our SGSNet on large-scale image retrieval task. We train networks on Google Landmarks 18 (GL18) [31] and test on Revisited Oxford and Paris datasets [25]. GL18 dataset is based on the original Kaggle challenge dataset [21]. It contains more than 1.2 million photos collected from 15,000 landmarks covering a wide-range of scenes (such as: historic cities, metropolitan areas, nature scenery, etc.). Revisited Oxford and Paris datasets are frequently-used to evaluate the performances of large-scale image retrieval methods. They improve Oxford and Paris datasets by dropping wrong annotations and adding new images, resulting in 4,993 and 6,322 images for Revisited Oxford and Paris datasets respectively. According to difficulty levels, the evaluation tasks are divided into three groups: easy, medium, and hard. In each task, the metrics are mean average precision (mAP) and mean precision at rank 10.

Table 5: Comparisons on image retrieval, where Revisited Oxford and Paris datasets are abbreviated as ROxf and RPar respectively.

Method	Medium		Hard
Method	ROxf	RPar	ROxf	RPar
baseline	67.6	80.9	44.9	61.9
+Non-local	68.8	82.0	46.6	64.5
+CCNet	68.8	81.8	46.8	64.7
+SGSNet	68.8	82.0	47.1	64.8

4.2.2 Comparisions with Other Methods

In this section, we compare SGSNet with Non-local and CCNet on image retrieval. We adopt ResNet101-GeM [26] trained with the triplet loss and second-order similarity (SOS) loss [32] as our baseline. ResNet101 [11] contains five fully-convolutional blocks conv1 to conv5_x. For fair comparison, we just add Non-local, CCNet and SGSNet after conv5_x respectively. Inside of Non-local, CCNet and SGSNet, the number of channels $C$ of an input feature map $I$ is reduced to $C/8$ for efficient computations. We also do not use batch normalization and Relu layers. Following [20], we report mAP of these methods in medium and hard tasks of Revisited Oxford and Paris datasets, as shown in Tab. 5.

From Tab. 5, we can see that adding SGSNet to the baseline can significantly improve accuracies of image retrieval task. Compared with Non-local and CCNet, SGSNet achieves comparable results in the medium task but superior results in the hard task. There are lots of large viewpoint changes and significant occlusions in the hard task, which will influence the aggregation of contextual information. Most of contextual information is captured around central nodes for SGSNet and thus large viewpoint changes and occlusion have relatively less impact on SGSNet. As a whole, SGSNet achieves better results. While, SGSNet takes less computation cost as shown in Sec. 4.1.2.

5 Conclusion

In this work, we propose a semi-global shape-aware network (SGSNet) which considers both feature similarity and geometric proximity for preserving object shapes when capturing long-range dependencies. Each position in the feature map captures contextual information in horizontal and vertical directions according to both similarity and proximity in a single block, and then harvests entire image contextual information by adding more semi-global blocks hierarchically. In addition, each of rows and columns on the given feature map is regarded as a binary tree. Then based on structures of the binary trees, we present a linear time algorithm further improving the efficiency of SGSNet. Extensive ablation studies have been conducted to deeply understand the proposed method. We also show the superiorities of SGSNet on semantic segmentation and image retrieval. In the future, we will explore SGSNet in more vision tasks.

References

[1] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5297–5307, 2016.
[2] Relja Arandjelovic and Andrew Zisserman. All about vlad. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013.
[3] Yue Cao, Jiarui Xu, Stephen Lin, Fangyun Wei, and Han Hu. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
[4] Liang-Chieh Chen, Maxwell Collins, Yukun Zhu, George Papandreou, Barret Zoph, Florian Schroff, Hartwig Adam, and Jon Shlens. Searching for efficient multi-scale architectures for dense image prediction. In Advances in neural information processing systems, pages 8699–8710, 2018.
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
[10] Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, and Hongsheng Li. Self-supervising fine-grained region similarities for large-scale image localization. In European Conference on Computer Vision.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[12] Heiko Hirschmuller. Stereo processing by semiglobal matching and mutual information. IEEE Transactions on pattern analysis and machine intelligence, 30(2):328–341, 2007.
[13] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 603–612, 2019.
[14] Tsung-Wei Ke, Jyh-Jing Hwang, Ziwei Liu, and Stella X Yu. Adaptive affinity fields for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 587–602, 2018.
[15] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934, 2017.
[16] Liu Liu, Hongdong Li, and Yuchao Dai. Stochastic attraction-repulsion embedding for large scale image localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 2570–2579, 2019.
[17] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[18] David G Lowe. Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2):91–110, 2004.
[19] Zixin Luo, Lei Zhou, Xuyang Bai, Hongkai Chen, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. Aslfeat: Learning local features of accurate shape and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6589–6598, 2020.
[20] Tony Ng, Vassileios Balntas, Yurun Tian, and Krystian Mikolajczyk. SOLAR: Second-order loss and attention for image retrieval. In ECCV, 2020.
[21] Hyeonwoo Noh, Andre Araujo, Jack Sim, and Bohyung Han. Image retrieval with deep local features and attention-based keypoints. arXiv preprint arXiv:1612.06321, 2016.
[22] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters–improve semantic segmentation by global convolutional network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2017.
[23] Florent Perronnin, Yan Liu, Jorge Sánchez, and Hervé Poirier. Large-scale image retrieval with compressed fisher vectors. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3384–3391. IEEE, 2010.
[24] James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In 2007 IEEE conference on computer vision and pattern recognition, pages 1–8. IEEE, 2007.
[25] Filip Radenović, Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondřej Chum. Revisiting oxford and paris: Large-scale image retrieval benchmarking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5706–5715, 2018.
[26] Filip Radenović, Giorgos Tolias, and Ondřej Chum. Fine-tuning cnn image retrieval with no human annotation. IEEE transactions on pattern analysis and machine intelligence, 41(7):1655–1668, 2018.
[27] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[28] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[29] Lin Song, Yanwei Li, Zeming Li, Gang Yu, Hongbin Sun, Jian Sun, and Nanning Zheng. Learnable tree filter for structure-preserving feature transform. In Advances in Neural Information Processing Systems, pages 1711–1721, 2019.
[30] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
[31] Marvin Teichmann, Andre Araujo, Menglong Zhu, and Jack Sim. Detect-to-retrieve: Efficient regional aggregation for image search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5109–5118, 2019.
[32] Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, and Vassileios Balntas. Sosnet: Second order similarity regularization for local descriptor learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11016–11025, 2019.
[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
[34] Panqu Wang, Pengfei Chen, Ye Yuan, Ding Liu, Zehua Huang, Xiaodi Hou, and Garrison Cottrell. Understanding convolution for semantic segmentation. In 2018 IEEE winter conference on applications of computer vision (WACV), pages 1451–1460. IEEE, 2018.
[35] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[36] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and Kuiyuan Yang. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3684–3692, 2018.
[37] Qingxiong Yang. Stereo matching using tree filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4):834–846, 2014.
[38] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 325–341, 2018.
[39] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. Learning a discriminative feature network for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1857–1866, 2018.
[40] Yuhui Yuan and Jingdong Wang. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.
[41] Rui Zhang, Sheng Tang, Yongdong Zhang, Jintao Li, and Shuicheng Yan. Scale-adaptive convolutions for scene parsing. In Proceedings of the IEEE International Conference on Computer Vision, pages 2031–2039, 2017.
[42] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[43] Hengshuang Zhao, Yi Zhang, Shu Liu, Jianping Shi, Chen Change Loy, Dahua Lin, and Jiaya Jia. Psanet: Point-wise spatial attention network for scene parsing. In Proceedings of the European Conference on Computer Vision (ECCV), pages 267–283, 2018.
[44] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9308–9316, 2019.