This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\DeclareUnicodeCharacter

FB01fi

11institutetext: School of Computer Science and Technology, Huazhong university of Science and Technology, Wuhan, China 22institutetext: School of Cyber Science & Engineering, Huazhong university of Science and Technology, Wuhan, China 33institutetext: Shenzhen Huazhong University of Science and Technology Research Institute, Shenzhen, China
33email: {yuanqing,lusongfeng,m201372777,d201980975}@hust.edu.cn

SIN: Superpixel Interpolation Network

Qing Yuan 11    Songfeng Lu(🖂) 2233    Yan Huang 11    Wuxin Sha 11
Abstract

Superpixels have been widely used in computer vision tasks due to their representational and computational efficiency. Meanwhile, deep learning and end-to-end framework have made great progress in various fields including computer vision. However, existing superpixel algorithms cannot be integrated into subsequent tasks in an end-to-end way. Traditional algorithms and deep learning-based algorithms are two main streams in superpixel segmentation. The former is non-differentiable and the latter needs a non-differentiable post-processing step to enforce connectivity, which constraints the integration of superpixels and downstream tasks. In this paper, we propose a deep learning-based superpixel segmentation algorithm SIN which can be integrated with downstream tasks in an end-to-end way. Owing to some downstream tasks such as visual tracking require real-time speed, the speed of generating superpixels is also important. To remove the post-processing step, our algorithm enforces spatial connectivity from the start. Superpixels are initialized by sampled pixels and other pixels are assigned to superpixels through multiple updating steps. Each step consists of a horizontal and a vertical interpolation, which is the key to enforcing spatial connectivity. Multi-layer outputs of a fully convolutional network are utilized to predict association scores for interpolations. Experimental results show that our approach runs at about 80fps and performs favorably against state-of-the-art methods. Furthermore, we design a simple but effective loss function which reduces much training time. The improvements of superpixel-based tasks demonstrate the effectiveness of our algorithm. We hope SIN will be integrated into downstream tasks in an end-to-end way and benefit the superpixel-based community. Code is available at: https://github.com/yuanqqq/SIN.

Keywords:
superpixel spatial connectivity deep learning.

1 Introduction

Superpixels are small clusters of pixels that have similar intrinsic properties. Superpixels provide a perceptually meaningful representation of image data and reduce the number of image primitives for subsequent tasks. Owing to their representational and computational efficiency, superpixels are widely applied to computer vision tasks such as object detection [21, 26], saliency detection [10, 27, 30], semantic segmentation [9, 19, 7] and visual tracking [25, 28].

In common, superpixel-based tasks first generate superpixels of input images. Afterwards, features of superpixels are extracted and fed into subsequent steps. Since most superpixel algorithms cannot ensure spatial connectivity directly, we need to enforce spatial connectivity through a post-processing step before extracting superpixel features. Recently, deep neural networks and end-to-end framework have been widely adopted in computer vision owing to their effectiveness. However, existing superpixel segmentation algorithms cannot be combined with downstream tasks in an end-to-end way, which constrains the application of superpixels and the performance of superpixel-based tasks. We will demonstrate the limitations of existing superpixel segmentation algorithms in the following.

Existing superpixel segmentation algorithms can be divided into traditional and deep learning-based branches. Traditional superpixel segmentation algorithms [17, 6, 14, 4, 1, 2] mainly rely on hand-crafted features. They are not trainable and cannot be integrated to subsequent deep learning methods in an end-to-end way obviously. Not to mention that most traditional algorithms run at a low speed, which affects the speed of downstream tasks heavily. While few attempts have been made  [24, 11, 29], utilizing deep networks to extract superpixels remains challenging. [24, 11] use a deep network to extract pixel features, followed by a superpixel segmentation module. FCN [29] proposes a network to directly generate superpixels and enforce connectivity as a post-processing step. All these methods need a post-processing step to handle orphan pixels and the step is non-differentiable. The post-processing step hinders existing deep learning-based algorithms to be combined with superpixel-based tasks in an end-to-end way. In fact, most traditional algorithms also need post-process to enforce spatial connectivity.

In this paper, we aim to propose a superpixel segmentation algorithm which can be integrated into downstream tasks in an end-to-end way. The speed of generating superpixels is also very important, because some downstream tasks such as visual tracking require real-time speed. Since the post-processing step is the main obstacle of existing deep learning-based methods, we enforce spatial connectivity from the start to remove the step. Without the post-processing step, not only the algorithm becomes a whole trainable network, but also the speed is faster. Our initial superpixels are initialized with sampled pixels and remaining pixels are assigned to superpixels through multiple similar steps. Each step consists of a horizontal and a vertical interpolation. According to current pixel-superpixel map and association scores, the interpolations assign partial pixels to superpixels. The pixel-superpixel map represents the map between pixels and superpixels, and the association scores are predicted by the multi-layer outputs of a fully convolutional network. The rule of interpolations is the key to enforcing spatial connectivity and we will prove it in Section 3.3. Furthermore, we design a simple but effective loss function that can reduce training time and fully utilize segmentation labels.

Extensive experiments have been conducted to evaluate SIN. Our method is the fastest compared to existing deep learning-based algorithms(running at about 80fps), which means it satisfies the instantaneity of downstream tasks. For superpixel segmentation, experimental results on public benchmarks such as BSDS500 [3] and NYUv2 [22] demonstrate that our method performs favorably against the state-of-the-art in a variety of metrics. For semantic segmentation and saliency object detection, we replace superpixels in the original BI [8] and SO [30] with ours. The results on PascalVOC 2012 test set [5] and ECSSD dataset [20] show that SIN superpixels benefit these downstream vision tasks.

In summary, the main contributions of this paper are:

  • We propose a superpixel segmentation network which can be integrated into downstream tasks in an end-to-end way, which does not need post-processing to handle orphan pixels. Our algorithm enforces spatial connectivity from the first instead of using a non-differentiable post-processing step. To the best of our knowledge, we are the first to develop a deep learning-based method to be integrated into superpixel-based tasks in an end-to-end way.

  • We analyze the runtime of deep learning-based superpixel algorithms and our model has the fastest speed. When utilizing our SIN superpixels in subsequent tasks, the instantaneity will not be destroyed. Extensive experiments show that our method performs well in superpixel segmentation especially in generating more compact superpixels.

  • We design a simple but effective loss function that fully utilizes the segmentation label. The loss function is computational efficiency and shortens plenty of training time.

2 Related Work

2.1 Traditional Superpixel Segmentation

Traditional superpixel segmentation algorithms can be roughly categorized as graph-based and clustering-based algorithms. Graph-based algorithms treat image pixels as graph nodes and pixel affinities as graph edges. Usually, superpixel segmentation problems are solved by graph-partitioning.  [17] applies the Normalized Cuts algorithm to produce the superpixel map. FH [6] defines an adaptive segmentation criterion to capture global image properties. ERS [14] proposes an objective function for superpixel segmentation, which consists of the entropy rate and the balancing term.

Clustering-based algorithms utilize clustering methods such as kk-means for superpixel segmentation. SEEDS [4] starts from an initial superpixel partitioning and continuously exchanges pixels on the boundaries between neighboring superpixels. SLIC [1] adopts a kk-means clustering approach to generate superpixels based on a 5-dimensional positional and LabLab color features. Owing to its simplicity and high performance, there are many variants [12, 15, 2] of SLIC. LSC [12] projects the 5-dimensional features to a 10-dimensional space and performs weighted kk-means in the projected space. Manifold-SLIC [15] maps the image to 2-dimensional manifold feature space for superpixel clustering. SNIC [2] proposes a non-iterative scheme for superpixel segmentation. Traditional superpixel algorithms are mainly based on hand-crafted features, which often fail to preserve weak object boundaries. Most traditional algorithms are computed on CPU, so it is hard to achieve real-time speed. What’s more, we cannot integrate traditional methods into subsequent tasks in an end-to-end way because they are non-differentiable.

2.2 Superpixel Segmentation using DNN

Recently, some researchers have focused on integrating deep networks into superpixel segmentation algorithms [24, 11, 29]. [24, 11] use a deep network to extract pixel features, which are then fed to a superpixel segmentation module. SEAL [24] develops the Pixel Affinity Net for affinity prediction and defines a new loss function which takes the segmentation error into account. These affinities are then passed to a graph-based algorithm to generate superpixels. To form an end-to-end trainable network, SSN [11] turns SLIC into a differentiable algorithm by relaxing the nearest neighbors’ constraints. FCN [29] combines feature extraction and superpixel segmentation into a single step. The proposed method employs a fully convolutional network to predict association scores between image pixels and regular grid cells. When utilizing superpixels generated by existing deep learning-based methods, a post-processing step is needed to handle orphan pixels. The step is not trainable and can only be computed on CPU, so existing deep learning-based methods cannot be integrated into downstream tasks in an end-to-end approach.

2.3 Spatial Connectivity

Most superpixel algorithms [6, 1, 12, 15, 24, 11, 29] do not explicitly enforce connectivity and there may exist some ”orphaned” pixels that do not belong to the same connected components. To correct this, SLIC [1] assigns these pixels the label of the nearest cluster. [11, 29] also apply a component connection algorithm to merge superpixels that are smaller than a certain threshold with the surrounding ones. These algorithms enforce connectivity using a post-processing step, whereas SNIC [2] enforces connectivity explicitly from the start. SNIC uses a priority queue to choose the next pixel to be assigned, and the queue is populated with pixels which are 4 or 8-connected to a currently growing superpixel. As far as we know, there is no method which utilizes learned features and enforces connectivity explicitly.

3 Superpixel Segmentation Method

In this section, we introduce our superpixel segmentation method SIN. The framework of our proposed method is illustrated in Figure 1. We first present our idea of superpixel initialization and updating scheme. After that, we introduce our network architecture and loss function design. Finally, we will explain why our method can enforce spatial connectivity from the start.

Refer to caption
Figure 1: Illustration of our proposed method. The SIN model takes the image as input, and predicts association scores for each updating step. In the training stage, association scores are utilized to compute loss. In the testing stage, new pixel-superpixel maps are obtained from current pixel-superpixel maps and association scores.

3.1 Learn superpixels by Interpolation

Our superpixels are obtained by initializing pixel-superpixel map and updating the map multiple times. Similar to the commonly adopted strategy in [4, 1, 2], we generate the initial superpixels by sampling the image IH×W×3I\in\mathbb{R}^{H\times W\times 3} with a regular step SS. By assigning each pixel to a unique superpixel, we get the initial pixel-superpixel map M0h0×w0M_{0}\in\mathbb{Z}^{h_{0}\times w_{0}}. The values of M0M_{0} denote ID of superpixels to which sampled pixels are assigned.

Refer to caption
Figure 2: Illustration of expanding pixel-superpixel map. Each expanding step consists of a horizontal interpolation followed by a vertical interpolation. The horizontal interpolation inserts values in each row and the vertical interpolation inserts values in each column. The inserted values are determined by association scores and neighboring superpixels.

Superpixel segmentation is to find the final pixel-superpixel map MH×WM\in\mathbb{Z}^{H\times W}, which assigns all pixels to superpixels. The problem of finding MM can be seemed as expanding M0M_{0} to MM. Inspired by resizing image, we use interpolation to expand the matrix. The rule of interpolation is carefully designed to enforce spatial connectivity from the start and to be computed on GPU in parallel. As depicted in Figure1, the process of expanding M0M_{0} to MM can be divided into multiple similar steps and each step consists of a horizontal interpolation and a vertical interpolation. As shown in Figure 2, when we expand pixel-superpixel map in horizontal/vertical dimension, we interpolate values among all neighboring elements in each row/column. The inserted values are the same as neighboring elements with certain probability. The probabilities(association scores) are computed by neural networks which we will introduce in Section 3.2.

In detail, we use PH×WP\in\mathbb{R}^{H\times W} to denote image pixels. P(i,j)P(i,j) represents the image pixel at the intersection of ii-th row and jj-th column. M(i,j)M(i,j) is the superpixel to which P(i,j)P(i,j) is assigned. In the initial step, we find partial connections between image pixels PP and superpixels. M0(i,j)M_{0}(i,j) represents the superpixel to which P(iS,jS)P(i*S,j*S) is assigned.

h0=(H+S1)/S,w0=(W+S1)/S.h_{0}=(H+S-1)/S,\quad w_{0}=(W+S-1)/S. (1)

To obtain MM, we need to expand M0M_{0} multiple times. At ll-th expansion, we use Mlhhl1×wlM_{l}^{h}\in\mathbb{Z}^{h_{l-1}\times w_{l}} and Mlhl×wlM_{l}\in\mathbb{Z}^{h_{l}\times w_{l}} to denote pixel-superpixel maps after horizontal and vertical interpolation.

hl=2hl11,wl=2wl11.h_{l}=2\ast h_{l-1}-1,\quad w_{l}=2\ast w_{l-1}-1. (2)

Figure 2 has shown a part of interpolation at ll-th expansion. At ll-th horizontal/vertical interpolation step, the inserted values are confirmed by association scores Alhhl1×(wl11)×2A_{l}^{h}\in\mathbb{R}^{h_{l-1}\times(w_{l-1}-1)\times 2}/Al(hl11)×wl×2A_{l}\in\mathbb{R}^{(h_{l-1}-1)\times w_{l}\times 2} and neighboring superpixels Qlhhl1×(wl11)×2Q_{l}^{h}\in\mathbb{Z}^{h_{l-1}\times(w_{l-1}-1)\times 2}/Ql(hl11)×wl×2Q_{l}\in\mathbb{Z}^{(h_{l-1}-1)\times w_{l}\times 2}. Alh(i,j,k)A_{l}^{h}(i,j,k) and Al(i,j,k)A_{l}(i,j,k) denote the probability of ii-th row, jj-th column inserted value is the same with its kk-th neighbor. All association scores are obtained from multi-layer outputs of the neural network described in Section 3.2. Qlh(i,j,k)Q_{l}^{h}(i,j,k) and Ql(i,j,k)Q_{l}(i,j,k) denote the kk-th neighbor’s value of ii-th row, jj-th column inserted element. Neighboring superpixels are obtained from current pixel-superpixel map. We interpolate new elements among existing neighboring elements at each row/column, so a pair of existing neighboring elements’ values are neighboring superpixel IDs of the corresponding inserted element. According to association scores and neighboring superpixels, inserted values can only be same with one of their neighboring elements with certain probability.

3.2 Network Architecture and Loss Function

We use a convolutional neural network similar to [29] to extract image feature F0h0×w0×c0F_{0}\in\mathbb{R}^{h_{0}\times w_{0}\times c_{0}}. We stack module deconv_hdeconv\_h and deconv_vdeconv\_v multiple times to extract multi-layer features Flhhl1×wl×clF_{l}^{h}\in\mathbb{R}^{h_{l-1}\times w_{l}\times c_{l}} and Flhl×wl×clF_{l}\in\mathbb{R}^{h_{l}\times w_{l}\times c_{l}}, where clc_{l} denotes feature channels. deconv_hdeconv\_h and deconv_vdeconv\_v are transposed convolutional neural networks, in which stride are (1,2)(1,2) and (2,1)(2,1) respectively. Specially, deconv_hdeconv\_h will reduce feature channels by half. convconv is a convolutional neural network, which transforms the multi-layer features to 2-dimensional association scores.

Our model is trained with ground truth segmentation labels TH×WT\in\mathbb{Z}^{H\times W} from BSDS500. Every interpolation is to find partial connections between pixels and superpixels. To get loss of all connections, we need to compute partial loss at every interpolation. We define sl=S/2ls_{l}=S/2^{l} to simplify descriptions. The inserted values at ll-th step in horizontal/vertical dimension are ID of superpixels to which pixels UlhU_{l}^{h} and UlU_{l} are assigned. UlhU_{l}^{h} denotes the subtraction of pixels sampled by stride (sl1,sl)(s_{l-1},s_{l}) and (sl1,sl1)(s_{l-1},s_{l-1}). UlU_{l} denotes the subtraction of pixels sampled by stride (sl,sl)(s_{l},s_{l}) and (sl1,sl)(s_{l-1},s_{l}). Partial ground truth connections TlhT_{l}^{h} and TlT_{l} are segmentation labels of pixels UlhU_{l}^{h} and UlU_{l}. To speed up training process, we do not generate pixel-superpixel maps to compute loss. Instead, we utilize association scores to compute loss directly. Association scores AlhA_{l}^{h} and AlA_{l} denote the probabilities of pixels assigned to neighboring superpixels. Inspired by tasks of classification, the ground truth labels GlhG_{l}^{h} and GlG_{l} are defined as the indexes of neighboring superpixels to which pixels should be assigned. GlhG_{l}^{h} and GlG_{l} can be inferred from TlhT_{l}^{h} and TlT_{l}. Owing to each inserted element has two neighbors, the ground truth labels are 0 or 1. If the neighboring superpixels ID of an inserted element are same, we will ignore it when computing loss. We define 𝕀lh\mathbb{I}_{l}^{h} and 𝕀l\mathbb{I}_{l} to represent whether to consider the elements when computing loss. Loss of each interpolation at ll-th step can be computed by:

Llh=𝒞𝕀lh(Glh,Alh),Llv=𝒞𝕀l(Gl,Al)L_{l}^{h}=\mathcal{C}_{\mathbb{I}_{l}^{h}}(G_{l}^{h},A_{l}^{h}),\quad L_{l}^{v}=\mathcal{C}_{\mathbb{I}_{l}}(G_{l},A_{l}) (3)

where LlhL_{l}^{h} and LlL_{l} denote the loss of horizontal and vertical at ll-th step. 𝒞𝕀lh\mathcal{C}_{\mathbb{I}_{l}^{h}} and 𝒞𝕀l\mathcal{C}_{\mathbb{I}_{l}} denote cross entropy loss functions, which only consider partial elements according to the values of 𝕀lh\mathbb{I}_{l}^{h} and 𝕀l\mathbb{I}_{l}.

Total loss \mathcal{L} can be computed by:

=l(wlhLlh+wlvLlv)\mathcal{L}=-\sum_{l}\left(w_{l}^{h}L_{l}^{h}+w_{l}^{v}L_{l}^{v}\right) (4)

where wlhw_{l}^{h} and wlvw_{l}^{v} denote weights of horizontal and vertical interpolation loss at ll-th step.

3.3 Illustration of Spatial Connectivity

Thanks to removing the post-processing step, our method can be integrated into subsequent tasks in an end-to-end way. The key to enforcing spatial connectivity from the start is the rule of interpolation. An expanding step consists of a horizontal interpolation and a vertical interpolation. The design ensures spatial connectivity of pixel-superpixel maps will not be destroyed by interpolations. Owing to initial pixel-superpixel map has spatial connectivity and interpolations preserve the property, the final pixel-superpixel map MM remains spatial connectivity. MM assigns all pixels to superpixels, so MM has spatial connectivity equals our SIN superpixels have spatial connectivity. In the following, we will first explain why the spatial connectivity of MM and superpixels are equivalent. Afterwards, we illustrate how interpolations preserve spatial connectivity of pixel-superpixel maps.

The fact that a superpixel has spatial connectivity means the set of all pixels in the superpixel is a connected set. We use XiX_{i} to denote a set where elements have same value ii in MM and X={X1,X2,,Xn}X=\{X_{1},X_{2},\dots,X_{n}\} to denote all such sets. If all elements in XX are connected sets, MM has spatial connectivity. Spatial information of elements in XiX_{i} equals spatial information of pixels assigned to superpixel ii, so XiX_{i} is a connected set represents superpixel ii has spatial connectivity. Evidently, MM has spatial connectivity equals all superpixels have spatial connectivity. All sets in M0M_{0} only has one element, so M0M_{0} has spatial connectivity definitely. If interpolations can preserve spatial connectivity, we can infer that MM has spatial connectivity.

Our scheme of interpolation is to insert elements among existing neighboring elements at each row/column. When we insert a element between a pair of neighbors, only sets including these three elements will be taken into consideration. If existing neighboring elements are in a same set, the inserted element will be added to the set and the set is still connected. If existing neighboring elements belong to different sets, the inserted element will be added to one of the sets, and the other will not change. The added set is still connected and spatial connectivity of the other will not be affected. We want to address that it is the design of interpolation preserves spatial connectivity. If we interpolate once at an expanding step and the inserted value is same with its 8-neighborhood, spatial connectivity of pixel-superpixel map will be destroyed. Above all, our method can enforce spatial connectivity explicitly through the delicate design of interpolation.

4 Experiments

To be integrated into subsequent tasks in an end-to-end way without impeding their instantaneity, we analyze the runtime of deep learning-based models. To demonstrate the effectiveness of SIN in superpixel segmentation, we train and test our model on the standard benchmark BSDS500 [3]. We also report its performance without fine-tuning on the benchmark NYUv2 [22] to evaluate the generalizability of our model. We use protocols and codes provided by [23] to evaluate all methods on two benchmarks. SNIC [2], SEAL [24], SSN [11] and FCN [29] are tested with the original implementations from the authors. SLIC [1] and ERS [14] are tested with the codes provided in [23]. For SLIC and ERS, we use the best parameters reported in [23], and for the rest, we use the default parameters recommended in the original papers. Figure 3 shows the visual results of some state-of-the-art methods and ours.

Input GT SNIC SEAL SSN FCN Ours
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 3: Visual results. Compared to SEAL, SSN and FCN, our method is competitive or better in terms of object boundary adherence while generating more compact superpixels. Top rows: BSDS500. Bottom rows: NYUv2.

4.1 Comparison with the state-of-the-arts

Implementation details.

We implement our model with PyTorch and use Adam with β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999 to optimize it. For training, we randomly crop the images to size 225×225225\times 225 as input and perform horizontal/vertical flipping for data augmentation. The initial learning rate is set to 5×1055\times 10^{-5} and is reduced by half after 200k iterations. It takes us about 3 hours to train the model for 300k iterations on 1 NVIDIA RTX 2080Ti GPU device.

We set the regular step SS as 16 and we can get 15×1515\times 15(225) superpixels through 4 expanding steps when training. We set whw^{h} and wvw^{v} as [20,10,5,2.5][20,10,5,2.5] and [8,4,2,1][8,4,2,1] respectively. To generate the varying number of superpixels when testing, we simply resize the input image to the appropriate size. For example, if we want to generate 30×2030\times 20 superpixels, we can resize the image to (301615)×(201615)(30*16-15)\times(20*16-15) i.e. 465×305465\times 305.

Refer to caption
Figure 4: Runtime analysis. Average runtime of different DL methods w.r.t number of superpixels. Note that yy-axis is plotted in the logarithmic scale.

Runtime Analysis.

We compare the runtime difference between deep learning-based methods. Figure 4 reports the average runtime w.r.t the number of generated superpixels on a NVIDIA RTX 2080Ti GPU device. Our method runs about 1.5 to 2 times faster than FCN, 12 to 33 times faster than SSN, and more than 70 times faster than SEAL. Note that existing deep learning-based methods need a post-processing step which takes 2.5ms to 8ms [18] and runtime in Figure 4 does not include the time. The reason of our method has the fastest speed is that we use a novel interpolation method to generate superpixels. What’s more, our method saves plenty of training time compared to FCN due to the simple and effective loss function. For training, we spend about 3 hours on a single GPU, while FCN spends about 20 hours.

Evaluation metrics.

To demonstrate the effectiveness of SIN, we use the achievable segmentation accuracy (ASA), boundary recall and precision (BR-BP), and compactness (CO) to evaluate the superpixels. ASA evaluates superpixels by measuring the total effective segmentation area of a superpixel representation concerning the ground truth segmentation map. BR and BP measure the boundary adherence of superpixels given the ground truth boundary, whereas CO assesses the compactness of superpixels. The higher these scores are, the better the superpixel segmentation result is. As in [23], for BR and BP evaluation, the boundary tolerance is 0.0025 times the image diagonal rounded to the closest integer.

Results on BSDS500.

BSDS500 contains 200 training, 100 validation, and 200 test images. Each image in this dataset is provided with multiple ground truth annotations. For training, we follow [11, 24, 29] and treat each annotation as an individual sample. With this dataset, we have 1633 training/validation samples and 1063 testing samples. We train our model using both the training and validation samples.

Figure 5 reports the performance of all methods on BSDS500 test set. Our method outperforms all traditional methods on all evaluation metrics, except SNIC in terms of BR-BP. Comparing to the other deep learning-based methods, our method achieves competitive results in terms of ASA and BR-BP, and significantly higher scores in terms of CO. With high CO, our method can better capture spatially coherent information and avoids paying too much attention to image details and noises. As shown in Figure 3, when handling fuzzy boundaries, our method can generate smoother superpixels.

Refer to caption Refer to caption Refer to caption
Figure 5: Results on BSDS500. From left to right: ASA, BR-BP, and CO.

Results on NYUv2.

NYUv2 is an RGB-D dataset containing 1499 images with object instance labels, which is originally proposed for indoor scene understanding tasks. [23] removes the unlabelled regions near the image boundary and develops a benchmark on a subset of 400 test images with size 608×448608\times 448 for superpixel evaluation. We directly apply the models of SEAL, SSN, FCN, and our method trained on BSDS500 to this dataset without any fine-tuning.

Refer to caption Refer to caption Refer to caption
Figure 6: Results on NYUv2. From left to right: ASA, BR-BP, and CO.

Figure 6 shows the performance of all methods on NYUv2. In general, these deep learning-based algorithms achieve competitive or better performance against the traditional algorithms, which demonstrate that they can extract high-quality superpixels on other datasets. Also, our method outperforms all other methods in terms of CO. As the visual results shown in Figure 3, our method handles the fuzzy boundary better than other deep learning-based methods.

Illustration of high CO score.

The experimental results on BSDS500 and NYUv2 show that our method has lower ASA and BR-BP scores, while a higher CO score. We will illustrate the reason in the following.

Refer to caption
Figure 7: Illustration of high CO score. According to the rule of interpolation, the above is a possible new pixel-superpixel map and the below is an impossible one.

To enforce spatial connectivity from the first, we expand pixel-superpixel map in horizontal and vertical dimensions. The horizontal/vertical interpolation constrains the inserted value can only be same with its horizontal/vertical neighbors. As Figure 7 shown, the black circled value can be 1/2 and cannot be 16/17. However, if the ground truth of the value is 16, our method cannot interpolate the same value. That is the reason our ASA and BR-BP scores are lower than other deep learning-based methods. Meanwhile, the constraint results in pixels in a superpixel are 4-neighborhood connected which is more compact than 8-neighborhood connected. Owing to the high CO score, our method generates smoother superpixels on the fuzzy boundaries as Figure 3 shown. The importance of compactness has been demonstrated in [29]. To extract more useful features in downstream tasks, it is important to capture spatial coherence in the local region in our superpixel method. In our view, it is worthy to enforce spatial connectivity from the start and get a higher CO score while sacrificing slight ASA and BR-BP scores.

4.2 Ablation study

We present an ablation study where we evaluate different design choices of the image feature extraction and loss sum. Unlike [29], we do not take image features from previous layers into account to predict association scores. Our total loss is the sum of horizontal and vertical loss at each step, so we can compute average or weighted sum. In our final model, we choose weighted sum to compute total loss. For comparison, we include a baseline model which uses the previous features and current features(concat) to predict scores and simply sums the loss values averagely. We evaluate each of these design options of the network. Figure 8 shows that each of the 2 alternatives in our model performs better.

Refer to caption
Figure 8: Ablation study. We show the effectiveness of each design choice in the SIN model in improving accuracy.

5 Application

In this section, we evaluate whether our SIN superpixels can improve the performance of downstream vision tasks which utilize superpixels. For this study, we choose existing semantic segmentation and salient object detection algorithms and substitute the original superpixels with our superpixels. For the following two tasks, our superpixels are generated by the network fine-tuned on PascaVOC 2012 training and validation datasets.

Semantic segmentation.

For semantic segmentation, CNN models [13, 16] achieve the state-of-the-art performance. However, most CNN architectures generate lower resolution outputs and then upsample them using post-processing techniques. To alleviate the need for post-processing CRF techniques, [8] propose the Bilateral Inception (BI) networks to utilize SLIC superpixels for long-range and edge-aware propagation across CNN units. We use SNIC and our superpixels to substitute SLIC superpixels and set the number of superpixels as 600. We evaluate the generated semantic segmentation on the PascalVOC 2012 test set [5]. Table 1 shows the standard Intersection over Union (IoU) scores. The results indicate that we can obtain significant IoU improvements when using SIN superpixels.

Salient object detection.

Superpixels are widely used in salient object detection algorithms. We experiment with Saliency Optimization(SO) [30] and report standard Mean Absolute Error (MAE) scores on the ECSSD dataset [20]. To demonstrate the potential of our SIN Superpixels, we replace SLIC superpixels used in SO with ours, SNIC, and ERS superpixels and set the number of superpixels as 200 and 400. Experimental results in Table 2 show that the use of our 200/400 superpixels consistently improves the performance of SO.

The above results on semantic segmentation and salient object detection demonstrate the effectiveness of integrating our superpixels into downstream vision tasks.

Table 1: Superpixels for semantic segmentation. We compute semantic segmentation using the BI network with different types of superpixels and compare the IoU scores on the PascalVOC 2012 test set.
  Method   DeepLab [13]   + CRF [13]   + BI(SLIC) [8]   + BI(ERS)   + BI(Ours)
IoU 68.9 72.7 73.5 74.0 74.4
Table 2: Superpixels for salient object detection. We run the SO algorithm with different types of superpixels and evaluate on the ECSSD dataset.
       Method        SLIC        SNIC        ERS        Ours
# of superpixels 200 0.1719 0.1714 0.1686 0.1657
# of superpixels 400 0.1675 0.1654 0.1630 0.1616

6 Conclusion

In this paper, we present a superpixel segmentation network SIN which can be integrated into downstream tasks in an end-to-end way. To extract superpixels, we initialize superpixels and expand pixel-superpixel map multiple times. By dividing an expanding step into a horizontal and a vertical interpolation, we enforce spatial connectivity explicitly. We utilize multi-layer outputs of a fully convolutional network to predict association scores for interpolations. To speed up training process, association scores are used to compute loss instead of pixel-superpixel maps. Owing to our interpolation constrains the number of neighbors of inserted elements, SIN has the fastest speed compared to existing deep learning-based methods. The high speed of our method ensures it can be integrated into downstream tasks requiring real-time speed. Our model performs favorably against several existing state-of-the-art superpixel algorithms. SIN can generate more compact superpixels thanks to the design of interpolation, which is important to downstream tasks. What’s more, visual results illustrate that our method outperforms when handling fuzzy boundaries. Furthermore, we apply our superpixels in downstream tasks and make progress. We will integrate SIN into downstream tasks in an end-to-end way in the future and we hope SIN can benefit superpixel-based computer vision tasks.

Acknowledgements. This work is supported by the Hubei Provincinal Science and Technology Major Project of China under Grant No. 2020AEA011, the Key Research & Developement Plan of Hubei Province of China under Grant No. 2020BAB100, the project of Science,Technology and Innovation Commission of Shenzhen Municipality of China under Grant No. JCYJ20210324120002006 and the Fundamental Research Funds for the Central Universities, HUST: 2020JYCXJJ067.

References

  • [1] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(11), 2274–2282 (2012)
  • [2] Achanta, R., Susstrunk, S.: Superpixels and polygons using simple non-iterative clustering. In: CVPR. pp. 4651–4660 (July 2017)
  • [3] Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33(5), 898–916 (2010)
  • [4] Van den Bergh, M., Boix, X., Roig, G., de Capitani, B., Van Gool, L.: Seeds: Superpixels extracted via energy-driven sampling. In: ECCV. pp. 13–26. Springer (2012)
  • [5] Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1), 98–136 (2015)
  • [6] Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. International journal of computer vision 59(2), 167–181 (2004)
  • [7] Gadde, R., Jampani, V., Kiefel, M., Kappler, D., Gehler, P.V.: Superpixel convolutional networks using bilateral inceptions. In: ECCV. pp. 597–613. Springer (2016)
  • [8] Gadde, R., Jampani, V., Kiefel, M., Kappler, D., Gehler, P.V.: Superpixel convolutional networks using bilateral inceptions. In: ECCV. pp. 597–613. Springer (2016)
  • [9] Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. International journal of computer vision 80(3), 300–316 (2008)
  • [10] He, S., Lau, R.W., Liu, W., Huang, Z., Yang, Q.: Supercnn: A superpixelwise convolutional neural network for salient object detection. International journal of computer vision 115(3), 330–344 (2015)
  • [11] Jampani, V., Sun, D., Liu, M.Y., Yang, M.H., Kautz, J.: Superpixel sampling networks. In: ECCV. pp. 352–368 (September 2018)
  • [12] Li, Z., Chen, J.: Superpixel segmentation using linear spectral clustering. In: CVPR. pp. 1356–1363 (2015)
  • [13] Liang-Chieh, C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015)
  • [14] Liu, M.Y., Tuzel, O., Ramalingam, S., Chellappa, R.: Entropy rate superpixel segmentation. In: CVPR. pp. 2097–2104. IEEE (2011)
  • [15] Liu, Y.J., Yu, C.C., Yu, M.J., He, Y.: Manifold slic: A fast method to compute content-sensitive superpixels. In: CVPR. pp. 651–659 (2016)
  • [16] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. pp. 3431–3440 (2015)
  • [17] Ren, Malik: Learning a classification model for segmentation. In: ICCV. pp. 10–17 vol.1 (2003). https://doi.org/10.1109/ICCV.2003.1238308
  • [18] Ren, C.Y., Reid, I.: gslic: a real-time implementation of slic superpixel segmentation. University of Oxford, Department of Engineering, Technical Report pp. 1–6 (2011)
  • [19] Sharma, A., Tuzel, O., Liu, M.Y.: Recursive context propagation network for semantic scene labeling. In: NeurIPS. pp. 2447–2455 (2014)
  • [20] Shi, J., Yan, Q., Xu, L., Jia, J.: Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence 38(4), 717–729 (2015)
  • [21] Shu, G., Dehghan, A., Shah, M.: Improving an object detector and extracting regions using superpixels. In: CVPR. pp. 3721–3727 (2013)
  • [22] Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. pp. 746–760. Springer (2012)
  • [23] Stutz, D., Hermans, A., Leibe, B.: Superpixels: An evaluation of the state-of-the-art. Computer Vision and Image Understanding 166, 1–27 (2018)
  • [24] Tu, W.C., Liu, M.Y., Jampani, V., Sun, D., Chien, S.Y., Yang, M.H., Kautz, J.: Learning superpixels with segmentation-aware affinity loss. In: CVPR. pp. 568–576 (2018)
  • [25] Wang, S., Lu, H., Yang, F., Yang, M.H.: Superpixel tracking. In: ICCV. pp. 1323–1330. IEEE (2011)
  • [26] Yan, J., Yu, Y., Zhu, X., Lei, Z., Li, S.Z.: Object detection by labeling superpixels. In: CVPR. pp. 5107–5116 (2015)
  • [27] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: CVPR. pp. 3166–3173 (2013)
  • [28] Yang, F., Lu, H., Yang, M.H.: Robust superpixel tracking. IEEE Transactions on Image Processing 23(4), 1639–1651 (2014)
  • [29] Yang, F., Sun, Q., Jin, H., Zhou, Z.: Superpixel segmentation with fully convolutional networks. In: CVPR. pp. 13964–13973 (2020)
  • [30] Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: CVPR. pp. 2814–2821 (2014)