\DeclareUnicodeCharacter

FB01fi

¹¹institutetext: School of Computer Science and Technology, Huazhong university of Science and Technology, Wuhan, China ²²institutetext: School of Cyber Science & Engineering, Huazhong university of Science and Technology, Wuhan, China ³³institutetext: Shenzhen Huazhong University of Science and Technology Research Institute, Shenzhen, China
³³email: {yuanqing,lusongfeng,m201372777,d201980975}@hust.edu.cn

SIN: Superpixel Interpolation Network

Qing Yuan 11 Songfeng Lu^(🖂) 2233 Yan Huang 11 Wuxin Sha 11

Abstract

Superpixels have been widely used in computer vision tasks due to their representational and computational efficiency. Meanwhile, deep learning and end-to-end framework have made great progress in various fields including computer vision. However, existing superpixel algorithms cannot be integrated into subsequent tasks in an end-to-end way. Traditional algorithms and deep learning-based algorithms are two main streams in superpixel segmentation. The former is non-differentiable and the latter needs a non-differentiable post-processing step to enforce connectivity, which constraints the integration of superpixels and downstream tasks. In this paper, we propose a deep learning-based superpixel segmentation algorithm SIN which can be integrated with downstream tasks in an end-to-end way. Owing to some downstream tasks such as visual tracking require real-time speed, the speed of generating superpixels is also important. To remove the post-processing step, our algorithm enforces spatial connectivity from the start. Superpixels are initialized by sampled pixels and other pixels are assigned to superpixels through multiple updating steps. Each step consists of a horizontal and a vertical interpolation, which is the key to enforcing spatial connectivity. Multi-layer outputs of a fully convolutional network are utilized to predict association scores for interpolations. Experimental results show that our approach runs at about 80fps and performs favorably against state-of-the-art methods. Furthermore, we design a simple but effective loss function which reduces much training time. The improvements of superpixel-based tasks demonstrate the effectiveness of our algorithm. We hope SIN will be integrated into downstream tasks in an end-to-end way and benefit the superpixel-based community. Code is available at: https://github.com/yuanqqq/SIN.

Keywords:

superpixel spatial connectivity deep learning.

1 Introduction

Superpixels are small clusters of pixels that have similar intrinsic properties. Superpixels provide a perceptually meaningful representation of image data and reduce the number of image primitives for subsequent tasks. Owing to their representational and computational efficiency, superpixels are widely applied to computer vision tasks such as object detection [21, 26], saliency detection [10, 27, 30], semantic segmentation [9, 19, 7] and visual tracking [25, 28].

In common, superpixel-based tasks first generate superpixels of input images. Afterwards, features of superpixels are extracted and fed into subsequent steps. Since most superpixel algorithms cannot ensure spatial connectivity directly, we need to enforce spatial connectivity through a post-processing step before extracting superpixel features. Recently, deep neural networks and end-to-end framework have been widely adopted in computer vision owing to their effectiveness. However, existing superpixel segmentation algorithms cannot be combined with downstream tasks in an end-to-end way, which constrains the application of superpixels and the performance of superpixel-based tasks. We will demonstrate the limitations of existing superpixel segmentation algorithms in the following.

Existing superpixel segmentation algorithms can be divided into traditional and deep learning-based branches. Traditional superpixel segmentation algorithms [17, 6, 14, 4, 1, 2] mainly rely on hand-crafted features. They are not trainable and cannot be integrated to subsequent deep learning methods in an end-to-end way obviously. Not to mention that most traditional algorithms run at a low speed, which affects the speed of downstream tasks heavily. While few attempts have been made [24, 11, 29], utilizing deep networks to extract superpixels remains challenging. [24, 11] use a deep network to extract pixel features, followed by a superpixel segmentation module. FCN [29] proposes a network to directly generate superpixels and enforce connectivity as a post-processing step. All these methods need a post-processing step to handle orphan pixels and the step is non-differentiable. The post-processing step hinders existing deep learning-based algorithms to be combined with superpixel-based tasks in an end-to-end way. In fact, most traditional algorithms also need post-process to enforce spatial connectivity.

In this paper, we aim to propose a superpixel segmentation algorithm which can be integrated into downstream tasks in an end-to-end way. The speed of generating superpixels is also very important, because some downstream tasks such as visual tracking require real-time speed. Since the post-processing step is the main obstacle of existing deep learning-based methods, we enforce spatial connectivity from the start to remove the step. Without the post-processing step, not only the algorithm becomes a whole trainable network, but also the speed is faster. Our initial superpixels are initialized with sampled pixels and remaining pixels are assigned to superpixels through multiple similar steps. Each step consists of a horizontal and a vertical interpolation. According to current pixel-superpixel map and association scores, the interpolations assign partial pixels to superpixels. The pixel-superpixel map represents the map between pixels and superpixels, and the association scores are predicted by the multi-layer outputs of a fully convolutional network. The rule of interpolations is the key to enforcing spatial connectivity and we will prove it in Section 3.3. Furthermore, we design a simple but effective loss function that can reduce training time and fully utilize segmentation labels.

Extensive experiments have been conducted to evaluate SIN. Our method is the fastest compared to existing deep learning-based algorithms(running at about 80fps), which means it satisfies the instantaneity of downstream tasks. For superpixel segmentation, experimental results on public benchmarks such as BSDS500 [3] and NYUv2 [22] demonstrate that our method performs favorably against the state-of-the-art in a variety of metrics. For semantic segmentation and saliency object detection, we replace superpixels in the original BI [8] and SO [30] with ours. The results on PascalVOC 2012 test set [5] and ECSSD dataset [20] show that SIN superpixels benefit these downstream vision tasks.

In summary, the main contributions of this paper are:

•

We propose a superpixel segmentation network which can be integrated into downstream tasks in an end-to-end way, which does not need post-processing to handle orphan pixels. Our algorithm enforces spatial connectivity from the first instead of using a non-differentiable post-processing step. To the best of our knowledge, we are the first to develop a deep learning-based method to be integrated into superpixel-based tasks in an end-to-end way.
•

We analyze the runtime of deep learning-based superpixel algorithms and our model has the fastest speed. When utilizing our SIN superpixels in subsequent tasks, the instantaneity will not be destroyed. Extensive experiments show that our method performs well in superpixel segmentation especially in generating more compact superpixels.
•

We design a simple but effective loss function that fully utilizes the segmentation label. The loss function is computational efficiency and shortens plenty of training time.

2 Related Work

2.1 Traditional Superpixel Segmentation

Traditional superpixel segmentation algorithms can be roughly categorized as graph-based and clustering-based algorithms. Graph-based algorithms treat image pixels as graph nodes and pixel affinities as graph edges. Usually, superpixel segmentation problems are solved by graph-partitioning. [17] applies the Normalized Cuts algorithm to produce the superpixel map. FH [6] defines an adaptive segmentation criterion to capture global image properties. ERS [14] proposes an objective function for superpixel segmentation, which consists of the entropy rate and the balancing term.

Clustering-based algorithms utilize clustering methods such as $k$ -means for superpixel segmentation. SEEDS [4] starts from an initial superpixel partitioning and continuously exchanges pixels on the boundaries between neighboring superpixels. SLIC [1] adopts a $k$ -means clustering approach to generate superpixels based on a 5-dimensional positional and $Lab$ color features. Owing to its simplicity and high performance, there are many variants [12, 15, 2] of SLIC. LSC [12] projects the 5-dimensional features to a 10-dimensional space and performs weighted $k$ -means in the projected space. Manifold-SLIC [15] maps the image to 2-dimensional manifold feature space for superpixel clustering. SNIC [2] proposes a non-iterative scheme for superpixel segmentation. Traditional superpixel algorithms are mainly based on hand-crafted features, which often fail to preserve weak object boundaries. Most traditional algorithms are computed on CPU, so it is hard to achieve real-time speed. What’s more, we cannot integrate traditional methods into subsequent tasks in an end-to-end way because they are non-differentiable.

2.2 Superpixel Segmentation using DNN

Recently, some researchers have focused on integrating deep networks into superpixel segmentation algorithms [24, 11, 29]. [24, 11] use a deep network to extract pixel features, which are then fed to a superpixel segmentation module. SEAL [24] develops the Pixel Affinity Net for affinity prediction and defines a new loss function which takes the segmentation error into account. These affinities are then passed to a graph-based algorithm to generate superpixels. To form an end-to-end trainable network, SSN [11] turns SLIC into a differentiable algorithm by relaxing the nearest neighbors’ constraints. FCN [29] combines feature extraction and superpixel segmentation into a single step. The proposed method employs a fully convolutional network to predict association scores between image pixels and regular grid cells. When utilizing superpixels generated by existing deep learning-based methods, a post-processing step is needed to handle orphan pixels. The step is not trainable and can only be computed on CPU, so existing deep learning-based methods cannot be integrated into downstream tasks in an end-to-end approach.

2.3 Spatial Connectivity

Most superpixel algorithms [6, 1, 12, 15, 24, 11, 29] do not explicitly enforce connectivity and there may exist some ”orphaned” pixels that do not belong to the same connected components. To correct this, SLIC [1] assigns these pixels the label of the nearest cluster. [11, 29] also apply a component connection algorithm to merge superpixels that are smaller than a certain threshold with the surrounding ones. These algorithms enforce connectivity using a post-processing step, whereas SNIC [2] enforces connectivity explicitly from the start. SNIC uses a priority queue to choose the next pixel to be assigned, and the queue is populated with pixels which are 4 or 8-connected to a currently growing superpixel. As far as we know, there is no method which utilizes learned features and enforces connectivity explicitly.

3 Superpixel Segmentation Method

In this section, we introduce our superpixel segmentation method SIN. The framework of our proposed method is illustrated in Figure 1. We first present our idea of superpixel initialization and updating scheme. After that, we introduce our network architecture and loss function design. Finally, we will explain why our method can enforce spatial connectivity from the start.

Refer to caption — Figure 1: Illustration of our proposed method. The SIN model takes the image as input, and predicts association scores for each updating step. In the training stage, association scores are utilized to compute loss. In the testing stage, new pixel-superpixel maps are obtained from current pixel-superpixel maps and association scores.

3.1 Learn superpixels by Interpolation

Our superpixels are obtained by initializing pixel-superpixel map and updating the map multiple times. Similar to the commonly adopted strategy in [4, 1, 2], we generate the initial superpixels by sampling the image $I\in\mathbb{R}^{H\times W\times 3}$ with a regular step $S$ . By assigning each pixel to a unique superpixel, we get the initial pixel-superpixel map $M_{0}\in\mathbb{Z}^{h_{0}\times w_{0}}$ . The values of $M_{0}$ denote ID of superpixels to which sampled pixels are assigned.

Superpixel segmentation is to find the final pixel-superpixel map $M\in\mathbb{Z}^{H\times W}$ , which assigns all pixels to superpixels. The problem of finding $M$ can be seemed as expanding $M_{0}$ to $M$ . Inspired by resizing image, we use interpolation to expand the matrix. The rule of interpolation is carefully designed to enforce spatial connectivity from the start and to be computed on GPU in parallel. As depicted in Figure1, the process of expanding $M_{0}$ to $M$ can be divided into multiple similar steps and each step consists of a horizontal interpolation and a vertical interpolation. As shown in Figure 2, when we expand pixel-superpixel map in horizontal/vertical dimension, we interpolate values among all neighboring elements in each row/column. The inserted values are the same as neighboring elements with certain probability. The probabilities(association scores) are computed by neural networks which we will introduce in Section 3.2.

In detail, we use $P\in\mathbb{R}^{H\times W}$ to denote image pixels. $P(i,j)$ represents the image pixel at the intersection of $i$ -th row and $j$ -th column. $M(i,j)$ is the superpixel to which $P(i,j)$ is assigned. In the initial step, we find partial connections between image pixels $P$ and superpixels. $M_{0}(i,j)$ represents the superpixel to which $P(i*S,j*S)$ is assigned.

h_{0}=(H+S-1)/S,\quad w_{0}=(W+S-1)/S.

(1)

To obtain $M$ , we need to expand $M_{0}$ multiple times. At $l$ -th expansion, we use $M_{l}^{h}\in\mathbb{Z}^{h_{l-1}\times w_{l}}$ and $M_{l}\in\mathbb{Z}^{h_{l}\times w_{l}}$ to denote pixel-superpixel maps after horizontal and vertical interpolation.

h_{l}=2\ast h_{l-1}-1,\quad w_{l}=2\ast w_{l-1}-1.

(2)

Figure 2 has shown a part of interpolation at $l$ -th expansion. At $l$ -th horizontal/vertical interpolation step, the inserted values are confirmed by association scores $A_{l}^{h}\in\mathbb{R}^{h_{l-1}\times(w_{l-1}-1)\times 2}$ / $A_{l}\in\mathbb{R}^{(h_{l-1}-1)\times w_{l}\times 2}$ and neighboring superpixels $Q_{l}^{h}\in\mathbb{Z}^{h_{l-1}\times(w_{l-1}-1)\times 2}$ / $Q_{l}\in\mathbb{Z}^{(h_{l-1}-1)\times w_{l}\times 2}$ . $A_{l}^{h}(i,j,k)$ and $A_{l}(i,j,k)$ denote the probability of $i$ -th row, $j$ -th column inserted value is the same with its $k$ -th neighbor. All association scores are obtained from multi-layer outputs of the neural network described in Section 3.2. $Q_{l}^{h}(i,j,k)$ and $Q_{l}(i,j,k)$ denote the $k$ -th neighbor’s value of $i$ -th row, $j$ -th column inserted element. Neighboring superpixels are obtained from current pixel-superpixel map. We interpolate new elements among existing neighboring elements at each row/column, so a pair of existing neighboring elements’ values are neighboring superpixel IDs of the corresponding inserted element. According to association scores and neighboring superpixels, inserted values can only be same with one of their neighboring elements with certain probability.

3.2 Network Architecture and Loss Function

We use a convolutional neural network similar to [29] to extract image feature $F_{0}\in\mathbb{R}^{h_{0}\times w_{0}\times c_{0}}$ . We stack module $deconv\_h$ and $deconv\_v$ multiple times to extract multi-layer features $F_{l}^{h}\in\mathbb{R}^{h_{l-1}\times w_{l}\times c_{l}}$ and $F_{l}\in\mathbb{R}^{h_{l}\times w_{l}\times c_{l}}$ , where $c_{l}$ denotes feature channels. $deconv\_h$ and $deconv\_v$ are transposed convolutional neural networks, in which stride are $(1,2)$ and $(2,1)$ respectively. Specially, $deconv\_h$ will reduce feature channels by half. $conv$ is a convolutional neural network, which transforms the multi-layer features to 2-dimensional association scores.

Our model is trained with ground truth segmentation labels $T\in\mathbb{Z}^{H\times W}$ from BSDS500. Every interpolation is to find partial connections between pixels and superpixels. To get loss of all connections, we need to compute partial loss at every interpolation. We define $s_{l}=S/2^{l}$ to simplify descriptions. The inserted values at $l$ -th step in horizontal/vertical dimension are ID of superpixels to which pixels $U_{l}^{h}$ and $U_{l}$ are assigned. $U_{l}^{h}$ denotes the subtraction of pixels sampled by stride $(s_{l-1},s_{l})$ and $(s_{l-1},s_{l-1})$ . $U_{l}$ denotes the subtraction of pixels sampled by stride $(s_{l},s_{l})$ and $(s_{l-1},s_{l})$ . Partial ground truth connections $T_{l}^{h}$ and $T_{l}$ are segmentation labels of pixels $U_{l}^{h}$ and $U_{l}$ . To speed up training process, we do not generate pixel-superpixel maps to compute loss. Instead, we utilize association scores to compute loss directly. Association scores $A_{l}^{h}$ and $A_{l}$ denote the probabilities of pixels assigned to neighboring superpixels. Inspired by tasks of classification, the ground truth labels $G_{l}^{h}$ and $G_{l}$ are defined as the indexes of neighboring superpixels to which pixels should be assigned. $G_{l}^{h}$ and $G_{l}$ can be inferred from $T_{l}^{h}$ and $T_{l}$ . Owing to each inserted element has two neighbors, the ground truth labels are 0 or 1. If the neighboring superpixels ID of an inserted element are same, we will ignore it when computing loss. We define $\mathbb{I}_{l}^{h}$ and $\mathbb{I}_{l}$ to represent whether to consider the elements when computing loss. Loss of each interpolation at $l$ -th step can be computed by:

L_{l}^{h}=\mathcal{C}_{\mathbb{I}_{l}^{h}}(G_{l}^{h},A_{l}^{h}),\quad L_{l}^{v}=\mathcal{C}_{\mathbb{I}_{l}}(G_{l},A_{l})

(3)

where $L_{l}^{h}$ and $L_{l}$ denote the loss of horizontal and vertical at $l$ -th step. $\mathcal{C}_{\mathbb{I}_{l}^{h}}$ and $\mathcal{C}_{\mathbb{I}_{l}}$ denote cross entropy loss functions, which only consider partial elements according to the values of $\mathbb{I}_{l}^{h}$ and $\mathbb{I}_{l}$ .

Total loss $\mathcal{L}$ can be computed by:

\mathcal{L}=-\sum_{l}\left(w_{l}^{h}L_{l}^{h}+w_{l}^{v}L_{l}^{v}\right)

(4)

where $w_{l}^{h}$ and $w_{l}^{v}$ denote weights of horizontal and vertical interpolation loss at $l$ -th step.

3.3 Illustration of Spatial Connectivity

Thanks to removing the post-processing step, our method can be integrated into subsequent tasks in an end-to-end way. The key to enforcing spatial connectivity from the start is the rule of interpolation. An expanding step consists of a horizontal interpolation and a vertical interpolation. The design ensures spatial connectivity of pixel-superpixel maps will not be destroyed by interpolations. Owing to initial pixel-superpixel map has spatial connectivity and interpolations preserve the property, the final pixel-superpixel map $M$ remains spatial connectivity. $M$ assigns all pixels to superpixels, so $M$ has spatial connectivity equals our SIN superpixels have spatial connectivity. In the following, we will first explain why the spatial connectivity of $M$ and superpixels are equivalent. Afterwards, we illustrate how interpolations preserve spatial connectivity of pixel-superpixel maps.

The fact that a superpixel has spatial connectivity means the set of all pixels in the superpixel is a connected set. We use $X_{i}$ to denote a set where elements have same value $i$ in $M$ and $X=\{X_{1},X_{2},\dots,X_{n}\}$ to denote all such sets. If all elements in $X$ are connected sets, $M$ has spatial connectivity. Spatial information of elements in $X_{i}$ equals spatial information of pixels assigned to superpixel $i$ , so $X_{i}$ is a connected set represents superpixel $i$ has spatial connectivity. Evidently, $M$ has spatial connectivity equals all superpixels have spatial connectivity. All sets in $M_{0}$ only has one element, so $M_{0}$ has spatial connectivity definitely. If interpolations can preserve spatial connectivity, we can infer that $M$ has spatial connectivity.

Our scheme of interpolation is to insert elements among existing neighboring elements at each row/column. When we insert a element between a pair of neighbors, only sets including these three elements will be taken into consideration. If existing neighboring elements are in a same set, the inserted element will be added to the set and the set is still connected. If existing neighboring elements belong to different sets, the inserted element will be added to one of the sets, and the other will not change. The added set is still connected and spatial connectivity of the other will not be affected. We want to address that it is the design of interpolation preserves spatial connectivity. If we interpolate once at an expanding step and the inserted value is same with its 8-neighborhood, spatial connectivity of pixel-superpixel map will be destroyed. Above all, our method can enforce spatial connectivity explicitly through the delicate design of interpolation.

4 Experiments

To be integrated into subsequent tasks in an end-to-end way without impeding their instantaneity, we analyze the runtime of deep learning-based models. To demonstrate the effectiveness of SIN in superpixel segmentation, we train and test our model on the standard benchmark BSDS500 [3]. We also report its performance without fine-tuning on the benchmark NYUv2 [22] to evaluate the generalizability of our model. We use protocols and codes provided by [23] to evaluate all methods on two benchmarks. SNIC [2], SEAL [24], SSN [11] and FCN [29] are tested with the original implementations from the authors. SLIC [1] and ERS [14] are tested with the codes provided in [23]. For SLIC and ERS, we use the best parameters reported in [23], and for the rest, we use the default parameters recommended in the original papers. Figure 3 shows the visual results of some state-of-the-art methods and ours.

4.1 Comparison with the state-of-the-arts

Implementation details.

We implement our model with PyTorch and use Adam with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ to optimize it. For training, we randomly crop the images to size $225\times 225$ as input and perform horizontal/vertical flipping for data augmentation. The initial learning rate is set to $5\times 10^{-5}$ and is reduced by half after 200k iterations. It takes us about 3 hours to train the model for 300k iterations on 1 NVIDIA RTX 2080Ti GPU device.

We set the regular step $S$ as 16 and we can get $15\times 15$ (225) superpixels through 4 expanding steps when training. We set $w^{h}$ and $w^{v}$ as $[20,10,5,2.5]$ and $[8,4,2,1]$ respectively. To generate the varying number of superpixels when testing, we simply resize the input image to the appropriate size. For example, if we want to generate $30\times 20$ superpixels, we can resize the image to $(30*16-15)\times(20*16-15)$ i.e. $465\times 305$ .

Runtime Analysis.

We compare the runtime difference between deep learning-based methods. Figure 4 reports the average runtime w.r.t the number of generated superpixels on a NVIDIA RTX 2080Ti GPU device. Our method runs about 1.5 to 2 times faster than FCN, 12 to 33 times faster than SSN, and more than 70 times faster than SEAL. Note that existing deep learning-based methods need a post-processing step which takes 2.5ms to 8ms [18] and runtime in Figure 4 does not include the time. The reason of our method has the fastest speed is that we use a novel interpolation method to generate superpixels. What’s more, our method saves plenty of training time compared to FCN due to the simple and effective loss function. For training, we spend about 3 hours on a single GPU, while FCN spends about 20 hours.

Evaluation metrics.

To demonstrate the effectiveness of SIN, we use the achievable segmentation accuracy (ASA), boundary recall and precision (BR-BP), and compactness (CO) to evaluate the superpixels. ASA evaluates superpixels by measuring the total effective segmentation area of a superpixel representation concerning the ground truth segmentation map. BR and BP measure the boundary adherence of superpixels given the ground truth boundary, whereas CO assesses the compactness of superpixels. The higher these scores are, the better the superpixel segmentation result is. As in [23], for BR and BP evaluation, the boundary tolerance is 0.0025 times the image diagonal rounded to the closest integer.

Results on BSDS500.

BSDS500 contains 200 training, 100 validation, and 200 test images. Each image in this dataset is provided with multiple ground truth annotations. For training, we follow [11, 24, 29] and treat each annotation as an individual sample. With this dataset, we have 1633 training/validation samples and 1063 testing samples. We train our model using both the training and validation samples.

Figure 5 reports the performance of all methods on BSDS500 test set. Our method outperforms all traditional methods on all evaluation metrics, except SNIC in terms of BR-BP. Comparing to the other deep learning-based methods, our method achieves competitive results in terms of ASA and BR-BP, and signiﬁcantly higher scores in terms of CO. With high CO, our method can better capture spatially coherent information and avoids paying too much attention to image details and noises. As shown in Figure 3, when handling fuzzy boundaries, our method can generate smoother superpixels.

Results on NYUv2.

NYUv2 is an RGB-D dataset containing 1499 images with object instance labels, which is originally proposed for indoor scene understanding tasks. [23] removes the unlabelled regions near the image boundary and develops a benchmark on a subset of 400 test images with size $608\times 448$ for superpixel evaluation. We directly apply the models of SEAL, SSN, FCN, and our method trained on BSDS500 to this dataset without any ﬁne-tuning.

Figure 6 shows the performance of all methods on NYUv2. In general, these deep learning-based algorithms achieve competitive or better performance against the traditional algorithms, which demonstrate that they can extract high-quality superpixels on other datasets. Also, our method outperforms all other methods in terms of CO. As the visual results shown in Figure 3, our method handles the fuzzy boundary better than other deep learning-based methods.

Illustration of high CO score.

The experimental results on BSDS500 and NYUv2 show that our method has lower ASA and BR-BP scores, while a higher CO score. We will illustrate the reason in the following.

To enforce spatial connectivity from the first, we expand pixel-superpixel map in horizontal and vertical dimensions. The horizontal/vertical interpolation constrains the inserted value can only be same with its horizontal/vertical neighbors. As Figure 7 shown, the black circled value can be 1/2 and cannot be 16/17. However, if the ground truth of the value is 16, our method cannot interpolate the same value. That is the reason our ASA and BR-BP scores are lower than other deep learning-based methods. Meanwhile, the constraint results in pixels in a superpixel are 4-neighborhood connected which is more compact than 8-neighborhood connected. Owing to the high CO score, our method generates smoother superpixels on the fuzzy boundaries as Figure 3 shown. The importance of compactness has been demonstrated in [29]. To extract more useful features in downstream tasks, it is important to capture spatial coherence in the local region in our superpixel method. In our view, it is worthy to enforce spatial connectivity from the start and get a higher CO score while sacrificing slight ASA and BR-BP scores.

4.2 Ablation study

We present an ablation study where we evaluate different design choices of the image feature extraction and loss sum. Unlike [29], we do not take image features from previous layers into account to predict association scores. Our total loss is the sum of horizontal and vertical loss at each step, so we can compute average or weighted sum. In our final model, we choose weighted sum to compute total loss. For comparison, we include a baseline model which uses the previous features and current features(concat) to predict scores and simply sums the loss values averagely. We evaluate each of these design options of the network. Figure 8 shows that each of the 2 alternatives in our model performs better.

5 Application

In this section, we evaluate whether our SIN superpixels can improve the performance of downstream vision tasks which utilize superpixels. For this study, we choose existing semantic segmentation and salient object detection algorithms and substitute the original superpixels with our superpixels. For the following two tasks, our superpixels are generated by the network fine-tuned on PascaVOC 2012 training and validation datasets.

Semantic segmentation.

For semantic segmentation, CNN models [13, 16] achieve the state-of-the-art performance. However, most CNN architectures generate lower resolution outputs and then upsample them using post-processing techniques. To alleviate the need for post-processing CRF techniques, [8] propose the Bilateral Inception (BI) networks to utilize SLIC superpixels for long-range and edge-aware propagation across CNN units. We use SNIC and our superpixels to substitute SLIC superpixels and set the number of superpixels as 600. We evaluate the generated semantic segmentation on the PascalVOC 2012 test set [5]. Table 1 shows the standard Intersection over Union (IoU) scores. The results indicate that we can obtain signiﬁcant IoU improvements when using SIN superpixels.

Salient object detection.

Superpixels are widely used in salient object detection algorithms. We experiment with Saliency Optimization(SO) [30] and report standard Mean Absolute Error (MAE) scores on the ECSSD dataset [20]. To demonstrate the potential of our SIN Superpixels, we replace SLIC superpixels used in SO with ours, SNIC, and ERS superpixels and set the number of superpixels as 200 and 400. Experimental results in Table 2 show that the use of our 200/400 superpixels consistently improves the performance of SO.

The above results on semantic segmentation and salient object detection demonstrate the effectiveness of integrating our superpixels into downstream vision tasks.

Table 1: Superpixels for semantic segmentation. We compute semantic segmentation using the BI network with different types of superpixels and compare the IoU scores on the PascalVOC 2012 test set.

Method	DeepLab [13]	+ CRF [13]	+ BI(SLIC) [8]	+ BI(ERS)	+ BI(Ours)
IoU	68.9	72.7	73.5	74.0	74.4

Table 2: Superpixels for salient object detection. We run the SO algorithm with different types of superpixels and evaluate on the ECSSD dataset.

Method	SLIC	SNIC	ERS	Ours
# of superpixels 200	0.1719	0.1714	0.1686	0.1657
# of superpixels 400	0.1675	0.1654	0.1630	0.1616

6 Conclusion

In this paper, we present a superpixel segmentation network SIN which can be integrated into downstream tasks in an end-to-end way. To extract superpixels, we initialize superpixels and expand pixel-superpixel map multiple times. By dividing an expanding step into a horizontal and a vertical interpolation, we enforce spatial connectivity explicitly. We utilize multi-layer outputs of a fully convolutional network to predict association scores for interpolations. To speed up training process, association scores are used to compute loss instead of pixel-superpixel maps. Owing to our interpolation constrains the number of neighbors of inserted elements, SIN has the fastest speed compared to existing deep learning-based methods. The high speed of our method ensures it can be integrated into downstream tasks requiring real-time speed. Our model performs favorably against several existing state-of-the-art superpixel algorithms. SIN can generate more compact superpixels thanks to the design of interpolation, which is important to downstream tasks. What’s more, visual results illustrate that our method outperforms when handling fuzzy boundaries. Furthermore, we apply our superpixels in downstream tasks and make progress. We will integrate SIN into downstream tasks in an end-to-end way in the future and we hope SIN can benefit superpixel-based computer vision tasks.

Acknowledgements. This work is supported by the Hubei Provincinal Science and Technology Major Project of China under Grant No. 2020AEA011, the Key Research & Developement Plan of Hubei Province of China under Grant No. 2020BAB100, the project of Science,Technology and Innovation Commission of Shenzhen Municipality of China under Grant No. JCYJ20210324120002006 and the Fundamental Research Funds for the Central Universities, HUST: 2020JYCXJJ067.

References

[1] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: Slic superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence 34(11), 2274–2282 (2012)
[2] Achanta, R., Susstrunk, S.: Superpixels and polygons using simple non-iterative clustering. In: CVPR. pp. 4651–4660 (July 2017)
[3] Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33(5), 898–916 (2010)
[4] Van den Bergh, M., Boix, X., Roig, G., de Capitani, B., Van Gool, L.: Seeds: Superpixels extracted via energy-driven sampling. In: ECCV. pp. 13–26. Springer (2012)
[5] Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1), 98–136 (2015)
[6] Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph-based image segmentation. International journal of computer vision 59(2), 167–181 (2004)
[7] Gadde, R., Jampani, V., Kiefel, M., Kappler, D., Gehler, P.V.: Superpixel convolutional networks using bilateral inceptions. In: ECCV. pp. 597–613. Springer (2016)
[8] Gadde, R., Jampani, V., Kiefel, M., Kappler, D., Gehler, P.V.: Superpixel convolutional networks using bilateral inceptions. In: ECCV. pp. 597–613. Springer (2016)
[9] Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. International journal of computer vision 80(3), 300–316 (2008)
[10] He, S., Lau, R.W., Liu, W., Huang, Z., Yang, Q.: Supercnn: A superpixelwise convolutional neural network for salient object detection. International journal of computer vision 115(3), 330–344 (2015)
[11] Jampani, V., Sun, D., Liu, M.Y., Yang, M.H., Kautz, J.: Superpixel sampling networks. In: ECCV. pp. 352–368 (September 2018)
[12] Li, Z., Chen, J.: Superpixel segmentation using linear spectral clustering. In: CVPR. pp. 1356–1363 (2015)
[13] Liang-Chieh, C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Semantic image segmentation with deep convolutional nets and fully connected crfs. In: ICLR (2015)
[14] Liu, M.Y., Tuzel, O., Ramalingam, S., Chellappa, R.: Entropy rate superpixel segmentation. In: CVPR. pp. 2097–2104. IEEE (2011)
[15] Liu, Y.J., Yu, C.C., Yu, M.J., He, Y.: Manifold slic: A fast method to compute content-sensitive superpixels. In: CVPR. pp. 651–659 (2016)
[16] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. pp. 3431–3440 (2015)
[17] Ren, Malik: Learning a classification model for segmentation. In: ICCV. pp. 10–17 vol.1 (2003). https://doi.org/10.1109/ICCV.2003.1238308
[18] Ren, C.Y., Reid, I.: gslic: a real-time implementation of slic superpixel segmentation. University of Oxford, Department of Engineering, Technical Report pp. 1–6 (2011)
[19] Sharma, A., Tuzel, O., Liu, M.Y.: Recursive context propagation network for semantic scene labeling. In: NeurIPS. pp. 2447–2455 (2014)
[20] Shi, J., Yan, Q., Xu, L., Jia, J.: Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence 38(4), 717–729 (2015)
[21] Shu, G., Dehghan, A., Shah, M.: Improving an object detector and extracting regions using superpixels. In: CVPR. pp. 3721–3727 (2013)
[22] Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from rgbd images. In: ECCV. pp. 746–760. Springer (2012)
[23] Stutz, D., Hermans, A., Leibe, B.: Superpixels: An evaluation of the state-of-the-art. Computer Vision and Image Understanding 166, 1–27 (2018)
[24] Tu, W.C., Liu, M.Y., Jampani, V., Sun, D., Chien, S.Y., Yang, M.H., Kautz, J.: Learning superpixels with segmentation-aware affinity loss. In: CVPR. pp. 568–576 (2018)
[25] Wang, S., Lu, H., Yang, F., Yang, M.H.: Superpixel tracking. In: ICCV. pp. 1323–1330. IEEE (2011)
[26] Yan, J., Yu, Y., Zhu, X., Lei, Z., Li, S.Z.: Object detection by labeling superpixels. In: CVPR. pp. 5107–5116 (2015)
[27] Yang, C., Zhang, L., Lu, H., Ruan, X., Yang, M.H.: Saliency detection via graph-based manifold ranking. In: CVPR. pp. 3166–3173 (2013)
[28] Yang, F., Lu, H., Yang, M.H.: Robust superpixel tracking. IEEE Transactions on Image Processing 23(4), 1639–1651 (2014)
[29] Yang, F., Sun, Q., Jin, H., Zhou, Z.: Superpixel segmentation with fully convolutional networks. In: CVPR. pp. 13964–13973 (2020)
[30] Zhu, W., Liang, S., Wei, Y., Sun, J.: Saliency optimization from robust background detection. In: CVPR. pp. 2814–2821 (2014)