Deep Level Set for Box-supervised Instance Segmentation in Aerial Images

Wentong Li¹, Yijie Chen¹, Wenyu Liu¹, Jianke Zhu^1,2
¹Zhejiang University
²Alibaba-Zhejiang University Joint Research Institute of Frontier Technologies
{liwentong, chen_yj, liuwenyu.lwy, jkzhu}@zju.edu.cn Corresponding author is Jianke Zhu.

Abstract

Box-supervised instance segmentation has recently attracted lots of research efforts while little attention is received in aerial image domain. In contrast to the general object collections, aerial objects have large intra-class variances and inter-class similarity with complex background. Moreover, there are many tiny objects in the high-resolution satellite images. This makes the recent pairwise affinity modeling method inevitably to involve the noisy supervision with the inferior results. To tackle these problems, we propose a novel aerial instance segmentation approach, which drives the network to learn a series of level set functions for the aerial objects with only box annotations in an end-to-end fashion. Instead of learning the pairwise affinity, the level set method with the carefully designed energy functions treats the object segmentation as curve evolution, which is able to accurately recover the object’s boundaries and prevent the interference from the indistinguishable background and similar objects. The experimental results demonstrate that the proposed approach outperforms the state-of-the-art box-supervised instance segmentation methods. The source code is available at https://github.com/LiWentomng/boxlevelset.

1 Introduction

An insightful understanding of the objects (e.g. buildings, vehicles, planes, ships etc) from the remote sensing images with high spatial resolution (HSR) has great value in many real-world applications inluding environmental monitoring and urban management [40].

Aerial instance segmentation is an important task having the pixel-level object localization rather than the bounding box, which can provide the detailed shape and pose cues for the interest objects [47]. Differently from general object collections [25, 8], aerial images have their own difficulties including the larger intra-class variance and inter-class similarity with complex background [23], many tiny objects in the high resolution satellite images [47]. Figure 1 shows two visual examples of aerial instance segmentation results.

Refer to caption — Figure 1: Illustration of intra-class variance, inter-class similarity and many tiny objects with complex background in aerial images.

Although having achieved the promising results, most of previous instance segmentation methods heavily depend on the pixel-wise mask annotations that is notoriously time-consuming comparing to box-level annotations [2, 19]. Instead of relying on the pixel-level labels, box-supervised instance segmentation has recently attracted lots of research efforts [14, 36], which are designed with the pairwise affinity modeling, e.g. neighboring pixel pairs [14], color pairs [36] etc. However, the pairwise affinity-based methods are defined on the set containing all or partial neighboring pixel pairs, which oversimplifies the assumption of spatially pixel or color pairs being encouraged to share the same label. This inevitably introduces the heavy noisy supervision in aerial instance segmentation. BoxInst [36] models the pairwise color affinity on an undirected graph built from the input image. BBTP [14] makes use of the pairwise pixel and formulates the box-supervised instance segmentation into a multiple instance learning problem. The above pairwise affinity learning may inevitably absorb the noisy context from the nearby background or similar objects, which leads to the inferior instance segmentation performance in aerial images.

In this paper, we propose a novel approach to aerial instance segmentation using box annotations, which is based on level set [29, 4]. Unlike the simple pairwise affinity modeling, our method implicitly evolves a closed curve to directly recover the object boundaries, which is able to effectively reduce the noisy interference. In contrast to the previous methods managing to evolve the level set to the ground truth boundary in a fully-supervised manner [15, 39, 13], our proposed approach presents an end-to-end network weakly supervised by bounding box annotations originally for detection. It consists of detection branch and segmentation branch, which adopt a unified way to select the potential positive samples, i.e. box samples for detection branch and mask map samples for segmentation branch. We train classification, box regression and mask-level prediction jointly. In addition, the whole evolution process of level-set function is differentiable. Within the enlarged ground-truth box region, the level sets are iteratively updated to optimize the our presented energy function, which forces the boundary to evolve so as to accurately segment object from the background during the training. To enable the efficient convergence of evolution process, we introduce an effective box and background constraint.

In summary, the main contributions of this work are: 1) a novel deep level set-based approach to aerial instance segmentation. To the best of our knowledge, this is the first deep level set-based method that tackles the problem of box-supervised instance segmentation; 2) a framework to learn the potential positive samples, i.e. boxes and mask map samples, in a unified manner, which can be jointly trained to achieve classification, box regression and mask-level prediction; 3) the promising aerial instance segmentation results on the benchmark using box annotations, which is comparable with the fully mask-supervised methods in some categories.

2 Related Work

2.1 Level Set-based Segmentation

The level set method [1, 29] is widely used in image segmentation due to its ability of automatically finding the object’s boundaries. The key idea is to define an energy function in a higher dimension to represent the implicit curve, which is iteratively evolved by the descent of gradients. Later, Chan et al. [4] introduce a region-based level set method having the energy function with the curve inside and outside, which is more robust to the complex background or initial contour.

Deep network is capable of effectively encoding the high-level features. Therefore, some works [13, 17, 15, 39, 18] embed the level set into deep network, which achieve the promising segmentation results. For the semantic segmentation task, Le et al. [20] propose the Contextual Recurrent Level Set (CRLS) method, which is presented in a time series for the curve evolution and is reformulated as Recurrent Neural Network. Kim et al. [18] convert level set functions into the maps of class probability and calculate the energy for each class, which can obtain the multi-class segmentation in an image. For instance annotation, Wang et al. [39] propose a method that predicts the evolution parameters and evolves the predicted initial contour by incorporating the user clicks on the extreme boundary points. Levelset R-CNN [13] embeds the Chan-Vese levelset segmentation optimization on top of Mask R-CNN [11] to facilitate the accurate instance segmentation. In spite of the promising performance, these methods adopt the level set to help deep network accurately evolve to the ground-truth boundary in a fully supervised manner. This paper exploit the deep level set on the task of box-supervised instance segmentation. [17] is the most related work to ours, which focuses on the global pixel similarity between the Mumford–Shah function [28] with multi-level sets and N-class softmax characteristic to achieve image segmentation in a semi-supervised or unsupervised manner. In this work, our method makes use of the curve evolution within its local box region for each object. The energy function is different and the levelset function in our work evolves the foreground object specially on single potential mask map during the optimization.

2.2 Box-Supervised Instance Segmentation

Modern instance segmentation methods [11, 45, 3, 35, 38] with fully mask-level supervision are able to segment objects with the accurate boundaries, while these methods rely on the pixel-wise mask annotations. This may limit their deployments in the real-world applications. Box-supervision instance segmentation methods recently attract more attention in the general scene. SDI [16] is the first method to predict the mask with the box annotations under the deep learning framework. It is heavily dependent on the region proposals generated by the unsupervised segmentation methods like GrabCut [34] and MCG [32]. Most recently, BBTP [14] models the neighboring pixel-pairwise affinity and defines the box-supervised instance segmentation as the multiple instance learning (MIL) based on the Mask R-CNN [11]. BoxInst [36] uses the color-pairwise affinity modeling with box projection, which is built on an efficient RoI-free CondInst framework [35]. Although having achieved the promising performance in the general scene, their pairwise affinity modeling is defined on either the set containing partial or all neighboring pixel pairs with the oversimplified assumption of spatially pixel or color pairs being encouraged to share the same label. This inevitably introduces the heavily noisy supervision in the aerial image scene. Besides, the recently advanced methods BBAM [21] and BoxCaseg [37] introduce multi training and inference stages or additional supervised information such as the mask guideline from the salient images to achieve promising segmentation performance. Differently from the above methods, our proposed level set-based method evolves implicitly in an end-to-end manner, which can directly learn to align with the instance’s boundaries and prevent the noisy interference effectively within its box region.

2.3 Instance Segmentation in Aerial Image

In contrast to the general scene, high-resolution remote sensing image has its unique characteristics, such as intra-class variance, large inter-class similarity with complex background [23], many tiny objects in high resolution images [47], and large object-scale variations [44, 22]. Therefore, it is difficult for the conventional instance segmentation methods to directly deal with the remote sensing imagery. Mou et al. [27] propose a unified multi-task learning network that can simultaneously segment the vehicles and detect their boundaries in a fully supervision manner. Pan et al. [30] propose to infer the mask map on the oriented bounding box (OBB) to obtain the more accurate mask predictions, especially for the densely distributed objects. To the best of our knowledge, there is still few previous work on aerial instance segmentation with box annotations.

3 Proposed Method

3.1 Overview

To facilitate the deep level set-based instance segmentation, we suggest a unified framework to learn the potential positive samples, i.e. boxes and mask map samples, which can be jointly trained to achieve classification, box regression and mask-level prediction. Figure 2 gives the overview of the whole pipeline. Based on the backbone and feature pyramid network (FPN) [24], our framework encodes the deep features from the input image. We employ two head branches, including the detection branch and segmentation branch. Note that the accurate detection and instance segmentation are obtained through only the bounding box annotations.

Detection Branch. In order to better fit the spatial of aerial objects with the arbitrary orientation, the detection branch adopts the oriented bounding boxes (OBB) representation. OBB can provide the orientation information of objects for detecting the objects with arbitrary orientation and dense distribution in aerial images [42, 10, 43]. As the previous oriented detection methods [7], the RPN regresses the rotated RoIs with five parameters $(x,y,w,h,\theta)$ . Inspired by DeepSnake [31], we regard the rotated RoIs as the initial contour explicitly and $K$ vertices ${\{({x_{k}},{y_{k}})\}_{k=1,\ldots K}}$ are uniformly sampled along each rotated RoI. The separate $snake$ modules [31] are used to process the $K$ vertices to iteratively refine the contour vertices towards the ground-truth rotated bounding box. The iterative regression of contour vertices can be formulated as follows:

\{(\Delta{x_{k}},\Delta{y_{k}})\}_{k=1}^{K}=snake(\{F({x_{k}},{y_{k}})\}_{k=1}^{K})

(1)

where $F({x_{k}},{y_{k}})$ is the concatenation of all contour vertices’ features that are obtained by the bilinear interpolation. $(\Delta{x_{k}},\Delta{y_{k}})$ are the regressed relative coordinates. With the learned offsets, we can obtain the refined vertices locations as the boundary of predicted rotated bounding box. The loss function $\mathcal{L}_{reg}$ for regression is made of Chamfer distance [9] between the regressed contour vertices and oriented ground-truth box. The classification loss function $\mathcal{L}_{cls}$ is the standard cross entropy loss based on the contour vertices’ features.

Segmentation Branch. The segmentation branch takes the highest-resolution feature map (i.e. $P_{2}$ ) from the feature pyramid as input and predicts the mask’s probability. In order to integrate the different-resolution features, we perform the upsampling operation with the high-level semantic features (i.e. $P3$ , $P4$ , $P5$ ) to keep the same size with $P2$ , and then sum them in a pixel-wise manner. Moreover, we adopt the encoder of DeepLabV3+ [6] to enhance the semantic features. The output feature map possesses both rich semantics and multi-spatial information for aerial images. Based on the output feature map, $N\times 1\times\frac{H}{s}\times\frac{W}{s}$ instance feature maps $M$ are generated. $N$ is the total number of potential instances, which is same as the number of the RoIs from the RPN in the detection branch. $H\times W$ is the input size, and $s$ denotes the feature map output stride.

Unlike Mask R-CNN using only the masks within RoIs, our proposed instance segmentation branch can generate the full-image instance mask maps. Upon each mask map, we only regard an object as the foreground. Otherwise, it is set to background. To select the potential mask map $P_{K\times 1\times\frac{H}{s}\times\frac{W}{s}}$ and assign the corresponding ground-truth bounding box for instance segmentation with different location, we perform the positive target assignment following the detection branch. It assigns the high-quality bounding box with ground-truth box by calculating the overlaps. This unifies the positive sample assignment measure, which is able to encode the high-quality instance-level information (e.g. the coarse shape and pose of objects) with different locations. For segmentation branch, the loss $\mathcal{L}_{seg}$ consists of two aspects, i.e., the value of level set function $\mathcal{L}_{levelset}$ with the boundary evolution and the constraint vaules $\mathcal{L}_{cons.}$ for the efficient convergence. Therefore, the instance segmentation loss can be derived as below

\mathcal{L}_{seg}=\mathcal{L}_{levelset}+\mathcal{L}_{cons.}

(2)

3.2 Deep Level Set for Instance Segmentation

Level Set Formulation. The level set method [29, 4] implicitly represents a parametric curve $C$ , which is defined as the boundary of an open region $z$ in $\Omega$ space. The parametric curve can use the zero crossing of a levelset function $\phi(x,y):\Omega\to\mathbb{R}$ as

C=\{(x,y)\in\Omega|\phi(x,y)=0\}

(3)

$inside(C)=\{(x,y)\in\Omega|\phi(x,y)>0\}$ and $outside(C)=\{(x,y)\in\Omega|{\phi(x,y)<0}\}$ represent the region $z$ and the region outside $z$ , respectively. The fitting energy terms ${E_{1}}(C)$ and ${E_{2}}(C)$ for the above two regions can be defined as follow,

\begin{array}[]{l}{E_{1}}(C)={\int_{\phi>0}{\left|{{u_{0}}(x,y)-{a_{1}}}\right|}^{2}}dxdy\vspace{1ex}\\ {E_{2}}(C)={\int_{\phi<0}{\left|{{u_{0}}(x,y)-{a_{2}}}\right|}^{2}}dxdy\end{array}

(4)

where $u_{0}$ is a given image, and $a_{1}$ and $a_{2}$ are the averages of $u_{0}$ inside $C$ and respectively outside $C$ , given by

\left\{\begin{array}[]{l}{a_{1}}(\phi)=average({u_{0}})\ \ \ in\ \ \{\phi\geq 0\}\vspace{1ex}\\ {a_{2}}(\phi)=average({u_{0}})\ \ \ in\ \ \{\phi<0\}\end{array}\right.

(5)

The minimization of the energy terms can be viewed as an curve evolution along the descent of energy function. The boundary $C_{0}$ of object is the minimization of the fitting term

\mathop{\inf}\limits_{C}\{{E_{1}}(C)+{E_{2}}(C)\}\approx 0\approx{E_{1}}({C_{0}})+{E_{2}}({C_{0}})

(6)

The length and area of $C$ with the curve evolution are represented as,

\begin{array}[]{l}\begin{aligned} Length(\phi=0)&=\int_{\Omega}{\left|{\nabla H(\phi(x,y))}\right|dxdy}\\ &=\int_{\Omega}{\delta(\phi(x,y))\left|{\nabla\phi(x,y)}\right|dxdy}\end{aligned}\vspace{1ex}\\ Area(\phi\geq 0)=\int_{\Omega}{H(\phi(x,y))dxdy}\end{array}

(7)

where $H$ is the Heaviside Function, $\delta$ is 1-D Dirac measure function.

Enlarged Box Region for Deep Level Set Evolution. The classical level set methods are defined and optimized only based on the low-level features, such as shape, color and texture between the object and background, which limits their performance on level set segmentation. Deep network possesses the ascendant ability to encode the high-level features. To this end, our proposed method exploits the curve evolution with level set to find the similarity between the deep features and the object characteristics in aerial scene. The level set method with well-designed energy function makes it possible to achieve high-quality instance segmentation using the box annotations only.

Given an input image $u_{0}(x,y)$ and horizontal bounding box $\mathcal{B}$ , our goal is to employ a CNN branch to predict the object’s boundary curve used in level set evolution within the enlarged box region $\mathcal{B^{*}}$ . $\mathcal{B^{*}}$ is the $n$ times of the bounding box $\mathcal{B}$ region, which is the relaxation of $\mathcal{B}$ and introduces more context for levelset evolution on each object. As shown in Figure 2, we build a segmentation branch based on the CNN feature encoder. The last output of this branch is the potential mask maps with the shape of $N\times 1\times\frac{H}{s}\times\frac{W}{s}$ , where $N$ is the number of predicted instances. For each instance, we only focus on the assigned enlarged bounding box region $\mathcal{B^{*}}$ . So, we regard its outside as the background. To better incorporate the CNN with the level-set method, we treat the values of potential mask map within $\mathcal{B^{*}}$ box region as the level set $\phi$ , and the pixel space of input image $u_{0}(x,y)$ is referred as $\Omega$ . $C$ is the segmentation boundary with $\phi=0$ . To generate the precise object boundary with level-set method, we design an energy function. With the minimization of this energy function, the network learns a level set $\phi$ to achieve the boundary evolution during the training phase. The energy function is defined as follows:

$\displaystyle L($	$\displaystyle{C_{1}},{C_{2}},\phi,{\rho_{cls}},\mathcal{B^{*}})$	(8)
	$\displaystyle={\alpha_{1}}\sum\limits_{n}{{\rho_{cls}}}{\int_{\Omega\in\mathcal{B^{}}}{\left\|{u_{0}^{}(x,y)-a_{1}^{*}}\right\|}^{2}}\sigma(\phi{(x,y)})dxdy$
	$\displaystyle+{\alpha_{2}}\sum\limits_{n}{{\rho_{cls}}}{\int_{\Omega\in\mathcal{B^{}}}{\left\|{u_{0}^{}(x,y)-a_{2}^{*}}\right\|}^{2}}(1-\sigma{(\phi(x,y))})dxdy$
	$\displaystyle+\lambda*Length(\phi)+\mu Aera(\phi)$

where $u_{0}^{*}(x,y)$ denotes the input image after the normalization. $\sigma$ denotes $sigmoid$ function that is treated as the characteristic function for level set $\phi$ . Differently from the traditional Heaviside function $H$ , $sigmoid$ function is much smoother, which is able to better express characteristics of the predicted instance. This further benefits the convergence of level set during the training process. The first two items in Eq. (8) force the predicted mask map to be uniform both inside and outside object regions. $a_{1}^{*}$ and $a_{2}^{*}$ play as the average values for $inside(C)$ and $outside(C)$ , respectively. Regarding to $a_{1}^{*}$ and $a_{2}^{*}$ , these two terms can be expressed as below

		$\displaystyle{a_{1}^{}}(\phi)=\frac{{\int_{\Omega\in\mathcal{B^{}}}{{u_{0}^{}}(x,y)\sigma(\phi(x,y))dxdy}}}{{\int_{\Omega\in\mathcal{B^{}}}{\sigma(\phi(x,y))dxdy}}}$		(9)
		$\displaystyle{a_{2}^{}}(\phi)=\frac{{\int_{\Omega\in\mathcal{B^{}}}{{u_{0}^{}}(x,y)(1-\sigma(\phi(x,y)))dxdy}}}{{\int_{\Omega\in\mathcal{B^{}}}{(1-\sigma(\phi(x,y)))dxdy}}}$		(9)

The last two terms of Eq. (8) denotes the the length of the evolution boundary and inside boundary region area, respectively. They can be viewed as the regularization term to keep the boundary of segmentation map smoothing, which are denoted as follows

\begin{array}[]{l}Length(\phi)=\int_{\Omega\in\mathcal{B^{*}}}{\left|{\nabla\sigma(\phi(x,y))}\right|dxdy\vspace{1ex}}\\ Area(\phi)=\int_{\Omega\in\mathcal{B^{*}}}{\sigma(\phi(x,y))dxdy}\end{array}

(10)

It is important to note that $\rho_{cls}$ is the class-wise weighted term for the different categories, which needs various degrees of level set evolution for the different object characteristics. This parameter can be adaptive tuned online or predefined. The energy function $L$ can be easily optimized with the gradient back-propagation during training. The derivative of energy function $L$ upon $\phi$ can be written as below

	$\displaystyle\frac{{\partial L}}{{\partial\phi}}$	$\displaystyle{\rm{=}}\nabla\sigma(\phi)[{\alpha_{1}}{\rho_{cls}}{({u_{0}^{}-a_{1}^{}})^{2}}-{\alpha_{2}}{\rho_{cls}}{({u_{0}^{}-a_{2}^{}})^{2}}$		(11)
		$\displaystyle+\lambda div(\frac{{\nabla\phi}}{{\left\|{\nabla\phi}\right\|}})+\mu]$		(11)

The whole evolution process is differentiable, which can be trained in an end-to-end manner. During the training process, we set the energy function $L$ as the loss function $\mathcal{L}_{levelset}$ for level set evolution. Practically, we set ${\alpha_{1}}={\alpha_{2}}=0.001$ , $\lambda=0.00001$ and $\mu=0.000001$ to maintain the loss at a reasonable value while keeping the effective gradient propagation and balanced converge stably for each loss term. The class-wise parameter $\rho_{cls}$ is discussed in the experiment section.

3.3 Box and Background Constraints

With the box annotations, we can obtain the two key rules. The first one is that the predicted mask map should be limited within its ground-truth box region. The second rule is that all region outside ground truth box should be the background for the current instance. To this end, we introduce the constraints to effectively improve the convergence and prevent the interference of noise information during the training phase.

As in [36], we firstly employ the coordinate projection of ground truth box to $x$ -axis and $y$ -axis and calculate the difference between predicted mask map and ground-truth box. Let $b^{f}\in{\{0,1\}^{H\times W}}{\rm{}}$ denote the binary region generated by assigning one to the locations in the ground-truth box, and zero otherwise. The mask score predictions ${m^{p}}\in{(0,1)^{H\times W}}$ for each instance can be regarded as the foreground probabilities. We define the box constraint term $\mathcal{L}_{cons.}^{box}$ as follows:

\displaystyle\mathcal{L}_{cons.}^{box}

\displaystyle=\mathcal{L}_{proj}({m_{x}^{p}},b_{x}^{f})+\mathcal{L}_{proj}({m_{y}^{p}},b_{y}^{f})

(12)

where ${m_{x}^{p}}$ , $b_{x}^{f}$ and ${m_{y}^{p}}$ , $b_{y}^{f}$ denote the $x$ -axis projection and $y$ -axis projection for mask prediction $m^{p}$ and binary ground-truth region $b^{f}$ , respectively. The projection can be implemented by a $\max$ operation for each axis.

The region outside the box is regarded as the background. Let $b^{b}\in{\{0,1\}^{H\times W}}$ denote the binary background region. We assign one to all locations outside the box region for background scores prediction. Otherwise, zero is set to the location inside the box region. The background prediction score is $m^{b}=\mathbbm{1}-m^{p}$ . Therefore, we define the background constraint term $\mathcal{L}_{cons.}^{back.}$ as follows:

\mathcal{L}_{cons.}^{back.}=\mathcal{L}(m^{b},b^{b})

(13)

where $\mathcal{L}(\cdot,\cdot)$ in the Eq. (12) and Eq. (13) denotes the pixel-wise dice loss [26]. Therefore, the constraint loss $\mathcal{L}_{cons.}$ is defined as below:

{\mathcal{L}_{cons.}}=\mathcal{L}_{cons.}^{box}+\mathcal{L}_{cons.}^{back.}

(14)

4 Experiments

4.1 Datasets

iSAID. iSAID dataset [44] consists of 2,806 HSR remote sensing images, which are collected from multiple aerial platforms. The original image sizes range from 800 $\times$ 800 pixels to 13000 $\times$ 4000 pixels. The iSAID dataset provides 655,451 instances annotations, which is the largest dataset for instance segmentation in the HSR remote sensing imagery. It consists of common 15 categories: Ship, Storage Tank (ST), Baseball Diamond (BD), Tennis Court (TC), Basketball Court (BC), Ground Track Field (GTF), Bridge, Large Vehicle (LV), Small Vehicle (SV), Helicopter (HC), Swimming Pool (SP), Roundabout (RA), Soccer Ball Field (SBF), Plane, and Harbor. The dataset are made of 1,411 training images, 458 validation images, and 937 testing images. We make use of the predefined training set to train our models and evaluate them on the validation set. This is because the labels of testing set are unavailable.

Potsdam. Potsdam dataset¹¹1https://www2.isprs.org/commissions/comm2/wg4/benchmark/2d-semlabel-potsdam/ is the high-resolution remote sensing dataset with the semantic labels, which consists of 38 high-resolution aerial images of 5cm GSD with 6000 $\times$ 6000 pixels. We randomly partition all the images into 28 images as the training set and 10 images as the testing set with the approximate ratio of 3:1. This dataset consists of six semantic categories, we only use the category of Car to evaluate the performance of instance segmentation, instead of using other stuffs related to scene understanding.

4.2 Implementation Details

We adopt ResNet-50 [12] by default. For data preprocessing, we crop the original images into $800\times 800$ patches using a sliding window striding of 600 pixels. For fair comparison, we employ the mask Average Precision (AP) [8] as the main metric for instance segmentation to evaluate the proposed method if unspecified. For the setting of RPN, we follow RoI-Transformer [7]. For the training, we adopt the same training schedules as $mmdetection$ [5]. The optimizer used for training is SGD. The initial learning rate is set to 0.005, which is divided by 10 at each decay step. The momentum and weight decay are 0.9 and 0.0001, respectively. We train the models in 1 $\times$ with 12 epochs or 2 $\times$ with 24 epochs. In terms of the data augmentation, random flip and rotation are adopted during training. We conducted experiments on 4 RTX 2080Ti GPUs with a total batch size of 8 for training and used a single RTX 2080Ti GPU for inference.

For levelset evolution in the instance segmentation branch, we employ the horizontal bounding box (HBB) annotations. Meanwhile, we use the annotations of oriented bounding box (OBB) for the detection branch. More importantly, any mask-level annotations are not used to train our models.

4.3 Ablation Study

Firstly, we conduct a series of ablation experiments on Postdam dataset to evaluate the effectiveness of our proposed method. Note that we train all the ablation experiments with 1 $\times$ schedule, where R-50-FPN is the backbone.

Level set and constraint loss for instance segmentation. We evaluate the effectiveness of level set and constraint loss under different settings. Table 1 shows the experimental results. It achieves only 17.8% AP performance with the constraint loss alone, which indicates that the box and background constraint are effective. With the level set loss term, it gains the 25.3% performance with 7.5% improvement compared with the result using constraint loss, which shows that the proposed region-based level set method can be learned in weakly supervision using box annotations. Incorporating constraint loss with level set loss term achieves the best performance with 34.8% AP, 57.1% AP₅₀ and 40.9% AP₇₅. The experimental results demonstrate that the combination of these two loss terms are effective for the aerial instance segmentation in the box-supervision manner.

$\mathcal{L}_{const.}$	$\mathcal{L}_{levelset}$	AP	AP₅₀	AP₇₅
✓		17.8	51.8	5.9
	✓	25.3	56.0	21.3
✓	✓	34.8	57.1	40.9

Table 1: Performance analysis of level set and constraint loss on Potsdam dataset.

Enlarged box region $\mathcal{B^{*}}$	AP	AP₅₀	AP₇₅
1.0 $\times$	33.2	59.8	36.0
1.5 $\times$	34.4	56.7	40.3
2.0 $\times$	36.1	60.4	42.1
2.5 $\times$	34.8	57.1	40.9

Table 2: Comparisons on Potsdam dataset with the different enlarged regions.

Methods	backbone	sched.	Ship	ST	BD	TC	BC	GTF	Bridge	LV	SV	HC	SP	RA	SBF	Plane	Harbor	AP	AP₅₀	AP₇₅
fully supervised methods:
Mask R-CNN [11]	R-50-C4	$1\times$	32.5	30.0	45.3	73.2	30.0	23.1	14.8	30.2	10.2	5.4	25.2	26.6	37.4	30.3	17.0	28.8	51.8	27.7
PolarMask [41]	R-50-FPN	$1\times$	35.7	32.5	44.4	74.4	37.5	13.4	10.8	30.0	8.5	3.4	24.5	29.6	32.3	21.6	9.6	27.2	48.5	27.3
CondInst [35]	R-50-FPN	$1\times$	35.0	32.0	44.5	72.4	28.6	17.6	14.5	33.7	10.5	4.6	26.4	27.9	34.4	35.7	24.1	29.5	54.2	28.3
box-supervised methods:
BBTP [14]	R-50-FPN	$1\times$	17.6	33.9	42.5	49.6	26.5	18.3	6.0	8.7	11.0	1.3	19.1	29.8	28.7	0.8	2.3	19.7	40.8	17.5
BoxInst [36]	R-50-FPN	$1\times$	21.6	21.5	35.9	42.5	19.4	15.2	3.5	19.3	5.6	1.4	19.3	13.0	19.1	20.5	9.3	17.8	41.4	12.9
Ours	R-50-FPN	$1\times$	24.6	24.4	37.2	62.8	35.5	18.9	14.0	28.9	9.7	1.3	17.8	21.8	28.9	6.0	15.2	23.1	48.4	19.4

Table 3: Class-wise instance segmentation results on iSAID val set.

Enlarged box region for level set evolution. As shown in Table 2, we measure the performance of different enlarged regions for level set evolution. The model can achieve 33.2% AP with the 1.0 $\times$ region, which is original box area. The best performance with 36.1% AP, 60.4% AP₅₀ and 42.1% AP₇₅ is achieved with 2.0 $\times$ region. It indicates that the larger enlarged region is beneficial for the level set evolution, which introduces more characteristic information to be learned. However, the performance begin to slightly drop when the enlarged region is set to 2.5 $\times$ , which may induce the noisy information with a large area.

Effectiveness of class-wise parameters for level set evolution. We further study the effects of class-wise $\rho_{cls}$ in Eq. (8) with different settings. In Table 4, we set different $\rho_{cls}$ for curve evolution on the Potsdam dataset, which indicates that this class-wise parameter can influence the performance of level set. When $\rho_{cls}=0.65$ , the mask AP has the better performance with 34.8% AP. We also set a single fully connection layer on the potential mask map to adaptively learn $\rho_{cls}$ during the training phase in order to avoid the manually adjusted parameters. The results show that the self-adaptive $\rho_{cls}$ scheme can achieve the best performance with 36.5% AP.

$\rho_{cls}$	1.50	1.0	0.65	0.35	self-adaptive
AP	30.8	31.1	34.8	33.2	36.5
AP₅₀	57.4	56.2	57.1	59.8	63.2
AP₇₅	31.7	32.5	40.9	36.0	40.6

Table 4: Analysis of the class-wise parameters for level set evolution on Potsdam dataset.

Samples assignment methods	AP	AP₅₀	AP₇₅	AP_box
Max-IoU [33]	34.8	57.1	40.9	35.4
ATSS [46]	36.1	60.4	42.1	37.3

Table 5: Effects of different samples assignment methods.

Comparisons with different samples assignment methods. In our framework, the segmentation module follows the detection branch in the positive samples assignment. Therefore, the samples assignment method will affect the performance. As shown in Table 5, we adopt Max-IoU [33] and ATSS [46]. A better performance with 36.1% mask AP and 37.3% box AP are achieved with ATSS. Besides Table 5, we employ the same Max-IoU method as other approaches without ATSS to facilitate the fair comparisons.

4.4 Comparisons with the State-of-the-art Methods

Results on iSAID. iSAID is the de-facto benchmark dataset for aerial instance segmentation. We compare our proposed approach against the state-of-the-art methods with the different forms of supervision. Table 3 lists the experimental results. It can be clearly observed that our approach outperforms the state-of-the-art box-supervision method BoxInst [36] for general scene by 5.3% AP, 7.0% AP₅₀ and 6.5% AP₇₅, respectively. In addition, our method performs better than BBTP [14] over 3.4% AP, 7.6% AP₅₀ and 1.9% AP₇₅. Furthermore, our method demonstrates the promising performance comparing to the top-performing fully-supervised instance segmentation methods using mask-level annotations. For the categories of Basketball Court (BC), Ground Track Field (GTF), Bridge and Large Vehicle (LV), the proposed approach achieves the competitive results. These encouraging results show that the proposed deep level set-based method is effective for the task of aerial instance segmentation by curve evolution, which greatly narrows down the performance gap between the fully mask-supervised and weakly box-supervised instance segmentation. Qualitative results on iSAID are shown in Figure 4.

Results on Potsdam. As shown in Table 6, we evaluate the recent segmentation methods with 2 $\times$ training schedules on the Potsdam dataset, which contains the urban scene with the challenging car segmentation. It can be clearly seen that our presented method achieves the best performance of 47.3% AP, which outperforms the box-supervised methods like BBTP [14] and BoxInst [36] over 17.6% AP and 8.9% AP, respectively. Comparing to the mask-supervised methods, the proposed method achieves the comparable results with narrow gap. Figure 5 shows some visualization results on Potsdam dataset.

Methods	backbone	sched.	AP	AP₅₀	AP₇₅
fully supervised methods:
Mask R-CNN [11]	R-50-C4	$2\times$	51.7	72.6	62.1
PolarMask [41]	R-50-FPN	$2\times$	50.0	71.0	59.9
CondInst [35]	R-50-FPN	$2\times$	53.6	74.2	64.1
box-supervised methods:
BBTP [14]	R-50-FPN	$2\times$	29.7	69.2	19.5
BoxInst [36]	R-50-FPN	$2\times$	38.4	71.4	39.2
Ours	R-50-FPN	$2\times$	47.3	71.9	44.7

Table 6: Comparisons with state-of-the-art methods on Potsdam dataset.

5 Conclusion

This paper has proposed an effective deep level set-based network with box supervision for aerial instance segmentation. The detection branch and segmentation branch are introduced, which are jointly training with a unified potential samples assignment manner. The well-designed energy function can evolve object accurate boundaries with complex background and prevent inter- and intra-class interference efficiently in high-resolution aerial images. Experiments results on two high-resolution aerial dataset have demonstrated that our proposed approach achieves the promising results with box supervision, and gain the comparable performance with mask-supervised methods on some categories with the large-scale aerial benchmarks. This work can promisingly narrows the performance gap between the fully supervised and box-supervised instance segmentation in aerial scene.

For the future work, we will further extend this method to the natural scene like COCO dataset to verify its generalization and provide the analysis of its advantages and limitations.

References

[1] David Adalsteinsson and James A. Sethian. A fast level set method for propagating interfaces. Journal of Computational Physics, 118(2):269–277, 1995.
[2] Amy Bearman, Olga Russakovsky, Vittorio Ferrari, and Li Fei-Fei. What’s the point: Semantic segmentation with point supervision. In Proceedings of European Conference on Computer Vision (ECCV), pages 549–565, 2016.
[3] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. In Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), pages 9157–9166, 2019.
[4] T.F. Chan and L.A. Vese. Active contours without edges. IEEE Transactions on Image Processing, 10(2):266–277, 2001.
[5] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. Mmdetection: Open mmlab detection toolbox and benchmark., 2019.
[6] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of European Conference on Computer Vision (ECCV), pages 833–851, 2018.
[7] Jian Ding, Nan Xue, Yang Long, Gui-Song Xia, and Qikai Lu. Learning roi transformer for oriented object detection in aerial images. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 2849–2858, 2019.
[8] Mark Everingham, Luc Gool, Christopher K. Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
[9] Haoqiang Fan, Hao Su, and Leonidas Guibas. A point set generation network for 3d object reconstruction from a single image. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2463–2471, Jul. 2017.
[10] Jiaming Han, Jian Ding, Nan Xue, and Gui-Song Xia. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2786–2795, 2021.
[11] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017.
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 770–778, Jun. 2016.
[13] Namdar Homayounfar, Yuwen Xiong, Justin Liang, Wei-Chiu Ma, and Raquel Urtasun. Levelset r-cnn: A deep variational method for instance segmentation. In European Conference on Computer Vision, pages 555–571. Springer, 2020.
[14] Cheng Chun Hsu, Kuang Jui Hsu, Chung Chi Tsai, Yen Yu Lin, and Yung Yu Chuang. Weakly supervised instance segmentation using the bounding box tightness prior. In Proceedings of Annual Conference on Neural Information Processing Systems (NeurIPS), volume 32, pages 6582–6593, 2019.
[15] Ping Hu, Bing Shuai, Jun Liu, and Gang Wang. Deep level sets for salient object detection. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 540–549, 2017.
[16] Anna Khoreva, Rodrigo Benenson, Jan Hosang, Matthias Hein, and Bernt Schiele. Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1665–1674, 2017.
[17] Boah Kim and Jong Chul Ye. Mumford–shah loss functional for image segmentation with deep learning. IEEE Transactions on Image Processing, 29:1856–1866, 2019.
[18] Youngeun Kim, Seunghyeon Kim, Taekyung Kim, and Changick Kim. Cnn-based semantic segmentation using level set loss. In Proceedings of IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1752–1760, 2019.
[19] Viveka Kulharia, Siddhartha Chandra, Amit Agrawal, Philip H. S. Torr, and Ambrish Tyagi. Box2seg: Attention weighted loss and discriminative feature learning for weakly supervised segmentation. In Proceedings of European Conference on Computer Vision (ECCV), pages 290–308, 2020.
[20] T. Hoang Ngan Le, Kha Gia Quach, Khoa Luu, Chi Nhan Duong, and Marios Savvides. Reformulating level sets as deep recurrent neural network approach to semantic segmentation. IEEE Transactions on Image Processing, 27(5):2393–2407, 2018.
[21] Jungbeom Lee, Jihun Yi, Chaehun Shin, and Sungroh Yoon. Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2643–2652, 2021.
[22] Wentong Li, Wanyi Li, Feng Yang, and Peng Wang. Multi-scale object detection in satellite imagery based on yolt. In Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 162–165, 2019.
[23] Xiangtai Li, Hao He, Xia Li, Duo Li, Guangliang Cheng, Jianping Shi, Lubin Weng, Yunhai Tong, and Zhouchen Lin. Pointflow: Flowing semantics through points for aerial image segmentation. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 4217–4226, 2021.
[24] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 936–944, 2017.
[25] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of European Conference on Computer Vision (ECCV), pages 740–755, 2014.
[26] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In Proceedings International Conference on 3D Vision (3DV), pages 565–571, 2016.
[27] Lichao Mou and Xiao Xiang Zhu. Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network. IEEE Transactions on Geoscience and Remote Sensing, 56(11):6699–6711, 2018.
[28] David Bryant Mumford and Jayant Shah. Optimal approximations by piecewise smooth functions and associated variational problems. Communications on pure and applied mathematics, 1989.
[29] Stanley Osher and James A. Sethian. Fronts propagating with curvature-dependent speed: algorithms based on hamilton-jacobi formulations. Journal of Computational Physics, 79(1):12–49, 1988.
[30] Ting Pan, Jian Ding, Jinwang Wang, Wen Yang, and Gui-Song Xia. Instance segmentation with oriented proposals for aerial images. In Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pages 988–991, 2020.
[31] Sida Peng, Wen Jiang, Huaijin Pi, Xiuli Li, Hujun Bao, and Xiaowei Zhou. Deep snake for real-time instance segmentation. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8533–8542, 2020.
[32] Jordi Pont-Tuset, Pablo Arbelaez, Jonathan T.Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):128–140, 2017.
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS), volume 28, pages 91–99, 2015.
[34] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309–314, 2004.
[35] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convolutions for instance segmentation. In Proceedings of European Conference on Computer Vision (ECCV), pages 282–298, 2020.
[36] Zhi Tian, Chunhua Shen, Xinlong Wang, and Hao Chen. Boxinst: High-performance instance segmentation with box annotations. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 5443–5452, 2021.
[37] Xinggang Wang, Jiapei Feng, Bin Hu, Qi Ding, Longjin Ran, Xiaoxin Chen, and Wenyu Liu. Weakly-supervised instance segmentation via class-agnostic learning with salient images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10225–10235, 2021.
[38] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chunhua Shen. Solov2: Dynamic and fast instance segmentation. arXiv preprint arXiv:2003.10152, 2020.
[39] Zian Wang, David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Object instance annotation with deep extreme level set evolution. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 7500–7508, 2019.
[40] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3974–3983, 2018.
[41] Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, Xuebo Liu, Ding Liang, Chunhua Shen, and Ping Luo. Polarmask: Single shot instance segmentation with polar representation. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 12193–12202, 2020.
[42] Xingxing Xie, Gong Cheng, Jiabao Wang, Xiwen Yao, and Junwei Han. Oriented r-cnn for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3520–3529, 2021.
[43] Xue Yang, Jirui Yang, Junchi Yan, Yue Zhang, Tengfei Zhang, Zhi Guo, Xian Sun, and Kun Fu. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8232–8241, 2019.
[44] Syed Waqas Zamir, Aditya Arora, Akshita Gupta, Salman H. Khan, Guolei Sun, Fahad Shahbaz Khan, Fan Zhu, Ling Shao, Gui-Song Xia, and Xiang Bai. isaid: A large-scale dataset for instance segmentation in aerial images. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition Workshops (CVPRW), pages 28–37, 2019.
[45] Gang Zhang, Xin Lu, Jingru Tan, Jianmin Li, Zhaoxiang Zhang, Quanquan Li, and Xiaolin Hu. Refinemask: Towards high-quality instance segmentation with fine-grained features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6861–6869, 2021.
[46] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9759–9768, 2020.
[47] Zhuo Zheng, Yanfei Zhong, Junjue Wang, and Ailong Ma. Foreground-aware relation network for geospatial object segmentation in high spatial resolution remote sensing imagery. In Proceedings of IEEE/CVF Conference Computer Vision Pattern Recognition (CVPR), pages 4096–4105, 2020.