This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Edge-aware Plug-and-play Scheme for Semantic Segmentation

Jianye Yi, Xiaopin Zhong11footnotemark: 1, Weixiang Liu🖂,
Wenxuan Zhu, Zongze Wu, Yuanlong Deng
Lab. of Machine Vision and Inspection, College of Mechatronics and Control Engineering,
Shenzhen University, #3688 Nanhai Ave, Shenzhen, PR China
[email protected], [email protected], 🖂[email protected],
[email protected],[email protected], [email protected]
These authors contributed to the work equally and should be regarded as co-first authors.
Abstract

Semantic segmentation is a classic and fundamental computer vision problem dedicated to assigning each pixel with its corresponding class. Some recent methods introduce edge-based information for improving the segmentation performance. However these methods are specific and limited to certain network architectures, and they can not be transferred to other models or tasks. Therefore, we propose an abstract and universal edge supervision method called Edge-aware Plug-and-play Scheme (EPS), which can be easily and quickly applied to any semantic segmentation models. The core is edge-width/thickness preserving guided for semantic segmentation. The EPS first extracts the Edge Ground Truth (Edge GT) with a predefined edge thickness from the training data; and then for any network architecture, it directly copies the decoder head for the auxiliary task with the Edge GT supervision. To ensure the edge thickness preserving consistantly, we design a new boundary-based loss, called Polar Hausdorff (PH) Loss, for the auxiliary supervision. We verify the effectiveness of our EPS on the Cityscapes dataset using 22 models. The experimental results indicate that the proposed method can be seamlessly integrated into any state-of-the-art (SOTA) models with zero modification, resulting in promising enhancement of the segmentation performance.

1 Introduction

Semantic segmentation aims to achieve pixel-level classification by providing dense predictions for each pixel. With the rapid development of convolutional neural networks and the application of Transformer[26] in the field of computer vision, a series of deep learning-based semantic segmentation models have emerged, such as CNN-based FCN[18], DeepLab[2], PSPNet[36], CGNet[29], and ViT-based Segmenter[24] and SegFormer[30], etc. Researchers are always striving to propose new network structures to improve the performance of semantic segmentation. They usually approach the problem from the perspective of model design, resulting in specific and unique network structures. However, this approach may lead to overfitting during training and may not be easily applicable to various applications. Therefore, we believe that it is more effective to design a general and abstract scheme that is applicable to any model, rather than solely relying on model design.

Currently, in supervised semantic segmentation tasks, most researchers directly use the original annotated data for supervision. However, a minority of researchers have explored other features of the original data for more effective supervision, such as adding edge supervision. Edge detection is a task of extracting image edges[21]. In recent years, related works[38, 34, 16, 6, 4, 1] have embedded the results of edge detection into semantic segmentation and confirmed that edge supervision can effectively improve the accuracy of segmentation models. Due to the locality of CNN’s inductive bias, it is necessary to perform pooling on the feature maps to increase the receptive field (RF), which leads to blurring and uncertainty of segmentation boundaries[39, 19], ultimately limiting segmentation accuracy. Although ViT-based segmentation models have global RF, there is currently no related work on adding edge supervision to ViT-based segmentation models. However, this does not mean that edge supervision is not important for ViT-based segmentation networks. Whether from a spatial-geometric perspective, dividing the objects of semantic segmentation into edges and bodies, or from a frequency domain perspective, dividing them into high-frequency and low-frequency information, these classifications are based on human experience. This partly explains why adding edge supervision as prior knowledge can improve model accuracy.

To leverage edge information as prior knowledge to improve segmentation performance, researchers have incorporated an edge supervision task into the network. Li et al.[16] decoupled the edges and bodies of the targets and supervised them separately, and then fused the body feature and the residual edge feature. However, this approach requires specially designed decouplers and fusers, which are not easily transferable to other segmentation models. Zhang et al.[34] designed an auxiliary decoder that utilizes the edge features extracted from the first two layers of the CNN’s multi-scale features for edge supervision. However, this method is only suitable for CNN networks and is not applicable to ViT-based segmentation networks, as ViT does not have multi-scale features. Chen et al.[4] proposed a SEMEDA framework with a segmentic edge detection network to extract edges from segmentation results for supervision. However, this structure is specific and may not be effective in other segmentation networks. To the best of our knowledge, all existing techniques that integrate edge supervision tasks rely on network architectures tailored for this purpose and are deficient in rapid transfer and universal applicability attributes. In addition, they use distribution-based cross-entropy loss, which does not have spatial-geometric characteristics and is therefore unsuitable for edge images with spatial-geometric features.

We propose an Edge-aware Plug-and-play Scheme (EPS) to address the issue of edge supervision modules lacking the characteristics of easy plug-and-play and general applicability in semantic segmentation. This scheme exhibits universal applicability across various networks and is solely utilized during the training phase, ensuring that the model size and inference speed remain unaltered during testing. Additionally, we propose a Polar Hausdorff (PH) Loss, a simplified version of the Hausdorff distance (HD) Loss represented in polar coordinates, to better utilize edge information. With the given edge thickness ded_{e}, EPS can calculate the kernel size to generate Edge GT, and the optimization target of PH Loss is to constrain the edge thickness in the edge segmentation result to approach ded_{e}. We conduct experiments on the Cityscapes dataset with 22 models in the MMSegmentation framework to demonstrate the effectiveness of EPS. Our contributions are summarized as follows:

  • We introduce a novel scheme called EPS that offers effective edge supervision to any semantic segmentation model, regardless of whether it is based on CNN or ViT.

  • In order to leverage edge information, we propose a novel boundary-based loss function, PH Loss, which restricts the thickness of edges and improves the accuracy of edge supervision signals.

  • Our experiments show that EPS can be seamlessly integrated into the existing SOTA without any modification, leading to improved model accuracy.

2 Related Work

2.1 Edge-supervised Segmentation

Refer to caption
Fig. 1: The left-hand side shows a parallel framework with edge supervision, while the right-hand side shows a serial framework with edge supervision. xx denotes the input image, y^\hat{y} represents the GT, and y^e\hat{y}_{e} is the Edge GT. EE and DD denote Encoder and Decoder, respectively, while DeD_{e} is the Edge Decoder.

In semantic segmentation tasks, the higher the segmentation accuracy of the target, the more accurate the segmentation of its edge parts tends to be, and vice versa. Therefore, effectively incorporating edge supervision in the segmentation model can improve the segmentation accuracy of the network. Currently, there are two main approaches to incorporating edge supervision in segmentation models: parallel supervision and series supervision (as shown in Figure 1).

The parallel supervision usually adds an auxiliary head outside the backbone, which takes the original input image xx as its input and uses the edge label y^e\hat{y}_{e} for supervision to improve the segmentation accuracy of the decoder head on the backbone. For example, EG-CNN [6] uses edge gated layers to reconstruct the edges of the target, while ET-Net [34] uses the feature extraction ability of the first two layers of CNN to extract low-level features map for edge supervision. However, these methods are not suitable for ViT-based models that do not have multi-scale features. Chen et al. [1] combines edge detection tasks with segmentation tasks, and through a fusion network, fuses the results of edge detection and segmentation to improve the segmentation accuracy. Moreover, Li et al. [16] first extracts the edges from the segmentation results, and then uses edge and body supervision to separately segment the edges and bodies, and finally designs a fusion model to merge the results of both parts. However, the latter two methods require an additional fusion network to use edge supervision to improve segmentation accuracy. Therefore, the approach of extracting edges and segmentation results separately and then merging them is cumbersome.

The series supervision involves the attachment of an auxiliary edge detector to a segmentation network, with input provided by y^\hat{y}, the predicted output from the segmentation network. Through supervision using edge labels y^e\hat{y}_{e}, the information extraction capabilities for edges in the backbone and decoder are influenced, resulting in improved segmentation accuracy. Zheng et al. proposed KLPPNet [38] that extracts edges through traditional algorithmic methods directly from the segmentation results and calculates its loss against edge labels y^e\hat{y}_{e}. SEMEDA [4], on the other hand, introduced a segmentic edge detection network in series framework connected to the decoder head to exploit edge supervision information. However, such a series approach increases the architecture’s depth, which may trigger adverse effects on back-propagation outcomes during edge supervision and potentially lower the primary network’s edge information extraction capacity.

Most of the previous works above are designed with specific edge supervision modules for certain networks, whether in parallel or series supervision. However, such specific modules are difficult to transfer into other segmentation networks and their effectiveness cannot be guaranteed.

2.2 Boundary-based Loss

Refer to caption
Fig. 2: For a semantic segmentation model with only one decoder head, the EPS involves creating a new auxiliary head by completely copying the original decoder head, without changing any of its structure. Then, the Edge GT obtained from GT by edge extraction is used for edge supervision. For a semantic segmentation model that already has an auxiliary head, we directly replace the GT with Edge GT to perform edge supervision on its auxiliary head without any other operations.

In semantic segmentation, the loss function plays a crucial role as it can significantly affect network learning efficiency. Existing segmentation losses have been classified into four categories by Ma et al. [20] and Jadon et al. [11]: distribution-based losses, region-based losses, compound losses, and boundary-based losses. Different types of loss functions have different optimization objectives and focuses, where distribution-based losses aim to improve overall classification accuracy, region-based losses aim to increase the overlap between predicted results and true labels, and compound losses combine the strengths of both types. The boundary-based loss, on the other hand, approaches it from a spatial geometry perspective by using the distance between the predicted and ground truth labels’ boundaries to construct the loss function. This study is centered on boundary-based loss functions, wherein some of the premier functions within this classification encompass the Boundary (BD) Loss [14] and the Hausdorff distance (HD) Loss [13].

BD Loss measures the quality of boundary prediction by calculating the distance between each pixel and the nearest boundary pixel in GT. The specific calculation formula can be defined as [14]:

ϕG(q)={DG(q),qGDG(q),qG,\mathcal{\phi}_{G}(q)=\left\{\begin{array}[]{rcl}-D_{G}(q),&&{q\in G}\\ D_{G}(q),&&{q\not\in G}\end{array}\right., (1)
BD=ΩϕG(q)Sθ(q)𝑑q,\begin{array}[]{rcl}\mathcal{L}_{BD}=\int_{\Omega}\phi_{G}(q)S_{\theta}(q)dq,\end{array} (2)

where, Ω\Omega refers to the entire image area, qΩq\in\Omega is any pixel point on the image, GΩG\subseteq\Omega is the region where the GT exists, with binary pixel values of {0,1}\{0,1\}, SθΩS_{\theta}\subseteq\Omega is the predicted labeling area, with pixel values of (0,1)(0,1), and DG(q)D_{G}(q) is the distance between the pixel point qq and the nearest pixel points on the boundary of the region GG.

HD Loss measures the accuracy of boundary prediction by computing the Hausdorff distance between predicted and GT boundaries. Its formula can be expressed as:

dAB=maxj(mini(d(ai,bj))),\begin{array}[]{rcl}d_{AB}=max_{j}(min_{i}(d(a_{i},b_{j}))),\end{array} (3)
dBA=maxj(mini(d(aj,bi))),\begin{array}[]{rcl}d_{BA}=max_{j}(min_{i}(d(a_{j},b_{i}))),\end{array} (4)
HD=max(dAB,dBA)\begin{array}[]{rcl}\mathcal{L}_{HD}=max(d_{AB},d_{BA})\end{array} (5)

where, AA is the total number of pixels on the predicted boundary, aia_{i} represents a pixel on it, BB is the total number of pixels on the GT boundary, bjb_{j} represents a pixel on it, and d(ai,bj)d(a_{i},b_{j}) is the Euclidean distance between pixel aia_{i} and bjb_{j}.

As both edge supervision and boundary-based losses are approached from a spatial-geometric perspective, their ideological origins are consistent. However, to the best of our knowledge, no research has emerged that have proposed an edge supervision scheme in combination with boundary-based losses, for semantic segmentation tasks.

3 Methods

3.1 Edge-aware Plug-and-play Scheme

To address the issue that the aforementioned relevant edge supervision methods are not easily applicable to other types of semantic segmentation networks, we propose an Edge-aware Plug-and-play Scheme (EPS). It is applicable to any semantic segmentation network and is easy to use. EPS mainly consists of two steps, which are to extract Edge GT with a thickness of ded_{e} and to copy decoder head.

Firstly, the edge thickness ded_{e} can reflect the degree of edge coarseness, and Edge GT with different edge thicknesses has different effects on edge supervision[38]. In order to generate Edge GT with a thickness of ded_{e}, EPS uses the simplest edge extraction method by using a kernel of size n×nn\times n to process the GT image. The relationship between ded_{e} and nn, as well as the calculation method of the kernel, are as follows:

n=2de+1\begin{array}[]{rcl}n=2d_{e}+1\end{array} (6)
kernel=[0010000100114de110010000100]\begin{array}[]{rcl}\scriptsize kernel=\begin{bmatrix}0&...&0&-1&0&...&0\\ ...&...&...&...&...&...&...\\ 0&...&0&-1&0&...&0\\ -1&...&-1&4d_{e}&-1&...&-1\\ 0&...&0&-1&0&...&0\\ ...&...&...&...&...&...&...\\ 0&...&0&-1&0&...&0\\ \end{bmatrix}\end{array} (7)

Secondly, EPS copies the decoder head to generate an auxiliary head with a skip connection identical to the decoder head, but its weights are not shared with the decoder head. However, its supervision information is Edge GT (see Figure 2). It can be seen that EPS is an abstract strategy, and its specific implementation depends on the semantic segmentation network, so it has general applicability and is easy to implement, also plug-and-play. EPS only participates in the updating of model parameters during training and can be completely discarded during inference, without changing the size of the model during inference. Although this scheme is simple, our experiments have proven its effectiveness.

3.2 Polar Hausdorff Loss

In EPS, we define the edge thickness ded_{e} of Edge GT as prior knowledge, and its value can be reflected by the size of kernel. For example, to set the edge thickness of Edge GT to de=3d_{e}=3, a 7×77\times 7 kernel is used to process GT. As the distribution-based loss optimizes globally, it doesn’t have the concept of edge thickness, so the edge segmentation thickness is random and uncertain. On the other hand, the extracted Edge GT in EPS has an explicit and equally-thick boundary with a thickness of ded_{e}. To fully utilize this prior knowledge, we propose a boundary-based loss called Polar Hausdorff (PH) Loss. PH Loss calculates the Hausdorff distance between the internal and external edges in the predicted image p^\hat{p} (as shown in Fig.3), and makes it tend to ded_{e}, the edge thickness of Edge GT in EPS. This is different from HD Loss, which calculates the Hausdorff distance between the predicted image and GT.

Refer to caption
Refer to caption
Fig. 3: On the left is the edge prediction image p^\hat{p}. On the right is the edge image p^e\hat{p}_{e} of p^\hat{p}, which is obtained by processing p^\hat{p} with a thickness of de=1d_{e}=1.

Assuming the edge prediction image p^\hat{p} is as shown in Fig. 3 (left), the edge image pe^\hat{p_{e}} with thickness 1 of the predicted edge image p^\hat{p} is extracted (Fig. 3 right). According to EPS, the edge thickness of Edge GT is defined as a preset value ded{e}. Therefore, during training, the Hausdorff distance between the inner edge pixel set PP and the outer edge pixel set QQ of the image in Figure 3 right should tend toward ded_{e}, which is the optimization goal of PH Loss. Thus, PH Loss is specifically defined as:

PH=|PHDP,Q(p^,n)de|\begin{array}[]{rcl}\mathcal{L}_{PH}=|PHD_{P,Q}(\hat{p},n)-d_{e}|\end{array} (8)

where, PHDP,Q(p^,n)PHD_{P,Q}(\hat{p},n) refers to the Polar Hausdorff distance between internal and external edges, and the calculation of ded_{e} is shown in equation (6).

Refer to caption
Fig. 4: When computing the PHDP,QPHD_{P,Q}, the process involves drawing a ray from the polar coordinate center at an angle of θ=i×360/n\theta=i\times 360^{\circ}/n, and selecting the intersection points of this ray and the inner and outer edges.

The computation of PH Loss is challenging in distinguishing the inner and outer edges of the edge image pe^\hat{p_{e}} for calculating the Polar Hausdorff distance between them since it is not feasible to classify them using one-pixel classifier, which can significantly increase the computational burden of the loss. To address this problem, we propose a more straightforward approach by utilizing the polar coordinate system to distinguish the inner and outer edges and compute their Hausdorff distance, as shown in Algorithm 1. Combining Fig.4, we first determine the geometrical center of the set PQP\cup Q on pe^\hat{p_{e}}, and then use centering to convert the coordinates of all pixels in PP and QQ into polar coordinates. We introduce a hyperparameter nn and draw nn rays from the geometric center outwards, with their angles uniformly starting from 00^{\circ} and increasing by 360/n360^{\circ}/n each time. Then, for each angle θ\theta, we calculate the set of intersection points PθQθP_{\theta}\cup Q_{\theta} of the ray with angle θ\theta and the inner and outer edges, and compute the distance ρθ\rho_{\theta} from each intersection point to the center. We select the minimum value ρmin\rho_{min}, and the corresponding intersection points pp lies on the inner edge intersection set PθP_{\theta}. We consider that the distances among the intersection points in PθP_{\theta} are less than the threshold δ=2\delta=2, and select all the inner edge intersections PθP_{\theta} via thresholding, while the remaining intersections PθQθP_{\theta}\cup Q_{\theta} form the outer edge intersection set QθQ_{\theta}. Thus, we have distinguished the inner intersection sets PθP_{\theta} and the outer intersections set QθQ_{\theta} of the ray with angle θ\theta between PθQθP_{\theta}\cup Q_{\theta}. By traversing each angle θ\theta of the rays, finally, we calculate the Euclidean distance dph(θ)d_{ph}(\theta) between the farthest points in PθP_{\theta} and the nearest points in QθQ_{\theta}. The ultimate step is to select the maximum value from all of {dph(θ)}\{d_{ph}(\theta)\} as the Polar Hausdorff distance PHDP,QPHD_{P,Q}.

The PH Loss is a proposed edge-supervision loss combined with EPS, which involves two hyperparameters. One is the hyperparameter for extracting the edge thickness ded_{e} of p^e\hat{p}_{e}, which is also a hyperparameter in PH Loss. The other is the number of rays from the polar coordinate center in PH Loss, denoted as nn. However, in subsequent experiments, it was found that nn had robusness, meaning that the choice of nn has little effect on the results. We recommend setting n=8n=8.

4 Experiments

Refer to caption
Fig. 5: For the purpose of visualizing the results of semantic segmentation using CGNet, the images are presented in a top-to-bottom order, including the GT images, the baseline result, and the result obtained after applying EPS.

4.1 Experimental Settings

Our proposed method was evaluated using the Cityscapes dataset, a well-known benchmark for semantic segmentation, which comprises 5000 high-resolution images with 19 categories. For hardware, we trained our models on a Linux server equipped with an I7 12700K CPU and two 24G RTX3090Ti GPUs. In terms of software framework, we utilized the PyTorch-based semantic segmentation framework MMSegmentation for all training and testing.

To ensure fair comparisons of experimental results, all models were trained with the official default parameters of MMSegmentation, such as the learning rate and momentum. For the loss function, we utilized the Cross-Entropy (CE) Loss as the base and combined it with the Polar Hausdorff (PH) Loss to achieve better segmentation accuracy. We reported segmentation accuracy using the standard mean Intersection over Union (mIoU) and mean Accuracy (mAcc) metrics.

input : p^\hat{p} with size w×hw\times h, σ=0.1\sigma=0.1, δ=2\delta=2, nn
output : PHDP,QPHD_{P,Q}
1
2dph=[]d_{ph}=[\ ];
3
4if p^\hat{p}¿0.5 then p^\hat{p} == 1;
5 else p^=\hat{p}= 0;
6 Get edge images p^e\hat{p}_{e} of p^\hat{p} with de=1d_{e}=1;
7 Get pixel index (x,y)(x,y) of p^e\hat{p}_{e}=1;
8 (xp,yp)=(x_{p},y_{p})= (x,y)(x,y)-mean((x,y)(x,y));
9 ρ=xp2+yp2\rho=\sqrt{x_{p}^{2}+y_{p}^{2}};
10 (cos(α),sin(α))=(xpρ,ypρ)(cos(\alpha),sin(\alpha))=(\frac{x_{p}}{\rho},\frac{y_{p}}{\rho});
11
12for jj in arrange(nn) do
13       dP=[]d_{P}=[\ ], dQ=[]d_{Q}=[\ ], dPQ=[]d_{P\cup Q}=[\ ];
14       for (xi,yi)(x_{i},y_{i}) in (xp,yp)(x_{p},y_{p}) do
15             if αijπ/180|<σ\alpha_{i}-j\pi/180|<\sigma then  Append ρi\rho_{i} to dPQd_{P\cup Q};
16            
17      dmin=d_{min}=Min(dPQd_{P\cup Q});
18       for did_{i} in dPQd_{P\cup Q} do
19             if didmin<δd_{i}-d_{min}<\delta then  Append did_{i} to dPd_{P};
20             else Append did_{i} to dQd_{Q};
21            
22      Append Min(dQd_{Q}) - Max(dPd_{P}) to dphd_{ph};
23      
24PHD(P,Q)=PHD(P,Q)=Max(dphd_{ph});
Algorithm 1 Calculate PHDP,Q(p^,n)PHD_{P,Q}(\hat{p},n)

4.2 Performance Comparison

Table 1: Comparing the results of baseline with EPS across different segmentation models.
Model Size Baseline EPS
Params mIoU mAcc mIoU mAcc
CGNet[29] 496.32k 66.84 80.12 68.68\uparrow 81.94\uparrow
ERFNet[22] 2.08M 66.08 74.65 70.08\uparrow 78.9\uparrow
MobileNetV3[9] 3.28M 58.12 67.76 65.82\uparrow 76.71\uparrow
SegFormer[30] 3.72M 76.28 83.89 76.88\uparrow 84.77\uparrow
HRNet[25] 9.64M 68.98 77.8 70.09\uparrow 78.99\uparrow
OCRNet[32] 12.08M 58.64 76.84 64.73\uparrow 82.54\uparrow
ICNet[35] 14.8M 68.44 77.45 68.46\uparrow 78.27\uparrow
STDC[5] 25.17M 53.30 62.55 58.62\uparrow 68.98\uparrow
BiSeNetV2[31] 28.5M 64.22 71.19 65.35\uparrow 74.03\uparrow
UNet[23] 29.06 56.27 64.04 56.43 \uparrow 62.95
PointRend[15] 30.34M 61.04 70.28 63.26\uparrow 71.94\uparrow
EncNet[33] 35.89M 68.95 77.74 72.59 \uparrow 80.28 \uparrow
EMANet[17] 42.09M 64.10 71.53 65.20 \uparrow 75.37 \uparrow
ANN[40] 46.23M 51.26 63.21 57.17 \uparrow 66.32 \uparrow
PSPNet[37] 48.98M 70.32 78.13 68.51 76.09
CCNet[10] 49.83M 60.01 66.88 60.89 \uparrow 72.78 \uparrow
DANet[12] 49.85M 74.12 83.53 75.23 \uparrow 84.62 \uparrow
NonLocal Net[27] 50.02M 66.54 72.93 69.20 \uparrow 76.21 \uparrow
APCNet[8] 56.36M 45.38 58.45 51.2 \uparrow 64.72 \uparrow
DMNet[7] 53.18M 67.65 75.94 66.96 74.4
DeepLabV3[3] 68.11M 62.95 75.78 70.09 \uparrow 81.75 \uparrow
FastFCN[28] 68.71M 72.88 81.26 71.58 81.41 \uparrow
Table 2: Comparing the results of baseline with EPS ++ PH Loss across different segmentation models.
Model Baseline EPS EPS+PH
mIoU mAcc mIoU mAcc mIoU mAcc
CGNet[29] 66.84 80.12 68.68\uparrow 81.94\uparrow 70.16\upuparrows 82.26\upuparrows
ERFNet[22] 66.08 74.65 70.08\uparrow 78.9\uparrow 71.87\upuparrows 80.84\upuparrows
SegFormer[30] 76.28 83.89 76.88\uparrow 84.77\uparrow 77.03\upuparrows 85.03\upuparrows
STDC[5] 53.30 62.55 58.62\uparrow 68.98\uparrow 77.62\upuparrows 85.11\upuparrows

To verify the versatility of EPS across different semantic segmentation models, we conducted experiments where our models were trained using Edge GT generated by a 5×55\times 5 kernel. We assessed nearly all semantic segmentation models within MMSegmentation, including both those with and without auxiliary heads, such as CGNet, MobileNetV3, ANN, BiSeNetV2, and others. The specific models utilized and corresponding experimental outcomes are presented in Table 1. We observed a noticeable improvement in both mIoU and mAcc for almost all state-of-the-art models after implementing the EPS strategy. This signifies the compatibility of EPS with various semantic segmentation models and its plug-and-play nature, which enables its direct use without any consideration of the model’s unique attributes.

We selected four models, namely CGNet, ERFNet, SegFormer, and STDC, to further explore the experimental results of using PH Loss in EPS.The results are presented in Table 2. Based on our analysis, it can be inferred that the utilization of EPS has resulted in substantial enhancements in the performance of the SOTA. Furthermore, the incorporation of PH Loss has demonstrated a superior improvement in the model’s accuracy.

4.3 Ablation Studies

Our proposed PH Loss is designed to be used in conjunction with the EPS framework, with two hyperparameters: the edge thickness ded_{e} of the Edge GT and the number nn of candidate distances dphd_{ph} in the PHDP,Q(p^,n)PHD_{P,Q}(\hat{p},n). In order to investigate the effects of these two hyperparameters, we conducted a series of ablation experiments on CGNet, ERFNet, SegFormer, and STDC.

After conducting an analysis of Table 3, we observe that there is no evident regularity for selecting the hyperparameter kernel, and the optimal kernel varies among different models. However, the performance is relatively favorable when selecting the kernel as 5×55\times 5 and 7×77\times 7. Additionally, employing EPS led to an improvement in mIoU, irrespective of the kernel size. Analysis of Table 4 revealed that even with PH Loss, the optimal kernel varied across different models. Comparing the results of the same kernel and model in Table 3, it was observed that almost all experiments using PH Loss performed better than those without PH Loss. In instances where the results of EPS with PH Loss were inferior to those of EPS with CE Loss, the cause may be attributed to insufficient training iterations. Finally, by analyzing Table 5, we found that the impact of selecting hyperparameter nn on the results is not significant, but as nn increases, the computational complexity of the model also increases. Consequently, it is recommended to select a smaller value for nn, such as n=8n=8.

Table 3: Comparing the impact of hyperparameter ded_{e} on EPS.
Method ded_{e} kernel mIoU
CGNet ERFNet SegFormer STDC
Baseline - - 66.84 66.08 76.28 76.37
EPS 1 3×\times3 69.18 66.45 75.38 76.75
2 5×\times5 68.68 70.08 76.88 76.92
3 7×\times7 68.98 65.51 75.93 77.48
5 11×\times11 67.55 67.56 76.12 76.77
Table 4: Comparing the impact of hyperparameter ded_{e} on EPS with PH Loss (nn=100).
Method ded_{e} kernel mIoU
CGNet ERFNet SegFormer STDC
Baseline - - 66.84 66.08 76.28 76.37
1 3x3 67.74 70.10 76.58 77.29
EPS 2 5x5 68.27 70.71 77.03 76.81
+PH Loss 3 7x7 70.16 70.43 76.98 76.77
5 11x11 69.05 71.87 76.90 76.85
Table 5: Comparing the impact of hyperparameter nn on EPS with PH Loss (de=5d_{e}=5).
Method nn mIoU
CGNet ERFNet SegFormer STDC
EPS 100 69.05 71.87 76.90 76.85
+PH Loss 32 67.18 71.35 76.90 77.25
8 69.07 71.69 76.87 77.62

5 Conclusion

Our study investigated the limitations of edge supervision methods for semantic segmentation tasks, specifically the difficulty in adapting these methods to different models. To address this challenge, we propose a novel edge supervision scheme, EPS, which duplicates the architecture decoder head for the auxiliary task with the edge supervision. By integrating the prior knowledge of edge thickness, we develop a boundary-based loss function for the thickness preserving task, which shows promising results in addressing the aforementioned challenge. The key advantage of our EPS is its ability to seamlessly and easily integrate into any semantic segmentation model, reflecting an innovative approach to developing universally applicable strategies. Our experiments, conducted on 22 models using the Cityscapes dataset, demonstrate that EPS can improve upon the state-of-the-art models. However, we acknowledge that there are still limitations to our study of EPS, including the lack of stability and robustness analysis of PH Loss and the need for more comprehensive experiments across multiple datasets. In future research, we plan to explore additional Plug-and-play schemes in depth, such as incorporating texture supervision, optimizing the calculation of PH Loss, investigating the mechanism of PH Loss; to extend EPS to other tasks, eg. instance segmentation and object detection; and to apply EPS in various fields, including medical imaging, autonomous driving, and industrial defect detection.

References

  • [1] Liang-Chieh Chen, Jonathan T Barron, George Papandreou, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4545–4554, 2016.
  • [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  • [3] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [4] Yifu Chen, Arnaud Dapogny, and Matthieu Cord. Semeda: Enhancing segmentation precision with semantic edge aware loss. Pattern Recognition, 108:107557, 2020.
  • [5] Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei, Zhenhua Chai, Junfeng Luo, and Xiaolin Wei. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9716–9725, 2021.
  • [6] Ali Hatamizadeh, Demetri Terzopoulos, and Andriy Myronenko. Edge-gated cnns for volumetric semantic segmentation of medical images. arXiv preprint arXiv:2002.04207, 2020.
  • [7] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi-scale filters for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • [8] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [9] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. In The IEEE International Conference on Computer Vision (ICCV), pages 1314–1324, October 2019.
  • [10] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. 2019.
  • [11] Shruti Jadon. A survey of loss functions for semantic segmentation. In 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pages 1–7. IEEE, 2020.
  • [12] Jing Liu Jun Fu, Yong Li Haijie Tian, Zhiwei Fang Yongjun Bao, and Hanqing Lu. Dual attention network for scene segmentation. 2019.
  • [13] Davood Karimi and Septimiu E Salcudean. Reducing the hausdorff distance in medical image segmentation with convolutional neural networks. IEEE Transactions on medical imaging, 39(2):499–513, 2019.
  • [14] Hoel Kervadec, Jihene Bouchtiba, Christian Desrosiers, Eric Granger, Jose Dolz, and Ismail Ben Ayed. Boundary loss for highly unbalanced segmentation. In International conference on medical imaging with deep learning, pages 285–296. PMLR, 2019.
  • [15] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9799–9808, 2020.
  • [16] Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai Tong. Improving semantic segmentation via decoupled body and edge supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 435–452. Springer, 2020.
  • [17] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, and Hong Liu. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 9167–9176, 2019.
  • [18] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [19] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29, 2016.
  • [20] Jun Ma, Jianan Chen, Matthew Ng, Rui Huang, Yu Li, Chen Li, Xiaoping Yang, and Anne L Martel. Loss odyssey in medical image segmentation. Medical Image Analysis, 71:102035, 2021.
  • [21] David Marr and Ellen Hildreth. Theory of edge detection. Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187–217, 1980.
  • [22] Eduardo Romera, José M Alvarez, Luis M Bergasa, and Roberto Arroyo. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, 2017.
  • [23] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • [24] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
  • [25] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
  • [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [27] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  • [28] Huikai Wu, Junge Zhang, Kaiqi Huang, Kongming Liang, and Yizhou Yu. Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv preprint arXiv:1903.11816, 2019.
  • [29] Tianyi Wu, Sheng Tang, Rui Zhang, Juan Cao, and Yongdong Zhang. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Transactions on Image Processing, 30:1169–1179, 2020.
  • [30] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  • [31] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, pages 1–18, 2021.
  • [32] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. 2020.
  • [33] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [34] Zhijie Zhang, Huazhu Fu, Hang Dai, Jianbing Shen, Yanwei Pang, and Ling Shao. Et-net: A generic edge-attention guidance network for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22, pages 442–450. Springer, 2019.
  • [35] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV), pages 405–420, 2018.
  • [36] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
  • [37] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
  • [38] Xianwei Zheng, Linxi Huan, Hanjiang Xiong, and Jianya Gong. Elkppnet: An edge-aware neural network with large kernel pyramid pooling for learning discriminative features in semantic segmentation. arXiv preprint arXiv:1906.11428, 2019.
  • [39] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.
  • [40] Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 593–602, 2019.