Edge-aware Plug-and-play Scheme for Semantic Segmentation

Jianye Yi, Xiaopin Zhong¹¹footnotemark: 1, Weixiang Liu^🖂,
Wenxuan Zhu, Zongze Wu, Yuanlong Deng
Lab. of Machine Vision and Inspection, College of Mechatronics and Control Engineering,
Shenzhen University, #3688 Nanhai Ave, Shenzhen, PR China
[email protected], [email protected], ^🖂[email protected],
[email protected],[email protected], [email protected] These authors contributed to the work equally and should be regarded as co-first authors.

Abstract

Semantic segmentation is a classic and fundamental computer vision problem dedicated to assigning each pixel with its corresponding class. Some recent methods introduce edge-based information for improving the segmentation performance. However these methods are specific and limited to certain network architectures, and they can not be transferred to other models or tasks. Therefore, we propose an abstract and universal edge supervision method called Edge-aware Plug-and-play Scheme (EPS), which can be easily and quickly applied to any semantic segmentation models. The core is edge-width/thickness preserving guided for semantic segmentation. The EPS first extracts the Edge Ground Truth (Edge GT) with a predefined edge thickness from the training data; and then for any network architecture, it directly copies the decoder head for the auxiliary task with the Edge GT supervision. To ensure the edge thickness preserving consistantly, we design a new boundary-based loss, called Polar Hausdorff (PH) Loss, for the auxiliary supervision. We verify the effectiveness of our EPS on the Cityscapes dataset using 22 models. The experimental results indicate that the proposed method can be seamlessly integrated into any state-of-the-art (SOTA) models with zero modification, resulting in promising enhancement of the segmentation performance.

1 Introduction

Semantic segmentation aims to achieve pixel-level classification by providing dense predictions for each pixel. With the rapid development of convolutional neural networks and the application of Transformer[26] in the field of computer vision, a series of deep learning-based semantic segmentation models have emerged, such as CNN-based FCN[18], DeepLab[2], PSPNet[36], CGNet[29], and ViT-based Segmenter[24] and SegFormer[30], etc. Researchers are always striving to propose new network structures to improve the performance of semantic segmentation. They usually approach the problem from the perspective of model design, resulting in specific and unique network structures. However, this approach may lead to overfitting during training and may not be easily applicable to various applications. Therefore, we believe that it is more effective to design a general and abstract scheme that is applicable to any model, rather than solely relying on model design.

Currently, in supervised semantic segmentation tasks, most researchers directly use the original annotated data for supervision. However, a minority of researchers have explored other features of the original data for more effective supervision, such as adding edge supervision. Edge detection is a task of extracting image edges[21]. In recent years, related works[38, 34, 16, 6, 4, 1] have embedded the results of edge detection into semantic segmentation and confirmed that edge supervision can effectively improve the accuracy of segmentation models. Due to the locality of CNN’s inductive bias, it is necessary to perform pooling on the feature maps to increase the receptive field (RF), which leads to blurring and uncertainty of segmentation boundaries[39, 19], ultimately limiting segmentation accuracy. Although ViT-based segmentation models have global RF, there is currently no related work on adding edge supervision to ViT-based segmentation models. However, this does not mean that edge supervision is not important for ViT-based segmentation networks. Whether from a spatial-geometric perspective, dividing the objects of semantic segmentation into edges and bodies, or from a frequency domain perspective, dividing them into high-frequency and low-frequency information, these classifications are based on human experience. This partly explains why adding edge supervision as prior knowledge can improve model accuracy.

To leverage edge information as prior knowledge to improve segmentation performance, researchers have incorporated an edge supervision task into the network. Li et al.[16] decoupled the edges and bodies of the targets and supervised them separately, and then fused the body feature and the residual edge feature. However, this approach requires specially designed decouplers and fusers, which are not easily transferable to other segmentation models. Zhang et al.[34] designed an auxiliary decoder that utilizes the edge features extracted from the first two layers of the CNN’s multi-scale features for edge supervision. However, this method is only suitable for CNN networks and is not applicable to ViT-based segmentation networks, as ViT does not have multi-scale features. Chen et al.[4] proposed a SEMEDA framework with a segmentic edge detection network to extract edges from segmentation results for supervision. However, this structure is specific and may not be effective in other segmentation networks. To the best of our knowledge, all existing techniques that integrate edge supervision tasks rely on network architectures tailored for this purpose and are deficient in rapid transfer and universal applicability attributes. In addition, they use distribution-based cross-entropy loss, which does not have spatial-geometric characteristics and is therefore unsuitable for edge images with spatial-geometric features.

We propose an Edge-aware Plug-and-play Scheme (EPS) to address the issue of edge supervision modules lacking the characteristics of easy plug-and-play and general applicability in semantic segmentation. This scheme exhibits universal applicability across various networks and is solely utilized during the training phase, ensuring that the model size and inference speed remain unaltered during testing. Additionally, we propose a Polar Hausdorff (PH) Loss, a simplified version of the Hausdorff distance (HD) Loss represented in polar coordinates, to better utilize edge information. With the given edge thickness $d_{e}$ , EPS can calculate the kernel size to generate Edge GT, and the optimization target of PH Loss is to constrain the edge thickness in the edge segmentation result to approach $d_{e}$ . We conduct experiments on the Cityscapes dataset with 22 models in the MMSegmentation framework to demonstrate the effectiveness of EPS. Our contributions are summarized as follows:

•

We introduce a novel scheme called EPS that offers effective edge supervision to any semantic segmentation model, regardless of whether it is based on CNN or ViT.
•

In order to leverage edge information, we propose a novel boundary-based loss function, PH Loss, which restricts the thickness of edges and improves the accuracy of edge supervision signals.
•

Our experiments show that EPS can be seamlessly integrated into the existing SOTA without any modification, leading to improved model accuracy.

2 Related Work

2.1 Edge-supervised Segmentation

Refer to caption — Fig. 1: The left-hand side shows a parallel framework with edge supervision, while the right-hand side shows a serial framework with edge supervision. $x$ denotes the input image, $\hat{y}$ represents the GT, and $\hat{y}_{e}$ is the Edge GT. $E$ and $D$ denote Encoder and Decoder, respectively, while $D_{e}$ is the Edge Decoder.

In semantic segmentation tasks, the higher the segmentation accuracy of the target, the more accurate the segmentation of its edge parts tends to be, and vice versa. Therefore, effectively incorporating edge supervision in the segmentation model can improve the segmentation accuracy of the network. Currently, there are two main approaches to incorporating edge supervision in segmentation models: parallel supervision and series supervision (as shown in Figure 1).

The parallel supervision usually adds an auxiliary head outside the backbone, which takes the original input image $x$ as its input and uses the edge label $\hat{y}_{e}$ for supervision to improve the segmentation accuracy of the decoder head on the backbone. For example, EG-CNN [6] uses edge gated layers to reconstruct the edges of the target, while ET-Net [34] uses the feature extraction ability of the first two layers of CNN to extract low-level features map for edge supervision. However, these methods are not suitable for ViT-based models that do not have multi-scale features. Chen et al. [1] combines edge detection tasks with segmentation tasks, and through a fusion network, fuses the results of edge detection and segmentation to improve the segmentation accuracy. Moreover, Li et al. [16] first extracts the edges from the segmentation results, and then uses edge and body supervision to separately segment the edges and bodies, and finally designs a fusion model to merge the results of both parts. However, the latter two methods require an additional fusion network to use edge supervision to improve segmentation accuracy. Therefore, the approach of extracting edges and segmentation results separately and then merging them is cumbersome.

The series supervision involves the attachment of an auxiliary edge detector to a segmentation network, with input provided by $\hat{y}$ , the predicted output from the segmentation network. Through supervision using edge labels $\hat{y}_{e}$ , the information extraction capabilities for edges in the backbone and decoder are influenced, resulting in improved segmentation accuracy. Zheng et al. proposed KLPPNet [38] that extracts edges through traditional algorithmic methods directly from the segmentation results and calculates its loss against edge labels $\hat{y}_{e}$ . SEMEDA [4], on the other hand, introduced a segmentic edge detection network in series framework connected to the decoder head to exploit edge supervision information. However, such a series approach increases the architecture’s depth, which may trigger adverse effects on back-propagation outcomes during edge supervision and potentially lower the primary network’s edge information extraction capacity.

Most of the previous works above are designed with specific edge supervision modules for certain networks, whether in parallel or series supervision. However, such specific modules are difficult to transfer into other segmentation networks and their effectiveness cannot be guaranteed.

2.2 Boundary-based Loss

In semantic segmentation, the loss function plays a crucial role as it can significantly affect network learning efficiency. Existing segmentation losses have been classified into four categories by Ma et al. [20] and Jadon et al. [11]: distribution-based losses, region-based losses, compound losses, and boundary-based losses. Different types of loss functions have different optimization objectives and focuses, where distribution-based losses aim to improve overall classification accuracy, region-based losses aim to increase the overlap between predicted results and true labels, and compound losses combine the strengths of both types. The boundary-based loss, on the other hand, approaches it from a spatial geometry perspective by using the distance between the predicted and ground truth labels’ boundaries to construct the loss function. This study is centered on boundary-based loss functions, wherein some of the premier functions within this classification encompass the Boundary (BD) Loss [14] and the Hausdorff distance (HD) Loss [13].

BD Loss measures the quality of boundary prediction by calculating the distance between each pixel and the nearest boundary pixel in GT. The specific calculation formula can be defined as [14]:

\mathcal{\phi}_{G}(q)=\left\{\begin{array}[]{rcl}-D_{G}(q),&&{q\in G}\\ D_{G}(q),&&{q\not\in G}\end{array}\right.,

(1)

\begin{array}[]{rcl}\mathcal{L}_{BD}=\int_{\Omega}\phi_{G}(q)S_{\theta}(q)dq,\end{array}

(2)

where, $\Omega$ refers to the entire image area, $q\in\Omega$ is any pixel point on the image, $G\subseteq\Omega$ is the region where the GT exists, with binary pixel values of $\{0,1\}$ , $S_{\theta}\subseteq\Omega$ is the predicted labeling area, with pixel values of $(0,1)$ , and $D_{G}(q)$ is the distance between the pixel point $q$ and the nearest pixel points on the boundary of the region $G$ .

HD Loss measures the accuracy of boundary prediction by computing the Hausdorff distance between predicted and GT boundaries. Its formula can be expressed as:

\begin{array}[]{rcl}d_{AB}=max_{j}(min_{i}(d(a_{i},b_{j}))),\end{array}

(3)

\begin{array}[]{rcl}d_{BA}=max_{j}(min_{i}(d(a_{j},b_{i}))),\end{array}

(4)

\begin{array}[]{rcl}\mathcal{L}_{HD}=max(d_{AB},d_{BA})\end{array}

(5)

where, $A$ is the total number of pixels on the predicted boundary, $a_{i}$ represents a pixel on it, $B$ is the total number of pixels on the GT boundary, $b_{j}$ represents a pixel on it, and $d(a_{i},b_{j})$ is the Euclidean distance between pixel $a_{i}$ and $b_{j}$ .

As both edge supervision and boundary-based losses are approached from a spatial-geometric perspective, their ideological origins are consistent. However, to the best of our knowledge, no research has emerged that have proposed an edge supervision scheme in combination with boundary-based losses, for semantic segmentation tasks.

3 Methods

3.1 Edge-aware Plug-and-play Scheme

To address the issue that the aforementioned relevant edge supervision methods are not easily applicable to other types of semantic segmentation networks, we propose an Edge-aware Plug-and-play Scheme (EPS). It is applicable to any semantic segmentation network and is easy to use. EPS mainly consists of two steps, which are to extract Edge GT with a thickness of $d_{e}$ and to copy decoder head.

Firstly, the edge thickness $d_{e}$ can reflect the degree of edge coarseness, and Edge GT with different edge thicknesses has different effects on edge supervision[38]. In order to generate Edge GT with a thickness of $d_{e}$ , EPS uses the simplest edge extraction method by using a kernel of size $n\times n$ to process the GT image. The relationship between $d_{e}$ and $n$ , as well as the calculation method of the kernel, are as follows:

\begin{array}[]{rcl}n=2d_{e}+1\end{array}

(6)

\begin{array}[]{rcl}\scriptsize kernel=\begin{bmatrix}0&...&0&-1&0&...&0\\ ...&...&...&...&...&...&...\\ 0&...&0&-1&0&...&0\\ -1&...&-1&4d_{e}&-1&...&-1\\ 0&...&0&-1&0&...&0\\ ...&...&...&...&...&...&...\\ 0&...&0&-1&0&...&0\\ \end{bmatrix}\end{array}

(7)

Secondly, EPS copies the decoder head to generate an auxiliary head with a skip connection identical to the decoder head, but its weights are not shared with the decoder head. However, its supervision information is Edge GT (see Figure 2). It can be seen that EPS is an abstract strategy, and its specific implementation depends on the semantic segmentation network, so it has general applicability and is easy to implement, also plug-and-play. EPS only participates in the updating of model parameters during training and can be completely discarded during inference, without changing the size of the model during inference. Although this scheme is simple, our experiments have proven its effectiveness.

3.2 Polar Hausdorff Loss

In EPS, we define the edge thickness $d_{e}$ of Edge GT as prior knowledge, and its value can be reflected by the size of kernel. For example, to set the edge thickness of Edge GT to $d_{e}=3$ , a $7\times 7$ kernel is used to process GT. As the distribution-based loss optimizes globally, it doesn’t have the concept of edge thickness, so the edge segmentation thickness is random and uncertain. On the other hand, the extracted Edge GT in EPS has an explicit and equally-thick boundary with a thickness of $d_{e}$ . To fully utilize this prior knowledge, we propose a boundary-based loss called Polar Hausdorff (PH) Loss. PH Loss calculates the Hausdorff distance between the internal and external edges in the predicted image $\hat{p}$ (as shown in Fig.3), and makes it tend to $d_{e}$ , the edge thickness of Edge GT in EPS. This is different from HD Loss, which calculates the Hausdorff distance between the predicted image and GT.

Assuming the edge prediction image $\hat{p}$ is as shown in Fig. 3 (left), the edge image $\hat{p_{e}}$ with thickness 1 of the predicted edge image $\hat{p}$ is extracted (Fig. 3 right). According to EPS, the edge thickness of Edge GT is defined as a preset value $d{e}$ . Therefore, during training, the Hausdorff distance between the inner edge pixel set $P$ and the outer edge pixel set $Q$ of the image in Figure 3 right should tend toward $d_{e}$ , which is the optimization goal of PH Loss. Thus, PH Loss is specifically defined as:

\begin{array}[]{rcl}\mathcal{L}_{PH}=|PHD_{P,Q}(\hat{p},n)-d_{e}|\end{array}

(8)

where, $PHD_{P,Q}(\hat{p},n)$ refers to the Polar Hausdorff distance between internal and external edges, and the calculation of $d_{e}$ is shown in equation (6).

The computation of PH Loss is challenging in distinguishing the inner and outer edges of the edge image $\hat{p_{e}}$ for calculating the Polar Hausdorff distance between them since it is not feasible to classify them using one-pixel classifier, which can significantly increase the computational burden of the loss. To address this problem, we propose a more straightforward approach by utilizing the polar coordinate system to distinguish the inner and outer edges and compute their Hausdorff distance, as shown in Algorithm 1. Combining Fig.4, we first determine the geometrical center of the set $P\cup Q$ on $\hat{p_{e}}$ , and then use centering to convert the coordinates of all pixels in $P$ and $Q$ into polar coordinates. We introduce a hyperparameter $n$ and draw $n$ rays from the geometric center outwards, with their angles uniformly starting from $0^{\circ}$ and increasing by $360^{\circ}/n$ each time. Then, for each angle $\theta$ , we calculate the set of intersection points $P_{\theta}\cup Q_{\theta}$ of the ray with angle $\theta$ and the inner and outer edges, and compute the distance $\rho_{\theta}$ from each intersection point to the center. We select the minimum value $\rho_{min}$ , and the corresponding intersection points $p$ lies on the inner edge intersection set $P_{\theta}$ . We consider that the distances among the intersection points in $P_{\theta}$ are less than the threshold $\delta=2$ , and select all the inner edge intersections $P_{\theta}$ via thresholding, while the remaining intersections $P_{\theta}\cup Q_{\theta}$ form the outer edge intersection set $Q_{\theta}$ . Thus, we have distinguished the inner intersection sets $P_{\theta}$ and the outer intersections set $Q_{\theta}$ of the ray with angle $\theta$ between $P_{\theta}\cup Q_{\theta}$ . By traversing each angle $\theta$ of the rays, finally, we calculate the Euclidean distance $d_{ph}(\theta)$ between the farthest points in $P_{\theta}$ and the nearest points in $Q_{\theta}$ . The ultimate step is to select the maximum value from all of $\{d_{ph}(\theta)\}$ as the Polar Hausdorff distance $PHD_{P,Q}$ .

The PH Loss is a proposed edge-supervision loss combined with EPS, which involves two hyperparameters. One is the hyperparameter for extracting the edge thickness $d_{e}$ of $\hat{p}_{e}$ , which is also a hyperparameter in PH Loss. The other is the number of rays from the polar coordinate center in PH Loss, denoted as $n$ . However, in subsequent experiments, it was found that $n$ had robusness, meaning that the choice of $n$ has little effect on the results. We recommend setting $n=8$ .

4 Experiments

4.1 Experimental Settings

Our proposed method was evaluated using the Cityscapes dataset, a well-known benchmark for semantic segmentation, which comprises 5000 high-resolution images with 19 categories. For hardware, we trained our models on a Linux server equipped with an I7 12700K CPU and two 24G RTX3090Ti GPUs. In terms of software framework, we utilized the PyTorch-based semantic segmentation framework MMSegmentation for all training and testing.

To ensure fair comparisons of experimental results, all models were trained with the official default parameters of MMSegmentation, such as the learning rate and momentum. For the loss function, we utilized the Cross-Entropy (CE) Loss as the base and combined it with the Polar Hausdorff (PH) Loss to achieve better segmentation accuracy. We reported segmentation accuracy using the standard mean Intersection over Union (mIoU) and mean Accuracy (mAcc) metrics.

input :

\hat{p}

with size

w\times h

\sigma=0.1

\delta=2

n

output :

PHD_{P,Q}

d_{ph}=[\ ]

;

4if $\hat{p}$ ¿0.5 then

\hat{p}

=

5 else

\hat{p}=

6 Get edge images

\hat{p}_{e}

\hat{p}

with

d_{e}=1

;

7 Get pixel index

(x,y)

\hat{p}_{e}

=1;

(x_{p},y_{p})=

(x,y)-

mean(

(x,y)

);

\rho=\sqrt{x_{p}^{2}+y_{p}^{2}}

;

(cos(\alpha),sin(\alpha))=(\frac{x_{p}}{\rho},\frac{y_{p}}{\rho})

;

12for $j$ in arrange( $n$ ) do

d_{P}=[\ ]

d_{Q}=[\ ]

d_{P\cup Q}=[\ ]

;

14 for $(x_{i},y_{i})$ in $(x_{p},y_{p})$ do

15 if — $\alpha_{i}-j\pi/180|<\sigma$ then Append

\rho_{i}

d_{P\cup Q}

;

d_{min}=

Min(

d_{P\cup Q}

);

18 for $d_{i}$ in $d_{P\cup Q}$ do

19 if $d_{i}-d_{min}<\delta$ then Append

d_{i}

d_{P}

;

20 else Append

d_{i}

d_{Q}

;

22 Append Min(

d_{Q}

) - Max(

d_{P}

) to

d_{ph}

;

PHD(P,Q)=

Max(

d_{ph}

);

Algorithm 1 Calculate

PHD_{P,Q}(\hat{p},n)

4.2 Performance Comparison

Table 1: Comparing the results of baseline with EPS across different segmentation models.

Model	Size	Baseline		EPS
Model	Params	mIoU	mAcc	mIoU	mAcc
CGNet[29]	496.32k	66.84	80.12	68.68 $\uparrow$	81.94 $\uparrow$
ERFNet[22]	2.08M	66.08	74.65	70.08 $\uparrow$	78.9 $\uparrow$
MobileNetV3[9]	3.28M	58.12	67.76	65.82 $\uparrow$	76.71 $\uparrow$
SegFormer[30]	3.72M	76.28	83.89	76.88 $\uparrow$	84.77 $\uparrow$
HRNet[25]	9.64M	68.98	77.8	70.09 $\uparrow$	78.99 $\uparrow$
OCRNet[32]	12.08M	58.64	76.84	64.73 $\uparrow$	82.54 $\uparrow$
ICNet[35]	14.8M	68.44	77.45	68.46 $\uparrow$	78.27 $\uparrow$
STDC[5]	25.17M	53.30	62.55	58.62 $\uparrow$	68.98 $\uparrow$
BiSeNetV2[31]	28.5M	64.22	71.19	65.35 $\uparrow$	74.03 $\uparrow$
UNet[23]	29.06	56.27	64.04	56.43 $\uparrow$	62.95
PointRend[15]	30.34M	61.04	70.28	63.26 $\uparrow$	71.94 $\uparrow$
EncNet[33]	35.89M	68.95	77.74	72.59 $\uparrow$	80.28 $\uparrow$
EMANet[17]	42.09M	64.10	71.53	65.20 $\uparrow$	75.37 $\uparrow$
ANN[40]	46.23M	51.26	63.21	57.17 $\uparrow$	66.32 $\uparrow$
PSPNet[37]	48.98M	70.32	78.13	68.51	76.09
CCNet[10]	49.83M	60.01	66.88	60.89 $\uparrow$	72.78 $\uparrow$
DANet[12]	49.85M	74.12	83.53	75.23 $\uparrow$	84.62 $\uparrow$
NonLocal Net[27]	50.02M	66.54	72.93	69.20 $\uparrow$	76.21 $\uparrow$
APCNet[8]	56.36M	45.38	58.45	51.2 $\uparrow$	64.72 $\uparrow$
DMNet[7]	53.18M	67.65	75.94	66.96	74.4
DeepLabV3[3]	68.11M	62.95	75.78	70.09 $\uparrow$	81.75 $\uparrow$
FastFCN[28]	68.71M	72.88	81.26	71.58	81.41 $\uparrow$

Table 2: Comparing the results of baseline with EPS

+

PH Loss across different segmentation models.

Model	Baseline		EPS		EPS+PH
Model	mIoU	mAcc	mIoU	mAcc	mIoU	mAcc
CGNet[29]	66.84	80.12	68.68 $\uparrow$	81.94 $\uparrow$	70.16 $\upuparrows$	82.26 $\upuparrows$
ERFNet[22]	66.08	74.65	70.08 $\uparrow$	78.9 $\uparrow$	71.87 $\upuparrows$	80.84 $\upuparrows$
SegFormer[30]	76.28	83.89	76.88 $\uparrow$	84.77 $\uparrow$	77.03 $\upuparrows$	85.03 $\upuparrows$
STDC[5]	53.30	62.55	58.62 $\uparrow$	68.98 $\uparrow$	77.62 $\upuparrows$	85.11 $\upuparrows$

To verify the versatility of EPS across different semantic segmentation models, we conducted experiments where our models were trained using Edge GT generated by a $5\times 5$ kernel. We assessed nearly all semantic segmentation models within MMSegmentation, including both those with and without auxiliary heads, such as CGNet, MobileNetV3, ANN, BiSeNetV2, and others. The specific models utilized and corresponding experimental outcomes are presented in Table 1. We observed a noticeable improvement in both mIoU and mAcc for almost all state-of-the-art models after implementing the EPS strategy. This signifies the compatibility of EPS with various semantic segmentation models and its plug-and-play nature, which enables its direct use without any consideration of the model’s unique attributes.

We selected four models, namely CGNet, ERFNet, SegFormer, and STDC, to further explore the experimental results of using PH Loss in EPS.The results are presented in Table 2. Based on our analysis, it can be inferred that the utilization of EPS has resulted in substantial enhancements in the performance of the SOTA. Furthermore, the incorporation of PH Loss has demonstrated a superior improvement in the model’s accuracy.

4.3 Ablation Studies

Our proposed PH Loss is designed to be used in conjunction with the EPS framework, with two hyperparameters: the edge thickness $d_{e}$ of the Edge GT and the number $n$ of candidate distances $d_{ph}$ in the $PHD_{P,Q}(\hat{p},n)$ . In order to investigate the effects of these two hyperparameters, we conducted a series of ablation experiments on CGNet, ERFNet, SegFormer, and STDC.

After conducting an analysis of Table 3, we observe that there is no evident regularity for selecting the hyperparameter kernel, and the optimal kernel varies among different models. However, the performance is relatively favorable when selecting the kernel as $5\times 5$ and $7\times 7$ . Additionally, employing EPS led to an improvement in mIoU, irrespective of the kernel size. Analysis of Table 4 revealed that even with PH Loss, the optimal kernel varied across different models. Comparing the results of the same kernel and model in Table 3, it was observed that almost all experiments using PH Loss performed better than those without PH Loss. In instances where the results of EPS with PH Loss were inferior to those of EPS with CE Loss, the cause may be attributed to insufficient training iterations. Finally, by analyzing Table 5, we found that the impact of selecting hyperparameter $n$ on the results is not significant, but as $n$ increases, the computational complexity of the model also increases. Consequently, it is recommended to select a smaller value for $n$ , such as $n=8$ .

Table 3: Comparing the impact of hyperparameter

d_{e}

on EPS.

Method	$d_{e}$	kernel	mIoU
Method	$d_{e}$	kernel	CGNet	ERFNet	SegFormer	STDC
Baseline	-	-	66.84	66.08	76.28	76.37
EPS	1	3 $\times$ 3	69.18	66.45	75.38	76.75
	2	5 $\times$ 5	68.68	70.08	76.88	76.92
	3	7 $\times$ 7	68.98	65.51	75.93	77.48
	5	11 $\times$ 11	67.55	67.56	76.12	76.77

Table 4: Comparing the impact of hyperparameter

d_{e}

on EPS with PH Loss (

n

=100).

Method	$d_{e}$	kernel	mIoU
Method	$d_{e}$	kernel	CGNet	ERFNet	SegFormer	STDC
Baseline	-	-	66.84	66.08	76.28	76.37
	1	3x3	67.74	70.10	76.58	77.29
EPS	2	5x5	68.27	70.71	77.03	76.81
+PH Loss	3	7x7	70.16	70.43	76.98	76.77
	5	11x11	69.05	71.87	76.90	76.85

Table 5: Comparing the impact of hyperparameter

n

on EPS with PH Loss (

d_{e}=5

Method	$n$	mIoU
Method	$n$	CGNet	ERFNet	SegFormer	STDC
EPS	100	69.05	71.87	76.90	76.85
+PH Loss	32	67.18	71.35	76.90	77.25
+PH Loss	8	69.07	71.69	76.87	77.62

5 Conclusion

Our study investigated the limitations of edge supervision methods for semantic segmentation tasks, specifically the difficulty in adapting these methods to different models. To address this challenge, we propose a novel edge supervision scheme, EPS, which duplicates the architecture decoder head for the auxiliary task with the edge supervision. By integrating the prior knowledge of edge thickness, we develop a boundary-based loss function for the thickness preserving task, which shows promising results in addressing the aforementioned challenge. The key advantage of our EPS is its ability to seamlessly and easily integrate into any semantic segmentation model, reflecting an innovative approach to developing universally applicable strategies. Our experiments, conducted on 22 models using the Cityscapes dataset, demonstrate that EPS can improve upon the state-of-the-art models. However, we acknowledge that there are still limitations to our study of EPS, including the lack of stability and robustness analysis of PH Loss and the need for more comprehensive experiments across multiple datasets. In future research, we plan to explore additional Plug-and-play schemes in depth, such as incorporating texture supervision, optimizing the calculation of PH Loss, investigating the mechanism of PH Loss; to extend EPS to other tasks, eg. instance segmentation and object detection; and to apply EPS in various fields, including medical imaging, autonomous driving, and industrial defect detection.

References

[1] Liang-Chieh Chen, Jonathan T Barron, George Papandreou, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with task-specific edge detection using cnns and a discriminatively trained domain transform. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4545–4554, 2016.
[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
[3] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
[4] Yifu Chen, Arnaud Dapogny, and Matthieu Cord. Semeda: Enhancing segmentation precision with semantic edge aware loss. Pattern Recognition, 108:107557, 2020.
[5] Mingyuan Fan, Shenqi Lai, Junshi Huang, Xiaoming Wei, Zhenhua Chai, Junfeng Luo, and Xiaolin Wei. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9716–9725, 2021.
[6] Ali Hatamizadeh, Demetri Terzopoulos, and Andriy Myronenko. Edge-gated cnns for volumetric semantic segmentation of medical images. arXiv preprint arXiv:2002.04207, 2020.
[7] Junjun He, Zhongying Deng, and Yu Qiao. Dynamic multi-scale filters for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[8] Junjun He, Zhongying Deng, Lei Zhou, Yali Wang, and Yu Qiao. Adaptive pyramid context network for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[9] Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, and Hartwig Adam. Searching for mobilenetv3. In The IEEE International Conference on Computer Vision (ICCV), pages 1314–1324, October 2019.
[10] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, and Wenyu Liu. Ccnet: Criss-cross attention for semantic segmentation. 2019.
[11] Shruti Jadon. A survey of loss functions for semantic segmentation. In 2020 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pages 1–7. IEEE, 2020.
[12] Jing Liu Jun Fu, Yong Li Haijie Tian, Zhiwei Fang Yongjun Bao, and Hanqing Lu. Dual attention network for scene segmentation. 2019.
[13] Davood Karimi and Septimiu E Salcudean. Reducing the hausdorff distance in medical image segmentation with convolutional neural networks. IEEE Transactions on medical imaging, 39(2):499–513, 2019.
[14] Hoel Kervadec, Jihene Bouchtiba, Christian Desrosiers, Eric Granger, Jose Dolz, and Ismail Ben Ayed. Boundary loss for highly unbalanced segmentation. In International conference on medical imaging with deep learning, pages 285–296. PMLR, 2019.
[15] Alexander Kirillov, Yuxin Wu, Kaiming He, and Ross Girshick. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9799–9808, 2020.
[16] Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai Tong. Improving semantic segmentation via decoupled body and edge supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 435–452. Springer, 2020.
[17] Xia Li, Zhisheng Zhong, Jianlong Wu, Yibo Yang, Zhouchen Lin, and Hong Liu. Expectation-maximization attention networks for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 9167–9176, 2019.
[18] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
[19] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive field in deep convolutional neural networks. Advances in neural information processing systems, 29, 2016.
[20] Jun Ma, Jianan Chen, Matthew Ng, Rui Huang, Yu Li, Chen Li, Xiaoping Yang, and Anne L Martel. Loss odyssey in medical image segmentation. Medical Image Analysis, 71:102035, 2021.
[21] David Marr and Ellen Hildreth. Theory of edge detection. Proceedings of the Royal Society of London. Series B. Biological Sciences, 207(1167):187–217, 1980.
[22] Eduardo Romera, José M Alvarez, Luis M Bergasa, and Roberto Arroyo. Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Transactions on Intelligent Transportation Systems, 19(1):263–272, 2017.
[23] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
[24] Robin Strudel, Ricardo Garcia, Ivan Laptev, and Cordelia Schmid. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
[25] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, 2019.
[26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[27] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
[28] Huikai Wu, Junge Zhang, Kaiqi Huang, Kongming Liang, and Yizhou Yu. Fastfcn: Rethinking dilated convolution in the backbone for semantic segmentation. arXiv preprint arXiv:1903.11816, 2019.
[29] Tianyi Wu, Sheng Tang, Rui Zhang, Juan Cao, and Yongdong Zhang. Cgnet: A light-weight context guided network for semantic segmentation. IEEE Transactions on Image Processing, 30:1169–1179, 2020.
[30] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
[31] Changqian Yu, Changxin Gao, Jingbo Wang, Gang Yu, Chunhua Shen, and Nong Sang. Bisenet v2: Bilateral network with guided aggregation for real-time semantic segmentation. International Journal of Computer Vision, pages 1–18, 2021.
[32] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. 2020.
[33] Hang Zhang, Kristin Dana, Jianping Shi, Zhongyue Zhang, Xiaogang Wang, Ambrish Tyagi, and Amit Agrawal. Context encoding for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[34] Zhijie Zhang, Huazhu Fu, Hang Dai, Jianbing Shen, Yanwei Pang, and Ling Shao. Et-net: A generic edge-attention guidance network for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part I 22, pages 442–450. Springer, 2019.
[35] Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European conference on computer vision (ECCV), pages 405–420, 2018.
[36] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017.
[37] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
[38] Xianwei Zheng, Linxi Huan, Hanjiang Xiong, and Jianya Gong. Elkppnet: An edge-aware neural network with large kernel pyramid pooling for learning discriminative features in semantic segmentation. arXiv preprint arXiv:1906.11428, 2019.
[39] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856, 2014.
[40] Zhen Zhu, Mengde Xu, Song Bai, Tengteng Huang, and Xiang Bai. Asymmetric non-local neural networks for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 593–602, 2019.