This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jyear

2021

[2]\fnmLinchao \surBao

1]\orgnameShenzhen Institute for Advanced Study, UESTC, \orgaddress\countryChina 2]\orgnameTencent AI Lab, \orgaddress\countryChina 3]\orgnameFujian Normal University, \orgaddress\countryChina 4]\orgnameUniversity of Technology Sydney, \orgaddress\countryAustralia 5]\orgnameUniversity of Nottingham Ningbo China, \orgaddress\countryChina

CARD: Semantic Segmentation with Efficient Class-Aware Regularized Decoder

\fnmYe \surHuang [email protected]    \fnmDi \surKang [email protected]    \fnmLiang \surChen [email protected]    \fnmWenjing \surJia [email protected]    \fnmXiangjian \surHe [email protected]    \fnmLixin \surDuan [email protected]    \fnmXuefei \surZhe [email protected]    [email protected] [ [ [ [ [
Abstract

Semantic segmentation has recently achieved notable advances by exploiting “class-level” contextual information during learning, e.g., the Object Contextual Representation (OCR) and Context Prior (CPNet) approaches. However, these approaches simply concatenate class-level information to pixel features to boost the pixel representation learning, which cannot fully utilize intra-class and inter-class contextual information. Moreover, these approaches learn soft class centers based on coarse mask prediction, which is prone to error accumulation. To better exploit class level information, we propose a universal Class-Aware Regularization (CAR) approach to optimize the intra-class variance and inter-class distance during feature learning, motivated by the fact that humans can recognize an object by itself no matter which other objects it appears with. Moreover, we design a dedicated decoder for CAR (named CARD), which consists of a novel spatial token mixer and an upsampling module, to maximize its gain for existing baselines while being highly efficient in terms of computational cost. Specifically, CAR consists of three novel loss functions. The first loss function encourages more compact class representations within each class, the second directly maximizes the distance between different class centers, and the third further pushes the distance between inter-class centers and pixels. Furthermore, the class center in our approach is directly generated from ground truth instead of from the error-prone coarse prediction. CAR can be directly applied to most existing segmentation models during training, including OCR and CPNet, and can largely improve their accuracy at no additional inference overhead. Extensive experiments and ablation studies conducted on multiple benchmark datasets demonstrate that the proposed CAR can boost the accuracy of all baseline models by up to 2.23% mIOU with superior generalization ability. CARD outperforms state-of-the-art approaches on multiple benchmarks with a highly efficient architecture. The code will be available at https://github.com/edwardyehuang/CAR.

keywords:
Class-aware regularizations, semantic segmentation
Refer to caption
Figure 1: The concept of the proposed CAR. Our CAR optimizes existing models with three regularization targets: 1) reducing pixels’ intra-class distance, 2) reducing inter-class center-to-center dependency, and 3) reducing pixels’ inter-class dependency. As highlighted in this example (indicated with a red dot in the image), with our CAR, the grass class does not affect the classification of dog/sheep as much as before, and hence successfully avoids previous (w/o CAR) mis-classification.

1 Introduction

The task of semantic segmentation is to predict a class label for each pixel in an image. It is a fundamental computer vision task that serves as a critical building block for various downstream tasks, such as scene understanding, image editing, self-driving vehicles, etc. After the seminal work FCN cFCN that used fully convolutional networks to make the dense per-pixel segmentation task more efficient, many FCN-based approaches cPSPNet ; cDeepLab have been proposed and greatly advanced the segmentation accuracy on various benchmarks. Among these methods, many of them have focused on better fusing spatial domain context information to obtain more powerful feature representations (termed pixel features in this work) for the final per-pixel classification. For example, DeepLab cDeepLab and PSPNet cPSPNet utilized multi-scale features via constructing feature pyramids.

Recently, methods based on dot-product self-attention (SA) have become very popular since they can easily capture the long-range relationship between pixels  cNonLocal ; cDualAttention ; cOCNet ; cCFNet ; cEMANet ; cANNN ; cViT ; cDPT ; cSETR . SA aggregates information dynamically (by different attention maps for different inputs) and selectively (using weighted averaging spatial features according to their similarity scores). Significant progresses have been made by using multi-scale and self-attention techniques during spatial information aggregation.

As complements to the above methods, many recent works have proposed various modules to utilize class-level contextual information. The class-level information is often represented by the class center/context prior which are the mean features of each class in the images. OCR cOCR and ACFNet cACFNet extract “soft” class centers according to the predicted coarse segmentation mask by using the weighted sum. CPNet cCPN proposed a context prior map/affinity map, which indicates if two spatial locations belong to the same class, and used this predicted context prior map for feature aggregation. However, they cOCR ; cACFNet ; cCPN simply concatenated these class-level features with the original pixel features for the final classification.

In this paper, we also focus on utilizing class level information. Instead of focusing on how to better extract class-level features like the existing methods cOCR ; cACFNet ; cCPN , we use the simple, but accurate, average feature according to the GT mask, and focus on maximizing the inter-class distance during feature learning. This is because it mirrors how humans can robustly recognize an object by itself no matter what other objects it appears with.

Learning more separable features makes the features of a class less dependent upon other classes, resulting in improved generalization ability, especially when the training set contains only limited and biased class combinations (e.g., cows and grass, boats and beach). Fig. 1 illustrates an example of such a problem, where the classification of dog and sheep depends on the classification of grass class, and has been mis-classified as cow. In comparison, networks trained with our proposed CAR successfully generalize to these unseen class combinations.

To better achieve this goal, we propose CAR, a class-aware regularizations module, that optimizes the class center (intra-class) and inter-class dependencies during feature learning. Three loss functions are devised: the first encourages more compact class representations within each class, and the other two directly maximize the distance between different classes. Specifically, an intra-class center-to-pixel loss (termed as “intra-c2p”, Eq. (3)) is first devised to produce more compact representation within a class by minimizing the distance between all pixels and their class center. In our work, a class center is calculated as the averaged feature of all pixels belonging to the same class according to the GT mask. More compact intra-class representations leave a relatively large margin between classes, thus contributing to more separable representations. Then, an inter-class center-to-center loss (“inter-c2c”, Eq. (6)) is devised to maximize the distance between any two different class centers. This inter-class center-to-center loss alone does not necessarily produce separable representations for every individual pixels. Therefore, a third inter-class center-to-pixel loss (“inter-c2p”, Eq. (13)) is proposed to enlarge the distance between every class center and all pixels that do not belong to the class.

A preliminary version of this work was presented in  cCAR , which proposed three class-aware regularization (CAR) terms and evaluated their effectiveness and universality by using them as a direct addon to various state-of-the-art methods. Although effective, we notice two issues when using CAR as an addon for some baselines – inefficiency brought by the baselines (e.g. dilation and self-attention cNonLocal ) and decreased gain due to limited compatibility with the baselines (e.g. CCNet cCCNet ). In this extension, we design a dedicated class-aware regularized decoder (CARD) to overcome the aforementioned two issues, resulting in greatly improved computational cost and better performance. Specifically, a leading synced axial attention (SAA) is proposed right before CAR to make sparse self-attention gain as much accuracy gain as self-attention, and a lightweight pyramid upsampling module is proposed to replace the computation-heavy dilated convolution with minimal accuracy loss (see Fig. 4).

In summary, the contributions of this work are:

  1. 1.

    We propose a universal class-aware regularization module that can be integrated into various segmentation models to largely improve the accuracy.

  2. 2.

    We devise three novel regularization terms to achieve more separable and less class-dependent feature representations by minimizing the intra-class variance and maximizing the inter-class distance.

  3. 3.

    We calculate the class centers directly from ground truth during training, thus avoiding the error accumulation issue of the existing methods and introducing no computational overhead during inference.

  4. 4.

    We visualize pixel-level feature-similarity heatmaps for the inter-class features learned with and without our CAR to demonstrate they are indeed less related to each other.

  5. 5.

    We propose a class-aware regularized decoder aiming for better efficiency and effectiveness for various backbones, achieving new state-of-the-art accuracies on multiple benchmarks while being highly efficient.

2 Related Work

2.1 Class Center.

In 2019 cACFNet ; cOCR , the concept of class center was introduced to describe the overall representation of each class from the categorical context perspective. In these approaches, the center representation of each class was determined by calculating the dot product of the feature map and the coarse prediction (i.e., weighted average) from an auxiliary task branch, supervised by the ground truth cPSPNet . After that, those intra-class centers are assigned to the corresponding pixels on feature map. Furthermore, in 2020 cCPN , a learnable kernel and one-hot ground truth were used to separate the intra-class center from inter-class center, and then concatenated with the original feature representation.

All of these works cOCR ; cACFNet ; cCPN have focused on extracting the intra (inter) class centers, but they then simply concatenated the resultant class centers with the original pixel representations to perform the final logits. We argue that the categorical context information can be utilized in a more effective way so as to reduce the inter-class dependency.

To this end, we propose a CAR approach, where the extracted class center is used to directly regularize the feature extraction process so as to boost the differentiability of the learned feature representations (see Fig. 1) and reduce their dependency on other classes. Fig. 2 contrasts the two different designs. More details of the proposed CAR are provided in Sect. 3.

Refer to caption
(a) Design of OCR, ACFNet and CPNet
Refer to caption
(b) Our CAR
Figure 2: The difference between the proposed CAR and previous methods that use class-level information. Previous models focus on extracting class center while using simple concatenation of the original pixel feature and the class/context feature for later classification. In contrast, our CAR uses direct supervision related to class center as regularization during training, resulting in small intra-class variance and low inter-class dependency. See Fig. 1 and Sec. 3 for details.

2.2 Inter-Class Reasoning.

Recently, cHANet ; cDependencyNet studied the class dependency as a dataset prior and demonstrated that inter-class reasoning could improve the classification performance. For example, a car usually does not appear in the sky, and therefore the classification of sky can help reduce the chance of mis-classifying an object in the sky as a car. However, due to the limited training data, such class-dependency prior may also contain bias, especially when the desired class relation rarely appears in the training set.

Fig. 1 shows such an example. In the training set, cow and grass are dependent on each other. However, as shown in this example, when there is a dog or sheep standing on the grass, the class dependency learned from the limited training data may result in errors and predict the target into a class that appears more often in the training data, i.e., cow in this case. In our CAR, we design inter-class and intra-class loss functions to reduce such inter-class dependency and achieve more robust segmentation results.

2.3 Spatial Context Aggregation

The spatial token mixer cMetaFormer provides the context aggregation between each pixel’s encoding. One of the well-used token mixers is Self-Attention.

Self-attention. Self-attention proposed in cNonLocal ; cAttentionIsAllYourNeed has been widely used in semantic segmentation cDualAttention ; cOCNet ; cCFNet ; cANNN . Specifically, self-attention determines the similarity between a pixel with every other pixel in the feature map by calculating their dot product, followed by softmax normalization. With this attention map, the feature representation of a given pixel is enhanced by aggregating features from the whole feature map weighted by the aforementioned attention values, thus easily taking long-range relationship into consideration and yielding boosted performance. In self-attention, in order to achieve correct pixel classification, the representation of pixels belonging to the same class should be similar to gain greater weights in the final representation augmentation.

Sparse self-attention. Although regular self-attention cNonLocal performs very well for semantic segmentation, its computational cost is too high (i.e. 𝒪(H2W2)\mathcal{O}(H^{2}W^{2}), especially for high-resolution input. Thus, many sparse alternatives of the full self-attention have been proposed, including axial attention cAxialAttention , CCNet cCCNet , and CAA cCAA , achieving similar accuracy as self-attention but with greatly reduced computational cost.

2.4 Maintain the feature map resolution

In semantic segmentation, most backbones including CNN-based cVGG ; cResnet ; cXception ; cResnest ; cEfficientNet ; cConvNeXT and Transformer based cViT ; cSwin , are initially designed for image-level classification, where the resolution of the intermediate feature maps does not matter. So, they usually progressively downsample the feature map to a resolution of 1/32 of the original size (i.e. output stride = 32), resulting in large enough receptive field size and greatly saved computation.

Unlike image classification, semantic segmentation is essentially a per-pixel classification task, where the final output size is identical to the input image. Thus, upsampling is required at the final stage if the resolution of the intermediate results is smaller. However, output stride = 32 feature map usually miss necessary segmentation details (e.g. boundaries, thin objects, etc) that cannot be recovered via bilinear upsampling. Thus, maintaining higher-resolution feature maps is crucial, among which dilation convolution that does not reduce the feature map’s resolution too much or multi-scale pyramid style feature upsampling (e.g. UNet/FPN) are wildly adopted.

Dilation. Early approaches apply dilation (instead of stride) on the later stages of a CNN to stop further downsampling of the feature maps, resulting in output stirde = 8 feature maps. However, the dilation modification introduces too much computation and it is not applicable to Transformer-based backbone.

Pyramid-based upsampling. Many other approaches cUNet ; cFPN ; cPanopticFPN ; cDeepLabV3Plus ; cMask2Former utilize pyramid-based feature upsampling by fusing multi-scale features from different levels, achieving similar accuracy to dilation methods but with much less computation. Representative methods including UNet cUNet , FPN cFPN , and JPU cFastFCN . UNet and FPN based methods usually add low-level fine-grained feature maps (with optional convolution layers) and high-level coarse feature maps together. This direct addition of low-/high-level features sometimes makes training harder cDeepLabV3Plus . Instead, JPU concatenates low-/high-level feature maps that is followed by multiple parallel dilated convolutions, achieving better accuracy. We also use JPU like pyramid upsampling for efficiency but with some modifications to improve convergence and make this upsampling module compatible with backbones producing various number of feature maps.

3 Methodology

Refer to caption
Figure 3: The proposed CAR approach. CAR can be inserted into various segmentation models, right before the logit prediction module (A1-A4). CAR contains three regularization terms, including (C) intra-class center-to-center loss intra-c2p\mathcal{L}_{\text{intra-c2p}} (Sec. 3.2.2), (D) inter-class center-to-center loss inter-c2c\mathcal{L}_{\text{inter-c2c}} (Sec. 3.3.2), and (E) inter-class center-to-pixel loss inter-c2p\mathcal{L}_{\text{inter-c2p}} (Sec. 3.3.3).

3.1 Extracting Class Centers from Ground Truth

Denote a feature map and its corresponding resized one-hot encoded ground-truth mask as 𝐗H×W×C\mathbf{X}\in\mathbb{R}^{H\times W\times C}111HH, WW and CC denote images’ height and width, and number of channels, respectively. and 𝐘H×W×Nclass\mathbf{Y}\in\mathbb{R}^{H\times W\times N_{\text{class}}}, respectively. We first get the spatially flattened class mask 𝐘flatHW×Nclass\mathbf{Y}_{\text{flat}}\in\mathbb{R}^{HW\times N_{\text{class}}} and flattened feature map 𝐗flatHW×C\mathbf{X}_{\text{flat}}\in\mathbb{R}^{HW\times C}. Then, the class center222It is termed as class center in cACFNet and object region representations in cOCR ., which is the average features of all pixel features of a class, can be calculated by:

𝝁image=𝐘flatT𝐗flat𝐍non-zeroNclass×C,\bm{\mu}_{image}=\frac{\mathbf{Y}_{\text{flat}}^{T}\cdot\mathbf{X}_{\text{flat}}}{\mathbf{N}_{\text{non-zero}}}\in\mathbb{R}^{N_{\text{class}}\times C}, (1)

where 𝐍non-zero\mathbf{N}_{\text{non-zero}} denotes the number of non-zero values in the corresponding map of the ground-truth mask 𝐘\mathbf{Y}. In our experiments, to alleviate the negative impact of noisy images, we calculate the class centers using all the training images in a batch, and denote them as 𝝁batch\bm{\mu_{\text{batch}}}333We use 𝝁\bm{\mu} and omit the subscript batchbatch for clarity. (see the appendix for details).

3.2 Reducing Intra-class Feature Variance

3.2.1 Motivation.

More compact intra-class representation can lead to a relatively larger margin between classes, and therefore result in more separable features. In order to reduce the intra-class feature variance, existing works cNonLocal ; cDualAttention ; cANNN ; cCPN ; cEMANet ; cOCNet usually use self-attention to calculate the dot-product similarity in spatial space to encourage similar pixels to have a compact distance implicitly. For example, the self-attention in cNonLocal implicitly pushed the feature representation of pixels belonging to the same class to be more similar to each other than those of pixels belonging to other classes. In our work, we devise a simple intra-class center-to-pixel loss to guide the training, which can achieve this goal very effectively and produce improved accuracy.

3.2.2 Intra-class Center-to-pixel Loss.

We define a simple but effective intra-class center-to-pixel loss to suppress the intra-class feature variance by penalizing large distance between a pixel feature and its class center. The Intra-class Center-to-pixel Loss intra-c2p\mathcal{L}_{\text{intra-c2p}} is defined by:

intra-c2p=fmse(𝒟intra-c2p),\small\mathcal{L}_{\text{intra-c2p}}=f_{\text{mse}}(\mathcal{D}_{\text{intra-c2p}}), (2)

where

𝒟intra-c2p=(1σ)|𝐘flat𝝁𝐗flat|.\small\mathcal{D}_{\text{intra-c2p}}=(1-\mathbf{\sigma})\lvert\mathbf{Y}_{\text{flat}}\cdot\bm{\mu}-\mathbf{X}_{\text{flat}}\rvert. (3)

In Eq. (3), σ\mathbf{\sigma} is a spatial mask indicating pixels being ignored (i.e., ignore label), 𝐘flat𝝁\mathbf{Y}_{\text{flat}}\cdot\bm{\mu} distributes the class centers 𝝁\bm{\mu} to the corresponding positions in each image. Thus, our intra-class loss intra-c2p\mathcal{L}_{\text{intra-c2p}} will push the pixel representations to their corresponding class center, using mean squared error (MSE) in Eq. (3).

3.3 Maximizing Inter-class Separation

3.3.1 Motivation.

Humans can robustly recognize an object by itself regardless which other objects it appears with. Conversely, if a classifier heavily relies on the information from other classes to determine the classification result, it will easily produce wrong classification results when a rather rare class combination appears during inference. Maximizing inter-class separation, or in another words, reducing the inter-class dependency, can therefore help the network generalize better, especially when the training set is small or is biased. As shown in Fig. 1, the dog and sheep are mis-classified as the cow because cow and grass appear together more often in the training set. To improve the robustness of the model, we propose to reduce this inter-class dependency. To this end, the following two loss functions are defined.

3.3.2 Inter-class Center-to-center Loss.

The first loss function is to maximize the distance between any two different class centers. Inspired by the center loss used in face recognition cCenterLoss , we propose to reduce the similarity between class centers 𝝁\bm{\mu}, which are the averaged features of each class calculated according to the GT mask. The inter-class relation is defined by the dot-product similarity cAttentionIsAllYourNeed between any two classes as:

𝐀c2c=softmax(𝝁T𝝁C),𝐀c2cNclass×Nclass.\footnotesize\mathbf{A}_{\text{c2c}}=\text{softmax}(\frac{\bm{\mu}^{T}\cdot\bm{\mu}}{\sqrt{C}}),\>\>\ \mathbf{A}_{\text{c2c}}\in\mathbb{R}^{N_{class}\times N_{class}}. (4)

Moreover, since we only need to constrain the inter-class distance, only the non-diagonal elements are retained for the later loss calculation as:

𝐃inter-c2c=(1eye(Nclass))𝐀c2c.\mathbf{D}_{\text{inter-c2c}}=\Big{(}1-eye(N_{class})\Big{)}\mathbf{A}_{\text{c2c}}. (5)

We only penalize larger similarity values between any two different classes than a pre-defined threshold ϵ0Nclass1\frac{\epsilon_{0}}{N_{class}-1}, i.e.,

𝒟inter-c2c=fsum(max(𝐃inter-c2cϵ0Nclass1,0)).\mathcal{D}_{\text{inter-c2c}}=f_{\text{sum}}\Big{(}\text{max}(\mathbf{D}_{\text{inter-c2c}}-\frac{\epsilon_{0}}{N_{class}-1},0)\Big{)}. (6)

Thus, the inter-class center-to-center loss inter-c2c\mathcal{L}_{\text{inter-c2c}} is defined by:

inter-c2c=fmse(𝒟inter-c2c).\mathcal{L}_{\text{inter-c2c}}=f_{\text{mse}}(\mathcal{D}_{\text{inter-c2c}}). (7)

Here, a small margin is used in consideration of the feature space size and the mislabeled ground truth.

3.3.3 Inter-class Center-to-pixel Loss.

Maximizing only the distances between class centers does not necessarily result in separable representation for every individual pixels. We further maximize the distance between a class center and any pixel that does not belong to this class. More concretely, we first compute the center-to-pixel dot product as:

𝚲c2p=𝝁T𝐗flat,𝚲c2pHW×Nclass.\mathbf{\Lambda}_{\text{c2p}}=\bm{\mu}^{T}\cdot\mathbf{X_{\text{flat}}},\>\>\ \mathbf{\Lambda}_{\text{c2p}}\in\mathbb{R}^{HW\times N_{\text{class}}}. (8)

Ideally, with the previous loss inter-c2c\mathcal{L}_{\text{inter-c2c}}, the features of all pixels belonging to the same class should be equal to that of the class center. Therefore, we replace the intra-class dot product with its ideal value, namely using the class center 𝝁\bm{\mu} for calculating the intra-class dot product as:

𝚲c=diag(𝝁T𝝁),\mathbf{\Lambda}_{c}=diag(\bm{\mu}^{T}\cdot\bm{\mu}), (9)

and the replacement effect is achieved by using masks as:

𝚲=𝚲c2p(1𝐘flat)+𝚲c.\mathbf{\Lambda^{\prime}}=\mathbf{\Lambda}_{\text{c2p}}(1-\mathbf{Y}_{\text{flat}})+\mathbf{\Lambda}_{c}. (10)

This updated dot product 𝚲\mathbf{\Lambda^{\prime}} is then used to calculate similarity across class axis with a softmax as:

𝐀c2p=softmax(𝚲),𝐀c2pHW×Nclass.\mathbf{A}_{\text{c2p}}=\text{softmax}(\mathbf{\Lambda^{\prime}}),\>\>\ \mathbf{A}_{\text{c2p}}\in\mathbb{R}^{HW\times N_{\text{class}}}. (11)

Similar to the calculation of inter-c2c\mathcal{L}_{\text{inter-c2c}} in the previous subsection, we have

𝐃inter-c2p=(1𝐘flat)𝐀c2p,\mathbf{D}_{\text{inter-c2p}}=(1-\mathbf{Y}_{\text{flat}})\mathbf{A}_{\text{c2p}}, (12)
𝒟inter-c2p=fsum(max(𝐃inter-c2pϵ1Nclass1,0)).\mathcal{D}_{\text{inter-c2p}}=f_{\text{sum}}\Big{(}\text{max}(\mathbf{D}_{\text{inter-c2p}}-\frac{\epsilon_{1}}{N_{\text{class}}-1},0)\Big{)}. (13)

Thus, the Inter-class Center-to-pixel Loss inter-c2p\mathcal{L}_{\text{inter-c2p}} is defined by:

inter-c2p=fmse(𝒟inter-c2p).\mathcal{L}_{\text{inter-c2p}}=f_{\text{mse}}(\mathcal{D}_{\text{inter-c2p}}). (14)

3.4 Differences with OCR, ACFNet, CPNet and CIPC

Methods that are closely related to ours are OCR cOCR , ACFNet cACFNet and CPNet cCPN , which all focus on better utilizing class-level features and differ on how to extract the class centers and context features. However, they all use a simple concatenation to fuse the original pixel feature and the complementary context feature. For example, OCR and ACFNet first produce a coarse segmentation, which is supervised by the GT mask with a categorical cross-entropy loss, and then use this predicted coarse mask to generate the (soft) class centers by weighted summing all the pixel features. OCR then aggregates these class centers according to their similarity to the original pixel feature termed as “pixel-region relation”, resulting in a “contextual feature”. Slightly differently from OCR, ACFNet directly uses the probability (from the predicted coarse mask) to aggregate class center, obtaining a similar context feature termed as “attentional class feature”. CPNet defines an affinity map, which is a binary map indicating if two spatial locations belong to the same class. Then, they use a sub-network to predict their ideal affinity map and use the soft version affinity map termed as “Context Prior Map” for feature aggregation, obtaining a class feature (center) and a context feature. Note that CPNet concatenates class feature, which is the updated pixel feature, and the context feature.

We also propose to utilize class-level contextual features. Instead of extracting and fusing pixel features with sub-networks, we propose three loss functions to directly regularize training and encourage the learned features to maintain certain desired properties. The approach is simple but more effective thanks to the direct supervision (validated in Tab. 2). Moreover, our class center estimate is more accurate because we use the GT mask. This strategy largely reduces the complexity of the network and introduces no computational overhead during inference. Furthermore, it is compatible with all existing methods, including OCR, ACFNet and CPNet, demonstrating great generalization capability.

We also notice that Cross-Image Pixel Contrast (CIPC) cCIPC shares a similar high-level goal as our CAR, i.e., learning more similar representations for pixels belonging to the same class than to a different class. However, the approaches of achieving this goal are very different. CIPC is motivated by contrastive learning while our CAR is motivated by the compositionality of the scene, for better generalization in the cases of rare class combinations. Therefore, CIPC adopts the one-vs-rest style InfoNCE loss, including the typical pixel-to-pixel loss and a special pixel-to-center loss. In contrast, (1) we propose an additional center-to-center loss to regularize the inter-class dependency explicitly and effectively (see Table 1); (2) we use one-vs-one style inter-class losses while CIPC uses one-vs-rest style NCE loss. Compared to our one-vs-one loss, using one-vs-rest loss for training does not necessarily result in small inter-class similarity between the current class and every individual “other” classes and may increase the inter-class similarity among those “other” classes. (3) we leave margins to prevent CAR regularizations, which is not the primary task of pixel classification, from dominating the learning process.

Refer to caption
Figure 4: Overview of the proposed CARD. Class-aware regularized decoder (CARD) is tailored for the proposed class-aware regularizations with greatly reduced computation cost and minor accuracy loss. CARD contains an EJPU, which fuses features from different layers (at the same spatial location) to obtain high-resolution multi-scale and multi-level feature maps, an synced axial attention (SAA) token mixer, which fuses features from different locations as context aggregation, and CAR to produce less class-dependent and thus more generalizable pixel features. The output stride (OS) = 8 logit maps are bilinearly upsampled to the original resolution to make final prediction.

3.5 The Devil Is in the Architecture’s Detail

The proposed CAR is compatible with many models as shown in table 2. However, some layers or operations in existing models may be harmful to the ability of CAR, where the last 3×33\times 3 conv is one commonly found case in many models cNonLocal ; cCCNet ; cUper (see A1 and A3 in Fig. 3). A possible reason is that the network is trained to maximize the separation between different classes. However, if the two pixels lie on different sides of the segmentation boundary, a 3×33\times 3 conv will merge the pixel representations from different classes, making the proposed CAR harder to optimize. In this work, we provide a simple and optional general modification for those models to enhance CAR’s ability, where we use 1×11\times 1 conv to replace the original 3×33\times 3 conv. Existing models like DeepLab cDeepLab are not required to modify because they are using the 1×11\times 1 conv as the original settings. Note that, this is only modification we made in some existing models, because it is simple and generalized.

We also found some architecture-specified modifications, yet not generalized, can further largely improve the performance when employing CAR on those existing models. For example, decreasing the filter number to 256 for the last conv layer of ResNet-50 + Self-Attention + CAR, or replacing the conv layer after PPM (inside Uper block, Fig. 3A3) from 3×33\times 3 to 1×11\times 1 in Swin-Tiny + UperNet. We did not try to exhaustively search these variants since they did not generalize.

3.6 Class-aware Regularized Decoder

3.6.1 Motivation

As mentioned in the Sec. 3.5, simply applying CAR to existing methods without architecture-specified modification may result in sub-optimal result. To better utilize CAR for semantic segmentation, we design a novel decoder module tailored for CAR by taking efficiency and effectiveness into consideration.

Concretely, the decoder design focuses on three aspects: 1) compatibility with the proposed CAR, 2) efficient spatial context aggregation, and 3) less computational overhead (e.g. avoiding dilation convolution). The resultant class-aware regularized decoder (CARD) is a lightweight, simple yet effective decoder for semantic segmentation, achieving good performance via small computational overhead and reasonable GPU memory usage together.

3.6.2 Overview of CARD

Fig. 4 presents the overall architecture of the proposed CARD444In this work, we refer a complete segmentation network as “model/method/baseline”, which usually consists of a “backbone” feature extractor (e.g. ResNet-50, usually pretrained on a large-scale classification dataset) and a “decoder” that typically increases the resolution of the feature maps (e.g. EJPU) and/or conducts multi-scale/global context aggregation as further enhancement (e.g. SAA). . CARD first uses our enhanced joint pyramid upsampling (EJPU) to obtain higher resolution multiscale feature maps with output stride (OS) = 8 (Sec. 3.6.4). Then it uses our proposed synced axial attention (SAA), which is lightweight and more compatible with the following CAR regularizations, to perform global spatial context aggregation (Sec. 3.6.3). Finally, the output of the token mixer is optimized by our proposed CAR to obtain less class-dependent and more generalizable pixel representation.

This novel design, which is optimized for efficiency and effectiveness, outperforms other state-of-the-art methods that use up to 3 times computation of ours, striking to a good balance between accuracy and computational cost.

Refer to caption
Figure 5: Comparison between vanilla axial attention and our proposed synced axial attention (SAA). The difference is highlighed in orange in the figure. In SAA, both column and row attention maps are obtained from the same set of queries and keys The column and row attention of SAA shared the same query/key.

3.6.3 Synced Axial Attention

For efficiency and effectiveness, we design a new synced axial attention (SAA) for CAR since we notice existing sparse attention method obtains limited accuracy gain from CAR (e.g. CCNet cCCNet , only +0.56 in Tab. 2).

Although CAA + CAR achieves considerably big gain and the best results, we do not consider CAA for spatial context aggregation because it is an uncommon operation that has small FLOPs but has an actual slow speed in some hardware due to the lack of hardware and software (e.g, GPU driver/library) support.

Token mixer. In CARD, we proposed an improved version of multi-head axial attention as the token mixer, named synced axial attention (SAA) in Fig. 5. In vanilla axial attention, column attention (vertical) and row attention (vertical) are performed separately, i.e. using different input feature (XX and XcolX_{col}) and different transformations. Differently, SAA only computes the query QQ, key KK, and value VV once, and uses the same set of query and key to generate both the column attention map and row attention map. After the column-wise context aggregation, the update feature is directly used for row-wise context aggregation according to the row attention maps. Thus, SAA takes as input consistent feature space when computing the column and row attention maps, since they are generated by the same query and key. Empirically, we find this consistent/synced attention calculation not only reduces computation but also improves the performance (see Tab. 3). Possible reasons may be that using consistent input and shared transformation avoids potential error accumulation during the attention-based feature aggregation and directly conducts optimization in global context (rather than via two stages in AA or CCNet).

Positional encoding. In CARD, we apply conditional positional encoding (CPE) cCPvT ; cMaxViT , a resolutions insensitive positional encoding before the attention operations. Note that we did not apply normalization in MaxViT cMaxViT since we found it harmful to the accuracy.

Refer to caption
Figure 6: Detailed design of the proposed EJPU. Similar to ResNet (see Sec. 3.6.4), we add upsampled original backbone features with “residual” information extracted from modified JPU module, which is then fed into the following decoder. We applied an optional CPM when the channel numbers differed between the original backbone and JPU features. Feature alignment is necessary because the channel of the original feature from the backbone is not grouped and arranged in multi-scale order.

3.6.4 Enhanced Joint Pyramid Upsampling (EJPU)

We choose JPU since it integrates better with multi-scale/global context aggregation modules (e.g. ASPP cDeepLabV3 , self-attention) than other UNet-like encoder-decoder or FPN cFPN (more discussion in Sec. 2.4). Based on JPU, we make some crucial modifications to improve its convergence and make it more compatible with the proposed CAR, resulting in largely improved accuracy (50.76 vs 49.76 in Tab. 4)

Concretely, we notice the initial convergence speed on the test set (evaluated every 1k training steps) is slower than the dilation model during our experiments. The possible reason is that JPU does not fully utilize the original backbone feature maps (i.e. highest abstraction level) since they are treated equally with low-level feature maps from previous stages. In contrast, FCN cFCN initialized the weights to zero for the convolution following low-level features before adding them with the original backbone features. The dilation model cDeepLab ; cPSPNet directly use the original backbone feature and the filter weights in essence. Both FCN and dilation models have faster convergence than JPU. Motivated by the above observation, we equip JPU with a ResBlock-style residual branch that directly sends the original backbone feature (via minimal learned transformation if required) to the later network layers. We detail the modifications as follows.

Residual branch. To better utilize the well-trained original feature from the backbone, we include a residual branch to directly feed the bilinearly upsampled backbone feature maps to the following network module (bottom in Fig. 6) similar to FCN and ResBlock. For some backbones, the output feature channel is not the same as JPU’s output (i.e. 2048). So a channel padding module (CPM) is introduced with as less as possible learnable transformations only when necessary.

Multi-scale multi-level feature branch. We adopt JPU cFastFCN style multi-scale multi-level feature fusion for upsampling to provide complementary information lost in the original backbone feature. Specifically, feature maps extracted by JPU are processed by a 1×11\times 1 Conv (followed by BN and ReLU), and then added to the backbone feature maps element-wisely. This extra convolution after JPU is introduced to calibrate the JPU features to the backbone features since JPU has reordered the channels and the meaning of the JPU and backbone features in the same dimension/channel does not correspond any more. Note that we do not back-propagate gradient to the highest level backbone feature through JPU and only keep gradient from the residual branch.

Channel Padding Module. We include an optional channel padding module (CPM) since different backbones output feature maps with different dimensions (i.e. channel numbers). In order not to interfere the original feature maps too much, we use as less as possible learnable transformations to project the feature to required dimensions (i.e. 2048). Specifically, the original backbone feature maps go through only a padding operation and a convolution layer. The padded feature maps is generated with global average pooling, dimension projection and unpooling as shown in Fig. 6 bottom.

In Tab. 5, other simple and intuitive alternatives have also been tested, including direct projection (optionally with BN + ReLU), channel axis interpolation, or align JPU dimensions to backbone dimensions, but they are not as effective as this configuration. The possible reason is that redundant channel information (i.e., direct projection/channel axis interpolation does not fully utilized the original well-trained features and insufficient channel information (i.e., match dimensions) reduces the network capacity.

4 Experiments

In the following, we first disclosure the implementation details and the detailed experiment settings in Sec. 4.1. Then we present various experimental results on Pascal Context (Sec. 4.2), COCOStuff-10K (Sec. 4.3), COCOStuff-164K (Sec. 4.4), and Cityscapes (Sec. 4.5). On Pascal Context (Sec. 4.2), we conduct thorough ablation studies (including the effectiveness of individual regularization terms inside CAR (Tab. 1), the applicability of CAR for various baselines (Tab. 2),, the effectiveness individual components inside CARD (Tab. 3-5), etc.) and present various visualizations for in-depth analysis (Fig. 7-8)).

4.1 Implementation Details

Training Settings. For both baselines and CAR experiments, we apply the settings common to most works cENCNet ; cCFNet ; cEMANet ; cCCNet ; cANNN , including SyncBatchNorm, batch size = 16, weight decay (0.001), 0.01 initial LR, and poly learning decay with SGD during training. In addition, for the CNN backbones (e.g., ResNet), we set output stride = 8 (see cDeepLabV3 ). Training iteration is set to 30k iterations unless otherwise specified. For the thresholds in Eq. 6 and Eq. 13, we set ϵ0=0.5\epsilon_{0}=0.5 and ϵ1=0.25\epsilon_{1}=0.25.

CARD experiments use the same settings as “Baselines + CAR” unless stated otherwise. For example, CARD experiments compared with the state-of-the-art methods use AdamW (instead of SGD) for fair comparison since it is widely adopted by recent state-of-the-art methods. Details are described in the corresponding subsections with the dataset.

Determinism & Reproducibility. Our implementations are based on the latest NVIDIA deterministic framework (2022), which means exactly the same results can be always reproduced with the same hardware and same training settings (including random seed). To fairly demonstrate the effectiveness of our CAR, we reimplement and reproduce all the baselines in our ablative experiments.

Table 1: Ablation studies of adding CAR to different methods on Pascal Context dataset. All results are obtained with single scale test without flip. “A” means replacing the 3×33\times 3 conv with 1×11\times 1 conv (detailed in Sec. 4.2.1). CAR improves the performance of different types of backbones (CNN & Transformer) and head blocks (SA & Uper), showing the generalizability of the proposed CAR.
Methods intra-c2p\mathcal{L}_{\text{intra-c2p}} inter-c2c\mathcal{L}_{\text{inter-c2c}} inter-c2p\mathcal{L}_{\text{inter-c2p}}  A mIOU (%)
R1 ResNet-50 + Self-Attention - - 48.32
R2 48.56
R3 + CAR 49.17
R4 49.79
R5 50.01
R6 49.62
R7 50.00
R8 50.50
S1 Swin-Tiny + UperNet - - 49.62
S2 49.82
S3 + CAR 49.01
S4 50.63
S5 50.26
S6 49.62
S7 50.58
S8 50.78

4.2 Experiments on Pascal Context

The Pascal Context cPascalContext  555https://www.cs.stanford.edu/r̃oozbeh/pascal-context/ dataset is split into 4,998/5,105 for training/test set. We use its 59 semantic classes following the common practice cOCR ; cCFNet . Unless otherwise specified, all experiments are trained on the training set with 30k iterations.

Ablation studies related to “baselines + CAR” are presented in Sec. 4.2.1, and ablation studies related to CARD are presented in Sec. 4.2.2.

4.2.1 Ablation Studies of CAR

In the following experiments, we first test the effectiveness of the individual regularization terms in CAR when plugged into different basic baselines, including the CNN-based and the Transformer-based baselines as representatives. Then, we test the effectiveness of CAR as a whole on many other well-known baselines to demonstrate its universality

CAR on “ResNet-50 + Self-Attention”.

We firstly test the CAR with ResNet-50 + Self-Attention (w/o image-level block in cCFNet ) to verify the effectiveness of the proposed loss functions, i.e., intra-c2p\mathcal{L}_{\text{intra-c2p}}, inter-c2c\mathcal{L}_{\text{inter-c2c}}, and inter-c2p\mathcal{L}_{\text{inter-c2p}}. As shown in Tab. 1, using intra-c2p\mathcal{L}_{\text{intra-c2p}} directly improves 1.30 mIOU (48.32 vs 49.62); Introducing inter-c2c\mathcal{L}_{\text{inter-c2c}} and inter-c2p\mathcal{L}_{\text{inter-c2p}} further improves 0.38 mIOU and 0.50 mIOU; Finally, with all three loss functions, the proposed CAR improves 2.18 mIOU from the regular ResNet-50 + Self-attention (48.32 vs 50.50).

CAR on “Swin-Tiny + Uper”.

“Swin-Tiny + Uper” is a totally different architecture from “ResNet-50 + Self-Attention cNonLocal ”. Swin cSwin is a recent Transformer-based backbone network. Uper cUper is based on the pyramid pooling modules (PPM) cPSPNet and FPN cFPN , focusing on extracting multi-scale context information. Similarly, as shown in Tab. 1, after adding CAR, the performance of Swin-Tiny + Uper also increases by 1.16, which shows our CAR can generalize to different architectures well.

The devil is in the architecture’s detail.

As mentioned in Sec. 3.5, we find it important to replace the leading 3×33\times 3 convolution (in the original baseline) with a 1×11\times 1 convolution (Fig. 3B). For example, inter-c2p\mathcal{L}_{\text{inter-c2p}} did not improve the performance in Swin-Tiny + Uper (S5 vs S4 in Tab. 1) until the last 3×33\times 3 convolution is replaced by a 1×11\times 1 (S8 vs S7 in Tab. 1).

To keep the simplicity and demonstrate its generalizability, we use the same network configurations for all the baseline methods. No architecture-specific modification is made when conducting ablation studies on existing models for experiments in Tab. 12.

CAR on various baselines.

After we have verified the effectiveness of each part of the proposed CAR, we then tested CAR on multiple well-known baselines. All of the baselines were reproduced under similar conditions (see Sect. 4.1). Experimental results shown in Tab. 2 demonstrate the generalizability of our CAR on different backbones and methods.

Table 2: Ablation studies of adding CAR to different baselines on Pascal Context cPascalContext and COCOStuff-10K cCocoStuff . We deterministically reproduced all the baselines with the same settings. All results are single-scale without flipping. CAR works very well in most existing methods. \boxtimes means reducing the class-level threshold to 0.25 from 0.5. We found it is sensitive for some model variants to handle a large number of class. Affinity loss cCPN and Auxiliary loss cPSPNet are applied on CPNet and OCR, respectively, since they highly rely on those losses.
Methods Backbone mIOU(%)
Pascal Context COCO-Stuff10K
FCN cFCN ResNet-50 cResnet 47.72 34.10
FCN + CAR ResNet-50 cResnet 48.40(+0.68) 34.91(+0.81)\boxtimes
FCN cFCN ResNet-101 cResnet 50.93 35.93
FCN + CAR ResNet-101 cResnet 51.39(+0.49) 36.88(+0.95)\boxtimes
DeepLabV3 cDeepLabV3 ResNet-50 cResnet 48.59 34.96
DeepLabV3 + CAR ResNet-50 cResnet 49.53(+0.94) 35.13(+0.17)
DeepLabV3 cDeepLabV3 ResNet-101 cResnet 51.69 36.92
DeepLabV3 + CAR ResNet-101 cResnet 52.58(+0.89) 37.39(+0.47)
Self-Attention cNonLocal ResNet-50 cResnet 48.32 34.35
Self-Attention + CAR ResNet-50 cResnet 50.50(+2.18) 36.58(+2.23)\boxtimes
Self-Attention cNonLocal ResNet-101 cResnet 51.59 36.53
Self-Attention + CAR ResNet-101 cResnet 52.49(+0.9) 38.15(+1.62)
CCNet cCCNet ResNet-50 cResnet 49.15 35.10
CCNet + CAR ResNet-50 cResnet 49.56(+0.41) 36.39(+1.29)
CCNet cCCNet ResNet-101 cResnet 51.41 36.88
CCNet + CAR ResNet-101 cResnet 51.97(+0.56) 37.56(+0.68)
DANet cDualAttention ResNet-101 cResnet 51.45 35.80
DANet + CAR ResNet-101 cResnet 52.57(+1.12) 37.47(+1.67)
CPNet cCPN ResNet-101 cResnet 51.29 36.92
CPNet + CAR ResNet-101 cResnet 51.98(+0.69) 37.12(+0.20)\boxtimes
OCR cOCR HRNet-W48 cHRNet 54.37 38.22
OCR + CAR HRNet-W48 cHRNet 54.99(+0.62) 39.53(+1.31)
UperNet cUper Swin-Tiny cSwin 49.62 36.07
UperNet + CAR Swin-Tiny cSwin 50.78(+1.16) 36.63(+0.56) \boxtimes
UperNet cUper Swin-Large cSwin 57.48 44.25
UperNet + CAR Swin-Large cSwin 58.97(+1.49) 44.88(+0.63)
CAA cCAA EfficientNet-B5 cEfficientNet 57.79 43.40
CAA + CAR EfficientNet-B5 cEfficientNet 58.96(+1.17) 43.93(+0.53)
CAA cCAA ConvNext-Large cConvNeXT 60.48 46.49
CAA + CAR ConvNext-Large cConvNeXT 61.40(+0.92) 46.70(+0.21)

4.2.2 Ablation Studies of CARD

In the following experiments, we test the effectiveness of the proposed CARD. Ablation studies include the effectiveness of individual components in CARD (i.e. the spatial token mixer in Tab. 3, EJPU in Tab. 4 & 5), and a computational cost analysis in Tab. 6.

Effectiveness of the token mixer

In Tab. 3, we conduct ablation studies of different token mixer designs in CARD. They are evaluated using a Dilated ResNet-50 with output stride = 8 (“ResNet-50 (D8)”) on the Pascal Context dataset. All settings are the same as the the ones tested in Tab. 1 and Tab. 2.

We empirically find using head count = 4 for the multi-head axial attention achieves best accuracy (50.91% mIOU), which is slightly better than self-attention (50.50% mIOU) and cost much less computation. This computational cost particularly matters for high-resolution inputs, which is evaluated in Tab. 6. Compared to another similar sparse attention based method CCNet, our SAA brings much more accuracy gain (+1.45 vs +0.41), demonstrating SAA is indeed more compatible with CAR.

As a result (Tab. 3), SAA brings more accuracy gain (+1.45) compared to vanilla AA (+1.24) and CCNet (+0.41), and achieves similar accuracy to self-attention (50.91 vs 50.50 ) but with much smaller computational cost.

Table 3: Ablation studies of different token mixers’ compatibility with CAR on Pascal Context dataset using ResNet-50 (D8), where “D8” means modifying the last convolution layers of the backbone to their dilated version to obtain output stride (OS) = 8 feature maps cDeepLabV3 . “PE” represents conditional positional encoding in cCPvT ; cMaxViT . “HC” represents number of attention heads. See Sec. 3.6.3 for details.
Methods CAR CPE HC mIOU(%)
T1 Self-Attention cNonLocal 1 48.32
T2 1 50.50(+2.18)
T3 CCNet cCCNet 1 49.15
T4 1 49.56(+0.41)
T5 Vanilla AA 4 49.39
T6 4 50.63(+1.24)
T7 SAA 4 49.46
T8 2 50.82(+1.36)
T9 1 50.55(+1.09)
T10 4 50.91(+1.45)
Effectiveness of EJPU

In Tab. 4, we evaluate the proposed EJPU of CARD on the Pascal Context dataset with ResNet-50. All settings are same as the the ones in Tab. 1 and Tab. 2.

Compared to other approaches, such as original JPU and Semantic FPN, EJPU has the closest performance to the dilation model and even beats the Semantic FPN with twice filter. Also note that EJPU is more compatible with CAR since it brings more accuracy gain (+1.13 vs +0.71).

Table 4: Ablation studies of EJPU in CARD on Pascal Context dataset using ResNet-50. Compared to Semantic FPN and JPU, the proposed EJPU achieved the closest performance to the dilation model.
Mode CAR mIOU(%)
Semantic FPN cPanopticFPN 48.96
49.67 (+0.71)
Semantic FPN cPanopticFPN 2×2\times filters 48.83
50.04 (+1.21)
JPU cFastFCN 49.05
49.76 (+0.71)
EJPU (Ours) 49.63
50.76 (+1.13)
OS = 8 (Dilation)  cPSPNet ; cDeepLabV3 49.46
50.91 (+1.45)
Effectiveness of CPM in EJPU

In Tab. 5, we compare different options for channel padding if the dimensions of the backbone and the JPU are different. We use ConvNeXt-L cConvNeXT (1536 channels) to conduct the experiments. The remaining settings are the same as in the previous sections. “Project JPU output” and “Project backbone output” use a 1×11\times 1 convolution layer (followed by BN and ReLU) to adjust the channel number to match the other one. “Reduce JPU’s conv filters” means reduce the filter numbers of all the convolution layers in JPU by a same factor to the backbone feature dimension. Among all these configurations, CPM achieves the best accuracy.

Table 5: Ablation studies of CPM inside EJPU on Pascal Context dataset using ConvNeXt-L, which outputs 1536 channel feature maps and thus requires the Channel Padding Module (CPM).
Padding Strategies mIOU(%)
Project JPU output 61.51
Project Backbone output 61.68
Reduce JPU’s conv filters 60.94
Interpolation 59.43
CPM 61.99
Computational cost of CARD

Tab. 6 presents computational cost of CARD for two commonly seen image resolutions. Compared to the Self-Attention with dilated ResNet-50, our CARD significantly reduces the computational cost from 158.96 to 112.69 GFLOPs. EJPU reduces more computation for larger backbones or higher-resolution inputs.

Table 6: Computation analysis of CARD. We provide computational cost in GFLOPs on two commonly used resolutions 513×513513\times 513 and 1025×20491025\times 2049. “SA” is short for “Self-Attention”. “D8” means modifying the last convolution layers of the backbone to their dilated version to obtain output stride (OS) = 8 feature maps cDeepLabV3 . Rows in CARD have not marked by “w/ EJPU” use dilated backbone with output stride = 8 (D8).
Method Backbone GFLOPs
513×513513\times 513 1025×20491025\times 2049
SA (CAR) ResNet-50 (D8) 158.96 1723.03
CARD ResNet-50 (D8) 151.70 1157.59
       w/ EJPU ResNet-50 112.69 (-25%) 887.180 (-23%)
CARD ConvNeXt-L (D8) 818.14 6418.79
       w/ EJPU ConvNeXt-L 262.82 (-67%) 2043.24 (-68%)
CARD EfficientNet-L2 (D8) 1635.22 12834.12
       w/ EJPU EfficientNet-L2 283.62 (-82%) 2184.76 (-82%)

4.2.3 CARD Compared to the State-of-the-art

In Tab. 7, we equip CARD to stronger backbones to compare with state-of-the-art methods on Pascal Context dataset. The reported mIOU of compared methods with “*” comes from their respective paper instead of our reproduction.

We train CARD with ConvNeXt-L cConvNeXT , ConvNeXtV2-L cConvNeXtV2 and EfficientNet-L2 cEfficientNet as backbone, using AdamW optimizer, an initial learning rate of 4e-5, while the other settings remain the same as the experiments in our ablation studies. The AdamW optimizer improved the performance of CARD (ConvNeXt-L) from 61.99% (shown in Tab. 5, trained by SGD) to 63.20%. As shown in Tab. 7, CARD outperforms other state-of-the-art approaches when using ConvNeXt-L and ConvNeXtV2-L.

Refer to caption
Figure 7: Class dependency maps generated on Pascal Context test set. One may zoom in to see class names. A hotter color means that the class has higher dependency to the corresponding class, and vice versa. It is obvious that our CAR reduces the inter-class dependency, thus providing better generalizability.

With even stronger backbone such as EfficientNet-L2, CARD achieves 66.0% mIOU under single-scale setting and 67.5% mIOU under multi-scale with flipping setting.

Table 7: Comparisons to state-of-the-art methods on Pascal Context dataset of CARD. Note that methods marked with ‘*’ report mIOU from their papers while the others are obtained with our implementation. SS means single scale performance w/o flipping. MF means multi-scale performance w/ flipping.
Methods Avenue mIOU(%)
SS MF
SETR (ViT-L)* cSETR CVPR’21 - 55.8
DPT (ViT-Hybrid)* cDPT ICCV’21 - 60.5
Segmenter (ViT-L)* cSegmenter ICCV’21 - 59.0
OCNet (HRNet-W48)* cOCNet IJCV’21 - 56.2
CAA (EfficientNet-B7)* cCAA AAAI’22 - 60.5
SegNeXt (MSCAN-L)* cSegNeXt NIPS’22 59.2 60.9
CAA + CAR (ConvNeXt-L) ECCV’22 62.7 63.9
CARD (ConvNeXt-L) Ours 63.2 64.4
CARD (ConvNeXtV2-L) Ours 64.0 64.6
CARD (EfficientNet-L2) Ours 66.0 67.5

4.2.4 Visualization of Class Dependency Maps

In Fig. 7, we present the class dependency maps calculated on the complete Pascal Context test set, where every pixel stores the dot-product similarities between every two class centers. The maps indicate the inter-class dependency obtained with the standard ResNet-50 + Self-Attention and Swin-Tiny + UperNet, and the effect of applying our CAR. A hotter color means that the class has higher dependency on the corresponding class, and vice versa. According to Fig. 7 a1-a2, we can easily observe that the inter-class dependency has been significantly reduced with CAR on ResNet50 + Self-Attention. Fig. 7 b1-b2 show a similar trend when tested with different backbones and head blocks. This partially explains the reason why baselines with CAR generalize better on rarely seen class combinations (Figs. 1 and 8). Interestingly, we find that the class-dependency issue is more serious in Swin-Tiny + Uper, but our CAR can still reduce its dependency level significantly.

4.2.5 Visualization of Pixel-relation Maps

In Fig. 8, we visualize the pixel-to-pixel relation energy map, based on the dot-product similarity between a red-dot marked pixel and other pixels, as well as the predicted results for different methods, for comparison. Examples are from Pascal Context test set. As we can see, with CAR supervision, the existing models focus better on objects themselves rather than other objects. Therefore, this reduces the possibility of the classification errors because of the class-dependency bias.

4.3 Experiments on COCOStuff-10K

COCOStuff-10K dataset cCocoStuff  666https://github.com/nightrome/cocostuff10k is widely used for evaluating the robustness of semantic segmentation models cEMANet ; cOCR . The COCOStuff-10k dataset is a very challenging dataset containing 171 labeled classes and 9000/1000 images for training/test.

Refer to caption
(a) ResNet50 + Self-Attention
Refer to caption
(b) Swin-Tiny + UperNet
Figure 8: Visualization of the feature similarity between a given pixel (marked with a red dot in the image) and all pixels, as well as the segmentation results on Pascal Context test set. A hotter color denotes larger similarity value. Apparently, our CAR reduces the inter-class dependency and exhibits better generalization ability, where energies are better restrained in the intra-class pixels.
Table 8: Comparisons to state-of-the-art methods on COCOStuff10k dataset of CARD. Note that methods marked with ‘*’ report mIOU from their papers while the others are obtained with our implementation. SS means single scale performance w/o flipping. MF means multi-scale performance w/ flipping.
Methods Avenue mIOU(%)
SS MF
OCR (HRNet-W48)* cOCR ECCV’20 - 45.2
OCNet (HRNet-W48)* cOCNet IJCV’21 - 40.0
CAA (EfficientNet-B7)* cCAA AAAI’22 - 45.4
RankSeg (ViT-L)* cRankSeg ECCV’22 - 47.9
CAA + CAR (ConvNeXt-L) ECCV’22 48.2 48.8
CARD (ConvNeXt-L) Ours 48.9 50.0

4.3.1 CAR on Different Baselines

In Tab. 2, all of the tested baselines gain performance boost ranging from 0.17% to 2.23% with our proposed CAR on COCOStuff-10K dataset. This demonstrates the generalization ability of our CAR when handling a large number of classes.

4.3.2 CARD Compared to the State-of-the-art

In Tab. 8, we equip CARD to ConvNeXt-L to compare with state-of-the-art methods on COCOStuff-10K dataset. The reported mIOU of compared methods with “*” comes from their respective paper instead of our reproduction. We trained CARD with ConvNeXt-L using AdamW optimizer, an initial learning rate of 4e-5, while the other settings remain the same as the experiments in our ablation studies. As shown in Tab. 8, CARD (ConvNeXt-L) surpasses the other methods with a large margin.

4.4 Experiments on COCOStuff-164K

Table 9: Comparisons to state-of-the-art methods on COCOStuff164K dataset of CARD. Note that methods marked with ‘*’ report mIOU from their papers while the others are obtained with our implementation. SS means single scale performance w/o flipping. MF means multi-scale performance w/ flipping.
Methods Avenue mIOU(%)
SS MF
SegFormer (MiT-B5)* cSegFormer NIPS’21 - 46.7
CAA (EfficientNet-B5)* cCAA AAAI’22 - 47.3
SegNeXt (MSCAN-L)* cSegNeXt NIPS’22 46.5 47.2
CARD (ConvNeXt-L) Ours 48.9 49.6
CARD (EfficientNet-L2) Ours 50.2 50.9

4.4.1 CARD Compared to the State-of-the-art

COCOStuff-164k 777https://github.com/nightrome/cocostuff is the full set of COCOStuff-10K, which becomes a new popular benchmark starting from 2021. Training settings are the same as COCOStuff-10k (Sec. 4.3.2), except the total training iteration is set to 80k. As shown in Tab. 9, the proposed CARD outperforms previous approaches by a large margin.

4.4.2 Visualization of CARD

COCOStuff-164k results in Fig. 9 compare SegNeXt cSegNeXt , and our CARD. As it can be seen, our CARD can segment the uncommon objects and complex scenes very well.

Refer to caption
Figure 9: Eamples of the results obtained on the COCOStuff-164K dataset with our proposed CARD (ConvNeXt-L) in comparison with SegNeXt-L cSegNeXt and the ground truth.
Table 10: Comparisons to state-of-the-art methods on Cityscapes validation set of CARD. Note that methods marked with ‘*’ report mIOU from their papers while the others are obtained with our implementation. SS means single scale performance w/o flipping. MF means multi-scale performance w/ flipping.
Methods Ref mIOU(%)
SS MF
Axial-DeepLab-L* cAxialDeepLab ECCV-2020 - 81.5
SETR (ViT-L)* cSegFormer CVPR-2021 - 82.2
Segmenter (ViT-L)* cSegmenter ICCV-2021 - 81.3
HRFormer-B * cSegNeXt NIPS-2021 - 82.6
CARD (ResNet-50) Ours 79.8 81.6
CARD (ConvNeXt-L) Ours 82.8 83.6

4.5 Experiments on Cityscapes

4.5.1 CARD Compared to the State-of-the-art

Cityscapes cCityScapes 888https://www.cityscapes-dataset.com/ contains 2975/500/1525 images for training/validation/test. We adopt AdamW, batch size = 8, 80K training iterations in total and 1000 steps linear warmup when training CARD with ConvNeXt-Large.

4.5.2 Visualization of CARD

Cityscapes results in Fig. 10 compare Segmenter(ViT-L) cSegmenter , and our CARD. As it can be seen, our CARD can segment the hard class (e.g, rider vs person) very well, which is very useful for the autopilot.

5 Conclusion

In this paper, we have aimed to make a better use of class level context information. We first proposed a universal class-aware regularizations (CAR) approach, which minimizes the intra-class feature variance and maximize the inter-class separation simultaneously, to regularize the training process and boost the differentiability of the learned pixel representations without extra computation during inference. Then we proposed a class-aware regularized decoder (CARD), which is designed for better effectiveness and efficiency tailored for the proposed CAR. Extensive experiments conducted on various benchmarks and thorough ablation studies have validated the effectiveness of the proposed CAR, which has boosted the existing models’ performance by up to 2.18% mIOU on Pascal Context and 2.23% on COCOStuff-10k with no extra inference overhead. And the proposed CARD achieved state-of-the-art performance on multiple benchmarks while using much less computation.

Refer to caption
Figure 10: Some visual examples on the Cityscapes. We compare our proposed CARD with Segmenter (ViT-L) cSegmenter . We show mIOU for each predicted mask on the bottom left corner.

Acknowledgement This research depends on the NVIDIA determinism framework. We appreciate the support from @duncanriach and @reedwm at NVIDIA and TensorFlow team.

We also thank OpenI (https://openi.org.cn) for providing GPUs to conduct experiments.

References

  • \bibcommenthead
  • (1) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
  • (2) Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)
  • (3) Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI (2017)
  • (4) Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
  • (5) Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., Lu, H.: Dual attention network for scene segmentation. In: CVPR (2019)
  • (6) Yuan, Y., Huang, L., Guo, J., Zhang, C., Chen, X., Wang, J.: Ocnet: Object context network for scene parsing. IJCV (2021)
  • (7) Zhang, H., Zhan, H., Wang, C., Xie, J.: Semantic correlation promoted shape-variant context for segmentation. In: CVPR (2019)
  • (8) Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H.: Expectation-maximization attention networks for semantic segmentation. In: ICCV (2019)
  • (9) Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In: ICCV (2019)
  • (10) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
  • (11) Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: ICCV (2021)
  • (12) Sixiao, Z., Jiachen, L., Hengshuang, Z., Xiatian, Z., Zekun, L., Yabiao, W., Yanwei, F., Jianfeng, F., Tao, X., H.S., T.P., Li, Z.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: CVPR (2021)
  • (13) Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: ECCV (2020)
  • (14) Zhang, F., Chen, Y., Li, Z., Hong, Z., Liu, J., Ma, F., Han, J., Ding, E.: Acfnet: Attentional class feature network for semantic segmentation. In: ICCV (2019)
  • (15) Yu, C., Wang, J., Gao, C., Yu, G., Shen, C., Sang, N.: Context prior for scene segmentation. In: CVPR (2020)
  • (16) Huang, Y., Kang, D., Chen, L., Zhe, X., Jia, W., Bao, L., He, X.: Car: Class-aware regularizations for semantic segmentation. In: ECCV (2022)
  • (17) Huang, Z., Wang, X., Wei, Y., Huang, L., Shi, H., Liu, W., Huang, T.S.: Ccnet: Criss-cross attention for semantic segmentation. IEEE TPAMI (2020)
  • (18) Choi, S., Kim, J.T., Choo, J.: Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks. In: CVPR (2020)
  • (19) Liu, M., Schonfeld, D., Tang, W.: Exploit visual dependency relations for semantic segmentation. In: CVPR (2021)
  • (20) Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., Yan, S.: Metaformer is actually what you need for vision. In: CVPR (2022)
  • (21) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Łukasz Kaiser, Polosukhin, I.: Attention is all you need. In: NeurIPS (2017)
  • (22) Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial Attention in Multidimensional Transformers (2019)
  • (23) Huang, Y., Kang, D., Jia, W., He, X., liu, L.: Channelized axial attention - considering channel relation within spatial attention for semantic segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)
  • (24) Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
  • (25) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
  • (26) Chollet, F.: Xception: Deep learning with depthwise separable convolutions. In: CVPR (2017)
  • (27) Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Zhang, Z., Lin, H., Sun, Y., He, T., Muller, J., Manmatha, R., Li, M., Smola, A.: Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955 (2020)
  • (28) Mingxing, T., Quoc, L.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning (2019)
  • (29) Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR (2022)
  • (30) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: ICCV (2021)
  • (31) Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. MICCAI (2015)
  • (32) Lin, T.-Y., Dollá, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
  • (33) Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)
  • (34) Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: ECCV (2018)
  • (35) Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
  • (36) Wu, H., Zhang, J., Huang, K., Liang, K., Yizhou, Y.: FastFCN: Rethinking Dilated Convolution in the Backbone for Semantic Segmentation (2019)
  • (37) Wen1, Y., Zhang, K., Li, Z., Qiao, Y.: Discriminative feature learning approach for deep face recognition. In: ECCV (2016)
  • (38) Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., Van Gool, L.: Exploring cross-image pixel contrast for semantic segmentation. In: ICCV, pp. 7303–7313 (2021)
  • (39) Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: ECCV (2018)
  • (40) Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., Shen, C.: Conditional positional encodings for vision transformers. arXiv preprint arXiv:2102.10882 (2021)
  • (41) Tu, Z., Talebi, H., Zhang, H., Yang, F., Peyman Milanfar, A.B., Li, Y.: Maxvit: Multi-axis vision transformer. In: ECCV (2022)
  • (42) Chen, L.-C., Papandreou, G., Schroff, F., Adam, H.: Rethinking Atrous Convolution for Semantic Image Segmentation (2017)
  • (43) Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., Agrawal, A.: Context encoding for semantic segmentation. In: CVPR (2018)
  • (44) Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., Urtasun, R., Yuille, A.: The role of context for object detection and semantic segmentation in the wild. In: CVPR (2014)
  • (45) Caesar, H., Uijlings, J., Ferrari, V.: COCO-Stuff: Thing and Stuff Classes in Context. In: CVPR (2018)
  • (46) Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., Liu, W., Xiao, B.: Deep high-resolution representation learning for visual recognition. IEEE TPAMI (2020)
  • (47) Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. arXiv preprint arXiv:2301.00808 (2023)
  • (48) Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for semantic segmentation. In: ICCV (2021)
  • (49) Guo, M.-H., Lu, C.-Z., Hou, Q., Liu, Z.-N., Cheng, M.-M., Hu, S.-M.: Segnext: Rethinking convolutional attention design for semantic segmentation. In: NeurIPS (2022)
  • (50) He, H., Yuan, Y., Yue, X., Hu, H.: Rankseg: Adaptive pixel classification with image category ranking for segmentation. In: ECCV (2022)
  • (51) Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: NeurIPS (2021)
  • (52) Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In: ECCV (2020)
  • (53) Marius, C., Mohamed, O., Sebastian, R., Timo, R., Markus, E., Rodrigo, B., Uwe, F., Roth, S., Bernt, S.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)