This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Region Rebalance for Long-Tailed Semantic Segmentation

Jiequan Cui 1  Yuhui Yuan 2  Zhisheng Zhong 1  Zhuotao Tian 1  Han Hu 2  Stephen Lin 2  Jiaya Jia 1
1The Chinese University of Hong Kong           2Microsoft Research Asia          
{jqcui, zszhong21, zttian, leojia}@cse.cuhk.edu.hk, {yuhui.yuan, hanhu, stevelin}@microsoft.com
Abstract

In this paper, we study the problem of class imbalance in semantic segmentation. We first investigate and identify the main challenges of addressing this issue through pixel rebalance. Then a simple and yet effective region rebalance scheme is derived based on our analysis. In our solution, pixel features belonging to the same class are grouped into region features, and a rebalanced region classifier is applied via an auxiliary region rebalance branch during training. To verify the flexibility and effectiveness of our method, we apply the region rebalance module into various semantic segmentation methods, such as Deeplabv33+, OCRNet, and Swin. Our strategy achieves consistent improvement on the challenging ADE2020K and COCO-Stuff benchmark. In particular, with the proposed region rebalance scheme, state-of-the-art BEiT receives +0.7%0.7\% gain in terms of mIoU\mathrm{mIoU} on the ADE2020K val set.

1 Introduction

Deep neural networks (DNNs) like convolutional neural networks [33, 25, 49, 50] and vision transformers [21, 54, 38] have achieved great success in various computer vision tasks, including image classification [33, 25, 49, 50], object detection [45, 36, 37] and semantic segmentation [62, 15, 59, 61]. Most previous studies focus on curated data with annotations, such as ImageNet, that are balanced over different classes. In contrast, real-world data often exhibit a long-tailed distribution where a small number of head classes contain many instances while most other classes have relatively few instances. As shown in [22], addressing the long-tailed distribution problem is the key to large vocabulary vision recognition tasks.

Refer to caption
Figure 1: Illustrating the improvements with our region rebalance: incorporating our region rebalance scheme leads to +1.21%1.21\%, +1.61%1.61\%, +1.22%1.22\%, and +0.7%0.7\% gains for OCRNet, DeepLabv33+, Swin, and BEiT on the ADE2020K val set, respectively.

With the increasing attention on the long-tailed distribution problem, various advanced methods have been developed for long-tailed image classification [5, 26, 19, 23, 11, 43, 32, 57, 17, 52, 64, 18] and instance segmentation [51, 34, 56]. For example, Cao et al. [9] and Ren et al. [43] provide theoretical foundation on the optimal margins for long-tailed image classification. For long-tailed instance segmentation, most effort tackles the problem based on the Mask-RCNN framework [24] and conducts class rebalance in the proposal classification. Without much prior long-tailed study for semantic segmentation yet, we focus our work in this direction.

To address the long-tailed semantic segmentation problem, we first apply the well-known long-tailed image classification method to the semantic segmentation task. With the balanced softmax strategy [43] to rebalance the pixel classification, we observe that it does not work well for the major evaluation metric of IoU\mathrm{IoU}.

From an investigation of this, we identify two major challenges in conducting pixel rebalance: (i) Pixel rebalance mainly improves pixel accuracy but not IoU\mathrm{IoU}. We empirically observe that applying the pixel rebalance strategy even harms IoU\mathrm{IoU} performance and argue that inconsistency between the training objective, i.e., pixel cross-entropy, and our inference target, i.e., IoU\mathrm{IoU}, leads to this phenomenon. (ii) Neighboring pixels are highly correlated. In long-tailed image classification, rebalance can be effectively guided by class frequency because image samples are independent and identically distributed (i.i.d.). In contrast, pixel samples are not i.i.d. due to correlation among neighboring pixels in an image. As a result, class frequency in the pixel domain is an unsuitable guide for rebalancing.

Refer to caption
Figure 2: Illustrating the region rebalance scheme for semantic segmentation: The region rebalance branch is indicated by the tan-colored box. We introduce the region rebalance branch (w/o any other changes to the semantic segmentation network and pixel classifier) to handle the class imbalance problem during training and remove it during evaluation.

In this work, we propose to overcome the non-i.i.d. issue of pixel rebalance by gathering correlated pixels into regions and performing rebalance based on regions. We refer to this method for dealing with the class imbalance in semantic segmentation as Region Rebalance. In this approach, we introduce an auxiliary region classification branch where pixel features of the same class/region are averaged and fed into a rebalanced region classifier, as shown in Figure 2. With this additional training target, the pixel features are encouraged to lie in a more balanced classification space with regard to uncorrelated pixel samples. We note that the region rebalance branch is used only for training and is removed for inference.

An incidental benefit of this approach is that popular benchmark datasets, including ADE20K and COCO-Stuff164K, exhibit less class imbalance in the region domain than in the pixel domain, as later analyzed in Sec. 3.3.2. Because of this property, rebalancing in the region domain can be accomplished relatively more effectively.

We apply our region rebalance strategy to various semantic segmentation methods e.g., PSPNet [62], OCRNet [61], DeepLabv33[15], Swin [38], and BEiT [3], and conduct experiments on two challenging semantic segmentation benchmarks, namely ADE2020K[66] and COCO-Stuff164164K[8], to verify the effectiveness of our method. The results are reported in Figure 1. Our code will be made publicly available. The key contributions are summarized as follows:

  • We investigate the class imbalance problem that exists in semantic segmentation and identify the main challenges of pixel rebalance.

  • We propose a simple yet effective region rebalance scheme and validate the effectiveness of our method across various semantic segmentation methods. Notably, our method yields a +0.7%0.7\% improvement for BEiT [3], setting a new performance record on ADE2020K val at the time of submission.

2 Related Work

2.1 Long-tailed Image Classification

In the area of long-tailed image classification, the most popular methods for dealing with imbalanced data can be categorized into re-weighting/re-sampling methods [47, 63, 5, 7, 23, 30, 6, 28, 27, 58, 44, 48, 29] or one-stage/two-stage methods [10, 31, 64, 65, 18, 17].

Re-weighting/Re-sampling. Re-sampling approaches are based on either over-sampling low-frequency classes or under-sampling high-frequency classes. Over-sampling [47, 63, 5, 7] usually suffers from heavy over-fitting to low-frequency classes. For under-sampling [23, 30, 6], it inevitably leads to degradation of CNN generalization ability because a large portion of the high-frequency class data is discarded. Re-weighting [28, 27, 58, 44, 48, 29] loss functions is an alternative to rebalancing and is accomplished by enlarging weights on more challenging or sparse classes. However, re-weighting makes CNNs difficult to optimize when training on large-scale data.

One-stage/Two-stage. Since the deferred re-weighting and re-sampling strategies proposed by Cao et al. [10], it was further observed by Kang et al. [31] and Zhou et al. [65] that re-weighting and re-sampling could benefit classifier learning but hurt representation learning. Based on this, many two-stage methods [10, 31, 64] were developed.

The two-stage design contradicts the push towards end-to-end learning that has been prevalent in the deep learning era. Through a causal inference framework, Tang et al. [53] revealed the harmful causal effects of SGD momentum on long-tailed classification. Ren et al. [43] extended LDAM [10] and theoretically obtained the optimal margins for multi-class image classification. Cui et al. [19] proposed a residual learning mechanism to address this issue. Recently, contrastive learning has also been introduced for long-tailed image classification [18].

2.2 Long-tailed Instance Segmentation

With the great progress in long-tailed image classification [5, 26, 19, 23, 11, 43, 32, 57, 17, 52, 64, 18], many researchers have started to explore the long-tailed phenomenon in instance segmentation. Many algorithms [51, 34, 56] have been developed to address this issue.

Tan et al. [51] observed that each positive sample of one category could be seen as a negative sample for other categories, causing the tail categories to receive more discouraging gradients. Based on this, they proposed to simply ignore those gradients for rare categories. Li et al. [34] proposed balanced group softmax (BAGS) for balancing the classifier within a detection framework through group-wise training. BAGS implicitly modulates the training process for the head and tail classes and ensures that all classes are trained sufficiently. Wang et al. [56] proposed the Seesaw loss to dynamically rebalance gradients of positive and negative samples for each category, further improving performance.

2.3 Semantic Segmentation

Semantic segmentation has long been studied as a fundamental task in computer vision. Substantial progress was made with the introduction of Fully Convolutional Networks (FCN) [40], which formulate the semantic segmentation task as per-pixel classification. Subsequently, many advanced methods have been introduced.

Many approaches [1, 2, 35, 42, 46] combine upsampled high-level feature maps and low-level feature maps to capture global information and recover sharp object boundaries. A large receptive field also plays an important role in semantic segmentation. Several methods [13, 12, 60] have proposed dilated or atrous convolutions to enlarge the field of filters and incorporate larger context. For better global information, recent work [14, 15, 62] adopted spatial pyramid pooling to capture multi-scale contextual information. Along with atrous spatial pyramid pooling and an effective decoder module, Deeplabv33+ [15] features a simple and effective encoder-decoder architecture.

3 Our Method

First, we analyze the challenges of conducting pixel rebalance for semantic segmentation. Second, we present the details of our proposed region rebalance scheme and explain how region rebalance helps.

3.1 Challenges of Pixel Rebalance

3.1.1 Higher Pixel Accuracy \to Higher IoU\mathrm{IoU}?

In the image classification task, minimizing the cross-entropy loss essentially maximizes the probability of the ground truth label during the training phase, which is consistent with the top-11 accuracy evaluation metric during inference. Therefore, we can expect more balanced class accuracy via incorporating rebalance strategies.

For semantic segmentation, we typically use pixel cross-entropy as a proxy to indirectly optimize Intersection-over-Union (IoU\mathrm{IoU}). However, pixel cross-entropy inherently maximizes pixel accuracy (Acc\mathrm{Acc}). This misalignment between the training objective and target evaluation metric IoU\mathrm{IoU} inspires us to explore what would happen when rebalancing with pixel cross-entropy in semantic segmentation.

Refer to caption
(a) ΔIoU\Delta\mathrm{IoU} after rebalancing with our Region Rebalance method.
Refer to caption
(b) ΔIoU\Delta\mathrm{IoU} after pixel rebalance with balanced softmax [43]
Refer to caption
(c) Δ\DeltaAcc after pixel rebalance with balanced softmax [43]
Figure 3: Comparison between baseline and pixel rebalance with balanced softmax. Classes are sorted in descending order of pixel numbers.
Refer to caption
(a) Accuracy on CIFAR-100100-LT with imbalance ratio 100.
Refer to caption
(b) Pixel accuracy on ADE2020K.
Refer to caption
(c) Region accuracy on ADE2020K.
Figure 4: Class accuracy is positively correlated with class image frequency in image classification, while class accuracy has weak correlations with class pixel frequency due to correlations among neighboring pixels. Class indexes are sorted in descending order according to the number of images, pixels, or regions in the classes.

Theoretical analysis. In order to study how pixel rebalance affects IoU\mathrm{IoU}, we examine the connections between IoU\mathrm{IoU} and Acc\mathrm{Acc}. We use TP\mathrm{TP}, FN\mathrm{FN} and FP\mathrm{FP} to denote true positives, false negatives, and false positives, respectively. Eq. (1) presents the formula for IoU\mathrm{IoU}, while Eq. (2) is for Acc. By combining these two equations in Eq. (3), we can see that 1IoU\cfrac{1}{\mathrm{IoU}} is a linear function of 1Acc\cfrac{1}{\mathrm{Acc}}, with FPTP\cfrac{\mathrm{FP}}{\mathrm{TP}} as a constant. In this sense, it is reasonable to optimize IoU\mathrm{IoU} with pixel cross-entropy.

IoU\displaystyle\mathrm{IoU} =\displaystyle= TPTP+FN+FP,\displaystyle\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}}, (1)
Acc\displaystyle\mathrm{Acc} =\displaystyle= TPTP+FN,\displaystyle\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, (2)
1IoU\displaystyle\frac{1}{\mathrm{IoU}} =\displaystyle= 1Acc+FPTP.\displaystyle\frac{1}{\mathrm{Acc}}+\frac{\mathrm{FP}}{\mathrm{TP}}. (3)

When rebalancing with pixel cross-entropy, Acc\mathrm{Acc} of low-frequency classes is expected to improve. However, according to Remark 1, a higher Acc\mathrm{Acc} does not always mean a higher IoU\mathrm{IoU}. Even worse, we empirically observe that Remark 1 is usually not satisfied for pixel rebalancing as shown in the following case study.

Remark 1

(Condition for IoU\mathrm{IoU} improvement)   With pixel rebalance, suppose the Acc\mathrm{Acc} of class yy is improved from AA to (1+K)A(1+K)A, then it must satisfy FPr\mathrm{FP}^{r} - (1+K)FP(1+K)\cdot\mathrm{FP} \leq KK\cdot nyn_{y} to guarantee IoU\mathrm{IoU} improvements, where FPr\mathrm{FP}^{r} denotes false positives after the rebalance while FP\mathrm{FP} is false positives beforehand. nyn_{y} is the pixel frequency for class yy. K0K\geq 0. Proof   See the Supplementary Material.

Table 1: Pixel rebalance with balanced softmax [43] and re-weighting on ADE2020K. ResNet-5050 and Deeplabv33+ are adopted. * denotes that region frequency prior is used for rebalance.
       Method        mIoU\mathrm{mIoU}(s.s.)
       Baseline        43.9543.95
       Re-weighting        37.8637.86
       balancedsoftmax        38.5738.57
       balancedsoftmax*        40.6640.66
       RR (Ours)        45.02\bf{45.02}

A Case Study. To rebalance with pixel cross-entropy, we conduct experiments on ADE2020K with balanced softmax [43], a state-of-the-art method for long-tailed image classification. As shown in Table 1, the overall mIoU\mathrm{mIoU} becomes worse after rebalance, dropping from 43.9543.95 to 38.5738.57, despite significant improvements in overall mAcc\mathrm{mAcc}.

To understand whether the overall mIoU\mathrm{mIoU} degradation is caused by over-rebalance, i.e., IoU\mathrm{IoU} of high-frequency classes drops a lot though IoU\mathrm{IoU} of low-frequency classes improves, we plot ΔIoU\Delta\mathrm{IoU} and ΔAcc\Delta\mathrm{Acc} for each class yy. The ΔIoU\Delta\mathrm{IoU} and ΔAcc\Delta\mathrm{Acc} are calculated by

ΔIoUy\displaystyle\Delta\mathrm{IoU}_{y} =\displaystyle= IoUyrebalanceIoUybaseline,\displaystyle\mathrm{IoU}_{y}^{\rm{rebalance}}-\mathrm{IoU}_{y}^{\rm{baseline}}, (4)
ΔAccy\displaystyle\Delta\mathrm{Acc}_{y} =\displaystyle= AccyrebalanceAccybaseline.\displaystyle\mathrm{Acc}_{y}^{\rm{rebalance}}-\mathrm{Acc}_{y}^{\rm{baseline}}. (5)

As shown in Figure 3 (b) and (c), after pixel rebalance with balanced softmax, IoU\mathrm{IoU} decreases while Acc\mathrm{Acc} increases for low-frequency classes.

This case study shows that pixel rebalance can improve Acc\mathrm{Acc} while failing to meet the IoU\mathrm{IoU} improvement condition of Remark 1, which implies that pixel rebalance improves low-frequency Acc\mathrm{Acc} largely through an increase of FP\mathrm{FP}, which is harmful to IoU\mathrm{IoU}.

3.1.2 Accuracy vs. Frequency

An important prior for rebalancing in long-tailed image classification is that images are nearly i.i.d., which makes class accuracy positively related to the number of images in the corresponding classes as shown in Figure 4 (a).

Under this case, to encourage the optimization process to be more friendly to low-frequency classes in training, it would be sensible to re-weight the cross-entropy loss with a factor that is inversely proportional to class frequency, like in [20]. Moreover, Cao et al. [10] and Ren et al. [43] theoretically reveal the close relationship between the optimal margin and class frequency, further indicating the importance of class frequency statistics for conducting rebalance.

However, the situation is different for the semantic segmentation task. Pixels within the same region of an image usually belong to the same object or stuff and are thus highly correlated. Since pixels are not i.i.d., the class pixel frequency is an unsuitable factor for rebalancing. This conclusion is supported empirically in Figure 4 (b), where class accuracy is shown to have much weaker correlation with class pixel frequency compared to class image frequency in image classification, shown in Figure 4 (a).

Table 2: Pearson coefficients between class frequency and accuracy.
        Dataset         ρX,Y(%)\rho_{X,Y}(\%)
        CIFAR-100100         75.975.9
        ADE2020K-pixel         35.035.0
        ADE2020K-region         41.341.3

To quantify the correlation between class accuracy and frequency, we calculate the Pearson correlation coefficient [4]:

ρX,Y=Cov(X,Y)σ(X)σ(Y)=𝔼[(Xμ(X))(Yμ(Y))]σ(X)σ(Y),\rho_{X,Y}=\frac{Cov(X,Y)}{\sigma(X)\sigma(Y)}=\frac{\mathbb{E}[(X-\mu(X))(Y-\mu(Y))]}{\sigma(X)\sigma(Y)}, (6)

where μ(.)\mu(.), σ(.)\sigma(.), Cov(.)Cov(.) and 𝔼(.)\mathbb{E}(.) are functions for mean, standard deviation, covariance and expectation. Table 2 lists the Pearson coefficients between class accuracy and image frequency on CIFAR-100, pixel frequency on ADE20K, and region frequency on ADE20K, respectively. The correlations between class accuracy and image frequency, region frequency, pixel frequency decrease step by step. Region frequency has a stronger correlation with class accuracy than pixel frequency, as it eliminates the effects of correlations among neighboring pixels. The weak relationship between class accuracy and pixel frequency causes pixel frequency to be less informative for rebalancing, which adds to the challenge of rebalancing in semantic segmentation.

3.2 Region Rebalance Framework

Due to the issues discussed in Sec. 3.1, we propose to rebalance at the region level instead of the pixel level, through a Region Rebalance (RR) training strategy. The region rebalance method relieves the class imbalance issue with an auxiliary region classification branch by encouraging features to lie in a more balanced region classification space. Moreover, rebalance methods from long-tailed image classification can be readily employed to solve the region imbalance problem. This involves no extra cost for testing because the region rebalance branch is removed during inference.

The region rebalance framework shown in Figure 2 contains two components: (i) An auxiliary region classification branch to ease the class imbalance problem and (ii) Enhanced rebalance with techniques in long-tailed image classification for solving region imbalance. The overall loss functions in the following are deployed:

(i)\displaystyle\mathcal{F}(i)\! =\displaystyle= x𝒟1ifixgtotherwise0,\displaystyle\!\sum_{x\in\mathcal{D}}1\!\quad\!if\!\quad\!i\in x_{gt}\!\quad\!otherwise\!\quad\!0, (7)
reg\displaystyle\mathcal{L}_{reg}\! =\displaystyle= 1||rlog((ry)erykY(k)erk),\displaystyle\!\frac{1}{|\mathcal{R}|}\sum_{r\in\mathcal{R}}-\log(\frac{\mathcal{F}(r^{y})e^{r_{y}}}{\sum_{k\in Y}\mathcal{F}(k)e^{r_{k}}}), (8)
all\displaystyle\mathcal{L}_{all}\! =\displaystyle= pixel+λreg,\displaystyle\!\mathcal{L}_{pixel}+\lambda\mathcal{L}_{reg}, (9)

where 𝒟\mathcal{D} is the dataset, \mathcal{R} contains all the extracted regions, ryr^{y} is the ground truth label of region rr, rir_{i} represents the region logit corresponding to class ii, (i)\mathcal{F}(i) denotes the region frequency of class ii, YY is the label space, reg\mathcal{L}_{reg} is the region loss, while pixel\mathcal{L}_{pixel} represents pixel cross-entropy loss.

In the region classification branch, we group pixels belonging to the same class into region features by averaging the pixel features according to the ground-truth region map. Then the extracted region features are fed into a region classifier. Further, we enhance the rebalance using a technique [43] borrowed from the long-tailed image classification community to tackle the region imbalance problem. The deployed region loss is shown in Eq. (8), and we use a hyper-parameter λ\lambda to control the strength of rebalance, shown in Eq. (9).

3.3 Analysis of Our Method

The misalignment between the training objective, i.e., pixel cross-entropy, and evaluation metric IoU\mathrm{IoU} at inference is problematic for direct pixel rebalance as analyzed in Sec. 3.1. The proposed method instead relieves class imbalance via region rebalance. To clarify the mechanism of our region rebalance method, we analyze it in comparison to pixel rebalance. In addition, we discuss differences in imbalance reflected by dataset statistics.

Refer to caption
Refer to caption
(a) long-tailed distribution of pixels vs. regions
(b) degree of pixel imbalance vs. region imbalance.
Dataset PIF RIF
ADE2020K 827827 282282
COCO-Stuff164164K 26122612 528528
Figure 5: Comparison between pixel imbalance and region imbalance: (a) Histogram statistics of pixel/region frequency over sorted class indexes on ADE2020K (left) and COCO-Stuff164164K (right). (b) Numerical comparisons of pixel imbalance factor (PIF) and region imbalance factor (RIF). It can be seen that there is less region imbalance than pixel imbalance.

3.3.1 Region Rebalance vs. Pixel Rebalance

On one hand, region rebalance gathers pixel features of the same class into region features, and then region classification follows. The “gather” operation essentially removes the effect of correlation among pixels in a region. As demonstrated in Table 2, region frequency has a stronger correlation with class accuracy than pixel frequency, further indicating its greater suitability for rebalancing. Besides, the “gather” operation reduces data class imbalance and thus is good for rebalancing as analyzed in Sec. 3.3.2.

On the other hand, we empirically observe that region rebalance is more aligned with the IoU\mathrm{IoU} metric. As studied in Sec. 3.1.1, pixel rebalance promotes higher Acc\mathrm{Acc} while failing to improve IoU\mathrm{IoU} because of an increase in FP\mathrm{FP}.

Such behavior can be understood by considering the situation that two very different objects/stuff can still have similar local pixel features. With pixel rebalance, we attach more importance to pixels of low-frequency classes, and in consequence, corresponding similar pixels in other classes are more likely to become FP\mathrm{FP}. In contrast, objects/stuff are more distinguishable at the region level due to features that are more global. Though more importance is also attached to regions of low-frequency classes with region rebalance, it will be much harder to predict other class objects/stuff as FP\mathrm{FP}. More visual evidence is discussed in our supplementary file. We leave more theoretical analysis to future work.

3.3.2 Datasets Statistics

Semantic segmentation datasets suffer from both pixel level and region level data imbalance. However, we observe that the degree of pixel imbalance is more serious than that of region imbalance as shown in Figure 5. We collect statistics from the most popular semantic segmentation benchmarks in Figure 5. Following convention [20, 39], we calculate the imbalance factor (IF) as β=NmaxNmin\beta=\frac{N_{max}}{N_{min}}, where NmaxN_{max} and NminN_{min} are the numbers of training samples for the most frequent class and the least frequent class.

As shown in Figure 5, for ADE2020K, the pixel imbalance factor (PIF) is nearly 33 times the region imbalance factor (RIF). For COCO-Stuff164164K, PIF is nearly 55 times RIF. As our method relieves the class imbalance issue via region rebalancing, it conveniently benefits from less imbalance than pixel rebalance would, and this in turn promotes more effective rebalancing.

4 Experiments

To evaluate the effectiveness of our method, we conduct experiments on the most popular benchmarks in semantic segmentation, i.e., ADE2020K [66] and COCO-Stuff164164K [8]. By inserting the proposed region rebalance method into UperNet, PSPNet, Deeplabv33+, and OCRNet, clear improvements are obtained. We also test the region rebalance method on popular transformer neural networks, specifically Swin transformer [38] and BEiT [3].

4.1 Datasets

ADE2020K. ADE2020K [66] is a challenging dataset often used to validate transformer-based neural networks on downstream tasks such as semantic segmentation. It contains 2222K densely annotated images with 150 fine-grained semantic concepts. The training and validation sets consist of 2020K and 22K images, respectively.

COCO-Stuff164164K. COCO-Stuff164164K [8] is a large-scale scene understanding benchmark that can be used for evaluating semantic segmentation, object detection, and image captioning. It includes all 164164K images from COCO 20172017. The training and validation sets contain 118118K and 55K images, respectively. It covers 171171 classes: 8080 thing classes and 9191 stuff classes.

4.2 Implementation Details

We implement the proposed region rebalance method in the mmsegmentation codebase [16] and follow the commonly used training settings for each dataset. More details are described in the following.

For backbones, we use CNN-based ResNet-5050c and ResNet-101101c, which replace the first 7×77\times 7 convolution layer in the original ResNet-5050 and ResNet-101101 with three consecutive 3×33\times 3 convolutions. Both are popular in the semantic segmentation community [62]. For OCRNet, we adopt HRNet-W1818 and HRNet-W4848 [55]. For transformer-based neural networks, we adopt the popular Swin transformer [38] and BEiT [3]. BEiT achieves the most recent state-of-the-art performance on the ADE2020K validation set. ”{\dagger}” in Tables 3 and 4 indicates models that are pretrained on ImageNet-2222K.

With CNN-based models, we use SGD and the poly learning rate schedule [62] with an initial learning rate of 1e21e-2 and a weight decay of 1e41e-4. If not stated otherwise, we use a crop size of 512×512512\times 512, a batch size of 1616, and train all models for 160160K iterations on ADE2020K and 320320K iterations on COCO-Stuff164K. For Swin transformer and BEiT, we use their default optimizer, learning rate setup and training schedule.

In the training phase, the standard random scale jittering between 0.50.5 and 2.02.0, random horizontal flipping, random cropping, as well as random color jittering are used as data augmentations [16]. For inference, we report the performance of both single scale (s.s.) inference and multi-scale (m.s.) inference with horizontal flips and scales of 0.50.5, 0.750.75, 1.01.0, 1.251.25, 1.51.5, 1.751.75.

Table 3: Performance on ADE2020K.
Method Backbone mIoU\mathrm{mIoU}(s.s.) mIoU\mathrm{mIoU}(m.s.)
OCRNet HRNet-W1818 39.3239.32 40.8040.80
OCRNet-RR HRNet-W1818 41.8041.80 43.5543.55(+2.75{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+2.75})
UPerNet ResNet-5050 42.0542.05 42.7842.78
UPerNet-RR ResNet-5050 43.2743.27 44.0344.03(+1.25{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.25})
PSPNet ResNet-5050 42.4842.48 43.4443.44
PSPNet-RR ResNet-5050 42.8342.83 44.2644.26(+0.82{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+0.82})
DeepLabv33+ ResNet-5050 43.9543.95 44.9344.93
DeepLabv33+-RR ResNet-5050 45.0245.02 46.0346.03(+1.10{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.10})
OCRNet HRNet-W4848 43.2543.25 44.8844.88
OCRNet-RR HRNet-W4848 44.5044.50 46.0946.09(+1.21{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.21})
UperNet ResNet-101101 43.8243.82 44.8544.85
UperNet-RR ResNet-101101 44.7844.78 46.1146.11(+1.26{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.26})
PSPNet ResNet-101101 44.3944.39 45.3545.35
PSPNet-RR ResNet-101101 44.6544.65 46.1546.15(+0.80{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+0.80})
DeepLabv33+ ResNet-101101 45.4745.47 46.3546.35
DeepLabv33+-RR ResNet-101101 46.6746.67 47.9647.96(+1.61{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.61})
UperNet Swin-T 44.5144.51 45.8145.81
UperNet-RR Swin-T 45.0245.02 46.4546.45(+0.64{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+0.64})
UperNet Swin-B† 50.0450.04 51.6651.66
UperNet-RR Swin-B† 51.2051.20 52.8852.88(+1.22{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.22})
UperNet BEiT-L 56.756.7 57.057.0
UperNet-RR BEiT-L 57.257.2 57.757.7(+0.70{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+0.70})

4.3 Main Results

Comparison on ADE2020K. To show the flexibility of our region rebalance module, we experiment on ADE2020K with various semantic segmentation methods, e.g., PSPNet, UperNet, OCRNet, and Deeplabv3+. As shown in Table 3, plugging the region rebalance module into those methods leads to significant improvements. Specifically, for ResNet-101101 or HRNet-W4848, after training with Region Rebalance, there are 1.261.26%, 0.800.80%, 1.211.21% and 1.611.61% gains for UperNet, PSPNet, OCRNet, and Deeplabv3+ respectively.

To further show the effectiveness of the proposed region rebalance method, we experiment with Swin transformer and BEiT. For CNN-based models, we achieve 47.9647.96 mIoU\mathrm{mIoU} with ResNet-101101, surpassing the baseline by 1.611.61. With BEiT, the state-of-the-art is improved from 57.057.0 mIoU\mathrm{mIoU} to 57.757.7 mIoU\mathrm{mIoU}.

Comparison on COCO-Stuff164164K. On the large-scale COCO-Stuff164164K dataset, we again demonstrate the flexibility of our region rebalance module. The experimental results are summarized in Table 4. Equipped with the region rebalance module in training, CNN-based models, i.e., ResNet-5050, HRNet-W1818, ResNet-101101 and HRNet-W4848, surpass their baselines by a large margin. Specifically, with HRNet-1818 and OCRNet, our trained model outperforms the baseline by 6.446.44 mIoU\mathrm{mIoU}. With the large CNN-based ResNet-101101 and Deeplabv3+, our model achieves 44.6244.62 mIoU\mathrm{mIoU}, surpassing the baseline by 1.661.66 mIoU\mathrm{mIoU}.

We also verify the effectiveness of the region rebalance module with Swin transformer on COCO-Stuff164164K. Experimental results with Swin-T and Swin-B show clear improvements after rebalancing with our region rebalance method in the training phase.

Table 4: Performance on COCOStuff-164164K.
Method Backbone mIoU\mathrm{mIoU}(s.s.) mIoU\mathrm{mIoU}(m.s.)
OCRNet HRNet-W1818 31.5831.58 32.3432.34
OCRNet-RR HRNet-W1818 37.9837.98 38.7838.78(+6.44{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+6.44})
UperNet ResNet-5050 39.8639.86 40.2640.26
UperNet-RR ResNet-5050 41.2041.20 41.7341.73(+1.47{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.47})
PSPNet ResNet-5050 40.5340.53 40.7540.75
PSPNet-RR ResNet-5050 41.9241.92 42.3442.34(+1.95{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.95})
DeepLabv33+ ResNet-5050 40.8540.85 41.4941.49
DeepLabv33+-RR ResNet-5050 42.6642.66 43.4743.47(+1.89{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.89})
OCRNet HRNet-W4848 40.4040.40 41.6641.66
OCRNet-RR HRNet-W4848 41.8341.83 43.2943.29(+1.36{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.36})
UperNet ResNet-101101 41.1541.15 41.5141.51
UperNet-RR ResNet-101101 42.3042.30 42.7842.78(+1.27{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.27})
PSPNet ResNet-101101 41.9541.95 42.4242.42
PSPNet-RR ResNet-101101 43.4043.40 43.8543.85(+1.43{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.43})
DeepLabv33+ ResNet-101101 42.3942.39 42.9642.96
DeepLabv33+-RR ResNet-101101 43.9343.93 44.6244.62(+1.66{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.66})
UperNet Swin-T 43.8343.83 44.5844.58
UperNet-RR Swin-T 44.4544.45 45.1845.18(+0.60{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+0.60})
UperNet Swin-B† 47.6747.67 48.5748.57
UperNet-RR Swin-B† 48.2148.21 49.2049.20(+0.63{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+0.63})
Refer to caption
(c) Deeplabv33+ w/ ResNet-101101.
Refer to caption
(d) OCRNet w/ HRNet-W4848.
Refer to caption
(e) UperNet w/ ResNet-101101.
Figure 6: Illustrating the category-wise improvements with our region rebalance method: The IoU\mathrm{IoU}s of most low-frequency classes are improved when the proposed region rebalance method is adopted in training. Class indexes are sorted in descending order by the number of pixels belonging to the same class.

4.4 Ablation Experiments

Ablation of region rebalance components. To verify the usefulness of both components in our region rebalance method, we conduct the following ablations on top of the baseline: I. only with an auxiliary region classification branch; II. further inserting a rebalance mechanism, i.e., balanced softmax [43]. We plot the experimental results in Figure 7.

With the two components applied in training, we observe consistent improvements on ADE2020K and the large-scale COCO-Stuff164164K. Additionally, an interesting phenomenon is observed: with only component I, deep models already enjoy significant gains on COCO-Stuff164164K. The reason behind this may be that there is a much larger ratio between the pixel imbalance factor (PIF) and region imbalance factor (RIF) on COCO-Stuff164164K than on ADE2020K, as shown in Table 5. From this point of view, when adopting I, models are given stronger regularization and the output features are encouraged to be more balanced.

Improvements on low-frequency classes. To further examine the rebalancing effects of our proposed region rebalance method, we plot the class ΔIoU\Delta\mathrm{IoU} between the baseline model and the model trained with the region rebalance method. Figure 3 (a) shows the results on ADE2020K with ResNet-5050 and Deeplabv33+. We also validate the rebalance effects on COCO-Stuff164164K with different semantic segmentation methods, i.e., Deeplabv33+, OCRNet, and UperNet, as shown in Figure 6. We observe that the IoU\mathrm{IoU} of most low-frequency classes is enhanced. For example, the IoU\mathrm{IoU} of class “shower” is improved from 0 to 6.146.14 in ADE2020K.

Refer to caption
Figure 7: Ablations for our region rebalance components.
Table 5: Ablation on hyper-parameter λ\lambda.
        Hyper-parameter λ\lambda         mIoU\mathrm{mIoU}(s.s.)
        λ=0.0\lambda=0.0 (baseline)         43.95
        λ=0.1\lambda=0.1         44.61
        λ=0.2\lambda=0.2         44.85
        λ=0.3\lambda=0.3         45.02
        λ=0.4\lambda=0.4         44.64

Ablation of hyper-parameter λ\lambda. In Eq. (9), we introduce the hyper-parameter λ\lambda for weighting between pixel loss and region loss. All experimental results in Tables 3 and 4 are reported for the same λ=0.3\lambda=0.3.

To show the sensitivity of our region rebalance method to different λ\lambda values, we conduct an ablation with ResNet-50 and Deeplabv3+ on ADE20K. Table 5 lists the experimental results, showing that the performance is not affected substantially by the value of λ\lambda within the range [0.1,0.3].

Ablation on Dice loss. Though some IoU\mathrm{IoU}-based loss functions, e.g., Dice Loss [41], have been developed, we empirically observe that pixel cross-entropy is still necessary in training for high performance. An ablation on the Dice loss is conducted on ADE2020K with ResNet-50 and Deeplabv33+, and the results are reported in Table 6. With just the Dice Loss itself, the model achieves only 1.14 mIoU. This huge performance degradation implies that the pixel cross-entropy is still a necessary part and the challenges in Sec. 3.1 will still exist when we conduct pixel rebalance.

Table 6: Ablation on Dice loss.
Method mIoU\mathrm{mIoU}(s.s.)
Baseline (cross-entropy) 43.95
Dice Loss 1.14
Dice Loss + cross-entropy 44.12
RR (Ours) 45.02

5 Conclusion

In this paper, we investigate the imbalance phenomenon in semantic segmentation and identified two major challenges in conducting pixel rebalance, i.e., (i) misalignment between the training objective and the IoU\mathrm{IoU} evaluation metric at inference, and (ii) correlations among pixels within the same object/stuff cause class frequency to be less effective for rebalancing. With our analysis, we propose to gather correlated pixels into regions and rebalance within the context of region classification. The proposed region rebalance method leads to strong improvements across different semantic segmentation methods, e.g., OCRNet, DeepLabv33+, UperNet, and PSPNet. Experimental results on ADE2020K and the large-scale COCO-Stuff164164K verify the effectiveness of our method.

References

  • [1] Md Amirul Islam, Mrigank Rochan, Neil DB Bruce, and Yang Wang. Gated feedback refinement network for dense image labeling. In CVPR, 2017.
  • [2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE TPAMI, 2017.
  • [3] Hangbo Bao, Li Dong, and Furu Wei. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  • [4] Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing. 2009.
  • [5] Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 2018.
  • [6] Mateusz Buda, Atsuto Maki, and Maciej A. Mazurowski. A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks, 2018.
  • [7] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In ICML, 2019.
  • [8] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-stuff: Thing and stuff classes in context. In CVPR, 2018.
  • [9] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In NeurIPS, 2019.
  • [10] Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In NeurIPS, 2019.
  • [11] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 2002.
  • [12] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
  • [13] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 2017.
  • [14] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [15] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, 2018.
  • [16] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  • [17] Jiequan Cui, Shu Liu, Zhuotao Tian, Zhisheng Zhong, and Jiaya Jia. Reslt: Residual learning for long-tailed recognition. arXiv preprint arXiv:2101.10633, 2021.
  • [18] Jiequan Cui, Zhisheng Zhong, Shu Liu, Bei Yu, and Jiaya Jia. Parametric contrastive learning. In ICCV, 2021.
  • [19] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge J. Belongie. Class-balanced loss based on effective number of samples. In CVPR, 2019.
  • [20] Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. Class-balanced loss based on effective number of samples. In CVPR, 2019.
  • [21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • [22] Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A dataset for large vocabulary instance segmentation. In CVPR, 2019.
  • [23] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE TKDE, 2009.
  • [24] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In ICCV, 2017.
  • [25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [26] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In CVPR, 2016.
  • [27] Chen Huang, Yining Li, Change Loy Chen, and Xiaoou Tang. Deep imbalanced learning for face recognition and attribute prediction. IEEE TPAMI, 2019.
  • [28] Chen Huang, Yining Li, Chen Change Loy, and Xiaoou Tang. Learning deep representation for imbalanced classification. In CVPR, 2016.
  • [29] Muhammad Abdullah Jamal, Matthew Brown, Ming-Hsuan Yang, Liqiang Wang, and Boqing Gong. Rethinking class-balanced methods for long-tailed visual recognition from a domain adaptation perspective. In CVPR, 2020.
  • [30] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent Data Analysis, 2002.
  • [31] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217, 2019.
  • [32] Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yannis Kalantidis. Decoupling representation and classifier for long-tailed recognition. In ICLR, 2020.
  • [33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In NeurIPS, 2012.
  • [34] Yu Li, Tao Wang, Bingyi Kang, Sheng Tang, Chunfeng Wang, Jintao Li, and Jiashi Feng. Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In CVPR, 2020.
  • [35] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR, 2017.
  • [36] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
  • [37] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. In CVPR, 2018.
  • [38] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  • [39] Ziwei Liu, Zhongqi Miao, Xiaohang Zhan, Jiayun Wang, Boqing Gong, and Stella X Yu. Large-scale long-tailed recognition in an open world. In CVPR, 2019.
  • [40] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • [41] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, 2016.
  • [42] Tobias Pohlen, Alexander Hermans, Markus Mathias, and Bastian Leibe. Full-resolution residual networks for semantic segmentation in street scenes. In CVPR, 2017.
  • [43] Jiawei Ren, Cunjun Yu, Shunan Sheng, Xiao Ma, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Balanced meta-softmax for long-tailed visual recognition. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, NeurIPS, 2020.
  • [44] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In ICML, 2018.
  • [45] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, 2015.
  • [46] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  • [47] Li Shen, Zhouchen Lin, and Qingming Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, 2016.
  • [48] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. In NeurIPS, 2019.
  • [49] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [50] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
  • [51] Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, and Junjie Yan. Equalization loss for long-tailed object recognition. In CVPR, 2020.
  • [52] Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. Long-tailed classification by keeping the good and removing the bad momentum causal effect. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, NeurIPS, 2020.
  • [53] Kaihua Tang, Jianqiang Huang, and Hanwang Zhang. Long-tailed classification by keeping the good and removing the bad momentum causal effect. In NeurIPS, 2020.
  • [54] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
  • [55] Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, et al. Deep high-resolution representation learning for visual recognition. IEEE TPAMI, 2020.
  • [56] Jiaqi Wang, Wenwei Zhang, Yuhang Zang, Yuhang Cao, Jiangmiao Pang, Tao Gong, Kai Chen, Ziwei Liu, Chen Change Loy, and Dahua Lin. Seesaw loss for long-tailed instance segmentation. In CVPR, 2021.
  • [57] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella X. Yu. Long-tailed recognition by routing diverse distribution-aware experts. In LCLR, 2021.
  • [58] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In NeurIPS, 2017.
  • [59] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understanding. In ECCV, 2018.
  • [60] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
  • [61] Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In ECCV, 2020.
  • [62] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, 2017.
  • [63] Q Zhong, C Li, Y Zhang, H Sun, S Yang, D Xie, and S Pu. Towards good practices for recognition & detection. In CVPR workshops, 2016.
  • [64] Zhisheng Zhong, Jiequan Cui, Shu Liu, and Jiaya Jia. Improving calibration for long-tailed recognition. In CVPR, 2021.
  • [65] Boyan Zhou, Quan Cui, Xiu-Shen Wei, and Zhao-Min Chen. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. arXiv preprint arXiv:1912.02413, 2019.
  • [66] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR, 2017.

Region Rebalance for Long-Tailed Semantic Segmentation
Supplementary Material

Appendix A Experiments on COCO-Stuff1010K for Semantic Segmentation

COCO-Stuff1010K, which includes 1010K images from the COCO training set, is a subset of COCO-Stuff164164K. The training and validation sets consist of 99K and 11K images, respectively. It also covers 171171 classes. The experimental results are reported in Table 7. Employing our region rebalance module in training yields 1.181.18 mIoU\mathrm{mIoU} and 1.191.19 mIoU\mathrm{mIoU} improvements separately when compared with the baselines.

Table 7: Performance on COCO-Stuff1010K.
      Method       Backbone       mIoU\mathrm{mIoU}(s.s.)       mIoU\mathrm{mIoU}(m.s.)
      DeepLabv33+       ResNet-5050       35.8435.84       36.8536.85
      DeepLabv33+-RR       ResNet-5050       36.5336.53       38.0338.03(+1.18{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.18})
      DeepLabv33+       ResNet-101101       38.1438.14       39.0039.00
      DeepLabv33+-RR       ResNet-101101       38.7238.72       40.1940.19(+1.19{\color[rgb]{0.65625,0.171875,0.65625}\definecolor[named]{pgfstrokecolor}{rgb}{0.65625,0.171875,0.65625}+1.19})

Appendix B Proof of Remark 1

Before the pixel rebalance, for one low-frequency class yy, we suppose its Acc\mathrm{Acc} and IoU\mathrm{IoU} are

Acc\displaystyle\mathrm{Acc} =\displaystyle= TPTP+FN,\displaystyle\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}, (10)
IoU\displaystyle\mathrm{IoU} =\displaystyle= TPTP+FN+FP.\displaystyle\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}}. (11)

With the pixel rebalance, we expect the Acc\mathrm{Acc} is improved to Acc(1+K)\mathrm{Acc}\cdot(1+K). Then the class Acc\mathrm{Acc} and IoU\mathrm{IoU} become

Accr\displaystyle\mathrm{Acc}^{r} =\displaystyle= TP(1+K)TP(1+K)+FNKTP=TP(1+K)TP+FN,\displaystyle\frac{\mathrm{TP}\cdot(1+K)}{\mathrm{TP}\cdot(1+K)+\mathrm{FN}-K\cdot\mathrm{TP}}=\frac{\mathrm{TP}\cdot(1+K)}{\mathrm{TP}+\mathrm{FN}}, (12)
IoUr\displaystyle\mathrm{IoU}^{r} =\displaystyle= TP(1+K)TP(1+K)+FNKTP+FPr=TP(1+K)TP+FN+FPr.\displaystyle\frac{\mathrm{TP}\cdot(1+K)}{\mathrm{TP}\cdot(1+K)+\mathrm{FN}-K\cdot\mathrm{TP}+\mathrm{FP}^{r}}=\frac{\mathrm{TP}\cdot(1+K)}{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}^{r}}. (13)

Taking the reciprocal of IoUr\mathrm{IoU}^{r} and IoU\mathrm{IoU}, we obtain

1IoUr\displaystyle\frac{1}{\mathrm{IoU}^{r}} =\displaystyle= TP(1+K)+FNKTP+FPrTP(1+K)=1+FNKTP+FPrTP(1+K),\displaystyle\frac{\mathrm{TP}\cdot(1+K)+\mathrm{FN}-K\cdot\mathrm{TP}+\mathrm{FP}^{r}}{\mathrm{TP}\cdot(1+K)}=1+\frac{\mathrm{FN}-K\cdot\mathrm{TP}+\mathrm{FP}^{r}}{\mathrm{TP}\cdot(1+K)}, (14)
1IoU\displaystyle\frac{1}{\mathrm{IoU}} =\displaystyle= TP+FN+FPTP=1+FN+FPTP=1+(FN+FP)(1+K)TP(1+K).\displaystyle\frac{\mathrm{TP}+\mathrm{FN}+\mathrm{FP}}{\mathrm{TP}}=1+\frac{\mathrm{FN}+\mathrm{FP}}{\mathrm{TP}}=1+\frac{(\mathrm{FN}+\mathrm{FP})\cdot(1+K)}{\mathrm{TP}\cdot(1+K)}. (15)

To guarantee IoUIoUr\mathrm{IoU}\leq\mathrm{IoU}^{r}, it should satisfy:

FNKTP+FPr(FN+FP)(1+K)FPr(1+K)FPK(TP+FN)=Kny,\displaystyle\mathrm{FN}-K\cdot\mathrm{TP}+\mathrm{FP}^{r}\leq(FN+FP)\cdot(1+K)\iff\mathrm{FP}^{r}-(1+K)\cdot\mathrm{FP}\leq K\cdot(\mathrm{TP}+\mathrm{FN})=K\cdot n_{y}, (16)

where nyn_{y} is the pixel frequency of class yy.

Refer to caption
(a) The effects of FPTP\cfrac{\mathrm{FP}}{\mathrm{TP}} on IoU\mathrm{IoU} and Acc\mathrm{Acc}. with b=FPTP[0,1]b=\cfrac{\mathrm{FP}}{\mathrm{TP}}\in[0,1].
Refer to caption
(b) The effects of FPTP\cfrac{\mathrm{FP}}{\mathrm{TP}} on IoU\mathrm{IoU} and Acc\mathrm{Acc}. with b=FPTP[0,5]b=\cfrac{\mathrm{FP}}{\mathrm{TP}}\in[0,5].
Figure 8: The effects of FPTP\cfrac{\mathrm{FP}}{\mathrm{TP}} on IoU\mathrm{IoU} and Acc\mathrm{Acc}.

Appendix C How Does FPTP\cfrac{\mathrm{FP}}{\mathrm{TP}} Make Effects on IoU\mathrm{IoU} and Acc\mathrm{Acc}?

As analyzed in Sec. 3.1, we examine the connections between IoU\mathrm{IoU} and Acc\mathrm{Acc}, i.e.,

1IoU=1Acc+FPTP.\displaystyle\frac{1}{\mathrm{IoU}}=\frac{1}{\mathrm{Acc}}+\frac{\mathrm{FP}}{\mathrm{TP}}. (17)

To understand the role of FPTP\cfrac{\mathrm{FP}}{\mathrm{TP}}, we plot the curves with respect to IoU\mathrm{IoU} and Acc\mathrm{Acc} with various constant value b=FPTPb=\cfrac{\mathrm{FP}}{\mathrm{TP}} shown in Figure 8. We observe that, when b=FPTPb=\cfrac{\mathrm{FP}}{\mathrm{TP}} is a smaller value around 0.0, e.g., b[0,1]b\in[0,1], optimizing Acc\mathrm{Acc} is equal to optimizing IoU\mathrm{IoU} (Figure 8). However, demonstrated by Figure 8, when b=FPTPb=\cfrac{\mathrm{FP}}{\mathrm{TP}} is a larger value, e.g., b3b\geq 3, optimizing Acc\mathrm{Acc} will have little effect on IoU\mathrm{IoU}, especially when Acc\mathrm{Acc} has already achieved a high value, e.g., Acc=0.6\mathrm{Acc}=0.6. This is just the reason that Acc\mathrm{Acc} is improved while IoU\mathrm{IoU} even decrease for low-frequency classes compared to the baseline in Sec. 3.1.1.

Appendix D Visual Evidence for Region Rebalance vs. Pixel Rebalance

As demonstrated in Sec. 3.1.1, pixel rebalance improves Acc\mathrm{Acc} meanwhile increases FP\mathrm{FP} significantly for low-frequency classes. As a result, IoU\mathrm{IoU} can not benefit from the pixel rebalance. In contrast, the proposed region rebalance is more aligned with the IoU\mathrm{IoU} metric. This phenomenon can be understood by considering the situation that two very different objects or stuff can still have similar local pixel features. Conducting pixel rebalance make it easy to let these similar pixels of other classes to be FP\mathrm{FP}. Taking it into consideration that objects/stuff are more distinguishable at the region level because of the global features, predicting other class objects/stuff as FP\mathrm{FP} will be much harder. We give visual evidence to our intuitive reasoning in Figure 9.

As shown in highlighted regions with rectangle, pixels having similar texture with other object/stuff or around objects/stuff boundary are more likely to be FP\mathrm{FP} with pixel rebalance. Region rebalance relieves this issue by using global region features.

(a) Input

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(b) GT

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(c) Pixel Rebalance

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(d) Region Rebalance

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 9: Visualization comparisons between pixel rebalance and region rebalance.

Appendix E Quantitative Visualization Comparisons on ADE2020K and COCO-Stuff164164K

In this section, we demonstrate the advantages of region rebalance with quantitative visualizations on ADE2020K and COCOStuff164164K shwon in Figures 10, 11 and 12.

(a) Input

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(b) GT

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(c) Baseline

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(d) Region Rebalance

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 10: Visualization comparisons between baseline and region rebalance on ADE2020K.

(a) Input

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(b) GT

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(c) Baseline

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(d) Region Rebalance

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 11: Visualization comparisons between baseline and region rebalance on ADE2020K.

(a) Input

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(b) GT

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(c) Baseline

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

(d) Region Rebalance

Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 12: Visualization comparisons between baseline and region rebalance on COCO-Stuff164164K.