Self-attention on Multi-Shifted Windows for Scene Segmentation

Litao Yu¹ Zhibin Li² Jian Zhang¹ Qiang Wu¹ University of Technology Sydney¹ CSIRO² {litao.yu, jian.zhang, Qiang Wu}@uts.edu.au¹ [email protected]

Abstract

Scene segmentation in images is a fundamental yet challenging problem in visual content understanding, which is to learn a model to assign every image pixel to a categorical label. One of the challenges for this learning task is to consider the spatial and semantic relationships to obtain descriptive feature representations, so learning the feature maps from multiple scales is a common practice in scene segmentation. In this paper, we explore the effective use of self-attention within multi-scale image windows to learn descriptive visual features, then propose three different strategies to aggregate these feature maps to decode the feature representation for dense prediction. Our design is based on the recently proposed Swin Transformer models, which totally discards convolution operations. With the simple yet effective multi-scale feature learning and aggregation, our models achieve very promising performance on four public scene segmentation datasets, PASCAL VOC2012, COCO-Stuff 10K, ADE20K and Cityscapes.

1 Introduction

Scene segmentation is a dense classification task for visual content analysis in computer vision. The goal is to parse the objects or scenes into different 2D regions associated with semantic categories. Scene segmentation in images has drawn a broad interest for many applications such as robotic sensing Cadena and Košecká (2014) and auto-navigation Xiao and Quan (2009).

Recently, the development of deep convolution neural networks has led to remarkable progress in semantic segmentation due to their powerful feature representation ability to describe the local visual information. A deep segmentation network usually has the encoder-decoder learning architecture. The encoder consists of stacked convolution layers and down-sampling layers, which learns high-level semantic concepts with the progressively increased receptive field. For general scene segmentation tasks, the encoder is basically a pre-trained classification network, e.g., deep residual networks He et al. (2016). The decoder, on the other hand, aims to predict the pixel-level classes by considering the contextual visual-semantic information. In some encoder-decoder segmentation architectures such as DeepLab Chen et al. (2018), multi-scale contextual information plays a critical role in determining the pixel-level classes, but such kinds of designs also raise an issue that they cannot learn the long-range spatial dependencies, which becomes a fundamental challenge because of the limited receptive fields applied in convolution kernels.

Refer to caption — Figure 1: An illustration of multi-sifted windows. An image with $6\times 6$ patches can be partitioned into $2\times 2$ or $3\times 3$ windows, and each partition can be separately shifted by different sizes.

Inspired by the success of Transformer on language modelling Vaswani et al. (2017); ACL19:BERT, the self-attention paradigm of Query-Key-Value (QKV) with multi-head attention and positional encoding has been transferred to 2D image processing, which we call as Vision Transformer (ViT). Directly partitioning an image into equal-sized patches and considering them as a sequence to learn their dependencies is a brute-force way for image classification Dosovitskiy et al. (2021). It first splits an image into equal-sized patches, then produces the low-dimensional linear embeddings. These embeddings with positional encodings are fed into the vanilla Transformer model as a token sequence. The disadvantage of this model is the lack of inductive biases of convolution neural networks, such as translation invariance and the locally restricted receptive fields, so directly training with relatively small-sized datasets obtains much inferior results. Pretrained on the very large-scale dataset, such as JFT-300M Sun et al. (2017), then transfered to relatively small datasets, ViT achieves the state-of-the-art accuracy for classification tasks. Compared with ConvNet based scene segmentation models, the scaled dot-product attention (self-attention) in the Transformer structure has a compelling advantage because QKV learns the global dependencies of features. However, ViT does not have hierarchical feature representations. Also, the input images are supposed to be in the same size, and each image is represented by a fixed number of tokens. This is problematic for scene segmentation because the spatial-semantic consistency is difficult to be preserved at the pixel level. Also, the complexity of self-attention is quadratic to image size, leading to the comparably low efficiency. Based on the above observations, we build the scene segmentation model backend on the recently proposed Swin Transformer Liu et al. (2021b). With the patch-merging and the self-attention within the non-overlapping windows, Swin Transformer overcomes two problems of ViT: the fixed scale tokens and the quadratic computational complexity. The pre-trained Swin Transformer itself can be used as a powerful backbone for many downstream learning tasks such as object detection and semantic segmentation. A similar work called Focal Transformer, conducts the self-attention with fine-attention locally and coarse attention globally, which enables sighificantly larger receptive fields and long-range self-attention with less memory cost and more time efficiency Yang et al. (2021).

Although visual Transformers are able to learn the spatial dependencies in image processing, a simple Transformer that computes the pixel-level labels from tokens still suffers from the limitation of single-scale features, because experiences show that multi-scale information can help resolve ambiguous cases and results in more robust scene segmentation models Zhao et al. (2017); Yang et al. (2018); Chen et al. (2018). In Swin Transformer, the self-attention is computed within a fixed local window. Such a single-scale self-attention carries the limited local information, without the consideration of the contexts in a larger receptive field. The multi-scale feature representation is an effective way for pixel-level prediction. This motivates us to propose a Transformer with self-attention on multi-shifted windows for scale variations to decode the visual feature representations for pixel-level classifications. Specifically, we propose to use self-attentions on a series of shifted windows then aggregate them to generate visual feature maps in the decoder of the scene segmentation model. The multi-shifted window is illustrated in Figure 1. By doing this, each neuron layer can encode the semantic information from multi-scale spatial dependencies. Therefore, the aggregated feature representations not only contain semantic information in a large scale range but also cover that range in a compact yet discriminative manner. The proposed learning framework is a pure Transformer-based scene segmentation network, which does not contain any convolution operator. We evaluate our method on four public benchmark datasets, which achieves very promising performance in terms of the mean Intersection-over-Union (mIoU) score.

We make the following contributions in this paper:

•

We apply self-attention on multiple shifted windows then propose three feature aggregation strategies (parallel, sequential and cross-attention) in the decoder, backend on a Transformer-based pyramid feature encoder, to generate multi-scale features for scene segmentation.
•

The whole learning framework is a pure Transformer-based model, which totally discards convolution operators. Thus the computational complexity is lower than the ConvNet based segmentation models.
•

Extensive experiments show our models can learn superior feature representations as compared to fully-convolution networks (FCNs), achieving very promising performance on four public scene segmentation benchmarks.

The rest of the paper is organized as follows. Section 2 introduces the related work. Section 3 elaborates the proposed learning framework, including a Transformer-based pyramid encoder and three decoder structures with self-attention on multi-shifted windows. Experimental results and analysis are presented in Section 4. Finally, Section 5 concludes the paper. We have made our code and pre-trained models available at https://github.com/yutao1008/MSwin.

2 Related work

Deep learning-based image segmentation models have achieved significant progress on large-scale benchmark datasets Zhou et al. (2017); Cordts et al. (2016) in recent years. The deep segmentation methods can be generally divided into two streams: the fully-convolutional networks (FCNs) and the encoder-decoder structures. The FCNs Long et al. (2015) are mainly designed for general segmentation tasks, such as scene parsing and instance segmentation. Most FCNs are based on a stem-network (e.g., deep residual networks He et al. (2016)) pre-trained on a large-scale dataset. These classification networks usually stack convolution and down-sampling layers to obtain visual feature maps with rich semantics. The deeper layer features with rich semantics are crucial for accurate classification, but lead to the reduced resolution and in turn spatial information loss. To address this issue, the encoder-decoder structures such as U-Net Ronneberger et al. (2015) have been proposed. The encoder maps the original images into low-resolution feature representations, while the decoder mainly restores the spatial information with skip-connections. Another popular method that has been widely used in semantic segmentation is the dilated (atrous) convolution Yu and Koltun (2015), which can enlarge the receptive field in the feature maps without adding more computation overhead, thus more visual details are preserved. Some methods, such as DeepLab v3+ Chen et al. (2018), just combine the encoder-decoder structure and dilated convolution, to effectively boost the pixel-wise prediction accuracy.

With the advent of Transformer models in image processing applications, various visual Transformer models have been proposed Dosovitskiy et al. (2021); Liu et al. (2021b); Touvron et al. (2021). As a central piece of Transformer, the self-attention has the complexity and structural prior challenges. The computational complexity is mainly determined by the length of tokens, so in the long-sequence modelling, the global self-attention becomes a bottleneck for model optimization and inference. In Swin Transformer Liu et al. (2021b), the self-attention is conducted within the equal-sized window partition locally to reduce the computational complexity. Another issue of the application in 2D images is the structural prior. Unlike the invariant word embedding, the high uncertainty of image patches lead to the inductive bias, making Transformer models less effective than the convolution counterparts in computer vision tasks Liu et al. (2021a). To obtain a comparable accuracy with ConvNet, using ViT as a backbone for image classification requires the pre-training on very large-scale datasets Dosovitskiy et al. (2021). This issue can be alleviated by applying a so-called distillation token, making the vision transformer effectively learn from a teacher (Data-Efficient Image Transformer, DeiT Touvron et al. (2021)). The distillation token is learned through back-propagation by interacting with the class and patch tokens via self-attention. However, DeiT requires a pre-trained ConvNet as a teacher model. Some other techniques such as multi-stage structures Liu et al. (2021b); Yuan et al. (2021a); Wang et al. (2021) and hybrid models Wu et al. (2021) can also make visual Transformer applicable in general image processing tasks, without the need of training on very large-scale image datasets. Among the very recently proposed visual Transformer models, Swin Transformer Liu et al. (2021b) presents a brand new perspective in designing a convolution-free deep neural network in image processing. Instead of learning the feature dependencies in the whole image patches, Swin Transformer computes the self-attention within each window partition, and uses the relative position bias $\mathbf{B}$ to each head in computing the spatial dependency:

\text{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\text{SoftMax}(\mathbf{QK}^{\top}/\sqrt{d}+\mathbf{B})\mathbf{V},

(1)

where $\mathbf{Q},\mathbf{K},\mathbf{V}$ are query, key and value matrices, respectively. $d$ is the dimension of query and key. In each head of self-attention, the query acts as a guide to search what it needs in a dictionary to reach the final prediction. $\mathbf{B}$ is a relative position bias array rather than a scalar. In the computational pipeline, Swin Transformer first splits an image into non-overlaped patches, then applies linear mapping to embed the raw pixel values. This enables the arbitrary input sizes for Swin Transformer because the input dimension for embedding is no longer restricted by the number of pixels within a patch. The key components are Swin Transformer blocks with modified self-attention modules, which can well model the spatial dependencies within the equal-sized windows, so the computational complexity is significantly reduced. Just like convolution neural networks, the resolution reduces at the end of each stage by token merging. Also, the modified self-attention further uses a shifted window partitioning strategy, resulting in more windows to generate rich visual features thus improving the model performance.

In the vision task of scene segmentation in images, the Transformer models also show their advantages over ConvNets counter parts. In Zheng et al. (2021), the authors deployed a pure transformer to encode an image as a sequence of patches, combined with a simple decoder to provide a powerful segmentation model SEgmentation TRansformer (SETR). Based on a hierarchical structured Transformer encoder, which outputs multi-scale features, SegFormer combined both local and global attentions for semantic segmentation Xie et al. (2021). Yuan et. al proposed a High-Resolution Transformer Yuan et al. (2021b), which replaces the convolutions with local-window self-attentions in HRNet Wang et al. (2020). Although it achieves outstanding performance in many vision tasks, the computational complexity is very high in dense prediction. Learning feature representation at multi-scale image feature maps have been proved to be an effective way to capture the contextual information in dense prediction Zhao et al. (2017); Chen et al. (2018); Yang et al. (2018). So in this paper, we also follow this paradigm to design the decoder head that uses self-attention on shifted windows with different sizes. With this regard, our method achieves a satisfactory balance between computational complexity and segmentation accuracy.

3 Method

In this section, we start with the design of a Transformer-based feature pyramid encoder, then introduce the decoder with self-attention on multi-shifted windows and three feature aggregation strategies. The overall learning framework is illustrated in Figure 2.

3.1 A Transformer-based feature pyramid encoder

The scene segmentation model is built on the recently proposed Swin Transformer Liu et al. (2021b), which is a powerful visual Transformer in large-scale image classification and can serve as a general-purpose backbone for down-stream learning tasks in computer vision.

Swin Transformer constructs hierarchical feature maps just like the commonly used Convolution Network (ConvNet) structures such as ResNet He et al. (2016) and DenseNet Huang et al. (2017). However, ConvNets and Swin Transformer have fundamental differences in computing feature representations. ConvNets apply convolution operators to overlapped receptive fields to progressively learn local semantic information, while Swin Transformer learns the spatial feature dependencies within the shifted image windows. The ability of learning feature dependency in the shifted image windows endows Transformer model a compelling advantage over FCNs for scene segmentation, since it explicitly describes the spatial dependencies. Another important difference is the resolution reduction between the learning stages. ConvNets use either doubled strides or down-samplings, while Swin Transformer adopts the patch-merging scheme. The above two differences make it inappropriate to directly build an encoder with Swin Transformer just like FCNs, i.e., keeping the feature resolution in the deeper stages and increasing the dilation rate with atrous convolution to enlarge the receptive field Zhao et al. (2017); Fu et al. (2019a). The reasons are illustrated as follows:

•

The learned embedding of merged patches no longer provides the accurate representation for the down-streamed computations, if keeping the high-resolution features in the deeper stages;
•

If we calculate the dependency of non-adjacent patches like the way of atrous convolution Zhao et al. (2017); Chen et al. (2018), the self-attentions still learn the same dependency within a window.

Based on the above observations, in the design of the encoder backend on Swin Transformer, the structure of Feature Pyramid Network (FPN) with the Pyramid Pooling Module (PPM) is used as a whole learning framework (UperNet) for scene segmentation Liu et al. (2021b). However, the PPM is applied on the low-resolution but high-dimensional feature maps in UperNet, which is computationally expensive and ineffective in restoring the spatial details. In our work, we only use the structure of Feature Pyramid Network (FPN), which can represent hierarchical features and carry rich spatial-semantic information, but remove the PPM and add a few Swin blocks before upsampling. In deep neural networks, FPN has been successfully used for object detection Lin et al. (2017) to extract multi-scale feature maps. The Transformer-based FPN is illustrated in Figure 2(a), which composes of both bottom-up and top-down pathways. The bottom-up pathway is the four-stage computation streamline in Swin Transformer. As we go up, the spatial resolution decreases due to the patch merging. The semantic value for each stage increases with more high-level feature representation is learned. The feature outputs from the bottom layers are in high resolution but the semantic value is limited. By contrast, the features from the top layers are in low-resolution but with rich semantics. In scene segmentation, the features with high-resolution and rich semantics are both necessary for pixel-level classification, so FPN provides a top-down pathway to construct higher resolution layers and keep the rich semantic information.

Assume an input image is $\mathbf{X}_{0}\in\mathbb{R}^{H\times W\times 3}$ , where $H$ and $W$ are the width and height, respectively. In bottom-up pathway in Swin Transformer, the output features of the four stages are $\mathbf{X}_{1}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times C},\mathbf{X}_{2}\in\mathbb{R}^{\frac{H}{8}\times\frac{W}{8}\times 2C},\mathbf{X}_{3}\in\mathbb{R}^{\frac{H}{16}\times\frac{W}{16}\times 4C}$ and $\mathbf{X}_{4}\in\mathbb{R}^{\frac{H}{32}\times\frac{W}{32}\times 8C}$ , respectively, where $C$ is a constant depending on the model size of the backbone. In the top-down pathway, it progressively hallucinates higher resolution features by upsampling spatially coarser but semantically richer feature maps from higher pyramid levels. These features are then fused with the feature maps computed from the bottom-up pathway via lateral connections. The lateral outputs $\mathbf{L}_{1},\mathbf{L}_{2},\mathbf{L}_{3}$ and $\mathbf{L}_{4}$ in the top-down pathway are computed as follows:

$\displaystyle\mathbf{L}_{1}$	$\displaystyle=\mathbf{X}_{4},$
$\displaystyle\mathbf{L}_{2}$	$\displaystyle=\text{UP}_{\times 2}(\mathbf{L}_{1})+\text{LP}(\mathbf{X}_{3}),$
$\displaystyle\mathbf{L}_{3}$	$\displaystyle=\text{UP}_{\times 2}(\mathbf{L}_{2})+\text{LP}(\mathbf{X}_{2}),$
$\displaystyle\mathbf{L}_{4}$	$\displaystyle=\text{UP}_{\times 2}(\mathbf{L}_{3})+\text{LP}(\mathbf{X}_{1}),$	(2)

where $\text{UP}_{\times 2}(\cdot)$ is the upsampling operator with bilinear interpolation by 2, and $\text{LP}(\cdot)$ is the linear projection with a batch normalization followed by a ReLU activation. After that, we apply four window-based self-attention blocks on the lateral outputs, then sum them up to form the final feature output $\mathbf{Y}_{0}$ of the encoder, as follows:

$\displaystyle\mathbf{X}^{\prime}_{1}$	$\displaystyle=\text{UP}_{\times 8}(\text{W-MSA}(\mathbf{L}_{1})),$
$\displaystyle\mathbf{X}^{\prime}_{2}$	$\displaystyle=\text{UP}_{\times 4}(\text{W-MSA}(\mathbf{L}_{2})),$
$\displaystyle\mathbf{X}^{\prime}_{3}$	$\displaystyle=\text{UP}_{\times 2}(\text{W-MSA}(\mathbf{L}_{3})),$
$\displaystyle\mathbf{X}^{\prime}_{4}$	$\displaystyle=\text{W-MSA}(\mathbf{L}_{4}),$
$\displaystyle\mathbf{Y}_{0}$	$\displaystyle=\mathbf{X}^{\prime}_{1}+\mathbf{X}^{\prime}_{2}+\mathbf{X}^{\prime}_{3}+\mathbf{X}^{\prime}_{4},$	(3)

where W-MSA is the Window based Multi-head Self-attention module. Different from the use of FPN for object detection in Lin et al. (2017) that predicts multi-scale object sizes with pyramid feature maps, we aggregate the multi-scale lateral features in the same feature resolution. The resolution of the final feature output $\mathbf{Y}_{0}$ is $\frac{H}{4}\times\frac{W}{4}$ .

The encoder with FPN is similar to the Unified Perceptual Parsing Network (UPerNet) proposed in Xiao et al. (2018). However, in our design we remove the pyramid pooling module (PPM)Zhao et al. (2017) and apply the Swin blocks to aggregate the feature maps, thus the proposed encoder does not contain any convolution operator. Note that the PPM needs more memory and extra $3\times 3$ convolutions, which is computationally expensive. In the experiment, we prove that even with the self-attention on multi-shifted windows, the whole learning framework of MSwin is more computationally efficient than UPerNet. Using FPN as the encoder of the scene segmentation model can well suit the hierarchical backbone of Swin Transformer without hurting the internal computation structure. Furthermore, the feature output carries very rich semantic information and keeps the high resolution of features. So if we directly add a classification layer as the decoder on the top of the FPN encoder output, it can form a pure Vision Transformer model for scene segmentation. Here we name it Transformer based FPN (T-FPN), which serves as a baseline in the experiment. Note that the T-FPN only contains the self-attention on single-shifted windows.

We next introduce how to design the decoder by exploring the spatial dependencies with self-attention on multi-shifted windows.

3.2 The decoders with self-attention on multi-shifted windows

The Shifted Window based Multi-head Self-attention (SW-MSA) is a simple yet effective module in Swin Transformer to enrich the feature representation. The intuition behind SW-MSA is to introduce cross-window connections of non-overlapping windows thus improving the modelling power of W-MSA. In scene segmentation tasks, it is commonly recognized that learning on multi-scale feature representations is beneficial to take different contextual information into consideration, thus giving more accurate predictions. In our model, we also follow this strategy in the design of segmentation head by applying self-attention on multi-shifted windows.

Here we denote $m$ and $n$ be the window size and the shifted size in a SW-MSA module, where $n<m$ . W-MSA is a special case of SW-MSA when $n=0$ . Note that the self-attention is learned to describe the spatial dependency among non-overlapping windows, so different settings of $m$ and $n$ essentially implement multiple self-attentions in the same image. In Figure 1 we illustrate two shifted windows, where an image feature map with the resolution $6\times 6$ can be partitioned to either $2\times 2$ or $3\times 3$ windows. Applying two shifted sizes 2 and 1 on these partitions, we can obtain four different outputs by setting $m=3,n=0$ , $m=3,n=2$ , $m=2,n=0$ , and $m=2,n=1$ in the SW-MSA modules, respectively. This can implement a multi-scale attention within different sized windows thus can diversify the feature map representation to improve the discriminative power of self-attention for pixel-wise classification. For simplicity, we set $n=\lfloor\frac{m}{2}\rfloor$ in the self-attention on multi-shifted windows of the decoder. Assume we have $L$ SW-MSA modules, and the feature output from the T-FPN encoder is $\mathbf{Y}_{0}$ , the three structures for the decoder are described as follows.

3.2.1 MSwin-P: The parallel decoding structure

The parallel structure is a wide decoder to aggregate the features from the self-attention on multi-shifted windows. It first normalizes $\mathbf{Y}_{0}$ then feeds its output into $L$ SW-MSA modules in parallel, then concatenates the feature outputs and applies the dimensionality reduction, followed by normalization and MLP. Here we also use the residual connections to enhance the feature representation. The structure is illustrated in Figure 2(b), and the details of the computation streamline are described as follows:

$\displaystyle\mathbf{Y}_{1,l}$	$\displaystyle=\text{SW-MSA}_{l}(\text{LN}(\mathbf{Y}_{0}))+\mathbf{Y}_{0},\>l=1,\ldots,L,$
$\displaystyle\mathbf{Y}_{2}$	$\displaystyle=\text{Linear}([\mathbf{Y}_{1,1},\ldots,\mathbf{Y}_{1,L}]),$
$\displaystyle\mathbf{Z}$	$\displaystyle=\text{MLP}(\text{LN}(\mathbf{Y}_{2}))+\mathbf{Y}_{2},$	(4)

where $[\cdot]$ is the channel-wise concatenation, and $\mathbf{Z}$ is the feature output.

3.2.2 MSwin-S: The sequential decoding structure

The sequential structure is a deep decoder that consists of $L$ Swin blocks with different window and shifted sizes, so $\mathbf{Y}_{0}$ is passed through these blocks sequentially, as is illustrated in Figure 2(c). In each Swin block, it contains only one self-attention module with/without shifted windows. The computation in each block is formulated as follows:

$\displaystyle\mathbf{Y}_{l,1}$	$\displaystyle=\text{SW-MSA}_{l}(\text{LN}(\mathbf{Y}_{l-1}))+\mathbf{Y}_{l-1}),$
$\displaystyle\mathbf{Y}_{l,2}$	$\displaystyle=\text{MLP}(\text{LN}(\mathbf{Y}_{1,1}))+\mathbf{Y}_{l,1},$
$\displaystyle l$	$\displaystyle=1,\ldots,L.$	(5)

So the final feature output is $\mathbf{Z}=\mathbf{Y}_{L,2}$ .

3.2.3 MSwin-C: The cross-attention decoding structure

In this decoding structure, we slightly change the SW-MSA module by using the input as the query while only keeping key and value in each Swin block. Similar to MSwin-S, there are $L$ Swin blocks with different window and shifted sizes. The query of each Swin block is the aggregation of all previous feature outputs (see Figure 2(d)), i.e., the Swin blocks are densely connected:

\displaystyle\mathbf{Y}_{l}

\displaystyle=\text{SW-MSA}_{l}(\sum\limits_{i=0}^{l-1}\mathbf{Y}_{i}),\>l=1,\ldots,L.

(6)

This design is inspired by DenseNet Huang et al. (2017) and DenseASPP Yang et al. (2018). The advantage of cross-attention is its improved information flow across different self-attention modules and gradients throughout the decoder, leading to implicit deep supervision.

4 Experiments

We carry out comprehensive experiments on four public benchmarks to demonstrate the efficacy of the proposed learning framework. Experimental results show that equipped with the Transformer based FPN encoder, the three decoder designs achieve similar yet promising performance in scene segmentation.

4.1 Datasets

PASCAL VOC 2012Everingham et al. (2015) contains 20 foreground object classes and one background class. The original dataset has 1,464 and 1,449 images for training and validation, respectively. To augment the training dataset, we also use extra annotations provided by Hariharan et al. (2011), so the total number of training images is 10,582. Here we do not use MS COCO dataset to pretrain the segmentation model.

COCO-Stuff 10KCaesar et al. (2018) is a subset of the complete COCO-Stuff dataset, which provides pixel-wise semantic labels for the whole scene, including both “thing” and “stuff” classes. It contains 9,000 and 1,000 images for training and validation (testing), respectively. Following Ding et al. (2018); Fu et al. (2019a); Yuan et al. (2020), we evaluate the segmentation performance on 171 categories (80 objects and 91 stuff) to each pixel.

ADE20KZhou et al. (2017) is a challenging scene parsing benchmark, in which the images are from both indoor and outdoor environments and are annotated by 150 fine-grained semantic concepts. It contains 20,210, 2,000 and 3,352 images for training, validation and testing, respectively.

CityscapesCordts et al. (2016) is an urban traffic dataset, in which the images are densely annotated by 19 classes. We train the scene segmentation model with the finely annotated 2,975 images for training and 500 images for validation, respectively. It also provides 20,000 coarsely labeled images to pre-train the segmentation models.

4.2 Implementation details

Our experimentation is based on the semantic segmentation package mmsegmentation GIT (2020). The MSwin models implemented in the experiment are backend on a small-sized Swin-S and a medium-sized Swin-B Liu et al. (2021b), which are pretrained on ImageNet-1K and ImageNet-22K, respectively. The window sizes in Swin-S and Swin-B are $7\times 7$ and $12\times 12$ . In all decoder heads that contain SW-MSA modules, we set the dimension of embedding to 512 and 8-heads attention without any change. For the attention on multi-shifted windows, we used three different window sizes $5\times 5$ , $7\times 7$ and $12\times 12$ , and the shifted sizes were set to $2\times 2$ , $3\times 3$ and $6\times 6$ , respectively. Considering the combinations of both W-MSA and SW-MSA, there are totally $L=6$ attention blocks. Following Zhao et al. (2017); Fu et al. (2019a), we added an auxiliary loss head composed of a two-layer sub-network. The auxiliary loss and main loss were computed concurrently with weights 0.4 and 1, respectively. We used AdamW optimizer L. and H. (2019) with the initial learning rate $6\times 10^{-5}$ and weight decay 0.01 after each iteration. For image augmentation, we applied random cropping, random flipping and photometric distortion. On the Cityscapes dataset, we applied random cropping with the size $512\times 1024$ . On the COCO-Stuff 10K, we set a smaller cropping size $480\times 480$ . On the rest two datasets, the cropping sizes were set to $512\times 512$ . We used categorical cross-entropy as loss function and report the mean Intersection-over-Union (mIoU) score in the evaluation of scene segmentation performance. These are consistent with the optimization settings of other baselines for fair comparisons, although applying some recently proposed learning objectives such as Lovász-softmax Berman et al. (2018) and Margin calibrated log-loss Yu et al. (2021) can further improve the mIoU scores. We used the mixed-precision and gradient checkpoint in the model training, which allows us to set a larger mini-batch size and can effectively save the GPU memory usage without hurting the normalization layers. Our experiments were conducted on a server equipped with two NVIDIA Tesla V100 GPU cards.

4.3 Result on PASCAL VOC 2012 dataset

To demonstrate the effectiveness of attention on multi-shifted windows, we conducted an ablation study of different window-size combinations on top of the T-FPN model backend on Swin-S. We separately set the window sizes $5\times 5$ , $7\times 7$ and $12\times 12$ , as well as their combinations. We tested the proposed MSwin-P and MSwin-S, and the single-scaled mIoU predictions are illustrated in Table 1. We can observe that it is unclear which single window size obtains the best results, and applying the self-attention with two window sizes do not necessarily improve the model performance. However, when we use three different window sizes by setting $L=6$ , the two MSwin models perform slightly better and have relatively stable results. It is expected that using more self-attention in MSwin model with more window sizes can further boost the segmentation performance, but the computational complexity also increases accordingly.

MSwin-P			MSwin-S
$L$	Size(s)	mIoU	$L$	Size(s)	mIoU
$L=2$	5	81.29	$L=2$	5	81.71
$L=2$	7	81.08	$L=2$	7	81.82
$L=2$	12	81.55	$L=2$	12	81.56
$L=4$	5,7	80.97	$L=4$	5,7	81.79
$L=4$	7,12	81.42	$L=4$	7,12	81.67
$L=6$	5,7,12	81.58	$L=6$	5,7,12	81.97

Table 1: Ablation study with different window sizes on PASCAL VOC2012 validation set.

We then experimented the segmentation models with Swin-S and Swin-B as backbones to verify the effectiveness of the baseline T-FPN, as well as three MSwin models on this dataset. The mIoU scores on the validation set are summarized in Table 2, where SS and MS are the abbreviates of single-scale and multi-scale prediction for mIoU scores, respectively. From the table, we can see that even with a softmax classifier as a decoder, the T-FPN backed on Swin Transformer is a powerful semantic segmentation model. Applying the small-sized backbone Swin-S, the proposed MSwin-P, MSwin-S and MSwin-C further improve the baseline T-FPN by 0.89%, 1.28% and 0.71%, and by 0.88%, 0.67% and 0.63% in terms of mIoU score when using single-scale and multi-scale predictions, respectively. When applying the more powerful backbone Swin-B, the three MSwin models further improve the performance of the baseline similarly. Figure 3 gives some visualization results of the four segmentation methods on the validation dataset.

We mixed the train+val datasets, fine-tuned the models, and used the multi-scale prediction on the test set, then uploaded the results to the evaluation server. In Table 3, we compared our models with the recently proposed methods. The three MSwin models achieve the new state-of-the-arts without the pre-training on MS COCO dataset.

Method	Backbone	SS	MS
T-FPN	Swin-S	80.69	82.07
MSwin-P	Swin-S	81.58	82.95
MSwin-S	Swin-S	81.97	82.74
MSwin-C	Swin-S	81.40	82.70
T-FPN	Swin-B	81.81	83.58
MSwin-P	Swin-B	82.85	83.82
MSwin-S	Swin-B	82.86	84.27
MSwin-C	Swin-B	83.50	84.47

Table 2: Results on PASCAL VOC2012 validation set.

Method	Backbone	COCO	Score
DANetFu et al. (2019a)	ResNet-101	✗	82.6
DeepLab v3+Chen et al. (2018)	Xception-71	✓	87.8
Auto-DeepLab-LLiu et al. (2019)	-	✓	85.6
EncNetZhang et al. (2018)	ResNet-101	✗	82.9
APCNetHe et al. (2019)	ResNet-101	✗	84.2
PSANetZhao et al. (2018)	ResNet-101	✓	85.7
EMANetLi et al. (2019)	ResNet-101	✓	87.7
MSwin-P	Swin-B	✗	85.9
MSwin-S	Swin-B	✗	86.1
MSwin-C	Swin-B	✗	85.7

Table 3: Online evaluation of PASCAL VOC 2012 test set.

4.4 Results on COCO-Stuff 10K dataset

We conducted experiments on the COCO-Stuff 10K dataset to prove the generalization ability of the proposed methods. All models were optimized by 40,000 iterations. The comparisons with previously state-of-the-arts are reported in Table 4. Our pure Transformer-based models, including the baseline T-FPN, consistently outperform all the ConvNet-based counterparts. When applying three different aggregation strategies in the decoder and multi-scale prediction, MSwin-P, MSwin-S and MSwin-C further boost the baseline T-FPN by 0.3%, 1.0% and 0.5% in terms of mIoU score, respectively. Our models achieve a slightly lower mIoU compared to the OCRNet backend on HRFormer-B on this dataset.

Method	Backbone	SS	MS
DANetFu et al. (2019a)	ResNet-101	-	39.7
OCRNetYuan et al. (2020)	HRNetV2-48	-	40.5
ACNetFu et al. (2019b)	ResNet-101	-	40.1
EMANetLi et al. (2019)	ResNet-101	-	39.9
RegionContrastHu et al. (2021)	ResNet-101	-	40.7
OCRNetYuan et al. (2021b)	HRFormer-B	-	43.3
T-FPN	Swin-S	39.6	41.1
MSwin-P	Swin-S	39.8	41.4
MSwin-S	Swin-S	40.2	42.1
MSwin-C	Swin-S	39.6	41.6
T-FPN	Swin-B	40.9	41.7
MSwin-P	Swin-B	41.3	42.7
MSwin-S	Swin-B	41.1	42.4
MSwin-C	Swin-B	41.0	42.8

Table 4: mIoU results on the COCO-Stuff 10K test set.

4.5 Results on ADE20K dataset

On this dataset, we used both Swin-S and Swin-B as backbones for model training to evaluate the performance. Table 5 shows the comparisons of FLOPs, as well as the results on the validation set. We can observe that the Transformer-based segmentation models generally needs more computational resources, except the recently proposed SegFormer. However, backend with Swin Transformer, our models are more computationally efficient. Applying the self-attention on multi-shifted windows in the decoder doubles the FLOPs in MSwin compared to T-FPN, but leads to more accurate segmentation results. On the validation dataset, all Transformer-based segmentation models substantially outperform the ConvNet counterparts by a large margin. The reason for this is mainly because the self-attention mechanism in Transformers has a very strong capability in modelling the spatial dependency, which can effectively parse very complex scenes in different visual environments by capturing the long-range dependencies. On the validation set, our MSwin models achieve comparable results to UPerNet backend on Swin Transformer, SETR backend on ViT-L and SegFormer. Some segmentation examples are illustrated in Figure 4.

We then fine-tuned the MSwin-P model on the train+validation set to improve the overall classification accuracy. On the evaluation server, MSwin-P obtains the pixel-wise accuracy of 0.767, mIoU 0.457 and the final test score of 0.612.

Method	Backbone	FLOPs	SS	MS
EncNet Zhang et al. (2018)	ResNet-101	219G	-	44.65
ACNet Fu et al. (2019b)	ResNet-101	-	-	45.90
APCNet He et al. (2019)	ResNet-101	282G	-	45.38
OCRNet Yuan et al. (2020)	HRNetV2-48	165G	-	45.66
UPerNet Xiao et al. (2018)	ResNet-50	-	41.22	-
RegionContrastHu et al. (2021)	ResNet-101	-	-	46.9
UPerNet Liu et al. (2021b)	Swin-B	300G	48.35	49.65
UPerNet Yang et al. (2021)	Focal-B	342G	49.00	50.50
FPN Wang et al. (2021)	PVT-Large	80G	42.10	44.80
SETR Zheng et al. (2021)	ViT-L	270G	48.64	50.28
SegFormer Xie et al. (2021)	MiT-B5	183G	49.13	50.22
DPT R. et al. (2021)	ViT-Hybrid	-	-	49.02
OCRNetYuan et al. (2021b)	HRFormer-B	283G	-	50.00
T-FPN	Swin-S	87G	46.38	48.15
MSwin-P	Swin-S	230G	47.11	48.55
MSwin-S	Swin-S	247G	47.52	48.56
MSwin-C	Swin-S	220G	46.26	48.12
T-FPN	Swin-B	127G	47.70	49.41
MSwin-P	Swin-B	270G	48.38	50.29
MSwin-S	Swin-B	288G	48.54	50.26
MSwin-C	Swin-B	260G	48.70	50.13

Table 5: Results on ADE20K validation set.

4.6 Results on Cityscapes dataset

On this dataset, we directly used Swin-B as the backbone to evaluate the effectiveness of the proposed MSwin models in street scene segmentation. Only using the fine-labeled dataset, we show the improvements brought by the decoder with self-attention on multi-shifted windows based on the T-FPN encoder in Table 6. On the validation set, the parallel structure of MSwin achieves the best mIoU score when using single-scale prediction, which improves mIoU by 0.67% compared to T-FPN. Applying the sequential structure, the model obtains the best performance with multi-scale prediction, which outperforms the baseline by 0.62% of mIoU. We show some segmentation examples and the confusion matrices of dense prediction in Figure 5 and 6, respectively. Compared with a thin decoder based on T-FPN, self-attention on multi-shifted windows can further reduce false positives and lead to finer details of street scenes.

Method	Backbone	SS	MS
T-FPN	Swin-B	80.39	81.77
MSwin-P	Swin-B	81.06	82.10
MSwin-S	Swin-B	80.87	82.39
MSwin-C	Swin-B	80.78	82.04

Table 6: Results on Cityscapes validation set.

We pre-trained MSwin-S and MSwin-C on the coarsely labeled training data then fine-tuned them on the fine labeled training dataset. After that, we applied the multi-scale prediction on the test set and submitted the results to the evaluation server. The overall comparisons of our models with some recently proposed methods are summarized in Table 7. We can see that without the pre-training with the coarsely labeled images, the proposed MSwin-P obtains the mIoU score 81.8%, which is slightly worse than the best scene segmentation models based on ConvNets. When pre-trained with the coarsely labeled data, MSwin-S and MSwin-C obtain the best results. Specifically, compared to the Transformer-based scene segmentation models SETR and SegFormer, MSwin-S and MSwin-C outperform the second-best by 0.5% and 0.3% in terms of mIoU score, respectively.

Method	Backbone	Coarse	mIoU
DenseASPP Yang et al. (2018)	DenseNet-161	✗	80.6
DeepLab V3+ Chen et al. (2018)	Xception-71	✓	82.1
Auto-DeepLab-L Liu et al. (2019)	-	✗	80.4
DANet Fu et al. (2019a)	ResNet-101	✗	81.5
HANet Choi et al. (2020)	ResNet-101	✗	80.9
HRNetV2 Wang et al. (2020)	HRNetV2-48	✗	81.6
OCRNet Yuan et al. (2020)	HRNetV2-48	✗	82.4
RegionContrastHu et al. (2021)	ResNet-101	✗	82.3
ACNetFu et al. (2019b)	ResNet-101	✗	82.3
SETR Zheng et al. (2021)	ViT-L	✗	81.1
SETR Zheng et al. (2021)	ViT-L	✓	81.6
SegFormer Xie et al. (2021)	MiT-B5	✓	82.2
OCRNet Yuan et al. (2021b)	HRFormer	✓	82.6
MSwin-P	Swin-B	✗	81.8
MSwin-S	Swin-B	✓	82.9
MSwin-C	Swin-B	✓	82.7

Table 7: Online evaluation of Cityscapes test set.

5 Conclusion

In this work, we have presented a convolution-free deep scene parsing model based on Swin Transformers. Different from the existing FCN based semantic segmentation models that parse complex semantic contextual information by enlarging receptive fields with dilated convolutions, the Transformer-based models consider the spatial and semantic dependencies. We have analyzed the computational properties of Swin Transformer, and designed a feature pyramid encoder that aggregates multiple-layer outputs without modifying the backbone. Furthermore, we have proposed the self-attention on multiple shifted windows to diversify the feature representations in the decoder. Extensive experiments on four public benchmarks demonstrate our models set new state-of-the-art performance.

References

Berman et al. [2018] M. Berman, A. Rannen Triki, and M.B Blaschko. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, pages 4413–4421, 2018.
Cadena and Košecká [2014] C. Cadena and J. Košecká. Semantic segmentation with heterogeneous sensor coverages. In ICRA, pages 2639–2645, 2014.
Caesar et al. [2018] H. Caesar, J. Uijlings, and V. Ferrari. Coco-stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018.
Chen et al. [2018] L.C Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pages 801–818, 2018.
Choi et al. [2020] S. Choi, J.T Kim, and J. Choo. Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks. In CVPR, pages 9373–9383, 2020.
Cordts et al. [2016] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
Ding et al. [2018] H. Ding, X. Jiang, B. Shuai, A.Q Liu, and G. Wang. Context contrasted feature and gated multi-scale aggregation for scene segmentation. In CVPR, pages 2393–2402, 2018.
Dosovitskiy et al. [2021] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
Everingham et al. [2015] M. Everingham, SM.A Eslami, L. Van Gool, C.K Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
Fu et al. [2019a] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu. Dual attention network for scene segmentation. In CVPR, pages 3146–3154, 2019.
Fu et al. [2019b] J. Fu, J. Liu, Y. Wang, Y. Li, Y. Bao, J. Tang, and H. Lu. Adaptive context network for scene parsing. In ICCV, pages 6748–6757, 2019.
GIT [2020] MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
Hariharan et al. [2011] B. Hariharan, P. Arbeláez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors. In ICCV, pages 991–998, 2011.
He et al. [2016] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
He et al. [2019] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao. Adaptive pyramid context network for semantic segmentation. In CVPR, pages 7519–7528, 2019.
Hu et al. [2021] H. Hu, J. Cui, and L. Wang. Region-aware contrastive learning for semantic segmentation. In ICCV, pages 16291–16301, 2021.
Huang et al. [2017] G. Huang, Z. Liu, L. Van Der Maaten, and K.Q Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
L. and H. [2019] Ilya L. and Frank H. Decoupled weight decay regularization. In ICLR, 2019.
Li et al. [2019] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu. Expectation-maximization attention networks for semantic segmentation. In ICCV, pages 9167–9176, 2019.
Lin et al. [2017] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
Liu et al. [2019] C. Liu, L.C. Chen, F. Schroff, H. Adam, W. Hua, A. L Yuille, and F.F. Li. Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In CVPR, pages 82–92, 2019.
Liu et al. [2021a] Y. Liu, E. Sangineto, W. Bi, N. Sebe, B. Lepri, and M. Nadai. Efficient training of visual transformers with small datasets. In NeurIPS, 2021.
Liu et al. [2021b] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
Long et al. [2015] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
R. et al. [2021] René R., Alexey B., and Vladlen K. Vision transformers for dense prediction. In ICCV, pages 12179–12188, 2021.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241, 2015.
Sun et al. [2017] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In CVPR, pages 843–852, 2017.
Touvron et al. [2021] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
Wang et al. [2020] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, and X. Wang. Deep high-resolution representation learning for visual recognition. IEEE trans. on pattern analysis and machine intelligence, 2020.
Wang et al. [2021] W. Wang, E. Xie, X. Li, D.P Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
Wu et al. [2021] H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808, 2021.
Xiao and Quan [2009] J. Xiao and L. Quan. Multiple view semantic segmentation for street view images. In ICCV, pages 686–693, 2009.
Xiao et al. [2018] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun. Unified perceptual parsing for scene understanding. In ECCV, pages 418–434, 2018.
Xie et al. [2021] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J.M Alvarez, and P. Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. CoRR, 2021.
Yang et al. [2018] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. In CVPR, pages 3684–3692, 2018.
Yang et al. [2021] Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu Yuan, and Jianfeng Gao. Focal self-attention for local-global interactions in vision transformers. In NeurIPS, 2021.
Yu and Koltun [2015] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, 2015.
Yu et al. [2021] L. Yu, Z. Li, M. Xu, Y. Gao, J. Luo, and J. Zhang. Distribution-aware margin calibration for semantic segmentation in images. International Journal of Computer Vision, pages 95–110, 2021.
Yuan et al. [2020] Y. Yuan, X. Chen, and J. Wang. Object-contextual representations for semantic segmentation. In ECCV, pages 173–190, 2020.
Yuan et al. [2021a] L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z.H. Jiang, F.E.H. Tay, J. Feng, and S. Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, pages 558–567, 2021.
Yuan et al. [2021b] Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-resolution vision transformer for dense predict. In NeurIPS, 2021.
Zhang et al. [2018] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic segmentation. In CVPR, pages 7151–7160, 2018.
Zhao et al. [2017] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.
Zhao et al. [2018] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia. Psanet: Point-wise spatial attention network for scene parsing. In ECCV, pages 267–283, 2018.
Zheng et al. [2021] S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, and P.H Torr. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, pages 6881–6890, 2021.
Zhou et al. [2017] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.