This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Heuristical Comparison of Vision Transformers Against Convolutional Neural Networks for Semantic Segmentation on Remote Sensing Imagery

Ashim Dahal, Saydul Akbar Murad, and Nick Rahimi A. Dahal, S. A. Murad, and N. Rahimi Authors are with the School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, USA ([email protected], [email protected], [email protected]).
Abstract

Vision Transformers (ViT) have recently brought a new wave of research in the field of computer vision. These models have done particularly well in the field of image classification and segmentation. Research on semantic and instance segmentation has emerged to accelerate with the inception of the new architecture, with over 80% of the top 20 benchmarks for the iSAID dataset being either based on the ViT architecture or the attention mechanism behind its success. This paper focuses on the heuristic comparison of three key factors of using (or not using) ViT for semantic segmentation of remote sensing aerial images on the iSAID. The experimental results observed during the course of the research were under the scrutinization of the following objectives: 1. Use of weighted fused loss function for the maximum mean Intersection over Union (mIoU) score, Dice score, and minimization or conservation of entropy or class representation, 2. Comparison of transfer learning on Meta’s MaskFormer, a ViT-based semantic segmentation model, against generic UNet Convolutional Neural networks (CNNs) judged over mIoU, Dice scores, training efficiency, and inference time, and 3. What do we lose for what we gain? i.e., the comparison of the two models against current state-of-art segmentation models. We show the use of the novel combined weighted loss function significantly boosts the CNN model’s performance capacities as compared to transfer learning the ViT. The code for this implementation can be found on https://github.com/ashimdahal/ViT-vs-CNN-Image-Segmentation.

Index Terms:
Vision Transformers, Semantic Segmentation, Convolutional Neural Networks (CNNs), MaskFormer and Remote Sensing
[Uncaptioned image]

I Introduction

The introduction of Transformers [23] has changed the research landscape when it comes to attention mechanisms on Natural Language Processing (NLP) tasks. However, this potential wasn’t fully capitalized in computer vision tasks until recently when Dosovitskiy et al. implemented the attention mechanism of transformers in their seminal paper [5] ever since ViT has been one of the fundamental areas of research in computer vision. Although initially proposed for image classification tasks, like CNNs in their inception days, they soon turned out to be one of the best-performing architectures for image segmentation tasks as well.

Image segmentation refers to the task of classifying each pixel of an image into a category. It can be said that image segmentation is a type of classification as well, but the role, approach and method of each subjects varies in themselves. In specific to the scope of this paper, semantic segmentation means to group certain objects in an image to a class among the given n classes in the dataset. Instance segmentation, similarly, would be to segment each object present in an image as a distinct item in itself.

As with any deep learning computer vision tasks, image segmentations were best done by models that were capable of capturing, encoding, and decoding essential patterns of the input image, mainly the implementation of UNet style architectures in CNNs [31, 37, 1, 36, 19, 33]. Specific to the scope of this paper, in the past few years researchers have utilized the UNet CNNs in the iSAID dataset[32], which is built on top of the DOTA[29] dataset, to do semantic segmentations [18, 38, 25, 4]. One of the main pitfalls of such datasets is the background class[11]. If not handled properly during the training process, there is a good chance to see high evaluation metrics on mIoU since the model would easily overfit on the most common class, which is the unlabelled class in this case.

Although the traditional deep learning technique of employing a UNet-based Convolutional Neural Network (CNN) remains the cornerstone of many segmentation tasks, recent trends in computer vision research indicate a significant shift towards Transformer-based architectures, particularly the Vision Transformer (ViT) and its variants. This shift is driven by the inherent ability of Transformer models to capture long-range dependencies through self-attention mechanisms, a feature that CNNs typically struggle with due to their localized receptive fields. The increasing preference for ViT models is exemplified by the fact that among the top 20 benchmark models for the iSAID dataset as listed on Papers with Code[paper_with_code], the top five employ either ViT, attention-based CNNs, or a hybrid combination of both [7, 8, 6, 9]. These models leverage the powerful self-attention mechanism to refine segmentation masks by focusing on relevant image regions.

In this paper, we introduce a novel loss function that integrates maximizing mean Intersection over Union (mIoU) and Dice score, while preserving entropy to ensure robust mask predictions. This new loss function is integrated into the UNet framework to improve its ability to model complex spatial relationships in images. This unique formulation allows for better generalization to unseen data by maintaining a balance between maximizing overlap with the ground truth masks and preventing over-segmentation. In addition, we investigate various data augmentation techniques since the number of samples is relatively lower in the iSAID[32] dataset. In parallel, we provide a direct comparison between training a UNet CNN model from scratch and fine-tuning Meta AI’s widely adopted MaskFormer[2], a ViT-based model that has achieved state-of-the-art results in semantic segmentation tasks; opensourced in Hugging Face[26].

This paper focuses on three key objectives throughout its experimentation and analysis:

  • Propose a combined weighted loss function to maximize mIoU and Dice while preserving entropy

  • Analyze the impact on efficiency during training and inference in both architectures

  • Benchmark and test inference capabilities of both models against current state-of-the-art iSAID benchmarks on unseen data

The rest of the paper is laid out in the following order, section II contains the current state of art and previous work on the field. Section III discusses our approach to the given objectives, then IV effectively binds the results with the research question. Section V would then conclude the paper with some ending thoughts and future direction for the research field.

II Literature Review

Refer to caption
Figure 1: Brief Overview of the Training and Validation Lifecycle

Noh et al. [37] first proposed the UNet style model by training a deep deconvolutional layer on top of the VGG-16[20] CNN. Their work provided a path for later iterations of UNet-based systems on the iSAID dataset. Regmi [18] presented their unsupervised model FreeSOLO for the iSAID dataset as a state-of-the-art. They deploy unsupervised learning, such as segmentation on the iSAID and other remote imagery data. However, the author fails to recognize the disparity between the claims and the results. The model fails to acknowledge small items on the dataset. This mainly comes from the author’s choice of image preprocessing which is to bilinearly transform the image into (3, 256, 256) pixel inputs. This is not viable for the iSAID dataset because the pictures in the datasets range from size (3, 800, 800) to (3, 4000, 13000) and downsampling in such cases would result in the loss of a lot of necessary information and pattern that the model would need to make a robust prediction on all images. The results presented by the author also present some doubts since the AP50{AP_{50}} percentage is not scaled to the standard value, and a score of 0.9% to 3.5% would rather indicate a poor model by the accepted standards. Also the Dice score and the IoU score are only presented for the backbone models, which too are below 65%, which is within range of the current state of the art ViT models and what our UNet CNN and MaskFormer models have surpassed.

In [25], Wang et al. show the best results obtained by the CNN models and ViT models after fine-tuning on seminal work by Xiao et al.; the UperNet model[30]. Both of their top results, on UNet and ViT, in the iSAID dataset come from fine-tuning the UperNet model. The best score for the CNN approach comes from fine-tuning the UperNet model pre-trained on the imagenet[3] dataset using a ResNet-50[10] backbone whereas their best IoU score for ViT, which is also the overall best, comes form fine-tuning the UperNet model on the same dataset with the ViTAEv2-S[34] backbone; the scores were 62.54 and 66.26 respectively. While it is an impressive feat to have achieved such a high average IoU score in the iSAID dataset, we believe and show that the results could be improved for a much higher mIoU score. The authors also don’t mention the Dice scores for their experiments. While the Dice score in most cases is very similar, or within the 5% range, to the mIoU, having a fixed number would have been better for direct comparison with other approaches.

After the introduction of ViT[5], papers like [9][24] have emerged in the iSAID dataset, which focuses on using the ViT in their approach. Papers like [7] and [8], however, argue that using attention-based CNNs are more effective towards the segmentation task. The authors of [24] used ViT in their approach and subsequently got higher results than [25] on the IoU score for the iSAID dataset: 67.20. They too used the UperNet as their training method, with the RingMo[22] as the train and the Swin-B ViT [15] as their backbone instead. This is impressive, but their model has 100 million parameters. We further prove that half of this is enough with our combined loss function to gain higher results on the dataset.

Hanyu et al. [9] proposed a 3 ViT-based models: AerialFormer-T, AerialFormer-S, and AerialFormer-B with 42.7M, 64.0M, and 113.8M trainable parameters, respectively. Their models achieve the mIoU score of 67.5, 68.4, and 69.3 in the given three models. This score is higher than the one reported by [24][25]. Even though the mIoU scores are highly desirable, the number of parameters in the AerialFormer-S and the AerialFormer-T are large with GLFLOPs of 49.0, 72.2, and 126.8, respectively, for the tiny, small, and base model. With respect to [24] though, this is an incredible feat as the AerialFormer-T with 42.7M parameters outperformed the one in [24] with 100M parameters. The authors don’t mention the Dice scores for direct comparison, and we propose that our proposed model with 42M parameters can outperform the 113.8M parameters model with the help of our combined loss function. We also show that by using MaskFormer ViT as well, we can surpass the larger model.

Liu et al. [27] proposed a CNN method for image segmentation on remote sensing images by dual path semantics approach. The authors devised a technique to make a new dual-path network structure, W-net, for the iSAID dataset and conducted an experiment to verify its generalization capacity. They conducted multiple ablation studies and deduced the highest amount of mIoU value of 63.68. Although respectable, we propose our model to yield better results than the one reported by the authors. The authors conclude that the size of the model impacts its inference time but do not mention the number of trainable parameters or the FLOPS in their model, which could be useful for common ground in model comparison.

III Methodology

As shown in Fig. 1, the training lifecycle for our experiment includes five key steps: 1. Dataset Information 2. Data Augmentation, 3. Models, 4. Loss Function 5. Validation Metrics, and 6. Hyperparameters and Training Settings

III-A Dataset Information

The iSAID dataset [32] is a benchmark in the remote sensing community due to its complex nature. The dataset is made of 2806 images from the DOTA dataset [29]. Out of these, 1411 images are training images, 458 validation images, and 937 unlabelled testing images. The resolution of the images range from 800×800800\times 800 pixels to 4000×130004000\times 13000 pixels with 15 foreground and one background category. We only consider the foreground category while calculating our validation metrics scores.

III-B Data Augmentation

The data augmentation process followed the procedure shown in Algorithm 1. The algorithm first takes anywhere between 6-28% of the image and resizes it to the size of (512, 512) pixels. It then randomly flips the horizontal and vertical axis with a 50% probability and rotates the image anywhere between (0 - 360). Random brightness or contrast is added to the image and is normalized with the mean and standard deviation from the ImageNet dataset. The first step in the process ensures with maximum probability that each of the images we generate from any given image would be either scaled appropriately, have unique features, or both.

Algorithm 1 Data Augmentation Using Albumentations
1:procedure Data Augmentation(InputImageInputImage)
2:     IInputImageI\leftarrow InputImage
3:     IRandomResizeCrop(args,I)I\leftarrow RandomResizeCrop(args,I)
4:     IRandomVerticalFlip(args,I)I\leftarrow RandomVerticalFlip(args,I)
5:     IRandomHorizontalFlip(args,I)I\leftarrow RandomHorizontalFlip(args,I)
6:     IRandomRotation(args,I)I\leftarrow RandomRotation(args,I)
7:     IRandomBrightness(args,I)I\leftarrow RandomBrightness(args,I)
8:     INormalization(args,I)I\leftarrow Normalization(args,I)
9:     return II
10:end procedure

III-C Models

We train two models in total. The first one is the UNet based on CNN, and the second one is the fine-tuned MaskFormer ViT.

We use the gradient clipping algorithm by Pascanu et al.[17] to account for exploding gradients in both of our models. Through the heuristic approach, we concluded the maximum value of thresholdthreshold in algorithm 2 could be set to 3.0.

Algorithm 2 Gradient Clipping[17]
1:g^Lθ\hat{g}\leftarrow\frac{\partial L}{\partial\theta} \triangleright θ\theta are the learnable parameters
2:if g^>threshold\|\hat{g}\|>threshold then
3:     g^thresholdg^g^\hat{g}\leftarrow\frac{threshold}{\|\hat{g}\|}\cdot\hat{g}
4:end if

Each model is different in its fundamental properties and are described below.

Refer to caption
Figure 2: Custom UNet CNN architecture

III-C1 UNet CNN

Our version of the UNet architecture is a simple adaptation of a generic CNN-based UNet model with four skip connections. The overall architecture of the UNet CNN model is described in Fig. 2. We describe an encoding block and use it throughout the model to do all the feature extraction in both the convolution and transpose convolution layers. The total number of trainable parameters amounts to 42.9M, with the most number of parameters situated in the bottleneck layers. We use gradient accumulation to mimic a batch size of 128 images and use the mixed precision technique with 16 16-bit floats to do the forward pass. These two techniques make the forward process more efficient by skipping backpropagation for n accumulation steps and calculating the predicted mask with just 16-bit precision.

Refer to caption
Figure 3: MaskFormer ViT architecture from [2]

III-C2 MaskFormer ViT

MaskFormer[2] is an algorithm that infuses pixel decoder and transformer decoder with a pre-trained backbone, usually a ResNet[10] block. The image is fed into the backbone, and the resultant is then put forward on the pixel decoder and transformer decoder. The pixel decoder would produce per-pixel embeddings, and the output from the transformer is sent to an MLP. The MLP would produce two results: the N class classification and N mask embeddings. The N mask embeddings are then combined with the per-pixel embeddings, and the output is combined with the N class classification to get the final segmentation. This is represented by Fig 3 extracted from the original paper [2]. They employ their own loss function, which is a combination of the Dice loss function and focal loss function [14]. The number of parameters depends upon the backbone used and ranges all the way from 41M to 212M parameters. For our experiment, we used Facebook’s Swin large architecture, pre-trained on imagenet22k, and fine-tuned for the 15 classes in the iSAID dataset, with ~200M parameters; since most of the parameters are frozen and only the top layer trained, the training efficiency was not affected like the UNet CNN counterpart. The inference time and Floating Operations Per Second (FLOPS) were, however, adversely affected by the larger number of parameters in the model.

III-D Loss Function

Take an image mask AA and probability distribution, or the predicted probability, of the mask BB such that A,BA,B CHW\in\mathbb{R}^{C*H*W} where CC, HH and WW are the channel width and height of the mask respectively. Then in order to maximize the mIoUmIoU we minimize 1IoU1-IoU as our loss function [35].

Liou=1ABABL_{iou}=1-\frac{A\cap B}{A\cup B} (1)

In order to make the loss function differentiable, we replace the nondifferentiable operations of the bitwise intersection (AND) operation and the union (OR) operation with multiplication and addition operations.

Liou=1ABA+B(AB)L_{iou}=1-\frac{A*B}{A+B-(A*B)} (2)

Similarly, in order to maximize the Dice score, we need to minimize 1Dice1-Dice score. Since the bitwise intersection function is not differentiable, we replace the intersection with equivalent multiplication, so the loss function LdiceL_{dice} becomes the following [21]:

Ldice=12AB|A|+|B|L_{dice}=1-\frac{2*A*B}{|A|+|B|} (3)

We also use a weighted cross-entropy loss function to maintain the entropy of our predictions, so the third part of our function becomes LceL_{ce} [16].

Lce=βAlog(B)+(1β)(1A)log(1B)L_{ce}=\beta*A\log(B)+(1-\beta)(1-A)\log(1-B) (4)

where β\beta is the weight hyperparameter (0.15 for unlabelled class and 1 for the rest),

Combining the loss functions (2) (3) (4) together with weights λiou,\lambda_{iou}, λdice,\lambda_{dice}, λce\lambda_{ce} respectively we get the total combined loss LL.

L=λiouLiou+λdiceLdice+λceLceL=\lambda_{iou}*L_{iou}+\lambda_{dice}*L_{dice}+\lambda_{ce}*L_{ce} (5)

where λiou\lambda_{iou} = 0.8, λdice\lambda_{dice} = 1 and λce\lambda_{ce} = 10 were selected by trial and error experimentation.

We also did not calculate the loss functions LiouL_{iou} and LdiceL_{dice} for the unlabelled class, unlike with LceL_{ce} because during inference, we need not care about labeling the unlabelled class properly but still the model needs some hint of guidance of the types of patterns it needs to avoid during the training phase, so comes the minimal 0.15 value of β\beta in LceL_{ce}. This loss function was only applied to the UNet CNN since the Maskformer model had its own dice and pixel classification loss described in the original paper.

III-E Validation Metrics

We validate the results produced by our model for CC classes using the mIoU score and Dice score described as follows [35][21]:

mIoU=1Ci=0i=CAiBiAiBimIoU=\frac{1}{C}\sum_{i=0}^{i=C}\frac{A_{i}\cap B_{i}}{A_{i}\cup B_{i}} (6)
Dice=1Ci=0i=C2(AiBi)|Ai|+|Bi|Dice=\frac{1}{C}\sum_{i=0}^{i=C}\frac{2*(A_{i}\cap B_{i})}{|A_{i}|+|B_{i}|} (7)

III-F Hyperparameters and Training Settings

Adam[13] was chosen as the optimizer for both models for its capability of fine-tuning the learning rate over time. We trained the models for 40 epochs, each with an initial learning rate of 10310^{-3} on a virtualized Nvidia A100 GPU with 48G VRAM. The models were validated at the end of each epoch under the Dice score and IoU scores described in (6)(7).

Table I: Model’s details
Model Name # of Parameters FLOPS Inference Time*
UNet CNN 42.9M 460.10 G 0.19s
MaskFormer 200M 232.40 G 0.29s
  • a

    * Inference Time calculated on 6 images

Refer to caption
Figure 4: Comparison of Metrics over Epochs
Table II: IoU Scores Comparison by Category
Method\dagger #params Year mIoU IoU per Category in %
PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC
Plain ViT [24] 100M 2022 67.20 Not Reported
AANet[28] 29.2M 2022 66.6 84.6 80.5 40.2 60.5 48.7 63.2 71.2 88.8 65.4 65.7 73.5 72.4 57.2 52.3 41.8
AF-T[9] 42.7M 2023 67.5 86.1 77.5 45.3 57.5 52.6 67.0 68.6 88.8 63.4 74.9 75.1 73.0 58.2 50.5 42.0
AF-S[9] 64M 2023 68.4 86.5 78.8 44.8 59.5 53.6 66.5 72.1 89.2 66.5 74.1 77.0 74.0 60.9 52.1 40.0
AF-B [9] 113.8M 2023 69.3 86.5 81.5 46.8 65.0 53.7 67.8 75.1 89.8 62.4 76.3 78.3 66.1 60.8 52.4 46.7
Ringmo[22] 87.6M 2023 67.2 85.7 77.0 43.2 63.0 51.2 63.9 73.5 89.1 62.5 73.0 78.5 67.3 58.9 48.9 40.1
SAMRS-V[25] - 2023 66.0 86.0 79.3 42.5 65.3 53.2 68.4 75.9 89.5 63.7 75.4 79.2 69.8 59.5 49.4 37.5
SAMRS-C [25] - 2023 62.54 82.9 76 40 61.9 48.5 63.8 69.8 87.7 58.1 71.1 76.1 70.2 56 48.6 27
W-Net-C [27] - 2023 56.7 50.6 59.7 61.5 43.2 64.7 55.9 88.9 72.1 44.4 42.1 79.9 56.6 67.7 29.5 18.6
Ours (CNN) 42.9M 2024 73.4 64.6 89.7 91.7 84.8 43.9 52.5 63.8 71.8 71.2 82.2 88.0 94.1 64.5 89.0 49.2
Ours (ViT) 200M 2024 82.48 92.8 77.9 82.3 79.9 53.8 62.7 82.4 90.2 83.8 81.6 79.7 93.3 88.6 93.1 95.3
  • a

    Abbreviations: PL = Plane, BD = Baseball Diamond, BR = Bridge, GTF = Ground Track Field, SV = Small Vehicle, LV = Large Vehicle, SH = Ship, TC = Tennis Court, BC = Basketball Court, ST = Storage Tank, SBF = Soccer Field, RA = Roundabout, HA = Harbor, SP = Swimming Pool, HC = Helicopter. \dagger: V=ViT, C=CNN, T=Tiny, S=Small, B=Base

IV Results

Refer to caption
Figure 5: Sample Visualization of model’s output
Refer to caption
Figure 6: Class-wise comparison of IoU and Dice Scores

Our findings based on the methodology graph shown in Fig 1 are presented in this section. First, we list out the two model’s efficiency information in Table I. During training, most of the time was consumed by the data augmentation technique described in III-B, which took 65 seconds per batch for a batch size of 128. During inference on six images, however, we can see that having more numbers of parameters hurts the efficiency of the MaskFormer model. The UNet CNN had a FLOPS of 460G, whereas the MaskFormer had a FLOPS of 232.4 G. The higher number of FLOPS and lower inference time for the UNet CNN demonstrates its capacity to perform more operations per second even faster, thus resulting in a more efficient model overall.

The MaskFormer ViT model churned better results in terms of mIoU and Dice scores Fig 4. The Dice Score and mIoU for the MaskFormer soared from mid 70% to the range of 80% by the end of the last epoch, whereas for UNet CNN, it started low on around 65% and reached the peak of 81% and stablized at around 78% during training over all classes Fig. 4. The Dice score and mIoU scores have a strong correlation between them, so we further compare the mIoU scores of the model against our references and recent research works. In Table II, recently published and reference works in the topic were selected whose models either have a comparable number of parameters, model architectures, or both as ours to be compared with respect to the mIoU scores and per category IoU scores.

It can be noted that the mIoU and Dice scores for the MaskFormer ViT are within the 10% upper range of the UNet CNN model, even while having  5 times more parameters in it. From further analysis of the Fig 5, however, we can notice that even though the Maskformer has higher metrics as compared to the UNet CNN, it has failed to rationalize the background class due to its low tolerance or importance towards the background or none class objects in the image. This entails that without a pixel mask that tells the model on which pixels to predict and which to not predict, the results are not feasible on the model unless trained without any background class like in datasets where every pixel is classified into a certain group. This also stems from the fact that we calculate the scores on only the valid mask of pixels, i.e., the pixels that belong to the background class are ignored during computation of mIoU and Dice as discussed earlier in section III.

Fig 6 represents the per class Dice and IoU scores for the two proposed models. The correlation between the two scores shown in Fig 4 is further solidified by the class-wise comparison. The worst performance, according to Fig 6, is on the small vehicle class, and both models generally perform well in classes that include general landmarks like baseball diamonds, tennis courts, basketball courts, and ground track fields. Other easily recognizable items, which rarely change their shape, form, and size, like a bridge and roundabout, are also among the best-predicted classes by both models.

Our UNet CNN model’s metrics on the validation set surpassed the performance of similarly comparable references on the test set, and the MaskFormer, with its greater number of parameters, surpassed the UNet CNN model as well. It can be noted, however, that the mIoU per category of our, or the reference works, were not uniformly distributed, and there’s a high affinity towards objects that tend to be larger in size and more frequent within the dataset. Fig 5 shows sample prediction of randomly selected data against the UNet CNN model and the MaskFormer ViT. The augmentation process shows a high yield of variability in the images, which is capable of generating multiple almost unrecognizable images from a single one. The ground truth was taken from the validation dataset itself, and the UNet and Maskformer’s predictions followed to the right. The segmentation maps in Fig 5, based on the category, show its consistency with the mIoU presented in Table II and the average overall metrics shown in Fig 4.

V Conclusion and Discussions

We successfully show that the introduction of the combined weighted loss function helps the model to make stronger predictions and yield better results. One caveat of the findings could be that we need to find a stronger way to de-segment the testing images into multiple patches of smaller images, which could then be fed into the model and then rearranged later to complete the original images for the testing set, as we learn that simply rescaling does affect the model negatively; in other words, the image loses key patterns and features with simple rescaling in data augmentation. We show the number of parameters required to make a robust remote imagery sensing segmentation model that does not need to be over 50M, given that it can directly influence the FLOPS and inference time. Although the larger training time could be attributed to the data augmentation process, future research could look into streamlining this entire pipeline as a whole. The next steps in the field would entail decoupling the image into multiple smaller fragments and recoupling into the original image to have a robust prediction in images of any shape and size. Nevertheless, we introduced a strong novel combined and weighted loss function to compare UNet CNN with transfer learning-based ViT and show their performance against similar state-of-the-art segmentation models. We show with the use of the new loss function, the CNN model yields comparable results on the metrics with better generalization capacity during inference on unseen data.

References

  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [2] B. Cheng, A. Schwing, and A. Kirillov, “Per-Pixel Classification is Not All You Need for Semantic Segmentation,” in Advances in Neural Information Processing Systems, vol. 34, pp. 17864–17875, 2021.
  • [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
  • [4] F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94–114, 2020.
  • [5] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in International Conference on Learning Representations, 2021.
  • [6] Z. Geng, M.-H. Guo, H. Chen, X. Li, K. Wei, and Z. Lin, “Is Attention Better Than Matrix Decomposition?,” in International Conference on Learning Representations, 2021.
  • [7] M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. Hu, “SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation,” arXiv preprint arXiv:2209.08575, 2022.
  • [8] M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, and S.-M. Hu, “Visual Attention Network,” arXiv preprint arXiv:2202.09741, 2022.
  • [9] T. Hanyu et al., “AerialFormer: Multi-Resolution Transformer for Aerial Image Segmentation,” Remote Sensing, vol. 16, no. 16, 2024.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • [11] P. S. Hinds, R. J. Vogel, and L. Clarke-Steffen, “The possibilities and pitfalls of doing a secondary analysis of a qualitative data set,” Qualitative health research, vol. 7, no. 3, pp. 408–424, 1997.
  • [12] D. Jiang, Y. Cao, and Q. Yang, “Weakly-supervised learning based automatic augmentation of aerial insulator images,” Expert Systems with Applications, vol. 242, p. 122739, 2024.
  • [13] D. P. Kingma, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [14] T. Lin, “Focal Loss for Dense Object Detection,” arXiv preprint arXiv:1708.02002, 2017.
  • [15] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022, 2021.
  • [16] A. Mao, M. Mohri, and Y. Zhong, “Cross-entropy loss functions: Theoretical analysis and applications,” in International conference on Machine learning, pp. 23803–23828, 2023.
  • [17] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, no. 3, pp. 1310–1318, 2013.
  • [18] S. Regmi, Unsupervised Image Segmentation in Satellite Imagery Using Deep Learning.   The University of Alabama in Huntsville, 2023.
  • [19] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, pp. 234–241, 2015.
  • [20] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [21] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge Cardoso, “Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 240–248, 2017.
  • [22] X. Sun et al., “RingMo: A Remote Sensing Foundation Model With Masked Image Modeling,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–22, 2023.
  • [23] A. Vaswani et al., “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010, 2017.
  • [24] D. Wang et al., “Advancing plain vision transformer toward remote sensing foundation model,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2022.
  • [25] D. Wang, J. Zhang, B. Du, M. Xu, L. Liu, D. Tao, and L. Zhang, “Samrs: Scaling-up remote sensing segmentation dataset with segment anything model,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [26] T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, 2020.
  • [27] G. Liu, Q. Wang, J. Zhu, and H. Hong, “W-Net: Convolutional neural network for segmenting remote sensing images by dual path semantics,” PLOS ONE, vol. 18, no. 7, pp. 1–16, 2023.
  • [28] G. Xue, Y. Liu, Y. Huang, M. Li, and G. Yang, “AANet: an attention-based alignment semantic segmentation network for high spatial resolution remote sensing images,” International Journal of Remote Sensing, vol. 43, no. 13, pp. 4836–4852, 2022.
  • [29] G.-S. Xia et al., “DOTA: A Large-Scale Dataset for Object Detection in Aerial Images,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [30] T. Xiao, Y. Liu, B. Zhou, Y. Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” in Proceedings of the European conference on computer vision (ECCV), pp. 418–434, 2018.
  • [31] K. Yue, L. Yang, R. Li, W. Hu, F. Zhang, and W. Li, “TreeUNet: Adaptive Tree convolutional neural networks for subdecimeter aerial image segmentation,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 156, pp. 1–13, 2019.
  • [32] S. Waqas Zamir et al., “iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–37, 2019.
  • [33] H. Zhang, K. J. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context Encoding for Semantic Segmentation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7151–7160, 2018.
  • [34] Q. Zhang, Y. Xu, J. Zhang, and D. Tao, “Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond,” International Journal of Computer Vision, vol. 131, no. 5, pp. 1141–1162, 2023.
  • [35] D. Zhou, J. Fang, X. Song, C. Guan, J. Yin, Y. Dai, and R. Yang, “Iou loss for 2d/3d object detection,” in 2019 international conference on 3D vision (3DV), pp. 85–94, 2019.
  • [36] N. Ibtehaz and M. S. Rahman, “MultiResUNet : Rethinking the U-Net architecture for multimodal biomedical image segmentation,” Neural Networks, vol. 121, pp. 74–87, 2020. [Previous entries from Pascanu to Noh…]
  • [37] H. Noh, S. Hong, and B. Han, “Learning Deconvolution Network for Semantic Segmentation,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1528, 2015. [Previous entries from Regmi to Zamir…]
  • [38] S.ALTUNGÜVEN and B. TOPTAŞ, “Using GAN Methods for Aerial Images Segmentation,” Dicle University Journal of Engineering, vol. 15, no. 1, 2024.