ChangeViT: Unleashing Plain Vision Transformers for Change Detection

Duowang Zhu^∗, Xiaohu Huang^∗, Haiyan Huang, Zhenfeng Shao^†, and Qimin Cheng Duowang Zhu, Haiyan Huang and Zhenfeng Shao are with the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China (email: [email protected]; [email protected]; [email protected]).Xiaohu Huang is with the University of Hong Kong, Pokfulam, Hong Kong (e-mail: [email protected])Qimin Cheng is with the School of Electronic Information and Communications, Huazhong University of Science and Technology, Wuhan 430074, China (e-mail: [email protected]).^∗: Equal contribution.^†: Corresponding author.

Abstract

Change detection in remote sensing images is essential for tracking environmental changes on the Earth’s surface. Despite the success of vision transformers (ViTs) as backbones in numerous computer vision applications, they remain underutilized in change detection, where convolutional neural networks (CNNs) continue to dominate due to their powerful feature extraction capabilities. In this paper, our study uncovers ViTs’ unique advantage in discerning large-scale changes, a capability where CNNs fall short. Capitalizing on this insight, we introduce ChangeViT, a framework that adopts a plain ViT backbone to enhance the performance of large-scale changes. This framework is supplemented by a detail-capture module that generates detailed spatial features and a feature injector that efficiently integrates fine-grained spatial information into high-level semantic learning. The feature integration ensures that ChangeViT excels in both detecting large-scale changes and capturing fine-grained details, providing comprehensive change detection across diverse scales. Without bells and whistles, ChangeViT achieves state-of-the-art performance on three popular high-resolution datasets (i.e., LEVIR-CD, WHU-CD, and CLCD) and one low-resolution dataset (i.e., OSCD), which underscores the unleashed potential of plain ViTs for change detection. Furthermore, thorough quantitative and qualitative analyses validate the efficacy of the introduced modules, solidifying the effectiveness of our approach. The source code is available at https://github.com/zhuduowang/ChangeViT.

Index Terms:

Change Detection, Vision Transformer.

I Introduction

Change detection plays a crucial role in the field of remote sensing, employing pairs of bi-temporal images taken of the same geographic area at different times to track changes on the Earth’s surface over time [1]. It has been widely applied in various applications such as disaster assessment [2], urban planning [3], arable land protection [4], and environmental management [5]. In recent years, convolutional neural networks (CNNs) have emerged as the primary backbone choice for state-of-the-art change detectors [2, 6, 7, 8, 9, 10], as they can extract rich hierarchical features for detecting changes with different sizes.

Over the past few years, Vision Transformers (ViTs) [11] have de facto substituted CNNs as the dominant backbones in various computer vision tasks, e.g., object detection [12], image segmentation [13], image matting [14], and pose estimation [15], which exhibit superior performance than CNN-based methods benefiting from the long-range modeling capability. While transformers have been explored in the context of change detection in some preliminary studies [6, 16, 17, 18, 19], their performance has not yet matched that of the leading CNN models. Therefore, this paper aims to study the potential benefits of ViTs for change detection, striving to unleash their effectiveness in this area.

Refer to caption — (a) Performance comparison of different change detectors.

To assess the efficacy of ViTs in the change detection task, we first conduct a comprehensive performance comparison between change detectors utilizing ViTs and three established CNN architectures as backbones, i.e., ResNet18 [20], VGG16 [21], and UNet [22]. This evaluation spans three well-known datasets, i.e., LEVIR-CD [23], WHU-CD [24], and CLCD [25], as depicted in Fig. 1(a). Additionally, we explore the influence of various model initializations by incorporating pre-trained weights from DeiT [26], DINO [27], and DINOv2 [28] into our analysis. Specifically, ResNet18, VGG16, UNet, and ViT-S (DeiT) are pre-trained on ImageNet-1k with supervised training, while ViT-S(DINO) and ViT-S(DINOv2) are pre-trained with self-supervised training on ImageNet-1k, ImageNet-22k, and Google Landmarks, etc. The results indicate that: (1) CNN models significantly outperform all ViTs across all datasets, regardless of whether supervised or self-supervised learning is used, highlighting the dominance of CNNs in change detection tasks. (2) Even with identical data initialization (i.e., ImageNet-1k), the performance of ViTs remains inferior to that of CNN-based models.

To delve deeper into the models’ capabilities, we perform an in-depth analysis of a ViT (ViT-S with DeiT pre-training) and a CNN model (ResNet18 with ImageNet-1k pre-training) in detecting changes across various object sizes, which is illustrated in Fig. 1(b). We organize the test samples from each dataset by the proportion of pixels occupied by different objects within the images. Specifically, we first sort the images in ascending order based on the ratio of pixels occupied by changing objects to the total number of pixels in the image. Then, we evenly divide this ordered sequence into five categories, ranging from the smallest to the largest proportions. We calculate the average performance difference between the ViT and CNN models within each category. The results show that though ViTs lag behind CNNs in detecting smaller changes, they demonstrate enhanced reliability for larger objects across all datasets. These insights suggest that while ViTs cannot capture fine-grained details as effectively as CNNs, they excel in detecting large-scale changes. Therefore, this previously untapped benefit has the potential to effectively mitigate the limitations inherent in CNN architectures.

Building upon the insights gathered from our preceding analysis, we propose ChangeViT, a simple yet effective framework that leverages the plain ViT framework as its core to capture large-scale object information. This is coupled with a detail-capture module specifically used to focus on fine-grained features. The detail-capture module functions as an auxiliary network, incorporating selected layers (C2-C4) from ResNet18 [20], which offers a more compact footprint (2.7M parameters) compared to a complete CNN model (11.2M parameters). To seamlessly inject these fine-grained details into the feature representation of ViTs, we establish connections between ViT’s representations and fine-grained features. This integration is accomplished by considering ViT features as queries and merging the fine-grained features by applying the cross-attention mechanism.

Through extensive experiments on four widely recognized datasets, i.e., LEVIR-CD [23], WHU-CD [24], CLCD [25], and OSCD [29], ChangeViT achieves the state-of-the-art performance across the board. In addition, we combine the proposed modules with various hierarchical transformers, i.e., Swin Transformer [30], PVT [31], and PiT [32]. Consistently across these architectures, the proposed modules enhance performance, thereby further confirming their efficacy. Notably, despite the plain ViT’s perceived limitations compared to these advanced hierarchical networks, ChangeViT outperforms methods that utilize these complex models, showcasing that we effectively unleash the capacity of plain ViTs in the field of change detection.

The main contributions of this paper can be summarized as follows:

•

We thoroughly investigate the performance of plain ViTs and identify their aptitude for detecting large-scale changes. Motivated by this finding, we introduce ChangeViT, a simple yet effective framework which utilizes plain ViT as the primary feature extractor for the change detection task.
•

To enhance the detection of changes across various sizes, we integrate a detail-capture module, specifically introduced to address the limitations of ViTs when identifying small objects. Furthermore, we introduce a feature injector to merge the extracted detailed features into high-level ones from the ViT, ensuring comprehensive feature representation within the model.
•

ChangeViT achieves state-of-the-art performance on four popular datasets, i.e., LEVIR-CD, WHU-CD, CLCD, and OSCD, demonstrating the superiority of the proposed method. Moreover, thorough quantitative and qualitative analyses validate the efficacy of the modules we have introduced, further solidifying the effectiveness of our approach.

II Related Work

II-A Change Detection

Regarding the network architecture, existing change detection methods employing deep learning can be generally categorized into two groups: CNN-based and transformer-based.

CNN-based Methods. CNN-based change detection approaches have been the mainstream framework in the literature [7, 9, 33, 34, 35, 36, 8, 37, 38, 39] for a long time, known for their hierarchical feature modeling capabilities. These works primarily focus on multi-scale feature extraction, difference modeling, lightweight architecture designing, and foreground-background class imbalance. For instance, methods in [35, 39] utilize fully convolutional networks to capture hierarchical features for learning multi-scale feature representations. For adequate differential feature modeling, approaches in [33, 36] incorporate the attention mechanism to establish relational dependencies among bi-temporal features. In contrast, Changer [34] introduces a parameter-free method, which simply exchanges the characteristics of each phase to capture and perceive each other’s information. Methods in [9, 8] focus on designing efficient and effective network architectures, utilizing lightweight feature extractors [40, 41] as backbones. Several studies [7, 37] address the significant challenge posed by foreground-background class imbalance by developing innovative loss functions that prioritize foreground alterations while minimizing interference from background noise (e.g., seasonal variations, climate changes).

Transformer-based Methods. Recently, Vision Transformer [11] and its variants [30, 42, 31] have surpassed CNN in various visual tasks and became the dominant backbone [12, 14, 15, 43]. Motivated by these achievements, several works [16, 18, 6, 44, 45, 17, 19] have explored the application of transformers in change detection tasks. Some of these methods [45, 18] utilize pure transformers, while others [16, 6, 44, 17] adopt CNN-Transformer hybrid architectures. Methods in [45, 18] introduce hierarchical transformer networks based on the swin transformer [30]. The others typically follow a paradigm in which features extracted by CNN serve as semantic tokens, followed by contextual relation modeling between bi-temporal tokens using transformer blocks. The method introduced in [19] exhibits an efficient tuning strategy that involves freezing the parameters of the Transformer encoder while introducing additional trainable parameters. However, this method fails to deliver optimal results due to an inadequate exploration of the strengths and limitations of the transformers. This precludes a more effective application of the model’s capabilities, thereby capping the potential gains in performance.

Different from the previous approaches mainly using hierarchical networks, the proposed ChangeViT applies the plain ViT as the cornerstone feature extractor, which we find has previously unidentified potential in detecting large-scale changes.

II-B Plain ViT for Downstream Tasks

ViT [11] is a plain, non-hierarchical architecture, which is a powerful alternative to standard CNN for image classification. Due to the significant computational overhead of self-attention in ViT, subsequent works focus on designing more efficient architectures, such as Swin [30], PVT [31] and PiT [32]. These works inherit some designs from CNN, including hierarchical structures, sliding windows, and convolutions. Recently, researchers have begun to study the potential of ViT for various downstream tasks motivated by the emergence of large pre-trained models, e.g., DeiT [26], DINO [46], DINOv2 [28], MAE [47] and CLIP [48]. The plain ViT has already made remarkable progress in dense prediction [43, 12, 13], pose estimation [15], image matting [14], etc. ViTDet [12] is the first to employ plain, non-hierarchical ViT as the backbone for object detection with minimal adaptation, i.e., building a simple feature pyramid for single-scale features and aiding a few cross windows for information propagation. ViT-Adapter [43] introduces a pre-training-free adapter that injects prior knowledge to ViT without redesigning its architecture for various dense prediction tasks. Similarly, SimpleClick [13] and ViTPose[15] apply vanilla ViT as the feature extractor to acquire single-scale features. For image matting, ViTMatte [14] is the first work to unleash the potential of ViT with concise adaption.

Inspired by the above works, we aim to unleash the potential of the plain ViT model, enabling it to adapt well to change detection tasks.

III Proposed Method

The overall architecture is illustrated in Fig. 2. For the bi-temporal images $I_{1}\in\mathbb{R}^{H\times W\times 3}$ and $I_{2}\in\mathbb{R}^{H\times W\times 3}$ , they are parallelly fed into a ViT and a detail-capture module. The ViT extracts high-level features $F_{V}^{t}\in\mathbb{R}^{\frac{H}{16}\times\frac{W}{16}\times C_{4}}$ , where $t\in\{1,2\}$ represents two phases, while the detail-capture module acquires fine-grained multi-scale features $F_{C_{i}}^{t}\in\mathbb{R}^{\frac{H}{2^{i}}\times\frac{W}{2^{i}}\times C_{i}}$ ( $i\in\{1,2,3\}$ , $t\in\{1,2\}$ ). To enhance the detection of intricate details within high-level features, we introduce a feature injector aimed at integrating low-level fine-grained information into $F_{V}$ . Finally, a multi-scale feature fusion decoder is applied to predict the changed probability map $P\in\mathbb{R}^{H\times W\times 1}$ .

III-A Feature Extraction

The feature extractor is composed of a plain ViT, and a detail-capture module which is described as follows:

Plain ViT. Bi-temporal images $I_{1}$ , $I_{2}$ are fed into patch embedding layer, dividing them into non-overlapping $16\times 16$ patches. These patches are then flattened and projected to $D$ -dimension tokens, and the feature resolution is reduced to $1/16$ of the original images. Afterwards, position embedding is added to these tokens, which are passed through $L$ transformer layers. Each layer consists of a layer normalization (LN), a multi-head self-attention (MHSA) and a feed-forward network (FFN). The formulation of these layers is given by Eq. 1 and Eq. 2:

\displaystyle F_{V}^{{}^{\prime}t,i+1}=F_{V}^{t,i}+\mathrm{MHSA}(\mathrm{LN}(F_{V}^{t,i})),

(1)

\displaystyle F_{V}^{t,i+1}=F_{V}^{{}^{\prime}t,i+1}+\mathrm{FFN}(\mathrm{LN}(F_{V}^{{}^{\prime}t,i+1})).

(2)

where $i$ denotes the output of the $i$ th transformer layer. The final output of the ViT backbone is represented as $F_{V}^{t}\in\mathbb{R}^{\frac{H}{16}\times\frac{W}{16}\times C_{4}}$ , where $C_{4}$ equals to $D$ .

Detail-capture. As discussed in Sec. I, ViT demonstrates proficiency in detecting large changes but exhibits reduced effectiveness with smaller ones. Addressing this challenge, we introduce a detail-capture module designed to compensate for the absence of fine-grained local cues crucial for change detection. This module comprises three residual convolutional blocks (C2-C4) adapted from ResNet18 [20]. Upon processing the input images through the detail-capture module, three-scale detailed features are generated, i.e., $1/2$ , $1/4$ and $1/8$ , denoted as $F_{C_{i}}^{t}\in\mathbb{R}^{\frac{H}{2^{i}}\times\frac{W}{2^{i}}\times C_{i}}$ ( $i\in\{1,2,3\}$ ).

III-B Feature Injector

In the change detection task, preserving detailed spatial features is crucial as they can help detect small objects. Ensuring the effective transmission of low-level details to high-level semantic features is paramount.

Therefore, we introduce a feature injector, composed of three cross-attention blocks [49], as illustrated in Fig. 3(a). It considers the low-level features as the key and value vectors and the high-level feature as the query vector. Intuitively, this is reasonable as it allows the feature injector to gather the most relevant information based on the provided key information and integrate it into the query. By enabling cross-layer feature propagation, detailed information can be incorporated into the high-level representations of the ViT, denoted as $F_{V_{E}}^{t}$ . The $F_{V_{E}}^{t}$ is computed as follows:

F_{V_{E}}^{t,i}=\mathrm{CrossAttn}(F_{V}^{t},F_{C_{i}}^{t}),

(3)

F_{V_{E}}^{t}=\mathrm{FC}(F_{V_{E}}^{t,1}ⓒF_{V_{E}}^{t,2}ⓒF_{V_{E}}^{t,3}).

(4)

where $i\in\{1,2,3\}$ denotes the index of low-level layers, $F_{V}^{t}$ as query and $F_{C_{i}}^{t}$ as key and value, respectively. The $\mathrm{FC}$ is a 2D depth-wise convolution with the kernel size of $1\times 1$ and ⓒ denotes concatenation operation along the channel dimension.

Additionally, we explore an alternative approach to feature injector, as depicted in Fig. 3(b), which considers the low-level features as query, and the ViT’s semantic information as key and value to refine the ViT’s representation according to the characteristics of the hierarchical detailed features.

III-C Decoder and Optimization

Compared to existing methods [7, 6, 36, 44], which employ complex techniques to model difference information and predict the change probability map, we chose a simpler decoder to better demonstrate the learning capabilities of ChangeViT. Specifically, we use a straightforward feature fusion layer to capture differences between bi-temporal features. A cascade convolutional layer, followed by an upsampling operation, is employed to progressively aggregate differential features from deep to shallow layers, ultimately restoring them to the original resolution of $H\times W$ . The difference modeling is formulated as Eq. 5:

F_{D_{i}}=\mathrm{MLP}(F^{1}_{i}ⓒF^{2}_{i}ⓒ\lvert{F^{1}_{i}-F^{2}_{i}}\rvert),

(5)

where $F_{i}^{t}\in\{F_{C_{1}}^{t},F_{C_{2}}^{t},F_{C_{3}}^{t},F_{V_{E}}^{t}\}$ ( $t\in\{1,2\}$ ), $\mathrm{MLP}$ is a three-layer 2D convolutional network with kernel size of $3\times 3$ along with ReLU activation function, ⓒ denotes concatenating on channel dimension, and $|\cdot|$ means absolute value operation.

To restore the original resolution of the changed map, we use a simple cascade upsampling operation, which is represented as follows:

F_{D_{i+1}}\leftarrow\mathrm{Deconv_{4\times 4}}(\mathrm{Conv_{1\times 1}}(F_{D_{i}}))+F_{D_{i+1}},

(6)

where $\mathrm{Conv_{1\times 1}}$ is a 2D convolution with a kernel size of $1\times 1$ to reduce the channel dimension, and $\mathrm{Deconv_{4\times 4}}$ denotes 2D deconvolution to upsample the feature map with a kernel size of $4\times 4$ and stride size of $2\times 2$ .

Finally, a classification layer is applied to transform the shallowest features $F_{D_{4}}$ into change maps $P$ , which is formulated as Eq. 7:

\centering P=\mathrm{Sigmoid}(\mathrm{Conv_{3\times 3}}(F_{D_{4}})).\@add@centering

(7)

where $\mathrm{Conv_{3\times 3}}$ is a 2D convolution with the kernel size of $3\times 3$ , $\mathrm{Sigmoid}$ function maps the feature map to $(0,1)$ and then transforms to a binary map given a predefined threshold (i.e., 0.5), i.e., $P\in\{0,1\}^{H\times W}$ .

As mentioned in prior works [7, 9], the proportion of changed targets is significantly lower than that of unchanged ones. Following the above works, we adopt binary cross-entropy (BCE) and dice loss (Dice) [50] to alleviate the class imbalance problem. The change detection loss $\mathcal{L}_{total}$ is defined as Eq. 8:

\displaystyle\mathcal{L}_{total}=\mathcal{L}_{bce}(P,Y)+\mathcal{L}_{dice}(P,Y),

(8)

The BCE and Dice losses are formulated as follows:

	$\displaystyle\mathcal{L}_{bce}(P,Y)=-\frac{1}{N}\sum_{i=1}^{N}[Y_{i}\log_{2}P_{i}+(1-Y_{i})\log_{2}(1-P_{i})],$
	$\displaystyle\mathcal{L}_{dice}(P,Y)=1-\frac{2\sum_{i=1}^{N}{{P_{i}}{Y_{i}}}+\epsilon}{\sum_{i=1}^{N}{(P_{i})^{2}}+\sum_{i=1}^{N}{(Y_{i})^{2}}+\epsilon}.$		(9)

where $i$ denotes the $i$ -th pixel, $N$ is the number of total pixels, $Y$ denotes the ground truth and $\epsilon$ (i.e., 1e-5) is a smooth term utilized to avoid zero division.

IV Experiments

We conducted extensive experiments on three widely used high-resolution datasets, namely LEVIR-CD [23], WHU-CD [24], and CLCD [25], as well as one challenging low-resolution dataset, OSCD [29], to demonstrate the effectiveness of the proposed method. To better understand each component of ChangeViT, we conduct extensive diagnostic experiments in Sec. IV-E. Otherwise stated, we use ChangeViT-S for experiments on the three high-resolution datasets.

IV-A Implementation Details

We adopt vanilla ViT [11] as our primary backbone, specifically incorporating its tiny and small variants, thereby constructing two models named ChangeViT-T and ChangeViT-S. We use DeiT [26] and DINOv2 [28] pre-trained weights for initialization, respectively. Our models are implemented using the PyTorch framework [51] and executed on a computing platform consisting of a single NVIDIA GeForce RTX 3090 GPU paired with an Intel(R) Xeon(R) Gold 6138 CPU. For optimization, we opt for the Adam optimizer [52], with beta values set to (0.9, 0.99) and a weight decay of 1e-4. Initially, the learning rate is 2e-4 and gradually reduces according to a scheduled reduction formula: (1-(curr_iter/max_iter)) ${}^{\alpha}\times$ lr, where $\alpha$ is set to 0.9 and max_iter is set to 80K iterations on LEVIR-CD and WHU-CD, 40K for CLCD dataset, and 10K for OSCD, respectively. The batch size remains at 16 across all experiments. To augment the training data and bolster the model’s robustness, we apply random flipping and cropping data enhancement approaches. The channel dimensions of $F_{C_{i}}$ are set to 64, 128, and 256, respectively. Furthermore, we ensure consistency and fairness in comparison by meticulously aligning the experimental settings of the compared methods with those specified in the original paper.

IV-B Datasets

IV-B1 LEVIR-CD

This dataset [23] comprises 637 high-resolution (1024 $\times$ 1024, 0.5 m/pixel) bi-temporal image pairs, sourced from Google Earth. The images represent 20 diverse regions across various Texan cities, including Austin, Lakeway, Bee Cave, Buda, Kyle, Manor, Pflugervilletx, Dripping Springs, and others. The dataset, with annotations for 31333 individual building changes, spans images captured from 2002 to 2018 in various locations. Following the cropping methodology established in [6], each image is segmented into 16 distinct 256 $\times$ 256 patches. Consequently, the dataset is divided into 7120 pairs for training, 1024 pairs for validation, and 2048 pairs for testing.

IV-B2 WHU-CD

This publicly available dataset [24] focuses on building change detection and includes high-resolution (0.2 m) bi-temporal aerial images, totaling 32507 $\times$ 15354 pixels. It primarily encompasses areas affected by earthquakes and subsequent reconstruction, mainly involving building renovations. Adhering to the standard procedure detailed in [33], the dataset images are divided into 256 $\times$ 256 non-overlapping patches. The dataset is partitioned into 5947 training pairs, 744 validation pairs, and 744 test pairs.

IV-B3 CLCD

The CLCD [25] dataset consists of cropland change samples, including buildings, roads, lakes, etc. The bi-temporal images in CLCD were collected by Gaofen-2 in Guangdong Province, China, in 2017 and 2019, respectively, with spatial resolutions ranging from 0.5 to 2 m. Following the standard procedure detailed in [6], each image in the dataset is segmented into 256 $\times$ 256 patches. Consequently, the CLCD dataset is divided into 1440, 480, and 480 pairs for training, validation, and testing, respectively.

IV-B4 OSCD

The OSCD dataset [29] is a relatively low-resolution dataset with a resolution ranging from 10m to 60m. It was captured by the Sentinel-2 satellites in various countries with different levels of urbanization and has experienced urban growth or changes. This resolution enables the detection of large buildings in the image pairs. However, smaller changes such as the appearance of small buildings, extensions of existing buildings, or additions of lanes to roads may not be obvious, making diverse change detection challenging. The dataset consists of 24 regions of approximately 600 $\times$ 600 pixels. In accordance with common practice, each image in the dataset is cropped into 256 $\times$ 256 patches. As a result, the OSCD dataset is divided into 75 training pairs and 28 test pairs.

IV-C Evaluation Metrics and Compared Methods

IV-C1 Evaluation Metrics

Following the widely used evaluation protocols in the change detection task, we use three accuracy metrics, i.e., F1 score (F1), intersection over union (IoU) and overall accuracy (OA), to evaluate our proposed method. They as formulated as follows:

	$\displaystyle P=\frac{TP}{TP+FP},$
	$\displaystyle R=\frac{TP}{TP+FN},$
	$\displaystyle F1=\frac{2PR}{P+R},$
	$\displaystyle IoU=\frac{TP}{TP+FN+FP},$
	$\displaystyle OA=\frac{TP+TN}{TP+TN+FN+FP}.$		(10)

where TP, FP, TN, and FN indicate true positive, false positive, true negative, and false negative, respectively. For all the metrics, a higher value means better detection performance.

IV-C2 Compared Methods

To verify the effectiveness of the proposed method, nine representative and open-source methods are selected for comparison which are described as follows:

a) DTCDSCN [38]: A dual task-constrained deep siamese convolutional network is introduced which can accomplish change detection and semantic segmentation. It applies channel and spatial attention to improve the interactive feature representation.

b) SNUNet [53]: The bi-temporal differential features are extracted by the densely connected siamese network which focuses not only on high-level semantic features but also on the low-level fine-grained features.

c) ChangeFormer [18]: Multi-scale long-range features are extracted by a hierarchical swin-transformer encoder and decoder with a multi-layer perception.

d) BIT [16]: The bi-temporal images are represented as semantic tokens, then using a transformer encoder to model contexts and a transformer decoder to refine the context-rich tokens.

e) ICIFNet [33]: An intra-scale cross-interaction and inter-scale feature fusion network that jointly captures spatio-temporal contextual information and obtains short-long range representations of bi-temporal features.

f) DMINet [36]: An inter-temporal joint-attention module which consists of self-attention and cross-attention block, aims to model the global relations of input images.

g) GASNet [7]: This is a CNN-transformer model that uses CNN as the backbone to extract multi-scale features and employs transformer encoder-decoder to model contextual information.

h) AMTNet [6]: A global-aware network that models relations between scene and foreground, is proposed to solve the class imbalance problem of change detection task.

i) EATDer [44]: An edge-assisted detector incorporates an edge-aware decoder to integrate the edge information obtained by the encoder, thereby enhancing the feature representation of changed regions.

TABLE I: Performance comparison of different change detection methods on LEVIR-CD, WHU-CD, and CLCD datasets, respectively. The best results are highlighted in bold and the second best results are underlined. All results of the three evaluation metrics are described as percentages (%).

Method	#Params(M)	FLOPs(G)	LEVIR-CD			WHU-CD			CLCD
Method	#Params(M)	FLOPs(G)	F1	IoU	OA	F1	IoU	OA	F1	IoU	OA
DTCDSCN [38]	41.07	20.44	87.43	77.67	98.75	79.92	66.56	98.05	57.47	40.81	94.59
SNUNet [53]	12.04	54.82	88.16	78.83	98.82	83.22	71.26	98.44	60.82	43.63	94.90
ChangeFormer [18]	41.03	202.79	90.40	82.48	99.04	87.39	77.61	99.11	61.31	44.29	94.98
BIT [16]	3.55	10.63	89.31	80.68	98.92	83.98	72.39	98.52	59.93	42.12	94.77
ICIFNet [33]	23.82	25.36	89.96	81.75	98.99	88.32	79.24	98.96	68.66	52.27	95.77
DMINet [36]	6.24	14.42	90.71	82.99	99.07	88.69	79.68	98.97	67.24	50.65	95.21
GASNet [7]	23.59	23.52	90.52	83.48	99.07	91.75	84.76	99.34	63.84	46.89	94.01
AMTNet [6]	24.67	21.56	90.76	83.08	98.96	92.27	85.64	99.32	75.10	60.13	96.45
EATDer [44]	6.61	23.43	91.20	83.80	98.75	90.01	81.97	98.58	72.01	56.19	96.11
ChangeViT-T	11.68	27.15	91.81	84.86	99.17	94.53	89.63	99.57	77.31	63.01	96.67
ChangeViT-S	32.13	38.80	91.98	85.16	99.19	94.84	90.18	99.59	77.57	63.36	96.79

TABLE II: Performance comparison of different change detection methods on the OSCD dataset. The best results are highlighted in bold and the second best results are underlined. All results of the three evaluation metrics are described as percentages (%).

Method	OSCD
Method	F1	IoU	OA
DTCDSCN [38]	36.13	22.05	94.50
SNUNet [53]	27.02	15.62	93.81
ChangeFormer [18]	38.22	23.62	94.53
BIT [16]	29.58	17.36	90.15
ICIFNet [33]	23.03	13.02	94.61
DMINet [36]	42.23	26.76	95.00
GASNet [7]	10.71	5.66	91.52
AMTNet [6]	10.25	5.40	94.29
EATDer [44]	54.23	36.98	93.85
ChangeViT-T	55.13	38.06	95.01
ChangeViT-S	55.51	38.42	95.05

IV-D Comparison with State-of-the-Art Approaches

As illustrated in Tab. I, we compare ChangeViT with the previous methods on three high-resolution datasets, i.e., LEVIR-CD, WHU-CD, and CLCD. Notably, all the compared methods employed hierarchical backbone as primary feature extractors. Specifically, DTCDSCN, BIT, ICIFNet, DMINet, GASNet and AMTNet apply ResNet [20] or its variants [54] as backbones, SNUNet and EATDer apply nested UNet [22] and stack non-local blocks [55], respectively. In contrast, our approach employs a non-hierarchical, plain ViT as the core backbone which includes ViT-T and ViT-S, coupling with a lightweight detail-capture module which serves as an auxiliary network. From Tab. I, we can summarize the following valuable findings: (1) ChangeViT consistently outperforms the existing works across all datasets and evaluation metrics, despite utilizing the tiny backbone of ViT, which demonstrates its effectiveness. (2) The primary feature extractor in ViTs, despite being non-hierarchical, demonstrates competitive performance when compared to hierarchical-based methods. This underscores the robust feature extraction and representation capabilities that large-scale pre-training ViT can offer, fully realizing its potential. (3) Notably, ChangeViT-T and ChangeViT-S exhibit significant performance gains over the SOTA method (i.e., AMTNet) by 3.99% and 4.54% IoU on the WHU-CD dataset. This finding is sensible given that the changes in WHU-CD vary widely, with fewer medium-sized objects compared to smaller and larger ones. This observation aligns with the results illustrated in the middle of Fig. 1(b), underscoring the efficacy of our proposed method in capturing global features and extracting fine-grained spatial information. (4) With an increase in the size of the primary feature extractor, ChangeViT demonstrates enhanced performance. Notably, the detail-capture module, comprising just 2.7M parameters, stands out for its lightweight nature when compared to the total parameter count of each model (i.e., 11.68M and 32.13M). Our proposed ChangeViT achieves a superior balance between efficiency and effectiveness compared to previous methods, underscoring its superiority.

As shown in Tab. II, we also compare ChangeViT with several existing methods on the low-resolution dataset, i.e., OSCD. The targets in the OSCD are relatively smaller than those in high-resolution datasets, exacerbating the foreground-background imbalance issue and making it challenging to detect smaller targets. From Tab. II, the following key points can be noted: (1) The proposed ChangeViT outperforms all compared methods across three evaluation metrics, despite utilizing tiny or small models of ViT, demonstrating its effectiveness on the low-resolution dataset. (2) GASNet and AMTNet perform poorly on this dataset, likely due to their inefficiency in detecting small targets. Although GASNet introduces a foreground-awareness module to address the category imbalance between the foreground and background, it still underperforms in detecting changes in low-resolution remote sensing images.

IV-E Diagnostic Study

TABLE III: Study the effectiveness of proposed modules with different transformer architectures on three datasets, respectively. The check mark (✓) denotes combining with our proposed modules. All results are described as percentages (%).

Backbone	Ours	LEVIR-CD			WHU-CD			CLCD
Backbone	Ours	F1	IoU	OA	F1	IoU	OA	F1	IoU	OA
Swin-S		89.40	80.83	98.94	93.03	86.98	99.22	73.80	58.47	96.33
Swin-S	✓	90.18	82.11	99.01	94.04	88.75	99.30	75.41	60.52	96.40
PVT-S		84.60	73.31	98.38	87.36	77.55	98.89	70.25	54.15	95.76
PVT-S	✓	87.26	77.40	98.68	89.09	80.32	98.92	71.95	56.19	95.90
PiT-S		84.94	73.83	98.38	87.34	77.53	98.89	70.01	53.86	95.88
PiT-S	✓	87.20	77.31	98.63	89.50	81.00	98.94	72.80	57.23	95.93
ViT-S		82.39	70.05	98.25	84.70	73.46	98.82	69.05	52.74	95.75
ViT-S	✓	91.98	85.16	99.19	94.84	90.18	99.59	77.53	63.30	96.76

TABLE IV: Study the effectiveness of proposed modules in ChangeViT on three datasets, respectively. DC and FI denote the detail-capture module and feature injector, respectively. All results are described as percentages (%).

Model			LEVIR-CD			WHU-CD			CLCD
ViT	DC	FI	F1	IoU	OA	F1	IoU	OA	F1	IoU	OA
✓			82.39	70.05	98.25	84.70	73.46	98.82	69.18	52.88	95.68
	✓		88.12	78.76	98.80	90.20	82.15	99.24	69.72	53.51	95.98
✓	✓		91.20	83.82	99.11	93.30	87.43	99.46	75.36	60.46	96.62
✓	✓	✓	91.98	85.16	99.19	94.84	90.18	99.59	77.53	63.30	96.76

TABLE V: Investigation on the impact of multiple scales in the detail-capture module on three datasets, respectively. All results are described as percentages (%).

Scales			LEVIR-CD			WHU-CD			CLCD
1/8	1/4	1/2	F1	IoU	OA	F1	IoU	OA	F1	IoU	OA
✓			91.32	84.03	99.12	94.20	89.04	99.55	76.43	61.85	96.49
	✓		91.08	83.62	99.10	92.90	86.74	99.45	73.25	57.79	96.31
		✓	89.43	80.89	98.94	90.87	83.28	99.29	70.82	54.82	95.79
✓	✓		91.56	84.43	99.15	94.25	89.07	99.58	77.30	63.07	96.62
✓		✓	91.45	94.24	99.14	94.02	88.67	99.44	77.10	62.73	96.56
	✓	✓	90.94	83.39	99.09	92.49	86.04	99.41	75.22	60.28	96.46
✓	✓	✓	91.98	85.16	99.19	94.84	90.18	99.59	77.53	63.30	96.76

Effectiveness with different architectures. In Tab. III, we investigate the effectiveness of the proposed modules with different architectures, including hierarchical (i.e., Swin-S [30], PVT-S [31], PiT-S [32]) and non-hierarchical (i.e., ViT-S [11]) transformers. Key observations from the table include: (1) Without combining with our proposed modules, the non-hierarchical ViT-S underperformers the other hierarchical methods across all metrics on three datasets. (2) When integrated with our proposed modules, all transformers exhibit performance improvements, indicating the efficacy of our approach regardless of transformer architectures. (3) ViT-S achieves significant performance gains over hierarchical transformers when equipped with our proposed modules, suggesting that our modules effectively mitigate ViT-S’s limitations in capturing detailed information to detect smaller objects.

Effectiveness of proposed modules. To investigate the effectiveness of the proposed modules, we conduct comprehensive diagnostic experiments on three datasets. As shown in Tab. IV, we take various combinations of components into account and explore the contribution of each module. We apply ViT as baseline, which consists of a plain ViT and a decoder. Coupling with detail-capture module, ViT can unleash its potential and improve 8.81%, 8.60%, and 6.18% F1 on three datasets compared to the baseline, which indicates that the detail-capture module can supplement the detailed spatial information, which is essential for the change detection task. Furthermore, when combined with the feature injector, there are additional performance gains of 0.78%, 1.54%, and 2.17% in F1, indicating the effectiveness of incorporating detailed information at higher levels. In summary, all of our proposed modules are essential and effective in the ChangeViT framework.

Impact of multiple scales. To investigate the necessity of capturing multiple scales in the detail-capture module, we conduct experiments using multi-scale features, i.e., 1/2, 1/4, 1/8. As shown in Tab. V, we can get the following key observations: (1) Single-scale features often yield subpar results, while the amalgamation of multi-scale features leads to enhanced performance. (2) An interesting finding is that high-level features or their combinations can achieve better performance than low-level features. (3) Furthermore, the inclusion of three-scale features results in mutual improvements, indicating that multi-scale features leverage spatial cues across complementary levels.

Impact of pre-trained weights. To investigate the impact of pre-trained weights on ChangeViT, we apply various model initialization approaches, including random initialization and several publicly available large-scale pre-trained weights derived from both supervised and self-supervised training strategies on various datasets. As illustrated in Tab. VI, we observed the following key points: (1) Both ChangeViT-T and ChangeViT-S exhibit improved detection accuracy when utilizing pre-trained weights compared to random initialization. (2) DINOv2-S provides the most effective pre-trained weights for the ChangeViT-S model, benefiting from large-scale data pre-training. (3) When DMINet, GASNet, AMTNet, and ChangeViT are pre-trained on the same data, i.e., ImageNet-1k, the proposed ChangeViT outperforms all the CNN-based methods, demonstrating the effectiveness of transferring the priorities of large pre-trained ViT models to the change detection task.

TABLE VI: Study the impact of different pre-trained weights of ViT on three datasets, respectively. All results are described as percentages (%).

Model	Backbone	Pretrain	Pre-trained Data	Training	LEVIR-CD			WHU-CD			CLCD
Model	Backbone	Pretrain	Pre-trained Data	Strategy	F1	IoU	OA	F1	IoU	OA	F1	IoU	OA
DMINet [36]	ResNet18	-	ImageNet(1k)	Supervised	90.71	82.99	99.07	88.69	79.68	98.97	67.24	50.65	95.21
GASNet [7]	ResNet34	-	ImageNet(1k)	Supervised	90.52	83.48	99.07	91.75	84.76	99.34	63.84	46.89	94.01
AMTNet [6]	ResNet50	-	ImageNet(1k)	Supervised	90.76	83.08	98.96	92.27	85.64	99.32	75.10	60.13	96.45
ChangeViT-T	ViT(Tiny)	Random Init	-	-	91.58	84.47	99.15	93.78	88.29	99.51	76.91	62.49	96.66
ChangeViT-T	ViT(Tiny)	DeiT-T [26]	ImageNet(1k)	Supervised	91.81	84.86	99.17	94.53	89.63	99.57	77.31	63.01	96.67
ChangeViT-S	ViT(Small)	Random Init	-	-	90.82	83.19	99.09	93.65	88.06	99.50	75.05	60.06	96.59
		DeiT-S [26]	ImageNet(1k)	Supervised	91.78	84.81	99.17	94.73	89.99	99.58	77.24	62.69	96.68
		DINO-S [27]	ImageNet(w/o labels)	Self-supervised	91.68	84.64	99.16	94.70	89.94	99.58	77.05	62.67	96.65
		DINOv2-S [28]	ImageNet(1k, 22k) &	Self-supervised	91.98	85.16	99.19	94.84	90.18	99.59	77.53	63.30	96.76
		DINOv2-S [28]	Google Landmarks	Self-supervised	91.98	85.16	99.19	94.84	90.18	99.59	77.53	63.30	96.76

Choice of query, key and value. Two experiments are conducted to investigate different modeling approaches in the feature injector, as shown in Tab. VII. In the first experiment, $F_{V}$ serves as query, and $F_{C}$ serves as key and value, yielding the best performance. This result is consistent with the conjecture mentioned in Sec. III-B, suggesting that the feature injector effectively captures low-level value information most relevant to the high-level query and reintegrates it back to the query. Therefore, through cross-attention, high-level fine-grained features can seamlessly merge with low-level features.

TABLE VII: Study the impact of different modeling approaches in the feature injector on three datasets, respectively. All results are described as percentages (%).

Query	Key&Value	LEVIR-CD			WHU-CD			CLCD
Query	Key&Value	F1	IoU	OA	F1	IoU	OA	F1	IoU	OA
$F_{V}$	$F_{C}$	91.98	85.16	99.19	94.84	90.18	99.59	77.53	63.30	96.76
$F_{C}$	$F_{V}$	91.78	84.80	99.17	94.60	89.75	99.58	75.84	61.08	96.58

Size of changes v.s. performance. As depicted in Fig. 4 (a), we conduct experiments on three datasets using the detail-capture module, ViT-S, and our proposed method to quantitatively analyse the performance of each method under different change sizes. The detail-capture module and ViT-S both integrate with a decoder which is the same as ChangeViT. The results indicate that the detail-capture module excels at detecting smaller changed targets, while the ViT-S demonstrates superiority in detecting larger ones. Our method capitalizes on ViT’s powerful feature expression while leveraging a detail-capture module for fine-detail information mining. This comprehensive approach enables superior performance across targets of all sizes.

Qualitative results. We present representative visualization results on three datasets, comparing the performance of the detail-capture module, ViT-S, and our proposed method to demonstrate the effectiveness of ChangeViT. As shown in Fig. 4 (b), the first row in each dataset presents the test results for smaller targets, while the second row corresponds to larger targets. From the results, we can see that the detail-capture module excels at detecting smaller targets, whereas ViT-S demonstrates superiority in detecting larger targets. The fundamental distinction lies in the local receptive field of CNN, enabling them to extract intricate local features, while ViT possesses a global receptive field, facilitating the extraction of comprehensive global information. The proposed method efficiently integrates global and local information, resulting in superior performance.

To qualitatively compare with previous methods, we provide comprehensive samples encompassing small, large, sparse, and dense targets, as illustrated in Fig. 5. From these samples, several key observations emerge intuitively: (1) Our proposed method consistently outperforms all compared methods across various change sizes. This is attributed to the robust global modeling capabilities of ViT and the detail-capture module’s capacity to extract intricate spatial information. Additionally, a feature injector integrates low-level fine-grained spatial features into ViT’s high-level semantic representations, enhancing ChangeViT’s capability to detect changes of diverse sizes. (2) In detecting dense objects, regardless of their size, ChangeViT consistently delineates clear boundaries compared to prior methods. This underscores ChangeViT’s effectiveness in capturing both global semantic information and local spatial details of neighboring objects.

V Conclusion

In this paper, we present a simple yet effective framework, namely ChangeViT, that leverages the plain ViT as its primary feature extractor to capture large-scale changes. Coupled with a detail-capture module dedicated to fine-grained spatial features, ChangeViT seamlessly integrates these details into ViT’s feature representation through the cross-attention mechanism. Experimental results demonstrate ChangeViT’s supremacy over meticulously designed hierarchical models across all evaluation metrics on four widely adopted datasets, highlighting the untapped potential of vanilla ViTs for change detection. Furthermore, comprehensive diagnostic analyses and visualization results provide insights into the contribution of each module. We aim for this study to offer valuable insights to the research community and ignite further exploration into leveraging vanilla ViTs for other related computer vision tasks, such as change caption.

References

[1] R. J. Radke, S. Andra, O. Al-Kofahi, and B. Roysam, “Image change detection algorithms: a systematic survey,” IEEE transactions on image processing, vol. 14, no. 3, pp. 294–307, 2005.
[2] Z. Zheng, Y. Zhong, J. Wang, A. Ma, and L. Zhang, “Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters,” Remote Sensing of Environment, vol. 265, p. 112636, 2021.
[3] S. W. Wang, L. Munkhnasan, and W.-K. Lee, “Land use and land cover change detection and prediction in bhutan’s high altitude city of thimphu, using cellular automata and markov chain,” Environmental Challenges, vol. 2, p. 100017, 2021.
[4] R. S. Lunetta, J. F. Knight, J. Ediriwickrema, J. G. Lyon, and L. D. Worthy, “Land-cover change detection using multi-temporal modis ndvi data,” in Geospatial Information Handbook for Water Resources and Watershed Management, Volume II. CRC Press, 2022, pp. 65–88.
[5] R. E. Kennedy, P. A. Townsend, J. E. Gross, W. B. Cohen, P. Bolstad, Y. Wang, and P. Adams, “Remote sensing change detection tools for natural resource managers: Understanding concepts and tradeoffs in the design of landscape monitoring projects,” Remote sensing of environment, vol. 113, no. 7, pp. 1382–1396, 2009.
[6] W. Liu, Y. Lin, W. Liu, Y. Yu, and J. Li, “An attention-based multiscale transformer network for remote sensing image change detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 202, pp. 599–609, 2023.
[7] R. Zhang, H. Zhang, X. Ning, X. Huang, J. Wang, and W. Cui, “Global-aware siamese network for change detection on remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 199, pp. 61–72, 2023.
[8] Y. Feng, Y. Shao, H. Xu, J. Xu, and J. Zheng, “A lightweight collective-attention network for change detection,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 8195–8203.
[9] Z. Li, C. Tang, X. Liu, W. Zhang, J. Dou, L. Wang, and A. Y. Zomaya, “Lightweight remote sensing change detection with progressive feature aggregation and supervised attention,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–12, 2023.
[10] Z. Zheng, A. Ma, L. Zhang, and Y. Zhong, “Change is everywhere: Single-temporal supervised object change detection in remote sensing imagery,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 193–15 202.
[11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[12] Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” in European Conference on Computer Vision. Springer, 2022, pp. 280–296.
[13] Q. Liu, Z. Xu, G. Bertasius, and M. Niethammer, “Simpleclick: Interactive image segmentation with simple vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 290–22 300.
[14] J. Yao, X. Wang, S. Yang, and B. Wang, “Vitmatte: Boosting image matting with pre-trained plain vision transformers,” Information Fusion, vol. 103, p. 102091, 2024.
[15] Y. Xu, J. Zhang, Q. Zhang, and D. Tao, “Vitpose: Simple vision transformer baselines for human pose estimation,” Advances in Neural Information Processing Systems, vol. 35, pp. 38 571–38 584, 2022.
[16] H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection with transformers,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
[17] B. Jiang, Z. Wang, X. Wang, Z. Zhang, L. Chen, X. Wang, and B. Luo, “Vct: Visual change transformer for remote sensing image change detection,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[18] W. G. C. Bandara and V. M. Patel, “A transformer-based siamese network for change detection,” in IGARSS 2022-2022 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2022, pp. 207–210.
[19] Y. Zhao, Y. Zhang, Y. Dong, and B. Du, “Adapting vision transformer for efficient change detection,” arXiv preprint arXiv:2312.04869, 2023.
[20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[21] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
[23] H. Chen and Z. Shi, “A spatial-temporal attention-based method and a new dataset for remote sensing image change detection,” Remote Sensing, vol. 12, no. 10, p. 1662, 2020.
[24] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set,” IEEE Transactions on geoscience and remote sensing, vol. 57, no. 1, pp. 574–586, 2018.
[25] M. Liu, Z. Chai, H. Deng, and R. Liu, “A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 15, pp. 4297–4306, 2022.
[26] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International conference on machine learning. PMLR, 2021, pp. 10 347–10 357.
[27] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
[28] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
[29] R. C. Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, “Urban change detection for multispectral earth observation using convolutional neural networks,” in IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium. Ieee, 2018, pp. 2115–2118.
[30] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
[31] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 568–578.
[32] B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh, “Rethinking spatial dimensions of vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 936–11 945.
[33] Y. Feng, H. Xu, J. Jiang, H. Liu, and J. Zheng, “Icif-net: Intra-scale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
[34] S. Fang, K. Li, and Z. Li, “Changer: Feature interaction is what you need for change detection,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[35] R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese networks for change detection,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 4063–4067.
[36] Y. Feng, J. Jiang, H. Xu, and J. Zheng, “Change detection on remote sensing images using dual-branch multilevel intertemporal network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023.
[37] J. Zhang, Z. Shao, Q. Ding, X. Huang, Y. Wang, X. Zhou, and D. Li, “Aernet: An attention-guided edge refinement network and a dataset for remote sensing building change detection,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[38] Y. Liu, C. Pang, Z. Zhan, X. Zhang, and X. Yang, “Building change detection for remote sensing images using a dual-task constrained deep siamese convolutional network model,” IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 5, pp. 811–815, 2020.
[39] R. C. Daudt, B. Le Saux, A. Boulch, and Y. Gousseau, “Urban change detection for multispectral earth observation using convolutional neural networks,” in IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium. Ieee, 2018, pp. 2115–2118.
[40] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4510–4520.
[41] Y. Tang, K. Han, J. Guo, C. Xu, C. Xu, and Y. Wang, “Ghostnetv2: enhance cheap operation with long-range attention,” Advances in Neural Information Processing Systems, vol. 35, pp. 9969–9982, 2022.
[42] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 009–12 019.
[43] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, “Vision transformer adapter for dense predictions,” International Conference on Learning Representations, 2023.
[44] J. Ma, J. Duan, X. Tang, X. Zhang, and L. Jiao, “Eatder: Edge-assisted adaptive transformer detector for remote sensing change detection,” IEEE Transactions on Geoscience and Remote Sensing, 2024.
[45] C. Zhang, L. Wang, S. Cheng, and Y. Li, “Swinsunet: Pure transformer network for remote sensing image change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
[46] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
[47] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
[48] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[49] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
[50] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 2016 fourth international conference on 3D vision (3DV). Ieee, 2016, pp. 565–571.
[51] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” NIPS Workshops, 2017.
[52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[53] S. Fang, K. Li, J. Shao, and Z. Li, “Snunet-cd: A densely connected siamese network for change detection of vhr images,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
[54] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[55] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.