SwinV2DNet: Pyramid and Self-Supervision Compounded Feature Learning for Remote Sensing Images Change Detection

Dalong Zheng, Zebin Wu, Jia Liu, Zhihui Wei, Manuscript received DD MM, YY; revised DD MM, YY; accepted DD MM, YY. Date of publication MM DD, YY. This work was supported in part by the National Natural Science Foundations of China under Grant 62071233, Grant 61971223, Grant 62276133, and Grant 61976117; in part by the Jiangsu Provincial Natural Science Foundations of China under Grant BK20211570, Grant BK20180018, and Grant BK20191409; in part by the Fundamental Research Funds for the Central Universities under Grant 30917015104, Grant 30919011103, Grant 30919011402, and Grant 30921011209; in part by the Key Projects of University Natural Science Fund of Jiangsu Province under Grant 19KJA360001; and in part by the Qinglan Project of Jiangsu Universities under Grant D202062032. (Corresponding author: Zebin Wu.) The authors are with the School of Computer Science and Engineering, Nanjing University of Science and Technology (NJUST), Nanjing 210094, China (e-mail: [email protected], [email protected], [email protected], [email protected]).

Abstract

Among the current mainstream change detection networks, transformer is deficient in the ability to capture accurate low-level details, while convolutional neural network (CNN) is wanting in the capacity to understand global information and establish remote spatial relationships. Meanwhile, both of the widely used early fusion and late fusion frameworks are not able to well learn complete change features. Therefore, based on swin transformer V2 (Swin V2) and VGG16, we propose an end-to-end compounded dense network SwinV2DNet to inherit the advantages of both transformer and CNN and overcome the shortcomings of existing networks in feature learning. Firstly, it captures the change relationship features through the densely connected Swin V2 backbone, and provides the low-level pre-changed and post-changed features through a CNN branch. Based on these three change features, we accomplish accurate change detection results. Secondly, combined with transformer and CNN, we propose mixed feature pyramid (MFP) which provides inter-layer interaction information and intra-layer multi-scale information for complete feature learning. MFP is a plug and play module which is experimentally proven to be also effective in other change detection networks. Further more, we impose a self-supervision strategy to guide a new CNN branch, which solves the untrainable problem of the CNN branch and provides the semantic change information for the features of encoder. The state-of-the-art (SOTA) change detection scores and fine-grained change maps were obtained compared with other advanced methods on four commonly used public remote sensing datasets. The code is available at https://github.com/DalongZ/SwinV2DNet.

Index Terms:

Change detection, swin transformer V2, mixed feature pyramid, self-supervision learning.

I Introduction

Remote sensing images change detection is one of the earliest and most important remote sensing tasks, which has been concerned and studied by many researchers for a long time [1, 2, 3, 4]. Change detection is defined as observing the image differences of the same surface area at different times. Change detection is used in many scenarios, including disaster assessment [5], urban planning, land surface change [6, 7], and so on. With the development of satellites and sensors, very high resolution remote sensing (VHRRS) images have gradually become one of the mainstream remote sensing images in research, which provide rich spatial information and fine surface details. However, one of the main challenges faced by VHRRS images change detection is high intraclass variation and low interclass variance of detection objects [8]. It, therefore, has been the focus of scholars’ research that how to design a stable network and provide comprehensive and diverse feature information to distinguish the pseudo changes in change detection (as shown in Fig. 1).

Refer to caption — Figure 1: A variety of pseudo changes become the challenges in change detection. (a) The roof color changes. (b) Spectral changes on the surface. (c) The house shadow changes.

Traditional change detection algorithms, according to different detection units, can be divided into pixel-based algorithms and object-based algorithms. The detection results of pixel-based algorithms are obtained through feature extraction and then threshold segmentation, which include methods based on arithmetic operations (band difference [9], spectral angle mapper [10]), methods based on transformation (change vector analysis (CVA) [11, 12], principal component analysis (PCA) [13], independent component analysis (ICA) [14]), post-classification change detection [15], slow feature analysis (SFA) [16] and so on. According to the shape, texture and spectrum of images, object-based algorithms need to segment the images and then compare the classification results to get the change detection results [17]. Pixel-based algorithms are trapped by the interference of small noises and the decision of segmentation threshold. Meanwhile object-based algorithms often get stuck in the accumulation of multiple classification errors that affect the detection accuracy [1]. Both of these traditional algorithms require prior knowledge and manual design, and are easily affected by sensor noises.

With the support of massive remote sensing data, deep learning has also shown outstanding detection ability in the field of remote sensing. CNN converts the input images into the high-dimensional depth features, and combines the targets and background to extract effective semantic information, achieving the detection effect beyond many traditional methods. [18] provides the three most common baseline networks for change detection. The architecture combined with CNN and conditional random field (CRF) refines the edges of detection areas, but it comes at the cost of slow training [19]. CNN is hindered regrettably by the narrow receptive field of local information, and transformer rises rapidly due to the ability of modeling global information. However, it can not work well that the pure transformer change detection model lacks low-level details [20]. Therefore, how to combine transformer and CNN to build the reasonable change detection architecture is the crux of the matter at this stage.

From another perspective, change detection network architectures can be divided into early fusion (EF) [18, 21, 22, 23] and late fusion (LF) [18, 24, 25, 26, 27, 28, 29, 30] networks. The EF network works by stitching two images together and feeding them into the single input network. By concatenating two three-channel images into a six-channel image, [21] and [22] input it into full convolutional neural network (FCN) and UNet++ respectively, and output change map after training the network. The disadvantage of this method is that the network lacks the depth features of single images, resulting in fractured edges and broken structures in change map. In the LF network, the two features are extracted from the pre-changed and post-changed images respectively by using the dual-input structure, and are fused in the second half of the network. The siamese network, the most prominent LF network, consists of two subnets with shared weights. The siamese network was first used for remote sensing images change detection in [24]. The use of convolutional block attention module (CBAM) and deep supervision for the siamese network respectively alleviates the problem of heterogeneous features fusion and depth features migration in training process [25]. However, for the LF network, the contradiction between the dual-stream input of encoder and the single output of decoder often results in the disappearance of gradient propagation and affects the low-level features learning of two original images. Furthermore, the heterogeneous features fusion of LF network needs to be solved by elaborately designed module. As a consequence, it is another problem worth pondering that how to overcome the respective disadvantages of these two network architectures and provide the complete and diverse features for change detection.

In addition to the design of the overall network architectures for change detection, researchers are also pushing forward the elaboration of network functional modules. The attention modules introduced into change detection are relatively representative, including squeeze-and-excitation attention (SE) [31], efficient channel attention (ECA) [32], CBAM [33] and cross-attention [29]. At the same time, since the ground objects have different scales in the VHRRS change detection datasets, how to adapt these ground objects to maintain the robustness and generalization ability of the network? Multi-scale features of deep learning generally can be divided into three categories: multi-scale features between different layers, multi-scale interaction features between different layers and multi-scale features from different convolution units. The first type of multi-scale features was embedded in the common U-Net network. The second type of multi-scale features typically interacts and fuses using transformer or CNN. The third type of multi-scale features is provided by a variety of convolution units, such as inception [34], dilated convolution [35], res2net convolution (Res2Net-Conv) [36], selective kernel convolution (SK-Conv) [37] and so on. We should think about the integration and utilization of these three multi-scale features. Other scholars also think in the combination with generative adversarial network (GAN) [38, 39] or self-supervised learning [40, 41, 42] to obtain more discriminative features. These deep learning technologies are aimed at solving the problem of high intraclass variation and low interclass variance by mining the different features of change detection data.

Motivated by the above concerns, this study combines Swin V2 and VGG16 to propose a new end-to-end compounded dense network SwinV2DNet. Swin V2 blocks are used to build the UNet++ type main network, and VGG16 encoder is used to build the CNN auxiliary network. SwinV2DNet overcomes the modeling defect of only local information in CNN and the insufficient interpretation of low-level details in transformer. On the other hand, the Swin V2 main network belongs to EF network, and the CNN branch belongs to LF. This structure constantly provides the pre-changed features, post-changed features and change relation features (namely, the six-channel concatenation from the pre-changed and post-changed images) for the accurate acquisition of change detection results. CBAM and deep supervision also promote the fusion of heterogeneous features and the rapidly stable convergence of the network, respectively. To better combine transformer and CNN, we propose a new multi-scale module, mixed feature pyramid, which provides inter-layer multi-scale interaction information and intra-layer multi-scale information to supplement the UNet++ main network only with inter-layer multi-scale information. We finally design a new decoder to the CNN branch with only VGG16 encoder, and use the self-supervision strategy to train the extracted features, so that the CNN branch can provide learnable and more discriminant semantic information. To sum up, the main contributions of this study are fourfold:

1)

We propose an end-to-end compounded dense network SwinV2DNet that possesses both advantages of transformer and CNN, and overcomes respective disadvantages of the EF and LF network. This is the first parallel combination of Swin V2 and VGG16 in change detection.
2)

Mixed feature pyramid is proposed, for the first time, to provide inter-layer interaction information and intra-layer multi-scale information. It is a plug and play module that has been experimentally proven to be also effective in other change detection networks.
3)

We design a new decoder for the CNN branch with only VGG16 encoder, and impose the self-supervised strategy to train the extracted features to provide more discriminative semantic information for the main network.
4)

Compared with other advanced methods, our method obtains the state-of-the-art (SOTA) change detection scores and the elaborate change maps on four common public remote sensing datasets.

The remainder of this paper is organized as follows. Section II reviews the related work. Section III elaborates the proposed SwinV2DNet method. The experimental evaluations and ablation studies are carried out in Section IV. Finally, Section V presents the conclusion of this article.

II Related Work

II-A Visual Transformer

Transformer first has been stand out in natural language processing (NLP) due to its ability of modeling global information [43]. Vision transformer (ViT) is proposed by [44] to bring the spirit of self-attention to image classification, but this huge model cannot be directly applied to other computer vision detection tasks. Swin transformer adopts the inspiration of architectural hierarchy refinement and local information interaction, which reduces the heavy computation involved in modeling global information [45]. Furthermore, [46] proposes Swin V2 that stabilizes the training process caused by the increase of model parameters and mitigates the resolution difference between the upstream and downstream tasks. At the same time, various transformer models are developing rapidly in the field of remote sensing images processing [47, 48, 49, 50].

Due to the lack of low-level details in the pure transformer model [20], transformer is generally combined with CNN for change detection. On the one hand, in the serial combination methods, Chen et al. proposes bitemporal image transformer (BiT) that embeds transformer to CNN to enhance the ability of modeling contexts within the spatial-temporal domain [26]. TransUNetCD mades up for the former omission of the high-resolution shallow information by cascading the up-sampling decoder to achieve the more detailed change map [51]. On the other hand, in the parallel associative approachs, [52] uses transformer and CNN respectively to extract the pair change features for feature fusion, but it also ignores the supplement of low-level details at the decoder stage. Our study adopts a new parallel approach with the high and low level features association to avoid the defects of the above models and obtain the best detection result.

II-B Multi-Scale Information

Multi-scale information is always an important tool in image processing. In deep learning, multi-scale features can be divided into three categories: multi-scale features between different layers, multi-scale interaction features between different layers and multi-scale features from different convolution units. The first category is often already implicit in a variety of classic networks, such as ResNet, FCN and U-Net [53, 54, 55]. Feature pyramid transformer (FPT) uses two different transformers to interfuse the features between different layers to generate the second multi-scale features [56]. The third type of multi-scale features is provided by various convolution units, such as inception, dilated convolution, Res2Net-Conv, SK-Conv and so on.

In VHRRS images change detection, [57] extracts the multilevel intertemporal features through the double branches of shared weights, and then performs the information fusion and features difference to obtain the robust change features. The multiscale decoupled convolution is constructed by using atrous convolutions with different dilation rates. The researchers embed several of these convolutions in different layers to acquire the two multiscale features between and within layers [58]. Based on FPT, we propose mixed feature pyramid that provides inter-layer multi-scale interaction information and intra-layer multi-scale information to supplement the main network only with inter-layer multi-scale information.

II-C Self-Supervised Learning

Since the labeled data requires precious labor costs, self-supervised learning (SSL), in which a large amount of unlabeled data can be used to train networks and extract knowledge that are then transferred to downstream tasks, has become a flourishing deep learning technology recently. SSL is first used in NLP. [59] designs the automatic regression task for words, which provides more contextual information than the task of matching a single sentence to a single label. Then SSL is introduced into computer vision. Contrast learning is one of the representative models [60], that is, the image sample pairs are generated through the network, and the cost function is set to shorten the distance between the two positive samples as far as possible and expand the distance between the positive and negative samples. Mask autoencoder (MAE) is another important branch of SSL [61].

In change detection, Chen and Bruzzone [62] use bootstrap your own latent (BYOL) framework to pretrain the heterogeneous remote sensing images, and then transfer it to the downstream change detection task. The use of ViT encoder and random masking as SSL data enhancement techniques further advances this architecture [63]. SSL has been shown to provide the more discriminative semantic information when combined with the supervised techniques to train the network [40]. Different from [40], we design a new encoder-decoder architecture with VGG16 as the encoder, and apply SSL to this branch architecture to provide the self-supervised semantic features for the main network.

III Proposed Method

In this section, we first introduce SwinV2DNet architecture. Moreover, Swin V2 block, MFP module, the CNN branch network and SSL strategy are described respectively. Finally, we specify the loss function of the model.

III-A Proposed SwinV2DNet

The overall architecture of SwinV2DNet is shown in Fig. 2. In conjunction with Algorithm 1, we elaborate on the details of the network. First of all, we initialize the parameters ${i}$ and ${j}$ to satisfy the following conditions:

\vspace{-0.1cm}\begin{array}[]{l}1\leq i\leq 4,i\in N\\ 0\leq j\leq 3,j\in N\\ whenj=0,i\in\{1,2,3,4\}\\ whenj=1,i\in\{1,2,3\}\\ whenj=2,i\in\{1,2\}\\ whenj=3,i\in\{1\}\end{array}

(1)

where ${N}$ is the set of natural numbers. ${i}$ is the number of layers of SwinV2DNet or VGG16, and ${j}$ is the number of columns of SwinV2DNet. Then we obtain the encoder features of CNN branch network to compensate for the lack of low-level details and single image convolution features of main network:

\vspace{-0.1cm}V_{1}^{\rm{i}},V_{2}^{\rm{i}}=CN{N_{SSL}}(I_{1},{I_{2}})

(2)

${CN{N_{SSL}}}$ refers to CNN branch network trained using SSL strategy. ${V_{1}^{\rm{i}},V_{2}^{\rm{i}}}$ and ${I_{1},{I_{2}}}$ represent the encoder feature pair and the original image pair respectively in this paper. On another path in parallel, the features in the first column of main network are generated:

\vspace{-0.1cm}{C_{0,0}},{{S^{\prime}}_{i,j=0}}=SwinV{2_{backbone}}(DConv(Cat({I_{1}},{I_{2}})))

(3)

${SwinV{2_{backbone}}}$ , ${DConv}$ and ${Cat}$ refer to Swin V2 backbone [46], double convolution module (DConv) and the concatenation in the channel dimension, respectively. ${C}$ denotes the feature from DConv. ${S^{\prime}}$ denotes the feature in column 1 from Swin V2 backbone. The features are estimated by MFP that provides inter-layer multi-scale interaction information and intra-layer multi-scale information:

\vspace{-0.1cm}{S_{1,0}},{S_{2,0}},{S_{3,0}}=MFP({{S^{\prime}}_{1,0}},{{S^{\prime}}_{2,0}},{{S^{\prime}}_{3,0}});{S_{4,0}}={{S^{\prime}}_{4,0}}

(4)

Input: image 1

{I_{1}}

, image 2

{I_{2}}

Output: change map

{CM}

1 Initialization: set parameters

{i}

{j}

satisfying Eq. (1).

3Acquisition of encoder features for CNN branch network: update

{V_{1}^{i}}

{V_{2}^{i}}

using

{I_{1}}

{I_{2}}

via Eq. (2).

5Acquisition of features in column 1 for main network: update

{C_{0,0}}

{S^{\prime}_{i,j=0}}

using

{I_{1}}

{I_{2}}

via Eq. (3).

7Estimation of MFP features: update

{S_{1,0}}

{S_{2,0}}

{S_{3,0}}

and

{S_{4,0}}

by solving Eq. (4).

8 Acquisition of Swin V2 and FF features in rows 1 to 4 for main network:

9 for $j=0$ to $3$ do

10 if $j==1$ then

11 compute

{S_{i,1}}

{S_{i,0}}

{FF_{i+1,0}}

via Eq. (5);

13 else if $j==2$ then

14 compute

{S_{i,2}}

{S_{i,0}}

{S_{i,1}}

{FF_{i+1,1}}

via Eq. (6);

16 else if $j==3$ then

17 compute

{S_{i,3}}

{S_{i,0}}

{S_{i,1}}

{S_{i,2}}

and

{FF_{i+1,2}}

via Eq. (7).

19 end if

20 Update

{FF_{i,j}}

using

{S_{i,j}}

{V_{1}^{i}}

{V_{2}^{i}}

via Eq. (8).

22 end for

24Estimation of other DConv features: compute

{C_{0,1}}

{C_{0,2}}

{C_{0,3}}

and

{C_{0,4}}

by solving Eq. (9).

Generation of change map: update

{CM}

via Eq. (10).

Algorithm 1 Inference of SwinV2DNet Model for Change Detection

${S}$ refers to the feature that comes from MFP or Swin V2 block.

At this point, the rest of Swin V2 and FF features in rows 1 to 4 of main network can be generated:

\vspace{-0.1cm}{S_{i,1}}=SwinV2(Conv(Cat({S_{i,0}},Up(F{F_{i+1,0}}))))

(5)

\vspace{-0.5cm}{S_{i,2}}=SwinV2(Conv(Cat({S_{i,0}},{S_{i,1}},Up(F{F_{i+1,1}}))))

(6)

\vspace{-0.5cm}{S_{i,3}}=SwinV2(Conv(Cat({S_{i,0}},{S_{i,1}},{S_{i,2}},Up(F{F_{i+1,2}}))))

(7)

\vspace{-0.1cm}F{F_{i,j}}=FF({S_{i,j}},V_{1}^{\rm{i}},V_{2}^{\rm{i}})

(8)

${FF()}$ represents feature fusion module (FF), and ${FF}$ represents the FF features. ${SwinV2}$ represents Swin V2 block. ${Conv}$ and ${Up}$ are used to adjust the spatial and channel resolutions of the features. Then other DConv features are acquired:

\vspace{-0.1cm}\begin{array}[]{l}{C_{0,1}}=DConv(Cat({C_{0,0}},F{F_{1,0}}))\\ {C_{0,2}}=DConv(Cat({C_{0,0}},{C_{0,1}},F{F_{1,1}}))\\ {C_{0,3}}=DConv(Cat({C_{0,0}},{C_{0,1}},{C_{0,2}},F{F_{1,2}}))\\ {C_{0,4}}=DConv(Cat({C_{0,0}},{C_{0,1}},{C_{0,2}},{C_{0,3}},F{F_{1,3}}))\end{array}

(9)

Finally, we generate change map ${CM}$ via CBAM module [33] and sigmoid function:

\vspace{-0.1cm}CM=Sig(Con{v_{1\times 1}}(CBAM(Cat({C_{0,1}},{C_{0,2}},{C_{0,3}},{C_{0,4}}))))

(10)

It should also be noted that dense connectivity allows the network to provide more diverse features. And deep supervision solves the problem of features shift in large network during training. The joint of these two techniques allows the network to train stably and converge quickly. FF concatenates, nonlinearizes, and finally uses CBAM to process the heterogeneous features for effective fusion. DConv is the double union of convolution, batch normalization (BN) and ReLU function.

To summarize, the superior detection performance of SwinV2DNet, a parallel compounded architecture, can be attributed to the following reasons. Firstly, main network consisting of Swin V2 blocks is responsible for extracting the change relationship features which are crucial for the change detection task. Secondly, CNN branch network complements main network by providing the low-level pre-changed and post-changed features necessary for accurate detection. Moreover, the incorporation of the multi-scale features and SSL semantic features further enhances the overall feature learning of the network. These combined factors contribute to SwinV2DNet achieving the SOTA detection performance.

III-B Swin Transformer V2 Block

Swin-type transformers greatly reduce the number of model parameters while modeling global information through shifted window and hierarchical mechanism [45]. Swin V2 [46] further employs the post-normalization and scaled cosine attention techniques to improve the stability of the large vision model. At the same time, the log-spaced continuous position bias method is used to alleviate the problem of transferring the model trained on low-resolution images to high-resolution images. So we use Swin V2 as the base block to build the main network. It is shown in Fig. 3.

Swin V2 splits the image into the patchs and then models the patchs to generate the features with global information. It is generally pairwise. Its input and output are ${S^{l-1}}$ and ${S^{l+1}}$ , respectively. We place layer normalization (LN) after multi-head self-attention (MSA) or multilayer perceptron (MLP) to improve the stability of the network. MLP is to enhance the nonlinear ability of the block. MSA includes window MSA (WMSA) and shifted window MSA (SWMSA): the former extracts the image features in the windows to reduce the computational cost, and the latter maintains the interaction of global information by sliding windows. The residual connection ensures the effective dissemination of information in the large network. The specific cooperation of these components is as follows:

\vspace{-0.1cm}\begin{array}[]{l}{{\hat{S}}^{l}}=WMSA(LN({S^{l-1}}))+{S^{l-1}}\\ {S^{l}}=MLP(LN({{\hat{S}}^{l}}))+{{\hat{S}}^{l}}\\ {{\hat{S}}^{l+1}}=SWMSA(LN({S^{l}}))+{S^{l}}\\ {S^{l+1}}=MLP(LN({{\hat{S}}^{l+1}}))+{{\hat{S}}^{l+1}}\end{array}

(11)

Here, ${S}$ is the image feature and ${l}$ is the number of Swin V2 block layers.

Attention is the core of the whole block. The image features are firstly mapped into the three vectors: query (Q), key (K), and value (V). Then we use Sim function (namely Eq. (14)) to calculate the correlation weight matrix coefficients of Q and K, and normalize these weight matrix by softmax. Finally, the dot product between the weight coefficients and V is finished to form the self-attention features:

\vspace{-0.1cm}Attention(Q,K,V)=Softmax(Sim(Q,K))V

(12)

where ${Q}$ , ${K}$ and ${V}$ are N ${\times}$ d matrices. N and d severally represent the number of patches and the dimension of single head self-attention. Instead of applying a single head self-attention, MSA of Swin V2 computes each head self-attention separately and concatenates these head self-attentions that represent different subspaces:

\vspace{-0.1cm}\begin{array}[]{l}MSA(Q,K,V)=Cat(hea{d_{1}},...,hea{d_{H}}){W^{O}}\\ {\rm{\quad}}where{\rm{\quad}}hea{d_{p}}=Attention(QW_{p}^{Q},KW_{p}^{K},VW_{p}^{V})\end{array}

(13)

where ${W_{p}^{Q}\in{{}^{D\times{d_{k}}}}}$ , ${W_{p}^{K}\in{{}^{D\times{d_{k}}}}}$ , ${W_{p}^{V}\in{{}^{D\times{d_{v}}}}}$ , and ${{W^{O}}\in{{}^{h{d_{v}}\times D}}}$ are parameter matrices. ${h}$ and ${D}$ respectively represent the number of self-attention heads and the dimension of embedding layers. We set ${{d_{k}}={d_{v}}=D/h}$ .

Swin V2 computes the attention logit of a pixel pair ${m}$ and ${n}$ by a scaled cosine function:

\vspace{-0.1cm}Sim({q_{m}},{k_{n}})=\cos({q_{m}},{k_{n}})/r+{B_{mn}}

(14)

Here, ${B_{mn}}$ is the relative position bias between pixel ${m}$ and ${n}$ . ${r}$ is a learnable scalar, non-shared across heads and layers. Furthermore, Swin V2 proposes the log-spaced coordinates instead of the original linear-spaced ones to facilitate the transferring of the model between different resolution images. The log-spaced coordinates is then used as the input of ${\Phi}$ to generate the bias values:

\vspace{-0.1cm}\begin{array}[]{l}\widehat{\Delta x}=sign(x)\cdot\log(1+|\Delta x|)\\ \widehat{\Delta y}=sign(y)\cdot\log(1+|\Delta y|)\end{array}

(15)

\vspace{-0.1cm}B(\Delta x,\Delta y)=\Phi(\Delta x,\Delta y)

(16)

${\Phi}$ is the 2-layer MLP with a ReLU activation. ${\Delta x}$ , ${\Delta y}$ and ${\widehat{\Delta x}}$ , ${\widehat{\Delta y}}$ are the linear-scaled and log-spaced coordinates, respectively.

III-C Mixed Feature Pyramid

FPT utilizes different transformers to provide the self-attention features of different layers and the interaction features between layers [56]. Self-transformer provides the self-attention features of three layers. Rendering transformer provides a down-top interaction feature between layers, and grounding transformer provides a top-down interaction feature. Inspired by FPT and the requirements of the existing task, we construct a novel multi-scale module, MFP, by improving FPT.

Since the basic blocks of current networks have evolved from convolutions to transformers, we remove self-transformer. Moreover, we delete rendering transformer due to it does not experimentally play a role on the VHRRS images change detection. Thus we only impose grounding transformer to provide the interaction information between layers. However, in addition to multi-scale features of different layers and multi-scale interaction features between layers, multi-scale information also includes multi-scale features modeling within layers. So we use SK-Conv [37] and Res2Net-Conv [36] to provide the multi-scale features inside different layers, respectively. Of course, other multi-scale convolutions, for example, inception, dilated convolution, can be used here. We only use SK-Conv and Res2Net-Conv as a example to experimentally validate our idea. MFP, FPT, SK-Conv and Res2Net-Conv are shown in Fig. 4.

Firstly, SK-Conv and Res2Net-Conv are severally used to mine the intra-layer multi-scale information of the input features ${a_{f}}$ , ${b_{f}}$ , and ${c_{f}}$ . At the same time, grounding transformer is utilized to provide the multi-scale interaction features between different layers. These features are then rearranged according to different spatial resolutions. After residual blocks are added to the different layers, feature fusion is performed. Finally, using convolution, CBAM, and dropout in turn, we get the output features with intra-layer multi-scale information and multi-scale interaction information between layers. The roles of CBAM and dropout are to enhance the effective fusion of different features and prevent overfitting, respectively. Since the spatial and spectral resolution of the input features are consistent with the output, MFP is a plug-and-play module. MFP has also been shown to be effective for a variety of VHRRS images change detection models.

Finally, we introduce the three multi-scale feature extractors used. The core operation of grounding transformer (GT) is:

\vspace{-0.1cm}GT({Q_{g}},{K_{g}},{V_{g}})=Softmax({Q_{g}}K_{g}^{T}){V_{g}}

(17)

Here, ${Q_{g}}$ , ${K_{g}}$ and ${V_{g}}$ are the nonlinear transformations of ${X_{g}}$ . ${K_{g}^{T}}$ is the transposition matrix of ${K_{g}}$ . And we define that ${X_{g}}$ , ${\bar{X}_{g}}$ and ${\tilde{X}_{g}}$ are the input feature, intermediate variable and output feature of GT, respectively. Take the layer ${a_{m}}$ and ${b_{m}}$ in Fig. 4 as an example to explain the complete operation of GT:

\vspace{-0.1cm}\begin{array}[]{l}\bar{X}_{g}^{a}=Up(BN(Conv(X_{g}^{a})))\\ \bar{X}_{g}^{b}=BN(Conv(X_{g}^{b}))\\ \tilde{X}_{g}^{ab}=GT(Cat(\bar{X}_{g}^{a},\bar{X}_{g}^{b}))\end{array}

(18)

Both SK-Conv and Res2Net-Conv are the multi-scale convolutions. SK-Conv obtains the features ${U_{s}}$ and ${V_{s}}$ by a 3 ${\times}$ 3 convolution and a 5 ${\times}$ 5 convolution, respectively. A joint SE attention is then applied for ${U_{s}}$ and ${V_{s}}$ to obtain ${\bar{U}_{s}}$ and ${\bar{V}_{s}}$ . In the end, ${\bar{U}_{s}}$ plus ${\bar{V}_{s}}$ gives the SK-Conv feature ${\tilde{X}_{s}}$ . The multi-scale component of Res2Net-Conv can be expressed:

\vspace{-0.1cm}y_{r}^{z}=\left\{\begin{array}[]{l}x_{r}^{z}{\rm{\quad\quad\quad\quad\quad\quad\quad\quad z=1}}\\ Conv(x_{r}^{z}){\rm{\quad\quad\quad\quad\quad z=2}}\\ Conv(x_{r}^{z}+y_{r}^{z-1}){\rm{\quad 2<z}}\leq{\rm{4}}\end{array}\right.

(19)

We pass the input ${X_{r}}$ through 1 ${\times}$ 1 convolution, Eq. (19), and 1 ${\times}$ 1 convolution in turn, to get the Res2Net-Conv feature ${\tilde{X}_{r}}$ .

III-D CNN Branch Network and SSL Strategy

To solve the problem that the VGG16 encoder [64] is untrainable, we design a new decoder and impose the SSL strategy [40] to optimize the low-level features of the VGG16 encoder. The detailed implementation is shown in Fig. 5. We compose the decoder base block with convolution, BN and ReLU. The five base blocks are arranged in order of the increasing spatial resolutions. Further more, the decoder features and the corresponding encoder features are fused using upsampling and addition. Finally, 1 ${\times}$ 1 convolution is used to output the probability map of change detection.

For SSL strategy, we firstly generate the pseudo-labels based on the probability maps through CNN branch network:

\vspace{-0.1cm}\begin{array}[]{l}P{L_{1,{\rm{u}}}}=\left\{\begin{array}[]{l}0\quad P{M_{1,{\rm{u}}}}<0.5\\ 1\quad P{M_{1,{\rm{u}}}}\geq 0.5\end{array}\right.\\ P{L_{2,{\rm{u}}}}=\left\{\begin{array}[]{l}0\quad P{M_{2,{\rm{u}}}}<0.5\\ 1\quad P{M_{2,{\rm{u}}}}\geq 0.5\end{array}\right.\end{array}

(20)

Here, ${u}$ refers to each pixel of the probability maps or pseudo-labels. ${P{M_{1,{\rm{u}}}}}$ and ${P{M_{2,{\rm{u}}}}}$ are the two probability maps that are the outputs from the branch network. Since ${P{M_{1,{\rm{u}}}}}$ and ${P{M_{2,{\rm{u}}}}}$ are normalized to [0,1] by sigmoid, we can obtain the pseudo-labels ${P{L_{1,{\rm{u}}}}}$ and ${P{L_{2,{\rm{u}}}}}$ via Eq. (20). After that, we determine the changed and unchanged regions based on the labels from the change detection datasets. In the unchanged regions, we use the pseudo-labels generated by the one branch to supervise the other branch. In the changed regions, we use the opposite results of the pseudo-labels from the one branch, to supervise the other branch. The goal is to keep the unchanged features as close as possible and the changed features as far away as possible. Therefore, the encoder features provided by CNN branch network are rich in semantic information and have more discrimination ability for change detection. The SSL-CD loss is designed as follows:

\vspace{-0.1cm}\begin{array}[]{l}{L_{SSL1}}=F(P{M_{1,{\rm{u}}}},P{L_{2,{\rm{u}}}}\left|{u\in{U_{\alpha}}}\right.)\\ \quad\quad\quad\quad\quad+F(P{M_{1,{\rm{u}}}},1-P{L_{2,{\rm{u}}}}\left|{u\in{C_{\alpha}}}\right.)\\ {L_{SSL2}}=F(P{M_{2,{\rm{u}}}},P{L_{1,{\rm{u}}}}\left|{u\in{U_{\alpha}}}\right.)\\ \quad\quad\quad\quad\quad+F(P{M_{2,{\rm{u}}}},1-P{L_{1,{\rm{u}}}}\left|{u\in{C_{\alpha}}}\right.)\end{array}

(21)

where ${{U_{\alpha}}}$ and ${{C_{\alpha}}}$ represent respectively the unchanged and changed regions. ${{L_{SSL1}}}$ and ${{L_{SSL2}}}$ are the two SSL-CD loss. ${F()}$ is a metric function, and we choose ${{L_{BCE}}}$ as it.

III-E Loss Function

First, bi-temporal change detection is fundamentally a binary classification task, so binary cross entropy (BCE) loss is usually used:

\vspace{-0.1cm}{L_{BCE}}=-(t\log(\hat{t})+(1-t)\log(1-\hat{t}))

(22)

In our paper, ${t}$ and ${\hat{t}}$ denote the predicted change confidence and the label in the corresponding position, respectively. For change detection tasks, however, the changed regions are far less than the unchanged regions, so there is a serious class imbalance problem in change detection. For example, the ratio of changed pixels to unchanged pixels in CDD is 0.147 [65]. To mitigate this problem, Dice loss is often used:

\vspace{-0.1cm}{L_{Dice}}=1-\frac{{2\hat{t}t+\sigma}}{{\hat{t}+t+\sigma}}

(23)

Here, adding ${\sigma}$ avoids the case where the denominator is zero.

The loss function used by our model is a combination of BCE and Dice loss. Moreover, to address the features shift during the training of the large network, we use the deep supervision strategy. Specifically, the deep supervision strategy uses the same labels as those used for the network output. The labels are replicated in four copies for the four deep supervision interfaces. And DConvs ${C_{0,1}}$ , ${C_{0,2}}$ , ${C_{0,3}}$ , ${C_{0,4}}$ output the probability maps through convolution and sigmoid function. ${\sum\limits_{m=1}^{4}{L_{BCE}^{m}}+{\lambda_{1}}L_{Dice}^{m}}$ , at last, is used to measure and optimize these probability maps by the deep supervision labels. The total loss function is expressed as follows:

\vspace{-0.1cm}{L_{Total}}=\sum\limits_{m=1}^{5}{L_{BCE}^{m}}+{\lambda_{1}}L_{Dice}^{m}+{\lambda_{2}}{L_{SSL1}}+{\lambda_{2}}{L_{SSL2}}

(24)

where ${\sum\limits_{m=1}^{5}{L_{BCE}^{m}}+{\lambda_{1}}L_{Dice}^{m}}$ represents the output loss and the four deep supervision losses of main network. ${L_{SSL1}}$ and ${L_{SSL2}}$ are the self-supervision losses of CNN branch network. ${\lambda_{1}}$ and ${\lambda_{2}}$ are the weight coefficients. ${\lambda_{1}}$ is set to 0.5, and ${\lambda_{2}}$ is 0.25.

IV Experimental results and analysis

IV-A Experimental Configurations

IV-A1 Datasets

our model is tested on the four publicly available change detection datasets, achieving the SOTA results. Due to GPU memory limitations, we crop the images into the non-overlapping patchs of size 256×256 for the four datasets.

(1)

Learning, vision, and remote sensing change detection dataset (LEVIR-CD) [66] is a publicly available change detection resource for large buildings. It comprises 637 pairs of high-resolution (0.5 m) remote sensing images, each measuring 1024×1024 pixels. We routinely crop these images and then obtain 7120/1024/2048 pairs as the training/validation/test data.
(2)

Change detection dataset (CDD) [67] contains 11 pairs of multi-spectral images obtained from Google Earth with spatial resolutions ranging from 0.03 to 1 m. Following the dataset partitioning, we apply the data augmentation methods, image rotation and image flipping, to the training set. As a result, we obtain a total of 60000/3000/3000 training/validation/test pairs.
(3)

Wuhan university change detection dataset (WHU-CD) [68] focuses on the buildings change detection. It features a pair of high-resolution (0.075 m) aerial images, each measuring 32507×15354 pixels. Since there is not a general data partitioning scheme for WHU-CD, we cut these images into the non-overlapping segments of size 256×256 and randomly divide them into 6096/764/764 pairs for the training/validation/testing sessions, respectively.
(4)

Sun Yat-sen university change detection dataset (SYSU-CD) [27] contains 20000 pairs of orthographic aerial images with the spatial resolution of 0.5 m taken in Hong Kong. Each image is 256×256 pixels. We use the 12000/4000/4000 training/validation/testing pairs based on the dataset provider splitting. It is worth noting that SYSU-CD presents multiple types of changed objects in the more complex scenario, making it a particularly challenging dataset.

IV-A2 Baseline and State-of-the-art Methods

we compare SwinV2DNet with the baseline and state-of-the-art methods as follows. The first three serve as the baselines, while the last seven represent the advanced networks developed over the past three years. We implement these change detection networks using the publicly available codes and default hyperparameters.

TABLE I: The comparison results on the three change detection datasets. The best values are
highlighted in bold font. All results are expressed as percentages (

\%

Method	LEVIR-CD	CDD	WHU-CD
Method	Pre. / Rec. / F1 / IoU / OA	Pre. / Rec. / F1 / IoU / OA	Pre. / Rec. / F1 / IoU / OA
FC-EF	86.16 / 86.20 / 86.18 / 76.16 / 98.59	85.35 / 77.56 / 81.27 / 42.14 / 95.59	86.13 / 86.01 / 86.07 / 75.67 / 98.82
FC-Siam-Diff	90.36 / 84.81 / 87.50 / 81.06 / 98.77	92.28 / 78.70 / 84.95 / 48.87 / 96.56	81.40 / 89.11 / 85.08 / 71.79 / 98.68
FC-Siam-Conc	87.30 / 87.81 / 87.55 / 75.09 / 98.73	92.04 / 81.94 / 86.70 / 52.95 / 96.90	79.98 / 90.94 / 85.11 / 66.25 / 98.65
IFNet	93.73 / 87.31 / 90.40 / 84.04 / 99.06	97.71 / 93.64 / 95.63 / 81.31 / 98.94	98.51 / 82.46 / 89.77 / 90.28 / 99.21
SNUNet-CD	91.00 / 88.30 / 89.63 / 79.18 / 98.96	98.13 / 97.62 / 97.87 / 88.24 / 99.48	88.50 / 90.31 / 89.40 / 73.50 / 99.09
BIT	91.81 / 88.00 / 89.86 / 79.35 / 98.99	97.07 / 96.43 / 96.75 / 83.52 / 99.20	92.10 / 92.41 / 92.26 / 78.56 / 99.34
DCFF-Net	92.96 / 89.83 / 91.37 / 82.80 / 99.14	98.75 / 98.94 / 98.84 / 92.79 / 99.71	96.51 / 93.11 / 94.78 / 88.55 / 99.57
TransUNetCD	90.62 / 88.44 / 89.52 / 79.63 / 98.94	97.44 / 96.52 / 96.98 / 84.81 / 99.26	94.72 / 91.21 / 92.93 / 83.82 / 99.41
ICIF-Net	92.23 / 88.53 / 90.34 / 81.54 / 99.04	97.61 / 97.04 / 97.32 / 85.98 / 99.34	93.61 / 89.69 / 91.61 / 82.61 / 99.31
FCCDN	92.10 / 84.86 / 88.33 / 80.48 / 98.86	95.96 / 95.56 / 95.76 / 79.79 / 98.96	92.65 / 90.44 / 91.53 / 79.94 / 99.29
Ours	92.98 / 91.33 / 92.15 / 85.44 / 99.21	99.18 / 99.12 / 99.15 / 94.02 / 99.79	96.75 / 94.65 / 95.69 / 90.34 / 99.64

(1)

FC-EF [18]: bi-temporal change detection images are concatenated as a single input to FCN.
(2)

FC-Siam-Diff [18]: a siamese FCN is employed to extract the multi-level features, utilizing the differences of these features to detect the changed information.
(3)

FC-Siam-Conc [18]: the multi-level features are extracted and fused using a siamese FCN with the cascaded architecture.
(4)

IFNet [25]: CBAM is applied to the heterogeneous features at each level of the cascaded decoder, and deep supervision is used for the improved training of intermediate layers.
(5)

SNUNet-CD [28]: a combination of siamese structure and UNet++ is utilized to extract the high-level features.
(6)

BIT [26]: a serial hybrid network that embeds transformer into ResNet.
(7)

DCFF-Net [69]: a parallel pure CNN that combines VGG16 with UNet++, and integrates CBAM and deep supervision.
(8)

TransUNetCD [51]: a serial cascaded hybrid network that embeds transformer into UNet.
(9)

ICIF-Net [52]: a parallel hybrid network focusing on the interaction and fusion of transformer and CNN.
(10)

FCCDN [40]: the supervised model with self-supervised strategy, producing the features rich in semantic information.

IV-A3 Implementation Details

we implement our model using PyTorch and train it on a single NVIDIA GeForce RTX 3090 GPU. During the training, we optimize the model with Adam optimizer. The batch size is set to 4. The learning rate is initially set to ${5\times{10^{-5}}}$ and linearly decays to 0 over the course of 200 epochs.

IV-A4 Evaluation Metrics

F1-score is an index used to measure the performance of binary classification models, taking into account both precision and recall. So we primarily employ F1-score with respect to the change category as the main evaluation metric. F1-score is shown below:

F1=\frac{{2\times TP}}{{2\times TP+FP+FN}}

(25)

In addition, we also report precision, recall, intersection over union (IoU) for the change category, and overall accuracy (OA). These metrics are defined as follows:

\vspace{-0.3cm}{\rm{precision=}}\frac{{TP}}{{TP+FP}}

(26)

\vspace{-0.3cm}{\rm{recall=}}\frac{{TP}}{{TP+FN}}

(27)

\vspace{-0.3cm}{\rm{IoU=}}\frac{{TP}}{{TP+FP+FN}}

(28)

{\rm{OA=}}\frac{{TP+TN}}{{TP+TN+FP+FN}}

(29)

Here, TP, TN, FP, and FN represent the number of true positive, true negative, false positive, and false negative, respectively.

IV-B Experimental Results

IV-B1 Results Analysis and Comparison

TABLE II: The comparison results on SYSU-CD dataset. The best values are highlighted in bold font. All results
are expressed as percentages (

\%

Method	SYSU-CD
Method	Pre. / Rec. / F1 / IoU / OA
FC-EF	79.30 / 68.84 / 73.70 / 44.64 / 88.41
FC-Siam-Diff	89.80 / 58.49 / 70.84 / 42.37 / 88.64
FC-Siam-Conc	82.31 / 73.52 / 77.67 / 50.33 / 90.03
IFNet	85.16 / 75.36 / 79.96 / 57.22 / 91.09
SNUNet-CD	80.03 / 76.62 / 78.29 / 52.99 / 89.98
BIT	81.67 / 76.52 / 79.01 / 52.41 / 90.41
DCFF-Net	78.71 / 86.24 / 82.30 / 62.05 / 91.25
TransUNetCD	77.25 / 80.17 / 78.68 / 54.72 / 89.75
ICIF-Net	78.53 / 78.89 / 78.71 / 53.80 / 89.94
FCCDN	78.57 / 78.14 / 78.36 / 53.33 / 89.82
Ours	83.46 / 82.81 / 83.13 / 62.90 / 92.08

Tables I and II present the overall comparison results for the test sets: LEVIR-CD, CDD, WHU-CD, and SYSU-CD. Through quantitative analysis, our model has demonstrated the significant improvements over other methods in the three key indicators, F1 score, IoU, and OA, across these datasets. Notably, our method outperforms the recent DCFF-Net by 0.78/0.31/0.91/0.83 in F1 score, underscoring the significance of both global information provided by transformer and local information represented by CNN for change detection. Furthermore, our approach also showcases a performance advantage over the two serial networks, BIT and TransUNetCD, reinforcing the superiority of the parallel combination of transformer and CNN. In summary, our proposed model achieves the SOTA results by leveraging a parallel architecture of transformer and CNN with the multi-scale and self-supervision features that enhance its discrimination capabilities (refer to the analysis about Table III).

The comparison of visualization results on the four datasets is shown in Figs. 6 and 7. We use different colors to represent TP(white), TN(black), FP(green), and FN(red) in the change maps. It is observed that our model maintains the best structured state compared to other models from (a), (c) of LEVIR-CD and (a), (b), (e) of WHU-CD. Additionally, our model demonstrates the superior performance in detecting dense small objects and edge objects, as evidenced by LEVIR-CD (d) and WHU-CD (d). The CDD dataset poses the unique challenges due to its varying illumination conditions and high occurrence of pseudo changes, such as the snow cover is added in the image 2 of (a), and the illumination differences are visible in the (b) ${\sim}$ (e) image pairs. However, our model outperforms other methods in these difficult scenes, particularly for the three detection requirements of structured changed regions, small targets, and edge targets. The challenge of SYSU-CD dataset lies also in the high intraclass variation and low interclass variance of background and targets. Our model is still able to maintain the structure of targets relatively well in the scenes where the targets are cluttered and close to the background. The detection result are well understood. Most of the structured regions are obtained from the global information of transformer, and the description of small targets and edge targets comes from the low-level detail information of CNN. This once again validates the importance of the parallel dense architecture of transformer and CNN.

IV-B2 Training Processes Analysis

we evaluate the performance of SwinV2DNet on the four datasets by tracking the two metrics, F1 score and loss, as depicted in Fig. 8. The F1 score and loss curves provide an intuitive understanding that our model is stably convergent and elegantly efficient. The peak values on the F1 curves are observed at points 0.9212(LEVIR-CD), 0.9765(CDD), 0.9544(WHU-CD), and 0.8100(SYSU-CD), indicating that our model requires a training process of just 35 epochs. Given the CDD dataset’s abundance of small targets and intricate labelings, SwinV2DNet exhibits a slight growth even after 35 epochs. Comparing the datasets from a horizontal perspective, SYSU-CD appears to be the most challenging and prone to the rapid overfitting. This could potentially be attributed to the significantly differing distribution between training data and validation data, wherein the overfitting on training data impairs the model’s generalization capacity. Currently, no relevant studies have been conducted to explain this phenomena.

TABLE III: The ablation studies for the overall network on the three datasets. We report F1 and OA scores. The base and best results are annotated in blue and red, respectively. All results are expressed as percentages (

\%

Overall Network			LEVIR-CD	CDD	WHU-CD
Swin-V2	MFP	SSL	F1 / OA	F1 / OA	F1 / OA
			91.37 / 99.14	94.81 / 98.73	94.78 / 99.57
✓			91.85 / 99.18	97.58 / 99.41	95.34 / 99.61
	✓		91.47 / 99.15	96.15 / 99.06	94.80 / 99.57
		✓	91.38 / 99.13	96.66 / 99.18	95.25 / 99.60
✓	✓		91.97 / 99.19	97.66 / 99.43	95.10 / 99.59
✓		✓	92.00 / 99.19	97.60 / 99.41	95.31 / 99.61
	✓	✓	91.40 / 99.13	96.69 / 99.19	94.82 / 99.57
✓	✓	✓	92.15 / 99.21	97.71 / 99.44	95.69 / 99.64

The further conjoint analysis with the testing F1 values of 0.9215(LEVIR-CD), 0.9771(CDD), 0.9569(WHU-CD), and 0.8313(SYSU-CD) in Tables I and II reaffirms the strong generalization ability of our model. 0.9771(CDD) reveals the F1 performance for the CDD test set without applying any data augmentations. Based on the compounded features, dense connection and deep supervision in our model greatly contribute to the stability and efficiency of training processes.

IV-C Ablation Studies and Parameter Analysis

IV-C1 Ablation Study of Overall Network

TABLE IV: The longitudinal dissection (left) and horizontal promotion (right) for MFP on the LEVIR-CD dataset. The base and best results are annotated in blue and red, respectively. All results are expressed as percentages (

\%

MFP			FC-Siam-Diff	Ours
GT	Res2Net-Conv	SK-Conv	F1 / OA	F1 / OA
			87.50 / 98.77	91.845 / 99.177
✓			87.49 / 98.74	91.955 / 99.192
	✓		87.89 / 98.79	92.057 / 99.191
		✓	87.66 / 98.74	91.915 / 99.186
✓	✓		87.78 / 98.78	91.975 / 99.190
✓		✓	87.50 / 98.75	91.866 / 99.178
	✓	✓	87.92 / 98.78	91.843 / 99.179
✓	✓	✓	88.20 / 98.81	92.060 / 99.200

Method	MFP	LEVIR-CD
Method	MFP	Pre. / Rec. / F1 / IoU / OA
FC-Siam-Conc		87.30 / 87.81 / 87.55 / 75.09 / 98.73
FC-Siam-Conc	✓	87.91 / 88.73 / 88.32 / 77.04 / 98.80
IFNet		93.73 / 87.31 / 90.40 / 84.04 / 99.06
IFNet	✓	93.68 / 88.02 / 90.76 / 84.44 / 99.09
SNUNet-CD		91.00 / 88.30 / 89.63 / 79.18 / 98.96
SNUNet-CD	✓	90.35 / 89.07 / 89.70 / 76.64 / 98.96
FCCDN		92.10 / 84.86 / 88.33 / 80.48 / 98.86
FCCDN	✓	89.95 / 87.66 / 88.79 / 79.91 / 98.87

in the overall network architecture, our contributions consist of the three parts: Swin V2 main network, MFP, and CNN branch network trained using SSL strategy. As shown in Table III, we perform the ablation studies for these three contributions on the LEVIR-CD, CDD, and WHU-CD datasets. According to the data analysis of Table III, Swin V2 main network has the largest effect on the overall performance improvement, and the latter two have the similar effects. Especially on the CDD dataset, Swin V2 main network shows a improvement of F1 score 2.77 $\%$ compared to CNN main network. And the different combinations of two contributions occasionally occur the mutually exclusive phenomenas. However, our model achieves the significantly better results than the baseline on the three datasets, using the three contributions together. Our ablation experiments use F1 score as the main evaluation metric that is roughly positively correlated with OA. This also verifies the importance of our parallel compounded architecture consisting of Swin V2 main network and CNN branch network. The three types of multi-scale features and the encoder semantic features guided by SSL strategy also have clear guiding effects on the model performance.

IV-C2 Parameter Analysis of Swin V2

for Swin V2 blocks, we mainly perform the ablation studies about the pretrained weights and number of blocks as described in Table V. We only use two Swin V2 blocks for ${S_{1,1}}$ , ${S_{1,2}}$ , and ${S_{2,1}}$ . For the U-shaped structure formed by [ ${S_{1,0}}$ , ${S_{2,0}}$ , ${S_{3,0}}$ , ${S_{4,0}}$ , ${S_{3,1}}$ , ${S_{2,2}}$ , ${S_{1,3}}$ ], we try the three configurations of Swin V2 blocks: [2, 2, 2, 2, 2, 2, 2], [2, 2, 6, 2, 6, 2, 2], and [2, 2, 18, 2, 18, 2, 2]. The ablation results on the three datasets show that the combination of the pre-trained weights and configuration of [2, 2, 6, 2, 6, 2, 2] has the advanced detection performance and robust application scenarios. This is due to the facts that the pretrained weights usually contain the prior information of upstream tasks, and too many Swin V2 blocks maybe lead to the underfitting of some intermediate layer parameters of the model.

TABLE V: The ablation studies for Swin V2 on the three datasets. We report F1 and OA scores. The best values are highlighted in bold font. All results are expressed as percentages (

\%

Swin-V2		LEVIR-CD	CDD	WHU-CD
Pre-trained	Conf.	F1 / OA	F1 / OA	F1 / OA
	Conf.1	91.66 / 99.17	97.41 / 99.36	95.32 / 99.61
✓	Conf.1	91.77 / 99.17	97.46 / 99.38	95.50 / 99.62
✓	Conf.2	91.85 / 99.18	97.58 / 99.41	95.34 / 99.61
✓	Conf.3	91.79 / 99.18	97.42 / 99.37	95.26 / 99.60

IV-C3 Ablation Study of Mixed Feature Pyramid

we propose MFP using the combination of GT, Res2Net-Conv and SK-Conv. In the left subtable of Table IV, we present a detailed ablation analysis for these three modules with FC-Siam-Diff and our model as the base lines on the LEVIR-CD dataset. In the right subtable of Table IV, we further test the plug and play performance of MFP based on the other four models. The data of the left subtable supports the detection role of each module and the improvement of overall performance using MFP. Specifically, MFP improves F1 scores by 0.70 $\%$ and 0.215 $\%$ on FC-Siam-Diff and our model (our model has a high complexity), respectively. Through the analysis of the right subtable, FC-Siam-Conc, IFNet, SNUNet-CD, and FCCDN achieve the improvements of 0.77 $\%$ , 0.36 $\%$ , 0.07 $\%$ , and 0.46 $\%$ in F1 score, respectively. The experiments on the longitudinal dissection and horizontal promotion support that MFP effectively improves the performance of change detection models by providing the inter-layer interaction information and intra-layer multi-scale information.

IV-C4 Ablation Study of SSL Strategy

We compare the performance gains brought by guiding CNN branch network under the three strategies of unsupervised, supervised and self-supervised, as shown in Table VI. On the LEVIR-CD, CDD and WHU-CD datasets, self-supervised strategy achieves the best results. The reason for this phenomenon is that self-supervised learning provides the deep features with semantic information for change detection tasks [40].

TABLE VI: The ablation studies for the supervised strategy of CNN branch network on the three datasets. We report F1 and OA scores. The best values are highlighted in bold font. All results are expressed as percentages (

\%

Strategy	LEVIR-CD	CDD	WHU-CD
Strategy	F1 / OA	F1 / OA	F1 / OA
Unsupervised	91.97 / 99.19	97.66 / 99.43	95.10 / 99.59
Supervised	91.97 / 99.19	97.68 / 99.43	95.39 / 99.62
Self-supervised	92.15 / 99.21	97.71 / 99.44	95.69 / 99.64

IV-D Network Visualization

To further elucidate the practical effect of each network module, we conduct the analysis of network visualization. As depicted in Fig. 9, we broadly segment the network into three components: CNN branch network, main network, and feature fusion modules. Given that CNN branch network is a dual-branch network with shared weights, we only present the activation maps for one branch. It is distinctly seen that the encoder layers furnish attention to the low-level detail information for main network. Conversely, main network provides the structured abstract features. Feature fusion modules concentrate on detailing the differential specifics of remote sensing objects while preserving structural information. The amalgamation and retention of these two types of information are pivotal for change detection in VHRRS images. By contrasting the label and change activation map acquired through CBAM, it becomes evident that our model exhibits the robustness in the intricate scene and different lighting condition.

V Conclusion

We propose an end-to-end parallel compounded network, SwinV2DNet, where the densely connected main network of Swin V2 blocks provides the change relationship features, and the CNN branch network provides the pre-changed and post-changed features for change detection. This network gleans global information via transformer while capturing precise low-level details through the encoder of CNN. The joint efforts of change relationship features, pre-changed, and post-changed features rectify the limitations of previous change detection networks reliant on either early fusion or late fusion. Furthermore, we propose a plug-and-play MFP that imparts the interlayer interaction information and intralayer multi-scale information. The experiments validate the efficacy of MFP in other change detection networks as well. We employ a SSL strategy to guide our CNN branch to provide the encoder semantic features for the main network. We achieve the stably convergent and elegantly efficient performance on the four commonly used public datasets. Importantly, our proposed model demonstrates the advanced abilities for capturing structured, small, and edge objects. This model also shows the robust capabilities in terms of anti-interference and change discrimination, especially when the illumination difference is large or the background and targets are cluttered and similar. In the future work, we will focus on the realization of transformer lightweight for better practicability.

References

[1] M. Hussain, D. Chen, A. Cheng, H. Wei, and D. Stanley, “Change detection from remotely sensed images: From pixel-based to object-based approaches,” ISPRS Journal of photogrammetry and remote sensing, vol. 80, pp. 91–106, 2013.
[2] J. Zhang, X. Jia, J. Hu, and K. Tan, “Moving vehicle detection for remote sensing video surveillance with nonstationary satellite platform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 5185–5198, 2021.
[3] L. Bruzzone and M. Marconcini, “Domain adaptation problems: A dasvm classification technique and a circular validation strategy,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 5, pp. 770–787, 2009.
[4] C. Benedek, X. Descombes, and J. Zerubia, “Building development monitoring in multitemporal remotely sensed image pairs with stochastic birth-death dynamics,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 1, pp. 33–50, 2011.
[5] P. Lu, Y. Qin, Z. Li, A. C. Mondini, and N. Casagli, “Landslide mapping from multi-sensor data through improved change detection-based markov random field,” Remote Sensing of Environment, vol. 231, p. 111235, 2019.
[6] S. Jin, L. Yang, Z. Zhu, and C. Homer, “A land cover change detection and classification protocol for updating alaska nlcd 2001 to 2011,” Remote Sensing of Environment, vol. 195, pp. 44–55, 2017.
[7] Z. Zhu and C. E. Woodcock, “Continuous change detection and classification of land cover using all available landsat data,” Remote sensing of Environment, vol. 144, pp. 152–171, 2014.
[8] Z. Lv, T. Liu, J. A. Benediktsson, and N. Falco, “Land cover change detection techniques: Very-high-resolution optical images: A review,” IEEE Geoscience and Remote Sensing Magazine, vol. 10, no. 1, pp. 44–63, 2021.
[9] Z. Liangpei and W. Chen, “Advance and future development of change detection for multi-temporal remote sensing imagery,” Acta Geodaetica et Cartographica Sinica, vol. 46, no. 10, p. 1447, 2017.
[10] H. Zhuang, K. Deng, H. Fan, and M. Yu, “Strategies combining spectral angle mapper and change vector analysis to unsupervised change detection in multispectral images,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 5, pp. 681–685, 2016.
[11] W. A. Malila, “Change vector analysis: An approach for detecting forest changes with landsat,” in LARS symposia, 1980, p. 385.
[12] F. Bovolo and L. Bruzzone, “A theoretical framework for unsupervised change detection based on change vector analysis in the polar domain,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 1, pp. 218–236, 2006.
[13] J. Zhang and Y. Zhang, “Remote sensing research issues of the national land use change program of china,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 62, no. 6, pp. 461–472, 2007.
[14] J. Zhong and R. Wang, “Multi-temporal remote sensing change detection based on independent component analysis,” International Journal of Remote Sensing, vol. 27, no. 10, pp. 2055–2061, 2006.
[15] G. Xian and C. Homer, “Updating the 2001 national land cover database impervious surface products to 2006 using landsat imagery change detection methods,” Remote sensing of environment, vol. 114, no. 8, pp. 1676–1686, 2010.
[16] C. Wu, B. Du, and L. Zhang, “Slow feature analysis for change detection in multispectral imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 5, pp. 2858–2874, 2013.
[17] J. L. Gil-Yepes, L. A. Ruiz, J. A. Recio, Á. Balaguer-Beser, and T. Hermosilla, “Description and validation of a new set of object-based temporal geostatistical features for land-use/land-cover change detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 121, pp. 77–91, 2016.
[18] R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese networks for change detection,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 4063–4067.
[19] D. Zheng, Z. Wei, Z. Wu, and J. Liu, “Learning pairwise potential crfs in deep siamese network for change detection,” Remote Sensing, vol. 14, no. 4, p. 841, 2022.
[20] C. Zhang, L. Wang, S. Cheng, and Y. Li, “Swinsunet: Pure transformer network for remote sensing image change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
[21] P. F. Alcantarilla, S. Stent, G. Ros, R. Arroyo, and R. Gherardi, “Street-view change detection with deconvolutional networks,” Autonomous Robots, vol. 42, pp. 1301–1322, 2018.
[22] D. Peng, Y. Zhang, and H. Guan, “End-to-end change detection for high resolution satellite images using improved unet++,” Remote Sensing, vol. 11, no. 11, p. 1382, 2019.
[23] X. Peng, R. Zhong, Z. Li, and Q. Li, “Optical remote sensing image change detection based on attention mechanism and image difference,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 9, pp. 7296–7307, 2020.
[24] Y. Zhan, K. Fu, M. Yan, X. Sun, H. Wang, and X. Qiu, “Change detection based on deep siamese convolutional network for optical aerial images,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 10, pp. 1845–1849, 2017.
[25] C. Zhang, P. Yue, D. Tapete, L. Jiang, B. Shangguan, L. Huang, and G. Liu, “A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 183–200, 2020.
[26] H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection with transformers,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
[27] Q. Shi, M. Liu, S. Li, X. Liu, F. Wang, and L. Zhang, “A deeply supervised attention metric-based network and an open aerial image dataset for remote sensing change detection,” IEEE transactions on geoscience and remote sensing, vol. 60, pp. 1–16, 2021.
[28] S. Fang, K. Li, J. Shao, and Z. Li, “Snunet-cd: A densely connected siamese network for change detection of vhr images,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
[29] X. Zhang, S. Cheng, L. Wang, and H. Li, “Asymmetric cross-attention hierarchical network based on cnn and transformer for bitemporal remote sensing images change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023.
[30] Y. Feng, J. Jiang, H. Xu, and J. Zheng, “Change detection on remote sensing images using dual-branch multilevel intertemporal network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023.
[31] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[32] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 534–11 542.
[33] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19.
[34] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[35] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” arXiv preprint arXiv:1511.07122, 2015.
[36] S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 2, pp. 652–662, 2019.
[37] X. Li, W. Wang, X. Hu, and J. Yang, “Selective kernel networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 510–519.
[38] B. Hou, Q. Liu, H. Wang, and Y. Wang, “From w-net to cdgan: Bitemporal change detection via deep learning techniques,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 3, pp. 1790–1802, 2019.
[39] W. Zhao, L. Mou, J. Chen, Y. Bo, and W. J. Emery, “Incorporating metric learning and adversarial network for seasonal invariant change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 4, pp. 2720–2731, 2019.
[40] P. Chen, B. Zhang, D. Hong, Z. Chen, X. Yang, and B. Li, “Fccdn: Feature constraint network for vhr image change detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 187, pp. 101–119, 2022.
[41] H. Chen, W. Li, S. Chen, and Z. Shi, “Semantic-aware dense representation learning for remote sensing image change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–18, 2022.
[42] Y. Zhang, Y. Zhao, Y. Dong, and B. Du, “Self-supervised pretraining via multimodality images with transformer for change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–11, 2023.
[43] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[44] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[45] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
[46] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 009–12 019.
[47] D. Hong, Z. Han, J. Yao, L. Gao, B. Zhang, A. Plaza, and J. Chanussot, “Spectralformer: Rethinking hyperspectral image classification with transformers,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2021.
[48] L. Sun, G. Zhao, Y. Zheng, and Z. Wu, “Spectral–spatial feature tokenization transformer for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2022.
[49] L. Gao, B. Liu, P. Fu, and M. Xu, “Adaptive spatial tokenization transformer for salient object detection in optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023.
[50] X. He, Y. Zhou, J. Zhao, D. Zhang, R. Yao, and Y. Xue, “Swin transformer embedding unet for remote sensing image semantic segmentation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–15, 2022.
[51] Q. Li, R. Zhong, X. Du, and Y. Du, “Transunetcd: A hybrid transformer network for change detection in optical remote-sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–19, 2022.
[52] Y. Feng, H. Xu, J. Jiang, H. Liu, and J. Zheng, “Icif-net: Intra-scale cross-interaction and inter-scale feature fusion network for bitemporal remote sensing images change detection,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.
[54] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional networks for semantic segmentation,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440, 2014.
[55] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” CoRR, vol. abs/1505.04597, 2015. [Online]. Available: http://arxiv.org/abs/1505.04597
[56] D. Zhang, H. Zhang, J. Tang, M. Wang, X. Hua, and Q. Sun, “Feature pyramid transformer,” ArXiv, vol. abs/2007.09451, 2020.
[57] Y. Feng, J. Jiang, H. Xu, and J. Zheng, “Change detection on remote sensing images using dual-branch multilevel intertemporal network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023.
[58] T. Lei, X. Geng, H. Ning, Z. Lv, M. Gong, Y. Jin, and A. Nandi, “Ultralightweight spatial–spectral feature cooperation network for change detection in remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023.
[59] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[60] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
[61] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
[62] Y. Chen and L. Bruzzone, “Self-supervised change detection in multi-view remote sensing images. arxiv 2021,” arXiv preprint arXiv:2103.05969.
[63] Y. Zhang, Y. Zhao, Y. Dong, and B. Du, “Self-supervised pre-training via multi-modality images with transformer for change detection,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
[64] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[65] Y. Qin, Z. Niu, F. Chen, B. Li, and Y. Ban, “Object-based land cover change detection for cross-sensor images,” International Journal of Remote Sensing, vol. 34, no. 19, pp. 6723–6737, 2013.
[66] H. Chen and Z. Shi, “A spatial-temporal attention-based method and a new dataset for remote sensing image change detection,” Remote Sensing, vol. 12, no. 10, p. 1662, 2020.
[67] M. Lebedev, Y. V. Vizilter, O. Vygolov, V. Knyaz, and A. Y. Rubis, “Change detection in remote sensing images using conditional adversarial networks.” International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, vol. 42, no. 2, 2018.
[68] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 1, pp. 574–586, 2018.
[69] F. Pan, Z. Wu, Q. Liu, Y. Xu, and Z. Wei, “Dcff-net: A densely connected feature fusion network for change detection in high-resolution remote sensing images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 14, pp. 11 974–11 985, 2021.