Weakly Supervised Change Detection via Knowledge Distillation and Multiscale Sigmoid Inference

Binghao Lu, Caiwen Ding, Jinbo Bi, Dongjin Song¹¹1Corresponding author: Dongjin Song Department of Computer Science and Engineering, University of Connecticut binghao.lu, caiwen.ding, jinbo.bi, [email protected]

Abstract

Change detection, which aims to detect spatial changes from a pair of multi-temporal images due to natural or man-made causes, has been widely applied in remote sensing, disaster management, urban management, etc. Most existing change detection approaches, however, are fully supervised and require labor-intensive pixel-level labels. To address this, we develop a novel weakly supervised change detection technique via Knowledge Distillation and Multiscale Sigmoid Inference (KD-MSI) that leverages image-level labels. In our approach, the Class Activation Maps (CAM) are utilized not only to derive a change probability map but also to serve as a foundation for the knowledge distillation process. This is done through a joint training strategy of the teacher and student networks, enabling the student network to highlight potential change areas more accurately than teacher network based on image-level labels. Moreover, we designed a Multiscale Sigmoid Inference (MSI) module as a post processing step to further refine the change probability map from the trained student network. Empirical results on three public datasets, i.e., WHU-CD, DSIFN-CD, and LEVIR-CD, demonstrate that our proposed technique, with its integrated training strategy, significantly outperforms the state-of-the-art. Code is available at https://github.com/BinghaoLu/KD-MSI.

1 Introduction

Change detection aims to identify the changes of objects within the same geological location across different periods. It has been applied in various real-world applications, e.g., disaster management, urban planning, visual surveillance, and resource management Bouziani et al. (2010); Goyette et al. (2012); Jiang et al. (2022); Sublime and Kalinicheva (2019); Lee et al. (2021a); Ye et al. (2021). Recently, tremendous progress has been made in deep learning-based change detection tasks as deep neural networks show their superiority in producing more effective representations Jiang et al. (2022). Most existing approaches, however, focus on fully supervised change detection tasks and require a massive amount of pixel-level labels. This is, however, time-consuming and labor-intensive Jiang et al. (2020). An alternative is to adopt image-level labels and leverage weakly supervised learning to train the change detection model. A comparison of change detection tasks with pixel-level labels versus image-level labels is depicted in Figure 1.

Refer to caption — Figure 1: Comparison of change detection methods with pixel-level supervision (third column) vs. image-level supervision (fourth column).

Although it is challenging to perform weakly supervised change detection, several attempts have been made to tackle this problem in the past few years Khan et al. (2016); Andermatt and Timofte (2020); Kalita et al. (2021); Wu et al. (2023); Huang et al. (2023). For instance, Khan et al. Khan et al. (2016) and Andermatt et al. Andermatt and Timofte (2020) developed conditional random fields with image-level labels to perform change detection. Kalita et al. Kalita et al. (2021) employed principal component analysis together with K-means clustering to produce the change map based on a Siamese neural network architecture. Wu et al. Wu et al. (2023) developed a fully convolutional change detection framework based on a generative adversarial network to facilitate weakly supervised change detection. Huang et al. Huang et al. (2023) developed a new augmentation method to mix the background region of the image pair to enhance weakly change detection performance. Despite the progress, there is still a large gap between the coarse image-level labels and satisfactory fine-grained pixel-level change detection results for real-world applications.

To bridge the aforementioned research gap, we develop a novel weakly supervised change detection technique by leveraging a knowledge distillation framework and a multiscale sigmoid inference module. Specifically, we adapt Class Activation Maps (CAM) Zhou et al. (2016), a technique that is originally used to visualize which regions contribute the most to the prediction made by a convolutional neural network, to incorporate a pair of input images via Siamese neural networks (teacher network) to highlight the potential change area based on image-level labels, i.e., change or no change. Although CAM can capture the discriminative region of potential changes, the results are still relatively coarse and may contain incorrectly labeled pixels. To refine this, we implement a joint training strategy where the student network’s change probability map is trained under the guidance of the teacher network through the knowledge distillation process. This approach allows the student network to learn the nuanced patterns and knowledge from the teacher network’s CAM, enhancing its ability to generate a more precise change probability map. Furthermore, we design a multiscale sigmoid inference module as a post processing step to further enhance the change probability map of the student network. Extensive experiments demonstrate that our proposed model, with its structured joint training approach and the integration of the multiscale sigmoid inference module, significantly outperforms state-of-the-art over three publicly available datasets, i.e., LEVIR-CD, WHU-CD, and DSIFN-CD.

2 Related Work

Change detection and semantic segmentation are two fundamental tasks in the field of remote sensing and computer vision. Our proposed method is closely related to fully supervised change detection, weakly supervised semantic segmentation and weakly supervised change detection.

2.1 Fully-supervised change detection

Fully supervised change detection relies on pixel level labeled data to train the model Shi et al. (2020) Khelifi and Mignotte (2020). Deep learning based fully supervised change detection methods has been popular since last decades. In the beginning, convolutional neural network based methods are dominant, recently the transformer based methods entered the stage. Daudt et al. Daudt et al. (2018) designed the first end-to-end UNet structure for fully supervised change detection with their designed FC-EE, FC-Siam-conc, and FC-siam-diff modules. There are also many convolutional neural netowrk based change detection methods use VGG Simonyan and Zisserman (2014) or Resnet He et al. (2016); Zheng et al. (2021) based backbones. For example, IFNet Zhang et al. (2020a) and DTCDSCN Liu et al. (2020) used VGG16 and ResNet34 as backbone respectively to do change detection. As for the transformer based change detection, SwinSuNet Zhang et al. (2022) adopted swin transformer as their backbone and BIT Chen et al. (2021) adpots ResNet18 and Vision Transformer as backbone for their change detection, Liu et al., Liu et al. (2022)combines a multi-scale CNN-transformer structure to enhance change detection performance in remote sensing images Liu et al. (2022). Many of these fully supervised change detection methods adopts multi-level representation learning to achieve supervior performance, however, training these supervised models requires pixel level labeling of each image pair, which is time consuming and labor intensive.

2.2 Weakly-supervised semantic segmentation

Weakly supervised semantic segmentation usually extracts attention map from classification network to serve as pseudo segmentation labels, which are further used to train a segmentation models. The focus of weakly supervised semantic segmentation is to generate high quality attention mapWang et al. (2020)Lee et al. (2021b)Wei et al. (2017)Zhang et al. (2020b). For example, Wei et al.Wei et al. (2017) proposed adversarial erasing strategy, which iterately erases the most discriminative region in its attention map in order to find more missing region for the object. Kolesnikov et al.Kolesnikov and Lampert (2016) introduce the seed expansion and constraint method which tries to expand the initial seed from attention map to align with the object boundaries. Wang et al. Wang et al. (2020) built a siamese netowrk utilizing the scale invariant property of attention map to make it cover more object regions. Lee et al. Lee et al. (2021b) modified the last layer of a deep neural network based on information bottleneck theory to mine more non-discriminative region of the object. Wang et al.Chang et al. (2020) introduced a novel approach that enhances weakly-supervised semantic segmentation by exploiting sub-category information. Specifically, the method involves clustering image features to generate pseudo sub-category labels within each annotated parent class and constructing a sub-category objective, leading to improved quality of response maps and segmentation results by encouraging the network to focus beyond the most discriminative object parts.

2.3 Weakly supervised change detection

Due to the heavy labeling cost of fully supervised change detection models, some researchers have studied the direction to train change detection models with weak labels such as image level labels. Khan et al.Khan et al. (2016) incorporated conditional random fields in their research. Andermatt et alAndermatt and Timofte (2020) introduced W-CDNet which can be trained with image-level semantic labels for change detection, employing a W-shaped siamese U-net and a Change Segmentation and Classification (CSC) module to create and refine change masks. Meanwhile, Kalita et al.Kalita et al. (2021) employed both principal component analysis and K-means clustering in their methodologies. Wu et alWu et al. (2023) leveraged adversarial learning techniques during their model’s training phase. Additionally, Huang et al.Huang et al. (2023) introduced an innovative augmentation technique that blends the background areas of image pairs, specifically for weak change detection.

2.4 Knowledge distillation

Knowledge distillation, a concept introduced and developed in seminal worksHinton et al. (2015)Furlanello et al. (2018)Gou et al. (2021)Gou et al. (2021), involves training a compact student model to replicate the behavior of a larger, pre-trained teacher modelWang and Yoon (2021). This technique has gained prominence for its effectiveness in model compression and facilitating the transfer of knowledge from complex models to simpler onesAlkhulaifi et al. (2021). In classification tasks, the student model assimilates knowledge by approximating the output distribution of the teacher model, effectively capturing the nuanced relationships learned by the teacher.

Recent advancements have extended the application of knowledge distillation to more complex tasks such as semantic segmentation He et al. (2019) Liu et al. (2019) Qin et al. (2021) Dou et al. (2020) Ji et al. (2022), demonstrating its versatility and potential in various domains. In our study, we adopt the framework of knowledge distillation with a novel focus: applying it to the domain of weakly supervised change detection. Specifically, we aim to distill knowledge from Class Activation Maps (CAMs), leveraging them to guide the student model in an online learning framework. This approach is particularly innovative as it navigates the challenges of weakly supervised settings, where the scarcity of labeled data can impede the learning process.

3 Proposed Method

Our proposed method’s innovation includes two key components: 1) a knowledge distillation framework for generating more accurate change probability map, and 2) a multiscale sigmoid inference module as post-processing to improve the accuracy of change probability map which will serve as pseudo pixel label to train a change detection model.

3.1 The Knowledge Distillation Framework

The knowledge distillation framework consists of two subnetworks, i.e., a Siamese teacher network that can generate Class Activation Maps (CAM) Zhou et al. (2016) based on the image-level labels, i.e., change or no change, and a Siamese student network that can generate a fine-grained change probability map via knowledge distillation. Both networks are jointly optimized to enable the student network to highlight potential change areas more accurately than the teacher network. The details of the proposed knowledge distillation framework are shown in Figure 2.

3.1.1 Siamese Teacher Network

Siamese teacher network aims to encode the image-level label information and produce CAM to guide the Siamese student network. Given pre-event and post-event image pair $(I_{1},I_{2})$ where $I_{1}\in\mathbb{R}^{m\times n\times c}$ and $I_{2}\in\mathbb{R}^{m\times n\times c}$ and image-level label $y\in\{0,1\}$ , each image is passed through the ResNet50 He et al. (2016) backbone network to get their last layer of high dimensional feature maps $F_{1}$ and $F_{2}$ , respectively. $F_{1}\bigoplus F_{2}$ is further passed through another 1 $\times$ 1 convolutional layer and gets the feature map $G_{\textrm{teacher}}$ with a channel size of 1, where $\bigoplus$ stands for feature combination methods, e.g., concatenation, subtraction, absolute subtraction. In our experiments, we choose the combination that provides the best validation IoU. After that, the CAM can be inferred via min-max normalization:

\textrm{CAM}(G_{\textrm{teacher}})=\frac{ReLU(G_{\textrm{teacher}})}{max(ReLU(G_{\textrm{teacher}}))}.

(1)

Finally, global average pooling (GAP) is applied to the feature map $G_{\textrm{teacher}}$ and a binary cross-entropy loss is employed to train the network, i.e.,

	$\displaystyle\small L_{\text{cls}}=$	$\displaystyle-y\log(\sigma(\textrm{GAP}(G_{\textrm{teacher}})))$		(2)
		$\displaystyle+(1-y)\log(1-\sigma(\textrm{GAP}(G_{\textrm{teacher}}))).$

where $\sigma(\cdot)$ stands for the sigmoid activation function.

Although CAM can provide localization capability for image-level labels, it only highlights the region of the actual change at the coarse level since the goal of the classification network is to classify rather than localize.

3.1.2 Siamese Student Network

Siamese student network aims to learn fine-grained change probability map based on the knowledge distilled from the CAM. The Siamese student network shares the same network architecture as the teacher network, however, the weights are trained separately. Similar to the teacher network, we can obtain the feature map $G_{\textrm{student}}$ based on the same pair of pre-event and post-event image input $(I_{1},I_{2})$ from the teacher network. Based on that, we apply sigmoid activation to the one-channel feature map $G_{\textrm{student}}$ as student network’s change probability map $\sigma(G_{\text{student}})$ . Finally, knowledge distillation is conducted by minimizing the Mean Square Error (MSE) between the student change probability map of the student network and the CAM of the teacher network. Specifically, the knowledge distillation loss is given by:

L_{\text{kd}}=\|\textrm{CAM}({G_{\text{teacher}})-\sigma(G_{\text{student}})}\|_{2}^{2}.

(3)

3.1.3 Learning Objective

The overall learning objective of the knowledge distillation framework can be written as:

L=L_{\text{cls}}+\lambda L_{\text{kd}},

(4)

where $\lambda>0$ is a hyperparameter to control the trade-off between those two terms. Once the model is trained, only the student model is used to infer the change probability map.

3.2 Multiscale Sigmoid Inference

In weakly supervised semantic segmentation, multiscale inference has been employed to further boost the performance of CAM. Specifically, it will first resize the image to a set of predefined scales $S=\{0.5,1.0,1.5,2.0\}$ . Then, for each scale, the image is further flipped along the height dimension, doubling the set of image variations. Under the knowledge distillation framework, we adapt the multiscale inference to the Siamese student network and further develop multiscale sigmoid inference to enhance the change probability map of the Siamese student network.

3.2.1 Multiscale Inference

We first adapt the multiscale inference to the Siamese student network. Let $I_{1}^{s}$ and $I_{2}^{s}$ denote the original images at scale s where $s\in S$ , $I_{1}^{sf}$ and $I_{2}^{sf}$ represent their flipped counterparts. For each pair of different scales and flip or non-flip images, they are fed into the trained Siamese student network to get the corresponding 1-channel feature map $G_{\textrm{student}}^{s}$ and $G_{\textrm{student}}^{sf}$ , respectively. The feature maps of the flipped image pair are then flipped to the original orientation as $G_{\textrm{student}}^{sff}$ . After that, all of the feature maps are further resized to the input image size as $G_{\textrm{student}}^{sr}$ and $G_{\textrm{student}}^{sffr}$ . These resized feature map are summed up to get $G_{\textrm{student}}^{\text{sum}}$ which can be given as:

G_{\textrm{student}}^{\text{sum}}=\sum_{s\in S}\left(G_{\textrm{student}}^{sr}+G_{\textrm{student}}^{sffr})\right),

and $G_{\textrm{student}}^{\text{sum}}$ is further passed through Equation1 to obtain the student network’s change probability map. One potential issue with the multiscale inference with our student network is that the min-max normalization during in Equation1 causes a potential mismatch of distribution with the sigmoid activation which is applied during the training of the student network and thus may not get the desired refinement of the change probability map.

3.2.2 Multiscale Sigmoid Inference

To resolve the aforementioned issue, we design a new Multiscale Sigmoid Inference (MSI) module to further enhance the change probability map for the Siamese student network. For input $I_{1}^{s}$ and $I_{2}^{s}$ at scale s where $s\in S$ , where $S=\{0.5,1.0,1.5,2.0\}$ . Following the similar procedure in multiscale inference, we can obtain $G_{\textrm{student}}^{sr}$ and $G_{\textrm{student}}^{sffr}$ . Before summing them up, we first apply the sigmoid activation function to ensure each scale of feature map can take equally amount of weight and obtain $\sigma(G_{\textrm{student}}^{sr})$ and $\sigma(G_{\textrm{student}}^{sffr})$ , respectively. Then, we take their average to obtain the final change probability map, i.e.,

\displaystyle M_{\textrm{student}}=\frac{1}{2|S|}\sum_{s\in S}\left(\sigma(G_{\textrm{student}}^{sr})\right.\left.+\sigma(G_{\textrm{student}}^{sffr})\right).

The pseudo code of multi-scale sigmoid inference is illustrated in Algorithm 1. Based on the change probability map $M_{\textrm{student}}$ , a background channel with a certain threshold can be applied and the argmax operation can be used to obtain the pseudo pixel-level label for the student network. This pseudo label will be employed to train a separate change detection network for evaluation and test(details are provided in the experiments).

Algorithm 1 Multiscale Sigmoid Inference (MSI)

Input: Image pairs (

I_{1}

I_{2}

), scales

S

, Siamese student network

N_{\text{student}}

Output: Change Probability Map

function MultiscaleSigmoidInference(

I_{1},I_{2},S,N_{\text{student}}

)

for each

s\in S

Resize images to scale

s

I_{1}^{s},I_{2}^{s}

Flip resized images:

I_{1}^{sf},I_{2}^{sf}

for each pair

(I_{1}^{s},I_{2}^{s}),(I_{1}^{sf},I_{2}^{sf})

Pass through

N_{\text{student}}

for pre-logit maps

G_{\text{student}}^{s},G_{\text{student}}^{sf}

Apply sigmoid activation

\sigma

\sigma(G_{\text{student}}^{s}),\sigma(G_{\text{student}}^{sf})

if pair is flipped then

Flip

\sigma(G_{\text{student}}^{sf})

back to original orientation as

\sigma(G_{\text{student}}^{sff})

end if

Resize

\sigma(G_{\text{student}}^{s}),\sigma(G_{\text{student}}^{sff})

to input size as

\sigma(G_{\text{student}}^{sr}),\sigma(G_{\text{student}}^{sffr})

end for

return

\frac{1}{2|S|}\sum_{s\in S}\left(\sigma(G_{\text{student}}^{sr})\right.\left.+\sigma(G_{\text{student}}^{sffr})\right).

end function

4 Experiments

4.1 Datasets

Our research involved conducting experiments on three change detection (CD) datasets. The first dataset is the LEVIR-CD Chen and Shi (2020), which stands for Learning, Vision, and Remote sensing. This is a publicly available large-scale building CD dataset containing 637 pairs of high-resolution (HR) remote sensing (RS) images with a resolution of 0.5 meters and dimensions of 1024 × 1024 pixels. We adhered to the standard dataset split provided for training, validation, and testing. Due to limitations in GPU memory capacity, we divided these images into smaller, non-overlapping patches of 256 × 256 pixels, resulting in 7120, 1024, and 2048 pairs of patches for the training, validation, and test sets, respectively.

The second dataset utilized was the WHU-CD Ji et al. (2018), provided by Wuhan University. This dataset comprises a pair of HR (0.075 meters resolution) aerial images with dimensions of 32,507 × 15,354 pixels, focusing on building change detection. Since the original dataset did not provide a data split solution, we processed the images into non-overlapping patches of 256 × 256 pixels and then randomly divided them into training, validation, and test sets with 5947, 743, and 744 patches, respectively.

In our study, we also utilized the DSIFN-CD dataset Zhang et al. (2020a). This dataset is publicly available and forms part of the Deeply Supervised Image Fusion Network project. It includes a collection of six extensive pairs of high-resolution (HR) satellite images, each with a resolution of 2 meters, sourced from six major cities in China. The dataset is diverse, encompassing a variety of land cover changes like alterations in roads, buildings, croplands, and water bodies. Originally, the DSIFN-CD dataset comprised 3600 training samples, 340 validation samples, and 48 test samples. Each sample measured 512 × 512 pixels in size. To better suit our experimental requirements, we further segmented these images into smaller, non-overlapping patches of 256 × 256 pixels. This process resulted in an increased count of samples, with 14400 training, 1360 validation, and 192 test patches derived from the original dataset allocations. Since the train test ratio of this dataset is highly imbalanced, we reallocated 1638 images from the training set to the test set. This adjustment led to a new distribution of samples for our experiments: 12762 patches for training, 1360 for validation, and 1830 for testing. This restructured dataset allowed for a more balanced and effective evaluation of our change detection methodologies.

4.2 Setup and Implementation Details

In our work, we utilize ResNet50 He et al. (2016) as the backbone for both the Siamese teacher network and the Siamese student network. Within each Siamese network (teacher or student), the ResNet50 backbones share weights, ensuring a mirrored structure. However, the teacher and student networks do not share weights and are trained jointly with the loss in Eq. 4

We train the proposed network with training data from the WHU-CD, DSIFN-CD dataset and LEVIR-CD datasets with image-level labels. The pixel-level labels for the training data are used for evaluating IoU over the change probability map. IoU of the change class is used as a metric to determine the early stop. The weight $\lambda$ for $L_{\text{kd}}$ is set as 10 based on the validation set. Note that only the teacher network is trained with the binary cross entropy loss whereas the student network doesn’t have classification loss.

The network is trained on an NVIDIA GeForce RTX 3090 GPU with 24GB of VRAM with a batch size of 8 for 20 epochs. The initial learning rate is 0.001 with polynomial learning rate decay. After that, only the student network is kept to obtain the change probability map, which is further refined by multi-scale sigmoid inference to obtain pseudo ground truth. Next, the pseudo pixel-level labels are treated as ground truth to train a separate change detection network with a batch size of 16 for 50 epochs. The initial learning rate is 0.007 with polynomial learning rate decay.

For the change detection, we employ DeepLabV3+ Chen et al. (2018) with Resnet50 for the encoder and modify it as a Siamese network. The image pair are first fed to the encoder to obtain their corresponding high-level and low-level feature maps as $F_{1}$ , $F_{2}$ , $F_{1low}$ , and $F_{2low}$ . $F_{1}$ - $F_{2}$ is passed through the same ASPP module from DeepLabV3+ model and its output is further concatenated with $F_{1low}$ - $F_{2low}$ to serve as the input to the same decoder from DeepLabV3+ to get the final change mask.

4.3 Metrics

To comprehensively evaluate the performance of our model, we employ a suite of metrics: Overall Accuracy (OA), F1-score (F1), Change Class Intersection over Union (IoU), False Positives (FP), False Negatives (FN), and Mean Intersection over Union (Mean IoU).

•

Overall Accuracy (OA) quantifies the proportion of correctly predicted pixels in the total number of predicted pixels. It is calculated as:

$OA=\frac{\text{number of correctly predicted pixels}}{\text{total number of pixels}}.$ (5)
•

F1-score (F1) serves as the harmonic mean of precision and recall, offering a balance between these two metrics. It is particularly useful for datasets with imbalanced class distributions. The F1-score is formulated as:

$F1=2\times\frac{\text{precision}\times\text{recall}}{\text{precision}+\text{recall}}.$ (6)
•

Change Class Intersection over Union (IoU) measures the overlap between the predicted change pixels and the actual change pixels. It is defined as:

$cIoU=\frac{\text{TP}}{\text{TP}+\text{FP}+\text{FN}},$ (7)

where TP (True Positives) are the correctly predicted change pixels.
•

False Positives (FP) are those instances where non-change pixels are incorrectly identified as changes by the model.
•

False Negatives (FN) occur when actual change pixels are missed by the model. Both FP and FN are critical for understanding the types of errors made by the model.
•

Mean Intersection over Union (Mean IoU) is an average of the IoU values across all classes, providing a comprehensive view of the model’s performance across different types of changes. It is especially relevant in scenarios involving multiple classes:

$\text{mIoU}=\frac{1}{N}\sum_{i=1}^{N}IoU_{i},$ (8)

where N is the number of classes, in our change detection settings, the number of class is 2. $IoU_{i}$ is the IoU for the i-th class.

Table 1: Comparison with state-of-the-art methods on WHU-CD test dataset

Method	F1 $\uparrow$	OA $\uparrow$	cIoU $\uparrow$	mIoU $\uparrow$	FP $\downarrow$	FN $\downarrow$
FCDNet	0.645	0.937	0.193	0.564	0.491	0.317
WCDNet	0.732	0.962	0.319	0.640	0.284	0.398
CAM	0.797	0.966	0.441	0.703	0.356	0.203
Ours	0.854	0.977	0.562	0.769	0.245	0.193
Supervised	0.944	0.992	0.807	0.899	0.069	0.124

Table 2: Comparison with state-of-the-art methods on LEVIR-CD test dataset

Method	F1 $\uparrow$	OA $\uparrow$	cIoU $\uparrow$	mIoU $\uparrow$	FP $\downarrow$	FN $\downarrow$
FCDNet	0.551	0.888	0.088	0.487	0.585	0.328
WCDNet	0.728	0.938	0.324	0.630	0.450	0.225
CAM	0.729	0.934	0.327	0.630	0.480	0.193
Ours	0.749	0.939	0.361	0.649	0.464	0.174
Supervised	0.922	0.985	0.742	0.863	0.115	0.142

Table 3: Comparison with state-of-the-art methods on DSIFN-CD test dataset

Method	F1 $\uparrow$	OA $\uparrow$	cIoU $\uparrow$	mIoU $\uparrow$	FP $\downarrow$	FN $\downarrow$
FCDNet	0.278	0.355	0.345	0.184	0.654	0.001
WCDNet	0.704	0.754	0.412	0.557	0.186	0.402
CAM	0.664	0.666	0.471	0.498	0.461	0.068
Ours	0.757	0.775	0.529	0.614	0.287	0.183
Supervised	0.929	0.935	0.831	0.868	0.112	0.057

4.4 Comparison with State-of-the-Art

There is only limited weakly change detection literature available. We adopt FCDNetWu et al. (2023) and WCDNetAndermatt and Timofte (2020) as two of our baselines since they provide their code available online. FCDNetWu et al. (2023) applied adversarial learning during the training of their change detection network with image-level labels. WCDNetAndermatt and Timofte (2020) developed a change segmentation and classification module with image-level labels only, where the change segmentation mask is learned inside the module. CRF-RNN module is then applied to further refine the change segmentation mask. We also consider a CAM baseline to compare with our proposed network. CAM baseline uses ResNet50 as the backbone of the Siamese teacher network and employs multiscale inference to refine CAM and obtain the pseudo pixel-level label. The same Siamese DeepLabV3+ Chen et al. (2018) model is used to pursue change detection. We also compared the fully supervised change detection method. The fully supervised change detection model is the same siamese DeepLabV3+ Chen et al. (2018) but trained with the original dataset’s pixel-level labels. The quantitative comparison results on the test data of the LEVIR-CD, WHU-CD, and DSIFN-CD datasets are shown in Table 1, Table 2, and Table 3. We can observe that our proposed method consistently outperforms state-of-the-art models on all three datasets. In particular, the proposed method outperforms CAM which only leverages the Siamese teacher network, this demonstrate the effectiveness of knowledge distillation for producing the change probability map in the student network.

Some visual comparisons on the test set of LEVIR-CD and WHU-CD datasets are shown in Figure 3 and Figure 4 respectively. Visual comparison on DSIFN-CD test dataset is shown in supplemental material. We can also observe that our proposed method can produce more accurate pixel-level change labels compared to all three baselines.

4.5 Ablation Study

Based on the WHU-CD dataset training data, we perform an ablation study to justify the effectiveness of each component of our proposed work. We first compare the change probability map from the student network and CAM from the teacher model. The student network’s change probability map can improve the teacher network’s CAM IoU from 35.1 $\%$ to 47.7 $\%$ with 12.6 $\%$ improvement. We also compare the result of the student network’s change probability map under the multi-scale inference method (MI) and multi-scale sigmoid inference method (MSI). We observe that MI may slightly reduce its IoU from 47.7 $\%$ to 47.3 $\%$ . But with MSI, we can achieve 52.7 $\%$ , with a 5 $\%$ improvement over the change probability map for the student network with knowledge distillation.

We also visualize the learned CAM from the teacher network (column (d)), the change probability map from the student network (column (e)), and the change probability map from the student network with multi-scale sigmoid inference (MSI) (column (f))in Fig. 5. Our results show that the change probability map from the student network with MSI can better highlight the change regions.

Table 4: Ablation study on WHU-CD dataset, where MI denotes multi-scale inference, MSI stands for multi-scale sigmoid inference.

Teacher	Student	MI	MSI	IoU
$\checkmark$				0.351
	$\checkmark$			0.477
	$\checkmark$	$\checkmark$		0.473
	$\checkmark$		$\checkmark$	0.527

5 Conclusion

In this paper, we developed a novel weakly supervised change detection method based on a knowledge distillation framework and multi-scale sigmoid inference module. Extensive experiments on three public datasets, i.e., WHU-CD, LEVIR-CD, and DSIFN-CD, justified the effectiveness of our proposed model.

References

Alkhulaifi et al. [2021] Abdolmaged Alkhulaifi, Fahad Alsahli, and Irfan Ahmad. Knowledge distillation in deep learning and its applications. PeerJ Computer Science, 7:e474, 2021.
Andermatt and Timofte [2020] Philipp Andermatt and Radu Timofte. A weakly supervised convolutional network for change segmentation and classification. In Proceedings of the Asian Conference on Computer Vision, 2020.
Bouziani et al. [2010] Mourad Bouziani, Kalifa Goïta, and Dong-Chen He. Automatic change detection of buildings in urban environment from very high spatial resolution images using existing geodatabase and prior knowledge. ISPRS Journal of Photogrammetry and Remote Sensing, 65(1):143–153, 2010.
Chang et al. [2020] Yu-Ting Chang, Qiaosong Wang, Wei-Chih Hung, Robinson Piramuthu, Yi-Hsuan Tsai, and Ming-Hsuan Yang. Weakly-supervised semantic segmentation via sub-category exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8991–9000, 2020.
Chen and Shi [2020] Hao Chen and Zhenwei Shi. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sensing, 12(10):1662, 2020.
Chen et al. [2018] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
Chen et al. [2021] Hao Chen, Zipeng Qi, and Zhenwei Shi. Remote sensing image change detection with transformers. IEEE Transactions on Geoscience and Remote Sensing, 60:1–14, 2021.
Daudt et al. [2018] Rodrigo Caye Daudt, Bertr Le Saux, and Alexandre Boulch. Fully convolutional siamese networks for change detection. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 4063–4067. IEEE, 2018.
Dou et al. [2020] Qi Dou, Quande Liu, Pheng Ann Heng, and Ben Glocker. Unpaired multi-modal segmentation via knowledge distillation. IEEE transactions on medical imaging, 39(7):2415–2425, 2020.
Furlanello et al. [2018] Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. In International Conference on Machine Learning, pages 1607–1616. PMLR, 2018.
Gou et al. [2021] Jianping Gou, Baosheng Yu, Stephen J Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
Goyette et al. [2012] Nil Goyette, Pierre-Marc Jodoin, Fatih Porikli, Janusz Konrad, and Prakash Ishwar. Changedetection. net: A new change detection benchmark dataset. In 2012 IEEE computer society conference on computer vision and pattern recognition workshops, pages 1–8. IEEE, 2012.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
He et al. [2019] Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, and Youliang Yan. Knowledge adaptation for efficient semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 578–587, 2019.
Hinton et al. [2015] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
Huang et al. [2023] Rui Huang, Ruofei Wang, Qing Guo, Jieda Wei, Yuxiang Zhang, Wei Fan, and Yang Liu. Background-mixed augmentation for weakly supervised change detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 7919–7927, 2023.
Ji et al. [2018] Shunping Ji, Shiqing Wei, and Meng Lu. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Transactions on geoscience and remote sensing, 57(1):574–586, 2018.
Ji et al. [2022] Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022.
Jiang et al. [2020] Huiwei Jiang, Xiangyun Hu, Kun Li, Jinming Zhang, Jinqi Gong, and Mi Zhang. Pga-siamnet: Pyramid feature-based attention-guided siamese network for remote sensing orthoimagery building change detection. Remote Sensing, 12(3):484, 2020.
Jiang et al. [2022] Huiwei Jiang, Min Peng, Yuanjun Zhong, Haofeng Xie, Zemin Hao, Jingming Lin, Xiaoli Ma, and Xiangyun Hu. A survey on deep learning-based change detection from high-resolution remote sensing images. Remote Sensing, 14(7):1552, 2022.
Kalita et al. [2021] Indrajit Kalita, Savvas Karatsiolis, and Andreas Kamilaris. Land use change detection using deep siamese neural networks and weakly supervised learning. In Computer Analysis of Images and Patterns: 19th International Conference, CAIP 2021, Virtual Event, September 28–30, 2021, Proceedings, Part II 19, pages 24–35. Springer, 2021.
Khan et al. [2016] Salman H Khan, Xuming He, Fatih Porikli, Mohammed Bennamoun, Ferdous Sohel, and Roberto Togneri. Learning deep structured network for weakly supervised change detection. arXiv preprint arXiv:1606.02009, 2016.
Khelifi and Mignotte [2020] Lazhar Khelifi and Max Mignotte. Deep learning for change detection in remote sensing images: Comprehensive review and meta-analysis. Ieee Access, 8:126385–126400, 2020.
Kolesnikov and Lampert [2016] Alexander Kolesnikov and Christoph H Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 695–711. Springer, 2016.
Lee et al. [2021a] Haeyun Lee, Kyungsu Lee, Jun Hee Kim, Younghwan Na, Juhum Park, Jihwan P Choi, and Jae Youn Hwang. Local similarity siamese network for urban land change detection on remote sensing images. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:4139–4149, 2021.
Lee et al. [2021b] Jungbeom Lee, Jooyoung Choi, Jisoo Mok, and Sungroh Yoon. Reducing information bottleneck for weakly supervised semantic segmentation. Advances in Neural Information Processing Systems, 34:27408–27421, 2021.
Liu et al. [2019] Yifan Liu, Ke Chen, Chris Liu, Zengchang Qin, Zhenbo Luo, and Jingdong Wang. Structured knowledge distillation for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2604–2613, 2019.
Liu et al. [2020] Yi Liu, Chao Pang, Zongqian Zhan, Xiaomeng Zhang, and Xue Yang. Building change detection for remote sensing images using a dual-task constrained deep siamese convolutional network model. IEEE Geoscience and Remote Sensing Letters, 18(5):811–815, 2020.
Liu et al. [2022] Mengxi Liu, Zhuoqun Chai, Haojun Deng, and Rong Liu. A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:4297–4306, 2022.
Qin et al. [2021] Dian Qin, Jia-Jun Bu, Zhe Liu, Xin Shen, Sheng Zhou, Jing-Jun Gu, Zhi-Hua Wang, Lei Wu, and Hui-Fen Dai. Efficient medical image segmentation based on knowledge distillation. IEEE Transactions on Medical Imaging, 40(12):3820–3831, 2021.
Shi et al. [2020] Wenzhong Shi, Min Zhang, Rui Zhang, Shanxiong Chen, and Zhao Zhan. Change detection based on artificial intelligence: State-of-the-art and challenges. Remote Sensing, 12(10):1688, 2020.
Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Sublime and Kalinicheva [2019] Jérémie Sublime and Ekaterina Kalinicheva. Automatic post-disaster damage mapping using deep-learning techniques for change detection: Case study of the tohoku tsunami. Remote Sensing, 11(9):1123, 2019.
Wang and Yoon [2021] Lin Wang and Kuk-Jin Yoon. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE transactions on pattern analysis and machine intelligence, 44(6):3048–3068, 2021.
Wang et al. [2020] Yude Wang, Jie Zhang, Meina Kan, Shiguang Shan, and Xilin Chen. Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12275–12284, 2020.
Wei et al. [2017] Yunchao Wei, Jiashi Feng, Xiaodan Liang, Ming-Ming Cheng, Yao Zhao, and Shuicheng Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1568–1576, 2017.
Wu et al. [2023] Chen Wu, Bo Du, and Liangpei Zhang. Fully convolutional change detection framework with generative adversarial network for unsupervised, weakly supervised and regional supervised change detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
Ye et al. [2021] Su Ye, John Rogan, Zhe Zhu, and J Ronald Eastman. A near-real-time approach for monitoring forest disturbance using landsat time series: Stochastic continuous change detection. Remote Sensing of Environment, 252:112167, 2021.
Zhang et al. [2020a] Chenxiao Zhang, Peng Yue, Deodato Tapete, Liangcun Jiang, Boyi Shangguan, Li Huang, and Guangchao Liu. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing, 166:183–200, 2020.
Zhang et al. [2020b] Man Zhang, Yong Zhou, Jiaqi Zhao, Yiyun Man, Bing Liu, and Rui Yao. A survey of semi-and weakly supervised semantic segmentation of images. Artificial Intelligence Review, 53:4259–4288, 2020.
Zhang et al. [2022] Cui Zhang, Liejun Wang, Shuli Cheng, and Yongming Li. Swinsunet: Pure transformer network for remote sensing image change detection. IEEE Transactions on Geoscience and Remote Sensing, 60:1–13, 2022.
Zheng et al. [2021] Zhuo Zheng, Yanfei Zhong, Junjue Wang, Ailong Ma, and Liangpei Zhang. Building damage assessment for rapid disaster response with a deep object-based semantic change detection framework: From natural disasters to man-made disasters. Remote Sensing of Environment, 265:112636, 2021.
Zhou et al. [2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2921–2929, 2016.