TFNet: Tuning Fork Network with Neighborhood Pixel Aggregation for Improved Building Footprint Extraction

Muhammad Ahmad Waseem, Muhammad Tahir, , Zubair Khalid , and Momin Uppal The authors are with Department of Electrical Engineering, Lahore University of Management Sciences, DHA Lahore Cantt., 54792, Lahore Pakistan. This work was supported by the Higher Education Commission (HEC) of Pakistan through a Grand Challenge Fund Grant No. GCF-521. Corresponding author: Momin Uppal, e-mail: [email protected]

Abstract

This paper considers the problem of extracting building footprints from satellite imagery – a task that is critical for many urban planning and decision-making applications. While recent advancements in deep learning have made great strides in automated detection of building footprints, state-of-the-art methods available in existing literature often generate erroneous results for areas with densely connected buildings. Moreover, these methods do not incorporate the context of neighborhood images during training thus generally resulting in poor performance at image boundaries. In light of these gaps, we propose a novel Tuning Fork Network (TFNet) design for deep semantic segmentation that not only performs well for widely-spaced building but also has good performance for buildings that are closely packed together. The novelty of TFNet architecture lies in a a single encoder followed by two parallel decoders to separately reconstruct the building footprint and the building edge. In addition, the TFNet design is coupled with a novel methodology of incorporating neighborhood information at the tile boundaries during the training process. This methodology further improves performance, especially at the tile boundaries. For performance comparisons, we utilize the SpaceNet2 and WHU datasets, as well as a dataset from an area in Lahore, Pakistan that captures closely connected buildings. For all three datasets, the proposed methodology is found to significantly outperform benchmark methods.

Index Terms:

Building Footprint Extraction, Deep Learning, Semantic Segmentation, Satellite Imagery, Neighborhood Pixel Aggregation, Remote Sensing, Urban data.

I Introduction

The availability of building footprints is crucial for informed urban planning and decision-making. They are used by urban planners for applications that include, but are not limited to, temporal change detection, population density estimation, public services planning, infrastructure management, and disaster damage assessment. The traditional method of generating these footprints involve tedious onsite measurements and surveys, which are extremely costly as well as time consuming. But recent advancements in artificial intelligence coupled with ready access to high resolution satellite imagery has enabled development of automated toolsets that promise significant time and cost savings over traditional onsite surveys.

The availability of open-source building footprint datasets such as Toronto City [1], INRIA [2], WHU [3], ISPRS [4], Space-Net [5], xBD [6], and AIcrowd [7] have allowed researchers to train deep learning models for building footprint extraction (BFE) that perform reasonably well. Indeed, the availability of these benchmark datasets has boosted research on automatic BFE from satellite imagery with numerous studies published in the last few years. The backbone of many of these studies, as will be discussed in Section II, are deep segmentation models such as DeepLabV3+ [8]. Despite these advances though, automated BFE from satellite imagery remains a challenging problem due to various reasons. For instance, buildings in different regions are characterized by high variability in shapes, sizes, and texture as well as occlusions due to surroundings or shadows. Another problem that arises with the use of semantic segmentation models is that they usually struggle to separate closely packed instances of building footprints in which the number of pixels separating two adjacent buildings is minimal. This problematic situation especially arises in developing countries. An instance of this scenario is depicted in the first row of Fig. 1 indicating how state-of-the-art methods such as [9] fails to distinguish closely packed buildings. Another challenge arises due to the tiling process of large tiff files – the tiling step becomes necessary for training of deep learning models. In essence, splitting a large file into tiles breaks the spatial relationships at the boundaries, thus impacting the building structures residing there. An instance of this scenario is depicted in the second row of Fig. 1 in which the benchmark state-of-the-art method fails to identify some buildings at the tile boundaries.

Refer to caption — Figure 1: Comparison of our proposed methodology with state-of-the-art (SOTA) methods. In the first row, we show how SOTA model(s) fail to distinguish closely connected buildings from SpaceNet’s Khartoum region, while our proposed design can effectively handle such situations. In the second row, we show the problems introduced at the image borders (blue grid lines show the division policy between consecutive tiles from the WHU Aerial dataset). The standard approach to train the model usually fails to perform well at such partially-cut buildings. Using our pre-processing pipeline, deep models can easily capture such dynamics.

In this paper, we try to overcome both problems mentioned above. For improved performance on densely packed buildings, we propose a new semantic segmentation network model based on the standard DeepLabV3+ architecture. The model comprises of a single encoder and two parallel-placed decoders to separately handle reconstruction of (a) the building footprint and (b) the building edge. This is motivated by our understanding that along with the building mask generation, the edge mask generation is also essential in extracting building footprints, and therefore both must be accorded due importance in the training process. While some studies (e.g., [9]) have tried to incorporate building edge information into the training framework, they do so while sticking with a generic single encoder-decoder design. In contrast, considering the unique behaviors of the building segmentation and edge segmentation tasks, we propose for them to be handled by two distinct decoders working on the high-dimensional features output of a single encoder. Given this tuning fork-like arrangement, we coin and use the term Tuning-Fork Network (TFNet) for this novel design which is found to perform well not only for scenarios with closely-packed buildings, but also for widely spaced building construction patterns. In addition to the improved TFNet design, we also propose a novel methodology that incorporates neighborhood pixels of the tile boundaries during training. The aim of this pre-processing pipeline, termed Neighborhood Pixel AGGregation (NePAGG), is to avoid the loss of spatial connectivity at the image borders. Armed with these two proposed novelties, we test our design on the WHU and SpaceNet2 benchmark datasets as well as a dataset we create from a region in the city of Lahore, Pakistan. The proposed model is shown to significantly outperform existing benchmark methods for all datasets.

In summary, the overall contributions of this work are as follows:

1.

We propose a novel pre-processing pipeline, named NePAGG, which incorporates spatial neighbourhood information during training.
2.

We propose a novel architecture for BFE, namely TFNet that uses building edge information in addition to the standard footprint. We find that this results in improved detection performance.
3.

We create a new challenging dataset for footprint extraction for the city of Lahore in Pakistan, which is available publicly¹¹1https://github.com/Muhammad-Ahmad-Waseem/TF-Net/tree/main.
4.

The proposed methodology is shown to achieve excellent performance on the standard SpaceNet [10]. and the WHU [11] datasets, as well as the dataset we create for Lahore, Pakistan.

The remainder of this paper is organized as follows. Section II provides a review of related work appearing in recent literature. Section III provides details of the proposed TFNet model and the NePAGG pre-processing methodology. This is followed by details of the experiments utilized for performance evaluation in Section IV, while Section V presents the evaluation results. Section VI concludes the paper.

II Related Work

As mentioned earlier, the BFE problem has been well studied in the literature. Initially, studies attempting BFE from satellite imagery relied on classical computer vision methods [12], [13], [14], [15]. Although these classical approaches needed less training data, they struggled to deliver effective results since they were unable to handle complicated scenarios, e,g., varying architectural designs and similarity of building features with other features like roads. These classical approaches were followed by studies that use the generalization ability of deep learning models such as Convolutional Neural Networks (CNNs) to develop effective deep computer vision approaches for BFE [16], [17], [18], [9], [19], [20], [21], [22], [23]. In recent years, semantic segmentation models have shown immense potential in providing effective solutions to automated BFE. Indeed, the winning solutions for most of the open challenges (such as SpaceNet2 and SpaceNet7) have been based on semantic segmentation models. Many authors have used baseline segmentation models, such as UNet [24], SegNet [25], FC-DenseNet [26], DeepLabV3+ [8], and HRNet [27] to extract building footprints from high-resolution satellite imagery. In addition to these, modified architectures for the encoder, decoder, or bottleneck modules have also been proposed that address the problem of semantic segmentation specifically for BFE [28], [29], [30]. An important example of the utilization of segmentation models is the Google’s Open Buildings dataset[31]. The dataset, now made publicly available for Africa, South Asia, South-East Asia, Latin America and the Caribbean, is very rich indeed. The first version of Google’s dataset was generated using a simple simple U-net architecture [32]. However, details of the model used to generate the latest version of the dataset, released very recently, are not publicly available. Similarly, the corresponding satellite imagery against this dataset is also not available which makes it difficult to train new models on these footprints because of the satellite imagery offsets [33]. This limits the applicability of the publicly available information for applications such as urban change detection and disaster damage assessment.

Although simple semantic segmentation models have shown excellent performance in extracting building footprints, many experiments show that these predicted masks are in need of improvement when the buildings are packed close together such that the building boundary is not clearly visible. To solve this problem and improve the performance in such locations, many authors have proposed to use building edge information during training. For example, [34] proposed a signed distance transform (SDT) that computes for each pixel the distance from the building boundary. The SDT is then encoded into a finite number of classes followed by treating it as a multi-class segmentation problem. On the other hand, the work in [9] proposed a weighted edge mask based on building boundary that assigns a higher weight to the loss generated from pixels residing at the building boundary. A related work is [18] that achieves improved performance using attraction field maps that are generated based on building boundaries during the training process. These methods show the importance of incorporating building edge information during the training of deep learning models for BFE. To the best of our knowledge though, the TFNet design consisting of a single encoder followed by two separate decoders for building and edge segmentation has not been explored in the literature before.

III Proposed Methodology

Deep-learning-based methods for BFE from satellite imagery involve the splitting of large raster and vector files into small tiles, followed by training of the deep learning model on those small tiles for predicting footprints. Most proposals in the literature follow a standardized method of splitting the larger raster into non-overlapping tiles with each tile fed into the deep learning model independently of the others. A problem with this approach is that it fails to account for the spatial relationships between the edges of adjacent tiles. It is to address this issue that we propose a novel pipeline called NePAGG to incorporate neighborhood information of each tiled image. Secondly, to improve BFE especially for densely packed buildings, we propose TFNET, a novel single encoder dual-decoder network that incorporates the building edges’ information during training. An overview of the proposed methodology is shown in Fig. 2 with the two novel components described in the subsections below.

III-A NePAGG Pipeline

NePAGG, as the name suggests, is the proposed technique that attempts to incorporate neighborhood information of each tile while training the deep learning model. To illustrate the loss of spatial information in traditional tiling, we refer the reader to the example shown in the bottom row of Fig. 1 in which a bigger raster is split into four non-overlapping tiles. As can be seen, there are several buildings present at the tile boundaries. Since training and prediction on each tile is carried out independently of others, the conventional method of having non-overlapping tiles results in the subsequent deep network to see partial building footprints at the boundaries, thus impairing its learning and prediction capability on the partial view of these buildings. It is to overcome this limitation that we propose the simple yet effective NePAGG methodology. As seen in Fig. 2, NePAGG utilizes overlapping tiles obtained by including neighborhood pixels at the boundaries. In particular, if the size of each tile with conventional non-overlapping tiling was $W\times H$ pixels, NePAGG picks up an extra $k$ pixels in the neighborhood of each boundary to obtain an augmented tile of size $(W+2k)\times(H+2k)$ pixels. This augmented tile is then used as input to the deep learning model that outputs segmentation masks of the same size. The segmentation masks are then cropped to the original size of $W\times H$ pixels before computing the training loss. The aforementioned pipeline ensures that some neighborhood information is available for each image tile at the boundaries resulting in better prediction.

The choice of an appropriate $k$ in NePAGG depends on the resolution of the satellite imagery. A very large value of $k$ will produce large images as input to the deep learning model resulting in inefficient utilization of memory resources. On the other hand, a very small value of $k$ may not be enough to reap the benefits of neighborhood pixel aggregation. An appropriate tradeoff is struck when $k$ is just enough for complete building footprints to be included in the augmented tile.

III-B TFNet Model Design

As discussed earlier, the standard single encoder-decoder designs of semantic segmentation models usually fail to perform well when the distance between consecutive instances of buildings is small. In such cases, there is a need to focus the model’s predictions at the building edge. As mentioned in Section II, existing literature attempts to incorporate building edge information in the training pipeline through a variety of methods. As opposed to existing work, the proposed TFNet model involves a novel architecture composed of a single encoder and two decoders; one for generating the building masks and the other to separately deal with generating the building edge masks.

The basic function of a decoder in a segmentation network is to learn the projection of dense features learned by the encoder for generation of segmentation masks. Introducing two separate decoders allows the model to learn this projection separately for building masks and the building edges. Moreover, since we train the model in an end-to-end fashion, both decoders also aid each other in the learning process through a back-propagation feedback loop, which also results in more refined feature space learning from encoders. It is due to these reasons that we expect our model to efficiently handle footprint extraction in high-density areas. Indeed, this is verified by the evaluation results presented in subsequent sections.

Considering the ability of atrous (or dilated) convolutions to better capture local and global relationships by providing an increased receptive field, TFNet uses a dilated ResNet network (proposed in the DeepLabv3+ segmentation model [8]) as the backbone architecture for extracting semantic features from images. Similarly, for the decoder, an atrous spatial pyramid pooling module is used so as to allow several parallel atrous convolutions with different rates. This helps the model in preserving coarse or high-level semantic features with minimal computation power [8]. Both of the decoder networks share exactly the same architecture with only difference being the corresponding ground truth that is used for training. The detailed architecture is shown in Fig. 3.

III-C Training Strategy

As shown in Fig. 2, We propose an end-to-end pipeline for training of the deep learning model. The augmented tiles of size $(W+2k)\times(H+2k)$ pixels are passed to the deep learning model. Since the semantic segmentation models are based only on convolutional layers, each decoder produces an output of exactly the same dimension as the input image. The images at the output of the two decoders are cropped to obtain the predicted segmentation masks of buildings and edges, each of size $W\times H$ pixels. Each of the predicted segmentation masks is compared to its corresponding ground truth to compute a focal loss. The sum of these two losses is then used to train the network in an end-to-end fashion.

IV Experimentation

IV-A Datasets

For performance comparisons, we utilize two benchmark datesets. In addition, we also provide a new dataset for Pakistan which is primarily meant to gauge performance in high-density urban areas. Some details about these datasets follow.

IV-A1 SpaceNet2

SpaceNet [5] currently has three challenges that cover building footprints: SN1, SN2, and SN7. For this study, we use the most commonly used SN2 dataset which provides building footprints for four Areas of Interest (AOI) around the globe: Las Vegas, Paris, Shanghai, and Khartoum. Each training tile in the dataset covers 200 m $\times$ 200 m on the ground, and for each training tile, a 30 cm resolution geo-referenced tiff file is provided that has a size of $650\times 650$ pixels. In addition, a geojson file containing ground truth markings for that region are also provided. The datasets are available online at [10].

IV-A2 WHU Aerial Images

This dataset comes from the New Zealand Land Information Services who have generated the dataset by manually editing Christchurch’s building vector data with about 22,000 independent buildings. The original ground resolution of the images is 0.075 m, but most of the aerial images are downsampled to a ground resolution of 0.3 m and subsequently cropped into 8,180 tiles of size $512\times 512$ pixels each. The shapefile is also rasterized. The ready-to-use samples are divided into three parts:

•

A training set (130,500 buildings).
•

A validation set (14,500 buildings).
•

A test set (42,000 buildings), which is further divided into Test1 and Test2 subsets.

Accompanying the images are manually edited shapefile corresponding to the whole area. Since the dataset does not provide geo-referenced rasters, we used the order of cropping and area split shapefiles (available on the website) to recover the Geo-Tiff images and geojson masks against each image. The data can be downloaded from the Linz official website [11].

IV-A3 Lahore DHA Dataset

We create a new dataset of building footprints for Pakistan that captures the complex and unique dynamics of the area. The dataset, illustrated in Fig. 4, consists of 30 cm per pixel resolution images for a portion of the Defense Housing Authority (DHA) area in Lahore, Pakistan. The dataset comes with corresponding annotations of 25,631 building footprints covering an approximate area of 31 square km. The said area is divided into tiles of size $512\times 512$ pixels, with the tiles then randomly split into training and testing portions with the following proportion:

•

Train Tiles: 1675, roughly 84% of the total.
•

Test Tiles: 315, roughly 16% of the total.

IV-B Training Parameters

Our experiments are conducted within a Pytorch framework on an NVIDIA RTX 3090 GPU with 24 GB of memory. For the model training, remote sensing images are tiled, processed through the proposed NePAGG pipeline explained in Section III, before being fed into the training pipeline. The tile sizes used for training the models for each of the above datasets are described in the text below.

IV-B1 SpaceNet2

As described earlier, the SpaceNet dataset provides images of size $650\times 650$ pixels. To make these divisible by 16, we perform zero padding at the edges to convert them to images of size $656\times 656$ pixels. For benchmarking performance gain due to NePAGG, we train the TFNET model without it. For the purpose, the zero-padded image tiles of size $656\times 656$ are used. On the other hand, when utilizing NePAGG, we fix $k=83$ so that the final image tile used for training is of size $816\times 816$ pixels.

IV-B2 WHU

The WHU dataset provides images of size ${512\times 512}$ pixels. For training with NePAGG, we set $k=64$ for this dataset to obtain a final image of size $640\times 640$ pixels.

IV-B3 Lahore DHA

The dataset for Lahore DHA has exactly the same specifications as the WHU dataset, so we chose the same image sizes, i.e., $640\times 640$ with NePAGG, and $512\times 512$ without.

In addition to the segmentation masks for buildings in each dataset, we automatically generate building edge masks by selecting a few pixels from the boundaries of the building polygons. A sample depiction of this for the Lahore DHA dataset is shown in Fig. 4. All models are trained for 150 epochs on each dataset, and the optimizer is stochastic gradient descent (SGD) with a learning rate of 0.0001. The training batch size of all models is set to 8. As mentioned earlier, the Focal Loss is used as a loss function for our model.

TABLE I: Quantitaive Comparison on SpaceNet2 Dataset. Accuracy of different methods on different cities of SN2. We train a single model for all cities. The results correspond to the whole train-split provided by SpaceNet2, as ground-truth marking for the test set is not available.

Metric	Method	AOI
Metric	Method	Vegas	Paris	Shanghai	Khartoum
Precision	Deeplabv3+[8]	89.37	64.40	60.00	55.22
	rgb footprint [9]	90.89	75.93	66.99	66.12
	TFNet w/o NePAGG	95.59	83.46	79.28	77.04
	TFNet with NePAGG	97.82	91.56	88.32	87.14
Recall	Deeplabv3+[8]	77.23	48.22	35.35	35.81
	rgb footprint [9]	86.83	66.67	49.57	48.52
	TFNet w/o NePAGG	88.19	72.08	61.18	61.77
	TFNet with NePAGG	92.97	83.08	74.40	75.43
F1 Score	Deeplabv3+[8]	82.85	55.15	44.49	43.44
	rgb footprint [9]	88.81	71.00	56.97	55.97
	TFNet w/o NePAGG	91.74	77.35	69.06	68.57
	TFNet with NePAGG	95.33	87.11	80.76	80.86

TABLE II: Quantitaive Comparison on WHU Dataset. Accuracy of different methods on different splits of WHU. We train the model using the Train split and validate the results using the Val split of the dataset. No image from the Test or the Test2 split was used during the training of any of the provided models.

Metric	Method	Split
Metric	Method	Train	Val	Test	Test2
Precision	Deeplabv3+[8]	92.65	93.39	92.84	93.31
	rgb footprint [9]	94.22	91.89	89.12	92.27
	TFNet w/o NePAGG	95.94	95.05	93.54	95.23
	TFNet with NePAGG	98.57	96.00	94.43	96.09
Recall	Deeplabv3+[8]	72.01	75.63	70.00	72.75
	rgb footprint [9]	83.91	85.34	80.79	84.18
	TFNet w/o NePAGG	84.34	85.25	80.36	84.26
	TFNet with NePAGG	91.27	89.74	85.40	89.31
F1 Score	Deeplabv3+[8]	81.04	83.57	79.82	81.76
	rgb footprint [9]	88.77	88.49	84.75	88.04
	TFNet w/o NePAGG	89.77	89.88	86.45	89.40
	TFNet with NePAGG	94.78	92.77	89.69	92.58

TABLE III: Quantitaive Comparison on Lahore DHA Dataset. Accuracy of different methods on different splits of the Lahore DHA dataset. We train the model using the Train split. No image from the Test split was used during training of any of the provided models.

Metric	Method	Split
Metric	Method	Train	Test
Precision	Deeplabv3+[8]	34.81	35.05
	rgb footprint [9]	63.39	46.51
	TFNet w/o NePAGG	99.62	71.98
	TFNet with NePAGG	99.73	78.70
Recall	Deeplabv3+[8]	21.45	20.56
	rgb footprint [9]	31.90	25.25
	TFNet w/o NePAGG	97.68	68.92
	TFNet with NePAGG	98.19	76.45
F1 Score	Deeplabv3+[8]	26.55	25.92
	rgb footprint [9]	42.44	32.73
	TFNet w/o NePAGG	98.64	70.41
	TFNet with NePAGG	98.95	77.56

IV-C Evaluation Metrics

Most existing methods for BFE using semantic segmentation models provide pixel-based accuracy for performance evaluation. Although it is a good metric for evaluation of pixel-based prediction tasks, it does not provide adequate classification accuracy information at the building level. For this purpose, we resort to the standard metrics used in SpaceNet challenges that are based on the number of polygons identified correctly/incorrectly.

To evaluate the performance of the models, we use the standard F1 Score computed on the basis of the number of polygons, instead of the image pixels. For each of the predicted image, we create polygons by creating a geometry over the connected regions of pixels. This can be easily done using the scikit image processing tools and the rasterio geospatial libraries available in python. Let $\mathcal{P}=\{P_{1},P_{2},...,P_{M}\}$ be the set of predicted polygons obtained from these images and $\mathcal{L}=\{L_{1},L_{2},...,L_{N}\}$ be the set of corresponding ground truth polygons. Each one of the predicted polygon is labelled as True Positive or False Positive based on the Intersection over Union (IoU) scores between the predicted and the ground truth polygons. Details of this evaluation method, as used in SpaceNet’s challenges, are provided in Algorithm 1.

Algorithm 1 SpaceNet Evaluation Method

i\leftarrow 1

\mathcal{P}\leftarrow\{P_{1},P_{2},...,P_{M}\}

\mathcal{L}\leftarrow\{L_{1},L_{2},...,L_{N}\}

4:while

i\leq M

S_{i}\leftarrow\smash{\displaystyle\max_{j\in\{1,\ldots,N\}}}\{\text{IoU}(P_{i},L_{j})\}

k\leftarrow\smash{\displaystyle\arg\>\max_{j\in\{1,\ldots,N\}}}\{\text{IoU}(P_{i},L_{j})\}

9: if

S_{i}\geq 0.5

then

10:

P_{i}

is True Positive

11: Remove

L_{k}

from

\mathcal{L}

12: else

13:

P_{i}

is False Positive

14: end if

15:

i\leftarrow i+1

16:end while

Once the predicted polygons are classified into true positives and false positives, all the remaining polygons in the list $\mathcal{L}$ represent missed detections / false negatives. The calculations for IoU, precision, recall and F1 Score are done using standard formulas as shown below:

\displaystyle IoU(P_{i},L_{j})=\frac{A(P_{i}\;\cap\;L_{j})}{A(P_{i}\;\cup\;L_{j})}

(1)

\displaystyle\rm{Precision}=\frac{\rm{TP}}{\rm{TP}+\rm{FP}}

(2)

\displaystyle\rm{Recall}=\frac{\rm{TP}}{\rm{TP}+\rm{FN}}

(3)

\displaystyle\rm{F1}=\frac{2\;x\;\rm{Precision}\;x\;\rm{Recall}}{\rm{Precision}\;+\;\rm{Recall}\;-\;\rm{Precision}\;x\;\rm{Recall}}

(4)

Here, $A(P)$ represents the area of a polygon $P$ , while TP, FP, FN represent the total number of true positives, false positives, and false negatives, respectively.

V Results and Comparisons

As discussed earlier, we have chosen three different datasets for comparison. For a fair analysis, we compare the performance of our method with the ones based on DeepLabV3+ architecture, which is considered as state-of-the-art semantic segmentation network. The first method that we compare with is a standard DeepLabV3+ model with Focal Loss [8]. Similarly, we also show a comparison with RGB Footprints by Jiwani et. al [9], which tries to incorporate an edge-based loss with a standard DeepLabV3+ architecture. In addition to these, we also show the performance of our proposed TFNet design without the NePAGG pre-processing pipeline. To make the comparisons consistent, all the above mentioned benchmark methods are also trained without NePAGG, while our final proposed model is trained using both NePAGG and the TFNet design. In this way, we can quantitatively elaborate the importance of incorporating each of the proposed methodologies. The performance of each of the described methods is measured using evaluation metric and the training parameters described in Section IV.

V-A Comparison on the SpaceNet2 Dataset

For the SpaceNet2 dataset, we combine all the training images from four different AOIs into a single dataset and train the model using that dataset. An 80% split of this combined dataset is used for training, while the remaining 20% is used for validation and testing. Since SN2 does not provide ground-truth marking over the test sets, we only show the performance on the training portion of each AOI. The results for each AOI are shown in Table I. It can be seeen that while the proposed TFNet architecture alone is characterized by improved performance on all of the AOIs, the overall TFNet pipeline with NePAGG improves the performance to a new high. For each of the AOI, the proposed methodology obtains an F1 score of more than 80% with an average F1 score of 86% which is significantly superior in effectiveness over the benchmark methods. We also note that the performance gain of the proposed methodology over that of [9] – the closest performing benchmark in the literature – is a lot higher for the cities of Shanghai and Khartoum as compared to Las Vegas and Paris. We posit that this is because of the buildings packed much closer together for Shangahi and Khartoum’s dataset as compared to the other two cities.

V-B Comparison on the WHU Dataset

For the WHU dataset, we use the original Train and Val splits for training and validation of the model, respectively. The other two splits, i.e., Test and Test2 were not used for training, and were just used to compute the accuracy scores for the unseen data split. The results on each of the splits are shown in Table II. Similar to the SpaceNet2 dataset, one can see from the table that while TFNet alone adds a performance boost on all of the splits, the proposed NePAGG plus TFNet design outperforms all of the provided methods on all of the different data splits. The results also indicate that our method not only performs well on the training data but also has improved performance over unseen areas.

V-C Comparison on the Lahore DHA Dataset

For the Lahore DHA dataset we create ourselves, we only have two data splits, i.e., Train and Test. We use the train split for training and do not use any validation set for this dataset. The other split, i.e., the test split, is not used for training and is just used to check the score on unseen data split. The results are shown in Table III. The low score of the standard DeepLabv3plus [8] model shows how challenging the dataset is for SOTA segmentation models. Compared to the benchmark methodologies in the literature, the proposed TFNet plus NePAGG methodology performs extremely well even in such complicated scenarios. We also note that the performance benefit because of the TFNet architecture is much higher compared to the additional gain obtained by NePAGG.

VI Conclusion

Considering the problem of BFE from satellite imagery, this paper proposes a novel methodology consisting of TFNet: a tuning-fork like encoder-decoder architecture coupled with NePAGG: a pixel aggregation pre-processing methodology. The TFNet architecture consists of a single encoder followed by two decoders (all based on the Deeplabv3 architecture) to separately detect detect building masks and building edges. This architecture specifically allows it to perform well on areas with densely connected buildings. On the other hand, NePAGG allows the model to incorporate spatial relationships of neighborhood images (at the tile boundaries) during the training phase. For performance evaluation, we utilize the polygon-based F1 scores and illustrate the effectiveness of the proposed methodology on standard datasets such as SpaceNet2 and WHU Datasets, as well as a dataset we create ourselves for the city of Lahore in Pakistan. A potential direction for future research could be to explore the possibility of incorporating deep features of neighborhood images directly during training without altering the actual image size. Moreover, BFE for regions with unstructured / informal settlements still remains a challenging task. This remains a promising direction of future research.

References

[1] S. Wang, M. Bai, G. Mattyus, H. Chu, W. Luo, B. Yang, J. Liang, J. Cheverie, S. Fidler, and R. Urtasun, “Torontocity: Seeing the world with a million eyes,” arXiv preprint arXiv:1612.00423, 2016.
[2] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark,” in 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2017, pp. 3226–3229.
[3] S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 1, pp. 574–586, 2018.
[4] “ISPRS 2D Semantic Labeling Contest,” accessed: May 27, 2022 [online]. Available: ”http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html”.
[5] A. Van Etten, D. Lindenbaum, and T. M. Bacastow, “Spacenet: A remote sensing dataset and challenge series,” arXiv preprint arXiv:1807.01232, 2018.
[6] R. Gupta, R. Hosfelt, S. Sajeev, N. Patel, B. Goodman, J. Doshi, E. Heim, H. Choset, and M. Gaston, “xbd: A dataset for assessing building damage from satellite imagery,” 2019. [Online]. Available: https://arxiv.org/abs/1911.09296
[7] S. P. Mohanty, J. Czakon, K. A. Kaczmarek, A. Pyskir, P. Tarasiewicz, S. Kunwar, J. Rohrbach, D. Luo, M. Prasad, S. Fleer et al., “Deep learning for understanding satellite imagery: An experimental survey,” Frontiers in Artificial Intelligence, vol. 3, 2020.
[8] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
[9] A. Jiwani, S. Ganguly, C. Ding, N. Zhou, and D. M. Chan, “A semantic segmentation network for urban-scale building footprint extraction using rgb satellite imagery,” arXiv preprint arXiv:2104.01263, 2021.
[10] “Spacenet.ai,” Available at https://spacenet.ai/datasets (accessed May 04, 2023).
[11] “Whu aerial data,” Available at https://data.linz.govt.nz/layer/51932-christchurch-post-earthquake-01m-urban-aerial-photos-24-february-2011/s (accessed May 04, 2023).
[12] A. Ok, C. Senaras, and B. Yuksel, “Automated detection of arbitrarily shaped buildings in complex environments from monocular vhr optical satellite imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, pp. 1701–1717, 03 2013.
[13] X. Huang, L. Zhang, and T. Zhu, “Building change detection from multitemporal high-resolution remotely sensed images based on a morphological building index,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 1, pp. 105–115, 2014.
[14] Q. Zhang, X. Huang, and G. Zhang, “A morphological building detection framework for high-resolution optical imagery over urban areas,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 9, pp. 1388–1392, 2016.
[15] X. Gao, M. Wang, Y. Yang, and G. Li, “Building extraction from rgb vhr images using shifted shadow algorithm,” Ieee Access, vol. 6, pp. 22 034–22 045, 2018.
[16] Q. Wen, K. Jiang, W. Wang, Q. Liu, Q. Guo, L. Li, and P. Wang, “Automatic building extraction from google earth images under complex backgrounds based on deep instance segmentation network,” Sensors, vol. 19, no. 2, p. 333, 2019.
[17] K. Zhao, J. Kang, J. Jung, and G. Sohn, “Building extraction from satellite images using mask r-cnn with building boundary regularization,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 247–251.
[18] Q. Li, L. Mou, Y. Hua, Y. Shi, and X. X. Zhu, “Building footprint generation through convolutional neural networks with attraction field representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–17, 2021.
[19] X. Li, X. Yao, and Y. Fang, “Building-a-nets: Robust building extraction from high-resolution remote sensing images with adversarial networks,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 10, pp. 3680–3687, 2018.
[20] Y. Shi, Q. Li, and X. X. Zhu, “Building footprint generation using improved generative adversarial networks,” IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 4, pp. 603–607, 2018.
[21] J. Hui, M. Du, X. Ye, Q. Qin, and J. Sui, “Effective building extraction from high-resolution remote sensing images with multitask driven deep neural network,” IEEE Geoscience and Remote Sensing Letters, vol. 16, no. 5, pp. 786–790, 2019.
[22] P. Liu, X. Liu, M. Liu, Q. Shi, J. Yang, X. Xu, and Y. Zhang, “Building footprint extraction from high-resolution images via spatial residual inception convolutional neural network,” Remote Sensing, vol. 11, no. 7, 2019. [Online]. Available: https://www.mdpi.com/2072-4292/11/7/830
[23] X. Qin, S. He, X. Yang, M. Dehghan, Q. Qin, and J. Martin, “Accurate outline extraction of individual building from very high-resolution optical images,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 11, pp. 1775–1779, 2018.
[24] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
[25] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[26] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 11–19.
[27] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang, “High-resolution representations for labeling pixels and regions,” arXiv preprint arXiv:1904.04514, 2019.
[28] Q. Zhu, C. Liao, H. Hu, X. Mei, and H. Li, “Map-net: Multiple attending path neural network for building footprint extraction from remote sensed imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 7, pp. 6169–6181, 2020.
[29] J. Chen, Y. Jiang, L. Luo, and W. Gong, “Asf-net: Adaptive screening feature network for building footprint extraction from remote-sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–13, 2022.
[30] H. Liu, J. Luo, B. Huang, X. Hu, Y. Sun, Y. Yang, N. Xu, and N. Zhou, “De-net: Deep encoding network for building extraction from high-resolution remote sensing imagery,” Remote Sensing, vol. 11, no. 20, p. 2380, 2019.
[31] “Open buildings by google,” Available at https://sites.research.google/open-buildings/ (accessed July 20, 2023).
[32] W. Sirko, S. Kashubin, M. Ritter, A. Annkah, Y. S. E. Bouchareb, Y. Dauphin, D. Keysers, M. Neumann, M. Cisse, and J. Quinn, “Continental-scale building detection from high resolution satellite imagery,” arXiv preprint arXiv:2107.12283, 2021.
[33] “Why-does-my-gps-data-and-imagery-not-line-up?” Available at https://www.agsgis.com/Advanced-Mobile-Mapping-Series-Why-Does-My-GPS-Data-and-Imagery-Not-Line-Up_b_1062.html (accessed July 20, 2023).
[34] J. Yuan, “Learning building extraction in aerial scenes with convolutional networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 11, pp. 2793–2798, 2017.

	Satellite Image	GT Mask	SOTA Prediction [9]	Our Prediction
Connected Buildings
Image Tiling