Single-View Height Estimation with
Conditional Diffusion Probabilistic Models

Abstract

Digital Surface Models (DSM) offer a wealth of height information for understanding the Earth’s surface as well as monitoring the existence or change in natural and man-made structures. Classical height estimation requires multi-view geospatial imagery or LiDAR point clouds which can be expensive to acquire. Single-view height estimation using neural network based models shows promise however it can struggle with reconstructing high resolution features. The latest advancements in diffusion models for high resolution image synthesis and editing have yet to be utilized for remote sensing imagery, particularly height estimation. Our approach involves training a generative diffusion model to learn the joint distribution of optical and DSM images across both domains as a Markov chain. This is accomplished by minimizing a denoising score matching objective while being conditioned on the source image to generate realistic high resolution 3D surfaces. In this paper we experiment with conditional denoising diffusion probabilistic models (DDPM) for height estimation from a single remotely sensed image and show promising results on the Vaihingen benchmark dataset.

Index Terms— diffusion models, digital surface models, remote sensing, geospatial, height estimation

1 Introduction

Refer to caption — Fig. 1: An overview of the Image Conditioned Denoising Diffusion Probabilistic Models (DDPM) architecture for single-view height estimation of remotely sensed imagery. Our architecture is a modified form of the DDPM processes which conditions on each step with the source image, $c$ to generate high resolution predictions.

Height estimation and digital surface model generation from aerial and/or satellite imagery has numerous applications including monitoring change of the Earth [1], humanitarian and disaster response (HADR) [2], and urban planning [3]. This data, in combination with the remotely sensed imagery, can be used to generate 3D approximations of models of scenes on the Earth. However, in comparison to natural images, remotely sensed imagery is inherently complex due to variations in spatial resolution, off-nadir acquisitions, seasonal changes, clouds, etc. These challenges require the use of powerful models to learn through the noise. Recently, diffusion models have grown to the forefront of academic research for the dramatic improvements in high resolution image synthesis and generation. The experiments conducted in this paper seek to explore the intersection of diffusion models and remotely sensed imagery, particularly for the task of height estimation from a single image.

To our knowledge, no other methods have trained a Denoising Diffusion Probabilistic Models (DDPM) generative model to learn the joint distribution of optical and Digital Surface Model (DSM) remotely sensed imagery. Our contributions can be described as following:

•

Image Conditioned Diffusion Models - We use a variation of Diffusion Models which conditions each step with source image to guide the generative process.
•

State-of-the-art Single-View Height Estimation Performance - Our novel use of conditional diffusion models for single-view height estimation method quantitatively and qualitatively outperforms previous works.

1.1 Diffusion Models

DDPMs [4] are a class of latent variable generative models inspired by Markovian dynamics. Recently, this method has overtaken other generative methods such as Generative Adversarial Networks (GAN) [5], Normalizing Flows [6], and Variational Autoencoders (VAE) [7], resulting in state-of-the-art performance for several research topics including image super resolution [8], high-resolution image synthesis [9], and text-to-image generation [10]. Diffusion models are tractable and stable to train and sample from in comparison to said generative methods and appear to have much unexplored potential for adoption in other fields. With interest in diffusion models increasing, remote sensing research has begun to investigate the efficacy of these methods for geospatial and remotely sensed imagery and data [11, 12, 13]. However, diffusion models have yet to be explored for the task of single-view height estimation.

1.2 Single-View Height Estimation

Classical remote sensing height estimation relies on either multiple imagery sources for use with stereo or multi-view reconstruction and photogrammetry techniques or LiDAR data. However, the data acquisition and processing for these methods can become costly. As a result, deep learning based methods using only a single-view for height estimation have become popular for obtaining inexpensive estimates of pixelwise height in remotely sensed imagery. With that said, estimating height from a single viewpoint is incredibly due to the lack of information. This has resulted in an explosion in research utilizing deep learning techniques to learn from optical imagery and digital surface models, including convolutional neural networks (CNNs) [14, 15] and GANS [16], and self-supervised learning [17]. While many methods perform multitask learning to learn both pixelwise height and land cover jointly, in this work we focus on purely methods for height estimation.

2 Methods

In the case of single-view height estimation we seek to optimize $\min_{\theta}\sum_{i=1}^{\infty}(h_{i}-\hat{h_{i}})$ , by training a black box model $f_{\theta}$ , in this case a U-Net [18] architecture, which takes in an optical image, $c_{i}$ , and produces a height estimate, $\hat{h_{i}}=f_{\theta}(c_{i})$ .

2.1 Denoising Diffusion Probabilistic Models

For applying DDPMs for single-view height estimation we use variants of the typical diffusion forward process, $q$ , and reverse process, $p$ . Given a ground truth DSM ${\bm{h}}$ , we perform the forward process by iteratively adding Gaussian noise of increasing scale over $T$ steps to ${\bm{h}}$ until we reach a final white noise state $\tilde{{\bm{h}}}_{T}$ . Given $\tilde{{\bm{h}}}_{T}$ , we seek to generate a height estimate through the reverse process of training a model, $f_{\theta}$ , to iteratively denoise the input given a noise level indicator, $\gamma$ , until we reach the final denoised approximation of the ground truth, ${\bm{h}}_{0}$ . Additionally, we condition the reverse process to generate the height estimate of a given source optical image, $c$ , by concatenation at each time step, $t$ , with the height estimate, $\tilde{{\bm{h}}}_{t}$ . The resulting loss function which we optimize the network for single-view height estimation using the $L_{p}$ norm at $p=2$ in the equation is provided in the algorithm below.

In this work we specifically utilize diffusion implicit models (DDIMs) [19] which are a variation of DDPMs that improve the inference sampling rate. To summarize, the DDPM generative forward process is converted from a Markovian to an implicit probabilistic model by setting $\sigma_{t}=0$ in Equation 1 below, which makes the process deterministic.

Defining $\alpha_{0}=1$ and $\epsilon_{t}\sim{\mathcal{N}}({\bm{0}},{\bm{I}})$ is to be standard Gaussian, $p_{\theta}({\bm{x}}_{1:T})$ , one can generate a sample ${\bm{x}}_{t-1}$ from a sample ${\bm{x}}_{t}$ given:

{\bm{x}}_{t-1}=\sqrt{\alpha_{t-1}}\left(\frac{{\bm{x}}_{t}-\sqrt{1-\alpha_{t}}\epsilon_{\theta}^{(t)}({\bm{x}}_{t})}{\sqrt{\alpha_{t}}}\right){\bm{x}}_{0}+\sqrt{1-\alpha_{t-1}-\sigma_{t}^{2}}\cdot\epsilon_{\theta}^{(t)}({\bm{x}}_{t})+\sigma_{t}\epsilon_{t}

(1)

3 Experiments

3.1 Dataset

The Vaihingen dataset [20] is part of the ISPRS benchmark on urban object classification and 3D building reconstruction. The dataset contains 33 patches of various sizes of pairs of optical imagery and digital surface models extracted from a larger orthomosaic. The imagery contains 3 bands (near-infrared (NIR), red, and green). The ground sampling distance or spatial resolution of the imagery and DSMs is 9cm per pixel. The DSMs were generated using the Trimble INPHO 5.3 software using classical dense image matching methods. We use the provided benchmark train and test set splits throughout our experiments.

3.2 Experimental Details and Hyperparameters

We utilize a NVIDIA DGX server and a single NVIDIA A100 node with 80GB GPU RAM for training and inference sampling. We normalize the inputs and outputs to the range $[-1,1]$ .

Training: During training we random crop 512x512 chips from the train set orthomosaics and DSMs. We perform random augmentations to all data including horizontal/vertical flipping, random rotation, as well as apply random color jitter and gaussian blurring to the optical imagery only. We train using $T=1000$ timesteps and a sigmoid variance schedule. We train for 100k iterations, a batch size of 8, mixed precision training, gradient accumulation every 2 steps, and use the Adam [21] optimizer with a learning rate of $\alpha=1e-3$ .
Inference: During inference we use a chip size of 1024x1024. As mentioned in Section 2.1, we utilize DDIM with $T=500$ and $\eta=0$ for faster sampling. Additionally, we utilize FP16 mixed precision during sampling to reduce the memory usage.

Method	MAE (m) $\downarrow$	RMSE (m) $\downarrow$
IM2HEIGHT [14]	1.485	2.253
D3Net [22]	1.314	2.123
Amirkolaee et al. [23]	1.487	2.197
3DBR [24]	1.379	2.074
IM2ELEVATION [15]	1.226	1.882
PLNet [25]	1.178	1.775
DDPM (Ours)	1.093	1.760

Table 1: Experimental results on the Vaihingen dataset with comparisons to other single-task height estimation methods. Benchmark results for other methods are referenced from [26]. Best results are marked in bold.

4 Discussion and Future Work

The experimental results in Table 1 provide evidence that training diffusion based models for single-view height estimation outperforms other previous benchmark methods. It is notable to metniond that ground truth DSMs used in many benchmarks are noisy, most notably around the edges of buildings or tree crowns, because they were generated using traditional methods. As seen in Figure 3, the DDPM output height estimates tend to learn to produce less noisy outputs and instead smooth these edges, which could lead to better downstream perform for generating 3D models of cities and roof wireframes.

While our experiments only explore comparisons of aerial and satellite imagery based models, we note that investigation of Vision Transformers (ViT) [27] based models would be an interesting future work to replace the U-Net backbone in the DDPM pipeline. ViTs have shown improved performance in segmentation and change detection applications in comparison to fully-convolutional based architectures [28].

5 Conclusion

In this paper we explore cutting edge DDPMs for estimating pixelwise height to generate high resolution digital surface models from remotely sensed imagery. We find that DDPMs outperform non diffusion based models and show promise for producing accurate surface estimates of the Earth which has many downstream applications with the potential to benefit the monitoring of manmade and natural change and 3D generation of cities. We hope that this work will inspire exploration of diffusion models in the remote sensing field.

References

[1] Rongjun Qin, Jiaojiao Tian, and Peter Reinartz, “3d change detection–approaches and applications,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 122, pp. 41–56, 2016.
[2] Jihui Tu, Haigang Sui, Wenqing Feng, and Zhina Song, “Automatic building damage detection method using high-resolution remote sensing images and 3d gis model.,” ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, vol. 3, no. 8, 2016.
[3] Charles Beumier and Mahamadou Idrissa, “Digital terrain models derived from digital surface model uniform regions in urban areas,” International Journal of Remote Sensing, vol. 37, no. 15, pp. 3477–3493, 2016.
[4] Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
[5] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
[6] Danilo Rezende and Shakir Mohamed, “Variational inference with normalizing flows,” in International conference on machine learning. PMLR, 2015, pp. 1530–1538.
[7] Diederik P Kingma and Max Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
[8] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi, “Palette: Image-to-image diffusion models,” in ACM SIGGRAPH 2022 Conference Proceedings, 2022, pp. 1–10.
[9] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” arXiv preprint arXiv:2205.11487, 2022.
[10] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer, “High-resolution image synthesis with latent diffusion models,” 2021.
[11] Wele Gedara Chaminda Bandara, Nithin Gopalakrishnan Nair, and Vishal M Patel, “Ddpm-cd: Remote sensing change detection using denoising diffusion probabilistic models,” arXiv preprint arXiv:2206.11892, 2022.
[12] Jinzhe Liu, Zhiqiang Yuan, Zhaoying Pan, Yiqun Fu, Li Liu, and Bin Lu, “Diffusion model with detail complement for super-resolution of remote sensing,” Remote Sensing, vol. 14, no. 19, pp. 4834, 2022.
[13] Benedikt Kolbeinsson and Krystian Mikolajczyk, “Multi-class segmentation from aerial views using recursive noise diffusion,” arXiv preprint arXiv:2212.00787, 2022.
[14] Lichao Mou and Xiao Xiang Zhu, “Im2height: Height estimation from single monocular imagery via fully residual convolutional-deconvolutional network,” arXiv preprint arXiv:1802.10249, 2018.
[15] Chao-Jung Liu, Vladimir A Krylov, Paul Kane, Geraldine Kavanagh, and Rozenn Dahyot, “Im2elevation: Building height estimation from single-view aerial imagery,” Remote Sensing, vol. 12, no. 17, pp. 2719, 2020.
[16] Pedram Ghamisi and Naoto Yokoya, “Img2dsm: Height simulation from single imagery using conditional generative adversarial net,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 5, pp. 794–798, 2018.
[17] Isaac Corley and Peyman Najafirad, “Supervising remote sensing change detection models with 3d surface semantics,” in 2022 IEEE International Conference on Image Processing (ICIP). IEEE, 2022, pp. 3753–3757.
[18] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[19] Jiaming Song, Chenlin Meng, and Stefano Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[20] Franz Rottensteiner, Gunho Sohn, Jaewook Jung, Markus Gerke, Caroline Baillard, Sebastien Benitez, and Uwe Breitkopf, “The isprs benchmark on urban object classification and 3d building reconstruction,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences I-3 (2012), Nr. 1, vol. 1, no. 1, pp. 293–298, 2012.
[21] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[22] Marcela Carvalho, Bertrand Le Saux, Pauline Trouvé-Peloux, Andrés Almansa, and Frédéric Champagnat, “On regression losses for deep depth estimation,” in 2018 25th IEEE International Conference on Image Processing (ICIP). IEEE, 2018, pp. 2915–2919.
[23] Hamed Amini Amirkolaee and Hossein Arefi, “Height estimation from single aerial images using a deep convolutional encoder-decoder network,” ISPRS journal of photogrammetry and remote sensing, vol. 149, pp. 50–66, 2019.
[24] Fatemeh Alidoost, Hossein Arefi, and Federico Tombari, “2d image-to-3d model: Knowledge-based 3d building reconstruction (3dbr) using single aerial images and convolutional neural networks (cnns),” Remote Sensing, vol. 11, no. 19, pp. 2219, 2019.
[25] Siyuan Xing, Qiulei Dong, and Zhanyi Hu, “Gated feature aggregation for height estimation from single aerial images,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2021.
[26] Siyuan Xing, Qiulei Dong, and Zhanyi Hu, “Sce-net: Self-and cross-enhancement network for single-view height estimation and semantic segmentation,” Remote Sensing, vol. 14, no. 9, pp. 2252, 2022.
[27] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[28] Hao Chen, Zipeng Qi, and Zhenwei Shi, “Remote sensing image change detection with transformers,” IEEE Transactions on Geoscience and Remote Sensing, 2021.

Single-View Height Estimation with Conditional Diffusion Probabilistic Models