An Immersive Multi-Elevation Multi-Seasonal Dataset for 3D Reconstruction and Visualization
Abstract
Significant progress has been made in photo-realistic scene reconstruction over recent years. Various disparate efforts have enabled capabilities such as multi-appearance or large-scale modeling; however, there lacks a well-designed dataset that can evaluate the holistic progress of scene reconstruction. We introduce a collection of imagery of the Johns Hopkins Homewood Campus, acquired at different seasons, times of day, in multiple elevations, and across a large scale. We perform a multi-stage calibration process, which efficiently recover camera parameters from phone and drone cameras. This dataset can enable researchers to rigorously explore challenges in unconstrained settings, including effects of inconsistent illumination, reconstruction from large scale and from significantly different perspectives, etc.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/8b33c5ba-7840-4218-9c88-47884f3e526f/Intro_Figure.jpg)
1 Introduction
Three-dimensional (3D) Scene reconstruction is a long-standing research area with extensive applications in robotics, autonomous driving, AR/VR, site modeling, disaster relief planning, etc. Particularly, photorealistic reconstruction can enable immersive experiences in these applications. Recent advances in neural rendering, e.g., Neural Radiance Field (NeRF) [18] and 3D Gaussian Splatting [7] have substantially improved our capability in photorealistic novel-view synthesis. Various follow-up work further extend this capability in large scale scenes [26, 24], to model sites across time [16, 3, 28, 9, 27, 30], and in various other challenging scenarios [20, 13, 19]. With these impressive advances, the ability to reconstruction a fully digital world in high fidelity, across time, and at scale may soon become a reality. However, these novel methods have been evaluated on scenes that are either collected 1. at the same time, 2. with high overlap, or 3. in small scale. As such, there currently lacks a dataset to realistically evaluate the potential challenges.
Popular datasets in reconstruction, such as Mip-Nerf 360 [1] and Tanks and Temples [8], are small scale and limited in appearance variation. This controlled environment simplifies camera calibration but does not reflect the complexity of large-scale reconstruction, which may contain many visual ambiguities. Phototourism [22] comprises more complex objects and environments with diverse illuminations and appearances; however, each image is captured at a unique time. As a result, methods [16] tested on these datasets requires access to test-view images during evaluation to optimize appearance information, potentially compromising the integrity of the test results. Various datasets [26, 17, 4, 15, 24, 11, 2] have been collected to cover a large area, but lack variations in appearance. Furthermore, they are typically collected at the same elevation. Ground-level acquisitions often suffer from occlusions, e.g. induced by plants or walls. This often leads to unconstrained areas of the building, such as the rooftop. Aerial views capture vast areas, but at a reduced resolution to a specific regions. Synthetic datasets [10] can offer multi-view imagery of large-scale area under various controlled environments, but they cannot fully reflect the complexities of real-world environments.
We introduce a university-scale, real-world dataset. This dataset is a comprehensive and expansive outlook of the Johns Hopkins University Homewood Campus, covering area over 80000 . As shown in Fig. 1, the images are captured and selected meticulously around buildings within the campus, and were photographed across different seasons, distinct times of a day, and different altitudes, over a year. Our dataset overcomes previous shortcomings through the following characteristics: (1) Large-scale and potentially repetitive structures: the architecture of the JHU campus buildings has a uniform style. This can introduce visual ambiguities that challenge image registration, where images with similar textures may be matched together. By offering large-scale scenes with intricate architectural patterns, our dataset stresses the robustness of reconstruction algorithms. (2) Multi-appearance and Multi-view: Photographs taken during different seasons and times of day create extensive variations in building appearance and illumination. For each building, we provide images captured around the structure at multiple time of the day and around the year, ensuring uniform appearances during a single acquisition. This approach yields multi-view datasets that allow a fair evaluation of methods with access to only metadata, such as the timestamp, and not the groundtruth image. (3) Multi-elevation: To provide a comprehensive display of each building’s structure, our dataset includes imagery captured from the ground-level, aerial views at various altitudes. By combining ground-level and multi-altitude acquisitions, we enable algorithms to reconstruct buildings with rich details at every angle.
While commercial products are available for accurate ground GPS positioning, they are much less accessible than phones. Since weather conditions can be fast changing, we elect to use phones to capture most of our data. To achieve accurate camera pose estimation from large-scale collections, we employ a multi-stage image registration process. First, we simultaneously register all images of each individual building, incorporating additional constraints into the registration algorithm avoid visual doppelgangers. Next, we establish a stable anchor coordinate system using a subset of images captured in summer from 60m above ground. By aligning each building’s registered cameras to this anchor coordinate system, we integrate all building reconstructions into a unified spatial reference. As a result, we can achieve a coherent, large-scale sparse reconstruction of the Johns Hopkins University Homewood Campus with a reasonable processing time.
2 Related Work
2.1 Scene Level Reconstruction Datasets
In scene reconstruction research, the widely used benchmark datasets often focus on single objects [18, 8, 1] or indoor scenes [12]. These datasets are collected in controlled environments, allowing highly accurate camera estimation. However, these datasets do not reflect various practical issues, including changes in appearances. There have been various attempts [22, 25, 29] to construct outdoor unbounded architecture datasets with diverse lighting and appearance conditions. While these datasets include appearance diversity, they lack multi-view imagery with a consistent appearance. Consequently, algorithms [16, 3, 28, 9, 27, 30] tested on these datasets require access to test-view images during evaluation to account for unique appearance variation. Such an approach risks compromising evaluation fairness and does not reflect the inference condition in reality. Ideally, a single appearance should be captured across multiple views, such that the held-out test views can be rendered based on time instead of pixel-wise information.
2.2 Large-scale Reconstruction Datasets
Large-scale datasets, such as Quad 6K[4], UrbanScene3D[4], Mill-19[26], and OMMO[15], have been collected across a large area from the air. However, these datasets are usually restricted to a single altitude. This limits the level of detail in reconstructed models. Datasets like Block-NeRF[24], KITTI-360[11], and NuScenes[2] focus exclusively on street-level imagery, leading to many unobserved regions such as the roof of the buildings. Although large-scale synthetic datasets, like MatrixCity[10], provide both ground and aerial views with a control of the environment and precise ground truth, real-world complexities, including environmental factors and real world physics interactions, cannot be fully reflected.
3 Data Collection
To address the limitations of previously collected datasets, we propose a novel collection designed to bridge the gaps in existing benchmarks. As shown in Table 1, by capturing real-world, large-scale imagery with multi-appearance and multi-elevation coverage, our dataset provides a comprehensive testing ground for evaluating modern reconstruction algorithms.
Dataset | # images | Scale | mA | mV | Elevation |
---|---|---|---|---|---|
Phototourism[22] | 150K | Scene (R) | ✓ | ✗ | Ground |
MegaScenes[25] | 2M | Scene (R) | ✓ | ✗ | Ground |
BlendedMVS[29] | 5K | Scene (R+S) | ✗ | ✗ | Ground |
UrbanScene3D[4] | 128K | Scene (R+S) | ✗ | ✗ | Aerial |
Quad 6K[4] | 5.1K | Scene (R) | ✗ | ✗ | Aerial |
Mill 19[26] | 3.6K | Scene (R) | ✗ | ✗ | Aerial |
OMMO[15] | 14.7k | Scene (R) | ✓ | ✓ | Aerial |
Block-NeRF[24] | 2.8M | City (R) | ✓ | ✗ | Ground |
KITTI-360[11] | 300K | City (R) | ✗ | ✗ | Ground |
NuScenes[2] | 1.4M | City (R) | ✓ | ✗ | Ground |
MatrixCity[10] | 519K | City (S) | ✓ | ✓ | Ground+Aerial |
Ours | 12.3K | City (R) | ✓ | ✓ | Ground+Aerial |
The dataset contains over 12,300 images on ten adjacent buildings on the Johns Hopkins University Homewood campus, covering an area of approximately 80,000 . As shown in Figure 2, for each building, we collect both aerial and ground-level imagery and systematically vary the appearance configurations. In particular, we consider four seasons, weather conditions, such as sunny and cloudy, and times of day, i.e. daytime and nighttime, over the course of one calendar year. The seasonal variations are visually distinct: in Winter, foliage is sparse and often accompanied by snowfall. In Spring and Summer, vegetation growth leads to a significantly greener scene. During Fall, the leaves gradually turn vibrant shades of yellow, orange, and red. In addition to seasonal differences, weather conditions also influence the imagery. On sunny days, strong directional lighting leads to shadows. In contrast, cloudy conditions diffuse the available light, yielding more uniform, ambient illumination. We collect a set of multi-view images for each of the mentioned conditions, minimizing the appearance differences within a single collection.
For ground-level imagery, the data collection process involves walking around each building’s perimeter with a handheld smartphone camera, capturing continuous video sequences. We perform a manual inspection of all extracted frames to remove any low-quality images and ensure consecutive frames maintain sufficient overlap to support robust image registration. In addition, images containing Personally Identifiable Information (PII), such as faces or vehicle license plates, are blurred.
For aerial imagery, we deploy drones equipped with stabilized high-resolution cameras. At each targeted altitude (60m, 100m, and 120m), the drone follows a circular flight trajectory around the building. Drone flights are planned to ensure uniform and complete coverage. In addition to circular flights, we also keep the ascending video sequences as the drone moves from ground level to approximately 60m above ground on two sides of each building. These vertically ascending videos can help improve registration between ground-level and aerial imagery. From these videos, we sample individual frames, applying the same quality control measures as for ground-level data.

4 Data Processing
After acquiring and inspecting the imagery, we apply a multi-stage data processing pipeline to register all images into a single coordinate space.
4.1 Large-scale Doppelgänger Mitigation
Visual ambiguity is a common challenge in calibrating images of buildings, due to their symmetries. This problem, often known as doppelgängers, is also seen in our collected imagery. Without accurate GPS, images that are far apart can get matched erroneously due to similar textures and limited field-of-view. As illustrated in Figure 3(a), the front door and back door regions of Clark Hall are very similar. Standard feature matching methods, e.g., SIFT [14] and Superglue [21], either fail to converge or created a heavily overlapped calibration.
Rather than allowing arbitrary matches across all ground-level images, we imposed a temporal adjacency constraint: each image is only matched with its 10 nearest frames in the video. By confining matches to successive frames, we effectively reduce the search space for feature correspondences and prevented spurious matches. This constraint led to a stable registration of ground-level imagery, even in the presence of repeated architectural motifs. However, registering these images without prior information about image order remains a challenging task and research direction.
4.2 Multi-elevation imagery integration
Our dataset spans multiple altitudes, ranging from ground level up to 120m, providing comprehensive and detailed coverage of the building exteriors and rooftops. However, drastic altitude and perspective changes pose a distinct challenge for registration. Benchmark registration methods struggled to match the fine-grained features in ground images to drone image. To address this issue, we incorporate the ascending image sequences acquired by drones. These sequences start close to ground level and incrementally ascend to higher altitudes, gradually shifting perspectives from ground to air. This incremental transition allows features detected in ground-level images to be traced upward. As such, we can register ground and aerial imagery of together for a single building. In Table 2 we compared the current registration benchmarks on a selected sets of imagery, withholding the ascending image sequences. All methods fail to register cross-view images correctly, prompting further research in this direction.
Buildings | # Images | SIFT [14] | SP+SG [21] | LoFTR [23] | RoMA [5] | |
---|---|---|---|---|---|---|
G | D | |||||
Ames | 175 | 35 | 175 | 175 | 175 | 188 |
Clark | 171 | 33 | 161 | 171 | 168 | 180 |
Garland | 153 | 32 | 106 | 153 | 102 | 167 |
Gilman | 68 | 48 | 48 | 116 | 48 | 116 |
Hackerman | 125 | 38 | 122 | 125 | 125 | 119 |
Hodson | 77 | 25 | 51 | 77 | 63 | 91 |
Latrobe | 80 | 29 | 80 | 80 | 71 | 100 |
Levering | 96 | 43 | 56 | 61 | 78 | 113 |
Mason | 143 | 27 | 143 | 143 | 143 | 160 |
Shriver | 97 | 46 | 75 | 97 | 97 | 139 |
4.3 Global Alignment
After correctly registering multi-appearance imagery of a single building, we then put all campus building in to a unified coordinate system, which is more efficient than directly registering all available together. To accomplish this, we first use a subset of aerial images from every building, captured during summer from an altitude of 60m. Since aerial images have a large perspective, they can be reliably calibrated. This produces a set of camera positions in a campus-wise coordinate system, . The same aerial camera positions can be expressed in a building-wise coordinate system, , where indicate the building index. Then, we perform Procrustes Alignment[6] to align each registration and point cloud for individual buildings with the anchor coordinate system.
A similarity transform comprising a scaling factor , a rotation matrix , and a translation vector are estimated. We determine by solving the following optimization problem:
(1) |
Once the optimal is found, it is used to transform all cameras and 3D points to a campus-wise coordinate system.
5 Conclusion
We introduce a comprehensive multi-view dataset of the Johns Hopkins University Homewood Campus designed to advance research in high-fidelity camera calibration and reconstruction. By systematically varying seasonal conditions, times of day, weather patterns, and elevations, our dataset encompasses a rich diversity of visual appearances. This breadth enables robust testing and benchmarking of algorithms under challenging real-world scenarios, including visually ambiguity and significant perspective changes.
After collecting the dataset, we outlined a multi-stage data processing pipeline that efficiently calibrates all the images into a single coordinate system. Specifically, this is done by 1. imposing temporal constraints to mitigate “Doppelgänger” issues in ground-level imagery, 2. leveraging ascending image sequences to bridge the gap between ground and aerial perspectives, and 3. aligning individual building reconstructions into a unified anchor coordinate system.
Our multi-view calibrated dataset offers a valuable new resource for the computer vision and graphics community. By enabling more rigorous evaluation of 3D reconstruction methods and facilitating comparisons across a wide range of environmental conditions, our work lays the groundwork for advancing state-of-the-art techniques and enhancing their applicability to complex, real-world environments.
References
- [1] Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 5460–5469. IEEE, 2022.
- [2] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 11618–11628. Computer Vision Foundation / IEEE, 2020.
- [3] Xingyu Chen, Qi Zhang, Xiaoyu Li, Yue Chen, Ying Feng, Xuan Wang, and Jue Wang. Hallucinated neural radiance fields in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 12933–12942. IEEE, 2022.
- [4] David J. Crandall, Andrew Owens, Noah Snavely, and Dan Huttenlocher. Discrete-continuous optimization for large-scale structure from motion. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 3001–3008. IEEE Computer Society, 2011.
- [5] Johan Edstedt, Qiyu Sun, Georg Bökman, Mårten Wadenbäck, and Michael Felsberg. Roma: Robust dense feature matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 19790–19800. IEEE, 2024.
- [6] John C Gower. Generalized procrustes analysis. Psychometrika, 40:33–51, 1975.
- [7] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. CoRR, abs/2308.04079, 2023.
- [8] Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: benchmarking large-scale scene reconstruction. ACM Trans. Graph., 36(4):78:1–78:13, 2017.
- [9] Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc Pollefeys, and Torsten Sattler. Wildgaussians: 3d gaussian splatting in the wild. CoRR, abs/2407.08447, 2024.
- [10] Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 3182–3192. IEEE, 2023.
- [11] Yiyi Liao, Jun Xie, and Andreas Geiger. KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. IEEE Trans. Pattern Anal. Mach. Intell., 45(3):3292–3310, 2023.
- [12] Huangjing Lin, Hao Chen, Qi Dou, Liansheng Wang, Jing Qin, and Pheng-Ann Heng. Scannet: A fast and dense scanning framework for metastastic breast cancer detection from whole-slide image. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 539–546. IEEE Computer Society, 2018.
- [13] Yang Liu, He Guan, Chuanchen Luo, Lue Fan, Junran Peng, and Zhaoxiang Zhang. Citygaussian: Real-time high-quality large-scale scene rendering with gaussians. CoRR, abs/2404.01133, 2024.
- [14] David G. Lowe. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis., 60(2):91–110, 2004.
- [15] Chongshan Lu, Fukun Yin, Xin Chen, Wen Liu, Tao Chen, Gang Yu, and Jiayuan Fan. A large-scale outdoor multi-modal dataset and benchmark for novel view synthesis and implicit scene reconstruction. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 7523–7533. IEEE, 2023.
- [16] Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 7210–7219. Computer Vision Foundation / IEEE, 2021.
- [17] Andreas Meuleman, Yu-Lun Liu, Chen Gao, Jia-Bin Huang, Changil Kim, Min H. Kim, and Johannes Kopf. Progressively optimized local radiance fields for robust view synthesis. In CVPR, 2023.
- [18] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision - ECCV 2020 - 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part I, volume 12346 of Lecture Notes in Computer Science, pages 405–421. Springer, 2020.
- [19] Cheng Peng and Rama Chellappa. PDRF: progressively deblurring radiance field for fast scene reconstruction from blurry images. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 2029–2037. AAAI Press, 2023.
- [20] Cheng Peng, Yutao Tang, Yifan Zhou, Nengyu Wang, Xijun Liu, Deming Li, and Rama Chellappa. BAGS: blur agnostic gaussian splatting through multi-scale kernel modeling. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors, Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXX, volume 15138 of Lecture Notes in Computer Science, pages 293–310. Springer, 2024.
- [21] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 4937–4946. Computer Vision Foundation / IEEE, 2020.
- [22] Noah Snavely, Steven M. Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. ACM Trans. Graph., 25(3):835–846, 2006.
- [23] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021, pages 8922–8931. Computer Vision Foundation / IEEE, 2021.
- [24] Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben P. Mildenhall, Pratul P. Srinivasan, Jonathan T. Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 8238–8248. IEEE, 2022.
- [25] Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors, Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXIX, volume 15087 of Lecture Notes in Computer Science, pages 197–214. Springer, 2024.
- [26] Haithem Turki, Deva Ramanan, and Mahadev Satyanarayanan. Mega-nerf: Scalable construction of large-scale nerfs for virtual fly- throughs. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 12912–12921. IEEE, 2022.
- [27] Jiacong Xu, Yiqun Mei, and Vishal M. Patel. Wild-gs: Real-time novel view synthesis from unconstrained photo collections. CoRR, abs/2406.10373, 2024.
- [28] Jiawei Yang, Marco Pavone, and Yue Wang. Freenerf: Improving few-shot neural rendering with free frequency regularization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 8254–8263. IEEE, 2023.
- [29] Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 1787–1796. Computer Vision Foundation / IEEE, 2020.
- [30] Dongbin Zhang, Chuming Wang, Weitao Wang, Peihao Li, Minghan Qin, and Haoqian Wang. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors, Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXVI, volume 15134 of Lecture Notes in Computer Science, pages 341–359. Springer, 2024.