Digital Twin Buildings: 3D Modeling, GIS Integration, and Visual Descriptions Using Gaussian Splatting, ChatGPT/Deepseek, and Google Maps Platform

Kyle Gao, , Dening Lu, Liangzhi Li, Nan Chen, Hongjie He, Linlin Xu*, , Jonathan Li Manuscript received February xx, 2025; accepted xxxx xx 2025. Date of publication xxxx xx, 2025. This work was supported in part by the NSERC discovery grant under No. RGPIN-2022-03741. (Corresponding author: J. Li, [email protected] and L. Xu, [email protected]) Kyle Gao, Dening Lu, and Jonathan Li (cross-appointed) are with the Department of Systems Design Engineering, University of Waterloo, Canada (e-mail: y56gao, d62lu, [email protected]).Hongjie He and Jonathan Li are with the Department of Geography and Environmental Management, University of Waterloo, Canada (e-mail: [email protected] , junli@@uwaterloo.ca).Nan Chen is with the School of Computer Science, Xi’an Aeronautical University, China (e-mail: [email protected]).Liangzhi Li is with the College of Land Engineering, Chang’an University, China (e-mail: [email protected]).Linlin Xu is with the Department of Geomatics Engineering, University of Calgary, Canada (e-mail: [email protected]).

Abstract

Urban digital twins are virtual replicas of cities that use multi-source data and data analytics to optimize urban planning, infrastructure management, and decision-making. Towards this, we propose a framework focused on the single-building scale. By connecting to cloud mapping platforms such as Google Map Platforms APIs, by leveraging state-of-the-art multi-agent Large Language Models data analysis using ChatGPT(4o) and Deepseek-V3/R1, and by using our Gaussian Splatting-based mesh extraction pipeline, our Digital Twin Buildings framework can retrieve a building’s 3D model, visual descriptions, and achieve cloud-based mapping integration with large language model-based data analytics using a building’s address, postal code, or geographic coordinates.

Index Terms:

Gaussian Splatting, ChatGPT, Deepseek, Large Language Models, Multi-Agent, AI, 3D Reconstruction, Google Maps, Remote Sensing, Urban Buildings, Urban Digital Twin

I Introduction

Figure 1: Diagram of our Digital Twin Building framework. External modules are boxed in red. Our own tools/modules are boxed in blue. The aspects specifically presented in this paper are in dark blue. Data (both inputs and outputs) are drawn in plain text.

In this manuscript, we present Digital Twin Building (DTB), a framework which allows for the extraction of the 3D mesh model of a building, along with Cloud Mapping Service Integration and Multi-Agent Large Language Models (LLM) for data analysis. In the scope of this paper, we use the framework to retrieve Gaussian Splatting models and 3D mesh models. We also retrieve fundamental geocoding information, mapping information and 2D images, and perform visual analysis on the 2D images using the Multi-Agent LLM module. This is shown in Fig. 1.

Depending on need, the Google Maps Platform Integration can also retrieve local elevation maps, real-time traffic data, air quality data and access other data sources and services, which can then be analyzed.

Our contributions are as follows.

•

We introduce Digital Twin Building (DTB), a framework for extracting 3D mesh models of buildings. We integrate Cloud Mapping services for retrieving geocoding, mapping information, and 2D images.
•

We designed a Multi-Agent Large Language Models (LLM) module for data analysis.
•

We performed extensive visual analysis experiments of multi-view/multi-scale images of the building of interest using the LLM module. We also assessed the performance of the popular ChatGPT(4o/mini) and Deepseek-V3/R1 models.

II Background and Related Works

II-A ChatGPT/Deepseek and API

Large Language Models (LLMs) are neural networks, typically Transformer-based [1], pre-trained on extensive, diverse text/image corpora, typically sourced from web crawls. These models, designed for Natural Language Processing (NLP), typically interpret text-based prompts and generate text-based outputs. Certain models, such as ”DeepseekV3/R1” and their variants [2, 3], support object character recognition (OCR, i.e., reading text from images). Models like ”ChatGPT-4o” [4] and its variants additionally support full interpretation and analysis of image content.

LLMs have achieved widespread adoption since 2023. Beyond basic image and text interpretation, these models recently exhibited expert-level problem-solving in various scientific and engineering domains [5, 6].

Due to their large size, LLMs often face hardware constraints for local deployment. While popular LLM providers such as OpenAI and Deepseek, provide web browser interfaces for their models, they also offer Application Programming Interfaces (APIs). These APIs enable client-side software or code to query LLMs hosted on OpenAI or Deepseek servers, facilitating large-scale data processing without requiring human-in-the-loop manipulations via browser interfaces. Unlike traditional local deep learning, which necessitates GPUs for both training and inference, API-based LLM querying requires minimal local hardware and can be deloyed on devices such as mobile phones.

TABLE I: TABLE OF IMPORTANT DEEPSEEK AND OPENAI LLMs

Model Name

Model Class

Model Type

Image Processing

Parameters

API call price/1M

Input Tokens (USD)

API call price/1M

Output Tokens (USD)

chatgpt4o-latest

GPT4o

Autoregressive

Analysis

\sim

1000+B

2.5

gpt-4o-mini

GPT4o Mini

Autoregressive

Analysis

\sim

10’s of B

0.15

0.6

deepseek-chat

Deepseek V3

Autoregressive

OCR

617B

*0.14

\times

0.1*

1.10

deepseek-reasoner

Deepseek R1 (V3-base)

Reasoning

OCR

617B

0.14

2.19

gpt-o1¹

GPT-o1 (GPT4-base)

Reasoning

None

\sim

175B

Table compiled on 2025-01-31. OpenAI models are not open-sourced, their model sizes (parameters) are estimated (B = billions, M = millions). ¹We did not include gpt-o1 in our experiments due to cost, but we include its specifications for comparison. *The Deepseek V3 API call input token price is discounted by 90% if input caching is used for repeated identical prompting.*

II-B Google Maps Platform API

Google Map Platform is a cloud-based mapping service and a part of Google Cloud. Its API allows the client device to connect to various cloud-based GIS, mapping, and remote sensing services hosted on the Google Cloud servers.

The services utilized in this research include remote sensing image retrieval, map retrieval, elevation data retrieval, geocoding/reverse geocoding, and building polygon retrieval. However, Google Maps Platform also offers other APIs for urban and environmental research, including real-time traffic data, solar potential data, air quality data, and plant pollen data, in addition to the full suite of commonly used Google Maps navigation and mapping tools.

Although less known in the remote sensing and GIS community than its sister application Google Earth Engine, Google Map Platform has been used in a variety of GIS research including navigation, object tracking, city modeling, image and map retrieval, geospatial data analysis for commercial and industrial applications [7, 8, 9, 10]. It is also used as part of many commercial software for cloud-based mapping integration.

II-C Google Earth Studio

Google Earth Studio [11] is a web-based animation tool that leverages Google Earth’s satellite imagery and 3D terrain data. The tool is especially useful for creating geospatial visualizations, as it is integrated with Google Earth’s geographic data. It allows for the retrieval of images from user-specified camera poses at user-specified locations. In this research, we use Google Earth Studio to retrieve 360 $\degree$ multi-view remote sensing images of a building from its address, postal code, place name, or geographic coordinates following [12, 13].

III Method

III-A Gaussian Building Mesh Extraction

We use the mesh extraction procedure we introduced in December 2024 [13]. For conciseness, the process is briefly described here, and is not benchmarked. We refer the readers to [13] for the original implementation details and benchmark comparisons. We also refer the readers to [14] for background and theory on Gaussian Splatting.

Gaussian Building Mesh (GBM) [13] is an end-to-end 3D building mesh extraction pipeline we recently proposed, leveraging Google Earth Studio [11], Segment Anything Model-2 (SAM2)[15] and GroundingDINO [16], and a modified [17] Gaussian Splatting [14] model. The pipeline enables the extraction of a 3D building mesh from inputs such as the building’s name, address, postal code, or geographical coordinates. Since the GBM uses Gaussian Splatting (GS) as its 3D representation, it also allows for the synthesis of new photorealistic 2D images of the building under different viewpoints.

III-B Google Maps Platform Integration

We use the Python client binding for Google Maps Platform Services APIs to create an integration tool to automatically retrieve the GIS and mapping information of a building. For these image analysis experiments, the data is retrieved with four API calls. The first is a Google Maps Platform Geocoding/Reverse Geocoding API call which retrieves the complete address information including geographic coordinates, entrance(s) coordinates, and building polygon mask vertex coordinates. Then, a Google Maps Platform Elevation API call is used to retrieve the ground elevation using the building’s coordinates as input. Additional API calls to other Cloud Services can also be performed at this step. Finally, two API calls are made using the Google Maps Platform Static Maps API to retrieve map(s) and satellite/aerial image(s) at the desired zoom level. This process is illustrated in Figure 2. The aerial/satellite image(s) are then used as one of the inputs to our Multi-Agent LLM Module.

Our Google Map Platform Integration can easily be modified to retrieve additional data from the cloud-based mapping service by adding parallel API calls below the Geocoding/Reverse Geocoding API call. For example, if we wish to analyze real-time traffic data, we can simply perform API calls to the Traffic API.

Figure 2: Diagram of our Google Map Services Integration Tool.

III-C Multi-Agent LLM Analysis of Multi-View/Scale Images

The motivation of this module is to create a multi-agent LLM system to analyze the data retrieved from Google Cloud Platform Services integration. In this paper, we restrict the scope of this paper to the multi-agent content analysis of multi-view/scale images.

For this analysis, the primary goal is to retrieve and store keywords describing the architectural style, function, landscape, and architectural elements of the building using keywords. With the idea that an accurate set of keywords will allow an agent to reconstruct a text-based description of the image without actually seeing the image, we also retrieve the caption using keywords as a secondary objective. This process is illustrated in Fig. 3.

From a building’s address, place name, postal code, or geographic coordinates, we retrieve multi-view off-nadir images of the building of interest using Google Earth Studio (or use the ones previously retrieved in the GBM module). We also retrieve top-down view aerial/satellite image(s) at different scales using the building using Google Map Platform Integration. For each image, we initiate a GPT4o/GPT4o-mini agent and prompt it to analyze the image and retrieve a set of keywords for each image. We then initiate two agents, one to aggregate the keywords from all the images of the building, and one to turn the aggregated keywords into a human-readable caption description.

Figure 3: Diagram of Multi Agent LLM processing multi-view-multi-scale images.

III-D Metrics

Although metrics such as BLEU and CIDEr are commonly used to evaluate captioning performance, they require supervised datasets with ground truth captions. However, our images lack ground truth captions. Therefore, we use the CLIP (Contrastive Language–Image Pretraining) score [18]. A CLIP-trained Transformer is used to embed both the caption and image into a shared image-language latent space. Given the text embedding of the caption, $\mathbf{t}$ and the image embedding of the corresponding image, $\mathbf{i}$ , the CLIP score is given by

\text{CLIP Score (\%)}=100\frac{\mathbf{t}\cdot\mathbf{i}}{\|\mathbf{t}\|\|\mathbf{i}\|}.

(1)

We also use perplexity to roughly assess LLM image-to-keyword extraction confidence. Perplexity is a measure of the model’s confidence in its response, and is given by

\text{PPL}=\exp\left(-\frac{1}{N}\sum_{i=1}^{N}\log P(x_{i})\right)

(2)

where $\log P(x_{i})$ is the log-probability of the token $x_{i}$ as assessed by the model during text-token autoregression.

IV Experiments and Discussions

IV-A Experiments

We chose seven different buildings to test our framework. These include well-known landmarks, commercial, residential, and institutional buildings. We extract 31 multi-view images in a 360 $\degree$ view pose around the building of interest, which we then use in conjunction with our GBM module to create the 3D colored mesh of the building. Then we subsample six images, one every 70 $\degree$ , as inputs to the Multi-Agent LLM module. We also use the Google Map Platform integration to retrieve two aerial/satellite image(s), one at Google Maps zoom level 18, and one at Google Maps zoom level 19 as inputs to the Multi-Agent LLM module.

In preliminary experiments, we noticed a relatively large variation in final CLIP-score across different attempts even while using the same model and prompt, since the LLMs’ outputs are not deterministic (even when using LLM Temperature = 0). As such, we perform two experiments. We want to understand the performance of the module when using different models as LLM agents, and we want to understand the distribution of scores across different attempts for both the keyword extraction step, and the captioning step.

Refer to caption — Figure 4: Box plot of image-to-keyword perplexity distribution per model and level of image detail (2240 samples total).

IV-A1 Keyword Extraction

For multi-agent image-to-keyword extraction, both chatgpt-4o-latest, and gpt-4o-mini are suitable. Additionally, both high-resolution and low-resolution image analysis are available. We test all 4 combinations for all 7 scenes, for 10 iterations, with 8 images and LLM agents, resulting in a total of 2240 API calls. We record the LLM responses and the perplexity scores. The results are shown in Fig.4.

IV-A2 Captioning

For the keyword-to-caption step, we fix the per-image keywords for each scene using the results from one of the gpt-4o high image resolution API calls. For each of the 7 buildings, we test 5 iterations of keyword-aggregation-caption for each of the four models: gpt-4o-mini, chatgpt-4o-latest, deepseek-chat, deepseek-reasoner. Each test requires two API calls, totaling 448 API calls. We additionally calculate CLIP scores for every single one of the input images. Resulting in $7\cdot 5\cdot 4\cdot 8=1120$ CLIP scores. The per-model clip score distributions are visualized in Fig 5.

IV-A3 Visualization

We present a visualization of the extracted 3D model, caption, keywords, and Google Maps Platform-based information for the Perimeter Institute (PI) building scene in Fig. 6. The Perimeter Institute for Theoretical Physics is an independent research centre located at 31 Caroline St. N, Waterloo, Ontario, Canada. We show the 3D mesh and depth maps extracted from the scene, the 2D map, and the aerial image with the building’s polygon at Google Maps zoom level 18, retrieved via the Google Maps Platform Static Maps API. We also plot the keywords extracted from a single view, as well as the caption generated by the Multi-Agent LLM module.

IV-B Discussion

Fig. 4 shows the level of detail does not significantly affect the LLM agents’ confidence in their own predictions. Perhaps surprisingly, the much smaller gpt-4o-mini is more confident in its own responses on average. We note this does not necessarily denote keyword accuracy since it is possible the larger model considers many equivalent alternate visual descriptions, lowering its confidence in any individual description. Visual inspection of captions and images shows that across all four configurations, caption-to-image agreement is high. Although it is possible that the smaller model is enough for the visual captioning task, we nonetheless used the larger model with high visual quality for multi-view/multi-scale keyword extraction to test the keyword-to-caption step.

Deepseek-reasoner, the Deepseek-R1 model has the poorest captioning score. Additionally, this model sometimes failed at the task, shown as outliers in Fig. 5. This is perhaps because image captioning is an autoregressive text generation task and not a reasoning task. Consequently, we decided not to test OpenAI’s reasoning model, the GPT-o1. Its performance on reasoning tasks is comparable to Deepseek-R1 according to [3], yet it is more expensive by a factor of 30-100. Deepseek-chat. The Deepseek-V3 model has on average the best captioning performance, offering a very good best price-performance ratio with performance comparable to GPT-4o, but prices similar to GPT-4o Mini (see Table I).

Our future research aims to leverage the multi-agent LLM tool for geospatial data analysis, integrating various data sources from Google Cloud Platform services, including Google Maps Platform APIs and Google Earth Engine. Additionally, benchmarking the reasoning capabilities of large language models, such as GPT-o1/o3 and Deepseek-R1, for remote sensing and GIS tasks could yield valuable insights.

V Conclusion

We have presented Digital Twin Buildings, a framework for extracting the 3D mesh of a building, for connecting the building to Google Maps Platform APIs, and for Multi-Agent Large Language Models data analytics. We demonstrate this by extracting visual description keywords and captions of the building from multi-view multi-scale images of the building. The framework can also be used to process different data modalities sourced from Google Cloud Services. This approach enables richer semantic understanding, seamless integration with geospatial data, and enhanced interaction with real-world structures, paving the way for advanced applications in urban analytics, navigation, and virtual environments.

References

[1] A. Waswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017.
[2] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al., “Deepseek-V3 technical report,” arXiv preprint arXiv:2412.19437, 2024.
[3] D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi et al., “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,” arXiv preprint arXiv:2501.12948, 2025.
[4] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
[5] D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y. Pang, J. Dirani, J. Michael, and S. R. Bowman, “Gpqa: A graduate-level google-proof q&a benchmark,” arXiv preprint arXiv:2311.12022, 2023.
[6] Z. Liu, Y. Chen, M. Shoeybi, B. Catanzaro, and W. Ping, “AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling,” arXiv preprint arXiv:2412.15084, 2024.
[7] A. M. Luthfi, N. Karna, and R. Mayasari, “Google maps api implementation on iot platform for tracking an object using gps,” in 2019 IEEE Asia Pacific Conference on Wireless and Mobile (APWiMob). IEEE, 2019, pp. 126–131.
[8] A. Bhandari and R. Noone, “Support local: Google maps’ local guides platform, spatial power and constructions of “the local”,” Communication, Culture & Critique, vol. 16, no. 3, pp. 198–207, 2023.
[9] H. Li and B. Hecht, “3 stars on yelp, 4 stars on google maps: a cross-platform examination of restaurant ratings,” Proceedings of the ACM on Human-Computer Interaction, vol. 4, no. CSCW3, pp. 1–25, 2021.
[10] P. Fuquan, S. Jian, W. Wenyong, and W. Zebing, “A city modeling and simulation platform based on google map api,” in Proceedings of the 2011, International Conference on Informatics, Cybernetics, and Computer Engineering (ICCE2011) November 19–20, 2011, Melbourne, Australia: Volume 2: Information Systems and Computer Engineering. Springer, 2012, pp. 513–520.
[11] Alphabet Inc., “Google earth studio,” 2015-2024. [Online]. Available: https://www.google.com/earth/studio/
[12] K. Gao, D. Lu, H. He, L. Xu, and J. Li, “Photorealistic 3d urban scene reconstruction and point cloud extraction using google earth imagery and gaussian splatting,” arXiv preprint arXiv:2405.11021, 2024.
[13] K. Gao, L. Li, H. He, D. Lu, L. Xu, and J. Li, “Gaussian Building Mesh (GBM): Extract a Building’s 3D Mesh with Google Earth and Gaussian Splatting,” arXiv preprint arXiv:2501.00625, 2024.
[14] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, pp. 1–14, 2023.
[15] N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson et al., “Sam 2: Segment anything in images and videos,” arXiv preprint arXiv:2408.00714, 2024.
[16] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su et al., “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” in European Conference on Computer Vision. Springer, 2025, pp. 38–55.
[17] C. Ye and Contributors, “2d-gaussian-splatting-great-again,” GitHub repository, 2024. [Online]. Available: https://github.com/hugoycj/2d-gaussian-splatting-great-again/tree/main
[18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.

Additional Results

The per-scene-per-model average CLIP scores are shown in Fig. A1. Deepseek-V3 achieves higher caption CLIP scores than other LLMs, except in two cases. The first is the ICON scene, which features twin high-rise buildings with retail spaces at the ground level. The second is the Parliament Hill of Canada scene overlooking the Ottawa River, characterized by its gothic-style architecture. Notably, the Parliament Hill scene also received the lowest overall CLIP scores.

Input images

Input images for the Parliament Hill scene (Fig. A2) and the ICON scene (Fig. A3) are provided. These are the two scenes with the lowest multi-agent captioning CLIP scores.