ThermalNeRF: Thermal Radiance Fields

Yvette Y. Lin*, Xin-Yi Pan*, Sara Fridovich-Keil, and Gordon Wetzstein Y. Y. Lin is with the Department of Computer Science, Stanford University, Stanford, CA. X. Pan, S. Fridovich-Keil, G. Wetzstein are with Department of Electrical Engineering, Stanford University, Stanford, CA. E-mail: {yvelin, xinyipan, sarafk, gordonwz}@stanford.edu

Abstract

Thermal imaging has a variety of applications, from agricultural monitoring to building inspection to imaging under poor visibility, such as in low light, fog, and rain. However, reconstructing thermal scenes in 3D presents several challenges due to the comparatively lower resolution and limited features present in long-wave infrared (LWIR) images. To overcome these challenges, we propose a unified framework for scene reconstruction from a set of LWIR and RGB images, using a multispectral radiance field to represent a scene viewed by both visible and infrared cameras, thus leveraging information across both spectra. We calibrate the RGB and infrared cameras with respect to each other, as a preprocessing step using a simple calibration target. We demonstrate our method on real-world sets of RGB and LWIR photographs captured from a handheld thermal camera, showing the effectiveness of our method at scene representation across the visible and infrared spectra. We show that our method is capable of thermal super-resolution, as well as visually removing obstacles to reveal objects that are occluded in either the RGB or thermal channels. Please see https://yvette256.github.io/thermalnerf/ for video results as well as our code and dataset release.

Index Terms:

Thermal Imaging, Long-Wave Infrared, 3D, Radiance Fields, Sensor Fusion

1 Introduction

Thermal imaging exposes features of our world that are invisible to the naked eye, and to RGB cameras recording visible light. By capturing long-wave infrared (LWIR) light, in the 8–14 $\mu$ m wavelength range, thermal cameras expose heat sources and material properties, and can see in the dark as well as through many occlusive media such as smoke. These properties make thermal imaging a valuable tool in a wide range of applications including water and air pollution monitoring [1, 2, 3, 4], search and rescue [5, 6], burn severity triage [7, 8], surveillance and defense [9, 10, 11], agriculture and vegetation monitoring [12, 13, 14, 15, 16, 17, 18], and infrastructure inspection [19, 20].

In many of these applications, multiple thermal images are collected from different viewpoints during the course of inspection or exploration. These applications stand to benefit from access to 3D thermal field reconstructions that combine these multi-view thermal images into a unified and consistent 3D thermal volume; this task of 3D thermal field recovery is the focus of our work. For example, in fig. 1 we show example renderings from our reconstruction of a large crane structure based on RGB and thermal images collected autonomously by a Skydio drone. Combining drone-based thermal and visible imaging with our multi-spectral 3D reconstruction may help automate and accelerate otherwise difficult, tedious, or dangerous tasks including this example of infrastructure inspection.

Refer to caption — Figure 1: RGB (top) and thermal (bottom) renderings from our reconstruction of a large crane structure, based on images collected by a Skydio drone. Thermal imaging with 3D reconstruction can be used for building and infrastructure inspection, among many other applications ranging from agriculture to search and rescue.

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 2: Results on our pyrex scene, compared against two Nerfacto-based baselines [21] in which the visible and thermal models share a density field {RGBT} or are completely separate {RGB}{T}. The {RGBT} model fails to recover the glass bowl as visually transparent yet thermally opaque, while the camera pose refinement fails for the thermal component of the {RGB}{T} model, leading to ghosting.

3D radiance field reconstruction has made great strides based on RGB images taken with visible light cameras. However, 3D reconstruction from thermal (long-wave IR) images remains challenging due to the low resolution of thermal cameras. In addition to limiting image quality, this low thermal resolution limits the number of robust 2D image features that can be used to recover thermal camera poses via structure from motion algorithms like COLMAP [22, 23] (see fig. 3). Further, even with known thermal camera poses, directly extending existing radiance field models to include a thermal channel produces limited quality 3D reconstructions because many materials interact differently with thermal versus visible light. For example, a glass of water is transparent to visible light but thermally opaque, while smoke is visibly opaque but thermally transparent.

We propose strategies to address both of these challenges, recovering accurate thermal camera poses by calibrating the relative poses of a thermal camera and an RGB camera, and gracefully combining information from the two spectra while recovering material-specific properties. Our method also improves 3D thermal reconstruction quality by leveraging information from the visible spectrum for super-resolution, as RGB cameras are often of far higher spatial resolution compared to thermal cameras. Concretely, we make the following contributions:

•

We introduce the first method to demonstrate 3D thermal scene reconstruction from long-wave IR images, including cross-calibrating handheld RGB and thermal cameras to estimate thermal camera poses, as well as thermal super-resolution based on fusion of multiview thermal and RGB measurements.
•

We extend radiance field models to separately represent absorption of thermal and visible light, with appropriate cross-spectral regularization, to enable recovery of material properties and improve reconstruction quality.
•

We showcase our method on a novel dataset of diverse materials imaged with multiview thermal and RGB cameras, which is available at https://yvette256.github.io/thermalnerf/. Our dataset includes nine real-world scenes as well as one synthetic scene. We also exhibit our method on a Skydio drone dataset of a large crane structure (see fig. 1).

2 Related Work

2.1 Thermal Imaging

Thermal cameras detect and measure the heat signature of objects, and convert this emitted infrared (IR) energy into a thermal image reflecting varying levels of IR radiation [24]. Thermal images can provide insights into scenes and objects that are invisible to visible-light cameras [25]. The contactless nature of thermal imaging adds to its attractiveness in a diverse range of applications, including security and surveillance [11], preventive maintenance [20, 19], building inspection [26], monitoring rock masses [27], and archaeology [28, 29].

Thermal cameras that measure long-wave infrared (LWIR) wavelengths (8–14 $\mu$ m) use very expensive germanium lenses that transmit in the IR spectrum but block visible light [30, 31]. The longer wavelength range also requires each element in the detector array to be larger than those required for the visible spectrum [32]. These factors contribute to the significantly lower spatial resolution [30, 33] and higher production cost [34] of thermal cameras compared to visible-light cameras. Considering this limited resolution, thermal cameras are often paired with other sensors to improve the effective resolution beyond what the thermal camera alone can provide [35, 36].

In particular, 3D scene reconstruction from IR images is a compelling method to achieve such sensor fusion and enhanced resolution, alongside a volumetric scene representation that is of independent value in many applications. Prior work has proposed methods for 3D IR reconstruction by combining short-wave or near IR and visible light images [37], but these methods yield limited-quality reconstructions and are presently limited to the short-wave IR wavelengths below 1 $\mu$ m (see section 2.3), for which the resolution challenge is far less severe compared to the 8–14 $\mu$ m LWIR range we consider. Other approaches have also been proposed for 3D thermal imaging, but these either rely on contact maps from functional grasping [38], rather than non-contact camera measurements, or study non-line-of-sight object localization rather than direct thermal imaging [39]. In contrast, our focus is to produce high-fidelity 3D thermal reconstructions of diverse scenes by combining measurements from a low-resolution LWIR camera and a higher-resolution RGB visible light camera, without any direct scene contact.

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 4: Results on our engine scene. The engine is warm from recent driving; thermal imaging is used by mechanics to quickly diagnose problems.

2.2 3D Reconstruction and Novel-View Synthesis

3D reconstruction is a valuable aid in the visualization, survey, and analysis of large or inaccessible objects or landscapes, in ways that 2D images are unable to provide [40, 12, 41]. Broadly, 3D reconstruction can be done based on Structure from Motion (SfM) algorithms using point clouds, or based on fully volumetric models such as radiance fields or signed distance fields.

Structure from Motion (SfM) algorithms, such as COLMAP [22], recover both camera poses and 3D points by matching corresponding feature points in 2D images from different viewpoints [42]. After camera poses are known, multi-view stereo (MVS) can then recover a denser 3D reconstruction by matching points across calibrated images [23]. Although MVS often works well, its quality is limited by the accuracy of the camera parameters computed by SfM, as well as reconstruction assumptions such as scene rigidity [43, 44]. In particular, COLMAP is used as a preprocessing step in nearly all radiance field models, as a way to recover camera parameters [23, 22]. It is usually successful on natural scenes with sufficient high-quality input images, but tends to fail on scenes with limited texture (yielding limited keypoints) or images of low-resolution [45].

Our work focuses on reconstructing a radiance field as a dense representation of a scene that leverages differentiable rendering instead of feature mapping. Many methods have been proposed for parameterizing a radiance field, including multilayer perceptrons [46, 47], voxel grids [48, 49, 50], factorized tensors [51, 52], multi-resolution hash tables [53], and anisotropic 3D points [54]; we build our method on top of one such representation, the nerfacto model provided by NeRFStudio [21]. While these methods represent great progress in producing faithful and high-resolution radiance fields, they are all focused on the visible spectrum for which high-resolution visible light camera images can be easily acquired. Extending radiance field reconstruction to represent other parts of the spectrum which are invisible to the human eye would expose useful and otherwise inaccessible information [55].

2.3 Multispectral Radiance Fields

Capturing data beyond the visible spectrum can be helpful in identifying features that are invisible to the usual RGB color channels [56, 57, 58, 59, 60]. Accordingly, integrated sensors have been proposed to extend RGB cameras to incorporate information beyond the visible spectrum.

For instance, X-NeRF [37] reconstructs 3D from multispectral images by optimizing a transform between RGB and other sensors including near IR (NIR) cameras [61], while assuming known camera intrinsics. NIR images tend to have higher resolution as compared to long- and mid-wave IR (LWIR and MWIR) images due to its shorter wavelength [62, 63]. When applied to LWIR images, the novel views rendered from X-NeRF’s 3D multispectral reconstruction have relatively low-resolution, suggesting there is room for improved processing of longer wavelength IR images. Additionally, X-NeRF adopts the same assumption as vanilla NeRF: that each material absorbs different wavelengths of light equally, having a shared density parameter regardless of wavelength. While this assumption is a reasonable one for imaging a small range of wavelengths, it becomes problematic for sensor fusion across a wider spectrum like the one we consider.

In addition to these challenges inherent to multispectral imaging, our LWIR spectrum of interest poses additional challenges to 3D reconstruction. LWIR imaging faces an inherent physical resolution limitation due to its longer wavelengths, and consumer-grade thermal cameras like the handheld FLIR One Pro [36] that we use, often have even poorer image resolution [27, 64]. This poses multiple challenges to obtaining the camera poses and completing the subsequent 3D reconstruction [65]. We propose a method for thermal 3D reconstruction of scenes using both RGB and thermal images, leveraging cross-camera information to address these challenges [12]. While existing approaches have made use of similar insights in complementary tasks such as dehazing [66], hyperspectral imaging [67], and 3D reconstruction of a person via reflections [68], we demonstrate the first 3D RGB-thermal field reconstruction method that separately models material interactions with thermal and visible spectra to improve reconstruction quality.

3 Proposed Method

Our method extends nerfacto [21] to the combined RGBT (red, green, blue, and thermal/LWIR) domain. We begin by describing the main idea of our method as compared to standard visible-light RGB radiance fields, and then describing our implementation in more detail.

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 5: Results on our sheet scene. The plastic sheet is opaque to visible light but transparent to LWIR, revealing the hot kettle behind it.

3.1 Main Idea: Broad-Spectrum Radiance Fields

Existing radiance field models, including NeRF [46] and its many variations, typically focus on modeling radiance in the visible spectrum as three color channels (red, green, and blue). In using the standard volume rendering formula based on the Beer–Lambert law (see section 3.2), these models implicitly assume that each point in space is equally absorptive of all three of these colors of light. While this is a good approximation for RGB visible light, to which most materials are either opaque or transparent, there are certain materials, such as stained glass, for which the approximation is no longer valid. For example, a red stained glass window transmits red light but occludes green and blue light, violating the equal-absorption assumption.

When we begin to consider radiance fields across a wider spectrum, including our setting of RGB and LWIR thermal radiance field modeling, we find that more materials exhibit differing absorption behavior across this wider spectrum. We model this behavior by explicitly endowing each spatial location with separate densities (absorption coefficients) for each wavelength, while introducing regularization to encourage these wavelength-specific densities to remain similar for most materials.

3.2 Image Formation Model

In the standard RGB setting, NeRF represents a scene as a volumetric radiance field

F_{\Theta}^{\text{rgb}}:(\mathbf{x},\mathbf{d})\mapsto(\mathbf{c}_{\text{rgb}},\sigma_{\text{rgb}})

mapping a 3D point $\mathbf{x}\in\mathbb{R}^{3}$ and viewing direction $\mathbf{d}\in\mathbb{R}^{3}$ to a volume density $\sigma_{\text{rgb}}$ and view-dependent emitted color $\mathbf{c}_{\text{rgb}}=(r,g,b)\in\mathbb{R}^{3}$ . The scene is rendered along a camera ray $\mathbf{r}=\mathbf{o}+t\mathbf{d}$ with origin $\mathbf{o}\in\mathbb{R}^{3}$ and direction $\mathbf{d}\in\mathbb{R}^{3}$ via standard volumetric rendering [69]

	$\displaystyle\mathbf{c}_{\text{rgb}}(\mathbf{r})$	$\displaystyle=\int_{0}^{\infty}T(t)\sigma_{\text{rgb}}(\mathbf{r}(t))\mathbf{c}_{\text{rgb}}(\mathbf{r}(t),\mathbf{d})dt$		(1)
	$\displaystyle\text{where}\quad T(t)$	$\displaystyle=\exp\left[-\int_{0}^{t}\sigma_{\text{rgb}}(\mathbf{r}(t^{\prime}))dt^{\prime}\right]$		(2)

which is approximated numerically via $N$ samples along the ray via

$\displaystyle\mathbf{c}_{\text{rgb}}(\mathbf{r})$	$\displaystyle\approx\sum_{i=1}^{N}w_{i}\mathbf{c}_{\text{rgb}}(\mathbf{r}(t_{i}),\mathbf{d})$	(3)
$\displaystyle\text{where}\quad w_{i}$	$\displaystyle=T_{i}(1-\exp(-\sigma_{\text{rgb}}(\mathbf{r}(t_{i}))(t_{i+1}-t_{i})))$	(4)
$\displaystyle\text{and}\quad T_{i}$	$\displaystyle=\exp\left[-\sum_{j=1}^{i-1}\sigma_{\text{rgb}}(\mathbf{r}(t_{j}))(t_{j+1}-t_{j})\right].$	(5)

Now, to extend NeRF from the RGB to the RGBT domain, given a ray $\mathbf{r}$ , we wish to render the RGB color $(r,g,b)$ plus the color from a thermal image $\tau$ . We treat this as 4-D color $\mathbf{c}=(r,g,b,\tau)$ . Hence we introduce $c_{\text{therm}}:=\tau\in\mathbb{R}$ :

\displaystyle\mathbf{c}(\mathbf{r})=(\mathbf{c}_{\text{rgb}}(\mathbf{r}),c_{\text{therm}}(\mathbf{r}))

(6)

We observe that in the visible-light spectrum, objects of interest tend to absorb wavelengths of light similarly, but the same is not true in the combined infrared-and-visible-light spectrum. For example, many objects are opaque to visible light but transparent to infrared light, or vice versa, as illustrated by the pyrex glass bowl in fig. 2, which is thermally opaque but visibly transparent. We therefore propose to render $c_{\text{therm}}(\mathbf{r})$ with a separate density $\sigma_{\text{therm}}$ , distinct from $\sigma_{\text{rgb}}$ :

	$\displaystyle c_{\text{therm}}(\mathbf{r})$	$\displaystyle=\int_{0}^{\infty}T(t)\sigma_{\text{therm}}(\mathbf{r}(t))c_{\text{therm}}(\mathbf{r}(t),\mathbf{d})dt$		(7)
	$\displaystyle\text{where}\quad T(t)$	$\displaystyle=\exp\left[-\int_{0}^{t}\sigma_{\text{therm}}(\mathbf{r}(t^{\prime}))dt^{\prime}\right].$		(8)

We thus represent the RGBT scene as a radiance field

\displaystyle F_{\Theta}:(\mathbf{x},\mathbf{d})\mapsto(\mathbf{c}_{\text{rgb}},c_{\text{therm}},\sigma_{\text{rgb}},\sigma_{\text{therm}}).

(9)

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 6: Results on our charger scene. Thermal imaging reveals heat dissipated by both the laptop and its power adapter.

3.3 Optimization and Regularization

To optimize $F_{\Theta}$ , we minimize the following objective:

\displaystyle\mathcal{L}=\mathcal{L}_{\text{rgb}}+\lambda_{\text{therm}}\mathcal{L}_{\text{therm}}+\lambda_{\text{$\sigma$}}\mathcal{L}_{\text{$\sigma$}}+\lambda_{\text{cc}}\mathcal{L}_{\text{cc}}+\lambda_{\text{tv}}\mathcal{L}_{\text{tv}}.

(10)

Here $\mathcal{L}_{\text{rgb}}$ and $\mathcal{L}_{\text{therm}}$ are the standard pixel-wise photometric $\ell_{2}$ losses against calibrated ground-truth (gt) images:

	$\displaystyle\mathcal{L}_{\text{rgb}}$	$\displaystyle=\frac{1}{\|\mathcal{R}_{\text{rgb}}\|}\sum_{\mathbf{r}\in\mathcal{R}_{\text{rgb}}}\left\\|\mathbf{c}_{\text{rgb}}(\mathbf{r})-\mathbf{c}_{\text{rgb}}^{\text{gt}}(\mathbf{r})\right\\|_{2}^{2}$		(11)
	$\displaystyle\mathcal{L}_{\text{therm}}$	$\displaystyle=\frac{1}{\|\mathcal{R}_{\text{therm}}\|}\sum_{\mathbf{r}\in\mathcal{R}_{\text{therm}}}(c_{\text{therm}}(\mathbf{r})-c_{\text{therm}}^{\text{gt}}(\mathbf{r}))^{2}.$		(12)

where $\mathcal{R}_{\text{rgb}}$ and $\mathcal{R}_{\text{therm}}$ are rays from RGB and thermal cameras respectively.

$\mathcal{L}_{\text{$\sigma$}}$ is an $\ell_{1}$ regularizer encouraging the RGB and thermal densities to deviate from each other only at sparse 3D positions $\mathbf{x}$ :

\displaystyle\mathcal{L}_{\text{$\sigma$}}

\displaystyle=\frac{1}{|\mathcal{X}|}\sum_{\mathbf{x}\in\mathcal{X}}\left|\text{$\sigma$}_{\text{rgb}}(\mathbf{x})-\text{$\sigma$}_{\text{therm}}(\mathbf{x})\right|.

(13)

In practice, we implement this regularizer in two parts, where one part applies the regularizer with weight $\lambda_{\text{$\sigma$},\text{rgb}}$ to $\text{$\sigma$}_{\text{rgb}}$ (while blocking any gradients to $\text{$\sigma$}_{\text{therm}}$ ) and the other applies the same regularizer but with a larger weight $\lambda_{\text{$\sigma$},\text{therm}}$ to $\text{$\sigma$}_{\text{therm}}$ (while blocking any gradients to $\sigma_{\text{rgb}}$ ). This two-part implementation of $\mathcal{L}_{\text{$\sigma$}}$ allows us to prioritize transfer of information from the RGB to the thermal components of the reconstruction, while still allowing each spectrum to regularize the other. This two-part $\mathcal{L}_{\sigma}$ regularizer is motivated by the observation illustrated in fig. 2 that, while some materials (like the glass bowl) do absorb visible and infrared light differently, most materials are similarly absorptive of both visible and infrared wavelengths. For those objects, we can leverage the information in the higher-resolution RGB measurements to guide the thermal reconstruction and achieve thermal super-resolution.

$\mathcal{L}_{\text{cc}}$ is a variation on the cross-channel prior [70, 71] adapted to our radiance field reconstruction task:

\displaystyle\mathcal{L}_{\text{cc}}

\displaystyle=\frac{1}{|\mathcal{R}_{\text{rgb}}|}\sum_{\mathbf{r}\in\mathcal{R}_{\text{rgb}}}\left\|\frac{1}{3}\nabla_{\mathbf{r}}(r^{\text{gt}}+g^{\text{gt}}+b^{\text{gt}})(\mathbf{r})-\nabla_{\mathbf{r}}c_{\text{therm}}(\mathbf{r})\right\|_{1}.

(14)

where $\mathbf{c}^{\text{gt}}_{\text{rgb}}=(r^{\text{gt}},g^{\text{gt}},b^{\text{gt}})$ . Concretely, we estimate $\mathcal{L}_{\text{cc}}$ stochastically by selecting a batch of patches of pixels during each minibatch gradient update. For each pixel patch, we convolve with a 2D finite difference kernel to compute the local spatial gradient in each channel, and then penalize the thermal channel for $\ell_{1}$ deviation in gradient relative to the visible channels. Note that $\mathcal{L}_{\text{cc}}$ is only applied to rendered thermal views for which we have ground truth RGB (rather than rendered RGB views for which we have ground truth thermal), so that this loss only affects the thermal reconstruction. Intuitively, this loss encourages object edges (high spatial gradient magnitudes) to align between the thermal and visible reconstructions. Since the RGB resolution of the training images is higher than the thermal resolution, this unidirectional cross-channel loss promotes thermal super-resolution, which demonstrate in fig. 13.

$\mathcal{L}_{\text{tv}}$ is a pixelwise total variation regularizer on thermally-unsupervised rendered thermal views:

\displaystyle\mathcal{L}_{\text{tv}}=\frac{1}{|\mathcal{R}_{\text{rgb}}|}\sum_{\mathbf{r}\in\mathcal{R}_{\text{rgb}}}\|\nabla_{\mathbf{r}}c_{\text{therm}}(\mathbf{r})\|_{1}.

(15)

Concretely, we estimate $\mathcal{L}_{\text{tv}}$ stochastically by selecting a batch of pixel patches during each minibatch gradient update and computing the 2D finite differences to estimate the local spatial gradient. We do this only for rendered thermal views for which we do not have ground truth thermal observations (i.e. views for which we only have RGB supervision). We motivate the inclusion of this term with two observations. First, thermal photographs tend to exhibit sparse features. Second, thermal cameras tend to have lower field of view (FOV) than RGB cameras [72]. Hence, especially when rendered from the perspective of an RGB camera, thermal views of the scene can exhibit noisy artifacts at the edges of the image or in the background. The inclusion of $\mathcal{L}_{\text{tv}}$ discourages the appearance of such artifacts and expands the boundaries of the thermal scene, compensating for the typically lower FOV of the thermal camera.

3.4 Camera Calibration

We calibrate both RGB and thermal cameras using OpenCV’s camera calibration [73] on a 2mm thick aluminum sheet with a 4 $\times$ 11 asymmetric grid of circular cutouts. Each circle has a diameter of $15$ mm and a center-center distance of $38$ mm. We chill the aluminum sheet in a freezer, then collect a series of simultaneous RGB and thermal calibration photographs. Example such photographs are shown in fig. 7.

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 8: Results on our generator scene. Thermal imaging reveals heat emanating from one side of the generator box. Note that the placards on the right side of the box are visible in the {RGBT} baseline but not in ours or in the ground truth, since these placards are thermally transparent.

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 9: Results on our heater scene. The chair in the background is not visible in the ground truth thermal image, or in our reconstruction, but does appear in the {RGBT} baseline due to leakage from the RGB channels.

For each of the RGB and thermal cameras, we estimate the extrinsic parameters, the rotation $R$ and translation $t$ from the camera coordinates to the world coordinates. The camera space coordinates $(X_{c},Y_{c},Z_{c})$ corresponding to a point $(X,Y,Z)$ in world space are

\displaystyle\begin{bmatrix}X_{c}\\ Y_{c}\\ Z_{c}\end{bmatrix}=\begin{bmatrix}R|t\\ \end{bmatrix}\begin{bmatrix}X\\ Y\\ Z\\ 1\end{bmatrix}.

(16)

We additionally estimate the intrinsic parameters, the focal length $(f_{x},f_{y})$ and principal point $(c_{x},c_{y})$ , as well as the radial distortion coefficients $k_{1},k_{2}$ and the tangential distortion coefficients $p_{1},p_{2}$ of each camera. Then a pixel $(u,v)$ is determined by

\displaystyle\begin{bmatrix}u\\ v\\ \end{bmatrix}=\begin{bmatrix}f_{x}x^{\prime\prime}+c_{x}\\ f_{y}y^{\prime\prime}+c_{y}\\ \end{bmatrix}

(17)

where

\displaystyle\begin{bmatrix}x^{\prime\prime}\\ y^{\prime\prime}\\ \end{bmatrix}=\begin{bmatrix}x^{\prime}(1+k_{1}r^{2}+k_{2}r^{4})+2p_{1}x^{\prime}y^{\prime}+p_{2}(r^{2}+2x^{\prime 2})\\ y^{\prime}(1+k_{1}r^{2}+k_{2}r^{4})+p_{1}(r^{2}+2y^{\prime 2})+2p_{2}x^{\prime}y^{\prime}\end{bmatrix}

(18)

and

	$\displaystyle\begin{bmatrix}x^{\prime}\\ y^{\prime}\\ \end{bmatrix}$	$\displaystyle=\begin{bmatrix}X_{c}/Z_{c}\\ Y_{c}/Z_{c}\\ \end{bmatrix}$		(19)
	$\displaystyle r^{2}$	$\displaystyle=x^{\prime 2}+y^{\prime 2}.$		(20)

After using OpenCV’s camera calibration [73] to solve for the intrinsic and extrinsic parameters of both cameras, we compute the relative transform between the RGB camera and the thermal camera. We use COLMAP [22, 23] to estimate camera poses for our RGB camera for each scene, and then recover thermal camera poses using this calibrated relative pose. We resolve the global scale ambiguity in applying this camera offset by measuring the physical distance between two of our camera measurements for each scene.

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 10: Results on our trace scene. Thermal imaging can locate gaps in insulation, illustrated by the window behind the couch.

3.5 Training

We build and train our model by building upon the NeRFStudio library [21], using a modification of the default nerfacto model. Specifically, we create a second copy of the entire nerfacto model (including the auxiliary sampling model) to represent the thermal scene, and train the two models in parallel on their respective training images. The only connection between the RGB and LWIR (thermal) models is defined by our regularizers $\mathcal{L}_{\text{$\sigma$}}$ and $\mathcal{L}_{\text{cc}}$ .

3.6 Dataset Collection

We test our method on a novel dataset of 9 real-world scenes, each with 50–170 images from distinct viewpoints. We collected these images using a handheld FLIR One Pro [35], which attaches to a smartphone and records simultaneous RGB and LWIR channel images at a resolution of 1080 $\times$ 1440 and 480 $\times$ 640, respectively. For each scene, we reserve 10% of the RGB-LWIR image pairs as a test set, and train on the remaining 90%. We selected these scenes to demonstrate a range of indoor and outdoor settings for which thermal imaging shows interesting phenomena, such as revealing heat sources and sinks, checking thermal insulation, and imaging through visibly occlusive media. Example images from each scene are shown in fig. 2, fig. 4, fig. 5, fig. 6, fig. 8, fig. 9, fig. 10, fig. 14, and fig. 16 in the supplement.

In addition to this real-scene dataset, we also introduce a synthetic RGBT scene based on the hotdog scene from the NeRF Blender dataset [46], in which we make the hotdog thermally hot. This synthetic scene allows us to separately test the regularization and modeling aspects of our approach, without any concerns over potential miscalibration. It also allows us to test our thermal super-resolution capability in a setting for which ground-truth high-resolution thermal images are available, which is not the case for our real-scene dataset due to the limited thermal resolution of the FLIR One Pro. For this synthetic dataset, we render 45 training views and 5 testing views, each with red, green, blue, and thermal channels.

4 Experimental Results

4.1 Qualitative Results

We show qualitative results on all of our real-world scenes in fig. 2, fig. 4, fig. 5, fig. 6, fig. 8, fig. 9, fig. 10, fig. 14, and fig. 16 in the supplement. We note that, while all methods tend to do decently well at reconstructing the RGB scene, only our method reliably also recovers the thermal scene. Treating the thermal and visible spectra separately ({RGB}{T}) typically results in catastrophic failure to reconstruct the thermal scene, because the low resolution of the thermal images causes the camera pose refinement of NeRFStudio [21] to diverge—a similar behavior as in fig. 3. Treating the thermal and visible spectra jointly ({RGBT}), with a shared density field, is successful on many objects but fails to reconstruct certain materials, like glass, and often exhibits unintended leakage from the visible to the thermal spectra.

4.2 Quantitative Results

In Table I we compare our method to two natural baseline approaches based on nerfacto [21]. The {RGB}{T} baseline consists of completely separate nerfacto models for the visible and thermal wavelengths, while the {RGBT} baseline consists of a single nerfacto model with an extra color channel to represent the thermal (LWIR) spectrum. Our method is similar to the {RGB}{T} baseline, but with regularizers to tie together the visible and thermal reconstructions for most materials.

While both of these baselines successfully recover the RGB scene–unsurprisingly since the nerfacto model was developed for RGB imaging–only our method is also successful at simultaneous thermal reconstruction. The {RGB}{T} baseline tends to fail dramatically due to diverging thermal camera pose refinement (built into NeRFStudio [21]), while the {RGBT} baseline suffers thermal artifacts for not allowing any density (absorption coefficient) variation between the visible and thermal reconstructions.

TABLE I: Quantitative results and ablations, comparing our method against nerfacto baselines with fully joint or fully separate models for RGB and thermal spectra, as well as versions of our method without each of our proposed regularizers. The best result in each column is bolded and the second-best is underlined.

	PSNR ( $\uparrow$ )		SSIM ( $\uparrow$ )		LPIPS ( $\downarrow$ )
Method	RGB	Thermal	RGB	Thermal	RGB	Thermal
Ours	23.26	31.29	0.760	0.945	0.387	0.055
Nerfacto {RGBT}	21.89	21.51	0.721	0.845	0.417	0.182
Nerfacto {RGB}{T}	22.35	19.23	0.726	0.871	0.397	0.226
Ours w/o $\mathcal{L}_{\text{cc}}$	23.25	30.53	0.759	0.928	0.386	0.064
Ours w/o $\mathcal{L}_{\text{$\sigma$}}$	23.28	29.76	0.758	0.928	0.382	0.062
Ours w/o $\mathcal{L}_{\text{tv}}$	23.33	30.52	0.759	0.931	0.386	0.059

4.3 Ablation Studies

Table I also includes ablation studies comparing our method to versions of it without each of our three regularizers, $\mathcal{L}_{\text{cc}}$ , $\mathcal{L}_{\text{$\sigma$}}$ , and $\mathcal{L}_{\text{tv}}$ . Note that our method without any regularization is identical to the {RGB}{T} baseline, except that we also reduce the degree of camera pose refinement.

We find that including any regularizer alone produces a substantial improvement in thermal reconstruction quality relative to the two nerfacto baselines, with modest improvement in RGB reconstruction quality. We note that, in addition to the modest quantitative impact of $\mathcal{L}_{\text{$\sigma$}}$ shown in table I, this regularizer produces meaningful qualitative improvement. This qualitative improvement can be seen via the RGB and thermal depth maps in fig. 11, and by enabling the de-occlusion application shown in fig. 12.

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 14: Results on our sink scene, with hot water flowing from the faucet.

4.4 Revealing Hidden Objects

In fig. 12 we demonstrate an application made possible by our method, specifically by $\mathcal{L}_{\text{$\sigma$}}$ and the use of separate densities for visible and LWIR light. We can remove occluding objects from RGB or thermal views, thus revealing objects hidden behind other objects by rendering only the parts of the scene with RGB and thermal densities sufficiently similar to each other. Precisely, we render RGB and thermal scenes respectively with densities

	$\displaystyle\sigma_{\text{rgb}}^{\prime}$	$\displaystyle=\mathbbm{1}_{\|\sigma_{\text{rgb}}-\sigma_{\text{therm}}\|<\epsilon}\cdot\sigma_{\text{rgb}}$		(21)
	$\displaystyle\sigma_{\text{therm}}^{\prime}$	$\displaystyle=\mathbbm{1}_{\|\sigma_{\text{rgb}}-\sigma_{\text{therm}}\|<\epsilon}\cdot\sigma_{\text{therm}}$		(22)

where $\epsilon$ is the minimum allowed magnitude of difference between the RGB and thermal densities in the rendered image. This computation is only possible due to the $\lambda_{\text{$\sigma$}}\mathcal{L}_{\text{$\sigma$}}$ term in our loss function (eq. 10), which encourages the RGB and thermal densities to be similar.

We demonstrate this application on the pyrex scene, which depicts a cold pack in a glass container, and the sheet scene, which depicts a hot water kettle behind a plastic sheet. The glass container in pyrex transmits visible but not infrared light; the plastic sheet in sheet transmits infrared but not visible light. We show that we are able to remove the occluding material in the RGB and thermal channels respectively, revealing the shape of the object that normally would be occluded from the ground-truth view.

4.5 Thermal Super-resolution

In fig. 13, we show qualitatively that our method achieves a level of thermal super-resolution. We demonstrate super-resolution on our synthetic hotdog scene, for which ground truth thermal images are available at high resolution for a valid comparison. We show that with the help of the higher-resolution RGB images (column 1), despite the low resolution of the thermal training data (column 2), our reconstruction (column 4), is able to reconstruct the higher-frequency ground-truth thermal features (column 3) in the underlying volume.

5 Conclusion

We present a novel method that addresses critical challenges of 3D thermal reconstruction. By recovering accurate thermal camera poses through inter-camera calibration, and by integrating information from both spectra while taking into account wavelength-dependent material properties, we have achieved significant improvement in 3D thermal reconstruction both quantitatively and qualitatively.

We note that, although our implementation is built on Nerfacto, our modifications to handle thermal radiance are applicable to any radiance field model (implicit or explicit). We also note that, although we focus on the case of LWIR thermal imaging, our modifications are likely applicable to other multispectral imaging settings in which the same material interacts differently with different wavelengths. For example, our method may be relevant to multi-energy X-ray computed tomography, in which different tissues absorb different X-ray wavelengths to varying degrees, or even to RGB radiance field modeling of certain materials that absorb red, green, and blue visible wavelengths differently, such as stained glass.

Acknowledgments

Many thanks to Skydio for sharing their RGB and thermal imaging of the crane structure. This material is based upon work supported by the National Science Foundation under award number 2303178 to SFK. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

References

[1] S. Fuentes, E. Tongson, and C. Gonzalez Viejo, “Urban green infrastructure monitoring using remote sensing from integrated visible and thermal infrared cameras mounted on a moving vehicle,” Sensors, vol. 21, no. 1, p. 295, 2021.
[2] M. Lega and R. M. Napoli, “Aerial infrared thermography in the surface waters contamination monitoring,” Desalination and water treatment, vol. 23, no. 1-3, pp. 141–151, 2010.
[3] K. Iwasaki, K. Fukushima, Y. Nagasaka, N. Ishiyama, M. Sakai, and A. Nagasaka, “Real-time monitoring and postprocessing of thermal infrared video images for sampling and mapping groundwater discharge,” Water Resources Research, vol. 59, no. 4, p. e2022WR033630, 2023.
[4] P. Pyykönen, P. Peussa, M. Kutila, and K.-W. Fong, “Multi-camera-based smoke detection and traffic pollution analysis system,” in 2016 IEEE 12th International Conference on Intelligent Computer Communication and Processing (ICCP). IEEE, 2016, pp. 233–238.
[5] P. Rudol and P. Doherty, “Human body detection and geolocalization for uav search and rescue missions using color and thermal imagery,” in 2008 IEEE aerospace conference. Ieee, 2008, pp. 1–8.
[6] C. D. Rodin, L. N. de Lima, F. A. de Alcantara Andrade, D. B. Haddad, T. A. Johansen, and R. Storvold, “Object classification in thermal images using convolutional neural networks for search and rescue missions with unmanned aerial systems,” in 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 1–8.
[7] J. Goel, M. Nizamoglu, A. Tan, H. Gerrish, K. Cranmer, N. El-Muttardi, D. Barnes, and P. Dziewulski, “A prospective study comparing the flir one with laser doppler imaging in the assessment of burn depth by a tertiary burns unit in the united kingdom,” Scars, Burns & Healing, vol. 6, p. 2059513120974261, 2020.
[8] M. E. Jaspers, M. Carrière, A. Meij-de Vries, J. Klaessens, and P. Van Zuijlen, “The flir one thermal imager for the assessment of burn wounds: Reliability and validity study,” Burns, vol. 43, no. 7, pp. 1516–1523, 2017.
[9] H. Torresan, B. Turgeon, C. Ibarra-Castanedo, P. Hebert, and X. P. Maldague, “Advanced surveillance systems: combining video and thermal imagery for pedestrian detection,” in Thermosense XXVI, vol. 5405. SPIE, 2004, pp. 506–515.
[10] A. Akula, R. Ghosh, and H. Sardana, “Thermal imaging and its application in defence systems,” in AIP conference proceedings, vol. 1391, no. 1. American Institute of Physics, 2011, pp. 333–335.
[11] W. K. Wong, P. N. Tan, C. K. Loo, and W. S. Lim, “An effective surveillance system using thermal camera,” in 2009 international conference on signal acquisition and processing. IEEE, 2009, pp. 13–17.
[12] J. M. Jurado, A. López, L. Pádua, and J. J. Sousa, “Remote sensing image fusion on 3d scenarios: A review of applications for agriculture and forestry,” International Journal of Applied Earth Observation and Geoinformation, vol. 112, p. 102856, 2022.
[13] R. Näsi, E. Honkavaara, M. Blomqvist, P. Lyytikäinen-Saarenmaa, T. Hakala, N. Viljanen, T. Kantola, and M. Holopainen, “Remote sensing of bark beetle damage in urban forests at individual tree level using a novel hyperspectral camera from uav and aircraft,” Urban Forestry & Urban Greening, vol. 30, pp. 72–83, 2018.
[14] G. T. Miyoshi, M. d. S. Arruda, L. P. Osco, J. Marcato Junior, D. N. Gonçalves, N. N. Imai, A. M. G. Tommaselli, E. Honkavaara, and W. N. Gonçalves, “A novel deep learning method to identify single tree species in uav-based hyperspectral images,” Remote Sensing, vol. 12, no. 8, p. 1294, 2020.
[15] S. Lee, H. Moon, Y. Choi, and D. K. Yoon, “Analyzing thermal characteristics of urban streets using a thermal imaging camera: A case study on commercial streets in seoul, korea,” Sustainability, vol. 10, no. 2, p. 519, 2018.
[16] S. Fuentes, E. J. Tongson, R. De Bei, C. Gonzalez Viejo, R. Ristic, S. Tyerman, and K. Wilkinson, “Non-invasive tools to detect smoke contamination in grapevine canopies, berries and wine: A remote sensing and machine learning modeling approach,” Sensors, vol. 19, no. 15, p. 3335, 2019.
[17] M. Carrasco-Benavides, J. Antunez-Quilobrán, A. Baffico-Hernández, C. Ávila-Sánchez, S. Ortega-Farías, S. Espinoza, J. Gajardo, M. Mora, and S. Fuentes, “Performance assessment of thermal infrared cameras of different resolutions to estimate tree water status from two cherry cultivars: An alternative to midday stem water potential and stomatal conductance,” Sensors, vol. 20, no. 12, p. 3596, 2020.
[18] R. Hernández-Clemente, A. Hornero, M. Mottus, J. Peñuelas, V. González-Dugo, J. C. Jiménez, L. Suárez, L. Alonso, and P. J. Zarco-Tejada, “Early diagnosis of vegetation health from high-resolution hyperspectral and thermal imagery: Lessons learned from empirical relationships and radiative transfer modelling,” Current forestry reports, vol. 5, pp. 169–183, 2019.
[19] V. Malhotra and N. Carino, “Crc handbook on nondestructive testing of concrete,” CRC Press Inc., 01 2004.
[20] C. Jackson, C. Sherlock, P. Moore, and A. S. for Nondestructive Testing, Leak Testing, ser. Nondestructive testing handbook. American Society for Nondestructive Testing, 1998. [Online]. Available: https://books.google.com/books?id=bAdsQgAACAAJ
[21] M. Tancik, E. Weber, E. Ng, R. Li, B. Yi, J. Kerr, T. Wang, A. Kristoffersen, J. Austin, K. Salahi, A. Ahuja, D. McAllister, and A. Kanazawa, “Nerfstudio: A modular framework for neural radiance field development,” in ACM SIGGRAPH 2023 Conference Proceedings, ser. SIGGRAPH ’23, 2023.
[22] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[23] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise view selection for unstructured multi-view stereo,” in European Conference on Computer Vision (ECCV), 2016.
[24] Fluke, “What is thermal imaging? how a thermal image is captured.” [Online]. Available: https://www.fluke.com/en-us/learn/blog/thermal-imaging/how-infrared-cameras-work
[25] ——, “What is thermal imaging? thermal cameras and how they work,” Jan 2024. [Online]. Available: https://www.fluke.com/en-us/learn/blog/thermal-imaging/how-infrared-cameras-work
[26] A. Adán, B. Quintana, J. Garcia Aguilar, V. Pérez, and F. J. Castilla, “Towards the use of 3d thermal models in constructions,” Sustainability, vol. 12, no. 20, p. 8521, 2020.
[27] G. Grechi, M. Fiorucci, G. M. Marmoni, and S. Martino, “3d thermal monitoring of jointed rock masses through infrared thermography and photogrammetry,” Remote Sensing, vol. 13, no. 5, p. 957, 2021.
[28] J. Casana, A. Wiewel, A. Cool, A. C. Hill, K. D. Fisher, and E. J. Laugier, “Archaeological aerial thermography in theory and practice,” Advances in Archaeological Practice, vol. 5, no. 4, pp. 310–327, 2017.
[29] C. Brooke, “Thermal imaging for the archaeological investigation of historic buildings,” Remote Sensing, vol. 10, no. 9, p. 1401, 2018.
[30] F. Nilsson, Intelligent network video: Understanding modern video surveillance systems. crc Press, 2008.
[31] B. Mesnik, “Thermal versus optical ip cameras.” [Online]. Available: https://kintronics.com/thermal-versus-optical-ip-cameras/
[32] “Comparing sensitivity of thermal imaging camera modules.” [Online]. Available: https://www.flir.com/discover/cores-components/Comparing-Sensitivity-of-Thermal-Imaging-Cameras-Modules/
[33] F. L. Liu, Single-Shot 3D Microscopy: Optics and Algorithms Co-Design. University of California, Berkeley, 2022.
[34] R. Schmidt, “How patent-pending technology blends thermal and visible light.” [Online]. Available: https://www.fluke.com/en-us/learn/blog/thermal-imaging/how-patent-pending-technology-blends-thermal-and-visible-light
[35] “Flir one pro thermal imaging camera for smartphones — teledyne flir.” [Online]. Available: https://www.flir.com/products/flir-one-pro/?vertical=condition%2Bmonitoring&segment=solutions
[36] FLIR, “Flir one® series thermal imaging cameras for ios® or android™ smartphones.” [Online]. Available: https://www.flir.com/flir-one/
[37] M. Poggi, P. Zama Ramirez, F. Tosi, S. Salti, L. Di Stefano, and S. Mattoccia, “Cross-spectral neural radiance fields,” in International Conference on 3D Vision, 2022, 3DV.
[38] S. Brahmbhatt, C. Ham, C. C. Kemp, and J. Hays, “Contactdb: Analyzing and predicting grasp contact via thermal imaging,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8709–8719.
[39] T. Maeda, Y. Wang, R. Raskar, and A. Kadambi, “Thermal non-line-of-sight imaging,” in 2019 IEEE International Conference on Computational Photography (ICCP). IEEE, 2019, pp. 1–11.
[40] C. Collaro and M. Herkommer, “Research, application, and innovation of lidar technology in spatial archeology,” in Encyclopedia of Information Science and Technology, Sixth Edition. IGI Global, 2025, pp. 1–33.
[41] C. Collaro, C. Enríquez-Muñoz, A. López, C. Enríquez, and J. M. Jurado, “Detection of landscape features with visible and thermal imaging at the castle of puerta arenas,” Archaeological and Anthropological Sciences, vol. 15, no. 10, p. 152, 2023.
[42] D. Robertson and R. Cipolla, “Practical image processing and computer vision,” in chapter Structure from Motion. John Wiley & Sons Australia, 2009.
[43] Y. Furukawa, C. Hernández et al., “Multi-view stereo: A tutorial,” Foundations and Trends® in Computer Graphics and Vision, vol. 9, no. 1-2, pp. 1–148, 2015.
[44] S. Wang, H. Jiang, and L. Xiang, “Ct-mvsnet: Efficient multi-view stereo with cross-scale transformer,” in International Conference on Multimedia Modeling. Springer, 2024, pp. 394–408.
[45] Z. Cheng, C. Esteves, V. Jampani, A. Kar, S. Maji, and A. Makadia, “Lu-nerf: Scene and pose estimation by synchronizing local unposed nerfs,” arXiv preprint arXiv:2306.05410, 2023.
[46] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
[47] V. Sitzmann, J. Martel, A. Bergman, D. Lindell, and G. Wetzstein, “Implicit neural representations with periodic activation functions,” Advances in neural information processing systems, vol. 33, pp. 7462–7473, 2020.
[48] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer, “Deepvoxels: Learning persistent 3d feature embeddings,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2437–2446.
[49] S. Lombardi, T. Simon, J. Saragih, G. Schwartz, A. Lehrmann, and Y. Sheikh, “Neural volumes: Learning dynamic renderable volumes from images,” arXiv preprint arXiv:1906.07751, 2019.
[50] Sara Fridovich-Keil and Alex Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa, “Plenoxels: Radiance fields without neural networks,” in CVPR, 2022.
[51] E. R. Chan, C. Z. Lin, M. A. Chan, K. Nagano, B. Pan, S. De Mello, O. Gallo, L. J. Guibas, J. Tremblay, S. Khamis et al., “Efficient geometry-aware 3d generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 123–16 133.
[52] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial radiance fields,” 2022.
[53] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics primitives with a multiresolution hash encoding,” ACM Trans. Graph., vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022. [Online]. Available: https://doi.org/10.1145/3528223.3530127
[54] B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Available: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
[55] H. Zhu, Y. Sun, C. Liu, L. Xia, J. Luo, N. Qiao, R. Nevatia, and C.-H. Kuo, “Multimodal neural radiance field,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 9393–9399.
[56] Y. Zhang, S. Müller, B. Stephan, H.-M. Gross, and G. Notni, “Point cloud hand–object segmentation using multimodal imaging with thermal and color data for safe robotic object handover,” Sensors, vol. 21, no. 16, p. 5676, 2021.
[57] J. Á. S. Carmona, E. Quirós, V. Mayoral, and C. Charro, “Assessing the potential of multispectral and thermal uav imagery from archaeological sites. a case study from the iron age hillfort of villasviejas del tamuja (cáceres, spain),” Journal of Archaeological Science: Reports, vol. 31, p. 102312, 2020.
[58] M. McLeester, J. Casana, M. R. Schurr, A. C. Hill, and J. H. Wheeler III, “Detecting prehistoric landscape features using thermal, multispectral, and historical imagery analysis at midewin national tallgrass prairie, illinois,” Journal of Archaeological Science: Reports, vol. 21, pp. 450–459, 2018.
[59] G. Patrucco, A. Gómez, A. Adineh, M. Rahrig, and J. L. Lerma, “3d data fusion for historical analyses of heritage buildings using thermal images: The palacio de colomina as a case study,” Remote Sensing, vol. 14, no. 22, p. 5699, 2022.
[60] N. Sutherland, S. Marsh, G. Priestnall, P. Bryan, and J. Mills, “Infrared thermography and 3d-data fusion for architectural heritage: A scoping review,” Remote Sensing, vol. 15, no. 9, p. 2422, 2023.
[61] Microsoft, “Azure kinect dk depth camera.” [Online]. Available: https://learn.microsoft.com/en-us/azure/kinect-dk/depth-camera
[62] J. Oncea, “Swir, mwir, and lwir: One use case for each.” [Online]. Available: https://www.photonicsonline.com/doc/swir-mwir-and-lwir-one-use-case-for-each-0001
[63] I. Electro-Optics, “Nir (near-infrared imaging (fog/haze filter).” [Online]. Available: https://www.infinitioptics.com/technology/nir-near-infrared
[64] O. González, M. I. Lizarraga, S. Karaman, and J. Salas, “Thermal radiation dynamics of soil surfaces with unmanned aerial systems,” in Pattern Recognition: 11th Mexican Conference, MCPR 2019, Querétaro, Mexico, June 26–29, 2019, Proceedings 11. Springer, 2019, pp. 183–192.
[65] A. López, J. M. Jurado, C. J. Ogayar, and F. R. Feito, “An optimized approach for generating dense thermal point clouds from uav-imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 182, pp. 78–95, 2021.
[66] F. Dümbgen, M. El Helou, N. Gucevska, and S. Süsstrunk, “Near-infrared fusion for photorealistic image dehazing,” Tech. Rep., 2018.
[67] S. Hu, R. Hou, L. Ming, S. Meifang, and P. Chen, “A hyperspectral image reconstruction algorithm based on rgb image using multi-scale atrous residual convolution network,” Frontiers in Marine Science, vol. 9, p. 1006452, 2023.
[68] R. Liu and C. Vondrick, “Humans as light bulbs: 3d human reconstruction from thermal reflection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 531–12 542.
[69] N. Max, “Optical models for direct volume rendering,” vol. 1, no. 2, pp. 99–108, 1995.
[70] F. Heide, M. Rouf, M. B. Hullin, B. Labitzke, W. Heidrich, and A. Kolb, “High-quality computational imaging through simple lenses,” ACM Trans. Graph., vol. 32, no. 5, oct 2013. [Online]. Available: https://doi.org/10.1145/2516971.2516974
[71] T. Sun, Y. Peng, and W. Heidrich, “Revisiting cross-channel information transfer for chromatic aberration correction,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3268–3276.
[72] A. Dlesk, K. Vach, and K. Pavelka, “Transformations in the photogrammetric co-processing of thermal infrared images and rgb images,” Sensors, vol. 21, no. 15, p. 5061, 2021.
[73] OpenCV. Camera calibration and 3d reconstruction. [Online]. Available: https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html

Supplement

1 Detailed results

In table II we present the quantitative results for each individual scene, as well as quantitative ablations excluding each of our proposed regularizers. Figure 15 visualizes the value of these regularizers via qualitative ablations.

TABLE II: Per-scene results.

Method			pyrex	heater	sink	charger	trace	generator	generators	sheet	engine
Ours	PSNR( $\uparrow$ )	RGB	22.12	24.88	28.31	25.09	22.59	20.44	18.84	24.84	22.25
		Thermal	35.78	35.24	23.83	32.60	30.92	32.80	31.13	27.36	31.98
	SSIM( $\uparrow$ )	RGB	0.715	0.827	0.911	0.890	0.851	0.610	0.436	0.800	0.804
		Thermal	0.987	0.980	0.890	0.956	0.960	0.977	0.914	0.914	0.925
	LPIPS( $\downarrow$ )	RGB	0.439	0.370	0.274	0.321	0.385	0.504	0.571	0.360	0.259
		Thermal	0.022	0.018	0.120	0.044	0.044	0.047	0.066	0.060	0.076
Nerfacto {RGB}{T}	PSNR( $\uparrow$ )	RGB	21.30	24.31	27.26	23.65	22.14	19.24	18.15	24.33	20.81
		Thermal	22.04	19.60	13.40	16.80	23.89	16.43	21.70	19.46	19.81
	SSIM( $\uparrow$ )	RGB	0.689	0.805	0.894	0.853	0.838	0.533	0.389	0.780	0.751
		Thermal	0.970	0.781	0.796	0.757	0.948	0.951	0.958	0.883	0.793
	LPIPS( $\downarrow$ )	RGB	0.449	0.370	0.278	0.338	0.378	0.533	0.597	0.365	0.268
		Thermal	0.096	0.179	0.476	0.271	0.119	0.133	0.153	0.185	0.425
Nerfacto {RGBT}	PSNR( $\uparrow$ )	RGB	21.76	23.68	23.89	24.56	22.90	19.02	18.18	22.89	20.19
		Thermal	32.31	20.94	15.65	13.95	22.04	29.87	17.33	19.71	21.77
	SSIM( $\uparrow$ )	RGB	0.713	0.806	0.845	0.879	0.843	0.529	0.394	0.748	0.728
		Thermal	0.985	0.668	0.754	0.655	0.949	0.982	0.936	0.812	0.864
	LPIPS( $\downarrow$ )	RGB	0.457	0.380	0.341	0.327	0.388	0.570	0.604	0.390	0.293
		Thermal	0.050	0.182	0.276	0.349	0.053	0.068	0.231	0.238	0.188
Ours /wo $\mathcal{L}_{\text{cc}}$	PSNR( $\uparrow$ )	RGB	22.28	24.91	28.20	25.08	22.64	20.28	18.69	24.92	22.21
		Thermal	34.97	35.20	21.99	27.03	35.45	30.88	32.34	24.31	32.61
	SSIM( $\uparrow$ )	RGB	0.721	0.829	0.910	0.900	0.851	0.605	0.425	0.798	0.801
		Thermal	0.987	0.984	0.893	0.921	0.960	0.974	0.920	0.791	0.925
	LPIPS( $\downarrow$ )	RGB	0.437	0.368	0.271	0.323	0.380	0.507	0.570	0.360	0.259
		Thermal	0.023	0.015	0.143	0.083	0.042	0.057	0.059	0.080	0.076
Ours w/o $\mathcal{L}_{\text{$\sigma$}}$	PSNR( $\uparrow$ )	RGB	22.16	24.97	28.51	25.10	22.76	20.45	18.73	24.89	21.93
		Thermal	34.81	33.64	20.08	31.07	30.37	28.61	30.24	28.83	30.24
	SSIM( $\uparrow$ )	RGB	0.715	0.828	0.912	0.887	0.852	0.607	0.429	0.795	0.794
		Thermal	0.987	0.974	0.722	0.947	0.958	0.958	0.915	0.961	0.934
	LPIPS( $\downarrow$ )	RGB	0.439	0.360	0.269	0.313	0.381	0.502	0.562	0.354	0.260
		Thermal	0.024	0.020	0.153	0.051	0.044	0.087	0.063	0.042	0.078
Ours w/o $\mathcal{L}_{\text{tv}}$	PSNR( $\uparrow$ )	RGB	22.35	24.97	28.58	25.25	22.61	20.36	18.72	24.82	22.29
		Thermal	36.01	35.53	25.66	31.18	30.05	32.90	29.52	23.97	29.89
	SSIM( $\uparrow$ )	RGB	0.717	0.831	0.912	0.892	0.852	0.602	0.428	0.796	0.803
		Thermal	0.987	0.984	0.894	0.950	0.957	0.976	0.912	0.786	0.932
	LPIPS( $\downarrow$ )	RGB	0.443	0.370	0.271	0.319	0.383	0.501	0.567	0.359	0.259
		Thermal	0.024	0.015	0.115	0.057	0.046	0.047	0.069	0.089	0.073

RGB
Thermal
	Ground Truth	Ours	Nerfacto {RGBT}	Nerfacto {RGB}{T}

Figure 16: Results on our generators scene.

2 Additional qualitative results

In fig. 16 we present qualitative results on our generators scene.

In fig. 17 we present RGB and thermal renderings from multiple viewpoints of our pyrex scene, to demonstrate multiview consistency. Video results are also available on the project webpage.



RGB ground truth	Thermal ground truth	Thermal ground truth	Our thermal
	(Input resolution)	(High resolution)

GT RGB
Our RGB
GT Thermal
Our Thermal

Ours
Ours w/o $\mathcal{L}_{\text{$\sigma$}}$
	RGB Depth	Thermal Depth

Ground Truth	Ours	w/o $\mathcal{L}_{\text{cc}}$

Ground Truth	Ours	w/o $\mathcal{L}_{\text{$\sigma$}}$

Ground Truth	Ours	w/o $\mathcal{L}_{\text{tv}}$