This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: NTT Corporation, 1-1, Hikarino-oka, Yokosuka-shi, Kanagawa, Japan

Unsupervised Intrinsic Image Decomposition with LiDAR Intensity Enhanced Training

Shogo Sato 11 0000-0003-3070-5880    Takuhiro Kaneko    Kazuhiko Murasaki    Taiga Yoshida    Ryuichi Tanida    Akisato Kimura
Abstract

Unsupervised intrinsic image decomposition (IID) is the process of separating a natural image into albedo and shade without these ground truths. A recent model employing light detection and ranging (LiDAR) intensity demonstrated impressive performance, though the necessity of LiDAR intensity during inference restricts its practicality. Thus, IID models employing only a single image during inference while keeping as high IID quality as the one with an image plus LiDAR intensity are highly desired. To address this challenge, we propose a novel approach that utilizes only an image during inference while utilizing an image and LiDAR intensity during training. Specifically, we introduce a partially-shared model that accepts an image and LiDAR intensity individually using a different specific encoder but processes them together in specific components to learn shared representations. In addition, to enhance IID quality, we propose albedo-alignment loss and image-LiDAR conversion (ILC) paths. Albedo-alignment loss aligns the gray-scale albedo from an image to that inferred from LiDAR intensity, thereby reducing cast shadows in albedo from an image due to the absence of cast shadows in LiDAR intensity. Furthermore, to translate the input image into albedo and shade style while keeping the image contents, the input image is separated into style code and content code by encoders. The ILC path mutually translates the image and LiDAR intensity, which share content but differ in style, contributing to the distinct differentiation of style from content. Consequently, LIET achieves comparable IID quality to the existing model with LiDAR intensity, while utilizing only an image without LiDAR intensity during inference.

Keywords:
Intrinsic image decomposition LiDAR intensity inference from a single image

1 Introduction

Intrinsic image decomposition (IID) is the process of separating a natural image into albedo and shade within Lambertian scenes. IID provides benefits for various high-level computer vision tasks, such as texture editing [3, 44] and semantic segmentation [58, 59, 2]. The origins of IID can be traced back to early work in computer vision during the 1970s [30, 1], where researchers grappled with the challenge of recovering albedos from shaded images. Since IID is an ill-posed problem that separates a natural image into its albedo and shade, researchers

Refer to caption
Figure 1: Train/infer schemes and examples of inferred albedos by (a) USI3[40], (b) IID-LI [52], and (c) LIET (our proposed model). USI3D, utilizing only a single image for both training and inference, leaves cast shadows on the inferred albedo. On the other hand, IID-LI utilizes LiDAR intensity during training and inference, making shadows further less noticeable. However, IID-LI has restricted applicability due to the requirement of LiDAR intensity even during inference. LIET utilizes only an image for inference to expand its usage scenarios, and utilizes both an image and its corresponding LiDAR intensity during training to make shadows less noticeable.

have traditionally addressed this challenge by incorporating a variety of priors, including albedo flatness, shade smoothness [30, 61, 18, 67, 4, 5] and dependence between shade and geometry [11, 33, 25]. Recently, a notable development has been the emergence of supervised learning models [47, 69, 48, 29, 15, 68, 41, 72], trained on the ground-truth albedo and shade corresponding to an input image with sparsely-annotated datasets [4] or synthetic datasets [9, 10, 37]. However, supervised learning models are not ideal since it is difficult to prepare ground truths by eliminating illumination from images within general scenes. On the other hand, unsupervised learning models [42, 38, 40, 53], that do not utilize ground-truth albedo and shade corresponding to the input image, encountered IID-quality limitations as shown in Fig. 1 (a), particularly in their capacity to reduce cast shadows. More recently, an unsupervised learning model that utilizes light detection and ranging (LiDAR) intensity, called intrinsic image decomposition with LiDAR intensity (IID-LI) [52], has notably enhanced IID quality as shown in Fig. 1 (b). LiDAR intensity refers to the strength of light reflected from object surfaces, and is equivalent to albedo in infrared wavelength. Thus, LiDAR intensity is effective for IID tasks. However, the applicability of IID-LI is limited since it requires LiDAR intensity even during inference.

This paper aims to employ only a single image without LiDAR intensity during inference to expand usage scenarios while keeping the high IID quality demonstrated by IID-LI as shown in Fig. 1 (c). To accomplish this objective, we propose unsupervised single-image intrinsic image decomposition with LiDAR intensity enhanced training (LIET). In the previous IID-LI framework, completely-shared model accepts both an image and LiDAR intensity as input and processes them simultaneously for both the training and inference processes. On the other hand, our proposed LIET is implemented with a partially-shared model that accepts an image and LiDAR intensity individually using a different specific encoder but processes them together in specific components to learn shared representations. The inference from a single image is achieved by utilizing only image-encoder path during inference, while both of the image-encoder path and LiDAR-encoder path are utilized during training. Furthermore, to enhance the IID quality, we propose albedo-alignment loss and image-LiDAR conversion (ILC) paths. The albedo-alignment loss aligns the gray-scaled albedo from an image to that inferred from its corresponding LiDAR intensity. LiDAR intensity reflects the object-surface properties independent from sunlight conditions and cast shadows, hence the albedo from LiDAR intensity has the potential to be a criterion for reducing cast shadow. In addition, to translate the input image into albedo and shade style while keeping the image contents 111Style and contents represent domain-variant components such as illuminations, and domain-invariant components such as object edges, respectively, the input image is separated into style and content codes by encoders. Due to the shared content but differing styles between an image and its corresponding LiDAR intensity, the ILC paths that mutually translate them facilitate separating into content and style codes, enhancing the IID quality.

The performance of LIET is investigated by comparing it with existing IID models including energy optimization models [18, 4, 5], weakly supervised model [15], and unsupervised models [38, 34, 40, 52] in IID quality metrics and image quality assessment (IQA) [62, 16, 28, 56, 65] on inferred albedos. The main contributions of this study are summarized as follows.

  • We propose unsupervised single-image intrinsic image decomposition with LiDAR intensity enhanced training (LIET) with a partially-shared model that accepts an image and LiDAR intensity individually using a different specific encoder but processes them together in specific components to learn shared representations. This model architecture reaches a single image inference, while both an image and LiDAR intensity are utilized during training.

  • To enhance the effective utilization of LiDAR intensity, we introduce albedo-alignment loss to align the albedo inferred from an image to that from its corresponding LiDAR intensity, and image-LiDAR conversion (ILC) paths to translate the input image into albedo and shade style while keeping the image contents. Additionally, the ablation study demonstrates the effectiveness of each proposed architecture and loss.

  • In terms of IID quality, our proposed model, which employs only an image during inference demonstrates comparable performance to the existing model, which employs both an image and LiDAR intensity during inference.

2 Related work

This section initially introduces general image-to-image translation (I2I) models that translate an input image from their source domain to the target domain. Since IID represents a specific form of I2I that translates an input image from the image domain into albedo and shade domains, several IID models and LIET are based on the I2I framework. Furthermore, we describe the existing unsupervised IID models and examples of LiDAR intensity utilization in this section.
Image-to-image translation. I2I models are designed to translate an input image from their source domain to the target domain. Most of the I2I models rely on deep generative models, and typical models include generative adversarial networks (GAN) [17] such as pix2pix [24]. Due to the challenge of acquiring paired images for each domain, CycleGAN [71] was proposed as an I2I model that does not require paired images. Additionally, unsupervised image-to-image translation networks (UNIT) [39] achieved unsupervised I2I by implementing weight sharing within the latent space. Conversely, UNIT and CycleGAN require training as many models as the number of domains to be translated, leading to high computational costs. Thus, StarGAN [14] and multimodal unsupervised image-to-image translation (MUNIT) [23] were introduced to translate images into multiple domains using a single model. Additionally, diverse image-to-image translation via disentangled representations (DRIT) [32], which amalgamates the advantages of UNIT and MUNIT, was presented. More recently, diffusion model [21, 50] began to be applied for I2I [35, 13]. USI3[40], IID-LI [52], and LIET are implemented based on MUNIT as specialized models for the IID task.
Unsupervised learning models for IID. Acquiring ground-truth albedo and shade corresponding to an input image in general scenes presents a considerable challenge, thus necessitating the use of unsupervised learning models that do not depend on such ground truths. To facilitate IID in the absence of ground truths, two primary strategies are employed, one using multiple images captured under different conditions and the other using a single image and synthetic data for albedo and shade domains that do not correspond to the image. As the first strategy, models have been trained using pairs of images under varying illumination conditions [42], as well as sequences of related images [63]. More recent models [55, 6, 64, 66, 7, 46, 20, 70] leveraging the neural radiance field (NeRF) [45] framework from multi-view images have been proposed. Conversely, as the second strategy, USI3[40] employs an image for IID and the synthetic albedo and shade domain data for ensuring albedo or shade domain likelihood, resulting in the enhancement of the decomposition quality without direct ground truth. In addition, IID-LI [52] has incorporated LiDAR intensity based on USI3D to reduce cast shadows and demonstrated impressive IID performance. However, IID-LI has restricted applicability due to the requirement for LiDAR intensity even during inference. Thus, IID models employing only a single image during inference while keeping as high IID quality as IID-LI are highly desired.
LiDAR intensity utilization. LiDAR is a device for measuring the distance to the object surfaces based on the time of flight from infrared laser irradiation to the reception of reflected light. In addition to the distance measurement, LiDAR also captures the intensity of the reflected light from the object surfaces, commonly referred to as LiDAR intensity. This LiDAR intensity is unaffected by variations in sunlight conditions or shading while preserving the texture of

Refer to caption
Figure 2: Examples of (a) input image, (b) gray-scale image, and its corresponding (c) LiDAR intensity and (d) LiDAR depth. The red circle indicates the regions with cast shadows and white arrows. The shadow and the white arrow are visible in gray-scale image. LiDAR intensity has no cast shadows while maintaining white arrows, since LiDAR intensity is calculated from the intensity ratio of irradiated and reflected lights, equivalent to an albedo at infrared wavelength. Conversely, LiDAR depth represents the distance to objects, resulting in the absence of cast shadows and white arrows.

object surfaces as illustrated in Fig. 2. Thus, LiDAR intensity has the potential for effective utilization in the context of IID tasks. Also, LiDAR intensity is widely applied due to its ability to depict surface properties, for instance, shadow detection [19, 51], hyper-spectral data correction [49, 8], and object recognition [31, 27, 43, 36]. Note that LiDAR and a camera typically operate in distinct wavelength bands; the near-infrared band and the visible light band, respectively. Hence, LiDAR intensity is not to be used directly as is for ground-truth albedo. LIET approaches the difference in wavelength by calculating loss with albedo inferred from LiDAR intensity, rather than directly utilization.

3 Proposed model (LIET)

3.1 Problem formulation

This section describes the problem formulation addressed by LIET framework. Initially, the inference process employs only a real-world image xIx_{\rm{I}} to infer albedo xRIx_{\rm{RI}} and shade xSIx_{\rm{SI}}. Meanwhile in the training process, a set of a real-world image xIx_{\rm{I}} and its corresponding LiDAR intensity xLx_{\rm{L}} is prepared. The ground-truth albedo and shade corresponding to the image are not prepared due the difficulty of obtaining ground truths in the real world. Instead of these ground truths, albedo xRx_{\rm{R}} and shade xSx_{\rm{S}} derived from synthetic dataset that do not corresponding to the image xIx_{\rm{I}} are prepared, thereby enabling calculation of the distributions for albedo and shade.

3.2 USI3D architecture

Overview. USI3D consists of within-domain reconstruction and cross-domain translation as illustrated in light-blue regions of Fig. 3. The within-domain reconstruction aims to extract features for each domain of image xIx_{\rm{I}}, albedo xRx_{\rm{R}}, and shade xSx_{\rm{S}} by encoders and decoders. The cross-domain translation infers albedo xRIx_{\rm{RI}} and shade xSIx_{\rm{SI}} from an input image xIx_{\rm{I}}.
Within-domain reconstruction. As illustrated in Fig. 3 (a), for each domain

Refer to caption
Figure 3: LIET architecture including (a) within-domain reconstruction and (b) cross-domain translation. (a) For each domain (image I{\rm{I}}, LiDAR intensity L{\rm{L}}, albedo R{\rm{R}}, shade S{\rm{S}}), an input xXx_{\rm{X}} is fed into style EXpE^{p}_{\rm{X}} and content EXcE^{c}_{\rm{X}} encoders to calculate style pXp_{\rm{X}} and content cXc_{\rm{X}} codes for domain X{I,L,R,S}X\in\{\rm{I,L,R,S}\}. These codes are used at generators GXG_{\rm{X}} to reconstruct the inputs within their domains. (b) The image-encoder path accepts an image xIx_{\rm{I}} and infers image style pIp_{\rm{I}} and content cIc_{\rm{I}} codes as within-domain reconstruction. Subsequently, pIp_{\rm{I}} is fed into a style mapping function fIf_{\rm{I}} to yield domain-specific style codes (pRIp_{\rm{RI}}, pSIp_{\rm{SI}}, pLIp_{\rm{LI}}) for generating respective domains via generators (GR,GS,GLG_{\rm{R}},G_{\rm{S}},G_{\rm{L}}). Similarly, the LiDAR-encoder path uses xLx_{\rm{L}} to infer albedo xRLx_{\rm{RL}}, shade xSLx_{\rm{SL}}, and image xILx_{\rm{IL}} through LiDAR style and content encoders (ELp,ELcE^{p}_{\rm{L}},E^{c}_{\rm{L}}). The albedo-alignment loss AA\mathcal{L}^{\rm{AA}} aligns the gray-scaled albedo from an image to that inferred from the LiDAR intensity for reducing cast shadows.

including image I{\rm{I}}, albedo R{\rm{R}}, and shade S{\rm{S}}, an input xXx_{\rm{X}} is fed into style encoder EXpE^{p}_{\rm{X}} and content encoder EXcE^{c}_{\rm{X}} to derive style code pXp_{\rm{X}} and content code cXc_{\rm{X}}, respectively, for X{I,R,S}X\in\{\rm{I},\rm{R},\rm{S}\}. These codes are input to domain-specific generators GXG_{\rm{X}}, reconstructing the original inputs within their respective domains xXXx_{\rm{XX}}. For example, for an image xIx_{\rm{I}}, the image-style encoder EIpE^{p}_{\rm{I}} and image-content encoder EIcE^{c}_{\rm{I}} are utilized to extract image-style code pIp_{\rm{I}} and image-content code cIc_{\rm{I}}, respectively, with the reconstruction achieved by image generator GI(cI,pI)G_{\rm{I}}(c_{\rm{I}},p_{\rm{I}}). Analogous processes apply for albedo, and shade reconstructions.
Cross-domain translation. For cross-domain translation, an input image xIx_{\rm{I}} is processed to infer albedo xRIx_{\rm{RI}} and shade xSIx_{\rm{SI}}. First, an input xIx_{\rm{I}} is fed into image-style encoder EIpE^{p}_{\rm{I}} and image-content encoder EIcE^{c}_{\rm{I}} to derive image-style code pIp_{\rm{I}} and image-content code cIc_{\rm{I}}, respectively. A style mapping function fIf_{\rm{I}} then adjusts pIp_{\rm{I}} to generate domain-specific style codes for albedo pRIp_{\rm{RI}} and shade pSIp_{\rm{SI}}. Subsequently, these style codes, alongside cIc_{\rm{I}}, are input into their respective generators (GR(cI,pRI),GS(cI,pSI)G_{\rm{R}}(c_{\rm{I}},p_{\rm{RI}}),G_{\rm{S}}(c_{\rm{I}},p_{\rm{SI}})) to infer the albedo xRIx_{\rm{RI}} and shade xSIx_{\rm{SI}}, respectively. Note that all encoders and decoders are shared between within-domain reconstruction and cross-domain translation.
Adversarial training. To improve the translation quality from the source domain into the target domain, the translated images are input into discriminators. Discriminators for albedo DRD_{\rm{R}} and shade DSD_{\rm{S}} are implemented for adversarial training. For instance, the albedo inferred from image xRIx_{\rm{RI}} and the albedo domain data xRx_{\rm{R}} are provided into the albedo discriminator DRD_{\rm{R}} to evaluate the domain likelihood.

3.3 LIET architectures

Overview. LIET is based on USI3D and consists of within-domain reconstruction and cross-domain translation as illustrated in Fig. 3. For within-domain reconstruction, LiDAR intensity xLx_{\rm{L}} reconstruction is added to image xIx_{\rm{I}}, albedo xRx_{\rm{R}}, and shade xSx_{\rm{S}} reconstructions. For cross-domain translation, LIET introduces a LiDAR-encoder path by employing a specific encoders (ELp,ELcE^{p}_{\rm{L}},E^{c}_{\rm{L}}) to infer albedo xRLx_{\rm{RL}} and shade xSLx_{\rm{SL}} from LiDAR intensity xLx_{\rm{L}}, while USI3D had only an image-encoder path. The inference from an image is achieved by utilizing only the image-encoder path during inference, while both paths are utilized during training. To enhance the IID quality, we introduce a albedo-alignment loss to align the albedo inferred from an image xRIx_{\rm{RI}} to that from its corresponding LiDAR intensity xRLx_{\rm{RL}}. Furthermore, image-LiDAR conversion (ILC) paths that mutually translate image and LiDAR intensity are also implemented for effective utilization of LiDAR intensity.
Within-domain reconstruction. In the same manner as USI3D, within-domain reconstructions for image, albedo, and shade are implemented. In addition, reconstruction for LiDAR intensity xLx_{\rm{L}} is also implemented in LIET as shown in Fig. 3 (a), to extract feature of LiDAR intensity. The LiDAR intensity xLx_{\rm{L}} is input into LiDAR-style encoder ELpE^{p}_{\rm{L}} and LiDAR-content encoder ELcE^{c}_{\rm{L}} to calculate the LiDAR-style code pLp_{\rm{L}} and LiDAR-content code cLc_{\rm{L}}, respectively. Subsequently, LiDAR generator GL(cL,pL)G_{\rm{L}}(c_{\rm{L}},p_{\rm{L}}) is utilized to reconstruct the LiDAR intensity xLLx_{\rm{LL}}.
Cross-domain translation. As illustrated in the light-blue region of Fig. 3 (b), USI3D infers albedo xRIx_{\rm{RI}} and shade xSIx_{\rm{SI}} from an image xIx_{\rm{I}}. Within the LIET framework, the objective is to maintain the use of solely an image xIx_{\rm{I}} as inputs during inference, while simultaneously leveraging both the image xIx_{\rm{I}} and LiDAR intensity xLx_{\rm{L}} during training. To this end, alongside the conventional image-encoder path, a LiDAR-encoder path employs LiDAR intensity xLx_{\rm{L}} to infer albedo xRLx_{\rm{RL}} and shade xSLx_{\rm{SL}}. The LiDAR intensity xLx_{\rm{L}} is fed into LiDAR-style encoder ELpE^{p}_{\rm{L}} and LiDAR-content encoder ELcE^{c}_{\rm{L}} to calculate the LiDAR-style pLp_{\rm{L}} and LiDAR-content cLc_{\rm{L}} codes, respectively. The LiDAR-style pLp_{\rm{L}} is input into the style mapping function fLf_{\rm{L}} to infer style codes for albedo pRLp_{\rm{RL}} and shade pSLp_{\rm{SL}}. Finally, these style and content codes are fed into generators (GR(cL,pRL),GS(cL,pSL)G_{\rm{R}}(c_{\rm{L}},p_{\rm{RL}}),G_{\rm{S}}(c_{\rm{L}},p_{\rm{SL}})) to infer albedo xRLx_{\rm{RL}} and shade xSLx_{\rm{SL}}. This partially-shared model supports the concurrent objectives of effectively leveraging LiDAR intensity during training and maintaining exclusive reliance on an image input during inference.
Image-LiDAR conversion (ILC) paths. As illustrated in the light-blue region of Fig. 3 (b), USI3D separates an image xIx_{\rm{I}} into content code cIc_{\rm{I}} and style code pIp_{\rm{I}} to infer albedo xRIx_{\rm{RI}} and shade xSIx_{\rm{SI}}, and this separation process is critical as it directly impacts the IID quality. Since USI3D utilizes the independence datasets for image xIx_{\rm{I}}, albedo xRx_{\rm{R}} and shade xSx_{\rm{S}} without shared content, this separation relies solely on the style information unique to each domain. On the other hand, in our problem formulation, LiDAR intensity xLx_{\rm{L}} corresponding to the image xIx_{\rm{I}} is also available. This helps us separate an image into content and style codes (cI,pI)(c_{\rm{I}},p_{\rm{I}}) accurately since the image and the corresponding LiDAR intensity should share the content. LIET incorporates the decoder GL(cI,pLI)G_{\rm{L}}(c_{\rm{I}},p_{\rm{LI}}) for inferring the LiDAR intensity in the image-encoder path, thereby, the content code cIc_{\rm{I}} and style code pIp_{\rm{I}} of the input image are effortlessly separated. Furthermore, to enhance the inference quality of albedo xRLx_{\rm{RL}} and shade xSLx_{\rm{SL}} from the LiDAR intensity xLx_{\rm{L}}, the image inference path is also added to the LiDAR-encoder path in the same manner as the image to LiDAR intensity path.
Adversarial training. Similar to USI3D, inferred albedo xRIx_{\rm{RI}} and shade xSIx_{\rm{SI}} from an image xIx_{\rm{I}} is input into albedo discriminator DRD_{\rm{R}} and shade discriminator DSD_{\rm{S}}, respectively, to improve the translation quality from the source domain into the target domain. In LIET, due to the incorporation of the LiDAR-encoder path, the albedo xRLx_{\rm{RL}} and shade xSLx_{\rm{SL}} inferred from LiDAR intensity xLx_{\rm{L}} are also input into their respective discriminators (DR,DSD_{\rm{R}},D_{\rm{S}}). Furthermore, along with the ILC paths, inferred LiDAR intensity xLIx_{\rm{LI}} and image xILx_{\rm{IL}} are also evaluated by LiDAR-intensity discriminator DLD_{\rm{L}} and image discriminator DID_{\rm{I}}, respectively. Note that all encoders and decoders are shared in within-domain reconstruction, the image-encoder path, and the LiDAR-encoder path.

3.4 Losses

In this section, we describe the loss functions computed during within-domain reconstruction and cross-domain translation.
Image reconstruction loss img\mathcal{L}^{\rm{img}}. Initially, the input images should be reconstructed after passing through the within-domain reconstruction process, hence image reconstruction loss img\mathcal{L}^{\rm{img}} is defined in Eq. 1.

img=xIIxI+xLLxL+xRRxR+xSSxS,\mathcal{L}^{\rm{img}}=\|x_{\rm{II}}-x_{\rm{I}}\|+\|x_{\rm{LL}}-x_{\rm{L}}\|+\|x_{\rm{RR}}-x_{\rm{R}}\|+\|x_{\rm{SS}}-x_{\rm{S}}\|, (1)

where xIIx_{\rm{II}}, xLLx_{\rm{LL}}, xRRx_{\rm{RR}}, and xSSx_{\rm{SS}} are reconstructed images by within-domain reconstruction for image, LiDAR intensity, albedo, and shade, respectively.
Style reconstruction loss sty\mathcal{L}^{\rm{sty}} and content code reconstruction loss cnt\mathcal{L}^{\rm{cnt}}. Since the reconstructed images should maintain their styles and contents, the style reconstruction loss sty\mathcal{L}^{\rm{sty}} and content code reconstruction loss cnt\mathcal{L}^{\rm{cnt}} are defined in Eq. 2 and Eq. 3, respectively. \linenomathAMS

sty=ELp(xLI)pLI+ERp(xRI)pRI+ESp(xSI)pSI+EIp(xIL)pIL+ERp(xRL)pRL+ESp(xSL)pSL,\mathcal{L}^{\rm{sty}}=\|E^{p}_{\rm{L}}(x_{\rm{LI}})-p_{\rm{LI}}\|+\|E^{p}_{\rm{R}}(x_{\rm{RI}})-p_{\rm{RI}}\|+\|E^{p}_{\rm{S}}(x_{\rm{SI}})-p_{\rm{SI}}\|\\ +\|E^{p}_{\rm{I}}(x_{\rm{IL}})-p_{\rm{IL}}\|+\|E^{p}_{\rm{R}}(x_{\rm{RL}})-p_{\rm{RL}}\|+\|E^{p}_{\rm{S}}(x_{\rm{SL}})-p_{\rm{SL}}\|, (2)
\linenomathAMS
cnt=ELc(xLI)cI+ERc(xRI)cI+ESc(xSI)cI+EIc(xIL)cL+ERc(xRL)cL+ESc(xSL)cL.\mathcal{L}^{\rm{cnt}}=\|E^{c}_{\rm{L}}(x_{\rm{LI}})-c_{\rm{I}}\|+\|E^{c}_{\rm{R}}(x_{\rm{RI}})-c_{\rm{I}}\|+\|E^{c}_{\rm{S}}(x_{\rm{SI}})-c_{\rm{I}}\|\\ +\|E^{c}_{\rm{I}}(x_{\rm{IL}})-c_{\rm{L}}\|+\|E^{c}_{\rm{R}}(x_{\rm{RL}})-c_{\rm{L}}\|+\|E^{c}_{\rm{S}}(x_{\rm{SL}})-c_{\rm{L}}\|. (3)

Adversarial loss adv\mathcal{L}^{\rm{adv}}. Moreover, the adversarial loss adv\mathcal{L}^{\rm{adv}} [17] is defined as Eq. 4 to ensure that the image inferred through cross-domain translation aligns with the distribution of the target domain. \linenomathAMS

adv=log(1DR(xRI))+log(DR(xR))+log(1DR(xRL))+log(DR(xR))+log(1DS(xSI))+log(DS(xS))+log(1DS(xSL))+log(DS(xS))+log(1DL(xLI))+log(DL(xL))+log(1DI(xIL))+log(DI(xI)).\mathcal{L}^{\rm{adv}}=\log(1-D_{\rm{R}}(x_{\rm{RI}}))+\log(D_{\rm{R}}(x_{\rm{R}}))+\log(1-D_{\rm{R}}(x_{\rm{RL}}))+\log(D_{\rm{R}}(x_{\rm{R}}))\\ +\log(1-D_{\rm{S}}(x_{\rm{SI}}))+\log(D_{\rm{S}}(x_{\rm{S}}))+\log(1-D_{\rm{S}}(x_{\rm{SL}}))+\log(D_{\rm{S}}(x_{\rm{S}}))\\ +\log(1-D_{\rm{L}}(x_{\rm{LI}}))+\log(D_{\rm{L}}(x_{\rm{L}}))+\log(1-D_{\rm{I}}(x_{\rm{IL}}))+\log(D_{\rm{I}}(x_{\rm{I}})). (4)

VGG loss VGG\mathcal{L}^{\rm{VGG}}. To preserve the object edges and colors of the input image, the distance between the input image and the inferred albedo within the VGG feature space is computed [54, 12, 60] for the VGG loss VGG\mathcal{L}^{\rm{VGG}} [26] in Eq. 5.

VGG=V(xI)V(xRI),\mathcal{L}^{\rm{VGG}}=\|V(x_{\rm{I}})-V(x_{\rm{RI}})\|, (5)

where VV is pre-trained visual-perception network such as VGG-19 [54].
KLD loss KLD\mathcal{L}^{\rm{KLD}}. Additionally, the Kullback-Leibler divergence (KLD) loss KLD\mathcal{L}^{\rm{KLD}} is formulated as Eq. 6 to align the probability distributions of inferred albedo style q(pRI)q(p_{\rm{RI}}) and shade style q(pSI)q(p_{\rm{SI}}) from an image with those calculated from synthetic data (q(pR),q(pS))(q(p_{\rm{R}}),q(p_{\rm{S}})) facilitated by a style mapping function.

KLD=logq(pRI)logq(pR)+logq(pSI)logq(pS).\mathcal{L}^{\rm{KLD}}=\|\log q(p_{\rm{RI}})-\log q(p_{\rm{R}})\|+\|\log q(p_{\rm{SI}})-\log q(p_{\rm{S}})\|. (6)

Physical loss phy\mathcal{L}^{\rm{phy}}. Given the assumption of a Lambertian surface in the IID task, the product of albedo and shade is expected to match the input image. Thus physical loss phy\mathcal{L}^{\rm{phy}} is defined in Eq. 7.

phy=xIxRIxSI.\mathcal{L}^{\rm{phy}}=\|x_{\rm{I}}-x_{\rm{RI}}\cdot x_{\rm{SI}}\|. (7)

Albedo-alignment loss AA\mathcal{L}^{\rm{AA}}. To improve the IID quality, we propose albedo-alignment loss AA\mathcal{L}^{\rm{AA}} as depicted in Fig. 4, aligning the gray-scale albedo from an image to that inferred from its corresponding LiDAR intensity. Since the albedo inferred from LiDAR intensity xRLx_{\rm{RL}} is independent of daylight conditions and cast shadows, the IID quality is expected to improve by aligning the albedo inferred from an image xRIx_{\rm{RI}} to that inferred from its corresponding LiDAR intensity xRLx_{\rm{RL}}. Additionally, these albedos are required to compare in gray scale due to the lack

Refer to caption
Figure 4: Calculation process of albedo-alignment loss AA\mathcal{L}^{\rm{AA}}. First, albedo from an image xRIx_{\rm{RI}} and that from its corresponding LiDAR intensity xRLx_{\rm{RL}} are computed. Subsequently, these albedos are masked to the points with LiDAR values and then gray scale. Next, instance normalization is performed to align the scales of these albedos, and the distance between these scaled albedos is calculated. A stop gradient is performed on the LiDAR-encoder path side to align xRIx_{\rm{RI}} to xRLx_{\rm{RL}} since the LiDAR intensity is independent of sunlight conditions. LiDAR intensity, scaled albedo from the image, and that from LiDAR intensity are represented in a cividis color map due to their gray scale.

of hue in LiDAR intensity. Thus, albedo-alignment loss AA\mathcal{L}^{\rm{AA}} is defined in Eq. 8 to compute the distance between xRIx_{\rm{RI}} and xRLx_{\rm{RL}} in gray scale.

AA=In(xRImL)In(xRLmL),\mathcal{L}^{\rm{AA}}=\|\rm{In}(x_{\rm{RI}}\cdot m_{\rm{L}})-\rm{In}(x_{\rm{RL}}\cdot m_{\rm{L}})\|, (8)

where mLm_{\rm{L}} represents the mask denoting the presence of LiDAR intensity values. In()\rm{In}(\cdot) is an instance normalization [57] and gray-scale function. The instance normalization is used to align the scales of xRIx_{\rm{RI}} and xRLx_{\rm{RL}}. In addition, a stop gradient is performed on the LiDAR-encoder path side to align xRIx_{\rm{RI}} to xRLx_{\rm{RL}}.

In summary, LIET optimizes the loss function in Eq. 9. \linenomathAMS

LIET=adv+λimgimg+λstysty+λcntcnt+λKLDKLD+λVGGVGG+λphyphy+λAAAA.\mathcal{L}^{\rm{LIET}}=\mathcal{L}^{\rm{adv}}+\lambda_{\rm{img}}\mathcal{L}^{\rm{img}}+\lambda_{\rm{sty}}\mathcal{L}^{\rm{sty}}+\lambda_{\rm{cnt}}\mathcal{L}^{\rm{cnt}}\\ +\lambda_{\rm{KLD}}\mathcal{L}^{\rm{KLD}}+\lambda_{\rm{VGG}}\mathcal{L}^{\rm{VGG}}+\lambda_{\rm{phy}}\mathcal{L}^{\rm{phy}}+\lambda_{\rm{AA}}\mathcal{L}^{\rm{AA}}. (9)

λimg\lambda_{\rm{img}}, λsty\lambda_{\rm{sty}}, λcnt\lambda_{\rm{cnt}}, λKLD\lambda_{\rm{KLD}}, λVGG\lambda_{\rm{VGG}}, λphy\lambda_{\rm{phy}}, and λAA\lambda_{\rm{AA}} are hyper parameters for balancing the losses. The effects of each hyper parameter are detailed in the supplementary materials.

4 Experiments

4.1 Experimental setting

Dataset. To facilitate IID using LiDAR intensity during training, we employ the NTT-IID dataset [52], which consists of images, LiDAR intensities, and annotations designed for evaluating IID quality. The NTT-IID dataset prepares 10,000 pairs of images measuring outdoor scenes and LiDAR intensity mapped to these images. Among these pairs, 110 samples have been annotated, yielding a total of 12,626 human judgments. Additionally, we utilized the FSVG dataset [29] as the target domain for albedo and shade. In this paper, we employ the same albedo and shade samples as those used in IID-LI [52].

Model Random sampled annotation All annotation
F-score(\uparrow) WHDR(\downarrow) Precision(\uparrow) Recall(\uparrow) F-score(\uparrow) WHDR(\downarrow) Precision(\uparrow) Recall(\uparrow)
Baseline-R [4] 0.350 0.527 0.375 0.440 0.306 0.531 0.393 0.445
Baseline-S [4] 0.227 0.529 0.361 0.340 0.314 0.185 0.431 0.340
Retinex [18] 0.420 0.452 0.523 0.445 0.469 0.187 0.496 0.455
Color Retinex [18] 0.420 0.452 0.531 0.445 0.470 0.187 0.496 0.455
Bell et al. [4] 0.414 0.446 0.504 0.453 0.457 0.213 0.467 0.463
Bi et al. [5] 0.490 0.406 0.561 0.522 0.466 0.283 0.462 0.522
Revisiting [15] 0.442 0.428 0.635 0.470 0.499 0.181 0.575 0.485
IIDWW [38] 0.417 0.464 0.489 0.475 0.397 0.375 0.418 0.483
UidSequence [34] 0.419 0.483 0.453 0.450 0.395 0.372 0.405 0.453
USI3[40] 0.454 0.422 0.539 0.500 0.446 0.287 0.444 0.504
IID-LI [52] 0.602 0.353 0.625 0.596 0.521 0.227 0.517 0.591
LIET (Ours) 0.607 0.340 0.649 0.601 0.525 0.245 0.500 0.598
Table 1: Numerical comparison in IID quality with NTT-IID dataset [52]. Due to the bias in the number of annotations, we evaluated (i) randomly sampled annotation and (ii) all annotations along with IID-LI [52]. Red and blue fonts indicate the best and second-best results, respectively. For both randomly sampled annotation and all annotations, LIET (Ours) achieves comparable IID quality to IID-LI. In addition, Revisiting [15] demonstrates the better IID quality in two indices of all annotations. Due to biases in the distribution of all annotations, models that infers flatter albedos are at an advantage in these metrics. Revisiting [15], assuming local flatness of albedo in its training, demonstrates superior results by aligning with this bias.

Evaluation metric. For quantitative evaluation, we employ the following metrics: the weighted human disagreement rate (WHDR), precision, recall, and F-score for all and random sampled annotation, following the same methodology as IID-LI [52]. In addition, we evaluate the image quality using five IQA models: MANIQA [62], TReS [16], MUSIQ [28], HyperIQA [56], and DBCNN [65].
Implementation details. In LIET, the style code encoder EXpE^{p}_{\rm{X}}, content code encoder EXcE^{c}_{\rm{X}}, generator GXG_{\rm{X}}, and discriminator DXD_{\rm{X}}, (XI,L,R,S)(X\in{I,L,R,S}), are implemented with the same model structures and parameters as USI3[40]222LIET is implemented based on USI3D (https://github.com/DreamtaleCore/USI3D.git) and IID-LI. The style encoders EXpE^{p}_{\rm{X}}, the content code encoders EXcE^{c}_{\rm{X}}, the decoders GXG_{\rm{X}}, and the discriminators DXcD^{c}_{\rm{X}} consist of convolutional layers, down-sampling layers, global average pooling layers, dense layers, and residual blocks. Notably, the residual blocks incorporate adaptive instance normalization (AdaIN) [22], and the AdaIN parameters are dynamically determined using a multi-layer perceptron (MLP). To assess images from both global and local perspectives, a multi-scale discriminator [60] is employed in discriminator DXD_{\rm{X}}. Given a style code, style mapping modules MIM_{\rm{I}} and MLM_{\rm{L}} are constructed by MLP. We empirically set the following values for the hyper parameters: λimg=100.0,λsty=10.0,λcnt=1.0,λKLD=1.0,λVGG=1.0,λphy=10.0\lambda_{\rm{img}}=100.0,\lambda_{sty}=10.0,\lambda_{cnt}=1.0,\lambda_{KLD}=1.0,\lambda_{VGG}=1.0,\lambda_{phy}=10.0, and λAA=100.0\lambda_{AA}=100.0.

4.2 Intrinsic image decomposition quality

LIET is compared with energy optimization models including baseline-R [4], baseline-S [4], Retinex [18], Color Retinex [18], Bell et al. [4], and Bi et al. [5].

Refer to caption
Figure 5: Examples of inferring results obtained from various existing models and LIET (Ours) with NTT-IID dataset [52]. The compared models include Revisiting [15], IIDWW [38], UidSequence [34], USI3[40], and IID-LI [52]. Shadows are less noticeable on the IID-LI and LIET, while cast shadows are visibly retained on the existing models without LiDAR intensity utilization.

Additionally, Revisiting333Revisiting [15] is marked with an asterisk () since it is a supervised learning model trained with the IIW dataset [4].The IIW dataset features real-world images with human-judged annotations. Unlike the NTT-IID dataset [52], it lacks LiDAR intensity but stands out for its large quantity of data and annotations. [15] is also evaluated as a supervised model. Furthermore, IIDWW [38], UidSequence [34], USI3[40], and IID-LI [52] are implemented as unsupervised learning models. Tab. 1 shows the numerical results for the NTT-IID dataset. Consequently, LIET demonstrates a comparable performance to that of IID-LI which utilizes a single image and its corresponding LiDAR intensity during inference, despite inputting only a single image in LIET. With all annotation, Revisiting [15] performed better on the two metrics. This model incorporates an albedo flattening module, and these metrics are more favorable for inferring flat albedos due to the annotation bias. Subsequently, the qualitative results of existing models and LIET are illustrated in Fig. 5. In comparison to other IID models, both IID-LI and LIET yield inferred albedos with less noticeable shadows due to the LiDAR intensity utilization. Though Revisiting [15] exhibits a flattened appearance, leading to favorable quantitative outcomes, cast shadows within the images still remain.

4.3 Image quality

Subsequently, we conducted an IQA comparison between LIET and the top three models for their IID quality: Revisiting [15], USI3[40], and IID-LI [52]. As shown in Tab. 2, LIET achieves the highest quality across all five evaluation metrics. Revisiting [15] is trained with relative gray-scaled albedo between nearby

Model MANIQA(\uparrow) TReS(\uparrow) MUSIQ(\uparrow) HYPERIQA(\uparrow) DBCNN(\uparrow)
Input (0.664) (81.1) (59.2) (0.595) (58.1)
Revisiting [15] 0.395 55.4 44.9 0.389 38.1
USI3[40] 0.488 53.1 40.1 0.298 35.3
IID-LI [52] 0.460 55.5 43.5 0.285 37.5
LIET (Ours) 0.570 75.1 56.3 0.414 46.3
Table 2: Numerical comparison in IQA between LIET and the top three models in IID quality including Revisiting [15], USI3[40], and IID-LI [52]. Our findings demonstrate that the images inferred by LIET consistently exhibited the highest image quality across all metrics. This comparable performance can be attributed to the absence of image blurring and collapse.
Model smooth\mathcal{L}^{\rm{smooth}} IID quality metrics IQA metrics
F-score(\uparrow) WHDR(\downarrow) Precision(\uparrow) Recall(\uparrow) MANIQA(\uparrow) TReS(\uparrow) MUSIQ(\uparrow)
USI3[40] 0.444 0.427 0.540 0.502 0.629 67.2 52.6
USI3D∗∗ [40] \checkmark 0.454 0.422 0.539 0.500 0.492 53.2 40.1
IID-LI [52] 0.570 0.389 0.581 0.565 0.522 80.9 56.4
IID-LI∗∗ [52] \checkmark 0.602 0.353 0.625 0.596 0.461 55.5 43.5
LIET∗∗ (Ours) 0.607 0.340 0.649 0.601 0.570 75.1 56.3
LIET (Ours) \checkmark 0.593 0.368 0.634 0.583 0.556 69.5 50.1
Table 3: Effect of smoothing loss smooth\mathcal{L}^{\rm{smooth}} in IID quality and IQA with the NTT-IID dataset [52]. Smoothing loss improves the IID quality for USI3D and IID-LI, while degrading image quality due to its blurring effect. LIET has already achieved albedo local flatness, hence adding smoothing loss will not improve the IID quality.

points. Thus the model output tends to reduce saturation and leads to reducing IQA ratings. Additionally, both USI3D and IID-LI enhance IID quality through a smoothing process that assumes local albedo flatness. As a result, the images inferred by these models tend to exhibit blurriness, leading to lower IQA ratings. Conversely, in LIET, since fine shadows derived from cast shadows are absent in LiDAR intensity, the albedo inferred from LiDAR intensity has less variation in luminance. LIET achieves albedo local flatness without using smoothing loss due to the alignment of the albedo inferred from LiDAR intensity with the albedo inferred from the image by albedo-alignment loss. The effect of smooth loss is described in the next section.

4.4 Effect of smoothing loss

Smoothing loss is not implemented in LIET, despite USI3[40] and IID-LI [52] utilizing it to improve IID quality. Thus, this section describes the effect of smoothing loss for IID quality and IQA, by removing and adding the smooth loss for USI3[40], IID-LI [52], and LIET 444USI3D with smooth loss, IID-LI with smooth loss, and LIET without smooth loss are marked with a double asterisk (∗∗) since these models are original. As a result, for both USI3D and IID-LI, eliminating smoothing loss improves the image quality of the inferred albedos as shown in Tab. 3, since the blurring by smoothing loss is reduced. On the other hand, variation caused by cast shadows in the albedo was reduced by

Model F-score(\uparrow) WHDR(\downarrow) Precision(\uparrow) Recall(\uparrow)
Ours 0.607 0.340 0.649 0.601
w/o AA\mathcal{L}^{\rm{AA}} 0.437 0.473 0.497 0.476
w/o inst. 0.489 0.447 0.520 0.505
w/o gray 0.601 0.359 0.623 0.596
w/o ILC path 0.589 0.361 0.641 0.581
Table 4: Effect of albedo-alignment loss AA\mathcal{L}^{\rm{AA}}. "w/o inst." describes the loss AA\mathcal{L}^{\rm{AA}} calculated without instance normalization in albedo-alignment loss. "w/o gray" refers to delete the gray scaling from albedo-alignment loss.

smoothing, hence, IID quality degrades without smoothing loss. Conversely, the albedo inferred from LiDAR intensity has less variability derived from cast shadows, hence the variation in the albedo inferred from the image has already reduced due to albedo-alignment loss in LIET. Thus, LIET achieves albedo local flatness without using smoothing loss, which keeps the IQA closely aligned with the input image. Adding further smoothing only degrades IQA and does not improve IID quality.

4.5 Ablation study

This section describes an ablation study for the contribution of albedo-alignment loss and ILC paths. As shown in Tab. 4, the albedo-alignment loss is effective for LIET due to the direct connection between the image-encoder path and the LiDAR-encoder path during training. Since the distribution of LiDAR intensity varies across samples, features are well-trained by applying instance normalization rather than by scaling uniformly across all samples. The inference quality is slightly improved by aligning these albedos in gray scale, due to the lack of hue in LiDAR intensity. Additionally, the ILC paths contributes the separating an image into content and style codes by mutually translating the image and its corresponding LiDAR intensity, which share the contents but differ styles. Thus, the IID quality is improved by ILC paths.

5 Conclusion

In this paper, we proposed unsupervised single-image intrinsic image decomposition with LiDAR intensity enhanced training (LIET). We proposed a novel approach in which an image and its corresponding LiDAR intensity are individually fed into the model during training, while the inference process only employs a single image. To calculate the relationship between the image-encoder path and the LiDAR-encoder path, we introduced albedo-alignment loss to align the albedo inferred from a single image to that from its corresponding LiDAR intensity, and ILC paths to enhance the separation of contents and styles. As a result, LIET achieved performance comparable to state-of-the-art in IID quality metrics while only employing a single image as input during inference. Furthermore, LIET demonstrated improvements in image quality supported by the five most recent IQA metrics.

References

  • [1] Barrow, H., Tenenbaum, J., Hanson, A., Riseman, E.: Recovering Intrinsic Scene Characteristics from Images. Computer Vision Systems 2(3-26),  2 (1978)
  • [2] Baslamisli, A.S., Groenestege, T.T., Das, P., Le, H.A., Karaoglu, S., Gevers, T.: Joint Learning of Intrinsic Images and Semantic Segmentation. In: ECCV. pp. 286–302 (2018)
  • [3] Beigpour, S., Van De Weijer, J.: Object recoloring based on intrinsic image estimation. In: ICCV. pp. 327–334 (2011)
  • [4] Bell, S., Bala, K., Snavely, N.: Intrinsic Images in the Wild. ACM TOG 33(4), 1–12 (2014)
  • [5] Bi, S., Han, X., Yu, Y.: An L1 image transform for edge-preserving smoothing and scene-level intrinsic decomposition. ACM TOG 34(4), 1–12 (2015)
  • [6] Boss, M., Braun, R., Jampani, V., Barron, J.T., Liu, C., Lensch, H.: NeRD: Neural Reflectance Decomposition from Image Collections. In: ICCV. pp. 12684–12694 (2021)
  • [7] Boss, M., Jampani, V., Braun, R., Liu, C., Barron, J., Lensch, H.: Neural-PIL: Neural Pre-Integrated Lighting for Reflectance Decomposition. NeurIPS 34, 10691–10704 (2021)
  • [8] Brell, M., Segl, K., Guanter, L., Bookhagen, B.: Hyperspectral and Lidar Intensity Data Fusion: A Framework for the Rigorous Correction of Illumination, Anisotropic Effects, and Cross Calibration. IEEE Transactions on Geoscience and Remote Sensing 55(5), 2799–2810 (2017)
  • [9] Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A Naturalistic Open Source Movie for Optical Flow Evaluation. In: ECCV. pp. 611–625 (2012)
  • [10] Chang, A.X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al.: ShapeNet: An Information-Rich 3D Model Repository. arXiv preprint arXiv:1512.03012 (2015)
  • [11] Chen, Q., Koltun, V.: A Simple Model for Intrinsic Image Decomposition with Depth Cues. In: ICCV. pp. 241–248 (2013)
  • [12] Chen, Q., Koltun, V.: Photographic Image Synthesis with Cascaded Refinement Networks. In: ICCV. pp. 1511–1520 (2017)
  • [13] Cheng, B., Liu, Z., Peng, Y., Lin, Y.: General Image-to-Image Translation with One-Shot Image Guidance. In: ICCV. pp. 22736–22746 (2023)
  • [14] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation. In: CVPR. pp. 8789–8797 (2018)
  • [15] Fan, Q., Yang, J., Hua, G., Chen, B., Wipf, D.: Revisiting Deep Intrinsic Image Decompositions. In: CVPR. pp. 8944–8952 (2018)
  • [16] Golestaneh, S.A., Dadsetan, S., Kitani, K.M.: No-Reference Image Quality Assessment via Transformers, Relative Ranking, and Self-Consistency. In: WACV. pp. 1220–1230 (2022)
  • [17] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative Adversarial Nets. NeurIPS 27, 139–144 (2014)
  • [18] Grosse, R., Johnson, M.K., Adelson, E.H., Freeman, W.T.: Ground truth dataset and baseline evaluations for intrinsic image algorithms. In: ICCV. pp. 2335–2342 (2009)
  • [19] Guislain, M., Digne, J., Chaine, R., Kudelski, D., Lefebvre-Albaret, P.: Detecting and Correcting Shadows in Urban Point Clouds and Image Collections. In: 3DV. pp. 537–545 (2016)
  • [20] Hasselgren, J., Hofmann, N., Munkberg, J.: Shape, Light, and Material Decomposition from Images using Monte Carlo Rendering and Denoising. NeurIPS 35, 22856–22869 (2022)
  • [21] Ho, J., Jain, A., Abbeel, P.: Denoising Diffusion Probabilistic Models. NeurIPS 33, 6840–6851 (2020)
  • [22] Huang, X., Belongie, S.: Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. In: ICCV. pp. 1501–1510 (2017)
  • [23] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal Unsupervised Image-to-Image Translation. In: ECCV. pp. 172–189 (2018)
  • [24] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-Image Translation with Conditional Adversarial Networks. In: CVPR. pp. 1125–1134 (2017)
  • [25] Jeon, J., Cho, S., Tong, X., Lee, S.: Intrinsic Image Decomposition Using Structure-Texture Separation and Surface Normals. In: ECCV. pp. 218–233 (2014)
  • [26] Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In: ECCV. pp. 694–711 (2016)
  • [27] Kashani, A.G., Graettinger, A.J.: Cluster-Based Roof Covering Damage Detection in Ground-Based Lidar Data. Automation in Construction 58, 19–27 (2015)
  • [28] Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: Multi-scale Image Quality Transformer. In: ICCV. pp. 5148–5157 (2021)
  • [29] Krähenbühl, P.: Free Supervision From Video Games. In: CVPR. pp. 2955–2964 (2018)
  • [30] Land, E.H., McCann, J.J.: Lightness and retinex theory. Journal of the Optical Society of America 61(1), 1–11 (1971)
  • [31] Lang, M.W., McCarty, G.W.: Lidar Intensity for Improved Detection of Inundation below the Forest Canopy. Wetlands 29(4), 1166–1178 (2009)
  • [32] Lee, H.Y., Tseng, H.Y., Huang, J.B., Singh, M., Yang, M.H.: Diverse Image-to-Image Translation via Disentangled Representations. In: ECCV. pp. 35–51 (2018)
  • [33] Lee, K.J., Zhao, Q., Tong, X., Gong, M., Izadi, S., Lee, S.U., Tan, P., Lin, S.: Estimation of Intrinsic Image Sequences from Image+Depth Video. In: ECCV. pp. 327–340 (2012)
  • [34] Lettry, L., Vanhoey, K., Gool, L.V.: Unsupervised Deep Single Image Intrinsic Decomposition using Illumination Varying Image Sequences. Computer Graphics Forum 37 (2018)
  • [35] Li, B., Xue, K., Liu, B., Lai, Y.K.: BBDM: Image-to-image Translation with Brownian Bridge Diffusion Models. In: CVPR. pp. 1952–1961 (2023)
  • [36] Li, E., Casas, S., Urtasun, R.: MemorySeg: Online LiDAR Semantic Segmentation with a Latent Memory. In: ICCV. pp. 745–754 (2023)
  • [37] Li, Z., Snavely, N.: CGIntrinsics: Better Intrinsic Image Decomposition through Physically-Based Rendering. In: ECCV. pp. 371–387 (2018)
  • [38] Li, Z., Snavely, N.: Learning Intrinsic Image Decomposition from Watching the World. In: CVPR. pp. 9039–9048 (2018)
  • [39] Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised Image-to-Image Translation Networks. NeurIPS 30 (2017)
  • [40] Liu, Y., Li, Y., You, S., Lu, F.: Unsupervised Learning for Intrinsic Image Decomposition from a Single Image. In: CVPR. pp. 3248–3257 (2020)
  • [41] Luo, J., Huang, Z., Li, Y., Zhou, X., Zhang, G., Bao, H.: NIID-Net: Adapting Surface Normal Knowledge for Intrinsic Image Decomposition in Indoor Scenes. IEEE TVCG 26(12), 3434–3445 (2020)
  • [42] Ma, W.C., Chu, H., Zhou, B., Urtasun, R., Torralba, A.: Single Image Intrinsic Decomposition without a Single Intrinsic Image. In: ECCV. pp. 201–217 (2018)
  • [43] Man, Q., Dong, P., Guo, H.: Pixel-and feature-level fusion of hyperspectral and lidar data for urban land-use classification. Remote Sensing 36(6), 1618–1644 (2015)
  • [44] Meka, A., Zollhöfer, M., Richardt, C., Theobalt, C.: Live Intrinsic Video. ACM TOG 35(4), 1–14 (2016)
  • [45] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Communications of the ACM 65(1), 99–106 (2021)
  • [46] Munkberg, J., Hasselgren, J., Shen, T., Gao, J., Chen, W., Evans, A., Müller, T., Fidler, S.: Extracting Triangular 3D Models, Materials, and Lighting From Images. In: CVPR. pp. 8280–8290 (2022)
  • [47] Narihira, T., Maire, M., Yu, S.X.: Direct Intrinsics: Learning Albedo-Shading Decomposition by Convolutional Regression. In: ICCV. pp. 2992–2992 (2015)
  • [48] Narihira, T., Maire, M., Yu, S.X.: Learning Lightness From Human Judgement on Relative Reflectance. In: CVPR. pp. 2965–2973 (2015)
  • [49] Priem, F., Canters, F.: Synergistic Use of LiDAR and APEX Hyperspectral Data for High-Resolution Urban Land Cover Mapping. Remote sensing 8(10),  787 (2016)
  • [50] Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D.J., Norouzi, M.: Image Super-Resolution via Iterative Refinement. IEEE TPAMI 45(4), 4713–4726 (2022)
  • [51] Sato, S., Yao, Y., Yoshida, T., Ando, S., Shimamura, J.: Shadow Detection Based on Luminance-LiDAR Intensity Uncorrelation. IEICE Transactions 106(9), 1556–1563 (2023)
  • [52] Sato, S., Yao, Y., Yoshida, T., Kaneko, T., Ando, S., Shimamura, J.: Unsupervised Intrinsic Image Decomposition With LiDAR Intensity. In: CVPR. pp. 13466–13475 (2023)
  • [53] Seo, K., Kinoshita, Y., Kiya, H.: Deep Retinex Network for Estimating Illumination Colors with Self-Supervised Learning. In: LifeTech. pp. 1–5 (2021)
  • [54] Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition. In: ICLR. pp. 1–14 (2015)
  • [55] Srinivasan, P.P., Deng, B., Zhang, X., Tancik, M., Mildenhall, B., Barron, J.T.: NeRV: Neural Reflectance and Visibility Fields for Relighting and View Synthesis. In: CVPR. pp. 7495–7504 (2021)
  • [56] Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Networkk. In: CVPR. pp. 3667–3676 (2020)
  • [57] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv preprint arXiv:1607.08022 (2016)
  • [58] Upcroft, B., McManus, C., Churchill, W., Maddern, W., Newman, P.: Lighting Invariant Urban Street Classification. In: ICRA. pp. 1712–1718 (2014)
  • [59] Wang, C., Tang, Y., Zou, X., SiTu, W., Feng, W.: A robust fruit image segmentation algorithm against varying illumination for vision system of fruit harvesting robot. Optik 131, 626–631 (2017)
  • [60] Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs. In: CVPR. pp. 8798–8807 (2018)
  • [61] Weiss, Y.: Deriving intrinsic images from image sequences. In: ICCV. vol. 2, pp. 68–75 (2001)
  • [62] Yang, S., Wu, T., Shi, S., Lao, S., Gong, Y., Cao, M., Wang, J., Yang, Y.: MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment. In: CVPRW. pp. 1191–1200 (2022)
  • [63] Yu, Y., Smith, W.A.: InverseRenderNet: Learning single image inverse rendering. In: CVPR. pp. 3155–3164 (2019)
  • [64] Zhang, K., Luan, F., Wang, Q., Bala, K., Snavely, N.: PhySG: Inverse Rendering with Spherical Gaussians for Physics-based Material Editing and Relighting. In: CVPR. pp. 5453–5462 (2021)
  • [65] Zhang, W., Ma, K., Yan, J., Deng, D., Wang, Z.: Blind Image Quality Assessment Using A Deep Bilinear Convolutional Neural Network. IEEE TCSVT 30(1), 36–47 (2018)
  • [66] Zhang, X., Srinivasan, P.P., Deng, B., Debevec, P., Freeman, W.T., Barron, J.T.: NeRFactor: Neural Factorization of Shape and Reflectance Under an Unknown Illumination. ACM TOG 40(6), 1–18 (2021)
  • [67] Zhao, Q., Tan, P., Dai, Q., Shen, L., Wu, E., Lin, S.: A Closed-Form Solution to Retinex with Nonlocal Texture Constraints. IEEE TPAMI 34(7), 1437–1444 (2012)
  • [68] Zhou, H., Yu, X., Jacobs, D.W.: GLoSH: Global-Local Spherical Harmonics for Intrinsic Image Decomposition. In: ICCV. pp. 7820–7829 (2019)
  • [69] Zhou, T., Krahenbuhl, P., Efros, A.A.: Learning Data-driven Reflectance Priors for Intrinsic Image Decomposition. In: ICCV. pp. 3469–3477 (2015)
  • [70] Zhu, J., Huo, Y., Ye, Q., Luan, F., Li, J., Xi, D., Wang, L., Tang, R., Hua, W., Bao, H., et al.: I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs. In: CVPR. pp. 12489–12498 (2023)
  • [71] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In: ICCV. pp. 2223–2232 (2017)
  • [72] Zhu, Y., Tang, J., Li, S., Shi, B.: DeRenderNet: Intrinsic Image Decomposition of Urban Scenes with Shape-(In) dependent Shading Rendering. In: ICCP. pp. 1–11 (2021)