This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Cross-Camera Convolutional Color Constancy

Mahmoud Afifi1,2       Jonathan T. Barron1       Chloe LeGendre1       Yun-Ta Tsai1       Francois Bleibel1
1Google Research           2York University
This work was done while Mahmoud was an intern at Google.
Abstract

We present “Cross-Camera Convolutional Color Constancy” (C5), a learning-based method, trained on images from multiple cameras, that accurately estimates a scene’s illuminant color from raw images captured by a new camera previously unseen during training. C5 is a hypernetwork-like extension of the convolutional color constancy (CCC) approach: C5 learns to generate the weights of a CCC model that is then evaluated on the input image, with the CCC weights dynamically adapted to different input content. Unlike prior cross-camera color constancy models, which are usually designed to be agnostic to the spectral properties of test-set images from unobserved cameras, C5 approaches this problem through the lens of transductive inference: additional unlabeled images are provided as input to the model at test time, which allows the model to calibrate itself to the spectral properties of the test-set camera during inference. C5 achieves state-of-the-art accuracy for cross-camera color constancy on several datasets, is fast to evaluate (\sim7 and \sim90 ms per image on a GPU or CPU, respectively), and requires little memory (\sim2 MB), and thus is a practical solution to the problem of calibration-free automatic white balance for mobile photography.

1 Introduction

Refer to caption
Figure 1: Our C5 model exploits the colors of unlabeled additional images captured by the new camera model to generate a specific color constancy model for the input image. These additional images can be randomly loaded from the photographer’s “camera roll”, or they could be a fixed set taken once by the camera manufacturer. The shown images were captured by unseen DSLR and smartphone camera models [48] that were not included in the training stage.

The goal of computational color constancy is to emulate the human visual system’s ability to constantly perceive object colors even when they are observed under different illumination conditions. In many contexts, this problem is equivalent to the practical problem of automatic white balance—removing an undesirable global color cast caused by the illumination in the scene, thereby, making it appear to have been imaged under a white light (see Figure 1). White balance does not only affect the quality of photographs but also has an impact on the accuracy of different computer vision tasks [4]. On modern digital cameras, automatic white balance is performed for all captured images as an essential part of the camera’s imaging pipeline.

Color constancy is a challenging problem, because it is fundamentally under-constrained: an infinite family of white-balanced images and global color casts can explain the same observed image. Color constancy is, therefore, often framed in terms of inferring the most likely illuminant color given some observed image and some prior knowledge of the spectral properties of the camera’s sensor.

One simple heuristic applied to the color constancy problem is the “gray-world” assumption: that colors in the world tend to be neutral gray and that the color of the illuminant can, therefore, be estimated as the average color of the input image [18]. This gray-world method and its related techniques have the convenient property that they are invariant to much of the spectral sensitivity differences among camera sensors and, therefore, very well-suited to the cross-camera task. If camera A’s red channel is twice as sensitive as camera B’s red channel, then a scene captured by camera A will have an average red intensity that is twice that of the scene captured by camera B, and so gray-world will produce identical output images (though this assumes that the spectral response of A and B are identical up to a scale factor, which is rarely the case in practice). However, current state-of-the-art learning-based methods for color constancy rarely exhibit this property, because they often learn things like the precise distribution of likely illuminant colors (a consequence of black-body illumination and other scene lighting regularities) and are, therefore, sensitive to any mismatch between the spectral sensitivity of the camera used during training and that of the camera used at test time [3].

Because there is often significant spectral variation across camera models (as shown in Figure 2), this sensitivity of existing methods is problematic when designing practical white-balance solutions. Training a learning-based algorithm for a new camera requires collecting hundreds, or thousands, of images with ground-truth illuminant color labels (in practice: images containing a color chart), a burdensome task for a camera manufacturer or platform that may need to support hundreds of different camera models. However, the gray-world assumption still holds surprisingly well across sensors—if given several images from a particular camera, one can do a reasonable job of estimating the range of likely illuminant colors (as can also be seen in Figure 2).

Refer to caption
Figure 2: A visualization of uvuv log-chroma histograms (u=log(g/r)u=\log(g/r), v=log(g/b)v=\log(g/b)) of images from two different cameras averaged over many images of the same scene set in the NUS dataset [20] (shown in green), as well as the uvuv coordinate of the mean of ground-truth illuminants over the entire scene set (shown in yellow). The “positions” of these histograms change significantly across the two camera sensors because of their different spectral sensitivities, which is why many color constancy models generalize poorly across cameras.

In this paper, we propose a camera-independent color constancy method. Our method achieves high-accuracy cross-camera color constancy through the use of two concepts: First, our system is constructed to take as input not just a single test-set image, but also a small set of additional images from the test set, which are: (i) arbitrarily-selected, (ii) unlabeled, (iii) and not white balanced. This allows the model to calibrate itself to the spectral properties of the test-time camera during inference. We make no assumptions about these additional images except that they come from the same camera as the “target” test set image and they contain some content (not all black or white images). In practice, these images could simply be randomly chosen images from the photographer’s “camera roll”, or they could be a fixed set of ad hoc images of natural scenes taken once by the camera manufacturer—because these images do not need to be annotated, they are abundantly available. Second, our system is constructed as a hypernetwork [34] around an existing color constancy model. The target image and the additional images are used as input to a deep neural network whose output is the weights of a smaller color constancy model, and those generated weights are then used to estimate the illuminant color of the target image.

Our system is trained using labeled (and unlabeled) images from multiple cameras, but at test time our model is able to look at a set of (unlabeled) test set images from a new camera. Our hypernetwork is able to infer the likely spectral properties of the new camera that produced the test set images (much as the reader can infer the likely illuminant colors of a camera from only looking at aggregate statistics, as in Figure 2) and produce a small model that has been dynamically adapted to produce accurate illuminant estimates when applied to the target image. Our method is computationally fast and requires a low memory footprint while achieving state-of-the-art results compared to other camera-independent color constancy methods.

2 Prior Work

There is a large body of literature proposed for illuminant color estimation, which can be categorized into statistical-based methods (e.g., [18, 17, 65, 25, 40, 68, 32, 20, 60]) and learning-based methods (e.g., [26, 31, 24, 16, 62, 30, 53, 15, 12, 57, 66, 37, 13, 55, 76]). The former rely on statistical-based hypotheses to estimate scene illuminant colors based on the color distribution and/or spatial layout of the input raw image. Such methods are usually simple and efficient, but they are less accurate than the learning-based alternatives.

Learning-based methods, on the other hand, are typically trained for a single target camera model in order to learn the distribution of illuminant colors produced by the target camera’s particular sensor [3, 45, 29]. The learning-based methods are typically constrained to the specific, single camera use-case, as the spectral sensitivity of each camera sensor significantly alters the recorded illuminant and scene colors, and different sensor spectral sensitivities change the illuminant color distribution for the same set of scenes [38, 73]. Such camera-specific methods cannot accurately extrapolate beyond the learned distribution of the training camera model’s illuminant colors [3, 60] without tuning/re-training or pre-calibration [50].

Recently, few-shot and multi-domain learning techniques [75, 55] have been proposed to reduce the effort of re-training camera-specific learned color constancy models. These methods require only a small set of labeled images for a new camera unseen during training. In contrast, our technique requires no ground-truth labels for the unseen camera, and is essentially calibration-free for this new sensor.

Another strategy has been proposed to white balance the input image with several illuminant color candidates and learn the likelihood of properly white-balanced images [35]. Such a Bayesian framework requires prior knowledge of the target camera model’s illuminant colors to build the illuminant candidate set. Despite promising results, these methods, however, all require labeled training examples from the target camera model: raw images paired with ground-truth illuminant colors. Collecting such training examples is a tedious process, as certain conditions must be satisfied—i.e., for each image to have a single uniform lighting and a calibration object to be present in the scene [20].

An additional class of work has sought to learn sensor-independent color constancy models, circumventing the need to re-train or calibrate to a specific camera model. A recent quasi-unsupervised approach to color constancy has been proposed, which learns the semantic features of achromatic objects to help build a model robust to differing camera sensor spectral sensitivities [14]. Another technique proposes to learn an intermediate “device independent” space before the illuminant estimation process [3]. The goal of our method is similar, in that we also propose to learn a color constancy model that works for all cameras, but neither of these previous sensor-independent approaches leverages multiple test images to reason about the spectral properties of the unseen camera model. This enables our method to outperform these state-of-the-art sensor-independent methods across diverse test sets.

Though not commonly applied in color constancy techniques, our proposal to use multiple test-set images at inference-time to improve performance is a well-explored approach across machine learning. The task of classifying an entire test set as accurately as possible was first described by Vapnik as “transductive inference” [69, 39]. Our approach is also closely related to the work on domain adaptation [22, 63] and transfer learning [58], both of which attempt to enable learning-based models to cope with differences between training and test data. Multiple sRGB camera-rendered images of the same scene have been used to estimate the response function of a given camera in the radiometric calibration literature [33, 42]. In our method, however, we employ additional images to learn to extract informative cues about the spectral sensitivity of the camera capturing the input test image, without needing to capture the same scene multiple times.

3 Method

We call our system “cross-camera convolutional color constancy” (C5), because it builds upon the existing “convolutional color constancy” (CCC) model [12] and its successor “fast Fourier color constancy” (FFCC) [13], but embeds them in a multi-input hypernetwork to enable accurate cross-camera performance. These CCC/FFCC models work by learning to perform localization within a log-chroma histogram space, such as those shown in Figure 2.

Here, we present a convolutional color constancy model that is a simplification of those presented in the original work [12] and its FFCC follow-up [13]. This simple convolutional model will be a fundamental building block that we will use in our larger neural network. The image formation model behind CCC/FFCC (and most color constancy models) is that each pixel of the observed image is assumed to be the element-wise product of some “true” white-balanced image (or equivalently, the observed image if it were imaged under a white illuminant) and some illuminant color:

k𝐜(k)=𝐰(k),\forall_{k}\,\bm{\mathrm{c}}^{(k)}=\bm{\mathrm{w}}^{(k)}\circ\bm{\mathrm{\ell}}\,, (1)

where 𝐜(k)\bm{\mathrm{c}}^{(k)} is the observed color of pixel kk, 𝐰(k)\bm{\mathrm{w}}^{(k)} is the true color of the pixel, and \bm{\mathrm{\ell}} is the color of the illuminant, all of which are 3-vectors of RGB values. Color constancy algorithms traditionally use the input image {𝐜(k)}\{\bm{\mathrm{c}}^{(k)}\} to produce an estimate of the illuminant ^\hat{\bm{\mathrm{\ell}}} that is then divided (element-wise) into each observed color to produce an estimate of the true color of each pixel, {𝐰^(k)}\{\hat{\bm{\mathrm{w}}}^{(k)}\}.

CCC defines two log-chroma measures for each pixel, which are simply the log of the ratio of two color channels:

u(k)=log(cg(k)/cr(k)),v(k)=log(cg(k)/cb(k)).u^{(k)}=\log\big{(}c^{(k)}_{g}/c^{(k)}_{r}\big{)},\quad v^{(k)}=\log\big{(}c^{(k)}_{g}/c^{(k)}_{b}\big{)}\,. (2)

As noted by Finlayson, this log-chrominance representation of color means that illuminant changes (i.e.  element-wise scaling by \bm{\mathrm{\ell}}) can be modeled simply as additive offsets to this uvuv representation [23]. We then construct a 2D histogram of the log-chroma values of all pixels:

N0(u,v)=k𝐜(k)2[|u(k)u|ϵ|v(k)v|ϵ].N_{0}(u,v)=\displaystyle\sum_{k}||\bm{\mathrm{c}}^{(k)}||_{2}\left[\left|u^{(k)}-u\right|\leq\epsilon\,\wedge\,\left|v^{(k)}-v\right|\leq\epsilon\right]\,.

(3)

This is simply a histogram over all uvuv coordinates of size (64×64)64\times 64) written out using Iverson brackets, where ϵ\epsilon is the width of a histogram bin, and where each pixel is weighted by its overall brightness under the assumption that bright pixels provide more actionable signal than dark pixels. As was done in FFCC, we construct two histograms: one of pixel intensities, N0N_{0}, and one of gradient intensities, N1N_{1}. The latter is constructed analogously to Equation 3.

These histograms of log-chroma values exhibit a useful property: element-wise multiplication of the RGB values of an image by a constant results in a translation of the resulting log-chrominance histograms. The core insight of CCC is that this property allows color constancy to be framed as the problem of “localizing” a log-chroma histogram in this uvuv histogram-space [12]—because every uvuv location in NN corresponds to a (normalized) illuminant color, \bm{\mathrm{\ell}}, the problem of estimating \bm{\mathrm{\ell}} is reducible (in a computability sense) to the problem of estimating a uvuv coordinate. This can be done by discriminatively training a “sliding window” classifier much as one might train, say, a face-detection system: the histogram is convolved with a (learned) filter and the location of the argmax is extracted from the filter response, and that argmax corresponds to uvuv value that is (the inverse of) an estimated illumination location.

We adopt a simplification of the convolutional structure used by FFCC [13]:

P=softmax(B+i(NiFi)),P=\operatorname{softmax}\bigg{(}B+\sum_{i}\big{(}N_{i}*F_{i}\big{)}\bigg{)}, (4)

where {Fi}\{F_{i}\} and BB are filters and a bias, respectively, which have the same shape as NiN_{i}. Each histogram, NiN_{i}, is convolved with each filter, FiF_{i}, and summed across channels (a “conv” layer). Then, the bias, BB, is added to that summation, which collectively biases inference towards uvuv coordinates that correspond to common illuminants, such as black body radiation.

As was done in FFCC, this convolution is accelerated through the use of FFTs, though, unlike FFCC, we use a non-wrapped histogram and thus non-wrapped filters and bias. This avoids the need for the complicated “de-aliasing” scheme used by FFCC which is not compatible with the convolutional neural network structure that we will later introduce.

The output of the softmax, PP, is effectively a “heat map” of what illuminants are likely, given the distribution of pixel and gradient intensities reflected in NN and in the prior BB, from which, we extract a “soft argmax” by taking the expectation of uu and vv with respect to PP:

^u=u,vuP(u,v),^v=u,vvP(u,v).\hat{\ell}_{u}=\sum_{u,v}uP(u,v)\,,\quad\hat{\ell}_{v}=\sum_{u,v}vP(u,v). (5)

Equation 5 is equivalent to estimating the mean of a fitted Gaussian, in the uvuv space, weighted by PP. Because the absolute scale of \bm{\mathrm{\ell}} is assumed to be irrelevant or unrecoverable in the context of color constancy, after estimating (^u,^v)(\hat{\ell}_{u},\hat{\ell}_{v}), we produce an RGB illuminant estimate, ^\hat{\bm{\mathrm{\ell}}}, that is simply the unit vector whose log-chroma values match our estimate:

^=(exp(^u)/z, 1/z,exp(^v)/z),\displaystyle\hat{\bm{\mathrm{\ell}}}=\left(\exp\left(-\hat{\ell}_{u}\right)/z,\,1/z,\,\exp\left(-\hat{\ell}_{v}\right)/z\right), (6)
z=exp(^u)2+exp(^v)2+1.\displaystyle z=\sqrt{\exp\left(-\hat{\ell}_{u}\right)^{2}+\exp\left(-\hat{\ell}_{v}\right)^{2}+1}. (7)

A convolutional color constancy model is then trained by setting {Fi}\{F_{i}\} and BB to be free parameters which are then optimized to minimize the difference between the predicted illuminant, ^\hat{\bm{\mathrm{\ell}}}, and the ground-truth illuminant, \bm{\mathrm{\ell}}^{*}.

3.1 Architecture

Refer to caption
Figure 3: An overview of our C5 model. The uvuv histograms for the input query image and a variable number of additional input images taken from the same sensor as the query are used as input to our neural network, which generates a filter bank {Fi}\{F_{i}\} (here shown as one filter) and a bias BB, which are the parameters of a conventional CCC model [12]. The query uvuv histogram is then convolved by the generated filter and shifted by the generated bias to produce a heat map, whose argmax is the estimated illuminant [12].

With our baseline CCC/FFCC-like model in place, we can now construct our cross-camera convolutional color constancy model (C5), which is a deep architecture in which CCC is a component. Both CCC and FFCC operate by learning a single fixed set of parameters consisting of a single filter bank {Fi}\{F_{i}\} and bias BB. In contrast, in C5 the filters and bias are parameterized as the output of a deep neural network (parameterized by weights θ\theta) that takes as input not just log-chrominance histograms for the image being color-corrected (which we will refer to as the “query” image), but also log-chrominance histograms from several other randomly selected input images (but with no ground-truth illuminant labels) from the test set.

By using a generated filter and bias from additional images taken from the query image’s camera (instead of using a fixed filter and bias as was done in previous work) our model is able to automatically “calibrate” its CCC model to the specific sensor properties of the query image. This can be thought of as a hypernetwork [34], wherein a deep neural network emits the “weights” of a CCC model, which is itself a shallow neural network. This approach also bears some similarity to a Transformer approach, as a CCC model can be thought of as “attending” to certain parts of a log-chroma histogram, and so our neural network can be viewed as a sort of self-attention mechanism [70]. See Figure 3 for a visualization of this data flow.

Refer to caption
Figure 4: An overview of neural network architecture that emits CCC model weights. The uvuv histogram of the query image along with additional input histograms taken from the same camera are provided as input to a set of multiple encoders. The activations of each encoder are shared with the other encoders by performing max-pooling across encoders after each block. The cross-pooled features at the last encoder layer are then fed into two decoder blocks to generate a bias and filter bank of an CCC model for the query histogram. Each scale of the decoder is connected to the corresponding scale of the encoder for query histogram with skip connections. The structure of encoder and decoder blocks is shown at the upper right corner.

At the core of our model is the deep neural network that takes as input a set of log-chroma histograms and must produce as output a CCC filter bank and bias map. For this we use a multi-encoder-multi-decoder U-Net-like architecture [61]. The first encoder is dedicated to the “query” input image’s histogram, while the rest of the encoders take as input the histograms corresponding to the additional input images. To allow the network to reason about the set of additional input images in a way that is insensitive to their ordering, we adopt the permutation invariant pooling approach of Aittala et al. [7]: we use max pooling across the set of activations of each branch of the encoder. This “cross-pooling” gives us a single set of activations that are reflective of the set of additional input images, but are agnostic to the particular ordering of those input images. At inference time, these additional images are needed to allow the network to reason about how to use them in challenging cases. The cross-pooled features of the last layer of all encoders are then fed into two decoder blocks. Each decoder produces one component of our CCC model: a bias map, BB, and two filters, {F0,F1}\{F_{0},F_{1}\} (which correspond to pixel and edge histograms, {N0,N1}\{N_{0},N_{1}\}, respectively).

As per the traditional U-Net structure, we use skip connections between each level of the decoder and its corresponding level of the encoder with the same spatial resolution, but only for the encoder branch corresponding to the query input image’s histogram. Each block of our encoder consists of a set of interleaved 3×33\times 3 conv layers, leaky ReLU activation, batch normalization, and 2×22\times 2 max pooling, and each block of our decoder consists of 2×2\times bilinear upsampling followed by interleaved 3×33\times 3 conv layers, leaky ReLU activation, and instance normalization.

When passing our 2-channel (pixel and gradient) log-chroma histograms to our network, we augment each histogram with two extra “channels” comprising of only the uu and vv coordinates of each histogram, as in CoordConv [51]. This augmentation allows a convolutional architecture on top of log-chroma histograms to reason about the absolute “spatial” information associated with each uvuv coordinate, thereby allowing a convolutional model to be aware of the absolute color of each histogram bin (see Appendix B for an ablation study). Figure 4 shows a detailed visualization of our architecture.

3.2 Training

Our model is trained by minimizing the angular error [36] between the predicted unit-norm illuminant color, ^\hat{\bm{\mathrm{\ell}}}, and the ground-truth illuminant color, \bm{\mathrm{\ell}}^{*}, as well as an additional loss that regularizes the CCC models emitted by our network. Our loss function ()\mathcal{L}(\cdot) is:

(,^)=cos1(^)+S({Fi(θ)},B(θ)),\mathcal{L}\left(\bm{\mathrm{\ell}}^{*},\hat{\bm{\mathrm{\ell}}}\right)=\cos^{-1}\left(\frac{\bm{\mathrm{\ell}}^{*}\cdot\hat{\bm{\mathrm{\ell}}}}{\lVert\bm{\mathrm{\ell}}^{*}\rVert}\right)+S\left(\{F_{i}(\theta)\},B(\theta)\right)\,, (8)

where S()S(\cdot) is a regularizer that encourage the network to generate smooth filters and biases, which reduces over-fitting and improves generalization:

S({Fi},B)=λB(\displaystyle S\left(\{F_{i}\},B\right)=\lambda_{B}( Bu2+Bv2)\displaystyle\lVert B\ast\nabla_{u}\rVert^{2}+\lVert B\ast\nabla_{v}\rVert^{2})
+λFi(\displaystyle+\lambda_{F}\sum_{i}( Fiu2+Fiv2),\displaystyle\lVert F_{i}\ast\nabla_{u}\rVert^{2}+\lVert F_{i}\ast\nabla_{v}\rVert^{2})\,, (9)

where u\nabla_{u} and v\nabla_{v} are 3×33\!\times\!3 horizontal and vertical Sobel filters, respectively, and λF\lambda_{F} and λB\lambda_{B} are multipliers that control the strength of the smoothness for the filters and the bias, respectively. This regularization is similar to the total variation smoothness prior used by FFCC [13], though here we are imposing it on the filters and bias generated by a neural network, rather than on a single filter bank and bias map. We set the multiplier hyperparameters λF\lambda_{F} and λB\lambda_{B} to 0.15 and 0.02, respectively (see Appendix B for an ablation study).

In addition to regularizing the CCC model emitted by our network, we additionally regularize the weights of our network themselves, θ\theta, using L2 regularization (i.e., “weight decay”) with a multiplier of 5×1045\!\times\!10^{-4}. This regularization of our network serves a different purpose than the regularization of the CCC models emitted by our network—regularizing {Fi(θ)}\{F_{i}(\theta)\} and B(θ)B(\theta) prevents over-fitting by the CCC model emitted by our network, while regularizing θ\theta prevents over-fitting by the model generating those CCC models.

Training is performed using the Adam optimizer [43] with hyperparameters β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, for 60 epochs. We use a learning rate of 5×1045\!\times\!10^{-4} with a cosine annealing schedule [52] and increasing batch-size (from 16 to 64) [67, 54] which improve the stability of training (see Appendix B for an ablation study).

When training our model for a particular camera model, at each iteration we randomly select a batch of training images (and their corresponding ground-truth illuminants) for use as query input images, and then randomly select eight additional input images for each query image from the training set for use as additional input images. See Appendix C for results of multiple versions of our model in which we vary the number of additional images used.

Refer to caption
Figure 5: An example of the image mapping used to augment training data. From left to right: a raw image captured by a Fujifilm X-M1 camera; the same image after white-balancing in CIE XYZ; the same image mapped into the Nikon D40 sensor space; and a real image captured by a Nikon D40 of the same scene for comparison [20].

4 Experiments and Discussion

In all experiments we used 384×256384\!\times\!256 raw images after applying the black-level normalization and masking out the calibration object to avoid any “leakage” during the evaluation. Excluding histogram computation time (which is difficult to profile accurately due to the expensive nature of scatter-type operations in deep learning frameworks), our method runs in \sim7 milliseconds per image on a NVIDIA GeForce GTX 1080, and \sim90 milliseconds on an Intel Xeon CPU Processor E5-1607 v4 (10M Cache, 3.10 GHz). Because our model exists in log-chroma histogram space, the uncompressed size of our entire model is \sim2 MB, small enough to easily fit within the narrow constraints of limited compute environments such as mobile phones.

Refer to caption
Figure 6: Here we visualize the performance of our C5 model alongside other camera-independent models: “quasi-unsupervised CC” [14] and SIIE [3]. Despite not having seen any images from the test-set camera during training, C5 is able to produce accurate illuminant estimates. The intermediate CCC filters and biases produced by C5 are also visualized.

4.1 Data Augmentation

Many of the datasets we use contain only a few images per distinct camera model (e.g. the NUS dataset [20]) and this poses a problem for our approach as neural networks generally require significant amounts of training data. To address this, we use a data augmentation procedure in which images taken from a “source” camera model are mapped into the color space of a “target” camera.

To perform this mapping, we first white balance each raw source image using its ground-truth illuminant color, and then transform that white-balanced raw image into the device-independent CIE XYZ color space [21] using the color space transformation matrix (CST) provided in each DNG file [1]. Then, we transform the CIE XYZ image into the target sensor space by inverting the CST of an image taken from the target camera dataset.

Instead of randomly selecting an image from the target dataset, we use the correlated color temperature of each image and the capture exposure setting to match source and target images that were captured under roughly the same conditions. This means that “daytime” source images get warped into the color space of “daytime” target images, etc., and this significantly increases the realism of our synthesized data. After mapping the source image to the target white-balanced sensor space, we randomly sample from a cubic curve that has been fit to the rgrg chromaticity of illuminant colors in the target sensor.

Lastly, we apply a chromatic adaptation to generate the augmented image in the target sensor space. This chromatic adaptation is performed by multiplying each color channel of the white-balanced raw image, mapped to the target sensor space, with the corresponding sampled illuminant color channel value; see Figure 5 for an example. Additional details can be found in Appendix D. This augmentation allows us to generate additional training examples to improve the generalization of our model. More details are provided in Sec. 4.2.

4.2 Results and Comparisons

We validate our model using four public datasets consisting of images taken from one or more camera models: the Gehler-Shi dataset (568 images, two cameras) [30], the NUS dataset (1,736 images, eight cameras) [20], the INTEL-TAU dataset (7,022 images, three cameras) [48], and the Cube+ dataset (2,070 images, one camera) [11] which has a separate 2019 “Challenge” test set [9]. We measure performance by reporting the error statistics commonly used by the community: the mean, median, trimean, and arithmetic means of the first and third quartiles (“best 25%” and “worst 25%”) of the angular error between the estimated illuminant and the true illuminant. As our method randomly selects the additional images, each experiment is repeated ten times and we reported the arithmetic mean of each error metric (Appendix C contains standard deviations).

To evaluate our model’s performance at generalizing to new camera models not seen during training, we adopt a leave-one-out cross-validation evaluation approach: for each dataset, we exclude all scenes and cameras used by the test set from our training images. For a fair comparison with FFCC [13], we trained FFCC using the same leave-one-out cross-validation evaluation approach. Results can be seen in Table 1 and qualitative comparisons are shown in Figures 6 and 7. Even when compared with prior sensor-independent techniques [14, 3], we achieve state-of-the-art performance, as demonstrated in Table 1.

When evaluating on the two Cube+  [11, 9] test sets and the INTEL-TAU [48] dataset in Table 1, we train our model on the NUS [20] and Gehler-Shi [30] datasets. When evaluating on the Gehler-Shi [30] and the NUS [20] datasets in Table 1, we train C5 using the INTEL-TAU dataset [48], the Cube+ dataset [11], and one of the Gehler-Shi [30] and the NUS [20] datasets after excluding the testing dataset. The one deviation from this procedure is for the NUS result labeled “CS”, where for a fair comparison with the recent SIIE method [3], we report our results with their cross-sensor (CS) evaluation in Table 1, in which we only excluded images of the test camera, and repeated this process over all cameras in the dataset.

We augmented the data used to train the model, adding 5,000 augmented examples generated as described in Sec. 4.1. In this process, we used only cameras of the training sets of each experiment as “target” cameras for augmentation, which has the effect of mixing the sensors and scene content from the training sets only. For instance, when evaluating on the INTEL-TAU [48] dataset, our augmented images simulate the scene content of the NUS [20] dataset as observed by sensors of the Gehler-Shi [30] dataset, and vice-versa.

Characteristics of Additional Images

Unless otherwise stated, the additional input images are randomly selected, but from the same camera model as the test image. This setting is meant to be equivalent to the real-world use case in which the additional images provided as input are, say, a photographer’s previously-captured images that are already present on the camera during inference. However, for the “Cube+ Challenge” table, we provide an additional set of experiments in Table 1, in which the set of additional images are chosen according to some heuristic, rather than randomly. We identified the 20 test-set images with the lowest variation of uvuv chroma values (“dull images”), the 20 test-set images with the highest variation of uvuv chroma values (“vivid images”), and we show that using vivid images produces lower error rates than randomly-chosen or dull images. This makes intuitive sense, as one might expect colorful images to be a more informative signal as to the spectral properties of previously-unobserved camera. We also show results in Table 1 where the additional images are taken from a different camera than the test-set camera, and show that this results in error rates that are higher than using additional images from the same test-set camera, as one might expect.

Refer to caption
Figure 7: Here we compare our C5 model against FFCC [13] on cross-sensor generalization using test-set Sony SLT-A57 images from the NUS dataset [20]. If FFCC is trained and tested on images from the same camera it performs well, as does C5 (top row). But if FFCC is instead tested on a different camera, such as the Olympus EPL6, it generalizes poorly, while C5 retains its performance (bottom row).
Table 1: Angular errors on the Cube+ dataset [11], the Cube+ challenge [9], the INTEL-TAU dataset [48], the Gehler-Shi dataset [30], and the NUS dataset [20]. The term “CS” refers to cross-sensor as used in [3]. See the text for additional details. Lowest errors are highlighted in yellow.
Cube+ Dataset Mean Med. B. 25% W. 25% Tri. Size (MB)
Gray-world [18] 3.52 2.55 0.60 7.98 2.82 -
Shades-of-Gray [25] 3.22 2.12 \cellcolor[HTML]FFFC9E 0.43 7.77 2.44 -
Cross-dataset CC [45] 2.47 1.94 - - - -
Quasi-Unsupervised CC [14] 2.69 1.76 0.49 6.45 2.00 622
SIIE [3] 2.14 1.44 0.44 5.06 - 10.3
FFCC [13] 2.69 1.89 0.46 6.31 2.08 0.22
\hdashlineC5 \cellcolor[HTML]FFFC9E 1.92 \cellcolor[HTML]FFFC9E 1.32 0.44 \cellcolor[HTML]FFFC9E 4.44 \cellcolor[HTML]FFFC9E 1.46 2.09
Cube+ Challenge Mean Med. B. 25% W. 25% Tri.
Gray-world [18] 4.44 3.50 0.77 9.64 -
1st-order Gray-Edge [68] 3.51 2.30 0.56 8.53 -
Quasi-Unsupervised CC [14] 3.12 2.19 0.60 7.28 2.40
SIIE [3] 2.89 1.72 0.71 7.06 -
FFCC [13] 3.25 2.04 0.64 8.22 2.09
\hdashlineC5 2.24 1.48 0.47 \cellcolor[HTML]FFFC9E5.39 1.62
C5 (another camera model) 2.97 2.47 0.78 6.11 2.52
C5 (dull images) 2.35 1.58 0.46 5.57 1.70
C5 (vivid images) \cellcolor[HTML]FFFC9E2.19 \cellcolor[HTML]FFFC9E1.39 \cellcolor[HTML]FFFC9E0.43 5.44 \cellcolor[HTML]FFFC9E1.54
INTEL-TAU Mean Med. B. 25% W. 25% Tri.
Gray-world [18] 4.7 3.7 0.9 10.0 4.0
Shades-of-Gray [25] 4.0 2.9 0.7 9.0 3.2
PCA-based B/W Colors [20] 4.6 3.4 0.7 10.3 3.7
Weighted Gray-Edge [32] 6.0 4.2 0.9 14.2 4.8
Quasi-Unsupervised CC [14] 3.71 2.67 0.66 8.55 2.90
SIIE [3] 3.42 2.42 0.73 7.80 2.64
FFCC [13] 3.42 2.38 0.70 7.96 2.61
\hdashlineC5 \cellcolor[HTML]FFFC9E2.52 \cellcolor[HTML]FFFC9E1.70 \cellcolor[HTML]FFFC9E0.52 \cellcolor[HTML]FFFC9E5.96 \cellcolor[HTML]FFFC9E1.86
Gehler-Shi Dataset Mean Med. B. 25% W. 25% Tri.
Shades-of-Gray [25] 4.93 4.01 1.14 10.20 4.23
PCA-based B/W Colors [20] 3.52 2.14 0.50 8.74 2.47
ASM [8] 3.80 2.40 - - 2.70
Woo et al. [72] 4.30 2.86 0.71 10.14 3.31
Grayness Index [60] 3.07 1.87 \cellcolor[HTML]FFFC9E0.43 7.62 2.16
Cross-dataset CC [45] 2.87 2.21 - - -
Quasi-Unsupervised CC [14] 3.46 2.23 - - -
SIIE [3] 2.77 \cellcolor[HTML]FFFC9E1.93 0.55 6.53 -
FFCC [13] 2.95 2.19 0.57 6.75 2.35
\hdashlineCS \cellcolor[HTML]FFFC9E2.50 1.99 0.53 \cellcolor[HTML]FFFC9E5.46 \cellcolor[HTML]FFFC9E2.03
NUS Dataset Mean Med. B. 25% W. 25% Tri.
Gray-world [18] 4.59 3.46 1.16 9.85 3.81
Shades-of-Gray [25] 3.67 2.94 0.98 7.75 3.03
Local Surface Reflectance [28] 3.45 2.51 0.98 7.32 2.70
PCA-based B/W Colors [20] 2.93 2.33 0.78 6.13 2.42
Grayness Index [60] 2.91 1.97 0.56 6.67 2.13
Cross-dataset CC [45] 3.08 2.24 - - -
Quasi-Unsupervised CC [14] 3.00 2.25 - - -
SIIE (CS) [3] 2.05 1.50 0.52 4.48 -
FFCC [13] 2.87 2.14 0.71 6.23 2.30
\hdashlineCS 2.54 1.90 0.61 5.61 2.02
C5 (CS) \cellcolor[HTML]FFFC9E1.77 \cellcolor[HTML]FFFC9E1.37 \cellcolor[HTML]FFFC9E0.48 \cellcolor[HTML]FFFC9E3.75 \cellcolor[HTML]FFFC9E1.46

5 Conclusion

We have presented C5, a cross-camera convolutional color constancy method. By embedding the existing state-of-the-art convolutional color constancy model (CCC) [12, 13] into a multi-input hypernetwork approach, C5 can be trained on images from multiple cameras, but at test time synthesize weights for a CCC-like model that is dynamically calibrated to the spectral properties of the previously-unseen camera of the test-set image. Extensive experimentation demonstrates that C5 achieves state-of-the-art performance on cross-camera color constancy for several datasets. By enabling accurate illuminant estimation without requiring the tedious collection of labeled training data for every particular camera, we hope that C5 will accelerate the widespread adoption of learning-based white balance by the camera industry.

Appendix A CCC Histogram Features

As mentioned earlier, we used a histogram bin size of 64 (i.e., n=64n=64) with a histogram bin width ϵ=(bmaxbmin)/n\epsilon=(b_{\texttt{max}}-b_{\texttt{min}})/n, where bmaxb_{\texttt{max}} and bminb_{\texttt{min}} are the histogram boundary values. In our experiments, we set bminb_{\texttt{min}} and bmaxb_{\texttt{max}} to -2.85 and 2.85, respectively. Our input is a concatenation of two histograms: (i) a histogram of pixel intensities and (ii) a histogram of gradient intensities. We augmented our histograms with extra uvuv coordinate channels to allow our network to consider the “spatial” (or more accurately, chromatic) information associated with each bin in the histogram.

Appendix B Ablations Studies

In the following ablation experiments, we used the Cube+ dataset [11] as our test set and trained our network with seven encoders using the same training set mentioned in Sec. 4.2 (the NUS dataset [20], the Gehler-Shi dataset [30], and the augmented images after excluding any scene/sensors of the test set). Table 2 shows the results obtained by models trained using different histogram sizes, using different values of the smoothness factors λB\lambda_{B} and λF\lambda_{F}, with and without increasing the batch-size during training, and with and without the histogram gradient intensity and the extra uvuv augmentation channels. Each experiment was repeated ten times and the arithmetic mean and standard deviation of each error metric are reported.

Figure 8 shows the effect of the smoothness regularization and increasing the batch-size during training on a small training set. We use the first fold of the Gehler-Shi dataset [30] as our validation set and the remaining two folds are used for training. In the figure, we plot the angular error on the training and validation sets. Each model was trained for 60 epochs as a camera-specific color constancy model (i.e., without using additional images or camera models). As can be seen in Figure 8, the smoothness regularization improves the generalization on the test set and increasing the batch size helps the network to reach a lower optimum.

Refer to caption
Figure 8: The impact of smoothness regularization and of increasing the batch size during training on training/validation accuracy. We show the training/validation angular error of training our network on the Gehler-Shi dataset [30] for camera-specific color constancy. We set λF=0.15\lambda_{F}=0.15, λB=0.02\lambda_{B}=0.02 for the experiment labeled with ‘w/ smoothness’, while we used λF=1.85\lambda_{F}=1.85, λB=0.25\lambda_{B}=0.25 for the experiment labeled with ‘over smoothness’ and λF=0\lambda_{F}=0, λB=0\lambda_{B}=0 for the ‘w/o smoothness’ experiments.
Table 2: Results of ablation studies. The shown results were obtained by training our network on the NUS [20] and the Gehler-Shi datasets [30] with augmentation, and testing on the Cube+ dataset [11]. In this set of experiments, we used seven encoders (i.e., six additional histograms). Note that none of the training data includes any scene/sensor from the Cube+ dataset [11]. For each set of experiments, we highlight the lowest errors in yellow.
Mean Med. B. 25% W. 25% Tri.
\cellcolor[HTML]FFFFFFHistogram bin size, nn
n=16n=16 2.28±\pm0.01 1.81±\pm0.03 0.65±\pm0.01 4.72±\pm0.02 1.91±\pm0.02
n=32n=32 2.02±\pm0.01 1.44±\pm0.01 0.44±\pm0.01 4.66±\pm0.01 1.86±\pm0.03
n=64n=64 \cellcolor[HTML]FFFC9E1.87±\pm0.00 \cellcolor[HTML]FFFC9E1.27±\pm0.01 0.41±\pm0.01 \cellcolor[HTML]FFFC9E4.36±\pm0.01 \cellcolor[HTML]FFFC9E1.40±\pm0.01
n=128n=128 2.03±\pm0.00 1.42±\pm0.01 \cellcolor[HTML]FFFC9E0.40±\pm0.00 4.70±\pm0.01 1.54±\pm0.01
\cellcolor[HTML]FFFFFFSmoothness factors, λB\lambda_{B} and λF\lambda_{F} (n=64n=64)
λB=0,λF=0\lambda_{B}=0,\lambda_{F}=0 2.07±\pm0.01 1.42±\pm0.01 0.47±\pm0.01 4.67±\pm0.01 1.57±\pm0.01
λB=0.005,λF=0.035\lambda_{B}=0.005,\lambda_{F}=0.035 1.95±\pm0.00 1.31±\pm0.01 \cellcolor[HTML]FFFC9E0.40±\pm0.00 4.57±\pm0.01 1.47±\pm0.01
λB=0.02,λF=0.15\lambda_{B}=0.02,\lambda_{F}=0.15 \cellcolor[HTML]FFFC9E1.87±\pm0.00 \cellcolor[HTML]FFFC9E1.27±\pm0.01 0.41±\pm0.01 \cellcolor[HTML]FFFC9E4.36±\pm0.01 \cellcolor[HTML]FFFC9E1.40±\pm0.01
λB=0.10,λF=0.75\lambda_{B}=0.10,\lambda_{F}=0.75 2.11±\pm0.00 1.55±\pm0.01 0.48±\pm0.00 4.70±\pm0.01 1.66±\pm0.01
λB=0.25,λF=1.85\lambda_{B}=0.25,\lambda_{F}=1.85 2.23±\pm0.00 1.61±\pm0.01 0.53±\pm0.00 5.04±\pm0.01 1.77±\pm 0.01
\cellcolor[HTML]FFFFFFIncreasing batch size (n=64n=64)
w/o increasing 1.93±\pm0.00 1.29±\pm0.01 0.42±\pm0.00 4.52±\pm0.02 1.43±\pm0.01
w/ increasing \cellcolor[HTML]FFFC9E1.87±\pm0.00 \cellcolor[HTML]FFFC9E1.27±\pm0.01 \cellcolor[HTML]FFFC9E0.41±\pm0.01 \cellcolor[HTML]FFFC9E4.36±\pm0.01 \cellcolor[HTML]FFFC9E 1.40±\pm0.01
\cellcolor[HTML]FFFFFFGradient histogram and uvuv channels (n=64n=64)
w/o gradient histogram 2.30±\pm0.01 1.53±\pm0.01 0.45±\pm0.01 5.51±\pm0.02 1.71±\pm0.02
w/o uvuv 2.03±\pm0.01 1.45±\pm0.01 0.44±\pm0.01 4.63±\pm0.02 1.56±\pm0.01
w/ uvuv and gradient histogram \cellcolor[HTML]FFFC9E1.87±\pm0.00 \cellcolor[HTML]FFFC9E1.27±\pm0.01 \cellcolor[HTML]FFFC9E0.41±\pm0.01 \cellcolor[HTML]FFFC9E4.36±\pm0.01 \cellcolor[HTML]FFFC9E1.40±\pm0.01

Table 3 shows the results with and without using our data augmentation approach. The experiments labeled “w/aug” in Table 3 refer to using our data augmentation approach, as described in Sec. 4.1. Additional details on the data augmentation process are given in Sec. D.

Table 3: Angular errors on the Cube+ dataset [11] and the INTEL-TAU dataset [48]. In this experiment, we used six additional images (i.e., m=7m=7) for our C5. Lowest errors are highlighted in yellow.
Cube+ Dataset Mean Med. B. 25% W. 25% Tri.
Cross-dataset CC [45] 2.47 1.94 - - -
Quasi-Unsupervised CC [14] 2.69 1.76 0.49 6.45 2.00
SIIE [3] 2.14 1.44 0.44 5.06 -
FFCC [13] 2.69 1.89 0.46 6.31 2.08
\hdashlineC5 2.10 1.38 0.49 4.97 1.56
C5 (w/aug.) \cellcolor[HTML]FFFC9E 1.87 \cellcolor[HTML]FFFC9E 1.27 \cellcolor[HTML]FFFC9E 0.41 \cellcolor[HTML]FFFC9E 4.36 \cellcolor[HTML]FFFC9E 1.40
INTEL-TAU Mean Med. B. 25% W. 25% Tri.
Quasi-Unsupervised CC [14] 3.71 2.67 0.66 8.55 2.90
SIIE [3] 3.42 2.42 0.73 7.80 2.64
FFCC [13] 3.42 2.38 0.70 7.96 2.61
\hdashlineC5 2.62 1.85 0.54 6.05 2.00
C5 (w/aug.) \cellcolor[HTML]FFFC9E2.49 \cellcolor[HTML]FFFC9E1.66 \cellcolor[HTML]FFFC9E0.51 \cellcolor[HTML]FFFC9E5.93 \cellcolor[HTML]FFFC9E1.83

Appendix C Additional Results

In Table 1, we reported our results using eight additional images. In Table 4, we report multiple versions of our model in which we vary mm, the number of input images (and encoders) used (m=1m=1 means that only the query image is used as an input with no additional images). Note that the single-image results (m=1m=1) are not intended to be the central contribution of this work—they are provided only as a point of comparison.

Table 4: Results using different number of the additional images (i.e., different values of mm). Note that m=7m=7, for example, means that we use six additional images along with the input image. For each experiment, we used the same training data explained in the main paper with augmentation. Lowest errors are highlighted in yellow. .
Cube+ Dataset Mean Med. B. 25% W. 25%
C5 (m=1m=1) 2.60 1.86 0.55 5.89
C5 (m=3m=3) 2.28 1.50 0.59 5.19
C5 (m=5m=5) 2.23 1.52 0.56 5.11
C5 (m=7m=7) \cellcolor[HTML]FFFC9E 1.87 \cellcolor[HTML]FFFC9E 1.27 0.41 \cellcolor[HTML]FFFC9E 4.36
C5 (m=9m=9) 1.92 1.32 0.44 4.44
C5 (m=11m=11) 1.93 1.41 0.42 4.35
C5 (m=13m=13) 1.95 1.35 \cellcolor[HTML]FFFC9E 0.40 4.52
Cube+ Challenge Mean Med. B. 25% W. 25%
C5 (m=1m=1) 2.70 2.00 0.61 6.15
C5 (m=7m=7) 2.55 1.63 0.54 6.21
C5 (m=9m=9) \cellcolor[HTML]FFFC9E 2.24 \cellcolor[HTML]FFFC9E 1.48 \cellcolor[HTML]FFFC9E 0.47 \cellcolor[HTML]FFFC9E5.39
C5 (m=11m=11) 2.41 1.72 0.54 5.58
C5 (m=13m=13) 2.39 1.61 0.53 5.64
INTEL-TAU Mean Med. B. 25% W. 25%
C5 (m=1m=1) 2.99 2.18 0.66 6.71
C5 (m=7m=7) \cellcolor[HTML]FFFC9E2.49 \cellcolor[HTML]FFFC9E1.66 \cellcolor[HTML]FFFC9E0.51 \cellcolor[HTML]FFFC9E5.93
C5 (m=9m=9) 2.52 1.70 0.52 5.96
C5 (m=11m=11) 2.60 1.79 0.54 6.07
C5 (m=13m=13) 2.57 1.74 0.52 6.08
Gehler-Shi Dataset Mean Med. B. 25% W. 25%
C5 (m=1m=1) 2.98 2.05 0.54 7.13
C5 (m=7m=7) \cellcolor[HTML]FFFC9E2.36 \cellcolor[HTML]FFFC9E1.61 \cellcolor[HTML]FFFC9E0.44 5.60
CS (m=9m=9) 2.50 1.99 0.53 \cellcolor[HTML]FFFC9E5.46
C5 (m=11m=11) 2.55 1.88 0.50 5.77
C5 (m=13m=13) 2.46 1.74 0.50 5.73
NUS Dataset Mean Med. B. 25% W. 25%
C5 (m=1m=1) 2.84 2.20 0.69 6.14
C5 (m=7m=7) 2.68 2.00 0.66 5.90
CS (m=9m=9) 2.54 1.90 \cellcolor[HTML]FFFC9E0.61 5.61
C5 (m=11m=11) 2.64 1.99 0.65 5.75
C5 (m=13m=13) \cellcolor[HTML]FFFC9E2.49 \cellcolor[HTML]FFFC9E1.88 \cellcolor[HTML]FFFC9E0.61 \cellcolor[HTML]FFFC9E5.43

We did not include the “gain” multiplier, originally proposed in FFCC [13], in our method elaborated in Sec. 3, as it did not result in a consistent improved performance over all error metrics and datasets. Here, we report results with and without using the gain multiplier map. This gain multiplier map can be generated by our network by adding an additional decoder network with skip connections from the query encoder. Based on this modification, our convolutional structure can now be described as:

P=softmax(B+Gi(NiFi)),P=\operatorname{softmax}\bigg{(}B+G\circ\sum_{i}\big{(}N_{i}*F_{i}\big{)}\bigg{)}\,, (10)

where {Fi}\{F_{i}\}, BB, and GG are filters, a bias map B(i,j)B(i,j), and the gain multiplier map G(i,j)G(i,j), respectively. We also change the smoothness regularizer to include the generated gain multiplier as follows:

S({Fi},B,G)=λB(\displaystyle S\left(\{F_{i}\},B,G\right)=\lambda_{B}( Bu2+Bv2)\displaystyle\lVert B\ast\nabla_{u}\rVert^{2}+\lVert B\ast\nabla_{v}\rVert^{2})
+λG(\displaystyle+\lambda_{G}( Gu2+Gv2)\displaystyle\lVert G\ast\nabla_{u}\rVert^{2}+\lVert G\ast\nabla_{v}\rVert^{2})
+λFi(\displaystyle+\lambda_{F}\sum_{i}( Fiu2+Fiv2),\displaystyle\lVert F_{i}\ast\nabla_{u}\rVert^{2}+\lVert F_{i}\ast\nabla_{v}\rVert^{2})\,, (11)

where u\nabla_{u} and v\nabla_{v} are 3×33\!\times\!3 horizontal and vertical Sobel filters, respectively, and λF\lambda_{F}, λB\lambda_{B}, λG\lambda_{G} are scalar multipliers to control the strength of the smoothness of each of the filters, the bias, and the gain, respectively. The results of using the additional gain multiplier map are reported in Table 5.

Table 5: Results of using the gain multiplier, GG. For each experiment, we used m=7m=7 and n=64n=64, and trained our network using the same training data explained in the main paper with augmentation. Lowest errors are highlighted in yellow.
Cube+ [11] Cube+ Challenge [9] INTEL-TAU [48] Gehler-Shi [30] NUS [20]
Mean Med. B. 25% W. 25% Mean Med. B. 25% W. 25% Mean Med. B. 25% W. 25% Mean Med. B. 25% W. 25% Mean Med. B. 25% W. 25%
w/o GG 1.87 1.27 \cellcolor[HTML]FFFC9E 0.41 4.36 2.40 1.58 0.52 \cellcolor[HTML]FFFC9E 5.76 \cellcolor[HTML]FFFC9E2.49 \cellcolor[HTML]FFFC9E 1.66 \cellcolor[HTML]FFFC9E 0.51 \cellcolor[HTML]FFFC9E 5.93 \cellcolor[HTML]FFFC9E 2.36 \cellcolor[HTML]FFFC9E 1.61 \cellcolor[HTML]FFFC9E0.44 5.60 2.68 2.00 0.66 5.90
w/ GG \cellcolor[HTML]FFFC9E1.83 \cellcolor[HTML]FFFC9E1.24 0.42 \cellcolor[HTML]FFFC9E4.25 \cellcolor[HTML]FFFC9E2.34 \cellcolor[HTML]FFFC9E1.45 \cellcolor[HTML]FFFC9E0.46 5.86 2.63 1.81 0.55 6.18 \cellcolor[HTML]FFFC9E2.36 1.72 0.48 \cellcolor[HTML]FFFC9E5.40 \cellcolor[HTML]FFFC9E2.44 \cellcolor[HTML]FFFC9E1.89 \cellcolor[HTML]FFFC9E0.64 \cellcolor[HTML]FFFC9E5.21

We further trained and tested our C5 model using the INTEL-TAU dataset evaluation protocols [48]. Specifically, the INTEL-TAU dataset introduced two different evaluation protocols: (i) the cross-validation protocol, where the model is trained using a 10-fold cross-validation scheme of images taken from three different camera models, and (ii) the camera invariance evaluation protocol, where the model is trained on a single camera model and then tested on another camera model. This camera invariance protocol is equivalent to the CS evaluation method [3], as the models are trained and tested on the same scene set, but with different camera models in the training and testing phases. See Table 6 for comparison with other methods using the INTEL-TAU evaluation protocols. In Table 6, we also show the results of our C5 model trained on the NUS and Gehler-Shi datasets with augmentation (i.e., our camera-independent model) as reported in Table 1 for completeness.

Table 6: Results using the INTEL-TAU dataset evaluation protocols [48]. We also show the results of camera-independent methods, including our camera-independent C5 model. Lower errors for each evaluation protocol are highlighted in yellow. The best results are bold-faced.
INTEL-TAU [48]
Mean Med. B. 25% W. 25% Tri.
\cellcolor[HTML]FFFFFFCamera-specific (10-fold cross-validation protocol [48])
Bianco et al.’s CNN [15] 3.5 2.6 0.9 7.4 2.8
C3AE [47] 3.4 2.7 0.9 7.0 2.8
BoCF [46] 2.4 1.9 0.7 5.1 2.0
FFCC [13] 2.4 1.6 0.4 5.6 1.8
VGG-FC4 [37] \cellcolor[HTML]FFFC9E2.2 1.7 0.6 \cellcolor[HTML]FFFC9E4.7 1.8
\hdashlineC5 (m=7,n=128m=7,n=128), w/ augmentation 2.33 \cellcolor[HTML]FFFC9E1.55 \cellcolor[HTML]FFFC9E0.45 5.57 \cellcolor[HTML]FFFC9E1.71
\cellcolor[HTML]FFFFFFCamera-specific (camera invariant protocol [48])
Bianco et al.’s CNN [15] 3.4 2.5 0.8 7.2 2.7
C3AE [47] 3.4 2.7 0.9 7.0 2.8
BoCF [46] 2.9 2.4 0.9 6.1 2.5
VGG-FC4 [37] 2.6 2.0 0.7 5.5 2.2
\hdashlineC5 (m=9m=9), w/aug. \cellcolor[HTML]FFFC9E2.45 \cellcolor[HTML]FFFC9E1.82 \cellcolor[HTML]FFFC9E0.53 \cellcolor[HTML]FFFC9E5.46 \cellcolor[HTML]FFFC9E1.95
\cellcolor[HTML]FFFFFFCamera-independent
Gray-world [18] 4.7 3.7 0.9 10.0 4.0
White-Patch [17] 7.0 5.4 1.1 14.6 6.2
1st-order Gray-Edge [17] 5.3 4.1 1.0 11.7 4.5
2nd-order Gray-Edge [17] 5.1 3.8 1.0 11.3 4.2
Shades-of-Gray [25] 4.0 2.9 0.7 9.0 3.2
PCA-based B/W Colors [20] 4.6 3.4 0.7 10.3 3.7
Weighted Gray-Edge [32] 6.0 4.2 0.9 14.2 4.8
Quasi-Unsupervised CC [14] 3.71 2.67 0.66 8.55 2.90
SIIE [3] 3.42 2.42 0.73 7.80 2.64
\hdashlineC5 (m=7m=7), w/aug. \cellcolor[HTML]FFFC9E2.49 \cellcolor[HTML]FFFC9E1.66 \cellcolor[HTML]FFFC9E0.51 \cellcolor[HTML]FFFC9E5.93 \cellcolor[HTML]FFFC9E1.83

Our C5 model achieves reasonable accuracy when used as a camera-specific model. In this scenario, we trained our model on training images captured by the same test camera model with a single encoder (i.e., m=1m=1). We found that n=128n=128, using the gain multiplier map G(i,j)G(i,j), achieves the best camera-specific results. We report the results of our camera-specific models in Table 7.

Table 7: Results of our C5 trained as a camera-specific model with a single encoder (i.e., m=1m=1). In this experiment, we performed a three-fold cross-validation on the Cube+ dataset [11]. For the Cube+ challenge [9], we report our results after training our model on the Cube+ dataset [11] without including any training example from the Cube+ challenge test set [9]. We also show the results of other camera-specific color constancy methods reported in past papers. Lowest angular errors are highlighted in yellow.
Cube+ Dataset [11]
Mean Med. B. 25% W. 25% Tri.
Color Dog [10] 3.32 1.19 0.22 10.22 -
APAP [6] 2.01 1.36 0.38 4.71 -
Meta-AWB w/ 20 tuning images [55] 1.59 1.02 0.30 3.85 1.15 -
Color Beaver [44] 1.49 0.77 0.21 3.94 -
SqueezeNet-FC4 [37] 1.35 0.93 0.30 3.24 1.01
FFCC [13] 1.38 \cellcolor[HTML]FFFC9E0.74 0.19 3.67 \cellcolor[HTML]FFFC9E0.89
WB-sRGB (modified for raw-RGB) [5] 1.32 \cellcolor[HTML]FFFC9E0.74 \cellcolor[HTML]FFFC9E0.18 3.43 -
MDLCC [75] \cellcolor[HTML]FFFC9E1.24 0.83 0.26 \cellcolor[HTML]FFFC9E2.91 0.92
\hdashlineC5 (n=128n=128), w/ GG 1.39 0.79 0.24 3.55 0.93
Cube+ Challenge [9]
Mean Med. B. 25% W. 25% Tri.
V Vuk et al., [9] 6.00 1.96 0.99 18.81 2.25
A Savchik et al., [64] 2.05 1.20 0.40 5.24 1.30
Y Qian et al., (1) [59] 2.48 1.56 0.44 6.11 -
Y Qian et al., (2) [59] 2.27 1.26 0.39 6.02 1.35
FFCC [13] 2.1 1.23 0.47 5.38 -
MHCC [35] 1.95 1.16 0.39 4.99 1.25
K Chen et al., [9] 1.84 1.27 0.39 4.41 1.32
WB-sRGB (modified for raw-RGB) [5] 1.83 1.15 \cellcolor[HTML]FFFC9E0.35 4.60 -
\hdashlineC5 (n=128n=128), w/ GG \cellcolor[HTML]FFFC9E1.72 \cellcolor[HTML]FFFC9E1.07 0.36 \cellcolor[HTML]FFFC9E4.27 \cellcolor[HTML]FFFC9E1.15

Lastly, we show additional qualitative results from the INTEL-TAU dataset [48] in Figure 9. In this figure, we show qualitative examples from our “worst 25%” and “best 25%” results alongside the corresponding results of prior sensor-independent techniques [3, 14].

Refer to caption
Figure 9: Random examples from our “worst 25%” and “best 25%” results alongside quasi-unsupervised CC [14] and SIIE [3]. Input images are from the INTEL-TAU dataset [48].

Appendix D Data Augmentation

In this section, we describe in detail the data augmentation procedure described in Sec. 4.1. We begin with the steps used to map a color temperature to the corresponding CIE XYZ value. Then, we elaborate the process of mapping from camera sensor raw to the CIE XYZ color space. Afterwards, we describe the details of the scene retrieval process mentioned in Sec. 4.1. Finally, we discuss experiments performed to evaluate our data augmentation and compare it with other color constancy augmentation techniques used in the literature.

D.1 From Color Temperature to CIE XYZ

According to Planck’s radiation law [74], the spectral power distribution (SPD) of a blackbody radiator at a given wavelength range [λ,λ][\lambda,\partial\lambda] can be computed using the color temperature qq as follows:

Sλdλ=f1λ5exp(f2/λq)1λ,S_{\lambda}d_{\lambda}=\frac{f_{1}\lambda^{-5}}{\exp{(f_{2}/\lambda q})-1}\partial\lambda, (12)

where, f1=3.7418321016f_{1}=3.741832^{1}0^{-16} Wm2Wm^{2} is the first radiation constant, f2=1.4388102mKf_{2}=1.4388^{10-2}mK is the second radiation constant, and qq is the blackbody temperature, in Kelvin. [71, 49]. Once the SPD is computed, the corresponding CIE tristimulus values can be approximated in the following discretized form:

X=Δλλ=380λ=780xλSλ,X=\Delta\lambda\sum_{\lambda=380}^{\lambda=780}x_{\lambda}S_{\lambda}, (13)

where the value of xλx_{\lambda} is the standard CIE color match value [21]. The values of YY and ZZ are computed similarly. The corresponding chromaticity coordinates of the computed XYZ tristimulus are finally computed as follows:

x=X/(X+Y+Z),\displaystyle x=X/(X+Y+Z), (14)
y=Y/(X+Y+Z),\displaystyle y=Y/(X+Y+Z),
z=Z/(X+Y+Z).\displaystyle z=Z/(X+Y+Z).

D.2 From Raw to CIE XYZ

Most DSLR cameras provide two pre-calibrated matrices, C1C_{1} and C2C_{2}, to map from the camera sensor space to the CIE 1931 XYZ 2-degree standard observer color space. These pre-calibrated color space transformation (CST) matrices are usually provided as a low color temperature (e.g., Standard-A) and a higher correlated color temperature (e.g., D65) [1].

Given an illuminant vector \bm{\mathrm{\ell}}, estimated by an illuminant estimation algorithm, the CIE XYZ mapping matrix associated with \bm{\mathrm{\ell}} is computed as follows [19]:

CT=αC1+(1α)C2,C_{T_{\bm{\mathrm{\ell}}}}=\alpha C_{1}+\left(1-\alpha\right)C_{2}, (15)
α=(1/q1/q2)/(1/q11/q2),\alpha=(1/q_{\bm{\mathrm{\ell}}}-1/q_{2})/(1/q_{1}-1/q_{2}), (16)

where q1q_{1} and q2q_{2} are the correlated color temperature associated to the pre-calibrated matrices C1C_{1} and C2C_{2}, and qq_{\bm{\mathrm{\ell}}} is the color temperature of the illuminant vector \bm{\mathrm{\ell}}. Here, qq_{\bm{\mathrm{\ell}}} is unknown, and unlike the standard mapping from color temperature to the CIE XYZ space (Sec. D.1), there is no standard conversion from a camera sensor raw space to the corresponding color temperature. Thus, the conversion from the sensor raw space to the CIE XYZ space is a chicken-and-egg problem—computing the correlated color temperature qq_{\bm{\mathrm{\ell}}} is necessarily to get the CST matrix CqC_{q_{\bm{\mathrm{\ell}}}}, while knowing the mapping from a camera sensor raw to the CIE XYZ space inherently requires knowledge of the correlated color temperature of a given raw illuminant.

This problem can be solved by a trial-and-error strategy as follows. We iterate over the color temperature range of 2500K to 7500K. For each color temperature qiq_{i} , we first compute the corresponding CST matrix CqiC_{q_{i}} using Eqs. 15 and 16. Then, we convert qiq_{i} to the corresponding xyz chromaticity triplet using Eqs. 1214.

Afterwards, we map the xyz chromaticity triplet to the sensor raw space using the following equation:

raw(qi)=Cqi1λxyz(qi).\bm{\mathrm{\ell}}_{\texttt{raw}(q_{i})}=C_{q_{i}}^{-1}\lambda_{\texttt{xyz}(q_{i})}. (17)

We repeated this process for all color temperatures and selected the color temperature/CST matrix that achieves the minimum angular error between \bm{\mathrm{\ell}} and the reconstructed illuminant color in the sensor raw space.

The accuracy of our conversion depends on the pre-calibrated matrices provided by the manufacturer of the DSLR cameras. Other factors that may affect the accuracy of the mapping includes the precision of the standard mapping from color temperature to XYZ space defined by [21], and the discretization process in Eq. 13.

D.3 Raw-to-raw mapping

Here, we describe the details of the mapping mentioned in Sec. 4.1. Let AA={𝐚1,𝐚2,}\{\bm{\mathrm{a}}_{1},\bm{\mathrm{a}}_{2},...\} represent the “source” set of demosaiced raw images taken by different camera models with the associated capture metadata. Let T={𝐭1,𝐭2,}T=\{\bm{\mathrm{t}}_{1},\bm{\mathrm{t}}_{2},...\} represent our “target” set of metadata of captured scenes by the target camera model. Here, the capture metadata includes exposure time, aperture size, ISO gain value, and the global scene illuminant color in the camera sensor space. We also assume that we have access to the pre-calibration color space transformation (CST) matrices for each camera model in the sets AA and TT (available in most DNG files of DSLR images [1]).

Our goal here is to map all raw images in AA, taken by different camera models, to the target camera sensor space in TT. To that end, we map each image in AA to the device-independent CIE XYZ color space [21]. This mapping is performed as follows. We first compute the correlated color temperature, q(i)q^{(i)}, of the scene illuminant color vector, raw(A)(i)\bm{\mathrm{\ell}}^{(i)}_{\texttt{raw}(A)}, of each raw image, Iraw(A)(i)I^{(i)}_{\texttt{raw}(A)}, in the set AA (see Sec. D.2). Then, we linearly interpolate between the pre-calibrated CST matrices provided with each raw image to compute the final CST mapping matrix, Cq(i)C_{q^{(i)}}, [19]. Afterwards, we map each image, Iraw(A)(i)I^{(i)}_{\texttt{raw}(A)}, in the set AA to the CIE XYZ space. Note that here we represent each image II as matrices of the color triplets (i.e., I={𝐜(k)}I=\{\bm{\mathrm{c}}^{(k)}\}), where kk is the total number of pixels in the image II. We map each raw image to the CIE XYZ space as follows:

Ixyz(A)(i)=Cq(i)D(i)Iraw(A)(i),I^{(i)}_{\texttt{xyz}(A)}=C_{q^{(i)}}D_{\bm{\mathrm{\ell}}^{(i)}}I^{(i)}_{\texttt{raw}(A)}, (18)

where D(i)D_{\bm{\mathrm{\ell}}^{(i)}} is the white-balance diagonal correction matrix constructed based on the illuminant vector raw(A)(i)\bm{\mathrm{\ell}}^{(i)}_{\texttt{raw}(A)}.

Similarly, we compute the inverse mapping from the CIE XYZ space back to the target camera sensor space based on the illuminant vectors and pre-calibration matrices provided in the target set TT. The mapping from the source sensor space to the target one in TT can be performed as follows:

Iraw(T)(i)=Dȷ(i)1Mq(i)1Ixyz(A)(i),I^{(i)}_{\texttt{raw}(T)}=D^{-1}_{\jmath^{(i)}}M^{-1}_{q^{(i)}}I^{(i)}_{\texttt{xyz}(A)}, (19)

where ȷraw(T)(i)\jmath^{(i)}_{\texttt{raw}(T)} is the corresponding illuminant color to the correlated color temperature, q(i)q^{(i)}, in the target sensor space (i.e., the ground-truth illuminant for image Iraw(T)(i)I^{(i)}_{\texttt{raw}(T)} in the illuminant estimation task), and Mq(i)1M^{-1}_{q^{(i)}} is the CST matrix that maps from the target sensor space to the CIE XYZ space.

The described steps so far assume that the spectral sensitivities of all sensors in AA and TT satisfy the Luther condition [56]. Prior studies, however, showed that this assumption is not always satisfied, and this can affect the accuracy of the pre-calibration matrices [38, 41]. According to this, we rely on Eqs. 18 and 19 only to map the original colors of captured objects in the scene (i.e., white-balanced colors) to the target camera model. For the values of the global color cast, ȷraw(T)(i)\jmath^{(i)}_{\texttt{raw}(T)}, we do not rely on Mq(i)1M^{-1}_{q^{(i)}} to map raw(A)(i)\bm{\mathrm{\ell}}^{(i)}_{\texttt{raw}(A)} to the target sensor space of TT. Instead, we follow a KK-nearest neighbor strategy to get samples from the target sensor’s illuminant color space.

D.4 Scene Sampling

As described in Sec. 4.1, we retrieve metadata of similar scenes in the target set TT for illuminant color sampling. This sampling process should consider the source scene capture conditions to sample suitable illuminant colors from the target camera model space—i.e., having indoor illuminant colors as ground-truth for outdoor scenes may affect the training process. To this end, we introduce a retrieval feature vA(i)v_{A}^{(i)} to represent the capture settings of the image Iraw(A)(i)I^{(i)}_{\texttt{raw}(A)}. This feature includes the correlated color temperature and auxiliary capture settings. These additional capture settings are used to retrieve scenes captured with similar settings of Iraw(A)(i)I^{(i)}_{\texttt{raw}(A)}.

Our feature vector is defined as follows:

vA(i)=[qnorm(i),hnorm(i),pnorm(i),enorm(i)],v_{A}^{(i)}=[q_{\texttt{norm}}^{(i)}\,,h_{\texttt{norm}}^{(i)},p_{\texttt{norm}}^{(i)}\,,e_{\texttt{norm}}^{(i)}], (20)

where qnorm(i)q_{\texttt{norm}}^{(i)}, hnorm(i)h_{\texttt{norm}}^{(i)}, pnorm(i)p_{\texttt{norm}}^{(i)}, and enorm(i)e_{\texttt{norm}}^{(i)} are the normalized color temperature, gain value, aperture size, and scaled exposure time, respectively. The gain value and the scaled exposure time are computed as follows:

h(i)=BLN(i)ISO(i),h^{(i)}=\texttt{BLN}^{(i)}\texttt{ISO}^{(i)}\,, (21)
e(i)=2BLE(i)l(i),e^{(i)}=\sqrt{2^{\texttt{BLE}^{(i)}}}l^{(i)}\,, (22)

where BLE, BLN, ISO, and ll are the baseline exposure, baseline noise, digital gain value, and exposure time (in seconds), respectively.

Illuminant Color Sampling

A naive sampling from the associated illuminant colors in TT does not introduce new illuminant colors over the Planckian locus of the target sensor. For this reason, we first fit a cubic polynomial to the rgrg chromaticity of illuminant colors in the target sensor TT. Then, we compute a new rr chromaticity value for each query vector as follows:

rv=jKwjrj+x,\centering r_{v}=\sum_{j\in K}{w_{j}r_{j}+x}\,,\@add@centering (23)

where wj=exp(1dj)/kKexp(1dk)w_{j}=\exp(1-d_{j})/\sum_{k}^{K}{\exp(1-d_{k})} is a weighting factor, x=λr𝒩(0,σr)x=\lambda_{r}\mathcal{N}(0,\sigma_{r}) is a small random shift, λr\lambda_{r} is a scalar factor to control the amount of divergence from the ideal Planckian curve, σr\sigma_{r} is the standard deviation of the rr chromaticity values in the retrieved KK metadata of the target camera model, TKT_{K}, and djd_{j} is the normalized L2 distance between vS(i)v_{S(i)} and the corresponding jthj^{\texttt{th}} feature vector in TKT_{K}. The CST matrix MM (Eq. 19) is constructed by linearly interpolating between the corresponding CST matrices associated with each sample in TKT_{K} using wjw_{j}. After computing rvr_{v}, the corresponding gg chromaticity value is computed as:

gv=[rv,rv2,rv3][ξ1,ξ2,ξ3]+y,g_{v}=[r_{v},r^{2}_{v},r^{3}_{v}][\xi_{1},\xi_{2},\xi_{3}]^{\top}+y\,, (24)

where [ξ1,ξ2,ξ3][\xi_{1},\xi_{2},\xi_{3}] are the cubic polynomial coefficients, yy is a random shift, and σg\sigma_{g} is the standard deviation of the gg chromaticity values in TKT_{K}. In our experiments, we set λr=0.7\lambda_{r}=0.7 and λg=1\lambda_{g}=1. The final illuminant color ȷraw(T)(i)\jmath^{(i)}_{\texttt{raw}(T)} can be represented as follows:

ȷraw(T)(i)=[rv,gv,1rvgv].\jmath^{(i)}_{\texttt{raw}(T)}=[r_{v},g_{v},1-r_{v}-g_{v}]^{\top}\,. (25)
Refer to caption
Figure 10: Synthetic illuminant samples of Canon EOS 5D camera model in the Gehler-Shi dataset [30]. The shown generated illuminant colors are then applied to sensor-mapped raw images, originally were taken by different camera models, for augmentation purpose (Sec. D).

To avoid any bias towards the dominant color temperature in the source set, AA, we first divide the color temperature range of the source set AA into different groups with a step of 250K. Then, we uniformly sample examples from each group to avoid any bias towards specific type of illuminants. Figure 10 shows examples of the sampling process. As shown, the sampled illuminant chromaticity values follow the original distribution over the Planckian curve, while introducing new illuminant colors of the target sensors that were not included in the original set. Finally, we apply random cropping to introduce more diversity in the generated images. Figure 11 shows examples of synthetic raw-like images of different target camera models.

Refer to caption
Figure 11: Example of camera augmentation used to train our network. The shown left raw image is captured by Nikon D5200 camera [20]. The next three images are the results of our mapping to different camera models.

D.5 Evaluation

In prior work, several approaches for training data augmentation for illuminant estimation have been attempted [53, 27, 2]. These approaches first white-balance the training raw images using the associated ground-truth illuminant colors associated with each image. Afterwards, illuminant colors are sampled from the “ground-truth” illuminant colors over the entire training set to be applied to the white-balanced raw images. These sampled illuminant colors can be taken randomly from the ground-truth illuminant colors [27] or after clustering the ground-truth illuminant colors [53]. These methods, however, are limited to using the same set of scenes as is present in the training dataset. Another approach for data augmentation has been proposed in [2] by mapping sRGB white-balanced images to a learned normalization space that is is learned based on the CIE XYZ space. Afterwards, a pre-computed global transformation matrix is used to map the images from this normalization space to the target white-balanced raw space. In contrast, the augmentation method described in our paper uses an accurate mapping from the camera sensor raw space to the CIE XYZ using the pre-calibration matrices provided by camera manufacturers.

In the following set of experiments, we use the baseline model FFCC [13] to study the potential improvement of our chosen data augmentation strategy and alternative augmentation techniques proposed in [53, 27, 2]. We use the Canon EOS 5D images from in the Gehler-Shi dataset [30] for comparisons. For our test set, we randomly select 30% of the total number of images in the Canon EOS 5D set. The remaining 70% of images are used for training. We refer to this set as “real training set”, which includes 336 raw images.

Note that, except for the augmentation used in a [2], none of these methods apply a sensor-to-sensor mapping, as they use the raw images of the “real training set” as the source and target set for augmentation. For this reason and for a fair comparison, we provide the results of two different set of experiments. In the first experiment, we use the CIE XYZ images taken by the Canon EOS 5D sensor as our source set AA, while in the second experiment, we use a different set of four sensors rather than the Canon EOS 5D sensor. The former is comparable to the augmentation methods used in [53, 27] (see Table 8), while the latter is comparable to the augmentation approach used in [2], which performs “raw mapping” in order to introduce new scene content in the training data (see Table 9). The shown results obtained by generating 500 synthetic images by each augmentation method, including our augmentation approach. As shown in Tables 8 and 9, our augmentation approach achieves the best improvement of the FFCC results.

In order to study the effect of the CIE XYZ mapping used by our augmentation approach, we trained FFCC [13] on a set of 500 synthetic raw images of the target camera model—namely, the Canon EOS 5D camera model in the Gehler-Shi dataset [30]. These synthetic raw images were originally captured by the Canon EOS 1Ds Mark III camera sensor (in the NUS dataset [20]), then these images are mapped to the target sensor using our augmentation approach. Table 10 shows the results of FFCC trained on synthetic raw images with and without the intermediate CIE XYZ mapping step (Eqs. 18 and 19). As shown, using the CIE XYZ mapping achieves better results, which are further improved by increasing the scene diversity of the source set by including additional scenes from other datasets, as shown in Table 9.

For a further evaluation, we use our approach to map images from the Canon EOS 5D camera’s set (the same set that was used to train the FFCC model) to different target camera models. Then, we trained and tested a FFCC model on these mapped images. This experiment was performed to gauge the ability of our data augmentation approach to have similar negative effects on camera-specific methods that were trained on a different camera model. To that end, we randomly selected 150 images from the Canon EOS 5D sensor set, which was used to train the FFCC model, as our source image set AA. Then, we mapped these images to different target camera models using our approach. That means that the training and our synthetic testing set share the same scene content. We report the results in Table 11. We also report the testing results on real image sets captured by the same target camera models. As shown in Table 11, both real and synthetic sets negatively affect the accuracy of the FFCC model (see Table 8 for results of the FFCC on a testing set taken by the same training sensor).

Table 8: A comparison of different augmentation methods for illuminant estimation. All results were obtained by using training images captured by the Canon EOS 5D camera model [30] as the source and target sets for augmentation. Lowest errors are highlighted in yellow.
Training set Mean Med. B. 25% W. 25%
Original set 1.81 1.12 0.35 4.43
Augmented (clustering & sampling) [53] 1.68 \cellcolor[HTML]FFFC9E0.97 \cellcolor[HTML]FFFC9E0.25 4.31
Augmented (sampling) [27] 1.79 1.09 0.33 4.34
\hdashlineAugmented (ours) \cellcolor[HTML]FFFC9E1.55 0.98 0.28 \cellcolor[HTML]FFFC9E3.68
Table 9: A comparison of techniques for generating new sensor-mapped raw-like images that were originally captured by different sensors than the training camera model. The term ‘synthetic’ refers to training FFCC [13] without including any of the original training examples, while the term ‘augmented’ refers to training on synthetic and real images. The best results are bold-faced. Lowest errors of synthesized and augmented sets are highlighted in red and yellow, respectively.
Training set Mean Med. B. 25% W. 25%
Synthetic [2] 4.17 3.06 0.78 9.39
Augmentation [2] 2.64 1.95 0.45 5.97
\hdashlineSynthetic (ours) \cellcolor[HTML]FFCCC9 2.44 \cellcolor[HTML]FFCCC9 1.89 \cellcolor[HTML]FFCCC9 0.42 \cellcolor[HTML]FFCCC9 5.40
Augmented (ours) \cellcolor[HTML]FFFC9E 1.75 \cellcolor[HTML]FFFC9E 1.28 \cellcolor[HTML]FFFC9E 0.35 \cellcolor[HTML]FFFC9E 4.15
Table 10: Results of FFCC [13] trained on synthetic raw-like images after they are mapped to the target camera model. In this experiment, the raw images are mapped from the Canon EOS-1Ds Mark III camera sensor (taken from the NUS dataset [20]) to the target Canon EOS 5D camera in the Gehler-Shi dataset [30]. The shown results were obtained with and without the intermediate CIE XYZ mapping step to generate the synthetic training set. Lowest errors are highlighted in yellow.
Synthetic training set Mean Med. B. 25% W. 25%
w/o CIE XYZ 3.30 2.55 0.60 7.21
w/ CIE XYZ \cellcolor[HTML]FFFC9E3.04 \cellcolor[HTML]FFFC9E2.36 \cellcolor[HTML]FFFC9E0.56 \cellcolor[HTML]FFFC9E6.58
Table 11: Results of FFCC [13] trained on the Canon EOS 5D camera [30] and tested on images taken by different camera models from the NUS dataset [20] and the Cube+ challenge set [11]. The synthetic sets refer to testing images generated by our data augmentation approach, where these images were mapped from the Canon EOS 5D set (used for training) to the target camera models.
Testing sensor Real camera images Synthetic camera images
Mean Med. Max Mean Med. Max
Canon EOS 1D [30] 3.88 2.66 16.32 4.68 3.80 22.83
Fujifilm XM1 [20] 4.22 3.05 47.87 2.91 2.06 38.93
Nikon D5200 [20] 4.45 3.45 36.762 3.36 2.10 41.23
Olympus EPL6 [20] 4.35 3.56 19.89 3.28 2.27 38.81
Panasonic GX1 [20] 2.83 2.03 16.58 3.24 2.29 17.07
Samsung NX2000 [20] 4.41 3.73 17.69 3.44 2.64 18.79
Sony A57 [20] 3.84 3.02 19.38 3.04 1.34 39.67
Canon EOS 550D [11] 3.83 2.49 46.55 3.14 1.98 36.30

References

  • [1] Digital negative (DNG) specification. Technical report, Adobe Systems Incorporated, 2012. Version 1.4.0.0.
  • [2] Mahmoud Afifi, Abdelrahman Abdelhamed, Abdullah Abuolaim, Abhijith Punnappurath, and Michael S Brown. CIE XYZ Net: Unprocessing images for low-level computer vision tasks. arXiv preprint arXiv:2006.12709, 2020.
  • [3] Mahmoud Afifi and Michael S Brown. Sensor-independent illumination estimation for dnn models. BMVC, 2019.
  • [4] Mahmoud Afifi and Michael S Brown. What else can fool deep learning? addressing color constancy errors on deep neural network performance. In ICCV, 2019.
  • [5] Mahmoud Afifi, Brian Price, Scott Cohen, and Michael S Brown. When color constancy goes wrong: Correcting improperly white-balanced images. CVPR, 2019.
  • [6] Mahmoud Afifi, Abhijith Punnappurath, Graham Finlayson, and Michael S. Brown. As-projective-as-possible bias correction for illumination estimation algorithms. JOSA A, 2019.
  • [7] Miika Aittala and Frédo Durand. Burst image deblurring using permutation invariant convolutional neural networks. ECCV, 2018.
  • [8] Arash Akbarinia and C Alejandro Parraga. Colour constancy beyond the classical receptive field. TPAMI, 2017.
  • [9] Nikola Banić and Karlo Koščević. Illumination estimation challenge. https://www.isispa.org/illumination-estimation-challenge. Accessed: 2021-03-07.
  • [10] Nikola Banic and Sven Loncaric. Color dog-guiding the global illumination estimation to better accuracy. VISAPP, 2015.
  • [11] Nikola Banić and Sven Lončarić. Unsupervised learning for color constancy. arXiv preprint arXiv:1712.00436, 2017.
  • [12] Jonathan T Barron. Convolutional color constancy. ICCV, 2015.
  • [13] Jonathan T Barron and Yun-Ta Tsai. Fast Fourier color constancy. CVPR, 2017.
  • [14] Simone Bianco and Claudio Cusano. Quasi-Unsupervised color constancy. CVPR, 2019.
  • [15] Simone Bianco, Claudio Cusano, and Raimondo Schettini. Color constancy using cnns. CVPR Workshops, 2015.
  • [16] David H Brainard and William T Freeman. Bayesian color constancy. JOSA A, 1997.
  • [17] David H Brainard and Brian A Wandell. Analysis of the retinex theory of color vision. JOSA A, 1986.
  • [18] Gershon Buchsbaum. A spatial processor model for object colour perception. Journal of the Franklin Institute, 1980.
  • [19] Hakki Can Karaimer and Michael S Brown. Improving color reproduction accuracy on cameras. CVPR, 2018.
  • [20] Dongliang Cheng, Dilip K Prasad, and Michael S Brown. Illuminant estimation for color constancy: Why spatial-domain methods work and the role of the color distribution. JOSA A, 2014.
  • [21] C CIE. Commission internationale de l’eclairage proceedings, 1931. Cambridge University, Cambridge, 1932.
  • [22] Hal Daume III and Daniel Marcu. Domain adaptation for statistical classifiers. JAIR, 2006.
  • [23] Graham D Finlayson and Steven D Hordley. Color constancy at a pixel. JOSA A, 2001.
  • [24] Graham D Finlayson, Steven D Hordley, and Ingeborg Tastl. Gamut constrained illuminant estimation. IJCV, 2006.
  • [25] Graham D Finlayson and Elisabetta Trezzi. Shades of gray and colour constancy. Color and Imaging Conference, 2004.
  • [26] David A Forsyth. A novel algorithm for color constancy. IJCV, 1990.
  • [27] Damien Fourure, Rémi Emonet, Elisa Fromont, Damien Muselet, Alain Trémeau, and Christian Wolf. Mixed pooling neural networks for color constancy. ICIP, 2016.
  • [28] Shaobing Gao, Wangwang Han, Kaifu Yang, Chaoyi Li, and Yongjie Li. Efficient color constancy with local surface reflectance statistics. ECCV, 2014.
  • [29] Shao-Bing Gao, Ming Zhang, Chao-Yi Li, and Yong-Jie Li. Improving color constancy by discounting the variation of camera spectral sensitivity. JOSA A, 2017.
  • [30] Peter V Gehler, Carsten Rother, Andrew Blake, Tom Minka, and Toby Sharp. Bayesian color constancy revisited. CVPR, 2008.
  • [31] Arjan Gijsenij, Theo Gevers, and Joost Van De Weijer. Generalized gamut mapping using image derivative structures for color constancy. IJCV, 2010.
  • [32] Arjan Gijsenij, Theo Gevers, and Joost Van De Weijer. Improving color constancy by photometric edge weighting. TPAMI, 2012.
  • [33] Michael D Grossberg and Shree K Nayar. Modeling the space of camera response functions. TPAMI, 2004.
  • [34] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106, 2016.
  • [35] Daniel Hernandez-Juarez, Sarah Parisot, Benjamin Busam, Ales Leonardis, Gregory Slabaugh, and Steven McDonagh. A multi-hypothesis approach to color constancy. CVPR, 2020.
  • [36] Steven D Hordley and Graham D Finlayson. Re-evaluating colour constancy algorithms. In ICPR, 2004.
  • [37] Yuanming Hu, Baoyuan Wang, and Stephen Lin. FC4: Fully convolutional color constancy with confidence-weighted pooling. CVPR, 2017.
  • [38] Jun Jiang, Dengyu Liu, Jinwei Gu, and Sabine Süsstrunk. What is the space of spectral sensitivity functions for digital color cameras? WACV, 2013.
  • [39] Thorsten Joachims. Learning to classify text using support vector machines. ICML, 1999.
  • [40] Hamid Reza Vaezi Joze, Mark S Drew, Graham D Finlayson, and Perla Aurora Troncoso Rey. The role of bright pixels in illumination estimation. Color and Imaging Conference, 2012.
  • [41] Hakki Can Karaimer and Michael S Brown. Beyond raw-RGB and sRGB: Advocating access to a colorimetric image state. Color and Imaging Conference, 2019.
  • [42] Seon Joo Kim, Jan-Michael Frahm, and Marc Pollefeys. Radiometric calibration with illumination change for outdoor scene analysis. CVPR, 2008.
  • [43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [44] Karlo Koščević, Nikola Banić, and Sven Lončarić. Color beaver: Bounding illumination estimations for higher accuracy. VISIGRAPP, 2019.
  • [45] Samu Koskinen12, Dan Yang, and Joni-Kristian Kämäräinen. Cross-dataset color constancy revisited using sensor-to-sensor transfer. BMVC, 2020.
  • [46] Firas Laakom, Nikolaos Passalis, Jenni Raitoharju, Jarno Nikkanen, Anastasios Tefas, Alexandros Iosifidis, and Moncef Gabbouj. Bag of color features for color constancy. IEEE TIP, 2020.
  • [47] Firas Laakom, Jenni Raitoharju, Alexandros Iosifidis, Jarno Nikkanen, and Moncef Gabbouj. Color constancy convolutional autoencoder. Symposium Series on Computational Intelligence, 2019.
  • [48] Firas Laakom, Jenni Raitoharju, Alexandros Iosifidis, Jarno Nikkanen, and Moncef Gabbouj. Intel-TAU: A color constancy dataset. arXiv preprint arXiv:1910.10404, 2019.
  • [49] Changjun Li, Guihua Cui, Manuel Melgosa, Xiukai Ruan, Yaoju Zhang, Long Ma, Kaida Xiao, and M Ronnier Luo. Accurate method for computing correlated color temperature. Optics express, 2016.
  • [50] Orly Liba, Kiran Murthy, Yun-Ta Tsai, Tim Brooks, Tianfan Xue, Nikhil Karnad, Qiurui He, Jonathan T Barron, Dillon Sharlet, Ryan Geiss, et al. Handheld mobile photography in very low light. ACM TOG, 2019.
  • [51] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. NeurIPS, 2018.
  • [52] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  • [53] Zhongyu Lou, Theo Gevers, Ninghang Hu, Marcel P Lucassen, et al. Color constancy by deep learning. BMVC, 2015.
  • [54] Dominic Masters and Carlo Luschi. Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612, 2018.
  • [55] Steven McDonagh, Sarah Parisot, Fengwei Zhou, Xing Zhang, Ales Leonardis, Zhenguo Li, and Gregory Slabaugh. Formulating camera-adaptive color constancy as a few-shot meta-learning problem. arXiv preprint arXiv:1811.11788, 2018.
  • [56] Junichi Nakamura. Image sensors and signal processing for digital still cameras. CRC press, 2017.
  • [57] Seoung Wug Oh and Seon Joo Kim. Approaching the computational color constancy as a classification problem through deep learning. Pattern Recognition, 2017.
  • [58] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE TKDE, 2009.
  • [59] Yanlin Qian, Ke Chen, and Huanglin Yu. Fast fourier color constancy and grayness index for ISPA illumination estimation challenge. International Symposium on Image and Signal Processing and Analysis (ISPA), 2019.
  • [60] Yanlin Qian, Joni-Kristian Kamarainen, Jarno Nikkanen, and Jiri Matas. On finding gray pixels. CVPR, 2019.
  • [61] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. MICCAI, 2015.
  • [62] Charles Rosenberg, Martial Hebert, and Sebastian Thrun. Color constancy using KL-divergence. ICCV, 2001.
  • [63] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. ECCV, 2010.
  • [64] A Savchik, E Ershov, and S Karpenko. Color cerberus. International Symposium on Image and Signal Processing and Analysis (ISPA), 2019.
  • [65] Lilong Shi and Brian Funt. MaxRGB reconsidered. Journal of Imaging Science and Technology, 2012.
  • [66] Wu Shi, Chen Change Loy, and Xiaoou Tang. Deep specialized network for illuminant estimation. ECCV, 2016.
  • [67] Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
  • [68] Joost Van De Weijer, Theo Gevers, and Arjan Gijsenij. Edge-based color constancy. IEEE TIP, 2007.
  • [69] Vladimir Vapnik. Statistical learning theory. Wiley, 1998.
  • [70] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
  • [71] John Walker. Colour rendering of spectra. https://www.fourmilab.ch/documents/specrend/. Accessed: 2021-03-07.
  • [72] S. Woo, S. Lee, J. Yoo, and J. Kim. Improving color constancy in an ambient light environment using the phong reflection model. IEEE TIP, 2018.
  • [73] Seoung Wug Oh, Michael S Brown, Marc Pollefeys, and Seon Joo Kim. Do it yourself hyperspectral imaging with everyday digital cameras. CVPR, 2016.
  • [74] Gunter Wyszecki and Walter Stanley Stiles. Color science, volume 8. Wiley New York, 1982.
  • [75] Jin Xiao, Shuhang Gu, and Lei Zhang. Multi-domain learning for accurate and few-shot color constancy. CVPR, 2020.
  • [76] Bolei Xu, Jingxin Liu, Xianxu Hou, Bozhi Liu, and Guoping Qiu. End-to-end illuminant estimation based on deep metric learning. CVPR, 2020.