(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: National Taiwan University ²²institutetext: Amazon Web Services ³³institutetext: aetherAI
³³email: [email protected], [email protected], [email protected], [email protected]

Every Pixel Has its Moments: Ultra-High-Resolution Unpaired Image-to-Image Translation via Dense Normalization

Ming-Yang Ho 11 Che-Ming Wu 22 Min-Sheng Wu 33 Yufeng Jane Tseng 11

Abstract

Recent advancements in ultra-high-resolution unpaired image-to-image translation have aimed to mitigate the constraints imposed by limited GPU memory through patch-wise inference. Nonetheless, existing methods often compromise between the reduction of noticeable tiling artifacts and the preservation of color and hue contrast, attributed to the reliance on global image- or patch-level statistics in the instance normalization layers. In this study, we introduce a Dense Normalization (DN) layer designed to estimate pixel-level statistical moments. This approach effectively diminishes tiling artifacts while concurrently preserving local color and hue contrasts. To address the computational demands of pixel-level estimation, we further propose an efficient interpolation algorithm. Moreover, we invent a parallelism strategy that enables the DN layer to operate in a single pass. Through extensive experiments, we demonstrate that our method surpasses all existing approaches in performance. Notably, our DN layer is hyperparameter-free and can be seamlessly integrated into most unpaired image-to-image translation frameworks without necessitating retraining. Overall, our work paves the way for future exploration in handling images of arbitrary resolutions within the realm of unpaired image-to-image translation. Code is available at: https://github.com/Kaminyou/Dense-Normalization.

Keywords:

Unpaired image-to-image translation Ultra-high-resolution image Parallelism

1 Introduction

Unpaired image-to-image (I2I) translation is a conventional computer vision task that aims to translate an image from one domain to another without using paired images [1, 2, 3]. However, most frameworks are incapable of handling ultra-high-resolution (UHR) images due to GPU memory limitations. For example, a popular CUT [4] framework requires 14 GB of GPU VRAM for inference and 160 GB for training when processing an image with a resolution of 2,048 $\times$ 2,048, exceeding the capacity of a single 32GB NVIDIA V-100 GPU. This presents a significant challenge for researchers and practitioners working on unpaired I2I translation tasks involving UHR images.

Nevertheless, the ubiquity of UHR images in our daily life is undeniable, with mobile phones capturing 4K resolution photos and movies exceeding 8K resolution [5, 6]. Without an effective methodology, performing common image translation tasks like style transfer [7] and colorization [8] on these images would be significantly hindered.

Another critical application of UHR unpaired I2I translation is stain transformation in digital pathology [9, 10, 11]. Standard staining methods, such as hematoxylin and eosin (H&E), are commonly used due to their cost-effectiveness. However, for more detailed cancer diagnostics, the use of expensive immunohistochemical (IHC) stains becomes essential [12, 13]. Given that pathological images frequently have resolutions exceeding 10,000 $\times$ 10,000 pixels, an effective algorithm for stain transformation that reduces the cost of pathological staining is urgently needed.

A few strategies have been leveraged to perform unpaired I2I translation on UHR images. Simplifying the model architecture or increasing the output image size enable translation on 2K images [14, 15], but they still have a high GPU memory usage with space complexity of $\mathcal{O}(N^{2})$ for an image with a resolution of $N\times N$ . Alternatively, patch-wise training and inference can decrease space complexity to $\mathcal{O}(1)$ , but struggle to produce seamless results due to the tiling artifacts that appear when stitching the patches into an UHR image. Although convolutional operators in most I2I frameworks should guarantee that the final output can be seamlessly assembled from the patches, normalization operators applied per patch disrupt this property. Since the statistical moments calculated in Instance Normalization (IN) layers can affect color fidelity [16], their discrepancies between neighboring patches lead to gap-type tiling artifacts, evidenced by patch-wise IN [17] (refer to Fig. 1 (b) and Fig. 2 (a)).

Refer to caption — Figure 1: Comparison of translations. (a) Showcases a real2paint translated ultra-high-resolution image (3,024 $\times$ 4,032 pixels) produced by our Dense Normalization (DN) from the image displayed in the top right corner, with comparisons highlighted within the blue-boxed region. (b) Illustrates the occurrence of gap-type tiling artifacts in patch-wise IN [17] or KIN [18]; (c) Demonstrates jitter-type tiling artifacts resulting from TIN [19]; (d) Presents DN’s effectiveness in diminishing tiling artifacts.

To mitigate this issue, Thumbnail Instance Normalization (TIN)[19] applies global image-level statistics, at the expense of losing local hue and contrast, resulting in over/under coloring. Furthermore, significant perturbations in these statistical moments unfortunately create jitter-type tiling artifacts, which manifest as color jitters at the edges of patches (see Fig. 1 (c) and Fig. 2 (b)). On the other hand, Kernelized Instance Normalization (KIN) [18] alleviates gap-type tiling artifacts by more closely aligning adjacent patch-level statistics, but it necessitates selecting a kernel size to make a trade-off between blurring artifacts and preserving local color contrast (see Fig. S1 in the supplementary material). Additionally, KIN’s method requires a two-stage pipeline (caching and inference stages) due to the initial need for statistics calculation and subsequent performance of convolution operations on them (see Fig. 2 (c)). This raises a question: Can pixel-level statistical moment estimation address all these issues, and can it be accomplished in a single pass?

To answer the above question, we propose the Dense Normalization (DN) layer, which is capable of estimating statistical moments for every pixel. It possesses four expected properties: diminishing tiling artifacts ( $\mathcal{P}_{1}$ ), preserving local hue and color contrast ( $\mathcal{P}_{2}$ ), executing in a single pass ( $\mathcal{P}_{3}$ ), and being hyperparameter-free ( $\mathcal{P}_{4}$ ), as illustrated in Fig.1, Fig.2(d), and Fig. S2 in the supplementary material.

While pixel-level statistics estimation can be achieved by performing bilinear interpolation on patch-level statistics, its naïve implementation is time-consuming due to the high computational demands. Hence, we developed a fast interpolation algorithm to enhance calculation efficiency and practicality (see the comparison in Table 4). Furthermore, to perform pixel-level statistics estimation, patch-wise statistics must first be calculated and cached. Fast interpolation is then performed on these statistics, a process that would typically necessitate a two-stage pipeline similar to KIN. Nevertheless, we have devised a prefetching strategy that cleverly hides the caching process within the inference process, leveraging GPU parallelism to enable DN to operate in a single pass.

We evaluated our DN on four publicly available datasets, including natural and pathological images; quantitative evaluations confirmed that DN outperforms previous methods and its applicability in the healthcare field is demonstrated. In summary, our research has achieved the following:

•

To the best of our knowledge, this is the first study to estimate pixel-level statistical moments for normalization in UHR unpaired I2I translation, effectively diminishing tiling artifacts ( $\mathcal{P}_{1}$ ) while simultaneously preserving local hue and color ( $\mathcal{P}_{2}$ ), thereby achieving state-of-the-art performance.
•

We introduce a fast interpolation algorithm for efficient pixel-wise statistics estimation, along with a prefetching parallelism algorithm that enables DN to operate in a single pass ( $\mathcal{P}_{3}$ ), significantly decreasing runtime in comparison to naïve implementations.
•

Our hyperparameter-free DN layer ( $\mathcal{P}_{4}$ ) can be seamlessly integrated into any existing framework utilizing IN layers during inference, without necessitating model retraining.

2 Related works

Unpaired image-to-image translation. Several frameworks have been developed for unpaired image-to-image translation, aiming to discover the mapping between diverse image domains. CycleGAN [20], DiscoGAN [21], and DualGAN [22] utilize cycle-consistency loss to enforce the mapping. However, the pixel-level cycle-consistency constraint can lead to deformation and hinder the generation of large objects and fine textures when there are significant domain differences. Recently, strategies have been proposed to enhance performance beyond cycle-consistency. DistanceGAN [23] maintains pairwise distances between different parts of the same sample in each domain, while ACL-GAN [24] utilizes adversarial loss to address cyclic loss. CUT [4] maximizes patch-wise similarity between domains using contrastive learning, and LSeSim [25] learns spatial correlation to preserve structural similarity. Additionally, patch-wise semantic relationship regularization [26] is used to enhance correspondence between input and output images, while an energy function [27] is employed to retain domain-independent features and discard domain-specific ones. Despite these advancements, these frameworks are limited to processing small images. Our DN is a plugin designed to enable the processing of UHR images by simply replacing the IN layer in these frameworks.

Ultra-high-resolution unpaired image-to-image translation. Performing unpaired image-to-image translation on ultra-high-resolution images is computationally expensive. The patch-wise-based method, which divides the input image into smaller patches and reassembles the translated ones, is a solution, but it often leads to tiling artifacts. To solve this problem, overlapping windows [28, 29] can be used, or a perceptual embedding consistency loss can be employed to learn color, contrast, and brightness invariant features [30]. Meanwhile, downsampling-based methods [14, 15] avoid tiling artifacts but may result in detail loss and increased spatial complexity in upsampled images. Thumbnail Instance Normalization (TIN) [19] eliminates gap-type tiling artifacts by assuming that all patches share the same global image-level statistics, but may result in over/under-colorizing and jitter-type tiling artifacts. Kernelized Instance Normalization (KIN)[18] involves a two-stage pipeline and computes patch-level statistics using convolution operations to preserve local information but requires selecting an optimal kernel. Our DN differentiates itself by estimating pixel-level statistics to reduce tiling artifacts ( $\mathcal{P}_{1}$ ) and preserve local hues and colors ( $\mathcal{P}_{2}$ ) in a single pass ( $\mathcal{P}_{3}$ ), without the need for hyperparameter tuning ( $\mathcal{P}_{4}$ ).

3 Proposed methods

3.1 Overall framework

Unpaired I2I translation aims to train a generator $\mathcal{G}$ to translate an image $\boldsymbol{X}$ in domain $\mathcal{X}$ to another domain $\mathcal{Y}$ even when there are no corresponding paired images. The output of $\mathcal{G}$ is denoted as $\hat{\boldsymbol{Y}}$ and is expected to be in domain $\mathcal{Y}$ . In the context of UHR unpaired I2I translation, all images in domain $\mathcal{X}$ have a high resolution of $H\times W$ . To enable $\mathcal{G}$ to handle images with infinite resolution, the model must be trained and executed in a patch-wise manner to reduce the GPU space complexity to a constant.

After training an I2I generator $\mathcal{G}$ with patches, the IN layers are replaced with DN layers. During the inference process, an UHR image $\boldsymbol{X}\in\mathbb{R}^{H\times W}$ is divided into patches $\boldsymbol{x}^{\text{patch}}_{c,r}\in\mathbb{R}^{N\times N}$ , each with a size of $N\times N$ , and their coordinates ( $c$ , $r$ ) relative to the original image are recorded. Here, $c\in\{0,1,...,\lceil\frac{H}{N}\rceil-1\}$ and $r\in\{0,1,...,\lceil\frac{W}{N}\rceil-1\}$ . Then, a dispatcher sequentially inputs these patches and coordinates into the generator.

Fig. 3 illustrates the overall pipeline of our framework. Specifically, during each dispatch, two patches along with their coordinates are sent: one to the prefetching branch and the other to the inference branch. Both patches simultaneously go through all layers except the DN layer. In the DN layer, the patch in the prefetching branch undergoes a standard instance normalization [17], storing the resultant statistics in the cache table ( $T_{\mu},T_{\sigma}$ ) with the coordinates as keys. The patch in the inference branch uses its coordinates to query the cache table for its own and its eight neighbors’ statistics, forming two $3\times 3$ matrices of coarse-level (patch-level) statistical moments ( $\tilde{\boldsymbol{\mu}}^{\text{patch}}_{c,r}$ , $\tilde{\boldsymbol{\sigma}}^{\text{patch}}_{c,r}\in\mathbb{R}^{3\times 3}$ ). Fast interpolation is then applied to these matrices to estimate fine-level (pixel-level) statistical measures ( $\hat{\boldsymbol{\mu}}^{\text{pixel}}_{c,r}$ , $\hat{\boldsymbol{\sigma^{*}}}^{\text{pixel}}_{c,r}\in\mathbb{R}^{N\times N}$ ), which are subsequently used for dense normalization. The translated patches are then reassembled into an UHR image, with the DN operation effectively reducing tiling artifacts.

3.2 Details

Dispatcher. Given a collection of patches along with their coordinates, the dispatcher first arranges them vertically to create a list of images, denoted as $P$ . Specifically, $P$ is formed as $(\boldsymbol{x}^{\text{patch}}_{0,0},\boldsymbol{x}^{\text{patch}}_{1,0},...,\boldsymbol{x}^{\text{patch}}_{h-1,0},\boldsymbol{x}^{\text{patch}}_{0,1},...,\boldsymbol{x}^{\text{patch}}_{h-1,w-1})$ , where $h=\lceil\frac{H}{N}\rceil$ and $w=\lceil\frac{W}{N}\rceil$ . Subsequently, the dispatcher sequentially dispatches the images. At step $t$ , the dispatcher sends out two images along with their coordinates: $P[t]$ for the inference branch and $P[t+h+2]$ for the prefetching branch. This arrangement ensures that the eight neighboring patches of $P[t]$ , as originally cropped from the UHR image, have already been processed by the prefetching branch, guaranteeing that the corresponding statistical moments are cached in $T_{\mu}$ and $T_{\sigma}$ and can be queried. The iteration starts from $t=-(h+2)$ and goes up to $h\cdot w-1$ . For $t$ values outside the range of sequence $P$ , an empty image $\phi$ is provided, and the branch assigned to process it performs no action.

Prefetching branch and caching. When a patch $\boldsymbol{x}^{\text{patch}}_{c,r}$ enters the prefetching branch, it first undergoes a standard instance normalization process [17].

IN(\boldsymbol{x}^{\text{patch}}_{c,r})=\gamma\left(\frac{\boldsymbol{x}^{\text{patch}}_{c,r}-\mathbb{E}[\boldsymbol{x}^{\text{patch}}_{c,r}]}{\sqrt{Var[\boldsymbol{x}^{\text{patch}}_{c,r}]}}\right)+\beta

(1)

Here, $\mathbb{E}[\boldsymbol{x}^{\text{patch}}_{c,r}]$ and $\sqrt{Var[\boldsymbol{x}^{\text{patch}}_{c,r}]}$ are denoted as ${\mu}^{\text{patch}}_{c,r}$ and ${\sigma}^{\text{patch}}_{c,r}$ , respectively. These represent the mean and standard deviation of $\boldsymbol{x}^{\text{patch}}_{c,r}$ . Subsequently, these two statistics are stored in the cache table using their coordinates as keys; specifically, $T_{\mu}[c][r]:={\mu}^{\text{patch}}_{c,r}$ and $T_{\sigma}[c][r]:={\sigma}^{\text{patch}}_{c,r}$ .

Inference branch and dense normalization. When a patch $\boldsymbol{x}^{\text{patch}}_{c,r}$ enters the inference branch, it first uses its coordinates to query the cache tables for its and its eight neighbors’ statistical moments. Specifically, we query $T_{\mu}$ and $T_{\sigma}$ with keys $\{c-1,c,c+1\}\times\{r-1,r,r+1\}$ , yielding two $3\times 3$ matrices: $\tilde{\boldsymbol{\mu}}^{\text{patch}}_{c,r}$ and $\tilde{\boldsymbol{\sigma}}^{\text{patch}}_{c,r}$ . Our goal is to derive two $N\times N$ pixel-level statistical moment estimations, $\hat{\boldsymbol{\mu}}^{\text{pixel}}_{c,r}$ and $\hat{\boldsymbol{\sigma^{*}}}^{\text{pixel}}_{c,r}$ , from $\tilde{\boldsymbol{\mu}}^{\text{patch}}_{c,r}$ and $\tilde{\boldsymbol{\sigma}}^{\text{patch}}_{c,r}$ , respectively, for $\boldsymbol{x}^{\text{patch}}_{c,r}$ .

We assume that the patch-level statistics represent the statistics for the central pixel of the patch; for example, $\hat{\boldsymbol{\mu}}^{\text{pixel}}_{c,r}\left[\frac{N}{2}\right]\left[\frac{N}{2}\right]={\mu}^{\text{patch}}_{c,r}$ . Hence, we can utilize the process below to derive $\hat{\boldsymbol{\mu}}^{\text{pixel}}_{c,r}$ from $\tilde{\boldsymbol{\mu}}^{\text{patch}}_{c,r}$ , with Fig. 4(a) providing a visual representation of the entire interpolation process.

1.

Perform fast interpolation on each corner of the $3\times 3$ matrix $\tilde{\boldsymbol{\mu}}^{\text{patch}}_{c,r}$ . For instance, for the top-left corner $\tilde{\boldsymbol{\mu}}^{\text{patch}}_{c,r}[0:1,0:1]$ (a $2\times 2$ submatrix), apply fast interpolation.
2.

This interpolation is performed on a $2\times 2$ submatrix to expand it to an $N\times N$ matrix.
3.

By interpolating each $2\times 2$ submatrix into an $N\times N$ matrix, a larger $2N\times 2N$ matrix is constructed.
4.

The central $N\times N$ submatrix is then extracted from this $2N\times 2N$ matrix, serving as the pixel-level statistical estimation $\hat{\boldsymbol{\mu}}^{\text{pixel}}_{c,r}$ for the patch $\boldsymbol{x}^{\text{patch}}_{c,r}$ .

For $\tilde{\boldsymbol{\sigma}}^{\text{patch}}_{c,r}$ , we first calculate the inverse of each element to form $\tilde{\boldsymbol{\sigma}^{*}}^{\text{patch}}_{c,r}$ . Then, the same interpolation and cropping processes are conducted to obtain the pixel-level statistical estimation $\hat{\boldsymbol{\sigma^{*}}}^{\text{pixel}}_{c,r}\in\mathbb{R}^{N\times N}$ for the patch $\boldsymbol{x}^{\text{patch}}_{c,r}$ .

Now, we can utilize the pixel-level statistical moments to perform dense normalization, denoted as $DN(\cdot)$ .

DN(\boldsymbol{x}^{patch},\hat{\boldsymbol{\mu}}^{\text{pixel}}_{c,r},\hat{\boldsymbol{\sigma^{*}}}^{\text{pixel}}_{c,r})=\gamma((\boldsymbol{x}^{patch}-\hat{\boldsymbol{\mu}}^{\text{pixel}}_{c,r})\cdot\hat{\boldsymbol{\sigma^{*}}}^{\text{pixel}}_{c,r})+\beta

(2)

Fast interpolation. Different from interpolation in general cases where input and output sizes are always different, our DN requires computing interpolation from $\mathbb{R}^{2\times 2}$ to $\mathbb{R}^{N\times N}$ several times, with $N$ being a constant. Hence, we reformulate bilinear interpolation into fast interpolation, which can reduce computational demands and expedite DN computation.

Given a matrix $\boldsymbol{Q}\in\mathbb{R}^{2\times 2}$ , if we wish to interpolate it into $\boldsymbol{Q}^{\prime}\in\mathbb{R}^{N\times N}$ , using standard bilinear interpolation can be formulated as follows:

\boldsymbol{Q}=\begin{bmatrix}q_{0,0}&q_{0,1}\\ q_{1,0}&q_{1,1}\end{bmatrix}

(3)

\boldsymbol{Q}^{\prime}[i][j]=\frac{1}{N^{2}}\begin{bmatrix}N-v_{i}&v_{i}\end{bmatrix}\begin{bmatrix}q_{0,0}&q_{0,1}\\ q_{1,0}&q_{1,1}\end{bmatrix}\begin{bmatrix}N-v_{j}\\ v_{j}\end{bmatrix},\quad\forall i,j\in\{0,1,...,N-1\}

(4)

\text{where}\ v_{k}=\frac{kN}{N-1},\quad k\in\{0,1,...,N-1\}

(5)

This can be further reformulated as:

	$\displaystyle\boldsymbol{Q}^{\prime}[i][j]$	$\displaystyle=\frac{1}{N^{2}}\left\langle\begin{bmatrix}(N-v_{i})(N-v_{j})&(N-v_{i})\cdot v_{j}\\ v_{i}\cdot(N-v_{j})&v_{i}\cdot v_{j}\end{bmatrix},\begin{bmatrix}q_{0,0}&q_{0,1}\\ q_{1,0}&q_{1,1}\end{bmatrix}\right\rangle$		(6)
		$\displaystyle=\left\langle\begin{bmatrix}\frac{(N-v_{i})(N-v_{j})}{N^{2}}&\frac{(N-v_{i})\cdot v_{j}}{N^{2}}\\ \frac{v_{i}\cdot(N-v_{j})}{N^{2}}&\frac{v_{i}\cdot v_{j}}{N^{2}}\end{bmatrix},\begin{bmatrix}q_{0,0}&q_{0,1}\\ q_{1,0}&q_{1,1}\end{bmatrix}\right\rangle,\forall i,j\in\{0,1,...,N-1\}$		(7)

where $\left\langle,\right\rangle$ denotes the Frobenius inner product. It can be noted that for a given coordinate $(i,j)$ , the interpolated value $\boldsymbol{Q}^{\prime}[i][j]$ is a weighted sum of elements in $\boldsymbol{Q}$ with a fixed set of weights since $N$ is a constant.

Thus, the final interpolated result $\boldsymbol{Q}^{\prime}$ can be written as:

\boldsymbol{Q^{\prime}}=\sum_{k=0}^{1}\sum_{l=0}^{1}q_{k,l}\cdot\boldsymbol{M}_{k,l}

(8)

This equation represents the fast interpolation process (see Fig. 4 (b)), where the elements in each matrix $\boldsymbol{M}_{k,l}\in\mathbb{R}^{N\times N}$ are defined as follows:

\boldsymbol{M}_{0,0}[i][j]=\frac{(N-v_{i})(N-v_{j})}{N^{2}},\quad\forall i,j\in\{0,1,...,N-1\}

(9)

\boldsymbol{M}_{0,1}[i][j]=\frac{(N-v_{i})\cdot v_{j}}{N^{2}},\quad\forall i,j\in\{0,1,...,N-1\}

(10)

\boldsymbol{M}_{1,0}[i][j]=\frac{v_{i}\cdot(N-v_{j})}{N^{2}},\quad\forall i,j\in\{0,1,...,N-1\}

(11)

\boldsymbol{M}_{1,1}[i][j]=\frac{v_{i}\cdot v_{j}}{N^{2}},\quad\forall i,j\in\{0,1,...,N-1\}

(12)

This reformulation highlights fast interpolation’s desirable features (see Table 4). First, it consists solely of matrix multiplication, which can be accelerated by a GPU. Second, all matrices $\boldsymbol{M}_{k,l}$ are consistent across all interpolation operations in our dense normalization, allowing for precomputation and caching.

Parallelism and single pass. Our dispatcher design obviates the need for separating caching and inference stages, enabling our framework to execute them concurrently in a single pass ( $\mathcal{P}_{3}$ ). This efficiency is attributed to the inherent characteristics of GPUs. Specifically, processing a batch of images through a neural network layer (e.g., a convolutional layer) incurs a similar time cost regardless of the batch size. Consequently, two dispatched patches can be processed in parallel across most layers of the generator. Even though they perform different tasks upon reaching the DN layer, the operations are asynchronously enqueued and executed in parallel by the GPU. While data synchronization does require some time, it incurs only a minimal time cost. This parallel execution strategy allows the prefetching branch’s operations to be effectively “hidden” beneath those of the inference branch, markedly reducing the overall runtime (see Fig. 5).

4 Experiments

4.1 Datasets

Natural images. To assess the effectiveness of DN, we utilized two publicly accessible datasets: Kyoto summer2autumn [18] and real2paint [31]. Kyoto summer2autumn contains UHR unpaired images of summer and autumn landscapes (5,184 $\times$ 3,456 pixels), useful for seasonal image conversion. The real2paint dataset contains UHR paintings by Vincent Van Gogh (4,000 $\times$ 3,000 pixels to 10,000 $\times$ 8,000 pixels) and real images (4,032 $\times$ 3,024 pixels) from the UHDM dataset [31]. Although low-resolution versions of Vincent Van Gogh paintings datasets are available [32], we collected 21 high-resolution images of public Vincent Van Gogh paintings online to facilitate research on UHR unpaired I2I translation. This curated list will be made publicly available.

Pathological whole slide images (WSIs). In order to demonstrate the versatility of our DN module, we performed experiments on two additional pathological datasets for stain transformation. The ACROBAT dataset [33] consists of UHR WSIs of H&E and corresponding estrogen receptor (ER), anti-progesterone receptor (PGR), human epidermal growth factor receptor 2 (HER2), and Ki67 WSIs. We randomly selected unpaired H&E and PGR WSIs from this dataset as transformation targets. The ANHIR dataset [34] contains WSIs from various organs with different staining and sizes ranging from 5,000 $\times$ 5,000 pixels to 50,000 $\times$ 50,000 pixels. For this dataset, we selected unpaired breast (H&E to ER) and lung lesion (H&E to Ki67) as our stain transformation targets.

4.2 Experimental settings

In our experiments with the aforementioned datasets, we cropped the UHR images into 512 $\times$ 512 patches and trained CycleGAN, CUT, and L-LSeSim frameworks for 100 epochs with default hyperparameters. We replaced the IN layers with our DN layers for the inference process and compared the results with patch-wise IN, TIN, and KIN methods. The experiments were conducted on an NVIDIA RTX 3090 GPU. However, due to the GPU memory limitation, we were unable to directly use the IN layer with UHR images as input without cropping. We presented results obtained using the CUT model, with further findings available in the supplementary material.

4.3 Metrics

We compared the results from our DN, patch-wise IN, TIN, and KIN methods using qualitative and quantitative evaluation techniques. The translated UHR images were assessed using the standard Fréchet inception distance (FID) [35] metric. Additionally, we conducted a downstream task to differentiate between patches from the source and target domains. To explicitly showcase the adverse effects of tiling artifacts or over/under-colorization, we intentionally cropped new patches across the raw translated patches.

However, since these metrics are not designed to evaluate UHR images and may not accurately identify tiling artifacts or over/under-colorizing, we conducted a detailed human evaluation consisting of three quality challenges. Participants were shown the source and translated images generated by DN, patch-wise IN, TIN, and KIN and asked to identify the image with the best quality, the fewest tiling artifacts, and the best color and hue. We recruited computer vision and pathology specialists in addition to the general population for the evaluation. For the translated WSIs, we conducted a fidelity challenge, following the AMT perceptual studies protocol [36].

5 Results

5.1 Qualitative evaluation

Fig. 6, Fig. S4, and Fig. S5 in the supplementary material show UHR images translated from natural and pathological WSI datasets. These images reveal that patch-wise IN generates a significant amount of gap-type tiling artifacts, while KIN mitigates some of these artifacts but at the expense of details, hue, and color. TIN reduces gap-type tiling artifacts but results in over/under-colorizing, loss of local hue details, and creates jitter-type tiling artifacts. Conversely, our DN approach is the only method that successfully diminishes tiling artifacts while maintaining local hue and color details, producing the best results.

5.2 Quantitative evaluation

Table 1: Quantitative Results. The best-performing method in each experiment is highlighted in bold. DN surpasses both TIN [19] and KIN [18], as indicated by the underlined results. With respect to the FID metric, DN generally introduces the least disturbance to diminish tiling artifacts, in some cases even outperforming the intuitive lower bound. Across all experiments for the domain differentiation downstream task, measured by accuracy, DN consistently exceeds the performance of other methods, showcasing its superior efficacy in UHR image translation.

	FID $\downarrow$				Domain differentiation (%) $\uparrow$
	Lower bound*	TIN	KIN	DN	Patch-wise IN	TIN	KIN	DN
summer2autumn	98.281	117.268	98.003	97.732	0.967	0.828	0.950	0.975
real2paint	234.732	249.612	238.561	237.202	0.971	0.556	0.757	0.986
ACROBAT	21.046	43.988	27.224	21.346	0.983	0.858	0.977	0.985
ANHIR (breast)	64.202	161.128	91.443	68.616	0.969	0.932	0.932	0.975
ANHIR (lung lesion)	130.672	174.450	133.263	130.062	0.880	0.863	0.880	0.900

•

Lower bound*: This is empirically achieved by patch-wise IN, as it is the optimized target of the backbone model, thereby setting an intuitive lower boundary for the FID values.

Table 1 (left part) presents the standard FID evaluation results on various datasets. Since the backbone model is optimized using patch-wise IN, this method can be considered as having the optimal FID values, thereby setting an intuitive lower bound, while other methods aim to minimally disturb the translation process to remove tiling artifacts. Overall, our hyperparameter-free DN ( $\mathcal{P}_{4}$ ) outperforms TIN and KIN, indicating that it introduces the smallest adjustment necessary to achieve the goal, and local color and hue are preserved ( $\mathcal{P}_{1}$ ). Remarkably, in some cases, it even surpasses the intuitive lower bound. On the other hand, KIN secures second place due to its balance between patch-wise IN and TIN. TIN inevitably yields the worst results because the use of global statistics introduces the largest disturbance.

Table 1 (right part) displays the results of the domain differentiation downstream task. DN consistently outperforms all other methods, likely indicating the involvement of the fewest tiling artifacts ( $\mathcal{P}_{2}$ ) and the preservation of hue and color ( $\mathcal{P}_{1}$ ). On the other hand, TIN yields the worst results, which is probably due to the large disturbances introduced by the use of global statistics.

To address the limitations of available metrics, we employed human evaluation to assess image quality (see Table 2). Three image quality challenges were conducted by forty participants, and our DN method achieved the best performance across all three challenges ( $\mathcal{P}_{1}$ and $\mathcal{P}_{2}$ ), particularly for the Kyoto summer2autumn dataset. Additionally, we recruited eight computer vision specialists to evaluate the translation of natural images and eight pathology specialists to evaluate the results of stain transformation. Interestingly, the effectiveness of DN was more pronounced to these specialists. Furthermore, the fidelity challenge (see Fig. S8 in the supplementary material) revealed that the images generated by DN were nearly indistinguishable from real pathological images.

Table 2: Human evaluation results conducted by the general public and experts. The best-performing method in each experiment is highlighted in bold. DN outperforms patch-wise IN [17], TIN [19], and KIN [18], as indicated by the underlined results. Overall, DN is the most preferred method across all aspects.

	Has the best quality (%) $\uparrow$								Has the fewest tiling artifacts (%) $\uparrow$								Exhibits the best color and hue (%) $\uparrow$
	By the general public				By experts				By the general public				By experts				By the general public				By experts
	IN*	TIN	KIN	DN	IN*	TIN	KIN	DN	IN*	TIN	KIN	DN	IN*	TIN	KIN	DN	IN*	TIN	KIN	DN	IN*	TIN	KIN	DN
summer2autumn	10.00	15.79	12.11	62.11	0.00	5.71	11.43	82.86	4.74	18.95	12.63	63.68	0.00	14.29	8.57	77.14	6.32	15.26	12.63	65.79	0.00	2.86	11.43	85.71
real2paint	3.68	34.21	22.11	40.00	0.00	28.57	22.86	48.57	3.68	34.21	22.11	40.00	0.00	22.86	22.86	54.29	7.37	32.11	18.9	41.58	0.00	25.71	20.00	54.29
stain transformation	-	-	-	-	14.29	10.71	3.57	71.43	-	-	-	-	7.14	17.86	10.71	64.29	-	-	-	-	14.29	7.14	3.57	75.00

•

IN*: patch-wise IN; -: not applicable

5.3 Runtime and resource utilization

Table 3 presents the runtime and GPU VRAM usage for various methods. Employing operations on statistical moments generally leads to longer runtime but yields superior results. Distinctively, DN achieves faster execution in a single pass ( $\mathcal{P}_{3}$ ) compared to KIN, with a modest increase in GPU VRAM usage. This efficiency is due to the parallel execution of prefetching and the inference branch, highlighting DN’s innovative approach to parallelism design.

Table 3: Comparison of runtime and GPU memory usage. Using an NVIDIA RTX 3090 GPU, we benchmarked the runtime and GPU VRAM usage for a 4,302

\times

3,024 image. One-stage DN, despite involving substantial operations on statistical moments, runs faster than KIN.

	IN*	TIN	KIN	DN	DN
Statistics type	patch-level	image-level	patch-level	pixel-level	pixel-level
# of pipeline stage	1	1	2	2	1
Operations on statistics			✓	✓	✓
Runtime (s)	2.46	2.62	4.42	5.51	4.35
GPU VRAM usage (mb)	2951	3335	3145	3161	4157

•

IN*: patch-wise IN

5.4 Ablation study

Interpolating granularity. The statistical measures are interpolated by DN for each pixel, representing an interpolating granularity of 1 pixel. By incrementally increasing the interpolating granularity to 2, 4, and up to 512 pixels, gap-type tiling artifacts begin to emerge gradually, as illustrated in Fig. 7. These results affirm that DN effectively diminishes tiling artifacts by the pixel-level statistical moment estimation.

Runtime Optimization. DN employs a fast interpolation algorithm and prefetching parallelism strategy to enable exhaustive estimation of pixel-level statistical moments efficiently. Table 4 presents evidence of significant acceleration. Performance benchmarks conducted on a single NVIDIA RTX 3090 GPU revealed that these strategies could achieve a speedup of 44 times for the entire image and 53 times per patch inference, respectively.

Table 4: Speedup achieved using fast interpolation and prefetching parallelism. Benchmarking was conducted on an NVIDIA RTX 3090 GPU to evaluate runtime optimization strategies for a 4,302

\times

3,024 image. Although prefetching parallelism requires processing a slightly higher number of patches, it significantly enhances performance, achieving final speedups of 44 and 53 times for the entire image and per patch, respectively.

Fast interpolation		Prefetching	# of	Runtime (s)	Runtime (s)	Speedup (times)	Speedup (times)
Reformulation	Precomputation	parallelism	patches	(entire image)	(per patch)	(entire image)	(per patch)
			35	192.50	5.50	1x	1x
✓			35	8.28	0.24	23x	23x
✓	✓		35	5.51	0.16	35x	35x
✓	✓	✓	42	4.35	0.10	44x	53x

6 Conclusion

In this study, we have introduced DN for UHR unpaired I2I translation. DN estimates pixel-level statistical moments for normalization, thereby diminishing tiling artifacts and preserving local hue and color simultaneously. It can be seamlessly integrated into any unpaired I2I translation model equipped with IN layers, without necessitating model retraining or hyperparameter tuning. The proposed fast interpolation algorithm allows DN to efficiently estimate statistical moments for every pixel. Additionally, a prefetching parallelism strategy enables DN to operate in a single pass. Experimental results have demonstrated that DN outperforms all prior methods on datasets containing natural images and pathological WSIs. Furthermore, DN’s ability to successfully perform stain transformation highlights its practicality in the medical domain.

Limitations and discussion. Although our research has demonstrated the superiority of DN over previous methods for UHR unpaired I2I translation, DN still requires patch-wise processing. Consequently, it would struggle to maintain the continuity of translated objects across patches. On the other hand, while less pronounced than TIN, jitter-type tiling artifacts occasionally emerge in the results, causing slight visibility issues and a lack of seamlessness. Addressing these limitations remains a goal for future work.

Furthermore, there is a lack of appropriate metrics and datasets for evaluating existing methods in UHR image translation. While we conducted human evaluations to mitigate this limitation, we recognize the importance of creating new metrics and releasing large datasets. To this end, we have released a curated list of the real2paint dataset to encourage further research into UHR unpaired I2I translation.

References

[1] Henri Hoyez, Cédric Schockaert, Jason Rambach, Bruno Mirbach, and Didier Stricker. Unsupervised image-to-image translation: A review. Sensors, 22(21):8540, 2022.
[2] Shizuo Kaji and Satoshi Kida. Overview of image-to-image translation by use of deep neural networks: denoising, super-resolution, modality conversion, and reconstruction in medical imaging. Radiological physics and technology, 12:235–248, 2019.
[3] Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. Image-to-image translation: Methods and applications. IEEE Transactions on Multimedia, 24:3859–3881, 2021.
[4] Taesung Park, Alexei A Efros, Richard Zhang, and Jun-Yan Zhu. Contrastive learning for unpaired image-to-image translation. In European Conference on Computer Vision, pages 319–345. Springer, 2020.
[5] Ryohei Funatsu, Steven Huang, Takayuki Yamashita, Kevin Stevulak, Jeff Rysinski, David Estrada, Shi Yan, Takuji Soeno, Tomohiro Nakamura, Tetsuya Hayashida, et al. 6.2 133mpixel 60fps cmos image sensor with 32-column shared high-speed column-parallel sar adcs. In 2015 IEEE International Solid-State Circuits Conference-(ISSCC) Digest of Technical Papers, pages 1–3. IEEE, 2015.
[6] Masahiko Kitamura, Daisuke Shirai, Kunitake Kaneko, Takahiro Murooka, Tomoko Sawabe, Tatsuya Fujii, and Atsushi Takahara. Beyond 4k: 8k 60p live video streaming to multiple sites. Future Generation Computer Systems, 27(7):952–959, 2011.
[7] Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, and Mingli Song. Neural style transfer: A review. IEEE transactions on visualization and computer graphics, 26(11):3365–3385, 2019.
[8] Ivana Žeger, Sonja Grgic, Josip Vuković, and Gordan Šišul. Grayscale image colorization methods: Overview and evaluation. IEEE Access, 9:113326–113346, 2021.
[9] Kevin de Haan, Yijie Zhang, Jonathan E Zuckerman, Tairan Liu, Anthony E Sisk, Miguel FP Diaz, Kuang-Yu Jen, Alexander Nobori, Sofia Liou, Sarah Zhang, et al. Deep learning-based transformation of h&e stained tissues into special stains. Nature communications, 12(1):4884, 2021.
[10] Xilin Yang, Bijie Bai, Yijie Zhang, Yuzhu Li, Kevin de Haan, Tairan Liu, and Aydogan Ozcan. Virtual stain transfer in histology via cascaded deep neural networks. ACS Photonics, 9(9):3134–3143, 2022.
[11] Ranran Zhang, Yankun Cao, Yujun Li, Zhi Liu, Jianye Wang, Jiahuan He, Chenyang Zhang, Xiaoyu Sui, Pengfei Zhang, Lizhen Cui, et al. Mvfstain: Multiple virtual functional stain histopathology images generation based on specific domain mapping. Medical Image Analysis, 80:102520, 2022.
[12] Eva-Maria Birkman, Naziha Mansuri, Samu Kurki, Annika Ålgars, Minnamaija Lintunen, Raija Ristamäki, Jari Sundström, and Olli Carpén. Gastric cancer: immunohistochemical classification of molecular subtypes and their association with clinicopathological characteristics. Virchows Archiv, 472:369–382, 2018.
[13] Kentaro Inamura. Update on immunohistochemistry for the diagnosis of lung cancer. Cancers, 10(3):72, 2018.
[14] Yuda Song, Hui Qian, and Xin Du. Multi-curve translator for high-resolution photorealistic image translation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XV, pages 126–143. Springer, 2022.
[15] Jie Liang, Hui Zeng, and Lei Zhang. High-resolution photorealistic image translation in real-time: A laplacian pyramid translation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9392–9400, 2021.
[16] Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision, pages 1501–1510, 2017.
[17] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
[18] Ming-Yang Ho, Min-Sheng Wu, and Che-Ming Wu. Ultra-high-resolution unpaired stain transformation via kernelized instance normalization. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pages 490–505. Springer, 2022.
[19] Zhe Chen, Wenhai Wang, Enze Xie, Tong Lu, and Ping Luo. Towards ultra-resolution neural style transfer via thumbnail instance normalization. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022.
[20] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
[21] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, Jung Kwon Lee, and Jiwon Kim. Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning, pages 1857–1865. PMLR, 2017.
[22] Zili Yi, Hao Zhang, Ping Tan, and Minglun Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE international conference on computer vision, pages 2849–2857, 2017.
[23] Sagie Benaim and Lior Wolf. One-sided unsupervised domain mapping. In NIPS, 2017.
[24] Yihao Zhao, Ruihai Wu, and Hao Dong. Unpaired image-to-image translation using adversarial consistency loss. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16, pages 800–815. Springer, 2020.
[25] Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. The spatially-correlative loss for various image translation tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[26] Chanyong Jung, Gihyun Kwon, and Jong Chul Ye. Exploring patch-wise semantic relation for contrastive learning in image-to-image translation tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18260–18269, 2022.
[27] Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations. arXiv preprint arXiv:2207.06635, 2022.
[28] Amal Lahiani, Jacob Gildenblat, Irina Klaman, Shadi Albarqouni, Nassir Navab, and Eldad Klaiman. Virtualization of tissue staining in digital pathology using an unsupervised deep learning approach. In European Congress on Digital Pathology, pages 47–55. Springer, 2019.
[29] Thomas de Bel, Meyke Hermsen, Jesper Kers, Jeroen van der Laak, and Geert Litjens. Stain-transforming cycle-consistent generative adversarial networks for improved segmentation of renal histopathology. In International Conference on Medical Imaging with Deep Learning–Full Paper Track, 2018.
[30] Amal Lahiani, Irina Klaman, Nassir Navab, Shadi Albarqouni, and Eldad Klaiman. Seamless virtual whole slide image synthesis and validation using perceptual embedding consistency. IEEE Journal of Biomedical and Health Informatics, 25(2):403–411, 2020.
[31] Xin Yu, Peng Dai, Wenbo Li, Lan Ma, Jiajun Shen, Jia Li, and Xiaojuan Qi. Towards efficient and scale-robust ultra-high-definition image demoiréing. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVIII, pages 646–662. Springer, 2022.
[32] Guilherme Folego, Otavio Gomes, and Anderson Rocha. From impressionism to expressionism: Automatically identifying van gogh’s paintings. In 2016 IEEE international conference on image processing (ICIP), pages 141–145. IEEE, 2016.
[33] Philippe Weitz, Masi Valkonen, Leslie Solorzano, Johan Hartman, Pekka Ruusuvuori, and Mattias Rantalainen. Acrobat-automatic registration of breast cancer tissue. In 10th Internatioal Workshop on Biomedical Image Registration, 2022.
[34] Jiří Borovec, Jan Kybic, Ignacio Arganda-Carreras, Dmitry V Sorokin, Gloria Bueno, Alexander V Khvostikov, Spyridon Bakas, I Eric, Chao Chang, Stefan Heldmann, et al. Anhir: automatic non-rigid histological image registration challenge. IEEE transactions on medical imaging, 39(10):3042–3052, 2020.
[35] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[36] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.