This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

No-Clean-Reference Image Super-Resolution: Application to Electron Microscopy

Mohammad Khateri, , Morteza Ghahremani, , Alejandra Sierra,
and Jussi Tohka
This work was supported in part by the Research Council of Finland (#323385), Jane and Aatos Erkko Foundation, and Doctoral Programme in Molecular Medicine at the University of Eastern Finland.Mohammad Khateri, Morteza Ghahremani, Alejandra Sierra, and Jussi Tohka are with A. I. Virtanen Institute for Molecular Sciences, Faculty of Health Sciences, University of Eastern Finland, Finland (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Morteza Ghahremani is also with Artificial Intelligence in Medical Imaging, Department of Radiology, Technical University of Munich, Germany (e-mail: [email protected]).
Abstract

The inability to acquire clean high-resolution (HR) electron microscopy (EM) images over a large brain tissue volume hampers many neuroscience studies. To address this challenge, we propose a deep-learning-based image super-resolution (SR) approach to computationally reconstruct clean HR 3D-EM with a large field of view (FoV) from noisy low-resolution (LR) acquisition. Our contributions are I) Investigating training with no-clean references; II) Introducing a novel network architecture, named EMSR, for enhancing the resolution of LR EM images while reducing inherent noise. The EMSR leverages distinctive features in brain EM images—repetitive textural and geometrical patterns amidst less informative backgrounds— via multi-scale edge-attention and self-attention mechanisms to emphasize edge features over the background; and, III) Comparing different training strategies including using acquired LR and HR image pairs, i.e., real pairs with no-clean references contaminated with real corruptions, the pairs of synthetic LR and acquired HR, as well as acquired LR and denoised HR pairs. Experiments with nine brain datasets showed that training with real pairs can produce high-quality super-resolved results, demonstrating the feasibility of training with non-clean references. Additionally, comparable results were observed, both visually and numerically, when employing denoised and noisy references for training. Moreover, utilizing the network trained with synthetically generated LR images from HR counterparts proved effective in yielding satisfactory SR results, even in certain cases, outperforming training with real pairs. The proposed SR network was compared quantitatively and qualitatively with several established SR techniques, showcasing either the superiority or competitiveness of the proposed method in recovering fine details while mitigating noise.

Index Terms:
Electron microscopy, neuroscience, no-clean-reference, super-resolution, deep learning.

I Introduction

Three-dimensional electron microscopy (3D-EM) is an essential technique to investigate brain tissue ultrastructures as it allows for their 3D visualization at nanometer resolution [1, 2]. Studying brain tissue ultrastructures requires high-resolution (HR) images over a large field of view (FoV) of the brain tissue. However, since imaging at higher resolutions demands denser sampling, it takes more time, proportionally increasing imaging cost and potential sample damage. Moreover, HR imaging over a large FoV is not feasible under realistic imaging constraints, demanding a trade-off between imaging resolution and FoV. The higher resolution, the smaller FoV [3]. Furthermore, imperfect components of imaging systems introduce noise into the images [4]. The mentioned limitations collectively prevent acquiring clean HR EM images over a large FoV of brain tissue, impeding subsequent brain ultrastructure analysis and visualization.

A practical approach to mitigate such limitations to provide clean HR EM images over a large tissue volume includes the following steps: I) low-resolution (LR) imaging of brain samples over a large FoV of interest, II) HR imaging over a small but representative portion of the same samples covered by LR FoV, and III) utilizing image super-resolution (SR) technique to computationally reconstruct high-quality HR 3D-EM images from the LR 3D-EM images of brain tissue, which is typically contaminated with noise, artifacts, and distortions.

SR is a low-level vision task that can serve as an integral preprocessing step for many image analyses in neuroscience[5, 6, 7]. It aims to recover the latent clean HR image xx from a degraded LR observation yy:

y=𝒟δ(x),y=\mathcal{D}_{\delta}(x), (1)

where 𝒟δ()\mathcal{D}_{\delta}(\cdot) is the degradation function parameterized by δ\delta, which is non-invertible, making SR an ill-posed inverse problem. 𝒟δ()\mathcal{D}_{\delta}(\cdot) includes convolution operator \circledast with the blur kernel κ\kappa, s-fold under-sampling operator s\downarrow_{s}, and noise nn (δ={κ,s,n}\delta=\{\kappa,\downarrow_{s},n\}) [8]. In practice, δ\delta is unknown and we only have the LR observation.

SR methods can be categorized into two groups: model-based and learning-based methods. Model-based SR methods approximate the degradation function in (1) as a combination of several operations. Assuming that the blurring kernel and under-sampling operator are known and noise is additive:

𝒟δ(x)=(xκ)s+n\mathcal{D}_{\delta}(x)=(x\circledast\kappa)\downarrow_{s}+n (2)

An estimate xx^{*} of an HR image can then be obtained by Maximum A Posteriori (MAP) formulation as:

x=argminx{y(xκ)spq+λ(x)}x^{*}=\arg\min_{x}\{{\|y-(x\circledast\kappa)\downarrow_{s}\|}^{q}_{p}+\lambda\mathcal{R}(x)\} (3)

The first term is likelihood computed as the p\ell_{p}-norm distance between the observation yy and degraded latent image xx, where 0<p,q20<p,q\leq 2 are determined by noise distribution [9, 10, 11, 12]. ()\mathcal{R(\cdot)} is the regularization term, also known as prior term, penalizing unknown latent image xx upon our prior knowledge of data. The parameter λ\lambda defines the trade-off between likelihood and prior terms. To reduce the ill-posedness of SR problems, many regularization terms have been developed [13, 8], each with specific pros and cons. Notably, contributions from total variation [14], self-similarity [15], low-rankness [12], and sparse representation [16] have played a significant role in improving SR performance, among others. Crafted priors enhance SR but have limited performance compared to data-driven methods [8]. Effective SR models involve optimizing multiple priors, which is time- and memory-consuming, and require tuning trade-off parameters. Additionally, SR models are specific to certain degradation settings, necessitating separate models for each degradation. Mismatched LR images with different degradations may result in severe artifacts due to domain gaps [17].

Learning-based SR methods learn a mapping between LR and HR image spaces, which is then used to restore the HR image from the given LR input image. Early work, pioneered by [18], restored HR images by capturing the co-occurrence prior between LR and HR image patches. Numerous patch-based methods have been introduced relying on manifold learning [19], filter learning [20], regression [21], and sparse representation [22]. Deep neural network (DNN)-based SR methods have demonstrated remarkable performance [13]. DNNs with end-to-end training avoid the need for explicit design of priors or degradations. Instead, priors and degradations are encapsulated in the training datasets. The commonly used DNN architectures include convolutional neural networks (CNNs) [23, 24], generative adversarial networks (GANs) [25, 26, 27], vision transformers (ViTs) [28, 29], and denoising diffusion probabilistic models (DDPMs) [30, 31]. In this realm, many works in computer vision and biomedical imaging define a specific degradation function to synthesize LR images from HR counterparts to generate the training data[8]. Several studies were also conducted to incorporate the interpretability of model-based methods into end-to-end learning, e.g., deep unfolding [32, 33, 34], Plug-and-Play (PnP)[35, 36, 37, 38], and deep equilibrium learning[39, 40]. Although most of these degradation-oriented SR approaches lead to satisfactory results on benchmark datasets, they fail to restore high-quality images when it comes to real-world applications [17], such as brain EM images, which is the focus of this study.

The computational approaches in super-resolution of EM have been studied in health and material sciences [41, 42, 43, 44]. As a pioneer, [42] proposed a material-specific PnP approach to super-resolve LR EM. Their method was based on the MAP formulation, where the likelihood term was based on a linear degradation model and the prior term was a library-based non-local means (LB-NLM) designed on HR EM images acquired within small FoV. The presence of HR edges and textures corresponding to the LR input image in the designed library yielded super-resolved results with fine details. To reduce the computational expenses and improve generalization, authors in [45] replaced the LB-NLM denoiser with the off-the-shelf Gaussian denoiser, leading to the version of PnP typically used in biomedical applications. However, both methods [42, 45] are essentially model-based, computationally cumbersome, and limited to degradation models. Experiments in both studies were conducted on the EM datasets acquired from nano-material with simple textural information, which sparsely recurred throughout the image. By leveraging the unique characteristics of such images, authors in [46] devised a patch-based strategy on acquired pairs of LR and HR EM images in the training of LB-NLM, resulting in better performance than the original LB-NLM method but inferior to DNN-based methods. Authors in [44] introduced a DNN-based SR, named point scanning super-resolution (PSSR), for EM brain images. They proposed a degradation operator, i.e., crappifier, to synthesize LR images from acquired HR counterparts, where the crappifier included additive Gaussian noise followed by a down-sampling operator. Using synthetic pairs of LR and HR EM images, they trained a UNet-based residual neural network. The performance of the method was then compared only with the bilinear interpolation. Although synthetizing pairs of LR and HR EM images can reduce imaging costs, it can increase the domain gap between the input LR EM and the trained SR model.

DNN-based SR method can implicitly learn EM degradations if trained with acquired and matched pairs of LR and HR images. However, many challenges impede designing DNN-based frameworks with such data. First, EM images inherently contain noise and artifacts derived from the microscope, sample, and experimental settings. Hence, there is no clean EM image to be used as the reference for training the network. Further, networks pre-trained on natural images cannot restore high-quality brain EM images due to the considerable difference in the physics behind photography and EM as well as content dissimilarity between natural and brain EM images. Hence, deploying and designing SR methods for EM images demands specific considerations. In this work, we will illustrate and address the mentioned challenges of the SR of EM images. Our key contributions are as follows:

  • Investigating training using no-clean references for 2\ell_{2} and 1\ell_{1} loss functions.

  • Introducing a DL-based image SR framework for EM, named EMSR, equipped with edge-attention and self-attention mechanisms for enhanced edge recovery. Sharing the network’s modules between the original noisy LR EM image and its noisier version makes it noise-robust.

  • Comparing various training strategies focusing on EM images, including training from pairs of physically acquired LR and HR, synthetically generated LR and HR, as well as LR and denoised HR EM images.

The remainder of this article is organized as follows: Section II describes the proposed image super-resolution method, Section III delves into experimental results, and finally, Section IV concludes the article.

II Proposed Method

Refer to caption
Figure 1: Schematic diagram of serial block-face scanning electron microscopy and imaging. a) The electron gun generates streams of electrons that are focused and raster scanned across the sample’s surface (solid yellow lines). The interaction of these focused electrons with the sample results in the ejection of electron streams (dashed yellow lines), which are collected by detectors to form a 2D image of kk-th slice (labeled as #k\small{\#}k). Note that the region of interest from the sample is imaged at LR with a large FoV (marked in orange), while at HR, the FoV is smaller (marked in red). After imaging a slice, the diamond knife is used to cut the sample to a specific thickness, determining the resolution in the zz direction and exposing the subsequent block-face for imaging. Imperfections in the imaging device components can introduce blurring and noise in the resultant images (solid green arrows). b) Stack of 2D imaged slices constitutes the 3D-EM dataset. c) LR 3D-EM corresponding to d) HR 3D-EM from small FoV. The zoomed-in area from (c) and (d) demonstrates the superior quality of the HR image in terms of contrast and resolution, see asterisks.

Supervised training of a network requires numerous pairs of corrupted LR and corresponding clean reference images. However, brain EM images inevitably include different types of noise, artifacts, and distortions, caused by the imaging system, and experimental settings. Therefore, clean EM images that serve as references are unavailable. Here, we investigate training a neural network for EM SR using physically acquired pairs of LR and HR EM images contaminated with real noise-like corruptions.

II-A Electron Microscopy Super-Resolution

In serial block-face scanning electron microscopy (SBEM), a focused high-energy electron beam scans the sample surface, resulting in the acquisition of a 2D image in xyxy-plane. The diamond knife subsequently removes the top layer of the sample to a specific thickness in zz direction, revealing the next block-face for imaging. The repetition of this process generates a series of 2D images that are stacked to form a 3D volume image, as illustrated in Fig. 1.

The observed block-face ym×my\in\mathbb{R}^{m\times m} is affected by underlying microscope degradation 𝒟δ():M×Mm×m\mathcal{D}_{\delta^{\prime}}(\cdot):\mathbb{R}^{M\times M}\rightarrow\mathbb{R}^{m\times m} parameterized by δ\delta^{\prime}, y=𝒟δ(x)y=\mathcal{D}_{\delta^{\prime}}(x); where, xM×Mx\in\mathbb{R}^{M\times M} denotes the latent image that we aim to restore, M=τmM=\tau m wherein τ\tau is the resolution ratio between HR and LR images, i.e., super-sampling ratio. Theoretically, the SR process is to recover unknown xx via 𝒟δ1(y)\mathcal{D}_{\delta^{\prime}}^{-1}(y), demanding finding degradation inversion 𝒟δ1():m×mM×M\mathcal{D}_{\delta^{\prime}}^{-1}(\cdot):\mathbb{R}^{m\times m}\rightarrow\mathbb{R}^{M\times M}. If such a mapping exists, we can obtain HR observations through LR imaging, practically accelerating imaging by a factor τ2\tau^{2}. Microscope degradation parameters, δ\delta^{\prime}, can arise from various sources [4, 47, 48]. These sources include electronic device components such as wires and coils, which produce thermal and electromagnetic interference that is modeled as Gaussian noise. The detector’s electron-counting error introduces signal-dependent noise in EM images, which is modeled as Poisson noise. Line-by-line pixel scanning in SBEM can lead to correlated noise. Imperfect electromagnetic lenses and anodes cause blurred observations due to suboptimal focusing of the electron beam. The high-energy electron beam introduces electron charge and causes absorption-based heating. Cutting the sample with a diamond knife can introduce specific artifacts and distortions. Additionally, mechanical disturbances from the environment and microscope can introduce mechanical noise, further exacerbating image degradation.

Hence, 𝒟δ()\mathcal{D}_{\delta^{\prime}}(\cdot) cannot be well parameterized by simplified assumptions such as block-averaging neighbor pixels for the under-sampling operator [45]. Implicit modeling of degradation function can be realized through training a neural network by acquired pairs of LR and HR EM images.

Refer to caption
Figure 2: Overview of the proposed image super-resolution network for training with pairs of corrupted images. The network includes the feature extractor, edge attention, and reconstruction modules, which are shared between the original noisy LR EM image yy and its noisier version yy^{\prime}. The network is encouraged to generate two outputs, fθ(y)f_{\theta}(y) and fθ(y)f_{\theta}(y^{\prime}), that are consistent with the noisy reference image xx. The output from the original image fθ(y)f_{\theta}(y) serves as the reference for the noisier-noisy input, establishing a noise-robust framework in a self-supervised approach.

II-B Training without Clean Reference

Training with no-clean references has been studied in several image restoration tasks, including denoising, magnetic resonance image reconstruction, and text removal [49, 50, 51]. Here, our focus is on investigating such a training approach for commonly used restoration loss functions, i.e., 2\ell_{2} and 1\ell_{1}, and discussing the corruption levels at which this training remains feasible for EM SR.

Supervised training of a network fθ()f_{\theta}(\cdot) for SR requires numerous pairs of degraded LR, yy, and clean reference, xx. The network’s parameters θ\theta are obtained by optimizing the following empirical loss function:

θ^=argminθ𝔼(x,y)[(fθ(y),x)]\hat{\theta}=\arg\min_{\theta}{\mathbb{E}_{(x,y)}[\mathcal{L}(f_{\theta}(y),x)]} (4)

By applying the conditional expectation rule for dependent random variables yy and xx, we can reformulate (4) as follows:

θ^=argminθ𝔼y[𝔼x|y[(fθ(y),x)]reference-dependent]\hskip 45.5244pt\hat{\theta}=\arg\min_{\theta}{\mathbb{E}_{y}[\underbrace{\mathbb{E}_{x|y}[\mathcal{L}(f_{\theta}(y),x)]}_{\text{reference-dependent}}]}\hskip 28.45274pt (5)

The equation above implies that the network parameters can be optimized separately with respect to yy and xx over the loss function (,)\mathcal{L(\cdot,\cdot)}. Let x^=x+n\hat{x}=x+n, where nn is an i.i.d. additive noise with mean μ\mu and variance σn2I\sigma_{n}^{2}I, where Id×dI\in\mathbb{R}^{d\times d} is an identity matrix with d=M2d=M^{2}.

In the case loss function is 2\ell_{2}, we can derive equality that links the solutions of the reference-dependent component in (5) for xx and x^\hat{x} as follows (see Appendix I.A):

𝔼x^|y[fθ(y)x^22]=𝔼x|y[fθ(y)x22]2μT𝔼x|y[fθ(y)x]+dσn2+μ2\begin{split}&\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{2}^{2}]=\\ &\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]-2\mu^{T}\mathbb{E}_{x|y}[f_{\theta}(y)-x]+d\sigma_{n}^{2}+||\mu||^{2}\end{split} (6)

The equation above states that when μ\mu is close to zero (𝔼[n]0\mathbb{E}[n]\approx 0), the second term on the right-hand side of the equation becomes negligible, i.e., 2μT𝔼x|y[fθ(y)x]02\mu^{T}\mathbb{E}_{x|y}[f_{\theta}(y)-x]\to 0. Additionally, the third term σn2\sigma_{n}^{2}, which is noise variance, and the fourth term, which is noise mean, are independent of yy and have no effect on the total optimization problem. Therefore, if we substitute the clean image xx with a random variable x^\hat{x} that satisfies 𝔼[x]𝔼[x^]\mathbb{E}[x]\approx\mathbb{E}[\hat{x}], the network’s parameters will remain close to the optimal. This enables us to replace the clean reference xx with its corrupted version x^\hat{x}, provided their expectation values are sufficiently close, which can be accompanied by the practical assumption that noise should not significantly alter the overall variability and structure of the original image, i.e., σx^2σx2\sigma_{\hat{x}}^{2}\approx\sigma_{x}^{2}.

In the case of 1\ell_{1} loss, we can establish the relationship between the solutions of the reference-dependent part in (5) for both xx and x^\hat{x} as below (see Appendix I.B):

|𝔼x^|y[(fθ(y)x^)1]𝔼x|y[fθ(y)x1]||2μT𝔼x|y[fθ(y)x]+dσn2+μ2|g(y,x,x^),\Big{|}\mathbb{E}_{\hat{x}|y}[\|(f_{\theta}(y)-\hat{x})\|_{1}]-\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{1}]\Big{|}\\ \leq\frac{|-2\mu^{T}\mathbb{E}_{x|y}[f_{\theta}(y)-x]+d\sigma_{n}^{2}+||\mu||^{2}|}{g(y,x,\hat{x})}, (7)

where g(y,x,x^)=𝔼x^|y[fθ(y)x^22]+𝔼x|y[fθ(y)x22]dg(y,x,\hat{x})=\frac{\sqrt{\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{2}^{2}]}+\sqrt{\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]}}{\sqrt{d}}. The inequality above suggests that the difference between the reference-dependent solutions for x^\hat{x} and xx is bounded by a function of μ\mu and σn2\sigma_{n}^{2}. When μ\mu is small, it significantly reduces the dependence on yy and tightens the upper bound, which becomes primarily dependent on yy through σn2\sigma_{n}^{2}. This implies that weak noise reduces the reliance on yy and indicates that it will not significantly alter the overall optimization problem (5). In other words, the network’s parameters will remain near optimal even if we replace clean image xx with its noisy version x^\hat{x}, as long as 𝔼[x]𝔼[x^]\mathbb{E}[x]\approx\mathbb{E}[\hat{x}], and overall structure of the clean image is not significantly altered by noise, σx^2σx2\sigma_{\hat{x}}^{2}\approx\sigma_{x}^{2}.

These observations hold the promise that the network can be trained under real-world scenarios where the reference is contaminated with weak noise-like corruptions. Here, we aim to determine the rough acceptable level of these corruptions in brain EM imaging–which was discussed in Section II. (B), upon noise statistics μ\mu and σ2\sigma^{2} that overshadow training with no-clean references. Suppose we can decompose x^\hat{x} into clean xx and noise-like corruption component nn, x^=xclean+n\hat{x}=x_{clean}+n. We can then establish the following relationships:

𝔼[x^]\displaystyle\mathbb{E}[\hat{x}] =𝔼[xclean]+𝔼[n],\displaystyle=\mathbb{E}[x_{clean}]+\mathbb{E}[n], (8a)
σx^2\displaystyle\sigma_{\hat{x}}^{2} =σxclean2+σn2\displaystyle=\sigma_{x_{clean}}^{2}+\sigma_{n}^{2} (8b)

The inequalities (9a) and (9b) hold 𝔼[xclean]𝔼[x^]\mathbb{E}[x_{clean}]\approx\mathbb{E}[\hat{x}] and σxclean2σx^2\sigma_{x_{clean}}^{2}\approx\sigma_{\hat{x}}^{2}, which are requirements for training using pairs of corrupted images using 1\ell_{1} and 2\ell_{2} loss functions, and guarantee that the content of the underlying image is much stronger than corruptions:

𝔼[xclean]\displaystyle\mathbb{E}[x_{clean}] 𝔼[n],\displaystyle\gg\mathbb{E}[n], (9a)
σxclean2\displaystyle\sigma_{x_{clean}}^{2} σn2\displaystyle\gg\sigma_{n}^{2} (9b)

The level of corruptions in EM is mostly much lower than image content information, satisfying (9), allowing for training network fθ()f_{\theta}(\cdot) from pairs of corrupted images. It is worth mentioning that rare image slices may exhibit levels of corruption inconsistent with constraints stated in (9). These corruptions act as anomalies that the network is unable to learn.

Refer to caption
Figure 3: Modules embedded in the proposed network: Basic Block, Residual Block, À-Trous Wavelet, Attention Block, and Vision Transformer Block.

II-C Network Architecture

The proposed SR network, which is designed for training using pairs of corrupted LR and HR EM images, is depicted in figure 2. It consists of three key modules: feature extractor, edge attention, and reconstruction. These modules are shared between the given LR image and its noisier version.

II-C1 Feature Extractor

The feature extractor (FE\mathcal{H}_{FE}) is employed to extract shallow (XSFX_{SF}) features from the given LR image 𝐲W×H×C\mathbf{y}\in\mathbb{R}^{W\times H\times C}. It includes a projection, which is a 3×33\times 3 convolutional filter. The extraction process is formulated by:

XSF=FE(y)\begin{split}X_{SF}=\mathcal{H}_{FE}(y)\end{split} (10)

II-C2 Edge Attention

The edge attention module (EA\mathcal{H}_{EA}) takes XSFX_{SF} and yy as input, extracting deep features and combining them with edge information using multi-scale edge attention and self-attention mechanisms, yielding the generation of edge-attentioned features (XEAX_{EA}). The calculation of the edge-attention module is summarized as:

XEA=EA(y,XSF)\begin{split}X_{EA}=\mathcal{H}_{EA}(y,X_{SF})\end{split} (11)

The module consists of kk basic blocks, as shown in figure 3. In each basic block, the input features pass through mm residual blocks with the well-studied benefits [52], then, take two parallel paths. In the upper path, the features are fed into residual blocks to produce deep features that are then enhanced using edge information. In the lower path, the features go through convolutional operations and ReLU activation to reconstruct the image in the LR space. The reconstructed image, along with the original LR image, is then fed to atrous wavelet (ATW) [53], a noise-robust feature extractor, to extract multi-scale edges. The resulting multi-scale edge features are then subjected to concatenation and filtering before being inputted into the attention block. The attention block generates multi-scale attention maps specifically focused on the deep feature edges. Finally, attention maps and deep features are combined through element-wise multiplication. The resulting attentioned-features are then added to the features from the upper path, leading to the generation of multi-scale edge-attention features. Subsequently, these features are passed into a ViT block, employing a window-based multi-head self-attention mechanism to capture both local and global image dependencies within the deep multi-scale edge attention features, and finally pass through convolution layers.

Vision Transformer (ViT): ViTs divide a feature map into a sequence of small patches, forming local windows, and utilize self-attention mechanisms to understand the relationships among them. This capacity to comprehend diverse image dependencies is crucial for representation learning performance in low-level vision tasks such as SR. To capture both global and local image dependencies while maintaining computational efficiency, we adopt the window-based multi-head self-attention (W-MSA) method [29]. The attention maps generated by W-MSA are then processed through the feed-forward network (FFN). These W-MSA and FFN components are integrated into a ViT block, as illustrated in Figure 3, and their computations are outlined below:

X=W-MSA(LN(X))+X,X′′=FFN(LN(X))+X,\begin{split}&{X^{\prime}}=\text{W-MSA}(\text{LN}({X}))+X,\\ &{X^{\prime\prime}}=\text{FFN}(\text{LN}({X}^{\prime}))+{X}^{\prime},\end{split} (12)

where, LN is layer normalization and XX is the input feature map.

In the W-MSA, the input feature map of size C×H×WC\times H\times W is initially divided into N=HW/M2N=HW/M^{2} non-overlapping local windows of size M×MM\times M, resulting in local feature maps XM2×CX\in\mathbb{R}^{M^{2}\times C}. Each of these local feature maps then undergoes the standard self-attention mechanism, with the following calculation:

Q=XPQ,K=XPK,V=XPV,Q=XP_{Q},\hskip 14.22636ptK=XP_{K},\hskip 14.22636ptV=XP_{V}, (13)

where, PQ,PK,PVC×dkP_{Q},P_{K},P_{V}\in\mathbb{R}^{C\times d_{k}} represent the query (QQ), key (KK), and value (VV) projection matrices, respectively; dkd_{k} is determined as C/kC/k, where kk denotes the number of attention heads. The attention matrix is computed using the self-attention mechanism within kk-th head of local window:

Attention(Q,K,V)=SoftMax(QKT/dk)V,\text{Attention}({Q},{K},{V})=\text{SoftMax}({Q}{K}^{T}/\sqrt{d_{k}}){V}, (14)

The concatenation of all attention heads results in the multi-head self-attention (W-MSA) output.

FFN is a multi-layer perceptron (MLP) used to introduce additional non-linearity to the model through two fully connected layers and ReLU activation.

II-C3 Reconstruction

Shallow features predominantly consist of low frequencies, capturing the overall structure, while the deep features encompass high frequencies corresponding to lost fine details. The long skip connection provides the reconstruction module with low frequencies and makes the training more stable. Further, it helps the edge-attention module focus on learning fine details. The element-wise summation of shallow and deep features in the LR space are fed to the reconstruction module (R\mathcal{H}_{R}) to generate super-resolved image xx with enhanced resolution:

x=R(XEA+XIF)\begin{split}x=\mathcal{H}_{R}(X_{EA}+X_{IF})\end{split} (15)

The reconstruction module includes an up-sampling process that enlarges the features by pixel shuffling [54]. This up-sampling step is followed by a mapper module including convolution layers, which yields the super-resolved image.

II-C4 Weight Sharing

The aforementioned modules are shared between the given LR EM image and its noisier version, as illustrated in figure 2. The weight sharing encourages the network to produce consistent outputs for both the given LR image and its noisier version, establishing a noise-robust framework for training. This strategy mitigates the absence of a clean reference: The prediction generated from the given LR EM image serves as a reference for the noisier LR EM branch in a self-supervised approach.

II-D Loss Functions

We employ the p\ell_{p}-norm loss, p{1,2}p\in\{1,2\}, as a pixel-wise distance measure between the network’s prediction z^\hat{z} and ground truth zz: p(z,z^)=zizi^pp\mathcal{L}_{\ell_{p}}(z,\hat{z})={{\|z_{i}-\hat{z_{i}}\|}_{p}^{p}} [13, 55]. Our loss function measures the mismatch between the two network outputs and the reference, namely p(fθ(y),x)\mathcal{L}_{\ell_{p}}(f_{\theta}(y),x) and p(fθ(y),x)\mathcal{L}_{\ell_{p}}(f_{\theta}(y^{\prime}),x), as well as the mismatch between two outputs, p(fθ(y),fθ(y))\mathcal{L}_{\ell_{p}}(f_{\theta}(y),f_{\theta}(y^{\prime})), see Figure 2. The total loss is then defined as:

T=λ1p(fθ(y),x)+λ2p(fθ(y),x)+λ3p(fθ(y),fθ(y))\mathcal{L}_{T}=\lambda_{1}\mathcal{L}_{\ell_{p}}(f_{\theta}(y),x)+\lambda_{2}\mathcal{L}_{\ell_{p}}(f_{\theta}(y^{\prime}),x)+\lambda_{3}\mathcal{L}_{\ell_{p}}(f_{\theta}(y),f_{\theta}(y^{\prime})) (16)

where λ1\lambda_{1}, λ2\lambda_{2}, and λ3\lambda_{3} are hyperparameters that govern the trade-off between components.

Refer to caption
Figure 4: Multi-scale edges extracted from the EM dataset using ATW, where hh represents the filter’s kernel. The figure illustrates three edge components obtained at different scales, demonstrating the sparsity of edges and underscoring the importance of paying attention to edge details.

III Experimental Settings and Results

III-A Datasets

We conducted experiments using nine LR and HR 3D-EM datasets acquired from the corpus callosum and cingulum regions associated with the white matter of five rat brains [56]. These datasets were acquired both ipsi- and contra-laterally. In four animals, both ipsi- and contra-lateral datasets were available, while in one animal, only ipsi-lateral data was available. Both LR and HR datasets were acquired simultaneously using the SBEM technique. The LR datasets were obtained from large tissue volumes of 200×100×65μm3200\times 100\times 65\mu m^{3}, with a voxel size of 50×50×50nm350\times 50\times 50nm^{3}. While the HR datasets were acquired from smaller tissue volumes of 15×15×15μm315\times 15\times 15\mu m^{3}, which were covered by the LR FoV, with a voxel size of 15±2.5×15±2.5×50nm315\pm 2.5\times 15\pm 2.5\times 50nm^{3}. The LR and HR 3D-EM datasets totaled approximately two hundred gigabytes in size. The pairs of LR and HR from small FoV were utilized in the experiments. In terms of dimensions, the LR and HR 3D-EM pairs had size ranges respectively within 330±40×330±40×550±150330\pm 40\times 330\pm 40\times 550\pm 150 and 1024×1024×550±1501024\times 1024\times 550\pm 150 voxels. Animal procedures were approved by the Committee of the Provincial Government of Southern Finland, following European Community Council Directives 86/609/EEC.

III-B Settings

III-B1 Training

Datasets were augmented by adding random zero-mean white Gaussian noise with a standard deviation of σ[0,5]\sigma\in[0,5], applying random rotation of θ{90,180,270}\theta\in\{90^{\circ},180^{\circ},270^{\circ}\}, and horizontal/vertical flipping on the input data. The noisier version of the input image was generated by adding random zero-mean white Gaussian noise with a standard deviation of σ[0,5]\sigma\in[0,5]. The Network was optimized using Adam [57] for 200,000200,000 steps. The initial learning rate was set to 104{10}^{-4} and halved every 50,00050,000 steps. The network implementation was done using the PyTorch framework. Hyperparameters were set as follows: λ1=1\lambda_{1}=1, λ2=1\lambda_{2}=1, and λ3=1\lambda_{3}=1. In the attention block, three scales of edges extracted by ATW were used. The edge attention module was configured with three basic blocks (k=3k=3). Each basic block had four residual blocks (m=4m=4), followed by two parallel sets of residual blocks (n=1n=1), The ViT block was equipped with sixteen attention heads (k=16k=16), a patch size of four (M=4M=4), and a multi-layer-perceptron ratio of two. The network maintained a constant channel number of sixty (C=64C=64) and utilized a batch size of two during training.

III-B2 Comparison

In our comparative analysis, we assessed the performance of our method with 1/2\mathcal{L}_{\ell_{1}}/\mathcal{L}_{\ell_{2}} loss function alongside several SR techniques, including standard bicubic, DPIR [36], PSSR [44], and SwinIR [29], setting hyper-parameters as in the respective papers. As a preprocessing step, we first utilized bicubic interpolation to resize both the LR and HR images to make the closest integer resolution ratio between them. Specifically, we resize the LR and HR images to dimensions of 341×341×K341\times 341\times K and 1023×1023×K1023\times 1023\times K, leading to resolution ratio τ=3\tau=3, where KK is number of slices. We conducted comparative experiments using pairs of LR and HR EM images. Additionally, we investigated three training strategies for the proposed method: Training using I) real LR and HR image pairs, II) synthetic LR and HR image pairs, and III) LR and denoised HR image pairs. For synthetic training, LR images are generated using two scenarios. First, by bicubically down-sampling HR images—a common practice in computer vision, we refer to it as synthetic (I). Second, by introducing random isotropic Gaussian kernel (κ[0,3]\kappa\in[0,3]) and random zero-mean Gaussian noise (σ[20,40]\sigma\in[20,40]) during training, followed by bicubic down-sampling and the addition of random zero-mean Gaussian noise (σ[5,15]\sigma\in[5,15]), we refer to it as synthetic (II).

III-C Quality Evaluation Metircs

To quantitatively assess the effectiveness of the proposed method and compare it with others, we have considered three image quality metrics: the structural similarity index (SSIM) [58], peak signal-to-noise ratio (PSNR) as standard metrics, as well as Fourier ring correlation (FRC) [59], which is utilized for evaluating EM SR [45].

III-C1 SSIM

The SSIM quantifies the similarity between restored x^\hat{x} and reference xx images in terms of luminance, contrast, and structure. It is calculated by:

SSIM(x,x^)=(2μxμx^+c1)(2σxx^+c2)(μx2+μx^2+c1)(σx2+σx^2+c2),\mathrm{SSIM}(x,\hat{x})=\frac{(2\mu_{x}\mu_{\hat{x}}+c_{1})(2\sigma_{x\hat{x}}+c_{2})}{(\mu_{x}^{2}+\mu_{\hat{x}}^{2}+c_{1})(\sigma_{x}^{2}+\sigma_{\hat{x}}^{2}+c_{2})}, (17)

where μx\mu_{x} and μx^\mu_{\hat{x}} are the average pixel intensities of xx and x^\hat{x} (luminance). σx\sigma_{x} and σx^\sigma_{\hat{x}} are the standard deviations of xx and x^\hat{x} pixel intensities (contrasts), while σxx^\sigma_{x\hat{x}} represents the covariance between xx and x^\hat{x} (structural similarity). c1c_{1} and c2c_{2} are small positive constants for division stability, typically set as 0.010.01 and 0.030.03 relative to the maximum pixel value, LL.

III-C2 PSNR

The PSNR measures the ratio of the maximum pixel value to the mean square error (MSE) between the reconstructed image x^\hat{x} and the ground truth xx as below:

PSNR(x,x^)=10log10(L2MSE(x,x^))\mathrm{PSNR}(x,\hat{x})=10\log_{10}\Bigl{(}\frac{L^{2}}{\mathrm{MSE}(x,\hat{x})}\Bigl{)} (18)

III-C3 FRC

The FRC measures the correlation between reconstructed image x^\hat{x} and reference xx in the frequency domain when spectra \mathcal{R} is subdivided into NN concentric rings rir_{i}, i.e., ={ri}i=1N\mathcal{R}=\{r_{i}\}_{i=1}^{N}. FRC is calculated using the following formula:

FRC()=rix(ri)x^¯(ri)(ri|x(ri)|2)(ri|x^(ri)|2),\mathrm{FRC}(\mathcal{R})=\frac{{\sum_{r_{i}\in\mathcal{R}}\mathcal{F}_{x}(r_{i})}\overline{\mathcal{F}_{\hat{x}}}(r_{i})}{\sqrt{\left({\sum_{r_{i}\in\mathcal{R}}|\mathcal{F}_{x}(r_{i})|^{2}}\right)\left({\sum_{r_{i}\in\mathcal{R}}|\mathcal{F}_{\hat{x}}(r_{i})|^{2}}\right)}}, (19)

where x(ri)\mathcal{F}_{x}(r_{i}) and x^(ri)\mathcal{F}_{\hat{x}}(r_{i}) are Fourier transformation of xx and x^\hat{x} over ring rir_{i}, and FRC()\mathrm{FRC}(\mathcal{R}) provides spectral correlation as a function of spatial frequency. The average correlation across the spectra is denoted by FRC¯\overline{\mathrm{FRC}}.

In the numerical evaluations, the denoised HR 3D-EM images, obtained through the method proposed in [60], were utilized as the ground truth references.

III-D Results

Refer to caption
Figure 5: Visual comparisons of super-resolution methods for BRAIN5[IPSI] are presented, and magnified regions are presented to facilitate comparison.
Refer to caption
Figure 6: Conditions for network training using pairs of corrupted images. The histograms of one slice from dataset BRAIN2[contra], its denoised version, and noise are illustrated. The mean and variance presented on each plot are examined to investigate conditions in (9). I) 𝔼[xn]=5.06×101\mathbb{E}[x_{n}]=5.06\times 10^{-1}, 𝔼[xclean]=5.05×101\mathbb{E}[x_{clean}]=5.05\times 10^{-1}, and 𝔼[n]=9.51×104\mathbb{E}[n]=9.51\times 10^{-4}, satisfying (9a) that mentions 𝔼[xclean]𝔼[n]\mathbb{E}[x_{clean}]\gg\mathbb{E}[n]. II) σxn2=4.14×102\sigma_{x_{n}}^{2}=4.14\times 10^{-2}, σxclean2=3.74×102\sigma_{x_{clean}}^{2}=3.74\times 10^{-2}, and σn2=2.69×103\sigma_{n}^{2}=2.69\times 10^{-3}, meeting condition in (9b) that σxclean2σn2\sigma_{x_{clean}}^{2}\gg\sigma_{n}^{2}. It should be noted that here we considered the denoised reference as a clean reference.
Refer to caption
Figure 7: Visual comparisons of super-resolution methods for BRAIN2[CONTRA] are presented, and magnified regions are presented to facilitate comparison.
Refer to caption
Figure 8: Illustrative visual comparisons of EMSR with different data training strategies. Panel (a) displays acquired HR, while (b) and (c) respectively showcase Synthetic (I) and Synthetic (II). Training with real pairs is depicted in both (d) for real noise-like corrupted reference and (e) for denoised reference.

III-D1 Method Comparison

The comparative results were obtained through a five-fold cross-validation process, where data from one animal functioned as test sets, and data from other animals were used as training sets. The quantitative results are summarized in Table (I). The reported average values, based on SSIM, emphasize the inferior performance of the bicubic method compared to deep learning-based methods—DPIR, PSSR, SwinIR, and EMSR. Among these methods, SwinIR and EMSR[ours] obtained superior quality metrics. Our approach, employing the 1{\ell_{1}} and 2{\ell_{2}} loss function, showcased the highest and second-highest scores, respectively. Similarly, the reported FRC value, demonstrates the superior performance of EMSR in terms of spectral correlation between restored and ground truth images compared to the competitors, achieving the best and second-best scores, when respectively trained with 1{\ell_{1}} and 2{\ell_{2}} loss functions. However, in terms of PSNR, DPIR achieved the highest score, and the PSSR method achieved the second-best PSNR. It’s essential to emphasize that the effectiveness of PSNR as an evaluation metric for SR model performance is limited. This limitation arises from its pure reliance on pixel values and its inability to capture a direct structural correlation between super-resolved and ground truth images.

To conserve space, we present a curated selection of representative results in figures (5) and (7). These figures provide visual insights into scenarios where our proposed method excelled as the best and also where it did not attain the highest quantitative performance.

Figure 5 showcases results from the BRAIN5[IPSI]. In this sub-dataset, our proposed method demonstrated outstanding performance, achieving the best and second-best quantitative results, based on SSIM and FRC, when utilizing 1\ell_{1} and 2\ell_{2} loss functions, respectively. In panel (a), we observe the bicubically interpolated LR image, which exhibits a lack of visual clarity and maintains noise. Conversely, DL-based SR methods effectively reduce noise, as evident in Fig. 5 (d)-(h). Among these methods, DPIR, i.e., a PnP method, produces overly smooth results, particularly when restoring fine details, as shown within regions enclosed by the ellipsoid and dashed circle. This outcome can be attributed to mismatches between priors in the trained model and test images. In contrast, PSSR, SwinIR, and EMSR, which were trained using EM images, exhibit the capability to restore intricate details and nuances characteristic of EM brain images. Among them, PSSR sometimes failed to restore particular intricate edges, as represented by the area confined by an ellipsoid. It also led to smear-out edges, as indicated within the dashed circle. Similarly, SwinIR faced challenges in recovering certain edges, akin to PSSR, showcasing within the region confined by the ellipsoid. It also introduced blurred output and fuzzy edges within an area marked by the dashed circle. On the other hand, EMSR with both 1\ell_{1} and 2\ell_{2} loss functions successfully super-resolved LR images by restoring intricate edges with higher contrast while avoiding blurriness. When comparing the results between 1\ell_{1} and 2\ell_{2}, 1\ell_{1} exhibited slightly better noise suppression (see zoomed-in rectangle marked in green). These results align with the theory mentioning that 1\ell_{1} loss, in contrast to 2\ell_{2}, does not over-penalize large errors, resulting in fewer noise artifacts. The condition checking for training with a no-clean reference is depicted in Figure 6.

Figure 7 presents results from BRAIN2[CONTRA]. In this subset, the SwinIR method exhibited superior performance in SSIM and FRC, collectively indicating enhanced structural capabilities. SwinIR did not achieve the highest PSNR, yet it maintained a satisfactory level of pixel fidelity. Panel (c) and (d) show that bicubic and DPIR generally produced oversmooth details, as denoted by the yellow arrow. Panel (e) revealed that PSSR excelled in enhancing details and contrast but faced challenges in recovering fine edges, as indicated by the yellow arrow. SwinIR and EMSR (f)-(h) showcased superior resolution enhancement and noise reduction. Particularly, SwinIR delivered slightly sharper SR results, highlighted by the yellow arrow. However, the proposed method, likely due to its edge-attention mechanism, demonstrated a superior ability to super-resolve two closely situated compartments compared to SwinIR, which struggled to effectively separate them, pointed by the green arrows.

III-D2 Data Training Strategies

The outcomes of training with different strategies—real pairs featuring corrupted references, real pairs with a denoised reference, and synthetic LR and HR pairs (both synthetic (I) and (II))—are detailed in Table (II). The reported average quantitative results across all datasets revealed that training with an acquired HR image as a reference and its denoised version resulted in nearly identical SSIM and FRC values, with the denoised reference exhibiting an inferior PSNR. Additionally, it was noted that training with synthetic (I) did not attain favorable super-resolution results. In contrast, synthetic (II) exhibited varying performance with promising outcomes, the average performance was slightly lower than that achieved with real pairs.

Representative results are depicted in Figure 8, spanning from inferior to superior performance. The results for BRAIN1[IPSI] indicate that the trained network for both synthetic (I) and (II), failed to produce satisfactory super-resolution results, as evident from different artifacts. The underlying reason for these shortcomings lies in the inability of both bicubically down-sampling and a pool of random Gaussian noise and blurring kernels to effectively match the degradation in the input LR image. Furthermore, it demonstrates that training using real pairs with either corrupted or denoised references yielded nearly identical outputs, with only subtle differences, such as slightly more homogeneous areas in the case of training with denoised reference, see white arrows.

Refer to caption
Figure 9: A comparison is shown between a physically acquired LR EM image and a synthetically generated LR image obtained by down-sampling an HR EM image. Notable differences are observed in terms of fine details and intensity. The blue rectangle highlights a drift distortion at the border of the HR image, which is not present in the LR counterpart. An arrow points to a charging effect that is only seen in the HR image. The zoomed-in area accentuates the distinct difference in fine details (yellow asterisks), and intensity level (green asterisk).

The results for BRAIN2[CONTRA] indicate encouraging findings. The Synthetic (I) training strategy yields unsatisfactory results as it struggles to match the degradations present in the input LR image, see Figure 9. However, results for synthetic (II), training with a diverse range of degradations, outperform training with real pairs, generating sharper edges and enhanced contrast, as indicated by the dashed green circle. The key factor behind these results is synthetic training’s ability, under well-matched degradations, to learn deblurring and denoising while super-resolving the input LR image. The low-level feature fidelity in the synthetic pairs is well-preserved compared to training with acquired LR and HR images, even in the case of synthetic (I) with bicubic downsampling, evident in black areas marked with asterisks. From a denoising perspective, training with real pairs may offer better performance, benefiting from the independence of noise-like corruptions in independently acquired LR and HR images, preventing the learning of noise-like patterns with random characteristics. Notably, both noisy and denoised reference training produce similar outputs.

BRAIN3[IPSI] showcases additional promising outcomes with the synthetic (II) strategy, demonstrating superior super-resolution performance in recovering fine details, and achieving sharp edges while mitigating noise—highlighted in areas marked by circles and arrows.

When comparing real and synthetic datasets, it is recommended to use real image pairs as they have the potential to enhance the overall quality. The foremost advantage is learning real degradations, which are difficult to simulate, see Fig.9. Importantly, the separate acquisition of LR and HR images leads to nearly independent noise-like corruption. This independence is beneficial for the network as it prevents the learning of noise-like patterns with random characteristics, learning to denoise while super-resolving LR image. Furthermore, the results indicate that while pairs of acquired synthetic LR, derived from down-sampled HR, and HR images are not suitable as training pairs, there is a potential for computationally generated pairs to advance EM super-resolution. Notably, this approach can address mismatches between acquired LR and HR pairs, i.e., co-registration and contrast, reduce imaging time, and lower costs.

III-D3 Super-Resolver as Enhancer

Applying the trained SR model to HR images with the same resolution enhances image quality. In comparison to a denoiser, it enhances the resolution as well as mitigating noise, see first row in Figure 10. However, in situations where there are mismatches between the trained model and the input image, changes in image contrasts may occur, as depicted in the second row of Figure 10. This observation highlights the potential of SR methods to function as denoisers and enhancers, particularly emphasizing the practical capabilities of a self-supervised SR approach that can address the mismatches.

Refer to caption
Figure 10: Application of the trained super-resolution model to HR images with the same resolution as the HR images used for training. (a) Input HR, (b) Enhanced using super-resolver, and (c) Denoised HR.

III-D4 Super-Resolution can help Distortion Avoidance

EM imaging at HR may result in distortions at the image border in the xyxy-plane, a phenomenon not observed in LR imaging, as depicted in Figure 11. However, employing SR techniques enables the generation of an HR image from an LR image, effectively overcoming these distortions.

Refer to caption
Figure 11: Distortion in the border of 3D-EM data. The xzxz-perspective view of (a) bicubically interpolated acquired LR, (b) acquired HR, and (c) super-resolved LR images.

III-D5 Natural Image Pre-trained Networks on Brain EM

Figure 12 depicts the application of state-of-the-art pre-trained networks designed for natural images on brain EM. BSRGAN [25] and Real ESRGAN [27] are two networks designed for the super-resolution of natural images–which respectively were trained on natural and pure synthetic datasets. When applied to brain EM images, while these methods can restore the overall structure of large tissue compartments, they fail to recover the intricate details and nuances unique to brain EM. In particular, they tend to introduce unrealistic details and cartoonish textures, as visible in the zoomed-in areas.

Refer to caption
Figure 12: Super-resolution of EM images using state-off-the-art pre-trained networks designed for natural images. (a) LR, (b) HR, (c) BSRGAN, and (d) Real ESRGAN.
TABLE I: Quantitative evaluation of super-resolution methods. The best and second-best scores are in bold and underlined, respectively. Reported values for mean and standard deviation (μ±σ\mu\pm\sigma) for each BRAIN are calculated across all slices. The overall evaluation, highlighted in gray, represents the mean and standard deviation across reported mean values for all datasets.
Metric Datasets Bicubic DPIR PSSR SwinIR EMSR [OURS /1\ell_{1}] EMSR [OURS /2\ell_{2}]
SSIM \Large\uparrow BRAIN1 [IPSI] 0.551±0.0120.551\pm 0.012 0.712¯±0.015¯\underline{\textit{0.712}}\pm\underline{\textit{0.015}} 0.702±0.0140.702\pm 0.014 0.697±0.0150.697\pm 0.015 0.720±0.015{\textbf{0.720}}\pm{\textbf{0.015}} 0.705±0.0150.705\pm 0.015
BRAIN1 [CONTRA] 0.719±0.0130.719\pm 0.013 0.840¯±0.014¯\underline{\textit{0.840}}\pm\underline{\textit{0.014}} 0.849±0.013{\textbf{0.849}}\pm{\textbf{0.013}} 0.832±0.0150.832\pm 0.015 0.832±0.0150.832\pm 0.015 0.815±0.0160.815\pm 0.016
BRAIN2 [IPSI] 0.669±0.0140.669\pm 0.014 0.715±0.0140.715\pm 0.014 0.713±0.0140.713\pm 0.014 0.694±0.0140.694\pm 0.014 0.739±0.016{\textbf{0.739}}\pm{\textbf{0.016}} 0.737¯±0.014¯\underline{\textit{0.737}}\pm\underline{\textit{0.014}}
BRAIN2 [CONTRA] 0.640±0.0200.640\pm 0.020 0.669±0.0240.669\pm 0.024 0.672±0.0260.672\pm 0.026 0.695±0.025{\textbf{0.695}}\pm{\textbf{0.025}} 0.688¯±0.026¯\underline{\textit{0.688}}\pm\underline{\textit{0.026}} 0.695±0.025{\textbf{0.695}}\pm{\textbf{0.025}}
BRAIN3 [IPSI] 0.646±0.0080.646\pm 0.008 0.673±0.0080.673\pm 0.008 0.666±0.0100.666\pm 0.010 0.715¯±0.009¯\underline{\textit{0.715}}\pm\underline{\textit{0.009}} 0.715¯±0.009¯\underline{\textit{0.715}}\pm\underline{\textit{0.009}} 0.720±0.009{\textbf{0.720}}\pm{\textbf{0.009}}
BRAIN3 [CONTRA] 0.737±0.0260.737\pm 0.026 0.753±0.0300.753\pm 0.030 0.740±0.0310.740\pm 0.031 0.787¯±0.026¯\underline{\textit{0.787}}\pm\underline{\textit{0.026}} 0.780±0.0280.780\pm 0.028 0.792±0.026{\textbf{0.792}}\pm{\textbf{0.026}}
BRAIN4 [IPSI] 0.721±0.0070.721\pm 0.007 0.794±0.0140.794\pm 0.014 0.808¯±0.012¯\underline{\textit{0.808}}\pm\underline{\textit{0.012}} 0.790±0.0070.790\pm 0.007 0.809±0.009{\textbf{0.809}}\pm{\textbf{0.009}} 0.796±0.0080.796\pm 0.008
BRAIN4 [CONTRA] 0.684±0.0140.684\pm 0.014 0.717±0.0160.717\pm 0.016 0.728±0.0180.728\pm 0.018 0.745±0.018{\textbf{0.745}}\pm{\textbf{0.018}} 0.735±0.0200.735\pm 0.020 0.738¯±0.018¯\underline{\textit{0.738}}\pm\underline{\textit{0.018}}
BRAIN5 [IPSI] 0.615±0.0190.615\pm 0.019 0.639±0.0160.639\pm 0.016 0.662±0.0160.662\pm 0.016 0.669±0.0180.669\pm 0.018 0.687±0.018{\textbf{0.687}}\pm{\textbf{0.018}} 0.681¯±0.017¯\underline{\textit{0.681}}\pm\underline{\textit{0.017}}
ALL DATASETS 0.665±0.0590.665\pm 0.059 0.724±0.0640.724\pm 0.064 0.727±0.0650.727\pm 0.065 0.736±0.0560.736\pm 0.056 0.745±0.051{\textbf{0.745}}\pm{\textbf{0.051}} 0.742¯±0.048¯\underline{\textit{0.742}}\pm\underline{\textit{0.048}}
PSNR \Large\uparrow BRAIN1 [IPSI] 23.1±0.323.1\pm 0.3 24.8±0.424.8\pm 0.4 25.0±0.4{\textbf{25.0}}\pm{\textbf{0.4}} 24.4±0.424.4\pm 0.4 24.9¯±0.4¯\underline{\textit{24.9}}\pm\underline{\textit{0.4}} 24.9¯±0.4¯\underline{\textit{24.9}}\pm\underline{\textit{0.4}}
BRAIN1 [CONTRA] 24.8±2.624.8\pm 2.6 25.9¯±2.9¯\underline{\textit{25.9}}\pm\underline{\textit{2.9}} 26.1±3.2{\textbf{26.1}}\pm{\textbf{3.2}} 24.4±2.924.4\pm 2.9 24.2±2.724.2\pm 2.7 23.8±2.623.8\pm 2.6
BRAIN2 [IPSI] 23.1±1.523.1\pm 1.5 23.6±1.7{\textbf{23.6}}\pm{\textbf{1.7}} 23.6¯±2.0¯\underline{\textit{23.6}}\pm\underline{\textit{2.0}} 22.2±1.222.2\pm 1.2 22.9±1.622.9\pm 1.6 23.0±1.623.0\pm 1.6
BRAIN2 [CONTRA] 22.8±0.922.8\pm 0.9 23.3¯±1.0¯\underline{\textit{23.3}}\pm\underline{\textit{1.0}} 23.3±0.9{\textbf{23.3}}\pm{\textbf{0.9}} 22.5±1.622.5\pm 1.6 22.3±1.522.3\pm 1.5 22.3±1.522.3\pm 1.5
BRAIN3 [IPSI] 21.9±0.721.9\pm 0.7 22.2±0.822.2\pm 0.8 21.8±0.821.8\pm 0.8 23.0±0.8{\textbf{23.0}}\pm{\textbf{0.8}} 22.7±0.722.7\pm 0.7 22.8¯±0.7¯\underline{\textit{22.8}}\pm\underline{\textit{0.7}}
BRAIN3 [CONTRA] 21.8±1.321.8\pm 1.3 22.1±1.322.1\pm 1.3 21.5±1.321.5\pm 1.3 23.0¯±1.4¯\underline{\textit{23.0}}\pm\underline{\textit{1.4}} 22.7±1.522.7\pm 1.5 23.1±1.4{\textbf{23.1}}\pm{\textbf{1.4}}
BRAIN4 [IPSI] 24.3±0.424.3\pm 0.4 25.2¯±0.5¯\underline{\textit{25.2}}\pm\underline{\textit{0.5}} 25.2¯±0.5¯\underline{\textit{25.2}}\pm\underline{\textit{0.5}} 25.3±0.4{\textbf{25.3}}\pm{\textbf{0.4}} 24.2±0.624.2\pm 0.6 23.3±0.523.3\pm 0.5
BRAIN4 [CONTRA] 24.0±0.724.0\pm 0.7 24.4±0.824.4\pm 0.8 24.6±0.8{\textbf{24.6}}\pm{\textbf{0.8}} 22.1±0.922.1\pm 0.9 24.0±0.724.0\pm 0.7 24.4¯±0.6¯\underline{\textit{24.4}}\pm\underline{\textit{0.6}}
BRAIN5 [IPSI] 22.6±0.622.6\pm 0.6 22.7¯±0.7¯\underline{\textit{22.7}}\pm\underline{\textit{0.7}} 23.2±0.7{\textbf{23.2}}\pm{\textbf{0.7}} 22.5±0.722.5\pm 0.7 21.6±1.221.6\pm 1.2 21.5±1.221.5\pm 1.2
ALL DATASETS 23.2±1.023.2\pm 1.0 23.8±1.4{\textbf{23.8}}\pm{\textbf{1.4}} 23.8¯±1.6¯\underline{\textit{23.8}}\pm\underline{\textit{1.6}} 23.3±1.123.3\pm 1.1 23.3±1.123.3\pm 1.1 23.2±1.023.2\pm 1.0
𝐅𝐑𝐂¯\overline{\mathrm{FRC}} \Large\uparrow BRAIN1 [IPSI] 0.191±0.0040.191\pm 0.004 0.216±0.0080.216\pm 0.008 0.227±0.0070.227\pm 0.007 0.243±0.0100.243\pm 0.010 0.253±0.011{\textbf{0.253}}\pm{\textbf{0.011}} 0.246¯±0.010¯\underline{\textit{0.246}}\pm\underline{\textit{0.010}}
BRAIN1 [CONTRA] 0.198±0.0080.198\pm 0.008 0.210±0.0080.210\pm 0.008 0.244±0.0070.244\pm 0.007 0.281¯±0.014¯\underline{\textit{0.281}}\pm\underline{\textit{0.014}} 0.285±0.013{\textbf{0.285}}\pm{\textbf{0.013}} 0.280±0.013{0.280}\pm{0.013}
BRAIN2 [IPSI] 0.226±0.0070.226\pm 0.007 0.287±0.0120.287\pm 0.012 0.287±0.0120.287\pm 0.012 0.310±0.0130.310\pm 0.013 0.319±0.015{\textbf{0.319}}\pm{\textbf{0.015}} 0.317¯±0.013¯\underline{\textit{0.317}}\pm\underline{\textit{0.013}}
BRAIN2 [CONTRA] 0.213±0.0090.213\pm 0.009 0.220±0.0100.220\pm 0.010 0.282±0.0130.282\pm 0.013 0.314±0.016{\textbf{0.314}}\pm{\textbf{0.016}} 0.307¯±0.014¯\underline{\textit{0.307}}\pm\underline{\textit{0.014}} 0.304±0.0140.304\pm 0.014
BRAIN3 [IPSI] 0.245±0.0040.245\pm 0.004 0.246±0.0060.246\pm 0.006 0.309±0.0060.309\pm 0.006 0.345±0.0080.345\pm 0.008 0.352±0.008{\textbf{0.352}}\pm{\textbf{0.008}} 0.349¯±0.008¯\underline{\textit{0.349}}\pm\underline{\textit{0.008}}
BRAIN3 [CONTRA] 0.259±0.0070.259\pm 0.007 0.263±0.0090.263\pm 0.009 0.319±0.0100.319\pm 0.010 0.350¯±0.016¯\underline{\textit{0.350}}\pm\underline{\textit{0.016}} 0.349±0.0150.349\pm 0.015 0.352±0.015{\textbf{0.352}}\pm{\textbf{0.015}}
BRAIN4 [IPSI] 0.216±0.0040.216\pm 0.004 0.220±0.0080.220\pm 0.008 0.273±0.0090.273\pm 0.009 0.292±0.0100.292\pm 0.010 0.297±0.009{\textbf{0.297}}\pm{\textbf{0.009}} 0.297¯±0.010¯\underline{\textit{0.297}}\pm\underline{\textit{0.010}}
BRAIN4 [CONTRA] 0.261±0.0050.261\pm 0.005 0.256±0.0080.256\pm 0.008 0.342¯±0.010¯\underline{\textit{0.342}}\pm\underline{\textit{0.010}} 0.349±0.010{\textbf{0.349}}\pm{\textbf{0.010}} 0.340±0.0120.340\pm 0.012 0.337±0.0110.337\pm 0.011
BRAIN5 [IPSI] 0.227±0.0070.227\pm 0.007 0.222±0.0080.222\pm 0.008 0.290±0.0100.290\pm 0.010 0.311±0.0100.311\pm 0.010 0.323±0.011{\textbf{0.323}}\pm{\textbf{0.011}} 0.320¯±0.011¯\underline{\textit{0.320}}\pm\underline{\textit{0.011}}
ALL DATASETS 0.226±0.0250.226\pm 0.025 0.238±0.0260.238\pm 0.026 0.286±0.0360.286\pm 0.036 0.311±0.0350.311\pm 0.035 0.314±0.032{\textbf{0.314}}\pm{\textbf{0.032}} 0.311¯±0.034¯\underline{\textit{0.311}}\pm\underline{\textit{0.034}}
TABLE II: Quantitative evaluation of EMSR using different training strategies: pairs of real LR and HR, pairs of synthetic LR and HR, and pairs of real LR and denoised HR, for 1\ell_{1} loss function. The best and second-best scores are in bold and underlined, respectively. Reported values for mean and standard deviation (μ±σ\mu\pm\sigma) for each BRAIN are calculated across all slices.
Metric Method BRAIN1 BRAIN2 BRAIN3 BRAIN4 BRAIN5 ALL DATASETS
IPSI CONTRA IPSI CONTRA IPSI CONTRA IPSI CONTRA IPSI
SSIM \Large\uparrow EMSR[Real] 0.720¯±0.015¯\underline{\textit{0.720}}\pm\underline{\textit{0.015}} 0.832¯±0.015¯\underline{\textit{0.832}}\pm\underline{\textit{0.015}} 0.739¯±0.016¯\underline{\textit{0.739}}\pm\underline{\textit{0.016}} 0.688¯±0.026¯\underline{\textit{0.688}}\pm\underline{\textit{0.026}} 0.715¯±0.009¯\underline{\textit{0.715}}\pm\underline{\textit{0.009}} 0.780¯±0.028¯\underline{\textit{0.780}}\pm\underline{\textit{0.028}} 0.809±0.009{\textbf{0.809}}\pm{\textbf{0.009}} 0.735¯±0.020¯\underline{\textit{0.735}}\pm\underline{\textit{0.020}} 0.687±0.018{\textbf{0.687}}\pm{\textbf{0.018}} 0.745±0.051{\textbf{0.745}}\pm{\textbf{0.051}}
EMSR[Denoised] 0.729±0.016{\textbf{0.729}}\pm{\textbf{0.016}} 0.839±0.015{\textbf{0.839}}\pm{\textbf{0.015}} 0.740±0.017{\textbf{0.740}}\pm{\textbf{0.017}} 0.685±0.0260.685\pm 0.026 0.712±0.0090.712\pm 0.009 0.772±0.0280.772\pm 0.028 0.808¯±0.009¯\underline{\textit{0.808}}\pm\underline{\textit{0.009}} 0.734±0.0200.734\pm 0.020 0.686¯±0.018¯\underline{\textit{0.686}}\pm\underline{\textit{0.018}} 0.745¯±0.053¯\underline{\textit{0.745}}\pm\underline{\textit{0.053}}
EMSR[Synthetic (II)] 0.598±0.0130.598\pm 0.013 0.780±0.0140.780\pm 0.014 0.738±0.0140.738\pm 0.014 0.704±0.021{\textbf{0.704}}\pm{\textbf{0.021}} 0.724±0.009{\textbf{0.724}}\pm{\textbf{0.009}} 0.794±0.025{\textbf{0.794}}\pm{\textbf{0.025}} 0.776±0.0060.776\pm 0.006 0.755±0.018{\textbf{0.755}}\pm{\textbf{0.018}} 0.667±0.0210.667\pm 0.021 0.726±0.0630.726\pm 0.063
EMSR[Synthetic (I)] 0.519±0.0120.519\pm 0.012 0.705±0.0150.705\pm 0.015 0.675±0.0130.675\pm 0.013 0.645±0.0200.645\pm 0.020 0.663±0.0080.663\pm 0.008 0.749±0.0250.749\pm 0.025 0.718±0.0080.718\pm 0.008 0.708±0.0150.708\pm 0.015 0.625±0.0200.625\pm 0.020 0.667±0.0680.667\pm 0.068
PSNR \Large\uparrow EMSR[Real] 24.9¯±0.4¯\underline{\textit{24.9}}\pm\underline{\textit{0.4}} 24.2±2.724.2\pm 2.7 22.9±1.622.9\pm 1.6 22.3±1.522.3\pm 1.5 22.7¯±0.7¯\underline{\textit{22.7}}\pm\underline{\textit{0.7}} 22.7±1.5{\textbf{22.7}}\pm{\textbf{1.5}} 24.2±0.6{24.2}\pm{0.6} 24.0±0.724.0\pm 0.7 21.6±1.221.6\pm 1.2 23.3¯±1.1¯\underline{\textit{23.3}}\pm\underline{\textit{1.1}}
EMSR[Denoised] 25.0±0.4{\textbf{25.0}}\pm{\textbf{0.4}} 24.5±2.8{24.5}\pm{2.8} 23.4¯±1.6¯\underline{\textit{23.4}}\pm\underline{\textit{1.6}} 22.6±1.522.6\pm 1.5 22.7¯±0.7¯\underline{\textit{22.7}}\pm\underline{\textit{0.7}} 22.1±1.522.1\pm 1.5 22.9±0.722.9\pm 0.7 23.6±0.723.6\pm 0.7 21.5±1.221.5\pm 1.2 23.1±1.123.1\pm 1.1
EMSR[Synthetic (II)] 23.3±0.323.3\pm 0.3 25.0±2.6{\textbf{25.0}}\pm{\textbf{2.6}} 24.0±1.9{\textbf{24.0}}\pm{\textbf{1.9}} 23.8±1.0{\textbf{23.8}}\pm{\textbf{1.0}} 22.9±0.9{\textbf{22.9}}\pm{\textbf{0.9}} 22.4¯±1.4¯\underline{\textit{22.4}}\pm\underline{\textit{1.4}} 24.7±0.6{\textbf{24.7}}\pm{\textbf{0.6}} 24.9±0.9{\textbf{24.9}}\pm{\textbf{0.9}} 22.8±0.7{\textbf{22.8}}\pm{\textbf{0.7}} 23.8±1.0{\textbf{23.8}}\pm{\textbf{1.0}}
EMSR[Synthetic (I)] 22.6±0.322.6\pm 0.3 24.7¯±2.5¯\underline{\textit{24.7}}\pm\underline{\textit{2.5}} 23.1±1.623.1\pm 1.6 22.9¯±0.9¯\underline{\textit{22.9}}\pm\underline{\textit{0.9}} 22.0±0.822.0\pm 0.8 21.8±1.321.8\pm 1.3 24.2¯±0.4¯\underline{\textit{24.2}}\pm\underline{\textit{0.4}} 24.3¯±0.7¯\underline{\textit{24.3}}\pm\underline{\textit{0.7}} 22.7¯±0.6¯\underline{\textit{22.7}}\pm\underline{\textit{0.6}} 23.1±1.023.1\pm 1.0
𝐅𝐑𝐂¯\overline{\mathrm{FRC}} \Large\uparrow EMSR[Real] 0.253¯±0.011¯\underline{\textit{0.253}}\pm\underline{\textit{0.011}} 0.285¯±0.013¯\underline{\textit{0.285}}\pm\underline{\textit{0.013}} 0.319¯±0.015¯\underline{\textit{0.319}}\pm\underline{\textit{0.015}} 0.307¯±0.014¯\underline{\textit{0.307}}\pm\underline{\textit{0.014}} 0.352¯±0.008¯\underline{\textit{0.352}}\pm\underline{\textit{0.008}} 0.349¯±0.015¯\underline{\textit{0.349}}\pm\underline{\textit{0.015}} 0.297±0.010{\textbf{0.297}}\pm{\textbf{0.010}} 0.340¯±0.012¯\underline{\textit{0.340}}\pm\underline{\textit{0.012}} 0.323±0.011{\textbf{0.323}}\pm{\textbf{0.011}} 0.314¯±0.032¯\underline{\textit{0.314}}\pm\underline{\textit{0.032}}
EMSR[Denoised] 0.256±0.011{\textbf{0.256}}\pm{\textbf{0.011}} 0.288±0.013{\textbf{0.288}}\pm{\textbf{0.013}} 0.321±0.015{\textbf{0.321}}\pm{\textbf{0.015}} 0.307¯±0.014¯\underline{\textit{0.307}}\pm\underline{\textit{0.014}} 0.349±0.009{0.349}\pm{0.009} 0.347±0.0160.347\pm 0.016 0.294¯±0.010¯\underline{\textit{0.294}}\pm\underline{\textit{0.010}} 0.340¯±0.012¯\underline{\textit{0.340}}\pm\underline{\textit{0.012}} 0.320¯±0.012¯\underline{\textit{0.320}}\pm\underline{\textit{0.012}} 0.314±0.031{\textbf{0.314}}\pm{\textbf{0.031}}
EMSR[Synthetic (II)] 0.214±0.0080.214\pm 0.008 0.255±0.0150.255\pm 0.015 0.311±0.0150.311\pm 0.015 0.311±0.015{\textbf{0.311}}\pm{\textbf{0.015}} 0.367±0.010{\textbf{0.367}}\pm{\textbf{0.010}} 0.362±0.018{\textbf{0.362}}\pm{\textbf{0.018}} 0.289±0.0120.289\pm 0.012 0.369±0.013{\textbf{0.369}}\pm{\textbf{0.013}} 0.308±0.0150.308\pm 0.015 0.310±0.0530.310\pm 0.053
EMSR[Synthetic (I)] 0.193±0.0050.193\pm 0.005 0.205±0.0090.205\pm 0.009 0.245±0.0100.245\pm 0.010 0.235±0.0130.235\pm 0.013 0.285±0.0060.285\pm 0.006 0.294±0.0110.294\pm 0.011 0.230±0.0060.230\pm 0.006 0.313±0.0090.313\pm 0.009 0.249±0.0100.249\pm 0.010 0.250±0.0400.250\pm 0.040

IV Conclusion

We introduced a deep-learning-based SR framework named EMSR, to address the challenge of acquiring clean HR 3D-EM images across large tissue volumes. As corruptions are inherent in EM, training neural networks with no-clean references for 2\ell_{2} and 1\ell_{1} loss functions was explored. Following this, we crafted a noise-robust network that integrated both edge-attention and self-attention mechanisms, to focus on enhancing edge features over less informative backgrounds in brain EM images. Utilizing real LR and HR brain EM image pairs, the network underwent training with LR and HR pairs, along with LR and denoised HR pairs. Experimental results, in line with the discussed theory, confirmed the feasibility of training with no-clean references for both loss functions. While both losses demonstrated similar SR performance, consistent with the literature, 1\ell_{1} slightly outperformed 2\ell_{2}. Furthermore, EMSR demonstrated superior or competitive results, both quantitatively and qualitatively, when compared to established SR methods. In addition to training with real LR and HR pairs, we synthesized LR images from HR using a wide-ranging isotropic Gaussian noise and Gaussian kernels. Experiments with synthetic pairs showed promising results, comparable to models trained on real pairs. Notably, in some cases, the synthesis produced super-resolved images with sharper edges and improved contrasts, addressing inherent mismatches in LR and HR pairs, e.g., co-registration and contrast. This synthesis could also aid in deblurring while denoising and super-resolving LR EM.

EMSR offers improved resolution and reduced noise simultaneously, enabling the computational generation of clean HR EM images over large samples from cost-effective LR EM imaging, providing it as a neuroimaging preprocessing tool for visualization and analysis.

Acknowledgments

The authors would like to thank the Electron Microscopy Unit at the Institute of Biotechnology, University of Helsinki, Finland, for 3D-EM datasets. They would also thank the Bioinformatics Center at the University of Eastern Finland, Finland, and the CSC–IT Center for Science, Finland, for providing computational resources.

Appendix A Training using corrupted reference

Let x^\hat{x} and xx be random variables such that x^=x+n\hat{x}=x+n, where nn represents i.i.d noise with a mean of μ\mu and a variance of σn2I\sigma_{n}^{2}I. The relation between the reference-dependent solutions in (5), i.e., 𝔼x|y[(fθ(y),x)]\mathbb{E}_{x|y}[\mathcal{L}(f_{\theta}(y),x)] and 𝔼x^|y[(fθ(y),x^)]\mathbb{E}_{\hat{x}|y}[\mathcal{L}(f_{\theta}(y),\hat{x})] for both 2\ell_{2} and 1\ell_{1} norms are discussed in the following subsections.

A-A Solution for 2\ell_{2}-norm loss function

The proof of (6) is provided below:

𝔼x^|y[fθ(y)x^22]=𝔼x,x^|y[(fθ(y)xn)22]=𝔼x,x^|y[(fθ(y)xn)T(fθ(y)xn)]=𝔼x,x^|y[fθ(y)x222nT(fθ(y)x)+n22]=𝔼x|y[fθ(y)x22]2𝔼x,x^|y[nT(fθ(y)x)]+𝔼x,x^|y[n22]=𝔼x|y[fθ(y)x22]2𝔼x,x^|y[nT(fθ(y)x)]+dσn2+μ2=𝔼x|y[fθ(y)x22]2(𝔼x^|y[x^]𝔼x|y[x])T𝔼x|y[(fθ(y)x)]+dσn2+μ2=𝔼x|y[fθ(y)x22]2μT𝔼x|y[fθ(y)x]+dσn2+μ2\begin{split}&\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{2}^{2}]\\ &=\mathbb{E}_{x,\hat{x}|y}[\|(f_{\theta}(y)-x-n)\|_{2}^{2}]\\ &=\mathbb{E}_{x,\hat{x}|y}[(f_{\theta}(y)-x-n)^{T}(f_{\theta}(y)-x-n)]\\ &=\mathbb{E}_{x,\hat{x}|y}[\|f_{\theta}(y)-x\|_{2}^{2}-2n^{T}(f_{\theta}(y)-x)+\|n\|_{2}^{2}]\\ &=\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]-2\mathbb{E}_{x,\hat{x}|y}[n^{T}(f_{\theta}(y)-x)]+\mathbb{E}_{x,\hat{x}|y}[\|n\|_{2}^{2}]\\ &=\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]-2\mathbb{E}_{x,\hat{x}|y}[n^{T}(f_{\theta}(y)-x)]+d\sigma_{n}^{2}+||\mu||^{2}\\ &\overset{*}{=}\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]-2(\mathbb{E}_{\hat{x}|y}[\hat{x}]-\mathbb{E}_{x|y}[x])^{T}\mathbb{E}_{x|y}[(f_{\theta}(y)-x)]\\ &\hskip 14.22636pt+d\sigma_{n}^{2}+||\mu||^{2}\\ &=\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]-2\mu^{T}\mathbb{E}_{x|y}[f_{\theta}(y)-x]+d\sigma_{n}^{2}+||\mu||^{2}\\ \end{split} (20)

* Under the assumption of i.i.d noise, we can establish 𝔼x^|y[x^]𝔼x|y[x]=𝔼x,x^|y[n]=μ\mathbb{E}_{\hat{x}|y}[\hat{x}]-\mathbb{E}_{x|y}[x]=\mathbb{E}_{x,\hat{x}|y}[n]=\mu.

A-B Bounds for 1\ell_{1}-norm loss function

We derive two upper bounds for 1\ell_{1} loss, including (7), by using the following inequality that holds for vectors uu and vv in the pp-norm in n\mathbb{C}^{n}:

𝔼[up]𝔼[vp]𝔼[uvp]\begin{split}\mathbb{E}[\|u\|_{p}]-\mathbb{E}[\|v\|_{p}]\leq\mathbb{E}[\|u-v\|_{p}]\end{split} (21)

By setting fθ(y)x^f_{\theta}(y)-\hat{x} and fθ(y)xf_{\theta}(y)-x respectively as uu and vv, we can rewrite the inequality as:

𝔼(x^,y)[fθ(y)x^p]𝔼(x,y)[fθ(y)xp]0𝔼(x,x^)[np]\begin{split}\underbrace{\mathbb{E}_{(\hat{x},y)}[\|f_{\theta}(y)-\hat{x}\|_{p}]-\mathbb{E}_{(x,y)}[\|f_{\theta}(y)-x\|_{p}]}_{0\leq}\leq\mathbb{E}_{(x,\hat{x})}[\|n\|_{p}]\\ \end{split} (22)

Without loss of generality, we make the assumption that the training error with a corrupted reference x^\hat{x} is greater than or equal to the training error with the clean reference xx, leading to the non-negativity of the left-hand side of (22).

Let uu be a vector in n\mathbb{C}^{n} with 1r<p1\leq r<p. Upon a well-known corollary of Hölder’s inequality,

upurd(1/r1/p)up,\|u\|_{p}\leq\|u\|_{r}\leq d^{(1/r-1/p)}\|u\|_{p}, (23)

where dd is the dimension of uu. By setting p=2p=2 and r=1r=1 in (23), we can establish a connection between the 1\ell_{1} and 2\ell_{2} norms as u1du2\|u\|_{1}\leq\sqrt{d}\|u\|_{2}, which can be transformed by taking the square of each side and applying expectation rule,

𝔼[u12]d𝔼[u22]\begin{split}\mathbb{E}[\|u\|_{1}^{2}]\leq d\mathbb{E}[\|u\|_{2}^{2}]\end{split} (24)

Applying Jensen’s inequality, which states that f(𝔼[x])𝔼[f(x)]f(\mathbb{E}[x])\leq\mathbb{E}[f(x)] for a convex function f:f:\mathbb{R}\to\mathbb{R}, the inequality above can be lower bounded as follows:

(𝔼[u1])2𝔼[u12]d𝔼[u22]\begin{split}(\mathbb{E}[\|u\|_{1}])^{2}&\leq\mathbb{E}[\|u\|_{1}^{2}]\\ &\leq d\mathbb{E}[\|u\|_{2}^{2}]\ \end{split} (25)

Taking the square root of both sides of (25) yields:

𝔼[u1]d𝔼[u22]\begin{split}\mathbb{E}[\|u\|_{1}]\leq\sqrt{d}\sqrt{\mathbb{E}[\|u\|_{2}^{2}]}\end{split} (26)

Using the above inequality we can establish two upper bounds:

A-B1 Upper-bound (I)

Considering (22) with p=1p=1 and (26),

0𝔼x^|y[(fθ(y)x^)1]𝔼x|y[(fθ(y)x)1]d𝔼[n22]=ddσn2+μ2\begin{split}0&\leq\mathbb{E}_{\hat{x}|y}[\|(f_{\theta}(y)-\hat{x})\|_{1}]-\mathbb{E}_{x|y}[\|(f_{\theta}(y)-x)\|_{1}]\\ &\leq\sqrt{d}\sqrt{\mathbb{E}[\|n\|_{2}^{2}]}=\sqrt{d}\sqrt{d\sigma_{n}^{2}+||\mu||^{2}}\end{split} (27)

A-B2 Upper-bound (II)

Let’s begin with applying inequality (26) to fθ(y)x^f_{\theta}(y)-\hat{x} and fθ(y)xf_{\theta}(y)-x,

𝔼x^|y[fθ(y)x^1]\displaystyle\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{1}] d𝔼x^|y[fθ(y)x^22],\displaystyle\leq\sqrt{d}\sqrt{\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{2}^{2}]}, (28a)
𝔼x|y[fθ(y)x1]\displaystyle\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{1}] d𝔼x|y[fθ(y)x22]\displaystyle\leq\sqrt{d}\sqrt{\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]} (28b)

Using (28a) and (28b),

0𝔼x^|y[(fθ(y)x^)1]𝔼x|y[(fθ(y)x)1]d|𝔼x^|y[(fθ(y)x^)22]𝔼x|y[(fθ(y)x)22]|\begin{split}0&\leq\mathbb{E}_{\hat{x}|y}[\|(f_{\theta}(y)-\hat{x})\|_{1}]-\mathbb{E}_{x|y}[\|(f_{\theta}(y)-x)\|_{1}]\\ &\leq\sqrt{d}\bigg{|}\sqrt{\mathbb{E}_{\hat{x}|y}[\|(f_{\theta}(y)-\hat{x})\|_{2}^{2}]}-\sqrt{\mathbb{E}_{x|y}[\|(f_{\theta}(y)-x)\|_{2}^{2}]}\bigg{|}\end{split} (29)

Inequality above can equivalently be formulated as follows:

0𝔼x^|y[fθ(y)x^1]𝔼x|y[fθ(y)x1]d|𝔼x^|y[fθ(y)x^22]𝔼x|y[fθ(y)x22]𝔼x^|y[fθ(y)x^22]+𝔼x|y[fθ(y)x22]|=d|2μT𝔼x|y[fθ(y)x]+dσn2+μ2𝔼x^|y[fθ(y)x^22]+𝔼x|y[fθ(y)x22]|=|2μT𝔼x|y[fθ(y)x]+dσn2+μ2|g(y,x,x^),\begin{split}0&\leq\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{1}]-\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{1}]\hskip 11.38109pt\\ &\leq{\sqrt{d}\Bigg{|}\frac{\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{2}^{2}]-\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]}{\sqrt{\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{2}^{2}]}+\sqrt{\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]}}\Bigg{|}}\\ &\overset{*}{=}\sqrt{d}\Bigg{|}\frac{-2\mu^{T}\mathbb{E}_{x|y}[f_{\theta}(y)-x]+d\sigma_{n}^{2}+||\mu||^{2}}{\sqrt{\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{2}^{2}]}+\sqrt{\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]}}\Bigg{|}\\ &=\frac{|-2\mu^{T}\mathbb{E}_{x|y}[f_{\theta}(y)-x]+d\sigma_{n}^{2}+||\mu||^{2}|}{g(y,x,\hat{x})},\hskip 113.81102pt\\ \end{split} (30)

where g(y,x,x^)=𝔼x^|y[fθ(y)x^22]+𝔼x|y[fθ(y)x22]dg(y,x,\hat{x})=\frac{\sqrt{\mathbb{E}_{\hat{x}|y}[\|f_{\theta}(y)-\hat{x}\|_{2}^{2}]}+\sqrt{\mathbb{E}_{x|y}[\|f_{\theta}(y)-x\|_{2}^{2}]}}{\sqrt{d}}. * The difference between solutions for x^\hat{x} and xx when loss function is 2\ell_{2} norm, see (20). Unlike upper-bound (I), (II) shows dependence on both yy and noise statistics.

References

  • [1] D. G. C. Hildebrand, M. Cicconet, R. M. Torres, B. J. Choi, O. Randlett, et al., “Whole-brain serial-section electron microscopy in larval zebrafish,” Nature, vol. 545, no. 7654, pp. 345–349, 2017.
  • [2] Z. Zheng, J. S. Lauritzen, E. Perlman, Robinson, et al., “A complete electron microscopy volume of the brain of adult drosophila melanogaster,” Cell, vol. 174, no. 3, pp. 730–743, 2018.
  • [3] N. Varsano and S. G. Wolf, “Electron microscopy of cellular ultrastructure in three dimensions,” Current opinion in structural biology, vol. 76, p. 102444, 2022.
  • [4] J. Roels, J. Aelterman, Luong, et al., “An overview of state-of-the-art image restoration in electron microscopy,” Journal of microscopy, vol. 271, no. 3, pp. 239–254, 2018.
  • [5] S. Mikula and W. Denk, “High-resolution whole-brain staining for electron microscopic circuit reconstruction,” Nature methods, vol. 12, no. 6, pp. 541–546, 2015.
  • [6] B. Imbrosci, D. Schmitz, and M. Orlando, “Automated detection and localization of synaptic vesicles in electron microscopy images,” Eneuro, vol. 9, no. 1, 2022.
  • [7] J. Funke, F. Tschopp, W. Grisaitis, Sheridan, et al., “Large scale image segmentation with structured loss based deep learning for connectome reconstruction,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1669–1680, 2018.
  • [8] A. Liu, Y. Liu, J. Gu, Y. Qiao, and C. Dong, “Blind image super-resolution: A survey and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [9] D. Ren, W. Zuo, D. Zhang, L. Zhang, and M.-H. Yang, “Simultaneous fidelity and regularization learning for image restoration,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 284–299, 2019.
  • [10] C. A. Bouman, Foundations of Computational Imaging: A Model-Based Approach. SIAM, 2022.
  • [11] D. Meng and F. De La Torre, “Robust matrix factorization with unknown noise,” in ICCV, pp. 1337–1344, 2013.
  • [12] X. Cao, Q. Zhao, D. Meng, Y. Chen, and Z. Xu, “Robust low-rank matrix factorization under general mixture noise distributions,” IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4677–4690, 2016.
  • [13] Z. Wang, J. Chen, and S. C. Hoi, “Deep learning for image super-resolution: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3365–3387, 2020.
  • [14] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: nonlinear phenomena, vol. 60, no. 1-4, pp. 259–268, 1992.
  • [15] D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a single image,” in ICCV, pp. 349–356, IEEE, 2009.
  • [16] Z. Zha, B. Wen, X. Yuan, J. Zhou, C. Zhu, and A. C. Kot, “Low-rankness guided group sparse representation for image restoration,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [17] H. Chen, X. He, L. Qing, Y. Wu, C. Ren, R. E. Sheriff, and C. Zhu, “Real-world single image super-resolution: A brief review,” Information Fusion, vol. 79, pp. 124–145, 2022.
  • [18] W. T. Freeman and E. C. Pasztor, “Markov networks for super-resolution,” in Proc. 34th Annual Conf. on Information Sciences and Systems (CISS 2000), 2000.
  • [19] X. Lu, Y. Yuan, and P. Yan, “Image super-resolution via double sparsity regularized manifold learning,” IEEE transactions on circuits and systems for video technology, vol. 23, no. 12, pp. 2022–2033, 2013.
  • [20] Y. Romano, J. Isidoro, and P. Milanfar, “Raisr: Rapid and accurate image super resolution,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 110–125, 2016.
  • [21] J. Yang, Z. Lin, and S. Cohen, “Fast image super-resolution based on in-place example regression,” in CVPR, pp. 1059–1066, 2013.
  • [22] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE transactions on image processing, vol. 19, no. 11, pp. 2861–2873, 2010.
  • [23] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pp. 184–199, Springer, 2014.
  • [24] T.-A. Song, S. R. Chowdhury, F. Yang, and J. Dutta, “Super-resolution pet imaging using convolutional neural networks,” IEEE transactions on computational imaging, vol. 6, pp. 518–528, 2020.
  • [25] K. Zhang, J. Liang, L. Van Gool, and R. Timofte, “Designing a practical degradation model for deep blind image super-resolution,” in ICCV, pp. 4791–4800, 2021.
  • [26] Y. Sui, O. Afacan, C. Jaimes, A. Gholipour, and S. K. Warfield, “Scan-specific generative neural network for mri super-resolution reconstruction,” IEEE Transactions on Medical Imaging, vol. 41, no. 6, pp. 1383–1399, 2022.
  • [27] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in ICCV, pp. 1905–1914, 2021.
  • [28] Z. Lu, J. Li, H. Liu, C. Huang, Zhang, et al., “Transformer for single image super-resolution,” in CVPR, pp. 457–466, 2022.
  • [29] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in ICCV, pp. 1833–1844, 2021.
  • [30] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [31] K. Mei, M. Delbracio, H. Talebi, Z. Tu, V. M. Patel, and P. Milanfar, “Conditional diffusion distillation,” arXiv preprint arXiv:2310.01407, 2023.
  • [32] K. Zhang, L. V. Gool, and R. Timofte, “Deep unfolding network for image super-resolution,” in CVPR, pp. 3217–3226, 2020.
  • [33] Q. Ma, J. Jiang, X. Liu, and J. Ma, “Deep unfolding network for spatiospectral image super-resolution,” IEEE Transactions on Computational Imaging, vol. 8, pp. 28–40, 2021.
  • [34] W. C. Karl, J. E. Fowler, C. A. Bouman, M. Çetin, B. Wohlberg, and J. C. Ye, “The foundations of computational imaging: A signal processing perspective,” IEEE Signal Processing Magazine, vol. 40, no. 5, pp. 40–53, 2023.
  • [35] S. H. Chan, X. Wang, and O. A. Elgendy, “Plug-and-play admm for image restoration: Fixed-point convergence and applications,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 84–98, 2016.
  • [36] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug-and-play image restoration with deep denoiser prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6360–6376, 2021.
  • [37] S. Shoushtari, J. Liu, E. P. Chandler, M. S. Asif, and U. S. Kamilov, “Prior mismatch and adaptation in pnp-admm with a nonconvex convergence analysis,” arXiv preprint arXiv:2310.00133, 2023.
  • [38] S. Abu-Hussein, T. Tirer, S. Y. Chun, Y. C. Eldar, and R. Giryes, “Image restoration by deep projected gsure,” in WACV, pp. 3602–3611, 2022.
  • [39] Z. Zou, J. Liu, B. Wohlberg, and U. S. Kamilov, “Deep equilibrium learning of explicit regularization functionals for imaging inverse problems,” IEEE Open Journal of Signal Processing, 2023.
  • [40] D. Gilton, G. Ongie, and R. Willett, “Deep equilibrium architectures for inverse problems in imaging,” IEEE Transactions on Computational Imaging, vol. 7, pp. 1123–1133, 2021.
  • [41] S. Tsiper, O. Dicker, I. Kaizerman, Z. Zohar, M. Segev, and Y. C. Eldar, “Sparsity-based super resolution for sem images,” Nano Letters, vol. 17, no. 9, pp. 5437–5445, 2017.
  • [42] S. Sreehari, S. Venkatakrishnan, K. L. Bouman, J. P. Simmons, L. F. Drummy, and C. A. Bouman, “Multi-resolution data fusion for super-resolution electron microscopy,” in CVPR workshops, pp. 88–96, 2017.
  • [43] Z. Gao, W. Ma, S. Huang, P. Hua, and C. Lan, “Deep learning for super-resolution in a field emission scanning electron microscope,” Ai, vol. 1, no. 1, pp. 1–10, 2020.
  • [44] L. Fang, F. Monroe, S. W. Novak, L. Kirk, C. R. Schiavon, S. B. Yu, Zhang, et al., “Deep learning-based point-scanning super-resolution imaging,” Nature methods, vol. 18, no. 4, pp. 406–416, 2021.
  • [45] E. J. Reid, L. F. Drummy, C. A. Bouman, and G. T. Buzzard, “Multi-resolution data fusion for super resolution imaging,” IEEE Transactions on Computational Imaging, vol. 8, pp. 81–95, 2022.
  • [46] Y. Qian, J. Xu, L. F. Drummy, and Y. Ding, “Effective super-resolution methods for paired electron microscopic images,” IEEE Transactions on Image Processing, vol. 29, pp. 7317–7330, 2020.
  • [47] B. Titze, Techniques to prevent sample surface charging and reduce beam damage effects for SBEM imaging. PhD thesis, 2013.
  • [48] M. G. de Faria, Y. Haddab, Y. Le Gorrec, and P. Lutz, “Influence of mechanical noise inside a scanning electron microscope,” Review of Scientific Instruments, vol. 86, no. 4, p. 045105, 2015.
  • [49] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2Noise: Learning image restoration without clean data,” in ICML, vol. 80, pp. 2965–2974, PMLR, 2018.
  • [50] N. Moran, D. Schmidt, Y. Zhong, and P. Coady, “Noisier2noise: Learning to denoise from unpaired noisy data,” in CVPR, pp. 12064–12072, 2020.
  • [51] A. F. Calvarons, “Improved noise2noise denoising with limited data,” in CVPR, pp. 796–805, 2021.
  • [52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, pp. 770–778, 2016.
  • [53] M. Ghahremani, Y. Liu, and B. Tiddeman, “Ffd: Fast feature detector,” IEEE Transactions on Image Processing, vol. 30, pp. 1153–1168, 2020.
  • [54] W. Shi, J. Caballero, F. Huszár, J. Totz, Aitken, et al., “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in CVPR, pp. 1874–1883, 2016.
  • [55] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on computational imaging, vol. 3, no. 1, pp. 47–57, 2016.
  • [56] A. Abdollahzadeh, I. Belevich, E. Jokitalo, A. Sierra, and J. Tohka, “Deepacson automated segmentation of white matter in 3d electron microscopy,” Communications biology, vol. 4, no. 1, p. 179, 2021.
  • [57] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [58] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [59] N. Banterle, K. H. Bui, E. A. Lemke, and M. Beck, “Fourier ring correlation as a resolution criterion for super-resolution microscopy,” Journal of structural biology, vol. 183, no. 3, pp. 363–367, 2013.
  • [60] M. Ghahremani, M. Khateri, A. Sierra, and J. Tohka, “Adversarial distortion learning for medical image denoising,” arXiv preprint arXiv:2204.14100, 2022.