No-Clean-Reference Image Super-Resolution: Application to Electron Microscopy
Abstract
The inability to acquire clean high-resolution (HR) electron microscopy (EM) images over a large brain tissue volume hampers many neuroscience studies. To address this challenge, we propose a deep-learning-based image super-resolution (SR) approach to computationally reconstruct clean HR 3D-EM with a large field of view (FoV) from noisy low-resolution (LR) acquisition. Our contributions are I) Investigating training with no-clean references; II) Introducing a novel network architecture, named EMSR, for enhancing the resolution of LR EM images while reducing inherent noise. The EMSR leverages distinctive features in brain EM images—repetitive textural and geometrical patterns amidst less informative backgrounds— via multi-scale edge-attention and self-attention mechanisms to emphasize edge features over the background; and, III) Comparing different training strategies including using acquired LR and HR image pairs, i.e., real pairs with no-clean references contaminated with real corruptions, the pairs of synthetic LR and acquired HR, as well as acquired LR and denoised HR pairs. Experiments with nine brain datasets showed that training with real pairs can produce high-quality super-resolved results, demonstrating the feasibility of training with non-clean references. Additionally, comparable results were observed, both visually and numerically, when employing denoised and noisy references for training. Moreover, utilizing the network trained with synthetically generated LR images from HR counterparts proved effective in yielding satisfactory SR results, even in certain cases, outperforming training with real pairs. The proposed SR network was compared quantitatively and qualitatively with several established SR techniques, showcasing either the superiority or competitiveness of the proposed method in recovering fine details while mitigating noise.
Index Terms:
Electron microscopy, neuroscience, no-clean-reference, super-resolution, deep learning.I Introduction
Three-dimensional electron microscopy (3D-EM) is an essential technique to investigate brain tissue ultrastructures as it allows for their 3D visualization at nanometer resolution [1, 2]. Studying brain tissue ultrastructures requires high-resolution (HR) images over a large field of view (FoV) of the brain tissue. However, since imaging at higher resolutions demands denser sampling, it takes more time, proportionally increasing imaging cost and potential sample damage. Moreover, HR imaging over a large FoV is not feasible under realistic imaging constraints, demanding a trade-off between imaging resolution and FoV. The higher resolution, the smaller FoV [3]. Furthermore, imperfect components of imaging systems introduce noise into the images [4]. The mentioned limitations collectively prevent acquiring clean HR EM images over a large FoV of brain tissue, impeding subsequent brain ultrastructure analysis and visualization.
A practical approach to mitigate such limitations to provide clean HR EM images over a large tissue volume includes the following steps: I) low-resolution (LR) imaging of brain samples over a large FoV of interest, II) HR imaging over a small but representative portion of the same samples covered by LR FoV, and III) utilizing image super-resolution (SR) technique to computationally reconstruct high-quality HR 3D-EM images from the LR 3D-EM images of brain tissue, which is typically contaminated with noise, artifacts, and distortions.
SR is a low-level vision task that can serve as an integral preprocessing step for many image analyses in neuroscience[5, 6, 7]. It aims to recover the latent clean HR image from a degraded LR observation :
(1) |
where is the degradation function parameterized by , which is non-invertible, making SR an ill-posed inverse problem. includes convolution operator with the blur kernel , s-fold under-sampling operator , and noise () [8]. In practice, is unknown and we only have the LR observation.
SR methods can be categorized into two groups: model-based and learning-based methods. Model-based SR methods approximate the degradation function in (1) as a combination of several operations. Assuming that the blurring kernel and under-sampling operator are known and noise is additive:
(2) |
An estimate of an HR image can then be obtained by Maximum A Posteriori (MAP) formulation as:
(3) |
The first term is likelihood computed as the -norm distance between the observation and degraded latent image , where are determined by noise distribution [9, 10, 11, 12]. is the regularization term, also known as prior term, penalizing unknown latent image upon our prior knowledge of data. The parameter defines the trade-off between likelihood and prior terms. To reduce the ill-posedness of SR problems, many regularization terms have been developed [13, 8], each with specific pros and cons. Notably, contributions from total variation [14], self-similarity [15], low-rankness [12], and sparse representation [16] have played a significant role in improving SR performance, among others. Crafted priors enhance SR but have limited performance compared to data-driven methods [8]. Effective SR models involve optimizing multiple priors, which is time- and memory-consuming, and require tuning trade-off parameters. Additionally, SR models are specific to certain degradation settings, necessitating separate models for each degradation. Mismatched LR images with different degradations may result in severe artifacts due to domain gaps [17].
Learning-based SR methods learn a mapping between LR and HR image spaces, which is then used to restore the HR image from the given LR input image. Early work, pioneered by [18], restored HR images by capturing the co-occurrence prior between LR and HR image patches. Numerous patch-based methods have been introduced relying on manifold learning [19], filter learning [20], regression [21], and sparse representation [22]. Deep neural network (DNN)-based SR methods have demonstrated remarkable performance [13]. DNNs with end-to-end training avoid the need for explicit design of priors or degradations. Instead, priors and degradations are encapsulated in the training datasets. The commonly used DNN architectures include convolutional neural networks (CNNs) [23, 24], generative adversarial networks (GANs) [25, 26, 27], vision transformers (ViTs) [28, 29], and denoising diffusion probabilistic models (DDPMs) [30, 31]. In this realm, many works in computer vision and biomedical imaging define a specific degradation function to synthesize LR images from HR counterparts to generate the training data[8]. Several studies were also conducted to incorporate the interpretability of model-based methods into end-to-end learning, e.g., deep unfolding [32, 33, 34], Plug-and-Play (PnP)[35, 36, 37, 38], and deep equilibrium learning[39, 40]. Although most of these degradation-oriented SR approaches lead to satisfactory results on benchmark datasets, they fail to restore high-quality images when it comes to real-world applications [17], such as brain EM images, which is the focus of this study.
The computational approaches in super-resolution of EM have been studied in health and material sciences [41, 42, 43, 44]. As a pioneer, [42] proposed a material-specific PnP approach to super-resolve LR EM. Their method was based on the MAP formulation, where the likelihood term was based on a linear degradation model and the prior term was a library-based non-local means (LB-NLM) designed on HR EM images acquired within small FoV. The presence of HR edges and textures corresponding to the LR input image in the designed library yielded super-resolved results with fine details. To reduce the computational expenses and improve generalization, authors in [45] replaced the LB-NLM denoiser with the off-the-shelf Gaussian denoiser, leading to the version of PnP typically used in biomedical applications. However, both methods [42, 45] are essentially model-based, computationally cumbersome, and limited to degradation models. Experiments in both studies were conducted on the EM datasets acquired from nano-material with simple textural information, which sparsely recurred throughout the image. By leveraging the unique characteristics of such images, authors in [46] devised a patch-based strategy on acquired pairs of LR and HR EM images in the training of LB-NLM, resulting in better performance than the original LB-NLM method but inferior to DNN-based methods. Authors in [44] introduced a DNN-based SR, named point scanning super-resolution (PSSR), for EM brain images. They proposed a degradation operator, i.e., crappifier, to synthesize LR images from acquired HR counterparts, where the crappifier included additive Gaussian noise followed by a down-sampling operator. Using synthetic pairs of LR and HR EM images, they trained a UNet-based residual neural network. The performance of the method was then compared only with the bilinear interpolation. Although synthetizing pairs of LR and HR EM images can reduce imaging costs, it can increase the domain gap between the input LR EM and the trained SR model.
DNN-based SR method can implicitly learn EM degradations if trained with acquired and matched pairs of LR and HR images. However, many challenges impede designing DNN-based frameworks with such data. First, EM images inherently contain noise and artifacts derived from the microscope, sample, and experimental settings. Hence, there is no clean EM image to be used as the reference for training the network. Further, networks pre-trained on natural images cannot restore high-quality brain EM images due to the considerable difference in the physics behind photography and EM as well as content dissimilarity between natural and brain EM images. Hence, deploying and designing SR methods for EM images demands specific considerations. In this work, we will illustrate and address the mentioned challenges of the SR of EM images. Our key contributions are as follows:
-
•
Investigating training using no-clean references for and loss functions.
-
•
Introducing a DL-based image SR framework for EM, named EMSR, equipped with edge-attention and self-attention mechanisms for enhanced edge recovery. Sharing the network’s modules between the original noisy LR EM image and its noisier version makes it noise-robust.
-
•
Comparing various training strategies focusing on EM images, including training from pairs of physically acquired LR and HR, synthetically generated LR and HR, as well as LR and denoised HR EM images.
The remainder of this article is organized as follows: Section II describes the proposed image super-resolution method, Section III delves into experimental results, and finally, Section IV concludes the article.
II Proposed Method

Supervised training of a network requires numerous pairs of corrupted LR and corresponding clean reference images. However, brain EM images inevitably include different types of noise, artifacts, and distortions, caused by the imaging system, and experimental settings. Therefore, clean EM images that serve as references are unavailable. Here, we investigate training a neural network for EM SR using physically acquired pairs of LR and HR EM images contaminated with real noise-like corruptions.
II-A Electron Microscopy Super-Resolution
In serial block-face scanning electron microscopy (SBEM), a focused high-energy electron beam scans the sample surface, resulting in the acquisition of a 2D image in -plane. The diamond knife subsequently removes the top layer of the sample to a specific thickness in direction, revealing the next block-face for imaging. The repetition of this process generates a series of 2D images that are stacked to form a 3D volume image, as illustrated in Fig. 1.
The observed block-face is affected by underlying microscope degradation parameterized by , ; where, denotes the latent image that we aim to restore, wherein is the resolution ratio between HR and LR images, i.e., super-sampling ratio. Theoretically, the SR process is to recover unknown via , demanding finding degradation inversion . If such a mapping exists, we can obtain HR observations through LR imaging, practically accelerating imaging by a factor . Microscope degradation parameters, , can arise from various sources [4, 47, 48]. These sources include electronic device components such as wires and coils, which produce thermal and electromagnetic interference that is modeled as Gaussian noise. The detector’s electron-counting error introduces signal-dependent noise in EM images, which is modeled as Poisson noise. Line-by-line pixel scanning in SBEM can lead to correlated noise. Imperfect electromagnetic lenses and anodes cause blurred observations due to suboptimal focusing of the electron beam. The high-energy electron beam introduces electron charge and causes absorption-based heating. Cutting the sample with a diamond knife can introduce specific artifacts and distortions. Additionally, mechanical disturbances from the environment and microscope can introduce mechanical noise, further exacerbating image degradation.
Hence, cannot be well parameterized by simplified assumptions such as block-averaging neighbor pixels for the under-sampling operator [45]. Implicit modeling of degradation function can be realized through training a neural network by acquired pairs of LR and HR EM images.

II-B Training without Clean Reference
Training with no-clean references has been studied in several image restoration tasks, including denoising, magnetic resonance image reconstruction, and text removal [49, 50, 51]. Here, our focus is on investigating such a training approach for commonly used restoration loss functions, i.e., and , and discussing the corruption levels at which this training remains feasible for EM SR.
Supervised training of a network for SR requires numerous pairs of degraded LR, , and clean reference, . The network’s parameters are obtained by optimizing the following empirical loss function:
(4) |
By applying the conditional expectation rule for dependent random variables and , we can reformulate (4) as follows:
(5) |
The equation above implies that the network parameters can be optimized separately with respect to and over the loss function . Let , where is an i.i.d. additive noise with mean and variance , where is an identity matrix with .
In the case loss function is , we can derive equality that links the solutions of the reference-dependent component in (5) for and as follows (see Appendix I.A):
(6) |
The equation above states that when is close to zero (), the second term on the right-hand side of the equation becomes negligible, i.e., . Additionally, the third term , which is noise variance, and the fourth term, which is noise mean, are independent of and have no effect on the total optimization problem. Therefore, if we substitute the clean image with a random variable that satisfies , the network’s parameters will remain close to the optimal. This enables us to replace the clean reference with its corrupted version , provided their expectation values are sufficiently close, which can be accompanied by the practical assumption that noise should not significantly alter the overall variability and structure of the original image, i.e., .
In the case of loss, we can establish the relationship between the solutions of the reference-dependent part in (5) for both and as below (see Appendix I.B):
(7) |
where . The inequality above suggests that the difference between the reference-dependent solutions for and is bounded by a function of and . When is small, it significantly reduces the dependence on and tightens the upper bound, which becomes primarily dependent on through . This implies that weak noise reduces the reliance on and indicates that it will not significantly alter the overall optimization problem (5). In other words, the network’s parameters will remain near optimal even if we replace clean image with its noisy version , as long as , and overall structure of the clean image is not significantly altered by noise, .
These observations hold the promise that the network can be trained under real-world scenarios where the reference is contaminated with weak noise-like corruptions. Here, we aim to determine the rough acceptable level of these corruptions in brain EM imaging–which was discussed in Section II. (B), upon noise statistics and that overshadow training with no-clean references. Suppose we can decompose into clean and noise-like corruption component , . We can then establish the following relationships:
(8a) | ||||
(8b) |
The inequalities (9a) and (9b) hold and , which are requirements for training using pairs of corrupted images using and loss functions, and guarantee that the content of the underlying image is much stronger than corruptions:
(9a) | ||||
(9b) |
The level of corruptions in EM is mostly much lower than image content information, satisfying (9), allowing for training network from pairs of corrupted images. It is worth mentioning that rare image slices may exhibit levels of corruption inconsistent with constraints stated in (9). These corruptions act as anomalies that the network is unable to learn.

II-C Network Architecture
The proposed SR network, which is designed for training using pairs of corrupted LR and HR EM images, is depicted in figure 2. It consists of three key modules: feature extractor, edge attention, and reconstruction. These modules are shared between the given LR image and its noisier version.
II-C1 Feature Extractor
The feature extractor () is employed to extract shallow () features from the given LR image . It includes a projection, which is a convolutional filter. The extraction process is formulated by:
(10) |
II-C2 Edge Attention
The edge attention module () takes and as input, extracting deep features and combining them with edge information using multi-scale edge attention and self-attention mechanisms, yielding the generation of edge-attentioned features (). The calculation of the edge-attention module is summarized as:
(11) |
The module consists of basic blocks, as shown in figure 3. In each basic block, the input features pass through residual blocks with the well-studied benefits [52], then, take two parallel paths. In the upper path, the features are fed into residual blocks to produce deep features that are then enhanced using edge information. In the lower path, the features go through convolutional operations and ReLU activation to reconstruct the image in the LR space. The reconstructed image, along with the original LR image, is then fed to atrous wavelet (ATW) [53], a noise-robust feature extractor, to extract multi-scale edges. The resulting multi-scale edge features are then subjected to concatenation and filtering before being inputted into the attention block. The attention block generates multi-scale attention maps specifically focused on the deep feature edges. Finally, attention maps and deep features are combined through element-wise multiplication. The resulting attentioned-features are then added to the features from the upper path, leading to the generation of multi-scale edge-attention features. Subsequently, these features are passed into a ViT block, employing a window-based multi-head self-attention mechanism to capture both local and global image dependencies within the deep multi-scale edge attention features, and finally pass through convolution layers.
Vision Transformer (ViT): ViTs divide a feature map into a sequence of small patches, forming local windows, and utilize self-attention mechanisms to understand the relationships among them. This capacity to comprehend diverse image dependencies is crucial for representation learning performance in low-level vision tasks such as SR. To capture both global and local image dependencies while maintaining computational efficiency, we adopt the window-based multi-head self-attention (W-MSA) method [29]. The attention maps generated by W-MSA are then processed through the feed-forward network (FFN). These W-MSA and FFN components are integrated into a ViT block, as illustrated in Figure 3, and their computations are outlined below:
(12) |
where, LN is layer normalization and is the input feature map.
In the W-MSA, the input feature map of size is initially divided into non-overlapping local windows of size , resulting in local feature maps . Each of these local feature maps then undergoes the standard self-attention mechanism, with the following calculation:
(13) |
where, represent the query (), key (), and value () projection matrices, respectively; is determined as , where denotes the number of attention heads. The attention matrix is computed using the self-attention mechanism within -th head of local window:
(14) |
The concatenation of all attention heads results in the multi-head self-attention (W-MSA) output.
FFN is a multi-layer perceptron (MLP) used to introduce additional non-linearity to the model through two fully connected layers and ReLU activation.
II-C3 Reconstruction
Shallow features predominantly consist of low frequencies, capturing the overall structure, while the deep features encompass high frequencies corresponding to lost fine details. The long skip connection provides the reconstruction module with low frequencies and makes the training more stable. Further, it helps the edge-attention module focus on learning fine details. The element-wise summation of shallow and deep features in the LR space are fed to the reconstruction module () to generate super-resolved image with enhanced resolution:
(15) |
The reconstruction module includes an up-sampling process that enlarges the features by pixel shuffling [54]. This up-sampling step is followed by a mapper module including convolution layers, which yields the super-resolved image.
II-C4 Weight Sharing
The aforementioned modules are shared between the given LR EM image and its noisier version, as illustrated in figure 2. The weight sharing encourages the network to produce consistent outputs for both the given LR image and its noisier version, establishing a noise-robust framework for training. This strategy mitigates the absence of a clean reference: The prediction generated from the given LR EM image serves as a reference for the noisier LR EM branch in a self-supervised approach.
II-D Loss Functions
We employ the -norm loss, , as a pixel-wise distance measure between the network’s prediction and ground truth : [13, 55]. Our loss function measures the mismatch between the two network outputs and the reference, namely and , as well as the mismatch between two outputs, , see Figure 2. The total loss is then defined as:
(16) |
where , , and are hyperparameters that govern the trade-off between components.

III Experimental Settings and Results
III-A Datasets
We conducted experiments using nine LR and HR 3D-EM datasets acquired from the corpus callosum and cingulum regions associated with the white matter of five rat brains [56]. These datasets were acquired both ipsi- and contra-laterally. In four animals, both ipsi- and contra-lateral datasets were available, while in one animal, only ipsi-lateral data was available. Both LR and HR datasets were acquired simultaneously using the SBEM technique. The LR datasets were obtained from large tissue volumes of , with a voxel size of . While the HR datasets were acquired from smaller tissue volumes of , which were covered by the LR FoV, with a voxel size of . The LR and HR 3D-EM datasets totaled approximately two hundred gigabytes in size. The pairs of LR and HR from small FoV were utilized in the experiments. In terms of dimensions, the LR and HR 3D-EM pairs had size ranges respectively within and voxels. Animal procedures were approved by the Committee of the Provincial Government of Southern Finland, following European Community Council Directives 86/609/EEC.
III-B Settings
III-B1 Training
Datasets were augmented by adding random zero-mean white Gaussian noise with a standard deviation of , applying random rotation of , and horizontal/vertical flipping on the input data. The noisier version of the input image was generated by adding random zero-mean white Gaussian noise with a standard deviation of . The Network was optimized using Adam [57] for steps. The initial learning rate was set to and halved every steps. The network implementation was done using the PyTorch framework. Hyperparameters were set as follows: , , and . In the attention block, three scales of edges extracted by ATW were used. The edge attention module was configured with three basic blocks (). Each basic block had four residual blocks (), followed by two parallel sets of residual blocks (), The ViT block was equipped with sixteen attention heads (), a patch size of four (), and a multi-layer-perceptron ratio of two. The network maintained a constant channel number of sixty () and utilized a batch size of two during training.
III-B2 Comparison
In our comparative analysis, we assessed the performance of our method with loss function alongside several SR techniques, including standard bicubic, DPIR [36], PSSR [44], and SwinIR [29], setting hyper-parameters as in the respective papers. As a preprocessing step, we first utilized bicubic interpolation to resize both the LR and HR images to make the closest integer resolution ratio between them. Specifically, we resize the LR and HR images to dimensions of and , leading to resolution ratio , where is number of slices. We conducted comparative experiments using pairs of LR and HR EM images. Additionally, we investigated three training strategies for the proposed method: Training using I) real LR and HR image pairs, II) synthetic LR and HR image pairs, and III) LR and denoised HR image pairs. For synthetic training, LR images are generated using two scenarios. First, by bicubically down-sampling HR images—a common practice in computer vision, we refer to it as synthetic (I). Second, by introducing random isotropic Gaussian kernel () and random zero-mean Gaussian noise () during training, followed by bicubic down-sampling and the addition of random zero-mean Gaussian noise (), we refer to it as synthetic (II).
III-C Quality Evaluation Metircs
To quantitatively assess the effectiveness of the proposed method and compare it with others, we have considered three image quality metrics: the structural similarity index (SSIM) [58], peak signal-to-noise ratio (PSNR) as standard metrics, as well as Fourier ring correlation (FRC) [59], which is utilized for evaluating EM SR [45].
III-C1 SSIM
The SSIM quantifies the similarity between restored and reference images in terms of luminance, contrast, and structure. It is calculated by:
(17) |
where and are the average pixel intensities of and (luminance). and are the standard deviations of and pixel intensities (contrasts), while represents the covariance between and (structural similarity). and are small positive constants for division stability, typically set as and relative to the maximum pixel value, .
III-C2 PSNR
The PSNR measures the ratio of the maximum pixel value to the mean square error (MSE) between the reconstructed image and the ground truth as below:
(18) |
III-C3 FRC
The FRC measures the correlation between reconstructed image and reference in the frequency domain when spectra is subdivided into concentric rings , i.e., . FRC is calculated using the following formula:
(19) |
where and are Fourier transformation of and over ring , and provides spectral correlation as a function of spatial frequency. The average correlation across the spectra is denoted by .
In the numerical evaluations, the denoised HR 3D-EM images, obtained through the method proposed in [60], were utilized as the ground truth references.
III-D Results




III-D1 Method Comparison
The comparative results were obtained through a five-fold cross-validation process, where data from one animal functioned as test sets, and data from other animals were used as training sets. The quantitative results are summarized in Table (I). The reported average values, based on SSIM, emphasize the inferior performance of the bicubic method compared to deep learning-based methods—DPIR, PSSR, SwinIR, and EMSR. Among these methods, SwinIR and EMSR[ours] obtained superior quality metrics. Our approach, employing the and loss function, showcased the highest and second-highest scores, respectively. Similarly, the reported FRC value, demonstrates the superior performance of EMSR in terms of spectral correlation between restored and ground truth images compared to the competitors, achieving the best and second-best scores, when respectively trained with and loss functions. However, in terms of PSNR, DPIR achieved the highest score, and the PSSR method achieved the second-best PSNR. It’s essential to emphasize that the effectiveness of PSNR as an evaluation metric for SR model performance is limited. This limitation arises from its pure reliance on pixel values and its inability to capture a direct structural correlation between super-resolved and ground truth images.
To conserve space, we present a curated selection of representative results in figures (5) and (7). These figures provide visual insights into scenarios where our proposed method excelled as the best and also where it did not attain the highest quantitative performance.
Figure 5 showcases results from the BRAIN5[IPSI]. In this sub-dataset, our proposed method demonstrated outstanding performance, achieving the best and second-best quantitative results, based on SSIM and FRC, when utilizing and loss functions, respectively. In panel (a), we observe the bicubically interpolated LR image, which exhibits a lack of visual clarity and maintains noise. Conversely, DL-based SR methods effectively reduce noise, as evident in Fig. 5 (d)-(h). Among these methods, DPIR, i.e., a PnP method, produces overly smooth results, particularly when restoring fine details, as shown within regions enclosed by the ellipsoid and dashed circle. This outcome can be attributed to mismatches between priors in the trained model and test images. In contrast, PSSR, SwinIR, and EMSR, which were trained using EM images, exhibit the capability to restore intricate details and nuances characteristic of EM brain images. Among them, PSSR sometimes failed to restore particular intricate edges, as represented by the area confined by an ellipsoid. It also led to smear-out edges, as indicated within the dashed circle. Similarly, SwinIR faced challenges in recovering certain edges, akin to PSSR, showcasing within the region confined by the ellipsoid. It also introduced blurred output and fuzzy edges within an area marked by the dashed circle. On the other hand, EMSR with both and loss functions successfully super-resolved LR images by restoring intricate edges with higher contrast while avoiding blurriness. When comparing the results between and , exhibited slightly better noise suppression (see zoomed-in rectangle marked in green). These results align with the theory mentioning that loss, in contrast to , does not over-penalize large errors, resulting in fewer noise artifacts. The condition checking for training with a no-clean reference is depicted in Figure 6.
Figure 7 presents results from BRAIN2[CONTRA]. In this subset, the SwinIR method exhibited superior performance in SSIM and FRC, collectively indicating enhanced structural capabilities. SwinIR did not achieve the highest PSNR, yet it maintained a satisfactory level of pixel fidelity. Panel (c) and (d) show that bicubic and DPIR generally produced oversmooth details, as denoted by the yellow arrow. Panel (e) revealed that PSSR excelled in enhancing details and contrast but faced challenges in recovering fine edges, as indicated by the yellow arrow. SwinIR and EMSR (f)-(h) showcased superior resolution enhancement and noise reduction. Particularly, SwinIR delivered slightly sharper SR results, highlighted by the yellow arrow. However, the proposed method, likely due to its edge-attention mechanism, demonstrated a superior ability to super-resolve two closely situated compartments compared to SwinIR, which struggled to effectively separate them, pointed by the green arrows.
III-D2 Data Training Strategies
The outcomes of training with different strategies—real pairs featuring corrupted references, real pairs with a denoised reference, and synthetic LR and HR pairs (both synthetic (I) and (II))—are detailed in Table (II). The reported average quantitative results across all datasets revealed that training with an acquired HR image as a reference and its denoised version resulted in nearly identical SSIM and FRC values, with the denoised reference exhibiting an inferior PSNR. Additionally, it was noted that training with synthetic (I) did not attain favorable super-resolution results. In contrast, synthetic (II) exhibited varying performance with promising outcomes, the average performance was slightly lower than that achieved with real pairs.
Representative results are depicted in Figure 8, spanning from inferior to superior performance. The results for BRAIN1[IPSI] indicate that the trained network for both synthetic (I) and (II), failed to produce satisfactory super-resolution results, as evident from different artifacts. The underlying reason for these shortcomings lies in the inability of both bicubically down-sampling and a pool of random Gaussian noise and blurring kernels to effectively match the degradation in the input LR image. Furthermore, it demonstrates that training using real pairs with either corrupted or denoised references yielded nearly identical outputs, with only subtle differences, such as slightly more homogeneous areas in the case of training with denoised reference, see white arrows.

The results for BRAIN2[CONTRA] indicate encouraging findings. The Synthetic (I) training strategy yields unsatisfactory results as it struggles to match the degradations present in the input LR image, see Figure 9. However, results for synthetic (II), training with a diverse range of degradations, outperform training with real pairs, generating sharper edges and enhanced contrast, as indicated by the dashed green circle. The key factor behind these results is synthetic training’s ability, under well-matched degradations, to learn deblurring and denoising while super-resolving the input LR image. The low-level feature fidelity in the synthetic pairs is well-preserved compared to training with acquired LR and HR images, even in the case of synthetic (I) with bicubic downsampling, evident in black areas marked with asterisks. From a denoising perspective, training with real pairs may offer better performance, benefiting from the independence of noise-like corruptions in independently acquired LR and HR images, preventing the learning of noise-like patterns with random characteristics. Notably, both noisy and denoised reference training produce similar outputs.
BRAIN3[IPSI] showcases additional promising outcomes with the synthetic (II) strategy, demonstrating superior super-resolution performance in recovering fine details, and achieving sharp edges while mitigating noise—highlighted in areas marked by circles and arrows.
When comparing real and synthetic datasets, it is recommended to use real image pairs as they have the potential to enhance the overall quality. The foremost advantage is learning real degradations, which are difficult to simulate, see Fig.9. Importantly, the separate acquisition of LR and HR images leads to nearly independent noise-like corruption. This independence is beneficial for the network as it prevents the learning of noise-like patterns with random characteristics, learning to denoise while super-resolving LR image. Furthermore, the results indicate that while pairs of acquired synthetic LR, derived from down-sampled HR, and HR images are not suitable as training pairs, there is a potential for computationally generated pairs to advance EM super-resolution. Notably, this approach can address mismatches between acquired LR and HR pairs, i.e., co-registration and contrast, reduce imaging time, and lower costs.
III-D3 Super-Resolver as Enhancer
Applying the trained SR model to HR images with the same resolution enhances image quality. In comparison to a denoiser, it enhances the resolution as well as mitigating noise, see first row in Figure 10. However, in situations where there are mismatches between the trained model and the input image, changes in image contrasts may occur, as depicted in the second row of Figure 10. This observation highlights the potential of SR methods to function as denoisers and enhancers, particularly emphasizing the practical capabilities of a self-supervised SR approach that can address the mismatches.

III-D4 Super-Resolution can help Distortion Avoidance
EM imaging at HR may result in distortions at the image border in the -plane, a phenomenon not observed in LR imaging, as depicted in Figure 11. However, employing SR techniques enables the generation of an HR image from an LR image, effectively overcoming these distortions.

III-D5 Natural Image Pre-trained Networks on Brain EM
Figure 12 depicts the application of state-of-the-art pre-trained networks designed for natural images on brain EM. BSRGAN [25] and Real ESRGAN [27] are two networks designed for the super-resolution of natural images–which respectively were trained on natural and pure synthetic datasets. When applied to brain EM images, while these methods can restore the overall structure of large tissue compartments, they fail to recover the intricate details and nuances unique to brain EM. In particular, they tend to introduce unrealistic details and cartoonish textures, as visible in the zoomed-in areas.

Metric | Datasets | Bicubic | DPIR | PSSR | SwinIR | EMSR [OURS /] | EMSR [OURS /] |
---|---|---|---|---|---|---|---|
SSIM | BRAIN1 [IPSI] | ||||||
BRAIN1 [CONTRA] | |||||||
BRAIN2 [IPSI] | |||||||
BRAIN2 [CONTRA] | |||||||
BRAIN3 [IPSI] | |||||||
BRAIN3 [CONTRA] | |||||||
BRAIN4 [IPSI] | |||||||
BRAIN4 [CONTRA] | |||||||
BRAIN5 [IPSI] | |||||||
ALL DATASETS | |||||||
PSNR | BRAIN1 [IPSI] | ||||||
BRAIN1 [CONTRA] | |||||||
BRAIN2 [IPSI] | |||||||
BRAIN2 [CONTRA] | |||||||
BRAIN3 [IPSI] | |||||||
BRAIN3 [CONTRA] | |||||||
BRAIN4 [IPSI] | |||||||
BRAIN4 [CONTRA] | |||||||
BRAIN5 [IPSI] | |||||||
ALL DATASETS | |||||||
BRAIN1 [IPSI] | |||||||
BRAIN1 [CONTRA] | |||||||
BRAIN2 [IPSI] | |||||||
BRAIN2 [CONTRA] | |||||||
BRAIN3 [IPSI] | |||||||
BRAIN3 [CONTRA] | |||||||
BRAIN4 [IPSI] | |||||||
BRAIN4 [CONTRA] | |||||||
BRAIN5 [IPSI] | |||||||
ALL DATASETS |
Metric | Method | BRAIN1 | BRAIN2 | BRAIN3 | BRAIN4 | BRAIN5 | ALL DATASETS | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
IPSI | CONTRA | IPSI | CONTRA | IPSI | CONTRA | IPSI | CONTRA | IPSI | |||
SSIM | EMSR[Real] | ||||||||||
EMSR[Denoised] | |||||||||||
EMSR[Synthetic (II)] | |||||||||||
EMSR[Synthetic (I)] | |||||||||||
PSNR | EMSR[Real] | ||||||||||
EMSR[Denoised] | |||||||||||
EMSR[Synthetic (II)] | |||||||||||
EMSR[Synthetic (I)] | |||||||||||
EMSR[Real] | |||||||||||
EMSR[Denoised] | |||||||||||
EMSR[Synthetic (II)] | |||||||||||
EMSR[Synthetic (I)] |
IV Conclusion
We introduced a deep-learning-based SR framework named EMSR, to address the challenge of acquiring clean HR 3D-EM images across large tissue volumes. As corruptions are inherent in EM, training neural networks with no-clean references for and loss functions was explored. Following this, we crafted a noise-robust network that integrated both edge-attention and self-attention mechanisms, to focus on enhancing edge features over less informative backgrounds in brain EM images. Utilizing real LR and HR brain EM image pairs, the network underwent training with LR and HR pairs, along with LR and denoised HR pairs. Experimental results, in line with the discussed theory, confirmed the feasibility of training with no-clean references for both loss functions. While both losses demonstrated similar SR performance, consistent with the literature, slightly outperformed . Furthermore, EMSR demonstrated superior or competitive results, both quantitatively and qualitatively, when compared to established SR methods. In addition to training with real LR and HR pairs, we synthesized LR images from HR using a wide-ranging isotropic Gaussian noise and Gaussian kernels. Experiments with synthetic pairs showed promising results, comparable to models trained on real pairs. Notably, in some cases, the synthesis produced super-resolved images with sharper edges and improved contrasts, addressing inherent mismatches in LR and HR pairs, e.g., co-registration and contrast. This synthesis could also aid in deblurring while denoising and super-resolving LR EM.
EMSR offers improved resolution and reduced noise simultaneously, enabling the computational generation of clean HR EM images over large samples from cost-effective LR EM imaging, providing it as a neuroimaging preprocessing tool for visualization and analysis.
Acknowledgments
The authors would like to thank the Electron Microscopy Unit at the Institute of Biotechnology, University of Helsinki, Finland, for 3D-EM datasets. They would also thank the Bioinformatics Center at the University of Eastern Finland, Finland, and the CSC–IT Center for Science, Finland, for providing computational resources.
Appendix A Training using corrupted reference
Let and be random variables such that , where represents i.i.d noise with a mean of and a variance of . The relation between the reference-dependent solutions in (5), i.e., and for both and norms are discussed in the following subsections.
A-A Solution for -norm loss function
A-B Bounds for -norm loss function
We derive two upper bounds for loss, including (7), by using the following inequality that holds for vectors and in the -norm in :
(21) |
By setting and respectively as and , we can rewrite the inequality as:
(22) |
Without loss of generality, we make the assumption that the training error with a corrupted reference is greater than or equal to the training error with the clean reference , leading to the non-negativity of the left-hand side of (22).
Let be a vector in with . Upon a well-known corollary of Hölder’s inequality,
(23) |
where is the dimension of . By setting and in (23), we can establish a connection between the and norms as , which can be transformed by taking the square of each side and applying expectation rule,
(24) |
Applying Jensen’s inequality, which states that for a convex function , the inequality above can be lower bounded as follows:
(25) |
Taking the square root of both sides of (25) yields:
(26) |
Using the above inequality we can establish two upper bounds:
A-B1 Upper-bound (I)
A-B2 Upper-bound (II)
References
- [1] D. G. C. Hildebrand, M. Cicconet, R. M. Torres, B. J. Choi, O. Randlett, et al., “Whole-brain serial-section electron microscopy in larval zebrafish,” Nature, vol. 545, no. 7654, pp. 345–349, 2017.
- [2] Z. Zheng, J. S. Lauritzen, E. Perlman, Robinson, et al., “A complete electron microscopy volume of the brain of adult drosophila melanogaster,” Cell, vol. 174, no. 3, pp. 730–743, 2018.
- [3] N. Varsano and S. G. Wolf, “Electron microscopy of cellular ultrastructure in three dimensions,” Current opinion in structural biology, vol. 76, p. 102444, 2022.
- [4] J. Roels, J. Aelterman, Luong, et al., “An overview of state-of-the-art image restoration in electron microscopy,” Journal of microscopy, vol. 271, no. 3, pp. 239–254, 2018.
- [5] S. Mikula and W. Denk, “High-resolution whole-brain staining for electron microscopic circuit reconstruction,” Nature methods, vol. 12, no. 6, pp. 541–546, 2015.
- [6] B. Imbrosci, D. Schmitz, and M. Orlando, “Automated detection and localization of synaptic vesicles in electron microscopy images,” Eneuro, vol. 9, no. 1, 2022.
- [7] J. Funke, F. Tschopp, W. Grisaitis, Sheridan, et al., “Large scale image segmentation with structured loss based deep learning for connectome reconstruction,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1669–1680, 2018.
- [8] A. Liu, Y. Liu, J. Gu, Y. Qiao, and C. Dong, “Blind image super-resolution: A survey and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- [9] D. Ren, W. Zuo, D. Zhang, L. Zhang, and M.-H. Yang, “Simultaneous fidelity and regularization learning for image restoration,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 1, pp. 284–299, 2019.
- [10] C. A. Bouman, Foundations of Computational Imaging: A Model-Based Approach. SIAM, 2022.
- [11] D. Meng and F. De La Torre, “Robust matrix factorization with unknown noise,” in ICCV, pp. 1337–1344, 2013.
- [12] X. Cao, Q. Zhao, D. Meng, Y. Chen, and Z. Xu, “Robust low-rank matrix factorization under general mixture noise distributions,” IEEE Transactions on Image Processing, vol. 25, no. 10, pp. 4677–4690, 2016.
- [13] Z. Wang, J. Chen, and S. C. Hoi, “Deep learning for image super-resolution: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3365–3387, 2020.
- [14] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: nonlinear phenomena, vol. 60, no. 1-4, pp. 259–268, 1992.
- [15] D. Glasner, S. Bagon, and M. Irani, “Super-resolution from a single image,” in ICCV, pp. 349–356, IEEE, 2009.
- [16] Z. Zha, B. Wen, X. Yuan, J. Zhou, C. Zhu, and A. C. Kot, “Low-rankness guided group sparse representation for image restoration,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- [17] H. Chen, X. He, L. Qing, Y. Wu, C. Ren, R. E. Sheriff, and C. Zhu, “Real-world single image super-resolution: A brief review,” Information Fusion, vol. 79, pp. 124–145, 2022.
- [18] W. T. Freeman and E. C. Pasztor, “Markov networks for super-resolution,” in Proc. 34th Annual Conf. on Information Sciences and Systems (CISS 2000), 2000.
- [19] X. Lu, Y. Yuan, and P. Yan, “Image super-resolution via double sparsity regularized manifold learning,” IEEE transactions on circuits and systems for video technology, vol. 23, no. 12, pp. 2022–2033, 2013.
- [20] Y. Romano, J. Isidoro, and P. Milanfar, “Raisr: Rapid and accurate image super resolution,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 110–125, 2016.
- [21] J. Yang, Z. Lin, and S. Cohen, “Fast image super-resolution based on in-place example regression,” in CVPR, pp. 1059–1066, 2013.
- [22] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse representation,” IEEE transactions on image processing, vol. 19, no. 11, pp. 2861–2873, 2010.
- [23] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part IV 13, pp. 184–199, Springer, 2014.
- [24] T.-A. Song, S. R. Chowdhury, F. Yang, and J. Dutta, “Super-resolution pet imaging using convolutional neural networks,” IEEE transactions on computational imaging, vol. 6, pp. 518–528, 2020.
- [25] K. Zhang, J. Liang, L. Van Gool, and R. Timofte, “Designing a practical degradation model for deep blind image super-resolution,” in ICCV, pp. 4791–4800, 2021.
- [26] Y. Sui, O. Afacan, C. Jaimes, A. Gholipour, and S. K. Warfield, “Scan-specific generative neural network for mri super-resolution reconstruction,” IEEE Transactions on Medical Imaging, vol. 41, no. 6, pp. 1383–1399, 2022.
- [27] X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in ICCV, pp. 1905–1914, 2021.
- [28] Z. Lu, J. Li, H. Liu, C. Huang, Zhang, et al., “Transformer for single image super-resolution,” in CVPR, pp. 457–466, 2022.
- [29] J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” in ICCV, pp. 1833–1844, 2021.
- [30] C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- [31] K. Mei, M. Delbracio, H. Talebi, Z. Tu, V. M. Patel, and P. Milanfar, “Conditional diffusion distillation,” arXiv preprint arXiv:2310.01407, 2023.
- [32] K. Zhang, L. V. Gool, and R. Timofte, “Deep unfolding network for image super-resolution,” in CVPR, pp. 3217–3226, 2020.
- [33] Q. Ma, J. Jiang, X. Liu, and J. Ma, “Deep unfolding network for spatiospectral image super-resolution,” IEEE Transactions on Computational Imaging, vol. 8, pp. 28–40, 2021.
- [34] W. C. Karl, J. E. Fowler, C. A. Bouman, M. Çetin, B. Wohlberg, and J. C. Ye, “The foundations of computational imaging: A signal processing perspective,” IEEE Signal Processing Magazine, vol. 40, no. 5, pp. 40–53, 2023.
- [35] S. H. Chan, X. Wang, and O. A. Elgendy, “Plug-and-play admm for image restoration: Fixed-point convergence and applications,” IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 84–98, 2016.
- [36] K. Zhang, Y. Li, W. Zuo, L. Zhang, L. Van Gool, and R. Timofte, “Plug-and-play image restoration with deep denoiser prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 10, pp. 6360–6376, 2021.
- [37] S. Shoushtari, J. Liu, E. P. Chandler, M. S. Asif, and U. S. Kamilov, “Prior mismatch and adaptation in pnp-admm with a nonconvex convergence analysis,” arXiv preprint arXiv:2310.00133, 2023.
- [38] S. Abu-Hussein, T. Tirer, S. Y. Chun, Y. C. Eldar, and R. Giryes, “Image restoration by deep projected gsure,” in WACV, pp. 3602–3611, 2022.
- [39] Z. Zou, J. Liu, B. Wohlberg, and U. S. Kamilov, “Deep equilibrium learning of explicit regularization functionals for imaging inverse problems,” IEEE Open Journal of Signal Processing, 2023.
- [40] D. Gilton, G. Ongie, and R. Willett, “Deep equilibrium architectures for inverse problems in imaging,” IEEE Transactions on Computational Imaging, vol. 7, pp. 1123–1133, 2021.
- [41] S. Tsiper, O. Dicker, I. Kaizerman, Z. Zohar, M. Segev, and Y. C. Eldar, “Sparsity-based super resolution for sem images,” Nano Letters, vol. 17, no. 9, pp. 5437–5445, 2017.
- [42] S. Sreehari, S. Venkatakrishnan, K. L. Bouman, J. P. Simmons, L. F. Drummy, and C. A. Bouman, “Multi-resolution data fusion for super-resolution electron microscopy,” in CVPR workshops, pp. 88–96, 2017.
- [43] Z. Gao, W. Ma, S. Huang, P. Hua, and C. Lan, “Deep learning for super-resolution in a field emission scanning electron microscope,” Ai, vol. 1, no. 1, pp. 1–10, 2020.
- [44] L. Fang, F. Monroe, S. W. Novak, L. Kirk, C. R. Schiavon, S. B. Yu, Zhang, et al., “Deep learning-based point-scanning super-resolution imaging,” Nature methods, vol. 18, no. 4, pp. 406–416, 2021.
- [45] E. J. Reid, L. F. Drummy, C. A. Bouman, and G. T. Buzzard, “Multi-resolution data fusion for super resolution imaging,” IEEE Transactions on Computational Imaging, vol. 8, pp. 81–95, 2022.
- [46] Y. Qian, J. Xu, L. F. Drummy, and Y. Ding, “Effective super-resolution methods for paired electron microscopic images,” IEEE Transactions on Image Processing, vol. 29, pp. 7317–7330, 2020.
- [47] B. Titze, Techniques to prevent sample surface charging and reduce beam damage effects for SBEM imaging. PhD thesis, 2013.
- [48] M. G. de Faria, Y. Haddab, Y. Le Gorrec, and P. Lutz, “Influence of mechanical noise inside a scanning electron microscope,” Review of Scientific Instruments, vol. 86, no. 4, p. 045105, 2015.
- [49] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2Noise: Learning image restoration without clean data,” in ICML, vol. 80, pp. 2965–2974, PMLR, 2018.
- [50] N. Moran, D. Schmidt, Y. Zhong, and P. Coady, “Noisier2noise: Learning to denoise from unpaired noisy data,” in CVPR, pp. 12064–12072, 2020.
- [51] A. F. Calvarons, “Improved noise2noise denoising with limited data,” in CVPR, pp. 796–805, 2021.
- [52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, pp. 770–778, 2016.
- [53] M. Ghahremani, Y. Liu, and B. Tiddeman, “Ffd: Fast feature detector,” IEEE Transactions on Image Processing, vol. 30, pp. 1153–1168, 2020.
- [54] W. Shi, J. Caballero, F. Huszár, J. Totz, Aitken, et al., “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in CVPR, pp. 1874–1883, 2016.
- [55] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, “Loss functions for image restoration with neural networks,” IEEE Transactions on computational imaging, vol. 3, no. 1, pp. 47–57, 2016.
- [56] A. Abdollahzadeh, I. Belevich, E. Jokitalo, A. Sierra, and J. Tohka, “Deepacson automated segmentation of white matter in 3d electron microscopy,” Communications biology, vol. 4, no. 1, p. 179, 2021.
- [57] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- [58] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
- [59] N. Banterle, K. H. Bui, E. A. Lemke, and M. Beck, “Fourier ring correlation as a resolution criterion for super-resolution microscopy,” Journal of structural biology, vol. 183, no. 3, pp. 363–367, 2013.
- [60] M. Ghahremani, M. Khateri, A. Sierra, and J. Tohka, “Adversarial distortion learning for medical image denoising,” arXiv preprint arXiv:2204.14100, 2022.