¹¹institutetext: [email protected], [email protected], [email protected]

FREA-Unet: Frequency-aware U-net for Modality Transfer

Hajar Emami Qiong Liu Ming Dong

Abstract

While Positron emission tomography (PET) imaging has been widely used in diagnosis of number of diseases, it has costly acquisition process which involves radiation exposure to patients. However, magnetic resonance imaging (MRI) is a safer imaging modality that does not involve patient’s exposure to radiation. Therefore, a need exists for an efficient and automated PET image generation from MRI data. In this paper, we propose a new frequency-aware attention U-net for generating synthetic PET images. Specifically, we incorporate attention mechanism into different U-net layers responsible for estimating low/high frequency scales of the image. Our frequency-aware attention U-net computes the attention scores for feature maps in low/high frequency layers and use it to help the model focus more on the most important regions, leading to more realistic output images. Experimental results on 30 subjects from Alzheimers Disease Neuroimaging Initiative (ADNI) dataset demonstrate good performance of the proposed model in PET image synthesis that achieved superior performance, both qualitative and quantitative, over current state-of-the-arts.

1 Introduction

Positron emission tomography (PET), a nuclear medical imaging technology offers potential for diagnosing a number of neurological diseases such as Alzheimer, Epilepsy, Head and Neck Cancer by reflecting tissue metabolic activity in brain. However, obtaining high‐quality PET images is costly and requires radioactive substance injection into the body, which may cause side effects to patients. Due to these reasons, baseline datasets usually have smaller number of PET cases compare to other modalities. For example, in Alzheimer’s Disease Neuroimaging Initiative (ADNI) database, only approximately half of subjects have PET scans. On the other hand, magnetic resonance imaging (MRI) has excellent soft tissue contrast with anatomical details that does not involve patient’s exposure to radiation. In order to circumvent the problems of PET imaging acquisition and streamline clinical efficiency, recent attention has been given to estimate PET image from its corresponding MRI data. Eliminating PET acquisition reduces time, costs and side effects to patients.

Related work. To date, a number of computational methods have been proposed for medical image synthesis using machine learning approaches. Several researches employed conventional machine learning methods, such as Gaussian Mixture Model (GMM), structure random forest (SRF), etc to generate missing type of medical image modality from available image modalities. In [1], the authors employed a regression forest for predicting the brain PET images using MRI inputs. Huynh et al. [2] used structured random forest together with an auto-context model to estimate CT patches from their corresponding MRI patches. At the end, all generated CT patches from a given MRI input are combined together to generate the corresponding CT image.

Recently, many deep learning models have been successfully applied to learning the mappings between two different image domains [3, 4, 5, 6, 7, 8, 9]. These deep learning approaches have been employed in medical imaging for generating missing type of modality from available type of imaging modality, e.g., generating synthetic CT images from MR images [10, 11, 12, 13] or estimating PET images from MRI data [14, 15, 16].

Deep convolutional neural networks (CNNs), a type of multi-layer deep learning models can be used to capture nonlinear mappings between input and output image domains. The CNN models are trained by minimizing the voxel-wise differences with respect to the generated output images and the ground truth data. CNN models were employed to perform image-to-image mapping between MRI and CT images in [10]. Li [14] developed a deep learning approach in which 3D-CNN is employed to estimate the output PET images given the input MRI modality. A U-net architecture, a well-established fully convolutional network (FCN) first introduced in [17] for biomedical image segmentation. In [11], Han developed a learning-based approach in which U-net architecture is employed to generate synthetic CT images from T1-weighted MR data.

Generative adversarial networks (GANs) [18], another type of deep learning models with two competing networks: a generator to synthesize output images and a discriminator to distinguish between the synthesized and real data have been used in image generation tasks. Recently, GANs have been used in medical image-to-image translation tasks [19, 16, 15, 20, 12, 21]. Pan [15] used a GAN model to capture the underlying relationship between MRI and PET images to generate missing PET data. In the image-to-image translation task, the models learn the mapping between two image domains using a training set of paired data. Since paired training data are not available in many tasks, Zhu in [8] intoduced CycleGAN to translate an image from a source domain to a target domain in the absence of paired data. Unsupervised image to image translation using cycleGAN was used in [22] to learn the mapping between unpaired MRI and CT imaging modality.

Attention plays an important role in human visual process for building a visual representation and possessing context. Inspired from human attention mechanism [23], attention-based deep learning models have been used in a variety of computer vision and machine learning tasks including image classification [24, 25], image segmentation [26], image-to-image translation [27, 28, 29], natural language processing [30, 31] and time series forecasting [32, 33]. Attention can be defined as a scalar matrix showing the relative importance of activation maps at different spatial locations [34]. Incorporating attention into deep learning models helps to improves the performance by focusing on the most relevant features. Zhou et al. [25] have shown that attention maps can be used to localize the object of interest and improve object localization accuracy. In [25], the attention in classification task is produced by removing top average-pooling layer of a classifier CNN and helped to identify the discriminative features across classes. Zagoruyko et al. [35] improved the performance of a small student CNN network by transferring the attention knowledge from a larger and more powerful teacher CNN. They defined the attention based on the assumption that the absolute value of a neuron activation is relative to the importance of that neuron in order to classify the input image in classification task.

Jetley et al. [36] incorporated attention mechanism into a classifier CNN to amplify the relevant features and suppress the irrelevant features for classification of the input. Their proposed approach bootstrapped standard CNN architectures in classification. Recent studies show that incorporation of attention learning in both image generation and image-to-image translation tasks leads to more realistic output. Zhang et al. [37] proposed self-attention GAN that uses a self-attention mechanism in image generation task. Chen et al. and Mejjati et al. [28, 27] used an attention network to localize the object of interest and excluding the background in image-to-image translation task to improve the quality of generated images.

This work. While deep learning models are great potential in medical imaging, generating synthetic PET images from MRI data is a challenging task due to their different appearances. In Fig. 1, two examples of brain MR images (the first column) are shown with the corresponding PET images of the same patient (the second column) with a clear different textures. While MR images show richer texture information than PET images, PET images contain complicated texture with different frequency scales. This would impact synthetic PET generation, suggesting that separately optimizing on different frequency scales can lead to more realistic output with more preserved details in PET image generation.

Refer to caption — Figure 1: Examples of brain MRI datasets (the first column) with the corresponding PET images of the same patient (the second column), highlighting significant difference in MRI and PET texture.

In this paper, we propose a novel deep learning method, a frequency-aware U-net model (FREA-Unet), that generates low/high frequency scales of the image separately in different layers for estimating synthetic PET images. Optimizing low/high frequency scales of the image with separate loss functions can help to balance between the error of the predicted PET image and the image sharpness and resolution.

We also use end-to-end-trainable attention modules in our FREA-Unet to weight different features in low/high frequency scales. Specifically, the attention in the low/high frequency layers is defined as the compatibility scores [36] between feature maps extracted at frequency layers and the output from the second to the last layer in U-net’s decoding part showing the relevant features for generating the output image. The extracted spatial attention scores are used in the scale layers so that higher weights are given to the relevant regions, leading to more realistic output images. The major contribution of our work is summarized as follows:

•

We proposed a novel frequency-aware U-net model for medical imaging modality transfer, FREA-Unet. Based on the proposed model, we use a modified loss function with different weights to optimize different scales during FREA-Unet training.
•

We also incorporated attention mechanism into low/high frequency layers to focus more on the relevant features during image translation.
•

We proposed a complete end-to-end PET image generation model from MRI data.
•

FREA-Unet demonstrates the effectiveness of separating low/high frequency generation and also incorporating the attention mechanism into the image generation model. Through extensive experiments, we show that, both qualitatively and quantitatively, FREA-Unet significantly outperforms other state-of-the-art methods on ADNI dataset, offering potential PET/MRI applications.

2 Materials and methods

2.1 FREA-Unet

The goal of the proposed FREA-Unet model is to estimate a mapping $F_{MR\rightarrow PET}$ from the source image domain (MRI) to the target image domain (PET). The mapping F is learned from paired training data $S=\{(mr_{i},pet_{i})|mr_{i}\in MR,pet_{i}\in PET,i=1,2,...,N\}$ , where N is the number of MRI-PET image pairs. Generating low/high frequency scales of the image separately with different loss functions can help the model to generate more realistic output with more preserved low/high frequency details. The proposed FREA-Unet model achieves this by explicitly generating low and high frequency PET scales in two different layers of the U-net’s decoder as pictured in Fig. 2.

There are two end-to-end-trainable attention modules for low/high frequency layers to weight different features based on their importance in PET image generation. Incorporating the attention mechanism into the layers responsible for generating low/high frequency scales helps the FREA-Unet model to focus more on the more relevant features during image translation leading to more realistic output images.

2.2 Frequency-aware network

U-net architecture has two symmetrical paths: an encoding path and a decoding path to leverage both local and hierarchical information in order to estimate more accurate reconstruction of the encoded input. Similar to a regular CNN, the encoding path of the U-net contains convolutional layers to encode the context of the input image. On the other hand, the decoding path includes transposed convolutional layers to reconstruct the estimated image. The U-Net architecture has direct connections across its contracting path and expanding path. These skip connections combine high resolution features from the contracting path with the up-sampled features from the expanding path and use them as the inputs to the convolutional layers in the expanding path. This typically leads to improved resolution for the network output.

In our proposed FREA-Unet’s contracting path, we have six convolutional layers with 64, 128, 256, 512, 512, 512 filters, respectively. Each convolutional layer uses these filters to perform 2D convolutions on its input. The input of each layer is the output of its previous layer, except for the first layer in which the input is MRI image. The convolution outputs are followed by a nonlinear activation function: Rectified Linear Unit (ReLU), and batch normalization. Batch normalization allows using much higher learning rates and accelerates the network training. In the expanding path of FREA-Unet architecture, we have six transposed convolutional layers with 512, 1024, 1024, 512, 256, 128 filters, respectively, followed by ReLU and batch normalization. In addition, we use dropout in layers of the FREA-Unet as an effective technique for regularization and preventing the co-adaptation of neurons in the network.

In the proposed FREA-Unet architecture, we assign two different layers in the decoding path for low/high frequency image generation in order to generating low/high frequency PET images separately. Specifically, we consider the fourth layer of the decoding path for low frequency generation and the fifth layer of the decoding path for high frequency PET image generation. Low/high frequency scales are separated using a Gaussian filter for each image during the training phase so the model can optimize different scales separately. As it is illustrated in Fig. 2, low/high frequency scale outputs are combined after generation. As shown in our experimental results in Section 3, optimizing on different scales helps generate more realistic PET images with improvement in quantitative results.

Learning the end-to-end mapping function from MRI input and low/high frequency scales requires the estimation of network parameters achieved through minimizing the prediction error between the predicted images $F(MR;\theta)$ and the corresponding ground truth PET images. The objective for the mapping $F_{MR\rightarrow PET_{low}}$ can be defied as:

L_{low}(F)=||F(mr;\theta_{low})-pet_{low}||_{2}

(1)

Similarly, the objective for updating the mapping $F_{MR\rightarrow PET_{high}}$ in high frequency stream of the generator is defined as:

L_{high}(F)=||F(mr;\theta_{high})-pet_{high}||_{1}

(2)

Since $L_{1}$ norm leads to less blurry results in image generation tasks [5], we use $L_{1}$ norm in our high scale generation stream. Finally, by combining low/high frequency streams loss functions, the full objective of the FREA-Unet model can be expressed as:

L_{FREA-Unet}=L_{low}(F)+L_{high}(F)

(3)

2.3 Attention module

In the proposed FREA-Unet model, low/high frequency scales are optimized separately. We incorporate attention mechanism into low/high scale layers to weight different features based on their importance in the output PET image. Specifically, the attention in the low/high frequency layers is defined as the compatibility scores [36] between feature maps extracted at frequency layers and the output from the second to the last layer in FREA-Unet’s decoding path showing the relevant features for generating the output image.

We define the compatibility score as result of the dot product between the feature map $f_{i}$ and the output of the second to last layer of the model that has the input image size, having only to pass through the final layer to produce the output PET image $O_{pet}$ :

c_{i}=\langle f_{i},O_{pet}\rangle,i\in\{1....n\}

(4)

The extracted spatial attention scores are used in the low/high scale layers so that higher weights are given to the relevant regions, leading to more realistic output images.

2.4 Datasets

The brain dataset was acquired from 30 subjects with both MRI and PET scans in the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (www. adni-info.org). ADNI provides a large database of studies with the goal of understanding the development and pathology of Alzheimer’s Disease. Subjects are diagnosed as cognitively normal (CN), significant memory concern (SMC), early mild cognitive impairment (EMCI), mild cognitive impairment (MCI), late mild cognitive impairment (LMCI) or having Alzheimer’s Disease (AD). All 30 PET subjects obtained from ADNI datasets were aligned to their corresponding T1- weighted MRI using coregistration so there is spatial correspondence MRI and PET slices in each subject. We cropped input MRI and PET images to the size of 256 $\times$ 256 for training.

3 Experimental Results

3.1 Experimental Setup

Extensive experiments were performed to compare FREA-Unet with current sate-of-the-art models for generating synthetic PET images from MRI inputs. Based on the same training and testing splits, the following models used previously for similar image generation tasks are included in our empirical evaluation and comparison.

•

U-net model: Following techniques developed by Han [11], we implemented and trained U-net model to compare against the FREA-Unet approach. The U-net model used in our empirical evaluation has no attention mechanism and generates only one frequency scale output. The architecture includes six convolutional layers in the encoding path and six transposed convolutional layers in the decoding path.
•

pix2pix: We implemented and trained pix2pix model developed by Isola [5], with five residual blocks in the generator. The discriminator is a regular CNN with five convolutional layers that classified the input image as real or synthetic.
•

cGANs with U-Net architecture (UcGAN): We implemented and trained conditional GAN (cGAN) model with U-net [17] architecture as its generator. We include comparison with UcGAN as FREA-Unet’s core architecture is also based on the U-net. The discriminator is a regular CNN with five convolutional layers.

3.2 Training and Implementation Details

To evaluate the performance, three-fold cross-validation [38] was used in our model training and testing. That is, out of the 30 cases, 20 cases are randomly selected for the training, and the remaining cases are used to test the trained model. We repeat the experiment three times and report the averaged outcome.

The weights in FREA-Unet were all initialized from a Gaussian distribution with parameter values of 0 and 0.02 for mean and standard deviation, respectively. The model is trained with ADAM optimizer with an initial learning rate of 0.0002 and with a batch size of 1. We trained the model for 200 epochs on a NVIDIA GTX 1080 Ti GPU. We used the same training setting for all models in our comparison.

3.3 Quantitative Measurement

We used three commonly-used quantitative measures to evaluate the prediction accuracy of the models between the generated PET images and the real PET. These include Mean Absolute Error (MAE) defined below:

MAE=\frac{\sum_{i=1}^{H}|realPET(i)-synPET(i)}{H}

(5)

where H is the number of body voxels. The second measure is the Structural Similarity Index (SSIM) defined as:

SSIM=\frac{(2\mu_{PET}\mu_{synPET}+C_{1})(2\sigma_{PETsynPET}+C_{2})}{(\mu_{PET}^{2}+\mu_{synPET}^{2}+C_{1})(\sigma_{PET}^{2}+\sigma_{synPET}^{2}+C_{2})}

(6)

where $\mu_{PET}$ denotes the mean value of PET image, $\mu_{synPET}$ denotes the mean value of generated PET image, $\sigma_{PET}^{2}$ is the variance of image PET, $\sigma_{synPET}^{2}$ is the variance of image synPET, and the parameters $C1=(k1Q)^{2}$ and $C2=(k2Q)^{2}$ are two variables to stabilize the division with weak denominators, where $k1=0.01$ and $k2=0.02$ . The last measure is the Peak Signal-to-Noise Ratio (PSNR) defined as:

PSNR=10log_{10}(\frac{Q^{2}}{MSE})

(7)

where $Q$ is the maximal intensity value of PET and synPET images, and MSE is the mean squared error.

For the MAE metric, the lower value means better prediction results, that is, more realistic generated PET images. For both SSIM and PSNR, higher values indicate better prediction.

3.4 Ablation Study

We first performed model ablation to evaluate the impact of each component of FREA-Unet. In Table 1, we report MAE, PSNR and SSIM for different configurations of our model. First, we removed the attention mechanism from the model and also optimized the model on the original input scale (without seperating low/high frequency scales). As a consequence, we also used the regular U-net loss. In this case, our model is reduced to the U-net architecture (U-net) with MAE: 52.56 $\pm$ 6.71, PSNR: 26.75 $\pm$ 1.96 and SSIM: 0.84 $\pm$ 0.05.

Next, we removed ”optimization on different scales” but kept attention mechanism to weight different features (FREA-Unet-wo-Freq). Our results show that this leads to better results (MAE: 51.10 $\pm$ 7.03, PSNR: 27.08 $\pm$ 1.33 and SSIM: 0.85 $\pm$ 0.04) than U-net, but worse than FREA-Unet. Clearly, separately optimizing low/high scales can help to achieve more accurate synPETS with preserved organ structures.

Further, we removed the attention mechanism from our model and optimized the model on low/high scales separately (FREA-Unet-wo-Att). This results to a higher MAE (48.19 $\pm$ 6.42) and lower PSNR (27.93 $\pm$ 1.84) and SSIM (0.86 $\pm$ 0.04) when compared with the full version of FREA-Unet. This ablation study showed that FREA-Unet works better when attention is employed to weight different features based on their importance in estimating PET images. Separately optimizing low/high scales using different weights obtained from attention modules helped to generate more realistic synthetic PETs with higher spatial resolution.

Table 1: Performance comparison for ablations of FREA-Unet in synPET generation.

	MAE	PSNR	SSIM
U-net	52.56 $\pm$ 6.71	26.75 $\pm$ 1.96	0.84 $\pm$ 0.05
FREA-Unet-wo-Freq	51.10 $\pm$ 7.03	27.08 $\pm$ 1.33	0.85 $\pm$ 0.04
FREA-Unet-wo-Att	48.19 $\pm$ 6.42	27.93 $\pm$ 1.84	0.86 $\pm$ 0.04
FREA-Unet	46.26 $\pm$ 5.03	28.45 $\pm$ 2.01	0.87 $\pm$ 0.04

3.5 Performance of Image Synthesis Models

Fig. 3, shows generated synPET results using different models for different test cases. The images in the columns are the input MRI, real PET, and the generated PETs using the proposed FREA-Unet model, U-net [11], UcGAN [17], and pix2pix [5] respectively.

Table 2: performance comparison of the FREA-Unet with other models for synthetic PET generation. For each model the average MAEs are computed from entire head. The average PSNR and SSIM are also reported in the second and third rows, respectively.

Method	MAE	PSNR	SSIM
U-net	52.56 $\pm$ 6.71	26.75 $\pm$ 1.96	0.84 $\pm$ 0.04
pix2pix	59.19 $\pm$ 7.89	24.86 $\pm$ 2.34	0.79 $\pm$ 0.04
UcGAN	55.31 $\pm$ 6.25	26.11 $\pm$ 2.21	0.82 $\pm$ 0.05
FREA-Unet	46.26 $\pm$ 5.03	28.45 $\pm$ 2.01	0.87 $\pm$ 0.04

Qualitative comparisons for different methods show that the synthetic PETs generated by the proposed FREA-Unet are more accurate in estimating anatomical structures and have more preserved details compared to synPETs generated by other models. Since FREA-Unet generates low/high frequency scales separately, its generated results have higher spatial resolution, and have a successful translation in generating sharp regions including boundaries as shown in Fig. 3.

The average MAE, PSNR, and SSIM metrics computed based on the real and synthetic PETs for all test cases are listed in Table 2 for each of the aforementioned methods. The FREA-Unet achieved the best performance in synthetic PET generation compared to other models with the average MAE of 46.26 $\pm$ 5.03 for all test cases.

3.6 Conclusions

In this paper, we propose a frequency-aware U-net model, FREA-Unet, for modality transformation in medical imaging. FREA-Unet optimizes on different scales separately and takes advantage of attention mechanism to weight different features in low/high scale generation. Separately optimizing low/high scales using different weights obtained from attention modules helped to generate more accurate synthetic PETs with preserved organ structures. We have applied this model to estimate PET images from their corresponding MR images on ADNI dataset where we are able to generate sharp synthetic PET images similar to real PET images. Our experiments show that the proposed FREA-Unet model can outperform other state-of-the-art methods in modality transfer of medical imaging.

References

[1] Kang, J., Gao, Y., Shi, F., Lalush, D.S., Lin, W., Shen, D.: Prediction of standard-dose brain pet image by using mri and low-dose brain [18f] fdg pet images. Medical physics 42 (2015) 5301–5309
[2] Huynh, T., Gao, Y., Kang, J., Wang, L., Zhang, P., Lian, J., Shen, D.: Estimating ct image from mri data using structured random forest and auto-context model. IEEE transactions on medical imaging 35 (2015) 174–183
[3] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2018) 8789–8797
[4] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 172–189
[5] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2017) 1125–1134
[6] Liu, M.Y., Breuel, T., Kautz, J.: Unsupervised image-to-image translation networks. In: Advances in neural information processing systems. (2017) 700–708
[7] Murez, Z., Kolouri, S., Kriegman, D., Ramamoorthi, R., Kim, K.: Image to image translation for domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2018) 4500–4509
[8] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. (2017) 2223–2232
[9] Zhu, J.Y., Zhang, R., Pathak, D., Darrell, T., Efros, A.A., Wang, O., Shechtman, E.: Toward multimodal image-to-image translation. In: Advances in neural information processing systems. (2017) 465–476
[10] Nie, D., Cao, X., Gao, Y., Wang, L., Shen, D.: Estimating ct image from mri data using 3d fully convolutional networks. In: Deep Learning and Data Labeling for Medical Applications. Springer (2016) 170–178
[11] Han, X.: Mr-based synthetic ct generation using a deep convolutional neural network method. Medical physics 44 (2017) 1408–1419
[12] Emami, H., Dong, M., Nejad-Davarani, S.P., Glide-Hurst, C.K.: Generating synthetic cts from magnetic resonance images using generative adversarial networks. Medical physics 45 (2018) 3627–3636
[13] Emami, H., Dong, M., Glide-Hurst, C.K.: Attention-guided generative adversarial network to address atypical anatomy in modality transfer. arXiv preprint arXiv:2006.15264 (2020)
[14] Li, R., Zhang, W., Suk, H.I., Wang, L., Li, J., Shen, D., Ji, S.: Deep learning based imaging data completion for improved brain disease diagnosis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2014) 305–312
[15] Pan, Y., Liu, M., Lian, C., Zhou, T., Xia, Y., Shen, D.: Synthesizing missing pet from mri with cycle-consistent generative adversarial networks for alzheimer’s disease diagnosis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2018) 455–463
[16] Armanious, K., Jiang, C., Fischer, M., Küstner, T., Hepp, T., Nikolaou, K., Gatidis, S., Yang, B.: Medgan: Medical image translation using gans. Computerized Medical Imaging and Graphics 79 (2020) 101684
[17] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer (2015) 234–241
[18] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
[19] Gehari, H.E., Nejad-Davarani, S., Dong, M., Glide-Hurst, C.: Generating synthetic cts from magnetic resonance images using generative adversarial networks. In: Medical physics. Volume 45., WILEY 111 RIVER ST, HOBOKEN 07030-5774, NJ USA (2018) E131–E131
[20] Nie, D., Trullo, R., Lian, J., Petitjean, C., Ruan, S., Wang, Q., Shen, D.: Medical image synthesis with context-aware generative adversarial networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer (2017) 417–425
[21] Emami, H., Dong, M., Glide-Hurst, C.K.: Attention-guided generative adversarial network to address atypical anatomy in synthetic ct generation. In: 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI), IEEE (2020) 188–193
[22] Wolterink, J.M., Dinkla, A.M., Savenije, M.H., Seevinck, P.R., van den Berg, C.A., Išgum, I.: Deep mr to ct synthesis using unpaired data. In: International workshop on simulation and synthesis in medical imaging, Springer (2017) 14–23
[23] Rensink, R.A.: The dynamic representation of scenes. Visual cognition 7 (2000) 17–42
[24] Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 842–850
[25] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 2921–2929
[26] Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 3640–3649
[27] Mejjati, Y.A., Richardt, C., Tompkin, J., Cosker, D., Kim, K.I.: Unsupervised attention-guided image-to-image translation. In: Advances in Neural Information Processing Systems. (2018) 3693–3703
[28] Chen, X., Xu, C., Yang, X., Tao, D.: Attention-gan for object transfiguration in wild images. In: Proceedings of the European Conference on Computer Vision (ECCV). (2018) 164–180
[29] Emami, H., Aliabadi, M.M., Dong, M., Chinnam, R.: Spa-gan: Spatial attention gan for image-to-image translation. IEEE Transactions on Multimedia (2020)
[30] Eriguchi, A., Hashimoto, K., Tsuruoka, Y.: Tree-to-sequence attentional neural machine translation. arXiv preprint arXiv:1603.06075 (2016)
[31] Lin, Z., Feng, M., Santos, C.N.d., Yu, M., Xiang, B., Zhou, B., Bengio, Y.: A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017)
[32] Siridhipakul, C., Vateekul, P.: Multi-step power consumption forecasting in thailand using dual-stage attentional lstm. In: 2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE), IEEE (2019) 1–6
[33] Aliabadi, M.M., Emami, H., Dong, M., Huang, Y.: Attention-based recurrent neural network for multistep-ahead prediction of process performance. Computers & Chemical Engineering (2020) 106931
[34] Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
[35] Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
[36] Jetley, S., Lord, N.A., Lee, N., Torr, P.H.: Learn to pay attention. arXiv preprint arXiv:1804.02391 (2018)
[37] Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318 (2018)
[38] Stone, M.: Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological) 36 (1974) 111–133