DC-WCNN: A deep cascade of wavelet based convolutional neural networks for MR Image Reconstruction

Abstract

Several variants of Convolutional Neural Networks (CNN) have been developed for Magnetic Resonance (MR) image reconstruction. Among them, U-Net has shown to be the baseline architecture for MR image reconstruction. However, sub-sampling is performed by its pooling layers, causing information loss which in turn leads to blur and missing fine details in the reconstructed image. We propose a modification to the U-Net architecture to recover fine structures. The proposed network is a wavelet packet transform based encoder-decoder CNN with residual learning called WCNN. The proposed WCNN has discrete wavelet transform instead of pooling and inverse wavelet transform instead of unpooling layers and residual connections. We also propose a deep cascaded framework (DC-WCNN) which consists of cascades of WCNN and k-space data fidelity units to achieve high quality MR reconstruction. Experimental results show that WCNN and DC-WCNN give promising results in terms of evaluation metrics and better recovery of fine details as compared to other methods.

Index Terms— MR image reconstruction, fine details, Wavelet transform, U-Net, pooling, Deep cascade

1 Introduction

Magnetic Resonance Imaging (MRI), an anatomical non-invasive imaging technique is well known for providing high-resolution images with excellent soft tissue contrast. The key challenge however is to reduce its long scan times to ease patient discomfort. Consequently, sub-Nyquist sampling is adopted in k-space to accelerate data acquisition from the modality and this naive reconstruction from sampled k-space results in images with aliasing artifacts. Approaches based on Compressed Sensing (CS-MRI) aim to solve the de-aliasing problem by enforcing image sparsity and incoherent sampling in k-space [1].

Refer to caption — Fig. 1: Top row (left to right): Human brain target image with highlighted region in yellow box, shows fine details, US image (with 5x acceleration factor) has artifacts, UNet output suffers from blur, WCNN (ours) shows better recovery of fine structures. Bottom row: reconstruction errors with respect to target show that WCNN gives minimal error as compared to U-Net.

Recently, deep learning methods for CS-MRI, based on Convolutional Neural Networks (CNN) that learn an end-to-end mapping between the undersampled (US) and fully-sampled (FS) image in a data driven manner has gained focus [2]. Wang et al. used a simple CNN to learn the mapping between US and FS images [3]. Subsequently, an encoder-decoder CNN based architecture, U-Net [4], has shown promising results in many image-to-image problems. Lee et al. used U-Net to learn the residual and showed that their method gives better reconstruction with faster training convergence [5]. Hyun et al. demonstrated the effectiveness of U-Net followed by k-space correction to provide consistency in the reconstructed k-space data [6]. Inspired by Dictionary Learning MRI (DLMRI), Schlemper et al. proposed a deep cascaded CNN (DC-CNN) which is a cascade of several CNNs and Data fidelity (DF) units [7]. Recent work by Sun et al. replaced CNN with U-Net in the deep cascaded network (referred from hereon as DC-UNet). DC-UNet has shown improved reconstruction for higher acceleration factors [8].

Although originally created for segmentation tasks, the use of the U-Net in the above papers indicate that it could be a good baseline architecture for US to FS reconstruction. This could be attributed to the following reasons: 1) The latent space representation learned from the training data acts as highly nonlinear compressed sensing to provide a mapping between US and FS images and 2) The hierarchical scale levels due to convolution and pooling layers provide good receptive field. However, pooling layers in U-Net causes information loss and in turn degrades the signal information, especially fine structural details, an undesirable factor for image reconstruction. A fully convolutional network (FCN) without pooling layers in U-Net could still address this limitation but cannot provide the desired receptive field [9]. Dilated convolution can be used in place of normal convolution in FCN to provide the necessary receptive field. But, dilated convolution introduces unwanted checkerboard artifacts and shows inconsistency at the edge regions of the features owing to its sparse sampling property [10].

In our work, we propose to replace the pooling layers with a transformation inspired from classical signal processing theory to circumvent the information loss. We choose multi-level Haar wavelet packet transform (WPT) as it can provide hierarchical scale levels without information loss. The salient features of WPT which makes it highly appropriate for deep learning include: 1) Frequency and localization characteristics which helps in efficient representation of structural and textural details 2) Lossless partitioning of feature maps into orthogonal subbands at multiple scales and 3) Sparsity induced in feature maps at every resolution level which reduces the overall computational complexity. Fig. 1 highlights above mentioned merits of using wavelet transforms in U-Net for recovery of fine details. Combining these advantages, we summarize our contributions as follows:

•

We propose a wavelet based convolutional encoder-decoder neural network WCNN, for MR image reconstruction, with better signal representation, inspired by the work in [11] for vision tasks. The proposed WCNN has residual connections, wavelet decomposition and re-composition operations in place of pooling and unpooling layers respectively.
•

We also propose a deep cascaded architecture called DC-WCNN, by cascading a series of WCNN and data fidelity (DF) units for the MRI reconstruction problem.
•

Performance analysis of WCNN and DC-WCNN with Kirby21 brain dataset shows promising results for the given 5x undersampling mask and outperforms other compared methods.

2 Methodology

2.1 Problem Formulation

Let $x\in C^{N}$ be the desired image to be reconstructed from undersampled k-space measurements $y\in C^{M}$ , $M<<N$ , such that $y=F_{u}x$ , where $F_{u}$ is the undersampled Fourier encoding matrix. The linear inversion $x_{u}=F_{u}^{H}y$ is called zero-filled reconstruction. Reconstructing $x$ from $y$ is an ill-posed problem due to sub-Nyquist sampling. The proposed method using WCNN can be formulated as the optimization problem:

\displaystyle\underset{x,\theta}{\operatorname{argmin}}||x-f_{wcnn}(x_{u}\lvert\theta)||_{2}^{2}+\alpha||F_{u}x-y||^{2}_{2}

(1)

Let $x_{wcnn}=f_{wcnn}(x_{u}\lvert\theta)$ , where $f_{wcnn}$ is the forward mapping of WCNN parameterized by $\theta$ . Here $\alpha$ is a weight factor based on the noise of the acquired data y. The above equation enforces $x$ to be approximated by the reconstruction of WCNN in image domain without any prior information about the acquired data in k-space.

We use data fidelity (DF) unit in k-space domain after WCNN unit to ensure that the WCNN reconstruction is consistent with the acquired k-space measurements. The data fidelity operation $f_{df}$ can be expressed as,

\hat{x}_{rec}=\begin{cases}\hat{x}_{wcnn}(k)&\ k\notin\Omega\\ \frac{\hat{x}_{wcnn}(k)+\lambda\hat{x}_{u}(k)}{1+\lambda}&k\in\Omega\\ \end{cases}

(2)

Here, $\hat{x}_{wcnn}=F_{f}x_{wcnn}$ , $\hat{x}_{u}=F_{f}x_{u}$ , $\Omega$ is the index set of known k-space data, $F_{f}$ is the Fourier encoding matrix, and $\hat{x}_{rec}$ is the corrected k-space and $\lambda\to\infty$ . The reconstructed image is obtained by inverse Fourier encoding of $\hat{x}_{rec}$ , i.e. ${x}_{rec}=F_{f}^{H}\hat{x}_{rec}$ . The proposed cascaded architecture, DC-WCNN, is a series of $N_{c}$ such WCNN and DF units which can be formulated as,

	$\displaystyle x_{n}$	$\displaystyle=WCNN_{n}(x_{n-1}^{df})+x_{n-1}^{df}$		(3)
	$\displaystyle x_{n}^{df}$	$\displaystyle=DF_{n}(x_{n})$		(4)

Here $WCNN_{n}$ and $DF_{n}$ denote the $n^{th}$ WCNN reconstruction block and DF unit respectively, $n=1,2..N_{c}$ , $x_{0}^{df}=x_{u}$ and $x_{rec}=x_{N_{c}}^{df}$ is the output of the last DF unit.

The 2D Discrete Wavelet Transform (DWT) and inverse wavelet transform (IWT) layers of WCNN are given by,

	$\displaystyle(X_{1},X_{2},...,X_{K})=DWT(X_{in})$		(5)
	$\displaystyle X=IWT(X_{1},X_{2},...,X_{K})$		(6)

Here, $X_{in}$ be the input image or an intermediate feature map in the network, K is the number of subbands (K $=$ 4 in our case, with approximation, horizontal, vertical and diagonal subbands), $X_{k},k=1,2,...K$ denote the coefficients of the $k^{th}$ subband and X is output of wavelet recomposition.

2.2 Background of the proposed architecture

Several studies have been done to bring wavelet transforms into CNNs ([9], [12]) . The proposed WCNN is a variant of U-Net with contraction and expansion paths wherein DWT and IWT are performed in place of pooling and unpooling layers respectively. The DWT subbands are stacked and passed to the convolutional layers. In this way, all the subbands are jointly learnt along with the inter-dependencies between them thereby enhancing spatial context. We have used residual connections wherein feature maps from contracting paths are element-wise added to the respective expanding path feature maps instead of concatenations. The residual learning strategy simplifies the optimization process and minimizes degradation of original signal information.

The WCNN as a standalone architecture unit could de-alias the US MRI image in a single step. The DC-CNN proposed by Schlemper et al. showed that a deep cascade of CNNs is similar to unfolding the optimization process of CSMRI and improves the performance of MRI reconstruction. We therefore embed WCNN as the reconstruction block in deep cascaded mode with DF units interleaved to form the DC-WCNN architecture.

2.3 Proposed architecture

The proposed DC-WCNN has cascades of WCNN and DF units (Fig.2). The WCNN consists of three WPT levels (i.e. three subsampling layers in the network). The subband images obtained from each level of DWT is fed to a 4-layer fully convolutional (FC) block in the contraction path. In the expansion path, IWT is performed after every FC block. The convolution layers have 3x3 convolution filters followed by batch normalization (BN) and rectified linear unit (ReLU) operations. In the last layer, only convolution operation is performed without BN and ReLU, to predict the residual subbands. In the DC-WCNN, the DF layers are inserted between WCNN units to correct the accumulation of possible distortions in the predicted k-space data over the cascaded blocks. We have used L2 loss function for both standalone and cascaded modes. Given the training data D consisting of a number of US and FS images as input-target pair $(x_{u},x_{t})$ as given by,

\displaystyle L(\theta)=\sum_{(x_{u},x_{t})\in D}\|x_{t}-x_{pred}||_{2}^{2}

(7)

Here $x_{pred}=x_{wcnn}$ in the standalone mode and $x_{pred}=x_{rec}$ in the deep cascade mode.

3 Experiments and Results

3.1 Dataset description and Evaluation metrics

We have conducted experiments on the publicly available Kirby21 dataset [13] with human brain data. The dataset consists of 5460 slices of size 256x256 taken from 42 T1-weighted MPRAGE volumes out of which, 3770 slices from 29 volumes are used for training and 1690 slices from 13 volumes for validation. A fixed cartesian undersampling mask with ten lowest spatial frequencies and remaining following a zero-mean Gaussian distribution is chosen. The acceleration factor for the mask is 5x (20%) to retrospectively generate undersampled MRI images for training and testing. Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), High Frequency Error Norm (HFEN) and Normalised Mean Square Error (NMSE) metrics are used to evaluate the reconstruction quality. Wilcoxon signed-rank test with an alpha of 0.05 is used to assess statistical significance.

Table 1: PSNR, SSIM, NMSE and HFEN results for 5x undersampling

	No. of cascades	Model	NMSE	PSNR	SSIM	HFEN
	-	US image	0.07014 +/- 0.01	25.95 +/- 1.29	0.6056 +/- 0.05	0.7571 +/- 0.01
Standalone mode	-	CNN [3]	0.02775 +/- 0.00	29.97 +/- 1.33	0.8897 +/- 0.02	0.5311 +/- 0.07
		U-Net [4]	0.01929 +/- 0.01	31.57 +/- 1.17	0.9302 +/- 0.01	0.4583 +/- 0.06
		U-NetMean	0.01955 +/- 0.00	31.51 +/- 1.21	0.9303 +/- 0.01	0.4574 +/- 0.06
		WCNN (Ours)	0.01466 +/- 0.00	32.75 +/- 1.55	0.9361 +/- 0.02	0.4228 +/- 0.06
Deep cascade mode	1	DC-CNN [7]	0.0179 +/- 0.00	31.88 +/- 1.50	0.8784 +/- 0.02	0.4776 +/- 0.07
		DC-UNet [8]	0.01181 +/- 0.00	33.69 +/- 1.56	0.9294 +/- 0.02	0.4033 +/- 0.06
		DC-WCNN (Ours)	0.01151 +/- 0.00	33.8 +/- 1.61	0.9308 +/- 0.02	0.3879 +/- 0.06
	2	DC-CNN [7]	0.01249 +/- 0.00	33.45 +/- 1.69	0.9385 +/- 0.02	0.4229 +/- 0.07
		DC-UNet [8]	0.00884 +/- 0.00	34.95 +/- 1.70	0.9587 +/- 0.01	0.357 +/- 0.064
		DC-WCNN (Ours)	0.00811 +/- 0.00	35.33 +/- 1.83	0.9614 +/- 0.01	0.3458 +/- 0.066
	3	DC-CNN [7]	0.01029 +/- 0.00	34.3 +/- 1.82	0.9526 +/- 0.02	0.3912 +/- 0.07
		DC-UNet [8]	0.0077 +/- 0.00	35.56 +/- 1.77	0.9652 +/- 0.01	0.3378 +/- 0.07
		DC-WCNN (Ours)	0.00682 +/- 0.00	36.09 +/- 1.96	0.9685 +/- 0.01	0.3236 +/- 0.06

3.2 Implementation details

We have trained WCNN as standalone and within the deep cascade framework (DC-WCNN). All the standalone models are trained for 150 epochs on Nvidia GTX-1070 GPUs. In the DC-WCNN, the weights of each WCNN are initialized with those of the standalone WCNN and subsequently weights are fine tuned. The same strategy is followed for other models in deep cascade mode. The models are implemented in PyTorch and code is publicly available ¹¹1https://github.com/sriprabhar/DC-WCNN. Adam optimizer is used with a learning rate of $0.001$ .

3.3 Results and discussion

We conduct two sets of experiments, one in standalone mode and the other deep cascade mode.

Standalone mode: We compare WCNN with vanilla CNN [3] and U-Net [4] (which commonly has max pooling layers) and U-Net with mean pooling instead of max pooling layers (we refer to it as U-NetMean in Table 1). Table 1 standalone mode shows our WCNN model outperforms all other methods with best values for PSNR, SSIM, NMSE and HFEN. We note that WCNN has only 3 subsampling levels and performs better than the two U-Net architectures which have four pooling (sub-sampling) layers each. Visual comparison in Fig. 3 shows that that WPT based sub-sampling in WCNN recovers fine details better than CNN and U-NetMean (U-Net not shown as it performs same as U-NetMean). The approximation subband obtained from DWT is similar to the average pooling performed in U-NetMean, which is basically a smoothing operation. By including the detail subbands into the convolution layers, emphasis on the blurring induced by smoothing is reduced, thereby helping convolutional layers to learn kernels that maximize the overall performance of WCNN.

Deep cascade mode: In this mode, we compare DC-WCNN with DC-CNN [7] and DC-UNet [8]. From Table 1 deep cascade mode we make two observations. Firstly, the deep cascade mode boosts up the quantitative metrics as compared to standalone mode. This shows that deep cascaded architectures with data fidelity units provide better reconstruction. Secondly, our DC-WCNN model inherits the benefits of WCNN and outperforms other methods quantitatively. Visual comparison in Fig. 4 shows that DC-CNN (with two cascades of WCNN and DF) and DC-UNet produce smudged structures whereas our method recovers structures much closer to the target.

In both the modes, the metrics are found to be statistically significant (p $<$ 0.05). We also note that increasing the number of WPT levels increases the computational cost and hence we have chosen three levels .

4 Conclusion

We propose a wavelet-based CNN (WCNN) and a deep cascaded architecture called DC-WCNN for recovering fine details in MR image reconstruction. The WCNN is a variant of U-Net with DWT in place of pooling and IWT in place of unpooling. The DC-WCNN architecture is a cascade of WCNN and DF units. Experimental results show that WCNN and DC-WCNN outperform other compared methods for 5x acceleration factor.

References

[1] Kieren Grant Hollingsworth, “Reducing acquisition time in clinical MRI by data undersampling and compressed sensing reconstruction,” Physics in Medicine and Biology, vol. 60, no. 21, pp. R297–R322, oct 2015.
[2] Alexander Selvikvåg Lundervold and Arvid Lundervold, “An overview of deep learning in medical imaging focusing on mri,” Zeitschrift für Medizinische Physik, vol. 29, no. 2, pp. 102 – 127, 2019.
[3] S. Wang, Z. Su, L. Ying, X. Peng, S. Zhu, F. Liang, D. Feng, and D. Liang, “Accelerating magnetic resonance imaging via deep learning,” in 2016 IEEE 13th International Symposium on Biomedical Imaging (ISBI), April 2016, pp. 514–517.
[4] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, 2015, pp. 234–241.
[5] D. Lee, J. Yoo, and J. C. Ye, “Deep residual learning for compressed sensing mri,” in 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), April 2017, pp. 15–18.
[6] Chang Min Hyun, Hwa Pyung Kim, Sung Min Lee, Sungchul Lee, and Jin Keun Seo, “Deep learning for undersampled MRI reconstruction,” Physics in Medicine & Biology, vol. 63, no. 13, pp. 135007, jun 2018.
[7] Jo Schlemper, Jose Caballero, Joseph V. Hajnal, Anthony Price, and Daniel Rueckert, “A deep cascade of convolutional neural networks for mr image reconstruction,” in Information Processing in Medical Imaging, 2017, pp. 647–658.
[8] Liyan Sun, Zhiwen Fan, Xinghao Ding, Yue Huang, and John Paisley, “Joint cs-mri reconstruction and segmentation with a unified deep network,” in Information Processing in Medical Imaging, 2019, pp. 492–504.
[9] J. Ye, Y. Han, and E. Cha, “Deep convolutional framelets: A general deep learning framework for inverse problems,” SIAM Journal on Imaging Sciences, vol. 11, no. 2, pp. 991–1048, 2018.
[10] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell, “Understanding convolution for semantic segmentation,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), March 2018, pp. 1451–1460.
[11] Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and Wangmeng Zuo, “Multi-level wavelet-cnn for image restoration,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.
[12] T. Guo, H. S. Mousavi, T. H. Vu, and V. Monga, “Deep wavelet prediction for image super-resolution,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), July 2017, pp. 1100–1109.
[13] Bennett A. Landman, Alan J. Huang, Aliya Gifford, Deepti S. Vikram, Issel Anne L. Lim, Jonathan A.D. Farrell, John A. Bogovic, Jun Hua, Min Chen, Samson Jarso, Seth A. Smith, Suresh Joel, Susumu Mori, James J. Pekar, Peter B. Barker, Jerry L. Prince, and Peter C.M. van Zijl, “Multi-parametric neuroimaging reproducibility: A 3-t resource study,” NeuroImage, vol. 54, no. 4, pp. 2854 – 2866, 2011.