Dynamic Cardiac MRI Super-Resolution through Self-Supervised Learning

I Methodology

The overall goal of this study is to design a neural network and convert blurred low-resolution (LR) CMRI images obtained under fast scan into high-quality images that are comparable with full scan images. To achieve this goal, We first adopted golden-angle radial sampling strategy [winkelmann2006optimal] to under-sample original high-resolution (HR) images in k-space and created LR counterparts. Then, we trained a neural network to improve image quality.

Refer to caption — Figure 1: The overall structure of the proposed neural network model.

I-A Neural network

We propose a two-level cascaded neural network model, which is shown in Fig. 1. In the first level, there are three transformer networks and there is a synthesis network in the second level. The input of the proposed network is three consecutive LR frames in the same slice. Transformer networks are used to warp adjacent frames into the target frame and the synthesis network combined two transformed frames and the current frame into a super-resolution target frame.

I-A1 Pre-trained network

We pre-trained a neural network for

I-A2 Transformer network

The transformer network is a multi-scale and multi-step model, which is composed of two six-layer densely connected sub-networks [huang2017densely] called inpainting sub-network and transformation sub-network, as show in Fig. 2. The inpainting sub-network is used to predict the erased region and the transformation sub-network warps the adjacent frame to the target frame. A 15 pixel $\times$ 15 pixel erasing region is cropped from each input image and input into the inpainting sub-network. The inpainting sub-network is expected to predict the erased part via utilizing information contained in the peripheral unerased part. Then, the output of the inpainting sub-network is put back to the original input image to replace the cropped erasing region. The loss of the inpainting sub-network is

\mathcal{L}_{inp}=\mathbb{E}_{I_{LR_{crop}},I_{HR_{crop}}}\|\phi(I_{LR_{crop}})-\phi(I_{HR_{crop}})\|_{2}^{2},

(1)

where $\phi$ is the perceptual loss [johnson2016perceptual]. In this paper, we use the perceptual loss to measure the difference between neural network outputs and corresponding ground truth. Comparing with the mean-squared-error, the perceptual loss has the advantage of avoiding over-smoothing when attempting to generating images [dosovitskiy2016generating]. A pre-trained VGG16 model [simonyan2014very] is used to calculate the perceptual loss.

For obtaining accurate warped frames, multi-scale training strategy [xue2019video] is adopted. To achieve this, input LR frames are down-sampled twice and forth times using averaging pooling and the three branches in the transformation sub-network share weights, which is shown in Fig. 3. The multi-scale loss is

\mathcal{L}_{scale}=\sum_{k=1}^{3}\mathbb{E}_{I_{LR_{k}},I_{HR}}\|\phi(I_{LR_{k}})-\phi(I_{HR})\|_{2}^{2},

(2)

where $k$ is the index of multi-scale branches, $k=1,2,3$ .

Apart from multi-scale training, our model also adopts multi-step training [shan20183]. The transformation sub-network is trained recurrently to generate accurate warped results. The multi-step loss is

\mathcal{L}_{step}=\frac{1}{k}\sum_{k=1}^{3}\mathcal{L}_{scale_{i}},

(3)

where $k$ is the index of recurrent step. $\mathcal{L}_{scale_{i}}$ is the multi-scale loss in the $i^{th}$ step. We set the step to 3.

The overall loss of the transformer network is

\mathcal{L}_{tra}=\mathcal{L}_{inp}+\mathcal{L}_{step}.

(4)

I-A3 Synthesis network

The structure of the synthesis sub-network is a modified autoencoder-decoder network [shan20183] (Fig. 4), the front half of the network includes five or six recursive blocks and extracts features from each input frame and the latter half consists of six recursive blocks and combines features from three input frames and reconstructs a super-resolution target frame. In each recursive block, there are two pairs of convolutional layers that share weights. The number of kernels in each recursive block in the front half are 32, 32, 64, 64, 128, 128. The number of kernels in each recursive block in the latter half are 128, 128, 64, 64, 32, 32. The objective function of the synthesis network is

\mathcal{L}_{syn}=\mathbb{E}_{I_{LR},I_{HR}}\|\phi(I_{LR})-\phi(I_{HR})\|_{2}^{2}.

(5)

The overall objective function is

\mathcal{L}=\alpha(\mathcal{L}_{tra_{1}}+\mathcal{L}_{tra_{2}}+\mathcal{L}_{tra_{3}})+\beta\mathcal{L}_{syn},

(6)

where $\mathcal{L}_{tra_{1}}$ , $\mathcal{L}_{tra_{2}}$ , and $\mathcal{L}_{tra_{3}}$ are the loss of the three transformer network. $\alpha$ and $\beta$ is the weight for each term. Adam optimizer was used in the training process. The learning rate is $10^{-4}$ . In this study, we jointly train the transformer sub-networks and the synthesis sub-networks end-to-end. In the first 30 epochs, we focused on training the transformer sub-networks and $\alpha$ was set far larger than $\beta$ . In the latter 30 epochs, $\alpha$ was set far smaller than $\beta$ to make sure the training concentrating on the synthesis.