This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dynamic Cardiac MRI Super-Resolution through Self-Supervised Learning

I Methodology

The overall goal of this study is to design a neural network and convert blurred low-resolution (LR) CMRI images obtained under fast scan into high-quality images that are comparable with full scan images. To achieve this goal, We first adopted golden-angle radial sampling strategy [winkelmann2006optimal] to under-sample original high-resolution (HR) images in k-space and created LR counterparts. Then, we trained a neural network to improve image quality.

Refer to caption
Figure 1: The overall structure of the proposed neural network model.

I-A Neural network

We propose a two-level cascaded neural network model, which is shown in Fig. 1. In the first level, there are three transformer networks and there is a synthesis network in the second level. The input of the proposed network is three consecutive LR frames in the same slice. Transformer networks are used to warp adjacent frames into the target frame and the synthesis network combined two transformed frames and the current frame into a super-resolution target frame.

Refer to caption
Figure 2: The structure of the transformer network.
Refer to caption
Figure 3: The schematic view of the multi-scale multi-step transformation sub-network.

I-A1 Pre-trained network

We pre-trained a neural network for

I-A2 Transformer network

The transformer network is a multi-scale and multi-step model, which is composed of two six-layer densely connected sub-networks [huang2017densely] called inpainting sub-network and transformation sub-network, as show in Fig. 2. The inpainting sub-network is used to predict the erased region and the transformation sub-network warps the adjacent frame to the target frame. A 15 pixel ×\times 15 pixel erasing region is cropped from each input image and input into the inpainting sub-network. The inpainting sub-network is expected to predict the erased part via utilizing information contained in the peripheral unerased part. Then, the output of the inpainting sub-network is put back to the original input image to replace the cropped erasing region. The loss of the inpainting sub-network is

inp=𝔼ILRcrop,IHRcropϕ(ILRcrop)ϕ(IHRcrop)22,\mathcal{L}_{inp}=\mathbb{E}_{I_{LR_{crop}},I_{HR_{crop}}}\|\phi(I_{LR_{crop}})-\phi(I_{HR_{crop}})\|_{2}^{2}, (1)

where ϕ\phi is the perceptual loss [johnson2016perceptual]. In this paper, we use the perceptual loss to measure the difference between neural network outputs and corresponding ground truth. Comparing with the mean-squared-error, the perceptual loss has the advantage of avoiding over-smoothing when attempting to generating images [dosovitskiy2016generating]. A pre-trained VGG16 model [simonyan2014very] is used to calculate the perceptual loss.

For obtaining accurate warped frames, multi-scale training strategy [xue2019video] is adopted. To achieve this, input LR frames are down-sampled twice and forth times using averaging pooling and the three branches in the transformation sub-network share weights, which is shown in Fig. 3. The multi-scale loss is

scale=k=13𝔼ILRk,IHRϕ(ILRk)ϕ(IHR)22,\mathcal{L}_{scale}=\sum_{k=1}^{3}\mathbb{E}_{I_{LR_{k}},I_{HR}}\|\phi(I_{LR_{k}})-\phi(I_{HR})\|_{2}^{2}, (2)

where kk is the index of multi-scale branches, k=1,2,3k=1,2,3.

Apart from multi-scale training, our model also adopts multi-step training [shan20183]. The transformation sub-network is trained recurrently to generate accurate warped results. The multi-step loss is

step=1kk=13scalei,\mathcal{L}_{step}=\frac{1}{k}\sum_{k=1}^{3}\mathcal{L}_{scale_{i}}, (3)

where kk is the index of recurrent step. scalei\mathcal{L}_{scale_{i}} is the multi-scale loss in the ithi^{th} step. We set the step to 3.

The overall loss of the transformer network is

tra=inp+step.\mathcal{L}_{tra}=\mathcal{L}_{inp}+\mathcal{L}_{step}. (4)

I-A3 Synthesis network

The structure of the synthesis sub-network is a modified autoencoder-decoder network [shan20183] (Fig. 4), the front half of the network includes five or six recursive blocks and extracts features from each input frame and the latter half consists of six recursive blocks and combines features from three input frames and reconstructs a super-resolution target frame. In each recursive block, there are two pairs of convolutional layers that share weights. The number of kernels in each recursive block in the front half are 32, 32, 64, 64, 128, 128. The number of kernels in each recursive block in the latter half are 128, 128, 64, 64, 32, 32. The objective function of the synthesis network is

syn=𝔼ILR,IHRϕ(ILR)ϕ(IHR)22.\mathcal{L}_{syn}=\mathbb{E}_{I_{LR},I_{HR}}\|\phi(I_{LR})-\phi(I_{HR})\|_{2}^{2}. (5)
Refer to caption
Figure 4: The structure of the synthesis sub-network.

The overall objective function is

=α(tra1+tra2+tra3)+βsyn,\mathcal{L}=\alpha(\mathcal{L}_{tra_{1}}+\mathcal{L}_{tra_{2}}+\mathcal{L}_{tra_{3}})+\beta\mathcal{L}_{syn}, (6)

where tra1\mathcal{L}_{tra_{1}}, tra2\mathcal{L}_{tra_{2}}, and tra3\mathcal{L}_{tra_{3}} are the loss of the three transformer network. α\alpha and β\beta is the weight for each term. Adam optimizer was used in the training process. The learning rate is 10410^{-4}. In this study, we jointly train the transformer sub-networks and the synthesis sub-networks end-to-end. In the first 30 epochs, we focused on training the transformer sub-networks and α\alpha was set far larger than β\beta. In the latter 30 epochs, α\alpha was set far smaller than β\beta to make sure the training concentrating on the synthesis.