This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning With Context Feedback Loop for Robust Medical Image Segmentation

Kibrom Berihu Girum    Member    IEEE    Gilles Créhange    and Alain Lalande Manuscript received December 4, 2020; revised January 19, 2021; accepted February 16, 2021. (Corresponding author: Kibrom Berihu Girum.)K.B. Girum is with the Imaging and Artificial Vision (ImViA) Research Laboratory, University of Burgundy, 21000 Dijon, France, and also with the Department of Radiation Oncology, Centre Georges François Leclerc (CGFL), 21000 Dijon, France. (Corresponding author’s email: kibrom2b[at]gmail.com.)G. Créhange is with the Department of Radiation Oncology, Institute of Curie, 75005 Paris, France, also with Imaging and Artificial Vision (ImViA) Research Laboratory, University of Burgundy, 21000 Dijon, France, and also with the Department of Radiation Oncology, Centre Georges François Leclerc (CGFL), 21000 Dijon, France.(Email: gilles.crehange[at]curie.fr)A. Lalande is with the Department of Medical Imaging, University Hospital of Dijon, 2100 Dijon, France, and also with Imaging and Artificial Vision (ImViA) Research Laboratory, University of Burgundy, 21000 Dijon, France. (Email: alain.Lalande[at]u-bourgogne.fr)Copyright ©2021 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected] is the author’s version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at
https://doi.org/10.1109/TMI.2021.3060497
Abstract

Deep learning has successfully been leveraged for medical image segmentation. It employs convolutional neural networks (CNN) to learn distinctive image features from a defined pixel-wise objective function. However, this approach can lead to less output pixel interdependence producing incomplete and unrealistic segmentation results. In this paper, we present a fully automatic deep learning method for robust medical image segmentation by formulating the segmentation problem as a recurrent framework using two systems. The first one is a forward system of an encoder-decoder CNN that predicts the segmentation result from the input image. The predicted probabilistic output of the forward system is then encoded by a fully convolutional network (FCN)-based context feedback system. The encoded feature space of the FCN is then integrated back into the forward system’s feed-forward learning process. Using the FCN-based context feedback loop allows the forward system to learn and extract more high-level image features and fix previous mistakes, thereby improving prediction accuracy over time. Experimental results, performed on four different clinical datasets, demonstrate our method’s potential application for single and multi-structure medical image segmentation by outperforming the state of the art methods. With the feedback loop, deep learning methods can now produce results that are both anatomically plausible and robust to low contrast images. Therefore, formulating image segmentation as a recurrent framework of two interconnected networks via context feedback loop can be a potential method for robust and efficient medical image analysis.

{IEEEkeywords}

CNN, Feedback loop, MRI, Ultrasound, CT.

1 Introduction

1.1 Motivation and background

\IEEEPARstart

Medical image segmentation is often profoundly important in clinical image analysis and image-guided interventions. It involves partitioning a medical image into multiple areas, where these areas can be used for better clinical analysis or clinical target visualization. For example, in prostate radiotherapy, accurate clinical target volume segmentation from magnetic resonance (MR), ultrasound (US), or computed tomography (CT) images is often essential in computer-aided diagnosis, therapy, and post-therapy analysis of prostate cancer [1] [2]. Indeed, it is critical to select patients for a specific treatment, guide source delivery during the intervention, and compute the dose distribution using MR, US, and CT images, respectively [1]. Similarly, in cardiac image analysis, accurate segmentation of the heart structures such as the left and right ventricular cavities, and the myocardium is essential to calculate the volume of the cavities at end-diastolic and end-systolic phases and the left ventricular myocardial mass [3] [4]. Inner ear structures segmentation such as the cochlea, vestibule, and semi-circular canals from preoperative CT images can be used in the treatment of patients with hearing impairments. In cochlear implant surgery, preoperative cochlea segmentation from CT images is useful for determining the length of the implanted electrode and improving the insertion guidance [5]. Other important medical image analysis applications using image segmentation includes in 2D echocardiography [6] and brain image segmentation [7] among others.

Despite the necessity of accurate and robust image segmentation in clinical routines, it is often challenging. Indeed, it highly depends on the imaging modality and the target (such as anatomy or tumor area). The main challenges in developing accurate and automatic medical image segmentation methods include a wide range of patient characteristics, low-contrast inherent imaging characteristics of some imaging modalities, artifacts (e.g., metallic artifacts in CT), respiratory motion, limited data in medical image analysis, and significant variation of shape and size of organs among cases [8] [9]. Unfortunately, manual segmentation is frequently prone to subjective errors and is time-consuming. It might not always be easy to analyze multiple sequences of examinations manually. Researchers have applied different methods to address these challenges, thereby better analyzing the medical images and improving case outcomes. For example, several groups have approached medical image segmentation using contour and shape detection, deformable models, conventional machine learning, and deep learning approaches [8] [10].

In the early times, conventional medical image segmentation approaches are often based on edge detection and fitting predefined parametric shapes, level-set methods, and shape models [3] [8]. Many classical machine learning-driven methods (supervised and unsupervised) are also proposed to tackle this challenging problem. Although traditional machine learning approaches like active shape, atlas methods, and statistically supervised and unsupervised methods are still prevalent [10], they often involve careful integration of hand-crafted image features. These hand-crafted features need to represent the input image. However, designing or engineering distinctive image features by human experts can be difficult and sometimes impossible. Moreover, manually designed distinctive features on a given task might not be easy to adapt to new cases, which is the main problem in developing a general method that can be used to extract the distinctive image features [8] [10].

In this regard, deep learning methods based on convolutional neural networks (CNN) have emerged as an alternative and reliable solution [10]. These CNN-based methods can automatically learn to extract the hierarchical distinctive image features. It avoids the need to develop hand-crafted features besides the generic characteristics of these extracted features. In fact, CNN-based approaches have achieved the state of the art (SOTA) performance in various tasks such image classification [11], object detection [12], segmentation [13] [14], registration [15] [16], and other tasks [17] [18]. For example, fully convolutional neural networks are trained on pixel-to-pixel transformations to extract high-level image features and predict the outputs from any arbitrary size inputs [14]. Following the success of these approaches in image and video processing, Simonyan et al. [18] proposed to increase the convolutional network depth via a network called VGG16. In this approach, convolutional layers are stacked one after the other to extract more distinctive image features.

Motivated by the promising results of the VGG16 network (deep encoder like structure), Badrinarayanan et al. [19], proposed to expand it into a deep fully convolutional neural network (FCN) architecture for semantic pixel-wise segmentation. It employs a trainable encoder network for high-level feature extraction and a corresponding decoder network. The decoder network maps the low-resolution encoder feature maps to a full input resolution feature maps for pixel-wise classification using up-sampling. This up-sampling operation is then modified to be a learning-based deconvolution network in the work of Noh et al. [20]. Szegedy et al. [21] also proposed a new design of CNN architecture to increase the depth and the width of CNN-based methods. As CNN architectures go deeper (i.e., large number of convolutional layers are stacked one after the other), residual connections are inherently crucial for the training [22] [23]. It smooths training and avoids loss of high-level features during sampling or pooling operations.

Ronneberger et al. [13] modified the FCN proposed by Long et al. [14] to propagate contextual information from the encoder into the decoder. It is done by connecting the encoder with the decoder through skip connections that creates a U-shaped architecture (and named U-Net). Similarly to the residual networks [22], this skip-connection recovers spatial information (such as localization information) that could be lost during the consecutive striding and convolutional operations. Several groups have modified the U-Net architecture for various medical image analysis applications [10]. Çiçek et al. [7] adopted the 2D U-Net for volumetric (3D) segmentation. Tu et al. [24] introduced auto-context U-Net in which they applied parallel 2D convolutional layers for the axial, coronal, and sagittal planes of brain images.

Most previous works based on encoder-decoder (or U-Net) architectures have been focused on improving encoder’s feature extraction ability either by constraining the segmentation results [25], incorporating the prior knowledge such as shape models [26] [27], or using transfer learning from pre-trained networks [28] [29].

Meanwhile, there has been an increased interest of researchers to incorporate the shape of anatomical structures into U-Net like architectures using multi-task learning [27] [30] [31]. These approaches aimed at solving the common limitations of encoder-decoder architectures in capturing the structural information and interdependence of the output while training from pixel-wise objective functions [10]. Consequently, several groups approached the incorporation of shape prior in image segmentation using statistical shape modeling [27] or learning-based modeling [25] [30] [32] [33]. These approaches generally showed a relative improvement over the methods without the shape prior. However, these approaches often require explicit prior knowledge of the target. It needs the modeling of the prior knowledge, for example, anatomical shapes, and then embed it into the U-Net architectures to constrain the learning process [6] [25] [27]. Other approaches include post-processing methods based on either denoising [34] or variational auto-encoders [35] and showed an improvement in the plausibility of the results. However, they are not free of limitations. Post-processing methods do not see the original input image. Thus, they might not always produce accurate results from a given erroneous segmentation [35].

In encoder-decoder based deep learning methods, although the encoder is essential to extract high-level and distinctive input image features, it could be challenging due to the often inconsistent contrasts and potential artifacts associated with the inherent limitations of medical imaging devices. Thus, similar to the residual [22] and the long skip connection [13] approaches, different alternative methods have been proposed to improve the contextual information propagation from the encoder to the decoder. These methods use either gated attention-based networks [36] or recurrent neural networks [37]. They can be used either to highlight salient image features by capturing richer contextual dependencies [38] [39] or to recurrently recover information loss from the consecutive sampling and re-sampling operations [40] [41] [42] [43] in the feed-forward learning approach. Recurrent based networks use a similar concept to residual networks. Firstly, they are used to accumulate features that lead to better feature representation. Secondly, they are used to smooth the training [42]. Other approaches proposed to apply additional paths to the encoder-decoder architecture [44] [45]. Most of these approaches are based on CNNs in the feed-forward training approach and could be biased towards recognizing the texture features than the contextual or shape features [46].

Nonetheless, capturing the contextual information, the inter-region relationship, and implicitly learning the prior knowledge of a target remained a wide interest of researchers in deep learning-based medical image analysis. Still, encoder-decoder based image segmentation methods do not get the second chance to look at the segmentation results, except through the gradient backpropagation from a defined pixel-wise objective function. Indeed, training from only pixel-wise objective functions can lead to the loss of spatial information, less output pixels interdependence, and less inter-region relationship representation and hence sometimes, produce unrealistic segmentation results [25]. A recent study has also demonstrated that CNNs are strongly biased towards recognizing textures rather than shapes [46]. Thus, capturing the output pixels’ interdependence would enable the network to learn distinctive image features. The network can learn to extract texture and contextual similarity between the same labeled pixels and the difference between differently labeled neighboring pixels, thereby producing realistic segmentation.

In this regard, the decoder should be intelligent enough to improve the classification of each pixel. To do so, the decoder needs to capture the contextual information, inter-region relationship, and distinguish errors that could be introduced during the consecutive striding, convolution, and up-sampling operations. Indeed, feeding back the predicted probabilistic output could help the decoder extrapolate contextual information and correct mistakes over time, thereby improving prediction accuracy. The idea of a feedback loop is indeed used in the control system’s theory, in which the system’s output error signal is used to make adjustments on the input signal. Alternative studies on generative adversarial neural networks (GANs) have also recently gained an interest in using a feedback loop to improve the model’s learning capacity [47] [48].

Therefore, in this study, to address the difficulties of CNNs in capturing the contextual information and the inter-region relationship, and to implicitly integrate the prior knowledge into the feed-forward learning process, we formulated image segmentation as a recurrent framework using two systems (named Learning with context FeedBack system (LFB-Net)). One can notice that we used the term system to refer to the two different interconnected networks. The forward system, a modified U-Net architecture, learns and predicts a probabilistic output from the raw input image. The probabilistic refers to the pixel levels’ softmax assignment at the output of the network. The second system, a fully convolutional network (FCN), named hereafter feedback system, transforms the predicted probabilistic output of the forward system into another spatial higher-level representation. The transformed high-level representation is then integrated back into the feed-forward learning path of the forward system. This feedback strategy, as we will show, would allow the main segmentation module to improve the performance over time. More importantly, the feedback system mitigates the conditions in which the forward system can potentially fail.

1.2 Contributions

In this study, we present the first deep learning formalism using a feedback loop in encoder-decoder based image segmentation. To this end, we designed the image segmentation problem as a recurrent framework. The image segmentation process of a network is conditioned by its previous segmentation results’ latent space and a spatial response map of another network that enables it to improve the performance over time.

Our main contributions can be summarized as follows:

  1. 1.

    We introduce a feedback looping system to enhance the feed-forward learning process of encoder-decoder CNNs for medical image segmentation.

  2. 2.

    We model the image segmentation problem as a two systems task, from system one to system two approach. We integrated the proposed feedback looping system with a modified U-Net architecture (in the forward system) as a regularizer that enables the network to attend and fix segmentation uncertainties over time.

  3. 3.

    The proposed method, LFB-Net, is validated on different medical image segmentation applications. Specifically, we evaluated it on short-axis cardiac-MRI, long-axis echocardiographic images, CT of the prostate, and CT of the inner ear image segmentation applications. Our extensive experiments and ablation studies on clinical databases support that our method consistently outperforms the SOTA methods at a reduced computational cost. As will be shown in the result section, it yielded accurate and plausible results without the need to incorporate shape prior or apply post-processing methods.

The rest of this paper is organized as follows. Section II introduces the proposed method (LFB-Net). In section III, we present the experimental results and discussions. Finally, we draw our conclusions in section IV.

2 Methodology

The main objective of a supervised FCN network is to learn how to best predict the target output yy from a given input image xx, i.e., mapping of the input image into the labeled target (f:xyf:x\mapsto y). The network ff thus learns through a back-propagation algorithm from a defined loss function L(y,y^)L(y,\hat{y}) (which is often an error between the model’s predicted y^\hat{y} and the reference yy values). Thus, it learns how to compute the model’s neural network weights ww and thereby best map the input image into the output target image, i.e., y^=f(x;w)\hat{y}=f(x;w).

Refer to caption
Figure 1: Framework of the proposed method: (A) the architecture of LFB-Net. In which the encoder, decoder, and bottleneck are the component of the forward system (U-Net architecture), and (B) the convolutional building block of the encoder as well as the decoder. The feedback system’s output (i.e., hf=322×256h_{f}=32^{2}\times 256) is concatenated with forward system encoder’s output (i.e., hsh_{s}). The switch indicates the alternative training strategy of the two systems. From top to bottom: Echocardiography long-axis view, cine-MRI short-axis view, prostate CT, and CT of the inner ear in the left and segmentation results on the right.

In our model, LFB-Net consists of two major parts: the main segmentation module (forward system) and the feedback system (Fig. 1). Each module is also composed of an encoder and a decoder. An encoder-decoder based FCN architectures learn the distinctive features of the datasets from a defined objective function Lf(y,fd(fe(x)))L_{f}(y,f_{d}(f_{e}(x))) [14]. Here, the functions fef_{e} and fdf_{d} are defined as the encoder and the decoder mapping functions, respectively (i.e., fe:xhf_{e}:x\mapsto h, and then fd:hyf_{d}:h\mapsto y). The intermediate representation of the input image is then fe(x)=hf_{e}(x)=h, often named latent space or higher feature space. It is another transformed spatial representation of the input image xx. We have used this representation to describe our method in the following sections.

2.1 Forward system

We formulate the main segmentation network SS (forward system in Fig. 1) as two main parts: a segmentation encoder SeS_{e} which encodes the raw input image into high feature space, and a decoder SdS_{d} which then decodes the encoded features into the target labels. Specifically, the encoder network SeS_{e} transforms the input image xx into the encoded features space hh, i.e., Se:xhS_{e}:x\mapsto h, and the decoder transforms the intermediate representation hh into the desired label yy, i.e., Sd:hyS_{d}:h\mapsto y. This consecutive stage is equivalent to

y^=S(x)=Sd(Se(x)),\hat{y}=S(x)=S_{d}(S_{e}(x)), (1)

where xW×Hx\in\mathbb{R}^{W\times H} is the input image and y^W×H\hat{y}\in\mathbb{R}^{W\times H} is the predicted output label. Here, WW and HH indicate the original image width and height respectively. Thus, the latent space dimension of this network is Se(x)=hsW/d×H/d×CS_{e}(x)=h_{s}\in\mathbb{R}^{W/d\times H/d\times C}, where CC indicates the number of feature channels at the latent space while dd is the depth of the network.

In order to not lose spatial information during down-sampling, this network is composed of skip connections. It is typically similar to the U-Net architecture [13], except the encoder and decoder are designed so that they can be trained jointly and separately, as will be shown in section 2.3.

2.2 Feedback system

The feedback system aims to provide a second chance for the forward system’s decoder network to look back on its predicted output. This approach could be made by recurrently formulating the network and merging their full-resolution outputs at the last layer of the decoder. However, providing a more high-level representation of the predicated output with small correction using the feedback system (i.e., spatial feedback) would enhance the decoder’s learning capacity [48]. Thus, the forward system’s probabilistic output is fed to a new FCN (FF) architecture (the feedback system), which transforms the predicted probabilistic output into another high-level feature space. This approach of correcting and feeding back the output can be considered as an analogy to the teacher correcting the exam of a student and then providing back the correct answers. Therefore, this strategy allows the forward system to attend to specific regions of its previous results and improve prediction accuracy.

Lets divide the feedback system FF into two: the encoder FeF_{e} which encodes the forward system’s probabilistic output y^\hat{y} into a high-feature space hfh_{f} ( i.e., Fe:y^hfF_{e}:\hat{y}\mapsto h_{f}) and the decoder FdF_{d} then transforms the high-feature space hfh_{f} into desired label yy (i.e., Fd:hfy^^F_{d}:h_{f}\mapsto\hat{\hat{y}}). As in (1), this process can be written as:

y^^=F(y^)=Fd(Fe(y^)),\hat{\hat{y}}=F(\hat{y})=F_{d}(F_{e}(\hat{y})), (2)

where y^^\hat{\hat{y}} is the predicted probabilistic output label from the feedback system, and y^\hat{y} is the predicted probabilistic output label from the forward system.

2.3 Integration

We need to integrate the feedback system FF with the forward system SS. Thus, we designed the integration of the two systems as a recurrent process. This recurrent process is formulated as follows. Firstly, the segmentation module produces a label map yi^\hat{y_{i}} (at iteration ii) using (1). Secondly, the feedback system predicts a label map yi^^\hat{\hat{y_{i}}} taking yi^\hat{y_{i}} as input from the output of the forward system. Now, we can write the feedback system’s encoder response map FeF_{e} as:

hfi=Fe(yi^)h_{f_{i}}=F_{e}(\hat{y_{i}}) (3)

where hfiW/d×H/d×Ch_{f_{i}}\in\mathbb{R}^{W/d\times H/d\times C} is high-feature space representation or latent space of the feedback system at iteration ii. It has the same size as the forward system’s encoder output, hsiW/d×H/d×Ch_{s_{i}}\in\mathbb{R}^{W/d\times H/d\times C}. The decoder (FdF_{d}) then transforms hfih_{f_{i}} into the output label map yi^^\hat{\hat{y_{i}}}.

Thirdly, we use the high-feature space of feedback information hfih_{f_{i}} along the encoded input from forward system hsih_{s_{i}} as input to the decoder SdS_{d}, at iteration i+1i+1. Therefore, now we can redefine (1) at i+1i+1 as:

y^i+1=Sdi+1(hsi,Fei(y^i))\hat{y}_{i+1}=S_{d_{i+1}}\big{(}h_{s_{i}},F_{e_{i}}(\hat{y}_{i})\big{)} (4)

which can be written as

y^i+1=Sdi+1(hsi,hfi)\hat{y}_{i+1}=S_{d_{i+1}}\big{(}h_{s_{i}},h_{f_{i}}\big{)}\ (5)

As our system aims to incorporate the context spatial feedback in the feed-forward learning process, at the first stage we set the feedback latent space hf0=0h_{f_{0}}=0 (i.e., hf0=h0h_{f_{0}}=h_{0}). Then in the second stage (i.e., at i=i+1i=i+1), we replace hf0h_{f_{0}} by the feedback latent space hfih_{f_{i}}. This is indicated in Fig. 1 as switch between h0h_{0} and hfh_{f}.

2.4 Training strategy

The integration and training strategy of the forward system and the feedback system can be summarized as follows:

  1. []

  2. 1.

    Train the neural network weights of the forward system, wisw^{s}_{i}, considering the raw input image xx and zero feedback latent space (i.e., hfi=0h_{f_{i}}=0) as inputs, and the ground truth labels yy as output.

  3. 2.

    Train the neural network weights of the feedback system, wifw^{f}_{i}, considering the input from the predicted output of the forward system’s decoder network y^\hat{y}, and the ground truth label yy as output as in (2).

  4. 3.

    Train the neural network weights of the forward system’s decoder part only wdi+1sw^{s}_{d_{i+1}}, taking the inputs from previously extracted high-level features from the raw input image in step 1 (i.e., hsih_{s_{i}}) and the feedback latent space hfih_{f_{i}} from the feedback system in step 2 as in (5). Here, the forward system’s and the feedback system’s encoder are designed to predict (i.e., freeze) from previously learned and updated weights during step 1 and step 2, respectively.

  5. 4.

    While not converged, repeat the above steps. The convergence is determined by the change in the validation loss at the output of the forward system. These steps are also shown in Fig. 2.

Note that each step is trained at a time until the network sees the whole training dataset. During the training phase, the feedback system provides feedback to the forward system, but the decoder part is discarded during the testing phase.

Refer to caption
Figure 2: Training scheme of the proposed LFB-Net method. For a given iteration ii: Step 1: train the forward system; Step 2: train the feedback FCN system. And for the next iteration i+1i+1: Step 3: train the forward system’s decoder network with the feedback loop. In step 3 only the neural network weights of Sd(hs,hf)S_{d}(h_{s},h_{f}) (i.e.,wdi+1s)(\mbox{\emph{i.e.,}}~{}w_{d_{i+1}}^{s}) are updated. In this case, while the encoder (Se(x;weis)S_{e}(x;w_{e_{i}}^{s})) is in predicting mode, the feedback loop (Fd(y^i;wfei)F_{d}(\hat{y}_{i};w_{f_{e_{i}}})) is used to regulate the forward system’s decoder (SdS_{d}) by feeding back its predicted probabilistic output in step 1. M is the merging block for hfh_{f} and hsh_{s}. In freeze refers that the network is predicting the output using the trained weights.

2.5 Training Loss function

To train our model, we used an average of binary cross-entropy and Dice coefficient loss functions as:

Ltotal=12×(L1+L2)L_{total}=\frac{1}{2}\times(L_{1}+L_{2}) (6)

where L1L_{1} and L2L_{2} are the cross-entropy and the Dice loss functions, respectively. The cross-entropy L1L_{1} is computed as :

L1=k=1ci=1Ilog[es(k,i)ies(k,i)]L_{1}=-\sum_{k=1}^{c}\sum_{i=1}^{I}\log{\bigg{[}\frac{e^{s(k,i)}}{\sum_{i}e^{s(k,i)}}\bigg{]}} (7)

where s(k,i)s(k,i) is the probabilistic feature maps at pixel iIi\in I, belonging to the pixel class k1,2,3,ck\in 1,2,3,...c (cc number of classes), and II being the number of pixels in the training batch. The Dice coefficient loss is calculated as:

L2=11kγk[kγk2×iIuikvikiIuik+iIvik]L_{2}=1-\frac{1}{\sum_{k}\gamma_{k}}\bigg{[}\sum_{k}{\gamma_{k}}\frac{2\times\sum_{i\in I}u_{i}^{k}v_{i}^{k}}{\sum_{i\in I}u_{i}^{k}+\sum_{i\in I}v_{i}^{k}}\bigg{]} (8)

where uu is the predicted output of the network, vv is a one-hot encoding of the ground truth segmentation map, γ\gamma is the weight associated to class k0,1,2,k\in{0,1,2,...}, being the pixel class. Both uu and vv have shape II with iIi\in I being the number of pixels in the training batch. The loss function was the same for both the forward and the feedback systems.

2.6 Network architecture

The complete network architecture is shown in Fig. 1. In the forward system, the encoder’s blocks consist of repeated applications of 3×33\times 3 convolutions, exponential linear unit (ELU) activation, batch normalization, and followed by a squeeze-and-excitation network [49] (Fig. 1 (B)). Moreover, we applied a 2×22\times 2 max pooling operation with stride 2 for downsampling. Contrarily, the decoder is composed of a 2×22\times 2 up-convolution and concatenation layer, followed by the same block as in the encoder (Fig. 1 (B)).

The feedback system is a fully convolutional network architecture with a learning deconvolution network [20]. It consists of a repeated application of 3×33\times 3 convolutions, ELU activation, batch normalization in both the encoder and decoder blocks. We selected a concatenation layer to merge the latent spaces from the two systems. Please refer to the supplementary material for a detailed description of the architectures and hyper-parameter values used in this work.

3 Materials and experiments

In this section, firstly, we introduce implementation, training, and testing configurations. Secondly, we present the experimental studies on four datasets. The datasets are selected to demonstrate our method’s performance on various medical image segmentation tasks such as single and multi-structure segmentation. Specifically, prostate volume segmentation from CT images and cardiac multi-structure segmentation from 2D echocardiography and 3D cardiac-MR images are used. Moreover, to demonstrate our method’s performance for applications involving irregularly shaped multiple targets per image and more class imbalances, we applied it for inner ear segmentation from μ\muCT. Besides, we performed an extensive ablation study to analyze and validate the design of the method. These experiments focus on demonstrating the importance of learning with a feedback loop for accurate and robust medical image segmentation tasks.

3.1 Experimental setup

The system was implemented and developed in Python, Keras API, with a Tensorflow backend. The weights of the model were updated by using (6), and an ADAM optimizer with a learning rate of 10310^{-3} until convergence [50]. All networks, including the baseline models, were trained from scratch with an early stop of 100 and batch-size of 10. We mostly divided the dataset into 70% training and the rest 30% for testing. For each clinical datasets which involve volumetric computation, the division was done based on the patient cases. The model is trained on the given training dataset and then tested on unseen new image cases. The model’s hyper-parameters were optimized only from the validation dataset.

The proposed network’s parameters were the same for all segmentation applications, except for the output activation function and the number of output feature channels. For the single label segmentation (i.e., for prostate and inner ear), it was a sigmoid activation function. For the multi-label segmentation (i.e., for cardiac cine-MRI and ultrasound with four channel outputs), it was a softmax activation function. In all experiments, the intensity values were normalized according to each dataset’s mean and standard deviation. It allows the model to learn the optimal model parameters quickly.

The proposed method was compared with other methods such as U-Net [13], residually connected U-Net (ResU-Net) [22], FCN [20], Post-DAE [34], and attention gated network (AGN) [38]. We chose these networks as they have a similar architecture to our method. For the public datasets, we evaluated our method on the testing data. We also compared the proposed method with other methods such as [25], [35], and [51] on the public data. We extensively experimented during our method’s ablation study with and without a feedback loop such as using only the forward system (FS). All settings were then the same for the baseline methods. Indeed, the baseline method’s hyper-parameters were optimized to yield the best results.

We considered both volume and distance metrics to evaluate the methods, such as the Dice similarity coefficient (DSC), Hausdorff distance (HD), and relative volume difference (RVD). For the volumetric datasets, the evaluation was in 3D. Unless specified, we adopted these metric’s notations throughout the paper. We also ran a statistical test using the Wilcoxon test [52] and we consider that the results are significantly different when the p-value is less than 0.05 (i.e., p<0.05p<0.05). Moreover, we visually verify the plausibility of the results. In this context, plausibility refers to the complete, realistic, and anatomically possible segmentation. Therefore, a plausible or realistic output of a method contains no hole, no segment from unexpected regions of the image, and segmented areas are similar in shape to the targets’ structure.

3.2 Prostate segmentation in radiotherapy

Precise prostate gland segmentation from CT images is critical in the treatment of prostate cancer. However, it is not easy due to the inherently low contrast characteristics of CT for soft tissues and the appearance of high metal artifacts. For example, in CT exams of patients who have received low dose rate brachytherapy treatment for prostate cancer, in which tiny radioactive elements are permanently implanted into the gland, accurate prostate segmentation is difficult. The quality of the CT image is worsened by the high metal artifacts coming from the implanted radioactive elements. In our study, a clinical database of 78 CT prostate image cases was collected. All cases had received a prostate cancer treatment using a low dose rate brachytherapy technique. The in-plane resolution varies from 0.40.4 mmmm x 0.40.4 mmmm to 0.580.58 mmmm x 0.580.58 mmmm with a slice thickness between 1.51.5 mmmm and 2.52.5 mmmm. We changed the datasets into the same voxel size of 0.50.5 mmmm x 0.50.5 mmmm x 1.251.25 mmmm. A radiation oncologist with an experience of more than ten years (>100>100 implants/year) manually delineated the prostate. These delineation procedures are routinely done in the clinical treatment where it involves permanent brachytherapy with I125{}^{125}I for localized prostate cancer [1] [2]. The dataset was randomly divided into 70% for training, and the rest 30% for testing and validation (20% for testing and 10% for validation).

Table 1: Results for a single target organ segmentation (mean ±\pm standard deviation). All metrics are in 3D. The higher value of DSC is better, whereas the lower value is better for HD and RVD metrics (bold values are better).
Method
Organ Metric FCN [20] U-Net [13] ResU-Net [22] Post-DAE [34] AGN [38] LFB-Net
Prostate DSC 0.89±\pm2.20 0.89±\pm0.02 0.89±\pm0.02 0.89±\pm0.02 0.89±\pm0.02 0.91±\pm0.02
(3D) HD 14.0±\pm8.14 17.5±\pm13.26 14.4±\pm3.30 11.3±\pm3.30 16.6±\pm7.12 6.1±\pm1.40
Ear DSC 0.93±\pm0.05 0.92±\pm0.06 0.93±\pm0.05 0.87±\pm0.07 0.93±\pm0.05 0.96±\pm0.01
(3D) RVD 0.07±\pm0.08 0.08±\pm0.09 0.06±\pm0.09 0.09±\pm0.11 0.04±\pm0.05 0.02±\pm0.02

From the experimental result shown in Table 1, our method can segment the prostate gland with an average volumetric Dice index of 0.91 and with a 3D HD of 6.1 mm. It appeared to outperform the other methods. For example, it produced a 2% increased Dice index and decreased the 3D HD value by 11 mm over U-Net [13]. Our method without the feedback loop, i.e., only the forward system (FS), produces a Dice index of 0.90 and a 3D HD value of 10.3 mm. It improved both metrics over the other methods. Using the feedback loop reduces the 3D HD error value by 4.3 mm over not using it (only FS). Attention gated-based method [38] has only reduced the 3D HD metric over the U-net [13] by 0.9 mm, while the Post-DAE method [34] reduced it by 6.2 mm. However, the Post-DAE did not improve the average Dice index. The proposed LFB-Net method significantly outperformed (p<0.05)(p<0.05) the other methods for the Dice index value and 3D HD.

3.3 Inner ear segmentation

Inner ear segmentation is essential for 3D visualization and modeling of the inner ear for surgeries. We chose the Hear-EU cochlear data descriptor public dataset consisting of μ\muCT-scans of 17 dry temporal bone specimens [5]. The ground truth labels include the cochlear scala, semicircular canals and the vestibule. The images were acquired at 16.3 and 19.5 μ\mum voxel resolutions for 13 and 4 μ\muCTs, respectively. The original volume size of the μ\muCTs ranged from 618 x 892 x 600 to 1500 x 1500 x 1500. The μ\muCTs were standardized to a fixed size of 256 x 256 x 256 voxels for computational and memory requirements. The ground truth labels of 5 patients were not aligned with the μ\muCT images. The data for these patients were manually corrected to align with the μ\muCT data perfectly. To compare all methods, a two-fold cross-validation technique was performed by randomly dividing the data at each fold into 76% for the training and 24% for the validation.

As can be seen from Table 1, LFB-Net achieved an increase in 4% of the average Dice index and reduced relative volume difference by 6% over the U-Net [13]. It also yielded accurate segmentation results without missing the unconnected small parts of the inner ear on CT images, which can then be used to model the inner ear structures for ear surgeries [53]. In contrast, the other methods yielded reduced segmentation results. In particular, the post-processing method, Post-DAE [34], appeared to degrade the segmentation results of the U-Net [13], in particular the Dice index by 5%.

3.4 Cardiac cine-MRI segmentation

The third application of our method, which aims to demonstrate the performance of our model for multi-class segmentation applications, is on cardiac cine-MRI segmentation. Although short-axis cardiac cine-MRI is essential to study the cardiac function of the left and right ventricles, accurate segmentation of both ventricles remains challenging. The segmentation of the endocardial border in diastole and systole is required to evaluate the cardiac function (i.e., cavity volume and ejection fraction) for the left and right ventricles, and the epicardial border of the left ventricle is mandatory to evaluate the myocardial mass and thickness. The main challenge is the variability in the shape of the left and right ventricular cavities. In this regard, we chose the Automated Cardiac Diagnosis challenge (ACDC) [4] to evaluate the proposed method. It contains 100 patients for the training and 50 for the testing. These datasets are available for download at 111https://acdc.creatis.insa-lyon.fr/. To analyze how the baseline methods would work in the testing sets, we have access to the ground truth upon request to the organizers. We randomly divided the 100 cases into 75 patients (75%) for training and 25 patients (25%) for validation (all the cases with diastole and systole phases). Thus, to further inspect the performance of the method, we evaluated the end-diastolic and end-systolic phases separately on the 50 testing cases.

In all 3D experimental measurements on the validation sets shown in Table 2, our method appeared to improve segmentation accuracy. It yielded an average Dice and HD values, respectively of 0.96±0.020.96\pm 0.02 and 6.7±3.956.7\pm 3.95 mm for the left ventricular cavity (LV), of 0.91±0.050.91\pm 0.05 and 13.2±7.9013.2\pm 7.90 mm for the right ventricular cavity (RV), and of 0.90±0.030.90\pm 0.03 and 9.4±5.219.4\pm 5.21 mm for the myocardium (MYO) on the validation datasets. The 3D HD value of the myocardium refers to the largest distance error from either the endocardium or the epicardium. More importantly, we observed that the improvement is significant in cases where the segmentation task is difficult, such as in the right ventricular cavity and the myocardium.

As shown in Table 3, on the ACDC testing set, our method yielded an average Dice index and 3D HD values, respectively of 0.94±0.060.94\pm 0.06 and 7.5±5.547.5\pm 5.54 mm in the LV, of 0.92±0.060.92\pm 0.06 and 11.9±6.4911.9\pm 6.49 mm in the RV, and 0.90±0.030.90\pm 0.03 and 9.5±5.589.5\pm 5.58 mm in the MYO. We observed that LFB-Net significantly outperforms the other methods (p<0.05p<0.05). To further analyze the method’s generalizability property, results from the unseen testing set and validation set are shown in Table 4. Although we have different sample sizes for the validation set, 25 cases (each with end-diastolic and end-systolic phases), and the testing set, 50 cases (each with end-diastolic and end-systolic phases), these average values in Table 4 can be considered to infer the methods’ performance for each set. Indeed, the proposed method showed almost no differences between the unseen testing and the validation sets by achieving a total Dice index of 0.92 and a 3D HD value of around 10 mmmm. In contrast, the other methods showed a large difference between the validation and testing sets. For example, LFB-Net has significantly outperformed the other methods in the heart segmentation on the testing data (shown in Table 4). However, this is not always the case on the validation data.

Table 2: Multi-structure cardiac image segmentation results of the validation set for end-diastolic (ED) and end-systolic (ES) phases. LV: Left ventricular cavity; MYO: Myocardium, RV/LA: Right ventricular cavity (RV) for the Cine-MRI and left atrium cavity (LA) for the echocardiography. The bold values refer to the best performance for each metric.
LV RV/LA MYO
Dice HD Dice HD Dice HD
Data Method ES ED ES ED ES ED ES ED ES ED ES ED
Echocardiography (2D) AGN [38] 0.92 0.95 9.7 6.0 0.92 0.89 7.1 8.5 0.69 0.69 24.2 25.0
FCN [20] 0.92 0.95 5.5 5.7 0.92 0.90 4.8 5.8 0.86 0.85 7.0 8.1
U-Net [13] 0.93 0.95 5.4 5.4 0.92 0.89 5.1 6.5 0.87 0.86 7.4 8.3
ResU-Net [22] 0.92 0.94 5.7 6.0 0.91 0.88 5.4 6.3 0.86 0.86 7.6 8.4
(54 cases) LFB-Net 0.93 0.95 5.3 5.4 0.93 0.91 4.6 5.4 0.87 0.87 6.7 7.1
Cine-MRI (3D) AGN [38] 0.94 0.96 7.8 6.9 0.84 0.93 15.1 14.3 0.90 0.89 11.1 10.1
FCN [20] 0.93 0.96 7.8 6.7 0.84 0.92 15.58 14.7 0.89 0.87 11.1 10.0
U-Net [13] 0.93 0.96 8.3 8.3 0.84 0.93 16.2 14.4 0.90 0.88 11.9 9.6
ResU-Net [22] 0.92 0.95 9.3 8.5 0.85 0.93 15.5 15.9 0.89 0.88 11.7 10.4
(25 cases) LFB-Net 0.94 0.97 6.7 6.5 0.88 0.95 13.7 12.9 0.91 0.89 10.5 8.4

3.5 Echocardiographic image segmentation

Accurate cardiac structure segmentation from echocardiography images is profoundly vital in cardiac diagnosis. For this, we chose Multi-structure Ultrasound Segmentation (CAMUS) dataset to evaluate our method [51]. It contains two and four-chamber acquisition from 500 patients, with end-diastolic and end-systolic phases. Thus, a given patient has four images (two in end-diastole and two in end-systole). The segmentation reference and raw data of 450 patients are available for download at 222https://camus.creatis.insa-lyon.fr/challenge/. We randomly divided this data into 396 patients for training and 54 patients for validation.

As shown in Table 2, on the 54 validation exams, our method has improved both the Dice and the HD values. For the two view acquisition (two-chambers and four-chambers) based segmentation results, our method yielded an average Dice index of 0.94±0.030.94\pm 0.03, 0.92±0.040.92\pm 0.04, and 0.86±0.060.86\pm 0.06 for the 4-chamber, and 0.94±0.030.94\pm 0.03, 0.92±0.050.92\pm 0.05, and 0.88±0.040.88\pm 0.04 for the 2-chamber respectively in the left ventricular cavity (LV), left atrium (LA), and myocardium (MYO). The average HD values were 5.0±2.835.0\pm 2.83 mm, 5.2±3.485.2\pm 3.48 mm, and 6.7±3.046.7\pm 3.04 mm for the 4-chambers, and 5.6±3.225.6\pm 3.22 mm, 4.8±2.794.8\pm 2.79 mm, and 7.1±3.867.1\pm 3.86 mm for the 2-chambers respectively in the LV, LA, and MYO. Although it produces similar results for the different view acquisitions in LV and LA, it improved the myocardium segmentation on the 2-chambers over the 4-chambers by 2% in the Dice index.

Table 3: Results for cardiac cine-MRI segmentation on the ACDC testing set (50 cases) at end-diastolic (ED) and end-systolic (ES) phases. LV: Left ventricular cavity; MYO: Myocardium, RV: Right ventricular cavity (RV). * (p<0.05)(p<0.05) indicates that the difference between LFB-Net and the other methods is significant. Values are expressed as a mean ±\pm standard deviation in 3D.
LV RV MYO
Dice HD Dice HD Dice HD
Method ES ED ES ED ES ED ES ED ES ED ES ED
AGN [38] 0.89 0.96 11.5 7.9 0.83 0.91 16.5 13.4 0.89 0.88 12.5 10.5
±\pm0.10* ±\pm0.02* ±\pm9.73* ±\pm5.37* ±\pm0.14* ±\pm0.09* ±\pm10.39* ±\pm8.33* ±\pm0.05* ±\pm0.04* ±\pm5.69* ±\pm6.12*
FCN [20] 0.89 0.96 10.8 7.6 0.85 0.90 15.8 14.0 0.87 0.86 12.6 11.3
±\pm0.09* ±\pm0.02* ±\pm6.2* ±\pm4.49* ±\pm0.12* ±\pm0.07* ±\pm7.70* ±\pm6.55* ±\pm0.05* ±\pm0.04* ±\pm4.60* ±\pm6.09*
U-Net [13] 0.89 0.96 11.3 8.2 0.83 0.90 16.8 14.3 0.89 0.87 12.5 10.9
±\pm0.09* ±\pm0.02* ±\pm7.90* ±\pm5.0* ±\pm0.17* ±\pm0.12* ±\pm10.2* ±\pm8.29* ±\pm0.04* ±\pm0.04* ±\pm5.1* ±\pm6.27*
ResU-Net [22] 0.90 0.96 10.5 8.6 0.85 0.90 15.2 13.5 0.88 0.87 12.6 10.8
±\pm0.10* ±\pm0.02* ±\pm7.34* ±\pm5.0* ±\pm0.11* ±\pm0.12* ±\pm6.91* ±\pm8.22* ±\pm0.09* ±\pm0.04* ±\pm7.98* ±\pm6.39*
LFB-Net 0.92 0.97 8.5 6.5 0.89 0.94 13.0 10.9 0.91 0.89 9.9 9.1
±\pm0.07 ±\pm0.02 ±\pm6.76 ±\pm3.84 ±\pm0.08 ±\pm0.04 ±\pm6.46 ±\pm6.49 ±\pm0.03 ±\pm0.03 ±\pm5.33 ±\pm5.92
Table 4: Total heart segmentation performance on ACDC validation (Valid.) and testing (Test.) sets in 3D by computing the average values of the MYO, LV, and RV segmentation. * (p<0.05)(p<0.05) indicates that the difference between LFB-Net and the other methods is significant.
Metric
Method Dice HD (mm)
       Valid.        Test.        Valid.         Test.
AGN [38] 0.910±\pm0.07* 0.894±\pm0.09* 10.9±\pm7.13 12.0±\pm8.23*
FCN [20] 0.903±\pm0.07* 0.890±\pm0.08* 11.0±\pm7.22 12.0±\pm6.53*
U-Net [13] 0.906±\pm0.06* 0.888±\pm0.10* 11.5±\pm7.50 12.3±\pm7.79*
ResUnet [22] 0.903±\pm0.07 0.893±\pm0.09* 11.9±\pm8.39* 11.9±\pm7.32*
LFB-Net 0.921±\pm0.05 0.920±\pm0.06 9.9±\pm6.47 9.7±\pm6.18
Table 5: Long-axis echocardiographic image segmentation results on the testing CAMUS data at end-diastolic (ED) and end-systolic (ES) phases. Comparisons are shown for our LFB-Net method and CAMUS challengers. Results were obtained from the CAMUS challenge portal. The provided inter- and intra-observer values were from only 40 cases (good and medium image qualities) by excluding ten low-quality image cases. No inter- and intra-observer studies were provided for the left atrium [51].
LV: Endocardium LV: Epicardium Left Atrium
Dice HD Dice HD Dice HD
Data Method ES ED ES ED ES ED ES ED ES ED ES ED
Echo- inter-observer 0.873 0.919 6.6 6.0 0.890 0.913 8.6 8.0 - - - -
cardiography ±\pm0.060 ±\pm0.033 ±\pm2.4 ±\pm2.0 ±\pm0.047 ±\pm0.037 ±\pm3.3 ±\pm 2.9 - - - -
(2D) Intra-observer 0.930 0.945 4.5 4.6 0.951 0.957 5.0 5.0 - - - -
±\pm0.031 ±\pm0.019 ±\pm1.8 ±\pm1.8 ±\pm0.021 ±\pm0.019 ±\pm2.1 ±\pm2.3 - - - -
Oktay O. et al.  [25] 0.913 0.936 5.6 5.6 0.945 0.953 5.9 5.9 0.911 0.881 5.8 6.0
Leclerc S. et al.  [51] 0.912 0.936 5.5 5.3 0.946 0.956 5.7 5.2 0.918 0.889 5.3 5.7
Testing U-net-2 [51] 0.899 0.922 5.3 5.7 0.923 0.932 6.4 6.4 0.888 0.848 6.2 6.9
(50 cases) LFB-Net 0.926 0.946 4.8 4.8 0.952 0.959 5.2 5.2 0.924 0.902 5.0 5.2

As shown in Table 5, on the 50 CAMUS testing exams, LFB-Net achieved an average Dice index of 0.96±0.020.96\pm 0.02 for the LV epicardium, 0.94±0.030.94\pm 0.03 for the LV endocardium, and 0.91±0.070.91\pm 0.07 for the left atrium. It outperformed the other CAMUS challengers [25] [51]. LFB-Net improves the segmentation in all multi-label structures with less accuracy variability among the test data. Moreover, it notably improved the segmentation by large value on the left atrium in the end-diastolic phase. However, as the results were obtained by submitting the predicted images to the challenge website, we could not perform a statistical comparison [51]. The proposed method also achieved comparable results with the intra-observer values. Mainly, it yielded better results in the Dice index except for the endocardium in the systolic phase. Thus, segmentation with the context feedback loop yields consistent results.

3.6 Qualitative segmentation results

As one can observe from the qualitative segmentation results in Fig. 3 for the single label and Fig. 4 for the multi-label segmentation, our model produces more plausible results than the other methods. From a careful visual checking of the results, if it is with holes for a given single structure segmentation and between structures for multi-structure segmentation and comparing the shape, our method produces more plausible results. Whereas the other methods produce holes in a given structure or between structures and sometimes produce atypical results which are type of errors that could not be made by manual segmentation. Moreover, as seen in the ear segmentation [Row 4-6], the other methods appeared to fail in scenarios when they segment multiple unconnected small structures. In contrast, our method produces a more realistic segmentation of all structures. We observed similar scenarios throughout the testing data. Indeed, anatomical plausibly is a prerequisite for the experts to use the segmented structures for clinical assessments. With the proposed method, the reliability of the segmentation renders trustworthy the clinical information extracted from these segmented structures.

Refer to caption
Figure 3: Examples of single label segmentation results. [Rows 1-3] Prostate segmentation examples, and [Rows 4-6] inner ear segmentation examples. The predicted mask (in green) is overlapped with the ground truth (in red).
Refer to caption
Figure 4: Examples of multi-label cardiac image segmentation results. [Row 1-2] long-axis echocardiography, and [Row 3-4] short axis cine-MRI segmentation. RV: Right ventricular cavity; LV: Left ventricular cavity; MYO: Myocardium; and LA: Left atrium.

3.7 Network ablation study

Ablation study for system design

To evaluate the contribution of each building block in our method, we created the following configurations.

  1. 1.

    The forward system (FS): a U-Net architecture without the feedback looping system in both the training and the testing phase (using only step 1 in Fig. 2).

  2. 2.

    The FS*: a forward system (FS) without the squeeze-and-excitation network, to study its effect in our method.

  3. 3.

    The proposed method (LFB-net): Forward system regularized by an FCN-based feedback system during the training and the testing phases.

Figure 5 shows the ablation study on the 25 ACDC validation cases. It can be seen that training with the feedback loop consistently increased the results. Moreover, considering the feedback system’s encoder during the testing has improved the accuracy of using it only during the training. As shown in Fig. 5, our method produces less inter case difference, yielding smaller standard deviations in both Dice and 3D HD metrics. The Se-block [49] has also appeared to increase the forward system’s accuracy, yielding better results than without it. It is always true in the total average values. We found that segmentation with the feedback loop significantly outperforms the other two network configurations (FS and FS*) in both Dice and 3D HD metrics of the RV and MYO (p<0.05p<0.05). However, although the feedback loop has improved the average performance for LV segmentation, it was not significant. Note that most networks, including the SOTA methods, performed well for LV segmentation but not for the MYO and RV segmentation. Thus, these results guarantee us to say that using the feedback loop increases the segmentation accuracy, but significantly for the complicated structures.

Refer to caption
Figure 5: Box plot of the ablation study for system design on the ACDC validation set. [row 1] Mean Dice coefficient (higher is better), [Row 2] mean 3D HD (mm) (lower is better). [Column 1] RV, [column 2] MYO, and [column 3] LV. FS: Forward system; FS*: Forward system without the squeeze-and-excitation network; and LFB: the proposed final method.

As shown in Fig. 6, training with the feedback loop mitigates the conditions in which the forward system potentially fails. Firstly, in these examples, the channel-wise feature calibration (i.e., squeeze-and-excitation network) improves the segmentation over using only convolutional networks. It is consistent with the quantitative results in Fig. 5. Training with the feedback loop improves the forward system’s accuracy, primarily when it produces low quality image labels. On these examples, we observed that the improvement was mainly at the basal and apical regions of the heart, in particular for the right ventricular cavity segmentation at the end-systolic phase, and the endocardial and epicardial border segmentation at the end-diastolic phase. These qualitative results are consistent with the quantitative results presented in Tables 3 and 5 for the ACDC and CAMUS testing datasets, respectively. Most other methods produce less good quantitative results for structures that are difficult to segment, such as the right ventricular cavity, the endocardial and epicardial borders for the ACDC data, and the left atrium for the CAMUS data. In contrast, LFB-Net produces better results for these regions.

The proposed feedback system is integrated with the forward system while training. The optimal trained neural network weights of the forward and feedback systems’ are thus saved simultaneously as final models. This design enables the forward system to always benefit from the feedback loop. In contrast, this might not always be the case while using post-processing methods [34] [35]. For example, the denoising autoencoder-based method [34] degraded the Dice index of Ronneberger et al.[13] in the inner ear segmentation by 5%.

Refer to caption
Figure 6: Examples of the ablation study for system design on the ACDC data. From left to right: original image, ground truth, FS, FS*, and our method (LFB-Net). [rows 1-5] Cases where the feedback loop based segmentation (i.e., LFB-Net) corrects wrong segmentation from the forward system. The red, green, and blue colors show the right ventricular cavity, the myocardium, and the left ventricular cavity, respectively. The white arrows show the areas where FS and FS* fails to segment the targets accurately.

To further study the hypothesis that feedback loop would increase accuracy, specifically for difficult or noisy images, we calculated the percentage of the performance of the methods on CAMUS datasets by specifying threshold values of whose Dice is less than 0.88 or HD error is greater than 6.5 mm. Experimental results are shown in Table 6, indicating that most results of our method are above 0.88 Dice and below 6.5 HD error, outperforming the other methods. It is particularly true for the MYO and LA. However, it yielded a 2.8% lower HD value than the U-Net in LV [13], which is not a difficult structure for segmentation. These results further demonstrate that the feedback loop has a significant benefit to segment difficult structures such as MYO and LA.

Table 6: Percentage of cases whose Dice is below 88% or HD is above 6.5 mm for 54 CAMUS validation sets.
Dice HD (mm)
% (minimum value) % (maximum value)
Method MYO LA LV MYO LA LV
FCN 53.9% 16.4% 5.6% 51.4% 22% 29.4%
(0.54) (0.36) (0.74) (34.4) (19.6) (19.9)
U-Net 49.5% 17.8% 6.1% 50.9 % 24.3% 24.3%
(0.56) (0.64) (0.72)) (39.2) (46.6) (20.6
ResU-Net 56.1% 17.8% 8.9% 54.2% 25.7% 34.1%
(0.60) (0.53) (0.79) (54.2) (25.7) (34.1)
AGN 95.8% 20.1% 6.5% 96.7% 34.1% 24.3%
(0.18) (0.5) (0.73) (93.5) (126.8) (89.0)
LFB-Net 44.4% 13.1% 4.7% 46.3% 18.7% 27.1%
(0.65) (0.74) (0.80) (30.6) (19.6) (16.0)

To examine where our method best performs, we computed the maximum HD error and the minimum Dice coefficient. Lower Dice and higher HD reveal the model’s worst scores. Results are shown in Table 6, illustrating that our method considerably decreased the maximum errors in every metric.

Another essential advantage of our approach is that it produces segmentation with almost no difference in the testing data populations. It can be observed from Table 1, for prostate and inner ear segmentation, that the standard deviation is small in all measurements. It was similar for the cardiac segmentation from both cine-MRI and echocardiographic images.

Ablation study for system integration strategy

To investigate the best combination strategy of the two systems through their latent spaces (i.e., hsh_{s} and hfh_{f}), we performed three different schemes such as concatenation, addition, and multiplication on the testing prostate datasets.

As shown in Fig. 7, the concatenation strategy appears to outperform the other two strategies. Thus, we selected the concatenation layer to merge the latent spaces. Indeed, statistically, the three strategies showed no significant difference for the Dice index, but in the 3D HD metric, the concatenation strategy significantly outperformed the others.

Refer to caption
Figure 7: Merging strategy comparison. Combination strategies (Concat: concatenation, Add: Addition, Multi: Multiplication, i.e., merge hsh_{s} and hfh_{f}). Best performance is with higher Dice, and with lower 3D HD and RVD values.

Ablation study for system training scheme

Our method is based on an alternative training strategy of a modified U-Net (forward system), and FCN architecture (feedback system). The FCN network is aiming to regularize the forward system by showing back its predicted probabilistic output, thereby improving the learning ability over time. For this, we conducted experiments on ACDC data to compare the alternative training schemes: with no feedback loop and with the feedback loop.

Figure 8 shows how the training loss changes in the two scenarios. Training and validation losses decrease faster while training with the feedback loop than training without the feedback loop. Besides, although the training loss without the feedback tends to decrease over time, validation loss is not. This further shows that the model without the feedback tends to quickly over-fit to the training data over iterations than the model with the feedback.

Refer to caption
Figure 8: Ablation study for the training scheme on ACDC data. The red line represents the forward system’s training loss with no update of the decoder from the feedback loop. The blue represents the forward system’s loss with the feedback loop hfh_{f}. Segmentation with the feedback loop accelerates the convergence speed with lower validation loss.

Segmentation with the feedback loop took an extra 1 hour training time to converge than without the feedback loop on the ACDC data. Although we designed the segmentation problem as a two systems task, the total number of parameters needed to optimize is smaller than the one model-based methods. Our method has 8.5 and 7.9 million parameters during the training and the testing phases, respectively, which are computationally more efficient than the 32 million trainable parameters of U-Net architecture [13]. Thus, our method can deliver results quickly, which can be beneficial for real-time applications. For example, it produces a 256×256×4256\times 256\times 4 segmentation result (4 indicates the predicted probabilistic output for LV, RV, MYO, and background) of the cine-MRI within 0.0250.025s on a personal computer of i7 with 32 GB RAM.

4 Conclusions

In this paper, we introduced a deep learning method for accurate and robust medical image segmentation by formulating the segmentation problem as a two systems task. It employs a forward system (modified U-Net) for hierarchal feature extraction-driven image segmentation along with a contextual feedback system. The FCN-based contextual feedback system is used to regulate the forward system’s segmentation process. It allowed the forward system to attend and improve its previous decisions, particularly on the uncertain image regions over time. This strategy of modeling image segmentation as a two systems task enabled us to develop an efficient architecture that can be trained from a small dataset and quickly deliver segmentation results.

We demonstrated our method’s performance through extensive ablation and experimental results in the prostate, short-axis cardiac-MRI, inner ear, and long-axis echocardiographic image segmentation applications. The experimental results reveal two important points. Firstly, a spatial feedback loop-based image segmentation is an effective feed-forward learning approach that produces both plausible and accurate segmentation results. The plausibility was achieved without incorporating shape prior or applying post-processing method. Secondly, our method produces results with reduced segmentation variability among the testing data that shows robustness to segment low contrast images as well as structures. It also provides results with reduced maximum errors in all metrics. Moreover, the proposed method yielded significantly better results than the state of the art methods for single and multi-structure segmentation, specifically for the complex structures. Thus, our work opens important perspectives towards efficient and accurate medical image analysis tasks by interconnecting two networks through the introduced feedback loop method.

Moreover, the proposed LFB-Net framework can be extended to other medical image analysis tasks. In this regard, future research will focus on exploring conditions on how to exploit the contextual feedback loop’s latent space and efficiently leverage the merged contextual information. In cases of available 3D datasets, the 3D version of the proposed method could be applied to capture the 3D topology of the target.

References

  • [1] B. J. Davis, E. M. Horwitz, W. R. Lee, J. M. Crook, R. G. Stock, G. S. Merrick, W. M. Butler, P. D. Grimm, N. N. Stone, L. Potters et al., “American brachytherapy society consensus guidelines for transrectal ultrasound-guided permanent prostate brachytherapy,” Brachytherapy, vol. 11, no. 1, pp. 6–19, 2012.
  • [2] K. B. Girum, A. Lalande, M. Quivrin, I. Bessières, N. Pierrat, E. Martin, L. Cormier, A. Petitfils, J. M. Cosset, and G. Créhange, “Inferring postimplant dose distribution of salvage permanent prostate implant (ppi) after primary ppi on ct images,” Brachytherapy, vol. 17, no. 6, pp. 866–873, 2018.
  • [3] C. Petitjean and J.-N. Dacher, “A review of segmentation methods in short axis cardiac mr images,” Medical image analysis, vol. 15, no. 2, pp. 169–184, 2011.
  • [4] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester et al., “Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved?” IEEE transactions on medical imaging, vol. 37, no. 11, pp. 2514–2525, 2018.
  • [5] N. Gerber, M. Reyes, L. Barazzetti, H. M. Kjer, S. Vera, M. Stauber, P. Mistrik, M. Ceresa, N. Mangado, W. Wimmer et al., “A multiscale imaging and modelling dataset of the human inner ear,” Scientific data, vol. 4, p. 170132, 2017.
  • [6] M. H. Jafari, Z. Liao, H. Girgis, M. Pesteie, R. Rohling, K. Gin, T. Tsang, and P. Abolmaesumi, “Echocardiography segmentation by quality translation using anatomically constrained cyclegan,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2019, pp. 655–663.
  • [7] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2016, pp. 424–432.
  • [8] S. Ghose, A. Oliver, R. Martí, X. Lladó, J. C. Vilanova, J. Freixenet, J. Mitra, D. Sidibé, and F. Meriaudeau, “A survey of prostate segmentation methodologies in ultrasound, magnetic resonance and computed tomography images,” Computer methods and programs in biomedicine, vol. 108, no. 1, pp. 262–287, 2012.
  • [9] K. B. Girum, G. Créhange, R. Hussain, and A. Lalande, “Fast interactive medical image segmentation with weakly supervised deep learning method,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, no. 9, pp. 1437–1444, 2020.
  • [10] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  • [13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [15] Z. Xu and M. Niethammer, “Deepatlas: Joint semi-supervised learning of image registration and segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2019, pp. 420–429.
  • [16] B. Li, W. J. Niessen, S. Klein, M. de Groot, M. A. Ikram, M. W. Vernooij, and E. E. Bron, “A hybrid deep learning framework for integrated segmentation and registration: Evaluation on longitudinal white matter tract changes,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2019, pp. 645–653.
  • [17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [19] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [20] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
  • [21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [23] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
  • [24] Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3d brain image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 10, pp. 1744–1757, 2009.
  • [25] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. De Marvao, T. Dawes, D. P. O‘Regan et al., “Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation,” IEEE transactions on medical imaging, vol. 37, no. 2, pp. 384–395, 2017.
  • [26] K. B. Girum, G. Créhange, R. Hussain, P. M. Walker, and A. Lalande, “Deep generative model-driven multimodal prostate segmentation in radiotherapy,” in Workshop on Artificial Intelligence in Radiation Therapy.   Springer, 2019, pp. 119–127.
  • [27] C. Zotti, Z. Luo, A. Lalande, and P.-M. Jodoin, “Convolutional neural network with shape prior applied to cardiac mri segmentation,” IEEE journal of biomedical and health informatics, vol. 23, no. 3, pp. 1119–1128, 2018.
  • [28] M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio, “Transfusion: Understanding transfer learning for medical imaging,” in Advances in neural information processing systems, 2019, pp. 3347–3357.
  • [29] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, and J. Liu, “Ce-net: Context encoder network for 2d medical image segmentation,” IEEE transactions on medical imaging, vol. 38, no. 10, pp. 2281–2292, 2019.
  • [30] Q. Zeng, D. Karimi, E. H. Pang, S. Mohammed, C. Schneider, M. Honarvar, and S. E. Salcudean, “Liver segmentation in magnetic resonance imaging via mean shape fitting with fully convolutional neural networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2019, pp. 246–254.
  • [31] K. B. Girum, A. Lalande, R. Hussain, and G. Crehange, “A deep learning method for real-time intraoperative us image segmentation in prostate brachytherapy,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, no. 9, pp. 1467–1476, 2020.
  • [32] C. Chen, C. Biffi, G. Tarroni, S. Petersen, W. Bai, and D. Rueckert, “Learning shape priors for robust cardiac mr segmentation from multi-view images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2019, pp. 523–531.
  • [33] H. Ravishankar, R. Venkataramani, S. Thiruvenkadam, P. Sudhakar, and V. Vaidya, “Learning and incorporating shape models for semantic segmentation,” in International conference on medical image computing and computer-assisted intervention.   Springer, 2017, pp. 203–211.
  • [34] A. J. Larrazabal, C. Martínez, B. Glocker, and E. Ferrante, “Post-dae: Anatomically plausible segmentation via post-processing with denoising autoencoders,” IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 3813–3820, 2020.
  • [35] N. Painchaud, Y. Skandarani, T. Judge, O. Bernard, A. Lalande, and P.-M. Jodoin, “Cardiac segmentation with strong anatomical guarantees,” IEEE Transactions on Medical Imaging, vol. 39, no. 11, pp. 3703–3713, 2020.
  • [36] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
  • [37] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
  • [38] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert, “Attention gated networks: Learning to leverage salient regions in medical images,” Medical image analysis, vol. 53, pp. 197–207, 2019.
  • [39] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medical image segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 1, pp. 121–130, 2020.
  • [40] R. Li, K. Li, Y.-C. Kuo, M. Shu, X. Qi, X. Shen, and J. Jia, “Referring image segmentation via recurrent refinement networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5745–5753.
  • [41] J. Chen, L. Yang, Y. Zhang, M. Alber, and D. Z. Chen, “Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation,” in Advances in neural information processing systems, 2016, pp. 3036–3044.
  • [42] M. Z. Alom, C. Yakopcic, M. Hasan, T. M. Taha, and V. K. Asari, “Recurrent residual u-net for medical image segmentation,” Journal of Medical Imaging, vol. 6, no. 1, p. 014006, 2019.
  • [43] W. Wang, K. Yu, J. Hugonot, P. Fua, and M. Salzmann, “Recurrent u-net for resource-constrained segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2142–2151.
  • [44] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
  • [45] D. Wang, G. Hu, and C. Lyu, “Frnet: an end-to-end feature refinement neural network for medical image segmentation,” The Visual Computer, pp. 1–12, 2020.
  • [46] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness,” in International Conference on Learning Representations, 2018.
  • [47] F. Shama, R. Mechrez, A. Shoshan, and L. Zelnik-Manor, “Adversarial feedback loop,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3205–3214.
  • [48] M. Huh, S.-H. Sun, and N. Zhang, “Feedback adversarial learning: Spatial feedback for improving generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1476–1485.
  • [49] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  • [50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [51] S. Leclerc, E. Smistad, J. Pedrosa, A. Østvik, F. Cervenansky, F. Espinosa, T. Espeland, E. A. R. Berg, P.-M. Jodoin, T. Grenier et al., “Deep learning for segmentation using an open large-scale dataset in 2d echocardiography,” IEEE transactions on medical imaging, vol. 38, no. 9, pp. 2198–2210, 2019.
  • [52] E. Whitley and J. Ball, “Statistics review 6: Nonparametric methods,” Critical care, vol. 6, no. 6, p. 509, 2002.
  • [53] R. Hussain, A. Lalande, K. B. Girum, C. Guigou, and A. B. Grayeli, “Augmented reality for inner ear procedures: visualization of the cochlear central axis in microscopic videos,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, no. 10, pp. 1703–1711, 2020.