Learning With Context Feedback Loop for Robust Medical Image Segmentation

Kibrom Berihu Girum Member IEEE Gilles Créhange and Alain Lalande Manuscript received December 4, 2020; revised January 19, 2021; accepted February 16, 2021. (Corresponding author: Kibrom Berihu Girum.)K.B. Girum is with the Imaging and Artificial Vision (ImViA) Research Laboratory, University of Burgundy, 21000 Dijon, France, and also with the Department of Radiation Oncology, Centre Georges François Leclerc (CGFL), 21000 Dijon, France. (Corresponding author’s email: kibrom2b[at]gmail.com.)G. Créhange is with the Department of Radiation Oncology, Institute of Curie, 75005 Paris, France, also with Imaging and Artificial Vision (ImViA) Research Laboratory, University of Burgundy, 21000 Dijon, France, and also with the Department of Radiation Oncology, Centre Georges François Leclerc (CGFL), 21000 Dijon, France.(Email: gilles.crehange[at]curie.fr)A. Lalande is with the Department of Medical Imaging, University Hospital of Dijon, 2100 Dijon, France, and also with Imaging and Artificial Vision (ImViA) Research Laboratory, University of Burgundy, 21000 Dijon, France. (Email: alain.Lalande[at]u-bourgogne.fr)Copyright ©2021 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected] is the author’s version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at
https://doi.org/10.1109/TMI.2021.3060497

Abstract

Deep learning has successfully been leveraged for medical image segmentation. It employs convolutional neural networks (CNN) to learn distinctive image features from a defined pixel-wise objective function. However, this approach can lead to less output pixel interdependence producing incomplete and unrealistic segmentation results. In this paper, we present a fully automatic deep learning method for robust medical image segmentation by formulating the segmentation problem as a recurrent framework using two systems. The first one is a forward system of an encoder-decoder CNN that predicts the segmentation result from the input image. The predicted probabilistic output of the forward system is then encoded by a fully convolutional network (FCN)-based context feedback system. The encoded feature space of the FCN is then integrated back into the forward system’s feed-forward learning process. Using the FCN-based context feedback loop allows the forward system to learn and extract more high-level image features and fix previous mistakes, thereby improving prediction accuracy over time. Experimental results, performed on four different clinical datasets, demonstrate our method’s potential application for single and multi-structure medical image segmentation by outperforming the state of the art methods. With the feedback loop, deep learning methods can now produce results that are both anatomically plausible and robust to low contrast images. Therefore, formulating image segmentation as a recurrent framework of two interconnected networks via context feedback loop can be a potential method for robust and efficient medical image analysis.

{IEEEkeywords}

CNN, Feedback loop, MRI, Ultrasound, CT.

1 Introduction

1.1 Motivation and background

\IEEEPARstart

Medical image segmentation is often profoundly important in clinical image analysis and image-guided interventions. It involves partitioning a medical image into multiple areas, where these areas can be used for better clinical analysis or clinical target visualization. For example, in prostate radiotherapy, accurate clinical target volume segmentation from magnetic resonance (MR), ultrasound (US), or computed tomography (CT) images is often essential in computer-aided diagnosis, therapy, and post-therapy analysis of prostate cancer [1] [2]. Indeed, it is critical to select patients for a specific treatment, guide source delivery during the intervention, and compute the dose distribution using MR, US, and CT images, respectively [1]. Similarly, in cardiac image analysis, accurate segmentation of the heart structures such as the left and right ventricular cavities, and the myocardium is essential to calculate the volume of the cavities at end-diastolic and end-systolic phases and the left ventricular myocardial mass [3] [4]. Inner ear structures segmentation such as the cochlea, vestibule, and semi-circular canals from preoperative CT images can be used in the treatment of patients with hearing impairments. In cochlear implant surgery, preoperative cochlea segmentation from CT images is useful for determining the length of the implanted electrode and improving the insertion guidance [5]. Other important medical image analysis applications using image segmentation includes in 2D echocardiography [6] and brain image segmentation [7] among others.

Despite the necessity of accurate and robust image segmentation in clinical routines, it is often challenging. Indeed, it highly depends on the imaging modality and the target (such as anatomy or tumor area). The main challenges in developing accurate and automatic medical image segmentation methods include a wide range of patient characteristics, low-contrast inherent imaging characteristics of some imaging modalities, artifacts (e.g., metallic artifacts in CT), respiratory motion, limited data in medical image analysis, and significant variation of shape and size of organs among cases [8] [9]. Unfortunately, manual segmentation is frequently prone to subjective errors and is time-consuming. It might not always be easy to analyze multiple sequences of examinations manually. Researchers have applied different methods to address these challenges, thereby better analyzing the medical images and improving case outcomes. For example, several groups have approached medical image segmentation using contour and shape detection, deformable models, conventional machine learning, and deep learning approaches [8] [10].

In the early times, conventional medical image segmentation approaches are often based on edge detection and fitting predefined parametric shapes, level-set methods, and shape models [3] [8]. Many classical machine learning-driven methods (supervised and unsupervised) are also proposed to tackle this challenging problem. Although traditional machine learning approaches like active shape, atlas methods, and statistically supervised and unsupervised methods are still prevalent [10], they often involve careful integration of hand-crafted image features. These hand-crafted features need to represent the input image. However, designing or engineering distinctive image features by human experts can be difficult and sometimes impossible. Moreover, manually designed distinctive features on a given task might not be easy to adapt to new cases, which is the main problem in developing a general method that can be used to extract the distinctive image features [8] [10].

In this regard, deep learning methods based on convolutional neural networks (CNN) have emerged as an alternative and reliable solution [10]. These CNN-based methods can automatically learn to extract the hierarchical distinctive image features. It avoids the need to develop hand-crafted features besides the generic characteristics of these extracted features. In fact, CNN-based approaches have achieved the state of the art (SOTA) performance in various tasks such image classification [11], object detection [12], segmentation [13] [14], registration [15] [16], and other tasks [17] [18]. For example, fully convolutional neural networks are trained on pixel-to-pixel transformations to extract high-level image features and predict the outputs from any arbitrary size inputs [14]. Following the success of these approaches in image and video processing, Simonyan et al. [18] proposed to increase the convolutional network depth via a network called VGG16. In this approach, convolutional layers are stacked one after the other to extract more distinctive image features.

Motivated by the promising results of the VGG16 network (deep encoder like structure), Badrinarayanan et al. [19], proposed to expand it into a deep fully convolutional neural network (FCN) architecture for semantic pixel-wise segmentation. It employs a trainable encoder network for high-level feature extraction and a corresponding decoder network. The decoder network maps the low-resolution encoder feature maps to a full input resolution feature maps for pixel-wise classification using up-sampling. This up-sampling operation is then modified to be a learning-based deconvolution network in the work of Noh et al. [20]. Szegedy et al. [21] also proposed a new design of CNN architecture to increase the depth and the width of CNN-based methods. As CNN architectures go deeper (i.e., large number of convolutional layers are stacked one after the other), residual connections are inherently crucial for the training [22] [23]. It smooths training and avoids loss of high-level features during sampling or pooling operations.

Ronneberger et al. [13] modified the FCN proposed by Long et al. [14] to propagate contextual information from the encoder into the decoder. It is done by connecting the encoder with the decoder through skip connections that creates a U-shaped architecture (and named U-Net). Similarly to the residual networks [22], this skip-connection recovers spatial information (such as localization information) that could be lost during the consecutive striding and convolutional operations. Several groups have modified the U-Net architecture for various medical image analysis applications [10]. Çiçek et al. [7] adopted the 2D U-Net for volumetric (3D) segmentation. Tu et al. [24] introduced auto-context U-Net in which they applied parallel 2D convolutional layers for the axial, coronal, and sagittal planes of brain images.

Most previous works based on encoder-decoder (or U-Net) architectures have been focused on improving encoder’s feature extraction ability either by constraining the segmentation results [25], incorporating the prior knowledge such as shape models [26] [27], or using transfer learning from pre-trained networks [28] [29].

Meanwhile, there has been an increased interest of researchers to incorporate the shape of anatomical structures into U-Net like architectures using multi-task learning [27] [30] [31]. These approaches aimed at solving the common limitations of encoder-decoder architectures in capturing the structural information and interdependence of the output while training from pixel-wise objective functions [10]. Consequently, several groups approached the incorporation of shape prior in image segmentation using statistical shape modeling [27] or learning-based modeling [25] [30] [32] [33]. These approaches generally showed a relative improvement over the methods without the shape prior. However, these approaches often require explicit prior knowledge of the target. It needs the modeling of the prior knowledge, for example, anatomical shapes, and then embed it into the U-Net architectures to constrain the learning process [6] [25] [27]. Other approaches include post-processing methods based on either denoising [34] or variational auto-encoders [35] and showed an improvement in the plausibility of the results. However, they are not free of limitations. Post-processing methods do not see the original input image. Thus, they might not always produce accurate results from a given erroneous segmentation [35].

In encoder-decoder based deep learning methods, although the encoder is essential to extract high-level and distinctive input image features, it could be challenging due to the often inconsistent contrasts and potential artifacts associated with the inherent limitations of medical imaging devices. Thus, similar to the residual [22] and the long skip connection [13] approaches, different alternative methods have been proposed to improve the contextual information propagation from the encoder to the decoder. These methods use either gated attention-based networks [36] or recurrent neural networks [37]. They can be used either to highlight salient image features by capturing richer contextual dependencies [38] [39] or to recurrently recover information loss from the consecutive sampling and re-sampling operations [40] [41] [42] [43] in the feed-forward learning approach. Recurrent based networks use a similar concept to residual networks. Firstly, they are used to accumulate features that lead to better feature representation. Secondly, they are used to smooth the training [42]. Other approaches proposed to apply additional paths to the encoder-decoder architecture [44] [45]. Most of these approaches are based on CNNs in the feed-forward training approach and could be biased towards recognizing the texture features than the contextual or shape features [46].

Nonetheless, capturing the contextual information, the inter-region relationship, and implicitly learning the prior knowledge of a target remained a wide interest of researchers in deep learning-based medical image analysis. Still, encoder-decoder based image segmentation methods do not get the second chance to look at the segmentation results, except through the gradient backpropagation from a defined pixel-wise objective function. Indeed, training from only pixel-wise objective functions can lead to the loss of spatial information, less output pixels interdependence, and less inter-region relationship representation and hence sometimes, produce unrealistic segmentation results [25]. A recent study has also demonstrated that CNNs are strongly biased towards recognizing textures rather than shapes [46]. Thus, capturing the output pixels’ interdependence would enable the network to learn distinctive image features. The network can learn to extract texture and contextual similarity between the same labeled pixels and the difference between differently labeled neighboring pixels, thereby producing realistic segmentation.

In this regard, the decoder should be intelligent enough to improve the classification of each pixel. To do so, the decoder needs to capture the contextual information, inter-region relationship, and distinguish errors that could be introduced during the consecutive striding, convolution, and up-sampling operations. Indeed, feeding back the predicted probabilistic output could help the decoder extrapolate contextual information and correct mistakes over time, thereby improving prediction accuracy. The idea of a feedback loop is indeed used in the control system’s theory, in which the system’s output error signal is used to make adjustments on the input signal. Alternative studies on generative adversarial neural networks (GANs) have also recently gained an interest in using a feedback loop to improve the model’s learning capacity [47] [48].

Therefore, in this study, to address the difficulties of CNNs in capturing the contextual information and the inter-region relationship, and to implicitly integrate the prior knowledge into the feed-forward learning process, we formulated image segmentation as a recurrent framework using two systems (named Learning with context FeedBack system (LFB-Net)). One can notice that we used the term system to refer to the two different interconnected networks. The forward system, a modified U-Net architecture, learns and predicts a probabilistic output from the raw input image. The probabilistic refers to the pixel levels’ softmax assignment at the output of the network. The second system, a fully convolutional network (FCN), named hereafter feedback system, transforms the predicted probabilistic output of the forward system into another spatial higher-level representation. The transformed high-level representation is then integrated back into the feed-forward learning path of the forward system. This feedback strategy, as we will show, would allow the main segmentation module to improve the performance over time. More importantly, the feedback system mitigates the conditions in which the forward system can potentially fail.

1.2 Contributions

In this study, we present the first deep learning formalism using a feedback loop in encoder-decoder based image segmentation. To this end, we designed the image segmentation problem as a recurrent framework. The image segmentation process of a network is conditioned by its previous segmentation results’ latent space and a spatial response map of another network that enables it to improve the performance over time.

Our main contributions can be summarized as follows:

1.

We introduce a feedback looping system to enhance the feed-forward learning process of encoder-decoder CNNs for medical image segmentation.
2.

We model the image segmentation problem as a two systems task, from system one to system two approach. We integrated the proposed feedback looping system with a modified U-Net architecture (in the forward system) as a regularizer that enables the network to attend and fix segmentation uncertainties over time.
3.

The proposed method, LFB-Net, is validated on different medical image segmentation applications. Specifically, we evaluated it on short-axis cardiac-MRI, long-axis echocardiographic images, CT of the prostate, and CT of the inner ear image segmentation applications. Our extensive experiments and ablation studies on clinical databases support that our method consistently outperforms the SOTA methods at a reduced computational cost. As will be shown in the result section, it yielded accurate and plausible results without the need to incorporate shape prior or apply post-processing methods.

The rest of this paper is organized as follows. Section II introduces the proposed method (LFB-Net). In section III, we present the experimental results and discussions. Finally, we draw our conclusions in section IV.

2 Methodology

The main objective of a supervised FCN network is to learn how to best predict the target output $y$ from a given input image $x$ , i.e., mapping of the input image into the labeled target ( $f:x\mapsto y$ ). The network $f$ thus learns through a back-propagation algorithm from a defined loss function $L(y,\hat{y})$ (which is often an error between the model’s predicted $\hat{y}$ and the reference $y$ values). Thus, it learns how to compute the model’s neural network weights $w$ and thereby best map the input image into the output target image, i.e., $\hat{y}=f(x;w)$ .

Refer to caption — Figure 1: Framework of the proposed method: (A) the architecture of LFB-Net. In which the encoder, decoder, and bottleneck are the component of the forward system (U-Net architecture), and (B) the convolutional building block of the encoder as well as the decoder. The feedback system’s output (*i.e.,* $h_{f}=32^{2}\times 256$ ) is concatenated with forward system encoder’s output (*i.e.,* $h_{s}$ ). The switch indicates the alternative training strategy of the two systems. From top to bottom: Echocardiography long-axis view, cine-MRI short-axis view, prostate CT, and CT of the inner ear in the left and segmentation results on the right.

In our model, LFB-Net consists of two major parts: the main segmentation module (forward system) and the feedback system (Fig. 1). Each module is also composed of an encoder and a decoder. An encoder-decoder based FCN architectures learn the distinctive features of the datasets from a defined objective function $L_{f}(y,f_{d}(f_{e}(x)))$ [14]. Here, the functions $f_{e}$ and $f_{d}$ are defined as the encoder and the decoder mapping functions, respectively (i.e., $f_{e}:x\mapsto h$ , and then $f_{d}:h\mapsto y$ ). The intermediate representation of the input image is then $f_{e}(x)=h$ , often named latent space or higher feature space. It is another transformed spatial representation of the input image $x$ . We have used this representation to describe our method in the following sections.

2.1 Forward system

We formulate the main segmentation network $S$ (forward system in Fig. 1) as two main parts: a segmentation encoder $S_{e}$ which encodes the raw input image into high feature space, and a decoder $S_{d}$ which then decodes the encoded features into the target labels. Specifically, the encoder network $S_{e}$ transforms the input image $x$ into the encoded features space $h$ , i.e., $S_{e}:x\mapsto h$ , and the decoder transforms the intermediate representation $h$ into the desired label $y$ , i.e., $S_{d}:h\mapsto y$ . This consecutive stage is equivalent to

\hat{y}=S(x)=S_{d}(S_{e}(x)),

(1)

where $x\in\mathbb{R}^{W\times H}$ is the input image and $\hat{y}\in\mathbb{R}^{W\times H}$ is the predicted output label. Here, $W$ and $H$ indicate the original image width and height respectively. Thus, the latent space dimension of this network is $S_{e}(x)=h_{s}\in\mathbb{R}^{W/d\times H/d\times C}$ , where $C$ indicates the number of feature channels at the latent space while $d$ is the depth of the network.

In order to not lose spatial information during down-sampling, this network is composed of skip connections. It is typically similar to the U-Net architecture [13], except the encoder and decoder are designed so that they can be trained jointly and separately, as will be shown in section 2.3.

2.2 Feedback system

The feedback system aims to provide a second chance for the forward system’s decoder network to look back on its predicted output. This approach could be made by recurrently formulating the network and merging their full-resolution outputs at the last layer of the decoder. However, providing a more high-level representation of the predicated output with small correction using the feedback system (i.e., spatial feedback) would enhance the decoder’s learning capacity [48]. Thus, the forward system’s probabilistic output is fed to a new FCN ( $F$ ) architecture (the feedback system), which transforms the predicted probabilistic output into another high-level feature space. This approach of correcting and feeding back the output can be considered as an analogy to the teacher correcting the exam of a student and then providing back the correct answers. Therefore, this strategy allows the forward system to attend to specific regions of its previous results and improve prediction accuracy.

Lets divide the feedback system $F$ into two: the encoder $F_{e}$ which encodes the forward system’s probabilistic output $\hat{y}$ into a high-feature space $h_{f}$ ( i.e., $F_{e}:\hat{y}\mapsto h_{f}$ ) and the decoder $F_{d}$ then transforms the high-feature space $h_{f}$ into desired label $y$ (i.e., $F_{d}:h_{f}\mapsto\hat{\hat{y}}$ ). As in (1), this process can be written as:

\hat{\hat{y}}=F(\hat{y})=F_{d}(F_{e}(\hat{y})),

(2)

where $\hat{\hat{y}}$ is the predicted probabilistic output label from the feedback system, and $\hat{y}$ is the predicted probabilistic output label from the forward system.

2.3 Integration

We need to integrate the feedback system $F$ with the forward system $S$ . Thus, we designed the integration of the two systems as a recurrent process. This recurrent process is formulated as follows. Firstly, the segmentation module produces a label map $\hat{y_{i}}$ (at iteration $i$ ) using (1). Secondly, the feedback system predicts a label map $\hat{\hat{y_{i}}}$ taking $\hat{y_{i}}$ as input from the output of the forward system. Now, we can write the feedback system’s encoder response map $F_{e}$ as:

h_{f_{i}}=F_{e}(\hat{y_{i}})

(3)

where $h_{f_{i}}\in\mathbb{R}^{W/d\times H/d\times C}$ is high-feature space representation or latent space of the feedback system at iteration $i$ . It has the same size as the forward system’s encoder output, $h_{s_{i}}\in\mathbb{R}^{W/d\times H/d\times C}$ . The decoder ( $F_{d}$ ) then transforms $h_{f_{i}}$ into the output label map $\hat{\hat{y_{i}}}$ .

Thirdly, we use the high-feature space of feedback information $h_{f_{i}}$ along the encoded input from forward system $h_{s_{i}}$ as input to the decoder $S_{d}$ , at iteration $i+1$ . Therefore, now we can redefine (1) at $i+1$ as:

\hat{y}_{i+1}=S_{d_{i+1}}\big{(}h_{s_{i}},F_{e_{i}}(\hat{y}_{i})\big{)}

(4)

which can be written as

\hat{y}_{i+1}=S_{d_{i+1}}\big{(}h_{s_{i}},h_{f_{i}}\big{)}\

(5)

As our system aims to incorporate the context spatial feedback in the feed-forward learning process, at the first stage we set the feedback latent space $h_{f_{0}}=0$ (i.e., $h_{f_{0}}=h_{0}$ ). Then in the second stage (i.e., at $i=i+1$ ), we replace $h_{f_{0}}$ by the feedback latent space $h_{f_{i}}$ . This is indicated in Fig. 1 as switch between $h_{0}$ and $h_{f}$ .

2.4 Training strategy

The integration and training strategy of the forward system and the feedback system can be summarized as follows:

[]
1.

Train the neural network weights of the forward system, $w^{s}_{i}$ , considering the raw input image $x$ and zero feedback latent space (i.e., $h_{f_{i}}=0$ ) as inputs, and the ground truth labels $y$ as output.
2.

Train the neural network weights of the feedback system, $w^{f}_{i}$ , considering the input from the predicted output of the forward system’s decoder network $\hat{y}$ , and the ground truth label $y$ as output as in (2).
3.

Train the neural network weights of the forward system’s decoder part only $w^{s}_{d_{i+1}}$ , taking the inputs from previously extracted high-level features from the raw input image in step 1 (i.e., $h_{s_{i}}$ ) and the feedback latent space $h_{f_{i}}$ from the feedback system in step 2 as in (5). Here, the forward system’s and the feedback system’s encoder are designed to predict (i.e., freeze) from previously learned and updated weights during step 1 and step 2, respectively.
4.

While not converged, repeat the above steps. The convergence is determined by the change in the validation loss at the output of the forward system. These steps are also shown in Fig. 2.

Note that each step is trained at a time until the network sees the whole training dataset. During the training phase, the feedback system provides feedback to the forward system, but the decoder part is discarded during the testing phase.

2.5 Training Loss function

To train our model, we used an average of binary cross-entropy and Dice coefficient loss functions as:

L_{total}=\frac{1}{2}\times(L_{1}+L_{2})

(6)

where $L_{1}$ and $L_{2}$ are the cross-entropy and the Dice loss functions, respectively. The cross-entropy $L_{1}$ is computed as :

L_{1}=-\sum_{k=1}^{c}\sum_{i=1}^{I}\log{\bigg{[}\frac{e^{s(k,i)}}{\sum_{i}e^{s(k,i)}}\bigg{]}}

(7)

where $s(k,i)$ is the probabilistic feature maps at pixel $i\in I$ , belonging to the pixel class $k\in 1,2,3,...c$ ( $c$ number of classes), and $I$ being the number of pixels in the training batch. The Dice coefficient loss is calculated as:

L_{2}=1-\frac{1}{\sum_{k}\gamma_{k}}\bigg{[}\sum_{k}{\gamma_{k}}\frac{2\times\sum_{i\in I}u_{i}^{k}v_{i}^{k}}{\sum_{i\in I}u_{i}^{k}+\sum_{i\in I}v_{i}^{k}}\bigg{]}

(8)

where $u$ is the predicted output of the network, $v$ is a one-hot encoding of the ground truth segmentation map, $\gamma$ is the weight associated to class $k\in{0,1,2,...}$ , being the pixel class. Both $u$ and $v$ have shape $I$ with $i\in I$ being the number of pixels in the training batch. The loss function was the same for both the forward and the feedback systems.

2.6 Network architecture

The complete network architecture is shown in Fig. 1. In the forward system, the encoder’s blocks consist of repeated applications of $3\times 3$ convolutions, exponential linear unit (ELU) activation, batch normalization, and followed by a squeeze-and-excitation network [49] (Fig. 1 (B)). Moreover, we applied a $2\times 2$ max pooling operation with stride 2 for downsampling. Contrarily, the decoder is composed of a $2\times 2$ up-convolution and concatenation layer, followed by the same block as in the encoder (Fig. 1 (B)).

The feedback system is a fully convolutional network architecture with a learning deconvolution network [20]. It consists of a repeated application of $3\times 3$ convolutions, ELU activation, batch normalization in both the encoder and decoder blocks. We selected a concatenation layer to merge the latent spaces from the two systems. Please refer to the supplementary material for a detailed description of the architectures and hyper-parameter values used in this work.

3 Materials and experiments

In this section, firstly, we introduce implementation, training, and testing configurations. Secondly, we present the experimental studies on four datasets. The datasets are selected to demonstrate our method’s performance on various medical image segmentation tasks such as single and multi-structure segmentation. Specifically, prostate volume segmentation from CT images and cardiac multi-structure segmentation from 2D echocardiography and 3D cardiac-MR images are used. Moreover, to demonstrate our method’s performance for applications involving irregularly shaped multiple targets per image and more class imbalances, we applied it for inner ear segmentation from $\mu$ CT. Besides, we performed an extensive ablation study to analyze and validate the design of the method. These experiments focus on demonstrating the importance of learning with a feedback loop for accurate and robust medical image segmentation tasks.

3.1 Experimental setup

The system was implemented and developed in Python, Keras API, with a Tensorflow backend. The weights of the model were updated by using (6), and an ADAM optimizer with a learning rate of $10^{-3}$ until convergence [50]. All networks, including the baseline models, were trained from scratch with an early stop of 100 and batch-size of 10. We mostly divided the dataset into 70% training and the rest 30% for testing. For each clinical datasets which involve volumetric computation, the division was done based on the patient cases. The model is trained on the given training dataset and then tested on unseen new image cases. The model’s hyper-parameters were optimized only from the validation dataset.

The proposed network’s parameters were the same for all segmentation applications, except for the output activation function and the number of output feature channels. For the single label segmentation (i.e., for prostate and inner ear), it was a sigmoid activation function. For the multi-label segmentation (i.e., for cardiac cine-MRI and ultrasound with four channel outputs), it was a softmax activation function. In all experiments, the intensity values were normalized according to each dataset’s mean and standard deviation. It allows the model to learn the optimal model parameters quickly.

The proposed method was compared with other methods such as U-Net [13], residually connected U-Net (ResU-Net) [22], FCN [20], Post-DAE [34], and attention gated network (AGN) [38]. We chose these networks as they have a similar architecture to our method. For the public datasets, we evaluated our method on the testing data. We also compared the proposed method with other methods such as [25], [35], and [51] on the public data. We extensively experimented during our method’s ablation study with and without a feedback loop such as using only the forward system (FS). All settings were then the same for the baseline methods. Indeed, the baseline method’s hyper-parameters were optimized to yield the best results.

We considered both volume and distance metrics to evaluate the methods, such as the Dice similarity coefficient (DSC), Hausdorff distance (HD), and relative volume difference (RVD). For the volumetric datasets, the evaluation was in 3D. Unless specified, we adopted these metric’s notations throughout the paper. We also ran a statistical test using the Wilcoxon test [52] and we consider that the results are significantly different when the p-value is less than 0.05 (i.e., $p<0.05$ ). Moreover, we visually verify the plausibility of the results. In this context, plausibility refers to the complete, realistic, and anatomically possible segmentation. Therefore, a plausible or realistic output of a method contains no hole, no segment from unexpected regions of the image, and segmented areas are similar in shape to the targets’ structure.

3.2 Prostate segmentation in radiotherapy

Precise prostate gland segmentation from CT images is critical in the treatment of prostate cancer. However, it is not easy due to the inherently low contrast characteristics of CT for soft tissues and the appearance of high metal artifacts. For example, in CT exams of patients who have received low dose rate brachytherapy treatment for prostate cancer, in which tiny radioactive elements are permanently implanted into the gland, accurate prostate segmentation is difficult. The quality of the CT image is worsened by the high metal artifacts coming from the implanted radioactive elements. In our study, a clinical database of 78 CT prostate image cases was collected. All cases had received a prostate cancer treatment using a low dose rate brachytherapy technique. The in-plane resolution varies from $0.4$ $mm$ x $0.4$ $mm$ to $0.58$ $mm$ x $0.58$ $mm$ with a slice thickness between $1.5$ $mm$ and $2.5$ $mm$ . We changed the datasets into the same voxel size of $0.5$ $mm$ x $0.5$ $mm$ x $1.25$ $mm$ . A radiation oncologist with an experience of more than ten years ( $>100$ implants/year) manually delineated the prostate. These delineation procedures are routinely done in the clinical treatment where it involves permanent brachytherapy with ${}^{125}I$ for localized prostate cancer [1] [2]. The dataset was randomly divided into 70% for training, and the rest 30% for testing and validation (20% for testing and 10% for validation).

Table 1: Results for a single target organ segmentation (mean

\pm

standard deviation). All metrics are in 3D. The higher value of DSC is better, whereas the lower value is better for HD and RVD metrics (bold values are better).

Organ	Metric	FCN [20]	U-Net [13]	ResU-Net [22]	Post-DAE [34]	AGN [38]	LFB-Net
		Method
Prostate	DSC	0.89 $\pm$ 2.20	0.89 $\pm$ 0.02	0.89 $\pm$ 0.02	0.89 $\pm$ 0.02	0.89 $\pm$ 0.02	0.91 $\pm$ 0.02
(3D)	HD	14.0 $\pm$ 8.14	17.5 $\pm$ 13.26	14.4 $\pm$ 3.30	11.3 $\pm$ 3.30	16.6 $\pm$ 7.12	6.1 $\pm$ 1.40
Ear	DSC	0.93 $\pm$ 0.05	0.92 $\pm$ 0.06	0.93 $\pm$ 0.05	0.87 $\pm$ 0.07	0.93 $\pm$ 0.05	0.96 $\pm$ 0.01
(3D)	RVD	0.07 $\pm$ 0.08	0.08 $\pm$ 0.09	0.06 $\pm$ 0.09	0.09 $\pm$ 0.11	0.04 $\pm$ 0.05	0.02 $\pm$ 0.02

From the experimental result shown in Table 1, our method can segment the prostate gland with an average volumetric Dice index of 0.91 and with a 3D HD of 6.1 mm. It appeared to outperform the other methods. For example, it produced a 2% increased Dice index and decreased the 3D HD value by 11 mm over U-Net [13]. Our method without the feedback loop, i.e., only the forward system (FS), produces a Dice index of 0.90 and a 3D HD value of 10.3 mm. It improved both metrics over the other methods. Using the feedback loop reduces the 3D HD error value by 4.3 mm over not using it (only FS). Attention gated-based method [38] has only reduced the 3D HD metric over the U-net [13] by 0.9 mm, while the Post-DAE method [34] reduced it by 6.2 mm. However, the Post-DAE did not improve the average Dice index. The proposed LFB-Net method significantly outperformed $(p<0.05)$ the other methods for the Dice index value and 3D HD.

3.3 Inner ear segmentation

Inner ear segmentation is essential for 3D visualization and modeling of the inner ear for surgeries. We chose the Hear-EU cochlear data descriptor public dataset consisting of $\mu$ CT-scans of 17 dry temporal bone specimens [5]. The ground truth labels include the cochlear scala, semicircular canals and the vestibule. The images were acquired at 16.3 and 19.5 $\mu$ m voxel resolutions for 13 and 4 $\mu$ CTs, respectively. The original volume size of the $\mu$ CTs ranged from 618 x 892 x 600 to 1500 x 1500 x 1500. The $\mu$ CTs were standardized to a fixed size of 256 x 256 x 256 voxels for computational and memory requirements. The ground truth labels of 5 patients were not aligned with the $\mu$ CT images. The data for these patients were manually corrected to align with the $\mu$ CT data perfectly. To compare all methods, a two-fold cross-validation technique was performed by randomly dividing the data at each fold into 76% for the training and 24% for the validation.

As can be seen from Table 1, LFB-Net achieved an increase in 4% of the average Dice index and reduced relative volume difference by 6% over the U-Net [13]. It also yielded accurate segmentation results without missing the unconnected small parts of the inner ear on CT images, which can then be used to model the inner ear structures for ear surgeries [53]. In contrast, the other methods yielded reduced segmentation results. In particular, the post-processing method, Post-DAE [34], appeared to degrade the segmentation results of the U-Net [13], in particular the Dice index by 5%.

3.4 Cardiac cine-MRI segmentation

The third application of our method, which aims to demonstrate the performance of our model for multi-class segmentation applications, is on cardiac cine-MRI segmentation. Although short-axis cardiac cine-MRI is essential to study the cardiac function of the left and right ventricles, accurate segmentation of both ventricles remains challenging. The segmentation of the endocardial border in diastole and systole is required to evaluate the cardiac function (i.e., cavity volume and ejection fraction) for the left and right ventricles, and the epicardial border of the left ventricle is mandatory to evaluate the myocardial mass and thickness. The main challenge is the variability in the shape of the left and right ventricular cavities. In this regard, we chose the Automated Cardiac Diagnosis challenge (ACDC) [4] to evaluate the proposed method. It contains 100 patients for the training and 50 for the testing. These datasets are available for download at ¹¹1https://acdc.creatis.insa-lyon.fr/. To analyze how the baseline methods would work in the testing sets, we have access to the ground truth upon request to the organizers. We randomly divided the 100 cases into 75 patients (75%) for training and 25 patients (25%) for validation (all the cases with diastole and systole phases). Thus, to further inspect the performance of the method, we evaluated the end-diastolic and end-systolic phases separately on the 50 testing cases.

In all 3D experimental measurements on the validation sets shown in Table 2, our method appeared to improve segmentation accuracy. It yielded an average Dice and HD values, respectively of $0.96\pm 0.02$ and $6.7\pm 3.95$ mm for the left ventricular cavity (LV), of $0.91\pm 0.05$ and $13.2\pm 7.90$ mm for the right ventricular cavity (RV), and of $0.90\pm 0.03$ and $9.4\pm 5.21$ mm for the myocardium (MYO) on the validation datasets. The 3D HD value of the myocardium refers to the largest distance error from either the endocardium or the epicardium. More importantly, we observed that the improvement is significant in cases where the segmentation task is difficult, such as in the right ventricular cavity and the myocardium.

As shown in Table 3, on the ACDC testing set, our method yielded an average Dice index and 3D HD values, respectively of $0.94\pm 0.06$ and $7.5\pm 5.54$ mm in the LV, of $0.92\pm 0.06$ and $11.9\pm 6.49$ mm in the RV, and $0.90\pm 0.03$ and $9.5\pm 5.58$ mm in the MYO. We observed that LFB-Net significantly outperforms the other methods ( $p<0.05$ ). To further analyze the method’s generalizability property, results from the unseen testing set and validation set are shown in Table 4. Although we have different sample sizes for the validation set, 25 cases (each with end-diastolic and end-systolic phases), and the testing set, 50 cases (each with end-diastolic and end-systolic phases), these average values in Table 4 can be considered to infer the methods’ performance for each set. Indeed, the proposed method showed almost no differences between the unseen testing and the validation sets by achieving a total Dice index of 0.92 and a 3D HD value of around 10 $mm$ . In contrast, the other methods showed a large difference between the validation and testing sets. For example, LFB-Net has significantly outperformed the other methods in the heart segmentation on the testing data (shown in Table 4). However, this is not always the case on the validation data.

Table 2: Multi-structure cardiac image segmentation results of the validation set for end-diastolic (ED) and end-systolic (ES) phases. LV: Left ventricular cavity; MYO: Myocardium, RV/LA: Right ventricular cavity (RV) for the Cine-MRI and left atrium cavity (LA) for the echocardiography. The bold values refer to the best performance for each metric.

		LV				RV/LA				MYO
		Dice		HD		Dice		HD		Dice		HD
Data	Method	ES	ED	ES	ED	ES	ED	ES	ED	ES	ED	ES	ED
Echocardiography (2D)	AGN [38]	0.92	0.95	9.7	6.0	0.92	0.89	7.1	8.5	0.69	0.69	24.2	25.0
	FCN [20]	0.92	0.95	5.5	5.7	0.92	0.90	4.8	5.8	0.86	0.85	7.0	8.1
	U-Net [13]	0.93	0.95	5.4	5.4	0.92	0.89	5.1	6.5	0.87	0.86	7.4	8.3
	ResU-Net [22]	0.92	0.94	5.7	6.0	0.91	0.88	5.4	6.3	0.86	0.86	7.6	8.4
(54 cases)	LFB-Net	0.93	0.95	5.3	5.4	0.93	0.91	4.6	5.4	0.87	0.87	6.7	7.1
Cine-MRI (3D)	AGN [38]	0.94	0.96	7.8	6.9	0.84	0.93	15.1	14.3	0.90	0.89	11.1	10.1
	FCN [20]	0.93	0.96	7.8	6.7	0.84	0.92	15.58	14.7	0.89	0.87	11.1	10.0
	U-Net [13]	0.93	0.96	8.3	8.3	0.84	0.93	16.2	14.4	0.90	0.88	11.9	9.6
	ResU-Net [22]	0.92	0.95	9.3	8.5	0.85	0.93	15.5	15.9	0.89	0.88	11.7	10.4
(25 cases)	LFB-Net	0.94	0.97	6.7	6.5	0.88	0.95	13.7	12.9	0.91	0.89	10.5	8.4

3.5 Echocardiographic image segmentation

Accurate cardiac structure segmentation from echocardiography images is profoundly vital in cardiac diagnosis. For this, we chose Multi-structure Ultrasound Segmentation (CAMUS) dataset to evaluate our method [51]. It contains two and four-chamber acquisition from 500 patients, with end-diastolic and end-systolic phases. Thus, a given patient has four images (two in end-diastole and two in end-systole). The segmentation reference and raw data of 450 patients are available for download at ²²2https://camus.creatis.insa-lyon.fr/challenge/. We randomly divided this data into 396 patients for training and 54 patients for validation.

As shown in Table 2, on the 54 validation exams, our method has improved both the Dice and the HD values. For the two view acquisition (two-chambers and four-chambers) based segmentation results, our method yielded an average Dice index of $0.94\pm 0.03$ , $0.92\pm 0.04$ , and $0.86\pm 0.06$ for the 4-chamber, and $0.94\pm 0.03$ , $0.92\pm 0.05$ , and $0.88\pm 0.04$ for the 2-chamber respectively in the left ventricular cavity (LV), left atrium (LA), and myocardium (MYO). The average HD values were $5.0\pm 2.83$ mm, $5.2\pm 3.48$ mm, and $6.7\pm 3.04$ mm for the 4-chambers, and $5.6\pm 3.22$ mm, $4.8\pm 2.79$ mm, and $7.1\pm 3.86$ mm for the 2-chambers respectively in the LV, LA, and MYO. Although it produces similar results for the different view acquisitions in LV and LA, it improved the myocardium segmentation on the 2-chambers over the 4-chambers by 2% in the Dice index.

Table 3: Results for cardiac cine-MRI segmentation on the ACDC testing set (50 cases) at end-diastolic (ED) and end-systolic (ES) phases. LV: Left ventricular cavity; MYO: Myocardium, RV: Right ventricular cavity (RV). *

(p<0.05)

indicates that the difference between LFB-Net and the other methods is significant. Values are expressed as a mean

\pm

standard deviation in 3D.

	LV				RV				MYO
	Dice		HD		Dice		HD		Dice		HD
Method	ES	ED	ES	ED	ES	ED	ES	ED	ES	ED	ES	ED
AGN [38]	0.89	0.96	11.5	7.9	0.83	0.91	16.5	13.4	0.89	0.88	12.5	10.5
	$\pm$ 0.10*	$\pm$ 0.02*	$\pm$ 9.73*	$\pm$ 5.37*	$\pm$ 0.14*	$\pm$ 0.09*	$\pm$ 10.39*	$\pm$ 8.33*	$\pm$ 0.05*	$\pm$ 0.04*	$\pm$ 5.69*	$\pm$ 6.12*
FCN [20]	0.89	0.96	10.8	7.6	0.85	0.90	15.8	14.0	0.87	0.86	12.6	11.3
	$\pm$ 0.09*	$\pm$ 0.02*	$\pm$ 6.2*	$\pm$ 4.49*	$\pm$ 0.12*	$\pm$ 0.07*	$\pm$ 7.70*	$\pm$ 6.55*	$\pm$ 0.05*	$\pm$ 0.04*	$\pm$ 4.60*	$\pm$ 6.09*
U-Net [13]	0.89	0.96	11.3	8.2	0.83	0.90	16.8	14.3	0.89	0.87	12.5	10.9
	$\pm$ 0.09*	$\pm$ 0.02*	$\pm$ 7.90*	$\pm$ 5.0*	$\pm$ 0.17*	$\pm$ 0.12*	$\pm$ 10.2*	$\pm$ 8.29*	$\pm$ 0.04*	$\pm$ 0.04*	$\pm$ 5.1*	$\pm$ 6.27*
ResU-Net [22]	0.90	0.96	10.5	8.6	0.85	0.90	15.2	13.5	0.88	0.87	12.6	10.8
	$\pm$ 0.10*	$\pm$ 0.02*	$\pm$ 7.34*	$\pm$ 5.0*	$\pm$ 0.11*	$\pm$ 0.12*	$\pm$ 6.91*	$\pm$ 8.22*	$\pm$ 0.09*	$\pm$ 0.04*	$\pm$ 7.98*	$\pm$ 6.39*
LFB-Net	0.92	0.97	8.5	6.5	0.89	0.94	13.0	10.9	0.91	0.89	9.9	9.1
	$\pm$ 0.07	$\pm$ 0.02	$\pm$ 6.76	$\pm$ 3.84	$\pm$ 0.08	$\pm$ 0.04	$\pm$ 6.46	$\pm$ 6.49	$\pm$ 0.03	$\pm$ 0.03	$\pm$ 5.33	$\pm$ 5.92

Table 4: Total heart segmentation performance on ACDC validation (Valid.) and testing (Test.) sets in 3D by computing the average values of the MYO, LV, and RV segmentation. *

(p<0.05)

indicates that the difference between LFB-Net and the other methods is significant.

	Metric
Method	Dice		HD (mm)
	Valid.	Test.	Valid.	Test.
AGN [38]	0.910 $\pm$ 0.07*	0.894 $\pm$ 0.09*	10.9 $\pm$ 7.13	12.0 $\pm$ 8.23*
FCN [20]	0.903 $\pm$ 0.07*	0.890 $\pm$ 0.08*	11.0 $\pm$ 7.22	12.0 $\pm$ 6.53*
U-Net [13]	0.906 $\pm$ 0.06*	0.888 $\pm$ 0.10*	11.5 $\pm$ 7.50	12.3 $\pm$ 7.79*
ResUnet [22]	0.903 $\pm$ 0.07	0.893 $\pm$ 0.09*	11.9 $\pm$ 8.39*	11.9 $\pm$ 7.32*
LFB-Net	0.921 $\pm$ 0.05	0.920 $\pm$ 0.06	9.9 $\pm$ 6.47	9.7 $\pm$ 6.18

Table 5: Long-axis echocardiographic image segmentation results on the testing CAMUS data at end-diastolic (ED) and end-systolic (ES) phases. Comparisons are shown for our LFB-Net method and CAMUS challengers. Results were obtained from the CAMUS challenge portal. The provided inter- and intra-observer values were from only 40 cases (good and medium image qualities) by excluding ten low-quality image cases. No inter- and intra-observer studies were provided for the left atrium [51].

		LV: Endocardium				LV: Epicardium				Left Atrium
		Dice		HD		Dice		HD		Dice		HD
Data	Method	ES	ED	ES	ED	ES	ED	ES	ED	ES	ED	ES	ED
Echo-	inter-observer	0.873	0.919	6.6	6.0	0.890	0.913	8.6	8.0	-	-	-	-
cardiography		$\pm$ 0.060	$\pm$ 0.033	$\pm$ 2.4	$\pm$ 2.0	$\pm$ 0.047	$\pm$ 0.037	$\pm$ 3.3	$\pm$ 2.9	-	-	-	-
(2D)	Intra-observer	0.930	0.945	4.5	4.6	0.951	0.957	5.0	5.0	-	-	-	-
		$\pm$ 0.031	$\pm$ 0.019	$\pm$ 1.8	$\pm$ 1.8	$\pm$ 0.021	$\pm$ 0.019	$\pm$ 2.1	$\pm$ 2.3	-	-	-	-
	Oktay O. et al. [25]	0.913	0.936	5.6	5.6	0.945	0.953	5.9	5.9	0.911	0.881	5.8	6.0
	Leclerc S. et al. [51]	0.912	0.936	5.5	5.3	0.946	0.956	5.7	5.2	0.918	0.889	5.3	5.7
Testing	U-net-2 [51]	0.899	0.922	5.3	5.7	0.923	0.932	6.4	6.4	0.888	0.848	6.2	6.9
(50 cases)	LFB-Net	0.926	0.946	4.8	4.8	0.952	0.959	5.2	5.2	0.924	0.902	5.0	5.2

As shown in Table 5, on the 50 CAMUS testing exams, LFB-Net achieved an average Dice index of $0.96\pm 0.02$ for the LV epicardium, $0.94\pm 0.03$ for the LV endocardium, and $0.91\pm 0.07$ for the left atrium. It outperformed the other CAMUS challengers [25] [51]. LFB-Net improves the segmentation in all multi-label structures with less accuracy variability among the test data. Moreover, it notably improved the segmentation by large value on the left atrium in the end-diastolic phase. However, as the results were obtained by submitting the predicted images to the challenge website, we could not perform a statistical comparison [51]. The proposed method also achieved comparable results with the intra-observer values. Mainly, it yielded better results in the Dice index except for the endocardium in the systolic phase. Thus, segmentation with the context feedback loop yields consistent results.

3.6 Qualitative segmentation results

As one can observe from the qualitative segmentation results in Fig. 3 for the single label and Fig. 4 for the multi-label segmentation, our model produces more plausible results than the other methods. From a careful visual checking of the results, if it is with holes for a given single structure segmentation and between structures for multi-structure segmentation and comparing the shape, our method produces more plausible results. Whereas the other methods produce holes in a given structure or between structures and sometimes produce atypical results which are type of errors that could not be made by manual segmentation. Moreover, as seen in the ear segmentation [Row 4-6], the other methods appeared to fail in scenarios when they segment multiple unconnected small structures. In contrast, our method produces a more realistic segmentation of all structures. We observed similar scenarios throughout the testing data. Indeed, anatomical plausibly is a prerequisite for the experts to use the segmented structures for clinical assessments. With the proposed method, the reliability of the segmentation renders trustworthy the clinical information extracted from these segmented structures.

3.7 Network ablation study

Ablation study for system design

To evaluate the contribution of each building block in our method, we created the following configurations.

1.

The forward system (FS): a U-Net architecture without the feedback looping system in both the training and the testing phase (using only step 1 in Fig. 2).
2.

The FS*: a forward system (FS) without the squeeze-and-excitation network, to study its effect in our method.
3.

The proposed method (LFB-net): Forward system regularized by an FCN-based feedback system during the training and the testing phases.

Figure 5 shows the ablation study on the 25 ACDC validation cases. It can be seen that training with the feedback loop consistently increased the results. Moreover, considering the feedback system’s encoder during the testing has improved the accuracy of using it only during the training. As shown in Fig. 5, our method produces less inter case difference, yielding smaller standard deviations in both Dice and 3D HD metrics. The Se-block [49] has also appeared to increase the forward system’s accuracy, yielding better results than without it. It is always true in the total average values. We found that segmentation with the feedback loop significantly outperforms the other two network configurations (FS and FS*) in both Dice and 3D HD metrics of the RV and MYO ( $p<0.05$ ). However, although the feedback loop has improved the average performance for LV segmentation, it was not significant. Note that most networks, including the SOTA methods, performed well for LV segmentation but not for the MYO and RV segmentation. Thus, these results guarantee us to say that using the feedback loop increases the segmentation accuracy, but significantly for the complicated structures.

As shown in Fig. 6, training with the feedback loop mitigates the conditions in which the forward system potentially fails. Firstly, in these examples, the channel-wise feature calibration (i.e., squeeze-and-excitation network) improves the segmentation over using only convolutional networks. It is consistent with the quantitative results in Fig. 5. Training with the feedback loop improves the forward system’s accuracy, primarily when it produces low quality image labels. On these examples, we observed that the improvement was mainly at the basal and apical regions of the heart, in particular for the right ventricular cavity segmentation at the end-systolic phase, and the endocardial and epicardial border segmentation at the end-diastolic phase. These qualitative results are consistent with the quantitative results presented in Tables 3 and 5 for the ACDC and CAMUS testing datasets, respectively. Most other methods produce less good quantitative results for structures that are difficult to segment, such as the right ventricular cavity, the endocardial and epicardial borders for the ACDC data, and the left atrium for the CAMUS data. In contrast, LFB-Net produces better results for these regions.

The proposed feedback system is integrated with the forward system while training. The optimal trained neural network weights of the forward and feedback systems’ are thus saved simultaneously as final models. This design enables the forward system to always benefit from the feedback loop. In contrast, this might not always be the case while using post-processing methods [34] [35]. For example, the denoising autoencoder-based method [34] degraded the Dice index of Ronneberger et al.[13] in the inner ear segmentation by 5%.

To further study the hypothesis that feedback loop would increase accuracy, specifically for difficult or noisy images, we calculated the percentage of the performance of the methods on CAMUS datasets by specifying threshold values of whose Dice is less than 0.88 or HD error is greater than 6.5 mm. Experimental results are shown in Table 6, indicating that most results of our method are above 0.88 Dice and below 6.5 HD error, outperforming the other methods. It is particularly true for the MYO and LA. However, it yielded a 2.8% lower HD value than the U-Net in LV [13], which is not a difficult structure for segmentation. These results further demonstrate that the feedback loop has a significant benefit to segment difficult structures such as MYO and LA.

Table 6: Percentage of cases whose Dice is below 88% or HD is above 6.5 mm for 54 CAMUS validation sets.

	Dice			HD (mm)
	% (minimum value)			% (maximum value)
Method	MYO	LA	LV	MYO	LA	LV
FCN	53.9%	16.4%	5.6%	51.4%	22%	29.4%
	(0.54)	(0.36)	(0.74)	(34.4)	(19.6)	(19.9)
U-Net	49.5%	17.8%	6.1%	50.9 %	24.3%	24.3%
	(0.56)	(0.64)	(0.72))	(39.2)	(46.6)	(20.6
ResU-Net	56.1%	17.8%	8.9%	54.2%	25.7%	34.1%
	(0.60)	(0.53)	(0.79)	(54.2)	(25.7)	(34.1)
AGN	95.8%	20.1%	6.5%	96.7%	34.1%	24.3%
	(0.18)	(0.5)	(0.73)	(93.5)	(126.8)	(89.0)
LFB-Net	44.4%	13.1%	4.7%	46.3%	18.7%	27.1%
	(0.65)	(0.74)	(0.80)	(30.6)	(19.6)	(16.0)

To examine where our method best performs, we computed the maximum HD error and the minimum Dice coefficient. Lower Dice and higher HD reveal the model’s worst scores. Results are shown in Table 6, illustrating that our method considerably decreased the maximum errors in every metric.

Another essential advantage of our approach is that it produces segmentation with almost no difference in the testing data populations. It can be observed from Table 1, for prostate and inner ear segmentation, that the standard deviation is small in all measurements. It was similar for the cardiac segmentation from both cine-MRI and echocardiographic images.

Ablation study for system integration strategy

To investigate the best combination strategy of the two systems through their latent spaces (i.e., $h_{s}$ and $h_{f}$ ), we performed three different schemes such as concatenation, addition, and multiplication on the testing prostate datasets.

As shown in Fig. 7, the concatenation strategy appears to outperform the other two strategies. Thus, we selected the concatenation layer to merge the latent spaces. Indeed, statistically, the three strategies showed no significant difference for the Dice index, but in the 3D HD metric, the concatenation strategy significantly outperformed the others.

Ablation study for system training scheme

Our method is based on an alternative training strategy of a modified U-Net (forward system), and FCN architecture (feedback system). The FCN network is aiming to regularize the forward system by showing back its predicted probabilistic output, thereby improving the learning ability over time. For this, we conducted experiments on ACDC data to compare the alternative training schemes: with no feedback loop and with the feedback loop.

Figure 8 shows how the training loss changes in the two scenarios. Training and validation losses decrease faster while training with the feedback loop than training without the feedback loop. Besides, although the training loss without the feedback tends to decrease over time, validation loss is not. This further shows that the model without the feedback tends to quickly over-fit to the training data over iterations than the model with the feedback.

Segmentation with the feedback loop took an extra 1 hour training time to converge than without the feedback loop on the ACDC data. Although we designed the segmentation problem as a two systems task, the total number of parameters needed to optimize is smaller than the one model-based methods. Our method has 8.5 and 7.9 million parameters during the training and the testing phases, respectively, which are computationally more efficient than the 32 million trainable parameters of U-Net architecture [13]. Thus, our method can deliver results quickly, which can be beneficial for real-time applications. For example, it produces a $256\times 256\times 4$ segmentation result (4 indicates the predicted probabilistic output for LV, RV, MYO, and background) of the cine-MRI within $0.025$ s on a personal computer of i7 with 32 GB RAM.

4 Conclusions

In this paper, we introduced a deep learning method for accurate and robust medical image segmentation by formulating the segmentation problem as a two systems task. It employs a forward system (modified U-Net) for hierarchal feature extraction-driven image segmentation along with a contextual feedback system. The FCN-based contextual feedback system is used to regulate the forward system’s segmentation process. It allowed the forward system to attend and improve its previous decisions, particularly on the uncertain image regions over time. This strategy of modeling image segmentation as a two systems task enabled us to develop an efficient architecture that can be trained from a small dataset and quickly deliver segmentation results.

We demonstrated our method’s performance through extensive ablation and experimental results in the prostate, short-axis cardiac-MRI, inner ear, and long-axis echocardiographic image segmentation applications. The experimental results reveal two important points. Firstly, a spatial feedback loop-based image segmentation is an effective feed-forward learning approach that produces both plausible and accurate segmentation results. The plausibility was achieved without incorporating shape prior or applying post-processing method. Secondly, our method produces results with reduced segmentation variability among the testing data that shows robustness to segment low contrast images as well as structures. It also provides results with reduced maximum errors in all metrics. Moreover, the proposed method yielded significantly better results than the state of the art methods for single and multi-structure segmentation, specifically for the complex structures. Thus, our work opens important perspectives towards efficient and accurate medical image analysis tasks by interconnecting two networks through the introduced feedback loop method.

Moreover, the proposed LFB-Net framework can be extended to other medical image analysis tasks. In this regard, future research will focus on exploring conditions on how to exploit the contextual feedback loop’s latent space and efficiently leverage the merged contextual information. In cases of available 3D datasets, the 3D version of the proposed method could be applied to capture the 3D topology of the target.

References

[1] B. J. Davis, E. M. Horwitz, W. R. Lee, J. M. Crook, R. G. Stock, G. S. Merrick, W. M. Butler, P. D. Grimm, N. N. Stone, L. Potters et al., “American brachytherapy society consensus guidelines for transrectal ultrasound-guided permanent prostate brachytherapy,” Brachytherapy, vol. 11, no. 1, pp. 6–19, 2012.
[2] K. B. Girum, A. Lalande, M. Quivrin, I. Bessières, N. Pierrat, E. Martin, L. Cormier, A. Petitfils, J. M. Cosset, and G. Créhange, “Inferring postimplant dose distribution of salvage permanent prostate implant (ppi) after primary ppi on ct images,” Brachytherapy, vol. 17, no. 6, pp. 866–873, 2018.
[3] C. Petitjean and J.-N. Dacher, “A review of segmentation methods in short axis cardiac mr images,” Medical image analysis, vol. 15, no. 2, pp. 169–184, 2011.
[4] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester et al., “Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved?” IEEE transactions on medical imaging, vol. 37, no. 11, pp. 2514–2525, 2018.
[5] N. Gerber, M. Reyes, L. Barazzetti, H. M. Kjer, S. Vera, M. Stauber, P. Mistrik, M. Ceresa, N. Mangado, W. Wimmer et al., “A multiscale imaging and modelling dataset of the human inner ear,” Scientific data, vol. 4, p. 170132, 2017.
[6] M. H. Jafari, Z. Liao, H. Girgis, M. Pesteie, R. Rohling, K. Gin, T. Tsang, and P. Abolmaesumi, “Echocardiography segmentation by quality translation using anatomically constrained cyclegan,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 655–663.
[7] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3d u-net: learning dense volumetric segmentation from sparse annotation,” in International conference on medical image computing and computer-assisted intervention. Springer, 2016, pp. 424–432.
[8] S. Ghose, A. Oliver, R. Martí, X. Lladó, J. C. Vilanova, J. Freixenet, J. Mitra, D. Sidibé, and F. Meriaudeau, “A survey of prostate segmentation methodologies in ultrasound, magnetic resonance and computed tomography images,” Computer methods and programs in biomedicine, vol. 108, no. 1, pp. 262–287, 2012.
[9] K. B. Girum, G. Créhange, R. Hussain, and A. Lalande, “Fast interactive medical image segmentation with weakly supervised deep learning method,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, no. 9, pp. 1437–1444, 2020.
[10] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical image analysis, vol. 42, pp. 60–88, 2017.
[11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[12] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
[14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[15] Z. Xu and M. Niethammer, “Deepatlas: Joint semi-supervised learning of image registration and segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 420–429.
[16] B. Li, W. J. Niessen, S. Klein, M. de Groot, M. A. Ikram, M. W. Vernooij, and E. E. Bron, “A hybrid deep learning framework for integrated segmentation and registration: Evaluation on longitudinal white matter tract changes,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 645–653.
[17] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[19] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
[20] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1520–1528.
[21] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[23] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
[24] Z. Tu and X. Bai, “Auto-context and its application to high-level vision tasks and 3d brain image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 10, pp. 1744–1757, 2009.
[25] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. De Marvao, T. Dawes, D. P. O‘Regan et al., “Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation,” IEEE transactions on medical imaging, vol. 37, no. 2, pp. 384–395, 2017.
[26] K. B. Girum, G. Créhange, R. Hussain, P. M. Walker, and A. Lalande, “Deep generative model-driven multimodal prostate segmentation in radiotherapy,” in Workshop on Artificial Intelligence in Radiation Therapy. Springer, 2019, pp. 119–127.
[27] C. Zotti, Z. Luo, A. Lalande, and P.-M. Jodoin, “Convolutional neural network with shape prior applied to cardiac mri segmentation,” IEEE journal of biomedical and health informatics, vol. 23, no. 3, pp. 1119–1128, 2018.
[28] M. Raghu, C. Zhang, J. Kleinberg, and S. Bengio, “Transfusion: Understanding transfer learning for medical imaging,” in Advances in neural information processing systems, 2019, pp. 3347–3357.
[29] Z. Gu, J. Cheng, H. Fu, K. Zhou, H. Hao, Y. Zhao, T. Zhang, S. Gao, and J. Liu, “Ce-net: Context encoder network for 2d medical image segmentation,” IEEE transactions on medical imaging, vol. 38, no. 10, pp. 2281–2292, 2019.
[30] Q. Zeng, D. Karimi, E. H. Pang, S. Mohammed, C. Schneider, M. Honarvar, and S. E. Salcudean, “Liver segmentation in magnetic resonance imaging via mean shape fitting with fully convolutional neural networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 246–254.
[31] K. B. Girum, A. Lalande, R. Hussain, and G. Crehange, “A deep learning method for real-time intraoperative us image segmentation in prostate brachytherapy,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, no. 9, pp. 1467–1476, 2020.
[32] C. Chen, C. Biffi, G. Tarroni, S. Petersen, W. Bai, and D. Rueckert, “Learning shape priors for robust cardiac mr segmentation from multi-view images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 523–531.
[33] H. Ravishankar, R. Venkataramani, S. Thiruvenkadam, P. Sudhakar, and V. Vaidya, “Learning and incorporating shape models for semantic segmentation,” in International conference on medical image computing and computer-assisted intervention. Springer, 2017, pp. 203–211.
[34] A. J. Larrazabal, C. Martínez, B. Glocker, and E. Ferrante, “Post-dae: Anatomically plausible segmentation via post-processing with denoising autoencoders,” IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 3813–3820, 2020.
[35] N. Painchaud, Y. Skandarani, T. Judge, O. Bernard, A. Lalande, and P.-M. Jodoin, “Cardiac segmentation with strong anatomical guarantees,” IEEE Transactions on Medical Imaging, vol. 39, no. 11, pp. 3703–3713, 2020.
[36] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
[37] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997.
[38] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert, “Attention gated networks: Learning to leverage salient regions in medical images,” Medical image analysis, vol. 53, pp. 197–207, 2019.
[39] A. Sinha and J. Dolz, “Multi-scale self-guided attention for medical image segmentation,” IEEE Journal of Biomedical and Health Informatics, vol. 25, no. 1, pp. 121–130, 2020.
[40] R. Li, K. Li, Y.-C. Kuo, M. Shu, X. Qi, X. Shen, and J. Jia, “Referring image segmentation via recurrent refinement networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5745–5753.
[41] J. Chen, L. Yang, Y. Zhang, M. Alber, and D. Z. Chen, “Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation,” in Advances in neural information processing systems, 2016, pp. 3036–3044.
[42] M. Z. Alom, C. Yakopcic, M. Hasan, T. M. Taha, and V. K. Asari, “Recurrent residual u-net for medical image segmentation,” Journal of Medical Imaging, vol. 6, no. 1, p. 014006, 2019.
[43] W. Wang, K. Yu, J. Hugonot, P. Fua, and M. Salzmann, “Recurrent u-net for resource-constrained segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 2142–2151.
[44] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
[45] D. Wang, G. Hu, and C. Lyu, “Frnet: an end-to-end feature refinement neural network for medical image segmentation,” The Visual Computer, pp. 1–12, 2020.
[46] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel, “Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness,” in International Conference on Learning Representations, 2018.
[47] F. Shama, R. Mechrez, A. Shoshan, and L. Zelnik-Manor, “Adversarial feedback loop,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3205–3214.
[48] M. Huh, S.-H. Sun, and N. Zhang, “Feedback adversarial learning: Spatial feedback for improving generative adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1476–1485.
[49] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[51] S. Leclerc, E. Smistad, J. Pedrosa, A. Østvik, F. Cervenansky, F. Espinosa, T. Espeland, E. A. R. Berg, P.-M. Jodoin, T. Grenier et al., “Deep learning for segmentation using an open large-scale dataset in 2d echocardiography,” IEEE transactions on medical imaging, vol. 38, no. 9, pp. 2198–2210, 2019.
[52] E. Whitley and J. Ball, “Statistics review 6: Nonparametric methods,” Critical care, vol. 6, no. 6, p. 509, 2002.
[53] R. Hussain, A. Lalande, K. B. Girum, C. Guigou, and A. B. Grayeli, “Augmented reality for inner ear procedures: visualization of the cochlear central axis in microscopic videos,” International Journal of Computer Assisted Radiology and Surgery, vol. 15, no. 10, pp. 1703–1711, 2020.