\cftpagenumbersoff

figure \cftpagenumbersofftable

Selecting the Best Optimizers for Deep Learning based Medical Image Segmentation

Aliasghar Mortazi Volastra Therapeutics, NYC, NY, USA. Vedat Cicek Department of Cardiology, Health Sciences University, Istanbul, Turkey. Elif Keles Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University, Chicago, IL, USA. Ulas Bagci Machine & Hybrid Intelligence Lab, Department of Radiology, Northwestern University, Chicago, IL, USA.

Abstract

Purpose: The goal of this work is to identify the best optimizers for deep learning in the context of cardiac image segmentation and to provide guidance on how to design segmentation networks with effective optimization strategies.
Approach: Most successful deep learning networks are trained using two types of stochastic gradient descent (SGD) algorithms: adaptive learning and accelerated schemes. Adaptive learning helps with fast convergence by starting with a larger learning rate (LR) and gradually decreasing it. Momentum optimizers are particularly effective at quickly optimizing neural networks within the accelerated schemes category. By revealing the potential interplay between these two types of algorithms (LR and momentum optimizers or momentum rate (MR) in short), in this article, we explore the two variants of SGD algorithms in a single setting. We suggest using cyclic learning as the base optimizer and integrating optimal values of learning rate and momentum rate. The new optimization function proposed in this work is based on the Nesterov accelerated gradient optimizer, which is more efficient computationally and has better generalization capabilities compared to other adaptive optimizers.
Results: We investigated the relationship of LR and MR under an important problem of medical image segmentation of cardiac structures from MRI and CT scans. We conducted experiments using the cardiac imaging dataset from the ACDC challenge of MICCAI 2017, and four different architectures shown to be successful for cardiac image segmentation problems. Our comprehensive evaluations demonstrated that the proposed optimizer achieved better results (over a 2% improvement in the dice metric) than other optimizers in deep learning literature with similar or lower computational cost in both single and multi-object segmentation settings.
Conclusions: We hypothesized that combination of accelerated and adaptive optimization methods can have a drastic effect in medical image segmentation performances. To this end, we proposed a new cyclic optimization method (CLMR) to address the efficiency and accuracy problems in deep learning based medical image segmentation. The proposed strategy yielded better generalization in comparison to adaptive optimizers.

keywords:

Deep Learning Optimization, Segmentation, Cyclic learning, adaptive optimization, accelerated optimization

*Ulas Bagci, \linkable[email protected]

1 Introduction

Optimization algorithms are used in the training phase of deep learning, where the model is presented with a batch of data, the gradients are calculated, and the weights and biases are updated using an optimization algorithm. Once the model has been trained, it can then be used for inference on new data.

Stochastic gradient descent (SGD) algorithms are the main optimization techniques used to train deep neural networks. These algorithms can be divided into two categories: adaptive learning rate methods (e.g., Adam and AdaGrad) and accelerated schemes (e.g., Nesterov momentum). Both the learning rate (LR) and momentum rate (MR) are important factors in the optimization process. LR, in particular, is a key adjustable parameter that has been extensively studied and modified over the years. The momentum term was introduced to the optimization equation by Rumelhart and Williams in 1986 to allow for larger changes in the network weights without causing oscillation [1].

There have been controversial results in the literature about the characteristics of available optimization methods. Therefore, there is a need for exploring which optimization method should be chosen for particular tasks. Most neural network optimizers have been evaluated and tested on classification tasks, which have much lower output dimensions compared to segmentation tasks, which have much higher output dimensions. Hence, these differences between classification and segmentation problems imply a different investigation and method for optimization. In this paper, we develop a new optimization method by exploring LR and MR optimizers for medical image segmentation problems for the first time in the literature. Our proposed optimizer is simple and promising because it fixes the problems with traditional optimizers and demonstrates how a simple new formulation can solve surprisingly these problems.

Non-adaptive vs. adaptive optimizers: SGD is the dominant optimization algorithm in deep learning, which is simple and performs well across many applications. However, it has the disadvantage of scaling the gradient uniformly in all directions (for each parameter of network). Another challenge in SGD is to choose an appropriate value for LR. Since LR is a fixed value in SGD based approaches, it is critical to set it up appropriately since it can directly affect both the convergence speed and prediction accuracy of neural networks. There have been several studies trying to solve this problem by adaptively changing the LR during training which are mostly known as ”adaptive optimizers”. Based on the history of changes in gradients during network optimization, LR is adapted in each iteration. Examples of such methods consist of ADAM [2], ADAGrad [3], and RMSProp [4]. In general, adaptive optimizers make training faster, which has led to their wide use in deep learning applications.

The development of momentum in neural network optimizers has followed a similar trajectory as the learning rate. Momentum optimizers [5] introduced to speed up convergence by considering the changes from last iteration with a multiplier, which called momentum, in updating parameters in current iteration. Selecting an appropriate value for the momentum rate (MR) was initially difficult, but this issue was addressed with the introduction of adaptive optimizers like ADAM, which can adaptively adjust both the MR and LR. These adaptive optimizers have become very popular in the field because they quickly converge on training data.

Although they are widely used, adaptive optimizers may converge to different local minima compared to classical SGD approaches, which can lead to worse generalization and out-of-sample performance. This has been demonstrated by a growing number of recent studies [6, 7, 8]. To improve the generalization ability of neural networks, researchers have returned to using original SGD approaches but with new strategies for improving convergence speed. For example, the YellowFin optimizer demonstrated that manually tuning the learning rate and momentum rate can lead to better results than using the ADAM optimizer [8]. Although it was a proof-of-concept study that provided evidence for the counterintuitive idea that non-adaptive methods can be effective. However, in practical applications, manually tuning these rates is challenging and time-consuming.

In another attempt, a cyclic learning rate (CLR) was introduced in [7] to change the LR according to a cycle (i.e triangle or Gaussian), proposing a practical solution to hand-tuning requirements. The CLR’s only disadvantage was that a fixed MR could limit the search states of LR and MR and cause them to fail until find an optimal solution. Our work will go beyond this constraint.

Summary of our contribution: By motivated from [7], herein we introduce a new version of CLR called ”Cyclic Learning/Momentum Rate” (CLMR). This new optimizer alternates the values of the learning rate and momentum rate during training, which has two benefits compared to adaptive optimizers. First, it is more efficient computationally. Second, it has better generalization performance. Furthermore, CLMR leads to better results than conventional approaches such as SGD and CLR. Lastly, we investigate the effect of changing the frequency of cyclic function in training and generalization and suggest the optimum frequency values. We investigate several optimizers commonly used in medical image segmentation problems, and compare their performance as well as generalization ability in single and multi-object segmentation settings by using cardiac MR images (Cine-MRI).

The rest of the paper is organized as follows. In Section 2, we introduce the background information for neural network optimizers, their notations and their use in medical image segmentation. In section 3, we give the details of the proposed method and network architectures on which segmentation experiments have been conducted. Experimental results are summarized in Section 4. Section 5 concludes the paper with discussions and future work.

2 Background

Optimizing a deep neural network, which is a high-dimensional system with millions of parameters, is one of the most challenging aspects of making these systems more practical. Designing and implementing the best optimizer for deep network training has received a lot of attention in recent decades. These studies mainly address two major issues: (1) Making the network training as fast as possible (fast convergence), (2) increasing the generalizability of networks. SGD optimizers have been the most popular optimizer in deep networks due to their low computational cost, fast convergence. There have been major modifications to original SGD optimizer during last decade to make them for efficient for training deep nets. The following are some of the key optimization studies related to our efforts.

2.1 Optimizers with fixed LR/MR

SGD and Mini-batch gradient descent were first optimizers used for training neural networks. The updating rule for these optimizers include only the value of last iteration as shown in Eq. 1. Choosing appropriate value for a LR is challenging in these optimizers since if LR is very small then convergence is very slow; and if LR is set high, the optimizer will oscillate around global minima instead of converging:

~{}\theta_{i}=~{}\theta_{i-1}-~{}\alpha~{}\nabla_{~{}\theta_{i}}J(~{}\theta_{i}),

(1)

where $\theta$ is network parameters, $~{}\alpha$ is LR, J is cost function to be minimized (function of $\theta$ , $X$ (input), and $Y$ (labels)). The equation 1 can be considered as an updating rule for SGD and mini-batch gradient descent by choosing $X$ and $Y$ as whole samples, a single sample, or a batch of samples in a dataset.

The Momentum optimizer was designed to accelerate the optimization process by taking into account the values from previous iterations, weighted by a factor known as ”momentum,” as mentioned in [5]. The updating for this optimizer is defined as:

~{}\theta_{i}=~{}\theta_{i-1}-~{}\alpha~{}\nabla_{~{}\theta_{i}}J(~{}\theta_{i})-~{}\beta(~{}\theta_{i-1}-~{}\theta_{i-2}),

(2)

where $~{}\beta$ denotes the momentum rate (MR). In Momentum optimizer, the past iterations don’t play any role in cost function and cost function is only calculated for the current iteration only. Also, similar to LR, choosing a proper value for MR is challenging and it has a correlation with LR too.

Nesterov accelerated gradient [9] (NAG) was then introduced to address the limitation of momentum optimizers as well as to accelerate the convergence by including information from previous iterations in calculating the gradient of the cost function as shown in the following equation:

~{}\theta_{i}=~{}\theta_{i-1}-~{}\alpha~{}\nabla_{~{}\theta_{i}}J(~{}\theta_{i}-~{}\beta(~{}\theta_{i-1}-~{}\theta_{i-2}))-~{}\beta(~{}\theta_{i-1}-~{}\theta_{i-2}).

(3)

Compared to optimizers with fixed LR/MR, the NAG optimizer generally shows improved performance in both convergence speed and generalizability.

2.2 Optimizers with adaptive LR and MR

A significant disadvantage of optimizers with a fixed LR/MR is that they cannot incorporate information from the gradients of past iterations in adjusting the learning and momentum rates. For instance, they cannot increase the learning rate for dimensions with a small slope to improve convergence, or reduce the learning rate for dimensions with a steep slope to avoid oscillation around the minimum point. Adagrad [3] is one of fist adaptive LR optimizers used in deep networks adapting the learning rate for each parameter in the network by dividing the gradient of each parameter by its sum of the squares of gradient, as follows:

~{}\theta_{i}=~{}\theta_{i-1}-~{}\alpha~{}\frac{1}{~{}\sqrt[]{G_{i}+~{}\epsilon}}~{}\circ~{}\nabla_{~{}\theta_{i}}J(~{}\theta_{i}),

(4)

where $G_{i}$ is a diagonal (square) matrix and each diagonal element equal to the sum of the square of gradient of its corresponding parameters:

G_{i}=\displaystyle\sum_{i=1}^{I}(~{}\nabla_{~{}\theta_{i}}J(~{}\theta_{i}))^{2},

(5)

where I is the current iteration.

One of the drawbacks of AdaGrad is gradient vanishing due to accumulation of all past square gradient in denominator of Equation 6 during the training. This leads the gradients to converge to zero after several epochs in training. However, AdaDelta, RMSProp, and ADAM optimizers solved this problem by considering a sum of the past samples within a pre-defined window. ADAM optimizer’s updating rule uses past squared gradient (as scale) and also like momentum, it keeps an exponentially decaying average of past gradients. Hence, these adaptive optimizers have advantages over the AdaGrad by adaptively changing both RL and ML as well as resolving the gradient vanishing issue:

~{}\theta_{i}=~{}\theta_{i-1}-~{}\alpha_{i}~{}\frac{~{}\beta_{1}~{}\nabla_{~{}\theta}J(~{}\theta_{i-2})-(1-~{}\beta_{1})~{}\nabla_{~{}\theta}J(~{}\theta_{i-1})}{~{}\sqrt[]{~{}\beta_{2}+~{}\epsilon}}~{}\circ~{}\nabla_{~{}\theta_{i}}J(~{}\theta_{i}).

(6)

Adaptive learning methods are costly because they are required to calculate and keep all the past gradients and their squares to update the next parameters. Also, the adaptive learning optimizer may converge into different minima in comparison with fixed learning rate optimizers [7, 6, 8].

Alternatively, Cyclic learning rate (CLR) was proposed to change the learning rate during training, which needed no additional computational cost. CLR is a method for training neural networks that involves periodically changing the learning rate during training. As mentioned earlier, the learning rate is typically adjusted according to a predetermined schedule, such as increasing the learning rate from a low value to a high value and then decreasing it back to the low value over a set number of training iterations. The learning rate is then reset and the process is repeated. This can help the optimization process by allowing the model to make larger updates at the beginning of training and smaller updates as training progresses, potentially leading to faster convergence and better model performance [7]. Later in Figure 3a, we show how we use CLR in our methodology.

2.3 Cardiac Image Segmentation

Cardiovascular diseases (CVDs) are the leading cause of death worldwide according to the World Health Organization (WHO). CVDs lead to millions of deaths annually and are expected to cause over 23.6 million deaths in 2030 [10]. Cine-MR imaging can provide valuable information about cardiac diseases due to its excellent soft tissue contrast. For example, ejection fraction (EF), an important metric measuring how much blood the left ventricle pumps out with each contraction, can be measured with Cine-MRI. To this end, radiologists often manually measure the volume of the heart at the end of the systole (ES) and the end of the diastole (ED) to measure EF. This is a time-consuming process with known inter-, and intra-observer variations. Due to its significance in functional assessment of heart, there have been numerous machine learning based automated algorithms developed in the literature for measuring EF. In this study, we make our efforts in this application due to its importance in the clinic.

There is a considerable amount of research dedicated to the problem of cardiac segmentation from MR or CT images. Since Xu et al. found a correlation between motion characteristics and tissue properties, they developed a combined motion feature learning architecture for distinguishing myocardial infarction [11]. In our another attempt, CardiacNet in [12] proposed a multi-view CNN to segment the left atrium and proximal pulmonary veins from MR images following by an adaptive fusion. The shape prior information from deep networks were used to guide segmentation network to delineate cardiac substructures from MR images [13, 14]. As previously stated, the literature and methodologies for cardiac segmentation are extensive. Readers are invited to consult references [15] and [16] for more comprehensive information.

3 Methods

We approach the optimization problem from the perspective of a significant medical image analysis application: segmentation. Segmentation is rarely studied from an optimization perspective in comparison to classification.

Over the past few years, there has been a dramatic increase in the use of CNN in computer vision and medical imaging applications, more recently combined with Transformers [17, 18, 19, 20, 21]. The successful CNN-based segmentation approaches can be divided into three broad categories. The first category is named encoder-decoder architecture. One of the most famous works in this category has been done by Badrinarayanan and et al. [22], called SegNet, and it is designed for semantic segmentation. Another category of architecture is called ResNet [23] and it is proposed by He and et al. for image recognition but later, the U-Net [24] was proposed from the similar extending recognition into segmentation with a U-shaped network consisting of skip connections between encoder and decoder. The last category of the architecture is based on the DenseNet[25], instead of having a residual connection, the vectors are concatenated to each other to maximize the information flow through the network. Hence, the information loss during backpropagation can be minimized by considering these connections. DenseNet itself is proposed for image classification, but by combining concepts of the ResNet and DenseNet, a new architecture was introduced in [26] in a U-Net shape to do segmentation. There are many more architectures based on the U-Net style with adaptation from the CNN and Transformers literature. In our study, we conducted experiments in three different (mostly used) segmentation architectures to demonstrate the effect of the connections, as explained in the following subsection. One may increase the number of architectures for more comparisons, but this is outside the scope of our study. CNN Architectures used in the experiments are the following:

1. Encoder-Decoder Architecture: This architecture simply consists of the encoder and decoder part as illustrated in Figure 1, without considering red skip connections. The filter size in all the layer are $3\times 3$ and each encoder and decoder part include $5$ CNN blocks and each CNN blocks consist of different number of layers as mentioned in Table 1. Also, the number of filters in each CNN block are a fixed number and they are mentioned in Table 1 for each layer. Each layer within the CNN block includes Convolution+Batch normalization+ReLu as activation function (CBR).

2. U-Net Architecture: U-Net is particularly popular in medical image analysis. The U-Net model is based on a fully convolutional network, which means that it is built entirely out of convolutional layers and does not contain any fully connected layers. This makes it well-suited for image segmentation tasks, as it can process input images of any size and output a corresponding segmentation map. The U-Net model is known for its ability to handle small, sparsely annotated training datasets, making it a useful tool for medical image analysis where such datasets are common. This architecture is similar to the Encoder-Decoder architecture as illustrated in Figure 1 with red skip connections from encoder to decoder. The number of layers and filters for each block are mentioned in Table 1.

3. DenseNet Architecture: DenseNet is another convolutional neural network architecture that was developed to improve upon the efficiency of training deep networks. The key idea behind DenseNet is to connect all layers in the network directly to every other layer, rather than only connecting each layer to its immediate neighbors as is done in traditional convolutional networks. This allows the network to learn more efficient feature representations and reduces the risk of overfitting. DenseNets have been successful in a number of applications and have achieved state-of-the-art performance on image classification and segmentation tasks. We will use two different DenseNet architectures in our experiments. First, the architecture in Figure 1 with dense blocks (DBs) and skip connections is DenseNet_1. Then, in order to use higher growth rate (GR), in DenseNet_2, at the end of each block a convolution layer with kernel size of $1\times 1$ is used to decrease number of its input filters by C rate, which C is equal 2 in this paper. The GR in DenseNet_2 increased to 24 (from 16 in DenseNet_1) while the number of parameters decreased (Table 1).The number of CBR layers and also the number of parameters are mentioned in Table 1.

Table 1: Number of layers in each block of different architectures and number of parameters.

Enc_Dec

U-Net

DenseNet_1

(GR=16)

DenseNet_2

(GR=24)

Block 1

6 layers,#filters=32

6 layers

Block 2

8 layers,#filters=64

8 layers

Block 3

11 layers,#filters=128

11 layers,

11 layers

Block 4

15 layers,#filters=256

15 layers

Block 5

20 layers,#filters=512

20 layers

Block 6

20 layers,#filters=512

20 layers

Block 7

15 layers,#filters=256

15 layers

Block 8

11 layers,#filters=128

11 layers

Block 9

8 layers,#filters=64

8 layers

Block 10

6 layers,#filters=32

6 layers

# of params

(in million):

77.5

79.1

7.7

8.8

3.1 Dense Block

Within the DB, a concatenation operation is done for combining the feature maps (through direction (axis) of the channels) for the last three layers. So, if the input to $\it{l^{th}}$ layer is $\mathbf{X}_{\it{l}}$ , then the output of $\it{l^{th}}$ layer is:

F(\mathbf{X}_{\it{l}})=CBR(\mathbf{X}_{\it{l}}).

(7)

Since we are doing concatenation before each layer (except the first one), the output of each layer can be calculated only by considering the input and output of first layer as following:

\begin{split}F(\mathbf{X}_{\it{l}})=F(\displaystyle\underset{{\it{l}^{\prime}=0}}{\overset{{\it{l}^{\prime}=\it{l}-1}}{\mathbf{{}^{\frown}}}}F(\mathbf{X}_{\it{l}^{\prime}}))\qquad\text{for}\quad\it{l}\geq 1\quad\\ \text{and}\quad\it{l}=\{1,2,\dots,\it{L},\},\end{split}

(8)

where ^⌢ is concatenation operation. In addition, for initialization $F(\mathbf{X}_{\it{-1}})$ and $F(\mathbf{X}_{\it{0}})$ are considered as {} and $\mathbf{X}_{\it{1}}$ respectively which {} is an empty set and there are $\it{L}$ layers inside of the block.

Assuming the number of output features for each layer is $\mathbf{K_{out}}$ (channel out) and the number of input features for first layer is $\mathbf{K_{in_{1}}}$ (channel in). Then, the feature maps growth (channel out) for second, third, …, and $\it{L^{th}}$ layer are $\mathbf{K_{out}}+\mathbf{K_{in_{1}}}$ , 2 $\mathbf{K_{out}}+\mathbf{K_{in_{1}}}$ , …, and $(\it{L}-1)\mathbf{K_{out}}+\mathbf{K_{in_{1}}}$ respectively. The growth rate for the DB is the same as fourth layer.

Refer to caption — Figure 1: CNN Architecture is used for pixel-wise segmentation. The architecture with CNN blocks without red skip connections is Encoder-Decoder architecture. The architecture with red skip connection (Fig. 2a) is called U-Net, if connections are with Dense block (Fig. 2b), it is called Tiramisu (DenseNet for segmentation)

3.2 Cyclic Learning/Momentum Rate Optimizer

Smith et al [7] argued that a cycling learning may be a more effective alternative to adaptive optimizations especially from generalization perspective. Basically, cyclic learning includes a pre-defined cycle (such as triangle or Gaussian function) that learning rate is changing according to that cycle. Here, we hypothesize (and show later in the results section) that having a cyclic momentum in Nesterov optimizer (Eq. 2) can lead to a better accuracy in segmentation task in generalization phase. As a reminder, momentum in Eq. 2 was used to consider the past iterations by a coefficient called momentum. So, choosing the proper value for momentum is challenging. To this end, we propose changing the MR in the same way that we changed the LR, and we considered the cyclic triangle function for both MR and LR as illustrated in the Figure 3. $cycle_{lr}$ and $cycle_{mr}$ determine the period of triangle function for LR and MR are defined by:

cycle_{lr}=C_{lr}\times It,

(9)

cycle_{mr}=C_{mr}\times It,

(10)

where $C_{lr}$ and $C_{mr}$ are positive even integer numbers, It is number of iteration per each epoch.

In Figures 3a and 3b, the cyclic function for different values of $C_{lr}$ and $C_{mr}$ are illustrated. LR during whole training can be determined from equation 11:

LR=\begin{cases}2\times\frac{max_{lr}-min_{lr}}{C_{lr}\times It}\times i+min_{lr},&for\quad N\times cycle_{lr}\leq i<\frac{2N+1}{2}\times cycle_{lr}\\ -2\times\frac{max_{lr}-min_{lr}}{C_{lr}\times It}\times i+2max_{lr}-min_{lr},&for\quad\frac{2N+1}{2}\times cycle_{lr}\leq i<(N+1)\times cycle_{lr},\end{cases}

(11)

where $max_{lr}$ and $min_{lr}$ are maximum and minimum values of LR function, respectively. $i$ is the iteration indicator during whole training process and $i\in\{1,2,\dots,It\times Ep\}$ , which $Ep$ is total number of epochs in training and $N$ is a set of natural number. MR can also be determined as:

MR=\begin{cases}2\times\frac{max_{mr}-min_{mr}}{C_{mr}\times It}\times i+min_{mr},&for\quad N\times cycle_{mr}\leq i<\frac{2N+1}{2}\times cycle_{mr}\\ -2\times\frac{max_{mr}-min_{mr}}{C_{mr}\times It}\times i+2max_{mr}-min_{mr},&for\quad\frac{2N+1}{2}\times cycle_{mr}\leq i<(N+1)\times cycle_{mr},\end{cases}

(12)

where $max_{mr}$ and $min_{mr}$ are maximum and minimum values of MR function, respectively.

Equations 11 and 12 are used to determine the values of LR and MR in each iteration during training. One of the challenges in using these cyclic LR and MR functions are determining the values of some variables in the equations including $max_{lr}$ , $min_{lr}$ , and $C_{lr}$ for LR; and also $max_{mr}$ , $min_{mr}$ , and $C_{mr}$ for MR. For finding the the $max_{lr}$ and $min_{lr}$ values, as it suggested in [7], one can run the networks with different LR values for a few epochs and then these values are chosen according to how network accuracy changes. Since, when both LR and MR change dynamically and the one value can affect the other one (considering the optimizer formula), it makes more challenging to find the CLMR optimum parameters by proposed solution. It means we need to train large number of networks in order to determine the optimum values of $max_{lr}$ , $min_{lr}$ , $max_{,r}$ , and $min_{mr}$ which is not computationally feasible. Also, a heuristic method was suggested in [7] to find the best value of $C_{lr}$ .

In this paper we propose an alternative way to find best cyclic functions with minimum computational cost. We set fixed values for $max_{lr}$ , $min_{lr}$ , $max_{,r}$ , and $min_{mr}$ parameters and make sure that the selected values cover a good range of values for both LR and MR in practice (illustrated in Figure 3). Then, we did a computationally reasonable heuristic search for finding the appropriate amount of $C_{lr}$ and $C_{mr}$ from the values shown in Figure 3. Since, changing the values of $C_{lr}$ and $C_{mr}$ leads to change in the values of LR and MR in different iterations, there is no need to find the optimum values for minimum and maximum, we did search in 2D space of $C_{lr}$ and $C_{mr}$ to find their optimal values.

4 Experiments and results

4.1 Data

For investigating the performance of proposed method, a dataset from Automatic Cardiac Diagnosis Challenge (ACDC-MICCAI Workshop 2017) were used [27]. These data set includes 150 cine-MR images: 30 normal cases, 30 patients with myocardial infarction, 30 patients with dilated cardiomyopathy, 30 patients with hypertrophic cardiomyopathy, and the remaining 30 patients with abnormal RV. While 100 cine-MR images were used for training (80) and validation (20), the remaining 50 images were used for testing with online evaluation by the challenge organizers. For a fair validation in training procedures, four subjects from each category have been chosen. The binary masks for ground truths of three substructures were provided by the challenge organizers for training and validation while test set was evaluated online (unseen test set. Three substructures are right ventricle (RV), myocardium of left ventricle(Myo.), and left ventricle (LV) at two time points of end-systole (ES) and end-diastole (ED).

The MRIs were obtained using two MRI scanners of different magnetic strengths (1.5T and 3.0T). Cine-MR images were acquired with a SSFP sequence in short axis while on breath hold (and gating). In particular, a series of short axis slices cover the LV from the base to the apex, with a thickness of 5 mm (or sometimes 8 mm) and sometimes an inter-slice gap of 5 mm. The spatial resolution goes from 1.37 to 1.68 $mm^{2}/pixel$ and 28 to 40 volumes cover completely or partially the cardiac cycle.

4.2 Implementation details

The networks were trained for a fixed number of epochs (100) and it was confirmed that they are fully trained. All the images were resized to $200\times 200$ in short axis by using B-spline interpolation. Then, as a preprocessing step, we applied anisotropic filtering and histogram matching to the whole data set. The total number of 2D slices for training was about 1690 and batch size of 10 were chosen for training. Hence, the number of iteration per epoch is $\frac{1690}{10}=169$ and we have a total number of iteration $100\times 169=16,900$ in training. The Cross Entropy loss function was chosen for minimization. All the networks were implemented on Tensorflow with using NVIDIA TitanXP GPUs.

4.3 Results

We calculated Dice Index (DI) and also Cross Entropy (CE) loss on validation set for investigating our proposed optimizer along with other optimizers. In Figures 5a and b, the CE and DI curves versus iterations for U-Net architecture for different optimizers are illustrated. As these curves show, the DI in U-Net with ADAM optimizer is increasing rapidly and sharply at the very beginning and then it is almost fixed afterwards. Although our proposed optimizer (CLMR(C_lr=20, C_mr=20)) is not learning as fast as ADAM optimizer at very beginning in U-Net, it gets better accuracy than ADAM finally. This phenomenon is clearer in CE curves. The quantitative results on test set in Table 2 support the same observation and conclusion. Further, the same pattern happens for DenseNet_2 architecture in Figure 6. This confirms our hypothesis that adaptive optimizers converges faster but to potentially to different local minimas in comparison with classical SGD optimizers.

Figure 5 shows the behavior of U-Net architecture with CLMR optimizer performing 2% increase in dice index (in all three substructures as well as average) than its CRL optimizer counterpart. This proves that having a cyclic momentum rate can yield to better efficiency and accuracy than having a simple cyclic learning rate. The results on the test set comparing CLR and CLMR optimizers in Table 2 support this conclusion too.

Moreover, the curves of DI and CE among different architectures, trained by ADAM and CLMR, are demonstrated in Figure 4 a and b. Although the DenseNet_2 has less parameters in comparison with other architectures, it gets better results than the other architectures. These curves reveals some other important points about using different architectures: first, for all different architectures, the proposed CLMR optimizer works better than ADAM optimizer, indicating the power of proposed cyclic optimizer. Second, DenseNet architectures are getting better results than U-Net and Enc_Dec architectures, which are highly over-parameterized architectures than DenseNet and their saturation can be linked to this information too. Third, comparison between curves of DenseNet_1 and DenseNet_2 shows that having a higher GR (growth rate) in dense connections is more important than having dense block with high number of parameters. Since, DenseNet_2, with GR=24 reached better results in comparison with DenseNet_1 with twice of number of parameters in end of each dense block in comparison to DenseNet_2 and GR=16. These results are supported by the dice metric obtained from test data and are mentioned in Table 2.

Finally, the DI on test data with online evaluation for different architectures with different optimizers are summarized in Table 2. In order to have a better comparison, the box plot of all methods are drawn in Figure 7. As the figure shows, the dice statistic obtained from CLMR is better than other optimizers most of the time in addition to its superior efficiency. In addition, qualitative results for different methods are shown in Figures 8 and 9: the contours for RV, Myo., and LV in ED for different methods and architectures and also ground-truth across four slices from Apex to Base. Usually, segmentation of RV near the Apex is harder than others because RV is almost vanishing at this point. As a result, some methods may not even detect the RV at slices near the Apex. Figure 9 shows the contours for RV, Myo., and LV in ES for different methods and architectures and also ground-truth across four slices from Apex to Base. Since at ES heart is at minimum volume; thus, it is more difficult to segment substructures. The contours generated with DenseNet_2 method is more similar to ground-truth in both ED and ES, which shows the generalizability of the proposed method with an efficient architecture choice.

Table 2: DI in the test data set with online evaluation.

		Adam	Nesterov	CLR	CLMR
Enc_Dec	RV	0.3272	0.1309	0.3833	0.4336
U-Net		0.8574	0.5968	0.8618	0.8820
DenseNet_1		0.8802	0.6936	0.8961	0.8957
DenseNet_2		0.8781	0.7232	0.8910	0.9049
Enc_Dec	Myo	0.1473	0.1492	0.1692	0.1686
U-Net		0.8628	0.6486	0.8588	0.8631
DenseNet_1		0.8787	0.7170	0.8834	0.8960
DenseNet_2		0.8796	0.7196	0.8904	0.8999
Enc_Dec	LV	0.4950	0.3260	0.4972	0.5418
U-Net		0.9238	0.7670	0.8936	0.9360
DenseNet_1		0.9376	0.8465	0.9351	0.9393
DenseNet_2		0.9196	0.8449	0.9378	0.9478
Enc_Dec	Ave.	0.3232	0.1687	0.3499	0.3814
U-Net		0.8813	0.6708	0.8714	0.8937
DenseNet_1		0.8988	0.7524	0.9049	0.9103
DenseNet_2		0.8924	0.7626	0.9064	0.9176

5 Discussions and Conclusions

We proposed a new cyclic optimization method (CLMR) to address the efficiency and accuracy problems in deep learning based medical image segmentation. We hypothesized that having a cyclic learning/momentum function can yield better generalization in comparison to adaptive optimizers. We showed that CMLR is significantly better than adaptive optimizers by considering momentum changes inside the Nesterov optimizer as a cyclic function. Finding the parameters of these cyclic functions are complicated due to the correlation existing between LR and MR function. Thus, we formulated both LR and MR functions and we suggested a method to find the parameters of these cyclic functions with reasonable computational cost.

Our proposed method is just a beginning of a new generation of optimizers which can generalize better than adaptive ones. One of the challenges in designing such optimizers is to set up the parameters of cyclic functions which need further investigation in a broad sense. One can learn these parameters with a neural network or reinforcement learning in an efficient manner: i.e., the $max_{lr}$ , $min_{lr}$ , $max_{,r}$ , $min_{mr}$ , $C_{lr}$ , and $C_{mr}$ can be learned by an policy gradient reinforcement learning approach. In this study, our focus was only on supervised learning methods. However, proposed method can be generalized to semi-supervised or self-supervised methods as well. This is outside the scope of the current paper and can be thought of as a follow-up to what we proposed here.

In our study, our focus was in a particular clinical imaging problem: segmenting cardiac MRI scans. We assessed the optimization problem with single and multi-object settings. One may consider different imaging modalities and with different, and perhaps with newer, architectures to explore the architecture choices versus optimization functions. We believe that, based on our comparative studies, the architecture choice can affect the segmentation results such that more complex architectures require optimization algorithms to be selected wisely.

The choice of optimization algorithm can depend on the specific characteristics of the dataset and the model being trained, as well as the computational resources available. Therefore, our results may not be generalizable to every situation in medical image analysis tasks. For instance, if the medical data is noisy or uncertain, it may be more difficult for the model to accurately predict the labels. This can make the optimization process more sensitive to the choice of optimization algorithm and may require the use of regularization techniques to prevent overfitting. For another example, if the dataset is highly imbalanced, with many more examples of one class than the other, it may be more difficult for the model to accurately predict the minority class. This can make the optimization process more challenging and may require the use of techniques such as class weighting or oversampling to improve the performance of the model. Last, but not least, if the dataset has a large number of features or the features are highly correlated, it may be more difficult to find a good set of weights and biases that accurately model the data. This can make the optimization process more challenging and may require the use of more advanced optimization algorithms.

Our study has some other limitations too. The use of second-order optimization methods are in high demands recently. However, it was not our focus on such methods due to their high burden in computational cost. Second-order optimization methods, which take into account the curvature of the loss function, have shown promising results in a variety of deep learning applications. These methods can be more computationally expensive than first-order methods, which only consider the gradient of the loss function, but may be more effective in certain situations. Further, we focused on the segmentation problem with traditional deep network architectures while reinforcement learning and generative models can require development of new algorithms tailored to specific types of problem.

5.1 Disclosures

No Conflict of Interest.

5.2 Acknowledgments

This project is supported by NIH funding: R01-CA246704, R01-CA240639, U01-DK127384-02S1, R03-EB032943-02, and R15-EB030356.

5.3 Data, Materials, and Code Availability

Data is available under MICCAI 2017 ACDC challenge.

References

[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal Representations by Error Propagation, 318–362. MIT Press, Cambridge, MA, USA (1986).
[2] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).
[3] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” Journal of Machine Learning Research 12(Jul), 2121–2159 (2011).
[4] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machine learning lecture 6a overview of mini-batch gradient descent,” Cited on 14(8), 2 (2012).
[5] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural networks 12(1), 145–151 (1999).
[6] A. C. Wilson, R. Roelofs, M. Stern, et al., “The marginal value of adaptive gradient methods in machine learning,” in Advances in Neural Information Processing Systems, 4151–4161 (2017).
[7] L. N. Smith, “Cyclical learning rates for training neural networks,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, 464–472, IEEE (2017).
[8] J. Zhang and I. Mitliagkas, “Yellowfin and the art of momentum tuning,” arXiv preprint arXiv:1706.03471 (2017).
[9] Y. Nesterov, “A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2),” in Doklady AN USSR, 269, 543–547 (1983).
[10] “Cardiovascular Diseases (cvds).” http://www.who.int/mediacentre/factsheets/fs317/en/ (2007). [Online; accessed 30-June-2017].
[11] C. Xu, L. Xu, Z. Gao, et al., “Direct delineation of myocardial infarction without contrast agents using a joint motion feature learning architecture,” Medical image analysis 50, 82–94 (2018).
[12] A. Mortazi, R. Karim, K. Rhode, et al., “Cardiacnet: Segmentation of left atrium and proximal pulmonary veins from mri using multi-view cnn,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 377–385, Springer (2017).
[13] A. Mortazi, N. Khosravan, D. A. Torigian, et al., “Weakly supervised segmentation by a deep geodesic prior,” in International Workshop on Machine Learning in Medical Imaging, 238–246, Springer (2019).
[14] O. Oktay, E. Ferrante, K. Kamnitsas, et al., “Anatomically constrained neural networks (acnns): application to cardiac image enhancement and segmentation,” IEEE transactions on medical imaging 37(2), 384–395 (2017).
[15] P. Bizopoulos and D. Koutsouris, “Deep learning in cardiology,” IEEE reviews in biomedical engineering 12, 168–193 (2018).
[16] X. Zhuang, L. Li, C. Payer, et al., “Evaluation of algorithms for multi-modality whole heart segmentation: An open-access grand challenge,” arXiv preprint arXiv:1902.07880 (2019).
[17] A. Srivastava, D. Jha, B. Aydogan, et al., “Multi-scale fusion methodologies for head and neck tumor segmentation,” arXiv preprint arXiv:2210.16704 (2022).
[18] N. K. Tomar, D. Jha, and U. Bagci, “Dilatedsegnet: A deep dilated segmentation network for polyp segmentation,” arXiv preprint arXiv:2210.13595 (2022).
[19] A. Srivastava, D. Jha, E. Keles, et al., “An efficient multi-scale fusion network for 3d organ at risk (oar) segmentation,” arXiv preprint arXiv:2208.07417 (2022).
[20] Z. Zhang and U. Bagci, “Dynamic linear transformer for 3d biomedical image segmentation,” arXiv preprint arXiv:2206.00771 (2022).
[21] U. Demir, Z. Zhang, B. Wang, et al., “Transformer based generative adversarial network for liver segmentation,” arXiv preprint arXiv:2205.10663 (2022).
[22] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv preprint arXiv:1511.00561 (2015).
[23] K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
[24] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 234–241, Springer (2015).
[25] G. Huang, Z. Liu, K. Q. Weinberger, et al., “Densely connected convolutional networks,” arXiv preprint arXiv:1608.06993 (2016).
[26] S. Jégou, M. Drozdzal, D. Vazquez, et al., “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, 1175–1183, IEEE (2017).
[27] O. Bernard, A. Lalande, C. Zotti, et al., “Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: Is the problem solved?,” IEEE transactions on medical imaging 37(11), 2514–2525 (2018).

Biographies and photographs of the other authors are not available.