This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: 1University of California, Los Angeles, CA, USA
11email: [email protected]
2Chung-Ang University, Seoul, Korea

Small Lesion Segmentation in Brain MRIs with Subpixel Embedding

Alex Wong†,1()    Allison Chen†,1    Yangchao Wu1    Safa Cicek1   
Alexandre Tiard1
   Byung-Woo Hong2    and Stefano Soatto1 denotes authors with equal contributions.
Abstract

We present a method to segment MRI scans of the human brain into ischemic stroke lesion and normal tissues. We propose a neural network architecture in the form of a standard encoder-decoder where predictions are guided by a spatial expansion embedding network. Our embedding network learns features that can resolve detailed structures in the brain without the need for high-resolution training images, which are often unavailable and expensive to acquire. Alternatively, the encoder-decoder learns global structures by means of striding and max pooling. Our embedding network complements the encoder-decoder architecture by guiding the decoder with fine-grained details lost to spatial downsampling during the encoder stage. Unlike previous works, our decoder outputs at 2×2\times the input resolution, where a single pixel in the input resolution is predicted by four neighboring subpixels in our output. To obtain the output at the original scale, we propose a learnable downsampler (as opposed to hand-crafted ones e.g. bilinear) that combines subpixel predictions. Our approach improves the baseline architecture by 11.7%\approx 11.7\% and achieves the state of the art on the ATLAS public benchmark dataset with a smaller memory footprint and faster runtime than the best competing method. Our source code has been made available at:
https://github.com/alexklwong/subpixel-embedding-segmentation.

1 Introduction

A stroke occurs when a lack of blood flow prevents brain tissue from receiving adequate oxygen and nutrients. This condition affects over 795,000 people annually [31]. The severity of the outcome, including disability and paralysis, depends on the location and intensity of the stroke, as well as the time of diagnosis [2, 33]. Preserving cognitive and motor functions, therefore, hinges on localizing stroke lesions quickly and precisely. However, doing so manually requires expert knowledge, is time consuming, and is ultimately subjective [13, 15].

We focus on automatically segmenting ischemic stroke lesions, which account for 87% of all strokes [31], from T1-weighted anatomical magnetic resonance imaging (MRI) brain scans. These lesions are characterized by high variability in location, shape, and size – the latter two are problematic for conventional convolutional neural networks (CNNs) where precision of irregularly shaped lesion boundaries and recall of small lesions are critical measures of success. Due to aggressive spatial downsampling (i.e. max pooling, strided convolutions) customary in CNNs, details of local structures are lost in the process. Yet, the spatial downsampling is necessary for obtaining a global representation of the input while using fixed-size filters with limited receptive fields. The outcome of which are segmentations with ambiguous boundaries between lesion and normal tissues and missed lesions that occupy small number of voxels in the MRIs.

We propose to retain small local structures by learning an embedding that maps the input to high dimensional feature maps of twice the input resolution. Unlike the typical CNN, we do not perform lossy downsampling on this representation; hence, the embedding preserves local structures, but lacks global context. When combined with the standard encoder-decoder e.g. U-Net [22], the embedding complements the encoder-decoder by supplying the decoder with fine-grained detail information to guide segmentation. Our network also outputs at twice the resolution of the input, representing each element in the input with a 2×22\times 2 neighborhood of predictions. The final output is obtained by combining the four predictions (akin to an ensemble) as a weighted sum where the contribution of each prediction is learned from the data. Our design not only enables the network to produce robust segmentations but also localize small lesions (Fig. 3).

Our contributions include (i) an embedding function that preserves fine-grained details of the input by mapping it to larger spatial dimensions, (ii) a neural network architecture that leverages the complementary strengths of the proposed embedding and an encoder-decoder to produce predictions at twice the input resolution, and (iii) a learnable downsampler that combines local predictions in an ensemble fashion to yield robust segmentations at the input resolution. Our approach improves the baseline U-Net architecture by 11.7%\approx 11.7\% and achieves the state of the art on the ATLAS [13, 14] dataset with lower computational burden than the best competing method.

2 Related Work

Lesion Segmentation. Early works [6] aggregated classification results for the center pixel of patches sampled from an image. However, [6] lacked global context, so [24] addressed this with multi-stage cascaded hierarchical models. More recent works build upon the U-Net [22], a 2D fully-convolutional network with skip connections and up-convolutions. For example, [16] used a Dual Path Network [5] encoder while [29] leveraged dilated convolutions to inexpensively increase receptive fields. Furthermore, [1] fused the U-net with other high-performing modules, the BConvLSTM [27] and the SENet [10], and [21] introduced X-blocks to the U-Net, leveraging depthwise separable convolutions to reduce computational load. [34] used skip connections between successive encoder resolutions to prevent the loss of features and ConvLSTM [26] modules to maintain localization.

Recent works also leveraged 3D architectural backbones to improve localization. [35] performed 3D convolutions on a subsection of the scan and fused the results with 2D convolutions. [11] proposed an attention gate to combine 2D segmentations along the axial, sagittal, coronal planes into a 3D volume. However, these works use significantly larger memory footprints and 3D convolutions are computationally expensive – limiting the models’ practicality. We note that while conventional architectures perform well globally (i.e. recovering the coarse shape of lesions) they struggle to segment small lesions that blend into the background.

Super-Resolution. There is an abundance of works in natural images super-resolution [7, 8, 25, 28, 32] and a growing number in medical imaging. [23] proposed to map MRI images from low to high-resolution with an overcomplete dictionary. [19] leveraged SRCNN [7] for super-resolving 2D MRI images and fused them to obtain a 3D volume. [20] handled arbitrary scaling factors with a 3D architecture for multi-modal 3D data. However, these works require low and high-resolution image pairs for training and are limited to the super-resolution task while our method does not rely on a larger resolution ground truth. More recently, [30] introduced Kite-Net, an upsampling encoder that outputs a latent at 8×8\times resolution followed by a max-pooling decoder to downsample back to the original resolution. Kite-Net is used in parallel with a U-Net for lesion segmentation. Our approach draws inspiration from super resolution and latent over-representations as methods to retain local structure that are often lost in spatial downsampling. However, unlike [30], we avoid downsampling the latent with pooling (which discards information), and instead employ lossless space-to-depth and depth-to-space [25] operations to retain fine-grained details. Furthermore, we propose to learn a subpixel embedding at 2×2\times the original resolution to guide our segmentation, which uses a much smaller memory footprint than [30]. We show that our approach can capture small lesions that are missed by [21, 22, 30, 34, 35].

3 Method

We propose a method to partition a 3D MRI volume XC×H×WX\in\mathbb{R}^{C\times H\times W} into lesion (positive, 1) and normal (negative, 0) classes. Our method takes, as input, a 3D slice of cc consecutive 2D images xc×H×Wx\in\mathbb{R}^{c\times H\times W} (cc is an odd integer) from XX and predicts the binary segmentation for the image x¯1×H×W\bar{x}\in\mathbb{R}^{1\times H\times W}, the c+12\frac{c+1}{2}-th image of xx. In other words, xx is a sliding window of cc images centered at a target image x¯\bar{x}. To avoid sampling out of bounds, we perform mean padding of size c12×H×W\frac{c-1}{2}\times H\times W on both sides of XX before sampling xx (see Sec. 1 of Supp. Mat. for more details). To segment a single image x¯\bar{x}, we propose to learn a deep neural network fωf_{\omega}, parameterized by ω\omega, where f:c×H×W[0,1]1×H×Wf:\mathbb{R}^{c\times H\times W}\mapsto[0,1]^{1\times H\times W} is a function that takes the 3D slice xx as an input and outputs the sigmoid response fω(x)f_{\omega}(x), a confidence map corresponding to lesions in x¯\bar{x}. To obtain the binary segmentation of XX, we aggregate our predictions by running fωf_{\omega} for all xx and setting any response greater than a threshold of 0.5 to the lesion class. We note that our method can be extended to multi-class segmentation simply by expanding our output to [0,1]K×H×W[0,1]^{K\times H\times W} for KK classes, and choosing the class with highest response, i.e. argmaxfω()\arg\max f_{\omega}(\cdot), to yield the segmentation.

Refer to caption
Figure 1: Network architecture. SPiN is comprised of (i) a U-Net based encoder-decoder that produces subpixel predictions fω0(x)f^{0}_{\omega}(x) at 2×2\times the input resolution, which are guided by (ii) a subpixel embedding that captures local structure. The final output fω(x)f_{\omega}(x) is achieved by combining local predictions in a 2×22\times 2 neighborhood as a weighted sum based on the per element contribution predicted by a (iii) learnable downsampler.

3.1 Network Architecture

Our network fωf_{\omega} (Fig. 1) is composed of two modules: (i) an encoder-decoder (based on U-Net [22]) that outputs at 2×2\times the input resolution, e.g. 2H×2W2H\times 2W, whose predictions are guided by (ii) a network that maps the input xx to a high dimensional embedding space also at twice the input resolution. The result is a confidence map comprised of “subpixel” predictions – the output class for each input pixel is represented by four predictions within a 2×22\times 2 neighborhood. Rather than using hand-crafted downsampling techniques (e.g. bilinear, nearest neighbor) to obtain the output at the original (1×1\times) spatial resolution, we propose a learnable downsampler that predicts the weight, or contribution, of each subpixel prediction in a local region corresponding to the pixel in the 1×1\times resolution. For simplicity, we refer to our embedding function as a subpixel embedding and our overall architecture (fωf_{\omega}) as a subpixel network or “SPiN” for short (Fig. 1).

Refer to caption
Figure 2: Learnable Downsampler, Space-to-Depth and Depth-to-Space. (a): Learnable Downsampler predicts the contribution h(z)h(z) of each subpixel prediction in fω0(x)f_{\omega}^{0}(x) by conditioning on fω0(x)f_{\omega}^{0}(x) and the latent vector g(x)g(x). Subpixel predictions fω0(x)f_{\omega}^{0}(x) are rearranged to the resolution of the input using Space-to-Depth. The final output fω(x)f_{\omega}(x) is produced by taking the element-wise dot product between h(z)h(z) and the reshaped fω0(x)f_{\omega}^{0}(x). (b) Space-to-Depth reduces resolution by rearranging elements from the spatial dimensions into the channel dimensions, where each 2×22\times 2 neighborhood is reshaped to a 4 element vector. Depth-to-Space conversely performs spatial expansion by rearranging elements from the channel dimensions to height and width dimensions.

Subpixel Embedding consists of feature extraction and spatial expansion phases. Feature extraction is performed by two ResNet blocks [9] with 16 filters per layer; we also use stride of 1 and zero-padded edges to minimize spatial reduction. The extracted 16×H×W16\times H\times W feature maps are fed to a depth-to-space module [25] that rearranges elements from the channel dimension to the height and width dimensions (see Fig. 2-(b)). The resulting set of 4×2H×2W4\times 2H\times 2W feature maps with twice the spatial resolution then undergoes a 1×11\times 1 and a 3×33\times 3 convolution layers, with 8 filters each. The resulting 8×2H×2W8\times 2H\times 2W high dimensional feature maps, produced by our subpixel embedding function, resolve fine local details by increasing the feature map resolution and thus representing information at each pixel location with four “subpixel” feature vectors.

When used as skip connections, these embeddings complement the standard U-Net architecture that obtains a global representation of the input by spatial downsampling (striding and max pooling), which naturally discards local detail. Hence, we propose to inject these embeddings into the decoder via feature concatenation at the original (1×1\times) resolution and at the 2×2\times output resolution. To reduce the height and width dimensions of the embeddings to match the feature maps at the 1×1\times resolution, we propose a space-to-depth module, which performs the inverse operation of depth-to-space (see Fig. 2-(b)), yielding 32×H×W32\times H\times W feature maps . Unlike striding and pooling, the depth-to-space operation is information preserving as it rearranges feature vectors from the height and width dimensions to their channel dimension. The result is fed through a 3×33\times 3 convolutional layer with 8 filters and concatenated with the feature maps of the decoder at the 1×1\times resolution. Similarly, the embeddings at 2×2\times resolution undergo a separate 3×33\times 3 convolution to yield the output resolution guidance before being concatenated with their corresponding feature maps in the decoder. Finally, the 2×2\times decoder output fω0(x)[0,1]1×2H×2Wf_{\omega}^{0}(x)\in[0,1]^{1\times 2H\times 2W} is produced by convolving a single 3×33\times 3 filter over the resulting latent vector g(x)24×2H×2Wg(x)\in\mathbb{R}^{24\times 2H\times 2W}. We use subpixel guidance (SPG) to refer to the process of learning and injecting the embedding as skip connections, which substantially helps with localizing small lesions missed by previous works [21, 22, 34, 35] (see Fig. 3). We note that SPG is light-weight and only uses 16K parameters.

Learnable downsampler takes the concatenation z=[g(x);fω0(x)]z=[g(x);f_{\omega}^{0}(x)] of the latent vector g(x)g(x) and the 2×2\times resolution output fω0(x)f_{\omega}^{0}(x) and predicts h(z)h(z), where h:25×2H×2W[0,1]4×H×Wh:\mathbb{R}^{25\times 2H\times 2W}~{}\mapsto~{}[0,1]^{4\times H\times W}. In other words, h(z)h(z) is a set of 4×H×W4\times H\times W values that determine the contribution of each subpixel prediction in a 2×22\times 2 neighborhood of fω0(x)f_{\omega}^{0}(x). To achieve this, we first perform space-to-depth on zz to rearrange each 2×22\times 2 neighborhood into a 4 element vector. This is followed by two 3×33\times 3 convolutions of 16 filters and a 1×11\times 1 convolution with 4 filters. h(z)h(z) is the softmax response of the result along the channel dimension.

To obtain the final output fω(x)f_{\omega}(x), we utilize space-to-depth to rearrange fω0(x)f_{\omega}^{0}(x) into the shape of 4×H×W4\times H\times W (to match the shape of h(z)h(z)) and take its element-wise dot product with h(z)h(z). With an abuse of notation, fω(x)=fω0(x)h(z)f_{\omega}(x)=f_{\omega}^{0}(x)\cdot h(z). Because h(z)h(z) is conditioned on the latent vector g(x)g(x) of the input, the predicted weights respect lesion boundaries to yield detailed segmentations. This is unlike bilinear or nearest-neighbor downsampling where weights are predetermined and independent of the input. We note that our learnable downsampler is also lightweight and only consists of 11K parameters.

3.2 Loss Function

We assume a training set of {(x(n),y¯(n))}n=1N\{(x^{(n)},\bar{y}^{(n)})\}_{n=1}^{N}, where y¯(n)\bar{y}^{(n)} is the ground truth corresponding to x¯(n)\bar{x}^{(n)}, the image located at the center of x(n)x^{(n)}. To train SPiN, we minimize the standard binary cross entropy loss,

(y,y¯)=1|Ω|uΩ(y¯(u)logy(u)+(1y¯(u))log(1y(u))),\ell(y,\bar{y})=\frac{1}{|\Omega|}\sum_{u\in\Omega}-\big{(}\bar{y}(u)\log y(u)+(1-\bar{y}(u))\log(1-y(u))\big{)}, (1)

where Ω2\Omega\subset\mathbb{R}^{2} denotes the spatial image domain, uu a pixel coordinate, and y=fω(x)y=f_{\omega}(x) the network output. The loss over the training set of NN samples reads

L(ω)=1Nn=1N(fω(x(n)),y¯(n))).L(\omega)=\frac{1}{N}\sum_{n=1}^{N}\ell(f_{\omega}(x^{(n)}),\bar{y}^{(n)})). (2)

We note that previous works [34, 35] used soft Dice loss (an approximation of the true Dice score) to counter the class imbalance between normal and lesion tissues, characteristic in the lesion segmentation problem. However, a minimizer of cross entropy equivalently minimizes Dice, and empirically, we found that directly minimizing cross entropy yields better performance for our model. We hypothesize that our SPG allows small lesions to be recovered more easily, making our method more conducive to minimizing cross entropy, which is not prone to the noisy training signal inherent in soft Dice. We demonstrate this in row 7 of Table 4 in our ablation studies. Also, we note that our loss can be easily extended for multi-class classification to accommodate multiple lesion categories.

4 Experiments and Results

We demonstrate our method on the Anatomical Tracings of Lesion After Stroke (ATLAS) MRI dataset [13, 14], using the metrics defined in Table 1. ATLAS contains 304 T1-weighted MRI scans of stroke patients with corresponding lesion annotations. The data is collected from 11 research sites worldwide, manually annotated, and post-processed (i.e. smoothing and defacing for privacy), leaving 239 patient scans with 189 2D images (197×233197\times 233 resolution) each. Since no official data split is provided by [13], previous works [21, 34, 35] evaluated their methods using kk-fold cross validation and randomly sampled data splits. However, the value of kk and samples within each split varied across works. Due to the lack of consistency, the reported results are not directly comparable. Thus, we propose a training (212 patients) and a held-out testing (27 patients) split to standardize the evaluation protocol for more rigorous comparisons. We provide quantitative comparisons against [21, 22, 30, 34, 35] on the proposed training and testing split in Table 2. We also show qualitative (Fig. 3) and quantitative (Table 3) comparisons on segmenting small lesions using a subset of test set: 490 images containing only lesions smaller than 100 pixels (0.2% of the image). All reported results for previous works are obtained using their training procedures and open-sourced code. We also provide details on our training and testing split in Sec. 2 of Supp. Mat. and further kk-fold cross validation comparisons in Sec. 3. of Supp. Mat.

Implementation details. Our model is implemented in PyTorch [18] and optimized using Adam [12]. We used an initial learning rate of 3×1043\times 10^{-4}, decreased it to 1×1041\times 10^{-4} after 400 epochs, and to 5×1055\times 10^{-5} after 1400 epochs for a total of 1600 epochs. We choose c=5c=5 for the number images in the input xx. During training, x¯\bar{x} and its corresponding xx are randomly sampled from XX. Training takes 8\approx 8 hours on an Nvidia GTX 1080 GPU, and inference takes 11\approx 11 ms per 2D image. For data augmentation, we randomly perform (i) horizontal and vertical flips, (ii) rotation between -30 and 30, and (iii) add zero-mean Gaussian noise with standard deviation of 1×1021\times 10^{-2} to training samples. We perform augmentation with a probability of 1 for 1400 epochs and decrease it to 0.5 thereafter so training samples will be closer to the true distribution of the dataset.

Metric IOU DSC Precision Recall
Definition TPTP+FN+FP\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}+\textrm{FP}} 2×TP2×TP+FN+FP\frac{2\times\textrm{TP}}{2\times\textrm{TP}+\textrm{FN}+\textrm{FP}} TPTP+FP\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}} TPTP+FN\frac{\textrm{TP}}{\textrm{TP}+\textrm{FN}}
Table 1: Evaluation metrics. IOU denotes Intersection Over Union, and DSC denotes Dice similarity coefficient. TP, FN and FP correspond to true positive, false negative and false positive respectively.
Refer to caption
Figure 3: Qualitative results on ATLAS. Columns 2-8 show (zoomed in) head-to-head comparisons across all methods for highlighted areas in column 1. Row 1 demonstrates that SPiN outperforms existing works in capturing shape and boundary details in medium-sized, irregularly-shaped lesions. Furthermore, rows 2 and 3 demonstrate SPiN’s ability to localize small lesions that are missed by other models.
Performance Metrics Memory Usage (GB)
Method DSC IOU Precision Recall Runtime (s) Train Test
U-Net [22] 0.584 0.432 0.674 0.558 1.375 2.291 1.181
D-UNet [35] 0.548 0.404 0.652 0.521 3.425 15.426 15.426
CLCI-Net [34] 0.599 0.469 0.741 0.536 8.860 7.853 7.853
KiU-Net [30] 0.524 0.387 0.703 0.459 1.05 23.566 1.555
X-Net [21] 0.639 0.495 0.746 0.588 5.046 11.839 11.839
SPiN (Ours) 0.703 0.556 0.806 0.654 2.145 3.273 0.803
Table 2: Quantitative comparison on ATLAS. SPiN outperforms all methods across all performance metrics. It is also one of the least computationally expensive models, i.e. smallest test time memory footprint, second in training memory usage, and third fastest in runtime per patient (189 images).

ATLAS test set. Table 2 shows that our approach outperforms competing methods [21, 22, 30, 34, 35] across all evaluation metrics. Specifically, we beat the best performing method X-Net [21] by an average of 10.4\approx 10.4% with a 72.3% reduction in training memory and a 57.5% runtime reduction during inference. Our approach also uses a smaller memory footprint, containing only 5.3\approx 5.3M parameters, compred to 15\approx 15M in [21]. Another key comparison is with KiU-Net, which learns a representation at 8×8\times the original input spatial resolution. Unlike us, KiU-Net [30] uses max pooling layers, which discards information, to reduce the size of their high resolution representation to the original (1×1\times) resolution. Whereas, we maintain the 2×2\times resolution of our embedding until the output layer, which yields subpixel predictions that are aggregated by our learnable downsampler to the 1×1\times resolution. Admittedly, this comes at the cost of runtime – our method requires 2.145s per patient and KiU-Net [30] requires 1.05s. However, we outperform [30] by an average of 33.733.7% across all metrics and reduce test time memory by half. We show qualitative comparisons in row 1 of Fig. 3 where the segmentation produced by our approach better captures irregularly shaped lesions than those predicted by competing methods.

      Method       DSC       IOU       Precision       Recall
      U-Net [22]       0.368       0.225       0.440       0.316
      D-UNet [35]       0.265       0.180       0.377       0.264
      CLCI-Net [34]       0.246       0.178       0.662       0.215
      KiU-Net [30]       0.246       0.255       0.466       0.206
      X-Net [21]       0.306       0.213       0.546       0.268
      SPiN (Ours)       0.424       0.269       0.546       0.347
Table 3: Evaluation on small lesion subset. While [34] achieves the highest precision, we note they have the second lowest recall out of all methods – missing small lesions can negatively impact patient recovery. In contrast, our method ranks second in precision and first across all other metrics.
Method DSC IOU Precision Recall
Without SPG, LD (Baseline) 0.634 0.487 0.707 0.606
Without SPG 0.637 0.487 0.701 0.613
Replace SPG with addit. convolutions 0.627 0.475 0.721 0.596
Replace SPG w/ bilinear upsampling 0.663 0.513 0.780 0.600
Replace SPG w/ nearest upsampling 0.660 0.513 0.762 0.626
Replace LD with downsampling 0.670 0.526 0.786 0.625
Full model with soft Dice loss 0.684 0.546 0.729 0.672
Full model 0.703 0.556 0.806 0.654
Table 4: Ablation study on ATLAS. Removing SPG and/or LD results in performance decrease (rows 1, 2, 6), and SPG cannot be substituted with more parameters or interpolation (rows 3-5). The best results are achieved by our full model (row 8).

Small lesion segmentation. Here, we consider the task of segmenting lesions that occupy fewer than 100 pixels or 0.2% of the image. Due to the challenging nature of the task, we observe an expected drop in performance across all methods (trained on the proposed split) when segmenting small lesions (Table 3), as compared to doing so for all lesion sizes (Table 2). However, we still outperform all competing methods – by even larger margins than on the full test set. This shows that competing methods, while able to localize large and medium sized lesions, actually perform poorly on small lesions. With the exception of precision, where we tie for second with X-Net [21], we rank first in all other metrics. We note that while CLCI-Net [34] has the highest precision, it also achieved second lowest recall, meaning that it misses many small lesions, which is critical to clinical prognosis and thus patient recovery. This is also reflected in DSC and IOU where we outperform [34] by 72% and 51%, respectively. Qualitatively, rows 2 and 3 in Fig. 3 show that our method successfully localized small lesions that [21, 22, 30, 34, 35] missed entirely.

Ablation studies. Table 4 shows the effect of each of our contributions to architectural design. Row 1 shows that our baseline, a U-Net [22] based encoder-decoder, performs significantly worse by 11.7%11.7\% than the proposed approach because it lacks fine local details from SPG and uses bilinear downsampling instead of a learnable downsampler (LD). Including LD alone, but not SPG (row 2) provides no improvement as the network only learns a coarse global representation, but is still missing details lost during spatial downsampling.

In row 3, we show that solely increasing parameters (i.e. adding ResNet blocks [9] to the baseline) brings no improvement, which suggests that the performance boost is not a result of a larger network. In fact, SPG and the learnable downsampler marginally increase the model size as they only combine for 27K parameters. Rows 4 and 5 show that using hand-crafted 2×2\times resolution images (from bilinear, nearest neighbor upsampling) does provide some gain. In these experiments, we replace SPG with different interpolation methods and the higher resolution images undergo 3×33\times 3 convolutions before being passed as skip connections to the decoder. However, because the 2×2\times representation is not learned, as it is with SPG, the result is still 6%\approx 6\% worse than our full model. Our learnable downsampler (LD) contributes 4.4%4.4\% to our performance (row 6) as removing LD and replacing it with bilinear interpolation smooths lesion boundaries, resulting in loss of details. Finally, we justify the use of cross entropy for our loss function; row 7 demonstrates that minimizing a soft Dice loss, as in [34, 35], results in worse performance. The best performance is achieved with our full model using SPG and LD, and minimizing cross entropy (row 8).

5 Discussion

We propose SPiN, a network architecture that learns a spatially increasing embedding that, when used as guidance for an encoder-decoder network, helps ensure that small structures are not lost through spatial downsampling in the encoder. We note that our embedding does not create extra spatial information (data processing inequality), but serves as a means for better characterization of local regions for the downstream segmentation task. While we outperform existing works and improve on small lesion segmentation, we do cost more memory and compute than the baseline. However, the extra cost is within reason (1 GB of memory for training and \approx 0.7s in runtime) and does not limit applicability. Despite the improved segmentation performance, we would like to address that there is still room for improvement, especially with small lesions. The highest recall of 0.347 (Table 3) achieved by our model is admittedly low compared to recall metrics on the full dataset, implying that many small lesions still pass undetected. We note that this is one of the first works to study subpixel architectures in lesion segmentation, and we hope that our optimistic results will motivate further exploration in this direction.

Acknowledgements. This work was supported by NIH-NEI 5R01EY029689 and R01EY030595, ARO W911NF-17-1-0304, NRF-2017R1A2B4006023 and IITP-2021-0-01341, AI Graduate School (CAU) in Korea.

References

  • [1] Asadi-Aghbolaghi, M., Azad, R., Fathy, M., Escalera, S.: Multi-level context gating of embedded collective knowledge for medical image segmentation. arXiv preprint arXiv:2003.05056 (2020)
  • [2] ATLANTIS, T., et al.: Association of outcome with early stroke treatment: pooled analysis of atlantis, ecass, and ninds rt-pa stroke trials. The Lancet 363(9411), 768–774 (2004)
  • [3] Beaulieu, C., De Crespigny, A., Tong, D.C., Moseley, M.E., Albers, G.W., Marks, M.P.: Longitudinal magnetic resonance imaging study of perfusion and diffusion in stroke: evolution of lesion volume and correlation with clinical outcome. Annals of Neurology: Official Journal of the American Neurological Association and the Child Neurology Society 46(4), 568–578 (1999)
  • [4] Brott, T., Marler, J.R., Olinger, C.P., Adams Jr, H.P., Tomsick, T., Barsan, W.G., Biller, J., Eberle, R., Hertzberg, V., Walker, M.: Measurements of acute cerebral infarction: lesion size by computed tomography. Stroke 20(7), 871–875 (1989)
  • [5] Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., Feng, J.: Dual path networks. arXiv preprint arXiv:1707.01629 (2017)
  • [6] Ciresan, D., Giusti, A., Gambardella, L., Schmidhuber, J.: Deep neural networks segment neuronal membranes in electron microscopy images. Advances in neural information processing systems 25, 2843–2851 (2012)
  • [7] Dong, C., Loy, C.C., He, K., Tang, X.: Learning a deep convolutional network for image super-resolution. In: European conference on computer vision. pp. 184–199. Springer (2014)
  • [8] Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38(2), 295–307 (2015)
  • [9] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [10] Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
  • [11] Hui, H., Zhang, X., Li, F., Mei, X., Guo, Y.: A partitioning-stacking prediction fusion network based on an improved attention u-net for stroke lesion segmentation. IEEE Access 8, 47419–47432 (2020)
  • [12] Kingma, D.P., Ba, J.L.: Adam: A method for stochastic gradient descent. In: ICLR: International Conference on Learning Representations. pp. 1–15 (2015)
  • [13] Liew, S.L., Anglin, J.M., Banks, N.W., Sondag, M., Ito, K.L., Kim, H., Chan, J., Ito, J., Jung, C., Khoshab, N., et al.: A large, open source dataset of stroke anatomical brain images and manual lesion segmentations. Scientific data 5, 180011 (2018)
  • [14] Liew, S.L., Anglin, J.M., Banks, N.W., Sondag, M., Ito, K.L., Kim, H., Chan, J., Ito, J., Jung, C., Lefebvre, S., et al.: The anatomical tracings of lesions after stroke (atlas) dataset-release 1.1. bioRxiv p. 179614 (2017)
  • [15] Maier, O., Menze, B.H., von der Gablentz, J., Häni, L., Heinrich, M.P., Liebrand, M., Winzeck, S., Basit, A., Bentley, P., Chen, L., et al.: Isles 2015-a public evaluation benchmark for ischemic stroke lesion segmentation from multispectral mri. Medical image analysis 35, 250–269 (2017)
  • [16] Manvel, A., Vladimir, K., Alexander, T., Dmitry, U.: Radiologist-level stroke classification on non-contrast ct scans with deep u-net. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 820–828. Springer (2019)
  • [17] Pantano, P., Caramia, F., Bozzao, L., Dieler, C., von Kummer, R.: Delayed increase in infarct volume after cerebral ischemia: correlations with thrombolytic treatment and clinical outcome. Stroke 30(3), 502–507 (1999)
  • [18] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32, 8026–8037 (2019)
  • [19] Pham, C.H., Ducournau, A., Fablet, R., Rousseau, F.: Brain mri super-resolution using deep 3d convolutional networks. In: 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017). pp. 197–200. IEEE (2017)
  • [20] Pham, C.H., Tor-Díez, C., Meunier, H., Bednarek, N., Fablet, R., Passat, N., Rousseau, F.: Multiscale brain mri super-resolution using deep 3d convolutional networks. Computerized Medical Imaging and Graphics 77, 101647 (2019)
  • [21] Qi, K., Yang, H., Li, C., Liu, Z., Wang, M., Liu, Q., Wang, S.: X-net: Brain stroke lesion segmentation based on depthwise separable convolution and long-range dependencies. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 247–255. Springer (2019)
  • [22] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
  • [23] Rueda, A., Malpica, N., Romero, E.: Single-image super-resolution of brain mr images using overcomplete dictionaries. Medical image analysis 17(1), 113–132 (2013)
  • [24] Seyedhosseini, M., Sajjadi, M., Tasdizen, T.: Image segmentation with cascaded hierarchical models and logistic disjunctive normal networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (December 2013)
  • [25] Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A.P., Bishop, R., Rueckert, D., Wang, Z.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1874–1883 (2016)
  • [26] Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convolutional lstm network: A machine learning approach for precipitation nowcasting. Advances in neural information processing systems 28, 802–810 (2015)
  • [27] Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.M.: Pyramid dilated deeper convlstm for video salient object detection. In: Proceedings of the European conference on computer vision (ECCV). pp. 715–731 (2018)
  • [28] Tang, Z., Pan, B., Liu, E., Xu, X., Shi, T., Shi, Z.: Srda-net: Super-resolution domain adaptation networks for semantic segmentation. arXiv preprint arXiv:2005.06382 (2020)
  • [29] Tureckova, A., Rodríguez-Sánchez, A.J.: Isles challenge: U-shaped convolution neural network with dilated convolution for 3d stroke lesion segmentation. In: International MICCAI Brainlesion Workshop. pp. 319–327. Springer (2018)
  • [30] Valanarasu, J.M.J., Sindagi, V.A., Hacihaliloglu, I., Patel, V.M.: Kiu-net: Towards accurate segmentation of biomedical images using over-complete representations. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 363–373. Springer (2020)
  • [31] Virani, S.S., Alonso, A., Benjamin, E.J., Bittencourt, M.S., Callaway, C.W., Carson, A.P., Chamberlain, A.M., Chang, A.R., Cheng, S., Delling, F.N., et al.: Heart disease and stroke statistics—2020 update: a report from the american heart association. Circulation 141(9), e139–e596 (2020)
  • [32] Wang, L., Li, D., Zhu, Y., Tian, L., Shan, Y.: Dual super-resolution learning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3774–3783 (2020)
  • [33] Van der Worp, H.B., van Gijn, J.: Acute ischemic stroke. New England Journal of Medicine 357(6), 572–579 (2007)
  • [34] Yang, H., Huang, W., Qi, K., Li, C., Liu, X., Wang, M., Zheng, H., Wang, S.: Clci-net: Cross-level fusion and context inference networks for lesion segmentation of chronic stroke. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 266–274. Springer (2019)
  • [35] Zhou, Y., Huang, W., Dong, P., Xia, Y., Wang, S.: D-unet: a dimension-fusion u shape network for chronic stroke lesion segmentation. IEEE/ACM transactions on computational biology and bioinformatics (2019)

Small Lesion Segmentation in Brain MRIs with Subpixel Embedding
SUPPLEMENTARY MATERIALS

Summary of content: In Sec. 0.A, we elaborate on how we select the input xx from the 3D MRI scan during training and testing phases. In Sec. 0.B, we discuss details on our proposed training and testing splits and show their data distribution with respect to lesion size. Sec. 0.C provides comparisons between all competing methods using the reported numbers from their papers. We note that these numbers are not directly comparable since each method used a different data split. In Sec. 0.D, we provide additional qualitative examples of our predictions on MRI images for the task of small lesion segmentation. Finally, we show the feature maps produced by our subpixel embedding in Sec. 0.E and discuss how they contribute to guiding our decoder in localizing small lesions and recovering irregularly shaped lesion boundaries.

Refer to caption
Figure 4: Selecting slices. First, a 3D MRI scan XC×H×WX\in\mathbb{R}^{C\times H\times W} is mean padded on either side along its channel dimension. The mean padding is represented by the appended gray boxes on either side of the grayscale MRI scan. The green boxes represent a selected volume xc×H×Wx\in\mathbb{R}^{c\times H\times W} consisting of cc 2D images centered at x¯\bar{x}. In the leftmost figure, x¯\bar{x} is selected to be in one end of the 3D scan XX, so xx will contain mean padding images. We select x¯\bar{x} in a sliding window fashion until the right most figure where x¯\bar{x} is chosen to be in the other side of XX.

Appendix 0.A Selecting Slices from a 3D MRI

In the Sec. 3 of the main text, we discussed how we select cc consecutive 2D images from the 3D MRI; here we will elaborate on this. Recall that cc is an odd integer. First, each MRI XC×H×WX\in\mathbb{R}^{C\times H\times W} is padded with the mean of the entire dataset c12\frac{c-1}{2} times along the first dimension, CC, parallel to the axial plane. The resulting tensor XX^{\prime} has a shape of (C+c1)×H×W(C+c-1)\times H\times W. Next, we select a subset of this volume, as seen in green in Fig. 4, consisting of cc 2D images centered at the image x¯X\bar{x}\in X^{\prime} such that the condition x¯X\bar{x}\in X is also satisfied. Thus, x¯\bar{x} is a 2D image of the original MRI scan, and it is for x¯\bar{x} that we make our segmentation prediction. Besides x¯\bar{x}, the volume is filled with neighboring images or the mean-padded values when no neighbors are available (i.e. near the edges of the 3D MRI scan), allowing us to utilize contextual information without reducing the amount of data. With an abuse of notation and following the notation used in the main text, we will refer to XX^{\prime} as XX.

During training, x¯\bar{x} and its corresponding xx is randomly sampled from XX. For the stability of training, we also increase the probability of sampling x¯\bar{x} with lesion of any size to 95%. This is done to counteract the class imbalance where images without lesions greatly outnumber those with lesions. During test time, we iteratively select x¯\bar{x} and xx from XX, starting from image 1 to CC for a total of CC images in XX. For the size of xx, we chose cc based on hyper-parameter tuning. We note that performance does not significantly improve if we choose cc to be larger than 5; however, choosing a large cc (such as up to the entire 3D volume like methods with 3D architectural backbones [11, 35]) will increase memory usage and computation requirements, rendering the method impractical. We also note that choosing c=1c=1 will reduce performance by 5%\approx 5\% on average.

Refer to caption
Refer to caption
Figure 5: Distribution of lesions in the proposed training and testing set for ATLAS. Number of images (samples) are shown in log\log scale. We were careful to select the testing set (right) such that it is representative of the training set (left) distribution. Also we ensured that there are sufficient small lesions (fewer than 100 pixels) in the testing set to make the dataset challenging. The small lesion subset is represented by the second bar (1 - 99) in the testing set plot on the right.

Appendix 0.B Details on ATLAS Training and Testing Split

In Sec. 4 of the main text, we proposed a training and testing split for the ATLAS [13, 14] dataset consisting of 212 patients for training (40K images) and a held-out testing set of 27 patients (5.1K images). In Fig. 5 we show the distribution of lesion sizes for the training set (left) and the testing set (right). The data split was randomly chosen and post-processed to ensure that the testing set is representative of the training set data distribution. To make sure that the benchmark was challenging, we ensured that there are sufficient images with small lesions in the testing set. The proposed small lesion evaluation uses the subset of the testing set that contain only lesions with fewer than 100 pixels. Both the training and testing split and the small lesion subset will be released along with our open-sourced code and pretrained models upon publication. We cordially invite researchers to test on this benchmark in order to push the state of the art for small lesion segmentation in MRIs.

      Method       DSC       IOU       Precision       Recall
      U-Net [22]       0.475       0.358       0.586       0.471
      D-UNet [35]       0.535       -       0.633       0.524
      CLCINet [34]       0.581       -       0.649       0.581
      KiU-Net [30]       0.441       0.321       0.580       0.418
      X-Net [21]       0.487       0.372       0.600       0.475
      SPiN (Ours)       0.587       0.461       0.670       0.610
Table 5: Self reported results on ATLAS. [34, 35] did not report on IOU. Results for [22] cross validation were reported by [21]. Because [30] did not evaluate on ATLAS, we used the open-sourced code provided by the authors to train their model on our cross validation split. We report results obtained by training their model based on the procedure specified in their paper.

Appendix 0.C Comparison of Reported Results on ATLAS

As noted above and in Sec. 4 of the main text, there is not a clear precedent of evaluating and comparing related works. Due to the lack of consistency, we proposed a training and testing split on which we trained and evaluated [21, 22, 30, 34, 35] in our main text. For fairness, we directly used the open-sourced code provided by the authors and trained their model on our proposed data split. The results in Table 2 and 3 in the main text were obtained by training their model using the procedure specified in their papers. In addition, to account for random seed affecting performance, we trained each of their models multiple times and selected the best performing checkpoint for evaluation.

For completeness, we also compare our method to reported results of other works in Table 5. Our reported results were conducted on 6-fold cross validation to minimize the impact of overfitting or selection bias on our results and better measure generalizability. Although these results seem to demonstrate that our model outperforms all other works, we note that the reported numbers from each method were obtained from training on different data splits. Hence, we do not feel these numbers are directly comparable and refer the reader to the Table 2 and 3 in the main text for more rigorous comparisons.

Refer to caption
Figure 6: Qualitative results on small lesions on ATLAS. Here we further demonstrate SPiN’s performance specifically for small lesions (no larger than 100 pixels). In each 1×31\times 3 panel, the leftmost column shows the entire patient scan with the ground truth overlayed, the middle column shows the zoomed in binary ground truth in yellow, and the rightmost column shows SPiN’s zoomed in softmax prediction. Our method is able to segment very small lesions that may otherwise be missed by other methods.

Appendix 0.D Small Lesion Prediction Visualization for SPiN

We presents qualitative results of our model on additional small lesion test samples in Fig. 6. Identifying small lesions is a challenging task as the lesion area is less than 0.2% of the image and requires the model to capture fine-grained details. As noted in the main text, accurately segmenting lesions as early as possible has a significant impact on patient recovery, yet most current works are unable to detect the smaller lesions111Early onset of stroke tend to yield smaller lesion sizes, which grow to maximum volume after 5 to 7 days [3, 4, 17]; hence, the inability to detect small lesion can be detrimental to patient recovery.. Here, we demonstrate that our model can segment lesions as small as a few pixels large, which in the time-sensitive context of stroke lesions could significantly affect treatment and recovery.

Refer to caption
Figure 7: Feature maps from our subpixel embedding network. Column 1: sample MRI image and corresponding ground truth. Columns 2-4: feature maps produced by our embedding network. Despite feeding xc×H×Wx\in\mathbb{R}^{c\times H\times W} into our embedding network, it naturally learns to retain the structural details for x¯\bar{x}, the target image to be segmented.

Appendix 0.E Features Maps from Subpixel Embedding

In Sec. 3 of the main text, we discussed the architecture of our subpixel embedding network used for guiding the output of the decoder – subpixel guidance (SPG). Here, we show examples of the feature maps produced by the learned embedding in Fig. 7. Our subpixel embedding retains the anatomical structures in the output high resolution feature maps that resembles those present in the low resolution MRI image. We note that this is particularly interesting because our only form of supervision is the cross entropy loss (Eqn. 1, 2 from main text) e.g. we do not use high resolution MRI images to constrain the latent representation of the subpixel embedding network, so the embedding does not necessarily need to learn shapes that resembles in image of interest. However, as seen in Fig. 7, despite the input being xc×H×Wx\in\mathbb{R}^{c\times H\times W}, which contains multiple images, our embedding naturally learns to extract features that retain the structures of x¯\bar{x}, the target image to be segmented. We hypothesize that the network is able to learn this without additional supervision because the target image is always located in the center of xx.

As seen in Fig. 7, the feature maps can be treated as a high dimensional super-resolved version of the input target image where small details are now represented by four times (2×22\times 2 neighborhood) the number of elements. When used as guidance for the decoder via skip connections, the local structures preserved by these feature maps help the network resolve fine-grained details in the image for the localization of small lesion and the recovery of irregular lesion boundaries.