GAN-based Super-Resolution and Segmentation of Retinal Layers in Optical coherence tomography Scans

Abstract

TIn this paper, we propose a Generative AdversarialNetwork (GAN) based solution for super-resolution and segmentation of optical coherence tomography (OCT) scans of the retinal layer. This work involves joint optimization of two objective functions, as the network learns to segment the retinal layersas well as to increase the image resolution from low to high by afactor of four. OCT has been identified as an a non-invasive andinexpensive modality of imaging to discover potential biomarkersfor Alzheimer’s diagnosis and progress determination. Currenthypotheses presume the thickness of the retinal layers, whichare analyzable within OCT scans, as an effective biomarkerfor the presence of Alzheimer’s. As a logical first step, thiswork concentrates on the segmentation of these layers andsuperresolving them for higher clarity and accuracy. We employa GAN based approach with a view to semantically segment theOCT images as well as super resolve them. In this work, wealso use dice loss as an additional reconstruction loss function toimprove the performance of this joint task.

Index Terms— One, two, three, four, five

1 Introduction

Optical Coherence Tomography (OCT) is a non invasiveimaging technology for projecting cross-sectional images ofinternal micro-structure of human tissue in high resolution [schmitt1999optical]. First OCT scan of the human retina was done in 1993 [2],which ushered in the era of rapid development of non-invasiveopto medical diagnostic modality that enabled cross sectionalvisualization of the internal structure of biological components[3], prominently the human retina. Nowadays, OCT is theprimary modality for cross sectional imaging of the humanretina in high resolution.Alzheimer’s Disease (AD) is one of the most common formsof dementia, increasingly prominent among elderly population.According to several clinical studies, the neurodegenerativeprocess of Alzheimer’s propelled by the abnormal cerebralaccumulation of Amyloid-beta and tau protein [4], also mayaffect the retina. These studies hypothesize the neuronal lossof retinal tissue as a possible biomarker for the presenceof AD. These neuronal losses are introduced by the RetinalNerve Fiber Layer (RNFL) thickness [5] [6], macular thickness[7], and the Ganglion Cell Layer (GCL) [8] degeneration.Currently measures like Position Emission Tomography andMagnetic Resonance Imaging which are the standards for ADdiagnosis [4], comes with the added burden of being invasive.Thus there is research ongoing regarding the viability of OCTscans as alternative, as this modality offers the benefit ofbeing non-invasive, less time consuming and cost effectiveas well. For further research in addressing this modality asa viable biomarker, segmentation of the retinal layers is thefirst significant step. Due to the difficulty associated withmanual segmenting such images which have poor signal tonoise ratio [9] because of the presence of micro-saccading eyemovements, it is imperative to program a method of automaticsegmentation. Another hindrance commonly faced is the nearnon-visibility of the layer boundaries, which compels theresearch of superresolving the images for improved clarity.In this work, we have identified the goal as jointly superre-solving and segmenting the OCT retinal scans. As a possiblesolution, we employ a Generative Adversarial Network (GAN)[10] with different generator architectures as well as analyzingthe effect of a dice loss as an additional constraint, and howits presence improves the performance.

to
[email protected].

2 Related Works

Biomedical image segmentation has been a demanding research topic for many years in the domain of computer vision. Since the advent of neural networks as proven methods for effective application in computer vision tasks [krizhevsky2012imagenet], there have been numerous developments for the target of semantic segmentation of biomedical images. Most semantic segmentation algorithms follow a an encoder-decoder based architecture, popularized by the work [long2015fully] called Fully Convolutional Network (FCN). This architecture has two main components, the encoder part which downsizes the image, extracting features whereas the decoder part upsamples back to the original image size. One of the issues faced with FCN is that this successive downsampling and upsampling result in losing some semantic and spatial information. U-Net [ronneberger2015u] solved that issue, as it introduced skip connections in between the encoder and decoders, which would relay the spatial information from the encoder part to the corresponding feature maps of the decoder region. This model has been widely used in the domain of biomedical image segmentation. This architecture has spawned several variations such as U-Net++, 3D-U-Net, FC-DenseNet-Tiramisu [zhou2018unet++, cciccek20163d, jegou2017one] to varying degrees of performance. In the specific OCT segmentation related task, RelayNet [roy2017relaynet] was published which follows the U-Net baseline and to the best of our knowledge, provides the state of the art performance. Another long challenging task in the fraternity of computer vision is constructing high resolution photo realistic images from its low resolution counterparts. This strenuous task, aptly termed super resolution, has been a content of research for many years, even predating the advent of deep learning. Classical methods include various interpolation methods such as nearest neighbour, bi-cubic or bi-linear etc. With the success of FCN, its popularity to the field of super resolution as similar baseline was followed to design Super resolution Convolutional Neural Network (SR-CNN) [dong2015image]. Here the image is first upsampled through bi-cubic interpolation, and fed through an FCN, resulting in an output with high resolution. [ledig2017photo] is the continuation of SR-CNN, with residual blocks replacing the conventional convolution blocks. Using such SR-CNN as the generator architecture, GAN has been also used to reconstruct images in higher resolution [ledig2017photo].

GANs have been quite prominent in learning deep representations and modelling high dimensional data. This type of generative modelling employ two networks being trained in a competitive manner, one being trained to synthesize new data and the other being trained to classify real and synthesized data. Since its inception, GANs have gradually evolved and have been used for various tasks, including image processing and computer vision. Initially, GANs were trained with a noise sample from a particular distribution. Later with the advent of conditional GANs, it became possible to capture even better representations, by rendering both the generator and discriminator networks as class conditional. GANs showed good performance translating data from one domain to another [isola2017image], [zhu2017unpaired], thus being appropriate for semantic segmentation.

In this work, the goal is to semantically segment as well as super resolve the segmentation in a joint manner using GAN. Performance is then evaluated with various architectures as the generator with dice loss as an additional constraint.

3 DATA ACQUISITION AND PRE-PROCESSING

For experimentation, OCT images of 45 patients were obtained from the Department of Opthalmology & Visual Sciences, West Virginia University. Nineteen scans were captured from each patient utilizing the Spectralis OCT imaging platform by Heidelberg Engineering. These images which totalled 855, were manually labelled by an expert in this domain. For this research work, seven layers of the retina were focused on which are Internal Limiting Membrane (ILM), RNFL, GCL, Inner Plexiform Layer (IPL), Inner Nuclear Layer (INL), Outer Plexiform Layer (OPL) and Outer Nuclear Layer (ONL).

3.1 Dataset Preparation

In any deep learning task, a dataset of significant size is essential for better performance, and to learn features more robustly. The data we currently have available is small, so we applied data augmentation techniques to synthetically enlarge the dataset. The techniques being horizontal flip, rotation (15 degrees) and spatial translation. Apart from these conventional augmentation methods, the dataset was subjected to a sliding crop window with 75% overlap at each sliding step, effectively increasing the dataset by a significant factor. Similar augmentations were also done to the ground truth labels. After cropping, each patches were of the size of $224\times 224$ .

The presence of speckle noise is a big hindrance in analysing and processing OCT scans. It corrupts the edges between the retinal layers, which renders delineating the layer boundaries difficult for a neural network model. To alleviate this issue, a median filter of 3x3 window was used. On top of that, an unsharp masking technique was used to make the boundaries more visible for the task at hand.

The OCT images were all manually annotated by an expert in this domain for each of the 7 layers and the background. To get the labels for the tasks of segmentation and surer-resolution at the same time, we refered to the labels of size (224x224) as ground-truth target labes and down-sampled OCT scans to the size of 56x56 as the input. The input $56\times 56$ images were fed to the generator $G$ that generates the segmented outputs of size $224\times 224$ , upscaled by a factor of 4, to compare against the target labels.

4 Methodology

The baseline architecture of a GAN consists of two competing neural networks, aptly named generator and the discriminator. The purpose of the generator in this work is to produce superresolved segmented labels of the OCT input images, whereas the discriminator learns to differentiate between real ground truth labels and the generated ones. Fig 2, shows the general overall architecture of the GAN. The following chronicles each component of the architecture in details.

4.1 Generator

In this paper, three different architectures are tried and tested. All of these architectures have the dual task of segmenting the input OCT images, as well as superresolving them. In this paper, we have only tested with superresolving the labels by a factor of 4.

4.1.1 ResNet with transposed convolution

ResNet architecture was one of the most ground breaking works in computer vision [he2016deep], which introduced the concept of skip connections. ResNet is comprised of Residual Blocks, where convolutional blocks are stacked and the input and output of each block are added together. This sidestepping connection of input and output, appropriately named residual skip connections, was used to combat the problem of vanishing gradients for networks with increasingly deeper layers. These residual blocks can have any number of convolution operations in it, for example in Fig 3(a), we show a residual block where a single $3\times 3$ convolution is stacked between two $1\times 1$ convolutions, with the input to the block is bypassed and added to the output of it. Connecting each of these blocks to one another, ResNet of varied sizes can be formed.

As the task at hand requires superresolution along with segmentation, it requires a module to upsample the low resolution feature maps to high resolution. To keep in line with the residual blocks, a transposed block depicted in Fig 3(b) is used for the upsizing operation. in between the 1x1 convolutions, a 2x2 transposed convolution is used which acts as the upsampler. The residual connection also goes through a transposed convolution to maintain spatial integrity.

4.1.2 ResNet with sub-pixel convolution

In this generator architecture, we replace the transposed convolutions in the ResNet with sub-pixel convolution popularized by [shi2016real]. It works as follows, an image with $Cr^{2}\times H\times W$ dimensions where C is the channel width, H and W being the spatial height and width, the sub pixel convolution of it will yield an output of the size $C\times Hr\times Wr$ , where $r$ is the factor by which the image is being upscaled. We add a sub-pixel convolution block at the end of the residual blocks with an upscale factor $r$ of 4.

4.1.3 U-Net

U-Net is a popular architecture for biomedical image segmentation. It follows a common encoder-decoder architecture, where the first part of the network downsamples the image to an abstract representation while extracting features and the latter part upsamples it back to the original image size. The striking difference between a typical autoencoder with U-Net is the presence of skip connections in the latter. These connections are a bridge between the encoder and decoder portions, which results in retrieving feature representations extracted during the encoding operation. Experiments have proved the skip connections to provide stark improvement over traditional autoencoders. For the task of superresolution, we try two different approaches. For the first approach, we add two transposed convolution blocks at the end of the U-Net architecture, which is the convolutional operation used to upscale images. Each transposed convolution block has $2\times 2$ kernels with the stride of 2, so that each block upscales by a factor of 2. For the second approach, we use the aforementioned sub-pixel convolution method for upsampling with an upscale factor $r$ as 4. Fig 4 shows both type of U-Net architectures used here.

4.2 Discriminator

To differentiate real high resolution images to the generated ones, a discriminator network is needed. The network that is being used here is called a patchGAN classifier [isola2017image]. The input to the discriminator is either the generated superresolved labels G(X) or the ground truth superresolved labels Y. The network consists of several block convolutional layers which successively decrease the spatial size of the input to $N\times N$ size patches to classify the said input as actual ground truth labels or the generated ones. For our experiment, the patch size was chosen as $70\times 70$ .

4.3 Loss Function

In this work we have opted three different loss functions for the purpose of training the algorithm. They are the adversarial loss, the generator reconstruction loss function and a dice loss function. The combination of all three contributes to the backpropagation and update of the model weights.

4.3.1 Adversarial Loss

The adversarial loss is applied on both the generator $G$ and the discriminator $D$ . If the input to the $G$ is $x$ , then the adversarial loss is

L_{GAN}(G)=E_{x}[log(D(G(x)))],

(1)

and for training the discriminator, the loss function being:

L_{GAN}(G,D)=E_{y}[logD(y)]+E_{x}[log(1-D(G(x))],

(2)

where the $y$ denotes the actual ground truth label. The adversarial loss for the discriminator has two terms, which essentially competes against each other. The first term is to train the discriminator in learning real ground truth labels, and the second term is to train it to detect the generated ones. As the training goes on, one of the loss terms start dominating the other, signifying whether the discriminator is learning to identify the real labels from the generated ones.

4.3.2 Reconstruction Loss

Along with the adversarial loss, another loss function is used additionally to train the generator, termed the reconstruction loss. This loss function is the L1 loss measured between the generated output $G(X)$ and the ground truth label $Y$ . The equation is

L_{L1}(G)=E_{x}[\|y-G(x)\|]_{1}.

(3)

This loss function helps the generator in synthesizing output conforming to the ground truth label. The L1 loss is the suitable measure as it calculates the L1 distance between ground truth and generated labels, appropriately training the generator for the desired task.

4.3.3 Dice Loss

This loss function originates from the semantic segmentation metric called dice coefficient. Usually a dice coefficient is a metric measure which entails values within $[0,1]$ range, the higher value demonstrating better segmentation. Taking the additive inverse of said metric gives us the dice loss [milletari2016v]. The dice loss is given as

L_{dice}(G)=1-\frac{2\sum_{i=1}^{N}G(x_{i})Y}{\sum_{i=1}^{N}G(x_{i})^{2}+\sum_{i=1}^{N}y_{i}^{2}},

(4)

where $G(X)$ is the generated labels and the $Y$ is the ground truth ones, and $N$ being the total number of pixels. The task of the network is to minimize this function so that the generator can succesfully segment the images which results in minimal dice loss when calculated against the ground truths.This loss function acts as an additional reconstruction loss to further emphasize and improve the quality of the generator output. This form of dice loss can be differentiated yielding the gradient

$\frac{\partial L_{dice}(G)}{\partial G(X)}=2[\frac{Y(\sum_{i=1}^{N}G(X)^{2}+\sum_{i=1}^{N}Y^{2})-2Y(\sum_{i=1}^{N}G(X)Y)}{(\sum_{i=1}^{N}G(X)^{2}+\sum_{i=1}^{N}Y^{2})^{2}}].$

(5)

Finally, the total loss function for the generator stands as

L_{generator}=L_{GAN}(G)+\lambda L_{L1}(G)+\alpha L_{dice}(G),

(6)

where $\lambda$ and $\alpha$ are coefficients of constant value, which control the relative importance of each corresponding loss functions that we fine-tuned by a simple grid-search.

5 Experiment

Experiments are conducted according to the baseline architecture in Fig. 2.After the preprocessing and augmentation, the dataset size was increased from 855 to 34,199. The dataset was split into training and test set, with the training set containing 80% of the total data. All of the training was done on a system with two GeForce GTX TITAN X GPU. Adam optimizer [kingma2014adam] was used for training, with a learning rate of 0.0001 for both the generator and the discriminator. The $\lambda$ value was set to 100 and $\alpha$ was set to 1. Three different architectures were trained as the generator separately, and results for each of them was obtained. We also did an analysis of the effect of dice loss as an additional cost function, by running the same models with and without it. All of the experiments were run for 100 epochs.

6 Results

The metric used to check the quality of the superresolved segmentation was dice coefficient and mean intersection over union (mIOU). These numbers are of the $[0,1]$ range where a higher value denotes better quality. Dice coefficient was chosen over pixel accuracy because, the latter does not take into account the problem of class imbalance. As in our images, there is a dominant portion of the background, pixel accuracy as a metric would lead to erroneous conclusion.

6.1 Superresolution Method

To superresolve an image from a low resolution domain to a higher one, we tried two different modules here. Transposed convolution and sub-pixel convolution. Analyzing the metrics shown in Table 1, it is evident that sub-pixel convolution performs better as a module for superesolution rather than transposed convolution. For both ResNet and U-Net as the generator architecture, performance improves with the addition of dice loss as an additional reconstruction constraint. In case of U-Net as the generator, performance improves with sub-pixel convolution when the dice loss is also involved.

6.2 Dice Loss

We also investigated the effect of dice loss as we tested every architecture with and without it. From the results section, the dice loss provides slight improvement for ResNet with sub-pixel convolution and U-Net with both tranposed convolution and sub-pixel convolution. However, according to the table it is apparent that ResNet with transposed convolution performance imrpoves considerably with the introduction of dice loss as an extra objective function.

7 Conclusion

The goal of this paper was to generate superresolved segmentation of OCT scans of retina using GAN architectures. In this work, we experimented different architectures as generators which performed the dual task of semantic segmentation as well as superresolving the segmented images. To do this dual training, we tried two popular architectures U-Net and ResNet, with additional modules of transposed convolution and sub-pixel convolution separately for the task of upscaling image from low to high resolution. We also tested dice loss, an objective function originating from the dice coefficient metric, as an additional loss function for the GAN model. As evident from the results, this joint training for the dual task of segmentation and superresolution was achieved effectively. Dice loss as an additional constraint to the original $L1$ loss emphasized the reconstruction performance, and empiricially from the results it can be seen that it improves the results considerably. In LaTeX, to start a new column (but not a new page) and help balance the last-page column lengths, you can use the command “ $\backslash$ pagebreak” as demonstrated on this page (see the LaTeX source below).

8 PAGE NUMBERING

Please do not paginate your paper. Page numbers, session numbers, and conference identification will be inserted when the paper is included in the proceedings.

9 ILLUSTRATIONS, GRAPHS, AND PHOTOGRAPHS

Illustrations must appear within the designated margins. They may span the two columns. If possible, position illustrations at the top of columns, rather than in the middle or at the bottom. Caption and number every illustration. All halftone illustrations must be clear black and white prints. Colors may be used, but they should be selected so as to be readable when printed on a black-only printer. Since there are many ways, often incompatible, of including images (e.g., with experimental results) in a LaTeX document, below is an example of how to do this [Lamp86].

10 FOOTNOTES

Use footnotes sparingly (or not at all!) and place them at the bottom of the column on the page on which they are referenced. Use Times 9-point type, single-spaced. To help your readers, avoid using footnotes altogether and include necessary peripheral observations in the text (within parentheses, if you prefer, as in this sentence).

Refer to caption — Fig. 1: Example of placing a figure with experimental results.

11 COPYRIGHT FORMS

You must include your fully completed, signed IEEE copyright release form when form when you submit your paper. We must have this form before your paper can be published in the proceedings.