This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Institute of Image Processing and Pattern Recognition, Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, P. R. China. 11email: [email protected] 22institutetext: Xi’an Jiaotong University, Xi’an, P. R. China. 33institutetext: Department of Ophthalmology, Shanghai Tenth People’s Hospital, Tongji University, Shanghai,P. R. China.

Unsupervised Learning of Local Discriminative Representation for Medical Images

Huai Chen 11 0000-0002-1815-1486    Jieyu Li 11    Renzhen Wang 22    Yijie Huang 11    Fanrui Meng 11    Deyu Meng 22    Qing Peng 33    Lisheng Wang(){}^{(\textrm{{\char 0\relax}})} 11 0000-0003-3234-7511
Abstract

Local discriminative representation is needed in many medical image analysis tasks such as identifying sub-types of lesion or segmenting detailed components of anatomical structures. However, the commonly applied supervised representation learning methods require a large amount of annotated data, and unsupervised discriminative representation learning distinguishes different images by learning a global feature, both of which are not suitable for localized medical image analysis tasks. In order to avoid the limitations of these two methods, we introduce local discrimination into unsupervised representation learning in this work. The model contains two branches: one is an embedding branch which learns an embedding function to disperse dissimilar pixels over a low-dimensional hypersphere; and the other is a clustering branch which learns a clustering function to classify similar pixels into the same cluster. These two branches are trained simultaneously in a mutually beneficial pattern, and the learnt local discriminative representations are able to well measure the similarity of local image regions. These representations can be transferred to enhance various downstream tasks. Meanwhile, they can also be applied to cluster anatomical structures from unlabeled medical images under the guidance of topological priors from simulation or other structures with similar topological characteristics. The effectiveness and usefulness of the proposed method are demonstrated by enhancing various downstream tasks and clustering anatomical structures in retinal images and chest X-ray images.

Keywords:
Unsupervised representation learning Local discrimination Topological priors

1 Introduction

In medical image analysis, transferring pre-trained encoders as initial models is an effective practice, and supervised representation learning is widely applied, while it usually depends on a large amount of annotated data and the learnt features might be less efficient for new tasks differing from original training task [4]. Thus, some researchers turn to study unsupervised representation learning [9, 17], and particularly unsupervised discriminative representation learning was proposed to measure similarity of different images [7, 16, 18]. However, these methods mainly learn the instance-wise discrimination based on global semantics, and cannot characterize the similarities of local regions in image. Hence, they are less efficient for many medical image analysis tasks, such as lesion detection, structure segmentation, identifying distinctions between different structures, in which local discriminative features are needed to be captured. In order to make unsupervised representation learning suitable for these tasks, we introduce local discrimination into unsupervised representation learning in this work.

It is known that medical images of humans contain similar anatomical structures, and thus pixels can be classifying into several clusters based on their context. Based on such observations, a local discriminative embedding space can be learnt, in which pixels with similar context will distribute closely and dissimilar pixels can be dispersed. In this work, a model containing two branches is constructed following a backbone network, in which an embedding branch is used to generate pixel-wise embedding features and a clustering branch is used to generate pseudo segmentations. Through jointly updating these two branches, pixels belonging to the same cluster will have similar embedding features and different clusters will have dissimilar ones. In this way, local discriminative features can be learnt in an unsupervised way, which can be used for evaluating similarity of local image regions.

The proposed method is further applied to several typical medical image analysis tasks respectively in fundus images and chest X-ray images: (1) The learnt features are utilized in 9 different downstream tasks via transfer learning, including segmentations of retinal vessel, optic disk (OD) and lungs, detection of haemorrhages and hard exudates, etc., to enhance the performances of these tasks. (2) Inspired by specialists’ ability of recognizing anatomical structures based on prior knowledge, we utilize the learnt features to cluster local regions of the same anatomical structure under the guidance of topological priors, which are generated by simulation or from other structures with similar topology.

2 Related work

Instance discrimination learning method [16, 7, 1, 18] is an unsupervised representation learning framework providing a good initialization for downstream tasks and it can be considered as an extension of exemplar convolution neural network (CNN) [4]. The main conception of instance discrimination is to build an encoder to dispersedly embed training samples over a hypersphere [16]. Specifically speaking, a CNN is trained to project each image onto a low-dimensional unit hypersphere, in which the similarity between images can be evaluated by cosine similarity. In this embedding space, dissimilar images are forced to be separately distributed and similar images are forced to be closely distributed. Thus, the encoder can make instance-level discrimination. Wu et al. [16] introduce a memory bank to store historical feature vectors for each image. Then the probability of image being recognized as ii-th example can be expressed by inner product of the embedding vector and vectors stored in the memory bank. And the discrimination ability of encoder is obtained by learning to correctly classify image instance into the corresponding record in the memory bank. However, the vectors stored in the memory bank are usually outdated caused by discontinuous updating. To address this problem, Ye et al. [18] propose a framework with siamese network which introduces augmentation invariant into the embedding space to cluster similar images to realize real-time comparison.

The ingenious design enables instance discrimination effectively utilize unlabeled images to train a generalized feature extractor for downstream tasks and shrink the gap between unsupervised and supervised representation learning [7]. However, summarizing a global feature for image instance miss local details, which are crucial for medical image tasks, and the high similarity of global semantics between images of same body part makes instance-wise discrimination less practical. Therefore, it is more convinced to focus on local discrimination of medical images. Meanwhile, medical images of the same body part can be divided into several clusters due to the similar anatomical structures, which inspires us to propose a framework to cluster similar pixels to learn local discrimination.

Refer to caption
Figure 1: Illustration of our proposed learning model.

3 Methods

The illustration of our unsupervised framework is shown in Figure 1. This model has two main components. The first is learning a local discriminative representation, which aims to project pixels into a l2l_{2}-normalized low-dimensional space, i.e. a KK-D unit hypersphere, and pixels with similar context should be closely distributed and dissimilar pixels should be far away from each other on this embedding space. The learnt local discriminative representation can be taken as a good feature extractor for downstream tasks. The second is introducing prior knowledge of topological structure and relative location into local discrimination, where the prior knowledge will be fused into the model to make the distribution of pseudo segmentations closer to the distribution of priors. By combining priors of structures with local discrimination, regions of the expected anatomical structure can be clustered.

3.1 Local discrimination learning

Refer to caption
Figure 2: Illustration of local discrimination learning.

As medical images of the same body region contain same anatomical structures, image pixels can be classified into several clusters, each of which corresponds to a specific kind of structure. Therefore, local discrimination learning is proposed to train representations to embed each pixel onto a hypersphere, on which pixels with similar context will be encoded closely. To achieve this, two branches, including an embedding branch to encode each pixel and a clustering branch to generate pseudo segmentations to cluster pixels, are built following a backbone network and trained in a mutually beneficial manner.

3.1.1 Notation:

We denote fθf_{\theta} as the deep neural network, where θ\theta is the parameters of network. The unlabeled image examples are denoted as X={x1,,xN}X=\{x_{1},...,x_{N}\} where xiH,Wx_{i}\in\mathbb{R}^{H,W}. After feeding xix_{i} into the network, we can get embedding features viv_{i} and probability map rir_{i}, i.e., vi,ri=fθ(xi)v_{i},r_{i}=f_{\theta}(x_{i}), where vi(h,w)Kv_{i}(h,w)\in\mathbb{R}^{K} is the KK-dimensional encoded vector for position (h,w)(h,w) of image xix_{i} and ri(h,w)Mr_{i}(h,w)\in\mathbb{R}^{M} is a vector representing the probability of classifying pixel xi(h,w)x_{i}(h,w) into MM clusters. And rmi(h,w)r_{mi}(h,w) denotes the probability of classifying pixel xi(h,w)x_{i}(h,w) into the mm-th cluster. We force ri(h,w)1=1||r_{i}(h,w)||_{1}=1 and vi(h,w)2=1||v_{i}(h,w)||_{2}=1 by respectively setting l1l_{1} and l2l_{2} normalization in clustering branch and embedding branch.

3.1.2 Jointly train clustering branch and embedding branch:

After getting embedding features and pseudo segmentations, the center embedding feature cmc_{m} of mm-th cluster can be formulated as followed:

cm=i,h,wrmi(h,w)vi(h,w)i,h,wrmi(h,w)vi(h,w)2c_{m}=\frac{\sum_{i,h,w}{r_{mi}(h,w)v_{i}(h,w)}}{||\sum_{i,h,w}{r_{mi}(h,w)v_{i}(h,w)}||_{2}} (1)

Where l2l_{2} normalization is used to make cmc_{m} on the hypersphere. Thus, the similarity between cmc_{m} and vi(w,h)v_{i}(w,h) can be evaluated by cosine similarity as followed:

t(cm,vi(w,h))=cmTvi(w,h)t(c_{m},v_{i}(w,h))=c_{m}^{T}v_{i}(w,h) (2)

To make pixels of same cluster closely distributed and pixels of different clusters dispersedly distributed, there should be high similarity between vi(w,h)v_{i}(w,h) and corresponding center embedding features cmc_{m}, and low similarity between cmc_{m} and cn(mn)c_{n}(m\neq n) as well. Thus, the loss function can be formulated as followed:

lossld=1MNHWm,i,h,wrmi(h,w)t(cm,vi(w,h))+1M(M1)m,nmcmTcnloss_{ld}=-\frac{1}{MNHW}\sum_{m,i,h,w}r_{mi}(h,w)t(c_{m},v_{i}(w,h))+\frac{1}{M(M-1)}\sum_{m,n\neq m}c_{m}^{T}c_{n} (3)

3.1.3 More constraints:

We also add entropy loss and area loss to make high confidence of predictions and avoid blank outputs for some clusters. The losses are as followed:

lossentropy=1MNHWm,i,h,wrmi(h,w)logrmi(h,w)loss_{entropy}=-\frac{1}{MNHW}\sum_{m,i,h,w}r_{mi}(h,w)logr_{mi}(h,w) (4)
areami=h,wrmi(h,w)area_{mi}=\sum_{h,w}r_{mi}(h,w) (5)
lossarea=1NMm,irelu(14MHWareami)loss_{area}=\frac{1}{NM}\sum_{m,i}relu(\frac{1}{4M}HW-area_{mi}) (6)

Where relurelu is rectified linear units [5], lossarealoss_{area} will impose punishment if the area of pseudo segmentation is smaller than 14MHW\frac{1}{4M}HW.

3.2 Prior-guided anatomical structure clustering

Refer to caption
Figure 3: Based on prior knowledge and local discrimination to recognize structures.

Commonly, specialists can easily identify anatomical structures based on corresponding prior knowledge, including relative location, topological structure, and even based on knowledge of similar structures. Therefore, DNN’s ability of recognizing structures based on local discrimination and topological priors is studied in this part. Reference images, which are binary masks of similar structures, real data or simulation and show knowledge of location and topological structure, is introduced to the network to force the clustering branch to obtain corresponding structures as shown in Figure 3.

We denote the distribution of mm-th cluster as PmP_{m} and the distribution of corresponding references as QmQ_{m}. The goal of optimization is to minimize Kullback-Leibler (KL) divergence between them, and it can be formulated as followed:

minfθKL(Pm||Qm)=iPm(rmi)logPm(rmi)Qm(rmi)\mathop{min}_{f_{\theta}}KL(P_{m}||Q_{m})=\sum_{i}P_{m}(r_{mi})log\frac{P_{m}(r_{mi})}{Q_{m}(r_{mi})} (7)

To minimize the KL divergence between PmP_{m} and QmQ_{m}, adversarial learning [6] is utilized to encourage the produced pseudo segmentation to be similar as the reference mask. During training, a discriminator DD is set to discriminate pseudo segmentation rmr_{m} and reference mask sms_{m}, while fθf_{\theta} aims to cheat DD. The loss function for DD and adversarial loss for fθf_{\theta} are defined as followed:

lossD=lossbce(D(sm),1)+lossbce(D(rm),0)loss_{D}=loss_{bce}(D(s_{m}),1)+loss_{bce}(D(r_{m}),0) (8)
lossbce(y^,y)=1Ni(yilogy^i+(1yi)log(1y^i))loss_{bce}(\hat{y},y)=-\frac{1}{N}\sum_{i}(y_{i}log\hat{y}_{i}+(1-y_{i})log(1-\hat{y}_{i})) (9)
lossadv=lossbce(D(rm),1)loss_{adv}=loss_{bce}(D(r_{m}),1) (10)

3.2.1 Reference masks:

(1) From similar structures: Similar structures share similar geometry and topology. Therefore, we can utilize segmentation annotations from similar structures to guide the segmentation of target, e.g., annotations of vessel in OCTA can be utilized for the clustering of retinal vessel in fundus images. (2) From real data: Corresponding annotations of target structure can be directly set as the prior knowledge. (3) Simulation: Based on the comprehension, experts can draw the pseudo masks to show the information of relative location, topology, etc. For example, based on retinal vessel mask, the approximate location of OD and fovea can be identified. Then, ellipses can be placed at these positions to represent OD and fovea based on their geometry priors.

4 Experiments and Discussion

The experiments can be divided into two parts to show the effectiveness of our proposed unsupervised local discrimination learning and prove the feasibility of combining local discrimination and topological priors to cluster target structures.

4.1 Network architectures and initialization

The backbone is a U-net consisted with a VGG-liked encoder and a decoder. The encoder is a tiny version of VGG-16 without fully connection layers (FCs), whose channel number is quarter of VGG-16. The decoder is composed with 4 convolution blocks, each of which is made up of two convolution layers. The final features of decoder will be concatenated with the features generated by the first convolution block of encoder for the further processing of the clustering branch and the embedding branch. Embedding branch is formed of 2 convolution layers with 32 channels and a l2l_{2} normalization layer to project each pixel onto a 32-D hypersphere. Clustering branch is consisted with 2 convolution layers with 8 channels followed by a l1l_{1} normalization layer.

To minimize the KL divergence between the pseudo segmentation distribution and the references distribution, a discriminator is created. The discriminator is a simple classifier with 7 convolution layers and 2 FCs. The channel numbers of convolution layers are 16, 32, 32, 32, 32, 64, 64 and the first 5 layers are followed by a max-pooling layer to halve the image size. FCs’ channels are 32 and 1, and the final FC is followed by a Sigmoid layer.

Patch discrimination to initialize the network: It is hard to simultaneously train the clustering branch and the embedding branch from scratch. Thus, we firstly jointly pre-train the backbone and the embedding branch by patch discrimination, which is an improvement of instance discrimination [18]. The main idea is that the embedding branch should project similar patches (patches under various augmentations) onto close positions on the hypersphere. The embedding features viv_{i} will be firstly processed by an adaptive average pooling layer (APP) to generate spatial features, each of which represents feature of corresponding patches of image xix_{i}. We denote si(j)s_{i}(j) as the embedding vector for xi(j)x_{i}(j) (jj-th patch of xix_{i}), where si(j)2=1||s_{i}(j)||_{2}=1 by applying a l2l_{2} normalization. s^i(j)\hat{s}_{i}(j) denotes the embedding vector of corresponding augmentation patch x^i(j)\hat{x}_{i}(j). The probability of region x^i(j)\hat{x}_{i}(j) being recognized as region xi(j){x}_{i}(j) can be defined as followed:

P(ij|x^i(j))\displaystyle P(ij|\hat{x}_{i}(j)) =exp(siT(j)s^i(j)/τ)k,lexp(skT(l)s^i(j)/τ),\displaystyle=\frac{\exp(s_{i}^{T}(j)\hat{s}_{i}(j)/\tau)}{\sum_{k,l}{\exp(s_{k}^{T}(l)\hat{s}_{i}(j)/\tau)}}, (11)

Assuming all patches being recognized as xi(j)x_{i}(j) is independent, then the joint probability of x^i(j)\hat{x}_{i}(j) being recognized as xi(j)x_{i}(j) and xk(l)(kiorlj)x_{k}(l)(k\neq{i}\ or\ l\neq{j}) not being recognized as xi(j)x_{i}(j) is as followed:

Pij=P(ij|x^i(j))kiorlj(1P(ij|x^k(l))).P_{ij}=P(ij|\hat{x}_{i}(j))\prod_{k\neq i\ or\ l\neq j}(1-P(ij|\hat{x}_{k}(l))). (12)

The negative log likelihood and loss function are formulated as followed:

Jij=logP(ij|x^i(j))kiorljlog(1P(ij|x^k(l))).J_{ij}=-\log P(ij|\hat{x}_{i}(j))-\sum_{k\neq i\ or\ l\neq j}\log(1-P(ij|\hat{x}_{k}(l))). (13)
losspd=i,jJijloss_{pd}=\sum_{i,j}J_{ij} (14)

We also introduce mixup [19] to make the representations more robust. Based on mixup, virtual sample x~i=λxa+(1λ)xb\tilde{x}_{i}=\lambda x_{a}+(1-\lambda)x_{b} is firstly generated by linear interpolation of xax_{a} and xbx_{b}, where λ(0,1)\lambda\in(0,1). The embedded representation for patch x~i(j)\tilde{x}_{i}(j) is s~i(j)\tilde{s}_{i}(j), and we expect it is similar to the mixup feature zi(j)z_{i}(j). The loss is defined as followed:

zi(j)=λsa(j)+(1λ)sb(j)λsa(j)+(1λ)sb(j)2.\displaystyle z_{i}(j)=\frac{\lambda s_{a}(j)+(1-\lambda)s_{b}(j)}{||\lambda s_{a}(j)+(1-\lambda)s_{b}(j)||_{2}}. (15)
P~(ij|s~i(j))=exp(ziT(j)s~i(j)/τ)k,lexp(zkT(l)s~i(j)/τ).\tilde{P}(ij|\tilde{s}_{i}(j))=\frac{\exp(z_{i}^{T}(j)\tilde{s}_{i}(j)/\tau)}{\sum_{k,l}{\exp(z_{k}^{T}(l)\tilde{s}_{i}(j)/\tau)}}. (16)
J~ij=logP~(ij|s~i(j))kiorljlog(1P~(ij|s~k(l))),\tilde{J}_{ij}=-\log\tilde{P}(ij|\tilde{s}_{i}(j))-\sum_{k\neq i\ or\ l\neq j}\log(1-\tilde{P}(ij|\tilde{s}_{k}(l))), (17)
lossmixup=i,jJ~ijloss_{mixup}=\sum_{i,j}\tilde{J}_{ij} (18)

When pre-training this model, we set the training loss as losspd+lossmixuploss_{pd}+loss_{mixup}. The output size of APP is set as 4×44\times 4 to split each image into 16 patches. And each batch contains 16 groups of images and 8 corresponding mixup images, and each of group contains 2 augmentations of one image. The augmentation methods contain RandomResizedCropRandomResizedCrop, RandomGrayscaleRandomGrayscale, ColorJitterColorJitter, RandomHorizontalFlipRandomHorizontalFlip, Rotation90Rotation90 in pytorch. The optimizer is Adam with initial learning rate (lrlr) of 0.0010.001, which will be half if the validation loss does not decrease over 3 epochs. The maximum training epoch is 20.

4.2 Experiments for learning local discrimination

4.2.1 Datasets and preprocessing:

Our method is evaluated in two medical scenes. Fundus images: The model will be firstly trained on diabetic retinopathy (DR) detection dataset of kaggle [3]111https://www.kaggle.com/c/diabetic-retinopathy-detection/data (30k30k for training, 5k5k for validation). Then, the pre-trained encoder is transferred to 8 segmentation tasks: (1) Retinal vessel: DRIVE [13] (20 for training, 20 for testing), STARE [8] (10 for training, 10 for testing) and CHASEDB1 [10] (20 training, 8 testing). (2) OD and cup: Drishti-GS [12] (50 for training, 50 for testing). ID (OD) [11] (54 for training, 27 for testing). (3) Lesions: Haemorrhages dataset (Hae) and hard exudates dataset (HE) from IDRID [11]. Chest X-ray: The encoder is pre-trained on ChestX-ray8 [15] (100k100k for training and 12k12k for validation) and transferred to lung segmentation [2] (69 for training, 69 for testing). All images of above datasets are resized to 512×512512\times 512.

4.2.2 Implementation details:

(1) Local discriminative representation learning: The model is firstly initialized by pre-trained model of patch discrimination. Then the training loss is set as losspd+lossmixup+10lossld+lossentropy+5lossarealoss_{pd}+loss_{mixup}+10loss_{ld}+loss_{entropy}+5loss_{area}. Each batch has 6 groups of images, each of which contains 2 augmentations of one image, and 3 mixup images. The maximum training epoch is 80 and the optimizer is Adam with lr=0.001lr=0.001.

(2) Transferring: The encoder of downstream tasks is initialized by the learnt feature extractor of local discrimination. The decoder is composed with 5 convolution blocks, each of which contains 2 convolution layers and is followed by a up-pooling layer. The loss is set as lossdsc=2|p×g||g|+|p|loss_{dsc}=\frac{2|p\times g|}{|g|+|p|}. This model will be firstly trained in 100 epochs in frozen pattern with Adam with lr=0.001lr=0.001, and then be trained in fine-tune pattern with lr=0.0001lr=0.0001 in the following 100 epochs.

(3) Comparative methods: Random: The network is trained from scratch. Supervised: Supervised by the manual score of DR, the encoder will be firstly trained by making classification. Wu et al. [16] and Ye et al. [18]: Instance discrimination methods proposed in [16] and [18]. LD: The proposed method.

Table 1: Comparison of results of downstream segmentation tasks.
Retinal vessel Optic disc and cup Lesions X-ray
encoder DRIVE STARE CHASE GS(cup) GS(OD) ID(OD) Hae HE Lung
Random 80.7680.76 76.2676.26 78.3078.30 77.7677.76 95.4195.41 89.1189.11 37.7637.76 57.4457.44 96.3496.34
Supervised 81.0681.06 80.5980.59 78.5678.56 86.9486.94 96.4096.40 93.5693.56 51.11 61.3461.34 -
Wu et al. 74.9874.98 66.1566.15 68.3168.31 84.5984.59 94.5894.58 88.7088.70 26.3426.34 48.6748.67 96.2796.27
Ye et al. 80.8780.87 81.2281.22 79.8579.85 87.3087.30 97.40 94.6894.68 46.7946.79 59.4059.40 96.6396.63
LD 82.15 83.42 80.35 89.30 96.5396.53 95.59 46.7246.72 65.77 97.51

4.2.3 Results:

The evaluation metric is mean Dice-Sørensen coefficient (DSC): DSC=2|P×G||P|+|G|DSC=\frac{2{|P\times{G}|}}{|P|+|G|}, where PP is the binary results of predictions and GG is the ground truth. Quantitative evaluations for downstream tasks are shown in Table 1, and we can have following observations:

1) The generalization ability of the trained local discriminative representation is demonstrated by the leading performance in the 6 fundus tasks and lung segmentation. Compared with models trained from scratch, models initialized by our pre-trained encoder can respectively gain improvements of 1.39%, 7.16%, 2.05%, 11.54%, 1.12%, 6.48%, 8.96%, 8.33% and 1.17% in DSC for all 9 tasks.

2) Compared with instance discrimination methods by Wu et al. [16] and Ye et al. [18], the proposed local discrimination is capable to learn finer features and is more suitable for unsupervised representation learning of medical images.

3) The proposed unsupervised method is free from labeled images and the learnt representation is more generalized, while supervised representation learning relies on expensive manual annotations and learns specialized representations. As shown in Table 1, our method shows better performance than supervised representation learning, whose target is to classification DR, and the only exception is on segmenting haemorrhages which is the key evidence for DR.

Refer to caption
Figure 4: Some examples of reference masks and predicted results: The first row show some reference images and the second row show the predictions.

4.3 Experiments for clustering structures based on prior knowledge

4.3.1 Implementation details:

In this part, we respectively fuse reference images from real data, similar structures and simulations into local discrimination to investigate the ability of clustering anatomical structures. A dataset with 3110 high-quality fundus images from [14] and 1482 frontal X-rays from [15] are utilized as the training data. The reference images can be constructed in 3 ways: (1) From real references: ALL 40 retinal vessel masks of DRIVE are utilized as the references for clustering pixels of vessel. (2) From similar structures: Similar structures share similar priors, thus, 10 OCTA vessel masks are utilized as the references for retinal vessel of fundus. (3) Simulation: We directly draw 20 simulated lung masks to guide lung segmentation. Meanwhile, based on vessel masks of DRIVE, we place ellipses at approximate center location of OD and fovea to generate pseudo masks. Some reference masks are shown in Figure 4.

fθf_{\theta} needs to jointly learn local discrimination and cheat DD, thus, it will be updated by minimizing the following loss:

lossfθ=losspd+lossmixup+10lossld+lossentropy+5lossarea+2lossadvloss_{f_{\theta}}=loss_{pd}+loss_{mixup}+10loss_{ld}+loss_{entropy}+5loss_{area}+2loss_{adv} (19)

The optimizer for fθf_{\theta} is Adam with lr=0.001lr=0.001. The discriminator is optimized by minimizing lossDloss_{D} and the optimizer is Adam with lr=0.0005lr=0.0005. It is worth noting that during the clustering training of OD and fovea, all masks of real vessel, fovea and OD are concatenated and fed into DD to provide enough information and fθf_{\theta} is firstly pre-trained to cluster retinal vessel. The maximum training epoch is 80.

4.3.2 Results:

Visualization examples are shown in Figure 4. Quantitative evaluations are as followed: (1) Retinal vessel segmentation is evaluated in the test data of STARE. And the DSCDSC are respectively 66.25% and 57.35% for models based on real references and based on OCTA annotations. (2) The segmentation of OD is evaluated in the test data of Drishti-GS and gains DSCDSC of 83.60%. (3) The segmentation of fovea is evaluated in the test data of STARE. Because the region of fovea is fuzzy, we measure the mean distance between the real center of fovea and the predicted center. The mean distance is 7.63pixels7.63pixels. (4) The segmentation of lung is evaluated in NLM [2] and the DSC is 81.20%.

Based on above results, we can have following observations:

1) In general, topological priors generated from simulation or similar structures in a different modality is effective to guide the clustering of target regions.

2) However, real masks contain more detailed information and are able to provide more precise guidance. For example, compared with vessel segmentations based on OCTA annotations, which missing the thin blood vessels due to the great thickness of OCTA mask, segmentations based on real masks can recognize thin vessels due to the details provided and the constraint of clustering pixels with similar context.

3) For anatomical structures with fuzzy intensity pattern, such as fovea, combining local similarity and structure priors is able to guide precise recognition.

5 Conclusion

In this paper, we propose an unsupervised framework to learn local discriminative representation for medical images. By transferring the learnt feature extractor, downstream tasks can be improved to decrease the demand for expensive annotations. Furthermore, similar structures can be clustered by fusing prior knowledge into the learning framework. The experimental results show that our methods have best performance on 7 out of 9 tasks in fundus and chest X-ray images, demonstrating the great generalization of the learnt representation. Meanwhile, the feasibility of clustering structures based on prior knowledge and unlabeled images is demonstrated by combining local discrimination and topological priors from real data, similar structures or even simulations to segment anatomical structures including retinal vessel, OD, fovea and lung.

References

  • [1] Bachman, P., Hjelm, R.D., Buchwalter, W.: Learning representations by maximizing mutual information across views. In: Advances in Neural Information Processing Systems. pp. 15535–15545 (2019)
  • [2] Candemir, S., Jaeger, S., Palaniappan, K., Musco, J.P., Singh, R.K., Xue, Z., Karargyris, A., Antani, S., Thoma, G., McDonald, C.J.: Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE transactions on medical imaging 33(2), 577–590 (2013)
  • [3] Cuadros, J., Bresnick, G.: Eyepacs: an adaptable telemedicine system for diabetic retinopathy screening. Journal of diabetes science and technology 3(3), 509–516 (2009)
  • [4] Dosovitskiy, A., Fischer, P., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence 38(9), 1734–1747 (2015)
  • [5] Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. pp. 315–323 (2011)
  • [6] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
  • [7] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9729–9738 (2020)
  • [8] Hoover, A., Kouznetsova, V., Goldbaum, M.: Locating blood vessels in retinal images by piecewise threshold probing of a matched filter response. IEEE Transactions on Medical imaging 19(3), 203–210 (2000)
  • [9] Mahmood, U., Rahman, M.M., Fedorov, A., Lewis, N., Fu, Z., Calhoun, V.D., Plis, S.M.: Whole milc: generalizing learned dynamics across tasks, datasets, and populations. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 407–417. Springer (2020)
  • [10] Owen, C.G., Rudnicka, A.R., Mullen, R., Barman, S.A., Monekosso, D., Whincup, P.H., Ng, J., Paterson, C.: Measuring retinal vessel tortuosity in 10-year-old children: validation of the computer-assisted image analysis of the retina (caiar) program. Investigative ophthalmology & visual science 50(5), 2004–2010 (2009)
  • [11] Porwal, P., Pachade, S., Kamble, R., Kokare, M., Deshmukh, G., Sahasrabuddhe, V., Meriaudeau, F.: Indian diabetic retinopathy image dataset (idrid): a database for diabetic retinopathy screening research. Data 3(3),  25 (2018)
  • [12] Sivaswamy, J., Krishnadas, S., Joshi, G.D., Jain, M., Tabish, A.U.S.: Drishti-gs: Retinal image dataset for optic nerve head (onh) segmentation. In: 2014 IEEE 11th international symposium on biomedical imaging (ISBI). pp. 53–56. IEEE (2014)
  • [13] Staal, J., Abràmoff, M.D., Niemeijer, M., Viergever, M.A., Van Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. IEEE transactions on medical imaging 23(4), 501–509 (2004)
  • [14] Wang, R., Chen, B., Meng, D., Wang, L.: Weakly-supervised lesion detection from fundus images. IEEE transactions on medical imaging pp. 1501–1512 (2018)
  • [15] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2097–2106 (2017)
  • [16] Wu, Z., Xiong, Y., Yu, X.S., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition pp. 3733–3742 (2018)
  • [17] Xie, X., Chen, J., Li, Y., Shen, L., Ma, K., Zheng, Y.: Instance-aware self-supervised learning for nuclei segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 341–350. Springer (2020)
  • [18] Ye, M., Zhang, X., Yuen, P.C., Chang, S.F.: Unsupervised embedding learning via invariant and spreading instance feature. In: Proceedings of the IEEE Conference on computer vision and pattern recognition. pp. 6210–6219 (2019)
  • [19] Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: International Conference on Learning Representations (2018)