Artifact-Tolerant Clustering-Guided Contrastive Embedding Learning for Ophthalmic Images
Abstract
Ophthalmic images and derivatives such as the retinal nerve fiber layer (RNFL) thickness map are crucial for detecting and monitoring ophthalmic diseases (e.g., glaucoma). For computer-aided diagnosis of eye diseases, the key technique is to automatically extract meaningful features from ophthalmic images that can reveal the biomarkers (e.g., RNFL thinning patterns) linked to functional vision loss. However, representation learning from ophthalmic images that links structural retinal damage with human vision loss is non-trivial mostly due to large anatomical variations between patients. The task becomes even more challenging in the presence of image artifacts, which are common due to issues with image acquisition and automated segmentation. In this paper, we propose an artifact-tolerant unsupervised learning framework termed EyeLearn for learning representations of ophthalmic images. EyeLearn has an artifact correction module to learn representations that can best predict artifact-free ophthalmic images. In addition, EyeLearn adopts a clustering-guided contrastive learning strategy to explicitly capture the intra- and inter-image affinities. During training, images are dynamically organized in clusters to form contrastive samples in which images in the same or different clusters are encouraged to learn similar or dissimilar representations, respectively. To evaluate EyeLearn, we use the learned representations for visual field prediction and glaucoma detection using a real-world ophthalmic image dataset of glaucoma patients. Extensive experiments and comparisons with state-of-the-art methods verified the effectiveness of EyeLearn for learning optimal feature representations from ophthalmic images.
Ophthalmic image, RNFLT map, artifact correction, representation learning, clustering-guided contrastive learning
1 Introduction
Eye diseases present great challenges and serious threats to human health and quality of life. Vision impairments caused by many eye-related diseases such as glaucoma are irreversible and may result in complete vision loss if left untreated [1]. Detecting potential visual disorders at early stages is critical to reducing vision loss. In clinical practice, diagnosis of eye diseases largely relies on clinicians’ assessments of ophthalmic images [2], which have benefited from the development of various noninvasive medical imaging modalities such as the spectral domain optical coherence tomography (OCT). For example, the retinal nerve fiber layer (RNFL) thickness (RNFLT) map derived from OCT images is commonly used by clinicians to diagnose glaucoma and monitor disease progression [3], based on the clinical knowledge that RNFL damage is a hallmark of glaucoma and therefore predictive of accompanying vision loss [4].
Computer-aided diagnosis has been widely applied to ophthalmic images [5], where the common paradigm is to develop effective feature learning algorithms that can automatically extract relevant biomarkers (e.g., optic nerve thinning and atrophy) to support the detection of eye disease and its progression. Recent works include 1) Using RNFLT maps to predict visual fields [6, 7]; 2) Using fundus images to detect eye diseases [8, 9]; and 3) Using multiple image modalities such as fundus images and visual fields to collectively detect eye diseases and the progression [2, 10]. These studies demonstrate the promise of artificial intelligence (AI) algorithms in learning clinically meaningful features. Nevertheless, existing methods tend to learn sub-optimal features from ophthalmic images because of three limitations. First, current methods ignore the fact that ophthalmic images are commonly distorted by artifact due to segmentation failures in the context of degraded imaging quality and differences in anatomy. For example, it has been reported that 35% to 56% of OCT images of the optic nerve has at least one artifact [11]. Artifact may inhibit conventional AI methods from learning accurate features in ophthalmic images. Second, most current methods do not account for the fact that ophthalmic images are subject to variations in the testing procedure such as patient head rotation and imperfect fixation [5, 12], which means these methods normally lack the mechanism to explicitly capture the intra- (i.e., feature invariance in the same image) and inter- (i.e., feature invariance between different images) image affinities. Third, current methods typically depend on massive quantities of image annotations to guide the model to learn features related to the target tasks, making these methods less generalizable across different tasks especially when the label information is not available.

Compared to prior work, we study a more generic question, namely feature representation or embedding learning of ophthalmic images. In this paper, representation learning and embedding refer to the same concept: learning a low-dimensional feature vector to represent each ophthalmic image with as much information preserved as possible, and also to explicitly embed the inter- and intra-image affinities. Embedding learning is a fundamental problem and can be relevant in multiple aspects. Intuitively, the well-learned representations of images are in a latent space that is machine-understandable and can be readily used as inputs to many lightweight machine learning algorithms to enhance their performance [13]. In addition, the pretrained embedding learning model developed on one image type can be used as a starting point in developing other deep learning models set to be trained on other image types (i.e., transfer learning) [14]. Other recognized benefits of embedding learning include facilitating the fusion of multi-modal data [15], data interpretation and biomarker discovery [16], and data visualization [17]. The key to embedding learning lies in extracting features from data such that both intra- and inter-data relationships can be accurately captured in a latent space. There are two challenges when applying embedding learning to ophthalmic images:
-
•
Prevalent Image Artifacts: Ophthalmic images are easily affected by artifacts, e.g., the RNFLT artifacts can result from the layer segmentation failure of OCT software. The image artifacts distort the derived features, resulting in the learned representations inaccurately representing the true image semantics.
-
•
Obscure Image Affinities: Clinical interpretations of ophthalmic images are complicated primarily because of the subtle retinal anatomical variations between patients which can be difficult for clinicians to detect. The obscure inter-image affinities combined with unpredictable image artifacts make it challenging to learn distinguishing image representations.
We propose a general framework called EyeLearn for learning representations of ophthalmic images by taking the above two obstacles into account. As shown in Fig. 1, our strategy is to adopt an artifact-corrected embedding learning with contrastive embedding regularization to encourage distinguishing representations of ophthalmic images. Specifically, to address the first obstacle, we adopt an artifact correction-based embedding learning of the ophthalmic image, which forces the learned representation to reconstruct the complete image without artifact. To mitigate the second obstacle, we introduce a contrastive learning-based regularization, which encourages similar (i.e., within clusters) or dissimilar (i.e., between clusters) images to be represented with similar or dissimilar vectors. With this learning strategy of integrating artifact correction and contrastive regularization, EyeLearn is expected to output effective and discriminative representations of ophthalmic images that can be useful in relevant ophthalmic analytic tasks. We expect this to represent an improvement over the state-of-the-art methods which fall into either supervised or unsupervised learning.
In summary, the major contributions of this work are:
-
1)
We proposed artifact-tolerant embedding learning for ophthalmic images. It is fundamental and suitable for understanding ophthalmic images which are often distorted by artifacts.
-
2)
We proposed EyeLearn which is a general framework for representation learning of ophthalmic images. It is tolerant to image artifact with an artifact correction module and it pursues discriminative representations of images with contrastive learning-based regularization.
-
3)
We evaluated EyeLearn with extensive experiments using a large OCT dataset of glaucoma patients through visual field prediction and glaucoma detection. EyeLearn is superior to state-of-the-art methods which adopt either supervised and unsupervised representation learning.
Section II surveys the related work. Section III presents preliminaries, including definition of the embedding learning problem and the building components used in our approach. The proposed model for embedding learning of ophthalmic images is introduced in Section IV. Section V reports on the results of experiments to validate the proposed approach. Finally, Section VI summarizes the conclusions we draw from this work.
2 Related Work
We first review feature representation learning for medical images. Then, we briefly present recent work on contrastive representation learning. Finally, we summarize related works on representation learning for ophthalmic images.
2.1 Medical Image Representation Learning
Representation learning in the medical domain is concerned with extracting useful information which help to guide diagnostic decision-making. Common medical image modalities include X-ray, magnetic resonance imaging (MRI), functional MRI, computed tomography, ultrasound scanning, optical coherence tomography, etc. [18]. There are two broad paths of research on this topic: self-supervised or unsupervised visual representation learning and supervised visual representation learning [19]. For self-supervised methods, pretext tasks such as image semantic reconstruction and pseudo label prediction are often used in order to achieve meaningful representations [20]. For example, Bai et al. [21] proposed anatomical position prediction as a pretext task for cardiac MRI segmentation. Prakash et al. [22] adopted image denoising as a pretext task for nuclei images’ segmentation. Li et al. [23] combined two colorization based pretext tasks into a single multi-tasking framework called ColorMe to learn useful representations from scopy images. These pretext tasks can be roughly categorized into predictive, generative, contrastive and multi-tasking according to their working mechanisms [20]. For example, generative methods normally learn to generate medical images using a generative adversarial network [19], while multi-tasking methods seek to integrate various pretext tasks to co-optimize the representation learning [23]. For supervised methods, representations are trained to explain the respective labels such as the binary class of benign and malignant tumors using the end-to-end task training. Popular tasks [24] include detection and severity classification of tumors, skin lesions, colon cancer, blood cancer, etc., achieved by various convolutional neural networks (CNN) architectures such as VGGNet [25] and U-Net [26].
Because the acquisition of quality annotations of medical images is expensive and depends heavily on clinicians’ experience and knowledge, unsupervised representation learning has gained growing attention, especially in the current wave of research on generative and contrastive representation learning [27]. Therefore, we adopt an unsupervised representation learning method in this work.
2.2 Contrastive Representation Learning
A recent breakthrough in contrastive learning has shed light on the potential of discriminative model for representation learning [28]. Contrastive learning aims at learning to compare diverse augmented images by optimizing the Noise-Contrastive Estimation loss function, where the positive pair is usually formed with two augmented views of the same image, while negative ones are formed with different images [20]. SimCLR [28] learns representations by maximizing agreement between differently augmented views of the same data example at the instance level. Contrastive clustering [29] extends SimCLR to perform both instance and clustering level contrastive learning, where soft labels of instances are utilized to regularize affinities between samples. To address the reliance of most contrastive learning methods on a large number of explicit pairwise comparisons, SwAV [30] simultaneously clusters the data while enforcing consistency between cluster assignments produced for different augmentations. In a different way, InterCLR [31] uses a memory bank which stores running average features of all samples in the dataset computed in previous steps to reduce the limitation of small batch size. InterCLR additionally captures the inter-image invariance using k-means clustering labels updated in mini-batch, where a positive sample for an input image is selected within the same cluster, while the negative samples are obtained from other clusters.
We adopt a similar learning paradigm as InterCLR to capture both the inter- and intra-invariance of ophthalmic images. However, we update clusters of images after every epoch which are thereafter used for guiding the contrastive learning in the following epoch. Since we use an artifact correction module as the backbone of EyeLearn to constrain the representation learning to be gradually optimized over the epochs, the clustering will become stable with the training to provide reliable guidance for contrastive learning.
2.3 Ophthalmic Image Representation Learning
Eye diseases such as glaucoma and diabetic retinopathy are chronic disorders that may result in irreversible blindness [1]. The past decade has seen the rise of AI methods to support diagnosis of eye diseases [12, 5]. For example, Wang al et. [3] used RNFLT maps (221 samples) to predict glaucoma with a CNN model. Nayak et al. [32] propose an evolutionary convolutional network model to predict glaucoma in fundus images (1,426 samples). An et al. [2] used both the RNFLT maps and fundus images (357 samples) to predict glaucoma with a transfer learning-based CNN model. Since some ocular diseases are associated with functional vision loss, some work has attempted to predict the respective visual field damage using the ophthalmic image as input. For example, Lazaridis et al. [7] used RNFLT maps (954 samples) to predict the visual fields with a multi-input CNN and a multi-channel variational autoencoder, respectively. Christopher et al. [6] used both RNFLT maps and optic nerve head en face images (1,909 samples) to predict visual field damage with a transfer learning-based ResNet model. However, these methods use relatively small sample sizes for model training and evaluation, which tend to generate over-fitting models and biased evaluation outcomes. In addition, they assume that the ophthalmic images are noise-free or simply exclude the noisy samples before training the models, which therefore cannot handle images containing artifacts.
The essence of AI-based diagnosis is to learn good feature representation revealing potential disease biomarkers in the ophthalmic image [14, 15]. However, large image variations due to varying image artifact and patient-specific retinal anatomy make many standard deep learning methods suboptimal at learning quality representations. To mitigate this deficiency, we propose to correct the inclusive image artifact while learning its representation. We also introduce memory bank-supported contrastive learning to encourage distinguishable representations between images. Unlike many supervised methods which require a massive quantity of image annotations, our method is fully self-supervised which is more generalizable for extracting features for various types of ophthalmic images.

3 Problem Definition and Preliminaries
First, we formally define the embedding learning problem studied in this paper. Then, we briefly introduce the principle of partial convolution used as a building block in EyeLearn as well as the memory bank introduced to facilitate the contrastive learning.
3.1 Problem Formulation
Without loss of generality, we consider that each ophthalmic image is represented as a two-dimensional (2D) image, e.g., as a RNFLT map. Let be a collection of 2D ophthalmic images, where and are the width and height of the image, respectively. The embedding learning aims to project each image in a -dimensional latent space , i.e., , where is the projection function defined as the embedding learning model EyeLearn (refer to Fig. 2 for illustration) and is the vector representation of with preserved clinical information conveyed by the image. In addition, we seek to improve embedding learning performance by preserving the relative affinities between ophthalmic images through the contrastive learning-based regularization in EyeLearn.
3.2 Partial Convolution
Partial convolution [33] is proposed to handle images which include missing pixels (e.g., holes). Let be the pixels within a convolution window at the location and be the binary mask with valid pixels being all ones and invalid pixels being all zeros. The partial convolution at location is defined as:
(1) |
with
(2) |
where is the L1 norm, is the element-wise multiplication, is the weight matrix and is the bias. is a scaling factor to adjust for the varying number of valid input pixels within the convolution window, where is a one-like matrix with the same shape as . We can observe from Eq. 1 that the output at every location depends only on the valid input pixels within the respective convolution window. This is achieved by maintaining a binary mask for the feature map at each convolution layer. After a partial convolution layer , at location () for next layer () is updated as:
(3) |
meaning that if there exists at least one valid input pixel within the convolution window that contributes to the output , the location () is a valid pixel for the next convolution layer; otherwise, it is an invalid pixel.
3.3 Memory Bank
During the contrastive learning process, the negative samples are normally constructed from samples within a mini-batch. However, contrastive learning prefers a larger number of negative samples to learn quality representations [31, 28]. One approach would be to set a very large batch size, but this would require impractically large computational resources. As an alternative method, we use a memory bank of size to store the representations and the clustering labels of most recent running samples. Formally, a memory bank (i.e., a size-invariant queue) can be represented as , where is the up-to-date representations of ophthalmic images and is the clustering labels of images. In this paper, the image features are updated at the end of each training batch, while the clustering labels are updated at the end of each training epoch.
4 Methodology
The proposed framework for learning ophthalmic image representations, EyeLearn, is shown in Fig. 2. EyeLearn consists of an artifact correction-based representation learning module and a contrastive learning-based regularization module, where the latter encapsulates both an intra-contrastive learning and an inter-contrastive learning. Taken together, EyeLearn trains to optimize the following three parts:
-
•
Artifact Correction-based Embedding Learning: the backbone to reconstruct artifact pixels in masked areas of the input image and output its vector representation. This part trains to optimize an artifact reconstruction loss.
-
•
Intra-Contrastive Learning-based Regularization: applies image augmentations and encourages augmented images from the same image to learn similar representations while others to learn dissimilar representations. This part trains to optimize an intra-contrastive loss.
-
•
Inter-Contrastive Learning-based Regularization: encourages images in the same clusters to learn similar representations, while images in different clusters to learn dissimilar representations. This part trains to optimize an inter-contrastive loss.
4.1 Artifact Correction-based Embedding Learning
Ophthalmic images are often affected by artifact or noise which prevents accurate interpretation and diagnosis of eye diseases. We propose to correct artifacts while learning the feature representation of images to address this problem. As shown in Fig. 2, the artifact correlation is implemented by an U-Net like structure [26] comprised of an encoder to learn the representation and a decoder to reconstruct the image from the learned representation. The encoder-decoder learning scheme can provide EyeLearn the potential to learn representation reflecting the corrected version of image without artifacts.
Specifically, each encoder layer is a stack of three successive operation layers, including a partial convolution layer (refer to Section III-B for detail) to extract features from the masked image, a normalization layer to normalize the convolution feature map, and a non-linear activation layer (e.g., ReLU) to transform the feature map. In comparison, each decoder layer additionally includes an up-sampling layer and a concatenation layer before the partial convolution, normalization and non-linear activation layers. The concatenation operation concatenates two feature maps from the respective encoder and decoder layers at the same level. Notably, the last decoder layer’s input contains the concatenation of the original masked image so the model can reuse valid pixels for image reconstruction. Formally, for each image , the model takes the ground truth image , masked image , and binary mask (invalid pixels are zeros) as inputs. It then trains to minimize the differences with respect to both valid and invalid locations between the growth truth image and the decoder-corrected image as:
(4) |
(5) |
where denotes the number of pixel locations in . In addition, several other losses have been proven useful in characterizing the similarity between the ground truth and predicted images [33] in terms of different aspects of high-level features, which include the perceptual loss , the style loss , the style loss and the total variation loss . These losses are calculated from the outputs of an ImageNet-pretrained VGG-16 [25] which takes and as inputs. Formal definitions of these losses have been established in the literature [33]. Taken together, the autoencoder training aims to optimize a collective reconstruction loss as:
(6) | ||||
Five weight parameters (, , , and ) are used to balance the importance of different parts. Finally, the encoder output through an average pooling is taken as vector representation for the input image , where is the representation dimension.
4.2 Intra-Contrastive Learning-based Regularization
Intra-contrastive learning seeks to capture the intra-image invariance through data augmentations. Each image is augmented to generate two versions of variant images and which are then enforced to learn similar representations compared to other augmented images, i.e., they form a positive pair. In this paper, we adopt five different augmentation methods to randomly transform the image, including center crop, rotation, zooming, width shift and height shift. We design these augmentation types to account for the fact that ophthalmic image variances may be generated from varying conditions (e.g., distance of the eye to the lens and rotation of eyes) of patients when the images were taken.
In addition, for each augmented image (e.g., or ), images are chosen to form negative pairs. Instead of constructing negative samples within each mini-batch as many previous works do [28, 29], we manage a memory bank (Fig. 2) implemented as a queue which stores the most recent representations of () images computed from the latest mini-batches. Then, we randomly sample images from the memory bank to form negative pairs for each augmented image. Following previous work [28], we compute the normalized temperature-scaled cross entropy (NT-Xent) as the contrastive loss for each pair of positive examples () after a projection head:
(7) |
where is a temperature parameter. is the projection head implemented as a three-layer multilayer perceptron (MLP).
4.3 Inter-Contrastive Learning-based Regularization
While intra-contrastive learning can capture the feature invariance between multiple augmented versions of an image, inter-contrastive learning seeks to explicitly model the relative affinities between different images. To this end, we adopt a clustering-guided image sampling strategy for constructing positive and negative samples. As shown in Fig. 2, we organize samples in clusters by performing the clustering (i.e., k-means) after every epoch. Therefore, images will be assigned with respective clustering labels dynamically. Then, the newly updated labels after each epoch will be immediately reflected in the memory bank for the next epoch. For image in cluster , we randomly sample an image with label from the memory bank to form a positive pair with . Then, we randomly sample images from other clusters to form negative pairs. Therefore, inter-contrastive learning tries to minimize the difference between images within the same clusters while encouraging different representation learning for images in different clusters. Finally, the inter-contrastive loss based on NT-Xent for each pair of positive () samples and pairs of negative samples is calculated as:
(8) |
Since we use the artifact correction module in EyeLearn as the backbone for learning representations of images, the clustering based on the learned representations will become stable with training, thus providing reliable guidance for contrastive sampling purpose.
4.4 Optimization of EyeLearn
In EyeLearn, the artifact correction-based embedding learning and the contrastive learning-based regularization share the same learning core (i.e., the encoder), which collectively optimize the model to learn optimal representations of ophthalmic images under the combined objective as:
(9) |
where denotes the number of training samples, and are weight parameters used to balance the contributions of intra-contrastive and inter-contrastive learning, respectively. The entire training and optimizing procedure of EyeLearn is summarized in Algorithm 1.
Training Ratios 10% 30% 50% 70% 90% Methods MAE R2 MAE R2 MAE R2 MAE R2 MAE R2 RPCA 4.373 0.111 3.669 0.341 3.648 0.382 3.580 0.403 3.344 0.456 DAE 3.426 0.439 3.293 0.476 3.199 0.501 3.161 0.505 3.119 0.523 SimCLR 3.494 0.392 3.399 0.427 3.350 0.437 3.344 0.443 3.337 0.448 EyeLearnrecon 3.262 0.460 3.142 0.495 3.100 0.506 3.078 0.513 3.065 0.517 EyeLearnintra 3.452 0.424 3.329 0.451 3.277 0.461 3.254 0.466 3.235 0.470 EyeLearninter 3.201 0.478 3.079 0.508 3.038 0.518 3.019 0.522 3.002 0.526 EyeLearn 3.079 0.494 3.002 0.519 2.963 0.529 2.948 0.535 2.933 0.540
Ratios 10% 30% 50% 70% 90% Methods Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 Accuracy F1 RPCA 68.05 68.94 77.24 78.51 77.03 78.12 78.88 79.71 79.67 80.92 DAE 76.70 77.63 77.63 79.37 79.34 80.16 80.26 80.72 80.50 81.36 SimCLR 75.57 76.83 77.27 78.92 77.71 79.38 77.88 79.39 77.98 79.84 EyeLearnrecon 77.56 78.96 79.16 80.70 79.65 81.25 79.99 81.55 80.20 81.76 EyeLearnintra 78.04 80.72 79.53 81.69 80.08 82.12 80.36 82.30 80.53 82.44 EyeLearninter 79.18 80.67 80.50 82.03 81.10 82.61 81.34 82.80 81.50 82.95 EyeLearn 79.91 81.57 80.82 82.37 81.19 82.73 81.47 82.97 81.57 83.09
5 EVALUATION METHODS AND RESULTS
We evaluated EyeLearn using a real-world ophthalmic image dataset, i.e., RNFLT maps of glaucoma patients. We compare EyeLearn with prior work for glaucoma detection and visual field prediction. Based on the learned representations, visual field prediction aims to predict mean defect level of vision loss, while glaucoma detection aims to predict the binary glaucomatous or non-glaucomatous status of patients.
5.1 Datasets
The dataset includes 30,953 RNFLT maps (each containing pixels) from 11,284 unique patients who received a clinical diagnosis between 2010 and 2021. There are numbers of 14,730 left-eye RNFLT maps and 16,223 right-eye RNFLT maps. All left-eye RNFLT maps were horizontally flipped to the right-eye format. Pretraining of the EyeLearn model was done using 22,953 RNFLT maps; another 2,000 RNFLT maps were used for parameter selection. The remaining 6,000 RNFLT maps were used for evaluating the glaucoma detection and visual field mean deviation (MD) prediction.
5.2 Comparative Methods
Glaucoma detection and visual field prediction with deep learning methods have received increasing attention [34, 12], but few of existing methods focus on the unsupervised feature representation learning. We choose to compare the following high-performing methods that are relevant to our method for unsupervised representation learning of RNFLT maps:
-
•
RPCA [35]. Robust principal component analysis (PCA) is a modified version of the widely used PCA method which works well with respect to noisy or corrupted data.
-
•
DAE [36]. Denoising autoencoder (AE) extends from the AE by changing the reconstruction criterion. It is effective at learning representations of noisy or corrupted data.
-
•
SimCLR [28]. This is a contrastive learning method which performs data augmentations and forces similar or dissimilar images in order to learn similar or dissimilar representations.
-
•
EyeLearn. This is our proposed artifact-corrected representation learning combined with intra-contrastive and inter-contrastive learning-based regularization.
We perform glaucoma detection and visual field detection based on the learned representations by the above methods. We further compare EyeLearn with several state-of-the-art visual representation learning methods in two categories which are designed for end-to-end supervised classification training, including the vision transformer which has been used for relevant glaucoma detection tasks [37]:
- •
-
•
Vision Graph Neural Networks (GNN). This is an recent vision feature learning model which models an image as a graph of small patches and uses GNN to capture the semantic relevance between neighborhood batches [41].
Because vision transformer and vision GNN are used for supervised image classification, we compare their performance through the binary glaucoma detection task in the experiment.
5.3 Experimental Parameters
For glaucoma detection, we train a linear support vector classifier (SVC) based on learned representations and adopt the Accuracy (Acc) and F1 score as evaluation metrics. For visual field (i.e., MD) prediction, we train a ridge regression model and adopt the mean absolute error (MAE) and R-squared score (R2) as evaluation metrics. We repeat the evaluations 40 times with random partitions of various ratios of training data. The mean performances and standard deviations are reported.
We train the EyeLearn model 80 epochs with a batch size of 4 and adopt the following parameter settings. The learning rate for the Adam optimizer is 0.0002. In Eq. 6, , , , and are set as 6, 0.05, 1, 1 and 0.1, respectively. In Eq. 9, and are set as 0.002 and 0.001, respectively. The representation dimension is set as 512. The number of image clusters is set as 7. The bank size and number of negative samples are set as 800 and 16, respectively. Detailed parameter settings are referred to our implementation of the EyeLearn model111https://github.com/codesharea/EyeLearn..
5.4 Results
In this section, we first report the embedding performance of various baseline methods by the downstream visual field prediction and glaucoma detection tasks. Then, we examine the contributions of various learning components in EyeLearn through the ablation study. Finally, we compare the performance at various proportions of artifacts in the RNFLT maps.
5.4.1 Visual Field Prediction
The visual field prediction results are shown in Table 1. We can summarize three major observations: a) Deep learning models such as DAE, SimCLR and EyeLearn perform better than the conventional model RPCA to learn representations from RNFLT maps. This is attributed to the efficient feature learning and optimization strategies adopted in deep learning models; b) The autoencoder-based methods, i.e., EyeLearn and DAE, generally outpace the purely contrastive learning model SimCLR for learning representations of RNFLT maps. This is possibly because RNFLT maps share many similar features, making contrastive learning alone ineffective to discriminate between images. In contrast, the autoencoder trains to reconstruct the input features from the learned representation which is robust to achieve meaningful representations; c) Our method has better generalizability (i.e., stable performance) when the training ratios changes from 10% to 90%. For example, EyeLearn can achieve better prediction performance with fewer data (i.e., 10%) compared with other methods trained on more data (i.e., 30%). Our method integrating both the autoencoder and contrastive learning performs the best. The average MAE and R2 of EyeLearn respectively improved by 0.255 and 3.4% over DAE, and 0.400 and 9.4% over SimCLR, which verifies the superiority of EyeLearn for the embedding learning of ophthalmic images.

5.4.2 Glaucoma Detection
The glaucoma detection results for various unsupervised methods are shown in Table 2. We can observe that DAE and EyeLearn outperform other baseline methods, which again verifies the advantage of autoencoder-based embedding learning of ophthalmic images. For glaucoma detection, it is interesting to observe that SimCLR is not better than RPCA when the training ratio is 90% as in the visual field prediction, meaning the learned representations by SimCLR do not generalize well in different downstream tasks. Among the algorithms, EyeLearn has achieved the best prediction performance, which demonstrates its superiority and the learned representations are well-generalized. In addition, EyeLearn has achieved better detection performance using less training data (i.e., 10%) compared to RPCA, DAE and SimCLR which require more data (i.e., 90%) to achieve comparative results. This means EyeLearn has better robustness and generalizability than other methods.
We further compare with state-of-the-art supervised embedding learning methods and the results are shown in Table 3. We train a SVC classifier and a 2-layer MLP classifier for glaucoma detection based on the previously learned representations by EyeLearn in an unsupervised manner. We can observe that our methods are superior to the vision transformers and can achieve comparable performance (Accuracy and F1 score) to the vision GNN. Floating point operations per second (FLOPs) is a vital measure of the model computational efficiency. We can observe from Table 3 that EyeLearn requires fewer FLOPs compared to ViG and CCT which are the top-two supervised embedding learning methods.



5.4.3 Ablation Study
We develop three variants of EyeLearn to examine the benefits of its different learning parts. EyeLearnrecon is the version created by removing the contrastive learning-based regularization. EyeLearnintra is the version created by removing the inter-contrastive learning-based regularization. EyeLearninter is the version created by removing the intra-contrastive learning-based regularization. The performance of these variant methods with respect to visual field prediction and glaucoma detection is shown in Table 1 and Table 2. We can observe that EyeLearnintra performs better than EyeLearnrecon in glaucoma detection but not in visual field prediction. This is possibly because intra-contrastive learning aims to capture the intra-feature invariance in an image, ignoring the inter-image feature invariance which is important for learning representations used in the visual field prediction task. In contrast, EyeLearninter outperforms EyeLearnrecon which demonstrates that capturing inter-image invariance is useful for learned improved embeddings. However, simultaneously capturing the intra-feature invariance and inter-image invariance delivers the best embedding learning model (EyeLearn) as observed in the comparisons in Tables 1 and 2.
5.4.4 Comparison at Different Proportions of Artifact
We generate different proportions of artifact noise into the RNFLT maps before inferring their representations with the pretrained models. We can observe from the comparative results in Fig. 3 that EyeLearn consistently performs better than other methods and demonstrates relatively stable performance for different percentages of artifacts. The reason is that EyeLearn is designed to handle image artifacts so that the learned representation is likely to reflect the image with artifacts corrected.
5.5 Case Study
We examine the impact of correcting image artifact with the example images in three categories , and (Fig. 4). We can observe that EyeLearn can impute visually meaningful pixels at the artifact regions. For example, the lower thickness bundles in A and B are missing, which are roughly predicted by our model. We further obtain the pairwise Pearson correlations between images in the second row (i.e., calculated based on the raw features in the image) and third rows (i.e., calculated based on the learned representations by EyeLearn). The results are shown in Fig. 5. We can observe that images with artifacts fall in two categories and , while EyeLearn tends to recover the three true categories with the artifacts corrected. In addition, after correcting the image artifacts, the pairwise correlations of , and become closer, as can been seen by comparing the color maps between Fig. 5 (a) and (b). This means that EyeLearn is able to learn closed representations for similar images and recover their true relationships that have been distorted by artifacts.
5.6 Parameter Sensitivity
We study the impacts of important hyperparameters including the weight parameters and used in Eq. 9, the number of clusters in the clustering-guided contrastive learning, and the size of the memory bank . We can observe from Fig. 6 that the variances of these parameters have significant impacts to the model performance. For example, EyeLearn tends to achieve improved embedding performance with a larger memory bank size observed from the trend in Fig. 6 (d).
6 Conclusion
Learning the feature representations of ophthalmic images is fundamental to computer-aided eye disease diagnosis. Recent work has explored the potential to extract biomarker features from ophthalmic images using AI techniques for detecting and monitoring the progression of glaucoma. The challenges for embedding ophthalmic images include prevalent image artifacts and obscure image affinities. To address these problems, we propose EyeLearn, which is a self-supervised framework for representation learning of ophthalmic images. EyeLearn adopts an artifact correction-based representation learning, which enables the learned representation to reflect the corrected image without artifacts. Another feature in EyeLearn is the contrastive learning-based regularization used to encourage improved and discriminative representations between ophthalmic images. We follow a clustering-guided contrastive learning which emphasizes the intra-feature invariance through image augmentation and the inter-image feature invariance by organizing images in clusters (i.e., images in the same clusters learn similar representations).
We designed comprehensive experiments to evaluate the learned feature representations by EyeLearn using a large ophthalmic image dataset of glaucoma patients. Experimental results show that EyeLearn outperforms strong unsupervised representation learning models by a large margin. The comparison also shows that EyeLearn performs comparably or better than some state-of-the-art supervised methods regarding the model effectiveness and computational efficiency. The case study further demonstrated that EyeLearn is useful for correcting artifacts and learn the actual relations between ophthalmic images.
References
- [1] H. R. Taylor and J. E. Keeffe, “World blindness: a 21st century perspective,” British journal of ophthalmology, vol. 85, no. 3, pp. 261–266, 2001.
- [2] G. An, K. Omodaka, K. Hashimoto, S. Tsuda, Y. Shiga, N. Takada, T. Kikawa, H. Yokota, M. Akiba, and T. Nakazawa, “Glaucoma diagnosis with machine learning based on optical coherence tomography and color fundus images,” Journal of healthcare engineering, vol. 2019, 2019.
- [3] P. Wang, J. Shen, R. Chang, M. Moloney, M. Torres, B. Burkemper, X. Jiang, D. Rodger, R. Varma, and G. M. Richter, “Machine learning models for diagnosing glaucoma from retinal nerve fiber layer thickness maps,” Ophthalmology Glaucoma, vol. 2, no. 6, pp. 422–428, 2019.
- [4] M. Wang, L. Q. Shen, L. R. Pasquale, H. Wang, D. Li, E. Y. Choi, S. Yousefi, P. J. Bex, and T. Elze, “An artificial intelligence approach to assess spatial patterns of retinal nerve fiber layer thickness maps in glaucoma,” Translational vision science & technology, vol. 9, no. 9, pp. 41–41, 2020.
- [5] S. Sengupta, A. Singh, H. A. Leopold, T. Gulati, and V. Lakshminarayanan, “Ophthalmic diagnosis using deep learning with fundus images–a critical review,” Artificial Intelligence in Medicine, vol. 102, p. 101758, 2020.
- [6] M. Christopher, C. Bowd, A. Belghith, M. H. Goldbaum, R. N. Weinreb, M. A. Fazio, C. A. Girkin, J. M. Liebmann, and L. M. Zangwill, “Deep learning approaches predict glaucomatous visual field damage from optical coherence tomography optic nerve head enface images and retinal nerve fiber layer thickness maps,” Ophthalmology, vol. 127, no. 3, p. 346, 2020.
- [7] G. Lazaridis, G. Montesano, S. S. Afgeh, J. Mohamed-Noriega, S. Ourselin, M. Lorenzi, and D. F. Garway-Heath, “Predicting visual fields from optical coherence tomography via an ensemble of deep representation learners,” American journal of ophthalmology, 2022.
- [8] S. Das, K. Kharbanda, M. Suchetha, R. Raman, and E. Dhas, “Deep learning architecture based on segmented fundus image features for classification of diabetic retinopathy,” Biomedical Signal Processing and Control, vol. 68, p. 102600, 2021.
- [9] T. Saba, S. Akbar, H. Kolivand, and S. Ali Bahaj, “Automatic detection of papilledema through fundus retinal images using deep learning,” Microscopy Research and Technique, vol. 84, no. 12, pp. 3066–3077, 2021.
- [10] A. S. Mursch-Edlmayr, W. S. Ng, A. Diniz-Filho, D. C. Sousa, L. Arnould, M. B. Schlenker, K. Duenas-Angeles, P. A. Keane, J. G. Crowston, and H. Jayaram, “Artificial intelligence algorithms to diagnose glaucoma and detect glaucoma progression: translation to clinical practice,” Translational vision science & technology, vol. 9, no. 2, pp. 55–55, 2020.
- [11] S. Choi, F. Jassim, E. Tsikata, Z. Khoueir, L. Y. Poon, B. Braaf, B. J. Vakoc, B. E. Bouma, J. F. de Boer, and T. C. Chen, “Artifact rates for 2d retinal nerve fiber layer thickness versus 3d retinal nerve fiber layer volume,” Translational Vision Science & Technology, vol. 9, no. 3, pp. 12–12, 2020.
- [12] D. Mirzania, A. C. Thompson, and K. W. Muir, “Applications of deep learning in detection of glaucoma: a systematic review,” European Journal of Ophthalmology, vol. 31, no. 4, pp. 1618–1642, 2021.
- [13] S. Guo, S. Chen, and Y. Li, “Face recognition based on convolutional neural network and support vector machine,” in 2016 IEEE International conference on Information and Automation (ICIA). IEEE, 2016, pp. 1787–1792.
- [14] S. Wang, L. Yu, K. Li, X. Yang, C.-W. Fu, and P.-A. Heng, “Dofe: Domain-oriented feature embedding for generalizable fundus image segmentation on unseen datasets,” IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 4237–4248, 2020.
- [15] X. Li, M. Jia, M. T. Islam, L. Yu, and L. Xing, “Self-supervised feature learning via exploiting multi-modal data for retinal disease diagnosis,” IEEE Transactions on Medical Imaging, vol. 39, no. 12, pp. 4023–4033, 2020.
- [16] K. Y. Gao, A. Fokoue, H. Luo, A. Iyengar, S. Dey, P. Zhang et al., “Interpretable drug target prediction using deep neural representation.” in IJCAI, vol. 2018, 2018, pp. 3371–3377.
- [17] Ç. Demiralp, C. E. Scheidegger, G. L. Kindlmann, D. H. Laidlaw, and J. Heer, “Visual embedding: A model for visualization,” IEEE Computer Graphics and Applications, vol. 34, no. 1, pp. 10–15, 2014.
- [18] I. Castiglioni, L. Rundo, M. Codari, G. Di Leo, C. Salvatore, M. Interlenghi, F. Gallivanone, A. Cozzi, N. C. D’Amico, and F. Sardanelli, “Ai applications to medical images: From machine learning to deep learning,” Physica Medica, vol. 83, pp. 9–24, 2021.
- [19] S. K. Zhou, H. Greenspan, C. Davatzikos, J. S. Duncan, B. Van Ginneken, A. Madabhushi, J. L. Prince, D. Rueckert, and R. M. Summers, “A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises,” Proceedings of the IEEE, vol. 109, no. 5, pp. 820–838, 2021.
- [20] S. Shurrab and R. Duwairi, “Self-supervised learning methods and applications in medical imaging analysis: A survey,” arXiv preprint arXiv:2109.08685, 2021.
- [21] W. Bai, C. Chen, G. Tarroni, J. Duan, F. Guitton, S. E. Petersen, Y. Guo, P. M. Matthews, and D. Rueckert, “Self-supervised learning for cardiac mr image segmentation by anatomical position prediction,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2019, pp. 541–549.
- [22] M. Prakash, T.-O. Buchholz, M. Lalit, P. Tomancak, F. Jug, and A. Krull, “Leveraging self-supervised denoising for image segmentation,” in 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). IEEE, 2020, pp. 428–432.
- [23] Y. Li, J. Chen, and Y. Zheng, “A multi-task self-supervised learning framework for scopy images,” in 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). IEEE, 2020, pp. 2005–2009.
- [24] D. Sarvamangala and R. V. Kulkarni, “Convolutional neural networks in medical image understanding: a survey,” Evolutionary intelligence, pp. 1–22, 2021.
- [25] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [26] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
- [27] X. Liu, F. Zhang, Z. Hou, L. Mian, Z. Wang, J. Zhang, and J. Tang, “Self-supervised learning: Generative or contrastive,” IEEE Transactions on Knowledge and Data Engineering, 2021.
- [28] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- [29] Y. Li, P. Hu, Z. Liu, D. Peng, J. T. Zhou, and X. Peng, “Contrastive clustering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 10, 2021, pp. 8547–8555.
- [30] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in Neural Information Processing Systems, vol. 33, pp. 9912–9924, 2020.
- [31] J. Xie, X. Zhan, Z. Liu, Y. S. Ong, and C. C. Loy, “Delving into inter-image invariance for unsupervised visual representations,” arXiv preprint arXiv:2008.11702, 2020.
- [32] D. R. Nayak, D. Das, B. Majhi, S. V. Bhandary, and U. R. Acharya, “Ecnet: An evolutionary convolutional network for automated glaucoma detection using fundus images,” Biomedical Signal Processing and Control, vol. 67, p. 102559, 2021.
- [33] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro, “Image inpainting for irregular holes using partial convolutions,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 85–100.
- [34] L. Li, M. Xu, X. Wang, L. Jiang, and H. Liu, “Attention based glaucoma detection: a large-scale database and cnn model,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 571–10 580.
- [35] E. J. Candès, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, pp. 1–37, 2011.
- [36] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and L. Bottou, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of machine learning research, vol. 11, no. 12, 2010.
- [37] D. Song, B. Fu, F. Li, J. Xiong, J. He, X. Zhang, and Y. Qiao, “Deep relation transformer for diagnosing glaucoma with optical coherence tomography and visual field function,” IEEE Transactions on Medical Imaging, vol. 40, no. 9, pp. 2392–2402, 2021.
- [38] A. Hassani, S. Walton, N. Shah, A. Abuduweili, J. Li, and H. Shi, “Escaping the big data paradigm with compact transformers,” arXiv preprint arXiv:2104.05704, 2021.
- [39] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [40] C.-F. R. Chen, Q. Fan, and R. Panda, “Crossvit: Cross-attention multi-scale vision transformer for image classification,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 357–366.
- [41] K. Han, Y. Wang, J. Guo, Y. Tang, and E. Wu, “Vision gnn: An image is worth graph of nodes,” arXiv preprint arXiv:2206.00272, 2022.