One-Vote Veto: Semi-Supervised Learning for Low-Shot Glaucoma Diagnosis
Abstract
Convolutional neural networks (CNNs) are a promising technique for automated glaucoma diagnosis from images of the fundus, and these images are routinely acquired as part of an ophthalmic exam. Nevertheless, CNNs typically require a large amount of well-labeled data for training, which may not be available in many biomedical image classification applications, especially when diseases are rare and where labeling by experts is costly. This article makes two contributions to address this issue: (1) It extends the conventional Siamese network and introduces a training method for low-shot learning when labeled data are limited and imbalanced, and (2) it introduces a novel semi-supervised learning strategy that uses additional unlabeled training data to achieve greater accuracy. Our proposed multi-task Siamese network (MTSN) can employ any backbone CNN, and we demonstrate with four backbone CNNs that its accuracy with limited training data approaches the accuracy of backbone CNNs trained with a dataset that is 50 times larger. We also introduce One-Vote Veto (OVV) self-training, a semi-supervised learning strategy that is designed specifically for MTSNs. By taking both self-predictions and contrastive predictions of the unlabeled training data into account, OVV self-training provides additional pseudo labels for fine-tuning a pre-trained MTSN. Using a large (imbalanced) dataset with 66,715 fundus photographs acquired over 15 years, extensive experimental results demonstrate the effectiveness of low-shot learning with MTSN and semi-supervised learning with OVV self-training. Three additional, smaller clinical datasets of fundus images acquired under different conditions (cameras, instruments, locations, populations) are used to demonstrate the generalizability of the proposed methods.
Convolutional neural networks, glaucoma diagnosis, low-shot learning, semi-supervised learning.
1 Introduction
Glaucoma is a prevalent and debilitating disease that can lead to progressive and irreversible vision loss through optic nerve damage [1]. The global incidence of glaucoma was estimated at 64.3 million in 2013, and due to aging populations, this number is expected to rise to 111.8 million by 2040 [2]. Improvement in the management of glaucoma would have a major human and socio-economic impact [3]. Early identification and intervention would significantly reduce the economic burden of late-stage disease [4]. In addition, visual impairment in glaucoma patients has been associated with decreased physical activity and mental health [5, 6] and increased risk of motor vehicle accidents [7, 8].
With the recent advances in machine learning, convolutional neural networks (CNNs), trained via supervised learning, have shown promise in diagnosing glaucoma from fundus images (photographs of the back of eyes) [9]. However, this requires large amounts of empirical data for supervised training [10]. In this study, we use 66,715 fundus photographs from the Ocular Hypertension Treatment Study (OHTS) [11, 12, 13], which is a 22-site multi-center, longitudinal (phase 1 and 2, 1994-2008) randomized clinical trial of 1,636 subjects (3,272 eyes). The primary goal of the OHTS was to determine if topical ocular hypotensive medications could delay or prevent the onset of glaucoma in eyes with high intraocular pressure [11]. Conversion to glaucoma was decided by a masked endpoint committee of three glaucoma specialists using fundus photographs and visual fields. Owing to its well-characterized ground-truth labels, the OHTS dataset provides us a basis to explore an effective way of training CNNs to diagnose glaucoma with low-shot learning when only a small quantity of labeled data is available, and/or semi-supervised learning when raw data is abundant, but labeling resources are scarce, costly, require strong expertise, or are just unavailable. However, as shown in Fig. 1, conventional semi-supervised learning approaches typically require a reliable pre-trained CNN (using a small sample) as prior knowledge, which is often challenging due to the over-fitting problem. Moreover, there is also a strong motivation to design a feasible semi-supervised learning strategy capable of determining confident predictions and generating pseudo labels for unlabeled data. We focus specifically on fundus images and glaucoma diagnosis in this article because we have sufficient data to accurately characterize the effectiveness of our methods. The same techniques could also be applied to tasks where there is limited data, such as rare diseases or where limited labels are available (e.g., asthma and diabetes prediction from fundus images). Therefore, this article aims to answer the following questions:
-
1.
Can a CNN be developed to accurately diagnose glaucoma, compared to the expert graders of the OHTS? Will the model be generalizable to other datasets?
-
2.
Is it necessary to train CNNs with thousands of labeled fundus images to diagnose glaucoma, or can diagnosis be achieved using only one image per patient (approximately 1.1K fundus images in the OHTS training set)?
-
3.
Can the performance of a CNN trained using a small sample be improved further by fine-tuning it with additional unlabeled training data?

To answer these questions, we first evaluate the performance of state-of-the-art (SoTA) glaucoma diagnosis algorithms, including six supervised learning algorithms [14, 15, 16, 17, 18, 3], one low-shot learning algorithm [19], and two semi-supervised learning algorithms [20, 21], on the OHTS dataset. Their generalizabilities are further validated on three additional clinical datasets of fundus images: (a) ACRIMA (Spain) [22], (b) Large-Scale Attention-Based Glaucoma (LAG, China) [9], and (c) the UCSD-based Diagnostic Innovations in Glaucoma Study and African Descent and Glaucoma Evaluation Study (DIGS/ADAGES, US) [23].
Furthermore, we propose a novel extension to the conventional Siamese network, referred to as the Multi-Task Siamese Network (MTSN), as depicted in Fig. 2. By minimizing a novel Combined Weighted Cross-Entropy (CWCE) Loss, the MTSN can simultaneously perform two tasks: measuring the similarity of a given pair of images (primary task) and classifying them as healthy or glaucoma (secondary task). With a small training set of approximately 1.1K fundus images, we explore the feasibility of training an MTSN for glaucoma diagnosis. Although the MTSN may not provide complementary information, it effectively performs a type of “data augmentation” by generating pairs of fundus images for training instead of using independent fundus images. The visual features learned from these two tasks prove to be more informative for glaucoma diagnosis when the training set is small. Our experimental results demonstrate that the MTSN greatly reduces over-fitting and achieves an accuracy on a small training set comparable to a large training set, which contains approximately 53K fundus images.
Moreover, we propose a novel semi-supervised learning strategy, referred to as One-Vote Veto (OVV) Self-Training, which generates reliable pseudo labels for the unlabeled training data and incorporates them into the labeled training data to fine-tune the MTSN for improved performance and generalizability. Our extensive experiments show that the MTSN fine-tuned with OVV self-training achieves similar performance to the corresponding backbone CNN trained via supervised learning on the OHTS dataset, and achieves higher area under the receiver operating characteristic curve (AUROC) scores on the additional fundus image datasets. The fine-tuned MTSN also outperforms SoTA semi-supervised glaucoma diagnosis approaches [20, 21], and in some cases, even outperforms SoTA supervised approaches. Additionally, we compare our proposed OVV self-training approach with four SoTA general-purpose semi-supervised learning methods, including FreeMatch [24], SoftMatch [25], FixMatch [26], and FlexMatch [27], all of which utilize vision Transformer [28] as their backbone network. The results demonstrate that our proposed OVV self-training approach outperforms these methods on the OHTS dataset and demonstrates better generalizability on three additional fundus image test sets.
We also conduct two additional few-shot biomedical image classification experiments (chest X-ray image classification [29, 30] and lung histopathological image classification [31]) to further validate the effectiveness of the MTSN on other types of image data. The promising results indicate that our proposed algorithms have the potential to solve a variety of biomedical image classification problems.
2 Related Works
Most SoTA glaucoma diagnosis algorithms are developed based on supervised fundus image classification. For example, Judy et al.[16] trained an AlexNet [32] to diagnose glaucoma. As VGG architectures [33] can learn more complicated image features than AlexNet, Gómez-Valverde et al.[15] employed a VGG-19 [33] model to diagnose glaucoma. Nevertheless, VGG architectures [33] consist of hundreds of millions of parameters, making them very memory-consuming. In contrast, GoogLeNet [34] and Inception-v3 [35] have lower computational complexities. Hence, Ahn et al.[36] and Li et al.[14] utilized transfer learning to re-train an Inception-v3 [35] model (pre-trained on the ImageNet [37] database) for glaucoma diagnosis, while Serener and Serte [17] re-trained a pre-trained GoogLeNet [34] model to diagnose glaucoma. However, with the increase of network depth, accuracy gets saturated and then degrades rapidly due to vanishing gradients [38]. To tackle this problem, the residual neural network (ResNet) [38] was developed. Due to its robustness, ResNet-50 [38] has been extensively used for biomedical image analysis, and it is a popular choice [39, 40, 41, 42, 43, 3] for fundus image classification. Additionally, developing low-cost and real-time embedded glaucoma diagnosis systems [44, 45, 18], e.g., based on MobileNet-v2 [46], for mobile devices is also an emerging area.
Machine/deep learning has achieved compelling performance in data-intensive applications, but it is often challenging for these algorithms to yield comparable performance when only a limited amount of labeled training data is available [47]. Low-shot and semi-supervised learning can address these issues. Unfortunately, they are rarely discussed in the field of glaucoma diagnosis. To the best of our knowledge, [19] is the only published low/few-shot glaucoma diagnosis algorithm. This algorithm employs a conventional Siamese network to compare two groups of (negative and positive) fundus images. The Siamese network utilizes two identical CNNs to learn visual embeddings. A bi-directional long short-term memory [48] component is then trained over the CNN outputs for glaucoma diagnosis. However, the training process is complicated since different types of losses are minimized, and the achieved glaucoma diagnosis results are unsatisfactory since each sub-network is only fed with one type of fundus images (either negative or positive). The lack of same-class comparisons leads to a performance bottleneck when compared to the MTSN proposed in this article.

A thorough search of the relevant literature yielded only two published studies on semi-supervised learning specifically for glaucoma diagnosis [21, 20]. Diaz-Pinto et al.[21] utilized a deep convolutional generative adversarial network (DCGAN) [49] for semi-supervised learning of glaucoma diagnosis, where the discriminator is trained to classify healthy and glaucomatous optic neuropathy (GON) fundus images, while also distinguishing between real and fake fundus images. The classifier for the former task is then employed for glaucoma diagnosis. On the other hand, Al Ghamdi et al.[20] developed a glaucoma diagnosis approach based on self-training [50], which is a typical semi-supervised learning approach that uses a pre-trained model (typically yielded via supervised learning) to produce pseudo labels of the unlabeled data. However, producing reliable pseudo labels is a significant challenge in self-training, and the pseudo labels generated by a single pre-trained CNN are usually not trustworthy enough for CNN fine-tuning [51]. Additionally, training a reliable pre-trained classifier with only a small amount of labeled data is notably demanding. In this article, we combine semi-supervised learning with low-shot learning to address these issues using glaucoma diagnosis as an example case. Specifically, our proposed OVV self-training strategy, as discussed in Sect. 3.2, is inspired by the mechanism of learning with external memory (LwEM), used in low-shot learning [52], where the labels of unlabeled training data are predicted by a classifier trained via low-shot learning on a small collection of fundus images with ground-truth labels.
3 Methodology
3.1 Multi-Task Siamese Network
As illustrated in Fig. 1, conventional semi-supervised learning methods initialize a network by pre-training it with a small number of fundus images for glaucoma diagnosis. However, we observed that such approaches are highly sensitive to noise. As a result, we design a novel MTSN specifically for our semi-supervised learning approach, which requires not only predicting the label of a given fundus image but also determining the similarity between a pair of given fundus images to generate pseudo labels through a voting process.
Conventional Siamese networks have become a common choice for metric learning and few/low-shot image recognition tasks [53]. These networks comprise two identical sub-networks, as depicted in Fig. 2. Each pair of fundus images and are separately fed into these sub-networks, producing two 1D embeddings (features) and , respectively. Another 1D embedding is generated by . is then passed through a fully connected (FC) layer to produce a scalar indicating the similarity between and . If and are dissimilar, approaches 1, and vice versa. The ground-truth labels of and are represented by and , respectively, where 0 denotes a healthy image, and 1 denotes a GON image.
However, a conventional Siamese network can only determine whether and belong to the same category, rather than predicting their independent categories. A straightforward solution is to connect and to separate FC layers, producing two scalars and indicating the probabilities that and are GON images, respectively. Refer to Fig. 2 and note that the two FC layers connected to and use the same weights. In this article, we refer to the network architecture in Fig. 2 as an MTSN, which can simultaneously measure the similarity of a given pair of fundus images and classify them as either healthy or GON. These two tasks are dependent yet not directly deducible from one another. A well-trained glaucoma diagnosis network can be employed to compare differences between given pairs of fundus images, but a well-trained fundus image similarity measurement network cannot directly output the category of a given fundus image.
In addition, the visual features learned from the primary and secondary tasks are distinct from one another. For the primary task, the network learns the visual features to classify same-class and different-class fundus image pairs. On the other hand, for the secondary task, the network learns the visual features to classify GON and healthy fundus images. Although this network architecture may not provide complementary information, it effectively performs a type of “data augmentation” by producing pairs of fundus images for training, rather than using independent fundus images. The visual features learned from these two tasks prove to be more informative for glaucoma diagnosis when the training set is small. Furthermore, multi-task learning is effective because requiring an algorithm to perform well on a related task induces regularization, which can be superior to uniform complexity penalization for preventing over-fitting. This idea has been explored in many Siamese neural network works, such as [54, 55, 56].
In this article, we use and to respectively denote the numbers of healthy and GON fundus images used to train the MTSN, with . is usually much greater than , because there are fewer patients with glaucomatous disease than healthy patients, resulting in a severely imbalanced dataset. Therefore, the MTSN is trained by minimizing a CWCE loss as follows:
(1) |
where
(2) |
(3) |
The hyper-parameter balances the primary task loss and the secondary task loss . The choice of and is discussed in Sect. 4.2. The motivations for using such a CWCE loss function instead of the commonly used triplet loss [57] or contrastive loss [58] to train the MTSN are:
-
1.
Most datasets for rare disease diagnosis are imbalanced. As detailed in Sect. 4.1, the OHTS training set is severely imbalanced, with 50,208 healthy images and only 2,416 GON images for supervised learning, and 995 healthy images and 152 GON images for low-shot learning. Learning from such an imbalanced dataset without weights on different classes can result in many incorrect predictions, with most GON images likely to be predicted as healthy images. To address this issue, a higher weight should be used for the minority class to prevent the CNN from predicting all fundus images as the majority class.
-
2.
In multi-task learning, weighing different types of losses, such as regression and classification, is typically challenging [59]. Assigning an incorrect weight may cause one task to perform poorly, even when other tasks converge to satisfactory results. Therefore, formulating as a weighted cross-entropy loss function is a simple but effective solution. However, due to the dataset imbalance problem, the cross-entropy losses have to be weighted.
-
3.
As shown in Fig. 3, OVV self-training requires both labels and probabilities (of being GON images), predicted by a pre-trained model, to produce pseudo labels for unlabeled data. Such network architecture and training loss can efficiently and effectively provide both “self-predicted” and “contrastively-predicted” labels and probabilities, as described in Sect. 3.2.
It should be noted here that using the fundus images from the same patient as an image pair for MTSN training is not necessary.

3.2 One-Vote Veto Self-Training
As discussed in Sect. 1, self-training aims to improve the performance of a pre-trained model by incorporating reliable predictions of the unlabeled data to obtain useful additional information that can be used for model fine-tuning. A feasible strategy to determine such reliable predictions is, therefore, key to the success of self-training [60].
In conventional semi-supervised learning algorithms, a pre-trained image classification model (obtained through supervised learning) can be fine-tuned by assessing the reliability of unlabeled images. To determine the reliability of an unlabeled image, its probability distribution for the most likely class is compared to a pre-determined threshold. If the probability surpasses this threshold, the prediction is considered a pseudo label. Subsequently, the image and its corresponding pseudo label are utilized to fine-tune the pre-trained model.
However, relying solely on probability distributions to generate pseudo labels is often insufficient [50]. Drawing inspiration from LwEM [61], we introduce One-Vote Veto self-training in this paper, as illustrated in Fig. 3111The superscripts and denote “reference” and “target”, respectively.. Similar to LwEM [61], we use a collection of reference (labeled) fundus images to provide “contrastive predictions” to the target (unlabeled) fundus images . The contrastive predictions subsequently vote to veto the unreliable “self-predictions” produced by the MTSN. Our OVV self-training is detailed in Algorithm 1, where the target model updates its parameters during self-training but the reference model does not.
When fine-tuning an MTSN pre-trained through low-shot learning, each mini-batch contains a discrete set of reference fundus images , their ground-truth labels , and an equal number of target fundus images without labels. and represent the 1D embeddings learned from and (), respectively. Given a pair of reference and target fundus images, and , the pre-trained MTSN can “self-predict”:
-
•
the scalars and which indicate the probabilities that and are GON images, respectively;
-
•
their labels and using its fundus image classification functionality ( when , and otherwise).
is then used to determine whether the reference fundus image is qualified to veto unreliable predictions. If , its vote will be omitted, where is a threshold used to select qualified reference fundus images (step 6 in Algorithm 1). In the meantime, the pre-trained MTSN can also “contrastively-predict” the scalar
(4) |
indicating the GON probability as well as the label
(5) |
of from using its input similarity measurement functionality222In rare cases, might not be equivalent to .. To determine the reliability of and whether it can be used as the pseudo label of , all the reference fundus images in the mini-batch are used to provide additional judgements. Each pair of contrastively-predicted scalar (indicating GON probability) and label form a vote . With all votes collected from the qualified reference fundus images, the OVV self-training algorithm determines whether should be used as the pseudo label for based on the following criteria (step 9 in Algorithm 1):
-
•
Identical to the process of determining qualified reference fundus images, if any () or is not close to either 0 (healthy) or 1 (GON), as evaluated by the threshold , will not be assigned to .
-
•
If a minority of more than qualified reference fundus images disagree with the majority of the qualified reference fundus images, will not be assigned to .
As discussed in Sect. 4, (all qualified reference images vote for the same category) achieves the best overall performance. Therefore, the aforementioned strategy is named “One-Vote Veto” in this paper. Since each target fundus image is required to be compared with all the reference fundus images in the same mini-batch, the proposed self-training strategy has a computational complexity of , which is relatively memory-consuming. The reliable target fundus images and their pseudo labels are then included into the low-shot training data to fine-tune the pre-trained MTSN with supervised learning by minimizing a CWCE loss. The OVV self-training performance with respect to different , , and values is discussed in Sect. 4.
Test set | Experiments | Training strategy | ResNet-50 | MobileNet-v2 | ||||
Accuracy (%) | F1-score (%) | AUROC | Accuracy (%) | F1-score (%) | AUROC | |||
ACRIMA [22] | Baseline | Supervised learning | 57.163 | 56.734 | 0.625 | 73.333 | 78.341 | 0.779 |
EWAD | Low-shot learning | 67.092 | 63.175 | 0.758 | 70.355 | 67.797 | 0.820 | |
EWSD | Low-shot learning | 49.504 | 25.523 | 0.437 | 66.241 | 59.107 | 0.823 | |
LAG [9] | Baseline | Supervised learning | 64.318 | 57.445 | 0.714 | 65.122 | 63.268 | 0.781 |
EWAD | Low-shot learning | 79.028 | 65.908 | 0.841 | 79.007 | 69.482 | 0.843 | |
EWSD | Low-shot learning | 78.039 | 68.179 | 0.826 | 78.430 | 61.830 | 0.846 | |
DIGS/ADAGES [23] | Baseline | Supervised learning | 59.639 | 59.708 | 0.648 | 61.478 | 66.128 | 0.669 |
EWAD | Low-shot learning | 67.745 | 60.754 | 0.743 | 69.176 | 68.145 | 0.748 | |
EWSD | Low-shot learning | 65.736 | 55.722 | 0.700 | 68.120 | 64.028 | 0.740 |
Dataset | ResNet-50 | MobileNet-v2 | |||||||
Accuracy (%) | F1-score (%) | AUROC | Accuracy (%) | F1-score (%) | AUROC | ||||
OHTS [11, 12] | 0 | 0.01 | 20 | 91.415 | 41.148 | 0.898 | 90.609 | 36.960 | 0.887 |
0 | 0.01 | 15 | 92.113 | 41.316 | 0.898 | 93.470 | 36.976 | 0.893 | |
0 | 0.01 | 10 | 94.199 | 43.759 | 0.890 | 93.742 | 37.779 | 0.863 | |
0 | 0.10 | 20 | 90.516 | 38.139 | 0.898 | 88.825 | 34.351 | 0.878 | |
2 | 0.01 | 20 | 90.717 | 35.818 | 0.885 | 93.463 | 35.204 | 0.862 | |
2 | 0.10 | 20 | 92.656 | 32.017 | 0.851 | 90.772 | 32.616 | 0.858 | |
4 | 0.01 | 20 | 92.610 | 29.668 | 0.854 | 92.268 | 31.852 | 0.854 | |
4 | 0.10 | 20 | 92.672 | 28.463 | 0.842 | 90.803 | 32.690 | 0.859 | |
ACRIMA [22] | 0 | 0.01 | 20 | 59.858 | 49.192 | 0.775 | 72.340 | 70.229 | 0.840 |
0 | 0.01 | 15 | 60.426 | 49.365 | 0.751 | 63.404 | 54.895 | 0.814 | |
0 | 0.01 | 10 | 54.610 | 35.743 | 0.721 | 61.986 | 51.273 | 0.826 | |
LAG [9] | 0 | 0.01 | 20 | 80.882 | 66.304 | 0.881 | 79.625 | 65.262 | 0.851 |
0 | 0.01 | 15 | 76.864 | 56.252 | 0.825 | 76.670 | 55.906 | 0.841 | |
0 | 0.01 | 10 | 76.638 | 56.518 | 0.826 | 75.834 | 51.748 | 0.866 | |
DIGS/ADAGES [23] | 0 | 0.01 | 20 | 67.813 | 58.315 | 0.763 | 69.653 | 63.258 | 0.777 |
0 | 0.01 | 15 | 63.045 | 44.727 | 0.753 | 66.383 | 52.888 | 0.773 | |
0 | 0.01 | 10 | 61.819 | 41.340 | 0.727 | 63.965 | 44.198 | 0.789 |
Backbone | Method | Training strategy | OHTS [11, 12] | ACRIMA [22] | LAG [9] | DIGS/ADAGES [23] | (min) |
---|---|---|---|---|---|---|---|
ResNet-50 | Baseline | Supervised learning | 0.904 (95% CI: 0.865, 0.935) | 0.736 (95% CI: 0.698, 0.771) | 0.794 (95% CI: 0.780, 0.807) | 0.744 (95% CI: 0.696, 0.792) | 52.1 |
MTSN | Low-shot learning | 0.869 (95% CI: 0.833, 0.901) | 0.758 (95% CI: 0.723, 0.792) | 0.841 (95% CI: 0.829, 0.853) | 0.743 (95% CI: 0.683, 0.795) | 1.8 | |
MTSN+OVV | Semi-supervised learning | 0.898 (95% CI: 0.857, 0.928) | 0.775 (95% CI: 0.741, 0.808) | 0.881 (95% CI: 0.870, 0.891) | 0.763 (95% CI: 0.695, 0.820) | 203.7 | |
MobileNet-v2 | Baseline | Supervised learning | 0.893 (95% CI: 0.845, 0.932) | 0.794 (95% CI: 0.760, 0.825) | 0.856 (95% CI: 0.844, 0.867) | 0.786 (95% CI: 0.728, 0.835) | 42.9 |
MTSN | Low-shot learning | 0.859 (95% CI: 0.813, 0.896) | 0.820 (95% CI: 0.786, 0.850) | 0.843 (95% CI: 0.831, 0.855) | 0.748 (95% CI: 0.689, 0.802) | 1.2 | |
MTSN+OVV | Semi-supervised learning | 0.887 (95% CI: 0.850, 0.920) | 0.840 (95% CI: 0.808, 0.867) | 0.851 (95% CI: 0.838, 0.862) | 0.777 (95% CI: 0.718, 0.826) | 125.4 | |
DenseNet | Baseline | Supervised learning | 0.898 (95% CI: 0.867, 0.927) | 0.810 (95% CI: 0.778, 0.841) | 0.784 (95% CI: 0.771, 0.798) | 0.743 (95% CI: 0.688, 0.789) | 122.7 |
MTSN | Low-shot learning | 0.854 (95% CI: 0.811, 0.894) | 0.753 (95% CI: 0.716, 0.786) | 0.853 (95% CI: 0.842, 0.865) | 0.732 (95% CI: 0.675, 0.785) | 5.7 | |
MTSN+OVV | Semi-supervised learning | 0.896 (95% CI: 0.861, 0.926) | 0.783 (95% CI: 0.748, 0.817) | 0.831 (95% CI: 0.818, 0.843) | 0.746 (95% CI: 0.678, 0.800) | 324.0 | |
EfficientNet | Baseline | Supervised learning | 0.768 (95% CI: 0.684, 0.834) | 0.633 (95% CI: 0.590, 0.672) | 0.650 (95% CI: 0.634, 0.667) | 0.658 (95% CI: 0.611, 0.702) | 48.7 |
MTSN | Low-shot learning | 0.863 (95% CI: 0.818, 0.899) | 0.845 (95% CI: 0.815, 0.873) | 0.845 (95% CI: 0.833, 0.856) | 0.719 (95% CI: 0.659, 0.774) | 1.5 | |
MTSN+OVV | Semi-supervised learning | 0.886 (95% CI: 0.845, 0.918) | 0.792 (95% CI: 0.758, 0.824) | 0.850 (95% CI: 0.837, 0.861) | 0.749 (95% CI: 0.690, 0.800) | 159.8 |
Percentage of training data | Supervised learning | Low-shot learning | Semi-supervised learning | ||||||
Accuracy (%) | F1-score (%) | AUROC | Accuracy (%) | F1-score (%) | AUROC | Accuracy (%) | F1-score (%) | AUROC | |
0.5 | 80.402 | 18.066 | 0.720 | 84.769 | 20.228 | 0.759 | 88.856 | 25.351 | 0.797 |
1.0 | 79.557 | 21.573 | 0.799 | 88.538 | 25.654 | 0.806 | 88.453 | 32.472 | 0.857 |
2.0 (baseline) | 84.141 | 26.743 | 0.838 | 87.150 | 31.726 | 0.865 | 91.415 | 41.148 | 0.898 |
10.0 | 85.651 | 32.288 | 0.890 | 89.988 | 38.612 | 0.891 | 91.446 | 39.760 | 0.899 |
50.0 | 92.950 | 40.188 | 0.907 | 94.223 | 40.826 | 0.887 | 92.067 | 38.187 | 0.889 |
90.0 | 89.556 | 37.582 | 0.905 | 92.897 | 41.878 | 0.887 | 93.331 | 43.421 | 0.898 |
Percentage of training data | Supervised learning | Low-shot learning | Semi-supervised learning | ||||||
Accuracy (%) | F1-score (%) | AUROC | Accuracy (%) | F1-score (%) | AUROC | Accuracy (%) | F1-score (%) | AUROC | |
0.5 | 83.033 | 19.035 | 0.720 | 81.954 | 18.035 | 0.745 | 87.918 | 29.246 | 0.826 |
1.0 | 67.659 | 16.458 | 0.748 | 91.764 | 20.628 | 0.772 | 89.407 | 31.357 | 0.841 |
2.0 (baseline) | 78.144 | 23.547 | 0.840 | 86.018 | 30.252 | 0.854 | 90.609 | 36.960 | 0.887 |
10.0 | 90.429 | 34.873 | 0.884 | 89.562 | 36.808 | 0.897 | 89.624 | 36.887 | 0.888 |
50.0 | 92.438 | 43.361 | 0.906 | 91.043 | 39.050 | 0.888 | 93.230 | 40.653 | 0.889 |
90.0 | 93.754 | 42.401 | 0.908 | 93.750 | 37.519 | 0.890 | 91.857 | 41.602 | 0.896 |
4 Experiments
4.1 Datasets and Experimental Setups
The datasets utilized in our experiments were collected by various clinicians in different countries using distinct fundus cameras. The ACRIMA [22] and LAG [9] datasets are publicly available, while the OHTS [11, 12] and DIGS/ADAGES [23] datasets are available upon request after appropriate data use agreements are initiated. Their details are as follows:
-
•
The OHTS [11, 12] is the only multi-center longitudinal study that has precise information on the dates/timing of the development of glaucoma (the enrolled subjects did not have glaucoma at study entry) using standardized assessment criteria by an independent Optic Disc Reading Center and confirmed by three glaucoma specialist endpoint committee members. In our experiments, a square region centered on the optic nerve head was first extracted from each raw fundus image using a well-trained DeepLabv3+ [63] model. A small part of the raw data are stereoscopic fundus images, each of which was split to produce two individual fundus images. Through this image pre-processing approach, a total of 74,678 fundus images were obtained. Moreover, ENPOAGDISC (endpoint committee attributable to primary open angle glaucoma based on optic disc changes from photographs) [3] labels are used as the classification ground truth. The fundus images are divided into a training set (50,208 healthy images and 2,416 GON images), a validation set (7,188 healthy images and 426 GON images), and a test set (13,780 healthy images and 660 GON images) by participant. Splitting by participant (instead of by image) ensured that the validation and test sets did not contain images from any eyes or individuals used to train the model. More details on dataset preparation and baseline supervised learning experiments are provided in our recent publications [3, 10]. Additionally, we select one image (from only one eye) from each patient in the training set to create the low-shot training set (995 healthy images and 152 GON images). If both eyes of a patient do not convert to glaucoma in the study, the first captured fundus photograph is selected. If either eye of a patient converts to glaucoma in the study, the first glaucoma fundus photograph is selected.
-
•
The ACRIMA [22] dataset consists of 309 healthy images and 396 GON images. It was collected as part of an initiative by the government of Spain. Classification was based on the review by a single experienced glaucoma expert. Images were excluded if they did not provide a clear view of the optic nerve head region [43].
-
•
The LAG [9] dataset contains 3,143 healthy images and 1,711 GON images333The number of fundus images being published is fewer than what was reported in publication [9]. , collected by Beijing Tongren Hospital. Similar to the OHTS dataset, we also use the well-trained DeepLabv3+ [63] model to extract a square region centered on the optic nerve head from each fundus image.
-
•
The DIGS and ADAGES [23] are longitudinal studies designed to detect and monitor glaucoma based on optical imaging and visual function testing that, when combined, have generated tens of thousands of test results from over 4,000 healthy, glaucoma suspect, or glaucoma eyes. In our experiments, we utilize the DIGS/ADAGES test set (5,184 healthy images and 4,289 GON images) to evaluate the generalizability of our proposed methods.



Visualizations of the four test sets using t-SNE [62] are provided in Fig. 4. Since healthy and GON images are distributed similarly between the OHTS and LAG datasets, we expect models to perform similarly on these datasets. Dissimilar distributions in the ACRIMA and DIGS/ADAGES datasets led us to believe the performance of models on these datasets would be somewhat worse. Using these four datasets, we conduct three experiments:
-
1.
Supervised learning experiment: We employ transfer learning [64] to train ResNet-50 [38], MobileNet-v2 [46], DenseNet [65], and EfficientNet [66] (pre-trained on the ImageNet database [37]), on the entire OHTS training set (including 53K fundus images). The best-performing models are selected using the OHTS validation set. Their performance is subsequently evaluated on the OHTS test set, the ACRIMA dataset, the LAG dataset, and the DIGS/ADAGES test set.
-
2.
Low-shot learning experiment: The four pre-trained models mentioned above are used as the MTSN backbones and trained on the OHTS low-shot training set (containing 1,147 images) to validate the effectiveness of our proposed low-shot glaucoma diagnosis algorithm. The validation and testing procedures are identical to those in the supervised learning experiment.
-
3.
Semi-supervised learning experiment: The MTSNs trained on the low-shot training set are fine-tuned on the entire OHTS training set without using additional ground-truth labels. The fine-tuned MTSNs are referred to as MTSN+OVV. The validation and testing procedures are identical to those in the supervised learning experiment.
The fundus images are resized to pixels. The initial learning rate is set to 0.001, which decays gradually after the 100th epoch. Due to the dataset imbalance problem, F1-score is utilized to select the best-performing models during the validation stage. Moreover, we adopt an early stopping mechanism during the validation stage to reduce over-fitting (the training will be terminated if the achieved F1-score has not increased for 10 epochs). We use three metrics: (1) accuracy, (2) F1-score, and (3) AUROC to quantify the performances of the trained models. While accuracy is generally reported in image classification papers, F1-score and AUROC are more comprehensive and reasonable evaluation metrics when the dataset is severely imbalanced.
4.2 Ablation study and Threshold Selection


Training strategy | Method | Accuracy (%) | F1-score (%) | AUROC |
Supervised learning | Li et al.[14] | 93.812 | 42.590 | 0.886 |
Gómez-Valverde et al.[15] | 95.068 | 48.039 | 0.903 | |
Judy et al.[16] | 94.188 | 42.835 | 0.908 | |
Serener and Serte [17] | 90.492 | 41.227 | 0.912 | |
Thakur et al.[18] | 94.145 | 44.115 | 0.896 | |
Fan et al.[3] (Baseline Result) | 93.261 | 43.016 | 0.904 | |
Low-shot learning | Kim et al.[19] | 84.203 | 23.161 | 0.786 |
MTSN (Backbone: ResNet-50) (Ours) | 87.150 | 31.726 | 0.865 | |
MTSN (Backbone: MobileNet-v2) (Ours) | 86.018 | 30.252 | 0.854 | |
Semi-supervised learning | Al Ghamdi et al.[20] | 84.808 | 27.579 | 0.830 |
Diaz-Pinto et al.[21] | 76.619 | 20.721 | 0.748 | |
MTSN (Backbone: ResNet-50) + OVV Self-Training (Ours) | 90.244 | 38.454 | 0.899 | |
MTSN (Backbone: MobileNet-v2) + OVV Self-Training (Ours) | 89.360 | 36.599 | 0.891 |
Training strategy | Method | Accuracy (%) | F1-score (%) | AUROC |
Supervised learning | Li et al.[14] | 60.142 | 46.272 | 0.813 |
Gómez-Valverde et al.[15] | 63.546 | 53.358 | 0.826 | |
Judy et al.[16] | 57.872 | 41.650 | 0.824 | |
Serener and Serte [17] | 54.326 | 36.364 | 0.675 | |
Thakur et al.[18] | 65.106 | 58.020 | 0.794 | |
Fan et al.[3] (Baseline Result) | 53.333 | 31.601 | 0.736 | |
Low-shot learning | Kim et al.[19] | 64.965 | 58.627 | 0.844 |
MTSN (Backbone: ResNet-50) (Ours) | 67.092 | 63.175 | 0.758 | |
MTSN (Backbone: MobileNet-v2) (Ours) | 70.355 | 67.797 | 0.820 | |
Semi-supervised learning | Al Ghamdi et al.[20] | 68.511 | 67.257 | 0.794 |
Diaz-Pinto et al.[21] | 59.149 | 44.828 | 0.818 | |
MTSN (Backbone: ResNet-50) + OVV Self-Training (Ours) | 64.539 | 57.627 | 0.801 | |
MTSN (Backbone: MobileNet-v2) + OVV Self-Training (Ours) | 72.340 | 70.939 | 0.835 |
Training strategy | Method | Accuracy (%) | F1-score (%) | AUROC |
Supervised learning A S | Li et al.[14] | 76.535 | 53.756 | 0.855 |
Gómez-Valverde et al.[15] | 80.202 | 62.417 | 0.883 | |
Judy et al.[16] | 78.348 | 60.174 | 0.860 | |
Serener and Serte [17] | 77.379 | 66.545 | 0.806 | |
Thakur et al.[18] | 80.305 | 65.882 | 0.856 | |
Fan et al.[3] (Baseline Result) | 75.052 | 50.267 | 0.794 | |
Low-shot learning | Kim et al.[19] | 74.619 | 65.000 | 0.805 |
MTSN (Backbone: ResNet-50) (Ours) | 79.028 | 65.908 | 0.841 | |
MTSN (Backbone: MobileNet-v2) (Ours) | 79.007 | 69.482 | 0.843 | |
Semi-supervised learning | Al Ghamdi et al.[20] | 79.028 | 72.382 | 0.860 |
Diaz-Pinto et al.[21] | 65.554 | 56.662 | 0.701 | |
MTSN (Backbone: ResNet-50) + OVV Self-Training (Ours) | 81.644 | 69.929 | 0.879 | |
MTSN (Backbone: MobileNet-v2) + OVV Self-Training (Ours) | 80.470 | 68.692 | 0.849 |
Training strategy | Method | Accuracy (%) | F1-score (%) | AUROC |
Supervised learning | Li et al.[14] | 65.293 | 46.955 | 0.780 |
Gómez-Valverde et al.[15] | 65.157 | 45.729 | 0.795 | |
Judy et al.[16] | 63.556 | 41.144 | 0.760 | |
Serener and Serte [17] | 69.108 | 57.478 | 0.757 | |
Thakur et al.[18] | 70.606 | 60.322 | 0.786 | |
Fan et al.[3] (Baseline Result) | 62.500 | 38.042 | 0.744 | |
Low-shot learning | Kim et al.[19] | 63.862 | 52.949 | 0.687 |
MTSN (Backbone: ResNet-50) (Ours) | 67.745 | 60.754 | 0.743 | |
MTSN (Backbone: MobileNet-v2) (Ours) | 69.176 | 68.145 | 0.748 | |
Semi-supervised learning | Al Ghamdi et al.[20] | 64.850 | 54.816 | 0.716 |
Diaz-Pinto et al.[21] | 64.441 | 59.908 | 0.677 | |
MTSN (Backbone: ResNet-50) + OVV Self-Training (Ours) | 66.281 | 55.486 | 0.747 | |
MTSN (Backbone: MobileNet-v2) + OVV Self-Training (Ours) | 70.402 | 66.015 | 0.776 |
We set in (1) to 0.1, 0.2, 0.3, 0.4, and 0.5, respectively, and compare the MTSN performance when computes the element-wise absolute difference (EWAD) and element-wise squared difference (EWSD), respectively. The comparisons in terms of F1-score and AUROC on the OHTS test set are provided in Fig. 5. It can be seen that the MTSN achieves the best overall performance when . This is reasonable, as a higher weighs more on the image classification task, easily resulting in over-fitting. Additionally, the MTSN in which computes the EWSD between and performs better when using ResNet-50 as the backbone CNN but slightly worse when using MobileNet-v2 as the backbone CNN. Therefore, we further evaluate their generalizability on three additional test sets, as shown in Table 1. When computes the EWAD, the MTSN generally performs better or very similarly on the additional test sets, especially when testing the MTSN assembled with ResNet-50 on the ACRIMA dataset. EWAD is, therefore, used in the following experiments. Furthermore, Table 1 provides the results of a baseline supervised learning experiment conducted on the low-shot training set. The results suggest that low-shot learning performs much better than supervised learning when the training size is small.
Furthermore, we discuss the selection of the thresholds and (used to select reliable “self-predictions” and “contrastive predictions” in our OVV self-training) as well as the impact of different mini-batch sizes on OVV self-training (each mini-batch contains pairs of reference and target fundus images). Table 2 shows the MTSN performances with respect to different , , and . When evaluated on the OHTS test set, it can be seen that accuracy and F1-score increase slightly, but AUROC almost remains the same, with the decrease of . Moreover, with the increase of and , the standard to determine reliable predictions becomes lower, making the semi-supervised learning performance degrade. Based on this experiment, we believe OVV self-training benefits from smaller and .
Additionally, MTSNs, trained under different , are evaluated on the three additional test sets, as shown in Table 2. It can be seen that the network trained with a larger typically shows better results. When decreases, the generalizability of MTSN degrades dramatically, especially for F1-score (decreases by around 9-19%). Therefore, increasing the mini-batch size can improve the MTSN generalizability, as more reference fundus images are used to provide contrastive predictions for the target fundus images, which can veto more unreliable predictions on the unlabeled data. Hence, we increase to 30 to further improve OVV self-training when comparing it with other published SoTA algorithms, as shown in Sect. 4.4. Since our threshold selection experiments cover a very limited number of discrete sets of , , and , we believe better performance can be achieved when more values are tested.
4.3 Comparison of Supervised, Low-Shot, and Semi-Supervised Glaucoma Diagnosis
Comparisons of supervised learning, low-shot learning, and semi-supervised learning (w.r.t. four backbone CNNs: ResNet-50, MobileNet-v2, DenseNet, and EfficientNet) for glaucoma diagnosis are provided in Table 3. First, these results suggest that the MTSNs fine-tuned with OVV self-training that requires a small number of labeled fundus images perform similarly (AUROC 95% CI overlaps considerably) and, in some cases, significantly better (AUROC 95% CI does not overlap) than the backbone CNNs trained with a large number of labeled fundus images (50 times larger) under full supervision.
Specifically, when using ResNet-50, MobileNet-v2, or DenseNet as the backbone CNN, semi-supervised learning performs similarly to supervised learning on the OHTS and DIGS/ADAGES test sets, and in most cases, significantly better than supervised learning on the ACRIMA and LAG datasets. Although EfficientNet trained through supervised learning performs unsatisfactorily on all four test sets, it shows considerable compatibility with MTSN in the low-shot and semi-supervised learning experiments. Second as expected, the AUROC scores achieved by low-shot learning are in most, but not all, cases slightly lower than those achieved by the backbone CNNs, when evaluated on the OHTS test set. However, low-shot learning shows better generalizability than supervised learning on the ACRIMA and LAG datasets. Moreover, since low-shot learning uses only a small amount of training data, training an MTSN is much faster than supervised learning. As MTSNs assembled with ResNet-50 and MobileNet-v2 typically demonstrate better performances than the ones assembled with DenseNet and EfficientNet, we only use the former two CNNs for the following experiments.








We also employ Grad-CAM++ [67] to explain the models’ decision-making, as shown in Fig. 6. These results suggest that the optic nerve head areas impact model decisions most. The neuroretinal rim areas are identified as most important, and the periphery contributed comparatively little to model decisions for both healthy and GON eyes [42].
We also carry out a series of experiments with respect to different percentages of training data, as shown in Table 4(b), to further validate the effectiveness of our proposed low-shot and semi-supervised learning algorithms. The backbone CNNs trained via supervised learning on the small subsets generally perform worse than the MTSNs trained via low-shot and semi-supervised learning. With the labeled training data increase, the models’ performance gets saturated. When using less labeled training data ( and ), the MTSN performance degrades. However, its performance can still be greatly improved with OVV self-training (accuracy, F1-score, and AUROC can be improved by up to 6%, 11%, and 0.08, respectively). In addition, when using over of the entire training data, the MTSN performance saturates, and the OVV self-training can bring very limited improvements on MTSNs.
4.4 Comparisons with other SoTA glaucoma diagnosis approaches
Table 5 provides comprehensive comparisons with nine SoTA glaucoma diagnosis algorithms444Our recent work [3] provides the baseline supervised learning results.. The results suggest that (a) for low-shot learning, the MTSNs trained by minimizing our proposed CWCE loss perform significantly better than the SoTA low-shot glaucoma diagnosis approach [19] on all four datasets (accuracy, F1-score, and AUROC are up to 5%, 15%, and 0.08 higher, respectively), and (b) for semi-supervised learning, the MTSNs fine-tuned with OVV self-training also achieve the superior performances over another two SoTA semi-supervised glaucoma diagnosis approaches [20, 21] (accuracy, F1-score, and AUROC are up to 6%, 11%, and 0.07 higher, respectively). Compared with the SoTA supervised approaches, the fine-tuned MTSNs demonstrate similar performance on the OHTS test set and better generalizability on three additional test sets. Therefore, we believe that MTSN with our proposed OVV self-training is an effective technique for semi-supervised glaucoma diagnosis.
4.5 Comparisons with SoTA general-purpose semi-supervised learning approaches
Test Set | Method | Accuracy (%) | F1-score (%) | AUROC |
OHTS [11, 12] | FreeMatch [24] | 86.510 | 27.745 | 0.784 |
SoftMatch [25] | 85.679 | 25.718 | 0.768 | |
FixMatch [26] | 84.695 | 27.825 | 0.807 | |
FlexMatch [27] | 92.715 | 26.331 | 0.696 | |
OVV Self-Training (Ours) | 90.244 | 38.454 | 0.899 | |
ACRIMA [22] | FreeMatch [24] | 77.872 | 77.778 | 0.791 |
SoftMatch [25] | 78.582 | 79.115 | 0.795 | |
FixMatch [26] | 78.582 | 79.622 | 0.792 | |
FlexMatch [27] | 60.426 | 47.850 | 0.644 | |
OVV Self-Training (Ours) | 64.539 | 57.627 | 0.801 | |
LAG [9] | FreeMatch [24] | 78.471 | 67.087 | 0.793 |
SoftMatch [25] | 79.110 | 67.972 | 0.798 | |
FixMatch [26] | 81.314 | 73.394 | 0.827 | |
FlexMatch [27] | 71.302 | 35.360 | 0.654 | |
OVV Self-Training (Ours) | 81.644 | 69.929 | 0.879 | |
DIGS/ADAGES [23] | FreeMatch [24] | 65.429 | 56.716 | 0.685 |
SoftMatch [25] | 65.395 | 56.986 | 0.679 | |
FixMatch [26] | 68.835 | 65.090 | 0.731 | |
FlexMatch [27] | 57.050 | 26.558 | 0.594 | |
OVV Self-Training (Ours) | 66.281 | 55.486 | 0.747 |
Table 6 provides a comprehensive comparison of our proposed OVV self-training approach with four SoTA general-purpose semi-supervised learning methods: FreeMatch [24], SoftMatch [25], FixMatch [26], and FlexMatch [27], which all employ vision Transformer [28] as their backbone network. Our results demonstrate that the proposed OVV self-training approach outperforms these methods in terms of F1-score and AUROC on the OHTS dataset. Specifically, we observe improvements in F1-score ranging from approximately 11% to 13%, and improvements in AUROC ranging from around 9% to 20%. Furthermore, our method demonstrates better generalizability in terms of AUROC across three additional fundus image test sets. Although their results are inferior to ours, particularly in terms of AUROC, it may be unjust to compare them as they were not specifically designed for the diagnosis of glaucoma or other diseases.
4.6 MTSN and CWCE Loss for Few-Shot Multi-Class Biomedical Image Classification


We conduct two additional few-shot multi-class lung disease diagnosis experiments: (a) chest X-ray image classification for COVID-19 and viral pneumonia detection [29, 30], and (b) lung histopathological image classification for lung cancer diagnosis [31], to validate the effectiveness of our proposed MTSN and CWCE loss. The first experiment has three classes of images: (1) healthy, (2) viral pneumonia, and (3) COVID-19 (an example of each class is shown in Fig. 7), while the second experiment also has three classes of images: (1) benign tissue, (2) adenocarcinoma, and (3) squamous cell carcinoma (an example of each class is shown in Fig. 7). In these two experiments, we only select a few images from each class for MTSN training. The numbers of images used for training, validation, and testing are given in Table 7, where it can be observed that the training set is much smaller than the validation and test sets. Since (1) can only be used for binary image classification problems, we extend it here to tackle multi-class image classification problems.
Class | Training | Validation | Test |
Healthy | 27 | 657 | 657 |
Viral pneumonia | 27 | 659 | 659 |
COVID-19 | 24 | 588 | 588 |
Class | Training | Validation | Test |
Benign tissue | 10 | 2,495 | 2,495 |
Adenocarcinoma | 10 | 2,495 | 2,495 |
Squamous cell carcinoma | 10 | 2,495 | 2,495 |
Let us consider the chest X-ray image classification task as an example. Each image is assigned a pair of two labels with the following values:
-
•
and when is a healthy image;
-
•
and when is a viral pneumonia image;
-
•
and when is a COVID-19 image.
The numbers of healthy images (class 1), viral pneumonia images (class 2), and COVID-19 images (class 3) are denoted as , , and , respectively. The total number of images is . The weight used in the image classification loss w.r.t. class is given by:
(6) |
. Therefore, can be written as follows:
(7) |
where indicates the probability that belongs to class . . when , and otherwise. Given a pair of images and with ground-truth labels and , respectively, there are four cases:
-
•
case 1: (healthy v.s. viral pneumonia);
-
•
case 2: (viral pneumonia v.s. COVID-19);
-
•
case 3: (COVID-19 v.s. healthy);
-
•
case 4: (two images are of the same class).
The weight used in the image similarity comparison loss with respect to case is given by:
(8) |
. Therefore, can be written as follows:
(9) |
where indicates the similarity between and under case . . when and belong to case , and otherwise. The hyper-parameter is empirically set to .
The experimental results of these two multi-task biomedical image classification tasks are presented in Fig. 8 with two confusion matrices. These results demonstrate that our proposed MTSN can be effectively trained with very few images to solve multi-class biomedical image classification problems. Specifically, the achieved accuracy values for chest X-ray image classification (25-shot learning) and lung histopathological image classification (10-shot learning) are 93% and 90%, respectively. The chest X-ray image classification result compares favorably with the accuracy range of 82%-93% achieved by supervised methods (2,520 images for training, 840 images for validation, and 840 images for testing) using all available training data [68]. Although the accuracy achieved by MTSN for lung histopathological image classification is lower than the accuracy of over 97% reported in [69] by supervised approaches using the full training set (8,250 images for training, 3,000 images for validation, and 3,744 images for testing), we believe that our proposed low-shot learning method can achieve comparable results when a small amount of additional images are incorporated for MTSN training.
5 Discussion
Extensive experiments demonstrate the effectiveness and efficiency of training an MTSN by minimizing our proposed CWCE loss. Such a low-shot learning approach significantly reduces over-fitting and achieves an accuracy on a small training set (1,147 fundus images) comparable to a large training set (approximately 53K fundus images). We also demonstrate its effectiveness on two additional multi-class few-shot biomedical image classification tasks. Additionally, the MTSNs fine-tuned with OVV self-training outperform the SoTA semi-supervised glaucoma diagnosis algorithms [20, 21] as well as general-purpose semi-supervised learning algorithms [24, 25, 26, 27] trained for glaucoma diagnosis. They perform similarly, and in some cases, better than SoTA supervised algorithms. However, our proposed method has two limitations:
-
•
In the OVV self-training, each target fundus image must be compared with all the reference fundus images in the same mini-batch, resulting in a computational complexity of . As the mini-batch size increases, OVV self-training becomes relatively memory-consuming. The high computational complexity of OVV self-training may reduce the feasibility of this method in clinical practice for now. Therefore, we plan to improve the OVV self-training strategy in the future by adaptively selecting only a limited number of reference fundus photographs for semi-supervised glaucoma diagnosis, which can reduce the computational complexity and make the method more practical in clinical settings.
-
•
Our proposed OVV self-training strategy is developed for binary image classification and may not be directly applicable to multi-class image classification problems. Therefore, we plan to extend the contrastive prediction procedure to handle multi-class image classification problems in future work. More hyper-parameter tuning can always be done, but it is so easy to over-fit with limited data.
6 Conclusion
The main contributions of this paper include: (1) a multi-task Siamese network that can learn glaucoma diagnosis from very limited labeled training data; (2) an effective semi-supervised learning strategy, referred to as One-Vote Veto self-training, which can produce pseudo labels for the unlabeled data to fine-tune a pre-trained multi-task Siamese network. Extensive experiments conducted on four fundus image datasets demonstrated the effectiveness of these proposed techniques. The low-shot learning reduces over-fitting and achieves an accuracy on a small training set comparable to that of a large training set. Furthermore, with One-Vote Veto self-training, the multi-task Siamese networks perform similarly to their backbone CNNs (trained via supervised learning on the full training set) on the OHTS test set and show better generalizability on three additional test sets. The methods introduced in this paper can also be applied to other few-shot multi-class biomedical image classification problems, e.g., COVID-19 and lung cancer diagnosis, and other diseases in which only a small quantity of ground-truth labels are available for network training.
References
- [1] R. N. Weinreb and P. T. Khaw, “Primary open-angle glaucoma,” The Lancet, vol. 363, no. 9422, pp. 1711–1720, 2004.
- [2] Y.-C. Tham et al., “Global prevalence of glaucoma and projections of glaucoma burden through 2040: A systematic review and meta-analysis,” Ophthalmology, vol. 121, no. 11, pp. 2081–2090, 2014.
- [3] R. Fan et al., “Detecting glaucoma in the ocular hypertension study using deep learning,” JAMA Ophthalmology, vol. 140, no. 4, pp. 383–391, 2022.
- [4] C. Traverso et al., “Direct costs of glaucoma and severity of the disease: a multinational long term study of resource utilisation in Europe,” British Journal of Ophthalmology, vol. 89, no. 10, pp. 1245–1249, 2005.
- [5] W. Huang et al., “The adverse impact of glaucoma on psychological function and daily physical activity,” Journal of Ophthalmology, vol. 2020, 2020.
- [6] R. K. Parrish et al., “Visual function and quality of life among patients with glaucoma,” Archives of Ophthalmology, vol. 115, no. 11, pp. 1447–1455, 1997.
- [7] M. Kwon et al., “Association between glaucoma and at–fault motor vehicle collision involvement among older drivers: A population-based study,” Ophthalmology, vol. 123, no. 1, pp. 109–116, 2016.
- [8] G. McGwin Jr et al., “Binocular visual field impairment in glaucoma and at-fault motor vehicle collisions,” Journal of Glaucoma, vol. 24, no. 2, pp. 138–143, 2015.
- [9] L. Li et al., “Attention based glaucoma detection: A large-scale database and cnn model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 571–10 580.
- [10] R. Fan et al., “Detecting glaucoma from fundus photographs using deep learning without convolutions: Transformer for improved generalization,” Ophthalmology Science, vol. 3, no. 1, p. 100233, 2023.
- [11] M. O. Gordon and M. A. Kass, “The ocular hypertension treatment study: Design and baseline description of the participants,” Archives of Ophthalmology, vol. 117, no. 5, pp. 573–583, 1999.
- [12] M. A. Kass et al., “The ocular hypertension treatment study: A randomized trial determines that topical ocular hypotensive medication delays or prevents the onset of primary open-angle glaucoma,” Archives of Ophthalmology, vol. 120, no. 6, pp. 701–713, 2002.
- [13] M. O. Gordon et al., “Assessment of the impact of an endpoint committee in the ocular hypertension treatment study,” American Journal of Ophthalmology, vol. 199, pp. 193–199, 2019.
- [14] Z. Li et al., “Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs,” Ophthalmology, vol. 125, no. 8, pp. 1199–1206, 2018.
- [15] J. J. Gómez-Valverde et al., “Automatic glaucoma classification using color fundus images based on convolutional neural networks and transfer learning,” Biomedical Optics Express, vol. 10, no. 2, pp. 892–913, 2019.
- [16] D. Judy et al., “Automated identification of glaucoma from fundus images using deep learning techniques,” European Journal of Molecular & Clinical Medicine, vol. 7, no. 2, pp. 5449–5458, 2020.
- [17] A. Serener and S. Serte, “Transfer learning for early and advanced glaucoma detection with convolutional neural networks,” in 2019 Medical Technologies Congress (TIPTEKNO). IEEE, 2019, pp. 1–4.
- [18] A. Thakur et al., “Predicting glaucoma before onset using deep learning,” Ophthalmology Glaucoma, vol. 3, no. 4, pp. 262–268, 2020.
- [19] M. Kim et al., “Few-shot learning using a small-sized dataset of high-resolution fundus images for glaucoma diagnosis,” in Proceedings of the 2nd International Workshop on Multimedia for Personal Health and Health Care, 2017, pp. 89–92.
- [20] M. Al Ghamdi et al., “Semi-supervised transfer learning for convolutional neural networks for glaucoma detection,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3812–3816.
- [21] A. Díaz Pinto et al., “Retinal image synthesis and semi-supervised learning for glaucoma assessment,” IEEE Transactions on Medical Imaging, vol. 38, no. 9, pp. 2211–2218, 2019.
- [22] A. Díaz Pinto et al., “CNNs for automatic glaucoma assessment using fundus images: An extensive validation,” Biomedical Engineering Online, vol. 18, pp. 1–19, 2019.
- [23] P. Sample et al., “The African descent and glaucoma evaluation study (ADAGES): Design and baseline data,” Archives of Ophthalmology, vol. 127, no. 9, pp. 1136–1145, 2009.
- [24] Y. Wang et al., “Freematch: Self-adaptive thresholding for semi-supervised learning,” in the International Conference on Learning Representations (ICLR), 2023, in press.
- [25] H. Chen et al., “Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning,” in the International Conference on Learning Representations (ICLR), 2023, in press.
- [26] K. Sohn et al., “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 596–608, 2020.
- [27] B. Zhang et al., “Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 18 408–18 419, 2021.
- [28] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in the International Conference on Learning Representations (ICLR).
- [29] M. E. Chowdhury et al., “Can AI help in screening viral and COVID-19 pneumonia?” IEEE Access, vol. 8, pp. 132 665–132 676, 2020.
- [30] T. Rahman et al., “Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images,” Computers in Biology and Medicine, vol. 132, p. 104319, 2021.
- [31] A. A. Borkowski et al., “Lung and colon cancer histopathological image dataset (LC25000),” CoRR, 2019.
- [32] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” CoRR, 2014.
- [33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in the International Conference on Learning Representations (ICLR), 2015, pp. 1–14.
- [34] C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
- [35] C. Szegedy et al., “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
- [36] J. M. Ahn et al., “A deep learning model for the detection of both advanced and early glaucoma using fundus photography,” PloS one, vol. 13, no. 11, p. e0207982, 2018.
- [37] J. Deng et al., “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
- [38] K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- [39] S. Liu et al., “A deep learning-based algorithm identifies glaucomatous discs using monoscopic fundus photographs,” Ophthalmology Glaucoma, vol. 1, no. 1, pp. 15–22, 2018.
- [40] A. R. Ran et al., “Detection of glaucomatous optic neuropathy with spectral-domain optical coherence tomography: A retrospective training and validation deep-learning analysis,” The Lancet Digital Health, vol. 1, no. 4, pp. e172–e182, 2019.
- [41] F. A. Medeiros et al., “Detection of progressive glaucomatous optic nerve damage on fundus photographs with deep learning,” Ophthalmology, vol. 128, no. 3, pp. 383–392, 2021.
- [42] M. Christopher et al., “Performance of deep learning architectures and transfer learning for detecting glaucomatous optic neuropathy in fundus photographs,” Scientific Reports, vol. 8, no. 1, pp. 1–13, 2018.
- [43] M. Christopher et al., “Effects of study population, labeling and training on glaucoma detection using deep learning algorithms,” Translational Vision Science & Technology, vol. 9, no. 2, pp. 27–27, 2020.
- [44] D. Jain et al., “Open-source, ultra-low-cost smartphone attachment for non-mydriatic fundus photography-open indirect ophthalmoscope,” Investigative Ophthalmology & Visual Science, vol. 57, no. 12, pp. 1685–1685, 2016.
- [45] E. Matthew Lawson and R. Raskar, “Smart phone administered fundus imaging without additional imaging optics,” Investigative Ophthalmology & Visual Science, vol. 55, no. 13, pp. 1609–1609, 2014.
- [46] M. Sandler et al., “MobileNetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
- [47] Y. Wang et al., “Generalizing from a few examples: A survey on few-shot learning,” ACM Computing Surveys (CSUR), vol. 53, no. 3, pp. 1–34, 2020.
- [48] P. Zhou et al., “Attention-based bidirectional long short-term memory networks for relation classification,” in Proceedings of the 54th annual meeting of the Association for Computational Linguistics (volume 2: Short Papers), 2016, pp. 207–212.
- [49] A. Radford et al., “Unsupervised representation learning with deep convolutional generative adversarial networks,” in the International Conference on Learning Representations (ICLR), 2015.
- [50] I. Triguero et al., “Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study,” Knowledge and Information Systems, vol. 42, no. 2, pp. 245–284, 2015.
- [51] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020.
- [52] A. Miller et al., “Key-value memory networks for directly reading documents,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016, pp. 1400–1409.
- [53] G. Koch et al., “Siamese neural networks for one-shot image recognition,” in ICML Deep Learning Workshop, vol. 2. Lille, 2015.
- [54] J. Jang and C. O. Kim, “Siamese network-based health representation learning and robust reference-based remaining useful life prediction,” IEEE Transactions on Industrial Informatics, vol. 18, no. 8, pp. 5264–5274, 2021.
- [55] Q. Wang et al., “Learning attentions: Residual attentional Siamese network for high performance online visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4854–4863.
- [56] W. Wang et al., “Face recognition based on deep learning,” in 1st International Conference on Human Centered Computing (HCC). Springer, 2015, pp. 812–820.
- [57] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in International Workshop on Similarity-Based Pattern Recognition. Springer, 2015, pp. 84–92.
- [58] P. Khosla et al., “Supervised contrastive learning,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 18 661–18 673.
- [59] A. Kendall et al., “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7482–7491.
- [60] D. McClosky, E. Charniak, and M. Johnson, “Effective self-training for parsing,” in Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, 2006, pp. 152–159.
- [61] S. Sukhbaatar, J. Weston, R. Fergus et al., “End-to-end memory networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 2440–2448.
- [62] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 11, pp. 2579–2605, 2008.
- [63] L.-C. Chen et al., “Rethinking atrous convolution for semantic image segmentation,” CoRR, 2017.
- [64] C. Tan et al., “A survey on deep transfer learning,” in International Conference on Artificial Neural Networks (ICANN). Springer, 2018, pp. 270–279.
- [65] G. Huang et al., “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.
- [66] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning (ICML). PMLR, 2019, pp. 6105–6114.
- [67] A. Chattopadhay et al., “Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018, pp. 839–847.
- [68] S. Jadon, “COVID-19 detection from scarce chest x-ray image data using few-shot deep learning approach,” in Medical Imaging 2021: Imaging Informatics for Healthcare, Research, and Applications, vol. 11601. International Society for Optics and Photonics, 2021, p. 116010X.
- [69] M. A. Abbas et al., “The histopathological diagnosis of adenocarcinoma & squamous cells carcinoma of lungs by artificial intelligence: A comparative study of convolutional neural networks,” MedRxiv, pp. 2020–05, 2020.