This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

One-Vote Veto: Semi-Supervised Learning for Low-Shot Glaucoma Diagnosis

Rui Fan    \IEEEmembershipSenior Member, IEEE    Christopher Bowd    Nicole Brye    Mark Christopher   
Robert N. Weinreb
   David J. Kriegman    \IEEEmembershipFellow, IEEE    Linda M. Zangwill This research was supported in part by the Fundamental Research Funds for the Central Universities, and was also supported in part by awards from the National Eye Institute (grants EY027510, R214278211, P30EY022589, K99EY030942, EY026574, EY11008, and EY19869), the National Center on Minority Health and Health Disparities, National Institutes of Health (grants EY09341 and EY09307), Horncrest Foundation, awards to the Department of Ophthalmology and Visual Sciences at Washington University, the NIH Vision Core Grant P30 EY 02687, Merck Research Laboratories, Pfizer, Inc., White House Station, New Jersey, and unrestricted grants from Research to Prevent Blindness, Inc., New York, NY. (Corresponding author: Linda M. Zangwill)R. Fan is with the College of Electronics & Information Engineering, Shanghai Research Institute for Intelligent Autonomous Systems, State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Tongji University, Shanghai 201804, China, and was with Hamilton Glaucoma Center, Viterbi Family Department of Ophthalmology, Shiley Eye Institute, and the Department of Computer Science & Engineering, the University of California San Diego, La Jolla, CA 92093, USA (e-mail: [email protected]).C. Bowd, N. Brye, M. Christopher, R. N. Weinreb, and L. M. Zangwill are with Hamilton Glaucoma Center, Viterbi Family Department of Ophthalmology and Shiley Eye Institute, the University of California San Diego, La Jolla, CA 92093, USA (e-mail: {cbowd, nbrye, mac157, rweinreb, lzangwill}@health.ucsd.edu).D. J. Kriegman is with the Department of Computer Science & Engineering, the University of California San Diego, La Jolla, CA 92093, USA (e-mail: [email protected]).
Abstract

Convolutional neural networks (CNNs) are a promising technique for automated glaucoma diagnosis from images of the fundus, and these images are routinely acquired as part of an ophthalmic exam. Nevertheless, CNNs typically require a large amount of well-labeled data for training, which may not be available in many biomedical image classification applications, especially when diseases are rare and where labeling by experts is costly. This article makes two contributions to address this issue: (1) It extends the conventional Siamese network and introduces a training method for low-shot learning when labeled data are limited and imbalanced, and (2) it introduces a novel semi-supervised learning strategy that uses additional unlabeled training data to achieve greater accuracy. Our proposed multi-task Siamese network (MTSN) can employ any backbone CNN, and we demonstrate with four backbone CNNs that its accuracy with limited training data approaches the accuracy of backbone CNNs trained with a dataset that is 50 times larger. We also introduce One-Vote Veto (OVV) self-training, a semi-supervised learning strategy that is designed specifically for MTSNs. By taking both self-predictions and contrastive predictions of the unlabeled training data into account, OVV self-training provides additional pseudo labels for fine-tuning a pre-trained MTSN. Using a large (imbalanced) dataset with 66,715 fundus photographs acquired over 15 years, extensive experimental results demonstrate the effectiveness of low-shot learning with MTSN and semi-supervised learning with OVV self-training. Three additional, smaller clinical datasets of fundus images acquired under different conditions (cameras, instruments, locations, populations) are used to demonstrate the generalizability of the proposed methods.

{IEEEkeywords}

Convolutional neural networks, glaucoma diagnosis, low-shot learning, semi-supervised learning.

1 Introduction

\IEEEPARstart

Glaucoma is a prevalent and debilitating disease that can lead to progressive and irreversible vision loss through optic nerve damage [1]. The global incidence of glaucoma was estimated at 64.3 million in 2013, and due to aging populations, this number is expected to rise to 111.8 million by 2040 [2]. Improvement in the management of glaucoma would have a major human and socio-economic impact [3]. Early identification and intervention would significantly reduce the economic burden of late-stage disease [4]. In addition, visual impairment in glaucoma patients has been associated with decreased physical activity and mental health [5, 6] and increased risk of motor vehicle accidents [7, 8].

With the recent advances in machine learning, convolutional neural networks (CNNs), trained via supervised learning, have shown promise in diagnosing glaucoma from fundus images (photographs of the back of eyes) [9]. However, this requires large amounts of empirical data for supervised training [10]. In this study, we use 66,715 fundus photographs from the Ocular Hypertension Treatment Study (OHTS) [11, 12, 13], which is a 22-site multi-center, longitudinal (phase 1 and 2, 1994-2008) randomized clinical trial of 1,636 subjects (3,272 eyes). The primary goal of the OHTS was to determine if topical ocular hypotensive medications could delay or prevent the onset of glaucoma in eyes with high intraocular pressure [11]. Conversion to glaucoma was decided by a masked endpoint committee of three glaucoma specialists using fundus photographs and visual fields. Owing to its well-characterized ground-truth labels, the OHTS dataset provides us a basis to explore an effective way of training CNNs to diagnose glaucoma with low-shot learning when only a small quantity of labeled data is available, and/or semi-supervised learning when raw data is abundant, but labeling resources are scarce, costly, require strong expertise, or are just unavailable. However, as shown in Fig. 1, conventional semi-supervised learning approaches typically require a reliable pre-trained CNN (using a small sample) as prior knowledge, which is often challenging due to the over-fitting problem. Moreover, there is also a strong motivation to design a feasible semi-supervised learning strategy capable of determining confident predictions and generating pseudo labels for unlabeled data. We focus specifically on fundus images and glaucoma diagnosis in this article because we have sufficient data to accurately characterize the effectiveness of our methods. The same techniques could also be applied to tasks where there is limited data, such as rare diseases or where limited labels are available (e.g., asthma and diabetes prediction from fundus images). Therefore, this article aims to answer the following questions:

  1. 1.

    Can a CNN be developed to accurately diagnose glaucoma, compared to the expert graders of the OHTS? Will the model be generalizable to other datasets?

  2. 2.

    Is it necessary to train CNNs with thousands of labeled fundus images to diagnose glaucoma, or can diagnosis be achieved using only one image per patient (approximately 1.1K fundus images in the OHTS training set)?

  3. 3.

    Can the performance of a CNN trained using a small sample be improved further by fine-tuning it with additional unlabeled training data?

Refer to caption
Figure 1: Supervised learning v.s. semi-supervised learning.

To answer these questions, we first evaluate the performance of state-of-the-art (SoTA) glaucoma diagnosis algorithms, including six supervised learning algorithms [14, 15, 16, 17, 18, 3], one low-shot learning algorithm [19], and two semi-supervised learning algorithms [20, 21], on the OHTS dataset. Their generalizabilities are further validated on three additional clinical datasets of fundus images: (a) ACRIMA (Spain) [22], (b) Large-Scale Attention-Based Glaucoma (LAG, China) [9], and (c) the UCSD-based Diagnostic Innovations in Glaucoma Study and African Descent and Glaucoma Evaluation Study (DIGS/ADAGES, US) [23].

Furthermore, we propose a novel extension to the conventional Siamese network, referred to as the Multi-Task Siamese Network (MTSN), as depicted in Fig. 2. By minimizing a novel Combined Weighted Cross-Entropy (CWCE) Loss, the MTSN can simultaneously perform two tasks: measuring the similarity of a given pair of images (primary task) and classifying them as healthy or glaucoma (secondary task). With a small training set of approximately 1.1K fundus images, we explore the feasibility of training an MTSN for glaucoma diagnosis. Although the MTSN may not provide complementary information, it effectively performs a type of “data augmentation” by generating C(N,2)C(N,2) pairs of fundus images for training instead of using NN independent fundus images. The visual features learned from these two tasks prove to be more informative for glaucoma diagnosis when the training set is small. Our experimental results demonstrate that the MTSN greatly reduces over-fitting and achieves an accuracy on a small training set comparable to a large training set, which contains approximately 53K fundus images.

Moreover, we propose a novel semi-supervised learning strategy, referred to as One-Vote Veto (OVV) Self-Training, which generates reliable pseudo labels for the unlabeled training data and incorporates them into the labeled training data to fine-tune the MTSN for improved performance and generalizability. Our extensive experiments show that the MTSN fine-tuned with OVV self-training achieves similar performance to the corresponding backbone CNN trained via supervised learning on the OHTS dataset, and achieves higher area under the receiver operating characteristic curve (AUROC) scores on the additional fundus image datasets. The fine-tuned MTSN also outperforms SoTA semi-supervised glaucoma diagnosis approaches [20, 21], and in some cases, even outperforms SoTA supervised approaches. Additionally, we compare our proposed OVV self-training approach with four SoTA general-purpose semi-supervised learning methods, including FreeMatch [24], SoftMatch [25], FixMatch [26], and FlexMatch [27], all of which utilize vision Transformer [28] as their backbone network. The results demonstrate that our proposed OVV self-training approach outperforms these methods on the OHTS dataset and demonstrates better generalizability on three additional fundus image test sets.

We also conduct two additional few-shot biomedical image classification experiments (chest X-ray image classification [29, 30] and lung histopathological image classification [31]) to further validate the effectiveness of the MTSN on other types of image data. The promising results indicate that our proposed algorithms have the potential to solve a variety of biomedical image classification problems.

2 Related Works

Most SoTA glaucoma diagnosis algorithms are developed based on supervised fundus image classification. For example, Judy et al.[16] trained an AlexNet [32] to diagnose glaucoma. As VGG architectures [33] can learn more complicated image features than AlexNet, Gómez-Valverde et al.[15] employed a VGG-19 [33] model to diagnose glaucoma. Nevertheless, VGG architectures [33] consist of hundreds of millions of parameters, making them very memory-consuming. In contrast, GoogLeNet [34] and Inception-v3 [35] have lower computational complexities. Hence, Ahn et al.[36] and Li et al.[14] utilized transfer learning to re-train an Inception-v3 [35] model (pre-trained on the ImageNet [37] database) for glaucoma diagnosis, while Serener and Serte [17] re-trained a pre-trained GoogLeNet [34] model to diagnose glaucoma. However, with the increase of network depth, accuracy gets saturated and then degrades rapidly due to vanishing gradients [38]. To tackle this problem, the residual neural network (ResNet) [38] was developed. Due to its robustness, ResNet-50 [38] has been extensively used for biomedical image analysis, and it is a popular choice [39, 40, 41, 42, 43, 3] for fundus image classification. Additionally, developing low-cost and real-time embedded glaucoma diagnosis systems [44, 45, 18], e.g., based on MobileNet-v2 [46], for mobile devices is also an emerging area.

Machine/deep learning has achieved compelling performance in data-intensive applications, but it is often challenging for these algorithms to yield comparable performance when only a limited amount of labeled training data is available [47]. Low-shot and semi-supervised learning can address these issues. Unfortunately, they are rarely discussed in the field of glaucoma diagnosis. To the best of our knowledge, [19] is the only published low/few-shot glaucoma diagnosis algorithm. This algorithm employs a conventional Siamese network to compare two groups of (negative and positive) fundus images. The Siamese network utilizes two identical CNNs to learn visual embeddings. A bi-directional long short-term memory [48] component is then trained over the CNN outputs for glaucoma diagnosis. However, the training process is complicated since different types of losses are minimized, and the achieved glaucoma diagnosis results are unsatisfactory since each sub-network is only fed with one type of fundus images (either negative or positive). The lack of same-class comparisons leads to a performance bottleneck when compared to the MTSN proposed in this article.

Refer to caption
Figure 2: An illustration of our MTSN for joint learning of fundus image similarity measurement (primary task) and glaucoma diagnosis (secondary task) in a low-shot manner.

A thorough search of the relevant literature yielded only two published studies on semi-supervised learning specifically for glaucoma diagnosis [21, 20]. Diaz-Pinto et al.[21] utilized a deep convolutional generative adversarial network (DCGAN) [49] for semi-supervised learning of glaucoma diagnosis, where the discriminator is trained to classify healthy and glaucomatous optic neuropathy (GON) fundus images, while also distinguishing between real and fake fundus images. The classifier for the former task is then employed for glaucoma diagnosis. On the other hand, Al Ghamdi et al.[20] developed a glaucoma diagnosis approach based on self-training [50], which is a typical semi-supervised learning approach that uses a pre-trained model (typically yielded via supervised learning) to produce pseudo labels of the unlabeled data. However, producing reliable pseudo labels is a significant challenge in self-training, and the pseudo labels generated by a single pre-trained CNN are usually not trustworthy enough for CNN fine-tuning [51]. Additionally, training a reliable pre-trained classifier with only a small amount of labeled data is notably demanding. In this article, we combine semi-supervised learning with low-shot learning to address these issues using glaucoma diagnosis as an example case. Specifically, our proposed OVV self-training strategy, as discussed in Sect. 3.2, is inspired by the mechanism of learning with external memory (LwEM), used in low-shot learning [52], where the labels of unlabeled training data are predicted by a classifier trained via low-shot learning on a small collection of fundus images with ground-truth labels.

3 Methodology

3.1 Multi-Task Siamese Network

As illustrated in Fig. 1, conventional semi-supervised learning methods initialize a network by pre-training it with a small number of fundus images for glaucoma diagnosis. However, we observed that such approaches are highly sensitive to noise. As a result, we design a novel MTSN specifically for our semi-supervised learning approach, which requires not only predicting the label of a given fundus image but also determining the similarity between a pair of given fundus images to generate pseudo labels through a voting process.

Conventional Siamese networks have become a common choice for metric learning and few/low-shot image recognition tasks [53]. These networks comprise two identical sub-networks, as depicted in Fig. 2. Each pair of fundus images 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} are separately fed into these sub-networks, producing two 1D embeddings (features) 𝐡i\mathbf{h}_{i} and 𝐡j\mathbf{h}_{j}, respectively. Another 1D embedding 𝐡i,j\mathbf{h}_{i,j} is generated by Φ()\Phi(\cdot). 𝐡i,j\mathbf{h}_{i,j} is then passed through a fully connected (FC) layer to produce a scalar q(𝐱i,𝐱j)[0,1]q(\mathbf{x}_{i},\mathbf{x}_{j})\in[0,1] indicating the similarity between 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j}. If 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} are dissimilar, q(𝐱i,𝐱j)q(\mathbf{x}_{i},\mathbf{x}_{j}) approaches 1, and vice versa. The ground-truth labels of 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} are represented by yi{0,1}y_{i}\in\{0,1\} and yj{0,1}y_{j}\in\{0,1\}, respectively, where 0 denotes a healthy image, and 1 denotes a GON image.

However, a conventional Siamese network can only determine whether 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} belong to the same category, rather than predicting their independent categories. A straightforward solution is to connect 𝐡i\mathbf{h}_{i} and 𝐡j\mathbf{h}_{j} to separate FC layers, producing two scalars p(𝐱i)p(\mathbf{x}_{i}) and p(𝐱j)p(\mathbf{x}_{j}) indicating the probabilities that 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} are GON images, respectively. Refer to Fig. 2 and note that the two FC layers connected to 𝐡i\mathbf{h}_{i} and 𝐡j\mathbf{h}_{j} use the same weights. In this article, we refer to the network architecture in Fig. 2 as an MTSN, which can simultaneously measure the similarity of a given pair of fundus images and classify them as either healthy or GON. These two tasks are dependent yet not directly deducible from one another. A well-trained glaucoma diagnosis network can be employed to compare differences between given pairs of fundus images, but a well-trained fundus image similarity measurement network cannot directly output the category of a given fundus image.

In addition, the visual features learned from the primary and secondary tasks are distinct from one another. For the primary task, the network learns the visual features to classify same-class and different-class fundus image pairs. On the other hand, for the secondary task, the network learns the visual features to classify GON and healthy fundus images. Although this network architecture may not provide complementary information, it effectively performs a type of “data augmentation” by producing C(N,2)C(N,2) pairs of fundus images for training, rather than using NN independent fundus images. The visual features learned from these two tasks prove to be more informative for glaucoma diagnosis when the training set is small. Furthermore, multi-task learning is effective because requiring an algorithm to perform well on a related task induces regularization, which can be superior to uniform complexity penalization for preventing over-fitting. This idea has been explored in many Siamese neural network works, such as [54, 55, 56].

In this article, we use n0n_{0} and n1n_{1} to respectively denote the numbers of healthy and GON fundus images used to train the MTSN, with n=n0+n1n=n_{0}+n_{1}. n0n_{0} is usually much greater than n1n_{1}, because there are fewer patients with glaucomatous disease than healthy patients, resulting in a severely imbalanced dataset. Therefore, the MTSN is trained by minimizing a CWCE loss as follows:

=sim+λ,cla\mathcal{L}=\mathcal{L}_{\text{sim}}+\lambda\mathcal{L}{{}_{\text{cla}}}, (1)

where

sim=n0(n01)+n1(n11)n(n1)|yiyj|log(q(𝐱i,𝐱j))2n0n1n(n1)(1|yiyj|)log(1q(𝐱i,𝐱j)),\displaystyle\begin{split}\mathcal{L}_{\text{sim}}&=-\frac{n_{0}(n_{0}-1)+n_{1}(n_{1}-1)}{n(n-1)}|y_{i}-y_{j}|\log(q(\mathbf{x}_{i},\mathbf{x}_{j}))\\ &-\frac{2n_{0}n_{1}}{n(n-1)}(1-|y_{i}-y_{j}|)\log(1-q(\mathbf{x}_{i},\mathbf{x}_{j})),\end{split} (2)
=cla1n(n0(yilog(p(𝐱i))+yjlog(p(𝐱j)))+n1((1yi)log(1p(𝐱i))+(1yj)log(1p(𝐱j)))),\displaystyle\begin{split}&\mathcal{L}{{}_{\text{cla}}}=-\frac{1}{n}\Big{(}n_{0}\big{(}y_{i}\log(p(\mathbf{x}_{i}))+y_{j}\log(p(\mathbf{x}_{j}))\big{)}\\ &+n_{1}\big{(}(1-y_{i})\log(1-p(\mathbf{x}_{i}))+(1-y_{j})\log(1-p(\mathbf{x}_{j}))\big{)}\Big{)},\end{split} (3)

The hyper-parameter λ\lambda balances the primary task loss sim\mathcal{L}_{\text{sim}} and the secondary task loss cla\mathcal{L}_{\text{cla}}. The choice of λ\lambda and Φ()\Phi(\cdot) is discussed in Sect. 4.2. The motivations for using such a CWCE loss function instead of the commonly used triplet loss [57] or contrastive loss [58] to train the MTSN are:

  1. 1.

    Most datasets for rare disease diagnosis are imbalanced. As detailed in Sect. 4.1, the OHTS training set is severely imbalanced, with 50,208 healthy images and only 2,416 GON images for supervised learning, and 995 healthy images and 152 GON images for low-shot learning. Learning from such an imbalanced dataset without weights on different classes can result in many incorrect predictions, with most GON images likely to be predicted as healthy images. To address this issue, a higher weight should be used for the minority class to prevent the CNN from predicting all fundus images as the majority class.

  2. 2.

    In multi-task learning, weighing different types of losses, such as regression and classification, is typically challenging [59]. Assigning an incorrect weight may cause one task to perform poorly, even when other tasks converge to satisfactory results. Therefore, formulating sim\mathcal{L}_{\text{sim}} as a weighted cross-entropy loss function is a simple but effective solution. However, due to the dataset imbalance problem, the cross-entropy losses have to be weighted.

  3. 3.

    As shown in Fig. 3, OVV self-training requires both labels and probabilities (of being GON images), predicted by a pre-trained model, to produce pseudo labels for unlabeled data. Such network architecture and training loss can efficiently and effectively provide both “self-predicted” and “contrastively-predicted” labels and probabilities, as described in Sect. 3.2.

It should be noted here that using the fundus images from the same patient as an image pair for MTSN training is not necessary.

Refer to caption
Figure 3: An illustration of our One-Vote Veto Self-Training strategy. 𝐡1,kd\mathbf{h}^{d}_{1,k} and 𝐡m,kd\mathbf{h}^{d}_{m,k} are two 1D embeddings, followed by an FC layer to produce scalars indicating the similarities between the given pairs of reference and target fundus images. The reference fundus images having ground-truth labels are used to train the MTSN by minimizing (1). The contrastive predictions are obtained using (4) and (5). The pseudo labels of the target fundus photographs are generated using One-Vote Veto Self-Training strategy, as detailed in Algorithm 1.

3.2 One-Vote Veto Self-Training

As discussed in Sect. 1, self-training aims to improve the performance of a pre-trained model by incorporating reliable predictions of the unlabeled data to obtain useful additional information that can be used for model fine-tuning. A feasible strategy to determine such reliable predictions is, therefore, key to the success of self-training [60].

In conventional semi-supervised learning algorithms, a pre-trained image classification model (obtained through supervised learning) can be fine-tuned by assessing the reliability of unlabeled images. To determine the reliability of an unlabeled image, its probability distribution for the most likely class is compared to a pre-determined threshold. If the probability surpasses this threshold, the prediction is considered a pseudo label. Subsequently, the image and its corresponding pseudo label are utilized to fine-tune the pre-trained model.

However, relying solely on probability distributions to generate pseudo labels is often insufficient [50]. Drawing inspiration from LwEM [61], we introduce One-Vote Veto self-training in this paper, as illustrated in Fig. 3111The superscripts rr and tt denote “reference” and “target”, respectively.. Similar to LwEM [61], we use a collection of mm reference (labeled) fundus images {𝐱1r,,𝐱mr}𝒳r\{\mathbf{x}^{r}_{1},\dots,\mathbf{x}^{r}_{m}\}\in\mathscr{X}^{r} to provide “contrastive predictions” to the target (unlabeled) fundus images {𝐱1t,,𝐱mt}𝒳t\{\mathbf{x}^{t}_{1},\dots,\mathbf{x}^{t}_{m}\}\in\mathscr{X}^{t}. The contrastive predictions subsequently vote to veto the unreliable “self-predictions{y~1t,,y~mt}\{\tilde{y}^{t}_{1},\dots,\tilde{y}^{t}_{m}\} produced by the MTSN. Our OVV self-training is detailed in Algorithm 1, where the target model updates its parameters during self-training but the reference model does not.

Data: Reference fundus images 𝒳r\mathscr{X}^{r} and their labels 𝒴r\mathscr{Y}^{r}, and target fundus images 𝒳t\mathscr{X}^{t}
1 while Training do
2       Given a mini-batch consisting of {𝐱1r,,𝐱mr}𝒳r,{y1r,,ymr}𝒴r\{\mathbf{x}^{r}_{1},\dots,\mathbf{x}^{r}_{m}\}\in\mathscr{X}^{r},\ \{y^{r}_{1},\dots,y^{r}_{m}\}\in\mathscr{Y}^{r} and {𝐱1t,,𝐱mt}𝒳t\{\mathbf{x}^{t}_{1},\dots,\mathbf{x}^{t}_{m}\}\in\mathscr{X}^{t};
3      
4      Initialize an empty set 𝒫\mathscr{P} to store reliable target fundus images and their pseudo labels ;
5      
6      for Each target fundus image 𝐱kt\mathbf{x}^{t}_{k} do
7             if The self-prediction of 𝐱kt\mathbf{x}^{t}_{k} is reliable then
8                   for Any qualified reference image 𝐱lr\mathbf{x}^{r}_{l} do
9                         Compute the self-prediction and contrastive prediction;
10                        
11                   end for
12                  
13                  if The criteria to generate pseudo labels are satisfied  then
14                         Update the set 𝒫\mathscr{P};
15                        
16                   end if
17                  
18             end if
19            
20       end for
21      Fine-tune the target model using unique(𝒫\mathscr{P});
22      
23 end while
24if The target model outperforms the reference model then
25      Update the reference model parameters;
26 end if
Algorithm 1 One-Vote Veto Self-Training Strategy.

When fine-tuning an MTSN pre-trained through low-shot learning, each mini-batch contains a discrete set of mm reference fundus images {𝐱1r,,𝐱mr}𝒳r\{\mathbf{x}^{r}_{1},\dots,\mathbf{x}^{r}_{m}\}\in\mathscr{X}^{r}, their ground-truth labels {y1r,,ymr}𝒴r\{{y}^{r}_{1},\dots,{y}^{r}_{m}\}\in\mathscr{Y}^{r}, and an equal number of mm target fundus images {𝐱1t,,𝐱mt}𝒳t\{\mathbf{x}^{t}_{1},\dots,\mathbf{x}^{t}_{m}\}\in\mathscr{X}^{t} without labels. 𝐡kr\mathbf{h}^{r}_{k} and 𝐡kt\mathbf{h}^{t}_{k} represent the 1D embeddings learned from 𝐱kr\mathbf{x}^{r}_{k} and 𝐱kt\mathbf{x}^{t}_{k} (k[1,m]k\in[1,m]\cap\mathbb{Z}), respectively. Given a pair of reference and target fundus images, 𝐱lr\mathbf{x}^{r}_{l} and 𝐱kt\mathbf{x}^{t}_{k}, the pre-trained MTSN can “self-predict”:

  • the scalars p(𝐱lr){p}(\mathbf{x}^{r}_{l}) and p(𝐱kt){p}(\mathbf{x}^{t}_{k}) which indicate the probabilities that 𝐱lr\mathbf{x}^{r}_{l} and 𝐱kt\mathbf{x}^{t}_{k} are GON images, respectively;

  • their labels y~lr=δ(p(𝐱lr))\tilde{y}^{r}_{l}=\delta({{p}(\mathbf{x}^{r}_{l})}) and y~kt=δ(p(𝐱kt))\tilde{y}^{t}_{k}=\delta({{p}(\mathbf{x}^{t}_{k})}) using its fundus image classification functionality (δ(p)=1\delta(p)=1 when p>0.5p>0.5, and δ(p)=0\delta(p)=0 otherwise).

p(𝐱lr){p}(\mathbf{x}^{r}_{l}) is then used to determine whether the reference fundus image 𝐱lr\mathbf{x}^{r}_{l} is qualified to veto unreliable predictions. If |p(𝐱lr)ylr|>κ2|{p}(\mathbf{x}^{r}_{l})-{y}^{r}_{l}|>\kappa_{2}, its vote will be omitted, where κ2\kappa_{2} is a threshold used to select qualified reference fundus images (step 6 in Algorithm 1). In the meantime, the pre-trained MTSN can also “contrastively-predict” the scalar

pl,krt(𝐱lr,𝐱kt)=|p(𝐱lr)q(𝐱lr,𝐱kt)|{p}_{l,k}^{r\rightarrow t}(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k})=|p(\mathbf{x}^{r}_{l})-q(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k})| (4)

indicating the GON probability as well as the label

y~l,krt(𝐱lr,𝐱kt)=|δ(p(𝐱lr))δ(q(𝐱lr,𝐱kt))|\tilde{y}^{r\rightarrow t}_{l,k}(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k})=|\delta(p(\mathbf{x}^{r}_{l}))-\delta(q(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k}))| (5)

of 𝐱kt\mathbf{x}^{t}_{k} from 𝐱lr\mathbf{x}^{r}_{l} using its input similarity measurement functionality222In rare cases, y~l,krt(𝐱lr,𝐱kt)\tilde{y}^{r\rightarrow t}_{l,k}(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k}) might not be equivalent to δ(pl,krt(𝐱lr,𝐱kt))\delta({p}_{l,k}^{r\rightarrow t}(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k})).. To determine the reliability of y~kt\tilde{y}^{t}_{k} and whether it can be used as the pseudo label of 𝐱kt\mathbf{x}^{t}_{k}, all the reference fundus images {𝐱1r,,𝐱mr}𝒳r\{\mathbf{x}^{r}_{1},\dots,\mathbf{x}^{r}_{m}\}\in\mathscr{X}^{r} in the mini-batch are used to provide additional judgements. Each pair of contrastively-predicted scalar (indicating GON probability) and label form a vote (pl,krt(𝐱lr,𝐱kt),y~l,krt(𝐱lr,𝐱kt))({p}_{l,k}^{r\rightarrow t}(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k}),\tilde{y}^{r\rightarrow t}_{l,k}(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k})). With all votes collected from the qualified reference fundus images, the OVV self-training algorithm determines whether y~kt\tilde{y}^{t}_{k} should be used as the pseudo label for 𝐱kt\mathbf{x}^{t}_{k} based on the following criteria (step 9 in Algorithm 1):

  • Identical to the process of determining qualified reference fundus images, if any pl,krt(𝐱lr,𝐱kt){p}_{l,k}^{r\rightarrow t}(\mathbf{x}^{r}_{l},\mathbf{x}^{t}_{k}) (l[1,m]l\in[1,m]\cap\mathbb{Z}) or p(𝐱kt){p}(\mathbf{x}_{k}^{t}) is not close to either 0 (healthy) or 1 (GON), as evaluated by the threshold κ2\kappa_{2}, y~kt\tilde{y}^{t}_{k} will not be assigned to 𝐱kt\mathbf{x}^{t}_{k}.

  • If a minority of more than κ1\kappa_{1} qualified reference fundus images disagree with the majority of the qualified reference fundus images, y~kt\tilde{y}^{t}_{k} will not be assigned to 𝐱kt\mathbf{x}^{t}_{k}.

As discussed in Sect. 4, κ1=0\kappa_{1}=0 (all qualified reference images vote for the same category) achieves the best overall performance. Therefore, the aforementioned strategy is named “One-Vote Veto” in this paper. Since each target fundus image is required to be compared with all the reference fundus images in the same mini-batch, the proposed self-training strategy has a computational complexity of 𝒪(n2)\mathscr{O}(n^{2}), which is relatively memory-consuming. The reliable target fundus images and their pseudo labels are then included into the low-shot training data to fine-tune the pre-trained MTSN with supervised learning by minimizing a CWCE loss. The OVV self-training performance with respect to different κ1\kappa_{1}, κ2\kappa_{2}, and mm values is discussed in Sect. 4.

Table 1: Comparison between supervised and low-shot learning (both on the low-shot OHTS training set). λ\lambda is set to 0.30.3. The best results are shown in bold font.
Test set Experiments Training strategy ResNet-50 MobileNet-v2
Accuracy (%) F1-score (%) AUROC Accuracy (%) F1-score (%) AUROC
ACRIMA [22] Baseline Supervised learning 57.163 56.734 0.625 73.333 78.341 0.779
EWAD Low-shot learning 67.092 63.175 0.758 70.355 67.797 0.820
EWSD Low-shot learning 49.504 25.523 0.437 66.241 59.107 0.823
LAG [9] Baseline Supervised learning 64.318 57.445 0.714 65.122 63.268 0.781
EWAD Low-shot learning 79.028 65.908 0.841 79.007 69.482 0.843
EWSD Low-shot learning 78.039 68.179 0.826 78.430 61.830 0.846
DIGS/ADAGES [23] Baseline Supervised learning 59.639 59.708 0.648 61.478 66.128 0.669
EWAD Low-shot learning 67.745 60.754 0.743 69.176 68.145 0.748
EWSD Low-shot learning 65.736 55.722 0.700 68.120 64.028 0.740
Table 2: Evaluation of our OVV self-training w.r.t. different κ1\kappa_{1}, κ2\kappa_{2}, and mm. The best results are shown in bold font. \uparrow indicates that semi-supervised learning outperforms low-shot learning.
Dataset κ1\kappa_{1} κ2\kappa_{2} mm ResNet-50 MobileNet-v2
Accuracy (%) F1-score (%) AUROC Accuracy (%) F1-score (%) AUROC
OHTS [11, 12] 0 0.01 20 91.415 \uparrow 41.148 \uparrow 0.898 \uparrow 90.609 \uparrow 36.960 \uparrow 0.887 \uparrow
0 0.01 15 92.113 \uparrow 41.316 \uparrow 0.898 \uparrow 93.470 \uparrow 36.976 \uparrow 0.893 \uparrow
0 0.01 10 94.199 \uparrow 43.759 \uparrow 0.890 \uparrow 93.742 \uparrow 37.779 \uparrow 0.863 \uparrow
0 0.10 20 90.516 \uparrow 38.139 \uparrow 0.898 \uparrow 88.825 \uparrow 34.351 \uparrow 0.878 \uparrow
2 0.01 20 90.717 \uparrow 35.818 \uparrow 0.885 \uparrow 93.463 \uparrow 35.204 \uparrow 0.862 \uparrow
2 0.10 20 92.656 \uparrow 32.017 \uparrow 0.851 \downarrow 90.772 \uparrow 32.616 \uparrow 0.858 \uparrow
4 0.01 20 92.610 \uparrow 29.668 \downarrow 0.854 \downarrow 92.268 \uparrow 31.852 \uparrow 0.854 \uparrow
4 0.10 20 92.672 \uparrow 28.463 \downarrow 0.842 \downarrow 90.803 \uparrow 32.690 \uparrow 0.859 \uparrow
ACRIMA [22] 0 0.01 20 59.858 \downarrow 49.192 \downarrow 0.775 \uparrow 72.340 \uparrow 70.229 \uparrow 0.840 \uparrow
0 0.01 15 60.426 \downarrow 49.365 \downarrow 0.751 \downarrow 63.404 \downarrow 54.895 \downarrow 0.814 \downarrow
0 0.01 10 54.610 \downarrow 35.743 \downarrow 0.721 \downarrow 61.986 \downarrow 51.273 \downarrow 0.826 \uparrow
LAG [9] 0 0.01 20 80.882 \uparrow 66.304 \uparrow 0.881 \uparrow 79.625 \uparrow 65.262 \downarrow 0.851 \uparrow
0 0.01 15 76.864 \downarrow 56.252 \downarrow 0.825 \downarrow 76.670 \downarrow 55.906 \downarrow 0.841 \downarrow
0 0.01 10 76.638 \downarrow 56.518 \downarrow 0.826 \downarrow 75.834 \downarrow 51.748 \downarrow 0.866 \uparrow
DIGS/ADAGES [23] 0 0.01 20 67.813 \uparrow 58.315 \downarrow 0.763 \uparrow 69.653 \uparrow 63.258 \downarrow 0.777 \uparrow
0 0.01 15 63.045 \downarrow 44.727 \downarrow 0.753 \uparrow 66.383 \downarrow 52.888 \downarrow 0.773 \uparrow
0 0.01 10 61.819 \downarrow 41.340 \downarrow 0.727 \downarrow 63.965 \downarrow 44.198 \downarrow 0.789 \uparrow
Table 3: AUROC (shown along with 95% CI) and training time tt (min) per epoch of supervised, low-shot, and semi-supervised glaucoma diagnosis.
Backbone Method Training strategy OHTS [11, 12] ACRIMA [22] LAG [9] DIGS/ADAGES [23] tt (min)
ResNet-50 Baseline Supervised learning 0.904 (95% CI: 0.865, 0.935) 0.736 (95% CI: 0.698, 0.771) 0.794 (95% CI: 0.780, 0.807) 0.744 (95% CI: 0.696, 0.792) 52.1
MTSN Low-shot learning 0.869 (95% CI: 0.833, 0.901) 0.758 (95% CI: 0.723, 0.792) 0.841 (95% CI: 0.829, 0.853) 0.743 (95% CI: 0.683, 0.795) 1.8
MTSN+OVV Semi-supervised learning 0.898 (95% CI: 0.857, 0.928) 0.775 (95% CI: 0.741, 0.808) 0.881 (95% CI: 0.870, 0.891) 0.763 (95% CI: 0.695, 0.820) 203.7
MobileNet-v2 Baseline Supervised learning 0.893 (95% CI: 0.845, 0.932) 0.794 (95% CI: 0.760, 0.825) 0.856 (95% CI: 0.844, 0.867) 0.786 (95% CI: 0.728, 0.835) 42.9
MTSN Low-shot learning 0.859 (95% CI: 0.813, 0.896) 0.820 (95% CI: 0.786, 0.850) 0.843 (95% CI: 0.831, 0.855) 0.748 (95% CI: 0.689, 0.802) 1.2
MTSN+OVV Semi-supervised learning 0.887 (95% CI: 0.850, 0.920) 0.840 (95% CI: 0.808, 0.867) 0.851 (95% CI: 0.838, 0.862) 0.777 (95% CI: 0.718, 0.826) 125.4
DenseNet Baseline Supervised learning 0.898 (95% CI: 0.867, 0.927) 0.810 (95% CI: 0.778, 0.841) 0.784 (95% CI: 0.771, 0.798) 0.743 (95% CI: 0.688, 0.789) 122.7
MTSN Low-shot learning 0.854 (95% CI: 0.811, 0.894) 0.753 (95% CI: 0.716, 0.786) 0.853 (95% CI: 0.842, 0.865) 0.732 (95% CI: 0.675, 0.785) 5.7
MTSN+OVV Semi-supervised learning 0.896 (95% CI: 0.861, 0.926) 0.783 (95% CI: 0.748, 0.817) 0.831 (95% CI: 0.818, 0.843) 0.746 (95% CI: 0.678, 0.800) 324.0
EfficientNet Baseline Supervised learning 0.768 (95% CI: 0.684, 0.834) 0.633 (95% CI: 0.590, 0.672) 0.650 (95% CI: 0.634, 0.667) 0.658 (95% CI: 0.611, 0.702) 48.7
MTSN Low-shot learning 0.863 (95% CI: 0.818, 0.899) 0.845 (95% CI: 0.815, 0.873) 0.845 (95% CI: 0.833, 0.856) 0.719 (95% CI: 0.659, 0.774) 1.5
MTSN+OVV Semi-supervised learning 0.886 (95% CI: 0.845, 0.918) 0.792 (95% CI: 0.758, 0.824) 0.850 (95% CI: 0.837, 0.861) 0.749 (95% CI: 0.690, 0.800) 159.8
Table 4: Comparisons of supervised learning, low-shot learning, and semi-supervised learning w.r.t. different percentages of labeled training data. In the baseline experiments (2.0% of training data), one fundus photograph is selected from each patient; In the experiments with 0.5% and 1.0% of training data, a subset is created by choosing patient IDs at regular intervals.
Percentage of training data Supervised learning Low-shot learning Semi-supervised learning
Accuracy (%) F1-score (%) AUROC Accuracy (%) F1-score (%) AUROC Accuracy (%) F1-score (%) AUROC
0.5 80.402 18.066 0.720 84.769 20.228 0.759 88.856 25.351 0.797
1.0 79.557 21.573 0.799 88.538 25.654 0.806 88.453 32.472 0.857
2.0 (baseline) 84.141 26.743 0.838 87.150 31.726 0.865 91.415 41.148 0.898
10.0 85.651 32.288 0.890 89.988 38.612 0.891 91.446 39.760 0.899
50.0 92.950 40.188 0.907 94.223 40.826 0.887 92.067 38.187 0.889
90.0 89.556 37.582 0.905 92.897 41.878 0.887 93.331 43.421 0.898
(a) Backbone CNN: ResNet-50.
Percentage of training data Supervised learning Low-shot learning Semi-supervised learning
Accuracy (%) F1-score (%) AUROC Accuracy (%) F1-score (%) AUROC Accuracy (%) F1-score (%) AUROC
0.5 83.033 19.035 0.720 81.954 18.035 0.745 87.918 29.246 0.826
1.0 67.659 16.458 0.748 91.764 20.628 0.772 89.407 31.357 0.841
2.0 (baseline) 78.144 23.547 0.840 86.018 30.252 0.854 90.609 36.960 0.887
10.0 90.429 34.873 0.884 89.562 36.808 0.897 89.624 36.887 0.888
50.0 92.438 43.361 0.906 91.043 39.050 0.888 93.230 40.653 0.889
90.0 93.754 42.401 0.908 93.750 37.519 0.890 91.857 41.602 0.896
(b) Backbone CNN: MobileNet-v2.

4 Experiments

4.1 Datasets and Experimental Setups

The datasets utilized in our experiments were collected by various clinicians in different countries using distinct fundus cameras. The ACRIMA [22] and LAG [9] datasets are publicly available, while the OHTS [11, 12] and DIGS/ADAGES [23] datasets are available upon request after appropriate data use agreements are initiated. Their details are as follows:

  • The OHTS [11, 12] is the only multi-center longitudinal study that has precise information on the dates/timing of the development of glaucoma (the enrolled subjects did not have glaucoma at study entry) using standardized assessment criteria by an independent Optic Disc Reading Center and confirmed by three glaucoma specialist endpoint committee members. In our experiments, a square region centered on the optic nerve head was first extracted from each raw fundus image using a well-trained DeepLabv3+ [63] model. A small part of the raw data are stereoscopic fundus images, each of which was split to produce two individual fundus images. Through this image pre-processing approach, a total of 74,678 fundus images were obtained. Moreover, ENPOAGDISC (endpoint committee attributable to primary open angle glaucoma based on optic disc changes from photographs) [3] labels are used as the classification ground truth. The fundus images are divided into a training set (50,208 healthy images and 2,416 GON images), a validation set (7,188 healthy images and 426 GON images), and a test set (13,780 healthy images and 660 GON images) by participant. Splitting by participant (instead of by image) ensured that the validation and test sets did not contain images from any eyes or individuals used to train the model. More details on dataset preparation and baseline supervised learning experiments are provided in our recent publications [3, 10]. Additionally, we select one image (from only one eye) from each patient in the training set to create the low-shot training set (995 healthy images and 152 GON images). If both eyes of a patient do not convert to glaucoma in the study, the first captured fundus photograph is selected. If either eye of a patient converts to glaucoma in the study, the first glaucoma fundus photograph is selected.

  • The ACRIMA [22] dataset consists of 309 healthy images and 396 GON images. It was collected as part of an initiative by the government of Spain. Classification was based on the review by a single experienced glaucoma expert. Images were excluded if they did not provide a clear view of the optic nerve head region [43].

  • The LAG [9] dataset contains 3,143 healthy images and 1,711 GON images333The number of fundus images being published is fewer than what was reported in publication [9]. , collected by Beijing Tongren Hospital. Similar to the OHTS dataset, we also use the well-trained DeepLabv3+ [63] model to extract a square region centered on the optic nerve head from each fundus image.

  • The DIGS and ADAGES [23] are longitudinal studies designed to detect and monitor glaucoma based on optical imaging and visual function testing that, when combined, have generated tens of thousands of test results from over 4,000 healthy, glaucoma suspect, or glaucoma eyes. In our experiments, we utilize the DIGS/ADAGES test set (5,184 healthy images and 4,289 GON images) to evaluate the generalizability of our proposed methods.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Comparisons of dataset visualizations produced by t-SNE, where \bullet and \bullet represent the healthy and GON images in the OHTS test set, respectively; \bullet and \bullet in (a) represent the healthy and GON images in the ACRIMA dataset, respectively; \bullet and \bullet in (b) represent the healthy and GON images in the LAG dataset, respectively; \bullet and \bullet in (c) represent the healthy and GON images in the DIGS/ADAGES test set, respectively.

Visualizations of the four test sets using t-SNE [62] are provided in Fig. 4. Since healthy and GON images are distributed similarly between the OHTS and LAG datasets, we expect models to perform similarly on these datasets. Dissimilar distributions in the ACRIMA and DIGS/ADAGES datasets led us to believe the performance of models on these datasets would be somewhat worse. Using these four datasets, we conduct three experiments:

  1. 1.

    Supervised learning experiment: We employ transfer learning [64] to train ResNet-50 [38], MobileNet-v2 [46], DenseNet [65], and EfficientNet [66] (pre-trained on the ImageNet database [37]), on the entire OHTS training set (including \sim53K fundus images). The best-performing models are selected using the OHTS validation set. Their performance is subsequently evaluated on the OHTS test set, the ACRIMA dataset, the LAG dataset, and the DIGS/ADAGES test set.

  2. 2.

    Low-shot learning experiment: The four pre-trained models mentioned above are used as the MTSN backbones and trained on the OHTS low-shot training set (containing 1,147 images) to validate the effectiveness of our proposed low-shot glaucoma diagnosis algorithm. The validation and testing procedures are identical to those in the supervised learning experiment.

  3. 3.

    Semi-supervised learning experiment: The MTSNs trained on the low-shot training set are fine-tuned on the entire OHTS training set without using additional ground-truth labels. The fine-tuned MTSNs are referred to as MTSN+OVV. The validation and testing procedures are identical to those in the supervised learning experiment.

The fundus images are resized to 224×224224\times 224 pixels. The initial learning rate is set to 0.001, which decays gradually after the 100th epoch. Due to the dataset imbalance problem, F1-score is utilized to select the best-performing models during the validation stage. Moreover, we adopt an early stopping mechanism during the validation stage to reduce over-fitting (the training will be terminated if the achieved F1-score has not increased for 10 epochs). We use three metrics: (1) accuracy, (2) F1-score, and (3) AUROC to quantify the performances of the trained models. While accuracy is generally reported in image classification papers, F1-score and AUROC are more comprehensive and reasonable evaluation metrics when the dataset is severely imbalanced.

4.2 Ablation study and Threshold Selection

Refer to caption
(a) Backbone CNN: ResNet-50.
Refer to caption
(b) Backbone CNN: MobileNet-v2.
Figure 5: MTSN performance on the OHTS test set with respect to different λ\lambda and Φ()\Phi(\cdot).
Table 5: Comparisons with other SoTA glaucoma diagnosis algorithms. The best results for each training strategy are shown in bold font.
Training strategy Method Accuracy (%) F1-score (%) AUROC
Supervised learning Li et al.[14] 93.812 42.590 0.886
Gómez-Valverde et al.[15] 95.068 48.039 0.903
Judy et al.[16] 94.188 42.835 0.908
Serener and Serte [17] 90.492 41.227 0.912
Thakur et al.[18] 94.145 44.115 0.896
Fan et al.[3] (Baseline Result) 93.261 43.016 0.904
Low-shot learning Kim et al.[19] 84.203 23.161 0.786
MTSN (Backbone: ResNet-50) (Ours) 87.150 31.726 0.865
MTSN (Backbone: MobileNet-v2) (Ours) 86.018 30.252 0.854
Semi-supervised learning Al Ghamdi et al.[20] 84.808 27.579 0.830
Diaz-Pinto et al.[21] 76.619 20.721 0.748
MTSN (Backbone: ResNet-50) + OVV Self-Training (Ours) 90.244 38.454 0.899
MTSN (Backbone: MobileNet-v2) + OVV Self-Training (Ours) 89.360 36.599 0.891
(a) OHTS [12] test set.
Training strategy Method Accuracy (%) F1-score (%) AUROC
Supervised learning Li et al.[14] 60.142 46.272 0.813
Gómez-Valverde et al.[15] 63.546 53.358 0.826
Judy et al.[16] 57.872 41.650 0.824
Serener and Serte [17] 54.326 36.364 0.675
Thakur et al.[18] 65.106 58.020 0.794
Fan et al.[3] (Baseline Result) 53.333 31.601 0.736
Low-shot learning Kim et al.[19] 64.965 58.627 0.844
MTSN (Backbone: ResNet-50) (Ours) 67.092 63.175 0.758
MTSN (Backbone: MobileNet-v2) (Ours) 70.355 67.797 0.820
Semi-supervised learning Al Ghamdi et al.[20] 68.511 67.257 0.794
Diaz-Pinto et al.[21] 59.149 44.828 0.818
MTSN (Backbone: ResNet-50) + OVV Self-Training (Ours) 64.539 57.627 0.801
MTSN (Backbone: MobileNet-v2) + OVV Self-Training (Ours) 72.340 70.939 0.835
(b) ACRIMA [22] dataset.
Training strategy Method Accuracy (%) F1-score (%) AUROC
Supervised learning A S Li et al.[14] 76.535 53.756 0.855
Gómez-Valverde et al.[15] 80.202 62.417 0.883
Judy et al.[16] 78.348 60.174 0.860
Serener and Serte [17] 77.379 66.545 0.806
Thakur et al.[18] 80.305 65.882 0.856
Fan et al.[3] (Baseline Result) 75.052 50.267 0.794
Low-shot learning Kim et al.[19] 74.619 65.000 0.805
MTSN (Backbone: ResNet-50) (Ours) 79.028 65.908 0.841
MTSN (Backbone: MobileNet-v2) (Ours) 79.007 69.482 0.843
Semi-supervised learning Al Ghamdi et al.[20] 79.028 72.382 0.860
Diaz-Pinto et al.[21] 65.554 56.662 0.701
MTSN (Backbone: ResNet-50) + OVV Self-Training (Ours) 81.644 69.929 0.879
MTSN (Backbone: MobileNet-v2) + OVV Self-Training (Ours) 80.470 68.692 0.849
(c) LAG [9] dataset.
Training strategy Method Accuracy (%) F1-score (%) AUROC
Supervised learning Li et al.[14] 65.293 46.955 0.780
Gómez-Valverde et al.[15] 65.157 45.729 0.795
Judy et al.[16] 63.556 41.144 0.760
Serener and Serte [17] 69.108 57.478 0.757
Thakur et al.[18] 70.606 60.322 0.786
Fan et al.[3] (Baseline Result) 62.500 38.042 0.744
Low-shot learning Kim et al.[19] 63.862 52.949 0.687
MTSN (Backbone: ResNet-50) (Ours) 67.745 60.754 0.743
MTSN (Backbone: MobileNet-v2) (Ours) 69.176 68.145 0.748
Semi-supervised learning Al Ghamdi et al.[20] 64.850 54.816 0.716
Diaz-Pinto et al.[21] 64.441 59.908 0.677
MTSN (Backbone: ResNet-50) + OVV Self-Training (Ours) 66.281 55.486 0.747
MTSN (Backbone: MobileNet-v2) + OVV Self-Training (Ours) 70.402 66.015 0.776
(d) DIGS/ADAGES [23] test set.

We set λ\lambda in (1) to 0.1, 0.2, 0.3, 0.4, and 0.5, respectively, and compare the MTSN performance when Φ\Phi computes the element-wise absolute difference (EWAD) and element-wise squared difference (EWSD), respectively. The comparisons in terms of F1-score and AUROC on the OHTS test set are provided in Fig. 5. It can be seen that the MTSN achieves the best overall performance when λ=0.3\lambda=0.3. This is reasonable, as a higher λ\lambda weighs more on the image classification task, easily resulting in over-fitting. Additionally, the MTSN in which Φ()\Phi(\cdot) computes the EWSD between 𝐡i\mathbf{h}_{i} and 𝐡j\mathbf{h}_{j} performs better when using ResNet-50 as the backbone CNN but slightly worse when using MobileNet-v2 as the backbone CNN. Therefore, we further evaluate their generalizability on three additional test sets, as shown in Table 1. When Φ()\Phi(\cdot) computes the EWAD, the MTSN generally performs better or very similarly on the additional test sets, especially when testing the MTSN assembled with ResNet-50 on the ACRIMA dataset. EWAD is, therefore, used in the following experiments. Furthermore, Table 1 provides the results of a baseline supervised learning experiment conducted on the low-shot training set. The results suggest that low-shot learning performs much better than supervised learning when the training size is small.

Furthermore, we discuss the selection of the thresholds κ1\kappa_{1} and κ2\kappa_{2} (used to select reliable “self-predictions” and “contrastive predictions” in our OVV self-training) as well as the impact of different mini-batch sizes 2m2m on OVV self-training (each mini-batch contains mm pairs of reference and target fundus images). Table 2 shows the MTSN performances with respect to different κ1\kappa_{1}, κ2\kappa_{2}, and mm. When evaluated on the OHTS test set, it can be seen that accuracy and F1-score increase slightly, but AUROC almost remains the same, with the decrease of mm. Moreover, with the increase of κ1\kappa_{1} and κ2\kappa_{2}, the standard to determine reliable predictions becomes lower, making the semi-supervised learning performance degrade. Based on this experiment, we believe OVV self-training benefits from smaller κ1\kappa_{1} and κ2\kappa_{2}.

Additionally, MTSNs, trained under different mm, are evaluated on the three additional test sets, as shown in Table 2. It can be seen that the network trained with a larger mm typically shows better results. When mm decreases, the generalizability of MTSN degrades dramatically, especially for F1-score (decreases by around 9-19%). Therefore, increasing the mini-batch size can improve the MTSN generalizability, as more reference fundus images are used to provide contrastive predictions for the target fundus images, which can veto more unreliable predictions on the unlabeled data. Hence, we increase mm to 30 to further improve OVV self-training when comparing it with other published SoTA algorithms, as shown in Sect. 4.4. Since our threshold selection experiments cover a very limited number of discrete sets of κ1\kappa_{1}, κ2\kappa_{2}, and mm, we believe better performance can be achieved when more values are tested.

4.3 Comparison of Supervised, Low-Shot, and Semi-Supervised Glaucoma Diagnosis

Comparisons of supervised learning, low-shot learning, and semi-supervised learning (w.r.t. four backbone CNNs: ResNet-50, MobileNet-v2, DenseNet, and EfficientNet) for glaucoma diagnosis are provided in Table 3. First, these results suggest that the MTSNs fine-tuned with OVV self-training that requires a small number of labeled fundus images perform similarly (AUROC 95% CI overlaps considerably) and, in some cases, significantly better (AUROC 95% CI does not overlap) than the backbone CNNs trained with a large number of labeled fundus images (50 times larger) under full supervision.

Specifically, when using ResNet-50, MobileNet-v2, or DenseNet as the backbone CNN, semi-supervised learning performs similarly to supervised learning on the OHTS and DIGS/ADAGES test sets, and in most cases, significantly better than supervised learning on the ACRIMA and LAG datasets. Although EfficientNet trained through supervised learning performs unsatisfactorily on all four test sets, it shows considerable compatibility with MTSN in the low-shot and semi-supervised learning experiments. Second as expected, the AUROC scores achieved by low-shot learning are in most, but not all, cases slightly lower than those achieved by the backbone CNNs, when evaluated on the OHTS test set. However, low-shot learning shows better generalizability than supervised learning on the ACRIMA and LAG datasets. Moreover, since low-shot learning uses only a small amount of training data, training an MTSN is much faster than supervised learning. As MTSNs assembled with ResNet-50 and MobileNet-v2 typically demonstrate better performances than the ones assembled with DenseNet and EfficientNet, we only use the former two CNNs for the following experiments.

Refer to caption
(a) OHTS [13] healthy fundus images.
Refer to caption
(b) OHTS [13] GON fundus images.
Refer to caption
(c) ACRIMA [22] healthy fundus images.
Refer to caption
(d) ACRIMA [22] GON fundus images.
Refer to caption
(e) LAG [9] healthy fundus images.
Refer to caption
(f) LAG [9] GON fundus images.
Refer to caption
(g) DIGS/ADAGES [23] healthy fundus images.
Refer to caption
(h) DIGS/ADAGES [23] GON fundus images.
Figure 6: Examples of Grad-CAM++ [67] results: (i) fundus images; (ii) and (v) show the class activation maps of (i), obtained by the backbone CNNs trained through supervised learning on the entire training set (containing \sim53K fundus images); (iii) and (vi) show the class activation maps of (i), obtained by MTSNs trained through low-shot learning on a small training set (containing 1,147 fundus images); (iv) and (vii) show the class activation maps of (i), obtained by MTSNs fine-tuned with our proposed OVV self-training on the entire training set (containing \sim53K fundus images without ground-truth labels).

We also employ Grad-CAM++ [67] to explain the models’ decision-making, as shown in Fig. 6. These results suggest that the optic nerve head areas impact model decisions most. The neuroretinal rim areas are identified as most important, and the periphery contributed comparatively little to model decisions for both healthy and GON eyes [42].

We also carry out a series of experiments with respect to different percentages of training data, as shown in Table 4(b), to further validate the effectiveness of our proposed low-shot and semi-supervised learning algorithms. The backbone CNNs trained via supervised learning on the small subsets generally perform worse than the MTSNs trained via low-shot and semi-supervised learning. With the labeled training data increase, the models’ performance gets saturated. When using less labeled training data (0.5%0.5\% and 1.0%1.0\%), the MTSN performance degrades. However, its performance can still be greatly improved with OVV self-training (accuracy, F1-score, and AUROC can be improved by up to 6%, 11%, and 0.08, respectively). In addition, when using over 10%10\% of the entire training data, the MTSN performance saturates, and the OVV self-training can bring very limited improvements on MTSNs.

4.4 Comparisons with other SoTA glaucoma diagnosis approaches

Table 5 provides comprehensive comparisons with nine SoTA glaucoma diagnosis algorithms444Our recent work [3] provides the baseline supervised learning results.. The results suggest that (a) for low-shot learning, the MTSNs trained by minimizing our proposed CWCE loss perform significantly better than the SoTA low-shot glaucoma diagnosis approach [19] on all four datasets (accuracy, F1-score, and AUROC are up to 5%, 15%, and 0.08 higher, respectively), and (b) for semi-supervised learning, the MTSNs fine-tuned with OVV self-training also achieve the superior performances over another two SoTA semi-supervised glaucoma diagnosis approaches [20, 21] (accuracy, F1-score, and AUROC are up to 6%, 11%, and 0.07 higher, respectively). Compared with the SoTA supervised approaches, the fine-tuned MTSNs demonstrate similar performance on the OHTS test set and better generalizability on three additional test sets. Therefore, we believe that MTSN with our proposed OVV self-training is an effective technique for semi-supervised glaucoma diagnosis.

4.5 Comparisons with SoTA general-purpose semi-supervised learning approaches

Table 6: Comparisons with SoTA general-purpose semi-supervised learning methods that use the vision Transformer [28] as the backbone network. The best results of each dataset are shown in bold font.
Test Set Method Accuracy (%) F1-score (%) AUROC
OHTS [11, 12] FreeMatch [24] 86.510 27.745 0.784
SoftMatch [25] 85.679 25.718 0.768
FixMatch [26] 84.695 27.825 0.807
FlexMatch [27] 92.715 26.331 0.696
OVV Self-Training (Ours) 90.244 38.454 0.899
ACRIMA [22] FreeMatch [24] 77.872 77.778 0.791
SoftMatch [25] 78.582 79.115 0.795
FixMatch [26] 78.582 79.622 0.792
FlexMatch [27] 60.426 47.850 0.644
OVV Self-Training (Ours) 64.539 57.627 0.801
LAG [9] FreeMatch [24] 78.471 67.087 0.793
SoftMatch [25] 79.110 67.972 0.798
FixMatch [26] 81.314 73.394 0.827
FlexMatch [27] 71.302 35.360 0.654
OVV Self-Training (Ours) 81.644 69.929 0.879
DIGS/ADAGES [23] FreeMatch [24] 65.429 56.716 0.685
SoftMatch [25] 65.395 56.986 0.679
FixMatch [26] 68.835 65.090 0.731
FlexMatch [27] 57.050 26.558 0.594
OVV Self-Training (Ours) 66.281 55.486 0.747

Table 6 provides a comprehensive comparison of our proposed OVV self-training approach with four SoTA general-purpose semi-supervised learning methods: FreeMatch [24], SoftMatch [25], FixMatch [26], and FlexMatch [27], which all employ vision Transformer [28] as their backbone network. Our results demonstrate that the proposed OVV self-training approach outperforms these methods in terms of F1-score and AUROC on the OHTS dataset. Specifically, we observe improvements in F1-score ranging from approximately 11% to 13%, and improvements in AUROC ranging from around 9% to 20%. Furthermore, our method demonstrates better generalizability in terms of AUROC across three additional fundus image test sets. Although their results are inferior to ours, particularly in terms of AUROC, it may be unjust to compare them as they were not specifically designed for the diagnosis of glaucoma or other diseases.

4.6 MTSN and CWCE Loss for Few-Shot Multi-Class Biomedical Image Classification

Refer to caption
Figure 7: Examples of images used in two few-shot multi-class lung disease diagnosis tasks.
Refer to caption
Figure 8: Experimental results of two few-shot lung disease diagnosis tasks. REC: recall; PRE: precision; ACC: accuracy.

We conduct two additional few-shot multi-class lung disease diagnosis experiments: (a) chest X-ray image classification for COVID-19 and viral pneumonia detection [29, 30], and (b) lung histopathological image classification for lung cancer diagnosis [31], to validate the effectiveness of our proposed MTSN and CWCE loss. The first experiment has three classes of images: (1) healthy, (2) viral pneumonia, and (3) COVID-19 (an example of each class is shown in Fig. 7), while the second experiment also has three classes of images: (1) benign tissue, (2) adenocarcinoma, and (3) squamous cell carcinoma (an example of each class is shown in Fig. 7). In these two experiments, we only select a few images from each class for MTSN training. The numbers of images used for training, validation, and testing are given in Table 7, where it can be observed that the training set is much smaller than the validation and test sets. Since (1) can only be used for binary image classification problems, we extend it here to tackle multi-class image classification problems.

Table 7: Training, validation, and test sample sizes in the two few-shot biomedical image classification experiments.
Class Training Validation Test
Healthy 27 657 657
Viral pneumonia 27 659 659
COVID-19 24 588 588
(a) Chest X-ray image classification.
Class Training Validation Test
Benign tissue 10 2,495 2,495
Adenocarcinoma 10 2,495 2,495
Squamous cell carcinoma 10 2,495 2,495
(b) Lung histopathological image classification.

Let us consider the chest X-ray image classification task as an example. Each image 𝐱\mathbf{x} is assigned a pair of two labels (r,s)(r,s) with the following values:

  • r=1r=1 and s=0s=0 when 𝐱\mathbf{x} is a healthy image;

  • r=2r=2 and s=1s=1 when 𝐱\mathbf{x} is a viral pneumonia image;

  • r=3r=3 and s=3s=3 when 𝐱\mathbf{x} is a COVID-19 image.

The numbers of healthy images (class 1), viral pneumonia images (class 2), and COVID-19 images (class 3) are denoted as n1n_{1}, n2n_{2}, and n3n_{3}, respectively. The total number of images is N=n1+n2+n3N=n_{1}+n_{2}+n_{3}. The weight ωcla,e\omega{{}_{\text{cla},e}} used in the image classification loss cla\mathcal{L}{{}_{\text{cla}}} w.r.t. class ee is given by:

ω=cla,eNne2N.\omega{{}_{\text{cla},e}}=\frac{N-n_{e}}{2N}. (6)

e=13ω=cla,e1\sum_{e=1}^{3}\omega{{}_{\text{cla},e}}=1. Therefore, cla\mathcal{L}{{}_{\text{cla}}} can be written as follows:

=clae=13ω(ke,ilog(pe(𝐱i))+ke,jlog(pe(𝐱j)))cla,e,\displaystyle\begin{split}\mathcal{L}{{}_{\text{cla}}}=-\sum_{e=1}^{3}\omega{{}_{\text{cla},e}}\bigg{(}k_{e,i}\log(p_{e}(\mathbf{x}_{i}))+k_{e,j}\log(p_{e}(\mathbf{x}_{j}))\bigg{)},\end{split} (7)

where pe(𝐱)[0,1]p_{e}(\mathbf{x})\in[0,1] indicates the probability that 𝐱\mathbf{x} belongs to class ee. e=13pe(𝐱)=1\sum_{e=1}^{3}p_{e}(\mathbf{x})=1. ke,i=1k_{e,i}=1 when e=rie=r_{i}, and ke,i=0k_{e,i}=0 otherwise. Given a pair of images 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} with ground-truth labels (ri,si)(r_{i},s_{i}) and (rj,sj)(r_{j},s_{j}), respectively, there are four cases:

  • case 1: |sisj|=1|s_{i}-s_{j}|=1 (healthy v.s. viral pneumonia);

  • case 2: |sisj|=2|s_{i}-s_{j}|=2 (viral pneumonia v.s. COVID-19);

  • case 3: |sisj|=3|s_{i}-s_{j}|=3 (COVID-19 v.s. healthy);

  • case 4: |sisj|=0|s_{i}-s_{j}|=0 (two images are of the same class).

The weight ωsim,c\omega{{}_{\text{sim},c}} used in the image similarity comparison loss sim\mathcal{L}{{}_{\text{sim}}} with respect to case cc is given by:

ω=sim,c{13(12n1n2N(N1))if c=113(12n2n3N(N1))if c=213(12n1n3N(N1))if c=313(1w=13nw(nw1)N(N1))if c=4,\omega{{}_{\text{sim},c}}=\begin{cases}\frac{1}{3}(1-\frac{2n_{1}n_{2}}{N(N-1)})&\text{if\ \ }c=1\\ \frac{1}{3}(1-\frac{2n_{2}n_{3}}{N(N-1)})&\text{if\ \ }c=2\\ \frac{1}{3}(1-\frac{2n_{1}n_{3}}{N(N-1)})&\text{if\ \ }c=3\\ \frac{1}{3}(1-\frac{\sum_{w=1}^{3}n_{w}(n_{w}-1)}{N(N-1)})&\text{if\ \ }c=4\end{cases}, (8)

c=14ω=sim,c1\sum_{c=1}^{4}\omega{{}_{\text{sim},c}}=1. Therefore, sim\mathcal{L}{{}_{\text{sim}}} can be written as follows:

sim=c=14ωhc,(i,j)sim,clog(qc(𝐱i,𝐱j)),\displaystyle\mathcal{L}_{\text{sim}}=-\sum_{c=1}^{4}\omega{{}_{\text{sim},c}}h_{c,(i,j)}\log(q_{c}(\mathbf{x}_{i},\mathbf{x}_{j})), (9)

where qc(𝐱i,𝐱j)[0,1]q_{c}(\mathbf{x}_{i},\mathbf{x}_{j})\in[0,1] indicates the similarity between 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} under case cc. c=14qc(𝐱i,𝐱j)=1\sum_{c=1}^{4}q_{c}(\mathbf{x}_{i},\mathbf{x}_{j})=1. hc,(i,j)=1h_{c,(i,j)}=1 when 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} belong to case cc, and hc,(i,j)=0h_{c,(i,j)}=0 otherwise. The hyper-parameter λ\lambda is empirically set to 0.30.3.

The experimental results of these two multi-task biomedical image classification tasks are presented in Fig. 8 with two confusion matrices. These results demonstrate that our proposed MTSN can be effectively trained with very few images to solve multi-class biomedical image classification problems. Specifically, the achieved accuracy values for chest X-ray image classification (\sim25-shot learning) and lung histopathological image classification (10-shot learning) are 93% and 90%, respectively. The chest X-ray image classification result compares favorably with the accuracy range of 82%-93% achieved by supervised methods (2,520 images for training, 840 images for validation, and 840 images for testing) using all available training data [68]. Although the accuracy achieved by MTSN for lung histopathological image classification is lower than the accuracy of over 97% reported in [69] by supervised approaches using the full training set (8,250 images for training, \sim3,000 images for validation, and 3,744 images for testing), we believe that our proposed low-shot learning method can achieve comparable results when a small amount of additional images are incorporated for MTSN training.

5 Discussion

Extensive experiments demonstrate the effectiveness and efficiency of training an MTSN by minimizing our proposed CWCE loss. Such a low-shot learning approach significantly reduces over-fitting and achieves an accuracy on a small training set (1,147 fundus images) comparable to a large training set (approximately 53K fundus images). We also demonstrate its effectiveness on two additional multi-class few-shot biomedical image classification tasks. Additionally, the MTSNs fine-tuned with OVV self-training outperform the SoTA semi-supervised glaucoma diagnosis algorithms [20, 21] as well as general-purpose semi-supervised learning algorithms [24, 25, 26, 27] trained for glaucoma diagnosis. They perform similarly, and in some cases, better than SoTA supervised algorithms. However, our proposed method has two limitations:

  • In the OVV self-training, each target fundus image must be compared with all the reference fundus images in the same mini-batch, resulting in a computational complexity of 𝒪(n2)\mathscr{O}(n^{2}). As the mini-batch size increases, OVV self-training becomes relatively memory-consuming. The high computational complexity of OVV self-training may reduce the feasibility of this method in clinical practice for now. Therefore, we plan to improve the OVV self-training strategy in the future by adaptively selecting only a limited number of reference fundus photographs for semi-supervised glaucoma diagnosis, which can reduce the computational complexity and make the method more practical in clinical settings.

  • Our proposed OVV self-training strategy is developed for binary image classification and may not be directly applicable to multi-class image classification problems. Therefore, we plan to extend the contrastive prediction procedure to handle multi-class image classification problems in future work. More hyper-parameter tuning can always be done, but it is so easy to over-fit with limited data.

6 Conclusion

The main contributions of this paper include: (1) a multi-task Siamese network that can learn glaucoma diagnosis from very limited labeled training data; (2) an effective semi-supervised learning strategy, referred to as One-Vote Veto self-training, which can produce pseudo labels for the unlabeled data to fine-tune a pre-trained multi-task Siamese network. Extensive experiments conducted on four fundus image datasets demonstrated the effectiveness of these proposed techniques. The low-shot learning reduces over-fitting and achieves an accuracy on a small training set comparable to that of a large training set. Furthermore, with One-Vote Veto self-training, the multi-task Siamese networks perform similarly to their backbone CNNs (trained via supervised learning on the full training set) on the OHTS test set and show better generalizability on three additional test sets. The methods introduced in this paper can also be applied to other few-shot multi-class biomedical image classification problems, e.g., COVID-19 and lung cancer diagnosis, and other diseases in which only a small quantity of ground-truth labels are available for network training.

References

  • [1] R. N. Weinreb and P. T. Khaw, “Primary open-angle glaucoma,” The Lancet, vol. 363, no. 9422, pp. 1711–1720, 2004.
  • [2] Y.-C. Tham et al., “Global prevalence of glaucoma and projections of glaucoma burden through 2040: A systematic review and meta-analysis,” Ophthalmology, vol. 121, no. 11, pp. 2081–2090, 2014.
  • [3] R. Fan et al., “Detecting glaucoma in the ocular hypertension study using deep learning,” JAMA Ophthalmology, vol. 140, no. 4, pp. 383–391, 2022.
  • [4] C. Traverso et al., “Direct costs of glaucoma and severity of the disease: a multinational long term study of resource utilisation in Europe,” British Journal of Ophthalmology, vol. 89, no. 10, pp. 1245–1249, 2005.
  • [5] W. Huang et al., “The adverse impact of glaucoma on psychological function and daily physical activity,” Journal of Ophthalmology, vol. 2020, 2020.
  • [6] R. K. Parrish et al., “Visual function and quality of life among patients with glaucoma,” Archives of Ophthalmology, vol. 115, no. 11, pp. 1447–1455, 1997.
  • [7] M. Kwon et al., “Association between glaucoma and at–fault motor vehicle collision involvement among older drivers: A population-based study,” Ophthalmology, vol. 123, no. 1, pp. 109–116, 2016.
  • [8] G. McGwin Jr et al., “Binocular visual field impairment in glaucoma and at-fault motor vehicle collisions,” Journal of Glaucoma, vol. 24, no. 2, pp. 138–143, 2015.
  • [9] L. Li et al., “Attention based glaucoma detection: A large-scale database and cnn model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 10 571–10 580.
  • [10] R. Fan et al., “Detecting glaucoma from fundus photographs using deep learning without convolutions: Transformer for improved generalization,” Ophthalmology Science, vol. 3, no. 1, p. 100233, 2023.
  • [11] M. O. Gordon and M. A. Kass, “The ocular hypertension treatment study: Design and baseline description of the participants,” Archives of Ophthalmology, vol. 117, no. 5, pp. 573–583, 1999.
  • [12] M. A. Kass et al., “The ocular hypertension treatment study: A randomized trial determines that topical ocular hypotensive medication delays or prevents the onset of primary open-angle glaucoma,” Archives of Ophthalmology, vol. 120, no. 6, pp. 701–713, 2002.
  • [13] M. O. Gordon et al., “Assessment of the impact of an endpoint committee in the ocular hypertension treatment study,” American Journal of Ophthalmology, vol. 199, pp. 193–199, 2019.
  • [14] Z. Li et al., “Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs,” Ophthalmology, vol. 125, no. 8, pp. 1199–1206, 2018.
  • [15] J. J. Gómez-Valverde et al., “Automatic glaucoma classification using color fundus images based on convolutional neural networks and transfer learning,” Biomedical Optics Express, vol. 10, no. 2, pp. 892–913, 2019.
  • [16] D. Judy et al., “Automated identification of glaucoma from fundus images using deep learning techniques,” European Journal of Molecular & Clinical Medicine, vol. 7, no. 2, pp. 5449–5458, 2020.
  • [17] A. Serener and S. Serte, “Transfer learning for early and advanced glaucoma detection with convolutional neural networks,” in 2019 Medical Technologies Congress (TIPTEKNO).   IEEE, 2019, pp. 1–4.
  • [18] A. Thakur et al., “Predicting glaucoma before onset using deep learning,” Ophthalmology Glaucoma, vol. 3, no. 4, pp. 262–268, 2020.
  • [19] M. Kim et al., “Few-shot learning using a small-sized dataset of high-resolution fundus images for glaucoma diagnosis,” in Proceedings of the 2nd International Workshop on Multimedia for Personal Health and Health Care, 2017, pp. 89–92.
  • [20] M. Al Ghamdi et al., “Semi-supervised transfer learning for convolutional neural networks for glaucoma detection,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2019, pp. 3812–3816.
  • [21] A. Díaz Pinto et al., “Retinal image synthesis and semi-supervised learning for glaucoma assessment,” IEEE Transactions on Medical Imaging, vol. 38, no. 9, pp. 2211–2218, 2019.
  • [22] A. Díaz Pinto et al., “CNNs for automatic glaucoma assessment using fundus images: An extensive validation,” Biomedical Engineering Online, vol. 18, pp. 1–19, 2019.
  • [23] P. Sample et al., “The African descent and glaucoma evaluation study (ADAGES): Design and baseline data,” Archives of Ophthalmology, vol. 127, no. 9, pp. 1136–1145, 2009.
  • [24] Y. Wang et al., “Freematch: Self-adaptive thresholding for semi-supervised learning,” in the International Conference on Learning Representations (ICLR), 2023, in press.
  • [25] H. Chen et al., “Softmatch: Addressing the quantity-quality trade-off in semi-supervised learning,” in the International Conference on Learning Representations (ICLR), 2023, in press.
  • [26] K. Sohn et al., “Fixmatch: Simplifying semi-supervised learning with consistency and confidence,” Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 596–608, 2020.
  • [27] B. Zhang et al., “Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling,” Advances in Neural Information Processing Systems (NeurIPS), vol. 34, pp. 18 408–18 419, 2021.
  • [28] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in the International Conference on Learning Representations (ICLR).
  • [29] M. E. Chowdhury et al., “Can AI help in screening viral and COVID-19 pneumonia?” IEEE Access, vol. 8, pp. 132 665–132 676, 2020.
  • [30] T. Rahman et al., “Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images,” Computers in Biology and Medicine, vol. 132, p. 104319, 2021.
  • [31] A. A. Borkowski et al., “Lung and colon cancer histopathological image dataset (LC25000),” CoRR, 2019.
  • [32] A. Krizhevsky, “One weird trick for parallelizing convolutional neural networks,” CoRR, 2014.
  • [33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in the International Conference on Learning Representations (ICLR), 2015, pp. 1–14.
  • [34] C. Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
  • [35] C. Szegedy et al., “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
  • [36] J. M. Ahn et al., “A deep learning model for the detection of both advanced and early glaucoma using fundus photography,” PloS one, vol. 13, no. 11, p. e0207982, 2018.
  • [37] J. Deng et al., “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
  • [38] K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [39] S. Liu et al., “A deep learning-based algorithm identifies glaucomatous discs using monoscopic fundus photographs,” Ophthalmology Glaucoma, vol. 1, no. 1, pp. 15–22, 2018.
  • [40] A. R. Ran et al., “Detection of glaucomatous optic neuropathy with spectral-domain optical coherence tomography: A retrospective training and validation deep-learning analysis,” The Lancet Digital Health, vol. 1, no. 4, pp. e172–e182, 2019.
  • [41] F. A. Medeiros et al., “Detection of progressive glaucomatous optic nerve damage on fundus photographs with deep learning,” Ophthalmology, vol. 128, no. 3, pp. 383–392, 2021.
  • [42] M. Christopher et al., “Performance of deep learning architectures and transfer learning for detecting glaucomatous optic neuropathy in fundus photographs,” Scientific Reports, vol. 8, no. 1, pp. 1–13, 2018.
  • [43] M. Christopher et al., “Effects of study population, labeling and training on glaucoma detection using deep learning algorithms,” Translational Vision Science & Technology, vol. 9, no. 2, pp. 27–27, 2020.
  • [44] D. Jain et al., “Open-source, ultra-low-cost smartphone attachment for non-mydriatic fundus photography-open indirect ophthalmoscope,” Investigative Ophthalmology & Visual Science, vol. 57, no. 12, pp. 1685–1685, 2016.
  • [45] E. Matthew Lawson and R. Raskar, “Smart phone administered fundus imaging without additional imaging optics,” Investigative Ophthalmology & Visual Science, vol. 55, no. 13, pp. 1609–1609, 2014.
  • [46] M. Sandler et al., “MobileNetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
  • [47] Y. Wang et al., “Generalizing from a few examples: A survey on few-shot learning,” ACM Computing Surveys (CSUR), vol. 53, no. 3, pp. 1–34, 2020.
  • [48] P. Zhou et al., “Attention-based bidirectional long short-term memory networks for relation classification,” in Proceedings of the 54th annual meeting of the Association for Computational Linguistics (volume 2: Short Papers), 2016, pp. 207–212.
  • [49] A. Radford et al., “Unsupervised representation learning with deep convolutional generative adversarial networks,” in the International Conference on Learning Representations (ICLR), 2015.
  • [50] I. Triguero et al., “Self-labeled techniques for semi-supervised learning: Taxonomy, software and empirical study,” Knowledge and Information Systems, vol. 42, no. 2, pp. 245–284, 2015.
  • [51] J. E. Van Engelen and H. H. Hoos, “A survey on semi-supervised learning,” Machine Learning, vol. 109, no. 2, pp. 373–440, 2020.
  • [52] A. Miller et al., “Key-value memory networks for directly reading documents,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016, pp. 1400–1409.
  • [53] G. Koch et al., “Siamese neural networks for one-shot image recognition,” in ICML Deep Learning Workshop, vol. 2.   Lille, 2015.
  • [54] J. Jang and C. O. Kim, “Siamese network-based health representation learning and robust reference-based remaining useful life prediction,” IEEE Transactions on Industrial Informatics, vol. 18, no. 8, pp. 5264–5274, 2021.
  • [55] Q. Wang et al., “Learning attentions: Residual attentional Siamese network for high performance online visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4854–4863.
  • [56] W. Wang et al., “Face recognition based on deep learning,” in 1st International Conference on Human Centered Computing (HCC).   Springer, 2015, pp. 812–820.
  • [57] E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in International Workshop on Similarity-Based Pattern Recognition.   Springer, 2015, pp. 84–92.
  • [58] P. Khosla et al., “Supervised contrastive learning,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 18 661–18 673.
  • [59] A. Kendall et al., “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7482–7491.
  • [60] D. McClosky, E. Charniak, and M. Johnson, “Effective self-training for parsing,” in Proceedings of the Human Language Technology Conference of the NAACL, Main Conference, 2006, pp. 152–159.
  • [61] S. Sukhbaatar, J. Weston, R. Fergus et al., “End-to-end memory networks,” in Advances in Neural Information Processing Systems (NeurIPS), 2015, pp. 2440–2448.
  • [62] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 11, pp. 2579–2605, 2008.
  • [63] L.-C. Chen et al., “Rethinking atrous convolution for semantic image segmentation,” CoRR, 2017.
  • [64] C. Tan et al., “A survey on deep transfer learning,” in International Conference on Artificial Neural Networks (ICANN).   Springer, 2018, pp. 270–279.
  • [65] G. Huang et al., “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.
  • [66] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning (ICML).   PMLR, 2019, pp. 6105–6114.
  • [67] A. Chattopadhay et al., “Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).   IEEE, 2018, pp. 839–847.
  • [68] S. Jadon, “COVID-19 detection from scarce chest x-ray image data using few-shot deep learning approach,” in Medical Imaging 2021: Imaging Informatics for Healthcare, Research, and Applications, vol. 11601.   International Society for Optics and Photonics, 2021, p. 116010X.
  • [69] M. A. Abbas et al., “The histopathological diagnosis of adenocarcinoma & squamous cells carcinoma of lungs by artificial intelligence: A comparative study of convolutional neural networks,” MedRxiv, pp. 2020–05, 2020.