Adaptive Variance Thresholding: A Novel Approach to Improve Existing Deep Transfer Vision Models and Advance Automatic Knee-Joint Osteoarthritis Classification

Fabi Prezja Leevi Annala Sampsa Kiiskinen Suvi Lahtinen Timo Ojala

Abstract

Knee-Joint Osteoarthritis (KOA) is a prevalent cause of global disability and is inherently complex to diagnose due to its subtle radiographic markers and individualized progression. One promising classification avenue involves applying deep learning methods; however, these techniques demand extensive, diversified datasets, which pose substantial challenges due to medical data collection restrictions. Existing practices typically resort to smaller datasets and transfer learning. However, this approach often inherits unnecessary pre-learned features that can clutter the classifier’s vector space, potentially hampering performance. This study proposes a novel paradigm for improving post-training specialized classifiers by introducing adaptive variance thresholding (AVT) followed by Neural Architecture Search (NAS). This approach led to two key outcomes: an increase in the initial accuracy of the pre-trained KOA models and a 60-fold reduction in the NAS input vector space, thus facilitating faster inference speed and a more efficient hyperparameter search. We also applied this approach to an external model trained for KOA classification. Despite its initial performance, the application of our methodology improved its average accuracy, making it one of the top three KOA classification models.

keywords:

Knee Osteoarthritis , Adaptive Variance Threshold , Neural Architecture Search , Convolutional Neural Network , Disease stages

\affiliation

[1]organization=University of Jyväskylä, Faculty of Information Technology, city=Jyväskylä, postcode=40014, country=Finland

\affiliation

[3]organization=University of Helsinki, Faculty of Science, Department of Computer Science, city=Helsinki, country=Finland

\affiliation

[4]organization=University of Helsinki, Faculty of Agriculture and Forestry, Department of Food and Nutrition, city=Helsinki, country=Finland

\affiliation

[5]organization=University of Jyväskylä, Faculty of Mathematics and Science, Department of Biological and Environmental Science, city=Jyväskylä, postcode=40014, country=Finland

1 Introduction

Over the last ten years, a significant increase has been observed in the incorporation of artificial intelligence into the medical field[1, 2], spurred by the robust expansion in deep machine learning methodologies[3]. Medicine has been recognized as a key domain for deploying these sophisticated technologies, with deep learning primarily focused on data analysis and clinical decision support. These proficient systems, capable in exploring medical data to identify patterns and relationships, cover many applications. They’ve displayed considerable advancement in predicting patient outcomes [4, 5, 6], along with enriching diagnostics and disease classification [7, 8, 9, 10, 11]. Apart from data analysis and classification, deep learning has been successful in data segmentation [12, 13], and has also shown progress in the generation [14, 15, 16, 17, 18] and anonymization of medical data[19, 20, 21, 22, 23, 24]. Nonetheless, applying these technological developments to Osteoarthritis (OA) brings its unique set of challenges.

OA, predominantly knee joint osteoarthritis[25, 26, 27] (KOA), is a leading cause of disability worldwide[28], with estimated expenses amounting to as much as 2.5% of the gross national product in western countries.[28]. Early detection is hampered by subtle radiographic indicators and variability in disease progression[29, 25]. Employing deep learning in KOA classification [30, 31, 32] heavily relies on diverse and extensive data sets. Yet, obtaining such data sets is challenging, hindered by patient privacy concerns[33, 34], limitations in data collection, and the intrinsic progression of OA. Traditionally, researchers have circumvented these limitations through deep transfer learning, where pre-trained neural networks are repurposed for a specific task using smaller data sets. The peak accuracy achieved for multi-KL classification using radiographic Kellgren and Lawrence grading of OA stands at 74.81% [35, 25]. This process typically involves end-to-end training of transfer architectures, which include a feature learning section and a classification section. In this setup, learned features are usually fed into a classifier for predictions. However, not all repurposed features may be highly relevant to the task, potentially cluttering the classifier with unnecessary or barely varying (near-static) features, while manual classifier design could increase complexity and raise the risk of under-fitting. However, feature selection[36] and Neural Architecture Search (NAS)[37, 38] can partly mitigate these issues. Feature selection refining classifier input and NAS automating the architecture design process, identifying optimal structures often missed in manual design, thus increasing potential model efficiency and accuracy. However, these techniques have yet to be utilized with transfer learning-based classification for KOA classification.

In this study, we propose a method to amplify the performance of existing KOA classification models using an adaptive variance threshold feature filtering technique and neural architecture search. This approach refined the classifier’s feature space, boosted computational efficiency, and advanced the initial accuracy of the evaluated models. Significantly, when we applied our technique to an external model, specifically from Chen et al[39], its average accuracy improved to 71.14%. This boost propeled the external model to the top three KOA classification models[25, 35].

2 Methods

Our study’s methodology is structured into three clear phases. First, we focus on data collection and essential pre-processing. In the subsequent phase, we explore the intricacies of the deep transfer learning approach. The concluding phase offers a comprehensive overview of the adaptive variance thresholding method and neural architecture search, detailing the metrics for evaluation. Figure 1 illustrates the core approach we used with the pre-trained EfficientNet model for KOA classification.

Refer to caption — Figure 1: The study’s methodological pipeline, as applied to the EfficientNetV2 Base Model, showcases the intervention in the classification head.

2.1 Data Collection

Our study utilized knee joint X-ray images derived from the 2019 Chen, et al. study[39], obtained initially from the Osteoarthritis Initiative (OAI). The OAI is a multi-center study focused on biomarkers for knee osteoarthritis that involved 4796 participants aged between 45 and 79. We used the pre-processed primary cohort data[40] from the Chen 2019 study, which had undergone automatic knee joint detection, bounding, and standardization of zoom to 0.14mm/pixel. This process yielded 8260 images (224 x 224 pixels) extracted from 4130 X-rays, each encompassing both knee joints. The images were classified using the Kellgren and Lawrence (KL) system[41], as illustrated in Figure 2. The distribution of KL grades comprised 3253 images for Grade 0, 1495 for Grade 1, 2175 for Grade 2, 1086 for Grade 3, and 251 for Grade 4.

2.2 Image Pre-Processing

We adjusted each right knee joint image to replicate the orientation of a left knee. We subsequently located and inverted any negative channel images, leading to 189 instances. The contrast of the image histograms was then equalized using equation 1. In this equation, we took a given grayscale image I of $m\times n$ dimensions and used the cumulative distribution function $cdf$ and pixel value $v$ to derive an equalized value $h(v)$ within the range $[0,255]$ . Here, $cdf_{min}$ represents a non-zero minimum value of the image’s cumulative distribution, while $m\times n$ refers to the total number of pixels.

h(v)=255\frac{cdf(v)-cdf_{min}}{(mn)-cdf_{min}}

(1)

2.3 Chen 2019 External Validation

In order to externally validate our approach, we mirrored the exact models and data utilized in the Chen et al. 2019 study[39]. We chose the study’s best-performing VGG19 - Ordinal model (publicly available[40]). We extracted the feature vectors from this model from the pre-output dense layer, effectively replicating the conditions under which the original model achieved its top performance. This step was crucial to ensure a fair comparison and assessment of the effectiveness of our approach when applied to an external model.

2.4 Convolutional Neural Networks

Convolutional Neural Networks (CNNs) [42], a cornerstone in the renaissance of deep learning [3], have often found applications in the realm of computer vision. The strength of CNNs lies in their ability to execute convolution operations between an input and a filter-kernel, thereby highlighting distinctive features captured in a response known as a feature map. As these filters slide across inputs, they create layers of complex feature maps, each representing more abstract concepts from the preceding maps. Mathematically, given an image I with dimensions $u\times v$ and a filter-kernel H of $s\times t$ dimensions, the generation of a feature map G by convolving I and H across axes $u,v$ is expressed as:

\textbf{{G}}(u,v)=\sum_{s}\sum_{t}\textbf{{I}}(u,v)\textbf{{H}}(u-s,v-t)

(2)

Usually, the resulting values of the feature map are filtered through an activation function, which acts to re-map these values using a pre-defined function. One common example is the Rectified Linear Unit activation function[43] (ReLu), which sets negative values to zero, thereby offering computational efficiency by eliminating less essential information. For any feature map value $z$ , the ReLu activation function is defined as:

f(z)=\max(0,z)

(3)

Further, a max pooling operation is often employed to down-sample the convolution result. Consequently, cascades of max pooling and convolution yield a gradually decreasing length of features. For an image I with dimensions $u\times v$ , the max pooled value ${g(u_{I})}$ for dimension $u$ can be defined as:

g(u_{I})=\lfloor\frac{{u_{I}}-r}{h}\rfloor+1

(4)

Here, $u_{I}$ is dimension $u$ from image I, $r$ denotes the pooling window size, and $h$ signifies the stride value. This streamlining process contributes to an efficient and effective model for identifying the necessary classification task features.

2.5 Convolutional Neural Network Architecture

A notable deep learning model, EfficientNet[44] , adopts the strategy of compound scaling, which systematically adjusts depth (number of layers), width (size of the layers), and resolution (size of the input image). This scaling process is mathematically encapsulated as follows:

d=\alpha^{\phi}d_{0},\quad w=\beta^{\phi}w_{0},\quad r=\gamma^{\phi}r_{0}

(5)

Here, $\alpha,\beta,\gamma$ denote constants, $\phi$ stands for a user-selected coefficient, and $d_{0},w_{0},r_{0}$ represent the depth, width, and resolution of the original model, respectively.

A distinctive feature of EfficientNet is the MBConv block, which chains together transformations in the order of a $1\times 1$ convolution, a depth-wise convolution, a Squeeze-and-Excitation (SE) operation [45], and a subsequent $1\times 1$ convolution:

T_{MB}(\textbf{{I}})=\textbf{{K}}_{2}\ast SE(\textbf{{D}}\ast(\textbf{{K}}_{1}\ast\textbf{{I}}))

(6)

In this equation, $\textbf{{K}}_{1}$ and $\textbf{{K}}_{2}$ are $1\times 1$ convolutional filters, D stands for the depth-wise convolutional filter, and $SE(\cdot)$ signifies the Squeeze-and-Excitation operation. The EfficientNetV2[46] model extends its predecessor by incorporating a Fused-MBConv block that fuses the initial $1\times 1$ and depth-wise convolutions into a single $3\times 3$ convolution, followed by an SE operation and a concluding $1\times 1$ convolution:

T_{FMB}(\textbf{{I}})=\textbf{{K}}{2}\ast(SE(\textbf{{K}}{f}\ast\textbf{{I}}))

(7)

Here, $\textbf{{K}}_{f}$ is the $3\times 3$ convolutional filter that merges the initial $1\times 1$ and depth-wise convolutions, and $\textbf{{K}}_{2}$ is the final $1\times 1$ convolutional filter. An activation function follows each convolution and may include a skip connection. The EfficientNetV2 base model (B0) and the modifications implemented in our study are illustrated in Figure 3.

We utilized the EfficientNetV2-M architecture for our model and further updated it by introducing flattening and a dense layer consisting of 240 neurons. The model was trained over 25 epochs using the Adam optimizer, with the dataset being partitioned into Training (75%), Validation (15%), and Testing (15%) sets (patient-wise splits). The minimum validation loss governed the early stopping criterion. The CNN training and online affine augmentations (’advanced augmentation preset’) were executed using the open-source Deep Fast Vision library[47].

2.6 Adaptive Variance Thresholding

Variance Thresholding is a popular feature selection method[48, 49] used to enhance machine learning model performance by removing excess features. It calculates the variance of each feature and drops those whose variance is below a specific user-defined threshold. However, the conventional method requires manually setting the threshold, which is not adaptable to different datasets and does not generalize across datasets or feature vectors of different architectures trained even on the same dataset. To overcome this, we propose Adaptive Variance Thresholding (AVT), which automatically sets the threshold from a user-defined percentile of the calculated feature variances. This approach makes the method flexible, as it adapts to the unique characteristics of each dataset without requiring manual changes. The following mathematical steps describe the process of Adaptive Variance Thresholding (AVT):

1.

Compute the variance of each feature in the feature set, F. We denote this variance as v, where each element $v_{i}$ represents the variance of the $i$ -th feature:

$\mathbf{v}=\text{Var}(\textbf{{F}}_{i}),\quad\forall i\in{1,2,...,w}$ (8)

Here, $\textbf{{F}}_{i}$ denotes the $i$ -th feature in the feature matrix F with dimensions $w\times z$ (where $w$ is the number of features and $z$ is the number of samples). $\text{Var}(\textbf{{F}}_{i})$ calculates the variance of the $i$ -th feature, with $\mathbf{v}$ storing these variances.
2.

Set the threshold $j$ to be the $p$ -th percentile of these computed variances:

$j=\text{percentile}(\mathbf{v},p)$ (9)

The function $\text{percentile}(\mathbf{v},p)$ computes the $p$ -th percentile of the variances in $\mathbf{v}$ , where $p$ is a user-defined parameter.
3.

Form a new feature matrix F’, consisting of only those features $i$ whose variance exceeds the threshold $j$ :

$\textbf{{F}}_{i}\in\textbf{{F'}}\text{ if }v_{i}\geq j$ (10)

In the final step, the feature $\textbf{{F}}_{i}$ is included in the new feature matrix F’ if its corresponding variance $v_{i}$ is greater than or equal to the threshold $j$ . Otherwise, the feature is removed. The power of the adaptive approach is in its capacity to derive the threshold directly from the data, allowing it to adjust to the distinct characteristics of each dataset. In our study, we examined three threshold levels, referred to as Low, Mid, and High, each representing different percentiles - 1.5%, 50%, and 98.5%, respectively, given the computational demands associated with this approach; conducting an exhaustive search was prohibitively computationally demanding. In essence, we preserved the top 98.5% varying features in the Low setting, the top 50% in the Mid setting, and only the top 1.5% in the High setting. These settings explored various scenarios from low to significant dimensionality reduction. The Python class implementing this method can be found in the provided materials. It constitutes a modification of the Scikit-Learn class[50], using NumPy[51] for computations.

2.7 Neural Architecture Search

Neural Architecture Search (NAS) [37], a method in machine learning, offers an automated way to design neural network architectures, diminishing the need for exhaustive manual tuning. By systematically exploring a range of potential architectures, NAS pinpoints the ones that yield the best performance for a specific task. The process involves defining a search space of potential architectures and using a controller model to generate, assess, and train these candidate architectures. In our experiment, we applied the AutoKeras[38] Structured Classifier search approach to identify suitable architectures for the Adaptive Variance Thresholding (AVT) sets. Notably, the evaluation of NAS was performed exclusively using validation data. We permitted a maximum of 55 consecutive trials with validation accuracy guiding the early stopping criterion. Each trial was given a training space of up to 25 epochs. All identified architectures are detailed in the appendix, and trained versions of these architectures are available under data availability.

2.8 Classification Metrics

Accuracy, a common measure of model performance, is the proportion of correct predictions (true positives and true negatives) to the total number of observations. It’s mathematically defined as:

\text{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN}

(11)

Precision, also referred to as the positive predictive value, denotes the fraction of correct positive predictions in relation to all positive predictions made. The calculation for precision is as follows:

\text{Precision}=\frac{TP}{TP+FP}

(12)

Recall, alternatively known as sensitivity or true positive rate, represents the fraction of actual positive instances that the model correctly identified. The formula for recall is:

\text{Recall}=\frac{TP}{TP+FN}

(13)

The F1 score aims to balance precision and recall by taking their harmonic mean. It’s calculated using the following formula:

\text{F1 Score}=2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}

(14)

In these expressions, TP stands for True Positives, TN represents True Negatives, FP is for False Positives, and FN refers to False Negatives.

3 Results

3.1 External Chen 2019 Model

This study utilized adaptive variance thresholding (AVT) and neural architecture search on pre-trained vision neural networks. Table 1 compares the performance across AVT thresholds on the best pre-trained Chen 2019 model (baseline), highlighting the impact of these techniques. We observed that the model’s NAS input dimensionality significantly decreased from 4096 in the ’Baseline’ condition to 62 in the ’High AVT’ condition. Accuracy generally improved, peaking at 71.14% in the ’High AVT’ scenario. Precision increased to 70.98% in the ’Mid AVT’ condition and slightly declined to 69.76% in the ’High AVT’ setting. Recall remained relatively steady across all conditions with a minor increase to 68.95% in the ’Mid AVT’ scenario. The F1-score, crucial in handling uneven KL distributions, increased until ’Mid AVT’, then slightly decreased in the ’High AVT’ condition. Figure 4 further details the confusion matrixes for each condition.

Table 1: Classification report for the Chen model using Chen test data, with and without AVT intervention (baseline). ’Dimensionality’ denotes the vector space size pre-NAS.

	Baseline	Low AVT	Mid AVT	High AVT
Dimensionality	4096	4034	2048	62
Accuracy	69.63%	68.30%	70.41%	71.14%
Precision	64.79%	68.71%	70.98%	69.76%
Recall	67.74%	67.32%	68.95%	67.53%
F1-score	64.96%	67.96%	69.86%	66.88%

In the provided confusion matrices and accuracy results, the impact of varying Adaptive Variance Thresholding (AVT) levels, namely Baseline, Low AVT, Mid AVT, and High AVT, on the performance of a neural network model is presented. A general improvement in accuracy was observed as the AVT level increased, rising from approximately 69.63% at Baseline to 71.14% at High AVT. A detailed look into the normalized confusion matrices revealed a consistent rise in correct predictions, notably for the first and third classes, across increasing AVT levels. An exception was observed for KL2, where correct predictions declined as AVT increased. On the other hand, the fourth and fifth classes exhibited stability in model performance with varying AVT levels after an initial improvement from Baseline to Low AVT.

Different trends surfaced when we focused on the recall values for individual classes (as presented in Table 2). The model’s proficiency in identifying instances of KL 0 consistently strengthened with increasing AVT, leading to the highest recall across all classes and conditions at High AVT (91.39%). Conversely, KL 1 experienced a peak in recall at Low AVT (36.82%), followed by a decrease at High AVT (14.86%), lower than its Baseline value. KL 2 also benefited from higher AVT levels, as indicated by a higher recall at High AVT (75.39%), despite slight decreases at Low and Mid AVT. The KL 3 class maintained a relatively stable recall across all AVT levels, peaking slightly at Low AVT (78.92%). Finally, the recall for KL 4 initially dropped from Baseline to Low AVT but then held relatively steady for Mid and High AVT. These recall trends underline the complex interplay between AVT levels and KL grade-specific performance.

Table 2: Recall report for the external Chen model across classes, derived from the confusion matrix diagonals.

KL Label	Baseline	Low AVT	Mid AVT	High AVT
0	87.01%	80.13%	85.13%	91.39%
1	17.57%	36.82%	35.14%	14.86%
2	74.94%	66.22%	68.01%	75.39%
3	74.89%	78.92%	78.03%	77.58%
4	84.31%	74.51%	78.43%	78.43%

3.2 EfficientNetV2M Internal Model

The following section will focus on a different neural network model - the EfficientNetV2M. When looking at Table 3, it was clear that the dimensionality of the NAS input reduced dramaticaly as we moved from the Baseline to the High AVT level. Regarding model performance, Accuracy, and Precision peaked at the Mid AVT level, with Accuracy reaching 65.48% and Precision achieving 67.18%. However, Precision took an upward leap at High AVT, reaching 79.72%, the highest among all levels. In contrast, Recall and F1-score exhibited a slight downward trend after the Baseline, with the lowest values observed at the High AVT level - 58.99% and 58.35%, respectively. Based on Accuracy and Precision, the model performance seemed optimal at Mid AVT rather than High AVT, suggesting that extreme dimensionality reduction might not necessarily lead to better overall performance for this set-up.

Table 3: Classification report for the EfficientNet model using the test data, with and without AVT intervention (baseline). ’Dimensionality’ denotes the vector space size pre-NAS.

	Baseline	Low AVT	Mid AVT	High AVT
Dimensionality	240	236	120	4
Accuracy	63.23%	64.27%	65.48%	65.08%
Precision	66.25%	65.15%	67.18%	79.72%
Recall	66.24%	63.57%	64.12%	58.99%
F1-score	65.36%	63.25%	64.15%	58.35%

When analyzing the confusion matrices (Figure 5), we observe an overall upward trend in the diagonal values, indicating an increase in correct predictions on a higher variance threshold (AVT). Notably, KL 0 and KL 2 experienced consistent improvement in accuracy. The performance, however, varied for other classes. While KL 1’s score declined post-Baseline and droped to zero at High AVT, KL 3 showed a slight inconsistency, and KL 4 displayed a fluctuating trend, peaking at Baseline and Mid AVT but falling at High AVT. The off-diagonal elements reveal changes in KL confusion across different AVT levels. Mainly, confusion between KL 0 and KL 1 decreased with an increasing AVT, while the confusion between KL 1 and KL 2 increased. The recall table 4 further supplements these findings.

Table 4: Recall report for the EfficientNet model across classes, derived from the confusion matrix diagonals.

KL Label	Baseline	Low AVT	Mid AVT	High AVT
0	77.66%	87.68%	89.98%	92.90%
1	39.51%	17.56%	14.15%	0.00%
2	51.73%	56.65%	61.56%	62.72%
3	68.18%	64.77%	60.80%	68.75%
4	94.12%	91.18%	94.12%	70.59%

4 Discussion

This study evaluated a novel method for enhancing the performance of pre-existing models used for diagnosing Knee-Joint Osteoarthritis (KOA) using a combination of adaptive variance thresholding (AVT) and neural architecture search (NAS). This effort led to a notable increase in diagnostic accuracy, a substantial reduction in the NAS input vector space, and an overall improvement in model efficiency. More specifically, our approach improved model accuracy on the Chen KOA model, from 69.63% at the baseline to 71.14% at High AVT. This approach rendered the new solution within the top three radiographic KOA classification solutions %[25, 35]. We observed that this enhancement was exceptionally beneficial for KL 0 and KL 2, showing a consistent increase in recall as the AVT level increased. Our internal model, EfficientNetV2M, also benefited from the new method. Although the overall model performance peaked at Mid AVT, we observed a remarkable jump in Precision at High AVT, suggesting that extreme dimensionality reduction could still lead to specific performance improvements. However, it is worth noting that while the adaptive variance thresholding approach successfully enhanced overall model accuracy, the level of improvement varied among different AVT levels and ground-truth classes.

Furthermore, as evident from our experiments, the overall performance improvement on the external model was slightly superior to the internal model. The disparity may be attributed to the higher initial dimensionality of the external model, which allowed for less extensive pruning by adaptive variance thresholding. On the other hand, the internal model with lower initial dimensionality might have had fewer redundant features to begin with, hence a lesser margin for improvement. To gain further intuitive understanding, future work could explore the inverse process of identifying convolutional filter index based on the selected features post-AVT. This process could be an intriguing method to visualize the feature maps primarily associated with the Adaptive Variance Thresholding (AVT) sets.

Neural Architecture Search (NAS) played an instrumental role in this study, helping optimize the architecture of the classifier models for our specific task. The application of AVT resulted in a 60-fold reduction in the NAS input vector space at high levels. This improved the model’s efficiency and expedited the inference speed and hyperparameter search. Despite these improvements, it is worth noting that we selected three threshold values placed at equal intervals. Although the prospect of an increased AVT sampling resolution could yield potentially superior results, it is essential to balance this with the considerably higher computational cost it would incur. In the Neural Architecture Search (NAS) process, we relied on validation accuracy as our primary evaluative metric. However, applying various evaluation metrics could potentially yield a diverse range of classifiers. Therefore, exploring potential meta-metrics which involve weighted combinations of various metrics may provide an edge in identifying architectures that are better tailored for specific tasks.

In our study, a significant constraint arose from the lack of label noise estimates for our dataset. This challenge is especially relevant for studies such as ours, focused on radiographic Knee Osteoarthritis (KOA). The early stages of KOA are usually shrouded in diagnostic uncertainty, thus making substantial label noise not just a mere possibility but a considerable expectation. Past research[23] corroborated this, revealing significant discrepancies in labeling among medical experts. Without an estimate for label noise, our ability to discern the maximum achievable performance on the task is impeded. Moreover, another level of complexity stems from the potential for extensive feature noise, which may result from poor-quality radiographs. It remains unclear how our approach would respond under such circumstances, adding an extra dimension to the limitations faced in the study.

An additional limitation in our study pertained to the resolution parameters employed. Our evaluation was centered around transfer learning architectures and a resolution size of 224 x 224. This raises a question of scale adaptability: it is uncertain how our methodology would perform when applied to features from larger receptive fields in the feature learning segment. The influence of resolution size on model performance, especially when dealing with higher-resolution inputs, remains a topic unexplored within the confines of this study. This constitutes another key limitation that future studies would seek to address.

While our study was rooted in the context of KOA classification, our methodology’s potential may extend beyond this specific domain. Given the inherent adaptability of AVT and NAS, our proposed paradigm has the potential to generalize to other areas of medicine or even entirely different fields that utilize deep transfer learning. However, such generalizations would necessitate extensive further research.

In conclusion, this study introduced a novel paradigm that enhanced the performance of existing deep-learning models for KOA classification. The proposed combination of AVT and NAS improved model accuracy and efficiency, and could potentially be applied to other models and diseases, presenting a promising avenue for future automatic diagnostic research.

Data Availability

Trained models and the AVT code are available in the Google drive repository.

References

[1] F. Wang, L. P. Casalino, D. Khullar, Deep learning in medicine—promise, progress, and challenges, JAMA internal medicine 179 (3) (2019) 293–294.
[2] A. L. Beam, I. S. Kohane, Big data and machine learning in health care, Jama 319 (13) (2018) 1317–1318.
[3] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (7553) (2015) 436–444.
[4] J. N. Kather, J. Krisam, P. Charoentong, T. Luedde, E. Herpel, C.-A. Weis, T. Gaiser, A. Marx, N. A. Valous, D. Ferber, others, Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study, PLoS medicine 16 (1) (2019) e1002730.
[5] P. Courtiol, C. Maussion, M. Moarii, E. Pronier, S. Pilcer, M. Sefta, P. Manceron, S. Toldo, M. Zaslavskiy, N. Le Stang, others, Deep learning-based classification of mesothelioma improves prediction of patient outcome, Nature medicine 25 (10) (2019) 1519–1525.
[6] A. Diamant, A. Chatterjee, M. Vallières, G. Shenouda, J. Seuntjens, Deep learning in head \& neck cancer outcome prediction, Scientific reports 9 (1) (2019) 1–10.
[7] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, S. Thrun, Dermatologist-level classification of skin cancer with deep neural networks, nature 542 (7639) (2017) 115–118.
[8] Z. Han, B. Wei, Y. Zheng, Y. Yin, K. Li, S. Li, Breast cancer multi-classification from histopathological images with structured deep learning model, Scientific reports 7 (1) (2017) 1–10.
[9] M. Bakator, D. Radosav, Deep learning and medical diagnosis: A review of literature, Multimodal Technologies and Interaction 2 (3) (2018) 47.
[10] F. Prezja, S. Äyrämö, I. Pölönen, T. Ojala, S. Lahtinen, P. Ruusuvuori, T. Kuopio, Improved accuracy in colorectal cancer tissue decomposition through refinement of established deep learning solutions, Scientific Reports 13 (1) (2023) 15879.
[11] F. Prezja, L. Annala, S. Kiiskinen, S. Lahtinen, T. Ojala, P. Ruusuvuori, T. Kuopio, Improving performance in colorectal cancer histology decomposition using deep and ensemble machine learning, arXiv preprint arXiv:2310.16954 (2023).
[12] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, K. H. Maier-Hein, nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation, Nature methods 18 (2) (2021) 203–211.
[13] X. Liu, L. Song, S. Liu, Y. Zhang, A review of deep-learning-based medical image segmentation methods, Sustainability 13 (3) (2021) 1224.
[14] M. J. M. Chuquicusma, S. Hussein, J. Burt, U. Bagci, How to fool radiologists with generative adversarial networks? A visual turing test for lung cancer diagnosis, in: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), IEEE, 2018, pp. 240–244.
[15] F. Calimeri, A. Marzullo, C. Stamile, G. Terracina, Biomedical data augmentation using generative adversarial neural networks, in: International conference on artificial neural networks, Springer, 2017, pp. 626–634.
[16] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification, Neurocomputing 321 (2018) 321–331.
[17] V. Thambawita, J. L. Isaksen, S. A. Hicks, J. Ghouse, G. Ahlberg, A. Linneberg, N. Grarup, C. Ellervik, M. S. Olesen, T. Hansen, others, DeepFake electrocardiograms using generative adversarial networks are the beginning of the end for privacy issues in medicine, Scientific reports 11 (1) (2021) 1–8.
[18] L. Annala, N. Neittaanmäki, J. Paoli, O. Zaar, I. Pölönen, Generating hyperspectral skin cancer imagery using generative adversarial neural network, in: 2020 42nd Annual International Conference of the IEEE Engineering in Medicine \& Biology Society (EMBC), IEEE, 2020, pp. 1600–1603.
[19] H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. P. Andriole, M. Michalski, Medical image synthesis for data augmentation and anonymization using generative adversarial networks, in: International workshop on simulation and synthesis in medical imaging, Springer, 2018, pp. 1–11.
[20] J. Yoon, L. N. Drumright, M. Van Der Schaar, Anonymization through data synthesis using generative adversarial networks (ads-gan), IEEE journal of biomedical and health informatics 24 (8) (2020) 2378–2388.
[21] A. Torfi, E. A. Fox, C. K. Reddy, Differentially private synthetic medical data generation using convolutional gans, Information Sciences 586 (2022) 485–500.
[22] S. N. Kasthurirathne, G. Dexter, S. J. Grannis, Generative Adversarial Networks for Creating Synthetic Free-Text Medical Data: A Proposal for Collaborative Research and Re-use of Machine Learning Models, in: AMIA Annual Symposium Proceedings, Vol. 2021, American Medical Informatics Association, 2021, p. 335.
[23] F. Prezja, J. Paloneva, I. Pölönen, E. Niinimäki, S. Äyrämö, DeepFake knee osteoarthritis X-rays from generative adversarial neural networks deceive medical experts and offer augmentation potential to automatic classification, Scientific Reports 12 (1) (2022) 1–16. doi:10.1038/s41598-022-23081-4.
[24] F. Prezja, J. Paloneva, I. Pölönen, E. Niinimäki, S. Äyrämö, Synthetic (DeepFake) Knee Osteoarthritis X-ray Images from Generative Adversarial Neural Networks, Mendeley Data, V3 (2022). doi:10.17632/fyybnjkw7v.3.
[25] P. S. Q. Yeoh, K. W. Lai, S. L. Goh, K. Hasikin, Y. C. Hum, Y. K. Tee, S. Dhanalakshmi, Emergence of deep learning in knee osteoarthritis diagnosis, Computational intelligence and neuroscience 2021 (2021).
[26] S. Saarakkala, P. Julkunen, P. Kiviranta, J. Mäkitalo, J. S. Jurvelin, R. K. Korhonen, Depth-wise progression of osteoarthritis in human articular cartilage: investigation of composition, structure and biomechanics, Osteoarthritis and Cartilage 18 (1) (2010) 73–81.
[27] M. S. Laasanen, J. Töyräs, R. K. Korhonen, J. Rieppo, S. Saarakkala, M. T. Nieminen, J. Hirvonen, J. S. Jurvelin, Biomechanical properties of knee articular cartilage, Biorheology 40 (1, 2, 3) (2003) 133–140.
[28] J. Hermans, M. A. Koopmanschap, S. M. A. Bierma-Zeinstra, J. H. van Linge, J. A. N. Verhaar, M. Reijman, A. Burdorf, Productivity costs and medical costs among working patients with knee osteoarthritis, Arthritis care \& research 64 (6) (2012) 853–861.
[29] D. J. Hunter, S. Bierma-Zeinstra, Osteoarthritis, The Lancet 393 (10182) (2019) 1745–1759. doi:https://doi.org/10.1016/S0140-6736(19)30417-9.
URL https://www.sciencedirect.com/science/article/pii/S0140673619304179
[30] A. Tiulpin, S. Klein, S. M. A. Bierma-Zeinstra, J. Thevenot, E. Rahtu, J. v. Meurs, E. H. G. Oei, S. Saarakkala, Multimodal machine learning-based knee osteoarthritis progression prediction from plain radiographs and clinical data, Scientific reports 9 (1) (2019) 20038.
[31] A. Tiulpin, J. Thevenot, E. Rahtu, P. Lehenkari, S. Saarakkala, Automatic knee osteoarthritis diagnosis from plain radiographs: a deep learning-based approach, Scientific reports 8 (1) (2018) 1727.
[32] A. Tiulpin, S. Saarakkala, Automatic grading of individual knee osteoarthritis features in plain radiographs using deep convolutional neural networks, Diagnostics 10 (11) (2020) 932.
[33] C. for Disease Control, Prevention, others, HIPAA privacy rule and public health. Guidance from CDC and the US Department of Health and Human Services, MMWR: Morbidity and mortality weekly report 52 (Suppl 1) (2003) 1–17.
[34] P. Voigt, A. dem Bussche, The eu general data protection regulation (gdpr), A Practical Guide, 1st Ed., Cham: Springer International Publishing 10 (3152676) (2017) 10–5555.
[35] B. Zhang, J. Tan, K. Cho, G. Chang, C. M. Deniz, Attention-based cnn for kl grade classification: data from the osteoarthritis initiative, in: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI), IEEE, 2020, pp. 731–735.
[36] S. Khalid, T. Khalil, S. Nasreen, A survey of feature selection and feature extraction techniques in machine learning, in: 2014 science and information conference, IEEE, 2014, pp. 372–378.
[37] T. Elsken, J. H. Metzen, F. Hutter, Neural architecture search: A survey, The Journal of Machine Learning Research 20 (1) (2019) 1997–2017.
[38] H. Jin, F. Chollet, Q. Song, X. Hu, AutoKeras: An AutoML Library for Deep Learning, Journal of Machine Learning Research 24 (6) (2023) 1–6.
URL http://jmlr.org/papers/v24/20-1355.html
[39] P. Chen, L. Gao, X. Shi, K. Allen, L. Yang, Fully automatic knee osteoarthritis severity grading using deep neural networks with a novel ordinal loss, Computerized Medical Imaging and Graphics 75 (2019) 84–92.
[40] P. Chen, Knee osteoarthritis severity grading dataset, Mendeley Data 1 (2018) 21–23.
[41] J. H. Kellgren, J. Lawrence, Radiological assessment of osteo-arthrosis, Annals of the rheumatic diseases 16 (4) (1957) 494.
[42] Y. LeCun, Y. Bengio, others, Convolutional networks for images, speech, and time series, The handbook of brain theory and neural networks 3361 (10) (1995) 1995.
[43] V. Nair, G. E. Hinton, Rectified linear units improve Restricted Boltzmann machines, in: ICML 2010 - Proceedings, 27th International Conference on Machine Learning, 2010, pp. 807–814.
[44] M. Tan, Q. Le, Efficientnet: Rethinking model scaling for convolutional neural networks, in: International conference on machine learning, PMLR, 2019, pp. 6105–6114.
[45] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[46] M. Tan, Q. Le, Efficientnetv2: Smaller models and faster training, in: International conference on machine learning, PMLR, 2021, pp. 10096–10106.
[47] F. Prezja, Deep Fast Vision: Accelerated Deep Transfer Learning Vision Prototyping and Beyond, \url{https://github.com/fabprezja/deep-fast-vision} (4 2023). doi:10.5281/zenodo.7865289.
URL https://doi.org/10.5281/zenodo.7865289
[48] Scikit-Learn Developers, Code implementation of VarianceThreshold in Scikit-learn, \url{https://github.com/scikit-learn/scikit-learn/blob/7f9bad99d/sklearn/feature_selection/_variance_threshold.py#L14} (2023).
[49] M. Kuhn, K. Johnson, others, Applied predictive modeling, Vol. 26, Springer, 2013.
[50] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in {P}ython, Journal of Machine Learning Research 12 (2011) 2825–2830.
[51] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, T. E. Oliphant, Array programming with {NumPy}, Nature 585 (7825) (2020) 357–362. doi:10.1038/s41586-020-2649-2.
URL https://doi.org/10.1038/s41586-020-2649-2

Acknowledgements

The authors extend their sincere gratitude to Kimmo Riihiaho, Rodion Enkel and Leevi Lind for their exceptional support and invaluable discussions.

Author contributions statement

Conceptualization: F. P.; Methodology: F. P.; Investigation: F. P.; Data curation: All authors; Formal analysis: All authors; Writing – original draft: F. P.; Writing – review & editing: All authors.

Additional information

Competing interests All authors declare that they have no conflicts of interest.

Appendix

EfficientNetV2M NAS

Table 5: NAS-selected architecture for the Low AVT Set of EfficientNet.

Layer Type	Output Shape	Parameter Count
Input Layer	236	-
Multi Category Encoding	236	-
Normalization	236	473
Dense Layer	32	7584
ReLU Activation	32	-
Dense Layer	32	1056
ReLU Activation	32	-
Dropout	32	-
Dense Layer	5	165
Softmax Activation	5	-

Table 6: NAS-selected architecture for the Mid AVT Set of EfficientNet.

Layer Type	Output Shape	Parameter Count
Input Layer	120	-
Multi Category Encoding	120	-
Dense Layer	16	1936
ReLU Activation	16	-
Dense Layer	32	544
ReLU Activation	32	-
Dense Layer	5	165
Softmax Activation	5	-

Table 7: NAS-selected architecture for the High AVT Set of EfficientNet.

Layer Type	Output Shape	Parameter Count
Input Layer	4	-
Multi Category Encoding	4	-
Normalization	4	9
Dense Layer	32	160
ReLU Activation	32	-
Dense Layer	32	1056
ReLU Activation	32	-
Dense Layer	5	165
Softmax Activation	5	-

4.1 Chen VGG19 Ordinal NAS

Table 8: NAS-selected architecture for the Low AVT Set of the external Chen VGG19 Ordinal.

Layer Type	Output Shape	Parameter Count
Input Layer	4034	-
Multi Category Encoding	4034	-
Normalization	4034	8069
Dense Layer	32	129120
Batch Normalization	32	128
ReLU Activation	32	-
Dense Layer	32	1056
Batch Normalization	32	128
ReLU Activation	32	-
Dense Layer	5	165
Softmax Activation	5	-

Table 9: NAS-selected architecture for the Mid AVT Set of the external Chen VGG19 Ordinal.

Layer Type	Output Shape	Parameter Count
Input Layer	2048	-
Multi Category Encoding	2048	-
Normalization	2048	4097
Dense Layer	32	65568
Batch Normalization	32	128
ReLU Activation	32	-
Dense Layer	1024	33792
Batch Normalization	1024	4096
ReLU Activation	1024	-
Dense Layer	128	131200
Batch Normalization	128	512
ReLU Activation	128	-
Dropout	128	-
Dense Layer	5	645
Softmax Activation	5	-

Table 10: NAS-selected architecture for the High AVT Set of the external Chen VGG19 Ordinal.

Layer Type	Output Shape	Parameter Count
Input Layer	62	-
Multi Category Encoding	62	-
Dense Layer	128	8064
ReLU Activation	128	-
Dropout	128	-
Dense Layer	16	2064
ReLU Activation	16	-
Dropout	16	-
Dense Layer	32	544
ReLU Activation	32	-
Dropout	32	-
Dense Layer	5	165
Softmax Activation	5	-