This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newcites

appendixAppendix References

HoneyBee: A Scalable Modular Framework for Creating Multimodal Oncology Datasets

Aakash Tripathi
Department of Machine Learning
Moffit Cancer Center
Tampa, FL, 33620
[email protected]
\AndAsim Waqas
Department of Machine Learning
Moffit Cancer Center
Tampa, FL, 33620
[email protected]
\AndMatthew B. Schabath
Departments of Cancer Epidemiology and Thoracic Oncology
Moffitt Cancer Center & Research Institute
[email protected] \AndYasin Yilmaz
Department of Electrical Engineering
University of South Florida
Tampa, FL, 33620
[email protected]
\AndGhulam Rasool
Department of Machine Learning
Moffit Cancer Center
Tampa, FL, 33620
[email protected]
Corresponding AuthorAlso part of the Department of Electrical Engineering, University of South Florida, Tampa, FL, 33620
Abstract

Developing accurate machine learning models for oncology requires large-scale, high-quality multimodal datasets. However, creating such datasets remains challenging due to the complexity and heterogeneity of medical data. To address this challenge, we introduce HoneyBee, a scalable modular framework for building multimodal oncology datasets that leverages foundation models to generate representative embeddings. HoneyBee integrates various data modalities, including clinical diagnostic and pathology imaging data, medical notes, reports, records, and molecular data. It employs data preprocessing techniques and foundation models to generate embeddings that capture the essential features and relationships within the raw medical data. The generated embeddings are stored in a structured format using Hugging Face datasets and PyTorch dataloaders for accessibility. Vector databases enable efficient querying and retrieval for machine learning applications. We demonstrate the effectiveness of HoneyBee through experiments assessing the quality and representativeness of these embeddings. The framework is designed to be extensible to other medical domains and aims to accelerate oncology research by providing high-quality, machine learning-ready datasets. HoneyBee is an ongoing open-source effort, and the code, datasets, and models are available at the project repository.

1 Introduction

The availability of large-scale public datasets has been a critical factor in advancing machine learning (ML) techniques, particularly in the development of foundation models [5, 13]. Datasets such as ImageNet [8], which contains over 14 million images across 20,000 categories, and COCO [20], which includes 330,000 images with over 200,000 labeled instances, have facilitated the training of deep learning models that achieve state-of-the-art (SOTA) performance on tasks such as object detection, semantic segmentation, and image captioning. In the context of oncology, multimodal medical datasets are likewise essential for developing ML models that can effectively capture the complex and heterogeneous nature of cancer. Medical data typically includes clinical records, imaging data (e.g., histopathology slides, radiology scans), molecular data (e.g., genomics, proteomics), and patient-reported outcomes. The integration of these diverse data modalities enables the identification of patterns and relationships that may not be apparent when analyzing individual modalities in isolation [35, 33, 4].

Multimodal datasets can support the development of ML models for various oncology applications, such as cancer screening, diagnosis, prognosis prediction, treatment response assessment, and post-treatment surveillance. Despite this need, there is a lack of large-scale, high-quality multimodal medical datasets for oncology research. The raw medical data collected from patients during studies are often distributed across multiple institutions and repositories, such as the Cancer Research Data Commons (CRDC) [15], Genomic Data Commons (GDC) [14], Proteomic Data Commons (PDC) [30], and Imaging Data Commons (IDC) [12], each with its own data formats, access protocols, and privacy constraints [32]. Integrating data from these disparate sources requires significant manual effort in terms of data harmonization, quality control, and metadata management. Furthermore, each data modality needs specialized preprocessing and feature engineering techniques to extract meaningful representations for ML tasks. For instance, histopathology slides may require tissue segmentation, color normalization, and tile extraction [21], while genomic data may involve quality control, normalization, and dimensionality reduction [34]. Additionally, clinical records often contain unstructured text data that needs to be processed using natural language processing (NLP) techniques [26]. These preprocessing steps are time-consuming, computationally expensive, and require domain-specific expertise. The scale and quality of existing publicly available multimodal oncology datasets vary considerably. Some datasets, such as The Cancer Genome Atlas (TCGA) [37], offer comprehensive multimodal data for multiple cancer types, including clinical records, imaging data, and molecular profiles, collected during large-scale studies. However, other datasets, like Med-MNIST [41], are more limited in scope, focusing on specific data modalities or providing smaller sample sizes, which may not be sufficient for developing robust ML models. Hence, there is a need for a framework that can efficiently aggregate and preprocess scattered raw public medical datasets to generate large-scale ML-ready feature representations that can be easily integrated into downstream ML pipelines.

To address the lack of critical large-scale multimodal medical datasets, we introduce HoneyBee, a modular and scalable framework for building ML-ready multimodal oncology datasets using open-source foundation models. The key objectives of HoneyBee are as follows:

  1. 1.

    Develop a set of standardized, modality-specific preprocessing pipelines for clinical records, imaging data, genomic information, and patient outcomes, ensuring consistency and reproducibility across datasets.

  2. 2.

    Capture complex patterns and relationships within and across data modalities by generating feature-rich embedding vectors from raw medical data using pre-trained foundation models.

  3. 3.

    Conduct a comprehensive evaluation of the HoneyBee framework using large-scale, oncology datasets by assessing performance in real-world applications.

The remainder of this paper is organized as follows. Section 2 reviews related work on multimodal oncology datasets, foundation models, and ML-ready datasets. Section 3 describes the HoneyBee framework in detail, including the data acquisition and integration process, embedding generation using the foundation models, data storage, and accessibility components. Section 4 presents the open and processed public datasets generated using the HoneyBee framework, focusing on TCGA111The results shown here are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga/. data source as a case study. Section 5 demonstrates the utility of HoneyBee datasets through use cases and experimental evaluations. Finally, we discuss the limitations, challenges, and future directions of the HoneyBee framework in Section 6.

2 Related Work

2.1 Challenges and Limitations of Current Public Datasets

Public datasets have been instrumental in advancing ML research, particularly in medical domains. However, these datasets, such as those integrated by the Multimodal Integration of Oncology Data System (MINDS) framework [32], often require extensive preprocessing to be ML-ready. MINDS consolidates data from various sources, such as TCGA [37] and TARGET [36], providing efficient query capabilities and facilitating interactive data exploration. Despite these capabilities, the datasets still face significant challenges, including inconsistent formats, missing data, and the need for domain-specific preprocessing techniques. Existing ML-ready datasets like Med-MNIST [41] and those available on Hugging Face Datasets [19, 7, 17, 23] have limited sample sizes and narrow focus on specific modalities. The need for preprocessing, harmonization, and normalization across different data sources remains a critical barrier to the effective use of these datasets in machine learning applications. Additional related work is provided in Appendix A.3.

2.2 Foundation Models in Medical AI

Foundation models are pre-trained on extensive datasets and can be adapted to various medical tasks [27]. They integrate multiple data types, enhancing diagnostics, personalized treatment, and predictive analytics [43, 16, 44, 39, 25, 28, 31]. Models like UNI [6] and REMEDIS [3] have shown effectiveness in pathology and other medical fields. However, the development of truly multimodal foundation models that utilize raw medical data from various sources, such as whole slide images (WSI), molecular sequences, radiology scans, and electronic health records (EHR), is limited [2]. This limitation is primarily due to the lack of comprehensive multimodal datasets that integrate these diverse data types.

3 The HoneyBee Framework

The HoneyBee framework consists of three main components: data acquisition and integration, embedding generation, and data storage and accessibility, as illustrated in Figure 1. We discuss these components in detail below.

Refer to caption
Figure 1: The HoneyBee workflow starts with accessing the public datasets, including data aggregation, cataloging, and metadata tagging using scripts and libraries from MINDS. The curated cohort dataset is then processed through modality-specific pipelines and pre-trained models in the HoneyBee framework. The resulting embeddings, as well as the raw data, are made publicly accessible to support the development of ML applications.

3.1 Data Acquisition and Integration

HoneyBee extends the data integration capabilities of the MINDS framework [32] by incorporating additional preprocessing steps to ensure data quality and compatibility. The data in MINDS is collected from various sources, including public repositories, clinical institutions, and research collaborations. Key data modalities incorporated in HoneyBee datasets include:

Text: Text data includes structured and unstructured medical reports, such as pathology reports, radiology reports, and clinical notes, generally stored in electronic health records (EHRs). These datasets provide valuable information about patient diagnoses, treatments, and outcomes. HoneyBee employs NLP techniques to extract relevant features and normalize the text data for downstream ML. The collected data undergoes integration steps to harmonize heterogeneous datasets and align data elements across modalities. Metadata alignment consolidates metadata from various text sources to ensure consistency by mapping different terminologies to a common ontology and standardizing data types. Data cleaning handles missing values, outliers, and inconsistencies, which may be imputed using statistical methods or domain-specific rules. Harmonization adheres to standards like Fast Healthcare Interoperability Resources (FHIR) [1] for EHRs to ensure interoperability and facilitate data exchange, including standardizing data formats and protocols to convert data from different formats to a common standard.

Imaging: Imaging data may include (1) digitized histopathology slides WSIs, and (2) radiology scans, performed using computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET), Ultrasound, X-ray, etc. These imaging modalities capture detailed visual information about tumor morphology, size, location, and progression. HoneyBee utilizes SOTA computer vision models to process and analyze these high-dimensional imaging data. The collected data undergoes integration steps to harmonize heterogeneous datasets and align data elements across modalities. Metadata alignment ensures consistency in metadata by mapping terminologies to a common ontology and standardizing data types. Data cleaning addresses missing values, outliers, and inconsistencies through statistical imputation or domain-specific rules. Normalization standardizes pixel values across different imaging sources using data collecting protocols and recording information, available in the metadata of radiology data storage file format, Digital Imaging and Communications in Medicine (DICOM). Harmonization utilizes DICOM standards to ensure comparability and interoperability across imaging datasets, including standardizing data formats and protocols. For WSIs, additional preprocessing steps, such as stain normalization, patching, and background removal, are performed to ensure data quality and compatibility.

Molecular: Molecular data types provide insights into the underlying biological mechanisms of cancer at the genomic, transcriptomic, and proteomic levels. HoneyBee incorporates specialized bioinformatics pipelines to preprocess and integrate these complex molecular data. The collected data undergoes integration steps to harmonize heterogeneous datasets and align data elements across modalities. Metadata alignment consolidates different terminologies into a common ontology and standardizes data types. Data cleaning handles missing values, outliers, and inconsistencies through statistical imputation or domain-specific rules. Normalization employs methods to standardize numerical features, ensuring consistent data representation. Harmonization ensures interoperability and data exchange by adhering to established standards and addressing inconsistencies within and between datasets.

3.2 Embedding Generation

HoneyBee utilizes foundation models to generate embeddings from raw medical data. The embedding generation process involves several key steps:

3.2.1 Selection of Foundation Models

The HoneyBee framework includes foundation models tailored for different medical modalities:

  • For radiology scans (CT, MRI, PET), the REMEDIS model, trained on large-scale medical imaging datasets, generates embeddings that capture spatial features and structural information.

  • For WSIs from digitized histopathology specimens, the TissueDetector model detects and segments regions of interest, while the UNI model, a ViT-based encoder pre-trained on the Mass-100K dataset, generates embeddings that capture visual features and patterns [6].

  • For textual data (medical reports, clinical notes), language models from the Hugging Face library are used.

  • For molecular data (gene expression, DNA methylation, protein expression, mutations, miRNA), the SeNMo model, a deep learning framework, integrates various data types and generates unified latent embeddings [34].

3.2.2 Preprocessing of Raw Medical Data

Preprocessing in HoneyBee ensures the compatibility of raw imaging, textual, and molecular data with foundation models. Processing images involves resizing images to fit the input requirements of the models while preserving the aspect ratio and minimizing information loss, typically using adaptive techniques. Normalization scales pixel values to a standard range, ensuring uniformity across datasets. Data augmentation techniques such as rotation, flipping, and cropping are applied to enhance the diversity of the dataset, improving model robustness. Tissue detection models are employed to identify and extract regions of interest from WSIs by reading regions at various resolutions and processing them using ML models designed for tissue detection. For textual data, the preparation begins with tokenization, converting text into a sequence of tokens suitable for model processing. This process addresses complex medical terminology to ensure effective representation learning. Optical Character Recognition (OCR) is used to extract text from images of documents, including scanned pathology reports. The OCR process includes steps like grayscaling, resizing, and thresholding to enhance accuracy. The extracted text undergoes cleaning to remove noise and irrelevant characters, including extra whitespace, newline characters, and non-alphanumeric symbols. For long documents, the text is split into manageable chunks using techniques that ensure proper handling of chunk size and overlap. For molecular data, the preparation involves several steps to prepare the data for analysis. First, missing values are addressed using robust imputation methods to ensure completeness. Features with no variance are removed, as they do not contribute to the model’s discriminative capabilities. Redundant data points are identified and eliminated through deduplication algorithms to maintain data integrity. Highly correlated features are dropped to avoid multicollinearity, which could negatively impact model stability and interpretability. Missing data points are then imputed using advanced techniques, such as matrix factorization and deep learning-based methods, ensuring that the imputed values are consistent with the underlying data distribution. Additional details on all raw data processing are provided in Appendix A.4.

3.2.3 Generation of Embeddings

Each preprocessed data sample is passed through the foundation model, which produces a fixed-length embedding vector. For example, the embedding dimensions can range from 48 to 2048, depending on the model architecture. HoneyBee utilizes GPU acceleration and distributed computing, when available, to efficiently generate embeddings for large-scale datasets. The generated embeddings, along with associated metadata, are stored in a structured format to facilitate downstream tasks such as similarity search, clustering, and ML model training. HoneyBee employs efficient data compression and indexing techniques to optimize the storage and retrieval of high-dimensional embeddings.

Refer to caption
Figure 2: Pathology Workflow: WSIs are processed into 512×512512\times 512 pixel tiles, followed by tissue detection and embedding generation. Radiology Imaging Workflow: CT, MRI, and PET scans undergo normalization, standardization, slicing, and embedding generation. Molecular Data Workflow: DNA methylation, gene expression, protein expression, DNA mutation, and miRNA expression data are preprocessed (removing NaNs and low-expression genes), and embeddings are generated. Clinical Data Workflow: Text from EHRs and reports is extracted, processed, and embeddings are generated. All processed data is integrated into a multimodal embedding dataset and optionally stored in vector databases.

3.3 Data Storage and Accessibility

The generated embeddings and tabular data are stored using the Hugging Face datasets library, which provides a standardized interface for data access and processing [19]. The datasets are organized into a structured format, containing the embeddings, metadata, and labels (if available). PyTorch DataLoaders are employed to efficiently load and iterate over the datasets during model training and evaluation, handling tasks such as batching, shuffling, and parallel processing. Additionally, HoneyBee datasets can be integrated into vector databases such as Faiss and Annoy [11, 38] to enable fast similarity search, nearest neighbor retrieval, and clustering on the high-dimensional embedding vectors. These databases are optimized for efficient querying and retrieval of embeddings based on similarity metrics such as Euclidean distance or cosine similarity [40]. By deploying vector databases, researchers can quickly identify similar samples, perform data exploration, and retrieve relevant subsets based on embedding similarity, facilitating various downstream tasks such as retrieval augmented generation (RAG) [42]. The structured storage and accessibility components of HoneyBee ensure that the generated embeddings and associated data are readily available for researchers and practitioners to use in their ML pipelines and downstream applications.

4 Datasets

TCGA has molecularly characterized over 11,000 primary cancer patients and matched normal samples spanning 33 cancer types. To demonstrate the utility of the HoneyBee framework, we have created a public multimodal dataset for oncology by extracting and processing data from TCGA using MINDS [37, 32]. The dataset makes up 25.60%25.60\% of all publicly available data in MINDS and includes modalities ranging from clinical data (EHR and pathology reports), pathology images (WSIs of tumor and diagnostic samples), radiology images (CT, MRI, and PET), as well as molecular data (gene expression DNA methylation, somatic mutations, protein expression, and miRNA expression). We used pre-trained models on large-scale datasets to generate embeddings for each data modality. The dataset is organized by cancer studies and data modality, with each subset containing the generated embeddings, metadata, and relevant labels (e.g., survival outcomes, tumor stage). Researchers and practitioners are able to integrate the TCGA embedding dataset into their ML pipelines using the Hugging Face datasets library and PyTorch DataLoaders. In addition to the embedding datasets, we also provide the original raw data files, accessed through the MINDS platform, and the associated code used for data extraction, preprocessing, and embedding generation.

To generate embeddings for various data modalities, we evaluated and selected specific models based on their performance and suitability for the task. For textual data such as clinical notes and pathology reports, we considered GatorTron-medium and Clinical T5 [43, 22]. Our experiments showed that GatorTron-medium, a 3.9 billion parameter model trained on biomedical corpora, consistently performed better in regression tasks like patient age prediction, achieving a lower overall loss. Thus, we chose GatorTron-medium for generating embeddings from textual data [43]. For pathology images, we evaluated models including UNI and REMEDIS. We found that the smaller embeddings generated by the UNI model (1024 per patch) were more effective for tasks like retrieval-augmented generation compared to the multidimensional embedding matrix generated by REMEDIS (patches×\times7×\times7×\times2048). Consequently, UNI was selected as our preferred model for pathology image embeddings. For radiology images, we initially used REMEDIS, which is trained on over 4.5 million medical images from modalities such as CT, MRI, and PET scans [3]. However, after experimenting with RadImageNet, we identified it as a promising alternative [24]. While REMEDIS is currently used to generate radiology embeddings, we plan to include RadImageNet embeddings in the publicly available dataset to facilitate comparison across different tasks. Finally, for molecular data, we used the SeNMo model [34], which is designed to learn latent embeddings from multiple molecular data types. SeNMo effectively captures meaningful representations from multi-omics data, supporting downstream tasks such as patient stratification and molecular subtype identification. Table 1 summarizes the models, embedding shapes, and the number of samples for each modality.

Table 1: Overview of models used for different data modalities in HoneyBee. The table lists the types of data (Modality), the corresponding machine learning models utilized (Model), the shape of the embeddings generated by these models (Embedding Shape), and the number of samples for each modality (Sample Count).
Modality Model Embedding Shape Sample Count
Clinical Data GatorTron-medium [43] [1024] 11,428
Pathology Reports GatorTron-medium [43] [1024] 11,208
Pathology Images UNI [6] [patches, 1024] 30,075
Radiology Images REMEDIS [3] [patches, 7, 7, 2048] 11,869
Molecular Data SeNMo [34] [48] 13,804

We preprocess the raw data for each modality, feed it into the selected model at inference time, and generate a fixed-length embedding vector per sample, per modality. The embeddings are stored with associated metadata using the Hugging Face datasets library. To facilitate easy access and utilization of the generated embeddings, we have made the processed TCGA dataset, comprising patient embeddings, publicly available on the Hugging Face platform (https://huggingface.co/datasets/Lab-Rasool/TCGA) under an open Creative Commons Attribution Non Commercial No Derivatives 4.0 license. Additional information about the public data is provided in Appendix A.2.

5 Use Cases

To validate the effectiveness of the HoneyBee framework in generating meaningful multimodal datasets, we conducted a series of experiments assessing the quality and utility of the embeddings on a downstream ML task. We used all the clinical text data generated from the 33 cancer sites in the TCGA dataset and extracted embeddings using the GatorTron (gatortron-medium) and BERT (bert-base-uncased) models [43, 10]. We trained a random forest classifier to classify the cancer type using the embeddings available in the HoneyBee Hugging Face repository. In addition to showcasing the capability of generating embeddings directly from the models, we demonstrate how parameter-efficient fine-tuning (PEFT), which has gained popularity recently with the explosion of large language model research [18, 9], can make the embedding models more suited for the task. The experiments were conducted using an Nvidia RTX 3090 GPU with 24GB VRAM, 32GB RAM, and a Ryzen 5950X 16-core CPU.

5.1 Analysis of the Extracted Embedding in HoneyBee Framework

We began by analyzing the quality of the extracted embeddings and visualizing them using t-SNE. This dimensionality reduction technique maps high-dimensional data to a lower-dimensional space while preserving local structure. Figure 3 show the t-SNE plots of the embeddings generated by the pre-trained and fine-tuned versions of the GatorTron (https://huggingface.co/Lab-Rasool/gatortron-base-tcga) and BERT (https://huggingface.co/Lab-Rasool/bert-base-uncased-tcga) models, respectively. Each point represents a patient’s clinical information, while the colors correspond to the different TCGA project IDs. In both cases, the fine-tuned embeddings exhibit more visual separation between the different project IDs compared to the pre-trained embeddings, suggesting that the fine-tuning process has successfully adapted the language models to capture the nuances and distinguishing features of the various cancer types in the TCGA dataset. We employed the adapter-based fine-tuning approach to fine-tune the models to showcase the HoneyBee framework’s capability for PEFT, where a small set of low-rank adapters are inserted into the transformer (query, key, and value) linear layers of the pre-trained model, and only these adapter layers are updated during fine-tuning. In contrast, the pre-trained model parameters remain frozen.

Refer to caption
Refer to caption
Figure 3: t-SNE visualization of pre-trained and fine-tuned embeddings generated by the GatorTron and BERT models. Each subplot represents the embeddings for a specific model configuration: (a) t-SNE of GatorTron Pre-trained Embeddings, (b) t-SNE of GatorTron Fine-tuned Embeddings, (c) t-SNE of BERT Pre-trained Embeddings, and (d) t-SNE of BERT Fine-tuned Embeddings. Each point corresponds to a clinical note, with colors indicating different TCGA studies. Fine-tuned embeddings demonstrate improved separation between studies, indicating enhanced model performance in distinguishing between cancer types post-fine-tuning.

5.2 Downstream ML Model Training

To evaluate the effectiveness of the generated embeddings for cancer-type classification, a random forest classifier was trained. The clinical dataset from TCGA, including clinical text data and project IDs, was loaded and converted into a pandas DataFrame. Two models, GatorTron-medium and BERT-base-uncased, were used for generating embeddings, with both pre-trained and fine-tuned versions created. The clinical text data was tokenized and passed through the models to generate fixed-length embeddings. The embeddings and corresponding project IDs were split into training (80%), and test (20%) sets and a random forest classifier with 100 estimators was trained on the training set embeddings. The trained classifier was evaluated on the test set embeddings, and accuracies were calculated. To ensure robustness and account for variability, the experiment was run multiple times with different random seeds. Specifically, we conducted 10 runs with different random seeds for both the pre-trained and fine-tuned models. The mean accuracy and standard deviation of these runs were calculated to provide error bars for the performance metrics. Table 2 shows the classification accuracies achieved by the random forest classifier using both pre-trained and fine-tuned embeddings from the GatorTron and BERT models. The fine-tuned models exhibited improved performance over the pre-trained models, with the GatorTron model demonstrating superior accuracy in both scenarios.

Table 2: Classification and retrieval accuracies for different cancer types using pre-trained and fine-tuned embeddings. The GatorTron-medium model outperforms the BERT-base-uncased model in both pre-trained and fine-tuned settings, with significant improvements observed after fine-tuning.
Model Pre-trained Fine-tuned
Accuracy Retrieval Accuracy Retrieval
GatorTron-medium 0.889 ±\pm 0.006 0.694 ±\pm 0.034 0.976 ±\pm 0.003 0.943 ±\pm 0.016
BERT-base-uncased 0.798 ±\pm 0.009 0.689 ±\pm 0.033 0.909 ±\pm 0.005 0.860 ±\pm 0.029

5.3 Retrieval Benchmark

In addition to classification, we conducted a retrieval benchmark to assess the ability of the embeddings to capture similarity between clinical texts from the same cancer type. This benchmark used FAISS [11] for similarity search and evaluating the retrieval of patient records with matching project IDs. We ran 100 trials where a random patient’s clinical embedding was selected as a query, and the nearest neighbors were retrieved from the embedding space. The percentage of correct matches (i.e., retrieved patients having the same project ID as the query) was calculated. This process was repeated 100 times, and the mean and standard deviation of the retrieval accuracy were computed. The results, presented in Table 2, show significant improvements in retrieval accuracy for fine-tuned models compared to pre-trained models, with the GatorTron [43] model again outperforming BERT in both settings. These retrieval benchmarks demonstrate the effectiveness of the HoneyBee framework in generating embeddings that support both classification and retrieval tasks in medical research and clinical data analysis.

6 Discussion and Conclusion

The HoneyBee framework integrates multimodal data and uses representation learning techniques to create ML-ready datasets for oncology research. The use cases demonstrate the effectiveness of the generated embeddings in capturing information from raw medical data and their utility in tasks such as cancer-type classification. The t-SNE visualizations (Figure 3) show that fine-tuned embeddings separate different cancer types better than raw embeddings, suggesting that fine-tuning adapts the language models to capture distinguishing features of cancer types in the TCGA dataset. Comparing the performance of the GatorTron model to BERT highlights the importance of selecting appropriate foundation models for each data modality. The HoneyBee framework allows for the incorporation of various data modalities and the flexibility to extend to other disease areas. It provides a standardized pipeline for data integration, pre-processing, normalization, harmonization, embedding generation, and storage, facilitating the creation of datasets that can accelerate model development in oncology and other medical domains. This work addresses the need for large-scale, high-quality, ML-ready datasets in the oncology domain, which can drive the development of innovative models and analyses. However, the interpretability and trustworthiness of the generated embeddings require further investigation to facilitate adoption in clinical settings [29]. The TCGA dataset may contain biases due to the patient selection criteria and data collection processes, which could impact the generalizability of the models. HoneyBee invites collaborators to contribute to the ongoing open-source effort.

References

  • [1] HL7 FHIR. Available online: https://www.hl7.org/fhir/. (accessed on 3 April 2024).
  • [2] S. Alfasly, P. Nejat, S. Hemati, J. Khan, I. Lahr, A. Alsaafin, A. Shafique, N. Comfere, D. Murphree, C. Meroueh, et al. Foundation models for histopathology—fanfare or flair. Mayo Clinic Proceedings: Digital Health, 2(1):165–174, 2024.
  • [3] S. Azizi, L. Culp, J. Freyberg, B. Mustafa, S. Baur, S. Kornblith, T. Chen, N. Tomasev, J. Mitrović, P. Strachan, et al. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering, 7(6):756–779, 2023.
  • [4] T. Baltrušaitis, C. Ahuja, and L.-P. Morency. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443, 2018.
  • [5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [6] R. J. Chen, T. Ding, M. Y. Lu, D. F. Williamson, G. Jaume, A. H. Song, B. Chen, A. Zhang, D. Shao, M. Shaban, et al. Towards a general-purpose foundation model for computational pathology. Nature Medicine, 30(3):850–862, 2024.
  • [7] S. Chen, Z. Ju, X. Dong, H. Fang, S. Wang, Y. Yang, J. Zeng, R. Zhang, R. Zhang, M. Zhou, P. Zhu, and P. Xie. Meddialog: a large-scale medical dialogue dataset. arXiv preprint arXiv:2004.03329, 2020.
  • [8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • [9] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  • [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arxiv. arXiv preprint arXiv:1810.04805, 2019.
  • [11] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini, and H. Jégou. The faiss library. arXiv preprint arXiv:2401.08281, 2024.
  • [12] A. Fedorov, W. Longabaugh, D. Pot, D. Clunie, S. Pieper, R. Lewis, H. Aerts, A. Homeyer, M. Herrmann, U. Wagner, T. Pihl, K. Farahani, and R. Kikinis. Nci imaging data commons. International Journal of Radiation Oncology*Biology*Physics, 111(3, Supplement):e101, 2021. 2021 Proceedings of the ASTRO 63rd Annual Meeting.
  • [13] S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems, 36, 2024.
  • [14] R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus, D. R. Lowy, W. A. Kibbe, and L. M. Staudt. Toward a shared vision for cancer genomic data. New England Journal of Medicine, 375(12):1109–1112, 2016.
  • [15] I. V. Hinkson, T. M. Davidsen, J. D. Klemm, I. Chandramouliswaran, A. R. Kerlavage, and W. A. Kibbe. A comprehensive infrastructure for big data in cancer research: Accelerating cancer research and precision medicine. Frontiers in Cell and Developmental Biology, 5, 2017.
  • [16] Y. Ji, Z. Zhou, H. Liu, and R. V. Davuluri. Dnabert: pre-trained bidirectional encoder representations from transformers model for dna-language in genome. Bioinformatics, 37(15):2112–2120, 2021.
  • [17] D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081, 2020.
  • [18] J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Yoo, S. J. Kwon, and D. Lee. Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. Advances in Neural Information Processing Systems, 36, 2024.
  • [19] Q. Lhoest, A. Villanova del Moral, Y. Jernite, A. Thakur, P. von Platen, S. Patil, J. Chaumond, M. Drame, J. Plu, L. Tunstall, J. Davison, M. Šaško, G. Chhablani, B. Malik, S. Brandeis, T. Le Scao, V. Sanh, C. Xu, N. Patry, A. McMillan-Major, P. Schmid, S. Gugger, C. Delangue, T. Matussière, L. Debut, S. Bekman, P. Cistac, T. Goehringer, V. Mustar, F. Lagunas, A. Rush, and T. Wolf. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  • [20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [21] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
  • [22] Q. Lu, D. Dou, and T. Nguyen. Clinicalt5: A generative language model for clinical text. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5436–5443, 2022.
  • [23] C. H. McCreery, N. Katariya, A. Kannan, M. Chablani, and X. Amatriain. Effective transfer learning for identifying similar questions: matching user questions to covid-19 faqs. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 3458–3465, 2020.
  • [24] X. Mei, Z. Liu, P. M. Robson, B. Marinelli, M. Huang, A. Doshi, A. Jacobi, C. Cao, K. E. Link, T. Yang, et al. Radimagenet: an open radiologic deep learning research dataset for effective transfer learning. Radiology: Artificial Intelligence, 4(5):e210315, 2022.
  • [25] D. MH Nguyen, H. Nguyen, N. Diep, T. N. Pham, T. Cao, B. Nguyen, P. Swoboda, N. Ho, S. Albarqouni, P. Xie, et al. Lvm-med: Learning large-scale self-supervised vision models for medical imaging via second-order graph matching. Advances in Neural Information Processing Systems, 36, 2024.
  • [26] R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley. Deep patient: an unsupervised representation to predict the future of patients from the electronic health records. Scientific reports, 6(1):1–10, 2016.
  • [27] M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
  • [28] D. M. H. Nguyen, T. N. Pham, N. T. Diep, N. Q. Phan, Q. Pham, V. Tong, B. T. Nguyen, N. H. Le, N. Ho, P. Xie, et al. On the out of distribution robustness of foundation models in medical image segmentation. arXiv preprint arXiv:2311.11096, 2023.
  • [29] I. E. Nielsen, D. Dera, G. Rasool, R. P. Ramachandran, and N. C. Bouaynaya. Robust explainability: A tutorial on gradient-based attribution methods for deep neural networks. IEEE Signal Processing Magazine, 39(4):73–84, 2022.
  • [30] R. R. Thangudu, P. A. Rudnick, M. Holck, D. Singhal, M. J. MacCoss, N. J. Edwards, K. A. Ketchum, C. R. Kinsinger, E. Kim, and A. Basu. Abstract LB-242: Proteomic Data Commons: A resource for proteogenomic analysis. Cancer Research, 80(16_Supplement):LB–242, 08 2020.
  • [31] H. Tizhoosh. Foundation models and information retrieval in digital pathology. arXiv preprint arXiv:2403.12090, 2024.
  • [32] A. Tripathi, A. Waqas, K. Venkatesan, Y. Yilmaz, and G. Rasool. Building flexible, scalable, and machine learning-ready multimodal oncology datasets. Sensors, 24(5), 2024.
  • [33] A. Tripathi, A. Waqas, Y. Yilmaz, and G. Rasool. Multimodal transformer model improves survival prediction in lung cancer compared to unimodal approaches. Cancer Research, 84(6_Supplement):4905–4905, 2024.
  • [34] A. Waqas, A. Tripathi, S. Ahmed, A. Mukund, H. Farooq, M. B. Schabath, P. Stewart, M. Naeini, and G. Rasool. SeNMo: A Self-Normalizing Deep Learning Model for Enhanced Multi-Omics Data Analysis in Oncology. arXiv preprint arXiv:2405.08226, 2024.
  • [35] A. Waqas, A. Tripathi, R. P. Ramachandran, P. Stewart, and G. Rasool. Multimodal data integration for oncology in the era of deep neural networks: a review. arXiv preprint arXiv:2303.06471, 2023.
  • [36] J. S. Wei, S. Zhang, I. Kuznetsov, Y. K. Song, S. Asgharzadeh, S. Sindiri, X. Wen, R. Patidar, J. M. G. Auvil, D. S. Gerhard, R. Seeger, J. M. Maris, and J. Khan. Abstract 1744: Rnaseq identified immune signatures associated with adverse outcome in high-risk neuroblastoma. Cancer Research, 77(13):1744–1744, 07 2017.
  • [37] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, and J. M. Stuart. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113–1120, 2013.
  • [38] J. Wolff. Approximate nearest neighbor query methods for large scale structured datasets. 2016.
  • [39] M. Wornow, R. Thapa, E. Steinberg, J. Fries, and N. Shah. Ehrshot: An ehr benchmark for few-shot evaluation of foundation models. Advances in Neural Information Processing Systems, 36, 2024.
  • [40] P. Wu, S. Wang, K. Dela Rosa, and D. Hu. Forb: A flat object retrieval benchmark for universal image embedding. Advances in Neural Information Processing Systems, 36, 2024.
  • [41] J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
  • [42] K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil, R. J. Prenger, and A. Anandkumar. Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems, 36, 2024.
  • [43] X. Yang, A. Chen, N. PourNejatian, H. C. Shin, K. E. Smith, C. Parisien, C. Compas, C. Martin, A. B. Costa, M. G. Flores, et al. A large language model for electronic health records. NPJ digital medicine, 5(1):194, 2022.
  • [44] K. Zhang, J. Yu, Z. Yan, Y. Liu, E. Adhikarla, S. Fu, X. Chen, C. Chen, Y. Zhou, X. Li, et al. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv preprint arXiv:2305.17100, 2023.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] The claims in the abstract and introduction accurately reflect the contributions and scope of the paper by outlining the development and implementation of the HoneyBee framework for creating multimodal oncology datasets with foundation model embeddings.

    2. (b)

      Did you describe the limitations of your work? [Yes] The paper discusses limitations in Section 6, addressing challenges like the complexity and heterogeneity of medical data, biases in the TCGA dataset, and the need for further investigation into the interpretability of generated embeddings.

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [Yes] The paper discusses both positive and negative societal impacts in Section 6, highlighting the potential for improved cancer diagnosis and treatment, as well as concerns about data privacy and biases.

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [N/A] The paper does not include theoretical results.

    2. (b)

      Did you include complete proofs of all theoretical results? [N/A] The paper does not include theoretical results.

  3. 3.

    If you ran experiments (e.g. for benchmarks)…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] The HoneyBee framework and the processed TCGA dataset are available on the Hugging Face platform

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] The paper specifies details about the training and testing processes, including the models used for embedding generation, data splits, and evaluation methods, in Section 5.

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We ran the experiments multiple times with different random seeds and reported the mean accuracy and standard deviation to provide error bars for the performance metrics.

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] The experiments were conducted using an Nvidia RTX 3090 GPU with 24GB VRAM, 32GB RAM, and a Ryzen 5950X 16-core CPU.

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes] The paper references the original creators of the datasets, models, and libraries used, including TCGA, MINDS, Hugging Face datasets, and foundation models such as GatorTron, BERT, and UNI.

    2. (b)

      Did you mention the license of the assets? [Yes] The Hugging Face page for the dataset states that the data is available under the CC-BY-NC-ND-4.0 license.

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [Yes] The HoneyBee framework and the processed TCGA dataset are publicly available on the Hugging Face platform at https://huggingface.co/datasets/Lab-Rasool/TCGA.

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [Yes] The data used are publicly available datasets, and consent for their use is implied by their availability.

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] The paper ensures that data privacy is maintained, and personally identifiable information is anonymized in the processed datasets.

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] The paper does not involve crowdsourcing or research with human subjects.

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] The paper does not involve research with human subjects.

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A] The paper does not involve crowdsourcing or research with human subjects.

Appendix A Appendix

A.1 Datasheet for the HoneyBee Dataset

A.1.1 Motivation

  1. Q1.

    For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

    • The HoneyBee dataset was created to enable research in oncology by providing a scalable framework for generating machine learning-ready multimodal datasets. It addresses the gap in integrating diverse data types such as clinical, imaging, and molecular data, which is crucial for comprehensive cancer research and improving diagnostic and treatment models.

  2. Q2.

    Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

    • The dataset was created by the Department of Machine Learning at Moffit Cancer Center in collaboration with the Department of Electrical Engineering at the University of South Florida.

  3. Q3.

    Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

    • This research was funded by National Science Foundation grants 2234836, 2234468, and 1903466, and Moffitt Cancer Center.

  4. Q4.

    Any other comments?

    • None.

A.1.2 Composition

  1. Q1.

    What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)? Please provide a description.

    • The instances in the dataset represent various types of medical data, including clinical records (text), pathology images, radiology images, and molecular data (genomics, proteomics).

  2. Q2.

    How many instances are there in total (of each type, if appropriate)?

    • There are approximately 11,428 instances of clinical data, 11,208 instances of pathology reports, 30,075 instances of pathology images, 11,869 instances of radiology images, and 13,804 instances of molecular data.

  3. Q3.

    Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).

    • The dataset is a comprehensive sample from the TCGA repository, representing a diverse set of cancer types and data modalities. The representativeness is validated through the diversity of cancer types and patient demographics included in the TCGA dataset.

  4. Q4.

    What data does each instance consist of? “Raw” data (e.g., unprocessed text or images) or features? In either case, please provide a description.

    • Each instance consists of raw medical data such as clinical text, imaging data, and molecular sequences, as well as processed feature embeddings generated using foundation models.

  5. Q5.

    Is there a label or target associated with each instance? If so, please provide a description.

    • Yes, each instance is associated with metadata and labels such as cancer type, patient outcomes, and other clinical attributes relevant to the dataset.

  6. Q6.

    Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text.

    • There may be instances with missing information due to unavailability or incomplete records in the original TCGA dataset.

  7. Q7.

    Are relationships between individual instances made explicit (e.g., users’ movie ratings, social network links)? If so, please describe how these relationships are made explicit.

    • Relationships between instances are made explicit through metadata and unique identifiers linking different data types (e.g., a patient’s clinical record, imaging data, and molecular data are linked through patient IDs).

  8. Q8.

    Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them.

    • No, recommended data splits are not provided because the appropriate split depends on the specific research question and use case. Researchers are encouraged to define their splits based on their unique experimental design and objectives.

  9. Q9.

    Are there any errors, sources of noise, or redundancies in the dataset? If so, please provide a description.

    • Some sources of noise and redundancies may exist due to the variability and heterogeneity of medical data. Efforts have been made to clean and preprocess the data, but some inconsistencies may still be present.

  10. Q10.

    Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.

    • The dataset is self-contained but provides links to external resources such as the TCGA repository for accessing raw data. These external resources are maintained by their respective institutions, and access is governed by their terms and conditions.

  11. Q11.

    Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor–patient confidentiality, data that includes the content of individuals’ non-public communications)? If so, please provide a description.

    • The dataset contains de-identified medical data. Care has been taken to ensure patient confidentiality by removing or obfuscating any personally identifiable information.

  12. Q12.

    Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

    • The dataset includes medical information that may be sensitive or distressing, such as details of cancer diagnoses and treatments. Users should handle the data with care and sensitivity.

  13. Q13.

    Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

    • Subpopulations are identified by metadata attributes such as age, gender, and cancer type. The distribution of these attributes reflects the diversity of the patient cohort in the TCGA dataset.

  14. Q14.

    Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

    • No, the dataset is de-identified to prevent direct or indirect identification of individuals.

  15. Q15.

    Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

    • The dataset contains sensitive health data related to cancer diagnoses and treatments. It does not include other forms of sensitive information such as social security numbers or criminal history.

  16. Q16.

    Any other comments?

    • None.

A.1.3 Collection Process

  1. Q1.

    How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

    • The data was acquired from the TCGA repository using MINDS, which collects directly observable clinical, imaging, and molecular data from patient records and research studies. The data undergoes validation and quality control by the TCGA consortium.

  2. Q2.

    What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? How were these mechanisms or procedures validated?

    • Data was collected using software APIs and scripts provided by the MINDS framework, which interfaces with the TCGA repository. These tools have been validated through extensive use in previous research projects.

  3. Q3.

    If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

    • The dataset is not a sample; it includes comprehensive data from the TCGA repository covering various cancer types.

  4. Q4.

    Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

    • The data collection process involved researchers and data scientists from the Department of Machine Learning at Moffit Cancer Center and the Department of Electrical Engineering at the University of South Florida. They were compensated as part of their institutional roles.

  5. Q5.

    Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

    • The data collection occurred over a period of several months in 2024. The original data in the TCGA dataset spans multiple years, corresponding to the timeframe of the TCGA studies.

  6. Q6.

    Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

    • Ethical review processes were conducted by the institutions providing the original TCGA data. The HoneyBee dataset uses de-identified data, ensuring compliance with ethical guidelines.

  7. Q7.

    Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

    • The data was obtained from the TCGA repository, a third-party source.

  8. Q8.

    Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

    • The individuals in question were notified by the TCGA consortium during the original data collection process. Notifications and consent were managed by the TCGA consortium according to ethical guidelines.

  9. Q9.

    Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

    • The TCGA consortium provided notifications and obtained consent from participants during the original data collection process.

  10. Q10.

    If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

    • Revocation of consent mechanisms was managed by the TCGA consortium as part of their data governance protocols.

  11. Q11.

    Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

    • An analysis of the potential impact on data subjects was conducted by the TCGA consortium during the original data collection and publication process. The HoneyBee framework uses de-identified data to mitigate risks.

  12. Q12.

    Any other comments?

    • None.

A.1.4 Preprocessing/cleaning/labeling

  1. Q1.

    Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remaining questions in this section.

    • Yes, preprocessing steps include tokenization of text data, normalization of imaging data, and quality control of molecular data. Feature extraction was performed using foundation models to generate embeddings.

  2. Q2.

    Was the “raw” data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? If so, please provide a link or other access point to the “raw” data.

    • The raw data is accessible through the TCGA repository. Links to access the raw data are provided in the dataset documentation.

  3. Q3.

    Is the software that was used to preprocess/clean/label the data available? If so, please provide a link or other access point.

  4. Q4.

    Any other comments?

    • None.

A.1.5 Uses

  1. Q1.

    Has the dataset been used for any tasks already? If so, please provide a description.

    • The dataset has been used for research in cancer-type classification and similarity retrieval tasks, demonstrating its utility in machine learning applications in oncology.

  2. Q2.

    Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

    • Research papers and systems using the dataset are linked in the project repository provided using Hugging Face and Papers with Code.

  3. Q3.

    What (other) tasks could the dataset be used for?

    • The dataset could be used for tasks such as prognosis prediction, treatment response assessment, survival analysis, and the development of multimodal diagnostic tools.

  4. Q4.

    Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other risks or harms (e.g., legal risks, financial harms)? If so, please provide a description. Is there anything a dataset consumer could do to mitigate these risks or harms?

    • Users should be aware that the dataset contains de-identified but sensitive medical data. Proper ethical considerations should be taken into account to avoid misuse, such as ensuring fair treatment and avoiding biases in machine learning models.

  5. Q5.

    Are there tasks for which the dataset should not be used? If so, please provide a description.

    • The dataset should not be used for any non-research purposes that could violate patient confidentiality or ethical guidelines. It is intended strictly for scientific research and educational purposes.

  6. Q6.

    Any other comments?

    • None.

A.1.6 Distribution

  1. Q1.

    Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? If so, please provide a description.

    • Yes, the dataset will be distributed to third parties via the Hugging Face platform for research and educational purposes.

  2. Q2.

    How will the dataset be distributed (e.g., tarball on website, API, GitHub)? Does the dataset have a digital object identifier (DOI)?

    • The dataset will be distributed via the Hugging Face platform and GitHub. A DOI will be provided for citation purposes.

  3. Q3.

    When will the dataset be distributed?

    • The dataset is already publicly accessible through the project repository and Hugging Face Datasets.

  4. Q4.

    Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.

    • The dataset will be distributed under a Creative Commons Attribution Non-Commercial No Derivatives 4.0 license. Details will be provided in the dataset documentation.

  5. Q5.

    Have any third parties imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.

    • No third-party IP-based restrictions apply to the de-identified data used in the HoneyBee dataset.

  6. Q6.

    Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.

    • No export controls or regulatory restrictions apply to the dataset.

  7. Q7.

    Any other comments?

    • None.

A.1.7 Maintenance

  1. Q1.

    Who will be supporting/hosting/maintaining the dataset?

    • The dataset will be supported, hosted, and maintained by the Department of Machine Learning at Moffit Cancer Center and the Department of Electrical Engineering at the University of South Florida.

  2. Q2.

    How can the owner/curator/manager of the dataset be contacted (e.g., email address)?

    • The dataset curators can be contacted via the information provided on the dataset page.

  3. Q3.

    Is there an erratum? If so, please provide a link or other access point.

    • Any erratum will be documented and linked in the project repository on Hugging Face and or GitHub.

  4. Q4.

    Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)?

    • The dataset will be updated periodically by the research team. Updates will be communicated through the project repository.

  5. Q5.

    If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? If so, please describe these limits and explain how they will be enforced.

    • The dataset uses de-identified data and complies with data retention policies as outlined by the TCGA consortium.

  6. Q6.

    Will older versions of the dataset continue to be supported/hosted/maintained? If so, please describe how. If not, please describe how its obsolescence will be communicated to dataset consumers.

    • Older versions will be maintained in the project repository with version control to ensure traceability and reproducibility.

  7. Q7.

    If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? If so, please provide a description. Will these contributions be validated/verified? If so, please describe how. If not, why not? Is there a process for communicating/distributing these contributions to dataset consumers? If so, please provide a description.

    • Contributions can be made through the GitHub repository. Contributions will be validated by the research team before integration. The process for contributing will be documented in the repository.

  8. Q8.

    Any other comments?

    • None.

A.2 Statistics of Public Processed HoneyBee Datasets

The HoneyBee dataset leverages data from The Cancer Genome Atlas (TCGA) to create a comprehensive and machine learning-ready collection of multimodal oncology data. This dataset integrates clinical records, pathology reports, whole slide images (WSIs), radiology images, and molecular data, providing embeddings generated using state-of-the-art foundation models. Below is the distribution of patients across different cancer types within the TCGA dataset used for generating these embeddings.

Table 3: Patients within The Cancer Genome Atlas (TCGA) dataset used for generating multimodal oncology embeddings.
Projects # of Patients
Adrenocortical Carcinoma (ACC) 92
Bladder Urothelial Carcinoma (BLCA) 412
Breast Invasive Carcinoma (BRCA) 1,098
Cervical Squamous Cell Carcinoma & Endocervical Adenocarcinoma (CESC) 307
Cholangiocarcinoma (CHOL) 51
Colon Adenocarcinoma (COAD) 461
Lymphoid Neoplasm Diffuse Large B-cell Lymphoma (DLBC) 58
Esophageal Carcinoma (ESCA) 185
Glioblastoma Multiforme (GBM) 617
Head and Neck Squamous Cell Carcinoma (HNSC) 528
Kidney Chromophobe (KICH) 113
Kidney Renal Clear Cell Carcinoma (KIRC) 537
Kidney Renal Papillary Cell Carcinoma (KIRP) 291
Acute Myeloid Leukemia (LAML) 200
Lower Grade Glioma (LGG) 516
Liver Hepatocellular Carcinoma (LIHC) 377
Lung Adenocarcinoma (LUAD) 585
Lung Squamous Cell Carcinoma (LUSC) 504
Mesothelioma (MESO) 87
Ovarian Serous Cystadenocarcinoma (OV) 608
Pancreatic Adenocarcinoma (PAAD) 185
Pheochromocytoma and Paraganglioma (PCPG) 179
Prostate Adenocarcinoma (PRAD) 500
Rectum Adenocarcinoma (READ) 172
Sarcoma (SARC) 261
Skin Cutaneous Melanoma (SKCM) 470
Stomach Adenocarcinoma (STAD) 443
Testicular Germ Cell Tumors (TGCT) 263
Thyroid Carcinoma (THCA) 507
Thymoma (THYM) 124
Uterine Corpus Endometrial Carcinoma (UCEC) 560
Uterine Carcinosarcoma (UCS) 57
Uveal Melanoma (UVM) 80
Total 11,428

The table below provides an overview of the foundation models used for generating embeddings for each data modality, the vector dimensions of the embeddings, and references to the respective models.

Table 4: Overview of models used for different data modalities in HoneyBee. The table lists the types of data (Modality), the corresponding machine learning models utilized (Model), the shape of the embeddings generated by these models (Embedding Shape), and the number of samples for each modality (Sample Count).
Modality Model Embedding Shape
Clinical Data GatorTron-medium [1024]
Pathology Reports GatorTron-medium [1024]
Pathology Images UNI [patches, 1024]
Radiology Images REMEDIS [patches, 7, 7, 2048]
Molecular Data SeNMo [48]
Refer to caption
Figure 4: Snapshot of the HoneyBee dataset hosted on Hugging Face, showcasing various dataset configurations including clinical data, pathology reports, WSIs, radiology images, and molecular data.

A.3 Additional Related Work

A.3.1 Advances in Multimodal Learning (MML)

Recent Developments in MML

Advancements in MML, particularly through the use of deep neural networks (DNNs), have significantly enhanced the ability to learn from diverse data sources. Models that integrate computer vision (CV) and natural language processing (NLP) have demonstrated remarkable capabilities. For example, multimodal foundation models like Contrastive Language-Image Pretraining (CLIP) \citeappendixclipappen and Generative Pretraining Transformer (GPT-4) have set new performance benchmarks across various tasks \citeappendixachiam2023gptappen. These models leverage large-scale datasets to learn robust representations, enabling them to generalize well across different modalities and applications \citeappendixwaqas2023multimodalappen.

Applications of MML in Oncology

In oncology, innovative applications of MML are emerging, such as the integration of clinical and genomics data with imaging modalities like Positron Emission Tomography (PET) scans. Models like RadGenNets combine these diverse data sources to predict gene mutations in Non-small cell lung cancer (NSCLC) patients \citeappendixtripathi2022radgennetsappen. Additionally, Graph Neural Networks (GNNs) and Transformers are being explored for tumor classification, prognosis prediction, and treatment response assessment. These models can capture complex interactions between different data types, providing more comprehensive insights into cancer biology and patient outcomes \citeappendixwaqas2023multimodalappen.

A.3.2 Foundation Models in Medical AI

Generalist Medical AI (GMAI)

Foundation models trained on extensive and diverse datasets exhibit versatility across numerous downstream tasks. Generalist Medical AI (GMAI) models, such as Gato and GPT-3, demonstrate state-of-the-art performance in various applications. These models employ in-context learning, allowing them to solve new tasks by learning from text explanations without needing retraining. Their adaptability and robustness make them valuable tools for medical AI, capable of addressing a wide range of problems with minimal additional training \citeappendixmoor2023foundationappen.

Trustworthiness and Interpretability

Enhancing the trustworthiness of medical AI involves leveraging foundation models to inspect and validate AI systems through medically relevant concepts. This approach facilitates the deployment of trustworthy AI systems in clinical settings. Foundation models can help interpret complex AI decisions by mapping them to known medical concepts, thus improving transparency and trust among healthcare providers. This methodology has proven successful in fields like dermatology, where foundation models aid in interpreting AI predictions and enhancing clinical decision-making \citeappendixkim2024transparentappen.

A.3.3 Opportunities and Future Directions

Deep Phenotyping and Personalized Medicine

The development of multimodal AI models that integrate data across various modalities, including biosensors, genetic, epigenetic, proteomic, microbiome, metabolomic, imaging, text, clinical, social determinants, and environmental data, holds significant potential for personalized medicine. Such models can enable applications like individualized treatment plans, integrated real-time pandemic surveillance, digital clinical trials, and virtual health coaches. By capturing a comprehensive picture of a patient’s health, these models can provide tailored insights and interventions, ultimately improving patient outcomes \citeappendixacosta2022multimodalappen.

Collaboration and Data Collection

Collaboration across industries and sectors is essential for collecting and linking large, diverse multimodal health datasets. Efforts should focus on developing approaches that pretrain models using extensive unlabeled data across modalities, requiring only limited labeled data for fine-tuning. Federated learning is a promising approach in this context, enabling multiple institutions to collaboratively train models on shared data without compromising data privacy. This decentralized approach allows models to learn from a broader dataset while maintaining patient confidentiality and complying with data protection regulations \citeappendixzhang2024eliminatingappen.

A.4 Details on Raw Medical Data Processing

The HoneyBee framework involves several stages of raw medical data processing to generate high-quality, machine learning-ready embeddings. This section provides a detailed overview of the preprocessing techniques applied to clinical data, pathology reports, whole slide images (WSIs), radiology images, and molecular data.

A.4.1 Clinical Data Processing

Clinical data, comprising electronic health records (EHRs) and medical notes, undergo several preprocessing steps to ensure consistency and usability:

  • Text Extraction and Cleaning: Raw clinical text data is extracted from EHR systems. This involves removing unnecessary characters, HTML tags, and special symbols. The text is then normalized by converting it to lowercase and correcting common misspellings. This step ensures that the data is clean and standardized for further processing.

  • Tokenization: The cleaned text is tokenized, converting it into a sequence of words or subwords. Tokenization is crucial for breaking down the text into manageable pieces that can be effectively processed by language models.

  • Embedding Generation: The tokenized text is fed into pre-trained language models such as GatorTron-medium to generate dense vector embeddings. These embeddings capture the semantic information from the clinical narratives, making them suitable for various downstream ML tasks.

A.4.2 Pathology Reports Processing

Pathology reports, which include detailed descriptions of tissue samples and diagnostic results, are processed similarly to clinical data:

  • Text Extraction: Pathology reports are extracted from structured and unstructured formats. This extraction process involves parsing the reports to retrieve relevant textual data.

  • Cleaning and Normalization: The text is cleaned to remove irrelevant information, and medical terms are normalized to a common vocabulary. This ensures that the terminology used in the reports is consistent, which is important for accurate analysis and embedding generation.

  • Tokenization and Embedding: The processed text is tokenized and input into language models like GatorTron-medium to generate embeddings that encapsulate the diagnostic information. These embeddings can then be used for various analytical and predictive tasks in oncology.

However, this process is not without challenges. Figure 5 shows an example of a failed pathology report where the Optical Character Recognition (OCR) system failed to accurately extract text. Such failures highlight the difficulties in processing handwritten or poorly scanned documents, which can lead to incomplete or incorrect data extraction.

Refer to caption
Figure 5: Example of a failed pathology report due to OCR errors. The poor quality of the scan and handwritten elements make it difficult for the OCR system to accurately extract text, leading to significant errors in the extracted data.

A.4.3 Whole Slide Images (WSIs) Processing

Refer to caption
Figure 6: Visualization of whole slide image (WSI) processing in the HoneyBee framework. (A) Original whole slide image. (B) Tissue detection and segmentation showing regions of interest (blue), background (green), and noise from pen markings (red). (C) Processed WSI with tiles prepared for embedding generation, highlighting areas of tissue and background.

Whole slide images, which are high-resolution digital scans of histopathology slides, require extensive preprocessing due to their large size and complexity:

  • Tiling: WSIs are divided into smaller, manageable tiles (e.g., 512×512512\times 512 pixels) to facilitate processing. This step is essential due to the high resolution of WSIs, making them too large to process in their entirety.

  • Tissue Detection and Segmentation: Models such as TissueDetector are used to identify and segment regions of interest (ROIs) containing tissue, excluding background areas. This step focuses the analysis on the relevant parts of the image, as illustrated in Figure 6.

  • Normalization: Color normalization techniques are applied to account for staining variations across different slides. This ensures that the color differences due to different staining procedures do not affect the analysis.

  • Embedding Generation: Each tile is processed through a vision transformer (ViT)-based model like UNI, which generates embeddings for each tile, capturing the morphological features of the tissue. These embeddings are then used for further analysis and classification tasks.

However, the process is not without challenges. Some WSIs fail to meet the quality standards required for effective analysis due to various issues, as illustrated in Figure 7.

Refer to caption
Figure 7: Examples of failed whole slide image (WSI) samples in the HoneyBee framework. (A) Sample with significant artifacts and staining issues. (B) Sample with extremely faint staining, making tissue identification difficult. (C) Sample with minimal tissue presence, predominantly background. (D) Sample with blurred tissue, compromising the quality of morphological information. These examples highlight the challenges in processing and analyzing WSIs in a multimodal oncology dataset.

A.4.4 Radiology Images Processing

Radiology images, including CT, MRI, and PET scans, are processed to extract meaningful features for embedding generation:

  • Normalization: Pixel intensities are normalized to a standard range to ensure consistency across different imaging modalities and machines. This step corrects for variations in imaging conditions and enhances the comparability of images.

  • Slicing: Volumetric scans (e.g., CT and MRI) are sliced into 2D images for easier handling and analysis. This process simplifies the analysis by breaking down complex 3D data into more manageable 2D slices.

  • Preprocessing: Image preprocessing steps, such as resizing and augmentation (e.g., rotation, flipping), are applied to enhance the diversity and robustness of the dataset. These steps help improve the generalizability of the models trained on the data.

  • Embedding Generation: Pre-trained models like REMEDIS are used to process the images and generate embeddings that represent the spatial and structural information contained in the scans. These embeddings can be used for diagnostic and prognostic modeling.

A.4.5 Molecular Data Processing

Molecular data, encompassing genomics, transcriptomics, and proteomics, is processed to extract biologically relevant features:

  • Quality Control: Raw molecular data undergoes quality control measures to remove low-quality samples and artifacts.

  • Normalization: Data normalization techniques, such as log transformation and scaling, are applied to ensure comparability across samples.

  • Embedding Generation: The processed molecular data is input into specialized models like SeNMo, which generate embeddings that capture the complex relationships within and across different molecular datasets.

\bibliographystyleappendix

unsrt \bibliographyappendixappendixref