QuaLLM-Health: An Adaptation of an LLM-Based Framework for Quantitative Data Extraction from Online Health Discussions

Ramez Kouzy^1,^*^**[email protected] , Roxanna Attar-Olyaee^2,∗, Michael K. Rooney¹, Comron J. Hassanzadeh¹,
Junyi Jessy Li³, Osama Mohamad¹

¹The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
²Texas Tech Health Science Center El Paso, El Paso, Texas, USA
³University of Texas at Austin, Austin, Texas, USA
*co-first authors

Abstract

Background: Health-related discussions on social media like Reddit offer valuable insights, but extracting quantitative data from unstructured text is challenging. In this work, we present an adapted framework from QuaLLM into QuaLLM-Health for extracting clinically relevant quantitative data from Reddit discussions about glucagon-like peptide-1 (GLP-1) receptor agonists using large language models (LLMs).

Methods: We collected 410,710 posts and comments from five GLP-1–related Reddit communities using the Reddit API in July 2024. After filtering for cancer-related discussions with predefined keywords, 2,059 unique entries remained. We developed annotation guidelines to manually extract variables such as cancer survivorship, family cancer history, cancer types mentioned, risk perceptions, and discussions with physicians. Two domain-expert annotators independently annotated a random sample of 100 entries to create a gold-standard dataset, achieving high inter-annotator agreement (Fleiss’ kappa $\kappa\geq 0.8$ ) for key variables. We then employed iterative prompt engineering with OpenAI’s GPT-4o-mini on the gold-standard dataset to build an optimized pipeline that allowed us to extract variables from dataset in question.

Results: The optimized LLM achieved accuracies above 0.85 for all variables, with precision, recall and F1 score macro averaged > 0.90, indicating balanced performance. Stability testing showed a 95% match rate across runs, confirming consistency. Applying the framework to the full dataset enabled efficient extraction of variables necessary for downstream analysis, costing under $3 and completing in approximately one hour.

Conclusion: QuaLLM-Health demonstrates that LLMs can effectively and efficiently extract clinically relevant quantitative data from unstructured social media content. Incorporating human expertise and iterative prompt refinement ensures accuracy and reliability. This methodology can be adapted for large-scale analysis of patient-generated data across various health domains, facilitating valuable insights for healthcare research.

1 Introduction

The increasing volume of health-related discussions on social media platforms, such as Reddit, presents a unique opportunity to derive meaningful insights about patient experiences, medication effects, and public health concerns.¹ However, transforming these unstructured discussions into actionable data poses significant challenges. In this study, we present QuaLLM-Health, an adaptation of the QuaLLM framework specifically designed to extract clinically relevant quantitative data from Reddit conversations about glucagon-like peptide-1 (GLP-1) receptor agonists.² This approach integrates structured prompt engineering, human-in-the-loop validation, and a comprehensive evaluation methodology to effectively translate social media content into a rich source of information for healthcare research.

Our goal with QuaLLM-Health is to leverage the strengths of large language models (LLMs) for the analysis of unstructured online forum data, enabling the extraction of actionable quantitative insights in an efficient and cost-effective manner. By integrating expert knowledge throughout the process, this framework ensures accuracy, consistency, and applicability to the healthcare domain, offering a blueprint for future research in health informatics and patient-centered studies.

1.1 Overview of the Framework

Our adapted framework encompasses the following key components (Figure 1):

1.

Data Collection: Systematic gathering of relevant Reddit posts and comments.
2.

Data Preprocessing: Rigorous filtering and cleaning of the collected data based on established criteria.
3.

Annotation Guideline Development: Creation of a detailed guideline for consistent variable extraction.
4.

Human Annotation: Generation of a high-quality, consensus-based gold standard dataset.
5.

LLM Prompt Engineering: Implementation of iterative prompt engineering techniques with a LLM for automated extraction.
6.

Evaluation: Thorough assessment of the LLM’s performance using the gold standard dataset.
7.

Execution: Deployment of a fine-tuned pipeline on a large dataset for quantitative data extraction for downstream analysis.

Refer to caption — Figure 1: Overview of the QuaLLM-Health framework showing the pipeline from data collection through execution.

2 Data Collection and Preprocessing

2.1 Data Collection

We initiated the data collection process by identifying five Reddit communities (subreddits) with the highest user engagement related to GLP-1 receptor agonists. These included subreddits focused on the medication names (r/Semaglutide, r/WegovyWeightLoss, r/Zepbound, r/Ozempic, and r/Mounjaro). Data were collected using the Reddit API in July 2024.³ This initial data collection resulted in approximately 410,710 entries, comprising both posts and comments.

2.2 Data Filtering

We developed a regex pipeline to automate the filtering process, using keywords and phrases specific to GLP-1 receptor agonists as well as cancer-related terms to identify relevant entries. The cancer-related terms were identified using a predefined list including ’cancer’, ’malignancy’, ’tumor’, ’neoplasm’, ’carcinoma’, ’sarcoma’, ’leukemia’, ’lymphoma’, ’metastasis’, ’oncology’, ’biopsy’, ’chemotherapy’, ’radiation therapy’, ’malignant’, and ’benign tumor’. This filtering step reduced the dataset to approximately 2,390 entries.

2.3 Data Cleaning and Preparation

We removed duplicates and non-English posts, which left us with a dataset comprised approximately 2,059 unique entries suitable for analysis. We then selected a random sample of 100 entries from the filtered dataset, which were given to human annotators for validation and to assist in prompt engineering.

3 Annotation Guideline Development

We developed an annotation guideline describing variables that could be extracted as binary or categorical variables. This was used to label the textual entries (posts or comments) in a more consistent and clear way. This approach aimed to maintain consistency and clarity in identifying clinically relevant information. The annotation guideline, data used, and code are available at https://github.com/ramezkouzy/GLP1-LLM. The guideline included the following key areas for annotation:

•

Inclusion and Exclusion Criteria: Posts were categorized based on whether they discussed GLP-1 receptor agonists in the context of cancer. Instances that did not meet the inclusion criteria were labeled with specific exclusion reasons to maintain dataset relevance and clarity.
•

Cancer Survivorship and Related Details: We identified mentions of cancer survivors, including whether they were currently taking GLP-1 medications and if they are taking it to lose weight.
•

Family Cancer History and Cancer Type: Posts mentioning a family history of cancer or specifying types of cancer were annotated to facilitate deeper analysis of user backgrounds and cancer types involved. Unusual or unlisted types of cancer were captured with a free-text entry.
•

Cancer Risk and Discussions with Physicians: Mentions related to cancer risk, including concerns, discussions of increased or decreased risk, and whether these concerns were addressed with healthcare professionals.
•

Timing of Cancer Diagnosis: Posts mentioning a cancer diagnosis that occurred after starting GLP-1 medication were flagged to help identify any temporal relationships discussed by users.
•

Information Seeking: Instances where users explicitly sought information regarding cancer risks associated with GLP-1 medications were annotated to understand user awareness and areas where more education might be needed.

This systematic and structured annotation allowed us to create a high-quality, gold-standard dataset that could reliably be used to evaluate LLM performance and guide prompt engineering. The finalized annotation guideline is attached in the published github library.^†^††https://github.com/ramezkouzy/GLP1-LLM.

4 Human Annotation Process

Two annotators with domain expertise were tasked with independently annotating the same random sample of 100 entries from the cleaned dataset. Annotations were recorded in a standardized format, capturing all variables as per the guideline. Post-annotation, discrepancies were adjudicated by discussion. The variables and the frequencies are presented in Table 1. Inter-annotator agreement was quantified using Fleiss’ kappa, with results for each variable as follows:

•

Variables with High Agreement ( $\kappa\geq 0.8$ ): Discussions about GLP-1 receptor agonists in the context of cancer, mentions of being a cancer survivor, types of cancer mentioned, cases where a cancer survivor is also taking GLP-1 medication, mentions of family cancer history, discussions of cancer risk with a physician, mentions of other cancer types not previously listed, and discussions about GLP-1 potentially decreasing cancer risk. These variables showed strong consensus among annotators, indicating reliable extraction for these key areas.
•

Variables with Moderate Agreement ( $0.6\leq\kappa<0.8$ ): Mentions of weight loss among cancer survivors, cancer diagnosis after starting GLP-1 medication, seeking information about cancer risks, mentions of increased cancer risk, and expressions of concern regarding cancer risk. These variables had moderate agreement, suggesting a reasonable level of consistency but room for refinement in the annotation guidelines or further training.
•

Variables with Low Agreement ( $\kappa<0.6$ ): Ability to assess misinformation, overall sentiment of the post, tone of the discussion, general context, and references to misinformation. These variables showed significant variability in interpretation between annotators and were excluded from LLM performance evaluations, though they may provide useful insights in qualitative analyses.

Table 1: Cancer-related variables in the random double-annotated sample (N=100).

Variable	N
Cancer Type
Thyroid Cancer	44
Breast Cancer	8
Gyn Cancer	2
Pancreatic Cancer	2
Other	8
No Type Mentioned	32
Cancer History and Experience
Is Cancer Survivor	33
Survivor Taking GLP-1	26
Cancer Diagnosis After Medication	13
Risk Communication and Perception
Mentions Cancer Risk	57
Concerned About Cancer Risk	29
Seeking Cancer Risk Data	9
Discussed Risk with Physician	17
GLP-1 Decreasing Cancer Risk	13

5 LLM Prompt Engineering and Evaluation

5.1 Initial Prompting Strategy

We utilized Open AI’s gpt-4o-mini-2024-07-18 via API calls, providing it with a carefully structured prompt. This prompt was based on the annotation guidelines that human annotators followed, ensuring that the language model received explicit instructions for each variable. Alongside this, we also provided a JSON schema that defined the expected output format, including specific fields and data types. Our goal with this zero-shot prompting strategy was to evaluate the model’s baseline ability to understand and extract variables without relying on any prior examples. The initial evaluation of the LLM’s output against the gold standard annotations revealed mixed results, which are presented in Table 2. The model accurately identified basic variables but struggled with more nuanced aspects, such as detecting cancer type and risk discussion detection.

Table 2: Performance metrics at baseline.

Category	Precision	Recall	F1
inclusion	0.943	0.510	0.640
exclusion_reason	1.000	0.500	0.667
is_survivor	0.939	0.938	0.936
is_survivor_and_taking_med	0.886	0.865	0.848
cancer_type	0.603	0.510	0.539
other_cancer_type	0.936	0.823	0.875
is_survivor_weight_loss	0.844	0.833	0.806
cancer_diagnosis_after_medication	0.910	0.917	0.910
mentions_cancer_risk	0.771	0.740	0.741
concerned_about_cancer_risk	0.812	0.812	0.812
seeking_cancer_risk_data	0.903	0.917	0.906
discussed_risk_with_physician	0.921	0.906	0.911
discussion_GLP1_decreasing_cancer_risk	0.889	0.885	0.851
Macro-average	0.874	0.781	0.803

5.2 Prompt Engineering

To address these limitations, we employed iterative improvements, including chain-of-thought reasoning, few-shot prompting, and edge case inclusion.⁴ The prompts used are detailed in the github library.^‡^‡‡https://github.com/ramezkouzy/GLP1-LLM Importantly, we did not use examples directly from the evaluation dataset. We employed a human-in-the-loop process, as depicted in Figure 1, to fine-tune the prompts and create new examples that specifically highlighted edge cases where the model struggled. We also set the temperature at 0.0 to decrease the stochasticity of the responses, maximizing the consistency of the classification. Additionally, we utilized the model’s reasoning to identify discrepancies and refine the prompts further, ultimately enhancing alignment with the gold standard dataset and reducing inconsistencies.

Each iteration of prompt refinement was followed by a re-evaluation of the LLM’s performance metrics, allowing us to progressively enhance the model’s extraction capabilities and ensure alignment with our gold standard dataset. The final LLM configuration achieved accuracy of at least 0.85 across all included variables, demonstrating reliable extraction performance. Precision, recall, and F1 macro average were higher than 0.90 indicating a well-balanced performance across all variables. Detailed performance metrics for each variable are presented in Table 3.

Table 3: Performance metrics after prompt engineering.

Category	Precision	Recall	F1
inclusion	0.944	0.950	0.947
exclusion_reason	1.000	0.979	0.989
is_survivor	0.921	0.917	0.914
is_survivor_and_taking_med	0.886	0.865	0.848
cancer_type	0.920	0.906	0.906
other_cancer_type	0.929	0.927	0.925
is_survivor_weight_loss	0.883	0.885	0.881
cancer_diagnosis_after_medication	0.924	0.927	0.919
mentions_cancer_risk	0.822	0.823	0.822
concerned_about_cancer_risk	0.851	0.854	0.851
seeking_cancer_risk_data	0.971	0.969	0.969
discussed_risk_with_physician	0.916	0.896	0.902
discussion_GLP1_decreasing_cancer_risk	0.878	0.896	0.881
Macro Average	0.911	0.909	0.904

5.3 Stability Testing

To evaluate the consistency of the LLM, we ran the final prompt configuration five times on the gold standard dataset. The results showed an average pairwise match rate of 95% across runs, indicating strong agreement. Variables with clear textual cues demonstrated high consistency, while minor variations were observed in more complex variables that required inference. Overall, these findings confirm the LLM’s stability in extracting variables under the optimized prompt settings.

6 Application of the Framework

We applied the optimized prompt to the full dataset of 2,059 unique entries. The LLM successfully extracted variables required for downstream analysis in an efficient manner, with the entire process—including prompt engineering—costing under $3 and taking around one hour to complete. The results will be presented in future analysis.

7 Discussion

In this report, we present an adaptation of the QuaLLM framework that leverages the power of LLMs to efficiently analyze large volumes of unstructured text data, with a particular focus on extracting quantitative data. This adapted framework demonstrates value for similar and broader use cases where both quantitative and structured data are required for research purposes.

In this adaptation, we build on previously described concepts, enhancing the utility of LLMs for use cases such as healthcare-related research. We employed iterative prompting, which has proven to be highly effective, as well as few-shot prompting, both of which improved model performance.^5,6 We also emphasize the importance of involving humans at every stage of the workflow, especially in downstream deployment. Insights from domain experts play a crucial role in maximizing performance by addressing domain-specific nuances and challenges.

In addition to enhancing LLM utility, our framework exemplifies a cost-effective and time-efficient method for large-scale text analysis. The structured approach adopted in this study underscores the potential of LLMs in producing reliable and reproducible results even in complex domains such as healthcare. The combination of human expertise and iterative model tuning facilitated a nuanced understanding of social media data, which is often noisy and context-specific. This methodology not only highlights the adaptability of LLMs but also serves as a guide for future research that aims to leverage large language models for extracting actionable insights from unstructured datasets. Moreover, this approach saves significant resources and personnel time, reducing the need for manual data extraction and analysis. Future iterations of this framework could allow even more researchers to find and adapt the methodology to address their specific use cases, further expanding the potential applications across various domains.

7.1 Limitations

We acknowledge that the generalizability of our human evaluation is limited by the relatively small sample size and its focus on specific pipeline stages. More sophisticated fine-tuned or larger models might perform better, but our approach prioritizes efficiency and cost-effectiveness. Setting up the pipeline took considerable time, but we aimed to balance efficiency with accuracy and cost. Even with high stability, the model may still exhibit slight variations between runs, which could make it unsuitable for high-stakes applications like clinical deployment. However, it is also important to remember that human evaluators are not fault-proof, and the comparator is often a human with their own limitations.⁷

8 Conclusion

This study demonstrates the feasibility of using LLMs for extracting clinically relevant data from unstructured social media content. The approach was cost-effective and computationally efficient, enabling a research team to conduct a study faster and with fewer resources while maintaining good accuracy and reliability.

References

[1] Yesha R, Gangopadhyay A. A method for analyzing health behavior in online forums. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. ACM; 2015. DOI: https://doi.org/10.1145/2808719.2812592
[2] Rao VN, Agarwal E, Dalal S, Calacci D, Monroy-Hernández A. QuaLLM: An LLM-based framework to extract quantitative insights from online forums. arXiv [csCP]. Published online May 8, 2024. Available from: http://arxiv.org/abs/2405.05345
[3] Boe B. PRAW: The python reddit api wrapper. URL: https://praw.readthedocs.io/en/v7. 2021;5.
[4] Dunivin ZO. Scalable qualitative coding with LLMs: Chain-of-thought reasoning matches human performance in some hermeneutic tasks. arXiv [csCP]. Published online January 26, 2024. DOI: 10.48550/arxiv.2401.15170
[5] Logan RL IV, Balažević I, Wallace E, Petroni F, Singh S, Riedel S. Cutting down on prompts and parameters: Simple few-shot learning with language models. arXiv [csCP]. Published online June 24, 2021. Available from: http://arxiv.org/abs/2106.13353
[6] Wang B, Deng X, Sun H. Iteratively prompt Pre-trained Language Models for chain of thought. arXiv [csCP]. Published online March 16, 2022. Available from: http://arxiv.org/abs/2203.08383
[7] Kohane IS. Compared with what? Measuring AI against the health care we have. N Engl J Med. Published online October 26, 2024. DOI: https://www.nejm.org/doi/full/10.1056/NEJMp2404691