This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

CollectiveSFT: Scaling Large Language Models for Chinese Medical Benchmark with Collective Instructions in Healthcare

Jingwei Zhu
School of Software Engineering
University of Science and Technology of China
Hefei, China
[email protected]
Minghuan Tan
Shenzhen Institute of Advanced Technology
Chinese Academy of Sciences
Shenzhen, China
[email protected]
Min Yang
Shenzhen Institute of Advanced Technology
Chinese Academy of Sciences
Shenzhen, China
[email protected]
Ruixue Li
Xiangshui County Party School
Yancheng, China
[email protected]
Hamid Alinejad-Rokny
UNSW BioMedical Machine Learning Lab (BML)
School of Biomedical Engineering
UNSW, Sydney, Australia
[email protected]
Abstract

The rapid progress in Large Language Models (LLMs) has prompted the creation of numerous benchmarks to evaluate their capabilities. This study focuses on the Comprehensive Medical Benchmark in Chinese (CMB) [25], showcasing how dataset diversity and distribution in supervised fine-tuning (SFT) may enhance LLM performance. Remarkably, We successfully trained a smaller base model to achieve scores comparable to larger models, indicating that a diverse and well-distributed dataset can optimize performance regardless of model size. This study suggests that even smaller models may reach high performance levels with carefully curated and varied datasets. By integrating a wide range of instructional content, our approach addresses potential issues such as data quality inconsistencies. Our results imply that a broader spectrum of training data may enhance a model’s ability to generalize and perform effectively across different medical scenarios, highlighting the importance of dataset quality and diversity in fine-tuning processes.111https://github.com/CAS-SIAT-XinHai/CollectiveSFT

1 Introduction

With the rapid development of Large Language Models (LLMs), there is increasing interest in applying LLMs to the physical health domain. Due to the specialized nature of physical health, LLMs need to acquire extensive medical knowledge, ensure accuracy, and exhibit patience when interacting with patients. To evaluate the knowledge and accuracy of LLMs in this domain, various medical benchmarks have been established. Some models have achieved impressive scores, demonstrating their potential as basic doctor assistants for daily use.

Despite these advancements, several major concerns remain regarding the instructions used for fine-tuning these models. Firstly, the diversity and distribution of instructions may still be limited. As highlighted by Zheng et al. [31], the effectiveness of fine-tuning is heavily influenced by the variety and richness of the instruction sets used.

To address this issue, we propose integrating a diverse array of instruction types and related domains into our fine-tuning dataset. Our approach involves collecting instructions from multiple question types and ensuring a comprehensive representation of different domains. Specifically, we focus on creating a dataset that includes real-world dialogue reconstructions, consultation records from medical forums, and various other sources. This comprehensive approach aims to enhance the model’s performance across different medical scenarios.

In this work, we explore the potential of supervised fine-tuning (SFT) in improving the performance of a smaller model in the medical domain. By utilizing a diverse and well-distributed dataset, we aim to demonstrate that even a smaller model can achieve competitive performance in specialized tasks. Our experiments highlight the importance of dataset quality in fine-tuning processes and show that a well-curated dataset can significantly enhance a model’s capabilities, even with limited parameters.

2 Related Work

2.1 Instruction Tuning

Instruction tuning is a highly effective approach for improving the performance of language models on unseen tasks in zero-shot or few-shot scenarios [27]. This method involves training models with a variety of instructions, enabling them to better understand and execute tasks they have not been explicitly trained on.

Natural Instructions [19] represents an effort to create a comprehensive set of human-crafted instructions designed to enhance model performance across a wide range of tasks. These instructions serve as a valuable resource for fine-tuning models to perform well in diverse applications. Building on this concept, Super-NaturalInstructions [26] expands the scope by including even more detailed and varied instructions, further improving the robustness and adaptability of language models.

To address the issue of limited diversity in human-crafted instructions, Unnatural Instructions [13] introduces a vast dataset of imaginative and varied instructions collected with minimal human effort. This innovative approach leverages automated methods to generate a rich and diverse set of instructions, significantly enhancing the model’s ability to handle a wider array of tasks with improved accuracy and efficiency.

2.2 Open-Source Medical Models

In the realm of medical LLMs, several notable open-source projects have emerged, such as HuatuoGPT [28] and BenTsao [24]. These models are designed to assist in medical consultations and diagnostics by leveraging large-scale medical dialogues and literature.

HuatuoGPT and BenTsao [9] have undertaken the task of collecting extensive medical dialogue datasets. They use advanced language models like GPT-4 to reconstruct these dialogues into question-answer pairs for model training. This method aims to improve the models’ understanding of medical consultations and enhance their ability to provide accurate and relevant responses.

However, these models also come with notable limitations. One major concern is the risk of overfitting to specific datasets, which can limit their generalizability to new, unseen medical scenarios. The reliance on reconstructed dialogues might lead to inconsistencies in data quality, affecting the robustness of the models’ responses.

These challenges highlight the need for ongoing refinement and evaluation of open-source medical models. A key area of focus should be the diversity and distribution of datasets used during fine-tuning. Ensuring a wide variety of instructions and data sources may enhance the model’s ability to generalize and perform effectively across various medical tasks. By carefully curating and diversifying the datasets, it is possible to develop more robust and versatile medical LLMs, capable of providing reliable and comprehensive support in healthcare settings. Our work aims to address these issues, striving to improve the overall performance of medical LLMs through strategic dataset diversification.

3 Collective Instruction Set

3.1 Data Collection

The datasets we gather encompass various types, from conversations to question-answering pairs. While we primarily focus on English and Chinese datasets, we also acknowledge the availability of healthcare datasets in other languages, such as HeadQA [23] in Spanish and FrenchMedMCQA [15] in French.

Our review of publicly accessible datasets indicated that many formats are unsuitable for model fine-tuning due to inconsistencies in structure, detail levels, and annotation standards. To tackle these issues, we decided to standardize all datasets into the Alpaca format [22]. This format includes fields for instruction, input, and output, as well as optional fields for system prompts and history, tailored for specific use cases. By adopting a standardized format, we ensure consistent data processing, enhancing its effectiveness for training and fine-tuning models.

Reconstructing the datasets involves several steps. First, we extract relevant information from each dataset, preserving key details. Then, we reformat this information into the Alpaca structure, which entails defining clear instructions for the model, specifying inputs, and providing expected outputs. For conversational data, we include history fields to maintain context across dialogue turns.

Table 1 summarizes all collected data, detailing their language, style, topic size, and instruction size.By aligning diverse datasets into a single, coherent format, we facilitate more effective training processes and enhance the models’ ability to generalize across different medical tasks.

In addition to reformatting existing datasets, we also aim to expand our collection with new data sources. This involves curating data from medical forums, academic publications, and other relevant repositories. This ongoing effort ensures our models remain relevant and effective in real-world medical applications.

Moreover, incorporating diverse datasets helps mitigate biases present in individual data sources. By integrating data from various origins and languages, we create a more balanced and comprehensive training environment. This diversity is essential for developing robust, reliable models capable of providing accurate medical advice across different contexts and populations.

Language Dataset Name Style Topic Size Instruction Size
English PubMedQA [14] QA 273,518 273,518
MedMCQA [21] MCQA 182,822 182,822
HeadQA [23] QA 2,657 2,657
Total 458,997 458,997
Chinese cMedQA2 [29] QA 100,000 188,783
cMedDialogue [1] QA 792,099 792,099
webMedQA [11] QA 252,850 50,570
MedicalDialog [12] Dialogue 2,725,989 4,503,475
CMID [8] NER 12,254 11,786
NLPEC [16] MCQA 18,703 18,703
CMB [25] MCQA 269,359 269,359
MLEC-QA [17] MCQA 108,988 108,988
DISCMed [5] Dialogue 464,898 1,362,307
Total 4,745,140 7,306,070
Table 1: Public medical datasets used for fine-tuning our model. The table shows their size with original format and number of instructions constructed for this work.

3.2 Instruction Set Construction

We construct instructions based on the data types of the collected datasets, ensuring that each type is processed into a unified format that the language models can effectively utilize. This standardization is crucial for maintaining consistency and clarity across different data sources, which is essential for optimizing the model’s performance. The following sections detail the strategies used to process various formats of datasets into a standardized format.

Multiple-Choice Question Answering

For the MCQA format, we use a consistent method to process the data. The instruction field typically contains background information and descriptions about the source of the question, which helps the LLM understand the context better. The input field combines the original question with all the answer options. The output field provides the correct answer, along with an explanation if available in the dataset.

Question Answering

The QA format is simpler compared to other formats. We leave the input field blank and fill the instruction field with the original question and the output field with the corresponding answer.

Dialogue

The dialogue format differs slightly from others due to the nature of conversational data. In this case, we include an additional field named "history" that contains the entire chat history up to that point. The instruction field contains the current question, the input field is left blank, and the output field provides the response. This approach helps the LLM understand the context of the ongoing conversation.

Sequence Labeling

For sequence labeling, specifically in Named Entity Recognition (NER) tasks, we set the instruction field to request an analysis of specific noun entities and the intent of the description. The input field contains the original content, while the output field consolidates all identified noun entities into a new description that captures the intended meaning. This method aids the LLM in recognizing and understanding specialized terminology in the medical domain.

By standardizing these diverse data formats into a single instructional framework, we ensure consistency and clarity in training. This approach enhances the LLM’s ability to generalize and perform effectively across various medical tasks, leading to more reliable and robust models.

4 Experiments

4.1 Hyperparameter Optimization

We employ advanced tools like LLaMA-Factory [30] to fine-tune our models, exploring various hyperparameters such as cut-off length, epoch count, and learning rate. These parameters are crucial for the models’ performance and efficiency.

For our fine-tuning base model, we have selected the InternLM2.5-7B base model [6] due to its outstanding reasoning capabilities. This model stands out for its ability to handle complex tasks with high accuracy and efficiency. Additionally, the 7B parameter size is particularly advantageous as it strikes a balance between performance and resource requirements. This size is common for personal deployment because it does not demand extensive computational resources, making it accessible for a wider range of applications, including those with limited hardware. By choosing the InternLM2.5-7B base model, we aim to leverage its strengths in reasoning while maintaining feasibility for personal and small-scale deployments, ensuring that our fine-tuning processes are both effective and practical.

Our experiments indicate that cut-off length profoundly affects the model’s performance. Specifically, a shorter cut-off length yields better results with the same dataset. This improvement is due to the dataset’s average length; shorter cut-off lengths help the model capture essential information within each instance, enhancing output accuracy and relevance.

In benchmark scenarios, particularly with multiple-choice questions, a slightly shorter cut-off length proves beneficial. For instance, CMB Exam emphasizes accuracy in answering specific questions over conversational abilities. By aligning the cut-off length with the dataset’s average length, we boost the model’s efficiency and accuracy for these specialized tasks. Shorter cut-off lengths enable the model to concentrate on the core content of questions and options, improving its ability to select correct answers. Adjusting other hyperparameters like epoch count and learning rate in tandem with cut-off length further refines performance. A higher epoch count allows the model to learn more comprehensively from the training data, while a well-tuned learning rate ensures optimal convergence without overshooting or getting trapped in local minima.

Overall, our hyperparameter optimization strategy balances these parameters to achieve peak performance for specific applications. Through systematic experimentation with different settings, we fine-tune our models to excel in their tasks, ensuring reliable and effective performance in real-world medical applications.

4.2 Performance over CMB Benchmark

We achieve an outstanding score in the CMB using a remarkably small model as shown in Table  1, significantly smaller than any other model at the top of the benchmark. This achievement can be attributed to the diversity and distribution of our dataset. Our results demonstrate that the quality of the dataset is the most critical factor influencing the performance of model fine-tuning.

By using a wide variety of data formats and sources, we create a training set that is rich and representative of diverse medical scenarios. This strategy allows our smaller model to generalize better and perform effectively across different tasks within the CMB. The success of our fine-tuning process shows the importance of dataset diversity and demonstrates that even with fewer model parameters, top performance can be achieved through careful dataset selection and distribution.

Furthermore, our findings challenge the conventional belief that larger models are inherently superior. Instead, they emphasize that a well-curated and diverse dataset can significantly enhance a model’s capabilities, enabling smaller models to compete with and even surpass larger ones. This has important implications for the development of efficient, resource-conserving models that do not compromise on performance.

Model Total Avg.
Training
Grad.
Nursing
Exam
Pharm.
Exam
Med. Tech.
Exam
Prof.
Knowledge
Med.
Postgrad.
CollectiveSFT-7B 77.05 83.00 85.75 79.25 72.50 90.25 80.25
InternLM2.5-7B [6] 71.40 75.80 78.13 68.28 70.92 65.00 72.19
HuatuoGPTII-34B [7] 76.80 82.50 75.50 73.25 68.75 87.75 77.00
Qwen-72B-Chat [3] 74.38 88.00 75.00 77.00 70.25 94.25 65.50
Yi-34B-Chat [2] 69.17 78.75 69.50 69.75 63.75 87.00 56.50
AntGLM-Med-10 [18] 64.09 81.75 62.00 63.75 60.25 82.50 64.50
GPT-4 [20] 59.46 64.50 60.75 39.50 57.00 77.50 61.25
HuatuoGPTII-7B [7] 59.00 70.75 64.75 60.00 57.75 70.25 53.75
Qwen-14B-Chat [3] 57.64 69.00 60.50 51.25 51.75 73.00 50.00
Baichuan2-13B-Chat [4] 48.87 56.50 47.75 44.50 45.50 63.25 39.25
Qwen-7B-Chat [3] 46.58 56.25 46.00 42.00 37.25 63.50 39.50
ChatGLM2-6B [10] 45.05 48.25 47.25 43.75 43.00 54.25 42.25
Table 2: Performance Comparison of Some Open-source Medical Models focusing on specific exam scores and overall averages.111Only open-source models were selected, excluding closed-source models. Data retrieved from CMB leaderboard on July 24, 2024. (https://cmedbenchmark.llmzoo.com/static/leaderboard.html)

5 Discussion and Conclusion

In this article, we have highlighted the potential of using diverse datasets to improve model performance using SFT. Our findings suggest that incorporating a variety of data types is an effective way to enhance the capabilities of models, achieving better performance with fewer GPU resources.

Our study also uncovered some limitations associated with this method. One notable issue is that while the fine-tuned smaller models excel at answering multiple-choice questions accurately and effectively, they may lose some of their conversational abilities. This loss means that although the models perform well on specific tasks like MCQA, they struggle to maintain engaging and coherent conversations with users during interactive sessions. This trade-off between specialized task performance and general conversational ability is an important consideration for the application in real-world scenarios.

Additionally, we observed common problems associated with smaller models, such as hallucination. Hallucination refers to the generation of plausible but incorrect or nonsensical information by the model. This issue can undermine the reliability of the model’s responses and poses a significant challenge for its deployment in sensitive domains like healthcare, where accuracy is paramount.

In conclusion, while the use of diverse datasets in supervised fine-tuning offers a promising pathway for quickly enhancing a model’s knowledge base and task-specific performance, it also presents several challenges that need to be addressed. Future work should focus on developing strategies to preserve the conversational capabilities of fine-tuned models and reduce instances of hallucination. Overall, this method shows great potential for improving the efficiency and effectiveness of LLMs, but it requires careful consideration and further innovation to fully realize its benefits.

Acknowledgments and Disclosure of Funding

This work was partially supported by China Postdoctoral Science Foundation (2023M733654), Guangdong Basic and Applied Basic Research Foundation (2023A1515110496), Shenzhen Science and Technology Innovation Program (KQTD20190929172835662).

References

  • [1] Chinese medical dialogue data 中文医疗问答数据集 (2019), https://github.com/Toyhom/Chinese-medical-dialogue-data
  • [2] AI, ., :, Young, A., Chen, B., Li, C., Huang, C., Zhang, G., Zhang, G., Li, H., Zhu, J., Chen, J., Chang, J., Yu, K., Liu, P., Liu, Q., Yue, S., Yang, S., Yang, S., Yu, T., Xie, W., Huang, W., Hu, X., Ren, X., Niu, X., Nie, P., Xu, Y., Liu, Y., Wang, Y., Cai, Y., Gu, Z., Liu, Z., Dai, Z.: Yi: Open foundation models by 01.ai (2024)
  • [3] Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y., Yu, B., Yuan, H., Yuan, Z., Zhang, J., Zhang, X., Zhang, Y., Zhang, Z., Zhou, C., Zhou, J., Zhou, X., Zhu, T.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
  • [4] Baichuan: Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305 (2023), https://arxiv.org/abs/2309.10305
  • [5] Bao, Z., Chen, W., Xiao, S., Ren, K., Wu, J., Zhong, C., Peng, J., Huang, X., Wei, Z.: Disc-medllm: Bridging general large language models and real-world medical consultation (2023)
  • [6] Cai, Z., Cao, M., Chen, H., Chen, K., Chen, K., Chen, X., Chen, X., Chen, Z., Chen, Z., Chu, P., Dong, X., Duan, H., Fan, Q., Fei, Z., Gao, Y., Ge, J., Gu, C., Gu, Y., Gui, T., Guo, A., Guo, Q., He, C., Hu, Y., Huang, T., Jiang, T., Jiao, P., Jin, Z., Lei, Z., Li, J., Li, J., Li, L., Li, S., Li, W., Li, Y., Liu, H., Liu, J., Hong, J., Liu, K., Liu, K., Liu, X., Lv, C., Lv, H., Lv, K., Ma, L., Ma, R., Ma, Z., Ning, W., Ouyang, L., Qiu, J., Qu, Y., Shang, F., Shao, Y., Song, D., Song, Z., Sui, Z., Sun, P., Sun, Y., Tang, H., Wang, B., Wang, G., Wang, J., Wang, J., Wang, R., Wang, Y., Wang, Z., Wei, X., Weng, Q., Wu, F., Xiong, Y., Xu, C., Xu, R., Yan, H., Yan, Y., Yang, X., Ye, H., Ying, H., Yu, J., Yu, J., Zang, Y., Zhang, C., Zhang, L., Zhang, P., Zhang, P., Zhang, R., Zhang, S., Zhang, S., Zhang, W., Zhang, W., Zhang, X., Zhang, X., Zhao, H., Zhao, Q., Zhao, X., Zhou, F., Zhou, Z., Zhuo, J., Zou, Y., Qiu, X., Qiao, Y., Lin, D.: Internlm2 technical report (2024)
  • [7] Chen, J., Wang, X., Gao, A., Jiang, F., Chen, S., Zhang, H., Song, D., Xie, W., Kong, C., Li, J., Wan, X., Li, H., Wang, B.: Huatuogpt-ii, one-stage training for medical adaption of llms (2023), https://arxiv.org/abs/2311.09774
  • [8] Chen, N., Su, X., Liu, T., Hao, Q., Wei, M.: A benchmark dataset and case study for chinese medical question intent classification. BMC Medical Informatics and Decision Making 20(3),  125 (Jul 2020). https://doi.org/10.1186/s12911-020-1122-3, https://doi.org/10.1186/s12911-020-1122-3
  • [9] Du, Y., Zhao, S., Cai, M., Chen, J., Wang, H., Chen, Y., Guo, H., Qin, B.: The calla dataset: Probing llms’ interactive knowledge acquisition from chinese medical literature (2023)
  • [10] Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z., Tang, J.: Glm: General language model pretraining with autoregressive blank infilling (2022), https://arxiv.org/abs/2103.10360
  • [11] He, J., Fu, M., Tu, M.: Applying deep matching networks to chinese medical question answering: A study and a dataset. BMC Medical Informatics and Decision Making 19(2),  52 (2019). https://doi.org/10.1186/s12911-019-0761-8
  • [12] He, X., Chen, S., Ju, Z., Dong, X., Fang, H., Wang, S., Yang, Y., Zeng, J., Zhang, R., Zhang, R., Zhou, M., Zhu, P., Xie, P.: Meddialog: Two large-scale medical dialogue datasets (2020)
  • [13] Honovich, O., Scialom, T., Levy, O., Schick, T.: Unnatural instructions: Tuning language models with (almost) no human labor. In: Rogers, A., Boyd-Graber, J., Okazaki, N. (eds.) Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 14409–14428. Association for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.org/10.18653/v1/2023.acl-long.806, https://aclanthology.org/2023.acl-long.806
  • [14] Jin, Q., Dhingra, B., Liu, Z., Cohen, W., Lu, X.: PubMedQA: A dataset for biomedical research question answering. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 2567–2577. Association for Computational Linguistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-1259, https://aclanthology.org/D19-1259
  • [15] Labrak, Y., Bazoge, A., Dufour, R., Daille, B., Gourraud, P.A., Morin, E., Rouvier, M.: FrenchMedMCQA: A French multiple-choice question answering dataset for medical domain. In: Lavelli, A., Holderness, E., Jimeno Yepes, A., Minard, A.L., Pustejovsky, J., Rinaldi, F. (eds.) Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI). pp. 41–46. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Hybrid) (Dec 2022). https://doi.org/10.18653/v1/2022.louhi-1.5, https://aclanthology.org/2022.louhi-1.5
  • [16] Li, D., Hu, B., Chen, Q., Peng, W., Wang, A.: Towards medical machine reading comprehension with structural knowledge and plain text. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1427–1438. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.111, https://aclanthology.org/2020.emnlp-main.111
  • [17] Li, J., Zhong, S., Chen, K.: MLEC-QA: A Chinese Multi-Choice Biomedical Question Answering Dataset. In: Moens, M.F., Huang, X., Specia, L., Yih, S.W.t. (eds.) Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 8862–8874. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic (Nov 2021). https://doi.org/10.18653/v1/2021.emnlp-main.698, https://aclanthology.org/2021.emnlp-main.698
  • [18] Li, Q., Yang, X., Wang, H., Wang, Q., Liu, L., Wang, J., Zhang, Y., Chu, M., Hu, S., Chen, Y., Shen, Y., Fan, C., Zhang, W., Xu, T., Gu, J., Zheng, J., Group, G.Z.A.: From beginner to expert: Modeling medical knowledge into general llms (2024), https://arxiv.org/abs/2312.01040
  • [19] Mishra, S., Khashabi, D., Baral, C., Hajishirzi, H.: Cross-task generalization via natural language crowdsourcing instructions. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 3470–3487. Association for Computational Linguistics, Dublin, Ireland (May 2022). https://doi.org/10.18653/v1/2022.acl-long.244, https://aclanthology.org/2022.acl-long.244
  • [20] OpenAI:: GPT-4 Technical Report (2023)
  • [21] Pal, A., Umapathi, L.K., Sankarasubbu, M.: Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In: Flores, G., Chen, G.H., Pollard, T., Ho, J.C., Naumann, T. (eds.) Proceedings of the Conference on Health, Inference, and Learning. Proceedings of Machine Learning Research, vol. 174, pp. 248–260. PMLR (07–08 Apr 2022), https://proceedings.mlr.press/v174/pal22a.html
  • [22] Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., Hashimoto, T.B.: Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca (2023)
  • [23] Vilares, D., Gómez-Rodríguez, C.: HEAD-QA: A healthcare dataset for complex reasoning. In: Korhonen, A., Traum, D., Màrquez, L. (eds.) Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 960–966. Association for Computational Linguistics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-1092, https://aclanthology.org/P19-1092
  • [24] Wang, H., Liu, C., Xi, N., Qiang, Z., Zhao, S., Qin, B., Liu, T.: Huatuo: Tuning llama model with chinese medical knowledge (2023)
  • [25] Wang, X., Chen, G.H., Song, D., Zhang, Z., Chen, Z., Xiao, Q., Jiang, F., Li, J., Wan, X., Wang, B., et al.: Cmb: A comprehensive medical benchmark in chinese. arXiv preprint arXiv:2308.08833 (2023)
  • [26] Wang, Y., Mishra, S., Alipoormolabashi, P., Kordi, Y., Mirzaei, A., Naik, A., Ashok, A., Dhanasekaran, A.S., Arunkumar, A., Stap, D., Pathak, E., Karamanolakis, G., Lai, H., Purohit, I., Mondal, I., Anderson, J., Kuznia, K., Doshi, K., Pal, K.K., Patel, M., Moradshahi, M., Parmar, M., Purohit, M., Varshney, N., Kaza, P.R., Verma, P., Puri, R.S., Karia, R., Doshi, S., Sampat, S.K., Mishra, S., Reddy A, S., Patro, S., Dixit, T., Shen, X.: Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In: Goldberg, Y., Kozareva, Z., Zhang, Y. (eds.) Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. pp. 5085–5109. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (Dec 2022). https://doi.org/10.18653/v1/2022.emnlp-main.340, https://aclanthology.org/2022.emnlp-main.340
  • [27] Wei, J., Bosma, M., Zhao, V., Guu, K., Yu, A.W., Lester, B., Du, N., Dai, A.M., Le, Q.V.: Finetuned language models are zero-shot learners. In: International Conference on Learning Representations (2022), https://openreview.net/forum?id=gEZrGCozdqR
  • [28] Zhang, H., Chen, J., Jiang, F., Yu, F., Chen, Z., Li, J., Chen, G., Wu, X., Zhang, Z., Xiao, Q., Wan, X., Wang, B., Li, H.: Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075 (2023)
  • [29] Zhang, S., Zhang, X., Wang, H., Guo, L., Liu, S.: Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access 6, 74061–74071 (2018). https://doi.org/10.1109/ACCESS.2018.2883637
  • [30] Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z., Feng, Z., Ma, Y.: Llamafactory: Unified efficient fine-tuning of 100+ language models. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Association for Computational Linguistics, Bangkok, Thailand (2024), http://arxiv.org/abs/2403.13372
  • [31] Zheng, Z., Liao, L., Deng, Y., Nie, L.: Building emotional support chatbots in the era of llms (2023)