A Question Answering Based Pipeline for Comprehensive Chinese EHR Information Extraction

Huaiyuan Ying Center for Statistical Science
Tsinghua University
Beijing, China
[email protected] Sheng Yu Center for Statistical Science
Tsinghua University
Beijing, China
[email protected]

Abstract

Electronic health records (EHRs) hold significant value for research and applications. As a new way of information extraction, question answering (QA) can extract more flexible information than conventional methods and is more accessible to clinical researchers, but its progress is impeded by the scarcity of annotated data. In this paper, we propose a novel approach that automatically generates training data for transfer learning of QA models. Our pipeline incorporates a preprocessing module to handle challenges posed by extraction types that is not readily compatible with extractive QA frameworks, including cases with discontinuous answers and many-to-one relationships. The obtained QA model exhibits excellent performance on subtasks of information extraction in EHRs, and it can effectively handle few-shot or zero-shot settings involving yes-no questions. Case studies and ablation studies demonstrate the necessity of each component in our design, and the resulting model is deemed suitable for practical use.

Index Terms:

information extraction, question answering, electronic health records, discontinuous answer spans

I Introduction

Information extraction (IE) from electronic health records (EHRs) aims to convert free text content to structured data to facilitate diverse downstream tasks and analyses in healthcare research and services, such as developing patient data registries and EHR-linked biobanks. Conventionally, IE encompasses basic tasks such as named entity recognition (NER), entity linking, coreference resolution, and relation extraction (RE) [1] to extract entities, entity types, relations, events, and other valuable information. However, many biomedical studies require information that is significantly more sophisticated than what the above tasks can extract[2, 3]. Combining basic tasks to extract more sophisticated information not only requires careful designs but is also difficult for users without a background in natural language processing (NLP). With the advancement of pretrained language models, question answering (QA) models can achieve impressive performance in extracting information that are more flexible and more complex than conventional tasks[4, 5]. It also has the advantage of being capable of accepting natural language queries, which can significantly lower the barrier to using IE in EHRs. Notably, the advent of large language models (LLMs) further enhanced the ability to summarize information [6, 7, 8, 9], opening new possibilities in IE for EHRs.

QA tasks generally assume a data format that includes a context, a question, and an answer. Depending on how the answer is obtained, QA models can be classified as extractive or generative, where extractive models must identify the answer as a substring from the context, and generative models generate the answer autoregressively conditioning on the context and the question[10]. As a result, extractive QA is not as flexible as generative QA, but is also less prone to hallucination, which refers to generative models’ behavior of generating fake answers based on its model parameters. As the goal of IE in EHRs is often extracting answers from the EHR text, we focus on extractive QA in this work to avoid potential hallucinations.

Training QA models requires the above-mentioned data format: each sample is comprised of a context, a question, and an answer. As QA models aim to answer a wide range of questions given in natural language, their training samples require more diversity than the other IE tasks[11]. As a result, annotating QA data requires significantly more thought and takes more time than the basic IE tasks, making the development of QA models very costly. Fine-tuning LLMs for the biomedical domain also requires QA data, as LLMs like ChatGPT are obtained by fine-tuning unsupervised base models with QA data. Although LLMs are much more general-purpose than previous QA models and exhibit surprisingly good performance on medical data out-of-the-box, they still need to be further fine-tuned with QA data on EHRs to deliver satisfactory results.

In this paper, we propose a pipeline to automatically generate QA data from EHRs, and apply transfer learning on pretrained general domain QA models for IE tasks in EHRs. We leverage the fact that many projects have annotated abundant EHRs for basic IE tasks including NER and RE, and various NLP tools can perform these annotations as well. Based on the semantic types of the annotated entities, we design templates to convert entity and relation annotations to questions and answers, while we also prepare samples to train the model to judge answerability. Additionally, extractive QA models by default only identify a continuous span as the answer, but the EHR information to be extracted is usually discretely distributed. To address this issue, we incorporate additional preprocessing and post-processing in our pipeline. In brief, we break down paragraphs into sentences, exclude sentences without potential answers, and then merge the remaining answer spans. With transfer learning, we observed satisfactory performance. The final model demonstrates competency across various IE subtasks in EHRs and exhibits impressive generalization abilities. We summarize our contributions as follows:

•

We propose a pipeline that utilizes accessible data annotations to train an extractive QA system, enabling it to assist in EHR IE. The pipeline accommodates diverse types of questions and performs multiple IE subtasks simultaneously.
•

We introduce processing techniques that equip the extractive QA system with the ability to judge answerability and extract discontinuous spans, thereby achieving superior performance compared to traditional IE methods.

II Related Work

II-A EHR Information Extraction

The extraction of information from health records and clinical documents has a long history. In its early stages, rule-based and expert-based systems were commonly used methods for Information Extraction (IE) [12, 13]. Software such as cTAKES [14] and MetaMap [15], equipped with NLP analysis engines, proved to be beneficial for clinical tasks. However, the drawback of these methods was the need to create numerous rules for each specific task, which made them less efficient.

Consequently, machine learning and deep learning-based approaches have gained significant interest in research for more efficient IE methods in the clinical domain [16]. Support vector machine and conditional random field are widely used methods for detecting entities or events [17, 18]. Deep learning methods in the clinical domain have also benefited from transfer learning from the general domain. Apart from the popular BERT (Bidirectional Encoder Representations from Transformers) backbone[19], other strategies such as MT-clinical BERT that use multi-task learning for knowledge sharing among subtasks [20], OIE4KGC that employs a UMLS knowledge-graph enhanced model [21], and self-supervised and retrieval-augmented methods have shown effectiveness [22].

The applications of information extraction are diverse. For instance, phenotyping involves identifying diseases with specific features, like childhood obesity and neuropsychiatric disorders [23]. IE can also be employed in feature engineering for the classification of cancer types and stages [24, 25]. Additionally, drug-related studies can draw on extracted examples for insights [26], and the extracted information can be utilized to optimize clinical workflows or even provide clinical decision support [27, 28].

However, as far as our knowledge extends, no existing methods have utilized QA to assist in EHR IE tasks, such as NER and relation extraction. The use of QA offers the advantage of simultaneously addressing all the subtasks in a more convenient manner.

II-B Question Answering

In our focus on Machine Reading Comprehension (MRC) QA, the model provides an answer based on both the question and a given context. This QA task encompasses four main types of answers: yes-no questions, multi-choice questions, extractive questions, and generative questions [10]. For the first three types, models typically use only the encoder/embedding to obtain word representations. The widely adopted three-stage architecture proposed by FastQA [29] consists of semantic understanding, interaction, and prediction. Since the emergence of BERT-like models, enhancing the semantic understanding stage has become the mainstream approach to improving model performance [30, 31, 32]. However, these three types differ in the predicting module. For instance, extractive QA predicts the probability of being start or end tokens, yes-no QA conducts binary classification [33, 34], and multi-choice QA may select the most plausible hypothesis from constructed options [35, 36]. The SQuAD 2.0 dataset introduced unanswerable questions for extractive QA [37], which is the task used to define our problem.

On the other hand, generative QA always requires an additional decoding module and is more commonly used in open-domain QA rather than machine reading comprehension. Interaction-aware embeddings are fed into a seq2seq decoder [38], transformer decoder [39], or other structures like T5 [40], GPT, etc. However, evaluating generative answers is challenging [41], and it is even more difficult to determine whether the model provides unfounded answers. As a result, generative QA is not the preferred approach for our pipeline.

III Methods

In this section, we will provide a detailed demonstration of the entire pipeline. The main purpose of the pipeline is to transform the dependency annotations into question-answer pairs, and help train a model for extracting information from the EHR data. The pipeline comprises three main components: 1. preprocessing of the dependency and textual data, 2. the QA model capable of discriminating unanswerable questions, 3. post-processing and combination of model outputs for application. The pipeline architecture and translated examples are displayed in Figure1. For simplicity and privacy protection, we selected only several clauses from the original lengthy EHR paragraphs.

Refer to caption — Figure 1: The architecture of our pipeline. From the EHR corpus, we obtain the original dependency annotations and the context. The words labeled by colors are the general entities annotated, each color stands for one type. The arrow represents the dependency annotations. During preprocessing, the dependency annotations are transformed into questions through manually constructed templates based on relation types. The contexts are split according to many-to-one correspondences of the relation pairs, resulting in sentence-level or paragraph-level texts. The questions and the texts are concatenated and sent into the QA model for training. We also introduce impossible questions with plausible answers through annotations of the same type. The QA model judges the answerability of each question-context pairs and output the answer span. Finally, the answers from split texts are merged to provide the final outputs.

III-A Notation

The raw annotations of our EHR data only compose of general entity annotations, which specify the textual content, type and start token of the entities, and the dependencies between them. The type of an entity $A$ is denoted as $t(A)$ and each dependency pair is denoted as $(A,B)$ , indicating that a dependency relation exists from $A$ to $B$ . Each sample pair $(A,B)$ is categorized into the class named ” $t(A)$ - $t(B)$ ”. There are 15 entity types and 21 relation classes in total in the annotations.

Given a specific key (the information we are interested in), the output should consist of corresponding retrieved values. After preprocessing, the input should be the concatenation of a question $Q=(q_{1},\cdots,q_{n})$ together with a context $X=(X_{1},\cdots,X_{m})$ . The pipeline output should include one or several spans $A_{1}=(X_{a_{1s}},\cdots,X_{a_{1e}}),\cdots,A_{t}=(X_{a_{ts}},\cdots,X_{a_{te}})$ from the context, where $a_{is}$ and $a_{ie}$ represent the start and end tokens of the $i^{th}$ span, respectively.

III-B Preprocessing

The preprocessing consists of three parts which will be introduced below. We will still utilize the examples in Figure 1.

III-B1 Transform dependency to question-answer pairs

Firstly, we utilize pre-defined templates to convert the entity and dependency annotations into two kinds of question-answer pairs. Specifically, for disease, abnormality and body part entities, we can directly query them without specifying the dependency, for instance, ”What abnormality does the patient have?”. These questions equip the QA model with the ability to perform NER tasks effectively. For the dependency annotations, we manually design up to three question templates for each relation class ” $t(A)$ - $t(B)$ ” to query the left and right entities respectively. For example, in the first annotation in Figure 1, we can query the disease name by the template ”What disease has the patient’s ’family member’ suffered from?’, or reversely query the family member by ”Which family member of the patient has suffered from ’disease’?”. This can capture bidirectional relationships and avoid annotation inconsistencies.

III-B2 Split paragraphs for discontinuous answer spans

Secondly, in EHR, it is common to encounter scenarios where one abnormality corresponds to multiple body parts, or one body part may have multiple abnormalities. So the answer to the above question may have multiple spans. However, extractive QA, which deals with continuous spans, faces challenges in capturing discretely distributed answers. To address this issue, we handle cases of many-to-one correspondence during preprocessing as follows: If two answer spans are adjacent or separated only by punctuations, we merge them into one span: in Figure 1 ”fluid accumulation”, ”gas accumulation” are two annotated entities within the same body part, and they have merged to ”fluid accumulation, gas accumulation”. Additionally, when such correspondence still exists, we split the EHR paragraphs into sentences, and each sentence would have one or zero answer span to the question. The model, to be explained in the next subsection, is designed to provide an answer for answerable questions and an empty string for unanswerable ones. During evaluation, the output of each sentence will undergo post-processing to derive the final answer. So take the same example in Figure 1, the CT report is split into three sentences, where the first and third sentences both have one answer span to the question ”What abnormalities are there in the abdominal cavity of the patient?” and the second sentence has zero. The final answer should be a combination of ”fluid accumulation, gas accumulation” and ”fat gap is turbid”. The preprocessing strategy proves to be highly effective, as observed from the fact that, after merging adjacent spans, most of the dependency pairs follow a one-to-one correspondence within one sentence under EHR settings.

III-B3 Construct impossible questions

Furthermore, to adhere to the format of the SQuAD dataset, we include ”plausible answers” for impossible questions. For this purpose, we exclude the original impossible questions and sentences and instead construct new samples. Concretely, we select paragraphs containing more than one dependence pair of the same relation type. If the two left entities of the two pairs are different entities, we choose one question and consider the span of the answer to the other question as context. In the second example of Figure 1, ”What abnormalities are there in the abdominal cavity of the patient?” is an impossible question to the clause ”limited low-density fluid shadow seen in the gallbladder fossa;” but with a plausible answer ”limited low-density fluid shadow” which has the same entity type ”body part”. The process forms an impossible question-answer pair with a seemingly plausible answer that differs from the original, thereby enhancing the model’s capability to judge answerability.

After the preprocessing stage, we obtain question-answer pairs along with the context at the paragraph or sentence level. The answer can either be empty or represented by a continuous span. These processed data are then fed into the model.

III-C QA Model

The main body of the Question Answering (QA) model we employ is the original Retro-Reader[42], one of the benchmark models for SQuAD 2.0 (Stanford Question Answering Dataset). And we will give a brief introduction to the model. Retro-Reader comprises two primary components: a sketch reading module responsible for making a coarse judgment about answerability, and an intensive reading module that generates candidate answer spans. Additionally, it has integrated two threshold-based verification modules to further enhance the model’s capability for predicting answerability. The architecture can be concluded in Figure 2.

In the sketchy reading module, the input is first encoded by a pretrained multi-layer Transformer[39, 19]. The last layer of hidden states $\bm{H}_{1}$ is used as the embedding, which is sent to a fully connected layer for a two-way classification. And the predicted logits of not having an answer $\hat{y}_{i}$ are used for loss computation, where $y_{i}\in\{0,1\}$ is the true label:

L_{ans1}=-\frac{1}{N}\sum_{i=1}^{N}(y_{i}\log\hat{y}_{i}+(1-y_{i}\log(1-\hat{y}_{i}))

(1)

The intensive reading module uses another Transformer but starts from the same pretrained weights. We denote the embeddings by $\bm{H}_{2}=(H_{Q},H_{P})$ , standing for the question and passage respectively. The module adopts cross attention to compute the question-aware representation of the texts, which set $Q=H,K=V=H_{Q}$ in the multi-head attention in [39]. The final embedding $H^{\prime}$ after being weighted by attention is applied to two linear layers with a softmax operator to compute start and end position probabilities. The loss of the span prediction is written as:

L_{span}=-\frac{1}{N}\sum_{i=1}^{N}[\log(s_{y_{i}^{s}})+\log(e_{y_{i}^{e}})]

(2)

which is the sum of the log probabilities of the true start and end positions. Another binary cross entropy loss $L_{ans2}$ is introduced to render this module also able to identify unanswerable questions. So the loss of this whole module is:

L=\alpha_{1}L_{span}+\alpha_{2}L_{ans2}

(3)

For our usage, both reading modules are trained on sentence-level and paragraph-level data to deal with multi-granularity in preprocessed data. As the answerability judgment is essential in the pipeline, we slightly enlarge the weight of $L_{ans2}$ . Finally, three scores: an external verification score, a has-answer score and a no-answer score are computed for rear verification:

\begin{split}score_{ext}&=\hat{y}_{i}-(1-\hat{y}_{i})\\ score_{has}&=\max(s_{k}+e_{l}),\,1<k<l\leq n\\ score_{null}&=s_{1}+e_{1}\end{split}

(4)

Here $s_{1},e_{1}$ is the start and end possibility of the [CLS] token. If the difference score $score_{diff}=score_{null}-score_{has}$ and the mixture score $\beta_{1}score_{diff}+\beta_{2}score_{ext}$ are both larger than a chosen threshold $\delta$ , the QA model will output the answer span. Otherwise, it will predict the null string.

III-D Postprocessing and Application

To ensure consistency in preprocessing and evaluation, we establish the format of the gold standard. The answer should be a concatenation of the initially annotated general entities, with one comma placed between each entity. The presence of the comma will be taken into account when computing the EM (exact match) score but not in the computation of the F1 score. If the answer of one sentence is empty, it will not be concatenated. In the case where all the answers of the sentences are empty, we consider the question to be unanswerable.

In a real-world setting, conducting comprehensive information extraction for all the EHRs is essential. Depending on the type of EHR, different templates will be applied and queried. If the EHR contains disease, abnormality and body part information, the NER step should be performed first, and the resulting entities will be utilized in the templates to query the text. Otherwise, the template itself with several fed-in words is sufficient for the query. Once all the possible queries are listed, we split the paragraph, obtain the model outputs, and merge them to complete the task. Questions that yield an empty string as the final answer are dismissed as non-informative.

IV Experiments

IV-A Data and Annotations

We utilized real de-identified EHR data from a single hospital as the basis for our study. The corpus comprises various EHR types, such as past medical history, personal history, family history, CT reports, and radiology reports. Our annotators were instructed to first label the general entities and then establish the dependency relationships between them. The identified entities encompass 15 types, including numbers, sizes, trends, properties of abnormalities, diseases, body parts, immune groups, and values. These general entities are further connected to form 21 distinct relation classes. For training and testing purposes, we selected 18 high-frequency classes.

Following preprocessing steps, we compiled a dataset consisting of 1718 EHR passages, resulting in a total of 14528 question-answer pairs based on dependencies. The training set consists of 11451 QA pairs, while the dev set contains 1510 pairs, and the test set contains 1567 pairs. Notably, the dataset contains 3576 impossible questions, with 350 instances each in the dev and test sets. Additionally, the dataset includes 5600 NER-like questions found in 908 passages. To handle the two types of questions we process them separately.

IV-B Implementation

The model backbone is detailed in Section III. For recognition of Chinese EHR texts, we employed the ”roberta-base-chinese” and ”bert-base-chinese” pretrained embeddings. To optimize performance, the model was deployed in parallel on two GTX 2080Ti 11G GPUs. Each GPU was assigned a train and eval batch size of 2, and the max length was set to 512. The training process encompassed the entire training set for four epochs. The training was completed within approximately 40 minutes, while producing predictions for evaluation took around five minutes. We set $\alpha_{2}=0.8$ in the intensive reading module loss, and all other hyperparameters remained consistent with those utilized in [42].

V Results

V-A NER QA Results

The NER-like QA evaluation is tested only on questions that pertain to diseases, abnormalities and body parts, the only three entity types not categorized into general entities or events. The evaluation standard we adopt involves both entity-level Exact Match (EM) score and token-level micro F1 score. A token is considered a True Positive if both its position and label are correct, and a False Negative if either of them mismatches.

For Chinese NER, we opted to use the most commonly employed baseline model, which consists of BERT/RoBERTa (Chinese) in combination with a classifier (linear layer) for NER sequence tagging. The performance of each baseline model as well as our pipeline is summarized in Table I.

TABLE I: NER results comparison

^a The F1 and EM score of the baseline models and our pipeline.
Model Names	Results
Model Names	EM	F1
BERT-base-chinese + linear	0.906	0.815
RoBERTa-base-chinese + linear	0.851	0.933
RoBERTa-base-QA pipeline w/o finetuning	0.166	0.255
RoBERTa-base-QA pipeline (ours)	0.894	0.965

The initial performance of the QA pipeline is notably poor when applied without finetuning. This discrepancy in results can be attributed to the inherent differences between the clinical domain and the general domain. However, following the training process, our pipeline demonstrates significant improvement and achieves competitive results on the NER task.

V-B Relation-like QA Results

In this section, we evaluate the other question types for information extraction collectively. For evaluation, we employ the F1 and EM metrics. The token-level F1 computation is similar to the NER task, with the distinction that QA does not involve entity class labels. Since the extraction of information within given keys cannot be directly classified into existing tasks, such as Open IE or relation extraction, we establish our baseline model as a pipeline without any training.

TABLE II: Relation QA results comparison

Model Names	Results
Model Names	EM	F1
RoBERTa-base-QA pipeline w/o finetuning	0.363	0.514
RoBERTa-base-QA pipeline (ours)	0.922	0.953

TABLE III: Answerability judgment results comparison

Model Names	Accuracy
RoBERTa-base-QA pipeline w/o finetuning	0.919
RoBERTa-base-QA pipeline (ours)	0.998

The performance improvement in Table II demonstrates the effectiveness of the processing pipeline, which render the model able to answer EHR-related questions by constructing such question-answer pairs from original data. We also manually checked the model’s predictions. Among the 122 samples in the test set that do not exactly match the gold standard, approximately 2/3 of the errors result from a margin mismatch of one to two tokens. These errors also include some subtle inconsistencies among the annotations. We provide some examples in Figure 3, where the ”Boundary Mismatch” line illustrates these cases. The figure caption contains the translations. The red-labeled characters representing the mismatch may be locative prepositions or function words. Both predicted spans are plausible in these cases. So intuitively, the performance seems even better than indicated and appears suitable for practical usage. Other main causes of error are finding the wrong correspondence to another left entity of the same entity type at the paragraph level, and predicting a long sequence unexpectedly while the true answer is short. The accuracy of judgment, on the other hand, turned out to be very close to 1 after finetuning as shown in Table III. This actually guarantees the validity of merging sentence-level outputs.

V-C Ablation Study

n this section, we aim to validate the necessity of our design, which consists of two main components in the preprocessing stage. The first component involves splitting some paragraphs into sentences, which addresses many-to-one correspondence and discontinuous answer spans. The second component is the construction of impossible questions with plausible answers, which enhances the model’s ability to capture dependencies and discriminate answerability.

Table IV and Table V provide evidence of the design’s effectiveness. Each component contributes to a gain of approximately $0.02$ to $0.05$ in the EM score and $0.05$ to $0.08$ in the F1 score for relation-like questions. However, for NER-like questions, the performance experiences a significant drop without the component. This drop can be attributed to the fact that most paragraphs contain more than five entities of the same type, making it challenging for the model to handle NER-like questions effectively. It’s worth noting that the ablation of plausible answers is not conducted for NER-like questions due to the absence of plausible answers when a specific entity type does not exist in the sentence.

TABLE IV: Ablation results comparison on relation-like questions

Model Names	Results
Model Names	EM	F1
RoBERTa-base-QA pipeline w/o splitting	0.844	0.88
RoBERTa-base-QA pipeline w/o plausible answers	0.88	0.921
RoBERTa-base-QA pipeline (ours)	0.922	0.953

TABLE V: Ablation results comparison on NER questions

^a Here ’-’ means the ablation is not carried out
Model Names	Results
Model Names	EM	F1
RoBERTa-base-QA pipeline w/o splitting	0.189	0.322
RoBERTa-base-QA pipeline w/o plausible answers	-	-
RoBERTa-base-QA pipeline (ours)	0.894	0.965

A case study can help us better understand how these tricks work. In Figure 3, we present two instances to illustrate their effectiveness. In the first example, we have two abnormalities that point to the same body location but are mentioned in two separate clauses. If we didn’t perform the splitting operation during preprocessing, the model would likely predict only one of the abnormalities. We observed similar cases even when the two spans are adjacent in the original annotations, as they are initially annotated in this way.

In both examples, there are two clauses in analog structures. When the model lacks fine-tuning on impossible questions with plausible answers, there is a higher probability of mixing up the relativity between the clauses and predicting the wrong span of the same type. So both of the tricks are necessary to achieve more stable and accurate extractions.

V-D Generalization Performance

Finally, we aim to explore whether the model can be utilized to extract information beyond the annotations, simulating a zero-shot setting. For this purpose, we manually propose 40 questions not covered in the 18 relation classes. These questions comprise 20 wh-questions and 20 yes-no questions. As our pipeline relies on extractive Question-Answering (QA), we expect the model to extract the negation word if the answer is ”no,” and extract the relevant statement if the answer is ”yes.” Therefore, if the model predicts a negation word, we consider its answer as ”no,” and as ”yes” otherwise.

TABLE VI: Genralization results

^a Here ’-’ means not able to compute the F1 score.
Tasks	Results
Tasks	EM	F1
zero-shot questions	0.65	0.763
yes-no questions	0.85	-

The generalization results are presented in Table VI. The model performs well on yes-no questions, achieving satisfactory accuracy. However, it only performs acceptably on other types of questions. The primary reason for this discrepancy is that the model lacks an understanding of the EHR writing format during the fine-tuning process, making it challenging for it to answer questions related to overall information.

V-E Discussion and Limitations

The results obtained from our pipeline demonstrate its superiority, meeting most of the requirements in our task settings. However, due to limitations in annotations and equipment, several potential improvements are not explored.

Firstly, we acknowledge the need to address a combination of yes-no questions, multiple-choice questions, and extractive questions within the context of EHR extraction. In certain situations, while extracting information from EHRs, we may encounter queries related to the patient’s diagnosis of specific diseases, and disease classification or severity, which typically require constrained options for answers. Although our pipeline showcased competence in handling yes-no questions, it would be better to apply targeted models. An additional module, such as rule-based or BERT-based classifiers, could achieve high accuracy in distinguishing among these three question classes. By incorporating such classifiers, the models described in Section II can be adapted accordingly to form again a unified framework.

Secondly, we recognize the importance of enabling few-shot or even zero-shot capabilities for our models. In real-world scenarios, medical concepts may lack available annotated data[43], or the existing data may not align precisely with practical requirements. Pretrained Language models possess the transferring function, and LLMs even have emergent ablities[44]. Prompts or other tricks may be invented to trigger such capacity. As new techniques continue to emerge, other avenues for improvement may become possible, ultimately leading to the development of a more comprehensive and powerful EHR IE system in the foreseeable future.

VI Conclusion

In this paper, we present a pipeline for comprehensive information extraction from Electronic Health Records using question-answering models. Our approach involves three main steps: paragraph splitting, impossible question construction, and output merging. This pipeline is designed to handle unified Named Entity Recognition and dependency-based extraction, and effectively address the issue of discontinuous multi-span answers. By fine-tuning the pipeline on our Electronic Health Records data, we achieve competitive performance on both types of questions, demonstrating a certain level of generalization to answer zero-shot and yes-no questions. Consequently, the pipeline proves to be practically useful for real-world tasks.

References

[1] S. Singh, “Natural language processing for information extraction,” arXiv preprint arXiv:1807.02383, 2018.
[2] S. Ouyang, X. Yao, Y. Wang, Q. Peng, Z. He, and J. Xia, “An overview of the text mining task for “gene-disease” association semantics,” J. Med. Inform, vol. 43, no. 12, pp. 6–9, 2022.
[3] D. N. Nicholson and C. S. Greene, “Constructing knowledge graphs and their biomedical applications,” Computational and structural biotechnology journal, vol. 18, pp. 1414–1428, 2020.
[4] K. Ishwari, A. Aneeze, S. Sudheesan, H. Karunaratne, A. Nugaliyadde, and Y. Mallawarrachchi, “Advances in natural language question answering: A review,” arXiv preprint arXiv:1904.05276, 2019.
[5] S. J. Athenikos and H. Han, “Biomedical question answering: A survey,” Computer methods and programs in biomedicine, vol. 99, no. 1, pp. 1–24, 2010.
[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[7] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
[8] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[9] C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang, “Wizardlm: Empowering large language models to follow complex instructions,” 2023.
[10] Q. Jin, Z. Yuan, G. Xiong, Q. Yu, H. Ying, C. Tan, M. Chen, S. Huang, X. Liu, and S. Yu, “Biomedical question answering: a survey of approaches and challenges,” ACM Computing Surveys (CSUR), vol. 55, no. 2, pp. 1–36, 2022.
[11] A. Rogers, M. Gardner, and I. Augenstein, “Qa dataset explosion: A taxonomy of nlp resources for question answering and reading comprehension,” ACM Computing Surveys, vol. 55, no. 10, pp. 1–45, 2023.
[12] S. Sohn and G. K. Savova, “Mayo clinic smoking status classification system: extensions and improvements,” in AMIA Annual Symposium Proceedings, vol. 2009. American Medical Informatics Association, 2009, p. 619.
[13] G. K. Savova, J. Fan, Z. Ye, S. P. Murphy, J. Zheng, C. G. Chute, and I. J. Kullo, “Discovering peripheral arterial disease cases from radiology notes using natural language processing,” in AMIA Annual Symposium Proceedings, vol. 2010. American Medical Informatics Association, 2010, p. 722.
[14] G. K. Savova, J. J. Masanz, P. V. Ogren, J. Zheng, S. Sohn, K. C. Kipper-Schuler, and C. G. Chute, “Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications,” Journal of the American Medical Informatics Association, vol. 17, no. 5, pp. 507–513, 2010.
[15] A. R. Aronson and F.-M. Lang, “An overview of metamap: historical perspective and recent advances,” Journal of the American Medical Informatics Association, vol. 17, no. 3, pp. 229–236, 2010.
[16] Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, S. Sohn et al., “Clinical information extraction applications: a literature review,” Journal of biomedical informatics, vol. 77, pp. 34–49, 2018.
[17] Q. Li, S. A. Spooner, M. Kaiser, N. Lingren, J. Robbins, T. Lingren, H. Tang, I. Solti, and Y. Ni, “An end-to-end hybrid algorithm for automated medication discrepancy detection,” BMC medical informatics and decision making, vol. 15, no. 1, pp. 1–12, 2015.
[18] C. M. Rochefort, D. L. Buckeridge, and A. J. Forster, “Accuracy of using automated methods for detecting adverse events from electronic health record data: a research protocol,” Implementation Science, vol. 10, pp. 1–9, 2015.
[19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[20] A. Mulyar, O. Uzuner, and B. McInnes, “Mt-clinical bert: scaling clinical information extraction with multitask learning,” Journal of the American Medical Informatics Association, vol. 28, no. 10, pp. 2108–2115, 2021.
[21] I. Muhammad, A. Kearney, C. Gamble, F. Coenen, and P. Williamson, “Open information extraction for knowledge graph construction,” in Database and Expert Systems Applications: DEXA 2020 International Workshops BIOKDD, IWCFS and MLKgraphs, Bratislava, Slovakia, September 14–17, 2020, Proceedings 31. Springer, 2020, pp. 103–113.
[22] Y.-P. Chen, Y.-H. Lo, F. Lai, and C.-H. Huang, “Disease concept-embedding based on the self-supervised method for medical information extraction from electronic health records and disease retrieval: Algorithm development and validation study,” Journal of Medical Internet Research, vol. 23, no. 1, p. e25113, 2021.
[23] S. Lyalina, B. Percha, P. LePendu, S. V. Iyer, R. B. Altman, and N. H. Shah, “Identifying phenotypic signatures of neuropsychiatric disorders from electronic medical records,” Journal of the American Medical Informatics Association, vol. 20, no. e2, pp. e297–e305, 2013.
[24] Y. Si and K. Roberts, “A frame-based nlp system for cancer-related information extraction,” in AMIA annual symposium proceedings, vol. 2018. American Medical Informatics Association, 2018, p. 1524.
[25] D. Martinez, G. Pitson, A. MacKinlay, and L. Cavedon, “Cross-hospital portability of information extraction of cancer staging information,” Artificial intelligence in medicine, vol. 62, no. 1, pp. 11–21, 2014.
[26] X. Liu and H. Chen, “Azdrugminer: an information extraction system for mining patient-reported adverse drug events in online patient forums,” in Smart Health: International Conference, ICSH 2013, Beijing, China, August 3-4, 2013. Proceedings. Springer, 2013, pp. 134–150.
[27] J. Gobeill, A. Gaudinat, and P. Ruch, “Exploiting incoming and outgoing citations for improving information retrieval in the trec 2015 clinical decision support track,” in Proceedings of The 24th Text REtrieval Conference (TREC 2015). 17-20 November 2015, 2015.
[28] Y. Si, J. Wang, H. Xu, and K. Roberts, “Enhancing clinical concept extraction with contextual embeddings,” Journal of the American Medical Informatics Association, vol. 26, no. 11, pp. 1297–1304, 2019.
[29] D. Weissenborn, G. Wiese, and L. Seiffe, “Making neural qa as simple as possible but not simpler,” arXiv preprint arXiv:1703.04816, 2017.
[30] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
[31] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” arXiv preprint arXiv:1909.11942, 2019.
[32] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders as discriminators rather than generators,” arXiv preprint arXiv:2003.10555, 2020.
[33] C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova, “Boolq: Exploring the surprising difficulty of natural yes/no questions,” arXiv preprint arXiv:1905.10044, 2019.
[34] W. Yoon, J. Lee, D. Kim, M. Jeong, and J. Kang, “Pre-trained language model for biomedical question answering,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2019, pp. 727–740.
[35] J. Li, S. Zhong, and K. Chen, “Mlec-qa: A chinese multi-choice biomedical question answering dataset,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8862–8874.
[36] D. Jin, S. Gao, J.-Y. Kao, T. Chung, and D. Hakkani-tur, “Mmm: Multi-stage multi-task learning for multi-choice reading comprehension,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8010–8017.
[37] P. Rajpurkar, R. Jia, and P. Liang, “Know what you don’t know: Unanswerable questions for squad,” 2018.
[38] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Advances in neural information processing systems, vol. 27, 2014.
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[40] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
[41] A. Chen, G. Stanovsky, S. Singh, and M. Gardner, “Evaluating question answering evaluation,” in Proceedings of the 2nd workshop on machine reading for question answering, 2019, pp. 119–124.
[42] Z. Zhang, J. Yang, and H. Zhao, “Retrospective reader for machine reading comprehension,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 16, 2021, pp. 14 506–14 514.
[43] S. Sivarajkumar and Y. Wang, “Healthprompt: A zero-shot learning paradigm for clinical natural language processing,” in AMIA Annual Symposium Proceedings, vol. 2022. American Medical Informatics Association, 2022, p. 972.
[44] J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.